Bioinformatics I (BIF401) book HANDOUTS / POWER POINT SLIDES in pdf

Bioinformatics I (BIF401) book HANDOUTS / POWER POINT SLIDES in pdf
Handouts | Lectures | Contents | Books

Chapter 1- Introduction to Bioinformatics

Module 001: INTRODUCTION TO BIOINFORMATICS

AN INTRODUCTORY COURSE

BIOINFORMATICS-I

A STUDENT HANDOUT

2016

VER 1.0.0.0

LAST REVISED ON May 2, 2016

Table of Contents

Handout Chapter 01: Introduction to Bioinformatics

Page 1 of 320

Chapter 1- Introduction to Bioinformatics

Module 001: INTRODUCTION TO BIOINFORMATICS

 BACKGROUND

Bioinformatics is an interdisciplinary science at the cross-roads of biology, mathematics, computer science, chemistry and physics. With the digitalization of the biological information, doors have been wide opened towards the analysis of this information using computer algorithms and software.

Now we know well that the human genome has over 25,000 genes and these genes code for thousands of different proteins which perform day-to-day functions in the living cell. Furthermore, these proteins may take on various post-translational modifications leading to a very large number of functionally unique molecules. This presents us with a huge challenge in identification of genes and proteins.

 EXPERIMENTS IN BIOLOGY

With the advancements in experimental protocols, now we have several next generation instruments and techniques available for obtaining digitalized biological information on genes and proteins etc. These instruments include:

1. Next Generation Sequencers (NGS) for whole genome sequencing

2. High Resolution Mass Spectrometry for whole proteome profiling

3. Nuclear Magnetic Resonance Spectroscopy for structural studies

 DIGITALIZATION OF BIOLOGY

In today’s world, when a biologist performs an experiment in the wet-lab, he or she in fact produces digital data which is continuously being stored on computer disks. The data may include text, numbers, symbols or images.

 SPEED OF DATA GROWTH

Due to advancement in instrumentation used in biological experiments, data is being accumulated at exponentially increasing rates. For example; genome sequences in genome databases are doubling every few years.

 CONCLUSION

Human brain is limited in recalling information from memory. First, we have to commit all information to our memory followed by its recall. To overcome our ability to memorize and recall, computers can come to our rescue. This is because computers have an infinite ability to recall this information and process it quickly towards results.

Handout Chapter 01: Introduction to Bioinformatics

Page 2 of 320

Module 002: INTRODUCTION TO BIOINFORMATICS

 MOTIVATION

Bioinformatics is a becoming a popular science due to several reasons.

 It is an interdisciplinary field as it covers the information of biological digital information including human, plants, animals and microorganisms.

 Although it is a new field but it is rapidly developing field.

 It demands a very low cost infrastructure and hardly any lab equipment.

 As bioinformatics data concerns a wide range of species such as humans, plants and micro-organisms, it presents us with plenty of opportunities in scientific discovery.

 SCOPE OF BIOINFORMATICS

Bioinformatics primarily deals with digitalized biological information as well as data reported from biology experiments. Computational methods, data processing techniques and algorithms are employed in addressing the following issues:

 Storage of data

 Organization data

 Analysis of many experiments

 For representation of biological information

 ACTIVITIES

In modern biological sciences, bioinformatics is used for activities such as:

 Developing algorithms for organizing data collected from experiments

 Writing software and tools for data analysis

 Data processing to determine the role of underlying biomolecules

 Statistical evaluation of data using methods such as t-test and ANOVA

 Data visualization for meaningful presentation of biological information

 CONCLUSION

In Pakistan, the field of biology is undergoing a rapid change due to the onset of bioinformatics. New research and educational programs are being constructed which is opening new door of opportunities for our future generations.

Handout Chapter 01: Introduction to Bioinformatics

Page 3 of 320

Module 003: INTRODUCTION TO BIOINFORMATICS

 NEED FOR BIOINFORMATICS –I

If we look at the pace of development in the area of bioinformatics then we can easily observe that from year’s 2000 to 2015, the number of online tools for processing genomics and proteomics information are rapidly increasing. This is just a reflection of the need for bioinformatics in modern day biology. The field of Bioinformatics and Computational Biology is characterized by a highly diverse confluence of traditional academic disciplines. Informatics and Bio-science are the umbrella terms given to a set of allied disciplines which make up the field, but a much larger array of traditional areas contribute to the set of tools needed by individuals training for this new and expanding interdisciplinary field. Biomedical Engineering, Electrical and Computer Engineering, Computer Science, Applied Mathematics, Genetics, Biology, Anatomy and Cell Biology, Micro Biology, and Biostatistics are the principal allied disciplines.

 CONCLUSION

The need for bioinformatics is on a rapid rise as biological data is rapidly increasing and becoming available online, free of any cost.

Handout Chapter 01: Introduction to Bioinformatics

Page 4 of 320

Module 004: NEED FOR BIOINFORMATICS –II

If we observe the growth of gene bank than from 1982 it comprised of 2 billion base pairs but by year 2002 it had risen to 56 billion base pairs. With the data in our hands, there is an urgent need to interpret this data. For instance, analysis of this data can help us in developing an understanding of the phylogenetic “tree of life” which consist of:

 Bacteria

 Archaea

 Eucarya

Towards exploring the possible benefits of using bioinformatics, one needs to answer the following question:

 WHAT IS IT THAT BIOINFORMATICS CAN DELEIVER?

The simple answer to that bioinformatics is:

 Provide us better understanding of life, evolution, molecular mechanisms as well as disease.

 Moreover, we can make better drugs with the availability of an enhanced molecular understanding of disease.

 POSSIBLE CONTRIBUTIONS

 It can help us to organize the large datasets from new experiments instruments

 Bioinformatics can help store and process this data as well.

 It can provide insights into the meanings of our research results and findings.

 Overall, it can help us to better understand paradoxes defining the life forms.

 CONCLUSION

From gene sequencing to protein sequencing, bioinformatics is providing us with an improved understanding of the genes, proteins, protein interaction and signaling pathways involved in biological functioning and disease.

Handout Chapter 01: Introduction to Bioinformatics

Page 5 of 320

Module 005: APPLICATIONS OF BIOINFORMATICS – I

When we look at bioinformatics, it seems to be a very complex and abstract field. How and where can bioinformatics be applied specifically? How does it improve the fundamental understanding of biological phenomenon? Most importantly, how can its benefits be delivered to the society at large?

The answers to these questions are categorized as follows:

 GENOMICS

 Bioinformatics can help in assembling DNA sequencing data.

 It can help in gene finding (markers).

 Gene assembly can be performed using bioinformatics tools (nucleotide alignments)

 It can help transcribe the gene data to RNA data

 Also, databases can be generated from such data.

 EVOLUTIONARY STUDIES

 Evolutionary relationships between different organisms can be derived from data.

 Evolutionary distance among species can be computed by using bioinformatics tools.

 Phylogenetic trees can be constructed to find relationships between species.

 Ancestry can be better understood between several species and organisms.

 PROTEOMICS

 Bioinformatics can help us in decoding protein sequences.

 It can also help us in understanding protein structure.

 We can also understand post translational changes in proteins with the help of bioinformatics.

 We can better understand the protein-protein interaction in different biological reactions.

 It can also help us in generating databases of these sequences and structures.

 SYSTEMS BIOLOGY

 Bioinformatics can assist us in modelling regulatory mechanisms in gene and protein networks.

 Such models can be analyzed to identify the key regulators in these networks.

 Moreover, the models can help evaluate drugs to treat these key regulators.

 CONCLUSION

Bioinformatics can be applied to life in many ways it helps us to understand the sequence and function of biomolecules and their relationships. Recent trends in bioinformatics involve development of personalized therapeutics for cancer and diabetes.

Handout Chapter 01: Introduction to Bioinformatics

Page 6 of 320

Module 006: APPLICATIONS OF BIOINFORMATICS - II

Bioinformatics is being applied in routine life in many ways like in Genomics, transcriptomics, Proteomics, Metabolomics, Structural Proteomics, Designing Drugs, System Biology and in personalization of medicines for cure.

Except these applications Bioinformatics introduced us the techniques which enabled us to generate the large data regarding biology and also its use. And step by step the applications of bioinformatics increased from genomic level to entire system level.

 SMALL TO BIG

 Bioinformatics helps us to understand the systems from small to big like from gene findings to entire system prediction

 In structure findings and modeling of many biological system to understand them in better ways.

 Bioinformatics helped the human to understand the protein, protein interaction in many biological systems.

 And provide us the concept how these biological process are interconnected with each other and how they affect each other.

 Now we are able to understand the modeling of molecules and genome at cell level.

 Signaling pathways are easy just because of bioinformatics.

 Now morphology of tissue can be understand by creating the models with help of bioinformatics tools.

 CONCLUSION

Bioinformatics not only just collect, analyze and store the data it process it in very authentic way and validates our hypothesis and very soon in future it will help us to understand that which disease is coming in future and how to tackle it with personalize medicine.

Handout Chapter 01: Introduction to Bioinformatics

Page 7 of 320

Module 007: FRONTIERS IN BIOINFORMATICS - I

 INTROCDUCTION

Bioinformatics is new and emerging field of science having vast opportunities and with innovation in tools it is increasing the scale of biological data, but still there are many unsolved challenges which are pending in the field of life science and for which bioinformatics is doing new innovative ideas.

 FRONTIER IN GENOMICS

Now we are able to sequence the whole genome with the bioinformatics tool of Next generation sequencing (NGS)

We are able to save, store and analyze the massive amount of biological data which is in (Terabyte files)

We can handle the large number of data easily and can process it as well in easy way.

Whole genome can be assemble in sequence and can flaws can be identified easily.

 FRONTIER IN TRANSCRIPTOMICS

Now in genomics we are able to identify those matters which are unknown yet or under discussion.

Role of RNA in making proteins and its dynamics can be understood easily now.

Interactions of RNA molecule can be easily understood by simple model.

 FRONTIER IN PROTEOMICS

Deficiency of low proteins in any patient tissue sample can be identify.

Expression and manufacture of protein in large molecular level in any organism can be identified.

Pathways before and after any biological reaction are easy to design.

Handout Chapter 01: Introduction to Bioinformatics

Page 8 of 320

 CONCLUSION

Bioinformatics is literally a science full of challenges and opportunities having a revolution in field of biology and routine life.

Handout Chapter 01: Introduction to Bioinformatics

Page 9 of 320

Module 008: FRONTIER IN BIOINFORMATICS-II

Frontier in Bioinformatics includes

 Next generation genomics

 Transcriptomics

 Proteomics

 FRONTIER IN PROTEIN STURUCTURE

Bioinformatics helps us to understand the layer folding of proteins that how they are processed, and helps to know that how protein interact with each other and how a drug can affect or stimulate a protein.

 FRONTIER IN SYSTEM BIOLOGY

It helps us to understand the whole system of a single cell, in that cell how organelles, gene, proteins and metabolites are interconnected in a single unified system (cell). And bioinformatics also give us the idea how these models can be applied to real-time.

 FRONTIER IN PERSONALIZED MEDICINE

This is the important thing for this century and upcoming generation that personalize the medicine for exact cure of a disease. Because all the medicine cannot work exact some effect patient badly therefor with the help of Bioinformatics we are now able to personalize some medicines for some diseases. And bioinformatics helps us to evaluate the medicine.

 CONCLUSION

If we talk about the 21st century than it’s the century of bioinformatics it will enable the human to cure many disease with one drug by personalizing it.

Handout Chapter 01: Introduction to Bioinformatics

Page 10 of 320

Module 009: Overview of Course Contents - I

Philosophy behind the Course Outlay

1. Introduce the classical algorithms in bioinformatics

2. Link them to latest developments in the field

3. Evaluate the future applications

Handout Chapter 01: Introduction to Bioinformatics

Page 11 of 320

Sequences and operations such as alignment and comparison will be covered along with phylogenetic and RNA structure modelling. Next up we will delve into protein sequences and structures!

Handout Chapter 01: Introduction to Bioinformatics

Page 12 of 320

Module 010: Overview of Course Contents - II

Handout Chapter 01: Introduction to Bioinformatics

Page 13 of 320

Handout Chapter 01: Introduction to Bioinformatics

Page 14 of 320

Summary

• Protein sequence and structure topics will be dealt in these modules

• Next set of modules is about the homology modelling and systems biology topics!

Handout Chapter 01: Introduction to Bioinformatics

Page 15 of 320

Module 011: Overview of Course Contents – III

Handout Chapter 01: Introduction to Bioinformatics

Page 16 of 320

Handout Chapter 01: Introduction to Bioinformatics

Page 17 of 320

Conclusion

• These contents will give you an initial exposure to the variety of topics in bioinformatics

• After covering these topics, you should have a basic conceptual foundation for further studies into Bioinformatic

Handout Chapter 02: Sequence Analysis

Page 18 of 320

Chapter 2 - Sequence Analysis

Module 001: Gene, mRNA and Protein Sequences

 INTRODUTION

We all know that all the living things are composed of cells and here a question arise that how cells are made? For composition of cell DNA has blueprints for building cells along with the information of cell’s protein, carbohydrate and vitamins production.

And transfer of this information from DNA to these molecules is termed as “Central Dogma” which is

DNA RNA Protein.

Proteins are than use in constructing the cell.

 DNA

Figure 0.1 DNA Double helix

DNA molecule is double helix structure contain base pairs composed of nucleotides and these nucleotides are composed of sugar phosphate group and are bind with each other with hydrogen bonds.

Handout Chapter 02: Sequence Analysis

Page 19 of 320

Normally all the nucleotides are same in both DNA and RNA except one position in RNA which is U (Uracil) and in DNA it is T (Thiamin)

DNA sends the information to cell via mRNA and that sequence the amino acids according to coded information and protein structure is formed and that protein form a cell.

 CONCLUSION

According to the central dogma DNA codes information for RNA and RNA makes the Protein and that protein along with some organelles make cells and its systems.

Handout Chapter 02: Sequence Analysis

Page 20 of 320

Module 002: TRANSCRIPTION

All cells are made of carbohydrates and proteins and for these cells DNA codes the information which makes the RNA and protein both.

Figure 0.2 Flow of information from DNA to Proteins

The above mechanism explains the process of transcription in very simple way, DNA codes the information and converted into RNA where mRNA copies the information and it execute the information in cell and amino acids combine with each other according to coded information of DNA and protein formation takes place. Which is known as Translation.

Molecule of DNA contains only four base pairs (A, T, C, and G) which are repeated thousands of time and Adenine “A” pairs with Cytosine “C”, While Thymine “T” binds with Guanine “G” and all pairings are with the help of Hydrogen bonding.

Same like DNA, the RNA contains four base pairs but Thymine is replaced with Uracil “U” and RNA is single stranded.

DNA just codes the information for protein but RNA helps in making protein.

DNA

RNA

Proteins Translation Transcription Information Copy of Information Execution of Information

Handout Chapter 02: Sequence Analysis

Page 21 of 320

Module 003: NUCLEOTIDES

If we talk about the composition of DNA and RNA molecule than these are composed of four other molecules which are named as Nucleotides.

These molecules are Adenine (A), Cytosine (C), Thymine (T), Uracil (U), and Guanine (G).

DNA molecule although is double stranded and RNA is single stranded but there is difference in sugar composition.

RNA has Ribose sugar and DNA has de-oxyribose sugar:

Figure 0.3 Difference between RNA and DNA sugar

RNA DNA

Adenine and Guanine collectively called Purines while Cytosine, Uracil, and Thymine are called as Pyrimidine.

when phosphate, nitrogen base and sugar come together if there is (OH) than molecule is RNA and if there is (H) in sugar than molecule is DNA. As figure shows.

Handout Chapter 02: Sequence Analysis

Page 22 of 320

 CONCLUSION

DNA molecule make RNA and RNA make the protein and DNA differ from RNA in nature due to sugar and nucleotide.

Handout Chapter 02: Sequence Analysis

Page 23 of 320

Module 004: TRANSLATION

Cells are built of proteins and carbohydrates and these proteins are made in results of transformation of RNA molecule and this transformation is called as translation.

Translation takes place in ribosome of cell and ribosomes after reading the information of mRNA collects the amino acids from cell cytosol which is the part of the cytoplasm that is not held by any of the organelles in the cell.

 MECHANISM

At ribosome three nucleotides are read at a time from mRNA, this set of three nucleotide is called as codon and each codon correspond to a specific amino acid.

Figure 0.4 sixty four codons combinations

Handout Chapter 02: Sequence Analysis

Page 24 of 320

 CONCLUSION

RNA codes for protein and codons of here nucleotide code for specific amino acid on ribosomes and this process is called as translation.

Handout Chapter 02: Sequence Analysis

Page 25 of 320

Module 005: AMINO ACIDS

RNA decodes the information at ribosomes in form of Codons each codon select a specific amino acid. Because there are 20 different amino acids in nature therefore they fold together and make a protein structure by polymerizing themselves.

If we observe the structure of amino acid it contains nitrogen, hydrogen, oxygen and two carbon atoms. And a variable group R.

Figure 0.5 structure of amino acid

When polymerizations takes place water is formed and if any compound attached with R group than structure of protein is changed.

These amino acids are joined with each other with peptide bonds and fold with each other in 3D form they make protein structure.

Handout Chapter 02: Sequence Analysis

Page 26 of 320

Handout Chapter 02: Sequence Analysis

Page 27 of 320

Module 006: STORAGE OF BIOLOGICAL SEQUENCE INFORMATION

We know that sequence of DNA contain A,C,T&G nucleotides and sequence of RNA contains A,C,U&G while sequence of protein contain A,R,N,D,C,E,Q,G,H,I,L,K,M,F,P,S,T,W,Y&P these are actually 20 different amino acids in nature which compose a protein.

When both DNA and RNA or mRNA are sequenced in lab their sequences contains larger number of nucleotides with variety

And when we talk about protein its sequences contain large number of bases as they are complex in nature.

 SOLUTIONS DATABASES

This large number of sequence or bases cannot be stored in a single computer that’s why solution lies in public sequence data bases for DNA & RNA the public database is GenBank (by NIH).

For proteins the public database is UniProt (by Uniprot Consortium)

Both GenBank and UniProt are online database and the DNA, RNA and Protein sequences are available here online for public and researchers.

Handout Chapter 02: Sequence Analysis

Page 28 of 320

Module 007: USING GENBANK

GenBank is online database where researcher can get access to the sequences of DNA, RNA and proteins.

To find any sequence we go online to NCBI GenBank website which is Public database site. Which is;

www.ncbi.nlm.nih.gov/genbank

And for example we want to find the sequence for Immunoglobulin which is responsible for Glycoprotein antibodies in white blood cells plasma and act for immunity.

Handout Chapter 02: Sequence Analysis

Page 29 of 320

Sequences can be searched from GenBank by typing;

o Sequence name

o ID

o Name

o Species

o Locus

o Accession Number

o Author

o Journal

Handout Chapter 02: Sequence Analysis

Page 30 of 320

Module 008: USING UNIPROT

UniProt is public database which is being used to search the sequence of proteins.

www.Uniprot.org

For example we want to search a sequence of a protein which is Ubiquitin which plays an important role in cytosol for recycling the proteins. We have to go online to the website www.Uniprot.org and above page will appear.

We have to write the name of protein in search box and press enter. You will get the searched results like this one.

Handout Chapter 02: Sequence Analysis

Page 31 of 320

By clicking on any result you can download or Blast the sequence.

In home page there is a box named “Swiss Prot” which contains human curated protein information, molecular mass, observed and predicted modifications etc.

Uniprot can be searched by typing amino acid, Name, ID or sequence.

Handout Chapter 02: Sequence Analysis

Page 32 of 320

Module 009: COMPARING SEQUENCES

There are millions sequences on GenBank and UniProt what will happen if we will compare them? By comparing sequences of DNA, RNA and Proteins we can get

 Similarity among sequences

 There might be some specific difference due to some disease or mutation

 There may be some evolutionary relationship.

As there nucleotides can be similar or differ from each other

Figure 0.6 BLAST is used to compare the nucleotides sequences

While UniProt is used in case of amino acids sequence comparison.

By comparison of nucleotides and Amino acids of any DNA, RNA and protein sequence we can find many evolutionary facts and relations among species.

Handout Chapter 02: Sequence Analysis

Page 33 of 320

Module 010: SIMILARITIES & DIFFERENCES IN SEQUENCES

When we compare the sequences of DNA and RNA we can get the similarity and differences or relationship in evolution. And same case is with amino acids of proteins.

In compression not only they have the same number of nucleotides but they have same order or arrangements.

If some sequence are exactly similar to each other it means;

 They might have some regular expression in cell or system.

 Or they indicate some specific presence like signature of any protein or gene.

 Or they might have similar nucleotide just one or two between them are different from rest.

 CONCLUSION

If there is exact match in sequences it means their order or arrangement and maximum numbers of nucleotides match to each other not all of those. While the genome of each created kind is unique, many animal kinds share some specific types of genes that are generally similar in DNA sequence. When comparing DNA sequences between animal taxa, evolutionary scientists often hand-select the genes that are commonly shared and more similar (conserved), while giving less attention to categories of DNA sequence that are dissimilar. One result of this approach is that comparing the more conserved sequences allows the scientists to include more animal taxa in their analysis, giving a broader data set so they can propose a larger evolutionary tree. Although these types of genes can be easily aligned and compared, the overall approach is biased towards evolution. It also avoids the majority of genes and sequences that would give a better understanding of DNA similarity concepts. http://www.icr.org/article/common-dna-sequences-evidence-evolution/

Handout Chapter 02: Sequence Analysis

Page 34 of 320

Module 011: SIMILARITIES & DIFFERENCES IN SEQUENCES

When we compare the sequences of DNA and RNA we can get the similarity and differences or relationship in evolution. And same case is with amino acids of proteins.

In compression not only they have the same number of nucleotides but they have same order or arrangements.

If some sequence are exactly similar to each other it means;

 They might have some regular expression in cell or system.

 Or they indicate some specific presence like signature of any protein or gene.

 Or they might have similar nucleotide just one or two between them are different from rest.

 CONCLUSION

Handout Chapter 02: Sequence Analysis

Page 35 of 320

Module 012: PAIR WISE ALIGNMENT –II

In pair wise alignment of nucleotides the nucleotides comes in pairs and matching are colored while missing amino acids are indicated with “” and this empty space is called as gap.

And these Gaps are inserted for deletion or insertion of any nucleotide. Increase in Gaps can increase the chance of plenty in sequencing and less number of Gaps can increase the similarity rate of sequences.

There are two types of pair alignments.

1. Global

2. Local

In Global ways of sequence pair alignment we introduce the Gaps in all sequence to know over all matching. While in Local type of sequence pair alignment we find those regions where nucleotides are maximum matching with each other it is used to find the similarity or some nutation.

Most important the Gaps are introduced so that we may add the missing nucleotides. Pairwise Sequence Alignment is used to identify regions of similarity that may indicate functional, structural and/or evolutionary relationships between two biological sequences (protein or nucleic acid).

http://www.ebi.ac.uk/Tools/psa/

Handout Chapter 02: Sequence Analysis

Page 36 of 320

Module 013: PAIR WISE SEQUENCE ALIGNMENT –III

Pair wise alignment helps us to find the similarity and differences there are three ways according to which sequences can differ from each other.

Which are

 Substitutions ACGA  AGGA

 Insertions ACGA  ACCGA

 Deletions ACGA  AGA

By applying all above ways to any sequence the matching and mismatching can be increased or decreased between to different comparing sequencing.

Both local and Global ways of alignments give us different results.

But among above Substitution increases mismatch of sequence.

Handout Chapter 02: Sequence Analysis

Page 37 of 320

Module 014: DOT PLOTS

To visualize the sequence alignment we have a method called Dot Plots in this method the sequence is written top and left side of dot matrix grid.

Where one nucleotide or amino acid match with each other the dot is placed in grid position in each row for one time.

Similar dots are match with diagonal pattern and which remain separate differ from similar sequence

A C A C G A C C G G

Handout Chapter 02: Sequence Analysis

Page 38 of 320

Dots on diagonal repeats the alignments and separate one give difference to the sequence. Figure 0.7 dot plot diagonal pattern

Handout Chapter 02: Sequence Analysis

Page 39 of 320

Module 015: EXEMPLE OF DOT PLOTS

In dot plot the matching nucleotides are connected in diagonal way and represent the sequence alignments.

When we compare the human Cytochrome and Tuna Fish Cytochrome than the diagonal alignment of sequence we find is in this below diagram.

 BENEFITS

Dot plots provides us the Global similarity between the two sequences and helps us to visualize the alignments of sequences and sequence repeats appear as diagonal stacks in plot.

 CONCLUSION

Dot plot help us to find the threshold difference among two sequences.

Figure 0.8 tuna fish vs Human

Handout Chapter 02: Sequence Analysis

Page 40 of 320

Module 016: IDENTY VS SIMILARITY

When we talk about the comparison of two sequences than question arise that how we can compare the biological sequences and after comparison what will be the degree of comparison.

There are two concepts for sequence analysis

1. Identity

2. Similarity

Identity means the counting number of nucleotides or amino acids which exactly match when two biological sequences are matched.

For example:

Number of match = 5

Smaller length = 5

Sequence (1) = 7

Sequence (2) = 5

Formula for Identity:

Identity = No. of Matches / smaller length × 100

And Similarity means the comparison between two different sequences calculated by alignment approach.

In both identity and similarity the dots are not counted.

1: CATGCTT 2: CATGC

Handout Chapter 02: Sequence Analysis

Page 41 of 320

Module 017: INTRODUCTION TO ALIGNMENT APPROACHES

When we align the sequence that may be vary due to insertion and deletion of nucleotides and to calculate the similarity we need to align the sequence first. And there are two different approaches to align the sequence.

1. Global Alignment

2. Local Alignment

In Local alignment we compare one whole sequence with the one portion of other like this.

While in Global alignment we compare both sequence from end to end completely.

Local alignment just focus on highly matching portions of sequence while in Global one whole sequence is compared with other one.

Figure 0.9 local alignment Figure 0.10 Global alignment

Handout Chapter 02: Sequence Analysis

Page 42 of 320

Module 018: WHY LOCAL ALIGNMENT

When there is Global alignment which compare the whole sequence from end to end than why local alignment is done question arise.

Because Local alignment have power to detect the smaller regions with high similarity and such matches are motifs or domains which remain hidden in case of protein function.

 DOMAIN SHUFFLING

Aligned portions of sequence can be considered in varying orders and this process is called as domain shuffling.

 ADVENTAGES

 We can compare the different length sequences

 Conserved domains can be determined from proteins

 Common function features can be identified.

 CONCLUSION

Local alignment is used to compare the segments for high matching sequencing.

Handout Chapter 02: Sequence Analysis

Page 43 of 320

Module 019: ALIGNING, INSERTION & DELETION

Insertion means addition of amino acids in protein sequence and addition of nucleotides in DNA sequences.

And deletion means removal of amino acids from protein sequence and removal of nucleotides from DNA or RNA sequences.

 ALIGNING INSERTION

For example we have following two sequences

To add the nucleotide in sequence 2 we will add gap first. And same happens with the deletion alignment we add gap where we delete the nucleotide from sequence. And such insertion of gap is called as –ve or plenty.

1: A C T G A C T G 2: A C G A C T G 1: A C T G A C T G 2: A C G A C T G 1: A C T G A C T G 2: A C . G A C T G

Handout Chapter 02: Sequence Analysis

Page 44 of 320

Module 020: ALIGNING MUTATION IN SEQUENCES

Removal and addition of amino acids in proteins and nucleotides in DNA, RNA by using Gaps named as Indels.

Mutation is totally different from Indels, because in Mutation we replace the amino acid with other amino acids and replace the nucleotides with other and we don’t use Gaps is inserted in template or target for mutation.

 CONCLUSION

In identity alignment we use Gaps and in mutation we use substitution penalties and penalties depend upon the substitution.

1: A C T G A C T 2: A C G G A C T 1: A C T G A C T 2: A C G G A C T

Handout Chapter 02: Sequence Analysis

Page 45 of 320

Module 021: INTRODUCTION TO DYNAMIC PROGRAMMING

To find matching in nucleotides and amino acids of two sequences we use dot plot method. But dot plot cannot capture the insertions, deletions and gaps in the sequences.

To deal with this situation we modify the dot plot.

We represent the matching nucleotides with +1 while gaps, substitutions, insertions and mutations can be represented as -1 in dot plot. Dynamic programming is an algorithmic technique used commonly in sequence analysis. Dynamic programming is used when recursion could be used but would be inefficient because it would repeatedly solve the same sub problems. http://www.ibm.com/developerworks/library/j-seqalign/

Handout Chapter 02: Sequence Analysis

Page 46 of 320

Module 022: DYNAMIC PROGRAMMING ESSENTIALS

When we talk about the compression of two sequences one by one it need time and is computationally expensive method. That’s why we need algorithm.

In algorithm we calculate the step involve in sequence compression for example if we if we compare two sequences of length “n” than it would be “n2 “

And its order is O (n2)

Figure 0.11 -1 represent deletion, insertion and gaps while +1 represent matching nucleotides or amino acids

One by one sequence compression is costly and time consuming process we minimize the cost with the help of algorithm.

Handout Chapter 02: Sequence Analysis

Page 47 of 320

Module 023: DYNAMIC PROGRAMMING METHODOLOGY

Dynamic programming helps us to reduce the computational cost in sequence comparisons and it works on the method of “scoring function”.

For example

Match = +a

Mismatch = -b

Gap = -c

Score = #match + #Mismatch +#Gaps

Total =11

All the alignments are done in diagonal way in dot plot matrix. For total score we make calculations in diagonal way and after calculation best one is selected.

Match Rewards = 10 Mismatch penalty = 2 Gap penalty = 5 C T G T C G – C T G C - T G C – C G – T G - -5 10 10 -2 -5 -2 -5 -5 10 10 -5

Handout Chapter 02: Sequence Analysis

Page 48 of 320

Module 024: NEEDLEMAN WUNSCH ALGORITHM-I

In two different sequences alignments are arranged in a diagonal pattern of dot matrix. Total scores are captured for each alignment and at the end the best one is selected.

Needleman Wunsch Algorithm is the way for alignment. The method is same like dot plot but it computes the scores in different way. We Start with a zero in the second row, second column. Move through the cells row by row, calculating the score for each cell. The score is calculated as the best possible score (i.e. highest) from existing scores to the left, top or top-left (diagonal). When a score is calculated from the top, or from the left this represents an indel in our alignment. When we calculate scores from the diagonal this represents the alignment of the two letters the resulting cell matches to. Given there is no 'top' or 'top-left' cells for the second row we can only add from the existing cell to the left. Hence we add -1 for each shift to the right as this represents an indel from the previous score. This results in the first row being 0, -1, -2, -3, -4, -5, -6, and -7. The same applies to the second column as we only have existing scores above. Thus we have:

https://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm

Figure 11 Needleman-Wunsch pairwise sequence alignment

Handout Chapter 02: Sequence Analysis

Page 49 of 320

Figure 12 Needleman wunsch algorithm way of computation of nucleotides

Handout Chapter 02: Sequence Analysis

Page 50 of 320

Module 025: NEEDLEMAN WUNSCH ALGORITHM-II

Alignments are represented by unbroken diagonal dot matrix plot. In this way we can create numerous combinations.

Figure 13 various combinations of sequences through dot plot

If the sequence is too long then there will be many diagonal alignments and at the end we select the best alignment by combinations of all. And for this we use Needleman Algorithm

In Needleman Algorithm we use 0, 0 in first row and first column.

Figure 14 initial column and row are kept zero (0)

Left to right and top to bottom the best element (having high score) is selected.

Handout Chapter 02: Sequence Analysis

Page 51 of 320

Figure 15 maximum score element is selected from all three sides comparison

The terms for match, mismatch are:

The matrix is computed progressively until the bottom right element

Alpha = Match reward Beta = Mismatch penalty Gamma = Gap penalty

Handout Chapter 02: Sequence Analysis

Page 52 of 320

Module 026: NEEDLEMAN WUNSCH ALGORITHM-III

We follow the diagonal route for scoring in Needleman Wuncsh Algorithm. Left to right and top to bottom in diagonal way we select the highest score.

Figure 16 filling order of sequence alignment

We select the best score to make best alignment and matrix is computed progressively until we reach to the bottom right.

Handout Chapter 02: Sequence Analysis

Page 53 of 320

Module 027: NEEDLEMAN WUNSCH ALGORITHM EXAMPLE

Top left and diagonal element are considered to calculate an element in the matrix. Match, mismatch and gap penalty is computed from all there sides (Left to right) (Top to bottom) and (Diagonal).

For example:

Figure 17 score computation from all sides

Handout Chapter 02: Sequence Analysis

Page 54 of 320

Figure 18 the best score is computed in diagonal way

DNA, RNA and Protein sequences can be computed by using Needleman algorithm.

Handout Chapter 02: Sequence Analysis

Page 55 of 320

Module 028: BACKTRACKING ALIGNMENTS

To find an optimal alignment in Needleman Wunsch Algorithm we use traceback method.

Figure 19 traceback method starts from end maximum score.

After completely matrix calculations we apply traceback to find the optimal alignment and traceback starts from bottom right (maximum score) to top side.

Handout Chapter 02: Sequence Analysis

Page 56 of 320

Module 029: REVISITING LOCAL AND GLOBAL ALIGNMENTS

We use Needleman Algorithm to align the sequences in scoring way and traceback method to find the optimal alignment.

In Needleman Algorithm we start traceback from bottom right towards top left progressively and this provides us the global alignment.

Figure 20 Traceback in Needleman Algorithm

We can start traceback from any point in matrix and smith waterman algorithm helps elicit the local alignments.

Handout Chapter 02: Sequence Analysis

Page 57 of 320

Module 030: OVERLAP MATCHES

Dot plot and Needleman wucsch are algorithm method with little difference. Dot plot help us in finding matching residues of two sequences while Needleman wunsch helps us to find the global alignments.

If some sequences have different regions of nucleotides which does not match to any other for that alignment we prefer Global alignment not local, but that does not penalize leading or trailing end.

Figure 21 leading and trailing edge mismatches versus global alignment by gap-insertion (stretching) of sequences

And “Traceback” is the technique by which we can check the sequences from any end of the matrix box. And such “Tracebacks” helps us to find the overlaps in aligned sequences.

Figure 22 Traceback method

Handout Chapter 02: Sequence Analysis

Page 58 of 320

Module 031: EXAMPLE

A slight variation in traceback can helps us to find the overlaps in sequences and can apply some interesting strategies in sequences alignments.

In following example of amino acids alignment we can understand the ways of tracback.

Figure 23 Traceback in amino acid sequence alignment

Scoring stagey is:

Sequences are:

Match = +2. Mismatch = -1, Gap = -2

Handout Chapter 02: Sequence Analysis

Page 59 of 320

Module 032: MOVING FROM GLOBAL TO LOCAL ALIGNMENT

DNA has coding and noncoding regions. Coding regions are called “EXON” expressed as protein and they remain more conserved due to their role in making functional proteins.

And noncoding regions of DNA are called as “INTRONS” which are more likely involved in mutations than coding ones. It means high degree of alignment can be find among two exons.

In local alignment we use small segments of sequences and through which we can find exons. Through this we can find “functional subunits”. However, the term exon is often misused to refer only to coding sequences for the final protein. This is incorrect, since many noncoding exons are known in human genes.

(Zhang 1998)

Zhang, M. Q. (1998). "Statistical features of human exons and their flanking regions." Human molecular genetics 7(5): 919-932.

Handout Chapter 02: Sequence Analysis

Page 60 of 320

Module 033: SMITH WATERMAN ALGORITHM

In global alignment we compare the sequence from end to end but in local alignment we compare the sequences in segments.

For Global sequences we use Needleman and Wunsch algorithm while for local pairwise alignment we use Smith and waterman.

The Smith Freshman algorithm is different from Needle man.

 Top row and Colum are set to zero.

 Alignment can end anywhere.

 Traceback starts from highest score.

Local alignments can identify the coding portions of DNA and in this way we can find the functional domains from protein sequences.

Figure 24 Global and local sequence alignment comparison.

Handout Chapter 02: Sequence Analysis

Page 61 of 320

Module 034: EXAMPLE OF SMITH WATERMAN ALGORITHM

The only difference between Needleman and Smith Waterman is that zero “0” is placed in the

relationship.

 

 

 





 





 

  



, 1

[ 1, 1] ,

, max



C i j

C i j score i j

C i j

And in the matrix we place top line of zero and first Colum of zero.

Figure 25 top line and first Colum are filled with zero in Smith Freshman Algorithm

Local alignments can be extracted by starting from a high score till reaching ‘0’

Handout Chapter 02: Sequence Analysis

Page 62 of 320

Module 035: REPEATED ALIGNMENTS

We can find the best local alignments by using Smith Waterman algorithm.

By making some change in strategy of traceback we can find the repeated sequences.

We use threshold “T” score for matching and it avoids low scoring local alignment. And traceback can help us to find multiple aligned regions in multiple ways.

Figure 27 (-5) is threshold score in table

This threshold scoring method with some modifications in waterman algorithm can help us to find many matching sequence of amino acids or DNA.

Handout Chapter 02: Sequence Analysis

Page 63 of 320

Module 036: EXAMPLES OF REPEATED ALIGNMENTS

Slight modification in waterman model can help us to find the Exons as well as the functional units in any sequence. Matches should be end at the threshold score or we should keep track of maximum score in sequence.

Figure 28 Trackback from different sides to find maximum or Threshold (T) Score.

Traceback should start from last element of the row and should reach at the top of row element and then move to the highest score of the Column. And this traceback is done twice and end at the point where score become “0, 0”

Handout Chapter 02: Sequence Analysis

Page 64 of 320

Module 037: INTRODUCTION TO SCORING ALIGNMENTS

There are two types of alignments;

 Optimal Alignments

 Best Alignment

Scoring scheme used in sequences matches play crucial role in producing optimal alignment. An optimal alignment should be:

 Appropriately rewarded for matches and mismatches.

 INTRODUCTION

We identify the pairs of symbol which most frequently appear in a sequences it helps us to find the substitution of specific pair of amino acid or nucleotide with other on in a sequence.

For example AA nucleotide have a specific pattern of substitution. And same pairs of amino acids does in protein sequences because it can help to preserve the function of protein.

 CONCLUSION

Statically we can better align any sequence of protein or DNA, optimal gaps, penalties, insertions and deletions can be computed statically better.

Handout Chapter 02: Sequence Analysis

Page 65 of 320

Module 038: MEASURING ALIGNMENTS SCORES

Score of match and mismatch both are equally observed while sequence alignments.

For example:

The matrix has positive and negative scores both, matches and mismatches therefore are all considered because it’s a diagonal pattern.

If we build such scoring matrixes with matches and mismatches we can we can sequence in according to real life.

Figure 29 Needleman wunsch algorithm match, mismatch scoring

Handout Chapter 02: Sequence Analysis

Page 66 of 320

Module 039: SCORING MATRICES

Alignments are used to align the biological sequences. Amino acids and nucleotides are more easily substituted because they have similar chemical nature.

As amino acids are substituted with many probabilities that’s why we need flexible scoring. And we use Scoring Matrices contain such flexible scoring during alignment.

To build the Scoring Matrices we analyze the amino acids and nucleotides which are substituted in single gene and protein sequence.

Scoring Matrices have both values +ve and –ve. Positive value for matches and negative value for mismatches.

Figure 0.12 Ubiquitin Protein where amino acids matching

Different type of scoring matrices can be developed based on underlying strategy.

Handout Chapter 02: Sequence Analysis

Page 67 of 320

Module 040: DERIVING SCORING MATRICES

Each amino acid have different property.

Figure 0.13 properties of amino acids (Image Esquivel et al. (2013)

And each amino acid have different frequency.

When we compare the sequences they match and mismatch according to their frequency.

For example.

Handout Chapter 02: Sequence Analysis

Page 68 of 320

Based on frequencies we match and mismatch the sequence alignments for scoring.

Handout Chapter 02: Sequence Analysis

Page 69 of 320

Module 041: PAM MATRICES

Alignment matrices scoring is very useful method to score the sequences alignment for match and mismatch.

There are two types of scoring matrices.

 PAM

 BLOSUM

PAM means “Point Accepted Mutation”

Point accepted mutations means the substitution of one amino acid in a sequence with another that protein function remain conserved.

 PAM UNIT

PAM unit is actually that time during which 1% amino acid undergo for acceptable mutation. If two sequences diverge by 100 PAM units, it does not mean that they will be at totally different positions.

 STEP TOCOMPUTE PAM MATRICES

1. Align the protein sequence which are 1-PAM Unit diverge.

2. Let Ai,i be the number of times Ai is substituted by Ai.

3. Compute the frequency fi of amino acid Ai.

Then, PAM1=pii=

PAM ‘n’= (PAM1)n

Handout Chapter 02: Sequence Analysis

Page 70 of 320

Module 042: BLOSUM MATRICES

BLOSUM matrices can be used to align the protein sequences. BLOSUM matrices was first purposed in 1992 by Henikoff et al.

BLOSUM matrices is also called the Block substitution matrix without any gap although it has mismatches in sequences.

There are three steps to compute the BLOUSM Matrices.

Step 1: Eliminate sequences that are identical in x% positions

Step 2: Compute observed frequency f i, j of aligned pair Ai to Aj. Hence, f i,j becomes the probability of aligning Ai and Aj in the selected blocks.

Step 3: Compute fi which is the frequency of observing Ai in the entire block

Typically used matrices: BLOSUM62 or PAM120 in PAMx, larger x detects more divergent sequences.

Figure 0.14 sequence of amino acids which have mismatch but no gap. Figure 0.15 formula for computation of BLOSUM MATRICES.

Handout Chapter 02: Sequence Analysis

Page 71 of 320

Module 043: MULTIPLE SEQUENCE ALIGNMENT

In pair wise sequence alignments we use pairs of sequence to compare them. And scoring matrices were used to score the sequence ranks.

In Multiple sequence Alignments we compare multiple number of protein and DNA sequences to identify the matches and mismatches.

For pair wise alignment we use Dynamic programming but for multiple alignment it would be very expensive computationally. So solution for this is progressive alignment.

M Q V K L F T P L H D K S D H G K Y H M Q V K I F T P L H D K S - H G K S H M Q V H L Y - P L H D K S - T G K S H M Q V H L F - P L H D K S D T G K S H M Q V K L Y T P L H D K S D H G K Y HFigure 0.16 multiple sequence alignment

Handout Chapter 02: Sequence Analysis

Page 72 of 320

Module 044: MORE ON MULTIPLE SEQUENCE ALIGNMENT

MSA helps compare several sequences by aligning them. MSA can extract consensus sequences from several aligned sequences. Characterize protein families based on homologous regions.

APPLICATION OF MSA

 Predict secondary and tertiary structures of new protein sequences

 Evaluate evolutionary order of species or “Phylogeny”

METHODOLOGY

 Pairwise alignment is the alignment of two sequences

 MSA can be performed by repeated application of pairwise alignment

Figure 0.17 Methodology

Handout Chapter 02: Sequence Analysis

Page 73 of 320

Figure 0.18 sequence alignments

CONCLUSION

MSA can help align multiple sequences. Progressive alignment can help perform MSA. Need to remove sequences with >80% similarity.

Figure 0.19 CLUSTAL – Online tool

http://www.ebi.ac.uk/Tools/msa/clustalo/

Handout Chapter 02: Sequence Analysis

Page 74 of 320

Module 045: PROGRESSIVE ALIGNMENT FOR MSA

MSA involves progressive alignment of sequences. Doing so many progressive alignments can be slow.

STEPS:

 Step 1 : Pairwise Alignment of all sequences

Example: S1, S2, S3, S4, so that is 6 pairwise comparisons.

 Step 2: Construct a Guide Tree (Dendogram) using a Distance Matrix.

 Step 3: Progressive alignment following branching order in tree.

Figure 0.20 Similarity Matrix

 SHORTCOMING OF THIS APPROACH

Handout Chapter 02: Sequence Analysis

Page 75 of 320

 Dependence upon initial alignments

 If sequences are dissimilar, errors in alignment are propagated

 Solution: Begin by using an initial alignment, and refine it repeatedly

Progressive alignments are used in aligning multiple sequences. Iterative approaches can help refine results from progressive alignments.

Handout Chapter 02: Sequence Analysis

Page 76 of 320

Module 046: MSA-EXAMPLE

MSA involves progressive alignment of sequences. Doing so many progressive alignments can be slow.

For example:

Figure 0.21 MSA on globin sequences

Handout Chapter 02: Sequence Analysis

Page 77 of 320

Figure 0.22 Progressive alignment using sequential branching

Figure 0.23 Progressive alignment following a guide tree

Handout Chapter 02: Sequence Analysis

Page 78 of 320

Figure 0.24 Alignment results

MSA can be better performed using clustering strategies followed by alignment of the alignments later. CLUSTAL is a free online tool that does all of this for us!

Handout Chapter 02: Sequence Analysis

Page 79 of 320

Module 047: CLUSTALW

MSA involves progressive alignment of sequences. Doing so many progressive alignments can be slow. CLUSTALW is an online tool to perform MSA.

Developed by European Molecular Biology Laboratory & European Bioinformatics Institute. Performs alignment in:

 slow/accurate

 fast/approximate

SCOPE

 create multiple alignments,

 optimize existing alignments,

 profile analysis &

 create phylogenetic trees

http://www.genome.jp/tools/clustalw

Handout Chapter 02: Sequence Analysis

Page 80 of 320

Handout Chapter 02: Sequence Analysis

Page 81 of 320

Module 048: INTRODUCTION TO BLAST-I

National Center for the Biotechnology Information (NCBI) – USA. BLAST developed in 1990. “Basic Local Alignment Search Tool”. Searches databases for query protein and nucleotide sequences. Also searches for translational products etc. Online availability

www.blast.ncbi.nlm.nih.gov/Blast.cgi

Handout Chapter 02: Sequence Analysis

Page 82 of 320

BLAST can be used to search for local alignment of protein and nucleotide sequences. It is available online. Can perform searches across species and organisms

Handout Chapter 02: Sequence Analysis

Page 83 of 320

Module 049: INTRODUCTION TO BLAST-II

www.blast.ncbi.nlm.nih.gov/Blast.cgi

Smith Waterman can align complete sequences. BLAST does it in an approximate way. Hence, BLAST is faster BUT does not ensure optimal alignment. BLAST provides for approximate sequence matching. Input to BLAST is a FASTA formatted sequence and a set of search parameters

OUTPUT OF BLAST

Results are shown in HTML, plain text, and XML formats. A table lists the sequence hits found along with scores. Users can read this table off and evaluate results

Figure 0.25Input to BLAST: Gene IDs

Handout Chapter 02: Sequence Analysis

Page 84 of 320

Figure 0.26 Input to BLAST: Protein IDs

Figure 0.27 Results from BLAST

Handout Chapter 02: Sequence Analysis

Page 85 of 320

Module 050: BLAST ALGORITHM

BLAST can search sequence databases and identify unknown sequences by comparing them to the known sequences. This can help identify the parent organism, function and evolutionary history.

For example:

Query sequence: PQGELV

Make list of all possible worlds (length 3 for proteins)

PQG (score 15)

QGE (score 9)

GEL (score 12)

ELV (score 10)

Assign scores from Blosum62, use those with score> 11: PQG & GEL

Mutate words such that score still > 11

PQG (score 15) similar to PEG (score 13)

At the end, we get: PQG, GEL and PEG

Find all database sequences that have at least 2 matches among our 3 words: PQG, GEL & PEG. Find database hits and extend alignment (High-scoring Segment Pair):

High Scoring Pair: PQGI (score 8+5+5+2)

If 2 HSP in query sequence are < 40 positions away

Full dynamic alignment on query and hit sequences

BLAST performs quick alignments on sequences. The results are tabulated with alignment regions overlapping each other. Statistical evaluation is also provided alongside

Handout Chapter 02: Sequence Analysis

Page 86 of 320

Module 051: TYPES OF BLAST

BLAST can search sequence databases and identify unknown sequences by comparing them to the known sequences. This can help identify the parent organism, function and evolutionary history.

There are two main types of BLAST.

Nucleotides

• Blastn: Compares a nucleotide query sequence against a nucleotide database.

Proteins

• Blastp: Compares an amino acid query sequence against a protein database.

There are also many other types of BLAST:

 Blastx:

Compares a nucleotide query sequence against a protein sequence database.

Helps find potential translation products of unknown nucleotide sequences

 tblastn:

Compares a protein query sequence against a nucleotide sequence database

Nucleotide sequence dynamically translated into all reading frames

 tblastx:

Compares the six-frame translated proteins of a nucleotide query sequence against the six-frame translated proteins of a nucleotide sequence database.

• BLAST performs quick alignments on biological sequences

• Several types of BLAST exist which can assist in comparing nucleotide sequences with amino acids and vice versa

Handout Chapter 02: Sequence Analysis

Page 87 of 320

Module 052: SUMMERY OF BLAST

BLAST can search sequence databases and identify unknown sequences by comparing them to the known sequences. This can help identify the parent organism, function and evolutionary history.

Step1: obtain a query of sequence

Step2: choose a type of BLAST

Handout Chapter 02: Sequence Analysis

Page 88 of 320

Step3: search parameter

Step4: tabulated search results

Handout Chapter 02: Sequence Analysis

Page 89 of 320

Handout Chapter 02: Sequence Analysis

Page 90 of 320

Figure 0.28 tabulated search results

Handout Chapter 02: Sequence Analysis

Page 91 of 320

Module 053: INTRODUCTION TO FASTA

For comparing two sequences we use pair wise sequencing and for the comparison of many sequences we use multiple sequence alignment. To handle the multiple alignments we perform alignment through smith-waterman algorithm for local one. And for global alignment we use Needleman-wunsch algorithm.

Both local and global alignments are the dynamic approaches. Many of the sequences are compared, which takes time and we use BLAST which is an approximate local alignment search tool BLAST compares a large number of sequences, quickly. FASTA took a similar approach.

Developed in 1988.it does Fast Alignment .Searches databases for query protein and nucleotide sequences. Was later improved upon in BLAST.

Figure 0.29 Regions of absolute identity

http://www.ebi.ac.uk/Tools/sss/fasta/

Handout Chapter 02: Sequence Analysis

Page 92 of 320

Figure 0.30 Nucleotide FASTA Figure 0.31 Protein FASTA

Handout Chapter 02: Sequence Analysis

Page 93 of 320

Handout Chapter 02: Sequence Analysis

Page 94 of 320

Module 054: INTRODUCTION TO FASTA-II

FASTA – Fast Alignment Algorithm. Classical global and local alignment algorithms are time consuming. FASTA achieves alignment by using short lengths of exact matches.

 USES OF FASTA

FASTA relies on aligning subsequences of absolute identity. Input to FASTA search can be in FASTA, EMBL, GenBank, PIR, NBRF, PHYLIP or UniProt formats

 OUTPUT OF BLAST

Results are output in visual format along with functional prediction. Makes table lists the sequence hits found along with scores. Users can click on each reported match to look at the details.

Figure 0.32 Input to FASTA: Gene IDs

Handout Chapter 02: Sequence Analysis

Page 95 of 320

Figure 0.33 Input to FASTA: Protein Sequence

Figure 0.34 Results from FASTA

Handout Chapter 02: Sequence Analysis

Page 96 of 320

Module 055: FASTA ALGORITHM

FASTA can search sequence databases and identify unknown sequences by comparing them to the known sequence databases. This can help obtain information on the parent organism, function and evolutionary history.

STEP1: Local regions of identity are found

STEP2: Rescore the local regions using PAM or BLOSUM matrix

STEP3: Eliminate short diagonals below a cutoff score

Handout Chapter 02: Sequence Analysis

Page 97 of 320

STEP4: Create a gapped alignment in a narrow segment and then perform Smith Watermann alignment

Handout Chapter 02: Sequence Analysis

Page 98 of 320

Module 056: TYPES OF FASTA

There are six types of FASTS:

 fasts35

Compare unordered peptides to a protein sequence database

 fastm35

Compare ordered peptides (or short DNA sequences) to a protein (DNA) sequence database

 Fasta35

Scan a protein or DNA sequence library for similar sequences

 Fastx35

Compare a translated DNA sequence (6 ORFs) to a protein sequence database

 tfastx35

Compare a protein sequence to a DNA sequence database (6 ORFs)

 fasty35

Compare a DNA sequence (6ORFs) to a protein sequence database

FASTA performs quick alignments on biological sequences. Several types of FASTA exist which can assist in comparing DNA/RNA/Protein sequences with each other

Handout Chapter 02: Sequence Analysis

Page 99 of 320

Module 057: SUMMERY OF FASTA

FASTA can briskly perform sequence search databases if given a query sequence. Multiple types of FASTA exist which assist in aligning DNA/RNA/Protein sequences

Figure 0.35 Step 1: Obtain a query sequence

Handout Chapter 02: Sequence Analysis

Page 100 of 320

Figure 0.36 Step 2: Choose a type of FASTA

Figure 0.37 Type of FASTA

http://fasta.bioch.virginia.edu/fasta_docs/fasta35.shtml

Handout Chapter 02: Sequence Analysis

Page 101 of 320

Figure 0.38 Step 3: Setup Search Parameters

Figure 0.39 Step 4: Tabulated Search Results

Handout Chapter 02: Sequence Analysis

Page 102 of 320

Figure 0.40 Tabulated data

Handout Chapter 02: Sequence Analysis

Page 103 of 320

Module 058: BIOLOGICAL DATABASE AND ONLINE TOOLS

All molecular information of RNA, DNA, Proteins have need to be stored and retrieved. Sequences are obtained from genome sequencing and mass spectrometry

Structures are obtained from X-Ray Crystallography, Atomic Force Microscopy & Nuclear Magnetic Resonance Spectroscopy.

Vast amounts of such data exists. Moreover, this data is rapidly accumulating. Online Databases are formed to store and share this data.

 OBJECTIVE

 Make biological data available to scientists in computer-readable form

 For handling, sharing and analysis of the data

 The best way to share is to keep this data on the web

Several sequence, structure and molecular interaction databases exist. These are available online on the web. Users can freely access and download such data

Handout Chapter 02: Sequence Analysis

Page 104 of 320

Module 059: EXPASY

It is developed by Swiss Bioinformatics Institute (SIB). Website provides access to databases and tools Proteomics, genomics, phylogeny, systems biology, population genetics, transcriptomics etc. can be searched.

http://www.expasy.org/

Handout Chapter 02: Sequence Analysis

Page 105 of 320

Figure 0.41 flowchart

Handout Chapter 02: Sequence Analysis

Page 106 of 320

Figure 0.42 prosite scanning section

Handout Chapter 02: Sequence Analysis

Page 107 of 320

Figure 0.43 peptide mass finding

Handout Chapter 02: Sequence Analysis

Page 108 of 320

Figure 0.44 for local use of protein sequencing

Handout Chapter 02: Sequence Analysis

Page 109 of 320

Figure 0.45 potential protein finding tool

Handout Chapter 02: Sequence Analysis

Page 110 of 320

Module 060: UNIPROT AND SWISSPROT

Both UniProt and SwissProt are the online database for proteins.

Figure 0.46 gene, protein or chemical can be find

Handout Chapter 02: Sequence Analysis

Page 111 of 320

Figure 0.47 online database for proteins

Swiss-Prot contains human curated protein information

 Accession number, unique identifier

 The sequence

 Molecular mass

 Observed and predicted modifications

Protein sequences from various species and organisms can be found in uniprot. SwissProt is the manually annotated version of the UniProt Database.

Handout Chapter 02: Sequence Analysis

Page 112 of 320

Module 061: PROTEIN DATA BANK

Protein Data Bank is the premier resource of protein structures. These structures have been determined using experimental techniques. It’s Open & Free

Figure 0.48 protein data bank

Figure 0.49 P0CG47 - UBB_HUMAN

Handout Chapter 02: Sequence Analysis

Page 113 of 320

Figure 0.50 P0CG47 - UBB_HUMAN

Figure 0.51 searched results

Protein Data Bank provides Cartesian coordinates of each atom in the protein structure. Over 50,000 protein structures are reported and present in this database

Handout Chapter 02: Sequence Analysis

Page 114 of 320

Module 062: REVIEW OF SEQUENCE ALIGNMENT

We use next generation sequencing and whole genome sequencing to obtain the genetic information. For protein sequencing we use Mass Spectrometry and Edman Degradation

STORAGE:

 Sequence information is stored digitally

 Databases are designed to store sequence data

 Several databases exist depending on the type of sequence data

SHARING AND ACCESS:

 Sequence databases are shared via online websites

 Access to several such websites is free

 Data can be downloaded or searched on these website

USAGE OF DATA:

Sequence data can be used to obtain:

 Similarity of sequences

 Evolutionary History

 Predict the function of molecules

Handout Chapter 02: Sequence Analysis

Page 115 of 320

Module 063: GENBANK

 Developed by Swiss Bioinformatics Institute (SIB)

 Website provides access to databases and tools

 Proteomics, genomics, phylogeny, systems biology, population genetics, transcriptomics etc.

Figure 0.52 http://www.ncbi.nlm.nih.gov/genbank/

Several sequence, structure and molecular interaction databases exist. These are available online on the web. Users can freely access and download such data

Handout Chapter 02: Sequence Analysis

Page 116 of 320

Module 064: ENSEMBLE

As human brain is limited to remember and store the information for long time that’s why we use online database for the storage of Molecular information.

ESEMBLE is genome search engine which is used to search the genome of every recorded species.

http://asia.ensembl.org/index.html

Handout Chapter 03: Molecular Evolution

Page 117 of 320

Chapter 3 - Molecular Evolution

Module 001: Molecular Evolution & Phylogeny Molecular evolution is the process of change in the sequence composition of cellular molecules such as DNA, RNA, and proteins across generations. The field of molecular evolution uses principles of evolutionary biology and population genetics to explain patterns in these changes. Genes and Proteins are modified in this process.

All molecules have an evolutionary history. Phylogenetics is the science of studying evolutionary relationships. Phylogenetics has led to the creation of relationship trees between various species of Bacteria, Archaea, and Eukaryota.

(Page and Holmes 2009)

Figure 0.1 Phylogenetic tree

Handout Chapter 03: Molecular Evolution

Page 118 of 320

Types of Phylogenetic Trees

Scaled Trees

• Branch lengths are equal to the magnitude of change in the nodes

Unscaled Trees

• Only representing the relationship between sequences

Figure 0.2 phylogenetic tree interference

Conclusion

Phylogenetics is the study of extracting evolutionary relationships between species. Sequence information from each species is used to measure the difference between the species.

Page, R. D. and E. C. Holmes (2009). Molecular evolution: a phylogenetic approach, John Wiley & Sons.

Handout Chapter 03: Molecular Evolution

Page 119 of 320

Module 002: Evolution of Sequences

DNA acts as cellular memory unit and protein are the translated product of DNA coded information. And evaluation is very important to survive in different type of environments. There are some methods which brings change or evolution in any organism. (Kluger 2015)

Method of Change

DNA gets modified by:

 Mutation & Substitution

 Insertion

 Deletion

Discussion

Over time, species evolve to adapt to their circumstances. Since the environment and circumstances may be different for each species, they evolve uniquely. Unique evolutionary pressures may be encountered by each cell for struggle of life. However, in which sequence they are presented to the cells is also unique. Combinations of evolutionary factors are involve in evolution. The evolutionary events and their combination impart relationships between sequences. These relationships are explored in Phylogenetics .Several algorithms exist for finding such relationships

Kluger, M. J. (2015). Fever: its biology, evolution, and function, Princeton University Press.

Page, R. D. and E. C. Holmes (2009). Molecular evolution: a phylogenetic approach, John Wiley & Sons.

Handout Chapter 03: Molecular Evolution

Page 120 of 320

Module 003: Concepts and Terminologies - I

To understand the concept of evolution we follow some rules. Phylogenetics involves processing sequence information from different species to find evolutionary relationships. Output from such studies include Phylogenetic Trees

Figure 3 phylogenetic tree from ancestor to evolution

In above figure the point A stands for ancestor and with the passage of time the evolution occurred with and the genome sequence of organisms changed.

Figure 4 layout of trees

All trees have same meanings.

Handout Chapter 03: Molecular Evolution

Page 121 of 320

Figure 5 rooted tree

Root node is the ancestor of all other nodes. The direction of evolution is from ancestor to the terminal nodes.

Conclusion

Phylogenetics specifies evolutionary relationship with the help of trees. Trees can be rooted or unrooted. Rooted trees can show temporal evolutionary direction.

Handout Chapter 03: Molecular Evolution

Page 122 of 320

Module 004: Concepts and Terminologies - II

Rooted and Unrooted trees can be used to show phylogenetic relationships between sequences. Let’s examine the properties of these trees further.

Figure 6 rooted tree vs unrooted

Rooted trees are computationally expensive. http://everything.explained.today/Computational_phylogenetics/

https://github.com/joey711/phyloseq/issues/597

Figure 7 computation comparison

Conclusion

Rooted and Unrooted trees have their own advantages and disadvantages. Depending on our requirement, we can choose between them.

Handout Chapter 03: Molecular Evolution

Page 123 of 320

Module 005: Algorithms and Techniques

Rooted and Unrooted trees can be used to show phylogenetic relationships between sequences. Several types of algorithms exist which are divided into two classes. There are many methods for constructing evolutionary trees.

Figure 9 construction methods UPGMA (Unweighted Pair Group Method with Arithmetic Mean) is a simple agglomerative (bottom-up) hierarchical clustering method. The method is generally attributed to Sokal and Michener.

In this method two sequences with with the shortest evolutionary distance between them are considered and these sequences will be the last to diverge, and represented by the most recent internal node.

Least Squares Distance Method. Branch lengths, represent the “observed” distances between sequences (i & j).

Handout Chapter 03: Molecular Evolution

Page 124 of 320

Find X, Y and Z such that D (i, j) are conserved?

Conclusion

 Several methods exist for constructing phylogenetic trees.

 Broadly, they belong to objective methods or clustering methods.

 We will study UPGMA and Distance Methods.

Module 006: Introduction to UPGMA

Phylogenetic trees can be used to show phylogenetic relationships between sequences. To construct these trees, several types of algorithms exist which are divided into two classes.

UPGMA: Unweighted Pair – Group Method using arithmetic Averages

 Calculating distance between two clusters:

Handout Chapter 03: Molecular Evolution

Page 125 of 320

Cluster X + Cluster Y = Cluster Z

Calculate the distance of a cluster (e.g. W) to the new cluster Z

X Y

X XW Y YW

ZW N N

N d N d





Nx is the number of sequences in cluster x

 Calculating distance between two trees:

Assume we have N sequences

Cluster X has NX sequences, cluster Y has NY sequences

dXY : the evlotionary distance between X and Y



 



i X j Y

X Y

XY d

N N

 Methods for constructing trees

The distance matrix is obtained using pairwise sequence alignment.

 Calculating distance between two clusters:

Cluster X + Cluster Y = Cluster Z

Calculate the distance of a cluster (e.g. W) to the new cluster Z

X Y

X XW Y YW

ZW N N

N d N d





Nx is the number of sequences in cluster x

Handout Chapter 03: Molecular Evolution

Page 126 of 320

 Calculating distance between two trees:

Assume we have N sequences

Cluster X has NX sequences, cluster Y has NY sequences

dXY : the evlotionary distance between X and Y



 



i X j Y

X Y

XY d

N N

 Methods for constructing trees

A – D becomes a new cluster lets say V. We have to modify the distance matrix. What are the

distances between:

 V and B (Calculate),

 V and C,

 V and E,

 V and F.

X Y

X XW Y YW

ZW N N

N d N d





1 1

1*6 1*6











A D

A AB D DB

VB N N

N d N d

Conclusion

UPGMA is a clustering algorithm which can help us compute phylogenetic trees. We will see the

detailed working of this approach in later modules.

Module 007: UPGMA-I

UPGMA has two components to it. These include distance calculations between two clusters

and between two trees.

 Building trees using UPGMA

Combining Clusters: Cluster X + Cluster Y = Cluster Z

Calculate the distance of each cluster (e.g. W) to the new cluster Z

Handout Chapter 03: Molecular Evolution

Page 127 of 320

X Y

X XW Y YW

ZW N N

N d N d





Nx is the number of sequences in cluster x

 Calculating the distance between two trees

Assume we have N sequences

Cluster X has NX sequences, cluster Y has NY sequences

dXY : the evlotionary distance between X and Y



 



i X j Y

X Y

XY d

N N

Figure 10 the distance matrix is obtained using pairwise sequence alignment

 Methods for constructing trees

A – D becomes a new cluster lets say V.

We have to modify the distance matrix!

What are the distances between:

 V and B (Calculate),

Handout Chapter 03: Molecular Evolution

Page 128 of 320

 V and C,

 V and E,

 V and F.

X Y

X XW Y YW

ZW N N

N d N d





1 1

1*6 1*6











A D

A AB D DB

VB N N

N d N d

Conclusion

UPGMA starts with creating clusters of sequences which are the closest. Next, distance is

computed between the new cluster and the remaining sequences. The process is repeated for

all sequences.

Module 008: UPGMA-II

UPGMA steps include distance calculations between two clusters and between two trees. We

formed clusters from sequences which had the shortest distance.

Building trees using UPGMA

Combining Clusters: Cluster X + Cluster Y = Cluster Z

Calculate the distance of each cluster (e.g. W) to the new cluster Z

Handout Chapter 03: Molecular Evolution

Page 129 of 320

X Y

X XW Y YW

ZW N N

N d N d





Nx is the number of sequences in cluster x

Calculating the distance between two trees

Assume we have N sequences

Cluster X has NX sequences, cluster Y has NY sequences

dXY : the evlotionary distance between X and Y



 



i X j Y

X Y

XY d

N N

Methods for constructing trees

The distance matrix is obtained using pairwise sequence alignment.

1 1

1*6 1*6











A D

A AB D DB

VB N N

N d N d

1 1

1*8 1*8











A D

A AC D DC

VC N N

N d N d

1 1

1*2 1*2











A D

A AE D DE

VE N N

N d N d

1 1

1*6 1*6











A D

A AF D DF

VF N N

N d N d

Handout Chapter 03: Molecular Evolution

Page 130 of 320

V – E becomes a new cluster lets say W

Now we have to modify the distance matrix again.

What are the distances between:

W and B,

W and C,

W and F.

Conclusion

Once a cluster is selected and its distance is computed with all other sequences, we update the distance matrix. Next, we select the shortest distance from the new matrix and repeat the process.

Handout Chapter 03: Molecular Evolution

Page 131 of 320

Module 009: UPGMA-III

UPGMA has two components to it. These include progressive distance calculations between

two clusters or between two trees.

Building trees using UPGMA

Combining Clusters: Cluster X + Cluster Y = Cluster Z. Calculate the distance of each cluster (e.g.

W) to the new cluster Z.

X Y

X XW Y YW

ZW N N

N d N d





Nx is the number of sequences in cluster x

Calculating the distance between two trees

 Assume we have N sequences

 Cluster X has NX sequences, cluster Y has NY sequences

 dXY : the evlotionary distance between X and Y



 



i X j Y

X Y

XY d

N N

V – E becomes a new cluster lets say W. Now we have to modify the distance matrix again.

What are the distances between:

W and B,

W and C,

W and F.

2 1

2*6 1*6











V E

V VB E EB

WB N N

N d N d

Handout Chapter 03: Molecular Evolution

Page 132 of 320

2 1

2*8 1*8











V E

V VC E EC

WC N N

N d N d

2 1

2*6 1*6











V E

V VF E EF

WF N N

N d N d

New matrix

Cluster according to min distance

Conclusion

Now we have formed three clusters. Also, two separate trees have been formed. Next, we need

to join these trees to create a complete tree.

Handout Chapter 03: Molecular Evolution

Page 133 of 320

Module 010: UPGMA-IV

Application of UPGMA resulted in formation of two sub-trees. The need now was to join them

into a single tree. Let’s see how that is done.

F – B becomes a new cluster lets say X. We have to modify the distance matrix yet again. What

is the distance between trees: W and X.

*(6 6 6 6 6 6) 6

3*2

( )

     

  

 

AB AF DB DF EB EF

W X

i W j X

W X

d d d d d d

N N

Handout Chapter 03: Molecular Evolution

Page 134 of 320

X – W becomes a new cluster lets say Y. We have to modify the distance matrix

What is the distance between: Y and C.

Conclusion

We have now seen how trees are generated and connected. Next, we need to finalize the tree by adding the last two clusters.

Handout Chapter 03: Molecular Evolution

Page 135 of 320

Module 011: UPGMA-V

Application of UPGMA resulted in formation of two sub-trees. The need now was to join them into a single tree. Let’s see how that is done.

X – W becomes a new cluster lets say Y. We have to modify the distance matrix. What is the distance between: Y and C.

Handout Chapter 03: Molecular Evolution

Page 136 of 320

Conclusion

Un-weighted Pair Group Method using Arithmetic Averages is a clustering method to construct phylogenetic trees. Non-clustering methods such as Maximum Parsimony may be used for making trees as well.

Handout Chapter 04: RNA Secondary Structure Prediction

Page 137 of 320

Chapter 4 - RNA Secondary Structure Prediction

Module 001: DNA TO RNA SEQUENCES

 MOTIVATION

In early days RNA was a considered as a structure which was involve between DNA and protein, means takes information from DNA and converts that information into protein synthesis. Now we know that it has multiple types like mRNA, tRNA, miRNA and siRNA. And they perform most of the work in gene expression and proteins. Not All RNA molecules are same, they differ in nucleotide sequences and functions also.

Many viruses assemble their genomes from RNAs. They are therefore called RNA viruses. Examples include Human Immunodeficiency Virus and Hepatitis C Virus.

There is little difference between RNA and DNA:

 Thymine is replaced by Uracil in RNA molecule.

 RNA molecule is single stand.

 RNA contain ribose sugar.

Because RNA has two (OH) groups that’s why it has shot life spam because of both (OH) repulsion.

Figure 0.1 Ribose sugar has (OH) and Deoxyribose (H)

Handout Chapter 04: RNA Secondary Structure Prediction

Page 138 of 320

Module 002: TYPES OF RNA &THEIR FUNCTIONS

There are two categories of RNA:

 Coding RNA

 Non-Coding RNA

Coding RNA perform their coded function in protein synthesis. And Non-coding RNA helps in translation process.

 TYPES OF RNA

There are many types of RNA according to their funtions like:

 Messenger RNA (mRNA)

 Transfer RNA (tRNA)

 Ribosomal RNA (rRNA)

 Micro RNAs (miRNA)

 Small Interfering RNA (siRNA)



 MESSENGER RNA

Only 5-10% of this RNA type is present in cell. Which has variable sequence, variable size and it carries the genetic information form DNA to Ribosomes where proteins to be assembled. Messenger RNA 5’ end is capped with (7-Methyl Guanosine Triphosphate) which helps the Ribosomes to identify the mRNA. And 3’ end of the mRNA is poly A tail (around 30-200 adenylate residues) which help shield against 3’ exonucleases)

Handout Chapter 04: RNA Secondary Structure Prediction

Page 139 of 320

As RNA has differ in nucleotides sequences therefore differ in functions.

Figure 2 RNA sequence is complementary to the DNA sequence and is translated as codons of three nucleotides

Handout Chapter 04: RNA Secondary Structure Prediction

Page 140 of 320

Module 003: SIGNIFICANCES OF RNA STRUCTURE

RNA can form 3D structures {Sarver, 2008 #5}, such structural properties helps the RNA molecule to perform different functions.

As RNA is composed of sugars, phosphate and nucleotides and these nucleotides have ability to form hydrogen bonds.

 A’ can make hydrogen bonds with ‘U’

 ‘G’ makes hydrogen bonds with ‘C’

 ‘G’ can also make hydrogen bonds with ‘U’ (Wobble Pair)

Due to this ability of bonding RNA forms many structures and due to variety of structures RNA performs many functions in cell like:

 DNA information transfer Figure 3 In RNA ribose is used in place of deoxyribose 3 In RNA uracil is used in place of thymine

Handout Chapter 04: RNA Secondary Structure Prediction

Page 141 of 320

 Regulatory roles

 Catalytic roles

 Defense & immune response

 Structure-based special roles

Handout Chapter 04: RNA Secondary Structure Prediction

Page 142 of 320

Module 004: RNA FOLDING AND ENERGY FOLDING

RNA molecules form many structures for stability and different functions. “Gibs Free Energy” (LANGRIDGE and KOLLMAN 1987) is the free energy available for RNA molecule for reactions and RNA structure formation takes place at this lower energy. Incase if RNA has two structure we can select the one with lowest energy state.

http://chemwiki.ucdavis.edu/Core/Physical_Chemistry/Thermodynamics/State_Functions/Free_Energy/Gibbs_Free_Energy

We can calculate the overall energy of RNA structures by summing up energies given out during the process of folding. For knowing the positive and negative values of calculations of stabilizing and destabilizing energies we may factor in ways in which RNA can be destabilized.

Figure 4 Energy is continuously given out as the RNA molecule folds by pairing complementary bases Figure 5 calculation of stabilizing and destabilizing values

Handout Chapter 04: RNA Secondary Structure Prediction

Page 143 of 320

Module 005: CALCULATING ENERGIES OF FOLDING-AN EXAMPLE

RNA is composed of four nucleotides (A, U, C and G) and these nucleotides are attached with ribose sugar in backbone. And these nucleotides have hydrogen bonding between them. G always bond with C and Always bonds with U through hydrogen bonding and energy is released.

That’s why RNA molecule become more stable.

Figure 0.2nucleotides are held together by strong bonds which are created by the release of energy

Figure 7 RNA Sequence energy

Handout Chapter 04: RNA Secondary Structure Prediction

Page 144 of 320

5 nucleotides formed H-Bonds. This bond formation released energy (-12.0 kcal/mol) RNA molecule took up a 2’ structure. Hence became more stable.

Module 006: TYPES OF RNA SECONDARY STRUCTURES-I

All the complimentary bases of RNA combine together to form RNA secondary structures. A simple nucleotide sequences of RNA is called as Primary structure and denoted by 1’ while when these nucleotides fold together and form a complex structure that is called secondary structure and denoted by 2’.

The preferred structure of RNA is 2’ which has many structural patterns like Helices, Loops, Bulges and Junctions

Figure 8 RNA sequence extends from its 5’ end to 3’ end. Upon folding, 3’ end may fold on to the 5’ end

The first 2’ RNA structure is called helix. Unlike the DNA helix, the RNA helix is formed when the RNA folds onto itself.

Handout Chapter 04: RNA Secondary Structure Prediction

Page 145 of 320

The second 2’ structure is the hairpin loop

The loop of the hairpin must at least four bases long to avoid steric hindrance with base-pairing in the stem part of the structure. Note that hairpins reverses the chemical direction of the RNA molecule.

Handout Chapter 04: RNA Secondary Structure Prediction

Page 146 of 320

Module 007: TYPES OF RNA SECONDARY STRUCTURES-II

RNA 1’ structure fold the (5’-3’) ends and make RNA 2’ structure just like helix and hairpin structure.

The third type of 2’ structure is bulge loop.

Bulges, are formed when a double-stranded region cannot form base pairs perfectly. Bulges can be asymmetric with varying number of base pairs on one side of the loop. Bulge loops are commonly found in helical segments of cellular RNAs and used to measure the helical twist of RNA in solution. (Tang and Draper 1990)The forth type of 2’ RNA structure is interior loop.

Interior loops are formed by an asymmetric number of unpaired bases on each side of the loop.(Turner, Sugimoto et al. 1988)

Handout Chapter 04: RNA Secondary Structure Prediction

Page 147 of 320

Module 008: TYPES OF RNA SECONDARY STRUCTURES-III

Another 2’ RNA structure is the Junction or Intersection.

Figure 9 2' RNA structure called junction

Junctions include two or more double-stranded regions converging to form a closed structure. The unpaired bases appear as a bulge.(Zuker and Sankoff 1984)

Figure 10 Unpaired bases in two 2’ structures form hydrogen bonds with each other

RNA tertiary structures are formed when RNA unpaired base bond in 2’ region bond.

Handout Chapter 04: RNA Secondary Structure Prediction

Page 148 of 320

Module 009: RNA TERTIARY STRUCTURES

2’ RNA structures is formed due to folding of nucleotide with in RNA molecule but after folding some nucleotides remain open for interaction. And they form hydrogen bonds together.

Figure 11 Hydrogen bonding formation in open nucleotides.

These unpaired nucleotides of 2’ structure interact with other unpaired nucleotides and form a third structure called tertiary 3’ structure. For example 4 nucleotides in hairpin loop structure does.

The above figure:

1. Indicate how these 2’ structures come together

2. Indicate the difference between internal loop and multi loop

3. Indicate the yet unpaired bases

The unpaired bases in 3’ structure remain paired by abnormal folding called (pseudoknots) but instead of pairing they remain available or pairing.

Figure 12 pseudoknots

Handout Chapter 04: RNA Secondary Structure Prediction

Page 149 of 320

Module 010: CIRCULAR REPRESENTATION OF STRUCTURES

Tertiary or 3’ structure of RNA may form pseudoknots to detect the pseudoknots in RNA structure we need “circular plot” which is a graphical approach.

Intersecting arcs in circular plot are the pseudoknot.

Handout Chapter 04: RNA Secondary Structure Prediction

Page 150 of 320

Module 011: EXPERIMENTAL METHODS TO DETERMINE RNA STRUCTURES

RNA has 1’, 2’ and 3’ structures. 1’ has simple nucleotide sequence and 2’ has nucleotides folding and 3’ has knots.

For measuring the RNA structure we use X-ray crystallography (Smyth and Martin 2000), which works according to the principle of diffraction. Crystallized RNA diffracts X-rays which helps estimate atomic positions

All isotopes that contain an odd number of protons and/or of neutrons (see Isotope) have an intrinsic magnetic moment and angular momentum, in other words a nonzero spin, while all nuclides with even numbers of both have a total spin of zero. The most commonly studied nuclei are 1H and 13C, al

Figure 13 X-ray Crystallography https://260h.pbworks.com/w/page/30814223/X%20Ray%20Crystallography

Another method to measure the RNA structure is called as Atomic Force microscopy in this technique a laser connected to a Si3N4 piezoelectric probe scans an RNA sample. It works well in air and liquid environment.

Figure 14 Atomic microscopy

Handout Chapter 04: RNA Secondary Structure Prediction

Page 151 of 320

The third method for measuring the RNA structure is Nuclear Magnetic Resonance Imaging in this method Hydrogen atoms in RNA resonate upon placement in a high magnetic field. It Works well without crystallizing RNA

http://www.slideshare.net/Oatsmith/13-nuclear-magnetic-resonance-spectroscopy-wade-7th

 STORAGE OF STRUCTURES

Reported structures are stored in online databases. Example includes RNA Bricks and RMDB etc. Bioinformaticians can refer to these databases for RNA structure studies

RNA Bricks is a database of RNA 3D structure motifs and their contacts, both with themselves and with proteins

Stanford University’s RNA Mapping Database is an archive that contains results of diverse structural mapping experiments performed on ribonucleic acids.

Module 012: Strategies for RNA Structure Prediction

RNA structure 2’ and 3’ can be measured experimentally, but RNA molecule readily degrade due to their short shelf life.

Handout Chapter 04: RNA Secondary Structure Prediction

Page 152 of 320

Give 1’ RNA structure creates the 2’ structure because the simple nucleotides folds and form 2’ structure. And on the base of folding we can predict the stability of the RNA molecule.

For example.

Figure 15 pairs represent the stability of the RNA molecule

Maximizing the number of nucleotides can increase the structure and we have to select the structure according to the stability.

Module 013: Dot Plots for RNA 2' Structure Prediction

Structure measurement through experiments is slow and costly and there is maximum chances of more than one structure existence.

Handout Chapter 04: RNA Secondary Structure Prediction

Page 153 of 320

The dot plot method for RNA structure prediction is easy. Draw a square and partition by drawing gridlines. Put RNA sequence on top and left sides of the square. Put a “DOT” on complementary nucleotides

For example:

Figure 16 dot are placed at complimentary base pair place.

Connect regions of paired nucleotides to form 2’ structures in following image.

Figure 17 Potato Tuber Spindle Viroid

In longest RNA nucleotides the gaps between complementary nucleotides becomes bulges and loops of the structure.

Module 014: ENERGY BASES METHODS

Experimental prediction of RNA structure is slow and costly that’s why a few 2’ RNA structures are reported experimentally.

Handout Chapter 04: RNA Secondary Structure Prediction

Page 154 of 320

While prediction we get many possible 2’ structures of RNA and for optimal structure selection we calculate their overall stability.

Figure 18 energy table

 STABILIZING ENERGY

Energy table helps us to find the optimal prediction of structure because energy is released when complementary nucleotides make bonds.

 DESTABILIZING ENERGY

Remaining unpaired nucleotides destabilized the RNA structure in form of hairpin or bulge structure.

Figure 19 Hairpin+IntLoop+ExtLoop+Bulge+Hairpin

 SUM OF ENERGIES

Sum of stabilizing and destabilizing energies can help determine the quality of a 2’ RNA structure. 2’ structure with longest coupled sequences vs. one with lowest energy

Module 015: Zuker’s Algorithm

Energy based methods involve evaluating the free energy structures. To compute the RNA sequence for 1’ or 2’ optimal structure prediction we use Zuker’s Algorithm.

Handout Chapter 04: RNA Secondary Structure Prediction

Page 155 of 320

Zuker’s Algorithm helps us to compute the stabilizing energies (-ve) and also destabilizing energies (+ve values). And also compute the sum of +ve and –ve energies.

Figure 20 stacking energies

Figure 21 destabilizing energies

Figure 22 working principle of Zuker’s Algorithm (2003)

It Compute energies of all possible 2’ structures. Generate combinations of all computed 2’ structures. Select the one with lowest energy.

Module 016: Zuker’s Algorithm EXAMPLE

Zuker’s Algorithm involves computing stabilizing and destabilizing energies of a 2’ structure. All possible 2’ structures are generated. The best 2’ structure is selected!

Handout Chapter 04: RNA Secondary Structure Prediction

Page 156 of 320

Figure 23 Calculation of all possible structure combinations

We need to construct all the possible combinations of nucleotides for selection of optimal 2’ RNA structure.

Module 017: Zuker’s Algorithm – A Flow Chart

Zuker’s Algorithm involves computing stabilizing and destabilizing energies of 2’ structure. And it also computes the overall energy by summing up the positive and negative energies.

Handout Chapter 04: RNA Secondary Structure Prediction

Page 157 of 320

The flow chart for energies is:

Figure 24 flow chart

The diagonal combination from all possible is selected with overall lowest energy.

The two diagonals (‘D’) given above include: 1.A/U, C/G, G/C, U/G 2.G/U, U/G

Handout Chapter 04: RNA Secondary Structure Prediction

Page 158 of 320

Module 018: Martinez Algorithm

Zuker’s Algorithm involves computing stabilizing and destabilizing energies of 2’ structure. And it also computes the overall energy by summing up the positive and negative energies. Martinez Algorithm is improvement on it.

Making combination of all possible structures is time consuming, Martinez Algorithm favors those 2’ structure which are energetically more feasible.

Figure 25 Martinez Algorithm flow chart

In Martinez algorithm all the 2’ structures are weighed by its stability and optimal one is sorted out. Monte Carlo methods (or Monte Carlo experiments) are a broad class of computational algorithms that rely on repeated random sampling to obtain numerical results. They are often used in physical and mathematical problems and are most useful when it is difficult or impossible to use other mathematical methods.

And Monto Carlo method do not provide a definitive solution.

Module 019: Dynamic Programming Approaches

RNA sequence contains 4 type of nucleotides G/C, G/U and A/U and it may contain hundreds of nucleotides it means there is possibility of many combinations.

Handout Chapter 04: RNA Secondary Structure Prediction

Page 159 of 320

In 2’ RNA structure there may be large number of nucleotide sequence with large number of combinations hence it is hard to find the optimal one and for this prediction we us Dynamic Programming (DP) which breaks the larger problems into smaller one.

 PRINCIPLE OF DYNAMIC PROGRAMMING

For optimal structure combination selection we use the Dynamic Programming (DP) and we select the sequence of RNA nucleotides and list all the possible complementary positions for nucleotides in the given complete sequence.

For example:

Dynamic Programming then recombines such combinations in a process called “Traceback” to ensure that the highest coupled 2’ structure is reported

Figure 26 all possible complementary bases combinations.

Handout Chapter 04: RNA Secondary Structure Prediction

Page 160 of 320

Module 020: Nussinov-Jacobson Algorithm –

An Overview

Nussinov-Jacobson (NJ) Algorithm is a Dynamic Programming (DP) strategy to predict optimal

RNA 2’ structures, Proposed in 1980. Computes 2’ structures with most nucleotide coupling.

http://ultrastudio.org/en/Nussinov_algorithm

 HOW IT WORKS

 Create a matrix with RNA sequences on top and right

 Set diagonal & lower tri-diagonal to zero

 Start filling each empty position in matrix by choosing the maximum of 4 scores

J 1 2 3 4 5 6 7 8 9

I G G C A A A U G C

1 G 0

2 G 0 0

3 C 0 0

4 A 0 0

5 A 0 0

6 A 0 0

7 U 0 0

8 G 0 0

9 C 0 0

Figure 27 A.Note the I and J labels, B. Initialize tri-diagonal and lower tri-diagonals to zero.

The score S ( i , j ) is the maximum of the following four possibilities

Handout Chapter 04: RNA Secondary Structure Prediction

Page 161 of 320

Module 021: Nussinov-Jacobson Algorithm – The Flowchart

NJ algorithm is actually a dynamic programming (DP) approach to predict the 2’ RNA structure. A scoring matrix is initialized to record scores in NJ Algorithm .For filling scoring matrix, the maximum score from 4 matrix positions is chosen.

Figure 0.3 for maximum score 4 positions are used in scoring

Figure 28 flow chart of NJ Algorithm.

Traceback is used to report the coupling of structures in sequences.

Handout Chapter 04: RNA Secondary Structure Prediction

Page 162 of 320

Module 022: Nussinov-Jacobson Algorithm – EXAMPLE

The main points to be focused in N-J Algorithm are:

 Scoring Matrix

 Matrix Initialization

 Scoring method

 The 4 different positions to be considered for calculating matrix

Figure 29 N-J Algorithm scoring

The matrix is filled by four different positions. Left, Bottom, Diagonal, and Left/Bottom elements. In this way all complementary nucleotides coupling is catered.

Handout Chapter 04: RNA Secondary Structure Prediction

Page 163 of 320

Module 023: Score Calculations & Traceback

From four positions the score is calculated and from each position we calculate the score contribution. And maximum score is sorted out.

Figure 30 scoring and traceback in N-J Algorithm.

There can be many traceback. Each traceback is used to make the RNA secondary structure. And traceback with highest number of nucleotide coupling is selected.

Handout Chapter 04: RNA Secondary Structure Prediction

Page 164 of 320

Module 024: Comparison of Algorithms

RNA has three different structures 1’, 2’ and 3’. For these structures predictions there are many algorithms. But in all algorithm there are two main strategies:

1. Nucleotides stacking

2. Energy minimization

 ENERGY BASED ALGORITHM.

Zuker’s Algorithm involves energy minimization. It is updated version and incorporate the phylogenetic information. It is improved. Overcomes the pseudoknots assumes them and accommodate them. And this algorithm helps to predict the structures of RNA based on nucleotides.

 NUCLEOTIDES STACKING ALGORITHM.

NJ’s Algorithm comes under this category. It involves the maximizing the nucleotides pairing. Traceback helps to find best 2’ structure.

It predict the 75% accurate 2’ structure. Because there may be more than two equal scores as it is calculated from four different positions. To get best results we need to combine the stacking and energy minimization methods together.

For further improvements in results we take help from:

 Sequences

 Comparison

 Nucleotide

 Covariance analysis

Handout Chapter 04: RNA Secondary Structure Prediction

Page 165 of 320

Module 025: WEB RESOURCES-I

For prediction of 1’ and 2’ structure of RNA we use different algorithm like Zuker’s, Martinez and N-J. Online tools also. The mfold web server is one of the oldest web servers in computational molecular biology. Mfold is upgraded version of Zuker’s algorithm. MFOLD is computationally expensive and can give results for 1’ and 2’ structures that have sequences less than 8000 nucleotides.

Figure 31 http://unafold.rna.albany.edu/?q=mfold

Figure 32 http://unafold.rna.albany.edu/?q=mfold/RNA-Folding-Form

Handout Chapter 04: RNA Secondary Structure Prediction

Page 166 of 320

Figure 33 http://unafold.rna.albany.edu/?q=mfold/Structure-display-and-free-energy-determination

MFOLD helps fold an RNA nucleotide sequence into its possible 2’ structures. MFOLD gives out several structures along with their energetic stability!

Handout Chapter 04: RNA Secondary Structure Prediction

Page 167 of 320

Module 026: WEB RESOURCES-II

RNA nucleotides folds to form 2’ structure from simple portion of 1’ nucleotides. For example CUUCGG occurs a wide variety in RNA and it mostly forms the stable hairpin loop. So we can make the list of all likely 2’ structures arising from 1’.

Figure 34 http://www.rnasoft.ca/strand/

Figure 35 http://iimcb.genesilico.pl/rnabricks

RNA 1’ folds and makes RNA 2’ structure and this online database is established for 2’ RNA structure and it act as dictionary for 2’ RNA structure.

Handout Chapter 05: Protein Sequences

Page 168 of 320

Chapter 5 - Protein Sequences

Module 001: FROM DNA/RNA SEQUENCES TO PROTEIN

We are aware that DNA has four nucleotides bases (A, C, T & G). RNA contains (A, C, U & G). And protein contain 20 different amino acids. DNA to RNA then Protein is called as central dogma. Which includes translation, transcription and protein modifications.

A set of three nucleotides called codon, codes the information for specific amino acids in protein synthesis.

Figure 0.1 amino acids letters information according to codons

Figure 0.2 codon (set of three nucleotide) codes for specific amino acid.

Handout Chapter 05: Protein Sequences

Page 169 of 320

Codons select the amino acids and ribosomes make the protein by polymerization process and these nucleotides coil together to form 3D structure.

Handout Chapter 05: Protein Sequences

Page 170 of 320

Module 002: CODING OF AMINO ACIDS

Nucleotides (A, G, C, and T) make set of three called codons for amino acid selection in protein synthesis. More than one codon can code for same amino acids as there are 20 amino acids involved in protein synthesis.

Figure 3 coding of amino acids

Figure 4 Start Codon ATG and Stop Codon TAG, TGA or TAA

Handout Chapter 05: Protein Sequences

Page 171 of 320

Module 003: OPEN READING FRAMES

Codons codes information for amino acid and there are three stop codons and one start codon. For the valid open reading frame it must have longest sequence. In molecular genetics, an open reading frame (ORF) is the part of a reading frame that has the potential to code for a protein or peptide. An ORF is a continuous stretch of codons that do not contain a stop codon (usually UAA, UAG or UGA).

https://en.wikipedia.org/wiki/Open_reading_frame

Figure 5 ORF 1 is valid, as it is the longest

There is online tool from which we can find ORF in any sequence.

Figure 6 NCBI, ORF Finder

Six ORF exist in any DNA sequence and longest one is marked and first stop codon will be marked end of the protein.

Handout Chapter 05: Protein Sequences

Page 172 of 320

Module 004: ORF Extraction – A Flowchart

Codons of 3 nucleotides code for each Amino Acid. There are 1 start and 3 stop codons. Selection of ORF is based on its length if it the longest one from others than it would be suitable for protein synthesis reaction.

Figure 7 ORF extraction flowchart

Both reverse and forward RNA sequences are considered which may have many ORF and selection is based upon longest protein sequences having.

Handout Chapter 05: Protein Sequences

Page 173 of 320

Module 005: SEQUENCING PROTEINS

Given the DNA/RNA sequence, ORFs can be extracted and protein sequence can be determined. But there are chances that protein may be unknown, that’s why we use Adam degradation method in protein sequencing. Edman degradation, developed by Pehr Edman, is a method of sequencing amino acids in a peptide. In this method, the amino-terminal residue is labeled and cleaved from the peptide without disrupting the peptide bonds between other amino acid residues.it starting from the N-terminal and removing one amino acid at a time

Figure 8 Mechanism

https://en.wikipedia.org/wiki/Edman_degradation

Cyclic degradation of peptides by Phenyl-iso-thio-cyanate (PhNCS). PhNCS attaches to the free amino group at N-terminal residue. 1 amino acid is removed as a PhNCS derivative.

Handout Chapter 05: Protein Sequences

Page 174 of 320

Figure 9 working of Edam degradation

 DRAWBACKS

 It is restricted to chain of 60 residues.

 It is very time consuming process 40-50 amino acids per day.

Modern techniques for this is Tandem mass spectrometry.

Handout Chapter 05: Protein Sequences

Page 175 of 320

Module 006: Application of Mass Spectrometry in Protein Sequencing

Edam Degradation methods helps us to sequence the protein which is unknown. But it is restricted to 60 amino acids only.

Protein can be charged with electrons or protons and if moving charges are placed in between the magnetic field they get deflected. And their deflection is proportional to their momentum.

Figure 10 Moving charged particles in a magnetic field

Where:

 F is the force applied to the ion

 m is the mass of the particle,

 a is the acceleration

 Q is the electric charge,

 E is the electric field

 v × B is the cross product of the ion's velocity and the magnetic flux density.

Figure 11 equation for MS application in protein sequencing

Handout Chapter 05: Protein Sequences

Page 176 of 320

 COMPUNENTS

 Sample Injection

 Ionization Source

 Mass Analyzer

 Ion Detector

 Spectra search using computational tools

 CONCLUSION

Charged proteins can be set into motion within a magnetic field. Their deflections accurately correspond to their molecular mass. Deflections can be measured (hence protein’s mass)

Handout Chapter 05: Protein Sequences

Page 177 of 320

Module 007: Techniques for MS Proteomics

MS proteomics works on the principle of protein ionization which are placed in very high magnetic field. Each protein deflect to its proportion which is equal to its molecular weight in this way molecular mass is measured. The protein mass of unknown protein is compared with the masses of proteins in database and matching one is selected.

Example for protein sequences database is uniProt, swissprot etc.

Proteins are measured and sequenced if are unknown than matched with existing database if matched than are shortlisted.

Handout Chapter 05: Protein Sequences

Page 178 of 320

Module 008: Types of MS-based proteomics

Proteins can be sequenced by Edam’s degradation and Mass spectrometry. MS based proteomics helps us to sequence the larger and bigger proteins more quickly.

Following steps are involved in MS:

 Separation

 Ionization

 Mass analysis

 Detection

Two methodologies are involved

1. Bottom up proteomics

2. Top down proteomics

Bottom up proteomics measures the peptide masses produced after protein enzymatic digestion. And Top down proteomics measures the intact proteins followed by peptides after fragmentation.

 BOTTOM UP PROTEOMICS

In this methodology the protein complex is treated with site specific enzymes which cleaves them into amino acid residue and resultant peptides are measured for their masses. One peptide is selected at one time for processing and when all are processed than protein search engine is used for matches.

 TOP DOWN PROTEOMICS

In this methodology proteins are ionized and measured for their masses and one protein is mass selected at a time for fragmentation. And resultant peptide fragments are measured for mass.

We can say that bottom up proteomics deals with peptides while top down proteomics can handle the whole protein.

Handout Chapter 05: Protein Sequences

Page 179 of 320

Module 009: BOTTOM UP PROTEOMICS

There are two types of proteomics protocol that are usually employed.

1. Bottom up proteomics

2. Top down proteomics

PROTOCOL

1. Sample containing the mixture of protein from cells and tissues is obtained.

2. Enzymes such as trypsin is use to cleave the proteins.

3. Enzyme cleaves the amino acids at specific sites of amino acid.

4. Several peptides are formed when protein is cleaved.

5. Number of peptide depends upon the number of sites where enzymes cleaved the protein. For example trypsin cleaves the protein at lysine (k)

6. Mass of each peptide is measured.

7. One peptide is selected at a time.

8. Different enzyme is use to cleave the protein at different site.

9. This process keep going until the possible number of peptides are formed or searched.

10. Peptides are searched in data base and matched.

Handout Chapter 05: Protein Sequences

Page 180 of 320

Module 010: Two Approaches for Bottom up Proteomics (BUP)

There are two approaches for bottom up proteomics.

1. Peptide Mass Fingerprinting.

2. Shotgun Proteomics

Figure 0.33 Peptide mass fingerprinting

Figure 14 Shotgun Proteomics

Shotgun Proteomics digest the whole protein and mix first and compared with database. And peptide mass finger printing involves in protein separation followed by single protein’s peptide analysis.

Handout Chapter 05: Protein Sequences

Page 181 of 320

Module 011: TOP DOWN PROTEOMICS

Bottom up proteomics identifies the proteins by cleaving them into segments at specific sites and was not suitable to measure the direct protein masse.

 PROTOCOL

1. Sample containing the protein mixture from cells and tissues is obtained

2. The entire protein is mixed and analyzed for masses.

3. The list of masses is obtained.

4. TDP Measures all post translational masses of protein.

5. After MS1 one protein is selected at a time and fragmented to obtain its peptides.

6. The process is repeated many times.

Comparison is done from protein database uniProt and swissProt.

TDP also measure the masses of intact proteins and masses of post transcriptional changes.

Handout Chapter 05: Protein Sequences

Page 182 of 320

Module 012: PROTEIN SEQUENCE IDENTIFICATION

Mass spectrometry helps us to measure the molecular weight of proteins and peptides, but

several proteins can have same masses to identify them we follow the flow chart of following

techniques.

Figure 0.45 protein sequence identification flowchart

The flowcharts discussed above can help us arrive at the sequence of the protein in question.

Scoring schemes are required to quantitatively represent the quality of results

P62837...51

Q6ZPJ3...40

P21734...33

Q02159...21

P62837...51

Filter Protein

Database

In Silico Fragmentation

of Candidate Proteins

Ionization

EST Determination MS Spectra

Protein Scores

Matching of Experimental

and Insilco Peak List

Protein Complex Separation

+ +

Mass Spectrometer

Fragmentation

MS2

Post

Translational

Modifications

Compare Theoretical

Masses with

Experimental

Masses

Handout Chapter 05: Protein Sequences

Page 183 of 320

Module 013: PROTEIN IONIZATION TECHNIQUES

Protein ionization is used in Mass spectrometry based on proteomics protocols. Ionization involves loading of proton in protein or removal of protein. Ionizations can increase or decrease the mass of protein or peptide.

 SALIENT IONIZATION

Is the technique which include Matrix Assisted Laser Desorption Ionization MALDI) & Electro Spray Ionization (ESI)

For example:

 MALDI

In this technique one proton is added to protein or peptide and the molecular weight is increases by one and Mass spectrometry reports the molecule at +1.

 ESI

ESI adds many protons to protein or peptides and molecular weight is increased by the number of protons added. But it is difficult in ESI to find the molecule with +1.

 EXAMPLE

Figure 0.56 resolving multiple charges

MS data from MALDI ionization is easier to handle as the product ions masses are mostly at “1+mass”. ESI is difficult to use as it does not easily give away the +1 charged ion

Handout Chapter 05: Protein Sequences

Page 184 of 320

Module 014: MS1 & INTACT PROTEIN MASS

When we ionize the protein, it can be deflected by a magnetic field in proportion to its mass and the mass of protein can be measured by spectrometry.

Figure 0.67 MS1 Schematic (Image courtesy Wikipedia)

Mass/charge helps us to calculate the mass of protein, “Mass Select” can help to select specific MS1 for further analysis.

MS1 results the intact masses of the peptides.

Handout Chapter 05: Protein Sequences

Page 185 of 320

Module 015: SCORING INTACT PROTEIN MASS

MS1 helps us to obtain the intact masses of precursor molecules which depend upon the proteomics and protocol applied. Protein masses reported by MS1 are matched with protein database, but before match the masses are converted into +1 of all molecule.

 SCORING

We can score each protein in the way that it get maximum score and low quality matches should get low scores.

After filtering the multiple charges we get the only the peaks having charge 1. And after this filter we compare it with protein data base.

 SCORING SCHEME

Example: A Protein Sequence from Database “MQLF”

http://web.expasy.org/compute_pi/

All experimental masses are compared with theoretical masses of database and mass is selected on the base of closeness.

𝑀𝑆𝑐𝑜𝑟𝑒 √ 𝑀𝐸𝑥𝑝−𝑀𝑇 2 MW: 537.67

Handout Chapter 05: Protein Sequences

Page 186 of 320

Module 016: PROTEIN FRAGMENTATION TECHNIQUES

We compare the experimental mass with theoretical data base mass of protein and on base of closeness we rank or score it.

If several proteins have same score than selection is done by using another technique protein fragmentation. We fragment the protein or peptide and ionize it, it helps us to measure the fragment masses as the same ways as their precursor.

There are different techniques for protein fragmentation.

 Electron Capture Dissociation (ECD)

 Electron Transfer Dissociation (ETD)

 Collision Induced Dissociated (CID)

Each fragmentation technique gives result of specific type of fragments.

ECD gives out ‘C’ and ‘Z’ ions. CID gives out ‘B’ and ‘Y’ ions, etc.

Figure 0.78 natural peptide of four residue

If we can measure the mass of fragments using MS, Calculate the theoretical mass of the fragments. Then, we can award score on the basis of the similarity of experimental and theoretical mass.

Handout Chapter 05: Protein Sequences

Page 187 of 320

Module 017: TANDEM MS

Intact masses can measure the intact proteins or peptides. And this can be followed up by their fragmentation in MS chamber.

Tandem MS can be extended to the fragments of the intact fragment. All you need is the MS instrument capability to, (i) select fragment’s mass range. (ii) Fragment the precursor fragment.

Tandem MS helps us to measure masses of fragments. By this scoring and protein identifications so easy.

Handout Chapter 05: Protein Sequences

Page 188 of 320

Module 018: MEASURING EXPERIMENTAL FRAGMENT’S MASS

In MS1, the molecular weight of intact sample molecule is measure and then intact molecule is fragmented in two afterward, these two fragments are measured by MS or MS2

FRAGMENTATION TECHNIQUES AND MOLECULAR WEIGHT

Fragmentation techniques include ECD, CID etc. intact molecule fragmentation splits the molecule into two parts.

FRAGMENT MASS

Mass of fragment is produced by MS2 deepening upon the technique because each techniques splits the protein or peptide at different location.

Figure 0.89 Masses after Fragmentation by ECD, CID & ETD

Experimental mass reported from MS2 is matched with theoretical peptides of candidate proteins (from DB). Score is awarded on the basis of the closeness between experimental and theoretical masses.

Handout Chapter 05: Protein Sequences

Page 189 of 320

Module 019: Calculating Theoretical Fragment's Mass

After measuring the mass of intact molecule from MS2 we compare that mass with theoretical mass of databased proteins.

Figure 20 Masses after Fragmentation by CID

Figure 0.91 Masses after Fragmentation by ECD

Handout Chapter 05: Protein Sequences

Page 190 of 320

Figure 22 Masses after Fragmentation by ECD

Handout Chapter 05: Protein Sequences

Page 191 of 320

Module 020: PEPTIDE SEQUENCE TAG

Peptide sequence tag are the sequence of peptide which are produced after MS2. We can obtain the sequence of peptide through variation in fragmentation site.

Figure 23 variation sites

Precursor proteins or peptides fragmentation leads to formation of multiple ions of the same fragment type. However, fragments have variation in their molecular weights due to variation in site of fragmentation

Fragmentation at consecutive sites leads to a mass difference equal to that of a single amino acid. Such consecutive peaks can reveal partial peptide sequence tags

Handout Chapter 05: Protein Sequences

Page 192 of 320

Module 021: Extracting Peptide Sequence Tags

PSTs are formed due to sequential cleavage of precursor protein/peptide’s backbone.

Figure 24 peptide sequencing tagging

Peptide sequence tags can be extracted from peak list iteratively. A high quality mass spectrum will produce large number of PSTs. The bigger the peptide sequence tags, the better!

Handout Chapter 05: Protein Sequences

Page 193 of 320

Module 022: Using Peptide Sequence Tags in Protein Search

PSTs provide clues of the precursor protein/peptides sequence. Consider that we extract the following PSTs: M, MQ, QV etc. Search protein sequence database (e.g. Uniprot, Swissprot)

Sample sequence in protein DB

>>sp|Q6GZ4X|0X1R_FRG3G Putative transcription factor 0X1R OS=Random virus 3 (isolate Goorha) GN=FV3-0X1R PE=4 SV=1

MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPSEKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGQVLSDLDAKIKAYNTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHLEKDLVKDFKALVESAHRMRQGHMQNVKYILYQLLKKHGHGPDGPDILTVKTGSMQLYDDSFRKIYTDLGWKFTPL

For all the proteins in the database, we find out which PSTs exist in which proteins. The protein reporting the most PSTs is more probable to be the precursor protein.

If many PSTs report the same number or protein report the longer PSTs than through scoring we find the greater number. After extracting the PSTs we search the entire database for protein who report it.

Handout Chapter 05: Protein Sequences

Page 194 of 320

Module 023: Scoring Peptide Sequence Tags

According to scoring scheme if a candidate protein matches ‘n’ PSTs, then its score can be given by:

Additionally, if we include RMSE to the scoring system, then it can highlight better PST matches. And RMSE is the root mean square error.

Figure 25 root mean square error

RMSE for a sequence tag ‘i’ of length ‘n’?

So, the updated relationship is:

Handout Chapter 05: Protein Sequences

Page 195 of 320

Module 024: In silico Fragment Comparison

MS1 reports the intact mass of molecules (proteins or peptides) in the sample. Intact mass can be compared with every protein’s mass in database to identify the molecule in the sample.

Incase multiple candidate proteins are reported, MS2 can be performed. MS2 helps measure fragment peptide masses. MS2 data can be used to extract peptide sequence tags

If the protein identification is still not conformed than each experimentally reported MS2 fragment is compared with the in silico spectra of proteins from database.

Fragmentation techniques determine product ions e.g. ECD -> c/z and CID -> b/y ions etc. With known fragment types, we can compute the MW of all possible protein fragments

For obtaining all possible theoretical fragments in a protein, we need to compute the MWs of each fragment individually

Consider a random protein sequence from DB:

Figure 26 random protein sequence

Matching experimental fragments with in silico fragments is the final resort in protein search and identification.

Handout Chapter 05: Protein Sequences

Page 196 of 320

Module 025: In silico Fragment Comparison and scoring

Experimental MS2 can be compared with the in silico spectra of protein from database.

 Count the matches between in silico and in-vitro peaks.

 Give an equivalent score to candidate protein.

 Weigh each of the aforementioned match by the mass error

 Accumulate the score

With “all possible” fragments in in silico spectrum, and “reported” fragments in experimental spectrum, we can match and rank.

Scoring scheme should also consider the errors in peak matching

Handout Chapter 05: Protein Sequences

Page 197 of 320

Module 026: Protein Sequence Database Search Algorithm

MS1 and MS2 provide us with a host of data towards enabling us in identifying unknown proteins. A step by step approach combining MW, PSTs and insilico spectral matching is required.

Figure 27 protein sequencing flowchart

Integrating MW, PST and insilico comparison algorithms in a workflow can help create a composite protein search engine. A composite scoring system is also required for this search engine.

Handout Chapter 05: Protein Sequences

Page 198 of 320

Module 027: Integrative Scoring Schemes

Three individual scores can be obtained:

 MW Match score

 PST Match score

 In vitro & In silico Match score

For overall cumulative score computation:

 We simply sum the scores up (a linear function).

 Weigh each scoring component up by respective RMSE before summing them up

 Complex non-linear functions integrate the scoring components in Mascot etc.

 Highly proprietary for commercial proteomics software are used.

Composite scoring schemes are needed to combine scores coming in from multiple criteria. The ability of a scoring scheme to better isolate true positives from false positives is important.

Handout Chapter 05: Protein Sequences

Page 199 of 320

Module 028: LARGE SCALE PROTEOMICS

Peptide mixtures in bottom up proteomics are very complex. Tryptic peptides may reach up to an order of 300,000–400,000.In whole proteome samples, protein count may be over 10,000. Experiments have shown that it is difficult or even impossible to analyze all these peptides in a single analysis, as the mass spectrometer is essentially overwhelmed.

Over half a million peptides reported in a typical LSP experiment are redundant.

If we could find a unique peptide for a protein, that would make sequence coverage suffer and we have to strike a compromise between sequences coverage and sample coverage.

 TECHNIQUE

One way forward would be to transfer peptides to the MS chamber in a step-by-step manner. However, this imposes a precondition that a peptide is not selected earlier as well (i.e. more than once)

 STEP BY STEP TECHNIQUE

1. The instrument alternates between MS and MS/MS modes.

2. Three most intense peaks are chosen for MS/MS analysis.

3. After the initial MS scan, an MS/MS spectrum from peptide A is obtained by selectively fragmenting this mass only.

4. Next, a spectrum for peptide B is produced, followed by a recording of the MS/MS spectrum for peptide C.

5. After these three fragmentation spectra have been obtained, a new MS scan is started.

From this scan, three more peptides A B C are selected for fragmentation and the cycle starts over again.

The number of MS1/2 scans can be limited by carefully selecting the peptide peaks. Once the intense peptides are identified, next batch of peptides is chosen for MS2.

Handout Chapter 05: Protein Sequences

Page 200 of 320

Module 029: PROTEOMICS DATA FILE FORMATS

Mass spectrometer is used to measure mass/charge ratio of ionized proteins and peptides. Data output from the MS comprises of m/z ratios and intensities of each molecule that is measured.

Followings are the formats for proteomics data:

Figure 28 Formats used for proteomics data

OPEN FORMTAS:

Handout Chapter 05: Protein Sequences

Page 201 of 320

mzXML (tools.proteomecenter.org/mzXMLViewer.php)

MGF (proteomicsresource.washington.edu/mascot/help/data_file_help.html)

Multiple MS data formats exist. Proprietary formats exist which come implemented as software with hardware. Also, open software standards exist for interoperability etc.

Handout Chapter 05: Protein Sequences

Page 202 of 320

Module 030: RAW FILE FORMAT

Mass spectrometer outputs data with ionic mass/charge ratios & respective ion intensities. RAW file is a format in which an instrument outputs data in binary form.

Figure 29 Raw file formats

Figure 30 list of tools for raw data processing

Multiple RAW file formats are prevailing in the industry. Each vendor has its own unique RAW file format. You can convert proprietary formats into open formats

Handout Chapter 05: Protein Sequences

Page 203 of 320

Module 031: MGF FILE FORMAT

MGF – Mascot Generic Format. MGF is a simple human-readable format for MS/MS data developed by Matrix Science. Mascot Search Engine available at this link online. http://www.matrixscience.com/search_form_select.html

http://www.matrixscience.com/help/data_file_help.html

Handout Chapter 05: Protein Sequences

Page 204 of 320

Handout Chapter 05: Protein Sequences

Page 205 of 320

Handout Chapter 05: Protein Sequences

Page 206 of 320

Module 032: OPEN MS DATA FORMAT

Mass spectrometer outputs data with mass/charge ratios & respective ion intensities. RAW file formats are specific to each instrument and each vendor has its own unique file format. Once an instrument is upgraded, data output from the instrument is also changed. Hence the underlying RAW file format needs to be upgraded as well.

NEED

Proprietary RAW formats are binary formats which are difficult to read and parse. If you have the software from the maker of the MS then you can read the RAW data file as well.

SOLUTION

 mzData was developed by HUPO-PSI

 mzXML was developed at the Institute for Systems Biology

 To combine them, a joint venture produced mzML

Handout Chapter 05: Protein Sequences

Page 207 of 320

Figure 31 Formats used for open use

Handout Chapter 05: Protein Sequences

Page 208 of 320

Several software exist for converting RAW file formats into open software formats. Each open format has its own unique advantages. mzXML and MGF formats are most frequently used

Handout Chapter 05: Protein Sequences

Page 209 of 320

Module 033: Online Proteomics Tools - MASCOT

Matrix Science developed an online Bottom up Proteomics Search Engine. “MASCOT”. Mascot can search peptide mass fingerprinting and shotgun proteomics dataset

Figure 32 mzML http://tools.proteomecenter.org/software.php

Figure 33 http://www.matrixscience.com/search_form_select.html

Handout Chapter 05: Protein Sequences

Page 210 of 320

Mascot is the most widely used online search tool for proteomics data. However, it lacks a batch processing mode. Also, it does not cater for top-down proteomics data.

Handout Chapter 05: Protein Sequences

Page 211 of 320

Module 034: Online Proteomics Tools – ProSight PTM

Kelleher et al have developed an online Top down Proteomics Search Engine. “Prosight PTM”. ProsightPTM searches top down proteomics data and reports the precursor protein

https://prosightptm.northwestern.edu/about_retriever.html

Handout Chapter 05: Protein Sequences

Page 212 of 320

https://prosightptm.northwestern.edu/about_retriever.html

Handout Chapter 05: Protein Sequences

Page 213 of 320

ProSight PTM is the state of the art in top down proteomics search. Using Prosight PTM, post-translational modifications can be accurately identified.

Handout Chapter 05: Protein Sequences

Page 214 of 320

Module 035: Example Case Study - I

For case study we follow some steps:

Step 1 – Monoisotopic Peak Detection

Natural elements occur in multiple isotopes. Isotopes differ in their masses.The abundance of each isotopic variant is unique.

Figure 34 Isotopic variants of natural elements

TYPES OF MASSES

 Nominal Mass

 Monoisotopic Mass

 Average Mass

Handout Chapter 05: Protein Sequences

Page 215 of 320

Figure 35 Detecting Monoisotopic Peaks

Figure 36 Detecting Monoisotopic Peaks

MS1 data reports the isotopic distribution of intact molecule’s mass. Monoisotopic mass value has to be selected from this mass distribution. This value is the highest mass value in the distribution

Handout Chapter 05: Protein Sequences

Page 216 of 320

Module 036: Example Case Study – II

The first step in protein identification and characterization using mass spectrometry involves intact protein/peptide mass measurement. Next, we fragment the protein. A protein or peptide backbone may be fragmented anywhere along the peptide backbone.

This results in formation of two fragments i.e. N-term fragment and C-Term fragment.

For possible fragments let’s take an example protein with 100 residues. Such a molecule’s backbone can be fragmented at 100 different locations. The total number of possible fragments is then 200

TANDEM MS

The mass of 200 fragments can then be measured by using an MS again. The necessary condition for this measurement is that all 200 fragments are ionized.

To ensure that all fragments of precursor molecule are also charged, we can use Electrospray ionization (ESI).ESI induces multiple charges on the intact molecule

Role of Electrospray Ionization

Since ESI induces multiple charges on the precursor molecule, there is a good chance that upon precursor’s fragmentation, each fragment will have a portion of the charge. ESI allows for production of multiple charged ions. Tandem MS helps measure molecular weights of ionized fragments

Handout Chapter 05: Protein Sequences

Page 217 of 320

Module 037: Example Case Study – III

Tandem MS helps measure the mass of the fragments Those fragments which differ from each other by one amino acid’s mass can provide clues on the sequence of proteins

Figure 37 Example peptide sequence tags

Peptide sequence tags help derive clues about the sequence of precursor proteins/peptides. The short peptide sequences help us in shortlisting candidate proteins from the database.

Handout Chapter 05: Protein Sequences

Page 218 of 320

Module 038: Example Case Study – IV

MS1 helps measure the intact mass of proteins/peptides. A list of candidate proteins/peptides can be formed by comparing MS1 mass to the mass of proteins/peptides in the database. MS2 or Tandem MS was performed after fragmentation of intact proteins.MS2 helped extract peptide sequence tags from MS2 data. Candidate proteins can be further shortlisted by the PSTs

Exhaustive matching of all MS2 peaks with the theoretical fragments of candidate proteins. The set of theoretical fragments contains all possibilities of fragmentation

Theoretical vs experimental fragments comparison helps as the third stage for shortlisting candidate protein list. This shortlisting will help you arrive at a small number of proteins

Handout Chapter 05: Protein Sequences

Page 219 of 320

Module 039: Example Case Study – V

MS1 and MS2 provide mass of intact molecules and its fragments. This information helps filter proteins from protein database. For a quantitative measure, scoring scheme is required.

Figure 38 Example intact protein mass score

Figure 39 Example peptide sequence tags

Three scoring schemes can be applied to score the match at each stage of protein search. These scoring elements can be integrated to arrive at an overall candidate protein score.

Handout Chapter 05: Protein Sequences

Page 220 of 320

Module 040: Example Case Study – VI

Comparisons can be performed at various levels of information. These include MS1, MS2, PSTs and theoretical fragments comparison. Integrated scoring schemes couple these factors.

For comprehensive scoring

A comprehensive scoring scheme can combine all the scores. Several optimizations can be undertaken on the scoring scheme to further improve protein identification

𝑺𝒄𝒐𝒓𝒆 𝑺𝒄𝒐𝒓𝒆 𝑴𝑾⬚+𝑺𝒄𝒐𝒓𝒆 𝑷𝑺𝑻⬚+𝑺𝒄𝒐𝒓𝒆 𝑬𝒙𝒑<>𝑻𝒉𝒓⬚ Simply sum the scores up (a linear function) 𝑺𝒄𝒐𝒓𝒆⬚ 𝑺𝒄𝒐𝒓𝒆 𝑴𝑾𝑬𝑴𝑾+ 𝑺𝒄𝒐𝒓𝒆 𝑷𝑺𝑻𝑹𝑴𝑺𝑬𝑷𝑺𝑻𝒎𝒊=𝟎+ 𝑺𝒄𝒐𝒓𝒆 𝑬𝒙𝒑<>𝑻𝒉𝒓𝑬𝑬𝑿𝑷<>𝑻𝒉𝒓𝒏𝒊=𝟎

Handout Chapter 06: Protein Structures

Page 221 of 320

Chapter 6 - Protein Structures

Module 001: PROPERTIES OF AMINO ACIDS-I

Proteins are made by polymerization of amino acids on ribosomes and proteins properties are linked to the properties of amino acids. There are 20 amino acids in nature each has different chemical composition and that’s why each protein is different from other.

Figure 0.1 chemical structure of amino acid

Amino acid have three groups, hydroxyl group, Amine group and R group. The R group is representing any group.

Figure 0.2 periodic chart of amino acid

Handout Chapter 06: Protein Structures

Page 222 of 320

During polymerization of amino acids the water is formed and amino acids attached with each other.

Figure 0.3 polymerization of amino acids

Amino acids have unique properties such as polarity, charge states and interactions with water. Each of these properties describes the overall characteristic of an amino acid.

Handout Chapter 06: Protein Structures

Page 223 of 320

Module 002: PROPERTIES OF AMINO ACIDS-II

Amino acids have characteristics like polarity, hydrophobicity, and charge states. These characteristics are governed by the elemental composition of an amino acid’s side chain (R group).

Figure 4 R group in amino acid

HYDROPHILIC AMINO ACIDS

Since H and C introduce very little dipole moments in hydrophobic amino acids, these amino acids are non-polar. Hydrophobic amino acids are mostly found at the inside of folded proteins. Hydrophilic group contain the chain of C and H group in their R group.

Figure 5 hydrophilic group

Handout Chapter 06: Protein Structures

Page 224 of 320

POLAR AMINO ACID

These amino acids are polar but are not charged i.e. no net charge on the amino acid. Prefer to reside / interact with aqueous environments. Mostly found at the surface of folded proteins.

Figure 6 Polar amino acids

Amino acids have unique properties such as polarity, charge states and interactions with water. Each of these properties describes the overall characteristic of an amino acid.

Handout Chapter 06: Protein Structures

Page 225 of 320

Module 003: PROPERTIES OF AMINO ACIDS-III

Some amino acids are positively charged and some have negative charge.

Figure 7 positively charged amino acids

Figure 8 negative charge amino acid

Upon polymerization of amino acids into polypeptide chains, charged amino acids get neutralized. At pH=7, five amino acids are charged, 2 negatively and 3 positively.

Handout Chapter 06: Protein Structures

Page 226 of 320

Module 004: PROPERTIES OF AMINO ACIDS-IV

Some amino acids are positively charged and some have negative charge. pK is the values for an amino acid is the pH at which exactly half of the chargeable group is charged.

If pH < pK for an amino acid, the amine side chains gain a proton (H+) and become positively charged, hence basic.

If pH > pK for an amino acid, the carboxyl side chains loses a proton (H+) and become negatively charged, hence acidic.

-ve +ve

Handout Chapter 06: Protein Structures

Page 227 of 320

Figure 9 properties of amino acids according to pK and PH.

Depending on the pH, an amino acid may become charged. This may be positive or negative depending on the amino acid.

Handout Chapter 06: Protein Structures

Page 228 of 320

Module 005: PROPERTIES OF AMINO ACIDS-V

Amino acids may be charged depending on pH. This depends on the charge acceptance or donation from within an amino acid. Additionally, amino acids have structures as well.

Figure 10 Aliphatic Amino Acids (Non polar C and H chains)

Figure 11 Aromatic R groups

Side chain also impact some properties. Side chains comprising merely of Carbon and Hydrogen are:

 Chemically inert,

 Poorly soluble in water

However, side chains containing organic acids are very different. They are chemically reactive and Soluble in water. Elemental composition plays a very important role in determining properties of amino acids. Solubility and reactivity are key factors participating in protein folding.

Handout Chapter 06: Protein Structures

Page 229 of 320

Module 006: STRUCTURAL TRAIT OF AMINO ACID-I

Amino acids have several properties such as charge state, polarity and hydrophobicity. It is important to note that the physical size of each amino acid also varies.

EXAMPLE-1: Glycine

Glycine residues increase backbone flexibility because they have no R group (only an H), hence agile.

EXAMPLE-2: Proline

Proline residues reduce the flexibility of polypeptide chains. Proline cis-trans isomerization is often a rate-limiting step in protein folding.

Figure 12 cis and Tran’s form of proline

EXAMPLE-3: Cystine

Cysteines cement together by making disulfide bonds to stabilize 3-D protein structures. In eukaryotes, disulfide bonds can be found in secreted proteins or extracellular domains.

Handout Chapter 06: Protein Structures

Page 230 of 320

Figure 13 cystine

Amino acids not only have physical and chemical properties, but also structural properties. These structural properties are equally important in giving rise to protein structures.

Handout Chapter 06: Protein Structures

Page 231 of 320

Module 007: STRUCTURAL TRAIT OF AMINO ACID-II

Each amino acid has a unique set of properties such as charge state, polarity and hydrophobicity. Moreover, it may have unique structural traits as well which can help in protein folding. Since some amino acids are hydrophobic, they may be employed in forming a stable core in a protein. Also, chemically inactive amino acids reduce chances of destabilizing reactions in core.

There comes a problem in burying hydrophobic amino acids in protein core Backbone is highly polar (hydrophilic) due to polar -NH and C=O in each peptide unit; these polar groups must be neutralized.

Form regular secondary structures!

Such as:

• Alpha Helices

• Beta Sheets

Which are stabilized by H-bonds!

Handout Chapter 06: Protein Structures

Page 232 of 320

Module 008: STRUCTURAL TRAIT OF AMINO ACID-III

The size and structure of each amino acid is unique. Coupled with their chemical properties, each amino acid can uniquely contribute in the protein folding process.

Hydrophobic core formed by packed secondary structural elements provides compact, stable core. Upon establishment of a stable protein core, unstable or reactive groups can be added.

"Functional groups" of protein are attached to the hydrophobic core framework. Surface or a protein or its exterior must have more flexible regions (loops) and polar/charged residues.

The very few hydrophobic "patches" on protein surface are involved in protein-protein interactions. The active regions in a protein are almost all present on the surface.

Figure 0.44 Organization of core and surface in a protein

Each component of the protein structure has a unique and precise role in the construction of proteins. Hydrophobic and hydrophilic components have equally useful roles.

Handout Chapter 06: Protein Structures

Page 233 of 320

Module 09: STRUCTURAL TRAIT OF AMINO ACID-IV

The size and structure of each amino acid is unique. Coupled with their chemical properties, each amino acid can uniquely contribute in the protein folding process.

Figure 0.55 Alpha Helix C = black O = red N = blue

Alpha Helix is an example of amino acid folding. Stabilized by H-bonds between every ~ 4th residues in backbone. Reactive amino acids are exposed for external interactions.

Handout Chapter 06: Protein Structures

Page 234 of 320

Module 010: INTRODUCTION TO PROTEIN FOLDING

But how does a protein actually fold? The answer is still unknown. Scientists have spent decades in trying to find a definite answer to this question, but to no avail. After polymerization of amino acids, linear chains are formed. When these chains of amino acids are put in water, the proteins fold spontaneously. The folded protein molecule should have the lowest possible energy. Anfinsen's dogma (also known as the thermodynamic hypothesis) is a postulate in molecular biology that, at least for small globular proteins, the native structure is determined only by the protein's amino acid sequence. Unique, stable and kinetically accessible minimum free energy

Figure 0.66 Overall Energy (stability) of the Protein

Proteins fold spontaneously in water. Proteins fold to achieve thermodynamic stability. Proteins fold to organize themselves for performing functions in cells.

Handout Chapter 06: Protein Structures

Page 235 of 320

Module 011: IMPORTANCE OF PROTEIN FOLDING

Proteins are like functional machines in cell, therefore understanding the folding behavior of proteins can helps us in designing the suitable drug. If a protein is misfolded, then it can lead to a lack of function in the protein. To study anomalies in structures and to discover newer structural forms, computational algorithms are used.

We can study the folding behavior of protein computationally First, we collect clues & evidences from experimentally reported structures. We utilize these observations to analyze unknown structures. The manner in which a newly synthesized chain of amino acids transforms itself into a perfectly folded protein depends both on the intrinsic properties of the amino-acid sequence (Dobson 2003)

Dobson, C. M. (2003). "Protein folding and misfolding." Nature 426(6968): 884-890.

Handout Chapter 06: Protein Structures

Page 236 of 320

Module 012: COMPUTING PROTEIN FOLDING POSSIBILITIES

Computing the protein folding can help us study misfolding, interaction between drugs and proteins etc. However, first, it is important to know the number of the protein folding possibilities.

Let’s assume that each amino acid can fold into three different conformations. They are Alpha Helices, Beta Sheets and Loops. We know that proteins comprise of 100s of amino acids

If each amino acid can take 3 different conformations, and its parent protein has 100 amino acids, then 1003 = 5 x 1047 will be the combination. If it take 1/10th of a Nano-second (10 -10), then to compute all the folding possibilities will take 1.6 x 1030 years.

In fact, it take a protein less than a second to fold. It’s the Amazing speed of folding.

Figure 0.77 Overall Energy (stability) of the Protein

This is called “Levinthal’s Paradox”. We will try to understand this folding process using experimental datasets and algorithms. Molecular simulations are also helpful for it.

Handout Chapter 06: Protein Structures

Page 237 of 320

Module 013: PROCESSING OF PROTEIN FOLDING

Levinthal’s Paradox- enormous time required to compute all folding possibilities. It’s impossible to consider all the possibilities computationally. So, we are trying to understand the folding process.

The forces involved in protein folding include:

 Electrostatic interactions

 van der Waals interactions

 Hydrogen bonds

 Hydrophobic interactions

Figure 0.88 Protein folding

Figure 19 Anfinsen’s Experiment

Handout Chapter 06: Protein Structures

Page 238 of 320

Figure 20 Anfinsen’s Experiment

All the information required for folding a protein into its native structure is present within the protein’s amino acid sequence. The native folded form of protein is thermodynamically most stable as compared to others

Handout Chapter 06: Protein Structures

Page 239 of 320

Module 014: MODELS OF PROTEIN FOLDING

Information required for folding a protein into its native structure is present within the protein’s amino acid sequence. The native folded form of protein is thermodynamically most stable as compared to others.

FRAME WORK MODEL

Figure 21 Step 1: Formation of secondary structures

Figure 0.92 Step 2: Arrangement of secondary structures

Handout Chapter 06: Protein Structures

Page 240 of 320

NUCLEAR CONDENSATION MODEL

Figure 20.10 Step 1: Formation of a Hydrophobic Core

Figure 20.11 Step 2: Including remaining amino acids and expanding the nucleus

Several models exist for folding a protein given its amino acid sequence. The fundamental requirement is that the folding process remain spontaneous. There is still no definitive folding hypothesis.

Handout Chapter 06: Protein Structures

Page 241 of 320

Module 015: PROTEIN STRUCTURE

Proteins spontaneously fold to take 3D forms. It’s a fast yet specific process which leads to a folded protein. Several forces act together to fold the protein structure.

Figure 25 Folding funnel

Figure 0.126 Energies of Various Bonds & Interactions

Handout Chapter 06: Protein Structures

Page 242 of 320

Figure 27 Hydro peroxide resistance protein OsmC (1vla)

Figure 28 Cystatin – 3 (C) http://beautifulproteins.blogspot.com/

Protein structures are very complex yet they form spontaneously. We will investigate how to develop algorithms to predict such structures.

Handout Chapter 06: Protein Structures

Page 243 of 320

Module 016: Primary, Secondary, Tertiary and Quaternary Structures

Complex protein structures form spontaneously as a protein folds. A huge variety of protein structures exist. Each structure is designed to perform a specific function. Interestingly, each protein mega structure gets built out of only a few sub-structures. Combinations from the SMALL substructure set are used to construct larger protein structures.

There are many types of structure Single Alphabet Amino acid tags can be put together linearly to represent a protein sequence. This sequence is also called the primary sequence. Primary sequence can also be referred to as 1’ structure. Sub-structures are formed as a result of 1’ structure’s folding. Folded sub structures are called secondary protein structures .Secondary structures are also referred to as 2’ structures.

2’ sub-structures are packed together to form super structures. These protein super structures are called tertiary structures .Tertiary structures are also referred to as 3’ structures.

3’ structures represent the complete monomeric protein structure.3’ structures can combine with other polypeptide units to form a quaternary structure.

Quaternary structures are also called 4’ structures. 4’ structures are exemplified by protein complexes etc.

Protein structures are organized into 1’, 2’, 3’ and 4’ modular conformations. We will investigate how to develop algorithms to predict these structures

Handout Chapter 06: Protein Structures

Page 244 of 320

Module 017: Primary Structures of Protein

Protein structures are organized into 1’, 2’, 3’ and 4’ modular conformations. 1’ structures are essentially the amino acid sequence of the proteins.

Figure 29 protein folding funnel

Handout Chapter 06: Protein Structures

Page 245 of 320

Figure 30 list of amino acids

There are two methods for obtaining 1’ structure.

 Edman Degradation

 Tandem Mass Spectrometry

1’ structure databases are essentially protein sequence databases. Examples include Uniprot, Swissprot amongst several others.

Protein sequences are the primary structures of proteins. The primary or 1’ structure of a protein determines its initial properties.1’ structure lays the foundation for 2’ structures

Handout Chapter 06: Protein Structures

Page 246 of 320

Module 018: Secondary Structures of Proteins - I

The primary or 1’ structure of a protein determines its basic properties and 1’ structure lays the foundation for 2’ structures. 2’ structures are also referred to as secondary structures.

Figure 30.13 Organization of Secondary Structure

Formation of 2’ structure

C- Terminus is negatively charged .N-terminus is positively charged. C and N termini can therefore make Hydrogen Bonds. Hydrogen Bonds are the reason of 2’ structure formation.

Figure 30.14 Forming Secondary Structure

Handout Chapter 06: Protein Structures

Page 247 of 320

Figure 30.15 Types of Secondary Structures – Alpha Helix

Protein sequences fold onto themselves and make H-Bonds to create 2’ structures. Several types of 2’ structures exist. These include Alpha Helices and Beta Sheets.

Figure 30.16 Types of Secondary Structures – Beta Sheets

Handout Chapter 06: Protein Structures

Page 248 of 320

Module 019: Secondary Structures of Proteins – II

2’ structures or secondary protein structures are formed as a result of H-Bond formation between N and C termini in a protein backbone. Types of 2’ structures include Alpha helices and Beta sheets.

Figure 35 A Special Secondary Structure

Properties of Loop

 Loops connect helices and sheets

 Loops vary in length and 3-D configurations

 Loops are mostly located on the surface of proteins

 Loops are more “acceptable" of mutations

 Loops are flexible and can adopt multiple conformations

 Loops tend to have charged and polar amino acids

 Loops are frequently components of active sites

Coils

 Secondary structure that are not helices, sheets, or recognizable turns

 Disordered regions, but also appear to play important functional roles

Handout Chapter 06: Protein Structures

Page 249 of 320

Loops and Coils are also secondary structure which form the first structures after folding of protein’s amino acids. Loops and Coils are very important 2’ structures in that they form active sites of proteins.

Handout Chapter 06: Protein Structures

Page 250 of 320

Module 020: Tertiary Structures of Proteins

2’ structures include alpha helices, beta sheets, loops and coils. Upon combination of 2’ structures, a tertiary or 3’ structure is formed.3’ structure is next level of structure organization.

Figure 36 Example of Tertiary Structure

Formation of 3’ structure

 Hydrophobic interactions between nonpolar R-groups

 Covalent bonds in the form of Disulphide bridges

Combinations of Alpha helices, Beta sheets, coils and loops help form 3’ structures. Covalent bonds, Hydrogen bonds and hydrophobic interactions enforce the 3’ structure.

Handout Chapter 06: Protein Structures

Page 251 of 320

Module 021: Quaternary Structures of Proteins

4’ structures or quaternary structures are formed by different peptide chains that make up the protein. Multimeric proteins which comprise of multiple peptides form 4’ structures.

Monomeric vs. Multimeric Proteins

Protein comprised of only a single chain (monomeric) do not have a quaternary structure. Proteins with multiple chains can form 4’ structures.

Figure 37 Example of Quaternary Structure See how 2’ and 3’ structures come together

4’ structures are kept in conformation by Hydrogen Bonds, Covalent Disulphide Bonds, Hydrophobic Interactions and ionic bonds. In terms of stability 4’ > 3’ > 2’ > 1’

Handout Chapter 06: Protein Structures

Page 252 of 320

Module 022: Introduction to Bond Angles in Proteins

Protein folding results in a linear chain of amino acids getting packed into a compact 3D structure. This leads to a reduction in bond angles from an initial of 180 degrees (protein’s linear form)

Figure 38 Linear Protein

Figure 39 Formation of Planar Peptide Bond

The resultant chain gets its own set of attributes and Peptide bond is planar & rigid.

Dihedral Angles

 Angle between two planes (i.e. 4 points)!

Handout Chapter 06: Protein Structures

Page 253 of 320

 Considering the middle two points to be aligned (or overlapped), the angle between the 1st, overlapped and the 4th points forms a dihedral angle.

Figure 40 Protein after Folding: Phi and Psi Angles

Figure 0.171 Protein after Folding: Phi and Psi Angles

Φ (phi, involving C'-N-Cα-C') ψ (psi, involving N-Cα-C'-N)

Proteins fold into 3D structures. Phi and psi angles are taken up as a result of folding. These angles can be measured towards understanding the protein structure.

Handout Chapter 06: Protein Structures

Page 254 of 320

Module 023: Ramachandran Plot

Phi and Psi angles can be measured with in the folded structures like:

 φ - phi

 Involves C'-N-Cα-C‘

 ψ – psi

 Involves N-Cα-C'-N

Figure 42 Phi and Psi Angles

Handout Chapter 06: Protein Structures

Page 255 of 320

Figure 43 Allowable Phi and Psi Angles

Data as in (Lovell et al. 2003) showing about 100,000 data points for several amino-acids

A limited range of Phi and psi angles are taken up as a result of folding. This range of angles constitutes the allowable range of torsion or rotation angles that are taken up by the protein.

Handout Chapter 06: Protein Structures

Page 256 of 320

Module 024: Structure Visualization - I

We know that protein backbone takes up specific rotation angles after folding. A protein consists of multiple amino acids. Each amino acid has a C-terminus and an N-Terminus.

Figure 44 Protein Backbone and C atoms

Figure 45 Omitting Planar bonds and Tracing C-Alpha atoms in backbone

Figure 46 C-Alpha Backbone visualization

C-Alpha atoms are traced to recreate a 3D protein structure. The choice is made while keeping planar nature of the peptide bond in view. Later we will see how to insert side chains into the visual models as well.

Handout Chapter 06: Protein Structures

Page 257 of 320

Module 025: Structure Visualization – II

C-Alphas can be used to construct the backbone of a protein towards its visualization. We also need a representation of measurements for assigning the atomic distances. The ångström is used to express the size of atoms, molecules and extremely small biological structures, the lengths of chemical bonds, the arrangement of atoms in crystals.

1 angstrom is a unit of length equal to 10−10 m (one ten-billionth of a meter) or 0.1 nm

Atoms of phosphorus, sulfur, and chlorine are ~1 Å in covalent radius, while a hydrogen atom is 0.25 Å

Figure 47 Ansedel Anders Ångström (1814–1874)

C-Alpha atoms are traced to recreate a 3D protein structure. Each C-Alpha atom is at a distance which can be represented in the unit “Angstrom”.1 A resolution is better than 10 A.

Handout Chapter 06: Protein Structures

Page 258 of 320

Module 026: Experimental Determination of Protein Structure

C-Alpha atoms are traced to recreate a 3D protein structure. Distances between C-Alphas are measured in the unit “Angstrom”.

X-Ray Crystallography

 Crystallography data gives relative positions of atomic coordinates

 The data is obtained from diffractions by the atoms in a protein structure

 The coordinates of each atom in x,y and z axis are output

Figure 48 x-ray crystallography

Crystallized proteins are used to determine protein structures. As X-rays diffract from the atoms in a protein, the atomic distances are noted. These distance in 3D are measured in Angstroms.

Handout Chapter 06: Protein Structures

Page 259 of 320

Module 027: Protein Databank

Position of C-Alpha atoms are used to construct 3D protein structure. X-Ray diffraction data helps measure the atomic positions. X, Y and Z positions of several proteins are available online.

Figure 49 PDB File Format

Handout Chapter 06: Protein Structures

Page 260 of 320

Handout Chapter 06: Protein Structures

Page 261 of 320

PDB contains protein structure information. It has the coordinates of C-Alphas for over 50,000 proteins. Protein structures can be visualized using this information.

Handout Chapter 06: Protein Structures

Page 262 of 320

Module 028: Visualization Technique

Proteins fold into 3D structures. Phi and psi angles are assumed as a result of folding. These angles can be measured and viewed towards understanding the protein structure. To view a protein, we need to evaluate the physical location of its atoms. Proteins have Carbon and Nitrogen in their backbone.

CA atomic coordinates

 To trace the backbone of a protein, CA atoms trace can be used

 Note that CA atoms have the side chains attached to them

 A coordinates can be found in the PDB file

Protein structures can be visualized by tracing the CA atoms. Coordinates of CA atoms can be obtained from the PDB. Next, we need a tool to plot these coordinates.

Handout Chapter 06: Protein Structures

Page 263 of 320

Module 029: Online Resources for Protein Visualization

Protein structures can be visualized by tracing the CA atoms. CA Coordinates can be taken from PDB.

Online Tools

 Rasmol and CHIME are basic tools for visualizing proteins

 Swiss PDB Viewer offers several features such as protein surface view, alignment of several proteins & modelling secondary structures

 PyMOL is a python-script based tool for visualizing the protein structure

 Cn3D is another tool which helps us visualize protein structures

 It also provides for annotating protein structures

Handout Chapter 06: Protein Structures

Page 264 of 320

Protein structures are visualized using several online tools. These tools include Rasmol, CHIME, Swiss PDB Viewer and Cn3D.

Handout Chapter 06: Protein Structures

Page 265 of 320

Module 030: Types of Protein Visualizations

To visualize proteins, we use CA coordinates or positions. We can use several online tools to view the resulting model.

CPK: Corey-Paulin-Koltun Diagrams. In CPK diagrams, each atom is represented by a solid sphere. Spheres are equal to atomic van der Waal radius (the volume of the atom).

Figure 50 sphere and surface diagrams of protein

http://www.danforthcenter.org/smith/MolView/Over/overview.html

Ribbon Diagrams

Ribbon diagrams are an easy and frequently used technique for representing protein structures. Structure is represented by the secondary structures (fold) using simple cartoon figures.

Figure 51 ribbon diagrams

http://www.danforthcenter.org/smith/MolView/Over/overview.html

Handout Chapter 06: Protein Structures

Page 266 of 320

Balls & Stick (BS) Models

BS model is another popular protein structure representation strategy. BS Models have atoms as colored balls and intermediate bonds as sticks.

Figure 52 Balls and sticks model

http://www.danforthcenter.org/smith/MolView/Over/overview.html

Figure 53 Colored Sticks Models

http://www.danforthcenter.org/smith/MolView/Over/overview.html

Protein Structure Visualization can be performed using several atomic representations. These include CPK, Ribbon and Balls & Stick Diagrams.

Handout Chapter 06: Protein Structures

Page 267 of 320

Module 031: Introduction to Energy of Protein Structures

Proteins come together as a result of peptide bond formation between various amino acids. The resulting polymer then goes through the step of folding which leads to the formation of a 3D structure.

https://folding.stanford.edu/home/the-science/

Role of Amino Acids

We know that amino acids can be polar, charged and hydrophobic. Role of polar and charged amino acids in folding. Role of hydrophobic amino acids in folding.

Overall Goal of Folding

Anfinsen’s thermodynamic hypothesis: Proteins fold for a unique, stable and minimum free kinetic energy structure. What other factors may come into play for satisfying Anfinsen hypothesis.

Minimizing Energy

We know that if bonds can be formed between two atoms, then energy is released. This leads to a situation where there is lesser free energy accessible to each atom for further interactions. So, proteins maximize bonds that can be made between the side chains on each of their constituent amino acids

Such atomic interactions include:

 Disulphide bonds between Cysteine residues

 Hydrogen Bonds

 Van der Waals Forces

 Electrostatic Interactions between polar/charged amino acids

The greater the number of these bonds, the more stable a protein becomes. Hence, the basic idea of thermodynamic stability is to maximize bonding in order to minimize the free energy

Handout Chapter 06: Protein Structures

Page 268 of 320

Module 032: Calculating Energy of a Protein Structure

As we know the greater the number of bonds between the amino acids, the more stable a protein becomes.

Figure 54 Energies of Interactions www.ucdavis.edu

Comparison of bond energies

 Hydrophobic interactions >

 Electrostatic interactions>

 Hydrogen bond > van der Waals

Calculating overall energy of a protein structure

Given the number of atomic interactions in a protein, you can simply sum the energy in the protein molecule.

Energies of protein structures can be computed by first enumerating the types of interactions between each atom. Then, accumulating the energy of each interaction towards calculating an overall energy of a protein.

Handout Chapter 06: Protein Structures

Page 269 of 320

Module 033: Structure Determination for Energy Calculations

The greater the interactions between the amino acids, the more stable a protein becomes. We can calculate energy of a folded protein based on the number and types of atomic interactions.

How to find the number of interactions

 To determine the number of each type of interaction within a protein, we need to find its inter-atomic distances.

 Based on specific atomic distances, we can guess the type of atomic interaction.

 By looking up at the bond/energy table, we can compute the overall energy.

Techniques for structure determination

 X-Ray Crystallography

 Nuclear Magnetic Resonance (NMR) Spectroscopy

We need to know the structure of the protein to calculate atomic distances. Atomic distances tell us about atomic interactions with neighboring atoms. To determine the structure, we use X-Ray or NMR.

Handout Chapter 06: Protein Structures

Page 270 of 320

Module 034: Review of Experimental Structure Determination

The greater the interactions between the amino acids, the more stable a protein becomes. We can calculate energy of a folded protein based on the number and types of atomic interactions.

The structure also dictates which functions a protein can perform via the positioning of hydrophilic & polar amino acids. For determining stability, structure & function, we need to find the amino acid interactions. Several experimental methods exist for structure determination.

 X-Ray Crystallography

 Nuclear Magnetic Resonance (NMR) Spectroscopy

Figure 55 to measure a bond/interaction, we must first see atoms

Handout Chapter 06: Protein Structures

Page 271 of 320

Figure 56 Principle of X-Ray Crystallography

Figure 57 from Diffraction Patterns to Atomic Positions

Upon establishing the atomic positions and distances, we can then check for possible interaction between the different atoms. Atomic distances can help us classify interaction types e.g. hydrogen bonds, electrostatic & polar.

Handout Chapter 06: Protein Structures

Page 272 of 320

Module 035: Alpha Helices - I

Atomic distances can tell us about their existential interactions. Different types of interactions may occur between atoms. E.g. Hydrogen Bonds, Polar etc. If two atoms are participating in a covalent bond, their distance is ~0.96A. In case of hydrogen bond formation between atoms, the inter-atomic distance is ~1.97A. X-Ray data should have a minimum of 1.97A resolution.

Figure 58 Hydrogen Bonds to Fold an Amino Acid Chain

X-Ray Crystallography data shows that Hydrogen atoms of N-Term may come together with Oxygen atoms of C-term amino acid at 4th neighboring position. Their atomic distance is ~1.9A and hence are considered to be in a hydrogen bonds.

Handout Chapter 06: Protein Structures

Page 273 of 320

Module 036: Alpha Helices – II

X-Ray Crystallography of protein shows that Hydrogen atoms of N-Term come together with Oxygen atoms of C-term amino acid at 4th neighboring position to make Hydrogen bonds.

Figure 59 Forming Alpha Helix

Every Oxygen bound to 4th neighboring Amino Group’ Hydrogen.

Handout Chapter 06: Protein Structures

Page 274 of 320

Figure 60 Carbons (Black) & Nitrogen’s (Blue): 1-5, 2-6, 3-7…

Figure 61 Preference of Amino Acids for making Alpha Helices

Helix Formers

From 20 amino acids, anyone can be present in the backbone. Is there a variable preference in amino acids to form helix? Yes, “Helix Formers” are generally hydrophobic amino acids (M, A, L…). Alpha Helices are formed by hydrogen bonding (O-H) between Ci and Ni+4 atoms in the protein backbone.

Handout Chapter 06: Protein Structures

Page 275 of 320

Module 037: Beta Sheets - I

Alpha Helices are formed by hydrogen bonding (O-H) between Ci and Ni+4 atoms in the protein backbone. Beta Sheets are another common secondary structure. They are constituted by several Beta Strands which come together. 5 to 10 resides are needed to make a Beta Strand, typically.

Hydrogen Bonds to make in Beta Strands

The Beta Sheet is made up of several Beta Strands

C-Alpha atoms and the CO and NH groups are shown in blue, yellow, and green, respectively.

This is called a parallel beta sheet.

This is called an anti-parallel beta sheet.

Beta Sheets are another secondary structure that can be formed as a result of hydrogen bonding between the protein back bones. Some amino acids have a preference for making Beta Sheets.

Handout Chapter 06: Protein Structures

Page 276 of 320

Module 038: Beta Sheets - II

Beta strands can make hydrogen bonds with each other and organize as beta sheets.

Beta Sheets have different Properties:

 Beta Strand

 Beta Sheet

 Beta Barrel

 Beta Sandwiches

Beta Barrels

Beta Barrel is made of a single beta sheet that twists and coils upon itself. The first strand in the beta sheet makes a hydrogen bonds with the last strand. A beta barrel is a large beta-sheet that twists and coils to form a closed structure in which the first strand is hydrogen bonded to the last. Beta-strands in beta-barrels are typically arranged in an antiparallel fashion. https://en.wikipedia.org/wiki/Beta_barrel

Figure 62 beta barrel

Beta Sandwiches

Beta Sandwiches are made of two beta sheets which are usually twisted and packed so their strands are aligned.

Handout Chapter 06: Protein Structures

Page 277 of 320

Figure 63 Illustration of the β-sandwich from Tenascin C (PDB entry: 1TEN).

Figure 64 Preference of Amino Acids for making Beta strands

Beta Sheets are formed by H bonds between of 5–10 consecutive amino acids in one portion of the backbone with another 5–10 farther down the backbone. Beta strands may be adjacent (with a loop in between) or far with other structures in between.

Handout Chapter 06: Protein Structures

Page 278 of 320

Module 039: LOOPS-I

Alpha Helices and Beta Sheets are secondary structures formed as a result of hydrogen bonding in between protein backbone atoms.

Protein Backbone and Secondary Structures

Loops are formed by amino acids present in the middle of the Alpha Helices and Beta Sheets in a protein backbone.

Figure 65 Joining Alpha Helices and Beta Sheets in a Protein Backbone

Variability in length and conformation allows loops to join Alpha Helices and Beta Sheets in a variety of ways. Loops are variable in length and 3-D conformations.

Characteristics

 Loops are mostly located on the surface of protein structure

 Mutate in sequence at a much faster rate than Alpha Helices and Beta Sheets

 Loops are flexible and can adopt multiple conformations

Handout Chapter 06: Protein Structures

Page 279 of 320

Loops dictate the overall structure of protein as they couple Alpha helices and beta sheets

Handout Chapter 06: Protein Structures

Page 280 of 320

Module 040: LOOPS-II

Loops dictate the overall structure of protein as they couple Alpha helices and Beta sheets. Loops are flexible and have variable lengths so as to successfully bridge between secondary structures.

Figure 66 Loops in 3D Conformation

Loop Properties

 Loops are mostly comprised of charged and polar amino acids

 Loops frequently participate as components of active sites

Figure 67 Preference of Amino Acids for making Loops

Types of Loops

Handout Chapter 06: Protein Structures

Page 281 of 320

 Hairpin loops are two amino acids long and join anti-parallel Beta strands

 Other Loops may be 3 to 4 amino acids long

 Loops fall into various families

Loops are the third type of secondary structure after Alpha helices and Beta sheets. Loops are unique in that they are flexible and variable length. Loops constitute active sites.

Handout Chapter 06: Protein Structures

Page 282 of 320

Module 041: COILS

Alpha helices and beta sheets are the regular secondary structures. Loops are flexible secondary structures &connect alpha helices and beta sheets. Coils are another secondary structure. Coils are unstructured and unlike loops. Essentially, a secondary structure which is not a helix, sheet or loop is a coil.

Functional Aspects of Coils

 Coils are apparently disordered regions

 They are oriented randomly while being bonded to adjacent amino acids

 However, coils also appear to play important functional roles

Figure 68 Coils in Myoglobin

Coils are those secondary structure formed by the protein backbone which are neither helices, sheets nor loops. In fact, coils do not have a consistent classifiable structure. Hence, coils are random structure and random length.

Handout Chapter 06: Protein Structures

Page 283 of 320

Module 042: Structure Classification - I

Proteins have primary, secondary, tertiary & quaternary structures. Each level of protein structure organization is known to impart specific characteristics to the protein.

Review of the 4 structure levels

 Primary Structures

 Secondary Structures

 Tertiary Structures

 Quaternary Structures

Structural artifacts tend to be more conserved as compared to their sequences. Therefore, it may be useful to look at the secondary/tertiary structures for conservation study.

Classification

 The evolution of protein structures and their hierarchy is not systematized

 Hence, we need to classify the function of protein by examining their secondary and tertiary structures

Motifs (Non-functional Combinations of 2’ structures)

Handout Chapter 06: Protein Structures

Page 284 of 320

Figure 69 Domain (Functionally Complete)

Domains are semi-independent functional structures in a protein. Have a stable structure. Over ~40 residues. Protein may contain multiple domains.

Handout Chapter 06: Protein Structures

Page 285 of 320

Module 043: Structure Classification - II

Domains are semi-independent functional structures in a protein. Protein may contain multiple domains. Hence, we can try to classify proteins by their domains. Locally Compact – Domains interact (H-bonds) more internally than externally. Domains have a hydrophobic core. Domains are contiguous (min. chain breaks).

Domains have a minimal contact with rest of the peptide. Solvent area in contact with each domain should not vary significantly upon separating two separate domains.

Types of Domains

 Alpha Domains

 Beta Domains

 Alpha/Beta Domains

 Alpha + Beta Domains

 Alpha & Beta Multi-Domains

 Membrane & cell-surface proteins

So, by looking at proteins, we can list the domains present in each protein. Once domains in each protein are listed, we can classify whole proteins into various types and classes.

Handout Chapter 06: Protein Structures

Page 286 of 320

Module 044: Examples of Protein Domains

There are many domains for protein structure prediction.

 Alpha Domains

 Beta Domains

 Alpha/Beta Domains

 Alpha + Beta Domains

 Alpha & Beta Multi-Domains

 Membrane & cell-surface proteins

Figure 70 Alpha Domain: Hemoglobin (1bab)

Figure 71 Immunoglobulin (8fab)

Handout Chapter 06: Protein Structures

Page 287 of 320

Figure 72 Alpha / Beta: Triosephosphate isomerase (1hti)

Figure 73 Alpha + Beta: Lysozyme (1jsf)

Various types of domain architectures exist in proteins. Such architectures can be classified into general structural classes. Databases can be made from classes.

Handout Chapter 06: Protein Structures

Page 288 of 320

Module 045: CATH Classification

Domains can be classified into structural classes. Classes can be further classified into Architecture and Topologies. Let’s see how it is done in CATH.

Figure 74 Structural Classes

Handout Chapter 06: Protein Structures

Page 289 of 320

Class

 Similar secondary structure content

 All α, all β, alternating α/β etc.

Architecture

 Also called FOLD

 Major structural similarity

 SSE’s in similar arrangement

Topology

 Super Family

 Probable common ancestry

 Family membership

Homology

 Same Family

 Clear evolutionary relationship

 Pairwise sequence similarity > 30%

CATH classifies proteins by their structural similarity. It also considers the internal organization of the structural components in proteins.

Handout Chapter 06: Protein Structures

Page 290 of 320

Module 046: Classification Databases

Proteins are classified into various structural classes. CATH is one such system in which proteins are organized into classes, architecture, topology and homology.

http://scop.mrc-lmb.cam.ac.uk/scop/

Handout Chapter 06: Protein Structures

Page 291 of 320

Figure 75 SCOP Classification Statistics

http://scop.mrc-lmb.cam.ac.uk/scop/count.html

FSSP - Family of Structurally Similar Proteins, based on the DALI algorithm. Pclass - Protein Classification, based on the LOCK and 3Dsearch algorithms.

Handout Chapter 06: Protein Structures

Page 292 of 320

Module 047: Algorithms for Structure Classification

Several algorithms exist for classifying protein structures.

Intra-Molecular Distance Algorithms.

 Proteins are considered as rigid bodies.

 They are placed in a 3D Cartesian coordinate system.

 Structural alignment in 3D.

 E.g. VAST, LOCK

Inter-Molecular Distance Algorithms

 Proteins are considered as rigid bodies.

 They are placed in 2D.

 Structural alignment using internal distances and angles.

The basic idea is to capture internal geometry of protein structures. E.g. DALI, and SSAP.

Such algorithms are also very useful to compare whole protein structures. They can help determine evolutionary relationship. Also, functional similarity can be estimated.

Handout Chapter 06: Protein Structures

Page 293 of 320

Module 048: Protein Structure Comparison

Proteins are assembled into primary (1’), secondary (2’), tertiary (3’) and quaternary (4’) structures. Protein sequence is less conserved than its structure. Protein structure determines function. Since protein structure dictates function, comparing two structures can help us evaluate if the proteins do the same or similar function.

Comparing Whole Protein Structures

Proteins contain multiple structural subunits e.g. secondary structures, motifs and domains. Structures of all such subunits are to be considered as one and compared. We know that domains are functionally independent components of the protein structure. Proteins may have multiple domains. So for two different proteins, sharing the same domain, we may want to compare only a portion of the overall structure i.e. a domain. For comparing the complete or partial protein structures, the position of Alpha Carbon atoms can be used. The (x, y, z) positions of Alpha Carbon atoms can be obtained from the PDB.

Figure 74 C-Alpha atoms in backbone

Figure 76 Tracing and Visualizing C-Alpha Backbone

http://www.danforthcenter.org/smith/MolView/Over/overview.html

PDB coordinates of Alpha Carbons in the protein back bone can be used for comparison. In this way, whole protein structure or domains etc. can be compared.

Handout Chapter 06: Protein Structures

Page 294 of 320

Module 049: Strategies for Structure Comparison - I

PDB coordinates of Alpha Carbons in the protein back bone can be used for comparison. Thus, two whole protein structures or domains within each structure can be compared.

Figure 77 Tracing and Visualizing C-Alpha Backbone

http://www.danforthcenter.org/smith/MolView/Over/overview.html

Strategy # 1 – Whole Protein Structure Comparison by Intermolecular distances

 Two protein sequences are pair-wise aligned with each other

 Corresponding Alpha Carbons are identified

 Coordinates of corresponding Alpha Carbons are retrieved from PDB

 Their individual differences calculated

 Root Mean Square Distance is computed to assess the similarity

Whole protein structures can be compared by calculating the root mean squared difference (RMSD) between their Alpha Carbons positions. The lower the RMSD, the similar are the proteins.

Handout Chapter 06: Protein Structures

Page 295 of 320

Module 050: Strategies for Structure Comparison - II

Full protein structures can be compared and ranked by the overall differences in positions

between their Alpha Carbons. But proteins are 3D and in various conformations.

Full Protein Comparison

(Translation -> Rotation -> RMSD)

Calculating RMSD – An Example





i i d a b

RMS A B

2 ( , )

( , )

Domain or Motif Comparison

(Region Selection ->Translation -> Rotation -> RMSD)

Motifs, Domains and Full Proteins can be compared by using the rigid body super-positioning.

Depending on the RMSD, proteins, their motifs and domains can be selectively compared.

Handout Chapter 06: Protein Structures

Page 296 of 320

Module 051: Online Resources for Structure Comparison

Multiple types of comparison can be performed between Proteins, Motifs, and Domains by rigid body super-positioning. RMSD tells us about the quality of the matches.

Handout Chapter 06: Protein Structures

Page 297 of 320

Protein structures can be compared in multiple ways. Till now, we can compare proteins by their motifs, domains and full structures. There are several advanced techniques for this as well.

Handout Chapter 06: Protein Structures

Page 298 of 320

Module 052: Protein Structure Prediction

Complex protein structures enable proteins to perform complex functions. We know over a million protein sequences but only about 100,000 protein structures. Estimating exact protein structures is very difficult. It’s difficult to crystallize proteins. Even if we manage to get protein’s X-Ray, to reconstruct the structure is extremely complex.

Since we know so many sequences, they can be used for predicting protein structures. This indeed is possible and helpful.

The Basic Idea

 Amino acids determine the protein structure

 We have a large protein sequence dataset (uniprot)

Hence, we can fold protein sequences and predict their structures

Why predict and why not exact solutions?

A deterministic solution of protein folding is a major unsolved problem in molecular biology. Proteins fold spontaneously or with the help of enzymes or chaperones. To computationally predict protein structures, we need to copy or mimic the natural folding.

To fold we must learn the steps

Step 1: "Collapse"- leading to burial of hydrophobic AA’s

Step 2: Fluid globule - helices & sheets form, but are unorganized

Step 3: Compaction, and rearrangement of 2‘structures

Protein structure prediction involves learning how the amino acids in primary sequence fold. Using this information, upon getting a protein sequence, we can try to predict how it folds

Handout Chapter 06: Protein Structures

Page 299 of 320

Module 053: Predicting Secondary Structures

By looking at the structures in PDB, we know that Alanine mostly found in Alpha Helices. So if we have several Alanines in the sequence, then we can anticipate that a helix may be formed by them. What if we survey the entire PDB and check the presence of each amino in each type of secondary structure. If we know which amino acid is found in which specific secondary structure, then we can use it for prediction.

Figure 78 Chou & Fasman (1974 & 1978)

Several algorithms have been designed to predict 2’ given an amino acid sequence. The first such algorithm was the Chou-Fasman Algorithm. We will see it in the upcoming modules.

Handout Chapter 06: Protein Structures

Page 300 of 320

Module 054: Introduction to Chou Fasman Algorithm

3D Structure of proteins is determined by their Amino Acid sequence. Note that we only know 100,000 3D protein structures, but 10 times more sequences. For those proteins whose structure is already known, can we evaluate their amino acid sequence?

Figure 79 Propensity Table

Predicting the 2’ structures

Now, let’s consider that if we are given an amino acid sequence, we can simply look up the propensity table and assign the tentative secondary structure.

Given an amino acid sequence, look up the propensity table for each amino acid’s propensity for various 2’ structures. Product of these propensity values will give you the overall propensity for formation of each 2’ structure.

Handout Chapter 06: Protein Structures

Page 301 of 320

Module 055: 2’ Structures in Chou Fasman Algorithm

For a primary sequence, and a tentative 2’ structure, propensity table can help us compute the overall propensity. Product of propensity values is computed for overall propensity for each 2’ structure. An important point to note here is that 2’ structures are formed due to hydrogen bonding between amino acids.

So, we need to consider the neighboring amino acids as well.

You only need to compute propensities for a small number 2’ structures. The highest net propensity will be the most probably secondary structure that will be formed.

Handout Chapter 06: Protein Structures

Page 302 of 320

Module 056: Chou Fasman Algorithm - I

Only a small number of combinations of secondary structures are possible due to their individual properties. Such as 4 amino acids are needed to start an Alpha Helix and 5 amino acids for Beta Sheet. Note that besides the alpha helix and beta sheets, LOOPS are another secondary structure. Loops are small ~ 3-4 amino acids.

1. Scan through the sequence : E M A V I Y P G

2. Identify sequence regions where:

• 4 out of 6 contiguous residues give a P(α ) > 1.0

• That region is declared as alpha-helix

• Extend helix to both sides until 4 out of 6 contiguous residues give a P(α ) < 1.0

That is declared end of the helix. For Alpha Helices, 4 contiguous amino acids are required. Their Alpha-Helix propensity should be more than 1.0. Once this propensity falls below 1.0, Alpha-Helix stops.

Handout Chapter 06: Protein Structures

Page 303 of 320

Module 057: Chou Fasman Algorithm - II

Alpha Helices are formed from 4 contiguous amino acids having an Alpha-Helix propensity over 1.0. The Alpha-Helix stops if this propensity falls below 1.0. Once Alpha Helices are constructed, and concluded, the remaining amino acids can be evaluated for Beta sheets and turns etc. Let’s see how Beta sheets are evaluated using Chou Fasman Algorithm.

1. Compute P(β) for contiguous regions of 5 Amino Acids

2. From these regions, identify regions where:

3. 5 contiguous residues have P(α ) > P(β)

That region is finalized as alpha-helix.Repeat this step for the full amino acid sequence to finalize all possible alpha helical regions in the sequence.

Alpha Helices can be finalized if their propensity is higher than the propensity for Beta Sheets in regions of 5 amino acids. For those regions where that is not the case, further evaluation is required.

Handout Chapter 06: Protein Structures

Page 304 of 320

Module 058: Chou Fasman Algorithm - III

Alpha Helices are formed from 4 contiguous amino acids having an Alpha-Helix propensity over 1.0. The Alpha-Helix stops if this propensity falls below 1.0. Alpha Helices were finalized if their propensity was higher than the propensity for Beta Sheets in regions of 5 amino acids.

We can evaluate such regions for Beta Sheets. Let us see step by stop how to find a beta sheet and how to differentiate them from alpha helices.

Scan the sequence to identify regions where:

 3 out of 5 amino acids have P(β) > 1.0

 That region is declared as beta sheet

 Extend beta sheet to both sides until 4 contiguous residues average P(β) < 1.0

 That is declared end of the beta sheet

 Those regions are finalized as beta-sheets which have average P(β) > 1.05 and the average P(β) > P(α) for that region.

Regions where overlapping alpha-helices and beta-sheets occur are declared helices if

 the average P(a-helix) > P(b-sheet) for that region

Else, a beta sheet is declared if

 average P(b-sheet) > P(a-helix) for that region

Using the strategy of higher propensity, alpha helices and beta sheets can be completely resolved. Assignments for each beta sheet and alpha helix can be finalized.

Handout Chapter 06: Protein Structures

Page 305 of 320

Module 059: Chou Fasman Algorithm - IV

After computing the propensity of alpha helices and beta sheets, we need to settle for loops. Let’s see how we can find out the loops using Chou Fasman Algorithm.

For any jth residue in sequence, we calculate f (Total) = f(j) f(j+1) f(j+2) f(j+3) (tetrapeptide)

 f(Total) > 0.000075

 the average value for P(turn) > 1.00 in the tetra peptide

 the averages for the tetra peptide are such P(a-helix) < P(turn) > P(b-sheet)

http://fasta.bioch.virginia.edu/fasta_www/chofas.htm

Chou Fasman Algorithm helps predict Alpha Helices, Beta Sheets and Turns. The algorithm is based on statistical occurrence of Amino Acids in known structures.

Handout Chapter 06: Protein Structures

Page 306 of 320

Handout Chapter 06: Protein Structures

Page 307 of 320

Module 060: Chou Fasman Algorithm – Flowchart I

Chou Fasman Algorithm helps predict secondary structures such as Alpha Helices, Beta Sheets and Turns. Step by step flowchart of the entire algorithm.

Beta sheets can be predicted from primary amino acid sequences. Next, we will see the flowchart of Alpha Helices and Beta Turns.

Handout Chapter 06: Protein Structures

Page 308 of 320

Module 061: Chou Fasman Algorithm – Flowchart II

Chou Fasman Algorithm helps predict secondary structures such as Alpha Helices, Beta Sheets and Turns. Step by step flowchart of the entire algorithm.

Figure 80 Beta sheet flowchart

Figure 81 Alpha helices flowchart

Now we have reviewed flowcharts for Alpha Helices and Beta Sheets. Next up is the flow chart for Beta Turns.

Module 062: Chou Fasman Algorithm – Flowchart III

Handout Chapter 06: Protein Structures

Page 309 of 320

Chou Fasman Algorithm helps predict secondary structures such as Alpha Helices, Beta Sheets and Turns. Step by step flowchart of the entire algorithm.

Figure 82 Beta sheet flowchart

Figure 83 Alpha helices flowchart

Handout Chapter 06: Protein Structures

Page 310 of 320

Figure 84 Beta turn

Alpha helices, beta sheets and turns can be predicted using Chou Fasman Algorithm. This algorithm is based on statistical analysis of amino acid occurrences in proteins.

Handout Chapter 06: Protein Structures

Page 311 of 320

Module 063: Chou Fasman Algorithm – Improvements

Alpha helices, beta sheets and turns can be predicted using Chou-Fasman Algorithm. The algorithm is based on statistical analysis of amino acid occurrences in proteins.

Secondary structure propensity values of alpha helix, beta sheet and turns should be recalculated with the latest protein data sets.

IMPROVEMENTS

Special consideration for:

 Nucleation regions

 Membrane proteins

 Hydrophobic domains

 Consider variable coil and loop sizes besides the from tetra peptide turns

 Consider local protein folding environments

 Solvent accessibility of residues

 Protein structural class

 Protein’s organism

Chou Fasman can be improved to better predict secondary structures by incorporating biochemical factors and updated statistics!

Handout Chapter 06: Protein Structures

Page 312 of 320

Module 064: Summary of Visualization, Classification & Prediction

Structure Classification

 relationship between protein structure and function

 There is need to classify proteins

 Hierarchy of classification

Structure visualization, classification and prediction equip us to perform functional evaluation of proteins. This is important for understanding disease and designing drugs for treating them.

Flying Twitter

Bioinformatics I (BIF401) book HANDOUTS / POWER POINT SLIDES in pdf

Post a Comment

Bioinformatics I (BIF401) book HANDOUTS / POWER POINT SLIDES in pdf

Next

Newer Post

Previous

Older Post

Post a Comment