Vous êtes sur la page 1sur 12

Bioinformatics- Assignment 1 Report

Phylogeny Construction for Influenza viruses based on hemagglutinin sequence Reihaneh Rabbany k.

Contents
Assignment Description ................................................................................................................................ 2 First Step, Collecting Protein Sequences ...................................................................................................... 3 Influenza Virus .......................................................................................................................................... 3 Collecting Sequences ................................................................................................................................ 4 FASA Format.......................................................................................................................................... 4 Step 2, Computing Pairwise Distances and Multiple Alignment................................................................... 5 Scoring scheme ......................................................................................................................................... 5 ClustalW .................................................................................................................................................... 5 Step3, Phylogeny Construction ..................................................................................................................... 7 Distance-based phylogeny ........................................................................................................................ 7 Character-based phylogeny ...................................................................................................................... 8 Step4, Evaluation .......................................................................................................................................... 9 Consistency ............................................................................................................................................... 9 Bootstrapping ......................................................................................................................................... 10 References .................................................................................................................................................. 12

Figures
Figure 1- annotated phylogeny tree by distance method ............................................................................ 7 Figure 2- annotated phylogeny tree obtained by parsimony ....................................................................... 8 Figure 3 - Zoomed branch of distance tree (left) and parsimony tree (right) .............................................. 9 Figure 4- Comparing resulted cladogram from distance method (right) with the reported cladogram for Influenza A virus by Yoshiyuki Suzuki, et. al. (left). ..................................................................................... 10 Figure 5- Bootstrapvalues_The upper is corresponding to parsimony method and the bottom one is corresponded to distance method ............................................................................................................. 11

Bioinformatics- Assignment 1 Report


Phylogeny Construction for Influenza viruses based on hemagglutinin sequence Reihaneh Rabbany k.

Assignment Description
The following is an H1N1 influenza virus hemagglutinin protein sequence, in FASTA format.

>gi|89903075|gb|ABD79112| /Human/4(HA)/H1N1//1946/// hemagglutinin [Influenza A virus MKAKLLILLCALSATDADTICIGYHANNSTDTVDTVLEKNVTVTHSVNLLEDSHNGKLCRLKGIAPLQLG KCNIAGWILGNPECESLLSKRSWSYIAETPNSENGACYPGDFADYEELREQLSSVSSFERFEIFPKGRSW PEHNIDIGVTAACSHAGKSSFYKNLLWLTEKDGSYPNLNKSYVNKKEKEVLILWGVHHPPNIENQKTLYR KENAYVSVVSSNYNRRFTPEIAERPKVRGQAGRINYYWTLLEPGDTIIFEANGNLIAPWYAFALNRGIGS GIITSNASMDECDTKCQTPQGAINSSLPFQNIHPFTIGECPKYVRSTKLRMVTGLRNIPSIQSRGLFGAI AGFIEGGWDGMIDGWYGYHHQNEQGSGYAADQKSTQNAINGITNKVNSVIEKMNTQFTAVGKEFNKLEKR MENLNKKVDDGFLDIWTYNAELLVLLENERTLDFHDSNVKNLYEKVKNQLRNNAKEIGNGCFEFYHKCNN ECMESVKNGTYDYPKFSEESKLNREKIDGVKLESMGVYQILAIYSTVASSLVLLVSLGAISFWMCSNGSL QCRICI

A/Cam/46(H1N1))]

Detailed tasks: 1. Search for at least 100 other hemagglutinin protein sequences for influenza viruses, such that they are distributed well in all 16 subtypes (H1H16). 2. Using an appropriate scoring scheme to compute the pairwise distances between every pair of sequences in the above; using the same scoring scheme, construct a multiple sequence alignment for these sequences. 3. Use a distance-based and a character-based phylogeny construction method, together with an out-group, to build two phylogenies for these sequences. 4. Evaluate the constructed phylogenies.

Note that the detailed descriptions of steps of operations you perform and the consequences of these operations must be reported (for example, the number of sequences you collected from each database, each tool you have called and their availability).

Bioinformatics- Assignment 1 Report


Phylogeny Construction for Influenza viruses based on hemagglutinin sequence Reihaneh Rabbany k.

First Step, Collecting Protein Sequences


At the first step, I should search for at least 100 other hemagglutinin protein sequences for influenza viruses, such that they are distributed well in all 16 subtypes (H1H16). For doing this, first I should get familiar with Influenza virus.

Influenza Virus
The influenza virus is an RNA virus comprises five genera: Influenzavirus A, Influenzavirus B, Influenzavirus C, Isavirus, and Thogotovirus. The type A viruses are the most virulent human pathogens and cause the most severe disease. The Influenza A genome encodes 11 proteins: hemagglutinin (HA), neuraminidase (NA), nucleoprotein (NP), M1, M2, NS1, NS2(NEP), PA, PB1, PB1-F2 and PB2 [1]. HA and NA are large glycoproteins on the outside of the viral particles; these proteins are targets for antiviral drugs which are antigens to which antibodies can be raised. Influenza A viruses are classified into subtypes based on antibody responses to HA and NA, forming the basis of the H and N distinctions in, for example, H5N1 [1]. There are 16 different HA antigens (H1 to H16) and nine different NA antigens (N1 to N9) for influenza A. [2]. Naming Each subtype virus has mutated into a variety of strains1 [2]. Generally, influenza A variants are identified according to the isolate that they are like and thus are presumed to share lineage (example Fujian flu virus like); according to their typical host (example Bird flu, Human Flu, Swine Flu, Horse Flu, Dog Flu); according to their subtype, an H number (for hemagglutinin) and an N number (for neuraminidase) (example H3N2); and according to their deadliness (example LP) [2,3].

A strain is a genetic variant or subtype of a microorganism (e.g. virus). For example, a "flu strain" is a certain biological form of the influenza or "flu" virus.

Bioinformatics- Assignment 1 Report


Phylogeny Construction for Influenza viruses based on hemagglutinin sequence Reihaneh Rabbany k.

Collecting Sequences
Ive used Influenza Virus Resources in NCNBI2 to retrieve the HA protein sequences for influenza. It contains more than 11000 viruses. I simply requested the db to retrieves all complete HA sequences of Influenza A and chose 7 of each subtype and download them in FASA format (The selected sequences are mostly from USA and between 2000 and 2008 unless there is not enough number of such sequences in these years). Although large number of influenza sequences in NCBI, it contains only 2 H14 and 5 H15 subtypes. Therefore, I used Uniport3 to find more sequences in these subtypes and I found 4 H14 and 7 H15 sequences there. Further, I searched BioHealth4 and I found 8 H15 and 2 H14 there. All these results are intenerated and recorded in the name of data/sequences.fasta. FASA Format FASA is a text-based format for representing peptide sequences, in which amino acids are represented using single-letter codes. Description line begins with > symbol. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). There should be no space between the ">" and the first letter of the identifier. In this case these descriptions contain the viruses location, host, year and subtype. Amino acid codes The amino acid codes supported are:

Amino Acid

A B C D E F G H I K L M N O P Q R S T U V W Y Z X * -

Aspartic acid or Asparagine

Glutamic acid or Glutamine

2 3

http://www.ncbi.nlm.nih.gov/genomes/FLU/Database/select.cgi?go=1 http://www.uniprot.org/ 4 http://www.biohealthbase.org/GSearch/home.do?decorator=Influenza

gap of indeterminate length

translation stop

Selenocysteine

Phenylalanine

Glutamic acid

Aspartic acid

Tryptophan

Pyrrolysine

Asparagine

Methionine

Glutamine

Threonine

Isoleucine

Meaning

Histidine

Tyrosine

Arginine

Cysteine

Leucine

Alanine

Glycine

Proline

Lysine

Valine

Serine

Any

Bioinformatics- Assignment 1 Report


Phylogeny Construction for Influenza viruses based on hemagglutinin sequence Reihaneh Rabbany k.

Step 2, Computing Pairwise Distances and Multiple Alignment


Here I should use an appropriate scoring scheme to compute the pairwise distances between every pair of sequences in the above; and then using the same scoring scheme, construct a multiple sequence alignment for these sequences.

Scoring scheme
Scoring scheme contains biological information which determines how one should compute the alignment. It includes substitution matrix (to assign scores to amino-acid matches or mismatches) and gap penalties (for matching an amino acid in one sequence to a gap in the other) [7, 8]. The two common substitution matrixes are PAM series and BLOSUM series; when comparing closely related proteins, one should use lower PAM or higher BLOSUM, for distantly related proteins higher PAM or lower BLOSUM matrices [7].

ClustalW
For performing multiple sequence alignment, Ive used ClustalW from PHYLIP package via Mobyle 5 webservice (a portal for bioinformatics analyses). Ive also checked ClustalX with is a windows interface for the ClustalW multiple sequence alignment program but as there are no different in functionality, I keep on using the webservice. ClustalW is a progressive method that generates a multiple sequence alignment by first aligning the most similar sequences and then adding successively less related sequences or groups to the alignment until the entire query set has been incorporated into the solution. The initial tree describing the sequence relatedness is based on pairwise comparisons. [8] For its scoring scheme Ive selected the following settings for both Pairwise Alignments parameters and Protein parameters of multiple sequence alignment 6: Gap opening penalty: 10 Gap extension penalty: 0.2 Gap separation penalty range: 8 Delay divergent sequences: 30% identity for delay Protein weight matrix: PAM series Protein weight matrix for pairwise alignment: PAM350

5 6

http://mobyle.pasteur.fr/cgi-bin/MobylePortal/portal.py?form=clustalw-multialign Command for rerunning it is:

clustalw -align -infile=sequences.fasta -type=protein -matrix=blosum -nopgap -nohgap -hgapresidues="RNDQEGKPS" -pwmatrix=blosum outfile=BlosumAligned

Bioinformatics- Assignment 1 Report


Phylogeny Construction for Influenza viruses based on hemagglutinin sequence Reihaneh Rabbany k. All the result of this section is reported under directory: analysis\*\clustalw-multialign It includes a sequences.aln file that contains the multiple sequence alignment. There is also sequences.dnd which contains the resulted tree and also clustalw-multialign.out that shows the progress of this algorithm which includes the pairwise scores between each pair of sequences.

Bioinformatics- Assignment 1 Report


Phylogeny Construction for Influenza viruses based on hemagglutinin sequence Reihaneh Rabbany k.

Step3, Phylogeny Construction


In this step I should use a distance se distance-based and a character-based phylogeny construction method, based together with an out-group (which specifies which species is to have the root of the tree be on the line leading to it), to build two phylogenies for these sequences.

Distance-based phylogeny
In distance-based tree reconstruction, we reconstruct an evolutionary tree from a distance matrix. As based most of distance measures dont guarantee to produce a additive matrix, we usually use a hierarchical an clustering methods for building the tree; such as UPGMA (Unweighted pair Group Method with Arithmetic Mean) which produce a ultra-metric rooted tree (root corresponds to the cluster created an metric last) or NJ (Neighbor Joining) [9]. For constructing this phylogeny tree I used Protdist (Protein Sequence Distance Method) from PHYLIP (Protein toolbox and via Mobyle webservice7. This program computed a distance matrix (based on the given Multiple sequence alignment) which I further fed into NJ algorithm (again via Mobyle webservice8) to obtain the corresponding phylogenetic tree. The resulted protdiss resulted distance matrix is reported onding in analysis\*\Distance\protdist\protdist.outfile and the NJs resulted tree is reported under protdist.outfile analysis\*\Distance\neighbor\neighbor.outtree Ive plotted its cladogram and tree using drawgram neighbor.outtree. m and drawtree toolbox in PHYLIP. The results are analysis\*\Distance\cladogram.pdf and ladogram.pdf analysis\*\Distance\tree.pdf.

Figure 1 annotated phylogeny tree by distance method 17 8

http://mobyle.pasteur.fr/cgi-bin/MobylePortal/portal.py?form=protdist With commant: bin/MobylePortal/portal.py?form=protdist http://mobyle.pasteur.fr/cgi-bin/MobylePortal/portal.py?form=neighbor With commant: bin/MobylePortal/portal.py?form=neighbor

ln -s PAMali.phylipi infile && protdist <protdist.params && mv outfile protdist.outfile s ln -s protdist.outfile infile && neighbor <neighbor.params && mv outfile neighbor.outfile && mv outtree neighbor.outtree s

Bioinformatics- Assignment 1 Report


Phylogeny Construction for Influenza viruses based on hemagglutinin sequence Reihaneh Rabbany k.

Character-based phylogeny
Instead of computing distances from alignment matrix ( f ) and using these distance matrix ( ) to construct the tree, we could use the alignment matrix directly to build the evolutionary tree by character-based methods (these methods try to explain the best character strings for internal nodes so based character that they describe their successors species here character string of a species is the amino-acids string amino or protein sequence of that species); such as maximum parsimony method that define the best tree as species); the one that needs minimum number of changes [9]. at For constructing this phylogeny tree I used ProtPars (Protein Sequence Parsimony Method) from PHYLIP toolbox and via Mobyle webservice9. The results are reported under analysis\*\Parisomy\protpars\ folder. Ive also plotted its cladogram and tree using drawgram and also drawtree toolbox in PHYLIP. The results are analysis\*\Parisomy\cladogram.pdf and ladogram.pdf analysis\*\Parisomy \tree.pdf. .

Figure 2- annotated phylogeny tree obtained by parsimony


9

http://mobyle.pasteur.fr/cgi-bin/MobylePortal/portal.py?form=protpars With commant: bin/MobylePortal/portal.py?form=protpars

ln -s BlosumAligned.phylipi infile && protpars <protpars.params && mv outfile protpars.outfile && mv outtree protpars.outtree s

Bioinformatics- Assignment 1 Report


Phylogeny Construction for Influenza viruses based on hemagglutinin sequence Reihaneh Rabbany k.

Step4, Evaluation
For evaluating the constructed phylogenies, I took two approaches by consistency of them with phylogenies, biological information and by bootstrapping.

Consistency
For evaluating the resulted trees I compare how consistent they are with the taxonomy or biological data I have both what are given in protein sequences descriptions and using others results. For doing results this, first, I renamed the sequences so that their subtype plus their year location becomes their identifier using \RenameSequences\Renaming Renaming\Rename.java. In this way the resulted tree becomes meaningful and readable (Figure 1 and 2). I used these renamed sequences to build the phylogenetic trees by both distance based and parsimony methods and the resulted cladogram trees are analysis\Consistence analysis \Parisomy\cladogram.pdf and analysis Consistence\Distance\cladogram.pdf. We could see in these analysis\ e trees that the viruses in a same subtype are grouped in the same clad which shows the consistency of clade these algorithms with biological information. Moreover most of the branches are consistence with the corresponding virus year. For illustrating them I zoomed in branch H7 of both trees:

Figure 3 - Zoomed branch of distance tree (left) and parsimony tree (right)

Based on Figure 1, 2, and 3, using this scoring scheme, the parsimony method exhibit the evolutionary distances for more closely sequences better that the distance method. Further I compared the resulted cladogram trees with trees reported by Yoshiyuki Suzuki, et. al. [11] and there is a high agreement between my resul and results presented on that paper about divergence of results influenza A viruses subtypes (see Figure 4 4).

Bioinformatics- Assignment 1 Report


Phylogeny Construction for Influenza viruses based on hemagglutinin sequence Reihaneh Rabbany k.

Figure 4- Comparing resulted cladogram from distance method (right) with the reported cladogram for Influenza A virus by Yoshiyuki Suzuki, et. al. (left) One could see that clades are mostly identical. uzuki,

Bootstrapping
Apart from consistency, I evaluate the resulted tree by bootstrapping. For doing this, I simply set parameters in ProtPars and ProtDist ProtDist+NJ to perform bootstrapping and they produced 100 bootstrapped trees and then I fed these trees into Consensus tree program in PHYLIP toolbox (via ed e toolbo 10 Mobyle webservice ). This program generated a c consensus tree with bootstrap values on its branches that shows the agreement on that branch between all bootstrapped trees (see \Boostrap n Boostrap\Parisomy\ TreeWithBootstrapValues.txt and \Boostrap\Distance\TreeWithBootstrapValues.txt [10]. TreeWithBootstrapValues.txt One technical point is that, for the distance method the Pratdist webservice is extremely slow; therefore, I tried the PHYLIP package directly; I used seqboot in PHYLIP package to generate to generate 100 ; ate bootstrapped MSA. Using these bootstrapped samples I produce 100 trees by neighbor and finally neighbor obtain consensus tree with bootstrap values by Consensus. This one is still slow but much faster than . the webservice. By the way, the results are not surprisingly identical.

10

http://mobyle.pasteur.fr/cgi-bin/MobylePortal/portal.py?form=consense With commant: bin/MobylePortal/portal.py?form=consense

ln -s protpars.outtree intree &&consense <consense.params && mv outfile consense.outfile && mv outtree consense.outtree s

Bioinformatics- Assignment 1 Report


Phylogeny Construction for Influenza viruses based on hemagglutinin sequence Reihaneh Rabbany k. Comparing the resulted bootstrap values in consensus trees of distance and parsimony method revealed that the parsimony method is by far better than the distance method in this specific task and with these dataset and settings; as the most of branches in its tree has 100% bootstrap values or high bootstrap values, while the distance method produces relatively poor bootstrap values in the branches. For , example compare these two branches with similar species and different bootstraps. The upper one is aps. from parsimony method (note that bootstrap values are between 0 and 1 i.e. 1 mean 100%) and the note 1, 100% lower one is from distance method.

Figure 5- Bootstrapvalues_The upper is corresponding to parsimony method and the bottom one is corresponded to distance The method

Bioinformatics- Assignment 1 Report


Phylogeny Construction for Influenza viruses based on hemagglutinin sequence Reihaneh Rabbany k.

References
[1] http://en.wikipedia.org/wiki/Influenza [2] http://en.wikipedia.org/wiki/Influenzavirus_A [3] http://en.wikipedia.org/wiki/Influenza_Genome_Sequencing_Project [4] http://www.ncbi.nlm.nih.gov/genomes/FLU/FLU.html [5] http://www.ncbi.nlm.nih.gov/genomes/FLU/Database/select.cgi?go=1 [6] http://www.uniprot.org/ [7] http://en.wikipedia.org/wiki/Substitution_matrix [8] http://en.wikipedia.org/wiki/Sequence_alignment [9] N. C. Jones and P. A. Pevzner. "An Introduction to Bioinformatics Algorithms". MIT Press. 2004 [10] http://bioweb2.pasteur.fr/docs/phylip/doc/consense.html [11] Yoshiyuki Suzuki, et. al., Origin and Evolution of Influenza Virus Hemagglutinin Genes, Molecular Biology and Evolution, Oxford University Press, April 1, 2002

Vous aimerez peut-être aussi