Vous êtes sur la page 1sur 184

5/13/2017 BiopythonTutorialandCookbook

BiopythonTutorialandCookbook
JeffChang,BradChapman,IddoFriedberg,ThomasHamelryck,
MichieldeHoon,PeterCock,TiagoAntao,EricTalevich,BartekWilczyski

LastUpdate6April2017(Biopython1.69)

Contents
Chapter1Introduction
1.1WhatisBiopython?
1.2WhatcanIfindintheBiopythonpackage
1.3InstallingBiopython
1.4FrequentlyAskedQuestions(FAQ)
Chapter2QuickStartWhatcanyoudowithBiopython?
2.1GeneraloverviewofwhatBiopythonprovides
2.2Workingwithsequences
2.3Ausageexample
2.4Parsingsequencefileformats
2.4.1SimpleFASTAparsingexample
2.4.2SimpleGenBankparsingexample
2.4.3Iloveparsingpleasedontstoptalkingaboutit!
2.5Connectingwithbiologicaldatabases
2.6Whattodonext
Chapter3Sequenceobjects
3.1SequencesandAlphabets
3.2Sequencesactlikestrings
3.3Slicingasequence
3.4TurningSeqobjectsintostrings
3.5Concatenatingoraddingsequences
3.6Changingcase
3.7Nucleotidesequencesand(reverse)complements
3.8Transcription
3.9Translation
3.10TranslationTables
3.11ComparingSeqobjects
3.12MutableSeqobjects
3.13UnknownSeqobjects
3.14Workingwithstringsdirectly
Chapter4Sequenceannotationobjects
4.1TheSeqRecordobject
4.2CreatingaSeqRecord
4.2.1SeqRecordobjectsfromscratch
4.2.2SeqRecordobjectsfromFASTAfiles
4.2.3SeqRecordobjectsfromGenBankfiles
4.3Feature,locationandpositionobjects
4.3.1SeqFeatureobjects
4.3.2Positionsandlocations
4.3.3Sequencedescribedbyafeatureorlocation
4.4Comparison
4.5References
4.6Theformatmethod
4.7SlicingaSeqRecord
4.8AddingSeqRecordobjects
4.9ReversecomplementingSeqRecordobjects
Chapter5SequenceInput/Output
5.1ParsingorReadingSequences
5.1.1ReadingSequenceFiles
5.1.2Iteratingovertherecordsinasequencefile

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 1/184
5/13/2017 BiopythonTutorialandCookbook
5.1.3Gettingalistoftherecordsinasequencefile
5.1.4Extractingdata
5.2Parsingsequencesfromcompressedfiles
5.3Parsingsequencesfromthenet
5.3.1ParsingGenBankrecordsfromthenet
5.3.2ParsingSwissProtsequencesfromthenet
5.4SequencefilesasDictionaries
5.4.1SequencefilesasDictionariesInmemory
5.4.2SequencefilesasDictionariesIndexedfiles
5.4.3SequencefilesasDictionariesDatabaseindexedfiles
5.4.4Indexingcompressedfiles
5.4.5Discussion
5.5WritingSequenceFiles
5.5.1Roundtrips
5.5.2Convertingbetweensequencefileformats
5.5.3Convertingafileofsequencestotheirreversecomplements
5.5.4GettingyourSeqRecordobjectsasformattedstrings
5.6LowlevelFASTAandFASTQparsers
Chapter6MultipleSequenceAlignmentobjects
6.1ParsingorReadingSequenceAlignments
6.1.1SingleAlignments
6.1.2MultipleAlignments
6.1.3AmbiguousAlignments
6.2WritingAlignments
6.2.1Convertingbetweensequencealignmentfileformats
6.2.2Gettingyouralignmentobjectsasformattedstrings
6.3ManipulatingAlignments
6.3.1Slicingalignments
6.3.2Alignmentsasarrays
6.4AlignmentTools
6.4.1ClustalW
6.4.2MUSCLE
6.4.3MUSCLEusingstdout
6.4.4MUSCLEusingstdinandstdout
6.4.5EMBOSSneedleandwater
6.4.6Biopythonspairwise2
Chapter7BLAST
7.1RunningBLASTovertheInternet
7.2RunningBLASTlocally
7.2.1Introduction
7.2.2StandaloneNCBIBLAST+
7.2.3OtherversionsofBLAST
7.3ParsingBLASToutput
7.4TheBLASTrecordclass
7.5DeprecatedBLASTparsers
7.5.1ParsingplaintextBLASToutput
7.5.2ParsingaplaintextBLASTfilefullofBLASTruns
7.5.3FindingabadrecordsomewhereinahugeplaintextBLASTfile
7.6DealingwithPSIBLAST
7.7DealingwithRPSBLAST
Chapter8BLASTandothersequencesearchtools(experimentalcode)
8.1TheSearchIOobjectmodel
8.1.1QueryResult
8.1.2Hit
8.1.3HSP
8.1.4HSPFragment
8.2Anoteaboutstandardsandconventions
8.3Readingsearchoutputfiles
8.4Dealingwithlargesearchoutputfileswithindexing
8.5Writingandconvertingsearchoutputfiles
Chapter9AccessingNCBIsEntrezdatabases
9.1EntrezGuidelines
9.2EInfo:ObtaininginformationabouttheEntrezdatabases
9.3ESearch:SearchingtheEntrezdatabases
9.4EPost:Uploadingalistofidentifiers
9.5ESummary:RetrievingsummariesfromprimaryIDs
9.6EFetch:DownloadingfullrecordsfromEntrez
9.7ELink:SearchingforrelateditemsinNCBIEntrez
9.8EGQuery:GlobalQuerycountsforsearchterms
9.9ESpell:Obtainingspellingsuggestions
9.10ParsinghugeEntrezXMLfiles
9.11Handlingerrors
9.12Specializedparsers
9.12.1ParsingMedlinerecords
9.12.2ParsingGEOrecords
9.12.3ParsingUniGenerecords
9.13Usingaproxy
9.14Examples
9.14.1PubMedandMedline
9.14.2Searching,downloading,andparsingEntrezNucleotiderecords
9.14.3Searching,downloading,andparsingGenBankrecords

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 2/184
5/13/2017 BiopythonTutorialandCookbook
9.14.4Findingthelineageofanorganism
9.15UsingthehistoryandWebEnv
9.15.1Searchingforanddownloadingsequencesusingthehistory
9.15.2Searchingforanddownloadingabstractsusingthehistory
9.15.3Searchingforcitations
Chapter10SwissProtandExPASy
10.1ParsingSwissProtfiles
10.1.1ParsingSwissProtrecords
10.1.2ParsingtheSwissProtkeywordandcategorylist
10.2ParsingPrositerecords
10.3ParsingPrositedocumentationrecords
10.4ParsingEnzymerecords
10.5AccessingtheExPASyserver
10.5.1RetrievingaSwissProtrecord
10.5.2SearchingSwissProt
10.5.3RetrievingPrositeandPrositedocumentationrecords
10.6ScanningthePrositedatabase
Chapter11Going3D:ThePDBmodule
11.1Readingandwritingcrystalstructurefiles
11.1.1ReadingaPDBfile
11.1.2ReadinganmmCIFfile
11.1.3ReadingfilesintheMMTFformat
11.1.4ReadingfilesinthePDBXMLformat
11.1.5WritingPDBfiles
11.2Structurerepresentation
11.2.1Structure
11.2.2Model
11.2.3Chain
11.2.4Residue
11.2.5Atom
11.2.6ExtractingaspecificAtom/Residue/Chain/ModelfromaStructure
11.3Disorder
11.3.1Generalapproach
11.3.2Disorderedatoms
11.3.3Disorderedresidues
11.4Heteroresidues
11.4.1Associatedproblems
11.4.2Waterresidues
11.4.3Otherheteroresidues
11.5NavigatingthroughaStructureobject
11.6Analyzingstructures
11.6.1Measuringdistances
11.6.2Measuringangles
11.6.3Measuringtorsionangles
11.6.4Determiningatomatomcontacts
11.6.5Superimposingtwostructures
11.6.6Mappingtheresiduesoftworelatedstructuresontoeachother
11.6.7CalculatingtheHalfSphereExposure
11.6.8Determiningthesecondarystructure
11.6.9Calculatingtheresiduedepth
11.7CommonproblemsinPDBfiles
11.7.1Examples
11.7.2Automaticcorrection
11.7.3Fatalerrors
11.8AccessingtheProteinDataBank
11.8.1DownloadingstructuresfromtheProteinDataBank
11.8.2DownloadingtheentirePDB
11.8.3KeepingalocalcopyofthePDBuptodate
11.9Generalquestions
11.9.1HowwelltestedisBio.PDB?
11.9.2Howfastisit?
11.9.3Istheresupportformoleculargraphics?
11.9.4WhosusingBio.PDB?
Chapter12Bio.PopGen:Populationgenetics
12.1GenePop
Chapter13PhylogeneticswithBio.Phylo
13.1Demo:WhatsinaTree?
13.1.1Coloringbrancheswithinatree
13.2I/Ofunctions
13.3Viewandexporttrees
13.4UsingTreeandCladeobjects
13.4.1Searchandtraversalmethods
13.4.2Informationmethods
13.4.3Modificationmethods
13.4.4FeaturesofPhyloXMLtrees
13.5Runningexternalapplications
13.6PAMLintegration
13.7Futureplans
Chapter14SequencemotifanalysisusingBio.motifs
14.1Motifobjects
14.1.1Creatingamotiffrominstances

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 3/184
5/13/2017 BiopythonTutorialandCookbook
14.1.2Creatingasequencelogo
14.2Readingmotifs
14.2.1JASPAR
14.2.2MEME
14.2.3TRANSFAC
14.3Writingmotifs
14.4PositionWeightMatrices
14.5PositionSpecificScoringMatrices
14.6Searchingforinstances
14.6.1Searchingforexactmatches
14.6.2SearchingformatchesusingthePSSMscore
14.6.3Selectingascorethreshold
14.7EachmotifobjecthasanassociatedPositionSpecificScoringMatrix
14.8Comparingmotifs
14.9Denovomotiffinding
14.9.1MEME
14.10Usefullinks
Chapter15Clusteranalysis
15.1Distancefunctions
15.2Calculatingclusterproperties
15.3Partitioningalgorithms
15.4Hierarchicalclustering
15.5SelfOrganizingMaps
15.6PrincipalComponentAnalysis
15.7HandlingCluster/TreeViewtypefiles
15.8Examplecalculation
15.9Auxiliaryfunctions
Chapter16Supervisedlearningmethods
16.1TheLogisticRegressionModel
16.1.1BackgroundandPurpose
16.1.2Trainingthelogisticregressionmodel
16.1.3Usingthelogisticregressionmodelforclassification
16.1.4LogisticRegression,LinearDiscriminantAnalysis,andSupportVectorMachines
16.2kNearestNeighbors
16.2.1Backgroundandpurpose
16.2.2Initializingaknearestneighborsmodel
16.2.3Usingaknearestneighborsmodelforclassification
16.3NaveBayes
16.4MaximumEntropy
16.5MarkovModels
Chapter17GraphicsincludingGenomeDiagram
17.1GenomeDiagram
17.1.1Introduction
17.1.2Diagrams,tracks,featuresetsandfeatures
17.1.3Atopdownexample
17.1.4Abottomupexample
17.1.5FeatureswithoutaSeqFeature
17.1.6Featurecaptions
17.1.7Featuresigils
17.1.8Arrowsigils
17.1.9Aniceexample
17.1.10Multipletracks
17.1.11CrossLinksbetweentracks
17.1.12Furtheroptions
17.1.13Convertingoldcode
17.2Chromosomes
17.2.1SimpleChromosomes
17.2.2AnnotatedChromosomes
Chapter18KEGG
18.1ParsingKEGGrecords
18.2QueryingtheKEGGAPI
Chapter19Bio.phenotype:analysephenotypicdata
19.1PhenotypeMicroarrays
19.1.1ParsingPhenotypeMicroarraydata
19.1.2ManipulatingPhenotypeMicroarraydata
19.1.3WritingPhenotypeMicroarraydata
Chapter20CookbookCoolthingstodowithit
20.1Workingwithsequencefiles
20.1.1Filteringasequencefile
20.1.2Producingrandomisedgenomes
20.1.3TranslatingaFASTAfileofCDSentries
20.1.4MakingthesequencesinaFASTAfileuppercase
20.1.5Sortingasequencefile
20.1.6SimplequalityfilteringforFASTQfiles
20.1.7Trimmingoffprimersequences
20.1.8Trimmingoffadaptorsequences
20.1.9ConvertingFASTQfiles
20.1.10ConvertingFASTAandQUALfilesintoFASTQfiles
20.1.11IndexingaFASTQfile
20.1.12ConvertingSFFfiles
20.1.13Identifyingopenreadingframes

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 4/184
5/13/2017 BiopythonTutorialandCookbook
20.2Sequenceparsingplussimpleplots
20.2.1Histogramofsequencelengths
20.2.2PlotofsequenceGC%
20.2.3Nucleotidedotplots
20.2.4Plottingthequalityscoresofsequencingreaddata
20.3Dealingwithalignments
20.3.1Calculatingsummaryinformation
20.3.2Calculatingaquickconsensussequence
20.3.3PositionSpecificScoreMatrices
20.3.4InformationContent
20.4SubstitutionMatrices
20.4.1Usingcommonsubstitutionmatrices
20.4.2Creatingyourownsubstitutionmatrixfromanalignment
20.5BioSQLstoringsequencesinarelationaldatabase
Chapter21TheBiopythontestingframework
21.1Runningthetests
21.1.1RunningthetestsusingTox
21.2Writingtests
21.2.1Writingaprintandcomparetest
21.2.2Writingaunittestbasedtest
21.3Writingdoctests
21.4WritingdoctestsintheTutorial
Chapter22Advanced
22.1ParserDesign
22.2SubstitutionMatrices
22.2.1SubsMat
22.2.2FreqTable
Chapter23WheretogofromherecontributingtoBiopython
23.1BugReports+FeatureRequests
23.2Mailinglistsandhelpingnewcomers
23.3ContributingDocumentation
23.4Contributingcookbookexamples
23.5Maintainingadistributionforaplatform
23.6ContributingUnitTests
23.7ContributingCode
Chapter24Appendix:UsefulstuffaboutPython
24.1Whattheheckisahandle?
24.1.1Creatingahandlefromastring

Chapter1Introduction
1.1WhatisBiopython?
TheBiopythonProjectisaninternationalassociationofdevelopersoffreelyavailablePython(http://www.python.org)toolsforcomputationalmolecularbiology.Python
isanobjectoriented,interpreted,flexiblelanguagethatisbecomingincreasinglypopularforscientificcomputing.Pythoniseasytolearn,hasaveryclearsyntaxandcan
easilybeextendedwithmoduleswritteninC,C++orFORTRAN.

TheBiopythonwebsite(http://www.biopython.org)providesanonlineresourceformodules,scripts,andweblinksfordevelopersofPythonbasedsoftwarefor
bioinformaticsuseandresearch.Basically,thegoalofBiopythonistomakeitaseasyaspossibletousePythonforbioinformaticsbycreatinghighquality,reusable
modulesandclasses.BiopythonfeaturesincludeparsersforvariousBioinformaticsfileformats(BLAST,Clustalw,FASTA,Genbank,...),accesstoonlineservices
(NCBI,Expasy,...),interfacestocommonandnotsocommonprograms(Clustalw,DSSP,MSMS...),astandardsequenceclass,variousclusteringmodules,aKDtree
datastructureetc.andevendocumentation.

Basically,wejustliketoprograminPythonandwanttomakeitaseasyaspossibletousePythonforbioinformaticsbycreatinghighquality,reusablemodulesand
scripts.

1.2WhatcanIfindintheBiopythonpackage
ThemainBiopythonreleaseshavelotsoffunctionality,including:

TheabilitytoparsebioinformaticsfilesintoPythonutilizabledatastructures,includingsupportforthefollowingformats:
BlastoutputbothfromstandaloneandWWWBlast
Clustalw
FASTA
GenBank
PubMedandMedline
ExPASyfiles,likeEnzymeandProsite
SCOP,includingdomandlinfiles
UniGene
SwissProt
FilesinthesupportedformatscanbeiteratedoverrecordbyrecordorindexedandaccessedviaaDictionaryinterface.
Codetodealwithpopularonlinebioinformaticsdestinationssuchas:
NCBIBlast,EntrezandPubMedservices

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 5/184
5/13/2017 BiopythonTutorialandCookbook
ExPASySwissProtandPrositeentries,aswellasPrositesearches
Interfacestocommonbioinformaticsprogramssuchas:
StandaloneBlastfromNCBI
Clustalwalignmentprogram
EMBOSScommandlinetools
Astandardsequenceclassthatdealswithsequences,idsonsequences,andsequencefeatures.
Toolsforperformingcommonoperationsonsequences,suchastranslation,transcriptionandweightcalculations.
CodetoperformclassificationofdatausingkNearestNeighbors,NaiveBayesorSupportVectorMachines.
Codefordealingwithalignments,includingastandardwaytocreateanddealwithsubstitutionmatrices.
Codemakingiteasytosplitupparallelizabletasksintoseparateprocesses.
GUIbasedprogramstodobasicsequencemanipulations,translations,BLASTing,etc.
Extensivedocumentationandhelpwithusingthemodules,includingthisfile,onlinewikidocumentation,thewebsite,andthemailinglist.
IntegrationwithBioSQL,asequencedatabaseschemaalsosupportedbytheBioPerlandBioJavaprojects.

WehopethisgivesyouplentyofreasonstodownloadandstartusingBiopython!

1.3InstallingBiopython
AlloftheinstallationinformationforBiopythonwasseparatedfromthisdocumenttomakeiteasiertokeepupdated.

Theshortversionisgotoourdownloadspage(http://biopython.org/wiki/Download),downloadandinstallthelisteddependencies,thendownloadandinstall
Biopython.Biopythonrunsonmanyplatforms(Windows,Mac,andonthevariousflavorsofLinuxandUnix).ForWindowsweprovideprecompiledclickandrun
installers,whileforUnixandotheroperatingsystemsyoumustinstallfromsourceasdescribedintheincludedREADMEfile.Thisisusuallyassimpleasthestandard
commands:
pythonsetup.pybuild
pythonsetup.pytest
sudopythonsetup.pyinstall

(Youcaninfactskipthebuildandtest,andgostraighttotheinstallbutitsbettertomakesureeverythingseemstobeworking.)

ThelongerversionofourinstallationinstructionscoversinstallationofPython,BiopythondependenciesandBiopythonitself.ItisavailableinPDF
(http://biopython.org/DIST/docs/install/Installation.pdf)andHTMLformats(http://biopython.org/DIST/docs/install/Installation.html).

1.4FrequentlyAskedQuestions(FAQ)
1.HowdoIciteBiopythoninascientificpublication?
Pleaseciteourapplicationnote[1,Cocketal.,2009]asthemainBiopythonreference.Inaddition,pleaseciteanypublicationsfromthefollowinglistif
appropriate,inparticularasareferenceforspecificmoduleswithinBiopython(moreinformationcanbefoundonourwebsite):
Fortheofficialprojectannouncement:[13,ChapmanandChang,2000]
ForBio.PDB:[18,HamelryckandManderick,2003]
ForBio.Cluster:[14,DeHoonetal.,2004]
ForBio.Graphics.GenomeDiagram:[2,Pritchardetal.,2006]
ForBio.PhyloandBio.Phylo.PAML:[9,Talevichetal.,2012]
FortheFASTQfileformatassupportedinBiopython,BioPerl,BioRuby,BioJava,andEMBOSS:[7,Cocketal.,2010].
2.HowshouldIcapitalizeBiopython?IsBioPythonOK?
ThecorrectcapitalizationisBiopython,notBioPython(eventhoughthatwouldhavematchedBioPerl,BioJavaandBioRuby).
3.Whatisgoingwrongwithmyprintcommands?
ThistutorialnowusesthePython3styleprintfunction.AsofBiopython1.62,wesupportbothPython2andPython3.Themostobviouslanguagedifferenceisthe
printstatementinPython2becameaprintfunctioninPython3.

Forexample,thiswillonlyworkunderPython2:
>>>print"HelloWorld!"
HelloWorld!

IfyoutrythatonPython3youllgetaSyntaxError.UnderPython3youmustwrite:

>>>print("HelloWorld!")
HelloWorld!

SurprisinglythatwillalsoworkonPython2butonlyforsimpleexamplesprintingonething.Ingeneralyouneedtoaddthismagiclinetothestartofyour
PythonscriptstousetheprintfunctionunderPython2.6and2.7:
from__future__importprint_function

Ifyouforgettoaddthismagicimport,underPython2youllseeextrabracketsproducedbytryingtousetheprintfunctionwhenPython2isinterpretingitasa
printstatementandatuple.

4.HowdoIfindoutwhatversionofBiopythonIhaveinstalled?
Usethis:
>>>importBio
>>>print(Bio.__version__)

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 6/184
5/13/2017 BiopythonTutorialandCookbook
...

IftheimportBiolinefails,Biopythonisnotinstalled.Notethatthosearedoubleunderscoresbeforeandafterversion.Ifthesecondlinefails,yourversionis
veryoutofdate.

Iftheversionstringendswithapluslike1.66+,youdonthaveanofficialrelease,butanoldsnapshotoftheindevelopmentcodeafterthatversionwasreleased.
ThisnamingwasuseduntilJune2016intherunuptoBiopython1.68..

Iftheversionstringendswith.dev<number>like1.68.dev0,againyoudonthaveanofficialrelease,butinsteadasnapshotoftheindevelopementcodebefore
thatversionwasreleased.

5.Whereisthelatestversionofthisdocument?
IfyoudownloadaBiopythonsourcecodearchive,itwillincludetherelevantversioninbothHTMLandPDFformats.Thelatestpublishedversionofthis
document(updatedateachrelease)isonline:
http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
6.Whatiswrongwithmysequencecomparisons?
TherewasamajorchangeinBiopython1.65makingtheSeqandMutableSeqclasses(andsubclasses)usesimplestringbasedcomparison(ignoringthealphabet
otherthanifgivingawarning),whichyoucandoexplicitlywithstr(seq1)==str(seq2).

OlderversionsofBiopythonwoulduseinstancebasedcomparisonforSeqobjectswhichyoucandoexplicitlywithid(seq1)==id(seq2).

IfyoustillneedtosupportoldversionsofBiopython,usetheseexplicitformstoavoidproblems.SeeSection3.11.

7.WhyistheSeqobjectmissingtheupper&lowermethodsdescribedinthisTutorial?
YouneedBiopython1.53orlater.Alternatively,usestr(my_seq).upper()togetanuppercasestring.IfyouneedaSeqobject,trySeq(str(my_seq).upper())but
becarefulaboutblindlyreusingthesamealphabet.
8.WhydoesnttheSeqobjecttranslationmethodsupportthecdsoptiondescribedinthisTutorial?
YouneedBiopython1.51orlater.
9.WhatfileformatsdoBio.SeqIOandBio.AlignIOreadandwrite?
Checkthebuiltindocstrings(fromBioimportSeqIO,thenhelp(SeqIO)),orseehttp://biopython.org/wiki/SeqIOandhttp://biopython.org/wiki/AlignIOonthe
wikiforthelatestlisting.
10.WhywonttheBio.SeqIOandBio.AlignIOfunctionsparse,readandwritetakefilenames?Theyinsistonhandles!
YouneedBiopython1.54orlater,orjustusehandlesexplicitly(seeSection24.1).Itisespeciallyimportanttoremembertocloseoutputhandlesexplicitlyafter
writingyourdata.
11.WhywonttheBio.SeqIO.write()andBio.AlignIO.write()functionsacceptasinglerecordoralignment?Theyinsistonalistoriterator!
YouneedBiopython1.54orlater,orjustwraptheitemwith[...]tocreatealistofoneelement.
12.Whydoesntstr(...)givemethefullsequenceofaSeqobject?
YouneedBiopython1.45orlater.
13.WhydoesntBio.BlastworkwiththelatestplaintextNCBIblastoutput?
TheNCBIkeeptweakingtheplaintextoutputfromtheBLASTtools,andkeepingourparseruptodateis/wasanongoingstruggle.Ifyouarentusingthelatest
versionofBiopython,youcouldtryupgrading.However,we(andtheNCBI)recommendyouusetheXMLoutputinstead,whichisdesignedtobereadbya
computerprogram.
14.WhydoesntBio.Entrez.parse()work?Themoduleimportsfinebutthereisnoparsefunction!
YouneedBiopython1.52orlater.
15.WhyhasmyscriptusingBio.Entrez.efetch()stoppedworking?
ThiscouldbeduetoNCBIchangesinFebruary2012introducingEFetch2.0.First,theychangedthedefaultreturnmodesyouprobablywanttoadd
retmode="text"toyourcall.Second,theyarenowstricterabouthowtoprovidealistofIDsBiopython1.59onwardsturnsalistintoacommaseparatedstring
automatically.
16.WhydoesntBio.Blast.NCBIWWW.qblast()givethesameresultsastheNCBIBLASTwebsite?
YouneedtospecifythesameoptionstheNCBIoftenadjustthedefaultsettingsonthewebsite,andtheydonotmatchtheQBLASTdefaultsanymore.Check
thingslikethegappenaltiesandexpectationthreshold.
17.WhydoesntBio.Blast.NCBIXML.read()work?Themoduleimportsbutthereisnoreadfunction!
YouneedBiopython1.50orlater.Or,usenext(Bio.Blast.NCBIXML.parse(...))instead.
18.WhydoesntmySeqRecordobjecthavealetter_annotationsattribute?
PerletterannotationsupportwasaddedinBiopython1.50.
19.WhycantIslicemySeqRecordtogetasubrecord?
YouneedBiopython1.50orlater.
20.WhycantIaddSeqRecordobjectstogether?
YouneedBiopython1.53orlater.
21.WhydoesntBio.SeqIO.convert()orBio.AlignIO.convert()work?Themodulesimportfinebutthereisnoconvertfunction!
YouneedBiopython1.52orlater.Alternatively,combinetheparseandwritefunctionsasdescribedinthistutorial(seeSections5.5.2and6.2.1).
22.WhydoesntBio.SeqIO.index()work?Themoduleimportsfinebutthereisnoindexfunction!
YouneedBiopython1.52orlater.
23.WhydoesntBio.SeqIO.index_db()work?Themoduleimportsfinebutthereisnoindex_dbfunction!
YouneedBiopython1.57orlater(andaPythonwithSQLite3support).
24.WhereistheMultipleSeqAlignmentobject?TheBio.Alignmoduleimportsfinebutthisclassisntthere!
YouneedBiopython1.54orlater.Alternatively,theolderBio.Align.Generic.Alignmentclasssupportssomeofitsfunctionality,butusingthisisnowdiscouraged.
25.WhycantIruncommandlinetoolsdirectlyfromtheapplicationwrappers?
YouneedBiopython1.55orlater.Alternatively,usethePythonsubprocessmoduledirectly.
26.Ilookedinadirectoryforcode,butIcouldntfindthecodethatdoessomething.Wheresithidden?
Onethingtoknowisthatweputcodein__init__.pyfiles.Ifyouarenotusedtolookingforcodeinthisfilethiscanbeconfusing.Thereasonwedothisisto

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 7/184
5/13/2017 BiopythonTutorialandCookbook
maketheimportseasierforusers.Forinstance,insteadofhavingtodoarepetitiveimportlikefromBio.GenBankimportGenBank,youcanjustusefromBio
importGenBank.
27.WhydoesthecodefromCVSseemoutofdate?
InlateSeptember2009,justafterthereleaseofBiopython1.52,weswitchedfromusingCVStogit,adistributedversioncontrolsystem.TheoldCVSserverwill
remainavailableasastaticandreadonlybackup,butifyouwanttograbthelatestcode,youllneedtousegitinstead.Seeourwebsiteformoredetails.
28.WhydoesntBio.Fastawork?
WedeprecatedtheBio.FastamoduleinBiopython1.51(August2009)andremoveditinBiopython1.55(August2010).Thereisabriefexampleshowinghowto
convertoldcodetouseBio.SeqIOinsteadintheDEPRECATED.rstfile.

Formoregeneralquestions,thePythonFAQpageshttp://www.python.org/doc/faq/maybeuseful.

Chapter2QuickStartWhatcanyoudowithBiopython?
ThissectionisdesignedtogetyoustartedquicklywithBiopython,andtogiveageneraloverviewofwhatisavailableandhowtouseit.Alloftheexamplesinthis
sectionassumethatyouhavesomegeneralworkingknowledgeofPython,andthatyouhavesuccessfullyinstalledBiopythononyoursystem.Ifyouthinkyouneedto
brushuponyourPython,themainPythonwebsiteprovidesquiteabitoffreedocumentationtogetstartedwith(http://www.python.org/doc/).

Sincemuchbiologicalworkonthecomputerinvolvesconnectingwithdatabasesontheinternet,someoftheexampleswillalsorequireaworkinginternetconnectionin
ordertorun.

Nowthatthatisalloutoftheway,letsgetintowhatwecandowithBiopython.

2.1GeneraloverviewofwhatBiopythonprovides
Asmentionedintheintroduction,Biopythonisasetoflibrariestoprovidetheabilitytodealwiththingsofinteresttobiologistsworkingonthecomputer.Ingeneral
thismeansthatyouwillneedtohaveatleastsomeprogrammingexperience(inPython,ofcourse!)oratleastaninterestinlearningtoprogram.Biopythonsjobisto
makeyourjobeasierasaprogrammerbysupplyingreusablelibrariessothatyoucanfocusonansweringyourspecificquestionofinterest,insteadoffocusingonthe
internalsofparsingaparticularfileformat(ofcourse,ifyouwanttohelpbywritingaparserthatdoesntexistandcontributingittoBiopython,pleasegoahead!).So
Biopythonsjobistomakeyouhappy!

OnethingtonoteaboutBiopythonisthatitoftenprovidesmultiplewaysofdoingthesamething.Thingshaveimprovedinrecentreleases,butthiscanstillbe
frustratingasinPythonthereshouldideallybeonerightwaytodosomething.However,thiscanalsobearealbenefitbecauseitgivesyoulotsofflexibilityandcontrol
overthelibraries.Thetutorialhelpstoshowyouthecommonoreasywaystodothingssothatyoucanjustmakethingswork.Tolearnmoreaboutthealternative
possibilities,lookintheCookbook(Chapter20,thishassomecoolstricksandtips),theAdvancedsection(Chapter22),thebuiltindocstrings(viathePythonhelp
command,ortheAPIdocumentation)orultimatelythecodeitself.

2.2Workingwithsequences
Disputably(ofcourse!),thecentralobjectinbioinformaticsisthesequence.Thus,wellstartwithaquickintroductiontotheBiopythonmechanismsfordealingwith
sequences,theSeqobject,whichwelldiscussinmoredetailinChapter3.

MostofthetimewhenwethinkaboutsequenceswehaveinmymindastringofletterslikeAGTACACTGGT.YoucancreatesuchSeqobjectwiththissequenceasfollows
the>>>representsthePythonpromptfollowedbywhatyouwouldtypein:

>>>fromBio.SeqimportSeq
>>>my_seq=Seq("AGTACACTGGT")
>>>my_seq
Seq('AGTACACTGGT',Alphabet())
>>>print(my_seq)
AGTACACTGGT
>>>my_seq.alphabet
Alphabet()

WhatwehavehereisasequenceobjectwithagenericalphabetreflectingthefactwehavenotspecifiedifthisisaDNAorproteinsequence(okay,aproteinwithalot
ofAlanines,Glycines,CysteinesandThreonines!).WelltalkmoreaboutalphabetsinChapter3.

Inadditiontohavinganalphabet,theSeqobjectdiffersfromthePythonstringinthemethodsitsupports.Youcantdothiswithaplainstring:
>>>my_seq
Seq('AGTACACTGGT',Alphabet())
>>>my_seq.complement()
Seq('TCATGTGACCA',Alphabet())
>>>my_seq.reverse_complement()
Seq('ACCAGTGTACT',Alphabet())

ThenextmostimportantclassistheSeqRecordorSequenceRecord.Thisholdsasequence(asaSeqobject)withadditionalannotationincludinganidentifier,nameand
description.TheBio.SeqIOmoduleforreadingandwritingsequencefileformatsworkswithSeqRecordobjects,whichwillbeintroducedbelowandcoveredinmore
detailbyChapter5.

ThiscoversthebasicfeaturesandusesoftheBiopythonsequenceclass.NowthatyouvegotsomeideaofwhatitisliketointeractwiththeBiopythonlibraries,itstime
todelveintothefun,funworldofdealingwithbiologicalfileformats!

2.3Ausageexample
BeforewejumprightintoparsersandeverythingelsetodowithBiopython,letssetupanexampletomotivateeverythingwedoandmakelifemoreinteresting.After
all,iftherewasntanybiologyinthistutorial,whywouldyouwantyoureadit?

SinceIloveplants,Ithinkwerejustgoingtohavetohaveaplantbasedexample(sorrytoallthefansofotherorganismsoutthere!).Havingjustcompletedarecenttrip
toourlocalgreenhouse,wevesuddenlydevelopedanincredibleobsessionwithLadySlipperOrchids(ifyouwonderwhy,havealookatsomeLadySlipperOrchids
photosonFlickr,ortryaGoogleImageSearch).

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 8/184
5/13/2017 BiopythonTutorialandCookbook
Ofcourse,orchidsarenotonlybeautifultolookat,theyarealsoextremelyinterestingforpeoplestudyingevolutionandsystematics.Soletssupposewerethinking
aboutwritingafundingproposaltodoamolecularstudyofLadySlipperevolution,andwouldliketoseewhatkindofresearchhasalreadybeendoneandhowwecan
addtothat.

AfteralittlebitofreadingupwediscoverthattheLadySlipperOrchidsareintheOrchidaceaefamilyandtheCypripedioideaesubfamilyandaremadeupof5genera:
Cypripedium,Paphiopedilum,Phragmipedium,SelenipediumandMexipedium.

Thatgivesusenoughtogetstarteddelvingformoreinformation.So,letslookathowtheBiopythontoolscanhelpus.WellstartwithsequenceparsinginSection2.4,
buttheorchidswillbebacklateronaswellforexamplewellsearchPubMedforpapersaboutorchidsandextractsequencedatafromGenBankinChapter9,extract
datafromSwissProtfromcertainorchidproteinsinChapter10,andworkwithClustalWmultiplesequencealignmentsoforchidproteinsinSection6.4.1.

2.4Parsingsequencefileformats
Alargepartofmuchbioinformaticsworkinvolvesdealingwiththemanytypesoffileformatsdesignedtoholdbiologicaldata.Thesefilesareloadedwithinteresting
biologicaldata,andaspecialchallengeisparsingthesefilesintoaformatsothatyoucanmanipulatethemwithsomekindofprogramminglanguage.Howeverthetask
ofparsingthesefilescanbefrustratedbythefactthattheformatscanchangequiteregularly,andthatformatsmaycontainsmallsubtletieswhichcanbreakeventhemost
welldesignedparsers.

WearenowgoingtobrieflyintroducetheBio.SeqIOmoduleyoucanfindoutmoreinChapter5.Wellstartwithanonlinesearchforourfriends,theladyslipper
orchids.Tokeepthisintroductionsimple,werejustusingtheNCBIwebsitebyhand.LetsjusttakealookthroughthenucleotidedatabasesatNCBI,usinganEntrez
onlinesearch(http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Nucleotide)foreverythingmentioningthetextCypripedioideae(thisisthesubfamilyoflady
slipperorchids).

Whenthistutorialwasoriginallywritten,thissearchgaveusonly94hits,whichwesavedasaFASTAformattedtextfileandasaGenBankformattedtextfile(files
ls_orchid.fastaandls_orchid.gbk,alsoincludedwiththeBiopythonsourcecodeunderdocs/tutorial/examples/).

Ifyourunthesearchtoday,youllgethundredsofresults!Whenfollowingthetutorial,ifyouwanttoseethesamelistofgenes,justdownloadthetwofilesaboveor
copythemfromdocs/examples/intheBiopythonsourcecode.InSection2.5wewilllookathowtodoasearchlikethisfromwithinPython.

2.4.1SimpleFASTAparsingexample
IfyouopentheladyslipperorchidsFASTAfilels_orchid.fastainyourfavouritetexteditor,youllseethatthefilestartslikethis:
>gi|2765658|emb|Z78533.1|CIZ78533C.irapeanum5.8SrRNAgeneandITS1andITS2DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
...

Itcontains94records,eachhasalinestartingwith>(greaterthansymbol)followedbythesequenceononeormorelines.NowtrythisinPython:
fromBioimportSeqIO
forseq_recordinSeqIO.parse("ls_orchid.fasta","fasta"):
print(seq_record.id)
print(repr(seq_record.seq))
print(len(seq_record))

Youshouldgetsomethinglikethisonyourscreen:
gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC',SingleLetterAlphabet())
740
...
gi|2765564|emb|Z78439.1|PBZ78439
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC',SingleLetterAlphabet())
592

NoticethattheFASTAformatdoesnotspecifythealphabet,soBio.SeqIOhasdefaultedtotherathergenericSingleLetterAlphabet()ratherthansomethingDNA
specific.

2.4.2SimpleGenBankparsingexample

NowletsloadtheGenBankfilels_orchid.gbkinsteadnoticethatthecodetodothisisalmostidenticaltothesnippetusedabovefortheFASTAfiletheonly
differenceiswechangethefilenameandtheformatstring:
fromBioimportSeqIO
forseq_recordinSeqIO.parse("ls_orchid.gbk","genbank"):
print(seq_record.id)
print(repr(seq_record.seq))
print(len(seq_record))

Thisshouldgive:

Z78533.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC',IUPACAmbiguousDNA())
740
...
Z78439.1
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC',IUPACAmbiguousDNA())
592

ThistimeBio.SeqIOhasbeenabletochooseasensiblealphabet,IUPACAmbiguousDNA.Youllalsonoticethatashorterstringhasbeenusedastheseq_record.idin
thiscase.

2.4.3Iloveparsingpleasedontstoptalkingaboutit!

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 9/184
5/13/2017 BiopythonTutorialandCookbook
Biopythonhasalotofparsers,andeachhasitsownlittlespecialnichesbasedonthesequenceformatitisparsingandallofthat.Chapter5coversBio.SeqIOinmore
detail,whileChapter6introducesBio.AlignIOforsequencealignments.

WhilethemostpopularfileformatshaveparsersintegratedintoBio.SeqIOand/orBio.AlignIO,forsomeoftherarerandunlovedfileformatsthereiseithernoparserat
all,oranoldparserwhichhasnotbeenlinkedinyet.Pleasealsocheckthewikipageshttp://biopython.org/wiki/SeqIOandhttp://biopython.org/wiki/AlignIOforthe
latestinformation,oraskonthemailinglist.Thewikipagesshouldincludeanuptodatelistofsupportedfiletypes,andsomeadditionalexamples.

ThenextplacetolookforinformationaboutspecificparsersandhowtodocoolthingswiththemisintheCookbook(Chapter20ofthisTutorial).Ifyoudontfindthe
informationyouarelookingfor,pleaseconsiderhelpingoutyourpooroverworkeddocumentorsandsubmittingacookbookentryaboutit!(onceyoufigureouthowto
doit,thatis!)

2.5Connectingwithbiologicaldatabases
Oneoftheverycommonthingsthatyouneedtodoinbioinformaticsisextractinformationfrombiologicaldatabases.Itcanbequitetedioustoaccessthesedatabases
manually,especiallyifyouhavealotofrepetitiveworktodo.BiopythonattemptstosaveyoutimeandenergybymakingsomeonlinedatabasesavailablefromPython
scripts.Currently,Biopythonhascodetoextractinformationfromthefollowingdatabases:

Entrez(andPubMed)fromtheNCBISeeChapter9.
ExPASySeeChapter10.
SCOPSeetheBio.SCOP.search()function.

ThecodeinthesemodulesbasicallymakesiteasytowritePythoncodethatinteractwiththeCGIscriptsonthesepages,sothatyoucangetresultsinaneasytodealwith
format.Insomecases,theresultscanbetightlyintegratedwiththeBiopythonparserstomakeiteveneasiertoextractinformation.

2.6Whattodonext
Nowthatyouvemadeitthisfar,youhopefullyhaveagoodunderstandingofthebasicsofBiopythonandarereadytostartusingitfordoingusefulwork.Thebestthing
todonowisfinishreadingthistutorial,andthenifyouwantstartsnoopingaroundinthesourcecode,andlookingattheautomaticallygenerateddocumentation.

Onceyougetapictureofwhatyouwanttodo,andwhatlibrariesinBiopythonwilldoit,youshouldtakeapeakattheCookbook(Chapter20),whichmayhaveexample
codetodosomethingsimilartowhatyouwanttodo.

Ifyouknowwhatyouwanttodo,butcantfigureouthowtodoit,pleasefeelfreetopostquestionstothemainBiopythonlist(see
http://biopython.org/wiki/Mailing_lists).Thiswillnotonlyhelpusansweryourquestion,itwillalsoallowustoimprovethedocumentationsoitcanhelpthenext
persondowhatyouwanttodo.

Enjoythecode!

Chapter3Sequenceobjects
BiologicalsequencesarearguablythecentralobjectinBioinformatics,andinthischapterwellintroducetheBiopythonmechanismfordealingwithsequences,theSeq
object.Chapter4willintroducetherelatedSeqRecordobject,whichcombinesthesequenceinformationwithanyannotation,usedagaininChapter5forSequence
Input/Output.

SequencesareessentiallystringsofletterslikeAGTACACTGGT,whichseemsverynaturalsincethisisthemostcommonwaythatsequencesareseeninbiologicalfile
formats.

TherearetwoimportantdifferencesbetweenSeqobjectsandstandardPythonstrings.Firstofall,theyhavedifferentmethods.AlthoughtheSeqobjectsupportsmanyof
thesamemethodsasaplainstring,itstranslate()methoddiffersbydoingbiologicaltranslation,andtherearealsoadditionalbiologicallyrelevantmethodslike
reverse_complement().Secondly,theSeqobjecthasanimportantattribute,alphabet,whichisanobjectdescribingwhattheindividualcharactersmakingupthesequence
stringmean,andhowtheyshouldbeinterpreted.Forexample,isAGTACACTGGTaDNAsequence,orjustaproteinsequencethathappenstoberichinAlanines,Glycines,
CysteinesandThreonines?

3.1SequencesandAlphabets
ThealphabetobjectisperhapstheimportantthingthatmakestheSeqobjectmorethanjustastring.ThecurrentlyavailablealphabetsforBiopythonaredefinedinthe
Bio.Alphabetmodule.WellusetheIUPACalphabets(http://www.chem.qmw.ac.uk/iupac/)heretodealwithsomeofourfavoriteobjects:DNA,RNAandProteins.

Bio.Alphabet.IUPACprovidesbasicdefinitionsforproteins,DNAandRNA,butadditionallyprovidestheabilitytoextendandcustomizethebasicdefinitions.For
instance,forproteins,thereisabasicIUPACProteinclass,butthereisanadditionalExtendedIUPACProteinclassprovidingfortheadditionalelementsU(orSecfor
selenocysteine)andO(orPylforpyrrolysine),plustheambiguoussymbolsB(orAsxforasparagineorasparticacid),Z(orGlxforglutamineorglutamic
acid),J(orXleforleucineisoleucine)andX(orXxxforanunknownaminoacid).ForDNAyouvegotchoicesofIUPACUnambiguousDNA,whichprovides
forjustthebasicletters,IUPACAmbiguousDNA(whichprovidesforambiguitylettersforeverypossiblesituation)andExtendedIUPACDNA,whichallowslettersfor
modifiedbases.Similarly,RNAcanberepresentedbyIUPACAmbiguousRNAorIUPACUnambiguousRNA.

Theadvantagesofhavinganalphabetclassaretwofold.First,thisgivesanideaofthetypeofinformationtheSeqobjectcontains.Secondly,thisprovidesameansof
constrainingtheinformation,asameansoftypechecking.

Nowthatweknowwhatwearedealingwith,letslookathowtoutilizethisclasstodointerestingwork.Youcancreateanambiguoussequencewiththedefaultgeneric
alphabetlikethis:
>>>fromBio.SeqimportSeq
>>>my_seq=Seq("AGTACACTGGT")
>>>my_seq
Seq('AGTACACTGGT',Alphabet())
>>>my_seq.alphabet
Alphabet()

However,wherepossibleyoushouldspecifythealphabetexplicitlywhencreatingyoursequenceobjectsinthiscaseanunambiguousDNAalphabetobject:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 10/184
5/13/2017 BiopythonTutorialandCookbook
>>>fromBio.SeqimportSeq
>>>fromBio.AlphabetimportIUPAC
>>>my_seq=Seq("AGTACACTGGT",IUPAC.unambiguous_dna)
>>>my_seq
Seq('AGTACACTGGT',IUPACUnambiguousDNA())
>>>my_seq.alphabet
IUPACUnambiguousDNA()

Unlessofcourse,thisreallyisanaminoacidsequence:
>>>fromBio.SeqimportSeq
>>>fromBio.AlphabetimportIUPAC
>>>my_prot=Seq("AGTACACTGGT",IUPAC.protein)
>>>my_prot
Seq('AGTACACTGGT',IUPACProtein())
>>>my_prot.alphabet
IUPACProtein()

3.2Sequencesactlikestrings
Inmanyways,wecandealwithSeqobjectsasiftheywerenormalPythonstrings,forexamplegettingthelength,oriteratingovertheelements:
>>>fromBio.SeqimportSeq
>>>fromBio.AlphabetimportIUPAC
>>>my_seq=Seq("GATCG",IUPAC.unambiguous_dna)
>>>forindex,letterinenumerate(my_seq):
...print("%i%s"%(index,letter))
0G
1A
2T
3C
4G
>>>print(len(my_seq))
5

Youcanaccesselementsofthesequenceinthesamewayasforstrings(butremember,Pythoncountsfromzero!):
>>>print(my_seq[0])#firstletter
G
>>>print(my_seq[2])#thirdletter
T
>>>print(my_seq[1])#lastletter
G

TheSeqobjecthasa.count()method,justlikeastring.NotethatthismeansthatlikeaPythonstring,thisgivesanonoverlappingcount:
>>>fromBio.SeqimportSeq
>>>"AAAA".count("AA")
2
>>>Seq("AAAA").count("AA")
2

Forsomebiologicaluses,youmayactuallywantanoverlappingcount(i.e.3inthistrivialexample).Whensearchingforsingleletters,thismakesnodifference:

>>>fromBio.SeqimportSeq
>>>fromBio.AlphabetimportIUPAC
>>>my_seq=Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC',IUPAC.unambiguous_dna)
>>>len(my_seq)
32
>>>my_seq.count("G")
9
>>>100*float(my_seq.count("G")+my_seq.count("C"))/len(my_seq)
46.875

WhileyoucouldusetheabovesnippetofcodetocalculateaGC%,notethattheBio.SeqUtilsmodulehasseveralGCfunctionsalreadybuilt.Forexample:

>>>fromBio.SeqimportSeq
>>>fromBio.AlphabetimportIUPAC
>>>fromBio.SeqUtilsimportGC
>>>my_seq=Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC',IUPAC.unambiguous_dna)
>>>GC(my_seq)
46.875

NotethatusingtheBio.SeqUtils.GC()functionshouldautomaticallycopewithmixedcasesequencesandtheambiguousnucleotideSwhichmeansGorC.

AlsonotethatjustlikeanormalPythonstring,theSeqobjectisinsomewaysreadonly.Ifyouneedtoedityoursequence,forexamplesimulatingapointmutation,
lookattheSection3.12belowwhichtalksabouttheMutableSeqobject.

3.3Slicingasequence
Amorecomplicatedexample,letsgetasliceofthesequence:
>>>fromBio.SeqimportSeq
>>>fromBio.AlphabetimportIUPAC
>>>my_seq=Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC",IUPAC.unambiguous_dna)
>>>my_seq[4:12]
Seq('GATGGGCC',IUPACUnambiguousDNA())

Twothingsareinterestingtonote.First,thisfollowsthenormalconventionsforPythonstrings.Sothefirstelementofthesequenceis0(whichisnormalforcomputer
science,butnotsonormalforbiology).Whenyoudoaslicethefirstitemisincluded(i.e.4inthiscase)andthelastisexcluded(12inthiscase),whichisthewaythings
workinPython,butofcoursenotnecessarilythewayeveryoneintheworldwouldexpect.ThemaingoalistostayconsistentwithwhatPythondoes.
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 11/184
5/13/2017 BiopythonTutorialandCookbook
Thesecondthingtonoticeisthatthesliceisperformedonthesequencedatastring,butthenewobjectproducedisanotherSeqobjectwhichretainsthealphabet
informationfromtheoriginalSeqobject.

AlsolikeaPythonstring,youcandosliceswithastart,stopandstride(thestepsize,whichdefaultstoone).Forexample,wecangetthefirst,secondandthirdcodon
positionsofthisDNAsequence:
>>>my_seq[0::3]
Seq('GCTGTAGTAAG',IUPACUnambiguousDNA())
>>>my_seq[1::3]
Seq('AGGCATGCATC',IUPACUnambiguousDNA())
>>>my_seq[2::3]
Seq('TAGCTAAGAC',IUPACUnambiguousDNA())

AnotherstridetrickyoumighthaveseenwithaPythonstringistheuseofa1stridetoreversethestring.YoucandothiswithaSeqobjecttoo:
>>>my_seq[::1]
Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG',IUPACUnambiguousDNA())

3.4TurningSeqobjectsintostrings
Ifyoureallydojustneedaplainstring,forexampletowritetoafile,orinsertintoadatabase,thenthisisveryeasytoget:

>>>str(my_seq)
'GATCGATGGGCCTATATAGGATCGAAAATCGC'

Sincecallingstr()onaSeqobjectreturnsthefullsequenceasastring,youoftendontactuallyhavetodothisconversionexplicitly.Pythondoesthisautomaticallyin
theprintfunction(andtheprintstatementunderPython2):

>>>print(my_seq)
GATCGATGGGCCTATATAGGATCGAAAATCGC

YoucanalsousetheSeqobjectdirectlywitha%splaceholderwhenusingthePythonstringformattingorinterpolationoperator(%):
>>>fasta_format_string=">Name\n%s\n"%my_seq
>>>print(fasta_format_string)
>Name
GATCGATGGGCCTATATAGGATCGAAAATCGC
<BLANKLINE>

ThislineofcodeconstructsasimpleFASTAformatrecord(withoutworryingaboutlinewrapping).Section4.6describesaneatwaytogetaFASTAformattedstring
fromaSeqRecordobject,whilethemoregeneraltopicofreadingandwritingFASTAformatsequencefilesiscoveredinChapter5.

>>>str(my_seq)
'GATCGATGGGCCTATATAGGATCGAAAATCGC'

3.5Concatenatingoraddingsequences
Naturally,youcaninprincipleaddanytwoSeqobjectstogetherjustlikeyoucanwithPythonstringstoconcatenatethem.However,youcantaddsequenceswith
incompatiblealphabets,suchasaproteinsequenceandaDNAsequence:

>>>fromBio.AlphabetimportIUPAC
>>>fromBio.SeqimportSeq
>>>protein_seq=Seq("EVRNAK",IUPAC.protein)
>>>dna_seq=Seq("ACGT",IUPAC.unambiguous_dna)
>>>protein_seq+dna_seq
Traceback(mostrecentcalllast):
...
TypeError:IncompatiblealphabetsIUPACProtein()andIUPACUnambiguousDNA()

Ifyoureallywantedtodothis,youdhavetofirstgivebothsequencesgenericalphabets:
>>>fromBio.Alphabetimportgeneric_alphabet
>>>protein_seq.alphabet=generic_alphabet
>>>dna_seq.alphabet=generic_alphabet
>>>protein_seq+dna_seq
Seq('EVRNAKACGT',Alphabet())

HereisanexampleofaddingagenericnucleotidesequencetoanunambiguousIUPACDNAsequence,resultinginanambiguousnucleotidesequence:

>>>fromBio.SeqimportSeq
>>>fromBio.Alphabetimportgeneric_nucleotide
>>>fromBio.AlphabetimportIUPAC
>>>nuc_seq=Seq("GATCGATGC",generic_nucleotide)
>>>dna_seq=Seq("ACGT",IUPAC.unambiguous_dna)
>>>nuc_seq
Seq('GATCGATGC',NucleotideAlphabet())
>>>dna_seq
Seq('ACGT',IUPACUnambiguousDNA())
>>>nuc_seq+dna_seq
Seq('GATCGATGCACGT',NucleotideAlphabet())

Youmayoftenhavemanysequencestoaddtogether,whichcanbedonewithaforlooplikethis:
>>>fromBio.SeqimportSeq
>>>fromBio.Alphabetimportgeneric_dna
>>>list_of_seqs=[Seq("ACGT",generic_dna),Seq("AACC",generic_dna),Seq("GGTT",generic_dna)]
>>>concatenated=Seq("",generic_dna)
>>>forsinlist_of_seqs:
...concatenated+=s
...

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 12/184
5/13/2017 BiopythonTutorialandCookbook
>>>concatenated
Seq('ACGTAACCGGTT',DNAAlphabet())

Or,amoreelegantapproachistotheusebuiltinsumfunctionwithitsoptionalstartvalueargument(whichotherwisedefaultstozero):
>>>fromBio.SeqimportSeq
>>>fromBio.Alphabetimportgeneric_dna
>>>list_of_seqs=[Seq("ACGT",generic_dna),Seq("AACC",generic_dna),Seq("GGTT",generic_dna)]
>>>sum(list_of_seqs,Seq("",generic_dna))
Seq('ACGTAACCGGTT',DNAAlphabet())

UnlikethePythonstring,theBiopythonSeqdoesnot(currently)havea.joinmethod.

3.6Changingcase
Pythonstringshaveveryusefulupperandlowermethodsforchangingthecase.AsofBiopython1.53,theSeqobjectgainedsimilarmethodswhicharealphabetaware.
Forexample,

>>>fromBio.SeqimportSeq
>>>fromBio.Alphabetimportgeneric_dna
>>>dna_seq=Seq("acgtACGT",generic_dna)
>>>dna_seq
Seq('acgtACGT',DNAAlphabet())
>>>dna_seq.upper()
Seq('ACGTACGT',DNAAlphabet())
>>>dna_seq.lower()
Seq('acgtacgt',DNAAlphabet())

Theseareusefulfordoingcaseinsensitivematching:

>>>"GTAC"indna_seq
False
>>>"GTAC"indna_seq.upper()
True

NotethatstrictlyspeakingtheIUPACalphabetsareforuppercasesequencesonly,thus:

>>>fromBio.SeqimportSeq
>>>fromBio.AlphabetimportIUPAC
>>>dna_seq=Seq("ACGT",IUPAC.unambiguous_dna)
>>>dna_seq
Seq('ACGT',IUPACUnambiguousDNA())
>>>dna_seq.lower()
Seq('acgt',DNAAlphabet())

3.7Nucleotidesequencesand(reverse)complements
Fornucleotidesequences,youcaneasilyobtainthecomplementorreversecomplementofaSeqobjectusingitsbuiltinmethods:

>>>fromBio.SeqimportSeq
>>>fromBio.AlphabetimportIUPAC
>>>my_seq=Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC",IUPAC.unambiguous_dna)
>>>my_seq
Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC',IUPACUnambiguousDNA())
>>>my_seq.complement()
Seq('CTAGCTACCCGGATATATCCTAGCTTTTAGCG',IUPACUnambiguousDNA())
>>>my_seq.reverse_complement()
Seq('GCGATTTTCGATCCTATATAGGCCCATCGATC',IUPACUnambiguousDNA())

Asmentionedearlier,aneasywaytojustreverseaSeqobject(oraPythonstring)issliceitwith1step:

>>>my_seq[::1]
Seq('CGCTAAAAGCTAGGATATATCCGGGTAGCTAG',IUPACUnambiguousDNA())

Inalloftheseoperations,thealphabetpropertyismaintained.Thisisveryusefulincaseyouaccidentallyenduptryingtodosomethingweirdliketakethe
(reverse)complementofaproteinsequence:

>>>fromBio.SeqimportSeq
>>>fromBio.AlphabetimportIUPAC
>>>protein_seq=Seq("EVRNAK",IUPAC.protein)
>>>protein_seq.complement()
Traceback(mostrecentcalllast):
...
ValueError:Proteinsdonothavecomplements!

TheexampleinSection5.5.3combinestheSeqobjectsreversecomplementmethodwithBio.SeqIOforsequenceinput/output.

3.8Transcription
Beforetalkingabouttranscription,Iwanttotrytoclarifythestrandissue.Considerthefollowing(madeup)stretchofdoublestrandedDNAwhichencodesashort
peptide:


DNAcodingstrand(akaCrickstrand,strand+1)
5 ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG 3
|||||||||||||||||||||||||||||||||||||||
3 TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC 5
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 13/184
5/13/2017 BiopythonTutorialandCookbook
3 5
DNAtemplatestrand(akaWatsonstrand,strand1)

|
Transcription


5 AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG 3
SinglestrandedmessengerRNA

Theactualbiologicaltranscriptionprocessworksfromthetemplatestrand,doingareversecomplement(TCAGCUGA)togivethemRNA.However,inBiopython
andbioinformaticsingeneral,wetypicallyworkdirectlywiththecodingstrandbecausethismeanswecangetthemRNAsequencejustbyswitchingTU.

NowletsactuallygetdowntodoingatranscriptioninBiopython.First,letscreateSeqobjectsforthecodingandtemplateDNAstrands:
>>>fromBio.SeqimportSeq
>>>fromBio.AlphabetimportIUPAC
>>>coding_dna=Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG",IUPAC.unambiguous_dna)
>>>coding_dna
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG',IUPACUnambiguousDNA())
>>>template_dna=coding_dna.reverse_complement()
>>>template_dna
Seq('CTATCGGGCACCCTTTCAGCGGCCCATTACAATGGCCAT',IUPACUnambiguousDNA())

Theseshouldmatchthefigureaboverememberbyconventionnucleotidesequencesarenormallyreadfromthe5to3direction,whileinthefigurethetemplatestrand
isshownreversed.

NowletstranscribethecodingstrandintothecorrespondingmRNA,usingtheSeqobjectsbuiltintranscribemethod:

>>>coding_dna
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG',IUPACUnambiguousDNA())
>>>messenger_rna=coding_dna.transcribe()
>>>messenger_rna
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG',IUPACUnambiguousRNA())

Asyoucansee,allthisdoesisswitchTU,andadjustthealphabet.

Ifyoudowanttodoatruebiologicaltranscriptionstartingwiththetemplatestrand,thenthisbecomesatwostepprocess:
>>>template_dna.reverse_complement().transcribe()
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG',IUPACUnambiguousRNA())

TheSeqobjectalsoincludesabacktranscriptionmethodforgoingfromthemRNAtothecodingstrandoftheDNA.Again,thisisasimpleUTsubstitutionand
associatedchangeofalphabet:

>>>fromBio.SeqimportSeq
>>>fromBio.AlphabetimportIUPAC
>>>messenger_rna=Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG",IUPAC.unambiguous_rna)
>>>messenger_rna
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG',IUPACUnambiguousRNA())
>>>messenger_rna.back_transcribe()
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG',IUPACUnambiguousDNA())

Note:TheSeqobjectstranscribeandback_transcribemethodswereaddedinBiopython1.49.ForolderreleasesyouwouldhavetousetheBio.Seqmodulesfunctions
instead,seeSection3.14.

3.9Translation
Stickingwiththesameexamplediscussedinthetranscriptionsectionabove,nowletstranslatethismRNAintothecorrespondingproteinsequenceagaintaking
advantageofoneoftheSeqobjectsbiologicalmethods:
>>>fromBio.SeqimportSeq
>>>fromBio.AlphabetimportIUPAC
>>>messenger_rna=Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG",IUPAC.unambiguous_rna)
>>>messenger_rna
Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG',IUPACUnambiguousRNA())
>>>messenger_rna.translate()
Seq('MAIVMGR*KGAR*',HasStopCodon(IUPACProtein(),'*'))

YoucanalsotranslatedirectlyfromthecodingstrandDNAsequence:

>>>fromBio.SeqimportSeq
>>>fromBio.AlphabetimportIUPAC
>>>coding_dna=Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG",IUPAC.unambiguous_dna)
>>>coding_dna
Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG',IUPACUnambiguousDNA())
>>>coding_dna.translate()
Seq('MAIVMGR*KGAR*',HasStopCodon(IUPACProtein(),'*'))

Youshouldnoticeintheaboveproteinsequencesthatinadditiontotheendstopcharacter,thereisaninternalstopaswell.Thiswasadeliberatechoiceofexample,asit
givesanexcusetotalkaboutsomeoptionalarguments,includingdifferenttranslationtables(GeneticCodes).

ThetranslationtablesavailableinBiopythonarebasedonthosefromtheNCBI(seethenextsectionofthistutorial).Bydefault,translationwillusethestandardgenetic
code(NCBItableid1).Supposewearedealingwithamitochondrialsequence.Weneedtotellthetranslationfunctiontousetherelevantgeneticcodeinstead:
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 14/184
5/13/2017 BiopythonTutorialandCookbook
>>>coding_dna.translate(table="VertebrateMitochondrial")
Seq('MAIVMGRWKGAR*',HasStopCodon(IUPACProtein(),'*'))

YoucanalsospecifythetableusingtheNCBItablenumberwhichisshorter,andoftenincludedinthefeatureannotationofGenBankfiles:
>>>coding_dna.translate(table=2)
Seq('MAIVMGRWKGAR*',HasStopCodon(IUPACProtein(),'*'))

Now,youmaywanttotranslatethenucleotidesuptothefirstinframestopcodon,andthenstop(ashappensinnature):

>>>coding_dna.translate()
Seq('MAIVMGR*KGAR*',HasStopCodon(IUPACProtein(),'*'))
>>>coding_dna.translate(to_stop=True)
Seq('MAIVMGR',IUPACProtein())
>>>coding_dna.translate(table=2)
Seq('MAIVMGRWKGAR*',HasStopCodon(IUPACProtein(),'*'))
>>>coding_dna.translate(table=2,to_stop=True)
Seq('MAIVMGRWKGAR',IUPACProtein())

Noticethatwhenyouusetheto_stopargument,thestopcodonitselfisnottranslatedandthestopsymbolisnotincludedattheendofyourproteinsequence.

Youcanevenspecifythestopsymbolifyoudontlikethedefaultasterisk:

>>>coding_dna.translate(table=2,stop_symbol="@")
Seq('MAIVMGRWKGAR@',HasStopCodon(IUPACProtein(),'@'))

Now,supposeyouhaveacompletecodingsequenceCDS,whichistosayanucleotidesequence(e.g.mRNAafteranysplicing)whichisawholenumberofcodons
(i.e.thelengthisamultipleofthree),commenceswithastartcodon,endswithastopcodon,andhasnointernalinframestopcodons.Ingeneral,givenacompleteCDS,
thedefaulttranslatemethodwilldowhatyouwant(perhapswiththeto_stopoption).However,whatifyoursequenceusesanonstandardstartcodon?Thishappensa
lotinbacteriaforexamplethegeneyaaXinE.coliK12:

>>>fromBio.SeqimportSeq
>>>fromBio.Alphabetimportgeneric_dna
>>>gene=Seq("GTGAAAAAGATGCAATCTATCGTACTCGCACTTTCCCTGGTTCTGGTCGCTCCCATGGCA"+\
..."GCACAGGCTGCGGAAATTACGTTAGTCCCGTCAGTAAAATTACAGATAGGCGATCGTGAT"+\
..."AATCGTGGCTATTACTGGGATGGAGGTCACTGGCGCGACCACGGCTGGTGGAAACAACAT"+\
..."TATGAATGGCGAGGCAATCGCTGGCACCTACACGGACCGCCGCCACCGCCGCGCCACCAT"+\
..."AAGAAAGCTCCTCATGATCATCACGGCGGTCATGGTCCAGGCAAACATCACCGCTAA",
...generic_dna)
>>>gene.translate(table="Bacterial")
Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HR*',
HasStopCodon(ExtendedIUPACProtein(),'*')
>>>gene.translate(table="Bacterial",to_stop=True)
Seq('VKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR',
ExtendedIUPACProtein())

InthebacterialgeneticcodeGTGisavalidstartcodon,andwhileitdoesnormallyencodeValine,ifusedasastartcodonitshouldbetranslatedasmethionine.This
happensifyoutellBiopythonyoursequenceisacompleteCDS:
>>>gene.translate(table="Bacterial",cds=True)
Seq('MKKMQSIVLALSLVLVAPMAAQAAEITLVPSVKLQIGDRDNRGYYWDGGHWRDH...HHR',
ExtendedIUPACProtein())

InadditiontotellingBiopythontotranslateanalternativestartcodonasmethionine,usingthisoptionalsomakessureyoursequencereallyisavalidCDS(youllgetan
exceptionifnot).

TheexampleinSection20.1.3combinestheSeqobjectstranslatemethodwithBio.SeqIOforsequenceinput/output.

3.10TranslationTables
IntheprevioussectionswetalkedabouttheSeqobjecttranslationmethod(andmentionedtheequivalentfunctionintheBio.SeqmoduleseeSection3.14).Internally
theseusecodontableobjectsderivedfromtheNCBIinformationatftp://ftp.ncbi.nlm.nih.gov/entrez/misc/data/gc.prt,alsoshownon
http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgiinamuchmorereadablelayout.

Asbefore,letsjustfocusontwochoices:theStandardtranslationtable,andthetranslationtableforVertebrateMitochondrialDNA.
>>>fromBio.DataimportCodonTable
>>>standard_table=CodonTable.unambiguous_dna_by_name["Standard"]
>>>mito_table=CodonTable.unambiguous_dna_by_name["VertebrateMitochondrial"]

Alternatively,thesetablesarelabeledwithIDnumbers1and2,respectively:

>>>fromBio.DataimportCodonTable
>>>standard_table=CodonTable.unambiguous_dna_by_id[1]
>>>mito_table=CodonTable.unambiguous_dna_by_id[2]

Youcancomparetheactualtablesvisuallybyprintingthem:

>>>print(standard_table)
Table1Standard,SGC0

|T|C|A|G|
+++++
T|TTTF|TCTS|TATY|TGTC|T
T|TTCF|TCCS|TACY|TGCC|C
T|TTAL|TCAS|TAAStop|TGAStop|A
T|TTGL(s)|TCGS|TAGStop|TGGW|G
+++++
C|CTTL|CCTP|CATH|CGTR|T
C|CTCL|CCCP|CACH|CGCR|C
C|CTAL|CCAP|CAAQ|CGAR|A

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 15/184
5/13/2017 BiopythonTutorialandCookbook
C|CTGL(s)|CCGP|CAGQ|CGGR|G
+++++
A|ATTI|ACTT|AATN|AGTS|T
A|ATCI|ACCT|AACN|AGCS|C
A|ATAI|ACAT|AAAK|AGAR|A
A|ATGM(s)|ACGT|AAGK|AGGR|G
+++++
G|GTTV|GCTA|GATD|GGTG|T
G|GTCV|GCCA|GACD|GGCG|C
G|GTAV|GCAA|GAAE|GGAG|A
G|GTGV|GCGA|GAGE|GGGG|G
+++++

and:

>>>print(mito_table)
Table2VertebrateMitochondrial,SGC1

|T|C|A|G|
+++++
T|TTTF|TCTS|TATY|TGTC|T
T|TTCF|TCCS|TACY|TGCC|C
T|TTAL|TCAS|TAAStop|TGAW|A
T|TTGL|TCGS|TAGStop|TGGW|G
+++++
C|CTTL|CCTP|CATH|CGTR|T
C|CTCL|CCCP|CACH|CGCR|C
C|CTAL|CCAP|CAAQ|CGAR|A
C|CTGL|CCGP|CAGQ|CGGR|G
+++++
A|ATTI(s)|ACTT|AATN|AGTS|T
A|ATCI(s)|ACCT|AACN|AGCS|C
A|ATAM(s)|ACAT|AAAK|AGAStop|A
A|ATGM(s)|ACGT|AAGK|AGGStop|G
+++++
G|GTTV|GCTA|GATD|GGTG|T
G|GTCV|GCCA|GACD|GGCG|C
G|GTAV|GCAA|GAAE|GGAG|A
G|GTGV(s)|GCGA|GAGE|GGGG|G
+++++

Youmayfindthesefollowingpropertiesusefulforexampleifyouaretryingtodoyourowngenefinding:

>>>mito_table.stop_codons
['TAA','TAG','AGA','AGG']
>>>mito_table.start_codons
['ATT','ATC','ATA','ATG','GTG']
>>>mito_table.forward_table["ACG"]
'T'

3.11ComparingSeqobjects
Sequencecomparisonisactuallyaverycomplicatedtopic,andthereisnoeasywaytodecideiftwosequencesareequal.Thebasicproblemisthemeaningoftheletters
inasequencearecontextdependenttheletterAcouldbepartofaDNA,RNAorproteinsequence.BiopythonusesalphabetobjectsaspartofeachSeqobjecttotryto
capturethisinformationsocomparingtwoSeqobjectscouldmeanconsideringboththesequencestringsandthealphabets.

Forexample,youmightarguethatthetwoDNASeqobjectsSeq("ACGT",IUPAC.unambiguous_dna)andSeq("ACGT",IUPAC.ambiguous_dna)shouldbeequal,eventhough
theydohavedifferentalphabets.Dependingonthecontextthiscouldbeimportant.

ThisgetsworsesupposeyouthinkSeq("ACGT",IUPAC.unambiguous_dna)andSeq("ACGT")(i.e.thedefaultgenericalphabet)shouldbeequal.Then,logically,
Seq("ACGT",IUPAC.protein)andSeq("ACGT")shouldalsobeequal.Now,inlogicifA=BandB=C,bytransitivityweexpectA=C.Soforlogicalconsistencywedrequire
Seq("ACGT",IUPAC.unambiguous_dna)andSeq("ACGT",IUPAC.protein)tobeequalwhichmostpeoplewouldagreeisjustnotright.Thistransitivityalsohas
implicationsforusingSeqobjectsasPythondictionarykeys.

Now,ineverydayuse,yoursequenceswillprobablyallhavethesamealphabet,oratleastallbethesametypeofsequence(allDNA,allRNA,orallprotein).Whatyou
probablywantistojustcomparethesequencesasstringswhichyoucandoexplicitly:

>>>fromBio.SeqimportSeq
>>>fromBio.AlphabetimportIUPAC
>>>seq1=Seq("ACGT",IUPAC.unambiguous_dna)
>>>seq2=Seq("ACGT",IUPAC.ambiguous_dna)
>>>str(seq1)==str(seq2)
True
>>>str(seq1)==str(seq1)
True

So,whatdoesBiopythondo?Well,asofBiopython1.65,sequencecomparisononlylooksatthesequence,essentiallyignoringthealphabet:

>>>seq1==seq2
True
>>>seq1=="ACGT"
True

Asanextensiontothis,usingsequenceobjectsaskeysinaPythondictionaryisnowequivalenttousingthesequenceasaplainstringforthekey.SeealsoSection3.4.

Noteifyoucomparesequenceswithincompatiblealphabets(e.g.DNAvsRNA,ornucleotideversusprotein),thenyouwillgetawarningbutforthecomparisonitself
onlythestringoflettersinthesequenceisused:

>>>fromBio.SeqimportSeq
>>>fromBio.Alphabetimportgeneric_dna,generic_protein
>>>dna_seq=Seq("ACGT",generic_dna)
>>>prot_seq=Seq(``ACGT'',generic_protein)

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 16/184
5/13/2017 BiopythonTutorialandCookbook
>>>dna_seq==prot_seq
BiopythonWarning:IncompatiblealphabetsDNAAlphabet()andProteinAlphabet()
True

WARNING:OlderversionsofBiopythoninsteadusedtocheckiftheSeqobjectswerethesameobjectinmemory.Thisisimportantifyouneedtosupportscriptsonboth
oldandnewversionsofBiopython.Heremakethecomparisonexplicitbywrappingyoursequenceobjectswitheitherstr(...)forstringbasedcomparisonorid(...)
forobjectinstancebasedcomparison.

3.12MutableSeqobjects
JustlikethenormalPythonstring,theSeqobjectisreadonly,orinPythonterminology,immutable.ApartfromwantingtheSeqobjecttoactlikeastring,thisisalsoa
usefuldefaultsinceinmanybiologicalapplicationsyouwanttoensureyouarenotchangingyoursequencedata:

>>>fromBio.SeqimportSeq
>>>fromBio.AlphabetimportIUPAC
>>>my_seq=Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA",IUPAC.unambiguous_dna)

Observewhathappensifyoutrytoeditthesequence:

>>>my_seq[5]="G"
Traceback(mostrecentcalllast):
...
TypeError:'Seq'objectdoesnotsupportitemassignment

However,youcanconvertitintoamutablesequence(aMutableSeqobject)anddoprettymuchanythingyouwantwithit:

>>>mutable_seq=my_seq.tomutable()
>>>mutable_seq
MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA',IUPACUnambiguousDNA())

Alternatively,youcancreateaMutableSeqobjectdirectlyfromastring:
>>>fromBio.SeqimportMutableSeq
>>>fromBio.AlphabetimportIUPAC
>>>mutable_seq=MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA",IUPAC.unambiguous_dna)

Eitherwaywillgiveyouasequenceobjectwhichcanbechanged:

>>>mutable_seq
MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA',IUPACUnambiguousDNA())
>>>mutable_seq[5]="C"
>>>mutable_seq
MutableSeq('GCCATCGTAATGGGCCGCTGAAAGGGTGCCCGA',IUPACUnambiguousDNA())
>>>mutable_seq.remove("T")
>>>mutable_seq
MutableSeq('GCCACGTAATGGGCCGCTGAAAGGGTGCCCGA',IUPACUnambiguousDNA())
>>>mutable_seq.reverse()
>>>mutable_seq
MutableSeq('AGCCCGTGGGAAAGTCGCCGGGTAATGCACCG',IUPACUnambiguousDNA())

DonotethatunliketheSeqobject,theMutableSeqobjectsmethodslikereverse_complement()andreverse()actinsitu!

AnimportanttechnicaldifferencebetweenmutableandimmutableobjectsinPythonmeansthatyoucantuseaMutableSeqobjectasadictionarykey,butyoucanusea
PythonstringoraSeqobjectinthisway.

OnceyouhavefinishededitingyouraMutableSeqobject,itseasytogetbacktoareadonlySeqobjectshouldyouneedto:

>>>new_seq=mutable_seq.toseq()
>>>new_seq
Seq('AGCCCGTGGGAAAGTCGCCGGGTAATGCACCG',IUPACUnambiguousDNA())

YoucanalsogetastringfromaMutableSeqobjectjustlikefromaSeqobject(Section3.4).

3.13UnknownSeqobjects
TheUnknownSeqobjectisasubclassofthebasicSeqobjectanditspurposeistorepresentasequencewhereweknowthelength,butnottheactuallettersmakingitup.
YoucouldofcourseuseanormalSeqobjectinthissituation,butitwastesratheralotofmemorytoholdastringofamillionNcharacterswhenyoucouldjuststorea
singleletterNandthedesiredlengthasaninteger.

>>>fromBio.SeqimportUnknownSeq
>>>unk=UnknownSeq(20)
>>>unk
UnknownSeq(20,alphabet=Alphabet(),character='?')
>>>print(unk)
????????????????????
>>>len(unk)
20

Youcanofcoursespecifyanalphabet,meaningfornucleotidesequencestheletterdefaultstoNandforproteinsX,ratherthanjust?.

>>>fromBio.SeqimportUnknownSeq
>>>fromBio.AlphabetimportIUPAC
>>>unk_dna=UnknownSeq(20,alphabet=IUPAC.ambiguous_dna)
>>>unk_dna
UnknownSeq(20,alphabet=IUPACAmbiguousDNA(),character='N')
>>>print(unk_dna)
NNNNNNNNNNNNNNNNNNNN

YoucanusealltheusualSeqobjectmethodstoo,notethesegivebackmemorysavingUnknownSeqobjectswhereappropriateasyoumightexpect:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 17/184
5/13/2017 BiopythonTutorialandCookbook
>>>unk_dna
UnknownSeq(20,alphabet=IUPACAmbiguousDNA(),character='N')
>>>unk_dna.complement()
UnknownSeq(20,alphabet=IUPACAmbiguousDNA(),character='N')
>>>unk_dna.reverse_complement()
UnknownSeq(20,alphabet=IUPACAmbiguousDNA(),character='N')
>>>unk_dna.transcribe()
UnknownSeq(20,alphabet=IUPACAmbiguousRNA(),character='N')
>>>unk_protein=unk_dna.translate()
>>>unk_protein
UnknownSeq(6,alphabet=ProteinAlphabet(),character='X')
>>>print(unk_protein)
XXXXXX
>>>len(unk_protein)
6

YoumaybeabletofindausefortheUnknownSeqobjectinyourowncode,butitismorelikelythatyouwillfirstcomeacrosstheminaSeqRecordobjectcreatedby
Bio.SeqIO(seeChapter5).Somesequencefileformatsdontalwaysincludetheactualsequence,forexampleGenBankandEMBLfilesmayincludealistoffeaturesbut
forthesequencejustpresentthecontiginformation.Alternatively,theQUALfilesusedinsequencingworkholdqualityscoresbuttheynevercontainasequence
insteadthereisapartnerFASTAfilewhichdoeshavethesequence.

3.14Workingwithstringsdirectly
Toclosethischapter,forthoseyouwhoreallydontwanttousethesequenceobjects(orwhopreferafunctionalprogrammingstyletoanobjectorientatedone),thereare
modulelevelfunctionsinBio.SeqwillacceptplainPythonstrings,Seqobjects(includingUnknownSeqobjects)orMutableSeqobjects:

>>>fromBio.Seqimportreverse_complement,transcribe,back_transcribe,translate
>>>my_string="GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG"
>>>reverse_complement(my_string)
'CTAACCAGCAGCACGACCACCCTTCCAACGACCCATAACAGC'
>>>transcribe(my_string)
'GCUGUUAUGGGUCGUUGGAAGGGUGGUCGUGCUGCUGGUUAG'
>>>back_transcribe(my_string)
'GCTGTTATGGGTCGTTGGAAGGGTGGTCGTGCTGCTGGTTAG'
>>>translate(my_string)
'AVMGRWKGGRAAG*'

Youare,however,encouragedtoworkwithSeqobjectsbydefault.

Chapter4Sequenceannotationobjects
Chapter3introducedthesequenceclasses.ImmediatelyabovetheSeqclassistheSequenceRecordorSeqRecordclass,definedintheBio.SeqRecordmodule.Thisclass
allowshigherlevelfeaturessuchasidentifiersandfeatures(asSeqFeatureobjects)tobeassociatedwiththesequence,andisusedthroughoutthesequenceinput/output
interfaceBio.SeqIOdescribedfullyinChapter5.

IfyouareonlygoingtobeworkingwithsimpledatalikeFASTAfiles,youcanprobablyskipthischapterfornow.Ifontheotherhandyouaregoingtobeusingrichly
annotatedsequencedata,sayfromGenBankorEMBLfiles,thisinformationisquiteimportant.

WhilethischaptershouldcovermostthingstodowiththeSeqRecordandSeqFeatureobjectsinthischapter,youmayalsowanttoreadtheSeqRecordwikipage
(http://biopython.org/wiki/SeqRecord),andthebuiltindocumentation(alsoonlineSeqRecordandSeqFeature):
>>>fromBio.SeqRecordimportSeqRecord
>>>help(SeqRecord)
...

4.1TheSeqRecordobject
TheSeqRecord(SequenceRecord)classisdefinedintheBio.SeqRecordmodule.Thisclassallowshigherlevelfeaturessuchasidentifiersandfeaturestobeassociated
withasequence(seeChapter3),andisthebasicdatatypefortheBio.SeqIOsequenceinput/outputinterface(seeChapter5).

TheSeqRecordclassitselfisquitesimple,andoffersthefollowinginformationasattributes:

.seq
Thesequenceitself,typicallyaSeqobject.
.id
TheprimaryIDusedtoidentifythesequenceastring.Inmostcasesthisissomethinglikeanaccessionnumber.
.name
Acommonname/idforthesequenceastring.Insomecasesthiswillbethesameastheaccessionnumber,butitcouldalsobeaclonename.Ithinkofthisas
beinganalogoustotheLOCUSidinaGenBankrecord.
.description
Ahumanreadabledescriptionorexpressivenameforthesequenceastring.
.letter_annotations
Holdsperletterannotationsusinga(restricted)dictionaryofadditionalinformationaboutthelettersinthesequence.Thekeysarethenameoftheinformation,and
theinformationiscontainedinthevalueasaPythonsequence(i.e.alist,tupleorstring)withthesamelengthasthesequenceitself.Thisisoftenusedforquality
scores(e.g.Section20.1.6)orsecondarystructureinformation(e.g.fromStockholm/PFAMalignmentfiles).
.annotations
Adictionaryofadditionalinformationaboutthesequence.Thekeysarethenameoftheinformation,andtheinformationiscontainedinthevalue.Thisallowsthe
additionofmoreunstructuredinformationtothesequence.
.features
AlistofSeqFeatureobjectswithmorestructuredinformationaboutthefeaturesonasequence(e.g.positionofgenesonagenome,ordomainsonaprotein
sequence).ThestructureofsequencefeaturesisdescribedbelowinSection4.3.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 18/184
5/13/2017 BiopythonTutorialandCookbook
.dbxrefs
Alistofdatabasecrossreferencesasstrings.

4.2CreatingaSeqRecord
UsingaSeqRecordobjectisnotverycomplicated,sincealloftheinformationispresentedasattributesoftheclass.UsuallyyouwontcreateaSeqRecordbyhand,but
insteaduseBio.SeqIOtoreadinasequencefileforyou(seeChapter5andtheexamplesbelow).However,creatingSeqRecordcanbequitesimple.

4.2.1SeqRecordobjectsfromscratch
TocreateaSeqRecordataminimumyoujustneedaSeqobject:

>>>fromBio.SeqimportSeq
>>>simple_seq=Seq("GATC")
>>>fromBio.SeqRecordimportSeqRecord
>>>simple_seq_r=SeqRecord(simple_seq)

Additionally,youcanalsopasstheid,nameanddescriptiontotheinitializationfunction,butifnottheywillbesetasstringsindicatingtheyareunknown,andcanbe
modifiedsubsequently:
>>>simple_seq_r.id
'<unknownid>'
>>>simple_seq_r.id="AC12345"
>>>simple_seq_r.description="MadeupsequenceIwishIcouldwriteapaperabout"
>>>print(simple_seq_r.description)
MadeupsequenceIwishIcouldwriteapaperabout
>>>simple_seq_r.seq
Seq('GATC',Alphabet())

IncludinganidentifierisveryimportantifyouwanttooutputyourSeqRecordtoafile.Youwouldnormallyincludethiswhencreatingtheobject:

>>>fromBio.SeqimportSeq
>>>simple_seq=Seq("GATC")
>>>fromBio.SeqRecordimportSeqRecord
>>>simple_seq_r=SeqRecord(simple_seq,id="AC12345")

Asmentionedabove,theSeqRecordhasandictionaryattributeannotations.Thisisusedforanymiscellaneousannotationsthatdoesntfitunderoneoftheothermore
specificattributes.Addingannotationsiseasy,andjustinvolvesdealingdirectlywiththeannotationdictionary:
>>>simple_seq_r.annotations["evidence"]="None.Ijustmadeitup."
>>>print(simple_seq_r.annotations)
{'evidence':'None.Ijustmadeitup.'}
>>>print(simple_seq_r.annotations["evidence"])
None.Ijustmadeitup.

Workingwithperletterannotationsissimilar,letter_annotationsisadictionarylikeattributewhichwillletyouassignanyPythonsequence(i.e.astring,listortuple)
whichhasthesamelengthasthesequence:

>>>simple_seq_r.letter_annotations["phred_quality"]=[40,40,38,30]
>>>print(simple_seq_r.letter_annotations)
{'phred_quality':[40,40,38,30]}
>>>print(simple_seq_r.letter_annotations["phred_quality"])
[40,40,38,30]

ThedbxrefsandfeaturesattributesarejustPythonlists,andshouldbeusedtostorestringsandSeqFeatureobjects(discussedlaterinthischapter)respectively.

4.2.2SeqRecordobjectsfromFASTAfiles

ThisexampleusesafairlylargeFASTAfilecontainingthewholesequenceforYersiniapestisbiovarMicrotusstr.91001plasmidpPCP1,originallydownloadedfromthe
NCBI.ThisfileisincludedwiththeBiopythonunittestsundertheGenBankfolder,oronlineNC_005816.fnafromourwebsite.

Thefilestartslikethisandyoucancheckthereisonlyonerecordpresent(i.e.onlyonelinestartingwithagreaterthansymbol):
>gi|45478711|ref|NC_005816.1|YersiniapestisbiovarMicrotus...pPCP1,completesequence
TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCC
...

BackinChapter2youwillhaveseenthefunctionBio.SeqIO.parse(...)usedtoloopoveralltherecordsinafileasSeqRecordobjects.TheBio.SeqIOmodulehasasister
functionforuseonfileswhichcontainjustonerecordwhichwellusehere(seeChapter5fordetails):
>>>fromBioimportSeqIO
>>>record=SeqIO.read("NC_005816.fna","fasta")
>>>record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',
SingleLetterAlphabet()),id='gi|45478711|ref|NC_005816.1|',name='gi|45478711|ref|NC_005816.1|',
description='gi|45478711|ref|NC_005816.1|YersiniapestisbiovarMicrotus...sequence',
dbxrefs=[])

Now,letshavealookatthekeyattributesofthisSeqRecordindividuallystartingwiththeseqattributewhichgivesyouaSeqobject:

>>>record.seq
Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',SingleLetterAlphabet())

HereBio.SeqIOhasdefaultedtoagenericalphabet,ratherthanguessingthatthisisDNA.IfyouknowinadvancewhatkindofsequenceyourFASTAfilecontains,you
cantellBio.SeqIOwhichalphabettouse(seeChapter5).

Next,theidentifiersanddescription:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 19/184
5/13/2017 BiopythonTutorialandCookbook
>>>record.id
'gi|45478711|ref|NC_005816.1|'
>>>record.name
'gi|45478711|ref|NC_005816.1|'
>>>record.description
'gi|45478711|ref|NC_005816.1|YersiniapestisbiovarMicrotus...pPCP1,completesequence'

Asyoucanseeabove,thefirstwordoftheFASTArecordstitleline(afterremovingthegreaterthansymbol)isusedforboththeidandnameattributes.Thewholetitle
line(afterremovingthegreaterthansymbol)isusedfortherecorddescription.Thisisdeliberate,partlyforbackwardscompatibilityreasons,butitalsomakessenseif
youhaveaFASTAfilelikethis:

>YersiniapestisbiovarMicrotusstr.91001plasmidpPCP1
TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGGGGGTAATCTGCTCTCC
...

NotethatnoneoftheotherannotationattributesgetpopulatedwhenreadingaFASTAfile:
>>>record.dbxrefs
[]
>>>record.annotations
{}
>>>record.letter_annotations
{}
>>>record.features
[]

InthiscaseourexampleFASTAfilewasfromtheNCBI,andtheyhaveafairlywelldefinedsetofconventionsforformattingtheirFASTAlines.Thismeansitwouldbe
possibletoparsethisinformationandextracttheGInumberandaccessionforexample.However,FASTAfilesfromothersourcesvary,sothisisntpossibleingeneral.

4.2.3SeqRecordobjectsfromGenBankfiles
Asinthepreviousexample,weregoingtolookatthewholesequenceforYersiniapestisbiovarMicrotusstr.91001plasmidpPCP1,originallydownloadedfromthe
NCBI,butthistimeasaGenBankfile.Again,thisfileisincludedwiththeBiopythonunittestsundertheGenBankfolder,oronlineNC_005816.gbfromourwebsite.

Thisfilecontainsasinglerecord(i.e.onlyoneLOCUSline)andstarts:

LOCUSNC_0058169609bpDNAcircularBCT21JUL2008
DEFINITIONYersiniapestisbiovarMicrotusstr.91001plasmidpPCP1,complete
sequence.
ACCESSIONNC_005816
VERSIONNC_005816.1GI:45478711
PROJECTGenomeProject:10638
...

Again,welluseBio.SeqIOtoreadthisfilein,andthecodeisalmostidenticaltothatforusedabovefortheFASTAfile(seeChapter5fordetails):

>>>fromBioimportSeqIO
>>>record=SeqIO.read("NC_005816.gb","genbank")
>>>record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',
IUPACAmbiguousDNA()),id='NC_005816.1',name='NC_005816',
description='YersiniapestisbiovarMicrotusstr.91001plasmidpPCP1,completesequence.',
dbxrefs=['Project:10638'])

Youshouldbeabletospotsomedifferencesalready!Buttakingtheattributesindividually,thesequencestringisthesameasbefore,butthistimeBio.SeqIOhasbeenable
toautomaticallyassignamorespecificalphabet(seeChapter5fordetails):

>>>record.seq
Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',IUPACAmbiguousDNA())

ThenamecomesfromtheLOCUSline,whiletheidincludestheversionsuffix.ThedescriptioncomesfromtheDEFINITIONline:

>>>record.id
'NC_005816.1'
>>>record.name
'NC_005816'
>>>record.description
'YersiniapestisbiovarMicrotusstr.91001plasmidpPCP1,completesequence.'

GenBankfilesdonthaveanyperletterannotations:

>>>record.letter_annotations
{}

Mostoftheannotationsinformationgetsrecordedintheannotationsdictionary,forexample:

>>>len(record.annotations)
11
>>>record.annotations["source"]
'YersiniapestisbiovarMicrotusstr.91001'

ThedbxrefslistgetspopulatedfromanyPROJECTorDBLINKlines:

>>>record.dbxrefs
['Project:10638']

Finally,andperhapsmostinterestingly,alltheentriesinthefeaturestable(e.g.thegenesorCDSfeatures)getrecordedasSeqFeatureobjectsinthefeatureslist.
>>>len(record.features)
29

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 20/184
5/13/2017 BiopythonTutorialandCookbook
WelltalkaboutSeqFeatureobjectsnext,inSection4.3.

4.3Feature,locationandpositionobjects
4.3.1SeqFeatureobjects

Sequencefeaturesareanessentialpartofdescribingasequence.Onceyougetbeyondthesequenceitself,youneedsomewaytoorganizeandeasilygetatthemore
abstractinformationthatisknownaboutthesequence.Whileitisprobablyimpossibletodevelopageneralsequencefeatureclassthatwillcovereverything,the
BiopythonSeqFeatureclassattemptstoencapsulateasmuchoftheinformationaboutthesequenceaspossible.ThedesignisheavilybasedontheGenBank/EMBL
featuretables,soifyouunderstandhowtheylook,youllprobablyhaveaneasiertimegraspingthestructureoftheBiopythonclasses.

ThekeyideaabouteachSeqFeatureobjectistodescribearegiononaparentsequence,typicallyaSeqRecordobject.Thatregionisdescribedwithalocationobject,
typicallyarangebetweentwopositions(seeSection4.3.2below).

TheSeqFeatureclasshasanumberofattributes,sofirstwelllistthemandtheirgeneralfeatures,andthenlaterinthechapterworkthroughexamplestoshowhowthis
appliestoareallifeexample.TheattributesofaSeqFeatureare:

.type
Thisisatextualdescriptionofthetypeoffeature(forinstance,thiswillbesomethinglikeCDSorgene).
.location
ThelocationoftheSeqFeatureonthesequencethatyouaredealingwith,seeSection4.3.2below.TheSeqFeaturedelegatesmuchofitsfunctionalitytothe
locationobject,andincludesanumberofshortcutattributesforpropertiesofthelocation:

.ref
shorthandfor.location.refany(different)referencesequencethelocationisreferringto.UsuallyjustNone.
.ref_db
shorthandfor.location.ref_dbspecifiesthedatabaseanyidentifierin.refrefersto.UsuallyjustNone.
.strand
shorthandfor.location.strandthestrandonthesequencethatthefeatureislocatedon.Fordoublestrandednucleotidesequencethismayeitherbe1forthe
topstrand,1forthebottomstrand,0ifthestrandisimportantbutisunknown,orNoneifitdoesntmatter.ThisisNoneforproteins,orsinglestranded
sequences.

.qualifiers
ThisisaPythondictionaryofadditionalinformationaboutthefeature.Thekeyissomekindofterseoneworddescriptionofwhattheinformationcontainedinthe
valueisabout,andthevalueistheactualinformation.Forexample,acommonkeyforaqualifiermightbeevidenceandthevaluemightbecomputational(non
experimental).Thisisjustawaytoletthepersonwhoislookingatthefeatureknowthatithasnotbeexperimentally(i.e.inawetlab)confirmed.Notethatother
thevaluewillbealistofstrings(evenwhenthereisonlyonestring).ThisisareflectionofthefeaturetablesinGenBank/EMBLfiles.
.sub_features
ThisusedtobeusedtorepresentfeatureswithcomplicatedlocationslikejoinsinGenBank/EMBLfiles.Thishasbeendeprecatedwiththeintroductionofthe
CompoundLocationobject,andshouldnowbeignored.

4.3.2Positionsandlocations

ThekeyideaabouteachSeqFeatureobjectistodescribearegiononaparentsequence,forwhichweusealocationobject,typicallydescribingarangebetweentwo
positions.Twotrytoclarifytheterminologywereusing:

position
Thisreferstoasinglepositiononasequence,whichmaybefuzzyornot.Forinstance,5,20,<100and>200areallpositions.
location
Alocationisregionofsequenceboundedbysomepositions.Forinstance5..20(i.e.5to20)isalocation.

IjustmentionthisbecausesometimesIgetconfusedbetweenthetwo.

4.3.2.1FeatureLocationobject

Unlessyouworkwitheukaryoticgenes,mostSeqFeaturelocationsareextremelysimpleyoujustneedstartandendcoordinatesandastrand.Thatsessentiallyallthe
basicFeatureLocationobjectdoes.

Inpractiseofcourse,thingscanbemorecomplicated.Firstofallwehavetohandlecompoundlocationsmadeupofseveralregions.Secondly,thepositionsthemselves
maybefuzzy(inexact).

4.3.2.2CompoundLocationobject

Biopython1.62introducedtheCompoundLocationaspartofarestructuringofhowcomplexlocationsmadeupofmultipleregionsarerepresented.Themainusageisfor
handlingjoinlocationsinEMBL/GenBankfiles.

4.3.2.3FuzzyPositions

Sofarweveonlyusedsimplepositions.Onecomplicationindealingwithfeaturelocationscomesinthepositionsthemselves.Inbiologymanytimesthingsarent
entirelycertain(asmuchasuswetlabbiologiststrytomakethemcertain!).Forinstance,youmightdoadinucleotideprimingexperimentanddiscoverthatthestartof
mRNAtranscriptstartsatoneoftwosites.Thisisveryusefulinformation,butthecomplicationcomesinhowtorepresentthisasaposition.Tohelpusdealwiththis,we
havetheconceptoffuzzypositions.Basicallythereareseveraltypesoffuzzypositions,sowehavefiveclassesdodealwiththem:

ExactPosition
Asitsnamesuggests,thisclassrepresentsapositionwhichisspecifiedasexactalongthesequence.Thisisrepresentedasjustanumber,andyoucangetthe
positionbylookingatthepositionattributeoftheobject.
BeforePosition

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 21/184
5/13/2017 BiopythonTutorialandCookbook
Thisclassrepresentsafuzzypositionthatoccurspriortosomespecifiedsite.InGenBank/EMBLnotation,thisisrepresentedassomethinglike`<13',signifying
thattherealpositionislocatedsomewherelessthan13.Togetthespecifiedupperboundary,lookatthepositionattributeoftheobject.
AfterPosition
ContrarytoBeforePosition,thisclassrepresentsapositionthatoccursaftersomespecifiedsite.ThisisrepresentedinGenBankas`>13',andlikeBeforePosition,
yougettheboundarynumberbylookingatthepositionattributeoftheobject.
WithinPosition
OccasionallyusedforGenBank/EMBLlocations,thisclassmodelsapositionwhichoccurssomewherebetweentwospecifiednucleotides.InGenBank/EMBL
notation,thiswouldberepresentedas(1.5),torepresentthatthepositionissomewherewithintherange1to5.Togettheinformationinthisclassyouhavetolook
attwoattributes.Thepositionattributespecifiesthelowerboundaryoftherangewearelookingat,soinourexamplecasethiswouldbeone.Theextension
attributespecifiestherangetothehigherboundary,sointhiscaseitwouldbe4.Soobject.positionisthelowerboundaryandobject.position+object.extension
istheupperboundary.
OneOfPosition
OccasionallyusedforGenBank/EMBLlocations,thisclassdealswithapositionwhereseveralpossiblevaluesexist,forinstanceyoucouldusethisifthestart
codonwasunclearandtherewheretwocandidatesforthestartofthegene.Alternatively,thatmightbehandledexplicitlyastworelatedgenefeatures.
UnknownPosition
Thisclassdealswithapositionofunknownlocation.ThisisnotusedinGenBank/EMBL,butcorrespondstothe?featurecoordinateusedinUniProt.

Heresanexamplewherewecreatealocationwithfuzzyendpoints:
>>>fromBioimportSeqFeature
>>>start_pos=SeqFeature.AfterPosition(5)
>>>end_pos=SeqFeature.BetweenPosition(9,left=8,right=9)
>>>my_location=SeqFeature.FeatureLocation(start_pos,end_pos)

NotethatthedetailsofsomeofthefuzzylocationschangedinBiopython1.59,inparticularforBetweenPositionandWithinPositionyoumustnowmakeitexplicit
whichintegerpositionshouldbeusedforslicingetc.Forastartpositionthisisgenerallythelower(left)value,whileforanendpositionthiswouldgenerallybethe
higher(right)value.

IfyouprintoutaFeatureLocationobject,youcangetanicerepresentationoftheinformation:

>>>print(my_location)
[>5:(8^9)]

Wecanaccessthefuzzystartandendpositionsusingthestartandendattributesofthelocation:

>>>my_location.start
AfterPosition(5)
>>>print(my_location.start)
>5
>>>my_location.end
BetweenPosition(9,left=8,right=9)
>>>print(my_location.end)
(8^9)

Ifyoudontwanttodealwithfuzzypositionsandjustwantnumbers,theyareactuallysubclassesofintegerssoshouldworklikeintegers:

>>>int(my_location.start)
5
>>>int(my_location.end)
9

ForcompatibilitywitholderversionsofBiopythonyoucanaskforthenofuzzy_startandnofuzzy_endattributesofthelocationwhichareplainintegers:

>>>my_location.nofuzzy_start
5
>>>my_location.nofuzzy_end
9

Noticethatthisjustgivesyoubackthepositionattributesofthefuzzylocations.

Similarly,tomakeiteasytocreateapositionwithoutworryingaboutfuzzypositions,youcanjustpassinnumberstotheFeaturePositionconstructors,andyoullget
backoutExactPositionobjects:

>>>exact_location=SeqFeature.FeatureLocation(5,9)
>>>print(exact_location)
[5:9]
>>>exact_location.start
ExactPosition(5)
>>>int(exact_location.start)
5
>>>exact_location.nofuzzy_start
5

ThatismostofthenittygrittyaboutdealingwithfuzzypositionsinBiopython.Ithasbeendesignedsothatdealingwithfuzzinessisnotthatmuchmorecomplicated
thandealingwithexactpositions,andhopefullyyoufindthattrue!

4.3.2.4Locationtesting

YoucanusethePythonkeywordinwithaSeqFeatureorlocationobjecttoseeifthebase/residueforaparentcoordinateiswithinthefeature/locationornot.

Forexample,supposeyouhaveaSNPofinterestandyouwanttoknowwhichfeaturesthisSNPiswithin,andletssupposethisSNPisatindex4350(Pythoncounting!).
Hereisasimplebruteforcesolutionwherewejustcheckallthefeaturesonebyoneinaloop:

>>>fromBioimportSeqIO
>>>my_snp=4350
>>>record=SeqIO.read("NC_005816.gb","genbank")
>>>forfeatureinrecord.features:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 22/184
5/13/2017 BiopythonTutorialandCookbook
...ifmy_snpinfeature:
...print("%s%s"%(feature.type,feature.qualifiers.get('db_xref')))
...
source['taxon:229193']
gene['GeneID:2767712']
CDS['GI:45478716','GeneID:2767712']

NotethatgeneandCDSfeaturesfromGenBankorEMBLfilesdefinedwithjoinsaretheunionoftheexonstheydonotcoveranyintrons.

4.3.3Sequencedescribedbyafeatureorlocation
ASeqFeatureorlocationobjectdoesntdirectlycontainasequence,insteadthelocation(seeSection4.3.2)describeshowtogetthisfromtheparentsequence.For
exampleconsidera(short)genesequencewithlocation5:18onthereversestrand,whichinGenBank/EMBLnotationusing1basedcountingwouldbe
complement(6..18),likethis:

>>>fromBio.SeqimportSeq
>>>fromBio.SeqFeatureimportSeqFeature,FeatureLocation
>>>example_parent=Seq("ACCGAGACGGCAAAGGCTAGCATAGGTATGAGACTTCCTTCCTGCCAGTGCTGAGGAACTGGGAGCCTAC")
>>>example_feature=SeqFeature(FeatureLocation(5,18),type="gene",strand=1)

Youcouldtaketheparentsequence,sliceittoextract5:18,andthentakethereversecomplement.IfyouareusingBiopython1.59orlater,thefeaturelocationsstartand
endareintegerlikesothisworks:
>>>feature_seq=example_parent[example_feature.location.start:example_feature.location.end].reverse_complement()
>>>print(feature_seq)
AGCCTTTGCCGTC

Thisisasimpleexamplesothisisnttoobadhoweveronceyouhavetodealwithcompoundfeatures(joins)thisisrathermessy.Instead,theSeqFeatureobjecthasan
extractmethodtotakecareofallthis:

>>>feature_seq=example_feature.extract(example_parent)
>>>print(feature_seq)
AGCCTTTGCCGTC

ThelengthofaSeqFeatureorlocationmatchesthatoftheregionofsequenceitdescribes.
>>>print(example_feature.extract(example_parent))
AGCCTTTGCCGTC
>>>print(len(example_feature.extract(example_parent)))
13
>>>print(len(example_feature))
13
>>>print(len(example_feature.location))
13

ForsimpleFeatureLocationobjectsthelengthisjustthedifferencebetweenthestartandendpositions.However,foraCompoundLocationthelengthisthesumofthe
constituentregions.

4.4Comparison
TheSeqRecordobjectscanbeverycomplex,butheresasimpleexample:

>>>fromBio.SeqimportSeq
>>>fromBio.SeqRecordimportSeqRecord
>>>record1=SeqRecord(Seq("ACGT"),id="test")
>>>record2=SeqRecord(Seq("ACGT"),id="test")

Whathappenswhenyoutrytocomparetheseidenticalrecords?

>>>record1==record2
...

PerhapssurprisinglyolderversionsofBiopythonwouldusePythonsdefaultobjectcomparisonfortheSeqRecord,meaningrecord1==record2wouldonlyreturnTrueif
thesevariablespointedatthesameobjectinmemory.Inthisexample,record1==record2wouldhavereturnedFalsehere!
>>>record1==record2#onoldversionsofBiopython!
False

AsofBiopython1.67,SeqRecordcomparisonlikerecord1==record2willinsteadraiseanexpliciterrortoavoidpeoplebeingcaughtoutbythis:

>>>record1==record2
Traceback(mostrecentcalllast):
...
NotImplementedError:SeqRecordcomparisonisdeliberatelynotimplemented.Explicitlycomparetheattributesofinterest.

Insteadyoushouldchecktheattributesyouareinterestedin,forexampletheidentifierandthesequence:

>>>record1.id==record2.id
True
>>>record1.seq==record2.seq
True

Bewarethatcomparingcomplexobjectsquicklygetscomplicated(seealsoSection3.11).

4.5References
Anothercommonannotationrelatedtoasequenceisareferencetoajournalorotherpublishedworkdealingwiththesequence.Wehaveafairlysimplewayof
representingaReferenceinBiopythonwehaveaBio.SeqFeature.Referenceclassthatstorestherelevantinformationaboutareferenceasattributesofanobject.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 23/184
5/13/2017 BiopythonTutorialandCookbook
Theattributesincludethingsthatyouwouldexpecttoseeinareferencelikejournal,titleandauthors.Additionally,italsocanholdthemedline_idandpubmed_idand
acommentaboutthereference.Theseareallaccessedsimplyasattributesoftheobject.

Areferencealsohasalocationobjectsothatitcanspecifyaparticularlocationonthesequencethatthereferencerefersto.Forinstance,youmighthaveajournalthatis
dealingwithaparticulargenelocatedonaBAC,andwanttospecifythatitonlyreferstothispositionexactly.Thelocationisapotentiallyfuzzylocation,asdescribed
insection4.3.2.

AnyreferenceobjectsarestoredasalistintheSeqRecordobjectsannotationsdictionaryunderthekeyreferences.Thatsallthereistooit.Referencesaremeanttobe
easytodealwith,andhopefullygeneralenoughtocoverlotsofusagecases.

4.6Theformatmethod
Theformat()methodoftheSeqRecordclassgivesastringcontainingyourrecordformattedusingoneoftheoutputfileformatssupportedbyBio.SeqIO,suchasFASTA:

fromBio.SeqimportSeq
fromBio.SeqRecordimportSeqRecord
fromBio.Alphabetimportgeneric_protein

record=SeqRecord(Seq("MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD"\
+"GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK"\
+"NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM"\
+"SSAC",generic_protein),
id="gi|14150838|gb|AAK54648.1|AF376133_1",
description="chalconesynthase[Cucumissativus]")

print(record.format("fasta"))

whichshouldgive:
>gi|14150838|gb|AAK54648.1|AF376133_1chalconesynthase[Cucumissativus]
MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD
GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK
NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM
SSAC

Thisformatmethodtakesasinglemandatoryargument,alowercasestringwhichissupportedbyBio.SeqIOasanoutputformat(seeChapter5).However,someofthe
fileformatsBio.SeqIOcanwritetorequiremorethanonerecord(typicallythecaseformultiplesequencealignmentformats),andthuswontworkviathisformat()
method.SeealsoSection5.5.4.

4.7SlicingaSeqRecord
YoucansliceaSeqRecord,togiveyouanewSeqRecordcoveringjustpartofthesequence.Whatisimportanthereisthatanyperletterannotationsarealsosliced,and
anyfeatureswhichfallcompletelywithinthenewsequencearepreserved(withtheirlocationsadjusted).

Forexample,takingthesameGenBankfileusedearlier:

>>>fromBioimportSeqIO
>>>record=SeqIO.read("NC_005816.gb","genbank")

>>>record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',
IUPACAmbiguousDNA()),id='NC_005816.1',name='NC_005816',
description='YersiniapestisbiovarMicrotusstr.91001plasmidpPCP1,completesequence',
dbxrefs=['Project:58037'])

>>>len(record)
9609
>>>len(record.features)
41

Forthisexampleweregoingtofocusinonthepimgene,YP_pPCP05.IfyouhavealookattheGenBankfiledirectlyyoullfindthisgene/CDShaslocationstring
4343..4780,orinPythoncounting4342:4780.Fromlookingatthefileyoucanworkoutthatthesearethetwelfthandthirteenthentriesinthefile,soinPythonzerobased
countingtheyareentries11and12inthefeatureslist:

>>>print(record.features[20])
type:gene
location:[4342:4780](+)
qualifiers:
Key:db_xref,Value:['GeneID:2767712']
Key:gene,Value:['pim']
Key:locus_tag,Value:['YP_pPCP05']
<BLANKLINE>

>>>print(record.features[21])
type:CDS
location:[4342:4780](+)
qualifiers:
Key:codon_start,Value:['1']
Key:db_xref,Value:['GI:45478716','GeneID:2767712']
Key:gene,Value:['pim']
Key:locus_tag,Value:['YP_pPCP05']
Key:note,Value:['similartomanypreviouslysequencedpesticinimmunity...']
Key:product,Value:['pesticinimmunityprotein']
Key:protein_id,Value:['NP_995571.1']
Key:transl_table,Value:['11']
Key:translation,Value:['MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNSLSH...']

Letsslicethisparentrecordfrom4300to4800(enoughtoincludethepimgene/CDS),andseehowmanyfeaturesweget:

>>>sub_record=record[4300:4800]

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 24/184
5/13/2017 BiopythonTutorialandCookbook
>>>sub_record
SeqRecord(seq=Seq('ATAAATAGATTATTCCAAATAATTTATTTATGTAAGAACAGGATGGGAGGGGGA...TTA',
IUPACAmbiguousDNA()),id='NC_005816.1',name='NC_005816',
description='YersiniapestisbiovarMicrotusstr.91001plasmidpPCP1,completesequence.',
dbxrefs=[])

>>>len(sub_record)
500
>>>len(sub_record.features)
2

Oursubrecordjusthastwofeatures,thegeneandCDSentriesforYP_pPCP05:

>>>print(sub_record.features[0])
type:gene
location:[42:480](+)
qualifiers:
Key:db_xref,Value:['GeneID:2767712']
Key:gene,Value:['pim']
Key:locus_tag,Value:['YP_pPCP05']
<BLANKLINE>

>>>print(sub_record.features[1])
type:CDS
location:[42:480](+)
qualifiers:
Key:codon_start,Value:['1']
Key:db_xref,Value:['GI:45478716','GeneID:2767712']
Key:gene,Value:['pim']
Key:locus_tag,Value:['YP_pPCP05']
Key:note,Value:['similartomanypreviouslysequencedpesticinimmunity...']
Key:product,Value:['pesticinimmunityprotein']
Key:protein_id,Value:['NP_995571.1']
Key:transl_table,Value:['11']
Key:translation,Value:['MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNSLSH...']

Noticethattheirlocationshavebeenadjustedtoreflectthenewparentsequence!

WhileBiopythonhasdonesomethingsensibleandhopefullyintuitivewiththefeatures(andanyperletterannotation),fortheotherannotationitisimpossibletoknowif
thisstillappliestothesubsequenceornot.Toavoidguessing,theannotationsanddbxrefsareomittedfromthesubrecord,anditisuptoyoutotransferanyrelevant
informationasappropriate.
>>>sub_record.annotations
{}
>>>sub_record.dbxrefs
[]

Thesamepointcouldbemadeabouttherecordid,nameanddescription,butforpracticalitythesearepreserved:
>>>sub_record.id
'NC_005816.1'
>>>sub_record.name
'NC_005816'
>>>sub_record.description
'YersiniapestisbiovarMicrotusstr.91001plasmidpPCP1,completesequence'

Thisillustratestheproblemnicelythough,ournewsubrecordisnotthecompletesequenceoftheplasmid,sothedescriptioniswrong!Letsfixthisandthenviewthe
subrecordasareducedGenBankfileusingtheformatmethoddescribedaboveinSection4.6:
>>>sub_record.description="YersiniapestisbiovarMicrotusstr.91001plasmidpPCP1,partial."
>>>print(sub_record.format("genbank"))
...

SeeSections20.1.7and20.1.8forsomeFASTQexampleswheretheperletterannotations(thereadqualityscores)arealsosliced.

4.8AddingSeqRecordobjects
YoucanaddSeqRecordobjectstogether,givinganewSeqRecord.Whatisimportanthereisthatanycommonperletterannotationsarealsoadded,allthefeaturesare
preserved(withtheirlocationsadjusted),andanyothercommonannotationisalsokept(liketheid,nameanddescription).

Foranexamplewithperletterannotation,wellusethefirstrecordinaFASTQfile.Chapter5willexplaintheSeqIOfunctions:
>>>fromBioimportSeqIO
>>>record=next(SeqIO.parse("example.fastq","fastq"))
>>>len(record)
25
>>>print(record.seq)
CCCTTCTTGTCTTCAGCGTTTCTCC

>>>print(record.letter_annotations["phred_quality"])
[26,26,18,26,26,26,26,26,26,26,26,26,26,26,26,22,26,26,26,26,
26,26,26,23,23]

LetssupposethiswasRoche454data,andthatfromotherinformationyouthinktheTTTshouldbeonlyTT.Wecanmakeaneweditedrecordbyfirstslicingthe
SeqRecordbeforeandaftertheextrathirdT:

>>>left=record[:20]
>>>print(left.seq)
CCCTTCTTGTCTTCAGCGTT
>>>print(left.letter_annotations["phred_quality"])
[26,26,18,26,26,26,26,26,26,26,26,26,26,26,26,22,26,26,26,26]
>>>right=record[21:]
>>>print(right.seq)
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 25/184
5/13/2017 BiopythonTutorialandCookbook
CTCC
>>>print(right.letter_annotations["phred_quality"])
[26,26,23,23]

Nowaddthetwopartstogether:

>>>edited=left+right
>>>len(edited)
24
>>>print(edited.seq)
CCCTTCTTGTCTTCAGCGTTCTCC

>>>print(edited.letter_annotations["phred_quality"])
[26,26,18,26,26,26,26,26,26,26,26,26,26,26,26,22,26,26,26,26,
26,26,23,23]

Easyandintuitive?Wehopeso!Youcanmakethisshorterwithjust:
>>>edited=record[:20]+record[21:]

Now,foranexamplewithfeatures,welluseaGenBankfile.Supposeyouhaveacirculargenome:
>>>fromBioimportSeqIO
>>>record=SeqIO.read("NC_005816.gb","genbank")

>>>record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',
IUPACAmbiguousDNA()),id='NC_005816.1',name='NC_005816',
description='YersiniapestisbiovarMicrotusstr.91001plasmidpPCP1,completesequence.',
dbxrefs=['Project:10638'])

>>>len(record)
9609
>>>len(record.features)
41
>>>record.dbxrefs
['Project:58037']

>>>record.annotations.keys()
['comment','sequence_version','source','taxonomy','keywords','references',
'accessions','data_file_division','date','organism','gi']

Youcanshifttheoriginlikethis:
>>>shifted=record[2000:]+record[:2000]

>>>shifted
SeqRecord(seq=Seq('GATACGCAGTCATATTTTTTACACAATTCTCTAATCCCGACAAGGTCGTAGGTC...GGA',
IUPACAmbiguousDNA()),id='NC_005816.1',name='NC_005816',
description='YersiniapestisbiovarMicrotusstr.91001plasmidpPCP1,completesequence.',
dbxrefs=[])

>>>len(shifted)
9609

Notethatthisisntperfectinthatsomeannotationlikethedatabasecrossreferencesandoneofthefeatures(thesourcefeature)havebeenlost:
>>>len(shifted.features)
40
>>>shifted.dbxrefs
[]
>>>shifted.annotations.keys()
[]

ThisisbecausetheSeqRecordslicingstepiscautiousinwhatannotationitpreserves(erroneouslypropagatingannotationcancausemajorproblems).Ifyouwanttokeep
thedatabasecrossreferencesortheannotationsdictionary,thismustbedoneexplicitly:
>>>shifted.dbxrefs=record.dbxrefs[:]
>>>shifted.annotations=record.annotations.copy()
>>>shifted.dbxrefs
['Project:10638']
>>>shifted.annotations.keys()
['comment','sequence_version','source','taxonomy','keywords','references',
'accessions','data_file_division','date','organism','gi']

Alsonotethatinanexamplelikethis,youshouldprobablychangetherecordidentifierssincetheNCBIreferencesrefertotheoriginalunmodifiedsequence.

4.9ReversecomplementingSeqRecordobjects
OneofthenewfeaturesinBiopython1.57wastheSeqRecordobjectsreverse_complementmethod.Thistriestobalanceeasyofusewithworriesaboutwhattodowith
theannotationinthereversecomplementedrecord.

Forthesequence,thisusestheSeqobjectsreversecomplementmethod.Anyfeaturesaretransferredwiththelocationandstrandrecalculated.Likewiseanyperletter
annotationisalsocopiedbutreversed(whichmakessensefortypicalexampleslikequalityscores).However,transferofmostannotationisproblematical.

Forinstance,iftherecordIDwasanaccession,thataccessionshouldnotreallyapplytothereversecomplementedsequence,andtransferringtheidentifierbydefault
couldeasilycausesubtledatacorruptionindownstreamanalysis.Thereforebydefault,theSeqRecordsid,name,description,annotationsanddatabasecrossreferences
areallnottransferredbydefault.

TheSeqRecordobjectsreverse_complementmethodtakesanumberofoptionalargumentscorrespondingtopropertiesoftherecord.SettingtheseargumentstoTrue
meanscopytheoldvalues,whileFalsemeansdroptheoldvaluesandusethedefaultvalue.Youcanalternativelyprovidethenewdesiredvalueinstead.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 26/184
5/13/2017 BiopythonTutorialandCookbook
Considerthisexamplerecord:

>>>fromBioimportSeqIO
>>>record=SeqIO.read("NC_005816.gb","genbank")
>>>print("%s%i%i%i%i"%(record.id,len(record),len(record.features),len(record.dbxrefs),len(record.annotations)))
NC_005816.1960941113

Herewetakethereversecomplementandspecifyanewidentifierbutnoticehowmostoftheannotationisdropped(butnotthefeatures):
>>>rc=record.reverse_complement(id="TESTING")
>>>print("%s%i%i%i%i"%(rc.id,len(rc),len(rc.features),len(rc.dbxrefs),len(rc.annotations)))
TESTING96094100

Chapter5SequenceInput/Output
InthischapterwelldiscussinmoredetailtheBio.SeqIOmodule,whichwasbrieflyintroducedinChapter2andalsousedinChapter4.Thisaimstoprovideasimple
interfaceforworkingwithassortedsequencefileformatsinauniformway.SeealsotheBio.SeqIOwikipage(http://biopython.org/wiki/SeqIO),andthebuiltin
documentation(alsoonline):

>>>fromBioimportSeqIO
>>>help(SeqIO)
...

ThecatchisthatyouhavetoworkwithSeqRecordobjects(seeChapter4),whichcontainaSeqobject(seeChapter3)plusannotationlikeanidentifieranddescription.
NotethatwhendealingwithverylargeFASTAorFASTQfiles,theoverheadofworkingwithalltheseobjectscanmakescriptstooslow.Inthiscaseconsiderthelow
levelSimpleFastaParserandFastqGeneralIteratorparserswhichreturnjustatupleofstringsforeachrecord(seeSection5.6).

5.1ParsingorReadingSequences
TheworkhorsefunctionBio.SeqIO.parse()isusedtoreadinsequencedataasSeqRecordobjects.Thisfunctionexpectstwoarguments:

1.Thefirstargumentisahandletoreadthedatafrom,orafilename.Ahandleistypicallyafileopenedforreading,butcouldbetheoutputfromacommandline
program,ordatadownloadedfromtheinternet(seeSection5.3).SeeSection24.1formoreabouthandles.
2.Thesecondargumentisalowercasestringspecifyingsequenceformatwedonttryandguessthefileformatforyou!Seehttp://biopython.org/wiki/SeqIOfor
afulllistingofsupportedformats.

Thereisanoptionalargumentalphabettospecifythealphabettobeused.ThisisusefulforfileformatslikeFASTAwhereotherwiseBio.SeqIOwilldefaulttoageneric
alphabet.

TheBio.SeqIO.parse()functionreturnsaniteratorwhichgivesSeqRecordobjects.Iteratorsaretypicallyusedinaforloopasshownbelow.

Sometimesyoullfindyourselfdealingwithfileswhichcontainonlyasinglerecord.ForthissituationusethefunctionBio.SeqIO.read()whichtakesthesame
arguments.Providedthereisoneandonlyonerecordinthefile,thisisreturnedasaSeqRecordobject.Otherwiseanexceptionisraised.

5.1.1ReadingSequenceFiles
IngeneralBio.SeqIO.parse()isusedtoreadinsequencefilesasSeqRecordobjects,andistypicallyusedwithaforlooplikethis:

fromBioimportSeqIO
forseq_recordinSeqIO.parse("ls_orchid.fasta","fasta"):
print(seq_record.id)
print(repr(seq_record.seq))
print(len(seq_record))

TheaboveexampleisrepeatedfromtheintroductioninSection2.4,andwillloadtheorchidDNAsequencesintheFASTAformatfilels_orchid.fasta.Ifinsteadyou
wantedtoloadaGenBankformatfilelikels_orchid.gbkthenallyouneedtodoischangethefilenameandtheformatstring:
fromBioimportSeqIO
forseq_recordinSeqIO.parse("ls_orchid.gbk","genbank"):
print(seq_record.id)
print(repr(seq_record.seq))
print(len(seq_record))

Similarly,ifyouwantedtoreadinafileinanotherfileformat,thenassumingBio.SeqIO.parse()supportsityouwouldjustneedtochangetheformatstringas
appropriate,forexampleswissforSwissProtfilesoremblforEMBLtextfiles.Thereisafulllistingonthewikipage(http://biopython.org/wiki/SeqIO)andinthe
builtindocumentation(alsoonline).

AnotherverycommonwaytouseaPythoniteratoriswithinalistcomprehension(orageneratorexpression).Forexample,ifallyouwantedtoextractfromthefilewas
alistoftherecordidentifierswecaneasilydothiswiththefollowinglistcomprehension:
>>>fromBioimportSeqIO
>>>identifiers=[seq_record.idforseq_recordinSeqIO.parse("ls_orchid.gbk","genbank")]
>>>identifiers
['Z78533.1','Z78532.1','Z78531.1','Z78530.1','Z78529.1','Z78527.1',...,'Z78439.1']

TherearemoreexamplesusingSeqIO.parse()inalistcomprehensionlikethisinSection20.2(e.g.forplottingsequencelengthsorGC%).

5.1.2Iteratingovertherecordsinasequencefile

Intheaboveexamples,wehaveusuallyusedaforlooptoiterateoveralltherecordsonebyone.YoucanusetheforloopwithallsortsofPythonobjects(includinglists,
tuplesandstrings)whichsupporttheiterationinterface.

TheobjectreturnedbyBio.SeqIOisactuallyaniteratorwhichreturnsSeqRecordobjects.Yougettoseeeachrecordinturn,butonceandonlyonce.Thepluspointisthat
aniteratorcansaveyoumemorywhendealingwithlargefiles.
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 27/184
5/13/2017 BiopythonTutorialandCookbook
Insteadofusingaforloop,canalsousethenext()functiononaniteratortostepthroughtheentries,likethis:
fromBioimportSeqIO
record_iterator=SeqIO.parse("ls_orchid.fasta","fasta")

first_record=next(record_iterator)
print(first_record.id)
print(first_record.description)

second_record=next(record_iterator)
print(second_record.id)
print(second_record.description)

Notethatifyoutrytousenext()andtherearenomoreresults,youllgetthespecialStopIterationexception.

Onespecialcasetoconsideriswhenyoursequencefileshavemultiplerecords,butyouonlywantthefirstone.Inthissituationthefollowingcodeisveryconcise:
fromBioimportSeqIO
first_record=next(SeqIO.parse("ls_orchid.gbk","genbank"))

Awordofwarninghereusingthenext()functionlikethiswillsilentlyignoreanyadditionalrecordsinthefile.Ifyourfileshaveoneandonlyonerecord,likesomeof
theonlineexampleslaterinthischapter,oraGenBankfileforasinglechromosome,thenusethenewBio.SeqIO.read()functioninstead.Thiswillcheckthereareno
extraunexpectedrecordspresent.

5.1.3Gettingalistoftherecordsinasequencefile
IntheprevioussectionwetalkedaboutthefactthatBio.SeqIO.parse()givesyouaSeqRecorditerator,andthatyougettherecordsonebyone.Veryoftenyouneedtobe
abletoaccesstherecordsinanyorder.ThePythonlistdatatypeisperfectforthis,andwecanturntherecorditeratorintoalistofSeqRecordobjectsusingthebuiltin
Pythonfunctionlist()likeso:
fromBioimportSeqIO
records=list(SeqIO.parse("ls_orchid.gbk","genbank"))

print("Found%irecords"%len(records))

print("Thelastrecord")
last_record=records[1]#usingPython'slisttricks
print(last_record.id)
print(repr(last_record.seq))
print(len(last_record))

print("Thefirstrecord")
first_record=records[0]#remember,Pythoncountsfromzero
print(first_record.id)
print(repr(first_record.seq))
print(len(first_record))

Giving:

Found94records
Thelastrecord
Z78439.1
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...GCC',IUPACAmbiguousDNA())
592
Thefirstrecord
Z78533.1
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC',IUPACAmbiguousDNA())
740

YoucanofcoursestilluseaforloopwithalistofSeqRecordobjects.Usingalistismuchmoreflexiblethananiterator(forexample,youcandeterminethenumberof
recordsfromthelengthofthelist),butdoesneedmorememorybecauseitwillholdalltherecordsinmemoryatonce.

5.1.4Extractingdata

TheSeqRecordobjectanditsannotationstructuresaredescribedmorefullyinChapter4.Asanexampleofhowannotationsarestored,welllookattheoutputfrom
parsingthefirstrecordintheGenBankfilels_orchid.gbk.

fromBioimportSeqIO
record_iterator=SeqIO.parse("ls_orchid.gbk","genbank")
first_record=next(record_iterator)
print(first_record)

Thatshouldgivesomethinglikethis:

ID:Z78533.1
Name:Z78533
Description:C.irapeanum5.8SrRNAgeneandITS1andITS2DNA.
Numberoffeatures:5
/sequence_version=1
/source=Cypripediumirapeanum
/taxonomy=['Eukaryota','Viridiplantae','Streptophyta',...,'Cypripedium']
/keywords=['5.8SribosomalRNA','5.8SrRNAgene',...,'ITS1','ITS2']
/references=[...]
/accessions=['Z78533']
/data_file_division=PLN
/date=30NOV2006
/organism=Cypripediumirapeanum
/gi=2765658
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGG...CGC',IUPACAmbiguousDNA())

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 28/184
5/13/2017 BiopythonTutorialandCookbook
ThisgivesahumanreadablesummaryofmostoftheannotationdatafortheSeqRecord.Forthisexampleweregoingtousethe.annotationsattributewhichisjusta
Pythondictionary.Thecontentsofthisannotationsdictionarywereshownwhenweprintedtherecordabove.Youcanalsoprintthemoutdirectly:
print(first_record.annotations)

LikeanyPythondictionary,youcaneasilygetalistofthekeys:
print(first_record.annotations.keys())

orvalues:
print(first_record.annotations.values())

Ingeneral,theannotationvaluesarestrings,orlistsofstrings.Onespecialcaseisanyreferencesinthefilegetstoredasreferenceobjects.

Supposeyouwantedtoextractalistofthespeciesfromthels_orchid.gbkGenBankfile.Theinformationwewant,Cypripediumirapeanum,isheldintheannotations
dictionaryundersourceandorganism,whichwecanaccesslikethis:
>>>print(first_record.annotations["source"])
Cypripediumirapeanum

or:

>>>print(first_record.annotations["organism"])
Cypripediumirapeanum

Ingeneral,organismisusedforthescientificname(inLatin,e.g.Arabidopsisthaliana),whilesourcewilloftenbethecommonname(e.g.thalecress).Inthis
example,asisoftenthecase,thetwofieldsareidentical.

Nowletsgothroughalltherecords,buildingupalistofthespecieseachorchidsequenceisfrom:

fromBioimportSeqIO
all_species=[]
forseq_recordinSeqIO.parse("ls_orchid.gbk","genbank"):
all_species.append(seq_record.annotations["organism"])
print(all_species)

Anotherwayofwritingthiscodeistousealistcomprehension:

fromBioimportSeqIO
all_species=[seq_record.annotations["organism"]forseq_recordin\
SeqIO.parse("ls_orchid.gbk","genbank")]
print(all_species)

Ineithercase,theresultis:

['Cypripediumirapeanum','Cypripediumcalifornicum',...,'Paphiopedilumbarbatum']

Great.ThatwasprettyeasybecauseGenBankfilesareannotatedinastandardisedway.

Now,letssupposeyouwantedtoextractalistofthespeciesfromaFASTAfile,ratherthantheGenBankfile.Thebadnewsisyouwillhavetowritesomecodeto
extractthedatayouwantfromtherecordsdescriptionlineiftheinformationisinthefileinthefirstplace!OurexampleFASTAformatfilels_orchid.fastastartslike
this:

>gi|2765658|emb|Z78533.1|CIZ78533C.irapeanum5.8SrRNAgeneandITS1andITS2DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAATAAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTGATTTGTTGTTGGG
...

Youcancheckbyhand,butforeveryrecordthespeciesnameisinthedescriptionlineasthesecondword.Thismeansifwebreakupeachrecords.descriptionatthe
spaces,thenthespeciesisthereasfieldnumberone(fieldzeroistherecordidentifier).Thatmeanswecandothis:
fromBioimportSeqIO
all_species=[]
forseq_recordinSeqIO.parse("ls_orchid.fasta","fasta"):
all_species.append(seq_record.description.split()[1])
print(all_species)

Thisgives:
['C.irapeanum','C.californicum','C.fasciculatum','C.margaritaceum',...,'P.barbatum']

Theconcisealternativeusinglistcomprehensionswouldbe:
fromBioimportSeqIO
all_species==[seq_record.description.split()[1]forseq_recordin\
SeqIO.parse("ls_orchid.fasta","fasta")]
print(all_species)

Ingeneral,extractinginformationfromtheFASTAdescriptionlineisnotverynice.IfyoucangetyoursequencesinawellannotatedfileformatlikeGenBankorEMBL,
thenthissortofannotationinformationismucheasiertodealwith.

5.2Parsingsequencesfromcompressedfiles
Intheprevioussection,welookedatparsingsequencedatafromafile.Insteadofusingafilename,youcangiveBio.SeqIOahandle(seeSection24.1),andinthissection
wellusehandlestoparsesequencefromcompressedfiles.

Asyoullhaveseenabove,wecanuseBio.SeqIO.read()orBio.SeqIO.parse()withafilenameforinstancethisquickexamplecalculatesthetotallengthofthe
sequencesinamultiplerecordGenBankfileusingageneratorexpression:
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 29/184
5/13/2017 BiopythonTutorialandCookbook
>>>fromBioimportSeqIO
>>>print(sum(len(r)forrinSeqIO.parse("ls_orchid.gbk","gb")))
67518

Hereweuseafilehandleinstead,usingthewithstatementtoclosethehandleautomatically:

>>>fromBioimportSeqIO
>>>withopen("ls_orchid.gbk")ashandle:
...print(sum(len(r)forrinSeqIO.parse(handle,"gb")))
67518

Or,theoldfashionedwaywhereyoumanuallyclosethehandle:

>>>fromBioimportSeqIO
>>>handle=open("ls_orchid.gbk")
>>>print(sum(len(r)forrinSeqIO.parse(handle,"gb")))
67518
>>>handle.close()

Now,supposewehaveagzipcompressedfileinstead?TheseareverycommonlyusedonLinux.WecanusePythonsgzipmoduletoopenthecompressedfilefor
readingwhichgivesusahandleobject:
>>>importgzip
>>>fromBioimportSeqIO
>>>withgzip.open("ls_orchid.gbk.gz","rt")ashandle:
...print(sum(len(r)forrinSeqIO.parse(handle,"gb")))
...
67518

Similarlyifwehadabzip2compressedfile(sadlythefunctionnameisntquiteasconsistentunderPython2):
>>>importbz2
>>>fromBioimportSeqIO
>>>ifhasattr(bz2,"open"):
...handle=bz2.open("ls_orchid.gbk.bz2","rt")#Python3
...else:
...handle=bz2.BZ2File("ls_orchid.gbk.bz2","r")#Python2
...
>>>withhandle:
...print(sum(len(r)forrinSeqIO.parse(handle,"gb")))
...
67518

Thereisagzip(GNUZip)variantcalledBGZF(BlockedGNUZipFormat),whichcanbetreatedlikeanordinarygzipfileforreading,buthasadvantagesforrandom
accesslaterwhichwelltalkaboutlaterinSection5.4.4.

5.3Parsingsequencesfromthenet
Intheprevioussections,welookedatparsingsequencedatafromafile(usingafilenameorhandle),andfromcompressedfiles(usingahandle).Herewelluse
Bio.SeqIOwithanothertypeofhandle,anetworkconnection,todownloadandparsesequencesfromtheinternet.

NotethatjustbecauseyoucandownloadsequencedataandparseitintoaSeqRecordobjectinonegodoesntmeanthisisagoodidea.Ingeneral,youshouldprobably
downloadsequencesonceandsavethemtoafileforreuse.

5.3.1ParsingGenBankrecordsfromthenet
Section9.6talksabouttheEntrezEFetchinterfaceinmoredetail,butfornowletsjustconnecttotheNCBIandgetafewOpuntia(pricklypear)sequencesfrom
GenBankusingtheirGInumbers.

Firstofall,letsfetchjustonerecord.IfyoudontcareabouttheannotationsandfeaturesdownloadingaFASTAfileisagoodchoiceasthesearecompact.Now
remember,whenyouexpectthehandletocontainoneandonlyonerecord,usetheBio.SeqIO.read()function:
fromBioimportEntrez
fromBioimportSeqIO
Entrez.email="A.N.Other@example.com"
withEntrez.efetch(db="nucleotide",rettype="fasta",retmode="text",id="6273291")ashandle:
seq_record=SeqIO.read(handle,"fasta")
print("%swith%ifeatures"%(seq_record.id,len(seq_record.features)))

Expectedoutput:
gi|6273291|gb|AF191665.1|AF191665with0features

TheNCBIwillalsoletyouaskforthefileinotherformats,inparticularasaGenBankfile.UntilEaster2009,theEntrezEFetchAPIletyouusegenbankasthereturn
type,howevertheNCBInowinsistonusingtheofficialreturntypesofgb(orgpforproteins)asdescribedonEFetchforSequenceandotherMolecularBiology
Databases.Asaresult,inBiopython1.50onwards,wesupportgbasanaliasforgenbankinBio.SeqIO.

fromBioimportEntrez
fromBioimportSeqIO
Entrez.email="A.N.Other@example.com"
withEntrez.efetch(db="nucleotide",rettype="gb",retmode="text",id="6273291")ashandle
seq_record=SeqIO.read(handle,"gb")#using"gb"asanaliasfor"genbank"
print("%swith%ifeatures"%(seq_record.id,len(seq_record.features)))

Theexpectedoutputofthisexampleis:
AF191665.1with3features

Noticethistimewehavethreefeatures.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 30/184
5/13/2017 BiopythonTutorialandCookbook
Nowletsfetchseveralrecords.Thistimethehandlecontainsmultiplerecords,sowemustusetheBio.SeqIO.parse()function:

fromBioimportEntrez
fromBioimportSeqIO
Entrez.email="A.N.Other@example.com"
withEntrez.efetch(db="nucleotide",rettype="gb",retmode="text",
id="6273291,6273290,6273289")ashandle:
forseq_recordinSeqIO.parse(handle,"gb"):
print("%s%s..."%(seq_record.id,seq_record.description[:50]))
print("Sequencelength%i,%ifeatures,from:%s"
%(len(seq_record),len(seq_record.features),seq_record.annotations["source"]))

Thatshouldgivethefollowingoutput:
AF191665.1Opuntiamarenaerpl16gene;chloroplastgeneforc...
Sequencelength902,3features,from:chloroplastOpuntiamarenae
AF191664.1Opuntiaclavatarpl16gene;chloroplastgeneforc...
Sequencelength899,3features,from:chloroplastGrusoniaclavata
AF191663.1Opuntiabradtianarpl16gene;chloroplastgenefor...
Sequencelength899,3features,from:chloroplastOpuntiabradtianaa

SeeChapter9formoreabouttheBio.Entrezmodule,andmakesuretoreadabouttheNCBIguidelinesforusingEntrez(Section9.1).

5.3.2ParsingSwissProtsequencesfromthenet
NowletsuseahandletodownloadaSwissProtfilefromExPASy,somethingcoveredinmoredepthinChapter10.Asmentionedabove,whenyouexpectthehandleto
containoneandonlyonerecord,usetheBio.SeqIO.read()function:
fromBioimportExPASy
fromBioimportSeqIO
withExPASy.get_sprot_raw("O23729")ashandle:
seq_record=SeqIO.read(handle,"swiss")
print(seq_record.id)
print(seq_record.name)
print(seq_record.description)
print(repr(seq_record.seq))
print("Length%i"%len(seq_record))
print(seq_record.annotations["keywords"])

AssumingyournetworkconnectionisOK,youshouldgetback:
O23729
CHS3_BROFI
RecName:Full=Chalconesynthase3;EC=2.3.1.74;AltName:Full=Naringeninchalconesynthase3;
Seq('MAPAMEEIRQAQRAEGPAAVLAIGTSTPPNALYQADYPDYYFRITKSEHLTELK...GAE',ProteinAlphabet())
Length394
['Acyltransferase','Flavonoidbiosynthesis','Transferase']

5.4SequencefilesasDictionaries
WerenowgoingtointroducethreerelatedfunctionsintheBio.SeqIOmodulewhichallowdictionarylikerandomaccesstoamultisequencefile.Thereisatradeoffhere
betweenflexibilityandmemoryusage.Insummary:

Bio.SeqIO.to_dict()isthemostflexiblebutalsothemostmemorydemandingoption(seeSection5.4.1).Thisisbasicallyahelperfunctiontobuildanormal
PythondictionarywitheachentryheldasaSeqRecordobjectinmemory,allowingyoutomodifytherecords.
Bio.SeqIO.index()isausefulmiddleground,actinglikeareadonlydictionaryandparsingsequencesintoSeqRecordobjectsondemand(seeSection5.4.2).
Bio.SeqIO.index_db()alsoactslikeareadonlydictionarybutstorestheidentifiersandfileoffsetsinafileondisk(asanSQLite3database),meaningithasvery
lowmemoryrequirements(seeSection5.4.3),butwillbealittlebitslower.

Seethediscussionforanbroadoverview(Section5.4.5).

5.4.1SequencefilesasDictionariesInmemory

ThenextthingthatwelldowithourubiquitousorchidfilesistoshowhowtoindexthemandaccessthemlikeadatabaseusingthePythondictionarydatatype(likea
hashinPerl).Thisisveryusefulformoderatelylargefileswhereyouonlyneedtoaccesscertainelementsofthefile,andmakesforanicequickndirtydatabase.For
dealingwithlargerfileswherememorybecomesaproblem,seeSection5.4.2below.

YoucanusethefunctionBio.SeqIO.to_dict()tomakeaSeqRecorddictionary(inmemory).Bydefaultthiswilluseeachrecordsidentifier(i.e.the.idattribute)asthe
key.LetstrythisusingourGenBankfile:

>>>fromBioimportSeqIO
>>>orchid_dict=SeqIO.to_dict(SeqIO.parse("ls_orchid.gbk","genbank"))

ThereisjustonerequiredargumentforBio.SeqIO.to_dict(),alistorgeneratorgivingSeqRecordobjects.HerewehavejustusedtheoutputfromtheSeqIO.parse
function.Asthenamesuggests,thisreturnsaPythondictionary.

Sincethisvariableorchid_dictisanordinaryPythondictionary,wecanlookatallofthekeyswehaveavailable:
>>>len(orchid_dict)
94

>>>list(orchid_dict.keys())
['Z78484.1','Z78464.1','Z78455.1','Z78442.1','Z78532.1','Z78453.1',...,'Z78471.1']

Youcanleaveoutthelist(...)bitifyouarestillusingPython2.UnderPython3thedictionarymethodslike.keys()and.values()areiteratorsratherthanlists.

Ifyoureallywantto,youcanevenlookatalltherecordsatonce:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 31/184
5/13/2017 BiopythonTutorialandCookbook
>>>list(orchid_dict.values())#lotsofoutput!
...

WecanaccessasingleSeqRecordobjectviathekeysandmanipulatetheobjectasnormal:

>>>seq_record=orchid_dict["Z78475.1"]
>>>print(seq_record.description)
P.supardii5.8SrRNAgeneandITS1andITS2DNA
>>>print(repr(seq_record.seq))
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGT',IUPACAmbiguousDNA())

So,itisveryeasytocreateaninmemorydatabaseofourGenBankrecords.NextwelltrythisfortheFASTAfileinstead.

NotethatthoseofyouwithpriorPythonexperienceshouldallbeabletoconstructadictionarylikethisbyhand.However,typicaldictionaryconstructionmethodswill
notdealwiththecaseofrepeatedkeysverynicely.UsingtheBio.SeqIO.to_dict()willexplicitlycheckforduplicatekeys,andraiseanexceptionifanyarefound.

5.4.1.1Specifyingthedictionarykeys

Usingthesamecodeasabove,butfortheFASTAfileinstead:
fromBioimportSeqIO
orchid_dict=SeqIO.to_dict(SeqIO.parse("ls_orchid.fasta","fasta"))
print(orchid_dict.keys())

Thistimethekeysare:
['gi|2765596|emb|Z78471.1|PDZ78471','gi|2765646|emb|Z78521.1|CCZ78521',...
...,'gi|2765613|emb|Z78488.1|PTZ78488','gi|2765583|emb|Z78458.1|PHZ78458']

YoushouldrecognisethesestringsfromwhenweparsedtheFASTAfileearlierinSection2.4.1.Supposeyouwouldratherhavesomethingelseasthekeyslikethe
accessionnumbers.ThisbringsusnicelytoSeqIO.to_dict()soptionalargumentkey_function,whichletsyoudefinewhattouseasthedictionarykeyforyourrecords.

Firstyoumustwriteyourownfunctiontoreturnthekeyyouwant(asastring)whengivenaSeqRecordobject.Ingeneral,thedetailsoffunctionwilldependonthesort
ofinputrecordsyouaredealingwith.Butforourorchids,wecanjustsplituptherecordsidentifierusingthepipecharacter(theverticalline)andreturnthefourth
entry(fieldthree):
defget_accession(record):
""""GivenaSeqRecord,returntheaccessionnumberasastring.

e.g."gi|2765613|emb|Z78488.1|PTZ78488">"Z78488.1"
"""
parts=record.id.split("|")
assertlen(parts)==5andparts[0]=="gi"andparts[2]=="emb"
returnparts[3]

ThenwecangivethisfunctiontotheSeqIO.to_dict()functiontouseinbuildingthedictionary:

fromBioimportSeqIO
orchid_dict=SeqIO.to_dict(SeqIO.parse("ls_orchid.fasta","fasta"),key_function=get_accession)
print(orchid_dict.keys())

Finally,asdesired,thenewdictionarykeys:
>>>print(orchid_dict.keys())
['Z78484.1','Z78464.1','Z78455.1','Z78442.1','Z78532.1','Z78453.1',...,'Z78471.1']

Nottoocomplicated,Ihope!

5.4.1.2IndexingadictionaryusingtheSEGUIDchecksum

TogiveanotherexampleofworkingwithdictionariesofSeqRecordobjects,wellusetheSEGUIDchecksumfunction.Thisisarelativelyrecentchecksum,andcollisions
shouldbeveryrare(i.e.twodifferentsequenceswiththesamechecksum),animprovementontheCRC64checksum.

Onceagain,workingwiththeorchidsGenBankfile:

fromBioimportSeqIO
fromBio.SeqUtils.CheckSumimportseguid
forrecordinSeqIO.parse("ls_orchid.gbk","genbank"):
print(record.id,seguid(record.seq))

Thisshouldgive:

Z78533.1JUEoWn6DPhgZ9nAyowsgtoD9TTo
Z78532.1MN/s0q9zDoCVEEc+k/IFwCNF2pY
...
Z78439.1H+JfaShya/4yyAj7IbMqgNkxdxQ

Now,recalltheBio.SeqIO.to_dict()functionskey_functionargumentexpectsafunctionwhichturnsaSeqRecordintoastring.Wecantusetheseguid()function
directlybecauseitexpectstobegivenaSeqobject(orastring).However,wecanusePythonslambdafeaturetocreateaoneofffunctiontogiveto
Bio.SeqIO.to_dict()instead:

>>>fromBioimportSeqIO
>>>fromBio.SeqUtils.CheckSumimportseguid
>>>seguid_dict=SeqIO.to_dict(SeqIO.parse("ls_orchid.gbk","genbank"),
...lambdarec:seguid(rec.seq))
>>>record=seguid_dict["MN/s0q9zDoCVEEc+k/IFwCNF2pY"]
>>>print(record.id)
Z78532.1
>>>print(record.description)
C.californicum5.8SrRNAgeneandITS1andITS2DNA

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 32/184
5/13/2017 BiopythonTutorialandCookbook
ThatshouldhaveretrievedtherecordZ78532.1,thesecondentryinthefile.

5.4.2SequencefilesasDictionariesIndexedfiles
Asthepreviouscoupleofexamplestriedtoillustrate,usingBio.SeqIO.to_dict()isveryflexible.However,becauseitholdseverythinginmemory,thesizeoffileyou
canworkwithislimitedbyyourcomputersRAM.Ingeneral,thiswillonlyworkonsmalltomediumfiles.

ForlargerfilesyoushouldconsiderBio.SeqIO.index(),whichworksalittledifferently.Althoughitstillreturnsadictionarylikeobject,thisdoesnotkeepeverythingin
memory.Instead,itjustrecordswhereeachrecordiswithinthefilewhenyouaskforaparticularrecord,itthenparsesitondemand.

Asanexample,letsusethesameGenBankfileasbefore:
>>>fromBioimportSeqIO
>>>orchid_dict=SeqIO.index("ls_orchid.gbk","genbank")
>>>len(orchid_dict)
94

>>>orchid_dict.keys()
['Z78484.1','Z78464.1','Z78455.1','Z78442.1','Z78532.1','Z78453.1',...,'Z78471.1']

>>>seq_record=orchid_dict["Z78475.1"]
>>>print(seq_record.description)
P.supardii5.8SrRNAgeneandITS1andITS2DNA
>>>seq_record.seq
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGATCACAT...GGT',IUPACAmbiguousDNA())
>>>orchid_dict.close()

NotethatBio.SeqIO.index()wonttakeahandle,butonlyafilename.Therearegoodreasonsforthis,butitisalittletechnical.Thesecondargumentisthefileformat(a
lowercasestringasusedintheotherBio.SeqIOfunctions).Youcanusemanyothersimplefileformats,includingFASTAandFASTQfiles(seetheexamplein
Section20.1.11).However,alignmentformatslikePHYLIPorClustalarenotsupported.Finallyasanoptionalargumentyoucansupplyanalphabet,orakeyfunction.

HereisthesameexampleusingtheFASTAfileallwechangeisthefilenameandtheformatname:
>>>fromBioimportSeqIO
>>>orchid_dict=SeqIO.index("ls_orchid.fasta","fasta")
>>>len(orchid_dict)
94
>>>orchid_dict.keys()
['gi|2765596|emb|Z78471.1|PDZ78471','gi|2765646|emb|Z78521.1|CCZ78521',...
...,'gi|2765613|emb|Z78488.1|PTZ78488','gi|2765583|emb|Z78458.1|PHZ78458']

5.4.2.1Specifyingthedictionarykeys

Supposeyouwanttousethesamekeysasbefore?MuchlikewiththeBio.SeqIO.to_dict()exampleinSection5.4.1.1,youllneedtowriteatinyfunctiontomapfrom
theFASTAidentifier(asastring)tothekeyyouwant:

defget_acc(identifier):
""""GivenaSeqRecordidentifierstring,returntheaccessionnumberasastring.

e.g."gi|2765613|emb|Z78488.1|PTZ78488">"Z78488.1"
"""
parts=identifier.split("|")
assertlen(parts)==5andparts[0]=="gi"andparts[2]=="emb"
returnparts[3]

ThenwecangivethisfunctiontotheBio.SeqIO.index()functiontouseinbuildingthedictionary:

>>>fromBioimportSeqIO
>>>orchid_dict=SeqIO.index("ls_orchid.fasta","fasta",key_function=get_acc)
>>>print(orchid_dict.keys())
['Z78484.1','Z78464.1','Z78455.1','Z78442.1','Z78532.1','Z78453.1',...,'Z78471.1']

Easywhenyouknowhow?

5.4.2.2Gettingtherawdataforarecord

ThedictionarylikeobjectfromBio.SeqIO.index()givesyoueachentryasaSeqRecordobject.However,itissometimesusefultobeabletogettheoriginalrawdata
straightfromthefile.Forthisusetheget_raw()methodwhichtakesasingleargument(therecordidentifier)andreturnsabytesstring(extractedfromthefilewithout
modification).

AmotivatingexampleisextractingasubsetofarecordsfromalargefilewhereeitherBio.SeqIO.write()doesnot(yet)supporttheoutputfileformat(e.g.theplaintext
SwissProtfileformat)orwhereyouneedtopreservethetextexactly(e.g.GenBankorEMBLoutputfromBiopythondoesnotyetpreserveeverylastbitofannotation).

LetssupposeyouhavedownloadthewholeofUniProtintheplaintextSwissPortfileformatfromtheirFTPsite
(ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz)anduncompresseditasthefileuniprot_sprot.dat,
andyouwanttoextractjustafewrecordsfromit:

>>>fromBioimportSeqIO
>>>uniprot=SeqIO.index("uniprot_sprot.dat","swiss")
>>>withopen("selected.dat","wb")asout_handle:
...foraccin["P33487","P19801","P13689","Q8JZQ5","Q9TRC7"]:
...out_handle.write(uniprot.get_raw(acc))
...

NotewithPython3onwards,wehavetoopenthefileforwritinginbinarymodebecausetheget_raw()methodreturnsbytesstrings.

ThereisalongerexampleinSection20.1.5usingtheSeqIO.index()functiontosortalargesequencefile(withoutloadingeverythingintomemoryatonce).

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 33/184
5/13/2017 BiopythonTutorialandCookbook
5.4.3SequencefilesasDictionariesDatabaseindexedfiles

Biopython1.57introducedanalternative,Bio.SeqIO.index_db(),whichcanworkonevenextremelylargefilessinceitstorestherecordinformationasafileondisk
(usinganSQLite3database)ratherthaninmemory.Also,youcanindexmultiplefilestogether(providingalltherecordidentifiersareunique).

TheBio.SeqIO.index()functiontakesthreerequiredarguments:

Indexfilename,wesuggestusingsomethingending.idx.ThisindexfileisactuallyanSQLite3database.
Listofsequencefilenamestoindex(orasinglefilename)
Fileformat(lowercasestringasusedintherestoftheSeqIOmodule).

Asanexample,considertheGenBankflatfilereleasesfromtheNCBIFTPsite,ftp://ftp.ncbi.nih.gov/genbank/,whicharegzipcompressedGenBankfiles.

AsofGenBankrelease210,thereare38filesmakinguptheviralsequences,gbvrl1.seq,,gbvrl38.seq,takingabout8GBondiskoncedecompressed,andcontaining
intotalnearlytwomillionrecords.

Ifyouwereinterestedintheviruses,youcoulddownloadallthevirusfilesfromthecommandlineveryeasilywiththersynccommand,andthendecompressthemwith
gunzip:

#Forillustrationonly,seereducedexamplebelow
$rsyncavP"ftp.ncbi.nih.gov::genbank/gbvrl*.seq.gz".
$gunzipgbvrl*.seq.gz

Unlessyoucareaboutviruses,thatsalotofdatatodownloadjustforthisexamplesoletsdownloadjustthefirstfourchunks(about25MBeachcompressed),and
decompressthem(takinginallabout1GBofspace):
#Reducedexample,downloadonlythefirstfourchunks
$curlOftp://ftp.ncbi.nih.gov/genbank/gbvrl1.seq.gz
$curlOftp://ftp.ncbi.nih.gov/genbank/gbvrl2.seq.gz
$curlOftp://ftp.ncbi.nih.gov/genbank/gbvrl3.seq.gz
$curlOftp://ftp.ncbi.nih.gov/genbank/gbvrl4.seq.gz
$gunzipgbvrl*.seq.gz

Now,inPython,indextheseGenBankfilesasfollows:

>>>importglob
>>>fromBioimportSeqIO
>>>files=glob.glob("gbvrl*.seq")
>>>print("%ifilestoindex"%len(files))
4
>>>gb_vrl=SeqIO.index_db("gbvrl.idx",files,"genbank")
>>>print("%isequencesindexed"%len(gb_vrl))
272960sequencesindexed

IndexingthefullsetofvirusGenBankfilestookabouttenminutesonmymachine,justthefirstfourfilestookaboutaminuteorso.

However,oncedone,repeatingthiswillreloadtheindexfilegbvrl.idxinafractionofasecond.

YoucanusetheindexasareadonlyPythondictionarywithouthavingtoworryaboutwhichfilethesequencecomesfrom,e.g.

>>>print(gb_vrl[``AB811634.1''].description)
EquineencephalosisvirusNS3gene,completecds,isolate:Kimron1.

5.4.3.1Gettingtherawdataforarecord

JustaswiththeBio.SeqIO.index()functiondiscussedaboveinSection5.4.2.2,thedictionarylikeobjectalsoletsyougetattherawbytesofeachrecord:
>>>print(gb_vrl.get_raw(``AB811634.1''))
LOCUSAB811634723bpRNAlinearVRL17JUN2015
DEFINITIONEquineencephalosisvirusNS3gene,completecds,isolate:Kimron1.
ACCESSIONAB811634
...
//

5.4.4Indexingcompressedfiles

Veryoftenwhenyouareindexingasequencefileitcanbequitelargesoyoumaywanttocompressitondisk.Unfortunatelyefficientrandomaccessisdifficultwith
themorecommonfileformatslikegzipandbzip2.Inthissetting,BGZF(BlockedGNUZipFormat)canbeveryhelpful.Thisisavariantofgzip(andcanbe
decompressedusingstandardgziptools)popularisedbytheBAMfileformat,samtools,andtabix.

TocreateaBGZFcompressedfileyoucanusethecommandlinetoolbgzipwhichcomeswithsamtools.Inourexamplesweuseafilenameextension*.bgz,sotheycan
bedistinguishedfromnormalgzippedfiles(named*.gz).YoucanalsousetheBio.bgzfmoduletoreadandwriteBGZFfilesfromwithinPython.

TheBio.SeqIO.index()andBio.SeqIO.index_db()canbothbeusedwithBGZFcompressedfiles.Forexample,ifyoustartedwithanuncompressedGenBankfile:
>>>fromBioimportSeqIO
>>>orchid_dict=SeqIO.index("ls_orchid.gbk","genbank")
>>>len(orchid_dict)
94
>>>orchid_dict.close()

Youcouldcompressthis(whilekeepingtheoriginalfile)atthecommandlineusingthefollowingcommandbutdontworry,thecompressedfileisalreadyincluded
withtheotherexamplefiles:

$bgzipcls_orchid.gbk>ls_orchid.gbk.bgz

Youcanusethecompressedfileinexactlythesameway:
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 34/184
5/13/2017 BiopythonTutorialandCookbook
>>>fromBioimportSeqIO
>>>orchid_dict=SeqIO.index("ls_orchid.gbk.bgz","genbank")
>>>len(orchid_dict)
94
>>>orchid_dict.close()

or:
>>>fromBioimportSeqIO
>>>orchid_dict=SeqIO.index_db("ls_orchid.gbk.bgz.idx","ls_orchid.gbk.bgz","genbank")
>>>len(orchid_dict)
94
>>>orchid_dict.close()

TheSeqIOindexingautomaticallydetectstheBGZFcompression.Notethatyoucantusethesameindexfilefortheuncompressedandcompressedfiles.

5.4.5Discussion
So,whichofthesemethodsshouldyouuseandwhy?Itdependsonwhatyouaretryingtodo(andhowmuchdatayouaredealingwith).However,ingeneralpicking
Bio.SeqIO.index()isagoodstartingpoint.Ifyouaredealingwithmillionsofrecords,multiplefiles,orrepeatedanalyses,thenlookatBio.SeqIO.index_db().

ReasonstochooseBio.SeqIO.to_dict()overeitherBio.SeqIO.index()orBio.SeqIO.index_db()boildowntoaneedforflexibilitydespiteitshighmemoryneeds.The
advantageofstoringtheSeqRecordobjectsinmemoryistheycanbechanged,addedto,orremovedatwill.Inadditiontothedownsideofhighmemoryconsumption,
indexingcanalsotakelongerbecausealltherecordsmustbefullyparsed.

BothBio.SeqIO.index()andBio.SeqIO.index_db()onlyparserecordsondemand.Whenindexing,theyscanthefileoncelookingforthestartofeachrecordanddoas
littleworkaspossibletoextracttheidentifier.

ReasonstochooseBio.SeqIO.index()overBio.SeqIO.index_db()include:

Fastertobuildtheindex(morenoticeableinsimplefileformats)
SlightlyfasteraccessasSeqRecordobjects(butthedifferenceisonlyreallynoticeableforsimpletoparsefileformats).
CanuseanyimmutablePythonobjectasthedictionarykeys(e.g.atupleofstrings,orafrozenset)notjuststrings.
Dontneedtoworryabouttheindexdatabasebeingoutofdateifthesequencefilebeingindexedhaschanged.

ReasonstochooseBio.SeqIO.index_db()overBio.SeqIO.index()include:

Notmemorylimitedthisisalreadyimportantwithfilesfromsecondgenerationsequencingwhere10sofmillionsofsequencesarecommon,andusing
Bio.SeqIO.index()canrequiremorethan4GBofRAMandthereforea64bitversionofPython.
Becausetheindexiskeptondisk,itcanbereused.Althoughbuildingtheindexdatabasefiletakeslonger,ifyouhaveascriptwhichwillbererunonthesame
datafilesinfuture,thiscouldsavetimeinthelongrun.
Indexingmultiplefilestogether
Theget_raw()methodcanbemuchfaster,sinceformostfileformatsthelengthofeachrecordisstoredaswellasitsoffset.

5.5WritingSequenceFiles
WevetalkedaboutusingBio.SeqIO.parse()forsequenceinput(readingfiles),andnowwelllookatBio.SeqIO.write()whichisforsequenceoutput(writingfiles).
Thisisafunctiontakingthreearguments:someSeqRecordobjects,ahandleorfilenametowriteto,andasequenceformat.

Hereisanexample,wherewestartbycreatingafewSeqRecordobjectsthehardway(byhand,ratherthanbyloadingthemfromafile):
fromBio.SeqimportSeq
fromBio.SeqRecordimportSeqRecord
fromBio.Alphabetimportgeneric_protein

rec1=SeqRecord(Seq("MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD"\
+"GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK"\
+"NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM"\
+"SSAC",generic_protein),
id="gi|14150838|gb|AAK54648.1|AF376133_1",
description="chalconesynthase[Cucumissativus]")

rec2=SeqRecord(Seq("YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQ"\
+"DMVVVEIPKLGKEAAVKAIKEWGQ",generic_protein),
id="gi|13919613|gb|AAK33142.1|",
description="chalconesynthase[Fragariavescasubsp.bracteata]")

rec3=SeqRecord(Seq("MVTVEEFRRAQCAEGPATVMAIGTATPSNCVDQSTYPDYYFRITNSEHKVELKEKFKRMC"\
+"EKSMIKKRYMHLTEEILKENPNICAYMAPSLDARQDIVVVEVPKLGKEAAQKAIKEWGQP"\
+"KSKITHLVFCTTSGVDMPGCDYQLTKLLGLRPSVKRFMMYQQGCFAGGTVLRMAKDLAEN"\
+"NKGARVLVVCSEITAVTFRGPNDTHLDSLVGQALFGDGAAAVIIGSDPIPEVERPLFELV"\
+"SAAQTLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLVEAFQPLGISDWNSLFW"\
+"IAHPGGPAILDQVELKLGLKQEKLKATRKVLSNYGNMSSACVLFILDEMRKASAKEGLGT"\
+"TGEGLEWGVLFGFGPGLTVETVVLHSVAT",generic_protein),
id="gi|13925890|gb|AAK49457.1|",
description="chalconesynthase[Nicotianatabacum]")

my_records=[rec1,rec2,rec3]

NowwehavealistofSeqRecordobjects,wellwritethemtoaFASTAformatfile:
fromBioimportSeqIO
SeqIO.write(my_records,"my_example.faa","fasta")

Andifyouopenthisfileinyourfavouritetexteditoritshouldlooklikethis:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 35/184
5/13/2017 BiopythonTutorialandCookbook
>gi|14150838|gb|AAK54648.1|AF376133_1chalconesynthase[Cucumissativus]
MMYQQGCFAGGTVLRLAKDLAENNRGARVLVVCSEITAVTFRGPSETHLDSMVGQALFGD
GAGAVIVGSDPDLSVERPLYELVWTGATLLPDSEGAIDGHLREVGLTFHLLKDVPGLISK
NIEKSLKEAFTPLGISDWNSTFWIAHPGGPAILDQVEAKLGLKEEKMRATREVLSEYGNM
SSAC
>gi|13919613|gb|AAK33142.1|chalconesynthase[Fragariavescasubsp.bracteata]
YPDYYFRITNREHKAELKEKFQRMCDKSMIKKRYMYLTEEILKENPSMCEYMAPSLDARQ
DMVVVEIPKLGKEAAVKAIKEWGQ
>gi|13925890|gb|AAK49457.1|chalconesynthase[Nicotianatabacum]
MVTVEEFRRAQCAEGPATVMAIGTATPSNCVDQSTYPDYYFRITNSEHKVELKEKFKRMC
EKSMIKKRYMHLTEEILKENPNICAYMAPSLDARQDIVVVEVPKLGKEAAQKAIKEWGQP
KSKITHLVFCTTSGVDMPGCDYQLTKLLGLRPSVKRFMMYQQGCFAGGTVLRMAKDLAEN
NKGARVLVVCSEITAVTFRGPNDTHLDSLVGQALFGDGAAAVIIGSDPIPEVERPLFELV
SAAQTLLPDSEGAIDGHLREVGLTFHLLKDVPGLISKNIEKSLVEAFQPLGISDWNSLFW
IAHPGGPAILDQVELKLGLKQEKLKATRKVLSNYGNMSSACVLFILDEMRKASAKEGLGT
TGEGLEWGVLFGFGPGLTVETVVLHSVAT

SupposeyouwantedtoknowhowmanyrecordstheBio.SeqIO.write()functionwrotetothehandle?Ifyourrecordswereinalistyoucouldjustuselen(my_records),
howeveryoucantdothatwhenyourrecordscomefromagenerator/iterator.TheBio.SeqIO.write()functionreturnsthenumberofSeqRecordobjectswrittentothefile.

NoteIfyoutelltheBio.SeqIO.write()functiontowritetoafilethatalreadyexists,theoldfilewillbeoverwrittenwithoutanywarning.

5.5.1Roundtrips
Somepeopleliketheirparserstoberoundtripable,meaningifyoureadinafileandwriteitbackoutagainitisunchanged.Thisrequiresthattheparsermustextract
enoughinformationtoreproducetheoriginalfileexactly.Bio.SeqIOdoesnotaimtodothis.

Asatrivialexample,anylinewrappingofthesequencedatainFASTAfilesisallowed.AnidenticalSeqRecordwouldbegivenfromparsingthefollowingtwoexamples
whichdifferonlyintheirlinebreaks:

>YAL068C7235.2170Putativepromotersequence
TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCACAGTTTTCGTTAAGA
GAACTTAACATTTTCTTATGACGTAAATGAAGTTTATATATAAATTTCCTTTTTATTGGA

>YAL068C7235.2170Putativepromotersequence
TACGAGAATAATTTCTCATCATCCAGCTTTAACACAAAATTCGCA
CAGTTTTCGTTAAGAGAACTTAACATTTTCTTATGACGTAAATGA
AGTTTATATATAAATTTCCTTTTTATTGGA

TomakearoundtripableFASTAparseryouwouldneedtokeeptrackofwherethesequencelinebreaksoccurred,andthisextrainformationisusuallypointless.Instead
Biopythonusesadefaultlinewrappingof60charactersonoutput.Thesameproblemwithwhitespaceappliesinmanyotherfileformatstoo.Anotherissueinsome
casesisthatBiopythondoesnot(yet)preserveeverylastbitofannotation(e.g.GenBankandEMBL).

Occasionallypreservingtheoriginallayout(withanyquirksitmayhave)isimportant.SeeSection5.4.2.2abouttheget_raw()methodoftheBio.SeqIO.index()
dictionarylikeobjectforonepotentialsolution.

5.5.2Convertingbetweensequencefileformats

InpreviousexampleweusedalistofSeqRecordobjectsasinputtotheBio.SeqIO.write()function,butitwillalsoacceptaSeqRecorditeratorlikewegetfrom
Bio.SeqIO.parse()thisletsusdofileconversionbycombiningthesetwofunctions.

ForthisexamplewellreadintheGenBankformatfilels_orchid.gbkandwriteitoutinFASTAformat:
fromBioimportSeqIO
records=SeqIO.parse("ls_orchid.gbk","genbank")
count=SeqIO.write(records,"my_example.fasta","fasta")
print("Converted%irecords"%count)

Still,thatisalittlebitcomplicated.So,becausefileconversionissuchacommontask,thereisahelperfunctionlettingyoureplacethatwithjust:
fromBioimportSeqIO
count=SeqIO.convert("ls_orchid.gbk","genbank","my_example.fasta","fasta")
print("Converted%irecords"%count)

TheBio.SeqIO.convert()functionwilltakehandlesorfilenames.Watchoutthoughiftheoutputfilealreadyexists,itwilloverwriteit!Tofindoutmore,seethebuilt
inhelp:
>>>fromBioimportSeqIO
>>>help(SeqIO.convert)
...

Inprinciple,justbychangingthefilenamesandtheformatnames,thiscodecouldbeusedtoconvertbetweenanyfileformatsavailableinBiopython.However,writing
someformatsrequiresinformation(e.g.qualityscores)whichotherfilesformatsdontcontain.Forexample,whileyoucanturnaFASTQfileintoaFASTAfile,you
cantdothereverse.SeealsoSections20.1.9and20.1.10inthecookbookchapterwhichlooksatinterconvertingbetweendifferentFASTQformats.

Finally,asanaddedincentiveforusingtheBio.SeqIO.convert()function(ontopofthefactyourcodewillbeshorter),doingitthiswaymayalsobefaster!Thereason
forthisistheconvertfunctioncantakeadvantageofseveralfileformatspecificoptimisationsandtricks.

5.5.3Convertingafileofsequencestotheirreversecomplements
Supposeyouhadafileofnucleotidesequences,andyouwantedtoturnitintoafilecontainingtheirreversecomplementsequences.Thistimealittlebitofworkis
requiredtotransformtheSeqRecordobjectswegetfromourinputfileintosomethingsuitableforsavingtoouroutputfile.

Tostartwith,welluseBio.SeqIO.parse()toloadsomenucleotidesequencesfromafile,thenprintouttheirreversecomplementsusingtheSeqobjectsbuiltin
.reverse_complement()method(seeSection3.7):

>>>fromBioimportSeqIO
>>>forrecordinSeqIO.parse("ls_orchid.gbk","genbank"):

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 36/184
5/13/2017 BiopythonTutorialandCookbook
...print(record.id)
...print(record.seq.reverse_complement())

Now,ifwewanttosavethesereversecomplementstoafile,wellneedtomakeSeqRecordobjects.WecanusetheSeqRecordobjectsbuiltin.reverse_complement()
method(seeSection4.9)butwemustdecidehowtonameournewrecords.

Thisisanexcellentplacetodemonstratethepoweroflistcomprehensionswhichmakealistinmemory:

>>>fromBioimportSeqIO
>>>records=[rec.reverse_complement(id="rc_"+rec.id,description="reversecomplement")\
...forrecinSeqIO.parse("ls_orchid.fasta","fasta")]
>>>len(records)
94

Nowlistcomprehensionshaveanicetrickuptheirsleeves,youcanaddaconditionalstatement:
>>>records=[rec.reverse_complement(id="rc_"+rec.id,description="reversecomplement")\
...forrecinSeqIO.parse("ls_orchid.fasta","fasta")iflen(rec)<700]
>>>len(records)
18

Thatwouldcreateaninmemorylistofreversecomplementrecordswherethesequencelengthwasunder700basepairs.However,wecandoexactlythesamewitha
generatorexpressionbutwiththeadvantagethatthisdoesnotcreatealistofalltherecordsinmemoryatonce:
>>>records=(rec.reverse_complement(id="rc_"+rec.id,description="reversecomplement")\
...forrecinSeqIO.parse("ls_orchid.fasta","fasta")iflen(rec)<700)

Asacompleteexample:

>>>fromBioimportSeqIO
>>>records=(rec.reverse_complement(id="rc_"+rec.id,description="reversecomplement")\
...forrecinSeqIO.parse("ls_orchid.fasta","fasta")iflen(rec)<700)
>>>SeqIO.write(records,"rev_comp.fasta","fasta")
18

ThereisarelatedexampleinSection20.1.3,translatingeachrecordinaFASTAfilefromnucleotidestoaminoacids.

5.5.4GettingyourSeqRecordobjectsasformattedstrings
Supposethatyoudontreallywanttowriteyourrecordstoafileorhandleinsteadyouwantastringcontainingtherecordsinaparticularfileformat.TheBio.SeqIO
interfaceisbasedonhandles,butPythonhasausefulbuiltinmodulewhichprovidesastringbasedhandle.

Foranexampleofhowyoumightusethis,letsloadinabunchofSeqRecordobjectsfromourorchidsGenBankfile,andcreateastringcontainingtherecordsinFASTA
format:
fromBioimportSeqIO
fromStringIOimportStringIO
records=SeqIO.parse("ls_orchid.gbk","genbank")
out_handle=StringIO()
SeqIO.write(records,out_handle,"fasta")
fasta_data=out_handle.getvalue()
print(fasta_data)

Thisisntentirelystraightforwardthefirsttimeyouseeit!Onthebrightside,forthespecialcasewhereyouwouldlikeastringcontainingasinglerecordinaparticular
fileformat,usethetheSeqRecordclassformat()method(seeSection4.6).

Notethatalthoughwedontencourageit,youcanusetheformat()methodtowritetoafile,forexamplesomethinglikethis:
fromBioimportSeqIO
withopen("ls_orchid_long.tab","w")asout_handle:
forrecordinSeqIO.parse("ls_orchid.gbk","genbank"):
iflen(record)>100:
out_handle.write(record.format("tab"))

WhilethisstyleofcodewillworkforasimplesequentialfileformatlikeFASTAorthesimpletabseparatedformatusedhere,itwillnotworkformorecomplexor
interlacedfileformats.ThisiswhywestillrecommendusingBio.SeqIO.write(),asinthefollowingexample:
fromBioimportSeqIO
records=(recforrecinSeqIO.parse("ls_orchid.gbk","genbank")iflen(rec)>100)
SeqIO.write(records,"ls_orchid.tab","tab")

MakingasinglecalltoSeqIO.write(...)isalsomuchquickerthanmultiplecallstotheSeqRecord.format(...)method.

5.6LowlevelFASTAandFASTQparsers
WorkingwiththelowlevelSimpleFastaParserorFastqGeneralIteratorisoftenmorepracticalthanBio.SeqIO.parsewhendealingwithlargehighthroughputFASTAor
FASTQsequencingfileswherespeedmatters.Asnotedintheintroductiontothischapter,thefileformatneutralBio.SeqIOinterfacehastheoverheadofcreatingmany
objectsevenforsimpleformatslikeFASTA.

WhenparsingFASTAfiles,internallyBio.SeqIO.parse()callsthelowlevelSimpleFastaParserwiththefilehandle.Youcanusethisdirectlyititeratesoverthefile
handlereturningeachrecordasatupleoftwostrings,thetitleline(everythingafterthe>character)andthesequence(asaplainstring):
>>>fromBio.SeqIO.FastaIOimportSimpleFastaParser
>>>count=0
>>>total_len=0
>>>withopen("ls_orchid.fasta")asin_handle:
...fortitle,seqinSimpleFastaParser(in_handle):
...count+=1
...total_len+=len(seq)

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 37/184
5/13/2017 BiopythonTutorialandCookbook
...
>>>print("%irecordswithtotalsequencelength%i"%(count,total_len))
94recordswithtotalsequencelength67518

Aslongasyoudontcareaboutlinewrapping(andyouprobablydontforshortreadhighthrougputdata),thenoutputingFASTAformatfromthesestringsisalsovery
fast:

...
out_handle.write(">%s\n%s\n"%(title,seq))
...

Likewise,whenparsingFASTQfiles,internallyBio.SeqIO.parse()callsthelowlevelFastqGeneralIteratorwiththefilehandle.Ifyoudontneedthequalityscores
turnedintointegers,orcanworkwiththemasASCIIstringsthisisideal:

>>>fromBio.SeqIO.QualityIOimportFastqGeneralIterator
>>>count=0
>>>total_len=0
>>>withopen("example.fastq")asin_handle:
...fortitle,seq,qualinFastqGeneralIterator(in_handle):
...count+=1
...total_len+=len(seq)
...
>>>print("%irecordswithtotalsequencelength%i"%(count,total_len))
3recordswithtotalsequencelength75

TherearemoreexamplesofthisintheCookbook(Chapter20),includinghowtooutputFASTQefficientlyfromstringsusingthiscodesnippet:
...
out_handle.write("@%s\n%s\n+\n%s\n"%(title,seq,qual))
...

Chapter6MultipleSequenceAlignmentobjects
ThischapterisaboutMultipleSequenceAlignments,bywhichwemeanacollectionofmultiplesequenceswhichhavebeenalignedtogetherusuallywiththeinsertion
ofgapcharacters,andadditionofleadingortrailinggapssuchthatallthesequencestringsarethesamelength.Suchanalignmentcanberegardedasamatrixofletters,
whereeachrowisheldasaSeqRecordobjectinternally.

WewillintroducetheMultipleSeqAlignmentobjectwhichholdsthiskindofdata,andtheBio.AlignIOmoduleforreadingandwritingthemasvariousfileformats
(followingthedesignoftheBio.SeqIOmodulefromthepreviouschapter).NotethatbothBio.SeqIOandBio.AlignIOcanreadandwritesequencealignmentfiles.The
appropriatechoicewilldependlargelyonwhatyouwanttodowiththedata.

ThefinalpartofthischapterisaboutourcommandlinewrappersforcommonmultiplesequencealignmenttoolslikeClustalWandMUSCLE.

6.1ParsingorReadingSequenceAlignments
Wehavetwofunctionsforreadinginsequencealignments,Bio.AlignIO.read()andBio.AlignIO.parse()whichfollowingtheconventionintroducedinBio.SeqIOarefor
filescontainingoneormultiplealignmentsrespectively.

UsingBio.AlignIO.parse()willreturnaniteratorwhichgivesMultipleSeqAlignmentobjects.Iteratorsaretypicallyusedinaforloop.Examplesofsituationswhereyou
willhavemultipledifferentalignmentsincluderesampledalignmentsfromthePHYLIPtoolseqboot,ormultiplepairwisealignmentsfromtheEMBOSStoolswateror
needle,orBillPearsonsFASTAtools.

However,inmanysituationsyouwillbedealingwithfileswhichcontainonlyasinglealignment.Inthiscase,youshouldusetheBio.AlignIO.read()functionwhich
returnsasingleMultipleSeqAlignmentobject.

Bothfunctionsexpecttwomandatoryarguments:

1.Thefirstargumentisahandletoreadthedatafrom,typicallyanopenfile(seeSection24.1),orafilename.
2.Thesecondargumentisalowercasestringspecifyingthealignmentformat.AsinBio.SeqIOwedonttryandguessthefileformatforyou!See
http://biopython.org/wiki/AlignIOforafulllistingofsupportedformats.

Thereisalsoanoptionalseq_countargumentwhichisdiscussedinSection6.1.3belowfordealingwithambiguousfileformatswhichmaycontainmorethanone
alignment.

Afurtheroptionalalphabetargumentallowingyoutospecifytheexpectedalphabet.Thiscanbeusefulasmanyalignmentfileformatsdonotexplicitlylabelthe
sequencesasRNA,DNAorproteinwhichmeansBio.AlignIOwilldefaulttousingagenericalphabet.

6.1.1SingleAlignments
Asanexample,considerthefollowingannotationrichproteinalignmentinthePFAMorStockholmfileformat:
#STOCKHOLM1.0
#=GSCOATB_BPIKE/3081ACP03620.1
#=GSCOATB_BPIKE/3081DRPDB;1ifl;152;
#=GSQ9T0Q8_BPIKE/152ACQ9T0Q8.1
#=GSCOATB_BPI22/3283ACP15416.1
#=GSCOATB_BPM13/2472ACP69541.1
#=GSCOATB_BPM13/2472DRPDB;2cpb;149;
#=GSCOATB_BPM13/2472DRPDB;2cps;149;
#=GSCOATB_BPZJ2/149ACP03618.1
#=GSQ9T0Q9_BPFD/149ACQ9T0Q9.1
#=GSQ9T0Q9_BPFD/149DRPDB;1nh4A;149;
#=GSCOATB_BPIF1/2273ACP03619.2
#=GSCOATB_BPIF1/2273DRPDB;1ifk;150;
COATB_BPIKE/3081AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 38/184
5/13/2017 BiopythonTutorialandCookbook
#=GRCOATB_BPIKE/3081SSHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
Q9T0Q8_BPIKE/152AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA
COATB_BPI22/3283DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA
COATB_BPM13/2472AEGDDP...AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA
#=GRCOATB_BPM13/2472SSST...CHCHHHHCCCCTCCCTTCHHHHHHHHHHHHHHHHHHHHCTT
COATB_BPZJ2/149AEGDDP...AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA
Q9T0Q9_BPFD/149AEGDDP...AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA
#=GRQ9T0Q9_BPFD/149SS...HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
COATB_BPIF1/2273FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA
#=GRCOATB_BPIF1/2273SSXXHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHH
#=GCSS_consXHHHHHHHHHHHHHHHCHHHHHHHHCHHHHHHHHHHHHHHHHHHHHHHHC
#=GCseq_consAEssss...AptAhDSLpspAThIu.sWshVsslVsAsluIKLFKKFsSKA
//

ThisistheseedalignmentforthePhage_Coat_Gp8(PF05371)PFAMentry,downloadedfromanowoutofdatereleaseofPFAMfromhttp://pfam.sanger.ac.uk/.We
canloadthisfileasfollows(assumingithasbeensavedtodiskasPF05371_seed.sthinthecurrentworkingdirectory):
>>>fromBioimportAlignIO
>>>alignment=AlignIO.read("PF05371_seed.sth","stockholm")

Thiscodewillprintoutasummaryofthealignment:

>>>print(alignment)
SingleLetterAlphabet()alignmentwith7rowsand52columns
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRL...SKACOATB_BPIKE/3081
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKL...SRAQ9T0Q8_BPIKE/152
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRL...SKACOATB_BPI22/3283
AEGDDPAKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKL...SKACOATB_BPM13/2472
AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKACOATB_BPZJ2/149
AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKAQ9T0Q9_BPFD/149
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKL...SRACOATB_BPIF1/2273

Youllnoticeintheaboveoutputthesequenceshavebeentruncated.Wecouldinsteadwriteourowncodetoformatthisaswepleasebyiteratingovertherowsas
SeqRecordobjects:

>>>fromBioimportAlignIO
>>>alignment=AlignIO.read("PF05371_seed.sth","stockholm")
>>>print("Alignmentlength%i"%alignment.get_alignment_length())
Alignmentlength52
>>>forrecordinalignment:
...print("%s%s"%(record.seq,record.id))
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKACOATB_BPIKE/3081
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRAQ9T0Q8_BPIKE/152
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKACOATB_BPI22/3283
AEGDDPAKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKACOATB_BPM13/2472
AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKACOATB_BPZJ2/149
AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKAQ9T0Q9_BPFD/149
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRACOATB_BPIF1/2273

YoucouldalsousethealignmentobjectsformatmethodtoshowitinaparticularfileformatseeSection6.2.2fordetails.

DidyounoticeintherawfileabovethatseveralofthesequencesincludedatabasecrossreferencestothePDBandtheassociatedknownsecondarystructure?Trythis:
>>>forrecordinalignment:
...ifrecord.dbxrefs:
...print("%s%s"%(record.id,record.dbxrefs))
COATB_BPIKE/3081['PDB;1ifl;152;']
COATB_BPM13/2472['PDB;2cpb;149;','PDB;2cps;149;']
Q9T0Q9_BPFD/149['PDB;1nh4A;149;']
COATB_BPIF1/2273['PDB;1ifk;150;']

Tohavealookatallthesequenceannotation,trythis:
>>>forrecordinalignment:
...print(record)

Sangerprovideanicewebinterfaceathttp://pfam.sanger.ac.uk/family?acc=PF05371whichwillactuallyletyoudownloadthisalignmentinseveralotherformats.This
iswhatthefilelookslikeintheFASTAfileformat:
>COATB_BPIKE/3081
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKA
>Q9T0Q8_BPIKE/152
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRA
>COATB_BPI22/3283
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKA
>COATB_BPM13/2472
AEGDDPAKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA
>COATB_BPZJ2/149
AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKA
>Q9T0Q9_BPFD/149
AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKA
>COATB_BPIF1/2273
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRA

Notethewebsiteshouldhaveanoptionaboutshowinggapsasperiods(dots)ordashes,weveshowndashesabove.Assumingyoudownloadandsavethisasfile
PF05371_seed.faathenyoucanloaditwithalmostexactlythesamecode:
fromBioimportAlignIO
alignment=AlignIO.read("PF05371_seed.faa","fasta")
print(alignment)

Allthathaschangedinthiscodeisthefilenameandtheformatstring.Youllgetthesameoutputasbefore,thesequencesandrecordidentifiersarethesame.However,
asyoushouldexpect,ifyoucheckeachSeqRecordthereisnoannotationnordatabasecrossreferencesbecausethesearenotincludedintheFASTAfileformat.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 39/184
5/13/2017 BiopythonTutorialandCookbook
NotethatratherthanusingtheSangerwebsite,youcouldhaveusedBio.AlignIOtoconverttheoriginalStockholmformatfileintoaFASTAfileyourself(seebelow).

Withanysupportedfileformat,youcanloadanalignmentinexactlythesamewayjustbychangingtheformatstring.Forexample,usephylipforPHYLIPfiles,
nexusforNEXUSfilesorembossforthealignmentsoutputbytheEMBOSStools.Thereisafulllistingonthewikipage(http://biopython.org/wiki/AlignIO)and
inthebuiltindocumentation(alsoonline):
>>>fromBioimportAlignIO
>>>help(AlignIO)
...

6.1.2MultipleAlignments
Theprevioussectionfocusedonreadingfilescontainingasinglealignment.Ingeneralhowever,filescancontainmorethanonealignment,andtoreadthesefileswe
mustusetheBio.AlignIO.parse()function.

SupposeyouhaveasmallalignmentinPHYLIPformat:

56
AlphaAACAAC
BetaAACCCC
GammaACCAAC
DeltaCCACCA
EpsilonCCAAAC

IfyouwantedtobootstrapaphylogenetictreeusingthePHYLIPtools,oneofthestepswouldbetocreateasetofmanyresampledalignmentsusingthetoolbootseq.
Thiswouldgiveoutputsomethinglikethis,whichhasbeenabbreviatedforconciseness:

56
AlphaAAACCA
BetaAAACCC
GammaACCCCA
DeltaCCCAAC
EpsilonCCCAAA
56
AlphaAAACAA
BetaAAACCC
GammaACCCAA
DeltaCCCACC
EpsilonCCCAAA
56
AlphaAAAAAC
BetaAAACCC
GammaAACAAC
DeltaCCCCCA
EpsilonCCCAAC
...
56
AlphaAAAACC
BetaACCCCC
GammaAAAACC
DeltaCCCCAA
EpsilonCAAACC

IfyouwantedtoreadthisinusingBio.AlignIOyoucoulduse:
fromBioimportAlignIO
alignments=AlignIO.parse("resampled.phy","phylip")
foralignmentinalignments:
print(alignment)
print("")

Thiswouldgivethefollowingoutput,againabbreviatedfordisplay:
SingleLetterAlphabet()alignmentwith5rowsand6columns
AAACCAAlpha
AAACCCBeta
ACCCCAGamma
CCCAACDelta
CCCAAAEpsilon

SingleLetterAlphabet()alignmentwith5rowsand6columns
AAACAAAlpha
AAACCCBeta
ACCCAAGamma
CCCACCDelta
CCCAAAEpsilon

SingleLetterAlphabet()alignmentwith5rowsand6columns
AAAAACAlpha
AAACCCBeta
AACAACGamma
CCCCCADelta
CCCAACEpsilon

...

SingleLetterAlphabet()alignmentwith5rowsand6columns
AAAACCAlpha
ACCCCCBeta
AAAACCGamma
CCCCAADelta
CAAACCEpsilon

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 40/184
5/13/2017 BiopythonTutorialandCookbook
AswiththefunctionBio.SeqIO.parse(),usingBio.AlignIO.parse()returnsaniterator.Ifyouwanttokeepallthealignmentsinmemoryatonce,whichwillallowyouto
accesstheminanyorder,thenturntheiteratorintoalist:

fromBioimportAlignIO
alignments=list(AlignIO.parse("resampled.phy","phylip"))
last_align=alignments[1]
first_align=alignments[0]

6.1.3AmbiguousAlignments
Manyalignmentfileformatscanexplicitlystoremorethanonealignment,andthedivisionbetweeneachalignmentisclear.However,whenageneralsequencefile
formathasbeenusedthereisnosuchblockstructure.ThemostcommonsuchsituationiswhenalignmentshavebeensavedintheFASTAfileformat.Forexample
considerthefollowing:

>Alpha
ACTACGACTAGCTCAGG
>Beta
ACTACCGCTAGCTCAGAAG
>Gamma
ACTACGGCTAGCACAGAAG
>Alpha
ACTACGACTAGCTCAGG
>Beta
ACTACCGCTAGCTCAGAAG
>Gamma
ACTACGGCTAGCACAGAAG

Thiscouldbeasinglealignmentcontainingsixsequences(withrepeatedidentifiers).Or,judgingfromtheidentifiers,thisisprobablytwodifferentalignmentseachwith
threesequences,whichhappentoallhavethesamelength.

Whataboutthisnextexample?
>Alpha
ACTACGACTAGCTCAGG
>Beta
ACTACCGCTAGCTCAGAAG
>Alpha
ACTACGACTAGCTCAGG
>Gamma
ACTACGGCTAGCACAGAAG
>Alpha
ACTACGACTAGCTCAGG
>Delta
ACTACGGCTAGCACAGAAG

Again,thiscouldbeasinglealignmentwithsixsequences.Howeverthistimebasedontheidentifierswemightguessthisisthreepairwisealignmentswhichbychance
haveallgotthesamelengths.

Thisfinalexampleissimilar:
>Alpha
ACTACGACTAGCTCAGG
>XXX
ACTACCGCTAGCTCAGAAG
>Alpha
ACTACGACTAGCTCAGG
>YYY
ACTACGGCAAGCACAGG
>Alpha
ACTACGACTAGCTCAGG
>ZZZ
GGACTACGACAATAGCTCAGG

Inthisthirdexample,becauseofthedifferinglengths,thiscannotbetreatedasasinglealignmentcontainingallsixrecords.However,itcouldbethreepairwise
alignments.

ClearlytryingtostoremorethanonealignmentinaFASTAfileisnotideal.However,ifyouareforcedtodealwiththeseasinputfilesBio.AlignIOcancopewiththe
mostcommonsituationwhereallthealignmentshavethesamenumberofrecords.Oneexampleofthisisacollectionofpairwisealignments,whichcanbeproducedby
theEMBOSStoolsneedleandwateralthoughinthissituation,Bio.AlignIOshouldbeabletounderstandtheirnativeoutputusingembossastheformatstring.

TointerprettheseFASTAexamplesasseveralseparatealignments,wecanuseBio.AlignIO.parse()withtheoptionalseq_countargumentwhichspecifieshowmany
sequencesareexpectedineachalignment(intheseexamples,3,2and2respectively).Forexample,usingthethirdexampleastheinputdata:
foralignmentinAlignIO.parse(handle,"fasta",seq_count=2):
print("Alignmentlength%i"%alignment.get_alignment_length())
forrecordinalignment:
print("%s%s"%(record.seq,record.id))
print("")

giving:

Alignmentlength19
ACTACGACTAGCTCAGGAlpha
ACTACCGCTAGCTCAGAAGXXX

Alignmentlength17
ACTACGACTAGCTCAGGAlpha
ACTACGGCAAGCACAGGYYY

Alignmentlength21
ACTACGACTAGCTCAGGAlpha
GGACTACGACAATAGCTCAGGZZZ

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 41/184
5/13/2017 BiopythonTutorialandCookbook
UsingBio.AlignIO.read()orBio.AlignIO.parse()withouttheseq_countargumentwouldgiveasinglealignmentcontainingallsixrecordsforthefirsttwoexamples.
Forthethirdexample,anexceptionwouldberaisedbecausethelengthsdifferpreventingthembeingturnedintoasinglealignment.

IfthefileformatitselfhasablockstructureallowingBio.AlignIOtodeterminethenumberofsequencesineachalignmentdirectly,thentheseq_countargumentisnot
needed.Ifitissupplied,anddoesntagreewiththefilecontents,anerrorisraised.

Notethatthisoptionalseq_countargumentassumeseachalignmentinthefilehasthesamenumberofsequences.Hypotheticallyyoumaycomeacrossstranger
situations,forexampleaFASTAfilecontainingseveralalignmentseachwithadifferentnumberofsequencesalthoughIwouldlovetohearofarealworldexampleof
this.Assumingyoucannotgetthedatainanicerfileformat,thereisnostraightforwardwaytodealwiththisusingBio.AlignIO.Inthiscase,youcouldconsiderreading
inthesequencesthemselvesusingBio.SeqIOandbatchingthemtogethertocreatethealignmentsasappropriate.

6.2WritingAlignments
WevetalkedaboutusingBio.AlignIO.read()andBio.AlignIO.parse()foralignmentinput(readingfiles),andnowwelllookatBio.AlignIO.write()whichisfor
alignmentoutput(writingfiles).Thisisafunctiontakingthreearguments:someMultipleSeqAlignmentobjects(orforbackwardscompatibilitytheobsoleteAlignment
objects),ahandleorfilenametowriteto,andasequenceformat.

Hereisanexample,wherewestartbycreatingafewMultipleSeqAlignmentobjectsthehardway(byhand,ratherthanbyloadingthemfromafile).Notewecreatesome
SeqRecordobjectstoconstructthealignmentfrom.

fromBio.Alphabetimportgeneric_dna
fromBio.SeqimportSeq
fromBio.SeqRecordimportSeqRecord
fromBio.AlignimportMultipleSeqAlignment

align1=MultipleSeqAlignment([
SeqRecord(Seq("ACTGCTAGCTAG",generic_dna),id="Alpha"),
SeqRecord(Seq("ACTCTAGCTAG",generic_dna),id="Beta"),
SeqRecord(Seq("ACTGCTAGDTAG",generic_dna),id="Gamma"),
])

align2=MultipleSeqAlignment([
SeqRecord(Seq("GTCAGCAG",generic_dna),id="Delta"),
SeqRecord(Seq("GACAGCTAG",generic_dna),id="Epsilon"),
SeqRecord(Seq("GTCAGCTAG",generic_dna),id="Zeta"),
])

align3=MultipleSeqAlignment([
SeqRecord(Seq("ACTAGTACAGCTG",generic_dna),id="Eta"),
SeqRecord(Seq("ACTAGTACAGCT",generic_dna),id="Theta"),
SeqRecord(Seq("CTACTACAGGTG",generic_dna),id="Iota"),
])

my_alignments=[align1,align2,align3]

NowwehavealistofAlignmentobjects,wellwritethemtoaPHYLIPformatfile:
fromBioimportAlignIO
AlignIO.write(my_alignments,"my_example.phy","phylip")

Andifyouopenthisfileinyourfavouritetexteditoritshouldlooklikethis:

312
AlphaACTGCTAGCTAG
BetaACTCTAGCTAG
GammaACTGCTAGDTAG
39
DeltaGTCAGCAG
EpislonGACAGCTAG
ZetaGTCAGCTAG
313
EtaACTAGTACAGCTG
ThetaACTAGTACAGCT
IotaCTACTACAGGTG

Itsmorecommontowanttoloadanexistingalignment,andsavethat,perhapsaftersomesimplemanipulationlikeremovingcertainrowsorcolumns.

SupposeyouwantedtoknowhowmanyalignmentstheBio.AlignIO.write()functionwrotetothehandle?Ifyouralignmentswereinalistliketheexampleabove,you
couldjustuselen(my_alignments),howeveryoucantdothatwhenyourrecordscomefromagenerator/iterator.ThereforetheBio.AlignIO.write()functionreturnsthe
numberofalignmentswrittentothefile.

NoteIfyoutelltheBio.AlignIO.write()functiontowritetoafilethatalreadyexists,theoldfilewillbeoverwrittenwithoutanywarning.

6.2.1Convertingbetweensequencealignmentfileformats
ConvertingbetweensequencealignmentfileformatswithBio.AlignIOworksinthesamewayasconvertingbetweensequencefileformatswithBio.SeqIO
(Section5.5.2).Weloadgenerallythealignment(s)usingBio.AlignIO.parse()andthensavethemusingtheBio.AlignIO.write()orjustusethe
Bio.AlignIO.convert()helperfunction.

Forthisexample,wellloadthePFAM/StockholmformatfileusedearlierandsaveitasaClustalWformatfile:
fromBioimportAlignIO
count=AlignIO.convert("PF05371_seed.sth","stockholm","PF05371_seed.aln","clustal")
print("Converted%ialignments"%count)

Or,usingBio.AlignIO.parse()andBio.AlignIO.write():
fromBioimportAlignIO
alignments=AlignIO.parse("PF05371_seed.sth","stockholm")

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 42/184
5/13/2017 BiopythonTutorialandCookbook
count=AlignIO.write(alignments,"PF05371_seed.aln","clustal")
print("Converted%ialignments"%count)

TheBio.AlignIO.write()functionexpectstobegivenmultiplealignmentobjects.Intheexampleabovewegaveitthealignmentiteratorreturnedby
Bio.AlignIO.parse().

Inthiscase,weknowthereisonlyonealignmentinthefilesowecouldhaveusedBio.AlignIO.read()instead,butnoticewehavetopassthisalignmentto
Bio.AlignIO.write()asasingleelementlist:

fromBioimportAlignIO
alignment=AlignIO.read("PF05371_seed.sth","stockholm")
AlignIO.write([alignment],"PF05371_seed.aln","clustal")

Eitherway,youshouldendupwiththesamenewClustalWformatfilePF05371_seed.alnwiththefollowingcontent:
CLUSTALX(1.81)multiplesequencealignment

COATB_BPIKE/3081AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSS
Q9T0Q8_BPIKE/152AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVS
COATB_BPI22/3283DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSS
COATB_BPM13/2472AEGDDPAKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
COATB_BPZJ2/149AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFAS
Q9T0Q9_BPFD/149AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
COATB_BPIF1/2273FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVS

COATB_BPIKE/3081KA
Q9T0Q8_BPIKE/152RA
COATB_BPI22/3283KA
COATB_BPM13/2472KA
COATB_BPZJ2/149KA
Q9T0Q9_BPFD/149KA
COATB_BPIF1/2273RA

Alternatively,youcouldmakeaPHYLIPformatfilewhichwellnamePF05371_seed.phy:
fromBioimportAlignIO
AlignIO.convert("PF05371_seed.sth","stockholm","PF05371_seed.phy","phylip")

Thistimetheoutputlookslikethis:

752
COATB_BPIKAEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSS
Q9T0Q8_BPIAEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVS
COATB_BPI2DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSS
COATB_BPM1AEGDDPAKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
COATB_BPZJAEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFAS
Q9T0Q9_BPFAEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
COATB_BPIFFAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVS

KA
RA
KA
KA
KA
KA
RA

OneofthebighandicapsoftheoriginalPHYLIPalignmentfileformatisthatthesequenceidentifiersarestrictlytruncatedattencharacters.Inthisexample,asyoucan
seetheresultingnamesarestilluniquebuttheyarenotveryreadable.Asaresult,amorerelaxedvariantoftheoriginalPHYLIPformatisnowquitewidelyused:

fromBioimportAlignIO
AlignIO.convert("PF05371_seed.sth","stockholm","PF05371_seed.phy","phyliprelaxed")

Thistimetheoutputlookslikethis,usingalongerindentationtoallowalltheidentiferstobegiveninfull::
752
COATB_BPIKE/3081AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSS
Q9T0Q8_BPIKE/152AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVS
COATB_BPI22/3283DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSS
COATB_BPM13/2472AEGDDPAKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
COATB_BPZJ2/149AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFAS
Q9T0Q9_BPFD/149AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
COATB_BPIF1/2273FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVS

KA
RA
KA
KA
KA
KA
RA

IfyouhavetoworkwiththeoriginalstrictPHYLIPformat,thenyoumayneedtocompresstheidentiferssomehoworassignyourownnamesornumberingsystem.
Thisfollowingbitofcodemanipulatestherecordidentifiersbeforesavingtheoutput:
fromBioimportAlignIO
alignment=AlignIO.read("PF05371_seed.sth","stockholm")
name_mapping={}
fori,recordinenumerate(alignment):
name_mapping[i]=record.id
record.id="seq%i"%i
print(name_mapping)

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 43/184
5/13/2017 BiopythonTutorialandCookbook

AlignIO.write([alignment],"PF05371_seed.phy","phylip")

ThiscodeusedaPythondictionarytorecordasimplemappingfromthenewsequencesystemtotheoriginalidentifier:
{0:'COATB_BPIKE/3081',1:'Q9T0Q8_BPIKE/152',2:'COATB_BPI22/3283',...}

Hereisthenew(strict)PHYLIPformatoutput:
752
seq0AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSS
seq1AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVS
seq2DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSS
seq3AEGDDPAKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
seq4AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFAS
seq5AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTS
seq6FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVS

KA
RA
KA
KA
KA
KA
RA

Ingeneral,becauseoftheidentifierlimitation,workingwithstrictPHYLIPfileformatsshouldntbeyourfirstchoice.UsingthePFAM/Stockholmformatontheother
handallowsyoutorecordalotofadditionalannotationtoo.

6.2.2Gettingyouralignmentobjectsasformattedstrings
TheBio.AlignIOinterfaceisbasedonhandles,whichmeansifyouwanttogetyouralignment(s)intoastringinaparticularfileformatyouneedtodoalittlebitmore
work(seebelow).However,youwillprobablyprefertotakeadvantageofthealignmentobjectsformat()method.Thistakesasinglemandatoryargument,alowercase
stringwhichissupportedbyBio.AlignIOasanoutputformat.Forexample:
fromBioimportAlignIO
alignment=AlignIO.read("PF05371_seed.sth","stockholm")
print(alignment.format("clustal"))

AsdescribedinSection4.6,theSeqRecordobjecthasasimilarmethodusingoutputformatssupportedbyBio.SeqIO.

Internallytheformat()methodisusingtheStringIOstringbasedhandleandcallingBio.AlignIO.write().Youcandothisinyourowncodeifforexampleyouareusing
anolderversionofBiopython:
fromBioimportAlignIO
fromStringIOimportStringIO

alignments=AlignIO.parse("PF05371_seed.sth","stockholm")

out_handle=StringIO()
AlignIO.write(alignments,out_handle,"clustal")
clustal_data=out_handle.getvalue()

print(clustal_data)

6.3ManipulatingAlignments
Nowthatwevecoveredloadingandsavingalignments,welllookatwhatelseyoucandowiththem.

6.3.1Slicingalignments

Firstofall,insomesensesthealignmentobjectsactlikeaPythonlistofSeqRecordobjects(therows).Withthismodelinmindhopefullytheactionsoflen()(the
numberofrows)anditeration(eachrowasaSeqRecord)makesense:
>>>fromBioimportAlignIO
>>>alignment=AlignIO.read("PF05371_seed.sth","stockholm")
>>>print("Numberofrows:%i"%len(alignment))
Numberofrows:7
>>>forrecordinalignment:
...print("%s%s"%(record.seq,record.id))
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKACOATB_BPIKE/3081
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRAQ9T0Q8_BPIKE/152
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKACOATB_BPI22/3283
AEGDDPAKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKACOATB_BPM13/2472
AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKACOATB_BPZJ2/149
AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKAQ9T0Q9_BPFD/149
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRACOATB_BPIF1/2273

Youcanalsousethelistlikeappendandextendmethodstoaddmorerowstothealignment(asSeqRecordobjects).Keepingthelistmetaphorinmind,simpleslicingof
thealignmentshouldalsomakesenseitselectssomeoftherowsgivingbackanotheralignmentobject:
>>>print(alignment)
SingleLetterAlphabet()alignmentwith7rowsand52columns
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRL...SKACOATB_BPIKE/3081
AEPNAATNYATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKL...SRAQ9T0Q8_BPIKE/152
DGTSTATSYATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRL...SKACOATB_BPI22/3283
AEGDDPAKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKL...SKACOATB_BPM13/2472
AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKACOATB_BPZJ2/149
AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKAQ9T0Q9_BPFD/149
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKL...SRACOATB_BPIF1/2273

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 44/184
5/13/2017 BiopythonTutorialandCookbook
>>>print(alignment[3:7])
SingleLetterAlphabet()alignmentwith4rowsand52columns
AEGDDPAKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKL...SKACOATB_BPM13/2472
AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKACOATB_BPZJ2/149
AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKL...SKAQ9T0Q9_BPFD/149
FAADDATSQAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKL...SRACOATB_BPIF1/2273

Whatifyouwantedtoselectbycolumn?ThoseofyouwhohaveusedtheNumPymatrixorarrayobjectswontbesurprisedatthisyouuseadoubleindex.
>>>print(alignment[2,6])
T

Usingtwointegerindicespullsoutasingleletter,shorthandforthis:
>>>print(alignment[2].seq[6])
T

Youcanpulloutasinglecolumnasastringlikethis:

>>>print(alignment[:,6])
TTTT

Youcanalsoselectarangeofcolumns.Forexample,topickoutthosesamethreerowsweextractedearlier,buttakejusttheirfirstsixcolumns:
>>>print(alignment[3:6,:6])
SingleLetterAlphabet()alignmentwith3rowsand6columns
AEGDDPCOATB_BPM13/2472
AEGDDPCOATB_BPZJ2/149
AEGDDPQ9T0Q9_BPFD/149

Leavingthefirstindexas:meanstakealltherows:
>>>print(alignment[:,:6])
SingleLetterAlphabet()alignmentwith7rowsand6columns
AEPNAACOATB_BPIKE/3081
AEPNAAQ9T0Q8_BPIKE/152
DGTSTACOATB_BPI22/3283
AEGDDPCOATB_BPM13/2472
AEGDDPCOATB_BPZJ2/149
AEGDDPQ9T0Q9_BPFD/149
FAADDACOATB_BPIF1/2273

Thisbringsustoaneatwaytoremoveasection.Noticecolumns7,8and9whicharegapsinthreeofthesevensequences:

>>>print(alignment[:,6:9])
SingleLetterAlphabet()alignmentwith7rowsand3columns
TNYCOATB_BPIKE/3081
TNYQ9T0Q8_BPIKE/152
TSYCOATB_BPI22/3283
COATB_BPM13/2472
COATB_BPZJ2/149
Q9T0Q9_BPFD/149
TSQCOATB_BPIF1/2273

Again,youcanslicetogeteverythingaftertheninthcolumn:

>>>print(alignment[:,9:])
SingleLetterAlphabet()alignmentwith7rowsand43columns
ATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKACOATB_BPIKE/3081
ATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRAQ9T0Q8_BPIKE/152
ATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKACOATB_BPI22/3283
AKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKACOATB_BPM13/2472
AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKACOATB_BPZJ2/149
AKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKAQ9T0Q9_BPFD/149
AKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRACOATB_BPIF1/2273

Now,theinterestingthingisthatadditionofalignmentobjectsworksbycolumn.Thisletsyoudothisasawaytoremoveablockofcolumns:

>>>edited=alignment[:,:6]+alignment[:,9:]
>>>print(edited)
SingleLetterAlphabet()alignmentwith7rowsand49columns
AEPNAAATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKACOATB_BPIKE/3081
AEPNAAATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRAQ9T0Q8_BPIKE/152
DGTSTAATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKACOATB_BPI22/3283
AEGDDPAKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKACOATB_BPM13/2472
AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKACOATB_BPZJ2/149
AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKAQ9T0Q9_BPFD/149
FAADDAAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRACOATB_BPIF1/2273

Anothercommonuseofalignmentadditionwouldbetocombinealignmentsforseveraldifferentgenesintoametaalignment.Watchoutthoughtheidentifiersneedto
matchup(seeSection4.8forhowaddingSeqRecordobjectsworks).Youmayfindithelpfultofirstsortthealignmentrowsalphabeticallybyid:
>>>edited.sort()
>>>print(edited)
SingleLetterAlphabet()alignmentwith7rowsand49columns
DGTSTAATEAMNSLKTQATDLIDQTWPVVTSVAVAGLAIRLFKKFSSKACOATB_BPI22/3283
FAADDAAKAAFDSLTAQATEMSGYAWALVVLVVGATVGIKLFKKFVSRACOATB_BPIF1/2273
AEPNAAATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIRLFKKFSSKACOATB_BPIKE/3081
AEGDDPAKAAFNSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKACOATB_BPM13/2472
AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFASKACOATB_BPZJ2/149
AEPNAAATEAMDSLKTQAIDLISQTWPVVTTVVVAGLVIKLFKKFVSRAQ9T0Q8_BPIKE/152
AEGDDPAKAAFDSLQASATEYIGYAWAMVVVIVGATIGIKLFKKFTSKAQ9T0Q9_BPFD/149

Notethatyoucanonlyaddtwoalignmentstogetheriftheyhavethesamenumberofrows.
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 45/184
5/13/2017 BiopythonTutorialandCookbook
6.3.2Alignmentsasarrays

Dependingonwhatyouaredoing,itcanbemoreusefultoturnthealignmentobjectintoanarrayoflettersandyoucandothiswithNumPy:
>>>importnumpyasnp
>>>fromBioimportAlignIO
>>>alignment=AlignIO.read("PF05371_seed.sth","stockholm")
>>>align_array=np.array([list(rec)forrecinalignment],np.character)
>>>print("Arrayshape%iby%i"%align_array.shape)
Arrayshape7by52

Ifyouwillbeworkingheavilywiththecolumns,youcantellNumPytostorethearraybycolumn(asinFortran)ratherthenitsdefaultofbyrow(asinC):

>>>align_array=np.array([list(rec)forrecinalignment],np.character,order="F")

NotethatthisleavestheoriginalBiopythonalignmentobjectandtheNumPyarrayinmemoryasseparateobjectseditingonewillnotupdatetheother!

6.4AlignmentTools
Therearelotsofalgorithmsoutthereforaligningsequences,bothpairwisealignmentsandmultiplesequencealignments.Thesecalculationsarerelativelyslow,andyou
generallywouldntwanttowritesuchanalgorithminPython.ForpairwisealignmentsBiopythoncontainstheBio.pairwise2module(seeSection6.4.6),whichis
supplementedbyfunctionswritteninCforspeedenhancements.Inaddition,youcanuseBiopythontoinvokeacommandlinetoolonyourbehalf.Normallyyouwould:

1.Prepareaninputfileofyourunalignedsequences,typicallythiswillbeaFASTAfilewhichyoumightcreateusingBio.SeqIO(seeChapter5).
2.Callthecommandlinetooltoprocessthisinputfile,typicallyviaoneofBiopythonscommandlinewrappers(whichwelldiscusshere).
3.Readtheoutputfromthetool,i.e.youralignedsequences,typicallyusingBio.AlignIO(seeearlierinthischapter).

Allthecommandlinewrappersweregoingtotalkaboutinthischapterfollowthesamestyle.Youcreateacommandlineobjectspecifyingtheoptions(e.g.theinput
filenameandtheoutputfilename),theninvokethiscommandlineviaaPythonoperatingsystemcall(e.g.usingthesubprocessmodule).

MostofthesewrappersaredefinedintheBio.Align.Applicationsmodule:

>>>importBio.Align.Applications
>>>dir(Bio.Align.Applications)
...
['ClustalwCommandline','DialignCommandline','MafftCommandline','MuscleCommandline',
'PrankCommandline','ProbconsCommandline','TCoffeeCommandline'...]

(IgnoretheentriesstartingwithanunderscorethesehavespecialmeaninginPython.)ThemoduleBio.Emboss.ApplicationshaswrappersforsomeoftheEMBOSS
suite,includingneedleandwater,whicharedescribedbelowinSection6.4.5,andwrappersfortheEMBOSSpackagedversionsofthePHYLIPtools(whichEMBOSS
refertoasoneoftheirEMBASSYpackagesthirdpartytoolswithanEMBOSSstyleinterface).Wewontexploreallthesealignmenttoolshereinthesection,justa
sample,butthesameprinciplesapply.

6.4.1ClustalW
ClustalWisapopularcommandlinetoolformultiplesequencealignment(thereisalsoagraphicalinterfacecalledClustalX).BiopythonsBio.Align.Applications
modulehasawrapperforthisalignmenttool(andseveralothers).

BeforetryingtouseClustalWfromwithinPython,youshouldfirsttryrunningtheClustalWtoolyourselfbyhandatthecommandline,tofamiliariseyourselftheother
options.YoullfindtheBiopythonwrapperisveryfaithfultotheactualcommandlineAPI:

>>>fromBio.Align.ApplicationsimportClustalwCommandline
>>>help(ClustalwCommandline)
...

Forthemostbasicusage,allyouneedistohaveaFASTAinputfile,suchasopuntia.fasta(availableonlineorintheDoc/examplessubdirectoryoftheBiopythonsource
code).ThisisasmallFASTAfilecontainingsevenpricklypearDNAsequences(fromthecactusfamilyOpuntia).

BydefaultClustalWwillgenerateanalignmentandguidetreefilewithnamesbasedontheinputFASTAfile,inthiscaseopuntia.alnandopuntia.dnd,butyoucan
overridethisormakeitexplicit:

>>>fromBio.Align.ApplicationsimportClustalwCommandline
>>>cline=ClustalwCommandline("clustalw2",infile="opuntia.fasta")
>>>print(cline)
clustalw2infile=opuntia.fasta

Noticeherewehavegiventheexecutablenameasclustalw2,indicatingwehaveversiontwoinstalled,whichhasadifferentfilenametoversionone(clustalw,the
default).Fortunatelybothversionssupportthesamesetofargumentsatthecommandline(andindeed,shouldbefunctionallyidentical).

YoumayfindthateventhoughyouhaveClustalWinstalled,theabovecommanddoesntworkyoumaygetamessageaboutcommandnotfound(especiallyon
Windows).ThisindicatedthattheClustalWexecutableisnotonyourPATH(anenvironmentvariable,alistofdirectoriestobesearched).Youcaneitherupdateyour
PATHsettingtoincludethelocationofyourcopyofClustalWtools(howyoudothiswilldependonyourOS),orsimplytypeinthefullpathofthetool.Forexample:

>>>importos
>>>fromBio.Align.ApplicationsimportClustalwCommandline
>>>clustalw_exe=r"C:\ProgramFiles\newclustal\clustalw2.exe"
>>>clustalw_cline=ClustalwCommandline(clustalw_exe,infile="opuntia.fasta")

>>>assertos.path.isfile(clustalw_exe),"ClustalWexecutablemissing"
>>>stdout,stderr=clustalw_cline()

Remember,inPythonstrings\nand\tarebydefaultinterpretedasanewlineandatabwhichiswhywereputaletterratthestartforarawstringthatisnt
translatedinthisway.ThisisgenerallygoodpracticewhenspecifyingaWindowsstylefilename.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 46/184
5/13/2017 BiopythonTutorialandCookbook
InternallythisusesthesubprocessmodulewhichisnowtherecommendedwaytorunanotherprograminPython.Thisreplacesolderoptionsliketheos.system()and
theos.popen*functions.

Now,atthispointithelpstoknowabouthowcommandlinetoolswork.Whenyourunatoolatthecommandline,itwilloftenprinttextoutputdirectlytoscreen.This
textcanbecapturedorredirected,viatwopipes,calledstandardoutput(thenormalresults)andstandarderror(forerrormessagesanddebugmessages).Thereisalso
standardinput,whichisanytextfedintothetool.Thesenamesgetshortenedtostdin,stdoutandstderr.Whenthetoolfinishes,ithasareturncode(aninteger),whichby
conventioniszeroforsuccess.

WhenyourunthecommandlinetoollikethisviatheBiopythonwrapper,itwillwaitforittofinish,andcheckthereturncode.Ifthisisnonzero(indicatinganerror),an
exceptionisraised.Thewrapperthenreturnstwostrings,stdoutandstderr.

InthecaseofClustalW,whenrunatthecommandlinealltheimportantoutputiswrittendirectlytotheoutputfiles.Everythingnormallyprintedtoscreenwhileyouwait
(viastdoutorstderr)isboringandcanbeignored(assumingitworked).

Whatwecareaboutarethetwooutputfiles,thealignmentandtheguidetree.WedidnttellClustalWwhatfilenamestouse,butitdefaultstopickingnamesbasedonthe
inputfile.Inthiscasetheoutputshouldbeinthefileopuntia.aln.YoushouldbeabletoworkouthowtoreadinthealignmentusingBio.AlignIObynow:

>>>fromBioimportAlignIO
>>>align=AlignIO.read("opuntia.aln","clustal")
>>>print(align)
SingleLetterAlphabet()alignmentwith7rowsand906columns
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273285|gb|AF191659.1|AF191
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273284|gb|AF191658.1|AF191
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273287|gb|AF191661.1|AF191
TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273286|gb|AF191660.1|AF191
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273290|gb|AF191664.1|AF191
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273289|gb|AF191663.1|AF191
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273291|gb|AF191665.1|AF191

Incaseyouareinterested(andthisisanasidefromthemainthrustofthischapter),theopuntia.dndfileClustalWcreatesisjustastandardNewicktreefile,and
Bio.Phylocanparsethese:

>>>fromBioimportPhylo
>>>tree=Phylo.read("opuntia.dnd","newick")
>>>Phylo.draw_ascii(tree)
_______________gi|6273291|gb|AF191665.1|AF191665
__________________________|
||______gi|6273290|gb|AF191664.1|AF191664
||__|
||_____gi|6273289|gb|AF191663.1|AF191663
|
_|_________________gi|6273287|gb|AF191661.1|AF191661
|
|__________gi|6273286|gb|AF191660.1|AF191660
|
|__gi|6273285|gb|AF191659.1|AF191659
|___|
|gi|6273284|gb|AF191658.1|AF191658
<BLANKLINE>

Chapter13coversBiopythonssupportforphylogenetictreesinmoredepth.

6.4.2MUSCLE
MUSCLEisamorerecentmultiplesequencealignmenttoolthanClustalW,andBiopythonalsohasawrapperforitundertheBio.Align.Applicationsmodule.As
before,werecommendyoutryusingMUSCLEfromthecommandlinebeforetryingitfromwithinPython,astheBiopythonwrapperisveryfaithfultotheactual
commandlineAPI:

>>>fromBio.Align.ApplicationsimportMuscleCommandline
>>>help(MuscleCommandline)
...

Forthemostbasicusage,allyouneedistohaveaFASTAinputfile,suchasopuntia.fasta(availableonlineorintheDoc/examplessubdirectoryoftheBiopythonsource
code).YoucanthentellMUSCLEtoreadinthisFASTAfile,andwritethealignmenttoanoutputfile:
>>>fromBio.Align.ApplicationsimportMuscleCommandline
>>>cline=MuscleCommandline(input="opuntia.fasta",out="opuntia.txt")
>>>print(cline)
muscleinopuntia.fastaoutopuntia.txt

NotethatMUSCLEusesinandoutbutinBiopythonwehavetouseinputandoutasthekeywordargumentsorpropertynames.Thisisbecauseinisa
reservedwordinPython.

BydefaultMUSCLEwilloutputthealignmentasaFASTAfile(usinggappedsequences).TheBio.AlignIOmoduleshouldbeabletoreadthisalignmentusing
format="fasta".YoucanalsoaskforClustalWlikeoutput:

>>>fromBio.Align.ApplicationsimportMuscleCommandline
>>>cline=MuscleCommandline(input="opuntia.fasta",out="opuntia.aln",clw=True)
>>>print(cline)
muscleinopuntia.fastaoutopuntia.alnclw

Or,strictClustalWoutputwheretheoriginalClustalWheaderlineisusedformaximumcompatibility:

>>>fromBio.Align.ApplicationsimportMuscleCommandline
>>>cline=MuscleCommandline(input="opuntia.fasta",out="opuntia.aln",clwstrict=True)
>>>print(cline)
muscleinopuntia.fastaoutopuntia.alnclwstrict

TheBio.AlignIOmoduleshouldbeabletoreadthesealignmentsusingformat="clustal".

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 47/184
5/13/2017 BiopythonTutorialandCookbook
MUSCLEcanalsooutputinGCGMSFformat(usingthemsfargument),butBiopythoncantcurrentlyparsethat,orusingHTMLwhichwouldgiveahumanreadable
webpage(notsuitableforparsing).

Youcanalsosettheotheroptionalparameters,forexamplethemaximumnumberofiterations.Seethebuiltinhelpfordetails.

YouwouldthenrunMUSCLEcommandlinestringasdescribedaboveforClustalW,andparsetheoutputusingBio.AlignIOtogetanalignmentobject.

6.4.3MUSCLEusingstdout
UsingaMUSCLEcommandlineasintheexamplesabovewillwritethealignmenttoafile.Thismeanstherewillbenoimportantinformationwrittentothestandardout
(stdout)orstandarderror(stderr)handles.However,bydefaultMUSCLEwillwritethealignmenttostandardoutput(stdout).Wecantakeadvantageofthistoavoid
havingatemporaryoutputfile!Forexample:

>>>fromBio.Align.ApplicationsimportMuscleCommandline
>>>muscle_cline=MuscleCommandline(input="opuntia.fasta")
>>>print(muscle_cline)
muscleinopuntia.fasta

Ifwerunthisviathewrapper,wegetbacktheoutputasastring.InordertoparsethiswecanuseStringIOtoturnitintoahandle.RememberthatMUSCLEdefaultsto
usingFASTAastheoutputformat:
>>>fromBio.Align.ApplicationsimportMuscleCommandline
>>>muscle_cline=MuscleCommandline(input="opuntia.fasta")
>>>stdout,stderr=muscle_cline()
>>>fromStringIOimportStringIO
>>>fromBioimportAlignIO
>>>align=AlignIO.read(StringIO(stdout),"fasta")
>>>print(align)
SingleLetterAlphabet()alignmentwith7rowsand906columns
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273289|gb|AF191663.1|AF191663
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273291|gb|AF191665.1|AF191665
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273290|gb|AF191664.1|AF191664
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273287|gb|AF191661.1|AF191661
TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273286|gb|AF191660.1|AF191660
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273285|gb|AF191659.1|AF191659
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273284|gb|AF191658.1|AF191658

Theaboveapproachisfairlysimple,butifyouaredealingwithverylargeoutputtextthefactthatallofstdoutandstderrisloadedintomemoryasastringcanbea
potentialdrawback.Usingthesubprocessmodulewecanworkdirectlywithhandlesinstead:

>>>importsubprocess
>>>fromBio.Align.ApplicationsimportMuscleCommandline
>>>muscle_cline=MuscleCommandline(input="opuntia.fasta")
>>>child=subprocess.Popen(str(muscle_cline),
...stdout=subprocess.PIPE,
...stderr=subprocess.PIPE,
...universal_newlines=True,
...shell=(sys.platform!="win32"))
>>>fromBioimportAlignIO
>>>align=AlignIO.read(child.stdout,"fasta")
>>>print(align)
SingleLetterAlphabet()alignmentwith7rowsand906columns
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273289|gb|AF191663.1|AF191663
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273291|gb|AF191665.1|AF191665
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273290|gb|AF191664.1|AF191664
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273287|gb|AF191661.1|AF191661
TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273286|gb|AF191660.1|AF191660
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273285|gb|AF191659.1|AF191659
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273284|gb|AF191658.1|AF191658

6.4.4MUSCLEusingstdinandstdout

WedontactuallyneedtohaveourFASTAinputsequencespreparedinafile,becausebydefaultMUSCLEwillreadintheinputsequencefromstandardinput!Notethis
isabitmoreadvancedandfiddly,sodontbotherwiththistechniqueunlessyouneedto.

First,wellneedsomeunalignedsequencesinmemoryasSeqRecordobjects.ForthisdemonstrationImgoingtouseafilteredversionoftheoriginalFASTAfile(usinga
generatorexpression),takingjustsixofthesevensequences:

>>>fromBioimportSeqIO
>>>records=(rforrinSeqIO.parse("opuntia.fasta","fasta")iflen(r)<900)

ThenwecreatetheMUSCLEcommandline,leavingtheinputandoutputtotheirdefaults(stdinandstdout).ImalsogoingtoaskforstrictClustalWformatasforthe
output.

>>>fromBio.Align.ApplicationsimportMuscleCommandline
>>>muscle_cline=MuscleCommandline(clwstrict=True)
>>>print(muscle_cline)
muscleclwstrict

Nowforthefiddlybitsusingthesubprocessmodule,stdinandstdout:

>>>importsubprocess
>>>importsys
>>>child=subprocess.Popen(str(cline),
...stdin=subprocess.PIPE,
...stdout=subprocess.PIPE,
...stderr=subprocess.PIPE,
...universal_newlines=True,
...shell=(sys.platform!="win32"))

ThatshouldstartMUSCLE,butitwillbesittingwaitingforitsFASTAinputsequences,whichwemustsupplyviaitsstdinhandle:
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 48/184
5/13/2017 BiopythonTutorialandCookbook
>>>SeqIO.write(records,child.stdin,"fasta")
6
>>>child.stdin.close()

Afterwritingthesixsequencestothehandle,MUSCLEwillstillbewaitingtoseeifthatisalltheFASTAsequencesornotsowemustsignalthatthisisalltheinput
databyclosingthehandle.AtthatpointMUSCLEshouldstarttorun,andwecanaskfortheoutput:

>>>fromBioimportAlignIO
>>>align=AlignIO.read(child.stdout,"clustal")
>>>print(align)
SingleLetterAlphabet()alignmentwith6rowsand900columns
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273290|gb|AF191664.1|AF19166
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273289|gb|AF191663.1|AF19166
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273287|gb|AF191661.1|AF19166
TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273286|gb|AF191660.1|AF19166
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273285|gb|AF191659.1|AF19165
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273284|gb|AF191658.1|AF19165

Wow!Therewearewithanewalignmentofjustthesixrecords,withouthavingcreatedatemporaryFASTAinputfile,oratemporaryalignmentoutputfile.However,a
wordofcaution:Dealingwitherrorswiththisstyleofcallingexternalprogramsismuchmorecomplicated.Italsobecomesfarhardertodiagnoseproblems,becauseyou
canttryrunningMUSCLEmanuallyoutsideofBiopython(becauseyoudonthavetheinputfiletosupply).Therecanalsobesubtlecrossplatformissues(e.g.Windows
versusLinux,Python2versusPython3),andhowyourunyourscriptcanhaveanimpact(e.g.atthecommandline,fromIDLEoranIDE,orasaGUIscript).Theseare
allgenericPythonissuesthough,andnotspecifictoBiopython.

Ifyoufindworkingdirectlywithsubprocesslikethisscary,thereisanalternative.Ifyouexecutethetoolwithmuscle_cline()youcansupplyanystandardinputasabig
string,muscle_cline(stdin=...).So,providedyourdataisntverybig,youcanpreparetheFASTAinputinmemoryasastringusingStringIO(seeSection24.1):

>>>fromBioimportSeqIO
>>>records=(rforrinSeqIO.parse("opuntia.fasta","fasta")iflen(r)<900)
>>>fromStringIOimportStringIO
>>>handle=StringIO()
>>>SeqIO.write(records,handle,"fasta")
6
>>>data=handle.getvalue()

Youcanthenrunthetoolandparsethealignmentasfollows:

>>>stdout,stderr=muscle_cline(stdin=data)
>>>fromBioimportAlignIO
>>>align=AlignIO.read(StringIO(stdout),"clustal")
>>>print(align)
SingleLetterAlphabet()alignmentwith6rowsand900columns
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273290|gb|AF191664.1|AF19166
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273289|gb|AF191663.1|AF19166
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273287|gb|AF191661.1|AF19166
TATACATAAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273286|gb|AF191660.1|AF19166
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273285|gb|AF191659.1|AF19165
TATACATTAAAGAAGGGGGATGCGGATAAATGGAAAGGCGAAAG...AGAgi|6273284|gb|AF191658.1|AF19165

Youmightfindthiseasier,butitdoesrequiremorememory(RAM)forthestringsusedfortheinputFASTAandoutputClustalformatteddata.

6.4.5EMBOSSneedleandwater
TheEMBOSSsuiteincludesthewaterandneedletoolsforSmithWatermanalgorithmlocalalignment,andNeedlemanWunschglobalalignment.Thetoolssharethe
samestyleinterface,soswitchingbetweenthetwoistrivialwelljustuseneedlehere.

Supposeyouwanttodoaglobalpairwisealignmentbetweentwosequences,preparedinFASTAformatasfollows:

>HBA_HUMAN
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHG
KKVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTP
AVHASLDKFLASVSTVLTSKYR

inafilealpha.faa,andsecondlyinafilebeta.faa:
>HBB_HUMAN
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPK
VKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG
KEFTPPVQAAYQKVVAGVANALAHKYH

YoucanfindcopiesoftheseexamplefileswiththeBiopythonsourcecodeundertheDoc/examples/directory.

Letsstartbycreatingacompleteneedlecommandlineobjectinonego:

>>>fromBio.Emboss.ApplicationsimportNeedleCommandline
>>>needle_cline=NeedleCommandline(asequence="alpha.faa",bsequence="beta.faa",
...gapopen=10,gapextend=0.5,outfile="needle.txt")
>>>print(needle_cline)
needleoutfile=needle.txtasequence=alpha.faabsequence=beta.faagapopen=10gapextend=0.5

Whynottryrunningthisbyhandatthecommandprompt?Youshouldseeitdoesapairwisecomparisonandrecordstheoutputinthefileneedle.txt(inthedefault
EMBOSSalignmentfileformat).

EvenifyouhaveEMBOSSinstalled,runningthiscommandmaynotworkyoumightgetamessageaboutcommandnotfound(especiallyonWindows).This
probablymeansthattheEMBOSStoolsarenotonyourPATHenvironmentvariable.YoucaneitherupdateyourPATHsetting,orsimplytellBiopythonthefullpathto
thetool,forexample:
>>>fromBio.Emboss.ApplicationsimportNeedleCommandline
>>>needle_cline=NeedleCommandline(r"C:\EMBOSS\needle.exe",
...asequence="alpha.faa",bsequence="beta.faa",
...gapopen=10,gapextend=0.5,outfile="needle.txt")

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 49/184
5/13/2017 BiopythonTutorialandCookbook
RememberinPythonthatforadefaultstring\nor\tmeansanewlineoratabwhichiswhywereputaletterratthestartforarawstring.

AtthispointitmighthelptotryrunningtheEMBOSStoolsyourselfbyhandatthecommandline,tofamiliariseyourselftheotheroptionsandcomparethemtothe
Biopythonhelptext:

>>>fromBio.Emboss.ApplicationsimportNeedleCommandline
>>>help(NeedleCommandline)
...

Notethatyoucanalsospecify(orchangeorlookat)thesettingslikethis:

>>>fromBio.Emboss.ApplicationsimportNeedleCommandline
>>>needle_cline=NeedleCommandline()
>>>needle_cline.asequence="alpha.faa"
>>>needle_cline.bsequence="beta.faa"
>>>needle_cline.gapopen=10
>>>needle_cline.gapextend=0.5
>>>needle_cline.outfile="needle.txt"
>>>print(needle_cline)
needleoutfile=needle.txtasequence=alpha.faabsequence=beta.faagapopen=10gapextend=0.5
>>>print(needle_cline.outfile)
needle.txt

NextwewanttousePythontorunthiscommandforus.Asexplainedabove,forfullcontrol,werecommendyouusethebuiltinPythonsubprocessmodule,butfor
simpleusagethewrapperobjectusuallysuffices:

>>>stdout,stderr=needle_cline()
>>>print(stdout+stderr)
NeedlemanWunschglobalalignmentoftwosequences

NextwecanloadtheoutputfilewithBio.AlignIOasdiscussedearlierinthischapter,astheembossformat:
>>>fromBioimportAlignIO
>>>align=AlignIO.read("needle.txt","emboss")
>>>print(align)
SingleLetterAlphabet()alignmentwith2rowsand149columns
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTY...KYRHBA_HUMAN
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRF...KYHHBB_HUMAN

Inthisexample,wetoldEMBOSStowritetheoutputtoafile,butyoucantellittowritetheoutputtostdoutinstead(usefulifyoudontwantatemporaryoutputfileto
getridofusestdout=Trueratherthantheoutfileargument),andalsotoreadoneoftheoneoftheinputsfromstdin(e.g.asequence="stdin",muchlikeinthe
MUSCLEexampleinthesectionabove).

Thishasonlyscratchedthesurfaceofwhatyoucandowithneedleandwater.Oneusefultrickisthatthesecondfilecancontainmultiplesequences(sayfive),andthen
EMBOSSwilldofivepairwisealignments.

6.4.6Biopythonspairwise2
Biopythonhasitsownmoduletomakelocalandglobalpairwisealignments,Bio.pairwise2.Thismodulecontainsessentiallythesamealgorithmsaswater(local)and
needle(global)fromtheEMBOSSsuite(seeabove)andshouldreturnthesameresults.

Supposeyouwanttodoaglobalpairwisealignmentbetweenthesametwohemoglobinsequencesfromabove(HBA_HUMAN,HBB_HUMAN)storedinalpha.faaandbeta.faa:

>>>fromBioimportpairwise2
>>>fromBioimportSeqIO
>>>seq1=SeqIO.read("alpha.faa","fasta")
>>>seq2=SeqIO.read("beta.faa","fasta")
>>>alignments=pairwise2.align.globalxx(seq1.seq,seq2.seq)

Asyousee,wecallthealignmentfunctionwithalign.globalxx.Thetrickypartarethelasttwolettersofthefunctionname(here:xx),whichareusedfordecodingthe
scoresandpenaltiesformatches(andmismatches)andgaps.Thefirstletterdecodesthematchscore,e.g.xmeansthatamatchcounts1whilemismatcheshavenocosts.
Withmgeneralvaluesforeithermatchesormismatchescanbedefined(formoreoptionsseeBiopythonsAPI).Thesecondletterdecodesthecostforgapsxmeansno
gapcostsatall,withsdifferentpenaltiesforopeningandextendingagapcanbeassigned.So,globalxxmeansthatonlymatchesbetweenbothsequencesarecounted.

Ourvariablealignmentsnowcontainsalistofalignments(atleastone)whichhavethesameoptimalscoreforthegivenconditions.Inourexamplethisare80different
alignmentswiththescore72(Bio.pairwise2willreturnupto1000alignments).Havealookatoneofthesealignments:
>>>len(alignments)
80

>>>print(alignments[0])
('MVLSPADKTNVKAAWGKVGAHAG...YR','MVHLTPEEKSAVTALWGKV...YH',
72.0,0,217)

Eachalignmentisatupleconsistingofthetwoalignedsequences,thescore,thestartandtheendpositionsofthealignment(inglobalalignmentsthestartisalways0and
theendthelengthofthealignment).Bio.pairwise2hasafunctionformat_alignmentforanicerprintout:

>>>print(pairwise2.format_alignment(*alignment[0]))
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYF...YR
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||...|||
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRF...YH
Score=72

Betteralignmentsareusuallyobtainedbypenalizinggaps:highercostsforopeningagapandlowercostsforextendinganexistinggap.Foraminoacidsequencesmatch
scoresareusuallyencodedinmatriceslikePAMorBLOSUM.Thus,amoremeaningfulalignmentforourexamplecanbeobtainedbyusingtheBLOSUM62matrix,together
withagapopenpenaltyof10andagapextensionpenaltyof0.5(usingglobalds):

>>>fromBioimportpairwise2
>>>fromBioimportSeqIO
>>>fromBio.SubsMat.MatrixInfoimportblosum62

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 50/184
5/13/2017 BiopythonTutorialandCookbook
>>>seq1=SeqIO.read("alpha.faa","fasta")
>>>seq2=SeqIO.read("beta.faa","fasta")
>>>alignments=pairwise2.align.globalds(seq1.seq,seq2.seq,blosum62,10,0.5)
>>>len(alignments)
2

>>>print(pairwise2.format_alignment(*alignments[0]))
MVLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTY...KYR
||||||||||||||||||||||||||||||||||||||||||||...|||
MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFF...KYH
Score=292.5

ThisalignmenthasthesamescorethatweobtainedearlierwithEMBOSSneedleusingthesamesequencesandthesameparameters.

Localalignmentsarecalledsimilarlywiththefunctionalign.localXX,whereagainXXstandsforatwolettercodeforthematchandgapfunctions:

>>>fromBioimportpairwise2
>>>fromBio.SubsMat.MatrixInfoimportblosum62
>>>alignments=pairwise2.align.localds("LSPADKTNVKAA","PEEKSAV",blosum62,10,1)
>>>print(pairwise2.format_alignment(*alignments[0]))
LSPADKTNVKAA
|||||||
PEEKSAV
Score=16
<BLANKLINE>

Insteadofsupplyingacompletematch/mismatchmatrix,thematchcodemallowsforeasydefininggeneralmatch/mismatchvalues.Thenextexampleuses
match/mismatchscoresof5/4andgappenalties(open/extend)of2/0.5usinglocalms):
>>>alignments=pairwise2.align.localms("AGAACT","GAC",5,4,2,0.5)
>>>print(pairwise2.format_alignment(*alignments[0]))
AGAACT
||||
GAC
Score=13
<BLANKLINE>

OneusefulkeywordargumentoftheBio.pairwise2.alignfunctionsisscore_only.WhensettoTrueitwillonlyreturnthescoreofthebestalignment(s),butina
significantlyshortertime.Itwillalsoallowthealignmentoflongersequencesbeforeamemoryerrorisraised.

Unfortunately,Bio.pairwise2doesnotworkwithBiopythonsmultiplesequencealignmentobjects(yet).However,themodulehassomeinterestingadvancedfeatures:
youcandefineyourownmatchandgapfunctions(interestedintestingaffinelogarithmicgapcosts?),gappenaltiesandendgapspenaltiescanbedifferentforboth
sequences,sequencescanbesuppliedaslists(usefulifyouhaveresiduesthatareencodedbymorethanonecharacter),etc.Thesefeaturesarehard(ifatall)torealize
withotheralignmenttools.FormoredetailsseethemodulesdocumentationinBiopythonsAPI.

Chapter7BLAST
Hey,everybodylovesBLASTright?Imean,geez,howcanitgetanyeasiertodocomparisonsbetweenoneofyoursequencesandeveryothersequenceintheknown
world?But,ofcourse,thissectionisntabouthowcoolBLASTis,sincewealreadyknowthat.ItisabouttheproblemwithBLASTitcanbereallydifficulttodealwith
thevolumeofdatageneratedbylargeruns,andtoautomateBLASTrunsingeneral.

Fortunately,theBiopythonfolksknowthisonlytoowell,sotheyvedevelopedlotsoftoolsfordealingwithBLASTandmakingthingsmucheasier.Thissectiondetails
howtousethesetoolsanddousefulthingswiththem.

DealingwithBLASTcanbesplitupintotwosteps,bothofwhichcanbedonefromwithinBiopython.Firstly,runningBLASTforyourquerysequence(s),andgetting
someoutput.Secondly,parsingtheBLASToutputinPythonforfurtheranalysis.

YourfirstintroductiontorunningBLASTwasprobablyviatheNCBIwebservice.Infact,therearelotsofwaysyoucanrunBLAST,whichcanbecategorisedinseveral
ways.ThemostimportantdistinctionisrunningBLASTlocally(onyourownmachine),andrunningBLASTremotely(onanothermachine,typicallytheNCBIservers).
WeregoingtostartthischapterbyinvokingtheNCBIonlineBLASTservicefromwithinaPythonscript.

NOTE:ThefollowingChapter8describesBio.SearchIO,anexperimentalmoduleinBiopython.WeintendthistoultimatelyreplacetheolderBio.Blastmodule,asit
providesamoregeneralframeworkhandlingotherrelatedsequencesearchingtoolsaswell.However,untilthatisdeclaredstable,forproductioncodepleasecontinueto
usetheBio.BlastmodulefordealingwithNCBIBLAST.

7.1RunningBLASTovertheInternet
Weusethefunctionqblast()intheBio.Blast.NCBIWWWmoduletocalltheonlineversionofBLAST.Thishasthreenonoptionalarguments:

Thefirstargumentistheblastprogramtouseforthesearch,asalowercasestring.Theoptionsanddescriptionsoftheprogramsareavailableat
http://www.ncbi.nlm.nih.gov/BLAST/blast_program.shtml.Currentlyqblastonlyworkswithblastn,blastp,blastx,tblastandtblastx.
Thesecondargumentspecifiesthedatabasestosearchagainst.Again,theoptionsforthisareavailableontheNCBIwebpagesat
http://www.ncbi.nlm.nih.gov/BLAST/blast_databases.shtml.
Thethirdargumentisastringcontainingyourquerysequence.Thiscaneitherbethesequenceitself,thesequenceinfastaformat,oranidentifierlikeaGInumber.

TheqblastfunctionalsotakeanumberofotheroptionargumentswhicharebasicallyanalogoustothedifferentparametersyoucansetontheBLASTwebpage.Well
justhighlightafewofthemhere:

Theargumenturl_basesetsthebaseURLforrunningBLASTovertheinternet.BydefaultitconnectstotheNCBI,butonecanusethistoconnecttoaninstance
ofNCBIBLASTrunninginthecloud.Pleaserefertothedocumentationfortheqblastfunctionforfurtherdetails.
TheqblastfunctioncanreturntheBLASTresultsinvariousformats,whichyoucanchoosewiththeoptionalformat_typekeyword:"HTML","Text","ASN.1",or
"XML".Thedefaultis"XML",asthatistheformatexpectedbytheparser,describedinsection7.3below.
Theargumentexpectsetstheexpectationorevaluethreshold.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 51/184
5/13/2017 BiopythonTutorialandCookbook
FormoreabouttheoptionalBLASTarguments,wereferyoutotheNCBIsowndocumentation,orthatbuiltintoBiopython:

>>>fromBio.BlastimportNCBIWWW
>>>help(NCBIWWW.qblast)
...

NotethatthedefaultsettingsontheNCBIBLASTwebsitearenotquitethesameasthedefaultsonQBLAST.Ifyougetdifferentresults,youllneedtocheckthe
parameters(e.g.,theexpectationvaluethresholdandthegapvalues).

Forexample,ifyouhaveanucleotidesequenceyouwanttosearchagainstthenucleotidedatabase(nt)usingBLASTN,andyouknowtheGInumberofyourquery
sequence,youcanuse:

>>>fromBio.BlastimportNCBIWWW
>>>result_handle=NCBIWWW.qblast("blastn","nt","8332116")

Alternatively,ifwehaveourquerysequencealreadyinaFASTAformattedfile,wejustneedtoopenthefileandreadinthisrecordasastring,andusethatasthequery
argument:
>>>fromBio.BlastimportNCBIWWW
>>>fasta_string=open("m_cold.fasta").read()
>>>result_handle=NCBIWWW.qblast("blastn","nt",fasta_string)

WecouldalsohavereadintheFASTAfileasaSeqRecordandthensuppliedjustthesequenceitself:

>>>fromBio.BlastimportNCBIWWW
>>>fromBioimportSeqIO
>>>record=SeqIO.read("m_cold.fasta",format="fasta")
>>>result_handle=NCBIWWW.qblast("blastn","nt",record.seq)

SupplyingjustthesequencemeansthatBLASTwillassignanidentifierforyoursequenceautomatically.YoumightprefertousetheSeqRecordobjectsformatmethodto
makeaFASTAstring(whichwillincludetheexistingidentifier):
>>>fromBio.BlastimportNCBIWWW
>>>fromBioimportSeqIO
>>>record=SeqIO.read("m_cold.fasta",format="fasta")
>>>result_handle=NCBIWWW.qblast("blastn","nt",record.format("fasta"))

Thisapproachmakesmoresenseifyouhaveyoursequence(s)inanonFASTAfileformatwhichyoucanextractusingBio.SeqIO(seeChapter5).

Whateverargumentsyougivetheqblast()function,youshouldgetbackyourresultsinahandleobject(bydefaultinXMLformat).Thenextstepwouldbetoparsethe
XMLoutputintoPythonobjectsrepresentingthesearchresults(Section7.3),butyoumightwanttosavealocalcopyoftheoutputfilefirst.Ifindthisespeciallyuseful
whendebuggingmycodethatextractsinfofromtheBLASTresults(becausererunningtheonlinesearchisslowandwastestheNCBIcomputertime).

Weneedtobeabitcarefulsincewecanuseresult_handle.read()toreadtheBLASToutputonlyoncecallingresult_handle.read()againreturnsanemptystring.

>>>withopen("my_blast.xml","w")asout_handle:
...out_handle.write(result_handle.read())
...
>>>result_handle.close()

Afterdoingthis,theresultsareinthefilemy_blast.xmlandtheoriginalhandlehashadallitsdataextracted(soweclosedit).However,theparsefunctionoftheBLAST
parser(describedin7.3)takesafilehandlelikeobject,sowecanjustopenthesavedfileforinput:

>>>result_handle=open("my_blast.xml")

NowthatwevegottheBLASTresultsbackintoahandleagain,wearereadytodosomethingwiththem,sothisleadsusrightintotheparsingsection(seeSection7.3
below).Youmaywanttojumpaheadtothatnow.

7.2RunningBLASTlocally
7.2.1Introduction

RunningBLASTlocally(asopposedtoovertheinternet,seeSection7.1)hasatleastmajortwoadvantages:

LocalBLASTmaybefasterthanBLASTovertheinternet
LocalBLASTallowsyoutomakeyourowndatabasetosearchforsequencesagainst.

DealingwithproprietaryorunpublishedsequencedatacanbeanotherreasontorunBLASTlocally.Youmaynotbeallowedtoredistributethesequences,sosubmitting
themtotheNCBIasaBLASTquerywouldnotbeanoption.

Unfortunately,therearesomemajordrawbackstooinstallingallthebitsandgettingitsetuprighttakessomeeffort:

LocalBLASTrequirescommandlinetoolstobeinstalled.
LocalBLASTrequires(large)BLASTdatabasestobesetup(andpotentiallykeptuptodate).

TofurtherconfusemattersthereareseveraldifferentBLASTpackagesavailable,andtherearealsoothertoolswhichcanproduceimitationBLASToutputfiles,suchas
BLAT.

7.2.2StandaloneNCBIBLAST+
ThenewNCBIBLAST+suitewasreleasedin2009.ThisreplacestheoldNCBIlegacyBLASTpackage(seebelow).

ThissectionwillshowbrieflyhowtousethesetoolsfromwithinPython.IfyouhavealreadyreadortriedthealignmenttoolexamplesinSection6.4thisshouldallseem
quitestraightforward.First,weconstructacommandlinestring(asyouwouldtypeinatthecommandlinepromptifrunningstandaloneBLASTbyhand).Thenwecan
executethiscommandfromwithinPython.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 52/184
5/13/2017 BiopythonTutorialandCookbook
Forexample,takingaFASTAfileofgenenucleotidesequences,youmightwanttorunaBLASTX(translation)searchagainstthenonredundant(NR)proteindatabase.
Assumingyou(oryoursystemsadministrator)hasdownloadedandinstalledtheNRdatabase,youmightrun:

blastxqueryopuntia.fastadbnroutopuntia.xmlevalue0.001outfmt5

ThisshouldrunBLASTXagainsttheNRdatabase,usinganexpectationcutoffvalueof0.001andproduceXMLoutputtothespecifiedfile(whichwecanthenparse).
Onmycomputerthistakesaboutsixminutesagoodreasontosavetheoutputtoafilesoyoucanrepeatanyanalysisasneeded.

FromwithinBiopythonwecanusetheNCBIBLASTXwrapperfromtheBio.Blast.Applicationsmoduletobuildthecommandlinestring,andrunit:

>>>fromBio.Blast.ApplicationsimportNcbiblastxCommandline
>>>help(NcbiblastxCommandline)
...
>>>blastx_cline=NcbiblastxCommandline(query="opuntia.fasta",db="nr",evalue=0.001,
...outfmt=5,out="opuntia.xml")
>>>blastx_cline
NcbiblastxCommandline(cmd='blastx',out='opuntia.xml',outfmt=5,query='opuntia.fasta',
db='nr',evalue=0.001)
>>>print(blastx_cline)
blastxoutopuntia.xmloutfmt5queryopuntia.fastadbnrevalue0.001
>>>stdout,stderr=blastx_cline()

InthisexamplethereshouldntbeanyoutputfromBLASTXtotheterminal,sostdoutandstderrshouldbeempty.Youmaywanttochecktheoutputfileopuntia.xmlhas
beencreated.

Asyoumayrecallfromearlierexamplesinthetutorial,theopuntia.fastacontainssevensequences,sotheBLASTXMLoutputshouldcontainmultipleresults.
ThereforeuseBio.Blast.NCBIXML.parse()toparseitasdescribedbelowinSection7.3.

7.2.3OtherversionsofBLAST
NCBIBLAST+(writteninC++)wasfirstreleasedin2009asareplacementfortheoriginalNCBIlegacyBLAST(writteninC)whichisnolongerbeingupdated.
TherewerealotofchangestheoldversionhadasinglecorecommandlinetoolblastallwhichcoveredmultipledifferentBLASTsearchtypes(whicharenow
separatecommandsinBLAST+),andallthecommandlineoptionswererenamed.BiopythonswrappersfortheNCBIlegacyBLASTtoolshavebeendeprecatedand
willberemovedinafuturerelease.Totrytoavoidconfusion,wedonotcovercallingtheseoldtoolsfromBiopythoninthistutorial.

YoumayalsocomeacrossWashingtonUniversityBLAST(WUBLAST),anditssuccessor,AdvancedBiocomputingBLAST(ABBLAST,releasedin2009,not
free/opensource).Thesepackagesincludethecommandlinetoolswublastallandabblastall,whichmimickedblastallfromtheNCBIlegacyBLASTsuite.
Biopythondoesnotcurrentlyprovidewrappersforcallingthesetools,butshouldbeabletoparseanyNCBIcompatibleoutputfromthem.

7.3ParsingBLASToutput
Asmentionedabove,BLASTcangenerateoutputinvariousformats,suchasXML,HTML,andplaintext.Originally,BiopythonhadparsersforBLASTplaintextand
HTMLoutput,astheseweretheonlyoutputformatsofferedatthetime.Unfortunately,theBLASToutputintheseformatskeptchanging,eachtimebreakingthe
Biopythonparsers.OurHTMLBLASTparserhasbeenremoved,buttheplaintextBLASTparserisstillavailable(seeSection7.5).Useitatyourownrisk,itmayor
maynotwork,dependingonwhichBLASTversionyoureusing.

AskeepingupwithchangesinBLASTbecameahopelessendeavor,especiallywithusersrunningdifferentBLASTversions,wenowrecommendtoparsetheoutputin
XMLformat,whichcanbegeneratedbyrecentversionsofBLAST.NotonlyistheXMLoutputmorestablethantheplaintextandHTMLoutput,itisalsomucheasier
toparseautomatically,makingBiopythonawholelotmorestable.

YoucangetBLASToutputinXMLformatinvariousways.Fortheparser,itdoesntmatterhowtheoutputwasgenerated,aslongasitisintheXMLformat.

YoucanuseBiopythontorunBLASTovertheinternet,asdescribedinsection7.1.
YoucanuseBiopythontorunBLASTlocally,asdescribedinsection7.2.
YoucandotheBLASTsearchyourselfontheNCBIsitethroughyourwebbrowser,andthensavetheresults.YouneedtochooseXMLastheformatinwhichto
receivetheresults,andsavethefinalBLASTpageyouget(youknow,theonewithalloftheinterestingresults!)toafile.
YoucanalsorunBLASTlocallywithoutusingBiopython,andsavetheoutputinafile.Again,youneedtochooseXMLastheformatinwhichtoreceivethe
results.

TheimportantpointisthatyoudonothavetouseBiopythonscriptstofetchthedatainordertobeabletoparseit.Doingthingsinoneoftheseways,youthenneedto
getahandletotheresults.InPython,ahandleisjustanicegeneralwayofdescribinginputtoanyinfosourcesothattheinfocanberetrievedusingread()and
readline()functions(seeSectionsec:appendixhandles).

IfyoufollowedthecodeaboveforinteractingwithBLASTthroughascript,thenyoualreadyhaveresult_handle,thehandletotheBLASTresults.Forexample,usinga
GInumbertodoanonlinesearch:

>>>fromBio.BlastimportNCBIWWW
>>>result_handle=NCBIWWW.qblast("blastn","nt","8332116")

IfinsteadyouranBLASTsomeotherway,andhavetheBLASToutput(inXMLformat)inthefilemy_blast.xml,allyouneedtodoistoopenthefileforreading:

>>>result_handle=open("my_blast.xml")

Nowthatwevegotahandle,wearereadytoparsetheoutput.Thecodetoparseitisreallyquitesmall.IfyouexpectasingleBLASTresult(i.e.,youusedasingle
query):
>>>fromBio.BlastimportNCBIXML
>>>blast_record=NCBIXML.read(result_handle)

or,ifyouhavelotsofresults(i.e.,multiplequerysequences):
>>>fromBio.BlastimportNCBIXML
>>>blast_records=NCBIXML.parse(result_handle)

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 53/184
5/13/2017 BiopythonTutorialandCookbook
JustlikeBio.SeqIOandBio.AlignIO(seeChapters5and6),wehaveapairofinputfunctions,readandparse,wherereadisforwhenyouhaveexactlyoneobject,and
parseisaniteratorforwhenyoucanhavelotsofobjectsbutinsteadofgettingSeqRecordorMultipleSeqAlignmentobjects,wegetBLASTrecordobjects.

TobeabletohandlethesituationwheretheBLASTfilemaybehuge,containingthousandsofresults,NCBIXML.parse()returnsaniterator.InplainEnglish,aniterator
allowsyoutostepthroughtheBLASToutput,retrievingBLASTrecordsonebyoneforeachBLASTsearchresult:

>>>fromBio.BlastimportNCBIXML
>>>blast_records=NCBIXML.parse(result_handle)
>>>blast_record=next(blast_records)
#...dosomethingwithblast_record
>>>blast_record=next(blast_records)
#...dosomethingwithblast_record
>>>blast_record=next(blast_records)
#...dosomethingwithblast_record
>>>blast_record=next(blast_records)
Traceback(mostrecentcalllast):
File"<stdin>",line1,in<module>
StopIteration
#Nofurtherrecords

Or,youcanuseaforloop:

>>>forblast_recordinblast_records:
...#Dosomethingwithblast_record

NotethoughthatyoucanstepthroughtheBLASTrecordsonlyonce.Usually,fromeachBLASTrecordyouwouldsavetheinformationthatyouareinterestedin.Ifyou
wanttosaveallreturnedBLASTrecords,youcanconverttheiteratorintoalist:
>>>blast_records=list(blast_records)

NowyoucanaccesseachBLASTrecordinthelistwithanindexasusual.IfyourBLASTfileishugethough,youmayrunintomemoryproblemstryingtosavethemall
inalist.

Usually,youllberunningoneBLASTsearchatatime.Then,allyouneedtodoistopickupthefirst(andonly)BLASTrecordinblast_records:

>>>fromBio.BlastimportNCBIXML
>>>blast_records=NCBIXML.parse(result_handle)
>>>blast_record=next(blast_records)

ormoreelegantly:

>>>fromBio.BlastimportNCBIXML
>>>blast_record=NCBIXML.read(result_handle)

IguessbynowyourewonderingwhatisinaBLASTrecord.

7.4TheBLASTrecordclass
ABLASTRecordcontainseverythingyoumighteverwanttoextractfromtheBLASToutput.Rightnowwelljustshowanexampleofhowtogetsomeinfooutofthe
BLASTreport,butifyouwantsomethinginparticularthatisnotdescribedhere,lookattheinfoontherecordclassindetail,andtakeaganderintothecodeor
automaticallygenerateddocumentationthedocstringshavelotsofgoodinfoaboutwhatisstoredineachpieceofinformation.

Tocontinuewithourexample,letsjustprintoutsomesummaryinfoaboutallhitsinourblastreportgreaterthanaparticularthreshold.Thefollowingcodedoesthis:

>>>E_VALUE_THRESH=0.04

>>>foralignmentinblast_record.alignments:
...forhspinalignment.hsps:
...ifhsp.expect<E_VALUE_THRESH:
...print('****Alignment****')
...print('sequence:',alignment.title)
...print('length:',alignment.length)
...print('evalue:',hsp.expect)
...print(hsp.query[0:75]+'...')
...print(hsp.match[0:75]+'...')
...print(hsp.sbjct[0:75]+'...')

Thiswillprintoutsummaryreportslikethefollowing:

****Alignment****
sequence:>gb|AF283004.1|AF283004ArabidopsisthalianacoldacclimationproteinWCOR413likeprotein
alphaformmRNA,completecds
length:783
evalue:0.034
tacttgttgatattggatcgaacaaactggagaaccaacatgctcacgtcacttttagtcccttacatattcctc...
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||...
tacttgttggtgttggatcgaaccaattggaagacgaatatgctcacatcacttctcattccttacatcttcttc...

Basically,youcandoanythingyouwanttowiththeinfointheBLASTreportonceyouhaveparsedit.Thiswill,ofcourse,dependonwhatyouwanttouseitfor,but
hopefullythishelpsyougetstartedondoingwhatyouneedtodo!

AnimportantconsiderationforextractinginformationfromaBLASTreportisthetypeofobjectsthattheinformationisstoredin.InBiopython,theparsersreturnRecord
objects,eitherBlastorPSIBlastdependingonwhatyouareparsing.TheseobjectsaredefinedinBio.Blast.Recordandarequitecomplete.

HerearemyattemptsatUMLclassdiagramsfortheBlastandPSIBlastrecordclasses.IfyouaregoodatUMLandseemistakes/improvementsthatcanbemade,please
letmeknow.TheBlastclassdiagramisshowninFigure7.4.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 54/184
5/13/2017 BiopythonTutorialandCookbook

ThePSIBlastrecordobjectissimilar,buthassupportfortheroundsthatareusedintheiterationstepsofPSIBlast.TheclassdiagramforPSIBlastisshowninFigure7.4.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 55/184
5/13/2017 BiopythonTutorialandCookbook

7.5DeprecatedBLASTparsers
OlderversionsofBiopythonhadparsersforBLASToutputinplaintextorHTMLformat.Overtheyears,wediscoveredthatitisveryhardtomaintaintheseparsersin
workingorder.Basically,anysmallchangetotheBLASToutputinnewlyreleasedBLASTversionstendstocausetheplaintextandHTMLparserstobreak.We
thereforerecommendparsingBLASToutputinXMLformat,asdescribedinsection7.3.

DependingonwhichBLASTversionsorprogramsyoureusing,ourplaintextBLASTparsermayormaynotwork.Useitatyourownrisk!

7.5.1ParsingplaintextBLASToutput
TheplaintextBLASTparserislocatedinBio.Blast.NCBIStandalone.

AswiththeXMLparser,weneedtohaveahandleobjectthatwecanpasstotheparser.Thehandlemustimplementthereadline()methodanddothisproperly.The
commonwaystogetsuchahandlearetoeitherusetheprovidedblastallorblastpgpfunctionstorunthelocalblast,ortorunalocalblastviathecommandline,and
thendosomethinglikethefollowing:
>>>result_handle=open("my_file_of_blast_output.txt")

Well,nowthatwevegotahandle(whichwellcallresult_handle),wearereadytoparseit.Thiscanbedonewiththefollowingcode:
>>>fromBio.BlastimportNCBIStandalone
>>>blast_parser=NCBIStandalone.BlastParser()
>>>blast_record=blast_parser.parse(result_handle)

ThiswillparsetheBLASTreportintoaBlastRecordclass(eitheraBlastoraPSIBlastrecord,dependingonwhatyouareparsing)sothatyoucanextractthe
informationfromit.Inourcase,letsjustprintoutaquicksummaryofallofthealignmentsgreaterthansomethresholdvalue.

>>>E_VALUE_THRESH=0.04
>>>foralignmentinblast_record.alignments:
...forhspinalignment.hsps:
...ifhsp.expect<E_VALUE_THRESH:
...print('****Alignment****')
...print('sequence:',alignment.title)
...print('length:',alignment.length)
...print('evalue:',hsp.expect)
...print(hsp.query[0:75]+'...')
...print(hsp.match[0:75]+'...')
...print(hsp.sbjct[0:75]+'...')

Ifyoualsoreadthesection7.3onparsingBLASTXMLoutput,youllnoticethattheabovecodeisidenticaltowhatisfoundinthatsection.Onceyouparsesomething
intoarecordclassyoucandealwithitindependentoftheformatoftheoriginalBLASTinfoyouwereparsing.Prettysnazzy!
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 56/184
5/13/2017 BiopythonTutorialandCookbook
Sure,parsingonerecordisgreat,butIvegotaBLASTfilewithtonsofrecordshowcanIparsethemall?Well,fearnot,theanswerliesintheverynextsection.

7.5.2ParsingaplaintextBLASTfilefullofBLASTruns
Wecandothisusingtheblastiterator.Tosetupaniterator,wefirstsetupaparser,toparseourblastreportsinBlastRecordobjects:

>>>fromBio.BlastimportNCBIStandalone
>>>blast_parser=NCBIStandalone.BlastParser()

Thenwewillassumewehaveahandletoabunchofblastrecords,whichwellcallresult_handle.Gettingahandleisdescribedinfulldetailaboveintheblastparsing
sections.

Nowthatwevegotaparserandahandle,wearereadytosetuptheiteratorwiththefollowingcommand:

>>>blast_iterator=NCBIStandalone.Iterator(result_handle,blast_parser)

Thesecondoption,theparser,isoptional.Ifwedontsupplyaparser,thentheiteratorwilljustreturntherawBLASTreportsoneatatime.

Nowthatwevegotaniterator,westartretrievingblastrecords(generatedbyourparser)usingnext():

>>>blast_record=next(blast_iterator)

Eachcalltonextwillreturnanewrecordthatwecandealwith.Nowwecaniteratethroughtheserecordsandgenerateouroldfavorite,anicelittleblastreport:

>>>forblast_recordinblast_iterator:
...E_VALUE_THRESH=0.04
...foralignmentinblast_record.alignments:
...forhspinalignment.hsps:
...ifhsp.expect<E_VALUE_THRESH:
...print('****Alignment****')
...print('sequence:',alignment.title)
...print('length:',alignment.length)
...print('evalue:',hsp.expect)
...iflen(hsp.query)>75:
...dots='...'
...else:
...dots=''
...print(hsp.query[0:75]+dots)
...print(hsp.match[0:75]+dots)
...print(hsp.sbjct[0:75]+dots)

Theiteratorallowsyoutodealwithhugeblastrecordswithoutanymemoryproblems,sincethingsarereadinoneatatime.Ihaveparsedtremendouslyhugefiles
withoutanyproblemsusingthis.

7.5.3FindingabadrecordsomewhereinahugeplaintextBLASTfile

OnereallyuglyproblemthathappenstomeisthatIllbeparsingahugeblastfileforawhile,andtheparserwillbomboutwithaValueError.Thisisaseriousproblem,
sinceyoucanttelliftheValueErrorisduetoaparserproblem,oraproblemwiththeBLAST.Tomakeitevenworse,youhavenoideawheretheparsefailed,soyou
cantjustignoretheerror,sincethiscouldbeignoringanimportantdatapoint.

Weusedtohavetomakealittlescripttogetaroundthisproblem,buttheBio.BlastmodulenowincludesaBlastErrorParserwhichreallyhelpsmakethiseasier.The
BlastErrorParserworksverysimilartotheregularBlastParser,butitaddsanextralayerofworkbycatchingValueErrorsthataregeneratedbytheparser,and
attemptingtodiagnosetheerrors.

Letstakealookatusingthisparserfirstwedefinethefilewearegoingtoparseandthefiletowritetheproblemreportsto:
>>>importos
>>>blast_file=os.path.join(os.getcwd(),"blast_out","big_blast.out")
>>>error_file=os.path.join(os.getcwd(),"blast_out","big_blast.problems")

NowwewanttogetaBlastErrorParser:
>>>fromBio.BlastimportNCBIStandalone
>>>error_handle=open(error_file,"w")
>>>blast_error_parser=NCBIStandalone.BlastErrorParser(error_handle)

Noticethattheparsertakeanoptionalargumentofahandle.Ifahandleispassed,thentheparserwillwriteanyblastrecordswhichgenerateaValueErrortothishandle.
Otherwise,theserecordswillnotberecorded.

NowwecanusetheBlastErrorParserjustlikearegularblastparser.Specifically,wemightwanttomakeaniteratorthatgoesthroughourblastrecordsoneatatimeand
parsesthemwiththeerrorparser:
>>>result_handle=open(blast_file)
>>>iterator=NCBIStandalone.Iterator(result_handle,blast_error_parser)

Wecanreadtheserecordsoneatime,butnowwecancatchanddealwitherrorsthatareduetoproblemswithBlast(andnotwiththeparseritself):
>>>try:
...next_record=next(iterator)
...exceptNCBIStandalone.LowQualityBlastErrorasinfo:
...print("LowQualityBlastErrordetectedinid%s"%info[1])

Thenext()functionalityisnormallycalledindirectlyviaaforloop.RightnowtheBlastErrorParsercangeneratethefollowingerrors:

ValueErrorThisisthesameerrorgeneratedbytheregularBlastParser,andisduetotheparsernotbeingabletoparseaspecificfile.Thisisnormallyeitherdue
toabugintheparser,orsomekindofdiscrepancybetweentheversionofBLASTyouareusingandtheversionstheparserisabletohandle.
LowQualityBlastErrorWhenBLASTingasequencethatisofreallybadquality(forexample,ashortsequencethatisbasicallyastretchofonenucleotide),it
seemsthatBlastendsupmaskingouttheentiresequenceandendingupwithnothingtoparse.Inthiscaseitwillproduceatruncatedreportthatcausestheparserto
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 57/184
5/13/2017 BiopythonTutorialandCookbook
generateaValueError.LowQualityBlastErrorisreportedinthesecases.Thiserrorreturnsaninfoitemwiththefollowinginformation:
item[0]Theerrormessage
item[1]Theidoftheinputrecordthatcausedtheerror.Thisisreallyusefulifyouwanttorecordalloftherecordsthatarecausingproblems.

Asmentioned,witheacherrorgenerated,theBlastErrorParserwillwritetheoffendingrecordtothespecifiederror_handle.Youcanthengoaheadandlookandthese
anddealwiththemasyouseefit.Eitheryouwillbeabletodebugtheparserwithasingleblastreport,orwillfindoutproblemsinyourblastruns.Eitherway,itwill
definitelybeausefulexperience!

HopefullytheBlastErrorParserwillmakeitmucheasiertodebuganddealwithlargeBlastfiles.

7.6DealingwithPSIBLAST
YoucanrunthestandaloneversionofPSIBLAST(thelegacyNCBIcommandlinetoolblastpgp,oritsreplacementpsiblast)usingthewrappersin
Bio.Blast.Applicationsmodule.

Atthetimeofwriting,theNCBIdonotappeartosupporttoolsrunningaPSIBLASTsearchviatheinternet.

NotethattheBio.Blast.NCBIXMLparsercanreadtheXMLoutputfromcurrentversionsofPSIBLAST,butinformationlikewhichsequencesineachiterationisnewor
reusedisntpresentintheXMLfile.IfyoucareaboutthisinformationyoumayhavemorejoywiththeplaintextoutputandthePSIBlastParserin
Bio.Blast.NCBIStandalone.

7.7DealingwithRPSBLAST
YoucanrunthestandaloneversionofRPSBLAST(eitherthelegacyNCBIcommandlinetoolrpsblast,oritsreplacementwiththesamename)usingthewrappersin
Bio.Blast.Applicationsmodule.

Atthetimeofwriting,theNCBIdonotappeartosupporttoolsrunninganRPSBLASTsearchviatheinternet.

YoucanusetheBio.Blast.NCBIXMLparsertoreadtheXMLoutputfromcurrentversionsofRPSBLAST.

Chapter8BLASTandothersequencesearchtools(experimentalcode)
WARNING:ThischapteroftheTutorialdescribesanexperimentalmoduleinBiopython.ItisbeingincludedinBiopythonanddocumentedhereinthetutorialinapre
finalstatetoallowaperiodoffeedbackandrefinementbeforewedeclareitstable.Untilthenthedetailswillprobablychange,andanyscriptsusingthecurrent
Bio.SearchIOwouldneedtobeupdated.Pleasekeepthisinmind!ForstablecodeworkingwithNCBIBLAST,pleasecontinuetouseBio.Blastdescribedinthe
precedingChapter7.

Biologicalsequenceidentificationisanintegralpartofbioinformatics.Severaltoolsareavailableforthis,eachwiththeirownalgorithmsandapproaches,suchas
BLAST(arguablythemostpopular),FASTA,HMMER,andmanymore.Ingeneral,thesetoolsusuallyuseyoursequencetosearchadatabaseofpotentialmatches.With
thegrowingnumberofknownsequences(hencethegrowingnumberofpotentialmatches),interpretingtheresultsbecomesincreasinglyhardastherecouldbehundreds
oreventhousandsofpotentialmatches.Naturally,manualinterpretationofthesesearchesresultsisoutofthequestion.Moreover,youoftenneedtoworkwithseveral
sequencesearchtools,eachwithitsownstatistics,conventions,andoutputformat.Imaginehowdauntingitwouldbewhenyouneedtoworkwithmultiplesequences
usingmultiplesearchtools.

Weknowthistoowellourselves,whichiswhywecreatedtheBio.SearchIOsubmoduleinBiopython.Bio.SearchIOallowsyoutoextractinformationfromyoursearch
resultsinaconvenientway,whilealsodealingwiththedifferentstandardsandconventionsusedbydifferentsearchtools.ThenameSearchIOisahomagetoBioPerls
moduleofthesamename.

Inthischapter,wellgothroughthemainfeaturesofBio.SearchIOtoshowwhatitcandoforyou.Wellusetwopopularsearchtoolsalongtheway:BLASTandBLAT.
Theyareusedmerelyforillustrativepurposes,andyoushouldbeabletoadapttheworkflowtoanyothersearchtoolssupportedbyBio.SearchIOinabreeze.Yourevery
welcometofollowalongwiththesearchoutputfileswellbeusing.TheBLASToutputfilecanbedownloadedhere,andtheBLAToutputfilehereorareincludedwith
theBiopythonsourcecodeundertheDoc/examples/folder.Bothoutputfilesweregeneratedusingthissequence:

>mystery_seq
CCCTCTACAGGGAAGCGCTTTCTGTTGTCTGAAAGAAAAGAAAGTGCTTCCTTTTAGAGGG

TheBLASTresultisanXMLfilegeneratedusingblastnagainsttheNCBIrefseq_rnadatabase.ForBLAT,thesequencedatabasewastheFebruary2009hg19human
genomedraftandtheoutputformatisPSL.

WellstartfromanintroductiontotheBio.SearchIOobjectmodel.Themodelistherepresentationofyoursearchresults,thusitiscoretoBio.SearchIOitself.Afterthat,
wellcheckoutthemainfunctionsinBio.SearchIOthatyoumayoftenuse.

Nowthatwereallset,letsgotothefirststep:introducingthecoreobjectmodel.

8.1TheSearchIOobjectmodel
Despitethewildlydifferingoutputstylesamongmanysequencesearchtools,itturnsoutthattheirunderlyingconceptissimilar:

Theoutputfilemaycontainresultsfromoneormoresearchqueries.
Ineachsearchquery,youwillseeoneormorehitsfromthegivensearchdatabase.
Ineachdatabasehit,youwillseeoneormoreregionscontainingtheactualsequencealignmentbetweenyourquerysequenceandthedatabasesequence.
SomeprogramslikeBLATorExoneratemayfurthersplittheseregionsintoseveralalignmentfragments(orblocksinBLATandpossiblyexonsinexonerate).This
isnotsomethingyoualwayssee,asprogramslikeBLASTandHMMERdonotdothis.

Realizingthisgenerality,wedecideduseitasbaseforcreatingtheBio.SearchIOobjectmodel.TheobjectmodelconsistsofanestedhierarchyofPythonobjects,each
onerepresentingoneconceptoutlinedabove.Theseobjectsare:

QueryResult,torepresentasinglesearchquery.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 58/184
5/13/2017 BiopythonTutorialandCookbook
Hit,torepresentasingledatabasehit.HitobjectsarecontainedwithinQueryResultandineachQueryResultthereiszeroormoreHitobjects.
HSP(shortforhighscoringpair),torepresentregion(s)ofsignificantalignmentsbetweenqueryandhitsequences.HSPobjectsarecontainedwithinHitobjectsand
eachHithasoneormoreHSPobjects.
HSPFragment,torepresentasinglecontiguousalignmentbetweenqueryandhitsequences.HSPFragmentobjectsarecontainedwithinHSPobjects.Mostsequence
searchtoolslikeBLASTandHMMERunifyHSPandHSPFragmentobjectsaseachHSPwillonlyhaveasingleHSPFragment.HowevertherearetoolslikeBLATand
ExoneratethatproduceHSPcontainingmultipleHSPFragment.Dontworryifthisseemsatadconfusingnow,wellelaboratemoreonthesetwoobjectslateron.

ThesefourobjectsaretheonesyouwillinteractwithwhenyouuseBio.SearchIO.TheyarecreatedusingoneofthemainBio.SearchIOmethods:read,parse,index,or
index_db.Thedetailsofthesemethodsareprovidedinlatersections.Forthissection,wellonlybeusingreadandparse.Thesefunctionsbehavesimilarlytotheir
Bio.SeqIOandBio.AlignIOcounterparts:

readisusedforsearchoutputfileswithasinglequeryandreturnsaQueryResultobject
parseisusedforsearchoutputfileswithmultiplequeriesandreturnsageneratorthatyieldsQueryResultobjects

Withthatsettled,letsstartprobingeachBio.SearchIOobject,beginningwithQueryResult.

8.1.1QueryResult

TheQueryResultobjectrepresentsasinglesearchqueryandcontainszeroormoreHitobjects.LetsseewhatitlookslikeusingtheBLASTfilewehave:

>>>fromBioimportSearchIO
>>>blast_qresult=SearchIO.read('my_blast.xml','blastxml')
>>>print(blast_qresult)
Program:blastn(2.2.27+)
Query:42291(61)
mystery_seq
Target:refseq_rna
Hits:
##HSPID+description

01gi|262205317|ref|NR_030195.1|HomosapiensmicroRNA52...
11gi|301171311|ref|NR_035856.1|PantroglodytesmicroRNA...
21gi|270133242|ref|NR_032573.1|MacacamulattamicroRNA...
32gi|301171322|ref|NR_035857.1|PantroglodytesmicroRNA...
41gi|301171267|ref|NR_035851.1|PantroglodytesmicroRNA...
52gi|262205330|ref|NR_030198.1|HomosapiensmicroRNA52...
61gi|262205302|ref|NR_030191.1|HomosapiensmicroRNA51...
71gi|301171259|ref|NR_035850.1|PantroglodytesmicroRNA...
81gi|262205451|ref|NR_030222.1|HomosapiensmicroRNA51...
92gi|301171447|ref|NR_035871.1|PantroglodytesmicroRNA...
101gi|301171276|ref|NR_035852.1|PantroglodytesmicroRNA...
111gi|262205290|ref|NR_030188.1|HomosapiensmicroRNA51...
121gi|301171354|ref|NR_035860.1|PantroglodytesmicroRNA...
131gi|262205281|ref|NR_030186.1|HomosapiensmicroRNA52...
142gi|262205298|ref|NR_030190.1|HomosapiensmicroRNA52...
151gi|301171394|ref|NR_035865.1|PantroglodytesmicroRNA...
161gi|262205429|ref|NR_030218.1|HomosapiensmicroRNA51...
171gi|262205423|ref|NR_030217.1|HomosapiensmicroRNA52...
181gi|301171401|ref|NR_035866.1|PantroglodytesmicroRNA...
191gi|270133247|ref|NR_032574.1|MacacamulattamicroRNA...
201gi|262205309|ref|NR_030193.1|HomosapiensmicroRNA52...
212gi|270132717|ref|NR_032716.1|MacacamulattamicroRNA...
222gi|301171437|ref|NR_035870.1|PantroglodytesmicroRNA...
232gi|270133306|ref|NR_032587.1|MacacamulattamicroRNA...
242gi|301171428|ref|NR_035869.1|PantroglodytesmicroRNA...
251gi|301171211|ref|NR_035845.1|PantroglodytesmicroRNA...
262gi|301171153|ref|NR_035838.1|PantroglodytesmicroRNA...
272gi|301171146|ref|NR_035837.1|PantroglodytesmicroRNA...
282gi|270133254|ref|NR_032575.1|MacacamulattamicroRNA...
292gi|262205445|ref|NR_030221.1|HomosapiensmicroRNA51...
~~~
971gi|356517317|ref|XM_003527287.1|PREDICTED:Glycinema...
981gi|297814701|ref|XM_002875188.1|Arabidopsislyratasu...
991gi|397513516|ref|XM_003827011.1|PREDICTED:Panpanisc...

Wevejustbeguntoscratchthesurfaceoftheobjectmodel,butyoucanseethattheresalreadysomeusefulinformation.ByinvokingprintontheQueryResultobject,
youcansee:

Theprogramnameandversion(blastnversion2.2.27+)
ThequeryID,description,anditssequencelength(IDis42291,descriptionismystery_seq,anditis61nucleotideslong)
Thetargetdatabasetosearchagainst(refseq_rna)
Aquickoverviewoftheresultinghits.Forourquerysequence,thereare100potentialhits(numbered099inthetable).Foreachhit,wecanalsoseehowmany
HSPsitcontains,itsID,andasnippetofitsdescription.NoticeherethatBio.SearchIOtruncatesthehittableoverview,byshowingonlyhitsnumbered029,and
then9799.

NowletscheckourBLATresultsusingthesameprocedureasabove:

>>>blat_qresult=SearchIO.read('my_blat.psl','blatpsl')
>>>print(blat_qresult)
Program:blat(<unknownversion>)
Query:mystery_seq(61)
<unknowndescription>
Target:<unknowntarget>
Hits:
##HSPID+description

017chr19<unknowndescription>

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 59/184
5/13/2017 BiopythonTutorialandCookbook
Youllimmediatelynoticethattherearesomedifferences.SomeofthesearecausedbythewayPSLformatstoresitsdetails,asyoullsee.Therestarecausedbythe
genuineprogramandtargetdatabasedifferencesbetweenourBLASTandBLATsearches:

Theprogramnameandversion.Bio.SearchIOknowsthattheprogramisBLAT,butintheoutputfilethereisnoinformationregardingtheprogramversionsoit
defaultsto<unknownversion>.
ThequeryID,description,anditssequencelength.NoticeherethatthesedetailsareslightlydifferentfromtheoneswesawinBLAST.TheIDismystery_seq
insteadof42991,thereisnoknowndescription,butthequerylengthisstill61.Thisisactuallyadifferenceintroducedbythefileformatsthemselves.BLAST
sometimescreatesitsownqueryIDsandusesyouroriginalIDasthesequencedescription.
Thetargetdatabaseisnotknown,asitisnotstatedintheBLAToutputfile.
Andfinally,thelistofhitswehaveiscompletelydifferent.Here,weseethatourquerysequenceonlyhitsthechr19databaseentry,butinitwesee17HSP
regions.Thisshouldnotbesurprisinghowever,giventhatweareusingadifferentprogram,eachwithitsowntargetdatabase.

AllthedetailsyousawwheninvokingtheprintmethodcanbeaccessedindividuallyusingPythonsobjectattributeaccessnotation(a.k.a.thedotnotation).Thereare
alsootherformatspecificattributesthatyoucanaccessusingthesamemethod.
>>>print("%s%s"%(blast_qresult.program,blast_qresult.version))
blastn2.2.27+
>>>print("%s%s"%(blat_qresult.program,blat_qresult.version))
blat<unknownversion>
>>>blast_qresult.param_evalue_threshold#blastxmlspecific
10.0

Foracompletelistofaccessibleattributes,youcancheckeachformatspecificdocumentation.HerearetheonesforBLASTandforBLAT.

HavinglookedatusingprintonQueryResultobjects,letsdrilldowndeeper.WhatexactlyisaQueryResult?IntermsofPythonobjects,QueryResultisahybridbetween
alistandadictionary.Inotherwords,itisacontainerobjectwithalltheconvenientfeaturesoflistsanddictionaries.

LikePythonlistsanddictionaries,QueryResultobjectsareiterable.EachiterationreturnsaHitobject:
>>>forhitinblast_qresult:
...hit
Hit(id='gi|262205317|ref|NR_030195.1|',query_id='42291',1hsps)
Hit(id='gi|301171311|ref|NR_035856.1|',query_id='42291',1hsps)
Hit(id='gi|270133242|ref|NR_032573.1|',query_id='42291',1hsps)
Hit(id='gi|301171322|ref|NR_035857.1|',query_id='42291',2hsps)
Hit(id='gi|301171267|ref|NR_035851.1|',query_id='42291',1hsps)
...

Tocheckhowmanyitems(hits)aQueryResulthas,youcansimplyinvokePythonslenmethod:
>>>len(blast_qresult)
100
>>>len(blat_qresult)
1

LikePythonlists,youcanretrieveitems(hits)fromaQueryResultusingtheslicenotation:

>>>blast_qresult[0]#retrievesthetophit
Hit(id='gi|262205317|ref|NR_030195.1|',query_id='42291',1hsps)
>>>blast_qresult[1]#retrievesthelasthit
Hit(id='gi|397513516|ref|XM_003827011.1|',query_id='42291',1hsps)

Toretrievemultiplehits,youcansliceQueryResultobjectsusingtheslicenotationaswell.Inthiscase,theslicewillreturnanewQueryResultobjectcontainingonlythe
slicedhits:

>>>blast_slice=blast_qresult[:3]#slicesthefirstthreehits
>>>print(blast_slice)
Program:blastn(2.2.27+)
Query:42291(61)
mystery_seq
Target:refseq_rna
Hits:
##HSPID+description

01gi|262205317|ref|NR_030195.1|HomosapiensmicroRNA52...
11gi|301171311|ref|NR_035856.1|PantroglodytesmicroRNA...
21gi|270133242|ref|NR_032573.1|MacacamulattamicroRNA...

LikePythondictionaries,youcanalsoretrievehitsusingthehitsID.ThisisparticularlyusefulifyouknowagivenhitIDexistswithinasearchqueryresults:
>>>blast_qresult['gi|262205317|ref|NR_030195.1|']
Hit(id='gi|262205317|ref|NR_030195.1|',query_id='42291',1hsps)

YoucanalsogetafulllistofHitobjectsusinghitsandafulllistofHitIDsusinghit_keys:
>>>blast_qresult.hits
[...]#listofallhits
>>>blast_qresult.hit_keys
[...]#listofallhitIDs

Whatifyoujustwanttocheckwhetheraparticularhitispresentinthequeryresults?YoucandoasimplePythonmembershiptestusingtheinkeyword:

>>>'gi|262205317|ref|NR_030195.1|'inblast_qresult
True
>>>'gi|262205317|ref|NR_030194.1|'inblast_qresult
False

Sometimes,knowingwhetherahitispresentisnotenoughyoualsowanttoknowtherankofthehit.Here,theindexmethodcomestotherescue:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 60/184
5/13/2017 BiopythonTutorialandCookbook
>>>blast_qresult.index('gi|301171437|ref|NR_035870.1|')
22

RememberthatwereusingPythonsindexingstylehere,whichiszerobased.Thismeansourhitaboveisrankedatno.23,not22.

Also,notethatthehitrankyouseehereisbasedonthenativehitorderingpresentintheoriginalsearchoutputfile.Differentsearchtoolsmayorderthesehitsbasedon
differentcriteria.

Ifthenativehitorderingdoesntsuityourtaste,youcanusethesortmethodoftheQueryResultobject.ItisverysimilartoPythonslist.sortmethod,withtheaddition
ofanoptiontocreateanewsortedQueryResultobjectornot.

HereisanexampleofusingQueryResult.sorttosortthehitsbasedoneachhitsfullsequencelength.Forthisparticularsort,wellsetthein_placeflagtoFalsesothat
sortingwillreturnanewQueryResultobjectandleaveourinitialobjectunsorted.WellalsosetthereverseflagtoTruesothatwesortindescendingorder.
>>>forhitinblast_qresult[:5]:#idandsequencelengthofthefirstfivehits
...print("%s%i"%(hit.id,hit.seq_len))
...
gi|262205317|ref|NR_030195.1|61
gi|301171311|ref|NR_035856.1|60
gi|270133242|ref|NR_032573.1|85
gi|301171322|ref|NR_035857.1|86
gi|301171267|ref|NR_035851.1|80

>>>sort_key=lambdahit:hit.seq_len
>>>sorted_qresult=blast_qresult.sort(key=sort_key,reverse=True,in_place=False)
>>>forhitinsorted_qresult[:5]:
...print("%s%i"%(hit.id,hit.seq_len))
...
gi|397513516|ref|XM_003827011.1|6002
gi|390332045|ref|XM_776818.2|4082
gi|390332043|ref|XM_003723358.1|4079
gi|356517317|ref|XM_003527287.1|3251
gi|356543101|ref|XM_003539954.1|2936

Theadvantageofhavingthein_placeflaghereisthatwerepreservingthenativeordering,sowemayuseitagainlater.Youshouldnotethatthisisnotthedefault
behaviorofQueryResult.sort,however,whichiswhyweneededtosetthein_placeflagtoTrueexplicitly.

Atthispoint,youveknownenoughaboutQueryResultobjectstomakeitworkforyou.ButbeforewegoontothenextobjectintheBio.SearchIOmodel,letstakea
lookattwomoresetsofmethodsthatcouldmakeiteveneasiertoworkwithQueryResultobjects:thefilterandmapmethods.

IfyourefamiliarwithPythonslistcomprehensions,generatorexpressionsorthebuiltinfilterandmapfunctions,youllknowhowusefultheyareforworkingwith
listlikeobjects(ifyourenot,checkthemout!).YoucanusethesebuiltinmethodstomanipulateQueryResultobjects,butyoullendupwithregularPythonlistsand
losetheabilitytodomoreinterestingmanipulations.

Thatswhy,QueryResultobjectsprovideitsownflavoroffilterandmapmethods.Analogoustofilter,therearehit_filterandhsp_filtermethods.Astheirname
implies,thesemethodsfilteritsQueryResultobjecteitheronitsHitobjectsorHSPobjects.Similarly,analogoustomap,QueryResultobjectsalsoprovidethehit_mapand
hsp_mapmethods.ThesemethodsapplyagivenfunctiontoallhitsorHSPsinaQueryResultobject,respectively.

Letsseethesemethodsinaction,beginningwithhit_filter.ThismethodacceptsacallbackfunctionthatcheckswhetheragivenHitobjectpassestheconditionyouset
ornot.Inotherwords,thefunctionmustacceptasitsargumentasingleHitobjectandreturnsTrueorFalse.

Hereisanexampleofusinghit_filtertofilteroutHitobjectsthatonlyhaveoneHSP:

>>>filter_func=lambdahit:len(hit.hsps)>1#thecallbackfunction
>>>len(blast_qresult)#no.ofhitsbeforefiltering
100
>>>filtered_qresult=blast_qresult.hit_filter(filter_func)
>>>len(filtered_qresult)#no.ofhitsafterfiltering
37
>>>forhitinfiltered_qresult[:5]:#quickcheckforthehitlengths
...print("%s%i"%(hit.id,len(hit.hsps)))
gi|301171322|ref|NR_035857.1|2
gi|262205330|ref|NR_030198.1|2
gi|301171447|ref|NR_035871.1|2
gi|262205298|ref|NR_030190.1|2
gi|270132717|ref|NR_032716.1|2

hsp_filterworksthesameashit_filter,onlyinsteadoflookingattheHitobjects,itperformsfilteringontheHSPobjectsineachhits.

Asforthemapmethods,theytooacceptacallbackfunctionastheirarguments.However,insteadofreturningTrueorFalse,thecallbackfunctionmustreturnthe
modifiedHitorHSPobject(dependingonwhetheryoureusinghit_maporhsp_map).

Letsseeanexamplewherewereusinghit_maptorenamethehitIDs:

>>>defmap_func(hit):
...hit.id=hit.id.split('|')[3]#renames'gi|301171322|ref|NR_035857.1|'to'NR_035857.1'
...returnhit
...
>>>mapped_qresult=blast_qresult.hit_map(map_func)
>>>forhitinmapped_qresult[:5]:
...print(hit.id)
NR_030195.1
NR_035856.1
NR_032573.1
NR_035857.1
NR_035851.1

Again,hsp_mapworksthesameashit_map,butonHSPobjectsinsteadofHitobjects.

8.1.2Hit

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 61/184
5/13/2017 BiopythonTutorialandCookbook
Hitobjectsrepresentallqueryresultsfromasingledatabaseentry.TheyarethesecondlevelcontainerintheBio.SearchIOobjecthierarchy.Youveseenthattheyare
containedbyQueryResultobjects,buttheythemselvescontainHSPobjects.

Letsseewhattheylooklike,beginningwithourBLASTsearch:

>>>fromBioimportSearchIO
>>>blast_qresult=SearchIO.read('my_blast.xml','blastxml')
>>>blast_hit=blast_qresult[3]#fourthhitfromthequeryresult
>>>print(blast_hit)
Query:42291
mystery_seq
Hit:gi|301171322|ref|NR_035857.1|(86)
PantroglodytesmicroRNAmir520c(MIR520C),microRNA
HSPs:
#EvalueBitscoreSpanQueryrangeHitrange

08.9e20100.4760[1:61][13:73]
13.3e0655.3960[0:60][13:73]

Youseethatwevegottheessentialscoveredhere:

ThequeryIDanddescriptionispresent.Ahitisalwaystiedtoaquery,sowewanttokeeptrackoftheoriginatingqueryaswell.Thesevaluescanbeaccessed
fromahitusingthequery_idandquery_descriptionattributes.
WealsohavetheuniquehitID,description,andfullsequencelengths.Theycanbeaccessedusingid,description,andseq_len,respectively.
Finally,theresatablecontainingquickinformationabouttheHSPsthishitcontains.Ineachrow,wevegottheimportantHSPdetailslisted:theHSPindex,itse
value,itsbitscore,itsspan(thealignmentlengthincludinggaps),itsquerycoordinates,anditshitcoordinates.

NowletscontrastthiswiththeBLATsearch.RememberthatintheBLATsearchwehadonehitwith17HSPs.
>>>blat_qresult=SearchIO.read('my_blat.psl','blatpsl')
>>>blat_hit=blat_qresult[0]#theonlyhit
>>>print(blat_hit)
Query:mystery_seq
<unknowndescription>
Hit:chr19(59128983)
<unknowndescription>
HSPs:
#EvalueBitscoreSpanQueryrangeHitrange

0???[0:61][54204480:54204541]
1???[0:61][54233104:54264463]
2???[0:61][54254477:54260071]
3???[1:61][54210720:54210780]
4???[0:60][54198476:54198536]
5???[0:61][54265610:54265671]
6???[0:61][54238143:54240175]
7???[0:60][54189735:54189795]
8???[0:61][54185425:54185486]
9???[0:60][54197657:54197717]
10???[0:61][54255662:54255723]
11???[0:61][54201651:54201712]
12???[8:60][54206009:54206061]
13???[10:61][54178987:54179038]
14???[8:61][54212018:54212071]
15???[8:51][54234278:54234321]
16???[8:61][54238143:54238196]

Here,wevegotasimilarlevelofdetailaswiththeBLASThitwesawearlier.Therearesomedifferencesworthexplaining,though:

Theevalueandbitscorecolumnvalues.AsBLATHSPsdonothaveevaluesandbitscores,thedisplaydefaultsto?.
Whataboutthespancolumn?Thespanvaluesismeanttodisplaythecompletealignmentlength,whichconsistsofallresiduesandanygapsthatmaybepresent.
ThePSLformatdonothavethisinformationreadilyavailableandBio.SearchIOdoesnotattempttotryguesswhatitis,sowegeta?similartotheevalueand
bitscorecolumns.

IntermsofPythonobjects,HitbehavesalmostthesameasPythonlists,butcontainHSPobjectsexclusively.Ifyourefamiliarwithlists,youshouldencounterno
difficultiesworkingwiththeHitobject.

JustlikePythonlists,Hitobjectsareiterable,andeachiterationreturnsoneHSPobjectitcontains:

>>>forhspinblast_hit:
...hsp
HSP(hit_id='gi|301171322|ref|NR_035857.1|',query_id='42291',1fragments)
HSP(hit_id='gi|301171322|ref|NR_035857.1|',query_id='42291',1fragments)

YoucaninvokelenonaHittoseehowmanyHSPobjectsithas:
>>>len(blast_hit)
2
>>>len(blat_hit)
17

YoucanusetheslicenotationonHitobjects,whethertoretrievesingleHSPormultipleHSPobjects.LikeQueryResult,ifyousliceformultipleHSP,anewHitobjectwill
bereturnedcontainingonlytheslicedHSPobjects:

>>>blat_hit[0]#retrievesingleitems
HSP(hit_id='chr19',query_id='mystery_seq',1fragments)
>>>sliced_hit=blat_hit[4:9]#retrievemultipleitems
>>>len(sliced_hit)
5
>>>print(sliced_hit)
Query:mystery_seq

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 62/184
5/13/2017 BiopythonTutorialandCookbook
<unknowndescription>
Hit:chr19(59128983)
<unknowndescription>
HSPs:
#EvalueBitscoreSpanQueryrangeHitrange

0???[0:60][54198476:54198536]
1???[0:61][54265610:54265671]
2???[0:61][54238143:54240175]
3???[0:60][54189735:54189795]
4???[0:61][54185425:54185486]

YoucanalsosorttheHSPinsideaHit,usingtheexactsameargumentslikethesortmethodyousawintheQueryResultobject.

Finally,therearealsothefilterandmapmethodsyoucanuseonHitobjects.UnlikeintheQueryResultobject,Hitobjectsonlyhaveonevariantoffilter(Hit.filter)
andonevariantofmap(Hit.map).BothofHit.filterandHit.mapworkontheHSPobjectsaHithas.

8.1.3HSP
HSP(highscoringpair)representsregion(s)inthehitsequencethatcontainssignificantalignment(s)tothequerysequence.Itcontainstheactualmatchbetweenyour
querysequenceandadatabaseentry.Asthismatchisdeterminedbythesequencesearchtoolsalgorithms,theHSPobjectcontainsthebulkofthestatisticscomputedby
thesearchtool.ThisalsomakesthedistinctionbetweenHSPobjectsfromdifferentsearchtoolsmoreapparentcomparedtothedifferencesyouveseeninQueryResultor
Hitobjects.

LetsseesomeexamplesfromourBLASTandBLATsearches.WelllookattheBLASTHSPfirst:

>>>fromBioimportSearchIO
>>>blast_qresult=SearchIO.read('my_blast.xml','blastxml')
>>>blast_hsp=blast_qresult[0][0]#firsthit,firsthsp
>>>print(blast_hsp)
Query:42291mystery_seq
Hit:gi|262205317|ref|NR_030195.1|HomosapiensmicroRNA520b(MIR520...
Queryrange:[0:61](1)
Hitrange:[0:61](1)
Quickstats:evalue4.9e23;bitscore111.29
Fragments:1(61columns)
QueryCCCTCTACAGGGAAGCGCTTTCTGTTGTCTGAAAGAAAAGAAAGTGCTTCCTTTTAGAGGG
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HitCCCTCTACAGGGAAGCGCTTTCTGTTGTCTGAAAGAAAAGAAAGTGCTTCCTTTTAGAGGG

JustlikeQueryResultandHit,invokingprintonanHSPshowsitsgeneraldetails:

TherearethequeryandhitIDsanddescriptions.WeneedthesetoidentifyourHSP.
Wevealsogotthematchingrangeofthequeryandhitsequences.TheslicenotationwereusinghereisanindicationthattherangeisdisplayedusingPythons
indexingstyle(zerobased,halfopen).Thenumberinsidetheparenthesisdenotesthestrand.Inthiscase,bothsequenceshavetheplusstrand.
Somequickstatisticsareavailable:theevalueandbitscore.
ThereisinformationabouttheHSPfragments.Ignorethisfornowitwillbeexplainedlateron.
Andfinally,wehavethequeryandhitsequencealignmentitself.

Thesedetailscanbeaccessedontheirownusingthedotnotation,justlikeinQueryResultandHit:

>>>blast_hsp.query_range
(0,61)

>>>blast_hsp.evalue
4.91307e23

Theyrenottheonlyattributesavailable,though.HSPobjectscomewithadefaultsetofpropertiesthatmakesiteasytoprobetheirvariousdetails.Herearesome
examples:

>>>blast_hsp.hit_start#startcoordinateofthehitsequence
0
>>>blast_hsp.query_span#howmanyresiduesinthequerysequence
61
>>>blast_hsp.aln_span#howlongthealignmentis
61

CheckouttheHSPdocumentationforafulllistofthesepredefinedproperties.

Furthermore,eachsequencesearchtoolusuallycomputesitsownstatistics/detailsforitsHSPobjects.Forexample,anXMLBLASTsearchalsooutputsthenumberof
gapsandidenticalresidues.Theseattributescanbeaccessedlikeso:

>>>blast_hsp.gap_num#numberofgaps
0
>>>blast_hsp.ident_num#numberofidenticalresidues
61

Thesedetailsareformatspecifictheymaynotbepresentinotherformats.Toseewhichdetailsareavailableforagivensequencesearchtool,youshouldcheckthe
formatsdocumentationinBio.SearchIO.Alternatively,youmayalsouse.__dict__.keys()foraquicklistofwhatsavailable:
>>>blast_hsp.__dict__.keys()
['bitscore','evalue','ident_num','gap_num','bitscore_raw','pos_num','_items']

Finally,youmayhavenoticedthatthequeryandhitattributesofourHSParenotjustregularstrings:
>>>blast_hsp.query
SeqRecord(seq=Seq('CCCTCTACAGGGAAGCGCTTTCTGTTGTCTGAAAGAAAAGAAAGTGCTTCCTTT...GGG',DNAAlphabet()),id='42291',name='alignedquerysequence',description='mystery_
>>>blast_hsp.hit
SeqRecord(seq=Seq('CCCTCTACAGGGAAGCGCTTTCTGTTGTCTGAAAGAAAAGAAAGTGCTTCCTTT...GGG',DNAAlphabet()),id='gi|262205317|ref|NR_030195.1|',name='alignedhitsequence',

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 63/184
5/13/2017 BiopythonTutorialandCookbook
TheyareSeqRecordobjectsyousawearlierinSection4!ThismeansthatyoucandoallsortsofinterestingthingsyoucandowithSeqRecordobjectsonHSP.queryand/or
HSP.hit.

ItshouldnotsurpriseyounowthattheHSPobjecthasanalignmentpropertywhichisaMultipleSeqAlignmentobject:

>>>print(blast_hsp.aln)
DNAAlphabet()alignmentwith2rowsand61columns
CCCTCTACAGGGAAGCGCTTTCTGTTGTCTGAAAGAAAAGAAAG...GGG42291
CCCTCTACAGGGAAGCGCTTTCTGTTGTCTGAAAGAAAAGAAAG...GGGgi|262205317|ref|NR_030195.1|

HavingprobedtheBLASTHSP,letsnowtakealookatHSPsfromourBLATresultsforadifferentkindofHSP.Asusual,wellbeginbyinvokingprintonit:

>>>blat_qresult=SearchIO.read('my_blat.psl','blatpsl')
>>>blat_hsp=blat_qresult[0][0]#firsthit,firsthsp
>>>print(blat_hsp)
Query:mystery_seq<unknowndescription>
Hit:chr19<unknowndescription>
Queryrange:[0:61](1)
Hitrange:[54204480:54204541](1)
Quickstats:evalue?;bitscore?
Fragments:1(?columns)

Someoftheoutputsyoumayhavealreadyguessed.WehavethequeryandhitIDsanddescriptionsandthesequencecoordinates.Valuesforevalueandbitscoreis?as
BLATHSPsdonothavetheseattributes.ButThebiggestdifferencehereisthatyoudontseeanysequencealignmentsdisplayed.Ifyoulookcloser,PSLformats
themselvesdonothaveanyhitorquerysequences,soBio.SearchIOwontcreateanysequenceoralignmentobjects.WhathappensifyoutrytoaccessHSP.query,
HSP.hit,orHSP.aln?Youllgetthedefaultvaluesfortheseattributes,whichisNone:

>>>blat_hsp.hitisNone
True
>>>blat_hsp.queryisNone
True
>>>blat_hsp.alnisNone
True

Thisdoesnotaffectotherattributes,though.Forexample,youcanstillaccessthelengthofthequeryorhitalignment.Despitenotdisplayinganyattributes,thePSL
formatstillhavethisinformationsoBio.SearchIOcanextractthem:

>>>blat_hsp.query_span#lengthofquerymatch
61
>>>blat_hsp.hit_span#lengthofhitmatch
61

Otherformatspecificattributesarestillpresentaswell:
>>>blat_hsp.score#PSLscore
61
>>>blat_hsp.mismatch_num#themismatchcolumn
0

Sofarsogood?ThingsgetmoreinterestingwhenyoulookatanothervariantofHSPpresentinourBLATresults.YoumightrecallthatinBLATsearches,sometimes
wegetourresultsseparatedintoblocks.Theseblocksareessentiallyalignmentfragmentsthatmayhavesomeinterveningsequencebetweenthem.

LetstakealookataBLATHSPthatcontainsmultipleblockstoseehowBio.SearchIOdealswiththis:

>>>blat_hsp2=blat_qresult[0][1]#firsthit,secondhsp
>>>print(blat_hsp2)
Query:mystery_seq<unknowndescription>
Hit:chr19<unknowndescription>
Queryrange:[0:61](1)
Hitrange:[54233104:54264463](1)
Quickstats:evalue?;bitscore?
Fragments:
#SpanQueryrangeHitrange

0?[0:18][54233104:54233122]
1?[18:61][54264420:54264463]

Whatshappeninghere?Westillsomeessentialdetailscovered:theIDsanddescriptions,thecoordinates,andthequickstatisticsaresimilartowhatyouveseenbefore.
Butthefragmentsdetailisalldifferent.InsteadofshowingFragments:1,wenowhaveatablewithtwodatarows.

ThisishowBio.SearchIOdealswithHSPshavingmultiplefragments.Asmentionedbefore,anHSPalignmentmaybeseparatedbyinterveningsequencesinto
fragments.Theinterveningsequencesarenotpartofthequeryhitmatch,sotheyshouldnotbeconsideredpartofquerynorhitsequence.However,theydoaffecthow
wedealwithsequencecoordinates,sowecantignorethem.

TakealookatthehitcoordinateoftheHSPabove.IntheHitrange:field,weseethatthecoordinateis[54233104:54264463].Butlookingatthetablerows,weseethat
nottheentireregionspannedbythiscoordinatematchesourquery.Specifically,theinterveningregionspansfrom54233122to54264420.

Whythen,isthequerycoordinatesseemtobecontiguous,youask?Thisisperfectlyfine.Inthiscaseitmeansthatthequerymatchiscontiguous(nointervening
regions),whilethehitmatchisnot.

AlltheseattributesareaccessiblefromtheHSPdirectly,bytheway:

>>>blat_hsp2.hit_range#hitstartandendcoordinatesoftheentireHSP
(54233104,54264463)
>>>blat_hsp2.hit_range_all#hitstartandendcoordinatesofeachfragment
[(54233104,54233122),(54264420,54264463)]
>>>blat_hsp2.hit_span#hitspanoftheentireHSP
31359
>>>blat_hsp2.hit_span_all#hitspanofeachfragment
[18,43]
>>>blat_hsp2.hit_inter_ranges#startandendcoordinatesofinterveningregionsinthehitsequence

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 64/184
5/13/2017 BiopythonTutorialandCookbook
[(54233122,54264420)]
>>>blat_hsp2.hit_inter_spans#spanofinterveningregionsinthehitsequence
[31298]

MostoftheseattributesarenotreadilyavailablefromthePSLfilewehave,butBio.SearchIOcalculatesthemforyouontheflywhenyouparsethePSLfile.Allitneeds
arethestartandendcoordinatesofeachfragment.

Whataboutthequery,hit,andalnattributes?IftheHSPhasmultiplefragments,youwontbeabletousetheseattributesastheyonlyfetchsingleSeqRecordor
MultipleSeqAlignmentobjects.However,youcanusetheir*_allcounterparts:query_all,hit_all,andaln_all.ThesepropertieswillreturnalistcontainingSeqRecordor
MultipleSeqAlignmentobjectsfromeachoftheHSPfragment.Thereareotherattributesthatbehavesimilarly,i.e.theyonlyworkforHSPswithonefragment.Checkout
theHSPdocumentationforafulllist.

Finally,tocheckwhetheryouhavemultiplefragmentsornot,youcanusetheis_fragmentedpropertylikeso:

>>>blat_hsp2.is_fragmented#BLATHSPwith2fragments
True
>>>blat_hsp.is_fragmented#BLATHSPfromearlier,withonefragment
False

Beforewemoveon,youshouldalsoknowthatwecanusetheslicenotationonHSPobjects,justlikeQueryResultorHitobjects.Whenyouusethisnotation,youllgetan
HSPFragmentobjectinreturn,thelastcomponentoftheobjectmodel.

8.1.4HSPFragment
HSPFragmentrepresentsasingle,contiguousmatchbetweenthequeryandhitsequences.Youcouldconsideritthecoreoftheobjectmodelandsearchresult,sinceitisthe
presenceofthesefragmentsthatdeterminewhetheryoursearchhaveresultsornot.

Inmostcases,youdonthavetodealwithHSPFragmentobjectsdirectlysincenotthatmanysequencesearchtoolsfragmenttheirHSPs.Whenyoudohavetodealwith
them,whatyoushouldrememberisthatHSPFragmentobjectswerewrittenwithtobeascompactaspossible.Inmostcases,theyonlycontainattributesdirectlyrelatedto
sequences:strands,readingframes,alphabets,coordinates,thesequencesthemselves,andtheirIDsanddescriptions.

TheseattributesarereadilyshownwhenyouinvokeprintonanHSPFragment.Heresanexample,takenfromourBLASTsearch:

>>>fromBioimportSearchIO
>>>blast_qresult=SearchIO.read('my_blast.xml','blastxml')
>>>blast_frag=blast_qresult[0][0][0]#firsthit,firsthsp,firstfragment
>>>print(blast_frag)
Query:42291mystery_seq
Hit:gi|262205317|ref|NR_030195.1|HomosapiensmicroRNA520b(MIR520...
Queryrange:[0:61](1)
Hitrange:[0:61](1)
Fragments:1(61columns)
QueryCCCTCTACAGGGAAGCGCTTTCTGTTGTCTGAAAGAAAAGAAAGTGCTTCCTTTTAGAGGG
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HitCCCTCTACAGGGAAGCGCTTTCTGTTGTCTGAAAGAAAAGAAAGTGCTTCCTTTTAGAGGG

Atthislevel,theBLATfragmentlooksquitesimilartotheBLASTfragment,saveforthequeryandhitsequenceswhicharenotpresent:

>>>blat_qresult=SearchIO.read('my_blat.psl','blatpsl')
>>>blat_frag=blat_qresult[0][0][0]#firsthit,firsthsp,firstfragment
>>>print(blat_frag)
Query:mystery_seq<unknowndescription>
Hit:chr19<unknowndescription>
Queryrange:[0:61](1)
Hitrange:[54204480:54204541](1)
Fragments:1(?columns)

Inallcases,theseattributesareaccessibleusingourfavoritedotnotation.Someexamples:

>>>blast_frag.query_start#querystartcoordinate
0
>>>blast_frag.hit_strand#hitsequencestrand
1
>>>blast_frag.hit#hitsequence,asaSeqRecordobject
SeqRecord(seq=Seq('CCCTCTACAGGGAAGCGCTTTCTGTTGTCTGAAAGAAAAGAAAGTGCTTCCTTT...GGG',DNAAlphabet()),id='gi|262205317|ref|NR_030195.1|',name='alignedhitsequence',

8.2Anoteaboutstandardsandconventions
Beforewemoveontothemainfunctions,thereissomethingyououghttoknowaboutthestandardsBio.SearchIOuses.Ifyouveworkedwithmultiplesequencesearch
tools,youmighthavehadtodealwiththemanydifferentwayseachprogramdealswiththingslikesequencecoordinates.Itmightnothavebeenapleasantexperienceas
thesesearchtoolsusuallyhavetheirownstandards.Forexample,onetoolsmightuseonebasedcoordinates,whiletheotheruseszerobasedcoordinates.Or,one
programmightreversethestartandendcoordinatesifthestrandisminus,whileothersdont.Inshort,theseoftencreatesunnecessarymessmustbedealtwith.

WerealizethisproblemourselvesandweintendtoaddressitinBio.SearchIO.Afterall,oneofthegoalsofBio.SearchIOistocreateacommon,easytouseinterfaceto
dealwithvarioussearchoutputfiles.Thismeanscreatingstandardsthatextendbeyondtheobjectmodelyoujustsaw.

Now,youmightcomplain,"Notanotherstandard!".Well,eventuallywehavetochooseoneconventionortheother,sothisisnecessary.Plus,werenotcreating
somethingentirelynewherejustadoptingastandardwethinkisbestforaPythonprogrammer(itisBiopython,afterall).

TherearethreeimplicitstandardsthatyoucanexpectwhenworkingwithBio.SearchIO:

Thefirstonepertainstosequencecoordinates.InBio.SearchIO,allsequencecoordinatesfollowsPythonscoordinatestyle:zerobasedandhalfopen.Forexample,
ifinaBLASTXMLoutputfilethestartandendcoordinatesofanHSPare10and28,theywouldbecome9and28inBio.SearchIO.Thestartcoordinatebecomes
9becausePythonindicesstartfromzero,whiletheendcoordinateremains28asPythonslicesomitthelastiteminaninterval.
Thesecondisonsequencecoordinateorders.InBio.SearchIO,startcoordinatesarealwayslessthanorequaltoendcoordinates.Thisisntalwaysthecasewithall
sequencesearchtools,assomeofthemhavelargerstartcoordinateswhenthesequencestrandisminus.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 65/184
5/13/2017 BiopythonTutorialandCookbook
Thelastoneisonstrandandreadingframevalues.Forstrands,thereareonlyfourvalidchoices:1(plusstrand),1(minusstrand),0(proteinsequences),andNone
(nostrand).Forreadingframes,thevalidchoicesareintegersfrom3to3andNone.

NotethatthesestandardsonlyexistinBio.SearchIOobjects.IfyouwriteBio.SearchIOobjectsintoanoutputformat,Bio.SearchIOwillusetheformatsstandardforthe
output.Itdoesnotforceitsstandardovertoyouroutputfile.

8.3Readingsearchoutputfiles
TherearetwofunctionsyoucanuseforreadingsearchoutputfilesintoBio.SearchIOobjects:readandparse.Theyreessentiallysimilartoreadandparsefunctionsin
othersubmoduleslikeBio.SeqIOorBio.AlignIO.Inbothcases,youneedtosupplythesearchoutputfilenameandthefileformatname,bothasPythonstrings.Youcan
checkthedocumentationforalistofformatnamesBio.SearchIOrecognizes.

Bio.SearchIO.readisusedforreadingsearchoutputfileswithonlyonequeryandreturnsaQueryResultobject.Youveseenreadusedinourpreviousexamples.What
youhaventseenisthatreadmayalsoacceptadditionalkeywordarguments,dependingonthefileformat.

Herearesomeexamples.Inthefirstone,weusereadjustlikepreviouslytoreadaBLASTtabularoutputfile.Inthesecondone,weuseakeywordargumenttomodify
soitparsestheBLASTtabularvariantwithcommentsinit:

>>>fromBioimportSearchIO
>>>qresult=SearchIO.read('tab_2226_tblastn_003.txt','blasttab')
>>>qresult
QueryResult(id='gi|16080617|ref|NP_391444.1|',3hits)
>>>qresult2=SearchIO.read('tab_2226_tblastn_007.txt','blasttab',comments=True)
>>>qresult2
QueryResult(id='gi|16080617|ref|NP_391444.1|',3hits)

Thesekeywordargumentsdiffersamongfileformats.Checktheformatdocumentationtoseeifithaskeywordargumentsthatmodifiesitsparsersbehavior.

AsfortheBio.SearchIO.parse,itisusedforreadingsearchoutputfileswithanynumberofqueries.ThefunctionreturnsageneratorobjectthatyieldsaQueryResult
objectineachiteration.LikeBio.SearchIO.read,italsoacceptsformatspecifickeywordarguments:

>>>fromBioimportSearchIO
>>>qresults=SearchIO.parse('tab_2226_tblastn_001.txt','blasttab')
>>>forqresultinqresults:
...print(qresult.id)
gi|16080617|ref|NP_391444.1|
gi|11464971:4101
>>>qresults2=SearchIO.parse('tab_2226_tblastn_005.txt','blasttab',comments=True)
>>>forqresultinqresults2:
...print(qresult.id)
random_s00
gi|16080617|ref|NP_391444.1|
gi|11464971:4101

8.4Dealingwithlargesearchoutputfileswithindexing
Sometimes,yourehandedasearchoutputfilecontaininghundredsorthousandsofqueriesthatyouneedtoparse.YoucanofcourseuseBio.SearchIO.parseforthisfile,
butthatwouldbegrosslyinefficientifyouneedtoaccessonlyafewofthequeries.Thisisbecauseparsewillparseallqueriesitseesbeforeitfetchesyourqueryof
interest.

Inthiscase,theidealchoicewouldbetoindexthefileusingBio.SearchIO.indexorBio.SearchIO.index_db.Ifthenamessoundfamiliar,itsbecauseyouveseenthem
beforeinSection5.4.2.ThesefunctionsalsobehavesimilarlytotheirBio.SeqIOcounterparts,withtheadditionofformatspecifickeywordarguments.

Herearesomeexamples.Youcanuseindexwithjustthefilenameandformatname:

>>>fromBioimportSearchIO
>>>idx=SearchIO.index('tab_2226_tblastn_001.txt','blasttab')
>>>sorted(idx.keys())
['gi|11464971:4101','gi|16080617|ref|NP_391444.1|']
>>>idx['gi|16080617|ref|NP_391444.1|']
QueryResult(id='gi|16080617|ref|NP_391444.1|',3hits)
>>>idx.close()

Oralsowiththeformatspecifickeywordargument:

>>>idx=SearchIO.index('tab_2226_tblastn_005.txt','blasttab',comments=True)
>>>sorted(idx.keys())
['gi|11464971:4101','gi|16080617|ref|NP_391444.1|','random_s00']
>>>idx['gi|16080617|ref|NP_391444.1|']
QueryResult(id='gi|16080617|ref|NP_391444.1|',3hits)
>>>idx.close()

Orwiththekey_functionargument,asinBio.SeqIO:
>>>key_function=lambdaid:id.upper()#capitalizesthekeys
>>>idx=SearchIO.index('tab_2226_tblastn_001.txt','blasttab',key_function=key_function)
>>>sorted(idx.keys())
['GI|11464971:4101','GI|16080617|REF|NP_391444.1|']
>>>idx['GI|16080617|REF|NP_391444.1|']
QueryResult(id='gi|16080617|ref|NP_391444.1|',3hits)
>>>idx.close()

Bio.SearchIO.index_dbworkslikeasindex,onlyitwritesthequeryoffsetsintoanSQLitedatabasefile.

8.5Writingandconvertingsearchoutputfiles

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 66/184
5/13/2017 BiopythonTutorialandCookbook
Itisoccasionallyusefultobeabletomanipulatesearchresultsfromanoutputfileandwriteitagaintoanewfile.Bio.SearchIOprovidesawritefunctionthatletsyoudo
exactlythis.IttakesasitsargumentsaniterablereturningQueryResultobjects,theoutputfilenametowriteto,theformatnametowriteto,andoptionallysomeformat
specifickeywordarguments.Itreturnsafouritemtuple,whichdenotesthenumberorQueryResult,Hit,HSP,andHSPFragmentobjectsthatwerewritten.

>>>fromBioimportSearchIO
>>>qresults=SearchIO.parse('mirna.xml','blastxml')#readXMLfile
>>>SearchIO.write(qresults,'results.tab','blasttab')#writetotabularfile
(3,239,277,277)

YoushouldnotedifferentfileformatsrequiredifferentattributesoftheQueryResult,Hit,HSPandHSPFragmentobjects.Iftheseattributesarenotpresent,writingwont
work.Inotherwords,youcantalwayswritetotheoutputformatthatyouwant.Forexample,ifyoureadaBLASTXMLfile,youwouldntbeabletowritetheresultsto
aPSLfileasPSLfilesrequireattributesnotcalculatedbyBLAST(e.g.thenumberofrepeatmatches).Youcanalwayssettheseattributesmanually,ifyoureallywantto
writetoPSL,though.

Likeread,parse,index,andindex_db,writealsoacceptsformatspecifickeywordarguments.CheckoutthedocumentationforacompletelistofformatsBio.SearchIO
canwritetoandtheirarguments.

Finally,Bio.SearchIOalsoprovidesaconvertfunction,whichissimplyashortcutforBio.SearchIO.parseandBio.SearchIO.write.Usingtheconvertfunction,our
exampleabovewouldbe:

>>>fromBioimportSearchIO
>>>SearchIO.convert('mirna.xml','blastxml','results.tab','blasttab')
(3,239,277,277)

Asconvertuseswrite,itisonlylimitedtoformatconversionsthathavealltherequiredattributes.Here,theBLASTXMLfileprovidesallthedefaultvaluesaBLAST
tabularfilerequires,soitworksjustfine.However,otherformatconversionsarelesslikelytoworksinceyouneedtomanuallyassigntherequiredattributesfirst.

Chapter9AccessingNCBIsEntrezdatabases
Entrez(http://www.ncbi.nlm.nih.gov/Entrez)isadataretrievalsystemthatprovidesusersaccesstoNCBIsdatabasessuchasPubMed,GenBank,GEO,andmany
others.YoucanaccessEntrezfromawebbrowsertomanuallyenterqueries,oryoucanuseBiopythonsBio.EntrezmoduleforprogrammaticaccesstoEntrez.The
latterallowsyouforexampletosearchPubMedordownloadGenBankrecordsfromwithinaPythonscript.

TheBio.EntrezmodulemakesuseoftheEntrezProgrammingUtilities(alsoknownasEUtils),consistingofeighttoolsthataredescribedindetailonNCBIspageat
http://www.ncbi.nlm.nih.gov/entrez/utils/.EachofthesetoolscorrespondstoonePythonfunctionintheBio.Entrezmodule,asdescribedinthesectionsbelow.This
modulemakessurethatthecorrectURLisusedforthequeries,andthatnotmorethanonerequestismadeeverythreeseconds,asrequiredbyNCBI.

TheoutputreturnedbytheEntrezProgrammingUtilitiesistypicallyinXMLformat.Toparsesuchoutput,youhaveseveraloptions:

1.UseBio.EntrezsparsertoparsetheXMLoutputintoaPythonobject
2.UsetheDOM(DocumentObjectModel)parserinPythonsstandardlibrary
3.UsetheSAX(SimpleAPIforXML)parserinPythonsstandardlibrary
4.ReadtheXMLoutputasrawtext,andparseitbystringsearchingandmanipulation.

FortheDOMandSAXparsers,seethePythondocumentation.TheparserinBio.Entrezisdiscussedbelow.

NCBIusesDTD(DocumentTypeDefinition)filestodescribethestructureoftheinformationcontainedinXMLfiles.MostoftheDTDfilesusedbyNCBIareincluded
intheBiopythondistribution.TheBio.EntrezparsermakesuseoftheDTDfileswhenparsinganXMLfilereturnedbyNCBIEntrez.

Occasionally,youmayfindthattheDTDfileassociatedwithaspecificXMLfileismissingintheBiopythondistribution.Inparticular,thismayhappenwhenNCBI
updatesitsDTDfiles.Ifthishappens,Entrez.readwillshowawarningmessagewiththenameandURLofthemissingDTDfile.Theparserwillproceedtoaccessthe
missingDTDfilethroughtheinternet,allowingtheparsingoftheXMLfiletocontinue.However,theparserismuchfasteriftheDTDfileisavailablelocally.Forthis
purpose,pleasedownloadtheDTDfilefromtheURLinthewarningmessageandplaceitinthedirectory...sitepackages/Bio/Entrez/DTDs,containingtheotherDTD
files.Ifyoudonthavewriteaccesstothisdirectory,youcanalsoplacetheDTDfilein~/.biopython/Bio/Entrez/DTDs,where~representsyourhomedirectory.Since
thisdirectoryisreadbeforethedirectory...sitepackages/Bio/Entrez/DTDs,youcanalsoputnewerversionsofDTDfilesthereiftheonesin...site
packages/Bio/Entrez/DTDsbecomeoutdated.Alternatively,ifyouinstalledBiopythonfromsource,youcanaddtheDTDfiletothesourcecodesBio/Entrez/DTDs
directory,andreinstallBiopython.ThiswillinstallthenewDTDfileinthecorrectlocationtogetherwiththeotherDTDfiles.

TheEntrezProgrammingUtilitiescanalsogenerateoutputinotherformats,suchastheFastaorGenBankfileformatsforsequencedatabases,ortheMedLineformatfor
theliteraturedatabase,discussedinSection9.12.

9.1EntrezGuidelines
BeforeusingBiopythontoaccesstheNCBIsonlineresources(viaBio.Entrezorsomeoftheothermodules),pleasereadtheNCBIsEntrezUserRequirements.Ifthe
NCBIfindsyouareabusingtheirsystems,theycanandwillbanyouraccess!

Toparaphrase:

Foranyseriesofmorethan100requests,dothisatweekendsoroutsideUSApeaktimes.Thisisuptoyoutoobey.
Usethehttp://eutils.ncbi.nlm.nih.govaddress,notthestandardNCBIWebaddress.Biopythonusesthiswebaddress.
Makenomorethanthreerequestseveryseconds(relaxedfromatmostonerequesteverythreesecondsinearly2009).Thisisautomaticallyenforcedby
Biopython.
UsetheoptionalemailparametersotheNCBIcancontactyouifthereisaproblem.YoucaneitherexplicitlysetthisasaparameterwitheachcalltoEntrez(e.g.
includeemail="A.N.Other@example.com"intheargumentlist),oryoucansetaglobalemailaddress:
>>>fromBioimportEntrez
>>>Entrez.email="A.N.Other@example.com"

Bio.EntrezwillthenusethisemailaddresswitheachcalltoEntrez.Theexample.comaddressisareserveddomainnamespecificallyfordocumentation(RFC
2606).PleaseDONOTusearandomemailitsbetternottogiveanemailatall.TheemailparameterhasbeenmandatorysinceJune1,2010.Incaseofexcessive

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 67/184
5/13/2017 BiopythonTutorialandCookbook
usage,NCBIwillattempttocontactauserattheemailaddressprovidedpriortoblockingaccesstotheEutilities.
IfyouareusingBiopythonwithinsomelargersoftwaresuite,usethetoolparametertospecifythis.Youcaneitherexplicitlysetthetoolnameasaparameterwith
eachcalltoEntrez(e.g.includetool="MyLocalScript"intheargumentlist),oryoucansetaglobaltoolname:
>>>fromBioimportEntrez
>>>Entrez.tool="MyLocalScript"

ThetoolparameterwilldefaulttoBiopython.
Forlargequeries,theNCBIalsorecommendusingtheirsessionhistoryfeature(theWebEnvsessioncookiestring,seeSection9.15).Thisisonlyslightlymore
complicated.

Inconclusion,besensiblewithyourusagelevels.Ifyouplantodownloadlotsofdata,considerotheroptions.Forexample,ifyouwanteasyaccesstoallthehuman
genes,considerfetchingeachchromosomebyFTPasaGenBankfile,andimportingtheseintoyourownBioSQLdatabase(seeSection20.5).

9.2EInfo:ObtaininginformationabouttheEntrezdatabases
EInfoprovidesfieldindextermcounts,lastupdate,andavailablelinksforeachofNCBIsdatabases.Inaddition,youcanuseEInfotoobtainalistofalldatabasenames
accessiblethroughtheEntrezutilities:

>>>fromBioimportEntrez
>>>Entrez.email="A.N.Other@example.com"#AlwaystellNCBIwhoyouare
>>>handle=Entrez.einfo()
>>>result=handle.read()
>>>handle.close()

ThevariableresultnowcontainsalistofdatabasesinXMLformat:
>>>print(result)
<?xmlversion="1.0"?>
<!DOCTYPEeInfoResultPUBLIC"//NLM//DTDeInfoResult,11May2002//EN"
"http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eInfo_020511.dtd">
<eInfoResult>
<DbList>
<DbName>pubmed</DbName>
<DbName>protein</DbName>
<DbName>nucleotide</DbName>
<DbName>nuccore</DbName>
<DbName>nucgss</DbName>
<DbName>nucest</DbName>
<DbName>structure</DbName>
<DbName>genome</DbName>
<DbName>books</DbName>
<DbName>cancerchromosomes</DbName>
<DbName>cdd</DbName>
<DbName>gap</DbName>
<DbName>domains</DbName>
<DbName>gene</DbName>
<DbName>genomeprj</DbName>
<DbName>gensat</DbName>
<DbName>geo</DbName>
<DbName>gds</DbName>
<DbName>homologene</DbName>
<DbName>journals</DbName>
<DbName>mesh</DbName>
<DbName>ncbisearch</DbName>
<DbName>nlmcatalog</DbName>
<DbName>omia</DbName>
<DbName>omim</DbName>
<DbName>pmc</DbName>
<DbName>popset</DbName>
<DbName>probe</DbName>
<DbName>proteinclusters</DbName>
<DbName>pcassay</DbName>
<DbName>pccompound</DbName>
<DbName>pcsubstance</DbName>
<DbName>snp</DbName>
<DbName>taxonomy</DbName>
<DbName>toolkit</DbName>
<DbName>unigene</DbName>
<DbName>unists</DbName>
</DbList>
</eInfoResult>

SincethisisafairlysimpleXMLfile,wecouldextracttheinformationitcontainssimplybystringsearching.UsingBio.Entrezsparserinstead,wecandirectlyparse
thisXMLfileintoaPythonobject:
>>>fromBioimportEntrez
>>>handle=Entrez.einfo()
>>>record=Entrez.read(handle)

Nowrecordisadictionarywithexactlyonekey:

>>>record.keys()
[u'DbList']

ThevaluesstoredinthiskeyisthelistofdatabasenamesshownintheXMLabove:

>>>record["DbList"]
['pubmed','protein','nucleotide','nuccore','nucgss','nucest',
'structure','genome','books','cancerchromosomes','cdd','gap',
'domains','gene','genomeprj','gensat','geo','gds','homologene',

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 68/184
5/13/2017 BiopythonTutorialandCookbook
'journals','mesh','ncbisearch','nlmcatalog','omia','omim','pmc',
'popset','probe','proteinclusters','pcassay','pccompound',
'pcsubstance','snp','taxonomy','toolkit','unigene','unists']

Foreachofthesedatabases,wecanuseEInfoagaintoobtainmoreinformation:

>>>handle=Entrez.einfo(db="pubmed")
>>>record=Entrez.read(handle)
>>>record["DbInfo"]["Description"]
'PubMedbibliographicrecord'
>>>record["DbInfo"]["Count"]
'17989604'
>>>record["DbInfo"]["LastUpdate"]
'2008/05/2406:45'

Tryrecord["DbInfo"].keys()forotherinformationstoredinthisrecord.OneofthemostusefulisalistofpossiblesearchfieldsforusewithESearch:
>>>forfieldinrecord["DbInfo"]["FieldList"]:
...print("%(Name)s,%(FullName)s,%(Description)s"%field)
ALL,AllFields,Alltermsfromallsearchablefields
UID,UID,Uniquenumberassignedtopublication
FILT,Filter,Limitstherecords
TITL,Title,Wordsintitleofpublication
WORD,TextWord,Freetextassociatedwithpublication
MESH,MeSHTerms,MedicalSubjectHeadingsassignedtopublication
MAJR,MeSHMajorTopic,MeSHtermsofmajorimportancetopublication
AUTH,Author,Author(s)ofpublication
JOUR,Journal,Journalabbreviationofpublication
AFFL,Affiliation,Author'sinstitutionalaffiliationandaddress
...

Thatsalonglist,butindirectlythistellsyouthatforthePubMeddatabase,youcandothingslikeJones[AUTH]tosearchtheauthorfield,orSanger[AFFL]torestrictto
authorsattheSangerCentre.Thiscanbeveryhandyespeciallyifyouarenotsofamiliarwithaparticulardatabase.

9.3ESearch:SearchingtheEntrezdatabases
Tosearchanyofthesedatabases,weuseBio.Entrez.esearch().Forexample,letssearchinPubMedforpublicationsrelatedtoBiopython:

>>>fromBioimportEntrez
>>>Entrez.email="A.N.Other@example.com"#AlwaystellNCBIwhoyouare
>>>handle=Entrez.esearch(db="pubmed",term="biopython")
>>>record=Entrez.read(handle)
>>>record["IdList"]
['19304878','18606172','16403221','16377612','14871861','14630660','12230038']

Inthisoutput,youseesevenPubMedIDs(including19304878whichisthePMIDfortheBiopythonapplicationnote),whichcanberetrievedbyEFetch(seesection
9.6).

YoucanalsouseESearchtosearchGenBank.HerewelldoaquicksearchforthematKgeneinCypripedioideaeorchids(seeSection9.2aboutEInfoforonewaytofind
outwhichfieldsyoucansearchineachEntrezdatabase):

>>>handle=Entrez.esearch(db="nucleotide",term="Cypripedioideae[Orgn]ANDmatK[Gene]",idtype="acc")
>>>record=Entrez.read(handle)
>>>record["Count"]
'348'
>>>record["IdList"]
['JQ660909.1','JQ660908.1','JQ660907.1','JQ660906.1',...,'JQ660890.1']

EachoftheIDs(JQ660909.1,JQ660908.1,JQ660907.1,)isaGenBankidentifier(Accessionnumber).Seesection9.6forinformationonhowtoactuallydownload
theseGenBankrecords.

NotethatinsteadofaspeciesnamelikeCypripedioideae[Orgn],youcanrestrictthesearchusinganNCBItaxonidentifier,herethiswouldbetxid158330[Orgn].This
isntcurrentlydocumentedontheESearchhelppagetheNCBIexplainedthisinreplytoanemailquery.Youcanoftendeducethesearchtermformattingbyplaying
withtheEntrezwebinterface.Forexample,includingcomplete[prop]inagenomesearchrestrictstojustcompletedgenomes.

Asafinalexample,letsgetalistofcomputationaljournaltitles:
>>>handle=Entrez.esearch(db="nlmcatalog",term="computational[Journal]",retmax='20')
>>>record=Entrez.read(handle)
>>>print("{}computationaljournalsfound".format(record["Count"]))
117computationalJournalsfound
>>>print("Thefirst20are\n{}".format(record['IdList']))
['101660833','101664671','101661657','101659814','101657941',
'101653734','101669877','101649614','101647835','101639023',
'101627224','101647801','101589678','101585369','101645372',
'101586429','101582229','101574747','101564639','101671907']

Again,wecoulduseEFetchtoobtainmoreinformationforeachofthesejournalIDs.

ESearchhasmanyusefuloptionsseetheESearchhelppageformoreinformation.

9.4EPost:Uploadingalistofidentifiers
EPostuploadsalistofUIsforuseinsubsequentsearchstrategiesseetheEPosthelppageformoreinformation.ItisavailablefromBiopythonthroughthe
Bio.Entrez.epost()function.

Togiveanexampleofwhenthisisuseful,supposeyouhavealonglistofIDsyouwanttodownloadusingEFetch(maybesequences,maybecitationsanything).When
youmakearequestwithEFetchyourlistofIDs,thedatabaseetc,areallturnedintoalongURLsenttotheserver.IfyourlistofIDsislong,thisURLgetslong,andlong
URLscanbreak(e.g.someproxiesdontcopewell).

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 69/184
5/13/2017 BiopythonTutorialandCookbook
Instead,youcanbreakthisupintotwosteps,firstuploadingthelistofIDsusingEPost(thisusesanHTMLpostinternally,ratherthananHTMLget,gettinground
thelongURLproblem).Withthehistorysupport,youcanthenrefertothislonglistofIDs,anddownloadtheassociateddatawithEFetch.

LetslookatasimpleexampletoseehowEPostworksuploadingsomePubMedidentifiers:
>>>fromBioimportEntrez
>>>Entrez.email="A.N.Other@example.com"#AlwaystellNCBIwhoyouare
>>>id_list=["19304878","18606172","16403221","16377612","14871861","14630660"]
>>>print(Entrez.epost("pubmed",id=",".join(id_list)).read())
<?xmlversion="1.0"?>
<!DOCTYPEePostResultPUBLIC"//NLM//DTDePostResult,11May2002//EN"
"http://www.ncbi.nlm.nih.gov/entrez/query/DTD/ePost_020511.dtd">
<ePostResult>
<QueryKey>1</QueryKey>
<WebEnv>NCID_01_206841095_130.14.22.101_9001_1242061629</WebEnv>
</ePostResult>

ThereturnedXMLincludestwoimportantstrings,QueryKeyandWebEnvwhichtogetherdefineyourhistorysession.Youwouldextractthesevaluesforusewithanother
EntrezcallsuchasEFetch:

>>>fromBioimportEntrez
>>>Entrez.email="A.N.Other@example.com"#AlwaystellNCBIwhoyouare
>>>id_list=["19304878","18606172","16403221","16377612","14871861","14630660"]
>>>search_results=Entrez.read(Entrez.epost("pubmed",id=",".join(id_list)))
>>>webenv=search_results["WebEnv"]
>>>query_key=search_results["QueryKey"]

Section9.15showshowtousethehistoryfeature.

9.5ESummary:RetrievingsummariesfromprimaryIDs
ESummaryretrievesdocumentsummariesfromalistofprimaryIDs(seetheESummaryhelppageformoreinformation).InBiopython,ESummaryisavailableas
Bio.Entrez.esummary().Usingthesearchresultabove,wecanforexamplefindoutmoreaboutthejournalwithID30367:

>>>fromBioimportEntrez
>>>Entrez.email="A.N.Other@example.com"#AlwaystellNCBIwhoyouare
>>>handle=Entrez.esummary(db="nlmcatalog",id="101660833")
>>>record=Entrez.read(handle)
>>>info=record[0]['TitleMainList'][0]
>>>print("Journalinfo\nid:{}\nTitle:{}".format(record[0]["Id"],info["Title"]))
Journalinfo
id:101660833
Title:IEEEtransactionsoncomputationalimaging.

9.6EFetch:DownloadingfullrecordsfromEntrez
EFetchiswhatyouusewhenyouwanttoretrieveafullrecordfromEntrez.Thiscoversseveralpossibledatabases,asdescribedonthemainEFetchHelppage.

Formostoftheirdatabases,theNCBIsupportseveraldifferentfileformats.RequestingaspecificfileformatfromEntrezusingBio.Entrez.efetch()requiresspecifying
therettypeand/orretmodeoptionalarguments.ThedifferentcombinationsaredescribedforeachdatabasetypeonthepageslinkedtoonNCBIefetchwebpage(e.g.
literature,sequencesandtaxonomy).

OnecommonusageisdownloadingsequencesintheFASTAorGenBank/GenPeptplaintextformats(whichcanthenbeparsedwithBio.SeqIO,seeSections5.3.1
and9.6).FromtheCypripedioideaeexampleabove,wecandownloadGenBankrecordEU490707usingBio.Entrez.efetch:

>>>fromBioimportEntrez
>>>Entrez.email="A.N.Other@example.com"#AlwaystellNCBIwhoyouare
>>>handle=Entrez.efetch(db="nucleotide",id="EU490707",rettype="gb",retmode="text")
>>>print(handle.read())

LOCUSEU4907071302bpDNAlinearPLN26JUL2016
DEFINITIONSelenipediumaequinoctialematuraseK(matK)gene,partialcds;
chloroplast.
ACCESSIONEU490707
VERSIONEU490707.1
KEYWORDS.
SOURCEchloroplastSelenipediumaequinoctiale
ORGANISMSelenipediumaequinoctiale
Eukaryota;Viridiplantae;Streptophyta;Embryophyta;Tracheophyta;
Spermatophyta;Magnoliophyta;Liliopsida;Asparagales;Orchidaceae;
Cypripedioideae;Selenipedium.
REFERENCE1(bases1to1302)
AUTHORSNeubig,K.M.,Whitten,W.M.,Carlsward,B.S.,Blanco,M.A.,Endara,L.,
Williams,N.H.andMoore,M.
TITLEPhylogeneticutilityofycf1inorchids:aplastidgenemore
variablethanmatK
JOURNALPlantSyst.Evol.277(12),7584(2009)
REFERENCE2(bases1to1302)
AUTHORSNeubig,K.M.,Whitten,W.M.,Carlsward,B.S.,Blanco,M.A.,
Endara,C.L.,Williams,N.H.andMoore,M.J.
TITLEDirectSubmission
JOURNALSubmitted(14FEB2008)DepartmentofBotany,Universityof
Florida,220BartramHall,Gainesville,FL326118526,USA
FEATURESLocation/Qualifiers
source1..1302
/organism="Selenipediumaequinoctiale"
/organelle="plastid:chloroplast"
/mol_type="genomicDNA"
/specimen_voucher="FLAS:Blanco2475"
/db_xref="taxon:256374"

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 70/184
5/13/2017 BiopythonTutorialandCookbook
gene<1..>1302
/gene="matK"
CDS<1..>1302
/gene="matK"
/codon_start=1
/transl_table=11
/product="maturaseK"
/protein_id="ACC99456.1"
/translation="IFYEPVEIFGYDNKSSLVLVKRLITRMYQQNFLISSVNDSNQKG
FWGHKHFFSSHFSSQMVSEGFGVILEIPFSSQLVSSLEEKKIPKYQNLRSIHSIFPFL
EDKFLHLNYVSDLLIPHPIHLEILVQILQCRIKDVPSLHLLRLLFHEYHNLNSLITSK
KFIYAFSKRKKRFLWLLYNSYVYECEYLFQFLRKQSSYLRSTSSGVFLERTHLYVKIE
HLLVVCCNSFQRILCFLKDPFMHYVRYQGKAILASKGTLILMKKWKFHLVNFWQSYFH
FWSQPYRIHIKQLSNYSFSFLGYFSSVLENHLVVRNQMLENSFIINLLTKKFDTIAPV
ISLIGSLSKAQFCTVLGHPISKPIWTDFSDSDILDRFCRICRNLCRYHSGSSKKQVLY
RIKYILRLSCARTLARKHKSTVRTFMRRLGSGLLEEFFMEEE"
ORIGIN
1attttttacgaacctgtggaaatttttggttatgacaataaatctagtttagtacttgtg
61aaacgtttaattactcgaatgtatcaacagaattttttgatttcttcggttaatgattct
121aaccaaaaaggattttgggggcacaagcattttttttcttctcatttttcttctcaaatg
181gtatcagaaggttttggagtcattctggaaattccattctcgtcgcaattagtatcttct
241cttgaagaaaaaaaaataccaaaatatcagaatttacgatctattcattcaatatttccc
301tttttagaagacaaatttttacatttgaattatgtgtcagatctactaataccccatccc
361atccatctggaaatcttggttcaaatccttcaatgccggatcaaggatgttccttctttg
421catttattgcgattgcttttccacgaatatcataatttgaatagtctcattacttcaaag
481aaattcatttacgccttttcaaaaagaaagaaaagattcctttggttactatataattct
541tatgtatatgaatgcgaatatctattccagtttcttcgtaaacagtcttcttatttacga
601tcaacatcttctggagtctttcttgagcgaacacatttatatgtaaaaatagaacatctt
661ctagtagtgtgttgtaattcttttcagaggatcctatgctttctcaaggatcctttcatg
721cattatgttcgatatcaaggaaaagcaattctggcttcaaagggaactcttattctgatg
781aagaaatggaaatttcatcttgtgaatttttggcaatcttattttcacttttggtctcaa
841ccgtataggattcatataaagcaattatccaactattccttctcttttctggggtatttt
901tcaagtgtactagaaaatcatttggtagtaagaaatcaaatgctagagaattcatttata
961ataaatcttctgactaagaaattcgataccatagccccagttatttctcttattggatca
1021ttgtcgaaagctcaattttgtactgtattgggtcatcctattagtaaaccgatctggacc
1081gatttctcggattctgatattcttgatcgattttgccggatatgtagaaatctttgtcgt
1141tatcacagcggatcctcaaaaaaacaggttttgtatcgtataaaatatatacttcgactt
1201tcgtgtgctagaactttggcacggaaacataaaagtacagtacgcacttttatgcgaaga
1261ttaggttcgggattattagaagaattctttatggaagaagaa
//

PleasebeawarethatasofOctober2016GIidentifiersarediscontinuedinfavourofaccessionnumbers.YoucanstillfetchsequencesbasedontheirGI,butnew
sequencesarenolongergiventhisidentifier.YoushouldinsteadrefertothembytheAccessionnumberasdoneintheexample.

Theargumentsrettype="gb"andretmode="text"letusdownloadthisrecordintheGenBankformat.

NotethatuntilEaster2009,theEntrezEFetchAPIletyouusegenbankasthereturntype,howevertheNCBInowinsistonusingtheofficialreturntypesofgbor
gbwithparts(orgpforproteins)asdescribedononline.AlsonotthatuntilFeb2012,theEntrezEFetchAPIwoulddefaulttoreturningplaintextfiles,butnow
defaultstoXML.

Alternatively,youcouldforexampleuserettype="fasta"togettheFastaformatseetheEFetchSequencesHelppageforotheroptions.Remembertheavailable
formatsdependonwhichdatabaseyouaredownloadingfromseethemainEFetchHelppage.

IfyoufetchtherecordinoneoftheformatsacceptedbyBio.SeqIO(seeChapter5),youcoulddirectlyparseitintoaSeqRecord:
>>>fromBioimportEntrez,SeqIO
>>>handle=Entrez.efetch(db="nucleotide",id="EU490707",rettype="gb",retmode="text")
>>>record=SeqIO.read(handle,"genbank")
>>>handle.close()
>>>print(record)
ID:EU490707.1
Name:EU490707
Description:SelenipediumaequinoctialematuraseK(matK)gene,partialcds;chloroplast.
Numberoffeatures:3
...
Seq('ATTTTTTACGAACCTGTGGAAATTTTTGGTTATGACAATAAATCTAGTTTAGTA...GAA',IUPACAmbiguousDNA())

Notethatamoretypicalusewouldbetosavethesequencedatatoalocalfile,andthenparseitwithBio.SeqIO.Thiscansaveyouhavingtoredownloadthesamefile
repeatedlywhileworkingonyourscript,andplaceslessloadontheNCBIsservers.Forexample:

importos
fromBioimportSeqIO
fromBioimportEntrez
Entrez.email="A.N.Other@example.com"#AlwaystellNCBIwhoyouare
filename="EU490707.gbk"
ifnotos.path.isfile(filename):
#Downloading...
net_handle=Entrez.efetch(db="nucleotide",id="EU490707",rettype="gb",retmode="text")
out_handle=open(filename,"w")
out_handle.write(net_handle.read())
out_handle.close()
net_handle.close()
print("Saved")

print("Parsing...")
record=SeqIO.read(filename,"genbank")
print(record)

TogettheoutputinXMLformat,whichyoucanparseusingtheBio.Entrez.read()function,useretmode="xml":

>>>fromBioimportEntrez
>>>handle=Entrez.efetch(db="nucleotide",id="EU490707",retmode="xml")
>>>record=Entrez.read(handle)

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 71/184
5/13/2017 BiopythonTutorialandCookbook
>>>handle.close()
>>>record[0]["GBSeq_definition"]
'SelenipediumaequinoctialematuraseK(matK)gene,partialcds;chloroplast'
>>>record[0]["GBSeq_source"]
'chloroplastSelenipediumaequinoctiale'

So,thatdealtwithsequences.Forexamplesofparsingfileformatsspecifictotheotherdatabases(e.g.theMEDLINEformatusedinPubMed),seeSection9.12.

IfyouwanttoperformasearchwithBio.Entrez.esearch(),andthendownloadtherecordswithBio.Entrez.efetch(),youshouldusetheWebEnvhistoryfeaturesee
Section9.15.

9.7ELink:SearchingforrelateditemsinNCBIEntrez
ELink,availablefromBiopythonasBio.Entrez.elink(),canbeusedtofindrelateditemsintheNCBIEntrezdatabases.Forexample,youcanusthistofindnucleotide
entriesforanentryinthegenedatabase,andothercoolstuff.

LetsuseELinktofindarticlesrelatedtotheBiopythonapplicationnotepublishedinBioinformaticsin2009.ThePubMedIDofthisarticleis19304878:

>>>fromBioimportEntrez
>>>Entrez.email="A.N.Other@example.com"
>>>pmid="19304878"
>>>record=Entrez.read(Entrez.elink(dbfrom="pubmed",id=pmid))

TherecordvariableconsistsofaPythonlist,oneforeachdatabaseinwhichwesearched.SincewespecifiedonlyonePubMedIDtosearchfor,recordcontainsonlyone
item.Thisitemisadictionarycontaininginformationaboutoursearchterm,aswellasalltherelateditemsthatwerefound:

>>>record[0]["DbFrom"]
'pubmed'
>>>record[0]["IdList"]
['19304878']

The"LinkSetDb"keycontainsthesearchresults,storedasalistconsistingofoneitemforeachtargetdatabase.Inoursearchresults,weonlyfindhitsinthePubMed
database(althoughsubdividedintocategories):
>>>len(record[0]["LinkSetDb"])
5
>>>forlinksetdbinrecord[0]["LinkSetDb"]:
...print(linksetdb["DbTo"],linksetdb["LinkName"],len(linksetdb["Link"]))
...
pubmedpubmed_pubmed110
pubmedpubmed_pubmed_combined6
pubmedpubmed_pubmed_five6
pubmedpubmed_pubmed_reviews5
pubmedpubmed_pubmed_reviews_five5

Theactualsearchresultsarestoredasunderthe"Link"key.Intotal,110itemswerefoundunderstandardsearch.Letsnowatthefirstsearchresult:

>>>record[0]["LinkSetDb"][0]["Link"][0]
{u'Id':'19304878'}

Thisisthearticlewesearchedfor,whichdoesnthelpusmuch,soletslookatthesecondsearchresult:

>>>record[0]["LinkSetDb"][0]["Link"][1]
{u'Id':'14630660'}

Thispaper,withPubMedID14630660,isabouttheBiopythonPDBparser.

WecanusealooptoprintoutallPubMedIDs:
>>>forlinkinrecord[0]["LinkSetDb"][0]["Link"]:
...print(link["Id"])
19304878
14630660
18689808
17121776
16377612
12368254
......

Nowthatwasnice,butpersonallyIamoftenmoreinterestedtofindoutifapaperhasbeencited.Well,ELinkcandothattooatleastforjournalsinPubmedCentral
(seeSection9.15.3).

ForhelponELink,seetheELinkhelppage.Thereisanentiresubpagejustforthelinknames,describinghowdifferentdatabasescanbecrossreferenced.

9.8EGQuery:GlobalQuerycountsforsearchterms
EGQueryprovidescountsforasearchtermineachoftheEntrezdatabases(i.e.aglobalquery).Thisisparticularlyusefultofindouthowmanyitemsyoursearchterms
wouldfindineachdatabasewithoutactuallyperforminglotsofseparatesearcheswithESearch(seetheexamplein9.14.2below).

Inthisexample,weuseBio.Entrez.egquery()toobtainthecountsforBiopython:
>>>fromBioimportEntrez
>>>Entrez.email="A.N.Other@example.com"#AlwaystellNCBIwhoyouare
>>>handle=Entrez.egquery(term="biopython")
>>>record=Entrez.read(handle)
>>>forrowinrecord["eGQueryResult"]:
...print(row["DbName"],row["Count"])
...
pubmed6

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 72/184
5/13/2017 BiopythonTutorialandCookbook
pmc62
journals0
...

SeetheEGQueryhelppageformoreinformation.

9.9ESpell:Obtainingspellingsuggestions
ESpellretrievesspellingsuggestions.Inthisexample,weuseBio.Entrez.espell()toobtainthecorrectspellingofBiopython:

>>>fromBioimportEntrez
>>>Entrez.email="A.N.Other@example.com"#AlwaystellNCBIwhoyouare
>>>handle=Entrez.espell(term="biopythooon")
>>>record=Entrez.read(handle)
>>>record["Query"]
'biopythooon'
>>>record["CorrectedQuery"]
'biopython'

SeetheESpellhelppageformoreinformation.ThemainuseofthisisforGUItoolstoprovideautomaticsuggestionsforsearchterms.

9.10ParsinghugeEntrezXMLfiles
TheEntrez.readfunctionreadstheentireXMLfilereturnedbyEntrezintoasinglePythonobject,whichiskeptinmemory.ToparseEntrezXMLfilestoolargetofitin
memory,youcanusethefunctionEntrez.parse.ThisisageneratorfunctionthatreadsrecordsintheXMLfileonebyone.ThisfunctionisonlyusefuliftheXMLfile
reflectsaPythonlistobject(inotherwords,ifEntrez.readonacomputerwithinfinitememoryresourceswouldreturnaPythonlist).

Forexample,youcandownloadtheentireEntrezGenedatabaseforagivenorganismasafilefromNCBIsftpsite.Thesefilescanbeverylarge.Asanexample,on
September4,2009,thefileHomo_sapiens.ags.gz,containingtheEntrezGenedatabaseforhuman,hadasizeof116576kB.Thisfile,whichisintheASNformat,canbe
convertedintoanXMLfileusingNCBIsgene2xmlprogram(seeNCBIsftpsiteformoreinformation):

gene2xmlbTiHomo_sapiens.agsoHomo_sapiens.xml

TheresultingXMLfilehasasizeof6.1GB.AttemptingEntrez.readonthisfilewillresultinaMemoryErroronmanycomputers.

TheXMLfileHomo_sapiens.xmlconsistsofalistofEntrezgenerecords,eachcorrespondingtooneEntrezgeneinhuman.Entrez.parseretrievesthesegenerecordsone
byone.Youcanthenprintoutorstoretherelevantinformationineachrecordbyiteratingovertherecords.Forexample,thisscriptiteratesovertheEntrezgenerecords
andprintsoutthegenenumbersandnamesforallcurrentgenes:

>>>fromBioimportEntrez
>>>handle=open("Homo_sapiens.xml")
>>>records=Entrez.parse(handle)

>>>forrecordinrecords:
...status=record['Entrezgene_trackinfo']['Genetrack']['Genetrack_status']
...ifstatus.attributes['value']=='discontinued':
...continue
...geneid=record['Entrezgene_trackinfo']['Genetrack']['Genetrack_geneid']
...genename=record['Entrezgene_gene']['Generef']['Generef_locus']
...print(geneid,genename)

Thiswillprint:

1A1BG
2A2M
3A2MP
8AA
9NAT1
10NAT2
11AACP
12SERPINA3
13AADAC
14AAMP
15AANAT
16AARS
17AAVS1
...

9.11Handlingerrors
ThreethingscangowrongwhenparsinganXMLfile:

ThefilemaynotbeanXMLfiletobeginwith
Thefilemayendprematurelyorotherwisebecorrupted
ThefilemaybecorrectXML,butcontainitemsthatarenotrepresentedintheassociatedDTD.

Thefirstcaseoccursif,forexample,youtrytoparseaFastafileasifitwereanXMLfile:

>>>fromBioimportEntrez
>>>handle=open("NC_005816.fna")#aFastafile
>>>record=Entrez.read(handle)
Traceback(mostrecentcalllast):
...
Bio.Entrez.Parser.NotXMLError:FailedtoparsetheXMLdata(syntaxerror:line1,column0).PleasemakesurethattheinputdataareinXMLformat.

Here,theparserdidntfindthe<?xml...tagwithwhichanXMLfileissupposedtostart,andthereforedecides(correctly)thatthefileisnotanXMLfile.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 73/184
5/13/2017 BiopythonTutorialandCookbook
WhenyourfileisintheXMLformatbutiscorrupted(forexample,byendingprematurely),theparserwillraiseaCorruptedXMLError.HereisanexampleofanXML
filethatendsprematurely:
<?xmlversion="1.0"?>
<!DOCTYPEeInfoResultPUBLIC"//NLM//DTDeInfoResult,11May2002//EN""http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eInfo_020511.dtd">
<eInfoResult>
<DbList>
<DbName>pubmed</DbName>
<DbName>protein</DbName>
<DbName>nucleotide</DbName>
<DbName>nuccore</DbName>
<DbName>nucgss</DbName>
<DbName>nucest</DbName>
<DbName>structure</DbName>
<DbName>genome</DbName>
<DbName>books</DbName>
<DbName>cancerchromosomes</DbName>
<DbName>cdd</DbName>

whichwillgeneratethefollowingtraceback:

>>>Entrez.read(handle)
Traceback(mostrecentcalllast):
...
Bio.Entrez.Parser.CorruptedXMLError:FailedtoparsetheXMLdata(noelementfound:line16,column0).Pleasemakesurethattheinputdataarenotcorrupted.

NotethattheerrormessagetellsyouatwhatpointintheXMLfiletheerrorwasdetected.

ThethirdtypeoferroroccursiftheXMLfilecontainstagsthatdonothaveadescriptioninthecorrespondingDTDfile.ThisisanexampleofsuchanXMLfile:

<?xmlversion="1.0"?>
<!DOCTYPEeInfoResultPUBLIC"//NLM//DTDeInfoResult,11May2002//EN""http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eInfo_020511.dtd">
<eInfoResult>
<DbInfo>
<DbName>pubmed</DbName>
<MenuName>PubMed</MenuName>
<Description>PubMedbibliographicrecord</Description>
<Count>20161961</Count>
<LastUpdate>2010/09/1004:52</LastUpdate>
<FieldList>
<Field>
...
</Field>
</FieldList>
<DocsumList>
<Docsum>
<DsName>PubDate</DsName>
<DsType>4</DsType>
<DsTypeName>string</DsTypeName>
</Docsum>
<Docsum>
<DsName>EPubDate</DsName>
...
</DbInfo>
</eInfoResult>

Inthisfile,forsomereasonthetag<DocsumList>(andseveralothers)arenotlistedintheDTDfileeInfo_020511.dtd,whichisspecifiedonthesecondlineastheDTD
forthisXMLfile.Bydefault,theparserwillstopandraiseaValidationErrorifitcannotfindsometagintheDTD:

>>>fromBioimportEntrez
>>>handle=open("einfo3.xml")
>>>record=Entrez.read(handle)
Traceback(mostrecentcalllast):
...
Bio.Entrez.Parser.ValidationError:Failedtofindtag'DocsumList'intheDTD.ToskipalltagsthatarenotrepresentedintheDTD,pleasecallBio.Entrez.reado

Optionally,youcaninstructtheparsertoskipsuchtagsinsteadofraisingaValidationError.ThisisdonebycallingEntrez.readorEntrez.parsewiththeargument
validateequaltoFalse:

>>>fromBioimportEntrez
>>>handle=open("einfo3.xml")
>>>record=Entrez.read(handle,validate=False)

Ofcourse,theinformationcontainedintheXMLtagsthatarenotintheDTDarenotpresentintherecordreturnedbyEntrez.read.

9.12Specializedparsers
TheBio.Entrez.read()functioncanparsemost(ifnotall)XMLoutputreturnedbyEntrez.Entreztypicallyallowsyoutoretrieverecordsinotherformats,whichmay
havesomeadvantagescomparedtotheXMLformatintermsofreadability(ordownloadsize).

TorequestaspecificfileformatfromEntrezusingBio.Entrez.efetch()requiresspecifyingtherettypeand/orretmodeoptionalarguments.Thedifferentcombinations
aredescribedforeachdatabasetypeontheNCBIefetchwebpage.

OneobviouscaseisyoumayprefertodownloadsequencesintheFASTAorGenBank/GenPeptplaintextformats(whichcanthenbeparsedwithBio.SeqIO,see
Sections5.3.1and9.6).Fortheliteraturedatabases,BiopythoncontainsaparserfortheMEDLINEformatusedinPubMed.

9.12.1ParsingMedlinerecords
YoucanfindtheMedlineparserinBio.Medline.Supposewewanttoparsethefilepubmed_result1.txt,containingoneMedlinerecord.Youcanfindthisfilein
BiopythonsTests\Medlinedirectory.Thefilelookslikethis:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 74/184
5/13/2017 BiopythonTutorialandCookbook
PMID12230038
OWNNLM
STATMEDLINE
DA20020916
DCOM20030606
LR20041117
PUBMPrint
IS14675463(Print)
VI3
IP3
DP2002Sep
TITheBio*toolkitsabriefoverview.
PG296302
ABBioinformaticsresearchisoftendifficulttodowithcommercialsoftware.The
OpenSourceBioPerl,BioPythonandBiojavaprojectsprovidetoolkitswith
...

Wefirstopenthefileandthenparseit:

>>>fromBioimportMedline
>>>withopen("pubmed_result1.txt")ashandle:
...record=Medline.read(handle)
...

TherecordnowcontainstheMedlinerecordasaPythondictionary:
>>>record["PMID"]
'12230038'

>>>record["AB"]
'Bioinformaticsresearchisoftendifficulttodowithcommercialsoftware.
TheOpenSourceBioPerl,BioPythonandBiojavaprojectsprovidetoolkitswith
multiplefunctionalitythatmakeiteasiertocreatecustomisedpipelinesor
analysis.Thisreviewbrieflycomparesthequirksoftheunderlyinglanguages
andthefunctionality,documentation,utilityandrelativeadvantagesofthe
Biocounterparts,particularlyfromthepointofviewofthebeginning
biologistprogrammer.'

ThekeynamesusedinaMedlinerecordcanberatherobscureuse
>>>help(record)

forabriefsummary.

ToparseafilecontainingmultipleMedlinerecords,youcanusetheparsefunctioninstead:
>>>fromBioimportMedline
>>>withopen("pubmed_result2.txt")ashandle:
...forrecordinMedline.parse(handle):
...print(record["TI"])
...
AhighlevelinterfacetoSCOPandASTRALimplementedinpython.
GenomeDiagram:apythonpackageforthevisualizationoflargescalegenomicdata.
Opensourceclusteringsoftware.
PDBfileparserandstructureclassimplementedinPython.

InsteadofparsingMedlinerecordsstoredinfiles,youcanalsoparseMedlinerecordsdownloadedbyBio.Entrez.efetch.Forexample,letslookatallMedlinerecordsin
PubMedrelatedtoBiopython:

>>>fromBioimportEntrez
>>>Entrez.email="A.N.Other@example.com"#AlwaystellNCBIwhoyouare
>>>handle=Entrez.esearch(db="pubmed",term="biopython")
>>>record=Entrez.read(handle)
>>>record["IdList"]
['19304878','18606172','16403221','16377612','14871861','14630660','12230038']

WenowuseBio.Entrez.efetchtodownloadtheseMedlinerecords:

>>>idlist=record["IdList"]
>>>handle=Entrez.efetch(db="pubmed",id=idlist,rettype="medline",retmode="text")

Here,wespecifyrettype="medline",retmode="text"toobtaintheMedlinerecordsinplaintextMedlineformat.NowweuseBio.Medlinetoparsetheserecords:

>>>fromBioimportMedline
>>>records=Medline.parse(handle)
>>>forrecordinrecords:
...print(record["AU"])
['CockPJ','AntaoT','ChangJT','ChapmanBA','CoxCJ','DalkeA',...,'deHoonMJ']
['MunteanuCR','GonzalezDiazH','MagalhaesAL']
['CasbonJA','CrooksGE','SaqiMA']
['PritchardL','WhiteJA','BirchPR','TothIK']
['deHoonMJ','ImotoS','NolanJ','MiyanoS']
['HamelryckT','ManderickB']
['MangalamH']

Forcomparison,hereweshowanexampleusingtheXMLformat:
>>>handle=Entrez.efetch(db="pubmed",id=idlist,rettype="medline",retmode="xml")
>>>records=Entrez.read(handle)
>>>forrecordinrecords['PubmedArticle']:
...print(record["MedlineCitation"]["Article"]["ArticleTitle"])
Biopython:freelyavailablePythontoolsforcomputationalmolecularbiologyand
bioinformatics.
Enzymes/nonenzymesclassificationmodelcomplexitybasedoncomposition,sequence,

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 75/184
5/13/2017 BiopythonTutorialandCookbook
3Dandtopologicalindices.
AhighlevelinterfacetoSCOPandASTRALimplementedinpython.
GenomeDiagram:apythonpackageforthevisualizationoflargescalegenomicdata.
Opensourceclusteringsoftware.
PDBfileparserandstructureclassimplementedinPython.
TheBio*toolkitsabriefoverview.

Notethatinbothoftheseexamples,forsimplicitywehavenaivelycombinedESearchandEFetch.Inthissituation,theNCBIwouldexpectyoutousetheirhistory
feature,asillustratedinSection9.15.

9.12.2ParsingGEOrecords
GEO(GeneExpressionOmnibus)isadatarepositoryofhighthroughputgeneexpressionandhybridizationarraydata.TheBio.GeomodulecanbeusedtoparseGEO
formatteddata.

ThefollowingcodefragmentshowshowtoparsetheexampleGEOfileGSE16.txtintoarecordandprinttherecord:

>>>fromBioimportGeo
>>>handle=open("GSE16.txt")
>>>records=Geo.parse(handle)
>>>forrecordinrecords:
...print(record)

Youcansearchthegdsdatabase(GEOdatasets)withESearch:

>>>fromBioimportEntrez
>>>Entrez.email="A.N.Other@example.com"#AlwaystellNCBIwhoyouare
>>>handle=Entrez.esearch(db="gds",term="GSE16")
>>>record=Entrez.read(handle)
>>>record["Count"]
2
>>>record["IdList"]
['200000016','100000028']

FromtheEntrezwebsite,UID200000016isGDS16whiletheotherhit100000028isfortheassociatedplatform,GPL28.Unfortunately,atthetimeofwritingthe
NCBIdontseemtosupportdownloadingGEOfilesusingEntrez(notasXML,norintheSimpleOmnibusFormatinText(SOFT)format).

However,itisactuallyprettystraightforwardtodownloadtheGEOfilesbyFTPfrom ftp://ftp.ncbi.nih.gov/pub/geo/instead.Inthiscaseyoumightwant
ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/by_series/GSE16/GSE16_family.soft.gz(acompressedfile,seethePythonmodulegzip).

9.12.3ParsingUniGenerecords

UniGeneisanNCBIdatabaseofthetranscriptome,witheachUniGenerecordshowingthesetoftranscriptsthatareassociatedwithaparticulargeneinaspecific
organism.AtypicalUniGenerecordlookslikethis:

IDHs.2
TITLENacetyltransferase2(arylamineNacetyltransferase)
GENENAT2
CYTOBAND8p22
GENE_ID10
LOCUSLINK10
HOMOLYES
EXPRESSbone|connectivetissue|intestine|liver|livertumor|normal|softtissue/muscletissuetumor|adult
RESTR_EXPRadult
CHROMOSOME8
STSACC=PMC310725P3UNISTS=272646
STSACC=WIAF2120UNISTS=44576
STSACC=G59899UNISTS=137181
...
STSACC=GDB:187676UNISTS=155563
PROTSIMORG=10090;PROTGI=6754794;PROTID=NP_035004.1;PCT=76.55;ALN=288
PROTSIMORG=9796;PROTGI=149742490;PROTID=XP_001487907.1;PCT=79.66;ALN=288
PROTSIMORG=9986;PROTGI=126722851;PROTID=NP_001075655.1;PCT=76.90;ALN=288
...
PROTSIMORG=9598;PROTGI=114619004;PROTID=XP_519631.2;PCT=98.28;ALN=288

SCOUNT38
SEQUENCEACC=BC067218.1;NID=g45501306;PID=g45501307;SEQTYPE=mRNA
SEQUENCEACC=NM_000015.2;NID=g116295259;PID=g116295260;SEQTYPE=mRNA
SEQUENCEACC=D90042.1;NID=g219415;PID=g219416;SEQTYPE=mRNA
SEQUENCEACC=D90040.1;NID=g219411;PID=g219412;SEQTYPE=mRNA
SEQUENCEACC=BC015878.1;NID=g16198419;PID=g16198420;SEQTYPE=mRNA
SEQUENCEACC=CR407631.1;NID=g47115198;PID=g47115199;SEQTYPE=mRNA
SEQUENCEACC=BG569293.1;NID=g13576946;CLONE=IMAGE:4722596;END=5';LID=6989;SEQTYPE=EST;TRACE=44157214
...
SEQUENCEACC=AU099534.1;NID=g13550663;CLONE=HSI08034;END=5';LID=8800;SEQTYPE=EST
//

Thisparticularrecordshowsthesetoftranscripts(shownintheSEQUENCElines)thatoriginatefromthehumangeneNAT2,encodingenNacetyltransferase.ThePROTSIM
linesshowproteinswithsignificantsimilaritytoNAT2,whereastheSTSlinesshowthecorrespondingsequencetaggedsitesinthegenome.

ToparseUniGenefiles,usetheBio.UniGenemodule:
>>>fromBioimportUniGene
>>>input=open("myunigenefile.data")
>>>record=UniGene.read(input)

TherecordreturnedbyUniGene.readisaPythonobjectwithattributescorrespondingtothefieldsintheUniGenerecord.Forexample,
>>>record.ID
"Hs.2"

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 76/184
5/13/2017 BiopythonTutorialandCookbook
>>>record.title
"Nacetyltransferase2(arylamineNacetyltransferase)"

TheEXPRESSandRESTR_EXPRlinesarestoredasPythonlistsofstrings:

['bone','connectivetissue','intestine','liver','livertumor','normal','softtissue/muscletissuetumor','adult']

SpecializedobjectsarereturnedfortheSTS,PROTSIM,andSEQUENCElines,storingthekeysshownineachlineasattributes:

>>>record.sts[0].acc
'PMC310725P3'
>>>record.sts[0].unists
'272646'

andsimilarlyforthePROTSIMandSEQUENCElines.

ToparseafilecontainingmorethanoneUniGenerecord,usetheparsefunctioninBio.UniGene:

>>>fromBioimportUniGene
>>>input=open("unigenerecords.data")
>>>records=UniGene.parse(input)
>>>forrecordinrecords:
...print(record.ID)

9.13Usingaproxy
Normallyyouwonthavetoworryaboutusingaproxy,butifthisisanissueonyournetworkhereishowtodealwithit.Internally,Bio.EntrezusesthestandardPython
libraryurllibforaccessingtheNCBIservers.Thiswillcheckanenvironmentvariablecalledhttp_proxytoconfigureanysimpleproxyautomatically.Unfortunatelythis
moduledoesnotsupporttheuseofproxieswhichrequireauthentication.

Youmaychoosetosetthehttp_proxyenvironmentvariableonce(howyoudothiswilldependonyouroperatingsystem).AlternativelyyoucansetthiswithinPythonat
thestartofyourscript,forexample:
importos
os.environ["http_proxy"]="http://proxyhost.example.com:8080"

Seetheurllibdocumentationformoredetails.

9.14Examples
9.14.1PubMedandMedline
Ifyouareinthemedicalfieldorinterestedinhumanissues(andmanytimesevenifyouarenot!),PubMed(http://www.ncbi.nlm.nih.gov/PubMed/)isanexcellent
sourceofallkindsofgoodies.Solikeotherthings,wedliketobeabletograbinformationfromitanduseitinPythonscripts.

Inthisexample,wewillqueryPubMedforallarticleshavingtodowithorchids(seesection2.3forourmotivation).Wefirstcheckhowmanyofsucharticlesthereare:
>>>fromBioimportEntrez
>>>Entrez.email="A.N.Other@example.com"#AlwaystellNCBIwhoyouare
>>>handle=Entrez.egquery(term="orchid")
>>>record=Entrez.read(handle)
>>>forrowinrecord["eGQueryResult"]:
...ifrow["DbName"]=="pubmed":
...print(row["Count"])
463

NowweusetheBio.Entrez.efetchfunctiontodownloadthePubMedIDsofthese463articles:
>>>handle=Entrez.esearch(db="pubmed",term="orchid",retmax=463)
>>>record=Entrez.read(handle)
>>>idlist=record["IdList"]
>>>print(idlist)

ThisreturnsaPythonlistcontainingallofthePubMedIDsofarticlesrelatedtoorchids:

['18680603','18665331','18661158','18627489','18627452','18612381',
'18594007','18591784','18589523','18579475','18575811','18575690',
...

Nowthatwevegotthem,weobviouslywanttogetthecorrespondingMedlinerecordsandextracttheinformationfromthem.Here,welldownloadtheMedlinerecords
intheMedlineflatfileformat,andusetheBio.Medlinemoduletoparsethem:
>>>fromBioimportMedline
>>>handle=Entrez.efetch(db="pubmed",id=idlist,rettype="medline",
retmode="text")
>>>records=Medline.parse(handle)

NOTEWevejustdoneaseparatesearchandfetchhere,theNCBImuchpreferyoutotakeadvantageoftheirhistorysupportinthissituation.SeeSection9.15.

Keepinmindthatrecordsisaniterator,soyoucaniteratethroughtherecordsonlyonce.Ifyouwanttosavetherecords,youcanconvertthemtoalist:

>>>records=list(records)

Letsnowiterateovertherecordstoprintoutsomeinformationabouteachrecord:

>>>forrecordinrecords:
...print("title:",record.get("TI","?"))
...print("authors:",record.get("AU","?"))

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 77/184
5/13/2017 BiopythonTutorialandCookbook
...print("source:",record.get("SO","?"))
...print("")

Theoutputforthislookslike:
title:Sexpheromonemimicryintheearlyspiderorchid(ophryssphegodes):
patternsofhydrocarbonsasthekeymechanismforpollinationbysexual
deception[InProcessCitation]
authors:['SchiestlFP','AyasseM','PaulusHF','LofstedtC','HanssonBS',
'IbarraF','FranckeW']
source:JCompPhysiol[A]2000Jun;186(6):56774

Especiallyinterestingtonoteisthelistofauthors,whichisreturnedasastandardPythonlist.ThismakesiteasytomanipulateandsearchusingstandardPythontools.
Forinstance,wecouldloopthroughawholebunchofentriessearchingforaparticularauthorwithcodelikethefollowing:
>>>search_author="WaitsT"

>>>forrecordinrecords:
...ifnot"AU"inrecord:
...continue
...ifsearch_authorinrecord["AU"]:
...print("Author%sfound:%s"%(search_author,record["SO"]))

HopefullythissectiongaveyouanideaofthepowerandflexibilityoftheEntrezandMedlineinterfacesandhowtheycanbeusedtogether.

9.14.2Searching,downloading,andparsingEntrezNucleotiderecords
HerewellshowasimpleexampleofperformingaremoteEntrezquery.Insection2.3oftheparsingexamples,wetalkedaboutusingNCBIsEntrezwebsitetosearch
theNCBInucleotidedatabasesforinfoonCypripedioideae,ourfriendstheladyslipperorchids.Now,welllookathowtoautomatethatprocessusingaPythonscript.In
thisexample,welljustshowhowtoconnect,gettheresults,andparsethem,withtheEntrezmoduledoingallofthework.

First,weuseEGQuerytofindoutthenumberofresultswewillgetbeforeactuallydownloadingthem.EGQuerywilltellushowmanysearchresultswerefoundineach
ofthedatabases,butforthisexampleweareonlyinterestedinnucleotides:

>>>fromBioimportEntrez
>>>Entrez.email="A.N.Other@example.com"#AlwaystellNCBIwhoyouare
>>>handle=Entrez.egquery(term="Cypripedioideae")
>>>record=Entrez.read(handle)
>>>forrowinrecord["eGQueryResult"]:
...ifrow["DbName"]=="nuccore":
...print(row["Count"])
814

So,weexpecttofind814EntrezNucleotiderecords(thisisthenumberIobtainedin2008itislikelytoincreaseinthefuture).Ifyoufindsomeridiculouslyhighnumber
ofhits,youmaywanttoreconsiderifyoureallywanttodownloadallofthem,whichisournextstep:
>>>fromBioimportEntrez
>>>handle=Entrez.esearch(db="nucleotide",term="Cypripedioideae",retmax=814,idtype="acc")
>>>record=Entrez.read(handle)

Here,recordisaPythondictionarycontainingthesearchresultsandsomeauxiliaryinformation.Justforinformation,letslookatwhatisstoredinthisdictionary:

>>>print(record.keys())
[u'Count',u'RetMax',u'IdList',u'TranslationSet',u'RetStart',u'QueryTranslation']

First,letscheckhowmanyresultswerefound:

>>>print(record["Count"])
'814'

whichisthenumberweexpected.The814resultsarestoredinrecord['IdList']:

>>>len(record["IdList"])
814

Letslookatthefirstfiveresults:

>>>record["IdList"][:5]
['KX265015.1','KX265014.1','KX265013.1','KX265012.1','KX265011.1']

Wecandownloadtheserecordsusingefetch.Whileyoucoulddownloadtheserecordsonebyone,toreducetheloadonNCBIsservers,itisbettertofetchabunchof
recordsatthesametime,shownbelow.However,inthissituationyoushouldideallybeusingthehistoryfeaturedescribedlaterinSection9.15.
>>>idlist=",".join(record["IdList"][:5])
>>>print(idlist)
KX265015.1,KX265014.1,KX265013.1,KX265012.1,KX265011.1]
>>>handle=Entrez.efetch(db="nucleotide",id=idlist,retmode="xml")
>>>records=Entrez.read(handle)
>>>len(records)
5

EachoftheserecordscorrespondstooneGenBankrecord.
>>>print(records[0].keys())
[u'GBSeq_moltype',u'GBSeq_source',u'GBSeq_sequence',
u'GBSeq_primaryaccession',u'GBSeq_definition',u'GBSeq_accessionversion',
u'GBSeq_topology',u'GBSeq_length',u'GBSeq_featuretable',
u'GBSeq_createdate',u'GBSeq_otherseqids',u'GBSeq_division',
u'GBSeq_taxonomy',u'GBSeq_references',u'GBSeq_updatedate',
u'GBSeq_organism',u'GBSeq_locus',u'GBSeq_strandedness']

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 78/184
5/13/2017 BiopythonTutorialandCookbook
>>>print(records[0]["GBSeq_primaryaccession"])
DQ110336

>>>print(records[0]["GBSeq_otherseqids"])
['gb|DQ110336.1|','gi|187237168']

>>>print(records[0]["GBSeq_definition"])
CypripediumcalceolusvoucherDavis0303Amaturase(matR)gene,partialcds;
mitochondrial

>>>print(records[0]["GBSeq_organism"])
Cypripediumcalceolus

Youcouldusethistoquicklysetupsearchesbutforheavyusage,seeSection9.15.

9.14.3Searching,downloading,andparsingGenBankrecords
TheGenBankrecordformatisaverypopularmethodofholdinginformationaboutsequences,sequencefeatures,andotherassociatedsequenceinformation.Theformat
isagoodwaytogetinformationfromtheNCBIdatabasesathttp://www.ncbi.nlm.nih.gov/.

InthisexamplewellshowhowtoquerytheNCBIdatabases,toretrievetherecordsfromthequery,andthenparsethemusingBio.SeqIOsomethingtouchedonin
Section5.3.1.Forsimplicity,thisexampledoesnottakeadvantageoftheWebEnvhistoryfeatureseeSection9.15forthis.

First,wewanttomakeaqueryandfindouttheidsoftherecordstoretrieve.Herewelldoaquicksearchforoneofourfavoriteorganisms,Opuntia(pricklypearcacti).
WecandoquicksearchandgetbacktheGIs(GenBankidentifiers)forallofthecorrespondingrecords.Firstwecheckhowmanyrecordsthereare:

>>>fromBioimportEntrez
>>>Entrez.email="A.N.Other@example.com"#AlwaystellNCBIwhoyouare
>>>handle=Entrez.egquery(term="OpuntiaANDrpl16")
>>>record=Entrez.read(handle)
>>>forrowinrecord["eGQueryResult"]:
...ifrow["DbName"]=="nuccore":
...print(row["Count"])
...
9

NowwedownloadthelistofGenBankidentifiers:
>>>handle=Entrez.esearch(db="nuccore",term="OpuntiaANDrpl16")
>>>record=Entrez.read(handle)
>>>gi_list=record["IdList"]
>>>gi_list
['57240072','57240071','6273287','6273291','6273290','6273289','6273286',
'6273285','6273284']

NowweusetheseGIstodownloadtheGenBankrecordsnotethatwitholderversionsofBiopythonyouhadtosupplyacommaseparatedlistofGInumberstoEntrez,
asofBiopython1.59youcanpassalistandthisisconvertedforyou:
>>>gi_str=",".join(gi_list)
>>>handle=Entrez.efetch(db="nuccore",id=gi_str,rettype="gb",retmode="text")

IfyouwanttolookattherawGenBankfiles,youcanreadfromthishandleandprintouttheresult:
>>>text=handle.read()
>>>print(text)
LOCUSAY851612892bpDNAlinearPLN10APR2007
DEFINITIONOpuntiasubulatarpl16gene,intron;chloroplast.
ACCESSIONAY851612
VERSIONAY851612.1GI:57240072
KEYWORDS.
SOURCEchloroplastAustrocylindropuntiasubulata
ORGANISMAustrocylindropuntiasubulata
Eukaryota;Viridiplantae;Streptophyta;Embryophyta;Tracheophyta;
Spermatophyta;Magnoliophyta;eudicotyledons;coreeudicotyledons;
Caryophyllales;Cactaceae;Opuntioideae;Austrocylindropuntia.
REFERENCE1(bases1to892)
AUTHORSButterworth,C.A.andWallace,R.S.
...

Inthiscase,wearejustgettingtherawrecords.TogettherecordsinamorePythonfriendlyform,wecanuseBio.SeqIOtoparsetheGenBankdataintoSeqRecord
objects,includingSeqFeatureobjects(seeChapter5):

>>>fromBioimportSeqIO
>>>handle=Entrez.efetch(db="nuccore",id=gi_str,rettype="gb",retmode="text")
>>>records=SeqIO.parse(handle,"gb")

Wecannowstepthroughtherecordsandlookattheinformationweareinterestedin:

>>>forrecordinrecords:
>>>...print("%s,length%i,with%ifeatures"\
>>>...%(record.name,len(record),len(record.features)))
AY851612,length892,with3features
AY851611,length881,with3features
AF191661,length895,with3features
AF191665,length902,with3features
AF191664,length899,with3features
AF191663,length899,with3features
AF191660,length893,with3features
AF191659,length894,with3features
AF191658,length896,with3features

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 79/184
5/13/2017 BiopythonTutorialandCookbook
Usingtheseautomatedqueryretrievalfunctionalityisabigplusoverdoingthingsbyhand.AlthoughthemoduleshouldobeytheNCBIsmaxthreequeriespersecond
rule,theNCBIhaveotherrecommendationslikeavoidingpeakhours.SeeSection9.1.Inparticular,pleasenotethatforsimplicity,thisexampledoesnotusetheWebEnv
historyfeature.Youshouldusethisforanynontrivialsearchanddownloadwork,seeSection9.15.

Finally,ifplantorepeatyouranalysis,ratherthandownloadingthefilesfromtheNCBIandparsingthemimmediately(asshowninthisexample),youshouldjust
downloadtherecordsonceandsavethemtoyourharddisk,andthenparsethelocalfile.

9.14.4Findingthelineageofanorganism
Stayingwithaplantexample,letsnowfindthelineageoftheCypripedioideaeorchidfamily.First,wesearchtheTaxonomydatabaseforCypripedioideae,whichyields
exactlyoneNCBItaxonomyidentifier:

>>>fromBioimportEntrez
>>>Entrez.email="A.N.Other@example.com"#AlwaystellNCBIwhoyouare
>>>handle=Entrez.esearch(db="Taxonomy",term="Cypripedioideae")
>>>record=Entrez.read(handle)
>>>record["IdList"]
['158330']
>>>record["IdList"][0]
'158330'

Now,weuseefetchtodownloadthisentryintheTaxonomydatabase,andthenparseit:

>>>handle=Entrez.efetch(db="Taxonomy",id="158330",retmode="xml")
>>>records=Entrez.read(handle)

Again,thisrecordstoreslotsofinformation:

>>>records[0].keys()
[u'Lineage',u'Division',u'ParentTaxId',u'PubDate',u'LineageEx',
u'CreateDate',u'TaxId',u'Rank',u'GeneticCode',u'ScientificName',
u'MitoGeneticCode',u'UpdateDate']

Wecangetthelineagedirectlyfromthisrecord:
>>>records[0]["Lineage"]
'cellularorganisms;Eukaryota;Viridiplantae;Streptophyta;Streptophytina;
Embryophyta;Tracheophyta;Euphyllophyta;Spermatophyta;Magnoliophyta;
Liliopsida;Asparagales;Orchidaceae'

Therecorddatacontainsmuchmorethanjusttheinformationshownhereforexamplelookunder"LineageEx"insteadof"Lineage"andyoullgettheNCBItaxon
identifiersofthelineageentriestoo.

9.15UsingthehistoryandWebEnv
Oftenyouwillwanttomakeaseriesoflinkedqueries.Mosttypically,runningasearch,perhapsrefiningthesearch,andthenretrievingdetailedsearchresults.Youcan
dothisbymakingaseriesofseparatecallstoEntrez.However,theNCBIpreferyoutotakeadvantageoftheirhistorysupportforexamplecombiningESearchand
EFetch.

AnothertypicaluseofthehistorysupportwouldbetocombineEPostandEFetch.YouuseEPosttouploadalistofidentifiers,whichstartsanewhistorysession.You
thendownloadtherecordswithEFetchbyreferringtothesession(insteadoftheidentifiers).

9.15.1Searchingforanddownloadingsequencesusingthehistory

SupposewewanttosearchanddownloadalltheOpuntiarpl16nucleotidesequences,andstoretheminaFASTAfile.AsshowninSection9.14.3,wecannaively
combineBio.Entrez.esearch()togetalistofAccessionnumbers,andthencallBio.Entrez.efetch()todownloadthemall.

However,theapprovedapproachistorunthesearchwiththehistoryfeature.Then,wecanfetchtheresultsbyreferencetothesearchresultswhichtheNCBIcan
anticipateandcache.

Todothis,callBio.Entrez.esearch()asnormal,butwiththeadditionalargumentofusehistory="y",
>>>fromBioimportEntrez
>>>Entrez.email="history.user@example.com"
>>>search_handle=Entrez.esearch(db="nucleotide",term="Opuntia[orgn]andrpl16",
usehistory="y",idtype="acc")
>>>search_results=Entrez.read(search_handle)
>>>search_handle.close()

WhenyougettheXMLoutputback,itwillstillincludetheusualsearchresults:
>>>acc_list=search_results["IdList"]
>>>count=int(search_results["Count"])
>>>assertcount==len(acc_list)

However,youalsogetgiventwoadditionalpiecesofinformation,theWebEnvsessioncookie,andtheQueryKey:

>>>webenv=search_results["WebEnv"]
>>>query_key=search_results["QueryKey"]

Havingstoredthesevaluesinvariablessession_cookieandquery_keywecanusethemasparameterstoBio.Entrez.efetch()insteadofgivingtheGInumbersas
identifiers.

WhileforsmallsearchesyoumightbeOKdownloadingeverythingatonce,itisbettertodownloadinbatches.Youusetheretstartandretmaxparameterstospecify
whichrangeofsearchresultsyouwantreturned(startingentryusingzerobasedcounting,andmaximumnumberofresultstoreturn).Sometimesyouwillgetintermittent
errorsfromEntrez,HTTPError5XX,weuseatryexceptpauseretryblocktoaddressthis.Forexample,

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 80/184
5/13/2017 BiopythonTutorialandCookbook
#Thisassumesyouhavealreadyrunasearchasshownabove,
#andsetthevariablescount,webenv,query_key

try:
fromurllib.errorimportHTTPError#forPython3
exceptImportError:
fromurllib2importHTTPError#forPython2

batch_size=3
out_handle=open("orchid_rpl16.fasta","w")
forstartinrange(0,count,batch_size):
end=min(count,start+batch_size)
print("Goingtodownloadrecord%ito%i"%(start+1,end))
attempt=0
whileattempt<3:
attempt+=1
try:
fetch_handle=Entrez.efetch(db="nucleotide",
rettype="fasta",retmode="text",
retstart=start,retmax=batch_size,
webenv=webenv,query_key=query_key,
idtype="acc")
exceptHTTPErroraserr:
if500<=err.code<=599:
print("Receivederrorfromserver%s"%err)
print("Attempt%iof3"%attempt)
time.sleep(15)
else:
raise
data=fetch_handle.read()
fetch_handle.close()
out_handle.write(data)
out_handle.close()

Forillustrativepurposes,thisexampledownloadedtheFASTArecordsinbatchesofthree.Unlessyouaredownloadinggenomesorchromosomes,youwouldnormally
pickalargerbatchsize.

9.15.2Searchingforanddownloadingabstractsusingthehistory
Hereisanotherhistoryexample,searchingforpaperspublishedinthelastyearabouttheOpuntia,andthendownloadingthemintoafileinMedLineformat:
fromBioimportEntrez
importtime
try:
fromurllib.errorimportHTTPError#forPython3
exceptImportError:
fromurllib2importHTTPError#forPython2
Entrez.email="history.user@example.com"
search_results=Entrez.read(Entrez.esearch(db="pubmed",
term="Opuntia[ORGN]",
reldate=365,datetype="pdat",
usehistory="y"))
count=int(search_results["Count"])
print("Found%iresults"%count)

batch_size=10
out_handle=open("recent_orchid_papers.txt","w")
forstartinrange(0,count,batch_size):
end=min(count,start+batch_size)
print("Goingtodownloadrecord%ito%i"%(start+1,end))
attempt=1
whileattempt<=3:
try:
fetch_handle=Entrez.efetch(db="pubmed",rettype="medline",
retmode="text",retstart=start,
retmax=batch_size,
webenv=search_results["WebEnv"],
query_key=search_results["QueryKey"])
exceptHTTPErroraserr:
if500<=err.code<=599:
print("Receivederrorfromserver%s"%err)
print("Attempt%iof3"%attempt)
attempt+=1
time.sleep(15)
else:
raise
data=fetch_handle.read()
fetch_handle.close()
out_handle.write(data)
out_handle.close()

Atthetimeofwriting,thisgave28matchesbutbecausethisisadatedependentsearch,thiswillofcoursevary.AsdescribedinSection9.12.1above,youcanthenuse
Bio.Medlinetoparsethesavedrecords.

9.15.3Searchingforcitations

BackinSection9.7wementionedELinkcanbeusedtosearchforcitationsofagivenpaper.UnfortunatelythisonlycoversjournalsindexedforPubMedCentral(doing
itforallthejournalsinPubMedwouldmeanalotmoreworkfortheNIH).LetstrythisfortheBiopythonPDBparserpaper,PubMedID14630660:
>>>fromBioimportEntrez
>>>Entrez.email="A.N.Other@example.com"
>>>pmid="14630660"
>>>results=Entrez.read(Entrez.elink(dbfrom="pubmed",db="pmc",
...LinkName="pubmed_pmc_refs",id=pmid))

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 81/184
5/13/2017 BiopythonTutorialandCookbook
>>>pmc_ids=[link["Id"]forlinkinresults[0]["LinkSetDb"][0]["Link"]]
>>>pmc_ids
['2744707','2705363','2682512',...,'1190160']

Greatelevenarticles.ButwhyhasnttheBiopythonapplicationnotebeenfound(PubMedID19304878)?Well,asyoumighthaveguessedfromthevariablenames,
therearenotactuallyPubMedIDs,butPubMedCentralIDs.Ourapplicationnoteisthethirdcitingpaperinthatlist,PMCID2682512.

So,whatif(likeme)youdrathergetbackalistofPubMedIDs?WellwecancallELinkagaintotranslatethem.Thisbecomesatwostepprocess,sobynowyoushould
expecttousethehistoryfeaturetoaccomplishit(Section9.15).

Butfirst,takingthemorestraightforwardapproachofmakingasecond(separate)calltoELink:

>>>results2=Entrez.read(Entrez.elink(dbfrom="pmc",db="pubmed",LinkName="pmc_pubmed",
...id=",".join(pmc_ids)))
>>>pubmed_ids=[link["Id"]forlinkinresults2[0]["LinkSetDb"][0]["Link"]]
>>>pubmed_ids
['19698094','19450287','19304878',...,'15985178']

ThistimeyoucanimmediatelyspottheBiopythonapplicationnoteasthethirdhit(PubMedID19304878).

Now,letsdothatallagainbutwiththehistoryTODO.

Andfinally,dontforgettoincludeyourownemailaddressintheEntrezcalls.

Chapter10SwissProtandExPASy
10.1ParsingSwissProtfiles
SwissProt(http://www.expasy.org/sprot)isahandcurateddatabaseofproteinsequences.BiopythoncanparsetheplaintextSwissProtfileformat,whichisstill
usedfortheUniProtKnowledgebasewhichcombinedSwissProt,TrEMBLandPIRPSD.Wedonot(yet)supporttheUniProtKBXMLfileformat.

10.1.1ParsingSwissProtrecords
InSection5.3.2,wedescribedhowtoextractthesequenceofaSwissProtrecordasaSeqRecordobject.Alternatively,youcanstoretheSwissProtrecordina
Bio.SwissProt.Recordobject,whichinfactstoresthecompleteinformationcontainedintheSwissProtrecord.Inthissection,wedescribehowtoextract
Bio.SwissProt.RecordobjectsfromaSwissProtfile.

ToparseaSwissProtrecord,wefirstgetahandletoaSwissProtrecord.Thereareseveralwaystodoso,dependingonwhereandhowtheSwissProtrecordisstored:

OpenaSwissProtfilelocally:
\verb|>>>handle=open("myswissprotfile.dat")

OpenagzippedSwissProtfile:
>>>importgzip
>>>handle=gzip.open("myswissprotfile.dat.gz","rt")

OpenaSwissProtfileovertheinternet:
>>>importurllib
>>>handle=urllib.urlopen("http://www.somelocation.org/data/someswissprotfile.dat")

OpenaSwissProtfileovertheinternetfromtheExPASydatabase(seesection10.5.1):
>>>fromBioimportExPASy
>>>handle=ExPASy.get_sprot_raw(myaccessionnumber)

Thekeypointisthatfortheparser,itdoesntmatterhowthehandlewascreated,aslongasitpointstodataintheSwissProtformat.

WecanuseBio.SeqIOasdescribedinSection5.3.2togetfileformatagnosticSeqRecordobjects.Alternatively,wecanuseBio.SwissProtgetBio.SwissProt.Record
objects,whichareamuchclosermatchtotheunderlyingfileformat.

ToreadoneSwissProtrecordfromthehandle,weusethefunctionread():

>>>fromBioimportSwissProt
>>>record=SwissProt.read(handle)

ThisfunctionshouldbeusedifthehandlepointstoexactlyoneSwissProtrecord.ItraisesaValueErrorifnoSwissProtrecordwasfound,andalsoifmorethanone
recordwasfound.

Wecannowprintoutsomeinformationaboutthisrecord:

>>>print(record.description)
'RecName:Full=Chalconesynthase3;EC=2.3.1.74;AltName:Full=Naringeninchalconesynthase3;'
>>>forrefinrecord.references:
...print("authors:",ref.authors)
...print("title:",ref.title)
...
authors:LiewC.F.,LimS.H.,LohC.S.,GohC.J.;
title:"MolecularcloningandsequenceanalysisofchalconesynthasecDNAsof
Bromheadiafinlaysoniana.";
>>>print(record.organism_classification)
['Eukaryota','Viridiplantae','Streptophyta','Embryophyta',...,'Bromheadia']

ToparseafilethatcontainsmorethanoneSwissProtrecord,weusetheparsefunctioninstead.Thisfunctionallowsustoiterateovertherecordsinthefile.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 82/184
5/13/2017 BiopythonTutorialandCookbook
Forexample,letsparsethefullSwissProtdatabaseandcollectallthedescriptions.YoucandownloadthisfromtheExPAYsFTPsiteasasinglegzippedfile
uniprot_sprot.dat.gz(about300MB).Thisisacompressedfilecontainingasinglefile,uniprot_sprot.dat(over1.5GB).

Asdescribedatthestartofthissection,youcanusethePythonlibrarygziptoopenanduncompressa.gzfile,likethis:

>>>importgzip
>>>handle=gzip.open("uniprot_sprot.dat.gz","rt")

However,uncompressingalargefiletakestime,andeachtimeyouopenthefileforreadinginthisway,ithastobedecompressedonthefly.So,ifyoucansparethedisk
spaceyoullsavetimeinthelongrunifyoufirstdecompressthefiletodisk,togettheuniprot_sprot.datfileinside.Thenyoucanopenthefileforreadingasusual:

>>>handle=open("uniprot_sprot.dat")

AsofJune2009,thefullSwissProtdatabasedownloadedfromExPASycontained468851SwissProtrecords.Oneconcisewaytobuildupalistoftherecord
descriptionsiswithalistcomprehension:
>>>fromBioimportSwissProt
>>>handle=open("uniprot_sprot.dat")
>>>descriptions=[record.descriptionforrecordinSwissProt.parse(handle)]
>>>len(descriptions)
468851
>>>descriptions[:5]
['RecName:Full=ProteinMGF1001R;',
'RecName:Full=ProteinMGF1001R;',
'RecName:Full=ProteinMGF1001R;',
'RecName:Full=ProteinMGF1001R;',
'RecName:Full=ProteinMGF1002L;']

Or,usingaforloopovertherecorditerator:

>>>fromBioimportSwissProt
>>>descriptions=[]
>>>handle=open("uniprot_sprot.dat")
>>>forrecordinSwissProt.parse(handle):
...descriptions.append(record.description)
...
>>>len(descriptions)
468851

Becausethisissuchalargeinputfile,eitherwaytakesaboutelevenminutesonmynewdesktopcomputer(usingtheuncompresseduniprot_sprot.datfileasinput).

ItisequallyeasytoextractanykindofinformationyoudlikefromSwissProtrecords.ToseethemembersofaSwissProtrecord,use
>>>dir(record)
['__doc__','__init__','__module__','accessions','annotation_update',
'comments','created','cross_references','data_class','description',
'entry_name','features','gene_name','host_organism','keywords',
'molecule_type','organelle','organism','organism_classification',
'references','seqinfo','sequence','sequence_length',
'sequence_update','taxonomy_id']

10.1.2ParsingtheSwissProtkeywordandcategorylist
SwissProtalsodistributesafilekeywlist.txt,whichliststhekeywordsandcategoriesusedinSwissProt.Thefilecontainsentriesinthefollowingform:
ID2Fe2S.
ACKW0001
DEProteinwhichcontainsatleastone2Fe2Sironsulfurcluster:2iron
DEatomscomplexedto2inorganicsulfidesand4sulfuratomsof
DEcysteinesfromtheprotein.
SYFe2S2;[2Fe2S]cluster;[Fe2S2]cluster;Fe2/S2(inorganic)cluster;
SYDimusulfidodiiron;2iron,2sulfurclusterbinding.
GOGO:0051537;2iron,2sulfurclusterbinding
HILigand:Iron;Ironsulfur;2Fe2S.
HILigand:Metalbinding;2Fe2S.
CALigand.
//
ID3Dstructure.
ACKW0002
DEProtein,orpartofaprotein,whosethreedimensionalstructurehas
DEbeenresolvedexperimentally(forexamplebyXraycrystallographyor
DENMRspectroscopy)andwhosecoordinatesareavailableinthePDB
DEdatabase.Canalsobeusedfortheoreticalmodels.
HITechnicalterm:3Dstructure.
CATechnicalterm.
//
ID3Fe4S.
...

TheentriesinthisfilecanbeparsedbytheparsefunctionintheBio.SwissProt.KeyWListmodule.EachentryisthenstoredasaBio.SwissProt.KeyWList.Record,which
isaPythondictionary.

>>>fromBio.SwissProtimportKeyWList
>>>handle=open("keywlist.txt")
>>>records=KeyWList.parse(handle)
>>>forrecordinrecords:
...print(record['ID'])
...print(record['DE'])

Thisprints

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 83/184
5/13/2017 BiopythonTutorialandCookbook
2Fe2S.
Proteinwhichcontainsatleastone2Fe2Sironsulfurcluster:2ironatoms
complexedto2inorganicsulfidesand4sulfuratomsofcysteinesfromthe
protein.
...

10.2ParsingPrositerecords
Prositeisadatabasecontainingproteindomains,proteinfamilies,functionalsites,aswellasthepatternsandprofilestorecognizethem.Prositewasdevelopedinparallel
withSwissProt.InBiopython,aPrositerecordisrepresentedbytheBio.ExPASy.Prosite.Recordclass,whosememberscorrespondtothedifferentfieldsinaProsite
record.

Ingeneral,aPrositefilecancontainmorethanonePrositerecords.Forexample,thefullsetofPrositerecords,whichcanbedownloadedasasinglefile(prosite.dat)
fromtheExPASyFTPsite,contains2073records(version20.24releasedon4December2007).Toparsesuchafile,weagainmakeuseofaniterator:

>>>fromBio.ExPASyimportProsite
>>>handle=open("myprositefile.dat")
>>>records=Prosite.parse(handle)

Wecannowtaketherecordsoneatatimeandprintoutsomeinformation.Forexample,usingthefilecontainingthecompletePrositedatabase,wedfind

>>>fromBio.ExPASyimportProsite
>>>handle=open("prosite.dat")
>>>records=Prosite.parse(handle)
>>>record=next(records)
>>>record.accession
'PS00001'
>>>record.name
'ASN_GLYCOSYLATION'
>>>record.pdoc
'PDOC00001'
>>>record=next(records)
>>>record.accession
'PS00004'
>>>record.name
'CAMP_PHOSPHO_SITE'
>>>record.pdoc
'PDOC00004'
>>>record=next(records)
>>>record.accession
'PS00005'
>>>record.name
'PKC_PHOSPHO_SITE'
>>>record.pdoc
'PDOC00005'

andsoon.IfyoureinterestedinhowmanyPrositerecordsthereare,youcoulduse

>>>fromBio.ExPASyimportProsite
>>>handle=open("prosite.dat")
>>>records=Prosite.parse(handle)
>>>n=0
>>>forrecordinrecords:n+=1
...
>>>n
2073

ToreadexactlyonePrositefromthehandle,youcanusethereadfunction:

>>>fromBio.ExPASyimportProsite
>>>handle=open("mysingleprositerecord.dat")
>>>record=Prosite.read(handle)

ThisfunctionraisesaValueErrorifnoPrositerecordisfound,andalsoifmorethanonePrositerecordisfound.

10.3ParsingPrositedocumentationrecords
InthePrositeexampleabove,therecord.pdocaccessionnumbers'PDOC00001','PDOC00004','PDOC00005'andsoonrefertoPrositedocumentation.TheProsite
documentationrecordsareavailablefromExPASyasindividualfiles,andasonefile(prosite.doc)containingallPrositedocumentationrecords.

WeusetheparserinBio.ExPASy.ProdoctoparsePrositedocumentationrecords.Forexample,tocreatealistofallaccessionnumbersofPrositedocumentationrecord,
youcanuse

>>>fromBio.ExPASyimportProdoc
>>>handle=open("prosite.doc")
>>>records=Prodoc.parse(handle)
>>>accessions=[record.accessionforrecordinrecords]

Againaread()functionisprovidedtoreadexactlyonePrositedocumentationrecordfromthehandle.

10.4ParsingEnzymerecords
ExPASysEnzymedatabaseisarepositoryofinformationonenzymenomenclature.AtypicalEnzymerecordlooksasfollows:
ID3.1.1.34
DELipoproteinlipase.
ANClearingfactorlipase.
ANDiacylglycerollipase.
ANDiglyceridelipase.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 84/184
5/13/2017 BiopythonTutorialandCookbook
CATriacylglycerol+H(2)O=diacylglycerol+acarboxylate.
CC!Hydrolyzestriacylglycerolsinchylomicronsandverylowdensity
CClipoproteins(VLDL).
CC!Alsohydrolyzesdiacylglycerol.
PRPROSITE;PDOC00110;
DRP11151,LIPL_BOVIN;P11153,LIPL_CAVPO;P11602,LIPL_CHICK;
DRP55031,LIPL_FELCA;P06858,LIPL_HUMAN;P11152,LIPL_MOUSE;
DRO46647,LIPL_MUSVI;P49060,LIPL_PAPAN;P49923,LIPL_PIG;
DRQ06000,LIPL_RAT;Q29524,LIPL_SHEEP;
//

Inthisexample,thefirstlineshowstheEC(EnzymeCommission)numberoflipoproteinlipase(secondline).Alternativenamesoflipoproteinlipaseare"clearingfactor
lipase","diacylglycerollipase",and"diglyceridelipase"(lines3through5).Thelinestartingwith"CA"showsthecatalyticactivityofthisenzyme.Commentlinesstart
with"CC".The"PR"lineshowsreferencestothePrositeDocumentationrecords,andthe"DR"linesshowreferencestoSwissProtrecords.Notoftheseentriesare
necessarilypresentinanEnzymerecord.

InBiopython,anEnzymerecordisrepresentedbytheBio.ExPASy.Enzyme.Recordclass.ThisrecordderivesfromaPythondictionaryandhaskeyscorrespondingtothe
twolettercodesusedinEnzymefiles.ToreadanEnzymefilecontainingoneEnzymerecord,usethereadfunctioninBio.ExPASy.Enzyme:
>>>fromBio.ExPASyimportEnzyme
>>>withopen("lipoprotein.txt")ashandle:
...record=Enzyme.read(handle)
...
>>>record["ID"]
'3.1.1.34'
>>>record["DE"]
'Lipoproteinlipase.'
>>>record["AN"]
['Clearingfactorlipase.','Diacylglycerollipase.','Diglyceridelipase.']
>>>record["CA"]
'Triacylglycerol+H(2)O=diacylglycerol+acarboxylate.'
>>>record["PR"]
['PDOC00110']

>>>record["CC"]
['Hydrolyzestriacylglycerolsinchylomicronsandverylowdensitylipoproteins
(VLDL).','Alsohydrolyzesdiacylglycerol.']
>>>record["DR"]
[['P11151','LIPL_BOVIN'],['P11153','LIPL_CAVPO'],['P11602','LIPL_CHICK'],
['P55031','LIPL_FELCA'],['P06858','LIPL_HUMAN'],['P11152','LIPL_MOUSE'],
['O46647','LIPL_MUSVI'],['P49060','LIPL_PAPAN'],['P49923','LIPL_PIG'],
['Q06000','LIPL_RAT'],['Q29524','LIPL_SHEEP']]

ThereadfunctionraisesaValueErrorifnoEnzymerecordisfound,andalsoifmorethanoneEnzymerecordisfound.

ThefullsetofEnzymerecordscanbedownloadedasasinglefile(enzyme.dat)fromtheExPASyFTPsite,containing4877records(releaseof3March2009).Toparse
suchafilecontainingmultipleEnzymerecords,usetheparsefunctioninBio.ExPASy.Enzymetoobtainaniterator:

>>>fromBio.ExPASyimportEnzyme
>>>handle=open("enzyme.dat")
>>>records=Enzyme.parse(handle)

Wecannowiterateovertherecordsoneatatime.Forexample,wecanmakealistofallECnumbersforwhichanEnzymerecordisavailable:

>>>ecnumbers=[record["ID"]forrecordinrecords]

10.5AccessingtheExPASyserver
SwissProt,Prosite,andPrositedocumentationrecordscanbedownloadedfromtheExPASywebserverathttp://www.expasy.org.Sixkindsofqueriesareavailable
fromExPASy:

get_prodoc_entry
TodownloadaPrositedocumentationrecordinHTMLformat
get_prosite_entry
TodownloadaPrositerecordinHTMLformat
get_prosite_raw
TodownloadaPrositeorPrositedocumentationrecordinrawformat
get_sprot_raw
TodownloadaSwissProtrecordinrawformat
sprot_search_ful
TosearchforaSwissProtrecord
sprot_search_de
TosearchforaSwissProtrecord

ToaccessthiswebserverfromaPythonscript,weusetheBio.ExPASymodule.

10.5.1RetrievingaSwissProtrecord

LetssaywearelookingatchalconesynthasesforOrchids(seesection2.3forsomejustificationforlookingforinterestingthingsaboutorchids).Chalconesynthaseis
involvedinflavanoidbiosynthesisinplants,andflavanoidsmakelotsofcoolthingslikepigmentcolorsandUVprotectants.

IfyoudoasearchonSwissProt,youcanfindthreeorchidproteinsforChalconeSynthase,idnumbersO23729,O23730,O23731.Now,letswriteascriptwhichgrabs
these,andparsesoutsomeinterestinginformation.

First,wegrabtherecords,usingtheget_sprot_raw()functionofBio.ExPASy.Thisfunctionisverynicesinceyoucanfeeditanidandgetbackahandletoarawtext
record(noHTMLtomesswith!).WecantheuseBio.SwissProt.readtopullouttheSwissProtrecord,orBio.SeqIO.readtogetaSeqRecord.Thefollowingcode
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 85/184
5/13/2017 BiopythonTutorialandCookbook
accomplisheswhatIjustwrote:

>>>fromBioimportExPASy
>>>fromBioimportSwissProt

>>>accessions=["O23729","O23730","O23731"]
>>>records=[]

>>>foraccessioninaccessions:
...handle=ExPASy.get_sprot_raw(accession)
...record=SwissProt.read(handle)
...records.append(record)

IftheaccessionnumberyouprovidedtoExPASy.get_sprot_rawdoesnotexist,thenSwissProt.read(handle)willraiseaValueError.YoucancatchValueException
exceptionstodetectinvalidaccessionnumbers:

>>>foraccessioninaccessions:
...handle=ExPASy.get_sprot_raw(accession)
...try:
...record=SwissProt.read(handle)
...exceptValueException:
...print("WARNING:Accession%snotfound"%accession)
...records.append(record)

10.5.2SearchingSwissProt
Now,youmayremarkthatIknewtherecordsaccessionnumbersbeforehand.Indeed,get_sprot_raw()needseithertheentrynameoranaccessionnumber.Whenyou
donthavethemhandy,youcanuseoneofthesprot_search_de()orsprot_search_ful()functions.

sprot_search_de()searchesintheID,DE,GN,OSandOGlinessprot_search_ful()searchesin(nearly)allthefields.Theyaredetailedon
http://www.expasy.org/cgibin/sprotsearchdeandhttp://www.expasy.org/cgibin/sprotsearchfulrespectively.NotethattheydontsearchinTrEMBLbydefault
(argumenttrembl).NotealsothattheyreturnHTMLpageshowever,accessionnumbersarequiteeasilyextractable:

>>>fromBioimportExPASy
>>>importre

>>>handle=ExPASy.sprot_search_de("OrchidChalconeSynthase")
>>>#or:
>>>#handle=ExPASy.sprot_search_ful("Orchidand{ChalconeSynthase}")
>>>html_results=handle.read()
>>>if"Numberofsequencesfound"inhtml_results:
...ids=re.findall(r'HREF="/uniprot/(\w+)"',html_results)
...else:
...ids=re.findall(r'href="/cgibin/niceprot\.pl\?(\w+)"',html_results)

10.5.3RetrievingPrositeandPrositedocumentationrecords

PrositeandPrositedocumentationrecordscanberetrievedeitherinHTMLformat,orinrawformat.ToparsePrositeandPrositedocumentationrecordswithBiopython,
youshouldretrievetherecordsinrawformat.Forotherpurposes,however,youmaybeinterestedintheserecordsinHTMLformat.

ToretrieveaPrositeorPrositedocumentationrecordinrawformat,useget_prosite_raw().Forexample,todownloadaPrositerecordandprintitoutinrawtextformat,
use
>>>fromBioimportExPASy
>>>handle=ExPASy.get_prosite_raw('PS00001')
>>>text=handle.read()
>>>print(text)

ToretrieveaPrositerecordandparseitintoaBio.Prosite.Recordobject,use

>>>fromBioimportExPASy
>>>fromBioimportProsite
>>>handle=ExPASy.get_prosite_raw('PS00001')
>>>record=Prosite.read(handle)

ThesamefunctioncanbeusedtoretrieveaPrositedocumentationrecordandparseitintoaBio.ExPASy.Prodoc.Recordobject:
>>>fromBioimportExPASy
>>>fromBio.ExPASyimportProdoc
>>>handle=ExPASy.get_prosite_raw('PDOC00001')
>>>record=Prodoc.read(handle)

Fornonexistingaccessionnumbers,ExPASy.get_prosite_rawreturnsahandletoanemptrystring.Whenfacedwithanemptystring,Prosite.readandProdoc.readwill
raiseaValueError.Youcancatchtheseexceptionstodetectinvalidaccessionnumbers.

Thefunctionsget_prosite_entry()andget_prodoc_entry()areusedtodownloadPrositeandPrositedocumentationrecordsinHTMLformat.Tocreateawebpage
showingonePrositerecord,youcanuse
>>>fromBioimportExPASy
>>>handle=ExPASy.get_prosite_entry('PS00001')
>>>html=handle.read()
>>>withopen("myprositerecord.html","w")asout_handle:
...out_handle.write(html)
...

andsimilarlyforaPrositedocumentationrecord:

>>>fromBioimportExPASy
>>>handle=ExPASy.get_prodoc_entry('PDOC00001')
>>>html=handle.read()
>>>withopen("myprodocrecord.html","w")asout_handle:
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 86/184
5/13/2017 BiopythonTutorialandCookbook
...out_handle.write(html)
...

Forthesefunctions,aninvalidaccessionnumberreturnsanerrormessageinHTMLformat.

10.6ScanningthePrositedatabase
ScanPrositeallowsyoutoscanproteinsequencesonlineagainstthePrositedatabasebyprovidingaUniProtorPDBsequenceidentifierorthesequenceitself.Formore
informationaboutScanProsite,pleaseseetheScanPrositedocumentationaswellasthedocumentationforprogrammaticaccessofScanProsite.

YoucanuseBiopythonsBio.ExPASy.ScanPrositemoduletoscanthePrositedatabasefromPython.ThismodulebothhelpsyoutoaccessScanPrositeprogrammatically,
andtoparsetheresultsreturnedbyScanProsite.ToscanforPrositepatternsinthefollowingproteinsequence:

MEHKEVVLLLLLFLKSGQGEPLDDYVNTQGASLFSVTKKQLGAGSIEECAAKCEEDEEFT
CRAFQYHSKEQQCVIMAENRKSSIIIRMRDVVLFEKKVYLSECKTGNGKNYRGTMSKTKN

youcanusethefollowingcode:

>>>sequence="MEHKEVVLLLLLFLKSGQGEPLDDYVNTQGASLFSVTKKQLGAGSIEECAAKCEEDEEFT
CRAFQYHSKEQQCVIMAENRKSSIIIRMRDVVLFEKKVYLSECKTGNGKNYRGTMSKTKN"
>>>fromBio.ExPASyimportScanProsite
>>>handle=ScanProsite.scan(seq=sequence)

Byexecutinghandle.read(),youcanobtainthesearchresultsinrawXMLformat.Instead,letsuseBio.ExPASy.ScanProsite.readtoparsetherawXMLintoaPython
object:

>>>result=ScanProsite.read(handle)
>>>type(result)
<class'Bio.ExPASy.ScanProsite.Record'>

ABio.ExPASy.ScanProsite.Recordobjectisderivedfromalist,witheachelementintheliststoringoneScanPrositehit.Thisobjectalsostoresthenumberofhits,aswell
asthenumberofsearchsequences,asreturnedbyScanProsite.ThisScanPrositesearchresultedinsixhits:

>>>result.n_seq
1
>>>result.n_match
6
>>>len(result)
6
>>>result[0]
{'signature_ac':u'PS50948','level':u'0','stop':98,'sequence_ac':u'USERSEQ1','start':16,'score':u'8.873'}
>>>result[1]
{'start':37,'stop':39,'sequence_ac':u'USERSEQ1','signature_ac':u'PS00005'}
>>>result[2]
{'start':45,'stop':48,'sequence_ac':u'USERSEQ1','signature_ac':u'PS00006'}
>>>result[3]
{'start':60,'stop':62,'sequence_ac':u'USERSEQ1','signature_ac':u'PS00005'}
>>>result[4]
{'start':80,'stop':83,'sequence_ac':u'USERSEQ1','signature_ac':u'PS00004'}
>>>result[5]
{'start':106,'stop':111,'sequence_ac':u'USERSEQ1','signature_ac':u'PS00008'}

OtherScanPrositeparameterscanbepassedaskeywordargumentsseethedocumentationforprogrammaticaccessofScanPrositeformoreinformation.Asanexample,
passinglowscore=1toincludematcheswithlowlevelscoresletsusefindoneadditionalhit:

>>>handle=ScanProsite.scan(seq=sequence,lowscore=1)
>>>result=ScanProsite.read(handle)
>>>result.n_match
7

Chapter11Going3D:ThePDBmodule
Bio.PDBisaBiopythonmodulethatfocusesonworkingwithcrystalstructuresofbiologicalmacromolecules.Amongotherthings,Bio.PDBincludesaPDBParserclass
thatproducesaStructureobject,whichcanbeusedtoaccesstheatomicdatainthefileinaconvenientmanner.Thereislimitedsupportforparsingtheinformation
containedinthePDBheader.

11.1Readingandwritingcrystalstructurefiles
11.1.1ReadingaPDBfile
FirstwecreateaPDBParserobject:

>>>fromBio.PDB.PDBParserimportPDBParser
>>>p=PDBParser(PERMISSIVE=1)

ThePERMISSIVEflagindicatesthatanumberofcommonproblems(see11.7.1)associatedwithPDBfileswillbeignored(butnotethatsomeatomsand/orresidueswillbe
missing).IftheflagisnotpresentaPDBConstructionExceptionwillbegeneratedifanyproblemsaredetectedduringtheparseoperation.

TheStructureobjectisthenproducedbylettingthePDBParserobjectparseaPDBfile(thePDBfileinthiscaseiscalledpdb1fat.ent,1fatisauserdefinednamefor
thestructure):
>>>structure_id="1fat"
>>>filename="pdb1fat.ent"
>>>s=p.get_structure(structure_id,filename)

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 87/184
5/13/2017 BiopythonTutorialandCookbook
Youcanextracttheheaderandtrailer(simplelistsofstrings)ofthePDBfilefromthePDBParserobjectwiththeget_headerandget_trailermethods.Notehowever
thatmanyPDBfilescontainheaderswithincompleteorerroneousinformation.ManyoftheerrorshavebeenfixedintheequivalentmmCIFfiles.Hence,ifyouare
interestedintheheaderinformation,itisagoodideatoextractinformationfrommmCIFfilesusingtheMMCIF2Dicttooldescribedbelow,insteadofparsingthePDB
header.

Nowthatisclarified,letsreturntoparsingthePDBheader.ThestructureobjecthasanattributecalledheaderwhichisaPythondictionarythatmapsheaderrecordsto
theirvalues.

Example:

>>>resolution=structure.header['resolution']
>>>keywords=structure.header['keywords']

Theavailablekeysarename,head,deposition_date,release_date,structure_method,resolution,structure_reference(whichmapstoalistofreferences),
journal_reference,author,andcompound(whichmapstoadictionarywithvariousinformationaboutthecrystallizedcompound).

ThedictionarycanalsobecreatedwithoutcreatingaStructureobject,ie.directlyfromthePDBfile:

>>>withopen(filename,'r')ashandle:
...header_dict=parse_pdb_header(handle)
...

11.1.2ReadinganmmCIFfile
SimilarlytothecasethecaseofPDBfiles,firstcreateanMMCIFParserobject:

>>>fromBio.PDB.MMCIFParserimportMMCIFParser
>>>parser=MMCIFParser()

ThenusethisparsertocreateastructureobjectfromthemmCIFfile:

>>>structure=parser.get_structure('1fat','1fat.cif')

TohavesomemorelowlevelaccesstoanmmCIFfile,youcanusetheMMCIF2DictclasstocreateaPythondictionarythatmapsallmmCIFtagsinanmmCIFfiletotheir
values.Iftherearemultiplevalues(likeinthecaseoftag_atom_site.Cartn_y,whichholdstheycoordinatesofallatoms),thetagismappedtoalistofvalues.The
dictionaryiscreatedfromthemmCIFfileasfollows:

>>>fromBio.PDB.MMCIF2DictimportMMCIF2Dict
>>>mmcif_dict=MMCIF2Dict('1FAT.cif')

Example:getthesolventcontentfromanmmCIFfile:
>>>sc=mmcif_dict['_exptl_crystal.density_percent_sol']

Example:getthelistoftheycoordinatesofallatoms
>>>y_list=mmcif_dict['_atom_site.Cartn_y']

11.1.3ReadingfilesintheMMTFformat

YoucanusethedirectMMTFParsertoreadastructurefromafile:

>>>fromBio.PDB.mmtfimportMMTFParser
>>>structure=MMTFParser.get_structure("PDB/4CUP.mmtf")

OryoucanusethesameclasstogetastructurebyitsPDBID:
>>>structure=MMTFParser.get_structure_from_url("4CUP")

ThisgivesyouaStructureobjectasifreadfromaPDBormmCIFfile.

YoucanalsohaveaccesstotheunderlyingdatausingtheexternalMMTFlibrarywhichBiopythonisusinginternally:
>>>frommmtfimportfetch
>>>decoded_data=fetch("4CUP")

ForexampleyoucanaccessjusttheXcoordinate.
>>>print(decoded_data.x_coord_list)
...

11.1.4ReadingfilesinthePDBXMLformat
Thatsnotyetsupported,butwearedefinitelyplanningtosupportthatinthefuture(itsnotalotofwork).ContacttheBiopythondevelopers(biopython
dev@biopython.org)ifyouneedthis).

11.1.5WritingPDBfiles

UsethePDBIOclassforthis.Itseasytowriteoutspecificpartsofastructuretoo,ofcourse.

Example:savingastructure

>>>io=PDBIO()
>>>io.set_structure(s)
>>>io.save('out.pdb')

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 88/184
5/13/2017 BiopythonTutorialandCookbook
Ifyouwanttowriteoutapartofthestructure,makeuseoftheSelectclass(alsoinPDBIO).Selecthasfourmethods:

accept_model(model)
accept_chain(chain)
accept_residue(residue)
accept_atom(atom)

Bydefault,everymethodreturns1(whichmeansthemodel/chain/residue/atomisincludedintheoutput).BysubclassingSelectandreturning0whenappropriateyou
canexcludemodels,chains,etc.fromtheoutput.Cumbersomemaybe,butverypowerful.Thefollowingcodeonlywritesoutglycineresidues:
>>>classGlySelect(Select):
...defaccept_residue(self,residue):
...ifresidue.get_name()=='GLY':
...returnTrue
...else:
...returnFalse
...
>>>io=PDBIO()
>>>io.set_structure(s)
>>>io.save('gly_only.pdb',GlySelect())

Ifthisisalltoocomplicatedforyou,theDicemodulecontainsahandyextractfunctionthatwritesoutallresiduesinachainbetweenastartandendresidue.

11.2Structurerepresentation
TheoveralllayoutofaStructureobjectfollowsthesocalledSMCRA(Structure/Model/Chain/Residue/Atom)architecture:

Astructureconsistsofmodels
Amodelconsistsofchains
Achainconsistsofresidues
Aresidueconsistsofatoms

Thisisthewaymanystructuralbiologists/bioinformaticiansthinkaboutstructure,andprovidesasimplebutefficientwaytodealwithstructure.Additionalstuffis
essentiallyaddedwhenneeded.AUMLdiagramofthe Structureobject(forgetabouttheDisorderedclassesfornow)isshowninFig.11.1.Suchadatastructureisnot
necessarilybestsuitedfortherepresentationofthemacromolecularcontentofastructure,butitisabsolutelynecessaryforagoodinterpretationofthedatapresentina
filethatdescribesthestructure(typicallyaPDBorMMCIFfile).Ifthishierarchycannotrepresentthecontentsofastructurefile,itisfairlycertainthatthefilecontains
anerrororatleastdoesnotdescribethestructureunambiguously.IfaSMCRAdatastructurecannotbegenerated,thereisreasontosuspectaproblem.ParsingaPDB
filecanthusbeusedtodetectlikelyproblems.Wewillgiveseveralexamplesofthisinsection11.7.1.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 89/184
5/13/2017 BiopythonTutorialandCookbook

Figure11.1:UMLdiagramofSMCRAarchitectureoftheStructureclassusedtorepresentamacromolecularstructure.Fulllineswithdiamondsdenote
aggregation,fulllineswitharrowsdenotereferencing,fulllineswithtrianglesdenoteinheritanceanddashedlineswithtrianglesdenoteinterface
realization.

Structure,Model,ChainandResidueareallsubclassesoftheEntitybaseclass.TheAtomclassonly(partly)implementstheEntityinterface(becauseanAtomdoesnot
havechildren).

ForeachEntitysubclass,youcanextractachildbyusingauniqueidforthatchildasakey(e.g.youcanextractanAtomobjectfromaResidueobjectbyusinganatom
namestringasakey,youcanextractaChainobjectfromaModelobjectbyusingitschainidentifierasakey).

DisorderedatomsandresiduesarerepresentedbyDisorderedAtomandDisorderedResidueclasses,whicharebothsubclassesoftheDisorderedEntityWrapperbaseclass.
TheyhidethecomplexityassociatedwithdisorderandbehaveexactlyasAtomandResidueobjects.

Ingeneral,achildEntityobject(i.e.Atom,Residue,Chain,Model)canbeextractedfromitsparent(i.e.Residue,Chain,Model,Structure,respectively)byusinganidas
akey.

>>>child_entity=parent_entity[child_id]

YoucanalsogetalistofallchildEntitiesofaparentEntityobject.Notethatthislistissortedinaspecificway(e.g.accordingtochainidentifierforChainobjectsina
Modelobject).

>>>child_list=parent_entity.get_list()

Youcanalsogettheparentfromachild:

>>>parent_entity=child_entity.get_parent()

AtalllevelsoftheSMCRAhierarchy,youcanalsoextractafullid.Thefullidisatuplecontainingallidsstartingfromthetopobject(Structure)downtothecurrent
object.AfullidforaResidueobjecte.g.issomethinglike:
>>>full_id=residue.get_full_id()
>>>print(full_id)
("1abc",0,"A",("",10,"A"))

Thiscorrespondsto:

TheStructurewithid"1abc"
TheModelwithid0
TheChainwithid"A"

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 90/184
5/13/2017 BiopythonTutorialandCookbook
TheResiduewithid("",10,"A").

TheResidueidindicatesthattheresidueisnotaheteroresidue(norawater)becauseithasablankheterofield,thatitssequenceidentifieris10andthatitsinsertion
codeis"A".

Togettheentitysid,usetheget_idmethod:

>>>entity.get_id()

Youcancheckiftheentityhasachildwithagivenidbyusingthehas_idmethod:

>>>entity.has_id(entity_id)

Thelengthofanentityisequaltoitsnumberofchildren:

>>>nr_children=len(entity)

Itispossibletodelete,rename,add,etc.childentitiesfromaparententity,butthisdoesnotincludeanysanitychecks(e.g.itispossibletoaddtworesidueswiththesame
idtoonechain).ThisreallyshouldbedoneviaaniceDecoratorclassthatincludesintegritychecking,butyoucantakealookatthecode(Entity.py)ifyouwanttouse
therawinterface.

11.2.1Structure
TheStructureobjectisatthetopofthehierarchy.Itsidisausergivenstring.TheStructurecontainsanumberofModelchildren.Mostcrystalstructures(butnotall)
containasinglemodel,whileNMRstructurestypicallyconsistofseveralmodels.Disorderincrystalstructuresoflargepartsofmoleculescanalsoresultinseveral
models.

11.2.2Model

TheidoftheModelobjectisaninteger,whichisderivedfromthepositionofthemodelintheparsedfile(theyareautomaticallynumberedstartingfrom0).Crystal
structuresgenerallyhaveonlyonemodel(withid0),whileNMRfilesusuallyhaveseveralmodels.WhereasmanyPDBparsersassumethatthereisonlyonemodel,the
StructureclassinBio.PDBisdesignedsuchthatitcaneasilyhandlePDBfileswithmorethanonemodel.

Asanexample,togetthefirstmodelfromaStructureobject,use

>>>first_model=structure[0]

TheModelobjectstoresalistofChainchildren.

11.2.3Chain
TheidofaChainobjectisderivedfromthechainidentifierinthePDB/mmCIFfile,andisasinglecharacter(typicallyaletter).EachChaininaModelobjecthasa
uniqueid.Asanexample,togettheChainobjectwithidentifierAfromaModelobject,use

>>>chain_A=model["A"]

TheChainobjectstoresalistofResiduechildren.

11.2.4Residue

Aresidueidisatuplewiththreeelements:

Theheterofield(hetfield):thisis
'W'inthecaseofawatermolecule
'H_'followedbytheresiduenameforotherheteroresidues(e.g.'H_GLC'inthecaseofaglucosemolecule)
blankforstandardaminoandnucleicacids.
Thisschemeisadoptedforreasonsdescribedinsection11.4.1.
Thesequenceidentifier(resseq),anintegerdescribingthepositionoftheresidueinthechain(e.g.,100)
Theinsertioncode(icode)astring,e.g.A.Theinsertioncodeissometimesusedtopreserveacertaindesirableresiduenumberingscheme.ASer80insertion
mutant(insertede.g.betweenaThr80andanAsn81residue)coulde.g.havesequenceidentifiersandinsertioncodesasfollows:Thr80A,Ser80B,Asn81.In
thiswaytheresiduenumberingschemestaysintunewiththatofthewildtypestructure.

Theidoftheaboveglucoseresiduewouldthusbe(H_GLC,100,A).Iftheheteroflagandinsertioncodeareblank,thesequenceidentifieralonecanbeused:
#Fullid
>>>residue=chain[('',100,'')]
#Shortcutid
>>>residue=chain[100]

Thereasonfortheheteroflagisthatmany,manyPDBfilesusethesamesequenceidentifierforanaminoacidandaheteroresidueorawater,whichwouldcreate
obviousproblemsiftheheteroflagwasnotused.

Unsurprisingly,aResidueobjectstoresasetofAtomchildren.Italsocontainsastringthatspecifiestheresiduename(e.g.ASN)andthesegmentidentifierofthe
residue(wellknowntoXPLORusers,butnotusedintheconstructionoftheSMCRAdatastructure).

Letslookatsomeexamples.Asn10withablankinsertioncodewouldhaveresidueid(,10,).Water10wouldhaveresidueid(W,10,).Aglucose
molecule(aheteroresiduewithresiduenameGLC)withsequenceidentifier10wouldhaveresidueid(H_GLC,10,).Inthisway,thethreeresidues(withthesame
insertioncodeandsequenceidentifier)canbepartofthesamechainbecausetheirresidueidsaredistinct.

Inmostcases,thehetflagandinsertioncodefieldswillbeblank,e.g.(,10,).Inthesecases,thesequenceidentifiercanbeusedasashortcutforthefullid:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 91/184
5/13/2017 BiopythonTutorialandCookbook
#usefullid
>>>res10=chain[('',10,'')]
#useshortcut
>>>res10=chain[10]

EachResidueobjectinaChainobjectshouldhaveauniqueid.However,disorderedresiduesaredealtwithinaspecialway,asdescribedinsection11.3.3.

AResidueobjecthasanumberofadditionalmethods:

>>>residue.get_resname()#returnstheresiduename,e.g."ASN"
>>>residue.is_disordered()#returns1iftheresiduehasdisorderedatoms
>>>residue.get_segid()#returnstheSEGID,e.g."CHN1"
>>>residue.has_id(name)#testifaresiduehasacertainatom

Youcanuseis_aa(residue)totestifaResidueobjectisanaminoacid.

11.2.5Atom
TheAtomobjectstoresthedataassociatedwithanatom,andhasnochildren.Theidofanatomisitsatomname(e.g.OGforthesidechainoxygenofaSerresidue).
AnAtomidneedstobeuniqueinaResidue.Again,anexceptionismadefordisorderedatoms,asdescribedinsection11.3.2.

Theatomidissimplytheatomname(eg.CA).Inpractice,theatomnameiscreatedbystrippingallspacesfromtheatomnameinthePDBfile.

However,inPDBfiles,aspacecanbepartofanatomname.Often,calciumatomsarecalledCA..inordertodistinguishthemfromCatoms(whicharecalled
.CA.).Incaseswerestrippingthespaceswouldcreateproblems(ie.twoatomscalledCAinthesameresidue)thespacesarekept.

InaPDBfile,anatomnameconsistsof4chars,typicallywithleadingandtrailingspaces.Oftenthesespacescanberemovedforeaseofuse(e.g.anaminoacidC
atomislabeled.CA.inaPDBfile,wherethedotsrepresentspaces).Togenerateanatomname(andthusanatomid)thespacesareremoved,unlessthiswouldresult
inanamecollisioninaResidue(i.e.twoAtomobjectswiththesameatomnameandid).Inthelattercase,theatomnameincludingspacesistried.Thissituationcane.g.
happenwhenoneresiduecontainsatomswithnames.CA.andCA..,althoughthisisnotverylikely.

Theatomicdatastoredincludestheatomname,theatomiccoordinates(includingstandarddeviationifpresent),theBfactor(includinganisotropicBfactorsandstandard
deviationifpresent),thealtlocspecifierandthefullatomnameincludingspaces.Lessuseditemsliketheatomelementnumberortheatomicchargesometimesspecified
inaPDBfilearenotstored.

Tomanipulatetheatomiccoordinates,usethetransformmethodoftheAtomobject.Usetheset_coordmethodtospecifytheatomiccoordinatesdirectly.

AnAtomobjecthasthefollowingadditionalmethods:

>>>a.get_name()#atomname(spacesstripped,e.g."CA")
>>>a.get_id()#id(equalsatomname)
>>>a.get_coord()#atomiccoordinates
>>>a.get_vector()#atomiccoordinatesasVectorobject
>>>a.get_bfactor()#isotropicBfactor
>>>a.get_occupancy()#occupancy
>>>a.get_altloc()#alternativelocationspecifier
>>>a.get_sigatm()#standarddeviationofatomicparameters
>>>a.get_siguij()#standarddeviationofanisotropicBfactor
>>>a.get_anisou()#anisotropicBfactor
>>>a.get_fullname()#atomname(withspaces,e.g.".CA.")

Torepresenttheatomcoordinates,siguij,anisotropicBfactorandsigatmNumpyarraysareused.

Theget_vectormethodreturnsaVectorobjectrepresentationofthecoordinatesoftheAtomobject,allowingyoutodovectoroperationsonatomiccoordinates.Vector
implementsthefullsetof3Dvectoroperations,matrixmultiplication(leftandright)andsomeadvancedrotationrelatedoperationsaswell.

AsanexampleofthecapabilitiesofBio.PDBsVectormodule,supposethatyouwouldliketofindthepositionofaGlyresiduesCatom,ifithadone.RotatingtheN
atomoftheGlyresiduealongtheCCbondover120degreesroughlyputsitinthepositionofavirtualCatom.Hereshowtodoit,makinguseoftherotaxis
method(whichcanbeusedtoconstructarotationaroundacertainaxis)oftheVectormodule:

#getatomcoordinatesasvectors
>>>n=residue['N'].get_vector()
>>>c=residue['C'].get_vector()
>>>ca=residue['CA'].get_vector()
#centeratorigin
>>>n=nca
>>>c=cca
#findrotationmatrixthatrotatesn
#120degreesalongthecacvector
>>>rot=rotaxis(pi*120.0/180.0,c)
#applyrotationtocanvector
>>>cb_at_origin=n.left_multiply(rot)
#putontopofcaatom
>>>cb=cb_at_origin+ca

Thisexampleshowsthatitspossibletodosomequitenontrivialvectoroperationsonatomicdata,whichcanbequiteuseful.Inadditiontoalltheusualvectoroperations
(cross(use**),anddot(use*)product,angle,norm,etc.)andtheabovementionedrotaxisfunction,theVectormodulealsohasmethodstorotate(rotmat)orreflect
(refmat)onevectorontopofanother.

11.2.6ExtractingaspecificAtom/Residue/Chain/ModelfromaStructure

Thesearesomeexamples:
>>>model=structure[0]
>>>chain=model['A']
>>>residue=chain[100]
>>>atom=residue['CA']

Notethatyoucanuseashortcut:
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 92/184
5/13/2017 BiopythonTutorialandCookbook
>>>atom=structure[0]['A'][100]['CA']

11.3Disorder
Bio.PDBcanhandlebothdisorderedatomsandpointmutations(i.e.aGlyandanAlaresidueinthesameposition).

11.3.1Generalapproach
Disordershouldbedealtwithfromtwopointsofview:theatomandtheresiduepointsofview.Ingeneral,wehavetriedtoencapsulateallthecomplexitythatarisesfrom
disorder.IfyoujustwanttoloopoverallCatoms,youdonotcarethatsomeresidueshaveadisorderedsidechain.Ontheotherhanditshouldalsobepossibleto
representdisordercompletelyinthedatastructure.Therefore,disorderedatomsorresiduesarestoredinspecialobjectsthatbehaveasifthereisnodisorder.Thisisdone
byonlyrepresentingasubsetofthedisorderedatomsorresidues.Whichsubsetispicked(e.g.whichofthetwodisorderedOGsidechainatompositionsofaSerresidue
isused)canbespecifiedbytheuser.

11.3.2Disorderedatoms

DisorderedatomsarerepresentedbyordinaryAtomobjects,butallAtomobjectsthatrepresentthesamephysicalatomarestoredinaDisorderedAtomobject(seeFig.11.1).
EachAtomobjectinaDisorderedAtomobjectcanbeuniquelyindexedusingitsaltlocspecifier.TheDisorderedAtomobjectforwardsalluncaughtmethodcallstothe
selectedAtomobject,bydefaulttheonethatrepresentstheatomwiththehighestoccupancy.TheusercanofcoursechangetheselectedAtomobject,makinguseofits
altlocspecifier.Inthiswayatomdisorderisrepresentedcorrectlywithoutmuchadditionalcomplexity.Inotherwords,ifyouarenotinterestedinatomdisorder,youwill
notbebotheredbyit.

Eachdisorderedatomhasacharacteristicaltlocidentifier.YoucanspecifythataDisorderedAtomobjectshouldbehaveliketheAtomobjectassociatedwithaspecific
altlocidentifier:

>>>atom.disordered_select('A')#selectaltlocAatom
>>>print(atom.get_altloc())
"A"
>>>atom.disordered_select('B')#selectaltlocBatom
>>>print(atom.get_altloc())
"B"

11.3.3Disorderedresidues

Commoncase

Themostcommoncaseisaresiduethatcontainsoneormoredisorderedatoms.ThisisevidentlysolvedbyusingDisorderedAtomobjectstorepresentthedisordered
atoms,andstoringtheDisorderedAtomobjectinaResidueobjectjustlikeordinaryAtomobjects.TheDisorderedAtomwillbehaveexactlylikeanordinaryatom(infact
theatomwiththehighestoccupancy)byforwardingalluncaughtmethodcallstooneoftheAtomobjects(theselectedAtomobject)itcontains.

Pointmutations

Aspecialcaseariseswhendisorderisduetoapointmutation,i.e.whentwoormorepointmutantsofapolypeptidearepresentinthecrystal.Anexampleofthiscanbe
foundinPDBstructure1EN2.

Sincetheseresiduesbelongtoadifferentresiduetype(e.g.letssaySer60andCys60)theyshouldnotbestoredinasingleResidueobjectasinthecommoncase.Inthis
case,eachresidueisrepresentedbyoneResidueobject,andbothResidueobjectsarestoredinasingleDisorderedResidueobject(seeFig.11.1).

TheDisorderedResidueobjectforwardsalluncaughtmethodstotheselectedResidueobject(bydefaultthelastResidueobjectadded),andthusbehaveslikeanordinary
residue.EachResidueobjectinaDisorderedResidueobjectcanbeuniquelyidentifiedbyitsresiduename.Intheaboveexample,residueSer60wouldhaveidSERin
theDisorderedResidueobject,whileresidueCys60wouldhaveidCYS.TheusercanselecttheactiveResidueobjectinaDisorderedResidueobjectviathisid.

Example:supposethatachainhasapointmutationatposition10,consistingofaSerandaCysresidue.Makesurethatresidue10ofthischainbehavesastheCys
residue.

>>>residue=chain[10]
>>>residue.disordered_select('CYS')

Inaddition,youcangetalistofallAtomobjects(ie.allDisorderedAtomobjectsareunpackedtotheirindividualAtomobjects)usingtheget_unpacked_listmethodofa
(Disordered)Residueobject.

11.4Heteroresidues
11.4.1Associatedproblems

Acommonproblemwithheteroresiduesisthatseveralheteroandnonheteroresiduespresentinthesamechainsharethesamesequenceidentifier(andinsertioncode).
Therefore,togenerateauniqueidforeachheteroresidue,watersandotherheteroresiduesaretreatedinadifferentway.

RememberthatResidueobjecthavethetuple(hetfield,resseq,icode)asid.Thehetfieldisblank()foraminoandnucleicacids,andastringforwatersandotherhetero
residues.Thecontentofthehetfieldisexplainedbelow.

11.4.2Waterresidues
ThehetfieldstringofawaterresidueconsistsoftheletterW.Soatypicalresidueidforawateris(W,1,).

11.4.3Otherheteroresidues

ThehetfieldstringforotherheteroresiduesstartswithH_followedbytheresiduename.Aglucosemoleculee.g.withresiduenameGLCwouldhavehetfield
H_GLC.Itsresidueidcoulde.g.be(H_GLC,1,).

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 93/184
5/13/2017 BiopythonTutorialandCookbook

11.5NavigatingthroughaStructureobject
ParseaPDBfile,andextractsomeModel,Chain,ResidueandAtomobjects

>>>fromBio.PDB.PDBParserimportPDBParser
>>>parser=PDBParser()
>>>structure=parser.get_structure("test","1fat.pdb")
>>>model=structure[0]
>>>chain=model["A"]
>>>residue=chain[1]
>>>atom=residue["CA"]

Iteratingthroughallatomsofastructure

>>>p=PDBParser()
>>>structure=p.get_structure('X','pdb1fat.ent')
>>>formodelinstructure:
...forchaininmodel:
...forresidueinchain:
...foratominresidue:
...print(atom)
...

Thereisashortcutifyouwanttoiterateoverallatomsinastructure:
>>>atoms=structure.get_atoms()
>>>foratominatoms:
...print(atom)
...

Similarly,toiterateoverallatomsinachain,use

>>>atoms=chain.get_atoms()
>>>foratominatoms:
...print(atom)
...

Iteratingoverallresiduesofamodel

orifyouwanttoiterateoverallresiduesinamodel:

>>>residues=model.get_residues()
>>>forresidueinresidues:
...print(residue)
...

YoucanalsousetheSelection.unfold_entitiesfunctiontogetallresiduesfromastructure:
>>>res_list=Selection.unfold_entities(structure,'R')

ortogetallatomsfromachain:
>>>atom_list=Selection.unfold_entities(chain,'A')

Obviously,A=atom,R=residue,C=chain,M=model,S=structure.Youcanusethistogoupinthehierarchy,e.g.togetalistof(unique)ResidueorChainparentsfroma
listofAtoms:

>>>residue_list=Selection.unfold_entities(atom_list,'R')
>>>chain_list=Selection.unfold_entities(atom_list,'C')

Formoreinfo,seetheAPIdocumentation.

Extractaheteroresiduefromachain(e.g.aglucose(GLC)moietywithresseq10)

>>>residue_id=("H_GLC",10,"")
>>>residue=chain[residue_id]

Printallheteroresiduesinchain

>>>forresidueinchain.get_list():
...residue_id=residue.get_id()
...hetfield=residue_id[0]
...ifhetfield[0]=="H":
...print(residue_id)
...

PrintoutthecoordinatesofallCAatomsinastructurewithBfactorgreaterthan50

>>>formodelinstructure.get_list():
...forchaininmodel.get_list():
...forresidueinchain.get_list():
...ifresidue.has_id("CA"):
...ca=residue["CA"]
...ifca.get_bfactor()>50.0:
...print(ca.get_coord())
...

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 94/184
5/13/2017 BiopythonTutorialandCookbook
Printoutalltheresiduesthatcontaindisorderedatoms

>>>formodelinstructure.get_list():
...forchaininmodel.get_list():
...forresidueinchain.get_list():
...ifresidue.is_disordered():
...resseq=residue.get_id()[1]
...resname=residue.get_resname()
...model_id=model.get_id()
...chain_id=chain.get_id()
...print(model_id,chain_id,resname,resseq)
...

Loopoveralldisorderedatoms,andselectallatomswithaltlocA(ifpresent)

ThiswillmakesurethattheSMCRAdatastructurewillbehaveasifonlytheatomswithaltlocAarepresent.

>>>formodelinstructure.get_list():
...forchaininmodel.get_list():
...forresidueinchain.get_list():
...ifresidue.is_disordered():
...foratominresidue.get_list():
...ifatom.is_disordered():
...ifatom.disordered_has_id("A"):
...atom.disordered_select("A")
...

ExtractingpolypeptidesfromaStructureobject

Toextractpolypeptidesfromastructure,constructalistofPolypeptideobjectsfromaStructureobjectusingPolypeptideBuilderasfollows:

>>>model_nr=1
>>>polypeptide_list=build_peptides(structure,model_nr)
>>>forpolypeptideinpolypeptide_list:
...print(polypeptide)
...

APolypeptideobjectissimplyaUserListofResidueobjects,andisalwayscreatedfromasingleModel(inthiscasemodel1).YoucanusetheresultingPolypeptide
objecttogetthesequenceasaSeqobjectortogetalistofCatomsaswell.PolypeptidescanbebuiltusingaCNoraCCdistancecriterion.

Example:

#UsingCN
>>>ppb=PPBuilder()
>>>forppinppb.build_peptides(structure):
...print(pp.get_sequence())
...
#UsingCACA
>>>ppb=CaPPBuilder()
>>>forppinppb.build_peptides(structure):
...print(pp.get_sequence())
...

Notethatintheabovecaseonlymodel0ofthestructureisconsideredbyPolypeptideBuilder.However,itispossibletousePolypeptideBuildertobuildPolypeptide
objectsfromModelandChainobjectsaswell.

Obtainingthesequenceofastructure

Thefirstthingtodoistoextractallpolypeptidesfromthestructure(asabove).ThesequenceofeachpolypeptidecantheneasilybeobtainedfromthePolypeptide
objects.ThesequenceisrepresentedasaBiopythonSeqobject,anditsalphabetisdefinedbyaProteinAlphabetobject.

Example:

>>>seq=polypeptide.get_sequence()
>>>print(seq)
Seq('SNVVE...',<classBio.Alphabet.ProteinAlphabet>)

11.6Analyzingstructures
11.6.1Measuringdistances
Theminusoperatorforatomshasbeenoverloadedtoreturnthedistancebetweentwoatoms.

#Getsomeatoms
>>>ca1=residue1['CA']
>>>ca2=residue2['CA']
#Simplysubtracttheatomstogettheirdistance
>>>distance=ca1ca2

11.6.2Measuringangles

Usethevectorrepresentationoftheatomiccoordinates,andthecalc_anglefunctionfromtheVectormodule:

>>>vector1=atom1.get_vector()
>>>vector2=atom2.get_vector()
>>>vector3=atom3.get_vector()
>>>angle=calc_angle(vector1,vector2,vector3)

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 95/184
5/13/2017 BiopythonTutorialandCookbook
11.6.3Measuringtorsionangles

Usethevectorrepresentationoftheatomiccoordinates,andthecalc_dihedralfunctionfromtheVectormodule:

>>>vector1=atom1.get_vector()
>>>vector2=atom2.get_vector()
>>>vector3=atom3.get_vector()
>>>vector4=atom4.get_vector()
>>>angle=calc_dihedral(vector1,vector2,vector3,vector4)

11.6.4Determiningatomatomcontacts

UseNeighborSearchtoperformneighborlookup.TheneighborlookupisdoneusingaKDtreemodulewritteninC(seeBio.KDTree),makingitveryfast.Italsoincludes
afastmethodtofindallpointpairswithinacertaindistanceofeachother.

11.6.5Superimposingtwostructures

UseaSuperimposerobjecttosuperimposetwocoordinatesets.Thisobjectcalculatestherotationandtranslationmatrixthatrotatestwolistsofatomsontopofeachother
insuchawaythattheirRMSDisminimized.Ofcourse,thetwolistsneedtocontainthesamenumberofatoms.TheSuperimposerobjectcanalsoapplythe
rotation/translationtoalistofatoms.TherotationandtranslationarestoredasatupleintherotranattributeoftheSuperimposerobject(notethattherotationisright
multiplying!).TheRMSDisstoredinthermsdattribute.

ThealgorithmusedbySuperimposercomesfrom[17,Golub&VanLoan]andmakesuseofsingularvaluedecomposition(thisisimplementedinthegeneral
Bio.SVDSuperimposermodule).

Example:

>>>sup=Superimposer()
#Specifytheatomlists
#'fixed'and'moving'arelistsofAtomobjects
#Themovingatomswillbeputonthefixedatoms
>>>sup.set_atoms(fixed,moving)
#Printrotation/translation/rmsd
>>>print(sup.rotran)
>>>print(sup.rms)
#Applyrotation/translationtothemovingatoms
>>>sup.apply(moving)

Tosuperimposetwostructuresbasedontheiractivesites,usetheactivesiteatomstocalculatetherotation/translationmatrices(asabove),andapplythesetothewhole
molecule.

11.6.6Mappingtheresiduesoftworelatedstructuresontoeachother

First,createanalignmentfileinFASTAformat,thenusetheStructureAlignmentclass.Thisclasscanalsobeusedforalignmentswithmorethantwostructures.

11.6.7CalculatingtheHalfSphereExposure
HalfSphereExposure(HSE)isanew,2Dmeasureofsolventexposure[20].Basically,itcountsthenumberofCatomsaroundaresidueinthedirectionofitsside
chain,andintheoppositedirection(withinaradiusof13).Despiteitssimplicity,itoutperformsmanyothermeasuresofsolventexposure.

HSEcomesintwoflavors:HSEandHSE.TheformeronlyusestheCatompositions,whilethelatterusestheCandCatompositions.TheHSEmeasureis
calculatedbytheHSExposureclass,whichcanalsocalculatethecontactnumber.ThelatterclasshasmethodswhichreturndictionariesthatmapaResidueobjecttoits
correspondingHSE,HSEandcontactnumbervalues.

Example:

>>>model=structure[0]
>>>hse=HSExposure()
#CalculateHSEalpha
>>>exp_ca=hse.calc_hs_exposure(model,option='CA3')
#CalculateHSEbeta
>>>exp_cb=hse.calc_hs_exposure(model,option='CB')
#Calculateclassicalcoordinationnumber
>>>exp_fs=hse.calc_fs_exposure(model)
#PrintHSEalphaforaresidue
>>>print(exp_ca[some_residue])

11.6.8Determiningthesecondarystructure

Forthisfunctionality,youneedtoinstallDSSP(andobtainalicenseforitfreeforacademicuse,seehttp://www.cmbi.kun.nl/gv/dssp/).ThenusetheDSSPclass,
whichmapsResidueobjectstotheirsecondarystructure(andaccessiblesurfacearea).TheDSSPcodesarelistedinTable11.1.NotethatDSSP(theprogram,andthusby
consequencetheclass)cannothandlemultiplemodels!

Code Secondarystructure
H helix
B Isolatedbridgeresidue
E Strand
G 310helix
I helix
T Turn
S Bend
Other

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 96/184
5/13/2017 BiopythonTutorialandCookbook

Table11.1:DSSPcodesinBio.PDB.

TheDSSPclasscanalsobeusedtocalculatetheaccessiblesurfaceareaofaresidue.Butseealsosection11.6.9.

11.6.9Calculatingtheresiduedepth
Residuedepthistheaveragedistanceofaresiduesatomsfromthesolventaccessiblesurface.Itsafairlynewandverypowerfulparameterizationofsolvent
accessibility.Forthisfunctionality,youneedtoinstallMichelSannersMSMSprogram(http://www.scripps.edu/pub/olsonweb/people/sanner/html/msms_home.html).
ThenusetheResidueDepthclass.ThisclassbehavesasadictionarywhichmapsResidueobjectstocorresponding(residuedepth,Cdepth)tuples.TheCdepthisthe
distanceofaresiduesCatomtothesolventaccessiblesurface.

Example:

>>>model=structure[0]
>>>rd=ResidueDepth(model,pdb_file)
>>>residue_depth,ca_depth=rd[some_residue]

Youcanalsogetaccesstothemolecularsurfaceitself(viatheget_surfacefunction),intheformofaNumericPythonarraywiththesurfacepoints.

11.7CommonproblemsinPDBfiles
ItiswellknownthatmanyPDBfilescontainsemanticerrors(notthestructuresthemselves,buttheirrepresentationinPDBfiles).Bio.PDBtriestohandlethisintwo
ways.ThePDBParserobjectcanbehaveintwoways:arestrictivewayandapermissiveway,whichisthedefault.

Example:

#Permissiveparser
>>>parser=PDBParser(PERMISSIVE=1)
>>>parser=PDBParser()#Thesame(default)
#Strictparser
>>>strict_parser=PDBParser(PERMISSIVE=0)

Inthepermissivestate(DEFAULT),PDBfilesthatobviouslycontainerrorsarecorrected(i.e.someresiduesoratomsareleftout).Theseerrorsinclude:

Multipleresidueswiththesameidentifier
Multipleatomswiththesameidentifier(takingintoaccountthealtlocidentifier)

TheseerrorsindicaterealproblemsinthePDBfile(fordetailssee[18,HamelryckandManderick,2003]).Intherestrictivestate,PDBfileswitherrorscausean
exceptiontooccur.ThisisusefultofinderrorsinPDBfiles.

Someerrorshoweverareautomaticallycorrected.Normallyeachdisorderedatomshouldhaveanonblankaltlocidentifier.However,therearemanystructuresthatdo
notfollowthisconvention,andhaveablankandanonblankidentifierfortwodisorderedpositionsofthesameatom.Thisisautomaticallyinterpretedintherightway.

SometimesastructurecontainsalistofresiduesbelongingtochainA,followedbyresiduesbelongingtochainB,andagainfollowedbyresiduesbelongingtochainA,
i.e.thechainsarebroken.Thisisalsocorrectlyinterpreted.

11.7.1Examples

ThePDBParser/Structureclasswastestedonabout800structures(eachbelongingtoauniqueSCOPsuperfamily).Thistakesabout20minutes,oronaverage1.5
secondsperstructure.Parsingthestructureofthelargeribosomalsubunit(1FKK),whichcontainsabout64000atoms,takes10secondsona1000MHzPC.

Threeexceptionsweregeneratedincaseswhereanunambiguousdatastructurecouldnotbebuilt.Inallthreecases,thelikelycauseisanerrorinthePDBfilethatshould
becorrected.Generatinganexceptioninthesecasesismuchbetterthanrunningthechanceofincorrectlydescribingthestructureinadatastructure.

11.7.1.1Duplicateresidues

Onestructurecontainstwoaminoacidresiduesinonechainwiththesamesequenceidentifier(resseq3)andicode.Uponinspectionitwasfoundthatthischaincontains
theresiduesThrA3,,GlyA202,LeuA3,GluA204.Clearly,LeuA3shouldbeLeuA203.Acoupleofsimilarsituationsexistforstructure1FFK(whiche.g.contains
GlyB64,MetB65,GluB65,ThrB67,i.e.residueGluB65shouldbeGluB66).

11.7.1.2Duplicateatoms

Structure1EJGcontainsaSer/PropointmutationinchainAatposition22.Inturn,Ser22containssomedisorderedatoms.Asexpected,allatomsbelongingtoSer22
haveanonblankaltlocspecifier(BorC).AllatomsofPro22havealtlocA,excepttheNatomwhichhasablankaltloc.Thisgeneratesanexception,becauseallatoms
belongingtotworesiduesatapointmutationshouldhavenonblankaltloc.ItturnsoutthatthisatomisprobablysharedbySerandPro22,asSer22missestheNatom.
Again,thispointstoaprobleminthefile:theNatomshouldbepresentinboththeSerandtheProresidue,inbothcasesassociatedwithasuitablealtlocidentifier.

11.7.2Automaticcorrection
Someerrorsarequitecommonandcanbeeasilycorrectedwithoutmuchriskofmakingawronginterpretation.Thesecasesarelistedbelow.

11.7.2.1Ablankaltlocforadisorderedatom

Normallyeachdisorderedatomshouldhaveanonblankaltlocidentifier.However,therearemanystructuresthatdonotfollowthisconvention,andhaveablankanda
nonblankidentifierfortwodisorderedpositionsofthesameatom.Thisisautomaticallyinterpretedintherightway.

11.7.2.2Brokenchains

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 97/184
5/13/2017 BiopythonTutorialandCookbook
SometimesastructurecontainsalistofresiduesbelongingtochainA,followedbyresiduesbelongingtochainB,andagainfollowedbyresiduesbelongingtochainA,
i.e.thechainsarebroken.Thisiscorrectlyinterpreted.

11.7.3Fatalerrors
SometimesaPDBfilecannotbeunambiguouslyinterpreted.Ratherthanguessingandriskingamistake,anexceptionisgenerated,andtheuserisexpectedtocorrectthe
PDBfile.Thesecasesarelistedbelow.

11.7.3.1Duplicateresidues

Allresiduesinachainshouldhaveauniqueid.Thisidisgeneratedbasedon:

Thesequenceidentifier(resseq).
Theinsertioncode(icode).
Thehetfieldstring(WforwatersandH_followedbytheresiduenameforotherheteroresidues)
Theresiduenamesoftheresiduesinthecaseofpointmutations(tostoretheResidueobjectsinaDisorderedResidueobject).

Ifthisdoesnotleadtoauniqueidsomethingisquitelikelywrong,andanexceptionisgenerated.

11.7.3.2Duplicateatoms

Allatomsinaresidueshouldhaveauniqueid.Thisidisgeneratedbasedon:

Theatomname(withoutspaces,orwithspacesifaproblemarises).
Thealtlocspecifier.

Ifthisdoesnotleadtoauniqueidsomethingisquitelikelywrong,andanexceptionisgenerated.

11.8AccessingtheProteinDataBank
11.8.1DownloadingstructuresfromtheProteinDataBank

StructurescanbedownloadedfromthePDB(ProteinDataBank)byusingtheretrieve_pdb_filemethodonaPDBListobject.TheargumentforthismethodisthePDB
identifierofthestructure.

>>>pdbl=PDBList()
>>>pdbl.retrieve_pdb_file('1FAT')

ThePDBListclasscanalsobeusedasacommandlinetool:

pythonPDBList.py1fat

Thedownloadedfilewillbecalledpdb1fat.entandstoredinthecurrentworkingdirectory.Notethattheretrieve_pdb_filemethodalsohasanoptionalargumentpdir
thatspecifiesaspecificdirectoryinwhichtostorethedownloadedPDBfiles.

Theretrieve_pdb_filemethodalsohassomeoptionstospecifythecompressionformatusedforthedownload,andtheprogramusedforlocaldecompression(default.Z
formatandgunzip).Inaddition,thePDBftpsitecanbespecifieduponcreationofthePDBListobject.Bydefault,theserveroftheWorldwideProteinDataBank
(ftp://ftp.wwpdb.org/pub/pdb/data/structures/divided/pdb/)isused.SeetheAPIdocumentationformoredetails.ThanksagaintoKristianRotherfordonatingthis
module.

11.8.2DownloadingtheentirePDB

ThefollowingcommandswillstoreallPDBfilesinthe/data/pdbdirectory:

pythonPDBList.pyall/data/pdb

pythonPDBList.pyall/data/pdbd

TheAPImethodforthisiscalleddownload_entire_pdb.Addingthedoptionwillstoreallfilesinthesamedirectory.Otherwise,theyaresortedintoPDBstyle
subdirectoriesaccordingtotheirPDBIDs.Dependingonthetraffic,acompletedownloadwilltake24days.

11.8.3KeepingalocalcopyofthePDBuptodate

ThiscanalsobedoneusingthePDBListobject.OnesimplycreatesaPDBListobject(specifyingthedirectorywherethelocalcopyofthePDBispresent)andcallsthe
update_pdbmethod:

>>>pl=PDBList(pdb='/data/pdb')
>>>pl.update_pdb()

Onecanofcoursemakeaweeklycronjoboutofthistokeepthelocalcopyautomaticallyuptodate.ThePDBftpsitecanalsobespecified(seeAPIdocumentation).

PDBListhassomeadditionalmethodsthatcanbeofuse.Theget_all_obsoletemethodcanbeusedtogetalistofallobsoletePDBentries.Thechanged_this_week
methodcanbeusedtoobtaintheentriesthatwereadded,modifiedorobsoletedduringthecurrentweek.FormoreinfoonthepossibilitiesofPDBList,seetheAPI
documentation.

11.9Generalquestions
11.9.1HowwelltestedisBio.PDB?

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 98/184
5/13/2017 BiopythonTutorialandCookbook
Prettywell,actually.Bio.PDBhasbeenextensivelytestedonnearly5500structuresfromthePDBallstructuresseemedtobeparsedcorrectly.Moredetailscanbe
foundintheBio.PDBBioinformaticsarticle.Bio.PDBhasbeenused/isbeingusedinmanyresearchprojectsasareliabletool.Infact,ImusingBio.PDBalmostdaily
forresearchpurposesandcontinueworkingonimprovingitandaddingnewfeatures.

11.9.2Howfastisit?
ThePDBParserperformancewastestedonabout800structures(eachbelongingtoauniqueSCOPsuperfamily).Thistakesabout20minutes,oronaverage1.5seconds
perstructure.Parsingthestructureofthelargeribosomalsubunit(1FKK),whichcontainsabout64000atoms,takes10secondsona1000MHzPC.Inshort:itsmore
thanfastenoughformanyapplications.

11.9.3Istheresupportformoleculargraphics?

Notdirectly,mostlysincetherearequiteafewPythonbased/Pythonawaresolutionsalready,thatcanpotentiallybeusedwithBio.PDB.MychoiceisPymol,BTW(Ive
usedthissuccessfullywithBio.PDB,andtherewillprobablybespecificPyMolmodulesinBio.PDBsoon/someday).Pythonbased/awaremoleculargraphicssolutions
include:

PyMol:http://pymol.sourceforge.net/
Chimera:http://www.cgl.ucsf.edu/chimera/
PMV:http://www.scripps.edu/~sanner/python/
Coot:http://www.ysbl.york.ac.uk/~emsley/coot/
CCP4mg:http://www.ysbl.york.ac.uk/~lizp/molgraphics.html
mmLib:http://pymmlib.sourceforge.net/
VMD:http://www.ks.uiuc.edu/Research/vmd/
MMTK:http://starship.python.net/crew/hinsen/MMTK/

11.9.4WhosusingBio.PDB?
Bio.PDBwasusedintheconstructionofDISEMBL,awebserverthatpredictsdisorderedregionsinproteins(http://dis.embl.de/),andCOLUMBA,awebsitethat
providesannotatedproteinstructures(http://www.columbadb.de/).Bio.PDBhasalsobeenusedtoperformalargescalesearchforactivesitessimilaritiesbetween
proteinstructuresinthePDB[19,Hamelryck,2003],andtodevelopanewalgorithmthatidentifieslinearsecondarystructureelements[26,Majumdaretal.,2005].

Judgingfromrequestsforfeaturesandinformation,Bio.PDBisalsousedbyseveralLPCs(LargePharmaceuticalCompanies:).

Chapter12Bio.PopGen:Populationgenetics
Bio.PopGenisaBiopythonmodulesupportingpopulationgenetics,availableinBiopython1.44onwards.

Themediumtermobjectiveforthemoduleistosupportwidelyuseddataformats,applicationsanddatabases.Thismoduleiscurrentlyunderintensedevelopmentand
supportfornewfeaturesshouldappearataratherfastpace.UnfortunatelythismightalsoentailsomeinstabilityontheAPI,especiallyifyouareusingadevelopment
version.APIsthataremadeavailableonourofficialpublicreleasesshouldbemuchmorestable.

12.1GenePop
GenePop(http://genepop.curtin.edu.au/)isapopularpopulationgeneticssoftwarepackagesupportingHardyWeinbergtests,linkagedesiquilibrium,population
diferentiation,basicstatistics,Fstandmigrationestimates,amongothers.GenePopdoesnotsupplysequencebasedstatisticsasitdoesnthandlesequencedata.The
GenePopfileformatissupportedbyawiderangeofotherpopulationgeneticsoftwareapplications,thusmakingitarelevantformatinthepopulationgeneticsfield.

Bio.PopGenprovidesaparserandgeneratorofGenePopfileformat.Utilitiestomanipulatethecontentofarecordarealsoprovided.Hereisanexampleonhowtoreada
GenePopfile(youcanfindexampleGenePopdatafilesintheTest/PopGendirectoryofBiopython):

fromBio.PopGenimportGenePop

withopen("example.gen")ashandle:
rec=GenePop.read(handle)

Thiswillreadafilecalledexample.genandparseit.Ifyoudoprintrec,therecordwillbeoutputagain,inGenePopformat.

Themostimportantinformationinrecwillbethelocinamesandpopulationinformation(butthereismoreusehelp(GenePop.Record)tochecktheAPI
documentation).Locinamescanbefoundonrec.loci_list.Populationinformationcanbefoundonrec.populations.Populationsisalistwithoneelementperpopulation.
Eachelementisitselfalistofindividuals,eachindividualisapaircomposedbyindividualnameandalistofalleles(2permarker),hereisanexamplefor
rec.populations:

[
[
('Ind1',[(1,2),(3,3),(200,201)],
('Ind2',[(2,None),(3,3),(None,None)],
],
[
('Other1',[(1,1),(4,3),(200,200)],
]
]

Sowehavetwopopulations,thefirstwithtwoindividuals,thesecondwithonlyone.ThefirstindividualofthefirstpopulationiscalledInd1,allelicinformationforeach
ofthe3locifollows.Pleasenotethatforanylocus,informationmightbemissing(seeasanexample,Ind2above).

AfewutilityfunctionstomanipulateGenePoprecordsaremadeavailable,hereisanexample:

fromBio.PopGenimportGenePop

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 99/184
5/13/2017 BiopythonTutorialandCookbook
#Imaginethatyouhaveloadedrec,asperthecodesnippetabove...

rec.remove_population(pos)
#Removesapopulationfromarecord,posisthepopulationpositionin
#rec.populations,rememberthatitstartsonposition0.
#recisaltered.

rec.remove_locus_by_position(pos)
#Removesalocusbyitsposition,posisthelocuspositionin
#rec.loci_list,rememberthatitstartsonposition0.
#recisaltered.

rec.remove_locus_by_name(name)
#Removesalocusbyitsname,nameisthelocusnameasin
#rec.loci_list.Ifthenamedoesn'texistthefunctionfails
#silently.
#recisaltered.

rec_loci=rec.split_in_loci()
#Splitsarecordinloci,thatis,foreachloci,itcreatesanew
#record,withasinglelociandallpopulations.
#Theresultisreturnedinadictionary,beingeachkeythelocusname.
#ThevalueistheGenePoprecord.
#recisnotaltered.

rec_pops=rec.split_in_pops(pop_names)
#Splitsarecordinpopulations,thatis,foreachpopulation,itcreates
#anewrecord,withasinglepopulationandallloci.
#Theresultisreturnedinadictionary,beingeachkey
#thepopulationname.AspopulationnamesarenotavailableinGenePop,
#theyarepassedinarray(pop_names).
#ThevalueofeachdictionaryentryistheGenePoprecord.
#recisnotaltered.

GenePopdoesnotsupportpopulationnames,alimitationwhichcanbecumbersomeattimes.Functionalitytoenablepopulationnamesiscurrentlybeingplannedfor
Biopython.Theseextensionswontbreakcompatibilityinanywaywiththestandardformat.Inthemediumterm,wewouldalsoliketosupporttheGenePopwebservice.

Chapter13PhylogeneticswithBio.Phylo
TheBio.PhylomodulewasintroducedinBiopython1.54.FollowingtheleadofSeqIOandAlignIO,itaimstoprovideacommonwaytoworkwithphylogenetictrees
independentlyofthesourcedataformat,aswellasaconsistentAPIforI/Ooperations.

Bio.Phyloisdescribedinanopenaccessjournalarticle[9,Talevichetal.,2012],whichyoumightalsofindhelpful.

13.1Demo:WhatsinaTree?
Togetacquaintedwiththemodule,letsstartwithatreethatwevealreadyconstructed,andinspectitafewdifferentways.Thenwellcolorizethebranches,tousea
specialphyloXMLfeature,andfinallysaveit.

CreateasimpleNewickfilenamedsimple.dndusingyourfavoritetexteditor,orusesimple.dndprovidedwiththeBiopythonsourcecode:

(((A,B),(C,D)),(E,F,G));

Thistreehasnobranchlengths,onlyatopologyandlabelledterminals.(Ifyouhavearealtreefileavailable,youcanfollowthisdemousingthatinstead.)

LaunchthePythoninterpreterofyourchoice:
%ipythonpylab

Forinteractivework,launchingtheIPythoninterpreterwiththepylabflagenablesmatplotlibintegration,sographicswillpopupautomatically.Wellusethatduring
thisdemo.

Now,withinPython,readthetreefile,givingthefilenameandthenameoftheformat.

>>>fromBioimportPhylo
>>>tree=Phylo.read("simple.dnd","newick")

Printingthetreeobjectasastringgivesusalookattheentireobjecthierarchy.

>>>print(tree)
Tree(rooted=False,weight=1.0)
Clade()
Clade()
Clade()
Clade(name='A')
Clade(name='B')
Clade()
Clade(name='C')
Clade(name='D')
Clade()
Clade(name='E')
Clade(name='F')
Clade(name='G')

TheTreeobjectcontainsglobalinformationaboutthetree,suchaswhetheritsrootedorunrooted.Ithasonerootclade,andunderthat,itsnestedlistsofcladesallthe
waydowntothetips.

Thefunctiondraw_asciicreatesasimpleASCIIart(plaintext)dendrogram.Thisisaconvenientvisualizationforinteractiveexploration,incasebettergraphicaltools
arentavailable.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 100/184
5/13/2017 BiopythonTutorialandCookbook
>>>fromBioimportPhylo
>>>tree=Phylo.read("simple.dnd","newick")
>>>Phylo.draw_ascii(tree)
________________________A
________________________|
||________________________B
________________________|
||________________________C
||________________________|
_||________________________D
|
|________________________E
||
|________________________|________________________F
|
|________________________G
<BLANKLINE>

Ifyouhavematplotliborpylabinstalled,youcancreateagraphicusingthedrawfunction(seeFig.13.1):

>>>tree.rooted=True
>>>Phylo.draw(tree)

13.1.1Coloringbrancheswithinatree
Thefunctionsdrawanddraw_graphvizsupportthedisplayofdifferentcolorsandbranchwidthsinatree.AsofBiopython1.59,thecolorandwidthattributesare
availableonthebasicCladeobjectandtheresnothingextrarequiredtousethem.Bothattributesrefertothebranchleadingthegivenclade,andapplyrecursively,soall
descendentbrancheswillalsoinherittheassignedwidthandcolorvaluesduringdisplay.

InearlierversionsofBiopython,thesewerespecialfeaturesofPhyloXMLtrees,andusingtheattributesrequiredfirstconvertingthetreetoasubclassofthebasictree
objectcalledPhylogeny,fromtheBio.Phylo.PhyloXMLmodule.

InBiopython1.55andlater,thisisaconvenienttreemethod:

>>>tree=tree.as_phyloxml()

InBiopython1.54,youcanaccomplishthesamethingwithoneextraimport:

>>>fromBio.Phylo.PhyloXMLimportPhylogeny
>>>tree=Phylogeny.from_tree(tree)

NotethatthefileformatsNewickandNexusdontsupportbranchcolorsorwidths,soifyouusetheseattributesinBio.Phylo,youwillonlybeabletosavethevaluesin
PhyloXMLformat.(YoucanstillsaveatreeasNewickorNexus,butthecolorandwidthvalueswillbeskippedintheoutputfile.)

Nowwecanbeginassigningcolors.First,wellcolortherootcladegray.Wecandothatbyassigningthe24bitcolorvalueasanRGBtriple,anHTMLstylehexstring,
orthenameofoneofthepredefinedcolors.

>>>tree.root.color=(128,128,128)

Or:

>>>tree.root.color="#808080"

Or:

>>>tree.root.color="gray"

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 101/184
5/13/2017 BiopythonTutorialandCookbook
Colorsforacladearetreatedascascadingdownthroughtheentireclade,sowhenwecolorizetheroothere,itturnsthewholetreegray.Wecanoverridethatby
assigningadifferentcolorlowerdownonthetree.

Letstargetthemostrecentcommonancestor(MRCA)ofthenodesnamedEandF.Thecommon_ancestormethodreturnsareferencetothatcladeintheoriginaltree,
sowhenwecolorthatcladesalmon,thecolorwillshowupintheoriginaltree.

>>>mrca=tree.common_ancestor({"name":"E"},{"name":"F"})
>>>mrca.color="salmon"

Ifwehappenedtoknowexactlywhereacertaincladeisinthetree,intermsofnestedlistentries,wecanjumpdirectlytothatpositioninthetreebyindexingit.Here,the
index[0,1]referstothesecondchildofthefirstchildoftheroot.

>>>tree.clade[0,1].color="blue"

Finally,showourwork(seeFig.13.1.1):

>>>Phylo.draw(tree)

Notethatacladescolorincludesthebranchleadingtothatclade,aswellasitsdescendents.ThecommonancestorofEandFturnsouttobejustundertheroot,andwith
thiscoloringwecanseeexactlywheretherootofthetreeis.

My,weveaccomplishedalot!Letstakeabreakhereandsaveourwork.Callthewritefunctionwithafilenameorhandlehereweusestandardoutput,toseewhat
wouldbewrittenandtheformatphyloxml.PhyloXMLsavesthecolorsweassigned,soyoucanopenthisphyloXMLfileinanothertreeviewerlikeArchaeopteryx,
andthecolorswillshowupthere,too.
>>>importsys
>>>Phylo.write(tree,sys.stdout,"phyloxml")

<phy:phyloxmlxmlns:phy="http://www.phyloxml.org">
<phy:phylogenyrooted="true">
<phy:clade>
<phy:branch_length>1.0</phy:branch_length>
<phy:color>
<phy:red>128</phy:red>
<phy:green>128</phy:green>
<phy:blue>128</phy:blue>
</phy:color>
<phy:clade>
<phy:branch_length>1.0</phy:branch_length>
<phy:clade>
<phy:branch_length>1.0</phy:branch_length>
<phy:clade>
<phy:name>A</phy:name>
...

TherestofthischaptercoversthecorefunctionalityofBio.Phyloingreaterdetail.FormoreexamplesofusingBio.Phylo,seethecookbookpageonBiopython.org:

http://biopython.org/wiki/Phylo_cookbook

13.2I/Ofunctions
LikeSeqIOandAlignIO,Phylohandlesfileinputandoutputthroughfourfunctions:parse,read,writeandconvert,allofwhichsupportthetreefileformatsNewick,
NEXUS,phyloXMLandNeXML,aswellastheComparativeDataAnalysisOntology(CDAO).

Thereadfunctionparsesasingletreeinthegivenfileandreturnsit.Carefulitwillraiseanerrorifthefilecontainsmorethanonetree,ornotrees.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 102/184
5/13/2017 BiopythonTutorialandCookbook
>>>fromBioimportPhylo
>>>tree=Phylo.read("Tests/Nexus/int_node_labels.nwk","newick")
>>>print(tree)

(ExamplefilesareavailableintheTests/Nexus/andTests/PhyloXML/directoriesoftheBiopythondistribution.)

Tohandlemultiple(oranunknownnumberof)trees,usetheparsefunctioniteratesthrougheachofthetreesinthegivenfile:

>>>trees=Phylo.parse("../../Tests/PhyloXML/phyloxml_examples.xml","phyloxml")
>>>fortreeintrees:
...print(tree)

Writeatreeoriterableoftreesbacktofilewiththewritefunction:

>>>trees=list(Phylo.parse("../../Tests/PhyloXML/phyloxml_examples.xml","phyloxml"))
>>>tree1=trees[0]
>>>others=trees[1:]
>>>Phylo.write(tree1,"tree1.nwk","newick")
1
>>>Phylo.write(others,"other_trees.nwk","newick")
12

Convertfilesbetweenanyofthesupportedformatswiththeconvertfunction:

>>>Phylo.convert("tree1.nwk","newick","tree1.xml","nexml")
1
>>>Phylo.convert("other_trees.xml","phyloxml","other_trees.nex","nexus")
12

Tousestringsasinputoroutputinsteadofactualfiles,useStringIOasyouwouldwithSeqIOandAlignIO:

>>>fromBioimportPhylo
>>>fromStringIOimportStringIO
>>>handle=StringIO("(((A,B),(C,D)),(E,F,G));")
>>>tree=Phylo.read(handle,"newick")

13.3Viewandexporttrees
ThesimplestwaytogetanoverviewofaTreeobjectistoprintit:

>>>fromBioimportPhylo
>>>tree=Phylo.read("PhyloXML/example.xml","phyloxml")
>>>print(tree)
Phylogeny(description='phyloXMLallowstouseeithera"branch_length"attribute...',name='examplefromProf.JoeFelsenstein'sbook"InferringPhyl...',rooted=
Clade()
Clade(branch_length=0.06)
Clade(branch_length=0.102,name='A')
Clade(branch_length=0.23,name='B')
Clade(branch_length=0.4,name='C')

ThisisessentiallyanoutlineoftheobjecthierarchyBiopythonusestorepresentatree.Butmorelikely,youdwanttoseeadrawingofthetree.Therearethreefunctions
todothis.

Aswesawinthedemo,draw_asciiprintsanasciiartdrawingofthetree(arootedphylogram)tostandardoutput,oranopenfilehandleifgiven.Notalloftheavailable
informationaboutthetreeisshown,butitprovidesawaytoquicklyviewthetreewithoutrelyingonanyexternaldependencies.
>>>tree=Phylo.read("example.xml","phyloxml")
>>>Phylo.draw_ascii(tree)
__________________A
__________|
_||___________________________________________B
|
|___________________________________________________________________________C

Thedrawfunctiondrawsamoreattractiveimageusingthematplotliblibrary.SeetheAPIdocumentationfordetailsontheargumentsitacceptstocustomizetheoutput.

>>>tree=Phylo.read("example.xml","phyloxml")
>>>Phylo.draw(tree,branch_labels=lambdac:c.branch_length)

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 103/184
5/13/2017 BiopythonTutorialandCookbook

draw_graphvizdrawsanunrootedcladogram,butrequiresthatyouhaveGraphviz,PyDotorPyGraphviz,NetworkX,andmatplotlib(orpylab)installed.Usingthesame
exampleasabove,andthedotprogramincludedwithGraphviz,letsdrawarootedtree(seeFig.13.3):

>>>tree=Phylo.read("example.xml","phyloxml")
>>>Phylo.draw_graphviz(tree,prog='dot')
>>>importpylab
>>>pylab.show()#Displaysthetreeinaninteractiveviewer
>>>pylab.savefig('phylodot.png')#CreatesaPNGfileofthesamegraphic

(Tip:IfyouexecuteIPythonwiththepylaboption,callingdraw_graphvizcausesthematplotlibviewertolaunchautomaticallywithoutmanuallycallingshow().)

ThisexportsthetreeobjecttoaNetworkXgraph,usesGraphviztolayoutthenodes,anddisplaysitusingmatplotlib.Thereareanumberofkeywordargumentsthatcan
modifytheresultingdiagram,includingmostofthoseacceptedbytheNetworkXfunctionsnetworkx.drawandnetworkx.draw_graphviz.

Thedisplayisalsoaffectedbytherootedattributeofthegiventreeobject.Rootedtreesareshownwithaheadoneachbranchindicatingdirection(seeFig.13.3):

>>>tree=Phylo.read("simple.dnd","newick")
>>>tree.rooted=True
>>>Phylo.draw_graphviz(tree)

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 104/184
5/13/2017 BiopythonTutorialandCookbook

TheprogargumentspecifiestheGraphvizengineusedforlayout.Thedefault,twopi,behaveswellforanysizetree,reliablyavoidingcrossedbranches.Theneato
programmaydrawmoreattractivemoderatelysizedtrees,butsometimeswillcrossbranches(seeFig.13.3).Thedotprogrammaybeusefulwithsmalltrees,buttends
todosurprisingthingswiththelayoutoflargertrees.

>>>Phylo.draw_graphviz(tree,prog="neato")

Thisviewingmodeisparticularlyhandyforexploringlargertrees,becausethematplotlibviewercanzoominonaselectedregion,thinningoutaclutteredgraphic.

>>>tree=Phylo.read("apaf.xml","phyloxml")
>>>Phylo.draw_graphviz(tree,prog="neato",node_size=0)

Notethatbranchlengthsarenotdisplayedaccurately,becauseGraphvizignoresthemwhencreatingthenodelayouts.Thebranchlengthsareretainedwhenexportinga
treeasaNetworkXgraphobject(to_networkx),however.

SeethePhylopageontheBiopythonwiki(http://biopython.org/wiki/Phylo)fordescriptionsandexamplesofthemoreadvancedfunctionalityindraw_ascii,
draw_graphvizandto_networkx.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 105/184
5/13/2017 BiopythonTutorialandCookbook

13.4UsingTreeandCladeobjects
TheTreeobjectsproducedbyparseandreadarecontainersforrecursivesubtrees,attachedtotheTreeobjectattherootattribute(whetherornotthephylogenictreeis
actuallyconsideredrooted).ATreehasgloballyappliedinformationforthephylogeny,suchasrootedness,andareferencetoasingleCladeaCladehasnodeandclade
specificinformation,suchasbranchlength,andalistofitsowndescendentCladeinstances,attachedatthecladesattribute.

Sothereisadistinctionbetweentreeandtree.root.Inpractice,though,yourarelyneedtoworryaboutit.Tosmoothoverthedifference,bothTreeandCladeinherit
fromTreeMixin,whichcontainstheimplementationsformethodsthatwouldbecommonlyusedtosearch,inspectormodifyatreeoranyofitsclades.Thismeansthat
almostallofthemethodssupportedbytreearealsoavailableontree.rootandanycladebelowit.(Cladealsohasarootproperty,whichreturnsthecladeobjectitself.)

13.4.1Searchandtraversalmethods
Forconvenience,weprovideacoupleofsimplifiedmethodsthatreturnallexternalorinternalnodesdirectlyasalist:

get_terminals
makesalistofallofthistreesterminal(leaf)nodes.
get_nonterminals
makesalistofallofthistreesnonterminal(internal)nodes.

Thesebothwrapamethodwithfullcontrolovertreetraversal,find_clades.Twomoretraversalmethods,find_elementsandfind_any,relyonthesamecore
functionalityandacceptthesamearguments,whichwellcallatargetspecificationforlackofabetterdescription.Thesespecifywhichobjectsinthetreewillbe
matchedandreturnedduringiteration.Thefirstargumentcanbeanyofthefollowingtypes:

ATreeElementinstance,whichtreeelementswillmatchbyidentitysosearchingwithaCladeinstanceasthetargetwillfindthatcladeinthetree
Astring,whichmatchestreeelementsstringrepresentationinparticular,acladesname(addedinBiopython1.56)
Aclassortype,whereeverytreeelementofthesametype(orsubtype)willbematched
Adictionarywherekeysaretreeelementattributesandvaluesarematchedtothecorrespondingattributeofeachtreeelement.Thisonegetsevenmoreelaborate:
Ifanintisgiven,itmatchesnumericallyequalattributes,e.g.1willmatch1or1.0
Ifabooleanisgiven(TrueorFalse),thecorrespondingattributevalueisevaluatedasabooleanandcheckedforthesame
NonematchesNone

Ifastringisgiven,thevalueistreatedasaregularexpression(whichmustmatchthewholestringinthecorrespondingelementattribute,notjustaprefix).A
givenstringwithoutspecialregexcharacterswillmatchstringattributesexactly,soifyoudontuseregexes,dontworryaboutit.Forexample,inatreewith
cladenamesFoo1,Foo2andFoo3,tree.find_clades({"name":"Foo1"})matchesFoo1,{"name":"Foo.*"}matchesallthreeclades,and{"name":"Foo"}
doesntmatchanything.

Sincefloatingpointarithmeticcanproducesomestrangebehavior,wedontsupportmatchingfloatsdirectly.Instead,usethebooleanTruetomatcheveryelement
withanonzerovalueinthespecifiedattribute,thenfilteronthatattributemanuallywithaninequality(orexactnumber,ifyoulikelivingdangerously).

Ifthedictionarycontainsmultipleentries,amatchingelementmustmatcheachofthegivenattributevaluesthinkand,notor.

Afunctiontakingasingleargument(itwillbeappliedtoeachelementinthetree),returningTrueorFalse.Forconvenience,LookupError,AttributeErrorand
ValueErroraresilenced,sothisprovidesanothersafewaytosearchforfloatingpointvaluesinthetree,orsomemorecomplexcharacteristic.

Afterthetarget,therearetwooptionalkeywordarguments:

terminal
Abooleanvaluetoselectfororagainstterminalclades(a.k.a.leafnodes):Truesearchesforonlyterminalclades,Falsefornonterminal(internal)clades,andthe
default,None,searchesbothterminalandnonterminalclades,aswellasanytreeelementslackingtheis_terminalmethod.
order
Treetraversalorder:"preorder"(default)isdepthfirstsearch,"postorder"isDFSwithchildnodesprecedingparents,and"level"isbreadthfirstsearch.

Finally,themethodsacceptarbitrarykeywordargumentswhicharetreatedthesamewayasadictionarytargetspecification:keysindicatethenameoftheelement
attributetosearchfor,andtheargumentvalue(string,integer,Noneorboolean)iscomparedtothevalueofeachattributefound.Ifnokeywordargumentsaregiven,then
anyTreeElementtypesarematched.Thecodeforthisisgenerallyshorterthanpassingadictionaryasthetargetspecification:tree.find_clades({"name":"Foo1"})can
beshortenedtotree.find_clades(name="Foo1").

(InBiopython1.56orlater,thiscanbeevenshorter:tree.find_clades("Foo1"))

Nowthatwevemasteredtargetspecifications,herearethemethodsusedtotraverseatree:

find_clades
Findeachcladecontainingamatchingelement.Thatis,findeachelementaswithfind_elements,butreturnthecorrespondingcladeobject.(Thisisusuallywhatyou
want.)

Theresultisaniterablethroughallmatchingobjects,searchingdepthfirstbydefault.ThisisnotnecessarilythesameorderastheelementsappearintheNewick,
NexusorXMLsourcefile!
find_elements
Findalltreeelementsmatchingthegivenattributes,andreturnthematchingelementsthemselves.SimpleNewicktreesdonthavecomplexsubelements,sothis
behavesthesameasfind_cladesonthem.PhyloXMLtreesoftendohavecomplexobjectsattachedtoclades,sothismethodisusefulforextractingthose.
find_any
Returnthefirstelementfoundbyfind_elements(),orNone.Thisisalsousefulforcheckingwhetheranymatchingelementexistsinthetree,andcanbeusedina
conditional.

Twomoremethodshelpnavigatingbetweennodesinthetree:

get_path
Listthecladesdirectlybetweenthetreeroot(orcurrentclade)andthegiventarget.Returnsalistofallcladeobjectsalongthispath,endingwiththegiventarget,but
excludingtherootclade.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 106/184
5/13/2017 BiopythonTutorialandCookbook
trace
Listofallcladeobjectbetweentwotargetsinthistree.Excludingstart,includingfinish.

13.4.2Informationmethods
Thesemethodsprovideinformationaboutthewholetree(oranyclade).

common_ancestor
Findthemostrecentcommonancestorofallthegiventargets.(ThiswillbeaCladeobject).Ifnotargetisgiven,returnstherootofthecurrentclade(theonethis
methodiscalledfrom)if1targetisgiven,thisreturnsthetargetitself.However,ifanyofthespecifiedtargetsarenotfoundinthecurrenttree(orclade),an
exceptionisraised.
count_terminals
Countsthenumberofterminal(leaf)nodeswithinthetree.
depths
Createamappingoftreecladestodepths.TheresultisadictionarywherethekeysarealloftheCladeinstancesinthetree,andthevaluesarethedistancefromthe
roottoeachclade(includingterminals).Bydefaultthedistanceisthecumulativebranchlengthleadingtotheclade,butwiththeunit_branch_lengths=Trueoption,
onlythenumberofbranches(levelsinthetree)iscounted.
distance
Calculatethesumofthebranchlengthsbetweentwotargets.Ifonlyonetargetisspecified,theotheristherootofthistree.
total_branch_length
Calculatethesumofallthebranchlengthsinthistree.Thisisusuallyjustcalledthelengthofthetreeinphylogenetics,butweuseamoreexplicitnametoavoid
confusionwithPythonterminology.

Therestofthesemethodsarebooleanchecks:

is_bifurcating
Trueifthetreeisstrictlybifurcatingi.e.allnodeshaveeither2or0children(internalorexternal,respectively).Therootmayhave3descendentsandstillbe
consideredpartofabifurcatingtree.
is_monophyletic
Testifallofthegiventargetscompriseacompletesubcladei.e.,thereexistsacladesuchthatitsterminalsarethesamesetasthegiventargets.Thetargetsshould
beterminalsofthetree.Forconvenience,thismethodreturnsthecommonancestor(MCRA)ofthetargetsiftheyaremonophyletic(insteadofthevalueTrue),and
Falseotherwise.
is_parent_of
Trueiftargetisadescendentofthistreenotrequiredtobeadirectdescendent.Tocheckdirectdescendentsofaclade,simplyuselistmembershiptesting:if
subcladeinclade:...
is_preterminal
TrueifalldirectdescendentsareterminalFalseifanydirectdescendentisnotterminal.

13.4.3Modificationmethods

Thesemethodsmodifythetreeinplace.Ifyouwanttokeeptheoriginaltreeintact,makeacompletecopyofthetreefirst,usingPythonscopymodule:

tree=Phylo.read('example.xml','phyloxml')
importcopy
newtree=copy.deepcopy(tree)

collapse
Deletesthetargetfromthetree,relinkingitschildrentoitsparent.
collapse_all
Collapseallthedescendentsofthistree,leavingonlyterminals.Branchlengthsarepreserved,i.e.thedistancetoeachterminalstaysthesame.Withatarget
specification(seeabove),collapsesonlytheinternalnodesmatchingthespecification.
ladderize
Sortcladesinplaceaccordingtothenumberofterminalnodes.Deepestcladesareplacedlastbydefault.Usereverse=Truetosortcladesdeepesttoshallowest.
prune
Prunesaterminalcladefromthetree.Iftaxonisfromabifurcation,theconnectingnodewillbecollapsedanditsbranchlengthaddedtoremainingterminalnode.
Thismightnolongerbeameaningfulvalue.
root_with_outgroup
Rerootthistreewiththeoutgroupcladecontainingthegiventargets,i.e.thecommonancestoroftheoutgroup.ThismethodisonlyavailableonTreeobjects,not
Clades.

Iftheoutgroupisidenticaltoself.root,nochangeoccurs.Iftheoutgroupcladeisterminal(e.g.asingleterminalnodeisgivenastheoutgroup),anewbifurcating
rootcladeiscreatedwitha0lengthbranchtothegivenoutgroup.Otherwise,theinternalnodeatthebaseoftheoutgroupbecomesatrifurcatingrootforthewhole
tree.Iftheoriginalrootwasbifurcating,itisdroppedfromthetree.

Inallcases,thetotalbranchlengthofthetreestaysthesame.

root_at_midpoint
Rerootthistreeatthecalculatedmidpointbetweenthetwomostdistanttipsofthetree.(Thisusesroot_with_outgroupunderthehood.)
split
Generaten(default2)newdescendants.Inaspeciestree,thisisaspeciationevent.Newcladeshavethegivenbranch_lengthandthesamenameasthiscladesroot
plusanintegersuffix(countingfrom0)forexample,splittingacladenamedAproducesthesubcladesA0andA1.

SeethePhylopageontheBiopythonwiki(http://biopython.org/wiki/Phylo)formoreexamplesofusingtheavailablemethods.

13.4.4FeaturesofPhyloXMLtrees

ThephyloXMLfileformatincludesfieldsforannotatingtreeswithadditionaldatatypesandvisualcues.

SeethePhyloXMLpageontheBiopythonwiki(http://biopython.org/wiki/PhyloXML)fordescriptionsandexamplesofusingtheadditionalannotationfeatures
providedbyPhyloXML.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 107/184
5/13/2017 BiopythonTutorialandCookbook

13.5Runningexternalapplications
WhileBio.Phylodoesntinfertreesfromalignmentsitself,therearethirdpartyprogramsavailablethatdo.Thesearesupportedthroughthemodule
Bio.Phylo.Applications,usingthesamegeneralframeworkasBio.Emboss.Applications,Bio.Align.Applicationsandothers.

Biopython1.58introducedawrapperforPhyML( http://www.atgcmontpellier.fr/phyml/).Theprogramacceptsaninputalignmentinphyliprelaxedformat(thats
Phylipformat,butwithoutthe10characterlimitontaxonnames)andavarietyofoptions.Aquickexample:

>>>fromBioimportPhylo
>>>fromBio.Phylo.ApplicationsimportPhymlCommandline
>>>cmd=PhymlCommandline(input='Tests/Phylip/random.phy')
>>>out_log,err_log=cmd()

Thisgeneratesatreefileandastatsfilewiththenames[inputfilename]_phyml_tree.txtand[inputfilename]_phyml_stats.txt.ThetreefileisinNewickformat:

>>>tree=Phylo.read('Tests/Phylip/random.phy_phyml_tree.txt','newick')
>>>Phylo.draw_ascii(tree)

AsimilarwrapperforRAxML(http://sco.hits.org/exelixis/software.html)wasaddedinBiopython1.60,andFastTree
(http://www.microbesonline.org/fasttree/)inBiopython1.62.

NotethatsomepopularPhylipprograms,includingdnamlandprotml,arealreadyavailablethroughtheEMBOSSwrappersinBio.Emboss.Applicationsifyouhavethe
PhylipextensionstoEMBOSSinstalledonyoursystem.SeeSection6.4forsomeexamplesandcluesonhowtouseprogramslikethese.

13.6PAMLintegration
Biopython1.58broughtsupportforPAML( http://abacus.gene.ucl.ac.uk/software/paml.html),asuiteofprogramsforphylogeneticanalysisbymaximumlikelihood.
Currentlytheprogramscodeml,basemlandyn00areimplemented.DuetoPAMLsusageofcontrolfilesratherthancommandlineargumentstocontrolruntimeoptions,
usageofthiswrapperstraysfromtheformatofotherapplicationwrappersinBiopython.

AtypicalworkflowwouldbetoinitializeaPAMLobject,specifyinganalignmentfile,atreefile,anoutputfileandaworkingdirectory.Next,runtimeoptionsaresetvia
theset_options()methodorbyreadinganexistingcontrolfile.Finally,theprogramisrunviatherun()methodandtheoutputfileisautomaticallyparsedtoaresults
dictionary.

Hereisanexampleoftypicalusageofcodeml:

>>>fromBio.Phylo.PAMLimportcodeml
>>>cml=codeml.Codeml()
>>>cml.alignment="Tests/PAML/alignment.phylip"
>>>cml.tree="Tests/PAML/species.tree"
>>>cml.out_file="results.out"
>>>cml.working_dir="./scratch"
>>>cml.set_options(seqtype=1,
...verbose=0,
...noisy=0,
...RateAncestor=0,
...model=0,
...NSsites=[0,1,2],
...CodonFreq=2,
...cleandata=1,
...fix_alpha=1,
...kappa=4.54006)
>>>results=cml.run()
>>>ns_sites=results.get("NSsites")
>>>m0=ns_sites.get(0)
>>>m0_params=m0.get("parameters")
>>>print(m0_params.get("omega"))

Existingoutputfilesmaybeparsedaswellusingamodulesread()function:

>>>results=codeml.read("Tests/PAML/Results/codeml/codeml_NSsites_all.out")
>>>print(results.get("lnLmax"))

DetaileddocumentationforthisnewmodulecurrentlylivesontheBiopythonwiki:http://biopython.org/wiki/PAML

13.7Futureplans
Bio.Phyloisunderactivedevelopment.Herearesomefeatureswemightaddinfuturereleases:

Newmethods
GenerallyusefulfunctionsforoperatingonTreeorCladeobjectsappearontheBiopythonwikifirst,sothatcasualuserscantestthemanddecideiftheyreuseful
beforeweaddthemtoBio.Phylo:

http://biopython.org/wiki/Phylo_cookbook

Bio.Nexusport
MuchofthismodulewaswrittenduringGoogleSummerofCode2009,undertheauspicesofNESCent,asaprojecttoimplementPythonsupportforthephyloXML
dataformat(see13.4.4).SupportforNewickandNexusformatswasaddedbyportingpartoftheexistingBio.NexusmoduletothenewclassesusedbyBio.Phylo.

Currently,Bio.NexuscontainssomeusefulfeaturesthathavenotyetbeenportedtoBio.Phyloclassesnotably,calculatingaconsensustree.Ifyoufindsome
functionalitylackinginBio.Phylo,trypokingthroughtBio.Nexustoseeifitsthereinstead.

Wereopentoanysuggestionsforimprovingthefunctionalityandusabilityofthismodulejustletusknowonthemailinglistorourbugdatabase.

Finally,ifyouneedadditionalfunctionalitynotyetincludedinthePhylomodule,checkifitsavailableinanotherofthehighqualityPythonlibrariesforphylogenetics
suchasDendroPy(http://pythonhosted.org/DendroPy/)orPyCogent(http://pycogent.org/).Sincetheselibrariesalsosupportstandardfileformatsforphylogenetic

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 108/184
5/13/2017 BiopythonTutorialandCookbook
trees,youcaneasilytransferdatabetweenlibrariesbywritingtoatemporaryfileorStringIOobject.

Chapter14SequencemotifanalysisusingBio.motifs
ThischaptergivesanoverviewofthefunctionalityoftheBio.motifspackageincludedinBiopython.Itisintendedforpeoplewhoareinvolvedintheanalysisof
sequencemotifs,soIllassumethatyouarefamiliarwithbasicnotionsofmotifanalysis.Incasesomethingisunclear,pleaselookatSection14.10forsomerelevant
links.

MostofthischapterdescribesthenewBio.motifspackageincludedinBiopython1.61onwards,whichisreplacingtheolderBio.Motifpackageintroducedwith
Biopython1.50,whichwasinturnbasedontwoolderformerBiopythonmodules,Bio.AlignAceandBio.MEME.Itprovidesmostoftheirfunctionalitywithaunifiedmotif
objectimplementation.

Speakingofotherlibraries,ifyouarereadingthisyoumightbeinterestedinTAMO,anotherpythonlibrarydesignedtodealwithsequencemotifs.Itsupportsmorede
novomotiffinders,butitisnotapartofBiopythonandhassomerestrictionsoncommercialuse.

14.1Motifobjects
Sinceweareinterestedinmotifanalysis,weneedtotakealookatMotifobjectsinthefirstplace.ForthatweneedtoimporttheBio.motifslibrary:

>>>fromBioimportmotifs

andwecanstartcreatingourfirstmotifobjects.WecaneithercreateaMotifobjectfromalistofinstancesofthemotif,orwecanobtainaMotifobjectbyparsingafile
fromamotifdatabaseormotiffindingsoftware.

14.1.1Creatingamotiffrominstances

SupposewehavetheseinstancesofaDNAmotif:

>>>fromBio.SeqimportSeq
>>>instances=[Seq("TACAA"),
...Seq("TACGC"),
...Seq("TACAC"),
...Seq("TACCC"),
...Seq("AACCC"),
...Seq("AATGC"),
...Seq("AATGC"),
...]

thenwecancreateaMotifobjectasfollows:

>>>m=motifs.create(instances)

Theinstancesaresavedinanattributem.instances,whichisessentiallyaPythonlistwithsomeaddedfunctionality,asdescribedbelow.PrintingouttheMotifobject
showstheinstancesfromwhichitwasconstructed:

>>>print(m)
TACAA
TACGC
TACAC
TACCC
AACCC
AATGC
AATGC
<BLANKLINE>

Thelengthofthemotifisdefinedasthesequencelength,whichshouldbethesameforallinstances:

>>>len(m)
5

TheMotifobjecthasanattribute.countscontainingthecountsofeachnucleotideateachposition.Printingthiscountsmatrixshowsitinaneasilyreadableformat:
>>>print(m.counts)
01234
A:3.007.000.002.001.00
C:0.000.005.002.006.00
G:0.000.000.003.000.00
T:4.000.002.000.000.00
<BLANKLINE>

Youcanaccessthesecountsasadictionary:

>>>m.counts['A']
[3,7,0,2,1]

butyoucanalsothinkofitasa2Darraywiththenucleotideasthefirstdimensionandthepositionastheseconddimension:

>>>m.counts['T',0]
4
>>>m.counts['T',2]
2
>>>m.counts['T',3]
0

Youcanalsodirectlyaccesscolumnsofthecountsmatrix

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 109/184
5/13/2017 BiopythonTutorialandCookbook
>>>m.counts[:,3]
{'A':2,'C':2,'T':0,'G':3}

Insteadofthenucleotideitself,youcanalsousetheindexofthenucleotideinthesortedlettersinthealphabetofthemotif:

>>>m.alphabet
IUPACUnambiguousDNA()
>>>m.alphabet.letters
'GATC'
>>>sorted(m.alphabet.letters)
['A','C','G','T']
>>>m.counts['A',:]
(3,7,0,2,1)
>>>m.counts[0,:]
(3,7,0,2,1)

Themotifhasanassociatedconsensussequence,definedasthesequenceoflettersalongthepositionsofthemotifforwhichthelargestvalueinthecorresponding
columnsofthe.countsmatrixisobtained:
>>>m.consensus
Seq('TACGC',IUPACUnambiguousDNA())

aswellasananticonsensussequence,correspondingtothesmallestvaluesinthecolumnsofthe.countsmatrix:
>>>m.anticonsensus
Seq('GGGTG',IUPACUnambiguousDNA())

Youcanalsoaskforadegenerateconsensussequence,inwhichambiguousnucleotidesareusedforpositionswheretherearemultiplenucleotideswithhighcounts:

>>>m.degenerate_consensus
Seq('WACVC',IUPACAmbiguousDNA())

Here,WandRfollowtheIUPACnucleotideambiguitycodes:WiseitherAorT,andVisA,C,orG[10].Thedegenerateconsensussequenceisconstructedfollowing
therulesspecifiedbyCavener[11].

Wecanalsogetthereversecomplementofamotif:

>>>r=m.reverse_complement()
>>>r.consensus
Seq('GCGTA',IUPACUnambiguousDNA())
>>>r.degenerate_consensus
Seq('GBGTW',IUPACAmbiguousDNA())
>>>print(r)
TTGTA
GCGTA
GTGTA
GGGTA
GGGTT
GCATT
GCATT
<BLANKLINE>

ThereversecomplementandthedegenerateconsensussequenceareonlydefinedforDNAmotifs.

14.1.2Creatingasequencelogo
Ifwehaveinternetaccess,wecancreateaweblogo:

>>>m.weblogo("mymotif.png")

WeshouldgetourlogosavedasaPNGinthespecifiedfile.

14.2Readingmotifs
Creatingmotifsfrominstancesbyhandisabitboring,soitsusefultohavesomeI/Ofunctionsforreadingandwritingmotifs.Therearenotanyreallywellestablished
standardsforstoringmotifs,butthereareacoupleofformatsthataremoreusedthanothers.

14.2.1JASPAR

OneofthemostpopularmotifdatabasesisJASPAR.Inadditiontothemotifsequenceinformation,theJASPARdatabasestoresalotofmetainformationforeachmotif.
ThemoduleBio.motifscontainsaspecializedclassjaspar.Motifinwhichthismetainformationisrepresentedasattributes:

matrix_idtheuniqueJASPARmotifID,e.g.MA0004.1
namethenameoftheTF,e.g.Arnt
collectiontheJASPARcollectiontowhichthemotifbelongs,e.g.CORE
tf_classthestructualclassofthisTF,e.g.ZipperType
tf_familythefamilytowhichthisTFbelongs,e.g.HelixLoopHelix
speciesthespeciestowhichthisTFbelongs,mayhavemultiplevalues,thesearespecifiedastaxonomyIDs,e.g.10090
tax_groupthetaxonomicsupergrouptowhichthismotifbelongs,e.g.vertebrates
acctheaccessionnumberoftheTFprotein,e.g.P53762
data_typethetypeofdatausedtoconstructthismotif,e.g.SELEX
medlinethePubmedIDofliteraturesupportingthismotif,maybemultiplevalues,e.g.7592839
pazar_idexternalreferencetotheTFinthePAZARdatabase,e.g.TF0000003

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 110/184
5/13/2017 BiopythonTutorialandCookbook
commentfreeformtextcontainingnotesabouttheconstructionofthemotif

Thejaspar.MotifclassinheritsfromthegenericMotifclassandthereforeprovidesallthefacilitiesofanyofthemotifformatsreadingmotifs,writingmotifs,
scanningsequencesformotifinstancesetc.

JASPARstoresmotifsinseveraldifferentwaysincludingthreedifferentflatfileformatsandasanSQLdatabase.Alloftheseformatsfacilitatetheconstructionofa
countsmatrix.However,theamountofmetainformationdescribedabovethatisavailablevarieswiththeformat.

TheJASPARsitesformat

Thefirstofthethreeflatfileformatscontainsalistofinstances.Asanexample,thesearethebeginningandendinglinesoftheJASPARArnt.sitesfileshowingknown
bindingsitesofthemousehelixloophelixtranscriptionfactorArnt.

>MA0004ARNT1
CACGTGatgtcctc
>MA0004ARNT2
CACGTGggaggtac
>MA0004ARNT3
CACGTGccgcgcgc
...
>MA0004ARNT18
AACGTGacagccctcc
>MA0004ARNT19
AACGTGcacatcgtcc
>MA0004ARNT20
aggaatCGCGTGc

Thepartsofthesequenceincapitallettersarethemotifinstancesthatwerefoundtoaligntoeachother.

WecancreateaMotifobjectfromtheseinstancesasfollows:

>>>fromBioimportmotifs
>>>withopen("Arnt.sites")ashandle:
...arnt=motifs.read(handle,"sites")
...

Theinstancesfromwhichthismotifwascreatedisstoredinthe.instancesproperty:
>>>print(arnt.instances[:3])
[Seq('CACGTG',IUPACUnambiguousDNA()),Seq('CACGTG',IUPACUnambiguousDNA()),Seq('CACGTG',IUPACUnambiguousDNA())]
>>>forinstanceinarnt.instances:
...print(instance)
...
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
CACGTG
AACGTG
AACGTG
AACGTG
AACGTG
CGCGTG

Thecountsmatrixofthismotifisautomaticallycalculatedfromtheinstances:

>>>print(arnt.counts)
012345
A:4.0019.000.000.000.000.00
C:16.000.0020.000.000.000.00
G:0.001.000.0020.000.0020.00
T:0.000.000.000.0020.000.00
<BLANKLINE>

Thisformatdoesnotstoreanymetainformation.

TheJASPARpfmformat

JASPARalsomakesmotifsavailabledirectlyasacountmatrix,withouttheinstancesfromwhichitwascreated.Thispfmformatonlystoresthecountsmatrixforasingle
motif.Forexample,thisistheJASPARfileSRF.pfmcontainingthecountsmatrixforthehumanSRFtranscriptionfactor:

2901323461431522
133454511000101
392100000004443
4200134204533000

Wecancreateamotifforthiscountmatrixasfollows:
>>>withopen("SRF.pfm")ashandle:
...srf=motifs.read(handle,"pfm")
...

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 111/184
5/13/2017 BiopythonTutorialandCookbook
>>>print(srf.counts)
01234567891011
A:2.009.000.001.0032.003.0046.001.0043.0015.002.002.00
C:1.0033.0045.0045.001.001.000.000.000.001.000.001.00
G:39.002.001.000.000.000.000.000.000.000.0044.0043.00
T:4.002.000.000.0013.0042.000.0045.003.0030.000.000.00
<BLANKLINE>

Asthismotifwascreatedfromthecountsmatrixdirectly,ithasnoinstancesassociatedwithit:

>>>print(srf.instances)
None

Wecannowaskfortheconsensussequenceofthesetwomotifs:

>>>print(arnt.counts.consensus)
CACGTG
>>>print(srf.counts.consensus)
GCCCATATATGG

Aswiththeinstancesfile,nometainformationisstoredinthisformat.

TheJASPARformatjaspar

Thejasparfileformatallowsmultiplemotifstobespecifiedinasinglefile.Inthisformateachofthemotifrecordsconsistofaheaderlinefollowedbyfourlines
definingthecountsmatrix.Theheaderlinebeginswitha>character(similartotheFastafileformat)andisfollowedbytheuniqueJASPARmatrixIDandtheTFname.
ThefollowingexampleshowsajasparformattedfilecontainingthethreemotifsArnt,RUNX1andMEF2A:
>MA0004.1Arnt
A[4190000]
C[16020000]
G[01020020]
T[0000200]
>MA0002.1RUNX1
A[10124122000813]
C[22710800122]
G[31102302626004]
T[111114241160025167]
>MA0052.1MEF2A
A[1057296372566]
C[50011000000]
G[00000000250]
T[7580554952215602]

Themotifsarereadasfollows:

>>>fh=open("jaspar_motifs.txt")
>>>forminmotifs.parse(fh,"jaspar"))
...print(m)
TFnameArnt
MatrixIDMA0004.1
Matrix:
012345
A:4.0019.000.000.000.000.00
C:16.000.0020.000.000.000.00
G:0.001.000.0020.000.0020.00
T:0.000.000.000.0020.000.00

TFnameRUNX1
MatrixIDMA0002.1
Matrix:
012345678910
A:10.0012.004.001.002.002.000.000.000.008.0013.00
C:2.002.007.001.000.008.000.000.001.002.002.00
G:3.001.001.000.0023.000.0026.0026.000.000.004.00
T:11.0011.0014.0024.001.0016.000.000.0025.0016.007.00

TFnameMEF2A
MatrixIDMA0052.1
Matrix:
0123456789
A:1.000.0057.002.009.006.0037.002.0056.006.00
C:50.000.001.001.000.000.000.000.000.000.00
G:0.000.000.000.000.000.000.000.002.0050.00
T:7.0058.000.0055.0049.0052.0021.0056.000.002.00

NotethatprintingaJASPARmotifyieldsboththecountsdataandtheavailablemetainformation.

AccessingtheJASPARdatabase

Inadditiontoparsingtheseflatfileformats,wecanalsoretrievemotifsfromaJASPARSQLdatabase.Unliketheflatfileformats,aJASPARdatabaseallowsstoringof
allpossiblemetainformationdefinedintheJASPARMotifclass.ItisbeyondthescopeofthisdocumenttodescribehowtosetupaJASPARdatabase(pleaseseethe
mainJASPARwebsite).MotifsarereadfromaJASPARdatabaseusingtheBio.motifs.jaspar.dbmodule.FirstconnecttotheJASPARdatabaseusingtheJASPAR5
classwhichmodelsthethelatestJASPARschema:

>>>fromBio.motifs.jaspar.dbimportJASPAR5
>>>
>>>JASPAR_DB_HOST=<hostname>

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 112/184
5/13/2017 BiopythonTutorialandCookbook
>>>JASPAR_DB_NAME=<db_name>
>>>JASPAR_DB_USER=<user>
>>>JASPAR_DB_PASS=<passord>
>>>
>>>jdb=JASPAR5(
...host=JASPAR_DB_HOST,
...name=JASPAR_DB_NAME,
...user=JASPAR_DB_USER,
...password=JASPAR_DB_PASS
...)

NowwecanfetchasinglemotifbyitsuniqueJASPARIDwiththefetch_motif_by_idmethod.NotethataJASPARIDconistsofabaseIDandaversionnumber
seperatedbyadecimalpoint,e.g.MA0004.1.Thefetch_motif_by_idmethodallowsyoutouseeitherthefullyspecifiedIDorjustthebaseID.IfonlythebaseIDis
provided,thelatestversionofthemotifisreturned.

>>>arnt=jdb.fetch_motif_by_id("MA0004")

PrintingthemotifrevealsthattheJASPARSQLdatabasestoresmuchmoremetainformationthantheflatfiles:

>>>print(arnt)
TFnameArnt
MatrixIDMA0004.1
CollectionCORE
TFclassZipperType
TFfamilyHelixLoopHelix
Species10090
Taxonomicgroupvertebrates
Accession['P53762']
DatatypeusedSELEX
Medline7592839
PAZARIDTF0000003
Comments
Matrix:
012345
A:4.0019.000.000.000.000.00
C:16.000.0020.000.000.000.00
G:0.001.000.0020.000.0020.00
T:0.000.000.000.0020.000.00

Wecanalsofetchmotifsbyname.Thenamemustbeanexactmatch(partialmatchesordatabasewildcardsarenotcurrentlysupported).Notethatasthenameisnot
guaranteedtobeunique,thefetch_motifs_by_namemethodactuallyreturnsalist.

>>>motifs=jdb.fetch_motifs_by_name("Arnt")
>>>print(motifs[0])
TFnameArnt
MatrixIDMA0004.1
CollectionCORE
TFclassZipperType
TFfamilyHelixLoopHelix
Species10090
Taxonomicgroupvertebrates
Accession['P53762']
DatatypeusedSELEX
Medline7592839
PAZARIDTF0000003
Comments
Matrix:
012345
A:4.0019.000.000.000.000.00
C:16.000.0020.000.000.000.00
G:0.001.000.0020.000.0020.00
T:0.000.000.000.0020.000.00

Thefetch_motifsmethodallowsyoutofetchmotifswhichmatchaspecifiedsetofcriteria.Thesecriteriaincludeanyoftheabovedescribedmetainformationaswellas
certainmatrixpropertiessuchastheminimuminformationcontent(min_icintheexamplebelow),theminimumlengthofthematrixortheminimumnumberofsites
usedtoconstructthematrix.OnlymotifswhichpassALLthespecifiedcriteriaarereturned.Notethatselectioncriteriawhichcorrespondtometainformationwhich
allowformultiplevaluesmaybespecifiedaseitherasinglevalueoralistofvalues,e.g.tax_groupandtf_familyintheexamplebelow.

>>>motifs=jdb.fetch_motifs(
...collection='CORE',
...tax_group=['vertebrates','insects'],
...tf_class='WingedHelixTurnHelix',
...tf_family=['Forkhead','Ets'],
...min_ic=12
...)
>>>formotifinmotifs:
...pass#dosomethingwiththemotif

CompatibilitywithPerlTFBSmodules

AnimportantthingtonoteisthattheJASPARMotifclasswasdesignedtobecompatiblewiththepopularPerlTFBSmodules.Thereforesomespecificsaboutthechoice
ofdefaultsforbackgroundandpseudocountsaswellashowinformationcontentiscomputedandsequencessearchedforinstancesisbasedonthiscompatibilitycriteria.
Thesechoicesarenotedinthespecificsubsectionsbelow.

Choiceofbackground:
ThePerlTFBSmodulesappeartoallowachoiceofcustombackgroundprobabilities(althoughthedocumentationstatesthatuniformbackgroundisassumed).
Howeverthedefaultistouseauniformbackground.Thereforeitisrecommendedthatyouuseauniformbackgroundforcomputingthepositionspecificscoring
matrix(PSSM).ThisisthedefaultwhenusingtheBiopythonmotifsmodule.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 113/184
5/13/2017 BiopythonTutorialandCookbook
Choiceofpseudocounts:
Bydefault,thePerlTFBSmodulesuseapseudocountequaltoN*bg[nucleotide],whereNrepresentsthetotalnumberofsequencesusedtoconstructthematrix.
Toapplythissamepseudocountformula,setthemotifpseudocountsattributeusingthejaspar.calculate\_pseudcounts()function:
>>>motif.pseudocounts=motifs.jaspar.calculate_pseudocounts(motif)

Notethatitispossibleforthecountsmatrixtohaveanunequalnumberofsequencesmakingupthecolumns.Thepseudocountcomputationusestheaverage
numberofsequencesmakingupthematrix.However,whennormalizeiscalledonthecountsmatrix,eachcountvalueinacolumnisdividedbythetotalnumber
ofsequencesmakingupthatspecificcolumn,notbytheaveragenumberofsequences.ThisdiffersfromthePerlTFBSmodulesbecausethenormalizationisnot
doneasaseparatestepandsotheaveragenumberofsequencesisusedthroughoutthecomputationofthepssm.Therefore,formatriceswithunequalcolumn
counts,thePSSMcomputedbythemotifsmodulewilldiffersomewhatfromthepssmcomputedbythePerlTFBSmodules.
Computationofmatrixinformationcontent:
Theinformationcontent(IC)orspecificityofamatrixiscomputedusingthemeanmethodofthePositionSpecificScoringMatrixclass.Howeverofnote,inthe
PerlTFBSmodulesthedefaultbehaviouristocomputetheICwithoutfirstapplyingpseudocounts,eventhoughbydefaultthePSSMsarecomputedusing
pseudocountsasdescribedabove.
Searchingforinstances:
SearchingforinstanceswiththePerlTFBSmotifswasusuallyperformedusingarelativescorethreshold,i.e.ascoreintherange0to1.Inordertocomputethe
absolutePSSMscorecorrespondingtoarelativescoreonecanusetheequation:
>>>abs_score=(pssm.maxpssm.min)*rel_score+pssm.min

Toconverttheabsolutescoreofaninstancebacktoarelativescore,onecanusetheequation:
>>>rel_score=(abs_scorepssm.min)/(pssm.maxpssm.min)

Forexample,usingtheArntmotifbefore,letssearchasequencewitharelativescorethresholdof0.8.
>>>test_seq=Seq("TAAGCGTGCACGCGCAACACGTGCATTA",unambiguous_dna)
>>>arnt.pseudocounts=motifs.jaspar.calculate_pseudocounts(arnt)
>>>pssm=arnt.pssm
>>>max_score=pssm.max
>>>min_score=pssm.min
>>>abs_score_threshold=(max_scoremin_score)*0.8+min_score
>>>forposition,scoreinpssm.search(test_seq,
threshold=abs_score_threshold):
...rel_score=(scoremin_score)/(max_scoremin_score)
...print("Position%d:score=%5.3f,rel.score=%5.3f"%(
position,score,rel_score))
...
Position2:score=5.362,rel.score=0.801
Position8:score=6.112,rel.score=0.831
Position20:score=7.103,rel.score=0.870
Position17:score=10.351,rel.score=1.000
Position11:score=10.351,rel.score=1.000

14.2.2MEME

MEME[12]isatoolfordiscoveringmotifsinagroupofrelatedDNAorproteinsequences.IttakesasinputagroupofDNAorproteinsequencesandoutputsasmany
motifsasrequested.Therefore,incontrasttoJASPARfiles,MEMEoutputfilestypicallycontainmultiplemotifs.Thisisanexample.

AtthetopofanoutputfilegeneratedbyMEMEshowssomebackgroundinformationabouttheMEMEandtheversionofMEMEused:

********************************************************************************
MEMEMotifdiscoverytool
********************************************************************************
MEMEversion3.0(Releasedate:2004/08/1809:07:01)
...

Furtherdown,theinputsetoftrainingsequencesisrecapitulated:
********************************************************************************
TRAININGSET
********************************************************************************
DATAFILE=INO_up800.s
ALPHABET=ACGT
SequencenameWeightLengthSequencenameWeightLength

CHO11.0000800CHO21.0000800
FAS11.0000800FAS21.0000800
ACC11.0000800INO11.0000800
OPI31.0000800
********************************************************************************

andtheexactcommandlinethatwasused:
********************************************************************************
COMMANDLINESUMMARY
********************************************************************************
Thisinformationcanalsobeusefulintheeventyouwishtoreporta
problemwiththeMEMEsoftware.

command:mememodoopsdnarevcompnmotifs2bfileyeast.nc.6.freqINO_up800.s
...

Nextisdetailedinformationoneachmotifthatwasfound:

********************************************************************************
MOTIF1width=12sites=7llr=95Evalue=2.0e001
********************************************************************************

Motif1Description

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 114/184
5/13/2017 BiopythonTutorialandCookbook

SimplifiedA:::9:a::::3:
pos.specificC::a:9:11691a
probabilityG::::1::94:4:
matrixTaa:1::9::11:

Toparsethisfile(storedasmeme.dna.oops.txt),use

>>>withopen("meme.dna.oops.txt")ashandle:
...record=motifs.parse(handle,"meme")
...

Themotifs.parsecommandreadsthecompletefiledirectly,soyoucanclosethefileaftercallingmotifs.parse.Theheaderinformationisstoredinattributes:

>>>record.version
'3.0'
>>>record.datafile
'INO_up800.s'
>>>record.command
'mememodoopsdnarevcompnmotifs2bfileyeast.nc.6.freqINO_up800.s'
>>>record.alphabet
IUPACUnambiguousDNA()
>>>record.sequences
['CHO1','CHO2','FAS1','FAS2','ACC1','INO1','OPI3']

TherecordisanobjectoftheBio.motifs.meme.Recordclass.Theclassinheritsfromlist,andyoucanthinkofrecordasalistofMotifobjects:

>>>len(record)
2
>>>motif=record[0]
>>>print(motif.consensus)
TTCACATGCCGC
>>>print(motif.degenerate_consensus)
TTCACATGSCNC

Inadditiontothesegenericmotifattributes,eachmotifalsostoresitsspecificinformationascalculatedbyMEME.Forexample,

>>>motif.num_occurrences
7
>>>motif.length
12
>>>evalue=motif.evalue
>>>print("%3.1g"%evalue)
0.2
>>>motif.name
'Motif1'

Inadditiontousinganindexintotherecord,aswedidabove,youcanalsofinditbyitsname:

>>>motif=record['Motif1']

Eachmotifhasanattribute.instanceswiththesequenceinstancesinwhichthemotifwasfound,providingsomeinformationoneachinstance:

>>>len(motif.instances)
7
>>>motif.instances[0]
Instance('TTCACATGCCGC',IUPACUnambiguousDNA())
>>>motif.instances[0].motif_name
'Motif1'
>>>motif.instances[0].sequence_name
'INO1'
>>>motif.instances[0].start
620
>>>motif.instances[0].strand
''
>>>motif.instances[0].length
12
>>>pvalue=motif.instances[0].pvalue
>>>print("%5.3g"%pvalue)
1.85e08

MAST

14.2.3TRANSFAC
TRANSFACisamanuallycurateddatabaseoftranscriptionfactors,togetherwiththeirgenomicbindingsitesandDNAbindingprofiles[27].Whilethefileformatused
intheTRANSFACdatabaseisnowadaysalsousedbyothers,wewillrefertoitastheTRANSFACfileformat.

AminimalfileintheTRANSFACformatlooksasfollows:

IDmotif1
P0ACGT
011220S
022120R
033011A
040500C
055000A
060041G
070140G
080005T
090050G
100122K
110203Y

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 115/184
5/13/2017 BiopythonTutorialandCookbook
121031G
//

Thisfileshowsthefrequencymatrixofmotifmotif1of12nucleotides.Ingeneral,onefileintheTRANSFACformatcancontainmultiplemotifs.Forexample,thisisthe
contentsoftheexampleTRANSFACfiletransfac.dat:
VVEXAMPLEJanuary15,2013
XX
//
IDmotif1
P0ACGT
011220S
022120R
033011A
...
110203Y
121031G
//
IDmotif2
P0ACGT
012120R
021220S
...
090005T
100203Y
//

ToparseaTRANSFACfile,use

>>>withopen("transfac.dat")ashandle:
...record=motifs.parse(handle,"TRANSFAC")
...

Theoverallversionnumber,ifavailable,isstoredasrecord.version:

>>>record.version
'EXAMPLEJanuary15,2013'

EachmotifinrecordisininstanceoftheBio.motifs.transfac.Motifclass,whichinheritsbothfromtheBio.motifs.MotifclassandfromaPythondictionary.The
dictionaryusesthetwoletterkeystostoreanyadditionalinformationaboutthemotif:

>>>motif=record[0]
>>>motif.degenerate_consensus#UsingtheBio.motifs.Motifmethod
Seq('SRACAGGTGKYG',IUPACAmbiguousDNA())
>>>motif['ID']#Usingmotifasadictionary
'motif1'

TRANSFACfilesaretypicallymuchmoreelaboratethanthisexample,containinglotsofadditionalinformationaboutthemotif.Table14.2.3liststhetwoletterfield
codesthatarecommonlyfoundinTRANSFACfiles:

Table14.1:FieldscommonlyfoundinTRANSFACfiles
AC Accessionnumber
AS Accessionnumbers,secondary
BA Statisticalbasis
BF Bindingfactors
BS Factorbindingsitesunderlyingthematrix
CC Comments
CO Copyrightnotice
DE Shortfactordescription
DR Externaldatabases
DT Datecreated/updated
HC Subfamilies
HP Superfamilies
ID Identifier
NA Nameofthebindingfactor
OC Taxonomicclassification
OS Species/Taxon
OV Olderversion
PV Preferredversion
TY Type
XX EmptylinethesearenotstoredintheRecord.

Eachmotifalsohasanattribute.referencescontainingthereferencesassociatedwiththemotif,usingthesetwoletterkeys:

Table14.2:FieldsusedtostorereferencesinTRANSFACfiles
RN Referencenumber
RA Referenceauthors
RL Referencedata
RT Referencetitle

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 116/184
5/13/2017 BiopythonTutorialandCookbook
RX PubMedID

PrintingthemotifswritesthemoutintheirnativeTRANSFACformat:
>>>print(record)
VVEXAMPLEJanuary15,2013
XX
//
IDmotif1
XX
P0ACGT
011220S
022120R
033011A
040500C
055000A
060041G
070140G
080005T
090050G
100122K
110203Y
121031G
XX
//
IDmotif2
XX
P0ACGT
012120R
021220S
030500C
043011A
050041G
065000A
070140G
080050G
090005T
100203Y
XX
//
<BLANKLINE>

YoucanexportthemotifsintheTRANSFACformatbycapturingthisoutputinastringandsavingitinafile:

>>>text=str(record)
>>>withopen("mytransfacfile.dat",'w')asout_handle:
...out_handle.write(text)
...

14.3Writingmotifs
Speakingofexporting,letslookatexportfunctionsingeneral.WecanusetheformatmethodtowritethemotifinthesimpleJASPARpfmformat:

>>>print(arnt.format("pfm"))
4.0019.000.000.000.000.00
16.000.0020.000.000.000.00
0.001.000.0020.000.0020.00
0.000.000.000.0020.000.00

Similarly,wecanuseformattowritethemotifintheJASPARjasparformat:

>>>print(arnt.format("jaspar"))
>MA0004.1Arnt
A[4.0019.000.000.000.000.00]
C[16.000.0020.000.000.000.00]
G[0.001.000.0020.000.0020.00]
T[0.000.000.000.0020.000.00]

TowritethemotifinaTRANSFAClikematrixformat,use
>>>print(m.format("transfac"))
P0ACGT
013004W
027000A
030502C
042230V
051600C
XX
//
<BLANKLINE>

Towriteoutmultiplemotifs,youcanusemotifs.write.ThisfunctioncanbeusedregardlessofwhetherthemotifsoriginatedfromaTRANSFACfile.Forexample,

>>>two_motifs=[arnt,srf]
>>>print(motifs.write(two_motifs,'transfac'))
P0ACGT
0141600C
0219010A
0302000C
0400200G
0500020T
0600200G
XX

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 117/184
5/13/2017 BiopythonTutorialandCookbook
//
P0ACGT
0121394G
0293322C
0304510C
0414500C
05321013A
0631042T
0746000A
0810045T
0943003A
10151030T
1120440G
1221430G
XX
//
<BLANKLINE>

Or,towritemultiplemotifsinthejasparformat:

>>>two_motifs=[arnt,mef2a]
>>>print(motifs.write(two_motifs,"jaspar"))
>MA0004.1Arnt
A[4.0019.000.000.000.000.00]
C[16.000.0020.000.000.000.00]
G[0.001.000.0020.000.0020.00]
T[0.000.000.000.0020.000.00]
>MA0052.1MEF2A
A[1.000.0057.002.009.006.0037.002.0056.006.00]
C[50.000.001.001.000.000.000.000.000.000.00]
G[0.000.000.000.000.000.000.000.002.0050.00]
T[7.0058.000.0055.0049.0052.0021.0056.000.002.00]

14.4PositionWeightMatrices
The.countsattributeofaMotifobjectshowshowofteneachnucleotideappearedateachpositionalongthealignment.Wecannormalizethismatrixbydividingbythe
numberofinstancesinthealignment,resultingintheprobabilityofeachnucleotideateachpositionalongthealignment.Werefertotheseprobabilitiesastheposition
weightmatrix.However,bewarethatintheliteraturethistermmayalsobeusedtorefertothepositionspecificscoringmatrix,whichwediscussbelow.

Usually,pseudocountsareaddedtoeachpositionbeforenormalizing.Thisavoidsoverfittingofthepositionweightmatrixtothelimitednumberofmotifinstancesinthe
alignment,andcanalsopreventprobabilitiesfrombecomingzero.Toaddafixedpseudocounttoallnucleotidesatallpositions,specifyanumberforthepseudocounts
argument:

>>>pwm=m.counts.normalize(pseudocounts=0.5)
>>>print(pwm)
01234
A:0.390.830.060.280.17
C:0.060.060.610.280.72
G:0.060.060.060.390.06
T:0.500.060.280.060.06
<BLANKLINE>

Alternatively,pseudocountscanbeadictionaryspecifyingthepseudocountsforeachnucleotide.Forexample,astheGCcontentofthehumangenomeisabout40%,you
maywanttochoosethepseudocountsaccordingly:
>>>pwm=m.counts.normalize(pseudocounts={'A':0.6,'C':0.4,'G':0.4,'T':0.6})
>>>print(pwm)
01234
A:0.400.840.070.290.18
C:0.040.040.600.270.71
G:0.040.040.040.380.04
T:0.510.070.290.070.07
<BLANKLINE>

Thepositionweightmatrixhasitsownmethodstocalculatetheconsensus,anticonsensus,anddegenerateconsensussequences:

>>>pwm.consensus
Seq('TACGC',IUPACUnambiguousDNA())
>>>pwm.anticonsensus
Seq('GGGTG',IUPACUnambiguousDNA())
>>>pwm.degenerate_consensus
Seq('WACNC',IUPACAmbiguousDNA())

Notethatduetothepseudocounts,thedegenerateconsensussequencecalculatedfromthepositionweightmatrixisslightlydifferentfromthedegenerateconsensus
sequencecalculatedfromtheinstancesinthemotif:

>>>m.degenerate_consensus
Seq('WACVC',IUPACAmbiguousDNA())

Thereversecomplementofthepositionweightmatrixcanbecalculateddirectlyfromthepwm:

>>>rpwm=pwm.reverse_complement()
>>>print(rpwm)
01234
A:0.070.070.290.070.51
C:0.040.380.040.040.04
G:0.710.270.600.040.04
T:0.180.290.070.840.40
<BLANKLINE>

14.5PositionSpecificScoringMatrices
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 118/184
5/13/2017 BiopythonTutorialandCookbook
UsingthebackgrounddistributionandPWMwithpseudocountsadded,itseasytocomputethelogoddsratios,tellinguswhatarethelogoddsofaparticularsymbolto
becomingfromamotifagainstthebackground.Wecanusethe.log_odds()methodonthepositionweightmatrix:

>>>pssm=pwm.log_odds()
>>>print(pssm)
01234
A:0.681.761.910.210.49
C:2.492.491.260.091.51
G:2.492.492.490.602.49
T:1.031.910.211.911.91
<BLANKLINE>

Herewecanseepositivevaluesforsymbolsmorefrequentinthemotifthaninthebackgroundandnegativeforsymbolsmorefrequentinthebackground.0.0meansthat
itsequallylikelytoseeasymbolinthebackgroundandinthemotif.

ThisassumesthatA,C,G,andTareequallylikelyinthebackground.Tocalculatethepositionspecificscoringmatrixagainstabackgroundwithunequalprobabilities
forA,C,G,T,usethebackgroundargument.Forexample,againstabackgroundwitha40%GCcontent,use

>>>background={'A':0.3,'C':0.2,'G':0.2,'T':0.3}
>>>pssm=pwm.log_odds(background)
>>>print(pssm)
01234
A:0.421.492.170.050.75
C:2.172.171.580.421.83
G:2.172.172.170.922.17
T:0.772.170.052.172.17
<BLANKLINE>

ThemaximumandminimumscoreobtainablefromthePSSMarestoredinthe.maxand.minproperties:

>>>print("%4.2f"%pssm.max)
6.59
>>>print("%4.2f"%pssm.min)
10.85

ThemeanandstandarddeviationofthePSSMscoreswithrespecttoaspecificbackgroundarecalculatedbythe.meanand.stdmethods.

>>>mean=pssm.mean(background)
>>>std=pssm.std(background)
>>>print("mean=%0.2f,standarddeviation=%0.2f"%(mean,std))
mean=3.21,standarddeviation=2.59

Auniformbackgroundisusedifbackgroundisnotspecified.Themeanisparticularlyimportant,asitsvalueisequaltotheKullbackLeiblerdivergenceorrelative
entropy,andisameasurefortheinformationcontentofthemotifcomparedtothebackground.AsinBiopythonthebase2logarithmisusedinthecalculationofthelog
oddsscores,theinformationcontenthasunitsofbits.

The.reverse_complement,.consensus,.anticonsensus,and.degenerate_consensusmethodscanbeapplieddirectlytoPSSMobjects.

14.6Searchingforinstances
Themostfrequentuseforamotifistofinditsinstancesinsomesequence.Forthesakeofthissection,wewilluseanartificialsequencelikethis:

>>>test_seq=Seq("TACACTGCATTACAACCCAAGCATTA",m.alphabet)
>>>len(test_seq)
26

14.6.1Searchingforexactmatches
Thesimplestwaytofindinstances,istolookforexactmatchesofthetrueinstancesofthemotif:
>>>forpos,seqinm.instances.search(test_seq):
...print("%i%s"%(pos,seq))
...
0TACAC
10TACAA
13AACCC

Wecandothesamewiththereversecomplement(tofindinstancesonthecomplementarystrand):

>>>forpos,seqinr.instances.search(test_seq):
...print("%i%s"%(pos,seq))
...
6GCATT
20GCATT

14.6.2SearchingformatchesusingthePSSMscore

Itsjustaseasytolookforpositions,givingrisetohighlogoddsscoresagainstourmotif:

>>>forposition,scoreinpssm.search(test_seq,threshold=3.0):
...print("Position%d:score=%5.3f"%(position,score))
...
Position0:score=5.622
Position20:score=4.601
Position10:score=3.037
Position13:score=5.738
Position6:score=4.601

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 119/184
5/13/2017 BiopythonTutorialandCookbook
Thenegativepositionsrefertoinstancesofthemotiffoundonthereversestrandofthetestsequence,andfollowthePythonconventiononnegativeindices.Therefore,
theinstanceofthemotifatposislocatedattest_seq[pos:pos+len(m)]bothforpositiveandfornegativevaluesofpos.

Youmaynoticethethresholdparameter,heresetarbitrarilyto3.0.Thisisinlog2,sowearenowlookingonlyforwords,whichareeighttimesmorelikelytooccurunder
themotifmodelthaninthebackground.Thedefaultthresholdis0.0,whichselectseverythingthatlooksmorelikethemotifthanthebackground.

Youcanalsocalculatethescoresatallpositionsalongthesequence:

>>>pssm.calculate(test_seq)
array([5.62230396,5.6796999,3.43177247,0.93827754,
6.84962511,2.04066086,10.84962463,3.65614533,
0.03370807,3.91102552,3.03734159,2.14918518,
0.6016975,5.7381525,0.50977498,3.56422281,
8.73414803,0.09919716,0.6016975,2.39429784,
10.84962463,3.65614533],dtype=float32)

Ingeneral,thisisthefastestwaytocalculatePSSMscores.Thescoresreturnedbypssm.calculatearefortheforwardstrandonly.Toobtainthescoresonthereverse
strand,youcantakethereversecomplementofthePSSM:
>>>rpssm=pssm.reverse_complement()
>>>rpssm.calculate(test_seq)
array([9.43458748,3.06172252,7.18665981,7.76216221,
2.04066086,4.26466274,4.60124254,4.2480607,
8.73414803,2.26503372,6.49598789,5.64668512,
8.73414803,10.84962463,4.82356262,4.82356262,
5.64668512,8.73414803,4.15613794,5.6796999,
4.60124254,4.2480607],dtype=float32)

14.6.3Selectingascorethreshold
Ifyouwanttousealessarbitrarywayofselectingthresholds,youcanexplorethedistributionofPSSMscores.Sincethespaceforascoredistributiongrows
exponentiallywithmotiflength,weareusinganapproximationwithagivenprecisiontokeepcomputationcostmanageable:

>>>distribution=pssm.distribution(background=background,precision=10**4)

Thedistributionobjectcanbeusedtodetermineanumberofdifferentthresholds.Wecanspecifytherequestedfalsepositiverate(probabilityoffindingamotif
instanceinbackgroundgeneratedsequence):

>>>threshold=distribution.threshold_fpr(0.01)
>>>print("%5.3f"%threshold)
4.009

orthefalsenegativerate(probabilityofnotfindinganinstancegeneratedfromthemotif):

>>>threshold=distribution.threshold_fnr(0.1)
>>>print("%5.3f"%threshold)
0.510

orathreshold(approximately)satisfyingsomerelationbetweenthefalsepositiverateandthefalsenegativerate(fnr/fprt):

>>>threshold=distribution.threshold_balanced(1000)
>>>print("%5.3f"%threshold)
6.241

orathresholdsatisfying(roughly)theequalitybetweenthelogofthefalsepositiverateandtheinformationcontent(asusedinpatsersoftwarebyHertzandStormo):

>>>threshold=distribution.threshold_patser()
>>>print("%5.3f"%threshold)
0.346

Forexample,incaseofourmotif,youcangetthethresholdgivingyouexactlythesameresults(forthissequence)assearchingforinstanceswithbalancedthreshold
withrateof1000.

>>>threshold=distribution.threshold_fpr(0.01)
>>>print("%5.3f"%threshold)
4.009
>>>forposition,scoreinpssm.search(test_seq,threshold=threshold):
...print("Position%d:score=%5.3f"%(position,score))
...
Position0:score=5.622
Position20:score=4.601
Position13:score=5.738
Position6:score=4.601

14.7EachmotifobjecthasanassociatedPositionSpecificScoringMatrix
TofacilitatesearchingforpotentialTFBSsusingPSSMs,boththepositionweightmatrixandthepositionspecificscoringmatrixareassociatedwitheachmotif.Using
theArntmotifasanexample:

>>>fromBioimportmotifs
>>>withopen("Arnt.sites")ashandle:
...motif=motifs.read(handle,'sites')
...
>>>print(motif.counts)
012345
A:4.0019.000.000.000.000.00
C:16.000.0020.000.000.000.00
G:0.001.000.0020.000.0020.00
T:0.000.000.000.0020.000.00
<BLANKLINE>

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 120/184
5/13/2017 BiopythonTutorialandCookbook
>>>print(motif.pwm)
012345
A:0.200.950.000.000.000.00
C:0.800.001.000.000.000.00
G:0.000.050.001.000.001.00
T:0.000.000.000.001.000.00
<BLANKLINE>

>>>print(motif.pssm)
012345
A:0.321.93infinfinfinf
C:1.68inf2.00infinfinf
G:inf2.32inf2.00inf2.00
T:infinfinfinf2.00inf
<BLANKLINE>

Thenegativeinfinitiesappearherebecausethecorrespondingentryinthefrequencymatrixis0,andweareusingzeropseudocountsbydefault:

>>>forletterin"ACGT":
...print("%s:%4.2f"%(letter,motif.pseudocounts[letter]))
...
A:0.00
C:0.00
G:0.00
T:0.00

Ifyouchangethe.pseudocountsattribute,thepositionfrequencymatrixandthepositionspecificscoringmatrixarerecalculatedautomatically:
>>>motif.pseudocounts=3.0
>>>forletterin"ACGT":
...print("%s:%4.2f"%(letter,motif.pseudocounts[letter]))
...
A:3.00
C:3.00
G:3.00
T:3.00

>>>print(motif.pwm)
012345
A:0.220.690.090.090.090.09
C:0.590.090.720.090.090.09
G:0.090.120.090.720.090.72
T:0.090.090.090.090.720.09
<BLANKLINE>

>>>print(motif.pssm)
012345
A:0.191.461.421.421.421.42
C:1.251.421.521.421.421.42
G:1.421.001.421.521.421.52
T:1.421.421.421.421.521.42
<BLANKLINE>

Youcanalsosetthe.pseudocountstoadictionaryoverthefournucleotidesifyouwanttousedifferentpseudocountsforthem.Settingmotif.pseudocountstoNoneresets
ittoitsdefaultvalueofzero.

Thepositionspecificscoringmatrixdependsonthebackgrounddistribution,whichisuniformbydefault:
>>>forletterin"ACGT":
...print("%s:%4.2f"%(letter,motif.background[letter]))
...
A:0.25
C:0.25
G:0.25
T:0.25

Again,ifyoumodifythebackgrounddistribution,thepositionspecificscoringmatrixisrecalculated:

>>>motif.background={'A':0.2,'C':0.3,'G':0.3,'T':0.2}
>>>print(motif.pssm)
012345
A:0.131.781.091.091.091.09
C:0.981.681.261.681.681.68
G:1.681.261.681.261.681.26
T:1.091.091.091.091.851.09
<BLANKLINE>

Settingmotif.backgroundtoNoneresetsittoauniformdistribution:
>>>motif.background=None
>>>forletterin"ACGT":
...print("%s:%4.2f"%(letter,motif.background[letter]))
...
A:0.25
C:0.25
G:0.25
T:0.25

Ifyousetmotif.backgroundequaltoasinglevalue,itwillbeinterpretedastheGCcontent:

>>>motif.background=0.8
>>>forletterin"ACGT":
...print("%s:%4.2f"%(letter,motif.background[letter]))
...
A:0.10

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 121/184
5/13/2017 BiopythonTutorialandCookbook
C:0.40
G:0.40
T:0.10

NotethatyoucannowcalculatethemeanofthePSSMscoresoverthebackgroundagainstwhichitwascomputed:

>>>print("%f"%motif.pssm.mean(motif.background))
4.703928

aswellasitsstandarddeviation:
>>>print("%f"%motif.pssm.std(motif.background))
3.290900

anditsdistribution:
>>>distribution=motif.pssm.distribution(background=motif.background)
>>>threshold=distribution.threshold_fpr(0.01)
>>>print("%f"%threshold)
3.854375

Notethatthepositionweightmatrixandthepositionspecificscoringmatrixarerecalculatedeachtimeyoucallmotif.pwmormotif.pssm,respectively.Ifspeedisanissue
andyouwanttousethePWMorPSSMrepeatedly,youcansavethemasavariable,asin

>>>pssm=motif.pssm

14.8Comparingmotifs
Oncewehavemorethanonemotif,wemightwanttocomparethem.

Beforewestartcomparingmotifs,Ishouldpointoutthatmotifboundariesareusuallyquitearbitrary.Thismeansweoftenneedtocomparemotifsofdifferentlengths,so
comparisonneedstoinvolvesomekindofalignment.Thismeanswehavetotakeintoaccounttwothings:

alignmentofmotifs
somefunctiontocomparealignedmotifs

Toalignthemotifs,weuseungappedalignmentofPSSMsandsubstitutezerosforanymissingcolumnsatthebeginningandendofthematrices.Thismeansthat
effectivelyweareusingthebackgrounddistributionforcolumnsmissingfromthePSSM.Thedistancefunctionthenreturnstheminimaldistancebetweenmotifs,aswell
asthecorrespondingoffsetintheiralignment.

Togiveanexample,letusfirstloadanothermotif,whichissimilartoourtestmotifm:

>>>withopen("REB1.pfm")ashandle:
...m_reb1=motifs.read(handle,"pfm")
...
>>>m_reb1.consensus
Seq('GTTACCCGG',IUPACUnambiguousDNA())
>>>print(m_reb1.counts)
012345678
A:30.000.000.00100.000.000.000.000.0015.00
C:10.000.000.000.00100.00100.00100.000.0015.00
G:50.000.000.000.000.000.000.0060.0055.00
T:10.00100.00100.000.000.000.000.0040.0015.00
<BLANKLINE>

Tomakethemotifscomparable,wechoosethesamevaluesforthepseudocountsandthebackgrounddistributionasourmotifm:

>>>m_reb1.pseudocounts={'A':0.6,'C':0.4,'G':0.4,'T':0.6}
>>>m_reb1.background={'A':0.3,'C':0.2,'G':0.2,'T':0.3}
>>>pssm_reb1=m_reb1.pssm
>>>print(pssm_reb1)
012345678
A:0.005.675.671.725.675.675.675.670.97
C:0.975.675.675.672.302.302.305.670.41
G:1.305.675.675.675.675.675.671.571.44
T:1.531.721.725.675.675.675.670.410.97
<BLANKLINE>

WellcomparethesemotifsusingthePearsoncorrelation.Sincewewantittoresembleadistancemeasure,weactuallytake1r,whereristhePearsoncorrelation
coefficient(PCC):

>>>distance,offset=pssm.dist_pearson(pssm_reb1)
>>>print("distance=%5.3g"%distance)
distance=0.239
>>>print(offset)
2

ThismeansthatthebestPCCbetweenmotifmandm_reb1isobtainedwiththefollowingalignment:

m:bbTACGCbb
m_reb1:GTTACCCGG

wherebstandsforbackgrounddistribution.ThePCCitselfisroughly10.239=0.761.

14.9Denovomotiffinding
Currently,Biopythonhasonlylimitedsupportfordenovomotiffinding.Namely,wesupportrunningxxmotifandalsoparsingofMEME.Sincethenumberofmotif
findingtoolsisgrowingrapidly,contributionsofnewparsersarewelcome.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 122/184
5/13/2017 BiopythonTutorialandCookbook
14.9.1MEME

Letsassume,youhaverunMEMEonsequencesofyourchoicewithyourfavoriteparametersandsavedtheoutputinthefilememe.out.Youcanretrievethemotifs
reportedbyMEMEbyrunningthefollowingpieceofcode:

>>>fromBioimportmotifs
>>>withopen("meme.out")ashandle:
...motifsM=motifs.parse(handle,"meme")
...

>>>motifsM
[<Bio.motifs.meme.Motifobjectat0xc356b0>]

Besidesthemostwantedlistofmotifs,theresultobjectcontainsmoreusefulinformation,accessiblethroughpropertieswithselfexplanatorynames:

.alphabet
.datafile
.sequence_names
.version
.command

ThemotifsreturnedbytheMEMEParsercanbetreatedexactlylikeregularMotifobjects(withinstances),theyalsoprovidesomeextrafunctionality,byadding
additionalinformationabouttheinstances.

>>>motifsM[0].consensus
Seq('CTCAATCGTA',IUPACUnambiguousDNA())
>>>motifsM[0].instances[0].sequence_name
'SEQ10;'
>>>motifsM[0].instances[0].start
3
>>>motifsM[0].instances[0].strand
'+'

>>>motifsM[0].instances[0].pvalue
8.71e07

14.10Usefullinks
Sequencemotifinwikipedia
PWMinwikipedia
Consensussequenceinwikipedia
Comparisonofdifferentmotiffindingprograms

Chapter15Clusteranalysis
Clusteranalysisisthegroupingofitemsintoclustersbasedonthesimilarityoftheitemstoeachother.Inbioinformatics,clusteringiswidelyusedingeneexpression
dataanalysistofindgroupsofgeneswithsimilargeneexpressionprofiles.Thismayidentifyfunctionallyrelatedgenes,aswellassuggestthefunctionofpresently
unknowngenes.

TheBiopythonmoduleBio.Clusterprovidescommonlyusedclusteringalgorithmsandwasdesignedwiththeapplicationtogeneexpressiondatainmind.However,this
modulecanalsobeusedforclusteranalysisofothertypesofdata.Bio.ClusterandtheunderlyingCClusteringLibraryisdescribedbyDeHoonetal.[14].

ThefollowingfourclusteringapproachesareimplementedinBio.Cluster:

Hierarchicalclustering(pairwisecentroid,single,complete,andaveragelinkage)
kmeans,kmedians,andkmedoidsclustering
SelfOrganizingMaps
PrincipalComponentAnalysis.

Datarepresentation
ThedatatobeclusteredarerepresentedbyanmNumericalPythonarraydata.Withinthecontextofgeneexpressiondataclustering,typicallytherowscorrespondto
differentgeneswhereasthecolumnscorrespondtodifferentexperimentalconditions.TheclusteringalgorithmsinBio.Clustercanbeappliedbothtorows(genes)andto
columns(experiments).

Missingvalues

Ofteninmicroarrayexperiments,someofthedatavaluesaremissing,whichisindicatedbyanadditionalnmNumericalPythonintegerarraymask.Ifmask[i,j]==0,
thendata[i,j]ismissingandisignoredintheanalysis.

Randomnumbergenerator
Thekmeans/medians/medoidsclusteringalgorithmsandSelfOrganizingMaps(SOMs)includetheuseofarandomnumbergenerator.Theuniformrandomnumber
generatorinBio.ClusterisbasedonthealgorithmbyLEcuyer[25],whilerandomnumbersfollowingthebinomialdistributionaregeneratedusingtheBTPEalgorithm
byKachitvichyanukulandSchmeiser[23].Therandomnumbergeneratorisinitializedautomaticallyduringitsfirstcall.Asthisrandomnumbergeneratorusesa
combinationoftwomultiplicativelinearcongruentialgenerators,two(integer)seedsareneededforinitialization,forwhichweusethesystemsuppliedrandomnumber
generatorrand(intheCstandardlibrary).Weinitializethisgeneratorbycallingsrandwiththeepochtimeinseconds,andusethefirsttworandomnumbersgeneratedby
randasseedsfortheuniformrandomnumbergeneratorinBio.Cluster.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 123/184
5/13/2017 BiopythonTutorialandCookbook

15.1Distancefunctions
Inordertoclusteritemsintogroupsbasedontheirsimilarity,weshouldfirstdefinewhatexactlywemeanbysimilar.Bio.Clusterprovideseightdistancefunctions,
indicatedbyasinglecharacter,tomeasuresimilarity,orconversely,distance:

'e':Euclideandistance
'b':Cityblockdistance.
'c':Pearsoncorrelationcoefficient
'a':AbsolutevalueofthePearsoncorrelationcoefficient

'u':UncenteredPearsoncorrelation(equivalenttothecosineoftheanglebetweentwodatavectors)
'x':AbsoluteuncenteredPearsoncorrelation
's':Spearmansrankcorrelation
'k':Kendalls.

Thefirsttwoaretruedistancefunctionsthatsatisfythetriangleinequality:


d u , v d u , w +d w , v forall u , v , w ,

andarethereforereferedtoasmetrics.Ineverydaylanguage,thismeansthattheshortestdistancebetweentwopointsisastraightline.

Theremainingsixdistancemeasuresarerelatedtothecorrelationcoefficient,wherethedistancedisdefinedintermsofthecorrelationrbyd=1r.Notethatthese
distancefunctionsaresemimetricsthatdonotsatisfythetriangleinequality.Forexample,for


u= 1,0,1


v = 1,1,0


w = 0,1,1

wefindaPearsondistanced(u,w)=1.8660,whiled(u,v)+d(v,w)=1.6340.

Euclideandistance
InBio.Cluster,wedefinetheEuclideandistanceas

n
1
d=
n xiyi 2.
i=1

Onlythosetermsareincludedinthesummationforwhichbothxiandyiarepresent,andthedenominatornischosenaccordingly.Astheexpressiondataxiandyiare
subtracteddirectlyfromeachother,weshouldmakesurethattheexpressiondataareproperlynormalizedwhenusingtheEuclideandistance.

Cityblockdistance

Thecityblockdistance,alternativelyknownastheManhattandistance,isrelatedtotheEuclideandistance.WhereastheEuclideandistancecorrespondstothelengthof
theshortestpathbetweentwopoints,thecityblockdistanceisthesumofdistancesalongeachdimension.Asgeneexpressiondatatendtohavemissingvalues,in
Bio.Clusterwedefinethecityblockdistanceasthesumofdistancesdividedbythenumberofdimensions:

n
1
d=
n xiyi .
i=1

Thisisequaltothedistanceyouwouldhavetowalkbetweentwopointsinacity,whereyouhavetowalkalongcityblocks.AsfortheEuclideandistance,theexpression
dataaresubtracteddirectlyfromeachother,andweshouldthereforemakesurethattheyareproperlynormalized.

ThePearsoncorrelationcoefficient
ThePearsoncorrelationcoefficientisdefinedas

n
1 xix yi
r=
n
x y
,
i=1

inwhichx,arethesamplemeanofxandyrespectively,andx,yarethesamplestandarddeviationofxandy.ThePearsoncorrelationcoefficientisameasurefor
howwellastraightlinecanbefittedtoascatterplotofxandy.Ifallthepointsinthescatterplotlieonastraightline,thePearsoncorrelationcoefficientiseither+1or1,
dependingonwhethertheslopeoflineispositiveornegative.IfthePearsoncorrelationcoefficientisequaltozero,thereisnocorrelationbetweenxandy.

ThePearsondistanceisthendefinedas
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 124/184
5/13/2017 BiopythonTutorialandCookbook
dP1r.

AsthePearsoncorrelationcoefficientliesbetween1and1,thePearsondistanceliesbetween0and2.

AbsolutePearsoncorrelation
BytakingtheabsolutevalueofthePearsoncorrelation,wefindanumberbetween0and1.Iftheabsolutevalueis1,allthepointsinthescatterplotlieonastraightline
witheitherapositiveoranegativeslope.Iftheabsolutevalueisequaltozero,thereisnocorrelationbetweenxandy.

Thecorrespondingdistanceisdefinedas


dA1 r ,

whereristhePearsoncorrelationcoefficient.AstheabsolutevalueofthePearsoncorrelationcoefficientliesbetween0and1,thecorrespondingdistanceliesbetween0
and1aswell.

Inthecontextofgeneexpressionexperiments,theabsolutecorrelationisequalto1ifthegeneexpressionprofilesoftwogenesareeitherexactlythesameorexactly
opposite.Theabsolutecorrelationcoefficientshouldthereforebeusedwithcare.

Uncenteredcorrelation(cosineoftheangle)

Insomecases,itmaybepreferabletousetheuncenteredcorrelationinsteadoftheregularPearsoncorrelationcoefficient.Theuncenteredcorrelationisdefinedas


n
1 xi yi
rU=
n

x

y
,
(0) (0)
i=1

where

n
1
x(0) = xi2


n
i=1

n
1
(0)
yi2 .

y =
n
i=1

ThisisthesameexpressionasfortheregularPearsoncorrelationcoefficient,exceptthatthesamplemeansx,aresetequaltozero.Theuncenteredcorrelationmaybe
appropriateifthereisazeroreferencestate.Forinstance,inthecaseofgeneexpressiondatagivenintermsoflogratios,alogratioequaltozerocorrespondstothe
greenandredsignalbeingequal,whichmeansthattheexperimentalmanipulationdidnotaffectthegeneexpression.

Thedistancecorrespondingtotheuncenteredcorrelationcoefficientisdefinedas

dU1rU,

whererUistheuncenteredcorrelation.Astheuncenteredcorrelationcoefficientliesbetween1and1,thecorrespondingdistanceliesbetween0and2.

Theuncenteredcorrelationisequaltothecosineoftheangleofthetwodatavectorsinndimensionalspace,andisoftenreferredtoassuch.

Absoluteuncenteredcorrelation
AsfortheregularPearsoncorrelation,wecandefineadistancemeasureusingtheabsolutevalueoftheuncenteredcorrelation:


dAU1 rU ,

whererUistheuncenteredcorrelationcoefficient.Astheabsolutevalueoftheuncenteredcorrelationcoefficientliesbetween0and1,thecorrespondingdistancelies
between0and1aswell.

Geometrically,theabsolutevalueoftheuncenteredcorrelationisequaltothecosinebetweenthesupportinglinesofthetwodatavectors(i.e.,theanglewithouttaking
thedirectionofthevectorsintoconsideration).

Spearmanrankcorrelation

TheSpearmanrankcorrelationisanexampleofanonparametricsimilaritymeasure,andtendstobemorerobustagainstoutliersthanthePearsoncorrelation.

TocalculatetheSpearmanrankcorrelation,wereplaceeachdatavaluebytheirrankifwewouldorderthedataineachvectorbytheirvalue.Wethencalculatethe
Pearsoncorrelationbetweenthetworankvectorsinsteadofthedatavectors.

AsinthecaseofthePearsoncorrelation,wecandefineadistancemeasurecorrespondingtotheSpearmanrankcorrelationas

dS1rS,

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 125/184
5/13/2017 BiopythonTutorialandCookbook
whererSistheSpearmanrankcorrelation.

Kendalls
Kendallsisanotherexampleofanonparametricsimilaritymeasure.ItissimilartotheSpearmanrankcorrelation,butinsteadoftheranksthemselvesonlytherelative
ranksareusedtocalculate(seeSnedecor&Cochran[29]).

WecandefineadistancemeasurecorrespondingtoKendallsas

dK1.

AsKendallsisalwaysbetween1and1,thecorrespondingdistancewillbebetween0and2.

Weighting

FormostofthedistancefunctionsavailableinBio.Cluster,aweightvectorcanbeapplied.Theweightvectorcontainsweightsfortheitemsinthedatavector.Ifthe
weightforitemiiswi,thenthatitemistreatedasifitoccurredwitimesinthedata.Theweightdonothavetobeintegers.FortheSpearmanrankcorrelationand
Kendalls,weightsdonothaveawelldefinedmeaningandarethereforenotimplemented.

Calculatingthedistancematrix
Thedistancematrixisasquarematrixwithallpairwisedistancesbetweentheitemsindata,andcanbecalculatedbythefunctiondistancematrixintheBio.Cluster
module:

>>>fromBio.Clusterimportdistancematrix
>>>matrix=distancematrix(data)

wherethefollowingargumentsaredefined:

data(required)
Arraycontainingthedatafortheitems.
mask(default:None)
Arrayofintegersshowingwhichdataaremissing.Ifmask[i,j]==0,thendata[i,j]ismissing.Ifmask==None,thenalldataarepresent.
weight(default:None)
Theweightstobeusedwhencalculatingdistances.Ifweight==None,thenequalweightsareassumed.
transpose(default:0)
Determinesifthedistancesbetweentherowsofdataaretobecalculated(transpose==0),orbetweenthecolumnsofdata(transpose==1).
dist(default:'e',Euclideandistance)
Definesthedistancefunctiontobeused(see15.1).

Tosavememory,thedistancematrixisreturnedasalistof1Darrays.Thenumberofcolumnsineachrowisequaltotherownumber.Hence,thefirstrowhaszero
elements.Anexampleofthereturnvalueis

[array([]),
array([1.]),
array([7.,3.]),
array([4.,2.,6.])]

Thiscorrespondstothedistancematrix

0 1 7 4

1 0 3 2
.
7 3 0 6
4 2 6 0

15.2Calculatingclusterproperties
Calculatingtheclustercentroids
Thecentroidofaclustercanbedefinedeitherasthemeanorasthemedianofeachdimensionoverallclusteritems.ThefunctionclustercentroidsinBio.Clustercan
beusedtocalculateeither:

>>>fromBio.Clusterimportclustercentroids
>>>cdata,cmask=clustercentroids(data)

wherethefollowingargumentsaredefined:

data(required)
Arraycontainingthedatafortheitems.
mask(default:None)
Arrayofintegersshowingwhichdataaremissing.Ifmask[i,j]==0,thendata[i,j]ismissing.Ifmask==None,thenalldataarepresent.
clusterid(default:None)
Vectorofintegersshowingtowhichclustereachitembelongs.IfclusteridisNone,thenallitemsareassumedtobelongtothesamecluster.
method(default:'a')
Specifieswhetherthearithmeticmean(method=='a')orthemedian(method=='m')isusedtocalculatetheclustercenter.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 126/184
5/13/2017 BiopythonTutorialandCookbook
transpose(default:0)
Determinesifthecentroidsoftherowsofdataaretobecalculated(transpose==0),orthecentroidsofthecolumnsofdata(transpose==1).

Thisfunctionreturnsthetuple(cdata,cmask).Thecentroiddataarestoredinthe2DNumericalPythonarraycdata,withmissingdataindicatedbythe2DNumerical
Pythonintegerarraycmask.Thedimensionsofthesearraysare(numberofclusters,numberofcolumns)iftransposeis0,or(numberofrows,numberofclusters)if
transposeis1.Eachrow(iftransposeis0)orcolumn(iftransposeis1)containstheaverageddatacorrespondingtothecentroidofeachcluster.

Calculatingthedistancebetweenclusters
Givenadistancefunctionbetweenitems,wecandefinethedistancebetweentwoclustersinseveralways.Thedistancebetweenthearithmeticmeansofthetwoclusters
isusedinpairwisecentroidlinkageclusteringandinkmeansclustering.Inkmedoidsclustering,thedistancebetweenthemediansofthetwoclustersisusedinstead.
Theshortestpairwisedistancebetweenitemsofthetwoclustersisusedinpairwisesinglelinkageclustering,whilethelongestpairwisedistanceisusedinpairwise
maximumlinkageclustering.Inpairwiseaveragelinkageclustering,thedistancebetweentwoclustersisdefinedastheaverageoverthepairwisedistances.

Tocalculatethedistancebetweentwoclusters,use
>>>fromBio.Clusterimportclusterdistance
>>>distance=clusterdistance(data)

wherethefollowingargumentsaredefined:

data(required)
Arraycontainingthedatafortheitems.
mask(default:None)
Arrayofintegersshowingwhichdataaremissing.Ifmask[i,j]==0,thendata[i,j]ismissing.Ifmask==None,thenalldataarepresent.
weight(default:None)
Theweightstobeusedwhencalculatingdistances.Ifweight==None,thenequalweightsareassumed.
index1(default:0)
Alistcontainingtheindicesoftheitemsbelongingtothefirstcluster.Aclustercontainingonlyoneitemicanberepresentedeitherasalist[i],orasanintegeri.
index2(default:0)
Alistcontainingtheindicesoftheitemsbelongingtothesecondcluster.Aclustercontainingonlyoneitemsicanberepresentedeitherasalist[i],orasan
integeri.
method(default:'a')
Specifieshowthedistancebetweenclustersisdefined:
'a':Distancebetweenthetwoclustercentroids(arithmeticmean)

'm':Distancebetweenthetwoclustercentroids(median)
's':Shortestpairwisedistancebetweenitemsinthetwoclusters
'x':Longestpairwisedistancebetweenitemsinthetwoclusters
'v':Averageoverthepairwisedistancesbetweenitemsinthetwoclusters.
dist(default:'e',Euclideandistance)
Definesthedistancefunctiontobeused(see15.1).
transpose(default:0)
Iftranspose==0,calculatethedistancebetweentherowsofdata.Iftranspose==1,calculatethedistancebetweenthecolumnsofdata.

15.3Partitioningalgorithms
Partitioningalgorithmsdivideitemsintokclusterssuchthatthesumofdistancesovertheitemstotheirclustercentersisminimal.Thenumberofclusterskisspecified
bytheuser.ThreepartitioningalgorithmsareavailableinBio.Cluster:

kmeansclustering
kmediansclustering
kmedoidsclustering

Thesealgorithmsdifferinhowtheclustercenterisdefined.Inkmeansclustering,theclustercenterisdefinedasthemeandatavectoraveragedoverallitemsinthe
cluster.Insteadofthemean,inkmediansclusteringthemedianiscalculatedforeachdimensioninthedatavector.Finally,inkmedoidsclusteringtheclustercenteris
definedastheitemwhichhasthesmallestsumofdistancestotheotheritemsinthecluster.Thisclusteringalgorithmissuitableforcasesinwhichthedistancematrixis
knownbuttheoriginaldatamatrixisnotavailable,forexamplewhenclusteringproteinsbasedontheirstructuralsimilarity.

Theexpectationmaximization(EM)algorithmisusedtofindthispartitioningintokgroups.IntheinitializationoftheEMalgorithm,werandomlyassignitemsto
clusters.Toensurethatnoemptyclustersareproduced,weusethebinomialdistributiontorandomlychoosethenumberofitemsineachclustertobeoneormore.We
thenrandomlypermutetheclusterassignmentstoitemssuchthateachitemhasanequalprobabilitytobeinanycluster.Eachclusteristhusguaranteedtocontainatleast
oneitem.

Wetheniterate:

Calculatethecentroidofeachcluster,definedaseitherthemean,themedian,orthemedoidofthecluster
Calculatethedistancesofeachitemtotheclustercenters
Foreachitem,determinewhichclustercentroidisclosest
Reassigneachitemtoitsclosestcluster,orstoptheiterationifnofurtheritemreassignmentstakeplace.

Toavoidclustersbecomingemptyduringtheiteration,inkmeansandkmediansclusteringthealgorithmkeepstrackofthenumberofitemsineachcluster,and
prohibitsthelastremainingiteminaclusterfrombeingreassignedtoadifferentcluster.Forkmedoidsclustering,suchacheckisnotneeded,astheitemthatfunctions
astheclustercentroidhasazerodistancetoitself,andwillthereforeneverbeclosertoadifferentcluster.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 127/184
5/13/2017 BiopythonTutorialandCookbook
Astheinitialassignmentofitemstoclustersisdonerandomly,usuallyadifferentclusteringsolutionisfoundeachtimetheEMalgorithmisexecuted.Tofindtheoptimal
clusteringsolution,thekmeansalgorithmisrepeatedmanytimes,eachtimestartingfromadifferentinitialrandomclustering.Thesumofdistancesoftheitemstotheir
clustercenterissavedforeachrun,andthesolutionwiththesmallestvalueofthissumwillbereturnedastheoverallclusteringsolution.

HowoftentheEMalgorithmshouldberundependsonthenumberofitemsbeingclustered.Asaruleofthumb,wecanconsiderhowoftentheoptimalsolutionwas
foundthisnumberisreturnedbythepartitioningalgorithmsasimplementedinthislibrary.Iftheoptimalsolutionwasfoundmanytimes,itisunlikelythatbetter
solutionsexistthantheonethatwasfound.However,iftheoptimalsolutionwasfoundonlyonce,theremaywellbeothersolutionswithasmallerwithinclustersumof
distances.Ifthenumberofitemsislarge(morethanseveralhundreds),itmaybedifficulttofindthegloballyoptimalsolution.

TheEMalgorithmterminateswhennofurtherreassignmentstakeplace.Wenoticedthatforsomesetsofinitialclusterassignments,theEMalgorithmfailstoconverge
duetothesameclusteringsolutionreappearingperiodicallyafterasmallnumberofiterationsteps.Wethereforecheckfortheoccurrenceofsuchperiodicsolutions
duringtheiteration.Afteragivennumberofiterationsteps,thecurrentclusteringresultissavedasareference.Bycomparingtheclusteringresultaftereachsubsequent
iterationsteptothereferencestate,wecandetermineifapreviouslyencounteredclusteringresultisfound.Insuchacase,theiterationishalted.Ifafteragivennumber
ofiterationsthereferencestatehasnotyetbeenencountered,thecurrentclusteringsolutionissavedtobeusedasthenewreferencestate.Initially,teniterationstepsare
executedbeforeresavingthereferencestate.Thisnumberofiterationstepsisdoubledeachtime,toensurethatperiodicbehaviorwithlongerperiodscanalsobedetected.

kmeansandkmedians
ThekmeansandkmediansalgorithmsareimplementedasthefunctionkclusterinBio.Cluster:
>>>fromBio.Clusterimportkcluster
>>>clusterid,error,nfound=kcluster(data)

wherethefollowingargumentsaredefined:

data(required)
Arraycontainingthedatafortheitems.
nclusters(default:2)
Thenumberofclustersk.
mask(default:None)
Arrayofintegersshowingwhichdataaremissing.Ifmask[i,j]==0,thendata[i,j]ismissing.Ifmask==None,thenalldataarepresent.
weight(default:None)
Theweightstobeusedwhencalculatingdistances.Ifweight==None,thenequalweightsareassumed.
transpose(default:0)
Determinesifrows(transposeis0)orcolumns(transposeis1)aretobeclustered.
npass(default:1)
Thenumberoftimesthekmeans/mediansclusteringalgorithmisperformed,eachtimewithadifferent(random)initialcondition.Ifinitialidisgiven,thevalue
ofnpassisignoredandtheclusteringalgorithmisrunonlyonce,asitbehavesdeterministicallyinthatcase.
method(default:a)
describeshowthecenterofaclusterisfound:
method=='a':arithmeticmean(kmeansclustering)
method=='m':median(kmediansclustering).
Forothervaluesofmethod,thearithmeticmeanisused.
dist(default:'e',Euclideandistance)
Definesthedistancefunctiontobeused(see15.1).Whereasalleightdistancemeasuresareacceptedbykcluster,fromatheoreticalviewpointitisbesttousethe
Euclideandistanceforthekmeansalgorithm,andthecityblockdistanceforkmedians.
initialid(default:None)
SpecifiestheinitialclusteringtobeusedfortheEMalgorithm.Ifinitialid==None,thenadifferentrandominitialclusteringisusedforeachofthenpassrunsof
theEMalgorithm.IfinitialidisnotNone,thenitshouldbeequaltoa1Darraycontainingtheclusternumber(between0andnclusters1)foreachitem.Each
clustershouldcontainatleastoneitem.Withtheinitialclusteringspecified,theEMalgorithmisdeterministic.

Thisfunctionreturnsatuple(clusterid,error,nfound),whereclusteridisanintegerarraycontainingthenumberoftheclustertowhicheachroworclusterwas
assigned,erroristhewithinclustersumofdistancesfortheoptimalclusteringsolution,andnfoundisthenumberoftimesthisoptimalsolutionwasfound.

kmedoidsclustering

Thekmedoidsroutineperformskmedoidsclusteringonagivensetofitems,usingthedistancematrixandthenumberofclusterspassedbytheuser:

>>>fromBio.Clusterimportkmedoids
>>>clusterid,error,nfound=kmedoids(distance)

wherethefollowingargumentsaredefined:,nclusters=2,npass=1,initialid=None)|

distance(required)
Thematrixcontainingthedistancesbetweentheitemsthismatrixcanbespecifiedinthreeways:
asa2DNumericalPythonarray(inwhichonlytheleftlowerpartofthearraywillbeaccessed):
distance=array([[0.0,1.1,2.3],
[1.1,0.0,4.5],
[2.3,4.5,0.0]])

asa1DNumericalPythonarraycontainingconsecutivelythedistancesintheleftlowerpartofthedistancematrix:
distance=array([1.1,2.3,4.5])

asalistcontainingtherowsoftheleftlowerpartofthedistancematrix:
distance=[array([]|,
array([1.1]),

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 128/184
5/13/2017 BiopythonTutorialandCookbook
array([2.3,4.5])
]

Thesethreeexpressionscorrespondtothesamedistancematrix.
nclusters(default:2)
Thenumberofclustersk.
npass(default:1)
Thenumberoftimesthekmedoidsclusteringalgorithmisperformed,eachtimewithadifferent(random)initialcondition.Ifinitialidisgiven,thevalueof
npassisignored,astheclusteringalgorithmbehavesdeterministicallyinthatcase.
initialid(default:None)
SpecifiestheinitialclusteringtobeusedfortheEMalgorithm.Ifinitialid==None,thenadifferentrandominitialclusteringisusedforeachofthenpassrunsof
theEMalgorithm.IfinitialidisnotNone,thenitshouldbeequaltoa1Darraycontainingtheclusternumber(between0andnclusters1)foreachitem.Each
clustershouldcontainatleastoneitem.Withtheinitialclusteringspecified,theEMalgorithmisdeterministic.

Thisfunctionreturnsatuple(clusterid,error,nfound),whereclusteridisanarraycontainingthenumberoftheclustertowhicheachitemwasassigned,erroristhe
withinclustersumofdistancesfortheoptimalkmedoidsclusteringsolution,andnfoundisthenumberoftimestheoptimalsolutionwasfound.Notethatthecluster
numberinclusteridisdefinedastheitemnumberoftheitemrepresentingtheclustercentroid.

15.4Hierarchicalclustering
Hierarchicalclusteringmethodsareinherentlydifferentfromthekmeansclusteringmethod.Inhierarchicalclustering,thesimilarityintheexpressionprofilebetween
genesorexperimentalconditionsarerepresentedintheformofatreestructure.ThistreestructurecanbeshowngraphicallybyprogramssuchasTreeviewandJava
Treeview,whichhascontributedtothepopularityofhierarchicalclusteringintheanalysisofgeneexpressiondata.

Thefirststepinhierarchicalclusteringistocalculatethedistancematrix,specifyingallthedistancesbetweentheitemstobeclustered.Next,wecreateanodebyjoining
thetwoclosestitems.Subsequentnodesarecreatedbypairwisejoiningofitemsornodesbasedonthedistancebetweenthem,untilallitemsbelongtothesamenode.A
treestructurecanthenbecreatedbyretracingwhichitemsandnodesweremerged.UnliketheEMalgorithm,whichisusedinkmeansclustering,thecompleteprocess
ofhierarchicalclusteringisdeterministic.

Severalflavorsofhierarchicalclusteringexist,whichdifferinhowthedistancebetweensubnodesisdefinedintermsoftheirmembers.InBio.Cluster,pairwisesingle,
maximum,average,andcentroidlinkageareavailable.

Inpairwisesinglelinkageclustering,thedistancebetweentwonodesisdefinedastheshortestdistanceamongthepairwisedistancesbetweenthemembersofthe
twonodes.
Inpairwisemaximumlinkageclustering,alternativelyknownaspairwisecompletelinkageclustering,thedistancebetweentwonodesisdefinedasthelongest
distanceamongthepairwisedistancesbetweenthemembersofthetwonodes.
Inpairwiseaveragelinkageclustering,thedistancebetweentwonodesisdefinedastheaverageoverallpairwisedistancesbetweentheitemsofthetwonodes.
Inpairwisecentroidlinkageclustering,thedistancebetweentwonodesisdefinedasthedistancebetweentheircentroids.Thecentroidsarecalculatedbytakingthe
meanoveralltheitemsinacluster.Asthedistancefromeachnewlyformednodetoexistingnodesanditemsneedtobecalculatedateachstep,thecomputing
timeofpairwisecentroidlinkageclusteringmaybesignificantlylongerthanfortheotherhierarchicalclusteringmethods.Anotherpeculiarityisthat(foradistance
measurebasedonthePearsoncorrelation),thedistancesdonotnecessarilyincreasewhengoingupintheclusteringtree,andmayevendecrease.Thisiscausedby
aninconsistencybetweenthecentroidcalculationandthedistancecalculationwhenusingthePearsoncorrelation:WhereasthePearsoncorrelationeffectively
normalizesthedataforthedistancecalculation,nosuchnormalizationoccursforthecentroidcalculation.

Forpairwisesingle,complete,andaveragelinkageclustering,thedistancebetweentwonodescanbefounddirectlyfromthedistancesbetweentheindividualitems.
Therefore,theclusteringalgorithmdoesnotneedaccesstotheoriginalgeneexpressiondata,oncethedistancematrixisknown.Forpairwisecentroidlinkageclustering,
however,thecentroidsofnewlyformedsubnodescanonlybecalculatedfromtheoriginaldataandnotfromthedistancematrix.

TheimplementationofpairwisesinglelinkagehierarchicalclusteringisbasedontheSLINKalgorithm(R.Sibson,1973),whichismuchfasterandmorememory
efficientthanastraightforwardimplementationofpairwisesinglelinkageclustering.Theclusteringresultproducedbythisalgorithmisidenticaltotheclustering
solutionfoundbytheconventionalsinglelinkagealgorithm.Thesinglelinkagehierarchicalclusteringalgorithmimplementedinthislibrarycanbeusedtoclusterlarge
geneexpressiondatasets,forwhichconventionalhierarchicalclusteringalgorithmsfailduetoexcessivememoryrequirementsandrunningtime.

Representingahierarchicalclusteringsolution

Theresultofhierarchicalclusteringconsistsofatreeofnodes,inwhicheachnodejoinstwoitemsorsubnodes.Usually,wearenotonlyinterestedinwhichitemsor
subnodesarejoinedateachnode,butalsointheirsimilarity(ordistance)astheyarejoined.Tostoreonenodeinthehierarchicalclusteringtree,wemakeuseoftheclass
Node,whichdefinedinBio.Cluster.AninstanceofNodehasthreeattributes:

left
right

distance

Here,leftandrightareintegersreferringtothetwoitemsorsubnodesthatarejoinedatthisnode,anddistanceisthedistancebetweenthem.Theitemsbeingclustered
arenumberedfrom0to(numberofitems1),whileclustersarenumberedfrom1to(numberofitems1).Notethatthenumberofnodesisonelessthanthenumber
ofitems.

TocreateanewNodeobject,weneedtospecifyleftandrightdistanceisoptional.

>>>fromBio.ClusterimportNode
>>>Node(2,3)
(2,3):0
>>>Node(2,3,0.91)
(2,3):0.91

Theattributesleft,right,anddistanceofanexistingNodeobjectcanbemodifieddirectly:

>>>node=Node(4,5)
>>>node.left=6
>>>node.right=2

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 129/184
5/13/2017 BiopythonTutorialandCookbook
>>>node.distance=0.73
>>>node
(6,2):0.73

Anerrorisraisedifleftandrightarenotintegers,orifdistancecannotbeconvertedtoafloatingpointvalue.

ThePythonclassTreerepresentsafullhierarchicalclusteringsolution.ATreeobjectcanbecreatedfromalistofNodeobjects:

>>>fromBio.ClusterimportNode,Tree
>>>nodes=[Node(1,2,0.2),Node(0,3,0.5),Node(2,4,0.6),Node(1,3,0.9)]
>>>tree=Tree(nodes)
>>>print(tree)
(1,2):0.2
(0,3):0.5
(2,4):0.6
(1,3):0.9

TheTreeinitializerchecksifthelistofnodesisavalidhierarchicalclusteringresult:
>>>nodes=[Node(1,2,0.2),Node(0,2,0.5)]
>>>Tree(nodes)
Traceback(mostrecentcalllast):
File"<stdin>",line1,in?
ValueError:Inconsistenttree

IndividualnodesinaTreeobjectcanbeaccessedusingsquarebrackets:

>>>nodes=[Node(1,2,0.2),Node(0,1,0.5)]
>>>tree=Tree(nodes)
>>>tree[0]
(1,2):0.2
>>>tree[1]
(0,1):0.5
>>>tree[1]
(0,1):0.5

AsaTreeobjectisreadonly,wecannotchangeindividualnodesinaTreeobject.However,wecanconvertthetreetoalistofnodes,modifythislist,andcreateanew
treefromthislist:
>>>tree=Tree([Node(1,2,0.1),Node(0,1,0.5),Node(2,3,0.9)])
>>>print(tree)
(1,2):0.1
(0,1):0.5
(2,3):0.9
>>>nodes=tree[:]
>>>nodes[0]=Node(0,1,0.2)
>>>nodes[1].left=2
>>>tree=Tree(nodes)
>>>print(tree)
(0,1):0.2
(2,1):0.5
(2,3):0.9

ThisguaranteesthatanyTreeobjectisalwayswellformed.

TodisplayahierarchicalclusteringsolutionwithvisualizationprogramssuchasJavaTreeview,itisbettertoscaleallnodedistancessuchthattheyarebetweenzeroand
one.ThiscanbeaccomplishedbycallingthescalemethodonanexistingTreeobject:

>>>tree.scale()

Thismethodtakesnoarguments,andreturnsNone.

Afterhierarchicalclustering,theitemscanbegroupedintokclustersbasedonthetreestructurestoredintheTreeobjectbycuttingthetree:
>>>clusterid=tree.cut(nclusters=1)

wherenclusters(defaultingto1)isthedesirednumberofclustersk.Thismethodignoresthetopk1linkingeventsinthetreestructure,resultinginkseparatedclusters
ofitems.Thenumberofclusterskshouldbepositive,andlessthanorequaltothenumberofitems.Thismethodreturnsanarrayclusteridcontainingthenumberofthe
clustertowhicheachitemisassigned.

Performinghierarchicalclustering
Toperformhierarchicalclustering,usethetreeclusterfunctioninBio.Cluster.

>>>fromBio.Clusterimporttreecluster
>>>tree=treecluster(data)

wherethefollowingargumentsaredefined:

data
Arraycontainingthedatafortheitems.
mask(default:None)
Arrayofintegersshowingwhichdataaremissing.Ifmask[i,j]==0,thendata[i,j]ismissing.Ifmask==None,thenalldataarepresent.
weight(default:None)
Theweightstobeusedwhencalculatingdistances.Ifweight==None,thenequalweightsareassumed.
transpose(default:0)
Determinesifrows(transpose==0)orcolumns(transpose==1)aretobeclustered.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 130/184
5/13/2017 BiopythonTutorialandCookbook
method(default:'m')
definesthelinkagemethodtobeused:
method=='s':pairwisesinglelinkageclustering
method=='m':pairwisemaximum(orcomplete)linkageclustering
method=='c':pairwisecentroidlinkageclustering

method=='a':pairwiseaveragelinkageclustering
dist(default:'e',Euclideandistance)
Definesthedistancefunctiontobeused(see15.1).

Toapplyhierarchicalclusteringonaprecalculateddistancematrix,specifythedistancematrixargumentwhencallingtreeclusterfunctioninsteadofthedataargument:

>>>fromBio.Clusterimporttreecluster
>>>tree=treecluster(distancematrix=distance)

Inthiscase,thefollowingargumentsaredefined:

distancematrix
Thedistancematrix,whichcanbespecifiedinthreeways:
asa2DNumericalPythonarray(inwhichonlytheleftlowerpartofthearraywillbeaccessed):
distance=array([[0.0,1.1,2.3],
[1.1,0.0,4.5],
[2.3,4.5,0.0]])

asa1DNumericalPythonarraycontainingconsecutivelythedistancesintheleftlowerpartofthedistancematrix:
distance=array([1.1,2.3,4.5])

asalistcontainingtherowsoftheleftlowerpartofthedistancematrix:
distance=[array([]),
array([1.1]),
array([2.3,4.5])

Thesethreeexpressionscorrespondtothesamedistancematrix.Astreeclustermayshufflethevaluesinthedistancematrixaspartoftheclusteringalgorithm,be
suretosavethisarrayinadifferentvariablebeforecallingtreeclusterifyouneeditlater.
method
Thelinkagemethodtobeused:
method=='s':pairwisesinglelinkageclustering
method=='m':pairwisemaximum(orcomplete)linkageclustering

method=='a':pairwiseaveragelinkageclustering
Whilepairwisesingle,maximum,andaveragelinkageclusteringcanbecalculatedfromthedistancematrixalone,pairwisecentroidlinkagecannot.

Whencallingtreecluster,eitherdataordistancematrixshouldbeNone.

ThisfunctionreturnsaTreeobject.Thisobjectcontains(numberofitems1)nodes,wherethenumberofitemsisthenumberofrowsifrowswereclustered,orthe
numberofcolumnsifcolumnswereclustered.Eachnodedescribesapairwiselinkingevent,wherethenodeattributesleftandrighteachcontainthenumberofone
itemorsubnode,anddistancethedistancebetweenthem.Itemsarenumberedfrom0to(numberofitems1),whileclustersarenumbered1to(numberofitems1).

15.5SelfOrganizingMaps
SelfOrganizingMaps(SOMs)wereinventedbyKohonentodescribeneuralnetworks(seeforinstanceKohonen,1997[24]).Tamayo(1999)firstappliedSelf
OrganizingMapstogeneexpressiondata[30].

SOMsorganizeitemsintoclustersthataresituatedinsometopology.Usuallyarectangulartopologyischosen.TheclustersgeneratedbySOMsaresuchthatneighboring
clustersinthetopologyaremoresimilartoeachotherthanclustersfarfromeachotherinthetopology.

ThefirststeptocalculateaSOMistorandomlyassignadatavectortoeachclusterinthetopology.Ifrowsarebeingclustered,thenthenumberofelementsineachdata
vectorisequaltothenumberofcolumns.

AnSOMisthengeneratedbytakingrowsoneatatime,andfindingwhichclusterinthetopologyhastheclosestdatavector.Thedatavectorofthatcluster,aswellas
thoseoftheneighboringclusters,areadjustedusingthedatavectoroftherowunderconsideration.Theadjustmentisgivenby


x cell= x row x cell .

Theparameterisaparameterthatdecreasesateachiterationstep.Wehaveusedasimplelinearfunctionoftheiterationstep:


i
=init 1 ,
n

initistheinitialvalueofasspecifiedbytheuser,iisthenumberofthecurrentiterationstep,andnisthetotalnumberofiterationstepstobeperformed.Whilechanges
aremaderapidlyinthebeginningoftheiteration,attheendofiterationonlysmallchangesaremade.

AllclusterswithinaradiusRareadjustedtothegeneunderconsideration.Thisradiusdecreasesasthecalculationprogressesas

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 131/184
5/13/2017 BiopythonTutorialandCookbook
R=Rmax 1 i ,
n


inwhichthemaximumradiusisdefinedas

Rmax= Nx2+Ny2 ,

where(Nx,Ny)arethedimensionsoftherectangledefiningthetopology.

ThefunctionsomclusterimplementsthecompletealgorithmtocalculateaSelfOrganizingMaponarectangulargrid.Firstitinitializestherandomnumbergenerator.
Thenodedataaretheninitializedusingtherandomnumbergenerator.TheorderinwhichgenesormicroarraysareusedtomodifytheSOMisalsorandomized.Thetotal
numberofiterationsintheSOMalgorithmisspecifiedbytheuser.

Torunsomcluster,use

>>>fromBio.Clusterimportsomcluster
>>>clusterid,celldata=somcluster(data)

wherethefollowingargumentsaredefined:

data(required)
Arraycontainingthedatafortheitems.
mask(default:None)
Arrayofintegersshowingwhichdataaremissing.Ifmask[i,j]==0,thendata[i,j]ismissing.Ifmask==None,thenalldataarepresent.
weight(default:None)
containstheweightstobeusedwhencalculatingdistances.Ifweight==None,thenequalweightsareassumed.
transpose(default:0)
Determinesifrows(transposeis0)orcolumns(transposeis1)aretobeclustered.
nxgrid,nygrid(default:2,1)
ThenumberofcellshorizontallyandverticallyintherectangulargridonwhichtheSelfOrganizingMapiscalculated.
inittau(default:0.02)
TheinitialvaluefortheparameterthatisusedintheSOMalgorithm.Thedefaultvalueforinittauis0.02,whichwasusedinMichaelEisensCluster/TreeView
program.
niter(default:1)
Thenumberofiterationstobeperformed.
dist(default:'e',Euclideandistance)
Definesthedistancefunctiontobeused(see15.1).

Thisfunctionreturnsthetuple(clusterid,celldata):

clusterid:
Anarraywithtwocolumns,wherethenumberofrowsisequaltothenumberofitemsthatwereclustered.Eachrowcontainsthexandycoordinatesofthecellin
therectangularSOMgridtowhichtheitemwasassigned.
celldata:
Anarraywithdimensions(nxgrid,nygrid,numberofcolumns)ifrowsarebeingclustered,or(nxgrid,nygrid,numberofrows)ifcolumnsarebeingclustered.
Eachelement[ix][iy]ofthisarrayisa1Dvectorcontainingthegeneexpressiondataforthecentroidoftheclusterinthegridcellwithcoordinates[ix][iy].

15.6PrincipalComponentAnalysis
PrincipalComponentAnalysis(PCA)isawidelyusedtechniqueforanalyzingmultivariatedata.ApracticalexampleofapplyingPrincipalComponentAnalysistogene
expressiondataispresentedbyYeungandRuzzo(2001)[33].

Inessence,PCAisacoordinatetransformationinwhicheachrowinthedatamatrixiswrittenasalinearsumoverbasisvectorscalledprincipalcomponents,whichare
orderedandchosensuchthateachmaximallyexplainstheremainingvarianceinthedatavectors.Forexample,ann3datamatrixcanberepresentedasanellipsoidal
cloudofnpointsinthreedimensionalspace.Thefirstprincipalcomponentisthelongestaxisoftheellipsoid,thesecondprincipalcomponentthesecondlongestaxisof
theellipsoid,andthethirdprincipalcomponentistheshortestaxis.Eachrowinthedatamatrixcanbereconstructedasasuitablelinearcombinationoftheprincipal
components.However,inordertoreducethedimensionalityofthedata,usuallyonlythemostimportantprincipalcomponentsareretained.Theremainingvariance
presentinthedataisthenregardedasunexplainedvariance.

Theprincipalcomponentscanbefoundbycalculatingtheeigenvectorsofthecovariancematrixofthedata.Thecorrespondingeigenvaluesdeterminehowmuchofthe
variancepresentinthedataisexplainedbyeachprincipalcomponent.

Beforeapplyingprincipalcomponentanalysis,typicallythemeanissubtractedfromeachcolumninthedatamatrix.Intheexampleabove,thiseffectivelycentersthe
ellipsoidalcloudarounditscentroidin3Dspace,withtheprincipalcomponentsdescribingthevariationofpointsintheellipsoidalcloudwithrespecttotheircentroid.

Thefunctionpcabelowfirstusesthesingularvaluedecompositiontocalculatetheeigenvaluesandeigenvectorsofthedatamatrix.Thesingularvaluedecompositionis
implementedasatranslationinCoftheAlgolproceduresvd[16],whichusesHouseholderbidiagonalizationandavariantoftheQRalgorithm.Theprincipal
components,thecoordinatesofeachdatavectoralongtheprincipalcomponents,andtheeigenvaluescorrespondingtotheprincipalcomponentsarethenevaluatedand
returnedindecreasingorderofthemagnitudeoftheeigenvalue.Ifdatacenteringisdesired,themeanshouldbesubtractedfromeachcolumninthedatamatrixbefore
callingthepcaroutine.

ToapplyPrincipalComponentAnalysistoarectangularmatrixdata,use

>>>fromBio.Clusterimportpca
>>>columnmean,coordinates,components,eigenvalues=pca(data)

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 132/184
5/13/2017 BiopythonTutorialandCookbook
Thisfunctionreturnsatuplecolumnmean,coordinates,components,eigenvalues:

columnmean
Arraycontainingthemeanovereachcolumnindata.
coordinates
Thecoordinatesofeachrowindatawithrespecttotheprincipalcomponents.
components
Theprincipalcomponents.
eigenvalues
Theeigenvaluescorrespondingtoeachoftheprincipalcomponents.

Theoriginalmatrixdatacanberecreatedbycalculatingcolumnmean+dot(coordinates,components).

15.7HandlingCluster/TreeViewtypefiles
Cluster/TreeViewareGUIbasedcodesforclusteringgeneexpressiondata.TheywereoriginallywrittenbyMichaelEisenwhileatStanfordUniversity.Bio.Cluster
containsfunctionsforreadingandwritingdatafilesthatcorrespondtotheformatspecifiedforCluster/TreeView.Inparticular,bysavingaclusteringresultinthatformat,
TreeViewcanbeusedtovisualizetheclusteringresults.WerecommendusingAlokSaldanhashttp://jtreeview.sourceforge.net/JavaTreeViewprogram,whichcan
displayhierarchicalaswellaskmeansclusteringresults.

AnobjectoftheclassRecordcontainsallinformationstoredinaCluster/TreeViewtypedatafile.TostoretheinformationcontainedinthedatafileinaRecordobject,we
firstopenthefileandthenreadit:

>>>fromBioimportCluster
>>>withopen("mydatafile.txt")ashandle:
...record=Cluster.read(handle)
...

Thistwostepprocessgivesyousomeflexibilityinthesourceofthedata.Forexample,youcanuse

>>>importgzip#Pythonstandardlibrary
>>>handle=gzip.open("mydatafile.txt.gz","rt")

toopenagzippedfile,or
>>>importurllib#Pythonstandardlibrary
>>>handle=urllib.urlopen("http://somewhere.org/mydatafile.txt")

toopenafilestoredontheInternetbeforecallingread.

Thereadcommandreadsthetabdelimitedtextfilemydatafile.txtcontaininggeneexpressiondataintheformatspecifiedforMichaelEisensCluster/TreeView
program.Inthisfileformat,rowsrepresentgenesandcolumnsrepresentsamplesorobservations.Forasimpletimecourse,aminimalinputfilewouldlooklikethis:

YORF 0minutes 30minutes 1hour 2hours 4hours


YAL001C 1 1.3 2.4 5.8 2.4
YAL002W 0.9 0.8 0.7 0.5 0.2
YAL003W 0.8 2.1 4.2 10.1 10.1
YAL005C 1.1 1.3 0.8 0.4
YAL010C 1.2 1 1.1 4.5 8.3

Eachrow(gene)hasanidentifierthatalwaysgoesinthefirstcolumn.Inthisexample,weareusingyeastopenreadingframecodes.Eachcolumn(sample)hasalabelin
thefirstrow.Inthisexample,thelabelsdescribethetimeatwhichasamplewastaken.Thefirstcolumnofthefirstrowcontainsaspecialfieldthattellstheprogram
whatkindofobjectsareineachrow.Inthiscase,YORFstandsforyeastopenreadingframe.Thisfieldcanbeanyalphanumericvalue.Theremainingcellsinthetable
containdatafortheappropriategeneandsample.The5.8inrow2column4meansthattheobservedvalueforgeneYAL001Cat2hourswas5.8.Missingvaluesare
acceptableandaredesignatedbyemptycells(e.g.YAL004Cat2hours).

Theinputfilemaycontainadditionalinformation.Amaximalinputfilewouldlooklikethis:

YORF NAME GWEIGHT GORDER 0 30 1 2 4


EWEIGHT 1 1 1 1 0
EORDER 5 3 2 1 1
YAL001C TFIIIC138KDSUBUNIT 1 1 1 1.3 2.4 5.8 2.4
YAL002W UNKNOWN 0.4 3 0.9 0.8 0.7 0.5 0.2
YAL003W ELONGATIONFACTOREF1BETA 0.4 2 0.8 2.1 4.2 10.1 10.1
YAL005C CYTOSOLICHSP70 0.4 5 1.1 1.3 0.8 0.4

TheaddedcolumnsNAME,GWEIGHT,andGORDERandrowsEWEIGHTandEORDERareoptional.TheNAMEcolumnallowsyoutospecifyalabelforeachgene
thatisdistinctfromtheIDincolumn1.

ARecordobjecthasthefollowingattributes:

data
Thedataarraycontainingthegeneexpressiondata.Genesarestoredrowwise,whilemicroarraysarestoredcolumnwise.
mask
Thisarrayshowswhichelementsinthedataarray,ifany,aremissing.Ifmask[i,j]==0,thendata[i,j]ismissing.Ifnodatawerefoundtobemissing,maskisset
toNone.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 133/184
5/13/2017 BiopythonTutorialandCookbook
geneid
Thisisalistcontainingauniquedescriptionforeachgene(i.e.,ORFnumbers).
genename
Thisisalistcontainingadescriptionforeachgene(i.e.,genename).Ifnotpresentinthedatafile,genenameissettoNone.
gweight
Theweightsthataretobeusedtocalculatethedistanceinexpressionprofilebetweengenes.Ifnotpresentinthedatafile,gweightissettoNone.
gorder
Thepreferredorderinwhichgenesshouldbestoredinanoutputfile.Ifnotpresentinthedatafile,gorderissettoNone.
expid
Thisisalistcontainingadescriptionofeachmicroarray,e.g.experimentalcondition.
eweight
Theweightsthataretobeusedtocalculatethedistanceinexpressionprofilebetweenmicroarrays.Ifnotpresentinthedatafile,eweightissettoNone.
eorder
Thepreferredorderinwhichmicroarraysshouldbestoredinanoutputfile.Ifnotpresentinthedatafile,eorderissettoNone.
uniqid
ThestringthatwasusedinsteadofUNIQIDinthedatafile.

AfterloadingaRecordobject,eachoftheseattributescanbeaccessedandmodifieddirectly.Forexample,thedatacanbelogtransformedbytakingthelogarithmof
record.data.

Calculatingthedistancematrix

Tocalculatethedistancematrixbetweentheitemsstoredintherecord,use

>>>matrix=record.distancematrix()

wherethefollowingargumentsaredefined:

transpose(default:0)
Determinesifthedistancesbetweentherowsofdataaretobecalculated(transpose==0),orbetweenthecolumnsofdata(transpose==1).
dist(default:'e',Euclideandistance)
Definesthedistancefunctiontobeused(see15.1).

Thisfunctionreturnsthedistancematrixasalistofrows,wherethenumberofcolumnsofeachrowisequaltotherownumber(seesection15.1).

Calculatingtheclustercentroids

Tocalculatethecentroidsofclustersofitemsstoredintherecord,use
>>>cdata,cmask=record.clustercentroids()

clusterid(default:None)
Vectorofintegersshowingtowhichclustereachitembelongs.Ifclusteridisnotgiven,thenallitemsareassumedtobelongtothesamecluster.
method(default:'a')
Specifieswhetherthearithmeticmean(method=='a')orthemedian(method=='m')isusedtocalculatetheclustercenter.
transpose(default:0)
Determinesifthecentroidsoftherowsofdataaretobecalculated(transpose==0),orthecentroidsofthecolumnsofdata(transpose==1).

Thisfunctionreturnsthetuplecdata,cmaskseesection15.2foradescription.

Calculatingthedistancebetweenclusters

Tocalculatethedistancebetweenclustersofitemsstoredintherecord,use
>>>distance=record.clusterdistance()

wherethefollowingargumentsaredefined:

index1(default:0)
Alistcontainingtheindicesoftheitemsbelongingtothefirstcluster.Aclustercontainingonlyoneitemicanberepresentedeitherasalist[i],orasanintegeri.
index2(default:0)
Alistcontainingtheindicesoftheitemsbelongingtothesecondcluster.Aclustercontainingonlyoneitemicanberepresentedeitherasalist[i],orasaninteger
i.
method(default:'a')
Specifieshowthedistancebetweenclustersisdefined:
'a':Distancebetweenthetwoclustercentroids(arithmeticmean)

'm':Distancebetweenthetwoclustercentroids(median)
's':Shortestpairwisedistancebetweenitemsinthetwoclusters
'x':Longestpairwisedistancebetweenitemsinthetwoclusters
'v':Averageoverthepairwisedistancesbetweenitemsinthetwoclusters.
dist(default:'e',Euclideandistance)
Definesthedistancefunctiontobeused(see15.1).
transpose(default:0)
Iftranspose==0,calculatethedistancebetweentherowsofdata.Iftranspose==1,calculatethedistancebetweenthecolumnsofdata.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 134/184
5/13/2017 BiopythonTutorialandCookbook
Performinghierarchicalclustering

Toperformhierarchicalclusteringontheitemsstoredintherecord,use

>>>tree=record.treecluster()

wherethefollowingargumentsaredefined:

transpose(default:0)
Determinesifrows(transpose==0)orcolumns(transpose==1)aretobeclustered.
method(default:'m')
definesthelinkagemethodtobeused:
method=='s':pairwisesinglelinkageclustering
method=='m':pairwisemaximum(orcomplete)linkageclustering
method=='c':pairwisecentroidlinkageclustering
method=='a':pairwiseaveragelinkageclustering
dist(default:'e',Euclideandistance)
Definesthedistancefunctiontobeused(see15.1).
transpose
Determinesifgenesormicroarraysarebeingclustered.Iftranspose==0,genes(rows)arebeingclustered.Iftranspose==1,microarrays(columns)areclustered.

ThisfunctionreturnsaTreeobject.Thisobjectcontains(numberofitems1)nodes,wherethenumberofitemsisthenumberofrowsifrowswereclustered,orthe
numberofcolumnsifcolumnswereclustered.Eachnodedescribesapairwiselinkingevent,wherethenodeattributesleftandrighteachcontainthenumberofone
itemorsubnode,anddistancethedistancebetweenthem.Itemsarenumberedfrom0to(numberofitems1),whileclustersarenumbered1to(numberofitems1).

Performingkmeansorkmediansclustering
Toperformkmeansorkmediansclusteringontheitemsstoredintherecord,use

>>>clusterid,error,nfound=record.kcluster()

wherethefollowingargumentsaredefined:

nclusters(default:2)
Thenumberofclustersk.
transpose(default:0)
Determinesifrows(transposeis0)orcolumns(transposeis1)aretobeclustered.
npass(default:1)
Thenumberoftimesthekmeans/mediansclusteringalgorithmisperformed,eachtimewithadifferent(random)initialcondition.Ifinitialidisgiven,thevalue
ofnpassisignoredandtheclusteringalgorithmisrunonlyonce,asitbehavesdeterministicallyinthatcase.
method(default:a)
describeshowthecenterofaclusterisfound:
method=='a':arithmeticmean(kmeansclustering)
method=='m':median(kmediansclustering).
Forothervaluesofmethod,thearithmeticmeanisused.
dist(default:'e',Euclideandistance)
Definesthedistancefunctiontobeused(see15.1).

Thisfunctionreturnsatuple(clusterid,error,nfound),whereclusteridisanintegerarraycontainingthenumberoftheclustertowhicheachroworclusterwas
assigned,erroristhewithinclustersumofdistancesfortheoptimalclusteringsolution,andnfoundisthenumberoftimesthisoptimalsolutionwasfound.

CalculatingaSelfOrganizingMap

TocalculateaSelfOrganizingMapoftheitemsstoredintherecord,use

>>>clusterid,celldata=record.somcluster()

wherethefollowingargumentsaredefined:

transpose(default:0)
Determinesifrows(transposeis0)orcolumns(transposeis1)aretobeclustered.
nxgrid,nygrid(default:2,1)
ThenumberofcellshorizontallyandverticallyintherectangulargridonwhichtheSelfOrganizingMapiscalculated.
inittau(default:0.02)
TheinitialvaluefortheparameterthatisusedintheSOMalgorithm.Thedefaultvalueforinittauis0.02,whichwasusedinMichaelEisensCluster/TreeView
program.
niter(default:1)
Thenumberofiterationstobeperformed.
dist(default:'e',Euclideandistance)
Definesthedistancefunctiontobeused(see15.1).

Thisfunctionreturnsthetuple(clusterid,celldata):

clusterid:
Anarraywithtwocolumns,wherethenumberofrowsisequaltothenumberofitemsthatwereclustered.Eachrowcontainsthexandycoordinatesofthecellin

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 135/184
5/13/2017 BiopythonTutorialandCookbook
therectangularSOMgridtowhichtheitemwasassigned.
celldata:
Anarraywithdimensions(nxgrid,nygrid,numberofcolumns)ifrowsarebeingclustered,or(nxgrid,nygrid,numberofrows)ifcolumnsarebeingclustered.
Eachelement[ix][iy]ofthisarrayisa1Dvectorcontainingthegeneexpressiondataforthecentroidoftheclusterinthegridcellwithcoordinates[ix][iy].

Savingtheclusteringresult

Tosavetheclusteringresult,use

>>>record.save(jobname,geneclusters,expclusters)

wherethefollowingargumentsaredefined:

jobname
Thestringjobnameisusedasthebasenamefornamesofthefilesthataretobesaved.
geneclusters
Thisargumentdescribesthegene(rowwise)clusteringresult.Incaseofkmeansclustering,thisisa1Darraycontainingthenumberoftheclustereachgene
belongsto.Itcanbecalculatedusingkcluster.Incaseofhierarchicalclustering,geneclustersisaTreeobject.
expclusters
Thisargumentdescribesthe(columnwise)clusteringresultfortheexperimentalconditions.Incaseofkmeansclustering,thisisa1Darraycontainingthenumber
oftheclustereachexperimentalconditionbelongsto.Itcanbecalculatedusingkcluster.Incaseofhierarchicalclustering,expclustersisaTreeobject.

Thismethodwritesthetextfilejobname.cdt,jobname.gtr,jobname.atr,jobname*.kgg,and/orjobname*.kagforsubsequentreadingbytheJavaTreeViewprogram.If
geneclustersandexpclustersarebothNone,thismethodonlywritesthetextfilejobname.cdtthisfilecansubsequentlybereadintoanewRecordobject.

15.8Examplecalculation
Thisisanexampleofahierarchicalclusteringcalculation,usingsinglelinkageclusteringforgenesandmaximumlinkageclusteringforexperimentalconditions.Asthe
Euclideandistanceisbeingusedforgeneclustering,itisnecessarytoscalethenodedistancesgenetreesuchthattheyareallbetweenzeroandone.Thisisneededforthe
JavaTreeViewcodetodisplaythetreediagramcorrectly.Toclustertheexperimentalconditions,theuncenteredcorrelationisbeingused.Noscalingisneededinthis
case,asthedistancesinexptreearealreadybetweenzeroandtwo.Theexampledatacyano.txtcanbefoundinthedatasubdirectory.

>>>fromBioimportCluster
>>>withopen("cyano.txt")ashandle:
...record=Cluster.read(handle)
...
>>>genetree=record.treecluster(method='s')
>>>genetree.scale()
>>>exptree=record.treecluster(dist='u',transpose=1)
>>>record.save("cyano_result",genetree,exptree)

Thiswillcreatethefilescyano_result.cdt,cyano_result.gtr,andcyano_result.atr.

Similarly,wecansaveakmeansclusteringsolution:

>>>fromBioimportCluster
>>>withopen("cyano.txt")ashandle:
...record=Cluster.read(handle)
...
>>>(geneclusters,error,ifound)=record.kcluster(nclusters=5,npass=1000)
>>>(expclusters,error,ifound)=record.kcluster(nclusters=2,npass=100,transpose=1)
>>>record.save("cyano_result",geneclusters,expclusters)

Thiswillcreatethefilescyano_result_K_G2_A2.cdt,cyano_result_K_G2.kgg,andcyano_result_K_A2.kag.

15.9Auxiliaryfunctions
median(data)returnsthemedianofthe1Darraydata.

mean(data)returnsthemeanofthe1Darraydata.

version()returnstheversionnumberoftheunderlyingCClusteringLibraryasastring.

Chapter16Supervisedlearningmethods
NotethesupervisedlearningmethodsdescribedinthischapterallrequireNumericalPython(numpy)tobeinstalled.

16.1TheLogisticRegressionModel
16.1.1BackgroundandPurpose

LogisticregressionisasupervisedlearningapproachthatattemptstodistinguishKclassesfromeachotherusingaweightedsumofsomepredictorvariablesxi.The
logisticregressionmodelisusedtocalculatetheweightsiofthepredictorvariables.InBiopython,thelogisticregressionmodeliscurrentlyimplementedfortwo
classesonly(K=2)thenumberofpredictorvariableshasnopredefinedlimit.

Asanexample,letstrytopredicttheoperonstructureinbacteria.AnoperonisasetofadjacentgenesonthesamestrandofDNAthataretranscribedintoasingle
mRNAmolecule.TranslationofthesinglemRNAmoleculethenyieldstheindividualproteins.ForBacillussubtilis,whosedatawewillbeusing,theaveragenumberof
genesinanoperonisabout2.4.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 136/184
5/13/2017 BiopythonTutorialandCookbook
Asafirststepinunderstandinggeneregulationinbacteria,weneedtoknowtheoperonstructure.Forabout10%ofthegenesinBacillussubtilis,theoperonstructureis
knownfromexperiments.Asupervisedlearningmethodcanbeusedtopredicttheoperonstructurefortheremaining90%ofthegenes.

Forsuchasupervisedlearningapproach,weneedtochoosesomepredictorvariablesxithatcanbemeasuredeasilyandaresomehowrelatedtotheoperonstructure.One
predictorvariablemightbethedistanceinbasepairsbetweengenes.Adjacentgenesbelongingtothesameoperontendtobeseparatedbyarelativelyshortdistance,
whereasadjacentgenesindifferentoperonstendtohavealargerspacebetweenthemtoallowforpromoterandterminatorsequences.Anotherpredictorvariableisbased
ongeneexpressionmeasurements.Bydefinition,genesbelongingtothesameoperonhaveequalgeneexpressionprofiles,whilegenesindifferentoperonsareexpected
tohavedifferentexpressionprofiles.Inpractice,themeasuredexpressionprofilesofgenesinthesameoperonarenotquiteidenticalduetothepresenceofmeasurement
errors.Toassessthesimilarityinthegeneexpressionprofiles,weassumethatthemeasurementerrorsfollowanormaldistributionandcalculatethecorrespondinglog
likelihoodscore.

WenowhavetwopredictorvariablesthatwecanusetopredictiftwoadjacentgenesonthesamestrandofDNAbelongtothesameoperon:

x1:thenumberofbasepairsbetweenthem
x2:theirsimilarityinexpressionprofile.

Inalogisticregressionmodel,weuseaweightedsumofthesetwopredictorstocalculateajointscoreS:

S=0+1x1+2x2.(16.1)

Thelogisticregressionmodelgivesusappropriatevaluesfortheparameters0,1,2usingtwosetsofexamplegenes:

OP:Adjacentgenes,onthesamestrandofDNA,knowntobelongtothesameoperon
NOP:Adjacentgenes,onthesamestrandofDNA,knowntobelongtodifferentoperons.

Inthelogisticregressionmodel,theprobabilityofbelongingtoaclassdependsonthescoreviathelogisticfunction.ForthetwoclassesOPandNOP,wecanwritethis
as

exp(0+1x1+2x2)
Pr(OP|x1,x2) = (16.2)
1+exp(0+1x1+2x2)

1
Pr(NOP|x1,x2) = (16.3)
1+exp(0+1x1+2x2)

Usingasetofgenepairsforwhichitisknownwhethertheybelongtothesameoperon(classOP)ortodifferentoperons(classNOP),wecancalculatetheweights0,
1,2bymaximizingtheloglikelihoodcorrespondingtotheprobabilityfunctions(16.2)and(16.3).

16.1.2Trainingthelogisticregressionmodel

Table16.1:Adjacentgenepairsknowntobelongtothesameoperon(classOP)ortodifferentoperons(classNOP).Intergenedistancesarenegativeifthe
twogenesoverlap.
Genepair Intergenedistance(x1) Geneexpressionscore(x2) Class
cotJAcotJB 53 200.78 OP
yesKyesL 117 267.14 OP
lplAlplB 57 163.47 OP
lplBlplC 16 190.30 OP
lplClplD 11 220.94 OP
lplDyetF 85 193.94 OP
yfmTyfmS 16 182.71 OP
yfmFyfmE 15 180.41 OP
citScitT 26 181.73 OP
citMyflN 58 259.87 OP
yfiIyfiJ 126 414.53 NOP
lipByfiQ 191 249.57 NOP
yfiUyfiV 113 265.28 NOP
yfhHyfhI 145 312.99 NOP
cotYcotX 154 213.83 NOP
yjoBrapA 147 380.85 NOP
ptsIsplA 93 291.13 NOP

Table16.1listssomeoftheBacillussubtilisgenepairsforwhichtheoperonstructureisknown.Letscalculatethelogisticregressionmodelfromthesedata:

>>>fromBioimportLogisticRegression
>>>xs=[[53,200.78],
[117,267.14],
[57,163.47],
[16,190.30],
[11,220.94],
[85,193.94],
[16,182.71],
[15,180.41],
[26,181.73],
[58,259.87],

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 137/184
5/13/2017 BiopythonTutorialandCookbook
[126,414.53],
[191,249.57],
[113,265.28],
[145,312.99],
[154,213.83],
[147,380.85],
[93,291.13]]
>>>ys=[1,
1,
1,
1,
1,
1,
1,
1,
1,
1,
0,
0,
0,
0,
0,
0,
0]
>>>model=LogisticRegression.train(xs,ys)

Here,xsandysarethetrainingdata:xscontainsthepredictorvariablesforeachgenepair,andysspecifiesifthegenepairbelongstothesameoperon(1,classOP)or
differentoperons(0,classNOP).Theresultinglogisticregressionmodelisstoredinmodel,whichcontainstheweights0,1,and2:

>>>model.beta
[8.9830290157144681,0.035968960444850887,0.02181395662983519]

Notethat1isnegative,asgenepairswithashorterintergenedistancehaveahigherprobabilityofbelongingtothesameoperon(classOP).Ontheotherhand,2is
positive,asgenepairsbelongingtothesameoperontypicallyhaveahighersimilarityscoreoftheirgeneexpressionprofiles.Theparameter0ispositiveduetothe
higherprevalenceofoperongenepairsthannonoperongenepairsinthetrainingdata.

Thefunctiontrainhastwooptionalarguments:update_fnandtypecode.Theupdate_fncanbeusedtospecifyacallbackfunction,takingasargumentstheiteration
numberandtheloglikelihood.Withthecallbackfunction,wecanforexampletracktheprogressofthemodelcalculation(whichusesaNewtonRaphsoniterationto
maximizetheloglikelihoodfunctionofthelogisticregressionmodel):

>>>defshow_progress(iteration,loglikelihood):
print("Iteration:",iteration,"Loglikelihoodfunction:",loglikelihood)
>>>
>>>model=LogisticRegression.train(xs,ys,update_fn=show_progress)
Iteration:0Loglikelihoodfunction:11.7835020695
Iteration:1Loglikelihoodfunction:7.15886767672
Iteration:2Loglikelihoodfunction:5.76877209868
Iteration:3Loglikelihoodfunction:5.11362294338
Iteration:4Loglikelihoodfunction:4.74870642433
Iteration:5Loglikelihoodfunction:4.50026077146
Iteration:6Loglikelihoodfunction:4.31127773737
Iteration:7Loglikelihoodfunction:4.16015043396
Iteration:8Loglikelihoodfunction:4.03561719785
Iteration:9Loglikelihoodfunction:3.93073282192
Iteration:10Loglikelihoodfunction:3.84087660929
Iteration:11Loglikelihoodfunction:3.76282560605
Iteration:12Loglikelihoodfunction:3.69425027154
Iteration:13Loglikelihoodfunction:3.6334178602
Iteration:14Loglikelihoodfunction:3.57900855837
Iteration:15Loglikelihoodfunction:3.52999671386
Iteration:16Loglikelihoodfunction:3.48557145163
Iteration:17Loglikelihoodfunction:3.44508206139
Iteration:18Loglikelihoodfunction:3.40799948447
Iteration:19Loglikelihoodfunction:3.3738885624
Iteration:20Loglikelihoodfunction:3.3423876581
Iteration:21Loglikelihoodfunction:3.31319343769
Iteration:22Loglikelihoodfunction:3.2860493346
Iteration:23Loglikelihoodfunction:3.2607366863
Iteration:24Loglikelihoodfunction:3.23706784091
Iteration:25Loglikelihoodfunction:3.21488073614
Iteration:26Loglikelihoodfunction:3.19403459259
Iteration:27Loglikelihoodfunction:3.17440646052
Iteration:28Loglikelihoodfunction:3.15588842703
Iteration:29Loglikelihoodfunction:3.13838533947
Iteration:30Loglikelihoodfunction:3.12181293595
Iteration:31Loglikelihoodfunction:3.10609629966
Iteration:32Loglikelihoodfunction:3.09116857282
Iteration:33Loglikelihoodfunction:3.07696988017
Iteration:34Loglikelihoodfunction:3.06344642288
Iteration:35Loglikelihoodfunction:3.05054971191
Iteration:36Loglikelihoodfunction:3.03823591619
Iteration:37Loglikelihoodfunction:3.02646530573
Iteration:38Loglikelihoodfunction:3.01520177394
Iteration:39Loglikelihoodfunction:3.00441242601
Iteration:40Loglikelihoodfunction:2.99406722296
Iteration:41Loglikelihoodfunction:2.98413867259

Theiterationstopsoncetheincreaseintheloglikelihoodfunctionislessthan0.01.Ifnoconvergenceisreachedafter500iterations,thetrainfunctionreturnswithan
AssertionError.

Theoptionalkeywordtypecodecanalmostalwaysbeignored.ThiskeywordallowstheusertochoosethetypeofNumericmatrixtouse.Inparticular,toavoidmemory
problemsforverylargeproblems,itmaybenecessarytousesingleprecisionfloats(Float8,Float16,etc.)ratherthandouble,whichisusedbydefault.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 138/184
5/13/2017 BiopythonTutorialandCookbook
16.1.3Usingthelogisticregressionmodelforclassification

Classificationisperformedbycallingtheclassifyfunction.Givenalogisticregressionmodelandthevaluesforx1andx2(e.g.foragenepairofunknownoperon
structure),theclassifyfunctionreturns1or0,correspondingtoclassOPandclassNOP,respectively.Forexample,letsconsiderthegenepairsyxcE,yxcDandyxiB,
yxiA:

Table16.2:Adjacentgenepairsofunknownoperonstatus.
Genepair Intergenedistancex1 Geneexpressionscorex2
yxcEyxcD 6 173.143442352
yxiByxiA 309 271.005880394

ThelogisticregressionmodelclassifiesyxcE,yxcDasbelongingtothesameoperon(classOP),whileyxiB,yxiAarepredictedtobelongtodifferentoperons:

>>>print("yxcE,yxcD:",LogisticRegression.classify(model,[6,173.143442352]))
yxcE,yxcD:1
>>>print("yxiB,yxiA:",LogisticRegression.classify(model,[309,271.005880394]))
yxiB,yxiA:0

(which,bytheway,agreeswiththebiologicalliterature).

Tofindouthowconfidentwecanbeinthesepredictions,wecancallthecalculatefunctiontoobtaintheprobabilities(equations(16.2)and16.3)forclassOPandNOP.
ForyxcE,yxcDwefind

>>>q,p=LogisticRegression.calculate(model,[6,173.143442352])
>>>print("classOP:probability=",p,"classNOP:probability=",q)
classOP:probability=0.993242163503classNOP:probability=0.00675783649744

andforyxiB,yxiA

>>>q,p=LogisticRegression.calculate(model,[309,271.005880394])
>>>print("classOP:probability=",p,"classNOP:probability=",q)
classOP:probability=0.000321211251817classNOP:probability=0.999678788748

Togetsomeideaofthepredictionaccuracyofthelogisticregressionmodel,wecanapplyittothetrainingdata:

>>>foriinrange(len(ys)):
print("True:",ys[i],"Predicted:",LogisticRegression.classify(model,xs[i]))
True:1Predicted:1
True:1Predicted:0
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:0Predicted:0
True:0Predicted:0
True:0Predicted:0
True:0Predicted:0
True:0Predicted:0
True:0Predicted:0
True:0Predicted:0

showingthatthepredictioniscorrectforallbutoneofthegenepairs.Amorereliableestimateofthepredictionaccuracycanbefoundfromaleaveoneoutanalysis,in
whichthemodelisrecalculatedfromthetrainingdataafterremovingthegenetobepredicted:

>>>foriinrange(len(ys)):
model=LogisticRegression.train(xs[:i]+xs[i+1:],ys[:i]+ys[i+1:])
print("True:",ys[i],"Predicted:",LogisticRegression.classify(model,xs[i]))
True:1Predicted:1
True:1Predicted:0
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:0Predicted:0
True:0Predicted:0
True:0Predicted:0
True:0Predicted:0
True:0Predicted:1
True:0Predicted:0
True:0Predicted:0

Theleaveoneoutanalysisshowsthatthepredictionofthelogisticregressionmodelisincorrectforonlytwoofthegenepairs,whichcorrespondstoaprediction
accuracyof88%.

16.1.4LogisticRegression,LinearDiscriminantAnalysis,andSupportVectorMachines

Thelogisticregressionmodelissimilartolineardiscriminantanalysis.Inlineardiscriminantanalysis,theclassprobabilitiesalsofollowequations(16.2)and(16.3).
However,insteadofestimatingthecoefficientsdirectly,wefirstfitanormaldistributiontothepredictorvariablesx.Thecoefficientsarethencalculatedfromthe

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 139/184
5/13/2017 BiopythonTutorialandCookbook
meansandcovariancesofthenormaldistribution.Ifthedistributionofxisindeednormal,thenweexpectlineardiscriminantanalysistoperformbetterthanthelogistic
regressionmodel.Thelogisticregressionmodel,ontheotherhand,ismorerobusttodeviationsfromnormality.

Anothersimilarapproachisasupportvectormachinewithalinearkernel.SuchanSVMalsousesalinearcombinationofthepredictors,butestimatesthecoefficients
fromthepredictorvariablesxneartheboundaryregionbetweentheclasses.Ifthelogisticregressionmodel(equations(16.2)and(16.3))isagooddescriptionforxaway
fromtheboundaryregion,weexpectthelogisticregressionmodeltoperformbetterthananSVMwithalinearkernel,asitreliesonmoredata.Ifnot,anSVMwitha
linearkernelmayperformbetter.

TrevorHastie,RobertTibshirani,andJeromeFriedman:TheElementsofStatisticalLearning.DataMining,Inference,andPrediction.SpringerSeriesinStatistics,2001.
Chapter4.4.

16.2kNearestNeighbors
16.2.1Backgroundandpurpose

Theknearestneighborsmethodisasupervisedlearningapproachthatdoesnotneedtofitamodeltothedata.Instead,datapointsareclassifiedbasedonthecategories
oftheknearestneighborsinthetrainingdataset.

InBiopython,theknearestneighborsmethodisavailableinBio.kNN.ToillustratetheuseoftheknearestneighbormethodinBiopython,wewillusethesameoperon
datasetasinsection16.1.

16.2.2Initializingaknearestneighborsmodel

UsingthedatainTable16.1,wecreateandinitializeaknearestneighborsmodelasfollows:

>>>fromBioimportkNN
>>>k=3
>>>model=kNN.train(xs,ys,k)

wherexsandysarethesameasinSection16.1.2.Here,kisthenumberofneighborskthatwillbeconsideredfortheclassification.Forclassificationintotwoclasses,
choosinganoddnumberforkletsyouavoidtiedvotes.Thefunctionnametrainisabitofamisnomer,sincenomodeltrainingisdone:thisfunctionsimplystoresxs,
ys,andkinmodel.

16.2.3Usingaknearestneighborsmodelforclassification
Toclassifynewdatausingtheknearestneighborsmodel,weusetheclassifyfunction.Thisfunctiontakesadatapoint(x1,x2)andfindstheknearestneighborsinthe
trainingdatasetxs.Thedatapoint(x1,x2)isthenclassifiedbasedonwhichcategory(ys)occursmostamongthekneighbors.

FortheexampleofthegenepairsyxcE,yxcDandyxiB,yxiA,wefind:

>>>x=[6,173.143442352]
>>>print("yxcE,yxcD:",kNN.classify(model,x))
yxcE,yxcD:1
>>>x=[309,271.005880394]
>>>print("yxiB,yxiA:",kNN.classify(model,x))
yxiB,yxiA:0

Inagreementwiththelogisticregressionmodel,yxcE,yxcDareclassifiedasbelongingtothesameoperon(classOP),whileyxiB,yxiAarepredictedtobelongto
differentoperons.

Theclassifyfunctionletsusspecifybothadistancefunctionandaweightfunctionasoptionalarguments.Thedistancefunctionaffectswhichkneighborsarechosenas
thenearestneighbors,asthesearedefinedastheneighborswiththesmallestdistancetothequerypoint(x,y).Bydefault,theEuclideandistanceisused.Instead,we
couldforexampleusethecityblock(Manhattan)distance:

>>>defcityblock(x1,x2):
...assertlen(x1)==2
...assertlen(x2)==2
...distance=abs(x1[0]x2[0])+abs(x1[1]x2[1])
...returndistance
...
>>>x=[6,173.143442352]
>>>print("yxcE,yxcD:",kNN.classify(model,x,distance_fn=cityblock))
yxcE,yxcD:1

Theweightfunctioncanbeusedforweightedvoting.Forexample,wemaywanttogivecloserneighborsahigherweightthanneighborsthatarefurtheraway:

>>>defweight(x1,x2):
...assertlen(x1)==2
...assertlen(x2)==2
...returnexp(abs(x1[0]x2[0])abs(x1[1]x2[1]))
...
>>>x=[6,173.143442352]
>>>print("yxcE,yxcD:",kNN.classify(model,x,weight_fn=weight))
yxcE,yxcD:1

Bydefault,allneighborsaregivenanequalweight.

Tofindouthowconfidentwecanbeinthesepredictions,wecancallthecalculatefunction,whichwillcalculatethetotalweightassignedtotheclassesOPandNOP.
Forthedefaultweightingscheme,thisreducestothenumberofneighborsineachcategory.ForyxcE,yxcD,wefind

>>>x=[6,173.143442352]
>>>weight=kNN.calculate(model,x)
>>>print("classOP:weight=",weight[0],"classNOP:weight=",weight[1])
classOP:weight=0.0classNOP:weight=3.0

whichmeansthatallthreeneighborsofx1,x2areintheNOPclass.Asanotherexample,foryesK,yesLwefind
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 140/184
5/13/2017 BiopythonTutorialandCookbook
>>>x=[117,267.14]
>>>weight=kNN.calculate(model,x)
>>>print("classOP:weight=",weight[0],"classNOP:weight=",weight[1])
classOP:weight=2.0classNOP:weight=1.0

whichmeansthattwoneighborsareoperonpairsandoneneighborisanonoperonpair.

Togetsomeideaofthepredictionaccuracyoftheknearestneighborsapproach,wecanapplyittothetrainingdata:

>>>foriinrange(len(ys)):
print("True:",ys[i],"Predicted:",kNN.classify(model,xs[i]))
True:1Predicted:1
True:1Predicted:0
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:0
True:0Predicted:0
True:0Predicted:0
True:0Predicted:0
True:0Predicted:0
True:0Predicted:0
True:0Predicted:0
True:0Predicted:0

showingthatthepredictioniscorrectforallbuttwoofthegenepairs.Amorereliableestimateofthepredictionaccuracycanbefoundfromaleaveoneoutanalysis,in
whichthemodelisrecalculatedfromthetrainingdataafterremovingthegenetobepredicted:
>>>k=3
>>>foriinrange(len(ys)):
model=kNN.train(xs[:i]+xs[i+1:],ys[:i]+ys[i+1:],k)
print("True:",ys[i],"Predicted:",kNN.classify(model,xs[i]))
True:1Predicted:1
True:1Predicted:0
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:1
True:1Predicted:0
True:0Predicted:0
True:0Predicted:0
True:0Predicted:1
True:0Predicted:0
True:0Predicted:0
True:0Predicted:0
True:0Predicted:1

Theleaveoneoutanalysisshowsthatknearestneighborsmodeliscorrectfor13outof17genepairs,whichcorrespondstoapredictionaccuracyof76%.

16.3NaveBayes
ThissectionwilldescribetheBio.NaiveBayesmodule.

16.4MaximumEntropy
ThissectionwilldescribetheBio.MaximumEntropymodule.

16.5MarkovModels
ThissectionwilldescribetheBio.MarkovModeland/orBio.HMM.MarkovModelmodules.

Chapter17GraphicsincludingGenomeDiagram
TheBio.GraphicsmoduledependsonthethirdpartyPythonlibraryReportLab.AlthoughfocusedonproducingPDFfiles,ReportLabcanalsocreateencapsulated
postscript(EPS)and(SVG)files.Inadditiontothesevectorbasedimages,providedcertainfurtherdependenciessuchasthePythonImagingLibrary(PIL)areinstalled,
ReportLabcanalsooutputbitmapimages(includingJPEG,PNG,GIF,BMPandPICTformats).

17.1GenomeDiagram
17.1.1Introduction

TheBio.Graphics.GenomeDiagrammodulewasaddedtoBiopython1.50,havingpreviouslybeenavailableasaseparatePythonmoduledependentonBiopython.
GenomeDiagramisdescribedintheBioinformaticsjournalpublicationbyPritchardetal.(2006)[2],whichincludessomeexamplesimages.ThereisaPDFcopyofthe
oldmanualhere,http://biopython.org/DIST/docs/GenomeDiagram/userguide.pdfwhichhassomemoreexamples.

Asthenamemightsuggest,GenomeDiagramwasdesignedfordrawingwholegenomes,inparticularprokaryoticgenomes,eitheraslineardiagrams(optionallybroken
upintofragmentstofitbetter)orascircularwheeldiagrams.HavealookatFigure2inTothetal.(2006)[3]foragoodexample.Itprovedalsowellsuitedtodrawing

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 141/184
5/13/2017 BiopythonTutorialandCookbook
quitedetailedfiguresforsmallergenomessuchasphage,plasmidsormitochrondia,forexampleseeFigures1and2inVanderAuweraetal.(2009)[4](shownwith
additionalmanualediting).

ThismoduleiseasiesttouseifyouhaveyourgenomeloadedasaSeqRecordobjectcontaininglotsofSeqFeatureobjectsforexampleasloadedfromaGenBankfile
(seeChapters4and5).

17.1.2Diagrams,tracks,featuresetsandfeatures
GenomeDiagramusesanestedsetofobjects.Atthetoplevel,youhaveadiagramobjectrepresentingasequence(orsequenceregion)alongthehorizontalaxis(or
circle).Adiagramcancontainoneormoretracks,shownstackedvertically(orradiallyoncirculardiagrams).Thesewilltypicallyallhavethesamelengthandrepresent
thesamesequenceregion.Youmightuseonetracktoshowthegenelocations,anothertoshowregulatoryregions,andathirdtracktoshowtheGCpercentage.

Themostcommonlyusedtypeoftrackwillcontainfeatures,bundledtogetherinfeaturesets.YoumightchoosetouseonefeaturesetforallyourCDSfeatures,and
anotherfortRNAfeatures.Thisisntrequiredtheycanallgointhesamefeatureset,butitmakesiteasiertoupdatethepropertiesofjustselectedfeatures(e.g.makeall
thetRNAfeaturesred).

Therearetwomainwaystobuildupacompletediagram.Firstly,thetopdownapproachwhereyoucreateadiagramobject,andthenusingitsmethodsaddtrack(s),and
usethetrackmethodstoaddfeatureset(s),andusetheirmethodstoaddthefeatures.Secondly,youcancreatetheindividualobjectsseparately(inwhateverordersuits
yourcode),andthencombinethem.

17.1.3Atopdownexample

WeregoingtodrawawholegenomefromaSeqRecordobjectreadinfromaGenBankfile(seeChapter5).ThisexampleusesthepPCP1plasmidfromYersiniapestis
biovarMicrotus,thefileisincludedwiththeBiopythonunittestsundertheGenBankfolder,oronlineNC_005816.gbfromourwebsite.

fromreportlab.libimportcolors
fromreportlab.lib.unitsimportcm
fromBio.GraphicsimportGenomeDiagram
fromBioimportSeqIO
record=SeqIO.read("NC_005816.gb","genbank")

Wereusingatopdownapproach,soafterloadinginoursequencewenextcreateanemptydiagram,thenaddan(empty)track,andtothataddan(empty)featureset:

gd_diagram=GenomeDiagram.Diagram("YersiniapestisbiovarMicrotusplasmidpPCP1")
gd_track_for_features=gd_diagram.new_track(1,name="AnnotatedFeatures")
gd_feature_set=gd_track_for_features.new_set()

NowthefunpartwetakeeachgeneSeqFeatureobjectinourSeqRecord,anduseittogenerateafeatureonthediagram.Weregoingtocolorthemblue,alternating
betweenadarkblueandalightblue.

forfeatureinrecord.features:
iffeature.type!="gene":
#Excludethisfeature
continue
iflen(gd_feature_set)%2==0:
color=colors.blue
else:
color=colors.lightblue
gd_feature_set.add_feature(feature,color=color,label=True)

Nowwecometoactuallymakingtheoutputfile.Thishappensintwosteps,firstwecallthedrawmethod,whichcreatesalltheshapesusingReportLabobjects.Thenwe
callthewritemethodwhichrendersthesetotherequestedfileformat.Noteyoucanoutputinmultiplefileformats:

gd_diagram.draw(format="linear",orientation="landscape",pagesize='A4',
fragments=4,start=0,end=len(record))
gd_diagram.write("plasmid_linear.pdf","PDF")
gd_diagram.write("plasmid_linear.eps","EPS")
gd_diagram.write("plasmid_linear.svg","SVG")

Also,providedyouhavethedependenciesinstalled,youcanalsodobitmaps,forexample:

gd_diagram.write("plasmid_linear.png","PNG")

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 142/184
5/13/2017 BiopythonTutorialandCookbook
Noticethatthefragmentsargumentwhichwesettofourcontrolshowmanypiecesthegenomegetsbrokenupinto.

Ifyouwanttodoacircularfigure,thentrythis:

gd_diagram.draw(format="circular",circular=True,pagesize=(20*cm,20*cm),
start=0,end=len(record),circle_core=0.7)
gd_diagram.write("plasmid_circular.pdf","PDF")

Thesefiguresarenotveryexciting,butweveonlyjustgotstarted.

17.1.4Abottomupexample
Nowletsproduceexactlythesamefigures,butusingthebottomupapproach.Thismeanswecreatethedifferentobjectsdirectly(andthiscanbedoneinalmostany
order)andthencombinethem.

fromreportlab.libimportcolors
fromreportlab.lib.unitsimportcm
fromBio.GraphicsimportGenomeDiagram
fromBioimportSeqIO
record=SeqIO.read("NC_005816.gb","genbank")

#Createthefeaturesetanditsfeatureobjects,
gd_feature_set=GenomeDiagram.FeatureSet()
forfeatureinrecord.features:
iffeature.type!="gene":
#Excludethisfeature
continue
iflen(gd_feature_set)%2==0:
color=colors.blue
else:
color=colors.lightblue
gd_feature_set.add_feature(feature,color=color,label=True)
#(thisforloopisthesameasinthepreviousexample)

#Createatrack,andadiagram
gd_track_for_features=GenomeDiagram.Track(name="AnnotatedFeatures")
gd_diagram=GenomeDiagram.Diagram("YersiniapestisbiovarMicrotusplasmidpPCP1")

#Nowhavetogluethebitstogether...
gd_track_for_features.add_set(gd_feature_set)
gd_diagram.add_track(gd_track_for_features,1)

Youcannowcallthedrawandwritemethodsasbeforetoproducealinearorcirculardiagram,usingthecodeattheendofthetopdownexampleabove.Thefigures
shouldbeidentical.

17.1.5FeatureswithoutaSeqFeature

IntheaboveexampleweusedaSeqRecordsSeqFeatureobjectstobuildourdiagram(seealsoSection4.3).SometimesyouwonthaveSeqFeatureobjects,butjustthe
coordinatesforafeatureyouwanttodraw.YouhavetocreateminimalSeqFeatureobject,butthisiseasy:

fromBio.SeqFeatureimportSeqFeature,FeatureLocation
my_seq_feature=SeqFeature(FeatureLocation(50,100),strand=+1)

Forstrand,use+1fortheforwardstrand,1forthereversestrand,andNoneforboth.Hereisashortselfcontainedexample:

fromBio.SeqFeatureimportSeqFeature,FeatureLocation
fromBio.GraphicsimportGenomeDiagram
fromreportlab.lib.unitsimportcm

gdd=GenomeDiagram.Diagram('TestDiagram')
gdt_features=gdd.new_track(1,greytrack=False)
gds_features=gdt_features.new_set()

#Addthreefeaturestoshowthestrandoptions,
feature=SeqFeature(FeatureLocation(25,125),strand=+1)

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 143/184
5/13/2017 BiopythonTutorialandCookbook
gds_features.add_feature(feature,name="Forward",label=True)
feature=SeqFeature(FeatureLocation(150,250),strand=None)
gds_features.add_feature(feature,name="Strandless",label=True)
feature=SeqFeature(FeatureLocation(275,375),strand=1)
gds_features.add_feature(feature,name="Reverse",label=True)

gdd.draw(format='linear',pagesize=(15*cm,4*cm),fragments=1,
start=0,end=400)
gdd.write("GD_labels_default.pdf","pdf")

Thetoppartoftheimageinthenextsubsectionshowstheoutput(inthedefaultfeaturecolor,palegreen).

Noticethatwehaveusedthenameargumentheretospecifythecaptiontextforthesefeatures.Thisisdiscussedinmoredetailnext.

17.1.6Featurecaptions
Recallweusedthefollowing(wherefeaturewasaSeqFeatureobject)toaddafeaturetothediagram:
gd_feature_set.add_feature(feature,color=color,label=True)

IntheexampleabovetheSeqFeatureannotationwasusedtopickasensiblecaptionforthefeatures.BydefaultthefollowingpossibleentriesundertheSeqFeature
objectsqualifiersdictionaryareused:gene,label,name,locus_tag,andproduct.Moresimply,youcanspecifyanamedirectly:
gd_feature_set.add_feature(feature,color=color,label=True,name="MyGene")

Inadditiontothecaptiontextforeachfeatureslabel,youcanalsochoosethefont,position(thisdefaultstothestartofthesigil,youcanalsochoosethemiddleoratthe
end)andorientation(forlineardiagramsonly,wherethisdefaultstorotatedby45degrees):

#Largefont,parallelwiththetrack
gd_feature_set.add_feature(feature,label=True,color="green",
label_size=25,label_angle=0)

#Verysmallfont,perpendiculartothetrack(towardsit)
gd_feature_set.add_feature(feature,label=True,color="purple",
label_position="end",
label_size=4,label_angle=90)

#Smallfont,perpendiculartothetrack(awayfromit)
gd_feature_set.add_feature(feature,label=True,color="blue",
label_position="middle",
label_size=6,label_angle=90)

Combiningeachofthesethreefragmentswiththecompleteexampleintheprevioussectionshouldgivesomethinglikethis:

Wevenotshownithere,butyoucanalsosetlabel_colortocontrolthelabelscolor(usedinSection17.1.9).

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 144/184
5/13/2017 BiopythonTutorialandCookbook
Youllnoticethedefaultfontisquitesmallthismakessensebecauseyouwillusuallybedrawingmany(small)featuresonapage,notjustafewlargeonesasshown
here.

17.1.7Featuresigils
Theexamplesabovehavealljustusedthedefaultsigilforthefeature,aplainbox,whichwasallthatwasavailableinthelastpubliclyreleasedstandaloneversionof
GenomeDiagram.ArrowsigilswereincludedwhenGenomeDiagramwasaddedtoBiopython1.50:

#DefaultusesaBOXsigil
gd_feature_set.add_feature(feature)

#Youcanmakethisexplicit:
gd_feature_set.add_feature(feature,sigil="BOX")

#Oroptforanarrow:
gd_feature_set.add_feature(feature,sigil="ARROW")

Biopython1.61addedthreemoresigils,

#Boxwithcornerscutoff(makingitanoctagon)
gd_feature_set.add_feature(feature,sigil="OCTO")

#Boxwithjaggededges(usefulforshowingbreaksincontains)
gd_feature_set.add_feature(feature,sigil="JAGGY")

#Arrowwhichspanstheaxiswithstrandusedonlyfordirection
gd_feature_set.add_feature(feature,sigil="BIGARROW")

Theseareshownbelow.Mostsigilsfitintoaboundingbox(asgivenbythedefaultBOXsigil),eitheraboveorbelowtheaxisfortheforwardorreversestrand,or
straddlingit(doubletheheight)forstrandlessfeatures.TheBIGARROWsigilisdifferent,alwaysstraddlingtheaxiswiththedirectiontakenfromthefeaturesstand.

17.1.8Arrowsigils

Weintroducedthearrowsigilsintheprevioussection.Therearetwoadditionaloptionstoadjusttheshapesofthearrows,firstlythethicknessofthearrowshaft,givenas
aproportionoftheheightoftheboundingbox:

#Fullheightshafts,givingpointedboxes:
gd_feature_set.add_feature(feature,sigil="ARROW",color="brown",
arrowshaft_height=1.0)
#Or,thinshafts:
gd_feature_set.add_feature(feature,sigil="ARROW",color="teal",
arrowshaft_height=0.2)
#Or,verythinshafts:
gd_feature_set.add_feature(feature,sigil="ARROW",color="darkgreen",
arrowshaft_height=0.1)

Theresultsareshownbelow:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 145/184
5/13/2017 BiopythonTutorialandCookbook

Secondly,thelengthofthearrowheadgivenasaproportionoftheheightoftheboundingbox(defaultingto0.5,or50%):

#Shortarrowheads:
gd_feature_set.add_feature(feature,sigil="ARROW",color="blue",
arrowhead_length=0.25)
#Or,longerarrowheads:
gd_feature_set.add_feature(feature,sigil="ARROW",color="orange",
arrowhead_length=1)
#Or,veryverylongarrowheads(i.e.allhead,noshaft,sotriangles):
gd_feature_set.add_feature(feature,sigil="ARROW",color="red",
arrowhead_length=10000)

Theresultsareshownbelow:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 146/184
5/13/2017 BiopythonTutorialandCookbook

Biopython1.61addsanewBIGARROWsigilwhichalwaysstradlestheaxis,pointingleftforthereversestrandorrightotherwise:

#Alargearrowstraddlingtheaxis:
gd_feature_set.add_feature(feature,sigil="BIGARROW")

AlltheshaftandarrowheadoptionsshownabovefortheARROWsigilcanbeusedfortheBIGARROWsigiltoo.

17.1.9Aniceexample
NowletsreturntothepPCP1plasmidfromYersiniapestisbiovarMicrotus,andthetopdownapproachusedinSection17.1.3,buttakeadvantageofthesigiloptions
wevenowdiscussed.Thistimewellusearrowsforthegenes,andoverlaythemwithstrandlessfeatures(asplainboxes)showingthepositionofsomerestrictiondigest
sites.

fromreportlab.libimportcolors
fromreportlab.lib.unitsimportcm
fromBio.GraphicsimportGenomeDiagram
fromBioimportSeqIO
fromBio.SeqFeatureimportSeqFeature,FeatureLocation

record=SeqIO.read("NC_005816.gb","genbank")

gd_diagram=GenomeDiagram.Diagram(record.id)
gd_track_for_features=gd_diagram.new_track(1,name="AnnotatedFeatures")
gd_feature_set=gd_track_for_features.new_set()

forfeatureinrecord.features:
iffeature.type!="gene":
#Excludethisfeature
continue
iflen(gd_feature_set)%2==0:
color=colors.blue
else:
color=colors.lightblue
gd_feature_set.add_feature(feature,sigil="ARROW",
color=color,label=True,
label_size=14,label_angle=0)

#Iwanttoincludesomestrandlessfeatures,soforanexample
#willuseEcoRIrecognitionsitesetc.
forsite,name,colorin[("GAATTC","EcoRI",colors.green),
("CCCGGG","SmaI",colors.orange),
("AAGCTT","HindIII",colors.red),
("GGATCC","BamHI",colors.purple)]:
index=0
whileTrue:
index=record.seq.find(site,start=index)
ifindex==1:break

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 147/184
5/13/2017 BiopythonTutorialandCookbook
feature=SeqFeature(FeatureLocation(index,index+len(site)))
gd_feature_set.add_feature(feature,color=color,name=name,
label=True,label_size=10,
label_color=color)
index+=len(site)

gd_diagram.draw(format="linear",pagesize='A4',fragments=4,
start=0,end=len(record))
gd_diagram.write("plasmid_linear_nice.pdf","PDF")
gd_diagram.write("plasmid_linear_nice.eps","EPS")
gd_diagram.write("plasmid_linear_nice.svg","SVG")

gd_diagram.draw(format="circular",circular=True,pagesize=(20*cm,20*cm),
start=0,end=len(record),circle_core=0.5)
gd_diagram.write("plasmid_circular_nice.pdf","PDF")
gd_diagram.write("plasmid_circular_nice.eps","EPS")
gd_diagram.write("plasmid_circular_nice.svg","SVG")

Andtheoutput:

17.1.10Multipletracks
Alltheexamplessofarhaveusedasingletrack,butyoucanhavemorethanonetrackforexampleshowthegenesonone,andrepeatregionsonanother.Inthis
exampleweregoingtoshowthreephagegenomessidebysidetoscale,inspiredbyFigure6inProuxetal.(2002)[5].WellneedtheGenBankfilesforthefollowing
threephage:

NC_002703LactococcusphageTuc2009,completegenome(38347bp)

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 148/184
5/13/2017 BiopythonTutorialandCookbook
AF323668BacteriophagebIL285,completegenome(35538bp)
NC_003212ListeriainnocuaClip11262,completegenome,ofwhichwearefocussingonlyonintegratedprophage5(similarlength).

YoucandownloadtheseusingEntrezifyoulike,seeSection9.6formoredetails.Forthethirdrecordweveworkedoutwherethephageisintegratedintothegenome,
andslicetherecordtoextractit(withthefeaturespreserved,seeSection4.7),andmustalsoreversecomplementtomatchtheorientationofthefirsttwophage(again
preservingthefeatures,seeSection4.9):

fromBioimportSeqIO

A_rec=SeqIO.read("NC_002703.gbk","gb")
B_rec=SeqIO.read("AF323668.gbk","gb")
C_rec=SeqIO.read("NC_003212.gbk","gb")[2587879:2625807].reverse_complement(name=True)

Thefigureweareimitatinguseddifferentcolorsfordifferentgenefunctions.OnewaytodothisistoedittheGenBankfiletorecordcolorpreferencesforeachfeature
somethingSangersArtemiseditordoes,andwhichGenomeDiagramshouldunderstand.Herehowever,welljusthardcodethreelistsofcolors.

NotethattheannotationintheGenBankfilesdoesntexactlymatchthatshowninProuxetal.,theyhavedrawnsomeunannotatedgenes.

fromreportlab.lib.colorsimportred,grey,orange,green,brown,blue,lightblue,purple

A_colors=[red]*5+[grey]*7+[orange]*2+[grey]*2+[orange]+[grey]*11+[green]*4\
+[grey]+[green]*2+[grey,green]+[brown]*5+[blue]*4+[lightblue]*5\
+[grey,lightblue]+[purple]*2+[grey]
B_colors=[red]*6+[grey]*8+[orange]*2+[grey]+[orange]+[grey]*21+[green]*5\
+[grey]+[brown]*4+[blue]*3+[lightblue]*3+[grey]*5+[purple]*2
C_colors=[grey]*30+[green]*5+[brown]*4+[blue]*2+[grey,blue]+[lightblue]*2\
+[grey]*5

Nowtodrawthemthistimeweaddthreetrackstothediagram,andalsonoticetheyaregivendifferentstart/endvaluestoreflecttheirdifferentlengths(thisrequires
Biopython1.59orlater).

fromBio.GraphicsimportGenomeDiagram

name="ProuxFig6"
gd_diagram=GenomeDiagram.Diagram(name)
max_len=0
forrecord,gene_colorsinzip([A_rec,B_rec,C_rec],[A_colors,B_colors,C_colors]):
max_len=max(max_len,len(record))
gd_track_for_features=gd_diagram.new_track(1,
name=record.name,
greytrack=True,
start=0,end=len(record))
gd_feature_set=gd_track_for_features.new_set()

i=0
forfeatureinrecord.features:
iffeature.type!="gene":
#Excludethisfeature
continue
gd_feature_set.add_feature(feature,sigil="ARROW",
color=gene_colors[i],label=True,
name=str(i+1),
label_position="start",
label_size=6,label_angle=0)
i+=1

gd_diagram.draw(format="linear",pagesize='A4',fragments=1,
start=0,end=max_len)
gd_diagram.write(name+".pdf","PDF")
gd_diagram.write(name+".eps","EPS")
gd_diagram.write(name+".svg","SVG")

Theresult:

Ididwonderwhyintheoriginalmanuscripttherewerenoredororangegenesmarkedinthebottomphage.Anotherimportantpointisherethephageareshownwith
differentlengthsthisisbecausetheyarealldrawntothesamescale(theyaredifferentlengths).

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 149/184
5/13/2017 BiopythonTutorialandCookbook
Thekeydifferencefromthepublishedfigureistheyhavecolorcodedlinksbetweensimilarproteinswhichiswhatwewilldointhenextsection.

17.1.11CrossLinksbetweentracks
Biopython1.59addedtheabilitytodrawcrosslinksbetweentracksbothsimplelineardiagramsaswewillshowhere,butalsolineardiagramssplitintofragmentsand
circulardiagrams.

ContinuingtheexamplefromtheprevioussectioninspiredbyFigure6fromProuxetal.2002[5],wewouldneedalistofcrosslinksbetweenpairsofgenes,alongwith
ascoreorcolortouse.RealisticallyyoumightextractthisfromaBLASTfilecomputationally,buthereIhavemanuallytypedthemin.

MynamingconventioncontinuestorefertothethreephageasA,BandC.HerearethelinkswewanttoshowbetweenAandB,givenasalistoftuples(percentage
similarityscore,geneinA,geneinB).

#Tuc2009(NC_002703)vsbIL285(AF323668)
A_vs_B=[
(99,"Tuc2009_01","int"),
(33,"Tuc2009_03","orf4"),
(94,"Tuc2009_05","orf6"),
(100,"Tuc2009_06","orf7"),
(97,"Tuc2009_07","orf8"),
(98,"Tuc2009_08","orf9"),
(98,"Tuc2009_09","orf10"),
(100,"Tuc2009_10","orf12"),
(100,"Tuc2009_11","orf13"),
(94,"Tuc2009_12","orf14"),
(87,"Tuc2009_13","orf15"),
(94,"Tuc2009_14","orf16"),
(94,"Tuc2009_15","orf17"),
(88,"Tuc2009_17","rusA"),
(91,"Tuc2009_18","orf20"),
(93,"Tuc2009_19","orf22"),
(71,"Tuc2009_20","orf23"),
(51,"Tuc2009_22","orf27"),
(97,"Tuc2009_23","orf28"),
(88,"Tuc2009_24","orf29"),
(26,"Tuc2009_26","orf38"),
(19,"Tuc2009_46","orf52"),
(77,"Tuc2009_48","orf54"),
(91,"Tuc2009_49","orf55"),
(95,"Tuc2009_52","orf60"),
]

LikewiseforBandC:

#bIL285(AF323668)vsListeriainnocuaprophage5(inNC_003212)
B_vs_C=[
(42,"orf39","lin2581"),
(31,"orf40","lin2580"),
(49,"orf41","lin2579"),#terL
(54,"orf42","lin2578"),#portal
(55,"orf43","lin2577"),#protease
(33,"orf44","lin2576"),#mhp
(51,"orf46","lin2575"),
(33,"orf47","lin2574"),
(40,"orf48","lin2573"),
(25,"orf49","lin2572"),
(50,"orf50","lin2571"),
(48,"orf51","lin2570"),
(24,"orf52","lin2568"),
(30,"orf53","lin2567"),
(28,"orf54","lin2566"),
]

Forthefirstandlastphagetheseidentifiersarelocustags,forthemiddlephagetherearenolocustagssoIveusedgenenamesinstead.Thefollowinglittlehelper
functionletsuslookupafeatureusingeitheralocustagorgenename:
defget_feature(features,id,tags=["locus_tag","gene"]):
"""SearchlistofSeqFeatureobjectsforanidentifierunderthegiventags."""
forfinfeatures:
forkeyintags:
#tagmaynotbepresentinthisfeature
forxinf.qualifiers.get(key,[]):
ifx==id:
returnf
raiseKeyError(id)

WecannowturnthoselistofidentifierpairsintoSeqFeaturepairs,andthusfindtheirlocationcoordinates.Wecannowaddallthatcodeandthefollowingsnippettothe
previousexample(justbeforethegd_diagram.draw(...)lineseethefinishedexamplescriptProux_et_al_2002_Figure_6.pyincludedintheDoc/examplesfolderofthe
Biopythonsourcecode)toaddcrosslinkstothefigure:

fromBio.Graphics.GenomeDiagramimportCrossLink
fromreportlab.libimportcolors
#Noteitmighthavebeenclearertoassignthetracknumbersexplicitly...
forrec_X,tn_X,rec_Y,tn_Y,X_vs_Yin[(A_rec,3,B_rec,2,A_vs_B),
(B_rec,2,C_rec,1,B_vs_C)]:
track_X=gd_diagram.tracks[tn_X]
track_Y=gd_diagram.tracks[tn_Y]
forscore,id_X,id_YinX_vs_Y:
feature_X=get_feature(rec_X.features,id_X)
feature_Y=get_feature(rec_Y.features,id_Y)
color=colors.linearlyInterpolatedColor(colors.white,colors.firebrick,0,100,score)
link_xy=CrossLink((track_X,feature_X.location.start,feature_X.location.end),
(track_Y,feature_Y.location.start,feature_Y.location.end),

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 150/184
5/13/2017 BiopythonTutorialandCookbook
color,colors.lightgrey)
gd_diagram.cross_track_links.append(link_xy)

Thereareseveralimportantpiecestothiscode.FirsttheGenomeDiagramobjecthasacross_track_linksattributewhichisjustalistofCrossLinkobjects.EachCrossLink
objecttakestwosetsoftrackspecificcoordinates(heregivenastuples,youcanalternativelyuseaGenomeDiagram.Featureobjectinstead).Youcanoptionallysupplya
colour,bordercolor,andsayifthislinkshouldbedrawnflipped(usefulforshowinginversions).

YoucanalsoseehowweturntheBLASTpercentageidentityscoreintoacolour,interpolatingbetweenwhite(0%)andadarkred(100%).Inthisexamplewedonthave
anyproblemswithoverlappingcrosslinks.OnewaytotacklethatistousetransparencyinReportLab,byusingcolorswiththeiralphachannelset.However,thiskindof
shadedcolorschemecombinedwithoverlaptransparencywouldbedifficulttointerpret.Theresult:

ThereisstillalotmorethatcanbedonewithinBiopythontohelpimprovethisfigure.Firstofall,thecrosslinksinthiscasearebetweenproteinswhicharedrawnina
strandspecificmanor.Itcanhelptoaddabackgroundregion(afeatureusingtheBOXsigil)onthefeaturetracktoextendthecrosslink.Also,wecouldreducethe
verticalheightofthefeaturetrackstoallocatemoretothelinksinsteadonewaytodothatistoallocatespaceforemptytracks.Furthermore,incaseslikethiswhere
therearenolargegeneoverlaps,wecanusetheaxisstraddlingBIGARROWsigil,whichallowsustofurtherreducetheverticalspaceneededforthetrack.These
improvementsaredemonstratedintheexamplescriptProux_et_al_2002_Figure_6.pyincludedintheDoc/examplesfolderoftheBiopythonsourcecode.Theresult:

Beyondthat,finishingtouchesyoumightwanttodomanuallyinavectorimageeditorincludefinetuningtheplacementofgenelabels,andaddingothercustom
annotationsuchashighlightingparticularregions.

Althoughnotreallynecessaryinthisexamplesincenoneofthecrosslinksoverlap,usingatransparentcolorinReportLabisaveryusefultechniqueforsuperimposing
multiplelinks.However,inthiscaseashadedcolorschemeshouldbeavoided.

17.1.12Furtheroptions
Youcancontrolthetickmarkstoshowthescaleafteralleverygraphshouldshowitsunits,andthenumberofthegreytracklabels.

Also,wehaveonlyusedtheFeatureSetsofar.GenomeDiagramalsohasaGraphSetwhichcanbeusedforshowlinegraphs,barchartsandheatplots(e.g.toshowplots
ofGC%onatrackparalleltothefeatures).

Theseoptionsarenotcoveredhereyet,sofornowwereferyoutotheUserGuide(PDF)includedwiththestandaloneversionofGenomeDiagram(butpleasereadthe
nextsectionfirst),andthedocstrings.

17.1.13Convertingoldcode

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 151/184
5/13/2017 BiopythonTutorialandCookbook
IfyouhaveoldcodewrittenusingthestandaloneversionofGenomeDiagram,andyouwanttoswitchitovertousingthenewversionincludedwithBiopythonthenyou
willhavetomakeafewchangesmostimportantlytoyourimportstatements.

Also,theolderversionofGenomeDiagramusedonlytheUKspellingsofcolorandcenter(colourandcentre).YouwillneedtochangetotheAmericanspellings,
althoughforseveralyearstheBiopythonversionofGenomeDiagramsupportedboth.

Forexample,ifyouusedtohave:

fromGenomeDiagramimportGDFeatureSet,GDDiagram
gdd=GDDiagram("Anexample")
...

youcouldjustswitchtheimportstatementslikethis:

fromBio.Graphics.GenomeDiagramimportFeatureSetasGDFeatureSet,DiagramasGDDiagram
gdd=GDDiagram("Anexample")
...

andhopefullythatshouldbeenough.Inthelongtermyoumightwanttoswitchtothenewnames,butyouwouldhavetochangemoreofyourcode:

fromBio.Graphics.GenomeDiagramimportFeatureSet,Diagram
gdd=Diagram("Anexample")
...

or:
fromBio.GraphicsimportGenomeDiagram
gdd=GenomeDiagram.Diagram("Anexample")
...

Ifyourunintodifficulties,pleaseaskontheBiopythonmailinglistforadvice.OnecatchisthatwehavenotincludedtheoldmoduleGenomeDiagram.GDUtilitiesyet.
ThisincludedanumberofGC%relatedfunctions,whichwillprobablybemergedunderBio.SeqUtilslateron.

17.2Chromosomes
TheBio.Graphics.BasicChromosomemoduleallowsdrawingofchromosomes.ThereisanexampleinJupeetal.(2012)[6](openaccess)usingcolorstohighlight
differentgenefamilies.

17.2.1SimpleChromosomes
HereisaverysimpleexampleforwhichwelluseArabidopsisthaliana.

Youcanskipthisbit,butfirstIdownloadedthefivesequencedchromosomesfromtheNCBIsFTPsiteftp://ftp.ncbi.nlm.nih.gov/genomes/Arabidopsis_thalianaand
thenparsedthemwithBio.SeqIOtofindouttheirlengths.YoucouldusetheGenBankfilesforthis,butitisfastertousetheFASTAfilesforthewholechromosomes:

fromBioimportSeqIO
entries=[("ChrI","CHR_I/NC_003070.fna"),
("ChrII","CHR_II/NC_003071.fna"),
("ChrIII","CHR_III/NC_003074.fna"),
("ChrIV","CHR_IV/NC_003075.fna"),
("ChrV","CHR_V/NC_003076.fna")]
for(name,filename)inentries:
record=SeqIO.read(filename,"fasta")
print(name,len(record))

Thisgavethelengthsofthefivechromosomes,whichwellnowuseinthefollowingshortdemonstrationoftheBasicChromosomemodule:

fromreportlab.lib.unitsimportcm
fromBio.GraphicsimportBasicChromosome

entries=[("ChrI",30432563),
("ChrII",19705359),
("ChrIII",23470805),
("ChrIV",18585042),
("ChrV",26992728)]

max_len=30432563#Couldcomputethis
telomere_length=1000000#Forillustration

chr_diagram=BasicChromosome.Organism()
chr_diagram.page_size=(29.7*cm,21*cm)#A4landscape

forname,lengthinentries:
cur_chromosome=BasicChromosome.Chromosome(name)
#SetthescaletotheMAXIMUMlengthplusthetwotelomeresinbp,
#wantthesamescaleusedonallfivechromosomessotheycanbe
#comparedtoeachother
cur_chromosome.scale_num=max_len+2*telomere_length

#Addanopeningtelomere
start=BasicChromosome.TelomereSegment()
start.scale=telomere_length
cur_chromosome.add(start)

#Addabodyusingbpasthescalelengthhere.
body=BasicChromosome.ChromosomeSegment()
body.scale=length
cur_chromosome.add(body)

#Addaclosingtelomere
end=BasicChromosome.TelomereSegment(inverted=True)

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 152/184
5/13/2017 BiopythonTutorialandCookbook
end.scale=telomere_length
cur_chromosome.add(end)

#Thischromosomeisdone
chr_diagram.add(cur_chromosome)

chr_diagram.draw("simple_chrom.pdf","Arabidopsisthaliana")

ThisshouldcreateaverysimplePDFfile,shownhere:

Thisexampleisdeliberatelyshortandsweet.Thenextexampleshowsthelocationoffeaturesofinterest.

17.2.2AnnotatedChromosomes
Continuingfromthepreviousexample,letsalsoshowthetRNAgenes.WellgettheirlocationsbyparsingtheGenBankfilesforthefiveArabidopsisthaliana
chromosomes.YoullneedtodownloadthesefilesfromtheNCBIFTPsiteftp://ftp.ncbi.nlm.nih.gov/genomes/Arabidopsis_thaliana,andpreservethesubdirectory
namesoreditthepathsbelow:

fromreportlab.lib.unitsimportcm
fromBioimportSeqIO
fromBio.GraphicsimportBasicChromosome

entries=[("ChrI","CHR_I/NC_003070.gbk"),
("ChrII","CHR_II/NC_003071.gbk"),
("ChrIII","CHR_III/NC_003074.gbk"),
("ChrIV","CHR_IV/NC_003075.gbk"),
("ChrV","CHR_V/NC_003076.gbk")]

max_len=30432563#Couldcomputethis
telomere_length=1000000#Forillustration

chr_diagram=BasicChromosome.Organism()
chr_diagram.page_size=(29.7*cm,21*cm)#A4landscape

forindex,(name,filename)inenumerate(entries):
record=SeqIO.read(filename,"genbank")
length=len(record)
features=[fforfinrecord.featuresiff.type=="tRNA"]
#RecordanArtemisstyleintegercolorinthefeature'squalifiers,
#1=Black,2=Red,3=Green,4=blue,5=cyan,6=purple
forfinfeatures:f.qualifiers["color"]=[index+2]

cur_chromosome=BasicChromosome.Chromosome(name)
#SetthescaletotheMAXIMUMlengthplusthetwotelomeresinbp,
#wantthesamescaleusedonallfivechromosomessotheycanbe
#comparedtoeachother
cur_chromosome.scale_num=max_len+2*telomere_length

#Addanopeningtelomere
start=BasicChromosome.TelomereSegment()
start.scale=telomere_length
cur_chromosome.add(start)

#Addabodyagainusingbpasthescalelengthhere.
body=BasicChromosome.AnnotatedChromosomeSegment(length,features)
body.scale=length
cur_chromosome.add(body)

#Addaclosingtelomere
end=BasicChromosome.TelomereSegment(inverted=True)
end.scale=telomere_length
cur_chromosome.add(end)

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 153/184
5/13/2017 BiopythonTutorialandCookbook
#Thischromosomeisdone
chr_diagram.add(cur_chromosome)

chr_diagram.draw("tRNA_chrom.pdf","Arabidopsisthaliana")

Itmightwarnyouaboutthelabelsbeingtooclosetogetherhavealookattheforwardstrand(righthandside)ofChrI,butitshouldcreateacolorfulPDFfile,shown
here:

Chapter18KEGG
KEGG(http://www.kegg.jp/)isadatabaseresourceforunderstandinghighlevelfunctionsandutilitiesofthebiologicalsystem,suchasthecell,theorganismandthe
ecosystem,frommolecularlevelinformation,especiallylargescalemoleculardatasetsgeneratedbygenomesequencingandotherhighthroughputexperimental
technologies.

PleasenotethattheKEGGparserimplementationinBiopythonisincomplete.WhiletheKEGGwebsiteindicatesmanyflatfileformats,onlyparsersandwritersfor
compound,enzyme,andmaparecurrentlyimplemented.However,agenericparserisimplementedtohandletheotherformats.

18.1ParsingKEGGrecords
ParsingaKEGGrecordisassimpleasusinganyotherfileformatparserinBiopython.(Beforerunningthefollowingcodes,pleaseopen
http://rest.kegg.jp/get/ec:5.4.2.2withyourwebbrowserandsaveitasec_5.4.2.2.txt.)

>>>fromBio.KEGGimportEnzyme
>>>records=Enzyme.parse(open("ec_5.4.2.2.txt"))
>>>record=list(records)[0]
>>>record.classname
['Isomerases;','Intramoleculartransferases;','Phosphotransferases(phosphomutases)']
>>>record.entry
'5.4.2.2'

Alternatively,iftheinputKEGGfilehasexactlyoneentry,youcanuseread:

>>>fromBio.KEGGimportEnzyme
>>>record=Enzyme.read(open("ec_5.4.2.2.txt"))
>>>record.classname
['Isomerases;','Intramoleculartransferases;','Phosphotransferases(phosphomutases)']
>>>record.entry
'5.4.2.2'

ThefollowingsectionwillshowshowtodownloadtheaboveenzymeusingtheKEGGapiaswellashowtousethegenericparserwithdatathatdoesnothaveacustom
parserimplemented.

18.2QueryingtheKEGGAPI
BiopythonhasfullsupportforthequeryingoftheKEGGapi.QueryingallKEGGendpointsaresupportedallmethodsdocumentedbyKEGG
(http://www.kegg.jp/kegg/rest/keggapi.html)aresupported.TheinterfacehassomevalidationofquerieswhichfollowrulesdefinedontheKEGGsite.However,
invalidquerieswhichreturna400or404mustbehandledbytheuser.

First,hereishowtoextendtheaboveexamplebydownloadingtherelevantenzymeandpassingitthroughtheEnzymeparser.

>>>fromBio.KEGGimportREST
>>>fromBio.KEGGimportEnzyme
>>>request=REST.kegg_get("ec:5.4.2.2")
>>>open("ec_5.4.2.2.txt",'w').write(request.read())
>>>records=Enzyme.parse(open("ec_5.4.2.2.txt"))
>>>record=list(records)[0]
>>>record.classname

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 154/184
5/13/2017 BiopythonTutorialandCookbook
['Isomerases;','Intramoleculartransferases;','Phosphotransferases(phosphomutases)']
>>>record.entry
'5.4.2.2'

Now,heresamorerealisticexamplewhichshowsacombinationofqueryingtheKEGGAPI.Thiswilldemonstratehowtoextractauniquesetofallhumanpathway
genesymbolswhichrelatetoDNArepair.Thestepsthatneedtobetakentodosoareasfollows.First,weneedtogetalistofallhumanpathways.Secondly,weneedto
filterthoseforoneswhichrelateto"repair".Lastly,weneedtogetalistofallthegenesymbolsinallrepairpathways.
fromBio.KEGGimportREST

human_pathways=REST.kegg_list("pathway","hsa").read()

#Filterallhumanpathwaysforrepairpathways
repair_pathways=[]
forlineinhuman_pathways.rstrip().split("\n"):
entry,description=line.split("\t")
if"repair"indescription:
repair_pathways.append(entry)

#Getthegenesforpathwaysandaddthemtoalist
repair_genes=[]
forpathwayinrepair_pathways:
pathway_file=REST.kegg_get(pathway).read()#queryandreadeachpathway

#iteratethrougheachKEGGpathwayfile,keepingtrackofwhichsection
#ofthefilewe'rein,onlyreadthegeneineachpathway
current_section=None
forlineinpathway_file.rstrip().split("\n"):
section=line[:12].strip()#sectionnamesarewithin12columns
ifnotsection=="":
current_section=section

ifcurrent_section=="GENE":
gene_identifiers,gene_description=line[12:].split(";")
gene_id,gene_symbol=gene_identifiers.split()

ifnotgene_symbolinrepair_genes:
repair_genes.append(gene_symbol)

print("Thereare%drepairpathwaysand%drepairgenes.Thegenesare:"%\
(len(repair_pathways),len(repair_genes)))
print(",".join(repair_genes))

TheKEGGAPIwrapperiscompatiblewithallendpoints.Usageisessentiallyreplacingallslashesintheurlwithcommasandusingthatlistasargumentstothe
correspondingmethodintheKEGGmodule.Hereareafewexamplesfromtheapidocumentation(http://www.kegg.jp/kegg/docs/keggapi.html).

/list/hsa:10458+ece:Z5100>REST.kegg_list(["hsa:10458","ece:Z5100"])
/find/compound/300310/mol_weight>REST.kegg_find("compound","300310","mol_weight")
/get/hsa:10458+ece:Z5100/aaseq>REST.kegg_get(["hsa:10458","ece:Z5100"],"aaseq")

Chapter19Bio.phenotype:analysephenotypicdata
ThischaptergivesanoverviewofthefunctionalitiesoftheBio.phenotypepackageincludedinBiopython.Thescopeofthispackageistheanalysisofphenotypicdata,
whichmeansparsingandanalysinggrowthmeasurementsofcellcultures.Initscurrentstatethepackageisfocusedontheanalysisofhighthroughputphenotypic
experimentsproducedbythePhenotypeMicroarraytechnology,butfuturedevelopmentsmayincludeotherplatformsandformats.

19.1PhenotypeMicroarrays
ThePhenotypeMicroarrayisatechnologythatmeasuresthemetabolismofbacterialandeukaryoticcellsonroughly2000chemicals,dividedintwenty96wellplates.
ThetechnologymeasuresthereductionofatetrazoliumdyebyNADH,whoseproductionbythecellisusedasaproxyforcellmetabolismcolordevelopmentduetothe
reductionofthisdyeistypicallymeasuredonceevery15minutes.Whencellsaregrowninamediathatsustainscellmetabolism,therecordedphenotypicdataresembles
asigmoidgrowthcurve,fromwhichaseriesofgrowthparameterscanberetrieved.

19.1.1ParsingPhenotypeMicroarraydata

TheBio.phenotypepackagecanparsetwodifferentformatsofPhenotypeMicroarraydata:theCSV(commaseparatedvalues)filesproducedbythemachines
proprietarysoftwareandJSONfilesproducedbyanalysissoftware,likeopmorDuctApe.TheparserwillreturnoneorageneratorofPlateRecordobjects,dependingon
whetherthereadorparsemethodisbeingused.YoucantesttheparsefunctionbyusingthePlates.csvfileprovidedwiththeBiopythonsourcecode.
>>>fromBioimportphenotype
>>>forrecordinphenotype.parse("Plates.csv","pmcsv"):
...print("%s%i"%(record.id,len(record)))
...
PM0196
PM0196
PM0996
PM0996

TheparserreturnsaseriesofPlateRecordobjects,eachonecontainingaseriesofWellRecordobjects(holdingeachwellsexperimentaldata)arrangedin8rowsand12
columnseachrowisindicatedbyauppercasecharacterfromAtoH,whilecolumnsareindicatedbyatwodigitnumber,from01to12.Thereareseveralwaystoaccess
WellRecordobjectsfromaPlateRecordobjects:

Wellidentifier
Ifyouknowthewellidentifier(row+columnidentifiers)youcanaccessthedesiredwelldirectly.
>>>record['A02']

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 155/184
5/13/2017 BiopythonTutorialandCookbook
Wellplatecoordinates
Thesamewellcanberetrievedbyusingtherowandcolumnsnumbers(0basedindex).
>>>fromBioimportphenotype
>>>record=list(phenotype.parse("Plates.csv","pmcsv"))[1]
>>>print(record[0,1].id)
A02

Roworcolumncoordinates
AseriesofWellRecordobjectscontiguoustoeachotherintheplatecanberetrievedinbulkbyusingthepythonlistslicingsyntaxonPlateRecordobjectsrowsand
columnsarenumberedwitha0basedindex.
>>>print(record[0])
PlateID:PM09
Well:12
Rows:1
Columns:12
PlateRecord('WellRecord['A01'],WellRecord['A02'],WellRecord['A03'],...,WellRecord['A12']')
>>>print(record[:,0])
PlateID:PM09
Well:8
Rows:8
Columns:1
PlateRecord('WellRecord['A01'],WellRecord['B01'],WellRecord['C01'],...,WellRecord['H01']')
>>>print(record[:3,:3])
PlateID:PM09
Well:9
Rows:3
Columns:3
PlateRecord('WellRecord['A01'],WellRecord['A02'],WellRecord['A03'],...,WellRecord['C03']')

19.1.2ManipulatingPhenotypeMicroarraydata

19.1.2.1Accessingrawdata

TherawdataextractedfromthePMfilesiscomprisedofaseriesoftuplesforeachwell,containingthetime(inhours)andthecolorimetricmeasure(inarbitraryunits).
Usuallytheinstrumentcollectsdataeveryfifteenminutes,butthatcanvarybetweenexperiments.TherawdatacanbeaccessedbyiteratingonaWellRecordobjectin
theexamplebelowonlythefirsttentimepointsareshown.

>>>fromBioimportphenotype
>>>record=list(phenotype.parse("Plates.csv","pmcsv"))[1]
>>>well=record['A02']

>>>fortime,signalinwell:
...print(time,signal)
...
(0.0,12.0)
(0.25,18.0)
(0.5,27.0)
(0.75,35.0)
(1.0,37.0)
(1.25,41.0)
(1.5,44.0)
(1.75,44.0)
(2.0,44.0)
(2.25,44.0)
[...]

Thismethod,whileprovidingawaytoaccesstherawdata,doesntallowadirectcomparisonbetweendifferentWellRecordobjects,whichmayhavemeasurementsat
differenttimepoints.

19.1.2.2Accessinginterpolateddata

Tomakeiteasiertocomparedifferentexperimentsandingeneraltoallowamoreintuitivehandlingofthephenotypicdata,themoduleallowstodefineacustomslicing
ofthetimepointsthatarepresentintheWellRecordobject.Colorimetricdatafortimepointsthathavenotbeendirectlymeasuredarederivedthroughalinear
interpolationoftheavailabledata,otherwiseaNaNisreturned.Thismethodonlyworksinthetimeintervalwhereactualdataisavailable.Timeintervalscanbedefined
withthesamesyntaxaslistindexingthedefaulttimeintervalisthereforeonehour.

>>>well[:10]
[12.0,37.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0]

Differenttimeintervalscanbeused,forinstancefiveminutes:

>>>well[63:64:0.083]
[12.0,37.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0,44.0]
>>>well[9.55]
44.0
>>>well[63.33:73.33]
[113.31999999999999,
117.0,
120.31999999999999,
128.0,
129.63999999999999,
132.95999999999998,
136.95999999999998,
140.0,
142.0,
nan]

19.1.2.3Controlwellsubtraction

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 156/184
5/13/2017 BiopythonTutorialandCookbook
ManyPhenotypeMicroarrayplatescontainacontrolwell(usuallyA01),thatisawellwherethemediashouldntsupportanygrowththelowsignalproducedbythis
wellcanbesubtractedfromtheotherwells.ThePlateRecordobjectshaveadedicatedfunctionforthat,whichreturnsanotherPlateRecordobjectwiththecorrecteddata.

>>>corrected=record.subtract_control(control='A01')
>>>record['A01'][63]
336.0
>>>corrected['A01'][63]
0.0

19.1.2.4Parametersextraction

Thosewellswheremetabolicactivityisobservedshowasigmoidbehaviorforthecolorimetricdata.Toallowaneasierwaytocomparedifferentexperimentsasigmoid
curvecanbefittedontothedata,sothataseriesofsummaryparameterscanbeextractedandusedforcomparisons.Theparametersthatcanbeextractedfromthecurve
are:

Minimum(min)andmaximum(max)signal
Averageheight(average_height)
Areaunderthecurve(area)
Curveplateaupoint(plateau)
Curveslopeduringexponentialmetabolicactivity(slope)
Curvelagtime(lag).

Alltheparameters(exceptmin,maxandaverage_height)requirethescipylibrarytobeinstalled.

Thefitfunctionusesthreesigmoidfunctions:

Gompertz
(me/A(t)+1)
Aee +y0
Logistic
A/1+e(4m/A(t)+2)+y0
Richards
A(1+ve1+v+em/A(1+v)(1+1/v)(t))1/v+y0

Where:

Acorrespondstotheplateau
mcorrespondstotheslope
correspondstothelag

Thesefunctionshavebeenderivedfromthispublication.Thefitmethodbydefaulttriesfirsttofitthegompertzfunction:ifitfailsitwillthentrytofitthelogisticand
thentherichardsfunction.Theusercanalsospecifyoneofthethreefunctionstobeapplied.

>>>fromBioimportphenotype
>>>record=list(phenotype.parse("Plates.csv","pmcsv"))[1]
>>>well=record['A02']
>>>well.fit()
>>>print("Functionfitted:%s"%well.model)
Functionfitted:gompertz
>>>forparamin["area","average_height","lag","max","min",
..."plateau","slope"]:
...print("%s\t%.2f"%(param,getattr(well,param)))
...
area4414.38
average_height61.58
lag48.60
max143.00
min12.00
plateau120.02
slope4.99

19.1.3WritingPhenotypeMicroarraydata
PlateRecordobjectscanbewrittentofileintheformofJSONfiles,aformatcompatiblewithothersoftwarepackagessuchasopmorDuctApe.

>>>phenotype.write(record,"out.json","pmjson")
1

Chapter20CookbookCoolthingstodowithit
Biopythonnowhastwocollectionsofcookbookexamplesthischapter(whichhasbeenincludedinthistutorialformanyyearsandhasgraduallygrown),and
http://biopython.org/wiki/Category:Cookbookwhichisausercontributedcollectiononourwiki.

WeretryingtoencourageBiopythonuserstocontributetheirownexamplestothewiki.Inadditiontohelpingthecommunity,onedirectbenefitofsharinganexample
likethisisthatyoucouldalsogetsomefeedbackonthecodefromotherBiopythonusersanddeveloperswhichcouldhelpyouimproveallyourPythoncode.

Inthelongterm,wemayendupmovingalloftheexamplesinthischaptertothewiki,orelsewherewithinthetutorial.

20.1Workingwithsequencefiles

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 157/184
5/13/2017 BiopythonTutorialandCookbook
Thissectionshowssomemoreexamplesofsequenceinput/output,usingtheBio.SeqIOmoduledescribedinChapter5.

20.1.1Filteringasequencefile
Oftenyoullhavealargefilewithmanysequencesinit(e.g.FASTAfileorgenes,oraFASTQorSFFfileofreads),aseparateshorterlistoftheIDsforasubsetof
sequencesofinterest,andwanttomakeanewsequencefileforthissubset.

LetssaythelistofIDsisinasimpletextfile,asthefirstwordoneachline.ThiscouldbeatabularfilewherethefirstcolumnistheID.Trysomethinglikethis:

fromBioimportSeqIO

input_file="big_file.sff"
id_file="short_list.txt"
output_file="short_list.sff"

withopen(id_file)asid_handle:
wanted=set(line.rstrip("\n").split(None,1)[0]forlineinid_handle)
print("Found%iuniqueidentifiersin%s"%(len(wanted),id_file))

records=(rforrinSeqIO.parse(input_file,"sff")ifr.idinwanted)
count=SeqIO.write(records,output_file,"sff")
print("Saved%irecordsfrom%sto%s"%(count,input_file,output_file))
ifcount<len(wanted):
print("Warning%iIDsnotfoundin%s"%(len(wanted)count,input_file))

NotethatweuseaPythonsetratherthanalist,thismakestestingmembershipfaster.

AsdiscussedinSection5.6,foralargeFASTAorFASTQfileforspeedyouwouldbebetteroffnotusingthehighlevel SeqIOinterface,butworkingdirectlywithstrings.
ThisnextexampleshowshowtodothiswithFASTQfilesitismorecomplicated:

fromBio.SeqIO.QualityIOimportFastqGeneralIterator

input_file="big_file.fastq"
id_file="short_list.txt"
output_file="short_list.fastq"

withopen(id_file)asid_handle:
#Takingfirstwordoneachlineasanidentifer
wanted=set(line.rstrip("\n").split(None,1)[0]forlineinid_handle)
print("Found%iuniqueidentifiersin%s"%(len(wanted),id_file))

withopen(input_file)asin_handle:
withopen(output_file,"w")asout_handle:
fortitle,seq,qualinFastqGeneralIterator(in_handle):
#TheIDisthefirstwordinthetitleline(afterthe@sign):
iftitle.split(None,1)[0]inwanted:
#Thisproducesastandard4lineFASTQentry:
out_handle.write("@%s\n%s\n+\n%s\n"%(title,seq,qual))
count+=1
print("Saved%irecordsfrom%sto%s"%(count,input_file,output_file))
ifcount<len(wanted):
print("Warning%iIDsnotfoundin%s"%(len(wanted)count,input_file))

20.1.2Producingrandomisedgenomes

Letssupposeyouarelookingatgenomesequence,huntingforsomesequencefeaturemaybeextremelocalGC%bias,orpossiblerestrictiondigestsites.Onceyouve
gotyourPythoncodeworkingontherealgenomeitmaybesensibletotryrunningthesamesearchonrandomisedversionsofthesamegenomeforstatisticalanalysis
(afterall,anyfeaturesyouvefoundcouldjustbetherejustbychance).

Forthisdiscussion,wellusetheGenBankfileforthepPCP1plasmidfromYersiniapestisbiovarMicrotus.ThefileisincludedwiththeBiopythonunittestsunderthe
GenBankfolder,oryoucangetitfromourwebsite,NC_005816.gb.Thisfilecontainsoneandonlyonerecord,sowecanreaditinasaSeqRecordusingthe
Bio.SeqIO.read()function:

>>>fromBioimportSeqIO
>>>original_rec=SeqIO.read("NC_005816.gb","genbank")

So,howcanwegenerateashuffledversionsoftheoriginalsequence?IwouldusethebuiltinPythonrandommoduleforthis,inparticularthefunctionrandom.shuffle
butthisworksonaPythonlist.OursequenceisaSeqobject,soinordertoshuffleitweneedtoturnitintoalist:

>>>importrandom
>>>nuc_list=list(original_rec.seq)
>>>random.shuffle(nuc_list)#actsinsitu!

Now,inordertouseBio.SeqIOtooutputtheshuffledsequence,weneedtoconstructanewSeqRecordwithanewSeqobjectusingthisshuffledlist.Inordertodothis,we
needtoturnthelistofnucleotides(singleletterstrings)intoalongstringthestandardPythonwaytodothisiswiththestringobjectsjoinmethod.

>>>fromBio.SeqimportSeq
>>>fromBio.SeqRecordimportSeqRecord
>>>shuffled_rec=SeqRecord(Seq("".join(nuc_list),original_rec.seq.alphabet),
...id="Shuffled",description="Basedon%s"%original_rec.id)
...

LetsputallthesepiecestogethertomakeacompletePythonscriptwhichgeneratesasingleFASTAfilecontaining30randomlyshuffledversionsoftheoriginal
sequence.

Thisfirstversionjustusesabigforloopandwritesouttherecordsonebyone(usingtheSeqRecordsformatmethoddescribedinSection5.5.4):

importrandom
fromBio.SeqimportSeq
fromBio.SeqRecordimportSeqRecord

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 158/184
5/13/2017 BiopythonTutorialandCookbook
fromBioimportSeqIO

original_rec=SeqIO.read("NC_005816.gb","genbank")

withopen("shuffled.fasta","w")asoutput_handle:
foriinrange(30):
nuc_list=list(original_rec.seq)
random.shuffle(nuc_list)
shuffled_rec=SeqRecord(Seq("".join(nuc_list),original_rec.seq.alphabet),
id="Shuffled%i"%(i+1),
description="Basedon%s"%original_rec.id)
out_handle.write(shuffled_rec.format("fasta"))

PersonallyIpreferthefollowingversionusingafunctiontoshuffletherecordandageneratorexpressioninsteadoftheforloop:

importrandom
fromBio.SeqimportSeq
fromBio.SeqRecordimportSeqRecord
fromBioimportSeqIO

defmake_shuffle_record(record,new_id):
nuc_list=list(record.seq)
random.shuffle(nuc_list)
returnSeqRecord(Seq("".join(nuc_list),record.seq.alphabet),
id=new_id,description="Basedon%s"%original_rec.id)

original_rec=SeqIO.read("NC_005816.gb","genbank")
shuffled_recs=(make_shuffle_record(original_rec,"Shuffled%i"%(i+1))
foriinrange(30))
SeqIO.write(shuffled_recs,"shuffled.fasta","fasta")

20.1.3TranslatingaFASTAfileofCDSentries
SupposeyouvegotaninputfileofCDSentriesforsomeorganism,andyouwanttogenerateanewFASTAfilecontainingtheirproteinsequences.i.e.Takeeach
nucleotidesequencefromtheoriginalfile,andtranslateit.BackinSection3.9wesawhowtousetheSeqobjectstranslatemethod,andtheoptionalcdsargument
whichenablescorrecttranslationofalternativestartcodons.

WecancombinethiswithBio.SeqIOasshowninthereversecomplementexampleinSection5.5.3.ThekeypointisthatforeachnucleotideSeqRecord,weneedtocreate
aproteinSeqRecordandtakecareofnamingit.

Youcanwriteyouownfunctiontodothis,choosingsuitableproteinidentifiersforyoursequences,andtheappropriategeneticcode.Inthisexamplewejustusethe
defaulttableandaddaprefixtotheidentifier:

fromBio.SeqRecordimportSeqRecord
defmake_protein_record(nuc_record):
"""ReturnsanewSeqRecordwiththetranslatedsequence(defaulttable)."""
returnSeqRecord(seq=nuc_record.seq.translate(cds=True),\
id="trans_"+nuc_record.id,\
description="translationofCDS,usingdefaulttable")

Wecanthenusethisfunctiontoturntheinputnucleotiderecordsintoproteinrecordsreadyforoutput.Anelegantwayandmemoryefficientwaytodothisiswitha
generatorexpression:

fromBioimportSeqIO
proteins=(make_protein_record(nuc_rec)fornuc_recin\
SeqIO.parse("coding_sequences.fasta","fasta"))
SeqIO.write(proteins,"translations.fasta","fasta")

ThisshouldworkonanyFASTAfileofcompletecodingsequences.Ifyouareworkingonpartialcodingsequences,youmayprefertouse
nuc_record.seq.translate(to_stop=True)intheexampleabove,asthiswouldntcheckforavalidstartcodonetc.

20.1.4MakingthesequencesinaFASTAfileuppercase

OftenyoullgetdatafromcollaboratorsasFASTAfiles,andsometimesthesequencescanbeinamixtureofupperandlowercase.Insomecasesthisisdeliberate(e.g.
lowercaseforpoorqualityregions),butusuallyitisnotimportant.Youmaywanttoeditthefiletomakeeverythingconsistent(e.g.alluppercase),andyoucandothis
easilyusingtheupper()methodoftheSeqRecordobject(addedinBiopython1.55):

fromBioimportSeqIO
records=(rec.upper()forrecinSeqIO.parse("mixed.fas","fasta"))
count=SeqIO.write(records,"upper.fas","fasta")
print("Converted%irecordstouppercase"%count)

Howdoesthiswork?ThefirstlineisjustimportingtheBio.SeqIOmodule.ThesecondlineistheinterestingbitthisisaPythongeneratorexpressionwhichgivesan
uppercaseversionofeachrecordparsedfromtheinputfile(mixed.fas).InthethirdlinewegivethisgeneratorexpressiontotheBio.SeqIO.write()functionanditsaves
thenewuppercasesrecordstoouroutputfile(upper.fas).

Thereasonweuseageneratorexpression(ratherthanalistorlistcomprehension)isthismeansonlyonerecordiskeptinmemoryatatime.Thiscanbereallyimportant
ifyouaredealingwithlargefileswithmillionsofentries.

20.1.5Sortingasequencefile
Supposeyouwantedtosortasequencefilebylength(e.g.asetofcontigsfromanassembly),andyouareworkingwithafileformatlikeFASTAorFASTQwhich
Bio.SeqIOcanread,write(andindex).

Ifthefileissmallenough,youcanloaditallintomemoryatonceasalistofSeqRecordobjects,sortthelist,andsaveit:

fromBioimportSeqIO
records=list(SeqIO.parse("ls_orchid.fasta","fasta"))

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 159/184
5/13/2017 BiopythonTutorialandCookbook
records.sort(key=lambdar:len(r))
SeqIO.write(records,"sorted_orchids.fasta","fasta")

Theonlycleverbitisspecifyingacomparisonmethodforhowtosorttherecords(herewesortthembylength).Ifyouwantedthelongestrecordsfirst,youcouldflipthe
comparisonorusethereverseargument:

fromBioimportSeqIO
records=list(SeqIO.parse("ls_orchid.fasta","fasta"))
records.sort(key=lambdar:len(r))
SeqIO.write(records,"sorted_orchids.fasta","fasta")

Nowthatsprettystraightforwardbutwhathappensifyouhaveaverylargefileandyoucantloaditallintomemorylikethis?Forexample,youmighthavesome
nextgenerationsequencingreadstosortbylength.ThiscanbesolvedusingtheBio.SeqIO.index()function.

fromBioimportSeqIO
#Getthelengthsandids,andsortonlength
len_and_ids=sorted((len(rec),rec.id)forrecin
SeqIO.parse("ls_orchid.fasta","fasta"))
ids=reversed([idfor(length,id)inlen_and_ids])
dellen_and_ids#freethismemory
record_index=SeqIO.index("ls_orchid.fasta","fasta")
records=(record_index[id]foridinids)
SeqIO.write(records,"sorted.fasta","fasta")

FirstwescanthroughthefileonceusingBio.SeqIO.parse(),recordingtherecordidentifiersandtheirlengthsinalistoftuples.Wethensortthislisttogettheminlength
order,anddiscardthelengths.UsingthissortedlistofidentifiersBio.SeqIO.index()allowsustoretrievetherecordsonebyone,andwepassthemtoBio.SeqIO.write()
foroutput.

TheseexamplesalluseBio.SeqIOtoparsetherecordsintoSeqRecordobjectswhichareoutputusingBio.SeqIO.write().Whatifyouwanttosortafileformatwhich
Bio.SeqIO.write()doesntsupport,liketheplaintextSwissProtformat?Hereisanalternativesolutionusingtheget_raw()methodaddedtoBio.SeqIO.index()in
Biopython1.54(seeSection5.4.2.2).
fromBioimportSeqIO

#Getthelengthsandids,andsortonlength
len_and_ids=sorted((len(rec),rec.id)forrecin
SeqIO.parse("ls_orchid.fasta","fasta"))
ids=reversed([idfor(length,id)inlen_and_ids])
dellen_and_ids#freethismemory

record_index=SeqIO.index("ls_orchid.fasta","fasta")
withopen("sorted.fasta","wb")asout_handle:
foridinids:
out_handle.write(record_index.get_raw(id))

NotewithPython3onwards,wehavetoopenthefileforwritinginbinarymodebecausetheget_raw()methodreturnsbytesstrings.

Asabonus,becauseitdoesntparsethedataintoSeqRecordobjectsasecondtimeitshouldbefaster.IfyouonlywanttousethiswithFASTAformat,wecanspeedthis
uponestepfurtherbyusingthelowlevelFASTAparsertogettherecordidentifiersandlengths:

fromBio.SeqIO.FastaIOimportSimpleFastaParser
fromBioimportSeqIO

#Getthelengthsandids,andsortonlength
withopen("ls_orchid.fasta")asin_handle:
len_and_ids=sorted((len(seq),title.split(None,1)[0])for
title,seqinSimpleFastaParser(in_handle))
ids=reversed([idfor(length,id)inlen_and_ids])
dellen_and_ids#freethismemory

record_index=SeqIO.index("ls_orchid.fasta","fasta")
withopen("sorted.fasta","wb")asout_handle:
foridinids:
out_handle.write(record_index.get_raw(id))

20.1.6SimplequalityfilteringforFASTQfiles
TheFASTQfileformatwasintroducedatSangerandisnowwidelyusedforholdingnucleotidesequencingreadstogetherwiththeirqualityscores.FASTQfiles(andthe
relatedQUALfiles)areanexcellentexampleofperletterannotation,becauseforeachnucleotideinthesequencethereisanassociatedqualityscore.Anyperletter
annotationisheldinaSeqRecordintheletter_annotationsdictionaryasalist,tupleorstring(withthesamenumberofelementsasthesequencelength).

Onecommontaskistakingalargesetofsequencingreadsandfilteringthem(orcroppingthem)basedontheirqualityscores.Thefollowingexampleisverysimplistic,
butshouldillustratethebasicsofworkingwithqualitydatainaSeqRecordobject.AllwearegoingtodohereisreadinafileofFASTQdata,andfilterittopickoutonly
thoserecordswhosePHREDqualityscoresareallabovesomethreshold(here20).

ForthisexamplewellusesomerealdatadownloadedfromtheENAsequencereadarchive,
ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz(2MB)whichunzipstoa19MBfileSRR020192.fastq.ThisissomeRoche454GSFLX
singleenddatafromvirusinfectedCaliforniasealions(seehttp://www.ebi.ac.uk/ena/data/view/SRS004476fordetails).

First,letscountthereads:
fromBioimportSeqIO
count=0
forrecinSeqIO.parse("SRR020192.fastq","fastq"):
count+=1
print("%ireads"%count)

NowletsdoasimplefilteringforaminimumPHREDqualityof20:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 160/184
5/13/2017 BiopythonTutorialandCookbook
fromBioimportSeqIO
good_reads=(recforrecin\
SeqIO.parse("SRR020192.fastq","fastq")\
ifmin(rec.letter_annotations["phred_quality"])>=20)
count=SeqIO.write(good_reads,"good_quality.fastq","fastq")
print("Saved%ireads"%count)

Thispulledoutonly14580readsoutofthe41892present.Amoresensiblethingtodowouldbetoqualitytrimthereads,butthisisintendedasanexampleonly.

FASTQfilescancontainmillionsofentries,soitisbesttoavoidloadingthemallintomemoryatonce.Thisexampleusesageneratorexpression,whichmeansonlyone
SeqRecordiscreatedatatimeavoidinganymemorylimitations.

NotethatitwouldbefastertousethelowlevelFastqGeneralIteratorparserhere(seeSection5.6),butthatdoesnotturnthequalitystringintointegerscores.

20.1.7Trimmingoffprimersequences
ForthisexampleweregoingtopretendthatGATGACGGTGTisa5primersequencewewanttolookforinsomeFASTQformattedreaddata.Asintheexampleabove,
wellusetheSRR020192.fastqfiledownloadedfromtheENA( ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz).

ByusingthemainBio.SeqIOinterface,thesameapproachwouldworkwithanyothersupportedfileformat(e.g.FASTAfiles).However,forlargeFASTQfilesitwould
befasterthelowlevelFastqGeneralIteratorparserhere(seetheearlierexample,andSection5.6).

ThiscodeusesBio.SeqIOwithageneratorexpression(toavoidloadingallthesequencesintomemoryatonce),andtheSeqobjectsstartswithmethodtoseeiftheread
startswiththeprimersequence:

fromBioimportSeqIO
primer_reads=(recforrecin\
SeqIO.parse("SRR020192.fastq","fastq")\
ifrec.seq.startswith("GATGACGGTGT"))
count=SeqIO.write(primer_reads,"with_primer.fastq","fastq")
print("Saved%ireads"%count)

Thatshouldfind13819readsfromSRR014849.fastqandsavethemtoanewFASTQfile,with_primer.fastq.

NowsupposethatinsteadyouwantedtomakeaFASTQfilecontainingthesereadsbutwiththeprimersequenceremoved?Thatsjustasmallchangeaswecanslicethe
SeqRecord(seeSection4.7)toremovethefirstelevenletters(thelengthofourprimer):

fromBioimportSeqIO
trimmed_primer_reads=(rec[11:]forrecin\
SeqIO.parse("SRR020192.fastq","fastq")\
ifrec.seq.startswith("GATGACGGTGT"))
count=SeqIO.write(trimmed_primer_reads,"with_primer_trimmed.fastq","fastq")
print("Saved%ireads"%count)

Again,thatshouldpulloutthe13819readsfromSRR020192.fastq,butthistimestripoffthefirsttencharacters,andsavethemtoanothernewFASTQfile,
with_primer_trimmed.fastq.

Now,supposeyouwanttocreateanewFASTQfilewherethesereadshavetheirprimerremoved,butalltheotherreadsarekeptastheywere?Ifwewanttostillusea
generatorexpression,itisprobablyclearesttodefineourowntrimfunction:

fromBioimportSeqIO
deftrim_primer(record,primer):
ifrecord.seq.startswith(primer):
returnrecord[len(primer):]
else:
returnrecord

trimmed_reads=(trim_primer(record,"GATGACGGTGT")forrecordin\
SeqIO.parse("SRR020192.fastq","fastq"))
count=SeqIO.write(trimmed_reads,"trimmed.fastq","fastq")
print("Saved%ireads"%count)

Thistakeslonger,asthistimetheoutputfilecontainsall41892reads.Again,wereusedageneratorexpressiontoavoidanymemoryproblems.Youcouldalternatively
useageneratorfunctionratherthanageneratorexpression.

fromBioimportSeqIO
deftrim_primers(records,primer):
"""Removesperfectprimersequencesatstartofreads.

Thisisageneratorfunction,therecordsargumentshould
bealistoriteratorreturningSeqRecordobjects.
"""
len_primer=len(primer)#cachethisforlater
forrecordinrecords:
ifrecord.seq.startswith(primer):
yieldrecord[len_primer:]
else:
yieldrecord

original_reads=SeqIO.parse("SRR020192.fastq","fastq")
trimmed_reads=trim_primers(original_reads,"GATGACGGTGT")
count=SeqIO.write(trimmed_reads,"trimmed.fastq","fastq")
print("Saved%ireads"%count)

Thisformismoreflexibleifyouwanttodosomethingmorecomplicatedwhereonlysomeoftherecordsareretainedasshowninthenextexample.

20.1.8Trimmingoffadaptorsequences

Thisisessentiallyasimpleextensiontothepreviousexample.WearegoingtogoingtopretendGATGACGGTGTisanadaptorsequenceinsomeFASTQformattedreaddata,
againtheSRR020192.fastqfilefromtheNCBI(ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz).

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 161/184
5/13/2017 BiopythonTutorialandCookbook
Thistimehowever,wewilllookforthesequenceanywhereinthereads,notjustattheverybeginning:

fromBioimportSeqIO

deftrim_adaptors(records,adaptor):
"""Trimsperfectadaptorsequences.

Thisisageneratorfunction,therecordsargumentshould
bealistoriteratorreturningSeqRecordobjects.
"""
len_adaptor=len(adaptor)#cachethisforlater
forrecordinrecords:
index=record.seq.find(adaptor)
ifindex==1:
#adaptornotfound,sowon'ttrim
yieldrecord
else:
#trimofftheadaptor
yieldrecord[index+len_adaptor:]

original_reads=SeqIO.parse("SRR020192.fastq","fastq")
trimmed_reads=trim_adaptors(original_reads,"GATGACGGTGT")
count=SeqIO.write(trimmed_reads,"trimmed.fastq","fastq")
print("Saved%ireads"%count)

BecauseweareusingaFASTQinputfileinthisexample,theSeqRecordobjectshaveperletterannotationforthequalityscores.ByslicingtheSeqRecordobjectthe
appropriatescoresareusedonthetrimmedrecords,sowecanoutputthemasaFASTQfiletoo.

Comparedtotheoutputofthepreviousexamplewhereweonlylookedforaprimer/adaptoratthestartofeachread,youmayfindsomeofthetrimmedreadsarequite
shortaftertrimming(e.g.iftheadaptorwasfoundinthemiddleratherthannearthestart).So,letsaddaminimumlengthrequirementaswell:
fromBioimportSeqIO

deftrim_adaptors(records,adaptor,min_len):
"""Trimsperfectadaptorsequences,checksreadlength.

Thisisageneratorfunction,therecordsargumentshould
bealistoriteratorreturningSeqRecordobjects.
"""
len_adaptor=len(adaptor)#cachethisforlater
forrecordinrecords:
len_record=len(record)#cachethisforlater
iflen(record)<min_len:
#Tooshorttokeep
continue
index=record.seq.find(adaptor)
ifindex==1:
#adaptornotfound,sowon'ttrim
yieldrecord
eliflen_recordindexlen_adaptor>=min_len:
#aftertrimmingthiswillstillbelongenough
yieldrecord[index+len_adaptor:]

original_reads=SeqIO.parse("SRR020192.fastq","fastq")
trimmed_reads=trim_adaptors(original_reads,"GATGACGGTGT",100)
count=SeqIO.write(trimmed_reads,"trimmed.fastq","fastq")
print("Saved%ireads"%count)

Bychangingtheformatnames,youcouldapplythistoFASTAfilesinstead.Thiscodealsocouldbeextendedtodoafuzzymatchinsteadofanexactmatch(maybe
usingapairwisealignment,ortakingintoaccountthereadqualityscores),butthatwillbemuchslower.

20.1.9ConvertingFASTQfiles
BackinSection5.5.2weshowedhowtouseBio.SeqIOtoconvertbetweentwofileformats.HerewellgointoalittlemoredetailregardingFASTQfileswhichareused
insecondgenerationDNAsequencing.PleaserefertoCocketal.(2009)[7]foralongerdescription.FASTQfilesstoreboththeDNAsequence(asastring)andthe
associatedreadqualities.

PHREDscores(usedinmostFASTQfiles,andalsoinQUALfiles,ACEfilesandSFFfiles)havebecomeadefactostandardforrepresentingtheprobabilityofa
sequencingerror(heredenotedbyPe)atagivenbaseusingasimplebasetenlogtransformation:

QPHRED=10log10(Pe)(20.1)

Thismeansawrongread(Pe=1)getsaPHREDqualityof0,whileaverygoodreadlikePe=0.00001getsaPHREDqualityof50.Whileforrawsequencingdata
qualitieshigherthanthisarerare,withpostprocessingsuchasreadmappingorassembly,qualitiesofuptoabout90arepossible(indeed,theMAQtoolallowsfor
PHREDscoresintherange0to93inclusive).

TheFASTQformathasthepotentialtobecomeadefactostandardforstoringthelettersandqualityscoresforasequencingreadinasingleplaintextfile.Theonlyflyin
theointmentisthatthereareatleastthreeversionsoftheFASTQformatwhichareincompatibleanddifficulttodistinguish...

1.TheoriginalSangerFASTQformatusesPHREDqualitiesencodedwithanASCIIoffsetof33.TheNCBIareusingthisformatintheirShortReadArchive.Wecall
thisthefastq(orfastqsanger)formatinBio.SeqIO.
2.Solexa(laterboughtbyIllumina)introducedtheirownversionusingSolexaqualitiesencodedwithanASCIIoffsetof64.Wecallthisthefastqsolexaformat.
3.Illuminapipeline1.3onwardsproducesFASTQfileswithPHREDqualities(whichismoreconsistent),butencodedwithanASCIIoffsetof64.Wecallthisthe
fastqilluminaformat.

TheSolexaqualityscoresaredefinedusingadifferentlogtransformation:

QSolexa=10log10 Pe (20.2)

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 162/184
5/13/2017 BiopythonTutorialandCookbook
1Pe

GivenSolexa/IlluminahavenowmovedtousingPHREDscoresinversion1.3oftheirpipeline,theSolexaqualityscoreswillgraduallyfalloutofuse.Ifyouequatethe
errorestimates(Pe)thesetwoequationsallowconversionbetweenthetwoscoringsystemsandBiopythonincludesfunctionstodothisintheBio.SeqIO.QualityIO
module,whicharecalledifyouuseBio.SeqIOtoconvertanoldSolexa/IlluminafileintoastandardSangerFASTQfile:
fromBioimportSeqIO
SeqIO.convert("solexa.fastq","fastqsolexa","standard.fastq","fastq")

IfyouwanttoconvertanewIllumina1.3+FASTQfile,allthatgetschangedistheASCIIoffsetbecausealthoughencodeddifferentlythescoresareallPHREDqualities:
fromBioimportSeqIO
SeqIO.convert("illumina.fastq","fastqillumina","standard.fastq","fastq")

NotethatusingBio.SeqIO.convert()likethisismuchfasterthancombiningBio.SeqIO.parse()andBio.SeqIO.write()becauseoptimisedcodeisusedforconverting
betweenFASTQvariants(andalsoforFASTQtoFASTAconversion).

Forgoodqualityreads,PHREDandSolexascoresareapproximatelyequal,whichmeanssinceboththefastasolexaandfastqilluminaformatsuseanASCIIoffsetof
64thefilesarealmostthesame.ThiswasadeliberatedesignchoicebyIllumina,meaningapplicationsexpectingtheoldfastasolexastylefileswillprobablybeOK
usingthenewerfastqilluminafiles(ongooddata).Ofcourse,bothvariantsareverydifferentfromtheoriginalFASTQstandardasusedbySanger,theNCBI,and
elsewhere(formatnamefastqorfastqsanger).

Formoredetails,seethebuiltinhelp(alsoonline):

>>>fromBio.SeqIOimportQualityIO
>>>help(QualityIO)
...

20.1.10ConvertingFASTAandQUALfilesintoFASTQfiles

FASTQfilesholdbothsequencesandtheirqualitystrings.FASTAfilesholdjustsequences,whileQUALfilesholdjustthequalities.ThereforeasingleFASTQfilecan
beconvertedtoorfrompairedFASTAandQUALfiles.

GoingfromFASTQtoFASTAiseasy:

fromBioimportSeqIO
SeqIO.convert("example.fastq","fastq","example.fasta","fasta")

GoingfromFASTQtoQUALisalsoeasy:
fromBioimportSeqIO
SeqIO.convert("example.fastq","fastq","example.qual","qual")

However,thereverseisalittlemoretricky.YoucanuseBio.SeqIO.parse()toiterateovertherecordsinasinglefile,butinthiscasewehavetwoinputfiles.Thereare
severalstrategiespossible,butassumingthatthetwofilesarereallypairedthemostmemoryefficientwayistoloopoverbothtogether.Thecodeisalittlefiddly,sowe
provideafunctioncalledPairedFastaQualIteratorintheBio.SeqIO.QualityIOmoduletodothis.Thistakestwohandles(theFASTAfileandtheQUALfile)andreturns
aSeqRecorditerator:

fromBio.SeqIO.QualityIOimportPairedFastaQualIterator
forrecordinPairedFastaQualIterator(open("example.fasta"),open("example.qual")):
print(record)

ThisfunctionwillcheckthattheFASTAandQUALfilesareconsistent(e.g.therecordsareinthesameorder,andhavethesamesequencelength).Youcancombinethis
withtheBio.SeqIO.write()functiontoconvertapairofFASTAandQUALfilesintoasingleFASTQfiles:

fromBioimportSeqIO
fromBio.SeqIO.QualityIOimportPairedFastaQualIterator
withopen("example.fasta")asf_handle,open("example.qual")asq_handle:
records=PairedFastaQualIterator(f_handle,q_handle)
count=SeqIO.write(records,"temp.fastq","fastq")
print("Converted%irecords"%count)

20.1.11IndexingaFASTQfile

FASTQfilesareoftenverylarge,withmillionsofreadsinthem.Duetothesheeramountofdata,youcantloadalltherecordsintomemoryatonce.Thisiswhythe
examplesabove(filteringandtrimming)iterateoverthefilelookingatjustoneSeqRecordatatime.

However,sometimesyoucantuseabiglooporaniteratoryoumayneedrandomaccesstothereads.HeretheBio.SeqIO.index()functionmayproveveryhelpful,asit
allowsyoutoaccessanyreadintheFASTQfilebyitsname(seeSection5.4.2).

AgainwellusetheSRR020192.fastqfilefromtheENA( ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR020/SRR020192/SRR020192.fastq.gz),althoughthisisactuallyquitea
smallFASTQfilewithlessthan50,000reads:

>>>fromBioimportSeqIO
>>>fq_dict=SeqIO.index("SRR020192.fastq","fastq")
>>>len(fq_dict)
41892
>>>fq_dict.keys()[:4]
['SRR020192.38240','SRR020192.23181','SRR020192.40568','SRR020192.23186']
>>>fq_dict["SRR020192.23186"].seq
Seq('GTCCCAGTATTCGGATTTGTCTGCCAAAACAATGAAATTGACACAGTTTACAAC...CCG',SingleLetterAlphabet())

WhentestingthisonaFASTQfilewithsevenmillionreads,indexingtookaboutaminute,butrecordaccesswasalmostinstant.

TheexampleinSection20.1.5showhowyoucanusetheBio.SeqIO.index()functiontosortalargeFASTAfilethiscouldalsobeusedonFASTQfiles.
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 163/184
5/13/2017 BiopythonTutorialandCookbook
20.1.12ConvertingSFFfiles

Ifyouworkwith454(Roche)sequencedata,youwillprobablyhaveaccesstotherawdataasaStandardFlowgramFormat(SFF)file.Thiscontainsthesequencereads
(calledbases)withqualityscoresandtheoriginalflowinformation.

AcommontaskistoconvertfromSFFtoapairofFASTAandQUALfiles,ortoasingleFASTQfile.TheseoperationsaretrivialusingtheBio.SeqIO.convert()
function(seeSection5.5.2):

>>>fromBioimportSeqIO
>>>SeqIO.convert("E3MFGYR02_random_10_reads.sff","sff","reads.fasta","fasta")
10
>>>SeqIO.convert("E3MFGYR02_random_10_reads.sff","sff","reads.qual","qual")
10
>>>SeqIO.convert("E3MFGYR02_random_10_reads.sff","sff","reads.fastq","fastq")
10

Remembertheconvertfunctionreturnsthenumberofrecords,inthisexamplejustten.Thiswillgiveyoutheuntrimmedreads,wheretheleadingandtrailingpoor
qualitysequenceoradaptorwillbeinlowercase.Ifyouwantthetrimmedreads(usingtheclippinginformationrecordedwithintheSFFfile)usethis:

>>>fromBioimportSeqIO
>>>SeqIO.convert("E3MFGYR02_random_10_reads.sff","sfftrim","trimmed.fasta","fasta")
10
>>>SeqIO.convert("E3MFGYR02_random_10_reads.sff","sfftrim","trimmed.qual","qual")
10
>>>SeqIO.convert("E3MFGYR02_random_10_reads.sff","sfftrim","trimmed.fastq","fastq")
10

IfyourunLinux,youcouldaskRocheforacopyoftheiroffinstrumenttools(oftenreferredtoastheNewblertools).ThisoffersanalternativewaytodoSFFto
FASTAorQUALconversionatthecommandline(butcurrentlyFASTQoutputisnotsupported),e.g.

$sffinfoseqnotrimE3MFGYR02_random_10_reads.sff>reads.fasta
$sffinfoqualnotrimE3MFGYR02_random_10_reads.sff>reads.qual
$sffinfoseqtrimE3MFGYR02_random_10_reads.sff>trimmed.fasta
$sffinfoqualtrimE3MFGYR02_random_10_reads.sff>trimmed.qual

ThewayBiopythonusesmixedcasesequencestringstorepresentthetrimmingpointsdeliberatelymimicswhattheRochetoolsdo.

FormoreinformationontheBiopythonSFFsupport,consultthebuiltinhelp:

>>>fromBio.SeqIOimportSffIO
>>>help(SffIO)
...

20.1.13Identifyingopenreadingframes

Averysimplisticfirststepatidentifyingpossiblegenesistolookforopenreadingframes(ORFs).Bythiswemeanlookinallsixframesforlongregionswithoutstop
codonsanORFisjustaregionofnucleotideswithnoinframestopcodons.

Ofcourse,tofindageneyouwouldalsoneedtoworryaboutlocatingastartcodon,possiblepromotersandinEukaryotesthereareintronstoworryabouttoo.
However,thisapproachisstillusefulinvirusesandProkaryotes.

ToshowhowyoumightapproachthiswithBiopython,wellneedasequencetosearch,andasanexamplewellagainusethebacterialplasmidalthoughthistimewell
startwithaplainFASTAfilewithnopremarkedgenes:NC_005816.fna.Thisisabacterialsequence,sowellwanttouseNCBIcodontable11(seeSection3.9about
translation).

>>>fromBioimportSeqIO
>>>record=SeqIO.read("NC_005816.fna","fasta")
>>>table=11
>>>min_pro_len=100

HereisaneattrickusingtheSeqobjectssplitmethodtogetalistofallthepossibleORFtranslationsinthesixreadingframes:

>>>forstrand,nucin[(+1,record.seq),(1,record.seq.reverse_complement())]:
...forframeinrange(3):
...length=3*((len(record)frame)//3)#Multipleofthree
...forproinnuc[frame:frame+length].translate(table).split("*"):
...iflen(pro)>=min_pro_len:
...print("%s...%slength%i,strand%i,frame%i"\
...%(pro[:30],pro[3:],len(pro),strand,frame))
GCLMKKSSIVATIITILSGSANAASSQLIP...YRFlength315,strand1,frame0
KSGELRQTPPASSTLHLRLILQRSGVMMEL...NPElength285,strand1,frame1
GLNCSFFSICNWKFIDYINRLFQIIYLCKN...YYHlength176,strand1,frame1
VKKILYIKALFLCTVIKLRRFIFSVNNMKF...DLPlength165,strand1,frame1
NQIQGVICSPDSGEFMVTFETVMEIKILHK...GVAlength355,strand1,frame2
RRKEHVSKKRRPQKRPRRRRFFHRLRPPDE...PTRlength128,strand1,frame2
TGKQNSCQMSAIWQLRQNTATKTRQNRARI...AIKlength100,strand1,frame2
QGSGYAFPHASILSGIAMSHFYFLVLHAVK...CSDlength114,strand1,frame0
IYSTSEHTGEQVMRTLDEVIASRSPESQTR...FHVlength111,strand1,frame0
WGKLQVIGLSMWMVLFSQRFDDWLNEQEDA...ESKlength125,strand1,frame1
RGIFMSDTMVVNGSGGVPAFLFSGSTLSSY...LLKlength361,strand1,frame1
WDVKTVTGVLHHPFHLTFSLCPEGATQSGR...VKRlength111,strand1,frame1
LSHTVTDFTDQMAQVGLCQCVNVFLDEVTG...KAAlength107,strand1,frame2
RALTGLSAPGIRSQTSCDRLRELRYVPVSL...PLQlength119,strand1,frame2

Notethatherewearecountingtheframesfromthe5end(start)ofeachstrand.Itissometimeseasiertoalwayscountfromthe5end(start)oftheforwardstrand.

Youcouldeasilyedittheaboveloopbasedcodetobuildupalistofthecandidateproteins,orconvertthistoalistcomprehension.Now,onethingthiscodedoesntdois
keeptrackofwheretheproteinsare.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 164/184
5/13/2017 BiopythonTutorialandCookbook
Youcouldtacklethisinseveralways.Forexample,thefollowingcodetracksthelocationsintermsoftheproteincounting,andconvertsbacktotheparentsequenceby
multiplyingbythree,thenadjustingfortheframeandstrand:

fromBioimportSeqIO
record=SeqIO.read("NC_005816.gb","genbank")
table=11
min_pro_len=100

deffind_orfs_with_trans(seq,trans_table,min_protein_length):
answer=[]
seq_len=len(seq)
forstrand,nucin[(+1,seq),(1,seq.reverse_complement())]:
forframeinrange(3):
trans=str(nuc[frame:].translate(trans_table))
trans_len=len(trans)
aa_start=0
aa_end=0
whileaa_start<trans_len:
aa_end=trans.find("*",aa_start)
ifaa_end==1:
aa_end=trans_len
ifaa_endaa_start>=min_protein_length:
ifstrand==1:
start=frame+aa_start*3
end=min(seq_len,frame+aa_end*3+3)
else:
start=seq_lenframeaa_end*33
end=seq_lenframeaa_start*3
answer.append((start,end,strand,
trans[aa_start:aa_end]))
aa_start=aa_end+1
answer.sort()
returnanswer

orf_list=find_orfs_with_trans(record.seq,table,min_pro_len)
forstart,end,strand,proinorf_list:
print("%s...%slength%i,strand%i,%i:%i"\
%(pro[:30],pro[3:],len(pro),strand,start,end))

Andtheoutput:

NQIQGVICSPDSGEFMVTFETVMEIKILHK...GVAlength355,strand1,41:1109
WDVKTVTGVLHHPFHLTFSLCPEGATQSGR...VKRlength111,strand1,491:827
KSGELRQTPPASSTLHLRLILQRSGVMMEL...NPElength285,strand1,1030:1888
RALTGLSAPGIRSQTSCDRLRELRYVPVSL...PLQlength119,strand1,2830:3190
RRKEHVSKKRRPQKRPRRRRFFHRLRPPDE...PTRlength128,strand1,3470:3857
GLNCSFFSICNWKFIDYINRLFQIIYLCKN...YYHlength176,strand1,4249:4780
RGIFMSDTMVVNGSGGVPAFLFSGSTLSSY...LLKlength361,strand1,4814:5900
VKKILYIKALFLCTVIKLRRFIFSVNNMKF...DLPlength165,strand1,5923:6421
LSHTVTDFTDQMAQVGLCQCVNVFLDEVTG...KAAlength107,strand1,5974:6298
GCLMKKSSIVATIITILSGSANAASSQLIP...YRFlength315,strand1,6654:7602
IYSTSEHTGEQVMRTLDEVIASRSPESQTR...FHVlength111,strand1,7788:8124
WGKLQVIGLSMWMVLFSQRFDDWLNEQEDA...ESKlength125,strand1,8087:8465
TGKQNSCQMSAIWQLRQNTATKTRQNRARI...AIKlength100,strand1,8741:9044
QGSGYAFPHASILSGIAMSHFYFLVLHAVK...CSDlength114,strand1,9264:9609

Ifyoucommentoutthesortstatement,thentheproteinsequenceswillbeshowninthesameorderasbefore,soyoucancheckthisisdoingthesamething.Herewehave
sortedthembylocationtomakeiteasiertocomparetotheactualannotationintheGenBankfile(asvisualisedinSection17.1.9).

Ifhoweverallyouwanttofindarethelocationsoftheopenreadingframes,thenitisawasteoftimetotranslateeverypossiblecodon,includingdoingthereverse
complementtosearchthereversestrandtoo.Allyouneedtodoissearchforthepossiblestopcodons(andtheirreversecomplements).Usingregularexpressionsisan
obviousapproachhere(seethePythonmodulere).Theseareanextremelypowerful(butrathercomplex)wayofdescribingsearchstrings,whicharesupportedinlotsof
programminglanguagesandalsocommandlinetoolslikegrepaswell).Youcanfindwholebooksaboutthistopic!

20.2Sequenceparsingplussimpleplots
Thissectionshowssomemoreexamplesofsequenceparsing,usingtheBio.SeqIOmoduledescribedinChapter5,plusthePythonlibrarymatplotlibspylabplotting
interface(seethematplotlibwebsiteforatutorial).Notethattofollowtheseexamplesyouwillneedmatplotlibinstalledbutwithoutityoucanstilltrythedataparsing
bits.

20.2.1Histogramofsequencelengths
Therearelotsoftimeswhenyoumightwanttovisualisethedistributionofsequencelengthsinadatasetforexampletherangeofcontigsizesinagenomeassembly
project.InthisexamplewellreuseourorchidFASTAfilels_orchid.fastawhichhasonly94sequences.

Firstofall,wewilluseBio.SeqIOtoparsetheFASTAfileandcompilealistofallthesequencelengths.Youcoulddothiswithaforloop,butIfindalistcomprehension
morepleasing:
>>>fromBioimportSeqIO
>>>sizes=[len(rec)forrecinSeqIO.parse("ls_orchid.fasta","fasta")]
>>>len(sizes),min(sizes),max(sizes)
(94,572,789)
>>>sizes
[740,753,748,744,733,718,730,704,740,709,700,726,...,592]

Nowthatwehavethelengthsofallthegenes(asalistofintegers),wecanusethematplotlibhistogramfunctiontodisplayit.

fromBioimportSeqIO
sizes=[len(rec)forrecinSeqIO.parse("ls_orchid.fasta","fasta")]

importpylab

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 165/184
5/13/2017 BiopythonTutorialandCookbook
pylab.hist(sizes,bins=20)
pylab.title("%iorchidsequences\nLengths%ito%i"\
%(len(sizes),min(sizes),max(sizes)))
pylab.xlabel("Sequencelength(bp)")
pylab.ylabel("Count")
pylab.show()

Thatshouldpopupanewwindowcontainingthefollowinggraph:

Noticethatmostoftheseorchidsequencesareabout740bplong,andtherecouldbetwodistinctclassesofsequenceherewithasubsetofshortersequences.

Tip:Ratherthanusingpylab.show()toshowtheplotinawindow,youcanalsousepylab.savefig(...)tosavethefiguretoafile(e.g.asaPNGorPDF).

20.2.2PlotofsequenceGC%
AnothereasilycalculatedquantityofanucleotidesequenceistheGC%.YoumightwanttolookattheGC%ofallthegenesinabacterialgenomeforexample,and
investigateanyoutlierswhichcouldhavebeenrecentlyacquiredbyhorizontalgenetransfer.Again,forthisexamplewellreuseourorchidFASTAfilels_orchid.fasta.

Firstofall,wewilluseBio.SeqIOtoparsetheFASTAfileandcompilealistofalltheGCpercentages.Again,youcoulddothiswithaforloop,butIpreferthis:

fromBioimportSeqIO
fromBio.SeqUtilsimportGC

gc_values=sorted(GC(rec.seq)forrecinSeqIO.parse("ls_orchid.fasta","fasta"))

HavingreadineachsequenceandcalculatedtheGC%,wethensortedthemintoascendingorder.Nowwelltakethislistoffloatingpointvaluesandplotthemwith
matplotlib:

importpylab
pylab.plot(gc_values)
pylab.title("%iorchidsequences\nGC%%%0.1fto%0.1f"\
%(len(gc_values),min(gc_values),max(gc_values)))
pylab.xlabel("Genes")
pylab.ylabel("GC%")
pylab.show()

Asinthepreviousexample,thatshouldpopupanewwindowcontainingagraph:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 166/184
5/13/2017 BiopythonTutorialandCookbook

Ifyoutriedthisonthefullsetofgenesfromoneorganism,youdprobablygetamuchsmootherplotthanthis.

20.2.3Nucleotidedotplots
Adotplotisawayofvisuallycomparingtwonucleotidesequencesforsimilaritytoeachother.Aslidingwindowisusedtocompareshortsubsequencestoeachother,
oftenwithamismatchthreshold.Hereforsimplicitywellonlylookforperfectmatches(showninblackintheplotbelow).

Tostartoff,wellneedtwosequences.Forthesakeofargument,welljusttakethefirsttwofromourorchidFASTAfilels_orchid.fasta:

fromBioimportSeqIO
withopen("ls_orchid.fasta")asin_handle:
record_iterator=SeqIO.parse(in_handle,"fasta")
rec_one=next(record_iterator)
rec_two=next(record_iterator)

Weregoingtoshowtwoapproaches.Firstly,asimplenaiveimplementationwhichcomparesallthewindowsizedsubsequencestoeachothertocompilesasimilarity
matrix.Youcouldconstructamatrixorarrayobject,butherewejustusealistoflistsofbooleanscreatedwithanestedlistcomprehension:

window=7
seq_one=str(rec_one.seq).upper()
seq_two=str(rec_two.seq).upper()
data=[[(seq_one[i:i+window]<>seq_two[j:j+window])
forjinrange(len(seq_one)window)]
foriinrange(len(seq_two)window)]

Notethatwehavenotcheckedforreversecomplementmatcheshere.Nowwellusethematplotlibspylab.imshow()functiontodisplaythisdata,firstrequestingthe
graycolorschemesothisisdoneinblackandwhite:

importpylab
pylab.gray()
pylab.imshow(data)
pylab.xlabel("%s(length%ibp)"%(rec_one.id,len(rec_one)))
pylab.ylabel("%s(length%ibp)"%(rec_two.id,len(rec_two)))
pylab.title("Dotplotusingwindowsize%i\n(allowingnomismatches)"%window)
pylab.show()

Thatshouldpopupanewwindowcontainingagraphlikethis:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 167/184
5/13/2017 BiopythonTutorialandCookbook

Asyoumighthaveexpected,thesetwosequencesareverysimilarwithapartiallineofwindowsizedmatchesalongthediagonal.Therearenooffdiagonalmatches
whichwouldbeindicativeofinversionsorotherinterestingevents.

Theabovecodeworksfineonsmallexamples,buttherearetwoproblemsapplyingthistolargersequences,whichwewilladdressbelow.Firstoffall,thisbruteforce
approachtotheallagainstallcomparisonsisveryslow.Instead,wellcompiledictionariesmappingthewindowsizedsubsequencestotheirlocations,andthentakethe
setintersectiontofindthosesubsequencesfoundinbothsequences.Thisusesmorememory,butismuchfaster.Secondly,thepylab.imshow()functionislimitedinthe
sizeofmatrixitcandisplay.Asanalternative,wellusethepylab.scatter()function.

Westartbycreatingdictionariesmappingthewindowsizedsubsequencestolocations:

window=7
dict_one={}
dict_two={}
for(seq,section_dict)in[(str(rec_one.seq).upper(),dict_one),
(str(rec_two.seq).upper(),dict_two)]:
foriinrange(len(seq)window):
section=seq[i:i+window]
try:
section_dict[section].append(i)
exceptKeyError:
section_dict[section]=[i]
#Nowfindanysubsequencesfoundinbothsequences
#(Python2.3wouldrequireslightlydifferentcodehere)
matches=set(dict_one).intersection(dict_two)
print("%iuniquematches"%len(matches))

Inordertousethepylab.scatter()weneedseparatelistsforthexandycoordinates:

#Createlistsofxandycoordinatesforscatterplot
x=[]
y=[]
forsectioninmatches:
foriindict_one[section]:
forjindict_two[section]:
x.append(i)
y.append(j)

Wearenowreadytodrawthereviseddotplotasascatterplot:

importpylab
pylab.cla()#clearanypriorgraph
pylab.gray()
pylab.scatter(x,y)
pylab.xlim(0,len(rec_one)window)
pylab.ylim(0,len(rec_two)window)
pylab.xlabel("%s(length%ibp)"%(rec_one.id,len(rec_one)))
pylab.ylabel("%s(length%ibp)"%(rec_two.id,len(rec_two)))
pylab.title("Dotplotusingwindowsize%i\n(allowingnomismatches)"%window)
pylab.show()

Thatshouldpopupanewwindowcontainingagraphlikethis:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 168/184
5/13/2017 BiopythonTutorialandCookbook

PersonallyIfindthissecondplotmucheasiertoread!Againnotethatwehavenotcheckedforreversecomplementmatcheshereyoucouldextendthisexampletodo
this,andperhapsplottheforwardmatchesinonecolorandthereversematchesinanother.

20.2.4Plottingthequalityscoresofsequencingreaddata
Ifyouareworkingwithsecondgenerationsequencingdata,youmaywanttotryplottingthequalitydata.HereisanexampleusingtwoFASTQfilescontainingpaired
endreads,SRR001666_1.fastqfortheforwardreads,andSRR001666_2.fastqforthereversereads.TheseweredownloadedfromtheENAsequencereadarchiveFTPsite
(ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001666/SRR001666_1.fastq.gzandftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001666/SRR001666_2.fastq.gz),
andarefromE.coliseehttp://www.ebi.ac.uk/ena/data/view/SRR001666fordetails.

Inthefollowingcodethepylab.subplot(...)functionisusedinordertoshowtheforwardandreversequalitiesontwosubplots,sidebyside.Thereisalsoalittlebitof
codetoonlyplotthefirstfiftyreads.
importpylab
fromBioimportSeqIO
forsubfigurein[1,2]:
filename="SRR001666_%i.fastq"%subfigure
pylab.subplot(1,2,subfigure)
fori,recordinenumerate(SeqIO.parse(filename,"fastq")):
ifi>=50:break#trick!
pylab.plot(record.letter_annotations["phred_quality"])
pylab.ylim(0,45)
pylab.ylabel("PHREDqualityscore")
pylab.xlabel("Position")
pylab.savefig("SRR001666.png")
print("Done")

YoushouldnotethatweareusingtheBio.SeqIOformatnamefastqherebecausetheNCBIhassavedthesereadsusingthestandardSangerFASTQformatwithPHRED
scores.However,asyoumightguessfromthereadlengths,thisdatawasfromanIlluminaGenomeAnalyzerandwasprobablyoriginallyinoneofthetwo
Solexa/IlluminaFASTQvariantfileformatsinstead.

Thisexampleusesthepylab.savefig(...)functioninsteadofpylab.show(...),butasmentionedbeforebothareuseful.Hereistheresult:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 169/184
5/13/2017 BiopythonTutorialandCookbook

20.3Dealingwithalignments
ThissectioncanbeenseenasafollowontoChapter6.

20.3.1Calculatingsummaryinformation
Onceyouhaveanalignment,youareverylikelygoingtowanttofindoutinformationaboutit.Insteadoftryingtohaveallofthefunctionsthatcangenerateinformation
aboutanalignmentinthealignmentobjectitself,wevetriedtoseparateoutthefunctionalityintoseparateclasses,whichactonthealignment.

Gettingreadytocalculatesummaryinformationaboutanobjectisquicktodo.Letssaywevegotanalignmentobjectcalledalignment,forexamplereadinusing
Bio.AlignIO.read(...)asdescribedinChapter6.Allweneedtodotogetanobjectthatwillcalculatesummaryinformationis:

fromBio.AlignimportAlignInfo
summary_align=AlignInfo.SummaryInfo(alignment)

Thesummary_alignobjectisveryuseful,andwilldothefollowingneatthingsforyou:

1.Calculateaquickconsensussequenceseesection20.3.2
2.Getapositionspecificscorematrixforthealignmentseesection20.3.3
3.Calculatetheinformationcontentforthealignmentseesection20.3.4
4.Generateinformationonsubstitutionsinthealignmentsection20.4detailsusingthistogenerateasubstitutionmatrix.

20.3.2Calculatingaquickconsensussequence

TheSummaryInfoobject,describedinsection20.3.1,providesfunctionalitytocalculateaquickconsensusofanalignment.AssumingwevegotaSummaryInfoobject
calledsummary_alignwecancalculateaconsensusbydoing:

consensus=summary_align.dumb_consensus()

Asthenamesuggests,thisisareallysimpleconsensuscalculator,andwilljustaddupalloftheresiduesateachpointintheconsensus,andifthemostcommonvalueis
higherthansomethresholdvaluewilladdthecommonresiduetotheconsensus.Ifitdoesntreachthethreshold,itaddsanambiguitycharactertotheconsensus.The
returnedconsensusobjectisSeqobjectwhosealphabetisinferredfromthealphabetsofthesequencesmakinguptheconsensus.Sodoingaprintconsensuswouldgive:
consensusSeq('TATACATNAAAGNAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAAT
...',IUPACAmbiguousDNA())

Youcanadjusthowdumb_consensusworksbypassingoptionalparameters:

thethreshold
Thisisthethresholdspecifyinghowcommonaparticularresiduehastobeatapositionbeforeitisadded.Thedefaultis0.7(meaning70%).
theambiguouscharacter
Thisistheambiguitycharactertouse.ThedefaultisN.
theconsensusalphabet
Thisisthealphabettousefortheconsensussequence.Ifanalphabetisnotspecifiedthanwewilltrytoguessthealphabetbasedonthealphabetsofthesequencesin
thealignment.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 170/184
5/13/2017 BiopythonTutorialandCookbook
20.3.3PositionSpecificScoreMatrices

Positionspecificscorematrices(PSSMs)summarizethealignmentinformationinadifferentwaythanaconsensus,andmaybeusefulfordifferenttasks.Basically,a
PSSMisacountmatrix.Foreachcolumninthealignment,thenumberofeachalphabetlettersiscountedandtotaled.Thetotalsaredisplayedrelativetosome
representativesequencealongtheleftaxis.Thissequencemaybetheconsesussequence,butcanalsobeanysequenceinthealignment.Forinstanceforthealignment,

GTATC
ATC
CTGTC

thePSSMis:

GATC
G1101
T0030
A1100
T0020
C0003

Letsassumewevegotanalignmentobjectcalledc_align.TogetaPSSMwiththeconsensussequencealongthesidewefirstgetasummaryobjectandcalculatethe
consensussequence:

summary_align=AlignInfo.SummaryInfo(c_align)
consensus=summary_align.dumb_consensus()

Now,wewanttomakethePSSM,butignoreanyNambiguityresidueswhencalculatingthis:

my_pssm=summary_align.pos_specific_score_matrix(consensus,
chars_to_ignore=['N'])

Twonotesshouldbemadeaboutthis:

1.Tomaintainstrictnesswiththealphabets,youcanonlyincludecharactersalongthetopofthePSSMthatareinthealphabetofthealignmentobject.Gapsarenot
includedalongthetopaxisofthePSSM.
2.Thesequencepassedtobedisplayedalongtheleftsideoftheaxisdoesnotneedtobetheconsensus.Forinstance,ifyouwantedtodisplaythesecondsequencein
thealignmentalongthisaxis,youwouldneedtodo:
second_seq=alignment.get_seq_by_num(1)
my_pssm=summary_align.pos_specific_score_matrix(second_seq
chars_to_ignore=['N'])

ThecommandabovereturnsaPSSMobject.ToprintoutthePSSMasshownabove,wesimplyneedtodoaprint(my_pssm),whichgives:

ACGT
T0.00.00.07.0
A7.00.00.00.0
T0.00.00.07.0
A7.00.00.00.0
C0.07.00.00.0
A7.00.00.00.0
T0.00.00.07.0
T1.00.00.06.0
...

YoucanaccessanyelementofthePSSMbysubscriptinglikeyour_pssm[sequence_number][residue_count_name].Forinstance,togetthecountsfortheAresidueinthe
secondelementoftheabovePSSMyouwoulddo:

>>>print(my_pssm[1]["A"])
7.0

ThestructureofthePSSMclasshopefullymakesiteasybothtoaccesselementsandtoprettyprintthematrix.

20.3.4InformationContent
Apotentiallyusefulmeasureofevolutionaryconservationistheinformationcontentofasequence.

Ausefulintroductiontoinformationtheorytargetedtowardsmolecularbiologistscanbefoundathttp://www.lecb.ncifcrf.gov/~toms/paper/primer/.Forourpurposes,
wewillbelookingattheinformationcontentofaconsesussequence,oraportionofaconsensussequence.Wecalculateinformationcontentataparticularcolumnina
multiplesequencealignmentusingthefollowingformula:

Na
Pij
ICj=
Pijlog
Qi
i=1

where:

ICjTheinformationcontentforthejthcolumninanalignment.
NaThenumberoflettersinthealphabet.
PijThefrequencyofaparticularletteriinthejthcolumn(i.e.ifGoccurred3outof6timesinanaligmentcolumn,thiswouldbe0.5)
QiTheexpectedfrequencyofaletteri.Thisisanoptionalargument,usageofwhichisleftattheusersdiscretion.Bydefault,itisautomaticallyassignedto0.05
=1/20foraproteinalphabet,and0.25=1/4foranucleicacidalphabet.Thisisforgetingtheinformationcontentwithoutanyassumptionofpriordistributions.
Whenassumingpriors,orwhenusinganonstandardalphabet,youshouldsupplythevaluesforQi.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 171/184
5/13/2017 BiopythonTutorialandCookbook
Well,nowthatwehaveanideawhatinformationcontentisbeingcalculatedinBiopython,letslookathowtogetitforaparticularregionofthealignment.

First,weneedtouseouralignmenttogetanalignmentsummaryobject,whichwellassumeiscalledsummary_align(seesection20.3.1)forinstructionsonhowtoget
this.Oncewevegotthisobject,calculatingtheinformationcontentforaregionisaseasyas:

info_content=summary_align.information_content(5,30,
chars_to_ignore=['N'])

Wow,thatwasmucheasierthentheformulaabovemadeitlook!Thevariableinfo_contentnowcontainsafloatvaluespecifyingtheinformationcontentoverthe
specifiedregion(from5to30ofthealignment).WespecificallyignoretheambiguityresidueNwhencalculatingtheinformationcontent,sincethisvalueisnot
includedinouralphabet(soweshouldntbeinterestedinlookingatit!).

Asmentionedabove,wecanalsocalculaterelativeinformationcontentbysupplyingtheexpectedfrequencies:

expect_freq={
'A':.3,
'G':.2,
'T':.3,
'C':.2}

Theexpectedshouldnotbepassedasarawdictionary,butinsteadbypassedasaSubsMat.FreqTableobject(seesection22.2.2formoreinformationaboutFreqTables).
TheFreqTableobjectprovidesastandardforassociatingthedictionarywithanAlphabet,similartohowtheBiopythonSeqclassworks.

TocreateaFreqTableobject,fromthefrequencydictionaryyoujustneedtodo:

fromBio.AlphabetimportIUPAC
fromBio.SubsMatimportFreqTable

e_freq_table=FreqTable.FreqTable(expect_freq,FreqTable.FREQ,
IUPAC.unambiguous_dna)

Nowthatwevegotthat,calculatingtherelativeinformationcontentforourregionofthealignmentisassimpleas:

info_content=summary_align.information_content(5,30,
e_freq_table=e_freq_table,
chars_to_ignore=['N'])

Now,info_contentwillcontaintherelativeinformationcontentovertheregioninrelationtotheexpectedfrequencies.

Thevaluereturniscalculatedusingbase2asthelogarithmbaseintheformulaabove.Youcanmodifythisbypassingtheparameterlog_baseasthebaseyouwant:

info_content=summary_align.information_content(5,30,log_base=10,
chars_to_ignore=['N'])

Bydefaultnucleotideoraminoacidresidueswithafrequencyof0inacolumnarenottakeintoaccountwhentherelativeinformationcolumnforthatcolumnis
computed.Ifthisisnotthedesiredresult,youcanusepseudo_countinstead.

info_content=summary_align.information_content(5,30,
chars_to_ignore=['N'],
pseudo_count=1)

Inthiscase,theobservedfrequencyPijofaparticularletteriinthejthcolumniscomputedasfollow:

nij+kQi
Pij=
Nj+k

where:

kthepseudocountyoupassasargument.
kthepseudocountyoupassasargument.
QiTheexpectedfrequencyoftheletteriasdescribedabove.

Well,nowyouarereadytocalculateinformationcontent.Ifyouwanttotryapplyingthistosomereallifeproblems,itwouldprobablybebesttodigintotheliteratureon
informationcontenttogetanideaofhowitisused.Hopefullyyourdiggingwontrevealanymistakesmadeincodingthisfunction!

20.4SubstitutionMatrices
Substitutionmatricesareanextremelyimportantpartofeverydaybioinformaticswork.Theyprovidethescoringtermsforclassifyinghowlikelytwodifferentresidues
aretosubstituteforeachother.Thisisessentialindoingsequencecomparisons.ThebookBiologicalSequenceAnalysisbyDurbinetal.providesareallynice
introductiontoSubstitutionMatricesandtheiruses.SomefamoussubstitutionmatricesarethePAMandBLOSUMseriesofmatrices.

Biopythonprovidesatonofcommonsubstitutionmatrices,andalsoprovidesfunctionalityforcreatingyourownsubstitutionmatrices.

20.4.1Usingcommonsubstitutionmatrices

20.4.2Creatingyourownsubstitutionmatrixfromanalignment
Averycoolthingthatyoucandoeasilywiththesubstitutionmatrixclassesistocreateyourownsubstitutionmatrixfromanalignment.Inpractice,thisisnormallydone
withproteinalignments.Inthisexample,wellfirstgetaBiopythonalignmentobjectandthengetasummaryobjecttocalculateinfoaboutthealignment.Thefile
containingprotein.aln(alsoavailableonlinehere)containstheClustalwalignmentoutput.

>>>fromBioimportAlignIO
>>>fromBioimportAlphabet
>>>fromBio.AlphabetimportIUPAC

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 172/184
5/13/2017 BiopythonTutorialandCookbook
>>>fromBio.AlignimportAlignInfo
>>>filename="protein.aln"
>>>alpha=Alphabet.Gapped(IUPAC.protein)
>>>c_align=AlignIO.read(filename,"clustal",alphabet=alpha)
>>>summary_align=AlignInfo.SummaryInfo(c_align)

Sections6.4.1and20.3.1containmoreinformationondoingthis.

Nowthatwevegotoursummary_alignobject,wewanttouseittofindoutthenumberoftimesdifferentresiduessubstituteforeachother.Tomaketheexamplemore
readable,wellfocusononlyaminoacidswithpolarchargedsidechains.Luckily,thiscanbedoneeasilywhengeneratingareplacementdictionary,bypassinginallof
thecharactersthatshouldbeignored.Thuswellcreateadictionaryofreplacementsforonlychargedpolaraminoacidsusing:

>>>replace_info=summary_align.replacement_dictionary(["G","A","V","L","I",
..."M","P","F","W","S",
..."T","N","Q","Y","C"])

Thisinformationaboutaminoacidreplacementsisrepresentedasapythondictionarywhichwilllooksomethinglike(theordercanvary):

{('R','R'):2079.0,('R','H'):17.0,('R','K'):103.0,('R','E'):2.0,
('R','D'):2.0,('H','R'):0,('D','H'):15.0,('K','K'):3218.0,
('K','H'):24.0,('H','K'):8.0,('E','H'):15.0,('H','H'):1235.0,
('H','E'):18.0,('H','D'):0,('K','D'):0,('K','E'):9.0,
('D','R'):48.0,('E','R'):2.0,('D','K'):1.0,('E','K'):45.0,
('K','R'):130.0,('E','D'):241.0,('E','E'):3305.0,
('D','E'):270.0,('D','D'):2360.0}

Thisinformationgivesusouracceptednumberofreplacements,orhowoftenweexpectdifferentthingstosubstituteforeachother.Itturnsout,amazinglyenough,that
thisisalloftheinformationweneedtogoaheadandcreateasubstitutionmatrix.First,weusethereplacementdictionaryinformationtocreateanAcceptedReplacement
Matrix(ARM):
>>>fromBioimportSubsMat
>>>my_arm=SubsMat.SeqMat(replace_info)

Withthisacceptedreplacementmatrix,wecangorightaheadandcreateourlogoddsmatrix(i.e.astandardtypeSubstitutionMatrix):

>>>my_lom=SubsMat.make_log_odds_matrix(my_arm)

Thelogoddsmatrixyoucreateiscustomizablewiththefollowingoptionalarguments:

exp_freq_tableYoucanpassatableofexpectedfrequenciesforeachalphabet.Ifsupplied,thiswillbeusedinsteadofthepassedacceptedreplacementmatrix
whencalculateexpectedreplacments.
logbaseThebaseofthelogarithmtakentocreatethelogoddmatrix.Defaultstobase10.
factorThefactortomultiplyeachmatrixentryby.Thisdefaultsto10,whichnormallymakesthematrixnumberseasytoworkwith.
round_digitThedigittoroundtointhematrix.Thisdefaultsto0(i.e.nodigits).

Onceyouvegotyourlogoddsmatrix,youcandisplayitprettilyusingthefunctionprint_mat.Doingthisonourcreatedmatrixgives:

>>>my_lom.print_mat()
D2
E11
H543
K10541
R48422
DEHKR

Verynice.Nowwevegotourveryownsubstitutionmatrixtoplaywith!

20.5BioSQLstoringsequencesinarelationaldatabase
BioSQLisajointeffortbetweentheOBFprojects(BioPerl,BioJavaetc)tosupportashareddatabaseschemaforstoringsequencedata.Intheory,youcouldloada
GenBankfileintothedatabasewithBioPerl,thenusingBiopythonextractthisfromthedatabaseasarecordobjectwithfeaturesandgetmoreorlessthesamethingas
ifyouhadloadedtheGenBankfiledirectlyasaSeqRecordusingBio.SeqIO(Chapter5).

BiopythonsBioSQLmoduleiscurrentlydocumentedathttp://biopython.org/wiki/BioSQLwhichispartofourwikipages.

Chapter21TheBiopythontestingframework
Biopythonhasaregressiontestingframework(thefilerun_tests.py)basedonunittest,thestandardunittestingframeworkforPython.Providingcomprehensivetestsfor
modulesisoneofthemostimportantaspectsofmakingsurethattheBiopythoncodeisasbugfreeaspossiblebeforegoingout.Italsotendstobeoneofthemost
undervaluedaspectsofcontributing.ThischapterisdesignedtomakerunningtheBiopythontestsandwritinggoodtestcodeaseasyaspossible.Ideally,everymodule
thatgoesintoBiopythonshouldhaveatest(andshouldalsohavedocumentation!).Allourdevelopers,andanyoneinstallingBiopythonfromsource,arestrongly
encouragedtoruntheunittests.

21.1Runningthetests
WhenyoudownloadtheBiopythonsourcecode,orcheckitoutfromoursourcecoderepository,youshouldfindasubdirectorycallTests.Thiscontainsthekeyscript
run_tests.py,lotsofindividualscriptsnamedtest_XXX.py,asubdirectorycalledoutputandlotsofothersubdirectorieswhichcontaininputfilesforthetestsuite.

AspartofbuildingandinstallingBiopythonyouwilltypicallyrunthefulltestsuiteatthecommandlinefromtheBiopythonsourcetopleveldirectoryusingthe
following:

pythonsetup.pytest

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 173/184
5/13/2017 BiopythonTutorialandCookbook
ThisisactuallyequivalenttogoingtotheTestssubdirectoryandrunning:

pythonrun_tests.py

Youlloftenwanttorunjustsomeofthetests,andthisisdonelikethis:

pythonrun_tests.pytest_SeqIO.pytest_AlignIO.py

Whengivingthelistoftests,the.pyextensionisoptional,soyoucanalsojusttype:

pythonrun_tests.pytest_SeqIOtest_AlignIO

Torunthedocstringtests(seesection21.3),youcanuse

pythonrun_tests.pydoctest

Youcanalsoskipanytestswhichhavebeensetupwithanexplicitonlinecomponentbyaddingoffline,e.g.

pythonrun_tests.pyoffline

Bydefault,run_tests.pyrunsalltests,includingthedocstringtests.

Ifanindividualtestisfailing,youcanalsotryrunningitdirectly,whichmaygiveyoumoreinformation.

Importantly,notethattheindividualunittestfilescomeintwotypes:

Oldersimpleprintandcomparescripts.TheseunittestsareessentiallyshortexamplePythonprograms,whichprintoutvariousoutputtext.Foratestfilenamed
test_XXX.pytherewillbeamatchingtextfilecalledtest_XXXundertheoutputsubdirectorywhichcontainstheexpectedoutput.Allthatthetestframeworkdoesto
isrunthescript,andchecktheoutputagrees.
Standardunittestbasedtests.Thesewillimportunittestandthendefineunittest.TestCaseclasses,eachwithoneormoresubtestsasmethodsstartingwith
test_whichchecksomespecificaspectofthecode.Thesetestsshouldnotprintanyoutputdirectly.

Currently,abouthalfoftheBiopythontestsareunitteststyletests,andhalfareprintandcomparetests.

Runningasimpleprintandcomparetestdirectlywillusuallygivelotsofoutputonscreen,butdoesnotchecktheoutputmatchestheexpectedoutput.Ifthetestisfailing
withanexceptionerror,itshouldbeveryeasytolocatewhereexactlythescriptisfailing.Foranexampleofaprintandcomparetest,try:

pythontest_SeqIO.py

Theunittestbasedtestsinsteadshowyouexactlywhichsubsection(s)ofthetestarefailing.Forexample,

pythontest_Cluster.py

21.1.1RunningthetestsusingTox

LikemostPythonprojects,youcanalsouseToxtorunthetestsonmultiplePythonversions,providedtheyarealreadyinstalledinyoursystem.

Wedonotprovidetheconfigurationtox.inifileinourcodebasebecauseofdifficultiespinningdownuserspecificsettings(e.g.executablenamesofthePython
versions).YoumayalsoonlybeinterestedintestingBiopythononlyagainstasubsetofthePythonversionsthatwesupport.

IfyouareinterestedinusingTox,youmaystartwiththeexampletox.inishownbelow:

[tox]
envlist=py26,py27,pypy,py33,py34,jython

[testenv]
changedir=Tests
commands={envpython}run_tests.pyoffline
deps=
numpy

Usingthetemplateabove,executingtoxwilltestyourBiopythoncodeagainstPython2.6,Python2.7,PyPy,Python3.3,Python3.4,andJython.Itassumesthatthose
Pythonsexecutablesarenamedaccordingly:python2.6forPython2.6,andsoon.

21.2Writingtests
LetssayyouwanttowritesometestsforamodulecalledBiospam.Thiscanbeamoduleyouwrote,oranexistingmodulethatdoesnthaveanytestsyet.Intheexamples
below,weassumethatBiospamisamodulethatdoessimplemath.

EachBiopythontestcanhavethreeimportantfilesanddirectoriesinvolvedwithit:

1.test_Biospam.pyTheactualtestcodeforyourmodule.
2.Biospam[optional]Adirectorywhereanynecessaryinputfileswillbelocated.Ifyouhaveanyoutputfilesthatshouldbemanuallyreviewed,outputthemhere
(butthisisdiscouraged)topreventcloggingupthemainTestsdirectory.Ingeneraluseatemporaryfile/folder.
3.output/Biospam[forprintandcomparetestsonly]Thisfilecontainstheexpectedoutputfromrunningtest_Biospam.py.Thisfileisnotneededforunitteststyle
tests,sincetherethevalidationisdoneinthetestscripttest_Biospam.pyitself.

Itsuptoyoutodecidewhetheryouwanttowriteaprintandcomparetestscriptoraunitteststyletestscript.Theimportantthingisthatyoucannotmixthesetwo
stylesinasingletestscript.Particularly,dontuseunittestfeaturesinaprintandcomparetest.

Anyscriptwithatest_prefixintheTestsdirectorywillbefoundandrunbyrun_tests.py.Below,weshowanexampletestscripttest_Biospam.pybothforaprintand
comparetestandforaunittestbasedtest.IfyouputthisscriptintheBiopythonTestsdirectory,thenrun_tests.pywillfinditandexecutethetestscontainedinit:

$pythonrun_tests.py
test_Ace...ok

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 174/184
5/13/2017 BiopythonTutorialandCookbook
test_AlignIO...ok
test_BioSQL...ok
test_BioSQL_SeqIO...ok
test_Biospam...ok
test_CAPS...ok
test_Clustalw...ok


Ran107testsin86.127seconds

21.2.1Writingaprintandcomparetest
Aprintandcomparestyletestshouldbemuchsimplerforbeginnersornovicestowriteessentiallyitisjustanexamplescriptusingyournewmodule.

HereiswhatyoushoulddotomakeaprintandcomparetestfortheBiospammodule.

1.Writeascriptcalledtest_Biospam.py
ThisscriptshouldliveintheTestsdirectory
Thescriptshouldtestalloftheimportantfunctionalityofthemodule(themoreyoutestthebetteryourtestis,ofcourse!).
Trytoavoidanythingwhichmightbeplatformspecific,suchasprintingfloatingpointnumberswithoutusinganexplicitformattingstringtoavoidhaving
toomanydecimalplaces(differentplatformscangiveveryslightlydifferentvalues).
2.Ifthescriptrequiresfilestodothetesting,theseshouldgointhedirectoryTests/Biospam(ifyoujustneedsomethinggeneric,likeaFASTAsequencefile,ora
GenBankrecord,tryanduseanexistingsampleinputfileinstead).
3.Writeoutthetestoutputandverifytheoutputtobecorrect.

Therearetwowaystodothis:

a.Thelongway:
Runthescriptandwriteitsoutputtoafile.OnUNIX(includingLinuxandMacOSX)machines,youwoulddosomethinglike:python
test_Biospam.py>test_Biospamwhichwouldwritetheoutputtothefiletest_Biospam.
Manuallylookatthefiletest_Biospamtomakesuretheoutputiscorrect.Whenyouaresureitisallrightandtherearenobugs,youneedtoquickly
editthetest_Biospamfilesothatthefirstlineis:test_Biospam(noquotes).
copythetest_BiospamfiletothedirectoryTests/output
b.Thequickway:
Runpythonrun_tests.pygtest_Biospam.py.Theregressiontestingframeworkisniftyenoughthatitllputtheoutputintherightplaceinjustthe
wayitlikesit.
Gototheoutput(whichshouldbeinTests/output/test_Biospam)anddoublechecktheoutputtomakesureitisallcorrect.
4.NowchangetotheTestsdirectoryandruntheregressiontestswithpythonrun_tests.py.Thiswillrunallofthetests,andyoushouldseeyourtestrun(andpass!).
5.Thatsit!Nowyouvegotanicetestforyourmodulereadytocheckin,orsubmittoBiopython.Congratulations!

Asanexample,thetest_Biospam.pytestscripttotesttheadditionandmultiplicationfunctionsintheBiospammodulecouldlookasfollows:

from__future__importprint_function
fromBioimportBiospam

print("2+3=",Biospam.addition(2,3))
print("91=",Biospam.addition(9,1))
print("2*3=",Biospam.multiplication(2,3))
print("9*(1)=",Biospam.multiplication(9,1))

Wegeneratethecorrespondingoutputwithpythonrun_tests.pygtest_Biospam.py,andchecktheoutputfileoutput/test_Biospam:

test_Biospam
2+3=5
91=8
2*3=6
9*(1)=9

Often,thedifficultywithlargerprintandcomparetestsistokeeptrackwhichlineintheoutputcorrespondstowhichcommandinthetestscript.Forthispurpose,itis
importanttoprintoutsomemarkerstohelpyoumatchlinesintheinputscriptwiththegeneratedoutput.

21.2.2Writingaunittestbasedtest

WewantallthemodulesinBiopythontohaveunittests,andasimpleprintandcomparetestisbetterthannotestatall.However,althoughthereisasteeperlearning
curve,usingtheunittestframeworkgivesamorestructuredresult,andifthereisatestfailurethiscanclearlypinpointwhichpartofthetestisgoingwrong.Thesub
testscanalsoberunindividuallywhichishelpfulfortestingordebugging.

TheunittestframeworkhasbeenincludedwithPythonsinceversion2.1,andisdocumentedinthePythonLibraryReference(whichIknowyouarekeepingunderyour
pillow,asrecommended).Thereisalsoonlinedocumentaionforunittest.Ifyouarefamiliarwiththeunittestsystem(orsomethingsimilarlikethenosetestframework),
youshouldnthaveanytrouble.YoumayfindlookingattheexistingexamplewithinBiopythonhelpfultoo.

HeresaminimalunitteststyletestscriptforBiospam,whichyoucancopyandpastetogetstarted:

importunittest
fromBioimportBiospam

classBiospamTestAddition(unittest.TestCase):

deftest_addition1(self):

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 175/184
5/13/2017 BiopythonTutorialandCookbook
result=Biospam.addition(2,3)
self.assertEqual(result,5)

deftest_addition2(self):
result=Biospam.addition(9,1)
self.assertEqual(result,8)

classBiospamTestDivision(unittest.TestCase):

deftest_division1(self):
result=Biospam.division(3.0,2.0)
self.assertAlmostEqual(result,1.5)

deftest_division2(self):
result=Biospam.division(10.0,2.0)
self.assertAlmostEqual(result,5.0)

if__name__=="__main__":
runner=unittest.TextTestRunner(verbosity=2)
unittest.main(testRunner=runner)

Inthedivisiontests,weuseassertAlmostEqualinsteadofassertEqualtoavoidtestsfailingduetoroundofferrorsseetheunittestchapterinthePythondocumentation
fordetailsandforotherfunctionalityavailableinunittest(onlinereference).

Thesearethekeypointsofunittestbasedtests:

Testcasesarestoredinclassesthatderivefromunittest.TestCaseandcoveronebasicaspectofyourcode
YoucanusemethodssetUpandtearDownforanyrepeatedcodewhichshouldberunbeforeandaftereachtestmethod.Forexample,thesetUpmethodmightbe
usedtocreateaninstanceoftheobjectyouaretesting,oropenafilehandle.ThetearDownshoulddoanytidyingup,forexampleclosingthefilehandle.
Thetestsareprefixedwithtest_andeachtestshouldcoveronespecificpartofwhatyouaretryingtotest.Youcanhaveasmanytestsasyouwantinaclass.
Attheendofthetestscript,youcanuse
if__name__=="__main__":
runner=unittest.TextTestRunner(verbosity=2)
unittest.main(testRunner=runner)

toexecutethetestswhenthescriptisrunbyitself(ratherthanimportedfromrun_tests.py).Ifyourunthisscript,thenyoullseesomethinglikethefollowing:
$pythontest_BiospamMyModule.py
test_addition1(__main__.TestAddition)...ok
test_addition2(__main__.TestAddition)...ok
test_division1(__main__.TestDivision)...ok
test_division2(__main__.TestDivision)...ok


Ran4testsin0.059s

OK

Toindicatemoreclearlywhateachtestisdoing,youcanadddocstringstoeachtest.Theseareshownwhenrunningthetests,whichcanbeusefulinformationifa
testisfailing.
importunittest
fromBioimportBiospam

classBiospamTestAddition(unittest.TestCase):

deftest_addition1(self):
"""Anadditiontest"""
result=Biospam.addition(2,3)
self.assertEqual(result,5)

deftest_addition2(self):
"""Asecondadditiontest"""
result=Biospam.addition(9,1)
self.assertEqual(result,8)

classBiospamTestDivision(unittest.TestCase):

deftest_division1(self):
"""Nowlet'scheckdivision"""
result=Biospam.division(3.0,2.0)
self.assertAlmostEqual(result,1.5)

deftest_division2(self):
"""Aseconddivisiontest"""
result=Biospam.division(10.0,2.0)
self.assertAlmostEqual(result,5.0)

if__name__=="__main__":
runner=unittest.TextTestRunner(verbosity=2)
unittest.main(testRunner=runner)

Runningthescriptwillnowshowyou:

$pythontest_BiospamMyModule.py
Anadditiontest...ok
Asecondadditiontest...ok
Nowlet'scheckdivision...ok
Aseconddivisiontest...ok

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 176/184
5/13/2017 BiopythonTutorialandCookbook
Ran4testsin0.001s

OK

Ifyourmodulecontainsdocstringtests(seesection21.3),youmaywanttoincludethoseintheteststoberun.Youcandosoasfollowsbymodifyingthecodeunderif
__name__=="__main__":tolooklikethis:

if__name__=="__main__":
unittest_suite=unittest.TestLoader().loadTestsFromName("test_Biospam")
doctest_suite=doctest.DocTestSuite(Biospam)
suite=unittest.TestSuite((unittest_suite,doctest_suite))
runner=unittest.TextTestRunner(sys.stdout,verbosity=2)
runner.run(suite)

Thisisonlyrelevantifyouwanttorunthedocstringtestswhenyouexecutepythontest_Biospam.pyifithassomecomplexruntimedependencychecking.

Ingeneralinsteadincludethedocstringtestsbyaddingthemtotherun_tests.pyasexplainedbelow.

21.3Writingdoctests
Pythonmodules,classesandfunctionssupportbuiltindocumentationusingdocstrings.Thedoctestframework(includedwithPython)allowsthedevelopertoembed
workingexamplesinthedocstrings,andhavetheseexamplesautomaticallytested.

CurrentlyonlyasmallpartofBiopythonincludesdoctests.Therun_tests.pyscripttakescareofrunningthedoctests.Forthispurpose,atthetopoftherun_tests.py
scriptisamanuallycompiledlistofmodulestotest,whichallowsustoskipmoduleswithoptionalexternaldependencieswhichmaynotbeinstalled(e.g.theReportlab
andNumPylibraries).So,ifyouveaddedsomedocteststothedocstringsinaBiopythonmodule,inordertohavethemincludedintheBiopythontestsuite,youmust
updaterun_tests.pytoincludeyourmodule.Currently,therelevantpartofrun_tests.pylooksasfollows:

#Thisisthelistofmodulescontainingdocstringtests.
#Ifyoudevelopdocstringtestsforothermodules,pleaseadd
#thosemoduleshere.
DOCTEST_MODULES=["Bio.Seq",
"Bio.SeqRecord",
"Bio.SeqIO",
"...",
]
#Silentlyignoreanydoctestsformodulesrequiringnumpy!
try:
importnumpy
DOCTEST_MODULES.extend(["Bio.Statistics.lowess"])
exceptImportError:
pass

Notethatweregarddoctestsprimarilyasdocumentation,soyoushouldsticktotypicalusage.Generallycomplicatedexamplesdealingwitherrorconditionsandthelike
wouldbebestlefttoadedicatedunittest.

Notethatifyouwanttowritedoctestsinvolvingfileparsing,definingthefilelocationcomplicatesmatters.Ideallyuserelativepathsassumingthecodewillberunfrom
theTestsdirectory,seetheBio.SeqIOdoctestsforanexampleofthis.

Torunthedocstringtestsonly,use

$pythonrun_tests.pydoctest

NotethatthedoctestsystemisfragileandcareisneededtoensureyouroutputwillmatchonallthedifferentversionsofPythonthatBiopythonsupports(e.g.differences
infloatingpointnumbers).

21.4WritingdoctestsintheTutorial
ThisTutorialyouarereadinghasalotofcodesnippets,whichareoftenformattedlikeadoctest.Wehaveourownsysteminfiletest_Tutorial.pytoallowtaggingcode
snippetsintheTutorialsourcetoberunasPythondoctests.Thisworksbyaddingspecial%doctestcommentlinesbeforeeachverbatimblock,e.g.

%doctest
\begin{verbatim}
>>>fromBio.Alphabetimportgeneric_dna
>>>fromBio.SeqimportSeq
>>>len("ACGT")
4
\end{verbatim}

Oftencodeexamplesarenotselfcontained,butcontinuefromthepreviousverbatimblock.Hereweusethemagiccomment%contdoctestasshownhere:

%contdoctest
\begin{verbatim}
>>>Seq("ACGT")==Seq("ACGT",generic_dna)
True
\end{verbatim}

Thespecial%doctestcommentlinecantakeaworkingdirectory(relativetotheDoc/folder)touseifyouhaveanyexampledatafiles,e.g.%doctestexampleswillusethe
Doc/examplesfolder,while%doctest../Tests/GenBankwillusetheTests/GenBankfolder.

Afterthedirectoryargument,youcanspecifyanyPythondependencieswhichmustbepresentinordertorunthetestbyaddinglib:XXXtoindicateimportXXXmust
work,e.g.%doctestexampleslib:numpy

YoucanruntheTutorialdoctestsvia:

$pythontest_Tutorial.py

or:

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 177/184
5/13/2017 BiopythonTutorialandCookbook
$pythonrun_tests.pytest_Tutorial.py

Chapter22Advanced
22.1ParserDesign
ManyoftheolderBiopythonparserswerebuiltaroundaneventorienteddesignthatincludesScannerandConsumerobjects.

Scannerstakeinputfromadatasourceandanalyzeitlinebyline,sendingoffaneventwheneveritrecognizessomeinformationinthedata.Forexample,ifthedata
includesinformationaboutanorganismname,thescannermaygenerateanorganism_nameeventwheneveritencountersalinecontainingthename.

ConsumersareobjectsthatreceivetheeventsgeneratedbyScanners.Followingthepreviousexample,theconsumerreceivestheorganism_nameevent,andtheprocesses
itinwhatevermannernecessaryinthecurrentapplication.

Thisisaveryflexibleframework,whichisadvantageousifyouwanttobeabletoparseafileformatintomorethanonerepresentation.Forexample,theBio.GenBank
moduleusesthistoconstructeitherSeqRecordobjectsorfileformatspecificrecordobjects.

Morerecently,manyoftheparsersaddedforBio.SeqIOandBio.AlignIOtakeamuchsimplerapproach,butonlygenerateasingleobjectrepresentation(SeqRecordand
MultipleSeqAlignmentobjectsrespectively).InsomecasestheBio.SeqIOparsersactuallywrapanotherBiopythonparserforexample,theBio.SwissProtparser
producesSwissProtformatspecificrecordobjects,whichgetconvertedintoSeqRecordobjects.

22.2SubstitutionMatrices
22.2.1SubsMat
Thismoduleprovidesaclassandafewroutinesforgeneratingsubstitutionmatrices,similartoBLOSUMorPAMmatrices,butbasedonuserprovideddata.
Additionally,youmayselectamatrixfromMatrixInfo.py,acollectionofestablishedsubstitutionmatrices.TheSeqMatclassderivesfromadictionary:

classSeqMat(dict)

Thedictionaryisoftheform{(i1,j1):n1,(i1,j2):n2,...,(ik,jk):nk}wherei,jarealphabetletters,andnisavalue.

1.Attributes
a.self.alphabet:aclassasdefinedinBio.Alphabet
b.self.ab_list:alistofthealphabetsletters,sorted.Neededmainlyforinternalpurposes
2.Methods
a.__init__(self,data=None,alphabet=None,mat_name='',build_later=0):
i.data:canbeeitheradictionary,oranotherSeqMatinstance.
ii. alphabet:aBio.Alphabetinstance.Ifnotprovided,constructanalphabetfromdata.
iii.mat_name:matrixname,suchas"BLOSUM62"or"PAM250"
iv.build_later:defaultfalse.Iftrue,usermaysupplyonlyalphabetandemptydictionary,ifintendingtobuildthematrixlater.thisskipsthesanitycheck
ofalphabetsizevs.matrixsize.
b.entropy(self,obs_freq_mat)
i.obs_freq_mat:anobservedfrequencymatrix.Returnsthematrixsentropy,basedonthefrequencyinobs_freq_mat.ThematrixinstanceshouldbeLO
orSUBS.
c.sum(self)
Calculatesthesumofvaluesforeachletterinthematrixsalphabet,andreturnsitasadictionaryoftheform{i1:s1,i2:s2,...,in:sn},where:
i:analphabetletter
s:sumofallvaluesinahalfmatrixforthatletter
n:numberoflettersinalphabet.
d.print_mat(self,f,format="%4d",bottomformat="%4s",alphabet=None)

printsthematrixtofilehandlef.formatistheformatfieldforthematrixvaluesbottomformatistheformatfieldforthebottomrow,containingmatrix
letters.Exampleoutputfora3letteralphabetmatrix:

A23
B1234
C72227
ABC

Thealphabetoptionalargumentisastringofallcharactersinthealphabet.Ifsupplied,theorderoflettersalongtheaxesistakenfromthestring,ratherthan
byalphabeticalorder.

3.Usage

Thefollowingsectionislaidoutintheorderbywhichmostpeoplewishtogeneratealogoddsmatrix.Ofcourse,interimmatricescanbegeneratedand
investigated.Mostpeoplejustwantalogoddsmatrix,thatsall.

a.GeneratinganAcceptedReplacementMatrix

Initially,youshouldgenerateanacceptedreplacementmatrix(ARM)fromyourdata.ThevaluesinARMarethecountednumberofreplacementsaccording
toyourdata.Thedatacouldbeasetofpairsormultiplealignments.SoforinstanceifAlaninewasreplacedbyCysteine10times,andCysteinebyAlanine

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 178/184
5/13/2017 BiopythonTutorialandCookbook
12times,thecorrespondingARMentrieswouldbe:
('A','C'):10,('C','A'):12

asorderdoesntmatter,usercanalreadyprovideonlyoneentry:

('A','C'):22

ASeqMatinstancemaybeinitializedwitheitherafull(firstmethodofcounting:10,12)orhalf(thelattermethod,22)matrices.Afullproteinalphabet
matrixwouldbeofthesize20x20=400.Ahalfmatrixofthatalphabetwouldbe20x20/2+20/2=210.Thatisbecausesameletterentriesdontchange.
(Thematrixdiagonal).GivenanalphabetsizeofN:

i.Fullmatrixsize:N*N
ii.Halfmatrixsize:N(N+1)/2

TheSeqMatconstructorautomaticallygeneratesahalfmatrix,ifafullmatrixispassed.Ifahalfmatrixispassed,lettersinthekeyshouldbeprovidedin
alphabeticalorder:(A,C)andnot(C,A).

Atthispoint,ifallyouwishtodoisgeneratealogoddsmatrix,pleasegotothesectiontitledExampleofUse.Thefollowingtextdescribesthenittygritty
ofinternalfunctions,tobeusedbypeoplewhowishtoinvestigatetheirnucleotide/aminoacidfrequencydatamorethoroughly.

b.Generatingtheobservedfrequencymatrix(OFM)

Use:
OFM=SubsMat._build_obs_freq_mat(ARM)

TheOFMisgeneratedfromtheARM,onlyinsteadofreplacementcounts,itcontainsreplacementfrequencies.

c.Generatinganexpectedfrequencymatrix(EFM)

Use:
EFM=SubsMat._build_exp_freq_mat(OFM,exp_freq_table)

i.exp_freq_table:shouldbeaFreqTableinstance.Seesection22.2.2fordetailedinformationonFreqTable.Briefly,theexpectedfrequencytablehasthe
frequenciesofappearanceforeachmemberofthealphabet.Itisimplementedasadictionarywiththealphabetlettersaskeys,andeachletters
frequencyasavalue.Valuessumto1.

Theexpectedfrequencytablecan(andgenerallyshould)begeneratedfromtheobservedfrequencymatrix.Soinmostcasesyouwillgenerate
exp_freq_tableusing:

>>>exp_freq_table=SubsMat._exp_freq_table_from_obs_freq(OFM)
>>>EFM=SubsMat._build_exp_freq_mat(OFM,exp_freq_table)

Butyoucansupplyyourownexp_freq_table,ifyouwish

d.Generatingasubstitutionfrequencymatrix(SFM)

Use:

SFM=SubsMat._build_subs_mat(OFM,EFM)

AcceptsanOFM,EFM.Providesthedivisionproductofthecorrespondingvalues.

e.Generatingalogoddsmatrix(LOM)

Use:

LOM=SubsMat._build_log_odds_mat(SFM[,logbase=10,factor=10.0,round_digit=1])

i.AcceptsanSFM.
ii. logbase:baseofthelogarithmusedtogeneratethelogoddsvalues.
iii.factor:factorusedtomultiplythelogoddsvalues.Eachentryisgeneratedbylog(LOM[key])*factorAndroundedtotheround_digitplaceafterthe
decimalpoint,ifrequired.
4.Exampleofuse

Asmostpeoplewouldwanttogeneratealogoddsmatrix,withminimumhassle,SubsMatprovidesonefunctionwhichdoesitall:

make_log_odds_matrix(acc_rep_mat,exp_freq_table=None,logbase=10,
factor=10.0,round_digit=0):

a.acc_rep_mat:userprovidedacceptedreplacementsmatrix
b.exp_freq_table:expectedfrequenciestable.Usedifprovided,ifnot,generatedfromtheacc_rep_mat.
c.logbase:baseoflogarithmforthelogoddsmatrix.Defaultbase10.
d.round_digit:numberafterdecimaldigittowhichresultshouldberounded.Defaultzero.

22.2.2FreqTable
FreqTable.FreqTable(UserDict.UserDict)

1.Attributes:
a.alphabet:ABio.Alphabetinstance.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 179/184
5/13/2017 BiopythonTutorialandCookbook
b.data:frequencydictionary
c.count:countdictionary(incasecountsareprovided).
2.Functions:
a.read_count(f):readacountfilefromstreamf.Thenconverttofrequencies.
b.read_freq(f):readafrequencydatafilefromstreamf.Ofcourse,wethendonthavethecounts,butitisusuallytheletterfrequencieswhichareinteresting.
3.Exampleofuse:Theexpectedcountoftheresiduesinthedatabaseissittinginafile,whitespacedelimited,inthefollowingformat(examplegivenfora3letter
alphabet):
A35
B65
C100

AndwillbereadusingtheFreqTable.read_count(file_handle)function.

Anequivalentfrequencyfile:
A0.175
B0.325
C0.5

Conversely,theresiduefrequenciesorcountscanbepassedasadictionary.Exampleofacountdictionary(3letteralphabet):

{'A':35,'B':65,'C':100}

Whichmeansthatanexpecteddatacountwouldgivea0.5frequencyforC,a0.325probabilityofBanda0.175probabilityofAoutof200total,sumofA,B
andC)

Afrequencydictionaryforthesamedatawouldbe:

{'A':0.175,'B':0.325,'C':0.5}

Summingupto1.

Whenpassingadictionaryasanargument,youshouldindicatewhetheritisacountorafrequencydictionary.ThereforetheFreqTableclassconstructorrequires
twoarguments:thedictionaryitself,andFreqTable.COUNTorFreqTable.FREQindicatingcountsorfrequencies,respectively.

Readexpectedcounts.readCountwillalreadygeneratethefrequenciesAnyoneofthefollowingmaybedonetogeeratethefrequencytable(ftab):
>>>fromSubsMatimport*
>>>ftab=FreqTable.FreqTable(my_frequency_dictionary,FreqTable.FREQ)
>>>ftab=FreqTable.FreqTable(my_count_dictionary,FreqTable.COUNT)
>>>ftab=FreqTable.read_count(open('myCountFile'))
>>>ftab=FreqTable.read_frequency(open('myFrequencyFile'))

Chapter23WheretogofromherecontributingtoBiopython
23.1BugReports+FeatureRequests
GettingfeedbackontheBiopythonmodulesisveryimportanttous.Opensourceprojectslikethisbenefitgreatlyfromfeedback,bugreports(andpatches!)fromawide
varietyofcontributors.

ThemainforumsfordiscussingfeaturerequestsandpotentialbugsaretheBiopythonmailinglists:

biopython@biopython.orgAnunmoderatedlistfordiscussionofanythingtodowithBiopython.
biopythondev@biopython.orgAmoredevelopmentorientedlistthatismainlyusedbydevelopers(butanyoneisfreetocontribute!).

Additionally,ifyouthinkyouvefoundanewbug,youcansubmitittoourissuetrackerathttps://github.com/biopython/biopython/issues(thishasreplacedtheolder
trackerhostedathttp://redmine.openbio.org/projects/biopython).Thisway,itwontgetburiedinanyonesInboxandforgottenabout.

23.2Mailinglistsandhelpingnewcomers
WeencourageallourusestosignuptothemainBiopythonmailinglist.OnceyouvegotthehangofanareaofBiopython,wedencourageyoutohelpanswerquestions
frombeginners.Afterall,youwereabeginneronce.

23.3ContributingDocumentation
WerehappytotakefeedbackorcontributionseitherviaabugreportorontheMailingList.Whilereadingthistutorial,perhapsyounoticedsometopicsyouwere
interestedinwhichweremissing,ornotclearlyexplained.ThereisalsoBiopythonsbuiltindocumentation(thedocstrings,thesearealsoonline),whereagain,youmay
beabletohelpfillinanyblanks.

23.4Contributingcookbookexamples
AsexplainedinChapter20,Biopythonnowhasawikicollectionofusercontributedcookbookexamples,http://biopython.org/wiki/Category:Cookbookmaybeyou
canaddtothis?

23.5Maintainingadistributionforaplatform

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 180/184
5/13/2017 BiopythonTutorialandCookbook
Wecurrentlyprovidesourcecodearchives(suitableforanyOS,ifyouhavetherightbuildtoolsinstalled),andWindowsInstallerswhicharejustclickandrun.This
coversallthemajoroperatingsystems.

MostmajorLinuxdistributionshavevolunteerswhotakethesesourcecodereleases,andcompilethemintopackagesforLinuxuserstoeasilyinstall(takingcareof
dependenciesetc).Thisisreallygreatandweareofcourseverygrateful.Ifyouwouldliketocontributetothiswork,pleasefindoutmoreabouthowyourLinux
distributionhandlesthis.

Belowaresometipsforcertainplatformstomaybegetpeoplestartedwithhelpingout:

Windows
Windowsproductstypicallyhaveanicegraphicalinstallerthatinstallsalloftheessentialcomponentsintherightplace.WeuseDistutilstocreateainstallerofthis
typefairlyeasily.

YoumustfirstmakesureyouhaveaCcompileronyourWindowscomputer,andthatyoucancompileandinstallthings(thisisthehardbitseetheBiopython
installationinstructionsforinfoonhowtodothis).

OnceyouaresetupwithaCcompiler,makingtheinstallerjustrequiresdoing:
pythonsetup.pybdist_wininst

NowyouvegotaWindowsinstaller.Congrats!Atthemomentwehavenotroubleshippinginstallersbuilton32bitwindows.Ifanyonewouldliketolookinto
supporting64bitWindowsthatwouldbegreat.

RPMs
RPMsareprettypopularpackagesystemsonsomeLinuxplatforms.ThereislotsofdocumentationonRPMsavailableathttp://www.rpm.orgtohelpyouget
startedwiththem.TocreateanRPMforyourplatformisreallyeasy.Youjustneedtobeabletobuildthepackagefromsource(havingaCcompilerthatworksis
thusessential)seetheBiopythoninstallationinstructionsformoreinfoonthis.

TomaketheRPM,youjustneedtodo:
pythonsetup.pybdist_rpm

ThiswillcreateanRPMforyourspecificplatformandasourceRPMinthedirectorydist.ThisRPMshouldbegoodandreadytogo,sothisisallyouneedtodo!
Niceandeasy.

Macintosh
SinceApplemovedtoMacOSX,thingshavebecomemucheasierontheMac.WegenerallytreatitasjustanotherUnixvariant,andinstallingBiopythonfrom
sourceisjustaseasyasonLinux.TheeasiestwaytogetalltheGCCcompilersetcinstalledistoinstallApplesXCode.Wemightbeabletoprovideclickandrun
installersforMacOSX,buttodatetherehasntbeenanydemand.

Onceyouvegotapackage,pleasetestitonyoursystemtomakesureitinstallseverythinginagoodwayandseemstoworkproperly.Onceyoufeelgoodaboutit,send
itofftooneoftheBiopythondevelopers(writetoourmainmailinglistatbiopython@biopython.orgifyourenotsurewhotosenditto)andyouvedoneit.Thanks!

23.6ContributingUnitTests
EvenifyoudonthaveanynewfunctionalitytoaddtoBiopython,butyouwanttowritesomecode,pleaseconsiderextendingourunittestcoverage.Wevedevotedall
ofChapter21tothistopic.

23.7ContributingCode
TherearenobarrierstojoiningBiopythoncodedevelopmentotherthananinterestincreatingbiologyrelatedcodeinPython.Thebestplacetoexpressaninterestison
theBiopythonmailinglistsjustletusknowyouareinterestedincodingandwhatkindofstuffyouwanttoworkon.Normally,wetrytohavesomediscussionon
modulesbeforecodingthem,sincethathelpsgenerategoodideasthenjustfeelfreetojumprightinandstartcoding!

ThemainBiopythonreleasetriestobefairlyuniformandinterworkable,tomakeiteasierforusers.Youcanreadaboutsomeof(fairlyinformal)codingstyleguidelines
wetrytouseinBiopythoninthecontributingdocumentationathttp://biopython.org/wiki/Contributing.Wealsotrytoaddcodetothedistributionalongwithtests
(seeChapter21formoreinfoontheregressiontestingframework)anddocumentation,sothateverythingcanstayasworkableandwelldocumentedaspossible
(includingdocstrings).Thisis,ofcourse,themostidealsituation,undermanysituationsyoullbeabletofindotherpeopleonthelistwhowillbewillingtohelpadd
documentationormoretestsforyourcodeonceyoumakeitavailable.So,toendthisparagraphlikethelast,feelfreetostartworking!

PleasenotethattomakeacodecontributionyoumusthavethelegalrighttocontributeitandlicenseitundertheBiopythonlicense.Ifyouwroteitallyourself,anditis
notbasedonanyothercode,thisshouldntbeaproblem.However,thereareissuesifyouwanttocontributeaderivativeworkforexamplesomethingbasedonGPLor
LPGLlicencedcodewouldnotbecompatiblewithourlicense.Ifyouhaveanyqueriesonthis,pleasediscusstheissueonthebiopythondevmailinglist.

AnotherpointofconcernforanyadditionstoBiopythonregardsanybuildtimeorruntimedependencies.Generallyspeaking,writingcodetointeractwithastandalone
tool(likeBLAST,EMBOSSorClustalW)doesntpresentabigproblem.However,anydependencyonanotherlibraryevenaPythonlibrary(especiallyoneneededin
ordertocompileandinstallBiopythonlikeNumPy)wouldneedfurtherdiscussion.

Additionally,ifyouhavecodethatyoudontthinkfitsinthedistribution,butthatyouwanttomakeavailable,wemaintainScriptCentral
(http://biopython.org/wiki/Scriptcentral)whichhaspointerstofreelyavailablecodeinPythonforbioinformatics.

HopefullythisdocumentationhasgotyouexcitedenoughaboutBiopythontotryitout(andmostimportantly,contribute!).Thanksforreadingallthewaythrough!

Chapter24Appendix:UsefulstuffaboutPython
IfyouhaventspentalotoftimeprogramminginPython,manyquestionsandproblemsthatcomeupinusingBiopythonareoftenrelatedtoPythonitself.Thissection
triestopresentsomeideasandcodethatcomeupoften(atleastforus!)whileusingtheBiopythonlibraries.Ifyouhaveanysuggestionsforusefulpointersthatcouldgo
here,pleasecontribute!

24.1Whattheheckisahandle?
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 181/184
5/13/2017 BiopythonTutorialandCookbook
Handlesarementionedquitefrequentlythroughoutthisdocumentation,andarealsofairlyconfusing(atleasttome!).Basically,youcanthinkofahandleasbeinga
wrapperaroundtextinformation.

Handlesprovide(atleast)twobenefitsoverplaintextinformation:

1.Theyprovideastandardwaytodealwithinformationstoredindifferentways.Thetextinformationcanbeinafile,orinastringstoredinmemory,ortheoutput
fromacommandlineprogram,oratsomeremotewebsite,butthehandleprovidesacommonwayofdealingwithinformationinalloftheseformats.
2.Theyallowtextinformationtobereadincrementally,insteadofallatonce.Thisisreallyimportantwhenyouaredealingwithhugetextfileswhichwoulduseup
allofyourmemoryifyouhadtoloadthemall.

Handlescandealwithtextinformationthatisbeingread(e.g.readingfromafile)orwritten(e.g.writinginformationtoafile).Inthecaseofareadhandle,
commonlyusedfunctionsareread(),whichreadstheentiretextinformationfromthehandle,andreadline(),whichreadsinformationonelineatatime.Forwrite
handles,thefunctionwrite()isregularlyused.

Themostcommonusageforhandlesisreadinginformationfromafile,whichisdoneusingthebuiltinPythonfunctionopen.Here,wehandletothefilem_cold.fasta
whichyoucandownloadhere(orfindincludedintheBiopythonsourcecodeasDoc/examples/m_cold.fasta).
>>>handle=open("m_cold.fasta","r")
>>>handle.readline()
">gi|8332116|gb|BE037100.1|BE037100MP14H09MPMesembryanthemum...\n"

HandlesareregularlyusedinBiopythonforpassinginformationtoparsers.Forexample,sinceBiopython1.54themainfunctionsinBio.SeqIOandBio.AlignIOhave
allowedyoutouseafilenameinsteadofahandle:

fromBioimportSeqIO
forrecordinSeqIO.parse("m_cold.fasta","fasta"):
print(record.id,len(record))

OnolderversionsofBiopythonyouhadtouseahandle,e.g.
fromBioimportSeqIO
handle=open("m_cold.fasta","r")
forrecordinSeqIO.parse(handle,"fasta"):
print(record.id,len(record))
handle.close()

ThispatternisstillusefulforexamplesupposeyouhaveagzipcompressedFASTAfileyouwanttoparse:

importgzip
fromBioimportSeqIO
handle=gzip.open("m_cold.fasta.gz","rt")
forrecordinSeqIO.parse(handle,"fasta"):
print(record.id,len(record))
handle.close()

Withourparsersforplaintextfiles,underPython3itisessentialtousegzipintextmode.

SeeSection5.2formoreexampleslikethis,includingreadingbzip2compressedfiles.

24.1.1Creatingahandlefromastring

Oneusefulthingistobeabletoturninformationcontainedinastringintoahandle.ThefollowingexampleshowshowtodothisusingcStringIOfromthePython
standardlibrary:

>>>my_info='Astring\nwithmultiplelines.'
>>>print(my_info)
Astring
withmultiplelines.
>>>fromStringIOimportStringIO
>>>my_info_handle=StringIO(my_info)
>>>first_line=my_info_handle.readline()
>>>print(first_line)
Astring
<BLANKLINE>
>>>second_line=my_info_handle.readline()
>>>print(second_line)
withmultiplelines.

References
[1]
PeterJ.A.Cock,TiagoAntao,JeffreyT.Chang,BradA.Chapman,CymonJ.Cox,AndrewDalke,IddoFriedberg,ThomasHamelryck,FrankKauff,Bartek
Wilczynski,MichielJ.L.deHoon:Biopython:freelyavailablePythontoolsforcomputationalmolecularbiologyandbioinformatics.Bioinformatics25(11),
14221423(2009).doi:10.1093/bioinformatics/btp163,
[2]
LeightonPritchard,JenniferA.White,PaulR.J.Birch,IanK.Toth:GenomeDiagram:apythonpackageforthevisualizationoflargescalegenomicdata.
Bioinformatics22(5):616617(2006).doi:10.1093/bioinformatics/btk021,
[3]
IanK.Toth,LeightonPritchard,PaulR.J.Birch:Comparativegenomicsrevealswhatmakesanenterobacterialplantpathogen.AnnualReviewof
Phytopathology44:305336(2006).doi:10.1146/annurev.phyto.44.070505.143444,
[4]
GraldineA.vanderAuwera,JaroslawE.Krl,HaruoSuzuki,BrianFoster,RobvanHoudt,CelesteJ.Brown,MaxMergeay,EvaM.Top:Plasmidscapturedin
C.metalliduransCH34:definingthePromAfamilyofbroadhostrangeplasmids.AntonievanLeeuwenhoek96(2):193204(2009).doi:10.1007/s10482009
93169
[5]

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 182/184
5/13/2017 BiopythonTutorialandCookbook
CarolineProux,DouwevanSinderen,JuanSuarez,PilarGarcia,VictorLadero,GeraldF.Fitzgerald,FrankDesiere,HaraldBrssow:Thedilemmaofphage
taxonomyillustratedbycomparativegenomicsofSfi21LikeSiphoviridaeinlacticacidbacteria.JournalofBacteriology184(21):60266036(2002).
http://dx.doi.org/10.1128/JB.184.21.60266036.2002
[6]
FlorianJupe,LeightonPritchard,GrahamJ.Etherington,KatrinMacKenzie,PeterJACock,FrankWright,SanjeevKumarSharma1,DanBolser,GlennJBryan,
JonathanDGJones,IngoHein:IdentificationandlocalisationoftheNBLRRgenefamilywithinthepotatogenome.BMCGenomics13:75(2012).
http://dx.doi.org/10.1186/147121641375
[7]
PeterJ.A.Cock,ChristopherJ.Fields,NaohisaGoto,MichaelL.Heuer,PeterM.Rice:TheSangerFASTQfileformatforsequenceswithqualityscores,andthe
Solexa/IlluminaFASTQvariants.NucleicAcidsResearch38(6):17671771(2010).doi:10.1093/nar/gkp1137
[8]
PatrickO.Brown,DavidBotstein:ExploringthenewworldofthegenomewithDNAmicroarrays.NatureGenetics21(Supplement1),3337(1999).
doi:10.1038/4462
[9]
EricTalevich,BrandonM.Invergo,PeterJ.A.Cock,BradA.Chapman:Bio.Phylo:Aunifiedtoolkitforprocessing,analyzingandvisualizingphylogenetictrees
inBiopython.BMCBioinformatics13:209(2012).doi:10.1186/1471210513209
[10]
AthelCornishBowden:Nomenclatureforincompletelyspecifiedbasesinnucleicacidsequences:Recommendations1984.NucleicAcidsResearch13(9):
30213030(1985).doi:10.1093/nar/13.9.3021
[11]
DouglasR.Cavener:ComparisonoftheconsensussequenceflankingtranslationalstartsitesinDrosophilaandvertebrates.NucleicAcidsResearch15(4):1353
1361(1987).doi:10.1093/nar/15.4.1353
[12]
TimothyL.BaileyandCharlesElkan:Fittingamixturemodelbyexpectationmaximizationtodiscovermotifsinbiopolymers,ProceedingsoftheSecond
InternationalConferenceonIntelligentSystemsforMolecularBiology2836.AAAIPress,MenloPark,California(1994).
[13]
BradChapmanandJeffChang:Biopython:Pythontoolsforcomputationalbiology.ACMSIGBIONewsletter20(2):1519(August2000).
[14]
MichielJ.L.deHoon,SeiyaImoto,JohnNolan,SatoruMiyano:Opensourceclusteringsoftware.Bioinformatics20(9):14531454(2004).
doi:10.1093/bioinformatics/bth078
[15]
MichielB.Eisen,PaulT.Spellman,PatrickO.Brown,DavidBotstein:Clusteranalysisanddisplayofgenomewideexpressionpatterns.Proceedingsofthe
NationalAcademyofScienceUSA95(25):1486314868(1998).doi:10.1073/pnas.96.19.10943c
[16]
GeneH.Golub,ChristianReinsch:Singularvaluedecompositionandleastsquaressolutions.InHandbookforAutomaticComputation,2,(LinearAlgebra)(J.H.
WilkinsonandC.Reinsch,eds),134151.NewYork:SpringerVerlag(1971).
[17]
GeneH.Golub,CharlesF.VanLoan:Matrixcomputations,2ndedition(1989).
[18]
ThomasHamelryckandBernardManderick:11PDBparserandstructureclassimplementedinPython.Bioinformatics,19(17):23082310(2003)doi:
10.1093/bioinformatics/btg299.
[19]
ThomasHamelryck:Efficientidentificationofsidechainpatternsusingamultidimensionalindextree.Proteins51(1):96108(2003).doi:10.1002/prot.10338
[20]
ThomasHamelryck:AnaminoacidhastwosidesAnew2Dmeasureprovidesadifferentviewofsolventexposure.Proteins59(1):2948(2005).
doi:10.1002/prot.20379.
[21]
JohnA.Hartiga.Clusteringalgorithms.NewYork:Wiley(1975).
[22]
AnilL.Jain,RichardC.Dubes:Algorithmsforclusteringdata.EnglewoodCliffs,N.J.:PrenticeHall(1988).
[23]
VoratasKachitvichyanukul,BruceW.Schmeiser:BinomialRandomVariateGeneration.CommunicationsoftheACM31(2):216222(1988).
doi:10.1145/42372.42381
[24]
TeuvoKohonen:Selforganizingmaps,2ndEdition.BerlinNewYork:SpringerVerlag(1997).
[25]
PierreLEcuyer:EfficientandPortableCombinedRandomNumberGenerators.CommunicationsoftheACM31(6):742749,774(1988).
doi:10.1145/62959.62969
[26]
IndraneelMajumdar,S.SriKrishna,NickV.Grishin:PALSSE:Aprogramtodelineatelinearsecondarystructuralelementsfromproteinstructures.BMC
Bioinformatics,6:202(2005).doi:10.1186/147121056202.
[27]
V.Matys,E.Fricke,R.Geffers,E.Gssling,M.Haubrock,R.Hehl,K.Hornischer,D.Karas,A.E.Kel,O.V.KelMargoulis,D.U.Kloos,S.Land,B.Lewicki
Potapov,H.Michael,R.Mnch,I.Reuter,S.Rotert,H.Saxel,M.Scheer,S.Thiele,E.WingenderE:TRANSFAC:transcriptionalregulation,frompatternsto
profiles.NucleicAcidsResearch31(1):374378(2003).doi:10.1093/nar/gkg108
[28]
RobinSibson:SLINK:Anoptimallyefficientalgorithmforthesinglelinkclustermethod.TheComputerJournal16(1):3034(1973).
doi:10.1093/comjnl/16.1.30
[29]
GeorgeW.Snedecor,WilliamG.Cochran:Statisticalmethods.Ames,Iowa:IowaStateUniversityPress(1989).
[30]
PabloTamayo,DonnaSlonim,JillMesirov,QingZhu,SutisakKitareewan,EthanDmitrovsky,EricS.Lander,ToddR.Golub:Interpretingpatternsofgene
expressionwithselforganizingmaps:Methodsandapplicationtohematopoieticdifferentiation.ProceedingsoftheNationalAcademyofScienceUSA96(6):
29072912(1999).doi:10.1073/pnas.96.6.2907
[31]
RobertC.Tryon,DanielE.Bailey:Clusteranalysis.NewYork:McGrawHill(1970).
[32]
JohnW.Tukey:Exploratorydataanalysis.Reading,Mass.:AddisonWesleyPub.Co.(1977).
[33]
KaYeeYeung,WalterL.Ruzzo:PrincipalComponentAnalysisforclusteringgeneexpressiondata.Bioinformatics17(9):763774(2001).
doi:10.1093/bioinformatics/17.9.763

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 183/184
5/13/2017 BiopythonTutorialandCookbook
[34]
AlokSaldanha:JavaTreeviewextensiblevisualizationofmicroarraydata.Bioinformatics20(17):32463248(2004).
http://dx.doi.org/10.1093/bioinformatics/bth349

ThisdocumentwastranslatedfromLATEXbyHEVEA.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc226 184/184

Vous aimerez peut-être aussi