Vous êtes sur la page 1sur 20

BasicModelsofNucleotideEvolution

Overtime,nucleotideswithinasequencecanevolvethroughsubstitution.This
processcancauseanucleotide(T,C,AorG)tochangeintoanothernucleotideandis
themaindrivingforcebehindevolution.Forexample,thenucleotideAinasequence
ofDNAcanchangeovertimeintothenucleotideC.Thischangemayresultinthis
sequenceofDNAbecominginactiveifthesequencewaspreviouslyinvolvedin
proteinsynthesisasanexon,ormaychangetheproteinthatthesequencecodes.As
proteinsarethebuildingblocksoforganiclife,thismaycauselargechangesinan
organismsfeatures.Alternatively,thischangemayhavenoeffectatall.Onaverage,
thisformofmutationonlyoccursonceortwiceeverymillionyears.However,in
assessingtheevolutionofspeciesoverhundredsofmillionsofyears,modelsare
usefulinevaluatinghowonesequenceofnucleotidesmayhaveevolvedfrom
another.
ModelsofnucleotideevolutioncanbeusedwhenexaminingtwosequencesofDNA
ofthesamelengththatmayberelated.Thistypeofmodelwouldbeusedto
comparethetwosequencesbyeitherassumingthatonesequenceevolvedintothe
otherorviceversa,orassumingthattheyhadevolvedfromacommonancestral
sequenceofDNA.Applyingthemodelwouldgivetheestimatednumberof
nucleotidesubstitutionspersite,calledthedistance,whichwouldthenbeusedto
estimateatime.Thistimecouldthenrelatetowhenonesequenceevolvedfromthe
otherorwouldrelatetohowlongagothatanancestralsequenceofDNAwould
havedivergedintoeachsequence.
Inthispaper,Iwilloutlinetheprinciplesandtheorybehindthemain(most
commonlyused)modelsofnucleotidesubstitution,addressingeachmodel
chronologicallyandinsomesenseswithincreasingcomplexity.Themodelsareas
follows:
o JukesandCantor1969(JC69)
o Kimura1980(K80)
o Felsenstein1981(F81)
o Hasegawa,KishinoandYano(HKY85)
o TamuraandNei1993(TN93)
Iwilldemonstratehowprogrammingsoftwaremaybeusedtoprocessdatausing
theformulaeproposedwithineachmodel.FromthisIwillexplainhow,continuingto
useprogrammingsoftware,eachmodeliscapableofsimulatingtheevolutionofa
nucleotidesequenceoveragiventime.

JC69Model
Intermsofcreatingmodelsthatassessnucleotidesubstitution,therateof
substitutionfromonenucleotidetoanotherandthetimeoverwhichsubstitution
hasbeenallowedtoactarekeyvariables.Differentmodelsorganisetheiruseof
ratesindifferentwaysbuttimeisalwaysusedinthesameway.Thesimplestmodel
ofnucleotidesubstitutionistheJukesandCantor1969(JC69)model.Thismodel
assumesthattherateofsubstitutionisthesamebetweenallnucleotides.Therefore,
thismodelonlyrequiresasingleparameterdenotingrate,alongwithavaluefor
time.A4x4matrixcanbecreatedshowingtheratesofnucleotidesubstitution
betweenthe4nucleotides.ThisisknownasmatrixQ:

Q=

Alongthediagonalofthismatrix,youcanseethattheratesofnucleotideschanging
intothemselvesarenotdisplayed,astheyarenotregardedassubstitutions.Also,
therowssumto0.UsingtheratesinmatrixQ,wecanworkouttheprobabilityof
eachnucleotidesubstitutionoccurringwhent>0,creatinganothermatrix.This
matrixisknownasthetransitionprobabilitymatrix(P(t))andisalsoa4x4matrix:

P(t)=

Theseformulaecalculatetheprobabilityofonenucleotideevolvingintoanother.
TheyareachievedthroughtheexponentiationoftheMatrixQusingtheMatrix
Taylorseries.
IntermsofusingthematrixP(t)withrealworldorexperimentaldata,aprogramcan
bewrittenwhichwillcalculatethetransitionprobabilitiesofeachnucleotide
substitutionusingtheformulaeinP(t).Pythonisprogrammingsoftwarethat
providesabasicbuteffectiveprogramminglanguage,whichcanbeusedinthese
circumstances.Wemustfirstdefineafunctionthatwillimplementtheformulaeof
thematrixP(t)whengivencertainvaluestoworkfrom.Thesevaluesarecalled
parametersandinthecaseofworkingoutthetransitionprobabilities,wemustinput
avaluefortherateatwhichnucleotidesubstitutionswilloccuraswellasavaluefor
thetimeoverwhichsubstitutionswilloccur.

Thefollowingcode,writteninPython,emulatesthematrixP(t):

Asshownatthebottomoftheimage,inputtinganexperimentalrate(0.2)andtime
(1)teststhefunctionJC69usedtocalculatethetransitionprobabilities.Thisis
followedbyamatrixdisplayingtheprobabilitiesrowbyrowwithnucleotideorderT,
C,AandG,inthesameorientationasthematrixQ.Inlookingattheformulaeused
tocalculatethetransitionprobabilities,conclusionscanbemadetohowthe
increasingrateortimewillaffecttheresultantprobabilities.

Theexponential(exp)ofanegativevaluegivesadecimalnumbersmallerthan1.If
thenegativevalueincreasesinsize,theexponentialofthatvaluebecomessmallerat
anincreasingrate.Therefore,asthenegativevaluetendstoinfinity,theexponential
ofthatvaluetendsto0.InlookingattheaboveformulaeXandY,asthevaluesofm
(rate)andt(time)increase,thevaluesbeingaddedtoinXandsubtractedfrom
inYbecomeinfinitelysmaller.Thisresultsinthetransitionprobabilitiestending
towardsforeachnucleotidesubstitution.Thissupportstheassumptionthatover
anincreasedtimeorrate,somanynucleotidesubstitutionswouldhaveoccurred
thatthetargetnucleotideiseventuallyrandom,withaprobabilityofforeach
nucleotide.

Thisisdemonstratedinthefollowinggraph,takingincreasingvaluesforratewitha
constanttimeof1:

Pii(t)representstheprobabilitythatanucleotidewillnotexperienceasubstitution
overaperiodoftime(t).Pij(t)representstheprobabilitythatanucleotidewill
experienceasubstitutionandevolveintoanothernucleotideoveraperiodoftime.
Atthepointwhentime=infinity,overwhichanucleotidesequencehadbeen
allowedtoevolve,theproportionofnucleotidesofeachtype(T,C,A,G)willhave
reachedforeach.Thisdistributionofnucleotidesiscalledthelimitingdistribution
andastheratesofchangearethesameforallnucleotidesintheJC69model,this
proportionwillbemaintained.Thisproportionalequilibriumiscalledthestationary
distribution.

K80Model
Kimuraandassociatescreatedamodelproposingamorecomplexmixofrates
betweennucleotidesubstitutionsin1980.ThismodeliscommonlyknownastheK80
modelandusestworatesasparametersalongwithtime.Nucleotidesubstitutions
canbeclassifiedasoneoftwotypes;transitionsandtransversions.Transitionsare
substitutionsbetweennucleotidesofthesameorsimilarmolecularstructure;
betweenpurinesorbetweenpyrimidines,andarepronetooccurmorefrequentlyto
othersubstitutions.NucleotidesAandGarepurinemoleculesandexperiencehigher
substitutionsbetweeneachother,aswellasnucleotidesTandCwhichare
pyrimidinemolecules.Allothersubstitutionsaretranversionsandareknownto
occurlessfrequentlythantransitions.In1980,thefirstmitochondrialsequences
werepublishedshowingadefinitivedifferencebetweenthefrequenciesof
transitionsandtransversions,transitionsbeingnoticeablyhigher.Asaresult,theK80
modelwasdevelopedandimplementedbyKimuraandassociatesinresponseto
thesefindings.
Theratematrix(Q)intheK80modeldisplaystworates;alpha(representingthe
substitutionratesofthetransitions)andbeta(representingthesubstitutionratesof
thetransversions).InthefollowingrepresentationofthematrixQ,alpha=Kand
beta=1:

AswiththeratematrixfortheJC69model,thediagonalelementsofthematrixQ
arenotincluded,asthesearenotregardedassubstitutions.Thetotalsubstitution
rateforanynucleotidewouldbea+2b(K+1+1).Derivingthetransitionprobability
matrixfromthematrixQisslightlymoredifficultthanfortheJC69model,the
transitionprobabilitymatrix(P(t))isasfollows:

p0(t)p1(t)p2(t)p2(t)
P(t)= p1(t)p0(t)p2(t)p2(t)
p2(t)p2(t)p0(t)p1(t)
p2(t)p2(t)p1(t)p0(t)

Where:
p0(t)=1/4.0+1/4.0*exp(4*b*t)+1/2.0*exp(2*(a+b)*t)
p1(t)=1/4.0+1/4.0*exp(4*b*t)1/2.0*exp(2*(a+b)*t)
p2(t)=1/4.01/4.0*exp(4*b*t)

AswiththeJC69model,wecanalsocreateaprogramthatwillemulatethe
transitionprobabilitymatrixwithrelativeeasebyinputtingtheparametervaluesfor
alpha(a),beta(b)andtime(t).Also,organisingtheformulaeofthetransition
probabilitymatrixinasimilarwaytotheJC69modelusingPythondefinesthe
followingfunction:



Thefunctionistestedusingtheparameters;a=0.4,b=0.2,t=1.Thetransition
probabilitiesfornucleotidesexperiencingnosubstitutionsaftert=1arehigh,
whereinthetransitionprobabilitiesfortransitionsandtransversionsarerelatively
lowincomparison.Whenconsideringtheformulaeusedtocalculatethese
probabilities,certaininevitabletrendsarerecognisable:

xrepresentstheprobabilityofanucleotideexperiencingnochangeoveragiven
time.Whent=0,x=1:fromthispoint,xdecreasesexponentiallytothevalueof.
yrepresentstheprobabilityofanucleotideexperiencingatransition(A<>GorT<
>C)overagiventime.Att=0,thevalueofyis0;whennotimehaspassed,the
probabilityofagivennucleotideexperiencinganysortofsubstitutionis0.Thisisalso
truefortransversionalsubstitutions,representedbyequationz.Astimeincreases
from0,thetransitionalprobabilitiesforbothtransversionsandtransitionsincrease,
tendingtowards.Astheratesoftransitionalchangearehigherthanthoseof
transversionalchange,thetransitionprobabilitiesfortransitionalsubstitutions
increasetowardsatahigherrate.Thefollowinggraphrepresentsthechangesin
thetransitionalprobabilitiesoftransitions,transversionsandnosubstitutionsitesas
timeincreases:

Tocreatethisgraph,thevaluesofalphaandbetaweresetto0.4and0.2
respectively.Thesevaluessimulaterealisticvaluesfortheratesfortransitionsand
transversionsasobservedrateshaveshownthattransitionalsubstitutionsoccurata
higherfrequencytotransversionalsubstitutions.Timerangesfrom0to10,
increasingby0.1withineachinterval.

HKY85andTN93Models
Hasegawa,KishinoandYanodevelopedamodelin1985thatcombinedelementsof
boththeK80andF81models.ThisisknownastheHKY85modelandincorporates
multipleparameterstocreateamorerealisticsimulationofhownucleotide
sequencesessentiallybehave.Firstofall,theHKY85modelassumesthattheratesof
substitutiondifferbetweeneachnucleotide.Asinglevaluewoulddefinetherates
foratargetnucleotidehavingbeenevolvedinto.Forexample,avaluefortherateof
Twoulddefinetheratesbywhichanynucleotidewouldbesubstitutedtoresultin
thecreationofthenucleotideT.Theseratesareknownasbasefrequenciesand
withinthismodel,thebasefrequenciesaredeemedunequal.Furtherparameters
areincludedtodistinguishbetweentheratesoftransitionsandtransversionsas
withintheK80model.Afterthefirstmitochondrialsequenceswerepublishedin
1980,thedifferencebetweentheratesoftransitionsandtransversionswasmade
definitiveandsomostnucleotideevolutionmodelscreatedafter1980incorporate
parametersthatdefinetheratesoftransitionsandtransversionsseparately.The
HKY85modelisseentogiveamoreaccuraterepresentationofnucleotide
substitutionsincomparisontotheJC69,K80andF81modelsbyaccommodating
multiplefactors.
ThefollowingimagerepresentstheratematrixQ:

Thematrixisorganisedastheratematricesforallpreviousmodelshavebeen,the
columnsandrowsareinthenucleotideorder;T,C,A,Grespectively.Withinthis
representationofthematrixQ,Krepresentstransitionalsubstitutions.Allother
substitutionsareassumedtobetransversionalotherthanthediagonalvaluesofthe
matrix,whicharenotsubstitutions.Trepresentstherateofsubstitutionsresulting
intheformationofthenucleotideTasmentionedbefore.Crepresentstherateof
substitutionsresultingintheformationofthenucleotideCandsoon.

Derivingthetransitionprobabilitymatrix(P(t))isnotassimpleaswiththeprevious
modelsduetothematrixQnotbeingadiagonalmatrix.Therefore,thematrixQis
initiallydiagonalized,followedbytheexponentiationofthediagonaltoproducethe
matrixP(t):



Where:

Mostofthetransitionprobabilitiesdifferforeachsubstitutionwithinthismodel;this
morecloselyemulateshownucleotideswouldbehaveinreallifeincomparisonto
thepreviousmodels.Morefactorsaretakenintoaccounttoachievethisandsothe
formulaeincreaseincomplexityastheyaccommodatealargernumberofvariables.
Writingafunctiontocarryouttheformulaeinthetransitionprobabilitymatrixis
slightlymoretimeconsumingthanpreviousmodelsbutitisstillachievable:

Parametersfortime,transitionrate,transversionrateandthebasefrequencies
mustbedefinedinordertogeneratethetransitionprobabilitymatrix.Thefunction
isthentestedwithexperimentalparameters,generatingthematrixatthebottomof
theimage.Att=0,thediagonalelementsofP(t)areat1whilstallothervaluesare
at0.Thisisbecauseatt=0,wewouldnotexpectanysubstitutionstohaveoccurred
toanucleotidesequence.Astimetendstoinfinity,theprobabilitiesofthediagonal
elementsdecrease,asallotherelementsincrease,totheirrespectivebase
frequencies.Thiswouldbetheresultofthenucleotidesinthesequencereachinga
stationarydistribution:whentheproportionsofeachnucleotidematchtheir
respectivebasefrequencies.Theseproportionswouldbemaintained,asfurther
substitutionswouldcontinuetogeneratethesameproportionsofnucleotides.
Therefore,inthiscase,thestationarydistributionisalsothelimitingdistribution.

Thedifferencebetweentheratesofsubstitutionoftransitionsandtransversionswas
wellestablishedandresoundswithinmostnucleotidemodelscreatedaftertheK80
model.However,withintransitionsafurtherdifferenceinratescanbedistinguished.
NucleotidesAandGareknownaspurinemoleculesandnucleotidesTandCare
knownaspyrimidinemolecules;thedifferencebeingthemolecularstructuresofthe
nucleotides.Generally,purinesandpyrimidinestendtohavedifferentratesof
substitution;therefore,amorerecentmodeltothosediscussedsofarhasbeen
developedtoaccommodateforthisfactor.In1993,TamuraandNeiproposedanew
model,whichincludedparametersthatwoulddistinguishbetweentheratesof
pyrimidinesandpurinesrespectively.ThismodeliscommonlyknownastheTN93
modelandintroducestheparameters;alpha1andalpha2inreplacementofthe
singlealphaparameterpresentintheHKY85modelfortransitionalrates.Therate
matrixforthismodelisthereforeverysimilartothatoftheHKY85model,aswellas
thetransitionprobabilitymatrix:

MatrixP(t)=

Where:


Simulationofnucleotidesequences
Thepreviouslydiscussedmodelsofnucleotidesubstitutionallallowforthe
generationofprobabilitiesthatdeterminehowanucleotidesequencewillorhas
evolvedbasedonlikelihood.Fromthis,afunctioncanbeusedtosimulatehowa
sequenceofnucleotidesmayevolvebasedontheseprobabilities.Forexample,
takingtheprinciplesofthesimplestmodel,JC69,wecansaythattheprobabilities
foranucleotidechangingintooneoftheothernucleotidesareequal.Therefore,
whensimulatingascheduledsubstitutionofanucleotide,becauseeachtransition
probabilityisthesame,thetargetnucleotidecanberandomlychosenandthe
sequencemutated.Ifthetransitionprobabilitieswereunequal,thetargetnucleotide
wouldberandomlychosenbutwithincorporatedbiasfavouringmoreprobable
transitions.

Afunctionmustbedesignedtofirstgeneratearandomtimeatwhichamutation
willoccurbasedonthetotalsubstitutionratesofallthenucleotidesofthesequence
usingtheratematrixQ.Atimeintervaloverwhichmutationswilloccurmustbe
outlined,forsimplicitytheintervalfromt=0tot=1isusedoften(timex).Tobegin
mutation,asequenceofnucleotidesmustbeprovided;throughtheuseofa
function,anucleotidesequenceofanylengthcanbegenerated(genseq).Usingthe
timexfunction,alistoftimesisgeneratedwhenarateisinputtedintothefunction.
Inthiscase,thetotalrateforallnucleotidesofthesequenceisinputtedandalistof
timesgeneratedrandomly,thesetimesareusedasthetimesofmutation.This
techniquecannotbeusedformorecomplexmodelsofnucleotideevolutionasthey
assumeunequaltransitionprobabilitiesandsoafterasubstitution,thetotalrate
wouldchangewiththedepartureofonenucleotideandthecreationofanew
nucleotide.

InbasingsimulationusingtheJC69model;thetransitionprobabilitymatrixforthe
JC69modelisusedtogeneratetheprobabilitiesformutationsorfornochanges.
Thegenseqandtimexfunctionsarebothusedtogenerateasequenceof
nucleotidesandtothencreatealistoftimesatwhichmutationswilltakeplace.
Pleaselooktothefunctionssectionstowardstheendofthisreportfordefinitionsof
eachfunction.3Thefollowingisasequenceofnucleotidesbeforeandaftermutation
usingtheJC69transitionprobabilitymatrix:

Before

After

Although5differencesarevisiblefromtheinitialsequencetothesequenceafter
mutation,7actualmutationshadoccurredwithtwoofthemutationsactingonthe
samestartingnucleotide,the8th,withthesecondmutationreturningthe8th
nucleotidebacktoitsstartingstate(nucleotideC).7mutationswereachievedusing
thetimexfunctionandinputtingavalueof4.5forrate(at).
SimulationofmutationusingtheK80modelrequiresaslightlydifferentmethod,as
doessimulationusingtheHKY85andTN93modelsduetothedifferingprinciplesand
parametersbetweeneachmodel.Theseprinciplesarequiteeasilysummarisable:
K80astransitionsandtransversionsmustbedistinguishedbetweenasthey
occuratdifferentrates,thefunctionwrittenforsimulatingmutationunder
theprobabilitiesgeneratedbytheK80modelaccountsforthis.Thisthen
resultsintransitionmutationsandtranversionmutationsoccurringat
differentratestothenucleotidesequencebeingmutatedaccordingly.
HKY85AstheHKY85modelutilisesseveraldifferentparametersand
thereforeratestodistinguishprobabilities,thefunctionwrittentosimulate
undertheprinciplesofthismodelusesmultiplerateswhenconductinga
mutation.Also,aseachnucleotideissubjecttodifferentratesofmutation,
thetotalratebywhichanymutationwilloccurusingthetimexfunctionis
updatedafteranynucleotideismutatedandchangedintoanotherto
accountforthischange.
TN93thefunctionsimulatingmutationundertheprinciplesoftheTN93
modelactsinthesamewayasthefunctionusedfortheHKY85model.The
onlydifferenceisthattheTN93modelintroducesanadditionalrate,
breakingtherateforalpha(transitions)intoalpha1(transitionsbetween
pyrimidines)andalpha2(transitionsbetweenpurines).
Thefunctionswrittenforthesimulationofthemutationofanucleotidesequence
areincludedintheappendixandarelabelledaccordingly.

MaximumLikelihoodEstimates(MLE)JC69&K80Models
Maximumlikelihoodestimatesareusedtoestimateparametervaluesfora
statisticalmodelwhenapplyingthatmodeltoadataset.Inthecaseofnucleotide
substitutions,thestatisticalmodelsfittedtodataarethemodelsofnucleotide
substitutionandtheparameterestimatedisthevalueforrateandtime.Rateand
timearedealtwithasasinglevalueastheycannotbedistinguishedfromone
another;thesinglevalue(at)canbeproducedbytheproductofanumberof
differentcombinationsofvaluesofeitheralphaortime.
Thedatasetusedwillbetwosequencesofnucleotidesofequallengthsofwhichone
sequencewillbeassumedtohaveevolvedfromtheotherthroughseveral
mutations.Thetotallengthofasequenceisrepresentedbytheletternandthe
differences(numbersofnucleotideswhichdifferbetweeneachsequence)is
representedbytheletterk.

JC69
Toexplainthetheorybehindacquiringthemaximumlikelihoodestimate,the
binomialdistributionmustbeconsidered.Thefollowingistheprobabilitymass
function(pmf)ofthebinomialdistribution:
n= The total length of a sequence.
k= The number of differences between the two sequences.

Theprobabilitymassfunctionisusedtocalculatetheprobabilitywhenavariable(at)
isexactlyequaltothevalueproposedforthevariable.Forexample,ifavalueforat
isinputtedintotheprobabilitymassfunction,thevaluecalculatedwillrepresentthe
probabilitythatthevalueforatusedtocalculatetheprobabilityiscorrect.In
replacementofthevariablepistheequationusedinthetransitionprobability
matrixfortheJC69modeltocalculatetheprobabilityofamutationoccurring.The
equationusedinreplacementof1pistheequationfromthetransitionprobability
matrixoftheJC69modelusedtocalculatetheprobabilityofamutationnot
occurring.Thefollowingequationistheprobabilitymassfunction,alteredtoinclude
thevariablesmentionedabovewiththetotallengthofasequence(n)as100andthe
numberofdifferences(k)as40.Thenotationpow(x,y)representsthevaluextothe
powerofy:

Probabilitymassfunction=l

Thevariablemrepresentsthevalueat.Findingthevalueofatwiththehighest
probabilitycanbefoundthroughtrialanderror,howeverusingPYTHONallvaluesof
atwithinanintervalcanbetestedandplottedontoagraph:

Theprobabilitymassfunctionequationdisplayedabovewasusedtogeneratethe
datatoplotthisgraph.Thevaluesofm(at)withintheinterval0to0.4weretested
andapeakprobabilitywasacquired.Thepeakrepresentsthevalueofm(at)with
thehighestprobabilityofresultinginthevalueofkandthereforeisthemaximum
likelihoodestimate.Inthiscase,themaximumlikelihoodestimateis0.19forat.

K80
TofindthemaximumlikelihoodestimateusingtheprinciplesoftheK80modelis
approachedinaverysimilarwayaswiththeJC69model.Theprobabilitymass
functionisadjustedsothattwovaluesareestimatedastherearetwoparameters
forratesintheK80model,alphaandbeta.

Pmf=p0^(nkj)*p1^k*p2^j

Where:p0=theequationusedfromthetransitionprobabilitymatrixoftheK80
modeltocalculatetheprobabilityofnomutationoccurring.
p1=theequationusedfromthetransitionprobabilitymatrixoftheK80
modeltocalculatetheprobabilityofatransitionmutationoccurring.
p2=theequationusedfromthetransitionprobabilitymatrixoftheK80
modeltocalculatetheprobabilityofatransversionmutationoccurring.
n=thetotallengthofasequence.
k=thenumberofdifferencesbetweentwosequencesthathaveresulted
fromtransitionmutations.
j=thenumberofdifferencesbetweentwosequencesthathaveresulted
fromtransversionmutations.

Probabilitymassfunction=l

aandbrepresentthevaluesfortheratesoftransitions(alpha)andtransversions
(beta)respectively.UsingPYTHONatablecanbegeneratedshowingthe
probabilitiesofavalueofabeingmostlikelywhenbisofanothervalue.Thevalues
inthistablecanbeplottedgraphicallyusingacontourplot.Thefollowingisa
contourplotgeneratedusingtheequationforprobabilitymassfunctiondisplayed
above,howeverthetotallengthofasequence(n)is100,thenumberofdifferences
thathaveresultedfromtransitionmutations(k)is30andthenumberofdifferences
betweentwosequencesthathaveresultedfromtransversionmutationsis10:


Thelinesbecomeconcentratedaroundthemaximumlikelihoodestimatesforthe
valuesofalpha(rateoftransitions)andbeta(rateoftranversions).Theestimatefor
themostprobablevalueofbisclearlycentredontheintervalbetween0.12and
0.14.Unfortunately,thevalueforaisnotvisibleasthelimitsofthiscontourgraph
donotshowwherethelinesofthegraphcentreontheyaxis.
Maximumlikelihoodestimatesareusedinconjunctionwithmodelsofnucleotide
evolutionmainlytoestimatethetimetakenforonesequenceofnucleotidesto
evolveintoanother,assumingthatonesequenceistheancestoroftheother.
Althoughonlyavalueforat,theproductofbothrateandtime,isachievableifan
averagerate(orratesinthecaseofmultipleparametermodels)isknown.Usingthe
knownvalueforrate,thevariableoftimecanbedistinguishedandsothetimetaken
foronesequencetomutateintotheotheriscalculatable.Practically,biologistsand
statisticianshaveadoptedthismethodwhenattemptingtocalculatethetimetaken
forparticularspecies(suchashumans)tohaveevolvedfromancestralspecies(such
aslesserevolvedprimates).ByassessingthesamesectionsofDNAfromthetwo
speciesofthesamelength,thenumberofdifferencesmayberecordedusedto
estimateatimeusingthemaximumlikelihoodmethod.

Conclusions
Asmyinvestigationwasnotanexperimentassuchbutratherthetranslationof
statisticalmodelsontosoftwaresoastousethesemodelsinpracticalsituations,my
conclusionwouldbetostatethattheprogrammesthatIhavewrittentoemulate
thesestatisticalmodelshavebeensuccessfulandsomaybeappliedtopractical
datasets.Thistranslationallink,betweenstatisticalmodelsandnewcomputing
softwareembodiesthebasicprinciplesofbioinformaticsandallowsdemonstrations
ofhowstatisticiansandbiologistscanthereforeusethesemodelswhendealingwith
mutatedsequencesofDNA.

IfIhadfurtherresearchtimeandpossiblyslightlymoreoptionsintermsof
computingsoftware,therearemultipleareasthatIwouldhaveexpandedwithinmy
projectandreport.Firstofall,Iwouldhaveincludedastepbystepexplanationof
theTaylorSeriesexpansionallowingforreaderstounderstandthemathematical
theorybehindobtainingthetransitionprobabilitymatrixfromtheratematrixofa
nucleotidemodel.Also,Iwouldhaveexploredfurthermodelsofnucleotide
evolution,astherearemanymoresignificantmodelsthathavenotbeenmentioned.
Thesemodelswouldhavebroadenedthescopeofmyprojectandwouldhave
depictedfurtherstepsbywhicheachmodelwaschronologicallyimproved.Within
thelastsectionofthisreport,themaximumlikelihoodestimationoftheJC69and
K80models,Ibelievethatthissectioncouldbeprogressedfurther.Withaccessto
alternativecomputingsoftwarethatcouldplotmultidimensionalgraphs,Iwould
haveextendedthecalculationofmaximumlikelihoodestimatesintoestimatingthe
parametersfortheHKY85andTN93models.

References:

ComputationalMolecularEvolution(Yang2006)
www.wikipedia.org
www.python.org
http://docs.python.org/lib/modulerandom.html
http://docs.python.org/lib/modulerandom.html
http://www.tau.ac.il/~doronadi/F81_model.doc
http://www.megasoftware.net/WebHelp/part_iv___evolutionary_analysis/c
omputing_evolutionary_distances/distance_models/nucleotide_substitution
_models/hc_jukes_cantor_distance.htm
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=hmg.figgrp.1080
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=hmg.figgrp.1080
http://www.ncbi.nlm.nih.gov/books/bv.fcgi?rid=hmg.figgrp.1080
EvolutionaryTreesfromDNASequences:
AMaximumLikelihoodApproach(JosephFelsenstein1981)
ANovelUseofEquilibriumFrequenciesinModelsofSequenceEvolution
(NickGoldmanandSimonWhelan)

Functions

Genseqthegenerationofarandomsequenceofnucleotidesisessentialtothe
simulationofnucleotidesubstitution.Todefineafunctiontogenerateasequence,a
parameterforthelengthofthesequencemustbedefined.Inthiscase,nisused.
Thefunctionrandomlychoosesaletter,representingeachnucleotide,fromthelist
ACGTusingtheinbuiltrandintfunction.Thechosenletterisaddedtoalist;the
processofchoosingaletteristhenrepeatedntimescreatingalistorsequencen
nucleotideslong.

Timexthisfunctionallowsforthegenerationofacumulativesetoftimesthat
representwhenmutationswilloccurstoanucleotidesequence.Thisfunctionisonly
usedwithinthesimplermodelsofsubstitutionasitassumesthattransition
probabilitiesarethesameforeachnucleotide.Aninbuiltfunction
(random.expovariate)takesavalueforrateasaparameterandgeneratesanother
valueusingthisratevalue.Inputtingahigherratevaluewillincreasetheprobability
oftheinbuiltfunctiongeneratingasmallervalue.Valuesaregeneratedusingthe
sameratevalueandaredisplayedcumulativelytorepresentthetimesatwhich
eventsoccuraccordingtotheinputtedratevalue.Thisprocessisterminatedwhen
thecumulativetimevalueincreasesover1asweareonlyinterestedinmutations
occurringwithintimes0and1.Thisfunctioniseffective,astheoretically,ifevents
occuratahigherrate,moreeventswilloccurinagiventime.

Intgenthisfunctionwascreatedtogeneratealist,oflengthn,ofrandomnumbers.
Theserandomnumbersdenoteatwhatpointsmutationswilloccur.Thetimex
functionisinitiallyusedtocalculatethenumberofmutationsthatwilloccurinan
allottedtime.Thenumberofcalculatedmutationswillthensignifythelengthofthe
listofrandomnumbers.Eachnumberwithinthislistreferstothenthnucleotideofa
sequencebeingmutated.Thatnucleotidewillthenbemutated.

Appendix:

K80Functionforsimulationofmutationofnucleotidesequence

HKY85Functionforsimulationofmutationofnucleotidesequence

TN93Functionforsimulationofmutationofnucleotidesequence