Académique Documents
Professionnel Documents
Culture Documents
"$ %ntroduction
No ada!s, one of the most cha""enging #rob"ems in com#utationa" bio"og! is to transform the huge $o"ume of data, #ro$ided b! ne "! de$e"o#ed techno"ogies, into kno "edge. %achine "earning has become an im#ortant too" to carr! out this &'(. Se$era" techni)ues and methods ha$e been de$e"o#ed in order to bui"d mode"s hich can be trained and make crucia" decisions. *a!esian c"assifiers, "ogistic regression, discriminant ana"!sis, c"assification trees, random forests, nearest neighbour, neura" net orks, su##ort $ector machines, ensemb"es of c"assifiers, #artitiona" c"ustering, hierarchica" c"ustering, mi+ture mode"s, hidden %arko$ mode"s, *a!esian net orks and Gaussian net orks are some of that kind of methods. In our #ro,ect the aim as to de$e"o# genera"i-ed "inear mode"s using the Re"e$ant Vector
%achine techni)ue in order to #redict ncRNA genes in genomic &DNA( se)uences of Zebrafish genome. Noncoding RNAs &ncRNA( are RNAs that are transcribed, but not trans"ated into #rotein. .here are t o kinds of ncRNA: short and "ong non/coding RNAs. .he! both inc"ude e""/characteri-ed transfer RNAs and ribosoma" RNAs, snRNAs, e"" as a #"ethora of ne ncRNAs that ha$e been sho n to #"a! snoRNAs, and miRNAs, as
ma,or ro"es in the ce""u"ar #rocesses of a"" "i$ing organisms&0(&1(. In addition, it has been studied the functiona" ro"e of "ong non/coding RNA in human carcinomas &2(&3(. Re"e$ant $ector machine &RV%( is a machine "earning method hich e+#"oits a #roba"istic e""/kno n
*a!esian "earning frame ork and ha$e an identica" functiona" form to the mode"s hich uti"i-e dramatica""! fe er basis function than a SV%
su##ort $ector machine &SV%( &4(. RV% has the abi"it! to construct accurate #rediction hi"e offering se$era" additiona" ad$antages. .he inno$ati$e function of a RV% is the #robabi"istic #redictions it creates. It doesn5t decide if a dato be"ongs or not in a c"ass but it gi$es it a #robabi"it! of be"onging to a c"ass. .he! can be uti"i-ed for both c"assification and regression #rob"ems.
method. In &7(, Do n and 6ubbard tried to gain im#ortant information from non/coding
regions of simi"arit! bet een genomes. 8s#ecia""!, their aim sho n that the #redictions of the mode"
signa" from a set of non/coding conser$ed se)uences using RV%. 9rom this
can see on the figure be"o . .he! a"so $erified that the #romoter signa" is the strongest sing"e motif/based signa" in the non/coding functiona" fraction of the genome of these #romoter regions ha$e an abundance of :#G dinuc"eotides. hi"e subsets
ork of Do n and 6ubbard se$era" other a##roaches has been de$e"o#ed as the
for the #rediction of non/coding RNA genes or regions. At &;( the #ur#ose $ector machine. S:I is a measure for RNA secondar! structure conser$ation
c"assification of RNA se)uence a"ignments based on S:I and a -/score using the su##ort hi"e the -/ ith score re#resents a measure for thermod!namic stabi"it! of a"ignments &norma"i-ed
res#ect to se)uence "ength and base com#osition(. At the fo""o ing figure the green circ"es are the #ositi$e e+am#"es of the training set &nati$e a"ignments( and the red crosses the negati$e ones &shuff"ed #ositions of random a"ignments(. .he background co"or ranging from red to green indicates the RNA c"ass #robabi"it! for different regions of the -<S:I #"ane.
.he annotation of noncoding RNA genes remains a ma,or bott"eneck in genome se)uencing #ro,ects. %ost genome se)uences re"eased toda! sti"" come ith sets of tRNAs and rRNAs as the on"! annotated RNA e"ements, ignoring hundreds of other RNA fami"ies. Se$era" on"ine too"s ha$e been created for this #ur#ose. RNAs#ace.org &=( is one of the most recent too"s for the #rediction, annotation and ana"!sis of "ncRNA genes. NcRNA.org is another too" for finding "ong non/coding RNA genes in RNA se)uences gi$ing a"so information about the secondar! structure of the resu"ts &9igure : ncRNA.org( &'>(. Se$era" other too"s and databases can be found on the .ab"e 0 from Gibb5s #ub"ication &3( / 9igure .
*igure + ncR,A$org
e random"! se"ected the same number of ncRNA genes &about 23>> se)uences(. Be did this in order to ha$e a ba"ance bet een the number of #ositi$e and negati$e e+am#"es. As e kno , a GC% is a used form of mode" for both c"assification and regression
here is a set of basis functions & hich can be arbitrar! rea"/$a"ued functions( and is a $ector of eights. In other ords this is e)ua" ith:
(Equation 1)
'. Strand: ' and > for com#"ementar! 0. Position in chromosome: Percentage in hich the se)uence be"ongs ith res#ect to the "ength of the chromosome
1. :om#osition 9re)uencies: a. DA, D., D:, DG b. Dimer &DAA,DAG,E( c. .rimer &DAAA, D..., DGGG and D:::( 2. G: content 3. . o Ratios: AA. and AGA.: 4. '' motifs: 9or the finding of motifs, e $isited .motifsearch.com here a de/
no$o DNA motif se)uences search can be im#"emented. Be ga$e as in#ut the DNA se)uences &from our #ositi$e set/ncRNA se)uences( in 9AS.A format and the out#ut hich as returned ere '' motifs: , FGAAG:.AAGF, F:A.GGGAAAF, F.GAAG:.AAF, FG:AGGG:.GF,
F:A:.GAAG:F, FA.GGGAGA:F,
FGA.GGGAGAF,
ith their in#uts and out#uts are in fo"der ?.rainSet5. In this fo"der there are t o subfo"ders ?NegSet and ?PosSet each one for the corres#onding data set &negati$e and #ositi$e e+am#"es(. .he fina" training set is the fi"e ith the name TrainingSet and is the concatenation of the fi"es: FeaturesNegSet.bed and FeaturesPosSet.bed / 9igure .
*igure + 7raining Set 8 *irst column is the Set /"-&ositive, !-,egative1 and the rest cols are the 9! features
9$ RVM - Run
In 9igure e can see ho the configuration fi"e of RV% is. .he ro s 1 and 2 are for training e train the mode" e e make the 0 "ast ro s as comments e make as comments the "ines of hi"e at the 2 th ro e indicate t o ant to test our mode" and the "ast 0 for test. Bhen and, corres#onding"!, training. At the ro 1 hen
out#uts. .he one &.od.>( inc"udes the #ost #robabi"ities for each obser$ation and this #robabi"it! a" a!s refer to ' st c"ass. In our case for e+am#"e &9igure ( the 'st dato & hich is a negati$e e+am#"e < #rotein coding gene( has #robG>.>>4 for be"onging to c"ass ' &being a ncRNA(, the 0nd &ncRNA( has #robG>.== to be a ncRNA, etc. .he 0 nd fi"e &.om.>( inc"udes the eights of each basis function. .he first ro is the H measure &of e)uation ' abo$e( hi"e each C indicates the eight for the feature at 1 rd co"umn &9igure (.
In case
e indicate at "ine 3 of configuration ith the #ost here the RV% hich the test run
fi"e our .est fi"e and at the "ast "ine 0 fi"es are inc"uded. .he one is the out#ut #robabi"ities and the other is the as based. At the fo""o ing figure runs e$er! time. eights fi"e &from the training ste#( on e can see the Net*eans en$ironment
as done in order to find the accurac! of our as s#"it into 3 e)ua" subsets
as used for training and the rest ' to test the #erformance of our mode" as about the inter#retation of significant features that e created 3 training sets and therefore ere
&3/fo"d cross $a"idation(. .he fi"es are "ocated in the director! named ?RV%Runs5. .he second #art of runs e+tracted. At this ste#, e constructed 3 different
obser$e the distribution of our data against the #ost #robabi"ities from the outcomes e can see from the fo""o ing figures &R@: :ur$es( our c"assifier #erformed e"" the data. 8+ce#t for the e obser$ed that the best as at $er! good, a"most identica""!. It is remarkab"e the fact that a"" 3 mode"s had a $er! high accurac! &I=4D(. .his $erifies that RV% can distinguish $er! R@: cur$es as e created a #er" scri#t in hich e"" as the true and fa"se negati$e rate. 9rom a"" abo$e e ca"cu"ated the true and fa"se #ositi$e rate
#robabi"it! thresho"d for distinguishing the data &thresho"d for the best accurac!( thresG>.3 &accurac!G==.>0D( / 9igure .
*igure + Scatter plot /Red crosses+ &ositive e0amples, Blue Crosses+ ,egative e0amples1
*igure + Best 7hreshold at !$=$ 7rue &os Rate 2 >>$!"?, *& rate 2 !$>>? and Classification Accuracy 2 >>$! ?
nd
9irst of a"" it5s im#ortant to sa! that for each .raining set e can see each bar tab"e ith the #ointers and their features: Pointer 9eature ' Strand 0 Position 1 A 2 G 3 : 4 .
7 AA ; AG = A: '> A. '' GA '0 GG '1 G: '2 G. '3 :A '4 :G '7 :: '; :. '= .A 0> .G 0' .: 00 .. 01 AAA 02 GGG 03 ::: 04 ... 07 G:J:ontent 0; AA. 0= A.AG: 1> FAG:.AAG:AF 1' FAAG:.AAG:F 10 FGAAG:.AAGF 11 F:A.GGGAAAF 12 FG:AGGG:.GF
.he fo""o ing subcha#ters describe the mode"s ho the! ere constructed, the features hich
e created for each different training ste#, ere e+tracted, their im#ortance and their
#ossib"e ro"e in the #rob"em of #redicting non/coding RNA genes from genomic se)uences.
ACATGGGAA(. .he ma,orit! of the rest features had a negati$e eights( and these ATGGGAGAC(. .herefore, cannot hich
can see at 9igure , on"! = features had a #ositi$e contribution to our mode" &#ositi$e ere the Position of se)uence, the A and T nuc"eotides, the @C content and 3 motifs &AGCTAAGCA, CATGGGAAA, GCAGGGCTG, CACTGAAGC,
e ha$e constructed our ' st Genera"i-ed Cinear %ode". e #ro$ide a tab"e ith the
In our occasion
rite in this documentation our mode" because of the "arge number of basic eight of each basic function as e+tracted from RV% machine:
*eature Position GA
GG
G.
:A
:A
:G
::
:.
.A
.G
.:
..
AAA
2 / 0.'2 20;4 / 1.;2 471= 3 / 2.'3 4>34 3 / 0.41 4';= / 2.0; 7444 3 / 0.1= 0>== / 2.14 4>03 7 / 0.3> >=3' = / 2.14 3;== ' 2.37 3470 ;4 / 2.'7 140> = / 0.>1 044' 3 / >.11 11=; 1
GGG
:::
A.AG: FAG:.AAG :AF F:A.GGGA AAF FG:AGGG: .GF F:A:.GAA G:F FA.GGGAG A:F .
AA
AG
A:
/ >.47 =1;3 / >.30 07;= 4 / '.07 =22; 0 ;.0; >;'4 3' / >.>2 4307 2 / ;.>;e />3 >.'= ';24 '7 >.17 411> =1 >.0> 10>0 32 >.10 10=2 3 >.10 7270 ;2 ;.47 1103 41 / '.>4 >042 0 / 1.;= =31' = / 1.4= 2'1
A.
/ 2.>1 >04> 3 ith the eights &as abo$e( for the rest eighted features and Negati$e
.he contribution of each structura" feature to our mode" is e$a"uated through a function &R( that )uantifies the re"ati$e feature im#ortance, rather than the actua" feature corres#onding eight &W(. *rief"!, the im#ortance R of each feature is e+#ressed as the #roduct of the eight and the corres#onding standard de$iation & SD( of the feature $a"ues in the training set. Be #refer to assess the feature contribution to the mode", through the R rather than the W $a"ue, because R takes into account the $ariabi"it! of the data set, norma"i-ing the $a"ues ith the corres#onding SD &''(. %oreo$er, it is #ossib"e to fa"" into ith a "arge #ositi$e eight can enhance the ith a #ositi$e eight ith a "arge #ositi$e hen a feature the tra# of considering that a feature
eight but
Per"9i"esLout#utsLNetbeans:ommandCine@uts(
then ca"cu"ate their significance R. .he #er" scri#t that im#"ements this is the ari!"."#. .he im#ortance of each feature is de#icted on 9igure . .his chart sho $erifies that
percentage of Adenines and 7hymines in a se)uence #"a! an im#ortant ro"e for deciding if in a DNA se)uence there is a ncRNA gene &I&A(G'=.144= and I&.(G10.7>4'(. *ut the most significant feature for making a decision is the percentage of @C content in se)uence &I&G:J:@N.8N.(G42.420'(.
the fact that the most of them are negati$e"! corre"ated among each other.
&ointer ' 0 1 2 3 4 7 ; = '> '' '0 '1 '2 '3 '4 '7 '; '= 0> 0' 00 01
G:J:ont 02 ent 03 AA. 04 A.AG: FAG:.AA 07 G:AF F:A.GGG 0; AAAF FG:AGGG 0= :.GF F:A:.GA 1> AG:F FA.GGGA 1' GA:F
As far as the features5 im#ortance it is a"so $erified that @C content is the most significant feature &IG4;.>>>>(. .he )uantities of 7hymines and Adenines in a se)uence come second and third on the "ist.
motifs 3/= are strong"! corre"ated. 8ssentia"!, this is some kind of "ogica" as these motifs ha$e a $er! sma"" #resence in se)uences. A"so, FAAG:.AAG:F motif is hig"! non/significant. Strand and G do ha$e some re"ationshi# as $erified as features e can see from 9igure but the! are being hich affect negati$e"! our mode" &9igure (.
im#ortance as a #redictor. Ne$erthe"ess, in our #ro,ect bet een these basis functions. :onse)uent"!, eighted features beha$e. .he features for fo""o ing: *eatur e AA AG A: A. GA GG G: G. :A :G :: :. .A .G .: .. AAA GGG ::: ... AA. A.AG:
&ointer ' 0 1 2 3 4 7 ; = '> '' '0 '1 '2 '3 '4 '7 '; '= 0> 0' 00
@ne thing that #ro$es that the most of the abo$e basis functions affect our modem in a Mnegati$e a!N is the fo""o ing chart sho &9igure (. Be can readi"! obser$e that most of the features continue to ha$e the same beha$ior. .he most of them ha$e a negati$e eight. @n"! $% and &% are re"ated each other. .his can be "ogica""! e+#"ained due to the
.h!mine. *oth dimers ha$e the .h!mine inside them something that conc"udes to the fact that .h!mine #"a!s a ro"e. Remember from before &9igure ( in the first training ste# that the #resence of A and . #"a! a rea""! significant ro"e in our mode". 8s#ecia""!, as far as im#ortance, :. is #ro$ed to be )uite im#ortant among the negati$e features.
Adenine, .h!mine, G: content and the motifs 4 and ; are strong"! corre"ated &9igure (. Not on"! does their re"ationshi# is stab"e but their im#ortance too. .he 9igure de#icts the fact that their im#ortance to a mode" hich inc"udes on"! these basis functions are stead!. Cast e cannot sa! for e+am#"e that motif but not "east, there aren5t man! differences in im#ortance magnitude &most features5 im#ortance measure ranges from >.'3 to >.33(. So FAG:.AAG:AF has a more im#ortant ro"e than G: content. &ointer ' 0 1 *eature Position A . G:J:ont 2 ent FAG:.AA 3 G:AF F:A.GGG 4 AAAF
orks ha$e sho n that G:/rich isochores inc"ude in them man! #rotein e said in the beginning, it has been ith :#G is"ands are
coding genesK thus determination of ratio of these s#ecific regions contributes in ma##ing gene/rich regions of the genome. 9or e+am#"e, as sho n that human genes associated ith :#G is"ands increase in number as the! increase
"ocated in the G:/richest com#artment of the human genome. .herefore, for this reason e create 0 distribution in order to see the differences bet een #ositi$e &kno n ncRNA genes( and negati$e &#rotein coding genes( e+am#"es.
"itt"e richer in G: content than kno n ncRNAs, such as in human genome. .he GO: content is bigger than 2>D in about 13>> ncRNA se)uences and in 1=>> Protein Genes. .he #ercentages of Adenine and .h!mine in genome se)uences a"so e+#orted and $erified as significant #redictors. .his is )uite "ogica" as be #"a!ing some ro"e. .he fo""o ing distributions sho e+am#"es. e #ro$ed that G: content #"a!s an us the #ercentage of Adenines and im#ortant ro"e in our mode". :onse)uent"!, the com#"ementar! bases of G and : ma! a"so .h!mines & ith res#ect to the "ength of each se)uence( for both Positi$e and Negati$e
and .h!mines content in each se)uence doesn5t e+ceed the #ercentage of 2>D. .he hich are encoded for a #rotein are a "itt"e richer in Adenines than the ones hich gi$e RNA genes, "ike in the case of G: :ontent. @n the other hand, both #rotein coding and ncRNA genes ha$e the same content of .h!mines in their content. Be cannot e+tract an! ma,or difference bet een our e+am#"es e+ce#t for the the kind of distribution of Adenines in t o datasets. .he a""ocation in Protein Genes is more ba"anced than in ncRNA ones. .herefore, G: content is rea""! a significant #redictor for making a decision if a DNA se)uence i"" be trans"ated into a #rotein or not.
References
'. CarraPaga P, :a"$o *, Santana R, *ie"-a :, Ga"diano Q, In-a I, et a". %achine "earning in bioinformatics. *rief. *ioinform. 0>>4 %ar 'K7&'(:;4<''0. 0. Ciu Q%, :ami""i A. A broadening 0>'> 9ebK'1&'(:';<01. or"d of bacteria" sma"" RNAs. :urr. @#in. %icrobio".
1. Qac)uier A. .he com#"e+ eukar!otic transcri#tome: une+#ected #er$asi$e transcri#tion and no$e" sma"" RNAs. Nat. Re$. Genet. 0>>= DecK'>&'0(:;11<22. 2. Gibb 8A, Vucic 8A, 8nfie"d RSS, Ste art GC, Conergan R%, Rennett QS, et a". 6uman :ancer Cong Non/:oding RNA .ranscri#tomes. P"os @ne. 0>'' @ct 1K4&'>(:e03='3. 3. Gibb 8A, *ro n :Q, Cam BC. .he functiona" ro"e of "ong non/coding RNA in human carcinomas. %o". :ancer. 0>'' A#r '1K'>&'(:1;. 4. .i##ing %8. S#arse ba!esian "earning and the re"e$ance $ector machine. Q %ach Cearn Res. 0>>' Se#K':0''<22. 7. Do n .A, 6ubbard .Q. Bhat can e "earn from noncoding regions of simi"arit! bet een genomesT *%: *ioinformatics. 0>>2 Se# '3K3&'(:'1'. ;. Bashiet" S, 6ofacker IC, Stad"er P9. 9ast and re"iab"e #rediction of noncoding RNAs. Proc. Nat". Acad. Sci. H. S. A. 0>>3 9eb '3K'>0&7(:0232<=. =. :ros %/Q, %onte A de, %ariette Q, *ardou P, Grenier/*o"e! *, Gautheret D, et a". RNAs#ace.org: An integrated en$ironment for the #rediction, annotation, and ana"!sis of ncRNA. RNA. 0>'' No$ 'K'7&''(:'=27<34. '>. Asai R, Rir!u 6, 6amada %, .abei S, Sato R, %atsui 6, et a". Soft are.ncrna.org: eb ser$ers for ana"!ses of RNA se)uences. Nuc"eic Acids Res. 0>>; Qu" 'K14&su##" 0(:B73< B7;. Vernikos GS, Parkhi"" Q. Reso"$ing the structura" features of genomic is"ands: A machine "earning a##roach. Genome Res. 0>>; 9eb 'K';&0(:11'<20.
''.