Académique Documents
Professionnel Documents
Culture Documents
...............................................................................................................................................1
1: ...................................................................................7
1. ......................................................................................................................................................................... 7
1.1.
..................................................................... 8
1.2.
...................................................................................................... 16
1.3.
...................................................................................................................................... 25
1.4.
................................................................................................................................... 31
......................................................................................................................................45
2: ....................................................................................47
2. ....................................................................................................................................................................... 47
2.1.
............................................................................................................................. 48
2.2.
........................................................................................................................... 54
2.3.
. ................................................ 76
................................................................................................................................78
( ) ...............................................................83
....................................................................................................................................108
3: ........................................................................117
3. ..................................................................................................................................................................... 117
3.1.
................................................................................................ 117
3.2.
3.3.
3.4.
3.5.
3.6.
......................................................................................................................................... 128
3.7.
.............................................................................................................................................. 130
3.8.
.......................................................................................................... 132
3.9.
3.10.
................................................................................................................................. 134
3.11.
3.12.
3.13.
3.14.
...................................................................................................... 140
3.15.
....................................................................................................................................147
.........................................................................................................................................149
4: .............................................................................151
4. ..................................................................................................................................................................... 151
4.1.
........................................................................................ 152
4.2.
........................................................................................................................ 155
4.3.
................................................................. 159
4.4.
............................................................................................. 162
4.5.
........................................................................... 164
....................................................................................................................................169
.......................................................................................................................................171
5: ..................................................................173
5. ..................................................................................................................................................................... 173
5.1.
...................................................................................................................... 173
5.2.
5.3.
.............................................................................................................................................................. 184
....................................................................................................................................189
6: .............................................................................................191
6. ..................................................................................................................................................................... 191
6.1.
....................................................................................................................................................... 191
6.2.
.............................................. 195
6.3.
.................................................................................................................. 199
6.4.
............................................................................................................. 203
6.5.
..................................................................................................................................... 207
6.6.
- ........................................................... 209
6.7.
.............................................................................................................................................................. 210
....................................................................................................................................213
7: .................................................................................................217
7. ..................................................................................................................................................................... 217
7.1.
........................................................................................................................ 218
7.2.
................................................................................................................................................ 223
7.3.
................................................. 227
7.4.
............................................................................................... 229
7.5.
............................................................................. 231
7.6.
.............................................................................................. 236
7.7.
DNA/RNA.............................................................................................. 259
....................................................................................................................................264
8: ...............................................................................................271
8. ..................................................................................................................................................................... 271
8.1.
8.2.
8.3.
8.4.
................................................................................................................. 298
8.5.
8.6.
8.7.
....................................................................................................................................306
.........................................................................................................................................309
9: ...........................................................................................313
9. ..................................................................................................................................................................... 313
9.1.
.......................................................................................................................................... 314
9.2.
....................................................................................................................... 318
9.3.
............................................................................................................................. 321
9.4.
..................................................................................................... 326
9.5.
...................................................................................................................................................... 335
....................................................................................................................................339
10: ...................................................................................343
10. ................................................................................................................................................................... 343
10.1.
Chomsky.................................................................................................... 343
10.2.
...................................................................................................................................... 345
10.3.
10.4.
................................................................................................... 355
....................................................................................................................................358
11: ................................................................................361
11. ................................................................................................................................................................... 361
11.1.
............................................................................................................. 361
11.2.
............................................................................................................................... 366
11.3.
............................................................................................................................................................ 373
....................................................................................................................................379
12: Perl ........................................................................381
12. ................................................................................................................................................................... 381
12.1.
12.2.
......................................................................................................................................................... 383
12.3.
.............................................................................................................................................. 385
12.4.
............................................................................................................................................................ 387
12.5.
..................................................................................................................................................... 388
12.6.
12.7.
......................................................................................................................................... 393
12.8.
12.9.
.............................................................................................................................................. 403
....................................................................................................................................404
...........................................................................................................................................405
,
,
, .
,,
. , (
,,,...),
(, )
,,
( , ).
, , ,
.
,,
,
, , .
, , .
, ,
( , ),
(),
, . ,
,
(,
, ...), /. ,
,
,,,.
' , ,
,.
,
,.,
, ,
.,
:
, ,
/.
,,
, ,
. ( , ) ,
,
.,
,
. ,
, ,
,,,
( ).
( , ),
,
.,
,.,
,
( ),
( ) . ,
1.
,
, ,
,,.
"" " ",
,,,
( ,
, ). ,
Creative Commons, ,
, ,
.,,
" " (open courses),
,," "
" ",,
.
, ,
. ,
, ,
,,
(,
)
.
(400 ),
.,,
""(,).1
, ,
,,,
,
,,,.2
9 , ,
, , ,
. Hidden Markov Models, ,
, ,
,8.1011
( , ). ,
12 Perl,
( ,
, , ...). ,
, ,
.
HiddenMarkovModels(8),
,(7),
,
DNA/RNA.,
.
, , ,
,.
,
,,.,2
9 ( 12,
).
,
," "2-8
( ),
" " 9-12,
( ) , 1-2
(
). , , ,
(,,).
,,
,.
, Hidden Markov Models
,,
, Position Specific
ScoringMatrices,,.
,,"-"
. , ,
, 7(
). ,
RNA. , 7, .
',
,,
10. "" ,
.,12,,
,,
.
(DNA, RNA, ,
, , , - , , ,
,,...),,
. ,
. ,
,
( ),
.,
1, /
, ,
.,,
,,,
,
(,
, , ). ,
,,
/,
. ,
,,
.
, ,
,.,
, ,
,,,
,.
,(),
, , .
, , ,
. , , ,
,,
epub. ,
,
.,,
(,). -, ,
, .
, , , ,
, , ,
,,,
,
.,,
. ,
,
.9,
,,
10, .
,,,,
( ). ,
,
, , : www.compgen.org/books/bioinformatics,
email:books@compgen.org.,,
,,
.
, , , ,
, . ,
(1990,
"" ). ,
- ,
, ,
.
(),
,.
, (,
), , , .
,,, ,
,,.
, , , ,
(3)
,
. ,
,.,
,
,.,
,,,
(,,),.,
,
.
,17/11/2015
-
BLAST
BasicLocalAlignmentSearchTool
DNA
DeoxyribonucleicAcid
RNA
RibonucleicAcid
MSA
MultipleSequenceAlignment
HMM
HiddenMarkovModel
PSSM
Position-SpecificScoringMatrix
EBI
EuropeanBioinformaticsInstitute
NCBI
NationalCenterforBiotechnologyInformation
NLM
NationalLibraryofMedicine
NIH
NationalInstitutesofHealth
PDB
ProteinDataBank
CABIOS
ComputerApplicationsintheBiosciences
SRS
SequenceRetrievalSystem
ISCB
InternationalSocietyforComputationalBiology
ISMB
IntelligentSystemsforMolecularBiology
CAFASP
CriticalAssessmentofFullyAutomatedStructurePrediction
CAPRI
CriticalAssessmentofPredictionofInteractions
CASP
CriticalAssessmentofproteinStructurePrediction
1:
.
,
,
. ,
.
( , ), ,
, ,
. ,
.
.
1.
,
,
,
,.
,,
. , ,
,
.
,
.
(,
,...).,
, ,
,:
( , )
.,,
.
,.
.,
,
(..
,
),
,
.
, '
, .
, (
) ( ).
, / ,
, .
( )
.,,
(, ,
/,...).,,
.,
,.,
,
ISI,
(Mathematical and Computational Biology),
.,,
.
,
.
,,,
.
1.1.
, (bioinformatics),
1990.,
,,informatics
.,,
(
)1990,
.,20,
,
,
.
, (Hagen, 2000; Ouzounis & Valencia, 2003;
Roberts,2000;Searls,2010;Trifonov,2000),.
, ,
,
.
,
,Hardy,WeinbergFisher,Wright,
,
19501960(1.2).,
Chargaff
DNA,
. ,
WatsonCrickDNA
(Wilkins,
Franklin ). 1960
JacobMonod().
,(),
Perutz Kendrew 1962, 1960
(,,...),
.,
1951,RNA1967.
,
1960.
1.2:
1960.
,19501960
,,Shannon,
Turing,vonNeumann,(strings),
, Chomsky.,
, ,
,1960
.,
,.
,,
6420,
1960,
.
,
.
,
Zuckerkandl Pauling, Fitch and
Margoliash Kimura Nei.
Ramachandran
,
Ramachandran
,helicalwheelplots.
1.1.2. 1970
,(1.3).
,,
,Kimura.,
, .
Fitch
().,
RNA ,
Crick 1970. ,
SangerMaxam-Glbert,
,
.
,Anfinsen
.
,
,
Chou Fasman 1975 (
).,
RNA.,
, .
, RNA
Nussinov.
1970,
(),Needleman
Wunsch1970.
,1970(dot-plot).
,.PDB
1972(10),Dayhoff1978
,
PIR. , /
( ,
10
, ...).
,1970
.,
, ,
.
1.3: 1970.
1.1.3. 1980
1980
,
.
(Science, Nature, Nucleic Acid Research),
(ComputerApplicationsinBiosciences).,
(1.4).
,
.
Smith Waterman 1981,
,
Aratia, Waterman Karlin,
(FASTA).
,
CLUSTAL.
(sequence profiles)
,
.,
.
DNA,PCR,
,
11
(,
).,1986
(GenBankEMBLData
Library),SwissProt,1987.
(EMBnet ),
(LiMB). , NIH EMBL
.
1.4: 1980.
,
.,NMR
. ,
,
(homologymodelling).
(),
.,
.,
( ),
(,,positiveinsiderule)
.
,RNA.
,Fenselstein
,
(
).
12
(
).
(PAM),,
,
(..
, , , ...). ,
rRNA,
,.
1.1.4. 1990
1990(
). ,
, / ( 1.5).
,
1990. , , Bioinformatics, 1995
ComputerApplicationsintheBiosciences(CABIOS).
1.5: 1990
.
,,BLAST(BasicLocal
AlignmentSearchTool),NCBI1990.BLAST
(score) ( Karlin-Altschul)
,
,
.,,
,PSI-BLAST.,
13
(CLUSTAL)
/,.
.
,RasmolKinemage,
(threading),(docking).
,
,70%
( , , ...). ,
/CASP.
,
,,
(genefinders).
, DNA
, , ,
,
.
,
.
(SCOPCATH),
(patterns) , PROSITE, PFAM INTERPRO.
,,EBI(EuropeanBioinformatics
Institure), ,
(Hinxton) 1992 EMBL Welcome Trust.
EMBL, EMBL-Bank SwissProt-TrEMBL
,TrEMBL.,1993ISMB
ISCB.
,,Krogh,Eddy,
Hughey , ,
Hidden Markov Model (HMM),
,.
HMMERprofileHMM,
PFAM,,
, TMHMM
, SignalP .
,
,,
( profile HMM,
, ). ,
,,.
1.1.5.
2000,
.,
,
...,
( )
(1.6). ,
, .
,
14
.,
(NextGenerationSequencing),(RNAseq),
.
,
.
, (web-servers),
. , -(meta-genomics)
.
, 2000,
, ,
SwissProt PIR
,Uniprot( EBI).
, ,
,
EBI ELIXIR , ,
.
1.6: 2000.
, (SNPs)
GWAS(Genome-WideAssociationStudies),
, ,
DNA.,
, , (HapMap
project),-(data
15
integration). ,
. ,
,(Proteomics),
,,RNA(ncRNA),
,
(Copy Number Variations- CNVs). ,
.
,SupportVectorMachines(SVMs)
,
. , -,
HMMER
BLAST.,
abinitio.,,
,
, ,
, .
, , . ,
,
.
,
,
. ,
, , ,
,
(Ouzounis,2012).
1.2.
,,
.,
NCBI
(http://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/MolBioReview/bioinformatics.html):
Bioinformatics is the field of science in which biology, computer science, and information technology merge
into a single discipline. There are three important sub-disciplines within bioinformatics: the development of
new algorithms and statistics with which to assess relationships among members of large data sets; the
analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein
domains, and protein structures; and the development and implementation of tools that enable efficient access
and management of different types of information
,Luscombe(Luscombe,Greenbaum,&Gerstein,2001):
The mathematical, statistical and computing methods that aim to solve biological problems using DNA and
amino acid sequences and related information.
16
,
( )
().RichardDurbin:
I do not think all biological computing is bioinformatics, e.g. mathematical modelling is not bioinformatics,
even when connected with biology-related problems. In my opinion, bioinformatics has to do with
management and the subsequent use of biological information, particular genetic information.
, International Society for Computational Biology (ISCB)
,,,:
a scholarly society dedicated to advancing the scientific understanding of living systems through
computation.
,
,
, . ,
(),
(),,.
, ( )
:
,
(1.7).
,,,
. (
),/,
, .
DNA,
.,,
(GWAS),
(Molenberghs, 2005).
, , .. ,
. ,
,.
,
,
( ), ,
,
. Eddy (Eddy, 2005),
,
:
.
Eddy,
(
),
, (
).
, , ,
-
.,,
,
.
,
17
,
. ,
( ,.)Eddy
Im sure my union card has expired (Eddy, 2005).
,
, : ,
,
,ECDL(,
ECDL).
1.7: .
, ,
, ,
,
,.
,
( ,,
Eddy).
. ,
(.. residues) , ...
, , .
; ;(..residuesresiduals
,
,
). ,
,,
( ,
18
10 ). ,
,
(,,
).VennAnthonyFejes,
(1.8).
1.8: http://blog.fejes.ca/?p=2418
,
(Ouzounis, 2002). ,
, .
,.;
. , ,
, .
, ,
,
.,,
. , ,
,,
50 . ,
(, , ), ,
, , ,
,
. , ,
, (
),.,,
..,(
),
.
,
,
,
19
.,
(,
, , , ).,
,
,
.,
,
,
.
,
,
(Chalmers,1999).
,
.
(),
,,...
,
, , ,
... ,
( , PCR, ...).
, .
(..
Watson Crick DNA
).,
20
.,
.
(.. ) , ,
(.. )
.
, ,
(
),,
.,,
. ,
,UUU(Phe).
, , (
)
. 1960,
,,
(
,).
;
;..,
,
,--,---.,
,
, (
).
,
(
),,,
.
,
.,
,,
- -
. , : )
)
.
,
.,99%
,
,
. , ,
,,
.
( )
,,
, .
,
(,,
).,
,,
.
,
,(Ouzounis,2000).
21
, .,
, ,
. , ,
. ,
, , , pyrosequencing,
.,
,.
,
.,,
.
, , ,
.
,
,/,
. ,
,.
, ,
, ... , ,
()(..,,
,,...).
, Edgar
Wingender: Thus, scientific articles publishing experimental findings which have been evaluated using
computational tools, very often give credit to them in the Methods or Results sections with phrases such as
"Computer analysis revealed that ...", without any appropriate reference. In contrast, any experimental
methodology used is extensively explained in these papers, down to the detailed listing of buffer systems,
voltage/current conditions of the electrophoresis systems etc(Wingender,1998).
,
,,
,
(paper).,,
,
,,,
,
, .
,,
,.,
,
. ,
, ,
.,,...,
.,
,
,
.
,
.,,
, . , ,
.
22
1.10: , . ,
,
. , ,
,...
. , ,
,
.
,,
(
BLAST).
.,
(, , ,
...).
,
,
().,
.,
, ...
.
,.,
.
23
,
.
, 2 3 ( 1 ). ,
bioinformatician2bioinformaticist
3,.
bioinformaticsscientistbioinformaticsengineer2
3(Welchetal.,2014),,
(),
, . ,
. ,
,(
), ,
.
(
), ,
...,
,.,
,,
,
.,(
), ,
, .
3 , ,
, ,
.
,
. ,
,
,
,
(Altman, 1998; Ditty et al., 2010;
Floriano, 2008; Honts, 2003; Searls, 2012; Welch, et al., 2014; Yan, Ban, & Tan, 2014).
, 1990 ,
. ()
(.),
(,,,,,),
Altman:
, , ,
(,,,...).
, , ,
. ,
,
/. ,
,
,
,,,,
,,.
24
1.3.
.
,
,
,
, .
.
-(keywordsMESHterms),
, .
,
.
,PatraMishra(Patra&Mishra,2006)
MESHterms"Bioinformatics"OR"Bioinformatics"OR"ComputationalBiology"OR
"ComputationalMolecularBiology"OR"BiologyComputational"OR"MolecularBiology;Computational"
OR "Genomics" PUBMED 16.178
, 2004, 1806
. , 2000 .
,199012,20001000.
(98%)
(97%).(42%)
(10%),(6%)(4%).
Bioinformatics
NucleicAcidsResearch
GenomeResearch
Science
Nature
ProceedingsoftheNationalAcademyofSciencesUSA
Proteomics
GenomeBiology
JournalofMolecularBiology
Proteins
NatureBiotechnology
BMCBioinformatics
PacificSymposiumonBiocomputing
JournalofComputationalBiology
TanpakushitsuKakusanKoso
JournalofBiologicalChemistry
DrugDiscoveryToday
TrendsinBiotechnology
Genomics
BriefinginBioinformatics
1.1: 20 Patra Mishra (Patra &
Mishra, 2006).
,20
1/3.1.1.
,
(Bioinjormatics, BMC Bioinformatics, Pacific Symposium on Biocomputing, Journal of Computational
Biology...),(NucleicAcidsResearch,
GenomeResearch,GenomeBiology,JournalofMolecularBiology)(Science,
Nature,ProceedingsoftheNationalAcademyofSciencesUSA).,
25
,
, . ,
(,
, ),
.
2006, Perez-Iratxeta, Andrade-Navarro Wren (Perez-Iratxeta,
Andrade-Navarro,&Wren,2007)PUBMED
3(,
).19962005(MESH
terms):comput*,*informatic*,
algorithm*, software database internet,
online,worldwideweb,web-based,http:*ftp:*.
26
1.11: Perez-Iratxeta (Perez-Iratxeta,
Andrade-Navarro, & Wren, 2007)
2007,
(SYMBIOMATICS), (bigrams,
) ,
(2000-2005)(1990-2000)
, ,
(Rebholz-Schuhman et al., 2007).
1.2.
Bioinformatics
AMIAAnnuSympProc
Biosystems
ArtifIntellMed
BMCBioinformatics
BMCMedInformDecisMak
BriefBioinform
IntJMedInform
ComputMethodsProgramsBiomed
JAmMedInformAssoc
IEEETransInfTechnolBiomed
Medinfo
JBioinformComputBiol
MethodsInfMed
JBiomedInform
ProcAMIASymp
JComputAidedMolDes
JComputBiol
PacSympBiocomput
1.2:
SYMBIOMATICS.
2000-2005,
1990-2000. (1990-2005)
27
geneexpression,aminoacid,proteinsequence,
informationsystem,healthcaredecisionsupport.
1990-2000
2000-2005
.1.12,,
, Support Vector Machines
.
:,
,,(..),
( ,
).,
(..supportvectormachines).,
(..,
,...).
1.12:
, SYMBIOMATICS
,2014
(Song, Kim, Zhang, Ding, & Chambers, 2014).
,
, ,
.
,(
), 73% WoS
(PUBMED).
, ,
PubmedCentral, PUBMED
(openaccess)
.
,
.
28
,,
,.
1.3
,.
2000-2003,
. 2004-2007
, , . 2008-2011,
.
,mutationRNA.,
protein binding, algorithm/method, cell/model,
network/interaction,genomesequence,immune/virus,geneexpression,genetic/evolution,database/software,
genetranscription,DNA/chromosome,ontology/mining,gene/genomicscancer/cell.
,
,
..,
202000-2003,2009-20116,
9 7.
, ,
20().
,
,
(King, 2004),
(http://www.natureindex.com/).
BMCBioinformatics
SourceCodeforBiologyandMedicine
BMCGenomics
AdvancedBioinformatics
PLoSBiology
BioDataMining
GenomeBiology
JournalofComputationalNeuroscience
PLoSGenetics
JournalofProteomeResearch
PLoSComputationalBiology
JournalofBiomedicalSemantics
BMCResearchNotes
JournalofComputer-AidedMolecularDesign
Bioinformatics
GenomeIntegration
MolecularSystemsBiology
JournalofMolecularModeling
BMCSystemsBiology
BulletinofMathematicalBiology
ComparativeandFunctionalGenomics
PharmacogeneticsandGenomics
Bioinformation
StatisticalMethodsinMedicalResearch
TheoreticalBiologyandMedicalModeling
Neuroinformatics
HumanMolecularGenetics
Genomics
TheEMBOJournal
ProteinScience
CancerInformatics
PhysiologicalGenomics
GenomeMedicine
TrendsinGenetics
EvolutionaryBioinformatics
JournalofProteomics
Biochemistry
Proteomics
AlgorithmsforMolecularBiology
TrendsinBiochemicalSciences
EURASIPJournalonBioinformaticsandSystemsBiology JournalofBiotechnology
JournalofMolecularBiology
TrendsinBiotechnology
Molecular&CellularProteomics
BriefingsinFunctionalGenomics&Proteomics
MammalianGenome
JournalofTheoreticalBiology
1.3: (Song, Kim, Zhang, Ding, & Chambers,
2014).
StanfordUniversity,HarvardUniversity
University of Washington . Stanford 3
2000-200312004.Harvard6,232000-2003,
2004-2007 2009-2011. , University of Washington 5 2
.,University
29
ofCambridge(11,8 5),UniversityCollegeLondon(11,1110),UniversityofOxford
2000-2003,10
2004-2008 6 2009-2011. , Brandeis University
12000-2003,122004-2007
2008-2011. University of California, Berkeley
2714.
.
.
2000-2003Gene ontology: tool for the
unification of biologyNatureGeneticsGeneOntologyConsortium
20 .
(D.Botstein,G.Rubin,G. Sherlock,M. Ashburner,J.Cherry,C.
Ball, J. Matese, H. Butler). ,
(Initial sequencing and analysis of the human genome)Nature.249
48 . 3 Significance
analysis of microarrays applied to the ionizing radiation responseV.Tusher,R.Tibshirani,G.
Chu,Stanford.R.Tibshirani12.
2004-2007, Bioconductor: open
software development for computational biology and bioinformatics25
19 . 4
.R,
R: A language and environment for statistical computing 3 Transcriptional regulatory
code of a eukaryotic genome 20 4 . ,
2008-2011,
PFAMThe Pfam protein families database133(
A Bateman R. Durbin,
). KEGG, KEGG for
linking genomes to life and the environment113,
Mapping short DNA sequencing reads and calling variants using mapping quality
scores H. Li, J Ruan R.Durbin.
10
(,
,
).
,
ImpactFactor.,ProceedingsoftheNational
AcademyofSciences,NucleicAcidsResearch,Nature,BioinformaticsScience
. BMC Bioinformatics
62004-200752008-2011.BMCBioinformatics,
2004-2007 PLoS Biology, BMC Genomics, Nature Reviews
Genetics.
20 2008-2009 PLoS One, PLoS Genetics, PLoS Computational Biology,
NatureBiotechnologyNatureMethods.
( Bioinformatics, BMC Bioinformatics PLoS Computational Biology)
.
,
.
,
10,
( ,
,,...).
30
.,
2000-2003, 2004
. ,
,
,
.
1.4.
.
,
,.
1.4.1.
,,
(,.).
(),
,
( 1.13) (..
).,
(,,&,2012;
,,,&,2013;,,,&,2014)
(, , , &
, 2012).
,
,.
,
(
) . , Nature,
(http://www.natureindex.com)
(
).(1.14),
(WFC),,32
, ()
2004().
31
1.13: , 27.
32
1.14: www.natureindex.com
;
:
,
( , , 43
,http://data.worldbank.org/,2012-2013).,
WFC,
, 95%. ,
. . , 33
Nature(-),
.
33
1.15: WFC .
1.16: WFC .
34
,
.
, .
,,
(http://www.scimagojr.com, 2011-2013).
,
,
,~70%.,
.
,
WFC
.,,
.,29%
()
48% ,
(48%).
,-15%,19
.
1.17: WFC/ .
,
.,
,
,
.,
.:
35
.
, .
(),
.
1/32.1%(0.69%)(&,2014).
1.4.2.
, . ,
.
(), http://www.hscbb.gr
,
.
(),
2009 (affiliated) (International Society
forComputationalBiology,.http://www.iscb.org/iscb-affiliates-europe#hellenic),
GOBLET (http://www.mygoblet.org) ELIXIR
(https://www.elixir-europe.org/).
, (..
http://hscbb11.hscbb.gr, http://hscbb12.hscbb.gr ...)
, .
,..,
, , . ,
, 120 ,
.(,
,)40,
.
30
,
.
,
2010(Bagos,2010).
,,
.
:
(
), ,
.
ISI WoS, MATHEMATICAL & COMPUTATIONAL
BIOLOGY .
,
. ,
Nucleic Acids Research
(web-server database issues). ,
PUBMED ( WoS).
1.4. ,
,
(JMB,PlosBiology,Protein
Engineering...),(Science,Nature),
36
.
(MachineLearning,PatternRecognition...).,
-.,
,
,
. -
(GREECE CYPRUS), SQL
Yahoo,Term
Extraction web service (http://developer.yahoo.com/search/content/V1/termExtraction.html),
- (
KEYWORDSWoS).
PLOSComputBiol
JournalofComputer-AidedMolecularDesign
Bioinformatics
NucleicAcidsResearch(web-serveranddatabaseissues)
BMCSystBiol
TheOpenBioinformaticsJournal
BMCBioinformatics
StatisticalApplicationsinGeneticsandMolecularBiology
Biostatistics
SourceCodeforBiologyandMedicine
JTheorBiol
OnlineJournalofBioinformatics
StatMethodMedRes
JournalofIntegrativeBioinformatics
IETSystBiol
JournalofBioinformaticsandComputationalBiology
JComputNeurosci
InternationalJournalofDataMiningandBioinformatics
JMolGraphModel
InternationalJournalofComputationalBiologyandDrugDesign
StatMed
InternationalJournalofBioinformaticsResearchandApplications
Biometrika
InSilicoBiology
EvolBioinform
Genomics,Proteomics&Bioinformatics
BMathBiol
GenomeInformatics
Biometrics
EURASIPJournalonBioinformaticsandSystemsBiology
AlgorithmMolBiol
CurrentBioinformatics
MedBiolEngComput
BioDataMining
JMathBiol
AdvancesandApplicationsinBioinformaticsandChemistry
IEEETInfTechnolB
AppliedBioinformatics
JComputBiol
InternationalJournalofBioinformatics
CurrBioinform
Bioinformation
SARQSAREnvironRes
PacSympBiocomput
MathBiosci
Database
ComputBiolMed
GenomeRes
BiometricalJ
BMCGenomics
MathMedBiol
IntJDataMinBioin
JAgrBiolEnvirSt
JBiolSyst
1.4: .
37
TOTAL_PAPERS
60
50
40
30
TOTAL_PAPERS
20
10
10
20
07
20
04
20
01
20
98
19
95
19
92
19
88
19
19
85
82
19
78
19
YE
AR
1.18: .
80
70
60
50
40
30
20
10
A
ME
TR
IK
AC
ID
EIC
NU
CL
BIO
RE
ME
D
ST
AT
AT
I CS
BM
C
BIO
INF
OR
M
AT
I CS
BIO
IN
FO
RM
IO
L
RB
HE
O
JT
BI O
LM
UT
CO
MP
IEE
ET
I NF
TE
C
HN
OL
ED
1.19:
.
405 1976
2010.,,1999,
38
5(1.18).,405
681 (1,68 ).
( ),
, ,
, . 681 , 636 3
45,18
9 .
(
). , 63
,.
,
,,ImpactFactor.
,
23
ComputerApplicationsinBiosciences,
, Bioinformatics. , A
protein secondary structure prediction scheme for the IBM PC and compatibles 1988 PBM: a
software package to create, display and manipulate interactively models of small molecules and proteins on
IBM-compatible PCs 1995 ( Perrakis A, Constantinides C, Athanasiades A),
,
.
H2000FickettJW,HatzigeorgiouAC.
Eukaryotic promoter recognition Genome Res. 2 Promponas VJ, Enright AJ,
TsokaS,KreilDP,LeroyC,HamodrakasS,SanderC,OuzounisCACAST: an iterative algorithm
for the complexity analysis of sequence tracts Bioinformatics 3 Pavlou S,
Kevrekidis IG Microbial predation in a periodically operated chemostat- a global study of the
interaction between natural and externally imposed frequencies, Math Biosci.
2001-2005,CarninciP,WakiK,ShirakiT,KonnoH,Shibata
K,ItohM,AizawaK,ArakawaT,IshiiY,SasakiDetal.(VAidinis)
Targeting a complex transcriptome: The construction of the mouse full-length cDNA encyclopedia
GenomeRes,2PatrinosGP,GiardineB,RiemerC,MillerW,ChuiDHK,
Anagnou NP, Wajcman H, Hardison RC Improvements in the HbVar database of human
hemoglobin variants and thalassemia mutations for population and sequence variation studiesNucleic
Acids Res 3 Bagos PG, Liakopoulos TD, Spyropoulos IC, Hamodrakas SJ
PRED-TMBB: a web server for predicting the topology of beta-barrel outer membrane proteins,
Nucleic Acids Res. , 2006-2010, Liolios K,
Tavernarakis N, Hugenholtz P, Kyrpides NC The Genomes On Line Database (GOLD) v.2: a
monitor of genome projects worldwideNucleicAcidsRes,2
(Liolios K, Mavromatis K, Tavernarakis N, Kyrpides NC) The Genomes On Line Database
(GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata
NucleicAcidsRes,3UlrichEL, AkutsuH,DoreleijersJF,HaranoY,
IoannidisYE,LinJ,LivnyM,MadingS,MaziukD,MillerZetal.BioMagResBank
(NucleicAcidsRes).
20 ,
74web-servers
. , 20 /
89webservers .
(ImpactFactor,
),(,),
.
,
(..
39
).,
2005,.
,
, Impact Factor,
,,
(1.5).,
(6),
(8)(9).
. (
18),
1.5.,,
, ,
... ,
(3),2012
, , .
(6)(9).
Impact Factor
UniversityofAthens
53
706
159,86
UniversityofIoannina
41
192
65,53
UniversityofCentralGreece
34
498
97,39
NatlTechUnivAthens
28
139
54,21
UniversityofPatras
21
66
48,17
BSRCAlexanderFleming
18
362
103,11
AristotleUnivThessaloniki
16
49
25,66
NatlCtrSciResDemokritos
15
110
35,9
TechEducInstAthens
14
46
28,73
AcadAthens|BiomedResFdn
12
13
35,33
CtrRes&TechnolHellasCERTH
9
116
32,87
UniversityofThessaly
9
70
17,46
UniversityofCyprus
8
139
32,53
TechEducInstLamia
6
42
11,47
DemocritusUnivThrace
6
31
15,44
1.5: , ,
.
,
, 15
. ,
,
. , ,
,
.
40
1.20: .
1.4.3.
,
,(
).
(
),,,/,
18 ( 1.6).
, /,
.,
. ,
, / . 11 18
, (..
), 3 18
3 , (
. . ,
).
,
, , ,
.
,3
(, , , ,
).,
4
, .
, , ,
,
.
.
,
,/(
). ,
41
,
.
,
,,.
(*)
(*)
(*)
(*)
(*)
(*)
(*)
(**)
(**)
/
(**)
,
,
,,
(*)
,
(*)
(*)
(*)
(*)
1.6: . (*)
.
42
12
10
8
4
2
0
1.21:
.
, 6
() ()
(
).1.7.
(
)
,
(
)
,,
,
(
)
/
1.7: .
,
,()
, , , /
.2003,
(
)
. ,
. ,
,
.,
,,
,
()
43
,.,
. ,
. ,
,
.
,
.,
,
.
44
Altman,R.B.(1998).Acurriculumforbioinformatics:thetimeisripe.Bioinformatics, 14(7),549-550.
Bagos,P.G.(2010).Bioinformatics and Computational Biology in Greece: a bibliometric study.Paper
presentedatthe5thConferenceofHSCBB(HSCBB10),Alexandroupolis.
Chalmers,A.(1999).What Is This Thing Called Science?(3rdrevisededitioned.).Hackett:Universityof
QueenslandPress,OpenUniversitypress.
Ditty,J.L.,Kvaal,C.A.,Goodner,B.,Freyermuth,S.K.,Bailey,C.,Britton,R.A.,...Kerfeld,C.A.
(2010).Incorporatinggenomicsandbioinformaticsacrossthelifesciencescurriculum.PLoS Biol,
8(8),e1000448.
Eddy,S.R.(2005)."Antedisciplinary"science.PLoS Comput Biol, 1(1),e6.
Floriano,W.B.(2008).Aportablebioinformaticscourseforupper-divisionundergraduatecurriculumin
sciences.Biochem Mol Biol Educ, 36(5),325-335.
Hagen,J.B.(2000).Theoriginsofbioinformatics.Nat Rev Genet, 1(3),231-236.
Honts,J.E.(2003).Evolvingstrategiesfortheincorporationofbioinformaticswithintheundergraduatecell
biologycurriculum.Cell Biol Educ, 2(4),233-247.
King,D.A.(2004).Thescientificimpactofnations.Nature, 430(6997),311-316.
Luscombe,N.M.,Greenbaum,D.,&Gerstein,M.(2001).Whatisbioinformatics?Aproposeddefinitionand
overviewofthefield.Methods Inf Med, 40(4),346-358.
Molenberghs,G.(2005).Biometry,biometrics,biostatistics,bioinformatics,...,bio-X.Biometrics, 61(1),1-9.
Ouzounis,C..(2000).Twoorthreemythsaboutbioinformatics.Bioinformatics, 16(3),187-189.
Ouzounis,C..(2002).Bioinformaticsandthetheoreticalfoundationsofmolecularbiology.Bioinformatics,
18(3),377-378.
Ouzounis,C.A.(2012).Riseanddemiseofbioinformatics?Promiseandprogress.PLoS Comput Biol, 8(4),
e1002487.
Ouzounis,C.A.,&Valencia,A.(2003).Earlybioinformatics:thebirthofadiscipline--apersonalview.
Bioinformatics, 19(17),2176-2190.
Patra,S.K.,&Mishra,S.(2006).Bibliometricstudyofbioinformaticsliterature.Scientometrics, 67(3),477489.
Perez-Iratxeta,C.,Andrade-Navarro,M.A.,&Wren,J.D.(2007).Evolvingresearchtrendsinbioinformatics.
Brief Bioinform, 8(2),88-95.
Rebholz-Schuhman,D.,Cameron,G.,Clark,D.,vanMulligen,E.,Coatrieux,J.L.,DelHoyoBarbolla,E.,..
.VanderLei,J.(2007).SYMBIOmatics:synergiesinMedicalInformaticsandBioinformatics-exploringcurrentscientificliteratureforemergingtopics.BMC Bioinformatics, 8 Suppl 1,S18.
Roberts,R.J.(2000).Theearlydaysofbioinformaticspublishing.Bioinformatics, 16(1),2-4.
Searls,D.B.(2010).Therootsofbioinformatics.PLoS Comput Biol, 6(6),e1000809.
Searls,D.B.(2012).Anonlinebioinformaticscurriculum.PLoS Comput Biol, 8(9),e1002632.
Song,M.,Kim,S.,Zhang,G.,Ding,Y.,&Chambers,T.(2014).Productivityandinfluenceinbioinformatics:
AbibliometricanalysisusingPubMedcentral.Journal of the Association for Information Science and
Technology, 65(2),352-371.
Trifonov,E.N.(2000).Earliestpagesofbioinformatics.Bioinformatics, 16(1),5-9.
45
Welch,L.,Lewitter,F.,Schwartz,R.,Brooksbank,C.,Radivojac,P.,Gaeta,B.,&Schneider,M.V.(2014).
Bioinformaticscurriculumguidelines:towardadefinitionofcorecompetencies.PLoS Comput Biol,
10(3),e1003496.
Wingender,E.(1998).ISB:JustAnotherJournal?In Silico Biol, 1(1),1-4.
Yan,B.,Ban,K.H.,&Tan,T.W.(2014).Integratingtranslationalbioinformaticsintothemedical
curriculum.Int J Med Educ, 5,132-134.
,.,,.,&,.(2014). .:
,.,&,.(2014,25/11/2014)..
.
,.,,.,&,.(2012).1996-2010:
Retrieved
fromhttp://reports.metrics.ekt.gr/
,.,,.,,.,&,.(2012).
2000-2010- Retrievedfromhttp://metrics.ekt.gr/el/node/15
,.,,.,,.,&,.(2013).
1996-2010:-
Scopus Retrievedfromhttp://report03.metrics.ekt.gr
,.,,.,,.,&,.(2014).
1998-2012:-
WebofScience Retrievedfromhttp://report04.metrics.ekt.gr/
46
2:
,
, ,
(, , , ,
...). ,
.
, ( ) ,
.
,
(DNA, RNA, ).
2.
, ,
.,
,.
,
.
,
,.
, ,
.
.
(annotation)
.
UiprotKB/SWISS-PROT,
547.599 (Rel. 2015_02 2015)
EMBL Nucleotide Sequence Database 510.014.239
(Rel.122-2014).
. ,
.
.
, , 2 ,
,.
, ,
:
47
, ,
,,:
()
()
2.1.
,
, . ,
,,
, ,
.
2.1.1
(2.1),
. (sequencing),
DNA RNA,
(.. )
.
.GENBANK
14(Rel.206,2015)
181.336.445187.893.826.750.
EMBL-Bank: EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl/)
,
(EMBL)
(EBI) Cambridge, UK. EMBL-Bank
,
.GENBANK,
,
.Fasta
BLAST.EMBL-Bank(Rel.122-2014)510.014.239.
1.094.969.877.589.
48
2.1: GenBank, 1982
2004.
2.2: 3
(INSDC; International Nucleotide Sequence Database Collaboration,
http://www.ddbj.nig.ac.jp/insdc/insdc-e.html).
49
2.1.2 .
,
( DNA),
,
.,
,
,
.
UniprotKB (Uniprot Knowledgebase, http://www.uniprot.org/), ,
(UniProt, 2014).
,Uniprot/SwissProt
Uniprot/TrEMBL
() . UniprotKB/SwissProt 547.599
(Rel. 2015_02 2015)
,
, ,
( ), . H Uniprot/TrEMBL
(Rel. 2015_02 2015) 92.124.243
. , Uniprot
,
""Uniprot/TrEMBL
Uniprot/SwissProt.,
( ,
,,,
...), , Uniprot/SwissProt ,
( ),
. ,
. Uniprot
,.
2.3: Swiss-Prot,
1986 2004.
50
, Uniprot 2002
, SwissProt PIR. H SwissProt 1986
(Swiss Instituteof Bioinformatics)
(EuropeanBioinformaticsInstitute).Protein Information
Resource (PIR - http://pir.georgetown.edu/) .
Georgetown
(NBRF) ... PIR-International Protein Sequence Database
(PSD),PIRoMunichInformationCenter
forProteinSequences(MIPS)JapaneseInternationalProteinInformationDatabase(JIPID).2002,
PIREBI(EuropeanBioinformaticsInstitute)SIB(SwissInstituteof
Bioinformatics)UniProtconsortium.PIR-PSD
,UniProtKnowledgebase.
UniProtPIR-PSD
PIR-PSD. PIR-PSD
UniProt.
2.1.3 .
.
(, , ...),
,,
NMR.,,,
. , PDB,
.
Protein Data Bank:HProteinDataBank(PDB,www.rcsb.org)
(Kouranovetal.,2006).
1971 Brookhaven National Laboratories (BNL) . 7
'70.'80
,PDB
(NMR). ( 2015) PDB 106.858 .
PDB
,
.
.
.
, PDB , .
,PDB
( ) ,
.,
,
,,
. ,
,
.
Uniprot PDB. ,
,
, . PDB
, . , ,
(MMDB)NCBI,
PDB.
51
2.4: PDB,
1977 2004.
2.1.4
,
.
,NextGeneration
Sequencing,
.
.
,
.,
,,
.,,
, ""
.,
.
,
52
. (raw) (
...). ( 2015), 14.031
, 1.357.732 "", (
,,-),
55.725 "" (series) 3.848 " " (datasets).
.
Array Express:
,,http://www.ebi.ac.uk/arrayexpress/(Brazmaet
al.,2003).GEO,
. ,
tutorials. 2015,
57.009 (experiments, series GEO) 1.689.237
(assays,).
Stanford Microarray Database (SMD):
Stanford,
, http://smd.stanford.edu/ (Demeter et al., 2007).
,
84.051631.
2.1.5
, DNA,
, .
(),
( , ...).
dbSNP,
,HapMap.
dbSNP: dbSNP
http://www.ncbi.nlm.nih.gov/snp (Sherry et al., 2001). (single
nucleotidepolymorphisms-SNPs),
(deletioninsertionpolymorphisms-DIPs),
(short tandem repeats - STRs). dbSNP
(),
,,
. dbSNP
,.
NCBI
http://www.ncbi.nlm.nih.gov/books/NBK3848/. 129 (2008) 14
,.
HapMap: International HapMap Project (http://hapmap.ncbi.nlm.nih.gov/)
(HapMap,2003).
()
(,
), . ,
4 , .
, ,
.,,
,
(LinkageDisequilibrium),,.
,
53
.,
.
2.1.6
,,
, .,
(, ,
...). , PubMed (http://www.ncbi.nlm.nih.gov/pubmed)
NCBI 24
(MEDLINE,
online ).
, PubMed Central (
),.
PubMed ,
, . ,
tutorials(http://www.nlm.nih.gov/bsd/disted/pubmed.html).
,,SCOPUS(http://www.scopus.com/)Web
of Science (http://webofknowledge.com/). , ,
(citations).
( ),
( , ).
.
,
,,(text
mining), ,
(),
(Ananiadou,
Kell,&Tsujii,2006;Scherf,Epple,&Werner,2005).
2.2.
, ,
.
2.2.1
,
(domains), . ,
.
.
,
.
( ),
.
,
.,
,
,
.,)
54
(,pattern,...),)
.CATHSCOP,
PROSITE,PFAM,INTERPRO.,
,.
, ,
.
55
PROSITE (http://www.expasy.ch/prosite/)
(sequencedomains)(Sigristet
al., 2010).
.
.
.
,,
'' ,
.
''.
""(regularexpressions),
, (profiles),
,
.,.
PROSITE '' 1716
.
,1308(patterns),11071105""(
). ,
(,).
Uniprot
""()
. ,
,
"",
.
PFAM: Pfam (http://pfam.xfam.org/)
(Finnetal.,2014). PROSITE(
),
hiddenMarkov model (HMM), ,
.(2013),
14.83180%
UNIPROT.
PFAM,PFAM-A,PFAM-B.PFAM-A
() , ,
. PFAM-B
,
PFAM-A. PFAM-B , ,
PFAM-A.
PFAM , (
HMMER,.),
,
( ). ,
,
,-(clan).
CATH: CATH (http://www.biochem.ucl.ac.uk/bsm/cath_new/index.html)
PDB
(domains) (Knudsen & Wiuf, 2010). CATH
3Angstroms
.
.:
1) (Class), 2) (Architecture), 3) ( )
56
.
(Fold):
. , ,
,
.
.
.
(Class):
:1)ll-,-,2)
all-, - , 3) /,
- - 4) +,
--.
.
.
57
2.6: SCOP (, , , ).
2.2.2
,
,
.,,(
,,
Uniprot), ,
.
. , ,
.
11-12 2014 Wellcome Trust,
Hinxton ,
(SpecializedProteinResourcesNetwork).
58
: (1)
,(2),(3),
(4) ,(5).
(Hollidayet
al.,2015).
(,
)(
Pfam, RefSeq, Swiss-Prot, UniProt).
, ,
,,.
.
,
(strains),
(Uniprotgi).
,
.
,
.,
:
;
.
.ESTHER,
-,GPCRDB
GPCRs, UniProtKB,
.
2009 (Schnoes, Brown, Dodevski, & Babbitt, 2009)
/(misannotation)80%.SwissProt UniProtKB
, 0%.
.
, .
,
.
. 50
.
(Kirby, 2001).
: ;
;
; ,
(/ )
,,.
59
2.7: " " , . Gert Vriend
,
(modular) .
(Module)
(carbohydrate-binding module, CBM)
.
Blast,CBM,
. ,
(singledomainprotein)
(..
(UniProtKB: B8NM72)). ,
-, (NRPS),
NRPS. ,
(Umemuraetal.,2014),
NRPS .
. ,
(SpecialistProteinResources-SPRs).
"
".
/ (.. )
(..).,
,
.BLASTnon-redundantNCBI,
.
.
, .
, (gene assembly errors)
60
, ,
.
,
SPRs.
;,
UniProtKB,
. , PDB, ,
PDB_REDO(Joosten,Long,Murshudov,&Perrakis,2014)
.
,
(Nagyetal.,2008;Wong,Maurer-Stroh,&Eisenhaber,2010).
,
; .
UniProtKB,
; SFLD, : SFLD
;(..UniProtKB)ECO(EvidenceCode
Ontology)(Chibucosetal.,2014),.
,
.
ECO.
,
; (
) .
,
.,.,
(
) .
.,
.
.
( )
. ,
,
.
. , ,
EnzymeCommission(EC).
, (WikiPedia).
Rfam.
Wikipedia,Rfam.,
; Swiss-Prot
;
;
; ,
,SPRs,
.
,
. EMBRACE
(Pettiferetal.,2010).superfamily
.,SFLD,
.TIGRFAM,,
.SFLD,PANTHER,
(SFLDPANTHER).
61
, .
EnzymeCommission(C).
(,,...)(
,).
MACiE EzCatDB, ,
,
. MACiE , EzCatDB
.
, EC (
).
EC50.
EC ., EC
EC,
.,
,
EC ,
. , EC
,
(.. -)
EC (.. ).
EC
ECBLAST.
,
. ;
,CAZy,
.,MACiE
EzCatDB, EC ,
,
.
(,
),.
.(GO)
,
.PubMed""
1.500 .
, , (.. )
.
, ,
,,BioPortal(Grosjean,Soualmia,Bouarech,
Jonquet, & Darmoni, 2014) OBO Foundry (Smith et al., 2007),
.
62
NucleicAcidsResearch(Fernndez-Surez,Rigden,&Galperin,
2014) 2014 1.552 58 123
..
..2008
40%URL
(Wren, 2008). ,
URL.
.
/,
.
;
. Interpro,
.
Interpro.
/(,
REFSEQUniProtKB)SPRs.SPRs,
, , (.. SFLD
), (), (.. KEGG (Kanehisa et al., 2014)
), ,
.
63
.
. UniProtKB 2014 86
,
.,
,.
.,,
Structural Genomics Consortium,
.
, , putative
glycosidehydrolase,
. , (
)(
), (glycoside hydrolases).
(
) (
)
(Martinetal.,1998).
(moonlighting proteins) ,
, . ,
. , , ,
,
,
,
.,SPR
,
.,
.
,
Protein Bioinformatics and Community
ResourcesRetreat(2.8).
TCDB: ,
(Stein, 2013)
, .
,
.(structuredquerylanguage-SQL)
(Jamison, 2003). ,
TransporterClassificationDatabase(TCDB,www.tcdb.org(Saier,
Reddy, Tamang, & Vstermark, 2014)). TCDB HTML. 1998
(OracleMySQL)PHP
San Diego Supercomputer Center (www.sdsc.edu).
,,,-.
(multi-component), 100
. TCDB
. 7 , 56 , 937 , 9098
,1180612086.
, .
64
TCEC(EnzymeCommission)(Bairoch,1999),
( )
(, - -).
(International Union of Biochemistry and Molecular Biology,
IUBMB) (Saier,2000).
TCDB:,
,
TC-BLAST
,.,
PFAMOMPdb.
65
.
(Pfam TCDB),
, ,
OMPdb.
CAZy: CAZy(www.cazy.org)
(carbohydrate-binding modules) , ,
, (Cantarel et al., 2009; Lombard, Ramulu, Drula,
Coutinho, & Henrissat, 2014). CAZy 1991,
(Henrissat,1991)
. 90
(Campbell,Davies,Bulone,&
Henrissat, 1997). 1998,
SQL ( 1999),
. ,
CAZy ,
, 50%
. ,
,EC
,
CAZy-.,,
.
CAZypedia(http://www.cazypedia.org),
CAZy.CAZypediaHarryBrumer
BritishColumbia
,
. ,
CAZy.,CAZymes,
,CAZy,
.,
, ,
CAZy.
, CAZy .
/
CAZy.
( ) ,
,
,
.
, , ,
.
CAZy.
MEROPS: Merops(http://merops.sanger.ac.uk)
(Rawlings,Waller,Barrett,&Bateman,2014).
,,,
,
, ( ) , ,
,,
. ( ,
, , II, Alzheimer),
66
( , ,
, ,
...).1996
400.000 . /
(clan).
. 61 , 251
4.236(377Enzyme Nomenclature).
28.000
, 1.200 .
53.000
,UniProt,PfamInterpro.
,
.
.
,,,
,
.
,
.
:,,
,
.
neXtProt: neXtProt (http://www.nextprot.org/)
(Lane et al., 2012).
UniProtKB/Swiss-Prot
,
.
neXtProt.
. neXtProt
:(>5%),(15% ) ( 1% ).
neXtProt.
,
repositories
.,
,
.
neXtProt
. neXtProt
.
. ,
.
.
10
.
PASS2: PASS2 -
(Protein sequence Alignments of Structural Superfamilies).
CAMPASS (Sowdhamini et al., 1998). -
,
, ,
.PASS2HTML,
67
(PASS2.4)(Gandhimathi,Nair,&Sowdhamini,2012),MYSQL
PHP.,SCOP1.75(Murzin,Brenner,
Hubbard, & Chothia, 1995) -,
1961.
, , ,
,.
, ,
outliers (Gandhimathi, et al., 2012),
-.
,
outliers.
- -.
-.
rogue (Deshmukh, et al., 2010; Rakshambikai, et al., 2014). O
. .
(Bhaskara et al., 2014; Gnanavel et al., 2014; Martin, Anamika, &
68
Srinivasan,2010),KinG.
KinG NetBeans IDE Java,JSP, Servlets,AJAX, Jquery,
XML, HTML CSS ,
.
EzCatDB: EzCatDB (http://ezcatdb.cbrc.jp/EzCatDB/) 2004
,,
. (Nagano,
2005; Nagano et al., 2014) Enzyme Commission (E.C.) (NC-IUBMB;
http://www.chem.qmul.ac.uk/iubmb/enzyme/)
(Fleischmann et al., 2004; McDonald, Boyce, & Tipton, 2009; Tipton,
1994).(RLCP)EzCatDB
(nucleophilic substitution reactions),
, , ,
,(Nagano,etal.,2014).
EzCatDBProteinDataBank(PDB)(Roseet
al., 2013) UniProt,
CATH(Cuffetal.,2011).,
KEGG(Kanehisa,etal.,2014).EzCatDB
PostgreSQL.
(Nagano, 2005)
E.C., IDs ,
(Nagano,
2005).,
PDB,
, , ,
.
(Nagano,2005).
.
,
.,
, , .
EzCatDB871,1.610UniProtKB6.704
PDB..
300 ,
.
MACiE: MACiE(Mechanism,AnnotationandClassificationinEnzymes,http://www.ebi.ac.uk/thorntonsrv/databases/MACiE),(Holliday
et al., 2012). MACiE ,
, ,
. MACiE
PDB .
MACiE,
.
,
( ).
,
.
. MACiE
MySQL.
69
.
MACiE(Hollidayetal.,2007).
,
. MACiE
-
. (1)
, (2) , (3),
, (4) , (5)
, , (6)
, EC, UniProtKB, CATH,
PDB.
MACiE
.
MACiE
. ,
, PDB.
, ..,
,.
,.
ConoServer: Cone
(Akondietal.,2014;Terlau&Olivera,
2004). (Davis, Jones, & Lewis, 2009)
(Biggset al.,2010;Chang&Duda,2012;Puillandre,Koua,Favreau,Olivera,&Stocklin,2012)
(Duda,Chang,Lewis,&Lee,2009;Duda&Lee,2009).2014,
ConoServer(Kaas,Yu,Jin,Dutertre,&Craik,2012)2000,
,(Kaas,Westermann,&Craik,2010).
70
ConoServer (www.conoserver.org) ( )
,.
.-
MySQL PHP.
ConoServer GenBank (Benson, et al., 2014), UniProt-
(UniProt, 2014) PDB (Berman, Henrick, Nakamura, & Markley, 2007),
. ,
, MySQL,
.
,
,
.
(Trabi &
Craik, 2002).
(Poth,
Chan, & Craik, 2013). CyBase (www.cybase.org.au)
,
(Wang, Kaas, Chiche, & Craik, 2008). 2014 CyBase
420160.
,
cyclotide, 282 . CyBase
ConoServer. CyBase
,
(Wang, et al., 2008). CyBase
.
71
. ,
.GPCRDB
GPCR,
, (Katritch, Cherezov, & Stevens,
2013).
- .
, -
.
GPCRDB .
HTML5 ( CSS3) JavaScript
.
ScalableVectorGraphics(SVG),
.
MySQL,Apache.
GPCRDBSOAP,
/.
-, ncRNAs ( ,
HGNC), (Tough, Lewis, Rioja, Lindon, & Prinjha, 2014),
(Bonner,2014),allostery(Christopoulosetal.,2014),(),
(Spedding, 2011).
.
GtoPdb 2.700
( G-
GPCRs, , , , , ,
),,
,
. ,
,.
, ,
.
GtoPdbChEMBL(Gaultonetal.,2012),
72
.
. ,
,..UniProt,Ensembl,EntrezGene,KEGG,.
, ,
, , PubChem,
DrugBank ChEMBL. , Concise Guide to
PHARMACOLOGY,GtoPdb,
. British Journal of Pharmacology
GRAC(Alexanderetal.,2013).
Kinase.com: Kinase.com ,
(Manning,Whyte,Martinez,Hunter,&Sudarsanam,2002).
"kinome", . , KinBase,
7.000
, 14 (Bradham et al., 2006;
Caenepeel, Charydczak, Sudarsanam, Hunter, & Manning, 2004; Eisen et al., 2006; Goldberg et al., 2006;
Srivastava et al., 2010; Stajich et al., 2010). 10 , 287
356 . KinBase
,,.,
BLAST
.
,,
,,,
HMM , .
(,)
,HMM
. KinBase, Kinase.com wiki,
WiKinome,.
wiki-.
Kinase.com 1999
Sugen Caenorhabditis elegans (Bingham,
Plowman,&Sudarsanam,2000;Manning,2005;Plowman,Sudarsanam,Bingham,Whyte,&Hunter,1999)
Saccharomycescerevisiae(Hunter&Plowman,1997). KinBase
2002(Manning,Whyte,et
al.,2002)(Manning,Plowman,Hunter,&Sudarsanam,2002).
MySQL Perl.
Modelviewcontrollerwebframework,HTML5,CSS5JavaScript.
.
6.000,Kinase.com
kinomes 15 .
, ,
/,kinomes15.
,
kinomes.
Structure-Function Linkage Database: Structure-Function Linkage Database (SFLD;
http://sfld.rbvi.ucsf.edu/django/), (Akiva et al., 2014; Pegg et al., 2006),
-
.
- (Gerlt &
Babbitt,2001).SFLD
, " -
"(Babbitt&Gerlt,1997).
73
-
-(Almonacid&Babbitt,2011).
-,
,
. ( -), SFLD
.
, - ,
-
,
,(Babbittetal.,
1996;Gerlt,Babbitt,&Rayment,2005).
-,-
(CoreSFLD),-.
-,,
(user interface).
- , s,
,3D
Chimera(Pettersenetal.,2004),
. Extended SFLD
.
,SFLDDjango,
PythonWeb,.
,
(GUI),.
,
- 100.000
.
SFLD-CoreExtended
SFLD,(Atkinson,Morris,Ferrin,&Babbitt,2009).
Histone Database:Histone(http://research.nhgri.nih.gov/histones/)1996(Baxevanis&
Landsman, 1996), ,
DNA (Baxevanis, Arents, Moudrianakis, & Landsman, 1995).
,
PSI-BLAST (Altschul et al., 1997) HMMER
(Eddy,2009).,
.
.
,
.
(1, 2, 2, 3, 4) ,
. (Baxevanis,et
al.,1995),(MarinoRamirezetal.,2011).HistoneDatabase
.HistoneDatabase
ProteinDataBank(PDB)(Roseetal.,2014),
-.
-
- (.. (CENP-A CSE4), 3.3, H2A.B , H2A.Z, H2B.Z,
macroH2A(136),,-H1).
74
, ,
,
.
(
nextProt
).
SubtiList
(http://genolist.pasteur.fr/SubtiList/) (Moszer, Jones, Moreira, Fabry, & Danchin, 2002) Bacillus
subtilisEcoCyc(http://ecocyc.org/)Escherichia coliK-12(Karpetal.,2002).
, Genome Online Database (GOLD),
(https://gold.jgi-psf.org/)(Reddyetal.,2015).
,.
ONCOMINE,
.
,http://www.oncomine.org/, (Rhodeset
al.,2004).RNA-Seq Atlas (http://medicalgenomics.org/rna_seq_atlas)
RNA
(RNA-Seq). 11 .
(Kruppetal.,2012).Next Generation Sequencing Catalog (NGS
Catalog)
(http://bioinfo.mc.vanderbilt.edu/NGS/index.html). ,
(Xiaetal.,2012).
,
.
( ) . , OMIM
(Online Mendelian Inheritance in Man),
(http://www.ncbi.nlm.nih.gov/omim). ,
,
,
. GAD (Genetic Association Database,
http://geneticassociationdb.nih.gov/) PubMed
(Becker, Barnes, Bright, & Wang, 2004), o Catalog of
Published Genome-Wide Association Studies (http://www.genome.gov/gwastudies/) GWASdb
(http://jjwanglab.org/gwasdb) (genomewide association
studies),
DNA. , ,
, Epilepsy Genetic
Association Database (epiGAD)(Tan&Berkovic,2010),Cancer GAMAdb(Schullyet
al., 2011) , AlzGene Alzheimer (Bertram, McQueen, Mullin, Blacker, &
Tanzi,2007).
,
SPRN,
.
, Database Collection Nucleic Acids Research,
(http://www.oxfordjournals.org/our_journals/nar/database/cap/).
PDBTM(http://pdbtm.enzim.hu/)
(Kozma,
Simon,
&
Tusnady,
2013),
ExTopoDB
(http://bioinformatics.biol.uoa.gr/ExTopoDB)
(Tsaousis et al., 2010) gpDB
75
(http://bioinformatics.biol.uoa.gr/gpDB),GPCRs
G- (Theodoropoulou, Bagos, Spyropoulos, & Hamodrakas, 2008). DBPTM
(http://dbptm.mbc.nctu.edu.tw/) (Lu et al., 2013), DIP (http://dip.doembi.ucla.edu/dip/Main.cgi) , ,
- (Xenarios et al., 2002). , bioGrid (http://thebiogrid.org/)
(Starketal.,2006).
(DNA RNA), .
2.3.
SRS,LIONBioscience
. ,
.
400
. SRS
,
. ,
,
.
.
SRSEBI.
, Uniprot
.
Entrez
NCBI(NationalCenterforBiotechnology
Information) . Entrez SRS
,
.,,
PUBMED.
NCBI
. ,
NCBI,
.
,SPRN,
,,
MySQL PHP. ,
,,
(
), .
76
, SQL ,
(
).
2.9: NCBI
Entrez. , . NCBI
,
. , Conserved Domains Database PROSITE Structure (MMDB)
PDB.
77
1. ompA Escherichia coli
UNIPROT, Gene Name UNIPROT.
2.10: Uniprot
2.11:
2.12:
78
2.13:
2. Uniprot
(outer membrane) .
( : taxonomy:"Bacteria [2]" existence:"evidence at protein level"
database:(type:pdb) locations:(location:"Cell outer membrane [SL-0040]") keyword:"Cell outer membrane
[KW-0998]")
2.14: Uniprot ( ),
79
2.15: Uniprot ( ),
2.16:
2.17: . ,
80
2.18: . ,
2.19:
81
2.20: ( )
3. Uniprot
G- () :
:
taxonomy:"keyword:"G-protein coupled receptor [KW-0297]" AND organism:"Human [9606]" AND
existence:"evidenceatproteinlevel"ANDdatabase:(type:pdb)
PDB;,
,
.
82
( )
1. GENBANK Outer membrane protein A (ompA)
Escherichia coli.
LOCUS
NC_000913
DEFINITION
ACCESSION
VERSION
DBLINK
KEYWORDS
SOURCE
ORGANISM
REFERENCE
AUTHORS
TITLE
JOURNAL
PUBMED
REMARK
REFERENCE
AUTHORS
TITLE
JOURNAL
PUBMED
REFERENCE
AUTHORS
TITLE
JOURNAL
PUBMED
REFERENCE
AUTHORS
TITLE
JOURNAL
REMARK
REFERENCE
AUTHORS
TITLE
JOURNAL
REMARK
REFERENCE
AUTHORS
TITLE
1041 bp
83
DNA
linear
CON 16-DEC-2014
JOURNAL
REMARK
REFERENCE
AUTHORS
TITLE
JOURNAL
REMARK
REFERENCE
AUTHORS
TITLE
JOURNAL
REFERENCE
CONSRTM
TITLE
JOURNAL
REFERENCE
AUTHORS
TITLE
JOURNAL
REMARK
REFERENCE
AUTHORS
TITLE
JOURNAL
REMARK
REFERENCE
AUTHORS
TITLE
JOURNAL
REMARK
REFERENCE
AUTHORS
TITLE
JOURNAL
REMARK
REFERENCE
AUTHORS
TITLE
JOURNAL
REMARK
REFERENCE
AUTHORS
TITLE
JOURNAL
REMARK
REFERENCE
AUTHORS
TITLE
JOURNAL
Unpublished
GenBank accessions AG613214 to AG613378 (sequence corrections)
7 (bases 1 to 1041)
Perna,N.T.
Escherichia coli K-12 MG1655 yqiK-rfaE intergenic region, genomic
sequence correction
Unpublished
GenBank accession AY605712 (sequence corrections)
8 (bases 1 to 1041)
Rudd,K.E.
A manual approach to accurate translation start site annotation: an
E. coli K-12 case study
Unpublished
9 (bases 1 to 1041)
NCBI Genome Project
Direct Submission
Submitted (26-AUG-2014) National Center for Biotechnology
Information, NIH, Bethesda, MD 20894, USA
10 (bases 1 to 1041)
Blattner,F.R. and Plunkett,G. III.
Direct Submission
Submitted (30-JUL-2014) Laboratory of Genetics, University of
Wisconsin, 425G Henry Mall, Madison, WI 53706-1580, USA
Protein update by submitter
11 (bases 1 to 1041)
Blattner,F.R. and Plunkett,G. III.
Direct Submission
Submitted (15-NOV-2013) Laboratory of Genetics, University of
Wisconsin, 425G Henry Mall, Madison, WI 53706-1580, USA
Protein update by submitter
12 (bases 1 to 1041)
Blattner,F.R. and Plunkett,G. III.
Direct Submission
Submitted (26-SEP-2013) Laboratory of Genetics, University of
Wisconsin, 425G Henry Mall, Madison, WI 53706-1580, USA
Sequence update by submitter
13 (bases 1 to 1041)
Rudd,K.E.
Direct Submission
Submitted (06-FEB-2013) Department of Biochemistry and Molecular
Biology, University of Miami Miller School of Medicine, 118 Gautier
Bldg., Miami, FL 33136, USA
Sequence update by submitter
14 (bases 1 to 1041)
Rudd,K.E.
Direct Submission
Submitted (24-APR-2007) Department of Biochemistry and Molecular
Biology, University of Miami Miller School of Medicine, 118 Gautier
Bldg., Miami, FL 33136, USA
Annotation update from ecogene.org as a multi-database
collaboration
15 (bases 1 to 1041)
Plunkett,G. III.
Direct Submission
Submitted (07-FEB-2006) Laboratory of Genetics, University of
Wisconsin, 425G Henry Mall, Madison, WI 53706-1580, USA
Protein updates by submitter
16 (bases 1 to 1041)
Plunkett,G. III.
Direct Submission
Submitted (10-JUN-2004) Laboratory of Genetics, University of
84
85
/translation="MKKTAIAIAVALAGFATVAQAAPKDNTWYTGAKLGWSQYHDTGF
INNNGPTHENQLGAGAFGGYQVNPYVGFEMGYDWLGRMPYKGSVENGAYKAQGVQLTA
KLGYPITDDLDIYTRLGGMVWRADTKSNVYGKNHDTGVSPVFAGGVEYAITPEIATRL
EYQWTNNIGDAHTIGTRPDNGMLSLGVSYRFGQGEAAPVVAPAPAPAPEVQTKHFTLK
SDVLFNFNKATLKPEGQAALDQLYSQLSNLDPKDGSVVVLGYTDRIGSDAYNQGLSER
RAQSVVDYLISKGIPADKISARGMGESNPVTGNTCDNVKQRAALIDCLAPDRRVEIEV
KGIKDVVTQPQA"
ORIGIN
1
61
121
181
241
301
361
421
481
541
601
661
721
781
841
901
961
1021
atgaaaaaga
gccgctccga
gacactggtt
tttggtggtt
cgtatgccgt
accgctaaac
atggtatggc
tctccggtct
gaataccagt
ggcatgctga
gctccggctc
gttctgttca
ctgtacagcc
accgaccgca
gttgttgatt
ggcgaatcca
atcgactgcc
gtaactcagc
cagctatcgc
aagataacac
tcatcaacaa
accaggttaa
acaaaggcag
tgggttaccc
gtgcagacac
tcgctggcgg
ggaccaacaa
gcctgggtgt
cagctccggc
acttcaacaa
agctgagcaa
tcggttctga
acctgatctc
acccggttac
tggctccgga
cgcaggctta
gattgcagtg
ctggtacact
caatggcccg
cccgtatgtt
cgttgaaaac
aatcactgac
taaatccaac
tgttgagtac
catcggtgac
ttcctaccgt
accggaagta
agcaaccctg
cctggatccg
cgcttacaac
caaaggtatc
tggcaacacc
tcgtcgcgta
a
gcactggctg
ggtgctaaac
acccatgaaa
ggctttgaaa
ggtgcataca
gacctggaca
gtttatggta
gcgatcactc
gcacacacca
ttcggtcagg
cagaccaagc
aaaccggaag
aaagacggtt
cagggtctgt
ccggcagaca
tgtgacaacg
gagatcgaag
gtttcgctac
tgggctggtc
accaactggg
tgggttacga
aagctcaggg
tctacactcg
aaaaccacga
ctgaaatcgc
tcggcactcg
gcgaagcagc
acttcactct
gtcaggctgc
ccgtagttgt
ccgagcgccg
agatctccgc
tgaaacagcg
ttaaaggtat
cgtagcgcag
ccagtaccat
cgctggtgct
ctggttaggt
cgttcaactg
tctgggtggc
caccggcgtt
tacccgtctg
tccggacaac
tccagtagtt
gaagtctgac
tctggatcag
tctgggttac
tgctcagtct
acgtggtatg
tgctgcactg
caaagacgtt
//
GENBANK
LOCUS:.
DEFINITION:.
ACCESSION:GENBANK.O
VERSION:AccessionNumber,
.
KEYWORDS:-
.
SOURCE:
(,..).
ORGANISM:'..
.
-
.
REFERENCE:
.
AUTHORS:.
TITLE:.
86
JOURNAL:
,,.
MEDLINE:MEDLINE.
COMMENT:,.
FEATURES:
()RNA()
.
BASE COUNT:.
,,,.
ORIGIN:
.
.
:
ORIGIN
1
61
121
181
241
301
361
421
481
541
601
661
721
781
841
901
961
1021
atgaaaaaga
gccgctccga
gacactggtt
tttggtggtt
cgtatgccgt
accgctaaac
atggtatggc
tctccggtct
gaataccagt
ggcatgctga
gctccggctc
gttctgttca
ctgtacagcc
accgaccgca
gttgttgatt
ggcgaatcca
atcgactgcc
gtaactcagc
cagctatcgc
aagataacac
tcatcaacaa
accaggttaa
acaaaggcag
tgggttaccc
gtgcagacac
tcgctggcgg
ggaccaacaa
gcctgggtgt
cagctccggc
acttcaacaa
agctgagcaa
tcggttctga
acctgatctc
acccggttac
tggctccgga
cgcaggctta
gattgcagtg
ctggtacact
caatggcccg
cccgtatgtt
cgttgaaaac
aatcactgac
taaatccaac
tgttgagtac
catcggtgac
ttcctaccgt
accggaagta
agcaaccctg
cctggatccg
cgcttacaac
caaaggtatc
tggcaacacc
tcgtcgcgta
a
gcactggctg
ggtgctaaac
acccatgaaa
ggctttgaaa
ggtgcataca
gacctggaca
gtttatggta
gcgatcactc
gcacacacca
ttcggtcagg
cagaccaagc
aaaccggaag
aaagacggtt
cagggtctgt
ccggcagaca
tgtgacaacg
gagatcgaag
gtttcgctac
tgggctggtc
accaactggg
tgggttacga
aagctcaggg
tctacactcg
aaaaccacga
ctgaaatcgc
tcggcactcg
gcgaagcagc
acttcactct
gtcaggctgc
ccgtagttgt
ccgagcgccg
agatctccgc
tgaaacagcg
ttaaaggtat
cgtagcgcag
ccagtaccat
cgctggtgct
ctggttaggt
cgttcaactg
tctgggtggc
caccggcgtt
tacccgtctg
tccggacaac
tccagtagtt
gaagtctgac
tctggatcag
tctgggttac
tgctcagtct
acgtggtatg
tgctgcactg
caaagacgtt
//
-
.
-60,,
11.10
.
-9
.
//:.
87
ID
AC
DT
DT
DT
DE
DE
DE
GN
OS
OC
OC
OX
RN
RP
RC
RX
RA
RT
RT
RL
RN
RP
RC
RX
RA
RT
RT
RL
RN
RP
RC
RX
RA
RA
RA
RA
RA
RA
RT
RT
RL
RN
RP
RC
RX
RA
RA
RA
RA
RT
RL
RN
RP
RC
RX
RA
RA
RT
RT
RL
OMPA_ECOLI
Reviewed;
346 AA.
P0A910; P02934;
20-JUL-1986, integrated into UniProtKB/Swiss-Prot.
20-JUL-1986, sequence version 1.
06-JAN-2015, entry version 99.
RecName: Full=Outer membrane protein A;
AltName: Full=Outer membrane protein II*;
Flags: Precursor;
Name=ompA; Synonyms=con, tolG, tut; OrderedLocusNames=b0957, JW0940;
Escherichia coli (strain K12).
Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
Enterobacteriaceae; Escherichia.
NCBI_TaxID=83333;
[1]
NUCLEOTIDE SEQUENCE [GENOMIC DNA].
STRAIN=K12;
PubMed=6253901; DOI=10.1093/nar/8.13.3011;
Beck E., Bremer E.;
"Nucleotide sequence of the gene ompA coding the outer membrane
protein II of Escherichia coli K-12.";
Nucleic Acids Res. 8:3011-3027(1979).
[2]
NUCLEOTIDE SEQUENCE [GENOMIC DNA].
STRAIN=K12;
PubMed=6260961; DOI=10.1016/0022-2836(80)90193-X;
Movva N.R., Nakamura K., Inouye M.;
"Gene structure of the OmpA protein, a major surface protein of
Escherichia coli required for cell-cell interaction.";
J. Mol. Biol. 143:317-328(1979).
[3]
NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
STRAIN=K12 / W3110 / ATCC 27325 / DSM 5911;
PubMed=8905232; DOI=10.1093/dnares/3.3.137;
Oshima T., Aiba H., Baba T., Fujita K., Hayashi K., Honjo A.,
Ikemoto K., Inada T., Itoh T., Kajihara M., Kanai K., Kashimoto K.,
Kimura S., Kitagawa M., Makino K., Masuda S., Miki T., Mizobuchi K.,
Mori H., Motomura K., Nakamura Y., Nashimoto H., Nishio Y., Saito N.,
Sampei G., Seki Y., Tagami H., Takemoto K., Wada C., Yamamoto Y.,
Yano M., Horiuchi T.;
"A 718-kb DNA sequence of the Escherichia coli K-12 genome
corresponding to the 12.7-28.0 min region on the linkage map.";
DNA Res. 3:137-155(1995).
[4]
NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
STRAIN=K12 / MG1655 / ATCC 47076;
PubMed=9278503; DOI=10.1126/science.277.5331.1453;
Blattner F.R., Plunkett G. III, Bloch C.A., Perna N.T., Burland V.,
Riley M., Collado-Vides J., Glasner J.D., Rode C.K., Mayhew G.F.,
Gregor J., Davis N.W., Kirkpatrick H.A., Goeden M.A., Rose D.J.,
Mau B., Shao Y.;
"The complete genome sequence of Escherichia coli K-12.";
Science 277:1453-1462(1996).
[5]
NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
STRAIN=K12 / W3110 / ATCC 27325 / DSM 5911;
PubMed=16738553; DOI=10.1038/msb4100049;
Hayashi K., Morooka N., Yamamoto Y., Fujita K., Isono K., Choi S.,
Ohtsubo E., Baba T., Wanner B.L., Mori H., Horiuchi T.;
"Highly accurate genome sequences of Escherichia coli K-12 strains
MG1655 and W3110.";
Mol. Syst. Biol. 2:E1-E5(2005).
88
RN
RP
RC
RX
RA
RT
RT
RL
RN
RP
RC
RX
RA
RT
RT
RL
RN
RP
RC
RA
RA
RA
RL
RN
RP
RC
RX
RA
RA
RT
RT
RL
RN
RP
RX
RA
RT
RT
RT
RL
RN
RP
RX
RA
RT
RT
RL
RN
RP
RC
RX
RA
RT
RL
RN
RP
RX
RA
RT
RT
RL
[6]
PROTEIN SEQUENCE OF 22-346.
STRAIN=K12;
PubMed=7001461; DOI=10.1073/pnas.77.8.4592;
Chen R., Schmidmayr W., Kramer C., Chen-Schmeisser U., Henning U.;
"Primary structure of major outer membrane protein II (ompA protein)
of Escherichia coli K-12.";
Proc. Natl. Acad. Sci. U.S.A. 77:4592-4596(1979).
[7]
PROTEIN SEQUENCE OF 22-34.
STRAIN=K12 / EMG2;
PubMed=9298646; DOI=10.1002/elps.1150180807;
Link A.J., Robison K., Church G.M.;
"Comparing the predicted and observed properties of proteins encoded
in the genome of Escherichia coli K-12.";
Electrophoresis 18:1259-1313(1996).
[8]
PROTEIN SEQUENCE OF 22-32.
STRAIN=K12 / W3110 / ATCC 27325 / DSM 5911;
Pasquali C., Sanchez J.-C., Ravier F., Golaz O., Hughes G.J.,
Frutiger S., Paquet N., Wilkins M., Appel R.D., Bairoch A.,
Hochstrasser D.F.;
Submitted (AUG-1994) to UniProtKB.
[9]
PROTEIN SEQUENCE OF 22-26.
STRAIN=K12 / W3110 / ATCC 27325 / DSM 5911;
PubMed=9629924; DOI=10.1002/elps.1150190539;
Molloy M.P., Herbert B.R., Walsh B.J., Tyler M.I., Traini M.,
Sanchez J.-C., Hochstrasser D.F., Williams K.L., Gooley A.A.;
"Extraction of membrane proteins by differential solubilization for
separation using two-dimensional gel electrophoresis.";
Electrophoresis 19:837-844(1997).
[10]
MUTANTS RESISTANT TO PHAGE ENTRY.
PubMed=6086577;
Morona R., Klose M., Henning U.;
"Escherichia coli K-12 outer membrane protein (OmpA) as a
bacteriophage receptor: analysis of mutant genes expressing altered
proteins.";
J. Bacteriol. 159:570-578(1983).
[11]
MUTANTS RESISTANT TO PHAGE ENTRY.
PubMed=3902787;
Morona R., Kramer C., Henning U.;
"Bacteriophage receptor area of outer membrane protein OmpA of
Escherichia coli K-12.";
J. Bacteriol. 164:539-543(1984).
[12]
PORIN ACTIVITY.
STRAIN=K12;
PubMed=1370823;
Sugawara E., Nikaido H.;
"Pore-forming activity of OmpA protein of Escherichia coli.";
J. Biol. Chem. 267:2507-2511(1991).
[13]
SUBCELLULAR LOCATION.
PubMed=7813480; DOI=10.1111/j.1432-1033.1994.00891.x;
Kuhn A., Kiefer D., Koehne C., Zhu H.-Y., Tschantz W.R., Dalbey R.E.;
"Evidence for a loop-like insertion mechanism of pro-Omp A into the
inner membrane of Escherichia coli.";
Eur. J. Biochem. 226:891-897(1993).
89
RN
RP
RX
RA
RT
RT
RL
RN
RP
RX
RA
RA
RT
RL
RN
RP
RX
RA
RT
RT
RL
RN
RP
RC
RX
RA
RA
RT
RL
RN
RP
RC
RX
RA
RT
RL
RN
RP
RX
RA
RT
RL
RN
RP
RX
RA
RT
RL
RN
RP
RX
RA
RT
RT
RL
RN
RP
RX
RA
RA
RT
[14]
TOPOLOGY.
PubMed=8106193;
Gromiha M.M., Ponnuswamy P.K.;
"Prediction of transmembrane beta-strands from hydrophobic
characteristics of proteins.";
Int. J. Pept. Protein Res. 42:420-431(1992).
[15]
IDENTIFICATION BY 2D-GEL.
PubMed=9298644; DOI=10.1002/elps.1150180805;
VanBogelen R.A., Abshire K.Z., Moldover B., Olson E.R.,
Neidhardt F.C.;
"Escherichia coli proteome analysis using the gene-protein database.";
Electrophoresis 18:1243-1251(1996).
[16]
TOPOLOGY.
PubMed=10368142;
Koebnik R.;
"Structural and functional roles of the surface-exposed loops of the
beta-barrel membrane protein OmpA from Escherichia coli.";
J. Bacteriol. 181:3688-3694(1998).
[17]
DIMERIZATION, AND SUBCELLULAR LOCATION.
STRAIN=BL21-DE3;
PubMed=16079137; DOI=10.1074/jbc.M506479200;
Stenberg F., Chovanec P., Maslen S.L., Robinson C.V., Ilag L.,
von Heijne G., Daley D.O.;
"Protein complexes of the Escherichia coli cell envelope.";
J. Biol. Chem. 280:34409-34419(2004).
[18]
SUBCELLULAR LOCATION.
STRAIN=K12 / MG1655 / ATCC 47076;
PubMed=21778229; DOI=10.1074/jbc.M111.245696;
Fontaine F., Fuchs R.T., Storz G.;
"Membrane localization of small proteins in Escherichia coli.";
J. Biol. Chem. 286:32464-32474(2010).
[19]
X-RAY CRYSTALLOGRAPHY (2.5 ANGSTROMS) OF 22-192.
PubMed=9808047; DOI=10.1038/2983;
Pautsch A., Schulz G.E.;
"Structure of the outer membrane protein A transmembrane domain.";
Nat. Struct. Biol. 5:1013-1017(1997).
[20]
X-RAY CRYSTALLOGRAPHY (1.65 ANGSTROMS).
PubMed=10764596; DOI=10.1006/jmbi.2000.3671;
Pautsch A., Schulz G.E.;
"High-resolution structure of the OmpA membrane domain.";
J. Mol. Biol. 298:273-282(1999).
[21]
STRUCTURE BY NMR OF 22-197.
PubMed=11276254; DOI=10.1038/86214;
Arora A., Abildgaard F., Bushweller J.H., Tamm L.K.;
"Structure of outer membrane protein A transmembrane domain by NMR
spectroscopy.";
Nat. Struct. Biol. 8:334-338(2000).
[22]
MASS SPECTROMETRY.
PubMed=10757971; DOI=10.1021/bi000150m;
le Coutre J., Whitelegge J.P., Gross A., Turk E., Wright E.M.,
Kaback H.R., Faull K.F.;
"Proteomics on full-length membrane proteins using mass
90
RT
RL
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
spectrometry.";
Biochemistry 39:4237-4242(1999).
-!- FUNCTION: Required for the action of colicins K and L and for the
stabilization of mating aggregates in conjugation. Serves as a
receptor for a number of T-even like phages. Also acts as a porin
with low permeability that allows slow penetration of small
solutes.
-!- SUBUNIT: Homodimer.
-!- INTERACTION:
P0C0V0:degP; NbExp=5; IntAct=EBI-371347, EBI-547165;
P0A850:tig; NbExp=3; IntAct=EBI-371347, EBI-544862;
-!- SUBCELLULAR LOCATION: Cell outer membrane
{ECO:0000269|PubMed:16079137, ECO:0000269|PubMed:21778229,
ECO:0000269|PubMed:7813480}; Multi-pass membrane protein
{ECO:0000269|PubMed:16079137, ECO:0000269|PubMed:21778229,
ECO:0000269|PubMed:7813480}.
-!- MASS SPECTROMETRY: Mass=35177; Method=Electrospray; Range=22-346;
Evidence={ECO:0000269|PubMed:10757971;
-!- SIMILARITY: Belongs to the OmpA family. {ECO:0000305}.
-!- SIMILARITY: Contains 1 OmpA-like domain. {ECO:0000255|PROSITEProRule:PRU00473}.
----------------------------------------------------------------------Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms
Distributed under the Creative Commons Attribution-NoDerivs License
----------------------------------------------------------------------EMBL; V00307; CAA23588.1; -; Genomic_DNA.
EMBL; U00096; AAC74043.1; -; Genomic_DNA.
EMBL; AP009048; BAA35715.1; -; Genomic_DNA.
PIR; A93707; MMECA.
RefSeq; NP_415477.1; NC_000913.3.
RefSeq; YP_489229.1; NC_007779.1.
PDB; 1BXW; X-ray; 2.50 A; A=21-192.
PDB; 1G90; NMR; -; A=22-197.
PDB; 1QJP; X-ray; 1.65 A; A=22-192.
PDB; 2GE4; NMR; -; A=22-197.
PDB; 2JMM; NMR; -; A=23-197.
PDB; 3NB3; EM; -; A/B/C=1-346.
PDBsum; 1BXW; -.
PDBsum; 1G90; -.
PDBsum; 1QJP; -.
PDBsum; 2GE4; -.
PDBsum; 2JMM; -.
PDBsum; 3NB3; -.
ProteinModelPortal; P0A910; -.
SMR; P0A910; 22-192, 209-346.
DIP; DIP-31879N; -.
IntAct; P0A910; 11.
MINT; MINT-1308131; -.
STRING; 511145.b0957; -.
TCDB; 1.B.6.1.1; the ompa-ompf porin (oop) family.
SWISS-2DPAGE; P0A910; -.
PaxDb; P0A910; -.
PRIDE; P0A910; -.
EnsemblBacteria; AAC74043; AAC74043; b0957.
EnsemblBacteria; BAA35715; BAA35715; BAA35715.
GeneID; 12931038; -.
GeneID; 945571; -.
KEGG; ecj:Y75_p0929; -.
KEGG; eco:b0957; -.
PATRIC; 32117133; VBIEscCol129921_0991.
EchoBASE; EB0663; -.
91
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
PE
KW
KW
KW
KW
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
92
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
SQ
TOPO_DOM
TRANSMEM
TOPO_DOM
TRANSMEM
TOPO_DOM
TRANSMEM
TOPO_DOM
TRANSMEM
TOPO_DOM
TRANSMEM
TOPO_DOM
REPEAT
REPEAT
REPEAT
REPEAT
DOMAIN
REGION
REGION
DISULFID
STRAND
STRAND
STRAND
STRAND
STRAND
STRAND
STRAND
STRAND
STRAND
STRAND
TURN
STRAND
SEQUENCE
MKKTAIAIAV
FGGYQVNPYV
MVWRADTKSN
GMLSLGVSYR
LYSQLSNLDP
GESNPVTGNT
79
96
108
112
125
138
152
156
164
182
191
201
203
205
207
210
95
107
111
124
137
151
155
163
181
190
346
202
204
206
208
338
Extracellular.
Beta stranded.
Periplasmic.
Beta stranded.
Extracellular.
Beta stranded.
Periplasmic.
Beta stranded.
Extracellular.
Beta stranded.
Periplasmic.
1.
2.
3.
4.
OmpA-like. {ECO:0000255|PROSITEProRule:PRU00473}.
Hinge-like.
4 X 2 AA tandem repeats of A-P.
197
208
201
208
311
323
27
37
{ECO:0000244|PDB:1QJP}.
41
43
{ECO:0000244|PDB:1G90}.
46
48
{ECO:0000244|PDB:1G90}.
50
53
{ECO:0000244|PDB:2GE4}U.
55
67
{ECO:0000244|PDB:1QJP}.
70
81
{ECO:0000244|PDB:1QJP}.
93
128
{ECO:0000244|PDB:1QJP}.
130
132
{ECO:0000244|PDB:1QJP}.
134
153
{ECO:0000244|PDB:1QJP}.
156
165
{ECO:0000244|PDB:1QJP}.
172
175
{ECO:0000244|PDB:1G90}.
182
190
{ECO:0000244|PDB:1QJP}.
346 AA; 37201 MW; 195147734CDF8B04 CRC64;
ALAGFATVAQ AAPKDNTWYT GAKLGWSQYH DTGFINNNGP
GFEMGYDWLG RMPYKGSVEN GAYKAQGVQL TAKLGYPITD
VYGKNHDTGV SPVFAGGVEY AITPEIATRL EYQWTNNIGD
FGQGEAAPVV APAPAPAPEV QTKHFTLKSD VLFNFNKATL
KDGSVVVLGY TDRIGSDAYN QGLSERRAQS VVDYLISKGI
CDNVKQRAAL IDCLAPDRRV EIEVKGIKDV VTQPQA
THENQLGAGA
DLDIYTRLGG
AHTIGTRPDN
KPEGQAALDQ
PADKISARGM
//
UNIPROT
ID (Identification):
Entry_name data_class; molecule_type; sequence length
Entry_name:UNIPROT.
..OMPA_ECOLI..
4..
5.
data_class:UNIPROT.
molecule_type:.UNIPROTPRT
(Protein).
93
sequence length:To().
AC (Accession number):
.
.
DT (Date):,,
.
DE (Description):.
GN (Gene name):.
OS (Organism Species):'.
.
OG (Organelle):,
.
OC (Organism Classification):'.
(Organism taxonomy cross-reference):
.
RN (Reference number):.
RP (Reference Position):.
RX (Reference cross-reference):..PUBMED.
RA (Reference author):.
RT (Reference title):.
RL (Reference Location):.
CC (Comments):.
-:
CATALYTIC ACTIVITY:.
ALTERNATIVE PRODUCTS:
.
FUNCTION:.
SUBCELLULAR LOCATION:.
SUBUNIT:
.
94
-CC
(Comments).
DR (Database cross-reference):To
PDB,EMBL...
KW (Keyword):-
.
FT (Feature Table):
.:
.
.(..Receptor-Ligand).
..
..
.
.
SQ (Sequence):(),
(MW)Daltons.
:
- IUPAC.
- 60 ,
, 6 . 10
.
//:.
..
SQ
SEQUENCE
MKKTAIAIAV
FGGYQVNPYV
MVWRADTKSN
GMLSLGVSYR
LYSQLSNLDP
GESNPVTGNT
THENQLGAGA
DLDIYTRLGG
AHTIGTRPDN
KPEGQAALDQ
PADKISARGM
//
ID
OMPA_1; PATTERN.
95
AC
DT
DE
PA
PA
PA
NR
NR
NR
CC
CC
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
3D
DO
//
PS01068;
NOV-1995 (CREATED); DEC-2004 (DATA UPDATE); FEB-2015 (INFO UPDATE).
OmpA-like domain.
[LIVMA]-x-[GT]-x-[TA]-[DAN]-x(2,3)-[DG]-[GSTPNKQ]-x(2)-[LFYDEPAVI]-[NQS]x(2)-[LI]-[SG]-[QEA]-[KRQENAD]-R-A-x(2)-[LVAIT]-x(3)-[LIVMF]-x(4,5)[LIVMF]-x(4)-[LIVM]-x(3)-[SGW]-x-G.
/RELEASE=2015_04,548208;
/TOTAL=55(55); /POSITIVE=55(55); /UNKNOWN=0(0); /FALSE_POS=0(0);
/FALSE_NEG=10; /PARTIAL=2;
/TAXO-RANGE=???P?; /MAX-REPEAT=1;
/VERSION=1;
P65594, ARFA_MYCBO , T; A1KH31, ARFA_MYCBP , T; P9WIU4, ARFA_MYCTO , T;
P9WIU5, ARFA_MYCTU , T; Q9S3P9, MOTY_VIBAN , T; P46233, MOTY_VIBPA , T;
Q8U9L5, OMP16_AGRT5, T; P0A3S9, OMP16_BRUAB, T; P0A3S7, OMP16_BRUME, T;
P0A3S8, OMP16_BRUSU, T; Q98F85, OMP16_RHILO, T; Q926C3, OMP16_RHIME, T;
P07050, OMP3_NEIGO , T; Q9S3R8, OMP40_PORGI, T; Q9S3R9, OMP41_PORGI, T;
P0A0V2, OMP4_NEIMA , T; P0A0V3, OMP4_NEIMB , T; P43840, OMP51_HAEIN, T;
P38368, OMP52_HAEIF, T; P45996, OMP53_HAEIF, T; Q05146, OMPA_BORAV , T;
P57414, OMPA_BUCAI , T; Q8K9L4, OMPA_BUCAP , T; P24016, OMPA_CITFR , T;
P0A911, OMPA_ECO57 , T; P0A910, OMPA_ECOLI , T; P09146, OMPA_ENTAE , T;
B7LNW7, OMPA_ESCF3 , T; P0C8Z2, OMPA_ESCFE , T; P24754, OMPA_ESCHE , T;
P24017, OMPA_KLEPN , T; Q8Z7S0, OMPA_SALTI , T; P02936, OMPA_SALTY , T;
P04845, OMPA_SERMA , T; P24755, OMPA_SEROD , T; I2BAK7, OMPA_SHIBC , T;
P0DJO6, OMPA_SHIBL , T; P02935, OMPA_SHIDY , T; Q8ZG77, OMPA_YERPE , T;
P38399, OMPA_YERPS , T; Q89AJ5, PAL_BUCBP , T; P0A913, PAL_ECO57 , T;
P0A912, PAL_ECOLI , T; P10324, PAL_HAEIN , T; P26493, PAL_LEGPN , T;
Q51886, PAL_PASMU , T; Q9I4Z4, PAL_PSEAE , T; P0A138, PAL_PSEPK , T;
P0A139, PAL_PSEPU , T; P0A914, PAL_SHIFL , T; P13794, PORF_PSEAE , T;
P37726, PORF_PSEFL , T; P22263, PORF_PSESY , T; P38369, TPN50_TREPA, T;
P37665, YIAD_ECOLI , T;
P85410, OMP5_HAEPR , P; P80444, OMPA_ACTLI , P;
D3GSC3, LAFU_ECO44 , N; Q47154, LAFU_ECOLI , N; Q6RYW5, OMP38_ACIBA, N;
A3M8K2, OMP38_ACIBT, N; P84838, OMPC_GLUDA , N; P07021, YFIB_ECOLI , N;
P0C536, YN58_BRUAB , N; Q2YJ83, YP57_BRUA2 , N; Q8YDY8, YU36_BRUME , N;
Q9RPX3, YU58_BRUSU , N;
1OAP; 1R1M; 2AIZ; 2HQS; 2K1S; 2KGW; 2L26; 2LBT; 2LCA; 2W8B;
PDOC00819;
PROSITE
ID (Identification):
IDENTRY_NAME;ENTRY_TYPE
PROSITE,
.
AC (ACcession number):
PROSITEPROSITE.
DT (DaTe):().
DE (DEscription):.
PA (PAttern):(pattern)
.
96
pattern:
1.
2.
3.
4.
5.
6.
7.
8.
IUPAC.
x.
[...].
[ALT]
.
.
(-).
..x(3).
..(2,4)
234.
'<''>'.
pattern.
NR (Numerical Results):(patternscan)
SWISS-PROTpatternPROSITE.
:
/RELEASE:UNIPROT
.
/TOTAL:UNIPROT.
/POSITIVE:pattern
PROSITE.
/UNKNOWN:PROSITE.
/FALSE_POS:UNIPROTpattern
.
/FALSE_NEG:UNIPROT
.
/PARTIAL:UNIPROT(fragments),
PROSITE,oPROSITE.
CC (Comments):-CommentsPROSITE.
DR (Database Reference):UNIPROT.
3D (3D Structure):ProteinDataBank
.
DO (Documentation):
.
//:.
97
HEADER
TITLE
COMPND
COMPND
COMPND
COMPND
COMPND
COMPND
SOURCE
SOURCE
SOURCE
SOURCE
SOURCE
SOURCE
SOURCE
SOURCE
SOURCE
KEYWDS
EXPDTA
AUTHOR
REVDAT
REVDAT
REVDAT
REVDAT
JRNL
JRNL
JRNL
JRNL
JRNL
JRNL
JRNL
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
MEMBRANE PROTEIN
03-OCT-98
1BXW
OUTER MEMBRANE PROTEIN A (OMPA) TRANSMEMBRANE DOMAIN
MOL_ID: 1;
2 MOLECULE: PROTEIN (OUTER MEMBRANE PROTEIN A);
3 CHAIN: A;
4 FRAGMENT: TRANSMEMBRANE DOMAIN;
5 ENGINEERED: YES;
6 MUTATION: YES
MOL_ID: 1;
2 ORGANISM_SCIENTIFIC: ESCHERICHIA COLI BL21(DE3);
3 ORGANISM_TAXID: 469008;
4 STRAIN: BL21DE3;
5 GENE: OMPA;
6 EXPRESSION_SYSTEM: ESCHERICHIA COLI BL21(DE3);
7 EXPRESSION_SYSTEM_TAXID: 469008;
8 EXPRESSION_SYSTEM_STRAIN: BL21DE3;
9 EXPRESSION_SYSTEM_PLASMID: PET3B-171
OUTER MEMBRANE, TRANSMEMBRANE PROTEIN
X-RAY DIFFRACTION
G.E.SCHULZ,A.PAUTSCH
3
24-FEB-09 1BXW
1
VERSN
2
22-DEC-99 1BXW
4
HEADER COMPND REMARK JRNL
2 2
4
ATOM
SOURCE SEQRES
1
14-OCT-98 1BXW
0
AUTH
A.PAUTSCH,G.E.SCHULZ
TITL
STRUCTURE OF THE OUTER MEMBRANE PROTEIN A
TITL 2 TRANSMEMBRANE DOMAIN.
REF
NAT.STRUCT.BIOL.
V.
5 1013 1998
REFN
ISSN 1072-8368
PMID
9808047
DOI
10.1038/2983
1
2
2 RESOLUTION.
2.50 ANGSTROMS.
3
3 REFINEMENT.
3
PROGRAM
: REFMAC
3
AUTHORS
: MURSHUDOV,VAGIN,DODSON
3
3 DATA USED IN REFINEMENT.
3
RESOLUTION RANGE HIGH (ANGSTROMS) : 2.50
3
RESOLUTION RANGE LOW (ANGSTROMS) : 50.00
3
DATA CUTOFF
(SIGMA(F)) : 0.000
3
COMPLETENESS FOR RANGE
(%) : 89.0
3
NUMBER OF REFLECTIONS
: 8328
3
3 FIT TO DATA USED IN REFINEMENT.
3
CROSS-VALIDATION METHOD
: THROUGHOUT
3
FREE R VALUE TEST SET SELECTION : RANDOM
3
R VALUE
(WORKING + TEST SET) : NULL
3
R VALUE
(WORKING SET) : 0.189
3
FREE R VALUE
: 0.235
3
FREE R VALUE TEST SET SIZE
(%) : 5.000
3
FREE R VALUE TEST SET COUNT
: 404
3
3 NUMBER OF NON-HYDROGEN ATOMS USED IN REFINEMENT.
3
PROTEIN ATOMS
: 1330
3
NUCLEIC ACID ATOMS
: 0
3
HETEROGEN ATOMS
: 21
3
SOLVENT ATOMS
: 39
98
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
4
4
100
100
100
200
200
200
200
200
200
200
B VALUES.
FROM WILSON PLOT
(A**2) : 49.20
MEAN B VALUE
(OVERALL, A**2) : 60.40
OVERALL ANISOTROPIC B VALUE.
B11 (A**2) : NULL
B22 (A**2) : NULL
B33 (A**2) : NULL
B12 (A**2) : NULL
B13 (A**2) : NULL
B23 (A**2) : NULL
ESTIMATED OVERALL COORDINATE ERROR.
ESU BASED ON R VALUE
(A):
ESU BASED ON FREE R VALUE
(A):
ESU BASED ON MAXIMUM LIKELIHOOD
(A):
ESU FOR B VALUES BASED ON MAXIMUM LIKELIHOOD (A**2):
RMS DEVIATIONS FROM IDEAL VALUES.
DISTANCE RESTRAINTS.
BOND LENGTH
ANGLE DISTANCE
INTRAPLANAR 1-4 DISTANCE
H-BOND OR METAL COORDINATION
PLANE RESTRAINT
CHIRAL-CENTER RESTRAINT
(A)
(A)
(A)
(A)
:
:
:
:
RMS
0.015
0.030
NULL
NULL
;
;
;
;
(A) : NULL
(A**3) : NULL
(A)
(A)
(A)
(A)
:
:
:
:
SIGMA
NULL
NULL
NULL
NULL
; NULL
; NULL
NULL
NULL
NULL
NULL
;
;
;
;
NULL
NULL
NULL
NULL
;
;
;
;
NULL
NULL
NULL
NULL
NULL
NULL
NULL
3.640
;
;
;
;
SIGMA
NULL
NULL
NULL
NULL
OTHER REFINEMENT REMARKS: DISORDERED REGIONS ARE FROM GLY22GLY28, GLY65-GLU68 AND ILE147-PRO147 WERE MODELED
STEREOCHEMICALLY
1BXW COMPLIES WITH FORMAT V. 3.15, 01-DEC-08
THIS ENTRY HAS BEEN PROCESSED BY RCSB ON 19-AUG-99.
THE RCSB ID CODE IS RCSB008140.
EXPERIMENTAL DETAILS
EXPERIMENT TYPE
DATE OF DATA COLLECTION
TEMPERATURE
(KELVIN)
PH
NUMBER OF CRYSTALS USED
99
:
:
:
:
:
X-RAY DIFFRACTION
15-JAN-98
298
5.0
1
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
280
280
280
280
280
280
280
290
290
290
290
290
290
290
290
290
290
290
290
SYNCHROTRON
RADIATION SOURCE
BEAMLINE
X-RAY GENERATOR MODEL
MONOCHROMATIC OR LAUE
WAVELENGTH OR RANGE
MONOCHROMATOR
OPTICS
(Y/N) :
:
:
:
(M/L) :
(A) :
:
:
N
ROTATING ANODE
NULL
RIGAKU RU200
M
1.5418
NI FILTER
NULL
DETECTOR TYPE
:
DETECTOR MANUFACTURER
:
INTENSITY-INTEGRATION SOFTWARE :
DATA SCALING SOFTWARE
:
AREA DETECTOR
SIEMENS
XDS
CCP4 (SCALA)
:
:
:
:
8328
2.500
50.000
NULL
OVERALL.
COMPLETENESS FOR RANGE
(%)
DATA REDUNDANCY
R MERGE
(I)
R SYM
(I)
<I/SIGMA(I)> FOR THE DATA SET
:
:
:
:
:
89.0
2.100
NULL
0.02800
16.8000
SYMMETRY
OPERATOR
X,Y,Z
-X,Y,-Z
X+1/2,Y+1/2,Z
-X+1/2,Y+1/2,-Z
100
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
290
290
290
290
290
290
290
290
290
290
290
290
290
290
290
290
290
290
290
290
300
300
300
300
300
300
350
350
350
350
350
350
350
350
350
350
350
350
350
470
470
470
470
470
470
470
475
475
475
475
475
475
475
475
475
475
475
475
475
475
475
0.00000
0.00000
0.00000
MISSING ATOM
THE FOLLOWING RESIDUES HAVE MISSING ATOMS(M=MODEL NUMBER;
RES=RESIDUE NAME; C=CHAIN IDENTIFIER; SSEQ=SEQUENCE NUMBER;
I=INSERTION CODE):
M RES CSSEQI ATOMS
HIS A 31
CG
ND1 CD2 CE1 NE2
ZERO OCCUPANCY RESIDUES
THE FOLLOWING RESIDUES WERE MODELED WITH ZERO OCCUPANCY.
THE LOCATION AND PROPERTIES OF THESE RESIDUES MAY NOT
BE RELIABLE. (M=MODEL NUMBER; RES=RESIDUE NAME;
C=CHAIN IDENTIFIER; SSEQ=SEQUENCE NUMBER; I=INSERTION CODE)
M RES C SSEQI
GLY A
22
LEU A
23
ILE A
24
ASN A
25
ASN A
26
ASN A
27
GLY A
28
GLY A
65
101
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
475
475
475
475
475
475
475
475
475
475
475
475
475
475
480
480
480
480
480
480
480
480
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
SER
VAL
GLU
ILE
GLY
ASP
ALA
HIS
THR
ILE
GLY
THR
ARG
PRO
A
A
A
A
A
A
A
A
A
A
A
A
A
A
66
67
68
147
148
149
150
151
152
153
154
155
156
157
RES
ASN
ASN
ASN
ASN
ASN
ASN
C
A
A
A
A
A
A
SSEQI
26
26
26
5
26
26
ATM2
CA
C
N
CD1
O
N
RES
PRO
PRO
PRO
ILE
PRO
PRO
C
A
A
A
A
A
A
SSEQI
29
29
29
147
29
29
SSYMOP
2556
2556
2556
2657
2556
2556
DISTANCE
1.44
1.68
1.72
2.03
2.08
2.11
REMARK: NULL
GEOMETRY AND STEREOCHEMISTRY
SUBTOPIC: COVALENT BOND LENGTHS
THE STEREOCHEMICAL PARAMETERS OF THE FOLLOWING RESIDUES
HAVE VALUES WHICH DEVIATE FROM EXPECTED VALUES BY MORE
THAN 6*RMSD (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN
IDENTIFIER; SSEQ=SEQUENCE NUMBER; I=INSERTION CODE).
STANDARD TABLE:
FORMAT: (10X,I3,1X,2(A3,1X,A1,I4,A1,1X,A4,3X),1X,F6.3)
EXPECTED VALUES PROTEIN: ENGH AND HUBER, 1999
EXPECTED VALUES NUCLEIC ACID: CLOWNEY ET AL 1996
102
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
M RES
GLY
GLY
ARG
PRO
PRO
PRO
CSSEQI ATM1
A 28
C
A 148
N
A 156
CA
A 157
N
A 157
CD
A 157
CA
RES
PRO
GLY
ARG
PRO
PRO
PRO
CSSEQI ATM2
A 29
N
A 148
CA
A 156
C
A 157
CA
A 157
N
A 157
C
DEVIATION
0.125
0.090
0.206
-0.251
-0.368
-0.164
REMARK: NULL
GEOMETRY AND STEREOCHEMISTRY
SUBTOPIC: COVALENT BOND ANGLES
THE STEREOCHEMICAL PARAMETERS OF THE FOLLOWING RESIDUES
HAVE VALUES WHICH DEVIATE FROM EXPECTED VALUES BY MORE
THAN 6*RMSD (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN
IDENTIFIER; SSEQ=SEQUENCE NUMBER; I=INSERTION CODE).
STANDARD TABLE:
FORMAT: (10X,I3,1X,A3,1X,A1,I4,A1,3(1X,A4,2X),12X,F5.1)
EXPECTED VALUES PROTEIN: ENGH AND HUBER, 1999
EXPECTED VALUES NUCLEIC ACID: CLOWNEY ET AL 1996
M RES
ASP
GLY
ASN
ASN
ASN
GLY
ARG
ARG
ARG
GLU
GLN
ASP
SER
VAL
ILE
ARG
ARG
ARG
HIS
ALA
ALA
HIS
HIS
THR
ARG
ARG
ARG
ARG
THR
ARG
PRO
PRO
PRO
PRO
PRO
PRO
CSSEQI ATM1
ATM2
ATM3
A
4
CA - C
- N
ANGL. DEV. = 18.3 DEGREES
A 22
O
- C
- N
ANGL. DEV. = 11.3 DEGREES
A 25
C
- N
- CA ANGL. DEV. = -16.5 DEGREES
A 25
CA - C
- N
ANGL. DEV. = -14.2 DEGREES
A 25
O
- C
- N
ANGL. DEV. = 14.8 DEGREES
A 28
O
- C
- N
ANGL. DEV. = -11.6 DEGREES
A 60
CD - NE - CZ ANGL. DEV. =
9.8 DEGREES
A 60
NE - CZ - NH1 ANGL. DEV. =
3.7 DEGREES
A 60
NE - CZ - NH2 ANGL. DEV. = -3.3 DEGREES
A 68
C
- N
- CA ANGL. DEV. = -16.3 DEGREES
A 75
CB - CA - C
ANGL. DEV. = 13.1 DEGREES
A 90
CB - CG - OD1 ANGL. DEV. =
5.7 DEGREES
A 120
N
- CA - CB ANGL. DEV. = -9.2 DEGREES
A 122
CB - CA - C
ANGL. DEV. = -12.2 DEGREES
A 135
CA - CB - CG2 ANGL. DEV. = 15.6 DEGREES
A 138
CA - CB - CG ANGL. DEV. = 14.8 DEGREES
A 138
CD - NE - CZ ANGL. DEV. = 10.7 DEGREES
A 138
NE - CZ - NH2 ANGL. DEV. = -3.3 DEGREES
A 151
CB - CA - C
ANGL. DEV. = -38.2 DEGREES
A 150
CA - C
- N
ANGL. DEV. = -16.2 DEGREES
A 150
O
- C
- N
ANGL. DEV. = 17.6 DEGREES
A 151
CA - C
- N
ANGL. DEV. = -38.2 DEGREES
A 151
O
- C
- N
ANGL. DEV. = 43.4 DEGREES
A 152
C
- N
- CA ANGL. DEV. = 30.0 DEGREES
A 156
CB - CA - C
ANGL. DEV. = 12.4 DEGREES
A 156
N
- CA - CB ANGL. DEV. = -15.9 DEGREES
A 156
NH1 - CZ - NH2 ANGL. DEV. = -6.6 DEGREES
A 156
NE - CZ - NH2 ANGL. DEV. =
3.7 DEGREES
A 155
O
- C
- N
ANGL. DEV. = -14.6 DEGREES
A 156
C
- N
- CA ANGL. DEV. = -16.4 DEGREES
A 157
CA - N
- CD ANGL. DEV. = -25.7 DEGREES
A 157
N
- CA - CB ANGL. DEV. = -25.4 DEGREES
A 157
CB - CG - CD ANGL. DEV. = -24.6 DEGREES
A 157
N
- CD - CG ANGL. DEV. = -33.9 DEGREES
A 157
N
- CA - C
ANGL. DEV. = 19.1 DEGREES
A 157
CA - C
- O
ANGL. DEV. = -17.2 DEGREES
103
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
PRO A 157
CA
REMARK: NULL
GEOMETRY AND STEREOCHEMISTRY
SUBTOPIC: TORSION ANGLES
TORSION ANGLES OUTSIDE THE EXPECTED RAMACHANDRAN REGIONS:
(M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN IDENTIFIER;
SSEQ=SEQUENCE NUMBER; I=INSERTION CODE).
STANDARD TABLE:
FORMAT:(10X,I3,1X,A3,1X,A1,I4,A1,4X,F7.2,3X,F7.2)
EXPECTED VALUES: GJ KLEYWEGT AND TA JONES (1996). PHI/PSICHOLOGY: RAMACHANDRAN REVISITED. STRUCTURE 4, 1395 - 1400
M RES
ASN
TYR
ASP
LEU
ASN
ASN
HIS
TYR
SER
VAL
VAL
ALA
HIS
THR
THR
ARG
CSSEQI
A
5
A 18
A 20
A 23
A 25
A 26
A 31
A 63
A 66
A 67
A 110
A 150
A 151
A 152
A 155
A 156
PSI
57.62
120.18
-140.62
150.39
-90.55
121.49
175.58
102.76
52.30
90.53
-72.06
-164.72
-97.21
-139.61
-137.22
-162.43
PHI
-113.00
166.90
-156.38
68.74
-9.26
-29.94
173.40
-169.58
128.49
49.93
-67.54
173.09
35.47
-128.76
-149.83
-178.66
REMARK: NULL
GEOMETRY AND STEREOCHEMISTRY
SUBTOPIC: NON-CIS, NON-TRANS
THE FOLLOWING PEPTIDE BONDS DEVIATE SIGNIFICANTLY FROM BOTH
CIS AND TRANS CONFORMATION. CIS BONDS, IF ANY, ARE LISTED
ON CISPEP RECORDS. TRANS IS DEFINED AS 180 +/- 30 AND
CIS IS DEFINED AS 0 +/- 30 DEGREES.
MODEL
OMEGA
ARG A 156
PRO A 157
-147.47
REMARK: NULL
GEOMETRY AND STEREOCHEMISTRY
SUBTOPIC: MAIN CHAIN PLANARITY
THE FOLLOWING RESIDUES HAVE A PSEUDO PLANARITY
TORSION, C(I) - CA(I) - N(I+1) - O(I), GREATER
10.0 DEGREES. (M=MODEL NUMBER; RES=RESIDUE NAME;
C=CHAIN IDENTIFIER; SSEQ=SEQUENCE NUMBER;
I=INSERTION CODE).
M RES CSSEQI
ARG A 156
ANGLE
-11.76
104
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
DBREF
SEQADV
SEQADV
SEQADV
SEQADV
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
HET
HETNAM
FORMUL
FORMUL
SHEET
SHEET
SHEET
SHEET
SHEET
SHEET
SHEET
SHEET
LINK
LINK
LINK
LINK
LINK
LINK
LINK
LINK
LINK
LINK
LINK
LINK
LINK
LINK
SITE
CRYST1
ORIGX1
ORIGX2
ORIGX3
SCALE1
SCALE2
SCALE3
PDB
105
1.24
1.88
1.59
1.96
1.92
1.18
1.38
1.68
1.87
1.45
1.22
2.02
1.70
1.34
HEADER: PDB,
Protein Data Bank.
TITLE: ,
, .
.
COMPOUND: compound
( , ) .
SOURCE: .
KEYWDS: - .
EXPDTA: (X-Ray Crystallography/NMR/Theoretical Model).
AUTHOR: .
JRNL:
.
REMARK: REMARK .
.
REMARK
, ,
.
SEQRES: . 3
.
HET: () .
.
,
.
: HET.
FORMUL: HET.
HELIX: .
SHEET: .
CRYST1: .
ORIGXn(n=1..3):
PDB.
SCALEn: .
ATOM: , , .
.
()
:
106
--------------------------------------------------------------------------------1- 6
"ATOM " .
7 - 11
13 - 16
18 - 20
. 3 .
22
(chainD) ,
.
23 - 26
31 - 38
x ( Angstroms) .
39 - 46
y ( Angstroms) Y .
47 - 54
z ( Angstroms) Z .
55 - 60
(occupancy)
61 - 66
(Temperature factor)
77 - 78
79 - 80
( ).
TER: R .
HETATM: . .
CONECT: CONECT .
.
MASTER: .
.
END: .
107
Akiva,E.,Brown,S.,Almonacid,D.E.,Barber,A.E.,2nd,Custer,A.F.,Hicks,M.A.,...Babbitt,P.C.
(2014).TheStructure-FunctionLinkageDatabase.Nucleic acids research, 42(Databaseissue),D521530.
Akondi,K.B.,Muttenthaler,M.,Dutertre,S.,Kaas,Q.,Craik,D.J.,Lewis,R.J.,&Alewood,P.F.(2014).
Discovery,synthesis,andstructure-activityrelationshipsofconotoxins.Chemical reviews, 114(11),
5815-5847.
Alexander,S.P.,Benson,H.E.,Faccenda,E.,Pawson,A.J.,Sharman,J.L.,McGrath,J.C.,...
Zimmermann,M.(2013).TheConciseGuidetoPHARMACOLOGY2013/14:overview.British
journal of pharmacology, 170(8),1449-1458.
Alexander,S.P.,Mathie,A.,&Peters,J.A.(2011).GuidetoReceptorsandChannels(GRAC),5thedition.
British journal of pharmacology, 164 Suppl 1,S1-324.
Almonacid,D.E.,&Babbitt,P.C.(2011).Towardmechanisticclassificationofenzymefunctions.Current
opinion in chemical biology, 15(3),435-442.
Altschul,S.F.,Madden,T.L.,Schaffer,A.A.,Zhang,J.,Zhang,Z.,Miller,W.,&Lipman,D.J.(1997).
GappedBLASTandPSI-BLAST:anewgenerationofproteindatabasesearchprograms.Nucleic
acids research, 25(17),3389-3402.
Ananiadou,S.,Kell,D.B.,&Tsujii,J.(2006).Textmininganditspotentialapplicationsinsystemsbiology.
Trends Biotechnol, 24(12),571-579.
Andreeva,A.,Howorth,D.,Brenner,S.E.,Hubbard,T.J.,Chothia,C.,&Murzin,A.G.(2004).SCOP
databasein2004:refinementsintegratestructureandsequencefamilydata.Nucleic Acids Res,
32(Databaseissue),D226-229.
Atkinson,H.J.,Morris,J.H.,Ferrin,T.E.,&Babbitt,P.C.(2009).Usingsequencesimilaritynetworksfor
visualizationofrelationshipsacrossdiverseproteinsuperfamilies.PloS one, 4(2),e4345.
Babbitt,P.C.,&Gerlt,J.A.(1997).Understandingenzymesuperfamilies.ChemistryAsthefundamental
determinantintheevolutionofnewcatalyticactivities.The Journal of biological chemistry, 272(49),
30591-30594.
Babbitt,P.C.,Hasson,M.S.,Wedekind,J.E.,Palmer,D.R.,Barrett,W.C.,Reed,G.H.,...Gerlt,J.A.
(1996).Theenolasesuperfamily:ageneralstrategyforenzyme-catalyzedabstractionofthealphaprotonsofcarboxylicacids.Biochemistry, 35(51),16489-16501.
Bairoch,A.(1999).TheENZYMEdatabankin1999.Nucleic acids research, 27(1),310-311.
Barrett,T.,&Edgar,R.(2006).MiningmicroarraydataatNCBI'sGeneExpressionOmnibus(GEO)*.
Methods Mol Biol, 338,175-190.
Baxevanis,A.D.,Arents,G.,Moudrianakis,E.N.,&Landsman,D.(1995).AvarietyofDNA-bindingand
multimericproteinscontainthehistonefoldmotif.Nucleic acids research, 23(14),2685-2691.
Baxevanis,A.D.,&Landsman,D.(1996).HistoneSequenceDatabase:acompilationofhighly-conserved
nucleoproteinsequences.Nucleic acids research, 24(1),245-247.
Becker,K.G.,Barnes,K.C.,Bright,T.J.,&Wang,S.A.(2004).Thegeneticassociationdatabase.Nat
Genet, 36(5),431-432.
Benson,D.A.,Clark,K.,Karsch-Mizrachi,I.,Lipman,D.J.,Ostell,J.,&Sayers,E.W.(2014).GenBank.
Nucleic acids research.
Berman,H.,Henrick,K.,Nakamura,H.,&Markley,J.L.(2007).TheworldwideProteinDataBank
(wwPDB):ensuringasingle,uniformarchiveofPDBdata.Nucleic acids research, 35(Database
issue),D301-303.
108
Bertram,L.,McQueen,M.B.,Mullin,K.,Blacker,D.,&Tanzi,R.E.(2007).Systematicmeta-analysesof
Alzheimerdiseasegeneticassociationstudies:theAlzGenedatabase.Nat Genet, 39(1),17-23.
Bhaskara,R.M.,Mehrotra,P.,Rakshambikai,R.,Gnanavel,M.,Martin,J.,&Srinivasan,N.(2014).The
relationshipbetweenclassificationofmulti-domainproteinsusinganalignment-freeapproachand
theirfunctions:acasestudywithimmunoglobulins.Molecular BioSystems, 10(5),1082-1093.
Biggs,J.S.,Watkins,M.,Puillandre,N.,Ownby,J.P.,Lopez-Vera,E.,Christensen,S.,...Olivera,B.M.
(2010).EvolutionofConuspeptidetoxins:analysisofConuscalifornicusReeve,1844.Molecular
phylogenetics and evolution, 56(1),1-12.
Bingham,J.,Plowman,G.D.,&Sudarsanam,S.(2000).Informaticsissuesinlarge-scalesequenceanalysis:
elucidatingtheproteinkinasesofC.elegans.J Cell Biochem, 80(2),181-186.
Bockaert,J.,&Pin,J.P.(1999).MoleculartinkeringofGprotein-coupledreceptors:anevolutionarysuccess.
Embo J, 18(7),1723-1729.
Bonner,T.I.(2014).Shouldpharmacologistscareaboutalternativesplicing?IUPHARReview4.British
journal of pharmacology, 171(5),1231-1240.
Bradham,C.A.,Foltz,K.R.,Beane,W.S.,Arnone,M.I.,Rizzo,F.,Coffman,J.A.,...Manning,G.(2006).
Theseaurchinkinome:afirstlook.Dev Biol, 300(1),180-193.
Brazma,A.,Parkinson,H.,Sarkans,U.,Shojatalab,M.,Vilo,J.,Abeygunawardena,N.,...Sansone,S.A.
(2003).ArrayExpress--apublicrepositoryformicroarraygeneexpressiondataattheEBI.Nucleic
Acids Res, 31(1),68-71.
Caenepeel,S.,Charydczak,G.,Sudarsanam,S.,Hunter,T.,&Manning,G.(2004).Themousekinome:
discoveryandcomparativegenomicsofallmouseproteinkinases.Proc Natl Acad Sci U S A,
101(32),11707-11712.
Campbell,J.A.,Davies,G.J.,Bulone,V.,&Henrissat,B.(1997).Aclassificationofnucleotide-diphosphosugarglycosyltransferasesbasedonaminoacidsequencesimilarities.The Biochemical journal, 326 (
Pt 3),929-939.
Cantarel,B.L.,Coutinho,P.M.,Rancurel,C.,Bernard,T.,Lombard,V.,&Henrissat,B.(2009).The
Carbohydrate-ActiveEnZymesdatabase(CAZy):anexpertresourceforGlycogenomics.Nucleic
acids research, 37(Databaseissue),D233-238.
Chang,D.,&Duda,T.F.,Jr.(2012).Extensiveandcontinuousduplicationfacilitatesrapidevolutionand
diversificationofgenefamilies.Molecular biology and evolution, 29(8),2019-2029.
Chatonnet,A.,Cousin,X.,&Robinson,A.(2001).Linksbetweenkineticdataandsequencesinthe
alpha/beta-hydrolasesfolddatabase.Briefings in bioinformatics, 2(1),30-37.
Chibucos,M.C.,Mungall,C.J.,Balakrishnan,R.,Christie,K.R.,Huntley,R.P.,White,O.,...Giglio,M.
(2014).StandardizeddescriptionofscientificevidenceusingtheEvidenceOntology(ECO).
Database (Oxford), 2014.
Christopoulos,A.,Changeux,J.P.,Catterall,W.A.,Fabbro,D.,Burris,T.P.,Cidlowski,J.A.,...
Langmead,C.J.(2014).Internationalunionofbasicandclinicalpharmacology.XC.multisite
pharmacology:recommendationsforthenomenclatureofreceptorallosterismandallostericligands.
Pharmacological reviews, 66(4),918-947.
Cousin,X.,Hotelier,T.,Lievin,P.,Toutant,J.P.,&Chatonnet,A.(1996).Acholinesterasegenesserver
(ESTHER):adatabaseofcholinesterase-relatedsequencesformultiplealignments,phylogenetic
relationships,mutationsandstructuraldataretrieval.Nucleic acids research, 24(1),132-136.
Craik,D.J.(2006).Chemistry.Seamlessproteinstieuptheirlooseends.science, 311(5767),1563-1564.
Cuff,A.L.,Sillitoe,I.,Lewis,T.,Clegg,A.B.,Rentzsch,R.,Furnham,N.,...Orengo,C.A.(2011).
ExtendingCATH:increasingcoverageoftheproteinstructureuniverseandlinkingstructurewith
function.Nucleic acids research, 39(Databaseissue),D420-426.
109
Davis,J.,Jones,A.,&Lewis,R.J.(2009).Remarkableinter-andintra-speciescomplexityofconotoxins
revealedbyLC/MS.Peptides, 30(7),1222-1227.
Demeter,J.,Beauheim,C.,Gollub,J.,Hernandez-Boussard,T.,Jin,H.,Maier,D.,...Ball,C.A.(2007).The
StanfordMicroarrayDatabase:implementationofnewanalysistoolsandopensourcereleaseof
software.Nucleic Acids Res, 35(Databaseissue),D766-770.
Deshmukh,K.,Anamika,K.,&Srinivasan,N.(2010).Evolutionofdomaincombinationsinproteinkinases
anditsimplicationsforfunctionaldiversity.Progress in biophysics and molecular biology, 102(1),115.
Dreos,R.,Ambrosini,G.,CavinPerier,R.,&Bucher,P.(2013).EPDandEPDnew,high-qualitypromoter
resourcesinthenext-generationsequencingera.Nucleic Acids Res, 41(Databaseissue),D157-164.
Duda,T.F.,Jr.,Chang,D.,Lewis,B.D.,&Lee,T.(2009).Geographicvariationinvenomallelic
compositionanddietsofthewidespreadpredatorymarinegastropodConusebraeus.PloS one, 4(7),
e6245.
Duda,T.F.,Jr.,&Lee,T.(2009).Ecologicalreleaseandvenomevolutionofapredatorymarinesnailat
EasterIsland.PloS one, 4(5),e5558.
Eddy,S.R.(2009).Anewgenerationofhomologysearchtoolsbasedonprobabilisticinference.Genome
informatics. International Conference on Genome Informatics, 23(1),205-211.
Eisen,J.A.,Coyne,R.S.,Wu,M.,Wu,D.,Thiagarajan,M.,Wortman,J.R.,...Orias,E.(2006).
MacronucleargenomesequenceoftheciliateTetrahymenathermophila,amodeleukaryote.PLoS
Biol, 4(9),e286.
Fernndez-Surez,X.M.,Rigden,D.J.,&Galperin,M.Y.(2014).The2014nucleicacidsresearchdatabase
issueandanupdatedNARonlinemolecularbiologydatabasecollection.Nucleic acids research,
42(D1),D1-D6.
Finn,R.D.,Bateman,A.,Clements,J.,Coggill,P.,Eberhardt,R.Y.,Eddy,S.R.,...Punta,M.(2014).Pfam:
theproteinfamiliesdatabase.Nucleic acids research, 42(Databaseissue),D222-D230.
Fleischmann,A.,Darsow,M.,Degtyarenko,K.,Fleischmann,W.,Boyce,S.,Axelsen,K.B.,...Apweiler,R.
(2004).IntEnz,theintegratedrelationalenzymedatabase.Nucleic acids research, 32(Databaseissue),
D434-437.
Gandhimathi,A.,Nair,A.G.,&Sowdhamini,R.(2012).PASS2version4:anupdatetothedatabaseof
structure-basedsequencealignmentsofstructuraldomainsuperfamilies.Nucleic acids research,
40(Databaseissue),D531-534.
Garland,S.L.(2013).AreGPCRsStillaSourceofNewTargets?Journal of Biomolecular Screening, 18(9),
947-966.
Gaulton,A.,Bellis,L.J.,Bento,A.P.,Chambers,J.,Davies,M.,Hersey,A.,...Overington,J.P.(2012).
ChEMBL:alarge-scalebioactivitydatabasefordrugdiscovery.Nucleic acids research, 40(Database
issue),D1100-1107.
Gerlt,J.A.,&Babbitt,P.C.(2001).Divergentevolutionofenzymaticfunction:mechanisticallydiverse
superfamiliesandfunctionallydistinctsuprafamilies.Annual review of biochemistry, 70,209-246.
Gerlt,J.A.,Babbitt,P.C.,&Rayment,I.(2005).Divergentevolutionintheenolasesuperfamily:the
interplayofmechanismandspecificity.Archives of biochemistry and biophysics, 433(1),59-70.
Gnanavel,M.,Mehrotra,P.,Rakshambikai,R.,Martin,J.,Srinivasan,N.,&Bhaskara,R.M.(2014).CLAP:a
web-serverforautomaticclassificationofproteinswithspecialreferencetomulti-domainproteins.
BMC bioinformatics, 15,343.
Thedictyosteliumkinome--analysisoftheproteinkinasesfromasimplemodelorganism,3,2Cong.Rec.
e38(2006).
110
Gowri,V.S.,Krishnadev,O.,Swamy,C.S.,&Srinivasan,N.(2006).MulPSSM:adatabaseofmultiple
position-specificscoringmatricesofproteindomainfamilies.Nucleic acids research, 34(Database
issue),D243-246.
Griffiths-Jones,S.,Grocock,R.J.,vanDongen,S.,Bateman,A.,&Enright,A.J.(2006).miRBase:
microRNAsequences,targetsandgenenomenclature.Nucleic Acids Res, 34(Databaseissue),D140144.
Grosjean,J.,Soualmia,L.,Bouarech,K.,Jonquet,C.,&Darmoni,S.(2014).An Approach to Compare BioOntologies Portals.PaperpresentedattheMIE'2014:26thInternationalConferenceoftheEuropean
FederationforMedicalInformatics.
Hanks,S.K.,&Hunter,T.(1995).Proteinkinases6.Theeukaryoticproteinkinasesuperfamily:kinase
(catalytic)domainstructureandclassification.FASEB journal : official publication of the Federation
of American Societies for Experimental Biology, 9(8),576-596.
HapMapConsortium.(2003).TheInternationalHapMapProject.Nature, 426(6968),789-796.
Harmar,A.J.,Hills,R.A.,Rosser,E.M.,Jones,M.,Buneman,O.P.,Dunbar,D.R.,...Spedding,M.
(2009).IUPHAR-DB:theIUPHARdatabaseofGprotein-coupledreceptorsandionchannels.
Nucleic acids research, 37(Databaseissue),D680-685.
Henrissat,B.(1991).Aclassificationofglycosylhydrolasesbasedonaminoacidsequencesimilarities.The
Biochemical journal, 280 ( Pt 2),309-316.
Holliday,G.L.,Almonacid,D.E.,Bartlett,G.J.,O'Boyle,N.M.,Torrance,J.W.,Murray-Rust,P.,...
Thornton,J.M.(2007).MACiE(Mechanism,AnnotationandClassificationinEnzymes):noveltools
forsearchingcatalyticmechanisms.Nucleic acids research, 35(Databaseissue),D515-520.
Holliday,G.L.,Andreini,C.,Fischer,J.D.,Rahman,S.A.,Almonacid,D.E.,Williams,S.T.,&Pearson,W.
R.(2012).MACiE:exploringthediversityofbiochemicalreactions.Nucleic acids research,
40(Databaseissue),D783-789.
Holliday,G.L.,Bairoch,A.,Bagos,P.G.,Chatonnet,A.,Craik,D.J.,Flinn,R.D.,...Bateman,A.(2015).
Keychallengesforthecreationandmaintenanceofspecialistproteinresources.Proteins.
Horn,F.,Bettler,E.,Oliveira,L.,Campagne,F.,Cohen,F.E.,&Vriend,G.(2003).GPCRDBinformation
systemforGprotein-coupledreceptors.Nucleic Acids Res, 31(1),294-297.
Horn,F.,Weare,J.,Beukers,M.W.,Horsch,S.,Bairoch,A.,Chen,W.,...Vriend,G.(1998).GPCRDB:an
informationsystemforGprotein-coupledreceptors.Nucleic Acids Res, 26(1),275-279.
Hsu,S.D.,Lin,F.M.,Wu,W.Y.,Liang,C.,Huang,W.C.,Chan,W.L.,...Huang,H.D.(2011).
miRTarBase:adatabasecuratesexperimentallyvalidatedmicroRNA-targetinteractions.Nucleic
Acids Res, 39(Databaseissue),D163-169.
Hunter,T.,&Plowman,G.D.(1997).Theproteinkinasesofbuddingyeast:sixscoreandmore.Trends
Biochem Sci, 22(1),18-22.
Isberg,V.,Vroling,B.,vanderKant,R.,Li,K.,Vriend,G.,&Gloriam,D.(2014).GPCRDB:aninformation
systemforGprotein-coupledreceptors.Nucleic Acids Res, 42(1),D422-425.
Jamison,D.C.(2003).StructuredQueryLanguage(SQL)fundamentals.Curr Protoc Bioinformatics, Chapter
9,Unit92.
Joosten,R.P.,Long,F.,Murshudov,G.N.,&Perrakis,A.(2014).ThePDB_REDOserverfor
macromolecularstructuremodeloptimization.IUCrJ, 1(4),0-0.
Kaas,Q.,Westermann,J.C.,&Craik,D.J.(2010).Conopeptidecharacterizationandclassifications:an
analysisusingConoServer.Toxicon : official journal of the International Society on Toxinology,
55(8),1491-1509.
Kaas,Q.,Yu,R.,Jin,A.H.,Dutertre,S.,&Craik,D.J.(2012).ConoServer:updatedcontent,knowledge,and
discoverytoolsintheconopeptidedatabase.Nucleic acids research, 40(Databaseissue),D325-330.
111
Kanehisa,M.,Goto,S.,Sato,Y.,Kawashima,M.,Furumichi,M.,&Tanabe,M.(2014).Data,information,
knowledgeandprinciple:backtometabolisminKEGG.Nucleic acids research, 42(D1),D199-D205.
Karp,P.D.,Riley,M.,Saier,M.,Paulsen,I.T.,Collado-Vides,J.,Paley,S.M.,...Gama-Castro,S.(2002).
TheEcoCycDatabase.Nucleic Acids Res, 30(1),56-58.
Katritch,V.,Cherezov,V.,&Stevens,R.C.(2013).Structure-functionoftheGprotein-coupledreceptor
superfamily.Annu Rev Pharmacol Toxicol, 53,531-556.
Kedarisetti,P.,Mizianty,M.J.,Kaas,Q.,Craik,D.J.,&Kurgan,L.(2014).Predictionandcharacterizationof
cyclicproteinsfromsequencesinthreedomainsoflife.Biochimica et biophysica acta, 1844(1PtB),
181-190.
Kirby,A.J.(2001).Thelysozymemechanismsortedafter50years.Nature Structural Biology, 8,737-739.
Knudsen,M.,&Wiuf,C.(2010).TheCATHdatabase.Hum Genomics, 4(3),207-212.
Kouranov,A.,Xie,L.,delaCruz,J.,Chen,L.,Westbrook,J.,Bourne,P.E.,&Berman,H.M.(2006).The
RCSBPDBinformationportalforstructuralgenomics.Nucleic Acids Res, 34(Databaseissue),D302305.
Kozma,D.,Simon,I.,&Tusnady,G.E.(2013).PDBTM:ProteinDataBankoftransmembraneproteinsafter
8years.Nucleic Acids Res, 41(Databaseissue),D524-529.
Krupa,A.,Abhinandan,K.,&Srinivasan,N.(2004).KinG:adatabaseofproteinkinasesingenomes.Nucleic
acids research, 32(suppl1),D153-D155.
Krupp,M.,Marquardt,J.U.,Sahin,U.,Galle,P.R.,Castle,J.,&Teufel,A.(2012).RNA-SeqAtlas--a
referencedatabaseforgeneexpressionprofilinginnormaltissuebynext-generationsequencing.
Bioinformatics, 28(8),1184-1185.
Lagerstrom,M.C.,&Schioth,H.B.(2008).StructuraldiversityofGprotein-coupledreceptorsand
significancefordrugdiscovery.Nat Rev Drug Discov, 7(4),339-357.
Lane,L.,Argoud-Puy,G.,Britan,A.,Cusin,I.,Duek,P.D.,Evalet,O.,...Bairoch,A.(2012).neXtProt:a
knowledgeplatformforhumanproteins.Nucleic acids research, 40(Databaseissue),D76-83.
Lenfant,N.,Hotelier,T.,Bourne,Y.,Marchot,P.,&Chatonnet,A.(2013).Proteinswithanalpha/beta
hydrolasefold:Relationshipsbetweensubfamiliesinanever-growingsuperfamily.Chemicobiological interactions, 203(1),266-268.
Lenfant,N.,Hotelier,T.,Bourne,Y.,Marchot,P.,&Chatonnet,A.(2014).Trackingtheoriginand
divergenceofcholinesterasesandneuroligins:theevolutionofsynapticproteins.Journal of
molecular neuroscience : MN, 53(3),362-369.
Lenfant,N.,Hotelier,T.,Velluet,E.,Bourne,Y.,Marchot,P.,&Chatonnet,A.(2013).ESTHER,the
databaseofthealpha/beta-hydrolasefoldsuperfamilyofproteins:toolstoexplorediversityof
functions.Nucleic acids research, 41(Databaseissue),D423-429.
Lombard,V.,Ramulu,H.G.,Drula,E.,Coutinho,P.M.,&Henrissat,B.(2014).Thecarbohydrate-active
enzymesdatabase(CAZy)in2013.Nucleic acids research, 42(D1),D490-D495.
Lu,C.T.,Huang,K.Y.,Su,M.G.,Lee,T.Y.,Bretana,N.A.,Chang,W.C.,...Huang,H.D.(2013).
DbPTM3.0:aninformativeresourceforinvestigatingsubstratesitespecificityandfunctional
associationofproteinpost-translationalmodifications.Nucleic Acids Res, 41(Databaseissue),D295305.
Manning,G.(2005).Genomicoverviewofproteinkinases.WormBook,1-19.
---Evolutionofproteinkinasesignalingfromyeasttoman,10,27Cong.Rec.514-520(2002).
---Theproteinkinasecomplementofthehumangenome,5600,298Cong.Rec.1912-1934(2002).
Marchot,P.,&Chatonnet,A.(2012).Enzymaticactivityandproteininteractionsinalpha/betahydrolasefold
proteins:moonlightingversuspromiscuity.Protein and peptide letters, 19(2),132-143.
112
Marino-Ramirez,L.,Levine,K.M.,Morales,M.,Zhang,S.,Moreland,R.T.,Baxevanis,A.D.,&Landsman,
D.(2011).TheHistoneDatabase:anintegratedresourceforhistonesandhistonefold-containing
proteins.Database (Oxford), 2011,bar048.
Martin,A.C.,Orengo,C.A.,Hutchinson,E.G.,Jones,S.,Karmirantzou,M.,Laskowski,R.A.,...
Thornton,J.M.(1998).Proteinfoldsandfunctions.Structure, 6(7),875-884.
Martin,J.,Anamika,K.,&Srinivasan,N.(2010).Classificationofproteinkinasesonthebasisofbothkinase
andnon-kinaseregions.PloS one, 5(9),e12460.
McDonald,A.G.,Boyce,S.,&Tipton,K.F.(2009).ExplorEnz:theprimarysourceoftheIUBMBenzyme
list.Nucleic acids research, 37(Databaseissue),D593-597.
Moszer,I.,Jones,L.M.,Moreira,S.,Fabry,C.,&Danchin,A.(2002).SubtiList:thereferencedatabasefor
theBacillussubtilisgenome.Nucleic Acids Res, 30(1),62-65.
Murzin,A.G.,Brenner,S.E.,Hubbard,T.,&Chothia,C.(1995).SCOP:astructuralclassificationofproteins
databasefortheinvestigationofsequencesandstructures.Journal of molecular biology, 247(4),536540.
Nagano,N.(2005).EzCatDB:theEnzymeCatalytic-mechanismDatabase.Nucleic acids research,
33(Databaseissue),D407-412.
Nagano,N.,Nakayama,N.,Ikeda,K.,Fukuie,M.,Yokota,K.,Doi,T.,...Tomii,K.(2014).EzCatDB:the
enzymereactiondatabase,2015update.Nucleic acids research.
Nagy,A.,Hegyi,H.,Farkas,K.,Tordai,H.,Kozma,E.,Banyai,L.,&Patthy,L.(2008).Identificationand
correctionofabnormal,incompleteandmispredictedproteinsinpublicdatabases.BMC
bioinformatics, 9,353.
Overington,J.P.,Al-Lazikani,B.,&Hopkins,A.L.(2006).Howmanydrugtargetsarethere?Nat Rev Drug
Discov, 5(12),993-996.
Pawson,A.J.,Sharman,J.L.,Benson,H.E.,Faccenda,E.,Alexander,S.P.,Buneman,O.P.,...Harmar,A.
J.(2014).TheIUPHAR/BPSGuidetoPHARMACOLOGY:anexpert-drivenknowledgebaseofdrug
targetsandtheirligands.Nucleic acids research, 42(Databaseissue),D1098-1106.
Pegg,S.C.,Brown,S.D.,Ojha,S.,Seffernick,J.,Meng,E.C.,Morris,J.H.,...Babbitt,P.C.(2006).
Leveragingenzymestructure-functionrelationshipsforfunctionalinferenceandexperimentaldesign:
thestructure-functionlinkagedatabase.Biochemistry, 45(8),2545-2555.
Pettersen,E.F.,Goddard,T.D.,Huang,C.C.,Couch,G.S.,Greenblatt,D.M.,Meng,E.C.,&Ferrin,T.E.
(2004).UCSFChimera--avisualizationsystemforexploratoryresearchandanalysis.Journal of
computational chemistry, 25(13),1605-1612.
Pettifer,S.,Ison,J.,Kalas,M.,Thorne,D.,McDermott,P.,Jonassen,I.,...Vriend,G.(2010).The
EMBRACEwebservicecollection.Nucleic Acids Res, 38(WebServerissue),W683-688.
Plowman,G.D.,Sudarsanam,S.,Bingham,J.,Whyte,D.,&Hunter,T.(1999).Theproteinkinasesof
Caenorhabditiselegans:amodelforsignaltransductioninmulticellularorganisms.Proc Natl Acad
Sci U S A, 96(24),13603-13610.
Poth,A.G.,Chan,L.Y.,&Craik,D.J.(2013).Cyclotidesasgraftingframeworksforproteinengineeringand
drugdesignapplications.Biopolymers, 100(5),480-491.
Puillandre,N.,Koua,D.,Favreau,P.,Olivera,B.M.,&Stocklin,R.(2012).Molecularphylogeny,
classificationandevolutionofconopeptides.Journal of molecular evolution, 74(5-6),297-309.
Rakshambikai,R.,Gnanavel,M.,&Srinivasan,N.(2014).Hybridandroguekinasesencodedinthegenomes
ofmodeleukaryotes.PloS one, 9(9),e107956.
Rawlings,N.D.,Waller,M.,Barrett,A.J.,&Bateman,A.(2014).MEROPS:thedatabaseofproteolytic
enzymes,theirsubstratesandinhibitors.Nucleic acids research, 42(Databaseissue),D503-D509.
113
Reddy,T.B.,Thomas,A.D.,Stamatis,D.,Bertsch,J.,Isbandi,M.,Jansson,J.,...Kyrpides,N.C.(2015).
TheGenomesOnLineDatabase(GOLD)v.5:ametadatamanagementsystembasedonafourlevel
(meta)genomeprojectclassification.Nucleic Acids Res, 43(Databaseissue),D1099-1106.
Rhodes,D.R.,Yu,J.,Shanker,K.,Deshpande,N.,Varambally,R.,Ghosh,D.,...Chinnaiyan,A.M.(2004).
ONCOMINE:acancermicroarraydatabaseandintegrateddata-miningplatform.Neoplasia, 6(1),16.
Rose,P.W.,Bi,C.,Bluhm,W.F.,Christie,C.H.,Dimitropoulos,D.,Dutta,S.,...Bourne,P.E.(2013).The
RCSBProteinDataBank:newresourcesforresearchandeducation.Nucleic acids research,
41(Databaseissue),D475-482.
Rose,P.W.,Prlic,A.,Bi,C.,Bluhm,W.F.,Christie,C.H.,Dutta,S.,...Burley,S.K.(2014).TheRCSB
ProteinDataBank:viewsofstructuralbiologyforbasicandappliedresearchandeducation.Nucleic
acids research.
Saier,M.H.,Jr.(2000).Afunctional-phylogeneticclassificationsystemfortransmembranesolute
transporters.Microbiol Mol Biol Rev, 64(2),354-411.
Saier,M.H.,Reddy,V.S.,Tamang,D.G.,&Vstermark,.(2014).TheTransporterClassification
Database..Nucleic acids research, 42(Databaseissue),D251-D258.
Scherf,M.,Epple,A.,&Werner,T.(2005).Thenextgenerationofliteratureanalysis:integrationofgenomic
analysisintotextmining.Brief Bioinform, 6(3),287-297.
Schnoes,A.M.,Brown,S.D.,Dodevski,I.,&Babbitt,P.C.(2009).AnnotationErrorinPublicDatabases:
MisannotationofMolecularFunctioninEnzymeSuperfamilies.PLoS computational biology, 5(12),
e1000605.
Schully,S.D.,Yu,W.,McCallum,V.,Benedicto,C.B.,Dong,L.M.,Wulf,A.,...Khoury,M.J.(2011).
CancerGAMAdb:databaseofcancergeneticassociationsfrommeta-analysesandgenome-wide
associationstudies.Eur J Hum Genet, 19(8),928-930.
Sethupathy,P.,Corda,B.,&Hatzigeorgiou,A.G.(2006).TarBase:Acomprehensivedatabaseof
experimentallysupportedanimalmicroRNAtargets.RNA, 12(2),192-197.
Shepelev,V.,&Fedorov,A.(2006).AdvancesintheExon-IntronDatabase(EID).Brief Bioinform, 7(2),178185.
Sherry,S.T.,Ward,M.H.,Kholodov,M.,Baker,J.,Phan,L.,Smigielski,E.M.,&Sirotkin,K.(2001).
dbSNP:theNCBIdatabaseofgeneticvariation.Nucleic Acids Res, 29(1),308-311.
Sigrist,C.J.,Cerutti,L.,deCastro,E.,Langendijk-Genevaux,P.S.,Bulliard,V.,Bairoch,A.,&Hulo,N.
(2010).PROSITE,aproteindomaindatabaseforfunctionalcharacterizationandannotation.Nucleic
Acids Res, 38(Databaseissue),D161-166.
Smith,B.,Ashburner,M.,Rosse,C.,Bard,J.,Bug,W.,Ceusters,W.,...Lewis,S.(2007).TheOBO
Foundry:coordinatedevolutionofontologiestosupportbiomedicaldataintegration.Nature
biotechnology, 25(11).
Sowdhamini,R.,Burke,D.F.,Huang,J.F.,Mizuguchi,K.,Nagarajaram,H.A.,Srinivasan,N.,...Blundell,
T.L.(1998).CAMPASS:adatabaseofstructurallyalignedproteinsuperfamilies.Structure, 6(9),
1087-1094.
Spedding,M.(2011).Resolutionofcontroversiesindrug/receptorinteractionsbyproteinstructure.
Limitationsandpharmacologicalsolutions.Neuropharmacology, 60(1),3-6.
Srivastava,M.,Simakov,O.,Chapman,J.,Fahey,B.,Gauthier,M.E.,Mitros,T.,...Rokhsar,D.S.(2010).
TheAmphimedonqueenslandicagenomeandtheevolutionofanimalcomplexity.Nature, 466(7307),
720-726.
Stajich,J.E.,Wilke,S.K.,Ahren,D.,Au,C.H.,Birren,B.W.,Borodovsky,M.,...Pukkila,P.J.(2010).
Insightsintoevolutionofmulticellularfungifromtheassembledchromosomesofthemushroom
Coprinopsiscinerea(Coprinuscinereus).Proc Natl Acad Sci U S A, 107(26),11889-11894.
114
Stark,C.,Breitkreutz,B.J.,Reguly,T.,Boucher,L.,Breitkreutz,A.,&Tyers,M.(2006).BioGRID:ageneral
repositoryforinteractiondatasets.Nucleic Acids Res, 34(Databaseissue),D535-539.
Stein,L.(2013).Creatingdatabasesforbiologicalinformation:anintroduction.Curr Protoc Bioinformatics,
Chapter 9,Unit91.
Sun,H.,Palaniswamy,S.K.,Pohar,T.T.,Jin,V.X.,Huang,T.H.,&Davuluri,R.V.(2006).MPromDb:an
integratedresourceforannotationandvisualizationofmammaliangenepromotersandChIP-chip
experimentaldata.Nucleic Acids Res, 34(Databaseissue),D98-103.
Tan,N.C.,&Berkovic,S.F.(2010).TheEpilepsyGeneticAssociationDatabase(epiGAD):analysisof165
geneticassociationstudies,1996-2008.Epilepsia, 51(4),686-689.
Terlau,H.,&Olivera,B.M.(2004).Conusvenoms:arichsourceofnovelionchannel-targetedpeptides.
Physiological reviews, 84(1),41-68.
Theodoropoulou,M.C.,Bagos,P.G.,Spyropoulos,I.C.,&Hamodrakas,S.J.(2008).gpDB:adatabaseof
GPCRs,G-proteins,effectorsandtheirinteractions.Bioinformatics, 24(12),1471-1472.
Tipton,K.F.(1994).NomenclatureCommitteeoftheInternationalUnionofBiochemistryandMolecular
Biology(NC-IUBMB).Enzymenomenclature.Recommendations1992.Supplement:correctionsand
additions.European journal of biochemistry / FEBS, 223(1),1-5.
Tough,D.F.,Lewis,H.D.,Rioja,I.,Lindon,M.J.,&Prinjha,R.K.(2014).Epigeneticpathwaytargetsfor
thetreatmentofdisease:acceleratingprogressinthedevelopmentofpharmacologicaltools:IUPHAR
Review11.British journal of pharmacology, 171(22),4981-5010.
Trabi,M.,&Craik,D.J.(2002).Circularproteins--noendinsight.Trends Biochem Sci, 27(3),132-138.
Tsaousis,G.N.,Tsirigos,K.D.,Andrianou,X.D.,Liakopoulos,T.D.,Bagos,P.G.,&Hamodrakas,S.J.
(2010).ExTopoDB:adatabaseofexperimentallyderivedtopologicalmodelsoftransmembrane
proteins.Bioinformatics, 26(19),2490-2492.
Tsirigos,K.D.,Bagos,P.G.,&Hamodrakas,S.J.(2011).OMPdb:adatabaseof{beta}-barrelouter
membraneproteinsfromGram-negativebacteria.Nucleic acids research, 39(Databaseissue),D324331.
Umemura,M.,Nagano,N.,Koike,H.,Kawano,J.,Ishii,T.,Miyamura,Y.,...Machida,M.(2014).
Characterizationofthebiosyntheticgeneclusterfortheribosomallysynthesizedcyclicpeptide
ustiloxinBinAspergillusflavus.Fungal Genet Biol, 68,23-30.
UniProt.(2014).ActivitiesattheUniversalProteinResource(UniProt).Nucleic acids research, 42(Database
issue),D191-198.
Vroling,B.,Sanders,M.,Baakman,C.,Borrmann,A.,Verhoeven,S.,Klomp,J.,...Vriend,G.(2011).
GPCRDB:informationsystemforGprotein-coupledreceptors.Nucleic acids research, 39(Database
issue),D309-D319.
Wang,C.K.,Kaas,Q.,Chiche,L.,&Craik,D.J.(2008).CyBase:adatabaseofcyclicproteinsequencesand
structures,withapplicationsinproteindiscoveryandengineering.Nucleic acids research, 36(suppl
1),D206-D210.
Wong,W.C.,Maurer-Stroh,S.,&Eisenhaber,F.(2010).Morethan1,001problemswithproteindomain
databases:transmembraneregions,signalpeptidesandtheissueofsequencehomology.PLoS Comput
Biol, 6(7),e1000867.
Wren,J.D.(2008).URLdecayinMEDLINEa4-yearfollow-upstudy.Bioinformatics, 24(11),1381-1385.
Xenarios,I.,Salwinski,L.,Duan,X.J.,Higney,P.,Kim,S.M.,&Eisenberg,D.(2002).DIP,theDatabaseof
InteractingProteins:aresearchtoolforstudyingcellularnetworksofproteininteractions.Nucleic
Acids Res, 30(1),303-305.
Xia,J.,Wang,Q.,Jia,P.,Wang,B.,Pao,W.,&Zhao,Z.(2012).NGScatalog:Adatabaseofnextgeneration
sequencingstudiesinhumans.Hum Mutat, 33(6),E2341-2355.
115
116
EquationSection(Next)
3:
,
.
. , ,
,
. ,
(FASTA, BLAST), .
,
..
3.
,
, , ,
.
(,),(),
.
, ,
. (
); ;
;,;
, ,
..99%,
(
).80%
.
;
30% 80 ,
, ,
.
,
,
. ,
,
,,
,
.
,
,.
3.1.
(DNA,RNA)
,,-
DNA -, n
.4(A,T,G,C):
117
p A , pT , pG , pC pk 0
pk 1 .
k{ A ,T ,G ,C }
, , 20 20
. DNA, x=x1, x2,...,xn xi { A, T , G , C}
:
n
p P ( x ) p X
i
i 1
(
4n4n)1:
P(x ) 1 .
j
x n DNA. 4
,:
P nA , nT , nG , nC
n!
p AnA pTnT pGnG pCnC
nA !nT !nG !nC !
(3.1)
,
,:
n x
n
P X x p Ax 1 p A
(3.2)
x
3(T,G,C).DNA
Bernoullip=pAq=1-pA.
3 . DNA (
)
.()
4 ( 20 ),
.,
(0)
--
,,,(Durbin,Eddy,Krogh,&Mithison,
1998).
, (
,
)
,:
p A pT pG pC 1
4
(Information Theory).
.D,
,Shannon:
H ( x) P ( xi ) log P ( xi )
(3.3)
,pA=pG=pT=pC=1/4
H(x)=(1/4)log(1/4)=log4.
2,bit.:
I ( x) H max H obs
(3.4)
2bits,
0.
118
,
(Wootton&Federhen,1993).,
""(compositionalcomplexity),,
k,:
k!
K log N
ns !
k
s
1
(3.5)
,ns s
(4,20).,
(),.,4
,
1
4! 1
K log 4
log 4 1 0 .
4
4!0!0!0! 4
,,
. ,
ATGC
1
4! 1
K log 4
log 4 24 0.573
4
1!1!1!1! 4
WoottonFederhen,
:
ns
ns
log 2
k
s k
Hk
(3.6)
,,
(,),
, Boltzman (
Shannon).
, ,
,.
WoottonFederhen,
.
,
(. ),
DNA.,
,
.
,(RelativeEntropy).
P, Q ( Kullback-Leibler)
,,:
H ( P, Q ) P ( xi ) log
i
P ( xi )
Q ( xi )
(3.7)
P(xi)(A,T,G,C)i
, Q(xi)
.
,,
.Q(xi)=1/4()(P,Q)=I(P)
(MutualInformation)..,
:
119
M ( X , Y ) P ( xi , y j ) log
i, j
P ( xi , y j )
P ( xi ) P ( y j )
(3.8)
, , x y.
..P(xi,yi),
P(xi,yi)=P(xi)P(yi).P(xi)P(yi)
..,.,
. ,
3.2.
- Erdos Renyi
A G G C G A T A A A A A A A A A A A A A A A A C G G A T G C A T C G
3.1: 16DNA
log(n),
Erdos Renyi (Erdos & Renyi, 1970). n
Bernoullip,0p1,Rn
,log1/p(n):
Rn
log1 p n
1 1.
(3.9)
(Waterman, 1995): p k
pk.n(n+)n
E( x)=npk
,,,Rn,1=npk,:
Rn log1 p n
,,
.
3.2.1
n=10000DNA,(p=1/4),
.,
:
log10 10000
4
Rn log1 p n Rn log 4 (10000) Rn
6.64
log10 4
0.60205
n=100050000p=0.1,0.25,0.4(1000)
(3.9).
120
15
E (m a x )
10
0
0
10
15
log1/p(n)
3.2: ,
p=0.1, 0.25 0.4. n 1.000 50.000.
3.3.
Erdos Renyi
.
,
..90% .,
.
,
( ,
1-2 ). ,
,
.
(Large Deviation Theory)
. 3.3 20
16.(
),
.
A G G C G A T A A A A A A A A T A A G A C C A A A A A C G G A T G C A T
3.3: 20 80% .
, p ,
,:
1 a
1 a
a a 1 a
1
p 1 p
(1
)
log
log
log
1 a
a
a 1 a
p 1 p
p
1 p
H ( , p ) log
(3.10)
(k,p)
(DNA
p),,,(k,)
-n,
121
=s/k. , 0p,a1.
,(,p)
. ,
. ,
0pa1Y~B(k,p),P(Yak),
:
kH ( a , p )
P(Y k ) e
(3.11)
(3.9).,n
Bernoulli p, 0pa1, Rna
100%,(Erdos&Renyi,1970;Erdos&Revesz,1975):
Rna
1
(3.12)
log(n)
H (a, p)
:
(LargeDeviations)(3.11)k100%
, e-kH(a,p).n-k+1n
:
log( n)
a
1 ne kH ( a , p ) Rn
H ( a, p )
, De L Hospital, =1(,p)=log(1/p)
(3.9).
3.3.1
10.00.000 DNA, Rna
80%():
log( n)
log(1000000)
Rna
20.744
H (a, p)
0.666
(loge)
3.4.
- (EVD)
,
.,
,,,
( ,
..). ( )
(Extreme ValueDistributions).
,1, 2,..., n,(iid)
:
M n an [max( X 1 , X 2 ,..., X n ) bn ], n
n, bn
. (cdf)
n,bn,(Davison,1998):
1. F ( y ) exp( e y ), y
(Gumbel)
2.
3.
0, y 0
a
exp( y ), y 0, a 0
exp( ( y ) a ), y 0, a 0
F ( y)
1, y 0
F ( y)
(Frechet)
(Weibull)
122
Gumbel,
(GeneralizedExtremeValueDistributionGEVD):
1
y a k ,k ,b 0
(3.13)
H ( y ) exp 1 k
b
ya
1 k
0 , k 0 (
b
ya
, t
k
b
H ( y ) exp 1
t
z
t
k 0 t :
n
a
lim 1 e a
n
n
t
y a
z
lim H ( y ) lim exp 1 exp e z exp e b
t
t
( y a )
F (Y ) exp(e
), y
2
(3.14)
(3.15)
6
(Arratia, Gordon, and Waterman, 1986; Arratia, Gordon, and Waterman, 1990;
Waterman,1995),(DNA)
log( qn)
1
an
, bn log 1 p :
E a b '(1), V
lim Rn
n
log(nq )
exp e
..Rn:
y log(nq )
F y P Rn y exp exp
(3.16)
,Gumbel.
,(3.16)(3.14).,:
log(n) log( q ) 1
(3.17)
E Rn
E Rn log 1 ( n) log 1 ( q ) 1
2
p
p
2
1
(3.18)
var Rn 2
6 12
= -(1 ) = 0.5772 Euler-Mascheroni.
1/12Sheppard,
...
123
.4
.3
.2
.1
0
-3
-2
-1
2
X
EVD P(X<x)=exp(-exp(-x))
3.4: Gumbel
3.4.1
(3.17)(3.18)3.2.1:
E Rn log 1 ( n ) log 1 ( q ) 1
2
p
p
0.5772 1
E Rn log 4 (10000) log 4 ( 3 )
6.3518
4
2
log(4)
var Rn
2
1
0.939
2
6 12
,(3.17),
, ,
(3.9),.,
log(n), log(q) /
.,n,.,
RnGumbel(EVD)
(3.17)(3.18).
124
n==1000
n==5000
n==10000
n==15000
n==20000
n==25000
n==30000
n==35000
n==40000
n==45000
n==50000
.4
.3
.2
.1
0
.4
.3
.2
.1
0
5
10
15
20
.4
.3
.2
.1
0
5
10
15
20
10
15
20
10
15
20
3.5: ,
p=0.1, 0.25 0.4. n 1,000 50,000.
15
E (m a x )
10
0
0
10
log1/p(qn)
15
3.6: ,
p=0.1, 0.25 0.4. n 1000 50000.
3.5.
Rna100%,
(score), . ,
,
.
, ..
80%(Karlin&Altschul,1990;Karlin&Brendel,1992),
.,Rna100%
,:
(3.19)
s k log a k p k
pk (.. p=1/4) k
(targetfrequency)
125
(,),.p,
.i
(segment score)
100% , (maximal
segmentscore)n(n).
( =0.8, p=0.25): (3.19)
,sA=log(0.8/0.25)=1.163
sN=log(0.2/0.25)=-0.223.2010,
s=16*1.163-4*0.223=17.716.
(log(ak/pk)
(ak)
(pk)
A
0.109
0.071
0.429
C
0.019
0.020
-0.051
D
0.007
0.053
-2.024
E
0.007
0.062
-2.181
F
0.090
0.039
0.836
G
0.082
0.070
0.158
H
0.008
0.023
-1.056
I
0.120
0.046
0.959
K
0.005
0.055
-2.398
L
0.168
0.087
0.658
M
0.040
0.025
0.470
N
0.016
0.048
-1.099
P
0.028
0.055
-0.675
Q
0.009
0.043
-1.564
R
0.005
0.061
-2.501
S
0.053
0.071
-0.292
T
0.050
0.060
-0.182
V
0.115
0.061
0.634
W
0.027
0.017
0.463
Y
0.040
0.034
0.163
3.1: , 160
(Krogh, Larsson, von Heijne, & Sonnhammer, 2001). ak=0
( 0.0001).
( -9 ).
- , Uniprot
(http://web.expasy.org/docs/relnotes/relstat.html).
(k)
Rna,=1()
, - (
)... k=10=10
( =-). ,
=1,,
, sA=log(1/0.25)=1.386, sN=log(0/0.25)= - (,
,-10.000).
16,16*1.386=22.176.
, . 3.1
160 (Krogh, et al., 2001).
,
( ) . , (3.19)
.(
I, F, L, V, M W),
- (
126
). , (Q, D, E, K, R),
. , .
, ,
.
3. 7: 20 .
, .
()
.:
,
a
(3.20)
E sk pk sk pk log k 0
pk
,(sk)
(,p).
score n ,
(Karlin & Altschul, 1990) n (
score):
(3.21)
P M log(n) x 1 exp Ke x
Gumbel,
.
:
(3.22)
pk expsk 1
k
127
,n,
k
pk exp sk 1.,:
(3.23)
ak pk exp sk
( x)
(rare events), Poisson , =Knexp(-x)
(3.21):
(3.24)
P M n x 1 e E
-(E-value):
(3.25)
1 exp exp t 1 1 exp t exp t
P-value E-value. , Poisson,
n,mS(m) x
:
m 1
Kne
x
(3.26)
i!
Rn,,
:
(3.27)
K 1 p q
(3.28)
log 1 p
P ( S ( m ) x ) 1 exp Kne
i 0
3.6.
,
.
,,,
.
x=x1,x2,,xn y=y1,y2,,ym
.:
(
)
, (alignment)
()
,,
,
()
, ,
(dot plot). ,
. ,
(),
, , . , 100%
,.
, () .
,,
.,
().
128
3.8: (dot plot). ,
, .
, 1-2 . ,
. ,
( 300 )
,,
.,
. DNA x=x1,x2,,xn y=y1,y2,,ym
(),:
(3.29)
P x ,y | R q xi q y j i 1, 2,..., n j 1, 2,..., m
i
,
:
(3.30)
P x,y | M p xi yi i 1, 2,..., n
i
(-),
n=m.(likelihoodratio)oddsratio
:
i pxi yi
px y
P x, y | M
i i
P x, y | R qxi q y j
i qxi q yi
i
(3.31)
px y
S log i i
qx qy
i
i i
s xi , yi
i
(3.32)
, score
. 4 DNA 4x4
score16
:
1, xi y i
(3.33)
s xi , yi
1, xi y i
score 1 (match) 1 (mismatch).
(match matrix) -
(..10000)
, (identitymatrix).
(..)
129
.
(-)
.
, .
(substitution matrices)
score
(mismatches)
.Altschul(Altschul,1991),
.PAM(Dayhoff,
Schwartz,&Orcutt,1978),BLOSUM(Henikoff&Henikoff,1992),GONNET(Gonnet,Cohen,&Benner,
1992),.
, .
,
(insertion)(deletion)
. ( ), (gap)
(),
.
score
.()
score (g), g ,
:
(3.34)
g gd
:
g d g 1 e
(3.35)
d(gapopenpenalty)e(gap
extension penalty).
(Vingron&Waterman,1994).
3.7.
DNA,
(3.33).,
1 -3 ,
,5-4.
,
(substitutionmatrices)score
(mismatches)
.Altschul(Altschul,1991),
.
PAM(Dayhoff,etal.,1978),BLOSUM(Henikoff&Henikoff,1992),GONNET(Gonnet,etal.,
1992).
BLOSUM(Henikoff&Henikoff,1992),
.
(blocks),
( ).
:
sij
qij
log
pp
i j
1
(3.36)
qij,ij(targetfrequencies),pi,
pj(backgroundfrequencies)
.
130
(3.32) .
.
,PointAccepted
Mutations (PAM) (Dayhoff, et al., 1978). ,
(PAM)
, . PAM1
85%.
(
),PAM30,PAM250...
=(1).
(),
.
(..
) , PAM250,
20-25%.
,
. , PAM, BLOSUM
, ,
,.,PAM,
BLOSUM , ,
,(3.2).
PAM
PAM100
PAM120
PAM160
PAM200
PAM250
BLOSUM
BLOSUM90
BLOSUM80
BLOSUM60
BLOSUM52
BLOSUM45
3.2 PAM
BLOSUM
, ,
.
,
131
,BLOSUM62.
( ),
BLOSUM90. , ,
,
BLOSUM45. ,
, ( )
.
, , :
(,,...),
. ,
,.
,
(,),
score.,
,()
. BLOSUM62, (C)
(W),,(9
11,),(),,
(4). , , .
, ,
, PHAT (Ng, Henikoff, & Henikoff,
2000) SLIM (Muller, Rahmann, & Rehmsmeier, 2001). , ,
().
3.8.
,
. ,
,,
, . ,
. ,
,
. ,
.,,.
,
n m
,
n=m,
n
Stirling(Durbin,etal.,1998):
2 n 2 n !
2 2 n
(3.37)
2
n
2 n
n !
.
.
,
.,
( ),
.,,
,.
132
3.10: , F(i,j) 3
F(i-1,j) F(i,j-1) F(i-1,j-1).
3.9.
- O Needleman Wunsch
(global
alignment).
(..)
.
Needleman-Wunsch(Needleman
&Wunsch,1970).
:
(3.38)
F i, j max F i 1, j 1 s xi , y j , F i 1, j d , F i, j 1 d
,
,F(i,0)=-idF(0,j)=-jd.
, (recursion) ,
,.
133
(3.39)
s xi , yi
1, xi y i
d=1,:
:
AAGTTAGCAG
CAGTATCGCA
3().
3.10.
(fit)
,
.(3.38).
(Galas,Eggert,&Waterman,1985):
F i, j max F i 1, j 1 s xi , y j , F i 1, j d , F i , j 1 d
(3.40)
F i, 0 id F 0, j 0
134
CATGAT
2(2,
).
,,
- . (local alignment)
().
() (Pearson&
Wood, 2001). ,
, (domains),
.
Smith Waterman (Smith &
Waterman,1981):
F i , j max F i 1, j 1 s xi , y j , F i 1, j d , F i , j 1 d , 0
(3.41)
F i , 0 0F 0, j 0
.,,
135
3.11: .
( ).
, , .
,:
AGTATCGCA
AGTTAGCA
5().
(mn). (mn) (big-O notation)
,
nm,.,
. ,
,(2n).
136
, ,
.,
. ,
,(nm2+mn2)
(open)(extension).
:
F (i 1, j 1) s ( xi , y j ),
(3.42)
. ,
,(mn),
, . ,
(3.35).
3,:
F i 1, j 1 s xi , y j
(3.43)
F ( i , j ) max I x i 1, j s xi , y j
I y i , j 1 s xi , y j
(3.44)
I x i, j max F i 1, j d , I x i 1, j e
I y i, j max F i, j 1 d , I y i, j 1 e
(3.45)
,
.,.
.
P-value ( ) ,
, , (
,).
, . ,
Gumbel.
, , .
.,Erdos
Renyi , (Waterman, 1995).
x=x1,x2,,xn y=y1,y2,,ym.
M n log1 p mn :
Mn
log1 p mn
11
(3.46)
137
Mn
log1 p mn
1
H a, p
(3.47)
(,p).,
(local) .
Local Similarity Score
,
.,(Arratia,GordonandWaterman,1986;Arratia,
Gordon and Waterman, 1990)
.,Arratia(1990)
,:
log mn log q 1
(3.48)
E Mn
2
q=1-p,=-(1)=0.5772Euler-Mascheroni,=log(1/p).,
:
Var M n
1
12
(3.49)
,(3.17)(3.18)
.(3.48)
(3.47),log(mn),
log(q) / . , m n ,
. Arratia Waterman (Arratia & Waterman, 1989),
,k
(mismatches).,x=x1,x2,,xn
y=y1,y2,,ym,
k(kmismatches):
(3.50)
E M n log1 p qn 2 k log1 p log1 p qn 2 k log1 p q log1 p k ! k
2
:
2
1
(3.51)
var M n 2
6 12
,q=1-p,=-(1)=0.5772Euler-Mascheroni,=log(1/p).
LocalSimilarityScore
-,
. ,
,
. ,
.,
30%(similarity)80.
,,
, ,
,
.
138
,,s(xi,yj)~cn
,
.
1, xi y i
d=
s xi , y i
, xi yi
,,s(xi,yj)~klogn
(3.46).
(3.47) n.
,
, . (phase transition)
Arratia,GordonWaterman(Arratia&Waterman,1994;Waterman,1995;Waterman,Gordon,&Arratia,
1987),m(mismatch)
d(gap),.
.
,-,.
,Poisson,:
(3.52)
S x Kmne x Kmnp x
<1 mn,
qiqjeS =1.
,:
(3.53)
K 1 p q
(3.54)
log 1 p
, ,
. ,
KarlinAltschul(Karlin&Altschul,1990)
x=x1,x2,,xny=y1,y2,,ymS(3.32),
x(,p-value),:
(3.55)
P S x 1 exp Kmne x
(localsimilarityscore)
,maximalsegmentscore.:
,:
qi q j
(3.56)
E sij qi q j sij qi q j log
0
pij
, qiqjeS =1. ,
,.,:
(3.57)
P S x exp Kmne x
,,...
Gumbel(EVD).,:
xa
P S x exp e b , x
(3.58)
E x a b ' 1 , V x
139
b 2 2
6
(3.59)
,b a
log kmn
,b
1
K 1- p q ,
p
log
.p-value
. ,
(Pearson,1998;Pearson&Wood,2001):
(3.60)
P Z z 1 exp exp
z ' 1
6
, (3.55)
(z).
(z)
().
p-value
,p-value
.
. p-value
10-4,
100.00010.
, D
, p-value ( - pmatch) , Poisson. (Pearson & Wood,
2001):P =Pr(1Sx)=1-e-Dp Dp(<0.01):PDp.
Sx,D.E-value(expectationvalue)
E(Sx)=DP(Sx)D
.
,
.
()
n ( ). m=N/D
,x,
P(S>x)=1-e-E(S) =1-exp(-Kne-x)(E-value)E(Sx)=Kne-x =DKmne-x.
BLAST(Altschul,Gish,Miller,Myers,&Lipman,1990),pvalue, (output) ,
, E-value ,
(Waterman,1995):1-exp(-exp(-t))1-(1-exp(-t))=exp(-t),p-valueEvalue.,,
p-value-value.
3.14.
. (Altschul et al.,
1997;Clote&Backofen,2000;Mott,2000),
Gumbel:
(3.61)
P S x exp Kmne x
,.
Gumbel
Mott,1992(Mott,1992).(3.58)
140
A a0
a1
2 log mn
,B
b1
.0,1,2b1.
q q e
i
1 .
,
(direct estimation) (Waterman, 1995; Waterman & Vingron, 1994).
( ). ,
, (
1000)
( shuffling, ). K,
... (e.c.d.f.) (log[log[cdf]])log[-log[cdf]]S.(slope)
(constant)log(mn).
, ,
,
. ,
(z>7)(z<-3)
,(systematicerrorbias)(Pearson,1998).
Waterman Vingron (Waterman & Vingron, 1994;
Waterman & Vingron, 1994), Poisson (Poisson Approximation -
(Arratia, Goldstein, & Gordon, 1989; Chen, 1975)) de-clumping estimation.
S(1) S(2) ... S(k) .
,.
S(i)Poisson,:
(3.62)
E S x Kmne x
,k,x:
P S k x 1 exp Kmne
k 1
x i
Kmne
(3.63)
i!
(3.52).,
log[data] ( )
Kmne-x()
,. Poisson
,
,,.
i 0
141
.
PoissonnmSmith-Waterman,
k(sub-optimalalignments).WatermanVingron,
10 , sub-optimal scores
,.
,Pearson(Pearson,1995,1998;Pearson&Wood,2001),
.
. , k
n1,n2,,nk,
, 10%. ,
S,,
(weightedlinearregression):
(3.64)
S a b log ni
, ni, i , log(ni)
(1/var),
. 2 ,
(residualvariance)z-score:
S a b log ni
var
(3.65)
,
,.
5 , (
).z-scores,,:
(3.66)
P Z z 1 exp exp
z ' 1
6
(p-value, -value),
.
,.
OMott(Mott,2000),,,
Gumbel .
Smith-Waterman,,
,.,
,
.
.
,
,
,.,
,
.
,SmithWaterman
,.
, , . ,
,
.
,
,
142
. , 2,
.,
,
. ,
1980,,
.
(heuristic),,
. ,
.
, BLAST (Altschul, et al., 1990; Altschul, et al., 1997)
FASTA(Lipman&Pearson,1985;Wilbur&Lipman,1983).
3.13: FASTA
FASTA (www.ebi.ac.uk/fasta33/),
,
.:
k-tuples(
k, 1 2)
.
,k-tuples.
143
,
,
(
),
.
3.14: BLAST
BLAST(www.ncbi.nlm.nih.gov/BLAST/),
FASTA,
:
(=13).
,
,.
(HSPs,highscoringpairs).
HSPs
S
144
,
,
KarlinAltschul.
BLAST,
.2
,BLAST
.BLAST,
,NCBI,
.
.
BLAST(K,),
,FASTA
.
BLAST
(PSI-BLAST) (Altschul, et al.,
1997),.
BLAST,
, Karlin Altschul. ,
(effectivelength).
log Kmn
log Kmn
m' m
n ' n
H
H
(substitutionmatrix)
.,.,
,,
.
.
20
20
H qij sij
(3.67)
i 1 j 1
,
.
. ,
(expected substitution score per position),
(3.56),
(expectedperpositionalignmentscore),.
,
(gap penalties, mismatches),
. bit
(Altschul,etal.,1990;Altschul,etal.,1997):
S log K
(3.68)
Sbit raw
log 2
Sraw, .
(3.47):
E Sbit mn2 Sbit
(3.69)
,(3.61),
bitScore.
FASTA
(Pearson, 1998). ,
BLAST FASTA ,
.
145
BLAST ,
.
, ,
DNA DNA, , ,
(DNA)(),
DNA,DNADNA
( DNA-DNA ).
BLAST FASTA
(),
(, ,
).
146
Altschul,S.F.(1991).Aminoacidsubstitutionmatricesfromaninformationtheoreticperspective.J Mol
Biol, 219(3),555-565.
Altschul,S.F.,Gish,W.,Miller,W.,Myers,E.W.,&Lipman,D.J.(1990).Basiclocalalignmentsearch
tool.J Mol Biol, 215(3),403-410.
Altschul,S.F.,Madden,T.L.,Schaffer,A.A.,Zhang,J.,Zhang,Z.,Miller,W.,&Lipman,D.J.(1997).
GappedBLASTandPSI-BLAST:anewgenerationofproteindatabasesearchprograms.Nucleic
Acids Res, 25(17),3389-3402.
Arratia,R.,Goldstein,L.,&Gordon,L.(1989).TwomomentssufficeforPoissonapproximation:TheChenSteinmethod.Ann. Probab., 17,9-25.
Arratia,R.,Gordon,L.andWaterman,M.S.(1986).Anextremevaluetheoryforsequencematching.Ann.
Statist., 14,971-993.
Arratia,R.,Gordon,L.andWaterman,M.S.(1990).TheErdos-Renyilawindistribution,forcointossingand
sequencematching.Ann. Statist., 18,539-570.
Arratia,R.,&Waterman,M.S.(1989).TheErdos-Renyistronglawforpatternmatchingwithagiven
proportionofmismatches.Ann. Probab., 17,1152-1169.
Arratia,R.,&Waterman,M.S.(1994).Aphasetransitionforthescoreinmatchingrandomsequences
allowingdeletions.Ann. Appl. Probab., 4,200-225.
Chen,L.H.Y.(1975).Poissonapproximationfordependenttrials.Ann. Probab., 3,534-545.
Clote,P.,&Backofen,R.(2000).Computational Molecular Biology, an Introduction.:JohnWileyandSons,
Ltd.USA.
Davison,A.C.(1998).ExtremeValuesEncyclopedia of Biostatistics:JohnWiley&Sons,Ltd.
Dayhoff,M.O.,Schwartz,R.M.,&Orcutt,B.C.(1978).AmodelofevolutionarychangeinProteins.InM.
Dayhoff(Ed.),In Atlas of protein sequence and structure(Vol.5,Suppl.3,pp.345-352):National
biomedicalresearchfoundation,SilverSpring,MD.
Durbin,R.,Eddy,S.,Krogh,A.,&Mithison,G.(1998).Biological sequence analysis, probabilistic models of
proteins and nucleic acids.:CambridgeUniversityPress.
Erdos,P.,&Renyi,A.(1970).Onanewlawoflargenumbers.J. Anal. Math., 22,103-111.
Erdos,P.,&Revesz,P.(1975).Onthelengthofthelongesthead-run.Topics in Inforrmation Theory.
Colloquia Math. Soc. J. Bolyai, 16,219-228.
Galas,D.J.,Eggert,M.,&Waterman,M.S.(1985).Rigorouspattern-recognitionmethodsforDNA
sequences.AnalysisofpromotersequencesfromEscherichiacoli.J Mol Biol, 186(1),117-128.
Gonnet,G.H.,Cohen,M.A.,&Benner,S.A.(1992).Exhaustivematchingoftheentireproteinsequence
database.Science, 256(5062),1443-1445.
Henikoff,S.,&Henikoff,J.G.(1992).Aminoacidsubstitutionsmatricesfromproteinblocks.Proceedings of
the National Academy of Sciences (USA), 89,10915-10919.
Karlin,S.,&Altschul,S.F.(1990).Methodsforassessingthestatisticalsignificanceofmolecularsequence
featuresbyusinggeneralscoringschemes.Proceedings of the National Academy of Sciences of the
USA., 87,2264-2268.
Karlin,S.,&Brendel,V.(1992).ChanceandstatisticalsignificanceinproteinandDNAsequenceanalysis.
Science, 257(5066),39-49.
Krogh,A.,Larsson,B.,vonHeijne,G.,&Sonnhammer,E.L.(2001).Predictingtransmembraneprotein
topologywithahiddenMarkovmodel:applicationtocompletegenomes.J Mol Biol, 305(3),567580.
147
Lipman,D.J.,&Pearson,W.R.(1985).Rapidandsensitiveproteinsimilaritysearches.Science, 227(4693),
1435-1441.
Mott,R.(1992).MaximumlikelihoodestimationofthestatisticaldistributionofSmith-Watermanlocal
sequencesimilarityscores.Bulletin of Mathematical Biology, 54,59-75.
Mott,R.(2000).AccurateformulaforP-valuesofgappedlocalsequenceandprofilealignments.J Mol Biol,
300(3),649-659.
Muller,T.,Rahmann,S.,&Rehmsmeier,M.(2001).Non-symmetricscorematricesandthedetectionof
homologoustransmembraneproteins.Bioinformatics, 17 Suppl 1,S182-189.
Needleman,S.B.,&Wunsch,C.D.(1970).Ageneralmethodapplicabletothesearchforsimilaritiesinthe
aminoacidsequenceoftwoproteins.J Mol Biol, 48(3),443-453.
Ng,P.C.,Henikoff,J.G.,&Henikoff,S.(2000).PHAT:atransmembrane-specificsubstitutionmatrix.
Predictedhydrophobicandtransmembrane.Bioinformatics, 16(9),760-766.
Pearson,W.R.(1995).Comparisonofmethodsforsearchingproteinsequencedatabases.Protein Science, 4,
1145-1160.
Pearson,W.R.(1998).Empiricalstatisticalestimatesforsequencesimilaritysearches.J Mol Biol, 276(1),7184.
Pearson,W.R.,&Wood,T.C.(2001).Statisticalsignificanceinbiologicalsequencecomparison.InD.J.
Balding,M.Bishop&C.Cannings(Eds.),In handbook of statistical genetics.(pp.39-65):John
WileyandSons,Ltd.England.
Smith,T.F.,&Waterman,M.S.(1981).Identificationofcommonmolecularsubsequences.J Mol Biol,
147(1),195-197.
Vingron,M.,&Waterman,M.S.(1994).Sequencealignmentandpenaltychoice.Reviewofconcepts,case
studiesandimplications.J Mol Biol, 235(1),1-12.
Waterman,M.S.(1995).Introduction to Computational Biology:ChapmanandHall,London.
Waterman,M.S.,Gordon,L.,&Arratia,R.(1987).Phasetransitionsinsequencematchesandnucleicacid
structure.Proceedings of the National Academy of Sciences of the USA., 84,1239-1243.
Waterman,M.S.,&Vingron,M.(1994).Rapidandaccurateestimatesofstatisticalsignificanceforsequence
databasesearches.Proceedings of the National Academy of Sciences of the USA., 91,4625-4628.
Waterman,M.S.,&Vingron,M.(1994).SequencecomparisonsignificanceandPoissonapproximation.
Statistical Science, 2,367-381.
Wilbur,W.J.,&Lipman,D.J.(1983).Rapidsimilaritysearchesofnucleicacidandproteindatabanks.
Proceedings of the National Academy of Sciences of the USA., 80,726-730.
Wootton,J.C.,&Federhen,S.(1993).Statisticsoflocalcomplexityinaminoacidsequencesandsequence
databases.Computers & chemistry, 17(2),149-163.
148
1)
1
H ( , p ) log (1 ) log
p
1 p
=1.
(3.11)(3.12);
2)
3.1,:
a
E sk pk sk pk log k 0
pk
;
3)
(%).
);;
),;
4)
250
PAM250,:
F W L E V E G N S M T A P T G
F W L D V Q G D S M T A P A G
, K=0.09, =0.229.
bit-score;
5)
300
550 . (similar residues) 61 166
,BitScore39.
)E-value;
);
;
;
149
)-value
500.000;
6)
BLAST
NRNCBI.
Score = 34.3 bits (77), Expect =
Identities = 28/85 (32%), Positives = 44/85 (51%), Gaps = 11/85 (12%)
Query
96
Sbjct
118
Query
151
Sbjct
176
INDWASIYGVVGVGYGKFQTTEYPTY---KHDTSDYGFSYGAGLQ--FNPMENVALDFSY
I++
I+G +G YG+ +T+ P +
D S +G SYGAG++ FNP
L+ +
ISEQFDIFGKLGTTYGRTKTSGNPGFGVATGDDSGFGLSYGAGVRWAFNPQWAAVLE--W
EQSRIR----SVDVGTWIAGVGYRF
E+ R+
DV
GV YR+
ERHRLHFADGKSDVDMTTIGVQYRY
150
175
171
200
Score = 77.4 bits (189), Expect = XXXXX
Identities = 62/201 (30%), Positives = 101/201 (50%), Gaps = 32/201 (15%)
Query
Sbjct
Query
56
Sbjct
60
Query
108
Sbjct
119
Query
151
Sbjct
179
MKKIACLSALAAVLAFTAGTSVAAT---STVTGGY--AQSDAQGQMNKMGGFNLKYRYEE
M+K+
AA+
+G
A+
ST++ GY
++ G +++ G N+KYRYE
MRKLYAAILSAAICLAVSGAPAWASEHQSTLSAGYLHVSTNVPGS-DELNGINVKYRYEF
55
59
DNSPLGVIGSFTY--------TEKSRTASSGDYNKNQYYGITAGPAYRINDWASIYGVVG
++ LG++ SF+Y
T S T
D +N+++ + AGP+ R+N+W S Y + G
TDT-LGMVTSFSYAGDKNRQLTHYSDTRWHEDSVRNRWFSVMAGPSVRVNEWFSAYAMAG
107
VGYGKFQT--------TEYPTYKHDT---------SDYGFSYGAGLQFNPMENVALDFSY
+ Y + T
T+
HD
S+
++GAG+Q NP E+VA+D +Y
MAYSRVSTFSGDYLRVTDNKGKTHDVLTGSDDGRHSNTSLAWGAGVQVNPTESVAIDIAY
150
EQSRIRSVDVGTWIAGVGYRF
E S
+I GVGY+F
ECSGSGDWRTDGFIVGVGYKF
118
178
171
199
Gapped
Lambda
K
H
0.267
0.0410
0.140
Number of Sequences: 4496249
Length of query: 171
Length of database: 1544746084
Length adjustment: 122
Effective length of query: 49
Effective length of database: 996203706
Effective search space: 48813981594
Effective search space used: 48813981594
)E-value(Expectation).
;
);
)
;
150
4:
.
, ,
.
,
, .
.
, ,
, .
3 ( ).
4.
,,
2.
.
, .
, ,
,.
()
.,""
.,
(..
,
).,
,
,
. ,
(patterns),HiddenMarkovModels
. 2
,(PROSITE,PFAM..).
,
, '
.
,
, ,
. ,
(
6),,
.
, (
).
,
.
,,
,
151
, .
,.
,
"" , , ,
,(profile).
,
.
,.
4.1.
,
.r:
(4.1)
3, r . 3,
. ' , ,
.
4.1: , 3
.
S (m) G S (mi )
i
152
(4.2)
miim,S(mi)G (
).,5(-),
,
.,,
,
.,:
(4.3)
S (m) S (mi )
i
,.
,logoddsr:
(4.4)
, ,
(PAM, BLOSUM ),
.,
3 , 4
,...,.
,
.mij i j nb(i)
bi,,:
n i
(4.5)
P mi pb i b
b
ps(i)si,:
nb i
nb ' i
pb i
(4.6)
b '
,,:
S mi nb i log pb i
(4.7)
, , ,
,
. , : 100%
,0,,,
. , ,
(,,
).,,,
,.,r,-
( ).
,,
.,
. ,
.
, SP (Sum of
Pairs).:
(4.8)
SP mi s m i j , mi j '
j j '
s
,.,:
153
p x x
px x
(4.9)
SP m s mij , mij ' log 1i 2i .... log ( r1)i ri
qx q x
qx qx
i j j '
i
i
2
i
(
r
1)
i
ri
2r.,
,.
, , . ,
,
(..1020),,
(Durbin,Eddy,Krogh,&Mithison,1998).
,,..
19/20 , 9/10,
.
,,
. ,
,(..>50)
100%.,
,
. , ,
.,
(
),
.
,
.ai1,i2,...,iN
xi11 , xi22 , ..., xiNN ,
3,:
ai11,i 21,...,in 1 S xi11 , xi22 ,..., xinn
...
ai1 , ai 2 ,..., ain max
(4.10)
1
2
...
ai11,i 21,...,in S xi1 , xi 2 ,...,
n
ai1,i 2,i 31,...,in 1 S , ,..., xin
...
,().(4.10),
(2N-1 ).
,(Durbin,etal.,1998;Waterman,1995):
(4.11)
ai1 , ai 2 ,..., ain max ai1 ,i 2 ,...,in S 1 xi11 , 2 xi22 ,..., n xinn
1 ... n 0
:
( x ), i 1
(4.12)
i x
( ), i 0
,,rn(,
), (nr2r)
(nr). ,
,
,
154
4.2.
(heuristic) ,
progressivemultiplealignmentmethod().
(
), -, .
1980,
Feng Doolittle 1987 (Feng & Doolittle, 1987)
:
,
(guidetree)
,
,,.
,
(BLAST, FASTA).
,
(clustering). ,
,,.
, ,
,.Feng
Doolittle(Feng&Doolittle,1987),,
( ). ,
,:
D log S log
Sobs Srand
Smax Srand
(4.13)
Sobs,.
Smax ,
,Srand
.
( shuffling), Feng Doolittle
. ,
, log
().
,
Fitch Margoliash (Fitch& Margoliash, 1967).
, .
, - ,
6,,
,.
155
,
.
, ,
,.,()
,
().,,,
. ,
:,
. 4.2
5.
, ,
(bias) ,
. ,
,
.,
, (consensus),
156
.
, .
2,1,1G,1C.,
,60%
,.
4.3: 4.2.
x1 x2 , x3 x4 . ,
, (x1-x3).
x3 x4 , x1 x2,
.
, profile alignment ( ),
. ,
,,(-),
. , SP ,
.,
1n,,n+1.
(4.8):
SP m s m , m
j
j'
j j '
s m , m s m , m
j
j j ' n
j'
n j j ' N
j'
(4.14)
j n , n j ' N
(4.14)
, ,
.
,
3(4.4).,
. ,
,(Edgar
&Sjolander,2004;Wang&Dunbrack,2004).
,
CLUSTALW(Thompson,Higgins,&Gibson,1994)profilealignment
.clustalCLUSTALV(Higgins,Bleasby,&
Fuchs, 1992) CLUSTALW
,CLUSTALX(Thompson,Gibson,&Higgins,2002).,
.
www.ebi.ac.uk/clustalw/.,:
,
(FASTA), .
157
, ,
,.
x
(D=1-x/100) -
,Neighbor-Joining()(Saitou&Nei,1987).
.
6.
( )
,profilealignment.
.,
(weight)
.
,
.,,,
( ,
, ,
5).,
,
.
. ,
CLUSTAL
,
.
,
, Kalign (Lassmann & Sonnhammer, 2005), (
http://msa.sbc.su.se/cgi-bin/msa.cgi). Kalign,
.
, Lassmann Sonnhammer
,,WuManber,
(Wu&Manber,1992).
Kalignk-tuple,.,
-UPGMA(
).
. profile alignment,
,,(
,).,
,
,,
(BLOSUM50, PAM250 GONNET250). ,
GONNET250(Gonnet,Cohen,&Benner,1992),
. , Kalign
CLUSTAL 10
.,,
.
,.
158
4.4 4.2.
profile alignment. ,
x1 x2 , x3 x4 .
(4.14) . ,
x3 x4 , x1 x2,
. , , .
, , .
, , .
,
. ,
, ,
(once a gap, always a gap). ,
,.
, ,
.
4.3.
,
,
, .
,,
,
159
. , CLUSTALW 6% (Wallace,
O'Sullivan,&Higgins,2005).
, Barton
Sternberg (Barton & Sternberg, 1987). ,
.,profile
alignment.,-
,.
Corpet (Corpet, 1988),
MULTALIN(http://prodes.toulouse.inra.fr/multalin/multalin.html).
.MULTALIN,,,
,
.
MUSCLE
(Edgar, 2004) ( http://www.drive5.com/muscle).
, MUSCLE k-mers ( -
k),
UPGMA, ( MSA1).
,Kimura(
6),
profile alignment ( MSA2).
(refinement),,
.
.MSA3,
MSA2(MUSCLE-p)
,
. MUSCLE-p O(N2L+NL2)
O(N2+NL+L2), O(N3L)
. MUSCLE
profilealignment,log-expectationscore.MUSCLE
, profiles
,.
, Gotoh
(Gotoh, 1996) PRRP/PRRN (http://www.genome.ist.i.kyotou.ac.jp/~aln_user/prrn/index.html).
, SP(weighted sums-of-pairs score)
.,
. SP
, ,
.
PRALINE, (Simossis & Heringa, 2005)
http://ibivu.cs.vu.nl/programs/pralinewww/,
. profile
PSI-BLAST .
, profile .
,
profiles.
, PRALINE
. ,
, profile,
profiles.,
,.,
. , (consistency)
160
profiles.
.
To Dialign (Morgenstern, 2014), ( http://bibiserv.techfak.unibielefeld.de/dialign/), ,
,
(,,,
).
. Dialign
(diagonal alignments in a dot plot). ,
. ,
,,
.,
, ,
. , :
(),
, . , Dialign
,
(..).
, PRALINE
Dialign, COBALT (Papadopoulos & Agarwala, 2007),
NCBI(ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/cobalt).COBALTBLAST
RPS-BLAST,
Neighbour-Joining,
(dij=1-(Sij/2)(1/Sii-1/Sij)).,
profile alignment ,
. ,
BLAST,RPS-BLAST,
NCBI
(CDD). ,
Dialign.
,
,T-Coffee (Magisetal.,2014),(
http://www.ch.embnet.org/software/TCoffee.html). To T-Coffee CLUSTALW (
,):,
,
profile alignment.
,
,
. , ,
,, ,
,.
CLUSTALW ,
,(
CLUSTALW), (
LALIGN FASTA). , T-Coffee
.
, .
,.,
.
,,
,
.,simulated
annealing (Kim, Pramanik, & Chung, 1994),
161
.,
SAGA (Notredame & Higgins, 1996),
(Hidden Markov
Models), ProbCons ProbAlign (Roshan, 2014) (
http://probalign.njit.edu/standalone.html). ,
,
, ,
.,
.
4.4.
,
.
;
; ,
,
. , . ,
,
,.
4.5: .
, , ,
(structural alignment). ,
,
.,
,.
162
,,
.,
,
,.,
,,
.,
, . ,
(goldstandard),
.
,
,.
BAliBASE (Thompson, Plewniak, & Poch, 1999)
.
/4.1.
BAliBASE
http://www-igbmc.u-strasbg.fr/BioInfo/BAliBASE/index.html
(Thompson,etal.,1999)
OxBench
http://www.compbio.dundee.ac.uk/
(Raghava,Searle,Audley,Barber,&Barton,
2003)
SABmark
http://bioinformatics.vub.ac.be/databases/databases.html
(VanWalle,Lasters,&Wyns,2005)
PREFAB
http://drive5.com/muscle/prefab.htm.
(Edgar,2004)
4.1:
,
, . ,
100%.,
,
,..
.,
,
(Thompson, Linard, Lecompte, & Poch, 2011),
(blocks) (Raghava, et al., 2003). ,
,
.:)
(
,
),)(
, ), )
-(fragments),
.
,
,
(Raghava,etal.,2003).APDB(O'Sullivanet
al.,2003).APDB,,
PDB,.,
, APDB
,.
163
,
,
(Pais, Ruy Pde, Oliveira, & Coimbra, 2014; Thompson, et al., 2011;
Thompson,etal.,1999).,,
, .
, . ,
,
,50%,
20%. T-Coffee, ProbCons ProbAlign
,
( ). ClustalW MUSCLE,
,.Prrp/Prrn
,.Kalign,,
10CLUSTALW(),
.,
, ,
,
(,).T-Coffee
,,,
Dialign (
).
,
. ,
(
). ,
.. ,
(,
)..
,
.
,.,
.
(CLUSTALW,T-Cofffee,Dialign,Kalign,MUSCLE,ProbAlign,Prrp/Prrn),
.,(Windows,
Linux,Mac),COBALTPRALINE,(PSIBLAST),(,PRALINE
).
, ,
.
4.5.
,
, .
,(FASTA)
.
,Multi-FASTA,MSFCLUSTAL.Multi-FASTA,
FASTA.,
(>),
.,(-)
.,
164
.MSF,,
, . 3
,350
(),350,...,
.,
, ( //)
. CLUSTAL (
),(,
,),
60.
.
, *. : .
(,),
()().,
PHYLIP STOCKHOLM,
.
,
READSEQ
(http://www.ebi.ac.uk/Tools/sfc/readseq/),modulesBioPerl,BioPythonBioJava,
.
4.6: , Strap.
, 3 . ,
JNET.
,,
, , .
,
,,
.,
,
. ,
(,
...).,.
(editors),,
( , ) ,
( ).
:
Jalview(http://www.jalview.org/)
Strap(http://www.bioinformatics.org/strap/)
Seqpup(http://iubio.bio.indiana.edu/soft/molbio/seqpup/java/seqpup-doc.html)
Seaview(http://pbil.univ-lyon1.fr/software/seaview.html)
165
Cinema(http://aig.cs.man.ac.uk/research/utopia/cinema/cinema.php)
Boxshade(http://www.ch.embnet.org/software/BOX_form.html)
Bioedit(http://www.mbio.ncsu.edu/BioEdit/bioedit.html)
,.
Desktop (
,applets).
.,,
. ,
,
.,,
,
. , ,
.
, , .
,
,.,,
(- ), .
,
,,
.,
PDB,
,
( , ). , , ,
,
, ,
,
.
,,
. ,
,1-2
.
. ,
.
.,
, ( ,
).,
. ,
(PFAM, PROSITE ), ,
( BLAST
).
166
4.7: Jalview.
RNA. , Jmol
RNA.
167
,,
. /
. ,
, .
, . ,
, (signal peptide),
,(
). ,
,
- (, ...). ,
,,
, .
,,(
!).
, ,
. ,
profilealignment,
.
168
Barton,G.J.,&Sternberg,M.J.(1987).Astrategyfortherapidmultiplealignmentofproteinsequences.
Confidencelevelsfromtertiarystructurecomparisons.J Mol Biol, 198(2),327-337.
Carrillo,H.,&Lipman,D.(1988).Themultiplesequencealignmentprobleminbiology.SIAM Journal on
Applied Mathematics 48(5),1073-1082.
Corpet,F.(1988).Multiplesequencealignmentwithhierarchicalclustering.Nucleic Acids Res, 16(22),
10881-10890.
Durbin,R.,Eddy,S.,Krogh,A.,&Mithison,G.(1998).Biological sequence analysis, probabilistic models of
proteins and nucleic acids.:CambridgeUniversityPress.
Duret,L.,&Abdeddaim,S.(2000).Multiplealignmentforstructural,functional,orphylogeneticanalysesof
homologoussequences.Bioinformatics: Sequence, Structure, and Databanks,51-76.
Edgar,R.C.(2004).MUSCLE:amultiplesequencealignmentmethodwithreducedtimeandspace
complexity.BMC Bioinformatics, 5,113.
Edgar,R.C.,&Sjolander,K.(2004).Acomparisonofscoringfunctionsforproteinsequenceprofile
alignment.Bioinformatics, 20(8),1301-1308.
Feng,D.-F.,&Doolittle,R.F.(1987).Progressivesequencealignmentasaprerequisitetocorrect
phylogenetictrees.Journal of molecular evolution, 25(4),351-360.
Fitch,W.M.,&Margoliash,E.(1967).Constructionofphylogenetictrees.science, 155(3760),279-284.
Gonnet,G.H.,Cohen,M.A.,&Benner,S.A.(1992).Exhaustivematchingoftheentireproteinsequence
database.Science, 256(5062),1443-1445.
Gotoh,O.(1996).Significantimprovementinaccuracyofmultipleproteinsequencealignmentsbyiterative
refinementasassessedbyreferencetostructuralalignments.J Mol Biol, 264(4),823-838.
Higgins,D.G.,Bleasby,A.J.,&Fuchs,R.(1992).CLUSTALV:improvedsoftwareformultiplesequence
alignment.Computer applications in the biosciences: CABIOS, 8(2),189-191.
Kim,J.,Pramanik,S.,&Chung,M.J.(1994).Multiplesequencealignmentusingsimulatedannealing.
Comput Appl Biosci, 10(4),419-426.
Lassmann,T.,&Sonnhammer,E.L.(2005).Kalign--anaccurateandfastmultiplesequencealignment
algorithm.BMC Bioinformatics, 6,298.
Lipman,D.J.,Altschul,S.F.,&Kececioglu,J.D.(1989).Atoolformultiplesequencealignment.
Proceedings of the National Academy of Sciences, 86(12),4412-4415.
Magis,C.,Taly,J.F.,Bussotti,G.,Chang,J.M.,DiTommaso,P.,Erb,I.,...Notredame,C.(2014).TCoffee:Tree-basedconsistencyobjectivefunctionforalignmentevaluation.Methods Mol Biol, 1079,
117-129.
Morgenstern,B.(2014).MultiplesequencealignmentwithDIALIGN.Methods Mol Biol, 1079,191-202.
Notredame,C.,&Higgins,D.G.(1996).SAGA:sequencealignmentbygeneticalgorithm.Nucleic Acids
Res, 24(8),1515-1524.
O'Sullivan,O.,Zehnder,M.,Higgins,D.,Bucher,P.,Grosdidier,A.,&Notredame,C.(2003).APDB:anovel
measureforbenchmarkingsequencealignmentmethodswithoutreferencealignments.
Bioinformatics, 19 Suppl 1,i215-221.
Pais,F.S.,RuyPde,C.,Oliveira,G.,&Coimbra,R.S.(2014).Assessingtheefficiencyofmultiplesequence
alignmentprograms.Algorithms Mol Biol, 9(1),4.
Papadopoulos,J.S.,&Agarwala,R.(2007).COBALT:constraint-basedalignmenttoolformultipleprotein
sequences.Bioinformatics, 23(9),1073-1079.
169
Raghava,G.P.,Searle,S.M.,Audley,P.C.,Barber,J.D.,&Barton,G.J.(2003).OXBench:abenchmark
forevaluationofproteinmultiplesequencealignmentaccuracy.BMC Bioinformatics, 4,47.
Roshan,U.(2014).MultiplesequencealignmentusingProbconsandProbalign.Methods Mol Biol, 1079,147153.
Saitou,N.,&Nei,M.(1987).Theneighbor-joiningmethod:anewmethodforreconstructingphylogenetic
trees.Molecular biology and evolution, 4(4),406-425.
Simossis,V.A.,&Heringa,J.(2005).PRALINE:amultiplesequencealignmenttoolboxthatintegrates
homology-extendedandsecondarystructureinformation.Nucleic Acids Res, 33(WebServerissue),
W289-294.
Thompson,J.D.,Gibson,T.J.,&Higgins,D.G.(2002).MultiplesequencealignmentusingClustalWand
ClustalX.Curr Protoc Bioinformatics, Chapter 2,Unit23.
Thompson,J.D.,Higgins,D.G.,&Gibson,T.J.(1994).CLUSTALW:improvingthesensitivityof
progressivemultiplesequencealignmentthroughsequenceweighting,position-specificgappenalties
andweightmatrixchoice.Nucleic acids research, 22(22),4673-4680.
Thompson,J.D.,Linard,B.,Lecompte,O.,&Poch,O.(2011).Acomprehensivebenchmarkstudyof
multiplesequencealignmentmethods:currentchallengesandfutureperspectives.PLoS One, 6(3),
e18093.
Thompson,J.D.,Plewniak,F.,&Poch,O.(1999).Acomprehensivecomparisonofmultiplesequence
alignmentprograms.Nucleic Acids Res, 27(13),2682-2690.
VanWalle,I.,Lasters,I.,&Wyns,L.(2005).SABmark--abenchmarkforsequencealignmentthatcoversthe
entireknownfoldspace.Bioinformatics, 21(7),1267-1268.
Wallace,I.M.,O'Sullivan,O.,&Higgins,D.G.(2005).Evaluationofiterativealignmentalgorithmsfor
multiplealignment.Bioinformatics, 21(8),1408-1414.
Wang,G.,&Dunbrack,R.L.,Jr.(2004).Scoringprofile-to-profilesequencealignments.Protein Sci, 13(6),
1612-1626.
Waterman,M.S.(1995).Introduction to Computational Biology:ChapmanandHall,London.
Wu,S.,&Manber,U.(1992).Fasttextsearchingallowingerrors.Communications of the ACM 35(10),83-91.
170
Multi-FASTA
MSF
MSF: 307 Type: P Check: 4977 ..
Name: CD5R_BOVIN oo Len: 307 Check: 5281 Weight: 33.3
Name: CD5R_HUMAN oo Len: 307 Check: 5196 Weight: 33.3
Name: CD5R_MOUSE oo Len: 307 Check: 4500 Weight: 33.3
//
CD5R_BOVIN MGTVLSLSPS YRKATLFEDG AATVGHYTAV QNSKNAKDKN
CD5R_HUMAN MGTVLSLSPS YRKATLFEDG AATVGHYTAV QNSKNAKDKN
CD5R_MOUSE MGTVLSLSPS YRKATLFEDG AATVGHYTAV QNSKNAKDKN
CD5R_BOVIN PWKRIVAVSA KKKNSKKVQP NSSYQNNITH LNNENLKKSL
CD5R_HUMAN PWKRIVAVSA KKKNSKKVQP NSSYQNNITH LNNENLKKSL
CD5R_MOUSE PWKRIVAVSA KKKNSKKAQP NSSYQSNIAH LNNENLKKSL
CD5R_BOVIN PPPAQPPAPP ASQLSGSQTG VSSSVKKAPH PAVSSAGTPK
CD5R_HUMAN PPPAQPPAPP ASQLSGSQTG GSSSVKKAPH PAVTSAGTPK
CD5R_MOUSE PPPAQPPAPP ASQLSGSQTG VSSSVKKAPH PAITSAGTPK
CD5R_BOVIN LLRCLGEFLC RRCYRLKHLS PTDPVLWLRS VDRSLLLQGW
CD5R_HUMAN LLRCLGEFLC RRCYRLKHLS PTDPVLWLRS VDRSLLLQGW
CD5R_MOUSE LLRCLGEFLC RRCYRLKHLS PTDPVLWLRS VDRSLLLQGW
CD5R_BOVIN VVFLYMLCRD VISSEVGSDH ELQAVLLTCL YLSYSYMGNE
CD5R_HUMAN VVFLYMLCRD VISSEVGSDH ELQAVLLTCL YLSYSYMGNE
CD5R_MOUSE VVFLYMLCRD VISSEVGSDH ELQAVLLTCL YLSYSYMGNE
CD5R_BOVIN ESCKEAFWDR CLSVINLMSS KMLQINADPH YFTQVFSDLK
CD5R_HUMAN ESCKEAFWDR CLSVINLMSS KMLQINADPH YFTQVFSDLK
CD5R_MOUSE ESCKEAFWDR CLSVINLMSS KMLQINADPH YFTQVFSDLK
CD5R_BOVIN LLLGLDR
CD5R_HUMAN LLLGLDR
CD5R_MOUSE LLLGLDR
171
LKRHSIISVL
LKRHSIISVL
LKRHSIISVL
SCANLSTFAQ
SCANLSTFAQ
SCANLSTFAQ
RVIVQASTSE
RVIVQASTSE
RVIVQASTSE
QDQGFITPAN
QDQGFITPAN
QDQGFITPAN
ISYPLKPFLV
ISYPLKPFLV
ISYPLKPFLV
NESGQEDKKR
NESGQEDKKR
NESGQEDKKR
CLUSTAL
CLUSTAL W (1.82) multiple sequence alignment
CD5R_BOVIN MGTVLSLSPSYRKATLFEDGAATVGHYTAVQNSKNAKDKNLKRHSIISVLPWKRIVAVSA
CD5R_HUMAN MGTVLSLSPSYRKATLFEDGAATVGHYTAVQNSKNAKDKNLKRHSIISVLPWKRIVAVSA
CD5R_MOUSE MGTVLSLSPSYRKATLFEDGAATVGHYTAVQNSKNAKDKNLKRHSIISVLPWKRIVAVSA
************************************************************
CD5R_BOVIN KKKNSKKVQPNSSYQNNITHLNNENLKKSLSCANLSTFAQPPPAQPPAPPASQLSGSQTG
CD5R_HUMAN KKKNSKKVQPNSSYQNNITHLNNENLKKSLSCANLSTFAQPPPAQPPAPPASQLSGSQTG
CD5R_MOUSE KKKNSKKAQPNSSYQSNIAHLNNENLKKSLSCANLSTFAQPPPAQPPAPPASQLSGSQTG
*******.*******.**:*****************************************
CD5R_BOVIN VSSSVKKAPHPAVSSAGTPKRVIVQASTSELLRCLGEFLCRRCYRLKHLSPTDPVLWLRS
CD5R_HUMAN GSSSVKKAPHPAVTSAGTPKRVIVQASTSELLRCLGEFLCRRCYRLKHLSPTDPVLWLRS
CD5R_MOUSE VSSSVKKAPHPAITSAGTPKRVIVQASTSELLRCLGEFLCRRCYRLKHLSPTDPVLWLRS
***********::***********************************************
CD5R_BOVIN VDRSLLLQGWQDQGFITPANVVFLYMLCRDVISSEVGSDHELQAVLLTCLYLSYSYMGNE
CD5R_HUMAN VDRSLLLQGWQDQGFITPANVVFLYMLCRDVISSEVGSDHELQAVLLTCLYLSYSYMGNE
CD5R_MOUSE VDRSLLLQGWQDQGFITPANVVFLYMLCRDVISSEVGSDHELQAVLLTCLYLSYSYMGNE
************************************************************
CD5R_BOVIN ISYPLKPFLVESCKEAFWDRCLSVINLMSSKMLQINADPHYFTQVFSDLKNESGQEDKKR
CD5R_HUMAN ISYPLKPFLVESCKEAFWDRCLSVINLMSSKMLQINADPHYFTQVFSDLKNESGQEDKKR
CD5R_MOUSE ISYPLKPFLVESCKEAFWDRCLSVINLMSSKMLQINADPHYFTQVFSDLKNESGQEDKKR
************************************************************
CD5R_BOVIN LLLGLDR
CD5R_HUMAN LLLGLDR
CD5R_MOUSE LLLGLDR
*******
172
5:
.
PROSITE
. ,
(PSSMs)
(profiles),
. ,
.
3 4.
5.
, ,
:
,
.,(patterns)
PROSITE,
(regular expressions) UNIX.
, ,
. ,
(profiles) (Position Specific Scoring Matrices). ,
,
,.
,
( ),
.
5.1.
5.1.1
( ,
),.
,,,
,
. ,
,,
. , ,
,
,.
,(patterns)
(
regularexpressions).,
, ( )
, ( ).
PROSITE,
UNIX
.
173
(5.1), ,
,.
,,
().PROSITE:
IUPAC.
,
(-).
.
,
(..,...)
,[],[ACG]
A,G,C.
,
x.
/,
{}. ,
{} DNA
[CGT]. ,
.
().(3)
--,x(3)x-x-x(3).,
.,x(2,4)x-x,x-x-x,
x-x-x-x.
< > .
,
<A-x
'>'
. , P-R-L-[G>]
P-R-L-GP-R-L>.
5.1: . .
.
.
, , ,
, ..
( 5.2). , ,
174
(5.3).
5.2: DNA.
2, PROSITE (http://www.expasy.ch/prosite/)
(sequence domains)
(Sigrist et al., 2010). ,
,,
().PROSITE
1700 . , 1308 , 1107 1105 ""
(
).,
(, ).
Uniprot,
""()
. ,
,
"",
.
,(regularexpressions)PROSITE
.:
(-).
(.)x
^ ,
{}.
,PROSITE:
[RK]-G-{EDRKHPCG}-[AGSCI]-[FY]-[LIVA]-x-[FYM]
:
[RK]G[^EDRKHPCG][AGSCI][FY][LIVA].[FYM]
,
1980 , PROSITE
,
, . 5.3
.,
5.4 , 15
.
,(nuclearlocalizationsignals-NLSs)
, , ,
. Cocol, Nair Rost,
175
214(
91 ) (Cokol, Nair, & Rost, 2000). ,
.
(peroxisomaltargetingsignal-PTS1)(S-K-L).2004,
Petriv,PTS2(Petriv,Tang,
Titorenko,&Rachubinski,2004).
5.3: .
,.
(signal peptide)
,
,
. ,
(17-30,
), .
1980,PS00013,
PROSITE ( 5.3). ' ,
,[LVI]-[ASTVI]-[GAS]-C
(DOLOP).2002,
SutcliffeHarringtonGram,
,
(Sutcliffe&Harrington,2002).()
5.4.
,
(twin-arginine translocation). ,
SEC.
, ,
.,
(),
2(R-Rx-[FGAVML]-[LITMVF])..
,
, ,
, . ,
SEC. , 2010 Shruthi, Babu
Sankaran,
(R-R-x-[FGAVML]-[LITMVF][LVI]-[ASTVI]-[GAS]-C),
-
(Shruthi,Babu,&Sankaran,2010).
176
,,
- Gram . 2004, Berven ,
- (Berven,
Flikka,Jensen,&Eidhammer,2004).,
,
.
, ,
,
.
5.4: ,
15 .
5.1.2
, . , .
, . ,
,
.
.,.
PROSITE(regularexpressions),
.12
Perl,
UNIX(grep,egrep).
,
,.
. 5.1 3
[], ,
,()
().100,
6535.(
),G,
;([AGT]),
.
,
. ,
177
PROSITE100%.
,
,
(,...).,
, ,
. , (sequence profiles)
(PSSMs),.
,
.5.1
, ()
.,1,4,56
23...8
(C); , .
,
(),
HiddenMarkovModels(HMMs)8.
5.5: 2 .
, ,
.,
. 5.5
(,)3
7.,350%G50%A,
7 50% 50% C. PROSITE
G(3)C(7),G(3)T(7).
,G(3)C(7),(3)
T(7).,,
,
.
,
(8),(7),
(10).
5.1.3
:,
(5.6). ,
.,
178
,
,
..
, 3 (Brazma, Jonassen, Eidhammer, &
Gilbert,1998):
:
.PROSITE,
(..
,
).
:
. ,
,
.
:
,NPcomplete, (heuristic)
(greedy),(..
).,
, (Expectation-Maximization) Gibbs
sampler.
5.6: .
,PRATT
(http://web.expasy.org/pratt/). PRATT ,
PROSITE(Jonassen,Collins,&
Higgins, 1995).
, .. ,
179
,.PRATT
,
.
MEME(http://meme-suite.org/tools/meme)
(MultipleEMForMotifElicitation).,
().,
(
).
(Bailey&Elkan,1994).
,Gibbs Motif Sampler,
Gibbssampler(http://ccmbweb.ccv.brown.edu/gibbs/gibbs.html).
,
,
(Thompson,Rouchka,&Lawrence,2003).
, TEIRESIAS
, (Rigoutsos &
Floratos,1998).(combinatorial)
,
., ,
.TEIRESIAShttps://cm.jefferson.edu/Teiresias/,
DNA,
5.2.
.,
,
. ,
,
. , (weight
matrices)(profiles).,kxp,k
p (
).,i
pb(i) ( 5.7). nb(i)
bi,pb(i)bi,
:
pb i
nb i
nb ' i
(5.1)
b '
.
,
.5.1,3,
ATTGAACTA:
p
(5.2)
i 1
, ATGCA 0.00155 (
ATGAACTA0,1).,
180
. ,
3.,
:
(5.3)
sb i log pb i pb
pb ( ) (
) pb(i)
.
,
. ,
100%
().,
,(3.19)-.,.
,
(..-10,000),
. ,
.,(3.19):
p i z
b
b
p k z
b
s 1 s
sb i log
(5.4)
.,(5.1)
. , ,
. ,
,.
181
5.7: (weight matrix)
(PSSM), .
,
. ,
.,
(Staden,1990).
(
,..
).,
(,
)(Barton&Sternberg,1990).
.
,
. ,
,,.
182
5.8: .
,
,.profileanalysis
(Gribskov,
McLachlan,&Eisenberg,1987).(position
specific scoring matric-PSSM),
(PAM,BLOSUM) .,
. ,
.
,,
.
, . ,
. ,
,
(
).
, ,
.,:
k
sb i
p i S
j
bj
(5.5)
j 1
pj(i) j i
( ), Sbj
(BLOSUM62)bj.
,
,
,
.,
PAM, BLOSUM45 ,
,,
(Lthy,Xenarios,&Bucher,1994).
,Gumbel,
. ,
183
. ,
, (..
60%).
5.9: PSSM.
20 . , ..
,
. 7 8, ,
, .
5.3.
,
PSSMs,.
ScanProsite (http://prosite.expasy.org/scanprosite/). ScanProsite
PROSITE,,,
.
PROSITE,
...(DeCastroetal.,2006)
PFTOOLS (http://web.expasy.org/pftools/)
(Bucher,Karplus,Moeri,&Hofmann,1996).
PFTOOLS ,
(,weightmatrices,PSSMs),
(generalized profile).
,HiddenMarkovModel
8. PFTOOLS ,
,:
pfmake:
pfscale: Gumbel
pfw:
-.
pfsearch:
DNA.
pfscan: DNA
.
184
,
,
HMMER 8 (psa2msa, gtop, htop, ptoh)
DNA(ptof,2ft,6ft).
PSSM PSI-BLAST
(Position-specific-iterated BLAST) (Altschul et al., 1997).
BLAST.(5.10):
BLASTvalue .
PSSM,
.
, ,
-value.,
,
(34).
(),
. , ,
,
,(contamination).
PSI-BLAST DELTA-BLAST (domain enhanced lookup
timeacceleratedBLAST),PSSM,
.,ConservedDomainDatabase(CDD)NCBI,
, PSIBLAST,
(Boratynetal.,2012).
PHI-BLAST(pattern-hitinitiatedBLAST)BLAST,
(Zhang et al., 1998).
,
, . ,
(5.11).
5.10: PSI-BLAST.
185
5.11: PHI-BLAST.
,
, WebLogo (http://weblogo.berkeley.edu/) (Crooks, Hon,
Chandonia, & Brenner, 2004). WebLogo
(Sequence Logo) Schneider Stephens (Schneider & Stephens, 1990)
,
.
,:
(5.6)
R S max Sobs log 2 k nb i log pb i
b
, Smax Sobs
3. k ,
(2.2bitsDNA/RNA~4.32).
, .
,
,.
,(5.6).
186
187
5.14: Sequence Logo .
(PTS1). ,
.
PROSITE,
.
188
Altschul,S.F.,Madden,T.L.,Schaffer,A.A.,Zhang,J.,Zhang,Z.,Miller,W.,&Lipman,D.J.(1997).
GappedBLASTandPSI-BLAST:anewgenerationofproteindatabasesearchprograms.Nucleic
acids research, 25(17),3389-3402.
Bailey,T.L.,&Elkan,C.(1994).Fittingamixturemodelbyexpectationmaximizationtodiscovermotifsin
biopolymers.Proc Int Conf Intell Syst Mol Biol, 2,28-36.
Barton,G.J.,&Sternberg,M.J.(1990).Flexibleproteinsequencepatterns:Asensitivemethodtodetect
weakstructuralsimilarities.Journal of molecular biology, 212(2),389-402.
Berven,F.S.,Flikka,K.,Jensen,H.B.,&Eidhammer,I.(2004).BOMP:aprogramtopredictintegralbbarreloutermembraneproteinsencodedwithingenomesofGram-negativebacteria.Nucleic Acids
Res, 32(WebServerIssue),W394-W399.
Boratyn,G.M.,Schaffer,A.A.,Agarwala,R.,Altschul,S.F.,Lipman,D.J.,&Madden,T.L.(2012).
DomainenhancedlookuptimeacceleratedBLAST.Biol Direct, 7(1),12.
Brazma,A.,Jonassen,I.,Eidhammer,I.,&Gilbert,D.(1998).Approachestotheautomaticdiscoveryof
patternsinbiosequences.Journal of computational biology, 5(2),279-305.
Bucher,P.,Karplus,K.,Moeri,N.,&Hofmann,K.(1996).Aflexiblemotifsearchtechniquebasedon
generalizedprofiles.Computers & chemistry, 20(1),3-23.
Cokol,M.,Nair,R.,&Rost,B.(2000).Findingnuclearlocalizationsignals.EMBO reports, 1(5),411-415.
Crooks,G.E.,Hon,G.,Chandonia,J.M.,&Brenner,S.E.(2004).WebLogo:asequencelogogenerator.
Genome Res, 14(6),1188-1190.
DeCastro,E.,Sigrist,C.J.,Gattiker,A.,Bulliard,V.,Langendijk-Genevaux,P.S.,Gasteiger,E.,...Hulo,N.
(2006).ScanProsite:detectionofPROSITEsignaturematchesandProRule-associatedfunctionaland
structuralresiduesinproteins.Nucleic Acids Research, 34(suppl2),W362-W365.
Gribskov,M.,McLachlan,A.D.,&Eisenberg,D.(1987).Profileanalysis:detectionofdistantlyrelated
proteins.Proc Natl Acad Sci U S A, 84(13),4355-4358.
Jonassen,I.,Collins,J.F.,&Higgins,D.G.(1995).Findingflexiblepatternsinunalignedproteinsequences.
Protein Science, 4(8),1587-1595.
Lthy,R.,Xenarios,I.,&Bucher,P.(1994).Improvingthesensitivityofthesequenceprofilemethod.
Protein Science, 3(1),139-146.
Petriv,I.,Tang,L.,Titorenko,V.I.,&Rachubinski,R.A.(2004).Anewdefinitionfortheconsensus
sequenceoftheperoxisometargetingsignaltype2.Journal of molecular biology, 341(1),119-134.
Rigoutsos,I.,&Floratos,A.(1998).Combinatorialpatterndiscoveryinbiologicalsequences:The
TEIRESIASalgorithm.Bioinformatics, 14(1),55-67.
Schneider,T.D.,&Stephens,R.M.(1990).Sequencelogos:anewwaytodisplayconsensussequences.
Nucleic Acids Res, 18(20),6097-6100.
Shruthi,H.,Babu,M.M.,&Sankaran,K.(2010).TAT-pathway-dependentlipoproteinsasaniche-based
adaptationinprokaryotes.Journal of Molecular Evolution, 70(4),359-370.
Sigrist,C.J.,Cerutti,L.,deCastro,E.,Langendijk-Genevaux,P.S.,Bulliard,V.,Bairoch,A.,&Hulo,N.
(2010).PROSITE,aproteindomaindatabaseforfunctionalcharacterizationandannotation.Nucleic
Acids Res, 38(Databaseissue),D161-166.
Staden,R.(1990).Searchingforpatternsinproteinandnucleicacidsequences.Methods in enzymology, 183,
193-211.
Sutcliffe,I.C.,&Harrington,D.J.(2002).Patternsearchesfortheidentificationofputativelipoproteingenes
inGram-positivebacterialgenomes.Microbiology, 148(Pt7),2065-2077.
189
Thompson,W.,Rouchka,E.C.,&Lawrence,C.E.(2003).GibbsRecursiveSampler:findingtranscription
factorbindingsites.Nucleic Acids Research, 31(13),3580-3585.
Zhang,Z.,Miller,W.,Schffer,A.A.,Madden,T.L.,Lipman,D.J.,Koonin,E.V.,&Altschul,S.F.(1998).
Proteinsequencesimilaritysearchesusingpatternsasseeds.Nucleic Acids Research, 26(17),39863990.
190
6:
, ,
,
.
. ,
, . ,
,
, .
. .
3 4.
6.
,,
( , ),
,.
(, )
Darwin .
()
(),ThedosiusDobzhansky:Nothing
in Biology makes sense except in the light of evolution. ,
,
.
,
.Darwin,
, . ,
(, ),
(
). , ,
,
, ,
.
6.1.
.
, ,
. ( ),
.
(),
.
(orthologues),
. , , , ,
191
(paralogues),
.-(..
, , ..),
(..),,,,,,,
., (xenologues),
().
,
.
.,
,
,,
. ,
,(taxa).,
, . ,
,.,
,(
,
).6.128
.
,
(Brinkman&Leipe,2001):
( )
. ( ) ,
.
.
taxon ,
.
.
192
6.2:
http://commons.wikimedia.org/wiki/File:Phylogenetic_Tree_of_Life.png)
(:
6.3: ,
. 6.2
. , ,
, (:
http://creationwiki.org/Macroevolution)
193
(rooted).
, (
, ).
(unrooted),
. (
).(6.4)
(),(),5.
6.4: 5 . () , ()
.
(,),
,
(-outgroup).
L ,
2L-1(1,2,LLL+1,L+2,,2L-1
),2L-3.N
L,
2 L 3 !
N rooted L 2
2 L 2 !
,:
2 L 5 !
N unrooted L 3
2 L 3 !
,L=10, 35.
2..2L-3
.
,
:
.,.
, . ,
.
,
. (
), ,
.
.,
. , ,
... , ,
194
.
. ,
. ,
,
.
(,
), . ,
.
6.2.
,
,,
.
Markov.
8,
,
.,(Durbin,Eddy,Krogh,&Mitchison,1998):
(6.1)
pabt P ( xi b | xi a, t )
bix
t.,Markov.
DNA x x1 , x2 ,..., xn y y1 , y 2 ,..., y n , x
yt:
n
P x | y, t P xi | yi , t
(6.2)
i 1
, 4x4
t,
P ( A | A, t )P (T | A, t )P (G | A, t )P (C | A, t )
P ( A | T , t )P (T | T , t )P (G | T , t )P (C | T , t )
S (t )
P ( A | G , t )P (T | G, t )P (G | G, t )P (C | G, t )
P ( A | C , t )P (T | C , t )P (G | C , t )P (C | C , t )
(6.3)
pi , j 0i , j =1,2,3,4
i, j
1i
(6.4)
j 1
.
Markov,,
(Lio & Goldman, 1998). , Markov
(homogeneity),.
(equilibrium).,Markov
(stationary),
. , (reversibility)
, .
3 ,
.
,Chapman-Kolmogorov:
195
(6.5)
S(t)S(s) S(t s)
,
(SubstitutionRateMatrix)R:
(6.6)
3,:
=-(a++)
0.:
Rt
Rt
...
2!
3!
(spectraldecomposition),:
S (t ) Udiag e t , ..., e t U 1
S (t ) exp( Rt ) I Rt
i(eigenvalues)R,U.O
:
S ( ) I R S (t ) S (t ) S ( ) S (t )( I R )
S (t ) S (t )
S (t )( I R )
0 :
(6.7)
S '(t ) S (t ) R
.
(6.6)a==:
-3
rt st st st
s r s s
- 3
t
t
t
t
R
S (t )
st st rt st
-3
- 3
st st st rt
rt 1
1 3e 4 at
4
st 1 1 e 4 at
4
(6.8)
JukesCantor(Jukes&
Cantor,1969)Poisson(,JC69).
,
.
()(
).,Kimura(Kimura,
1980)K2P:
-2 -
rt st ut st
s r s u
- 2 -
t
t
t
t
R
S (t )
ut st rt st
- 2 -
- 2 -
st ut st rt
st 1 1 e 4 t
4
ut 1
1 e 4 t 2 e 2 ( ) t
196
rt 1 2 st ut
(6.9)
6.5: y,
(, , ), t,
x.
-
(..AG,TC)(..AT,GC).
(JC69, K2P) t
qA=qT=qG=qC=1/4,
DNA.
(A+T)/(G+C)
(Felsenstein,1981; Lio&Goldman,1998;
Penny&Hendy,2001). ,(6.6)
(, G, C, T),
197
,
.,F81Felsenstein(Felsenstein,
1981), JC69, ()
,
(,G,C,T):
C
G
T
- C G T
- A G T
- C A T
(6.10)
A
C
G
- C G A
,
,( ):
C
G
T
- C G T
- A G T
- C A T
- C G A
(6.11)
a A
- a A d G e T
d G
e T
b A
d C
- d C b A f T
c A
e C
f G
- e C f G c A
(6.12)
,
. , HKY85 F81, K2P JC69,
F81K2P()JC69.,
,,
,
.,,
,
,GTR.
,,,
(Yang, 1994).
(randomeffectsmodel)(Yang,1993). ,
,
,JC69+,GTR+,...
(molecularclock).
,
.,
PAM (Dayhoff, Schwartz, & Orcutt, 1978) ( )
, (6.3).
()
.
3 ,
198
(Lio&Goldman,1998)
,t(
,).
Markov ,
,
.(Lio&
Goldman,1998).
6.3.
, (distancebased methods).,,
(Durbin,etal.,1998).(dij)i,j,
:
d ii 0
d ij d ji 0, i j
(6.13)
d ij d ik d kj
,
(clustering),,,
.
,4.
, dij, i j,
.fu,
xuixuj,.,
,.,
.,JC69:
d ij
3
4
log 1 f
4
3
(6.14)
K2P,:
1
1
(6.15)
d ij log 1 2 f g log 1 2 g
2
4
fg
.,,
.
, ,
. , Socal
Michener,(Sokal&Michener,1958)UPGMA(Unweighted Pair Group using Arithmetic
Mean),(Clusters)
:
1
(6.16)
d ij
d pq
C i C j pCi , qC j
,|Ci||Cj|i j.
ki jl :
dkl
dil Ci d jl C j
Ci C j
(6.17)
,
: ,
199
,.
dij/2(6.6).O
O(n2). , ,
.,
,
,,.,
,averagelinkage.
linkage (complete linkage, simple linkage ), ,
UPGMA.
6.6: UPGMA. ,
( ).
(A,B,C,D) ,
UPGMA, , ,
.,
,.
,,
.,
. , .
,i,jk,
200
,k
,m,:
d km
1
d im d jm d ij
2
(6.18)
,,.,
- ,
. ,
,.,
6.7.
6.7: ,
. (A-C -D) 4, (A-B
C-D) ( 3). ,
Neighbour-Joining.
(neighbour joining,NJ)(Saitou&Nei,1987),
,
:
1
Dij dij
dik d jk
(6.19)
L 2 k
L,
.,
6.7 (, Dij ).
,,..(6.8).,
L,L-3,Dij
LxL. , (L3),
.
NJ, ,
. ,
, .
,,,
bootsrap.UPGMA,,
. ,
201
()
. ,,
, ,
. , ,
CLUSTAL,-.
6.8: Neighbour-Joining. 1 2
() , (Y). ,
.
, . ,
Fitch-Margoliash (Fitch & Margoliash, 1967),
,,.,
(dij), ( dij ). , ,
:
L
Q wij dij d ij
(6.20)
i 1 j 1
(y=a+bx).wij .
,(Cavalli-Sforza&Edwards,1967)
wij=1 ( ), (Fitch & Margoliash, 1967),
( wi j 1 dij2 ,
).,,
,
.,,
.,
(Q) . ,
. , ,
.
202
6.4.
(character-based methods),
, ,
, , :
(Durbin, et al., 1998). :
,
,,(,)
..
6.4.1
(maximum parsimony)
, ,
( ) .
(
)
. , ,
(Okham Razor),
.Pluralitas
non est ponenda sine necessitate,
, .
,
,
.
Fitch(Fitch,1971),,
+1 ,
(weightedparsimony),
. 6.9, 4
,.
(3)(15).
6.9: .
(3 )
( 15 ).
, ,
,.
,branch and bound,
.
. ,
203
(),
.,
,
.
,
,,
.,
,
(Yang, 1996).
(Edwards & Cavalli-Sforza, 1963)
.
6.4.2
,
(Maximum Likelihood).
L n.:
x1 x11 x12 ... x1n
x 2 x21 x22 ...x2 n
........
x L xL1 xL 2 ...xLn
,
:
X 1 x11 , x21 ,..., xL1 , X 2 x12 , x22 ,..., xL 2 ,..., X L x1n , x2 n ,..., xLn
(Maximum Likelihood) ..
.
,
(
).
.
(.JC69
).
.
,
.:
x 1 x11 , x12 ,..., x1n
x 2 x 21 , x 22 ,..., x 2 n
i:
(6.21)
P x1i , x 2 i , a | T , t1 , t 2 q P xi1 | a , t1 P xi 2 | a , t 2
, , t1,t2
( ).
,:
204
(6.22)
n
:
n
P x1 , x2 | T , t1 , t2 P x1i , x2i | T , t1 , t2
(6.23)
i 1
(likelihood).
(log-likelihood):
n
(6.24)
i 1
(6.24),
,JC69GTR,
.
L,:
2 L2
a L 1 ,..., a 2 L 1
qa2 L1
P a
k L 1
| a a ( k ) , tk P xki | a a ( k ) , tk
(6.25)
k 1
(k)
. L 2L-1
2L-2 , . L
(),L+12L2, .
,
L+1 2L-1 (
). r
n:
n
(6.26)
i 1
(log-likelihood),:
n
(6.27)
i 1
,
.,
(Yang, 1993).
(molecular clock)
UPGMA,(
,).
,,
,,
(time reversibility). ,
,,
, ,
.
, ,
.
(),(6.27),
,,
, Gradient Descent, Newton-Raphson
(Ypma,1995),(Dempster,Laird,&Rubin,1977).,
:
,,.
205
,
, Felsenstein (Felsenstein, 1981).
,
:
.
,,.
(Yang&Rannala,2012).,
.
, , .
,
,(likelihoodratio
test),.
, ,
,
,
.,,
.,,
,
,
,GPUFPGA.
,
(Bayesian methods) (Huelsenbeck, Ronquist, Nielsen, & Bollback, 2001).
, ,
.
, ,
,P(x|T,),(6.26).(Bland&Altman,1998),,
Bayes,
(posteriordistribution):
P T , P x | T ,
(6.28)
P T , | x
P x
(6.28), P(T, |x) , P(x|T,) ,
P(T,).,P(x),
(,).,
, ,
. , ,
MCMC (Markov Chain Monte Carlo) (Gilks, Richardson, & Spiegelhalter,
1996).,.
,
,(,
;). ,
,
(Bland&Altman,1998).
. ,
-
. ,
( ),
. ,
206
,
.
6.5.
, ,
.
, ,
,
.
6.10: : (permutation). ,
( 1 2 5 6),
.
, ,
( ). :
(randomised) . ,
. ,
.
, ,
. ,
()
2.
,
, bootsrap (Goldman,
1993;Posada&Crandall,1998).,,
(
).
,
,
,().
207
permutationtests,,
, ,
,
.
.
6.11: bootstrap. ,
.
. ,
.
, bootstrap.
,(Efron&Tibshirani,
1993).,:1)
()
,2).
,
p-values.
(Hillis&Bull,1993;Solitis&Solitis,2003),,
,
.
( ),
(Durbin,etal.,1998).,
, bootstrap
(,
).,
(consensustree)
.
208
, - bootstrap
.
, bootstrap,
(Goldman,1993;Wollenberg&Atchley,2000).
,
-bootstrap.,
,
(model misspecification). , ,
,
.
,
, . ,
, bootstrap ,
. ,
bootstrap
, .
, Kishino-Hasegawa (KH), ShimodairaHasegawa (SH), (weighted) Shimodaira-Hasegawa (WSH),
ApproximatelyUnbiased(AU)Shimodaira,.
CONSEL,,
bootsrap.,
(Shimodaira&Hasegawa,2001).
6.6.
,()
, ,
, .
,
,
().
,
,,
.Felsenstein(Felsenstein,1973,1996)
,
.
.
(),
,
(Yang,1996).
, 100%
,
. , ,
, ,
,
.,
. , NJ
.
3(Penny&Hendy,2001;Steel&
209
Penny,2000),:1)
, 2) , 3)
,
,
. , bootstrap
, ( JC69, K2P
),J(UPGMA),
(,!).Felsenstein,
,
(maximumlikelihooddistance),(Felsenstein,1996).
,(NJ,UPGMA)
.NJML,
Neighbour Joining Maximum Likelihood.
NJ
bootstrap.NJML
NJ
(Ota&Li,2000).
,
(Brinkman&Leipe,2001):
.
, ,
,
(:garbage in, garbage out)
,
( ). ,
,
,.
. ,
( )
.,
, ,
.
.
(
),
.
,GC%,
(
).
6.7.
,
, .
,PAUPPHYLIP,
(MEGA,RAxML).
, ,
.
PAUP(Phylogeneticanalysisusingparsimony*andothermethods),
(Wilgenbusch & Swofford, 2003).
210
,
.
(http://www.sinauer.com/detail.php?id=8060).
PHYLIP(PHYLogenyInferencePackage)
Joe Felsenstein (Retief, 2000).
,
(http://evolution.gs.washington.edu/phylip.html). MEGA
(Molecular evolutionary genetic analysis) , ,
(Kumar, Nei, Dudley, & Tamura, 2008). ,
,
Windows
(http://www.megasoftware.net).
, ,
,-
-,
/.,,
HYPHY,PAML,PhyMLRAxML.HYPHY(Hypothesistestingusingphylogenies),
.
(http://www.hyphy.org). PAML (Phylogenetic analysis by maximum
likelihood),.
ZihengYang(Yang,2007),
,
,
(http://abacus.gene.ucl.ac.uk/software/paml.html). To PhyML
,
DNA http://www.atgc-montpellier.fr/phyml/binaries.php, (Bazinet,
Zwickl, & Cummings, 2014). , RAxML,
(Stamatakis,2014),
(GTR),
.,
http://scoh-its.org/exelixis/software.html.
,,
. ,
. ,
MCMC (Markov Chain Monte Carlo). ,
. MrBayes ,
MCMC (Huelsenbeck & Ronquist, 2001).
(http://mrbayes.net). BEAST (Bayesian evolutionary analysis sampling tree),
MCMC(Drummond,Suchard,Xie,&Rambaut,2012).
,
.
, ( ). ,
TracerFigTree,
(http://beast.bio.ed.ac.uk).
To GARLI (Genetic Algorithm for Rapid Likelihood Inference),
(Bazinet, et al., 2014).
( )
. GTR ,
,
.
211
http://code.google.com/p/garli. TNT (Tree analysis using new technology) (Goloboff, Farris, & Nixon,
2008)
http://www.lillo.org.ar/phylogeny/tnt/.,
TreeView (http://taxonomy.zoology.gla.ac.uk/rod/treeview.html)
.
(NEXUS,PHYLIP,Hennig86,
NONA,MEGA,ClustalW/X)TrueTypeandPostscript
PICT(Macintosh)Windowsmetafile(Windows)
.(Windows,Unix/Linux,Macintosh),
editor.
212
Bazinet,A.L.,Zwickl,D.J.,&Cummings,M.P.(2014).Agatewayforphylogeneticanalysispoweredby
gridcomputingfeaturingGARLI2.0.Syst Biol, 63(5),812-818.
Bland,J.M.,&Altman,D.G.(1998).Bayesiansandfrequentists.BMJ, 317(7166),1151-1160.
Brinkman,F.S.,&Leipe,D.D.(2001).PhylogeneticAnalysis.InA.D.Baxevanis&B.F.Ouellette(Eds.),
Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins(pp.323-358):JohnWiley&
Sons,Inc.
Cavalli-Sforza,L.L.,&Edwards,A.W.(1967).Phylogeneticanalysis.Modelsandestimationprocedures.
Am J Hum Genet, 19(3Pt1),233-257.
Dayhoff,M.O.,Schwartz,R.M.,&Orcutt,B.C.(1978).AmodelofevolutionarychangeinProteins.InM.
Dayhoff(Ed.),In Atlas of protein sequence and structure(Vol.5,Suppl.3,pp.345-352):National
biomedicalresearchfoundation,SilverSpring,MD.
Dempster,A.P.,Laird,N.M.,&Rubin,D.B.(1977).MaximumlikelihoodfromincompletedataviatheEM
algorithm.J Royal Stat Soc B, 39,1-38.
Drummond,A.J.,Suchard,M.A.,Xie,D.,&Rambaut,A.(2012).BayesianphylogeneticswithBEAUtiand
theBEAST1.7.Molecular biology and evolution, 29(8),1969-1973.
Durbin,R.,Eddy,S.R.,Krogh,A.,&Mitchison,G.J.(1998).Biological Sequence Analysis: Probabilistic
Models of Proteins and Nucleic Acids.
Edwards,A.W.,&Cavalli-Sforza,L.L.(1963).Thereconstructionofevolution..Annals of Human
Genetics, 27,105.
Efron,B.,&Tibshirani,R.(1993).An Introduction to the Bootstrap.BocaRaton,FL:Chapman&Hall/CRC.
Felsenstein,J.(1973).Maximum-likelihoodestimationofevolutionarytreesfromcontinuouscharacters.
American journal of human genetics, 25(5),471.
Felsenstein,J.(1981).EvolutionarytreesfromDNAsequences:amaximumlikelihoodapproach.Journal of
molecular evolution, 17(6),368-376.
Felsenstein,J.(1996).Inferringphylogeniesfromproteinsequencesbyparsimony,distance,andlikelihood
methods.Methods Enzymol, 266,418-427.
Fitch,W.M.(1971).Towarddefiningthecourseofevolution:minimumchangeforaspecifictreetopology.
Systematic Biology, 20(4),406-416.
Fitch,W.M.,&Margoliash,E.(1967).Constructionofphylogenetictrees.science, 155(3760),279-284.
Gilks,W.R.,Richardson,S.,&Spiegelhalter,D.(Eds.).(1996).Markov Chain Monte Carlo in Practice
Chapman&Hall/CRC.
Goldman,N.(1993).StatisticaltestsofmodelsofDNAsubstitution.Journal of Molecular Evolution, 36(2),
182-198.
Goloboff,P.,A.,,Farris,J.S.,&Nixon,K.C.(2008).TNT,afreeprogramforphylogeneticanalysis.
Cladistics, 24(5),774786.
Hasegawa,M.,Kishino,H.,&Yano,T.(1985).Datingofthehuman-apesplittingbyamolecularclockof
mitochondrialDNA.Journal of molecular evolution, 22(2),160-174.
Hillis,D.M.,&Bull,J.J.(1993).Anempiricaltestofbootstrappingasamethodforassessingconfidencein
phylogeneticanalysis.Systematic biology, 42(2),182-192.
Huelsenbeck,J.P.,&Ronquist,F.(2001).MRBAYES:Bayesianinferenceofphylogenetictrees.
Bioinformatics, 17(8),754-755.
213
Huelsenbeck,J.P.,Ronquist,F.,Nielsen,R.,&Bollback,J.P.(2001).Bayesianinferenceofphylogenyand
itsimpactonevolutionarybiology.Science, 294(5550),2310-2314.
Jukes,T.,&Cantor,C.(1969).EvolutionofproteinmoleculesPp.21132inHNMunro,ed.Mammalian
proteinmetabolism:AcademicPress,NewYork.
Kimura,M.(1980).Asimplemethodforestimatingevolutionaryratesofbasesubstitutionsthrough
comparativestudiesofnucleotidesequences.Journal of molecular evolution, 16(2),111-120.
Kumar,S.,Nei,M.,Dudley,J.,&Tamura,K.(2008).MEGA:abiologist-centricsoftwareforevolutionary
analysisofDNAandproteinsequences.Brief Bioinform, 9(4),299-306.
Lio,P.,&Goldman,N.(1998).Modelsofmolecularevolutionandphylogeny.Genome research, 8(12),
1233-1244.
Ota,S.,&Li,W.-H.(2000).NJML:ahybridalgorithmfortheneighbor-joiningandmaximum-likelihood
methods.Molecular Biology and Evolution, 17(9),1401-1409.
Penny,D.,&Hendy,M.(2001).Phylogenetics:parsimonyanddistancemethods.InD.J.Balding,M.Bishop
&C.Cannings(Eds.),Handbook of Statistical Genetics(pp.445-484):JohnWileyandSons,Ltd.
Posada,D.,&Crandall,K.A.(1998).Modeltest:testingthemodelofDNAsubstitution.Bioinformatics,
14(9),817-818.
Retief,J.D.(2000).PhylogeneticanalysisusingPHYLIP.Methods Mol Biol, 132,243-258.
Saitou,N.,&Nei,M.(1987).Theneighbor-joiningmethod:anewmethodforreconstructingphylogenetic
trees.Molecular biology and evolution, 4(4),406-425.
Shimodaira,H.,&Hasegawa,M.(2001).CONSEL:forassessingtheconfidenceofphylogenetictree
selection.Bioinformatics, 17(12),1246-1247.
Sokal,R.R.,&Michener,C.D.(1958).AStatisticalMethodforEvaluatingSystematicRelationships.
University of Kansas Science Bulletin, 38,1409-1438.
Solitis,P.S.,&Solitis,D.E.(2003).ApplyingtheBootstrapinPhylogenyReconstruction.Stat Sci, 18(2),
256-267.
Stamatakis,A.(2014).RAxMLversion8:atoolforphylogeneticanalysisandpost-analysisoflarge
phylogenies.Bioinformatics, 30(9),1312-1313.
Steel,M.,&Penny,D.(2000).Parsimony,likelihood,andtheroleofmodelsinmolecularphylogenetics.
Molecular Biology and evolution, 17(6),839-850.
Tavare,S.(1986).SomeProbabilisticandStatisticalProblemsintheAnalysisofDNASequences.Lectures
on Mathematics in the Life Sciences (American Mathematical Society) 17,5786.
Wilgenbusch,J.C.,&Swofford,D.(2003).InferringevolutionarytreeswithPAUP*.Curr Protoc
Bioinformatics, Chapter 6,Unit64.
Wollenberg,K.R.,&Atchley,W.R.(2000).Separationofphylogeneticandfunctionalassociationsin
biologicalsequencesbyusingtheparametricbootstrap.Proceedings of the National Academy of
Sciences, 97(7),3288-3291.
Yang,Z.(1993).Maximum-likelihoodestimationofphylogenyfromDNAsequenceswhensubstitutionrates
differoversites.Molecular Biology and Evolution, 10(6),1396-1401.
Yang,Z.(1994).Estimatingthepatternofnucleotidesubstitution.Journal of Molecular Evolution, 39(1),
105-111.
Yang,Z.(1996).Phylogeneticanalysisusingparsimonyandlikelihoodmethods.Journal of Molecular
Evolution, 42(2),294-307.
Yang,Z.(2007).PAML4:phylogeneticanalysisbymaximumlikelihood.Molecular biology and evolution,
24(8),1586-1591.
214
215
216
7:
,
DNA RNA.
.
,
. ,
, ,
- .
DNA , (
/, ...), RNA
micro RNA .
2, 3, 4 5.
7.
DNA/RNA.
(
...),
.
.
,
,
( / )
.
(
),,
:
()
.,,
.
,
,.,
,20-30%
.
, ,
,
.,
(remote homology)
(threading),.
,
DNA,.
.
(, , ...) ,
217
.,
,
(
). ,
.
.
,
. ,
, ,
, .
,
,,
- ,
... ,
,
.
. , ,
.
,,
.
,
( 1970)
().,
,
(- , , )
. , ,
.. ,
DNA , ... DNA,
(gene finding),
,
. , ,
, -,
RNA ... ,
microRNA.
,
.
,
.
7.1.
.
,-
,,
, , ,
--,
GolgiN-X-[ST].DNA
A-U-G,-,
A-G G-T , ... ,
. , ,
PROSITE.,
218
,,
, , (
),.,
( ) ,
,.
, .. ,
,
,
.
,
.
7.1: . ,
. ,
.
,(7.1).
.
.
,
(,,/,
...). , (
), (
). ,
. ,
, GPCR
,(..,,/...),
.
,
(.. ),
(..-),
.,
219
, . ,
.
,
(),
(
).
,
( 7.2). ,
( ),
. ,
( ). (smoothing),
.
,
( 15 ,
).
( ).
,..,
(,
).
7.2: , (
). ,
.
,
.,
. ( , ,
) 10-20 ,
.
220
, .. ,
. ,
,
.,
30 . .
(..
95,...).
,..
( ),
.
,
.,,
3. ,
( , ).
, .
http://web.expasy.org/protscale/
,,,,
.
(,)
. k,
p,pk,L
, (L-k+1) pk(L-k+1) .
, , ..
weight
matrices ( ).
,3(),5(,...)
8(),.
,
.
,
,
. ,
,
,sparseencoding()
2041
0(7.3).,dummyvariables,
(
).,
( ) 20 .
k20k,L,
(L-k+1)20k(L-k+1)().
,
,
.,
(,,,
...)79,
.,
BLOSUM62 ( ),
221
. , , PSSM,
.,
(,...).
,
, .
,
.,
20
.(Reinhardt&Hubbard,1998)
, ,
() , .
,,,
( ) ,
(..
).,
(
,
,).
,
, . ,
- -,
(4008000).,
, ,
222
,,,
... (
!).
(.. Fourier),
(pseudoaminoacidcomposition)
Chou,()
. , (
), ( i i+1),
( i i+2), (i+3). ,
.
, http://www.csbio.sjtu.edu.cn/bioinf/PseAAC/.
,
,
,.
7.2.
, ,
(Baldi & Brunak, 2001).
. ( , )
(Bishop,1998).
().,
, ()
.,
,
.,
.,
,
(
),
,.
,
, (feed forward)
,
( 7.4). ,
( ).
, . ,
.
(
, , ,
...). ( )
-
-. XOR,
,,
.
. ,
223
(,
). ,
(interaction),
.
,
.
7.4: 3 , 4 ( ) 2
. .
,(7.5).
( )
. , (weights)
.,
.
(activationfunction).
().
,
0.
(bias)
+1.
(,
). , (
/),i:
g h i
h i
1 e
,0
1,
. (logistic
regression). , ,
.
224
, , .
c(..-,
-,),
softmax:
g h i
h i
e
j 1
h j
7.5: m .
( bias)
.
.
, c ,
01,.
.GPCRG-,
.
,.
( , ...)
.
.
,
(),
.
,-(tanh):
g h i
1 e h i
1 e h i
225
-1+1.
, ,
.,
.(),
,.,
back-propagation(Rumelhart,Hinton,&Williams,1988).
gradientdescent
. ,
. ,
,,
2 (
).,
:
(
)
,
.
,
,,
.gradientdescent
, ,
.,
,
, ,
.(
...)
, ,
gradientdescent,.,,
, .
,
(.. , ...). ,
.
( )
. ,
cross-validation(.),,
,
.
,
.
MATLAB (http://www.mathworks.com/products/neural-network/) R
(https://cran.r-project.org/web/packages/neuralnet/index.html).,
,
, FANN C
(http://leenissen.dk/fann/wp/)
JOONE,
JAVA
(http://sourceforge.net/projects/joone/). ,
(simulators),
226
. () BILLNET
(http://www.nongnu.org/billnet/) NevProp (http://www.cse.unr.edu/brain/nevprop), SNNS
(http://www.ra.cs.uni-tuebingen.de/SNNS/) . ,
Weka(http://www.cs.waikato.ac.nz/ml/weka/).
7.3.
.
,
(
).
.
.
,
. , ,
,
. ,
.
.
7.6: .
,,
.,
( non-redundantset).
(..
30%, ,
). ,
,
,
(.).
227
, .
(
),
.
( , ,
)(
,
). , ,
( )
.,
, ,
, , (
),...
7.7: crossvalidation.
,
. , (.
) .
, .
- (self-consistency)
-(over-fitting).
,
.
,
( .. ) . ,
(independent
test) ( 7.6).
,
,
, .
, ,
(..
228
, 30% ). ,
.
,
, ,
cross-validation(7.7).,
k (k-fold cross-validation). ,
,
. k
(unbiased)
,
.,
( )
k.,
,Jackknifek
,
. , , Jackknife
,.
, (
).
7.4.
,.,
(per-residue prediction) (perprotein classification). ,
2x2 (). , TP (True Positives)
, (True Negatives)
, FN (False Negatives)
FP(FalsePositives)
. , ,
.
,
(Q),:
Q
TP TN
100%
TP TN FP FN
,(sensitivity)
, (
), (specificity)
(
).
,,
.,
,
, . ,
(5%),
95%, .
, , ,
.
229
7.8: .
( ), (Vihinen, 2012)
,Matthews
(C)(Baldi,Brunak,Chauvin,Andersen,&Nielsen,2000):
TP TN FP FN
C
TP FN TP FP TN FP TN FN
, Pearson
-1(),+1(
),0.
.
, , .
(k),
Q kxkC,
.(H,E
C),Q(Qa, Qb)
C(Ca,Cb).
,
, .
,,
(TP,TN,Q,C),
.,
..
. , ,
(measure of the segments overlap-SOV),
,
0-1(Zemla,Venclovas,Fidelis,&Rost,1999).
230
7.9: SOV. , ,
, ,
. , ,
, SOV.
7.5.
,
,
. ,
,
. ,
,
(),
.
,
,.
,
.
7.5.1.
/ , ,
.
, (majority vote, consensus)
(ensemble learning, meta-algorithms ...). (
),
(weak classifiers),
(.. >0, Q>0.5),
.,
(.. >0.95 Q>0.99),
.
231
,
, ( )
.,
(),
.,7.10.
, ,
.
.,
,
.,,
.,
.
,01(
). , ,
c
(0<c<1). , ,
,
.
7.10: 3 .
, ,
.,
. ,
(.. ,
).
ensemble learning, .
,
,
232
.,,
,c,
(0.8).
,
(refinement). ,
,
(..
). ,
(, ),
.
,.
ad-hoc ( )
.
,
.
7.5.2.
,
. ,
,.
.,
. ,
(,
).,7.11.
,
,,
.
, ,
,
.,,
,
.
,
. ,
BLAST
CLUSTAL,
KALIGN.,HMMER3.0,
HMM.
,,
6-8%.
,
. , (
).
,
.,.
,
,
.,
.
233
7.11:
.
,
PSI-BLAST. ,
,(PSSM).
,
55000(7.12).PSI-BLAST
(),
.
, .
,,
,
. ,
,
.
, .
PSI-BLAST
.,
(
), , -,
234
,
(Przybylski&Rost,2007).
7.12: PSSM. 1, 7 8,
.
, (
), (7.13).,
,,
. ,
,
.
,(
).
235
7.13:
.
7.6.
7.6.1
,
,1970,
.
, ,
,,
. , ,
, 3
- (), - () (C),
.
236
7.14: .
.
.
,-
,,.,Chou
Fasman (Chou & Fasman, 1978) ( ) 29
,
3(H,E,C),(P).
: fj(i) =
ij(helix,sheet,turn).<fj>
f j. ,
Pj(i)ijPj(i)=fj(i)/<fj>.
Chou Fasman 7.1. ,
228 (119 -, 38 - 71 ).
,f()=0.522,f()=0.167fC()fc =0.311.-
<f>=890/2473=0.359,-<f>=424/2473=0.171,
<fC>=1159/2473=0.469.,P()=
0.522/0.359=1.45,P()=0.167/0.171=0.97PC()=0.311/0.469=0.63.
Pj(i)>1.0
.,
. , , 4
PH(i)>1 3 5 P(i)>1.
,4
Pj(i)<1.,-
- ,
(, , , ). , ( ),
5 P(i)>1.05, P(i)> P(i) .
,
(, turn), C (coil,
).
,log-oddsscore
.
237
log-odds score,
,,Chou-Fasman
. , log-odds score
, Chou-Fasman
. ,
.
(~60%)55%.
(,
). ,
.,
http://cho-fas.sourceforge.net/
aminoacid
A(Ala)
R(Arg)
N(Asn)
D(Asp)
C(Cys)
Q(Gln)
E(Glu)
G(Gly)
H(His)
I(Ile)
L(Leu)
K(Lys)
M(Met)
F(Phe)
P(Pro)
S(Ser)
T(Thr)
W(Trp)
Y(Tyr)
V(Val)
P(helix)
1.420
0.980
0.670
1.010
0.700
1.110
1.510
0.570
1.000
1.080
1.210
1.160
1.450
1.130
0.570
0.770
0.830
1.080
0.690
1.060
P(sheet)
0.830
0.930
0.890
0.540
1.190
1.100
0.370
0.750
0.870
1.600
1.300
0.740
1.050
1.380
0.550
0.750
1.190
1.370
1.470
1.700
P(coil)
0.660
0.950
1.560
1.460
1.190
0.980
0.740
1.560
0.950
0.470
0.590
1.010
0.600
0.600
1.520
1.430
0.960
0.960
1.140
0.500
,
.,
(
). GOR (Garnier-Osguthorpe-Robson).
log-oddsscore
17 (Garnier, Osguthorpe, & Robson, 1978).
,
(
). , , ,
.,
, , GOR IV (https://npsa-prabi.ibcp.fr/cgibin/npsa_automat.pl?page=npsa_gor4.html), ,
64%,GOR V(http://gor.bb.iastate.edu/),
PSI-BLAST,
74%.
1987
QianSejnowski
238
,68%(Qian&Sejnowski,1988).,
70%1992RostSanderPHD(Rost&Sander,
1993).
(juryofnetworks),(130),
(structure-to-structurenetwork)
.
6-8%
70% . PSI-PRED
(http://bioinf.cs.ucl.ac.uk/psipred/)profilesPSIBLAST (Jones, 1999) ( 76%).
PSI-PRED
(513 187
).,PHD,PROFphd
PSI-BLAST (~75-76%).
, JNET (Cuff & Barton, 2000).
,,PSIBLAST,Support
VectorMachines(SVM)RecurrentNeuralNetworks(RNN).
,(
),
(
) (
).
(cross-validation)
,
,,(Bagos,Tsaousis,&
Hamodrakas,2009).
7.15:
.
(Bagos, Tsaousis, et al., 2009).
239
,
70% ( )
( 500 ). ,
80%(
2000).,80%
,,
.
/
. 1988
Hamodrakas (Hamodrakas, 1988)
(Chou-Fasman,GOR,Lim,Dufton-Hider,Burgess,Nagano).
2-3%
SecStr (http://athina.biol.uoa.gr/SecStr/). ,
SecStr,
70%. JPRED (http://www.compbio.dundee.ac.uk/jpred/)
1998(Cuff,Clamp,Siddiqui,Finlay,&Barton,1998).JNET
(NNSSP,DSC,PREDATOR,MULPRED,PHD,ZPRED)
. , 4 (JPRED4)
,
,
.
NPS@ (https://npsa-prabi.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_seccons.html)
SOPM,SOPMA,HNN,MLRC,DPM,DSC,
GORI,GORIII,GORIV,PHD,PREDATOR,SIMPA96
. CONCORD
(http://helios.princeton.edu/CONCORD/)PSIPRED,DSC,GORIV,Predator,Prof,
PROFphd, SSpro, SYMPRED (http://www.ibi.vu.nl/programs/sympredwww/)
PHDpsi,PROFsec,SSPro,Predator,YASPIN,JNetPSIPRED.
,
( ,
,...).,B.Rost
PREDICTPROTEIN (www.predictprotein.org/), PSIPRED (http://bioinf.cs.ucl.ac.uk/psipred/)
, SCRATCH
(http://scratch.proteomics.ics.uci.edu/index.html).
, .
. ,
.
,Rost.
EVA(Kohetal.,2003)PDB
,.
,,
240
7.6.2.
,
, ,
. ,
(Singer
&Nicolson,1972),
(7.16).
7.16: ,
(https://en.wikipedia.org/wiki/Cell_membrane)
(,,,)
( ),
,
. ,
...
(, , , )
.
,
. ,
, ( )
, , (Alberts et al., 1994).
. ,
-,
. ,
,
,
(
) ( ).
, ,
,
. ,
,
,
,
241
(Marsh, Horvath, Swamy, Mantripragada, & Kleinschmidt, 2002).
,,,
- (.. ),
.
- ,
, , ,
(, , , Golgi
). , ( ),
, .
,.
, ,
,.
, ,
.
7.17: . ,
Natronomonas pharaonis. , NspA, Neisseria meningitidis. ,
.
,
.
-()
-
( 7.17). ,
,
.
,
.
-
(von Heijne, 1999),
Gram (Schulz,
2003). ,
242
,
( 7.18). Gram
,,(7.19),
. , (.. Mycobacterium),
,
().
7.18: Gram .
( ), ,
.
Gram , , , .
, .
243
7.19: Gram .
, , ,
, .
-,
. ,
-,
,
(Alberts et al.,
1994). ,
multi-spanning,
.,
, G (G-Protein Coupled Receptors-GPCRs),
(Kristiansen, 2004) , (
) (
G ). -, ,
, ,
-
,.
-
20 (Eisenberg, Weiss, &
Terwilliger, 1984; Kyte & Doolittle, 1982), -
, . , 15-25 ,
. positive inside rule,
(von Heijne, 1992),
. (, positive inside rule),
- ,
244
(http://bioinf.cs.ucl.ac.uk/?id=756). PRED-TMR
,(Pasquieretal.,1999)(
http://athina.biol.uoa.gr/PRED-TMR/) ,
,,rienTM(http://athina.biol.uoa.gr/orienTM/),
245
).,
HMMTOP (http://www.enzim.hu/hmmtop/) (Tusnady & Simon, 2001).
, , CoPreThi
(http://athina.biol.uoa.gr/CoPreTHi/)
SOSUI, Tmpred, ISREC, DAS, TopPred, PHDtm PRED-TMR (Promponas,
Palaios,Pasquier,Hamodrakas,&Hamodrakas,1999).
,
,
.
(.)
-,
.Phobius (Kall,Krogh,&
Sonnhammer,2004)(http://phobius.sbc.su.se/),
SPOCTOPUS (http://octopus.cbr.su.se/index.php?about=SPOCTOPUS).
,
- . ,
, . ,
.,
.
HMMpTM
(http://bioinformatics.biol.uoa.gr/HMMpTM),
,
(Tsaousis,Bagos,&Hamodrakas,2014).
-,
.
,,
,
.
(gene fusions),
,()
(Drew et al., 2002; Melen, Krogh, & von Heijne, 2003).
(, ),
,
E. coli(Rappetal.,2004)S. cerevisiae(Kim,Melen,&vonHeijne,2003).
,HMMTOP(Tusnady&Simon,
2001), ,
.,
-,Phobius(Kalletal.,2004).
HMM-TM(http://bioinformatics.biol.uoa.gr/HMM-TM/),
,.
, , -
,.,
,
,.,
, TMHMM,
(PRO-TMHMM, PRODIV-TMHMM, S-TMHMM).
, Phobius PolyPhobius
(http://phobius.sbc.su.se/poly.html). ,
246
. ,
TOPCONS (http://topcons.net), PolyPhobius, OCTOPUS,
SPOCTOPUSSCAMPI(,),Philius
,
.,TOPCONS,.
- , - ( )
,
.-
, , .
,-
,-
(-hairpin).,-
,n8n26S, 8S24
(Schulz,2003).,
, 30-60. ,
,,-
--
,19-.
7.20: - .
- -.
,
- ( 7.20).
(),
-
.
. ()
, OmpA (Morona, Kramer, & Henning, 1985)
-,OmpX (Vogt&Schulz,
247
1999),.
(OmpA,OmpX),8,
.
OmpX,OmpA,
(Ringler&Schulz,2002),
(Sugawara&Nikaido,1992,1994).
NspA (Vandeputte-Rutten, Bos, Tommassen, & Gros, 2003) 8 , OpcA
(Prince, Achtman, & Derrick, 2002) 10 ,
.
7.21: - -,
. -, TolC, MspA.
-,
- .
-
, -,
, 6 22
.,79
, - (
)-.
,
,
,(..).
- , (Zhai&Saier,2002),
-.,
, (
), , ,
. , -,
248
,
.
, ,
. ,
-,-,
( ) .
,
,
(Schulz,2002,2003).
:
(1) - ,
- .
,
.
(2)
,
.
(3),
( ). ,
,100,
.
(4) , ,
()
().,
, 12
30.
-.
(5)-
6 22 . ,
, ,
,
.
(6)-
, - .
,.
,
,
.
(7)-,
.
-, ,
, .
,-
. ,
,
.
-,Diederichs(Diederichs,
Freigang,Umhau,Zeth,&Breed,1998).B2TMPRED
(Jacoboni, Martelli, Fariselli, De Pinto, & Casadio, 2001)
http://gpcr.biocomp.unibo.it/cgi/predictors/outer/pred_outercgi.cgi.
TBBpred (http://www.imtech.res.in/raghava/tbbpred/) TMBETA-NET
(http://psfs.cbrc.jp/tmbeta-net/) ,
249
TMBETAPRED-RBF (http://rbf.bioinfo.tw/~sachen/BARRELpredict/TMBETAPRED-RBF.php)
TMBpro (http://tmbpro.ics.uci.edu/)
.
HiddenMarkovModel(HMM)
2000, (Bagos, Liakopoulos,
Spyropoulos, & Hamodrakas, 2004a, 2004b; Bigelow, Petrey, Liu, Przybylski, & Rost, 2004; Hayat &
Elofsson,2012;Liu,Zhu,Wang,&Li,2003;Martelli,Fariselli,Krogh,&Casadio,2002;Savojardo,Fariselli,
& Casadio, 2013; Singh, Goodman, Walter, Helms, & Hayat, 2011). HMMB2TMR,
(http://gpcr.biocomp.unibo.it/predictors/),
, BetAware (http://www.biocomp.unibo.it/~savojard/betawarecl). PRED-TMBB
(http://bioinformatics.biol.uoa.gr/PRED-TMBB/),
, ,
,
.
PROFtmb (https://www.predictprotein.org/)
, TMBHMM TMBhunt.
,BOCTOPUS(http://boctopus.cbr.su.se/),
SupportVectorMachinesHMMs.
-,
,
- ,
.,
. , 2005,
-,
ConBBPRED
(http://bioinformatics.biol.uoa.gr/ConBBPRED/). To ConBBPRED
. ,
.
250
, . ,
PRED-TMBB
.PRED-TMBBBOCTOPUS,
PROFtmb, BetAware HMM-B2TMR.
PRED-TMBB,
.,PRED-TMBB2
(www.compgen.org/tools/PRED-TMBB2), ,
.
,PRED-TMBB,BetAware
TMBETA-NET,
-.,
,-
. , -
(~2% )
>90% . ,
(),
.
Michael Gromiha TMBETADISCRBF(http://rbf.bioinfo.tw/~sachen/OMPpredict/TMBETADISC-RBF.php),
(94%),(85%).BOMP(http://services.cbu.uib.no/tools/bomp)
251
7.6.3.
,
( ). ,
,
.
(, ),
,
( ,
).,
(n-region),(h-region)(c-region)
,
( --),
, (von Heijne, 1990).
,,(Driessen
& Nouwen, 2007), (Rapoport, Matlack, Plath, Misselwitz, &
Staeck, 1999), (Pohlschroder, Gimenez, & Jarrell, 2005).
, ,
(Tuteja, 2005; van Roosmalen et al., 2004).
,
( ), ,
,,
(Habib, Neupert, &
Rapaport, 2007; G. von Heijne, Steppuhn, & Herrmann, 1989). ,
, .
,,
(PTS1), ,
(PTS2).
,,
(Sec),
(Twin-Argininetranslocase-Tat).Tat
, (RR)
n-region (Berks, Palmer, & Sargent, 2005; Lee, Tullman-Ercek, & Georgiou, 2006; Teter & Klionsky,
1999). Sec Tat,
,
,
(Teter & Klionsky, 1999). (
Sec,Tat),(Spase
I), .
,,,
. ,
(Spase II or Lsp), .
,
, c-region (lipobox),
C,
.
[LVI]-[AST]-[GA]-C,
.,
Tat. , ,
.
,
, , 1980.
weight matrices Gunnar von Heijne (von Heijne, 1986),
252
, SigCleave,
(http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/sigcleave.html).
weightmatrices,
, PrediSi (http://www.predisi.de/). ,
,(,
Gram , Gram ). ,
.
,SignalP(http://www.cbs.dtu.dk/services/SignalP),
4.1,
(),
,,
(Bendtsen, Nielsen, von Heijne, & Brunak, 2004).
,
. Phobius,
http://phobius.sbc.su.se/ (Kall et al., 2004; Kall, Krogh, & Sonnhammer, 2007) Philius (Reynolds,
Kall, Riffle, Bilmes, & Noble, 2008),
http://noble.gs.washington.edu/proj/philius/,(Bayesian
network,
),
SPOCTOPUS
(http://octopus.cbr.su.se/index.php?about=SPOCTOPUS).
,,
,
. ,
(
). ,
, ,
,Uniprot(Bagos,Tsirigos,Plessas,
Liakopoulos, & Hamodrakas, 2009). ,
Gram ,
, PREDSIGNAL(http://www.compgen.org/tools/PRED-SIGNAL).
PROSITE,5(..PS00013).,
. LipoP
(http://www.cbs.dtu.dk/services/LipoP), HMM
Gram(Junckeretal.,2003).LipoP
,
97%
Gram , (,
), 0.3%. ,
Gram , 90-92%. ,
,
Gram , PRED-LIPO
(http://www.compgen.org/tools/PRED-LIPO),
,
,(Bagos,Tsirigos,Liakopoulos,&Hamodrakas,2008).
,,
(GPI-anchor).
, ,
.
, PredGPI (http://gpcr.biocomp.unibo.it/predgpi/pred.htm), big-PI
(http://mendel.imp.ac.at/gpi/gpi_server.html),
FragAnchor
(http://navet.ics.hawaii.edu/~fraganchor/NNHMM/NNHMM.html) GPI-SOM (http://gpi.unibe.ch/).
,
.
253
LPXTG(
,),Gram(
Gram
). , , CW-PRED
(http://bioinformatics.biol.uoa.gr/CW-PRED/),,
().
, (
) Tat (
),
. TATFIND
(http://signalfind.org/tatfind.html),
(Rose, Bruser, Kissinger, & Pohlschroder, 2002). TatP
(http://www.cbs.dtu.dk/services/TatP/),
RR(Bendtsen,Nielsen,Widdick,Palmer,&Brunak,2005).TatP
,SignalP,TATFIND
RR, .
, PRED-TAT (http://www.compgen.org/tools/PRED-TAT/),
HMMs,(SecTat),
, . ,
,Tat,(Sec)
,SignalP(Bagosetal.,
2010).
,.,
ChloroP
(http://www.cbs.dtu.dk/services/ChloroP),
TargetP
(http://www.cbs.dtu.dk/services/TargetP),
, .
iPSORT (http://ipsort.hgc.jp/how.html).
MitoProt
(https://ihg.gsf.de/ihg/mitoprot.html),Predotar(http://urgi.versailles.inra.fr/predotar/predotar.html),
Tppred2 (http://tppred2.biocomp.unibo.it). , PTS1
predictor (http://mendel.imp.ac.at/mendeljsp/sat/pts1/PTS1predictor.jsp),
cNLS Mapper (http://nls-mapper.iab.keio.ac.jp/cgibin/NLS_Mapper_form.cgi), NLStradamus (http://www.moseslab.csb.utoronto.ca/NLStradamus/),
NucPred
(http://www.sbc.su.se/~maccallr/nucpred/)
PredictNLS
(https://rostlab.org/owiki/index.php/PredictNLS).
,,
. ,
, (..
), (.. ,
Golgi ...). ,
.,,
,
(,,...).
,WoLF PSORT(http://wolfpsort.org/),
( PSORT PSORT II).
, PSORTb (http://www.psort.org/psortb/index.html).
LOCtree
(http://cubic.bioc.columbia.edu/cgi-bin/var/nair/loctree/query),
ESLPred2
(http://www.imtech.res.in/raghava/eslpred2/),
LOCSVMPSI
(http://bioinformatics.ustc.edu.cn/locsvmpsi/locsvmpsi.php), CELLO (http://cello.life.nctu.edu.tw/),
BaCELLO (http://gpcr.biocomp.unibo.it/bacello/), Protein Prowler (http://pprowler.imb.uq.edu.au/),
Hum-Ploc2 (http://www.csbio.sjtu.edu.cn/bioinf/hum-multi-2/), AAIndexLoc (http://aaindexloc.bii.a
254
SOSUI-GramN
(http://bp.nuap.nagoyau.ac.jp/sosui/sosuigramn/sosuigramn_submit.html) Gram , PSLPred
(http://www.imtech.res.in/raghava/pslpred/), Augur (http://bioinfo.mikrobio.med.uni-giessen.de/augur),
SubLoc(http://www.bioinfo.tsinghua.edu.cn/SubLoc/).
,
-,
,.,
SecretomeP (http://www.cbs.dtu.dk/services/SecretomeP),
NclassG+(http://www.biolisi.unal.edu.co/web-servers/nclassgpositive/).
7.6.4.
,
,
. ,
,
, (coiled coil).
, COILS (http://www.ch.embnet.org/software/COILS_form.html),
PAIRCOIL (http://paircoil2.csail.mit.edu/),
,MULTICOIL(http://multicoil2.csail.mit.edu/cgi-bin/multicoil2.cgi),
, CCHMM (http://gpcr.biocomp.unibo.it/cgi/predictors/cc/pred_cchmm.cgi)
MARCOIL (http://bcf.isb-sib.ch/webmarcoil/webmarcoilINFOC1.html),
.
,.
,
,
.
DisEMBL (http://dis.embl.de/), PrDOS (http://prdos.hgc.jp/cgi-bin/top.cgi), DISpro
(http://www.ics.uci.edu/~baldig/dispro.html),DISOPRED(http://bioinf.cs.ucl.ac.uk/psipred/?disopred=1),
MeDor (http://www.vazymolo.org/MeDor/index.html),
MetaDisorder(http://genesilico.pl/metadisorder/)DisProt(http://www.disprot.org/pondr-fit.php).
,
,
. ,
, . DIpro
(http://download.igb.uci.edu/bridge.html),EDBCP(http://biomedical.ctust.edu.tw/edbcp/),CYSPRED
(http://gpcr.biocomp.unibo.it/cgi/predictors/cyspred/pred_cyspredcgi.cgi),
DiANNA
(http://clavius.bc.edu/~clotelab/DiANNA/),Dinosolve(http://hpcr.cs.odu.edu/dinosolve/),DISULFIND
(http://disulfind.dsi.unifi.it/)CysCON(http://www.csbio.sjtu.edu.cn/bioinf/Cyscon/).
, . -
,
. , -
(
).,
,-,
. , ,
(, )
. , , , , ,
...
255
Golgi. - ( ), ( ) C- (
). - NetNGlyc
(http://www.cbs.dtu.dk/services/NetNGlyc/), -
NetOGlyc (http://www.cbs.dtu.dk/services/NetOGlyc/) C- NetCGlyc
(http://www.cbs.dtu.dk/services/NetCGlyc/),YinOYang(http://www.cbs.dtu.dk/services/YinOYang/)
. GlycoEP
(http://www.imtech.res.in/raghava/glycoep/submit.html),
, GPP (http://comp.chem.nottingham.ac.uk/glyco/).
,Oglyc(http://www.biosino.org/Oglyc/),ISOGlyP(http://isoglyp.utep.edu/)
CKSSAP_OGlySite(http://bioinformatics.cau.edu.cn/zzd_lab/CKSAAP_OGlySite/).
7.24: .
.
,
,,.
.
NetPhos(http://www.cbs.dtu.dk/services/NetPhos/)
,NetPhosK(http://www.cbs.dtu.dk/services/NetPhosK/)
.GPS (http://gps.biocuckoo.org/)
( ). KinasePhos2 (http://kinasephos2.mbc.nctu.edu.tw/)
HMM. PhosphoSVM (http://sysbio.unl.edu/PhosphoSVM/),
DISPHOS (http://www.dabi.temple.edu/disphos/), pkaPS (http://mendel.imp.ac.at/sat/pkaPS/)
256
Predikin (http://predikin.biosci.uq.edu.au/).
(
, ), ,
.
MetaPredPS
(http://c1.accurascience.com/MetaPred/MetaPredPS_091201/). ,
, ,
NetPhosYeast (http://www.cbs.dtu.dk/services/NetPhosYeast/) NetPhosBac
(http://www.cbs.dtu.dk/services/NetPhosBac-1.0/).
-
.
Myristoylator
(http://web.expasy.org/myristoylator/)
(http://mendel.imp.ac.at/myristate/SUPLpredictor.htm) ,
, NetAcet (http://www.cbs.dtu.dk/services/NetAcet/)
TermiNator (http://www.isv.cnrsgif.fr/terminator3/index.html),.
, ,
(),.,CSSPalm(http://csspalm.biocuckoo.org/),GPSTSP (http://tsp.biocuckoo.org/) Sulfinator (http://web.expasy.org/sulfinator/)
, PAIL (http://bdmpail.biocuckoo.org/) KAT
(http://bioinfo.bjmu.edu.cn/huac/) . ,
.
,
.(Ubiquitin)
,
, SUMO (Small
Ubiquitin-like Modifie). UbPred
(http://www.ubpred.org/), BDM-PUB (http://bdmpub.biocuckoo.org/), CKSAAP_UbSite
(http://protein.cau.edu.cn/cksaap_ubsite/), iUbiq-Lys (http://www.jci-bioinfo.cn/iUbiq-Lys)
UbiProber (http://bioinfo.ncu.edu.cn/UbiProber.aspx). , SUMO
SUMOplot(http://www.abgent.com/sumoplot)GPS-SUMP(http://sumosp.biocuckoo.org/).
7.25: PRED-CLASS
,
.
.,PRED-CLASS
(http://athina.biol.uoa.gr/PRED-CLASS/) , ,
257
( ,
..)(Pasquieretal.,2001).3,
. ,
()30.
(
),,.
,
.,,
. ,
,
. ,
(20) (10 ,
,).,
( 30 ),
Fourier(FFT),
.
, , ( ).
(>95% )
30%
. ,
, ..
,
.
7.26: PRED-COUPLE.
,
pHMM. ,
.
,
GPCR G-.
G- (G protein-coupled receptors-GPCRs),
.
-,
258
.,GPCRs,,
G-, . PRED-COUPLE
(http://athina.biol.uoa.gr/bioinformatics/PRED-COUPLE/),
profileHiddenMarkovModels,
GPCR G- (Sgourakis, Bagos, Papasaikas, & Hamodrakas,
2005).,
,G.,GPCRs(Gi/o,GsGq/11)
cross-validation(
5),89.7%.30
,
25(83.3%).
PRED-COUPLE2
(http://athina.biol.uoa.gr/bioinformatics/PRED-COUPLE2/),
, , ,
G-(Sgourakis,Bagos,&Hamodrakas,2005).
, pHMM
. ,
(~95%) --,
.
.,
.
7.27: PRED-COUPLE2.
pHMM PRED-COUPLE.
7.7.
DNA/RNA
,,.
DNA RNA. DNA
(genefinding),
259
7.28:
:
(exon/intronsplicesite),(promoterrecognition),
(translationinitiationsiteprediction)(Saeys,Abeel,Degroeve,&Van
de Peer, 2007), mRNA (polyadenylation prediction)
(Changetal.,2011),,,.,
.
,DNA,
(Saeysetal.,2007).,
,(
), . ,
. ,
1980, 1990
.
,:,weightmatrices,,
Hidden Markov Models.
ab initio gene finders,
homology-based gene finders.
,:
FrameD(http://tata.toulouse.inra.fr/apps/FrameD/FD)
GeneMark(http://exon.gatech.edu/GeneMark/gmchoice.html)
Glimmer(http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi)
EasyGene(http://www.cbs.dtu.dk/services/EasyGene/)
260
FGENESB
(http://linux1.softberry.com/berry.phtml?topic=fgenesb&group=help&subgroup=gfindb)
Prodigal (http://prodigal.ornl.gov/)
,,:
FGENESH
(http://linux1.softberry.com/berry.phtml?topic=fgenesh&group=programs&subgroup=gfind)
GlimmerHMM(https://ccb.jhu.edu/software/glimmerhmm/)
HMMgene (http://www.cbs.dtu.dk/services/HMMgene/)
GeneMark.hmm(http://exon.gatech.edu/GeneMark/hmmchoice.html)
GeneID(http://genome.crg.es/software/geneid/geneid.html)
GeneScan(http://genes.mit.edu/GENSCAN.html)
mGene(http://raetschlab.org/suppl/mgene)
Grail(http://compbio.ornl.gov/grailexp/)
(translationinitiation):
ATGpr(http://atgpr.dbcls.jp/)
NetStart(http://www.cbs.dtu.dk/services/NetStart/)
TIS Miner(http://dnafsminer.bic.nus.edu.sg/Tis.html)
StartScan(http://bioinformatics.psb.ugent.be/webtools/startscan/)
mRNA:
Poly(A) Signal Miner(http://dnafsminer.bic.nus.edu.sg/)
PolyAPred(http://www.imtech.res.in/raghava/polyapred/help.html)
POLYAH
(http://www.softberry.com/berry.phtml?topic=polyah&group=programs&subgroup=promoter
)
PolyApredict(http://cub.comsats.edu.pk/polyapredict.htm)
,/
,:
Human Splice Finder(http://www.umd.be/HSF3/)
NetGene(http://www.cbs.dtu.dk/services/NetGene2/)
NetPlant(http://www.cbs.dtu.dk/services/NetPGene/)
GeneSplicer(https://ccb.jhu.edu/software/genesplicer/)
SpliceView(http://bioinfo4.itb.cnr.it/~webgene/wwwspliceview_ex.html)
SplicePredictor(http://bioservices.usd.edu/splicepredictor/)
,DNA
. ,
Methylator (http://bio.dfci.harvard.edu/Methylator/)
epigram (http://wanglab.ucsd.edu/star/epigram/),
NuPoP
(http://nucleosome.stats.northwestern.edu/)
Segal
(http://genie.weizmann.ac.il/software/nucleo_prediction.html),
DNA uMELT (https://www.dna.utah.edu/umelt/umelt.html),
DNADNAshape(http://rohslab.cmb.usc.edu/DNAshape/)
DNAtools (http://hydra.icgeb.trieste.it/dna/),
(SNPs),
SNAP
(https://rostlab.org/services/snap/), FuncPred (http://snpinfo.niehs.nih.gov/snpinfo/snpfunc.htm)
PredictSNP(http://loschmidt.chemi.muni.cz/predictsnp/).
RNA,
, .
,
RNA,.,RNA
.
,
261
RNA
10.
RNA,
. microRNA
(miRNA) - RNA ( 21-22 ,
, pri-miRNA)
mRNA-(Cai,Yu,Hu,&
Yu, 2009).
mRNA. , mRNA ,
,
. miRNAs small interfering RNA (siRNA)
miRNAsRNA
siRNAs RNA.
1000 miRNA
60%,.
7.29: miRNA.
.
miRNA,
.
.
miRNA mRNA ,
mRNA.,miRNA
mRNA 68
5' miRNA.
mRNA.,
262
miRNA,,mRNA,mRNA
miRNA.
miRNA:
,.
, , , ,
,
RNA. miRNA
:
CID miRNA(http://melb.agrf.org.au:8888/cidmirna/)
MiRPara(https://code.google.com/p/mirpara/)
HeteroMirPred(http://ncrna-pred.com/premiRNA.html)
HHMMiR(http://biodev.hgen.pitt.edu/kadriAPBC2009.html)
HuntMi(http://adaa.polsl.pl/agudys/huntmi/huntmi.htm)
MaturePred(http://nclab.hit.edu.cn/maturepred/)
microPred(http://www.cs.ox.ac.uk/people/manohara.rukshan.batuwita/microPred.htm)
MiPred(http://www.bioinf.seu.edu.cn/miRNA/)
miRAbela(http://www.mirz.unibas.ch/cgi/pred_miRNA_genes.cgi)
MiRAlign(http://bioinfo.au.tsinghua.edu.cn/miralign/)
miRBoost(http://evryrna.ibisc.univ-evry.fr/miRBoost/index.html)
mirnaDetect(http://datamining.xmu.edu.cn/main/~leyiwei/mirnaDetect.html)
miRNAFold(http://evryrna.ibisc.univ-evry.fr/miRNAFold/)
MiRscan(http://genes.mit.edu/mirscan/)
novoMIR(http://www.biophys.uni-duesseldorf.de/novomir/)
ProMiR(http://bi.snu.ac.kr/Research/ProMiR/ProMiR.html)
RNAmicro(http://www.tbi.univie.ac.at/~jana/software/RNAmicro.html)
tripletSVM(http://bioinfo.au.tsinghua.edu.cn/mirnasvm/)
SplamiR (http://www.uni-jena.de/SplamiR.html)
SSCprofiler (http://mirna.imbb.forth.gr/SSCprofiler.html)
EumiR (http://miracle.igib.res.in/eumir/)
,miRNA,:
Diana Micro-T(http://diana.cslab.ece.ntua.gr/microT/)
PicTar(http://pictar.mdc-berlin.de/)
TargetScan(http://www.targetscan.org/)
miRTar(http://mirtar.mbc.nctu.edu.tw/human/)
miRanda (http://www.microrna.org/microrna/home.do)
MaMi(http://mami.med.harvard.edu/)
ComiR(http://www.benoslab.pitt.edu/comir/)()
PITA(http://genie.weizmann.ac.il/pubs/mir07/mir07_prediction.html)
MirMap(http://mirmap.ezlab.org/)
STarMir (http://sfold.wadsworth.org/starmir.html)
263
264
Chou,P.Y.,&Fasman,G.D.(1978).Predictionofthesecondarystructureofproteinsfromtheiraminoacid
sequence.Adv Enzymol Relat Areas Mol Biol, 47,45-148.
Claros,M.G.,&vonHeijne,G.(1994).TopPredII:animprovedsoftwareformembraneproteinstructure
predictions.Comput Appl Biosci, 10(6),685-686.
Cuff,J.A.,&Barton,G.J.(2000).Applicationofmultiplesequencealignmentprofilestoimproveprotein
secondarystructureprediction.Proteins, 40(3),502-511.
Cuff,J.A.,Clamp,M.E.,Siddiqui,A.S.,Finlay,M.,&Barton,G.J.(1998).JPred:aconsensussecondary
structurepredictionserver.Bioinformatics, 14(10),892-893.doi:btb130[pii]
Diederichs,K.,Freigang,J.,Umhau,S.,Zeth,K.,&Breed,J.(1998).Predictionbyaneuralnetworkofouter
membranebeta-strandproteintopology.Protein Sci, 7(11),2413-2420.
Drew,D.,Sjostrand,D.,Nilsson,J.,Urbig,T.,Chin,C.N.,deGier,J.W.,&vonHeijne,G.(2002).Rapid
topologymappingofEscherichiacoliinner-membraneproteinsbypredictionandPhoA/GFPfusion
analysis.Proc Natl Acad Sci U S A, 99(5),2690-2695.
Driessen,A.J.,&Nouwen,N.(2007).ProteinTranslocationAcrosstheBacterialCytoplasmicMembrane.
Annu Rev Biochem.doi:10.1146/annurev.biochem.77.061606.160747
Eisenberg,D.,Weiss,R.M.,&Terwilliger,T.C.(1984).Thehydrophobicmomentdetectsperiodicityin
proteinhydrophobicity.Proc Natl Acad Sci U S A, 81(1),140-144.
Garnier,J.,Osguthorpe,D.J.,&Robson,B.(1978).Analysisoftheaccuracyandimplicationsofsimple
methodsforpredictingthesecondarystructureofglobularproteins.J Mol Biol, 120(1),97-120.
Habib,S.J.,Neupert,W.,&Rapaport,D.(2007).Analysisandpredictionofmitochondrialtargetingsignals.
Methods Cell Biol, 80,761-781.doi:S0091-679X(06)80035-X[pii]10.1016/S0091-679X(06)80035X
Hamodrakas,S.J.(1988).AproteinsecondarystructurepredictionschemefortheIBMPCandcompatibles.
Comput Appl Biosci, 4(4),473-477.
Hayat,S.,&Elofsson,A.(2012).BOCTOPUS:improvedtopologypredictionoftransmembranebetabarrel
proteins.Bioinformatics, 28(4),516-522.doi:10.1093/bioinformatics/btr710
Houben,E.,deGier,J.W.,&vanWijk,K.J.(1999).Insertionofleaderpeptidaseintothethylakoid
membraneduringsynthesisinachloroplasttranslationsystem.Plant Cell, 11(8),1553-1564.
Jacoboni,I.,Martelli,P.L.,Fariselli,P.,DePinto,V.,&Casadio,R.(2001).Predictionofthetransmembrane
regionsofbeta-barrelmembraneproteinswithaneuralnetwork-basedpredictor.Protein Sci, 10(4),
779-787.
Jones,D.T.(1999).Proteinsecondarystructurepredictionbasedonposition-specificscoringmatrices.J Mol
Biol, 292(2),195-202.
Juncker,A.S.,Willenbrock,H.,VonHeijne,G.,Brunak,S.,Nielsen,H.,&Krogh,A.(2003).Predictionof
lipoproteinsignalpeptidesinGram-negativebacteria.Protein Sci, 12(8),1652-1662.
Kall,L.,Krogh,A.,&Sonnhammer,E.L.(2004).Acombinedtransmembranetopologyandsignalpeptide
predictionmethod.J Mol Biol, 338(5),1027-1036.doi:10.1016/j.jmb.2004.03.016
S0022283604002943[pii]
Kall,L.,Krogh,A.,&Sonnhammer,E.L.(2007).Advantagesofcombinedtransmembranetopologyand
signalpeptideprediction--thePhobiuswebserver.Nucleic Acids Res, 35(WebServerissue),W429432.doi:gkm256[pii]10.1093/nar/gkm256
Kim,H.,Melen,K.,&vonHeijne,G.(2003).Topologymodelsfor37Saccharomycescerevisiaemembrane
proteinsbasedonC-terminalreporterfusionsandpredictions.J Biol Chem, 278(12),10208-10213.
265
Koh,I.Y.,Eyrich,V.A.,Marti-Renom,M.A.,Przybylski,D.,Madhusudhan,M.S.,Eswar,N.,...Sali,A.
(2003).EVA:evaluationofproteinstructurepredictionservers.Nucleic Acids Research, 31(13),
3311-3315.
Kristiansen,K.(2004).Molecularmechanismsofligandbinding,signaling,andregulationwithinthe
superfamilyofG-protein-coupledreceptors:molecularmodelingandmutagenesisapproachesto
receptorstructureandfunction.Pharmacol Ther, 103(1),21-80.
Krogh,A.,Larsson,B.,vonHeijne,G.,&Sonnhammer,E.L.(2001).Predictingtransmembraneprotein
topologywithahiddenMarkovmodel:applicationtocompletegenomes.J Mol Biol, 305(3),567580.
Kyogoku,Y.,Fujiyoshi,Y.,Shimada,I.,Nakamura,H.,Tsukihara,T.,Akutsu,H.,...Nomura,N.(2003).
Structuralgenomicsofmembraneproteins.Acc Chem Res, 36(3),199-206.
Kyte,J.,&Doolittle,R.F.(1982).Asimplemethodfordisplayingthehydropathiccharacterofaprotein.J
Mol Biol, 157(1),105-132.
Lee,P.A.,Tullman-Ercek,D.,&Georgiou,G.(2006).Thebacterialtwin-argininetranslocationpathway.
Annu Rev Microbiol, 60,373-395.doi:10.1146/annurev.micro.60.080805.142212
Liakopoulos,T.D.,Pasquier,C.,&Hamodrakas,S.J.(2001).Anoveltoolforthepredictionof
transmembraneproteintopologybasedonastatisticalanalysisoftheSwissProtdatabase:the
OrienTMalgorithm.Protein Eng, 14(6),387-390.
Liu,Q.,Zhu,Y.S.,Wang,B.H.,&Li,Y.X.(2003).AHMM-basedmethodtopredictthetransmembrane
regionsofbeta-barrelmembraneproteins.Comput Biol Chem, 27(1),69-76.
Loll,P.J.(2003).Membraneproteinstructuralbiology:thehighthroughputchallenge.J Struct Biol, 142(1),
144-153.
Marsh,D.,Horvath,L.I.,Swamy,M.J.,Mantripragada,S.,&Kleinschmidt,J.H.(2002).Interactionof
membrane-spanningproteinswithperipheralandlipid-anchoredmembraneproteins:perspectives
fromprotein-lipidinteractions(Review).Mol Membr Biol, 19(4),247-255.
Martelli,P.L.,Fariselli,P.,Krogh,A.,&Casadio,R.(2002).Asequence-profile-basedHMMforpredicting
anddiscriminatingbetabarrelmembraneproteins.Bioinformatics, 18 Suppl 1,S46-53.
Math,C.,Sagot,M.F.,Schiex,T.,&Rouze,P.(2002).Currentmethodsofgeneprediction,theirstrengths
andweaknesses.Nucleic Acids Research, 30(19),4103-4117.
Melen,K.,Krogh,A.,&vonHeijne,G.(2003).Reliabilitymeasuresformembraneproteintopology
predictionalgorithms.J Mol Biol, 327(3),735-744.doi:S0022283603001827[pii]
Morona,R.,Kramer,C.,&Henning,U.(1985).Bacteriophagereceptorareaofoutermembraneprotein
OmpAofEscherichiacoliK-12.J Bacteriol, 164(2),539-543.
Pasquier,C.,&Hamodrakas,S.J.(1999).Anhierarchicalartificialneuralnetworksystemforthe
classificationoftransmembraneproteins.Protein Eng, 12(8),631-634.
Pasquier,C.,Promponas,V.J.,&Hamodrakas,S.J.(2001).PRED-CLASS:cascadingneuralnetworksfor
generalizedproteinclassificationandgenome-wideapplications.Proteins, 44(3),361-369.
Pasquier,C.,Promponas,V.J.,Palaios,G.A.,Hamodrakas,J.S.,&Hamodrakas,S.J.(1999).Anovel
methodforpredictingtransmembranesegmentsinproteinsbasedonastatisticalanalysisofthe
SwissProtdatabase:thePRED-TMRalgorithm.Protein Eng, 12(5),381-385.
Pohlschroder,M.,Gimenez,M.I.,&Jarrell,K.F.(2005).ProteintransportinArchaea:Secandtwinarginine
translocationpathways.Curr Opin Microbiol, 8(6),713-719.doi:S1369-5274(05)00162-1[pii]
10.1016/j.mib.2005.10.006
Prince,S.M.,Achtman,M.,&Derrick,J.P.(2002).CrystalstructureoftheOpcAintegralmembraneadhesin
fromNeisseriameningitidis.Proc Natl Acad Sci U S A, 99(6),3417-3421.
266
Promponas,V.J.,Palaios,G.A.,Pasquier,C.M.,Hamodrakas,J.S.,&Hamodrakas,S.J.(1999).CoPreTHi:
aWebtoolwhichcombinestransmembraneproteinsegmentpredictionmethods.In Silico Biol, 1(3),
159-162.doi:1998010014[pii]
Przybylski,D.,&Rost,B.(2007).ConsensussequencesimprovePSI-BLASTthroughmimickingprofile
profilealignments.Nucleic Acids Research, 35(7),2238-2246.
Qian,N.,&Sejnowski,T.J.(1988).Predictingthesecondarystructureofglobularproteinsusingneural
networkmodels.J Mol Biol, 202(4),865-884.
Rapoport,T.A.,Matlack,K.E.,Plath,K.,Misselwitz,B.,&Staeck,O.(1999).Posttranslationalprotein
translocationacrossthemembraneoftheendoplasmicreticulum.Biol Chem, 380(10),1143-1150.
Rapp,M.,Drew,D.,Daley,D.O.,Nilsson,J.,Carvalho,T.,Melen,K.,VonHeijne,G.(2004).
ExperimentallybasedtopologymodelsforE.coliinnermembraneproteins.Protein Sci, 13(4),937945.
Reinhardt,A.,&Hubbard,T.(1998).Usingneuralnetworksforpredictionofthesubcellularlocationof
proteins.Nucleic Acids Res, 26(9),2230-2236.
Reynolds,S.M.,Kall,L.,Riffle,M.E.,Bilmes,J.A.,&Noble,W.S.(2008).Transmembranetopologyand
signalpeptidepredictionusingdynamicbayesiannetworks.PLoS Comput Biol, 4(11),e1000213.doi:
10.1371/journal.pcbi.1000213
Ringler,P.,&Schulz,G.E.(2002).OmpAmembranedomainasatight-bindinganchorforlipidbilayers.
Chembiochem, 3(5),463-466.
Rojo,E.E.,Guiard,B.,Neupert,W.,&Stuart,R.A.(1999).N-terminaltailexportfromthemitochondrial
matrix.Adherencetotheprokaryotic"positive-inside"ruleofmembraneproteintopology.J Biol
Chem, 274(28),19617-19622.
Rose,R.W.,Bruser,T.,Kissinger,J.C.,&Pohlschroder,M.(2002).Adaptationofproteinsecretionto
extremelyhigh-saltconditionsbyextensiveuseofthetwin-argininetranslocationpathway.Mol
Microbiol, 45(4),943-950.doi:3090[pii]
Rost,B.,Casadio,R.,Fariselli,P.,&Sander,C.(1995).Transmembranehelicespredictedat95%accuracy.
Protein Sci, 4(3),521-533.
Rost,B.,&Sander,C.(1993).Predictionofproteinsecondarystructureatbetterthan70%accuracy.J Mol
Biol, 232(2),584-599.
Rumelhart,D.E.,Hinton,G.E.,&Williams,R.J.(1988).Learningrepresentationsbyback-propagating
errors.Cognitive modeling, 5,3.
Saeys,Y.,Abeel,T.,Degroeve,S.,&VandePeer,Y.(2007).Translationinitiationsitepredictionona
genomicscale:beautyinsimplicity.Bioinformatics, 23(13),i418-i423.
Savojardo,C.,Fariselli,P.,&Casadio,R.(2013).BETAWARE:amachine-learningtooltodetectandpredict
transmembranebeta-barrelproteinsinprokaryotes.Bioinformatics, 29(4),504-505.doi:
10.1093/bioinformatics/bts728
Schulz,G.E.(2002).Thestructureofbacterialoutermembraneproteins.Biochim Biophys Acta, 1565(2),
308-317.
Schulz,G.E.(2003).Transmembranebeta-barrelproteins.Adv Protein Chem, 63,47-70.
Sgourakis,N.G.,Bagos,P.G.,&Hamodrakas,S.J.(2005).PredictionofthecouplingspecificityofGPCRs
tofourfamiliesofG-proteinsusinghiddenMarkovmodelsandartificialneuralnetworks.
Bioinformatics, 21(22),4101-4106.doi:bti679[pii]10.1093/bioinformatics/bti679
Sgourakis,N.G.,Bagos,P.G.,Papasaikas,P.K.,&Hamodrakas,S.J.(2005).Amethodforthepredictionof
GPCRscouplingspecificitytoG-proteinsusingrefinedprofileHiddenMarkovModels.BMC
Bioinformatics, 6,104.doi:1471-2105-6-104[pii]10.1186/1471-2105-6-104
267
Singer,S.J.,&Nicolson,G.L.(1972).Thefluidmosaicmodelofthestructureofcellmembranes.Science,
175(23),720-731.
Singh,N.K.,Goodman,A.,Walter,P.,Helms,V.,&Hayat,S.(2011).TMBHMM:afrequencyprofilebased
HMMforpredictingthetopologyoftransmembranebetabarrelproteinsandtheexposurestatusof
transmembraneresidues.Biochim Biophys Acta, 1814(5),664-670.doi:10.1016/j.bbapap.2011.03.004
Sonnhammer,E.L.,vonHeijne,G.,&Krogh,A.(1998).AhiddenMarkovmodelforpredicting
transmembranehelicesinproteinsequences.Proc Int Conf Intell Syst Mol Biol, 6,175-182.
Sugawara,E.,&Nikaido,H.(1992).Pore-formingactivityofOmpAproteinofEscherichiacoli.J Biol Chem,
267(4),2507-2511.
Sugawara,E.,&Nikaido,H.(1994).OmpAproteinofEscherichiacolioutermembraneoccursinopenand
closedchannelforms.J Biol Chem, 269(27),17981-17987.
Teter,S.A.,&Klionsky,D.J.(1999).Howtogetafoldedproteinacrossamembrane.Trends Cell Biol,
9(11),428-431.doi:S0962-8924(99)01652-9[pii]
Tsaousis,G.N.,Bagos,P.G.,&Hamodrakas,S.J.(2014).HMMpTM:improvingtransmembraneprotein
topologypredictionusingphosphorylationandglycosylationsiteprediction.Biochim Biophys Acta,
1844(2),316-322.doi:10.1016/j.bbapap.2013.11.001S1570-9639(13)00376-2[pii]
Tusnady,G.E.,Dosztanyi,Z.,&Simon,I.(2004).Transmembraneproteinsinproteindatabank:
identificationandclassification.Bioinformatics.
Tusnady,G.E.,&Simon,I.(1998).Principlesgoverningaminoacidcompositionofintegralmembrane
proteins:applicationtotopologyprediction.J Mol Biol, 283(2),489-506.
Tusnady,G.E.,&Simon,I.(2001).TheHMMTOPtransmembranetopologypredictionserver.
Bioinformatics, 17(9),849-850.
Tuteja,R.(2005).TypeIsignalpeptidase:anoverview.Arch Biochem Biophys, 441(2),107-111.doi:S00039861(05)00305-X[pii]10.1016/j.abb.2005.07.013
vanRoosmalen,M.L.,Geukens,N.,Jongbloed,J.D.,Tjalsma,H.,Dubois,J.Y.,Bron,S.,...Anne,J.
(2004).TypeIsignalpeptidasesofGram-positivebacteria.Biochim Biophys Acta, 1694(1-3),279297.doi:S0167488904001235[pii]10.1016/j.bbamcr.2004.05.006
Vandeputte-Rutten,L.,Bos,M.P.,Tommassen,J.,&Gros,P.(2003).CrystalstructureofNeisserialsurface
proteinA(NspA),aconservedoutermembraneproteinwithvaccinepotential.J Biol Chem, 278(27),
24825-24830.
Vihinen,M.(2012).Howtoevaluateperformanceofpredictionmethods?Measuresandtheirinterpretationin
variationeffectanalysis.BMC Genomics, 13(Suppl4),S2.
Vogt,J.,&Schulz,G.E.(1999).ThestructureoftheoutermembraneproteinOmpXfromEscherichiacoli
revealspossiblemechanismsofvirulence.Structure Fold Des, 7(10),1301-1309.
vonHeijne,G.(1986).Anewmethodforpredictingsignalsequencecleavagesites.Nucleic Acids Res,
14(11),4683-4690.
vonHeijne,G.(1990).Thesignalpeptide.J Membr Biol, 115(3),195-201.
vonHeijne,G.(1992).Membraneproteinstructureprediction.Hydrophobicityanalysisandthepositiveinsiderule.J Mol Biol, 225(2),487-494.
vonHeijne,G.(1999).Recentadvancesintheunderstandingofmembraneproteinassemblyandfunction.
Quart Rev Biophys, 32(4),285-307.
vonHeijne,G.,Steppuhn,J.,&Herrmann,R.G.(1989).Domainstructureofmitochondrialandchloroplast
targetingpeptides.Eur J Biochem, 180(3),535-545.
Walian,P.,Cross,T.A.,&Jap,B.K.(2004).Structuralgenomicsofmembraneproteins.Genome Biol, 5(4),
215.
268
269
270
8:
, ,
(Hidden Markov Models)
.
,
,
. , ,
,
( , ...). ,
profile HMM ,
, .
.
3 4.
8.
-,Markov
. Markov (Markov Chain),
DNA
.Markov,,
.
2,3,...,kMarkov2,3, ...,k.
,Markov
DNA ,
. 1970
,..
.
,
o . ,
Q U, U
Q.,
AndreyMarkov(1856-1922),
Pushkin (Markov, 1913).
Markov(MarkovModel-)
Markov(HiddenMarkovModel-).
8.1.
Markov
8.1.1.
Markov 1
.,
,(DNA
271
20 ). L
,x,:
x x1 , x2 ,..., xL 1 , xL
i
,Markov,
.(),
x,
. , xi
i. ( )
xi+1,xi+2, xi+3x1,x2,
...xi-1, xi, xi. ,
.
:
(8.1)
P xi | xi 1 ,..., x1 P xi | xi 1
Markov
(transitionprobabilities),.,
:
(8.2)
ast P xi t | xi 1 s xi1 xi
,ti,
(i-1)s.k
, Markov 1 .
:
P x P xL | xL 1 P xL 1 | xL 2 ...P x2 | x1 P x1 P x1 P xi | xi 1 P x1 x
i2
i 1 xi
(8.3)
i2
P(x1) . ,
,:
p ab ( n 1, n ) P ( x i b | x i 1 a ) pab n=1,2,..L.
, , ,
.(
, ),
,,
. , ,
1 k
:
p a ,b 0 a , b 1, 2,..., k
k
a ,b
1 b=1,2,,k
b 1
, .
, Markov,
(B=Begin).
:
(8.4)
P( x1 a) pBa
()(=End)
:
P ( E | xn b) pbE
(8.5)
Markov 8.1
. ,
272
.
,.
P ( E | x n b ) p bE q
(8.3)L:
(8.6)
p q(1 q) L1
L
.
(Durbin,Eddy,Krogh,&Mithison,1998):
n
p P(x)
... P( x1 ) P ( xi | xi 1 ) 1
{X }
(8.7)
i 2
8.1.2.
273
S x log
x x
P x | L
log i1 i
x x
P x | i 1
i1 i
L
xi1 xi
i 1
(8.9)
S x
S norm x
i 1
xi 1 xi
(8.10)
, 1 ,
CG(Durbin,etal.,1998).
8.2: CG .
-, (8.8) (8.9)
.
. bits , 2.
8.1.3.
kMarkov,
(8.1).,k
:
(8.11)
P xi | xi 1 ,..., x1 P xi | xi 1 , xi 2 ,..., xi k x ... x x
k
i 1 i
k Markov,1,
20k. , 20kx20k. ,
274
, 1 202=400
, 2 203=8000 ,
.
, (Ellrott, Yang,
Sladek,&Jiang,2002;Phillips,Arnold,&Ivarie,1987).,AudicClaverie
(Audic & Claverie, 1998),
. , , ,
,
,90%(Audic&Claverie,1998).
,
. , k
(Yuan,1999),:
k
P xi | xi 1 , xi 2 ,..., xi k P xi | xi j
(8.13)
j 1
(8.14)
(),
Bejerano (Bejerano, Seldin, Margalit, & Tishby, 2001; Bejerano & Yona, 2001),
PFAM (Bateman et al., 2004).
, ,
,HiddenMarkovModels
(.).
, Mixture Transition Distribution (MTD) model
(Raftery, 1985a), (8.11)
:
k
(8.15)
j 1
, (j=1,2k)
275
. (MTDg),
(Raftery,1985b),j,j:
k
j
s j s0
(8.16)
j 1
,:
k
0 j sjj s0 1
(8.17)
j 1
j
j s j s0 1 ,:
s k ,..., s1 , s0 , s0 Q j 1
1, j 0 j 1, 2, ..., k
(8.18)
j 1
Raftery,
Markov. ,
,
. Raftery, (Raftery, 1985a),
(NAG).Bercthold,
gradient (Berchtold,2001).,
Expectation-Maximization (Lebre & Bourguignon,
2008).
(, ...), ,
.
(gene
finding),(Borodovsky&McIninch,1993;Borodovsky&Peresetsky,1994) (Audic& Claverie,1998). ,
interpolated Markov chains ,
(Salzberg, Delcher,Kasif,&White,1998)(Ohler,
Harbeck, Niemann, Noth, & Reese, 1999; Salzberg, Pertea, Delcher, Gardner, & Tettelin, 1999),
(Barash,Elidan,Friedman,&Kaplan,2003),
(Dalevi,Dubhashi,&Hermansson,2006),
(Yuan, 1999), (Bejerano, et al., 2001),
(Eronen,Geerts,& Toivonen,2004)
(Browning,2006).
8.2.
8.2.1.
(8.19)
x x1 , x2 ,..., xL 1 , xL
xi20(,
), .
i
,i,(path).,k, l
kl,Markov1.
,:
(8.20)
akl P ( i l | i 1 k )
,k()
l.Markov,
,(Begin)
276
aBk P ( 1 k | B)
(8.21)
E(End)
(8.22)
akE P( E | i k )
,
:
(8.23)
ek (b) P( xi b | i k )
,i,
b,k.x
,:
L
P ( x, ) P ( xL , xL 1 ,..., x1 , ) aB
( xi ) a
i
(8.24)
i 1
i 1
,(Durbin,etal.,1998)
(dishonest casino). ,
,(..0.05
) ,
.,
,.
( 8.3)
0.05 ( )
,0.1().(hidden)
.
()(),
Markov.()Markov(MM)
. (. 4),
.
8.3: .
(-), . ,
.
277
8.2.2. 3
3,
Rabiner(Rabiner,1989).
,
x ; ,
P(x|);
x,
, , ;
,,: max arg max P(x, ) ;
,
,;,
. ,
, ML arg max P x | .
8.4: . .
. . 1 . . 3 . . Hidden
Markov Model. -, . (
) 1 ,
.
278
8.2.3.
(8.24),,x,
.
x ,
, ,
.
L
P x | P x, | a B
( xi ) a
i
(8.25)
i 1
i 1
,,
.,..
50 300 , 50300,
. ,
3, (dynamic programming).
, ,
.
,Forward(8.4),(Durbin,etal.,1998;
Rabiner,1989).
Forward
k B, i 0 : f B (0) 1, f k (0) 0 ,
1 i L : f l (i ) el ( xi ) f k (i 1) akl
k
P ( x| ) f k ( L ) akE
(8.26)
, (L+1), N
L,fk(i)i
k.,
i,k.:
(8.27)
f k (i ) P( x1 , x2 ,..., xi , i k )
279
, fk(i) ,
,
. ,
.L,
(L).
Backward (Durbin, et al., 1998; Rabiner, 1989),
.
,bk(i),i
i+1,i k.:
(8.28)
bk (i ) P ( xi 1 ,..., xL | i k )
,:
Backward
k , i L : bk ( L) akE
1 i L : bk (i ) a kl el ( xi 1 )bl (i 1)
(8.29)
P ( x| ) a Bl el ( x1 )bl (1)
l
,,,
1.,Forward.
8.2.4.
,
. , ,
(decoding).,
Viterbi(Durbin,etal.,1998;Rabiner,1989).
Viterbi
k B, i 0 : uB (0) 1, uk (0) 0
1 i L : ul (i ) el ( xi ) max uk (i 1)akl
(8.30)
max
P ( x, | ) max uk ( L) akE
k
Viterbi, Forward,
.
max,
P(x, max|). , P(x, max|) P(x|).
,,
(pointers),()
.(back-tracking),,,
.
.
,
xi,xi
k, x. ,
P(i=k|x).xi
k.:
280
P ( x1 , x 2 ,..., xi , i k ) P ( xi 1 ,..., x n | i k )
,(8.27),
(8.28).,:
P(x, i k ) f k (i)bk (i)
,Bayes:
P( i k | x)
f k (i )bk (i )
(8.31)
P(x)
,
.,
:
(8.32)
i arg max P ( i k | x)
k
. ,
.
,
.,,
. ,
, , (..
- ), g(k)
,
TM
1, k C
g (k )
0, k C NTM
(8.33)
G ( i | x ) P ( i k | x ) g ( k )
k
i
.
,
(posterior decoding),
. ,
, (
).,,
,.
8.2.4.1
DNA,
( ): 2
,
4(emissionprobabilities).1:
e1 ( A) P( xi A | i 1) 0.7
e1 (T ) P ( xi T | i 1) 0.1
e1 (G ) P ( xi G | i 1) 0.1
e1 (C ) P ( xi C | i 1) 0.1
0:
e0 ( A) P ( xi A | i 0) 0.25
e0 (T ) P( xi T | i 0) 0.25
281
e0 (G ) P ( xi G | i 0) 0.25
e0 (C ) P ( xi C | i 0) 0.25
():
a11 P ( i 1 | i 1 1) 0.9
a10 P ( i 0 | i 1 1) 0.1
a00 P ( i 0 | i 1 0) 0.9
a01 P ( i 1 | i 1 0) 0.1
DNA200.
DNA,
(1/0). 8.7
.
AAACAAGAATGCGCACACTACGCAAAAACAATTAGTCGCACTCACGATGAAACAAATTACCACGGTGAA
111111111100000000000001111111111100000000000000111111110000000000001
AACGAATAAACCTCAGAGGCCCAGCGTATATAAACAAGATAAAAACCTAGTCAGCACTCTGACCAGACG
111111111100000000000000000000011111111111111100000000000000000000000
AGCTCACGACTTGAGGATAAGAAAAAAACAACAGCTCACGACTTGAGGATAAGAAAAAAACA
00000000000000001111111111111100000000000000000011111111111111
8.6: DNA .
(0/1)
( 8.7)
(Viterbi).
0
0
50
100
no
150
200
8.7: Viterbi
8.6. ,
DNA. ,
7 1, Viterbi
.
282
,:
a11 P ( i 1 | i 1 1) 0.98
a10 P ( i 0 | i 1 1) 0.02
a00 P ( i 0 | i 1 0) 0.97
a01 P ( i 1 | i 1 0) 0.03
,,
:
GCGCACACTAGCGCACACTACGCCTACGCAATTAGTCGCACTCACGAAGAAACAAATTACCACGGTGAG
000000000000000000000000000000000000000000000011111111110000000000001
AACGAATAAAAATCAGAGGCCCAGCGTATATCAGCACTCTGACCACCTAGTCAGCACTCTGACCAGACG
111111111111000000000000000000000000000000000000000000000000000000000
AGCTCACGACTTGAGGATAAGAATAGAAAAACAGCTCACGACTTGAGGCACGACTAGCTCAG
00000000000000001111111111111100000000000000000000000000000000
8.8: DNA , .
(0/1)
0
0
50
100
no
150
200
8.9: Viterbi
. Viterbi,
1, 1.
.
, (
) ( ) . ,
, (cut-off value)
1. () 0.5
( ), , .
283
.,,1,
2 , 3 ,
,.
,,,
1-3-2, .
,
. ,
(Viterbi)
.
,
, (..
)
.Jones(Jones,Taylor,&
Thornton,1994).
n
m , n
.,sil(
),l
i., S ij i :1, 2,..., n; j :1, 2,..., m ,
:
S ij max
l lmin lmax
il
j
max
k 1 l An
S
k
j 1
(8.34)
j , lmin lmax
,.
(Farisellietal., 2003),
-.
Posterior-Viterbi
, Fariselli ,
Viterbi (Fariselli,
Martelli,&Casadio,2005).,PV:
L
PV arg max P i | x
p
i 1
p , P(i=k|x)
,(8.31).,
10.:
1, if a kl 0
k , l
0, otherw ise
,PV,:
L
PV arg max i , i 1 P i | x
i 1
, ,
Viterbi,
.
284
Posterior-Viterbi
k B, i 0 : u B (0) 1, uk (0) 0
1 i L : ul (i) P i l | x max uk (i 1) k , l
k
P(x,
PV
| ) max uk ( L) k , E
k
Forward
,
Forward.,ForwardDecodingmethod,
( 8.10).
,,
().
(8.26), ,
.
(8.10).,:
log P x|
S (x| )
(8.35)
L
L,.,
,(),
. , (Eddy, 1998),
,
,
0,,(
null).
log P x|
(8.36)
S ( x| )
log P x| 0
8.10: . Forward,
.
,,
.
,
.
285
8.2.5.
- ,
(Maximum Likelihood).
(), ML ,
. ,
,
.:
(8.37)
ML arg max P x |
, l(x|),
.
l x | log P x |
,
( )
. ,
. , , ,
. ,
, .
, x,
,.
(),
,.,
,
.,:
akl
Akl
Akl '
(8.38)
(8.39)
l'
ek (b)
Ek (b)
Ek (b)
b'
,.
, ,
, . ,
(pseudo-counts),:
akl
Akl rkl
(8.40)
l'
Ek (b) rk (b)
ek (b)
Ek (b ') rk (b ')
b'
(8.41)
b'
,
rkl,rk(b)Dirichlet.
(prior),
.
Baum-Welch
,
,
.,1970Baum,
286
Bayes:
P x, |
P | x,
Px |
:
log P ( x | ) log P ( x , | ) log P ( | x , )
P ( | x, t ) ,,:
(8.42)
,:
log P ( x | ) log P ( x | t )
,:
P( | x, t )
P( | x, )
=t:
log P ( x | ) log P ( x | t ) Q ( | t ) Q ( t | t )
Q,:
t 1 arg max Q ( | t )
.:
(8.24):
Ek ( b , )
P(x, | ) ek (b)
k 1
kl
Akl ( )
(8.43)
k 0 l 1
, Ek(b,), , Akl() b,
l , k, . (8.42)
:
(8.44)
k 0 l 1
k 1 b
,Ek(b,)Akl(),
, fk(i), bk(i),
forwardbackward.,
P x, i k , i 1 l | P x1 , x2 ,..., xL , i k , i 1 l |
287
,:
P x, i k , i 1 l |
P x1 , x2 ,..., xi , i k | P xi 1 , xi 2 ,..., xL , i 1 l | i k ,
(8.27),:
f k (i) P( x1 , x2 ,..., xi , i k )
,
P xi 1 , xi 2 ,..., xL , i 1 l | i k ,
P xi 1 , i 1 l | i k , P xi 2 ,..., xL | xi 1 , i 1 l , i k ,
(8.45)
,:
P xi 1 , i 1 l | i k ,
=P i 1 l | i k P xi 1 | i 1 l
(8.46)
(8.47)
:
P x, i k , i 1 l | f k i akl el xi 1 bl i 1
(8.48)
Bayes:
P ( i k , i 1 l | x , ) f k ( i ) a kl e l ( x i 1 ) b l ( i 1)
(8.49)
=akl el xi 1
,(24):
P xi 2 ,..., xL | xi 1 , i 1 l , i k ,
=P xi 2 ,..., xL | i 1 l bl i 1
(8.46)(8.47)(8.45),:
P xi 1 , xi 2 ,..., xL , i 1 l | i k , akl el xi 1 bl i 1
P (x )
P ( i k | x) = f k (i )bk (i )
P ( x)
,,:
1
f k (i)akl el ( xi1 )bl (i 1)
P ( x) i
(8.50)
(8.51)
(8.52)
1
f kj (i)bkj (i)
P(x) i|xi b
(8.41):
k 0 l 1
(8.38)(8.39).
Baum-Welch - (Expectation)
fk(i),bk(i),forwardbackward
Akl,Ek(b)(8.50)(8.51).
Q. - (Maximization)
Akl Ek(b) (8.38) (8.39), ...
. (loglikelihood) Q
(threshold).
288
Gradient-Descent
Baum-Welch, . ,
, ,
. , (update) ,
(batch mode of learning). ,
,,
.,BaldiChauvin
(Baldi&Chauvin,1994).
Gradient-Descent, .
,fn:
f x f x1 , x2 ,..., xn
,,
x0 x10 , x2 0 ,..., xn 0
,:
f ( xt 1 ) f ( xt ) f ( x)
,
(learningrate).,
(negative log-likelihood),
.,:
x |
t 1 t
(8.53)
,
.
log P x |
:
log P x |
P x |
1
P x |
P x, |
1
P x |
log P x, |
1
P x, |
P x |
P | x,
log P x, |
(8.54)
(8.43), ,
:
Akl loga kl
log P x, |
k 0 l 1
a kl
a kl
logakl
Akl k 0 l 1
akl
Akl
akl
,(8.54):
289
(8.55)
log P x |
Akl
Akl
(8.56)
akl
akl
akl
,
:
log P x |
E , b Ek b
(8.57)
P | x, k
ek b
ek b
ek b
,
,Q(8.52).
,
,
(Baldi&Chauvin,1994).
,.,
,.
(8.53), Gradient Descent,
, .
,,
0 1,
.,KroghRiss(Krogh&Riis,
1999), soft-max . ,
kl,zkl, :
exp zkl
(8.58)
akl
exp zkl '
P | x,
l'
,GradientDecent,kl,zkl:
t
(8.59)
z tkl1 z tkl
z kl
:
t
z kl t exp
z kl
(8.60)
a kl t 1
t
t
l ' zkl ' exp z
kl '
(8.56),
zkl:
(8.61)
Akl akl Akl '
zkl
l'
,(8.61)(8.60),
:
t
kl ' exp Akl akl Akl '
l'
l
t
t 1
kl
(8.62)
,
.,
. ,
(smooth) ,
Baum-Welch, online training,
-,
290
. , ,
(Conditional Maximum
Likelihood-CML). , ,
(),
.
Viterbi training
,,,
. Viterbi training
Segmental k-means algorithm Juang Rabiner 1990 (Juang & Rabiner,
1990). , : )
Viterbi,)
(8.34)(8.35).
,
(clustering),1968
k-meansalgorithm(MacQueen,1967).
, ,
(8.20),
.JuangRabiner
, ,
(state-optimized likelihood).
, , ,
.
, ,
.,
(
Baum-Welch gradient).
, Baum-Welch.
,
.
, , , ,
Viterbi BaumWelch Forward Backward ( ,
,/).
8.3.
8.3.1.
,
(unsupervisedmethods).
,-,
. ,
,
()
.
291
8.11: CHMM. . HMM
. ( ) 1 ,
. CHMM
,
. ( ),
. , 1 .
, .
, ,
,
,
. ,
,3
. ,
(supervised learning). ,
(labeled sequences), Krogh
Class Hidden Markov Model (Anders. Krogh, 1994). , ,
x x1 , x2 ,..., xL 1 , xL ,
,(labels)
y y1 , y2 ,..., yL 1 , yL
,3:
(), () ().
,
292
.,
, ...
,k(c)kc.
,,
, (delta function) 1
0 . ,
.
8.3.2.
, ,
, P(x,y|)
xy,:
L
P x, y |
P x , y, | P x, | a e
B 1
( xi ) a i i 1
i 1
,(8.25),
y,.
,
Forward Backward,
.,,
(0,1), . ,
Forward,:
Forward
k B , i 0 : f B (0) 1, f k (0) 0 ,
P ( x,y| ) f k ( L )akE
k
,Backward,:
Backward
k , i L : bk ( L ) akE
1 i L : bk (i ) akl el ( xi 1 ) l yi 1 bl (i 1)
(8.64)
, Forward
Backward,(8.12).
y, ,
P(x,y|)P(x|).
293
8.12: Forward . 12
, 8 (labels).
,
.
8.3.3.
,
:
ML arg max P x,y |
,
.,,(8.65)
(8.66):
Akl
1
f k (i)akl el ( xi1 ) l yi 1 bl (i 1)
P x, y | i
Ek (b)
1
fk (i)bk (i)
P x, y | i| xij b
(8.67)
(8.68)
P(x,y|),fi(i)bk(i),
. ,
,Baum-WelchGradientDescent.
,Krogh(Anders.Krogh,1994)
, (Conditional Maximum
Likelihood).,,
:
P x, y |
(MaximumMutualInformation)(Rabiner,1989).
,:
log P y | x, c f
:
294
c log P x,y |
f log P x |
c f,
(clampedphase),(freerunningphase).(8.56)(8.57),
:
Ac Aklf
c f
(8.69)
kl
akl akl akl
akl
f
c
E c (b ) E kf (b )
(8.70)
k
ek b ek b ek b
ek (b )
,Baum-Welch,
.,
Gradient Descent.
,,:
(8.71)
Aklc Aklf akl Aklc ' Aklf '
zkl
l
t,,
:
klt 1
c
t
f
c
f
l ' kl ' exp Akl Akl akl l Akl ' Akl '
(8.72)
,
(Discriminative Training),
,
. ,
, .
,
( ). ,
,
(Bagos,
Liakopoulos,&Hamodrakas,2004).
8.3.4.
1-best Decoding
,
,
()
.,
,
. ,
,.
1-best (Krogh, 1997), N-best,
(Schwartz & Chow, 1990). ,
,
ymax.,
295
i , hi-1 ,
. ,
l yi
..
Viterbi, 1-best
.
1-best
i 1: l ( h1 ) a Bl el ( x1 )
1 i L : l (hi y i ) el ( xi ) k ( hi 1 ) akl
(8.73)
,
,
,:
P x, max | P x, y max | P x |
(8.74)
OAPD arg max i , i 1 P i | x k c
i 1
1 i L : Al (i ) P yi c l | x, max Ak (i 1) k , l (8.75)
k
P ( x, OAPD | ) max Ak ( L ) k , E
k
,Posterior-Viterbi
,),)
,.,
Posterior-Viterbi,OptimalAccuracyPosteriorDecoder
,
.,,
.
8.3.5.
,
,
296
.
(genefusions),
,
(Drewetal.,2002;Melen,Krogh,&vonHeijne,2003).
(,)
,E. coli(Rappetal.,2004)
S. cerevisiae(Kim,Melen,&vonHeijne,2003).
,HMMTOP(Tusnady&Simon,
2001),
, ,
Phobius (Kall,
Krogh, & Sonnhammer, 2004). ,
,(Bagos,Liakopoulos,
&Hamodrakas,2006).
8.13: Forward,
. () 12 , x 8 .
x, 3,4
, 1 8
.
,
,
. , ,
, .
,,
,.
(Information), 1rL,
, - , 1,
2, ...r,
.,
:
297
0,if k i 0andi r
d k i =
1,otherwise
forward,
forward f i k
( 8.13).
, y
y, . ,
,
. forward backward
,
. ,
.
,,.
,(Bagos,etal.,2006).
8.3.6.
,
, (underflow error).
,
.
,
.,
:
log a b log a 1 a
log a log 1 a
b
b
(8.76)
|log()-log(b)| 37,
, 0,
(8.76).
8.4.
, .
(Ostendorf & Singer, 1997; Vasko, El-Jaroudi, & Boston, 1996 ),
(Won,Prugel-Bennett,&Krogh,2004;Yada,Ishikawa,H.,&Asai,1994),,
.,
.
,
,
.CHMM,,
-.
,(,,
- ...) :
.,,
.,
. ,
298
15 35 . ,
parametertyingsharing.,
.
forward backward.
, (
10,2010*20).
, ,
.,,
(8.14).
,.,
,
.,.
,
.,
,.,
, .
3k(8.15).
. , profile
, (
,)(-).
,,
. ,
, .
, . ,
, . ,
,
.
.
299
8.15: . .
1, 2,3 4. . 4 ,
. 4
( ). . .
1 4 .
(silent states).
,
.
.
silent states ( ) ,
,
.
8.5.
300
, HMM ,
,
(insert/deletion).
,,
,.
,
,(silentstates).
,
.
,
. ,
k ,
.,
. pHMM,
8.16.
(
)3:
(Matchstates)
Mk
(Insertionstates)
Ik
(Deletionstates)
Dk
, .
.
, .
,
.:
,
,
.,,
.
,
,,
.
(Viterbi,backward,forward,Baum-Welch),
.
301
8.6.
profile HMM
,
,
.,
(gap penalties),
, . ,
,
(remote homologies),
.
8.17: .
, .
pHMM,,
(Eddy, 1995)
(Krogh,
Brown, Mian, Sjolander,&Haussler, 1994).
,
PFAM(Bateman,etal.,2004),
, .. .
,TIGRFAM,PFAM
.,,
,
, . ,
302
,
, profile (Eddy, 2011).
:
,
(BLAST,PSI-BLAST,HMMER)
,
()
, ( ,
)
(HMMER)
,
.
8.18: profile .
profileHMM,
, ,
, ,
. , profile HMM
,(Eddy,1995).
, 8.18 8.19. ,
, , ,
.,Viterbi
.,
i,(m,d,i),,.
i ,
(mi).
303
8.19: profile .
8.7.
HMMER
ProfileHiddenMarkovModels,HMMER(Eddy,
2000).,,
GPL (GNU Public License), ,
.
HMMER,3,:
hmmbuild: ,
,.
hmmalign:
, . ,
.
hmmsearch:,
.
phmmer:
(BLASTP)
jackhmmer:
( PSIBLAST)
hmmscan:
.,
,
.,,
,
.
304
nhmmer: DNA,
pHMM,DNA.(BLASTN)
nhmmscan: DNA
DNAprofileHM.
hmmconvert:
HMMER3.
hmmemit: , ( )
.
hmmpress: HMMhmmscan.
hmmstat: .
,
. ,
,
(null)
, ,
.,,
, .
, ,
,.,
Profile Hidden Markov Model, ,
() (D) .
,:http://hmmer.janelia.org/.
,HMMER.
1.8 .
2.0 , ,
. ,
. ,
(Zhang &
Wood,2003).,,,
(discriminative training) (Srivastava, Desai, Nandi, & Lynn,
2007). 3.0 , ,
, (Eddy,2011). , . ,
Gumbel,
., ,
, -
BLAST PSI-BLAST-,
. ,
,.
305
Audic,S.,&Claverie,J.M.(1998).Self-identificationofprotein-codingregionsinmicrobialgenomes.Proc
Natl Acad Sci U S A, 95(17),10026-10031.
Bagos,P.G.,Liakopoulos,T.D.,&Hamodrakas,S.J.(2004).FasterGradientDescentConditional
MaximumLikelihoodTrainingofHiddenMarkovModels,UsingIndividualLearningRate
Adaptation.InG.Paliouras&Y.Sakakibara(Eds.),Grammatical Inference: Algorithms and
Applications(Vol.3264,pp.40-52):SpingerBerlin/Heidelberg.
Bagos,P.G.,Liakopoulos,T.D.,&Hamodrakas,S.J.(2006).Algorithmsforincorporatingpriortopological
informationinHMMs:applicationtotransmembraneproteins.BMC Bioinformatics, 7,189.
Baldi,P.,&Chauvin,Y.(1994).SmoothOn-LineLearningAlgorithmsforHiddenMarkovModels.Neural
Comput, 6(2),305-316.
Barash,Y.,Elidan,G.,Friedman,N.,&Kaplan,T.(2003).Modeling dependencies in protein-DNA binding
sites.PaperpresentedattheProceedingsoftheseventhannualinternationalconferenceon
Computationalmolecularbiology.RECOMB'03.,NewYork,NY,USA.
Bateman,A.,Coin,L.,Durbin,R.,Finn,R.D.,Hollich,V.,Griffiths-Jones,S.,...Eddy,S.R.(2004).The
Pfamproteinfamiliesdatabase.Nucleic Acids Res, 32(Databaseissue),D138-141.
Baum,L.(1972).Aninequalityandassociatedmaximizationtechniqueinstatisticalestimationfor
probabilisticfunctionsofMarkovprocesses.Inequalities, 3,1-8.
Bejerano,G.(2004).AlgorithmsforvariablelengthMarkovchainmodeling.Bioinformatics, 20(5),788-789.
Bejerano,G.,Seldin,Y.,Margalit,H.,&Tishby,N.(2001).Markoviandomainfingerprinting:statistical
segmentationofproteinsequences.Bioinformatics, 17(10),927-934.
Bejerano,G.,&Yona,G.(2001).Variationsonprobabilisticsuffixtrees:statisticalmodelingandprediction
ofproteinfamilies.Bioinformatics, 17(1),23-43.
Berchtold,A.(2001).EstimationintheMixtureTransitionDistributionModel.Journal of Time Series
Analysis, 22(4),379-397.
Borodovsky,M.,&McIninch,J.(1993).GeneMark:parallelgenerecognitionforbothDNAstrands.Comput
Chem, 17(19),123-133.
Borodovsky,M.,&Peresetsky,A.(1994).Derivingnon-homogeneousDNAMarkovchainmodelsbycluster
analysisalgorithmminimizingmultiplealignmententropy.Comput Chem, 18(3),259-267.
Browning,S.R.(2006).Multilocusassociationmappingusingvariable-lengthMarkovchains.Am J Hum
Genet, 78(6),903-913.
Dalevi,D.,Dubhashi,D.,&Hermansson,M.(2006).BayesianclassifiersfordetectingHGTusingfixedand
variableordermarkovmodelsofgenomicsignatures.Bioinformatics, 22(5),517-522.
Dempster,A.P.,Laird,N.M.,&Rubin,D.B.(1977).MaximumlikelihoodfromincompletedataviatheEM
algorithm.J Royal Stat Soc B, 39,1-38.
Drew,D.,Sjostrand,D.,Nilsson,J.,Urbig,T.,Chin,C.N.,deGier,J.W.,&vonHeijne,G.(2002).Rapid
topologymappingofEscherichiacoliinner-membraneproteinsbypredictionandPhoA/GFPfusion
analysis.Proc Natl Acad Sci U S A, 99(5),2690-2695.
Durbin,R.,Eddy,S.R.,Krogh,A.,&Mithison,G.(1998).Biological sequence analysis, probabilistic models
of proteins and nucleic acids.:CambridgeUniversityPress.
Eddy,S.R.(1995).MultiplealignmentusinghiddenMarkovmodels.Proc Int Conf Intell Syst Mol Biol, 3,
114-120.
Eddy,S.R.(1998).ProfilehiddenMarkovmodels.Bioinformatics, 14(9),755-763.
306
307
Ostendorf,M.,&Singer,H.(1997).HMMtopologydesignusingmaximumlikelihoodsuccessivestate
splitting.Computer Speech & Language, 11(1),17-41.
Phillips,G.J.,Arnold,J.,&Ivarie,R.(1987).Mono-throughhexanucleotidecompositionoftheEscherichia
coligenome:aMarkovchainanalysis.Nucleic Acids Res, 15(6),2611-2626.
Rabiner,L.(1989).AtutorialonhiddenMarkovmodelsandselectedapplicationsinspeechrecognition.
Proc. IEEE, 77(2),257-286.
Raftery,A.E.(1985a).Amodelforhigh-orderMarkovchains.Journal of the Royal Statistical Society. Series
B (Methodological), 47(3),528-539.
Raftery,A.E.(1985b).Anewmodelfordiscrete-valuedtimeseries:Autocorrelationsandextensions.
Rassegna di Metodi Statistici ed Applicazioni, 3-4 149-162.
Rapp,M.,Drew,D.,Daley,D.O.,Nilsson,J.,Carvalho,T.,Melen,K.,...VonHeijne,G.(2004).
ExperimentallybasedtopologymodelsforE.coliinnermembraneproteins.Protein Sci, 13(4),937945.
Ron,D.,Singer,Y.,&Tishby,N.(1996).Thepowerofamnesia:learningprobabilisticautomatawith
variablememorylength.Machine Learning, 25,117-149.
Salzberg,S.L.,Delcher,A.L.,Kasif,S.,&White,O.(1998).Microbialgeneidentificationusinginterpolated
Markovmodels.Nucleic Acids Res, 26(2),544-548.
Salzberg,S.L.,Pertea,M.,Delcher,A.L.,Gardner,M.J.,&Tettelin,H.(1999).InterpolatedMarkovmodels
foreukaryoticgenefinding.Genomics, 59(1),24-31.
Schwartz,R.,&Chow,Y.L.(1990).TheN-BestAlgorithm:AnEfficientandExactProcedureforFinding
theNMostLikelySentenceHypotheses.Proc IEEE Int Conf Acoust, Speech, Sig Proc, 1,81-84.
Srivastava,P.K.,Desai,D.K.,Nandi,S.,&Lynn,A.M.(2007).HMM-ModE--improvedclassificationusing
profilehiddenMarkovmodelsbyoptimisingthediscriminationthresholdandmodifyingemission
probabilitieswithnegativetrainingsequences.BMC Bioinformatics, 8,104.
Tusnady,G.E.,&Simon,I.(2001).TheHMMTOPtransmembranetopologypredictionserver.
Bioinformatics, 17(9),849-850.
Vasko,R.C.J.,El-Jaroudi,A.,&Boston,J.R.(1996).An algorithm to determine hidden Markov model
topology.PaperpresentedattheIEEEInternationalConferenceonAcoustics,Speech,andSignal
Processing,ICASSP-96.
Won,K.J.,Prugel-Bennett,A.,&Krogh,A.(2004).TrainingHMMstructurewithgeneticalgorithmfor
biologicalsequenceanalysis.Bioinformatics, 20(18),3613-3619.
Yada,T.,Ishikawa,M.,H.,T.,&Asai,K.(1994).DNASequenceAnalysisusingHiddenMarkovModeland
GeneticAlgorithm.Genome Informatics, 5,178-179.
Yuan,Z.(1999).PredictionofproteinsubcellularlocationsusingMarkovchainmodels.FEBS Lett, 451(1),
23-26.
Zhang,Z.,&Wood,W.I.(2003).AprofilehiddenMarkovmodelforsignalpeptidesgeneratedbyHMMER.
Bioinformatics, 19(2),307-308.
308
1) 8.3.1,
x= 214526436636561666232145= --------++++++++++------.
(8.34)(8.35).
0.7 0.3 0 x1
0.2 0.1 x4 0.5
0
0 0.8 x2
, e 0.5 x5 0.1 0.1
a
0
0.1 0.2 0.1 0.6
0 0.9 0.1
0
0
x3 0
0.4 0.4 0.1 0.1
)
;.
)x1,x2,x3,x4x5..
).;
)(Begin)
(End);
x:AAACAGGTGAGTAAA
y:TTTAAGGTAAGTGGG
,Forward?.
)(),
(),regularexpression?
:()
(A,C,T,G).
4) :
GSAPSRKFFVGGNWKMNGRKQSLGELIGTLNAAKVPADTEVVCAPPTAYI
CCCCCCEEEEEEEECCCCCHHHHHHHHHHHHCCCCCCEEEEEEECCCCCH
51
DFARQKLDPKIAVAAQNCYKVTNGAFTGEISPGMIKDCGATWVVLGHSER
HHHHHHCCCCEEEEEEEECCCCCCCCCCCHHHHHHHCCCCEEEEECHHHH
101
RHVFGESDELIGQKVAHALAEGLG
HCCCCCCHHHHHHHHHHHHHCCCC
)
SCOP?
)HiddenMarkovModel
309
3(,,C).
?
),
(
).
:
5) DNA
(=, =, N=- , 5=5-
,3=3-)
GCATGCGTAGTCTGACTGCCAAGATATAAAGTTATAATCTATATACGATCGCTGTCAATGCT
NNEEEEEEEEEEEEEEEEEEEE5IIIIIIIIIIIIIIIIIIIIIII3EEEEEEEEEEEEEEE
)();
.
)HiddenMarkovModel
5(,,5,3,).
;
),
().
:
. , .
6) .
)(,,
).
):
x1=CGGCCG
x2=AT
x3=ACCGAT
x4=TCGAT
x5=GGCCG
.
?
)
?
310
7) HMM,22.
a e.
, ,
. ,
3, 4, ...
;
8) ,:
P ( i k | x, ) = f k (i )bk (i )
P ( x| )
: P ( x, i k | )
f k (i ) P ( x1 , x2 ,..., xi , i k | )
bk (i) P( xi 1 ,..., xL | i k , )
9) profile?
:
xa=WAYDDR,
xb=WDAYPDDR
(paths)Viterbi:
pa=m0m1m2m3d4d5m6m7m8m9,pb=m0m1i1m2m3d4m5m6m7m8m9
.
311
312
9:
,
, , DNA, RNA.
,
, , ,
, .
.
.
, .
9.
,
,,DNARNA.
1950 1960
, ,
.
,,
. ,
,.
,
. ,
.
,
. ,
. , ,
.
.
,
. ,
,
. , ,
(homologymodelling),(threading)abinitio.,
,,
, ,
, (docking)
(
/DNA ( RNA), / ,
).,,
,,
, ,
,,
.
313
,
.
9.1.
,
.1950
1960,,
.
, , ..
(.. 40%-20%-)
. ,
(far-UV, 170250 nm) ,
(IR) (NMR)(Meiler
&Baker,2003;Pelton&McLean,2000).,
(-, - ) ,
(
).
9.1: ( IR CD)
2 , (
),1950,DNA.
,SirJohnKendrewMaxPerutz1962,
.
.
, ,
.
,
(
). ,
.
1990 , ,
( ),
CCD,
, .
PDB(,'
314
,,
).
,
NMR10%()
.(2013-2015)
,2-3,
(,terabytegigabyte).
(single-crystal X-ray diffraction),
..
,
(Shi,2014;
Yaffe,2005).
9.2:
(https://en.wikipedia.org/wiki/X-ray_crystallography)
315
.
( 10-20m),
(, ...). ,
, .
. , ,
.
( ).
1m.
, ,
, .
.
. ,
,
.
, ,
. ,
.
,
.
,
(
).
(ab initio phasing), (molecular
replacement), (anomalous X-ray scattering)
(heavyatommethods),
. ,
(refinement).
/ .
,/,
,
.,
,
.
.InternationalUnionofCrystallography
,
(http://www.iucr.org/resources/other-directories/software).,
,,CCP4,
http://www.ccp4.ac.uk/(Winnetal.,2011),PHENIX,
https://www.phenix-online.org/ (Adams et al., 2010) X-PLOR (Gntert,
2011),,
NIH, Xplor-NIH,
http://nmr.cit.nih.gov/xplor-nih/(Schwieters,Kuszewski,&Clore,2006)
,,
.
(PDB),
. ,
.PDB
,
316
(http://www.rcsb.org/pdb/static.do?p=software/software_links/analysis_and_verification.html).
,,.
(Magnetic
ResonanceImaging-MRI),,mm
MRI.,
,
. ,
, .
,
.(,),
.,
,
().,
,
(,),
(,...),
.
,
.,
-,-...
,
,
,
(,
). ,
,,
-,.
DSSP (Define Secondary Structure of Proteins),
http://swift.cmbi.ru.nl/gv/dssp/,
(Kabsch & Sander, 1983). DSSP
,
. , DSSP
8.310-,-,-(G,H
) 3, 4, 5
.--()-(),
S . ,
. , ..
. - , -
(coil)C.
(G,I),,C.2002
317
9.2.
,
.,
19501960.,
. ,
.
, ball and stick.
,(wireframe)
,().
. C (
, )
.
, (space-filling
model),,,
van der Waals
. CPK models Corey, Pauling, Koltun,
,
.
,,
318
. ,
.
.
,..
.
()().
,-
/-, (Ribbon diagram) ,
.
. , -
(ribbon) , -
.,,
.,
. ,
,
.
.
9.4: , ATP. ,
2 , (PDB code 2RH1, https://en.wikipedia.org/wiki/Space-filling_model)
,.
,
.,
PDB.,
( , ,
...).
, .
,
, , ,
.,.
319
,
,ribbon.,
,
,
wireframe ball and stick . ,
,
,(.).
9.5:
. Planctomyces limnophilus (PDB code
3TVA). PV.
PDB
(http://www.rcsb.org/pdb/static.do?p=software/software_links/molecular_graphics.html). PDB
. RCSB Simple Viewer
(http://biojava.org/wiki/RCSB_Viewers:About), JavaWebStart
, Jmol (http://jmol.sourceforge.net/),
Java Applet, Jsmol
JavaScriptHTML5(http://sourceforge.net/projects/jsmol/).
.
PV WebGL
.
Rasol
(http://www.bernstein-plus-sons.com/software/rasmol/),
,
320
SWISS-MODEL
(.
)
Cn3D
(http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml)
NCBIEntrez,alignmenteditor.,
,
PyMol (http://www.pymol.org/pymol) Python
OpenGL Extension Wrangler Library (GLEW). PyMol
,,
.,
WHAT IF(.)3D
(
SGIUnix).
9.3.
( )
(
). ,
, ,
.,
,
,
.,
,
.,(
).
,
4,
,
/
().
,
.,
( ).
, ..
. ,
(11,22...),
(structural superposition).
,,
.
().
( 9.6),
()
.,,
.,
,RMSD(RootMeanSquareDeviation):
321
Nii
.
9.6: .
, (C)
. ,
-
.,,,
.
(least squares method)
,
(maximum likelihood) (Theobald & Wuttke, 2006a, 2006b) (robust)
,leastmediansquaresregression(LMS)(Liu,Fang,&Ramani,2009).
LSQMAN
(http://xray.bmc.uu.se/usf/lsqman_man.html), THESEUS (http://www.theseus3d.org)
LMSfit
(https://engineering.purdue.edu/PRECISE/LMSfit).
Profit
(http://www.bioinf.org.uk/software/profit/)
McLachlan(McLachlan,1982),
3dSS(http://cluster.physics.iisc.ernet.in/3dss/)
RasMol Profit,
, (Sumathi, Ananthalakshmi,
Roshan,&Sekar,2006).
322
,
, .
,
.
; ,
.,
,
/(9.7).
9.7: .
,,,
,.
.
,
(:H,E,C),
,
.
(threading).,
. SuperPose (http://wishart.biology.ualberta.ca/SuperPose/)
(Maiti, Van Domselaar, Zhang, & Wishart, 2004). . ,
,
.
323
, SuperPose
.,
,,
,
,
9.8: ( ).
.
,
, . ,
,,NPcomplete (Lathrop, 1994), ,
,(Poleksic,2009).,
(
),
. , ,
,.
, DALI, (distance alignment matrix
method), http://ekhidna.biocenter.helsinki.fi/dali_server/start
,(DALIlite).(1993),
().
324
(contacts)
.
.
,.
6x6
(Holm&Rosenstrm,2010).DALIFSSP(Familiesof
Structurally Similar Proteins)
.
CE (Combinatorial extension),
http://source.rcsb.org/jfatcatserver/ceHome.jsp PDB.
,
(CE-MC). CE DALI
, .
(aligned fragment pairs AFPs)
AFP.
,
,,...(Shindyalov&Bourne,1998).
To SSAP (Sequential Structure Alignment Program) ,
(http://www.biochem.ucl.ac.uk/~orengo/ssap.html). ,
C,
.
.,
.
,
(Taylor&Orengo,1989).,
CATH.
SSM (http://www.ebi.ac.uk/msd-srv/ssm/),
PDB.,
(Krissinel&
Henrick, 2004). , ,
.
,
.,
.
MASS (http://bioinfo3d.cs.tau.ac.il/MASS/),
(Dror,Benyamini, Nussinov,&
Wolfson,2003).MASS,
( ,),
, ,
.
To ,
(http://ub.cbm.uam.es/software/online/mamothmult.php). MAMMOTH
,,unit-vectorrootmeansquare
(URMS), ,
.
(Ortiz,Strauss,&Olmea,2002).
MUSTANG (multiple structural alignment algorithm)
(Konagurthu, Whisstock, Stuckey, & Lesk, 2006).
,.
325
, . ,
RMSD C,
(http://www.csse.monash.edu.au/~karun/Site/mustang.html).
,TM-align(http://zhanglab.ccmb.med.umich.edu/TM-align/)
(Zhang & Skolnick, 2005).
,
(C).,
(4CE20DALI),
.
,
. Wikipedia
(https://en.wikipedia.org/wiki/Structural_alignment_software), ,
' ,
. ,
(Kolodny,Koehl,
&Levitt,2005;Mayr,Domingues,&Lackner,2007;Singh&Brutlag,2000).,
,
, ,
.,DALI,CETM-align,
.',
, '
.
9.4.
,
. ,
,
.in
silico(,,...),
.,
( ...) ,
.,
: ) , )
. ,
,
, ( ,
).
,
, , (
).
,
,(target)
(template).,
..,
(,
).,
( ,
326
...),(9.9).,
;,
(),
: ,
,
.
, .
. , 30% ,
20%
(twilightzone).
9.9: % .
,,
.
,
(,80%,50%,40%...),
(<30%) .
,
.,,
,
(homology modelling). ,
(threading), (ab initio modelling),
.
, ,
, ,
(,
).abinitio,
327
, ,
,
.
9.10: .
, 3
. ,
,(>70%)
(RMSD<1 ),
().,
2-4.,abinitio,
RMSD 4-10 ,
.
9.4.1.
,(),
( 9.10). ,
,
.
:
.
BLASTFASTA,,
. ,
.
,.
328
().
.
.,
.
(C, C, N, ).
,
.
.
.,
.
,
,(loop),
. ,
,
.,,
, PDB
,abinitio,
.,
C-C.
, (>35%)
.
PDB.
.
,C
( ).
(),
(moleculardynamics).
,
.
.,
. ,
(
,,...).
.
,
WHAT IF (http://swift.cmbi.ru.nl/whatif/), 1987 Gert
Vriend (Vriend, 1990). ,
(, , 3D , , ,
...),,
. MODELLER (https://salilab.org/modeller/)
(Eswar et al., 2006). MODELLER
( , ,
) . MODELLER
Sali Blundell (ali & Blundell, 1993),
de novo , ,
,,,...,
/ (Unix/Linux, Windows, Mac)
329
, EasyModeller
(http://modellergui.blogspot.gr/). , SWISS-MODEL (http://swissmodel.expasy.org/)
.
,
,,
(Biasinietal.,2014).
9.4.2.
,.,
().
, ,
. ,
, ,
().
,
1000-2000 ( 1300 ). ,
,
.
, ,
(..
).
9.11: .
,
.
(1D), ,
330
. ,
...,
,
, . ,
(3D)
.,
(),
.,
,.
Bowie, Lthy Eisenberg 1991 (Bowie,
Luthy,&Eisenberg,1991)(threading)
Jones,TaylorThornton1992(Jones,Taylort,&Thornton,1992)
. ' ,
.
,THREADERDavidJones
( http://bioinf.cs.ucl.ac.uk/?id=747),
19923.5
(Jones, 1998).
,PHDthreaderBurkhartRost(Rost,Schneider,&Sander,1997),
server Predict Protein (www.predictprotein.org).
To PHDthreader PHD
( , ).
genTHREADER (Jones, 1999),
PSI-PRED,
,
server (http://bioinf.cs.ucl.ac.uk/psipred/). genTHREADER
MODELLER ,
.
, HHpred (Sding, Biegert, & Lupas,
2005). To HHpred profile
HMM ( HHsearch),
(Sding, 2005).
(PDB, SCOP, PFAM, SMART ...),
,
MODELLER (http://toolkit.tuebingen.mpg.de/hhpred). To
Phyre2(http://www.sbg.bio.ic.ac.uk/phyre2).
,profile-profile,PSSM,
HHsearch(Kelley,Mezulis,Yates,Wass,&Sternberg,2015).To
Phyre2 PSI-RPED,
MEMSAT, -
DISOPRED, ,
abinitio.,,
, , ( ,
, ab initio ),
.
,
RaptorX
(http://raptorx.uchicago.edu/)
MUSTER
(http://zhang.bioinformatics.ku.edu/MUSTER)
,
. RaptorX
,
(multiple template threading)
(Peng & Xu, 2011). MUSTER ,
(,,,
...),
331
,
.,
, (protein folding problem)
. ,
(,
NP-complete).,
, ,
. , ab initio ,
(
). , deNovo,
.
,
.
9.12: ab initio
.
332
. Anfinsen
,
Levinthal
. , 100 (
), 99 198
. 3 (
),3198,
.
, ...
. ,
microsecond millisecond, (
, ,
...). , , ,
abinitiodenovo.
,abinitio,(
),
( 50 ),
.,
,,
().:
.
,
.,
:C,C,
.
. ,
, 6-7
.
.
.
.,
,
.
AMBER,CHARMM,UNRESASTRO-FOLD.
(
ROSSETATASSER/I-TASSER).
.
.
, Monte Carlo,
Simulated Annealing ( ),
.
(Molecular Dynamics),
. ,
, .
,
.
.
, CHARMM
(http://www.charmm.org/) AMBER (http://ambermd.org/)
333
. CHARMM
(Chemistry at HARvard Macromolecular Mechanics)
, Martin Karplus Harvard (Brooks et al., 2009).
(,,
).AMBER(AssistedModelBuildingwithEnergyRefinement)
PeterKollmanUniversity
of California (Case et al.,
2005)..
,GROMACS (http://www.gromacs.org).GROMACS
,Windows(Pronket
al., 2013). ,
ab initio ,
,
.
abinitio/denovo,
ROSETTA (http://robetta.bakerlab.org/). To ROSETTA ,
abinitio
( 9) PDB. Bowie
Eisenberg1994.
C ( )
,MonteCarlo
(Rohl,Strauss,Misura,&Baker,2004).
, I-TASSER (http://zhanglab.ccmb.med.umich.edu/I-TASSER/about.html)
ab initio
.I-TASSER,
,
CASP(Helles,2008).
,,.
I-TASSER Rosetta Monte Carlo (
), ,
.
ePROPAINOR (http://www.math.iitb.ac.in/epropainor) PROTinfo
(http://ram.org/compbio/protinfo/), ( ITASSER ROSETTA). , QUARK
(http://zhanglab.ccmb.med.umich.edu/QUARK/),CABSfold(http://biocomp.chem.uw.edu.pl/CABSfold/),
PEP-FOLD (http://bioserv.rpbs.univ-paris-diderot.fr/services/PEP-FOLD/) BHAGEERATH (
http://www.scfbio-iitd.res.in/bhageerath/index.jsp),
(<100 ),
.
, (distributed) ab initio .
, ROSETTA@home (http://boinc.bakerlab.org/rosetta/)
Folding@home(http://folding.stanford.edu/).,
,
.
ROSETTA@home,
.
, .
FOLDit (http://fold.it/portal/) /
(,
).
,
.
334
9.5.
(docking)
.,
.
(Bonvin,2006;Gray,2006;Sternberg,
Gabb,&Jackson,1998),
, , , ... (Taylor, Jewsbury, & Essex, 2002). ,
DNA- DNA- . ,
,
,,
-, -,
.
,,
,,
(Alvarez,2004).
9.13: .
.,
-,
(,
...).,,
(rigid docking),
(
335
),.,
(
) ( ).
,
(flexible docking) ( )
.
,,,
. ,
. ,
.
,
.,
,.
,
.
, .
abinitio.,
- - ,
,
, .
. , ,
abinitio,
,
.
9.14: ( https://en.wikipedia.org/wiki/Docking_%28molecular%29).
, ab initio
,
.,
: (Halperin, Ma, Wolfson, &
Nussinov,2002;Moreira,Fernandes,&Ramos,2010).,
,.,
.
,,
.
(computer
aided drug discovery), . Swiss
Institude for Bioinformatics
336
(http://www.click2drug.org/directory_Docking.html).
,
(Rodrigues & Bonvin, 2014). ,
, ) , )
, ) -, )
)
.
GRAMM (http://vakser.bioinformatics.ku.edu/resources/gramm/gramm1/). GRAMM (
Global RAnge Molecular Matching)
, ,
. ,
.
. ,
,..
, .
, GRAMM-X
(http://vakser.compbio.ku.edu/resources/gramm/grammx/),
.
AutoDock(http://autodock.scripps.edu/).
.
HADDOCK (HighAmbiguity Drivenprotein-proteinDOCKing)
.
HADDOCKabinitio
(http://haddocking.org/).
( )
.
,
HADDOCK.
FTDock(http://www.sbg.bio.ic.ac.uk/docking/ftdock.html)
.
Fourier
.
DOT(http://www.sdsc.edu/CCMS/DOT/)
- .
DOT
.
vanderWaals,Fourier.
ZDOCK (http://www.umassmed.edu/zlab/)
.
,
. ZDOCK
.
To ClusPro (http://cluspro.bu.edu/),
.
Fourier.
337
RMSDCHARMM.
.
SwissDock(http://www.swissdock.ch/)
.SwissDock
EADock DSS,
(localdocking)
(blind docking). , CHARMM
.
rDock (http://rdock.sourceforge.net/)
.
HighThroughputVirtualScreening(HTVS).
, Linux,
cluster
CPUs.
, RosettaDock(http://rosie.rosettacommons.org/docking2)
Rosetta, . Monte Carlo (MC)
:
(),
.,
.
(Rodrigues&Bonvin,2014;R.D.Taylor,etal.,2002).
CASP CAFASP , ,
, CAPRI (Critical Assessment of
PRediction of Interactions). CAPRI
,
,
.
,,
(http://www.ebi.ac.uk/msd-srv/capri/).
,
.,
,
. ,
.,-
,
(,,
...).
338
Adams,P.D.,Afonine,P.V.,Bunkoczi,G.,Chen,V.B.,Davis,I.W.,Echols,N.,...Zwart,P.H.(2010).
PHENIX:acomprehensivePython-basedsystemformacromolecularstructuresolution.Acta
Crystallogr D Biol Crystallogr, 66(Pt2),213-221.
Alvarez,J.C.(2004).High-throughputdockingasasourceofnoveldrugleads.Current opinion in chemical
biology, 8(4),365-370.
Andersen,C.A.,Palmer,A.G.,Brunak,S.,&Rost,B.(2002).Continuumsecondarystructurecaptures
proteinflexibility.Structure, 10(2),175-184.
Biasini,M.,Bienert,S.,Waterhouse,A.,Arnold,K.,Studer,G.,Schmidt,T.,...Bordoli,L.(2014).SWISSMODEL:modellingproteintertiaryandquaternarystructureusingevolutionaryinformation.Nucleic
Acids Research,gku340.
Bonvin,A.M.(2006).Flexibleproteinproteindocking.Current Opinion in Structural Biology, 16(2),194200.
Bowie,J.U.,Luthy,R.,&Eisenberg,D.(1991).Amethodtoidentifyproteinsequencesthatfoldintoa
knownthree-dimensionalstructure.Science, 253(5016),164-170.
Brooks,B.R.,Brooks,C.L.,MacKerell,A.D.,Nilsson,L.,Petrella,R.J.,Roux,B.,...Boresch,S.(2009).
CHARMM:thebiomolecularsimulationprogram.Journal of computational chemistry, 30(10),15451614.
Case,D.A.,Cheatham,T.E.,Darden,T.,Gohlke,H.,Luo,R.,Merz,K.M.,...Woods,R.J.(2005).The
Amberbiomolecularsimulationprograms.Journal of computational chemistry, 26(16),1668-1688.
Chen,V.B.,Arendall,W.B.,3rd,Headd,J.J.,Keedy,D.A.,Immormino,R.M.,Kapral,G.J.,...
Richardson,D.C.(2010).MolProbity:all-atomstructurevalidationformacromolecular
crystallography.Acta Crystallogr D Biol Crystallogr, 66(Pt1),12-21.
Dror,O.,Benyamini,H.,Nussinov,R.,&Wolfson,H.J.(2003).Multiplestructuralalignmentbysecondary
structures:algorithmandapplications.Protein Science, 12(11),2492-2507.
Eswar,N.,Marti-Renom,M.A.,Webb,B.,Madhusudhan,M.S.,Eramian,D.,Shen,M.,...Sali,A.(2006).
ComparativeProteinStructureModelingWithMODELLERCurrentProtocolsinBioinformatics(Vol.
5.6.1-5.6.30):JohnWiley&Sons,Inc.
Frishman,D.,&Argos,P.(1995).Knowledge-basedproteinsecondarystructureassignment.Proteins, 23(4),
566-579.
Gntert,P.(2011).AutomatedproteinstructuredeterminationfromNMRdata.InA.J.Dingley&S.M.
Pascal(Eds.),Biomolecular NMR spectroscopy(pp.341).Amsterdam:IOSPress.
Gray,J.J.(2006).High-resolutionproteinproteindocking.Current Opinion in Structural Biology, 16(2),
183-193.
Halperin,I.,Ma,B.,Wolfson,H.,&Nussinov,R.(2002).Principlesofdocking:Anoverviewofsearch
algorithmsandaguidetoscoringfunctions.Proteins: Structure, Function, and Bioinformatics, 47(4),
409-443.
Heinig,M.,&Frishman,D.(2004).STRIDE:awebserverforsecondarystructureassignmentfromknown
atomiccoordinatesofproteins.Nucleic Acids Res, 32(WebServerissue),W500-502.
Helles,G.(2008).Acomparativestudyofthereportedperformanceofabinitioproteinstructureprediction
algorithms.Journal of the Royal Society Interface, 5(21),387-396.
Holm,L.,&Rosenstrm,P.(2010).Daliserver:conservationmappingin3D.Nucleic Acids Research,
38(suppl2),W545-W549.
339
Hooft,R.W.,Vriend,G.,Sander,C.,&Abola,E.E.(1996).Errorsinproteinstructures.Nature, 381(6580),
272.
Jones,D.T.(1998).THREADER:ProteinSequenceThreadingbyDoubleDynamicProgramming.InS.
Salzberg,D.Searls&S.Kasif(Eds.),ComputationalMethodsinMolecularBiology:Elsevier
Science.
Jones,D.T.(1999).GenTHREADER:anefficientandreliableproteinfoldrecognitionmethodforgenomic
sequences.Journal of molecular biology, 287(4),797-815.
Jones,D.T.,Taylort,W.R.,&Thornton,J.M.(1992).Anewapproachtoproteinfoldrecognition.
Joosten,R.P.,Salzemann,J.,Bloch,V.,Stockinger,H.,Berglund,A.C.,Blanchet,C.,...Vriend,G.(2009).
PDB_REDO:automatedre-refinementofX-raystructuremodelsinthePDB.J Appl Crystallogr,
42(Pt3),376-384.
Kabsch,W.,&Sander,C.(1983).Dictionaryofproteinsecondarystructure:patternrecognitionofhydrogenbondedandgeometricalfeatures.Biopolymers, 22(12),2577-2637.
Kelley,L.A.,Mezulis,S.,Yates,C.M.,Wass,M.N.,&Sternberg,M.J.E.(2015).ThePhyre2webportal
forproteinmodeling,predictionandanalysis.Nat. Protocols, 10(6),845-858.
Kolodny,R.,Koehl,P.,&Levitt,M.(2005).Comprehensiveevaluationofproteinstructurealignment
methods:scoringbygeometricmeasures.Journal of molecular biology, 346(4),1173-1188.
Konagurthu,A.S.,Whisstock,J.C.,Stuckey,P.J.,&Lesk,A.M.(2006).MUSTANG:amultiplestructural
alignmentalgorithm.Proteins: Structure, Function, and Bioinformatics, 64(3),559-574.
Krissinel,E.,&Henrick,K.(2004).Secondary-structurematching(SSM),anewtoolforfastproteinstructure
alignmentinthreedimensions.Acta Crystallographica Section D: Biological Crystallography,
60(12),2256-2268.
Laskowski,R.A.,MacArthur,M.W.,Moss,D.S.,&Thornton,J.M.(1993).PROCHECK-aprogramto
checkthestereochemicalqualityofproteinstructures.J. App. Cryst, 26,283-291.
Lathrop,R.H.(1994).TheproteinthreadingproblemwithsequenceaminoacidinteractionpreferencesisNPcomplete.Protein engineering, 7(9),1059-1068.
Liu,Y.-S.,Fang,Y.,&Ramani,K.(2009).Usingleastmedianofsquaresforstructuralsuperpositionof
flexibleproteins.BMC Bioinformatics, 10(1),29.
Maiti,R.,VanDomselaar,G.H.,Zhang,H.,&Wishart,D.S.(2004).SuperPose:asimpleserverfor
sophisticatedstructuralsuperposition.Nucleic Acids Research, 32(suppl2),W590-W594.
Mayr,G.,Domingues,F.S.,&Lackner,P.(2007).Comparativeanalysisofproteinstructurealignments.
BMC Structural Biology, 7(1),50.
McLachlan,A.D.(1982).Rapidcomparisonofproteinstructures.Acta Crystallogr D Biol Crystallogr, A38,
871-873
Meiler,J.,&Baker,D.(2003).RapidproteinfolddeterminationusingunassignedNMRdata.Proc Natl Acad
Sci U S A, 100(26),15404-15409.
Moreira,I.S.,Fernandes,P.A.,&Ramos,M.J.(2010).Proteinproteindockingdealingwiththeunknown.
Journal of computational chemistry, 31(2),317-342.
Ortiz,A.R.,Strauss,C.E.,&Olmea,O.(2002).MAMMOTH(matchingmolecularmodelsobtainedfrom
theory):anautomatedmethodformodelcomparison.Protein Science, 11(11),2606-2621.
Pelton,J.T.,&McLean,L.R.(2000).Spectroscopicmethodsforanalysisofproteinsecondarystructure.
Anal Biochem, 277(2),167-176.
Peng,J.,&Xu,J.(2011).RaptorX:exploitingstructureinformationforproteinalignmentbystatistical
inference.Proteins: Structure, Function, and Bioinformatics, 79(S10),161-171.
Poleksic,A.(2009).Algorithmsforoptimalproteinstructurealignment.Bioinformatics, 25(21),2751-2756.
340
Pronk,S.,Pll,S.,Schulz,R.,Larsson,P.,Bjelkmar,P.,Apostolov,R.,...vanderSpoel,D.(2013).
GROMACS4.5:ahigh-throughputandhighlyparallelopensourcemolecularsimulationtoolkit.
Bioinformatics,btt055.
Read,R.J.,Adams,P.D.,Arendall,W.B.,3rd,Brunger,A.T.,Emsley,P.,Joosten,R.P.,...Zwart,P.H.
(2011).Anewgenerationofcrystallographicvalidationtoolsfortheproteindatabank.Structure,
19(10),1395-1412.
Rodrigues,J.P.,&Bonvin,A.M.(2014).Integrativecomputationalmodelingofproteininteractions.FEBS
Journal, 281(8),1988-2003.
Rohl,C.A.,Strauss,C.E.,Misura,K.M.,&Baker,D.(2004).ProteinstructurepredictionusingRosetta.
Methods in enzymology, 383,66-93.
Rost,B.,Schneider,R.,&Sander,C.(1997).Proteinfoldrecognitionbyprediction-basedthreading.Journal
of molecular biology, 270(3),471-480.
Roy,A.,Kucukural,A.,&Zhang,Y.(2010).I-TASSER:aunifiedplatformforautomatedproteinstructure
andfunctionprediction.Nature protocols, 5(4),725-738.
ali,A.,&Blundell,T.L.(1993).Comparativeproteinmodellingbysatisfactionofspatialrestraints.Journal
of molecular biology, 234(3),779-815.
Schwieters,C.D.,Kuszewski,J.J.,&Clore,G.M.(2006).UsingXplor-NIHforNMRmolecularstructure
determination.Progr. NMR Spectroscopy 48,47-62
Shi,Y.(2014).AglimpseofstructuralbiologythroughX-raycrystallography.Cell, 159(5),995-1014.
Shindyalov,I.N.,&Bourne,P.E.(1998).Proteinstructurealignmentbyincrementalcombinatorialextension
(CE)oftheoptimalpath.Protein engineering, 11(9),739-747.
Singh,A.P.,&Brutlag,D.L.(2000).ProteinStructureAlignment:Acomparisonofmethods.
Bioinformatics.
Sding,J.(2005).ProteinhomologydetectionbyHMMHMMcomparison.Bioinformatics, 21(7),951-960.
Sding,J.,Biegert,A.,&Lupas,A.N.(2005).TheHHpredinteractiveserverforproteinhomologydetection
andstructureprediction.Nucleic Acids Research, 33(suppl2),W244-W248.
Sternberg,M.J.,Gabb,H.A.,&Jackson,R.M.(1998).Predictivedockingofproteinproteinandprotein
DNAcomplexes.Current Opinion in Structural Biology, 8(2),250-256.
Sumathi,K.,Ananthalakshmi,P.,Roshan,M.M.,&Sekar,K.(2006).3dSS:3Dstructuralsuperposition.
Nucleic Acids Research, 34(suppl2),W128-W132.
Taylor,R.D.,Jewsbury,P.J.,&Essex,J.W.(2002).Areviewofprotein-smallmoleculedockingmethods.
Journal of computer-aided molecular design, 16(3),151-166.
Taylor,W.R.,&Orengo,C.A.(1989).Proteinstructurealignment.Journal of molecular biology, 208(1),122.
Theobald,D.L.,&Wuttke,D.S.(2006a).EmpiricalBayeshierarchicalmodelsforregularizingmaximum
likelihoodestimationinthematrixGaussianProcrustesproblem.Proceedings of the National
Academy of Sciences, 103(49),18521-18527.
Theobald,D.L.,&Wuttke,D.S.(2006b).THESEUS:maximumlikelihoodsuperpositioningandanalysisof
macromolecularstructures.Bioinformatics, 22(17),2171-2172.
Vriend,G.(1990).WHATIF:amolecularmodelinganddrugdesignprogram.Journal of molecular graphics,
8(1),52-56.
Winn,M.D.,Ballard,C.C.,Cowtan,K.D.,Dodson,E.J.,Emsley,P.,Evans,P.R.,...Wilson,K.S.(2011).
OverviewoftheCCP4suiteandcurrentdevelopments.Acta Crystallogr D Biol Crystallogr, 67(Pt4),
235-242.
341
Wu,S.,&Zhang,Y.(2007).LOMETS:alocalmeta-threading-serverforproteinstructureprediction.Nucleic
Acids Research, 35(10),3375-3382.
Wu,S.,&Zhang,Y.(2008).MUSTER:improvingproteinsequenceprofileprofilealignmentsbyusing
multiplesourcesofstructureinformation.Proteins: Structure, Function, and Bioinformatics, 72(2),
547-556.
Yaffe,M.B.(2005).X-raycrystallographyandstructuralbiology.Crit Care Med, 33(12Suppl),S435-440.
Zhang,Y.,&Skolnick,J.(2005).TM-align:aproteinstructurealignmentalgorithmbasedontheTM-score.
Nucleic Acids Research, 33(7),2302-2309.
342
10:
, ,
(,
),
.
RNA
.
, ,
.
( 4), ( 8), ( 7).
10.
,(regularexpressions),
(patterns) , (profiles) Hidden Markov Models (HMMs),
.
. ,
,
.
(formal language theory),
.,,
.(rewritingrules)A xB,
, , - (nonterminal
symbols),,,
(terminalsymbols),.
-x.,-S,
, -
.
,
, Noam Chomsky.
, ,
(),,
. ,
(Durbin, Eddy, Krogh, & Mithison, 1998)
,..(Searls,2002).
10.1. Chomsky
343
,G(V, T, P, S). ,
, , ,
,.
(regulargrammars),3,
(right-linear) (left-linear).
,:
W1 aW2 W a
,:
W1 W2a W a
,
,
,
,.(Finite
StateAutomata).,,(regular
expressions),.
, ,
.
(context free grammar),
2,:
W
, (string) -
( ).
(Push Down Automata).
,
.
(context sensitive grammar),
1, (monotonic grammar).
:
1W2 a1a2
,a,
- ,
- .
, , .
.
(LinearlyBoundedAutomata).
, (unrestricted grammars),
0,:
1W2
-
.,
.
( ) ( ) .
.,
( )
.
(recursively enumerable
languages).(TuringMachines).
344
10.1: .
(
https://en.wikipedia.org/wiki/Chomsky_hierarchy).
10.2.
, , .
xy,S xS
S y.x
y. SxSxxSxxxSxxxy,
(
).
,
,.
, , (parsing)
(Finite State Automata).
(regularexpressions)
.,10.2,
PROSITE ( ,
PROSITE).
10.2: .
,PROSITE:
[RK]-G-{EDRKHPCG}-[AGSCI]-[FY]-[LIVA]-x-[FYM]
:
[RK]G[^EDRKHPCG][AGSCI][FY][LIVA].[FYM]
,
.,
345
(S)R,
G. ,
RG.,...
,,
,rR,
...,,:
S rW1|kW1
W1 gW2
W2 [afilmnqrstvwy] W3
W3 [agsci] W4
W4 fW5|yW5
W5lW6|iW6|vW6|aW6
W6 [acdefghiklmnpqrstvwy] W7
W7 f|y|m
,
,:
SrW1
rgW2
rgaW3
rgacW4
rgacfW5
rgacfvW6
rgacfvkW7
rgacfvky
, PROSITE
. ,
. ,
;rg;.,
,()
.,rgacfvkykgacfvky,
,agacfvky.
,,,
.,
,.
8, ( + -) 4
(DNA).10.3,
,.,
,-,(
),.
,
,-(
):
B +|-|E
+ +|-|E
- +|-|E
346
, ,
:
+: a|c|g|t
-: a|c|g|t
, .
,:
B a+|c+|g+|t+|a-|c-|g-|t-|E
+ a+|c+|g+|t+|a-|c-|g-|t-|E
- a+|c+|g+|t+|a-|c-|g-|t-|E
, ,
. , Ba+
P(Ba+)= P(+|B)P(a|+),+a-P(+a-)=
P(-|+)P(a|-),...
10.3: 8.
( )
,:
B a aa+
aac+
aact+
aactg aactgc aactgcaE
(Finite State Automata)
MealeMoore,
, ( Moore
Meale, Meale
). ,
,
.
347
10.3. RNA
,
().,
().
-
,
Chomsky. ,
(context-free grammars) , ,
(..xxxxyyyy).(,
) ,
.
,.
S xSy
( ) x y. ,
(),,
(),
, .
, ( 10.4). ,
.,
: , .
,BletchleyPark,
AlanTuring()ENIGMA,
. , ,
, , Peter Hilton:
Doc, note. I dissent. A fast never prevents fatness. I diet on cod. (
,...!).
, aabaabaa
S aSa|bSb|aa|bb
: SaSa aaSaa aabSbaa aabaabaa. , a
a(b).
- (
,).
348
10.4: , .
,,
RNA. , DNA,
( ),
(A-U, G-C). ,
.,
.
, ,
RNA
.10.5
(loop)
.,
:
acgugccacgauucaacguggcacag
..((((((((......))))))))..
,,
325,3
24...,
(.).
349
,
,
:
SgSc|cSg|aSu|uSa
, ,
,
(..
S gS| Sg),(
S S1S2)
(S g).
S1 S2g
S2 aS3
S3 S4a
S4 aS5
S5 gS6c
S6 uS7a
S7 gS8c
S8 cS9g
S9 cS10g
S10 aS11u
S11cS12g
S12 gS13c
S13 aS14
S14 uS15
S15 uS16
S16 cS17
S17 aS18
S18a
S1S2g
aS3g
aS4ag
aaS5ag
aagS6cag
aaguS7acag
aagugS8cacag
aagugcS9gcacag
aagugccS10ggcacag
aagugccaS11uggcacag
aagugccacS12guggcacag
aagugccacgS13cguggcacag
aagugccacgaS14cguggcacag
aagugccacgauS15cguggcacag
350
aagugccacgauuS16cguggcacag
aagugccacgauucS17cguggcacag
aagugccacgauucaS18cguggcacag
aagugccacgauucaacguggcacag
10.7: RNA .
,
, RNA.
,
, , ,
(tRNA).
, , .
, ,
.
(stochastic context-free grammars). ,
,
:
( ).
RNA()
,G-U,C-A,
.
,,
:
(alignmentparsingproblem)
(scoring
problem)
(trainingproblem)
351
,
.,
Viterbi.
Cocke-Younger-Kasami(CYKalgorithm)Youmger1967(Younger,
1967). , Inside (outside algorithm)
Forward ( outside Backward). ,
, Inside-Outside
Baum-Welch(Forward-Backward)1979(Baker,1979).,
,
,,
(inside/outside).,
(10.1).
, Chomsky
(Chomsky Normal Form). Chomsky 1959 (Chomsky, 1959)
,:
W1W2W3
W1a
, , ,
.
Chomsky ,
(
). ,
(Lange & Lei,
2009).,,
,
.
Viterbi
P(x|)
Forward
EMalgorithm
Baum-Welch
O(LM)
O(LM2)
10.1: SCFG.
SCFG
CYK
Inside
Inside-Outside
O(L2M)
O(L3M3)
RNA
1990Sakakibara(Sakakibaraetal.,1994).,
RNA Nussinov Zuker.
Nussinov 1978
(Nussinov,Pieczenik,
Griggs, & Kleitman, 1978). Zuker
, (G).
,
,G-U,C-A,(Zuker&
Stiegler,1981).,
SCFG,
,
.,Eddy(SeanR
Eddy, 2004). Zuker MFOLD
(http://unafold.rna.albany.edu/?q=mfold),
. RNAfold (http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi)
,,Zuker
RNA (Lorenz et al., 2011). PFOLD (http://www.daimi.au.dk/~compbio/pfold)
RNA
(Knudsen & Hein, 2003). Dowell Eddy (Dowell & Eddy, 2004)
352
,
RNA.
.
,
,PFOLD(
).,
CONUS, ,
(http://selab.janelia.org/software/conus/).
TORNADO
(http://selab.janelia.org/software/tornado/tornado.tar.gz)(Rivas,Lang,&Eddy,2012).
RNA,
EddyDurbin,CovarianceModels(
)(Eddy&Durbin,1994).,
, RNA.
SCFG,,profileHMM.,
.
,
.
. ,
,,
.
10.8: ()
.
(, )
37,10.9.,
350%G50%A,750%50%C.
PROSITE ( A-C-[AG]-A-x-T-[CT]-A),
G(3)C(7),G(3)T(7).
,G(3)C(7),
(3)T(7).,covariancemodel,
ACGATTCA,
ACGATTA. ()
PROSITE.
, INFERNAL,
http://infernal.wustl.edu/ Sean Eddy (Nawrocki, Kolbe, & Eddy,
2009), HMMER .
353
,INFERNAL,RNA
,.
, . ,
, RNA.
PFAM,,INFERNAL
RFAM,RNA,http://rfam.xfam.org/(Gardneretal.,2011).
To EvoFold, http://users.soe.ucsc.edu/~jsp/EvoFold/,
,
(phylo-SCFG).
RNA microRNA (Pedersen et al., 2006). To RNAz,
http://www.tbi.univie.ac.at/~wash/RNAz/
(Washietl, Hofacker, & Stadler, 2005).
, CONTRAfold, http://contra.stanford.edu/contrafold,
SCFG
, conditional log-linear model (CLLM).
CONTRAfold
(Do,Woods,&Batzoglou,2006).
,RNA
.
RNA (pseudoknots). ( 10.10)
().
10.10,
,
( ). ,
AAUUCCGG (nested) ,
AACCUUGG (crossing) ,
.
, ,
.
, . ,
(
Nussinov),.
, NP-complete (Lyngs &
Pedersen,2000).,
,
. , PKNOTS
http://selab.janelia.org/software/pknots/pknots.tar.gz (Rivas & Eddy, 1999). To CYLOFOLD
http://cylofold.abcc.ncifcrf.gov/ (Bindewald, Kluth, & Shapiro,
2010),KineFOLDhttp://kinefold.curie.fr/cgi-bin/form.pl(Isambert,2009),
IPknot https://github.com/satoken/ipknot, (Sato, Kato, Hamada, Akutsu, & Asai, 2011). ,
SimulFold, http://www.cs.ubc.ca/~irmtraud/simulfold/,
(Meyer&Mikls,2007).
354
10.9: (https://en.wikipedia.org/wiki/Pseudoknot).
.
,xxxyyyzzz,x,yz.
,
(context-sensitive grammar)
(copylanguage).,
, .
' ,
.
10.4.
RNA,
(
).,,
.
,:
(204)
( ,
,,VanderWaals)
,
RNA,-(10.11). , -
,Greek-keymotif
-,-,-,--.
,-C=O
.,
,
. ,
(-),
(C=O).
355
10.10:
.
, ,
.,
,
,.
356
10.11: - ( ). .
. . helical wheel plot,
. ,
. ,
, .
-.
,
transFold
(http://bioinformatics.bc.edu/clotelab/transFold) ,
, -
- (Waldispuhl,Berger,Clote, &
Steyaert,2006).,
( ...).
, ,
Partifold http://partiFold.csail.mit.edu/
(Waldispuhl,O'Donnell,Devadas,Clote,&Berger,2008).
, Dyrka Nebel,
,.,
,
(Dyrka&Nebel,2009).,,
(Dyrka,Nebel,&Kotulska,2013),
(Dyrka &Nebel, 2007).
,
, ,
,
.
357
Baker,J.K.(1979).Trainablegrammarsforspeechrecognition.TheJournaloftheAcousticalSocietyof
America,65(S1),S132-S132.
Bindewald,E.,Kluth,T.,&Shapiro,B.A.(2010).CyloFold:secondarystructurepredictionincluding
pseudoknots.NucleicAcidsResearch,38(suppl2),W368-W372.
Chiang,D.,Joshi,A.K.,&Searls,D.B.(2006).Grammaticalrepresentationsofmacromolecularstructure.
Journalofcomputationalbiology,13(5),1077-1100.
Chomsky,N.(1956).Threemodelsforthedescriptionoflanguage.IRETransactionsonInformationTheory,
2(3),113-124.
Chomsky,N.(1959).OnCertainFormalPropertiesofGrammars.InformationandControl2(2),137-167.
Do,C.B.,Woods,D.A.,&Batzoglou,S.(2006).CONTRAfold:RNAsecondarystructurepredictionwithout
physics-basedmodels.Bioinformatics,22(14),e90-e98.
Dowell,R.D.,&Eddy,S.R.(2004).Evaluationofseverallightweightstochasticcontext-freegrammarsfor
RNAsecondarystructureprediction.BMCBioinformatics,5(1),71.
Durbin,R.,Eddy,S.R.,Krogh,A.,&Mithison,G.(1998).Biologicalsequenceanalysis,probabilisticmodels
ofproteinsandnucleicacids:CambridgeUniversityPress.
Dyrka,W.,&Nebel,J.-C.(2007).Aprobabilisticcontext-freegrammarforthedetectionofbindingsitesfrom
aproteinsequence.BmcSystemsBiology,1(Suppl1),P78.
Dyrka,W.,&Nebel,J.-C.(2009).Astochasticcontextfreegrammarbasedframeworkforanalysisofprotein
sequences.BMCBioinformatics,10(1),323.
Dyrka,W.,Nebel,J.-C.,&Kotulska,M.(2013).Probabilisticgrammaticalmodelforhelix-helixcontactsite
classification.AlgorithmsforMolecularBiology,8(1),31.
Eddy,S.R.(2004).HowdoRNAfoldingalgorithmswork?Naturebiotechnology,22(11),1457-1458.
Eddy,S.R.,&Durbin,R.(1994).RNAsequenceanalysisusingcovariancemodels.NucleicAcidsRes,
22(11),2079-2088.
Gardner,P.P.,Daub,J.,Tate,J.,Moore,B.L.,Osuch,I.H.,Griffiths-Jones,S.,...Bateman,A.(2011).
Rfam:Wikipedia,clansandthe"decimal"release.NucleicAcidsRes,39(Databaseissue),D141-145.
Isambert,H.(2009).ThejerkyandknottydynamicsofRNA.Methods,49(2),189-196.
Ito,K.,Murakami,R.,Mochizuki,M.,Qi,H.,Shimizu,Y.,Miura,K.-i.,...Uchiumi,T.(2012).Structural
basisforthesubstraterecognitionandcatalysisofpeptidyl-tRNAhydrolase.NucleicAcidsResearch,
gks790.
Knudsen,B.,&Hein,J.(2003).Pfold:RNAsecondarystructurepredictionusingstochasticcontext-free
grammars.NucleicAcidsResearch,31(13),3423-3428.
Lange,M.,&Lei,H.(2009).ToCNFornottoCNF?AnefficientyetpresentableversionoftheCYK
algorithm.InformaticaDidactica,8,2008-2010.
Lorenz,R.,Bernhart,S.H.,ZuSiederdissen,C.H.,Tafer,H.,Flamm,C.,Stadler,P.F.,&Hofacker,I.L.
(2011).ViennaRNAPackage2.0.AlgorithmsforMolecularBiology,6(1),26.
Lyngs,R.B.,&Pedersen,C.N.(2000).RNApseudoknotpredictioninenergy-basedmodels.Journalof
computationalbiology,7(3-4),409-427.
Mamitsuka,H.,&Abe,N.(1994).Predictinglocationandstructureofbeta-sheetregionsusingstochastictree
grammars.ProcIntConfIntellSystMolBiol,2,276-284.
Meyer,I.M.,&Mikls,I.(2007).SimulFold:simultaneouslyinferringRNAstructuresincluding
pseudoknots,alignments,andtreesusingaBayesianMCMCframework.
358
Nawrocki,E.P.,Kolbe,D.L.,&Eddy,S.R.(2009).Infernal1.0:inferenceofRNAalignments.
Bioinformatics,25(10),1335-1337.
Nussinov,R.,Pieczenik,G.,Griggs,J.R.,&Kleitman,D.J.(1978).AlgorithmsforLoopMatchings.SIAM
JournalonAppliedMathematics35(1),68-82.
Pedersen,J.S.,Bejerano,G.,Siepel,A.,Rosenbloom,K.,Lindblad-Toh,K.,Lander,E.S.,...Haussler,D.
(2006).IdentificationandclassificationofconservedRNAsecondarystructuresinthehuman
genome.PLoSComputBiol,2(4),e33.
Rivas,E.,&Eddy,S.R.(1999).AdynamicprogrammingalgorithmforRNAstructurepredictionincluding
pseudoknots.Journalofmolecularbiology,285(5),2053-2068.
Rivas,E.,Lang,R.,&Eddy,S.R.(2012).ArangeofcomplexprobabilisticmodelsforRNAsecondary
structurepredictionthatincludesthenearest-neighbormodelandmore.RNA,18(2),193-212.
Sakakibara,Y.,Brown,M.,Hughey,R.,Mian,I.S.,Sjolander,K.,Underwood,R.C.,&Haussler,D.(1994).
Stochasticcontext-freegrammarsfortRNAmodeling.NucleicAcidsRes,22(23),5112-5120.
Sato,K.,Kato,Y.,Hamada,M.,Akutsu,T.,&Asai,K.(2011).IPknot:fastandaccuratepredictionofRNA
secondarystructureswithpseudoknotsusingintegerprogramming.Bioinformatics,27(13),i85-i93.
Searls,D.B.(2002).Thelanguageofgenes.Nature,420(6912),211-217.
Waldispuhl,J.,Berger,B.,Clote,P.,&Steyaert,J.M.(2006).transFold:awebserverforpredictingthe
structureandresiduecontactsoftransmembranebeta-barrels.NucleicAcidsRes,34(WebServer
issue),W189-193.
Waldispuhl,J.,O'Donnell,C.W.,Devadas,S.,Clote,P.,&Berger,B.(2008).Modelingensemblesof
transmembranebeta-barrelproteins.Proteins,71(3),1097-1112.
Washietl,S.,Hofacker,I.L.,&Stadler,P.F.(2005).FastandreliablepredictionofnoncodingRNAs.
ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica,102(7),24542459.
Younger,D.H.(1967).Recognitionandparsingofcontext-freelanguagesintimen3.Informationand
Control,10(2),189-208.
Zuker,M.,&Stiegler,P.(1981).OptimalcomputerfoldingoflargeRNAsequencesusingthermodynamics
andauxiliaryinformation.NucleicAcidsRes,9(1),133-148.
359
360
11:
.
, .
,
( )
.
3, 4, 6 7.
11.
,
,,
, ,
.
,
.,
,
(,RNA).
,
.,,
,
(Zerbino&Birney,2008),(Picardi
&Pesole,2010),(Harbisonetal.,2004),RNA(Rigoutsos,
2010; Vlachos & Hatzigeorgiou, 2013) (Soucy,
Huang, & Gogarten, 2015) . ,
,
(.. ), ,
.
.
,,
:,
,
,()
.
, ,
.
11.1.
,,
.,
(..
, ...).
RNA
. ,
361
(..GC
...).,
(, )
, () .
,,
. ,
.
,,
, -
20-30% , -
1-2% .
GPCRs ( ,
2% ) 15%
. ,
, (
),
. ,
2.PFAMTCDB,
CAZy...
11.1: .
,
.GC
362
(Li, 2011).
GC ( GC% content). ,
,
GC
.
, codon bias ( )
,
(..),.
- .
,
codonbias
. ,
, ,
(Quax,Claassens,Soll,&vanderOost,2015).
,,Ouzounis
Kreil,6
(), 2 , 17 2
. GC
(principalcomponentsanalysis).
GC,
(Kreil & Ouzounis, 2001). ,
(Gln)(Glu)
. , (Val) (Thr)
.(His),(Ser)(Asn)
.
, GC
. ,
,
(conceptual translation). ,
,
.
,:64(613).
(2),
21 . ,
,.
(, TGA, TAG). ,
, GC, .
GC
.
,ORF
(34 ). , (Redundancy Reduction)
(nonhypothetical) SwissProt (E-value<10-6).
, (over-annotation),
(Skovgaard,Jensen,Brunak,Ussery,&Krogh,2001).
363
11.2:
SwissProt ( ), . ( ).
.
Swissprot,
. , ,
,
(11.2).
(mixture of distributions)
. ,
, . ,
(..100)(..500),
,200
300,.',
.
,GC(,).
364
11.3: - (
) GC .
( 11.3), ,
GC .
,
, ( outlier ),
A. pernix,
,,
. ,
100% ( ,
).:
, ,
, . ,
, ,
,
.
GC, . -
, ,
. GC . ,
GC.,
11.4, GC ,
GC , .
,,
.,
--
GC.,GC,
365
(,).
min(+)
(Ala)
G-C-X
0
(Gly)
G-G-X
0
(Val)
G-T-X
1
(Leu)
C-T-X,T-T-[AG]
1
(Ile)
A-T-[ACT]
2
(Phe)
T-T-[CT]
2
11.4: .
min(G+C)
2
2
1
1
0
0
,
(
),(VL)
(FI).,
. , .
,(VL/FI)GC/AT.,
GC,
.
,
GC.,,
()
.
11.5: , .
, GC/AT.
, .
11.2.
,.
,..GC
,,
.,
,,,
.
,
/ ,
366
,
,().
(Tsoka&Ouzounis,2000)
11.6: . (),
(), ()
().
,
.
,
,
,
,
,
(),
(domains).
(
).,
.
,,.
11.2.1
, , ,
. , ,
367
(,).
,
().
()
(..
).
,Venn
,
.
,
(LastUniversalCommonAncestor-LUCA).
,Mycoplasma
genitalium 468
.,..Haemophilus influenzae
(1703 ) 240 M. genitalium H.
influenzae. LUCA ( ..
Mycoplasma),,()
. , ( ..
),.
,,
1006 1189,
, 1344 1529, -
( Mycoplasma) (Ouzounis, Kunin, Darzentas, &
Goldovsky,2006).
,,
, . ,
,
368
1Mbp
. , DNA (Mycoplasma
mycoides JCVI-syn1.0),
(Gibsonetal.,2010).
11.2.2
(
).
.
,..DNA
,
.,
2
12 13 ,
.
11.8: .
,.
(dot-plot)
( BLAST)
. ,
(),
..,
. ,
, ,
.
11.9: . (-),
(-).
. ,
369
.,
.,2
(
DNA).
11.10: .
, .
11.2.3
.
, .
-,(..
).,
,,
.
.,,
(lacZ)(lacY),
,
.
, ,
, 12
.
,
.
370
11.11: .
11.2.4
,,
(domains). ,
,.
, ( )
,,
(Enright, Iliopoulos, Kyrpides, & Ouzounis, 1999). ,
,
,
,.
,
. , (DHFR)
,
(TS)
( )
.
371
11.12: .
,
,,
(, ).
,
,,
,
.,
(
,).
(
BLAST SmithWaterman),(E-value
...). ,
,
. ,
,
.
(interaction
),
( ). ,
, ,
. ,
. ,
,
. ,
(hot-spots),
372
11.13: .
11.3.
,
,
(Edwards&Holt,2013).
,
. ,
.
ACT (https://www.sanger.ac.uk/resources/software/act/) Java
.
BLAST.BLAST
ACT . ,
.
, . .
CT
(zoom in zoom out)
, ,
(Carveretal.,
2005).
MAUVE(http://darlinglab.org/mauve/mauve.html)Java
.
.MAUVE
,
contigs.
.
. ,
373
.,
. , MAUVE (
)(SNPs)
,(Darling,Mau,
&Perna,2010).
To EDGAR (http://edgar.cebitec.uni-bielefeld.de)
. EDGAR
.
NCBI,
.,
(synteny plots)
Venn(Blom,etal.,2009).
CGAT (http://mbgd.genome.ad.jp/CGAT/)
. CGAT
client-server, client AlignmentViewer ( Java)
DataServer(Perl).
.
,
.
, CGAT
(Uchiyama,Higuchi,&Kobayashi,2006).
BRIG (BLAST Ring Image Generator, http://sourceforge.net/projects/brig/)
Java,
.,
(),,
. BRIG
,
.
.,
,
.
(Alikhan,Petty,BenZakour,&Beatson,2011).
VISTA(http://genome.lbl.gov/vista/index.shtml)
2000. ,
.
,
,-
().(VISTABrowser)
(VISTA
servers, rVista, mVISTA, phyloVISTA, gVISTA ...)
, ,
,...(Frazer,Pachter,
Poliakov, Rubin, & Dubchak, 2004). VISTA
(standalone),GenomeVISTA,
,(Poliakov,
Foong,Brudno,&Dubchak,2014).
374
11.14: 8 Yersinia MAUVE (Darling, Miklos, & Ragan, 2008).
,
.,
.
MUMMER (http://mummer.sourceforge.net/), MEGA-BLAST
(http://www.ncbi.nlm.nih.gov/BLAST/), LAGAN (http://bioperl.org/wiki/LAGAN) MGA
(http://bibiserv.techfak.uni-bielefeld.de/mga/).,
(),
,
,
(Carrara et al., 2013). , Ouzounis ,
GeneRAGE (Enright & Ouzounis, 2000), FusionMap
(http://www.omicsoft.com/fusionmap)
(Ge
et
al.,
2011)
MosaicFinder
(http://sourceforge.net/projects/mosaicfinder)(Jachiet,Pogorelcnik,Berry,Lopez,&Bapteste,2013).
375
11.15: E.coli O157:H7 str. Sakai
27 BRIG.
376
11.16: (a) KIF3A
(chr5:131949456132139102) VISTA
. (b) VISTA
KIF3A (c) KIF3A,
(Frazer, et al., 2004).
377
11.17: ACT, E. coliO104:H4 , E.
coli Ec55989 , E. coliEDL933 (Edwards & Holt, 2013).
378
Alikhan,N.F.,Petty,N.K.,BenZakour,N.L.,&Beatson,S.A.(2011).BLASTRingImageGenerator
(BRIG):simpleprokaryotegenomecomparisons.BMC Genomics, 12,402.doi:10.1186/1471-216412-402
Arai,M.,Ikeda,M.,&Shimizu,T.(2003).Comprehensiveanalysisoftransmembranetopologiesin
prokaryoticgenomes.[ResearchSupport,Non-U.S.Gov't].Gene, 304,77-86.
Blom,J.,Albaum,S.P.,Doppmeier,D.,Phler,A.,Vorhlter,F.-J.,Zakrzewski,M.,&Goesmann,A.
(2009).EDGAR:asoftwareframeworkforthecomparativeanalysisofprokaryoticgenomes.BMC
Bioinformatics, 10(1),154.
Carrara,M.,Beccuti,M.,Lazzarato,F.,Cavallo,F.,Cordero,F.,Donatelli,S.,&Calogero,R.A.(2013).
State-of-the-artfusion-finderalgorithmssensitivityandspecificity.Biomed Res Int, 2013,340620.
doi:10.1155/2013/340620
Carver,T.J.,Rutherford,K.M.,Berriman,M.,Rajandream,M.A.,Barrell,B.G.,&Parkhill,J.(2005).ACT:
theArtemisComparisonTool.Bioinformatics, 21(16),3422-3423.doi:10.1093/bioinformatics/bti553
Darling,A.E.,Mau,B.,&Perna,N.T.(2010).progressiveMauve:multiplegenomealignmentwithgene
gain,lossandrearrangement.PLoS One, 5(6),e11147.doi:10.1371/journal.pone.0011147
Darling,A.E.,Miklos,I.,&Ragan,M.A.(2008).Dynamicsofgenomerearrangementinbacterial
populations.PLoS Genet, 4(7),e1000128.doi:10.1371/journal.pgen.1000128
Edwards,D.J.,&Holt,K.E.(2013).Beginner'sguidetocomparativebacterialgenomeanalysisusingnextgenerationsequencedata.Microb Inform Exp, 3(1),2.doi:10.1186/2042-5783-3-2
Enright,A.J.,Iliopoulos,I.,Kyrpides,N.C.,&Ouzounis,C.A.(1999).Proteininteractionmapsfor
completegenomesbasedongenefusionevents.Nature, 402(6757),86-90.doi:10.1038/47056
Enright,A.J.,&Ouzounis,C.A.(2000).GeneRAGE:arobustalgorithmforsequenceclusteringanddomain
detection.Bioinformatics, 16(5),451-457.
Frazer,K.A.,Pachter,L.,Poliakov,A.,Rubin,E.M.,&Dubchak,I.(2004).VISTA:computationaltoolsfor
comparativegenomics.Nucleic Acids Res, 32(WebServerissue),W273-279.doi:
10.1093/nar/gkh458
Ge,H.,Liu,K.,Juan,T.,Fang,F.,Newman,M.,&Hoeck,W.(2011).FusionMap:detectingfusiongenes
fromnext-generationsequencingdataatbase-pairresolution.Bioinformatics, 27(14),1922-1928.doi:
10.1093/bioinformatics/btr310
Gibson,D.G.,Glass,J.I.,Lartigue,C.,Noskov,V.N.,Chuang,R.Y.,Algire,M.A.,...Venter,J.C.(2010).
Creationofabacterialcellcontrolledbyachemicallysynthesizedgenome.Science, 329(5987),5256.doi:10.1126/science.1190719
Harbison,C.T.,Gordon,D.B.,Lee,T.I.,Rinaldi,N.J.,Macisaac,K.D.,Danford,T.W.,...Young,R.A.
(2004).Transcriptionalregulatorycodeofaeukaryoticgenome.Nature, 431(7004),99-104.doi:
10.1038/nature02800
Jachiet,P.A.,Pogorelcnik,R.,Berry,A.,Lopez,P.,&Bapteste,E.(2013).MosaicFinder:identificationof
fusedgenefamiliesinsequencesimilaritynetworks.Bioinformatics, 29(7),837-844.doi:
10.1093/bioinformatics/btt049
Kreil,D.P.,&Ouzounis,C.A.(2001).Identificationofthermophilicspeciesbytheaminoacidcompositions
deducedfromtheirgenomes.Nucleic Acids Res, 29(7),1608-1615.
Li,W.(2011).Onparametersofthehumangenome.[Review].J Theor Biol, 288,92-104.doi:
10.1016/j.jtbi.2011.07.021
379
Ouzounis,C.A.,Kunin,V.,Darzentas,N.,&Goldovsky,L.(2006).Aminimalestimateforthegenecontent
ofthelastuniversalcommonancestor--exobiologyfromaterrestrialperspective.Res Microbiol,
157(1),57-68.doi:10.1016/j.resmic.2005.06.015
Papadimitriou,K.,Anastasiou,R.,Mavrogonatou,E.,Blom,J.,Papandreou,N.C.,Hamodrakas,S.J.,...
Pot,B.(2014).ComparativegenomicsofthedairyisolateStreptococcusmacedonicusACA-DC198
againstrelatedmembersoftheStreptococcusbovis/Streptococcusequinuscomplex.BMC Genomics,
15(1),272.
Picardi,E.,&Pesole,G.(2010).Computationalmethodsforabinitioandcomparativegenefinding.Methods
Mol Biol, 609,269-284.doi:10.1007/978-1-60327-241-4_16
Poliakov,A.,Foong,J.,Brudno,M.,&Dubchak,I.(2014).GenomeVISTA--anintegratedsoftwarepackage
forwhole-genomealignmentandvisualization.Bioinformatics, 30(18),2654-2655.doi:
10.1093/bioinformatics/btu355
Quax,T.E.,Claassens,N.J.,Soll,D.,&vanderOost,J.(2015).CodonBiasasaMeanstoFine-TuneGene
Expression.Mol Cell, 59(2),149-161.doi:10.1016/j.molcel.2015.05.035
Rigoutsos,I.(2010).ShortRNAs:howbigisthisiceberg?Curr Biol, 20(3),R110-113.doi:
10.1016/j.cub.2009.12.036
Shimizu,T.,Mitsuke,H.,Noto,K.,&Arai,M.(2004).Internalgeneduplicationintheevolutionof
prokaryotictransmembraneproteins.[ComparativeStudy
ResearchSupport,Non-U.S.Gov't].J Mol Biol, 339(1),1-15.doi:10.1016/j.jmb.2004.03.048
Skovgaard,M.,Jensen,L.J.,Brunak,S.,Ussery,D.,&Krogh,A.(2001).Onthetotalnumberofgenesand
theirlengthdistributionincompletemicrobialgenomes.Trends Genet, 17(8),425-428.doi:S01689525(01)02372-1[pii]
Soucy,S.M.,Huang,J.,&Gogarten,J.P.(2015).Horizontalgenetransfer:buildingtheweboflife.Nat Rev
Genet, 16(8),472-482.doi:10.1038/nrg3962
Tsoka,S.,&Ouzounis,C.A.(2000).Recentdevelopmentsandfuturedirectionsincomputationalgenomics.
FEBS Lett, 480(1),42-48.
Uchiyama,I.,Higuchi,T.,&Kobayashi,I.(2006).CGAT:acomparativegenomeanalysistoolforvisualizing
alignmentsintheanalysisofcomplexevolutionarychangesbetweencloselyrelatedgenomes.BMC
Bioinformatics, 7,472.doi:10.1186/1471-2105-7-472
Vlachos,I.S.,&Hatzigeorgiou,A.G.(2013).OnlineresourcesformiRNAanalysis.Clin Biochem, 46(1011),879-900.doi:10.1016/j.clinbiochem.2013.03.006
Zerbino,D.R.,&Birney,E.(2008).Velvet:algorithmsfordenovoshortreadassemblyusingdeBruijn
graphs.Genome Res, 18(5),821-829.doi:10.1101/gr.074492.107
380
12: Perl
H Perl
. Perl scripting
UNIX , ,
..., ,
, .
( ) .
/.
12.
PerlUNIXLinux.
Practical Extraction and Report Language
(). ,
, Perl Pathologiacally
Eclectic Rubbish Lister .
, Larry Wall. H Perl
. Perl scripting. Scripts ,
.
(interpret)(compile),
.
UNIX scripting,
,...
,
.Perl,
.,Perlawk,sed,
bash shell, C. ,
,Perl.
Perl,
,.,
,
(
C, C++ Java). , "" , Perl
. ,
TIMTOWTDI,There Is More Than One Way To Do
It ( Tim Toady). ,
.
()
, .
, ,
,
.Perl
( ),
( oneliners),
381
....,
Python,Perl,
There should be one-and preferably only one-obvious way to do it ( ,
,!).
Unix/LinuxMacOSXPerl,
Windows.
, Perl
http://www.perl.org/get.html ,
.
Perl,,Windows
(shell)
UNIX.
Cygwin (www.cygwin.com), native ()
bashWindows,UnixUtils(http://unxutils.sourceforge.net/)MinGW
(http://www.mingw.org/). (compiler),
Perl,.
12.1. Perl
Perl
.,
,(blocks)
. ,
.pl compiler
Windows (
).,
,Perl:
perl file_name.pl
,
().
.
,
.
Perl#.Perl:
#!/usr/bin/perl
#
print Hello world \n;
print \n
. print, Perl
, (;).
,Perl.
, Linux/UNIX ,
shellscript,:
./program.pl
./,(directory)
.,,
382
(Path),
.
,Perl
(C).,
,
. ,
C ( ,
,
Perl). , Perl
(interpreter),-,,
.
Perl.Python
, Java,
bytecode.
12.2.
Perl,(scalars)($)
. ,
($1, $2, $_, $/ ...).
, ,
,,
.
,
,
.
:
$var = ___;
$var
(,).,
(scalar) .
().
(variable interpolation)
,
(.. \n). (\)
(\n,\s
, \t ...). ,
,
.,$Perl
,\$().
()
.(``), Perl
shell,
,.
:
$var=3;
....
$var=hello;
.
383
$var . $var
.,Perl
.
Perl.
$number=5;
$name=George;
$exp=3*$number+($number+1);
$a+=5;
$b*=3;
++$a; $a++;
$a--;
12.2.1.
(Operators),
.
, 12.1
(string),12.2.
+
-
*
/
%
**
12.1: Perl
+
*
==
!=
>
<
>=
.
x
eq
ne
gt
lt
ge
<=
le
12.2: Perl.
.
12.2.2. Perl
12.3:
384
chomp
chop
substr
\n
$x=substr($name,0,1,L);
$name1
0
$x
$name.
$x=substr($name,0,1);
1
0$name
$x.
index
Index($name,k);
k$name
rindex
,index.
12.3: Perl
,:
$name=Takis;
$x=substr($name,0,1);
$name ,
$name().
12.2.3. (<STDIN>)
<STDIN>,Perl
(, )
<STDIN>.:
$a=<STDIN>; #
$a
print $a;
:
chomp ($a=<STDIN>);
12.3.
(list),.
.,:
(1, 2, 3)
(perl, 3, 15)
($x, 3, $x+2, $y$x)
(1..10)
($a..$b)
.
,
(swap):
($a,$b)=($b,$a);
385
.
(Array).
.,
.
:
@array = (
);
, @ .
, @name $name.
, , (@_, @ARGV ...).
,:
,,0,
.
:
$table [0] = 1;
$table [1] = 2;
$table [2] = 3;
,(
$ ) .
.
, (. )
.
:
$x=$table[0];
$table[1]++;
($table[0], $table[1])= ($table[1], $table[0]);
@table[0,1,2]=@table[1,1,1];
,
(
):
@table=(1, 2, @name, 7)
12.4.
,push:
push @table,$scalar;
,:
@table=(@table, $scalar);
386
,unshift:
unshift(@table, $scalar);
@table=($scalar, @table );
push
pop
shift
.
.
.
.
ASCII.
splice(@table,2,1);@table
1
2.
unshift
reverse
sort
splice
12.4: Perl
,,$#.
@name,$#name(
) . ,
..@table=(1,2,3),$#table,2.$#table
,,..$table[$#table]
3. :
, ,
. $table[5]=12,
(index)34,12
5. $#table 5,
6.
12.4.
(Hash),,
. , (keys),
,
(values).,,
, . ,
.,%.
,.
(,).
.
.
,
. ,
:
387
,
()-.
:
%day = ( "Sun" => "Sunday", "Mon" => "Monday", "Tue" => "Tuesday", "Wed" =>
"Wednesday", "Thu" => "Thursday", "Fri" => "Friday", "Sat" => "Saturday" );
,:
%table=@table;
,
(1,3,5...),
(!).,
,:
@table=%table;
,(values)
,(keys).
,({})
$.:
$hash{key}=value;
,-,
., $hash{key}
,
.,
,,
()(),
().
,,(,
,.12.5)
each
,,
(
)
.
.
(
)
delete
reverse
12.5: Perl
12.5.
, Perl
..
12.5.1 If/else/elsif
if/else . if
else .
388
,
.if/elsif/else.if,
, , elsif
( )
, else
.:
if (condition)
{
...
}
elsif
{
...
}
else
{
...
}
,elsifelseif
else if. , Perl (block) ,
({}).Perl,
, (false), 0
.
12.5.2. While/until
while (condition)
{
}
until (condition)
{
12.5.3. Do while/until
Perldo/while.
,
.Perlwhile
until.:
do
{
} while (condition)
do
{
389
} until (condition)
12.5.4. For
for .
.,
.:
,,
(.. $x+$y<10 ...),
, () (
foreach).
12.5.5. Foreach
for
(,).
,.
:
@a = (1,2,3,4,5);
foreash $i (@a)
{
print $i\n;
}
,,
(, $i),
().
12.5.6. Last/next/redo
.
Last: last break C,
.
:
for
foreach
while
until
while (condition 1)
{
if(condition 2)
{
390
last;
}
}
Next: continue C.
,,
.:
while (condition 1)
{
if(condition 2)
{
next;
}
Redo: redo
.
:
while (condition 1)
{
#
if(condition 2)
{
redo;
}
}
12.6. (Filehandles) /
Perl, .
Perl,
,
(filehandles). .
( , ),
.
,:
open.
.
, .
.
,,
. , ,
12.6:
391
open(,"filename")
filename
,
open(OUT,">filename")
filename
OUT
open(,">>filename")
filename
OUT
12.6: open
,
:
open IN, /etc/passwd;
$x=<IN>;
print $x;
close IN;
open OUT, >tempfile;
print OUT bla bla bla\n;
,/etc/passwd(Linux),
, , , , tempfile
blablabla.<>
(STDIN).
,
, .
:
while(<>)
{
print $_;
}
,catUnix(,
), End Of File (EOF)
$_.while,
<>,.<>()
Perl.,
.
:
, Perl file.txt,
<>,-().$_
Perl ,
,.
. ,
. ( ..
),(..
),
. ,
,@ARGV.,
$ARGV[0],$ARGV[1]...open.,
392
,
().
12.7.
(regularexpressions)
,.
4,
. (/)
.
:
$dna=~/GAATTC/;
$dnaGAATTC
. /GAATTC/,
.
,
/GAATTC/CTTAAG/(GAATTCCTTAAG).
Perl
..,
(.) , \d
.12.7.
\w
\W
\s
\S
\d
\D
12.7:
,
.
,.
(Quantifier),Perl
,.
{n},n.:
$sequence=~/AAT{5}CCG/;
$sequence CCG (,
PROSITE---(5)-C-C-G).,
(true)(false),,if.,
4, PROSITE , ,
.12.8.
393
01(,
)
+
1
.
0
{n,m}
nm
{n,}
n
{,m}
m
12.8:
12.8. Perl
,-,Perl,
,
.
,.
,
Uniprot
, fasta. ,
Uniprot.fasta
, AC ID.
Uniprot(..AC),
.,3
.(uniprot2fasta.pl),:
while (<>)
{
if ($_=~/^AC\s{3}(.*?)\;/)
{
print ">$1\n";
}
if ($_=~/^\s{5}(.*)/)
{
$sequence=$1;
$sequence=~s/\s//g;
print $sequence\n;
}
}
, <>. -
$_. if, AC.
AC,3,
(;)AC().
?(non-greedy),
. $1 ,
.,
(,AC).,,$2,$3...
, 5
(, ) (
). Uniprot
394
, $sequence . ,
~s/\s//g(g,
,globally-,).
, ,
.if
. ,
,.
,
,fasta,
..
(uniprot2line.pl).
$/="\/\/\n";
while (<>)
{
if ($_=~/^AC\s{3}(.*?)\;/m)
{
print ">$1\n";
}
while ($_=~/^\s{5}(.*)/mg)
{
$sequence=$1;
$sequence=~s/\s//g;
print "$sequence";
}
print "\n";
}
,
. $/="\/\/\n"
(-)
//\n. ,
Uniprot$_.,
m .
($_)(multilne).,if
while,g(global),
.
,
fastafasta.(fasta2line.pl):
$/=">";
while (<>)
{
$entry=$_;
chop $entry;
$entry= ">"."$entry";
$entry=~/>(.+?)\n(\C*)/g;
$name=$1;
$sequence=$2;
$sequence=~s/\n//g;
if ($name ne "")
{
print ">$name\n$sequence\n";
}
}
395
$/="\n";
,-(>).
, ,
.
12.8.2.
,
(, ...). 500
200:
@aa = (A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y);
for ( $i=0; $i<500; $i++ ) {
print '>Random', "$i\n";
for( $j=0; $j<200; $j++ ) {
$r = $aa[ int (rand 20)];
print $r;
print "\n" if ($j+1)%60 == 0 and $j;
}
print "\n";
}
(
).,rand
0-20,(index)
.
()60
fasta. , ..
,
. .
.
12.8.3.
,
,DNA/RNA.
,
. FASTA
.
@aa = (A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y);
while (<>)
{
if ($_=~/^>/)
{
$id=$_;
chomp $id;
print $id."\t";
$seq=<>;
chomp $seq;
}
$length=length($seq)+1;
foreach $z(@aa)
{
$count = $seq =~s/$z//g;
396
$diairesi = $count/$length;
$pososto=sprintf( "%.3f", $diairesi );
print $z."\t".$count.'/'.$length."\t[".$pososto."]\n";
}
print "\n";
}
20
-.>header
( @seq=<>). , ,
foreach - .
$count=$seq=~s/$z//g.,
( ) ,
(length
- $seq =~s/$z//g; $count=length($seq);). ,
sprintf.
12.8.4. DNA
(openreadingframes)DNA.
.
DNA . ,
26
.
%genetic_code = (
'GCA'=>'A', #Alanine
'GCC'=>'A', #Alanine
'GCG'=>'A', #Alanine
'GCT'=>'A', #Alanine
'AGA'=>'R', #Arginine
'AGG'=>'R', #Arginine
'CGA'=>'R', #Arginine
'CGC'=>'R', #Arginine
'CGG'=>'R', #Arginine
'CGT'=>'R', #Arginine
'AAC'=>'N', #Asparagine
'AAT'=>'N', #Asparagine
'GAC'=>'D', #Aspartic acid
'GAT'=>'D', #Aspartic acid
'TGC'=>'C', #Cysteine
'TGT'=>'C', #Cysteine
'GAA'=>'E', #Glutamic acid
'GAG'=>'E', #Glutamic acid
'CAA'=>'Q', #Glutamine
'CAG'=>'Q', #Glutamine
'GGA'=>'G', #Glycine
'GGC'=>'G', #Glycine
'GGG'=>'G', #Glycine
'GGT'=>'G', #Glycine
'CAC'=>'H', #Histidine
'CAT'=>'H', #Histidine
'ATA'=>'I', #Isoleucine
'ATC'=>'I', #Isoleucine
'ATT'=>'I', #Isoleucine
'TTA'=>'L', #Leucine
397
'TTG'=>'L',
'CTA'=>'L',
'CTC'=>'L',
'CTG'=>'L',
'CTT'=>'L',
'AAA'=>'K',
'AAG'=>'K',
'ATG'=>'M',
'TTC'=>'F',
'TTT'=>'F',
'CCA'=>'P',
'CCC'=>'P',
'CCG'=>'P',
'CCT'=>'P',
'AGC'=>'S',
'AGT'=>'S',
'TCA'=>'S',
'TCC'=>'S',
'TCG'=>'S',
'TCT'=>'S',
'ACA'=>'T',
'ACC'=>'T',
'ACG'=>'T',
'ACT'=>'T',
'TGG'=>'W',
'TAC'=>'Y',
'TAT'=>'Y',
'GTA'=>'V',
'GTC'=>'V',
'GTG'=>'V',
'GTT'=>'V',
'TAA'=>'-',
'TAG'=>'-',
'TGA'=>'-',
);
#Leucine
#Leucine
#Leucine
#Leucine
#Leucine
#Lysine
#Lysine
#Methionine
#Phenylalanine
#Phenylalanine
#Proline
#Proline
#Proline
#Proline
#Serine
#Serine
#Serine
#Serine
#Serine
#Serine
#Threonine
#Threonine
#Threonine
#Threonine
#Tryptophan
#Tyrosine
#Tyrosine
#Valine
#Valine
#Valine
#Valine
#STOP
#STOP
#STOP
$seq="AAAAAAATTAATAGATGAACATATATATAGATTTTCTATATAGACCCTCTACCCGATAAGGCTAC";
$seq2=$seq;
$seq2=~tr/ATCG/TAGC/;
$seq2=reverse($seq2);
print "Forward Strand\n";
Translate($seq);
print "Reverse Strand\n";
Translate($seq2);
sub Translate{
$sub_seq=$_[0];
for($i=0;$i<=length($sub_seq)-3;$i++)
{
$x=substr($sub_seq,$i,3);
if ($x eq 'ATG')
{
$pos=$i+1;
print "position $pos\n";
for($j=$i;$j<=length($sub_seq)-3;$j=$j+3)
{
$y=substr($sub_seq,$j,3);
$k=$genetic_code{$y};
if($k eq '-')
398
{
print"\n";
last;
}
print "$k";
}
}
}
}
,
. , .
(,).
tr,reverse,
(5').
,
(sub-routine).
(,)
(,).
,
( ).
,.
Perl.
@_ . ,
$_[0],()$_[1]...
Perl "" (global),
. ,
.(private),
(),
my(my$seq=...).
, .
3 (, ) ,
(ATG).,substr
.,
,,,
(last).
,
.
.
12.8.5.
5, (signal
peptide),
,
. ,
( 17-30 , ),
.
1980,PS00013,
PROSITE. ,
, [LVI]-[ASTVI]-[GAS]-C
(DOLOP).2002,SutcliffeHarrington
399
Gram,,
(Sutcliffe&Harrington,2002).
63
(Junckeretal.,2003).
FASTA,
(,
PROSITE).
while (<>){
if
($_=~/^>(.*)/)
{
$name=$1;
$seq=<>;
if($seq=~/(.*LA[GA]C)/)
{
$x=length($1);
$a=$a+1;
}
}
}
print "$a LIPOPROTEINS FOUND";
ifPerl:
if($seq=~/(.*[LVI][ASTG][GA]C)/)
if($seq=~/(.*[^DERK]{6}[LIVMFWSTAG]{2}[LIVMFYSTAGCQ][AGS]C)/)
if($seq=~/^([MV].{0,13}[RK][^DERK]{6,20}[LIVMFESTAG][LVIAM][IVMSTAFG][AG]C)/)
().,
,
,
. ,
,
("")
, .
WebLogo (http://weblogo.berkeley.edu/)
(Crooks,Hon,Chandonia,&Brenner,2004).
while (<>)
{
if ($_=~/^>(.*)/)
{
$name=$1;
$seq=<>;
if ($seq=~/(.*[^DERK]{6}[LIVMFWSTAG]{2}[LIVMFYSTAGCQ][AGS]C)/)
{
$x=length($1);
push @table,$1;
}
}
if($x>$max)
{
400
$max=$x;
}
}
foreach $signal(@table)
{
$i="-" x ($max-length($signal));
$signal=$i.$signal;
print "$signal\n";
}
-.
>header
( @seq=<>).
,(
). ,
,"-"
.
,.
------------------MRRCMPLVAASVAALMLAGC
-----------------MKLKQLFAITAIASALVLTGC
------------------MKLLSKIMIIALAASMLQAC
-------------MNKNRGFTPLAVVLMLSGSLALTGC
-------------------MKRQALAAMIASLFALAAC
-------------------MRLLPLVAAATAAFLVVAC
-----------------------MRIVIFILGILLTSC
---------------------MFKRFIFITLSLLVFAC
---------------------MLKKVYYFLIFLFIVAC
---------------------MKKILLTVSLGLALSAC
-----------------MVKKAIVTAMAVISLFTLMGC
-----------------MKQLIVNSVATVALASLVAGC
-------------------MKLKTLALSLLAAGVLAGC
--------------------MKAYLALISAAVIGLAAC
-------------------MKLKATLTLAAATLVLAAC
------------------MQKTPKKLTALCHQQSTASC
----------------MPLPDFRLIRLLPLAALVLTAC
----------------MKNQVKKILGMSVVAAMVIVGC
---------------------MKKFLPLSISITVLAAC
-----------------MKRLFLSFVALALLAGSIAAC
--------------------MCGKILLILFFIMTLSAC
---------------------MSKRLLSLASLALLFGC
-------------------MFKRRYVTLLPLFVLLAAC
------------------MKKIIKLSLLSLSIAGLASC
-----------------MGRSKIVLGAVVLASALLAGC
------------------MKAKIVLGAVILASGLLAGC
------------------MNNVLKFSALALAAVLATGC
--------------MKLTTHHLRTGAALLLAGILLAGC
-------------MAYSVQKSRLAKVAGVSLVLLLAAC
------------MSAGSPKFTVRRIAALSLVSLWLAGC
MDKGEGLRLAATLRQWTRLYGGCHLLLGAVVCSLLAAC
-------------------MKPFLRWCFVATALTLAGC
------------------MNIATKLMASLVASVVLTAC
----------------MQNAKLMLTCLAFAGLAALAGC
---------------------MKKYLLGIGLILALIAC
----------------------MRLLIGFALALALIGC
--------------MFVTSKKMTAAVLAITLAMSLSAC
401
-----------------MNKNMAGILSAAAVLTMLAGC
-----------------MHVSSLKVVLFGVCCLSLAAC
----------------MYKNGFFKNYLSLFLIFLVIAC
------------------MNKFVKSLLVAGSVAALAAC
-------------------MKKTNMALALLVAFSVTGC
----------------MSLTHYSGLAAAVSMSLILTAC
------------------MLRYTRNALVLGSLVLLSGC
--------------------MRNFILFPMMAVVLLSGC
--------------------MRKQWLGICIAAGMLAAC
-------------------MRYLATLLLSLAVLITAGC
-------------------MNMTKGALILSLSFLLAAC
----------------MNKKIFTLFLVVAASAIFAVSC
--------------------MVKRGRFALCLAVLLGAC
------------------MKVKYALLSAGALQLLVVGC
-----------------MNNPLVNQAAMVLPVFLLSAC
----------------MNAHTLVYSGVALACAAMLGSC
----------------MKLKSLVFSLSALFLVLGFTGC
-----------------MREKWVRAFAGVFCAMLLIGC
---------------MKHNVKLMAMTAVLSSVLVLSGC
--------------------MKLRLSALALGTTLLVGC
-----------MRKRISAIINKLNISIIIMTVVLMIGC
-------------------MRKRISAIIMTLFMVLVSC
-----------MRKRISAIINKLNISIMMMIVVLMIGC
, ,
(open).,
shell ( Windows)
. ,
:
ls | wc
lswc.Perl
, shell.
:
,
.,
Standard Input (
<STDIN>). , ,
,shell
.
, (
),WebLogo,12.1.
402
12.1:
WebLogo (http://weblogo.berkeley.edu/)
12.9.
,
Perl, ,
.
,Programming Perl,
(Wall & Schwartz, 1991), Learning Perl ,
(Schwartz & Phoenix, 2001), . ,
,
(Moorhouse&Barry,2005;Tisdall,2001,2003),(Orwant,Hietaniemi,&Macdonald,
1999) (Castro, 2001). , tutorials
online ebooks Picking Up Perl,
http://www.ebb.org/PickingUpPerl/ (Kuhn, 2002),
https://www.perl.org/books/library.htmlhttp://www.perlmonks.org/index.pl/Tutorials.
BioPerl
(http://www.bioperl.org/wiki/Main_Page),
(modules)Perl.BioPerl
,
(Stajich et al.,2002). ,
(object-orientedPerl),.
BioPerl,,
:http://www.bioperl.org/wiki/HOWTO:Beginners.
403
Castro,Elizabeth.(2001).PerlandCGIfortheworldwideweb:Visualquickstartguide:PeachpitPress.
Crooks,G.E.,Hon,G.,Chandonia,J.M.,&Brenner,S.E.(2004).WebLogo:asequencelogogenerator.
GenomeRes,14(6),1188-1190.doi:10.1101/gr.84900414/6/1188[pii]
Juncker,A.S.,Willenbrock,H.,VonHeijne,G.,Brunak,S.,Nielsen,H.,&Krogh,A.(2003).Predictionof
lipoproteinsignalpeptidesinGram-negativebacteria.ProteinSci,12(8),1652-1662.
Kuhn,BradleyM.(2002).PickingUpPerl:B.Kuhn.
Moorhouse,Michael,&Barry,Paul.(2005).BioinformaticsbiocomputingandPerl:anintroductionto
bioinformaticscomputingskillsandpractice:JohnWiley&Sons.
Orwant,Jon,Hietaniemi,Jarkko,&Macdonald,John.(1999).MasteringalgorithmswithPerl:"O'Reilly
Media,Inc.".
Schwartz,RandalL,&Phoenix,Tom.(2001).Learningperl:O'Reilly&Associates,Inc.
Stajich,JasonE,Block,David,Boulez,Kris,Brenner,StevenE,Chervitz,StephenA,Dagdigian,Chris,...
Lapp,Hilmar.(2002).TheBioperltoolkit:Perlmodulesforthelifesciences.Genomeresearch,
12(10),1611-1618.
Sutcliffe,I.C.,&Harrington,D.J.(2002).Patternsearchesfortheidentificationofputativelipoproteingenes
inGram-positivebacterialgenomes.Microbiology,148(Pt7),2065-2077.
Tisdall,James.(2001).BeginningPerlforbioinformatics:"O'ReillyMedia,Inc.".
Tisdall,James.(2003).MasteringPerlforbioinformatics:"O'ReillyMedia,Inc.".
Wall,Larry,&Schwartz,RandalL.(1991).Programmingperl:O'Reilly&AssociatesSebastopol,CA.
404
1)
(, ). ,
, Uniprot
(
,A,C,...):
2) DNA
3.(..1000)
,
.,1000
ErdosRenyi.
(..1000,5000,10000...).
3)
Uniprot. , FT
TRANSMEM . ,
fasta
oneline,.:
ID
AC
FT
FT
FT
FT
FT
FT
SQ
140U_DROME
P81928; Q9VFM8;
CHAIN
1
TRANSMEM
TRANSMEM
TRANSMEM
CONFLICT
SEQUENCE
MNFLWKGRRF
GSISSELNSV
TVNFAKGGFK
MAAGGIIGGF
PELFKAHDEK
Reviewed;
261 AA.
261
>P81928
MNFLWKGRRFLIAGILPTFEGAADEIVDKENKTYKAFLASKPPEETGLERLKQMFTIDEFGSISSELNSVYQAGFLGFL
-------------------------------------------------------------------MMMMMMMMMMM
, ,
,Kyte-Doolitle:
405
'M'
'N'
'P'
'Q'
'R'
'S'
'T'
'V'
'W'
'Y'
);
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
-1.590,
0.480,
0.730,
0.950,
1.910,
0.520,
0.070,
-1.270,
-0.510,
-0.210
, (
*).,
,,.. 3.1
3,.
4) DNA
.
.
(.. 1000),
(>10.000),
.
,(.1).
5)
FASTA ( ), ,
sparseencoding.
406
abinitiomodelling,330
HMMER, 3, 14, 16, 56, 74, 186, 235, 305, 306, 307,
ArrayExpress,53
309,310,356
homologymodelling,12,315,330
354
INTERPRO,14,55
BLAST,1,13,16,23,48,60,62,65,73,74,109,117,
I-TASSER,334,336,337,343
140, 142, 143, 144, 145, 146, 147, 150, 156, 161,
IUPHAR,63,72,110,112,114,116
162, 165, 167, 168, 186, 187, 191, 235, 236, 240,
Java,69,322,375,376,383,385
241, 269, 305, 306, 307, 331, 371, 374, 375, 376,
Linux,165,214,332,340,383,384,394,404
377,381
MiRBase,76
BLOSUM,130,131,154,184
MirTarBase,76
CABIOS,13,170
MODELLER,332,334
CATH,14,55,57,69,70,110,112,327
NeedlemanWunsch,1,10,133
CLUSTALW,158,161,162,165
nextgenerationsequencing,15,52,75
DBPTM,76
neXtProt,63,67,113
dbSNP,53,115
OMIM,75
EBI,14,15,48,51,76,91,110
OMPdb,63,65,116
EMBL-Bank,14,48
PAM,13,130,131,154,184,185,200
Entrez,73,76,77,323
PDB,10,51,52,56,57,61,69,70,71,75,77,82,92,
ErdosRenyi,1,120,121,137,407
93, 94, 95, 98, 106, 107, 109, 112, 113, 165, 167,
E-value,128,140,150,151,365,374
242, 316, 318, 320, 321, 322, 327, 331, 334, 336,
Fasta,48,396
342,352
FASTA,1,11,117,142,143,144,145,146,156,159,
PDBTM,76,113
162,166,172,331,398,402,408
Perl,3,4,2,73,178,376,383,384,385,386,387,389,
GenBank,12,49,71,84,109
390, 391, 393, 394, 395, 396, 401, 402, 404, 405,
genefinder,14
406
GeneExpressionOmnibus,53
PFAM,14,30,55,56,65,152,167,277,304,334,356,
GONNET,130
364
GPCR,71,72,221,227,261
pHMM,65,260,261,302,303,304,307
GPCRDB,59,63,71,72,112,116
Poisson,128,139,140,141,142,147,148,198
GRAMM,339
positiveinsiderule,12,246
GWAS,16,17
PRED-TMBB,39,252,253,266
HADDOCK,340
PROSITE,14,55,56,77,91,93,96,97,98,115,152,
HapMap,16,53,111
167, 174, 175, 176, 177, 178, 179, 180, 181, 185,
HHpred,334,344
190,191,220,255,347,348,355,395,402
HMM,3,14,56, 70,73,235,247,248,252,253,255,
PSSM,68,183,184,185,186,224,236,237,334
259, 268, 270, 273, 294, 303, 304, 305, 307, 309,
PubMed,25,26,28,36,45,54,62,75,76,83, 84,88,
310,311,313,334,344,355,358
89,90,91,93,95
407
p-value,139,140,142
,51,300
Python,74,323,341,384,385
,164,323,324,326,327,328,329,332
Rasmol,14
,132,281,327,333
Rfam,61,360
ROSETTA,336,337
228,289
SCOP,14,55,56,57,58,68,109,114,312,334
,156
SignalP,14,255,256,266
,118,119
SmithWaterman,1,11,135,142
,346,347,348,350,353
SpecializedProteinResourcesNetwork,59
,1,122
SQL,37,64,66,77,112
,222
SWISS-MODEL,323,332,341
,120,122,123,124,125
Swissprot,366
,141
TarBase,76,115
,336
TCDB,63,64,65,92,364
,226,241,261
TMHMM,14,247,248
Uniprot, 15, 50, 51, 56, 58, 59, 76, 78, 79, 80, 82, 88,
365,366
126,176,255,396,397,407
,186,193,230,241,365
UNIX,174,175,178,383,384,404
UPGMA,159,161,201,202,204,207,212
278,282,288,289,290,293,297,300
Viterbi, 282, 284, 285, 286, 287, 293, 298, 304, 305,
,137,158,159,161,162,184
313,354
WebLogo,187,191,402,405,406
154, 156, 157, 158, 159, 160, 161, 162, 163, 164,
weightmatrix,171,183
165, 166, 167, 168, 169, 174, 175, 178, 179, 183,
WHATIF,323,332,344
184, 186, 187, 188, 196, 197, 201, 202, 204, 206,
Windows,165,213,214,332,336,384,404
209, 210, 212, 235, 236, 242, 273, 302, 303, 304,
,3,337,339
305, 306, 313, 315, 328, 332, 334, 347, 355, 356,
-,167,220,227,238,239,246,248,316,319
371,402,405
,16,21,26,53,66,75,256,262
,120,168
319,333,334,345,356
,332
,3,328
,330,332
,156,158,162
,367
-,249,251,253,357,359
179, 180, 181, 183, 185, 186, 189, 190, 220, 222,
,331
225, 253, 256, 260, 327, 331, 332, 334, 345, 348,
355,402
319,357
, 2, 11, 24, 56, 65, 70, 73, 145, 152, 153, 158,
,180,346,350,353,
174, 176, 179, 181, 182, 183, 184, 185, 186, 187,
354,356,357
220,235,237,240,241,302,333,345,348
,14,121,128,138,142,145,
325,334,351
149,151,186,328,398
408
,3,323,325,326
,205,206,208,212,213
,14,315,326,330,331,332,333,334,337
409