Vous êtes sur la page 1sur 419

2015


...............................................................................................................................................1
1: ...................................................................................7
1. ......................................................................................................................................................................... 7
1.1.

..................................................................... 8

1.2.

...................................................................................................... 16

1.3.

...................................................................................................................................... 25

1.4.

................................................................................................................................... 31

......................................................................................................................................45
2: ....................................................................................47
2. ....................................................................................................................................................................... 47
2.1.

............................................................................................................................. 48

2.2.

........................................................................................................................... 54

2.3.

. ................................................ 76

................................................................................................................................78
( ) ...............................................................83
....................................................................................................................................108
3: ........................................................................117
3. ..................................................................................................................................................................... 117
3.1.

................................................................................................ 117

3.2.

- Erdos Renyi ............................................................................................................................. 120

3.3.

Erdos Renyi............................................................................................................. 121

3.4.

- (EVD) ................................................ 122

3.5.

(Maximal Segment Score) ..................................................... 125

3.6.

......................................................................................................................................... 128

3.7.

.............................................................................................................................................. 130

3.8.

.......................................................................................................... 132

3.9.

- O Needleman Wunsch.......................................................................... 133

3.10.

................................................................................................................................. 134

3.11.

Smith Waterman .................................................................................. 135

3.12.

O Erdos Renyi ........................................................................ 137

3.13.

local similarity score ...................................................................................... 138

3.14.

...................................................................................................... 140

3.15.

- BLAST FASTA .................................................................................................. 142

....................................................................................................................................147
.........................................................................................................................................149

4: .............................................................................151
4. ..................................................................................................................................................................... 151
4.1.

........................................................................................ 152

4.2.

........................................................................................................................ 155

4.3.

................................................................. 159

4.4.

............................................................................................. 162

4.5.

........................................................................... 164

....................................................................................................................................169
.......................................................................................................................................171
5: ..................................................................173
5. ..................................................................................................................................................................... 173
5.1.

...................................................................................................................... 173

5.2.

Weight Matrices, Profiles PSSMs................................................................................................................. 180

5.3.

.............................................................................................................................................................. 184

....................................................................................................................................189
6: .............................................................................................191
6. ..................................................................................................................................................................... 191
6.1.

....................................................................................................................................................... 191

6.2.

.............................................. 195

6.3.

.................................................................................................................. 199

6.4.

............................................................................................................. 203

6.5.

..................................................................................................................................... 207

6.6.

- ........................................................... 209

6.7.

.............................................................................................................................................................. 210

....................................................................................................................................213
7: .................................................................................................217
7. ..................................................................................................................................................................... 217
7.1.

........................................................................................................................ 218

7.2.

................................................................................................................................................ 223

7.3.

................................................. 227

7.4.

............................................................................................... 229

7.5.

............................................................................. 231

7.6.

.............................................................................................. 236

7.7.

DNA/RNA.............................................................................................. 259

....................................................................................................................................264
8: ...............................................................................................271
8. ..................................................................................................................................................................... 271
8.1.

Markov ................................................................................................................................................. 271

8.2.

Hidden Markov Models ....................................................................................................................................... 276

8.3.

Class Hidden Markov Model............................................................................................................................... 291

8.4.

................................................................................................................. 298

8.5.

Profile Hidden Markov Models........................................................................................................................... 300

8.6.

profile HMM ............................................................................................................................. 302

8.7.

HMMER ......................................................................................................................... 304

....................................................................................................................................306
.........................................................................................................................................309
9: ...........................................................................................313
9. ..................................................................................................................................................................... 313
9.1.

.......................................................................................................................................... 314

9.2.

....................................................................................................................... 318

9.3.

............................................................................................................................. 321

9.4.

..................................................................................................... 326

9.5.

...................................................................................................................................................... 335

....................................................................................................................................339
10: ...................................................................................343
10. ................................................................................................................................................................... 343
10.1.

Chomsky.................................................................................................... 343

10.2.

...................................................................................................................................... 345

10.3.

RNA ...................................................................... 348

10.4.

................................................................................................... 355

....................................................................................................................................358
11: ................................................................................361
11. ................................................................................................................................................................... 361
11.1.

............................................................................................................. 361

11.2.

............................................................................................................................... 366

11.3.

............................................................................................................................................................ 373

....................................................................................................................................379
12: Perl ........................................................................381
12. ................................................................................................................................................................... 381
12.1.

Perl ............................................................................................................................................. 382

12.2.

......................................................................................................................................................... 383

12.3.

.............................................................................................................................................. 385

12.4.

............................................................................................................................................................ 387

12.5.

..................................................................................................................................................... 388

12.6.

(Filehandles) / .................................................................................. 391

12.7.

......................................................................................................................................... 393

12.8.

Perl ....................................................................................................... 394

12.9.

.............................................................................................................................................. 403

....................................................................................................................................404
...........................................................................................................................................405



,
,
, .
,,

. , (
,,,...),
(, )
,,
( , ).
, , ,
.
,,
,
, , .
, , .
, ,
( , ),
(),
, . ,
,
(,
, ...), /. ,
,
,,,.
' , ,
,.
,
,.,
, ,
.,
:
, ,
/.
,,
, ,
. ( , ) ,
,
.,
,
. ,
, ,
,,,

( ).
( , ),
,
.,
,.,
,

( ),
( ) . ,
1.
,
, ,
,,.
"" " ",
,,,
( ,
, ). ,
Creative Commons, ,
, ,
.,,
" " (open courses),
,," "
" ",,
.
, ,
. ,
, ,
,,

(,
)
.
(400 ),
.,,
""(,).1
, ,
,,,
,
,,,.2
9 , ,
, , ,
. Hidden Markov Models, ,
, ,
,8.1011
( , ). ,
12 Perl,
( ,
, , ...). ,
, ,
.
HiddenMarkovModels(8),
,(7),
,
DNA/RNA.,
.
, , ,
,.
,
,,.,2
9 ( 12,
).

,
," "2-8
( ),
" " 9-12,
( ) , 1-2
(
). , , ,

(,,).
,,
,.
, Hidden Markov Models
,,
, Position Specific
ScoringMatrices,,.
,,"-"
. , ,
, 7(
). ,
RNA. , 7, .
',
,,
10. "" ,
.,12,,
,,
.
(DNA, RNA, ,
, , , - , , ,
,,...),,
. ,
. ,
,
( ),
.,
1, /
, ,
.,,
,,,
,
(,
, , ). ,
,,
/,
. ,
,,
.
, ,
,.,
, ,
,,,
,.
,(),
, , .
, , ,

. , , ,
,,
epub. ,
,
.,,
(,). -, ,
, .

, , , ,
, , ,
,,,
,
.,,
. ,
,
.9,
,,
10, .
,,,,
( ). ,
,
, , : www.compgen.org/books/bioinformatics,
email:books@compgen.org.,,
,,
.
, , , ,
, . ,
(1990,
"" ). ,
- ,
, ,
.
(),
,.
, (,
), , , .
,,, ,
,,.
, , , ,
(3)
,
. ,
,.,
,
,.,
,,,
(,,),.,
,
.

,17/11/2015

-
BLAST

BasicLocalAlignmentSearchTool

DNA

DeoxyribonucleicAcid

RNA

RibonucleicAcid

MSA

MultipleSequenceAlignment

HMM

HiddenMarkovModel

PSSM

Position-SpecificScoringMatrix

EBI

EuropeanBioinformaticsInstitute

NCBI

NationalCenterforBiotechnologyInformation

NLM

NationalLibraryofMedicine

NIH

NationalInstitutesofHealth

PDB

ProteinDataBank

CABIOS

ComputerApplicationsintheBiosciences

SRS

SequenceRetrievalSystem

ISCB

InternationalSocietyforComputationalBiology

ISMB

IntelligentSystemsforMolecularBiology

CAFASP

CriticalAssessmentofFullyAutomatedStructurePrediction

CAPRI

CriticalAssessmentofPredictionofInteractions

CASP

CriticalAssessmentofproteinStructurePrediction

1:

.
,
,
. ,
.
( , ), ,
, ,
. ,

.

.

1.

,
,
,
,.
,,
. , ,
,
.
,
.
(,
,...).,
, ,
,:
( , )
.,,
.
,.
.,
,
(..
,
),
,
.
, '
, .
, (
) ( ).
, / ,
, .
( )
.,,
(, ,
/,...).,,


.,
,.,
,
ISI,
(Mathematical and Computational Biology),

.,,
.

1.1: google trends (https://www.google.com/trends/) bioinformatics


computational biology, .

,
.
,,,
.

1.1.

, (bioinformatics),
1990.,
,,informatics
.,,
(
)1990,
.,20,
,

,
.
, (Hagen, 2000; Ouzounis & Valencia, 2003;
Roberts,2000;Searls,2010;Trifonov,2000),.
, ,
,
.

1.1.1. 1950 1960

,
,Hardy,WeinbergFisher,Wright,
,
19501960(1.2).,
Chargaff
DNA,
. ,
WatsonCrickDNA
(Wilkins,
Franklin ). 1960
JacobMonod().
,(),
Perutz Kendrew 1962, 1960
(,,...),
.,
1951,RNA1967.
,
1960.

1.2:
1960.

,19501960
,,Shannon,

Turing,vonNeumann,(strings),
, Chomsky.,
, ,
,1960
.,
,.
,,
6420,
1960,
.
,
.
,
Zuckerkandl Pauling, Fitch and
Margoliash Kimura Nei.
Ramachandran
,
Ramachandran
,helicalwheelplots.

1.1.2. 1970

,(1.3).
,,
,Kimura.,

, .
Fitch
().,
RNA ,
Crick 1970. ,
SangerMaxam-Glbert,
,
.
,Anfinsen
.
,
,
Chou Fasman 1975 (
).,
RNA.,
, .
, RNA
Nussinov.

1970,
(),Needleman
Wunsch1970.
,1970(dot-plot).
,.PDB
1972(10),Dayhoff1978
,
PIR. , /
( ,

10

, ...).
,1970
.,
, ,
.

1.3: 1970.

1.1.3. 1980

1980
,
.
(Science, Nature, Nucleic Acid Research),
(ComputerApplicationsinBiosciences).,


(1.4).
,
.
Smith Waterman 1981,
,
Aratia, Waterman Karlin,
(FASTA).
,
CLUSTAL.
(sequence profiles)
,
.,
.
DNA,PCR,
,

11


(,
).,1986
(GenBankEMBLData
Library),SwissProt,1987.

(EMBnet ),
(LiMB). , NIH EMBL
.

1.4: 1980.

,
.,NMR

. ,
,
(homologymodelling).
(),
.,

.,
( ),

(,,positiveinsiderule)
.
,RNA.
,Fenselstein
,
(
).

12

(
).
(PAM),,
,
(..
, , , ...). ,
rRNA,
,.

1.1.4. 1990

1990(
). ,

, / ( 1.5).
,
1990. , , Bioinformatics, 1995
ComputerApplicationsintheBiosciences(CABIOS).

1.5: 1990

.
,,BLAST(BasicLocal
AlignmentSearchTool),NCBI1990.BLAST
(score) ( Karlin-Altschul)

,
,
.,,
,PSI-BLAST.,

13

(CLUSTAL)
/,.

.
,RasmolKinemage,
(threading),(docking).
,
,70%

( , , ...). ,
/CASP.
,
,,
(genefinders).
, DNA
, , ,
,
.
,
.
(SCOPCATH),
(patterns) , PROSITE, PFAM INTERPRO.
,,EBI(EuropeanBioinformatics
Institure), ,
(Hinxton) 1992 EMBL Welcome Trust.
EMBL, EMBL-Bank SwissProt-TrEMBL

,TrEMBL.,1993ISMB
ISCB.
,,Krogh,Eddy,
Hughey , ,
Hidden Markov Model (HMM),
,.
HMMERprofileHMM,
PFAM,,
, TMHMM
, SignalP .
,
,,

( profile HMM,
, ). ,
,,.

1.1.5.

2000,
.,
,
...,
( )
(1.6). ,
, .
,

14


.,
(NextGenerationSequencing),(RNAseq),

.
,
.
, (web-servers),
. , -(meta-genomics)
.
, 2000,
, ,
SwissProt PIR
,Uniprot( EBI).
, ,
,
EBI ELIXIR , ,
.

1.6: 2000.

, (SNPs)
GWAS(Genome-WideAssociationStudies),
, ,
DNA.,
, , (HapMap
project),-(data

15

integration). ,
. ,
,(Proteomics),
,,RNA(ncRNA),
,
(Copy Number Variations- CNVs). ,
.
,SupportVectorMachines(SVMs)
,
. , -,
HMMER
BLAST.,
abinitio.,,
,
, ,
, .
, , . ,
,
.
,
,
. ,
, , ,
,
(Ouzounis,2012).

1.2.

,,

.,

NCBI
(http://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/MolBioReview/bioinformatics.html):

Bioinformatics is the field of science in which biology, computer science, and information technology merge
into a single discipline. There are three important sub-disciplines within bioinformatics: the development of
new algorithms and statistics with which to assess relationships among members of large data sets; the
analysis and interpretation of various types of data including nucleotide and amino acid sequences, protein
domains, and protein structures; and the development and implementation of tools that enable efficient access
and management of different types of information

,Luscombe(Luscombe,Greenbaum,&Gerstein,2001):

Bioinformatics is conceptualizing biology in terms of macromolecules (in the sense of physical-chemistry)


and then applying "informatics" techniques (derived from disciplines such as applied maths, computer
science, and statistics) to understand and organize the information associated with these molecules, on a
large-scale
FredjTekaia:

The mathematical, statistical and computing methods that aim to solve biological problems using DNA and
amino acid sequences and related information.

16

,
( )
().RichardDurbin:

I do not think all biological computing is bioinformatics, e.g. mathematical modelling is not bioinformatics,
even when connected with biology-related problems. In my opinion, bioinformatics has to do with
management and the subsequent use of biological information, particular genetic information.
, International Society for Computational Biology (ISCB)
,,,:

a scholarly society dedicated to advancing the scientific understanding of living systems through
computation.

,
,
, . ,
(),
(),,.
, ( )
:
,
(1.7).
,,,
. (
),/,
, .

DNA,
.,,
(GWAS),
(Molenberghs, 2005).
, , .. ,
. ,
,.
,
,
( ), ,
,
. Eddy (Eddy, 2005),
,
:
.
Eddy,
(
),
, (
).
, , ,
-
.,,
,
.
,

17

,
. ,
( ,.)Eddy
Im sure my union card has expired (Eddy, 2005).
,
, : ,
,
,ECDL(,
ECDL).

1.7: .


, ,
, ,
,
,.
,
( ,,
Eddy).
. ,
(.. residues) , ...
, , .
; ;(..residuesresiduals
,
,
). ,
,,
( ,

18

10 ). ,
,
(,,
).VennAnthonyFejes,
(1.8).

1.8: http://blog.fejes.ca/?p=2418

,
(Ouzounis, 2002). ,
, .

,.;
. , ,
, .
, ,
,
.,,
. , ,
,,
50 . ,
(, , ), ,
, , ,
,
. , ,
, (
),.,,
..,(
),
.
,
,
,

19

.,
(,
, , , ).,
,
,
.,
,
,
.
,
,
(Chalmers,1999).

1.9: (, , & , 2014)

,
.
(),
,,...
,
, , ,
... ,
( , PCR, ...).
, .
(..
Watson Crick DNA
).,

20

.,
.
(.. ) , ,
(.. )
.
, ,
(
),,
.,,
. ,
,UUU(Phe).
, , (
)
. 1960,
,,
(
,).
;
;..,
,
,--,---.,
,
, (
).
,
(
),,,
.
,
.,
,,
- -
. , : )
)
.
,
.,99%
,
,

. , ,
,,
.
( )
,,
, .
,
(,,
).,
,,
.
,
,(Ouzounis,2000).

21

, .,
, ,
. , ,
. ,
, , , pyrosequencing,
.,
,.
,
.,,
.
, , ,
.
,
,/,
. ,

,.
, ,
, ... , ,
()(..,,
,,...).
, Edgar
Wingender: Thus, scientific articles publishing experimental findings which have been evaluated using
computational tools, very often give credit to them in the Methods or Results sections with phrases such as
"Computer analysis revealed that ...", without any appropriate reference. In contrast, any experimental
methodology used is extensively explained in these papers, down to the detailed listing of buffer systems,
voltage/current conditions of the electrophoresis systems etc(Wingender,1998).
,
,,
,
(paper).,,
,
,,,
,
, .
,,
,.,
,
. ,
, ,
.,,...,
.,
,
,
.
,
.,,
, . , ,
.

22

1.10: , . ,

,
. , ,
,...

. , ,
,
.
,,
(
BLAST).
.,
(, , ,
...).
,
,
().,

.,
, ...
.
,.,
.

23

,
.

, 2 3 ( 1 ). ,
bioinformatician2bioinformaticist
3,.
bioinformaticsscientistbioinformaticsengineer2
3(Welchetal.,2014),,
(),
, . ,
. ,
,(
), ,
.
(
), ,
...,
,.,
,,
,

.,(
), ,
, .
3 , ,
, ,

.

,
. ,
,
,
,
(Altman, 1998; Ditty et al., 2010;
Floriano, 2008; Honts, 2003; Searls, 2012; Welch, et al., 2014; Yan, Ban, & Tan, 2014).
, 1990 ,
. ()
(.),
(,,,,,),
Altman:
, , ,
(,,,...).
, , ,
. ,

,
/. ,
,
,
,,,,
,,.

24

1.3.


.
,
,
,
, .
.

-(keywordsMESHterms),
, .
,
.
,PatraMishra(Patra&Mishra,2006)
MESHterms"Bioinformatics"OR"Bioinformatics"OR"ComputationalBiology"OR
"ComputationalMolecularBiology"OR"BiologyComputational"OR"MolecularBiology;Computational"
OR "Genomics" PUBMED 16.178
, 2004, 1806
. , 2000 .
,199012,20001000.
(98%)
(97%).(42%)
(10%),(6%)(4%).

Bioinformatics
NucleicAcidsResearch
GenomeResearch
Science
Nature
ProceedingsoftheNationalAcademyofSciencesUSA
Proteomics
GenomeBiology
JournalofMolecularBiology
Proteins
NatureBiotechnology
BMCBioinformatics
PacificSymposiumonBiocomputing
JournalofComputationalBiology
TanpakushitsuKakusanKoso
JournalofBiologicalChemistry
DrugDiscoveryToday
TrendsinBiotechnology
Genomics
BriefinginBioinformatics
1.1: 20 Patra Mishra (Patra &
Mishra, 2006).

,20
1/3.1.1.
,
(Bioinjormatics, BMC Bioinformatics, Pacific Symposium on Biocomputing, Journal of Computational
Biology...),(NucleicAcidsResearch,
GenomeResearch,GenomeBiology,JournalofMolecularBiology)(Science,
Nature,ProceedingsoftheNationalAcademyofSciencesUSA).,

25

(Impact Factor) 2001 ,


,
(.. Computer
Application in Biosciences Bioinfornatics 1998, PCR Methods and Its
ApplicationsGenomeResearch1994).
,23%
,23%.
4067.2000
consortia
.,39.43516.178(2,43/).
, 73,58% , 14,34%
5,30% . Lotka
n,1/n2.
6% 10 .

,
, . ,
(,
, ),
.
2006, Perez-Iratxeta, Andrade-Navarro Wren (Perez-Iratxeta,
Andrade-Navarro,&Wren,2007)PUBMED
3(,
).19962005(MESH
terms):comput*,*informatic*,
algorithm*, software database internet,
online,worldwideweb,web-based,http:*ftp:*.

1,6% 1975, 10% 2005.


0,05%0.87%
1990 /. ,
,

.,
,
,.,
MESH terms
2000-2003 ,
( 1.11).
(..)
().

26


1.11: Perez-Iratxeta (Perez-Iratxeta,
Andrade-Navarro, & Wren, 2007)

2007,
(SYMBIOMATICS), (bigrams,
) ,
(2000-2005)(1990-2000)
, ,
(Rebholz-Schuhman et al., 2007).
1.2.


Bioinformatics
AMIAAnnuSympProc
Biosystems
ArtifIntellMed
BMCBioinformatics
BMCMedInformDecisMak
BriefBioinform
IntJMedInform
ComputMethodsProgramsBiomed
JAmMedInformAssoc
IEEETransInfTechnolBiomed
Medinfo
JBioinformComputBiol
MethodsInfMed
JBiomedInform
ProcAMIASymp
JComputAidedMolDes

JComputBiol

PacSympBiocomput

1.2:
SYMBIOMATICS.

2000-2005,
1990-2000. (1990-2005)

27

geneexpression,aminoacid,proteinsequence,
informationsystem,healthcaredecisionsupport.
1990-2000
2000-2005
.1.12,,
, Support Vector Machines
.
:,
,,(..),
( ,
).,
(..supportvectormachines).,

(..,
,...).

1.12:
, SYMBIOMATICS

,2014
(Song, Kim, Zhang, Ding, & Chambers, 2014).
,
, ,
.
,(
), 73% WoS
(PUBMED).
, ,
PubmedCentral, PUBMED
(openaccess)
.
,
.

28

,,
,.
1.3
,.
2000-2003,
. 2004-2007
, , . 2008-2011,
.
,mutationRNA.,
protein binding, algorithm/method, cell/model,
network/interaction,genomesequence,immune/virus,geneexpression,genetic/evolution,database/software,
genetranscription,DNA/chromosome,ontology/mining,gene/genomicscancer/cell.
,
,
..,
202000-2003,2009-20116,
9 7.
, ,
20().
,
,
(King, 2004),
(http://www.natureindex.com/).

BMCBioinformatics
SourceCodeforBiologyandMedicine
BMCGenomics
AdvancedBioinformatics
PLoSBiology
BioDataMining
GenomeBiology
JournalofComputationalNeuroscience
PLoSGenetics
JournalofProteomeResearch
PLoSComputationalBiology
JournalofBiomedicalSemantics
BMCResearchNotes
JournalofComputer-AidedMolecularDesign
Bioinformatics
GenomeIntegration
MolecularSystemsBiology
JournalofMolecularModeling
BMCSystemsBiology
BulletinofMathematicalBiology
ComparativeandFunctionalGenomics
PharmacogeneticsandGenomics
Bioinformation
StatisticalMethodsinMedicalResearch
TheoreticalBiologyandMedicalModeling
Neuroinformatics
HumanMolecularGenetics
Genomics
TheEMBOJournal
ProteinScience
CancerInformatics
PhysiologicalGenomics
GenomeMedicine
TrendsinGenetics
EvolutionaryBioinformatics
JournalofProteomics
Biochemistry
Proteomics
AlgorithmsforMolecularBiology
TrendsinBiochemicalSciences
EURASIPJournalonBioinformaticsandSystemsBiology JournalofBiotechnology
JournalofMolecularBiology
TrendsinBiotechnology
Molecular&CellularProteomics
BriefingsinFunctionalGenomics&Proteomics
MammalianGenome
JournalofTheoreticalBiology
1.3: (Song, Kim, Zhang, Ding, & Chambers,
2014).

StanfordUniversity,HarvardUniversity
University of Washington . Stanford 3
2000-200312004.Harvard6,232000-2003,
2004-2007 2009-2011. , University of Washington 5 2
.,University

29

ofCambridge(11,8 5),UniversityCollegeLondon(11,1110),UniversityofOxford
2000-2003,10
2004-2008 6 2009-2011. , Brandeis University
12000-2003,122004-2007
2008-2011. University of California, Berkeley
2714.

.
.
2000-2003Gene ontology: tool for the
unification of biologyNatureGeneticsGeneOntologyConsortium
20 .
(D.Botstein,G.Rubin,G. Sherlock,M. Ashburner,J.Cherry,C.
Ball, J. Matese, H. Butler). ,
(Initial sequencing and analysis of the human genome)Nature.249
48 . 3 Significance
analysis of microarrays applied to the ionizing radiation responseV.Tusher,R.Tibshirani,G.
Chu,Stanford.R.Tibshirani12.
2004-2007, Bioconductor: open
software development for computational biology and bioinformatics25
19 . 4
.R,
R: A language and environment for statistical computing 3 Transcriptional regulatory
code of a eukaryotic genome 20 4 . ,
2008-2011,
PFAMThe Pfam protein families database133(
A Bateman R. Durbin,
). KEGG, KEGG for
linking genomes to life and the environment113,
Mapping short DNA sequencing reads and calling variants using mapping quality
scores H. Li, J Ruan R.Durbin.
10
(,
,
).

,
ImpactFactor.,ProceedingsoftheNational
AcademyofSciences,NucleicAcidsResearch,Nature,BioinformaticsScience
. BMC Bioinformatics
62004-200752008-2011.BMCBioinformatics,
2004-2007 PLoS Biology, BMC Genomics, Nature Reviews
Genetics.
20 2008-2009 PLoS One, PLoS Genetics, PLoS Computational Biology,
NatureBiotechnologyNatureMethods.
( Bioinformatics, BMC Bioinformatics PLoS Computational Biology)

.
,
.
,
10,
( ,
,,...).

30


.,
2000-2003, 2004
. ,
,

,
.

1.4.

.
,
,.

1.4.1.

,,
(,.).

(),
,

( 1.13) (..
).,
(,,&,2012;
,,,&,2013;,,,&,2014)
(, , , &
, 2012).
,
,.
,
(
) . , Nature,
(http://www.natureindex.com)
(
).(1.14),
(WFC),,32
, ()
2004().

31


1.13: , 27.

32


1.14: www.natureindex.com

;
:
,
( , , 43
,http://data.worldbank.org/,2012-2013).,
WFC,
, 95%. ,
. . , 33
Nature(-),
.

33


1.15: WFC .

1.16: WFC .

34


,
.
, .
,,
(http://www.scimagojr.com, 2011-2013).
,
,
,~70%.,
.
,
WFC

.,,
.,29%
()
48% ,
(48%).
,-15%,19
.

1.17: WFC/ .

,
.,
,
,
.,
.:

35


.
, .
(),
.
1/32.1%(0.69%)(&,2014).

1.4.2.


, . ,
.
(), http://www.hscbb.gr
,
.
(),
2009 (affiliated) (International Society
forComputationalBiology,.http://www.iscb.org/iscb-affiliates-europe#hellenic),
GOBLET (http://www.mygoblet.org) ELIXIR
(https://www.elixir-europe.org/).
, (..
http://hscbb11.hscbb.gr, http://hscbb12.hscbb.gr ...)
, .
,..,
, , . ,
, 120 ,
.(,
,)40,
.
30
,
.
,
2010(Bagos,2010).

,,
.
:
(
), ,
.
ISI WoS, MATHEMATICAL & COMPUTATIONAL
BIOLOGY .
,
. ,

Nucleic Acids Research
(web-server database issues). ,
PUBMED ( WoS).
1.4. ,
,
(JMB,PlosBiology,Protein
Engineering...),(Science,Nature),

36

.
(MachineLearning,PatternRecognition...).,
-.,
,
,
. -
(GREECE CYPRUS), SQL
Yahoo,Term
Extraction web service (http://developer.yahoo.com/search/content/V1/termExtraction.html),
- (
KEYWORDSWoS).

PLOSComputBiol
JournalofComputer-AidedMolecularDesign
Bioinformatics
NucleicAcidsResearch(web-serveranddatabaseissues)
BMCSystBiol
TheOpenBioinformaticsJournal
BMCBioinformatics
StatisticalApplicationsinGeneticsandMolecularBiology
Biostatistics
SourceCodeforBiologyandMedicine
JTheorBiol
OnlineJournalofBioinformatics
StatMethodMedRes
JournalofIntegrativeBioinformatics
IETSystBiol
JournalofBioinformaticsandComputationalBiology
JComputNeurosci
InternationalJournalofDataMiningandBioinformatics
JMolGraphModel
InternationalJournalofComputationalBiologyandDrugDesign
StatMed
InternationalJournalofBioinformaticsResearchandApplications
Biometrika
InSilicoBiology
EvolBioinform
Genomics,Proteomics&Bioinformatics
BMathBiol
GenomeInformatics
Biometrics
EURASIPJournalonBioinformaticsandSystemsBiology
AlgorithmMolBiol
CurrentBioinformatics
MedBiolEngComput
BioDataMining
JMathBiol
AdvancesandApplicationsinBioinformaticsandChemistry
IEEETInfTechnolB
AppliedBioinformatics
JComputBiol
InternationalJournalofBioinformatics
CurrBioinform
Bioinformation
SARQSAREnvironRes
PacSympBiocomput
MathBiosci
Database
ComputBiolMed
GenomeRes
BiometricalJ
BMCGenomics
MathMedBiol
IntJDataMinBioin
JAgrBiolEnvirSt
JBiolSyst
1.4: .

37

TOTAL_PAPERS
60
50
40
30

TOTAL_PAPERS

20
10

10
20

07
20

04
20

01
20

98
19

95
19

92
19

88
19

19

85

82
19

78
19

YE
AR

1.18: .

80
70
60
50
40
30
20
10

A
ME
TR
IK

AC
ID
EIC
NU
CL

BIO

RE

ME
D
ST
AT

AT
I CS

BM
C

BIO

INF
OR
M

AT
I CS
BIO

IN
FO
RM

IO
L
RB
HE
O
JT

BI O
LM
UT
CO
MP

IEE
ET

I NF

TE
C

HN
OL

ED

1.19:
.

405 1976
2010.,,1999,

38

5(1.18).,405
681 (1,68 ).
( ),
, ,
, . 681 , 636 3
45,18
9 .
(
). , 63
,.
,
,,ImpactFactor.
,
23
ComputerApplicationsinBiosciences,
, Bioinformatics. , A
protein secondary structure prediction scheme for the IBM PC and compatibles 1988 PBM: a
software package to create, display and manipulate interactively models of small molecules and proteins on
IBM-compatible PCs 1995 ( Perrakis A, Constantinides C, Athanasiades A),
,
.
H2000FickettJW,HatzigeorgiouAC.
Eukaryotic promoter recognition Genome Res. 2 Promponas VJ, Enright AJ,
TsokaS,KreilDP,LeroyC,HamodrakasS,SanderC,OuzounisCACAST: an iterative algorithm
for the complexity analysis of sequence tracts Bioinformatics 3 Pavlou S,
Kevrekidis IG Microbial predation in a periodically operated chemostat- a global study of the
interaction between natural and externally imposed frequencies, Math Biosci.
2001-2005,CarninciP,WakiK,ShirakiT,KonnoH,Shibata
K,ItohM,AizawaK,ArakawaT,IshiiY,SasakiDetal.(VAidinis)
Targeting a complex transcriptome: The construction of the mouse full-length cDNA encyclopedia
GenomeRes,2PatrinosGP,GiardineB,RiemerC,MillerW,ChuiDHK,
Anagnou NP, Wajcman H, Hardison RC Improvements in the HbVar database of human
hemoglobin variants and thalassemia mutations for population and sequence variation studiesNucleic
Acids Res 3 Bagos PG, Liakopoulos TD, Spyropoulos IC, Hamodrakas SJ
PRED-TMBB: a web server for predicting the topology of beta-barrel outer membrane proteins,
Nucleic Acids Res. , 2006-2010, Liolios K,
Tavernarakis N, Hugenholtz P, Kyrpides NC The Genomes On Line Database (GOLD) v.2: a
monitor of genome projects worldwideNucleicAcidsRes,2
(Liolios K, Mavromatis K, Tavernarakis N, Kyrpides NC) The Genomes On Line Database
(GOLD) in 2007: status of genomic and metagenomic projects and their associated metadata
NucleicAcidsRes,3UlrichEL, AkutsuH,DoreleijersJF,HaranoY,
IoannidisYE,LinJ,LivnyM,MadingS,MaziukD,MillerZetal.BioMagResBank
(NucleicAcidsRes).
20 ,
74web-servers
. , 20 /
89webservers .
(ImpactFactor,
),(,),

.
,
(..

39

).,
2005,.
,
, Impact Factor,
,,
(1.5).,
(6),
(8)(9).
. (
18),
1.5.,,
, ,
... ,
(3),2012
, , .
(6)(9).

Impact Factor
UniversityofAthens
53
706
159,86
UniversityofIoannina
41
192
65,53
UniversityofCentralGreece
34
498
97,39
NatlTechUnivAthens
28
139
54,21
UniversityofPatras
21
66
48,17
BSRCAlexanderFleming
18
362
103,11
AristotleUnivThessaloniki
16
49
25,66
NatlCtrSciResDemokritos
15
110
35,9
TechEducInstAthens
14
46
28,73
AcadAthens|BiomedResFdn
12
13
35,33
CtrRes&TechnolHellasCERTH
9
116
32,87
UniversityofThessaly
9
70
17,46
UniversityofCyprus
8
139
32,53
TechEducInstLamia
6
42
11,47
DemocritusUnivThrace
6
31
15,44
1.5: , ,
.

,
, 15

. ,
,
. , ,
,
.

40


1.20: .

1.4.3.

,
,(
).
(
),,,/,
18 ( 1.6).
, /,
.,
. ,
, / . 11 18
, (..
), 3 18
3 , (
. . ,
).
,
, , ,
.
,3
(, , , ,
).,
4
, .
, , ,
,
.
.

,
,/(
). ,

41

,
.
,
,,.

(*)
(*)
(*)

(*)

(*)

(*)
(*)

(**)
(**)
/
(**)

,
,

,,
(*)
,

(*)

(*)

(*)

(*)

1.6: . (*)
.

42

12
10
8

4
2
0

1.21:
.

, 6
() ()
(
).1.7.

(
)
,

(
)

,,
,
(

)
/

1.7: .

,
,()
, , , /
.2003,
(
)
. ,
. ,
,
.,
,,
,
()

43

,.,

. ,

. ,
,
.

,
.,

,
.

44

Altman,R.B.(1998).Acurriculumforbioinformatics:thetimeisripe.Bioinformatics, 14(7),549-550.
Bagos,P.G.(2010).Bioinformatics and Computational Biology in Greece: a bibliometric study.Paper
presentedatthe5thConferenceofHSCBB(HSCBB10),Alexandroupolis.
Chalmers,A.(1999).What Is This Thing Called Science?(3rdrevisededitioned.).Hackett:Universityof
QueenslandPress,OpenUniversitypress.
Ditty,J.L.,Kvaal,C.A.,Goodner,B.,Freyermuth,S.K.,Bailey,C.,Britton,R.A.,...Kerfeld,C.A.
(2010).Incorporatinggenomicsandbioinformaticsacrossthelifesciencescurriculum.PLoS Biol,
8(8),e1000448.
Eddy,S.R.(2005)."Antedisciplinary"science.PLoS Comput Biol, 1(1),e6.
Floriano,W.B.(2008).Aportablebioinformaticscourseforupper-divisionundergraduatecurriculumin
sciences.Biochem Mol Biol Educ, 36(5),325-335.
Hagen,J.B.(2000).Theoriginsofbioinformatics.Nat Rev Genet, 1(3),231-236.
Honts,J.E.(2003).Evolvingstrategiesfortheincorporationofbioinformaticswithintheundergraduatecell
biologycurriculum.Cell Biol Educ, 2(4),233-247.
King,D.A.(2004).Thescientificimpactofnations.Nature, 430(6997),311-316.
Luscombe,N.M.,Greenbaum,D.,&Gerstein,M.(2001).Whatisbioinformatics?Aproposeddefinitionand
overviewofthefield.Methods Inf Med, 40(4),346-358.
Molenberghs,G.(2005).Biometry,biometrics,biostatistics,bioinformatics,...,bio-X.Biometrics, 61(1),1-9.
Ouzounis,C..(2000).Twoorthreemythsaboutbioinformatics.Bioinformatics, 16(3),187-189.
Ouzounis,C..(2002).Bioinformaticsandthetheoreticalfoundationsofmolecularbiology.Bioinformatics,
18(3),377-378.
Ouzounis,C.A.(2012).Riseanddemiseofbioinformatics?Promiseandprogress.PLoS Comput Biol, 8(4),
e1002487.
Ouzounis,C.A.,&Valencia,A.(2003).Earlybioinformatics:thebirthofadiscipline--apersonalview.
Bioinformatics, 19(17),2176-2190.
Patra,S.K.,&Mishra,S.(2006).Bibliometricstudyofbioinformaticsliterature.Scientometrics, 67(3),477489.
Perez-Iratxeta,C.,Andrade-Navarro,M.A.,&Wren,J.D.(2007).Evolvingresearchtrendsinbioinformatics.
Brief Bioinform, 8(2),88-95.
Rebholz-Schuhman,D.,Cameron,G.,Clark,D.,vanMulligen,E.,Coatrieux,J.L.,DelHoyoBarbolla,E.,..
.VanderLei,J.(2007).SYMBIOmatics:synergiesinMedicalInformaticsandBioinformatics-exploringcurrentscientificliteratureforemergingtopics.BMC Bioinformatics, 8 Suppl 1,S18.
Roberts,R.J.(2000).Theearlydaysofbioinformaticspublishing.Bioinformatics, 16(1),2-4.
Searls,D.B.(2010).Therootsofbioinformatics.PLoS Comput Biol, 6(6),e1000809.
Searls,D.B.(2012).Anonlinebioinformaticscurriculum.PLoS Comput Biol, 8(9),e1002632.
Song,M.,Kim,S.,Zhang,G.,Ding,Y.,&Chambers,T.(2014).Productivityandinfluenceinbioinformatics:
AbibliometricanalysisusingPubMedcentral.Journal of the Association for Information Science and
Technology, 65(2),352-371.
Trifonov,E.N.(2000).Earliestpagesofbioinformatics.Bioinformatics, 16(1),5-9.

45

Welch,L.,Lewitter,F.,Schwartz,R.,Brooksbank,C.,Radivojac,P.,Gaeta,B.,&Schneider,M.V.(2014).
Bioinformaticscurriculumguidelines:towardadefinitionofcorecompetencies.PLoS Comput Biol,
10(3),e1003496.
Wingender,E.(1998).ISB:JustAnotherJournal?In Silico Biol, 1(1),1-4.
Yan,B.,Ban,K.H.,&Tan,T.W.(2014).Integratingtranslationalbioinformaticsintothemedical
curriculum.Int J Med Educ, 5,132-134.
,.,,.,&,.(2014). .:
,.,&,.(2014,25/11/2014)..
.
,.,,.,&,.(2012).1996-2010:
Retrieved
fromhttp://reports.metrics.ekt.gr/
,.,,.,,.,&,.(2012).
2000-2010- Retrievedfromhttp://metrics.ekt.gr/el/node/15
,.,,.,,.,&,.(2013).
1996-2010:-
Scopus Retrievedfromhttp://report03.metrics.ekt.gr
,.,,.,,.,&,.(2014).
1998-2012:-
WebofScience Retrievedfromhttp://report04.metrics.ekt.gr/

46

2:

,
, ,
(, , , ,
...). ,
.
, ( ) ,

.

,
(DNA, RNA, ).

2.

, ,

.,

,.
,
.
,
,.

, ,
.

.
(annotation)
.
UiprotKB/SWISS-PROT,
547.599 (Rel. 2015_02 2015)
EMBL Nucleotide Sequence Database 510.014.239
(Rel.122-2014).
. ,
.


.
, , 2 ,
,.
, ,
:





47

, ,
,,:
()
()

2.1.

,
, . ,
,,
, ,
.

2.1.1

(2.1),
. (sequencing),
DNA RNA,
(.. )
.

GENBANK (NCBI), DNA Data Bank of Japan (DDBJ) EMBL


Nucleotide Sequence Database (). , ,
,
.
International Nucleotide Sequence Database Collaboration.
International Nucleotide Sequence
DatabaseCollaboration:
GENBANK: GENBANK (http://www.ncbi.nlm.nih.gov/Genbank/index.html)
(Benson et al., 2014),
...(NationalInstitutesofHealth).

.
. ,
(annotate).

.GENBANK
14(Rel.206,2015)
181.336.445187.893.826.750.
EMBL-Bank: EMBL Nucleotide Sequence Database (http://www.ebi.ac.uk/embl/)
,
(EMBL)
(EBI) Cambridge, UK. EMBL-Bank
,
.GENBANK,
,
.Fasta
BLAST.EMBL-Bank(Rel.122-2014)510.014.239.
1.094.969.877.589.

48


2.1: GenBank, 1982
2004.

DDBJ: H DNA Databank of Japan (DDBJ - http://www.ddbj.nig.ac.jp/)


.1986
(NIG) ,
..
DDJB .
DDJB (Rel. 99, 2014) 178.825.615 184.410.381.191
.
DNA,Genbank,
DDBJEMBLDataBank,InternationalNucleotide
Sequence Collaboration, ,
,(2.2).
, , 3
, 3
.

2.2: 3
(INSDC; International Nucleotide Sequence Database Collaboration,
http://www.ddbj.nig.ac.jp/insdc/insdc-e.html).

49

2.1.2 .
,
( DNA),
,
.,
,
,
.
UniprotKB (Uniprot Knowledgebase, http://www.uniprot.org/), ,
(UniProt, 2014).
,Uniprot/SwissProt
Uniprot/TrEMBL
() . UniprotKB/SwissProt 547.599
(Rel. 2015_02 2015)
,
, ,
( ), . H Uniprot/TrEMBL
(Rel. 2015_02 2015) 92.124.243
. , Uniprot
,
""Uniprot/TrEMBL
Uniprot/SwissProt.,
( ,
,,,
...), , Uniprot/SwissProt ,
( ),
. ,
. Uniprot
,.

2.3: Swiss-Prot,
1986 2004.

50

, Uniprot 2002
, SwissProt PIR. H SwissProt 1986
(Swiss Instituteof Bioinformatics)
(EuropeanBioinformaticsInstitute).Protein Information
Resource (PIR - http://pir.georgetown.edu/) .
Georgetown
(NBRF) ... PIR-International Protein Sequence Database
(PSD),PIRoMunichInformationCenter
forProteinSequences(MIPS)JapaneseInternationalProteinInformationDatabase(JIPID).2002,
PIREBI(EuropeanBioinformaticsInstitute)SIB(SwissInstituteof
Bioinformatics)UniProtconsortium.PIR-PSD
,UniProtKnowledgebase.
UniProtPIR-PSD
PIR-PSD. PIR-PSD

UniProt.

2.1.3 .


.
(, , ...),
,,
NMR.,,,

. , PDB,
.
Protein Data Bank:HProteinDataBank(PDB,www.rcsb.org)
(Kouranovetal.,2006).
1971 Brookhaven National Laboratories (BNL) . 7

'70.'80

,PDB
(NMR). ( 2015) PDB 106.858 .
PDB
,
.
.
.
, PDB , .
,PDB
( ) ,
.,
,
,,
. ,
,
.
Uniprot PDB. ,
,
, . PDB
, . , ,
(MMDB)NCBI,
PDB.

51


2.4: PDB,
1977 2004.

2.1.4
,
.
,NextGeneration
Sequencing,
.
.
,
.,
,,

.,,
, ""
.,
.
,

(MIAME:Minimun nformation About a icroarray xperiment).


, ""
. ,
,
( ).
:
GeneExpression Omnibus (GEO): NCBI
, (next generation sequenicng) (Barrett &
Edgar, 2006) http://www.ncbi.nlm.nih.gov/geo/,

52


. (raw) (
...). ( 2015), 14.031
, 1.357.732 "", (
,,-),
55.725 "" (series) 3.848 " " (datasets).
.
Array Express:
,,http://www.ebi.ac.uk/arrayexpress/(Brazmaet
al.,2003).GEO,
. ,
tutorials. 2015,
57.009 (experiments, series GEO) 1.689.237
(assays,).
Stanford Microarray Database (SMD):
Stanford,
, http://smd.stanford.edu/ (Demeter et al., 2007).
,
84.051631.

2.1.5
, DNA,
, .
(),
( , ...).
dbSNP,
,HapMap.
dbSNP: dbSNP
http://www.ncbi.nlm.nih.gov/snp (Sherry et al., 2001). (single
nucleotidepolymorphisms-SNPs),
(deletioninsertionpolymorphisms-DIPs),
(short tandem repeats - STRs). dbSNP
(),
,,
. dbSNP
,.
NCBI
http://www.ncbi.nlm.nih.gov/books/NBK3848/. 129 (2008) 14
,.
HapMap: International HapMap Project (http://hapmap.ncbi.nlm.nih.gov/)

(HapMap,2003).
()
(,
), . ,
4 , .
, ,
.,,
,
(LinkageDisequilibrium),,.
,

53

.,
.

2.1.6
,,
, .,
(, ,
...). , PubMed (http://www.ncbi.nlm.nih.gov/pubmed)
NCBI 24
(MEDLINE,
online ).
, PubMed Central (
),.
PubMed ,
, . ,
tutorials(http://www.nlm.nih.gov/bsd/disted/pubmed.html).
,,SCOPUS(http://www.scopus.com/)Web
of Science (http://webofknowledge.com/). , ,
(citations).
( ),
( , ).

.
,
,,(text
mining), ,
(),
(Ananiadou,
Kell,&Tsujii,2006;Scherf,Epple,&Werner,2005).

2.2.


, ,
.

2.2.1
,
(domains), . ,
.
.
,
.
( ),

.
,
.,
,
,
.,)

54

(,pattern,...),)
.CATHSCOP,
PROSITE,PFAM,INTERPRO.,
,.
, ,
.

2.5: HCK (Uniprot: P08631, PDB: 2HCK_A).


, (domains) . ,
PFAM PROSITE .
, SCOP. ,
PFAM PROSITE, ,
.

55


PROSITE (http://www.expasy.ch/prosite/)
(sequencedomains)(Sigristet
al., 2010).
.
.

.
,,
'' ,
.
''.
""(regularexpressions),
, (profiles),
,
.,.
PROSITE '' 1716
.
,1308(patterns),11071105""(

). ,
(,).
Uniprot
""()
. ,
,
"",
.
PFAM: Pfam (http://pfam.xfam.org/)
(Finnetal.,2014). PROSITE(
),
hiddenMarkov model (HMM), ,
.(2013),
14.83180%
UNIPROT.
PFAM,PFAM-A,PFAM-B.PFAM-A
() , ,
. PFAM-B
,

PFAM-A. PFAM-B , ,
PFAM-A.
PFAM , (
HMMER,.),
,
( ). ,
,
,-(clan).
CATH: CATH (http://www.biochem.ucl.ac.uk/bsm/cath_new/index.html)
PDB
(domains) (Knudsen & Wiuf, 2010). CATH
3Angstroms
.
.:
1) (Class), 2) (Architecture), 3) ( )

56

(Topology (fold family)) 4) (Homologous superfamily).


(domains),
.
53%.
.
.
CATH:
C-(Class):4
: 1) mainly-alpha,
-, 2) mainly-beta,
- , 3) alpha-beta, / +
4) .
90%10%
.
A-(Architecture):
(domain),
..(barrels).
- (Topology):
.
- (Homology superfamily):
35%
.
S-(Sequencefamily):35%
.
SCOP:SCOP(http://scop.mrc-lmb.cam.ac.uk/scop/index.html)

PDB(Andreeva,etal.,2004).
. : 1)
(Family),2)-(Superfamily),3)(Fold)4)(Class).
(Family):.
30%.

30%(,15%).
-(Superfamily):-

.
(Fold):
. , ,
,
.

.
.
(Class):
:1)ll-,-,2)
all-, - , 3) /,
- - 4) +,
--.


.
.

57

2.6: SCOP (, , , ).

2.2.2

,
,
.,,(
,,
Uniprot), ,
.


. , ,
.
11-12 2014 Wellcome Trust,
Hinxton ,

(SpecializedProteinResourcesNetwork).

58

: (1)
,(2),(3),
(4) ,(5).
(Hollidayet
al.,2015).

(,
)(
Pfam, RefSeq, Swiss-Prot, UniProt).
, ,
,,.

.
,

(strains),
(Uniprotgi).

,
.

,
.,
:
;
.
.ESTHER,
-,GPCRDB
GPCRs, UniProtKB,
.
2009 (Schnoes, Brown, Dodevski, & Babbitt, 2009)
/(misannotation)80%.SwissProt UniProtKB
, 0%.
.
, .
,
.
. 50
.
(Kirby, 2001).
: ;
;
; ,
(/ )
,,.

59


2.7: " " , . Gert Vriend

,
(modular) .
(Module)
(carbohydrate-binding module, CBM)
.
Blast,CBM,
. ,
(singledomainprotein)
(..
(UniProtKB: B8NM72)). ,
-, (NRPS),
NRPS. ,
(Umemuraetal.,2014),
NRPS .

. ,
(SpecialistProteinResources-SPRs).
"
".
/ (.. )
(..).,

,
.BLASTnon-redundantNCBI,

.
.
, .
, (gene assembly errors)

60

, ,
.
,
SPRs.
;,
UniProtKB,
. , PDB, ,
PDB_REDO(Joosten,Long,Murshudov,&Perrakis,2014)
.
,
(Nagyetal.,2008;Wong,Maurer-Stroh,&Eisenhaber,2010).
,
; .
UniProtKB,
; SFLD, : SFLD
;(..UniProtKB)ECO(EvidenceCode
Ontology)(Chibucosetal.,2014),.
,
.

ECO.
,
; (
) .
,

.,.,
(
) .

.,
.
.
( )
. ,
,
.
. , ,
EnzymeCommission(EC).
, (WikiPedia).
Rfam.
Wikipedia,Rfam.,
; Swiss-Prot
;
;
; ,
,SPRs,
.
,
. EMBRACE
(Pettiferetal.,2010).superfamily
.,SFLD,
.TIGRFAM,,
.SFLD,PANTHER,
(SFLDPANTHER).

61

, .
EnzymeCommission(C).
(,,...)(
,).
MACiE EzCatDB, ,
,
. MACiE , EzCatDB
.
, EC (
).
EC50.
EC ., EC
EC,
.,
,
EC ,
. , EC
,

(.. -)
EC (.. ).
EC
ECBLAST.
,
. ;
,CAZy,
.,MACiE
EzCatDB, EC ,
,
.

(,
),.
.(GO)
,
.PubMed""
1.500 .
, , (.. )
.
, ,
,,BioPortal(Grosjean,Soualmia,Bouarech,
Jonquet, & Darmoni, 2014) OBO Foundry (Smith et al., 2007),
.

62

2.8: Protein Bioinformatics and Community Resources Retreat.


.. :
David Landsman (Histone database), Dan Haft (TIGRFAMS), Bernard Henrissat (CAZy), Rob Finn (InterPro and
Pfam), David Craik (ConoServer and CyBASE), Arnaud Chatonnet (ESTHER), Neil Rawlings (MEROPS);
: Amos Bairoch (neXtProt), Gerard Manning (Kinase.com), Michael Spedding (IUPHAR), Gert Vriend (GPCRDB),
Milton Saier (TCDB), Pantelis Bagos (OMPdb); : Narayanaswamy Srinivasan (KinG), Ramanathan
Sowdhamini (PASS2), Alex Bateman (Pfam & UniProt), Patsy Babbitt (SFLD), Kim Pruitt (RefSeq), Claire ODonovan
(UniProt), Gemma Holliday (MACiE), Nozomi Nagano (EzCatDB).

NucleicAcidsResearch(Fernndez-Surez,Rigden,&Galperin,
2014) 2014 1.552 58 123
..

..2008
40%URL
(Wren, 2008). ,
URL.
.
/,
.
;
. Interpro,
.
Interpro.
/(,
REFSEQUniProtKB)SPRs.SPRs,
, , (.. SFLD
), (), (.. KEGG (Kanehisa et al., 2014)
), ,
.

63


.
. UniProtKB 2014 86
,
.,
,.

.,,
Structural Genomics Consortium,

.
, , putative
glycosidehydrolase,
. , (
)(
), (glycoside hydrolases).
(
) (
)
(Martinetal.,1998).
(moonlighting proteins) ,
, . ,

. , , ,
,
,
,
.,SPR
,

.,
.
,
Protein Bioinformatics and Community
ResourcesRetreat(2.8).
TCDB: ,
(Stein, 2013)
, .
,
.(structuredquerylanguage-SQL)

(Jamison, 2003). ,
TransporterClassificationDatabase(TCDB,www.tcdb.org(Saier,
Reddy, Tamang, & Vstermark, 2014)). TCDB HTML. 1998
(OracleMySQL)PHP
San Diego Supercomputer Center (www.sdsc.edu).

,,,-.
(multi-component), 100
. TCDB
. 7 , 56 , 937 , 9098
,1180612086.
, .

64

TCEC(EnzymeCommission)(Bairoch,1999),
( )
(, - -).
(International Union of Biochemistry and Molecular Biology,
IUBMB) (Saier,2000).
TCDB:,
,
TC-BLAST
,.,
PFAMOMPdb.

OMPdb: OMPdb (Tsirigos, Bagos, & Hamodrakas, 2011),


http://www.ompdb.org, -
Gram . 2011
70.000 . 3 , 500.000
. OMPdb 91 ,
. Hidden Markov Model
(pHMM), .
Pfam(Finn,etal.,2014),
Pfamclan(clan-CL0193),
(DomainsofUnknownfunction-DUFs).,
15,PfamPfam-.
,
.,
, .
pHMM
Pfam, - .
,,
pHMM Pfam, ,
.
-,
.,OMPdb
-,
,,.
-,
, ,
. OMPdb TCDB
Pfam ( -
Gram ),

(,...)..
MySQL,
Apache-PHP .
( , ),
MySQL,
. OMPdb
,
.
-
,
.
OMPdb ,
UniProt
-, .
,OMPdb,

65

.
(Pfam TCDB),
, ,
OMPdb.

CAZy: CAZy(www.cazy.org)
(carbohydrate-binding modules) , ,
, (Cantarel et al., 2009; Lombard, Ramulu, Drula,
Coutinho, & Henrissat, 2014). CAZy 1991,
(Henrissat,1991)
. 90
(Campbell,Davies,Bulone,&
Henrissat, 1997). 1998,
SQL ( 1999),
. ,
CAZy ,
, 50%
. ,
,EC
,
CAZy-.,,

.
CAZypedia(http://www.cazypedia.org),
CAZy.CAZypediaHarryBrumer
BritishColumbia
,

. ,
CAZy.,CAZymes,
,CAZy,
.,
, ,
CAZy.
, CAZy .
/
CAZy.

( ) ,
,
,
.
, , ,
.
CAZy.

MEROPS: Merops(http://merops.sanger.ac.uk)

(Rawlings,Waller,Barrett,&Bateman,2014).
,,,
,
, ( ) , ,
,,
. ( ,
, , II, Alzheimer),

66

( , ,
, ,
...).1996
400.000 . /
(clan).
. 61 , 251
4.236(377Enzyme Nomenclature).
28.000
, 1.200 .
53.000
,UniProt,PfamInterpro.
,
.
.
,,,
,
.
,
.
:,,
,
.
neXtProt: neXtProt (http://www.nextprot.org/)
(Lane et al., 2012).
UniProtKB/Swiss-Prot
,
.
neXtProt.
. neXtProt
:(>5%),(15% ) ( 1% ).
neXtProt.
,
repositories
.,
,
.
neXtProt

. neXtProt

.
. ,
.
.
10
.

PASS2: PASS2 -
(Protein sequence Alignments of Structural Superfamilies).
CAMPASS (Sowdhamini et al., 1998). -
,
, ,
.PASS2HTML,

67

(PASS2.4)(Gandhimathi,Nair,&Sowdhamini,2012),MYSQL
PHP.,SCOP1.75(Murzin,Brenner,
Hubbard, & Chothia, 1995) -,
1961.
, , ,
,.
, ,
outliers (Gandhimathi, et al., 2012),

-.

,
outliers.

KinG: Kinases in Genomes (KinG) Ser/Thr/Tyr


,
(Krupa,Abhinandan,&Srinivasan,2004).
Garuda India
http://megha.garudaindia.in/king/. Ser/Thr/yr
, .
-
.

.KinG2004(Krupa,etal.,
2004), 40 . KinG .
, . ,
,
. KinG 12200
131.921 .
,
. KinG Hanks
Hunters(Hanks&Hunter,1995)
(multiple position-specific scoring matrices - PSSM) (Gowri, Krishnadev, Swamy, & Srinivasan, 2006).
- .,
, .
-,
,.
(Deshmukh, Anamika, & Srinivasan, 2010; Rakshambikai, Gnanavel, & Srinivasan, 2014)
(hybrid)
-,,
-
.o,
- . ,

- -.
-.
rogue (Deshmukh, et al., 2010; Rakshambikai, et al., 2014). O

. .
(Bhaskara et al., 2014; Gnanavel et al., 2014; Martin, Anamika, &

68

Srinivasan,2010),KinG.
KinG NetBeans IDE Java,JSP, Servlets,AJAX, Jquery,
XML, HTML CSS ,
.
EzCatDB: EzCatDB (http://ezcatdb.cbrc.jp/EzCatDB/) 2004
,,
. (Nagano,
2005; Nagano et al., 2014) Enzyme Commission (E.C.) (NC-IUBMB;
http://www.chem.qmul.ac.uk/iubmb/enzyme/)
(Fleischmann et al., 2004; McDonald, Boyce, & Tipton, 2009; Tipton,
1994).(RLCP)EzCatDB
(nucleophilic substitution reactions),
, , ,
,(Nagano,etal.,2014).
EzCatDBProteinDataBank(PDB)(Roseet
al., 2013) UniProt,
CATH(Cuffetal.,2011).,
KEGG(Kanehisa,etal.,2014).EzCatDB
PostgreSQL.
(Nagano, 2005)
E.C., IDs ,
(Nagano,
2005).,
PDB,
, , ,
.
(Nagano,2005).
.
,
.,

, , .
EzCatDB871,1.610UniProtKB6.704
PDB..
300 ,
.

MACiE: MACiE(Mechanism,AnnotationandClassificationinEnzymes,http://www.ebi.ac.uk/thorntonsrv/databases/MACiE),(Holliday
et al., 2012). MACiE ,
, ,
. MACiE
PDB .
MACiE,
.
,

( ).
,
.

. MACiE
MySQL.

69

.
MACiE(Hollidayetal.,2007).
,
. MACiE
-
. (1)
, (2) , (3),
, (4) , (5)
, , (6)
, EC, UniProtKB, CATH,
PDB.
MACiE
.
MACiE
. ,
, PDB.
, ..,
,.
,.

ESTHER: ESTHER (ESTerases and alpha/beta-Hydrolase Enzymes and Relatives)


- /-
(www.bioweb.supagro.inra.fr/esther). /
- .
800.000 ( 42.000
)175-(Lenfant,Hotelier,Bourne,Marchot,&Chatonnet,2013).
- HMM (Lenfant et al., 2013).
-
, ,
.,,

(Lenfant,Hotelier,Bourne,Marchot,&Chatonnet,2014;Marchot&Chatonnet,2012).
:
,,..
-,
.ESTHER1994
Gopher WWW (Cousin, Hotelier, Lievin, Toutant, & Chatonnet,
1996).ACeDB.
,
(Chatonnet, Cousin, & Robinson, 2001).
. R,

/
.

ConoServer: Cone

(Akondietal.,2014;Terlau&Olivera,
2004). (Davis, Jones, & Lewis, 2009)
(Biggset al.,2010;Chang&Duda,2012;Puillandre,Koua,Favreau,Olivera,&Stocklin,2012)
(Duda,Chang,Lewis,&Lee,2009;Duda&Lee,2009).2014,
ConoServer(Kaas,Yu,Jin,Dutertre,&Craik,2012)2000,
,(Kaas,Westermann,&Craik,2010).

70

ConoServer (www.conoserver.org) ( )
,.
.-
MySQL PHP.
ConoServer GenBank (Benson, et al., 2014), UniProt-
(UniProt, 2014) PDB (Berman, Henrick, Nakamura, & Markley, 2007),
. ,
, MySQL,
.
,
,
.

. ConoPrec (Kaas, et al., 2012)


,
ConoServer ,
ConoServer.
CyBase: ,
(Craik, 2006; Kedarisetti, Mizianty, Kaas, Craik, & Kurgan, 2014).

(Trabi &
Craik, 2002).
(Poth,
Chan, & Craik, 2013). CyBase (www.cybase.org.au)
,
(Wang, Kaas, Chiche, & Craik, 2008). 2014 CyBase
420160.
,
cyclotide, 282 . CyBase
ConoServer. CyBase
,

(Wang, et al., 2008). CyBase
.

GPCRDB: G- (G protein-coupled receptors, GPCRs)


.
,
,(Bockaert&Pin,1999;Lagerstrom&Schioth,
2008). 30% ,
(Garland, 2013; Overington, Al-Lazikani, &
Hopkins, 2006). GPCR, GPCRDB (http://gpcrdb.org), 1993.
, ,
, GPCRDB
,
.,GPCRDB
(Horn et al., 2003; Horn et al., 1998; Vroling et al., 2011). 2013, GPCRDB
Gloriam , EU
COST GPCR Action GLISTEN. , GPCRDB
.
, ,
(Isbergetal.,2014).
GPCRDBinvitro
.

71

. ,
.GPCRDB
GPCR,
, (Katritch, Cherezov, & Stevens,
2013).
- .
, -
.
GPCRDB .
HTML5 ( CSS3) JavaScript
.
ScalableVectorGraphics(SVG),
.
MySQL,Apache.
GPCRDBSOAP,
/.

IUPHAR/BPS Guide to PHARMACOLOGY: IUPHAR/BPS Guide to PHARMACOLOGY


(GtoPdb, (Pawson et al., 2014))
(Union of Basic and Clinical Pharmacology, IUPHAR)
(BPS).
GtoPdb.
GtoPdb (http://www.guidetopharmacology.org/)
IUPHAR
(IUPHAR-DB, (Harmar et al., 2009) (Guide to Receptors and
Channels,GRAC),BPS,BritishJournalofPharmacology
(..,(Alexander,Mathie,&Peters,2011)).
GtoPdb:

/




-, ncRNAs ( ,
HGNC), (Tough, Lewis, Rioja, Lindon, & Prinjha, 2014),
(Bonner,2014),allostery(Christopoulosetal.,2014),(),
(Spedding, 2011).
.
GtoPdb 2.700
( G-
GPCRs, , , , , ,
),,
,
. ,
,.
, ,
.
GtoPdbChEMBL(Gaultonetal.,2012),

72

.
. ,
,..UniProt,Ensembl,EntrezGene,KEGG,.
, ,
, , PubChem,
DrugBank ChEMBL. , Concise Guide to
PHARMACOLOGY,GtoPdb,
. British Journal of Pharmacology
GRAC(Alexanderetal.,2013).

Kinase.com: Kinase.com ,

(Manning,Whyte,Martinez,Hunter,&Sudarsanam,2002).
"kinome", . , KinBase,
7.000
, 14 (Bradham et al., 2006;
Caenepeel, Charydczak, Sudarsanam, Hunter, & Manning, 2004; Eisen et al., 2006; Goldberg et al., 2006;
Srivastava et al., 2010; Stajich et al., 2010). 10 , 287
356 . KinBase
,,.,
BLAST
.
,,
,,,
HMM , .
(,)
,HMM
. KinBase, Kinase.com wiki,
WiKinome,.
wiki-.
Kinase.com 1999
Sugen Caenorhabditis elegans (Bingham,
Plowman,&Sudarsanam,2000;Manning,2005;Plowman,Sudarsanam,Bingham,Whyte,&Hunter,1999)
Saccharomycescerevisiae(Hunter&Plowman,1997). KinBase
2002(Manning,Whyte,et
al.,2002)(Manning,Plowman,Hunter,&Sudarsanam,2002).
MySQL Perl.

Modelviewcontrollerwebframework,HTML5,CSS5JavaScript.
.
6.000,Kinase.com
kinomes 15 .
, ,
/,kinomes15.
,
kinomes.
Structure-Function Linkage Database: Structure-Function Linkage Database (SFLD;
http://sfld.rbvi.ucsf.edu/django/), (Akiva et al., 2014; Pegg et al., 2006),
-
.
- (Gerlt &
Babbitt,2001).SFLD
, " -
"(Babbitt&Gerlt,1997).

73

-
-(Almonacid&Babbitt,2011).
-,
,
. ( -), SFLD
.
, - ,
-
,
,(Babbittetal.,
1996;Gerlt,Babbitt,&Rayment,2005).

-,-
(CoreSFLD),-.
-,,
(user interface).
- , s,
,3D
Chimera(Pettersenetal.,2004),
. Extended SFLD
.
,SFLDDjango,
PythonWeb,.
,

(GUI),.
,
- 100.000
.
SFLD-CoreExtended
SFLD,(Atkinson,Morris,Ferrin,&Babbitt,2009).

Histone Database:Histone(http://research.nhgri.nih.gov/histones/)1996(Baxevanis&
Landsman, 1996), ,
DNA (Baxevanis, Arents, Moudrianakis, & Landsman, 1995).
,
PSI-BLAST (Altschul et al., 1997) HMMER
(Eddy,2009).,
.
.
,
.
(1, 2, 2, 3, 4) ,

. (Baxevanis,et
al.,1995),(MarinoRamirezetal.,2011).HistoneDatabase
.HistoneDatabase
ProteinDataBank(PDB)(Roseetal.,2014),
-.

-
- (.. (CENP-A CSE4), 3.3, H2A.B , H2A.Z, H2B.Z,
macroH2A(136),,-H1).

74


, ,
,
.

(
nextProt

).

SubtiList
(http://genolist.pasteur.fr/SubtiList/) (Moszer, Jones, Moreira, Fabry, & Danchin, 2002) Bacillus
subtilisEcoCyc(http://ecocyc.org/)Escherichia coliK-12(Karpetal.,2002).
, Genome Online Database (GOLD),
(https://gold.jgi-psf.org/)(Reddyetal.,2015).
,.
ONCOMINE,
.
,http://www.oncomine.org/, (Rhodeset
al.,2004).RNA-Seq Atlas (http://medicalgenomics.org/rna_seq_atlas)

RNA
(RNA-Seq). 11 .

(Kruppetal.,2012).Next Generation Sequencing Catalog (NGS
Catalog)

(http://bioinfo.mc.vanderbilt.edu/NGS/index.html). ,

(Xiaetal.,2012).
,
.
( ) . , OMIM
(Online Mendelian Inheritance in Man),
(http://www.ncbi.nlm.nih.gov/omim). ,
,
,
. GAD (Genetic Association Database,
http://geneticassociationdb.nih.gov/) PubMed
(Becker, Barnes, Bright, & Wang, 2004), o Catalog of
Published Genome-Wide Association Studies (http://www.genome.gov/gwastudies/) GWASdb
(http://jjwanglab.org/gwasdb) (genomewide association
studies),
DNA. , ,
, Epilepsy Genetic
Association Database (epiGAD)(Tan&Berkovic,2010),Cancer GAMAdb(Schullyet
al., 2011) , AlzGene Alzheimer (Bertram, McQueen, Mullin, Blacker, &
Tanzi,2007).
,
SPRN,
.
, Database Collection Nucleic Acids Research,

(http://www.oxfordjournals.org/our_journals/nar/database/cap/).
PDBTM(http://pdbtm.enzim.hu/)

(Kozma,
Simon,
&
Tusnady,
2013),

ExTopoDB
(http://bioinformatics.biol.uoa.gr/ExTopoDB)
(Tsaousis et al., 2010) gpDB

75

(http://bioinformatics.biol.uoa.gr/gpDB),GPCRs
G- (Theodoropoulou, Bagos, Spyropoulos, & Hamodrakas, 2008). DBPTM
(http://dbptm.mbc.nctu.edu.tw/) (Lu et al., 2013), DIP (http://dip.doembi.ucla.edu/dip/Main.cgi) , ,
- (Xenarios et al., 2002). , bioGrid (http://thebiogrid.org/)
(Starketal.,2006).
(DNA RNA), .

miRNA , MiRBase (http://www.mirbase.org/) (Griffiths-Jones,


Grocock,vanDongen,Bateman,&Enright,2006),MirTarBase(http://mirtarbase.mbc.nctu.edu.tw/)(Hsu
et al., 2011) TarBase (http://diana.imis.athena-innovation.gr/DianaTools/index.php?r=tarbase/index/)
(Sethupathy,Corda,&Hatzigeorgiou,2006).,
- EID (http://bpg.utoledo.edu/~afedorov/lab/eid.html) (Shepelev &
Fedorov, 2006), EPD
(http://epd.vital-it.ch/) (Dreos, Ambrosini, Cavin Perier, & Bucher, 2013) MMPROMdb
(http://mpromdb.wistar.upenn.edu/) (Sun et al., 2006). , ,
,
.

2.3.

SRS,LIONBioscience
. ,
.
400
. SRS
,

. ,
,
.

.
SRSEBI.
, Uniprot

.
Entrez
NCBI(NationalCenterforBiotechnology
Information) . Entrez SRS
,
.,,
PUBMED.
NCBI
. ,
NCBI,
.
,SPRN,
,,
MySQL PHP. ,
,,
(
), .

76

, SQL ,
(
).

2.9: NCBI
Entrez. , . NCBI
,
. , Conserved Domains Database PROSITE Structure (MMDB)
PDB.

77


1. ompA Escherichia coli
UNIPROT, Gene Name UNIPROT.

2.10: Uniprot

2.11:

2.12:

78


2.13:

2. Uniprot
(outer membrane) .
( : taxonomy:"Bacteria [2]" existence:"evidence at protein level"
database:(type:pdb) locations:(location:"Cell outer membrane [SL-0040]") keyword:"Cell outer membrane
[KW-0998]")

2.14: Uniprot ( ),

79


2.15: Uniprot ( ),

2.16:

2.17: . ,

80


2.18: . ,

2.19:

81


2.20: ( )

3. Uniprot
G- () :

:
taxonomy:"keyword:"G-protein coupled receptor [KW-0297]" AND organism:"Human [9606]" AND
existence:"evidenceatproteinlevel"ANDdatabase:(type:pdb)

PDB;,

,

.

82

( )
1. GENBANK Outer membrane protein A (ompA)
Escherichia coli.
LOCUS

NC_000913

DEFINITION

Escherichia coli str. K-12 substr. MG1655, complete genome.

ACCESSION
VERSION
DBLINK

NC_000913 REGION: complement(1019013..1020053)


NC_000913.3 GI:556503834
BioProject: PRJNA57779
BioSample: SAMN02604091
RefSeq.
Escherichia coli str. K-12 substr. MG1655
Escherichia coli str. K-12 substr. MG1655
Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
Enterobacteriaceae; Escherichia.
1 (bases 1 to 1041)
Riley,M., Abe,T., Arnaud,M.B., Berlyn,M.K., Blattner,F.R.,
Chaudhuri,R.R., Glasner,J.D., Horiuchi,T., Keseler,I.M., Kosuge,T.,
Mori,H., Perna,N.T., Plunkett,G. III, Rudd,K.E., Serres,M.H.,
Thomas,G.H., Thomson,N.R., Wishart,D. and Wanner,B.L.
Escherichia coli K-12: a cooperatively developed annotation
snapshot--2005
Nucleic Acids Res. 34 (1), 1-9 (2006)
16397293
Publication Status: Online-Only
2 (bases 1 to 1041)
Hayashi,K., Morooka,N., Yamamoto,Y., Fujita,K., Isono,K., Choi,S.,
Ohtsubo,E., Baba,T., Wanner,B.L., Mori,H. and Horiuchi,T.
Highly accurate genome sequences of Escherichia coli K-12 strains
MG1655 and W3110
Mol. Syst. Biol. 2, 2006 (2006)
16738553
3 (bases 1 to 1041)
Blattner,F.R., Plunkett,G. III, Bloch,C.A., Perna,N.T., Burland,V.,
Riley,M., Collado-Vides,J., Glasner,J.D., Rode,C.K., Mayhew,G.F.,
Gregor,J., Davis,N.W., Kirkpatrick,H.A., Goeden,M.A., Rose,D.J.,
Mau,B. and Shao,Y.
The complete genome sequence of Escherichia coli K-12
Science 277 (5331), 1453-1462 (1997)
9278503
4 (bases 1 to 1041)
Arnaud,M., Berlyn,M.K.B., Blattner,F.R., Galperin,M.Y.,
Glasner,J.D., Horiuchi,T., Kosuge,T., Mori,H., Perna,N.T.,
Plunkett,G. III, Riley,M., Rudd,K.E., Serres,M.H., Thomas,G.H. and
Wanner,B.L.
Workshop on Annotation of Escherichia coli K-12
Unpublished
Woods Hole, Mass., on 14-18 November 2003 (sequence corrections)
5 (bases 1 to 1041)
Glasner,J.D., Perna,N.T., Plunkett,G. III, Anderson,B.D.,
Bockhorst,J., Hu,J.C., Riley,M., Rudd,K.E. and Serres,M.H.
ASAP: Escherichia coli K-12 strain MG1655 version m56
Unpublished
ASAP download 10 June 2004 (annotation updates)
6 (bases 1 to 1041)
Hayashi,K., Morooka,N., Mori,H. and Horiuchi,T.
A more accurate sequence comparison between genomes of Escherichia
coli K12 W3110 and MG1655 strains

KEYWORDS
SOURCE
ORGANISM

REFERENCE
AUTHORS

TITLE
JOURNAL
PUBMED
REMARK
REFERENCE
AUTHORS
TITLE
JOURNAL
PUBMED
REFERENCE
AUTHORS

TITLE
JOURNAL
PUBMED
REFERENCE
AUTHORS

TITLE
JOURNAL
REMARK
REFERENCE
AUTHORS
TITLE
JOURNAL
REMARK
REFERENCE
AUTHORS
TITLE

1041 bp

83

DNA

linear

CON 16-DEC-2014

JOURNAL
REMARK
REFERENCE
AUTHORS
TITLE
JOURNAL
REMARK
REFERENCE
AUTHORS
TITLE
JOURNAL
REFERENCE
CONSRTM
TITLE
JOURNAL
REFERENCE
AUTHORS
TITLE
JOURNAL
REMARK
REFERENCE
AUTHORS
TITLE
JOURNAL
REMARK
REFERENCE
AUTHORS
TITLE
JOURNAL
REMARK
REFERENCE
AUTHORS
TITLE
JOURNAL

REMARK
REFERENCE
AUTHORS
TITLE
JOURNAL

REMARK
REFERENCE
AUTHORS
TITLE
JOURNAL
REMARK
REFERENCE
AUTHORS
TITLE
JOURNAL

Unpublished
GenBank accessions AG613214 to AG613378 (sequence corrections)
7 (bases 1 to 1041)
Perna,N.T.
Escherichia coli K-12 MG1655 yqiK-rfaE intergenic region, genomic
sequence correction
Unpublished
GenBank accession AY605712 (sequence corrections)
8 (bases 1 to 1041)
Rudd,K.E.
A manual approach to accurate translation start site annotation: an
E. coli K-12 case study
Unpublished
9 (bases 1 to 1041)
NCBI Genome Project
Direct Submission
Submitted (26-AUG-2014) National Center for Biotechnology
Information, NIH, Bethesda, MD 20894, USA
10 (bases 1 to 1041)
Blattner,F.R. and Plunkett,G. III.
Direct Submission
Submitted (30-JUL-2014) Laboratory of Genetics, University of
Wisconsin, 425G Henry Mall, Madison, WI 53706-1580, USA
Protein update by submitter
11 (bases 1 to 1041)
Blattner,F.R. and Plunkett,G. III.
Direct Submission
Submitted (15-NOV-2013) Laboratory of Genetics, University of
Wisconsin, 425G Henry Mall, Madison, WI 53706-1580, USA
Protein update by submitter
12 (bases 1 to 1041)
Blattner,F.R. and Plunkett,G. III.
Direct Submission
Submitted (26-SEP-2013) Laboratory of Genetics, University of
Wisconsin, 425G Henry Mall, Madison, WI 53706-1580, USA
Sequence update by submitter
13 (bases 1 to 1041)
Rudd,K.E.
Direct Submission
Submitted (06-FEB-2013) Department of Biochemistry and Molecular
Biology, University of Miami Miller School of Medicine, 118 Gautier
Bldg., Miami, FL 33136, USA
Sequence update by submitter
14 (bases 1 to 1041)
Rudd,K.E.
Direct Submission
Submitted (24-APR-2007) Department of Biochemistry and Molecular
Biology, University of Miami Miller School of Medicine, 118 Gautier
Bldg., Miami, FL 33136, USA
Annotation update from ecogene.org as a multi-database
collaboration
15 (bases 1 to 1041)
Plunkett,G. III.
Direct Submission
Submitted (07-FEB-2006) Laboratory of Genetics, University of
Wisconsin, 425G Henry Mall, Madison, WI 53706-1580, USA
Protein updates by submitter
16 (bases 1 to 1041)
Plunkett,G. III.
Direct Submission
Submitted (10-JUN-2004) Laboratory of Genetics, University of

84

Wisconsin, 425G Henry Mall, Madison, WI 53706-1580, USA


Sequence update by submitter
17 (bases 1 to 1041)
Plunkett,G. III.
Direct Submission
Submitted (13-OCT-1998) Laboratory of Genetics, University of
Wisconsin, 425G Henry Mall, Madison, WI 53706-1580, USA
REFERENCE
18 (bases 1 to 1041)
AUTHORS
Blattner,F.R. and Plunkett,G. III.
TITLE
Direct Submission
JOURNAL
Submitted (02-SEP-1997) Laboratory of Genetics, University of
Wisconsin, 425G Henry Mall, Madison, WI 53706-1580, USA
REFERENCE
19 (bases 1 to 1041)
AUTHORS
Blattner,F.R. and Plunkett,G. III.
TITLE
Direct Submission
JOURNAL
Submitted (16-JAN-1997) Laboratory of Genetics, University of
Wisconsin, 425G Henry Mall, Madison, WI 53706-1580, USA
COMMENT
REVIEWED REFSEQ: This record has been curated by NCBI staff. The
reference sequence is identical to U00096.
On Nov 3, 2013 this sequence version replaced gi:49175990.
RefSeq Category: Reference Genome
FGS: First Genome sequenced
MOD: Model Organism
PHY: Based on Phylogenetics
UPR: UniProt Genome
Current U00096 annotation updates are derived from EcoGene
http://ecogene.org. Suggestions for updates can be sent to Dr.
Kenneth Rudd (krudd@miami.edu). These updates are being generated
from a collaboration that also includes ASAP/ERIC, the Coli Genetic
Stock Center, EcoliHub, EcoCyc, RegulonDB and UniProtKB/Swiss-Prot.
COMPLETENESS: full length.
FEATURES
Location/Qualifiers
source
1..1041
/organism="Escherichia coli str. K-12 substr. MG1655"
/mol_type="genomic DNA"
/strain="K-12"
/sub_strain="MG1655"
/db_xref="taxon:511145"
gene
1..1041
/gene="ompA"
/locus_tag="b0957"
/gene_synonym="con; ECK0948; JW0940; tolG; tut"
/db_xref="EcoGene:EG10669"
/db_xref="GeneID:945571"
CDS
1..1041
/gene="ompA"
/locus_tag="b0957"
/gene_synonym="con; ECK0948; JW0940; tolG; tut"
/function="membrane; Outer membrane constituents"
/GO_component="GO:0009279 - cell outer membrane;
GO:0009274 - peptidoglycan-based cell wall"
/note="outer membrane protein 3a (II*;G;d)"
/codon_start=1
/transl_table=11
/product="outer membrane protein A (3a;II*;G;d)"
/protein_id="NP_415477.1"
/db_xref="GI:16128924"
/db_xref="ASAP:ABE-0003240"
/db_xref="UniProtKB/Swiss-Prot:P0A910"
/db_xref="EcoGene:EG10669"
/db_xref="GeneID:945571"
REMARK
REFERENCE
AUTHORS
TITLE
JOURNAL

85

/translation="MKKTAIAIAVALAGFATVAQAAPKDNTWYTGAKLGWSQYHDTGF
INNNGPTHENQLGAGAFGGYQVNPYVGFEMGYDWLGRMPYKGSVENGAYKAQGVQLTA
KLGYPITDDLDIYTRLGGMVWRADTKSNVYGKNHDTGVSPVFAGGVEYAITPEIATRL
EYQWTNNIGDAHTIGTRPDNGMLSLGVSYRFGQGEAAPVVAPAPAPAPEVQTKHFTLK
SDVLFNFNKATLKPEGQAALDQLYSQLSNLDPKDGSVVVLGYTDRIGSDAYNQGLSER
RAQSVVDYLISKGIPADKISARGMGESNPVTGNTCDNVKQRAALIDCLAPDRRVEIEV
KGIKDVVTQPQA"
ORIGIN
1
61
121
181
241
301
361
421
481
541
601
661
721
781
841
901
961
1021

atgaaaaaga
gccgctccga
gacactggtt
tttggtggtt
cgtatgccgt
accgctaaac
atggtatggc
tctccggtct
gaataccagt
ggcatgctga
gctccggctc
gttctgttca
ctgtacagcc
accgaccgca
gttgttgatt
ggcgaatcca
atcgactgcc
gtaactcagc

cagctatcgc
aagataacac
tcatcaacaa
accaggttaa
acaaaggcag
tgggttaccc
gtgcagacac
tcgctggcgg
ggaccaacaa
gcctgggtgt
cagctccggc
acttcaacaa
agctgagcaa
tcggttctga
acctgatctc
acccggttac
tggctccgga
cgcaggctta

gattgcagtg
ctggtacact
caatggcccg
cccgtatgtt
cgttgaaaac
aatcactgac
taaatccaac
tgttgagtac
catcggtgac
ttcctaccgt
accggaagta
agcaaccctg
cctggatccg
cgcttacaac
caaaggtatc
tggcaacacc
tcgtcgcgta
a

gcactggctg
ggtgctaaac
acccatgaaa
ggctttgaaa
ggtgcataca
gacctggaca
gtttatggta
gcgatcactc
gcacacacca
ttcggtcagg
cagaccaagc
aaaccggaag
aaagacggtt
cagggtctgt
ccggcagaca
tgtgacaacg
gagatcgaag

gtttcgctac
tgggctggtc
accaactggg
tgggttacga
aagctcaggg
tctacactcg
aaaaccacga
ctgaaatcgc
tcggcactcg
gcgaagcagc
acttcactct
gtcaggctgc
ccgtagttgt
ccgagcgccg
agatctccgc
tgaaacagcg
ttaaaggtat

cgtagcgcag
ccagtaccat
cgctggtgct
ctggttaggt
cgttcaactg
tctgggtggc
caccggcgtt
tacccgtctg
tccggacaac
tccagtagtt
gaagtctgac
tctggatcag
tctgggttac
tgctcagtct
acgtggtatg
tgctgcactg
caaagacgtt

//
GENBANK
LOCUS:.
DEFINITION:.
ACCESSION:GENBANK.O

VERSION:AccessionNumber,
.
KEYWORDS:-
.
SOURCE:
(,..).
ORGANISM:'..
.
-
.
REFERENCE:
.
AUTHORS:.
TITLE:.

86

JOURNAL:
,,.
MEDLINE:MEDLINE.
COMMENT:,.
FEATURES:
()RNA()
.
BASE COUNT:.
,,,.
ORIGIN:
.
.
:
ORIGIN
1
61
121
181
241
301
361
421
481
541
601
661
721
781
841
901
961
1021

atgaaaaaga
gccgctccga
gacactggtt
tttggtggtt
cgtatgccgt
accgctaaac
atggtatggc
tctccggtct
gaataccagt
ggcatgctga
gctccggctc
gttctgttca
ctgtacagcc
accgaccgca
gttgttgatt
ggcgaatcca
atcgactgcc
gtaactcagc

cagctatcgc
aagataacac
tcatcaacaa
accaggttaa
acaaaggcag
tgggttaccc
gtgcagacac
tcgctggcgg
ggaccaacaa
gcctgggtgt
cagctccggc
acttcaacaa
agctgagcaa
tcggttctga
acctgatctc
acccggttac
tggctccgga
cgcaggctta

gattgcagtg
ctggtacact
caatggcccg
cccgtatgtt
cgttgaaaac
aatcactgac
taaatccaac
tgttgagtac
catcggtgac
ttcctaccgt
accggaagta
agcaaccctg
cctggatccg
cgcttacaac
caaaggtatc
tggcaacacc
tcgtcgcgta
a

gcactggctg
ggtgctaaac
acccatgaaa
ggctttgaaa
ggtgcataca
gacctggaca
gtttatggta
gcgatcactc
gcacacacca
ttcggtcagg
cagaccaagc
aaaccggaag
aaagacggtt
cagggtctgt
ccggcagaca
tgtgacaacg
gagatcgaag

gtttcgctac
tgggctggtc
accaactggg
tgggttacga
aagctcaggg
tctacactcg
aaaaccacga
ctgaaatcgc
tcggcactcg
gcgaagcagc
acttcactct
gtcaggctgc
ccgtagttgt
ccgagcgccg
agatctccgc
tgaaacagcg
ttaaaggtat

cgtagcgcag
ccagtaccat
cgctggtgct
ctggttaggt
cgttcaactg
tctgggtggc
caccggcgtt
tacccgtctg
tccggacaac
tccagtagtt
gaagtctgac
tctggatcag
tctgggttac
tgctcagtct
acgtggtatg
tgctgcactg
caaagacgtt

//
-
.
-60,,
11.10
.
-9
.
//:.

2. Uniprot Outer membrane protein A (ompA)


Escherichia coli.

87

ID
AC
DT
DT
DT
DE
DE
DE
GN
OS
OC
OC
OX
RN
RP
RC
RX
RA
RT
RT
RL
RN
RP
RC
RX
RA
RT
RT
RL
RN
RP
RC
RX
RA
RA
RA
RA
RA
RA
RT
RT
RL
RN
RP
RC
RX
RA
RA
RA
RA
RT
RL
RN
RP
RC
RX
RA
RA
RT
RT
RL

OMPA_ECOLI
Reviewed;
346 AA.
P0A910; P02934;
20-JUL-1986, integrated into UniProtKB/Swiss-Prot.
20-JUL-1986, sequence version 1.
06-JAN-2015, entry version 99.
RecName: Full=Outer membrane protein A;
AltName: Full=Outer membrane protein II*;
Flags: Precursor;
Name=ompA; Synonyms=con, tolG, tut; OrderedLocusNames=b0957, JW0940;
Escherichia coli (strain K12).
Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
Enterobacteriaceae; Escherichia.
NCBI_TaxID=83333;
[1]
NUCLEOTIDE SEQUENCE [GENOMIC DNA].
STRAIN=K12;
PubMed=6253901; DOI=10.1093/nar/8.13.3011;
Beck E., Bremer E.;
"Nucleotide sequence of the gene ompA coding the outer membrane
protein II of Escherichia coli K-12.";
Nucleic Acids Res. 8:3011-3027(1979).
[2]
NUCLEOTIDE SEQUENCE [GENOMIC DNA].
STRAIN=K12;
PubMed=6260961; DOI=10.1016/0022-2836(80)90193-X;
Movva N.R., Nakamura K., Inouye M.;
"Gene structure of the OmpA protein, a major surface protein of
Escherichia coli required for cell-cell interaction.";
J. Mol. Biol. 143:317-328(1979).
[3]
NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
STRAIN=K12 / W3110 / ATCC 27325 / DSM 5911;
PubMed=8905232; DOI=10.1093/dnares/3.3.137;
Oshima T., Aiba H., Baba T., Fujita K., Hayashi K., Honjo A.,
Ikemoto K., Inada T., Itoh T., Kajihara M., Kanai K., Kashimoto K.,
Kimura S., Kitagawa M., Makino K., Masuda S., Miki T., Mizobuchi K.,
Mori H., Motomura K., Nakamura Y., Nashimoto H., Nishio Y., Saito N.,
Sampei G., Seki Y., Tagami H., Takemoto K., Wada C., Yamamoto Y.,
Yano M., Horiuchi T.;
"A 718-kb DNA sequence of the Escherichia coli K-12 genome
corresponding to the 12.7-28.0 min region on the linkage map.";
DNA Res. 3:137-155(1995).
[4]
NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
STRAIN=K12 / MG1655 / ATCC 47076;
PubMed=9278503; DOI=10.1126/science.277.5331.1453;
Blattner F.R., Plunkett G. III, Bloch C.A., Perna N.T., Burland V.,
Riley M., Collado-Vides J., Glasner J.D., Rode C.K., Mayhew G.F.,
Gregor J., Davis N.W., Kirkpatrick H.A., Goeden M.A., Rose D.J.,
Mau B., Shao Y.;
"The complete genome sequence of Escherichia coli K-12.";
Science 277:1453-1462(1996).
[5]
NUCLEOTIDE SEQUENCE [LARGE SCALE GENOMIC DNA].
STRAIN=K12 / W3110 / ATCC 27325 / DSM 5911;
PubMed=16738553; DOI=10.1038/msb4100049;
Hayashi K., Morooka N., Yamamoto Y., Fujita K., Isono K., Choi S.,
Ohtsubo E., Baba T., Wanner B.L., Mori H., Horiuchi T.;
"Highly accurate genome sequences of Escherichia coli K-12 strains
MG1655 and W3110.";
Mol. Syst. Biol. 2:E1-E5(2005).

88

RN
RP
RC
RX
RA
RT
RT
RL
RN
RP
RC
RX
RA
RT
RT
RL
RN
RP
RC
RA
RA
RA
RL
RN
RP
RC
RX
RA
RA
RT
RT
RL
RN
RP
RX
RA
RT
RT
RT
RL
RN
RP
RX
RA
RT
RT
RL
RN
RP
RC
RX
RA
RT
RL
RN
RP
RX
RA
RT
RT
RL

[6]
PROTEIN SEQUENCE OF 22-346.
STRAIN=K12;
PubMed=7001461; DOI=10.1073/pnas.77.8.4592;
Chen R., Schmidmayr W., Kramer C., Chen-Schmeisser U., Henning U.;
"Primary structure of major outer membrane protein II (ompA protein)
of Escherichia coli K-12.";
Proc. Natl. Acad. Sci. U.S.A. 77:4592-4596(1979).
[7]
PROTEIN SEQUENCE OF 22-34.
STRAIN=K12 / EMG2;
PubMed=9298646; DOI=10.1002/elps.1150180807;
Link A.J., Robison K., Church G.M.;
"Comparing the predicted and observed properties of proteins encoded
in the genome of Escherichia coli K-12.";
Electrophoresis 18:1259-1313(1996).
[8]
PROTEIN SEQUENCE OF 22-32.
STRAIN=K12 / W3110 / ATCC 27325 / DSM 5911;
Pasquali C., Sanchez J.-C., Ravier F., Golaz O., Hughes G.J.,
Frutiger S., Paquet N., Wilkins M., Appel R.D., Bairoch A.,
Hochstrasser D.F.;
Submitted (AUG-1994) to UniProtKB.
[9]
PROTEIN SEQUENCE OF 22-26.
STRAIN=K12 / W3110 / ATCC 27325 / DSM 5911;
PubMed=9629924; DOI=10.1002/elps.1150190539;
Molloy M.P., Herbert B.R., Walsh B.J., Tyler M.I., Traini M.,
Sanchez J.-C., Hochstrasser D.F., Williams K.L., Gooley A.A.;
"Extraction of membrane proteins by differential solubilization for
separation using two-dimensional gel electrophoresis.";
Electrophoresis 19:837-844(1997).
[10]
MUTANTS RESISTANT TO PHAGE ENTRY.
PubMed=6086577;
Morona R., Klose M., Henning U.;
"Escherichia coli K-12 outer membrane protein (OmpA) as a
bacteriophage receptor: analysis of mutant genes expressing altered
proteins.";
J. Bacteriol. 159:570-578(1983).
[11]
MUTANTS RESISTANT TO PHAGE ENTRY.
PubMed=3902787;
Morona R., Kramer C., Henning U.;
"Bacteriophage receptor area of outer membrane protein OmpA of
Escherichia coli K-12.";
J. Bacteriol. 164:539-543(1984).
[12]
PORIN ACTIVITY.
STRAIN=K12;
PubMed=1370823;
Sugawara E., Nikaido H.;
"Pore-forming activity of OmpA protein of Escherichia coli.";
J. Biol. Chem. 267:2507-2511(1991).
[13]
SUBCELLULAR LOCATION.
PubMed=7813480; DOI=10.1111/j.1432-1033.1994.00891.x;
Kuhn A., Kiefer D., Koehne C., Zhu H.-Y., Tschantz W.R., Dalbey R.E.;
"Evidence for a loop-like insertion mechanism of pro-Omp A into the
inner membrane of Escherichia coli.";
Eur. J. Biochem. 226:891-897(1993).

89

RN
RP
RX
RA
RT
RT
RL
RN
RP
RX
RA
RA
RT
RL
RN
RP
RX
RA
RT
RT
RL
RN
RP
RC
RX
RA
RA
RT
RL
RN
RP
RC
RX
RA
RT
RL
RN
RP
RX
RA
RT
RL
RN
RP
RX
RA
RT
RL
RN
RP
RX
RA
RT
RT
RL
RN
RP
RX
RA
RA
RT

[14]
TOPOLOGY.
PubMed=8106193;
Gromiha M.M., Ponnuswamy P.K.;
"Prediction of transmembrane beta-strands from hydrophobic
characteristics of proteins.";
Int. J. Pept. Protein Res. 42:420-431(1992).
[15]
IDENTIFICATION BY 2D-GEL.
PubMed=9298644; DOI=10.1002/elps.1150180805;
VanBogelen R.A., Abshire K.Z., Moldover B., Olson E.R.,
Neidhardt F.C.;
"Escherichia coli proteome analysis using the gene-protein database.";
Electrophoresis 18:1243-1251(1996).
[16]
TOPOLOGY.
PubMed=10368142;
Koebnik R.;
"Structural and functional roles of the surface-exposed loops of the
beta-barrel membrane protein OmpA from Escherichia coli.";
J. Bacteriol. 181:3688-3694(1998).
[17]
DIMERIZATION, AND SUBCELLULAR LOCATION.
STRAIN=BL21-DE3;
PubMed=16079137; DOI=10.1074/jbc.M506479200;
Stenberg F., Chovanec P., Maslen S.L., Robinson C.V., Ilag L.,
von Heijne G., Daley D.O.;
"Protein complexes of the Escherichia coli cell envelope.";
J. Biol. Chem. 280:34409-34419(2004).
[18]
SUBCELLULAR LOCATION.
STRAIN=K12 / MG1655 / ATCC 47076;
PubMed=21778229; DOI=10.1074/jbc.M111.245696;
Fontaine F., Fuchs R.T., Storz G.;
"Membrane localization of small proteins in Escherichia coli.";
J. Biol. Chem. 286:32464-32474(2010).
[19]
X-RAY CRYSTALLOGRAPHY (2.5 ANGSTROMS) OF 22-192.
PubMed=9808047; DOI=10.1038/2983;
Pautsch A., Schulz G.E.;
"Structure of the outer membrane protein A transmembrane domain.";
Nat. Struct. Biol. 5:1013-1017(1997).
[20]
X-RAY CRYSTALLOGRAPHY (1.65 ANGSTROMS).
PubMed=10764596; DOI=10.1006/jmbi.2000.3671;
Pautsch A., Schulz G.E.;
"High-resolution structure of the OmpA membrane domain.";
J. Mol. Biol. 298:273-282(1999).
[21]
STRUCTURE BY NMR OF 22-197.
PubMed=11276254; DOI=10.1038/86214;
Arora A., Abildgaard F., Bushweller J.H., Tamm L.K.;
"Structure of outer membrane protein A transmembrane domain by NMR
spectroscopy.";
Nat. Struct. Biol. 8:334-338(2000).
[22]
MASS SPECTROMETRY.
PubMed=10757971; DOI=10.1021/bi000150m;
le Coutre J., Whitelegge J.P., Gross A., Turk E., Wright E.M.,
Kaback H.R., Faull K.F.;
"Proteomics on full-length membrane proteins using mass

90

RT
RL
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
CC
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR

spectrometry.";
Biochemistry 39:4237-4242(1999).
-!- FUNCTION: Required for the action of colicins K and L and for the
stabilization of mating aggregates in conjugation. Serves as a
receptor for a number of T-even like phages. Also acts as a porin
with low permeability that allows slow penetration of small
solutes.
-!- SUBUNIT: Homodimer.
-!- INTERACTION:
P0C0V0:degP; NbExp=5; IntAct=EBI-371347, EBI-547165;
P0A850:tig; NbExp=3; IntAct=EBI-371347, EBI-544862;
-!- SUBCELLULAR LOCATION: Cell outer membrane
{ECO:0000269|PubMed:16079137, ECO:0000269|PubMed:21778229,
ECO:0000269|PubMed:7813480}; Multi-pass membrane protein
{ECO:0000269|PubMed:16079137, ECO:0000269|PubMed:21778229,
ECO:0000269|PubMed:7813480}.
-!- MASS SPECTROMETRY: Mass=35177; Method=Electrospray; Range=22-346;
Evidence={ECO:0000269|PubMed:10757971;
-!- SIMILARITY: Belongs to the OmpA family. {ECO:0000305}.
-!- SIMILARITY: Contains 1 OmpA-like domain. {ECO:0000255|PROSITEProRule:PRU00473}.
----------------------------------------------------------------------Copyrighted by the UniProt Consortium, see http://www.uniprot.org/terms
Distributed under the Creative Commons Attribution-NoDerivs License
----------------------------------------------------------------------EMBL; V00307; CAA23588.1; -; Genomic_DNA.
EMBL; U00096; AAC74043.1; -; Genomic_DNA.
EMBL; AP009048; BAA35715.1; -; Genomic_DNA.
PIR; A93707; MMECA.
RefSeq; NP_415477.1; NC_000913.3.
RefSeq; YP_489229.1; NC_007779.1.
PDB; 1BXW; X-ray; 2.50 A; A=21-192.
PDB; 1G90; NMR; -; A=22-197.
PDB; 1QJP; X-ray; 1.65 A; A=22-192.
PDB; 2GE4; NMR; -; A=22-197.
PDB; 2JMM; NMR; -; A=23-197.
PDB; 3NB3; EM; -; A/B/C=1-346.
PDBsum; 1BXW; -.
PDBsum; 1G90; -.
PDBsum; 1QJP; -.
PDBsum; 2GE4; -.
PDBsum; 2JMM; -.
PDBsum; 3NB3; -.
ProteinModelPortal; P0A910; -.
SMR; P0A910; 22-192, 209-346.
DIP; DIP-31879N; -.
IntAct; P0A910; 11.
MINT; MINT-1308131; -.
STRING; 511145.b0957; -.
TCDB; 1.B.6.1.1; the ompa-ompf porin (oop) family.
SWISS-2DPAGE; P0A910; -.
PaxDb; P0A910; -.
PRIDE; P0A910; -.
EnsemblBacteria; AAC74043; AAC74043; b0957.
EnsemblBacteria; BAA35715; BAA35715; BAA35715.
GeneID; 12931038; -.
GeneID; 945571; -.
KEGG; ecj:Y75_p0929; -.
KEGG; eco:b0957; -.
PATRIC; 32117133; VBIEscCol129921_0991.
EchoBASE; EB0663; -.

91

DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
PE
KW
KW
KW
KW
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT

EcoGene; EG10669; ompA.


eggNOG; COG2885; -.
HOGENOM; HOG000274199; -.
InParanoid; P0A910; -.
KO; K03286; -.
OMA; EYALTKN; -.
OrthoDB; EOG6PP9QB; -.
BioCyc; EcoCyc:EG10669-MONOMER; -.
BioCyc; ECOL316407:JW0940-MONOMER; -.
EvolutionaryTrace; P0A910; -.
PRO; PR:P0A910; -.
Proteomes; UP000000318; Chromosome.
Proteomes; UP000000625; Chromosome.
Genevestigator; P0A910; -.
GO; GO:0009279; C:cell outer membrane; IDA:EcoliWiki.
GO; GO:0016021; C:integral component of membrane; IDA:EcoliWiki.
GO; GO:0016020; C:membrane; IDA:EcoliWiki.
GO; GO:0019867; C:outer membrane; IDA:EcoliWiki.
GO; GO:0046930; C:pore complex; IEA:UniProtKB-KW.
GO; GO:0015288; F:porin activity; IDA:EcoCyc.
GO; GO:0005198; F:structural molecule activity; IEA:InterPro.
GO; GO:0006974; P:cellular response to DNA damage stimulus; IEP:EcoliWiki.
GO; GO:0000746; P:conjugation; IMP:EcoliWiki.
GO; GO:0009597; P:detection of virus; IMP:EcoliWiki.
GO; GO:0034220; P:ion transmembrane transport; IDA:EcoCyc.
GO; GO:0006811; P:ion transport; IDA:EcoliWiki.
GO; GO:0006810; P:transport; IDA:EcoliWiki.
GO; GO:0046718; P:viral entry into host cell; IMP:EcoliWiki.
Gene3D; 2.40.160.20; -; 1.
Gene3D; 3.30.1330.60; -; 1.
InterPro; IPR011250; OMP/PagP_b-brl.
InterPro; IPR006664; OMP_bac.
InterPro; IPR002368; OmpA.
InterPro; IPR006690; OMPA-like_CS.
InterPro; IPR000498; OmpA-like_TM_dom.
InterPro; IPR006665; OmpA/MotB_C.
Pfam; PF00691; OmpA; 1.
Pfam; PF01389; OmpA_membrane; 1.
PRINTS; PR01021; OMPADOMAIN.
PRINTS; PR01022; OUTRMMBRANEA.
SUPFAM; SSF103088; SSF103088; 1.
SUPFAM; SSF56925; SSF56925; 1.
PROSITE; PS01068; OMPA_1; 1.
PROSITE; PS51123; OMPA_2; 1.
1: Evidence at protein level;
3D-structure; Cell outer membrane; Complete proteome; Conjugation;
Direct protein sequencing; Disulfide bond; Ion transport; Membrane;
Porin; Reference proteome; Repeat; Signal; Transmembrane;
Transmembrane beta strand; Transport.
SIGNAL
1
21
{ECO:0000269|PubMed:7001461,
ECO:0000269|PubMed:9298646,
ECO:0000269|PubMed:9629924,
ECO:0000269|Ref.8}.
CHAIN
22
346
Outer membrane protein A.
/FTId=PRO_0000020094.
TOPO_DOM
22
26
Periplasmic.
TRANSMEM
27
37
Beta stranded.
TOPO_DOM
38
54
Extracellular.
TRANSMEM
55
66
Beta stranded.
TOPO_DOM
67
69
Periplasmic.
TRANSMEM
70
78
Beta stranded.

92

FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
SQ

TOPO_DOM
TRANSMEM
TOPO_DOM
TRANSMEM
TOPO_DOM
TRANSMEM
TOPO_DOM
TRANSMEM
TOPO_DOM
TRANSMEM
TOPO_DOM
REPEAT
REPEAT
REPEAT
REPEAT
DOMAIN
REGION
REGION
DISULFID
STRAND
STRAND
STRAND
STRAND
STRAND
STRAND
STRAND
STRAND
STRAND
STRAND
TURN
STRAND
SEQUENCE
MKKTAIAIAV
FGGYQVNPYV
MVWRADTKSN
GMLSLGVSYR
LYSQLSNLDP
GESNPVTGNT

79
96
108
112
125
138
152
156
164
182
191
201
203
205
207
210

95
107
111
124
137
151
155
163
181
190
346
202
204
206
208
338

Extracellular.
Beta stranded.
Periplasmic.
Beta stranded.
Extracellular.
Beta stranded.
Periplasmic.
Beta stranded.
Extracellular.
Beta stranded.
Periplasmic.
1.
2.
3.
4.
OmpA-like. {ECO:0000255|PROSITEProRule:PRU00473}.
Hinge-like.
4 X 2 AA tandem repeats of A-P.

197
208
201
208
311
323
27
37
{ECO:0000244|PDB:1QJP}.
41
43
{ECO:0000244|PDB:1G90}.
46
48
{ECO:0000244|PDB:1G90}.
50
53
{ECO:0000244|PDB:2GE4}U.
55
67
{ECO:0000244|PDB:1QJP}.
70
81
{ECO:0000244|PDB:1QJP}.
93
128
{ECO:0000244|PDB:1QJP}.
130
132
{ECO:0000244|PDB:1QJP}.
134
153
{ECO:0000244|PDB:1QJP}.
156
165
{ECO:0000244|PDB:1QJP}.
172
175
{ECO:0000244|PDB:1G90}.
182
190
{ECO:0000244|PDB:1QJP}.
346 AA; 37201 MW; 195147734CDF8B04 CRC64;
ALAGFATVAQ AAPKDNTWYT GAKLGWSQYH DTGFINNNGP
GFEMGYDWLG RMPYKGSVEN GAYKAQGVQL TAKLGYPITD
VYGKNHDTGV SPVFAGGVEY AITPEIATRL EYQWTNNIGD
FGQGEAAPVV APAPAPAPEV QTKHFTLKSD VLFNFNKATL
KDGSVVVLGY TDRIGSDAYN QGLSERRAQS VVDYLISKGI
CDNVKQRAAL IDCLAPDRRV EIEVKGIKDV VTQPQA

THENQLGAGA
DLDIYTRLGG
AHTIGTRPDN
KPEGQAALDQ
PADKISARGM

//

UNIPROT
ID (Identification):
Entry_name data_class; molecule_type; sequence length
Entry_name:UNIPROT.
..OMPA_ECOLI..
4..
5.
data_class:UNIPROT.
molecule_type:.UNIPROTPRT
(Protein).

93

sequence length:To().
AC (Accession number):
.
.
DT (Date):,,
.
DE (Description):.
GN (Gene name):.
OS (Organism Species):'.
.
OG (Organelle):,
.
OC (Organism Classification):'.
(Organism taxonomy cross-reference):
.

RN, RP, RC, RX, RA, RT, RL :


.

RN (Reference number):.
RP (Reference Position):.
RX (Reference cross-reference):..PUBMED.
RA (Reference author):.
RT (Reference title):.
RL (Reference Location):.
CC (Comments):.
-:
CATALYTIC ACTIVITY:.
ALTERNATIVE PRODUCTS:
.
FUNCTION:.
SUBCELLULAR LOCATION:.
SUBUNIT:
.

94

-CC
(Comments).
DR (Database cross-reference):To
PDB,EMBL...
KW (Keyword):-
.
FT (Feature Table):
.:
.
.(..Receptor-Ligand).
..
..
.
.
SQ (Sequence):(),
(MW)Daltons.
:
- IUPAC.
- 60 ,
, 6 . 10
.
//:.
..
SQ

SEQUENCE
MKKTAIAIAV
FGGYQVNPYV
MVWRADTKSN
GMLSLGVSYR
LYSQLSNLDP
GESNPVTGNT

346 AA; 37201 MW; 195147734CDF8B04 CRC64;


ALAGFATVAQ AAPKDNTWYT GAKLGWSQYH DTGFINNNGP
GFEMGYDWLG RMPYKGSVEN GAYKAQGVQL TAKLGYPITD
VYGKNHDTGV SPVFAGGVEY AITPEIATRL EYQWTNNIGD
FGQGEAAPVV APAPAPAPEV QTKHFTLKSD VLFNFNKATL
KDGSVVVLGY TDRIGSDAYN QGLSERRAQS VVDYLISKGI
CDNVKQRAAL IDCLAPDRRV EIEVKGIKDV VTQPQA

THENQLGAGA
DLDIYTRLGG
AHTIGTRPDN
KPEGQAALDQ
PADKISARGM

//

3. PROSITE Outer membrane protein A (ompA).

ID

OMPA_1; PATTERN.

95

AC
DT
DE
PA
PA
PA
NR
NR
NR
CC
CC
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
DR
3D
DO
//

PS01068;
NOV-1995 (CREATED); DEC-2004 (DATA UPDATE); FEB-2015 (INFO UPDATE).
OmpA-like domain.
[LIVMA]-x-[GT]-x-[TA]-[DAN]-x(2,3)-[DG]-[GSTPNKQ]-x(2)-[LFYDEPAVI]-[NQS]x(2)-[LI]-[SG]-[QEA]-[KRQENAD]-R-A-x(2)-[LVAIT]-x(3)-[LIVMF]-x(4,5)[LIVMF]-x(4)-[LIVM]-x(3)-[SGW]-x-G.
/RELEASE=2015_04,548208;
/TOTAL=55(55); /POSITIVE=55(55); /UNKNOWN=0(0); /FALSE_POS=0(0);
/FALSE_NEG=10; /PARTIAL=2;
/TAXO-RANGE=???P?; /MAX-REPEAT=1;
/VERSION=1;
P65594, ARFA_MYCBO , T; A1KH31, ARFA_MYCBP , T; P9WIU4, ARFA_MYCTO , T;
P9WIU5, ARFA_MYCTU , T; Q9S3P9, MOTY_VIBAN , T; P46233, MOTY_VIBPA , T;
Q8U9L5, OMP16_AGRT5, T; P0A3S9, OMP16_BRUAB, T; P0A3S7, OMP16_BRUME, T;
P0A3S8, OMP16_BRUSU, T; Q98F85, OMP16_RHILO, T; Q926C3, OMP16_RHIME, T;
P07050, OMP3_NEIGO , T; Q9S3R8, OMP40_PORGI, T; Q9S3R9, OMP41_PORGI, T;
P0A0V2, OMP4_NEIMA , T; P0A0V3, OMP4_NEIMB , T; P43840, OMP51_HAEIN, T;
P38368, OMP52_HAEIF, T; P45996, OMP53_HAEIF, T; Q05146, OMPA_BORAV , T;
P57414, OMPA_BUCAI , T; Q8K9L4, OMPA_BUCAP , T; P24016, OMPA_CITFR , T;
P0A911, OMPA_ECO57 , T; P0A910, OMPA_ECOLI , T; P09146, OMPA_ENTAE , T;
B7LNW7, OMPA_ESCF3 , T; P0C8Z2, OMPA_ESCFE , T; P24754, OMPA_ESCHE , T;
P24017, OMPA_KLEPN , T; Q8Z7S0, OMPA_SALTI , T; P02936, OMPA_SALTY , T;
P04845, OMPA_SERMA , T; P24755, OMPA_SEROD , T; I2BAK7, OMPA_SHIBC , T;
P0DJO6, OMPA_SHIBL , T; P02935, OMPA_SHIDY , T; Q8ZG77, OMPA_YERPE , T;
P38399, OMPA_YERPS , T; Q89AJ5, PAL_BUCBP , T; P0A913, PAL_ECO57 , T;
P0A912, PAL_ECOLI , T; P10324, PAL_HAEIN , T; P26493, PAL_LEGPN , T;
Q51886, PAL_PASMU , T; Q9I4Z4, PAL_PSEAE , T; P0A138, PAL_PSEPK , T;
P0A139, PAL_PSEPU , T; P0A914, PAL_SHIFL , T; P13794, PORF_PSEAE , T;
P37726, PORF_PSEFL , T; P22263, PORF_PSESY , T; P38369, TPN50_TREPA, T;
P37665, YIAD_ECOLI , T;
P85410, OMP5_HAEPR , P; P80444, OMPA_ACTLI , P;
D3GSC3, LAFU_ECO44 , N; Q47154, LAFU_ECOLI , N; Q6RYW5, OMP38_ACIBA, N;
A3M8K2, OMP38_ACIBT, N; P84838, OMPC_GLUDA , N; P07021, YFIB_ECOLI , N;
P0C536, YN58_BRUAB , N; Q2YJ83, YP57_BRUA2 , N; Q8YDY8, YU36_BRUME , N;
Q9RPX3, YU58_BRUSU , N;
1OAP; 1R1M; 2AIZ; 2HQS; 2K1S; 2KGW; 2L26; 2LBT; 2LCA; 2W8B;
PDOC00819;

PROSITE
ID (Identification):
IDENTRY_NAME;ENTRY_TYPE
PROSITE,
.
AC (ACcession number):
PROSITEPROSITE.
DT (DaTe):().
DE (DEscription):.
PA (PAttern):(pattern)
.

96

pattern:
1.
2.
3.

4.
5.
6.

7.
8.

IUPAC.
x.
[...].
[ALT]
.

.
(-).

..x(3).

..(2,4)
234.

'<''>'.
pattern.

NR (Numerical Results):(patternscan)
SWISS-PROTpatternPROSITE.
:
/RELEASE:UNIPROT
.
/TOTAL:UNIPROT.
/POSITIVE:pattern
PROSITE.
/UNKNOWN:PROSITE.
/FALSE_POS:UNIPROTpattern
.
/FALSE_NEG:UNIPROT
.
/PARTIAL:UNIPROT(fragments),
PROSITE,oPROSITE.
CC (Comments):-CommentsPROSITE.
DR (Database Reference):UNIPROT.
3D (3D Structure):ProteinDataBank
.
DO (Documentation):
.
//:.

4. PDB Outer membrane protein A (ompA)


Escherichia coli.

97


HEADER
TITLE
COMPND
COMPND
COMPND
COMPND
COMPND
COMPND
SOURCE
SOURCE
SOURCE
SOURCE
SOURCE
SOURCE
SOURCE
SOURCE
SOURCE
KEYWDS
EXPDTA
AUTHOR
REVDAT
REVDAT
REVDAT
REVDAT
JRNL
JRNL
JRNL
JRNL
JRNL
JRNL
JRNL
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK

MEMBRANE PROTEIN
03-OCT-98
1BXW
OUTER MEMBRANE PROTEIN A (OMPA) TRANSMEMBRANE DOMAIN
MOL_ID: 1;
2 MOLECULE: PROTEIN (OUTER MEMBRANE PROTEIN A);
3 CHAIN: A;
4 FRAGMENT: TRANSMEMBRANE DOMAIN;
5 ENGINEERED: YES;
6 MUTATION: YES
MOL_ID: 1;
2 ORGANISM_SCIENTIFIC: ESCHERICHIA COLI BL21(DE3);
3 ORGANISM_TAXID: 469008;
4 STRAIN: BL21DE3;
5 GENE: OMPA;
6 EXPRESSION_SYSTEM: ESCHERICHIA COLI BL21(DE3);
7 EXPRESSION_SYSTEM_TAXID: 469008;
8 EXPRESSION_SYSTEM_STRAIN: BL21DE3;
9 EXPRESSION_SYSTEM_PLASMID: PET3B-171
OUTER MEMBRANE, TRANSMEMBRANE PROTEIN
X-RAY DIFFRACTION
G.E.SCHULZ,A.PAUTSCH
3
24-FEB-09 1BXW
1
VERSN
2
22-DEC-99 1BXW
4
HEADER COMPND REMARK JRNL
2 2
4
ATOM
SOURCE SEQRES
1
14-OCT-98 1BXW
0
AUTH
A.PAUTSCH,G.E.SCHULZ
TITL
STRUCTURE OF THE OUTER MEMBRANE PROTEIN A
TITL 2 TRANSMEMBRANE DOMAIN.
REF
NAT.STRUCT.BIOL.
V.
5 1013 1998
REFN
ISSN 1072-8368
PMID
9808047
DOI
10.1038/2983
1
2
2 RESOLUTION.
2.50 ANGSTROMS.
3
3 REFINEMENT.
3
PROGRAM
: REFMAC
3
AUTHORS
: MURSHUDOV,VAGIN,DODSON
3
3 DATA USED IN REFINEMENT.
3
RESOLUTION RANGE HIGH (ANGSTROMS) : 2.50
3
RESOLUTION RANGE LOW (ANGSTROMS) : 50.00
3
DATA CUTOFF
(SIGMA(F)) : 0.000
3
COMPLETENESS FOR RANGE
(%) : 89.0
3
NUMBER OF REFLECTIONS
: 8328
3
3 FIT TO DATA USED IN REFINEMENT.
3
CROSS-VALIDATION METHOD
: THROUGHOUT
3
FREE R VALUE TEST SET SELECTION : RANDOM
3
R VALUE
(WORKING + TEST SET) : NULL
3
R VALUE
(WORKING SET) : 0.189
3
FREE R VALUE
: 0.235
3
FREE R VALUE TEST SET SIZE
(%) : 5.000
3
FREE R VALUE TEST SET COUNT
: 404
3
3 NUMBER OF NON-HYDROGEN ATOMS USED IN REFINEMENT.
3
PROTEIN ATOMS
: 1330
3
NUCLEIC ACID ATOMS
: 0
3
HETEROGEN ATOMS
: 21
3
SOLVENT ATOMS
: 39

98

REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK

3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
4
4
100
100
100
200
200
200
200
200
200
200

B VALUES.
FROM WILSON PLOT
(A**2) : 49.20
MEAN B VALUE
(OVERALL, A**2) : 60.40
OVERALL ANISOTROPIC B VALUE.
B11 (A**2) : NULL
B22 (A**2) : NULL
B33 (A**2) : NULL
B12 (A**2) : NULL
B13 (A**2) : NULL
B23 (A**2) : NULL
ESTIMATED OVERALL COORDINATE ERROR.
ESU BASED ON R VALUE
(A):
ESU BASED ON FREE R VALUE
(A):
ESU BASED ON MAXIMUM LIKELIHOOD
(A):
ESU FOR B VALUES BASED ON MAXIMUM LIKELIHOOD (A**2):
RMS DEVIATIONS FROM IDEAL VALUES.
DISTANCE RESTRAINTS.
BOND LENGTH
ANGLE DISTANCE
INTRAPLANAR 1-4 DISTANCE
H-BOND OR METAL COORDINATION
PLANE RESTRAINT
CHIRAL-CENTER RESTRAINT

(A)
(A)
(A)
(A)

:
:
:
:

RMS
0.015
0.030
NULL
NULL

;
;
;
;

(A) : NULL
(A**3) : NULL

NON-BONDED CONTACT RESTRAINTS.


SINGLE TORSION
MULTIPLE TORSION
H-BOND (X...Y)
H-BOND (X-H...Y)

(A)
(A)
(A)
(A)

:
:
:
:

SIGMA
NULL
NULL
NULL
NULL

; NULL
; NULL

NULL
NULL
NULL
NULL

;
;
;
;

NULL
NULL
NULL
NULL

CONFORMATIONAL TORSION ANGLE RESTRAINTS.


SPECIFIED
(DEGREES) : NULL
PLANAR
(DEGREES) : NULL
STAGGERED
(DEGREES) : NULL
TRANSVERSE
(DEGREES) : NULL

;
;
;
;

NULL
NULL
NULL
NULL

ISOTROPIC THERMAL FACTOR RESTRAINTS.


RMS
MAIN-CHAIN BOND
(A**2) : NULL
MAIN-CHAIN ANGLE
(A**2) : NULL
SIDE-CHAIN BOND
(A**2) : NULL
SIDE-CHAIN ANGLE
(A**2) : NULL

NULL
NULL
NULL
3.640

;
;
;
;

SIGMA
NULL
NULL
NULL
NULL

OTHER REFINEMENT REMARKS: DISORDERED REGIONS ARE FROM GLY22GLY28, GLY65-GLU68 AND ILE147-PRO147 WERE MODELED
STEREOCHEMICALLY
1BXW COMPLIES WITH FORMAT V. 3.15, 01-DEC-08
THIS ENTRY HAS BEEN PROCESSED BY RCSB ON 19-AUG-99.
THE RCSB ID CODE IS RCSB008140.
EXPERIMENTAL DETAILS
EXPERIMENT TYPE
DATE OF DATA COLLECTION
TEMPERATURE
(KELVIN)
PH
NUMBER OF CRYSTALS USED

99

:
:
:
:
:

X-RAY DIFFRACTION
15-JAN-98
298
5.0
1

REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK

200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
200
280
280
280
280
280
280
280
290
290
290
290
290
290
290
290
290
290
290
290

SYNCHROTRON
RADIATION SOURCE
BEAMLINE
X-RAY GENERATOR MODEL
MONOCHROMATIC OR LAUE
WAVELENGTH OR RANGE
MONOCHROMATOR
OPTICS

(Y/N) :
:
:
:
(M/L) :
(A) :
:
:

N
ROTATING ANODE
NULL
RIGAKU RU200
M
1.5418
NI FILTER
NULL

DETECTOR TYPE
:
DETECTOR MANUFACTURER
:
INTENSITY-INTEGRATION SOFTWARE :
DATA SCALING SOFTWARE
:

AREA DETECTOR
SIEMENS
XDS
CCP4 (SCALA)

NUMBER OF UNIQUE REFLECTIONS


RESOLUTION RANGE HIGH
(A)
RESOLUTION RANGE LOW
(A)
REJECTION CRITERIA (SIGMA(I))

:
:
:
:

8328
2.500
50.000
NULL

OVERALL.
COMPLETENESS FOR RANGE
(%)
DATA REDUNDANCY
R MERGE
(I)
R SYM
(I)
<I/SIGMA(I)> FOR THE DATA SET

:
:
:
:
:

89.0
2.100
NULL
0.02800
16.8000

IN THE HIGHEST RESOLUTION SHELL.


HIGHEST RESOLUTION SHELL, RANGE HIGH (A) : 2.50
HIGHEST RESOLUTION SHELL, RANGE LOW (A) : 2.64
COMPLETENESS FOR SHELL
(%) : 53.0
DATA REDUNDANCY IN SHELL
: 1.20
R MERGE FOR SHELL
(I) : NULL
R SYM FOR SHELL
(I) : 0.11000
<I/SIGMA(I)> FOR SHELL
: 6.600
DIFFRACTION PROTOCOL: SINGLE WAVELENGTH
METHOD USED TO DETERMINE THE STRUCTURE: MIRAS
SOFTWARE USED: SHARP
STARTING MODEL: NULL
REMARK: NULL
CRYSTAL
SOLVENT CONTENT, VS
(%): 66.70
MATTHEWS COEFFICIENT, VM (ANGSTROMS**3/DA): 3.70
CRYSTALLIZATION CONDITIONS: 10 % PEG-8000 10 % MPD 0.05 M
POTASSIUM PHOSPHATE PH 5.0
CRYSTALLOGRAPHIC SYMMETRY
SYMMETRY OPERATORS FOR SPACE GROUP: C 1 2 1
SYMOP
NNNMMM
1555
2555
3555
4555

SYMMETRY
OPERATOR
X,Y,Z
-X,Y,-Z
X+1/2,Y+1/2,Z
-X+1/2,Y+1/2,-Z

WHERE NNN -> OPERATOR NUMBER

100

REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK

290
290
290
290
290
290
290
290
290
290
290
290
290
290
290
290
290
290
290
290
300
300
300
300
300
300
350
350
350
350
350
350
350
350
350
350
350
350
350
470
470
470
470
470
470
470
475
475
475
475
475
475
475
475
475
475
475
475
475
475
475

MMM -> TRANSLATION VECTOR


CRYSTALLOGRAPHIC SYMMETRY TRANSFORMATIONS
THE FOLLOWING TRANSFORMATIONS OPERATE ON THE ATOM/HETATM
RECORDS IN THIS ENTRY TO PRODUCE CRYSTALLOGRAPHICALLY
RELATED MOLECULES.
SMTRY1
1 1.000000 0.000000 0.000000
0.00000
SMTRY2
1 0.000000 1.000000 0.000000
0.00000
SMTRY3
1 0.000000 0.000000 1.000000
0.00000
SMTRY1
2 -1.000000 0.000000 0.000000
0.00000
SMTRY2
2 0.000000 1.000000 0.000000
0.00000
SMTRY3
2 0.000000 0.000000 -1.000000
0.00000
SMTRY1
3 1.000000 0.000000 0.000000
34.59000
SMTRY2
3 0.000000 1.000000 0.000000
38.97500
SMTRY3
3 0.000000 0.000000 1.000000
0.00000
SMTRY1
4 -1.000000 0.000000 0.000000
34.59000
SMTRY2
4 0.000000 1.000000 0.000000
38.97500
SMTRY3
4 0.000000 0.000000 -1.000000
0.00000
REMARK: NULL
BIOMOLECULE: 1
SEE REMARK 350 FOR THE AUTHOR PROVIDED AND/OR PROGRAM
GENERATED ASSEMBLY INFORMATION FOR THE STRUCTURE IN
THIS ENTRY. THE REMARK MAY ALSO PROVIDE INFORMATION ON
BURIED SURFACE AREA.
COORDINATES FOR A COMPLETE MULTIMER REPRESENTING THE KNOWN
BIOLOGICALLY SIGNIFICANT OLIGOMERIZATION STATE OF THE
MOLECULE CAN BE GENERATED BY APPLYING BIOMT TRANSFORMATIONS
GIVEN BELOW. BOTH NON-CRYSTALLOGRAPHIC AND
CRYSTALLOGRAPHIC OPERATIONS ARE GIVEN.
BIOMOLECULE: 1
AUTHOR DETERMINED BIOLOGICAL UNIT: MONOMERIC
APPLY THE FOLLOWING TO CHAINS: A
BIOMT1
1 1.000000 0.000000 0.000000
BIOMT2
1 0.000000 1.000000 0.000000
BIOMT3
1 0.000000 0.000000 1.000000

0.00000
0.00000
0.00000

MISSING ATOM
THE FOLLOWING RESIDUES HAVE MISSING ATOMS(M=MODEL NUMBER;
RES=RESIDUE NAME; C=CHAIN IDENTIFIER; SSEQ=SEQUENCE NUMBER;
I=INSERTION CODE):
M RES CSSEQI ATOMS
HIS A 31
CG
ND1 CD2 CE1 NE2
ZERO OCCUPANCY RESIDUES
THE FOLLOWING RESIDUES WERE MODELED WITH ZERO OCCUPANCY.
THE LOCATION AND PROPERTIES OF THESE RESIDUES MAY NOT
BE RELIABLE. (M=MODEL NUMBER; RES=RESIDUE NAME;
C=CHAIN IDENTIFIER; SSEQ=SEQUENCE NUMBER; I=INSERTION CODE)
M RES C SSEQI
GLY A
22
LEU A
23
ILE A
24
ASN A
25
ASN A
26
ASN A
27
GLY A
28
GLY A
65

101

REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK

475
475
475
475
475
475
475
475
475
475
475
475
475
475
480
480
480
480
480
480
480
480
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500

SER
VAL
GLU
ILE
GLY
ASP
ALA
HIS
THR
ILE
GLY
THR
ARG
PRO

A
A
A
A
A
A
A
A
A
A
A
A
A
A

66
67
68
147
148
149
150
151
152
153
154
155
156
157

ZERO OCCUPANCY ATOM


THE FOLLOWING RESIDUES HAVE ATOMS MODELED WITH ZERO
OCCUPANCY. THE LOCATION AND PROPERTIES OF THESE ATOMS
MAY NOT BE RELIABLE. (M=MODEL NUMBER; RES=RESIDUE NAME;
C=CHAIN IDENTIFIER; SSEQ=SEQUENCE NUMBER; I=INSERTION CODE):
M RES C SSEQI ATOMS
LYS A
64
CB
CG
CD
CE
NZ
GEOMETRY AND STEREOCHEMISTRY
SUBTOPIC: CLOSE CONTACTS
THE FOLLOWING ATOMS THAT ARE RELATED BY CRYSTALLOGRAPHIC
SYMMETRY ARE IN CLOSE CONTACT. AN ATOM LOCATED WITHIN 0.15
ANGSTROMS OF A SYMMETRY RELATED ATOM IS ASSUMED TO BE ON A
SPECIAL POSITION AND IS, THEREFORE, LISTED IN REMARK 375
INSTEAD OF REMARK 500. ATOMS WITH NON-BLANK ALTERNATE
LOCATION INDICATORS ARE NOT INCLUDED IN THE CALCULATIONS.
DISTANCE CUTOFF:
2.2 ANGSTROMS FOR CONTACTS NOT INVOLVING HYDROGEN ATOMS
1.6 ANGSTROMS FOR CONTACTS INVOLVING HYDROGEN ATOMS
ATM1
OD1
OD1
OD1
OD1
OD1
CG

RES
ASN
ASN
ASN
ASN
ASN
ASN

C
A
A
A
A
A
A

SSEQI
26
26
26
5
26
26

ATM2
CA
C
N
CD1
O
N

RES
PRO
PRO
PRO
ILE
PRO
PRO

C
A
A
A
A
A
A

SSEQI
29
29
29
147
29
29

SSYMOP
2556
2556
2556
2657
2556
2556

DISTANCE
1.44
1.68
1.72
2.03
2.08
2.11

REMARK: NULL
GEOMETRY AND STEREOCHEMISTRY
SUBTOPIC: COVALENT BOND LENGTHS
THE STEREOCHEMICAL PARAMETERS OF THE FOLLOWING RESIDUES
HAVE VALUES WHICH DEVIATE FROM EXPECTED VALUES BY MORE
THAN 6*RMSD (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN
IDENTIFIER; SSEQ=SEQUENCE NUMBER; I=INSERTION CODE).
STANDARD TABLE:
FORMAT: (10X,I3,1X,2(A3,1X,A1,I4,A1,1X,A4,3X),1X,F6.3)
EXPECTED VALUES PROTEIN: ENGH AND HUBER, 1999
EXPECTED VALUES NUCLEIC ACID: CLOWNEY ET AL 1996

102

REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK

500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500

M RES
GLY
GLY
ARG
PRO
PRO
PRO

CSSEQI ATM1
A 28
C
A 148
N
A 156
CA
A 157
N
A 157
CD
A 157
CA

RES
PRO
GLY
ARG
PRO
PRO
PRO

CSSEQI ATM2
A 29
N
A 148
CA
A 156
C
A 157
CA
A 157
N
A 157
C

DEVIATION
0.125
0.090
0.206
-0.251
-0.368
-0.164

REMARK: NULL
GEOMETRY AND STEREOCHEMISTRY
SUBTOPIC: COVALENT BOND ANGLES
THE STEREOCHEMICAL PARAMETERS OF THE FOLLOWING RESIDUES
HAVE VALUES WHICH DEVIATE FROM EXPECTED VALUES BY MORE
THAN 6*RMSD (M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN
IDENTIFIER; SSEQ=SEQUENCE NUMBER; I=INSERTION CODE).
STANDARD TABLE:
FORMAT: (10X,I3,1X,A3,1X,A1,I4,A1,3(1X,A4,2X),12X,F5.1)
EXPECTED VALUES PROTEIN: ENGH AND HUBER, 1999
EXPECTED VALUES NUCLEIC ACID: CLOWNEY ET AL 1996
M RES
ASP
GLY
ASN
ASN
ASN
GLY
ARG
ARG
ARG
GLU
GLN
ASP
SER
VAL
ILE
ARG
ARG
ARG
HIS
ALA
ALA
HIS
HIS
THR
ARG
ARG
ARG
ARG
THR
ARG
PRO
PRO
PRO
PRO
PRO
PRO

CSSEQI ATM1
ATM2
ATM3
A
4
CA - C
- N
ANGL. DEV. = 18.3 DEGREES
A 22
O
- C
- N
ANGL. DEV. = 11.3 DEGREES
A 25
C
- N
- CA ANGL. DEV. = -16.5 DEGREES
A 25
CA - C
- N
ANGL. DEV. = -14.2 DEGREES
A 25
O
- C
- N
ANGL. DEV. = 14.8 DEGREES
A 28
O
- C
- N
ANGL. DEV. = -11.6 DEGREES
A 60
CD - NE - CZ ANGL. DEV. =
9.8 DEGREES
A 60
NE - CZ - NH1 ANGL. DEV. =
3.7 DEGREES
A 60
NE - CZ - NH2 ANGL. DEV. = -3.3 DEGREES
A 68
C
- N
- CA ANGL. DEV. = -16.3 DEGREES
A 75
CB - CA - C
ANGL. DEV. = 13.1 DEGREES
A 90
CB - CG - OD1 ANGL. DEV. =
5.7 DEGREES
A 120
N
- CA - CB ANGL. DEV. = -9.2 DEGREES
A 122
CB - CA - C
ANGL. DEV. = -12.2 DEGREES
A 135
CA - CB - CG2 ANGL. DEV. = 15.6 DEGREES
A 138
CA - CB - CG ANGL. DEV. = 14.8 DEGREES
A 138
CD - NE - CZ ANGL. DEV. = 10.7 DEGREES
A 138
NE - CZ - NH2 ANGL. DEV. = -3.3 DEGREES
A 151
CB - CA - C
ANGL. DEV. = -38.2 DEGREES
A 150
CA - C
- N
ANGL. DEV. = -16.2 DEGREES
A 150
O
- C
- N
ANGL. DEV. = 17.6 DEGREES
A 151
CA - C
- N
ANGL. DEV. = -38.2 DEGREES
A 151
O
- C
- N
ANGL. DEV. = 43.4 DEGREES
A 152
C
- N
- CA ANGL. DEV. = 30.0 DEGREES
A 156
CB - CA - C
ANGL. DEV. = 12.4 DEGREES
A 156
N
- CA - CB ANGL. DEV. = -15.9 DEGREES
A 156
NH1 - CZ - NH2 ANGL. DEV. = -6.6 DEGREES
A 156
NE - CZ - NH2 ANGL. DEV. =
3.7 DEGREES
A 155
O
- C
- N
ANGL. DEV. = -14.6 DEGREES
A 156
C
- N
- CA ANGL. DEV. = -16.4 DEGREES
A 157
CA - N
- CD ANGL. DEV. = -25.7 DEGREES
A 157
N
- CA - CB ANGL. DEV. = -25.4 DEGREES
A 157
CB - CG - CD ANGL. DEV. = -24.6 DEGREES
A 157
N
- CD - CG ANGL. DEV. = -33.9 DEGREES
A 157
N
- CA - C
ANGL. DEV. = 19.1 DEGREES
A 157
CA - C
- O
ANGL. DEV. = -17.2 DEGREES

103

REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
REMARK

500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500
500

PRO A 157

CA

ANGL. DEV. = -16.9 DEGREES

REMARK: NULL
GEOMETRY AND STEREOCHEMISTRY
SUBTOPIC: TORSION ANGLES
TORSION ANGLES OUTSIDE THE EXPECTED RAMACHANDRAN REGIONS:
(M=MODEL NUMBER; RES=RESIDUE NAME; C=CHAIN IDENTIFIER;
SSEQ=SEQUENCE NUMBER; I=INSERTION CODE).
STANDARD TABLE:
FORMAT:(10X,I3,1X,A3,1X,A1,I4,A1,4X,F7.2,3X,F7.2)
EXPECTED VALUES: GJ KLEYWEGT AND TA JONES (1996). PHI/PSICHOLOGY: RAMACHANDRAN REVISITED. STRUCTURE 4, 1395 - 1400
M RES
ASN
TYR
ASP
LEU
ASN
ASN
HIS
TYR
SER
VAL
VAL
ALA
HIS
THR
THR
ARG

CSSEQI
A
5
A 18
A 20
A 23
A 25
A 26
A 31
A 63
A 66
A 67
A 110
A 150
A 151
A 152
A 155
A 156

PSI
57.62
120.18
-140.62
150.39
-90.55
121.49
175.58
102.76
52.30
90.53
-72.06
-164.72
-97.21
-139.61
-137.22
-162.43

PHI
-113.00
166.90
-156.38
68.74
-9.26
-29.94
173.40
-169.58
128.49
49.93
-67.54
173.09
35.47
-128.76
-149.83
-178.66

REMARK: NULL
GEOMETRY AND STEREOCHEMISTRY
SUBTOPIC: NON-CIS, NON-TRANS
THE FOLLOWING PEPTIDE BONDS DEVIATE SIGNIFICANTLY FROM BOTH
CIS AND TRANS CONFORMATION. CIS BONDS, IF ANY, ARE LISTED
ON CISPEP RECORDS. TRANS IS DEFINED AS 180 +/- 30 AND
CIS IS DEFINED AS 0 +/- 30 DEGREES.
MODEL
OMEGA
ARG A 156
PRO A 157
-147.47
REMARK: NULL
GEOMETRY AND STEREOCHEMISTRY
SUBTOPIC: MAIN CHAIN PLANARITY
THE FOLLOWING RESIDUES HAVE A PSEUDO PLANARITY
TORSION, C(I) - CA(I) - N(I+1) - O(I), GREATER
10.0 DEGREES. (M=MODEL NUMBER; RES=RESIDUE NAME;
C=CHAIN IDENTIFIER; SSEQ=SEQUENCE NUMBER;
I=INSERTION CODE).
M RES CSSEQI
ARG A 156

ANGLE
-11.76

104

REMARK
REMARK
REMARK
REMARK
REMARK
REMARK
DBREF
SEQADV
SEQADV
SEQADV
SEQADV
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
SEQRES
HET
HETNAM
FORMUL
FORMUL
SHEET
SHEET
SHEET
SHEET
SHEET
SHEET
SHEET
SHEET
LINK
LINK
LINK
LINK
LINK
LINK
LINK
LINK
LINK
LINK
LINK
LINK
LINK
LINK
SITE
CRYST1
ORIGX1
ORIGX2
ORIGX3
SCALE1
SCALE2
SCALE3

500 REMARK: NULL


800
800 SITE
800 SITE_IDENTIFIER: AC1
800 EVIDENCE_CODE: SOFTWARE
800 SITE_DESCRIPTION: BINDING SITE FOR RESIDUE C8E A 172
1BXW A
0
171 UNP
P0A910
OMPA_ECOLI
21
192
1BXW MET A
0 UNP P0A910
ALA
21 SEE REMARK 999
1BXW LEU A
23 UNP P0A910
PHE
44 MUTATION
1BXW LYS A
34 UNP P0A910
GLN
55 MUTATION
1BXW TYR A 107 UNP P0A910
LYS
128 MUTATION
1 A 172 MET ALA PRO LYS ASP ASN THR TRP TYR THR GLY ALA LYS
2 A 172 LEU GLY TRP SER GLN TYR HIS ASP THR GLY LEU ILE ASN
3 A 172 ASN ASN GLY PRO THR HIS GLU ASN LYS LEU GLY ALA GLY
4 A 172 ALA PHE GLY GLY TYR GLN VAL ASN PRO TYR VAL GLY PHE
5 A 172 GLU MET GLY TYR ASP TRP LEU GLY ARG MET PRO TYR LYS
6 A 172 GLY SER VAL GLU ASN GLY ALA TYR LYS ALA GLN GLY VAL
7 A 172 GLN LEU THR ALA LYS LEU GLY TYR PRO ILE THR ASP ASP
8 A 172 LEU ASP ILE TYR THR ARG LEU GLY GLY MET VAL TRP ARG
9 A 172 ALA ASP THR TYR SER ASN VAL TYR GLY LYS ASN HIS ASP
10 A 172 THR GLY VAL SER PRO VAL PHE ALA GLY GLY VAL GLU TYR
11 A 172 ALA ILE THR PRO GLU ILE ALA THR ARG LEU GLU TYR GLN
12 A 172 TRP THR ASN ASN ILE GLY ASP ALA HIS THR ILE GLY THR
13 A 172 ARG PRO ASP ASN GLY MET LEU SER LEU GLY VAL SER TYR
14 A 172 ARG PHE GLY
C8E A 172
21
C8E (HYDROXYETHYLOXY)TRI(ETHYLOXY)OCTANE
2 C8E
C16 H34 O5
3 HOH
*39(H2 O)
1 S1 1 THR A
6 SER A 16 0
1 S2 1 LYS A 34 VAL A 45 0
1 S3 1 VAL A 49 ARG A 60 0
1 S4 1 TYR A 72 PRO A 86 0
1 S5 1 LEU A 91 THR A 106 0
1 S6 1 ASN A 114 ALA A 130 0
1 S7 1 ILE A 135 TRP A 143 0
1 S8 1 MET A 161 PHE A 170 0
OD2 ASP A 149
C17 C8E A 172
2657
1555
CB ASP A 149
C17 C8E A 172
2657
1555
OD1 ASP A 149
C17 C8E A 172
2657
1555
OD2 ASP A 149
O18 C8E A 172
2657
1555
CA ASP A 149
O18 C8E A 172
2657
1555
CB ASP A 149
O18 C8E A 172
2657
1555
OD1 ASP A 149
O18 C8E A 172
2657
1555
N
ASP A 149
C19 C8E A 172
2657
1555
C
ASP A 149
C19 C8E A 172
2657
1555
CA ASP A 149
C20 C8E A 172
2657
1555
C
ASP A 149
C20 C8E A 172
2657
1555
CB ASP A 149
C20 C8E A 172
2657
1555
N
ALA A 150
O21 C8E A 172
2657
1555
N
ALA A 150
C20 C8E A 172
2657
1555
1 AC1 4 TYR A 43 PHE A 51 LEU A 79 GLY A 99
69.180
77.950
50.930 90.00 91.52 90.00 C 1 2 1
4
1.000000 0.000000 0.000000
0.00000
0.000000 1.000000 0.000000
0.00000
0.000000 0.000000 1.000000
0.00000
0.014455 0.000000 0.000383
0.00000
0.000000 0.012829 0.000000
0.00000
0.000000 0.000000 0.019642
0.00000

PDB

105

1.24
1.88
1.59
1.96
1.92
1.18
1.38
1.68
1.87
1.45
1.22
2.02
1.70
1.34

HEADER: PDB,
Protein Data Bank.
TITLE: ,
, .
.
COMPOUND: compound
( , ) .
SOURCE: .
KEYWDS: - .
EXPDTA: (X-Ray Crystallography/NMR/Theoretical Model).
AUTHOR: .
JRNL:
.
REMARK: REMARK .
.
REMARK
, ,
.
SEQRES: . 3
.
HET: () .
.
,
.
: HET.
FORMUL: HET.
HELIX: .
SHEET: .
CRYST1: .
ORIGXn(n=1..3):
PDB.
SCALEn: .
ATOM: , , .
.
()
:

106



--------------------------------------------------------------------------------1- 6
"ATOM " .
7 - 11

13 - 16

18 - 20

. 3 .

22

(chainD) ,
.

23 - 26

31 - 38

x ( Angstroms) .

39 - 46

y ( Angstroms) Y .

47 - 54

z ( Angstroms) Z .

55 - 60

(occupancy)

61 - 66

(Temperature factor)

77 - 78

79 - 80

( ).

TER: R .
HETATM: . .
CONECT: CONECT .
.
MASTER: .
.
END: .

107

Akiva,E.,Brown,S.,Almonacid,D.E.,Barber,A.E.,2nd,Custer,A.F.,Hicks,M.A.,...Babbitt,P.C.
(2014).TheStructure-FunctionLinkageDatabase.Nucleic acids research, 42(Databaseissue),D521530.
Akondi,K.B.,Muttenthaler,M.,Dutertre,S.,Kaas,Q.,Craik,D.J.,Lewis,R.J.,&Alewood,P.F.(2014).
Discovery,synthesis,andstructure-activityrelationshipsofconotoxins.Chemical reviews, 114(11),
5815-5847.
Alexander,S.P.,Benson,H.E.,Faccenda,E.,Pawson,A.J.,Sharman,J.L.,McGrath,J.C.,...
Zimmermann,M.(2013).TheConciseGuidetoPHARMACOLOGY2013/14:overview.British
journal of pharmacology, 170(8),1449-1458.
Alexander,S.P.,Mathie,A.,&Peters,J.A.(2011).GuidetoReceptorsandChannels(GRAC),5thedition.
British journal of pharmacology, 164 Suppl 1,S1-324.
Almonacid,D.E.,&Babbitt,P.C.(2011).Towardmechanisticclassificationofenzymefunctions.Current
opinion in chemical biology, 15(3),435-442.
Altschul,S.F.,Madden,T.L.,Schaffer,A.A.,Zhang,J.,Zhang,Z.,Miller,W.,&Lipman,D.J.(1997).
GappedBLASTandPSI-BLAST:anewgenerationofproteindatabasesearchprograms.Nucleic
acids research, 25(17),3389-3402.
Ananiadou,S.,Kell,D.B.,&Tsujii,J.(2006).Textmininganditspotentialapplicationsinsystemsbiology.
Trends Biotechnol, 24(12),571-579.
Andreeva,A.,Howorth,D.,Brenner,S.E.,Hubbard,T.J.,Chothia,C.,&Murzin,A.G.(2004).SCOP
databasein2004:refinementsintegratestructureandsequencefamilydata.Nucleic Acids Res,
32(Databaseissue),D226-229.
Atkinson,H.J.,Morris,J.H.,Ferrin,T.E.,&Babbitt,P.C.(2009).Usingsequencesimilaritynetworksfor
visualizationofrelationshipsacrossdiverseproteinsuperfamilies.PloS one, 4(2),e4345.
Babbitt,P.C.,&Gerlt,J.A.(1997).Understandingenzymesuperfamilies.ChemistryAsthefundamental
determinantintheevolutionofnewcatalyticactivities.The Journal of biological chemistry, 272(49),
30591-30594.
Babbitt,P.C.,Hasson,M.S.,Wedekind,J.E.,Palmer,D.R.,Barrett,W.C.,Reed,G.H.,...Gerlt,J.A.
(1996).Theenolasesuperfamily:ageneralstrategyforenzyme-catalyzedabstractionofthealphaprotonsofcarboxylicacids.Biochemistry, 35(51),16489-16501.
Bairoch,A.(1999).TheENZYMEdatabankin1999.Nucleic acids research, 27(1),310-311.
Barrett,T.,&Edgar,R.(2006).MiningmicroarraydataatNCBI'sGeneExpressionOmnibus(GEO)*.
Methods Mol Biol, 338,175-190.
Baxevanis,A.D.,Arents,G.,Moudrianakis,E.N.,&Landsman,D.(1995).AvarietyofDNA-bindingand
multimericproteinscontainthehistonefoldmotif.Nucleic acids research, 23(14),2685-2691.
Baxevanis,A.D.,&Landsman,D.(1996).HistoneSequenceDatabase:acompilationofhighly-conserved
nucleoproteinsequences.Nucleic acids research, 24(1),245-247.
Becker,K.G.,Barnes,K.C.,Bright,T.J.,&Wang,S.A.(2004).Thegeneticassociationdatabase.Nat
Genet, 36(5),431-432.
Benson,D.A.,Clark,K.,Karsch-Mizrachi,I.,Lipman,D.J.,Ostell,J.,&Sayers,E.W.(2014).GenBank.
Nucleic acids research.
Berman,H.,Henrick,K.,Nakamura,H.,&Markley,J.L.(2007).TheworldwideProteinDataBank
(wwPDB):ensuringasingle,uniformarchiveofPDBdata.Nucleic acids research, 35(Database
issue),D301-303.

108

Bertram,L.,McQueen,M.B.,Mullin,K.,Blacker,D.,&Tanzi,R.E.(2007).Systematicmeta-analysesof
Alzheimerdiseasegeneticassociationstudies:theAlzGenedatabase.Nat Genet, 39(1),17-23.
Bhaskara,R.M.,Mehrotra,P.,Rakshambikai,R.,Gnanavel,M.,Martin,J.,&Srinivasan,N.(2014).The
relationshipbetweenclassificationofmulti-domainproteinsusinganalignment-freeapproachand
theirfunctions:acasestudywithimmunoglobulins.Molecular BioSystems, 10(5),1082-1093.
Biggs,J.S.,Watkins,M.,Puillandre,N.,Ownby,J.P.,Lopez-Vera,E.,Christensen,S.,...Olivera,B.M.
(2010).EvolutionofConuspeptidetoxins:analysisofConuscalifornicusReeve,1844.Molecular
phylogenetics and evolution, 56(1),1-12.
Bingham,J.,Plowman,G.D.,&Sudarsanam,S.(2000).Informaticsissuesinlarge-scalesequenceanalysis:
elucidatingtheproteinkinasesofC.elegans.J Cell Biochem, 80(2),181-186.
Bockaert,J.,&Pin,J.P.(1999).MoleculartinkeringofGprotein-coupledreceptors:anevolutionarysuccess.
Embo J, 18(7),1723-1729.
Bonner,T.I.(2014).Shouldpharmacologistscareaboutalternativesplicing?IUPHARReview4.British
journal of pharmacology, 171(5),1231-1240.
Bradham,C.A.,Foltz,K.R.,Beane,W.S.,Arnone,M.I.,Rizzo,F.,Coffman,J.A.,...Manning,G.(2006).
Theseaurchinkinome:afirstlook.Dev Biol, 300(1),180-193.
Brazma,A.,Parkinson,H.,Sarkans,U.,Shojatalab,M.,Vilo,J.,Abeygunawardena,N.,...Sansone,S.A.
(2003).ArrayExpress--apublicrepositoryformicroarraygeneexpressiondataattheEBI.Nucleic
Acids Res, 31(1),68-71.
Caenepeel,S.,Charydczak,G.,Sudarsanam,S.,Hunter,T.,&Manning,G.(2004).Themousekinome:
discoveryandcomparativegenomicsofallmouseproteinkinases.Proc Natl Acad Sci U S A,
101(32),11707-11712.
Campbell,J.A.,Davies,G.J.,Bulone,V.,&Henrissat,B.(1997).Aclassificationofnucleotide-diphosphosugarglycosyltransferasesbasedonaminoacidsequencesimilarities.The Biochemical journal, 326 (
Pt 3),929-939.
Cantarel,B.L.,Coutinho,P.M.,Rancurel,C.,Bernard,T.,Lombard,V.,&Henrissat,B.(2009).The
Carbohydrate-ActiveEnZymesdatabase(CAZy):anexpertresourceforGlycogenomics.Nucleic
acids research, 37(Databaseissue),D233-238.
Chang,D.,&Duda,T.F.,Jr.(2012).Extensiveandcontinuousduplicationfacilitatesrapidevolutionand
diversificationofgenefamilies.Molecular biology and evolution, 29(8),2019-2029.
Chatonnet,A.,Cousin,X.,&Robinson,A.(2001).Linksbetweenkineticdataandsequencesinthe
alpha/beta-hydrolasesfolddatabase.Briefings in bioinformatics, 2(1),30-37.
Chibucos,M.C.,Mungall,C.J.,Balakrishnan,R.,Christie,K.R.,Huntley,R.P.,White,O.,...Giglio,M.
(2014).StandardizeddescriptionofscientificevidenceusingtheEvidenceOntology(ECO).
Database (Oxford), 2014.
Christopoulos,A.,Changeux,J.P.,Catterall,W.A.,Fabbro,D.,Burris,T.P.,Cidlowski,J.A.,...
Langmead,C.J.(2014).Internationalunionofbasicandclinicalpharmacology.XC.multisite
pharmacology:recommendationsforthenomenclatureofreceptorallosterismandallostericligands.
Pharmacological reviews, 66(4),918-947.
Cousin,X.,Hotelier,T.,Lievin,P.,Toutant,J.P.,&Chatonnet,A.(1996).Acholinesterasegenesserver
(ESTHER):adatabaseofcholinesterase-relatedsequencesformultiplealignments,phylogenetic
relationships,mutationsandstructuraldataretrieval.Nucleic acids research, 24(1),132-136.
Craik,D.J.(2006).Chemistry.Seamlessproteinstieuptheirlooseends.science, 311(5767),1563-1564.
Cuff,A.L.,Sillitoe,I.,Lewis,T.,Clegg,A.B.,Rentzsch,R.,Furnham,N.,...Orengo,C.A.(2011).
ExtendingCATH:increasingcoverageoftheproteinstructureuniverseandlinkingstructurewith
function.Nucleic acids research, 39(Databaseissue),D420-426.

109

Davis,J.,Jones,A.,&Lewis,R.J.(2009).Remarkableinter-andintra-speciescomplexityofconotoxins
revealedbyLC/MS.Peptides, 30(7),1222-1227.
Demeter,J.,Beauheim,C.,Gollub,J.,Hernandez-Boussard,T.,Jin,H.,Maier,D.,...Ball,C.A.(2007).The
StanfordMicroarrayDatabase:implementationofnewanalysistoolsandopensourcereleaseof
software.Nucleic Acids Res, 35(Databaseissue),D766-770.
Deshmukh,K.,Anamika,K.,&Srinivasan,N.(2010).Evolutionofdomaincombinationsinproteinkinases
anditsimplicationsforfunctionaldiversity.Progress in biophysics and molecular biology, 102(1),115.
Dreos,R.,Ambrosini,G.,CavinPerier,R.,&Bucher,P.(2013).EPDandEPDnew,high-qualitypromoter
resourcesinthenext-generationsequencingera.Nucleic Acids Res, 41(Databaseissue),D157-164.
Duda,T.F.,Jr.,Chang,D.,Lewis,B.D.,&Lee,T.(2009).Geographicvariationinvenomallelic
compositionanddietsofthewidespreadpredatorymarinegastropodConusebraeus.PloS one, 4(7),
e6245.
Duda,T.F.,Jr.,&Lee,T.(2009).Ecologicalreleaseandvenomevolutionofapredatorymarinesnailat
EasterIsland.PloS one, 4(5),e5558.
Eddy,S.R.(2009).Anewgenerationofhomologysearchtoolsbasedonprobabilisticinference.Genome
informatics. International Conference on Genome Informatics, 23(1),205-211.
Eisen,J.A.,Coyne,R.S.,Wu,M.,Wu,D.,Thiagarajan,M.,Wortman,J.R.,...Orias,E.(2006).
MacronucleargenomesequenceoftheciliateTetrahymenathermophila,amodeleukaryote.PLoS
Biol, 4(9),e286.
Fernndez-Surez,X.M.,Rigden,D.J.,&Galperin,M.Y.(2014).The2014nucleicacidsresearchdatabase
issueandanupdatedNARonlinemolecularbiologydatabasecollection.Nucleic acids research,
42(D1),D1-D6.
Finn,R.D.,Bateman,A.,Clements,J.,Coggill,P.,Eberhardt,R.Y.,Eddy,S.R.,...Punta,M.(2014).Pfam:
theproteinfamiliesdatabase.Nucleic acids research, 42(Databaseissue),D222-D230.
Fleischmann,A.,Darsow,M.,Degtyarenko,K.,Fleischmann,W.,Boyce,S.,Axelsen,K.B.,...Apweiler,R.
(2004).IntEnz,theintegratedrelationalenzymedatabase.Nucleic acids research, 32(Databaseissue),
D434-437.
Gandhimathi,A.,Nair,A.G.,&Sowdhamini,R.(2012).PASS2version4:anupdatetothedatabaseof
structure-basedsequencealignmentsofstructuraldomainsuperfamilies.Nucleic acids research,
40(Databaseissue),D531-534.
Garland,S.L.(2013).AreGPCRsStillaSourceofNewTargets?Journal of Biomolecular Screening, 18(9),
947-966.
Gaulton,A.,Bellis,L.J.,Bento,A.P.,Chambers,J.,Davies,M.,Hersey,A.,...Overington,J.P.(2012).
ChEMBL:alarge-scalebioactivitydatabasefordrugdiscovery.Nucleic acids research, 40(Database
issue),D1100-1107.
Gerlt,J.A.,&Babbitt,P.C.(2001).Divergentevolutionofenzymaticfunction:mechanisticallydiverse
superfamiliesandfunctionallydistinctsuprafamilies.Annual review of biochemistry, 70,209-246.
Gerlt,J.A.,Babbitt,P.C.,&Rayment,I.(2005).Divergentevolutionintheenolasesuperfamily:the
interplayofmechanismandspecificity.Archives of biochemistry and biophysics, 433(1),59-70.
Gnanavel,M.,Mehrotra,P.,Rakshambikai,R.,Martin,J.,Srinivasan,N.,&Bhaskara,R.M.(2014).CLAP:a
web-serverforautomaticclassificationofproteinswithspecialreferencetomulti-domainproteins.
BMC bioinformatics, 15,343.
Thedictyosteliumkinome--analysisoftheproteinkinasesfromasimplemodelorganism,3,2Cong.Rec.
e38(2006).

110

Gowri,V.S.,Krishnadev,O.,Swamy,C.S.,&Srinivasan,N.(2006).MulPSSM:adatabaseofmultiple
position-specificscoringmatricesofproteindomainfamilies.Nucleic acids research, 34(Database
issue),D243-246.
Griffiths-Jones,S.,Grocock,R.J.,vanDongen,S.,Bateman,A.,&Enright,A.J.(2006).miRBase:
microRNAsequences,targetsandgenenomenclature.Nucleic Acids Res, 34(Databaseissue),D140144.
Grosjean,J.,Soualmia,L.,Bouarech,K.,Jonquet,C.,&Darmoni,S.(2014).An Approach to Compare BioOntologies Portals.PaperpresentedattheMIE'2014:26thInternationalConferenceoftheEuropean
FederationforMedicalInformatics.
Hanks,S.K.,&Hunter,T.(1995).Proteinkinases6.Theeukaryoticproteinkinasesuperfamily:kinase
(catalytic)domainstructureandclassification.FASEB journal : official publication of the Federation
of American Societies for Experimental Biology, 9(8),576-596.
HapMapConsortium.(2003).TheInternationalHapMapProject.Nature, 426(6968),789-796.
Harmar,A.J.,Hills,R.A.,Rosser,E.M.,Jones,M.,Buneman,O.P.,Dunbar,D.R.,...Spedding,M.
(2009).IUPHAR-DB:theIUPHARdatabaseofGprotein-coupledreceptorsandionchannels.
Nucleic acids research, 37(Databaseissue),D680-685.
Henrissat,B.(1991).Aclassificationofglycosylhydrolasesbasedonaminoacidsequencesimilarities.The
Biochemical journal, 280 ( Pt 2),309-316.
Holliday,G.L.,Almonacid,D.E.,Bartlett,G.J.,O'Boyle,N.M.,Torrance,J.W.,Murray-Rust,P.,...
Thornton,J.M.(2007).MACiE(Mechanism,AnnotationandClassificationinEnzymes):noveltools
forsearchingcatalyticmechanisms.Nucleic acids research, 35(Databaseissue),D515-520.
Holliday,G.L.,Andreini,C.,Fischer,J.D.,Rahman,S.A.,Almonacid,D.E.,Williams,S.T.,&Pearson,W.
R.(2012).MACiE:exploringthediversityofbiochemicalreactions.Nucleic acids research,
40(Databaseissue),D783-789.
Holliday,G.L.,Bairoch,A.,Bagos,P.G.,Chatonnet,A.,Craik,D.J.,Flinn,R.D.,...Bateman,A.(2015).
Keychallengesforthecreationandmaintenanceofspecialistproteinresources.Proteins.
Horn,F.,Bettler,E.,Oliveira,L.,Campagne,F.,Cohen,F.E.,&Vriend,G.(2003).GPCRDBinformation
systemforGprotein-coupledreceptors.Nucleic Acids Res, 31(1),294-297.
Horn,F.,Weare,J.,Beukers,M.W.,Horsch,S.,Bairoch,A.,Chen,W.,...Vriend,G.(1998).GPCRDB:an
informationsystemforGprotein-coupledreceptors.Nucleic Acids Res, 26(1),275-279.
Hsu,S.D.,Lin,F.M.,Wu,W.Y.,Liang,C.,Huang,W.C.,Chan,W.L.,...Huang,H.D.(2011).
miRTarBase:adatabasecuratesexperimentallyvalidatedmicroRNA-targetinteractions.Nucleic
Acids Res, 39(Databaseissue),D163-169.
Hunter,T.,&Plowman,G.D.(1997).Theproteinkinasesofbuddingyeast:sixscoreandmore.Trends
Biochem Sci, 22(1),18-22.
Isberg,V.,Vroling,B.,vanderKant,R.,Li,K.,Vriend,G.,&Gloriam,D.(2014).GPCRDB:aninformation
systemforGprotein-coupledreceptors.Nucleic Acids Res, 42(1),D422-425.
Jamison,D.C.(2003).StructuredQueryLanguage(SQL)fundamentals.Curr Protoc Bioinformatics, Chapter
9,Unit92.
Joosten,R.P.,Long,F.,Murshudov,G.N.,&Perrakis,A.(2014).ThePDB_REDOserverfor
macromolecularstructuremodeloptimization.IUCrJ, 1(4),0-0.
Kaas,Q.,Westermann,J.C.,&Craik,D.J.(2010).Conopeptidecharacterizationandclassifications:an
analysisusingConoServer.Toxicon : official journal of the International Society on Toxinology,
55(8),1491-1509.
Kaas,Q.,Yu,R.,Jin,A.H.,Dutertre,S.,&Craik,D.J.(2012).ConoServer:updatedcontent,knowledge,and
discoverytoolsintheconopeptidedatabase.Nucleic acids research, 40(Databaseissue),D325-330.

111

Kanehisa,M.,Goto,S.,Sato,Y.,Kawashima,M.,Furumichi,M.,&Tanabe,M.(2014).Data,information,
knowledgeandprinciple:backtometabolisminKEGG.Nucleic acids research, 42(D1),D199-D205.
Karp,P.D.,Riley,M.,Saier,M.,Paulsen,I.T.,Collado-Vides,J.,Paley,S.M.,...Gama-Castro,S.(2002).
TheEcoCycDatabase.Nucleic Acids Res, 30(1),56-58.
Katritch,V.,Cherezov,V.,&Stevens,R.C.(2013).Structure-functionoftheGprotein-coupledreceptor
superfamily.Annu Rev Pharmacol Toxicol, 53,531-556.
Kedarisetti,P.,Mizianty,M.J.,Kaas,Q.,Craik,D.J.,&Kurgan,L.(2014).Predictionandcharacterizationof
cyclicproteinsfromsequencesinthreedomainsoflife.Biochimica et biophysica acta, 1844(1PtB),
181-190.
Kirby,A.J.(2001).Thelysozymemechanismsortedafter50years.Nature Structural Biology, 8,737-739.
Knudsen,M.,&Wiuf,C.(2010).TheCATHdatabase.Hum Genomics, 4(3),207-212.
Kouranov,A.,Xie,L.,delaCruz,J.,Chen,L.,Westbrook,J.,Bourne,P.E.,&Berman,H.M.(2006).The
RCSBPDBinformationportalforstructuralgenomics.Nucleic Acids Res, 34(Databaseissue),D302305.
Kozma,D.,Simon,I.,&Tusnady,G.E.(2013).PDBTM:ProteinDataBankoftransmembraneproteinsafter
8years.Nucleic Acids Res, 41(Databaseissue),D524-529.
Krupa,A.,Abhinandan,K.,&Srinivasan,N.(2004).KinG:adatabaseofproteinkinasesingenomes.Nucleic
acids research, 32(suppl1),D153-D155.
Krupp,M.,Marquardt,J.U.,Sahin,U.,Galle,P.R.,Castle,J.,&Teufel,A.(2012).RNA-SeqAtlas--a
referencedatabaseforgeneexpressionprofilinginnormaltissuebynext-generationsequencing.
Bioinformatics, 28(8),1184-1185.
Lagerstrom,M.C.,&Schioth,H.B.(2008).StructuraldiversityofGprotein-coupledreceptorsand
significancefordrugdiscovery.Nat Rev Drug Discov, 7(4),339-357.
Lane,L.,Argoud-Puy,G.,Britan,A.,Cusin,I.,Duek,P.D.,Evalet,O.,...Bairoch,A.(2012).neXtProt:a
knowledgeplatformforhumanproteins.Nucleic acids research, 40(Databaseissue),D76-83.
Lenfant,N.,Hotelier,T.,Bourne,Y.,Marchot,P.,&Chatonnet,A.(2013).Proteinswithanalpha/beta
hydrolasefold:Relationshipsbetweensubfamiliesinanever-growingsuperfamily.Chemicobiological interactions, 203(1),266-268.
Lenfant,N.,Hotelier,T.,Bourne,Y.,Marchot,P.,&Chatonnet,A.(2014).Trackingtheoriginand
divergenceofcholinesterasesandneuroligins:theevolutionofsynapticproteins.Journal of
molecular neuroscience : MN, 53(3),362-369.
Lenfant,N.,Hotelier,T.,Velluet,E.,Bourne,Y.,Marchot,P.,&Chatonnet,A.(2013).ESTHER,the
databaseofthealpha/beta-hydrolasefoldsuperfamilyofproteins:toolstoexplorediversityof
functions.Nucleic acids research, 41(Databaseissue),D423-429.
Lombard,V.,Ramulu,H.G.,Drula,E.,Coutinho,P.M.,&Henrissat,B.(2014).Thecarbohydrate-active
enzymesdatabase(CAZy)in2013.Nucleic acids research, 42(D1),D490-D495.
Lu,C.T.,Huang,K.Y.,Su,M.G.,Lee,T.Y.,Bretana,N.A.,Chang,W.C.,...Huang,H.D.(2013).
DbPTM3.0:aninformativeresourceforinvestigatingsubstratesitespecificityandfunctional
associationofproteinpost-translationalmodifications.Nucleic Acids Res, 41(Databaseissue),D295305.
Manning,G.(2005).Genomicoverviewofproteinkinases.WormBook,1-19.
---Evolutionofproteinkinasesignalingfromyeasttoman,10,27Cong.Rec.514-520(2002).
---Theproteinkinasecomplementofthehumangenome,5600,298Cong.Rec.1912-1934(2002).
Marchot,P.,&Chatonnet,A.(2012).Enzymaticactivityandproteininteractionsinalpha/betahydrolasefold
proteins:moonlightingversuspromiscuity.Protein and peptide letters, 19(2),132-143.

112

Marino-Ramirez,L.,Levine,K.M.,Morales,M.,Zhang,S.,Moreland,R.T.,Baxevanis,A.D.,&Landsman,
D.(2011).TheHistoneDatabase:anintegratedresourceforhistonesandhistonefold-containing
proteins.Database (Oxford), 2011,bar048.
Martin,A.C.,Orengo,C.A.,Hutchinson,E.G.,Jones,S.,Karmirantzou,M.,Laskowski,R.A.,...
Thornton,J.M.(1998).Proteinfoldsandfunctions.Structure, 6(7),875-884.
Martin,J.,Anamika,K.,&Srinivasan,N.(2010).Classificationofproteinkinasesonthebasisofbothkinase
andnon-kinaseregions.PloS one, 5(9),e12460.
McDonald,A.G.,Boyce,S.,&Tipton,K.F.(2009).ExplorEnz:theprimarysourceoftheIUBMBenzyme
list.Nucleic acids research, 37(Databaseissue),D593-597.
Moszer,I.,Jones,L.M.,Moreira,S.,Fabry,C.,&Danchin,A.(2002).SubtiList:thereferencedatabasefor
theBacillussubtilisgenome.Nucleic Acids Res, 30(1),62-65.
Murzin,A.G.,Brenner,S.E.,Hubbard,T.,&Chothia,C.(1995).SCOP:astructuralclassificationofproteins
databasefortheinvestigationofsequencesandstructures.Journal of molecular biology, 247(4),536540.
Nagano,N.(2005).EzCatDB:theEnzymeCatalytic-mechanismDatabase.Nucleic acids research,
33(Databaseissue),D407-412.
Nagano,N.,Nakayama,N.,Ikeda,K.,Fukuie,M.,Yokota,K.,Doi,T.,...Tomii,K.(2014).EzCatDB:the
enzymereactiondatabase,2015update.Nucleic acids research.
Nagy,A.,Hegyi,H.,Farkas,K.,Tordai,H.,Kozma,E.,Banyai,L.,&Patthy,L.(2008).Identificationand
correctionofabnormal,incompleteandmispredictedproteinsinpublicdatabases.BMC
bioinformatics, 9,353.
Overington,J.P.,Al-Lazikani,B.,&Hopkins,A.L.(2006).Howmanydrugtargetsarethere?Nat Rev Drug
Discov, 5(12),993-996.
Pawson,A.J.,Sharman,J.L.,Benson,H.E.,Faccenda,E.,Alexander,S.P.,Buneman,O.P.,...Harmar,A.
J.(2014).TheIUPHAR/BPSGuidetoPHARMACOLOGY:anexpert-drivenknowledgebaseofdrug
targetsandtheirligands.Nucleic acids research, 42(Databaseissue),D1098-1106.
Pegg,S.C.,Brown,S.D.,Ojha,S.,Seffernick,J.,Meng,E.C.,Morris,J.H.,...Babbitt,P.C.(2006).
Leveragingenzymestructure-functionrelationshipsforfunctionalinferenceandexperimentaldesign:
thestructure-functionlinkagedatabase.Biochemistry, 45(8),2545-2555.
Pettersen,E.F.,Goddard,T.D.,Huang,C.C.,Couch,G.S.,Greenblatt,D.M.,Meng,E.C.,&Ferrin,T.E.
(2004).UCSFChimera--avisualizationsystemforexploratoryresearchandanalysis.Journal of
computational chemistry, 25(13),1605-1612.
Pettifer,S.,Ison,J.,Kalas,M.,Thorne,D.,McDermott,P.,Jonassen,I.,...Vriend,G.(2010).The
EMBRACEwebservicecollection.Nucleic Acids Res, 38(WebServerissue),W683-688.
Plowman,G.D.,Sudarsanam,S.,Bingham,J.,Whyte,D.,&Hunter,T.(1999).Theproteinkinasesof
Caenorhabditiselegans:amodelforsignaltransductioninmulticellularorganisms.Proc Natl Acad
Sci U S A, 96(24),13603-13610.
Poth,A.G.,Chan,L.Y.,&Craik,D.J.(2013).Cyclotidesasgraftingframeworksforproteinengineeringand
drugdesignapplications.Biopolymers, 100(5),480-491.
Puillandre,N.,Koua,D.,Favreau,P.,Olivera,B.M.,&Stocklin,R.(2012).Molecularphylogeny,
classificationandevolutionofconopeptides.Journal of molecular evolution, 74(5-6),297-309.
Rakshambikai,R.,Gnanavel,M.,&Srinivasan,N.(2014).Hybridandroguekinasesencodedinthegenomes
ofmodeleukaryotes.PloS one, 9(9),e107956.
Rawlings,N.D.,Waller,M.,Barrett,A.J.,&Bateman,A.(2014).MEROPS:thedatabaseofproteolytic
enzymes,theirsubstratesandinhibitors.Nucleic acids research, 42(Databaseissue),D503-D509.

113

Reddy,T.B.,Thomas,A.D.,Stamatis,D.,Bertsch,J.,Isbandi,M.,Jansson,J.,...Kyrpides,N.C.(2015).
TheGenomesOnLineDatabase(GOLD)v.5:ametadatamanagementsystembasedonafourlevel
(meta)genomeprojectclassification.Nucleic Acids Res, 43(Databaseissue),D1099-1106.
Rhodes,D.R.,Yu,J.,Shanker,K.,Deshpande,N.,Varambally,R.,Ghosh,D.,...Chinnaiyan,A.M.(2004).
ONCOMINE:acancermicroarraydatabaseandintegrateddata-miningplatform.Neoplasia, 6(1),16.
Rose,P.W.,Bi,C.,Bluhm,W.F.,Christie,C.H.,Dimitropoulos,D.,Dutta,S.,...Bourne,P.E.(2013).The
RCSBProteinDataBank:newresourcesforresearchandeducation.Nucleic acids research,
41(Databaseissue),D475-482.
Rose,P.W.,Prlic,A.,Bi,C.,Bluhm,W.F.,Christie,C.H.,Dutta,S.,...Burley,S.K.(2014).TheRCSB
ProteinDataBank:viewsofstructuralbiologyforbasicandappliedresearchandeducation.Nucleic
acids research.
Saier,M.H.,Jr.(2000).Afunctional-phylogeneticclassificationsystemfortransmembranesolute
transporters.Microbiol Mol Biol Rev, 64(2),354-411.
Saier,M.H.,Reddy,V.S.,Tamang,D.G.,&Vstermark,.(2014).TheTransporterClassification
Database..Nucleic acids research, 42(Databaseissue),D251-D258.
Scherf,M.,Epple,A.,&Werner,T.(2005).Thenextgenerationofliteratureanalysis:integrationofgenomic
analysisintotextmining.Brief Bioinform, 6(3),287-297.
Schnoes,A.M.,Brown,S.D.,Dodevski,I.,&Babbitt,P.C.(2009).AnnotationErrorinPublicDatabases:
MisannotationofMolecularFunctioninEnzymeSuperfamilies.PLoS computational biology, 5(12),
e1000605.
Schully,S.D.,Yu,W.,McCallum,V.,Benedicto,C.B.,Dong,L.M.,Wulf,A.,...Khoury,M.J.(2011).
CancerGAMAdb:databaseofcancergeneticassociationsfrommeta-analysesandgenome-wide
associationstudies.Eur J Hum Genet, 19(8),928-930.
Sethupathy,P.,Corda,B.,&Hatzigeorgiou,A.G.(2006).TarBase:Acomprehensivedatabaseof
experimentallysupportedanimalmicroRNAtargets.RNA, 12(2),192-197.
Shepelev,V.,&Fedorov,A.(2006).AdvancesintheExon-IntronDatabase(EID).Brief Bioinform, 7(2),178185.
Sherry,S.T.,Ward,M.H.,Kholodov,M.,Baker,J.,Phan,L.,Smigielski,E.M.,&Sirotkin,K.(2001).
dbSNP:theNCBIdatabaseofgeneticvariation.Nucleic Acids Res, 29(1),308-311.
Sigrist,C.J.,Cerutti,L.,deCastro,E.,Langendijk-Genevaux,P.S.,Bulliard,V.,Bairoch,A.,&Hulo,N.
(2010).PROSITE,aproteindomaindatabaseforfunctionalcharacterizationandannotation.Nucleic
Acids Res, 38(Databaseissue),D161-166.
Smith,B.,Ashburner,M.,Rosse,C.,Bard,J.,Bug,W.,Ceusters,W.,...Lewis,S.(2007).TheOBO
Foundry:coordinatedevolutionofontologiestosupportbiomedicaldataintegration.Nature
biotechnology, 25(11).
Sowdhamini,R.,Burke,D.F.,Huang,J.F.,Mizuguchi,K.,Nagarajaram,H.A.,Srinivasan,N.,...Blundell,
T.L.(1998).CAMPASS:adatabaseofstructurallyalignedproteinsuperfamilies.Structure, 6(9),
1087-1094.
Spedding,M.(2011).Resolutionofcontroversiesindrug/receptorinteractionsbyproteinstructure.
Limitationsandpharmacologicalsolutions.Neuropharmacology, 60(1),3-6.
Srivastava,M.,Simakov,O.,Chapman,J.,Fahey,B.,Gauthier,M.E.,Mitros,T.,...Rokhsar,D.S.(2010).
TheAmphimedonqueenslandicagenomeandtheevolutionofanimalcomplexity.Nature, 466(7307),
720-726.
Stajich,J.E.,Wilke,S.K.,Ahren,D.,Au,C.H.,Birren,B.W.,Borodovsky,M.,...Pukkila,P.J.(2010).
Insightsintoevolutionofmulticellularfungifromtheassembledchromosomesofthemushroom
Coprinopsiscinerea(Coprinuscinereus).Proc Natl Acad Sci U S A, 107(26),11889-11894.

114

Stark,C.,Breitkreutz,B.J.,Reguly,T.,Boucher,L.,Breitkreutz,A.,&Tyers,M.(2006).BioGRID:ageneral
repositoryforinteractiondatasets.Nucleic Acids Res, 34(Databaseissue),D535-539.
Stein,L.(2013).Creatingdatabasesforbiologicalinformation:anintroduction.Curr Protoc Bioinformatics,
Chapter 9,Unit91.
Sun,H.,Palaniswamy,S.K.,Pohar,T.T.,Jin,V.X.,Huang,T.H.,&Davuluri,R.V.(2006).MPromDb:an
integratedresourceforannotationandvisualizationofmammaliangenepromotersandChIP-chip
experimentaldata.Nucleic Acids Res, 34(Databaseissue),D98-103.
Tan,N.C.,&Berkovic,S.F.(2010).TheEpilepsyGeneticAssociationDatabase(epiGAD):analysisof165
geneticassociationstudies,1996-2008.Epilepsia, 51(4),686-689.
Terlau,H.,&Olivera,B.M.(2004).Conusvenoms:arichsourceofnovelionchannel-targetedpeptides.
Physiological reviews, 84(1),41-68.
Theodoropoulou,M.C.,Bagos,P.G.,Spyropoulos,I.C.,&Hamodrakas,S.J.(2008).gpDB:adatabaseof
GPCRs,G-proteins,effectorsandtheirinteractions.Bioinformatics, 24(12),1471-1472.
Tipton,K.F.(1994).NomenclatureCommitteeoftheInternationalUnionofBiochemistryandMolecular
Biology(NC-IUBMB).Enzymenomenclature.Recommendations1992.Supplement:correctionsand
additions.European journal of biochemistry / FEBS, 223(1),1-5.
Tough,D.F.,Lewis,H.D.,Rioja,I.,Lindon,M.J.,&Prinjha,R.K.(2014).Epigeneticpathwaytargetsfor
thetreatmentofdisease:acceleratingprogressinthedevelopmentofpharmacologicaltools:IUPHAR
Review11.British journal of pharmacology, 171(22),4981-5010.
Trabi,M.,&Craik,D.J.(2002).Circularproteins--noendinsight.Trends Biochem Sci, 27(3),132-138.
Tsaousis,G.N.,Tsirigos,K.D.,Andrianou,X.D.,Liakopoulos,T.D.,Bagos,P.G.,&Hamodrakas,S.J.
(2010).ExTopoDB:adatabaseofexperimentallyderivedtopologicalmodelsoftransmembrane
proteins.Bioinformatics, 26(19),2490-2492.
Tsirigos,K.D.,Bagos,P.G.,&Hamodrakas,S.J.(2011).OMPdb:adatabaseof{beta}-barrelouter
membraneproteinsfromGram-negativebacteria.Nucleic acids research, 39(Databaseissue),D324331.
Umemura,M.,Nagano,N.,Koike,H.,Kawano,J.,Ishii,T.,Miyamura,Y.,...Machida,M.(2014).
Characterizationofthebiosyntheticgeneclusterfortheribosomallysynthesizedcyclicpeptide
ustiloxinBinAspergillusflavus.Fungal Genet Biol, 68,23-30.
UniProt.(2014).ActivitiesattheUniversalProteinResource(UniProt).Nucleic acids research, 42(Database
issue),D191-198.
Vroling,B.,Sanders,M.,Baakman,C.,Borrmann,A.,Verhoeven,S.,Klomp,J.,...Vriend,G.(2011).
GPCRDB:informationsystemforGprotein-coupledreceptors.Nucleic acids research, 39(Database
issue),D309-D319.
Wang,C.K.,Kaas,Q.,Chiche,L.,&Craik,D.J.(2008).CyBase:adatabaseofcyclicproteinsequencesand
structures,withapplicationsinproteindiscoveryandengineering.Nucleic acids research, 36(suppl
1),D206-D210.
Wong,W.C.,Maurer-Stroh,S.,&Eisenhaber,F.(2010).Morethan1,001problemswithproteindomain
databases:transmembraneregions,signalpeptidesandtheissueofsequencehomology.PLoS Comput
Biol, 6(7),e1000867.
Wren,J.D.(2008).URLdecayinMEDLINEa4-yearfollow-upstudy.Bioinformatics, 24(11),1381-1385.
Xenarios,I.,Salwinski,L.,Duan,X.J.,Higney,P.,Kim,S.M.,&Eisenberg,D.(2002).DIP,theDatabaseof
InteractingProteins:aresearchtoolforstudyingcellularnetworksofproteininteractions.Nucleic
Acids Res, 30(1),303-305.
Xia,J.,Wang,Q.,Jia,P.,Wang,B.,Pao,W.,&Zhao,Z.(2012).NGScatalog:Adatabaseofnextgeneration
sequencingstudiesinhumans.Hum Mutat, 33(6),E2341-2355.

115

116

EquationSection(Next)

3:

,
.

. , ,
,
. ,
(FASTA, BLAST), .

,
..

3.
,

, , ,
.
(,),(),
.
, ,
. (
); ;
;,;
, ,
..99%,
(
).80%
.
;
30% 80 ,
, ,
.
,
,
. ,
,
,,
,
.
,
,.

3.1.

(DNA,RNA)
,,-
DNA -, n
.4(A,T,G,C):

117

p A , pT , pG , pC pk 0

pk 1 .

k{ A ,T ,G ,C }

, , 20 20
. DNA, x=x1, x2,...,xn xi { A, T , G , C}
:
n

p P ( x ) p X
i

i 1

(
4n4n)1:

P(x ) 1 .
j

x n DNA. 4
,:
P nA , nT , nG , nC

n!
p AnA pTnT pGnG pCnC
nA !nT !nG !nC !

(3.1)

,
,:
n x
n
P X x p Ax 1 p A

(3.2)
x
3(T,G,C).DNA
Bernoullip=pAq=1-pA.
3 . DNA (
)
.()
4 ( 20 ),
.,
(0)
--
,,,(Durbin,Eddy,Krogh,&Mithison,
1998).
, (
,
)
,:
p A pT pG pC 1
4

(Information Theory).
.D,
,Shannon:

H ( x) P ( xi ) log P ( xi )

(3.3)

,pA=pG=pT=pC=1/4
H(x)=(1/4)log(1/4)=log4.
2,bit.:
I ( x) H max H obs

(3.4)

2bits,
0.

118

,
(Wootton&Federhen,1993).,
""(compositionalcomplexity),,
k,:

k!

K log N

ns !
k
s
1

(3.5)

,ns s
(4,20).,

(),.,4
,
1
4! 1
K log 4
log 4 1 0 .
4
4!0!0!0! 4
,,
. ,
ATGC
1
4! 1
K log 4
log 4 24 0.573
4
1!1!1!1! 4
WoottonFederhen,
:

ns
ns
log 2
k
s k

Hk

(3.6)

,,
(,),
, Boltzman (
Shannon).
, ,
,.
WoottonFederhen,
.
,
(. ),
DNA.,
,
.
,(RelativeEntropy).
P, Q ( Kullback-Leibler)
,,:
H ( P, Q ) P ( xi ) log
i

P ( xi )

Q ( xi )

(3.7)

P(xi)(A,T,G,C)i
, Q(xi)
.
,,
.Q(xi)=1/4()(P,Q)=I(P)

(MutualInformation)..,
:

119

M ( X , Y ) P ( xi , y j ) log
i, j

P ( xi , y j )
P ( xi ) P ( y j )

(3.8)

, , x y.
..P(xi,yi),
P(xi,yi)=P(xi)P(yi).P(xi)P(yi)
..,.,
. ,

3.2.

- Erdos Renyi

(longest run of heads).


--
n ( Bernoulli).
, ,
.

A G G C G A T A A A A A A A A A A A A A A A A C G G A T G C A T C G
3.1: 16DNA

log(n),
Erdos Renyi (Erdos & Renyi, 1970). n
Bernoullip,0p1,Rn
,log1/p(n):

Rn
log1 p n

1 1.

(3.9)

(Waterman, 1995): p k
pk.n(n+)n

E( x)=npk
,,,Rn,1=npk,:
Rn log1 p n

,,
.

3.2.1
n=10000DNA,(p=1/4),

.,
:
log10 10000
4
Rn log1 p n Rn log 4 (10000) Rn

6.64
log10 4
0.60205

n=100050000p=0.1,0.25,0.4(1000)
(3.9).

120

15

E (m a x )

10

0
0

10

15

log1/p(n)

3.2: ,
p=0.1, 0.25 0.4. n 1.000 50.000.

3.3.

Erdos Renyi

.
,
..90% .,
.
,
( ,
1-2 ). ,
,
.

(Large Deviation Theory)
. 3.3 20
16.(
),
.

A G G C G A T A A A A A A A A T A A G A C C A A A A A C G G A T G C A T
3.3: 20 80% .

, p ,
,:
1 a

1 a

a a 1 a

1
p 1 p

(1

)
log

log
log

1 a

a
a 1 a
p 1 p
p
1 p

H ( , p ) log

(3.10)

(k,p)
(DNA
p),,,(k,)
-n,

121

=s/k. , 0p,a1.
,(,p)
. ,
. ,
0pa1Y~B(k,p),P(Yak),
:
kH ( a , p )
P(Y k ) e

(3.11)

(3.9).,n
Bernoulli p, 0pa1, Rna
100%,(Erdos&Renyi,1970;Erdos&Revesz,1975):
Rna
1

(3.12)

log(n)
H (a, p)
:
(LargeDeviations)(3.11)k100%
, e-kH(a,p).n-k+1n
:
log( n)
a

1 ne kH ( a , p ) Rn
H ( a, p )
, De L Hospital, =1(,p)=log(1/p)
(3.9).

3.3.1
10.00.000 DNA, Rna
80%():
log( n)
log(1000000)
Rna

20.744
H (a, p)
0.666
(loge)

3.4.

- (EVD)

,
.,
,,,
( ,
..). ( )
(Extreme ValueDistributions).
,1, 2,..., n,(iid)
:
M n an [max( X 1 , X 2 ,..., X n ) bn ], n
n, bn
. (cdf)
n,bn,(Davison,1998):
1. F ( y ) exp( e y ), y
(Gumbel)
2.
3.

0, y 0

a
exp( y ), y 0, a 0
exp( ( y ) a ), y 0, a 0

F ( y)
1, y 0
F ( y)

(Frechet)
(Weibull)

122

Gumbel,
(GeneralizedExtremeValueDistributionGEVD):
1


y a k ,k ,b 0
(3.13)
H ( y ) exp 1 k

b

ya
1 k
0 , k 0 (
b

k>0 (Frechet) k<0 (Weibull) ). z

ya

, t
k
b


H ( y ) exp 1

t
z

t

k 0 t :
n

a
lim 1 e a
n
n

t
y a
z
lim H ( y ) lim exp 1 exp e z exp e b
t
t

Yn max( X 1 , X 2 ,..., X n ) (Gumbel,1958):

( y a )

F (Y ) exp(e

), y
2

(3.14)

(3.15)
6
(Arratia, Gordon, and Waterman, 1986; Arratia, Gordon, and Waterman, 1990;
Waterman,1995),(DNA)
log( qn)
1
an
, bn log 1 p :

E a b '(1), V

lim Rn
n

log(nq )

exp e

..Rn:

y log(nq )

F y P Rn y exp exp

(3.16)

,Gumbel.
,(3.16)(3.14).,:
log(n) log( q ) 1

(3.17)
E Rn

E Rn log 1 ( n) log 1 ( q ) 1

2
p
p

2
1

(3.18)
var Rn 2
6 12
= -(1 ) = 0.5772 Euler-Mascheroni.
1/12Sheppard,
...

123

.4

.3

.2

.1

0
-3

-2

-1

2
X

EVD P(X<x)=exp(-exp(-x))

3.4: Gumbel

3.4.1
(3.17)(3.18)3.2.1:
E Rn log 1 ( n ) log 1 ( q ) 1

2
p
p

0.5772 1
E Rn log 4 (10000) log 4 ( 3 )

6.3518
4
2
log(4)

var Rn

2
1

0.939
2
6 12

,(3.17),
, ,
(3.9),.,
log(n), log(q) /
.,n,.,
RnGumbel(EVD)
(3.17)(3.18).

124

n==1000

n==5000

n==10000

n==15000

n==20000

n==25000

n==30000

n==35000

n==40000

n==45000

n==50000

.4
.3
.2
.1
0

.4
.3
.2
.1
0
5

10

15

20

.4
.3
.2
.1
0
5

10

15

20

10

15

20

10

15

20

3.5: ,
p=0.1, 0.25 0.4. n 1,000 50,000.

15

E (m a x )

10

0
0

10
log1/p(qn)

15

3.6: ,
p=0.1, 0.25 0.4. n 1000 50000.

3.5.

(Maximal Segment Score)


Rna100%,
(score), . ,
,
.
, ..
80%(Karlin&Altschul,1990;Karlin&Brendel,1992),
.,Rna100%
,:

(3.19)
s k log a k p k
pk (.. p=1/4) k
(targetfrequency)

125

(,),.p,
.i
(segment score)
100% , (maximal
segmentscore)n(n).
( =0.8, p=0.25): (3.19)
,sA=log(0.8/0.25)=1.163

sN=log(0.2/0.25)=-0.223.2010,
s=16*1.163-4*0.223=17.716.



(log(ak/pk)
(ak)
(pk)
A
0.109
0.071
0.429
C
0.019
0.020
-0.051
D
0.007
0.053
-2.024
E
0.007
0.062
-2.181
F
0.090
0.039
0.836
G
0.082
0.070
0.158
H
0.008
0.023
-1.056
I
0.120
0.046
0.959
K
0.005
0.055
-2.398
L
0.168
0.087
0.658
M
0.040
0.025
0.470
N
0.016
0.048
-1.099
P
0.028
0.055
-0.675
Q
0.009
0.043
-1.564
R
0.005
0.061
-2.501
S
0.053
0.071
-0.292
T
0.050
0.060
-0.182
V
0.115
0.061
0.634
W
0.027
0.017
0.463
Y
0.040
0.034
0.163
3.1: , 160
(Krogh, Larsson, von Heijne, & Sonnhammer, 2001). ak=0
( 0.0001).
( -9 ).
- , Uniprot
(http://web.expasy.org/docs/relnotes/relstat.html).

(k)

Rna,=1()
, - (
)... k=10=10
( =-). ,
=1,,
, sA=log(1/0.25)=1.386, sN=log(0/0.25)= - (,

,-10.000).
16,16*1.386=22.176.
, . 3.1
160 (Krogh, et al., 2001).
,
( ) . , (3.19)

.(
I, F, L, V, M W),
- (

126

). , (Q, D, E, K, R),
. , .
, ,
.

3. 7: 20 .
, .

()
.:

,
a

(3.20)
E sk pk sk pk log k 0
pk

,(sk)
(,p).
score n ,
(Karlin & Altschul, 1990) n (
score):

(3.21)
P M log(n) x 1 exp Ke x

Gumbel,
.
:

(3.22)
pk expsk 1
k

127

,n,
k
pk exp sk 1.,:

(3.23)
ak pk exp sk
( x)
(rare events), Poisson , =Knexp(-x)
(3.21):

(3.24)
P M n x 1 e E
-(E-value):

(3.25)
1 exp exp t 1 1 exp t exp t
P-value E-value. , Poisson,
n,mS(m) x
:

m 1

Kne
x

(3.26)
i!
Rn,,
:

(3.27)
K 1 p q

(3.28)
log 1 p

P ( S ( m ) x ) 1 exp Kne

i 0

3.6.

,
.
,,,
.
x=x1,x2,,xn y=y1,y2,,ym
.:
(
)
, (alignment)
()
,,
,
()
, ,
(dot plot). ,
. ,
(),
, , . , 100%
,.
, () .
,,
.,

().

128


3.8: (dot plot). ,
, .
, 1-2 . ,
. ,
( 300 )

,,
.,
. DNA x=x1,x2,,xn y=y1,y2,,ym
(),:

(3.29)
P x ,y | R q xi q y j i 1, 2,..., n j 1, 2,..., m
i

,
:

(3.30)
P x,y | M p xi yi i 1, 2,..., n
i

(-),
n=m.(likelihoodratio)oddsratio
:

i pxi yi
px y
P x, y | M

i i
P x, y | R qxi q y j
i qxi q yi
i

(3.31)

px y
S log i i
qx qy
i
i i

s xi , yi
i

(3.32)

, score
. 4 DNA 4x4
score16
:
1, xi y i

(3.33)
s xi , yi
1, xi y i
score 1 (match) 1 (mismatch).
(match matrix) -
(..10000)
, (identitymatrix).
(..)

129

.
(-)
.
, .
(substitution matrices)
score
(mismatches)
.Altschul(Altschul,1991),
.PAM(Dayhoff,
Schwartz,&Orcutt,1978),BLOSUM(Henikoff&Henikoff,1992),GONNET(Gonnet,Cohen,&Benner,
1992),.
, .
,
(insertion)(deletion)
. ( ), (gap)
(),
.
score
.()
score (g), g ,
:

(3.34)
g gd
:

g d g 1 e

(3.35)

d(gapopenpenalty)e(gap
extension penalty).
(Vingron&Waterman,1994).

3.7.

DNA,
(3.33).,
1 -3 ,
,5-4.
,
(substitutionmatrices)score
(mismatches)
.Altschul(Altschul,1991),
.
PAM(Dayhoff,etal.,1978),BLOSUM(Henikoff&Henikoff,1992),GONNET(Gonnet,etal.,
1992).
BLOSUM(Henikoff&Henikoff,1992),
.
(blocks),
( ).
:

sij

qij
log
pp

i j
1

(3.36)

qij,ij(targetfrequencies),pi,
pj(backgroundfrequencies)
.

130

(3.32) .
.
,PointAccepted
Mutations (PAM) (Dayhoff, et al., 1978). ,
(PAM)
, . PAM1
85%.
(
),PAM30,PAM250...
=(1).
(),
.
(..
) , PAM250,
20-25%.

3.9: BLOSUM62. (..


, , ...),

,
. , PAM, BLOSUM
, ,
,.,PAM,
BLOSUM , ,
,(3.2).

PAM
PAM100
PAM120
PAM160
PAM200
PAM250

BLOSUM
BLOSUM90
BLOSUM80
BLOSUM60
BLOSUM52
BLOSUM45

3.2 PAM
BLOSUM

, ,
.
,

131

,BLOSUM62.
( ),
BLOSUM90. , ,
,
BLOSUM45. ,
, ( )
.
, , :
(,,...),
. ,
,.
,
(,),
score.,
,()

. BLOSUM62, (C)
(W),,(9
11,),(),,
(4). , , .
, ,
, PHAT (Ng, Henikoff, & Henikoff,
2000) SLIM (Muller, Rahmann, & Rehmsmeier, 2001). , ,
().

3.8.

,
. ,
,,
, . ,
. ,
,
. ,
.,,.
,
n m
,
n=m,
n
Stirling(Durbin,etal.,1998):
2 n 2 n !
2 2 n

(3.37)

2
n
2 n
n !

.
.
,
.,
( ),
.,,
,.

132

3.10: , F(i,j) 3
F(i-1,j) F(i,j-1) F(i-1,j-1).

(Gonnet, et al., 1992)


: x=x1,x2,,xn y=y1,y2,,ym
nm F(i,j)
xi, yj. ,
,.
, s(xi,yi)
,F(i-1,j),F(i,j-1)F(i-1,j-1)
F(i,j) 3.10
F(i-1,j-1) F(i,j).
.

.

3.9.

- O Needleman Wunsch

(global
alignment).

(..)
.
Needleman-Wunsch(Needleman
&Wunsch,1970).
:

(3.38)
F i, j max F i 1, j 1 s xi , y j , F i 1, j d , F i, j 1 d

,
,F(i,0)=-idF(0,j)=-jd.
, (recursion) ,
,.

133

3.9.1 (Waterman, 1995)


DNA y=CAGTATCGCA x=AAGTTAGCAG.
:
1, xi y i

(3.39)
s xi , yi
1, xi y i
d=1,:

:
AAGTTAGCAG
CAGTATCGCA
3().

3.10.

(fit)
,
.(3.38).
(Galas,Eggert,&Waterman,1985):
F i, j max F i 1, j 1 s xi , y j , F i 1, j d , F i , j 1 d

(3.40)
F i, 0 id F 0, j 0

3.10.1 (Waterman, 1995)


lacI E.coli
(promoter).:
x=TCGCGGTATGGCATGATAGCGCCCGGAA,
:
y=TATAAT
1, xi yi
s xi , yi
d=2,F:
1, xi yi

134

CATGAT
2(2,
).

3.11. Smith Waterman

,,

- . (local alignment)

().

() (Pearson&
Wood, 2001). ,
, (domains),
.
Smith Waterman (Smith &
Waterman,1981):
F i , j max F i 1, j 1 s xi , y j , F i 1, j d , F i , j 1 d , 0

(3.41)
F i , 0 0F 0, j 0

.,,

135


3.11: .
( ).
, , .

3.11.1 (Waterman, 1995)


3.10.1
1, xi yi
d=1F:
s xi , yi
1, xi yi

,:
AGTATCGCA
AGTTAGCA
5().

(mn). (mn) (big-O notation)
,
nm,.,
. ,
,(2n).

136

, ,
.,
. ,
,(nm2+mn2)
(open)(extension).
:

F (i 1, j 1) s ( xi , y j ),

F (i, j ) max F ( k , j ) (i k ), k 0,.., i 1


F (i, k ) ( j k ), k 0,.., j 1

(3.42)

. ,
,(mn),
, . ,
(3.35).
3,:
F i 1, j 1 s xi , y j

(3.43)
F ( i , j ) max I x i 1, j s xi , y j

I y i , j 1 s xi , y j

(3.44)
I x i, j max F i 1, j d , I x i 1, j e

I y i, j max F i, j 1 d , I y i, j 1 e

(3.45)

3.12. O Erdos Renyi

,
.,.
.
P-value ( ) ,
, , (
,).

, . ,
Gumbel.
, , .
.,Erdos
Renyi , (Waterman, 1995).
x=x1,x2,,xn y=y1,y2,,ym.
M n log1 p mn :

Mn
log1 p mn

11

(3.46)

,pp=P(xi=yj) p p A2 pT2 pG2 pC2


.
,
mn.
.x=x1,x2,,xny=y1,y2,,ym
0p<1.100%:

137

Mn
log1 p mn

1
H a, p

(3.47)

(,p).,
(local) .
Local Similarity Score
,
.,(Arratia,GordonandWaterman,1986;Arratia,
Gordon and Waterman, 1990)
.,Arratia(1990)
,:
log mn log q 1

(3.48)
E Mn

2
q=1-p,=-(1)=0.5772Euler-Mascheroni,=log(1/p).,
:

Var M n

1
12

(3.49)

,(3.17)(3.18)
.(3.48)
(3.47),log(mn),
log(q) / . , m n ,
. Arratia Waterman (Arratia & Waterman, 1989),
,k
(mismatches).,x=x1,x2,,xn
y=y1,y2,,ym,
k(kmismatches):

(3.50)
E M n log1 p qn 2 k log1 p log1 p qn 2 k log1 p q log1 p k ! k
2
:
2
1

(3.51)
var M n 2
6 12
,q=1-p,=-(1)=0.5772Euler-Mascheroni,=log(1/p).

3.13. local similarity score

LocalSimilarityScore
-,
. ,
,
. ,
.,
30%(similarity)80.
,,
, ,
,
.

Local Similarity Score without gaps,


,
..,
1, xi yi
d=0
s xi , yi
0, xi yi

138

,,s(xi,yj)~cn
,

.
1, xi y i
d=
s xi , y i
, xi yi
,,s(xi,yj)~klogn
(3.46).
(3.47) n.
,
, . (phase transition)
Arratia,GordonWaterman(Arratia&Waterman,1994;Waterman,1995;Waterman,Gordon,&Arratia,
1987),m(mismatch)
d(gap),.
.
,-,.
,Poisson,:

(3.52)
S x Kmne x Kmnp x
<1 mn,
qiqjeS =1.
,:

(3.53)
K 1 p q

(3.54)
log 1 p
, ,
. ,
KarlinAltschul(Karlin&Altschul,1990)
x=x1,x2,,xny=y1,y2,,ymS(3.32),
x(,p-value),:

(3.55)
P S x 1 exp Kmne x
(localsimilarityscore)
,maximalsegmentscore.:

,:
qi q j

(3.56)
E sij qi q j sij qi q j log
0
pij

, qiqjeS =1. ,

,.,:

(3.57)
P S x exp Kmne x
,,...
Gumbel(EVD).,:

xa
P S x exp e b , x

(3.58)

E x a b ' 1 , V x

139

b 2 2
6

(3.59)

,b a

log kmn

,b

1
K 1- p q ,
p

log

.p-value
. ,
(Pearson,1998;Pearson&Wood,2001):

(3.60)
P Z z 1 exp exp
z ' 1
6

, (3.55)
(z).
(z)
().
p-value
,p-value
.
. p-value
10-4,

100.00010.
, D
, p-value ( - pmatch) , Poisson. (Pearson & Wood,
2001):P =Pr(1Sx)=1-e-Dp Dp(<0.01):PDp.

Sx,D.E-value(expectationvalue)
E(Sx)=DP(Sx)D
.
,

.
()
n ( ). m=N/D
,x,
P(S>x)=1-e-E(S) =1-exp(-Kne-x)(E-value)E(Sx)=Kne-x =DKmne-x.
BLAST(Altschul,Gish,Miller,Myers,&Lipman,1990),pvalue, (output) ,
, E-value ,
(Waterman,1995):1-exp(-exp(-t))1-(1-exp(-t))=exp(-t),p-valueEvalue.,,
p-value-value.

3.14.


. (Altschul et al.,
1997;Clote&Backofen,2000;Mott,2000),
Gumbel:

(3.61)
P S x exp Kmne x
,.
Gumbel
Mott,1992(Mott,1992).(3.58)

140

A a0

a1

2 log mn

,B

b1

.0,1,2b1.

q q e
i

1 .

,
(direct estimation) (Waterman, 1995; Waterman & Vingron, 1994).
( ). ,
, (
1000)
( shuffling, ). K,
... (e.c.d.f.) (log[log[cdf]])log[-log[cdf]]S.(slope)
(constant)log(mn).

3.12: - DNA 10000


. , log[-log[cdf]]
.

, ,
,
. ,
(z>7)(z<-3)
,(systematicerrorbias)(Pearson,1998).
Waterman Vingron (Waterman & Vingron, 1994;
Waterman & Vingron, 1994), Poisson (Poisson Approximation -
(Arratia, Goldstein, & Gordon, 1989; Chen, 1975)) de-clumping estimation.
S(1) S(2) ... S(k) .

,.
S(i)Poisson,:

(3.62)
E S x Kmne x
,k,x:

P S k x 1 exp Kmne

k 1

x i

Kmne

(3.63)
i!
(3.52).,
log[data] ( )
Kmne-x()
,. Poisson
,
,,.

i 0

141


.
PoissonnmSmith-Waterman,
k(sub-optimalalignments).WatermanVingron,
10 , sub-optimal scores
,.

,Pearson(Pearson,1995,1998;Pearson&Wood,2001),
.
. , k
n1,n2,,nk,
, 10%. ,
S,,
(weightedlinearregression):

(3.64)
S a b log ni
, ni, i , log(ni)
(1/var),
. 2 ,
(residualvariance)z-score:

S a b log ni
var

(3.65)

,
,.
5 , (
).z-scores,,:

(3.66)
P Z z 1 exp exp
z ' 1
6

(p-value, -value),
.
,.
OMott(Mott,2000),,,
Gumbel .
Smith-Waterman,,
,.,
,
.
.
,
,
,.,
,
.

3.15. - BLAST FASTA

,SmithWaterman
,.
, , . ,
,
.
,
,

142

. , 2,
.,
,
. ,
1980,,
.
(heuristic),,
. ,
.
, BLAST (Altschul, et al., 1990; Altschul, et al., 1997)
FASTA(Lipman&Pearson,1985;Wilbur&Lipman,1983).

3.13: FASTA

FASTA (www.ebi.ac.uk/fasta33/),
,
.:
k-tuples(
k, 1 2)
.

,k-tuples.

143

,
,
(
),
.

3.14: BLAST

BLAST(www.ncbi.nlm.nih.gov/BLAST/),
FASTA,

:

(=13).
,

,.

(HSPs,highscoringpairs).
HSPs
S

144

,
,
KarlinAltschul.
BLAST,
.2
,BLAST
.BLAST,
,NCBI,
.
.
BLAST(K,),
,FASTA
.
BLAST
(PSI-BLAST) (Altschul, et al.,
1997),.
BLAST,
, Karlin Altschul. ,

(effectivelength).
log Kmn
log Kmn
m' m
n ' n

H
H

(substitutionmatrix)
.,.,
,,
.
.
20

20

H qij sij

(3.67)

i 1 j 1


,
.
. ,
(expected substitution score per position),
(3.56),
(expectedperpositionalignmentscore),.
,
(gap penalties, mismatches),
. bit
(Altschul,etal.,1990;Altschul,etal.,1997):
S log K

(3.68)
Sbit raw
log 2
Sraw, .
(3.47):
E Sbit mn2 Sbit

(3.69)
,(3.61),
bitScore.
FASTA
(Pearson, 1998). ,
BLAST FASTA ,
.

145

BLAST ,
.
, ,
DNA DNA, , ,
(DNA)(),
DNA,DNADNA
( DNA-DNA ).
BLAST FASTA

(),
(, ,
).

146

Altschul,S.F.(1991).Aminoacidsubstitutionmatricesfromaninformationtheoreticperspective.J Mol
Biol, 219(3),555-565.
Altschul,S.F.,Gish,W.,Miller,W.,Myers,E.W.,&Lipman,D.J.(1990).Basiclocalalignmentsearch
tool.J Mol Biol, 215(3),403-410.
Altschul,S.F.,Madden,T.L.,Schaffer,A.A.,Zhang,J.,Zhang,Z.,Miller,W.,&Lipman,D.J.(1997).
GappedBLASTandPSI-BLAST:anewgenerationofproteindatabasesearchprograms.Nucleic
Acids Res, 25(17),3389-3402.
Arratia,R.,Goldstein,L.,&Gordon,L.(1989).TwomomentssufficeforPoissonapproximation:TheChenSteinmethod.Ann. Probab., 17,9-25.
Arratia,R.,Gordon,L.andWaterman,M.S.(1986).Anextremevaluetheoryforsequencematching.Ann.
Statist., 14,971-993.
Arratia,R.,Gordon,L.andWaterman,M.S.(1990).TheErdos-Renyilawindistribution,forcointossingand
sequencematching.Ann. Statist., 18,539-570.
Arratia,R.,&Waterman,M.S.(1989).TheErdos-Renyistronglawforpatternmatchingwithagiven
proportionofmismatches.Ann. Probab., 17,1152-1169.
Arratia,R.,&Waterman,M.S.(1994).Aphasetransitionforthescoreinmatchingrandomsequences
allowingdeletions.Ann. Appl. Probab., 4,200-225.
Chen,L.H.Y.(1975).Poissonapproximationfordependenttrials.Ann. Probab., 3,534-545.
Clote,P.,&Backofen,R.(2000).Computational Molecular Biology, an Introduction.:JohnWileyandSons,
Ltd.USA.
Davison,A.C.(1998).ExtremeValuesEncyclopedia of Biostatistics:JohnWiley&Sons,Ltd.
Dayhoff,M.O.,Schwartz,R.M.,&Orcutt,B.C.(1978).AmodelofevolutionarychangeinProteins.InM.
Dayhoff(Ed.),In Atlas of protein sequence and structure(Vol.5,Suppl.3,pp.345-352):National
biomedicalresearchfoundation,SilverSpring,MD.
Durbin,R.,Eddy,S.,Krogh,A.,&Mithison,G.(1998).Biological sequence analysis, probabilistic models of
proteins and nucleic acids.:CambridgeUniversityPress.
Erdos,P.,&Renyi,A.(1970).Onanewlawoflargenumbers.J. Anal. Math., 22,103-111.
Erdos,P.,&Revesz,P.(1975).Onthelengthofthelongesthead-run.Topics in Inforrmation Theory.
Colloquia Math. Soc. J. Bolyai, 16,219-228.
Galas,D.J.,Eggert,M.,&Waterman,M.S.(1985).Rigorouspattern-recognitionmethodsforDNA
sequences.AnalysisofpromotersequencesfromEscherichiacoli.J Mol Biol, 186(1),117-128.
Gonnet,G.H.,Cohen,M.A.,&Benner,S.A.(1992).Exhaustivematchingoftheentireproteinsequence
database.Science, 256(5062),1443-1445.
Henikoff,S.,&Henikoff,J.G.(1992).Aminoacidsubstitutionsmatricesfromproteinblocks.Proceedings of
the National Academy of Sciences (USA), 89,10915-10919.
Karlin,S.,&Altschul,S.F.(1990).Methodsforassessingthestatisticalsignificanceofmolecularsequence
featuresbyusinggeneralscoringschemes.Proceedings of the National Academy of Sciences of the
USA., 87,2264-2268.
Karlin,S.,&Brendel,V.(1992).ChanceandstatisticalsignificanceinproteinandDNAsequenceanalysis.
Science, 257(5066),39-49.
Krogh,A.,Larsson,B.,vonHeijne,G.,&Sonnhammer,E.L.(2001).Predictingtransmembraneprotein
topologywithahiddenMarkovmodel:applicationtocompletegenomes.J Mol Biol, 305(3),567580.

147

Lipman,D.J.,&Pearson,W.R.(1985).Rapidandsensitiveproteinsimilaritysearches.Science, 227(4693),
1435-1441.
Mott,R.(1992).MaximumlikelihoodestimationofthestatisticaldistributionofSmith-Watermanlocal
sequencesimilarityscores.Bulletin of Mathematical Biology, 54,59-75.
Mott,R.(2000).AccurateformulaforP-valuesofgappedlocalsequenceandprofilealignments.J Mol Biol,
300(3),649-659.
Muller,T.,Rahmann,S.,&Rehmsmeier,M.(2001).Non-symmetricscorematricesandthedetectionof
homologoustransmembraneproteins.Bioinformatics, 17 Suppl 1,S182-189.
Needleman,S.B.,&Wunsch,C.D.(1970).Ageneralmethodapplicabletothesearchforsimilaritiesinthe
aminoacidsequenceoftwoproteins.J Mol Biol, 48(3),443-453.
Ng,P.C.,Henikoff,J.G.,&Henikoff,S.(2000).PHAT:atransmembrane-specificsubstitutionmatrix.
Predictedhydrophobicandtransmembrane.Bioinformatics, 16(9),760-766.
Pearson,W.R.(1995).Comparisonofmethodsforsearchingproteinsequencedatabases.Protein Science, 4,
1145-1160.
Pearson,W.R.(1998).Empiricalstatisticalestimatesforsequencesimilaritysearches.J Mol Biol, 276(1),7184.
Pearson,W.R.,&Wood,T.C.(2001).Statisticalsignificanceinbiologicalsequencecomparison.InD.J.
Balding,M.Bishop&C.Cannings(Eds.),In handbook of statistical genetics.(pp.39-65):John
WileyandSons,Ltd.England.
Smith,T.F.,&Waterman,M.S.(1981).Identificationofcommonmolecularsubsequences.J Mol Biol,
147(1),195-197.
Vingron,M.,&Waterman,M.S.(1994).Sequencealignmentandpenaltychoice.Reviewofconcepts,case
studiesandimplications.J Mol Biol, 235(1),1-12.
Waterman,M.S.(1995).Introduction to Computational Biology:ChapmanandHall,London.
Waterman,M.S.,Gordon,L.,&Arratia,R.(1987).Phasetransitionsinsequencematchesandnucleicacid
structure.Proceedings of the National Academy of Sciences of the USA., 84,1239-1243.
Waterman,M.S.,&Vingron,M.(1994).Rapidandaccurateestimatesofstatisticalsignificanceforsequence
databasesearches.Proceedings of the National Academy of Sciences of the USA., 91,4625-4628.
Waterman,M.S.,&Vingron,M.(1994).SequencecomparisonsignificanceandPoissonapproximation.
Statistical Science, 2,367-381.
Wilbur,W.J.,&Lipman,D.J.(1983).Rapidsimilaritysearchesofnucleicacidandproteindatabanks.
Proceedings of the National Academy of Sciences of the USA., 80,726-730.
Wootton,J.C.,&Federhen,S.(1993).Statisticsoflocalcomplexityinaminoacidsequencesandsequence
databases.Computers & chemistry, 17(2),149-163.

148

1)


1
H ( , p ) log (1 ) log

p
1 p
=1.
(3.11)(3.12);

2)
3.1,:
a
E sk pk sk pk log k 0
pk
;

3)

(%).

);;
),;

4)
250
PAM250,:
F W L E V E G N S M T A P T G
F W L D V Q G D S M T A P A G

, K=0.09, =0.229.
bit-score;

5)
300
550 . (similar residues) 61 166
,BitScore39.
)E-value;
);
;
;

149

)-value
500.000;

6)
BLAST
NRNCBI.

Score = 34.3 bits (77), Expect =
Identities = 28/85 (32%), Positives = 44/85 (51%), Gaps = 11/85 (12%)
Query

96

Sbjct

118

Query

151

Sbjct

176

INDWASIYGVVGVGYGKFQTTEYPTY---KHDTSDYGFSYGAGLQ--FNPMENVALDFSY
I++
I+G +G YG+ +T+ P +
D S +G SYGAG++ FNP
L+ +
ISEQFDIFGKLGTTYGRTKTSGNPGFGVATGDDSGFGLSYGAGVRWAFNPQWAAVLE--W
EQSRIR----SVDVGTWIAGVGYRF
E+ R+
DV
GV YR+
ERHRLHFADGKSDVDMTTIGVQYRY

150
175

171
200


Score = 77.4 bits (189), Expect = XXXXX
Identities = 62/201 (30%), Positives = 101/201 (50%), Gaps = 32/201 (15%)
Query

Sbjct

Query

56

Sbjct

60

Query

108

Sbjct

119

Query

151

Sbjct

179

MKKIACLSALAAVLAFTAGTSVAAT---STVTGGY--AQSDAQGQMNKMGGFNLKYRYEE
M+K+
AA+
+G
A+
ST++ GY
++ G +++ G N+KYRYE
MRKLYAAILSAAICLAVSGAPAWASEHQSTLSAGYLHVSTNVPGS-DELNGINVKYRYEF

55
59

DNSPLGVIGSFTY--------TEKSRTASSGDYNKNQYYGITAGPAYRINDWASIYGVVG
++ LG++ SF+Y
T S T
D +N+++ + AGP+ R+N+W S Y + G
TDT-LGMVTSFSYAGDKNRQLTHYSDTRWHEDSVRNRWFSVMAGPSVRVNEWFSAYAMAG

107

VGYGKFQT--------TEYPTYKHDT---------SDYGFSYGAGLQFNPMENVALDFSY
+ Y + T
T+
HD
S+
++GAG+Q NP E+VA+D +Y
MAYSRVSTFSGDYLRVTDNKGKTHDVLTGSDDGRHSNTSLAWGAGVQVNPTESVAIDIAY

150

EQSRIRSVDVGTWIAGVGYRF
E S
+I GVGY+F
ECSGSGDWRTDGFIVGVGYKF

118

178

171
199

Gapped
Lambda
K
H
0.267
0.0410
0.140
Number of Sequences: 4496249
Length of query: 171
Length of database: 1544746084
Length adjustment: 122
Effective length of query: 49
Effective length of database: 996203706
Effective search space: 48813981594
Effective search space used: 48813981594

)E-value(Expectation).
;
);
)
;

150

4:

.

, ,
.
,
, .
.
, ,
, .

3 ( ).

4.

,,
2.
.
, .
, ,
,.

()
.,""
.,
(..
,
).,
,
,
. ,
(patterns),HiddenMarkovModels
. 2
,(PROSITE,PFAM..).
,
, '
.
,
, ,
. ,
(
6),,
.
, (
).
,
.
,,
,

151

, .
,.
,
"" , , ,

,(profile).
,
.
,.

4.1.

,
.r:

x1 x11 x12 ...x1n

x 2 x21 x22 ...x2 n


........
x r xr1 xr 2 ...xrn

(4.1)


3, r . 3,

. ' , ,

.

4.1: , 3

.

S (m) G S (mi )
i

152

(4.2)

miim,S(mi)G (
).,5(-),
,
.,,
,
.,:
(4.3)

S (m) S (mi )
i

,.
,logoddsr:

px1i x2i ...xri


S m S (mi ) log
qx qx ...qx
i
i
ri
1i 2i

s( x1i , x2i ,..., xri )


i

(4.4)

, ,
(PAM, BLOSUM ),
.,
3 , 4
,...,.
,
.mij i j nb(i)
bi,,:
n i
(4.5)

P mi pb i b
b

ps(i)si,:

nb i

nb ' i

pb i

(4.6)

b '

,,:

S mi nb i log pb i

(4.7)

, , ,
,
. , : 100%
,0,,,
. , ,
(,,
).,,,
,.,r,-
( ).
,,
.,
. ,

.
, SP (Sum of
Pairs).:
(4.8)

SP mi s m i j , mi j '
j j '

s
,.,:

153

p x x
px x
(4.9)
SP m s mij , mij ' log 1i 2i .... log ( r1)i ri
qx q x
qx qx
i j j '
i
i
2
i
(
r

1)
i
ri

2r.,
,.
, , . ,
,
(..1020),,
(Durbin,Eddy,Krogh,&Mithison,1998).
,,..
19/20 , 9/10,

.
,,

. ,
,(..>50)
100%.,
,
. , ,
.,
(
),
.
,
.ai1,i2,...,iN
xi11 , xi22 , ..., xiNN ,
3,:
ai11,i 21,...,in 1 S xi11 , xi22 ,..., xinn

ai1,i 21,...,in 1 S , xi22 ,..., xinn

...
ai1 , ai 2 ,..., ain max
(4.10)

1
2
...
ai11,i 21,...,in S xi1 , xi 2 ,...,

n
ai1,i 2,i 31,...,in 1 S , ,..., xin

...

,().(4.10),
(2N-1 ).
,(Durbin,etal.,1998;Waterman,1995):
(4.11)

ai1 , ai 2 ,..., ain max ai1 ,i 2 ,...,in S 1 xi11 , 2 xi22 ,..., n xinn
1 ... n 0

:
( x ), i 1
(4.12)

i x
( ), i 0
,,rn(,
), (nr2r)
(nr). ,
,
,

154

. Carrilo Lipman (Carrillo & Lipman, 1988).


,,
,
. , .
,
(
). CarriloLipman MSA(Lipman,
Altschul, & Kececioglu, 1989), http://xylian.igh.cnrs.fr/msa/msa.html
,,.
, ,
.
,
,.

4.2.

(heuristic) ,
progressivemultiplealignmentmethod().
(
), -, .
1980,
Feng Doolittle 1987 (Feng & Doolittle, 1987)
:

,
(guidetree)
,
,,.
,
(BLAST, FASTA).
,
(clustering). ,
,,.
, ,
,.Feng
Doolittle(Feng&Doolittle,1987),,
( ). ,
,:

D log S log

Sobs Srand

Smax Srand

(4.13)

Sobs,.
Smax ,
,Srand
.
( shuffling), Feng Doolittle
. ,
, log
().
,
Fitch Margoliash (Fitch& Margoliash, 1967).
, .
, - ,
6,,
,.

155

,
.

4.2: 5 (Duret & Abdeddaim, 2000). (*)


.
x1 x2 , x3 x4 , ,
,
x5. , ,
(x1-x3). To ,
x3-x5.

, ,
,.,()
,
().,,,
. ,
:,
. 4.2
5.
, ,
(bias) ,
. ,
,
.,
, (consensus),

156

.
, .
2,1,1G,1C.,
,60%
,.

4.3: 4.2.
x1 x2 , x3 x4 . ,
, (x1-x3).
x3 x4 , x1 x2,
.

, profile alignment ( ),

. ,
,,(-),
. , SP ,
.,
1n,,n+1.
(4.8):

SP m s m , m
j

j'

j j '

s m , m s m , m
j

j j ' n

j'

n j j ' N

j'

s mij , mij '

(4.14)

j n , n j ' N

(4.14)
, ,
.
,
3(4.4).,
. ,
,(Edgar
&Sjolander,2004;Wang&Dunbrack,2004).
,
CLUSTALW(Thompson,Higgins,&Gibson,1994)profilealignment
.clustalCLUSTALV(Higgins,Bleasby,&
Fuchs, 1992) CLUSTALW
,CLUSTALX(Thompson,Gibson,&Higgins,2002).,

.
www.ebi.ac.uk/clustalw/.,:
,
(FASTA), .

157

, ,
,.
x
(D=1-x/100) -
,Neighbor-Joining()(Saitou&Nei,1987).
.
6.
( )
,profilealignment.

.,
(weight)
.
,
.,,,
( ,
, ,
5).,
,
.
. ,
CLUSTAL
,
.
,
, Kalign (Lassmann & Sonnhammer, 2005), (
http://msa.sbc.su.se/cgi-bin/msa.cgi). Kalign,
.

, Lassmann Sonnhammer
,,WuManber,
(Wu&Manber,1992).
Kalignk-tuple,.,
-UPGMA(
).
. profile alignment,
,,(
,).,
,
,,
(BLOSUM50, PAM250 GONNET250). ,
GONNET250(Gonnet,Cohen,&Benner,1992),
. , Kalign
CLUSTAL 10
.,,
.
,.

158


4.4 4.2.
profile alignment. ,
x1 x2 , x3 x4 .
(4.14) . ,
x3 x4 , x1 x2,
. , , .
, , .

, , .
,
. ,
, ,
(once a gap, always a gap). ,
,.
, ,
.

4.3.

,
,
, .
,,
,

159

. , CLUSTALW 6% (Wallace,
O'Sullivan,&Higgins,2005).
, Barton
Sternberg (Barton & Sternberg, 1987). ,

.,profile
alignment.,-
,.
Corpet (Corpet, 1988),
MULTALIN(http://prodes.toulouse.inra.fr/multalin/multalin.html).
.MULTALIN,,,
,
.
MUSCLE
(Edgar, 2004) ( http://www.drive5.com/muscle).
, MUSCLE k-mers ( -
k),
UPGMA, ( MSA1).
,Kimura(
6),
profile alignment ( MSA2).
(refinement),,
.

.MSA3,
MSA2(MUSCLE-p)
,
. MUSCLE-p O(N2L+NL2)
O(N2+NL+L2), O(N3L)
. MUSCLE
profilealignment,log-expectationscore.MUSCLE
, profiles
,.
, Gotoh
(Gotoh, 1996) PRRP/PRRN (http://www.genome.ist.i.kyotou.ac.jp/~aln_user/prrn/index.html).
, SP(weighted sums-of-pairs score)
.,
. SP
, ,
.
PRALINE, (Simossis & Heringa, 2005)
http://ibivu.cs.vu.nl/programs/pralinewww/,
. profile
PSI-BLAST .
, profile .
,
profiles.
, PRALINE
. ,
, profile,
profiles.,

,.,
. , (consistency)

160

profiles.
.
To Dialign (Morgenstern, 2014), ( http://bibiserv.techfak.unibielefeld.de/dialign/), ,
,
(,,,
).
. Dialign
(diagonal alignments in a dot plot). ,
. ,
,,
.,
, ,
. , :
(),
, . , Dialign
,
(..).
, PRALINE
Dialign, COBALT (Papadopoulos & Agarwala, 2007),
NCBI(ftp://ftp.ncbi.nlm.nih.gov/pub/agarwala/cobalt).COBALTBLAST
RPS-BLAST,
Neighbour-Joining,
(dij=1-(Sij/2)(1/Sii-1/Sij)).,
profile alignment ,
. ,
BLAST,RPS-BLAST,
NCBI
(CDD). ,
Dialign.
,
,T-Coffee (Magisetal.,2014),(
http://www.ch.embnet.org/software/TCoffee.html). To T-Coffee CLUSTALW (
,):,
,
profile alignment.
,
,
. , ,
,, ,
,.
CLUSTALW ,
,(
CLUSTALW), (
LALIGN FASTA). , T-Coffee
.
, .
,.,

.
,,
,

.,simulated
annealing (Kim, Pramanik, & Chung, 1994),

161

.,
SAGA (Notredame & Higgins, 1996),
(Hidden Markov
Models), ProbCons ProbAlign (Roshan, 2014) (
http://probalign.njit.edu/standalone.html). ,
,
, ,
.,
.

4.4.

,
.
;
; ,
,
. , . ,
,
,.

4.5: .

, , ,
(structural alignment). ,
,
.,
,.

162

,,
.,
,
,.,
,,
.,

, . ,
(goldstandard),
.
,
,.
BAliBASE (Thompson, Plewniak, & Poch, 1999)
.
/4.1.



BAliBASE
http://www-igbmc.u-strasbg.fr/BioInfo/BAliBASE/index.html
(Thompson,etal.,1999)
OxBench
http://www.compbio.dundee.ac.uk/
(Raghava,Searle,Audley,Barber,&Barton,
2003)
SABmark
http://bioinformatics.vub.ac.be/databases/databases.html
(VanWalle,Lasters,&Wyns,2005)
PREFAB
http://drive5.com/muscle/prefab.htm.
(Edgar,2004)
4.1:

,
, . ,
100%.,
,
,..
.,
,
(Thompson, Linard, Lecompte, & Poch, 2011),
(blocks) (Raghava, et al., 2003). ,
,
.:)
(
,
),)(
, ), )
-(fragments),
.
,

,
(Raghava,etal.,2003).APDB(O'Sullivanet
al.,2003).APDB,,
PDB,.,

, APDB
,.

163

,
,
(Pais, Ruy Pde, Oliveira, & Coimbra, 2014; Thompson, et al., 2011;
Thompson,etal.,1999).,,
, .
, . ,
,
,50%,
20%. T-Coffee, ProbCons ProbAlign
,
( ). ClustalW MUSCLE,
,.Prrp/Prrn
,.Kalign,,
10CLUSTALW(),
.,
, ,
,
(,).T-Coffee
,,,
Dialign (
).
,

. ,
(
). ,
.. ,
(,
)..
,
.
,.,
.
(CLUSTALW,T-Cofffee,Dialign,Kalign,MUSCLE,ProbAlign,Prrp/Prrn),

.,(Windows,
Linux,Mac),COBALTPRALINE,(PSIBLAST),(,PRALINE
).

, ,
.

4.5.

,
, .
,(FASTA)
.
,Multi-FASTA,MSFCLUSTAL.Multi-FASTA,
FASTA.,
(>),
.,(-)
.,

164

.MSF,,
, . 3
,350
(),350,...,
.,
, ( //)
. CLUSTAL (
),(,
,),
60.
.
, *. : .
(,),
()().,
PHYLIP STOCKHOLM,

.
,

READSEQ
(http://www.ebi.ac.uk/Tools/sfc/readseq/),modulesBioPerl,BioPythonBioJava,
.

4.6: , Strap.
, 3 . ,
JNET.

,,
, , .
,
,,
.,
,
. ,
(,
...).,.
(editors),,
( , ) ,
( ).
:
Jalview(http://www.jalview.org/)
Strap(http://www.bioinformatics.org/strap/)
Seqpup(http://iubio.bio.indiana.edu/soft/molbio/seqpup/java/seqpup-doc.html)
Seaview(http://pbil.univ-lyon1.fr/software/seaview.html)

165

Cinema(http://aig.cs.man.ac.uk/research/utopia/cinema/cinema.php)
Boxshade(http://www.ch.embnet.org/software/BOX_form.html)
Bioedit(http://www.mbio.ncsu.edu/BioEdit/bioedit.html)

,.
Desktop (
,applets).
.,,
. ,
,
.,,
,
. , ,
.
, , .
,
,.,,
(- ), .
,
,,
.,
PDB,
,
( , ). , , ,
,
, ,
,
.
,,
. ,
,1-2
.
. ,
.
.,
, ( ,
).,
. ,
(PFAM, PROSITE ), ,
( BLAST
).

166


4.7: Jalview.
RNA. , Jmol
RNA.

4.8: BioEdit. To BioEdit Editor,


: ( ),
, , ,
(BLAST, dot plot ), , ,
, .

167

,,
. /

. ,
, .
, . ,
, (signal peptide),
,(
). ,
,
- (, ...). ,
,,
, .
,,(
!).
, ,
. ,
profilealignment,
.

168

Barton,G.J.,&Sternberg,M.J.(1987).Astrategyfortherapidmultiplealignmentofproteinsequences.
Confidencelevelsfromtertiarystructurecomparisons.J Mol Biol, 198(2),327-337.
Carrillo,H.,&Lipman,D.(1988).Themultiplesequencealignmentprobleminbiology.SIAM Journal on
Applied Mathematics 48(5),1073-1082.
Corpet,F.(1988).Multiplesequencealignmentwithhierarchicalclustering.Nucleic Acids Res, 16(22),
10881-10890.
Durbin,R.,Eddy,S.,Krogh,A.,&Mithison,G.(1998).Biological sequence analysis, probabilistic models of
proteins and nucleic acids.:CambridgeUniversityPress.
Duret,L.,&Abdeddaim,S.(2000).Multiplealignmentforstructural,functional,orphylogeneticanalysesof
homologoussequences.Bioinformatics: Sequence, Structure, and Databanks,51-76.
Edgar,R.C.(2004).MUSCLE:amultiplesequencealignmentmethodwithreducedtimeandspace
complexity.BMC Bioinformatics, 5,113.
Edgar,R.C.,&Sjolander,K.(2004).Acomparisonofscoringfunctionsforproteinsequenceprofile
alignment.Bioinformatics, 20(8),1301-1308.
Feng,D.-F.,&Doolittle,R.F.(1987).Progressivesequencealignmentasaprerequisitetocorrect
phylogenetictrees.Journal of molecular evolution, 25(4),351-360.
Fitch,W.M.,&Margoliash,E.(1967).Constructionofphylogenetictrees.science, 155(3760),279-284.
Gonnet,G.H.,Cohen,M.A.,&Benner,S.A.(1992).Exhaustivematchingoftheentireproteinsequence
database.Science, 256(5062),1443-1445.
Gotoh,O.(1996).Significantimprovementinaccuracyofmultipleproteinsequencealignmentsbyiterative
refinementasassessedbyreferencetostructuralalignments.J Mol Biol, 264(4),823-838.
Higgins,D.G.,Bleasby,A.J.,&Fuchs,R.(1992).CLUSTALV:improvedsoftwareformultiplesequence
alignment.Computer applications in the biosciences: CABIOS, 8(2),189-191.
Kim,J.,Pramanik,S.,&Chung,M.J.(1994).Multiplesequencealignmentusingsimulatedannealing.
Comput Appl Biosci, 10(4),419-426.
Lassmann,T.,&Sonnhammer,E.L.(2005).Kalign--anaccurateandfastmultiplesequencealignment
algorithm.BMC Bioinformatics, 6,298.
Lipman,D.J.,Altschul,S.F.,&Kececioglu,J.D.(1989).Atoolformultiplesequencealignment.
Proceedings of the National Academy of Sciences, 86(12),4412-4415.
Magis,C.,Taly,J.F.,Bussotti,G.,Chang,J.M.,DiTommaso,P.,Erb,I.,...Notredame,C.(2014).TCoffee:Tree-basedconsistencyobjectivefunctionforalignmentevaluation.Methods Mol Biol, 1079,
117-129.
Morgenstern,B.(2014).MultiplesequencealignmentwithDIALIGN.Methods Mol Biol, 1079,191-202.
Notredame,C.,&Higgins,D.G.(1996).SAGA:sequencealignmentbygeneticalgorithm.Nucleic Acids
Res, 24(8),1515-1524.
O'Sullivan,O.,Zehnder,M.,Higgins,D.,Bucher,P.,Grosdidier,A.,&Notredame,C.(2003).APDB:anovel
measureforbenchmarkingsequencealignmentmethodswithoutreferencealignments.
Bioinformatics, 19 Suppl 1,i215-221.
Pais,F.S.,RuyPde,C.,Oliveira,G.,&Coimbra,R.S.(2014).Assessingtheefficiencyofmultiplesequence
alignmentprograms.Algorithms Mol Biol, 9(1),4.
Papadopoulos,J.S.,&Agarwala,R.(2007).COBALT:constraint-basedalignmenttoolformultipleprotein
sequences.Bioinformatics, 23(9),1073-1079.

169

Raghava,G.P.,Searle,S.M.,Audley,P.C.,Barber,J.D.,&Barton,G.J.(2003).OXBench:abenchmark
forevaluationofproteinmultiplesequencealignmentaccuracy.BMC Bioinformatics, 4,47.
Roshan,U.(2014).MultiplesequencealignmentusingProbconsandProbalign.Methods Mol Biol, 1079,147153.
Saitou,N.,&Nei,M.(1987).Theneighbor-joiningmethod:anewmethodforreconstructingphylogenetic
trees.Molecular biology and evolution, 4(4),406-425.
Simossis,V.A.,&Heringa,J.(2005).PRALINE:amultiplesequencealignmenttoolboxthatintegrates
homology-extendedandsecondarystructureinformation.Nucleic Acids Res, 33(WebServerissue),
W289-294.
Thompson,J.D.,Gibson,T.J.,&Higgins,D.G.(2002).MultiplesequencealignmentusingClustalWand
ClustalX.Curr Protoc Bioinformatics, Chapter 2,Unit23.
Thompson,J.D.,Higgins,D.G.,&Gibson,T.J.(1994).CLUSTALW:improvingthesensitivityof
progressivemultiplesequencealignmentthroughsequenceweighting,position-specificgappenalties
andweightmatrixchoice.Nucleic acids research, 22(22),4673-4680.
Thompson,J.D.,Linard,B.,Lecompte,O.,&Poch,O.(2011).Acomprehensivebenchmarkstudyof
multiplesequencealignmentmethods:currentchallengesandfutureperspectives.PLoS One, 6(3),
e18093.
Thompson,J.D.,Plewniak,F.,&Poch,O.(1999).Acomprehensivecomparisonofmultiplesequence
alignmentprograms.Nucleic Acids Res, 27(13),2682-2690.
VanWalle,I.,Lasters,I.,&Wyns,L.(2005).SABmark--abenchmarkforsequencealignmentthatcoversthe
entireknownfoldspace.Bioinformatics, 21(7),1267-1268.
Wallace,I.M.,O'Sullivan,O.,&Higgins,D.G.(2005).Evaluationofiterativealignmentalgorithmsfor
multiplealignment.Bioinformatics, 21(8),1408-1414.
Wang,G.,&Dunbrack,R.L.,Jr.(2004).Scoringprofile-to-profilesequencealignments.Protein Sci, 13(6),
1612-1626.
Waterman,M.S.(1995).Introduction to Computational Biology:ChapmanandHall,London.
Wu,S.,&Manber,U.(1992).Fasttextsearchingallowingerrors.Communications of the ACM 35(10),83-91.

170

Multi-FASTA

>sw:CD5R_BOVIN Q28199 Cyclin-dependent kinase 5 activator 1 precursor


MGTVLSLSPSYRKATLFEDGAATVGHYTAVQNSKNAKDKNLKRHSIISVLPWKRIVAVSA
KKKNSKKVQPNSSYQNNITHLNNENLKKSLSCANLSTFAQPPPAQPPAPPASQLSGSQTG
VSSSVKKAPHPAVSSAGTPKRVIVQASTSELLRCLGEFLCRRCYRLKHLSPTDPVLWLRS
VDRSLLLQGWQDQGFITPANVVFLYMLCRDVISSEVGSDHELQAVLLTCLYLSYSYMGNE
ISYPLKPFLVESCKEAFWDRCLSVINLMSSKMLQINADPHYFTQVFSDLKNESGQEDKKR
LLLGLDR
>sw:CD5R_HUMAN Q15078 Cyclin-dependent kinase 5 activator 1 precursor
MGTVLSLSPSYRKATLFEDGAATVGHYTAVQNSKNAKDKNLKRHSIISVLPWKRIVAVSA
KKKNSKKVQPNSSYQNNITHLNNENLKKSLSCANLSTFAQPPPAQPPAPPASQLSGSQTG
GSSSVKKAPHPAVTSAGTPKRVIVQASTSELLRCLGEFLCRRCYRLKHLSPTDPVLWLRS
VDRSLLLQGWQDQGFITPANVVFLYMLCRDVISSEVGSDHELQAVLLTCLYLSYSYMGNE
ISYPLKPFLVESCKEAFWDRCLSVINLMSSKMLQINADPHYFTQVFSDLKNESGQEDKKR
LLLGLDR
>sw:CD5R_MOUSE Q62938 Cyclin-dependent kinase 5 activator 1 precursor
MGTVLSLSPSYRKATLFEDGAATVGHYTAVQNSKNAKDKNLKRHSIISVLPWKRIVAVSA
KKKNSKKAQPNSSYQSNIAHLNNENLKKSLSCANLSTFAQPPPAQPPAPPASQLSGSQTG
VSSSVKKAPHPAITSAGTPKRVIVQASTSELLRCLGEFLCRRCYRLKHLSPTDPVLWLRS
VDRSLLLQGWQDQGFITPANVVFLYMLCRDVISSEVGSDHELQAVLLTCLYLSYSYMGNE
ISYPLKPFLVESCKEAFWDRCLSVINLMSSKMLQINADPHYFTQVFSDLKNESGQEDKKR
LLLGLDR

MSF
MSF: 307 Type: P Check: 4977 ..
Name: CD5R_BOVIN oo Len: 307 Check: 5281 Weight: 33.3
Name: CD5R_HUMAN oo Len: 307 Check: 5196 Weight: 33.3
Name: CD5R_MOUSE oo Len: 307 Check: 4500 Weight: 33.3
//
CD5R_BOVIN MGTVLSLSPS YRKATLFEDG AATVGHYTAV QNSKNAKDKN
CD5R_HUMAN MGTVLSLSPS YRKATLFEDG AATVGHYTAV QNSKNAKDKN
CD5R_MOUSE MGTVLSLSPS YRKATLFEDG AATVGHYTAV QNSKNAKDKN
CD5R_BOVIN PWKRIVAVSA KKKNSKKVQP NSSYQNNITH LNNENLKKSL
CD5R_HUMAN PWKRIVAVSA KKKNSKKVQP NSSYQNNITH LNNENLKKSL
CD5R_MOUSE PWKRIVAVSA KKKNSKKAQP NSSYQSNIAH LNNENLKKSL
CD5R_BOVIN PPPAQPPAPP ASQLSGSQTG VSSSVKKAPH PAVSSAGTPK
CD5R_HUMAN PPPAQPPAPP ASQLSGSQTG GSSSVKKAPH PAVTSAGTPK
CD5R_MOUSE PPPAQPPAPP ASQLSGSQTG VSSSVKKAPH PAITSAGTPK
CD5R_BOVIN LLRCLGEFLC RRCYRLKHLS PTDPVLWLRS VDRSLLLQGW
CD5R_HUMAN LLRCLGEFLC RRCYRLKHLS PTDPVLWLRS VDRSLLLQGW
CD5R_MOUSE LLRCLGEFLC RRCYRLKHLS PTDPVLWLRS VDRSLLLQGW
CD5R_BOVIN VVFLYMLCRD VISSEVGSDH ELQAVLLTCL YLSYSYMGNE
CD5R_HUMAN VVFLYMLCRD VISSEVGSDH ELQAVLLTCL YLSYSYMGNE
CD5R_MOUSE VVFLYMLCRD VISSEVGSDH ELQAVLLTCL YLSYSYMGNE
CD5R_BOVIN ESCKEAFWDR CLSVINLMSS KMLQINADPH YFTQVFSDLK
CD5R_HUMAN ESCKEAFWDR CLSVINLMSS KMLQINADPH YFTQVFSDLK
CD5R_MOUSE ESCKEAFWDR CLSVINLMSS KMLQINADPH YFTQVFSDLK
CD5R_BOVIN LLLGLDR
CD5R_HUMAN LLLGLDR
CD5R_MOUSE LLLGLDR

171

LKRHSIISVL
LKRHSIISVL
LKRHSIISVL
SCANLSTFAQ
SCANLSTFAQ
SCANLSTFAQ
RVIVQASTSE
RVIVQASTSE
RVIVQASTSE
QDQGFITPAN
QDQGFITPAN
QDQGFITPAN
ISYPLKPFLV
ISYPLKPFLV
ISYPLKPFLV
NESGQEDKKR
NESGQEDKKR
NESGQEDKKR

CLUSTAL
CLUSTAL W (1.82) multiple sequence alignment
CD5R_BOVIN MGTVLSLSPSYRKATLFEDGAATVGHYTAVQNSKNAKDKNLKRHSIISVLPWKRIVAVSA
CD5R_HUMAN MGTVLSLSPSYRKATLFEDGAATVGHYTAVQNSKNAKDKNLKRHSIISVLPWKRIVAVSA
CD5R_MOUSE MGTVLSLSPSYRKATLFEDGAATVGHYTAVQNSKNAKDKNLKRHSIISVLPWKRIVAVSA
************************************************************
CD5R_BOVIN KKKNSKKVQPNSSYQNNITHLNNENLKKSLSCANLSTFAQPPPAQPPAPPASQLSGSQTG
CD5R_HUMAN KKKNSKKVQPNSSYQNNITHLNNENLKKSLSCANLSTFAQPPPAQPPAPPASQLSGSQTG
CD5R_MOUSE KKKNSKKAQPNSSYQSNIAHLNNENLKKSLSCANLSTFAQPPPAQPPAPPASQLSGSQTG
*******.*******.**:*****************************************
CD5R_BOVIN VSSSVKKAPHPAVSSAGTPKRVIVQASTSELLRCLGEFLCRRCYRLKHLSPTDPVLWLRS
CD5R_HUMAN GSSSVKKAPHPAVTSAGTPKRVIVQASTSELLRCLGEFLCRRCYRLKHLSPTDPVLWLRS
CD5R_MOUSE VSSSVKKAPHPAITSAGTPKRVIVQASTSELLRCLGEFLCRRCYRLKHLSPTDPVLWLRS
***********::***********************************************
CD5R_BOVIN VDRSLLLQGWQDQGFITPANVVFLYMLCRDVISSEVGSDHELQAVLLTCLYLSYSYMGNE
CD5R_HUMAN VDRSLLLQGWQDQGFITPANVVFLYMLCRDVISSEVGSDHELQAVLLTCLYLSYSYMGNE
CD5R_MOUSE VDRSLLLQGWQDQGFITPANVVFLYMLCRDVISSEVGSDHELQAVLLTCLYLSYSYMGNE
************************************************************
CD5R_BOVIN ISYPLKPFLVESCKEAFWDRCLSVINLMSSKMLQINADPHYFTQVFSDLKNESGQEDKKR
CD5R_HUMAN ISYPLKPFLVESCKEAFWDRCLSVINLMSSKMLQINADPHYFTQVFSDLKNESGQEDKKR
CD5R_MOUSE ISYPLKPFLVESCKEAFWDRCLSVINLMSSKMLQINADPHYFTQVFSDLKNESGQEDKKR
************************************************************
CD5R_BOVIN LLLGLDR
CD5R_HUMAN LLLGLDR
CD5R_MOUSE LLLGLDR
*******

172

5:

.
PROSITE
. ,
(PSSMs)
(profiles),
. ,
.

3 4.

5.

, ,
:
,
.,(patterns)
PROSITE,
(regular expressions) UNIX.
, ,
. ,
(profiles) (Position Specific Scoring Matrices). ,
,
,.
,
( ),
.

5.1.

5.1.1

( ,
),.
,,,
,
. ,
,,

. , ,
,
,.
,(patterns)
(
regularexpressions).,

, ( )
, ( ).
PROSITE,
UNIX
.

173

(5.1), ,
,.

,,
().PROSITE:

IUPAC.
,
(-).
.
,
(..,...)

,[],[ACG]
A,G,C.
,
x.
/,
{}. ,
{} DNA
[CGT]. ,
.
().(3)
--,x(3)x-x-x(3).,
.,x(2,4)x-x,x-x-x,
x-x-x-x.
< > .
,
<A-x
'>'
. , P-R-L-[G>]
P-R-L-GP-R-L>.

5.1: . .
.
.

, , ,
, ..
( 5.2). , ,

174


(5.3).

5.2: DNA.

2, PROSITE (http://www.expasy.ch/prosite/)
(sequence domains)
(Sigrist et al., 2010). ,

,,
().PROSITE
1700 . , 1308 , 1107 1105 ""
(
).,
(, ).
Uniprot,
""()
. ,
,
"",
.
,(regularexpressions)PROSITE
.:
(-).
(.)x
^ ,
{}.
,PROSITE:
[RK]-G-{EDRKHPCG}-[AGSCI]-[FY]-[LIVA]-x-[FYM]
:
[RK]G[^EDRKHPCG][AGSCI][FY][LIVA].[FYM]
,
1980 , PROSITE
,

, . 5.3
.,
5.4 , 15
.
,(nuclearlocalizationsignals-NLSs)
, , ,
. Cocol, Nair Rost,

175

214(
91 ) (Cokol, Nair, & Rost, 2000). ,
.
(peroxisomaltargetingsignal-PTS1)(S-K-L).2004,

Petriv,PTS2(Petriv,Tang,
Titorenko,&Rachubinski,2004).

5.3: .

,.
(signal peptide)
,
,
. ,
(17-30,
), .
1980,PS00013,
PROSITE ( 5.3). ' ,
,[LVI]-[ASTVI]-[GAS]-C
(DOLOP).2002,
SutcliffeHarringtonGram,
,
(Sutcliffe&Harrington,2002).()
5.4.
,
(twin-arginine translocation). ,
SEC.
, ,
.,
(),
2(R-Rx-[FGAVML]-[LITMVF])..
,
, ,
, . ,
SEC. , 2010 Shruthi, Babu
Sankaran,
(R-R-x-[FGAVML]-[LITMVF][LVI]-[ASTVI]-[GAS]-C),
-
(Shruthi,Babu,&Sankaran,2010).

176

,,
- Gram . 2004, Berven ,
- (Berven,
Flikka,Jensen,&Eidhammer,2004).,
,
.
, ,
,
.

5.4: ,
15 .

5.1.2

, . , .
, . ,
,
.
.,.
PROSITE(regularexpressions),

.12
Perl,
UNIX(grep,egrep).
,
,.
. 5.1 3
[], ,
,()
().100,
6535.(
),G,
;([AGT]),
.
,
. ,

177

PROSITE100%.
,
,

(,...).,
, ,
. , (sequence profiles)
(PSSMs),.
,
.5.1
, ()
.,1,4,56
23...8
(C); , .
,
(),
HiddenMarkovModels(HMMs)8.

5.5: 2 .

, ,
.,
. 5.5
(,)3
7.,350%G50%A,
7 50% 50% C. PROSITE
G(3)C(7),G(3)T(7).
,G(3)C(7),(3)
T(7).,,
,
.
,
(8),(7),
(10).

5.1.3

:,
(5.6). ,

.,

178

,
,
..
, 3 (Brazma, Jonassen, Eidhammer, &
Gilbert,1998):
:
.PROSITE,
(..
,
).
:
. ,
,
.
:
,NPcomplete, (heuristic)
(greedy),(..
).,
, (Expectation-Maximization) Gibbs
sampler.

5.6: .

,PRATT
(http://web.expasy.org/pratt/). PRATT ,
PROSITE(Jonassen,Collins,&
Higgins, 1995).
, .. ,

179

,.PRATT
,
.
MEME(http://meme-suite.org/tools/meme)
(MultipleEMForMotifElicitation).,
().,

(
).

(Bailey&Elkan,1994).
,Gibbs Motif Sampler,
Gibbssampler(http://ccmbweb.ccv.brown.edu/gibbs/gibbs.html).
,
,
(Thompson,Rouchka,&Lawrence,2003).
, TEIRESIAS
, (Rigoutsos &
Floratos,1998).(combinatorial)
,
., ,

.TEIRESIAShttps://cm.jefferson.edu/Teiresias/,
DNA,

5.2.

Weight Matrices, Profiles PSSMs

.,
,
. ,

,
. , (weight
matrices)(profiles).,kxp,k
p (
).,i
pb(i) ( 5.7). nb(i)
bi,pb(i)bi,
:

pb i

nb i

nb ' i

(5.1)

b '

.
,
.5.1,3,
ATTGAACTA:
p

P x pb i P x1 A P x2 T P x3 T ...P x10 A 0.074

(5.2)

i 1

, ATGCA 0.00155 (
ATGAACTA0,1).,

180


. ,

3.,
:

(5.3)
sb i log pb i pb
pb ( ) (
) pb(i)
.
,
. ,
100%
().,

,(3.19)-.,.
,
(..-10,000),
. ,
.,(3.19):

p i z
b
b
p k z
b
s 1 s

sb i log

(5.4)


.,(5.1)
. , ,
. ,
,.

181


5.7: (weight matrix)
(PSSM), .

,
. ,
.,

(Staden,1990).
(
,..
).,
(,
)(Barton&Sternberg,1990).
.
,
. ,
,,.

182

5.8: .

,
,.profileanalysis
(Gribskov,
McLachlan,&Eisenberg,1987).(position
specific scoring matric-PSSM),
(PAM,BLOSUM) .,
. ,
.
,,
.
, . ,

. ,
,
(
).
, ,

.,:
k

sb i

p i S
j

bj

(5.5)

j 1

pj(i) j i
( ), Sbj
(BLOSUM62)bj.
,
,
,
.,
PAM, BLOSUM45 ,
,,
(Lthy,Xenarios,&Bucher,1994).
,Gumbel,
. ,

183

. ,
, (..
60%).

5.9: PSSM.
20 . , ..
,
. 7 8, ,
, .

5.3.

,
PSSMs,.
ScanProsite (http://prosite.expasy.org/scanprosite/). ScanProsite
PROSITE,,,
.
PROSITE,
...(DeCastroetal.,2006)
PFTOOLS (http://web.expasy.org/pftools/)
(Bucher,Karplus,Moeri,&Hofmann,1996).
PFTOOLS ,
(,weightmatrices,PSSMs),
(generalized profile).
,HiddenMarkovModel
8. PFTOOLS ,
,:
pfmake:
pfscale: Gumbel

pfw:
-.
pfsearch:
DNA.
pfscan: DNA
.

184

,
,
HMMER 8 (psa2msa, gtop, htop, ptoh)
DNA(ptof,2ft,6ft).
PSSM PSI-BLAST
(Position-specific-iterated BLAST) (Altschul et al., 1997).
BLAST.(5.10):
BLASTvalue .
PSSM,
.
, ,
-value.,
,
(34).
(),
. , ,
,

,(contamination).
PSI-BLAST DELTA-BLAST (domain enhanced lookup
timeacceleratedBLAST),PSSM,

.,ConservedDomainDatabase(CDD)NCBI,
, PSIBLAST,
(Boratynetal.,2012).
PHI-BLAST(pattern-hitinitiatedBLAST)BLAST,
(Zhang et al., 1998).
,
, . ,

(5.11).

5.10: PSI-BLAST.

185


5.11: PHI-BLAST.

,
, WebLogo (http://weblogo.berkeley.edu/) (Crooks, Hon,
Chandonia, & Brenner, 2004). WebLogo
(Sequence Logo) Schneider Stephens (Schneider & Stephens, 1990)
,
.
,:

(5.6)
R S max Sobs log 2 k nb i log pb i
b

, Smax Sobs
3. k ,
(2.2bitsDNA/RNA~4.32).
, .
,
,.
,(5.6).

5.12: To Sequence Logo 5.1.

186

5.13: Sequence Logo DNA. ,


-, EID (Exon-Intron
database). , 350 E. coli.
, ,
.

187


5.14: Sequence Logo .
(PTS1). ,
.
PROSITE,
.

188

Altschul,S.F.,Madden,T.L.,Schaffer,A.A.,Zhang,J.,Zhang,Z.,Miller,W.,&Lipman,D.J.(1997).
GappedBLASTandPSI-BLAST:anewgenerationofproteindatabasesearchprograms.Nucleic
acids research, 25(17),3389-3402.
Bailey,T.L.,&Elkan,C.(1994).Fittingamixturemodelbyexpectationmaximizationtodiscovermotifsin
biopolymers.Proc Int Conf Intell Syst Mol Biol, 2,28-36.
Barton,G.J.,&Sternberg,M.J.(1990).Flexibleproteinsequencepatterns:Asensitivemethodtodetect
weakstructuralsimilarities.Journal of molecular biology, 212(2),389-402.
Berven,F.S.,Flikka,K.,Jensen,H.B.,&Eidhammer,I.(2004).BOMP:aprogramtopredictintegralbbarreloutermembraneproteinsencodedwithingenomesofGram-negativebacteria.Nucleic Acids
Res, 32(WebServerIssue),W394-W399.
Boratyn,G.M.,Schaffer,A.A.,Agarwala,R.,Altschul,S.F.,Lipman,D.J.,&Madden,T.L.(2012).
DomainenhancedlookuptimeacceleratedBLAST.Biol Direct, 7(1),12.
Brazma,A.,Jonassen,I.,Eidhammer,I.,&Gilbert,D.(1998).Approachestotheautomaticdiscoveryof
patternsinbiosequences.Journal of computational biology, 5(2),279-305.
Bucher,P.,Karplus,K.,Moeri,N.,&Hofmann,K.(1996).Aflexiblemotifsearchtechniquebasedon
generalizedprofiles.Computers & chemistry, 20(1),3-23.
Cokol,M.,Nair,R.,&Rost,B.(2000).Findingnuclearlocalizationsignals.EMBO reports, 1(5),411-415.
Crooks,G.E.,Hon,G.,Chandonia,J.M.,&Brenner,S.E.(2004).WebLogo:asequencelogogenerator.
Genome Res, 14(6),1188-1190.
DeCastro,E.,Sigrist,C.J.,Gattiker,A.,Bulliard,V.,Langendijk-Genevaux,P.S.,Gasteiger,E.,...Hulo,N.
(2006).ScanProsite:detectionofPROSITEsignaturematchesandProRule-associatedfunctionaland
structuralresiduesinproteins.Nucleic Acids Research, 34(suppl2),W362-W365.
Gribskov,M.,McLachlan,A.D.,&Eisenberg,D.(1987).Profileanalysis:detectionofdistantlyrelated
proteins.Proc Natl Acad Sci U S A, 84(13),4355-4358.
Jonassen,I.,Collins,J.F.,&Higgins,D.G.(1995).Findingflexiblepatternsinunalignedproteinsequences.
Protein Science, 4(8),1587-1595.
Lthy,R.,Xenarios,I.,&Bucher,P.(1994).Improvingthesensitivityofthesequenceprofilemethod.
Protein Science, 3(1),139-146.
Petriv,I.,Tang,L.,Titorenko,V.I.,&Rachubinski,R.A.(2004).Anewdefinitionfortheconsensus
sequenceoftheperoxisometargetingsignaltype2.Journal of molecular biology, 341(1),119-134.
Rigoutsos,I.,&Floratos,A.(1998).Combinatorialpatterndiscoveryinbiologicalsequences:The
TEIRESIASalgorithm.Bioinformatics, 14(1),55-67.
Schneider,T.D.,&Stephens,R.M.(1990).Sequencelogos:anewwaytodisplayconsensussequences.
Nucleic Acids Res, 18(20),6097-6100.
Shruthi,H.,Babu,M.M.,&Sankaran,K.(2010).TAT-pathway-dependentlipoproteinsasaniche-based
adaptationinprokaryotes.Journal of Molecular Evolution, 70(4),359-370.
Sigrist,C.J.,Cerutti,L.,deCastro,E.,Langendijk-Genevaux,P.S.,Bulliard,V.,Bairoch,A.,&Hulo,N.
(2010).PROSITE,aproteindomaindatabaseforfunctionalcharacterizationandannotation.Nucleic
Acids Res, 38(Databaseissue),D161-166.
Staden,R.(1990).Searchingforpatternsinproteinandnucleicacidsequences.Methods in enzymology, 183,
193-211.
Sutcliffe,I.C.,&Harrington,D.J.(2002).Patternsearchesfortheidentificationofputativelipoproteingenes
inGram-positivebacterialgenomes.Microbiology, 148(Pt7),2065-2077.

189

Thompson,W.,Rouchka,E.C.,&Lawrence,C.E.(2003).GibbsRecursiveSampler:findingtranscription
factorbindingsites.Nucleic Acids Research, 31(13),3580-3585.
Zhang,Z.,Miller,W.,Schffer,A.A.,Madden,T.L.,Lipman,D.J.,Koonin,E.V.,&Altschul,S.F.(1998).
Proteinsequencesimilaritysearchesusingpatternsasseeds.Nucleic Acids Research, 26(17),39863990.

190

6:

, ,
,
.
. ,
, . ,
,
, .

. .
3 4.

6.


,,
( , ),
,.
(, )
Darwin .
()
(),ThedosiusDobzhansky:Nothing
in Biology makes sense except in the light of evolution. ,


,

.
,
.Darwin,

, . ,
(, ),
(
). , ,
,
, ,
.

6.1.

.
, ,
. ( ),
.
(),
.
(orthologues),
. , , , ,

191

(paralogues),
.-(..
, , ..),
(..),,,,,,,
., (xenologues),

().

6.1: (), 8 (). .

,
.
.,
,
,,
. ,
,(taxa).,
, . ,
,.,
,(
,
).6.128
.
,
(Brinkman&Leipe,2001):
( )
. ( ) ,
.
.
taxon ,
.

.

192

6.2:

http://commons.wikimedia.org/wiki/File:Phylogenetic_Tree_of_Life.png)

(:

6.3: ,
. 6.2
. , ,
, (:
http://creationwiki.org/Macroevolution)

193

(rooted).
, (
, ).
(unrooted),
. (
).(6.4)
(),(),5.

6.4: 5 . () , ()

.
(,),
,
(-outgroup).
L ,
2L-1(1,2,LLL+1,L+2,,2L-1
),2L-3.N
L,
2 L 3 !
N rooted L 2
2 L 2 !
,:
2 L 5 !

N unrooted L 3
2 L 3 !
,L=10, 35.
2..2L-3
.
,
:
.,.
, . ,
.
,
. (
), ,
.
.,
. , ,
... , ,

194


.
. ,
. ,
,
.
(,
), . ,
.

6.2.

,
,,
.

Markov.
8,
,
.,(Durbin,Eddy,Krogh,&Mitchison,1998):

(6.1)
pabt P ( xi b | xi a, t )
bix
t.,Markov.
DNA x x1 , x2 ,..., xn y y1 , y 2 ,..., y n , x
yt:
n

P x | y, t P xi | yi , t

(6.2)

i 1

, 4x4
t,

P ( A | A, t )P (T | A, t )P (G | A, t )P (C | A, t )
P ( A | T , t )P (T | T , t )P (G | T , t )P (C | T , t )

S (t )
P ( A | G , t )P (T | G, t )P (G | G, t )P (C | G, t )

P ( A | C , t )P (T | C , t )P (G | C , t )P (C | C , t )

(6.3)

pi , j 0i , j =1,2,3,4

i, j

1i

(6.4)

j 1

.
Markov,,
(Lio & Goldman, 1998). , Markov
(homogeneity),.
(equilibrium).,Markov
(stationary),
. , (reversibility)
, .
3 ,
.
,Chapman-Kolmogorov:

195

(6.5)
S(t)S(s) S(t s)
,
(SubstitutionRateMatrix)R:

(6.6)

3,:
=-(a++)
0.:

Rt

Rt

...
2!
3!
(spectraldecomposition),:
S (t ) Udiag e t , ..., e t U 1
S (t ) exp( Rt ) I Rt

i(eigenvalues)R,U.O
:
S ( ) I R S (t ) S (t ) S ( ) S (t )( I R )

S (t ) S (t )

S (t )( I R )

0 :

(6.7)
S '(t ) S (t ) R
.
(6.6)a==:


-3
rt st st st

s r s s
- 3

t
t
t
t

R
S (t )
st st rt st

-3

- 3

st st st rt

rt 1

1 3e 4 at
4

st 1 1 e 4 at
4

(6.8)

JukesCantor(Jukes&
Cantor,1969)Poisson(,JC69).
,
.
()(
).,Kimura(Kimura,
1980)K2P:


-2 -
rt st ut st

s r s u
- 2 -

t
t
t
t

R
S (t )
ut st rt st

- 2 -

- 2 -

st ut st rt

st 1 1 e 4 t
4

ut 1

1 e 4 t 2 e 2 ( ) t

196

rt 1 2 st ut

(6.9)

6.5: y,
(, , ), t,
x.

-
(..AG,TC)(..AT,GC).
(JC69, K2P) t
qA=qT=qG=qC=1/4,
DNA.
(A+T)/(G+C)
(Felsenstein,1981; Lio&Goldman,1998;
Penny&Hendy,2001). ,(6.6)
(, G, C, T),

197

,
.,F81Felsenstein(Felsenstein,
1981), JC69, ()
,
(,G,C,T):
C
G
T
- C G T

- A G T

- C A T

(6.10)

A
C
G
- C G A
,
,( ):
C
G
T
- C G T

- A G T

- C A T

- C G A

(6.11)

Hasegawa-Kishino-Yano (Hasegawa, Kishino, & Yano, 1985) (HKY85),


Felsenstein, F84,
.
, ,
(general time reversible model), GTR (Tavare, 1986)
:
a C
b G
c T
- a C b G c T

a A

- a A d G e T

d G

e T

b A

d C

- d C b A f T

c A

e C

f G

- e C f G c A

(6.12)

,
. , HKY85 F81, K2P JC69,
F81K2P()JC69.,
,,
,
.,,
,

,GTR.
,,,

(Yang, 1994).

(randomeffectsmodel)(Yang,1993). ,
,
,JC69+,GTR+,...
(molecularclock).
,
.,
PAM (Dayhoff, Schwartz, & Orcutt, 1978) ( )
, (6.3).
()
.

3 ,

198

(Lio&Goldman,1998)
,t(
,).
Markov ,
,

.(Lio&
Goldman,1998).

6.3.

, (distancebased methods).,,

(Durbin,etal.,1998).(dij)i,j,
:
d ii 0

d ij d ji 0, i j

(6.13)

d ij d ik d kj

,
(clustering),,,
.
,4.
, dij, i j,
.fu,
xuixuj,.,
,.,

.,JC69:

d ij

3
4
log 1 f
4
3

(6.14)

K2P,:
1
1
(6.15)
d ij log 1 2 f g log 1 2 g
2
4
fg
.,,
.
, ,
. , Socal
Michener,(Sokal&Michener,1958)UPGMA(Unweighted Pair Group using Arithmetic
Mean),(Clusters)
:
1

(6.16)
d ij
d pq
C i C j pCi , qC j

,|Ci||Cj|i j.
ki jl :

dkl

dil Ci d jl C j
Ci C j

(6.17)

,
: ,

199

,.
dij/2(6.6).O
O(n2). , ,
.,
,
,,.,
,averagelinkage.
linkage (complete linkage, simple linkage ), ,
UPGMA.

6.6: UPGMA. ,
( ).
(A,B,C,D) ,

UPGMA, , ,
.,
,.
,,
.,
. , .
,i,jk,

200

,k
,m,:

d km

1
d im d jm d ij
2

(6.18)

,,.,
- ,
. ,
,.,
6.7.

6.7: ,
. (A-C -D) 4, (A-B
C-D) ( 3). ,
Neighbour-Joining.

(neighbour joining,NJ)(Saitou&Nei,1987),
,
:
1
Dij dij
dik d jk

(6.19)
L 2 k
L,
.,
6.7 (, Dij ).
,,..(6.8).,
L,L-3,Dij
LxL. , (L3),
.
NJ, ,
. ,
, .
,,,
bootsrap.UPGMA,,
. ,

201


()
. ,,
, ,
. , ,
CLUSTAL,-.

6.8: Neighbour-Joining. 1 2
() , (Y). ,

.

, . ,
Fitch-Margoliash (Fitch & Margoliash, 1967),
,,.,

(dij), ( dij ). , ,
:
L

Q wij dij d ij

(6.20)

i 1 j 1


(y=a+bx).wij .
,(Cavalli-Sforza&Edwards,1967)
wij=1 ( ), (Fitch & Margoliash, 1967),
( wi j 1 dij2 ,
).,,
,
.,,
.,
(Q) . ,

. , ,
.

202

6.4.

(character-based methods),
, ,
, , :
(Durbin, et al., 1998). :
,
,,(,)
..

6.4.1

(maximum parsimony)
, ,
( ) .
(
)
. , ,
(Okham Razor),
.Pluralitas
non est ponenda sine necessitate,
, .
,
,
.
Fitch(Fitch,1971),,
+1 ,
(weightedparsimony),
. 6.9, 4
,.

(3)(15).

6.9: .
(3 )
( 15 ).

, ,
,.
,branch and bound,
.
. ,

203

(),
.,
,

.
,
,,
.,
,

(Yang, 1996).
(Edwards & Cavalli-Sforza, 1963)
.

6.4.2
,
(Maximum Likelihood).
L n.:
x1 x11 x12 ... x1n
x 2 x21 x22 ...x2 n
........

x L xL1 xL 2 ...xLn
,
:
X 1 x11 , x21 ,..., xL1 , X 2 x12 , x22 ,..., xL 2 ,..., X L x1n , x2 n ,..., xLn

(Maximum Likelihood) ..

.
,
(
).
.

(.JC69
).

.

,
.:
x 1 x11 , x12 ,..., x1n

x 2 x 21 , x 22 ,..., x 2 n

i:

(6.21)
P x1i , x 2 i , a | T , t1 , t 2 q P xi1 | a , t1 P xi 2 | a , t 2
, , t1,t2
( ).
,:

204

P x1i , x2i | T , t1 , t2 q P xi1 | a, t1 P xi 2 | a, t2

(6.22)

n
:
n

P x1 , x2 | T , t1 , t2 P x1i , x2i | T , t1 , t2

(6.23)

i 1

(likelihood).
(log-likelihood):
n

log P x1 , x 2 | T , t1 , t2 log P x1i , x2i | T , t1 , t2

(6.24)

i 1

(6.24),
,JC69GTR,
.
L,:

P x1i , x2 i ,..., xLi | T , t

2 L2

a L 1 ,..., a 2 L 1

qa2 L1

P a

k L 1

| a a ( k ) , tk P xki | a a ( k ) , tk

(6.25)

k 1

(k)
. L 2L-1
2L-2 , . L
(),L+12L2, .
,
L+1 2L-1 (
). r
n:
n

P x1 ,x 2 ,...,x L | T , t P x1i , x2i ,..., xLi | T , t

(6.26)

i 1

(log-likelihood),:
n

log P x1 ,x2 ,...,x L | T , t log P x1i , x2i ,..., xLi | T , t

(6.27)

i 1

,
.,
(Yang, 1993).
(molecular clock)
UPGMA,(
,).
,,
,,
(time reversibility). ,
,,
, ,
.
, ,
.
(),(6.27),
,,
, Gradient Descent, Newton-Raphson
(Ypma,1995),(Dempster,Laird,&Rubin,1977).,
:
,,.

205

,
, Felsenstein (Felsenstein, 1981).
,
:
.
,,.

(Yang&Rannala,2012).,
.
, , .
,
,(likelihoodratio
test),.
, ,
,
,
.,,

.,,
,
,
,GPUFPGA.
,
(Bayesian methods) (Huelsenbeck, Ronquist, Nielsen, & Bollback, 2001).
, ,
.
, ,
,P(x|T,),(6.26).(Bland&Altman,1998),,
Bayes,
(posteriordistribution):
P T , P x | T ,

(6.28)
P T , | x
P x
(6.28), P(T, |x) , P(x|T,) ,
P(T,).,P(x),

(,).,
, ,
. , ,


MCMC (Markov Chain Monte Carlo) (Gilks, Richardson, & Spiegelhalter,
1996).,.
,
,(,
;). ,
,
(Bland&Altman,1998).
. ,
-
. ,
( ),
. ,

206

,
.

6.5.

, ,
.
, ,
,
.

6.10: : (permutation). ,
( 1 2 5 6),
.
, ,
( ). :
(randomised) . ,
. ,
.

, ,
. ,
()
2.

,
, bootsrap (Goldman,
1993;Posada&Crandall,1998).,,
(
).
,
,
,().

207

permutationtests,,
, ,
,
.

.

6.11: bootstrap. ,
.
. ,
.

, bootstrap.
,(Efron&Tibshirani,
1993).,:1)
()
,2).
,
p-values.
(Hillis&Bull,1993;Solitis&Solitis,2003),,
,
.
( ),

(Durbin,etal.,1998).,
, bootstrap
(,
).,
(consensustree)
.

208

, - bootstrap
.
, bootstrap,

(Goldman,1993;Wollenberg&Atchley,2000).
,
-bootstrap.,


,
(model misspecification). , ,
,
.
,
, . ,
, bootstrap ,
. ,
bootstrap
, .
, Kishino-Hasegawa (KH), ShimodairaHasegawa (SH), (weighted) Shimodaira-Hasegawa (WSH),
ApproximatelyUnbiased(AU)Shimodaira,.
CONSEL,,
bootsrap.,

(Shimodaira&Hasegawa,2001).

6.6.

,()
, ,
, .
,
,
().
,
,,

.Felsenstein(Felsenstein,1973,1996)
,
.
.
(),
,
(Yang,1996).
, 100%
,
. , ,
, ,
,
.,
. , NJ
.
3(Penny&Hendy,2001;Steel&

209

Penny,2000),:1)
, 2) , 3)
,
,
. , bootstrap
, ( JC69, K2P
),J(UPGMA),
(,!).Felsenstein,
,
(maximumlikelihooddistance),(Felsenstein,1996).
,(NJ,UPGMA)
.NJML,
Neighbour Joining Maximum Likelihood.
NJ
bootstrap.NJML
NJ
(Ota&Li,2000).
,
(Brinkman&Leipe,2001):
.
, ,
,
(:garbage in, garbage out)

,
( ). ,
,
,.
. ,
( )
.,
, ,
.

.
(
),

.
,GC%,
(
).

6.7.

,
, .
,PAUPPHYLIP,

(MEGA,RAxML).
, ,
.
PAUP(Phylogeneticanalysisusingparsimony*andothermethods),
(Wilgenbusch & Swofford, 2003).

210

,
.
(http://www.sinauer.com/detail.php?id=8060).
PHYLIP(PHYLogenyInferencePackage)
Joe Felsenstein (Retief, 2000).
,
(http://evolution.gs.washington.edu/phylip.html). MEGA
(Molecular evolutionary genetic analysis) , ,
(Kumar, Nei, Dudley, & Tamura, 2008). ,
,
Windows
(http://www.megasoftware.net).
, ,
,-
-,
/.,,
HYPHY,PAML,PhyMLRAxML.HYPHY(Hypothesistestingusingphylogenies),
.


(http://www.hyphy.org). PAML (Phylogenetic analysis by maximum
likelihood),.
ZihengYang(Yang,2007),
,
,
(http://abacus.gene.ucl.ac.uk/software/paml.html). To PhyML
,
DNA http://www.atgc-montpellier.fr/phyml/binaries.php, (Bazinet,
Zwickl, & Cummings, 2014). , RAxML,
(Stamatakis,2014),
(GTR),
.,
http://scoh-its.org/exelixis/software.html.
,,
. ,
. ,


MCMC (Markov Chain Monte Carlo). ,
. MrBayes ,
MCMC (Huelsenbeck & Ronquist, 2001).

(http://mrbayes.net). BEAST (Bayesian evolutionary analysis sampling tree),
MCMC(Drummond,Suchard,Xie,&Rambaut,2012).
,
.
, ( ). ,
TracerFigTree,
(http://beast.bio.ed.ac.uk).
To GARLI (Genetic Algorithm for Rapid Likelihood Inference),
(Bazinet, et al., 2014).
( )
. GTR ,
,
.

211

http://code.google.com/p/garli. TNT (Tree analysis using new technology) (Goloboff, Farris, & Nixon,
2008)
http://www.lillo.org.ar/phylogeny/tnt/.,
TreeView (http://taxonomy.zoology.gla.ac.uk/rod/treeview.html)
.
(NEXUS,PHYLIP,Hennig86,
NONA,MEGA,ClustalW/X)TrueTypeandPostscript
PICT(Macintosh)Windowsmetafile(Windows)
.(Windows,Unix/Linux,Macintosh),
editor.

212

Bazinet,A.L.,Zwickl,D.J.,&Cummings,M.P.(2014).Agatewayforphylogeneticanalysispoweredby
gridcomputingfeaturingGARLI2.0.Syst Biol, 63(5),812-818.
Bland,J.M.,&Altman,D.G.(1998).Bayesiansandfrequentists.BMJ, 317(7166),1151-1160.
Brinkman,F.S.,&Leipe,D.D.(2001).PhylogeneticAnalysis.InA.D.Baxevanis&B.F.Ouellette(Eds.),
Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins(pp.323-358):JohnWiley&
Sons,Inc.
Cavalli-Sforza,L.L.,&Edwards,A.W.(1967).Phylogeneticanalysis.Modelsandestimationprocedures.
Am J Hum Genet, 19(3Pt1),233-257.
Dayhoff,M.O.,Schwartz,R.M.,&Orcutt,B.C.(1978).AmodelofevolutionarychangeinProteins.InM.
Dayhoff(Ed.),In Atlas of protein sequence and structure(Vol.5,Suppl.3,pp.345-352):National
biomedicalresearchfoundation,SilverSpring,MD.
Dempster,A.P.,Laird,N.M.,&Rubin,D.B.(1977).MaximumlikelihoodfromincompletedataviatheEM
algorithm.J Royal Stat Soc B, 39,1-38.
Drummond,A.J.,Suchard,M.A.,Xie,D.,&Rambaut,A.(2012).BayesianphylogeneticswithBEAUtiand
theBEAST1.7.Molecular biology and evolution, 29(8),1969-1973.
Durbin,R.,Eddy,S.R.,Krogh,A.,&Mitchison,G.J.(1998).Biological Sequence Analysis: Probabilistic
Models of Proteins and Nucleic Acids.
Edwards,A.W.,&Cavalli-Sforza,L.L.(1963).Thereconstructionofevolution..Annals of Human
Genetics, 27,105.
Efron,B.,&Tibshirani,R.(1993).An Introduction to the Bootstrap.BocaRaton,FL:Chapman&Hall/CRC.
Felsenstein,J.(1973).Maximum-likelihoodestimationofevolutionarytreesfromcontinuouscharacters.
American journal of human genetics, 25(5),471.
Felsenstein,J.(1981).EvolutionarytreesfromDNAsequences:amaximumlikelihoodapproach.Journal of
molecular evolution, 17(6),368-376.
Felsenstein,J.(1996).Inferringphylogeniesfromproteinsequencesbyparsimony,distance,andlikelihood
methods.Methods Enzymol, 266,418-427.
Fitch,W.M.(1971).Towarddefiningthecourseofevolution:minimumchangeforaspecifictreetopology.
Systematic Biology, 20(4),406-416.
Fitch,W.M.,&Margoliash,E.(1967).Constructionofphylogenetictrees.science, 155(3760),279-284.
Gilks,W.R.,Richardson,S.,&Spiegelhalter,D.(Eds.).(1996).Markov Chain Monte Carlo in Practice
Chapman&Hall/CRC.
Goldman,N.(1993).StatisticaltestsofmodelsofDNAsubstitution.Journal of Molecular Evolution, 36(2),
182-198.
Goloboff,P.,A.,,Farris,J.S.,&Nixon,K.C.(2008).TNT,afreeprogramforphylogeneticanalysis.
Cladistics, 24(5),774786.
Hasegawa,M.,Kishino,H.,&Yano,T.(1985).Datingofthehuman-apesplittingbyamolecularclockof
mitochondrialDNA.Journal of molecular evolution, 22(2),160-174.
Hillis,D.M.,&Bull,J.J.(1993).Anempiricaltestofbootstrappingasamethodforassessingconfidencein
phylogeneticanalysis.Systematic biology, 42(2),182-192.
Huelsenbeck,J.P.,&Ronquist,F.(2001).MRBAYES:Bayesianinferenceofphylogenetictrees.
Bioinformatics, 17(8),754-755.

213

Huelsenbeck,J.P.,Ronquist,F.,Nielsen,R.,&Bollback,J.P.(2001).Bayesianinferenceofphylogenyand
itsimpactonevolutionarybiology.Science, 294(5550),2310-2314.
Jukes,T.,&Cantor,C.(1969).EvolutionofproteinmoleculesPp.21132inHNMunro,ed.Mammalian
proteinmetabolism:AcademicPress,NewYork.
Kimura,M.(1980).Asimplemethodforestimatingevolutionaryratesofbasesubstitutionsthrough
comparativestudiesofnucleotidesequences.Journal of molecular evolution, 16(2),111-120.
Kumar,S.,Nei,M.,Dudley,J.,&Tamura,K.(2008).MEGA:abiologist-centricsoftwareforevolutionary
analysisofDNAandproteinsequences.Brief Bioinform, 9(4),299-306.
Lio,P.,&Goldman,N.(1998).Modelsofmolecularevolutionandphylogeny.Genome research, 8(12),
1233-1244.
Ota,S.,&Li,W.-H.(2000).NJML:ahybridalgorithmfortheneighbor-joiningandmaximum-likelihood
methods.Molecular Biology and Evolution, 17(9),1401-1409.
Penny,D.,&Hendy,M.(2001).Phylogenetics:parsimonyanddistancemethods.InD.J.Balding,M.Bishop
&C.Cannings(Eds.),Handbook of Statistical Genetics(pp.445-484):JohnWileyandSons,Ltd.
Posada,D.,&Crandall,K.A.(1998).Modeltest:testingthemodelofDNAsubstitution.Bioinformatics,
14(9),817-818.
Retief,J.D.(2000).PhylogeneticanalysisusingPHYLIP.Methods Mol Biol, 132,243-258.
Saitou,N.,&Nei,M.(1987).Theneighbor-joiningmethod:anewmethodforreconstructingphylogenetic
trees.Molecular biology and evolution, 4(4),406-425.
Shimodaira,H.,&Hasegawa,M.(2001).CONSEL:forassessingtheconfidenceofphylogenetictree
selection.Bioinformatics, 17(12),1246-1247.
Sokal,R.R.,&Michener,C.D.(1958).AStatisticalMethodforEvaluatingSystematicRelationships.
University of Kansas Science Bulletin, 38,1409-1438.
Solitis,P.S.,&Solitis,D.E.(2003).ApplyingtheBootstrapinPhylogenyReconstruction.Stat Sci, 18(2),
256-267.
Stamatakis,A.(2014).RAxMLversion8:atoolforphylogeneticanalysisandpost-analysisoflarge
phylogenies.Bioinformatics, 30(9),1312-1313.
Steel,M.,&Penny,D.(2000).Parsimony,likelihood,andtheroleofmodelsinmolecularphylogenetics.
Molecular Biology and evolution, 17(6),839-850.
Tavare,S.(1986).SomeProbabilisticandStatisticalProblemsintheAnalysisofDNASequences.Lectures
on Mathematics in the Life Sciences (American Mathematical Society) 17,5786.
Wilgenbusch,J.C.,&Swofford,D.(2003).InferringevolutionarytreeswithPAUP*.Curr Protoc
Bioinformatics, Chapter 6,Unit64.
Wollenberg,K.R.,&Atchley,W.R.(2000).Separationofphylogeneticandfunctionalassociationsin
biologicalsequencesbyusingtheparametricbootstrap.Proceedings of the National Academy of
Sciences, 97(7),3288-3291.
Yang,Z.(1993).Maximum-likelihoodestimationofphylogenyfromDNAsequenceswhensubstitutionrates
differoversites.Molecular Biology and Evolution, 10(6),1396-1401.
Yang,Z.(1994).Estimatingthepatternofnucleotidesubstitution.Journal of Molecular Evolution, 39(1),
105-111.
Yang,Z.(1996).Phylogeneticanalysisusingparsimonyandlikelihoodmethods.Journal of Molecular
Evolution, 42(2),294-307.
Yang,Z.(2007).PAML4:phylogeneticanalysisbymaximumlikelihood.Molecular biology and evolution,
24(8),1586-1591.

214

Yang,Z.,&Rannala,B.(2012).Molecularphylogenetics:principlesandpractice.Nat Rev Genet, 13(5),303314.


Ypma,T.J.(1995).HistoricaldevelopmentoftheNewton-Raphsonmethod.SIAM Review, 37(4),531-551.

215

216

7:

,
DNA RNA.

.
,
. ,
, ,
- .
DNA , (
/, ...), RNA
micro RNA .

2, 3, 4 5.

7.

DNA/RNA.

(
...),
.
.
,
,
( / )
.
(
),,
:
()
.,,
.

,
,.,
,20-30%
.
, ,
,
.,
(remote homology)
(threading),.
,
DNA,.

.
(, , ...) ,

217

.,
,
(
). ,
.
.
,
. ,
, ,

, .
,
,,
- ,
... ,
,
.

. , ,
.
,,
.
,
( 1970)
().,
,
(- , , )
. , ,
.. ,
DNA , ... DNA,
(gene finding),
,
. , ,
, -,
RNA ... ,
microRNA.
,
.
,
.

7.1.

.
,-
,,
, , ,
--,
GolgiN-X-[ST].DNA
A-U-G,-,
A-G G-T , ... ,

. , ,
PROSITE.,

218

,,
, , (
),.,
( ) ,
,.
, .. ,
,
,
.
,
.

7.1: . ,
. ,
.

,(7.1).
.
.
,
(,,/,
...). , (
), (
). ,
. ,
, GPCR
,(..,,/...),
.
,
(.. ),
(..-),

.,

219

, . ,
.
,
(),
(
).
,
( 7.2). ,
( ),
. ,

( ). (smoothing),
.
,
( 15 ,
).
( ).
,..,
(,
).

7.2: , (
). ,
.

,
.,
. ( , ,
) 10-20 ,
.

220

, .. ,
. ,
,
.,

30 . .
(..
95,...).
,..
( ),
.
,

.,,
3. ,

( , ).

, .
http://web.expasy.org/protscale/
,,,,
.
(,)
. k,
p,pk,L
, (L-k+1) pk(L-k+1) .
, , ..
weight
matrices ( ).
,3(),5(,...)
8(),.

,
.
,
,
. ,
,
,sparseencoding()
2041
0(7.3).,dummyvariables,

(
).,
( ) 20 .
k20k,L,
(L-k+1)20k(L-k+1)().
,
,
.,
(,,,
...)79,
.,
BLOSUM62 ( ),

221


. , , PSSM,

.,
(,...).

7.3: sparse encoding ( )


20 1 19 0.

,
, .
,
.,
20
.(Reinhardt&Hubbard,1998)
, ,
() , .
,,,

( ) ,
(..
).,
(
,
,).
,
, . ,
- -,
(4008000).,
, ,

222

,,,
... (
!).
(.. Fourier),
(pseudoaminoacidcomposition)
Chou,()
. , (
), ( i i+1),
( i i+2), (i+3). ,
.
, http://www.csbio.sjtu.edu.cn/bioinf/PseAAC/.
,
,
,.

7.2.

, ,
(Baldi & Brunak, 2001).

. ( , )

(Bishop,1998).
().,
, ()
.,
,

.,
.,
,
(
),
,.
,
, (feed forward)
,
( 7.4). ,

( ).
, . ,
.

(

, , ,
...). ( )
-

-. XOR,
,,
.
. ,

223

(,
). ,
(interaction),
.
,
.

7.4: 3 , 4 ( ) 2
. .

,(7.5).
( )
. , (weights)
.,
.

(activationfunction).
().
,
0.
(bias)
+1.

(,
). , (
/),i:

g h i

h i
1 e

,0
1,
. (logistic
regression). , ,
.

224

, , .
c(..-,
-,),
softmax:

g h i

h i

e
j 1

h j

7.5: m .
( bias)
.
.

, c ,
01,.

.GPCRG-,
.
,.
( , ...)
.
.
,
(),
.

,-(tanh):

g h i

1 e h i

1 e h i

225


-1+1.
, ,
.,
.(),
,.,
back-propagation(Rumelhart,Hinton,&Williams,1988).
gradientdescent
. ,
. ,
,,
2 (
).,
:
(
)
,
.
,

,,
.gradientdescent

, ,
.,
,

, ,

.(
...)

, ,
gradientdescent,.,,
, .
,
(.. , ...). ,
.
( )

. ,
cross-validation(.),,
,
.
,
.

MATLAB (http://www.mathworks.com/products/neural-network/) R
(https://cran.r-project.org/web/packages/neuralnet/index.html).,
,
, FANN C
(http://leenissen.dk/fann/wp/)

JOONE,

JAVA
(http://sourceforge.net/projects/joone/). ,
(simulators),

226

. () BILLNET
(http://www.nongnu.org/billnet/) NevProp (http://www.cse.unr.edu/brain/nevprop), SNNS
(http://www.ra.cs.uni-tuebingen.de/SNNS/) . ,

Weka(http://www.cs.waikato.ac.nz/ml/weka/).

7.3.

.
,
(
).
.
.
,

. , ,
,
. ,
.

.

7.6: .

,,
.,
( non-redundantset).
(..
30%, ,
). ,
,
,
(.).

227

, .
(
),
.
( , ,
)(
,
). , ,
( )
.,
, ,
, , (
),...

7.7: crossvalidation.

,
. , (.
) .
, .
- (self-consistency)
-(over-fitting).
,
.
,
( .. ) . ,
(independent
test) ( 7.6).
,
,
, .
, ,
(..

228

, 30% ). ,
.
,
, ,
cross-validation(7.7).,
k (k-fold cross-validation). ,
,
. k
(unbiased)
,
.,

( )
k.,
,Jackknifek
,
. , , Jackknife
,.

, (
).

7.4.


,.,
(per-residue prediction) (perprotein classification). ,
2x2 (). , TP (True Positives)
, (True Negatives)
, FN (False Negatives)
FP(FalsePositives)
. , ,
.
,
(Q),:
Q

TP TN
100%
TP TN FP FN

,(sensitivity)
, (
), (specificity)
(
).
,,
.,
,
, . ,
(5%),
95%, .
, , ,
.

229


7.8: .
( ), (Vihinen, 2012)

,Matthews
(C)(Baldi,Brunak,Chauvin,Andersen,&Nielsen,2000):
TP TN FP FN

C
TP FN TP FP TN FP TN FN
, Pearson
-1(),+1(
),0.
.
, , .
(k),
Q kxkC,
.(H,E
C),Q(Qa, Qb)
C(Ca,Cb).
,
, .
,,
(TP,TN,Q,C),
.,
..
. , ,
(measure of the segments overlap-SOV),
,
0-1(Zemla,Venclovas,Fidelis,&Rost,1999).

230


7.9: SOV. , ,
, ,
. , ,
, SOV.

7.5.

,
,
. ,
,
. ,
,

(),
.
,
,.
,
.

7.5.1.

/ , ,
.
, (majority vote, consensus)
(ensemble learning, meta-algorithms ...). (
),
(weak classifiers),
(.. >0, Q>0.5),
.,
(.. >0.95 Q>0.99),
.

231

,
, ( )
.,
(),
.,7.10.
, ,
.
.,
,
.,,
.,
.
,01(
). , ,
c
(0<c<1). , ,
,
.

7.10: 3 .

, ,
.,

. ,
(.. ,
).
ensemble learning, .
,
,

232

.,,
,c,
(0.8).
,
(refinement). ,
,
(..
). ,
(, ),
.
,.
ad-hoc ( )
.
,
.

7.5.2.

,
. ,
,.

.,
. ,


(,
).,7.11.
,
,,
.
, ,
,
.,,
,
.
,
. ,
BLAST
CLUSTAL,
KALIGN.,HMMER3.0,
HMM.
,,

6-8%.
,
. , (
).
,
.,.
,
,
.,
.

233

7.11:
.

,
PSI-BLAST. ,

,(PSSM).
,
55000(7.12).PSI-BLAST
(),

.
, .

,,
,
. ,
,
.
, .
PSI-BLAST
.,
(
), , -,

234

,
(Przybylski&Rost,2007).

7.12: PSSM. 1, 7 8,
.

, (
), (7.13).,
,,
. ,
,
.

,(
).

235

7.13:
.

7.6.

7.6.1

,
,1970,
.
, ,
,,
. , ,
, 3
- (), - () (C),
.

236

7.14: .
.


.
,-
,,.,Chou
Fasman (Chou & Fasman, 1978) ( ) 29
,
3(H,E,C),(P).
: fj(i) =
ij(helix,sheet,turn).<fj>
f j. ,
Pj(i)ijPj(i)=fj(i)/<fj>.
Chou Fasman 7.1. ,
228 (119 -, 38 - 71 ).
,f()=0.522,f()=0.167fC()fc =0.311.-
<f>=890/2473=0.359,-<f>=424/2473=0.171,
<fC>=1159/2473=0.469.,P()=
0.522/0.359=1.45,P()=0.167/0.171=0.97PC()=0.311/0.469=0.63.
Pj(i)>1.0
.,
. , , 4
PH(i)>1 3 5 P(i)>1.
,4
Pj(i)<1.,-
- ,
(, , , ). , ( ),
5 P(i)>1.05, P(i)> P(i) .
,
(, turn), C (coil,
).
,log-oddsscore
.

237

log-odds score,
,,Chou-Fasman
. , log-odds score
, Chou-Fasman
. ,
.
(~60%)55%.
(,
). ,
.,
http://cho-fas.sourceforge.net/
aminoacid
A(Ala)
R(Arg)
N(Asn)
D(Asp)
C(Cys)
Q(Gln)
E(Glu)
G(Gly)
H(His)
I(Ile)
L(Leu)
K(Lys)
M(Met)
F(Phe)
P(Pro)
S(Ser)
T(Thr)
W(Trp)
Y(Tyr)
V(Val)

P(helix)
1.420
0.980
0.670
1.010
0.700
1.110
1.510
0.570
1.000
1.080
1.210
1.160
1.450
1.130
0.570
0.770
0.830
1.080
0.690
1.060

P(sheet)
0.830
0.930
0.890
0.540
1.190
1.100
0.370
0.750
0.870
1.600
1.300
0.740
1.050
1.380
0.550
0.750
1.190
1.370
1.470
1.700

P(coil)
0.660
0.950
1.560
1.460
1.190
0.980
0.740
1.560
0.950
0.470
0.590
1.010
0.600
0.600
1.520
1.430
0.960
0.960
1.140
0.500

7.1: () Chou Fasman

,
.,
(
). GOR (Garnier-Osguthorpe-Robson).
log-oddsscore
17 (Garnier, Osguthorpe, & Robson, 1978).

,
(
). , , ,
.,
, , GOR IV (https://npsa-prabi.ibcp.fr/cgibin/npsa_automat.pl?page=npsa_gor4.html), ,
64%,GOR V(http://gor.bb.iastate.edu/),
PSI-BLAST,
74%.
1987
QianSejnowski

238

,68%(Qian&Sejnowski,1988).,
70%1992RostSanderPHD(Rost&Sander,
1993).
(juryofnetworks),(130),
(structure-to-structurenetwork)
.
6-8%
70% . PSI-PRED
(http://bioinf.cs.ucl.ac.uk/psipred/)profilesPSIBLAST (Jones, 1999) ( 76%).
PSI-PRED
(513 187
).,PHD,PROFphd
PSI-BLAST (~75-76%).
, JNET (Cuff & Barton, 2000).
,,PSIBLAST,Support
VectorMachines(SVM)RecurrentNeuralNetworks(RNN).
,(
),
(
) (
).
(cross-validation)
,
,,(Bagos,Tsaousis,&
Hamodrakas,2009).

7.15:
.
(Bagos, Tsaousis, et al., 2009).

239

,
70% ( )
( 500 ). ,

80%(
2000).,80%
,,
.
/
. 1988
Hamodrakas (Hamodrakas, 1988)
(Chou-Fasman,GOR,Lim,Dufton-Hider,Burgess,Nagano).
2-3%
SecStr (http://athina.biol.uoa.gr/SecStr/). ,
SecStr,
70%. JPRED (http://www.compbio.dundee.ac.uk/jpred/)

1998(Cuff,Clamp,Siddiqui,Finlay,&Barton,1998).JNET
(NNSSP,DSC,PREDATOR,MULPRED,PHD,ZPRED)
. , 4 (JPRED4)
,
,
.
NPS@ (https://npsa-prabi.ibcp.fr/cgi-bin/npsa_automat.pl?page=/NPSA/npsa_seccons.html)
SOPM,SOPMA,HNN,MLRC,DPM,DSC,
GORI,GORIII,GORIV,PHD,PREDATOR,SIMPA96
. CONCORD
(http://helios.princeton.edu/CONCORD/)PSIPRED,DSC,GORIV,Predator,Prof,
PROFphd, SSpro, SYMPRED (http://www.ibi.vu.nl/programs/sympredwww/)
PHDpsi,PROFsec,SSPro,Predator,YASPIN,JNetPSIPRED.
,
( ,
,...).,B.Rost
PREDICTPROTEIN (www.predictprotein.org/), PSIPRED (http://bioinf.cs.ucl.ac.uk/psipred/)
, SCRATCH
(http://scratch.proteomics.ics.uci.edu/index.html).
, .

. ,

. , 1990 CASP (Critical


AssessmentofStructurePredictionshttp://predictioncenter.org/).,

.
, .

.
,Rost.
EVA(Kohetal.,2003)PDB
,.
,,

240

7.6.2.

,
, ,
. ,
(Singer
&Nicolson,1972),
(7.16).

7.16: ,
(https://en.wikipedia.org/wiki/Cell_membrane)

(,,,)

( ),
,
. ,
...
(, , , )

.
,
. ,
, ( )
, , (Alberts et al., 1994).

. ,
-,
. ,
,
,
(
) ( ).
, ,
,
. ,
,
,
,

241


(Marsh, Horvath, Swamy, Mantripragada, & Kleinschmidt, 2002).
,,,
- (.. ),
.
- ,
, , ,
(, , , Golgi
). , ( ),
, .
,.
, ,
,.
, ,
.

7.17: . ,
Natronomonas pharaonis. , NspA, Neisseria meningitidis. ,
.

,
.
-()
-
( 7.17). ,
,
.

,

.
-
(von Heijne, 1999),
Gram (Schulz,
2003). ,

242

,
( 7.18). Gram
,,(7.19),
. , (.. Mycobacterium),
,
().

7.18: Gram .
( ), ,
.
Gram , , , .
, .

243

7.19: Gram .
, , ,
, .

-,
. ,
-,
,
(Alberts et al.,
1994). ,
multi-spanning,
.,
, G (G-Protein Coupled Receptors-GPCRs),
(Kristiansen, 2004) , (
) (
G ). -, ,
, ,
-
,.
-
20 (Eisenberg, Weiss, &
Terwilliger, 1984; Kyte & Doolittle, 1982), -
, . , 15-25 ,

. positive inside rule,

(von Heijne, 1992),
. (, positive inside rule),
- ,

244

(Rojo, Guiard, Neupert, & Stuart, 1999)


(Houben, de Gier, & van Wijk, 1999). ,
,
(Claros&vonHeijne,1994),
(Pasquier, Promponas, Palaios, Hamodrakas, &
Hamodrakas, 1999), Hidden Markov Models
(Krogh, Larsson, von Heijne, & Sonnhammer, 2001; Tusnady & Simon, 1998)
(Pasquier&Hamodrakas,1999;Rost,Casadio,Fariselli,&Sander,1995).
,
.
,
.

, ,

(Kyogoku et al., 2003;
Loll,2003;Walian,Cross,&Jap,2004).,
25-30%(Chen&
Rost,2002;Pasquier,Promponas,&Hamodrakas,2001),
500, (<1%)
(Tusnady,Dosztanyi,&Simon,2004).
,
, .
,
,.,
,
40
(White,2004).,,
,
20
,
. ,
,
.
-
.
positive-inside.
, TopPred (Claros & von Heijne,
1994)( http://mobyle.pasteur.fr/cgi-bin/portal.py#forms::toppred). To TMpred
(http://www.ch.embnet.org/software/TMPRED_form.html)
. ,
MEMSAT ( ), log-odds score

(http://bioinf.cs.ucl.ac.uk/?id=756). PRED-TMR
,(Pasquieretal.,1999)(
http://athina.biol.uoa.gr/PRED-TMR/) ,
,,rienTM(http://athina.biol.uoa.gr/orienTM/),

(Liakopoulos, Pasquier, & Hamodrakas, 2001).


,,
1996 PHDtm (www.predictprotein.org),
.HMM1998(Sonnhammer,vonHeijne,&
Krogh, 1998), TMHMM (http://www.cbs.dtu.dk/services/TMHMM/)
,(

245

).,
HMMTOP (http://www.enzim.hu/hmmtop/) (Tusnady & Simon, 2001).
, , CoPreThi
(http://athina.biol.uoa.gr/CoPreTHi/)
SOSUI, Tmpred, ISREC, DAS, TopPred, PHDtm PRED-TMR (Promponas,
Palaios,Pasquier,Hamodrakas,&Hamodrakas,1999).
,
,
.
(.)
-,
.Phobius (Kall,Krogh,&
Sonnhammer,2004)(http://phobius.sbc.su.se/),
SPOCTOPUS (http://octopus.cbr.su.se/index.php?about=SPOCTOPUS).
,
- . ,
, . ,

.,
.

HMMpTM
(http://bioinformatics.biol.uoa.gr/HMMpTM),
,
(Tsaousis,Bagos,&Hamodrakas,2014).
-,
.
,,
,
.
(gene fusions),
,()

(Drew et al., 2002; Melen, Krogh, & von Heijne, 2003).

(, ),
,
E. coli(Rappetal.,2004)S. cerevisiae(Kim,Melen,&vonHeijne,2003).
,HMMTOP(Tusnady&Simon,
2001), ,
.,
-,Phobius(Kalletal.,2004).
HMM-TM(http://bioinformatics.biol.uoa.gr/HMM-TM/),

,.

, , -
,.,
,
,.,
, TMHMM,
(PRO-TMHMM, PRODIV-TMHMM, S-TMHMM).
, Phobius PolyPhobius
(http://phobius.sbc.su.se/poly.html). ,

246

. ,
TOPCONS (http://topcons.net), PolyPhobius, OCTOPUS,
SPOCTOPUSSCAMPI(,),Philius
,

.,TOPCONS,.
- , - ( )
,
.-
, , .
,-
,-
(-hairpin).,-
,n8n26S, 8S24
(Schulz,2003).,
, 30-60. ,
,,-
--
,19-.

7.20: - .
- -.

,
- ( 7.20).
(),
-
.
. ()
, OmpA (Morona, Kramer, & Henning, 1985)
-,OmpX (Vogt&Schulz,

247

1999),.
(OmpA,OmpX),8,
.
OmpX,OmpA,

(Ringler&Schulz,2002),
(Sugawara&Nikaido,1992,1994).
NspA (Vandeputte-Rutten, Bos, Tommassen, & Gros, 2003) 8 , OpcA
(Prince, Achtman, & Derrick, 2002) 10 ,
.

7.21: - -,
. -, TolC, MspA.

-,
- .
-
, -,
, 6 22
.,79
, - (

)-.
,
,
,(..).

- , (Zhai&Saier,2002),
-.,
, (
), , ,
. , -,

248

,
.
, ,
. ,
-,-,
( ) .
,
,
(Schulz,2002,2003).
:
(1) - ,
- .
,
.
(2)
,
.
(3),
( ). ,
,100,
.
(4) , ,
()
().,
, 12
30.
-.
(5)-
6 22 . ,
, ,
,
.
(6)-
, - .
,.
,
,
.
(7)-,
.
-, ,
, .
,-
. ,
,
.
-,Diederichs(Diederichs,
Freigang,Umhau,Zeth,&Breed,1998).B2TMPRED


(Jacoboni, Martelli, Fariselli, De Pinto, & Casadio, 2001)
http://gpcr.biocomp.unibo.it/cgi/predictors/outer/pred_outercgi.cgi.
TBBpred (http://www.imtech.res.in/raghava/tbbpred/) TMBETA-NET
(http://psfs.cbrc.jp/tmbeta-net/) ,

249

TMBETAPRED-RBF (http://rbf.bioinfo.tw/~sachen/BARRELpredict/TMBETAPRED-RBF.php)
TMBpro (http://tmbpro.ics.uci.edu/)
.
HiddenMarkovModel(HMM)
2000, (Bagos, Liakopoulos,
Spyropoulos, & Hamodrakas, 2004a, 2004b; Bigelow, Petrey, Liu, Przybylski, & Rost, 2004; Hayat &
Elofsson,2012;Liu,Zhu,Wang,&Li,2003;Martelli,Fariselli,Krogh,&Casadio,2002;Savojardo,Fariselli,
& Casadio, 2013; Singh, Goodman, Walter, Helms, & Hayat, 2011). HMMB2TMR,
(http://gpcr.biocomp.unibo.it/predictors/),
, BetAware (http://www.biocomp.unibo.it/~savojard/betawarecl). PRED-TMBB
(http://bioinformatics.biol.uoa.gr/PRED-TMBB/),
, ,
,
.
PROFtmb (https://www.predictprotein.org/)
, TMBHMM TMBhunt.
,BOCTOPUS(http://boctopus.cbr.su.se/),
SupportVectorMachinesHMMs.

7.22: PRED-TMBB OMPG_ECOLI.


,
- .

-,
,
- ,
.,

. , 2005,

-,

ConBBPRED
(http://bioinformatics.biol.uoa.gr/ConBBPRED/). To ConBBPRED

. ,
.

(Bagos, Liakopoulos, &


Hamodrakas,2005).
-
, ,
10-20 . ,
BOCTOPUS,

250

, . ,
PRED-TMBB
.PRED-TMBBBOCTOPUS,
PROFtmb, BetAware HMM-B2TMR.
PRED-TMBB,
.,PRED-TMBB2
(www.compgen.org/tools/PRED-TMBB2), ,
.

7.23: Omp85 (Neisseria meningitidis),


ConBBPRED (http://bioinformatics.biol.uoa.gr/ConBBPRED). :
. :
- (0-1). : ( ),
.

,PRED-TMBB,BetAware
TMBETA-NET,
-.,
,-
. , -
(~2% )
>90% . ,
(),
.
Michael Gromiha TMBETADISCRBF(http://rbf.bioinfo.tw/~sachen/OMPpredict/TMBETADISC-RBF.php),
(94%),(85%).BOMP(http://services.cbu.uib.no/tools/bomp)

(99%), (~68%). PSORTb


(http://www.psort.org/psortb/)
,.
,
. ,
(~99.5%) (~50%). -barrel analyzer (http://betabarrel.tulane.edu/FW_analysis.php) Freeman-Wimley,
(86%) (95%). , HHomp
(http://toolkit.tuebingen.mpg.de/hhomp),
-(
).
,,
.

251

7.6.3.

,
( ). ,
,
.
(, ),
,
( ,
).,
(n-region),(h-region)(c-region)
,
( --),
, (von Heijne, 1990).
,,(Driessen
& Nouwen, 2007), (Rapoport, Matlack, Plath, Misselwitz, &
Staeck, 1999), (Pohlschroder, Gimenez, & Jarrell, 2005).
, ,
(Tuteja, 2005; van Roosmalen et al., 2004).
,
( ), ,
,,
(Habib, Neupert, &
Rapaport, 2007; G. von Heijne, Steppuhn, & Herrmann, 1989). ,
, .

,,
(PTS1), ,
(PTS2).
,,
(Sec),
(Twin-Argininetranslocase-Tat).Tat
, (RR)
n-region (Berks, Palmer, & Sargent, 2005; Lee, Tullman-Ercek, & Georgiou, 2006; Teter & Klionsky,
1999). Sec Tat,
,
,
(Teter & Klionsky, 1999). (
Sec,Tat),(Spase
I), .
,,,
. ,
(Spase II or Lsp), .
,
, c-region (lipobox),
C,
.
[LVI]-[AST]-[GA]-C,
.,
Tat. , ,
.
,
, , 1980.
weight matrices Gunnar von Heijne (von Heijne, 1986),

252

, SigCleave,
(http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/sigcleave.html).
weightmatrices,
, PrediSi (http://www.predisi.de/). ,
,(,
Gram , Gram ). ,
.
,SignalP(http://www.cbs.dtu.dk/services/SignalP),
4.1,
(),
,,
(Bendtsen, Nielsen, von Heijne, & Brunak, 2004).
,
. Phobius,
http://phobius.sbc.su.se/ (Kall et al., 2004; Kall, Krogh, & Sonnhammer, 2007) Philius (Reynolds,
Kall, Riffle, Bilmes, & Noble, 2008),
http://noble.gs.washington.edu/proj/philius/,(Bayesian
network,
),

SPOCTOPUS
(http://octopus.cbr.su.se/index.php?about=SPOCTOPUS).
,,
,
. ,
(
). ,
, ,
,Uniprot(Bagos,Tsirigos,Plessas,
Liakopoulos, & Hamodrakas, 2009). ,
Gram ,
, PREDSIGNAL(http://www.compgen.org/tools/PRED-SIGNAL).

PROSITE,5(..PS00013).,
. LipoP
(http://www.cbs.dtu.dk/services/LipoP), HMM
Gram(Junckeretal.,2003).LipoP
,
97%
Gram , (,
), 0.3%. ,
Gram , 90-92%. ,
,
Gram , PRED-LIPO
(http://www.compgen.org/tools/PRED-LIPO),
,
,(Bagos,Tsirigos,Liakopoulos,&Hamodrakas,2008).
,,
(GPI-anchor).
, ,
.
, PredGPI (http://gpcr.biocomp.unibo.it/predgpi/pred.htm), big-PI
(http://mendel.imp.ac.at/gpi/gpi_server.html),

FragAnchor
(http://navet.ics.hawaii.edu/~fraganchor/NNHMM/NNHMM.html) GPI-SOM (http://gpi.unibe.ch/).
,
.

253

LPXTG(
,),Gram(
Gram
). , , CW-PRED
(http://bioinformatics.biol.uoa.gr/CW-PRED/),,
().
, (
) Tat (
),
. TATFIND
(http://signalfind.org/tatfind.html),
(Rose, Bruser, Kissinger, & Pohlschroder, 2002). TatP
(http://www.cbs.dtu.dk/services/TatP/),
RR(Bendtsen,Nielsen,Widdick,Palmer,&Brunak,2005).TatP
,SignalP,TATFIND
RR, .
, PRED-TAT (http://www.compgen.org/tools/PRED-TAT/),
HMMs,(SecTat),
, . ,
,Tat,(Sec)
,SignalP(Bagosetal.,
2010).

,.,

ChloroP
(http://www.cbs.dtu.dk/services/ChloroP),

TargetP
(http://www.cbs.dtu.dk/services/TargetP),
, .
iPSORT (http://ipsort.hgc.jp/how.html).

MitoProt
(https://ihg.gsf.de/ihg/mitoprot.html),Predotar(http://urgi.versailles.inra.fr/predotar/predotar.html),
Tppred2 (http://tppred2.biocomp.unibo.it). , PTS1
predictor (http://mendel.imp.ac.at/mendeljsp/sat/pts1/PTS1predictor.jsp),
cNLS Mapper (http://nls-mapper.iab.keio.ac.jp/cgibin/NLS_Mapper_form.cgi), NLStradamus (http://www.moseslab.csb.utoronto.ca/NLStradamus/),
NucPred
(http://www.sbc.su.se/~maccallr/nucpred/)

PredictNLS
(https://rostlab.org/owiki/index.php/PredictNLS).
,,
. ,
, (..
), (.. ,
Golgi ...). ,
.,,
,
(,,...).
,WoLF PSORT(http://wolfpsort.org/),

( PSORT PSORT II).
, PSORTb (http://www.psort.org/psortb/index.html).

LOCtree
(http://cubic.bioc.columbia.edu/cgi-bin/var/nair/loctree/query),

ESLPred2
(http://www.imtech.res.in/raghava/eslpred2/),

LOCSVMPSI
(http://bioinformatics.ustc.edu.cn/locsvmpsi/locsvmpsi.php), CELLO (http://cello.life.nctu.edu.tw/),
BaCELLO (http://gpcr.biocomp.unibo.it/bacello/), Protein Prowler (http://pprowler.imb.uq.edu.au/),
Hum-Ploc2 (http://www.csbio.sjtu.edu.cn/bioinf/hum-multi-2/), AAIndexLoc (http://aaindexloc.bii.a
254

star.edu.sg/) SecretP (http://cic.scu.edu.cn/bioinformatics/secretp/index.htm).


, ( PSORTb), iLoc-Gneg
(http://icpr.jci.edu.cn/bioinfo/iLoc-Gneg), Gpos-mPloc (http://www.csbio.sjtu.edu.cn/bioinf/Gpos-multi/)
Gneg-mPloc (http://www.csbio.sjtu.edu.cn/bioinf/Gneg-multi/) Gram

SOSUI-GramN
(http://bp.nuap.nagoyau.ac.jp/sosui/sosuigramn/sosuigramn_submit.html) Gram , PSLPred
(http://www.imtech.res.in/raghava/pslpred/), Augur (http://bioinfo.mikrobio.med.uni-giessen.de/augur),
SubLoc(http://www.bioinfo.tsinghua.edu.cn/SubLoc/).
,
-,
,.,
SecretomeP (http://www.cbs.dtu.dk/services/SecretomeP),
NclassG+(http://www.biolisi.unal.edu.co/web-servers/nclassgpositive/).

7.6.4.

,
,
. ,
,
, (coiled coil).
, COILS (http://www.ch.embnet.org/software/COILS_form.html),
PAIRCOIL (http://paircoil2.csail.mit.edu/),
,MULTICOIL(http://multicoil2.csail.mit.edu/cgi-bin/multicoil2.cgi),
, CCHMM (http://gpcr.biocomp.unibo.it/cgi/predictors/cc/pred_cchmm.cgi)
MARCOIL (http://bcf.isb-sib.ch/webmarcoil/webmarcoilINFOC1.html),
.
,.
,
,
.
DisEMBL (http://dis.embl.de/), PrDOS (http://prdos.hgc.jp/cgi-bin/top.cgi), DISpro
(http://www.ics.uci.edu/~baldig/dispro.html),DISOPRED(http://bioinf.cs.ucl.ac.uk/psipred/?disopred=1),
MeDor (http://www.vazymolo.org/MeDor/index.html),
MetaDisorder(http://genesilico.pl/metadisorder/)DisProt(http://www.disprot.org/pondr-fit.php).
,
,
. ,
, . DIpro
(http://download.igb.uci.edu/bridge.html),EDBCP(http://biomedical.ctust.edu.tw/edbcp/),CYSPRED
(http://gpcr.biocomp.unibo.it/cgi/predictors/cyspred/pred_cyspredcgi.cgi),

DiANNA
(http://clavius.bc.edu/~clotelab/DiANNA/),Dinosolve(http://hpcr.cs.odu.edu/dinosolve/),DISULFIND
(http://disulfind.dsi.unifi.it/)CysCON(http://www.csbio.sjtu.edu.cn/bioinf/Cyscon/).
, . -
,
. , -
(
).,
,-,
. , ,
(, )
. , , , , ,
...

255


Golgi. - ( ), ( ) C- (
). - NetNGlyc
(http://www.cbs.dtu.dk/services/NetNGlyc/), -
NetOGlyc (http://www.cbs.dtu.dk/services/NetOGlyc/) C- NetCGlyc
(http://www.cbs.dtu.dk/services/NetCGlyc/),YinOYang(http://www.cbs.dtu.dk/services/YinOYang/)
. GlycoEP
(http://www.imtech.res.in/raghava/glycoep/submit.html),
, GPP (http://comp.chem.nottingham.ac.uk/glyco/).
,Oglyc(http://www.biosino.org/Oglyc/),ISOGlyP(http://isoglyp.utep.edu/)
CKSSAP_OGlySite(http://bioinformatics.cau.edu.cn/zzd_lab/CKSAAP_OGlySite/).

7.24: .
.

,
,,.

.
NetPhos(http://www.cbs.dtu.dk/services/NetPhos/)
,NetPhosK(http://www.cbs.dtu.dk/services/NetPhosK/)
.GPS (http://gps.biocuckoo.org/)
( ). KinasePhos2 (http://kinasephos2.mbc.nctu.edu.tw/)

HMM. PhosphoSVM (http://sysbio.unl.edu/PhosphoSVM/),
DISPHOS (http://www.dabi.temple.edu/disphos/), pkaPS (http://mendel.imp.ac.at/sat/pkaPS/)

256

Predikin (http://predikin.biosci.uq.edu.au/).
(
, ), ,
.

MetaPredPS
(http://c1.accurascience.com/MetaPred/MetaPredPS_091201/). ,
, ,
NetPhosYeast (http://www.cbs.dtu.dk/services/NetPhosYeast/) NetPhosBac
(http://www.cbs.dtu.dk/services/NetPhosBac-1.0/).
-

.

Myristoylator
(http://web.expasy.org/myristoylator/)

(http://mendel.imp.ac.at/myristate/SUPLpredictor.htm) ,
, NetAcet (http://www.cbs.dtu.dk/services/NetAcet/)
TermiNator (http://www.isv.cnrsgif.fr/terminator3/index.html),.
, ,
(),.,CSSPalm(http://csspalm.biocuckoo.org/),GPSTSP (http://tsp.biocuckoo.org/) Sulfinator (http://web.expasy.org/sulfinator/)
, PAIL (http://bdmpail.biocuckoo.org/) KAT
(http://bioinfo.bjmu.edu.cn/huac/) . ,
.
,
.(Ubiquitin)
,
, SUMO (Small
Ubiquitin-like Modifie). UbPred
(http://www.ubpred.org/), BDM-PUB (http://bdmpub.biocuckoo.org/), CKSAAP_UbSite
(http://protein.cau.edu.cn/cksaap_ubsite/), iUbiq-Lys (http://www.jci-bioinfo.cn/iUbiq-Lys)
UbiProber (http://bioinfo.ncu.edu.cn/UbiProber.aspx). , SUMO
SUMOplot(http://www.abgent.com/sumoplot)GPS-SUMP(http://sumosp.biocuckoo.org/).

7.25: PRED-CLASS

,
.
.,PRED-CLASS
(http://athina.biol.uoa.gr/PRED-CLASS/) , ,

257

( ,
..)(Pasquieretal.,2001).3,
. ,
()30.
(
),,.
,
.,,
. ,
,
. ,
(20) (10 ,
,).,
( 30 ),
Fourier(FFT),
.
, , ( ).
(>95% )
30%
. ,
, ..
,
.

7.26: PRED-COUPLE.
,
pHMM. ,
.

,
GPCR G-.
G- (G protein-coupled receptors-GPCRs),
.
-,

258


.,GPCRs,,

G-, . PRED-COUPLE
(http://athina.biol.uoa.gr/bioinformatics/PRED-COUPLE/),
profileHiddenMarkovModels,
GPCR G- (Sgourakis, Bagos, Papasaikas, & Hamodrakas,
2005).,
,G.,GPCRs(Gi/o,GsGq/11)
cross-validation(
5),89.7%.30
,
25(83.3%).

PRED-COUPLE2
(http://athina.biol.uoa.gr/bioinformatics/PRED-COUPLE2/),
, , ,
G-(Sgourakis,Bagos,&Hamodrakas,2005).
, pHMM
. ,
(~95%) --,
.
.,
.

7.27: PRED-COUPLE2.
pHMM PRED-COUPLE.

7.7.

DNA/RNA

,,.

DNA RNA. DNA
(genefinding),

259

- (Math, Sagot, Schiex, & Rouze, 2002).


, ,
,,
.
, (
),
(
3)., ()
, . ,
,

. ,
,.

7.28:


:
(exon/intronsplicesite),(promoterrecognition),
(translationinitiationsiteprediction)(Saeys,Abeel,Degroeve,&Van
de Peer, 2007), mRNA (polyadenylation prediction)
(Changetal.,2011),,,.,
.
,DNA,
(Saeysetal.,2007).,
,(
), . ,

. ,
1980, 1990
.
,:,weightmatrices,,
Hidden Markov Models.
ab initio gene finders,

homology-based gene finders.
,:
FrameD(http://tata.toulouse.inra.fr/apps/FrameD/FD)
GeneMark(http://exon.gatech.edu/GeneMark/gmchoice.html)
Glimmer(http://www.ncbi.nlm.nih.gov/genomes/MICROBES/glimmer_3.cgi)
EasyGene(http://www.cbs.dtu.dk/services/EasyGene/)

260

FGENESB
(http://linux1.softberry.com/berry.phtml?topic=fgenesb&group=help&subgroup=gfindb)
Prodigal (http://prodigal.ornl.gov/)
,,:
FGENESH
(http://linux1.softberry.com/berry.phtml?topic=fgenesh&group=programs&subgroup=gfind)
GlimmerHMM(https://ccb.jhu.edu/software/glimmerhmm/)
HMMgene (http://www.cbs.dtu.dk/services/HMMgene/)
GeneMark.hmm(http://exon.gatech.edu/GeneMark/hmmchoice.html)
GeneID(http://genome.crg.es/software/geneid/geneid.html)
GeneScan(http://genes.mit.edu/GENSCAN.html)
mGene(http://raetschlab.org/suppl/mgene)
Grail(http://compbio.ornl.gov/grailexp/)
(translationinitiation):
ATGpr(http://atgpr.dbcls.jp/)
NetStart(http://www.cbs.dtu.dk/services/NetStart/)
TIS Miner(http://dnafsminer.bic.nus.edu.sg/Tis.html)
StartScan(http://bioinformatics.psb.ugent.be/webtools/startscan/)
mRNA:
Poly(A) Signal Miner(http://dnafsminer.bic.nus.edu.sg/)
PolyAPred(http://www.imtech.res.in/raghava/polyapred/help.html)
POLYAH
(http://www.softberry.com/berry.phtml?topic=polyah&group=programs&subgroup=promoter
)
PolyApredict(http://cub.comsats.edu.pk/polyapredict.htm)
,/
,:
Human Splice Finder(http://www.umd.be/HSF3/)
NetGene(http://www.cbs.dtu.dk/services/NetGene2/)
NetPlant(http://www.cbs.dtu.dk/services/NetPGene/)
GeneSplicer(https://ccb.jhu.edu/software/genesplicer/)
SpliceView(http://bioinfo4.itb.cnr.it/~webgene/wwwspliceview_ex.html)
SplicePredictor(http://bioservices.usd.edu/splicepredictor/)
,DNA
. ,
Methylator (http://bio.dfci.harvard.edu/Methylator/)
epigram (http://wanglab.ucsd.edu/star/epigram/),

NuPoP
(http://nucleosome.stats.northwestern.edu/)

Segal
(http://genie.weizmann.ac.il/software/nucleo_prediction.html),
DNA uMELT (https://www.dna.utah.edu/umelt/umelt.html),
DNADNAshape(http://rohslab.cmb.usc.edu/DNAshape/)
DNAtools (http://hydra.icgeb.trieste.it/dna/),

(SNPs),

SNAP
(https://rostlab.org/services/snap/), FuncPred (http://snpinfo.niehs.nih.gov/snpinfo/snpfunc.htm)
PredictSNP(http://loschmidt.chemi.muni.cz/predictsnp/).
RNA,
, .
,
RNA,.,RNA
.
,

261

RNA
10.
RNA,
. microRNA
(miRNA) - RNA ( 21-22 ,
, pri-miRNA)

mRNA-(Cai,Yu,Hu,&
Yu, 2009).
mRNA. , mRNA ,
,
. miRNAs small interfering RNA (siRNA)
miRNAsRNA
siRNAs RNA.
1000 miRNA
60%,.

7.29: miRNA.
.

miRNA,
.
.
miRNA mRNA ,
mRNA.,miRNA
mRNA 68
5' miRNA.
mRNA.,

262

miRNA,,mRNA,mRNA
miRNA.
miRNA:
,.
, , , ,
,
RNA. miRNA
:
CID miRNA(http://melb.agrf.org.au:8888/cidmirna/)
MiRPara(https://code.google.com/p/mirpara/)
HeteroMirPred(http://ncrna-pred.com/premiRNA.html)
HHMMiR(http://biodev.hgen.pitt.edu/kadriAPBC2009.html)
HuntMi(http://adaa.polsl.pl/agudys/huntmi/huntmi.htm)
MaturePred(http://nclab.hit.edu.cn/maturepred/)
microPred(http://www.cs.ox.ac.uk/people/manohara.rukshan.batuwita/microPred.htm)
MiPred(http://www.bioinf.seu.edu.cn/miRNA/)
miRAbela(http://www.mirz.unibas.ch/cgi/pred_miRNA_genes.cgi)
MiRAlign(http://bioinfo.au.tsinghua.edu.cn/miralign/)
miRBoost(http://evryrna.ibisc.univ-evry.fr/miRBoost/index.html)
mirnaDetect(http://datamining.xmu.edu.cn/main/~leyiwei/mirnaDetect.html)
miRNAFold(http://evryrna.ibisc.univ-evry.fr/miRNAFold/)
MiRscan(http://genes.mit.edu/mirscan/)
novoMIR(http://www.biophys.uni-duesseldorf.de/novomir/)
ProMiR(http://bi.snu.ac.kr/Research/ProMiR/ProMiR.html)
RNAmicro(http://www.tbi.univie.ac.at/~jana/software/RNAmicro.html)
tripletSVM(http://bioinfo.au.tsinghua.edu.cn/mirnasvm/)
SplamiR (http://www.uni-jena.de/SplamiR.html)
SSCprofiler (http://mirna.imbb.forth.gr/SSCprofiler.html)
EumiR (http://miracle.igib.res.in/eumir/)
,miRNA,:
Diana Micro-T(http://diana.cslab.ece.ntua.gr/microT/)
PicTar(http://pictar.mdc-berlin.de/)
TargetScan(http://www.targetscan.org/)
miRTar(http://mirtar.mbc.nctu.edu.tw/human/)
miRanda (http://www.microrna.org/microrna/home.do)
MaMi(http://mami.med.harvard.edu/)
ComiR(http://www.benoslab.pitt.edu/comir/)()
PITA(http://genie.weizmann.ac.il/pubs/mir07/mir07_prediction.html)
MirMap(http://mirmap.ezlab.org/)
STarMir (http://sfold.wadsworth.org/starmir.html)

263

Alberts,B.,Bray,D.,Lewis,J.,Raff,M.,Roberts,K.,&Watson,J.(1994).Molecular Biology of the Cell


(3rded.):GarlandPublishing,Inc.
Bagos,P.G.,Liakopoulos,T.D.,&Hamodrakas,S.J.(2005).Evaluationofmethodsforpredictingthe
topologyofbeta-barreloutermembraneproteinsandaconsensuspredictionmethod.BMC
Bioinformatics, 6,7.doi:1471-2105-6-7[pii]10.1186/1471-2105-6-7
Bagos,P.G.,Liakopoulos,T.D.,Spyropoulos,I.C.,&Hamodrakas,S.J.(2004a).AHiddenMarkovModel
method,capableofpredictinganddiscriminatingbeta-barreloutermembraneproteins.BMC
Bioinformatics, 5,29.doi:10.1186/1471-2105-5-291471-2105-5-29[pii]
Bagos,P.G.,Liakopoulos,T.D.,Spyropoulos,I.C.,&Hamodrakas,S.J.(2004b).PRED-TMBB:aweb
serverforpredictingthetopologyofbeta-barreloutermembraneproteins.Nucleic Acids Res, 32(Web
Serverissue),W400-404.doi:10.1093/nar/gkh41732/suppl_2/W400[pii]
Bagos,P.G.,Tsaousis,G.N.,&Hamodrakas,S.J.(2009).Howmany3Dstructuresdoweneedtotraina
predictor?Genomics Proteomics Bioinformatics, 7(3),128-137.doi:10.1016/S1672-0229(08)600418S1672-0229(08)60041-8[pii]
Bagos,P.G.,Tsirigos,K.D.,Liakopoulos,T.D.,&Hamodrakas,S.J.(2008).Predictionoflipoproteinsignal
peptidesinGram-positivebacteriawithaHiddenMarkovModel.J Proteome Res, 7(12),5082-5093.
Bagos,P.G.,Tsirigos,K.D.,Plessas,S.K.,Liakopoulos,T.D.,&Hamodrakas,S.J.(2009).Predictionof
signalpeptidesinarchaea.Protein Eng Des Sel, 22(1),27-35.doi:gzn064[pii]
10.1093/protein/gzn064
Bagos,P.G.,Nikolaou,E.P.,Liakopoulos,T.D.,&Tsirigos,K.D.(2010).CombinedpredictionofTatand
SecsignalpeptideswithhiddenMarkovmodels.Bioinformatics,26(22),2811-2817.doi:
10.1093/bioinformatics/btq530
Baldi,P.,&Brunak,S.(2001).Bioinformatics: the machine learning approach:MITpress.
Baldi,P.,Brunak,S.,Chauvin,Y.,Andersen,C.A.,&Nielsen,H.(2000).Assessingtheaccuracyof
predictionalgorithmsforclassification:anoverview.Bioinformatics, 16(5),412-424.
Bendtsen,J.D.,Nielsen,H.,vonHeijne,G.,&Brunak,S.(2004).Improvedpredictionofsignalpeptides:
SignalP3.0.J Mol Biol, 340(4),783-795.doi:10.1016/j.jmb.2004.05.028S0022283604005972[pii]
Bendtsen,J.D.,Nielsen,H.,Widdick,D.,Palmer,T.,&Brunak,S.(2005).Predictionoftwin-argininesignal
peptides.BMC Bioinformatics, 6,167.doi:1471-2105-6-167[pii]10.1186/1471-2105-6-167
Berks,B.C.,Palmer,T.,&Sargent,F.(2005).Proteintargetingbythebacterialtwin-argininetranslocation
(Tat)pathway.Curr Opin Microbiol, 8(2),174-181.doi:S1369-5274(05)00021-4
[pii]10.1016/j.mib.2005.02.010
Bigelow,H.R.,Petrey,D.S.,Liu,J.,Przybylski,D.,&Rost,B.(2004).Predictingtransmembranebetabarrelsinproteomes.Nucleic Acids Res, 32(8),2566-2577.
Bishop,C.M.(1998).Neural Networks for Pattern Recognition:OxfordUniversityPress.
Cai,Y.,Yu,X.,Hu,S.,&Yu,J.(2009).AbriefreviewonthemechanismsofmiRNAregulation.Genomics,
proteomics & bioinformatics, 7(4),147-154.
Chang,T.-H.,Wu,L.-C.,Chen,Y.-T.,Huang,H.-D.,Liu,B.-J.,Cheng,K.-F.,&Horng,J.-T.(2011).
CharacterizationandpredictionofmRNApolyadenylationsitesinhumangenes.Medical &
biological engineering & computing, 49(4),463-472.
Chen,C.P.,&Rost,B.(2002).State-of-the-artinmembraneproteinprediction.Appl Bioinformatics, 1(1),
21-35.

264

Chou,P.Y.,&Fasman,G.D.(1978).Predictionofthesecondarystructureofproteinsfromtheiraminoacid
sequence.Adv Enzymol Relat Areas Mol Biol, 47,45-148.
Claros,M.G.,&vonHeijne,G.(1994).TopPredII:animprovedsoftwareformembraneproteinstructure
predictions.Comput Appl Biosci, 10(6),685-686.
Cuff,J.A.,&Barton,G.J.(2000).Applicationofmultiplesequencealignmentprofilestoimproveprotein
secondarystructureprediction.Proteins, 40(3),502-511.
Cuff,J.A.,Clamp,M.E.,Siddiqui,A.S.,Finlay,M.,&Barton,G.J.(1998).JPred:aconsensussecondary
structurepredictionserver.Bioinformatics, 14(10),892-893.doi:btb130[pii]
Diederichs,K.,Freigang,J.,Umhau,S.,Zeth,K.,&Breed,J.(1998).Predictionbyaneuralnetworkofouter
membranebeta-strandproteintopology.Protein Sci, 7(11),2413-2420.
Drew,D.,Sjostrand,D.,Nilsson,J.,Urbig,T.,Chin,C.N.,deGier,J.W.,&vonHeijne,G.(2002).Rapid
topologymappingofEscherichiacoliinner-membraneproteinsbypredictionandPhoA/GFPfusion
analysis.Proc Natl Acad Sci U S A, 99(5),2690-2695.
Driessen,A.J.,&Nouwen,N.(2007).ProteinTranslocationAcrosstheBacterialCytoplasmicMembrane.
Annu Rev Biochem.doi:10.1146/annurev.biochem.77.061606.160747
Eisenberg,D.,Weiss,R.M.,&Terwilliger,T.C.(1984).Thehydrophobicmomentdetectsperiodicityin
proteinhydrophobicity.Proc Natl Acad Sci U S A, 81(1),140-144.
Garnier,J.,Osguthorpe,D.J.,&Robson,B.(1978).Analysisoftheaccuracyandimplicationsofsimple
methodsforpredictingthesecondarystructureofglobularproteins.J Mol Biol, 120(1),97-120.
Habib,S.J.,Neupert,W.,&Rapaport,D.(2007).Analysisandpredictionofmitochondrialtargetingsignals.
Methods Cell Biol, 80,761-781.doi:S0091-679X(06)80035-X[pii]10.1016/S0091-679X(06)80035X
Hamodrakas,S.J.(1988).AproteinsecondarystructurepredictionschemefortheIBMPCandcompatibles.
Comput Appl Biosci, 4(4),473-477.
Hayat,S.,&Elofsson,A.(2012).BOCTOPUS:improvedtopologypredictionoftransmembranebetabarrel
proteins.Bioinformatics, 28(4),516-522.doi:10.1093/bioinformatics/btr710
Houben,E.,deGier,J.W.,&vanWijk,K.J.(1999).Insertionofleaderpeptidaseintothethylakoid
membraneduringsynthesisinachloroplasttranslationsystem.Plant Cell, 11(8),1553-1564.
Jacoboni,I.,Martelli,P.L.,Fariselli,P.,DePinto,V.,&Casadio,R.(2001).Predictionofthetransmembrane
regionsofbeta-barrelmembraneproteinswithaneuralnetwork-basedpredictor.Protein Sci, 10(4),
779-787.
Jones,D.T.(1999).Proteinsecondarystructurepredictionbasedonposition-specificscoringmatrices.J Mol
Biol, 292(2),195-202.
Juncker,A.S.,Willenbrock,H.,VonHeijne,G.,Brunak,S.,Nielsen,H.,&Krogh,A.(2003).Predictionof
lipoproteinsignalpeptidesinGram-negativebacteria.Protein Sci, 12(8),1652-1662.
Kall,L.,Krogh,A.,&Sonnhammer,E.L.(2004).Acombinedtransmembranetopologyandsignalpeptide
predictionmethod.J Mol Biol, 338(5),1027-1036.doi:10.1016/j.jmb.2004.03.016
S0022283604002943[pii]
Kall,L.,Krogh,A.,&Sonnhammer,E.L.(2007).Advantagesofcombinedtransmembranetopologyand
signalpeptideprediction--thePhobiuswebserver.Nucleic Acids Res, 35(WebServerissue),W429432.doi:gkm256[pii]10.1093/nar/gkm256
Kim,H.,Melen,K.,&vonHeijne,G.(2003).Topologymodelsfor37Saccharomycescerevisiaemembrane
proteinsbasedonC-terminalreporterfusionsandpredictions.J Biol Chem, 278(12),10208-10213.

265

Koh,I.Y.,Eyrich,V.A.,Marti-Renom,M.A.,Przybylski,D.,Madhusudhan,M.S.,Eswar,N.,...Sali,A.
(2003).EVA:evaluationofproteinstructurepredictionservers.Nucleic Acids Research, 31(13),
3311-3315.
Kristiansen,K.(2004).Molecularmechanismsofligandbinding,signaling,andregulationwithinthe
superfamilyofG-protein-coupledreceptors:molecularmodelingandmutagenesisapproachesto
receptorstructureandfunction.Pharmacol Ther, 103(1),21-80.
Krogh,A.,Larsson,B.,vonHeijne,G.,&Sonnhammer,E.L.(2001).Predictingtransmembraneprotein
topologywithahiddenMarkovmodel:applicationtocompletegenomes.J Mol Biol, 305(3),567580.
Kyogoku,Y.,Fujiyoshi,Y.,Shimada,I.,Nakamura,H.,Tsukihara,T.,Akutsu,H.,...Nomura,N.(2003).
Structuralgenomicsofmembraneproteins.Acc Chem Res, 36(3),199-206.
Kyte,J.,&Doolittle,R.F.(1982).Asimplemethodfordisplayingthehydropathiccharacterofaprotein.J
Mol Biol, 157(1),105-132.
Lee,P.A.,Tullman-Ercek,D.,&Georgiou,G.(2006).Thebacterialtwin-argininetranslocationpathway.
Annu Rev Microbiol, 60,373-395.doi:10.1146/annurev.micro.60.080805.142212
Liakopoulos,T.D.,Pasquier,C.,&Hamodrakas,S.J.(2001).Anoveltoolforthepredictionof
transmembraneproteintopologybasedonastatisticalanalysisoftheSwissProtdatabase:the
OrienTMalgorithm.Protein Eng, 14(6),387-390.
Liu,Q.,Zhu,Y.S.,Wang,B.H.,&Li,Y.X.(2003).AHMM-basedmethodtopredictthetransmembrane
regionsofbeta-barrelmembraneproteins.Comput Biol Chem, 27(1),69-76.
Loll,P.J.(2003).Membraneproteinstructuralbiology:thehighthroughputchallenge.J Struct Biol, 142(1),
144-153.
Marsh,D.,Horvath,L.I.,Swamy,M.J.,Mantripragada,S.,&Kleinschmidt,J.H.(2002).Interactionof
membrane-spanningproteinswithperipheralandlipid-anchoredmembraneproteins:perspectives
fromprotein-lipidinteractions(Review).Mol Membr Biol, 19(4),247-255.
Martelli,P.L.,Fariselli,P.,Krogh,A.,&Casadio,R.(2002).Asequence-profile-basedHMMforpredicting
anddiscriminatingbetabarrelmembraneproteins.Bioinformatics, 18 Suppl 1,S46-53.
Math,C.,Sagot,M.F.,Schiex,T.,&Rouze,P.(2002).Currentmethodsofgeneprediction,theirstrengths
andweaknesses.Nucleic Acids Research, 30(19),4103-4117.
Melen,K.,Krogh,A.,&vonHeijne,G.(2003).Reliabilitymeasuresformembraneproteintopology
predictionalgorithms.J Mol Biol, 327(3),735-744.doi:S0022283603001827[pii]
Morona,R.,Kramer,C.,&Henning,U.(1985).Bacteriophagereceptorareaofoutermembraneprotein
OmpAofEscherichiacoliK-12.J Bacteriol, 164(2),539-543.
Pasquier,C.,&Hamodrakas,S.J.(1999).Anhierarchicalartificialneuralnetworksystemforthe
classificationoftransmembraneproteins.Protein Eng, 12(8),631-634.
Pasquier,C.,Promponas,V.J.,&Hamodrakas,S.J.(2001).PRED-CLASS:cascadingneuralnetworksfor
generalizedproteinclassificationandgenome-wideapplications.Proteins, 44(3),361-369.
Pasquier,C.,Promponas,V.J.,Palaios,G.A.,Hamodrakas,J.S.,&Hamodrakas,S.J.(1999).Anovel
methodforpredictingtransmembranesegmentsinproteinsbasedonastatisticalanalysisofthe
SwissProtdatabase:thePRED-TMRalgorithm.Protein Eng, 12(5),381-385.
Pohlschroder,M.,Gimenez,M.I.,&Jarrell,K.F.(2005).ProteintransportinArchaea:Secandtwinarginine
translocationpathways.Curr Opin Microbiol, 8(6),713-719.doi:S1369-5274(05)00162-1[pii]
10.1016/j.mib.2005.10.006
Prince,S.M.,Achtman,M.,&Derrick,J.P.(2002).CrystalstructureoftheOpcAintegralmembraneadhesin
fromNeisseriameningitidis.Proc Natl Acad Sci U S A, 99(6),3417-3421.

266

Promponas,V.J.,Palaios,G.A.,Pasquier,C.M.,Hamodrakas,J.S.,&Hamodrakas,S.J.(1999).CoPreTHi:
aWebtoolwhichcombinestransmembraneproteinsegmentpredictionmethods.In Silico Biol, 1(3),
159-162.doi:1998010014[pii]
Przybylski,D.,&Rost,B.(2007).ConsensussequencesimprovePSI-BLASTthroughmimickingprofile
profilealignments.Nucleic Acids Research, 35(7),2238-2246.
Qian,N.,&Sejnowski,T.J.(1988).Predictingthesecondarystructureofglobularproteinsusingneural
networkmodels.J Mol Biol, 202(4),865-884.
Rapoport,T.A.,Matlack,K.E.,Plath,K.,Misselwitz,B.,&Staeck,O.(1999).Posttranslationalprotein
translocationacrossthemembraneoftheendoplasmicreticulum.Biol Chem, 380(10),1143-1150.
Rapp,M.,Drew,D.,Daley,D.O.,Nilsson,J.,Carvalho,T.,Melen,K.,VonHeijne,G.(2004).
ExperimentallybasedtopologymodelsforE.coliinnermembraneproteins.Protein Sci, 13(4),937945.
Reinhardt,A.,&Hubbard,T.(1998).Usingneuralnetworksforpredictionofthesubcellularlocationof
proteins.Nucleic Acids Res, 26(9),2230-2236.
Reynolds,S.M.,Kall,L.,Riffle,M.E.,Bilmes,J.A.,&Noble,W.S.(2008).Transmembranetopologyand
signalpeptidepredictionusingdynamicbayesiannetworks.PLoS Comput Biol, 4(11),e1000213.doi:
10.1371/journal.pcbi.1000213
Ringler,P.,&Schulz,G.E.(2002).OmpAmembranedomainasatight-bindinganchorforlipidbilayers.
Chembiochem, 3(5),463-466.
Rojo,E.E.,Guiard,B.,Neupert,W.,&Stuart,R.A.(1999).N-terminaltailexportfromthemitochondrial
matrix.Adherencetotheprokaryotic"positive-inside"ruleofmembraneproteintopology.J Biol
Chem, 274(28),19617-19622.
Rose,R.W.,Bruser,T.,Kissinger,J.C.,&Pohlschroder,M.(2002).Adaptationofproteinsecretionto
extremelyhigh-saltconditionsbyextensiveuseofthetwin-argininetranslocationpathway.Mol
Microbiol, 45(4),943-950.doi:3090[pii]
Rost,B.,Casadio,R.,Fariselli,P.,&Sander,C.(1995).Transmembranehelicespredictedat95%accuracy.
Protein Sci, 4(3),521-533.
Rost,B.,&Sander,C.(1993).Predictionofproteinsecondarystructureatbetterthan70%accuracy.J Mol
Biol, 232(2),584-599.
Rumelhart,D.E.,Hinton,G.E.,&Williams,R.J.(1988).Learningrepresentationsbyback-propagating
errors.Cognitive modeling, 5,3.
Saeys,Y.,Abeel,T.,Degroeve,S.,&VandePeer,Y.(2007).Translationinitiationsitepredictionona
genomicscale:beautyinsimplicity.Bioinformatics, 23(13),i418-i423.
Savojardo,C.,Fariselli,P.,&Casadio,R.(2013).BETAWARE:amachine-learningtooltodetectandpredict
transmembranebeta-barrelproteinsinprokaryotes.Bioinformatics, 29(4),504-505.doi:
10.1093/bioinformatics/bts728
Schulz,G.E.(2002).Thestructureofbacterialoutermembraneproteins.Biochim Biophys Acta, 1565(2),
308-317.
Schulz,G.E.(2003).Transmembranebeta-barrelproteins.Adv Protein Chem, 63,47-70.
Sgourakis,N.G.,Bagos,P.G.,&Hamodrakas,S.J.(2005).PredictionofthecouplingspecificityofGPCRs
tofourfamiliesofG-proteinsusinghiddenMarkovmodelsandartificialneuralnetworks.
Bioinformatics, 21(22),4101-4106.doi:bti679[pii]10.1093/bioinformatics/bti679
Sgourakis,N.G.,Bagos,P.G.,Papasaikas,P.K.,&Hamodrakas,S.J.(2005).Amethodforthepredictionof
GPCRscouplingspecificitytoG-proteinsusingrefinedprofileHiddenMarkovModels.BMC
Bioinformatics, 6,104.doi:1471-2105-6-104[pii]10.1186/1471-2105-6-104

267

Singer,S.J.,&Nicolson,G.L.(1972).Thefluidmosaicmodelofthestructureofcellmembranes.Science,
175(23),720-731.
Singh,N.K.,Goodman,A.,Walter,P.,Helms,V.,&Hayat,S.(2011).TMBHMM:afrequencyprofilebased
HMMforpredictingthetopologyoftransmembranebetabarrelproteinsandtheexposurestatusof
transmembraneresidues.Biochim Biophys Acta, 1814(5),664-670.doi:10.1016/j.bbapap.2011.03.004
Sonnhammer,E.L.,vonHeijne,G.,&Krogh,A.(1998).AhiddenMarkovmodelforpredicting
transmembranehelicesinproteinsequences.Proc Int Conf Intell Syst Mol Biol, 6,175-182.
Sugawara,E.,&Nikaido,H.(1992).Pore-formingactivityofOmpAproteinofEscherichiacoli.J Biol Chem,
267(4),2507-2511.
Sugawara,E.,&Nikaido,H.(1994).OmpAproteinofEscherichiacolioutermembraneoccursinopenand
closedchannelforms.J Biol Chem, 269(27),17981-17987.
Teter,S.A.,&Klionsky,D.J.(1999).Howtogetafoldedproteinacrossamembrane.Trends Cell Biol,
9(11),428-431.doi:S0962-8924(99)01652-9[pii]
Tsaousis,G.N.,Bagos,P.G.,&Hamodrakas,S.J.(2014).HMMpTM:improvingtransmembraneprotein
topologypredictionusingphosphorylationandglycosylationsiteprediction.Biochim Biophys Acta,
1844(2),316-322.doi:10.1016/j.bbapap.2013.11.001S1570-9639(13)00376-2[pii]
Tusnady,G.E.,Dosztanyi,Z.,&Simon,I.(2004).Transmembraneproteinsinproteindatabank:
identificationandclassification.Bioinformatics.
Tusnady,G.E.,&Simon,I.(1998).Principlesgoverningaminoacidcompositionofintegralmembrane
proteins:applicationtotopologyprediction.J Mol Biol, 283(2),489-506.
Tusnady,G.E.,&Simon,I.(2001).TheHMMTOPtransmembranetopologypredictionserver.
Bioinformatics, 17(9),849-850.
Tuteja,R.(2005).TypeIsignalpeptidase:anoverview.Arch Biochem Biophys, 441(2),107-111.doi:S00039861(05)00305-X[pii]10.1016/j.abb.2005.07.013
vanRoosmalen,M.L.,Geukens,N.,Jongbloed,J.D.,Tjalsma,H.,Dubois,J.Y.,Bron,S.,...Anne,J.
(2004).TypeIsignalpeptidasesofGram-positivebacteria.Biochim Biophys Acta, 1694(1-3),279297.doi:S0167488904001235[pii]10.1016/j.bbamcr.2004.05.006
Vandeputte-Rutten,L.,Bos,M.P.,Tommassen,J.,&Gros,P.(2003).CrystalstructureofNeisserialsurface
proteinA(NspA),aconservedoutermembraneproteinwithvaccinepotential.J Biol Chem, 278(27),
24825-24830.
Vihinen,M.(2012).Howtoevaluateperformanceofpredictionmethods?Measuresandtheirinterpretationin
variationeffectanalysis.BMC Genomics, 13(Suppl4),S2.
Vogt,J.,&Schulz,G.E.(1999).ThestructureoftheoutermembraneproteinOmpXfromEscherichiacoli
revealspossiblemechanismsofvirulence.Structure Fold Des, 7(10),1301-1309.
vonHeijne,G.(1986).Anewmethodforpredictingsignalsequencecleavagesites.Nucleic Acids Res,
14(11),4683-4690.
vonHeijne,G.(1990).Thesignalpeptide.J Membr Biol, 115(3),195-201.
vonHeijne,G.(1992).Membraneproteinstructureprediction.Hydrophobicityanalysisandthepositiveinsiderule.J Mol Biol, 225(2),487-494.
vonHeijne,G.(1999).Recentadvancesintheunderstandingofmembraneproteinassemblyandfunction.
Quart Rev Biophys, 32(4),285-307.
vonHeijne,G.,Steppuhn,J.,&Herrmann,R.G.(1989).Domainstructureofmitochondrialandchloroplast
targetingpeptides.Eur J Biochem, 180(3),535-545.
Walian,P.,Cross,T.A.,&Jap,B.K.(2004).Structuralgenomicsofmembraneproteins.Genome Biol, 5(4),
215.

268

White,S.H.(2004).Theprogressofmembraneproteinstructuredetermination.Protein Sci, 13(7),19481949.


Zemla,A.,Venclovas,C.,Fidelis,K.,&Rost,B.(1999).AmodifieddefinitionofSov,asegment-based
measureforproteinsecondarystructurepredictionassessment.Proteins, 34(2),220-223.
Zhai,Y.,&Saier,M.H.,Jr.(2002).Thebeta-barrelfinder(BBF)program,allowingidentificationofouter
membranebeta-barrelproteinsencodedwithinprokaryoticgenomes.Protein Sci, 11(9),2196-2207.

269

270

8:

, ,
(Hidden Markov Models)
.
,
,
. , ,
,
( , ...). ,
profile HMM ,
, .

.
3 4.

8.


-,Markov
. Markov (Markov Chain),
DNA
.Markov,,

.
2,3,...,kMarkov2,3, ...,k.
,Markov
DNA ,

. 1970
,..
.
,
o . ,
Q U, U
Q.,
AndreyMarkov(1856-1922),
Pushkin (Markov, 1913).
Markov(MarkovModel-)
Markov(HiddenMarkovModel-).

8.1.

Markov

8.1.1.

Markov 1
.,
,(DNA

271

20 ). L
,x,:

x x1 , x2 ,..., xL 1 , xL
i
,Markov,
.(),
x,
. , xi
i. ( )
xi+1,xi+2, xi+3x1,x2,
...xi-1, xi, xi. ,
.
:

(8.1)
P xi | xi 1 ,..., x1 P xi | xi 1
Markov
(transitionprobabilities),.,
:

(8.2)
ast P xi t | xi 1 s xi1 xi
,ti,
(i-1)s.k
, Markov 1 .
:

P x P x1 , x2 ,..., xL 1 , xL P xL | xL 1 ,..., x1 P xL 1 | xL 2 ,..., x1 ...P x1


(8.1),:
L

P x P xL | xL 1 P xL 1 | xL 2 ...P x2 | x1 P x1 P x1 P xi | xi 1 P x1 x
i2

i 1 xi

(8.3)

i2

P(x1) . ,
,:
p ab ( n 1, n ) P ( x i b | x i 1 a ) pab n=1,2,..L.
, , ,
.(
, ),
,,
. , ,
1 k
:

p a ,b 0 a , b 1, 2,..., k
k

a ,b

1 b=1,2,,k

b 1

, .
, Markov,
(B=Begin).
:
(8.4)
P( x1 a) pBa
()(=End)
:
P ( E | xn b) pbE
(8.5)
Markov 8.1
. ,

272

.
,.
P ( E | x n b ) p bE q
(8.3)L:

(8.6)
p q(1 q) L1
L
.
(Durbin,Eddy,Krogh,&Mithison,1998):
n

p P(x)

... P( x1 ) P ( xi | xi 1 ) 1

{X }

(8.7)

i 2

8.1: Markov, 4 DNA.


. , , .

8.1.2.

(Maximum Likelihood Estimates-MLEs)


,:
nst
(8.8)
xi1 xi
t ' nst '
nst s t
, 20
( DNA).,
( , +
), log-odds score, S(x)
,:

273

S x log

x x
P x | L
log i1 i
x x
P x | i 1
i1 i

L
xi1 xi
i 1

(8.9)

xi1 xi , log-odds xi-1 xi,


.
,2,
,. xi1xi
0, (+),
0(-).
,,L
log-oddsscore.
L

S x
S norm x

i 1

xi 1 xi

(8.10)

, 1 ,
CG(Durbin,etal.,1998).

8.2: CG .
-, (8.8) (8.9)
.
. bits , 2.

8.1.3.

kMarkov,
(8.1).,k
:
(8.11)
P xi | xi 1 ,..., x1 P xi | xi 1 , xi 2 ,..., xi k x ... x x
k

i 1 i

P xi | xi 1 , xi 2 ,..., xi k P xi , xi 1 ,..., xi k 1 | xi 1 , xi 2 ,..., xi k (8.12)

k Markov,1,
20k. , 20kx20k. ,

274

, 1 202=400
, 2 203=8000 ,

.

, (Ellrott, Yang,
Sladek,&Jiang,2002;Phillips,Arnold,&Ivarie,1987).,AudicClaverie
(Audic & Claverie, 1998),

. , , ,
,
,90%(Audic&Claverie,1998).
,
. , k
(Yuan,1999),:
k

P xi | xi 1 , xi 2 ,..., xi k P xi | xi j

(8.13)

j 1

(8.13) (Yuan, 1999)


, . ,
,
(Yuan, 1999). ,
(>6), , - (overfitting),,-
(Ellrott,etal.,2002;Phillips,etal.,1987;Yuan,1999).,
, - (non-homogenous
Markovchains),,

(Borodovsky&Peresetsky,1994).
, , Markov
(VariablelengthMarkovModels-VMM),Bejerano
(Bejerano, 2004)
(ProbabilisticFiniteAutomata-PFA),Ron(Ron,Singer,&Tishby,1996).
, C n,
k , C * C ,

, , . ,
,:

P xi | xi 1 , xi 2 ,..., xi k P xi | max xi k ,..., xi 1 C


ki 0

(8.14)

(),
Bejerano (Bejerano, Seldin, Margalit, & Tishby, 2001; Bejerano & Yona, 2001),

PFAM (Bateman et al., 2004).
, ,
,HiddenMarkovModels
(.).
, Mixture Transition Distribution (MTD) model
(Raftery, 1985a), (8.11)
:
k

a sk ... s1s0 P xi | xi 1 , xi 2 ,..., xi k j s j s0

(8.15)

j 1

, (j=1,2k)

275

. (MTDg),
(Raftery,1985b),j,j:
k

a sk ... s1s0 P x i | x i 1 , x i 2 ,..., xi k

j
s j s0

(8.16)

j 1

,:
k

0 j sjj s0 1

(8.17)

j 1

j
j s j s0 1 ,:
s k ,..., s1 , s0 , s0 Q j 1

1, j 0 j 1, 2, ..., k

(8.18)

j 1

Raftery,
Markov. ,
,
. Raftery, (Raftery, 1985a),
(NAG).Bercthold,
gradient (Berchtold,2001).,
Expectation-Maximization (Lebre & Bourguignon,
2008).
(, ...), ,
.
(gene
finding),(Borodovsky&McIninch,1993;Borodovsky&Peresetsky,1994) (Audic& Claverie,1998). ,
interpolated Markov chains ,
(Salzberg, Delcher,Kasif,&White,1998)(Ohler,
Harbeck, Niemann, Noth, & Reese, 1999; Salzberg, Pertea, Delcher, Gardner, & Tettelin, 1999),
(Barash,Elidan,Friedman,&Kaplan,2003),
(Dalevi,Dubhashi,&Hermansson,2006),
(Yuan, 1999), (Bejerano, et al., 2001),
(Eronen,Geerts,& Toivonen,2004)
(Browning,2006).

8.2.

Hidden Markov Models

8.2.1.

Hidden Markov Model (), ,


,
(emissions). x L
:

(8.19)
x x1 , x2 ,..., xL 1 , xL
xi20(,
), .
i
,i,(path).,k, l
kl,Markov1.
,:

(8.20)
akl P ( i l | i 1 k )
,k()
l.Markov,
,(Begin)

276

aBk P ( 1 k | B)

(8.21)

E(End)

(8.22)
akE P( E | i k )
,
:

(8.23)
ek (b) P( xi b | i k )
,i,
b,k.x
,:
L

P ( x, ) P ( xL , xL 1 ,..., x1 , ) aB

( xi ) a
i

(8.24)

i 1

i 1

,(Durbin,etal.,1998)
(dishonest casino). ,
,(..0.05
) ,
.,
,.
( 8.3)
0.05 ( )
,0.1().(hidden)
.
()(),
Markov.()Markov(MM)

. (. 4),
.

8.3: .
(-), . ,
.

277

8.2.2. 3

3,
Rabiner(Rabiner,1989).
,
x ; ,
P(x|);
x,
, , ;
,,: max arg max P(x, ) ;

,
,;,
. ,
, ML arg max P x | .

8.4: . .
. . 1 . . 3 . . Hidden
Markov Model. -, . (
) 1 ,
.

278

8.2.3.

(8.24),,x,

.
x ,
, ,
.
L

P x | P x, | a B

( xi ) a
i

(8.25)

i 1

i 1

,,
.,..
50 300 , 50300,
. ,
3, (dynamic programming).
, ,
.
,Forward(8.4),(Durbin,etal.,1998;
Rabiner,1989).

Forward

k B, i 0 : f B (0) 1, f k (0) 0 ,

1 i L : f l (i ) el ( xi ) f k (i 1) akl
k


P ( x| ) f k ( L ) akE

(8.26)

, (L+1), N
L,fk(i)i
k.,
i,k.:

(8.27)
f k (i ) P( x1 , x2 ,..., xi , i k )

8.5: Forward, 12 (states)


8 . (.. f1(2) ),
1 ().

279

, fk(i) ,
,
. ,
.L,
(L).
Backward (Durbin, et al., 1998; Rabiner, 1989),
.
,bk(i),i
i+1,i k.:

(8.28)
bk (i ) P ( xi 1 ,..., xL | i k )
,:

Backward
k , i L : bk ( L) akE
1 i L : bk (i ) a kl el ( xi 1 )bl (i 1)

(8.29)

P ( x| ) a Bl el ( x1 )bl (1)
l

,,,
1.,Forward.

8.2.4.

,
. , ,
(decoding).,
Viterbi(Durbin,etal.,1998;Rabiner,1989).

Viterbi
k B, i 0 : uB (0) 1, uk (0) 0
1 i L : ul (i ) el ( xi ) max uk (i 1)akl

(8.30)

max
P ( x, | ) max uk ( L) akE
k

Viterbi, Forward,
.
max,
P(x, max|). , P(x, max|) P(x|).
,,
(pointers),()
.(back-tracking),,,
.


.
,
xi,xi
k, x. ,
P(i=k|x).xi
k.:

280

P ( x, i k ) P ( x1 , x 2 ,..., xi , i k ) P ( xi 1 ,..., x n | x1 ,..., xi , i k )

P ( x1 , x 2 ,..., xi , i k ) P ( xi 1 ,..., x n | i k )
,(8.27),
(8.28).,:
P(x, i k ) f k (i)bk (i)

,Bayes:

P( i k | x)

f k (i )bk (i )

(8.31)
P(x)
,
.,
:

(8.32)
i arg max P ( i k | x)
k

. ,
.
,
.,,
. ,
, , (..
- ), g(k)
,
TM
1, k C

g (k )
0, k C NTM

(8.33)
G ( i | x ) P ( i k | x ) g ( k )
k

i
.
,
(posterior decoding),
. ,
, (
).,,

,.

8.2.4.1
DNA,
( ): 2
,
4(emissionprobabilities).1:
e1 ( A) P( xi A | i 1) 0.7
e1 (T ) P ( xi T | i 1) 0.1
e1 (G ) P ( xi G | i 1) 0.1

e1 (C ) P ( xi C | i 1) 0.1
0:

e0 ( A) P ( xi A | i 0) 0.25

e0 (T ) P( xi T | i 0) 0.25

281

e0 (G ) P ( xi G | i 0) 0.25

e0 (C ) P ( xi C | i 0) 0.25
():
a11 P ( i 1 | i 1 1) 0.9

a10 P ( i 0 | i 1 1) 0.1
a00 P ( i 0 | i 1 0) 0.9
a01 P ( i 1 | i 1 0) 0.1
DNA200.
DNA,
(1/0). 8.7
.

AAACAAGAATGCGCACACTACGCAAAAACAATTAGTCGCACTCACGATGAAACAAATTACCACGGTGAA
111111111100000000000001111111111100000000000000111111110000000000001
AACGAATAAACCTCAGAGGCCCAGCGTATATAAACAAGATAAAAACCTAGTCAGCACTCTGACCAGACG
111111111100000000000000000000011111111111111100000000000000000000000
AGCTCACGACTTGAGGATAAGAAAAAAACAACAGCTCACGACTTGAGGATAAGAAAAAAACA
00000000000000001111111111111100000000000000000011111111111111
8.6: DNA .
(0/1)

( 8.7)
(Viterbi).

0
0

50

100
no

150

200

8.7: Viterbi
8.6. ,
DNA. ,
7 1, Viterbi
.

282

,:
a11 P ( i 1 | i 1 1) 0.98
a10 P ( i 0 | i 1 1) 0.02
a00 P ( i 0 | i 1 0) 0.97

a01 P ( i 1 | i 1 0) 0.03
,,
:
GCGCACACTAGCGCACACTACGCCTACGCAATTAGTCGCACTCACGAAGAAACAAATTACCACGGTGAG
000000000000000000000000000000000000000000000011111111110000000000001
AACGAATAAAAATCAGAGGCCCAGCGTATATCAGCACTCTGACCACCTAGTCAGCACTCTGACCAGACG
111111111111000000000000000000000000000000000000000000000000000000000
AGCTCACGACTTGAGGATAAGAATAGAAAAACAGCTCACGACTTGAGGCACGACTAGCTCAG
00000000000000001111111111111100000000000000000000000000000000
8.8: DNA , .
(0/1)

0
0

50

100
no

150

200

8.9: Viterbi
. Viterbi,
1, 1.
.
, (
) ( ) . ,
, (cut-off value)
1. () 0.5
( ), , .

283

.,,1,
2 , 3 ,
,.
,,,
1-3-2, .
,
. ,
(Viterbi)
.


,
, (..
)
.Jones(Jones,Taylor,&
Thornton,1994).
n
m , n
.,sil(
),l
i., S ij i :1, 2,..., n; j :1, 2,..., m ,
:

S ij max

l lmin lmax

il
j

max

k 1 l An

S
k
j 1

(8.34)

j , lmin lmax
,.
(Farisellietal., 2003),
-.

Posterior-Viterbi
, Fariselli ,
Viterbi (Fariselli,
Martelli,&Casadio,2005).,PV:
L

PV arg max P i | x
p

i 1

p , P(i=k|x)
,(8.31).,
10.:
1, if a kl 0
k , l
0, otherw ise
,PV,:
L

PV arg max i , i 1 P i | x

i 1

, ,
Viterbi,
.

284

Posterior-Viterbi
k B, i 0 : u B (0) 1, uk (0) 0
1 i L : ul (i) P i l | x max uk (i 1) k , l
k

P(x,

PV

| ) max uk ( L) k , E
k

Forward
,
Forward.,ForwardDecodingmethod,
( 8.10).
,,
().
(8.26), ,
.
(8.10).,:
log P x|
S (x| )

(8.35)
L
L,.,
,(),
. , (Eddy, 1998),
,
,
0,,(
null).
log P x|

(8.36)
S ( x| )
log P x| 0

8.10: . Forward,
.

,,
.
,
.

285

8.2.5.

- ,
(Maximum Likelihood).
(), ML ,
. ,
,
.:

(8.37)
ML arg max P x |

, l(x|),
.
l x | log P x |
,
( )
. ,
. , , ,
. ,
, .
, x,
,.

(),
,.,
,
.,:

akl

Akl

Akl '

(8.38)

(8.39)

l'

ek (b)

Ek (b)

Ek (b)

b'

,.
, ,
, . ,
(pseudo-counts),:

akl

Akl rkl

Akl ' rkl '


l'

(8.40)

l'

Ek (b) rk (b)

ek (b)
Ek (b ') rk (b ')
b'

(8.41)

b'

,
rkl,rk(b)Dirichlet.
(prior),
.

Baum-Welch
,
,
.,1970Baum,

286

Baum-Welch (Baum, 1972). , ,


,Expectation-Maximisation()(Dempster,Laird,&
Rubin,1977),
(missingvalues).,Baum,
,(Dempster,etal.,
1977).,
.

(missingvalues)..
, t,t+1 ,
(iteration)tx,.:

l ( x, ) log P (x, ) log P (x, | )

Bayes:
P x, |

P | x,
Px |
:
log P ( x | ) log P ( x , | ) log P ( | x , )
P ( | x, t ) ,,:

log P (x | ) P ( | x, t ) log P ( x, | ) P ( | x, t ) log P ( | x, )

Q ( | t ) P ( |x, t ) log P (x, | )

(8.42)

,:
log P ( x | ) log P ( x | t )
,:

log P (x | ) log P (x | t ) Q( | t ) Q( t | t ) P( | x, t ) log

P( | x, t )

P( | x, )

=t:
log P ( x | ) log P ( x | t ) Q ( | t ) Q ( t | t )
Q,:
t 1 arg max Q ( | t )

.:
(8.24):
Ek ( b , )

P(x, | ) ek (b)
k 1

kl

Akl ( )

(8.43)

k 0 l 1

, Ek(b,), , Akl() b,
l , k, . (8.42)
:

Q ( | t ) P ( |x, t ) Ek (b, ) log ek (b) Akl ( ) log kl

(8.44)

k 0 l 1
k 1 b

,Ek(b,)Akl(),
, fk(i), bk(i),
forwardbackward.,

P x, i k , i 1 l | P x1 , x2 ,..., xL , i k , i 1 l |

P x1 , x2 ,..., xi , i k | P xi 1 , xi 2 ,..., xL , i 1 l | x1 , x2 ,..., xi , i k ,

287


,:

P x, i k , i 1 l |

P x1 , x2 ,..., xi , i k | P xi 1 , xi 2 ,..., xL , i 1 l | i k ,

(8.27),:

f k (i) P( x1 , x2 ,..., xi , i k )
,

P xi 1 , xi 2 ,..., xL , i 1 l | i k ,
P xi 1 , i 1 l | i k , P xi 2 ,..., xL | xi 1 , i 1 l , i k ,

(8.45)

,:

P xi 1 , i 1 l | i k ,
=P i 1 l | i k P xi 1 | i 1 l

(8.46)

(8.47)

:
P x, i k , i 1 l | f k i akl el xi 1 bl i 1

(8.48)

Bayes:
P ( i k , i 1 l | x , ) f k ( i ) a kl e l ( x i 1 ) b l ( i 1)

(8.49)

=akl el xi 1
,(24):

P xi 2 ,..., xL | xi 1 , i 1 l , i k ,
=P xi 2 ,..., xL | i 1 l bl i 1

(8.46)(8.47)(8.45),:

P xi 1 , xi 2 ,..., xL , i 1 l | i k , akl el xi 1 bl i 1

P (x )

fk(i), bk(i), forward backward.


,(8.31):

P ( i k | x) = f k (i )bk (i )
P ( x)

,,:

Akl P( |x, t ) Akl ( )

1
f k (i)akl el ( xi1 )bl (i 1)
P ( x) i

(8.50)

(8.51)

(8.52)

Ek (b) P( |x, t ) Ek (b, )

1
f kj (i)bkj (i)
P(x) i|xi b

(8.41):

Q ( | t ) Ek (b) log ek (b) Akl log kl


k 1 b

k 0 l 1

(8.38)(8.39).
Baum-Welch - (Expectation)
fk(i),bk(i),forwardbackward
Akl,Ek(b)(8.50)(8.51).
Q. - (Maximization)
Akl Ek(b) (8.38) (8.39), ...
. (loglikelihood) Q
(threshold).

288

Gradient-Descent
Baum-Welch, . ,
, ,
. , (update) ,
(batch mode of learning). ,
,,
.,BaldiChauvin
(Baldi&Chauvin,1994).
Gradient-Descent, .
,fn:
f x f x1 , x2 ,..., xn
,,

x0 x10 , x2 0 ,..., xn 0

,:

f ( xt 1 ) f ( xt ) f ( x)
,
(learningrate).,
(negative log-likelihood),
.,:
x |
t 1 t

(8.53)

,
.

log P x |
:

log P x |
P x |
1

P x |

P x, |
1

P x |

log P x, |
1

P x, |

P x |

P | x,

log P x, |

(8.54)

(8.43), ,
:

Akl loga kl
log P x, |

k 0 l 1
a kl
a kl

logakl

Akl k 0 l 1
akl

Akl

akl
,(8.54):

289

(8.55)

log P x |

Akl

Akl

(8.56)
akl
akl
akl

,
:

log P x |
E , b Ek b

(8.57)
P | x, k

ek b
ek b
ek b

,
,Q(8.52).
,
,
(Baldi&Chauvin,1994).
,.,
,.
(8.53), Gradient Descent,
, .
,,
0 1,
.,KroghRiss(Krogh&Riis,
1999), soft-max . ,
kl,zkl, :
exp zkl

(8.58)
akl
exp zkl '

P | x,

l'

,GradientDecent,kl,zkl:
t

(8.59)
z tkl1 z tkl
z kl
:

t
z kl t exp

z kl

(8.60)
a kl t 1

t
t
l ' zkl ' exp z
kl '

(8.56),
zkl:

(8.61)
Akl akl Akl '
zkl
l'

,(8.61)(8.60),
:

t
kl ' exp Akl akl Akl '
l'
l
t

t 1

kl

kl exp Akl akl Akl '

(8.62)

,
.,
. ,
(smooth) ,
Baum-Welch, online training,
-,

290

. , ,
(Conditional Maximum
Likelihood-CML). , ,
(),
.

Viterbi training
,,,
. Viterbi training
Segmental k-means algorithm Juang Rabiner 1990 (Juang & Rabiner,
1990). , : )
Viterbi,)
(8.34)(8.35).
,
(clustering),1968
k-meansalgorithm(MacQueen,1967).
, ,
(8.20),
.JuangRabiner
, ,
(state-optimized likelihood).
, , ,
.
, ,
.,
(
Baum-Welch gradient).
, Baum-Welch.
,
.
, , , ,
Viterbi BaumWelch Forward Backward ( ,
,/).

8.3.

Class Hidden Markov Model

8.3.1.

,
(unsupervisedmethods).
,-,
. ,
,
()
.

291


8.11: CHMM. . HMM
. ( ) 1 ,
. CHMM
,
. ( ),
. , 1 .

, .
, ,
,
,
. ,
,3
. ,
(supervised learning). ,
(labeled sequences), Krogh
Class Hidden Markov Model (Anders. Krogh, 1994). , ,

x x1 , x2 ,..., xL 1 , xL ,
,(labels)
y y1 , y2 ,..., yL 1 , yL
,3:
(), () ().
,

292

.,
, ...
,k(c)kc.
,,
, (delta function) 1
0 . ,
.

8.3.2.

, ,
, P(x,y|)
xy,:
L

P x, y |

P x , y, | P x, | a e
B 1

( xi ) a i i 1

i 1

,(8.25),
y,.
,
Forward Backward,
.,,
(0,1), . ,
Forward,:

Forward
k B , i 0 : f B (0) 1, f k (0) 0 ,

1 i L : fl (i) el ( xi ) l yi f k (i 1)akl (8.63)


k

P ( x,y| ) f k ( L )akE
k

,Backward,:

Backward
k , i L : bk ( L ) akE
1 i L : bk (i ) akl el ( xi 1 ) l yi 1 bl (i 1)

(8.64)

P(x,y| ) aBl el ( x1 )bl (1)


l

, Forward
Backward,(8.12).
y, ,
P(x,y|)P(x|).

293

8.12: Forward . 12
, 8 (labels).
,
.

8.3.3.


,
:
ML arg max P x,y |

,
.,,(8.65)
(8.66):

Akl

1
f k (i)akl el ( xi1 ) l yi 1 bl (i 1)
P x, y | i

Ek (b)

1
fk (i)bk (i)
P x, y | i| xij b

(8.67)

(8.68)

P(x,y|),fi(i)bk(i),
. ,
,Baum-WelchGradientDescent.


,Krogh(Anders.Krogh,1994)
, (Conditional Maximum
Likelihood).,,
:
P x, y |

CML arg max P y | x, arg max


P x |

Krogh, (Anders. Krogh,1994),

(MaximumMutualInformation)(Rabiner,1989).
,:
log P y | x, c f
:

294

c log P x,y |
f log P x |

c f,
(clampedphase),(freerunningphase).(8.56)(8.57),
:
Ac Aklf
c f

(8.69)

kl
akl akl akl
akl
f

c
E c (b ) E kf (b )

(8.70)

k
ek b ek b ek b
ek (b )
,Baum-Welch,
.,
Gradient Descent.

,,:

(8.71)
Aklc Aklf akl Aklc ' Aklf '
zkl
l

t,,
:

klt 1

klt exp Aklc Aklf akl Aklc ' Aklf '

c
t
f
c
f
l ' kl ' exp Akl Akl akl l Akl ' Akl '

(8.72)

,
(Discriminative Training),
,
. ,
, .
,
( ). ,
,
(Bagos,
Liakopoulos,&Hamodrakas,2004).

8.3.4.

1-best Decoding
,
,
()
.,
,

. ,
,.
1-best (Krogh, 1997), N-best,
(Schwartz & Chow, 1990). ,
,
ymax.,

295

i , hi-1 ,
. ,
l yi
..
Viterbi, 1-best
.
1-best
i 1: l ( h1 ) a Bl el ( x1 )
1 i L : l (hi y i ) el ( xi ) k ( hi 1 ) akl

(8.73)

P x, y max | k (hL )akE


k

,
,
,:
P x, max | P x, y max | P x |

Optimal Accuracy Posterior Decoder


, Kall ,
OptimalAccuracyPosteriorDecoder(Kall,Krogh,&Sonnhammer,2005).
Posterior-Viterbi,
CHMM ( ,
). , : ,
(posterior label probabilities -PLPs),
(8.29), ,
Viterbi:
L

(8.74)
OAPD arg max i , i 1 P i | x k c

i 1

Optimal Accuracy Posterior Decoder


k B , i 0 : AB (0) 0, Ak (0)

1 i L : Al (i ) P yi c l | x, max Ak (i 1) k , l (8.75)
k

P ( x, OAPD | ) max Ak ( L ) k , E
k

,Posterior-Viterbi
,),)
,.,
Posterior-Viterbi,OptimalAccuracyPosteriorDecoder
,
.,,
.

8.3.5.

,
,

296


.
(genefusions),
,

(Drewetal.,2002;Melen,Krogh,&vonHeijne,2003).

(,)
,E. coli(Rappetal.,2004)
S. cerevisiae(Kim,Melen,&vonHeijne,2003).
,HMMTOP(Tusnady&Simon,
2001),
, ,
Phobius (Kall,
Krogh, & Sonnhammer, 2004). ,
,(Bagos,Liakopoulos,
&Hamodrakas,2006).

8.13: Forward,
. () 12 , x 8 .
x, 3,4
, 1 8
.

,
,
. , ,

, .

,,
,.
(Information), 1rL,
, - , 1,
2, ...r,
.,
:

297

0,if k i 0andi r

d k i =
1,otherwise
forward,
forward f i k
( 8.13).
, y
y, . ,
,
. forward backward
,
. ,
.
,,.
,(Bagos,etal.,2006).

8.3.6.

,

, (underflow error).
,
.
,
.,
:
log a b log a 1 a
log a log 1 a
b
b

log a log 1 exp log a log b

(8.76)

|log()-log(b)| 37,
, 0,
(8.76).

8.4.


, .

(Ostendorf & Singer, 1997; Vasko, El-Jaroudi, & Boston, 1996 ),
(Won,Prugel-Bennett,&Krogh,2004;Yada,Ishikawa,H.,&Asai,1994),,

.,
.
,
,
.CHMM,,
-.
,(,,
- ...) :
.,,
.,
. ,

298

15 35 . ,
parametertyingsharing.,
.
forward backward.
, (
10,2010*20).

8.14: . (left to right). ,


. . . ( )
.

, ,
.,,
(8.14).
,.,
,
.,.
,
.,
,.,
, .
3k(8.15).

. , profile
, (
,)(-).
,,
. ,
, .
, . ,

, . ,
,
.
.

299

8.15: . .
1, 2,3 4. . 4 ,
. 4
( ). . .
1 4 .


(silent states).
,
.
.
silent states ( ) ,
,
.

8.5.

Profile Hidden Markov Models

Hidden Markov Models, (profile) Hidden Markov


Models (Eddy, 1998), (Gribskov, Luthy, &
Eisenberg, 1990; Gribskov, McLachlan, & Eisenberg, 1987)
.ProfileHiddenMarkovModel(pHMM),
.
profile,
().,
(positionspecific)
. , left-to-right

.

300

, HMM ,
,
(insert/deletion).
,,
,.

8.16: profile Hidden Markov Model

,
,(silentstates).
,
.
,

. ,
k ,
.,
. pHMM,
8.16.
(
)3:
(Matchstates)
Mk

(Insertionstates)
Ik

(Deletionstates)
Dk

, .
.

, .
,
.:
,
,
.,,
.
,
,,
.
(Viterbi,backward,forward,Baum-Welch),
.

301

8.6.

profile HMM

,
,
.,
(gap penalties),
, . ,
,
(remote homologies),
.

8.17: .
, .

pHMM,,
(Eddy, 1995)
(Krogh,
Brown, Mian, Sjolander,&Haussler, 1994).
,
PFAM(Bateman,etal.,2004),
, .. .
,TIGRFAM,PFAM
.,,
,
, . ,

302

,
, profile (Eddy, 2011).
:
,

(BLAST,PSI-BLAST,HMMER)
,
()
, ( ,
)
(HMMER)
,
.

8.18: profile .

profileHMM,
, ,
, ,
. , profile HMM
,(Eddy,1995).
, 8.18 8.19. ,
, , ,
.,Viterbi
.,
i,(m,d,i),,.
i ,
(mi).

303


8.19: profile .

8.7.

HMMER

ProfileHiddenMarkovModels,HMMER(Eddy,
2000).,,
GPL (GNU Public License), ,

.

HMMER,3,:
hmmbuild: ,
,.
hmmalign:
, . ,
.
hmmsearch:,
.
phmmer:
(BLASTP)
jackhmmer:
( PSIBLAST)
hmmscan:
.,
,
.,,
,
.

304

nhmmer: DNA,
pHMM,DNA.(BLASTN)
nhmmscan: DNA
DNAprofileHM.
hmmconvert:
HMMER3.
hmmemit: , ( )
.
hmmpress: HMMhmmscan.
hmmstat: .

,
. ,
,
(null)
, ,
.,,
, .
, ,
,.,
Profile Hidden Markov Model, ,
() (D) .
,:http://hmmer.janelia.org/.
,HMMER.
1.8 .
2.0 , ,
. ,
. ,
(Zhang &
Wood,2003).,,,
(discriminative training) (Srivastava, Desai, Nandi, & Lynn,
2007). 3.0 , ,
, (Eddy,2011). , . ,
Gumbel,
., ,
, -
BLAST PSI-BLAST-,
. ,
,.

305

Audic,S.,&Claverie,J.M.(1998).Self-identificationofprotein-codingregionsinmicrobialgenomes.Proc
Natl Acad Sci U S A, 95(17),10026-10031.
Bagos,P.G.,Liakopoulos,T.D.,&Hamodrakas,S.J.(2004).FasterGradientDescentConditional
MaximumLikelihoodTrainingofHiddenMarkovModels,UsingIndividualLearningRate
Adaptation.InG.Paliouras&Y.Sakakibara(Eds.),Grammatical Inference: Algorithms and
Applications(Vol.3264,pp.40-52):SpingerBerlin/Heidelberg.
Bagos,P.G.,Liakopoulos,T.D.,&Hamodrakas,S.J.(2006).Algorithmsforincorporatingpriortopological
informationinHMMs:applicationtotransmembraneproteins.BMC Bioinformatics, 7,189.
Baldi,P.,&Chauvin,Y.(1994).SmoothOn-LineLearningAlgorithmsforHiddenMarkovModels.Neural
Comput, 6(2),305-316.
Barash,Y.,Elidan,G.,Friedman,N.,&Kaplan,T.(2003).Modeling dependencies in protein-DNA binding
sites.PaperpresentedattheProceedingsoftheseventhannualinternationalconferenceon
Computationalmolecularbiology.RECOMB'03.,NewYork,NY,USA.
Bateman,A.,Coin,L.,Durbin,R.,Finn,R.D.,Hollich,V.,Griffiths-Jones,S.,...Eddy,S.R.(2004).The
Pfamproteinfamiliesdatabase.Nucleic Acids Res, 32(Databaseissue),D138-141.
Baum,L.(1972).Aninequalityandassociatedmaximizationtechniqueinstatisticalestimationfor
probabilisticfunctionsofMarkovprocesses.Inequalities, 3,1-8.
Bejerano,G.(2004).AlgorithmsforvariablelengthMarkovchainmodeling.Bioinformatics, 20(5),788-789.
Bejerano,G.,Seldin,Y.,Margalit,H.,&Tishby,N.(2001).Markoviandomainfingerprinting:statistical
segmentationofproteinsequences.Bioinformatics, 17(10),927-934.
Bejerano,G.,&Yona,G.(2001).Variationsonprobabilisticsuffixtrees:statisticalmodelingandprediction
ofproteinfamilies.Bioinformatics, 17(1),23-43.
Berchtold,A.(2001).EstimationintheMixtureTransitionDistributionModel.Journal of Time Series
Analysis, 22(4),379-397.
Borodovsky,M.,&McIninch,J.(1993).GeneMark:parallelgenerecognitionforbothDNAstrands.Comput
Chem, 17(19),123-133.
Borodovsky,M.,&Peresetsky,A.(1994).Derivingnon-homogeneousDNAMarkovchainmodelsbycluster
analysisalgorithmminimizingmultiplealignmententropy.Comput Chem, 18(3),259-267.
Browning,S.R.(2006).Multilocusassociationmappingusingvariable-lengthMarkovchains.Am J Hum
Genet, 78(6),903-913.
Dalevi,D.,Dubhashi,D.,&Hermansson,M.(2006).BayesianclassifiersfordetectingHGTusingfixedand
variableordermarkovmodelsofgenomicsignatures.Bioinformatics, 22(5),517-522.
Dempster,A.P.,Laird,N.M.,&Rubin,D.B.(1977).MaximumlikelihoodfromincompletedataviatheEM
algorithm.J Royal Stat Soc B, 39,1-38.
Drew,D.,Sjostrand,D.,Nilsson,J.,Urbig,T.,Chin,C.N.,deGier,J.W.,&vonHeijne,G.(2002).Rapid
topologymappingofEscherichiacoliinner-membraneproteinsbypredictionandPhoA/GFPfusion
analysis.Proc Natl Acad Sci U S A, 99(5),2690-2695.
Durbin,R.,Eddy,S.R.,Krogh,A.,&Mithison,G.(1998).Biological sequence analysis, probabilistic models
of proteins and nucleic acids.:CambridgeUniversityPress.
Eddy,S.R.(1995).MultiplealignmentusinghiddenMarkovmodels.Proc Int Conf Intell Syst Mol Biol, 3,
114-120.
Eddy,S.R.(1998).ProfilehiddenMarkovmodels.Bioinformatics, 14(9),755-763.

306

Eddy,S.R.(2000).HMMER: profile hidden Markov models for biological sequence analysis.StLouis,MO:


WashingtonUniversityschoolofmedicine.
Eddy,S.R.(2011).AcceleratedProfileHMMSearches.PLoS Comput Biol, 7(10),e1002195.
Ellrott,K.,Yang,C.,Sladek,F.M.,&Jiang,T.(2002).Identifyingtranscriptionfactorbindingsitesthrough
Markovchainoptimization.Bioinformatics, 18 Suppl 2,S100-S109.
Eronen,L.,Geerts,F.,&Toivonen,H.(2004).AMarkovchainapproachtoreconstructionoflong
haplotypes.Pac Symp Biocomput,104-115.
Fariselli,P.,Finelli,M.,Marchignoli,D.,Martelli,P.L.,Rossi,I.,&Casadio,R.(2003).MaxSubSeq:an
algorithmforsegment-lengthoptimization.Thecasestudyofthetransmembranespanningsegments.
Bioinformatics, 19(4),500-505.
Fariselli,P.,Martelli,P.L.,&Casadio,R.(2005).AnewdecodingalgorithmforhiddenMarkovmodels
improvesthepredictionofthetopologyofall-betamembraneproteins.BMC Bioinformatics, 6 Suppl
4,S12.
Gribskov,M.,Luthy,R.,&Eisenberg,D.(1990).Profileanalysis.Methods Enzymol, 183,146-159.
Gribskov,M.,McLachlan,A.D.,&Eisenberg,D.(1987).Profileanalysis:detectionofdistantlyrelated
proteins.Proc Natl Acad Sci U S A, 84(13),4355-4358.
Jones,D.T.,Taylor,W.R.,&Thornton,J.M.(1994).Amodelrecognitionapproachtothepredictionofallhelicalmembraneproteinstructureandtopology.Biochemistry, 33(10),3038-3049.
Juang,B.H.,&Rabiner,L.R.(1990).TheSegmentalK-MeansAlgorithmforEstimatingParametersof
HiddenMarkovModels.IEEE Transactions on Acoustics, Speech and Signal Processing, 38(9),
1639-1641.
Kall,L.,Krogh,A.,&Sonnhammer,E.L.(2004).Acombinedtransmembranetopologyandsignalpeptide
predictionmethod.J Mol Biol, 338(5),1027-1036.
Kall,L.,Krogh,A.,&Sonnhammer,E.L.(2005).AnHMMposteriordecoderforsequencefeatureprediction
thatincludeshomologyinformation.Bioinformatics, 21 Suppl 1,i251-i257.
Kim,H.,Melen,K.,&vonHeijne,G.(2003).Topologymodelsfor37Saccharomycescerevisiaemembrane
proteinsbasedonC-terminalreporterfusionsandpredictions.J Biol Chem, 278(12),10208-10213.
Krogh,A.(1994).HiddenMarkovmodelsforlabelledsequences.Proceedings of the12th IAPR International
Conference on Pattern Recognition,140-144.
Krogh,A.(1997).TwomethodsforimprovingperformanceofanHMMandtheirapplicationforgene
finding.Proc Int Conf Intell Syst Mol Biol, 5,179-186.
Krogh,A.,Brown,M.,Mian,I.S.,Sjolander,K.,&Haussler,D.(1994).HiddenMarkovmodelsin
computationalbiology.Applicationstoproteinmodeling.J Mol Biol, 235(5),1501-1531.
Krogh,A.,&Riis,S.K.(1999).Hiddenneuralnetworks.Neural Comput, 11(2),541-563.
Lebre,S.,&Bourguignon,P.Y.(2008).AnEMalgorithmforestimationinthemixturetransitiondistribution
model.Journal of Statistical Computation and Simulation, 78(8),713-729.
MacQueen,B.(1967).Some Methods for classification and Analysis of Multivariate Observations.Paper
presentedattheProceedingsof5-thBerkeleySymposiumonMathematicalStatisticsandProbability.
Markov,A.A.(1913).AnexampleofstatisticalstudyontextofEugenyOneginillustratingthelinkingof
eventstoachain.Izvestija Imp. Akad. nauk, 6(3),153-162.
Melen,K.,Krogh,A.,&vonHeijne,G.(2003).Reliabilitymeasuresformembraneproteintopology
predictionalgorithms.J Mol Biol, 327(3),735-744.
Ohler,U.,Harbeck,S.,Niemann,H.,Noth,E.,&Reese,M.G.(1999).Interpolatedmarkovchainsfor
eukaryoticpromoterrecognition.Bioinformatics, 15(5),362-369.

307

Ostendorf,M.,&Singer,H.(1997).HMMtopologydesignusingmaximumlikelihoodsuccessivestate
splitting.Computer Speech & Language, 11(1),17-41.
Phillips,G.J.,Arnold,J.,&Ivarie,R.(1987).Mono-throughhexanucleotidecompositionoftheEscherichia
coligenome:aMarkovchainanalysis.Nucleic Acids Res, 15(6),2611-2626.
Rabiner,L.(1989).AtutorialonhiddenMarkovmodelsandselectedapplicationsinspeechrecognition.
Proc. IEEE, 77(2),257-286.
Raftery,A.E.(1985a).Amodelforhigh-orderMarkovchains.Journal of the Royal Statistical Society. Series
B (Methodological), 47(3),528-539.
Raftery,A.E.(1985b).Anewmodelfordiscrete-valuedtimeseries:Autocorrelationsandextensions.
Rassegna di Metodi Statistici ed Applicazioni, 3-4 149-162.
Rapp,M.,Drew,D.,Daley,D.O.,Nilsson,J.,Carvalho,T.,Melen,K.,...VonHeijne,G.(2004).
ExperimentallybasedtopologymodelsforE.coliinnermembraneproteins.Protein Sci, 13(4),937945.
Ron,D.,Singer,Y.,&Tishby,N.(1996).Thepowerofamnesia:learningprobabilisticautomatawith
variablememorylength.Machine Learning, 25,117-149.
Salzberg,S.L.,Delcher,A.L.,Kasif,S.,&White,O.(1998).Microbialgeneidentificationusinginterpolated
Markovmodels.Nucleic Acids Res, 26(2),544-548.
Salzberg,S.L.,Pertea,M.,Delcher,A.L.,Gardner,M.J.,&Tettelin,H.(1999).InterpolatedMarkovmodels
foreukaryoticgenefinding.Genomics, 59(1),24-31.
Schwartz,R.,&Chow,Y.L.(1990).TheN-BestAlgorithm:AnEfficientandExactProcedureforFinding
theNMostLikelySentenceHypotheses.Proc IEEE Int Conf Acoust, Speech, Sig Proc, 1,81-84.
Srivastava,P.K.,Desai,D.K.,Nandi,S.,&Lynn,A.M.(2007).HMM-ModE--improvedclassificationusing
profilehiddenMarkovmodelsbyoptimisingthediscriminationthresholdandmodifyingemission
probabilitieswithnegativetrainingsequences.BMC Bioinformatics, 8,104.
Tusnady,G.E.,&Simon,I.(2001).TheHMMTOPtransmembranetopologypredictionserver.
Bioinformatics, 17(9),849-850.
Vasko,R.C.J.,El-Jaroudi,A.,&Boston,J.R.(1996).An algorithm to determine hidden Markov model
topology.PaperpresentedattheIEEEInternationalConferenceonAcoustics,Speech,andSignal
Processing,ICASSP-96.
Won,K.J.,Prugel-Bennett,A.,&Krogh,A.(2004).TrainingHMMstructurewithgeneticalgorithmfor
biologicalsequenceanalysis.Bioinformatics, 20(18),3613-3619.
Yada,T.,Ishikawa,M.,H.,T.,&Asai,K.(1994).DNASequenceAnalysisusingHiddenMarkovModeland
GeneticAlgorithm.Genome Informatics, 5,178-179.
Yuan,Z.(1999).PredictionofproteinsubcellularlocationsusingMarkovchainmodels.FEBS Lett, 451(1),
23-26.
Zhang,Z.,&Wood,W.I.(2003).AprofilehiddenMarkovmodelforsignalpeptidesgeneratedbyHMMER.
Bioinformatics, 19(2),307-308.

308

1) 8.3.1,
x= 214526436636561666232145= --------++++++++++------.
(8.34)(8.35).

2) Hidden Markov Model (HMM) (a,


transitions),(e,emissions),:

0.7 0.3 0 x1
0.2 0.1 x4 0.5
0

0 0.8 x2
, e 0.5 x5 0.1 0.1
a
0
0.1 0.2 0.1 0.6
0 0.9 0.1

0
0
x3 0
0.4 0.4 0.1 0.1
)
;.
)x1,x2,x3,x4x5..
).;
)(Begin)
(End);

3) (regular expression) (splicing)


:
[AC]AGGT[AG]AGT
1 2 3 4 5 6 7 8 9,
1-934.
)()HiddenMarkovModel,
.
)1,(A)
10%,6(G)
(A),.
)DNA

x:AAACAGGTGAGTAAA

y:TTTAAGGTAAGTGGG

,Forward?.
)(),

(),regularexpression?
:()
(A,C,T,G).

4) :

GSAPSRKFFVGGNWKMNGRKQSLGELIGTLNAAKVPADTEVVCAPPTAYI
CCCCCCEEEEEEEECCCCCHHHHHHHHHHHHCCCCCCEEEEEEECCCCCH
51
DFARQKLDPKIAVAAQNCYKVTNGAFTGEISPGMIKDCGATWVVLGHSER
HHHHHHCCCCEEEEEEEECCCCCCCCCCCHHHHHHHCCCCEEEEECHHHH
101
RHVFGESDELIGQKVAHALAEGLG
HCCCCCCHHHHHHHHHHHHHCCCC
)
SCOP?
)HiddenMarkovModel

309

3(,,C).
?
),
(
).
:

5) DNA
(=, =, N=- , 5=5-
,3=3-)

GCATGCGTAGTCTGACTGCCAAGATATAAAGTTATAATCTATATACGATCGCTGTCAATGCT
NNEEEEEEEEEEEEEEEEEEEE5IIIIIIIIIIIIIIIIIIIIIII3EEEEEEEEEEEEEEE

)();
.
)HiddenMarkovModel
5(,,5,3,).
;
),
().
:
. , .

6) .
)(,,
).
):

x1=CGGCCG
x2=AT
x3=ACCGAT
x4=TCGAT
x5=GGCCG
.
?
)
?

310

7) HMM,22.
a e.
, ,
. ,
3, 4, ...
;

8) ,:
P ( i k | x, ) = f k (i )bk (i )

P ( x| )

: P ( x, i k | )

f k (i ) P ( x1 , x2 ,..., xi , i k | )
bk (i) P( xi 1 ,..., xL | i k , )

9) profile?
:
xa=WAYDDR,
xb=WDAYPDDR
(paths)Viterbi:
pa=m0m1m2m3d4d5m6m7m8m9,pb=m0m1i1m2m3d4m5m6m7m8m9
.

311

312

9:

,
, , DNA, RNA.
,
, , ,
, .
.

.


, .

9.

,
,,DNARNA.

1950 1960
, ,
.
,,
. ,
,.
,
. ,
.
,
. ,
. , ,
.
.
,
. ,
,
. , ,
(homologymodelling),(threading)abinitio.,
,,

, ,
, (docking)
(
/DNA ( RNA), / ,
).,,
,,
, ,
,,
.

313

,
.

9.1.

,
.1950
1960,,
.
, , ..
(.. 40%-20%-)
. ,
(far-UV, 170250 nm) ,
(IR) (NMR)(Meiler
&Baker,2003;Pelton&McLean,2000).,
(-, - ) ,

(
).

9.1: ( IR CD)


2 , (
),1950,DNA.
,SirJohnKendrewMaxPerutz1962,

.
.
, ,
.
,
(
). ,

.
1990 , ,
( ),
CCD,
, .
PDB(,'

314

,,
).
,
NMR10%()
.(2013-2015)
,2-3,

(,terabytegigabyte).

(single-crystal X-ray diffraction),

..
,
(Shi,2014;
Yaffe,2005).

9.2:
(https://en.wikipedia.org/wiki/X-ray_crystallography)

315

.
( 10-20m),
(, ...). ,
, .

. , ,
.
( ).

1m.
, ,
, .

.
. ,
,
.
, ,
. ,

.
,
.
,
(
).
(ab initio phasing), (molecular
replacement), (anomalous X-ray scattering)
(heavyatommethods),
. ,
(refinement).
/ .
,/,
,
.,
,
.
.InternationalUnionofCrystallography
,
(http://www.iucr.org/resources/other-directories/software).,
,,CCP4,
http://www.ccp4.ac.uk/(Winnetal.,2011),PHENIX,
https://www.phenix-online.org/ (Adams et al., 2010) X-PLOR (Gntert,
2011),,
NIH, Xplor-NIH,
http://nmr.cit.nih.gov/xplor-nih/(Schwieters,Kuszewski,&Clore,2006)
,,
.
(PDB),
. ,

.PDB
,

316

(http://www.rcsb.org/pdb/static.do?p=software/software_links/analysis_and_verification.html).

PROCHECK (Laskowski, MacArthur, Moss, &


Thornton, 1993).
MolProbity http://molprobity.biochem.duke.edu/ (Chenet
al., 2010) WHATCHECK http://swift.cmbi.ru.nl/gv/whatcheck/
(Hooft, Vriend, Sander, & Abola, 1996).
(Read et al., 2011).
,

PDB(Joostenetal.,2009).
,
(NMR). NMR ,

,,.
(Magnetic
ResonanceImaging-MRI),,mm
MRI.,
,
. ,
, .
,
.(,),

.,
,
().,
,
(,),
(,...),
.
,
.,
-,-...
,
,
,
(,
). ,
,,
-,.
DSSP (Define Secondary Structure of Proteins),
http://swift.cmbi.ru.nl/gv/dssp/,
(Kabsch & Sander, 1983). DSSP
,
. , DSSP
8.310-,-,-(G,H
) 3, 4, 5
.--()-(),
S . ,
. , ..

. - , -
(coil)C.
(G,I),,C.2002

317

DSSP (continuous DSSP)


(Andersen,Palmer,Brunak,&Rost,2002).
STRIDE (STRuctural IDEntification),
http://webclu.bio.wzw.tum.de/stride/
(Frishman&Argos,1995).STRIDE
DSSP,
( Lennard-Jones),
. , DSSP,
,
.DSSP,STRIDE
DSSP
.
, STRIDE PDB ( DSSP),
(Heinig&Frishman,2004).

9.2.

,
.,
19501960.,

. ,
.
, ball and stick.
,(wireframe)
,().

. C (
, )
.

9.3: , ball-and-stick (https://en.wikipedia.org/wiki/Molecular_model). ,


(NH3CH2CH2C(OH)(PO3H)(PO3H)-), Jmol
(https://en.wikipedia.org/wiki/Molecular_graphics)

, (space-filling
model),,,
van der Waals
. CPK models Corey, Pauling, Koltun,
,
.
,,

318


. ,
.
.
,..
.

()().
,-
/-, (Ribbon diagram) ,
.

. , -
(ribbon) , -
.,,
.,
. ,
,
.
.

9.4: , ATP. ,
2 , (PDB code 2RH1, https://en.wikipedia.org/wiki/Space-filling_model)

,.
,
.,
PDB.,

( , ,
...).
, .
,
, , ,

.,.

319

,
,ribbon.,
,
,
wireframe ball and stick . ,
,
,(.).

9.5:
. Planctomyces limnophilus (PDB code
3TVA). PV.

PDB

(http://www.rcsb.org/pdb/static.do?p=software/software_links/molecular_graphics.html). PDB


. RCSB Simple Viewer
(http://biojava.org/wiki/RCSB_Viewers:About), JavaWebStart
, Jmol (http://jmol.sourceforge.net/),
Java Applet, Jsmol
JavaScriptHTML5(http://sourceforge.net/projects/jsmol/).
.
PV WebGL
.
Rasol
(http://www.bernstein-plus-sons.com/software/rasmol/),
,

320

OpenRasMol (http://www.openrasmol.org/). CCP4 ,


, CCP4mg (http://www.ccp4.ac.uk/MG/),

, Swiss-PDBviewer (http://spdbv.vital-it.ch/),

SWISS-MODEL
(.
)

Cn3D
(http://www.ncbi.nlm.nih.gov/Structure/CN3D/cn3d.shtml)
NCBIEntrez,alignmenteditor.,
,
PyMol (http://www.pymol.org/pymol) Python
OpenGL Extension Wrangler Library (GLEW). PyMol
,,

.,
WHAT IF(.)3D
(
SGIUnix).

9.3.

( )
(
). ,
, ,
.,
,
,
.,
,
.,(
).
,
4,
,
/
().
,
.,
( ).
, ..
. ,
(11,22...),
(structural superposition).
,,

.
().
( 9.6),
()
.,,
.,
,RMSD(RootMeanSquareDeviation):

321

Nii
.

9.6: .

, (C)

. ,
-
.,,,
.
(least squares method)
,
(maximum likelihood) (Theobald & Wuttke, 2006a, 2006b) (robust)
,leastmediansquaresregression(LMS)(Liu,Fang,&Ramani,2009).

LSQMAN
(http://xray.bmc.uu.se/usf/lsqman_man.html), THESEUS (http://www.theseus3d.org)

LMSfit
(https://engineering.purdue.edu/PRECISE/LMSfit).

Profit
(http://www.bioinf.org.uk/software/profit/)
McLachlan(McLachlan,1982),
3dSS(http://cluster.physics.iisc.ernet.in/3dss/)
RasMol Profit,
, (Sumathi, Ananthalakshmi,
Roshan,&Sekar,2006).

322

,
, .
,
.
; ,

.,
,
/(9.7).

9.7: .

,,,
,.
.
,
(:H,E,C),
,
.
(threading).,

. SuperPose (http://wishart.biology.ualberta.ca/SuperPose/)

(Maiti, Van Domselaar, Zhang, & Wishart, 2004). . ,
,
.

323

, SuperPose

.,
,,
,
,

9.8: ( ).
.

,

, . ,
,,NPcomplete (Lathrop, 1994), ,
,(Poleksic,2009).,
(
),
. , ,
,.
, DALI, (distance alignment matrix
method), http://ekhidna.biocenter.helsinki.fi/dali_server/start
,(DALIlite).(1993),
().

324

(contacts)
.
.
,.
6x6
(Holm&Rosenstrm,2010).DALIFSSP(Familiesof
Structurally Similar Proteins)
.
CE (Combinatorial extension),
http://source.rcsb.org/jfatcatserver/ceHome.jsp PDB.
,
(CE-MC). CE DALI
, .
(aligned fragment pairs AFPs)

AFP.
,
,,...(Shindyalov&Bourne,1998).
To SSAP (Sequential Structure Alignment Program) ,

(http://www.biochem.ucl.ac.uk/~orengo/ssap.html). ,
C,
.
.,
.
,

(Taylor&Orengo,1989).,
CATH.
SSM (http://www.ebi.ac.uk/msd-srv/ssm/),
PDB.,
(Krissinel&
Henrick, 2004). , ,
.
,
.,

.
MASS (http://bioinfo3d.cs.tau.ac.il/MASS/),
(Dror,Benyamini, Nussinov,&
Wolfson,2003).MASS,
( ,),

, ,
.
To ,
(http://ub.cbm.uam.es/software/online/mamothmult.php). MAMMOTH
,,unit-vectorrootmeansquare
(URMS), ,

.
(Ortiz,Strauss,&Olmea,2002).
MUSTANG (multiple structural alignment algorithm)
(Konagurthu, Whisstock, Stuckey, & Lesk, 2006).
,.

325


, . ,
RMSD C,

(http://www.csse.monash.edu.au/~karun/Site/mustang.html).
,TM-align(http://zhanglab.ccmb.med.umich.edu/TM-align/)

(Zhang & Skolnick, 2005).
,
(C).,
(4CE20DALI),
.
,
. Wikipedia
(https://en.wikipedia.org/wiki/Structural_alignment_software), ,
' ,
. ,

(Kolodny,Koehl,
&Levitt,2005;Mayr,Domingues,&Lackner,2007;Singh&Brutlag,2000).,
,
, ,
.,DALI,CETM-align,
.',
, '

.

9.4.

,
. ,
,
.in
silico(,,...),

.,
( ...) ,
.,
: ) , )
. ,
,
, ( ,
).
,
, , (
).
,
,(target)
(template).,
..,

(,
).,
( ,

326

...),(9.9).,

;,
(),
: ,
,
.
, .
. , 30% ,
20%
(twilightzone).

9.9: % .

,,
.
,
(,80%,50%,40%...),
(<30%) .
,
.,,
,
(homology modelling). ,
(threading), (ab initio modelling),
.
, ,
, ,
(,
).abinitio,

327

, ,
,
.

9.10: .

, 3
. ,
,(>70%)
(RMSD<1 ),
().,

2-4.,abinitio,
RMSD 4-10 ,
.

9.4.1.

,(),

( 9.10). ,
,
.
:
.
BLASTFASTA,,
. ,
.
,.

328


().
.
.,
.
(C, C, N, ).
,
.
.
.,
.
,
,(loop),
. ,
,
.,,

, PDB
,abinitio,
.,
C-C.

, (>35%)
.
PDB.
.
,C
( ).

(),
(moleculardynamics).
,
.
.,
. ,

(
,,...).
.
,
WHAT IF (http://swift.cmbi.ru.nl/whatif/), 1987 Gert
Vriend (Vriend, 1990). ,

(, , 3D , , ,
...),,
. MODELLER (https://salilab.org/modeller/)
(Eswar et al., 2006). MODELLER

( , ,
) . MODELLER
Sali Blundell (ali & Blundell, 1993),
de novo , ,
,,,...,
/ (Unix/Linux, Windows, Mac)

329

, EasyModeller
(http://modellergui.blogspot.gr/). , SWISS-MODEL (http://swissmodel.expasy.org/)
.
,
,,
(Biasinietal.,2014).

9.4.2.

,.,

().
, ,
. ,
, ,
().
,
1000-2000 ( 1300 ). ,
,
.
, ,
(..
).

9.11: .

,
.
(1D), ,

330


. ,
...,
,
, . ,
(3D)
.,
(),
.,
,.
Bowie, Lthy Eisenberg 1991 (Bowie,
Luthy,&Eisenberg,1991)(threading)
Jones,TaylorThornton1992(Jones,Taylort,&Thornton,1992)
. ' ,
.
,THREADERDavidJones
( http://bioinf.cs.ucl.ac.uk/?id=747),
19923.5
(Jones, 1998).
,PHDthreaderBurkhartRost(Rost,Schneider,&Sander,1997),
server Predict Protein (www.predictprotein.org).
To PHDthreader PHD
( , ).
genTHREADER (Jones, 1999),
PSI-PRED,
,
server (http://bioinf.cs.ucl.ac.uk/psipred/). genTHREADER
MODELLER ,
.
, HHpred (Sding, Biegert, & Lupas,
2005). To HHpred profile
HMM ( HHsearch),
(Sding, 2005).
(PDB, SCOP, PFAM, SMART ...),
,
MODELLER (http://toolkit.tuebingen.mpg.de/hhpred). To
Phyre2(http://www.sbg.bio.ic.ac.uk/phyre2).
,profile-profile,PSSM,
HHsearch(Kelley,Mezulis,Yates,Wass,&Sternberg,2015).To
Phyre2 PSI-RPED,
MEMSAT, -
DISOPRED, ,
abinitio.,,
, , ( ,
, ab initio ),
.
,

RaptorX
(http://raptorx.uchicago.edu/)

MUSTER
(http://zhang.bioinformatics.ku.edu/MUSTER)
,
. RaptorX
,
(multiple template threading)
(Peng & Xu, 2011). MUSTER ,
(,,,
...),

331

(Wu & Zhang, 2008). ,


LOMETS(http://zhanglab.ccmb.med.umich.edu/LOMETS/)
meta-server9(FFAS-3D,HHsearch,MUSTER,
pGenTHREADER, PPAS, PRC, PROSPECT2, SP3, SPARKS-X)
(Wu & Zhang, 2007). LOMETS, I-TASSER
(http://zhanglab.ccmb.med.umich.edu/I-TASSER/about.html),
,
CASP. To I-TASSER
replicaexchangeMonteCarlosimulations
ab initio. ,
abinitio.
I-TASSER,
, ,
...
(Roy,Kucukural,&Zhang,2010).

9.4.3. Ab initio de novo

,
.,
, (protein folding problem)
. ,
(,
NP-complete).,
, ,
. , ab initio ,
(
). , deNovo,
.
,
.

9.12: ab initio
.

332


. Anfinsen
,
Levinthal
. , 100 (
), 99 198
. 3 (
),3198,
.
, ...
. ,
microsecond millisecond, (
, ,
...). , , ,
abinitiodenovo.
,abinitio,(
),
( 50 ),
.,
,,
().:
.
,
.,
:C,C,

.
. ,
, 6-7
.
.
.
.,
,
.
AMBER,CHARMM,UNRESASTRO-FOLD.

(
ROSSETATASSER/I-TASSER).
.
.
, Monte Carlo,
Simulated Annealing ( ),
.
(Molecular Dynamics),

. ,
, .
,
.

.
, CHARMM
(http://www.charmm.org/) AMBER (http://ambermd.org/)

333

. CHARMM
(Chemistry at HARvard Macromolecular Mechanics)
, Martin Karplus Harvard (Brooks et al., 2009).

(,,
).AMBER(AssistedModelBuildingwithEnergyRefinement)
PeterKollmanUniversity
of California (Case et al.,
2005)..
,GROMACS (http://www.gromacs.org).GROMACS

,Windows(Pronket
al., 2013). ,
ab initio ,
,
.
abinitio/denovo,
ROSETTA (http://robetta.bakerlab.org/). To ROSETTA ,
abinitio
( 9) PDB. Bowie
Eisenberg1994.
C ( )
,MonteCarlo
(Rohl,Strauss,Misura,&Baker,2004).
, I-TASSER (http://zhanglab.ccmb.med.umich.edu/I-TASSER/about.html)
ab initio
.I-TASSER,
,
CASP(Helles,2008).
,,.
I-TASSER Rosetta Monte Carlo (
), ,
.
ePROPAINOR (http://www.math.iitb.ac.in/epropainor) PROTinfo
(http://ram.org/compbio/protinfo/), ( ITASSER ROSETTA). , QUARK
(http://zhanglab.ccmb.med.umich.edu/QUARK/),CABSfold(http://biocomp.chem.uw.edu.pl/CABSfold/),
PEP-FOLD (http://bioserv.rpbs.univ-paris-diderot.fr/services/PEP-FOLD/) BHAGEERATH (
http://www.scfbio-iitd.res.in/bhageerath/index.jsp),
(<100 ),
.
, (distributed) ab initio .
, ROSETTA@home (http://boinc.bakerlab.org/rosetta/)
Folding@home(http://folding.stanford.edu/).,
,
.
ROSETTA@home,

.
, .
FOLDit (http://fold.it/portal/) /
(,
).
,
.

334

9.5.

(docking)

.,
.
(Bonvin,2006;Gray,2006;Sternberg,
Gabb,&Jackson,1998),
, , , ... (Taylor, Jewsbury, & Essex, 2002). ,
DNA- DNA- . ,
,
,,
-, -,
.
,,
,,
(Alvarez,2004).

9.13: .

.,
-,
(,
...).,,
(rigid docking),
(

335

),.,
(
) ( ).
,
(flexible docking) ( )
.
,,,
. ,
. ,

.
,
.,
,.
,
.
, .

abinitio.,
- - ,
,
, .

. , ,
abinitio,
,
.

9.14: ( https://en.wikipedia.org/wiki/Docking_%28molecular%29).

, ab initio
,
.,
: (Halperin, Ma, Wolfson, &
Nussinov,2002;Moreira,Fernandes,&Ramos,2010).,
,.,
.
,,
.
(computer
aided drug discovery), . Swiss
Institude for Bioinformatics

336


(http://www.click2drug.org/directory_Docking.html).
,
(Rodrigues & Bonvin, 2014). ,
, ) , )
, ) -, )
)
.

GRAMM (http://vakser.bioinformatics.ku.edu/resources/gramm/gramm1/). GRAMM (
Global RAnge Molecular Matching)

, ,
. ,
.
. ,
,..
, .
, GRAMM-X
(http://vakser.compbio.ku.edu/resources/gramm/grammx/),
.
AutoDock(http://autodock.scripps.edu/).

: AutoDock 4 AutoDock Vina.


,
.
. , AutoDockTools,

.
HADDOCK (HighAmbiguity Drivenprotein-proteinDOCKing)
.
HADDOCKabinitio
(http://haddocking.org/).
( )
.
,
HADDOCK.
FTDock(http://www.sbg.bio.ic.ac.uk/docking/ftdock.html)
.
Fourier
.
DOT(http://www.sdsc.edu/CCMS/DOT/)
- .
DOT
.
vanderWaals,Fourier.
ZDOCK (http://www.umassmed.edu/zlab/)
.
,
. ZDOCK
.
To ClusPro (http://cluspro.bu.edu/),
.
Fourier.

337

RMSDCHARMM.
.
SwissDock(http://www.swissdock.ch/)
.SwissDock
EADock DSS,
(localdocking)
(blind docking). , CHARMM
.
rDock (http://rdock.sourceforge.net/)
.
HighThroughputVirtualScreening(HTVS).
, Linux,
cluster
CPUs.
, RosettaDock(http://rosie.rosettacommons.org/docking2)
Rosetta, . Monte Carlo (MC)
:
(),

.,
.

(Rodrigues&Bonvin,2014;R.D.Taylor,etal.,2002).
CASP CAFASP , ,
, CAPRI (Critical Assessment of
PRediction of Interactions). CAPRI
,
,
.
,,
(http://www.ebi.ac.uk/msd-srv/capri/).
,
.,
,
. ,
.,-
,
(,,
...).

338


Adams,P.D.,Afonine,P.V.,Bunkoczi,G.,Chen,V.B.,Davis,I.W.,Echols,N.,...Zwart,P.H.(2010).
PHENIX:acomprehensivePython-basedsystemformacromolecularstructuresolution.Acta
Crystallogr D Biol Crystallogr, 66(Pt2),213-221.
Alvarez,J.C.(2004).High-throughputdockingasasourceofnoveldrugleads.Current opinion in chemical
biology, 8(4),365-370.
Andersen,C.A.,Palmer,A.G.,Brunak,S.,&Rost,B.(2002).Continuumsecondarystructurecaptures
proteinflexibility.Structure, 10(2),175-184.
Biasini,M.,Bienert,S.,Waterhouse,A.,Arnold,K.,Studer,G.,Schmidt,T.,...Bordoli,L.(2014).SWISSMODEL:modellingproteintertiaryandquaternarystructureusingevolutionaryinformation.Nucleic
Acids Research,gku340.
Bonvin,A.M.(2006).Flexibleproteinproteindocking.Current Opinion in Structural Biology, 16(2),194200.
Bowie,J.U.,Luthy,R.,&Eisenberg,D.(1991).Amethodtoidentifyproteinsequencesthatfoldintoa
knownthree-dimensionalstructure.Science, 253(5016),164-170.
Brooks,B.R.,Brooks,C.L.,MacKerell,A.D.,Nilsson,L.,Petrella,R.J.,Roux,B.,...Boresch,S.(2009).
CHARMM:thebiomolecularsimulationprogram.Journal of computational chemistry, 30(10),15451614.
Case,D.A.,Cheatham,T.E.,Darden,T.,Gohlke,H.,Luo,R.,Merz,K.M.,...Woods,R.J.(2005).The
Amberbiomolecularsimulationprograms.Journal of computational chemistry, 26(16),1668-1688.
Chen,V.B.,Arendall,W.B.,3rd,Headd,J.J.,Keedy,D.A.,Immormino,R.M.,Kapral,G.J.,...
Richardson,D.C.(2010).MolProbity:all-atomstructurevalidationformacromolecular
crystallography.Acta Crystallogr D Biol Crystallogr, 66(Pt1),12-21.
Dror,O.,Benyamini,H.,Nussinov,R.,&Wolfson,H.J.(2003).Multiplestructuralalignmentbysecondary
structures:algorithmandapplications.Protein Science, 12(11),2492-2507.
Eswar,N.,Marti-Renom,M.A.,Webb,B.,Madhusudhan,M.S.,Eramian,D.,Shen,M.,...Sali,A.(2006).
ComparativeProteinStructureModelingWithMODELLERCurrentProtocolsinBioinformatics(Vol.
5.6.1-5.6.30):JohnWiley&Sons,Inc.
Frishman,D.,&Argos,P.(1995).Knowledge-basedproteinsecondarystructureassignment.Proteins, 23(4),
566-579.
Gntert,P.(2011).AutomatedproteinstructuredeterminationfromNMRdata.InA.J.Dingley&S.M.
Pascal(Eds.),Biomolecular NMR spectroscopy(pp.341).Amsterdam:IOSPress.
Gray,J.J.(2006).High-resolutionproteinproteindocking.Current Opinion in Structural Biology, 16(2),
183-193.
Halperin,I.,Ma,B.,Wolfson,H.,&Nussinov,R.(2002).Principlesofdocking:Anoverviewofsearch
algorithmsandaguidetoscoringfunctions.Proteins: Structure, Function, and Bioinformatics, 47(4),
409-443.
Heinig,M.,&Frishman,D.(2004).STRIDE:awebserverforsecondarystructureassignmentfromknown
atomiccoordinatesofproteins.Nucleic Acids Res, 32(WebServerissue),W500-502.
Helles,G.(2008).Acomparativestudyofthereportedperformanceofabinitioproteinstructureprediction
algorithms.Journal of the Royal Society Interface, 5(21),387-396.
Holm,L.,&Rosenstrm,P.(2010).Daliserver:conservationmappingin3D.Nucleic Acids Research,
38(suppl2),W545-W549.

339

Hooft,R.W.,Vriend,G.,Sander,C.,&Abola,E.E.(1996).Errorsinproteinstructures.Nature, 381(6580),
272.
Jones,D.T.(1998).THREADER:ProteinSequenceThreadingbyDoubleDynamicProgramming.InS.
Salzberg,D.Searls&S.Kasif(Eds.),ComputationalMethodsinMolecularBiology:Elsevier
Science.
Jones,D.T.(1999).GenTHREADER:anefficientandreliableproteinfoldrecognitionmethodforgenomic
sequences.Journal of molecular biology, 287(4),797-815.
Jones,D.T.,Taylort,W.R.,&Thornton,J.M.(1992).Anewapproachtoproteinfoldrecognition.
Joosten,R.P.,Salzemann,J.,Bloch,V.,Stockinger,H.,Berglund,A.C.,Blanchet,C.,...Vriend,G.(2009).
PDB_REDO:automatedre-refinementofX-raystructuremodelsinthePDB.J Appl Crystallogr,
42(Pt3),376-384.
Kabsch,W.,&Sander,C.(1983).Dictionaryofproteinsecondarystructure:patternrecognitionofhydrogenbondedandgeometricalfeatures.Biopolymers, 22(12),2577-2637.
Kelley,L.A.,Mezulis,S.,Yates,C.M.,Wass,M.N.,&Sternberg,M.J.E.(2015).ThePhyre2webportal
forproteinmodeling,predictionandanalysis.Nat. Protocols, 10(6),845-858.
Kolodny,R.,Koehl,P.,&Levitt,M.(2005).Comprehensiveevaluationofproteinstructurealignment
methods:scoringbygeometricmeasures.Journal of molecular biology, 346(4),1173-1188.
Konagurthu,A.S.,Whisstock,J.C.,Stuckey,P.J.,&Lesk,A.M.(2006).MUSTANG:amultiplestructural
alignmentalgorithm.Proteins: Structure, Function, and Bioinformatics, 64(3),559-574.
Krissinel,E.,&Henrick,K.(2004).Secondary-structurematching(SSM),anewtoolforfastproteinstructure
alignmentinthreedimensions.Acta Crystallographica Section D: Biological Crystallography,
60(12),2256-2268.
Laskowski,R.A.,MacArthur,M.W.,Moss,D.S.,&Thornton,J.M.(1993).PROCHECK-aprogramto
checkthestereochemicalqualityofproteinstructures.J. App. Cryst, 26,283-291.
Lathrop,R.H.(1994).TheproteinthreadingproblemwithsequenceaminoacidinteractionpreferencesisNPcomplete.Protein engineering, 7(9),1059-1068.
Liu,Y.-S.,Fang,Y.,&Ramani,K.(2009).Usingleastmedianofsquaresforstructuralsuperpositionof
flexibleproteins.BMC Bioinformatics, 10(1),29.
Maiti,R.,VanDomselaar,G.H.,Zhang,H.,&Wishart,D.S.(2004).SuperPose:asimpleserverfor
sophisticatedstructuralsuperposition.Nucleic Acids Research, 32(suppl2),W590-W594.
Mayr,G.,Domingues,F.S.,&Lackner,P.(2007).Comparativeanalysisofproteinstructurealignments.
BMC Structural Biology, 7(1),50.
McLachlan,A.D.(1982).Rapidcomparisonofproteinstructures.Acta Crystallogr D Biol Crystallogr, A38,
871-873
Meiler,J.,&Baker,D.(2003).RapidproteinfolddeterminationusingunassignedNMRdata.Proc Natl Acad
Sci U S A, 100(26),15404-15409.
Moreira,I.S.,Fernandes,P.A.,&Ramos,M.J.(2010).Proteinproteindockingdealingwiththeunknown.
Journal of computational chemistry, 31(2),317-342.
Ortiz,A.R.,Strauss,C.E.,&Olmea,O.(2002).MAMMOTH(matchingmolecularmodelsobtainedfrom
theory):anautomatedmethodformodelcomparison.Protein Science, 11(11),2606-2621.
Pelton,J.T.,&McLean,L.R.(2000).Spectroscopicmethodsforanalysisofproteinsecondarystructure.
Anal Biochem, 277(2),167-176.
Peng,J.,&Xu,J.(2011).RaptorX:exploitingstructureinformationforproteinalignmentbystatistical
inference.Proteins: Structure, Function, and Bioinformatics, 79(S10),161-171.
Poleksic,A.(2009).Algorithmsforoptimalproteinstructurealignment.Bioinformatics, 25(21),2751-2756.

340

Pronk,S.,Pll,S.,Schulz,R.,Larsson,P.,Bjelkmar,P.,Apostolov,R.,...vanderSpoel,D.(2013).
GROMACS4.5:ahigh-throughputandhighlyparallelopensourcemolecularsimulationtoolkit.
Bioinformatics,btt055.
Read,R.J.,Adams,P.D.,Arendall,W.B.,3rd,Brunger,A.T.,Emsley,P.,Joosten,R.P.,...Zwart,P.H.
(2011).Anewgenerationofcrystallographicvalidationtoolsfortheproteindatabank.Structure,
19(10),1395-1412.
Rodrigues,J.P.,&Bonvin,A.M.(2014).Integrativecomputationalmodelingofproteininteractions.FEBS
Journal, 281(8),1988-2003.
Rohl,C.A.,Strauss,C.E.,Misura,K.M.,&Baker,D.(2004).ProteinstructurepredictionusingRosetta.
Methods in enzymology, 383,66-93.
Rost,B.,Schneider,R.,&Sander,C.(1997).Proteinfoldrecognitionbyprediction-basedthreading.Journal
of molecular biology, 270(3),471-480.
Roy,A.,Kucukural,A.,&Zhang,Y.(2010).I-TASSER:aunifiedplatformforautomatedproteinstructure
andfunctionprediction.Nature protocols, 5(4),725-738.
ali,A.,&Blundell,T.L.(1993).Comparativeproteinmodellingbysatisfactionofspatialrestraints.Journal
of molecular biology, 234(3),779-815.
Schwieters,C.D.,Kuszewski,J.J.,&Clore,G.M.(2006).UsingXplor-NIHforNMRmolecularstructure
determination.Progr. NMR Spectroscopy 48,47-62
Shi,Y.(2014).AglimpseofstructuralbiologythroughX-raycrystallography.Cell, 159(5),995-1014.
Shindyalov,I.N.,&Bourne,P.E.(1998).Proteinstructurealignmentbyincrementalcombinatorialextension
(CE)oftheoptimalpath.Protein engineering, 11(9),739-747.
Singh,A.P.,&Brutlag,D.L.(2000).ProteinStructureAlignment:Acomparisonofmethods.
Bioinformatics.
Sding,J.(2005).ProteinhomologydetectionbyHMMHMMcomparison.Bioinformatics, 21(7),951-960.
Sding,J.,Biegert,A.,&Lupas,A.N.(2005).TheHHpredinteractiveserverforproteinhomologydetection
andstructureprediction.Nucleic Acids Research, 33(suppl2),W244-W248.
Sternberg,M.J.,Gabb,H.A.,&Jackson,R.M.(1998).Predictivedockingofproteinproteinandprotein
DNAcomplexes.Current Opinion in Structural Biology, 8(2),250-256.
Sumathi,K.,Ananthalakshmi,P.,Roshan,M.M.,&Sekar,K.(2006).3dSS:3Dstructuralsuperposition.
Nucleic Acids Research, 34(suppl2),W128-W132.
Taylor,R.D.,Jewsbury,P.J.,&Essex,J.W.(2002).Areviewofprotein-smallmoleculedockingmethods.
Journal of computer-aided molecular design, 16(3),151-166.
Taylor,W.R.,&Orengo,C.A.(1989).Proteinstructurealignment.Journal of molecular biology, 208(1),122.
Theobald,D.L.,&Wuttke,D.S.(2006a).EmpiricalBayeshierarchicalmodelsforregularizingmaximum
likelihoodestimationinthematrixGaussianProcrustesproblem.Proceedings of the National
Academy of Sciences, 103(49),18521-18527.
Theobald,D.L.,&Wuttke,D.S.(2006b).THESEUS:maximumlikelihoodsuperpositioningandanalysisof
macromolecularstructures.Bioinformatics, 22(17),2171-2172.
Vriend,G.(1990).WHATIF:amolecularmodelinganddrugdesignprogram.Journal of molecular graphics,
8(1),52-56.
Winn,M.D.,Ballard,C.C.,Cowtan,K.D.,Dodson,E.J.,Emsley,P.,Evans,P.R.,...Wilson,K.S.(2011).
OverviewoftheCCP4suiteandcurrentdevelopments.Acta Crystallogr D Biol Crystallogr, 67(Pt4),
235-242.

341

Wu,S.,&Zhang,Y.(2007).LOMETS:alocalmeta-threading-serverforproteinstructureprediction.Nucleic
Acids Research, 35(10),3375-3382.
Wu,S.,&Zhang,Y.(2008).MUSTER:improvingproteinsequenceprofileprofilealignmentsbyusing
multiplesourcesofstructureinformation.Proteins: Structure, Function, and Bioinformatics, 72(2),
547-556.
Yaffe,M.B.(2005).X-raycrystallographyandstructuralbiology.Crit Care Med, 33(12Suppl),S435-440.
Zhang,Y.,&Skolnick,J.(2005).TM-align:aproteinstructurealignmentalgorithmbasedontheTM-score.
Nucleic Acids Research, 33(7),2302-2309.

342

10:

, ,
(,
),
.
RNA
.
, ,
.


( 4), ( 8), ( 7).

10.

,(regularexpressions),
(patterns) , (profiles) Hidden Markov Models (HMMs),
.
. ,
,
.
(formal language theory),
.,,
.(rewritingrules)A xB,
, , - (nonterminal
symbols),,,
(terminalsymbols),.
-x.,-S,
, -
.
,
, Noam Chomsky.

, ,
(),,
. ,
(Durbin, Eddy, Krogh, & Mithison, 1998)
,..(Searls,2002).

10.1. Chomsky

To 1956 Noam Chomsky


(Chomsky, 1956). G
:
V
T
P
S

343

,G(V, T, P, S). ,
, , ,
,.
(regulargrammars),3,
(right-linear) (left-linear).
,:
W1 aW2 W a
,:
W1 W2a W a
,
,
,
,.(Finite
StateAutomata).,,(regular
expressions),.
, ,
.
(context free grammar),
2,:
W
, (string) -
( ).
(Push Down Automata).
,
.
(context sensitive grammar),
1, (monotonic grammar).
:
1W2 a1a2
,a,
- ,
- .
, , .
.
(LinearlyBoundedAutomata).
, (unrestricted grammars),
0,:

1W2
-
.,
.
( ) ( ) .

.,
( )
.
(recursively enumerable
languages).(TuringMachines).

344


10.1: .

(
https://en.wikipedia.org/wiki/Chomsky_hierarchy).

10.2.

, , .
xy,S xS
S y.x
y. SxSxxSxxxSxxxy,
(
).
,
,.

, , (parsing)
(Finite State Automata).
(regularexpressions)
.,10.2,
PROSITE ( ,
PROSITE).

10.2: .

,PROSITE:
[RK]-G-{EDRKHPCG}-[AGSCI]-[FY]-[LIVA]-x-[FYM]

:
[RK]G[^EDRKHPCG][AGSCI][FY][LIVA].[FYM]

,
.,

345

(S)R,
G. ,
RG.,...
,,
,rR,
...,,:

S rW1|kW1
W1 gW2
W2 [afilmnqrstvwy] W3
W3 [agsci] W4
W4 fW5|yW5
W5lW6|iW6|vW6|aW6
W6 [acdefghiklmnpqrstvwy] W7
W7 f|y|m

,
,:
SrW1
rgW2
rgaW3
rgacW4
rgacfW5
rgacfvW6
rgacfvkW7
rgacfvky
, PROSITE
. ,

. ,
;rg;.,
,()
.,rgacfvkykgacfvky,
,agacfvky.

,,,
.,
,.
8, ( + -) 4
(DNA).10.3,

,.,
,-,(
),.
,
,-(
):
B +|-|E
+ +|-|E
- +|-|E

346

, ,
:
+: a|c|g|t
-: a|c|g|t
, .
,:
B a+|c+|g+|t+|a-|c-|g-|t-|E
+ a+|c+|g+|t+|a-|c-|g-|t-|E
- a+|c+|g+|t+|a-|c-|g-|t-|E
, ,
. , Ba+
P(Ba+)= P(+|B)P(a|+),+a-P(+a-)=
P(-|+)P(a|-),...

10.3: 8.

( )
,:
B a aa+
aac+
aact+
aactg aactgc aactgcaE
(Finite State Automata)
MealeMoore,
, ( Moore
Meale, Meale
). ,
,
.

347

10.3. RNA

,
().,
().
-
,
Chomsky. ,
(context-free grammars) , ,

(..xxxxyyyy).(,
) ,

.
,.

S xSy
( ) x y. ,

(),,
(),
, .
, ( 10.4). ,

.,

: , .
,BletchleyPark,
AlanTuring()ENIGMA,

. , ,
, , Peter Hilton:
Doc, note. I dissent. A fast never prevents fatness. I diet on cod. (
,...!).
, aabaabaa

S aSa|bSb|aa|bb
: SaSa aaSaa aabSbaa aabaabaa. , a
a(b).
- (
,).

348


10.4: , .

,,
RNA. , DNA,
( ),
(A-U, G-C). ,
.,
.
, ,
RNA
.10.5
(loop)
.,
:
acgugccacgauucaacguggcacag
..((((((((......))))))))..

10.5: RNA ( https://en.wikipedia.org/wiki/Stem-loop).

,,
325,3
24...,
(.).

349

10.6: () () tRNAPhe (PDB code 6TNA) E. coli,. (Ito et al., 2012)

,
,
:
SgSc|cSg|aSu|uSa
, ,
,
(..
S gS| Sg),(
S S1S2)
(S g).
S1 S2g
S2 aS3
S3 S4a
S4 aS5
S5 gS6c
S6 uS7a
S7 gS8c
S8 cS9g
S9 cS10g

S10 aS11u
S11cS12g
S12 gS13c
S13 aS14
S14 uS15
S15 uS16
S16 cS17
S17 aS18
S18a

S1S2g
aS3g
aS4ag
aaS5ag
aagS6cag
aaguS7acag
aagugS8cacag
aagugcS9gcacag
aagugccS10ggcacag
aagugccaS11uggcacag
aagugccacS12guggcacag
aagugccacgS13cguggcacag
aagugccacgaS14cguggcacag
aagugccacgauS15cguggcacag

350

aagugccacgauuS16cguggcacag
aagugccacgauucS17cguggcacag
aagugccacgauucaS18cguggcacag
aagugccacgauucaacguggcacag

10.7: RNA .

,
, RNA.
,
, , ,
(tRNA).
, , .
, ,
.
(stochastic context-free grammars). ,
,
:
( ).
RNA()
,G-U,C-A,
.
,,
:
(alignmentparsingproblem)
(scoring
problem)

(trainingproblem)

351

,
.,
Viterbi.
Cocke-Younger-Kasami(CYKalgorithm)Youmger1967(Younger,
1967). , Inside (outside algorithm)
Forward ( outside Backward). ,
, Inside-Outside
Baum-Welch(Forward-Backward)1979(Baker,1979).,
,
,,
(inside/outside).,
(10.1).
, Chomsky
(Chomsky Normal Form). Chomsky 1959 (Chomsky, 1959)
,:
W1W2W3
W1a
, , ,
.
Chomsky ,
(
). ,
(Lange & Lei,
2009).,,
,
.

Viterbi
P(x|)
Forward
EMalgorithm
Baum-Welch

O(LM)

O(LM2)
10.1: SCFG.

SCFG
CYK
Inside
Inside-Outside
O(L2M)
O(L3M3)

RNA
1990Sakakibara(Sakakibaraetal.,1994).,
RNA Nussinov Zuker.
Nussinov 1978
(Nussinov,Pieczenik,
Griggs, & Kleitman, 1978). Zuker
, (G).
,
,G-U,C-A,(Zuker&
Stiegler,1981).,
SCFG,
,
.,Eddy(SeanR
Eddy, 2004). Zuker MFOLD
(http://unafold.rna.albany.edu/?q=mfold),
. RNAfold (http://rna.tbi.univie.ac.at/cgi-bin/RNAfold.cgi)
,,Zuker
RNA (Lorenz et al., 2011). PFOLD (http://www.daimi.au.dk/~compbio/pfold)
RNA
(Knudsen & Hein, 2003). Dowell Eddy (Dowell & Eddy, 2004)

352

,
RNA.
.
,
,PFOLD(
).,
CONUS, ,
(http://selab.janelia.org/software/conus/).
TORNADO
(http://selab.janelia.org/software/tornado/tornado.tar.gz)(Rivas,Lang,&Eddy,2012).
RNA,
EddyDurbin,CovarianceModels(
)(Eddy&Durbin,1994).,
, RNA.
SCFG,,profileHMM.,
.
,
.
. ,
,,
.

10.8: ()
.

(, )
37,10.9.,
350%G50%A,750%50%C.
PROSITE ( A-C-[AG]-A-x-T-[CT]-A),
G(3)C(7),G(3)T(7).
,G(3)C(7),
(3)T(7).,covariancemodel,

ACGATTCA,
ACGATTA. ()
PROSITE.
, INFERNAL,
http://infernal.wustl.edu/ Sean Eddy (Nawrocki, Kolbe, & Eddy,
2009), HMMER .

353

,INFERNAL,RNA
,.
, . ,
, RNA.
PFAM,,INFERNAL
RFAM,RNA,http://rfam.xfam.org/(Gardneretal.,2011).
To EvoFold, http://users.soe.ucsc.edu/~jsp/EvoFold/,
,
(phylo-SCFG).
RNA microRNA (Pedersen et al., 2006). To RNAz,
http://www.tbi.univie.ac.at/~wash/RNAz/
(Washietl, Hofacker, & Stadler, 2005).
, CONTRAfold, http://contra.stanford.edu/contrafold,
SCFG
, conditional log-linear model (CLLM).
CONTRAfold
(Do,Woods,&Batzoglou,2006).
,RNA
.
RNA (pseudoknots). ( 10.10)

().
10.10,
,
( ). ,
AAUUCCGG (nested) ,
AACCUUGG (crossing) ,
.
, ,
.
, . ,
(
Nussinov),.
, NP-complete (Lyngs &
Pedersen,2000).,
,
. , PKNOTS
http://selab.janelia.org/software/pknots/pknots.tar.gz (Rivas & Eddy, 1999). To CYLOFOLD
http://cylofold.abcc.ncifcrf.gov/ (Bindewald, Kluth, & Shapiro,
2010),KineFOLDhttp://kinefold.curie.fr/cgi-bin/form.pl(Isambert,2009),
IPknot https://github.com/satoken/ipknot, (Sato, Kato, Hamada, Akutsu, & Asai, 2011). ,
SimulFold, http://www.cs.ubc.ca/~irmtraud/simulfold/,
(Meyer&Mikls,2007).

354


10.9: (https://en.wikipedia.org/wiki/Pseudoknot).


.
,xxxyyyzzz,x,yz.
,
(context-sensitive grammar)
(copylanguage).,
, .
' ,
.

10.4.

RNA,
(
).,,

.
,:
(204)
( ,
,,VanderWaals)

,

RNA,-(10.11). , -
,Greek-keymotif
-,-,-,--.
,-C=O
.,
,
. ,
(-),
(C=O).

355


10.10:
.

, ,
.,

,
,.

1994 Mamitsuka Abe (Mamitsuka & Abe, 1994). ,


- , ,
, Stochastic Ranked Node
Rewriting Grammars (SRNRG). (
, inside-outside,
), -
HSSP25%.
,-
.
,
. 2006, Searls

, , (Chiang, Joshi, & Searls, 2006). 2005
Waldispuhl
- (Waldisphl & Steyaert, 2005). multi-tape S-attributed grammar
classHMM,-
,.
(TMMTSAG),
-, ,
().
.

356

10.11: - ( ). .
. . helical wheel plot,
. ,
. ,
, .

-.
,

transFold
(http://bioinformatics.bc.edu/clotelab/transFold) ,
, -
- (Waldispuhl,Berger,Clote, &
Steyaert,2006).,
( ...).
, ,
Partifold http://partiFold.csail.mit.edu/
(Waldispuhl,O'Donnell,Devadas,Clote,&Berger,2008).
, Dyrka Nebel,

,.,
,

(Dyrka&Nebel,2009).,,
(Dyrka,Nebel,&Kotulska,2013),
(Dyrka &Nebel, 2007).
,
, ,
,
.

357

Baker,J.K.(1979).Trainablegrammarsforspeechrecognition.TheJournaloftheAcousticalSocietyof
America,65(S1),S132-S132.
Bindewald,E.,Kluth,T.,&Shapiro,B.A.(2010).CyloFold:secondarystructurepredictionincluding
pseudoknots.NucleicAcidsResearch,38(suppl2),W368-W372.
Chiang,D.,Joshi,A.K.,&Searls,D.B.(2006).Grammaticalrepresentationsofmacromolecularstructure.
Journalofcomputationalbiology,13(5),1077-1100.
Chomsky,N.(1956).Threemodelsforthedescriptionoflanguage.IRETransactionsonInformationTheory,
2(3),113-124.
Chomsky,N.(1959).OnCertainFormalPropertiesofGrammars.InformationandControl2(2),137-167.
Do,C.B.,Woods,D.A.,&Batzoglou,S.(2006).CONTRAfold:RNAsecondarystructurepredictionwithout
physics-basedmodels.Bioinformatics,22(14),e90-e98.
Dowell,R.D.,&Eddy,S.R.(2004).Evaluationofseverallightweightstochasticcontext-freegrammarsfor
RNAsecondarystructureprediction.BMCBioinformatics,5(1),71.
Durbin,R.,Eddy,S.R.,Krogh,A.,&Mithison,G.(1998).Biologicalsequenceanalysis,probabilisticmodels
ofproteinsandnucleicacids:CambridgeUniversityPress.
Dyrka,W.,&Nebel,J.-C.(2007).Aprobabilisticcontext-freegrammarforthedetectionofbindingsitesfrom
aproteinsequence.BmcSystemsBiology,1(Suppl1),P78.
Dyrka,W.,&Nebel,J.-C.(2009).Astochasticcontextfreegrammarbasedframeworkforanalysisofprotein
sequences.BMCBioinformatics,10(1),323.
Dyrka,W.,Nebel,J.-C.,&Kotulska,M.(2013).Probabilisticgrammaticalmodelforhelix-helixcontactsite
classification.AlgorithmsforMolecularBiology,8(1),31.
Eddy,S.R.(2004).HowdoRNAfoldingalgorithmswork?Naturebiotechnology,22(11),1457-1458.
Eddy,S.R.,&Durbin,R.(1994).RNAsequenceanalysisusingcovariancemodels.NucleicAcidsRes,
22(11),2079-2088.
Gardner,P.P.,Daub,J.,Tate,J.,Moore,B.L.,Osuch,I.H.,Griffiths-Jones,S.,...Bateman,A.(2011).
Rfam:Wikipedia,clansandthe"decimal"release.NucleicAcidsRes,39(Databaseissue),D141-145.
Isambert,H.(2009).ThejerkyandknottydynamicsofRNA.Methods,49(2),189-196.
Ito,K.,Murakami,R.,Mochizuki,M.,Qi,H.,Shimizu,Y.,Miura,K.-i.,...Uchiumi,T.(2012).Structural
basisforthesubstraterecognitionandcatalysisofpeptidyl-tRNAhydrolase.NucleicAcidsResearch,
gks790.
Knudsen,B.,&Hein,J.(2003).Pfold:RNAsecondarystructurepredictionusingstochasticcontext-free
grammars.NucleicAcidsResearch,31(13),3423-3428.
Lange,M.,&Lei,H.(2009).ToCNFornottoCNF?AnefficientyetpresentableversionoftheCYK
algorithm.InformaticaDidactica,8,2008-2010.
Lorenz,R.,Bernhart,S.H.,ZuSiederdissen,C.H.,Tafer,H.,Flamm,C.,Stadler,P.F.,&Hofacker,I.L.
(2011).ViennaRNAPackage2.0.AlgorithmsforMolecularBiology,6(1),26.
Lyngs,R.B.,&Pedersen,C.N.(2000).RNApseudoknotpredictioninenergy-basedmodels.Journalof
computationalbiology,7(3-4),409-427.
Mamitsuka,H.,&Abe,N.(1994).Predictinglocationandstructureofbeta-sheetregionsusingstochastictree
grammars.ProcIntConfIntellSystMolBiol,2,276-284.
Meyer,I.M.,&Mikls,I.(2007).SimulFold:simultaneouslyinferringRNAstructuresincluding
pseudoknots,alignments,andtreesusingaBayesianMCMCframework.

358

Nawrocki,E.P.,Kolbe,D.L.,&Eddy,S.R.(2009).Infernal1.0:inferenceofRNAalignments.
Bioinformatics,25(10),1335-1337.
Nussinov,R.,Pieczenik,G.,Griggs,J.R.,&Kleitman,D.J.(1978).AlgorithmsforLoopMatchings.SIAM
JournalonAppliedMathematics35(1),68-82.
Pedersen,J.S.,Bejerano,G.,Siepel,A.,Rosenbloom,K.,Lindblad-Toh,K.,Lander,E.S.,...Haussler,D.
(2006).IdentificationandclassificationofconservedRNAsecondarystructuresinthehuman
genome.PLoSComputBiol,2(4),e33.
Rivas,E.,&Eddy,S.R.(1999).AdynamicprogrammingalgorithmforRNAstructurepredictionincluding
pseudoknots.Journalofmolecularbiology,285(5),2053-2068.
Rivas,E.,Lang,R.,&Eddy,S.R.(2012).ArangeofcomplexprobabilisticmodelsforRNAsecondary
structurepredictionthatincludesthenearest-neighbormodelandmore.RNA,18(2),193-212.
Sakakibara,Y.,Brown,M.,Hughey,R.,Mian,I.S.,Sjolander,K.,Underwood,R.C.,&Haussler,D.(1994).
Stochasticcontext-freegrammarsfortRNAmodeling.NucleicAcidsRes,22(23),5112-5120.
Sato,K.,Kato,Y.,Hamada,M.,Akutsu,T.,&Asai,K.(2011).IPknot:fastandaccuratepredictionofRNA
secondarystructureswithpseudoknotsusingintegerprogramming.Bioinformatics,27(13),i85-i93.
Searls,D.B.(2002).Thelanguageofgenes.Nature,420(6912),211-217.
Waldispuhl,J.,Berger,B.,Clote,P.,&Steyaert,J.M.(2006).transFold:awebserverforpredictingthe
structureandresiduecontactsoftransmembranebeta-barrels.NucleicAcidsRes,34(WebServer
issue),W189-193.
Waldispuhl,J.,O'Donnell,C.W.,Devadas,S.,Clote,P.,&Berger,B.(2008).Modelingensemblesof
transmembranebeta-barrelproteins.Proteins,71(3),1097-1112.
Washietl,S.,Hofacker,I.L.,&Stadler,P.F.(2005).FastandreliablepredictionofnoncodingRNAs.
ProceedingsoftheNationalAcademyofSciencesoftheUnitedStatesofAmerica,102(7),24542459.
Younger,D.H.(1967).Recognitionandparsingofcontext-freelanguagesintimen3.Informationand
Control,10(2),189-208.
Zuker,M.,&Stiegler,P.(1981).OptimalcomputerfoldingoflargeRNAsequencesusingthermodynamics
andauxiliaryinformation.NucleicAcidsRes,9(1),133-148.

359

360

11:


.
, .

,
( )
.

3, 4, 6 7.

11.
,
,,
, ,
.
,
.,
,
(,RNA).
,
.,,
,
(Zerbino&Birney,2008),(Picardi
&Pesole,2010),(Harbisonetal.,2004),RNA(Rigoutsos,
2010; Vlachos & Hatzigeorgiou, 2013) (Soucy,
Huang, & Gogarten, 2015) . ,
,
(.. ), ,
.
.
,,
:,
,
,()
.
, ,
.

11.1.

,,
.,

(..
, ...).
RNA
. ,

361

(..GC
...).,
(, )
, () .
,,

. ,
.
,,
, -
20-30% , -
1-2% .
GPCRs ( ,
2% ) 15%
. ,
, (
),
. ,

(Arai, Ikeda, & Shimizu, 2003). ,


(,
),38,17487
, 377
8,10 12 (Shimizu, Mitsuke,Noto,& Arai,
2004).
,

2.PFAMTCDB,
CAZy...

11.1: .

,
.GC

362

(Li, 2011).
GC ( GC% content). ,
,
GC
.
, codon bias ( )
,
(..),.
- .
,
codonbias
. ,
, ,
(Quax,Claassens,Soll,&vanderOost,2015).
,,Ouzounis
Kreil,6
(), 2 , 17 2
. GC
(principalcomponentsanalysis).
GC,
(Kreil & Ouzounis, 2001). ,
(Gln)(Glu)
. , (Val) (Thr)
.(His),(Ser)(Asn)
.
, GC
. ,
,
(conceptual translation). ,
,
.
,:64(613).
(2),
21 . ,
,.
(, TGA, TAG). ,
, GC, .
GC
.
,ORF
(34 ). , (Redundancy Reduction)
(nonhypothetical) SwissProt (E-value<10-6).
, (over-annotation),
(Skovgaard,Jensen,Brunak,Ussery,&Krogh,2001).

363


11.2:
SwissProt ( ), . ( ).

.
Swissprot,
. , ,
,
(11.2).
(mixture of distributions)
. ,
, . ,
(..100)(..500),
,200
300,.',
.
,GC(,).

364


11.3: - (
) GC .

( 11.3), ,
GC .
,
, ( outlier ),
A. pernix,
,,
. ,
100% ( ,
).:
, ,
, . ,
, ,
,
.

GC, . -
, ,
. GC . ,
GC.,
11.4, GC ,
GC , .
,,
.,
--
GC.,GC,

365


(,).

min(+)
(Ala)
G-C-X
0
(Gly)
G-G-X
0
(Val)
G-T-X
1
(Leu)
C-T-X,T-T-[AG]
1
(Ile)
A-T-[ACT]
2
(Phe)
T-T-[CT]
2
11.4: .

min(G+C)
2
2
1
1
0
0

,
(
),(VL)
(FI).,
. , .
,(VL/FI)GC/AT.,
GC,
.
,

GC.,,
()
.

11.5: , .
, GC/AT.
, .

11.2.

,.
,..GC
,,

.,
,,,
.
,
/ ,

366

,
,().
(Tsoka&Ouzounis,2000)

11.6: . (),
(), ()
().

,
.
,
,
,
,
,
(),
(domains).

(
).,
.
,,.

11.2.1

, , ,
. , ,

367

(,).
,

().
()
(..
).
,Venn

,
.

11.7: Venn. : Xanthomonas Oryzae EDGAR


(Blom et al., 2009) : Streptococcous R (Papadimitriou et al., 2014).

,
(LastUniversalCommonAncestor-LUCA).
,Mycoplasma
genitalium 468
.,..Haemophilus influenzae
(1703 ) 240 M. genitalium H.
influenzae. LUCA ( ..
Mycoplasma),,()

. , ( ..
),.
,,
1006 1189,
, 1344 1529, -
( Mycoplasma) (Ouzounis, Kunin, Darzentas, &
Goldovsky,2006).
,,

, . ,
,

368

1Mbp
. , DNA (Mycoplasma
mycoides JCVI-syn1.0),
(Gibsonetal.,2010).

11.2.2

(
).
.
,..DNA
,
.,
2
12 13 ,
.

11.8: .

,.
(dot-plot)
( BLAST)
. ,
(),
..,
. ,
, ,
.

11.9: . (-),
(-).

. ,

369


.,
.,2
(
DNA).

11.10: .
, .

11.2.3


.
, .
-,(..
).,
,,
.

.,,
(lacZ)(lacY),
,
.
, ,
, 12

.
,

.

370


11.11: .

11.2.4
,,
(domains). ,

,.
, ( )
,,

(Enright, Iliopoulos, Kyrpides, & Ouzounis, 1999). ,
,
,
,.

,
. , (DHFR)
,
(TS)
( )
.

371


11.12: .

,
,,

(, ).
,
,,
,
.,

(
,).
(
BLAST SmithWaterman),(E-value
...). ,
,
. ,
,
.
(interaction
),
( ). ,
, ,
. ,

. ,
,
. ,
(hot-spots),

372


11.13: .

11.3.

,

,
(Edwards&Holt,2013).
,
. ,
.
ACT (https://www.sanger.ac.uk/resources/software/act/) Java
.
BLAST.BLAST
ACT . ,
.

, . .
CT
(zoom in zoom out)
, ,
(Carveretal.,
2005).
MAUVE(http://darlinglab.org/mauve/mauve.html)Java
.
.MAUVE
,
contigs.
.
. ,

373

.,
. , MAUVE (
)(SNPs)
,(Darling,Mau,
&Perna,2010).
To EDGAR (http://edgar.cebitec.uni-bielefeld.de)
. EDGAR
.
NCBI,
.,
(synteny plots)
Venn(Blom,etal.,2009).
CGAT (http://mbgd.genome.ad.jp/CGAT/)
. CGAT
client-server, client AlignmentViewer ( Java)
DataServer(Perl).
.
,
.
, CGAT
(Uchiyama,Higuchi,&Kobayashi,2006).
BRIG (BLAST Ring Image Generator, http://sourceforge.net/projects/brig/)
Java,
.,
(),,
. BRIG
,
.
.,
,
.
(Alikhan,Petty,BenZakour,&Beatson,2011).
VISTA(http://genome.lbl.gov/vista/index.shtml)
2000. ,
.
,
,-
().(VISTABrowser)
(VISTA
servers, rVista, mVISTA, phyloVISTA, gVISTA ...)
, ,
,...(Frazer,Pachter,
Poliakov, Rubin, & Dubchak, 2004). VISTA
(standalone),GenomeVISTA,

,(Poliakov,
Foong,Brudno,&Dubchak,2014).

374


11.14: 8 Yersinia MAUVE (Darling, Miklos, & Ragan, 2008).

,
.,
.
MUMMER (http://mummer.sourceforge.net/), MEGA-BLAST
(http://www.ncbi.nlm.nih.gov/BLAST/), LAGAN (http://bioperl.org/wiki/LAGAN) MGA
(http://bibiserv.techfak.uni-bielefeld.de/mga/).,
(),
,
,
(Carrara et al., 2013). , Ouzounis ,
GeneRAGE (Enright & Ouzounis, 2000), FusionMap
(http://www.omicsoft.com/fusionmap)
(Ge
et
al.,
2011)

MosaicFinder
(http://sourceforge.net/projects/mosaicfinder)(Jachiet,Pogorelcnik,Berry,Lopez,&Bapteste,2013).

375


11.15: E.coli O157:H7 str. Sakai
27 BRIG.

376


11.16: (a) KIF3A
(chr5:131949456132139102) VISTA
. (b) VISTA
KIF3A (c) KIF3A,
(Frazer, et al., 2004).

377


11.17: ACT, E. coliO104:H4 , E.
coli Ec55989 , E. coliEDL933 (Edwards & Holt, 2013).

378

Alikhan,N.F.,Petty,N.K.,BenZakour,N.L.,&Beatson,S.A.(2011).BLASTRingImageGenerator
(BRIG):simpleprokaryotegenomecomparisons.BMC Genomics, 12,402.doi:10.1186/1471-216412-402
Arai,M.,Ikeda,M.,&Shimizu,T.(2003).Comprehensiveanalysisoftransmembranetopologiesin
prokaryoticgenomes.[ResearchSupport,Non-U.S.Gov't].Gene, 304,77-86.
Blom,J.,Albaum,S.P.,Doppmeier,D.,Phler,A.,Vorhlter,F.-J.,Zakrzewski,M.,&Goesmann,A.
(2009).EDGAR:asoftwareframeworkforthecomparativeanalysisofprokaryoticgenomes.BMC
Bioinformatics, 10(1),154.
Carrara,M.,Beccuti,M.,Lazzarato,F.,Cavallo,F.,Cordero,F.,Donatelli,S.,&Calogero,R.A.(2013).
State-of-the-artfusion-finderalgorithmssensitivityandspecificity.Biomed Res Int, 2013,340620.
doi:10.1155/2013/340620
Carver,T.J.,Rutherford,K.M.,Berriman,M.,Rajandream,M.A.,Barrell,B.G.,&Parkhill,J.(2005).ACT:
theArtemisComparisonTool.Bioinformatics, 21(16),3422-3423.doi:10.1093/bioinformatics/bti553
Darling,A.E.,Mau,B.,&Perna,N.T.(2010).progressiveMauve:multiplegenomealignmentwithgene
gain,lossandrearrangement.PLoS One, 5(6),e11147.doi:10.1371/journal.pone.0011147
Darling,A.E.,Miklos,I.,&Ragan,M.A.(2008).Dynamicsofgenomerearrangementinbacterial
populations.PLoS Genet, 4(7),e1000128.doi:10.1371/journal.pgen.1000128
Edwards,D.J.,&Holt,K.E.(2013).Beginner'sguidetocomparativebacterialgenomeanalysisusingnextgenerationsequencedata.Microb Inform Exp, 3(1),2.doi:10.1186/2042-5783-3-2
Enright,A.J.,Iliopoulos,I.,Kyrpides,N.C.,&Ouzounis,C.A.(1999).Proteininteractionmapsfor
completegenomesbasedongenefusionevents.Nature, 402(6757),86-90.doi:10.1038/47056
Enright,A.J.,&Ouzounis,C.A.(2000).GeneRAGE:arobustalgorithmforsequenceclusteringanddomain
detection.Bioinformatics, 16(5),451-457.
Frazer,K.A.,Pachter,L.,Poliakov,A.,Rubin,E.M.,&Dubchak,I.(2004).VISTA:computationaltoolsfor
comparativegenomics.Nucleic Acids Res, 32(WebServerissue),W273-279.doi:
10.1093/nar/gkh458
Ge,H.,Liu,K.,Juan,T.,Fang,F.,Newman,M.,&Hoeck,W.(2011).FusionMap:detectingfusiongenes
fromnext-generationsequencingdataatbase-pairresolution.Bioinformatics, 27(14),1922-1928.doi:
10.1093/bioinformatics/btr310
Gibson,D.G.,Glass,J.I.,Lartigue,C.,Noskov,V.N.,Chuang,R.Y.,Algire,M.A.,...Venter,J.C.(2010).
Creationofabacterialcellcontrolledbyachemicallysynthesizedgenome.Science, 329(5987),5256.doi:10.1126/science.1190719
Harbison,C.T.,Gordon,D.B.,Lee,T.I.,Rinaldi,N.J.,Macisaac,K.D.,Danford,T.W.,...Young,R.A.
(2004).Transcriptionalregulatorycodeofaeukaryoticgenome.Nature, 431(7004),99-104.doi:
10.1038/nature02800
Jachiet,P.A.,Pogorelcnik,R.,Berry,A.,Lopez,P.,&Bapteste,E.(2013).MosaicFinder:identificationof
fusedgenefamiliesinsequencesimilaritynetworks.Bioinformatics, 29(7),837-844.doi:
10.1093/bioinformatics/btt049
Kreil,D.P.,&Ouzounis,C.A.(2001).Identificationofthermophilicspeciesbytheaminoacidcompositions
deducedfromtheirgenomes.Nucleic Acids Res, 29(7),1608-1615.
Li,W.(2011).Onparametersofthehumangenome.[Review].J Theor Biol, 288,92-104.doi:
10.1016/j.jtbi.2011.07.021

379

Ouzounis,C.A.,Kunin,V.,Darzentas,N.,&Goldovsky,L.(2006).Aminimalestimateforthegenecontent
ofthelastuniversalcommonancestor--exobiologyfromaterrestrialperspective.Res Microbiol,
157(1),57-68.doi:10.1016/j.resmic.2005.06.015
Papadimitriou,K.,Anastasiou,R.,Mavrogonatou,E.,Blom,J.,Papandreou,N.C.,Hamodrakas,S.J.,...
Pot,B.(2014).ComparativegenomicsofthedairyisolateStreptococcusmacedonicusACA-DC198
againstrelatedmembersoftheStreptococcusbovis/Streptococcusequinuscomplex.BMC Genomics,
15(1),272.
Picardi,E.,&Pesole,G.(2010).Computationalmethodsforabinitioandcomparativegenefinding.Methods
Mol Biol, 609,269-284.doi:10.1007/978-1-60327-241-4_16
Poliakov,A.,Foong,J.,Brudno,M.,&Dubchak,I.(2014).GenomeVISTA--anintegratedsoftwarepackage
forwhole-genomealignmentandvisualization.Bioinformatics, 30(18),2654-2655.doi:
10.1093/bioinformatics/btu355
Quax,T.E.,Claassens,N.J.,Soll,D.,&vanderOost,J.(2015).CodonBiasasaMeanstoFine-TuneGene
Expression.Mol Cell, 59(2),149-161.doi:10.1016/j.molcel.2015.05.035
Rigoutsos,I.(2010).ShortRNAs:howbigisthisiceberg?Curr Biol, 20(3),R110-113.doi:
10.1016/j.cub.2009.12.036
Shimizu,T.,Mitsuke,H.,Noto,K.,&Arai,M.(2004).Internalgeneduplicationintheevolutionof
prokaryotictransmembraneproteins.[ComparativeStudy
ResearchSupport,Non-U.S.Gov't].J Mol Biol, 339(1),1-15.doi:10.1016/j.jmb.2004.03.048
Skovgaard,M.,Jensen,L.J.,Brunak,S.,Ussery,D.,&Krogh,A.(2001).Onthetotalnumberofgenesand
theirlengthdistributionincompletemicrobialgenomes.Trends Genet, 17(8),425-428.doi:S01689525(01)02372-1[pii]
Soucy,S.M.,Huang,J.,&Gogarten,J.P.(2015).Horizontalgenetransfer:buildingtheweboflife.Nat Rev
Genet, 16(8),472-482.doi:10.1038/nrg3962
Tsoka,S.,&Ouzounis,C.A.(2000).Recentdevelopmentsandfuturedirectionsincomputationalgenomics.
FEBS Lett, 480(1),42-48.
Uchiyama,I.,Higuchi,T.,&Kobayashi,I.(2006).CGAT:acomparativegenomeanalysistoolforvisualizing
alignmentsintheanalysisofcomplexevolutionarychangesbetweencloselyrelatedgenomes.BMC
Bioinformatics, 7,472.doi:10.1186/1471-2105-7-472
Vlachos,I.S.,&Hatzigeorgiou,A.G.(2013).OnlineresourcesformiRNAanalysis.Clin Biochem, 46(1011),879-900.doi:10.1016/j.clinbiochem.2013.03.006
Zerbino,D.R.,&Birney,E.(2008).Velvet:algorithmsfordenovoshortreadassemblyusingdeBruijn
graphs.Genome Res, 18(5),821-829.doi:10.1101/gr.074492.107

380

12: Perl

H Perl
. Perl scripting
UNIX , ,
..., ,
, .

( ) .


/.

12.

PerlUNIXLinux.
Practical Extraction and Report Language
(). ,
, Perl Pathologiacally
Eclectic Rubbish Lister .
, Larry Wall. H Perl


. Perl scripting. Scripts ,
.
(interpret)(compile),
.
UNIX scripting,
,...
,
.Perl,
.,Perlawk,sed,
bash shell, C. ,
,Perl.
Perl,
,.,
,
(
C, C++ Java). , "" , Perl
. ,
TIMTOWTDI,There Is More Than One Way To Do
It ( Tim Toady). ,
.
()
, .
, ,
,
.Perl
( ),
( oneliners),

381

....,
Python,Perl,
There should be one-and preferably only one-obvious way to do it ( ,
,!).
Unix/LinuxMacOSXPerl,
Windows.
, Perl
http://www.perl.org/get.html ,
.
Perl,,Windows
(shell)
UNIX.
Cygwin (www.cygwin.com), native ()
bashWindows,UnixUtils(http://unxutils.sourceforge.net/)MinGW
(http://www.mingw.org/). (compiler),
Perl,.

12.1. Perl

Perl
.,
,(blocks)
. ,
.pl compiler
Windows (
).,
,Perl:

perl file_name.pl

,
().
.

,
.
Perl#.Perl:

#!/usr/bin/perl
#
print Hello world \n;

print \n
. print, Perl
, (;).
,Perl.
, Linux/UNIX ,
shellscript,:

./program.pl

./,(directory)
.,,

382

(Path),
.
,Perl
(C).,
,
. ,
C ( ,
,
Perl). , Perl
(interpreter),-,,
.
Perl.Python
, Java,
bytecode.

12.2.

Perl,(scalars)($)
. ,
($1, $2, $_, $/ ...).
, ,
,,
.
,
,
.
:

$var = ___;

$var
(,).,
(scalar) .
().
(variable interpolation)
,
(.. \n). (\)
(\n,\s
, \t ...). ,
,
.,$Perl
,\$().
()
.(``), Perl
shell,
,.
:

$var=3;
....
$var=hello;
.

383

$var . $var
.,Perl
.
Perl.

$number=5;
$name=George;
$exp=3*$number+($number+1);
$a+=5;
$b*=3;
++$a; $a++;
$a--;

12.2.1.

(Operators),
.
, 12.1
(string),12.2.


+
-
*
/
%
**
12.1: Perl


+
*
==
!=
>
<
>=


.
x
eq
ne
gt
lt
ge

<=

le

12.2: Perl.
.

12.2.2. Perl


12.3:

384

chomp
chop
substr

\n

$x=substr($name,0,1,L);
$name1
0
$x
$name.
$x=substr($name,0,1);
1
0$name
$x.
index
Index($name,k);
k$name
rindex
,index.
12.3: Perl

,:

$name=Takis;
$x=substr($name,0,1);

$name ,
$name().

12.2.3. (<STDIN>)

<STDIN>,Perl
(, )
<STDIN>.:
$a=<STDIN>; #
$a
print $a;

:
chomp ($a=<STDIN>);

12.3.

(list),.
.,:

(1, 2, 3)
(perl, 3, 15)
($x, 3, $x+2, $y$x)
(1..10)
($a..$b)

.
,
(swap):

($a,$b)=($b,$a);

($a, $b, $c)=(1,2,3);

385

.
(Array).
.,
.
:

@array = (

);

, @ .
, @name $name.
, , (@_, @ARGV ...).

,:

@name=(John, George, Mike);


@name=(1..10);
@table=1;
@table=(1, 2, @name, 7);
@table=(1,2,3);

,,0,

.
:
$table [0] = 1;
$table [1] = 2;
$table [2] = 3;

,(
$ ) .
.
, (. )
.
:

$x=$table[0];
$table[1]++;
($table[0], $table[1])= ($table[1], $table[0]);
@table[0,1,2]=@table[1,1,1];

,
(
):

@table=(1, 2, @name, 7)

12.4.
,push:

push @table,$scalar;

,:

@table=(@table, $scalar);

386

,unshift:

unshift(@table, $scalar);

@table=($scalar, @table );

push
pop
shift

.
.

.
.

ASCII.
splice(@table,2,1);@table
1
2.

unshift
reverse
sort

splice

12.4: Perl

,,$#.
@name,$#name(
) . ,
..@table=(1,2,3),$#table,2.$#table
,,..$table[$#table]
3. :
, ,
. $table[5]=12,
(index)34,12
5. $#table 5,
6.

12.4.

(Hash),,
. , (keys),
,
(values).,,
, . ,
.,%.

,.
(,).
.
.
,
. ,
:

%day = ("Sun", "Sunday", "Mon", "Monday", "Tue", "Tuesday", "Wed",


"Wednesday", "Thu", "Thursday", "Fri", "Friday", "Sat", "Saturday");

387

,
()-.
:

%day = ( "Sun" => "Sunday", "Mon" => "Monday", "Tue" => "Tuesday", "Wed" =>
"Wednesday", "Thu" => "Thursday", "Fri" => "Friday", "Sat" => "Saturday" );

,:

%table=@table;

,
(1,3,5...),
(!).,
,:

@table=%table;

,(values)
,(keys).
,({})
$.:

$hash{key}=value;

,-,
., $hash{key}
,
.,
,,
()(),
().
,,(,
,.12.5)

each

,,
(
)

.
.
(
)

delete

reverse
12.5: Perl

12.5.

, Perl
..

12.5.1 If/else/elsif

if/else . if
else .

388

,
.if/elsif/else.if,
, , elsif
( )
, else
.:
if (condition)
{
...
}
elsif
{
...
}
else
{
...
}

,elsifelseif
else if. , Perl (block) ,
({}).Perl,
, (false), 0
.

12.5.2. While/until

Perl while until.



,.:

while (condition)
{

}
until (condition)
{

12.5.3. Do while/until

Perldo/while.
,
.Perlwhile
until.:

do
{

} while (condition)

do
{

389

} until (condition)

12.5.4. For

for .
.,

.:

for ($i = 1; $i <= 10; $i ++)


{
print $i\n;

,,
(.. $x+$y<10 ...),
, () (
foreach).

12.5.5. Foreach

for
(,).
,.
:

@a = (1,2,3,4,5);
foreash $i (@a)
{
print $i\n;
}

,,
(, $i),
().

12.5.6. Last/next/redo


.
Last: last break C,
.
:
for
foreach
while
until

while (condition 1)
{

if(condition 2)
{

390

last;
}
}

Next: continue C.
,,
.:

while (condition 1)
{

if(condition 2)
{

next;
}

Redo: redo
.
:

while (condition 1)
{
#

if(condition 2)
{

redo;
}
}

12.6. (Filehandles) /

Perl, .
Perl,
,
(filehandles). .

( , ),
.
,:

open MYFILE, file.txt;

open.
.
, .
.
,,
. , ,
12.6:

391

open(,"filename")

filename
,
open(OUT,">filename")

filename

OUT
open(,">>filename")

filename
OUT
12.6: open

,
:
open IN, /etc/passwd;
$x=<IN>;
print $x;
close IN;
open OUT, >tempfile;
print OUT bla bla bla\n;

,/etc/passwd(Linux),
, , , , tempfile
blablabla.<>
(STDIN).
,
, .
:
while(<>)
{
print $_;
}

,catUnix(,
), End Of File (EOF)
$_.while,
<>,.<>()
Perl.,
.
:

perl programme.pl file.txt

, Perl file.txt,
<>,-().$_
Perl ,
,.
. ,
. ( ..
),(..
),
. ,
,@ARGV.,
$ARGV[0],$ARGV[1]...open.,

392

,
().

12.7.

(regularexpressions)
,.
4,
. (/)
.
:

$dna=~/GAATTC/;

$dnaGAATTC
. /GAATTC/,
.
,
/GAATTC/CTTAAG/(GAATTCCTTAAG).
Perl

..,
(.) , \d
.12.7.

\w

\W

\s

\S

\d

\D

12.7:

,
.
,.
(Quantifier),Perl
,.
{n},n.:

$sequence=~/AAT{5}CCG/;

$sequence CCG (,
PROSITE---(5)-C-C-G).,
(true)(false),,if.,
4, PROSITE , ,

.12.8.

393

01(,
)
+
1
.
0
{n,m}
nm
{n,}
n
{,m}
m
12.8:

12.8. Perl

,-,Perl,
,
.

12.8.1 Uniprot Fasta

,.
,
Uniprot
, fasta. ,
Uniprot.fasta
, AC ID.
Uniprot(..AC),
.,3
.(uniprot2fasta.pl),:

while (<>)
{
if ($_=~/^AC\s{3}(.*?)\;/)
{
print ">$1\n";
}
if ($_=~/^\s{5}(.*)/)
{
$sequence=$1;
$sequence=~s/\s//g;
print $sequence\n;
}
}

, <>. -
$_. if, AC.
AC,3,
(;)AC().
?(non-greedy),
. $1 ,

.,
(,AC).,,$2,$3...
, 5
(, ) (
). Uniprot

394

, $sequence . ,
~s/\s//g(g,
,globally-,).
, ,
.if
. ,
,.
,
,fasta,
..
(uniprot2line.pl).

$/="\/\/\n";
while (<>)
{
if ($_=~/^AC\s{3}(.*?)\;/m)
{
print ">$1\n";
}
while ($_=~/^\s{5}(.*)/mg)
{
$sequence=$1;
$sequence=~s/\s//g;
print "$sequence";
}
print "\n";
}

,
. $/="\/\/\n"
(-)
//\n. ,
Uniprot$_.,
m .
($_)(multilne).,if
while,g(global),
.
,
fastafasta.(fasta2line.pl):

$/=">";
while (<>)
{
$entry=$_;
chop $entry;
$entry= ">"."$entry";
$entry=~/>(.+?)\n(\C*)/g;
$name=$1;
$sequence=$2;
$sequence=~s/\n//g;
if ($name ne "")
{
print ">$name\n$sequence\n";
}
}

395

$/="\n";

,-(>).
, ,
.

12.8.2.

,
(, ...). 500
200:

@aa = (A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y);
for ( $i=0; $i<500; $i++ ) {
print '>Random', "$i\n";
for( $j=0; $j<200; $j++ ) {
$r = $aa[ int (rand 20)];
print $r;
print "\n" if ($j+1)%60 == 0 and $j;
}
print "\n";
}

(
).,rand
0-20,(index)
.
()60
fasta. , ..
,
. .
.

12.8.3.

,
,DNA/RNA.
,
. FASTA
.

@aa = (A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y);
while (<>)
{
if ($_=~/^>/)
{
$id=$_;
chomp $id;
print $id."\t";
$seq=<>;
chomp $seq;
}
$length=length($seq)+1;
foreach $z(@aa)
{
$count = $seq =~s/$z//g;

396

$diairesi = $count/$length;
$pososto=sprintf( "%.3f", $diairesi );
print $z."\t".$count.'/'.$length."\t[".$pososto."]\n";
}
print "\n";
}

20
-.>header

( @seq=<>). , ,
foreach - .
$count=$seq=~s/$z//g.,
( ) ,
(length
- $seq =~s/$z//g; $count=length($seq);). ,
sprintf.

12.8.4. DNA


(openreadingframes)DNA.
.
DNA . ,
26
.

%genetic_code = (
'GCA'=>'A', #Alanine
'GCC'=>'A', #Alanine
'GCG'=>'A', #Alanine
'GCT'=>'A', #Alanine
'AGA'=>'R', #Arginine
'AGG'=>'R', #Arginine
'CGA'=>'R', #Arginine
'CGC'=>'R', #Arginine
'CGG'=>'R', #Arginine
'CGT'=>'R', #Arginine
'AAC'=>'N', #Asparagine
'AAT'=>'N', #Asparagine
'GAC'=>'D', #Aspartic acid
'GAT'=>'D', #Aspartic acid
'TGC'=>'C', #Cysteine
'TGT'=>'C', #Cysteine
'GAA'=>'E', #Glutamic acid
'GAG'=>'E', #Glutamic acid
'CAA'=>'Q', #Glutamine
'CAG'=>'Q', #Glutamine
'GGA'=>'G', #Glycine
'GGC'=>'G', #Glycine
'GGG'=>'G', #Glycine
'GGT'=>'G', #Glycine
'CAC'=>'H', #Histidine
'CAT'=>'H', #Histidine
'ATA'=>'I', #Isoleucine
'ATC'=>'I', #Isoleucine
'ATT'=>'I', #Isoleucine
'TTA'=>'L', #Leucine

397

'TTG'=>'L',
'CTA'=>'L',
'CTC'=>'L',
'CTG'=>'L',
'CTT'=>'L',
'AAA'=>'K',
'AAG'=>'K',
'ATG'=>'M',
'TTC'=>'F',
'TTT'=>'F',
'CCA'=>'P',
'CCC'=>'P',
'CCG'=>'P',
'CCT'=>'P',
'AGC'=>'S',
'AGT'=>'S',
'TCA'=>'S',
'TCC'=>'S',
'TCG'=>'S',
'TCT'=>'S',
'ACA'=>'T',
'ACC'=>'T',
'ACG'=>'T',
'ACT'=>'T',
'TGG'=>'W',
'TAC'=>'Y',
'TAT'=>'Y',
'GTA'=>'V',
'GTC'=>'V',
'GTG'=>'V',
'GTT'=>'V',
'TAA'=>'-',
'TAG'=>'-',
'TGA'=>'-',
);

#Leucine
#Leucine
#Leucine
#Leucine
#Leucine
#Lysine
#Lysine
#Methionine
#Phenylalanine
#Phenylalanine
#Proline
#Proline
#Proline
#Proline
#Serine
#Serine
#Serine
#Serine
#Serine
#Serine
#Threonine
#Threonine
#Threonine
#Threonine
#Tryptophan
#Tyrosine
#Tyrosine
#Valine
#Valine
#Valine
#Valine
#STOP
#STOP
#STOP

$seq="AAAAAAATTAATAGATGAACATATATATAGATTTTCTATATAGACCCTCTACCCGATAAGGCTAC";
$seq2=$seq;
$seq2=~tr/ATCG/TAGC/;
$seq2=reverse($seq2);
print "Forward Strand\n";
Translate($seq);
print "Reverse Strand\n";
Translate($seq2);
sub Translate{
$sub_seq=$_[0];
for($i=0;$i<=length($sub_seq)-3;$i++)
{
$x=substr($sub_seq,$i,3);
if ($x eq 'ATG')
{
$pos=$i+1;
print "position $pos\n";
for($j=$i;$j<=length($sub_seq)-3;$j=$j+3)
{
$y=substr($sub_seq,$j,3);
$k=$genetic_code{$y};
if($k eq '-')

398

{
print"\n";
last;
}
print "$k";
}
}
}
}

,
. , .

(,).
tr,reverse,
(5').
,
(sub-routine).
(,)
(,).
,
( ).
,.
Perl.
@_ . ,
$_[0],()$_[1]...
Perl "" (global),
. ,
.(private),
(),
my(my$seq=...).
, .
3 (, ) ,
(ATG).,substr
.,
,,,
(last).
,
.
.

12.8.5.

5, (signal
peptide),
,

. ,
( 17-30 , ),
.
1980,PS00013,
PROSITE. ,
, [LVI]-[ASTVI]-[GAS]-C
(DOLOP).2002,SutcliffeHarrington

399

Gram,,
(Sutcliffe&Harrington,2002).
63
(Junckeretal.,2003).
FASTA,
(,
PROSITE).

while (<>){
if
($_=~/^>(.*)/)
{
$name=$1;
$seq=<>;
if($seq=~/(.*LA[GA]C)/)
{
$x=length($1);
$a=$a+1;
}
}
}
print "$a LIPOPROTEINS FOUND";

ifPerl:

if($seq=~/(.*[LVI][ASTG][GA]C)/)

if($seq=~/(.*[^DERK]{6}[LIVMFWSTAG]{2}[LIVMFYSTAGCQ][AGS]C)/)

if($seq=~/^([MV].{0,13}[RK][^DERK]{6,20}[LIVMFESTAG][LVIAM][IVMSTAFG][AG]C)/)


().,
,
,
. ,
,
("")
, .
WebLogo (http://weblogo.berkeley.edu/)
(Crooks,Hon,Chandonia,&Brenner,2004).

while (<>)
{
if ($_=~/^>(.*)/)
{
$name=$1;
$seq=<>;
if ($seq=~/(.*[^DERK]{6}[LIVMFWSTAG]{2}[LIVMFYSTAGCQ][AGS]C)/)
{
$x=length($1);
push @table,$1;
}
}
if($x>$max)
{

400

$max=$x;
}
}
foreach $signal(@table)
{
$i="-" x ($max-length($signal));
$signal=$i.$signal;
print "$signal\n";
}

-.
>header
( @seq=<>).
,(
). ,
,"-"
.
,.

------------------MRRCMPLVAASVAALMLAGC
-----------------MKLKQLFAITAIASALVLTGC
------------------MKLLSKIMIIALAASMLQAC
-------------MNKNRGFTPLAVVLMLSGSLALTGC
-------------------MKRQALAAMIASLFALAAC
-------------------MRLLPLVAAATAAFLVVAC
-----------------------MRIVIFILGILLTSC
---------------------MFKRFIFITLSLLVFAC
---------------------MLKKVYYFLIFLFIVAC
---------------------MKKILLTVSLGLALSAC
-----------------MVKKAIVTAMAVISLFTLMGC
-----------------MKQLIVNSVATVALASLVAGC
-------------------MKLKTLALSLLAAGVLAGC
--------------------MKAYLALISAAVIGLAAC
-------------------MKLKATLTLAAATLVLAAC
------------------MQKTPKKLTALCHQQSTASC
----------------MPLPDFRLIRLLPLAALVLTAC
----------------MKNQVKKILGMSVVAAMVIVGC
---------------------MKKFLPLSISITVLAAC
-----------------MKRLFLSFVALALLAGSIAAC
--------------------MCGKILLILFFIMTLSAC
---------------------MSKRLLSLASLALLFGC
-------------------MFKRRYVTLLPLFVLLAAC
------------------MKKIIKLSLLSLSIAGLASC
-----------------MGRSKIVLGAVVLASALLAGC
------------------MKAKIVLGAVILASGLLAGC
------------------MNNVLKFSALALAAVLATGC
--------------MKLTTHHLRTGAALLLAGILLAGC
-------------MAYSVQKSRLAKVAGVSLVLLLAAC
------------MSAGSPKFTVRRIAALSLVSLWLAGC
MDKGEGLRLAATLRQWTRLYGGCHLLLGAVVCSLLAAC
-------------------MKPFLRWCFVATALTLAGC
------------------MNIATKLMASLVASVVLTAC
----------------MQNAKLMLTCLAFAGLAALAGC
---------------------MKKYLLGIGLILALIAC
----------------------MRLLIGFALALALIGC
--------------MFVTSKKMTAAVLAITLAMSLSAC

401

-----------------MNKNMAGILSAAAVLTMLAGC
-----------------MHVSSLKVVLFGVCCLSLAAC
----------------MYKNGFFKNYLSLFLIFLVIAC
------------------MNKFVKSLLVAGSVAALAAC
-------------------MKKTNMALALLVAFSVTGC
----------------MSLTHYSGLAAAVSMSLILTAC
------------------MLRYTRNALVLGSLVLLSGC
--------------------MRNFILFPMMAVVLLSGC
--------------------MRKQWLGICIAAGMLAAC
-------------------MRYLATLLLSLAVLITAGC
-------------------MNMTKGALILSLSFLLAAC
----------------MNKKIFTLFLVVAASAIFAVSC
--------------------MVKRGRFALCLAVLLGAC
------------------MKVKYALLSAGALQLLVVGC
-----------------MNNPLVNQAAMVLPVFLLSAC
----------------MNAHTLVYSGVALACAAMLGSC
----------------MKLKSLVFSLSALFLVLGFTGC
-----------------MREKWVRAFAGVFCAMLLIGC
---------------MKHNVKLMAMTAVLSSVLVLSGC
--------------------MKLRLSALALGTTLLVGC
-----------MRKRISAIINKLNISIIIMTVVLMIGC
-------------------MRKRISAIIMTLFMVLVSC
-----------MRKRISAIINKLNISIMMMIVVLMIGC

, ,
(open).,
shell ( Windows)
. ,
:

perl program.pl input.file > output.file

Linux ( UNIX) shell


, "" (pipes). ,
.
:

ls | wc

lswc.Perl
, shell.
:

./program.pl input.file | ./program2.pl

,
.,
Standard Input (
<STDIN>). , ,
,shell
.
, (
),WebLogo,12.1.

402


12.1:
WebLogo (http://weblogo.berkeley.edu/)

12.9.

,
Perl, ,
.

,Programming Perl,
(Wall & Schwartz, 1991), Learning Perl ,
(Schwartz & Phoenix, 2001), . ,
,
(Moorhouse&Barry,2005;Tisdall,2001,2003),(Orwant,Hietaniemi,&Macdonald,
1999) (Castro, 2001). , tutorials
online ebooks Picking Up Perl,
http://www.ebb.org/PickingUpPerl/ (Kuhn, 2002),
https://www.perl.org/books/library.htmlhttp://www.perlmonks.org/index.pl/Tutorials.
BioPerl
(http://www.bioperl.org/wiki/Main_Page),
(modules)Perl.BioPerl
,
(Stajich et al.,2002). ,
(object-orientedPerl),.
BioPerl,,
:http://www.bioperl.org/wiki/HOWTO:Beginners.

403

Castro,Elizabeth.(2001).PerlandCGIfortheworldwideweb:Visualquickstartguide:PeachpitPress.
Crooks,G.E.,Hon,G.,Chandonia,J.M.,&Brenner,S.E.(2004).WebLogo:asequencelogogenerator.
GenomeRes,14(6),1188-1190.doi:10.1101/gr.84900414/6/1188[pii]
Juncker,A.S.,Willenbrock,H.,VonHeijne,G.,Brunak,S.,Nielsen,H.,&Krogh,A.(2003).Predictionof
lipoproteinsignalpeptidesinGram-negativebacteria.ProteinSci,12(8),1652-1662.
Kuhn,BradleyM.(2002).PickingUpPerl:B.Kuhn.
Moorhouse,Michael,&Barry,Paul.(2005).BioinformaticsbiocomputingandPerl:anintroductionto
bioinformaticscomputingskillsandpractice:JohnWiley&Sons.
Orwant,Jon,Hietaniemi,Jarkko,&Macdonald,John.(1999).MasteringalgorithmswithPerl:"O'Reilly
Media,Inc.".
Schwartz,RandalL,&Phoenix,Tom.(2001).Learningperl:O'Reilly&Associates,Inc.
Stajich,JasonE,Block,David,Boulez,Kris,Brenner,StevenE,Chervitz,StephenA,Dagdigian,Chris,...
Lapp,Hilmar.(2002).TheBioperltoolkit:Perlmodulesforthelifesciences.Genomeresearch,
12(10),1611-1618.
Sutcliffe,I.C.,&Harrington,D.J.(2002).Patternsearchesfortheidentificationofputativelipoproteingenes
inGram-positivebacterialgenomes.Microbiology,148(Pt7),2065-2077.
Tisdall,James.(2001).BeginningPerlforbioinformatics:"O'ReillyMedia,Inc.".
Tisdall,James.(2003).MasteringPerlforbioinformatics:"O'ReillyMedia,Inc.".
Wall,Larry,&Schwartz,RandalL.(1991).Programmingperl:O'Reilly&AssociatesSebastopol,CA.

404


1)
(, ). ,
, Uniprot
(
,A,C,...):

@counts = ('0.077', '0.016', '0.053', '0.065', '0.041', '0.069', '0.023', '0.059',


'0.059', '0.095', '0.024', '0.043', '0.049', '0.039', '0.052', '0.070', '0.055',
'0.066', '0.012', '0.031');

2) DNA
3.(..1000)
,
.,1000
ErdosRenyi.
(..1000,5000,10000...).

3)

Uniprot. , FT
TRANSMEM . ,
fasta
oneline,.:

ID
AC
FT
FT
FT
FT
FT
FT
SQ

140U_DROME
P81928; Q9VFM8;
CHAIN
1
TRANSMEM
TRANSMEM
TRANSMEM
CONFLICT
SEQUENCE
MNFLWKGRRF
GSISSELNSV
TVNFAKGGFK
MAAGGIIGGF
PELFKAHDEK

Reviewed;

261 AA.

261

RPII140-upstream gene protein.


/FTId=PRO_0000064352.
67
87
Potential.
131
151
Potential.
183
203
Potential.
64
64
S -> F (in Ref. 1).
261 AA; 29182 MW; 5DB78CF6CFC4435A CRC64;
LIAGILPTFE GAADEIVDKE NKTYKAFLAS KPPEETGLER LKQMFTIDEF
YQAGFLGFLI GAIYGGVTQS RVAYMNFMEN NQATAFKSHF DAKKKLQDQF
WGWRVGLFTT SYFGIITCMS VYRGKSSIYE YLAAGSITGS LYKVSLGLRG
LGGVAGVTSL LLMKASGTSM EEVRYWQYKW RLDRDENIQQ AFKKLTEDEN
TSEHVSLDTI K

>P81928
MNFLWKGRRFLIAGILPTFEGAADEIVDKENKTYKAFLASKPPEETGLERLKQMFTIDEFGSISSELNSVYQAGFLGFL
-------------------------------------------------------------------MMMMMMMMMMM

, ,
,Kyte-Doolitle:

%hyd =('A' => 0.100,


'C' => -1.420,
'D' => 0.780,
'E' => 0.830,
'F' => -2.120,
'G' => 0.330,
'H' => -0.500,
'I' => -1.130,
'K' => 1.400,
'L' => -1.180,

405

'M'
'N'
'P'
'Q'
'R'
'S'
'T'
'V'
'W'
'Y'
);

=>
=>
=>
=>
=>
=>
=>
=>
=>
=>

-1.590,
0.480,
0.730,
0.950,
1.910,
0.520,
0.070,
-1.270,
-0.510,
-0.210


, (
*).,
,,.. 3.1
3,.

4) DNA
.
.
(.. 1000),
(>10.000),
.
,(.1).

5)
FASTA ( ), ,
sparseencoding.

406

abinitiomodelling,330

HMMER, 3, 14, 16, 56, 74, 186, 235, 305, 306, 307,

ArrayExpress,53

309,310,356

Baum-Welch, 288, 289, 290, 291, 293, 296, 297, 304,

homologymodelling,12,315,330

354

INTERPRO,14,55

BLAST,1,13,16,23,48,60,62,65,73,74,109,117,

I-TASSER,334,336,337,343

140, 142, 143, 144, 145, 146, 147, 150, 156, 161,

IUPHAR,63,72,110,112,114,116

162, 165, 167, 168, 186, 187, 191, 235, 236, 240,

Java,69,322,375,376,383,385

241, 269, 305, 306, 307, 331, 371, 374, 375, 376,

Linux,165,214,332,340,383,384,394,404

377,381

MiRBase,76

BLOSUM,130,131,154,184

MirTarBase,76

CABIOS,13,170

MODELLER,332,334

CATH,14,55,57,69,70,110,112,327

NeedlemanWunsch,1,10,133

CLUSTALW,158,161,162,165

nextgenerationsequencing,15,52,75

DBPTM,76

neXtProt,63,67,113

dbSNP,53,115

OMIM,75

EBI,14,15,48,51,76,91,110

OMPdb,63,65,116

EMBL-Bank,14,48

PAM,13,130,131,154,184,185,200

Entrez,73,76,77,323

PDB,10,51,52,56,57,61,69,70,71,75,77,82,92,

ErdosRenyi,1,120,121,137,407

93, 94, 95, 98, 106, 107, 109, 112, 113, 165, 167,

E-value,128,140,150,151,365,374

242, 316, 318, 320, 321, 322, 327, 331, 334, 336,

Fasta,48,396

342,352

FASTA,1,11,117,142,143,144,145,146,156,159,

PDBTM,76,113

162,166,172,331,398,402,408

Perl,3,4,2,73,178,376,383,384,385,386,387,389,

GenBank,12,49,71,84,109

390, 391, 393, 394, 395, 396, 401, 402, 404, 405,

genefinder,14

406

GeneExpressionOmnibus,53

PFAM,14,30,55,56,65,152,167,277,304,334,356,

GONNET,130

364

GPCR,71,72,221,227,261

pHMM,65,260,261,302,303,304,307

GPCRDB,59,63,71,72,112,116

Poisson,128,139,140,141,142,147,148,198

GRAMM,339

positiveinsiderule,12,246

GWAS,16,17

PRED-TMBB,39,252,253,266

HADDOCK,340

PROSITE,14,55,56,77,91,93,96,97,98,115,152,

HapMap,16,53,111

167, 174, 175, 176, 177, 178, 179, 180, 181, 185,

HHpred,334,344

190,191,220,255,347,348,355,395,402

HMM,3,14,56, 70,73,235,247,248,252,253,255,

PSSM,68,183,184,185,186,224,236,237,334

259, 268, 270, 273, 294, 303, 304, 305, 307, 309,

PubMed,25,26,28,36,45,54,62,75,76,83, 84,88,

310,311,313,334,344,355,358

89,90,91,93,95

407

p-value,139,140,142

,51,300

Python,74,323,341,384,385

,164,323,324,326,327,328,329,332

Rasmol,14

,132,281,327,333

Rfam,61,360

,118,119,120, 121,138,145,154, 155, 188,

ROSETTA,336,337

228,289

SCOP,14,55,56,57,58,68,109,114,312,334

,156

SignalP,14,255,256,266

,118,119

SmithWaterman,1,11,135,142

,346,347,348,350,353

SpecializedProteinResourcesNetwork,59

,1,122

SQL,37,64,66,77,112

,222

SWISS-MODEL,323,332,341

,120,122,123,124,125

Swissprot,366

,141

TarBase,76,115

,336

TCDB,63,64,65,92,364

,226,241,261

TMHMM,14,247,248

, 15, 68, 157, 161, 201, 319, 332, 340,

Uniprot, 15, 50, 51, 56, 58, 59, 76, 78, 79, 80, 82, 88,

365,366

126,176,255,396,397,407

,186,193,230,241,365

UNIX,174,175,178,383,384,404

, 141, 180, 206, 207, 208, 212, 213, 228,

UPGMA,159,161,201,202,204,207,212

278,282,288,289,290,293,297,300

Viterbi, 282, 284, 285, 286, 287, 293, 298, 304, 305,

,137,158,159,161,162,184

313,354

, 2, 11, 23, 67, 120, 131, 152, 153,

WebLogo,187,191,402,405,406

154, 156, 157, 158, 159, 160, 161, 162, 163, 164,

weightmatrix,171,183

165, 166, 167, 168, 169, 174, 175, 178, 179, 183,

WHATIF,323,332,344

184, 186, 187, 188, 196, 197, 201, 202, 204, 206,

Windows,165,213,214,332,336,384,404

209, 210, 212, 235, 236, 242, 273, 302, 303, 304,

,3,337,339

305, 306, 313, 315, 328, 332, 334, 347, 355, 356,

-,167,220,227,238,239,246,248,316,319

371,402,405

,16,21,26,53,66,75,256,262

, 221, 225, 238, 241,

,120,168

319,333,334,345,356

,332

,3,328

,330,332

,156,158,162

,367

, 2, 14, 29, 56, 71, 174, 175, 176, 177, 178,

-,249,251,253,357,359

179, 180, 181, 183, 185, 186, 189, 190, 220, 222,

,331

225, 253, 256, 260, 327, 331, 332, 334, 345, 348,

- , 167, 227, 238, 239, 249, 316,

355,402

319,357

, 2, 11, 24, 56, 65, 70, 73, 145, 152, 153, 158,

,180,346,350,353,

174, 176, 179, 181, 182, 183, 184, 185, 186, 187,

354,356,357

220,235,237,240,241,302,333,345,348

, 50, 167, 169, 221, 223, 238, 311,

,14,121,128,138,142,145,

325,334,351

149,151,186,328,398

408

,3,323,325,326

,205,206,208,212,213

,14,315,326,330,331,332,333,334,337

409

Vous aimerez peut-être aussi