Académique Documents
Professionnel Documents
Culture Documents
Basics: Representations of
protein sequences and their
applications
Jong Park
Next
use Bio::Tools::CodonTable;
# defaults to ID 1 "Standard"
$myCodonTable = Bio::Tools::CodonTable->new();
$myCodonTable2 = Bio::Tools::CodonTable -> new ( -id => 3 );
# translate a codon
$aa = $myCodonTable->translate('ACU');
$aa = $myCodonTable->translate('act');
$aa = $myCodonTable->translate('ytr');
PIR format
A sequence in PIR format consists of:
1.One line starting with
a. a ">" (greater-than) sign, followed by
b. a two-letter code describing the sequence type
(P1, F1, DL, DC, RL, RC, or XX), followed by
c. a semicolon, followed by
d. the sequence identification code (the database IDcode).
2. One line containing a textual description of the sequence.
3. One or more lines containing the sequence itself. The
end of the sequence is marked by a "*" (asterisk) character.
A file in PIR format may comprise more than one sequence.
The PIR format is also often referred to as the NBRF format.
1-------10--------20--------30--------40--------50--------60--------70------78
LOCUS
DEFINITION
DATE
ACCESSION
ORGANISM
COMMENT
WEIGHT
LENGTH
ORIGIN
1
61
121
181
241
301
361
421
//
ABCAARAA_1
A.aceti acetic acid resistance protein (aarA) gene, complete cds;
acetic acid resistance protein (aarA).
15-SEP-1990
M34830
Acetobacter aceti
Eubacteria; Proteobacteria; alpha subdivision; Acetobacteraceae;
Acetobacter.
CDS 185..1495
/db_xref="PID:g141730"
48238
436
Translated using phase 1
MSASQKEGKL STATISVDGK SAEMPVLSGT LGPDVIDIRK LPAQLGVFTF DPGYGETAAC
NSKITFIDGD KGVLLHRGYP IAQLDENASY EEVIYLLLNG ELPNKVQYDT FTNTLTNHTL
LHEQIRNFFN GFRRDAHPMA ILCGTVGALS AFYPDANDIA IPANRDLAAM RLIAKIPTIA
AWAYKYTQGE AFIYPRNDLN YAENFLSMMF ARMSEPYKVN PVLARAMNRI LILHADHEQN
ASTSTVRLAG STGANPFACI AAGIAALWGP AHGGANEAVL KMLARIGKKE NIPAFIAQVK
DKNSGVKLMG FGHRVYKNFD PRAKIMQQTC HEVLTELGIK DDPLLDLAVE LEKIALSDDY
FVQRKLYPNV DFYSGIILKA MGIPTSMFTV LFAVARTTGW VSQWKEMIEE PGQRISRPRQ
LYIGAPQRDY VPLAKR
EMBL style
ID
XX
AC
XX
DT
XX
DE
XX
OS
XX
CC
XX
SQ
//
CM23SRIBR
X80636;
22-MAR-1995
C.mucosalis gene for 23S ribosomal RNA (fragment)
Campylobacter mucosalis
SEQIO retrieval from EMBL-format entry.
07-Feb-1996
acttagtttt
gcgctggagc
cccaacgccg
cctaagcaaa
ttattgtgcg
aaaggtgtag
ttgaagttct
agtttagata
ggcgcgtgga
ggtgtgccta
cctcccgact
gtgacgcctg
gatcgaagcc
60
120
180
240
300
360
420
480
540
600
660
720
780
805
Swissprot style
ID
AC
DT
DE
OS
CC
CC
CC
CC
SQ
//
104K_THEPA CONVERTED;
PRT;
924 AA.
P15711;
01-AUG-1992
104 KD MICRONEME-RHOPTRY ANTIGEN.
THEILERIA PARVA.
-!- DEVELOPMENTAL STAGE: SPOROZOITE ANTIGEN.
-!- SUBCELLULAR LOCATION: IN MICRONEME/RHOPTRY COMPLEXES.
SEQIO retrieval from Swiss-Prot database entry.
07-Feb-1996
SEQUENCE
924 AA;
MKFLILLFNI LCLFPVLAAD NHGVGPQGAS GVDPITFDIN SNQTGPAFLT AVEMAGVKYL
QVQHGSNVNI HRLVEGNVVI WENASTPLYT GAIVTNNDGP YMAYVEVLGD PNLQFFIKSG
DAWVTLSEHE YLAKLQEIRQ AVHIESVFSL NMAFQLENNK YEVETHAKNG ANMVTFIPRN
GHICKMVYHK NVRIYKATGN DTVTSVVGFF RGLRLLLINV FSIDDNGMMS NRYFQHVDDK
YVPISQKNYE TGIVKLKDYK HAYHPVDLDI KDIDYTMFHL ADATYHEPCF KIIPNTGFCI
TKLFDGDQVL YESFNPLIHC INEVHIYDRN NGSIICLHLN YSPPSYKAYL VLKDTGWEAT
THPLLEEKIE ELQDQRACEL DVNFISDKDL YVAALTNADL NYTMVTPRPH RDVIRVSDGS
EVLWYYEGLD NFLVCAWIYV SDGVASLVHL RIKDRIPANN DIYVLKGDLY WTRITKIQFT
QEIKRLVKKS KKKLAPITEE DSDKHDEPPE GPGASGLPPK APGDKEGSEG HKGPSKGSDS
SKEGKKPGSG KKPGPAREHK PSKIPTLSKK PSGPKDPKHP RDPKEPRKSK SPRTASPTRR
PSPKLPQLSK LPKSTSPRSP PPPTRPSSPE RPEGTKIIKT SKPPSPKPPF DPSFKEKFYD
DYSKAASRSK ETKTTVVLDE SFESILKETL PETPGTPFTT PRPVPPKRPR TPESPFEPPK
DPDSPSTSPS EFFTPPESKR TRFHETPADT PLPDVTAELF KEPDVTAETK SPDEAMKRPR
SPSEYEDTSP GDYPSLPMKR HRLERLRLTT TEMETDPGRM AKDASGKPVK LKRSKSFDDL
TTVELAPEPK ASRIVVDDEG TEADDEETHP PEERQKTEVR RRRPPKKPSK SPRPSKPKKP
KKPDSAYIPS ILAILVVSLI VGIL
Sequence profile/model
representations
Models : Hidden Markov Models
Profiles : A propensity mapping of multiple
sequences.
HMM
2. Accessibility
Knowing which part of proteins are
accessible or buried can give us
additional information.
Various accessibility prediction
algorithms.
This is the hydropathy plot for the acetylcholine receptor protein. The more positive
values are increasingly hydrophobic in character. This plot was calculated using the
Kyte and Doolittle, 1982 hydropathy values associated with each amino acid
and a running average of 17 neighbouring amino acids. The large window helps to
make the curve a little smoother. The number 1 amino acid in the amino terminus. The
segment labeled LSS is leader sequence and M1, M2, M3 and M4 are putative
membrane spanning domains. Other structural studies have shown that ED is an
extracellular domain and CL is a loop on the cytoplasmic side of the membrane.
Leu: 3.800
Lys: -3.900
Met: 1.900
Phe: 2.800
Pro: -1.600
Ser: -0.800
Thr: -0.700
Trp: -0.900
Tyr: -1.300
Val: 4.200
Leu: 1.470
Lys: 0.090
Met: 1.420
Phe: 1.570
Pro: 0.540
Ser: 0.970
Thr: 1.080
Trp: 1.000
Tyr: 0.830
Val: 1.370
Janin (1979)1;
Wolfenden, et al.2;
Kyte and Doolittle 3;
and
Rose, et al.4.
1.J. Janin, Surface and Inside Volumes in Globular Proteins, Nature, 277(1979)491-492.
2.R. Wolfenden, L. Andersson, P. Cullis and C. Southgate, Affinities of Amino Acid Side Chains for Solvent Water,
Biochemistry 20(1981)849-855.
3.J. Kyte and R. Doolite, A Simple Method for Displaying the Hydropathic Character of a Protein, J. Mol Biol.
157(1982)105-132.
4.G. Rose, A. Geselowitz, G. Lesser, R. Lee and M. Zehfus, Hydrophobicity of Amino Acid Residues in Globular
Proteins, Science 229(1985)834-838.
5.J. Cornette, K. B. Cease, H. Margalit, J. L. Spouge, J. A. Berzofsky and C. DeLisi, Hydrophobicity Scales and
Computational Techniques for Detecting Amphipathic Structures in Proteins, J. Mol. Biol. 195(1987)659-685.
6.M. Charton and B. I. Charton, The Structural Dependence of Amino Acid Hydrophobicity Parameters, J. theor. Biol.
99(1982)629-644.
Transmembrane prediction
http://marqusee9.berkeley.edu/kael/helical.htm
Accessibility
Accessibility:
PHD
?? What else??
Secondary Structure
PHD (neural network by Burkhard Rost)
DSC (King and Sternberg)
SSP / Segment-oriented SS prediction [H]
NNSSP / Nearest-neighbor SS prediction [H]
SSPAL / Nearest-neighbor with local alignments SS and
Accessibility prediction [H]
PSITE / Search for Prosite patterns with statistics [H]
?? What else ??