Vous êtes sur la page 1sur 26

1

Basics: Representations of
protein sequences and their
applications

Jong Park

Next

Amino Acids Representation


Ala alanine
Met methionine
Asp aspartate
Phe phenylalanine
Arg arginine
Pro proline
Asn asparagine
Ser serine
Cys cysteine
Thr threonine
Glu glutamate
Trp tryptophan
Gln glutamine
Tyr tyrosine
Gly glycine
Val valine
Glx glutamate or glutamine *** any
His histidine
--- gap of indeterminate length
Ileu isoleucine
TGA translation stop
Lys lysine
TAG translation stop
Leu leucine
TAA translation stop

Single Sequence representations


There are several commonly used pure sequence representation
formats in flat files
FASTA (most commonly used for raw sequence data)
PIR

Representations in Databases (such as MySQL)


As columns and rows

Representations in programs or objects


@codons = $myCodonTable->revtranslate('A');

Flat file FASTA format


> gi|532319|pir|TVFV2E|TVFV2E envelope protein
CCTCTCGGAGCTGGAAATGCAGCTATTGAGATCTTCGAATGCTGC
AGCTGGAGGCGGAGGCAGCTGGGGAGGTCCGAGCGATGTGACC
GGCCGCCATCGCTCGTCTCTTCCTCTCTCCTGCCGCCTCCTGTGT
CGAAAATAACTTTTTTAGTCTAAAGAAAGAAAG

>gi|532319|pir|TVFV2E|TVFV2E envelope protein


ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTLLL
SYSENRTAPTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXX

Accessing Bioperl CodonTable (from object oriented module)

use Bio::Tools::CodonTable;

# defaults to ID 1 "Standard"
$myCodonTable = Bio::Tools::CodonTable->new();
$myCodonTable2 = Bio::Tools::CodonTable -> new ( -id => 3 );

# change codon table


$myCodonTable->id(5);

# examine codon table


print join (' ', "The name of the codon table no.", $myCodonTable->id(4),
"is:", $myCodonTable->name(), "\n");

# translate a codon
$aa = $myCodonTable->translate('ACU');
$aa = $myCodonTable->translate('act');
$aa = $myCodonTable->translate('ytr');

# reverse translate an amino acid


@codons = $myCodonTable->revtranslate('A');
@codons = $myCodonTable->revtranslate('Ser');
@codons = $myCodonTable->revtranslate('Glx');
@codons = $myCodonTable->revtranslate('cYS', 'rna');

FASTA (flat file)


Sequences are expected to be represented in the standard
IUB/IUPAC amino acid and nucleic acid codes
* International Union of Pure and Applied Chemistry
Lower-case letters are accepted
A single hyphen or dash can be used to represent a gap of
indeterminate length
In amino acid sequences, U and * are acceptable letters
Numerical digits in the query sequence should either be
removed or replaced by appropriate letter codes (e.g., N for
unknown
nucleic acid residue or X for unknown amino acid residue

Nucleic Acids FASTA


A --> adenosine
M --> A C (amino)
C --> cytidine
S --> G C (strong)
G --> guanine
W --> A T (weak)
T --> thymidine
B --> G T C
U --> uridine
D --> G A T
R --> G A (purine)
H --> A C T
Y --> T C (pyrimidine) V --> G C A
K --> G T (keto)
N --> A G C T (any)
X --> for unknown
- gap of indeterminate length

Protein sequences in FASTA


A alanine
P proline
B aspartate or asparagine Q glutamine
C cystine
R arginine
D aspartate
S serine
E glutamate
T threonine
F phenylalanine
U selenocysteine
G glycine
V valine
H histidine
W tryptophan
I isoleucine
Y tyrosine
K lysine
Z glutamate or glutamine
L leucine
X any
M methionine
* translation stop
N asparagine
- gap of indeterminate length

PIR (NBRF) sequence format


>P1;CRAB_ANAPL
ALPHA CRYSTALLIN B CHAIN (ALPHA (BCRYSTALLIN).
MDITIHNPLIRRPLFSWLAPSRIFDQIFGEHLQESELLPASP
LSPFLMRSPIFRMPSWL ETGLSEMRLE KDKFSVNLDV
KHFSPEELKVKVLGDMVEIHGKHEERQDEHGFIAREFNR
KYRIPADVDPL TITSSLSLDG VLTVSAPRKQ SDVPERSIP
TREEKPAIAG AQRK*

PIR format
A sequence in PIR format consists of:
1.One line starting with
a. a ">" (greater-than) sign, followed by
b. a two-letter code describing the sequence type
(P1, F1, DL, DC, RL, RC, or XX), followed by
c. a semicolon, followed by
d. the sequence identification code (the database IDcode).
2. One line containing a textual description of the sequence.
3. One or more lines containing the sequence itself. The
end of the sequence is marked by a "*" (asterisk) character.
A file in PIR format may comprise more than one sequence.
The PIR format is also often referred to as the NBRF format.

GenBank style (flat file)

1-------10--------20--------30--------40--------50--------60--------70------78

LOCUS
DEFINITION
DATE
ACCESSION
ORGANISM

COMMENT
WEIGHT
LENGTH
ORIGIN
1
61
121
181
241
301
361
421
//

ABCAARAA_1
A.aceti acetic acid resistance protein (aarA) gene, complete cds;
acetic acid resistance protein (aarA).
15-SEP-1990
M34830
Acetobacter aceti
Eubacteria; Proteobacteria; alpha subdivision; Acetobacteraceae;
Acetobacter.
CDS 185..1495
/db_xref="PID:g141730"
48238
436
Translated using phase 1
MSASQKEGKL STATISVDGK SAEMPVLSGT LGPDVIDIRK LPAQLGVFTF DPGYGETAAC
NSKITFIDGD KGVLLHRGYP IAQLDENASY EEVIYLLLNG ELPNKVQYDT FTNTLTNHTL
LHEQIRNFFN GFRRDAHPMA ILCGTVGALS AFYPDANDIA IPANRDLAAM RLIAKIPTIA
AWAYKYTQGE AFIYPRNDLN YAENFLSMMF ARMSEPYKVN PVLARAMNRI LILHADHEQN
ASTSTVRLAG STGANPFACI AAGIAALWGP AHGGANEAVL KMLARIGKKE NIPAFIAQVK
DKNSGVKLMG FGHRVYKNFD PRAKIMQQTC HEVLTELGIK DDPLLDLAVE LEKIALSDDY
FVQRKLYPNV DFYSGIILKA MGIPTSMFTV LFAVARTTGW VSQWKEMIEE PGQRISRPRQ
LYIGAPQRDY VPLAKR

EMBL style

ID
XX
AC
XX
DT
XX
DE
XX
OS
XX
CC
XX
SQ

//

CM23SRIBR

converted; DNA; UNC; 805 BP.

X80636;
22-MAR-1995
C.mucosalis gene for 23S ribosomal RNA (fragment)
Campylobacter mucosalis
SEQIO retrieval from EMBL-format entry.

07-Feb-1996

Sequence 805 BP; 226 A; 158 C; 224 G; 194 T; 3 other;


gattctgcgc ggaaaatata acggggctaa aatgagtacc gaagctttag
actaagtggt aggagcgttc tattcagcgt tgaaggtgta ccggtaagga
ggatagaagt gagcatgcag gcatgagtag cgataattgg ggtgagaatc
taarcccaag gtttcctacg cgatgctcgt catcgtaggg ttagccgggt
gtccgaaagg ggtatgcgat ggaaaattgg ttaatattcc aatgccaaca
atggaaggac gcttagagtt aaaggagcca gctgatggaa gtgctggtcg
gttgagttac aggcaaatcc gtaactcttt atccgagacc ccacaggcgt
tcggaatgga tgacgaatcc ttgatactgt cgagccaaga aaagtttcta
atgttgcccg taccgtaaac cgacacaggt gggtgggatg agtattctaa
agaactctct tcaaggaact ctgcaaaata gcaccgtatc ttcggtataa
actttgtgaa ggatttactc cgtaagcatt gaaggttaca acaaagagtc
gtttaccaaa aacacagcac tctgctaact cgtaagagga tgtatagggt
cccggtgctc gaaggttaat tgatggggty agcagyaatg cgaagctctt
cgagtaaacg gccgccgtaa ctata

acttagtttt
gcgctggagc
cccaacgccg
cctaagcaaa
ttattgtgcg
aaaggtgtag
ttgaagttct
agtttagata
ggcgcgtgga
ggtgtgccta
cctcccgact
gtgacgcctg
gatcgaagcc

60
120
180
240
300
360
420
480
540
600
660
720
780
805

Swissprot style

ID
AC
DT
DE
OS
CC
CC
CC
CC
SQ

//

104K_THEPA CONVERTED;
PRT;
924 AA.
P15711;
01-AUG-1992
104 KD MICRONEME-RHOPTRY ANTIGEN.
THEILERIA PARVA.
-!- DEVELOPMENTAL STAGE: SPOROZOITE ANTIGEN.
-!- SUBCELLULAR LOCATION: IN MICRONEME/RHOPTRY COMPLEXES.
SEQIO retrieval from Swiss-Prot database entry.
07-Feb-1996
SEQUENCE
924 AA;
MKFLILLFNI LCLFPVLAAD NHGVGPQGAS GVDPITFDIN SNQTGPAFLT AVEMAGVKYL
QVQHGSNVNI HRLVEGNVVI WENASTPLYT GAIVTNNDGP YMAYVEVLGD PNLQFFIKSG
DAWVTLSEHE YLAKLQEIRQ AVHIESVFSL NMAFQLENNK YEVETHAKNG ANMVTFIPRN
GHICKMVYHK NVRIYKATGN DTVTSVVGFF RGLRLLLINV FSIDDNGMMS NRYFQHVDDK
YVPISQKNYE TGIVKLKDYK HAYHPVDLDI KDIDYTMFHL ADATYHEPCF KIIPNTGFCI
TKLFDGDQVL YESFNPLIHC INEVHIYDRN NGSIICLHLN YSPPSYKAYL VLKDTGWEAT
THPLLEEKIE ELQDQRACEL DVNFISDKDL YVAALTNADL NYTMVTPRPH RDVIRVSDGS
EVLWYYEGLD NFLVCAWIYV SDGVASLVHL RIKDRIPANN DIYVLKGDLY WTRITKIQFT
QEIKRLVKKS KKKLAPITEE DSDKHDEPPE GPGASGLPPK APGDKEGSEG HKGPSKGSDS
SKEGKKPGSG KKPGPAREHK PSKIPTLSKK PSGPKDPKHP RDPKEPRKSK SPRTASPTRR
PSPKLPQLSK LPKSTSPRSP PPPTRPSSPE RPEGTKIIKT SKPPSPKPPF DPSFKEKFYD
DYSKAASRSK ETKTTVVLDE SFESILKETL PETPGTPFTT PRPVPPKRPR TPESPFEPPK
DPDSPSTSPS EFFTPPESKR TRFHETPADT PLPDVTAELF KEPDVTAETK SPDEAMKRPR
SPSEYEDTSP GDYPSLPMKR HRLERLRLTT TEMETDPGRM AKDASGKPVK LKRSKSFDDL
TTVELAPEPK ASRIVVDDEG TEADDEETHP PEERQKTEVR RRRPPKKPSK SPRPSKPKKP
KKPDSAYIPS ILAILVVSLI VGIL

Sequence profile/model
representations
Models : Hidden Markov Models
Profiles : A propensity mapping of multiple
sequences.

HMM

Some other simple representations of protein


sequences.
1. Hydrophobicity (Hydropathy plot)
Hydrophobicity is a very well known and
important general characteristics of proteins.
This simple characteristics can let us predict
relatively a lot of features of proteins.
Transmembrane region, core part of protein
and secondary structural information can be
found from this. For secondary structures,
some alternating hydrophobic and hydrophilic
residues indicate beta sheets. In some cases,
by looking at the hydrophobicity, it is possible
to predict the orientation of the surfaces of
transmembrane helices.

1. Hydrophobic and Hydrophilic


regions
http://www.bmb.psu.edu/nixon/webtools/hydro/default.htm
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT

2. Accessibility
Knowing which part of proteins are
accessible or buried can give us
additional information.
Various accessibility prediction
algorithms.

3. Secondary Structure representation

Kyte Doolittle Hydropathy plot

This is the hydropathy plot for the acetylcholine receptor protein. The more positive
values are increasingly hydrophobic in character. This plot was calculated using the
Kyte and Doolittle, 1982 hydropathy values associated with each amino acid
and a running average of 17 neighbouring amino acids. The large window helps to
make the curve a little smoother. The number 1 amino acid in the amino terminus. The
segment labeled LSS is leader sequence and M1, M2, M3 and M4 are putative
membrane spanning domains. Other structural studies have shown that ED is an
extracellular domain and CL is a loop on the cytoplasmic side of the membrane.

Kyte and Doolittle H.P. Scale


Amino acid scale: Hydropathicity.
Author(s): Kyte J., Doolittle R.F.
Reference: J. Mol. Biol. 157:105-132(1982).
Amino acid scale values:
Ala: 1.800
Arg: -4.500
Asn: -3.500
Asp: -3.500
Cys: 2.500
Gln: -3.500
Glu: -3.500
Gly: -0.400
His: -3.200
Ile: 4.500

Leu: 3.800
Lys: -3.900
Met: 1.900
Phe: 2.800
Pro: -1.600
Ser: -0.800
Thr: -0.700
Trp: -0.900
Tyr: -1.300
Val: 4.200

Another way to calculate hydrophobicity


ProtScale Tool
Amino acid scale: Membrane buried helix parameter.
Author(s): Rao M.J.K., Argos P.
Reference: Biochim. Biophys. Acta 869:197-214(1986).
Amino acid scale values:
Ala: 1.360
Arg: 0.150
Asn: 0.330
Asp: 0.110
Cys: 1.270
Gln: 0.330
Glu: 0.250
Gly: 1.090
His: 0.680
Ile: 1.440

Leu: 1.470
Lys: 0.090
Met: 1.420
Phe: 1.570
Pro: 0.540
Ser: 0.970
Thr: 1.080
Trp: 1.000
Tyr: 0.830
Val: 1.370

Major Hydrophobic scales


(1)
(2)
(3)
(4)

Janin (1979)1;
Wolfenden, et al.2;
Kyte and Doolittle 3;
and
Rose, et al.4.

1.J. Janin, Surface and Inside Volumes in Globular Proteins, Nature, 277(1979)491-492.
2.R. Wolfenden, L. Andersson, P. Cullis and C. Southgate, Affinities of Amino Acid Side Chains for Solvent Water,
Biochemistry 20(1981)849-855.
3.J. Kyte and R. Doolite, A Simple Method for Displaying the Hydropathic Character of a Protein, J. Mol Biol.
157(1982)105-132.
4.G. Rose, A. Geselowitz, G. Lesser, R. Lee and M. Zehfus, Hydrophobicity of Amino Acid Residues in Globular
Proteins, Science 229(1985)834-838.
5.J. Cornette, K. B. Cease, H. Margalit, J. L. Spouge, J. A. Berzofsky and C. DeLisi, Hydrophobicity Scales and
Computational Techniques for Detecting Amphipathic Structures in Proteins, J. Mol. Biol. 195(1987)659-685.
6.M. Charton and B. I. Charton, The Structural Dependence of Amino Acid Hydrophobicity Parameters, J. theor. Biol.
99(1982)629-644.

Why are the H.P. scales are


different from different
researchers?
Mainly because proteins are very
different from each other.
Ideally, we should make protein family
specific H.P. scales.

Transmembrane prediction

If we can add some secondary structural information,


the prediction of transmembrane is more powerful.

Lets add some other simple ways of representing


proteins
2. Accessibility and Secondary structural information are
very useful for many Bioinformatic analysis.
Helical Wheels are often convenient for visualization

http://marqusee9.berkeley.edu/kael/helical.htm

Accessibility
Accessibility:
PHD
?? What else??
Secondary Structure
PHD (neural network by Burkhard Rost)
DSC (King and Sternberg)
SSP / Segment-oriented SS prediction [H]
NNSSP / Nearest-neighbor SS prediction [H]
SSPAL / Nearest-neighbor with local alignments SS and
Accessibility prediction [H]
PSITE / Search for Prosite patterns with statistics [H]
?? What else ??

More than one residue


representation of amino acids.

Vous aimerez peut-être aussi