Vous êtes sur la page 1sur 6

1

Note on
The COG database

S. Saengamnatdej
August 15, 2009

COG standing for Clusters of Orthologous Groups


of proteins is a database that classifies predicted
proteins from orthologues, homologous genes which are
derived by vertical descent from a single ancestral
gene in the last common ancestor of the compared
species and typically have the same function and
domain architecture, and can be used to annotate
proteins in a new genome.

VERSIONS
Current version (Figure 1)
Initial version (enter through the link on the
current version page, Figure 2.)

Figure 1 Current version of the COG database


2

Figure 2 Initial version of the COG database

FEATURES

1. 4873 COGs as reported [1] (begun with 720 then


860, to 2091 [2], currently 3307 COGs [3] including
groups with known function(s) such as chemotaxis
proteins, predicted functions (e.g. predicted
extracellular nucleases), and uncharacterized
functions.)
2. 138,458 proteins [1] (75% of the 185,505
predicted proteins) currently, 192,987 proteins [3]
3. 66 genomes of prokaryotes (started from 5
then 6 to 43 and, now, 63 genomes) and unicellular
eukaryotes (the only S. cerevisiae at the beginning
to, now, 3). COG coverage of most genomes is
approaching saturation (~80% of the genes of most
free-living prokaryotes belong to COGs).
4. 1% conserved phyletic patterns
5. Many new microbial genomes are being added
(See Figure 3).
3

Figure 3 The upcoming mycobial genomes.

5. KOGs (eukaryotic orthologous groups)


5.1) predicted orthologs for seven complex
multicellular eukaryotic genomes including 3 animals
(nematode, Caenorhabditis elegans ; fruit fly,
Drosophila melanogaster ; and human, Homo sapiens), 1
plant (thale cress, Arabidopsis thaliana), 2 fungi
(Saccharomyces cerevisiae ; Schizosaccharomyces
pombe) , and 1 parasite (Encephalitozoon cuniculi).
5.2) 4852 clusters
5.3) 59,838 proteins (54%of 110,655 gene products)
5.4) 20% conserved phyletic patterns (probably
due to the small numbers of included eukaryotic
genomes)
5.5) KOG coverages are still far from saturation.
5.6) The upcoming eukaryote genomes are Oryza
sativa (rice), Anopheles gambiae (mosquito), Pan
troglodytes (chimpanzee), Canis familiaris (dog), Mus
musculus (mouse), Rattus norvegicus (rat), and
Ascomycota genomes including Magnaporthe grisea &
Neurospora crassa.
4
CONSTRUCTION

1. Mask the low-complexity and predicted coiled-coil


regions of the proteins.
2. Use the gapped BLAST programme for all-against-all
protein sequence comparison.
3. Detect the proteins with consistent BeTs (genome-
specific best hits).
4. Group the proteins to form COGs. (It's required
that each COG includes proteins from at least three
sufficiently distant species.)
5. Manually split multidomain proteins (identified by
RPS-BLAST) into the component domains.
6. Manually curated and annotated the proteins.
7. Other groups
1) The proteins, which consisted solely of
widespread, "promiscuous" domains (e.g., SH2, SH3,
WD40 repeats or TPR repeats) and did not show clear-
cut orthologous relationships, were assigned to Fuzzy
Orthologous Groups (FOGs).
2) Two genomes (TWOGs) (as a preliminary group)
3) lineage-specific expansions (LSEs) of paralogs
(as a preliminary group)

ASSIGNMENT OF A PROTEIN TO A GROUP.

1. With the annotated protein from a new species,


search BLAST against COG database
2. Use COGNITOR programme to assign the protein to a
group of COGs.
3. If none of the groups can be assigned, form a new
COGs.
5
APPLICATIONS

1. Functional annotation of newly sequenced genomes.


By using programme "cognitor" to identify
orthologues, this method is not to make phylogenetic
analysis of the entire homologous protein, which is
time-consuming and error-pone, but rather find the
protein in a target genome which is most similar to a
given protein from the query genome. Three BeTs are
used to assign the protein into a COG. (the BeT :
genome-specific best hit)

2. Identify the protein(s) which are found in one


species but not in the others.
In some cases, an investigator may want to
analyse the difference between two species to
determine which proteins or gene products confering
the organism's characteristics.
This can be done by using phyletic pattern search
with user-selected species.

EXAMPLE

After clicking on the linked eukaryotic clusters,


you are presented with a new page. On this page,
click on the programme "Kognitor", then paste on your
protein sequence of any length with a name in fasta
format and click compare button. the blast result
page will show up. Then, click on the linked KOG
entry (KOG3378), you will be presented with the page
like this picture (Figure 4).
6
Figure 4 The result page for KOG entry (KOG3378)
This dendogram should not be construed as a
phylogenetic tree because of its crudeness. The
result shows that there are 11 human paralogues, 24
worm paralogues, and 1 fruitfly paralogue. In
addition, the vertebrates and nematode globins
comprise co-orthologous sets.

REFERENCES
1. Tatusov, R. L. et al (2003) The COG database: an
updated version includes eukaryotes. BMC
Bioinformatics, 4:41
2. Tatusov, R. L. et al (2000) The COG database: a
tool for genome-scale analysis of protein functions
and evolution. Nucleic Acids Research,Vol.28,No.1,
33–36.
3. http://www.ncbi.nlm.nih.gov/COG accessed August,
2009.

Vous aimerez peut-être aussi