Note On COGs

1
Note on
The COG database
S. Saengamnatdej
August 15, 2009
COG standing for Clusters of Orthologous Groups

of proteins is a database that classifies predicted
proteins from orthologues, homologous genes which are
derived by vertical descent from a single ancestral
gene in the last common ancestor of the compared
species and typically have the same function and
domain architecture, and can be used to annotate
proteins in a new genome.
VERSIONS
Current version (Figure 1)
Initial version (enter through the link on the
current version page, Figure 2.)
Figure 1 Current version of the COG database

2
Figure 2 Initial version of the COG database
FEATURES
1. 4873 COGs as reported [1] (begun with 720 then

860, to 2091 [2], currently 3307 COGs [3] including
groups with known function(s) such as chemotaxis
proteins, predicted functions (e.g. predicted
extracellular nucleases), and uncharacterized
functions.)
2. 138,458 proteins [1] (75% of the 185,505
predicted proteins) currently, 192,987 proteins [3]
3. 66 genomes of prokaryotes (started from 5
then 6 to 43 and, now, 63 genomes) and unicellular
eukaryotes (the only S. cerevisiae at the beginning
to, now, 3). COG coverage of most genomes is
approaching saturation (~80% of the genes of most
free-living prokaryotes belong to COGs).
4. 1% conserved phyletic patterns
5. Many new microbial genomes are being added
(See Figure 3).
3
Figure 3 The upcoming mycobial genomes.
5. KOGs (eukaryotic orthologous groups)

5.1) predicted orthologs for seven complex
multicellular eukaryotic genomes including 3 animals
(nematode, Caenorhabditis elegans ; fruit fly,
Drosophila melanogaster ; and human, Homo sapiens), 1
plant (thale cress, Arabidopsis thaliana), 2 fungi
(Saccharomyces cerevisiae ; Schizosaccharomyces
pombe) , and 1 parasite (Encephalitozoon cuniculi).
5.2) 4852 clusters
5.3) 59,838 proteins (54%of 110,655 gene products)
5.4) 20% conserved phyletic patterns (probably
due to the small numbers of included eukaryotic
genomes)
5.5) KOG coverages are still far from saturation.
5.6) The upcoming eukaryote genomes are Oryza
sativa (rice), Anopheles gambiae (mosquito), Pan
troglodytes (chimpanzee), Canis familiaris (dog), Mus
musculus (mouse), Rattus norvegicus (rat), and
Ascomycota genomes including Magnaporthe grisea &
Neurospora crassa.
4
CONSTRUCTION
1. Mask the low-complexity and predicted coiled-coil

regions of the proteins.
2. Use the gapped BLAST programme for all-against-all
protein sequence comparison.
3. Detect the proteins with consistent BeTs (genome-
specific best hits).
4. Group the proteins to form COGs. (It's required
that each COG includes proteins from at least three
sufficiently distant species.)
5. Manually split multidomain proteins (identified by
RPS-BLAST) into the component domains.
6. Manually curated and annotated the proteins.
7. Other groups
1) The proteins, which consisted solely of
widespread, "promiscuous" domains (e.g., SH2, SH3,
WD40 repeats or TPR repeats) and did not show clear-
cut orthologous relationships, were assigned to Fuzzy
Orthologous Groups (FOGs).
2) Two genomes (TWOGs) (as a preliminary group)
3) lineage-specific expansions (LSEs) of paralogs
(as a preliminary group)
ASSIGNMENT OF A PROTEIN TO A GROUP.
1. With the annotated protein from a new species,

search BLAST against COG database
2. Use COGNITOR programme to assign the protein to a
group of COGs.
3. If none of the groups can be assigned, form a new
COGs.
5
APPLICATIONS
1. Functional annotation of newly sequenced genomes.

By using programme "cognitor" to identify
orthologues, this method is not to make phylogenetic
analysis of the entire homologous protein, which is
time-consuming and error-pone, but rather find the
protein in a target genome which is most similar to a
given protein from the query genome. Three BeTs are
used to assign the protein into a COG. (the BeT :
genome-specific best hit)
2. Identify the protein(s) which are found in one

species but not in the others.
In some cases, an investigator may want to
analyse the difference between two species to
determine which proteins or gene products confering
the organism's characteristics.
This can be done by using phyletic pattern search
with user-selected species.
EXAMPLE
After clicking on the linked eukaryotic clusters,

you are presented with a new page. On this page,
click on the programme "Kognitor", then paste on your
protein sequence of any length with a name in fasta
format and click compare button. the blast result
page will show up. Then, click on the linked KOG
entry (KOG3378), you will be presented with the page
like this picture (Figure 4).
6
Figure 4 The result page for KOG entry (KOG3378)
This dendogram should not be construed as a
phylogenetic tree because of its crudeness. The
result shows that there are 11 human paralogues, 24
worm paralogues, and 1 fruitfly paralogue. In
addition, the vertebrates and nematode globins
comprise co-orthologous sets.
REFERENCES
1. Tatusov, R. L. et al (2003) The COG database: an
updated version includes eukaryotes. BMC
Bioinformatics, 4:41
2. Tatusov, R. L. et al (2000) The COG database: a
tool for genome-scale analysis of protein functions
and evolution. Nucleic Acids Research,Vol.28,No.1,
33–36.
3. http://www.ncbi.nlm.nih.gov/COG accessed August,
2009.

Note On COGs

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Note On COGs

Transféré par

Droits d'auteur :

Formats disponibles

1

COG standing for Clusters of Orthologous Groups

Figure 1 Current version of the COG database

Figure 2 Initial version of the COG database

1. 4873 COGs as reported [1] (begun with 720 then

Figure 3 The upcoming mycobial genomes.

5. KOGs (eukaryotic orthologous groups)

1. Mask the low-complexity and predicted coiled-coil

ASSIGNMENT OF A PROTEIN TO A GROUP.

1. With the annotated protein from a new species,

1. Functional annotation of newly sequenced genomes.

2. Identify the protein(s) which are found in one

After clicking on the linked eukaryotic clusters,

Vous aimerez peut-être aussi