Vous êtes sur la page 1sur 2

For project X - We get reads/assemblies (contigs/genome scaffolds) from genbank/Trace -

CDS and Proteins and ncRNAs are predicted from all these datasets.

For download from the publication page - we ONLY provide these Genbank-derived
datasets AND user-provided annotated data (if any), and other custom datasets provided
by the authors

For BLAST db, we produce

• (1) ProjectX: reads


• (2) Project X: CDS from reads
• (3) ProjectX : predicted Proteins from reads
• (4) Project X: Assemblies
• (5) Project X: CDS from Assemblies
• (6) ProjectX: Predicted proteins from Assemblies
• (7) Project X: ncRNAs
• We UPDATE : All metagenomics reads, All metagenomic CDS, ALL metagenomic
predicted proteins, and ALL metagenomic ncRNAs (NOTE here: I'm assuming that
this new annotation pipeline is doing something more sophisticated and
generating CDS and Predicted proteins as opposed to the 6-frame translation
ORFs and peptides we have now for GOS and HOT, if yes, we'll have to run the
GOS and HOT data through this pipeline as well and replace those datasets with
CDS and proteins)

Annotation
• All reads available (either via Genbank or the Trace Archive) that are longer than 250bp
on avg should be annotated - How well does the pipeline work for such short seqs? How
well does it work for Sanger seqs for that matter? What is the relative utility ? What to do
with ESTs? GSS?
• If annotation is available in Genbank, it should be retrieved and discussed - not available,
one exception is Leptospirillum assemly from AMD - 8 genome scaffold sequences with
predicted proteins have been deposited as a separate project from the metagenomic
projects.
• If an environmental dataset has scaffolds for organisms deposited in Genbank, should we
treat it like an organism? LIke an environmental set? As an environmnental set I think.
Depends on whether this data makes it into Genbank NR (and hence CAMERA nIAA) or not.
• Can we annotate contigs/scaffolds via our metagenomic annotation pipeline? Or should
we be using the prok pipeline?
• Which predicted proteins should be included in clusters?

Blastable Datasets
• reads should be added to All Metagenomic reads, same for ORFs, Peptides, ncRNAs
• Contigs and assemblies should be added to "All Metagenomic Assemblies" , no such db
presently since GOS is the only one with assemblies, when/if we DO provide this, it
should only contain "site-specific" assemblies.
• If available, mapping between reads and contigs should be absorbed

New datsets:
Only one new (CAMERA-relevant-maybe) project is available for update: Termite gut
metagenome - No traces deposited , data is 1337 fosmid clone seqs, and 1 WGS
entry (55,108 contigs) and 48 glycoside hydrolase family genes. Contacted JGI about
the
traces. http://www.ncbi.nlm.nih.gov/sites/entrez?db=genomeprj&cmd=Retrieve&dopt=Ove
rview&list_uids=19107?

Vous aimerez peut-être aussi