Académique Documents
Professionnel Documents
Culture Documents
Abstract
2. Fundamentos do Tema
Para criar e descrever esse modelo de dados, o
NCBI resolveu utilizar a ASN.1 (ISO 8824, 8825) ou
Abstract Syntax Notation 1 (Notaca o Sintatica Abstrata 1, em portugues) que traz definico es de como descrever os objetos e as funco es que operam nestes objetos. O modelo do NCBI tem como estrutura principal
uma sequencia biologica (Bioseq) que possui diversos
marcadores e indicadores.
Quando sao definidas, as sequencias agregam diversos dados tais como locus, definico es, origem,
publicaco es em que aparecem e caractersticas, dentre
outros campos que compreendem a sequencia como
um todo, sendo alguns desses opcionais. Veremos em
detalhes o campos obrigatorios de uma sequencia, o
que eles significam, e que informaco es eles contem.
1. Introduca o
Existem duas razoes principais para se colocar
dados [biologicos] em um computador: recuperaca o
e descoberta.(BAXEVANIS, 2001)[1]. Grandes repositorios centralizados de dados tais como SWISSPROT e o Genbank, gerenciados atraves de tecnicas
de manutenca o de dados, sao exemplos do desafio
que e manusear dados em grande escala que os laboratorios enfrentam. A avalanche de dados gerada,
particularmente de sequencias biologicas e, mais recentemente, de dados estruturais e transcripcionais,
interaco es, e genetica, levou a uma adoca o de ferramentas de analise nao-supervisionada e automatizada
de dados biologicos durante a decada de 90 [3, 4].
Onde apenas os campos seq-id e seq-inst sao obrigatorios. O primeiro compreende informaco es mais
biologicas e de controle, enquanto que o segundo possui informaco es de representaca o da molecula e suas
propriedades fsicas, como estrutura e tamanho.
2.2.1
Seq-Id
2.2. BIOSEQ
A estrutura principal do modelo do NCBI e uma
sequencia biologica, ou Bioseq. Ela compreende uma
simples e contnua molecula de um a cido nucleico ou
de uma protena. Ela tambem possui informaco es do
tipo fsico da molecula (DNA, RNA ou protena) e sao
completamente instanciaveis (ou seja, temos os dados de todo o resduo) ou apenas parcialmente instanciaveis (exemplo: sabemos que o fragmento tem
tamanho de 10 kilobases, mas, so temos os dados de
apenas 1 kilobase). Em notaca o ASN.1 essa estrutura
esta definida da seguinte forma:
Raw. e a representaca o tradicional que conhecemos. Sabendo o DNA, a fita dupla, o tamanho, e
a sequencia.
2.2.2
Seq-inst
Representaca o
As formas de representaca o disponveis para o modelo
sao divididas em:
Virtual. que e a representaca o utilizada para descrever a sequencia sobre a qual sabemos detalhes
tais como o DNA ou o tamanho, mas sem ter a
sequencia de fato.
critores para o conjunto. Ela e uma forma conveniente de empacotar toda a informaca o de coleco es
de sequencias sem a necessidade de identificadores
estaveis, como na sequencia simples. Apos os primeiros campos, sua estrutura e bastante similar ao da bioseq simples.
Existem descritores que definem aspectos da coleca o e
as bioseqs dentro da coleca o. A regra geral desses descritores e que eles aplicam-se para tudo abaixodos
primeiros campos de descritores, isso quer dizer, e
como um top level das sequencias que fazem parte de
uma coleca o. A seguir, a definica o da bioseq-set.
Bioseq-set ::= SEQUENCE{
id Object-id OPTIONAL ,
coll Dbtag OPTIONAL ,
level INTEGER OPTIONAL ,
class ENUMERATED
not-set (0) ,
nuc-prot (1) ,
segset (2) ,
conset (3) ,
parts (4) ,
gibb (5) ,
gi (6) ,
genbank (7) ,
pir (8) ,
pub-set (9) ,
equiv (10) ,
swissprot (11) ,
pdb-entry (12) ,
other (255) DEFAULT not-set ,
release VisibleString OPTIONAL ,
date Date OPTIONAL ,
descr Seq-descr OPTIONAL ,
seq-set SEQUENCE OF Seq-entry ,
annot SET OF Seq-annot OPTIONAL}
2.3. Bioseq-set
Referencias
[1] B. F. O. Andreas D. Baxevanis. Bioinformatics: A
Practical Guide to the Analysis of Genes and Proteins.
John Wiley & Sons, Inc., second edition, 2001.
[2] N. W. P. Erich Bornberg-Bauer. Conceptual data modelling for bioinformatics. In BRIEFINGS IN BIOINFORMATICS, volume 3 of 2. Henry Stewart Publications, jun 2002.
[3] E. M. et all. Agents in bioinformatics, computational
and systems biology. In BRIEFINGS IN BIOINFORMATICS, volume 8 of 1. Oxford University Press, may
2006.
[4] S. C. Gaasterland T. Fully automated genome analysis that reflects user needs and preferences. A detailed
introduction to the magpie system architecture. Biochimie, first edition, 1996.
[5] NCBI, ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt.
[6] NCBI, www.ncbi.nlm.nih.gov/IEB/DATA.HTML.
Apendice I
Especificaca o ANS.1 de uma Bioseq-set seqset.asn
--$Revision: 2.1 $
--**********************************************************************
--- NCBI Sequence Collections
-- by James Ostell, 1990
---**********************************************************************
NCBI-Seqset DEFINITIONS ::=
BEGIN
EXPORTS Bioseq-set, Seq-entry;
IMPORTS Bioseq, Seq-annot, Seq-descr FROM NCBI-Sequence
Object-id, Dbtag, Date FROM NCBI-General;
--*** Sequence Collections ********************************
--*
Bioseq-set ::= SEQUENCE {
-- just a collection
id Object-id OPTIONAL ,
coll Dbtag OPTIONAL ,
-- to identify a collection
level INTEGER OPTIONAL ,
-- nesting level
class ENUMERATED {
not-set (0) ,
nuc-prot (1) ,
-- nuc acid and coded proteins
segset (2) ,
-- segmented sequence + parts
conset (3) ,
-- constructed sequence + parts
parts (4) ,
-- parts for 2 or 3
gibb (5) ,
-- geninfo backbone
gi (6) ,
-- geninfo
genbank (7) ,
-- converted genbank
pir (8) ,
-- converted pir
pub-set (9) ,
-- all the seqs from a single publication
equiv (10) ,
-- a set of equivalent maps or seqs
swissprot (11) ,
-- converted SWISSPROT
pdb-entry (12) ,
-- a complete PDB entry
other (255) } DEFAULT not-set ,
release VisibleString OPTIONAL ,
date Date OPTIONAL ,
descr Seq-descr OPTIONAL ,
seq-set SEQUENCE OF Seq-entry ,
annot SET OF Seq-annot OPTIONAL }
*
* ==========================================================================
*/
#ifndef _NCBI_Seqset_
#define _NCBI_Seqset_
#ifndef _ASNTOOL_
#include <asn.h>
#endif
#ifndef _NCBI_General_
#include <objgen.h>
#endif
#ifndef _NCBI_Seq_
#include <objseq.h>
#endif
#ifdef __cplusplus
extern "C" {
#endif
typedef ValNodePtr SeqEntryPtr;
/*****************************************************************************
*
loader
*
*
*****************************************************************************/
extern Boolean SeqSetAsnLoad PROTO((void));
/*****************************************************************************
*
internal structures for NCBI-Seqset objects
*
*
*****************************************************************************/
/*****************************************************************************
*
BioseqSet - a collection of sequences
*
*
*****************************************************************************/
typedef struct seqset {
ObjectIdPtr id;
DbtagPtr coll;
Int2 level;
/* set to INT2_MIN (ncbilcl.h) for not used */
Uint1 _class;
CharPtr release;
DatePtr date;
8
ValNodePtr descr;
SeqEntryPtr seq_set;
SeqAnnotPtr annot;
} BioseqSet, PNTR BioseqSetPtr;
BioseqSetPtr BioseqSetNew PROTO((void));
Boolean BioseqSetAsnWrite PROTO((BioseqSetPtr bsp, AsnIoPtr aip, AsnTypePtr atp));
BioseqSetPtr BioseqSetAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));
BioseqSetPtr BioseqSetFree PROTO((BioseqSetPtr bsp));
/*****************************************************************************
*
SeqEntry - implemented as an ValNode
*
choice:
*
1 = Bioseq
*
2 = Bioseq-set
*
*
*****************************************************************************/
SeqEntryPtr SeqEntryNew PROTO((void));
Boolean SeqEntryAsnWrite PROTO((SeqEntryPtr sep, AsnIoPtr aip, AsnTypePtr atp));
SeqEntryPtr SeqEntryAsnRead PROTO((AsnIoPtr aip, AsnTypePtr atp));
SeqEntryPtr SeqEntryFree PROTO((SeqEntryPtr sep));
SeqEntryPtr PNTR SeqEntryInMem PROTO((Int2Ptr numptr));
/*****************************************************************************
*
Options for SeqEntryAsnRead()
*
*
*****************************************************************************/
SeqEntryPtr SeqEntryAsnGet PROTO((AsnIoPtr aip, AsnTypePtr atp, SeqIdPtr sip,
Int2 retcode));
#define SEQENTRY_OPTION_MAX_COMPLEX 1
{
/* seq-id to find */
/* type of set/seq to return */
/* 2, if in first set of retcode type */
/* 1, if found Bioseq, but not right set */
9
#ifdef __cplusplus
}
#endif
#endif
10
Apendice II
Ilustraca o das diferentes formas de representaca o de uma sequencia ou coleca o de sequencias
11
Apendice III
Exemplo de uma Bioseq-set Entrez: U49845
LOCUS
DEFINITION
SCU49845
5028 bp
DNA
linear
PLN 23-MAR-2010
Saccharomyces cerevisiae TCP1-beta gene, partial cds; and Axl2p
(AXL2) and Rev7p (REV7) genes, complete cds.
ACCESSION
U49845
VERSION
U49845.1 GI:1293613
KEYWORDS
.
SOURCE
Saccharomyces cerevisiae (bakers yeast)
ORGANISM Saccharomyces cerevisiae
Eukaryota; Fungi; Dikarya; Ascomycota; Saccharomycotina;
Saccharomycetes; Saccharomycetales; Saccharomycetaceae;
Saccharomyces.
REFERENCE
1 (bases 1 to 5028)
AUTHORS
Roemer,T., Madden,K., Chang,J. and Snyder,M.
TITLE
Selection of axial growth sites in yeast requires Axl2p, a novel
plasma membrane glycoprotein
JOURNAL
Genes Dev. 10 (7), 777-793 (1996)
PUBMED
8846915
REFERENCE
2 (bases 1 to 5028)
AUTHORS
Roemer,T.
TITLE
Direct Submission
JOURNAL
Submitted (22-FEB-1996) Biology, Yale University, New Haven, CT
06520, USA
FEATURES
Location/Qualifiers
source
1..5028
/organism="Saccharomyces cerevisiae"
/mol_type="genomic DNA"
/db_xref="taxon:4932"
/chromosome="IX"
mRNA
<1..>206
/product="TCP1-beta"
CDS
<1..206
/codon_start=3
/product="TCP1-beta"
/protein_id="AAA98665.1"
/db_xref="GI:1293614"
/translation="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEA
AEVLLRVDNIIRARPRTANRQHM"
gene
<687..>3158
/gene="AXL2"
mRNA
<687..>3158
/gene="AXL2"
/product="Axl2p"
CDS
687..3158
/gene="AXL2"
12
gene
mRNA
CDS
ORIGIN
1
61
121
181
241
301
361
421
481
541
gatcctccat
ccgacatgag
ctgcatctga
gaaccgccaa
ccacactgtc
agacgcgaaa
attttggcaa
aatacccatc
gagtcgccct
tttactctca
atacaacggt
acagttaggt
agccgctgaa
tagacaacat
attattataa
aaaaaagaac
cttatgtttc
gtaggtatgg
cctttgtcga
catcctgtag
atctccacct
atcgtcgaga
gttctactaa
atgtaacata
ttagaaacag
aacgcgtcat
ctcttcgagc
ttaaagatag
gtaattttca
tgattgacac
13
caggtttaga
gttacaagct
gggtggataa
tttaggatat
aacgcaaaaa
agaacttttg
agtactcgag
catctccaca
cttttcatat
tgcaacagcc
tctcaacaac
aaaacgagca
catcatccgt
acctcgaaaa
ttatccacta
gcaattcgcg
ccctgtctca
acctcaaagc
gagaacttat
accatcacta
ggaaccattg
gtagtcagct
gcaagaccaa
taataaaccg
tataattcaa
tcacaaataa
agaatgtaat
tccttgccga
tttcttattc
gaagaacaga
601
661
721
781
841
901
961
1021
1081
1141
1201
1261
1321
1381
1441
1501
1561
1621
1681
1741
1801
1861
1921
1981
2041
2101
2161
2221
2281
2341
2401
2461
2521
2581
2641
2701
2761
2821
2881
2941
3001
3061
3121
3181
3241
3301
3361
acaattactt
cgtatatcaa
ctactatatc
aacaataccc
cctataaatc
gctggctttc
tatctgatgc
acagcacgtc
tatcgtcaga
acgctctgaa
ctaacgaaga
ccaattggct
actcggcgat
gattttctgc
ctattcaaaa
ctctaaacta
acttattgga
cagatgaatt
cttatggtga
ttagttctct
cttctcagtt
aagaccatga
agaatttcga
tatattttaa
caacgtccac
acactgcaaa
cagcagccaa
ctatcccatt
gaagggaaaa
atcctgcaaa
atgcttcctc
aattggataa
ctctatcagg
tagcaaaacc
cttctgtgta
tgtcaccagt
aaaaactttt
tgtcttcact
caccatcacc
ctcaaagcgg
ttgttccggt
gaccaagtaa
ttaaggacat
taattttatt
agtttttata
taaaacaaag
attttgtcgt
aatagaaaaa
gaagcattca
actactccat
cccagtggca
gtctgtagac
gtttgactct
gaacaccacg
tttgaacaat
tttcaatcta
actagatcct
atccattgtg
gttcttcgat
tgctccagaa
cgttgaggta
tagtttgata
tgtttatctc
tgctccagac
actcggtaag
tgtgatttat
tcccaatatt
tacagactac
ctgggtgaaa
caagctttca
catcattggc
aagaagttct
aatttcttct
taaaacttca
aggcgttatc
tccagacgat
taaaccaaat
gtacgatgat
ccactctgcc
tatgaataca
cccagtacag
tatggatagt
ctctgatatt
cgatttagaa
ggacccttgg
atataacgta
taaaaacgga
taaagatggt
gaaaaggtta
tcacggacgc
ttcctgtttt
cttagagaca
atccaaaaat
caccgctgat
ttatatcttc
cttaccatga
ctagtagtgg
agagtcaatg
aagacagctc
agttctagaa
ttgtatttca
acataccaat
ttggcgttgt
aatgaagtct
tcgtattacg
tctggcgagt
acaagctaca
gaattcgaat
atcaacgtta
gatgacgatc
tgggtggcat
aactccaatc
ttcaacttcg
aacgctacaa
gtgaatacaa
ttccaatcat
ttaggtttga
atggattcaa
caccactcca
acctccgctg
tctcacaata
ctagtagctc
gaaaacttac
caagaaaacg
acttcaatag
actgaatctg
tacaatgatc
cctccagaga
gaaccagcag
gtcagagaca
gcaccagaga
aacagcaata
acgaagcatc
atcactccca
gaaaattttt
gtagattttt
atcccagaaa
attttttatt
tttaatttta
gctctcgccc
taatttttca
14
ctcgaaacga
cacagcttca
ccacgcccta
aatcgtttac
aaataacata
cgttctcagg
atgtaatact
ttgttgttac
taaaaaacta
tcaacgtgac
gacgttctca
tgaagtttac
gttttgtcat
tagtcatcgg
ctgacacagg
ctatttcttc
tagataatgc
ctgccaattt
aagttgtctc
ggggtgaatg
acgtttcatt
ctaatttaac
aagcgaacca
agataactca
cctcaacaag
ctgctacttc
aaaaagcagt
tcatttgctt
cgcatgctat
ctacaccttt
caagaagatt
atatttccag
agttccaatc
gcccgttctt
taaataaatc
gttacggatc
aggaaaaacg
ttagcccttc
gtaaccgcca
caacaatgtc
gctgggtcca
caaataagag
tgctgtgatt
agtggtttac
attccattct
tcttcatatt
ctaaactgat
tttcctgctt
gatttcatta
tgaggcatat
atttcaaatt
caattgcttc
tgaaccttct
cgagggtacg
aaaccgtcca
tggttatact
ttttgaccgt
gttgtataat
tgggacggca
catcgctaca
ggctcaccag
taacgtttca
tgataaattg
taccatttcc
ttctgtgtcc
cacaacggat
gttctcctac
agagtttact
attagctgga
aggttcacaa
ctcaaaccac
ttcttacaca
ttctgctcca
agcaattgcg
cctaatattc
tagtggacct
gaacaacccc
ggctgctttg
cgtggatgaa
ccaaagtaaa
tgacccacag
ctggcgatat
acaaaaaact
tacgtcaagg
tcccgtaaga
cttacaaaat
aacttcatct
tagcatggaa
taatgtcaat
atacgcaacg
agatacccta
tcaaatttca
gagaatacac
gaataatcaa
ccaacatcta
ttgctgacag
cctatcggaa
tccaatgata
gacttaccga
tctgacttac
gactctgccg
tccatctcgc
aacggcaaaa
tcaatgttca
gcgccgttac
ccggtgataa
gacattgaag
ttaactacct
tatgacttac
ggttctataa
gggtctgtcc
atttatgata
ttgtttgcca
tattttttgc
aattcaagcc
gaagtgccca
tctcaagagc
agtgcgaatg
tcttctactt
gcagcgctgc
tgcggtgttg
tggagacgca
gatttgaata
tttgatgatg
aacactttga
aagagagatt
gaagaattat
aataggtctt
actggcaacc
gttgatacag
gatgtcacta
aaatcagtaa
attcaagact
tctgacgatt
ccagacagaa
gttggtcaag
atattttgct
tattttattt
tttttgcact
tccattcaaa
aggccccacg
3421
3481
3541
3601
3661
3721
3781
3841
3901
3961
4021
4081
4141
4201
4261
4321
4381
4441
4501
4561
4621
4681
4741
4801
4861
4921
4981
tcagaaccga
aaattttcat
tccaaactat
ttaataactg
ataatcaaac
tgatcgtctt
aaatcgttct
agaacatcca
acgaactgcg
acatttctat
tctacccatc
tcagtcgtcg
gtttatatta
atattaagaa
ctgtttatgt
tttggtaaag
cttagttcat
ccatctgtca
agcgcgtttg
tccaatgaat
tcttcgcact
atttgctcag
tcactgtctt
gatctcaagt
ttctccactt
ttttcagtgt
tgccatgact
ctaaagaagt
cttcttgaca
cgaccctcct
cttcaaatgt
tatttaagga
tatccacatg
ttttattaat
gtataagttc
gcaagttgaa
aaaataaaat
tattcataaa
caaaaacgta
gttaaacagg
agtggaaatt
ttctacgtac
gtgaaagcat
cttttttcca
gcaacatcag
tcgtttgtat
tagcaatttc
tcttttccca
agttcaaatc
ctagctgttg
tattggagtc
cactgtcgag
tagattgctc
cagattctaa
gagttttatt
tttaacccag
gtttctgtcc
tattgtgtca
agatcggaat
ttgtaattca
aatgcagatg
ttctatatag
tgactggtaa
caaattaatg
gctgacgcaa
taccttcttt
gtctagtctt
aaattagtag
ttttgattta
aatgtaaaag
aaaagcaccc
ttgtgtgagc
cttccgtaat
gtccaattct
ttcatctctt
ggcctctttc
ttctagatcc
ttcagccaat
ttgctcgttt
taattctttg
ttttaagcta
//
15
ttaggaggtt
tttgaatccc
aacttatgtc
tcgttgactt
tcgtcgaaca
ctaaaatcta
gaaaatctgt
tcaattaaag
gtagtgtagt
tagcatttta
cgattactat
ttccgacctt
agtgtgaaag
tgtagacgta
tagcaagggg
ctagaataaa
aatgataata
aataataaaa
tttagtctta
ttttgagctt
tcttcttcca
agtttatcca
tggtttttct
tgctttgtat
ttagcggaca
agctgttctc
ttcaatttct
gaaaaccatt
tttcaatttc
ctagttccaa
taggtaattt
cttcagtttc
aaacgtattt
aaacgtgcgt
caggatgcct
cgaatgactg
agtataccct
tttttttttc
ttttttagct
ctagtggttt
tatgcatatg
aaaagaaata
atggacgaaa
actaaaatga
tcatcacctc
tcaatgggaa
cttcatattt
aagcaacgat
ttgcttcctt
tggtgtagtt
cagacaattg
aagatttaat
tcagctcctc
ctttgatc
attgtctggt
tgctttttcc
ttcgatcgca
ctccaaatgc
cgtaatgatc
ttcaatgcat
taatttagaa
attaatggga
aggtgggtat
cagccacttc
ttcttggatc
ttctggaaaa
cgattgactg
tatttctcgc
catactattt
taaagagagg
aaaggatttg
cgttgccttt
tcataaattt
gctttggaat
ccttctaccc
cagtttggct
ctcattatta
actctctaac
ctcgttttct
atatttttct