Vous êtes sur la page 1sur 19

Understanding the Basics

This section will help you understand common terms and tools of the Internet. Internet The Internet is a network of computers spanning the globe. This communication structure is a system connecting more than fifty million people in countries around the world. A global Web of computers, the Internet allows individuals to communicate with each other. Often called the World Wide Web, the Internet provides a quick and easy exchange of information and is recognized as the central tool in this Information Age. Internet Browser An Internet browser is a software program that enables you to access and navigate the Internet by viewing Web pages on your computer. The label Internet Browser describes a software program that provides users with a graphical interface that allows them to connect to the Internet and "surf the Web." Simply speaking, a browser is a software program that enables you to view Web pages on your computer. Netscape Navigator and Internet Explorer (IE) are the two browsers most commonly used for viewing the Internet. Netscape and Internet Explorer share many of the same functions, and it is possible to use both. Microsoft is the creator of Internet Explorer, and Netscape Navigator, originally developed by Netscape, is now owned by America Online/Time Warner. There are other browsers available as well. It does not take many users long to develop a preference and "adopt" a browser. You may have already made the choice. Which are you using? Not only will you need to be familiar with your browser "brand," but you should also know the version of the browser you are using. Frequently new versions of browsers are made available to computer users; normally they are available to be downloaded from the Internet at no charge.

Notable browsers
In order of release:

WorldWideWeb, February 26, 1991 Mosaic, April 22, 1993 Netscape Navigator and Netscape Communicator, October 13, 1994 Internet Explorer 1, August 16, 1995 Opera, 1996, see History of the Opera Internet suite Mozilla Navigator, June 5, 2002[14] Safari, January 7, 2003 Mozilla Firefox, November 9, 2004 Google Chrome, September 2, 2008

Web Site A site or area on the World Wide Web that is accessed by its own Internet address is called a Web site. A Web site can be a collection of related Web pages. Each Web site contains a home page and may also contain additional pages. Each Web site is owned and updated by an individual, company, or organization. Because the Web is a dynamically moving and changing entity, many Web sites change on a daily or even hourly basis.

Web Page A Web page can be explained as one area of the World Wide Web. Comparable to a page in a book, the basic unit of every Web site or document on the Web is a page. A Web page can be an article, an ordering page, or a single paragraph, and it is usually a combination of text and graphics. Home Page The term home page has a couple of meanings. It is the Web page that your browser uses when it starts, and also the Web page that appears every time you open your browser. Clicking the home page icon on your browser screen will take you to the specific page you have set as your browser's home page. Home page also refers to the main Web page out of a collection of Web pages. On each site, often you will see home page as a choice on a Menu Bar. Clicking on the word Home on a Web page will take you to the home or main page of that particular Web site. URL (Uniform Resource Locator) That unusual word at the top of the page is what is known as the locator box or address box of a Web page. Each Web page has a unique address called a Uniform Resource Locator or URL. The URL (pronounced U-R-L) is the specific address of a Web page. There is a special system for addressing Internet sites. The URL or Web address is typically composed of four parts: A protocol name (a protocol is a set of rules and standards that enable computers to exchange information) The location of the site The name of the organization that maintains the site A suffix that identifies the kind of organization it is

For example, the address http:// www. basicsbee. com is made up of the following areas: http:// This Web server uses Hypertext Transfer Protocol (HTTP). This is the most common protocol on the Internet. www This site is on the World Wide Web. basicsbee The Web server is at basicsbee. com This is a common extension.

Some common extensions are: .com (commercial) .edu (educational institution) .gov (government) .mil (military) .net (network) .org (organization)

You might also see foreign addresses that add a country code as the last several digits of the address, such as: .au (Australia) .ca (Canada) .fr (France) .it (Italy) .us (United States of America)

Connect to the Internet


Published: August 15, 2006

The Internet can turn your computer into a powerful tool for research, communications, and sharing. If you have kids, the Internet is a great place for them to find homework help and chat with friends. Connecting to the Internet is easy, and your service provider will do most of the technical work for you. To connect to the Internet
1. 2. 3. 4. 5. Find an Internet service provider, and choose a connection type. If necessary, transfer an existing ISP account. Set up your home network. Install a network adapter. Connect your computer to the network.

Find an Internet service provider


The first step in connecting to the Internet is to find an Internet service provider (ISP). The most important services an ISP offers are:

Internet access. Access any Web site, send instant messages to your friends, play

online games, or use any other Internet service. E-mail. You can access your e-mail with Microsoft Outlook Express or your Web browser. Most ISPs offer multiple e-mail addresses, so everyone in your family can have an account. ISPs typically provide spam filtering that reduces, but does not eliminate, unwanted messages.

Depending on your location, you might have several different choices for Internet access. Starting with the most attractive technologies for home Internet access, common Internet connection types are:

Cable modems. The best performing and most affordable option available to

customers, most cable TV providers offer broadband Internet access. DSL. An excellent choice for businesses, DSL typically offers better reliability than cable modems. However, DSL tends to be more expensive than cable modems for

similar levels of service. Dial-up. The slowest method of connecting to the Internet, dial-up enables you to connect to the Internet using your existing phone lines. Dial-up is convenient because it is available to any location with a phone. However, slow performance makes using the Internet frustrating. Satellite. Satellite broadband services provide high-speed Internet access to any location with a clear view of the sky (currently available in North America and certain other locations). Satellite services may be the only broadband option for people living in rural areas. The cost of satellite services is significantly higher than other services. While you can transfer large files quickly with satellite, browsing the Web or playing online games can seem slower than with dial-up because of the delay caused by sending signals to and from satellites.

Additionally, ISPs are beginning to offer wireless or fiber broadband Internet access in limited areas. To find an ISP, you should contact your cable television provider for cable modem service or your telephone company for DSL. Almost all cable and telephone companies offer broadband Internet access, and they typically offer a discount if you purchase multiple services from them.

Bandwidth
Internet bandwidth (the speed at which your computer can send and receive information) is measured in either Kbps (kilobits per second) or Mbps (megabits per second). If you are lucky enough to have multiple broadband options in your area, compare these factors:

Downstream bandwidth. This is the speed with which your computer can receive

Top of page

information from the Internet. The higher the downstream bandwidth, the faster your computer can display Web pages, transfer music, and download files. For most people, downstream bandwidth is more important than upstream bandwidth, so the speeds tend to be much higher. For example, a cable modem service might offer 6,000 Kbps downstream and only 768 Kbps upstream. Upstream bandwidth. This is the amount of data your computer can send to the Internet. This isn't important if you just plan to read e-mail and surf the Web, because your computer only needs to send a small request in order to receive a large Web page or e-mail. However, if you're into online gaming or you want to send large files to people, then higher upstream bandwidth is important, and you should choose the highest upstream bandwidth available. Reliability and customer service. ISP reliability has increased significantly in recent years; however, it is still not as reliable as your phone or television service. There is no objective way to measure reliability and customer service, so you should talk to your neighbors about their experiences and search the Web for reviews of ISPs in your area.

Transfer an existing account


No matter which ISP you use, you'll connect to the same Internet. You'll be able to reach the same Web sites and send instant messages to all of your friends. However, if you use the e-mail account provided by your current ISP, you will lose that when you switch ISPs. If possible, activate your new ISP account a week before you close your old ISP account. Notify everyone you want to exchange e-mail with that you have a new e-mail address, and ask them to update your address in their contact list. Then, if anyone sends an e-mail to your old account before you close it, reply from your new address. Tip: Your new ISP will provide you with an e-mail address. However, you do not have to use it. Consider using a free Hotmail account instead. If you use Hotmail, you can keep the same e-mail address no matter which ISP you use. If you've registered your e-mail account with Web sites that you want to receive e-mail from, you'll need to let those Web sites know that your address has changed. Examples might include online stores, forums, communities, financial services, and Microsoft Passport. It's important to change your account while your old e-mail address is still available, because Web sites often send a confirmation e-mail to your existing address before allowing you to make a change.
Top of page

Set up your home network


Your ISP brings the Internet to your home, but it's your home network that connects your computer to the Internet. If you have a single desktop computer and it's in the same room as your Internet connection, your home network will simply consist of a single cable (provided by your ISP) that runs from your modem or router to your computer. Your ISP will either set it up for you or provide you with instructions for connecting your computer. If you have more than one computer, a portable computer, or devices such as an Xbox 360 or Media Center Extender that you want to connect to the network, you can connect all the devices to the Internet simultaneously. For stationary devices, such as desktop computers and Xbox 360, you can connect them to a wired network. For mobile devices, such as a portable computer, you will need to install a wireless network. To help determine which type of network to choose, read Choosing the type of network to install.
Top of page

Connect your computer to the network


Once your network is set up, you need to connect your computers and other network devices. For computers, first verify that they have a compatible network adapter, and

install one if necessary. Then follow these instructions for the devices you want to connect:

Windows XP computers Windows 2000 computers Windows 98 or Windows Me computers Xbox or Xbox 360

Now that youre connected, enjoy all the Internet has to offer. File Transfer Protocol (FTP) is a standard network protocol used to exchange and manipulate files over a TCP/IP based network, such as the Internet. FTP is built on a client-server architecture and utilizes separate control and data connections between the client and server applications. Applications were originally interactive command-line tools with a standardized command syntax, but graphical user interfaces have been developed for all desktop operating systems in use today. FTP is also often used as an application component to automatically transfer files for program internal functions. FTP can be used with user-based password authentication or with anonymous user access. The Trivial File Transfer Protocol (TFTP) is a similar, but simplified, not interoperable, and unauthenticated version of FTP

Use
As outlined by its RFC, FTP is used to:

Promote sharing of files (computer programs and/or data). Encourage indirect or implicit use of remote computers. Shield a user from variations in file storage systems among different hosts. Transfer data reliably, and efficiently.

FTP can be run in active mode, passive mode and extended passive mode. Extended passive mode was added by RFC 2428 in September 1998. While transferring data over the network, several data representations can be used. The two most common transfer modes are:

ASCII mode: only for plain text. (Any other form of data will be corrupted) Binary mode: the sending machine sends each file byte for byte and as such the recipient stores the bytestream as it receives it. (The FTP standard calls this "IMAGE" or "I" mode)

FTP server return codes indicate their status by the digits within them.

Anonymous FTP
A host that provides an FTP service may additionally provide anonymous FTP access. Users typically login to the service with an 'anonymous' account when prompted for user name. Although users are commonly asked to send their email address in lieu of a password, little to no verification is actually performed on the supplied data. As modern FTP clients typically hide the anonymous login process from the user, the ftp client will supply dummy data as the password (since the user's email address may not be known to the application). The Gopher protocol has been suggested as an alternative to anonymous FTP, as well as Trivial File Transfer Protocol and File Service Protocol.[citation needed]

Remote FTP or FTP Mail


Where FTP access is restricted, a remote FTP (or FTP Mail) service can be used to circumvent the problem. An e-mail containing the FTP commands to be performed is sent to a remote FTP server, which is a mail server that parses the incoming e-mail, executes the FTP commands, and sends back an e-mail with any downloaded files as an attachment. Obviously this is less flexible than an ftp client, as it is not possible to view directories interactively or to modify commands, and there can also be problems with large file attachments in the response not getting through mail servers. As most internet users these days have ready access to FTP, this procedure is no longer in everyday use.

FTP and web browsers


Most recent web browsers and file managers can connect to FTP servers, although they may lack the support for protocol extensions such as FTPS. This allows manipulation of remote files over FTP through an interface similar to that used for local files. This is done via an FTP URL, which takes the form ftp(s)://<ftpserveraddress> (e.g., ftp://ftp.gimp.org/). A password can optionally be given in the URL, e.g.: ftp(s)://<login>:<password>@<ftpserveraddress>:<port>. Most web-browsers require the use of passive mode FTP, which not all FTP servers are capable of handling. Some browsers allow only the downloading of files, but offer no way to upload files to the server.

FTP and NAT devices


The representation of the IP addresses and port numbers in the PORT command and PASV reply poses another challenge for Network address translation (NAT) devices in handling FTP. The NAT device must alter these values, so that they contain the IP address of the NAT-ed client, and a port chosen by the NAT device for the data connection. The new address and port will probably differ in length in their decimal representation from the original address and port. This means that altering the values on the control connection by the NAT device must be done carefully, changing the TCP

Sequence and Acknowledgment fields for all subsequent packets. Such translation is not usually performed in most NAT devices, but special application layer gateways exist for this purpose.

The NCBI Data Model is defined in The NCBI Data Model is defined in ASN.1
NCBI has made considerable investment of time and energy to ensure that our data and software tools are not too tightly bound to any particular computer platform or database technology. However, we also wish to embrace the intellectual rigor imposed by describing our data within a formal system and in a machine readable and checkable way. For this reason we have chosen to describe our data in Abstract Syntax Notation 1 (ASN.1; ISO 8824, 8825). ASN.1 It is a formal language specifically designed to specify complex data structures in a machine, DBMS, and programming language independent manner. - ASN.1 (Abstract Syntax Notation One) is a standard way to describe a message (a unit of application data) that can be sent or received in a network. ASN.1 is divided into two parts: (1) the rules of syntax for describing the contents of a message in terms of data type and content sequence or structure and (2) how you actually encode each data item in a message. ASN.1 is defined in two ISO standards for applications intended for the Open Systems Interconnection (OSI) framework:

ISO 8824/ITU X.208 specifies the syntax (for example, which data item comes first in the message and what its data type is) ISO 8825/ITU X.209 specifies the basic encoding rules for ASN.1 (for example, how to state how long a data item is)

Here's an example of a message definition specified with ASN.1 notation:


Report ::= SEQUENCE { title OCTET STRING, biblio Bibliography author body } OCTET STRING, OCTET STRING,

In this very simple example, "Report" is the name of this type of message. SEQUENCE indicates that the message is a sequence of data items. The first four data items have the data type of OCTET STRING, meaning each is a string of eight-bit bytes (the term OCTET was used rather than BYTE because it can't be assumed that all computers will have eight bits in a byte). The bibliography data item is another definition named "Bibliography" that is used within this one. It might look like this:
Bibliography ::= SEQUENCE title OCTET STRING year OCTET STRING { author publisher } OCTET STRING OCTET STRING

Other data types that can be specified include: INTEGER, BOOLEAN, REAL, and BIT STRING. An ENUMERATED data type is one that takes one of several possible values. Data items can be specified as OPTIONAL (not necessarily present).

Biological Sequences
A Bioseq is a single continuous biological sequence. It can be nucleic acid or protein. It can be fully instantiated (i.e. we have data for every residue) or only partially instantiated (e.g. we know a fragment is 10 kilobases long, but we only have sequence data over 1 kilobase). A Bioseq is defined in ASN.1 as follows:
Bioseq ::= SEQUENCE { id SET OF Seq-id , -- equivalent identifiers descr Seq-descr OPTIONAL , -- descriptors inst Seq-inst , -- the sequence data annot SET OF Seq-annot OPTIONAL }

In ASN.1 a named datatype begins with a capital letter (e.g. Bioseq). The symbol "::=" means "is defined as". A primitive type is all capitals (e.g. SEQUENCE). A field within a named datatype begins with a lower case letter (e.g. descr). A structured datatype is bounded by curly brackets ({}). We can now read the definition above: a Bioseq is defined as a SEQUENCE (i.e. a structure where the elements must come in order; the mathematical notion of SEQUENCE, not the biological one). The first element of Bioseq is called "id" and is a SET OF (i.e. an unordered collection of repeating elements of the same type) a named datatype called "Seq-id". Seq-id would have its own definition elsewhere. The second element is called "descr" and is a named type called "Seq-descr", which is OPTIONAL. In this text, when we wish to refer to the id element of the named type Bioseq, we will use the notation "Bioseq.id". A Bioseq has two OPTIONAL elements, which both have descriptive information ABOUT the sequence. Seq-descr is a collection of types of information about the context of the sequence. It may set biological context (e.g. define the organism sequenced), or bibliographic context (e.g. the paper it was published in), among other things. Seq-annot is information that is explicitly tied to locations on the sequence. This could be feature tables, alignments, or graphs, at the present time. A Bioseq can have more than one feature table, perhaps coming from different sources, or a feature table and a graph, etc. A Bioseq is only REQUIRED to have two elements, id and inst. Bioseq.id is one or more identifiers for this Bioseq. An identifier is a key which allows us to retrieve this object from a database or identify it uniquely. It is not a name, which is a human compatible description, but not necessarily a unique identifier. The name "Jane Doe" does not uniquely identify a person in the United States, while the identifier, social security number, does. Each Seq-id is a CHOICE of one of a number of identifier types from different databases, which may have different structures. All Bioseqs MUST have at least one identifier.

Classes of Biological Sequences There are many classes of Bioseq Bioseq

A Bioseq may be DNA, RNA, or protein. A Bioseq may be represented many ways. A Bioseq may have a history (Seq-hist) virtual :No residues raw: AGCCTTT seg :Parts by pointer map :Landmarks
The required element of a Bioseq is a Seq-inst. This element instantiates the sequence itself. It represents things like is it DNA, RNA, or protein? Circular or linear? Doublestranded or single-stranded? How long is it?
Seqinst::=SEQUENCE{ reprENUMERATED{ notset(0), virtual(1), raw(2), seg(3), const(4), ref(5), consen(6), map(7), other(255)}, molENUMERATED{ notset(0), dna(1), rna(2), aa(3), na(4), other(255)}, lengthINTEGEROPTIONAL, fuzzIntfuzzOPTIONAL, topologyENUMERATED{ notset(0), linear(1), circular(2), tandem(3), other(255)}DEFAULTlinear, strandENUMERATED{ notset(0), ss(1), ds(2), mixed(3), other(255)}OPTIONAL, seqdataSeqdataOPTIONAL, extSeqextOPTIONAL, histSeqhistOPTIONAL}

Seq-inst is the parent class of a sequence representation class hierarchy. There are two major branches to the hierarchy. The molecule type branch is indicted by Seq-inst.mol. This could be a nucleic acid, or further sub classified as RNA or DNA. The nucleic acid may be circular, linear, or one repeat of a tandem repeat structure. It can be double, single, or of a mixed strandedness. It could also be a protein, in which case topology and strandedness are not relevant. There is also a representation branch, which is independent of the molecule type branch. This class hierarchy involves the particular data structure used to represent the knowledge we have about the molecule, no matter which part of the molecule type branch it may be in. The repr element indicates the type of representation used. The aim of such a set of representation classes is to support the information to express different views of sequence based objects, from chromosomes to restriction fragments, from genetic maps to proteins, within a single overall model. The ability to do this confers profound advantages for software tools, data storage and retrieval, and traversal of related sequence and map data from different scientific domains.

A virtual representation is used to describe a sequence about which we may know things like it is DNA, it is double stranded, we may even know it's length, but we do not have the actual sequence itself yet. Most fields of the Seq-inst are filled in, but Seq-inst.seqdata is empty. An example would be a band on a restriction map. A raw representation is used for what we traditionally consider a sequence. We know it is DNA, it is double stranded, we know its length exactly, and we have the sequence data itself. In this case, Seq-inst.seq-data contains the sequence data. A segmented representation is very analogous to a virtual representation. We posit that a continuous double stranded DNA sequence of a certain length exists, and pieces of it exist in other Bioseqs, but there is no data in Seq-inst.seq-data. Such a case would be when we have cloned and mapped a DNA fragment containing a large protein coding region, but have only actually sequenced the regions immediately around the exons. The sequence of each exon is an individual raw Bioseq in its own right. The regions between exons are virtual Bioseqs. The segmented Bioseq uses Seq-inst.ext to hold a

SEQUENCE OF Seq-loc. That is, the extension is an ordered series of locations on OTHER Bioseqs, in this case the raw and virtual Bioseqs representing the exons and introns. The segmented Bioseq contains data only by reference to other Bioseqs. In order to retrieve the base at the first position in the segmented Bioseq, one would go to the first Seq-loc in the extension, and return the appropriate base from the Bioseq it points to. A constructed Bioseq is used to describe an assembly or merge of other Bioseqs. It is analogous to the raw representation. In fact, most raw Bioseqs were actually constructed from an assembly of gel readings. However, the constructed representation class is really meant for tracking higher level merging, such as when an expert in a particular organism or gene region may construct a "typical" sequence from that region by merging available sequence data, often published by different groups, using domain knowledge to resolve discrepancies between reports or to select a typical allele. Seq-inst contains an optional Seq-hist object. Seq-hist contains a field called "assembly" which is a SET OF Seq-align, or sequence alignments. The alignments are used to record the history of how the various component Bioseqs used for the merge are related to the final product. A constructed sequence DOES contain sequence data in Seq-inst.seq-data, unlike a segmented sequence, because the component sequences may overlap, or expert knowledge may have been used to determine the "correct" residue at any position that is not captured in the original components. So Seq-hist.assembly is used to simply record the relationship of the merge to the old Bioseqs, but does NOT describe how to generate it from them. A map is akin to a virtual Bioseq. For example, for a genetic map of E.coli, we might posit that the E.coli chromosome is about 5 million base pairs long, DNA, double stranded, circular, but we do not have the sequence data for it. However, we do know the positions of some genes on this putative sequence. In this case, the Seq-inst.ext is a SEQUENCE OF Seq-feat, that is, a feature table. For a genetic map, the feature table contains Gene-ref features. An ordered restriction map would have a feature table containing Rsite-ref features. The feature table is part of Seq-inst because, for a map, it is an essential part of instantiating the map Bioseq, not merely annotation on a known sequence. In a sense, for a map, the annotation IS part of the sequence. As an aside, note that we have given gene positions on the E.coli genetic map in base pairs, while the standard E.coli map is numbered from 0.0 to 100.0 map units. Numbering systems can be applied to a Bioseq as a descriptor or a feature. For E.coli, we would simply apply the 0.0 - 100.0 floating point numbering system to the map Bioseq. Gene positions can then be shown to the scientists in familiar map units, while the underlying software still treats positions as large integers, just the same as with any other Bioseq. Coordinates on ANY class of Bioseq are ALWAYS integer offsets. So the first residue in any Bioseq is at position 0. The last residue of any Bioseq is in position (length - 1). The consequence of this design is that one uses EXACTLY the same data object to describe the location of a gene on an unsequenced restriction fragment, a fully sequenced piece of DNA, a partially sequenced piece of DNA, a putative overview of a large genetic region, or a genetic or physical map. Software to display, manipulate, or compare gene locations can work without change on the full range of possible representations. Sequence and physical map data can be easily integrated into a single, dynamically assembled view by creating a segmented sequence which points alternatively to raw or constructed Bioseqs and parts of a map Bioseq. The relationship between a genetic and

physical map is simply an alignment between two Bioseqs of representation class map, no different than the alignment between two sequences of class raw generated by a database search program like BLAST or FASTA.

Locations on Biological Sequences


A Seq-loc is an object which defines a location on a Bioseq. The smooth class hierarchy for Seq-inst makes it possible to use the same Seq-loc to describe an interval on a genetic map as that used to describe an interval on a sequenced molecule. Seq-loc is itself a class hierarchy. A valid Seq-loc can be an interval, a point, a whole sequence, a series of intervals, and so on.
Seqloc::=CHOICE{ nullNULL, emptySeqid, wholeSeqid, intSeqinterval, packedintPackedseqint, pntSeqpoint, packedpntPackedseqpnt, mixSeqlocmix, equivSeqlocequiv, bondSeqbond, featFeatid}

Seq-loc.null indicates a region of unknown length for which no data exists. Such a location may be used in a segmented sequence for the region between two sequenced fragments about which nothing, not even length, is known. All other Seq-loc types, except Seq-loc.feat, contain a Seq-id. This means they are independent of context. This means that data objects describing information ABOUT Bioseqs can be created and exchanged independently from the Bioseq itself. This encourages the development and exchange of structured knowledge about sequence data from many directions and is an essential goal of the data model.

Associating Annotation With Locations On Biological Sequences


Seq-annot, or sequence annotation, is a collection of information ABOUT a sequence, tied to specific regions of Bioseqs through the use of Seq-loc's. A Bioseq can have many Seq-annot's associated with it. This allows knowledge from a variety of sources to be collected in a single place but still be attributed to the original sources. Currently there are three kinds of Seq-annot, feature tables, alignments, and graphs.

Feature Tables
A feature table is a collection of Seq-feat, or sequence features. A Seq-feat is designed to tie a Seq-loc together with a datablock, a block of specific data. Datablocks are defined objects themselves, many of which are objects used in their own right in some other context, such as publications (Pub) or references to organisms (Org-ref) or genes (Generef). Some datablocks, such as coding regions (CdRegion) make sense only in the

context of a Seq-loc. However, since by design there is no intention that one datablock need to have anything in common with any other datablock, each can be tailored exactly to do a particular job. If a change or addition is required to one datablock, no others are affected. In those cases where a pre-existing object from another context is used as a datablock, any software that can use that object can now operate on the feature as well. For example, a piece of code to display a publication can operate on a publication from a bibliographic database or one use as a sequence feature with no change. Since the Seq-feat data structure itself and the Seq-loc used to attach it to the sequence are common to all features, it is also possible to support a class of operations over all features without regard to the different types of datablocks attached to them. So a function to determine all features in a particular region of a Bioseq need not care what type of features they are.

A Seq-feat is bipolar in that it contains up to two Seq-loc's. Seq-feat.location indicates the "source" and is the location similar to the single location in common feature table implementations. Seq-feat.product is the "sink". A CdRegion feature would have its Seq-feat.location on the DNA and it's Seq-feat.product on the protein sequence produced. Used this way it defines the process of translating a DNA sequence to a protein sequence. This establishes in an explicit way the important relationship between nucleic acid and protein sequence databases. The presence of two Seq-loc's also allows a more complete representation of data conflicts or exceptional biological circumstances. If an author presents a DNA sequence and its protein product in a figure in a paper, it is possible to enter the DNA and protein sequences independently, and then confirm through the CdRegion feature that the DNA in fact translates to that protein sequence. In an unfortunate number of published papers, the DNA presented does not translate to the protein presented. This may be a signal that the database has made an error of some sort, which can be caught early and corrected. Or the original paper may be in error. In this case, the "conflict" flag can be set in CdRegion, but the protein sequence is not lost, and retroactive work can be done to determine the source of the problem. It may also be the case that a genomic sequence cannot be translated to a protein for a known biological reason, such as RNA editing or

suppressor tRNAs. In this case the "exception" flag can be set in Seq-feat to indicate that the data are correct, but will not behave in the expected way.

Sequence Alignments
A sequence alignment is essentially a correlation between Seq-locs, often associated with some score. An alignment is most commonly between two sequences, but it may be among many at once. In an alignment between two raw Bioseqs, a certain amount of optimization can be done in the data structure based on the knowledge that there is a one to one mapping between the residues of the sequences. So instead of recording the start and stop in Bioseq A and the start and stop in Bioseq B, it is enough to record the start in A and the start in B and the length of the aligned region. However if one is aligning a genetic map Bioseq with a physical map Bioseq, then one will wish to allow the aligned regions to distort relative one another to account for the differences from the different mapping techniques. To accommodate this most general case, there is a Seq-align type which is purely correlations between Seq-locs of any type, with no constraint that they cover exactly the same number of residues. A Seq-align is considered to be a SEQUENCE OF segments. Each segment is an unbroken interval on a defined Bioseq, or a gap in that Bioseq. For example, let us look at the following three dimensional alignment with 6 segments:
Seq-ids id=100 id=200 id=300 AAGGCCTTTTAGAGATGATGATGATGATGA AAGGCCTaTTAG.......GATGATGATGA ....CCTTTTAGAGATGATGAT....ATGA | 1 | 2 | 3 | 4| 5 | 6 |

Segments

The example above is a global alignment that is each segment sequentially maps a region of each Bioseq to a region of the others. An alignment can also be of type "diags", which is just a collection of segments with no implication about the logic of joining one segment to the next. This is equivalent to the diagonal lines that are shown on a dotmatrix plot. The example above illustrates the most general form of a Seq-align, Std-seg, where each segment is purely a correlated set of Seq-loc. Two other forms of Seq-align allow denser packing of data for when only raw Bioseqs are aligned. These are Dense-seg, for global alignments, and Dense-diag for "diag" collections. The basic underlying model for these denser types is very similar to that shown above, but the data structure itself is somewhat different.

Sequence Graph
The third annotation type is a graph on a sequence, Seq-graph. It is basically a Seq-loc, over which to apply the graph, and a series of numbers representing values of the graph along the sequence. A software tool which calculates base composition or hydrophobic tendency might generate a Seq-graph. Additional fields in Seq-graph allow specification of axis labels, setting of ranges covered, compression of the data relative to the sequence, and so on.

Collections of Related Biological Sequences


It is often useful, even "natural", to package a group of sequences together. Some examples are a segmented Bioseq and the Bioseqs that make up its parts, a DNA sequence and its translated proteins, the separate chains of a multi-chain molecule, and so on. A Bioseq-set is such a collection of Bioseqs.
Bioseqset::=SEQUENCE{ idObjectidOPTIONAL, collDbtagOPTIONAL, levelINTEGEROPTIONAL, classENUMERATED{ notset(0), nucprot(1), segset(2), conset(3), parts(4), gibb(5), gi(6), genbank(7), pir(8), pubset(9), equiv(10), swissprot(11), pdbentry(12), other(255)}DEFAULTnotset, releaseVisibleStringOPTIONAL, dateDateOPTIONAL, descrSeqdescrOPTIONAL, seqsetSEQUENCEOFSeqentry, annotSETOFSeqannotOPTIONAL}

The basic structure of a Bioseq-set is very similar to that of a Bioseq. Instead of Bioseq.id, there is a series of identifier and descriptive fields for the set. A Bioseq-set is only a convenient way of packaging sequences so controlled, stable identifiers are less important for them than they are for Bioseqs. After the first few fields the structure is exactly parallel to a Bioseq. There are descriptors which describe aspects of the collection and the Bioseqs within the collection. The general rule for descriptors in a Bioseq-set is that they apply to "all of everything below". That is, a Bioseq-set of human sequences need have only one Org-ref descriptor for "human" at the top level of the set, and it is applied to all Bioseqs within the set. Then follows the equivalent of Seq-inst, that is the instantiation of the data. In this case, the data is the chain of contained Bioseqs or Bioseq-sets. A Seq-entry is either a Bioseq or Bioseq-set. Seq-entry's are very often used as arguments to display and analysis functions, since one can move around either a single Bioseq or a collection of related Bioseqs in context just as easily. This also makes a Bioseq-set recursive. That is, it may consist of collections of collections.
Seqentry::=CHOICE{ seqBioseq, setBioseqset}

Finally, a Bioseq-set may contain Seq-annot's. Generally one would put the Seq-annot's which apply to more than one Bioseq in the Bioseq-set at this level. Examples would be CdRegion features that point to DNA and protein Bioseqs, or Seq-align which align more than one Bioseq with each other. However, since Seq-annot's always explicitly cite a Seq-id, it does not matter, in terms of meaning; at what level they are put. This is in contrast to descriptors, where context does matter.

Vous aimerez peut-être aussi