Vous êtes sur la page 1sur 43

GENOMICS FILE

SUBMITTED TO:- SUBMITTED BY:-

MS INDU GAUR K.PUNIT PUSHKAR

IMT/07/8037

SECTION S
INDEX

EXPERIMENTS

• To study different databases and websites that help in genomics research

• To study different tools for genomic research

• To study DNA sequencing methods

• To visualize macromolecular structure of a protein using RASMOL

• To perform gene prediction method using GENSCAN

• To perform multiple sequence alignment using CLUSTALW algorithm

• To perform local alignment search of a sequence from sequence databases


using BLAST
EXPERIMENT NO.1

AIM: To study different websites and database related to genomic research

NCBI-The National Center for Biotechnology Information (NCBI) is part of the United
States National Library of Medicine (NLM), a branch of the National Institutes of Health. The
NCBI is located in Bethesda, Maryland and was founded in 1988 through legislation sponsored
by Senator Claude Pepper.The NCBI houses genome sequencing data in GenBank and an index
of biomedical research articles in PubMed Central and PubMed, as well as other information
relevant to biotechnology. All these databases are available online through the Entrez search
engine. The Entrez Global Query Cross-Database Search System is a powerful federated
search engine, or web portal that allows users to search many discrete health sciences databases
at the NCBI website. The NCBI has had responsibility for making available the
GenBank DNA sequence database since 1992. GenBank coordinates with individual laboratories
and other sequence databases such as those of the European Molecular Biology
Laboratory (EMBL) and the DNA Data Bank of Japan (DDBJ).Two major roles of NCBI are to
create research in the field of computational biology and create public databases. GenBank is a
part of International Nucleotide Sequence Database Collaboration along with europe’s EMBL,
japan’s DDBJ.

1. Genbank- The GenBank sequence database is an open access, annotated collection of all
publicly available nucleotide sequences and their protein translations. This database is
produced and maintained by the National Center for Biotechnology Information (NCBI)
as part of the International Nucleotide Sequence Database Collaboration (INSDC). Direct
submissions are made to GenBank using BankIt, which is a Web-based form, or the
stand-alone submission program,Sequin. Upon receipt of a sequence submission, the
GenBank staff assigns an accession number to the sequence and performs quality
assurance checks. The submissions are then released to the public database, where the
entries are retrievable by Entrez or downloadable by FTP.
SEQUENCE SUBMISSION TOOLS include Bankit and Sequin. Bankit is used when we
have a single sequence, a simple set of sequences or a small batch of different sequences.
It is a web-based submission tool. Sequin is a stand-alone software tool developed by the
NCBI for submitting and updating entries to the GenBank, EMBL, or DDBJ sequence
databases. It is capable of handling simple submissions that contain a single short mRNA
sequence, and complex submissions containing long sequences, multiple annotations,
segmented sets of DNA, or phylogenetic and population studies.

2. EMBL- The European Molecular Biology Laboratory (EMBL) is a molecular


biology research institution supported by 20 European countries and Australia as
associate member state. EMBL was created in 1974 and is an intergovernmental
organisation funded by public research money from its member states. The cornerstones
of EMBL's mission are manifold. Basic research in molecular biology and molecular
medicine is performed; scientists, students and visitors at all levels are trained; vital
services to scientists in the member states are offered; new instruments and methods in
the life sciences are developed; and there is an active engagement in technology
transfer.One of the major institutes of Europe that runs EMBL is European
Bioinformatics Institute.

3. DDBJ- The DNA Data Bank of Japan (DDBJ) is a DNA data bank.[1] It is located at
the National Institute of Genetics (NIG) in theShizuoka prefecture of Japan. It is also a
member of the International Nucleotide Sequence Database Collaboration. It exchanges
its data with European Molecular Biology Laboratory at the European Bioinformatics
Institute and with GenBank at the National Center for Biotechnology Information on a
daily basis. Thus these three databanks contents the same data at any given time.
TYPES OF DATABASES:

NUCLEOTIDE DATABASES

dbEST is a division of Genbank established in 1992. As for GenBank, data in dbEST is directly
submitted by laboratories worldwide and is not curated.

dbSTS is an NCBI resource that contains sequence data for short genomic landmark sequences
or Sequence Tagged Sites. STS sequences are incorporated into the STS Division
of GenBank.The dbSTS database offers a route for submission of STS sequences to GenBank. It
is designed especially for the submission of large batches of STS sequences.

The Single Nucleotide Polymorphism Database (dbSNP) is a free public archive for genetic
variation within and across different species developed and hosted by the National Center for
Biotechnology Information (NCBI) in collaboration with the National Human Genome Research
Institute (NHGRI). Although the name of the database implies a collection of one class of
polymorphisms only (i.e., single nucleotide polymorphisms (SNPs)), it in fact contains a range of
molecular variation: (1) SNPs, (2) short deletion and insertion polymorphisms (indels/DIPs),
(3) microsatellite markers or short tandem repeats (STRs), (4) multinucleotide polymorphisms
(MNPs), (5) heterozygous sequences, and (6) named variants.

The Reference Sequence (RefSeq) database is an open access, annotated and curated collection
of publicly available nucleotide sequences (DNA, RNA) and their protein products. This
database is built by (NCBI), and, unlike GenBank, provides only single record for each natural
biological molecule(i.e. DNA, RNA or protein) for major organisms ranging from viruses to
bacteria to eukaryotes.For each model organism, RefSeq aims to provide separate and linked
records for the genomic DNA, the gene transcripts, and the proteins arising from those
transcripts. RefSeq is limited to major organisms for which sufficient data is available.

PROTEIN DATABASES
The Protein Data Bank (PDB) is a repository for the 3-D structural data of large biological
molecules, such as proteins and nucleic acids. (See also crystallographic database). The data,
typically obtained by X-ray crystallography or NMR spectroscopy and submitted
by biologists and biochemists from around the world, are freely accessible on the Internet via the
websites of its member organisations (PDBe, PDBj, and RCSB). The PDB is overseen by an
organization called the Worldwide Protein Data Bank, wwPDB.

The Protein Clusters database provides easy access to annotation information, publications,
domains, structures, and external links and analysis tools including multiple alignments,
phylogenetic trees, and genomic neighborhoods (ProtMap).Protein Clusters can be searched like
any other Entrez database.

STRUCTURAL DATABASES

The Conserved Domain Database (CDD) brings together several collections of multiple
sequence alignments representing conserved domains, including NCBI-curated domains, which
use 3D-structure information to explicitly to define domain boundaries and provide insights into
sequence/structure/function relationships, as well as domain models imported from a number
of external source databases. The data are then used for putative functional annotation of protein
query sequences based on matches to specific hits.

The Structural Classification of RNA (SCOR) database provides a survey of the three-
dimensional motifs contained in 259 NMR and X-ray RNA structures. In one classification, the
structures are grouped according to function. The RNA motifs, including internal and external
loops, are also organized in a hierarchical classification.The 259 database entries contain 223
internal and 203 external loops; 52 entries consist of fully complementary duplexes.

GENOME DATABASES

• The NCBI Entrez Genome database is a collection of complete large-scale sequencing,


assembly, annotation, and mapping projects for cellular organisms. It contains Genomic
sequences at different stage of finishing from both the public domain sequencing effort and
Celera Genomics, protein function data and gene structure. It helps in understanding the
genomic organization of genes; mapping a gene, understanding the exon/intron structure of a
gene Searching for genetic and physical markers and accessing comprehensive information
about a gene, its transcript(s) and protein(s), structure, activity, and location.

CHEMICAL DATABASES

PubChem is a database of chemical molecules and their activities against biological assays.
The system is maintained by the NCBI, a component of the National Library of Medicine,
which is part of the United States National Institutes of Health (NIH). PubChem can be
accessed for free through a web user interface.

Chemical Entities of Biological Interest, also known as ChEBI, is a database and ontology of
molecular entities focused on 'small' chemical compounds, that is part of the Open Biomedical
Ontologies effort. The term "molecular entity" refers to any "constitutionally or isotopically
distinct atom, molecule, ion, ion pair, radical, radical ion, complex, conformer, etc.,
identifiable as a separately distinguishable entity. ChEBI uses nomenclature, symbolism and
terminology endorsed by the International Union of Pure and Applied Chemistry (IUPAC) and
Nomenclature Committee of the International Union of Biochemistry and Molecular Biology.

METABOLIC OR PATHWAY DATABASES

KEGG (Kyoto Encyclopedia of Genes and Genomes) is a collection of online databases


dealing with genomes, enzymatic pathways, and biological chemicals. The PATHWAY
database records networks of molecular interactions in the cells, and variants of them specific
to particular organisms.
LITERATURE DATABASES

PubMed is a free database accessing primarily the MEDLINE database of references and
abstracts on life sciences and biomedical topics. In addition to MEDLINE, it also provides
access to OLDMEDLINE for pre-1966 records and citations to articles from MEDLINE
journals. Citations may include links to full-text content from PubMed Central and publisher
web sitesMEDLINE (Medical Literature Analysis and Retrieval System Online) is a
bibliographic database of life sciences and biomedical information. It includes bibliographic
information for articles from academic journals covering medicine, nursing, pharmacy,
dentistry, veterinary medicine, and health care. MEDLINE also covers much of the literature
in biology and biochemistry, as well as fields such as molecular evolution.Compiled by the
United States National Library of Medicine (NLM), MEDLINE is freely available on the
Internet and searchable via PubMed and NLM's National Center for Biotechnology
Information's Entrez system.

DISEASE DATABASES

OMIM is a comprehensive, authoritative, and timely compendium of human genes and genetic
phenotypes. The full-text, referenced overviews in OMIM contain information on all known
mendelian disorders and over 12,000 genes. OMIM focuses on the relationship between
phenotype and genotype. It is updated daily, and the entries contain copious links to other
genetics resources.
EXPERIMENT NO.2

AIM: To study different tools used for genomic research

Phred: Better Base Calling

Phred is a base-calling program for DNA sequence traces. The program was developed by Drs.
Phil Green and Brent Ewing, and is copyrighted by the University of Washington. It is widely
used by the largest academic and commercial sequencing laboratories. It has a high base calling
accuracy with 40-50% lower error rates. The highly accurate error probablilities Phred calculates
for each base enable increase automation of the sequencing process. For example,drastically
lower false positive error rates in mutation detection ,effective quality control immediately after
sequence production, quantitative benchmarking of different sequencing methods and protocol
changes. Phred was developed for the Human Genome Project, where large amounts of sequence
data were processed by automated scripts; therefore, Phred's processing options are set by
command line parameters. For Windows and OS X users who would like to use Phred through
an easy-to-use graphical user interface, we have developed the sequence analysis software
CodonCode Aligner. CodonCode Aligner greatly simplifies using Phred for base calling and
Phrap for sequence assembly,and also offers a number of additional functions often needed in
DNA sequencing projects, for example contig alignment and editing, reference sequence
alignments, and mutation detection.

Phrap: Better Sequence Assemblies

Phrap is a leading program for DNA sequence assembly. Phrap is routinely used in some of the
largest sequencing projects in the Human Genome Sequencing Project and in the biotech
industry. Some of Phrap's feature include:
Fast assemblies- Assemblies of cosmid- to BAC sized projects with several hundred to two
thousand reads typically take only minutes to complete on high-powered workstations or
personal computers.

Accurate consensus sequences- Phrap uses Phred's quality scores to determine highly accurate
consensus sequences. Phrap examines all individual sequences at a given position, and generally
uses the highest quality sequence to build consensus. Compared to simple majority rules use in
older sequence assembly programs, Phrap's approach can give significantly more accurate
consensus sequences.

Consensus quality estimates- Phrap uses the quality information of individual sequences to
estimate the quality of the consensus sequence. In addition, Phrap uses available information
about sequencing chemistry (dye terminator or dye primer) and confirmation by "other strand"
reads in estimating the consensus quality.

Ability to assemble very large projects- Phrap has been used routinely to assembly bacterial
genomes sequenced by the "shotgun" approach, where each project contained tens of thousands
of reads. Smaller bacterial genomes (2 million bases or less) could often be assembled in less
than three hours.

Improved identification and handling of repeats- Phrap uses quality scores to estimate whether
discrepancies between two overlapping sequences are more likely to arise from random errors, or
from different copies of a repeated sequence. For repeats with 95 to 98% identity (like human
Alu sequences) and high quality sequence data, this typically yields correct assemblies.

Cross match: Fast DNA Sequence Comparisons and Vector Screening

Cross match is a program for fast comparisons of DNA sequences that uses the same algorithms
as Phrap. For example, the comparison of several hundred thousand bases of "raw" sequence to
the sequence of an entire BAC typically takes less than one minute. Within the Phred - Phrap
system, Cross_match is typically used for vector screening. In addition to this, it is also used for
the Identification of overlaps between contig ends after assembly with Phrap, identification of
potential repeat sequences in assemblies, generation of error summaries and lists after
completion of sequencing projects and estimation of vector contamination in newly created
libraries.

Fgenesh

It is a gene prediction program that falls under ab-initio gene prediction category. This is a HMM
based program that has parameters for finding genes in humans, drosophila, plants, yeast and
nematodes. The program does predict some genes that are not annotated as genes and fails to
predict some genes that do exist. A new program called fgenesh+ which works for a set of
missed genes when information about homologous protein sequences is furnished in fgenesh. It
is better in terms of sensitivity and specificity suggesting that, while ignoring similarity to
cDNAs, ESTs, and protein sequences may be appropriate for analyzing the ab initio part of a
predictor algorithm, for true-life scenarios of predicting genes in newly sequenced eukaryotic
genomes, more genes can be predicted by inclusion of these database sequences.

Glimmer

Glimmer is a system for finding genes in microbial DNA, especially the genomes of bacteria,
archaea, and viruses. Glimmer (Gene Locator and Interpolated Markov ModelER) uses
interpolated Markov models (IMMs) to identify the coding regions and distinguish them from
noncoding DNA. The IMM approach uses a combination of Markov models from 1st through
8th-order, weighting each model according to its predictive power. Glimmer uses 3-periodic
nonhomogenous Markov models in its IMMs.

Glimmer was the primary microbial gene finder used at The Institute for Genomic Research
(TIGR), where it was first developed, and has been used to annotate the complete genomes of
over 100 bacterial species from TIGR and other labs. Glimmer is used as basis for the design of
glimmer M which includes an algorithm for predicting splice sites. Further improvements to
glimmer M for the purpose of eukaryotic gene prediction resulted in the generation of glimmer
HMM. GlimmerHMM also adds in splice site predictors adapted from the Gene Splicer program.

Grail

The goal of the GRAIL program is to utilize several algorithms detecting different features of a
protein coding gene to predict with high accuracy the position of a gene within a genome.
Originally, GRAIL examined the presence of these several features (discussed below) in a
sliding 99-nucleotide window; however, this biases the program towards prediction of longer
exons and misses a larger number of shorter exons. This bias was later removed by allowing the
program to examine all possible exons, rather than just those in a sliding window. In both cases,
GRAIL utilizes a neural network to combine predictions for all these gene features.GRAIL starts
by scoring a region as protein coding versus protein noncoding based on frequency of 6-mers
that occur often in coding versus noncoding sequences.These coding regions are then scored for
the presence of a start codon, with a stop codon downstream and in-frame. A higher score is
achieved by the presence of these features. The GRAIL algorithm can also identify frameshift
mutations (insertions or deletions) that may be introduced do to errors during sequencing, by
determining when a shift in frame occurs in a region with high coding potential, creating an out-
of-frame stop codon. Splice sites are also detected, by analysis of the coding region surrounding
splice donor sequences and splice acceptor sequences. GRAIL also scores CpG islands, which
are underrepresented in the genome but enriched just 5’ of coding regions, the presence of a
TATA box, and the polyadenylation signal.
EXPERIMENT NO.3

AIM: To Study DNA sequencing methods

The term DNA sequencing is the use of sequencing for determining the order of the nucleotide
bases—adenine, guanine, cytosine, and thymine—in a molecule of DNA.Knowledge of DNA
sequences has become indispensable for basic biological research, other research branches
utilizing DNA sequencing, and in numerous applied fields such as diagnostic, biotechnology,
forensic biology and biological systematics. The advent of DNA sequencing has significantly
accelerated biological research and discovery. The rapid speed of sequencing attained with
modern DNA sequencing technology has been instrumental in the sequencing of the human
genome, in the Human Genome Project.

Maxam–Gilbert sequencing

In 1976–1977, Allan Maxam and Walter Gilbert developed a DNA sequencing method based on
chemical modification of DNA and subsequentcleavage at specific sites.The method requires
radioactive labeling at one 5' end of the DNA (typically by a kinase reaction using gamma-32P
ATP) and purification of the DNA fragment to be sequenced. Chemical treatment generates
breaks at a small proportion of one or two of the four nucleotide bases in each of four reactions
(G, A+G, C, C+T). For example, the purines (A+G) are depurinated using formic acid, the
guanines (and to some extent the adenines) are methylated by dimethyl sulfate, and the
pyrimidines (C+T) are methylated using hydrazine. The addition of salt (sodium chloride) to the
hydrazine reaction inhibits the methylation of thymine for the C-only reaction. The modified
DNAs are then cleaved by hot piperidine at the position of the modified base. The concentration
of the modifying chemicals is controlled to introduce on average one modification per DNA
molecule. Thus a series of labeled fragments is generated, from the radiolabeled end to the first
"cut" site in each molecule. The fragments in the four reactions are electrophoresed side by side
in denaturing acrylamide gels for size separation. To visualize the fragments, the gel is exposed
to X-ray film for autoradiography, yielding a series of dark bands each corresponding to a
radiolabeled DNA fragment, from which the sequence may be inferred. Also sometimes known
as "chemical sequencing", this method led to the Methylation Interference Assay used to map
DNA-binding sites for DNA-binding proteins.

Chain-termination method

The chain terminator method has become the method of choice for DNA sequencing because it is
more efficient and uses fewer toxic chemicals and lower amounts of radioactivity than the
method of Maxam and Gilbert. The key principle of the Sanger method was the use of
dideoxynucleotide triphosphates (ddNTPs) as DNA chain terminators.The classical chain-
termination method requires a single-stranded DNA template, a DNA primer, a DNA
polymerase, normal deoxynucleotidephosphates (dNTPs), and modified nucleotides
(dideoxyNTPs) that terminate DNA strand elongation. These ddNTPs will also be radioactively
or fluorescently labelled for detection in automated sequencing machines. The DNA sample is
divided into four separate sequencing reactions, containing all four of the standard
deoxynucleotides (dATP, dGTP, dCTP and dTTP) and the DNA polymerase. To each reaction is
added only one of the four dideoxynucleotides (ddATP, ddGTP, ddCTP, or ddTTP) which are
the chain-terminating nucleotides, lacking a 3'-OH group required for the formation of a
phosphodiester bond between two nucleotides, thus terminating DNA strand extension and
resulting in DNA fragments of varying length.

The newly synthesized and labelled DNA fragments are heat denatured, and separated by size
(with a resolution of just one nucleotide) by gel electrophoresis on a denaturing polyacrylamide-
urea gel with each of the four reactions run in one of four individual lanes (lanes A, T, G, C); the
DNA bands are then visualized by autoradiography or UV light, and the DNA sequence can be
directly read off the X-ray film or gel image. In the image on the right, X-ray film was exposed
to the gel, and the dark bands correspond to DNA fragments of different lengths. A dark band in
a lane indicates a DNA fragment that is the result of chain termination after incorporation of a
dideoxynucleotide (ddATP, ddGTP, ddCTP, or ddTTP). The relative positions of the different
bands among the four lanes are then used to read (from bottom to top) the DNA sequence.

Automated DNA sequencing

Automated DNA-sequencing instruments (DNA sequencers) can sequence up to 384 DNA


samples in a single batch (run) in up to 24 runs a day. DNA sequencers carry out capillary
electrophoresis for size separation, detection and recording of dye fluorescence, and data output
as fluorescent peak trace chromatograms. Sequencing reactions by thermocycling, cleanup and
re-suspension in a buffer solution before loading onto the sequencer are performed separately. A
number of commercial and non-commercial software packages can trim low-quality DNA traces
automatically. These programs score the quality of each peak and remove low-quality base peaks
(generally located at the ends of the sequence)
EXPERIMENT NO.4

AIM: To visualize the macromolecular structure of proteins using RASMOL

RasMol is a computer program written for molecular graphics visualization intended and used
primarily for the depiction and exploration of biological macromolecule structures, such as
those found in the Protein Data Bank. It was originally developed by Roger Sayle in the
early 90s. Maintenance of RasMol, much of the development, and integration of
modifications provided by the community is done at the ARCiB laboratory at Dowling
College. Work on RasMol has been supported in part by grants from the U.S. Department
of Energy, the U.S. National Science Foundation and the U.S. NIH National Institute of
General Medical Sciences. RasMol 2.7.5 runs on wide range of architectures and operating
systems including Microsoft Windows, Apple Macintosh, UNIX and VMS systems. UNIX
and VMS versions require an 8, 24 or 32 bit colour X Windows display (X11R4 or later).
The X Windows version of RasMol [rasmol2.7.5.exe] provides optional support for a
hardware dials box and accelerated shared memory communication (via the XInput and
MIT-SHM extensions) if available on the current X Server.
The program reads in a molecule coordinate file and interactively displays the molecule on
the screen in a variety of colour schemes and molecule representations. Currently available
representations include depth-cued wireframes, `Dreiding` sticks, spacefilling (CPK)
spheres, ball and stick, solid and strand molecular ribbons, atom labels and dot surfaces.

PROCEUDRE FOR VISUALISATION:

Browse for Rasmol V 2.7.5 README in google search.


Download Rasmol V 2.7.5 windows installer and save it.
Open the pdb website(www.pdb.org) and type the pdb id or text search of the complete structure
file of the protein of interest.(haemoglobin in this case)
Download the file entitled “structure of human deoxy hemoglobin A in complex with
xenon”.
Open the structure file with th e help of rasmol and analyze its sequence with the help of
functions available in the rasmol.
EXPERIMENT NO.5

AIM: To perform gene structure prediction using Genscan

Genscan is a bioinformatics software. Its mainsail function is to acquire a DNA sequence and
find the ORF that accord to genes.Genscan was formulated by Dr. Chris Burg who is
currently working on his thesis. This program is not only used to predict genes in a
sequenced set of DNA, it can also be used to determine a specific sequence using measures
of the percentage of C+G content. There are two approaches followed by Genscan for gene
prediction.

Statistical patterns identification-this approach of gene prediction uses all purpose knowledge
abour gene structure.Knowledge of gene structure includes promoter region, start and end
sequences of intron and exon,etc.
Sequence similarity comparision- this approach is based on similarity which takes advantage of
the fact that if the sequence is similar to the one with which it is being compared, it will
have the same function. But the structure of gene cannot be predicted accurately based on
sequence information alone.
For large scale analysis of gene, the typical strategy is to completely inactivate each gene or over
express it. In each case, however, the resulting phenotype may not be informative. Genscan
uses two tyepes of signal models to model different functional units. A weight matrix
model is used for modeling promoter, polyadenylation signals, transcription initiation and
termination signals. A modified version of the weighted array model is used for modeling
acceptor splice sites. After the prediction of gene structure, its function and expression
level can be investigated. Genscan can also identify disease severity.

PROCEDURE:

Search for Genscan on google and select genes.mit.edu/GENSCAN.html


Now go to NCBI’s homepage and search for chromosome 14 under genome databases option.
Select the entire sequence or a part of it and paste it under the input option on the genscan
homepage
Fill in the entries according to the requirements of experiment and click on RUN.
1) GENSCAN 1.0 ru31-1

EXPERIMENT NO.6

AIM:To perform multiple sequence alignment using CLUSTALW algorithm

The sensitivity of the commonly used progressive multiple sequence alignment method has been
greatly improved for the alignment of divergent protein sequences. Firstly, individual weights are
assigned to each sequence in a partial alignment in order to downweight near-duplicate
sequences and upweight the most divergent ones. Secondly, amino acid substitution matrices are
varied at different alignment stages according to the divergence of the sequences to be aligned.
Thirdly, residue specific gap penalties and locally reduced gap penalties in hydrophilic regions
encourage new gaps in potential loop regions rather than regular secondary structure. Fourthly,
positions in early alignments where gaps have been opened receive locally reduced gap penalties
to encourage the opening up of new gaps at these positions. These modifications are incorporated
into a new program, CLUSTALW. ClustalW2 is a general purpose multiple sequence alignment
program for DNA or proteins. The basic multiple alignment algorithm consists of three main
stages: 1) all pairs of sequences are aligned separately in order to calculate a distance matrix
giving the divergence of each pair of sequences; 2) a guide tree is calculated from the distance
matrix; 3) the sequences are progressively aligned according to the branching order in the guide
tree.

PROCEDURE: On google webpage, search for CLUSTALW, Select the required page and
follow the steps:

Step 1 - Sequence

Sequence Input Window

Three or more sequences to be aligned can be entered directly into this form. Sequences can be
be in GCG, FASTA, EMBL, GenBank, PIR, NBRF, PHYLIP or UniProtKB/Swiss-Prot format.
Partially formatted sequences are not accepted. Adding a return to the end of the sequence may
help certain applications understand the input. Note that directly using data from word processors
may yield unpredictable results as hidden/control characters may be present.

Sequence File Upload

A file containing three or more valid sequences in any format (GCG, FASTA, EMBL, GenBank,
PIR, NBRF, PHYLIP or UniProtKB/Swiss-Prot) can be uploaded and used as input for the
multiple sequence alignment. Word processor files may yield unpredictable results as
hidden/control characters may be present in the files. It is best to save files with the Unix format
option to avoid hidden Windows characters.

Sequence Type

Indicates if the sequences to align are protein or nucleotide (DNA/RNA).

Type Abbreviation

Protein protein

DNA dna
Default value is: Protein [protein]

Step 2 - Pairwise Alignment Options

Alignment Type

The alignment method used to perform the pairwise alignments used to generate the guide tree.

Output
Description Abbreviation
Format

slow Slow, but accurate slow

fast Fast, but approximate fast

Default value is: slow

Protein Weight Matrix (PW)

Slow pairwise alignment protein sequence comparison matrix series used to score alignment.

Matrix (Protein Only) Description Abbreviation

BLOSUM blosum

PAM pam

Gonnet gonnet

ID id

Default value is: Gonnet [gonnet]

DNA Weight Matrix (PW)

Slow pairwise alignment nucleotide sequence comparison matrix used to score alignment.

Matrix (Protein Only) Description Abbreviation

IUB iub
Matrix (Protein Only) Description Abbreviation

ClustalW clustalw

Default value is: IUB [iub]

Gap Open (PW)

Slow pairwise alignment score for the first residue in a gap.

Default value is: 10

Gap Extension (PW)

Slow pairwise alignment score for each additional residue in a gap.

Default value is: 0.1

KTUP

Fast pairwise alignment word size used to find matches between the sequences. Decrease for
sensitivity; increase for speed.

Default value is: 1

Window Length

Fast pairwise alignment window size for joining word matches. Decrease for speed; increase for
sensitivity.

Default value is: 5

Score Type

Fast pairwise alignment score type to output.


Order Description Abbreviation

percent Percent

absolute Absolute

Default value is: percent

Top Diags

Fast pairwise alignment number of match regions are used to create the pairwise alignment.
Decrease for speed; increase for sensitivity.

Default value is: 5

Pair Gap

Fast pairwise alignment gap penalty for each gap created.

Default value is: 3

Step 3 - Multiple Sequence Alignment Options

Protein Weight Matrix

Multiple alignment protein sequence comparison matrix series used to score the alignment.

Matrix (Protein Only) Description Abbreviation

BLOSUM blosum

PAM pam

Gonnet gonnet

ID id
Default value is: Gonnet [gonnet]

DNA Weight Matrix

Multiple alignment nucleotide sequence comparison matrix used to score the alignment.

Matrix (Protein Only) Description Abbreviation

IUB iub

ClustalW clustalw

Default value is: IUB [iub]

Gap Open

Multiple alignment penalty for the first residue in a gap.

Default value is: 10

Gap Extension

Multiple alignment penalty for each additional residue in a gap.

Default value is: 0.20

Gap Distances

Multiple alignment gaps that are closer together than this distance are penalised.

Default value is: 5

No End Gaps

Multiple alignment disable the gap seperation penalty when scoring gaps the the ends of the
alignment
Order Description Abbreviation

no False

yes True

Default value is: no [false]

Iteration

Multiple alignment improvement iteration type

Order Description Abbreviation

none No iteration none

tree Iteration at each step of alignment process tree

alignment Iteration only on final alignment alignment

Default value is: none

Num Iter

Maximum number of iterations to perform

Default value is: 1

Clustering

Clustering type.

Order Description Abbreviation

NJ Neighbour-joining (Saitou and Nei 1987) NJ

UPGMA UPGMA clustering UPGMA

Default value is: NJ


Output

Format for generated multiple sequence alignment.

Order Description Abbreviation

Aln w/numbers ClustalW alignment format with base/residue numbering aln1

Aln wo/numbers ClustalW alignment format without base/residue numbering aln2

GCG MSF GCG Multiple Sequence File (MSF) alignment format Gcg

PHYLIP PHYLIP interleaved alignment format Phylip

NEXUS NEXUS alignment format Nexus

NBRF/PIR NBRF or PIR sequence format Pir

GDE GDE sequence format Gde

Pearson/FASTA Pearson or FASTA sequence format Fasta

Default value is: Aln w/numbers [aln1]

Order

The order in which the sequences appear in the final alignment

Order Description Abbreviation

aligned Determined by the guide tree aligned

input Same order as the input sequences input

Default value is: aligned


Step 4 - Submission

Job title

It's possible to identify the tool result by giving it a name. This name will be associated to the
results and might appear in some of the graphical representations of the results.

Email Notification

Running a tool is usually an interactive process, the results are delivered directly to the browser
when they become available. Depending on the tool and its input parameters, this may take quite
a long time. It's possible to be notified by email when the job is finished by simply ticking the
box "Be notified by email". An email with a link to the results will be sent to the email address
specified in the corresponding text box. Email notifications require valid email addresses.

Email Address

If email notification is requested, then a valid Internet email address in the


form joe@bio.med.org must be provided. This is not required when running the tool interactively
(The results will be delivered to the browser window when they are ready).
CLUSTAL 2.1 multiple sequence alignment

gi|166362739|ref|NM_001992.3|
AGAGACTCTCACTGCACGCCGGAGGGCGCCCTTCCTCGCTCGCGCCCGCG 50
gi|133892391|ref|NM_010169.3|
--------------------------------------------------

gi|166362739|ref|NM_001992.3|
CGACCGCGCGCCCCAGTCCCGCCCCGCCCCGCTAACCGCCCCAGACACAG 100
gi|133892391|ref|NM_010169.3| ------------------------------GCTA-----
CTCAGAAA--- 12
**** *
**** *

gi|166362739|ref|NM_001992.3|
CGCTCGCCGAGGGTCGCTTGGACCCTGATCTTACCCGTGGGCACCCTGCG 150
gi|133892391|ref|NM_010169.3| --------GAAG------TAGGC---GA------
CGGCGGGCGCC----- 34
** * * * * ** * * ****
**

gi|166362739|ref|NM_001992.3|
CTCTGCCTGCCGCGAAGACCGGCTCCCCGACCCGCAGAAGTCAGGAGAGA 200
gi|133892391|ref|NM_010169.3| ---------------GGGCCG-----------
CGC--------------- 43
* *** ***

gi|166362739|ref|NM_001992.3|
GGGTGAAGCGGAGCAGCCCGAGGCGGGGCAGCCTCCCGGAGCAGCGCCGC 250
gi|133892391|ref|NM_010169.3|
-------------------------GGGCAGCCTT--------------- 53
*********

gi|166362739|ref|NM_001992.3|
GCAGAGCCCGGGACAATGGGGCCGCGGCGGCTGCTGCTGGTGGCCGCCTG 300
gi|133892391|ref|NM_010169.3|
---------GGGACAATGGGGCCCCGGCGCTTGCTGATCGTCGCCCTCGG 94
************** ***** ***** * **
*** * *

gi|166362739|ref|NM_001992.3|
CTTCAGTCTGTGCGGCCCGCTGTTGTCTGCCCGCACCCGGGCCCGCAGGC 350
gi|133892391|ref|NM_010169.3|
CCTCAGCCTGTGCGGTCCCTTGCTGTCTTCCCGCGTCCCTATGAGCCAGC 144
* **** ******** ** ** ***** ***** **
** **

gi|166362739|ref|NM_001992.3|
CAGAATCAAAAGCAACAAATGCCACCTTAGATCCCCGGTCATTTCTTCTC 400
gi|133892391|ref|NM_010169.3|
CAGAATCAGAGAGGACAGATGCTACGGTGAACCCCCGCTCATTCTTTCTA 194
******** * *** **** ** * * *****
***** ****

gi|166362739|ref|NM_001992.3| AGGAACCCCAATGATAA---
ATATGAACCATTTT------------GGGA 435
gi|133892391|ref|NM_010169.3|
AGGAATCCCAGTGAAAATACATTTGAACTGGTCCCCCTGGGGGATGAGGA 244
***** **** *** ** ** ***** *
***

gi|166362739|ref|NM_001992.3|
GGATGAGGAGAAAAATGAAAGTGGGTTAACTGAATACAGATTAGTCTCCA 485
gi|133892391|ref|NM_010169.3|
GGAGGAGGAGAAAAATGAAAGCGTCCTGCTGGAGGGTAGGGCAGTCTACT 294
*** ***************** * * ** **
***** *

gi|166362739|ref|NM_001992.3|
TCAATAAAAGCAGTCCTCTTCAAAAACAACTTCCTGCATTCATCTCAGAA 535
gi|133892391|ref|NM_010169.3|
TAAATATAAGCCTCCCTCCTCACACGCCGCCTCCTCCCTTCATCTCCGAG 344
* **** **** **** *** * * * **** *
******** **

gi|166362739|ref|NM_001992.3|
GATGCCTCCGGATATTTGACCAGCTCCTGGCTGACACTCTTTGTCCCATC 585
gi|133892391|ref|NM_010169.3|
GACGCCTCCGGATATCTGACCAGCCCCTGGCTGACGCTCTTCATGCCCTC 394
** ************ ******** ********** *****
* ** **

gi|166362739|ref|NM_001992.3|
TGTGTACACCGGAGTGTTTGTAGTCAGCCTCCCACTAAACATCATGGCCA 635
gi|133892391|ref|NM_010169.3|
CGTGTACACGATTGTGTTCATTGTCAGCCTTCCTCTGAACGTCCTGGCCA 444
******** ***** * ******** ** ** ***
** ******

gi|166362739|ref|NM_001992.3|
TCGTTGTGTTCATCCTGAAAATGAAGGTCAAGAAGCCGGCGGTGGTGTAC 685
gi|133892391|ref|NM_010169.3|
TCGCAGTGTTCGTCTTGAGGATGAAGGTCAAGAAGCCGGCCGTGGTGTAC 494
*** ****** ** *** ********************
*********

gi|166362739|ref|NM_001992.3|
ATGCTGCACCTGGCCACGGCAGATGTGCTGTTTGTGTCTGTGCTCCCCTT 735
gi|133892391|ref|NM_010169.3|
ATGCTGCACCTGGCCATGGCCGACGTGCTCTTCGTGTCGGTGCTCCCCTT 544
**************** *** ** ***** ** *****
***********

gi|166362739|ref|NM_001992.3|
TAAGATCAGCTATTACTTTTCCGGCAGTGATTGGCAGTTTGGGTCTGAAT 785
gi|133892391|ref|NM_010169.3|
CAAGATCAGCTACTACTTCTCCGGCACTGATTGGCAGTTCGGGTCTGGAA 594
*********** ***** ******* ************
******* *

gi|166362739|ref|NM_001992.3|
TGTGTCGCTTCGTCACTGCAGCATTTTACTGTAACATGTACGCCTCTATC 835
gi|133892391|ref|NM_010169.3|
TGTGCCGCTTCGCCACCGCAGCGTTTTACGGGAACATGTACGCCTCCATC 644
**** ******* *** ***** ****** *
************** ***

gi|166362739|ref|NM_001992.3|
TTGCTCATGACAGTCATAAGCATTGACCGGTTTCTGGCTGTGGTGTATCC 885
gi|133892391|ref|NM_010169.3|
ATGCTCATGACGGTCATAAGCATTGACCGGTTCCTGGCGGTGGTGTATCC 694
********** ******************** *****
***********

gi|166362739|ref|NM_001992.3|
CATGCAGTCCCTCTCCTGGCGTACTCTGGGAAGGGCTTCCTTCACTTGTC 935
gi|133892391|ref|NM_010169.3|
GATCCAGTCCCTGTCCTGGCGCACTCTGGGCCGAGCCAACTTCACTTGCG 744
** ******** ******** ******** * **
*********

gi|166362739|ref|NM_001992.3|
TGGCCATCTGGGCTTTGGCCATCGCAGGGGTAGTGCCTCTGCTCCTCAAG 985
gi|133892391|ref|NM_010169.3|
TGGTCATTTGGGTGATGGCCATCATGGGGGTGGTGCCCCTTCTCCTCAAG 794
*** *** **** ******** ***** ***** **
*********

gi|166362739|ref|NM_001992.3|
GAGCAAACCATCCAGGTGCCCGGGCTCAACATCACTACCTGTCATGATGT 1035
gi|133892391|ref|NM_010169.3|
GAGCAGACCACCCGAGTTCCGGGACTCAACATCACCACCTGCCACGACGT 844
***** **** ** ** ** ** *********** *****
** ** **

gi|166362739|ref|NM_001992.3|
GCTCAATGAAACCCTGCTCGAAGGCTACTATGCCTACTACTTCTCAGCCT 1085
gi|133892391|ref|NM_010169.3|
CCTCAGTGAGAACCTGATGCAAGGCTTTTACTCGTACTACTTCTCGGCCT 894
**** *** * **** * ****** ** *
*********** ****

gi|166362739|ref|NM_001992.3|
TCTCTGCTGTCTTCTTTTTTGTGCCGCTGATCATTTCCACGGTCTGTTAT 1135
gi|133892391|ref|NM_010169.3|
TCTCCGCCATCTTCTTTCTTGTGCCGTTGATCGTTTCCACGGTCTGCTAC 944
**** ** ******** ******** *****
************* **

gi|166362739|ref|NM_001992.3|
GTGTCTATCATTCGATGTCTTAGCTCTTCCGCAGTTGCCAACCGCAGCAA 1185
gi|133892391|ref|NM_010169.3|
ACGTCCATCATCCGGTGCCTGAGCTCCTCCGCGGTTGCCAACCGGAGCAA 994
*** ***** ** ** ** ***** *****
*********** *****

gi|166362739|ref|NM_001992.3|
GAAGTCCCGGGCTTTGTTCCTGTCAGCTGCTGTTTTCTGCATCTTCATCA 1235
gi|133892391|ref|NM_010169.3|
GAAGTCGCGGGCTTTGTTCCTGTCTGCCGCGGTGTTCTGCATCTTCATCG 1044
****** ***************** ** ** **
***************

gi|166362739|ref|NM_001992.3|
TTTGCTTCGGACCCACAAACGTCCTCCTGATTGCGCATTACTCATTCCTT 1285
gi|133892391|ref|NM_010169.3|
TCTGCTTTGGGCCCACCAACGTCCTCCTGATTGTGCACTACCTTTTCCTC 1094
* ***** ** ***** **************** *** ***
*****

gi|166362739|ref|NM_001992.3|
TCTCACACTTCCACCACAGAGGCTGCCTACTTTGCCTACCTCCTCTGTGT 1335
gi|133892391|ref|NM_010169.3|
TCCGACAGTCCTGGTACAGAGGCAGCCTACTTTGCTTACCTCCTCTGCGT 1144
** *** * * ******** ***********
*********** **

gi|166362739|ref|NM_001992.3|
CTGTGTCAGCAGCATAAGCTGCTGCATCGACCCCCTAATTTACTATTACG 1385
gi|133892391|ref|NM_010169.3|
CTGTGTGAGCAGCGTGAGCTGCTGCATCGATCCGTTGATTTACTACTACG 1194
****** ****** * ************** ** *
******** ****

gi|166362739|ref|NM_001992.3|
CTTCCTCTGAGTGCCAGAGGTACGTCTACAGTATCTTATGCTGCAAAGAA 1435
gi|133892391|ref|NM_010169.3|
CCTCCTCCGAGTGCCAGAGGCACCTCTACAGCATCTTGTGCTGCAAAGAA 1244
* ***** ************ ** ******* *****
************

gi|166362739|ref|NM_001992.3|
AGTTCCGATCCCAGCAGTTATAACAGCAGTGGGCAGTTGATGGCAAGTAA 1485
gi|133892391|ref|NM_010169.3|
AGCTCTGATCCCAACAGTTGCAACAGCACCGGCCAGCTGATGCCGAGTAA 1294
** ** ******* ***** ******* ** *** *****
* *****

gi|166362739|ref|NM_001992.3|
AATGGATACCTGCTCTAGTAACCTGAATAACAGCATATACAAAAAGCTGT 1535
gi|133892391|ref|NM_010169.3|
AATGGATACCTGCTCTAGTCACCTGAATAACAGCATATACAAAAAGCTAT 1344
*******************
**************************** *

gi|166362739|ref|NM_001992.3| TAACTTAGGAAAAGGGACTGCTGGGAGGTTAAA-
AAGAAAAGTTTATAAA 1584
gi|133892391|ref|NM_010169.3| TAGCTTAGGGAAAGGG-
TTGCTGGAAGGTTCCATGAGAAAAGGTTG-GAA 1392
** ****** ****** ****** ***** * *******
** **

gi|166362739|ref|NM_001992.3| AGTGAATAACCTGAGGATTCTATTAGTCCCCACCCA-
AACTTTATTGA-T 1632
gi|133892391|ref|NM_010169.3| AGCCAACAGCG-
GGGAATCCCATTAGTCCCTGCAAAGAACTGTATTTACT 1441
** ** * * * * ** * ********* * * ****
**** * *

gi|166362739|ref|NM_001992.3| TCACCTCCTAAAA--
CAACAGATGTACGACTTGCATACCTGCTTTTTATG 1680
gi|133892391|ref|NM_010169.3|
TCGAAACCTAAAAAACAACCAATATCCGATATGCACGAATACTTCT---- 1487
** ******* **** ** * *** **** *
*** *

gi|166362739|ref|NM_001992.3|
GGAGCTGTCAAGCATGTATTTTTGTCAATTACCAGAAAGATAACAGGAC- 1729
gi|133892391|ref|NM_010169.3|
---GCTATCAAGAGTCTAGATTGGATAATTACCAGCAAGGTGACGGGAAC 1534
*** ***** * ** ** * ********* *** *
** ***

gi|166362739|ref|NM_001992.3|
-GAGATGACGGTGTTATTCCAAGGGAATATTGCCAATGCTACAGTAATAA 1778
gi|133892391|ref|NM_010169.3| GGAAATAAAGGTGT----CCAG-----
TGTTGCTAGTGCTATGATAGTAA 1575
** ** * ***** *** * **** * *****
** ***

gi|166362739|ref|NM_001992.3|
ATGAATGTCACTTCTGGATATAGCTAGGTGACATATACATACTTACATGT 1828
gi|133892391|ref|NM_010169.3| CTGGATGTCACTTCTT-ATATATCTAGGTGAC---------
TTTA----- 1610
** *********** ***** *********
***

gi|166362739|ref|NM_001992.3| GTGTATATGTAGATG-
TATGCACACACATATATTATTTGCAGTGCAGTAT 1877
gi|133892391|ref|NM_010169.3| ----ATATATAGATGGTATGCACACAC-----
TCATTTGTCATGCAGGAG 1651
**** ****** *********** * *****
***** *

gi|166362739|ref|NM_001992.3|
AGAATAGGCACTTTAAAACACTCTTTCCCCGCACCCCAGCAATTA---TG 1924
gi|133892391|ref|NM_010169.3| GGAATCTGCACTTTGACACA-
TTTTTGTTTATTCCCTGGCCGTTACTATG 1700
**** ******* * *** * *** *** **
*** **

gi|166362739|ref|NM_001992.3|
AAAATAATCTCTGATTCCCTGATTTAATATGCAAAGTCTAGGTTGGTAGA 1974
gi|133892391|ref|NM_010169.3| GAAATAATCT--
GATTCTCTGACTTAATAAACAAAGTCTGAGTTGGTGGG 1748
********* ***** **** ****** ********
****** *

gi|166362739|ref|NM_001992.3|
GTTTAGCCCTGAACATTTCATGGTGTTCATCAACAGTGAGAGACTCCATA 2024
gi|133892391|ref|NM_010169.3| TGTTAGCACTGGGCAGCTGGAGATCCTAAT-
GATAGGGGAGGAGTCCGTA 1797
***** *** ** * * * * ** * ** *
** *** **

gi|166362739|ref|NM_001992.3| GTTTGGGCTTG-
TACCACTTTTGCAAATAAGTGTATTTTGAAATTGTTTG 2073
gi|133892391|ref|NM_010169.3| GTTTAGACTTAACACAGCTTTTGCCTATA--
TTTTTTTTCAAATTATTTG 1845
**** * *** ** ******* *** * * ****
***** ****

gi|166362739|ref|NM_001992.3|
ACGGCAAGGTTTAAGTTATTAAGAGGTAAGACTTAGTACTATCTGTGC-G 2122
gi|133892391|ref|NM_010169.3| ATAATAATGGTTA-GTGATGGAAGGATGAGAC--
AGTATTACCTGTGTAG 1892
* ** * *** ** ** * * * **** **** **
***** *

gi|166362739|ref|NM_001992.3|
TAGAAGTTCTAGTGTTTTCAATTTTAAACATATCCAAGTTTGAATTCCTA 2172
gi|133892391|ref|NM_010169.3|
GGGAAGCTCTAATACTTTTCATCTTGAACATACCGTAGTTTTAA------ 1936
**** **** * *** ** ** ****** * *****
**

gi|166362739|ref|NM_001992.3|
AAATTATGGAAACAGATGAAAAGCCTCTGTTTTGATATGGGTAGTATTTT 2222
gi|133892391|ref|NM_010169.3| GAATTATCAAGGCTGTTGGAAAACCC--
GTTTTGATATGGGTAGCATTTT 1984
****** * * * ** *** **
**************** *****

gi|166362739|ref|NM_001992.3| TT---------
ACATTTTACACACTGTACACATAAGCCAAAACTGAGCAT 2263
gi|133892391|ref|NM_010169.3|
TTTTTTAACTTGCAATTTACTTACTGAATACATGGACCAAGACTGAGCAT 2034
** ** ***** **** * **** ****
*********

gi|166362739|ref|NM_001992.3| AAGTCCT-
CTAGTGAATGTAGGCTGGCTTTCAGAGTAGGCTATTCCTGAG 2312
gi|133892391|ref|NM_010169.3| AAGACTCACCAG-GACTGTAATAAACCTTACAAAGCAG-
CCAAGCCT--- 2079
*** * * ** ** **** *** ** ** ** * *
***

gi|166362739|ref|NM_001992.3|
AGCTGCATGTGTCCGCCCCCGATGGAGGACTCCAGGCAGCAGACACATGC 2362
gi|133892391|ref|NM_010169.3| AGACACAGCCATCTGC-----
ATGGAGGCCTCTGAGCACCAGGTACAT-- 2122
** ** ** ** ******* *** *** ***
****

gi|166362739|ref|NM_001992.3|
CAGGGCCATGTCAGACACAGATTGGCCAGAAACCTTCCTGCTGAGCCTCA 2412
gi|133892391|ref|NM_010169.3| CACACCCCT------------TCGGCTATG---
CCTCCCAGAGAGC---- 2153
** ** * * *** * * ***
****

gi|166362739|ref|NM_001992.3|
CAGCAGTGAGACTGGGGCCACTACATTTGCTCCATCCTCCTGGGATT--- 2459
gi|133892391|ref|NM_010169.3| -AGAGATG-
GATGGGAAGCACCAGGCCCACCCCATCCTGCTAGGATTCTC 2201
** ** ** ** *** * * ******* **
*****

gi|166362739|ref|NM_001992.3|
---GGCTGTGAACTGATCATGTTTATGAGAAACTGGCAAAGCAGAATGTG 2506
gi|133892391|ref|NM_010169.3|
ATTAGCTGTGAGCTGACTGTGTCTTTTAGAAATTGGCAAGGTAAGGTATG 2251
******* **** *** * * ***** ****** *
* * **

gi|166362739|ref|NM_001992.3|
ATATCCTAGGAGGTAATGACCATGAAAGACTTCTCTACCCATCTTAAAAA 2556
gi|133892391|ref|NM_010169.3|
CCATCTTGGGAGGCAGTAACTATGAAAGACT------------------- 2282
*** * ***** * * ** **********

gi|166362739|ref|NM_001992.3|
CAACGAAAGAAGGCATGGACTTCTGGATGCCCATCCACTGGGTGTAAACA 2606
gi|133892391|ref|NM_010169.3| -GACGAGAGGAGAAA-------------------------
GGTGTGTTTA 2306
**** ** ** *
***** *

gi|166362739|ref|NM_001992.3|
CATCTAGTAGTTGTTCTGAAATGTCAGTTCTGATATGGAAGCACCCATT- 2655
gi|133892391|ref|NM_010169.3|
CATCCAGTAGCTGTCCTGCAAGGCTGGCCCTTGCACAGACAGACACACCC 2356
**** ***** *** *** ** * * ** * **
** **

gi|166362739|ref|NM_001992.3| ATGCGCTGTGGCCACTCCAATAGGTGCTGAG---
TGTACAGAGT---GGA 2699
gi|133892391|ref|NM_010169.3|
ACATGCCCTGGTCACACTGTTGGATAGTGGGCCATAGACTGACTATAGGA 2406
* ** *** *** * * * * ** * * ** **
* ***

gi|166362739|ref|NM_001992.3| ATAAGACAGAGACCTGCCCTCAA--
GAGCAAAGTAGA------------- 2734
gi|133892391|ref|NM_010169.3|
GAATAACCGAGTCCTGTCCTTACTCAGGCAACGCAGAGAGCTGGCATGTG 2456
* ** *** **** *** * **** * ***
gi|166362739|ref|NM_001992.3| --------TCATGCATAGAG----TGT-----
GATGTATGTGTAATAAAT 2767
gi|133892391|ref|NM_010169.3|
GTCAGCTATGATGCACATAGAACTTGTCTTCAGCTGGATGTG-ACCAAGT 2505
* ***** * ** *** * ** *****
* ** *

gi|166362739|ref|NM_001992.3|
ATGTTTCACACAAACAAGGCCTGTCAGCTAAAGAAGTTTGAACATTTGGG 2817
gi|133892391|ref|NM_010169.3|
GTATTTCACATAAGCAAGGCCTATCAGCTAAACTGCTTTGCATATCTGAG 2555
* ******* ** ******** ********* **** *
** ** *

gi|166362739|ref|NM_001992.3|
TTACTATTTCTTGTGGTTATAACTTAATGAAAACAATGCAGTACAGGACA 2867
gi|133892391|ref|NM_010169.3| TTTCTGCTTCCAGTAGCTATAGATTAG-
GATAAAAACACAGTATAAGATG 2604
** ** *** ** * **** *** ** ** **
***** * **

gi|166362739|ref|NM_001992.3| TATATTTTTTAAA-ATAAGTCT---GATTTA----
ATTGGGCACTATTTA 2909
gi|133892391|ref|NM_010169.3|
TATATTTTTAATACATATGCCCTTCAGCCTACAAAATTACACACTATTTA 2654
********* * * *** * * ** ***
*********

gi|166362739|ref|NM_001992.3|
TTTACAAATGTTTTGCTCAATAGATTGCTCAAATCAGGTTTTCTTTTAAG 2959
gi|133892391|ref|NM_010169.3| TTTACAAATGTTTT-TTCAA-
AAATTACTCAAATCAG--------CCAGG 2694
************** **** * *** **********
* *

gi|166362739|ref|NM_001992.3|
AATCAATCATGTCAGTCTGCTTAGAAATAACAGAAGAAAATAGAATTGAC 3009
gi|133892391|ref|NM_010169.3| CAT----TATGGTATACACCTT-----
TAATCCCAGAACTTGGGA--GGC 2733
** *** * * *** *** **** *
* * * *

gi|166362739|ref|NM_001992.3|
ATTGAAATCTAGGAAAATTATTCTATAATTTCCATTTACTTAAGACTTAA 3059
gi|133892391|ref|NM_010169.3| A--GAGG--CAGGCAGATC-TTAAACAATTT---
TTTTTTTAAGAAACAA 2775
* ** *** * ** ** * ***** ***
****** **

gi|166362739|ref|NM_001992.3|
TGAGACTTTAAAAGCATTTTTTAACCTCCTAAGTATCAAGTATAGAAAAT 3109
gi|133892391|ref|NM_010169.3| GCAAACACAAAAAG----TTTTA----CTTAAGT-
CCAA----------- 2805
* ** ***** ***** * ***** ***

gi|166362739|ref|NM_001992.3|
CTTCATGGAATTCACAAAGTAATTTGGAAATTAGGTTGAAACATATCTCT 3159
gi|133892391|ref|NM_010169.3| TTTTAAGAAATATATAGGTCAGTTTGG---
TTA----------------- 2835
** * * *** * * * ***** ***

gi|166362739|ref|NM_001992.3|
TATCTTACGAAAAAATGGTAGCATTTTAAACAAAATAGAAAGTTGCAAGG 3209
gi|133892391|ref|NM_010169.3| -----------AAAATAATAGTA------ATGAA--
AGGAAATTTCA--- 2863
***** *** * * ** ** **
** **

gi|166362739|ref|NM_001992.3|
CAAATGTTTATTTAAAAGAGCAGGCCAGGCGCGGTGGCTCACGCCTGTAA 3259
gi|133892391|ref|NM_010169.3|
-------TTGATTGAAA----------------------------TTTAT 2878
** ** ***
* **

gi|166362739|ref|NM_001992.3|
TCCCAGCACTTTGGGAGGCTGAGGCGGGTGGATCACGAGGTCAGGAGATC 3309
gi|133892391|ref|NM_010169.3| TCT--GTATTTT--------------------
TCTTGAGTT------ATT 2900
** * * *** ** *** *
**

gi|166362739|ref|NM_001992.3|
GAGACCATCCTGGCTAACACGGTGAAACCCGTCTCTACTAAAAATGCAAA 3359
gi|133892391|ref|NM_010169.3| GAGATTATTT-----------GTAAAGC--ATTTTT------
AATGCCAC 2931
**** ** ** ** * * * *
***** *

gi|166362739|ref|NM_001992.3|
AAAAATTAGCCGGGCGTGGTGGCAGGCACCTGTAGTCCCAGCTACTCGGG 3409
gi|133892391|ref|NM_010169.3| AGTGACTA-------------ACAAGCATATAAAATCTTCA-
TAC----- 2962
* * ** ** *** * * **
***

gi|166362739|ref|NM_001992.3|
AGGCTGAGGCAGGAGACTGGCGTGAACCCAGGAGGCGGACCTTGTAGTGA 3459
gi|133892391|ref|NM_010169.3| ---CTTTGACAAAA---
TAATTTGAA-------------------AATTA 2987
** * ** * * ****
* * *

gi|166362739|ref|NM_001992.3|
GCCGAGATCGCGCCACTGTGCTCCAGCCTGGGCAACAGAGCAAGACTCCA 3509
gi|133892391|ref|NM_010169.3| ATTTAAAACATATCCTTTTTCT--------
GATGAAAAAATATGTTGGCA 3029
* * * * * * ** * * * * *
* **

gi|166362739|ref|NM_001992.3| TCTCAAA-
AAATAAAAATAAATAAAAAATAAAAAAATAAAAGAGCAAACT 3558
gi|133892391|ref|NM_010169.3| TTTTAAGCAAATAAGAGTAGA--
AAGGTTGTTTATTTAAGAGAACAAAGT 3077
* * ** ****** * ** * ** * * ***
*** **** *

gi|166362739|ref|NM_001992.3|
ATTTCCAAATACCATAGAATAACTTACATAAAAGTAATATAACTGTATTG 3608
gi|133892391|ref|NM_010169.3|
ATTTCCAAATACTGTAGAGTCGCTTCCACGAAAGTCCTATGGTTGTATGG 3127
************ **** * *** ** ***** ***
***** *

gi|166362739|ref|NM_001992.3|
TAAGTAGAAGCTAGCACTGGTTTTATTAATTTAGTGACTATTCATTTTAT 3658
gi|133892391|ref|NM_010169.3| TTAAC-----TTGGTTCCGGTGTT-----------
GGCTG--------AT 3153
* * * * * *** ** * **
**

gi|166362739|ref|NM_001992.3|
CTAAATCAGTGAAGATTTACTGTCATTGTTTATTAGTCTGTATATATTAA 3708
gi|133892391|ref|NM_010169.3| CTCAATTACTGA---CTCCCTGTC-CCGTGT-----
TCTGTCTGTGACTT 3194
** *** * *** * ***** ** * *****
* *

gi|166362739|ref|NM_001992.3| AATATGA-
TATCATTAATGTACTTACAAAATAGTATGTCACTGTTTTTAT 3757
gi|133892391|ref|NM_010169.3|
AATGTAACTGTTATCACCGCGCTTGTGACCTTTTACGTCATTGTTTT-GT 3243
*** * * * * ** * * *** * * ** ****
****** *

gi|166362739|ref|NM_001992.3| GTTCA-----
TTCTTAAAAACATAACCTGTATTAATAAATGTGAACATTT 3802
gi|133892391|ref|NM_010169.3| GTTCACCCTCTTTTTTAAAAAAAAA--TATATTAATAAAC-
TAAAACCAT 3290
***** ** ** **** * ** * ********** *
** *

gi|166362739|ref|NM_001992.3|
GCTTGGTAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA 3847
gi|133892391|ref|NM_010169.3|
GCTTGG--------------------------------------- 3296
******
RESULTS:

Result files
Input Sequences

clustalw2-I20110331-115850-0033-1317260-oy.input

Tool Ouput

clustalw2-I20110331-115850-0033-1317260-oy.output

Alignments in CLUSTALW format

clustalw2-I20110331-115850-0033-1317260-oy.clustalw

Guide Tree

clustalw2-I20110331-115850-0033-1317260-oy.dnd

Scores Table

SeqA Name Length SeqB Name Length Score

1 gi|166362739|ref|NM_001992.3| 3847 2 gi|133892391|ref|NM_010169.3| 3296 70.0


EXPERIMENT NO.7

AIM :To perform pairwise sequence alignment for two retrieved sequences using BLAST

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA, RNA, or


protein to identify regions of similarity that may be a consequence of functional, structural,
or evolutionary relationships between the sequences. Aligned sequences of nucleotide or
amino acid residues are typically represented as rows within a matrix. Gaps are inserted
between the residues so that identical or similar characters are aligned in successive
columns. Very short or very similar sequences can be aligned by hand. However, most
interesting problems require the alignment of lengthy, highly variable or extremely
numerous sequences that cannot be aligned solely by human effort. Instead, human
knowledge is applied in constructing algorithms to produce high-quality sequence
alignments, and occasionally in adjusting the final results to reflect patterns that are
difficult to represent algorithmically (especially in the case of nucleotide sequences).
Computational approaches to sequence alignment generally fall into two categories: global
alignments and local alignments. Calculating a global alignment is a form of global
optimization that "forces" the alignment to span the entire length of all query sequences.
By contrast, local alignments identify regions of similarity within long sequences that are
often widely divergent overall. In bioinformatics, local alignment is mainly performed
using the Basic local alignment search tool or BLAST. A BLAST search enables a
researcher to compare a query sequence with a library or database of sequences, and
identify library sequences that resemble the query sequence above a certain threshold.
BLAST is one of the most widely used bioinformatics programs[2], because it addresses a
fundamental problem and the algorithm emphasizes speed over sensitivity. This emphasis
on speed is vital to making the algorithm practical on the huge genome databases currently
available, although subsequent algorithms can be even faster. Input sequences in BLAST
are in FASTA format or Genbank format.
BLAST output can be delivered in a variety of formats. These formats include HTML, plain text,
and XML formatting. For NCBI’s web-page, the default format for output is HTML. When
performing a BLAST on NCBI, the results are given in a graphical format showing the hits
found, a table showing sequence identifiers for the hits with scoring related data, as well as
alignments for the sequence of interest and the hits received with corresponding BLAST
scores for these. Using a heuristic method, BLAST finds homologous sequences, not by
comparing either sequence in its entirety, but rather by locating short matches between the
two sequences. This process of finding initial words is called seeding. It is after this first
match that BLAST begins to make local alignments. While attempting to find homology in
sequences, sets of common letters, known as words, are very important. The heuristic
algorithm of BLAST locates all common words between the sequence of interest and the
hit sequence, or sequences, from the database. These results will then be used to build an
alignment. After making words for the sequence of interest, neighborhood words are also
assembled. These words must satisfy a requirement of having a score of at least the
threshold, T, when compared by using a scoring matrix. The threshold score T, determines
whether a particular word will be included in the alignment or not. Once seeding has been
conducted, the alignment, which is only 3 residues long, is extended in both directions by
the algorithm used by BLAST. Each extension impacts the score of the alignment by either
increasing or decreasing it. Should this score be higher than a pre-determined T, the
alignment will be included in the results given by BLAST. However, should this score be
lower than this pre-determined T, the alignment will cease to extend, preventing areas of
poor alignment to be included in the BLAST results.

PROCEDURE:

Search for blast on google homepage and click on http://blast.ncbi. nlm. nih.gov/ Blast.cgi?
CMD=Web&PAGE_TYPE=BlastHome
Select the BLAST type you want to perform, for instance select nucleotide blast
Submit the sequence to be searched either in the FASTA format or in the form of NCBI
accession no.
Select the database from which sequence is to be searched
Click on BLAST

Vous aimerez peut-être aussi