Vous êtes sur la page 1sur 40

Algorithms for

Biological Sequence Analysis


─ Class Presentation

Human-Mouse Alignments with BLASTZ


Galaxy: A Platform for Interactive Large-scale Genome Analysis

許秉慧、陳怡靜、鄭智懷、宋建均
2005.11.30
S. Schwartz, W. J. Kent, A. Smit, Z. Zhang, R. Baertsch, R.
C. Hardison, D. Haussler, and W. Miller, “Human-Mouse Al
ignments with BLASTZ,” Genome Research, 2003; 13:
103–107.
B. Giardine, C. Riemer, R. C. Hardison, R. Burhans, L. Elnit
ski, P. Shah, U. Zhang, D. Blankenberg, I. Albert, W. Miller,
W. J. Kent, and A. Nekrutenko, “Galazy: A Platform for Inte
ractive Large-scale, Genome Analysis,” 2005; 15:
1451–1455.
Methods
S. Schwartz, W. J. Kent, A. Smit, Z. Zhang, R. Baertsch, R. C.
Hardison, D. Haussler, and W. Miller
Human-Mouse Alignments with BLASTZ
Genome Research, 2003; 13: 103–107

陳怡靜、許秉慧
2005.11.30
Outline
Motivation and results
BLASTZ and modified BLASTZ
Implementation issues and hardware environment
Software evaluation
Motivation and Results

陳怡靜
2005.11.30
Motivation
Several existing programs sacrifice sensitivity to attain very
short running time.
An appropriate level of sensitivity and specificity was
attained by a program called BLASTZ.
A modified BLASTZ program attains efficiency adequate
for aligning entire mammalian genomes and increasing its
specificity.
Results
To modify the BLASTZ alignment program which is used
by the PipMaker webserver (Schwartz et al. 2000)
The modified BLASTZ was used to compare all of the hu
man sequence with all of the mouse efficiently.
BLASTZ and Modified BLASTZ

陳怡靜
2005.11.30
Homologous
Two proteins are orthologous if they belong to different sp
ecies that evolve from a common ancestral gene by speciat
ion and retain the same function in the course of evolution.
Two proteins are paralogous if they are duplicated within
a genome and evolve new functions.
Human-Mouse Alignments
To find orthologous alignments
Natural consequence

Mouse
align
Human

We obtain the single best by applying a program, called axtBest, w


hich filters out all but the best alignment within a sliding window
of 10,000 bases.

Step1
BLASTZ
BLASTZ follows the three-step strategy used by Gapped
BLAST.
1) Find short near-exact matches
2) Extend each short match without allowing gaps
3) Extend each gap-free match that exceeds a certain threshold by a
DP procedure that permits gaps
BLASTZ
Two differences between BLASTZ and Gapped BLAST
were exploited in the whole-genome alignments.
BLASTZ has an potion to require that the matching regions that it
reports must occur in the same order and orientation in both
sequences.

Sequence 1

Sequence 2
BLASTZ
Two differences between BLASTZ and Gapped BLAST w
ere exploited in the whole-genome alignments.
BLASTZ uses an alignment-scoring scheme derived and evaluated
by Chiaromonte et al. (2000). Nucleotide substitutions are scored b
y the matrix A C G T
A 91 –114 –31 –123
C –114 100 –125 –31
G 100 –114
–31 –125 –100
T –123 –31 –114 91
and a gap of length k is penalized by subtracting 400 + 30k from th
e score.
Modified BLASTZ
The modified BLASTZ algorithm
1) Remove recent repeated elements
2) Run BLASTZ
3) Adjust positions in the alignment to refer to the original sequences
4) Filter the alignments
Modified BLASTZ
Step 1 (an addition from BLASTZ)
WHY?
Sequence 1

Sequence 2

I. Y. Lee, D. Westaway, A. F. Smit, K. Wang, J. Seto, L. Chen, C.


Acharya, M. Ankener, D. Baskin, C. Cooper, et at., “Complete Ge
nomic Sequence and Analysis of the Prion Protein Gene Region fr
om Three Mammalian Species,” Genome Research, 1998; 8: 1022
–1037.
Modified BLASTZ
Each 12-mer allows a
transition (A-G, G-A,
C-T or T-c) in any one
Step 2 (a modification from BLASTZ)
of the 12 positions.

Sequence 1 12-mer 12-mer 12-mer

Sequence 2 12-mer 12-mer 12-mer

Extend the induced alignment in each direction, not


allowing gaps.
Stop extending when the score decrease more than some
threshold.
Modified BLASTZ
Step 2 (a modification from BLASTZ)

Sequence 1 12-mer 12-mer 12-mer

Sequence 2 12-mer 12-mer 12-mer

If the gap-free alignment scores more than 3000 then


Repeat the extension step, but allow for gaps.
Retain the alignment if it scores above 5000.
Modified BLASTZ
Step 3
l
Sequence 1

Sequence 2

If l  10
50 kb, repeat Step 2, but using a more sensitive
seeding procedure (ex. 7-mer exact matches) and lower
2000
score thresholds both for gap-free alignments (ex. 2200
2000
instead of 3000) and for gapped alignments (ex. 2200
instead of 5000).
Modified BLASTZ
Step 4: Adjust sequence positions in the resulting alignmen
ts to make them refer to the original sequences.
Step 5: Filter the alignments as appropriate for particular p
Sequence 1
urposes.
Apply axtBest to finds a best way to align each aligned human pos
Sequence
ition2

Sequence 1
Choose best one
Sequence 2
Modified BLASTZ
Two changes to BLASTZ significantly improved its
execution speed for aligning entire genomes.
When the program realized that many regions of the mouse
genome align to the same human segment, that segment is
dynamically masked. (Step 1 of the modified BLASTZ)
BLASTZ applies 8-mer procedure to align, but the modified
BLASTZ applies 12-mer procedure to align. (Step 2 of the
modified BLASTZ)
Implementation Issues and
Hardware Environment

許秉慧
2005.11.30
Implementation Issues

Human sequence
Base 1 10 kb Gap-free segment
score .>3000
Base 2
10 kb

Base 3
1.01 Mb

Mouse sequence
Implementation Issues and
Hardware Environment
Input
2.8Gb human sequence vs. 2.5Gb mouse sequence
Hardware
A cluster of 1024 833-Mhz Pentium III
Time
481 days of CPU times
Half day of wall clock
Software Evaluation

許秉慧
2005.11.30
Software Evaluation
Different classes of parameters and thresholds might be
best tested in different way
Reverse mouse sequence to measure specificity
Reverse Mouse Sequence
True match microsatellite seq Spurious
uence match

3’ cacaca 5’

3’ cacaca 5’

3’ cacaca 5’

5’ acacac 3’

Human sequence
Mouse sequence
Reverse Mouse sequence
Coverage by Outer Alignment
39.154% 0.164%
Score 1 Mus >1Mus 1 Rev >1 Rev
3000 36.814% 2.340% 0.084% 0.080%
4000 36.859% 2.230% 0.040% 0.074%
5000 36.958% 1.975% 0.016% 0.059%
6000 36.992% 1.829% 0.013% 0.051%
7000 36.997% 1.697% 0.011% 0.043%
8000 36.966% 1.586% 0.010% 0.037%
9000 36.911% 1.490% 0.008% 0.033%
10000 36.831% 1.405% 0.007% 0.030%

-0.918% 0.037%
-0.221% 0.075%
Coverage by Outer Alignment

DNA sequence

geno
Comparison of Genome Coverage

chr20 CDS 3’UTR 5’UTR upstream


Blastz all 40.5% 98.5% 87.1% 89.0% 87.2%

Blastz tight 5.6% 92.5% 26.0% 39.6% 28.3%

PH all 29.7% 95.5% 55.0% 59.3% 52.5%

PH tight 5.0% 91.2% 25.1% 36.3% 25.2%

Transl. BLAT 5.8% 90.3% 29.2% 38.4% 27.2%


Comparison of Covered Region

All Tight
Blastz only 54.1% 12.2%
PH only 10.2% 3.3%
Both 35.7% 85.5%
Resources
B. Giardine, C. Riemer, R. C. Hardison, R. Burhans, L. Elnitski
, P. Shah, U. Zhang, D. Blankenberg, I. Albert, W. Miller, W. J.
Kent, and A. Nekrutenko
Galaxy: A Platform for Interactive Large-scale, Genome A
nalysis
Genome Research, 2005; 15: 1451–1455

宋建均、鄭智懷
2005.11.30
What is Galaxy?
It’s a tool that it allows users to gather and manipulate data
from existing resources in a variety of ways.
Galaxy contains three major classes of data manipulation:
Query operations
Sequence analysis tools
Output displays
Why needs Galaxy?
1. Galaxy differs from existing systems in its specificity for
access to, and comparative analysis of, genomic sequenc
e and alignments.
2. Programming experience is not required.
3. Galaxy is a web-based software which can handle large s
equence data sets.
Query Operations
Complement: compiles a list of regions that do not
overlap with the current query (requires UCSC library).
Restrict: filters data based on chromosome name and
region size (requires UCSC library).
Merge overlapping regions: overlapping regions within a
single query are consolidated into fewer, larger regions.
(requires UCSC library).
Intersect: finds overlapping regions between two queries
(requires UCSC library).
Union: to finds all regions that are covered by both of the
queries, and return either merged regions or the original
regions from one of the query (requires UCSC library).
Query Operations
Join Lists: joins two queries side by side to allow
performing statistical analyses (requires UCSC library).
Cluster: finds clusters of regions within specified distance
of each other (requires UCSC library).
Proximity: finds regions of one query within a specified
distance of regions from another query (requires UCSC
library).
Subtract: subtracts regions of one query from another
query (requires UCSC library).
Join Same Coordinates Region: joins two queries, which
have the same coordinates, side by side to allow
performing statistical analyses (requires UCSC library).
Sequence Analysis Tools
Extract sequences: uses a perl wrapper written around fas
ta-subseq to extract sequences corresponding to bed file co
ordinates. Uses alignseq.loc file to locate genomic sequenc
es. Requires PATH to include fasta-subseq location (requir
es perl)
Extract blastZ alignments: uses a perl wrapper for extrac
tAxt (developed by Rico) to extract genomic alignments co
rresponding to bed file coordinates. Uses alignseq.loc to fi
nd axt files. Requires PATH to include extractAxt location
(requires perl)
Output Displays
UCSC, Ensemble Genome Browser
EncodeDB at NEGRI
EnsMart at Sanger Centre
Language

CGI PERL

CORE C

SQL

Database
Other Features
Asynchronous query
User identity: cookies & assigning a sequential ID number
to each terminal
Demo

Vous aimerez peut-être aussi