Vous êtes sur la page 1sur 4

BIOINFORMATICS

Assignment 2 (Part 1):


Critique
Submitted By: Amina Asif

About the paper


Title: MAFFT: a novel method for rapid multiple sequence alignment based on fast Fourier
transform.
Authors:
Kazutaka Katoh and Kazuharu Misawa from the Department of Biophysics, Graduate School of
Science, Kyoto University, Kyoto 606-8502, Japan.
Kei-ichi Kuma and Takashi Miyata from the Institute of Molecular Evolutionary Genetics,
Pensylvania State University, University Park, PA 16802, USA.
Publication Details:
The paper was published in the Nucleic Acids Research journal, Vol. 30 No. 14 (3059-3066) in
June 2002.

Introduction
This paper proposes a new Multiple Sequence Alignment approach based on the fast Fourier
transform. The authors claim their proposed approach to be faster than the existing techniques.
The proposed method is composed of two techniques, first in which the Homologous regions
are identified using the fast Fourier transform and the second in which a simplified scoring
scheme is used. The performance of the method is compared with that of the CLUSTALW and
the T-COFFEE tools.

Motivation
Multiple sequence alignment is an essential tool in many biological analyses. There have been
many efforts for finding optimal alignments including Needleman Wunschs Dynamic
programming solution and various other heuristic based methods including progressive and
iterative refinement methods. Needleman Wunsch, due to very large CPU time requirements is
not applicable for large number of long sequences. The heuristic methods prove to be much
faster.
The heuristic methods, even if provide the optimal alignment, do not ensure the achieved
alignment to be the biologically correct one. The accuracy of the resulting alignment is greatly
affected by the scoring scheme used. According to the paper No existing scoring scheme is
able to process correctly global alignments for various types of problems including large
terminal extension of internal insertion.

Summary of the Proposed Methods


Group-to-Group Alignments by FFT
The frequency of amino acid substitutions strongly depends on the difference of physicochemical properties, particularly volume and polarity, between the amino acid pair involved in
the substitution. It is proposed that an amino acid sequence is converted into a sequence of
vectors representing an amino acid. For an amino acid a, the corresponding vector has, as
components, the volume value v(a) and the polarity value p(a). This method uses the
normalized values.
Firstly, the correlation between two sequences is computed. The overall correlation c(k) is the
sum of the correlations of the volume cv(k) and the polarity cp(k) components. k is the
positional lag. Instead of computing the correlation over the given vectors, FFT is applied over
them and then the correlation is computed which reduces the complexity from O(N2) to O(N log
N).
The next step is finding the homologous segments. The peaks in c(k) means some homologous
regions in the sequences but using the FFT we only know the positional lag not the exact
position of the regions, for which the sliding window analysis of the sequences is done.
The homologous sequences are then arranged in a homology matrix and an optimal path is
computed in a standard Dynamic Programming manner. This method can be extended from
sequence to sequence alignment to group to group alignment. Here the vector representation
uses a linear combination of the volume and polarity components of the sequences belonging
to each group of sequences.

New Proposed Scoring System


This method uses a normalized similarity matrix that has both positive and negative values. The
gaps are penalized such that if a gap is newly introduced at the same position as one of the
existing gaps (introduced earlier in the group), the new gap should not be penalized because
the new and the existing gaps are probably resulting from a single insertion or deletion event.

Results
The proposed method has proven to be many times faster as compared to the CLUSTALW and
the T-COFFEE methods and no sacrifice over accuracy has been made.

Analysis and Discussion


The use of FFT while computing correlation among the sequences results in great reduction of
time complexity as compared to when direct correlation is computed since in the FFT only the
dominant component is utilized ignoring the lower order components. Some information is
indeed lost but the time complexity is improved a lot.
The FFT analysis, although fast, gives only the information about the positional lag for which a
possible homologous region exists. The find the exact position the sliding window analysis is
unavoidable, which in my opinion, may get cumbersome especially for group-to-group
alignments.
Also, the division of the homology matrix and hence the computational cost of finding the
optimal alignment seems to depend over the quality of the previously found homologous
segments.
This method uses a novel scoring scheme too that adds to the computational speed of the
system. Other methods like CLUSTALW and T-COFFEE make use of problem dependent complex
scoring schemes including position dependent gap penalties whereas MAFFT makes use of
relatively simpler scoring system, which further adds to the computational speed.
The reason why the accuracy hasnt deteriorated much despite increased computational speed
is the computation of the homologous regions first and then aligning them.

Vous aimerez peut-être aussi