Académique Documents
Professionnel Documents
Culture Documents
Transitivity
• In mathematics: if A=B and B=C then A=C
• In biology: for given three sequences A, B and C, if A
and B as well as B and C have a common ancestor,
then A and C have a common ancestor
Use of Transitivity
• The concept of
transitivity can be
used to detect remote
homologues.
However
- It is not fully understood if transitivity
always holds and whether transitivity
can be extended ad infinitum.
- Multi-Domain Problem
Multi-Domain Problem
G Gd
No edge present
between nodes
P03480 and P03475.
The transitivity applied.
Threshold = 32%
Implementation and Evaluation
• The algorithm implemented in C++
• Own implementation of the Simith-
Waterman local alignment algorithm for
computing sequence similarity.
• The substitution matrix: BLOSUM80
• Gap opening (gop) = 90
Gap extension penalties (gep) = 9
Data (1)
• SwissProt (SP) excluded all sequences
with less than 40 amino acids (a.a.),
resulting in a set of 86494 protein
sequences
– The total running time for the pair-wise Smith-
Waterman alignment was on the order of 14000
cpu-days
• The evaluation data: SCOP database
– Three levels are used: family, super-family and fold
Data (2)
• SCOP1: set of 2692 sequences
– Contains all non-identical sequences from SCOP
– No sequences shorter than 40 a.a.
– No sequences from classes 8, 9, 10
– 65464 pairs of homologue sequences; i.e. pairs where both
sequences are in the same super-family and 3556622 pairs
where the sequences are in distinct super-families.
• SCOP1 + SP: 85961 sequences
– All sequences are from SCOP1 and SwissProt
• SCOP2: 609 randomly chosen sequences from
SCOP
– Including sequences shorter than 40 a.a.
– no sequences from classes 8, 9, 10.
Performance measure
• Sensitivity: specifies the
proportion of identified
homologue pairs
sens = 55.6%
spec = 100%
TP due to intermediate
linking = 8%
absolute increase in
sens = 2.3
relative increase in
sens = 4.1%
absolute increase in
indirect TP = 3.6%
SCOP1
SCOP1 + SP
# of SCOP
super-families
Total number of SCC clusters, and SCC clusters of sizes
1, 2-5, 6-10 etc. for varying thresholds from 25 to 50%.
Discussion (4)
Comparison with
algorithm by Arvestad:
employs only pair-wise
sequence comparisons,
their approach uses a
more involved scoring
method, optimized
substitution matrices,
and gap penalties, to
achieve a substantial
improvement over
straight-forward pair-
wise sequence
comparisons.