Académique Documents
Professionnel Documents
Culture Documents
Three-Dimensional Shape-Structure
Comparison Method for Protein Classification
Petros Daras, Dimitrios Zarpalas, Apostolos Axenopoulos,
Dimitrios Tzovaras, and Michael Gerassimos Strintzis
Abstract—In this paper, a 3D shape-based approach is presented for the efficient search, retrieval, and classification of protein
molecules. The method relies primarily on the geometric 3D structure of the proteins, which is produced from the corresponding PDB
files and secondarily on their primary and secondary structure. After proper positioning of the 3D structures, in terms of translation and
scaling, the Spherical Trace Transform is applied to them so as to produce geometry-based descriptor vectors, which are completely
rotation invariant and perfectly describe their 3D shape. Additionally, characteristic attributes of the primary and secondary structure of
the protein molecules are extracted, forming attribute-based descriptor vectors. The descriptor vectors are weighted and an integrated
descriptor vector is produced. Three classification methods are tested. A part of the FSSP/DALI database, which provides a structural
classification of the proteins, is used as the ground truth in order to evaluate the classification accuracy of the proposed method. The
experimental results show that the proposed method achieves more than 99 percent classification accuracy while remaining much
simpler and faster than the DALI method.
1 INTRODUCTION
(based on secondary structure elements). There are four some special characteristics so that no two representatives
main structural classes of proteins according to the way of have more than 25 percent amino-acid sequence identity.
folding their secondary structure elements: This method is very time-consuming due to the many
different alignments performed, the optimization proce-
all-a (consist of a-helices),
1. dures, and the extremely high number of distances between
all-b (consist of b-sheets),
2. amino acids since a protein may consist of thousands of
a/b (a-helixes and b-sheets alternating in protein
3.
amino acids.
structure), and
The protein databases may contain either protein
4. a+b (a-helixes and b-sheets located in specific parts
collections or proteins accompanied by annotation. An
of the structure).
example of the latter is the SWISS-PROT database [9], with
The CATH (Class, Architecture, Topology, and Homo- 195,000 entries, where, in addition to the protein sequences,
logous superfamily) database [5], which is held at the UCL information about their function and biological action is
University of London, contains hierarchically classified also available.
structural elements (domains) of the proteins stored in the The PROSITE [10], [11] is a database for the classifica-
PDB (Protein Data Bank) database [1]. The CATH system tion of proteins into families of proteinic sequences and
uses automatic methods for the classification of domains, as sequence domains. It is based on the observation that,
well as experts’ contribution, where automatic methods fail despite the vast number of different proteins, those can
to give reliable results. For the classification of structural be classified into a small number of families, according to
elements, five main hierarchical levels are used: their sequence similarities. Protein sequences or sequence
. Class: The class is determined by the percentage of domains that belong to the same family have the same
secondary structure elements and their packing. functions and a common ancestor. It is obvious that
. Architecture: Describes the organization of the sec- proteins of the same family have parts of their sequence
ondary structure elements. preserved during their evolution.
. Topology: Provides a complete description of the hole A lot of research has been performed in recent years
schema and the way the secondary structure for the classification of amino acid sequences using
elements are connected. different approaches. In [13], a data-mining approach
. Homologous Superfamily: Structural elements that have for motif-based classification of proteins is presented.
at least 35 percent amino-acid sequence identity Motifs are either short amino acid chains with a specific
belong to the same Homologous Superfamily. order or representations of multiple sequence alignments
. Sequence: At this last level of hierarchy, the structures using Hidden Markov Models [14]. Motifs can be used for
of the same Homologous Superfamily are further the prediction of proteins’ properties since the behavior of a
classified according to the similarity of their amino- protein is a function of many motifs. By using motifs stored
acid sequences. in several databases, such as the PROSITE database,
The FSSP (Families of Structurally Similar Proteins) classification rules that associate motifs with protein classes
database, which was created according to the DALI are applied. The data to be processed are in the form of a
classification method [6], [7] and is held at the European prefix tree acceptor (PTA), a tree-shaped automation. The
Bioinformatics Institute (EBI) [8], provides a sophisticated method utilizes a Finite State Automata (FSA) algorithm to
classification method. The similarity between two proteins induce classification rules into a training data set. The rules
is based on their secondary structure. The evaluation of a are finally applied to a test data set.
pair of proteins is a highly time consuming task, so the As it is not feasible to study experimentally every protein
comparison between a macromolecule and all the macro- in all genomes, the function and biological role of a newly
molecules of the database requires days. Therefore, one sequenced protein is usually inferred from a characterized
representative protein for each class is defined. Every new protein using sequence and/or structure comparison
protein is compared only to the representative protein of methods. In recent years, many methods for pairwise
each class. However, for an all-to-all comparison of the protein structure alignment have been proposed and are
385 representative proteins of the database, an entire day is now available on the World Wide Web. In [24], a state-of-
needed [29]. the-art survey on new methods for protein comparison that
The classification method of the DALI algorithm [6], [7] is have recently been published is presented.
based on the best alignment of protein structures. The In [25], a method to measure structural similarity of
3D coordinates of every protein are used for the creation of proteins is presented. According to this method, a finite
distance matrices that contain the distance between amino number of representative local feature (LF) patterns is
acids (the distance between their C alpha C alpha atoms). These extracted from the distance matrices of all protein fold
matrices are, first, decomposed into elementary formats, e.g., families by medoid analysis. Then, each distance matrix of a
hexapeptidic-hexapeptidic submatrices. Similar formats protein structure is encoded by labeling all its submatrices
make pairs and the emerging formats create new coherent by the index of the nearest representative LF patterns.
pairs. Finally, a Monte Carlo procedure is used for the Finally, the structure is represented by the frequency
optimization of the similarity measure concerning the distribution of these indices, which forms the LF frequency
inner-molecular distances. The DALI method contains a (LFF) profile of the protein, which is, in fact, a vector of
definition of representatives, which are proteins with common length K. The fold similarity between a pair of
DARAS ET AL.: THREE-DIMENSIONAL SHAPE-STRUCTURE COMPARISON METHOD FOR PROTEIN CLASSIFICATION 195
proteins can be computed by the Euclidean distance from the 3D structures of the PDB database, are used to
between two corresponding LFF profile vectors. index 3D hash tables. The hash tables are built after
The algorithm described in [26] aims to combine the computation of the angles and distances of all triplets of
results of several existing sequence and structure compar- linear segments. In [30], a fast computational framework for
ison tools in order to map domains within protein classification of proteins is developed, using a series of
structures with their homologs in an existing classification secondary structure geometric parameter represented by an
scheme. The comparison tools incorporated in the algo- unexplored dihedral angle of a protein sequence. The
rithm each utilize a different methodology for identifying comparison of two such series of dihedral angles, each
homologous domains and, consequently, these tools have representing a different protein structure, is accomplished
different advantages and limitations. The algorithm has by a similarity-search mechanism based on a translational
been developed to find the homologs already classified in and scale invariant indexing schema. The method is tested
the SCOP database and, thus, determine classification over 25 randomly selected proteins belonging to five
assignments, but it can be applied to any other evolu- different families and achieves a classification accuracy of
tionary-based classification scheme as well. 88 percent.
In [27], an information theoretic model called “coherent Following the same concept, we propose a new
subgraph” mining has been developed in order to find combined structure-geometric comparison algorithm, based
characteristic substructural patterns within protein struc- primarily on the 3D shape of a protein and secondarily on
tural families. Protein structures are represented by graphs its structure characteristics (primary, secondary structure).
where the nodes are residues and the edges connect The method was introduced in [19] and [33] and dealt with
residues found within a certain distance from each other. efficient 3D model content-based search and retrieval. In
An experimental study has been conducted in which all this paper, the method is adapted to protein classification.
coherent subgraphs were identified in several protein More specifically, a part of the Spherical Trace Transform
structural families annotated in the SCOP database and a presented in [19] is proposed in this paper for the extraction
Support Vector Machine algorithm was used to classify of a vector efficiently describing the 3D structure of each
proteins from different families under the binary classifica- protein. Having as input the PDB files, the 3D coordinates
tion scheme. of the main atoms composing the amino acids are taken into
In [28], an approach to the problem of automatically account in order to construct a 3D model that describes the
clustering protein sequences and discovering protein protein. These 3D protein forms are further processed in a
families, subfamilies, etc., based on the theory of infinite way to be applicable to the Spherical Trace Transform. This
methodology leads to the creation of completely rotation
Gaussian mixture models is described. The method allows
invariant descriptor vectors that perfectly describe the
the data itself to dictate how many mixture components are
3D shape of the proteins. Additionally, from the PDB files,
required to model it and provides a measure of the
characteristics which describe the primary and secondary
probability that two proteins belong to the same cluster.
structure of the proteins are also extracted. The geometrical
Finally, a classification of sequences of known structure is
descriptors, along with the structural descriptors, form a
obtained which both reflects and extends their SCOP
compound descriptor vector. This compound descriptor
classifications.
vector serves as input to a classification method which is
Considering that proteins with similar 3D structures
used to categorize unclassified protein molecules. The
have similar functions, a geometric filtering can lead
classification methods used, are: 1) the Euclidean distance
biologists to the investigation of new protein functions. In
measure, 2) the Mean Euclidean distance measure, and 3) a
[15], proteins are represented as 3D models on the surface
variance of the Bayesian probability measure.
of which sample points are defined. After a translation, The paper is organized as follows: The necessary
scaling, and rotation normalization, the models are seg- preprocessing steps are described in Section 2. The
mented to concentric spheres and sectors and the number of proposed method and the functionals used are described
sampled points is calculated per each sector and per each in detail in Section 3. Section 4 presents the classification
sphere. After this procedure, descriptor vectors are created schemes used in order to evaluate the classification
and compared using a quadratic form distance function. accuracy of the method. Experimental results evaluating
The nearest neighbor indicates the class assigned to the the proposed method are presented in Section 5. Finally,
query protein. In [16], geometric features based on geo- conclusions are drawn in Section 6.
metric moments and the Fourier Transform [17] are
extracted, after a translation, scaling, and rotation normal-
ization. Descriptors are also extracted from PDB files based 2 PREPROCESSING
on primary and secondary structure characteristics. Both of A protein P is mainly composed of Carbon (C), Nitrogen
the aforementioned methods use a portion of the FSSP (N), Oxygen (O), Hydrogen (H), and Sulfur (S) atoms. In
database as ground truth and achieve a percentage of Fig. 1, the 3D representation of a protein is depicted. The
around 90 percent classification accuracy, which is very colors used and the atomic radii are listed in Table 1. The
satisfactory, considering that they are less complicated than atoms in HETATM fields are not depicted.
the DALI algorithm. Since the exact 3D position of each atom and its radius
Another method that utilizes the geometric properties of are known, it may be represented by a sphere. Next, the
secondary structures is based on indexing [18]. Triplets surface of each sphere is triangulated by employing
(three linear segments) of secondary structures, extracted 3D modeling techniques. In this way, a sphere consists of
196 IEEE/ACM RANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 3, JULY-SEPTEMBER 2006
TABLE 1
Main Atoms of a Protein
1. “Lying inside P ” means that the corresponding voxel lies in the region
that is enclosed by a sphere, which represents the atom of one of the
proteins. Fig. 2. Planes tangential to concentric spheres.
DARAS ET AL.: THREE-DIMENSIONAL SHAPE-STRUCTURE COMPARISON METHOD FOR PROTEIN CLASSIFICATION 197
Fig. 4. Rotation of fðxxÞ rotates the fði; ^ jÞ (upper left image) without
Fig. 3. Rotation of fðx xÞ rotates the F ð; Þ without rotating the
corresponding fði; jÞ (upper left image). Thus, F ð2 ; 1 Þ ¼ F ð02 ; 1 Þ. causing a rotation of the point ð1 ; 1 Þ.
X
N X
1 N 1
In the specific case where the points lie on the axis of 2ik 2jm
DF Tt ðk; mÞ ¼ f^t ði; jÞ exp j^ þ ; ð7Þ
^ jÞ will be rotated (Fig. 4), i.e.,
rotation, the corresponding fði; i¼0 j¼0
N N
where i; j ¼ 0; . . . ; N 1. Then, (7) becomes: the sphere, generally denoted by Ylm ðÞ, where l 0 and
jmj l [22].
N1 X
X N1 Since spherical harmonics form a complete orthonormal
DF Tt ðk; mÞ ¼ f^t ðrij ; ij Þ expðjðkr
^ ij þ mij ÞÞ ð10Þ
i¼0 j¼0
set on the unit sphere, if a function
, parameterized by the
spherical coordinates ðÞ, can be expanded as an infinite
and rotation is converted to a circular translation of . Then, Fourier series of spherical harmonics:
the first K M harmonic amplitudes jDF Tt ðk; mÞj, where
k ¼ 0; . . . ; K 1 and m ¼ 0; . . . ; M 1, are considered for 1 X
X l
ði Þ ¼ lm Ylm ði Þ; i ¼ 1; . . . ; Ns ; ð14Þ
each f^t ði; jÞ. Since t refers to each plane which is described
l¼0 m¼l
in the 3D space by the couple ð; Þ, jDF Tt ðk; mÞj can be
denoted as F 1km ð; Þ or F 1km ðÞ. then the expansion coefficients lm are uniquely deter-
mined by:
3.1.2 Krawtchouk Moments
X
Ns
4
Krawtchouk moments [20] are a set of moments formed by lm ¼
ði ÞYlm ði Þ : ð15Þ
using Krawtchouk polynomials as the basis function set. i¼1
Ns
The nth order classical Krawtchouk polynomials are
In our case:
defined as:
F 1km ðÞ
X
N
1
ðÞ ¼ ð16Þ
Kn ðx; p; NÞ ¼ a ;n;p x ¼2 F1 n; x; N; ; ð11Þ F 2km ðÞ:
¼0
p
The expansions (14) are strictly convergent in the sense
where x; n 0; 1; 2; . . . ; N, N > 0, p 2 ð0; 1Þ, 2 F1 is the that the error of the expansion reduces monotonically as l
hypergeometric function defined as: tends to infinity. Hence, the leading terms of the series are
those with small values of l and m, which implies that, upon
X1
ðaÞ ðbÞ z
2 F1 ða; b; c; zÞ ¼ ð12Þ truncation, the series at a sufficiently large value of l, L,
¼0
ðcÞk ! most of the detail of the function
ðÞ will be captured.
Further, if
ðÞ is rotated (
0 ðÞ with expansion coeffi-
and ðaÞ is the Pochhammer symbol.
cients 0lm ), then, as is easily proven [22], the overall vector
Following the analysis described in [19], the rotation
length of 0lm coefficients with the same l is preserved under
invariant Krawtchouk moments are computed for each
rotation:
f^t ði; jÞ with spatial dimension N N by:
X 0 X
A2l ¼ 2
lm ¼ 2lm ; ð17Þ
X
N X
1 N 1
~km ¼ ½ðkÞðmÞð1=2Þ
Q ai;k;p1 aj;m;p2 ij ; ð13Þ m m
4 CLASSIFICATION
4.1 Matching Algorithm
Let A; B be two 3D models (proteins). Also, let
descriptor vector (Nd ¼ Nc ðL þ 4Þ þ 34). Also, let C be a where wUðjÞ ¼ 1 when UðjÞ satisfies (29) and wUðjÞ ¼ 0,
class with descriptor vectors: otherwise. U is assigned to the class Ci with the maximum
2 1 3 BðCiÞ.
D ð1Þ; . . . ; D1 ðkÞ; . . . ; D1 ðSÞ
6 ...; ...; ...; ...; ...; 7
6 7
6
MC ¼ 6 D ð1Þ; . . . ; D ðkÞ; . . . ; Di ðSÞ 7
i i
7; 5 EXPERIMENTAL RESULTS
4 ... ...; ...; ...; ...; 5
In order to evaluate the performance of the proposed
DNC ð1Þ; . . . ; DNC ðkÞ; . . . ; DNC ðSÞ
method, a portion of the FSSP database [23] was used. This
DARAS ET AL.: THREE-DIMENSIONAL SHAPE-STRUCTURE COMPARISON METHOD FOR PROTEIN CLASSIFICATION 201
TABLE 3
Protein Classes Used as Ground Truth Database
TABLE 4
Extraction Time Using Different Initial Functionals
and All Spherical Functionals
Fig. 8. Missed proteins using the Euclidean distance method. The query
proteins are depicted in the first column. The second column shows the
significantly increases the overall classification accuracy nearest neighbors, which were retrieved using the proposed method but
(Fig. 7). do not belong to the same class with the query, according to the FSSP/
The times needed for the computation of the overall DALI classification. The third column shows the proteins closer to the
query that do belong to the same class according to the FSSP/DALI
classification accuracy for the entire database are shown classification. It is obvious that the visual similarity between the proteins
in Table 5. These include the comparison of each query of columns 1 and 2 is greater than the similarity between the proteins of
protein descriptor vector to all (3,731) descriptor vectors columns 1 and 3.
(all-to-all comparison). In other words, the time needed
for approximately 3; 7312 comparisons is 395 sec if the The FSSP/DALI database has been constructed based in
part on the premise that proteins with at least 25 percent
“Kraw00 &F T02 &Struct” descriptor vector is used. This is
similarity in their amino acid sequence should belong to the
very satisfactory if we consider that the Dali algorithm
same class even if dissimilar geometrically. Since we do not
requires an entire day for an all-to-all comparison of all
use this criterion, we do not achieve 100 percent classifica-
385 representatives of FSSP database [29]. tion accuracy. In fact, the best overall classification accuracy
The time needed for the complete preprocessing proce- achieved, using the proposed method (Fig. 7, column 6), is
dure, from the creation of the 3D structure up to the final 99.62 percent. In other words, 14 out of 3,732 proteins are
normalization step, is approximately 3 min. Although this misclassified. Further analysis of the misclassified proteins
procedure, for a large database with thousands of proteins, showed that the proposed method, which is mainly based
may last for days, it takes place only once and the on geometrical features (90 percent) rather than structural
descriptor vectors are stored in the database along with features (10 percent), classifies the 3D proteins differently
the corresponding 3D structures. when compared to the DALI algorithm. However, there is
DARAS ET AL.: THREE-DIMENSIONAL SHAPE-STRUCTURE COMPARISON METHOD FOR PROTEIN CLASSIFICATION 203
TABLE 6
Classification Precision, Classification Recall, and
Classification Accuracy for Each Class Using the
“Kraw00 &F T02 &Struct” Descriptor Vector
TP
CRec ¼ ; ð33Þ
TP þ FN
TP þ TN
CAcc ¼ ; ð34Þ
TP þ FP þ FN þ TN
where:
TABLE 7 TABLE 8
The Times Needed for the Computation of the The Times Needed for the Computation of the Overall
Overall Classification Accuracy Classification Accuracy with the Naive Bayesian Classifier
with the Mean Euclidean Distance Measure Method
Fig. 12. (a) Precision-recall curve of classes 1a6m, 1l92, and 2cba by using Kraw00 &F T02 descriptors. (b), (c), and (d) Comparison of precision-
recall curve for each class with the method presented in [16].
206 IEEE/ACM RANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 3, JULY-SEPTEMBER 2006
[32] S. Tiwari and S. Gallager, “Machine Learning and Multiscale Dimitrios Tzovaras received the Diploma in
Methods in the Identification of Bivalve Larvae,” Proc. Ninth IEEE electrical engineering and the PhD degree in 2D
Int’l Conf. Computer Vision (ICCV 2003), pp. 13-16, Oct. 2003. and 3D image compression from Aristotle Uni-
[33] P. Daras, D. Zarpalas, D. Tzovaras, and M.G. Strintzis, “3D Model versity of Thessaloniki, Thessaloniki, Greece, in
Search and Retrieval Based on the Spherical Trace Transform,” 1992 and 1997, respectively. He is a senior
Proc. IEEE Int’l Workshop Multimedia Signal Processing (MMSP), researcher in the Informatics and Telematics
2004. Institute of Thessaloniki. Prior to his current
[34] S.F. Altschul, T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W. position, he was a senior researcher on 3D
Miller, and D.J. Lipman, “Gapped Blast and PSI-Blast: A New imaging at the Aristotle University of Thessalo-
Generation of Protein Database Search Programs,” Nucleic Acids niki. His main research interests include virtual
Research, vol. 25, no. 17, pp. 3389-3402, 1997. reality, assistive technologies, 3D data processing, medical image
communication, 3D motion estimation, and stereo and multiview image
Petros Daras received the Diploma in electrical sequence coding. His involvement with those research areas has led to
and computer engineering, the MSc degree in the coauthoring of more than 35 papers in refereed journals and more
medical informatics, and the PhD degree in than 80 papers in international conferences. He has served as a regular
electrical and computer engineering from the reviewer for a number of international journals and conferences. Since
Aristotle University of Thessaloniki, Greece, in 1992, he has been involved in more than 40 projects in Greece funded
1999, 2002, and 2005, respectively. He is an by the EC and the Greek Secretariat of Research and Technology. He is
associate researcher at the Informatics and an associate editor of the EURASIP Journal of Applied Signal
Telematics Institute. His main research interests Processing and a member of the Technical Chamber of Greece.
include computer vision, search and retrieval of
3D objects, the MPEG-4 standard, peer-to-peer Michael Gerassimos Strintzis (M’70-SM’80-
technologies, and medical informatics. He has been involved in more F’04) received the Diploma in electrical engi-
than 10 European and National research projects. Dr. Daras is a neering from the National Technical University of
member of the Technical Chamber of Greece. Athens, Athens, Greece, in 1967, and the MA
and PhD degrees in electrical engineering from
Dimitrios Zarpalas received the Diploma in Princeton University, Princeton, New Jersey, in
electrical and computer engineering from the 1969 and 1970, respectively. He then joined the
Aristotle University of Thessaloniki, Greece, in Electrical Engineering Department at the Uni-
2003. He is an associate researcher at the versity of Pittsburgh, where he served as an
Informatics and Telematics Institute. His main assistant professor (1970-1976) and an associ-
research interests include search and retrieval of ate professor (1976-1980). Since 1980, he has been a professor of
3D objects and medical image processing. He is electrical and computer engineering at the University of Thessaloniki,
a member of the Technical Chamber of Greece. Thessaloniki, Greece, and, since 1999, director of the Informatics and
Telematics Research Institute, Thessaloniki. His current research
interests include 2D and 3D image coding, image processing,
biomedical signal and image processing, and DVD and Internet data
authentication and copy protection. Dr. Strintzis has served as associate
Apostolos Axenopoulos received the Diploma editor for the IEEE Transactions on Circuits and Systems for Video
in electrical and computer engineering from the Technology since 1999. In 1984, he was awarded one of the Centennial
Aristotle University of Thessaloniki, Greece, in Medals of the IEEE. He is a fellow of the IEEE.
2003. Currently, he is pursuing the MSc degree
in advanced computing systems at the Aristotle
University of Thessaloniki. He is an associate . For more information on this or any other computing topic,
researcher at the Informatics and Telematics please visit our Digital Library at www.computer.org/publications/dlib.
Institute. His main research interests include 3D
content-based search and retrieval. He is a
member of the Technical Chamber of Greece.