Vous êtes sur la page 1sur 15

IEEE/ACM RANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO.

3, JULY-SEPTEMBER 2006 193

Three-Dimensional Shape-Structure
Comparison Method for Protein Classification
Petros Daras, Dimitrios Zarpalas, Apostolos Axenopoulos,
Dimitrios Tzovaras, and Michael Gerassimos Strintzis

Abstract—In this paper, a 3D shape-based approach is presented for the efficient search, retrieval, and classification of protein
molecules. The method relies primarily on the geometric 3D structure of the proteins, which is produced from the corresponding PDB
files and secondarily on their primary and secondary structure. After proper positioning of the 3D structures, in terms of translation and
scaling, the Spherical Trace Transform is applied to them so as to produce geometry-based descriptor vectors, which are completely
rotation invariant and perfectly describe their 3D shape. Additionally, characteristic attributes of the primary and secondary structure of
the protein molecules are extracted, forming attribute-based descriptor vectors. The descriptor vectors are weighted and an integrated
descriptor vector is produced. Three classification methods are tested. A part of the FSSP/DALI database, which provides a structural
classification of the proteins, is used as the ground truth in order to evaluate the classification accuracy of the proposed method. The
experimental results show that the proposed method achieves more than 99 percent classification accuracy while remaining much
simpler and faster than the DALI method.

Index Terms—Information search and retrieval, classification, protein databases.

1 INTRODUCTION

T HE structureof a molecule in 3D space is the main factor


which determines its chemical properties as well as its
function. All information required for a protein to be folded
relatively low. Since 1980, the increase rate has become
dramatically high due to the rapid technological develop-
ment. In addition to the atom coordinates, PDB entries may
in its natural 3D structure is coded in its amino acid contain additional information such as references, structure
sequence. Therefore, the 3D representation of a residue details, or other features. Every new structure undergoes a
sequence and the way this sequence folds in the 3D space correctness control by using appropriate software. After its
are very important in order to be able to understand the successful evaluation, the protein is given an ID (code
“logic” in which a function or biological action of a number) and it becomes available for public use.
protein is based on. With the technology innovation and Since 1958, when the first 3D structure of the protein
the rapid development of X-Ray crystallography methods myoglobin was determined, up to now, the complexity and
and NMR spectrum analysis techniques, a high number the variety of the protein structures has increased as the
of new 3D structures of protein molecules is determined number of the new determined macromolecules has. There-
[2]. The 3D structures are stored in the world-wide fore, a need for a classification of proteins is obvious, which
repository Protein Data Bank (PDB) [1]. The number of the may result in a better understanding of these complicated
3D molecular structure data increases rapidly since almost structures, their functions, and the deeper evolutionary
200 new structures are stored per month in PDB. Today procedures that led to their creation. In molecular biology,
there are more than 24,000 3D proteins and nucleic acid many classification schemata and databases are available.
molecules in this repository. These are briefly reviewed below.
The Protein Data Bank [1], [12] is the primary repository The SCOP (Structural Classification of Proteins) protein
for experimentally determined 3D protein structures. It was database, which is held at the Laboratory of Molecular
created in 1971 at Brookhaven National Laboratories (BNL) Biology of the Medical Research Council (MRC) in Cam-
in the USA and contained seven macromolecule structures. bridge, England, describes the structural and evolutionary
These structures were created using crystallography meth- relationships between proteins of known structure [4]. Since
ods. During the 1970s, the increase rate of entries was the existing automatic tools for the comparison of second-
ary structure elements cannot guarantee 100 percent success
in the identification of protein structures, SCOP uses
. P. Daras, D. Zarpalas, A. Axenopoulos, D. Tzovaras, and M.G. Strintzis experts’ experience to carry out this task. This is not a
are with the Informatics and Telematics Institute (ITI), 1st Km Thermi- simple task considering the complexity of protein struc-
Panorama Road, Thermi-Thessaloniki, PO Box 361, Gr-57001, Greece.
E-mail: {daras, zarpalas, axenop, tzovaras}@iti.gr. tures, which vary from single structural elements to vast
. M.G. Strintzis is with the Electrical and Computer Engineering multidomain complexes.
Department, Aristotle University of Thessaloniki, Thessaloniki, GR- Proteins are classified in a hierarchical manner that
54124, Greece. E-mail: daras@iti.gr, strintzi@eng.auth.gr.
reflects their structural and evolutionary relationship. The
Manuscript received 24 Nov. 2004; revised 23 Sept. 2005; accepted 27 Nov. main levels of the hierarchy are “Family” (based on the
2005; published online 31 July 2006.
For information on obtaining reprints of this article, please send e-mail to: proteins’ evolutionary relationships), “Superfamily” (based
tcbb@computer.org, and reference IEEECS Log Number TCBB-0195-1104. on some common structural characteristics), and “Fold”
1545-5963/06/$20.00 ß 2006 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
194 IEEE/ACM RANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 3, JULY-SEPTEMBER 2006

(based on secondary structure elements). There are four some special characteristics so that no two representatives
main structural classes of proteins according to the way of have more than 25 percent amino-acid sequence identity.
folding their secondary structure elements: This method is very time-consuming due to the many
different alignments performed, the optimization proce-
all-a (consist of a-helices),
1. dures, and the extremely high number of distances between
all-b (consist of b-sheets),
2. amino acids since a protein may consist of thousands of
a/b (a-helixes and b-sheets alternating in protein
3.
amino acids.
structure), and
The protein databases may contain either protein
4. a+b (a-helixes and b-sheets located in specific parts
collections or proteins accompanied by annotation. An
of the structure).
example of the latter is the SWISS-PROT database [9], with
The CATH (Class, Architecture, Topology, and Homo- 195,000 entries, where, in addition to the protein sequences,
logous superfamily) database [5], which is held at the UCL information about their function and biological action is
University of London, contains hierarchically classified also available.
structural elements (domains) of the proteins stored in the The PROSITE [10], [11] is a database for the classifica-
PDB (Protein Data Bank) database [1]. The CATH system tion of proteins into families of proteinic sequences and
uses automatic methods for the classification of domains, as sequence domains. It is based on the observation that,
well as experts’ contribution, where automatic methods fail despite the vast number of different proteins, those can
to give reliable results. For the classification of structural be classified into a small number of families, according to
elements, five main hierarchical levels are used: their sequence similarities. Protein sequences or sequence
. Class: The class is determined by the percentage of domains that belong to the same family have the same
secondary structure elements and their packing. functions and a common ancestor. It is obvious that
. Architecture: Describes the organization of the sec- proteins of the same family have parts of their sequence
ondary structure elements. preserved during their evolution.
. Topology: Provides a complete description of the hole A lot of research has been performed in recent years
schema and the way the secondary structure for the classification of amino acid sequences using
elements are connected. different approaches. In [13], a data-mining approach
. Homologous Superfamily: Structural elements that have for motif-based classification of proteins is presented.
at least 35 percent amino-acid sequence identity Motifs are either short amino acid chains with a specific
belong to the same Homologous Superfamily. order or representations of multiple sequence alignments
. Sequence: At this last level of hierarchy, the structures using Hidden Markov Models [14]. Motifs can be used for
of the same Homologous Superfamily are further the prediction of proteins’ properties since the behavior of a
classified according to the similarity of their amino- protein is a function of many motifs. By using motifs stored
acid sequences. in several databases, such as the PROSITE database,
The FSSP (Families of Structurally Similar Proteins) classification rules that associate motifs with protein classes
database, which was created according to the DALI are applied. The data to be processed are in the form of a
classification method [6], [7] and is held at the European prefix tree acceptor (PTA), a tree-shaped automation. The
Bioinformatics Institute (EBI) [8], provides a sophisticated method utilizes a Finite State Automata (FSA) algorithm to
classification method. The similarity between two proteins induce classification rules into a training data set. The rules
is based on their secondary structure. The evaluation of a are finally applied to a test data set.
pair of proteins is a highly time consuming task, so the As it is not feasible to study experimentally every protein
comparison between a macromolecule and all the macro- in all genomes, the function and biological role of a newly
molecules of the database requires days. Therefore, one sequenced protein is usually inferred from a characterized
representative protein for each class is defined. Every new protein using sequence and/or structure comparison
protein is compared only to the representative protein of methods. In recent years, many methods for pairwise
each class. However, for an all-to-all comparison of the protein structure alignment have been proposed and are
385 representative proteins of the database, an entire day is now available on the World Wide Web. In [24], a state-of-
needed [29]. the-art survey on new methods for protein comparison that
The classification method of the DALI algorithm [6], [7] is have recently been published is presented.
based on the best alignment of protein structures. The In [25], a method to measure structural similarity of
3D coordinates of every protein are used for the creation of proteins is presented. According to this method, a finite
distance matrices that contain the distance between amino number of representative local feature (LF) patterns is
acids (the distance between their C alpha  C alpha atoms). These extracted from the distance matrices of all protein fold
matrices are, first, decomposed into elementary formats, e.g., families by medoid analysis. Then, each distance matrix of a
hexapeptidic-hexapeptidic submatrices. Similar formats protein structure is encoded by labeling all its submatrices
make pairs and the emerging formats create new coherent by the index of the nearest representative LF patterns.
pairs. Finally, a Monte Carlo procedure is used for the Finally, the structure is represented by the frequency
optimization of the similarity measure concerning the distribution of these indices, which forms the LF frequency
inner-molecular distances. The DALI method contains a (LFF) profile of the protein, which is, in fact, a vector of
definition of representatives, which are proteins with common length K. The fold similarity between a pair of
DARAS ET AL.: THREE-DIMENSIONAL SHAPE-STRUCTURE COMPARISON METHOD FOR PROTEIN CLASSIFICATION 195

proteins can be computed by the Euclidean distance from the 3D structures of the PDB database, are used to
between two corresponding LFF profile vectors. index 3D hash tables. The hash tables are built after
The algorithm described in [26] aims to combine the computation of the angles and distances of all triplets of
results of several existing sequence and structure compar- linear segments. In [30], a fast computational framework for
ison tools in order to map domains within protein classification of proteins is developed, using a series of
structures with their homologs in an existing classification secondary structure geometric parameter represented by an
scheme. The comparison tools incorporated in the algo- unexplored dihedral angle of a protein sequence. The
rithm each utilize a different methodology for identifying comparison of two such series of dihedral angles, each
homologous domains and, consequently, these tools have representing a different protein structure, is accomplished
different advantages and limitations. The algorithm has by a similarity-search mechanism based on a translational
been developed to find the homologs already classified in and scale invariant indexing schema. The method is tested
the SCOP database and, thus, determine classification over 25 randomly selected proteins belonging to five
assignments, but it can be applied to any other evolu- different families and achieves a classification accuracy of
tionary-based classification scheme as well. 88 percent.
In [27], an information theoretic model called “coherent Following the same concept, we propose a new
subgraph” mining has been developed in order to find combined structure-geometric comparison algorithm, based
characteristic substructural patterns within protein struc- primarily on the 3D shape of a protein and secondarily on
tural families. Protein structures are represented by graphs its structure characteristics (primary, secondary structure).
where the nodes are residues and the edges connect The method was introduced in [19] and [33] and dealt with
residues found within a certain distance from each other. efficient 3D model content-based search and retrieval. In
An experimental study has been conducted in which all this paper, the method is adapted to protein classification.
coherent subgraphs were identified in several protein More specifically, a part of the Spherical Trace Transform
structural families annotated in the SCOP database and a presented in [19] is proposed in this paper for the extraction
Support Vector Machine algorithm was used to classify of a vector efficiently describing the 3D structure of each
proteins from different families under the binary classifica- protein. Having as input the PDB files, the 3D coordinates
tion scheme. of the main atoms composing the amino acids are taken into
In [28], an approach to the problem of automatically account in order to construct a 3D model that describes the
clustering protein sequences and discovering protein protein. These 3D protein forms are further processed in a
families, subfamilies, etc., based on the theory of infinite way to be applicable to the Spherical Trace Transform. This
methodology leads to the creation of completely rotation
Gaussian mixture models is described. The method allows
invariant descriptor vectors that perfectly describe the
the data itself to dictate how many mixture components are
3D shape of the proteins. Additionally, from the PDB files,
required to model it and provides a measure of the
characteristics which describe the primary and secondary
probability that two proteins belong to the same cluster.
structure of the proteins are also extracted. The geometrical
Finally, a classification of sequences of known structure is
descriptors, along with the structural descriptors, form a
obtained which both reflects and extends their SCOP
compound descriptor vector. This compound descriptor
classifications.
vector serves as input to a classification method which is
Considering that proteins with similar 3D structures
used to categorize unclassified protein molecules. The
have similar functions, a geometric filtering can lead
classification methods used, are: 1) the Euclidean distance
biologists to the investigation of new protein functions. In
measure, 2) the Mean Euclidean distance measure, and 3) a
[15], proteins are represented as 3D models on the surface
variance of the Bayesian probability measure.
of which sample points are defined. After a translation, The paper is organized as follows: The necessary
scaling, and rotation normalization, the models are seg- preprocessing steps are described in Section 2. The
mented to concentric spheres and sectors and the number of proposed method and the functionals used are described
sampled points is calculated per each sector and per each in detail in Section 3. Section 4 presents the classification
sphere. After this procedure, descriptor vectors are created schemes used in order to evaluate the classification
and compared using a quadratic form distance function. accuracy of the method. Experimental results evaluating
The nearest neighbor indicates the class assigned to the the proposed method are presented in Section 5. Finally,
query protein. In [16], geometric features based on geo- conclusions are drawn in Section 6.
metric moments and the Fourier Transform [17] are
extracted, after a translation, scaling, and rotation normal-
ization. Descriptors are also extracted from PDB files based 2 PREPROCESSING
on primary and secondary structure characteristics. Both of A protein P is mainly composed of Carbon (C), Nitrogen
the aforementioned methods use a portion of the FSSP (N), Oxygen (O), Hydrogen (H), and Sulfur (S) atoms. In
database as ground truth and achieve a percentage of Fig. 1, the 3D representation of a protein is depicted. The
around 90 percent classification accuracy, which is very colors used and the atomic radii are listed in Table 1. The
satisfactory, considering that they are less complicated than atoms in HETATM fields are not depicted.
the DALI algorithm. Since the exact 3D position of each atom and its radius
Another method that utilizes the geometric properties of are known, it may be represented by a sphere. Next, the
secondary structures is based on indexing [18]. Triplets surface of each sphere is triangulated by employing
(three linear segments) of secondary structures, extracted 3D modeling techniques. In this way, a sphere consists of
196 IEEE/ACM RANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 3, JULY-SEPTEMBER 2006

TABLE 1
Main Atoms of a Protein

Fig. 1. The protein 1DD5.

a small set of vertices and a set of connections between the ^ jÞ ði; j ¼ 1; 2; . . . ; NÞ is


sampled and its discrete form fði;
vertices. Finally, a protein P is comprised of a set of
produced. N is the number of voxels that the bounding
spheres, along with the corresponding vertices V and the
cube is partitioned along each dimension.
connections among them.
The “Spherical Trace Transform” proposed in this paper
Then, the center of mass of P is calculated and each V
can be described using the general formula:
is translated so that the new center of mass is at the
origin. The distance dmax between the new origin and the ^ ¼ T ðF ðfði;
SphT race½T ; F ; f ^ jÞÞÞ; ð3Þ
most distant vertex is computed and P is scaled so that
dmax ¼ 1. The translated and scaled P is then placed into where F ð; Þ denotes an “Initial Functional,” which can be
a bounding cube, which is partitioned in ð2  NÞ3 equal ^ jÞ, i.e., F ð; Þ ¼ F ðfði;
applied to each fði; ^ jÞÞ. The set of
cube shaped voxels ui with centers v i ¼ ½xi ; yi ; zi , where F ð; Þ is treated as a collection of spherical functions
i ¼ 1; . . . ; ð2  NÞ3 . Let U be the set of all voxels inside the fF  ðÞg parameterized by .
bounding cube and U1  U be the set of all voxels belonging Then, a set of “Spherical Functionals” T ðÞ is applied to
to the bounding cube and lying inside P .1 Then, the discrete each F  ðÞ, producing a descriptor vector D1 ¼ T ðF  ðÞÞ.
binary volume function fb ðvvi Þ of P , is defined as: Let us now examine the conditions that must be satisfied
 by the functionals in order to produce rotation invariant
1; when u i 2 U1
fb ðvvi Þ ¼ ð1Þ descriptor vectors. Under a 3D object rotation governed by
0; otherwise:
a 3D rotation matrix R, the points  will be rotated:
A coarser mesh is then constructed by combining every
0 ¼ R  ; ð4Þ
eight neighboring voxels, u i , to form a bigger voxel  k with
centers  k , k ¼ 1; . . . ; N 3 . The discrete integer volume therefore,
function fð k Þ of M is defined as:
F ð0 ; Þ ¼ F ðR   ; Þ; ð5Þ
X
8
fð k Þ ¼ fb ðvvn Þ : u n 2  k : ð2Þ and, thus, rotation invariant T functionals must be applied
n¼1 so that T ðF ð0 ; ÞÞ ¼ T ðF ð; ÞÞ (Fig. 3).
Thus, the domain of fð k Þ is ½0; . . . ; 8.

3 THE PROPOSED METHOD


The method proposed in this paper is based on the
“Spherical Trace Transform” introduced in [19], which is
further exploited to extract descriptors to be used for
classification purposes and it is presented in the sequel for
sake of completeness.
Let us define plane ð; Þ ¼ fvvjvvT   ¼ g to be tangen-
tial to the sphere S with radius  and center at the origin, at
the point ð; Þ, where  ¼ ½cossin; sinsin; cos is the
unit vector in R3 , and  a real positive number (Fig. 2).
The intersection of ð; Þ with fð Þ produces a
2D function fða;^ bÞ, ða; b 2 ð; Þ \ fð ÞÞ, which is then

1. “Lying inside P ” means that the corresponding voxel lies in the region
that is enclosed by a sphere, which represents the atom of one of the
proteins. Fig. 2. Planes tangential to concentric spheres.
DARAS ET AL.: THREE-DIMENSIONAL SHAPE-STRUCTURE COMPARISON METHOD FOR PROTEIN CLASSIFICATION 197

Fig. 4. Rotation of fðxxÞ rotates the fði; ^ jÞ (upper left image) without
Fig. 3. Rotation of fðx xÞ rotates the F ð; Þ without rotating the
corresponding fði; jÞ (upper left image). Thus, F ð2 ; 1 Þ ¼ F ð02 ; 1 Þ. causing a rotation of the point ð1 ; 1 Þ.

X
N X
1 N 1   
In the specific case where the points  lie on the axis of 2ik 2jm
DF Tt ðk; mÞ ¼ f^t ði; jÞ exp j^ þ ; ð7Þ
^ jÞ will be rotated (Fig. 4), i.e.,
rotation, the corresponding fði; i¼0 j¼0
N N

f^0 ði; jÞ ¼ fði


^ 0 ; j0 Þ; ð6Þ where k; m ¼ 0; . . . ; N  1. In the DFT, shifts in the spatial
domain cause corresponding linear shifts in the phase
component:
and, thus, 2D rotation invariant functionals must be applied
^ 0 ; j0 ÞÞ. Therefore, a general solution
so that F ðf^0 ði; jÞÞ ¼ F ðfði ^ þ bmÞ $ ft ði þ a; j þ bÞ:
DF Tt ðk; mÞ exp½jðak ð8Þ
is given using 2D rotation invariant functionals F and
rotation invariant spherical functionals T , producing Thus, the DFT magnitude is invariant to circular translation.
completely rotation invariant descriptor vectors. Therefore, using discrete polar coordinates:
3.1 Initial Functionals F
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
The set of the Initial Functionals F consists of several
rij ¼ ðc1 i þ c2 Þ2 þ ðc1 j þ c2 Þ2 ;
harmonics of the Polar-Fourier Transform and several of the  
c1 j þ c2
Krawtchouk moments. ij ¼ tan1 ;
c1 i þ c2
pffiffiffi ð9Þ
3.1.1 The Polar-Fourier Transform 2
c1 ¼  rmax ;
The Discrete Fourier Transform (DFT) is computed for each N 1
1
f^t ði; jÞ, where t ¼ 1; . . . ; NR and NR is the total number of c2 ¼  pffiffiffi  rmax ;
2
planes:
198 IEEE/ACM RANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 3, JULY-SEPTEMBER 2006

where i; j ¼ 0; . . . ; N  1. Then, (7) becomes: the sphere, generally denoted by Ylm ðÞ, where l  0 and
jmj  l [22].
N1 X
X N1 Since spherical harmonics form a complete orthonormal
DF Tt ðk; mÞ ¼ f^t ðrij ; ij Þ expðjðkr
^ ij þ mij ÞÞ ð10Þ
i¼0 j¼0
set on the unit sphere, if a function
, parameterized by the
spherical coordinates ðÞ, can be expanded as an infinite
and rotation is converted to a circular translation of . Then, Fourier series of spherical harmonics:
the first K  M harmonic amplitudes jDF Tt ðk; mÞj, where
k ¼ 0; . . . ; K  1 and m ¼ 0; . . . ; M  1, are considered for 1 X
X l

ði Þ ¼ lm Ylm ði Þ; i ¼ 1; . . . ; Ns ; ð14Þ
each f^t ði; jÞ. Since t refers to each plane which is described
l¼0 m¼l
in the 3D space by the couple ð; Þ, jDF Tt ðk; mÞj can be
denoted as F 1km ð; Þ or F 1km ðÞ. then the expansion coefficients lm are uniquely deter-
mined by:
3.1.2 Krawtchouk Moments
X
Ns
4
Krawtchouk moments [20] are a set of moments formed by lm ¼
ði ÞYlm ði Þ : ð15Þ
using Krawtchouk polynomials as the basis function set. i¼1
Ns
The nth order classical Krawtchouk polynomials are
In our case:
defined as:

  F 1km ðÞ
X
N
1
ðÞ ¼ ð16Þ
Kn ðx; p; NÞ ¼ a ;n;p x ¼2 F1 n; x; N; ; ð11Þ F 2km ðÞ:
¼0
p
The expansions (14) are strictly convergent in the sense
where x; n  0; 1; 2; . . . ; N, N > 0, p 2 ð0; 1Þ, 2 F1 is the that the error of the expansion reduces monotonically as l
hypergeometric function defined as: tends to infinity. Hence, the leading terms of the series are
those with small values of l and m, which implies that, upon
X1
ðaÞ ðbÞ z
2 F1 ða; b; c; zÞ ¼ ð12Þ truncation, the series at a sufficiently large value of l, L,
¼0
ðcÞk ! most of the detail of the function
ðÞ will be captured.
Further, if
ðÞ is rotated (
0 ðÞ with expansion coeffi-
and ðaÞ is the Pochhammer symbol.
cients 0lm ), then, as is easily proven [22], the overall vector
Following the analysis described in [19], the rotation
length of 0lm coefficients with the same l is preserved under
invariant Krawtchouk moments are computed for each
rotation:
f^t ði; jÞ with spatial dimension N  N by:
X 0 X
A2l ¼ 2
lm ¼ 2lm ; ð17Þ
X
N X
1 N 1
~km ¼ ½ðkÞðmÞð1=2Þ
Q ai;k;p1 aj;m;p2 ij ; ð13Þ m m

i¼0 j¼0 where the quantities Al are known as the rotationally


where the coefficients a ;n;p can be determined by (11) and invariant shape descriptors. In the proposed method, for
ðkÞ; ðmÞ can be calculated from the orthogonality condi- each l, the corresponding Al is a spherical functional T .
tion [20]. It should be noted that, in our experiments, the Therefore, the total number of spherical functionals T used
is L þ 4 for each concentric sphere.
parameters p1 ; p2 were set to 0:5 [20].
Referring to each plane ð; Þ, the rotation invariant 3.3 Descriptor Extraction
Krawtchouk moments can be denoted as F 2km ð; Þ or
3.3.1 Geometrical Descriptor Extraction
F 2km ðÞ.
In order to avoid possible sampling errors caused by using
3.2 Spherical Functionals T the lines of latitude and longitude (since they are
Then, the following set of spherical functionals T is applied concentrated too much toward the poles), each concentric
to each F  ðÞ in order to produce the descriptor vector: sphere is simulated by an icosahedron where each of the
20 main triangles is iteratively subdivided into q equal parts
1.T1 ð!Þ ¼ maxf!ð
P s  0j Þg, to form subtriangles. The vertices of the subtriangles are the
2.T2 ð!Þ ¼ N ! ðj Þ , sampled points Bt . Their total number Ns , for each
Pj¼1Ns
3. T3 ð!Þ ¼ j¼1 !ðj Þ, concentric sphere (icosahedron) Cs , with radius s ,
4. T4 ð!Þ ¼ maxf!ðj Þg  minf!ðj Þg, s ¼ 1; . . . ; Nc , where Nc is the total number of concentric
where j ¼ 1; . . . ; Ns , !ðj Þ ¼ F  ðj Þ, !0 its derivative, and spheres, is easily seen to be:
Ns ¼ NNRc , where Nc is the total number of concentric
Ns ¼ 10  q2 þ 2: ð18Þ
spheres, Ns is the total number of sampled points on a
sphere S with radius , and NR is the total number of Then, following the procedure described earlier, for each
sampled points. functional F , the descriptor vectors D1F ðl1 Þ ¼ T ðF t ðt ÞÞ
are produced, where l1 ¼ 1; . . . ; ðL þ 4Þ  Nc .
The amplitudes of the first L harmonics of the
5.
Spherical Fourier Transform (SFT). 3.3.2 Structural Descriptor Extraction
The fifth above T functional is generated using spherical Besides the geometric descriptor vectors, features that
harmonics. Spherical harmonics are special functions on characterize the primary and secondary structure of a
DARAS ET AL.: THREE-DIMENSIONAL SHAPE-STRUCTURE COMPARISON METHOD FOR PROTEIN CLASSIFICATION 199

TABLE 2 The descriptor vector, D2, is then produced, with length


Structural Features and Their Weights 34. Thus, the length of the compound descriptor vector D ¼
S
D1 D2 is Nc  ðL þ 4Þ þ 34.
Our experiments presented in the sequel were performed
using the values: Ns ¼ 2; 562, Nc ¼ 20, L ¼ 26, and N ¼ 64,
where N is the number of sampled points for each dimension
of each tangential plane ð; Þ. The total number of sampled
points on each tangential plane is N  N.

4 CLASSIFICATION
4.1 Matching Algorithm
Let A; B be two 3D models (proteins). Also, let

DA ðkÞ ¼ ½DA1 ðk1 Þ; DA2 ðk1 Þ; DA3 ðk2 ÞT ;


DB ðkÞ ¼ ½DB1 ðk1 Þ; DB2 ðk1 Þ; DB3 ðk2 ÞT

be two descriptor vectors, where A1; B1 denotes the


protein are also extracted [16]. More specifically, concerning descriptor vector extracted using Polar-Fourier Transform,
the primary structure, the ratio of the amino acids’ A2; B2 denotes the descriptor vector extracted using
occurrences relative to the total number of amino acids Krawtchouk moments, A3; B3 denotes the descriptor vector
(20 descriptors), the hydrophobic amino acids ratio extracted taking into account the primary and secondary
(one descriptor), and the ratio of the helix types’ occur- structure of each protein, k1 ¼ Nc  ðL þ 4Þ, and k2 ¼ 34. The
rences (10 descriptors) contained in a protein are calculated. geometrical descriptors are compared in pairs using their
Concerning the secondary structure, the number of Helices L1-distance:
(one descriptor), Sheets (one descriptor), and Turns
(one descriptor), contained in a protein are also calculated. vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
uNc ðLþ4Þ
These features are listed in Table 2. All the aforementioned u X
information is included in each PDB file. A part of a PDB D1similarity ¼t jDA1 ðk1Þ  DB1 ðk1Þj ð19Þ
k1¼1
file is depicted in Fig. 5.

Fig. 5. A PDB file.


200 IEEE/ACM RANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 3, JULY-SEPTEMBER 2006

and where NC is the number of 3D models which belong to class


vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi C. Then, the feature vectors f C1 ; . . . ; f Ck ; . . . ; f CS are formed,
uNc ðLþ4Þ
u X where C ¼ 1; . . . ; Nclass , f Ck ¼ ½D1 ðkÞ . . . Di ðkÞ . . . DNC ðkÞT ,
D2similarity ¼t jDA2 ðk2Þ  DB2 ðk2Þj: ð20Þ and Nclass is the total number of classes.
k1¼1 For each f Ck , the mean,
The overall geometrical similarity measure is determined by:
1 X
NC
f Ck ¼ Di ðkÞ; ð25Þ
DGsimilarity ¼ a1  D1similarity þ a2  D2similarity ; ð21Þ NC i¼1
where a1 ; a2 are descriptor vector percentage factors, which and the variance,
are calculated as follows: Let us assume that A belongs to a
class C, which contains NC models. Also let Ntotal be the 1 X
NC

total number of models contained in the database. Then, the


2f Ck ¼ ðDi ðkÞÞ2  ð f Ck Þ2 ; ð26Þ
NC i¼1
factor a1 is calculated as:
PNC are calculated. Finally, let U ¼ ½Uð1Þ; . . . ; UðNd Þ be a
i¼1 di descriptor vector of an unclassified protein U.
a1 ¼ PNtotal NC
; ð22Þ
j¼1 dj
4.2.1 Euclidean Distance Measure
where di is the L1-distance of the descriptor vector DA1 of The first metric of “similarity” is based on the Euclidean
0
each model A from the descriptor vector DA1 of a model A0 distance between the descriptor vectors, which is defined as:
which also belongs to C and dj is the L1-distance of the " #1=2
descriptor vector DA1 of the model A from the descriptor X
Nd
2
00
vector DA1 of a model A00 which does not belong to C. M1 ðD; UÞ ¼ ðDðjÞ  UðjÞÞ : ð27Þ
j¼1
Descriptor vectors DA1 with small values of di and large
values of dj are clearly appropriate for class C, in terms of For an unclassified U, the pairwise Euclidean distances
successful retrieved results. The percentage factor a2 is M1 ðDi ; UÞ, i ¼ 1; 2; . . . ; Ntotal , are rank ordered and U is
calculated similarly, taking into account the descriptor assigned to the class corresponding to the minimum
vector DA2 . Then, a1 and a2 are normalized so that distance.
1=a1 þ 1=a2 ¼ 100.
Following the above approach, the discriminant power 4.2.2 Mean Euclidean Distance Measure
of each descriptor vector per different class is taken into As a second metric, the Euclidean distances between a
account. feature vector Ck and an unclassified vector U are used:
The structural similarity is evaluated using: " #1=2
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi X
Nd
2
u 34 M2 ðX; UÞ ¼ ð XCi ðjÞ  UðjÞÞ : ð28Þ
uX
DSsimilarity ¼ t jDA3 ðk2Þ  DB3 ðk2Þj: ð23Þ j¼1

k2¼1 As before, the pairwise Euclidean distances M2 ðXi ; UÞ,


The overall similarity measure is determined by: i ¼ 1; 2; . . . ; Nclass , are rank ordered and the class with the
minimum distance to U is chosen.
Dsimilarity ¼ b1  DGsimilarity þ b2  DSsimilarity : ð24Þ
4.2.3 Naive Bayesian Classifier
The weights assigned to the different kind of descriptors are For each class Ci, i ¼ 1; . . . ; Nclass , the mean XCi ðjÞ and the
b1 ¼ 90% for the geometrical descriptors and b2 ¼ 10% for standard deviation
Ci are calculated for each feature vector
the structural descriptors. The weight allocation regarding Cj. For each descriptor UðjÞ of the unclassified protein U,
the latest formula is listed in Table 2. the validity of the following inequality is tested:
4.2 Classification Methods XCi ðjÞ  a 
Ci  UðjÞ  XCi ðjÞ þ a 
Ci ; ð29Þ
In order to evaluate the classification accuracy of the
proposed method, three classification schemes were used. where a 2 ½3; 4. For each class Ci, the following measure is
calculated:
A description of these schemes is given below.
Let Di ðjÞ ¼ ½Di ð1Þ; . . . ; Di ðNd Þ be a compound descrip- X
Nd
tor vector, where i ¼ 1; . . . ; Ntotal . Ntotal is the total number BðCiÞ ¼ wUðjÞ ; ð30Þ
of proteins and Nd is the total number of descriptors per j¼1

descriptor vector (Nd ¼ Nc  ðL þ 4Þ þ 34). Also, let C be a where wUðjÞ ¼ 1 when UðjÞ satisfies (29) and wUðjÞ ¼ 0,
class with descriptor vectors: otherwise. U is assigned to the class Ci with the maximum
2 1 3 BðCiÞ.
D ð1Þ; . . . ; D1 ðkÞ; . . . ; D1 ðSÞ
6 ...; ...; ...; ...; ...; 7
6 7
6
MC ¼ 6 D ð1Þ; . . . ; D ðkÞ; . . . ; Di ðSÞ 7
i i
7; 5 EXPERIMENTAL RESULTS
4 ... ...; ...; ...; ...; 5
In order to evaluate the performance of the proposed
DNC ð1Þ; . . . ; DNC ðkÞ; . . . ; DNC ðSÞ
method, a portion of the FSSP database [23] was used. This
DARAS ET AL.: THREE-DIMENSIONAL SHAPE-STRUCTURE COMPARISON METHOD FOR PROTEIN CLASSIFICATION 201

TABLE 3
Protein Classes Used as Ground Truth Database

Fig. 6. Overall classification accuracy using only geometrical character-


istics with the Euclidean Distance Measure method.

class label is then assigned to the query protein as the


output of the classification method. The overall classifica-
tion accuracy is the percentage of the correctly predicted
class labels among all 3,732 proteins of the database and is
given by:

Overall Classification Accuracy ¼


Number of correctly predicted proteins ð31Þ
:
T otal number of proteins in the database
The overall classification accuracy can also be derived
from the confusion matrix, which is widely used in
classification problems [32]. The overall classification
accuracy is the sum of the diagonal elements of the
confusion matrix divided by the total number of classified
objects.
Let F Tkm and Krawkm be the descriptor vectors
produced after applying the spherical functionals T to the
initial functionals F 1km ðÞ and F 2km ðÞ, respectively.
All of the produced descriptor vectors were tested
experimentally in terms of overall classification accuracy.
However, only the following achieved significantly high
classification accuracy and are reported in this section:

F T ¼ fF T00 ; F T01 ; F T10 ; F T02 g


and

K ¼ fKraw00 ; Kraw01 ; Kraw02 g:

5.1 Evaluation of Overall Classification Accuracy


Using the Euclidean Distance Measure
First, the simpler method was evaluated, which relies on the
Euclidean Distance measure. The overall classification
database was constructed according to the DALI algorithm accuracy results were very satisfactory (Fig. 6 and Table 4).
[6], [7] and consists of 3,732 proteins classified into 30 classes As seen by Fig. 6, the use of vectors Kraw00 and F T02
(Table 3). Care was taken to include classes with different was found to be optimal since the percentage accuracy
cardinalities, varying from 2 to 561 proteins. In order to get achieved was 98.9 percent (Fig. 6, last column).
reliable results, the 3,732 proteins were randomly selected. The time needed for the extraction of the descriptor
The database can be downloaded from: ftp://ftp.iti.gr/ vectors of the Initial Functionals used is shown in Table 4.
pub/incoming/proteins.zip. In addition to the geometrical descriptors, structural
The performance of the method was evaluated in terms descriptors are extracted as well (Table 2), which refer to the
of overall classification accuracy [15]. More specifically, for proteins’ primary and secondary structure elements. The
each molecule in the database, one of the three classification percentage of geometrical and structural features in the
methods described above is applied after removing that integrated descriptor vector was experimentally selected to
element from the database (“leave-one-out” experiment). A be 90 percent and 10 percent, respectively. This combination
202 IEEE/ACM RANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 3, JULY-SEPTEMBER 2006

TABLE 4
Extraction Time Using Different Initial Functionals
and All Spherical Functionals

Fig. 7. Overall classification accuracy using geometrical and structural


characteristics with the Euclidean Distance Measure method.
TABLE 5
The Times Needed for the Computation of the Overall
Classification Accuracy Using Geometrical and Structural
Characteristics with the Euclidean Distance Measure Method

Fig. 8. Missed proteins using the Euclidean distance method. The query
proteins are depicted in the first column. The second column shows the
significantly increases the overall classification accuracy nearest neighbors, which were retrieved using the proposed method but
(Fig. 7). do not belong to the same class with the query, according to the FSSP/
The times needed for the computation of the overall DALI classification. The third column shows the proteins closer to the
query that do belong to the same class according to the FSSP/DALI
classification accuracy for the entire database are shown classification. It is obvious that the visual similarity between the proteins
in Table 5. These include the comparison of each query of columns 1 and 2 is greater than the similarity between the proteins of
protein descriptor vector to all (3,731) descriptor vectors columns 1 and 3.
(all-to-all comparison). In other words, the time needed
for approximately 3; 7312 comparisons is 395 sec if the The FSSP/DALI database has been constructed based in
part on the premise that proteins with at least 25 percent
“Kraw00 &F T02 &Struct” descriptor vector is used. This is
similarity in their amino acid sequence should belong to the
very satisfactory if we consider that the Dali algorithm
same class even if dissimilar geometrically. Since we do not
requires an entire day for an all-to-all comparison of all
use this criterion, we do not achieve 100 percent classifica-
385 representatives of FSSP database [29]. tion accuracy. In fact, the best overall classification accuracy
The time needed for the complete preprocessing proce- achieved, using the proposed method (Fig. 7, column 6), is
dure, from the creation of the 3D structure up to the final 99.62 percent. In other words, 14 out of 3,732 proteins are
normalization step, is approximately 3 min. Although this misclassified. Further analysis of the misclassified proteins
procedure, for a large database with thousands of proteins, showed that the proposed method, which is mainly based
may last for days, it takes place only once and the on geometrical features (90 percent) rather than structural
descriptor vectors are stored in the database along with features (10 percent), classifies the 3D proteins differently
the corresponding 3D structures. when compared to the DALI algorithm. However, there is
DARAS ET AL.: THREE-DIMENSIONAL SHAPE-STRUCTURE COMPARISON METHOD FOR PROTEIN CLASSIFICATION 203

TABLE 6
Classification Precision, Classification Recall, and
Classification Accuracy for Each Class Using the
“Kraw00 &F T02 &Struct” Descriptor Vector

Fig. 9. Overall classification accuracy using geometrical and structural


characteristics with the Mean Euclidean Distance Measure method.

application to both small and large classes. In order to


evaluate the classification performance of each class, the
measures of Classification Precision ðCP re Þ, Classification
Recall ðCRec Þ, and Classification Accuracy ðCAcc Þ were used
[31]. These are given by the following equations:
TP
CP re ¼ ; ð32Þ
TP þ FP

TP
CRec ¼ ; ð33Þ
TP þ FN

TP þ TN
CAcc ¼ ; ð34Þ
TP þ FP þ FN þ TN
where:

. TP: The number of correctly included (True Positive)


class objects.
. FP: The number of incorrectly included (False
Positive) objects.
. TN: The number of correctly excluded (True
Negative) objects.
. FN: The number of incorrectly excluded (False
Negative) objects.
The values of TP, FP, FN, and TN, along with the
values of CP re , CRec , CAcc for each class, when the
“Kraw00 &F T02 &Struct” descriptor vector is used, are
presented in Table 6.
Table 6 illustrates the effectiveness of the proposed
method, showing its high performance in terms of
Classification Precision, Classification Recall, and Classification
Accuracy for each class.
As the protein database increases, the time needed for a
one-to-all comparison and classification of an unknown
protein increases dramatically. For such use, other faster
no clear answer as to which method is “more” correct. Fig. 8 classification methods, based on statistical features extrac-
depicts five missed proteins (column 1), their nearest tion, were evaluated. A detailed description of these
neighbors using the proposed method (column 2), and the methods was given in Section 4.
closest to the query proteins that belong to the same class 5.2 Evaluation of Overall Classification Accuracy
with them according to the FSSP classification (column 3). Using the Mean Euclidean Distance Measure
The structures in the first column are seen to be geome- In Fig. 9 and in Table 7, the results of the Mean Euclidean
trically far more similar to those in the second column than Distance method are presented: The first two columns
those in the third. depict the overall classification accuracy of the method with
A more detailed view of the classification results all classes included, with (Kraw00 &F T02 &Struct All col-
demonstrates the high performance of the method in umn) or without (Kraw00 &F T02 All column) structural
204 IEEE/ACM RANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 3, JULY-SEPTEMBER 2006

TABLE 7 TABLE 8
The Times Needed for the Computation of the The Times Needed for the Computation of the Overall
Overall Classification Accuracy Classification Accuracy with the Naive Bayesian Classifier
with the Mean Euclidean Distance Measure Method

tional complexity without, however, outperforming the


features. The next four columns present the results when methods presented in the previous paragraphs.
the Mean Euclidean Distance method is applied only to
classes with a relatively large number of proteins. The class 5.4 Evaluation of Information Retrieval Performance
that best fits the query protein is then included in the Apart from the classification performance, the efficiency of
Euclidean Distance algorithm, which is applied to the the proposed shape comparison method was evaluated in
remaining small classes. The key reason for this fused terms of information retrieval performance. In this case,
algorithm selection is that statistical measures are more each model of the database is used as query and the
reliable when applied to large classes (over 50 or 100 pro-
retrieved proteins are ranked in terms of shape similarity to
teins) since the higher the number of proteins in a class, the
the query. For the presentation of the results, the Information
more reliable the statistical measures. In the third and
fourth column, the Mean Euclidean method is applied to Retrieval Precision-Recall curve was used, where precision is
classes with a number of proteins larger than 50, while, in the proportion of the retrieved models that are relevant to
the last two columns, the number of proteins is larger than the query and recall is the proportion of relevant models in
100. Experiments proved that the overall classification the entire database that are retrieved as a result of the
accuracy in large classes with more than 100 proteins is query. More precisely, precision and recall are defined as:
very satisfactory, while the time needed for the classifica-
tion procedure is four times smaller than that of the Ndetection
P recision ¼ ; ð35Þ
Euclidean Distance method. Ndetection þ Nfalse

5.3 Evaluation of Overall Classification Accuracy


Ndetection
Using the Naive Bayesian Classifier Recall ¼ ; ð36Þ
Ndetection þ Nmiss
Finally, similar experiments, based on the Naive Bayesian
Classifier (Section 5.2.3), were performed. The results are
presented in Fig. 10 and in Table 8. It is obvious that, like
the previous method, Naive Bayesian Classifier achieves
satisfactory classification results as well as low computa-

Fig. 10. Overall classification accuracy using geometrical and structural


characteristics with the Naive Bayesian Classifier. Fig. 11. Precision-recall curve for the geometrical descriptor vectors.
DARAS ET AL.: THREE-DIMENSIONAL SHAPE-STRUCTURE COMPARISON METHOD FOR PROTEIN CLASSIFICATION 205

TABLE 9 comparison with the methods presented in [16], [15], which


Protein Classes to Be Compared are also based on the geometrical similarity of proteins, is
fully meaningful and is presented in the sequel.
First, the proposed method is compared with the method
[16] in terms of retrieval performance. In [16], three classes
are chosen from the Dali server, which are listed in Table 9.
Then, the “precision versus recall” is calculated for each
class.
Fig. 12a depicts the Information Retrieval Precision-
where:
Recall curve of the three classes by using Kraw00 &F T02
. Ndetection = number of relevant models retrieved, descriptors. In the next three diagrams, the precision-recall
. Nfalse = number of irrelevant models retrieved, curve of each class is compared with the respective curve of
. Nmiss = number of relevant models not retrieved. the method presented in [16]. It can be inferred that the
Fig. 11 depicts the Information Retrieval Precision-Recall proposed method demonstrates a slight improvement in the
curve for all geometrical descriptor vectors used. last values of recall, while it retains high performance in the
first values of recall.
5.5 Comparison with Existing Methods The proposed method is also compared with the one
It must be emphasized that the goal of the proposed method presented in [15] in terms of overall classification accuracy.
is not to introduce a new classification scheme, but to Since the experiments in [15] were conducted on a different
provide a fast geometric filtering so as to achieve a first set of protein structures, an extra effort in developing this
quick classification of a new protein sequence. Thus, method for our protein data set was required. The results are
comparison with classification schemes, such as DALI, presented in Fig. 13, where it is obvious that the proposed
SCOP, CATH, etc., or with methods that focus on finding method outperforms the one presented in [15] when applied
biologically relevant sequence similarities, such as BLAST, to single domain chains. For multidomain proteins, however,
PSI-BLAST [34], etc., is clearly not meaningful. However, the experimental results are inconclusive.

Fig. 12. (a) Precision-recall curve of classes 1a6m, 1l92, and 2cba by using Kraw00 &F T02 descriptors. (b), (c), and (d) Comparison of precision-
recall curve for each class with the method presented in [16].
206 IEEE/ACM RANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 3, JULY-SEPTEMBER 2006

[6] L. Holm and C. Sander, “The FSSP Database: Fold Classification


Based on Structure-Structure Alignment of Proteins,” Nucleic
Acids Research, vol. 24, pp. 206-210, 1996.
[7] L. Holm and C. Sander, “Touring Protein Fold Space with Dali/
FSSP,” Nucleic Acids Research, vol. 26, pp. 316-319, 1998.
[8] The European Bionformatics Institute, http://www.ebi.ac.uk/,
2006.
[9] A. Bairoch and R. Apweiler, “The SWISS-PROT Protein Sequence
Databank and Its Supplement TrEMBL in 1998,” Nucleid Acids
Research, vol. 26, pp. 38-42, 1998.
[10] L. Falquet, M. Pagni, P. Bucher, N. Hulo, C.J. Sigrist, K. Hofmann,
A. Bairoch, “The PROSITE Database, Its Status in 2002,” Nucleid
Acids Research, vol. 30, pp. 235-238, 2002.
[11] http://www.expasy.ch/prosite/, 2006.
Fig. 13. Comparison of the proposed method with the one presented in [12] http://www.rcsb.org, 2006.
[13] F. Psomopoulos, S. Diplaris, P.A. Mitkas, “A Finite State
[15] in terms of overall classification accuracy.
Automata Based Technique for Protein Classification Rules
Induction,” Proc. Second European Workshop Data Mining and Text
6 CONCLUSIONS Mining in Bioinformatics, 2004.
[14] W.N. Grundy, T.L. Bailey, C.P. Elkan, and M.E. Baker, “Meta-
In this paper, a novel approach for the comparison of MEME: Motif-Based Hidden Markov Models of Protein Families,”
3D protein structures is proposed. The approach consists of IEEE Trans. Computational and Applied Bioscience, vol. 13, no. 4,
pp. 397-406, Aug. 1997.
an offline and an online step. In the offline step, the protein, [15] M. Ankerst, G. Kastenmuller, H.P. Kriegel, and T. Seidl, “Nearest
which is taken from a PDB file, is preprocessed in terms of Neigbor Classification in 3D Protein Databases,” Proc. Seventh Int’l
visualization and triangulation. Next, the protein is trans- Conf. Intelligent Systems for Molecular Biology (ISMB ’99), 1999.
[16] C. Zhang and T. Chen, “Retrieval of 3D Protein Structures,” Proc.
lated, scaled, and voxelized. A set of functionals are applied Int’l Conf. Image Processing, Sept. 2002.
to the volume of the 3D structure producing a new domain [17] C. Zhang and T. Chen, “Efficient Feature Extraction for 2D/3D
of concentric spheres. In this domain, a new set of Objects in Mesh Representation,” Proc. Int’l Conf. Image Processing,
vol. 3, pp. 935-938, Oct. 2001.
functionals is applied, resulting in a completely rotation [18] C. Guerra, S. Lonardi, and G. Zanotti, “Analysis of Secondary
invariant descriptor vector. Additionally, descriptor vectors Structure Elements of Proteins Using Indexing Techniques,” Proc.
which correspond to the protein’s primary and secondary First Int’l Symp. 3D Data Processing Visualization and Transmission
(3DPVT ’02), 2002.
structure are extracted as well. All these descriptor vectors [19] D. Zarpalas, P. Daras, D. Tzovaras, and M.G. Strintzis, “3D Model
are stored, along with the corresponding proteins. In the Search and Retrieval Using the Spherical Trace Transform,” IEEE
online step, a classification algorithm is followed for the Trans. Multimedia, submitted.
[20] P.T. Yap, R. Paramesran, and S.H. Ong, “Image Analysis by
descriptor vectors. Krawtchouk Moments,” IEEE Trans. Image Processing, vol. 12,
Experiments were performed evaluating the efficiency of no. 11, pp. 1367-1377, Nov. 2003.
the proposed method using as ground truth a portion of the [21] M.K. Hu, “Visual Pattern Recognition by Moment Invariants,” IRE
Trans. Information Theory, vol. 8, pp. 179-197, 1962.
FFSP/DALI database, in terms of overall classification [22] D.W. Ritchie, “Parametric Protein Shale Recognition,” PhD thesis,
accuracy and precision-recall. The proposed method, far Univ. of Aberdeen, 1998.
less complex than the DALI algorithm, was seen to produce [23] http://www.bioinfo.biocenter.helsinki.fi:8080/dali/index.html,
2006.
results very close to the ground truth when applied to
[24] P. Koehl, “Protein Structure Similarities,” Current Opinion in
single domain chains. For multidomain proteins, however, Structural Biology, vol. 11, no. 3, pp. 348-353, June 2001.
the experimental results are inconclusive. [25] I.-G. Choi, J. Kwon, and S.-H. Kim, “Local Feature Frequency
Profile: A Method to Measure Structural Similarity in Proteins,”
Proc. Nat’l Academy of Science, vol. 101, no. 11, pp. 3797-3802, Mar.
ACKNOWLEDGMENTS 2004.
[26] S. Cheek, Y. Qi, S. SriKrishna, L.N. Kinch, and N.V. Grishin,
This work was supported by the ALTAB23D project funded “SCOPmap: Automated Assignment of Protein Structures to
by the Greek Secretariat of Research and Technology and by Evolutionary Superfamilies,” BMC Bioinformatics, vol. 5, p. 197,
2004.
the SIMILAR, CATER, and 3DTV EC IST projects. [27] J. Huan, W. Wang, A. Washington, J. Prins, R. Shah, and A.
Tropsha, “Accurate Classification of Protein Structural Families
Using Coherent Subgraph Analysis,” Proc. Pacific Symp. Biocom-
REFERENCES puting (PSB), 2004.
[1] H.M. Berman, J. Westbrook, Z. Feng, G. Gilliland, T.N. Bhat, H. [28] A. Dubey, S. Hwang, C. Rangel, C.E. Rasmussen, Z. Ghahramani,
Weissig, I.N. Shindyalov, and P.E. Bourne, “The Protein Data and D.L. Wild, “Clustering Protein Sequence and Structure Space
Bank,” Nucleic Acids Research, vol. 28, pp. 235-242, 2000. with Infinite Gaussian Mixture Models,” Proc. Pacific Symp.
[2] J.L. Sussman, D. Ling, J. Jiang, N.O. Manning, J. Prilusky, O. Biocomputing, 2004.
Ritter, and E.E. Abola, “Acta Crystallogr.,” vol. 54, pp. 1078-1084, [29] L. Holm and C. Sander, “3-D Lookup: Fast Protein Structure
1998. Database Searches at 90% Reliability,” Proc. Third Int’l Conf.
[3] C.B. Anfinsen, “Principles that Govern the Folding of Protein Intelligent Systems for Molecular Biology (ISMB), pp. 179-187, 1995.
Chains,” Science, vol. 181, pp. 223-230, 1973. [30] S. Dua and N. Kandiraju, “A Novel Computational Framework for
[4] A.G. Murzin, S.E. Brenner, T. Hubbard, and C. Chothia, “Scop: A Structural Classification of Proteins Using Local Geometric
Structural Classification of Proteins Database for the Investigation Parameter Matching,” Proc. 2004 IEEE Computational Systems
of Sequences and Structures,” J. Molecular Biology, vol. 247, Bioinformatics Conf. (CSB 2004), pp. 710-711, 2004.
pp. 536-540, 1995. [31] Y. Sun, M. Robinson, R. Adams, A.G. Rust, P. Kaye, and N. Davey,
[5] C.A Orengo, A.D. Michie, D.T. Jones, M.B. Swindells, and J.M. “Integrating Binding Site Predictions Using Meta Classification
Thornton, “CATH—A Hierarchic Classification of Protein Domain Methods,” Proc. Seventh Int’l Conf. Adaptive and Natural Computing
Structures,” Structure, vol. 5, no. 8, pp. 1093-1108, 1997. Algorithms (ICANNGA 2005), Mar. 2005.
DARAS ET AL.: THREE-DIMENSIONAL SHAPE-STRUCTURE COMPARISON METHOD FOR PROTEIN CLASSIFICATION 207

[32] S. Tiwari and S. Gallager, “Machine Learning and Multiscale Dimitrios Tzovaras received the Diploma in
Methods in the Identification of Bivalve Larvae,” Proc. Ninth IEEE electrical engineering and the PhD degree in 2D
Int’l Conf. Computer Vision (ICCV 2003), pp. 13-16, Oct. 2003. and 3D image compression from Aristotle Uni-
[33] P. Daras, D. Zarpalas, D. Tzovaras, and M.G. Strintzis, “3D Model versity of Thessaloniki, Thessaloniki, Greece, in
Search and Retrieval Based on the Spherical Trace Transform,” 1992 and 1997, respectively. He is a senior
Proc. IEEE Int’l Workshop Multimedia Signal Processing (MMSP), researcher in the Informatics and Telematics
2004. Institute of Thessaloniki. Prior to his current
[34] S.F. Altschul, T.L. Madden, A.A. Schaffer, J. Zhang, Z. Zhang, W. position, he was a senior researcher on 3D
Miller, and D.J. Lipman, “Gapped Blast and PSI-Blast: A New imaging at the Aristotle University of Thessalo-
Generation of Protein Database Search Programs,” Nucleic Acids niki. His main research interests include virtual
Research, vol. 25, no. 17, pp. 3389-3402, 1997. reality, assistive technologies, 3D data processing, medical image
communication, 3D motion estimation, and stereo and multiview image
Petros Daras received the Diploma in electrical sequence coding. His involvement with those research areas has led to
and computer engineering, the MSc degree in the coauthoring of more than 35 papers in refereed journals and more
medical informatics, and the PhD degree in than 80 papers in international conferences. He has served as a regular
electrical and computer engineering from the reviewer for a number of international journals and conferences. Since
Aristotle University of Thessaloniki, Greece, in 1992, he has been involved in more than 40 projects in Greece funded
1999, 2002, and 2005, respectively. He is an by the EC and the Greek Secretariat of Research and Technology. He is
associate researcher at the Informatics and an associate editor of the EURASIP Journal of Applied Signal
Telematics Institute. His main research interests Processing and a member of the Technical Chamber of Greece.
include computer vision, search and retrieval of
3D objects, the MPEG-4 standard, peer-to-peer Michael Gerassimos Strintzis (M’70-SM’80-
technologies, and medical informatics. He has been involved in more F’04) received the Diploma in electrical engi-
than 10 European and National research projects. Dr. Daras is a neering from the National Technical University of
member of the Technical Chamber of Greece. Athens, Athens, Greece, in 1967, and the MA
and PhD degrees in electrical engineering from
Dimitrios Zarpalas received the Diploma in Princeton University, Princeton, New Jersey, in
electrical and computer engineering from the 1969 and 1970, respectively. He then joined the
Aristotle University of Thessaloniki, Greece, in Electrical Engineering Department at the Uni-
2003. He is an associate researcher at the versity of Pittsburgh, where he served as an
Informatics and Telematics Institute. His main assistant professor (1970-1976) and an associ-
research interests include search and retrieval of ate professor (1976-1980). Since 1980, he has been a professor of
3D objects and medical image processing. He is electrical and computer engineering at the University of Thessaloniki,
a member of the Technical Chamber of Greece. Thessaloniki, Greece, and, since 1999, director of the Informatics and
Telematics Research Institute, Thessaloniki. His current research
interests include 2D and 3D image coding, image processing,
biomedical signal and image processing, and DVD and Internet data
authentication and copy protection. Dr. Strintzis has served as associate
Apostolos Axenopoulos received the Diploma editor for the IEEE Transactions on Circuits and Systems for Video
in electrical and computer engineering from the Technology since 1999. In 1984, he was awarded one of the Centennial
Aristotle University of Thessaloniki, Greece, in Medals of the IEEE. He is a fellow of the IEEE.
2003. Currently, he is pursuing the MSc degree
in advanced computing systems at the Aristotle
University of Thessaloniki. He is an associate . For more information on this or any other computing topic,
researcher at the Informatics and Telematics please visit our Digital Library at www.computer.org/publications/dlib.
Institute. His main research interests include 3D
content-based search and retrieval. He is a
member of the Technical Chamber of Greece.

Vous aimerez peut-être aussi