Vous êtes sur la page 1sur 17

Generalized Relevance LVQ (GRLVQ) with Correlation Measures for Gene Expression Analysis

Marc Strickert, Udo Seiert


Pattern Recognition Group, Institute of Plant Genetics and Crop Plant Research (IPK) Gatersleben

Nese Sreenivasulu, Winfriede Weschke


Gene Expression Group, IPK Gatersleben

Thomas Villmann
University Leipzig, Clinic for Psychotherapy

Barbara Hammer
Institute of Computer Science, Technical University of Clausthal

Abstract A correlation-based similarity measure is derived for generalized relevance learning vector quantization (GRLVQ). The resulting GRLVQ-C classier makes Pearson correlation available in a classication cost framework where data prototypes and global attribute weighting terms are adapted into directions of minimum cost function values. In contrast to the Euclidean metric, the Pearson correlation measure makes input vector processing invariant to shifting and scaling transforms, which is a valuable feature for dealing with functional data and with intensity observations like gene expression patterns. Two types of data measures are derived from Pearson correlation in order to make its benets for data processing available in compact prototype classication models. Fast convergence and high accuracies are demonstrated for cDNA-array gene expression data. Furthermore, the automatic attribute weighting of GRLVQ-C is successfully used to rate the functional relevance of analyzed genes. Key words: Prototype-based learning, adaptive metrics, correlation measure, Learning Vector Quantization, GRLVQ, gene expression analysis.

Preprint submitted to Elsevier Science

11 October 2005

Introduction

Pattern classication is the key technology for solving tasks in diagnostics, automation, information fusion, and forecasting. The backbone of pattern classication is the underlying distance metric: it denes how data items are compared, and it controls the grouping of data. Thus, depending on the denition of the distance, a data set can be viewed and processed from dierent perspectives. Unsupervised clustering with a specic similarity measure, for example visualized as the result of a self-organizing map (SOM), provides rst hints about the appropriateness of the chosen metric for meaningful data grouping [6]. In prototype-based models like the SOM, a data item can be compared with an average data prototype in various ways, for example, according to the Euclidean distance or the Manhattan block distance. Dierent physical and geometric interpretations are obtained then, because the former measures diagonally across the vector space, while the latter sums up distances along each dimension axis. In any case, the specic structure of the data space can and should be accounted for by selecting an appropriate metric. Once a suitable metric is identied, it can be further utilized for the design of good classiers. In supervised scenarios, auxiliary class information can be used for adapting parameters improving the specicity of data metrics during data processing, as proposed by Kaski for (semi-)supervised extensions of the SOM [5]. Another metric-adapting classication architecture is the generalized relevance learning vector quantization (GRLVQ) developed by Hammer and Villmann [4]. Data metrics in mathematical sense, however, might be too restrictive for some applications in which a relaxation to more general similarity measures would be useful. For example, in biological sciences often functional aspects of collected data play an important role: general spatio-temporal patterns in time series, intensity elds, or observation sequences might be more inter-related than patterns that are just spatially close in Euclidean sense. This applies to the aim of the present work, the analysis of gene expression patterns, for which the Pearson correlation is commonly used. Since recent technological achievements allow probing of thousands of gene expression levels in parallel, fast and accurate methods are required to deal with the resulting large data sets. Thereby, the denition of genetic similarity in terms of Pearson correlation should be possible, and the curse of dimensionality, related to only few available experiments in high-dimensional gene expression space, should be reduced to a minimum. Many commercial and freely available bioinformatEmail addresses: {stricker,seiffert}@ipk-gatersleben.de (Marc Strickert, Udo Seiert), {srinivas,weschke}@ipk-gatersleben.de (Nese Sreenivasulu, Winfriede Weschke), villmann@informatik.uni-leipzig.de (Thomas Villmann), hammer@in.tu-clausthal.de (Barbara Hammer).

ics tools, such as ArrayMiner, GeneSpring, J-Express Pro, and Eisens Gene Cluster use Pearson correlation for analysis. The common goal of these programs is the identication of key regulators and clusters of coexpressed genes that determine metabolic functions in developing organisms. Usually, only the metric of algorithms, which have been initially designed for processing Euclidean data, is exchanged by a 1 minus correlation term. Here, GRLVQ-C is proposed, a classier that is mathematically derived from the scratch for correlation-based classication. Its foundations are the generic update rules of generalized relevance learning vector quantization (GRLVQ, [3,4]). This allows incorporation of auxiliary information for genetic distinction, such as the developmental stage of the probed tissues, or the stress factors applied to the growing organisms. Using the GRLVQ approach with its rigid classication cost function, a fast prototype-based and intuitive classication model with very good generalization properties is derived. Both, data attribute relevances and prototype locations are obtained as a result of optimizing Pearson correlationships. The specic requirements of gene expression analysis are met in two ways: rstly, the implemented correlation measure accounts for the nature of gene expression experiments which, due to physico-chemical reasons, tend to dier in their overall intensities and in their dynamic ranges, but not in their general structure of expressed patterns. Secondly, automatic relevance weighting attenuates the curse of high dimensionality. The properties and benets of the proposed GRLVQ-C classier are demonstrated for real-life data sets.

Generalized Relevance LVQ (GRLVQ) and extensions

Let X = {(xi , y i ) Rd {1, . . . , c} | i = 1, . . . , n} be a training data set with d-dimensional elements to be classied xk = (xk , . . . , xk ) and c classes. A set 1 d W = {w1 , . . . , wK } of prototypes in data space with class labels y i is used for i i data representation, wi = (w1 , . . . , wn , y i ) Rd {1, . . . , c}. The classication cost function to be minimized is given in the generic form [4]:
n

EGRLVQ :=
i=1

g q(x )

d+ (xi ) d (xi ) with q(x ) = i + i , d(x) := d(x, w) . d (x ) + d (x )


i

The classication costs of all patterns are summed up, whereby q(xi ) serves as quality measure of the classication depending on the degree of t of the presented pattern xi and the two closest prototypes, wi+ representing the same label as xi and wi a dierent label. A sigmoid transfer function g(x) = sgd(x) = 1/(1 + exp(x)) (0; 1) is used [9]. Implicit degrees of freedom of the cost minimization are the prototype locations in the weight space and a set of adaptive parameters connected to the measure d(x) = d(x, w) comparing pattern and prototype. In prior work, d(x) was supposed to be a 3

metric in mathematical sense, i.e. taking only non-negative values, conforming to the triangle inequality, with a distance of d = 0 only for w = x. These conditions enable intuitive interpretations of prototype relationships. However, if just a well-performing classier invariant to certain features is wanted, distance conditions might be relaxed to a mere similarity measure to be plugged into the algorithm. Overall similarity maximization can be expressed in the GRLVQ framework by ipping the sign of the measure and then just keeping the minimization of EGRLVQ . Since the iterative GRLVQ update implements a gradient descent on E, d must be dierentiable almost everywhere, no matter if acting as distance or as similarity measure. Partial derivatives of EGRLVQ yield the generic update formulas for the closest correct and the closest wrong prototype and the metric weights: wi+ = + EGRLVQ = + g (q(xi )) wi+ wi = EGRLVQ = g (q(xi )) wi
GRLVQ = E = g (q(xi ))

(d+ (xi ) + d (xi )) (d+ (xi ) + d (xi ))


2 d+ (xi )

2 d (xi )

d+ (xi ) wi+ d (xi ) wi

2d+ (xi )/d (xi ) 2d+ (xi )d (xi )/

(d+ (xi ) + d (xi ))

Learning rates are for the metric parameters j , all initialized equally by j = 1/d, j = 1 . . . d; + and describe the update amount. Their choice depends on the used measure generally, they should be chosen according + 1 and decreased within these constrains to the relation 0 during training. Metric adaptation should be realized slowly, as a reaction to the quasi-stationary solutions for the prototype positions. The above set of equations is a convenient starting point to test dierent concepts of similarity by just inserting the denoted partial derivatives of d(x).

Metrics and similarity measures

The missing ingredient for carrying out comparisons is either a distance metric or a more general similarity measure d(x, w). In contrast to metrics, similarity measures are sometimes also called dis-similarity measures, because they are maximum for best match, which is opposed to the semantics of metrics. For reference, formulas for the weighted Euclidean distance will be revisited rst. Then, by relaxing the conditions of metrics, two types of measures are derived from the Pearson correlation coecient, which both inherit the invariance to component osets and amplitude scaling. The feature of prototype invariance, implemented by the presented update dynamic, is desirable in situations when mainly frequency information and simple plotting curves 4

matching is accounted for. More details on functional shape representations and functional data processing with neural networks are given in Biau et al. [1], with a focus on prototype-based SOM by Rossi et al. [7,8] and for use with support vector machines by Villa and Rossi [12].

3.1 Weighted Euclidean metric

The weighted Euclidean metric yields the following set of equations [13]:
d

dEuc (x, wi ) =
j=1

i b (xj wj )bw , integers b, bw 0 , bw even j

d i wj
Euc

(x, wi )

i = bw b (xj wj )bw 1 , j

dEuc (x, wi ) b i = b j 1 (xj wj )bw . j

For simplicity, roots have been omitted. In the squared case with bw = 2, i the derivative for the prototype update 2 (xj wj ) contains the well-known Hebbian learning term. In other cases, large bw tend to focus on dimensions with large dierences, and small bw focus on dimensions with small dierences. Approved values for the exponents of the relevance factors are b {1, 2}. Normalization of d i = 1, i 0 is necessary after each update step to i=1 prevent the parameters from divergence and collapse.

3.2 Correlation-based measures

In the following, a correlation-based classication is derived from the term r = dr (x, wi ) =


d l=1 d l=1

(wli wi ) (xl x )
d l=1

(wli wi )2

(xl x )2

[1; 1]

(1)

which is the Pearson correlation coecient; therein, y denotes the mean value of vector y. As illustrated in Fig. 1, this correlation possesses fundamentally dierent properties than the Euclidean distance: depending on the applied similarity function, the two patterns compared with a reference pattern yield opposite relations. Simple data preprocessing cannot transform correlationbased classication problem into an equivalent one solvable with the Euclidean metric. As a rough rule of thumb: if a prototype with sucient variance is similar to input points in Euclidean sense, then it is very likely that it is also 5

Pattern similarity
reference signal (RS) pattern 1 (P1) pattern 2 (P2)

expression value

4 attribute

Fig. 1. Data patterns compared with dierent similarity functions. Relation characterizations for the squared Euclidean metric dier from those for Pearson correlation: dEuc (RS, P1) = 0.82 < dEuc (RS, P2) = 1.81 P1 closer to RS than P2; but dr (RS, P1) = 0.53 < dr (RS, P2) = 0.89 P2 more similar to RS (highly correlated) than P1 (anti-correlated).

highly correlated to them. The other direction is untrue: if high correlation exists, there might still be a large Euclidean distance. Thus, potentially fewer prototypes are necessary for representations based on correlation, and sparser data models can thus be realized.

The straightforward denition of Pearson correlation by Eq. 1, however, is not suitable for being implemented in GRLVQ: rstly, the required cost function minimization conicts with the desired maximization of Pearson similarity between data and prototypes; secondly, only the small range of values 1 r 1 is taken for expressing best match versus worst match, which yields sub-optimum convergence of EGRLVQ . As a rst approach, one might think of a version of Fishers Z, Z = 0.5 (ln(1 + r) ln(1 r)), as a standard transformation of Pearson correlation. This, however, leads to unstable behavior, because almost perfect (dis-)correlation is mapped to arbitrarily large absolute values. Therefore, inverse fractions of appropriately reshaped functions of r are considered in the following. The derivations presented here unify and improve the transformations given in the authors prior work [10].

Since metric adaptivity is a very benecial property for rating individual data attributes, free parameters are added to the Pearson correlation in such a way that the meaning of correlation can be fully maintained. Then, by paying attention to the current prototype w := wi , the numerator of Eq. 1 becomes 6

H := = Hj (w) =

d l=1 2 j d
l=j

2 (wl w) (xl x ), focusing on component j : l (wj w) (xj x ) + Hj (w) , 2 (wl w) (xl x ) ; l


d

l=1

w = w(wj ) = 1/d wj + 1/d

wl .
l=j

l=1

The focus on component j will be a convenient abbreviation for deriving the formulas for prototype update and relevance adaptation. Each of the mean subtracted weight and pattern components, (wj w) and (xj x ), has got its own relevance factor j . This is reected in both rewritten denominator factors of Eq. 1, again with a focus on weight vector components j: W :=
d l=1

2 (wl w)2 , l

X :=

d l=1 d
l=j

2 (xl x )2 l 2 (wl w)2 l

W (wj , j ) = 2 (wj w)2 + Wj , Wj = j

l=1

Using the dened shortcuts, the adaptive Pearson correlation can be written as r = H / W X . Two types of measures are obtained by a unied transform: R= 1 C + r(x, w)
k

Rmin =

H C+ W X

Rmin =: R k Rmin (2)

The resulting classiers are characterized as follows: C0 One type of classier is obtained for C = 0, even integer exponents k 2, and minimum Rmin = 1. This classier allows to separate both correlation and anti-correlation from mere dis-correlation. The minimum value Rmin is subtracted in order to obtain sharp zeros for perfect matches. In computer implementations, a special treatment of the unlikely case of extreme discorrelation might be considered in order to avoid division by (near-)zero values. C0 -prototypes match both correlated patterns and their inverted anti-correlated counterparts, which allows the realization of very compact classiers. For specic data, however, this type of classication might lead to undesired intermingling of data prole shapes, such as occurring in gene expression analysis. C1 The other type of classier separates correlated patterns from anti-correlated ones. The C1 model is realized by C = 1, an integer exponent k 1, and Rmin = 2k ; here, the rare occurrence of extreme anti-correlation might be worth handling singular values in computer realizations. The C1 setup allows classication with intuitive Pearson correlationships, a feature that is well-suited for co-expression analysis of gene intensities. 7

For calculating derivatives of R, the constant expression Rmin can be omitted. Solutions can be obtained manually or by using computer algebra systems. In the latter case, after some rearrangements, the following equations are found:
R wj

R(wj ,j ) wj

wj

C+

H (wj ,j ) W (wj ,j )X (j )

k
1 2

= F H (wj ) W X H W (wj ) X
R j

= F (H (j ) W X H W (j ) X H W X (j ))
3

using the factor F = k R k1 /(W X ) 2 The missing derivatives are H (wj ) = 2 (xj x ) 1/d j
d l=1

2 (xl x ) l
d l=1

W (wj ) = 2 2 (wj w) 2/d j W (l ) = 2 l (wl w)2 X (l ) = 2 l (xl x )2

2 (wl w) l

H (l ) = 2 l (wl w) (xl x )

These formulas contain plausible Hebb terms, adapting wj into the direction of (xj x ) and away from (wl w) in case of correct classication, whereby further scaling factors come from the cost function. Similarly, l is adapted according to the correlation (wl w) (xl x ) in comparison to the variances of these terms. Note that very ecient computer implementations can be realized, because most calculations can be done during the pattern presentation phase already. Then, the similarity measure R and all its constituents H , W , X , and some other terms are computed, and they can be stored for each prototype for later reuse in the update phase. Again, the normalization of the relevance factors to d i = 1, i 0, is advised in order to avoid i=1 numerical instabilities and to make dierent training runs comparable. Two nally discussed practical issues concern singular expression handling and the choice of the exponent k. Singular states occur if the variance of one of the vectors is absent. If, for example, a pattern vector x is aected, then the simplied limit correlation limxl x
d l

(xl x )(wl w )
d (wl w )2 l d l

d (xl x )2 l

limxx =
d 0

(xx ) d(xx )2

d l

(wl w )
d (wl w )2 l

=
d

(wl w )
d (wl w )2 l

d (wl w )2 l

= 0

is of interest. In this case, all prototypes would end up with zero correlations and impracticable terms F . Analog reasoning holds in rare cases of equal prototype components, which, in practice, would occur by inappropriate initialization rather than by the update dynamic. Here, prototypes are assumed to be initialized by data instances. By skipping the degenerated constant patterns or prototypes, unpleasant situations can be eectively avoided. Alternatively, a single randomly picked component can be set to a dierent value, which, on average, produces the desired state of uncorrelatedness, even for two vectors with simultaneously equal components subject to that modication. The free parameter k takes inuence on the speed of convergence and the generalization ability. Integer values in the range 1 k < 20 have been found reasonable choices in experiments. Too high values lead to fast adaptation, but sometimes also to over-tting or to unstable convergence, unless a very small learning rate is chosen. Good initial exponents are k = 7 or k = 8, odd and/or even according to the desire for a C1 or a C0 type classier. For the presented experiments, training is only a matter of one or two minutes; therefore, systematic parameter searches can be realized.

Experiments

Three experiments underline the usefulness of correlation-based classication. First a proof of concept is given for the Tecator benchmark data which is a set of absorbance spectra Then the focus is put on cDNA array classication: in the second experiment, a distinction between two types of leukemia from gene spotting intensities is sought for this involves the classication of 72 complete cDNA microarrays, i.e. 7129-dimensional gene expression intensity vectors, and the rating of these genes for their relevance to classication. The last experiment detects systematic dierences between two series of gene expression records by analyzing two corresponding sets of 7-dimensional expression patterns with 1421 genes each.

4.1 Tecator spectral data

The rst data set, publicly available at http://lib.stat.cmu.edu/datasets/tecator, contains 215 samples of 100-dimensional infrared absorbance spectra. The task is to predict the binary fat content, low or high, of meat by spectrum classication, thereby using random data partitions of 120 training patterns and 95 test patterns, as suggested in [12]. It is known that Euclidean distance is not appropriate for raw spectrum comparison, and the question of interest is, if Pearson correlation yields any benet. 9

Tecator data set samples / relevance profiles


4.500 4.000 intensity 3.500 3.000 2.500 10 20 30 40 50 60 70 80 90 class 0 class 1

0.016 relevance factor 0.014 0.012 0.01 0.008 0.006 10 20 30 40 50 60 frequency channel

relevance profile average std.-dev. boundary

70

80

90

Fig. 2. Upper panel: sample spectra from Tecator data set. Lower panel: GRLVQ-C1 relevance proles.

Figure 2, top panel, shows some of the spectra with their corresponding classes. Apart from a tendency towards dints around channel 41 for high fat content, a substantial visual data overlap can be stated. This is reected in the results for Euclidean-based classiers: k-nearest neighbor (k-NN) reaches its best classication results of about 80% accuracy for k = 3, GRLVQ with squared Euclidean metric reaches 88% on the test set at 94% training accuracy. However, focusing on correlation-based classication the problem gets easier and results above 97% are obtained. For comparison, k-NN with maximum correlation neighborhood is taken, k-NN-C for short. Table 1 contains the average numbers of misclassication for a specic classier, each of which trained 25 times. Instead of the k-NN-C, which utilizes all available training data for classication, GRLVQ-C training requires only 20 prototypes per class trained in 500 epochs. The learning rates for both types C0 and C1 without relevance learning are = 0 and + = = 108 , and the exponents are k = 6 and k = 5, respectively. In additional runs, relevance learning is switched on by choosing a relatively large non-zero learning rate = 107 while the prototype adaptation rates are kept at + = = 108 .
GRLVQ-C1 no rel. 3.32 rel. 2.04 GRLVQ-C0 no rel. 2.4 rel. 2.16 1-NN-C no rel. 3.96 rel. 2.52 3-NN-C no rel. 5.04 rel. 2.84 7-NN-C no rel. 6.4 rel. 3.24

Table 1 Tecator correlation-based classication results for the test set. Average numbers of misclassications are shown for 25 runs of each classier. k-NN-C utilizes maximum correlation neighborhood. Relevance utilization is indicated by rel., denoting metric adaptation for GRLVQ-C and application of relevance-rescaled data for k-NN-C.

10

To summarize Tab. 1, GRLVQ-C classication is superior to k-NN-C. The differences between C0 and C1 accuracies become visible for non-relevance based learning: while C0 does not much prot from relevance adaptation, C1 does account for it, as some of the runs ended up with no misclassication at all. The bottom panel of Fig. 2 shows individual and average relevance proles for the C1 -classier. As an intuitive result, the apparent discriminators shown at particular channels of the data, such as channel 41, get amplied, while less important channels are suppressed. Although k-NN-C decreases in performance for larger k-neighborhoods, the results can be improved by transforming the input data according to the scaling factors shown in the relevance plots. The GRLVQ-C scaling weights can thus be used to boost k-NN-C classication accuracies, which underlines a more general validity of the found data scaling properties. To conclude, sparse and accurate GRLVQ-C-classiers are obtained for the Tecator data set without further data preprocessing. The built-in relevance detection yields highly interpretable results, which, in the following, helps to identify key regulators in gene expression experiments.

4.2 Leukemia cancer type detection

The second task is gene expression analysis, where the GRLVQ-C property of automatic attribute weighting is used for gene ranking. Data is taken from cDNA arrays which are powerful tools for probing in parallel the expression levels of thousands of genes that were extracted form organic tissue cells. A very important issue in gene expression analysis is the identication of functionally relevant genes. Particularly, medical diagnostics and therapies prot from the isolation of small sets of candidate genes responsible for defective or mutative operations. In cancer research many well-documented data sets and publications are online available. One of the discussed problems is the dierentiation between two types of leukemia, the acute lymphoblastic leukemia (ALL) and the acute myeloid leukemia (AML). Background information is provided by Golub et al. [2]. The corresponding data sets and further online material is found at http://www.broad.mit.edu/. The available data contains real-value expressions of 7129 genes (some redundant) for each of the 38 training samples (27 ALL, 11 AML) and of the 34 test samples (20 ALL, 14 AML). In order to distinguish correlation from anti-correlation GRLVQ-C1 is considered. Training has been carried out for a minimalistic GRLVQ-C1 model, using only one prototype per class which prevents over-tting in the 7129-dimensional space. Learning rates are + = = 2.5103 , the exponent k = 2 is taken, and 1000 epochs are trained. Table 3 shows that the average results of 100 runs with only prototype adaptation are rather poor in contrast to the neighborhood analysis method of Golub et al.; however, allowing relevance adaptation at = 5 108 , the 11

Gene-# 1745 1834 1882 2354 4190 4211 4847 5954 6277 6376

Found Yes Yes Yes Yes No No Yes No No Yes

ID M16038 M23197 M27891 M92287 X16706 X51521 X95735 Y00339 M30703 M83652

Name LYN Yamaguchi sarc. vir. rel. oncog. homolog CD33 antigen (dierentiation antigen) CST3 Cystatin C (amyloid cerebral hemorrhage) CCND3 Cyclin D3 FOS-related antigen 2 VIL2 Villin 2 (ezrin) Zyxin CA2 Carbonic anhydrase II Amphiregulin (AR) gene PFC Properdin P factor, complement

Table 2 Leukemia list of candidate genes for dierentiating between types AML and ALL. The Found column indicates if the specic gene is conrmed identied by the team of Golub et al.

GRLVQ-C1 accuracies are drastically improved. Thus, the obtained relevance factors must explain correlative dierences between AML and ALL. Yet, this statement does not claim biological truth. For validation purposes, the genes have been ranked according to their relevance values for those 19 of 100 independently trained classiers that showed perfect results on training and test set. A list of top-ten genes with highest sum of their 19 ranks has been extracted and matched with a longer list of fty genes given by Golub and his group. The results are given in Tab. 2. Remarkably, 6 of the identied genes are consistent with the reference list. This event nding at least 6 matches, has got a vanishing probability of P = 10 50 712950 / 7129 1.8 1011 for randomly selected genes. This k=6 k 10k 10 means that, although only ten instead of fty genes are considered for brevity here, these genes are conrmed as being of special importance. The functional
Set train test GRLVQ-C1 (rel.) 0.19 2.1 GRLVQ-C1 (no rel.) 5.96 7.14 Neigh.-Analysis 2 5

Table 3 Leukemia data set classication results. Average numbers of misclassications are shown for 100 runs of each GRLVQ-C1 classier. Relevance utilization is indicated by rel.. The results for the neighborhood analysis (done a single time) are taken from Golub et al. [2].

12

MDS of original leukemia data


2 ALL AML 2

MDS of rescaled leukemia data


ALL AML

-2

-1

-4

-3

-2

-1

-2 -1.5

-1

-0.5

0.5

1.5

Fig. 3. Leukemia data embedded by correlation-based multi-dimensional scaling. Left: original data set. Right: data set after application of GRLVQ-C1 relevance factors. Class prototypes are indicated by large symbols.

role of the other 4 identied genes might still be of interest from biological point of view, but this remains open for investigation in extra studies. Finally, the grouping of all 72 data samples are visualized together with the GRLVQ-C prototypes by correlation-based multi-dimensional scaling HiTMDS [11] in Fig. 3. For the original data, shown in the left panel, there is a rough unsupervised separation of the 7219-dimensional gene expression vectors according to their type AML or ALL. The corresponding GRLVQ-C1 prototypes are dening a tight data separation boundary which is still imperfect due to noisy data grouping. However, after the rescaling transform utilizing the GRLVQ-C1 relevance factors, much clearer data clusters are obtained, as shown in right panel of Fig. 3. This visual aid re-emphasizes how the curse of dimension is eectively circumvented by using an adaptive metric that is driven by the available auxiliary class information.

4.3 Validation of gene expression experiments

The third study is connected to developmental gene expression series obtained from macroarrays. Expression patterns of 1421 genes were collected from lial tissues of barley seeds during 7 developmental stages between 0 and 12 days after owering in steps of two days. For control purposes, each experiment has been repeated using 2 sets of independently grown plant material. The question of interest is, whether a systematic dierence can be found in the gene expression proles resulting from the two experimental series. Thus, 1421 data vectors in 7 dimensions are considered for each of the two classes related to series 1 and series 2. 13

GRLVQ-C0 Set train test (3 pt.) 68.07% 64.95%

GRLVQ-C1 (5 pt.) % 66.91% 66.44%

GRLVQ-Euc (4 pt.) 53.14% 49.66%

GRLVQ-Euc (40 pt.) 68.38% 58.32%

Table 4 GRLVQ-classication accuracies for dierentiating between 2 series of macroarray experiments. Number of used prototypes are given in brackets.

Random partitions into 50% training and 50% test sets are trained for 2500 epochs and 25 runs for each classication model, GRLVQ-C0 with k = 8, GRLVQ-C1 with k = 7, and GRLVQ-Euc. The exponents have been determined in a number of runs as a compromise between speed of convergence, related to small exponents, and over-tting observed for high values. Table 4 contains the average results of the classiers with optimum model size. GRLVQ-C0 uses only 3 prototypes, 1 for series one, and 2 for series two. This asymmetry has proofed to be benecial for classication. Likewise, GRLVQ-C1 makes use of 2 and 3 prototypes for series one and two, respectively. The squared Euclidean GRLVQ-Euc yields about guessing results for 2 prototypes per class; accuracies get better for 20 prototypes per class, but then the generalization is rather poor. All classiers but the small Euclidean one indicate a detectable dierence between the two series of experiments. However, the GRLVQ-C1 -classier that maintains the opposition of correlation and anti-correlation is a good choice with respect to model size and generalization ability. The relevance proles for the three classiers are given in Fig. 4. They show a rough correspondence in identifying relevant developmental stages within a range of 48 days. However, the details must be considered with care, because dierent congurations of the relevance proles are found to lead to EGRLVQ cost function minimization, especially the C1 type. Thus, the data space is too homogeneously lled to emphasize specic dimensions clearly. This is also reected in the small variability of the relevance factors i ; in this case, larger relevance learning rates produce unstable solutions. Nevertheless, biologically consistent interpretation of the relevance proles has been found: further biological investigations have supported a slight shift in assigning developmental stages between the two sets of independent experiments. In the conducted gene expression experiments, a robust transcriptional reprogramming occurred during intermediate stage related to days 4-8 of lial tissue development. Although overall expression data between the two sets of experiments are hardly distinguishable in practice, the slight systematic inuence depending on an assignment of the developmental stages aects gene expression during this intermediate phase. These slight dierences were detected and could be well exploited by the GRLVQ-classiers, which conrms their use for processing biological observations. 14

Euclidean relevances for gene expression series classification


0.147 0.146 0.145 relevance factor 0.144 0.143 0.142 0.141 0.14 0.139 0 2 4 6 8 developmental stage (days after flowering) 10 12 relevance profile average std.-dev

C0 relevances for gene expression series classification


0.147 0.146 relevance factor 0.145 0.144 0.143 0.142 0.141 0.14 0 2 4 6 8 developmental stage (days after flowering) 10 12 relevance profile average std.-dev

C1 relevances for gene expression series classification


0.147 0.146 relevance factor 0.145 0.144 0.143 0.142 0.141 relevance profile average std.-dev

4 6 8 developmental stage (days after flowering)

10

12

Fig. 4. GRLVQ relevance proles characterizing the developmental stage that enhance the distinction of two experimental gene expression series. From top to bottom: Euclidean, C0 , C1 . Dierent characteristics occur depending on the underlying similarity measure.

15

Conclusions

Adaptive correlation-based similarity measures have been successfully derived and integrated into the existing mathematical framework of GRLVQ learning. The experiments with the GRLVQ-C classiers show that there is much potential in using non-Euclidean similarity measures. GRLVQ-C1 maintains properties of correlation vs. anti-correlation, while GRLVQ-C0 opposes both characteristics against dis-correlation, which leads to structurally dierent classiers. The GRLVQ-C0 pattern matching is somewhat analogous to the property of Hopeld networks that do not distinguish a pattern from an inverted copy of it. By the utilization of Pearson correlation, no preprocessing is required for getting independent of data contrast related to scaling and shifting. As a consequence, in comparison to the Euclidean metric less prototypes are necessary to represent certain types of data: it has been showed that the functional Tecator data and the gene expression classication prot from using correlation measures. High sensitivity to specic dierences in the data is realized, and very good classication results are obtained. Many other areas of GRLVQ-C applications ranging from image processing to mass spectroscopy can be thought of, areas which prot from relaxed pattern matching in contrast to strict metricbased classication. A very important property of the proposed types of correlation measures, C0 and C1 , is their adaptivity for enhanced data discrimination at a global data perspective. As has been shown by the experiments, relevance scaling helps to nd interesting data attributes, and, thereby, the scaling drastically increases the classication accuracies for high-dimensional data. Even for standard methods, like the demonstrated k-NN-C and the MDS visualization technique, their expressiveness can be much improved if the data is subject to preprocessing by the GRLVQ-C scaling factors. Yet, an interesting theoretical problem remains, apart from the practical benets: to which extend can the large margin properties of GRLVQ-Euc be transferred to the new correlation measures of GRLVQ-C? This and a number of further issues will be addressed in future work. GRLVQ-C is online available as instance of SRNG-GM at http://srng.webhop.net/.

Acknowledgments

Thanks to Dr. Volodymyr Radchuk for macroarray hybridization experiments. Gratitudes to Prof. Wolfgang Stadje, University of Osnabrck, for his solution u to the combinatorial probability of accidentally identied relevant genes. Many thanks also for the precise statements of the anonymous reviewers. The present work is supported by BMBF grant FKZ 0313115, GABI-SEED-II. 16

References
[1] G. Biau, F. Bunea, and M. Wegkamp. Functional Classication in Hilbert Spaces. IEEE Transactions on Information Theory, 51(6):21632172, 2005. [2] T. Golub, D. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. Mesirov, H. Coller, M. Loh, J. Downing, M. Caligiuri, C. Bloomeld, and E. Lander. Molecular classication of cancer: class discovery and class prediction by gene expression monitoring. Science, 286(5439):531537, Oct 1999. [3] B. Hammer, M. Strickert, and T. Villmann. On the Generalization Ability of GRLVQ Networks. Neural Processing Letters, 21:109120, 2005. [4] B. Hammer and T. Villmann. Generalized Relevance Learning Vector Quantization. Neural Networks, 15:10591068, 2002. [5] S. Kaski. Bankruptcy analysis with self-organizing maps in learning metrics. IEEE Transactions on Neural Networks, 12:936947, 2001. [6] T. Kohonen. Self-Organizing Maps. Springer-Verlag, Berlin, 3rd edition, 2001. [7] F. Rossi, B. Conan-Guez, and A. E. Golli. Clustering functional data with the SOM algorithm. In Proceedings of ESANN 2004, pages 305312, Bruges, Belgium, April 2004. [8] F. Rossi, N. Delannay, B. Conan-Guez, and M. Verleysen. Representation of functional data in neural networks. Neurocomputing, 64:183210, March 2005. [9] A. Sato and K. Yamada. Generalized Learning Vector Quantization. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems 7 (NIPS), volume 7, pages 423429. MIT Press, 1995. [10] M. Strickert, N. Sreenivasulu, W. Weschke, U. Seiert, and T. Villmann. Generalized Relevance LVQ with Correlation Measures for Biological Data. In M. Verleysen, editor, European Symposium on Articial Neural Networks (ESANN), pages 331338. D-side Publications, 2005. [11] M. Strickert, S. Teichmann, N. Sreenivasulu, and U. Seiert. High-Throughput Multi-Dimensional Scaling (HiT-MDS) for cDNA-Array Expression Data. In W. Duch, J. Kacprzyk, E. Oja, and S. Zadro ny, editors, Articial Neural z Networks: Biological Inspirations ICANN 2005, Springer Lecture Notes in Computer Science, pages 625633, 2005. [12] N. Villa and F. Rossi. Support vector machine for functional data classication. In Proceedings of ESANN 2005, pages 467472, Bruges, Belgium, April 2005. [13] T. Villmann, F. Schleif, and B. Hammer. Supervised Neural Gas and Relevance Learning in Learning Vector Quantization. In T. Yamakawa, editor, Proc. of the Workshop on Self-Organizing Networks (WSOM), pages 4752, Kyushu Institute of Technology, 2003.

17

Vous aimerez peut-être aussi