Vous êtes sur la page 1sur 47

cv

Sandeep Kumar BEng (Hons) MEng


Address : Telephone: C-317, Hari Marg, Malviya Nagar, Jaipur-302017, Rajasthan, INDIA +91-141-2521648 Mobile: +91-9660006262 Email : san_deep2262@yahoo.com

WORK

EXPER IENC E:

3.4 Y EAR S

Ongoing Nov. 2010 Oct. 2007-10 Aug. 2007 Jun-Oct 2006

Aug-Dec 2004 Aug. 2004 Jul. 2003

E-Com Systems Inc. Cancer Research UK (Home Fundraising), UK. AECOM, Graduate Consultant, Traffic Technology, UK. Primark, Retail Operative, UK. Seedling Academy of Design, Technology & Management, Lecture r in E lectronics and Electrical Engineering Department, India. Sterling Telecom & Net Systems Ltd., Information Communication Technology Engineer, India. vCustomer, Technical Support Executive, India. Best Solutions Ltd., Games Developer, India.

2003

Percentile in Behavioral Traits at NIIT National IT Aptitude Test, India. E10B Exchange, BSNL.

S OFTW ARE S KILL S: MapInfo, AutoCAD, Matlab 7.1, Java, C++, C, HTML, 8085 & 8086 ASM, Visual Basic, Microsoft Office, Windows Batch file, Falsh 5.0/MX, Basic, Adobe Photoshop & Illustrator, Internet. OTHER QU AL IF IC ATI ONS :

2007-10 2009-19

2008-10 S EM IN ARS :

Member of The IET. Full Clean UK Driving License (Valid until 24/03/19) & Pass Plus Certificate. Motorway Pass (Highways Agency).

E DUCAT ION : 2003 E10B Exchange and Wireless Phones, India.

Jan 2008 Aug. 2004

May 2000 June 1998

M.Eng in Data Comm unications, The University of Sheffie ld (60.23 %) B.Eng in Electronics and Comm., The University of Rajasthan (Honors 75.25%) HSC - Science, CBSE (68.4%) SSC - CBSE (69.2%)

PROJEC T EXPER IENC E :

TR AININ G COUR SES :


2010 2010 2010 2010 2009

2008

2008 2008 2008 2007 2004 2004

MySQL & PHP Course (Ongoing) City & Guilds Foundation Course in Electrotechnical Technology EAL Level 3 VRQ Certificate in Refrigeration Maintenance CITB F-Gas Safe Handling Transport Planning Society Certificate in T raffic Engineering & Planning ECS Health & Safety Training (JIB for the Electrical Contracting Industry) Motorway Awareness & Safety Training, Mouchel. MapInfo, AECOM. Introduction to System Design, AECOM. Mesh4G , AECOM. UMTS, BSNL. High Aptitude for IT with 81

Lowestoft UTMC Des ign & Implementation: Produced CAD of lamp columns, MESH4G, power cable at feeder pillars & lamp columns. Determine d power consumption of the Urban Traffic and Management Control (UTMC) system. Wrote UTMC report, Design and Check certificates and equipment list. Operator of Variable Message Signs (VMS), CUTLAS UTMC system on behalf of Suffolk County Council. CENTRO Framework: - Performe d site audits of the Real Time Passenger Information (RTPI) signs & bus shelters i n Birmingham & Coventry on behalf of CENT RO. ITS Strategy for Suffolk: - Prepared strategy report in line with 'Local Transport Plan' and 'Suffolk Bus Strategy' HA Data Fusion: - Gathered road data for modeling input. Walsall TRO: - Prepared CAD of road networks using ParkMap and MapInfo. ITS Radar Newswire: - Wrote Highways Agency ITS Report.

(http://217.118.128.202/downloads/general/ITS_Ra dar_International_ReportITS_World_Congress_2009_Final.pdf) Suffolk CATS: - Cycle Activated Traffic Warning Sign. North West TechMac: - Performe d Statutory Enquiry Checks. - Assisted in writing technical notes for power cable specification of the ANPR and Ramp Metering equipment using AMTECH software. Oakham Road, Tividale Pedestri an Crossing: - Prepared CAD of pedestrian c rossing facility using AutoT rack. - Technical, safety and regulatory parameters report. Hertfordshire LTP: - Website: http://www.transportplans.co.uk/ TataIndicom BTS: - Supervised commissioning of mobile communication tower, cabins for communication equipments, diesel generator set, civil & electrical work. Jaipur National University: - Lecture r in Electronics & Electrical Engineering. Taught Circuit Analysis & Communication Theory as core subjects and supervised Electronics Workshop Lab. AC ADEM IC
PROJEC TS :

DTMF Remote Control. 3GPP Cryptographic Algorithms. Cell phone signal detector. Hierarchical Clus tering of Microarray Data.
ACT IVITI ES:

E XTR ACURRICUL AR

2003 2002 2002 2002 2000 1999 1999 1998 1996 1991

Third a t National Level Dance, India. Bronze medal - Swimming, Rajasthan University, India. First at College in Annual Day Sports, India. Certificate College Cultural Activities, India. CBSE Excellence Award in Physical Education, 94%, India. Rajasthan Colts Cricket Team, UK. National Level U-16&19 Cricket, India. Gold Medalist - District Karate, India. Silver Medal - District Gymnastics, India. Second at National Art, India.

LAN GU AGES:

Hindi: Mother tongue English: Fluent (IELTS: 6.5)

REFERENC ES Furnished upon re quest.

HIERARCHICAL CLUSTERING OF MICROARRAY DATA

by Sandeep Kumar Bairwa

Under the supervision of Dr Guido Sanguinetti

The University of Sheffield Department of Electronic and Electrical Engineering (2006-07)

This dissertation is a part requirement for the degree of M.Sc. in Data Communications

Abstract
The advent of microarrays have opened up unprecedented pathways in the research field of Bioinformatics. With the possibility of studying thousands of genes together at the same time, automated learning tools are being developed rigorously. Biologist use the gene expression data from microarrays to reveal biologically similar groups of genes, that are meaningful and informative. Many standard techniques have been developed to cluster the similar genes together, but finding optimal clusters and their number from a given data set still remains an unsolved problem. One of the most prominent techniques employed in this regard is called Hierarchical clustering. In this thesis we begin with a gentle introduction to microarrays and the associated data analysis techniques. Further we present two novel methods following agglomerative hierarchical clustering approach to separate overlapping and closely spaced clusters. The novel methods proposed in this thesis are compared with the clustering results of Bioinfomatics toolbox provided by MathWorks in the 'Matlab 7.1' software to separate a case of overlapping clusters. The novel methods developed give correct results in this regard, thus clearly proving to be advantageous. The methods proposed are based on Gaussian mixture assumptions. Cluster assignments are done using empirical covariance and Gaussian Bayes Classifier. We have presented test results on synthetic as well as real data sets. While clustering the real data, pruning decision is also made.

Acknowledgements
I want to thank the technical and administrative staff of the Electronics and Computer Science Departments of my University for their help and assistance throughout the preparation of this thesis. I am extremely thankful to my supervisor, Dr. Guido Sanguinetti, because of whom I have been able to raise the standards of my thesis, meet the deadlines in time and gain numerous skills which I would not otherwise. Without his guidance it would not be possible for me to keep my work going in the right direction. I want to thank Dr. Peter Rockett for his extended guidance offered to me for the writeup work. I am thankful to my family members for their moral support. Finally I want to thank the University for allowing me to be a part of such a wonderful institution.

Contents
1. Introduction.......7 2. Microarrays : A biological background.......8 Microarray Technology.......9 Types of microarrays.......11 Normalization.......12 Applications of microarrays.......12 3. Data Analysis.......13 Analysis Methodology.......14 Clustering.......15 Clustering the microarray data.......16 4. Hierarchical clustering.......17 Hierarchical Clustering Techniques.......22 Advantages of hierarchical clustering.......24 Problems with hierarchical clustering.......24 5. Practical Work.......25 A) Hardware & Software Employed.......25 B) Novel method for Hierarchical Clustering using Empirical Covariance....... 26 Overview of the approach.......26 Critical Features of Implementation.......27 Algorithmic Description of Matlab implementation.......28 Results.......29 C) Novel method for Hierarchical Clustering using stereographic projection.......30 Overview of the approach.......30 Critical Features of Implementation.......31 Method used.......31 Algorithmic description of the Matlab Implementation.......34 Results.......35 D) Tests on Real Data.......37

Critical features of the Matlab code & the pruning decision.......37 About the real data set........37 Verification of our result........38 E) A naive method developed at the outset of project work .......39 Brief Description.......39 Test Results.......40 6. Conclusion and Evaluation.......40 7. Future Work Areas.......41 8. References.......42 9. Abbreviations.......45 10. Diary of major milestones achieved .......45

Declaration

I certify that all sentences, passages and figures/diagrams quoted in this thesis from other people's work have been specifically acknowledged by clear crossreferencing to the author(s), work and page(s). Furthermore I have read and understood the definition of Unfair Means for assessed work produced in the M.Sc. Data Communications Student Handbook (page 18/19) and have compiled with its requirements. I understand the failure to comply with the above amounts to plagiarism and will be considered grounds for failure in this thesis and the degree examination as a whole.

Name : Sandeep Kumar Bairwa

Signature :

Date :

1. Introduction
Microarray technology is amongst the busiest fields in Bioinformatics. Biologist use the gene expression data from microarrays to reveal biologically similar groups of genes, that are meaningful and informative. Now with the possibility of analysing over tens of thousands of genes together as compared to the earlier times when only a single gene could be studied at a time, many research techniques are being developed and are currently in the developing stage. Microarrays have immensely contributed in reducing the time taken by the experiments to complete as well as the information gained. This has led to the birth of an interdisciplinary field named as Bioinformatics [Lim, 2005], also known as computational biology. Here various techniques of statistics, computer science, mathematics and physics are employed for enhanced learning in biology. Thus immensely boosting the quality and standard of research and analysis in the field. Massive amounts of data can now be easily as well as timely handled by employing machines programmed with automated learning procedures. This branch of science is well known as Machine Learning. Another relevant field is data mining which is specifically related to the study of large data sets. With the provision of simultaneous study of huge amounts of genetic data as well as the analysing infrastructure, now the sole problem remains in the research technique adopted to avail correct results. Many standard techniques have been developed to cluster similar genes together so as to reveal information about unknown genes, but numerous questions about living organisms still remain unanswered. Particularly relevant to this thesis, finding optimal groups and their number from a given data set remains a tricky issue. One of the most prominent techniques employed in this regard is Hierarchical clustering which groups similar genes together providing a simple and manageable visual aid. There are various aspects of this clustering method that have attracted researchers for its usage. Its visual results showing hierarchy in the data set and the choice in variable depth of clusters are some of its useful features [Eisen, 1998]. Though some problems remain in its applicability, such as to decide as to how many clusters are present in a data set. Many advanced clustering techniques have been developed too, some examples of which are bi-clustering and fuzzy clustering [Troyanskaya, 2003]. Still numerous papers currently employ hierarchical clustering technique to learn from the gene expression data. In this project we have addressed the current problems of determining correct clusters from a dataset that comprise of closely spaced and overlapping clusters. Further noise had been added to the synthetic data sets in order to test the robustness of the methods proposed for their performance in varying conditions. This report is organised in such a manner that it begins with a gentle introduction in the field leading to address the current analysis issues. First an essential introduction to the biology behind the microarray technology is outlined followed by the adjoining microarray features, issues and the data extraction process. In data analysis section basic steps in analysis of the gene expression data are elaborated with a simple 7

example. Further clustering is defined with particular focus on the hierarchical clustering technique and its types. In the following section all the practical work done is presented. Each stage of practical implementations are followed by their respective results ending with the discussions for all stages together and elaboration of future work areas. 2. Microarrays : A biological background A concise biological background to Microarrays is being presented below. The material was produced using the references [Brazma, 2001] & [Schena, 2003] : Cells : All living organisms are composed of at least one cell or more. Cells are extremely small in size and have their own development system. Molecules : Within cells are molecules that are in turn classified into : 1) Small Molecules 2) Proteins 3) DNA 4) RNA Small molecules either have independent roles or are engaged in the up keeping of Proteins, DNA and RNA. Nucleotides are elementary building blocks that make up DNA and RNA. Proteins act as the functional units of a cell. These are few nanometres in size and viewed on an electron microscope. Proteins are the actual carrier of the instructions issued by genes. RNA are similar to DNA, but differ in their structure. It is an intermediary in the conversion process resulting in protein formation. RNA is categorised into mRNA, tRNA and rRNA. DNA houses the genetic instructions that are used in the development and functioning of all living organisms. An organism's complete set of DNA, i.e. the complete hereditary information, is termed as genome, of which genes are the functional units. Each gene carries directive information for proteins. Results of the Human Genome Project admit that with current technology, highly repetitive DNA sequences are difficult to sequence. Further, the sequenced information is not enough to understand the functional properties of genes. Genes are composed of DNA and are engaged in biological functioning of a cell. 8

Gene Expression is a conversion process resulting in obtaining functional proteins from the initial DNA level stage. We are primarily concerned about studying genes which determine the characteristics of a living organism. The study of gene expression has become important since it has lead to a greater understanding of diseases. For example, it was revealed recently that human cancer is related to the change in gene expression. Gene expression process is outlined as follows: Transcription Genes Translation mRNA proteins mRNA

Transcription is the initial step to synthesize mRNA from DNA. The process begins at a site identified by a promoter where the synthesis starts by the RNA polymerase, unwinding the DNA helix. At completion the whole gene is transformed into RNA. This RNA detaches from the respective DNA and thus becomes independent. mRNA processing involves a chain of editing events that are named as capping, polyademylation and splicing. In these processes mRNA are made stable, a continuous coding sequence is recovered and the efficiency of protein synthesis increases. Translation is a process associated in deriving polypeptide chains from the mRNA after the transcription stage. Further amino acids are obtained which result in a protein at the end. An expression level is a scalar value indicating the cellular concentration of Messenger Ribonucleic Acid (mRNA) molecules. These are further converted into proteins which are used by the cells to perform their crucial functions. Scientists try to infer which genes are expressed, thus subsequently gaining valuable information as to how the cells respond to the varying conditions and its self needs. In other words the dynamics of cell as to how it responds and how it orients itself in accordance to its needs are revealed. There are various intricacies involved in the process itself and is beyond the scope of this report to go into its biological details. General idea of the study is to explore all aspects of life in any living organisms, which may involve its functioning, growth, ageing, diseases, immunities, survival needs and evolution etc. Microarray Technology : Following text has been referenced from [Schena, 2003] 9

There are many thousands of genes in a single cell. For instance, humans have about 30,000 genes [WWW1]. To study them simultaneously and quickly in an efficient manner microarrays are employed. In his book, the father of microarray technology, Mark Schena, describes the advent of his innovation similar to microprocessors in the electronic communication field. Just as microprocessors have evolved in a short time offering high speed computational capabilities, microarrays have similarly made it possible to simultaneously study different genes on a single chip packed close together and many thousands in number. This has revolutionised the biotechnology world which encompasses agriculture, medical, chemical and many more areas involving living organisms by issuing rapid and quantitative analysis. Basically in microarrays a continuous arrangement of fluorescent samples (prepared from RNA) affixed on a glass, silicon or plastic chip are studied whose levels of intensity are directly proportional to their expression. As the name suggests, microarray is an array of very small size, but it has enormous amounts of information within. The array houses spots arranged in a matrix form having fixed distances between them. These spots are always arranged in a uniform pattern. Size of a single spot is 50-350 micrometre [Schena, 2003]. Just as an integrated circuit in electronics has a substrate as a base, microarrays has its substrate as the chip.

Figure 1. A microarray, Size <=1 sq. inch, Spot size ~ 0.1 mm, Spots/chip ~ 60,000 [Brazma, 2001].

In microarray technology the entire genome is presented on a single glass chip known as a probe which reveals the expression level for each gene. Analysis Steps : Basic steps involved in the microarray analysis are as follows : Samples (Cells, tumor samples etc.) Extraction mRNA Conversion cDNA Labelling and Hybridisation

Laser Scan DNA microarray Scanned Image.

10

These stages are depicted in the figure below :

Figure 2 [Renkwitz, 2000]

Out of the numerous ways to measure gene expression level, the most common way is to use two samples of a particular cell in two different states, healthy and diseased. Let the first sample having healthy cells be labelled with fluorescent green dye and the second one with red dye. Together they are washed over a microarray slide. Hybridization takes place on the surface of the slide in which complementary parts tend to get together. When excited by a laser the amount of particular RNA determines the color of light given out by the spots. So larger amount of RNA from sample one on a spot radiates green light. Whereas for equal and nil (absence of) quantities, a yellow and black light is radiated respectively. (The emitted light color implication varies with chip types. For instance gray color indicates missing data in some chips.) Types of microarrays : Broadly they are divided into two categories : 1. One color microarray that employ oligonucleotides (short sequences of nucleotides). They are manufactured using numerous methods such as Photolithographic arrays (Affymetrix), Inkjet Arrays (Agilent), Maskless Arrays (Nimblegen) etc. 11

2. Two color or cDNA microarays. Problems associated within the analysis steps : 1. Different amounts of samples are present in the beginning which produce different amounts of RNA. 2. The labeling dyes have a different efficiency of labeling thus rendering irregular brightness levels even if RNA amount is same in two samples. 3. Hybridization is a non-uniform process practically, thereby resulting in one sample being hybridized better than the other. 4. The signal from scanner too cannot be kept constant, marginal variations are always present in the signal. These problems result in slide preparations having different brightness levels in the end and at each stage there is a marginal error induced during processing. Finally this accounts to incorrect gene expression. To deal with this situation an assumption is made that the average gene expression level across the chips do not vary. Thus the expression level of all genes on the chip fall within a common comparable range. This way of limiting the data within a bounded region is termed as normalization [Gurette, 2001]. Applications of microarrays [WWW2] and [Schena, 2003] : - All human diseases can be studied by microarrays to find their treatments. Some examples of diseases being investigated currently on a large scale are mental illness, aging, hormonal imbalance, Alzheimer, stroke, AIDS, cancer etc. It is worth mentioning that the technology stretches to genomic study of bacteria, viruses, plants, fruits and animals revealing information that was unknown before. - Drug discovery (Pharmacogenomics) - Reasoning as to why some drugs suit certain group of patients as compared to others and also the amount of toxication inhibited on administration. - Toxicological research (Toxicogenomics) - The relationships are studied between the toxic compounds and their responses after administration on subjects understudy. - Gene Discovery : Complex genetic diseases, expression levels of thousands of genes etc. Broadly, the studies relating to genes is well assisted by microarrays and many new findings are constantly being made with this advancement in analysis technology.

12

3. Data Analysis
The following text is referenced from [Schena, 2003]. TIFF/tagged format images are recovered from microarrays through scanning. Quantification is a process through which these images are converted into numerical values which represent the concentration levels of mRNA. Noise is induced due to inaccurate labelling of probe and through scanning process. The numerical values are categorised as signal that is the value representing the actual microarray data and a background value indicating noise level. Numerous segmentation procedures are employed to separate noise from the actual expression values. After segmentation the numerical values obtained are arranged in a regular and systematic manner. The data is saved in text file formats with adjacent numeric values separated by a comma or a space. Usually a space between data points is used. This type of text file is commonly know as tab delimited file. These files are easily transferable onto different spread sheet programs. Data transformation : The machine read microarray data obtained from scanning process is in form of 16 bit values. Handling thousands of such numbers can be cumbersome. For convenient analysis this data is transformed into an equivalent. Most common approach at this is using log values. Thus a large number such as 100000 is represented by 5 using log to the base 10. Data arrangement : Genes are arranged row wise and the columns represent time variances of the experiment. Thus resulting in a matrix form of data arrangement that is originally obtained from the microarray slides. For observations at times T1, T2, T3 ... TN and the genes G1, G2, G3 ...GM, a N dimensional space is chosen to plot each gene with the columns as the dimensional coordinates and rows as the indices. Researchers try to correlate these genes in N dimensional graphs. This requires the biologists to be employ statistical techniques to set a relationship for the vast population of gene data. Also due to the massive amounts of data generated in the genomic study, researchers have to resort to machines for assisting with the repetitive calculations. For this many computer scientists and statisticians have contributed in the development of novel and efficient methods of genetic data analysis. Many software packages are now widely available for preprocessing data, data mining and image analysis.

13

Basically for the genetic data a similarity or dissimilarity relationship is made using various distance measurement formulae. The distances found are arranged in N dimensional matrices. These are termed as proximity matrices which are the measures of similarity between genes. Parameters such as spread, orientation and direction in the genetic data are considered to group functionally similar genes [Eisen, 1998]. A brief outline to practical analysis procedure of genetic data is presented in the next section. Analysis Methodology : Expression data gained is analysed in form of a matrix as shown in Figure 3. Huge amount of data, as many as 15000 genes [WWW3] may be presented in a single matrix for many samples. Each element in the sample columns is a log ratio of the testing sample to the reference sample. The expressed data is normalised for meaningful statistical analysis [Shannon, 2003]. Sometimes the dataset provided suffers from missing values due to various reasons such as human errors, experimental errors or unavailability of information [Jain, 1988]. This affects the results in further analysis and some considerations might have to be made later in this regard.

Genes 1 2 3 . . . 15000

Sample 1 Numerical Data 1,1 Numerical Data 2,1 Numerical Data 3,1 . . . Numerical Data 15000, 1
Figure 3

Sample 2 Numerical Data 1,2 Numerical Data 2,2 Numerical Data 3,2 . . . Numerical Data 15000, 2

In order to study genes it is required to group the data based on similarity. This way many genes that are difficult to analyse individually can be studied with reference to other genes that are similar to them in functionality and regulatory operations. Grouping thus helps in revealing information about unknown genes with the help of known genes in same group. As elaborated in [Bryan, 2004], unexplored genes can be studied with the help of genes related to them via clustering.

14

Clustering : Basically clustering is a process by which many individual items are brought together to form a single group of similar items. In this way a number of groups are produced each containing items within them being similar to each other but being dissimilar to the items inside other groups. With regard to clustering of microarray data, it is the process in which genes are arranged in such way that similar genes can be brought together in one group away from unlike genes. Grouping or clustering is done on the basis of similarity, specifically the correlation between genes behaviour. General approach taken while clustering data : a) Assimilation of test data : Gathering data for analysis is an important step before starting the clustering process. The type/format of data decides which clustering methods can be employed on it. Also it should be confirmed, the data gathered is well suited to the technological standards available for analysis to the researcher. b) Pre-processing the data : In order to make data suitable for coherent analysis, the data is conditioned. This process may include transformations of data in different domains. For example the data obtained is normalised as discussed before. Visual help can be taken in which data is projected into a different dimension. Non-linear projections [Jain, 1988] are employed more as compared to linear since they ensure an accurate transformation of data into lower dimensions. This is especially useful when the data is distributed in higher dimensions (>2). c) Conversions into data models : For efficient analysis the data can be modelled into distributions with specific parameters. Such approaches help in determining the correct clustering result. Many different distributions are employed such as gaussian distribution, Von-Mises Fisher [Banerjee, 2005] distribution on sphere, Kent (FB5) [Peel, 2001] distribution on sphere etc. d) Validation : On rare occasions it is checked if clustering can be done or expression data is thoroughly random. In second case researcher have the provision to choose other analysis techniques rather than clustering. e) Clustering approach : At the outset, researcher chooses a method for clustering. For instance it has to be decided if hierarchical clustering will be more suitable or the partitional type would be more appropriate. f) Verification : In order to check the correctness of the result after clustering different techniques are employed. For example noise may be added to the data to check the variation in the result. Also multiple methods may be adopted and the results may be compared to check the correctness. Advantages of model based clustering :

15

There has been an extensive research in this field as compared to other approaches thus is backed up by numerous proven results. Setting the data set into a model helps in determining the localisation intensity characteristics of the individual groupings. More realistic partitioning of the groups is possible with this approach. For example if two Gaussian distributions are considered than the overlap does not affect the tails to be rigidly cut-off since the distribution characteristics are maintained. Thus overlapping clusters can be effectively separated. Clustering is broadly divided into two types [Shannon, 2003] : 1) Supervised or Discrimination or Extrinsic 2) Unsupervised or Clustering or Intrinsic In Supervised classification two things are presented for analysis, the expressed measurement data and the characteristic of the sample as a priori. Whereas in unsupervised learning, only the measurement data is available. As little is known about the gene expression for a particular state, latter method is generally more suited. Amongst the numerous types of clustering methodologies some of the most noticeable statistical methods are Hierarchical and k-means clustering. In k-means method the cluster divisions are set initially to a desired number. Only work to be done is to allocate the respective members of the groups based on similarity index chosen by the researcher. The term k represents the finite groups defined in the beginning before the assignment process begins. Whereas in hierarchical clustering, successive clusters are derived in steps from the previous ones. Clustering the microarray data : Since microarrays provide massive amounts of data to be analysed, various automated data mining tools are being developed. The gene expression data provided by the microarrays can be clustered in two different ways : 1) Clustering the genes. 2) Clustering the time variances/samples. Both approaches provide information resulting in the grouping of similar genes in case 1) and sectioning of similar groups in case 2). Biologist suggest that only some portion of the gene expression data is actively involved in the cellular processes and functioning. Thus is advantageous to study only a part of genes that involved. In second case of clustering thus within the sample groups some part of gene groups are evaluated this particular way of clustering is termed as sub-space clustering [Jiang, 2004].

16

4. Hierarchical clustering
Hierarchical clustering of microarray has become a common way to identify patterns in clustering due to its simplicity and visual clarity [Eisen, 1998]. Unsupervised/clustering approach is generally suited to gene expression data. It is known that genes are not involved in a single process at all times rather may be involved partially in multiple processes. Thus it is not advised to assign genes in a single group. This suggests inaccuracy of partitional type clustering approaches [Turner, 2005]. Most of the clustering methods require a distance or proximity matrix to be worked out from the given data. A distance matrix is simply the distances between a row with all other rows in the data set. Some commonly used distance measures are summarised as follows (used to find the similarity between genes) : Euclidean or Straight Line distance : Two points D & E in a n-dimensional euclidean space distance between them, n-dimensional euclidean space distance between them,

Mahalanobis Distance : Distance from a group of points having,


T mean () = 1, 2, 3,. ...... n ,

covariance matrix data (x) = x 1, x 2, x 3,....... x n ,T , is given as :

Pearson's Correlation coefficient : Also known as product- moment coefficient of 17

correlation. Let two variables ,X and Y, be gaussian, the correlation coefficient is given by :

A numerical estimate is presented on how much the two variables are correlated. It can take values from -1 to 1 representing perfect negative or inverse and perfect positive correlation respectively. Whereas a 0 means no correlation. The joint distribution of x and y is bivariate normal. Generally for normalised data the euclidean distance measure is used. For a data set as shown in Figure 3 having 15000 genes a distance matrix of order 15000 by 15000 will be generated. For example, if 3 genes, whose expression values are taken in two samples :

Genes Gene1 Gene2 Gene3

Sample1 1.5 0.5 2

Sample2 1.2 -2.5 1

The distances are calculated as follows :

distance(Gene1,Gene2) =

1.5 .52 1.2 2.5 2 1.5 2 21.21 2

distance(Gene1,Gene3) =

distance(Gene2,Gene3) =

0.52 22.512

Thus the corresponding distance matrix is :

18

Gene1 Gene1 Gene2 Gene3 0 3.83 0.53

Gene2 3.83 0 3.8

Gene3 0.53 3.8 0

Some properties of distance matrix worth noticing : Distances on the diagonal are all zeros. All distances are non-negative. Always a symmetric matrix. (i.e. distance(Gene1,Gene2) = distance(Gene2,Gene1) Distances of interest are N(N-1)/2 for N genes. Thus upper half or the lower half is sufficient to speculate the complete matrix.

Individual genes have been associated with other genes in terms of distance. A group of points or genes lying close to each other are said to form a cluster. Figure 4 shows three well separated clusters.

Figure 4

In hierarchical clustering similarity amongst the clusters can be determined by linkage methods such as Single Linkage or Nearest Neighbour, Average Linkage, Complete Linkage or Furthest Neighbour etc. Single linkage algorithm computes the minimum distance available between two genes in two clusters, best shown pictorially in Figure 5. For a data set S, the minimum distance between two genes is found as :

19

Figure 5 [WWW4]

Whereas complete linkage takes the largest distance and average linkage as the name suggests first finds the mean of two clusters individually then the distance between the two means is the average linkage distance. Mathematical interpretation is summarised below, taken from [WWW5]. Let Nr be the total data points of a grouping named 'r'. Let Ns be the total data points of another group named 's'. Then Xri be the ith data point in first group, and Xsj be the jth data point in the second group. Now the measures can be described as follows : 1. Single Linkage :

2. Complete Linkage :

3. Average Linkage :

Discussion : The single linkage method can result in long clusters, some long enough to reach out and join a nearby cluster, thus less number of clusters are generated. Whereas the complete link method will result in tight/isolated clusters, thereby giving large number of clusters. Average linkage is mostly employed in clustering due to its intermediate results. In hierarchical clustering the most common approach is to treat each gene as a cluster. Then find the nearest gene using the distance matrix. Having found the two genes that are nearest (i.e. most similar), are joined to form a new cluster. Replace the individual clusters of these two merged genes with the single new cluster formed using one of 20

the linkage methods described previously. Thus this step reduces the N data set into N-1 data set. It is not mandatory to merge two genes at a time only, larger number of genes can be merged at same instance. For example, 5 closest genes may be merged together to give the first new cluster. This process results in a hierarchy of genes in the data set according to their distance.

Hierarchical clustering is divided into two groups : 1. Divisive 2. Agglomerative

The common approach stated above is agglomerative, in which finally the whole data converges into one big cluster. Whereas the divisive is just opposite of this. Thus the whole data is iteratively divided into fragments until each gene is singled out into one node alone. The name hierarchy suggests a tree representation which is an inherent advantage of the hierarchical clustering technique. To study the result a diagrammatic representation is used since it is easier for a human being to comprehend a picture. The convenient tree structure drawn using the result of hierarchical clustering is named as dendrogram. A basic example of hierarchical clustering elaborates this theory: Given a set of genes/objects X = {X1,X2,X3,X4}. We choose distance between the genes to be used as the criterion to measure the similarity between the genes, i.e. the genes with least distance between them will be most similar and vice-versa. Then a general progressive hierarchical clustering results in a dendrogram as shown in Figure 6. The genes X1 and X2 are closest to each other therefore are merged into a single cluster. Further since this resulting cluster is found to be closest to X3 gene, they are merged together in the next step. This process is continued till all the genes are clustered. The U-shaped lines joining these clustered genes are the most important part of the dendrogram. It reflects the distance between the genes joined together and is read on a vertical scale. These lines joining the clusters give a clear view of the clustering topology. It can be easily decided thereon as to where the tree should be cut or pruned to result into a desired clustering. The final tree can be clustered as follows : The possible clustering results can be : X1, X2 & X3 in one group/cluster and X4 in the other group. Also it is correct if X3 and X4 form individual groups leaving X1 and X2 together in one group.

21

Either way it can be noticed that if a horizontal line is drawn at different places over the vertical lines joining the nodes in the dendrogram then a valid group can be found. All clusters as a result of this operation will be valid clusters.

Figure 6

Hierarchical Clustering Techniques : 1) Standard Hierarchical Clustering : The basic algorithm for an agglomerative type hierarchical clustering is presented and is referenced from [Matteucci, 2003] : 1. 2. 3. 4. Begin with the data set of length N. Compute the similarity matrix, i.e. a square matrix of size N. Assume each data point from 1 to N to be a cluster. Join the two closest clusters in the data set to form a single new cluster thus reducing the size of current data set by one. 5. Find the distances of all the clusters in the current data set. 6. Repeat from step 4 with current data set until all data points lie within one cluster. 22

Unavoidable problems associated with the above routine is that at minimum possible condition the time complexity is bound to be of order N 2 for the data set of length N. 2) Bayesian Hierarchical Clustering : Bayesian Hierarchical clustering used in [Heller, 2005] is outlined below : Marginal likelihoods are used to determine the suitable combination of clusters. The algorithm is similar to agglomerative hierarchical clustering. The difference being that decision made to join two clusters is based on hypothesis test. Dirichlet process mixture model is used. The paper discusses the limitations of traditional approaches such as finding finite clusters, the pruning decision, similarity measure to be chosen and since the traditional approaches do not employ probability models, the quality of clustering cannot be tested. Initially the data set is considered to have N trees in it which is actually all the points contained in it . At each step of grouping all possible tree combinations are tested. First hypothesis is based on believing that the data set in currently joined group belongs to a common probability distribution. Thus this helps in determining the optimality of the formed group. Second hypothesis believes that the current data set is rather having two of more subgroups in it. From the two hypothesis the posterior probability is determined using which final grouping is done. The attractive features being that the groupings or clustering is justified by the probabilistic approach. The method adopted can be manipulated to suit many other kinds of data sets by changing the adopted mixture model. It also comments on where to prune the hierarchical tree by using the posterior probability value. The algorithm for which is presented in [Mulajkar, 2006].

23

Advantages of hierarchical clustering : Genes/objects are ordered, thus resulting in easy and manageable comprehension. Unprecedented findings are favoured due to the small size of the clusters produced. For a vast data such as in microarrays it is difficult to pre-define partitions or supply with prior classification information. In such data sets hierarchical clustering is well suited to generate an extensive result incorporating the whole data set. A visually immediate result is obtained. A wide variety of cluster types can be formulated by using various linkage methods. A wide variety of clustering, based on similarity, can be achieved by pruning the tree at a desired level. Problems with hierarchical clustering : An incorrect assignment at a prior stage cannot be corrected in later stages of clustering process. The results of two or three samples/experiments are easy to study on a two and three dimensional graph but it gets a bit difficult to analyse many experiments together in multi dimension. Since different metrics result in different clusterings therefore it becomes mandatory to compare all the results and find an optimum result which is a tedious process. This clearly indicates the unavailability of a standard clustering technique. The methods used are predisposed to produce specific shaped cluster outputs. For instance, single linkage method churns out elongated clusters whereas the average linkage will give spherical clusters. Moreover there is little or no provision in standard approaches to deal with overlapping clusters. Say, the data is modelled as two overlapping Gaussian mixtures then the tails of each distribution encroach into the other distributions region. These are rigidly cut off after clustering and assigned into the cluster to which they lie closest in terms of distance. Problems seem to arise to deal with non numerical data. Noise associated with data is another problem. Re-evaluation of clustering is not possible. These deficiencies have given way to the development of novel clustering techniques and is an open field demanding a standard approach implementation.

24

5. Practical Work
A) Hardware & Software Employed
During software coding some functions in the Bio-Informatics toolbox of Matlab Software, Version 7.1.0.246 of The MathWorks, Inc were used. A brief description of the functions, as used in the coding part, is presented below : 'pdist' Using this function distances between two points can be calculated by choosing an appropriate distance metric such as Euclidean, Mahalanobis, Spearman, Cityblock etc. The function computes the distances between all combinations if a data set is provided with more than two points. Thus for a data set having 'm' points, the distances are (m-1)m/2 in number. These distances are presented in a row starting from the distances of all other points with the first point then the distances of second point with rest of the points and so on. This function is used in combination with others to perform clustering operations. 'linkage' This function is usually followed by the pdist function to generate the clustering tree. Thus the format of input should be the same as of the pdist function output if used independently. It determines the similar data points and their distances to cluster them together based on single link, average link, complete link etc. measures as chosen by user. It starts with all the data points considered as leafs being in individual clusters and further are joined to form the hierarchical tree. Result of this function is a matrix of dimension (d-1 X 3) where d are the total data points. The first two columns have the data point indexes and the third their distances. Each time a new cluster is formed resulting in a new index that is the sum of its row of occurance with the total length of original data set. Thus if two data points with indexes 1 and 9 are joined into a new cluster in the n th column of the matrix resulted by linkage function then the new index made is numbered (n+d) which can be further used to generate a new cluster. 'dendrogram' It plots the hierarchical tree using the data output of the linkage function. This tree structure consists of U-Shaped lines that join two or more clusters. The height of these lines represent the distance between the joined clusters. User can choose to display the number of nodes or the leaves in the tree to be displayed. Thus when too many nodes are present (In Matlab >30) then one node represents many nodes and thus gives a manageable visual result. The dendrogram function is employed to separate clusters in this thesis. Since the case of overlapping clusters was investigated exclusively during this work, it was easier to separate the conspicuous clusters found initially and then deal with the points which seemed difficult to group. The software implementation successfully separates overlapping clusters as described in the later section which were not separable using solely the functions of Matlab Software. 25

'seed' - To generate normally distributed random numbers Matlab's 'randn' function is used which chooses a seed to start producing a series of random numbers through multiplicative operations [WWW6] as per the clock time in computer. To plot a particular case of overlapping Gaussian clusters a constant seed is given to 'randn' function. All coding has been done in the Matlab environment. Hardware requirements have been minimum. A personal computer of following specifications was used : Intel CPU T2300 @ 1.66 GHz (Dual Core processor) Due to the case sensitive study, computation time issue has not been dealt with in this work. Although in order to save processing time for test results, initialisation of variables was done that would require large space in the system memory.

B) Novel method for Hierarchical Clustering using Empirical Covariance


Overview of the approach : Biologist look at the groups of similar genes to analyse them and gain information. This essentially requires to clearly separate the dissimilar genes which can be done by hierarchical clustering. But it was noticed that in case of closely spaced or overlapping clusters, the dendrogram is difficult to prune since there is no clear separation of groups. This is evident from Figure 9 that was generated using Matlab's Bioinformatics toolbox. A successful technique was developed to resolve this problem. This method is used on the overlapping clusters in which the overlapping regions are scarcely populated. This minimises the errors that may arise due to false allocations. First it is assumed that overlapping clusters are normally distributed. Thus these Gaussian mixture models have their individual characteristic parameters as mean and variance. The most interesting feature of modelling a group of points as normally distributed is that, at the focal point of the distribution, the density of points is highest and going away towards the boundaries of the distribution it tends to become zero. That is the region near the mean of the dense region is heavily populated and further away from the mean, population decreases giving scarcely populated regions. This is the key to the cluster separation program. First these dense regions near the mean are found negating the scarcely populated areas of the distribution. Thus this separation of points result in well separated clusters. Since the Matlab Software can easily detect well separated clusters, its 'dendrogram' function can be used. Using the function, densely populated and well separated clusters were obtained. Now the only problem remains with assigning the points that were the scarcely populated areas of the distribution which were lying away from the 26

mean. This is done by using the Mahalanobis distance formula. It helps in estimating the probability of a point to be a part of a distribution based on the mean and empirical covariance of the individual distribution. So, the means and empirical covariances of the dense regions were found. Each remaining point was tested for its association with either region/cluster. The region that gave higher probability value according to the Mahalanobis distance formula was assigned the tested point. Using this approach finally the whole data is divided into conspicuous clusters. Critical Features of Implementation : 1) Synthetic Data : Generated using Matlab's 'randn' function. 2) Similarity measure : Euclidean Distance. 3) Dense Region separation : Matlab's 'dendrogram' function. 4) Probabilistic Data Point assignment : Mahalanobis Distance. Synthetic Data : The artificial bivariate normal data set generated using Matlab's command randn is stretched to obtain desired elliptical shapes by an appropriate design matrix. Each data point is considered as a cluster thus an agglomerative bottom-up approach is followed. Based on empirical covariance the data points are grouped into clusters. Covariance : It is the measure of dependence between two random variables X and Y and is defined as : COV(X,Y) = E(XY) E(X)E(Y) If X and Y are independent then COV(X,Y) = 0 Empirical Covariance :

The distance matrix is generated using euclidean distance formula discussed earlier. To assign a point to one of, say, two clusters Mahalanobis distance is used which states in terms of probability :

where, y is the unassigned point C1 and C2 are clusters one and two. M1 and m2 the means of two clusters.

27

and 2

are the inverse covariance matrices.

Thus the probability of a point 'y' to be in cluster1 or in cluster2 is compared. Algorithmic Description of Matlab implementation : Generate the distance/proximity matrix from the available data set. Sort the distance matrix in ascending order such that in each column the first row shows the distance with itself i.e. 0 and going downward the distance increases showing distant neighbours of the respective data point represented by the column number. Sum up the rows column wise starting from topmost row, corresponding to 10% of the data length in each column. This results in a row vector having summations of the fixed number of nearest neighbours of each point that is represented by its respective column number. Sort the row vector thus obtained in ascending order, gathering their indexes that were indicated by the columns so as to get the nodes with closest neighbours summations arranged in the beginning of the row. The row vector has length equal to the length of original data set. In order to get dense regions separated, the indexes of the first 'S' summation values are gathered. 'S' is equivalent to 60% of the data length. On the above found densely populated nodes, Matlab's hierarchical clustering is run to obtain well separated clusters of dense regions. Find mean and inverse covariance of these clusters. Assign the remaining points to one of these clusters according to the Mahalanobis distance between the point and the cluster. Screenshots of results : Synthetic data set shown in Figure 7.

Figure 7: Case of two overlapping clusters.

28

Result of the novel algorithm proposed on the above data set :

Figure 8 : Two overlapping clusters identified correctly(Top). Its dendrogram (below).

The above results show successful determination of two overlapping clusters. The dendrogram clearly shows two clusters far apart from each other by the longest height of the u-shaped line joining the two clusters. Now the dendrogram can be easily cut to separate the two clusters. Same data was clustered using Matlab's Bioinformatics toolbox functions, result of which is :

Figure 9 : Using only Matlab's functions on the expression data, without finding the dense regions as done in our novel method initially, the resulting tree cannot be pruned to get conspicuous clusters . Thus showing failure of the Matlab functions in this case of overlapping clusters.

29

C) Novel method for Hierarchical Clustering using stereographic projection.


Overview of the approach : Clustering based on directionality of the gene expression data is another way of grouping similar genes. This method was recently adopted in [Dortet-Bernadet, 2007] where expression data is distributed based on its directionality. The proposed method is advantageous in dealing with different shapes that a data distribution may take. In the paper, expression data is distributed on the surface of a sphere using a density function who's parameters are derived by the EM algorithm. The paper proposes to use various methods of dimensionality reduction so as to perform clustering and finding the number of clusters in a data set. The novel method proposed here uses similar concept of realising expression data on a sphere together with a technique for dimensionality reduction by projection of the spherical data onto a plane and a decision to cut the dendrogram tree to gain clusters. The gene expression data is usually available with many samples for a single gene. This essentially suggests that dimensionally, the gene data be distributed on a hypersphere. The novel method proposed here based on directionality assumes the multidimensional gene expression data initially to be distributed on a sphere. The data is normalised to realise it being spread on sphere's surface with radius one. Using similar concept of normal distribution property as used before the dense regions are found lying on the surface of the sphere. The difference being that the similarity measure chosen to group data is based on correlation coefficients rather than Eucledian distances. Having the well separated dense areas on the sphere surface, Matlab's 'dendrogram' function can be used to separate each dense region found. Stereographic projection results in mapping the points lying on the curved surface of the sphere onto a 2-dimensional plane surface, where clustering is easier to perform. The dense regions found are projected initially onto a plane tangent to a point on sphere's surface. It is mean of the points in the respective dense region. In later section this procedure is better described mathematically. In order to assign the remaining points to their respective clusters, all the remaining points are projected to each dense region based on their means and covariances. Having projected the remaining points onto a plane surface Gaussian Bayes Classifier [Moore, 2001] is used to determine the probability of association of a point in one of the dense regions. Comparing the probabilities found for a test point in each case of dense region the assignments are done resulting in completion of clustering operation.

30

Critical Features of Implementation : 1) Synthetic Data : Generated using Matlab's 'randn' function. 2) Similarity measure : Pearson's correlation coefficient. 3) Dense Region separation : Matlab's 'dendrogram' function. 4) Probabilistic Data Point assignment : Gaussian Bayes Classifier. Method used : Let the complete dataset be represented as a set of vectors :

[ e1 , e2 , e 3 ... eM ]
K where e i R , i = (1, 2, 3 ... M) and K is the size of data set or the dimension.

Amongst this data set a dense or highly similar region consists a subset of points that are :

[ e1 , e2 , e 3 ..... e N ]
Mean of the dense region is given by :

1/ N e i =
i =1

e Consider a sphere having a vector drawn from the origin 'o' to a point on its surface 'e' as shown in Figure 10. Consider a tangent plane to the point 'e' as shown in Figure 11. On this plane be located a point 'x'. A line joining the point 'e' to 'x' is named as y . This is shown if Figure 12.

31

Figure 10

Figure 11

Figure 12

Figure 13

x e y From Figure 12 it can be stated =

(A)

e x e e y e e e y e So = = = 10 since is perpendicular to the plane.


Therefore

e x = 1

(B)

Consider a case of a point located on sphere ' e 1 ' . And e1 be the projection of e1 on the tangent plane as shown in the Figure 13.

Consider e 1 = e1
e Using result (B) it can be stated that e1 = 1 Multiplying (C) on both sides by e1 we have :

(C)

e1 = e 1 e e

or or Using result (D) in (C) :

e 1 = e1

e = 1/ e 1

(D)

32

e1 = e 1 / e 1 e

(E)

Thus all points of dense regions can be projected to the plane using the result (E) giving :

e e [ e 1 , e 2 , e3 , ... e N ] [ e 1 , e 2 , e3 , ... eN ] = [ e 1 / e 1 , e 2 / e 2 , e 3 / e 3 ... e N / e N ] e e


Since their may be more then one dense regions of closely spaced data points on separating such far spaced individual dense regions following parameters can be computed for each case : Dense region one : Dense region two : . . . Dense region S :
1 1 e , [ e1 , e1 , e1 ,... e1 ], 1 2 3 N

2 2 2 e , [ e1 , e 2 , e2 , ... e2 ], 2 3 N

. . . S S , e S , e S ,... eS ], S e , [ e1 2 3 N

S S S S Where e are their respective means, covariances and [ e1 , e S , e3 ,... eS ] the 2 N set of dense points.

Now from the initial 3D data set [ e1 , e 2 , e 3 ... eM ] , let the remaining non-dense points be f 1 , f 2 , f 3 , ... f R . To assign these remaining points to their most similar groups, first they too are projected to the same plane thus giving :
f 1 , f 2 , f 3 , ... f

f 1 , f 2 , f 3 ,... f R

For S different dense regions compute :


1 2 P( f 1 | e1 , ), P( f 1 | e 2 , ), ... P( 1 P( f 2 | e1 , ), P( f 2 | e 2 , 2 ), ... P(

f 1 | e S , S ) f 2 | e S , S )

33

. . . . . . 1 2 P( fR | e 1 , ), P( fR | e 2 , ), ... P( fR | e S , S ) where,

e P( f | , ) = 1 / 2

e ( exp 1/ 2 f

f e
1

is the Gaussian Bayes Classifier [Moore, 2001]. Comparing the probabilities computed by the above relation, rest of the points can be assigned to their respective clusters.

Algorithmic description of the Matlab Implementation :


Normalise the data so as to realise the data distribution to be located on the surface of a sphere. Generate the proximity matrix based on Pearson's correlation coefficients. Sort the proximity matrix in descending order such that in each column the first row shows correlation of a vector with itself that will be maximum and equal to 1. In this sorted matrix if we tread downwards column wise the correlation value decreases showing less similar points to least similar at the bottom. Sum up the rows column wise starting from topmost row. The number of rows to be summed up are equal to 5% of the total length of data set. This results in a row vector having summations of the fixed number of most similar data points or neighbours for a data point represented by its respective column number. Sort the row vector thus obtained in decreasing order gathering their indexes that are indicated by the columns so as to get the nodes with closest or most similar neighbour summations arranged in the beginning of the row. The row vector has length equal to the length of the original data set. In order to get dense or most similar regions separated, the indexes of the first 'S' summation values are gathered. 'S' is equivalent to 60% of the complete data length. On the above found densely populated nodes Matlab's hierarchical clustering is run to obtain well separated clusters of dense regions. Project the two dense regions on a plane tangent to the sphere using the means of each cluster. Separate the remaining points from the initial data set. Having projected the two dense regions, find their respective covariances and means. Similarly project the remaining points on to the same plane. Compute the probability of the remaining points in each case of projection to test their association with either of the two clusters. 34

Assign all the remaining points to their respective clusters by comparing the probability of each cases.

Results :

Figure 14 : Data generated from two Gaussian distributions and stretched by an appropriate design matrix.

Figure 15 : Clustering result of the novel method proposed on the synthetic data. The two clusters are separated without any incorrect assignments.

35

Dendrogram of the resulting data :

Figure : Tree obtained using the novel method proposed, can be easily cut into two individual clusters.

The dendrogram of the resultant clustering in the figure above shows conspicuous clusters that were separated from the original data set.

Figure : Matlab's result on the same data set. Here it is difficult to cut the tree into conspicuous clusters.

As discussed before the diagonal elements in case of correlations used as a similarity measure are all ones which show the correlation of a point with itself. Thus being highest. The proximity matrix generated is shown in Figure 16.

36

Figure 16 : A part of the proximity matrix generated based on Pearson Correlation Coefficient.

D) Tests on Real Data :


The novel hierarchical clustering method proposed based on stereographic projections was tested on real data set available over the internet from [Botstein, 2000]. Critical features of the Matlab code & the pruning decision : 1. Matlab's 'abs' function was used to find the peak expression values for each gene in the cdc15 data set. The genes showing variance greater than 2 were assimilated to generate the initial data set. 2. Dendrogram is pruned based on the output of Matlab's 'linkage' function. To decide the total number of clusters to be made following method is adopted : Let d 1 , d 2 , d 3 ....... d f be all the distances generated by the 'linkage' function. - Set Clusters = 1 - Find d n d n1 for n=2,3,4,.......f - Test Condition : If d n d n1 > 10% of d n then Clusters=Clusters+1 Decision : Cut dendrogram into number equal to the value of 'Clusters' found. About the real data set : The yeast Scaaharomyces Cerevisiae is a well known commercial fungi used in making breads and alcohol [Hansen, 1883]. Its genes are studied in the paper [Spellman, 1998] using microarrays. The paper [Spellman, 1998], investigates the cell cycle regulation processes on mRNA level and information that can help in prediction of cell cycle. Particularly the 37

fluctuations of mRNA measure during the cell cycle and regulation were studied using microarrays. A cell cycle is a chain of events which results in a 'daughter cell' [WWW7] through a four staged complex division processes. The division into two nucleus within the cell is termed as Mitosis. Mitosis starts with the DNA inside the nucleus grouping together to form coil like structure which are referred as chromosomes. Along with this process the the nucleus membrane disappears and the protein fibres attach themselves to each chromosome. These then separate resulting in the end of Mitosis. After this the cell divides resulting into two cells. In our case we selected data with variance greater than 2 from the cdc15 experiment in order to reduce computation time. Verification of our result : Our results in Figure 17 & 18 show clear peaks in the expression levels during three different time spans. This proves the similarity in result of our algorithm and the result shown in paper [Spellman, 1998]. Since results found by different methods are same, it proves that correct clustering is done and result is not a mere output of numerical computations [Eisen, 1998].

Figure 17 : Well separated clusters resulting from application of novel algorithm on cdc15 data [Botstein, 2000].

38

<Time Span 1> <Time Span 2>

<Time Span 3>

Figure 18 : Heat map of cdc15 expression data with variance>2.

E) A naive method developed at the outset of project work :


At the outset of the project work a clustering implementation was done in Matlab 7.1 which helped to realise the numerous shortcomings that needed to be addressed for implementing a efficient clustering code. This code was written in the beginning of the project work and suffered with many weak areas such as coding style, applicability and robustness which were dealt with in the two implementations later. Brief Description : Euclidean distance was used as the similarity index to assign points to the individual clusters. In case of closely placed clusters the overlapping regions were not taken into account, rather there is a rigid boundary drawn between two clusters separating them apart. Thus this implementation does not give good results in case of overlapping clusters. Though it works well in the case of clusters that do not lie too close to each other. Result of clusterings on a synthetic data sets is shown in Figures 18.

39

Test Results :

Figure 18 : Clustering result of the standard implementation showing correct determination of two well separated clusters in a synthetic data set (on left). Similar result obtained by using 'dendrogram' function of Matlab Software on the same data set (on right).

6) Conclusion and Evaluation


In the novel method based on empirical covariance, similarities were found using Euclidean distance formula thus the dense regions were found by arranging the data in ascending order so as to get the nearest neighbour at the beginning and farthest in the end. On cutting out small percentage of data from the beginning of the sorted data gave dense region or most similar data points. But in case of clustering using stereographic projections, since Pearson's correlation measure is used the data has to be arranged in descending order to collect the most similar points that are having high correlation with each other in the beginning of the sorted data. Thus cutting a small percentage of the data from the beginning also gave dense regions. The difference being that in case of distances, going from smaller to bigger ones, correspond to most similar to least similar. Whereas for correlations, high index shows greater similarity than a low correlation value. Thus least similar data points have a value near or equal to zero. The initial data sets taken are assumed to have similar groups of points that have normal distribution. The regions where the clusters overlap are bound to have miss assignments since there is no other information available for the data set other than the expression values. This was noticeable in the results when two normally distributed clusters were made to overlap and separated using the method proposed in 2D clustering. But the wrong assignments of data points were less than 2% in all different cases thus proving the effectiveness of the method proposed. Since even if there is a case of small group of genes actively involved in a process which may be needed to 40

group together through clustering than this percentage is fairly low to cover the whole sub group of those genes. Thus the method proposed would not fail in clustering similar groups. Figure 19 elaborates this discussion.

Figure 19

7. Future Work Areas


In case of hierarchical clustering using stereographic projection we have assumed that the gene expression data lies on a single side of the sphere rather than spread around throughly on all sides. In order to deal with data spread uniformly on all sides of the unit sphere a new projection method needs to be developed. The five parameter Kent distribution can be employed to model the gene expression data. The attractive features being that it efficiently describes elliptical shaped clusters and the clusters which are rotationally symmetric can be differentiated too.

41

8. References
Banerjee, A., Dhillon, I.S., Ghosh, J., Sra, S., ' Clustering on the Unit Hypersphere using von Mises-Fisher Distributions ', Journal of Machine Learning Research 6, 1-39, 2005. Biology Department Genomic Service, ' Microarray Catalogue ', 2006. [WWW3] Available : www.transcriptome.ens.fr/sgdb/services Botstein, D., Futcher, B., Brown, P. and Zhang, M., ' Yeast Cell Cycle Analysis Project ', Stanford University or Molecular Biology of the Cell, 2000. [Online] Available : http://genome-www.stanford.edu/cellcycle/data/rawdata/combined.txt Brazma, A., Parkinson, H., Schlitt, T., Shojatalab, M., ' A quick introduction to elements of biology - cells, molecules, genes, functional genomics, microarrays ', European Bioinformatics Institute, 2001. [Online] Available : www.ebi.ac.uk/microarray/biology_intro.html Bryan, J., ' Problems in gene clustering based on gene expression data ', Journal of Multivariate Analysis, doi:10.1016/j.jmva.2004.02.011, 44-66, 2004. Dortet-Bernadet, J-L., Wicker, N., ' Model-Based clustering on the unit sphere with an illustration using gene expression profiles ', To appear in Biostatistics, doi :10.1093/biostatistics/kxm012, 2007. Eisen, M.B., Spellman, P.T., Brown, P.O., and Botstein, D, ' Cluster analysis and display of genome-wide expression patterns ', Proc Natl Acad Sci USA, Vol. 95, pp.1486314868, 1998. Goldstein, D.R., Ghosh, D., and Conlon, E.M., ' Statistical issues in the clustering of gene expression data ', Statistica Sinica, 2002. Gurette, A., ' Normalization Techniques for cDNA Microarray Data ', McGill University, 2001. [Online] Available : www.cs.mcgill.ca/~aguere/308-761B/norm/summary_normalization.htm Hansen, M., ' Saccharomyces spp. ', Meyen ex Hansen, 1883. [Online] Available : http://www.doctorfungus.org/thefungi/Saccharomyces.htm Hardin, J., Mitani, A., Hicks, L., VanKoten, B., ' A robust measure of correlation between two genes on a microarray ', BMC Bioinformatics, 8:220, doi:10.1186/14712105-8-220, 2007. Heller, K.A., Ghahramani, Z., 'Bayesian Hierarchical Clustering', Twenty-second International Conference on Machine Learning, 2005. 42

Introduction to Engineering Systems, ' Matlab Tips #2 ', University of Notre Dame [WWW6]. Available : www.nd.edu/~engintro/Project3/MATLABtips2_05.pdf Jain, A. K., and Dubes. R. C., 'Algorithms for Clustering Data', Prentice Hall, 1988. Jiang, D., Tang, C., and Zhang, A., ' Cluster Analysis for Gene Expression Data : A Survey ', IEEE Transactions on knowledge and data engineering, Vol. 16, 1370 -1386, 2004. Laboratory, Oak Ridge National, ' How Many Genes Are in the Human Genome? ', 2004. [WWW1] Available : www.ornl.gov/sci/techresources/Human_Genome/faq/genenumber.shtml Lim, H., ' Father of Bioinformatics ', University of Texas, 2005. [Online] Available : www.utdallas.edu/news/archive/2005/bioinformatics-lecture.html Manchester Metropolitan University, ' Cluster Analysis ', 2003. [WWW4] Available : http://obelia.jde.aca.mmu.ac.uk/multivar/hc.htm MathWorks Inc. , ' Statistics Toolbox : Linkage ', 2007. [WWW5] Available : www.mathworks.com/access/helpdesk/help/toolbox Matteucci, M., ' A Tutorial on Clustering Algorithms ', Politecnico di, 2003 [Online].
Available : http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical.html

Moore, A., ' Learning Gaussian Bayes Classifiers ', Carnegie Mellon University, 2001. [Online]. Available : http://www.autonlab.org/tutorials/gaussbc12.pdf Mulajkar, T., ' Hierarchical Clustering of Microarray Data based on Empirical Covariance ', M.Sc., University of Sheffield, 2006. Peel, D., Whiten, W.J., and McLachlan, G.J., ' Fitting Mixtures of Kent Distributions to Aid in Joint Set Identification ', Journal of the American Statistical Association, Vol. 96, 50-63, 2001. Renkwitz, A., ' Microarrays : Chipping away at the mysteries of science and medicine ', Biology 101, Chesapeake College, 2000. [Online] Available : www.fastol.com/~renkwitz/microarray_chips.htm Schena, ' Microarray Analysis ' John Wiley & Sons, 2003. Shannon, W., Culverhouse, R., and Duncan, G., 'Analysing microarray data using cluster analysis ', Pharmacogenomics, 2003. Shi, L., Ph.D., ' DNA Microarray (Genome Chip) ', 2002. [WWW2] Available : www.gene-chips.com/

43

Spellman, P.T., Sherlock, G., Zhang, M.Q., Iyer, V.R., Anders, K., Eisen, M.B., Brown, P.O., Botstein, D., and Futcher, B., ' Comprehensive Identification of Cell Cycleregulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization ', Molecular Biology of the Cell, Vol. 9, 32733297, 1998. The Cell Cycle Database, ' Cell Cycle Research ', 2004. [WWW7] Available : http://www.cellcycles.org/ Troyanskaya, O., ' Microarray analysis ', Ph.D., Standford University, 2003. [Online] Available: http://stein.cshl.org/genome_informatics/troyanskya_microarrays.pdf Turner, H., ' Biclustering Microarray Data: Some Extensions of the Plaid Model ', Ph.D., 42-196, 2005 [Online]. Available : www2.warwick.ac.uk/fac/sci/statistics/staff/research/turner/turnerchapter3.pdf

44

9. Abbreviations

2D 3D cDNA cdc15 DNA mRNA RNA rRNA tRNA Trans.

Two dimensional Three dimensional Complementary DNA Cell division control protein Deoxyribonucleic acid Messenger RNA Ribonucleic acid Ribosomal RNA Transfer RNA Transactions

10. Diary of major milestones achieved

6th April, 2007 28th April, 2007 26th July, 2007

Naive clustering code implemented to find rigid clusters. Novel method of Hierarchical clustering using Empirical Covariance implemented in Matlab. Novel methd of Hierarchical clustering using Stereographic Projection implemented in Matlab. Pruning the dendrogram based on 'linkage' function output.

25th August, 2007

45

Vous aimerez peut-être aussi