Vous êtes sur la page 1sur 25

Chemometrics and Intelligent Laboratory Systems 104 (2010) 2852

Contents lists available at ScienceDirect

Chemometrics and Intelligent Laboratory Systems


j o u r n a l h o m e p a g e : w w w. e l s ev i e r. c o m / l o c a t e / c h e m o l a b

An introduction to DNA microarrays for gene expression analysis


Tobias K. Karakach a, Robert M. Flight b,c, Susan E. Douglas a, Peter D. Wentzell b,
a b c

Institute of Marine Biosciences, National Research Council of Canada, 1411 Oxford Street, Halifax, Nova Scotia, Canada B3H 3Z1 Department of Chemistry, Dalhousie University, Halifax, Nova Scotia, Canada B3H 4J3 Department of Neuroscience Training, University of Louisville, Louisville, Kentucky, 40203, USA

a r t i c l e

i n f o

a b s t r a c t
This tutorial presents a basic introduction to DNA microarrays as employed for gene expression analysis, approaching the subject from a chemometrics perspective. The emphasis is on describing the nature of the measurement process, from the platforms used to a few of the standard higher-level data analysis tools employed. Topics include experimental design, detection, image processing, measurement errors, ratio calculation, background correction, normalization, and higher-level data processing. The objective is to present the chemometrician with as clear a picture as possible of an evolving technology so that the strengths and limitations of DNA microarrays are appreciated. Although the focus is primarily on spotted, two-color microarrays, a signicant discussion of single-channel, lithographic arrays is also included. 2010 Elsevier B.V. All rights reserved.

Article history: Received 24 November 2009 Received in revised form 5 April 2010 Accepted 6 April 2010 Available online 29 April 2010 Keywords: DNA microarray GeneChip Gene expression Experimental design

1. Introduction The rise of chemometrics as an important sub-discipline of analytical measurement science paralleled the rapid growth of analytical instrumentation capable of providing higher orders of multivariate data and the associated demand for new kinds of information. Some twenty years later, the biological sciences are undergoing a similar revolution resulting from new measurement technologies, and the need for effective data analysis tools is just as pressing. Since the beginning of the 1990s, molecular biology has moved toward high throughput measurements and data similar to the transition in the analytical chemistry eld in the early 1970s. The move toward high throughput technologies in molecular biology is concomitant with the advent of the huge amounts of genome information and the need to utilize it in understanding complex molecular interactions in biological systems. This is a consequence of the recognition that even simple cellular activities are the result of well-orchestrated molecular networks that control the cell and that these cannot be fully understood by studying one component at a time, but only through a comprehensive integration of the entire molecular machinery controlling the cell. Predictably, analysis of the data generated by high throughput measurements has necessitated more complex mathematical approaches that had not previously been available to molecular biologists. Chemometrics has an important role to play in this regard, since at their core these are analytical measurements and amenable to the tools that have been developed by chemometricians over many

Corresponding author. E-mail address: peter.wentzell@dal.ca (P.D. Wentzell). 0169-7439/$ see front matter 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.chemolab.2010.04.003

years. The application of those tools, however, requires a clear understanding of the nature of these new measurements and the challenges they pose. There are many different high throughput measurement technologies currently employed by molecular biologists, including DNA sequencing and LCMS (and derivatives), but one of the more ubiquitous tools in use is the DNA microarray. DNA microarrays are popular due to their unique ability to query the mRNA expression levels of thousands of genes (potentially all of the genes in an organism) simultaneously with relatively high specicity, providing a snapshot in time of the overall gene expression of the system under study. However, there are some important considerations to take into account when one is using DNA microarrays or analyzing DNA microarray data. Although this topic has been previously reviewed in other elds [15], this tutorial provides an introduction, to an analytical chemistry audience, of this technology and various issues related to the analysis of the resultant data. It begins by providing a brief biological background necessary to appreciate the experimental underpinnings of the technology and an overview of the methods used in manufacturing DNA microarrays. Later sections provide a detailed introduction to the measurement process of DNA microarrays in the context of the DNA microarray experiment workow, starting with the experimental design and following through data acquisition and processing. In addition, the pre-processing applied to the data before nal analysis is discussed. Finally, the methods used to analyze the resultant data are briey considered. The primary technological platform treated in this paper is the spotted DNA microarray, with a secondary focus on Affymetrix arrays (see Section 3 for a description of the microarray types). This is largely due to the fact that the authors' have more extensive

T.K. Karakach et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 2852

29

experience working with data only from the former, and that much of the research available in the literature has been published on spotted microarrays. It is also important to note from the outset that the emphasis of this tutorial is on the nature of microarray measurements and the experimental procedures used to obtain them, rather than on the data analysis techniques applied to the nal data sets. Chemometricians are well-versed in the tools of the trade, but less familiar with strengths, limitations, and peculiarities of high throughput biological measurements. Readers looking for a primer on higher-level analysis of transcriptomics data are likely to be disappointed (they should visit [6] for a listing of papers describing DNA microarray analysis methods), but it is hoped that those who wish to gain a fundamental understanding of the measurement workow will nd what they need to venture into the eld of microarray analysis with condence.

2. Biological background and motivation A simplied view of the ow of information in a cell would show information traversing from the genes (DNA) to messenger RNA (mRNA) to proteins, which can subsequently act on DNA, mRNA, metabolites, or other proteins. To produce the required proteins, the gene must be transcribed into mRNA by RNA polymerases, and the mRNA can then be translated by ribosomes into protein (see Fig. 1). Depending on the cell type and its biological state, specic proteins will be expressed at different levels. Therefore, if one can measure the complement of all expressed proteins then this will provide information about the current state of the cell. Given the explicit relationship between gene expression (transcription) and protein translation, knowledge of mRNA levels may provide an indirect route to this knowledge. For example, comparing the gene expression between diseased and healthy cells could allow the determination of the molecular basis of disease. Alternatively, measuring gene expression as a function of a serial process would allow the determination of molecular changes over time (cell cycle) or with changing dosage (drugs/metabolite response). Consequently, three options are available for investigating molecular dynamics of the cell, analyzing the variations of (1) the complete set of proteins in the cell (proteomics), or (2) the complete set of mRNA transcripts that leads to the production of these proteins (transcriptomics), or (3) the complete set of metabolites generated by the proteins (metabolomics). Although research in proteomics and metabolomics has been ongoing for many years, both elds still suffer from a lack of standardized methodologies and poor reproducibility. This is partly a result of the heterogeneous properties of the molecules being measured. In the case of proteomics, different amino acid sequences lead to a wide variety of protein types, making it difcult to design standard protocols for performing measurements on the entire protein complement. Metabolomics likewise suffers from the wide diversity of chemical properties of different metabolites. The

relatively homogeneous nature of mRNA, and the development of capture methods based on complementary base pairing, has led to the very mature eld of transcriptomics using DNA microarrays. In addition, in many cases mRNA levels are a reasonable proxy for protein amounts, allowing one to make a rational inference regarding the level of protein expression based on the levels of mRNA expression. There are, however, exceptions where protein expression is controlled post-transcriptionally by other factors. Transcriptomics generally utilizes DNA microarrays, small slides to which are attached hundreds to tens of thousands of molecules of DNA [7]. The DNA is able to bind complementary sequences created from mRNA transcripts, facilitating the quantitation of various mRNA transcripts in the cell. This process is illustrated schematically in Fig. 2. DNA microarrays allow molecular biologists to monitor the levels of mRNA transcripts for tens of thousands of genes simultaneously, thereby giving them a window into the inner workings of the genome at the transcriptional level. Microarrays have impacted the study of numerous diseases, the regulation of many biological mechanisms, as well as the cell cycle of various organisms [1]. The methods by which DNA microarrays are constructed and used, however, can take various forms.

3. DNA microarrays A microarray consists of a series of miniaturized chemical recognition sites onto which binding reagents, capable of distinguishing complementary molecules, have been attached. Pirrung dened a microarray as a at solid support that bears multiple probe sites containing distinct chemical reagents with the capacity to recognize matching molecules unambiguously [8]. Thus, in principle, if complementary molecules in a complex mixture were modied with uorophores, for instance, and allowed to interact with the probes, the molecules could be interrogated simultaneously to determine their respective concentrations. This denition is similar to the classic denition of multianalyte chemical sensors, notwithstanding the different measurement environments and detection systems. It also restricts a microarray to be a miniaturized assay without specifying, explicitly, the chemical reagents that constitute the probes. In the case of DNA microarrays, the probes are DNA oligomers that are allowed to interact with labeled complementary DNA strands. This has led to a wide variety of DNA microarray types, although there are two general classes. The rst category encompasses microarrays on which a single stranded DNA (ssDNA) oligomer probe is synthesized directly on the substrate (in situ synthesis). The second category encompasses microarrays on which a ssDNA oligomer or dsDNA (double-stranded DNA) amplicon probe is deposited on the substrate, and these are commonly referred to as spotted arrays. These different methods of generating DNA microarrays lead to some important considerations in the data analysis, and

Fig. 1. Overview of the process of transcribing DNA to mRNA, which is translated into proteins that are then able to act on metabolites. It should be noted that this simple model ignores many of the complexities in the process, such as alternative splicing of mRNA, miRNA silencing and the effect of post-translational modications on proteins.

30

T.K. Karakach et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 2852

Fig. 2. (a) Spotted microarray experimental set-up. mRNA extracts (targets) from cells under two distinct physiological conditions are reverse transcribed to cDNA and then labeled with different uorescent dyes e. g. Cy3 and Cy5. Equal amounts of the dye-labeled targets are combined and applied to a glass substrate onto which cDNA amplicons or oligomers (probes) are immobilized. (b) Scanned image of an Atlantic salmon cDNA microarray [7].

so a basic primer on the synthesis and detection of target binding for both is provided in the next sections. 3.1. In situ synthesis Among the most popular arrays where the DNA oligomers are synthesized in situ are the Affymetrix arrays, known as GeneChips. In a GeneChip, a photolithographic mask is used to determine the probe position on the array at which photo-induced deprotection of a previously deposited functionalized nucleotide occurs, in order to attach the subsequent nucleotide to the growing oligomer [9]. Due to possible failure of photo-induced deprotection at each step of the synthesis, GeneChips contain short probes (25 nucleotides long), with multiple probe sequences for each target of interest. These make up what are known as probe sets, and contain both perfect matches (PM) for the sequence of interest, and also probes that contain a single base mis-match (MM) at the middle position to allow determination of non-specic target binding. The use of photolithographic techniques to produce the arrays leads to very reproducible, extremely regular probe regions on the array surface. However, this same strategy makes it more expensive to produce custom arrays, and Affymetrix has concentrated on producing arrays for widely used organisms, although the selection of organisms for which arrays exist has expanded considerably in recent years. Nimblegen arrays are similar to GeneChips in that they use a photo-induced deprotection of previously deposited functionalized nucleotides to subsequently add to the growing oligomer. However, in the case of the Nimblegen arrays, a digital micromirror device (DMD) is used to direct the light to cause photo-induced deprotection [10]. This has the advantage of not requiring the fabrication of new photolithographic masks for new array designs, as in the case of the GeneChips. Another important difference in the Nimblegen technology is the use of longer oligonucleotides, 60mers in contrast to the 25mers used by Affymetrix. In theory, this allows for greater specicity of hybridization of the targets to the probes on the slide, with less chance of cross-hybridization between the target sequences. The use of the DMD allows Nimblegen to achieve high densities, while easily allowing one to create customized arrays. Another method of in situ oligomer synthesis uses addressable electrodes to cause deprotection of the previously deposited nucleo-

tides via electrochemical methods. Each chemical reaction is conned to the activated electrode through the use of a buffering solution [11]. The number of probe sequences that can be synthesized on the array is limited by the lower limit size of the fabricated electrodes on the chip. The last method of note is the use of extremely accurate ink-jet systems to control the delivery of various reagents used for oligomer synthesis [12]. In this situation, the growing oligomer is synthesized by changing the base added at each location via the ink-jet system. Agilent uses this method to produce their arrays. Although the Affymetrix system using photolithographic masks is the least exible method of in situ oligomer array fabrication, it has the advantage of having been commercialized the longest, and has become a de facto standard in the industry. However, with the rise of requirements for experiments with non-model organisms or nonstandard applications, the arrays generated using the other methods outlined above are becoming more popular. 3.2. Spotted arrays In contrast to the in situ arrays mentioned above, spotted arrays are generated through the mechanical deposition of synthesized oligomer (generally 5070mers) or cDNA amplicons on a functionalized substrate. cDNA amplicons can be generated via reverse transcription of all mRNAs from an organism under study to generate so-called expressed sequence tags (ESTs), and then amplied by polymerase chain reaction (PCR), puried, and spotted on microarray slides. In the early days of microarrays, cDNA libraries were produced for many different organisms, and generating more cDNA is easily accomplished, providing a readily accessible source of probe material. This was an inherent advantage of spotted arrays over GeneChips at that time, but as genomic information becomes available for a wider range of organisms and the selection of commercial arrays becomes larger, the use of cDNA arrays is declining. Long oligomer arrays, in contrast, use probes consisting of oligomers synthesized via traditional solid-phase methods. Like in situ synthesized arrays, they require knowledge of the genomic sequence for the organism under study, and, in theory they provide more specicity than cDNA arrays. This is possible because 5070 nucleotides allow discriminatory binding between several different

T.K. Karakach et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 2852

31

but closely related sequences (in contrast to the 25mers used for GeneChips), and cDNA amplicons would likely cross-hybridize to closely related sequences. Both long oligomer and cDNA arrays are used currently; however the consensus appears to be that long oligo arrays are more advantageous due to the ability to have more control over the actual sequence in the microarray probe. The rst reported instance of the modern spotted microarray used the relatively simple method of spotting the cDNA by a robotic arrayer, with a single pin picking up a solution of cDNA and depositing it on an appropriately functionalized glass slide [13]. This method of array construction is still in use today with the modication of using multiple pins simultaneously to pick up and deposit the DNA (see Fig. 3 for a schematic of a robotic arrayer system). Since the initial report, there has been a large body of literature generated on the best types of array surfaces, surface chemistries, different types of pins, and different DNA solution compositions. A discussion of these various issues is beyond the scope of this review, and the interested reader will be able to easily nd information in the literature. Most importantly, the materials required to build the robotic arrayer are relatively cheap, and allowed almost anyone with the time and technical know-how to assemble their own arrayer and begin generating microarray slides to perform experiments. This led to a very large body of literature on the manufacture and use of spotted microarrays in the academic community. Although it is still possible to build one's own robotic microarrayer, due to the many sources of potential error in the manufacture of microarrays, printing is now generally performed by commercial suppliers or specialty academic microarray centers. The primary method of attaching the probe DNA (cDNA amplicon or synthesized oligomer), as mentioned above, uses print-tips whereby the DNA of interest is picked up by a solid or capillary metal tip, and then placed on the array with or without contact of the print tip to the surface. There are many other methods of depositing the probe on the array surface, including ink-jet [14] and electrophoretically driven [15] deposition. The use of spotted microarrays does lead to lower probe densities on the array in comparison to synthesized arrays, especially the Affymetrix and Nimblegen arrays.

3.3. Impact of array type Although the intricacies of detection of hybridized targets to the probes on the microarray are discussed in a later section, a brief comment regarding the inuence of the method of array fabrication on the approach used for probe detection and the inuence on array design is appropriate, especially in the context of comparing spotted microarrays with GeneChips. The highly reproducible manufacture of GeneChips leads to a very high technical reproducibility between two arrays measuring the same sample. This has resulted in the use of a single GeneChip for each sample of interest, while still being condent of making comparisons between two different samples hybridized to two different arrays. Spotted microarrays, in contrast, tend to have large variations in spot size and morphology between spots and between arrays, making it more difcult to make a valid comparison between two samples hybridized to two different arrays. Therefore, spotted arrays almost always hybridize two samples with different labels to the same array, thereby enabling comparison of the two samples. It is thus important to understand the limitations of the two different formats when undertaking data analysis of microarray data from different types of arrays. The next section examines the physical manipulations necessary to actually perform a microarray experiment, namely labeling of the sample, and hybridization to the microarray. 4. Measurement and analysis The process of acquiring and analyzing DNA microarray data can be regarded as consisting of a workow of discrete steps, starting with the design of the experiment, following through extraction of appropriate samples, labeling, hybridization, scanning of the microarray, image processing, normalization, ratio calculation, statistical analysis, and ending with the extraction of information and generation of knowledge from the results. Although this review is intended to highlight the steps of the workow that result in the actual data generation and the analysis of the data, all of the steps are included here with at least a cursory overview to give the reader an

Fig. 3. Schematic of microarrayer with the arrows pointing to the direction of movement of components during printing. The print-head moves in the x- and z-directions while the base, on which the glass slides and microtitre plates sit, moves in the y-direction (courtesy of Gisli Sigtryggsson). Inset: photograph of actual microarrayer (courtesy of M. WernerWashburne).

32

T.K. Karakach et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 2852

appreciation of the complexities involved in developing the actual data that are analyzed. Most of the subjects addressed in this review are applicable to all types of DNA microarray experiments (e.g. experimental design and transformations). However, in some of the areas there may be experimental considerations specic to either spotted microarrays or GeneChips. These will be addressed separately where appropriate. 4.1. Experimental design issues One of the unfortunate consequences of the technical and conceptual simplicity of microarray technology is its capacity to yield data sets that are biased by inadequate design considerations. In the absence of well-established experimental designs for microarrays, poorly designed experiments continue to yield multiply-confounded data with which one is unable to answer the question for which the experiment was conducted. The general objective of experimental design is to curtail the effects of confounding factors by generating data that span rich and diverse sample spaces, minimize the effects of unwanted variation, and provide the potential for maximum efciency for probing the hypotheses under investigation. Yet, with microarray experiments, there is often the false hope that due to the volume of data generated per experiment, confounding factors and unwanted variation will be somewhat mitigated. Although many different types of experiments may be conducted using DNA microarrays, such as comparative genome hybridization (CGH) [16], single nucleotide polymorphism (SNP) analysis [17], alternative splicing [18], and microRNAs [19], the focus of interest in the majority of microarray studies is typically to discover genes that are differentially expressed in different subjects, different tissues, cells exposed to varying physical/biochemical conditions, or those undergoing growth, development, or degeneration. Some of the common reasons for evaluating these variables are to discover the roles of genes in an organism, to group genes according to common functions, to understand the relationships among genes in a biological system (systems biology), to classify biological specimens (e.g. tumor cells) on the basis of gene expression, and to identify important biomarkers in disease progression. Thus, analysis of these experiments involves identication of genes that display uncharacteristic tendencies of increased or decreased expression, and achieving this goal must involve careful experimental design to avoid spurious observations confounded by unrelated experimental variables at multiple levels. Microarray experiments can be regarded as multilayered in the sense that they involve several nested levels at which variability may be introduced. Churchill [20] and Simon et al. [21] categorized the levels at which microarray experiments must be designed into three layers: (1) the selection of experimental units, (2) the design of mRNA extraction, labeling and hybridization, and (3) the arrangement of probes on the glass slides. Whereas the rst layer controls the span of the biological design space, the second and third layers account for the analytical (technical) variability at the lower levels of the experimental process and will be the focus of this section. 4.1.1. Types of experiments In a broad sense, most microarray experiments can be classied as either comparator experiments or serial experiments based on the nature and objectives of the procedures employed. In a comparator experiment, the objective is to compare gene expression among cells under several distinct conditions (e.g. different drug treatments, different tissue types and different tumor types) to identify differentially expressed genes. These experiments can be further categorized according to their objectives as class comparison, class prediction, and class discovery experiments [22,23]. In contrast, serial experiments are designed to follow the evolution of gene expression as a function of some ordinal variable in order to better understand the biological system under study [2426]. Most often, the ordinal

variable is time and the experiment is referred to as a time-course, but it is also possible to examine other variables such as the dosage level of a drug or toxin. Serial experiments are less widely employed than comparator experiments, probably because they demand more resources, require synchronization, and are not as amenable to conventional cluster analysis and other techniques that are easy to implement and widely used. These two experimental categories are discussed in greater detail below. Comparator experiments can be carried out using controlled or uncontrolled design strategies. The former are controlled in the sense that cell populations are selected and partitioned as reference and test samples, after which the test cells may be treated in some way that differentiates them from the reference, such as by exposure to a toxin [27], a drug [28] or environmental stress [29]. RNAs are extracted from the two cell populations, labeled with different dyes, and hybridized to the same array for direct comparison of relative expression (see Fig. 2). Uncontrolled comparator experiments involve identication of subjects that may exhibit the conditions of interest (e.g. patients suffering from different forms of a cancer), extracting RNA from these candidates, and comparing their abundance to reference mRNA extracted from separate normal individuals. A comprehensive comparison of comparator designs has been reported elsewhere [30]. The section that follows briey describes some of the common designs currently used. Time-course experiments include those that prole gene expression in response to cell cycle [25], development [3133], and external stresses over time [29,34]. In these experiments, RNA is extracted from candidate cells at specied time intervals and co-hybridized with RNA extracted from a common reference. For instance, the genetic proles of yeast cells exiting from stationary phase have been obtained using, as reference, mRNA derived from cells in the exponential growth phase [35]. Other approaches have been discussed in reference [36]. Experimental design issues associated with time-course experiments have been discussed in detail elsewhere [24,37] but a brief mention of the most signicant aspects will be made here. These include the frequency at which experimental mRNA samples are extracted (i.e. the number of samples per given time interval) and the synchronicity of the units in view of the homogeneity of cell populations. For example, in cell development and growth experiments, the sampling rate during exponential growth phase is maximized in order to minimize temporal aggregation. The synchronization of the initial population is also important in these experiments, since it is impossible to follow changes for a mixed population for which the distribution of cellular states does not change. For dose experiments, it is important to ensure that all cells to which a chemical dose is administered have had similar prior treatments and exhibit the same population distribution. It should be noted that the design of experiments utilizing GeneChips will be very similar to those employing two-color spotted arrays, with the exception that individual samples are hybridized to separate arrays. This mitigates some of the concerns inherent in twocolor experiments, especially the amount of sample required, but it does introduce novel complications for other types of experiments. 4.1.2. Experimental designs The most widely used and easily interpreted experimental design employed in two-color microarray experiments is referred to as the reference design. In this design, the test samples, labeled with one dye, are hybridized against a relevant reference which has been labeled with the other dye. For purposes of illustration in this section, we will consider a hypothetical example in which we are interested in the gene expression levels of three different types of tumors, A, B and C, extracted from test subjects. If the principal interest in this experiment is to examine the differences in gene expression between normal tissue and cancerous tissue of various types, a reference design could be used in which normal tissue serves as the reference, R. Even

T.K. Karakach et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 2852

33

with this specication, however, there are sub-classications of designs based on how the reference is obtained. In a common reference design, the same reference material is used for all of the test samples; that is, the reference material is extracted from one source, or from multiple sources and homogenized. In our example, this would correspond to extracting the mRNA from the healthy tissue of one individual (Fig. 4a). An alternative, however, would be to extract both healthy and diseased tissue from the same subject and use these pairs for comparison (Fig. 4b). This is referred to as a direct comparison, and would be expected to reduce the variance in differential expression arising from different individuals. This approach relies on the availability of a natural biological internal standard, however, and may not always be possible. For example, if the goal of the experiment is to determine the effect of a drug treatment, direct comparison will not generally be possible. As an alternative to the common reference in such circumstances, an indirect comparison can be used, where individual (unrelated) reference samples are obtained for each test sample (Fig. 4c). Although this would be more robust than using a common reference, which may increase the likelihood of a false positive due to a few anomalous genes, it would also be expected to increase the variance in the observations.

Fig. 4. Some possible experimental designs illustrated for a microarray experiment consisting of three treatments (A, B and C) plus a reference and two replicates: (a) common reference design, where R is the common reference; (b) reference design with direct comparison, where RA1 is a reference matched to A1; (c) reference design with indirect comparison, where R1 is not related to A1; (d) common reference design with dye-swap; (e) loop design (including a reference); (f) balanced incomplete block design (note that three replicates are used in this design).

Another issue related to reference designs is the use of dyereversal (also known as dye-swap or uoro-ip) experiments. Although it is natural to expect that there will be differences in the scale of intensities from the red and green channels due to factors such as dye labeling efciency and laser power, these are usually compensated for through a process known as normalization (see Section 4.8). However, if there is preferential incorporation of one dye or preferential hybridization of one dye-labeled transcript over another and this varies across genes, a gene-specic bias is introduced. To compensate for this, the use of dye-swap experiments, in which an experiment is repeated with the red and green labels reversed, has been advocated (Fig. 4d). The use of these experiments is popular, although there are arguments that they may be unnecessary [21]. In addition, there are more efcient designs now available to account for these biases should they exist (see below). The reference design is appealing because of its simplicity and its compatibility with data analysis techniques such as cluster analysis, but it is not the most efcient design. Often, the choice of design strategies in microarray experiments is determined by factors such as the specic biological question, the availability of resources, and the proposed methods for validation of the results [36]. Reference designs have been argued to be inefcient when resources are limited since the reference is hybridized multiple times. Common alternatives to the reference design are the loop design [38] and the balanced block design. In the loop design, sample 1 is co-hybridized with sample 2, sample 2 with sample 3, sample 3 with sample 4, and so on until the last sample is co-hybridized with sample 1. Successive hybridizations are set up so that a dye-swap occurs for the common sample in consecutive experiments, resulting in a design that is able to include dyegene interactions. This design is illustrated in Fig. 4e for the example presented earlier, including the reference (normal tissue) as one of the samples. The design for the example requires eight arrays, and would be equivalent to a reference design with dye-swaps, requiring 12 arrays. If one were only interested in comparing the test samples, only six experiments would be required. Thus, the loop design can be regarded as more efcient, but it suffers from a number of drawbacks. Since each sample connects to the next as a reference, one bad sample or array can disrupt the continuity, making the design sensitive to experimental problems. The indirect method of comparison also makes the method prone to inated variance when two samples far apart are contrasted. Finally, data analysis is not as straightforward as for the reference design and standard methods for data clustering cannot be directly applied. The balanced block design is similar to the loop design in that it attempts to improve efciency through co-hybridization of test samples. The requirement of this design is that each pair of sample classes appears together the same number of times. This is illustrated in Fig. 4f for the earlier example, where there are four sample classes or treatments (A, B, C and R), each with three replicates. Note that each treatment should appear labeled with each dye, ideally an equal number of times, although this may not be possible with an odd number of appearances, as in the current example. The minimum number of arrays required is equal to the number of combinations of treatments taken two at a time, which in this case is six (if using one replicate from each sample). Multiples of this minimum can also be employed. Also note that the complete utilization of biological replicates requires that the number of replicates for each treatment be an integer multiple of the number of treatments minus one, which is why the number of replicates was expanded to three in this example. If two replicates had been used, as in previous designs, it would be necessary to replace the third replicate with technical replicates of the biological samples. The design in this example is a balanced incomplete block design, since all treatments do not appear in all blocks. A complete block design is only possible when there are only two treatments. The balanced block design is very efcient, but

34

T.K. Karakach et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 2852

suffers from some of the same drawbacks as the loop design in terms of interpretation and data analysis. The discussion above has been focused primarily on designs for comparator experiments, but some comments on serial experiments should also be made. By far the most common design used for timecourse experiments is the reference design, but other designs are possible and some of these have been described by Yang and Speed [36]. Perhaps the most natural alternative design is the loop design, since there is an obvious relationship between sequential time points. In practice however, this can lead to signicant problems if there is one bad sample or array that breaks the chain of measurements. Another important consideration in serial experiments is the choice of reference. In comparator experiments, it is generally expected that the differential expression will occur in relatively few genes, but in serial experiments the changes in gene expression can be much more dramatic over the course of an experiment, making the choice of a natural reference difcult. Moreover, the reference in a serial experiment is primarily used as an internal standard for measurements and not so much as a measure of differential expression, since it is the change in expression from one time point to the next, as opposed to the absolute ratio, that is of greatest interest. Because of this, the reference for serial experiments should be chosen to represent as many genes as possible. The absence of a gene transcript in the reference will lead to an undened ratio and the inability to measure changes in the expression of that gene, which may be important even if it is initially absent in the early samples.

4.1.3. Replication One of the key questions posed by microarray researchers concerning experimental design relates to the number of times hybridizations must be repeated in order to gain accuracy in the estimates of variables of interest. Statistical inference of the signicance of measured variables (usually log-ratios) is determined by the magnitude of the residual variance which, in microarrays, has contributions from the inherent biological variability in samples plus the analytical variability (referred to as technical variability). These sources of variability can occur on multiple levels and within both the test and reference samples. As with any designed experiment, the goal in microarray experiments is to either control sources of variability or include them as part of the model. Accurate determination of the residual variance can only be achieved through objective replication of experiments both at the biological and technical levels. In this section, some strategies for replication of microarray experiments are discussed and some models in current use are described. It has been reported that technical variability accounts for as little as 510% of the standard error [39], yet many microarray experiments place an emphasis on this source of variability, probably because it is easier to generate technical replicates than biological replicates. Within the category of technical replication, there are different levels of contribution that can be investigated [40]. At the lowest level is the spot-to-spot variability on a given slide. In principle, this can be estimated by multiple spotting of the same probe at different locations on the array. In practice, it is common for logistical reasons to place probe replicates side-by-side, which limits their utility in estimating this source of variance since they do not model effects associated with spatial or temporal distribution, or different pins. However, this is only one component of the technical variance and, while it may be important in assessing the overall error structure, it is perhaps more useful to estimate the total variance from this source. The ideal technical replicate would begin with the replicate extractions of mRNA from the same biological source and carry these through the labeling and hybridization procedures. For practical reasons, this is not commonly done and technical replication is likely to be carried out at some downstream step, such as before or after labeling.

From a classical statistical standpoint, it is the variance that is introduced by biological replication (which also incorporates technical variance) that is of greatest importance in a comparison experiment. Some aspects of intrinsic biological variability are relatively simple to control, while others are impractical or impossible to eliminate, depending on the nature of the study. Variation resulting from gender, age, genotype and the interactions of these factors have been reported to account for upwards of 60% of the standard error [41]. It has also been argued that, if the biological samples are drawn from cell lines, biological variability will be smaller. However, even within these populations, some diversity is expected (e.g. with cell passage number) and biological replication is advisable to avoid the detection of false positives that arise from the anomalous behavior of a few genes within an individual sample of the population. To mitigate the high cost or difculties (e.g. small sample size), associated with independent sample replication, the pooling of mRNA extracts has also been considered as a possible alternative to capturing variation due to the transcriptional diversity of samples [42], although cautious sentiments have been expressed [21,39,40]. Finally, it should be emphasized that randomization of the procedures used in the microarray trials e.g. order of experiments, operators, etc., is critical to yield meaningful results. Perhaps the most comprehensive work addressing factors that inuence the overall standard error of measuring the uorescence intensity of a spot on a microarray is reported by Kerr et al. [43]. Here, uorescence intensity is modeled as a function of sample (V), array (A), dye (D), and gene (G) effects together with interactions between a gene and an array, and a gene and sample. This model is reproduced here as Eq. (1) (as it appears in the reference), where is the overall mean, yijkg is the measured intensity for the gth gene, corresponding to the kth variety (which is equivalent to treatment), labeled with the jth dye and hybridized to the ith array and, ijkg is the random error component. yijkg = c + Ai + Dj + Vk + Gg + VGkg + ijkg 1

Practically, this model calls for replication of arrays (ideally with independent biological samples), together with all the main effects in order to capture the relative uorescence intensity that represents the unbiased differential gene expression. In essence, variability due to the interaction term (VG) measures the quantity of interest, while variability due to the main effects must be controlled. This model was later updated to include (array gene) and (dye gene) interactions [44], as shown in Eq. (2) where AG and DG are the two additional terms. yijkg = c + Ai + Dj + Vk + Gg + VGkg + AGig + DGjg + ijkg 2 By including AG and DG interaction terms, this updated model ensures that technical replication accounts for spot-to-spot and genespecic dye effects. It is perhaps due to this model that replication in most microarray experiments has focused on multiple spotting and the so-called dye-swap experiments. The purpose of multiple spotting is to estimate spatial variability in the measurements that may result from a variety of factors, such as variations in the amount of probe deposited on specic sites. As already noted, multiple spotting is mainly done via side-by-side deposition of probes on the microarray for operational simplicity. The extent to which this accounts for the desired variability has not been established. Other methods of multiple spotting have been reported. In reference [7] for instance, microarray glass slides were divided into two and probe material deposited as side-by-side duplicates on each of the two halves, whereas in reference [45], probes were spotted multiple times in various locations on the array.

T.K. Karakach et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 2852

35

One aspect of Eq. (2) that has received widespread attention in the microarray literature on experimental design is the gene-specic dye biases modeled by the (DG)-term and is mostly captured by the dyeswap experiments discussed in Section 4.1.2. Kerr et al. [38,43] argued that, in some instances, certain transcripts incorporate one dye better than another and that this effect will be confounded with the differential expression arising from true biological factors. Statistical analyses have been carried out to conrm the importance of this argument [46], but there is some skepticism regarding the overall contribution of this factor to the standard error [47]. General experimental design strategies encourage replication in order to estimate random variability in the measurement process and other strategies, such as dye-swaps, capture systematic variability as well. Systematic variations in microarray measurements can be hard to control in view of the layers at which these experiments must be designed. In principle, it is anticipated that the standard method of evaluating two-color microarrays using ratios will correct for systematic uncertainty related to the technical level of the experiment. In practice, systematic artifacts persist and sometimes threaten to obscure the true biological variability which an experiment is designed to investigate. One of these systematic artifacts arises from the fact that the ratio calculated for a given spot is a function of a variety of technical variables that are unrelated to gene expression, such as the PMT response. A number of methods have been developed to mathematically correct for these effects and are collectively referred to as normalization. These are considered in a later section. 4.2. Labeling The rapid, simultaneous, and highly sensitive quantitative detection of transcripts from whole genomes remains the main objective of DNA microarray technology. Fluorescent labeling of cDNA target molecules allows rapid detection while providing the high sensitivity desired without the inherent problems associated with radioisotopic labeling [48]. Increased detection sensitivity has allowed the technology to be applied to investigations where the quantitative amounts of starting material would otherwise be considered undetectable. For instance, most mRNA extracts yield less than 1 g/g of tissue and even after amplication, one usually has only between 10 and 20 g of cDNA, which is quite difcult to quantitatively detect for ordinary hybridization experiments. Although uorescence is inherently sensitive as an analytical technique, signals emitted by the dyelabeled target often require enhancement through the incorporation of multiple uorophores. The uorescence intensity observed from the hybridized target will depend on a number of factors, including labeling density, uorophore charge and linker length [49]. Trends in DNA microarray technology show continued advancements in the methodology for labeling cDNA [50,51], thus improving the detection of probe target interactions. An understanding of the inuence of labeling methods is important in the context of comparison of results from different experiments. The variability introduced by the labeling technique could be systematic and has been shown to inuence expression patterns obtained in microarray experiments [51]. Labeling methods will inuence the data in terms of sensitivity, reproducibility and dynamic range of the signal. For instance, the efciency of dye incorporation of uor-tagged bases into a target cDNA is argued to be less than that of incorporating functional nucleotides. Thus, as noted earlier the variability in the measured intensity may reect this dye incorporation efciency and not differential expression, although this is still a subject of some debate. Early methods of uorescent labeling of cDNA involved either attachment of single uorophores to the 5 ends of DNA targets [51] or enzymatic incorporation of approximately 4% uor-tagged bases [48,52] into DNA targets. In recent years, the demand for higher sensitivity in high throughput analyses has required increased

incorporation efciency of uorophores. Although such an increase comes with enhanced detection sensitivity [53], reports have now appeared in the literature demonstrating elevated uorescence quenching and dwindling probetarget duplex stability resulting from bulky dyes [49]. Nonetheless, continued improvements in methodologies for preparation of uorescent targets have given rise to a number of approaches for dye labeling, most of which address these issues. Several of these methods involve chemically coupling uorophores to nucleotide substrates, the most common methods being the so-called direct and indirect labeling schemes. Whereas direct labeling methods incorporate nucleotides with covalently attached uorescent tags into the targets, indirect methods afx the tags to incorporated modied bases via chemical coupling. Comparisons of these methods have been carried out in recent years, with mixed results concerning reproducibility, sensitivity and accuracy [50,51,54,55]. The conceptual simplicity of the direct labeling approach is perhaps its main advantage, in addition to the strength of signals obtained when nucleic acids are labeled by this approach. Dye-labeled nucleotides are synthesized by simple nucleophilic reactions between a succinimidyl ester on a uorophore and a primary alkyl-amine modied nucleotide, usually deoxycytidine triphosphate (dCTP) [49,51]. Such a reaction scheme is illustrated in Fig. 5 using the most common dyes employed in two-channel microarray platforms the cyanine dyes, so-called Cy3 and Cy5. These commercially available dye-modied bases are incorporated into the nucleotide sequences of cDNA targets during the reverse transcription of mRNA to cDNA. Direct labeling approaches carry the risk of unequal incorporation of dye-labeled nucleotides (Cy3 and Cy5) into the sequences of cDNA targets, perhaps due to the slight differences in the size of the two uorophores, hence introducing a dye-bias. This bias can give artefactual results that necessitate dyeswap experiments (see Section 4.1.2). There are two main indirect labeling approaches. In the rst, nucleotides, usually deoxyuridine triphosphate (dUTP), are modied with a functional group such as aminoallyl [51,54], and these precursors are incorporated into a target cDNA sequence. This sequence is subsequently reacted with uorophores to form covalent bonds between the modied bases and the uorophores, The main benet of this approach is increased efciency of incorporation of the aminoallyl-modied nucleotides into a cDNA sequence owing to their relative small sizes compared to dye-labeled nucleotides. This also eliminates signal bias resulting from differential incorporation efciency of Cy3- and Cy5-labeled nucleotides. Another common indirect labeling approach is the so-called dendrimer indirect 3DNA labeling approach, originally described by Wang et al. [56] and later Stears et al. [57] in collaboration with Genisphere (Hateld, PA), who currently market the technique. This approach is described in detail in references [55,5759]. Although not as widely used, it is argued that indirect labeling approaches provide up to 300 times brighter signals than direct labeling methods and require much less mRNA for labeling [58]. Dendrimer labeling has an additional advantage that it yields targets with relatively high solubility in hybridization buffers, leading to low background uorescence. Furthermore, since as little as 13 g of total RNA can be labeled using indirect approaches, the need for amplication of starting material is diminished. (Amplication of starting material, if not performed carefully, may introduce experimental artifacts that can confound the results.) In contrast to the direct labeling approaches employed by twocolor spotted arrays, GeneChips have generally employed an indirect labeling method. This uses nucleotides functionalized with a biotin moiety to generate the cRNA/cDNA. Following hybridization to the array, streptavidin with a linked uorescent dye is added to the array. The streptavidin binds extremely tightly to the biotin on the cRNA/ cDNA, and excess dye can be washed away. More recently, indirect

36

T.K. Karakach et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 2852

Fig. 5. A typical nucleophilic reaction between a succinimidyl ester on a uorophore (Cy3) and a primary alkyl-amine-modied deoxycytidine triphosphate (dCTP).

labeling using Genisphere's dendrimer technology has been used with GeneChips, however, in contrast to the method described above, biotin molecules are attached to the dendrimer, and then the avidindye is added, thereby increasing the number of binding sites for the uorescent dye. A further note should be mentioned with regard to labeling, and that is the effect of ozone on signal intensities. Exposure of DNA microarray slides to ozone has been shown to affect signal quality, with a much greater effect on Cy5 (red) compared to Cy3 [60]. These effects have been observed with relatively low amounts of ozone exposure (510 ppb) and care should be taken to avoid exposing microarray slides labeled with Cy5 to environmental ozone, especially when wet. The Brown group has released plans for an enclosure to aid in eliminating ozone from a room or in a small enclosed environment [61], while Genisphere has released a product for coating arrays to prevent Cy5 degradation [62]. In 2008, GE Healthcare reported the development of an ozone-stable dye for DNA microarray applications [63].

The inuence of these factors on the stability of the probetarget duplex has been discussed extensively in the literature [58,66,67] and will not be covered here. However sufce it to say that several protocols have been developed to ensure that the experimental parameters that inuence hybridization efciency are optimized [55,67,68]. The signal obtained from a microarray is measured without reference to these hybridization conditions except where obvious aberrations are apparent, which often leaves only the option of repeating the experiments. 4.4. Detection The use of DNA microarrays for monitoring transcriptional states of biological samples is generally accomplished by comparing the relative abundance of transcripts from two samples via hybridization to either a single array of DNA probes (spotted two-color arrays) or two different arrays (GeneChip). The simplicity of this concept is deceptive; complexities in the measurement process are often ignored in spite of their importance. In principle, differential gene expression is measured by determining the ratio of uorescence intensities of the two dye-labeled targets emitting signals proportional to their concentration. In practice, however, acquisition of the data involves prior steps, including the scanning of the microarray with lasers set at different excitation wavelengths for Cy3 (green) and Cy5 (red) labeled targets and identication of spot locations on the microarray (a process referred to as gridding). The purpose of scanning the microarray is to excite the uorophores tagged to the hybridized probes as well as to collect the emitted uorescence and generate an image (for each wavelength) in which pixel intensities correspond to the level of localized uorescence. For spotted arrays, these images are typically stored as pairs of unsigned 16-bit tiff les. To evaluate the uorescence ratio, the location of the uorescing spot is carefully determined to accurately relate the pixel intensity to uorescence of a hybridized transcript. Due to some specic considerations when using GeneChips, the remainder of this section

4.3. Hybridization A critical part of any microarray experiment is the hybridization of dye-labeled targets to surface-immobilized probes. The hybridization of complementary DNA strands on glass supports is relatively wellestablished in molecular biology [9,64,65] and is important to the quality of microarrays given that the specicity and afnity of probe target interaction largely determines the quality of a microarray [58]. In this regard, factors that inuence the efciency and stability of hybridization will have a direct inuence on the quality and amount of information that can be derived from a microarray study. These factors include hybridization time, the length and composition of probes and targets used for hybridization, and the hybridization temperature, as well as the pH, ionic strength and viscosity of the hybridization solution.

T.K. Karakach et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 2852

37

will focus on analysis using spotted microarrays, and specic points relevant to GeneChips will be considered separately. Higher-level analyses of DNA microarray data generally pay little or no attention to the measurement processes mentioned above and dubious assumptions are often made regarding the variability in the data. Yet, even when all experimental aspects are held constant (the solid support system, the spotting procedure, the probe types, the labeling and so on), the process of acquiring data can still inuence the variability of gene expression patterns observed in a microarray experiment. Conventional approaches for capturing uncertainty in microarrays focus on variance associated with the spatial position of the microarray spots, the sample preparation, and the biological sampling. In most cases, these sources of variability are captured by replicating the deposition of DNA probes in varied locations on the microarray, as well as by replication of hybridization procedures and biological samples (see Section 4.1.3). Whereas these approaches may control uncertainties from extrinsic sources, inherent ambiguities that arise during the measurement process persist. These can arise from two main sources, the scanning of the microarray and primary level processing of the images. It has been reported that uncertainty in microarray results may be associated, in part, with the uctuations observed when independent scans of the same microarray are conducted [69]. Thus, when this source of variability is ignored, the observed gene expression proles are likely to be confounded with scanner instabilities. This is especially true for cases where the microarray is scanned by running a single laser pass for each dye. Furthermore, primary level processing of the acquired images introduces a potential for severe aberration in cases where the processing methods are not robust. In particular, this processing partitions the spots into regions called foreground and background and this could lead to severe uncertainties if the shapes of the spots are not well-dened (see Section 4.5). Although the individual sources of uncertainty in microarray measurement processes may be negligible, an understanding of their incremental contribution to the overall variability is essential for microarray technology to reach its full potential. This section describes the microarray data acquisition process, focusing on the scanning procedures and primary level processing in the context of measurement quality. 4.4.1. Scanning the microarray Generally, the acquisition of uorescence signals emitted by dyelabeled molecules on the microarray occurs by laser scanning confocal microscopy. In conjunction with photomultiplier tubes (PMT) or charge coupled device (CCD) cameras, microarray scanners detect and record the emitted uorescence signals, which are stored as16-bit tiff images for further analysis. Although the PMT is the most ubiquitous detector employed in microarray scanners due to its cost-effectiveness, portability and high sensitivity, CCDs play an important, albeit peripheral, role in microarray technology. This is because CCDs have excessive operational demands despite the desired high sensitivities they exhibit [70]. The general architecture of a microarray scanner consists of a light source(s), optical components (mirrors and lenses), a detector, and a data acquisition system. A basic optical architecture representing the general conguration of confocal laser scanning microscopes employed in most microarray scanners is shown in Fig. 6. Typically, the light sources consist of gas or solid-state lasers. Xenon lamps, which supply white light, are also a viable option, although their size and heat dissipation make their use less common. In the detection process, laser light is directed through the dichromatic (dichroic) mirror that allows light of a desired frequency to excite the sample after passing through a set of microscope objective lenses. Excited uorophores emit light of a different wavelength, which is transmitted back through the objective lenses to the detector via the dichroic mirror. The pinhole, conjugated to the

focal point of the objective lenses, eliminates out-of-focus uorescence from reaching the PMT, where the true signal is amplied and detected (conjugation of the pinhole to the focal point of the objective lens is the key to confocal microscopy). Finally, the analogue signal from the PMT is digitized (by A/D converters) and recorded to depict a map of pixel intensities, which is stored as a 16-bit tiff image. It should be noted that the development of confocal microscopic measurements was a key component in enabling modern microarray technology, since it allowed uorescent interferences outside the focal plane to be greatly reduced. Scanning the microarray is usually executed pixel-wise, at resolutions ranging from 5 m to 10 m, in a mechanism that involves xy Cartesian translation of either the substrate or optical components. Most microarray scanners employ the former due to the advantages associated with a stationary optical path and increased durability of delicate scanner components [70]. Common strategies for exciting samples during the scanning process include simultaneous and sequential scanning mechanisms. The rst approach employs two laser light sources in parallel, and yields two images (corresponding to the two dyes) in a single pass. This approach allows faster scanning rates but may exhibit a lower signal-to-noise ratio (S/ N) and is prone to increased cross-talk [70]. Sequential scanning is designed to minimize cross-talk since independent scans are run for each dye. Regardless of the mechanism used, the laser power and the PMT gain normally need to be optimized independently for each channel before image acquisition. This process, often referred to as a pre-scan, ensures that each channel has adequate sensitivity to represent low-level signals without excessive saturation of high level signals. In principle, either the laser power or PMT voltage could be used to adjust the signal amplitude, but factors such as photobleaching and S/N need to be considered. The most common errors affecting microarray scanner signals can be categorized based on their origins from either instrumental components, the substrate, or various contaminants. For instance, the quantized arrival of photons at the detector are governed by Poisson statistics, which leads to a measurement standard deviation equal to the square-root of the signal. Thus, this type of uncertainty (shot noise) is tied to the instrument detector and, although impossible to eliminate, it can be estimated by appropriate error models if it is suspected to be the dominant source of uncertainty in the measurement. Other examples of instrument noise include laser and PMT noise. Laser noise (source icker noise or drift noise) arises due to intensity uctuations over time and is typically characterized by a multiplicative effect on the signal. PMT noise (detector noise) may result from uctuations in the amplication of the signal or the presence of dark current. Conceivably, one of the most important considerations in PMT noise is the effect introduced by increasing the voltage gain. Whereas this is intended to yield stronger signals, escalating PMT voltages enhances the background noise as well. For a given instrumental setup, it is difcult to predict which of these sources will dominate the instrumental noise. Noise arising from the substrate is predominantly due to the nonuniformity of the surface. One of the fundamental properties of laser scanning confocal microscopy is that the focal points are pre-set to depths of 2.5 m in order to restrict the collected signals to those that originate only from the desired sample. In extreme cases where the surface is spatially heterogeneous, it is likely that undesired signals will be propagated to the pinhole. Other sources of sample uncertainty include dust smudges on the slide and back-reection of the laser light. In theory, the only reagents on the slide at the time of scanning should be cDNA and the dyes used to label the DNA. In practice, although great care is taken to remove as much of the excess reagents as possible, there remains traces of the various chemicals on the slide. These chemicals often have spectral proles that overlap with those of the dyes, contributing to noise in the

38

T.K. Karakach et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 2852

Fig. 6. General optical architecture of a confocal microscope.

background intensities. This type of noise can very rarely be removed, as discussed in Section 4.5.2. 4.5. Image processing: spotted microarrays The key aspects of microarray image processing are identication of the spotted probe locations after scanning and the quantitation of signal intensities/ratios at the probe sites. We refer to this as primary level processing since it is the rst step in microarray image analysis. Most software for processing microarray images include routines for gridding, segmenting the spot pixels into foreground and background (for spotted arrays), and measurement of signals corresponding to these regions. In this section, a brief mention will be made of these aspects of image analysis in order to dene and outline the goals of each approach, and their possible limitations. Yang and coworkers [71] have reviewed many software packages that are designed to perform the segmentation and background adjustment aspects of these processes. One of the basic purposes of imaging microarray slides is to permit a global visualization and interpretation of the relative concentrations of hybridized transcripts when corresponding Cy3 and Cy5 images are overlaid. This rudimentary analysis of the data is designed to provide the investigator with a general overview of the hybridization success of the experiment based on an interpretation of the color codes. When the images are overlaid, an assessment can be made regarding the concentration of labeled transcripts from one sample relative to the other by examining the predominance of either Cy5 (red) or Cy3 (green) on the spot. Conventionally, a red spot is interpreted as resulting from preferential hybridization of the Cy5-labeled targets to the probe relative to the Cy3-labeled targets, and vice-versa. Preferential hybridization to a spot is assumed to be inuenced only by higher concentrations of the particular target. If the concentrations are in equal proportion then the spot is expected to be yellow and, if no target hybridized, the spot is expected to be black. An example of this type of image is shown in Fig. 7. In addition to providing a general overview of the hybridization, such images are also very useful in evaluating the intensities of external controls. These allow calibration of scanner settings during the scanning of the array. If external controls are spotted on the array in incremental concentrations, it is possible to calibrate a scanner's dynamic range by adjusting its PMT gain and laser power until a desired brightness is obtained from prescans. It is important to recognize that microarray images, although coded in terms of red and green contributions, are rendered through

software that is not intended to reect the uorescent spectra actually obtained. The images are false color representations of the intensities measured on the two uorescence excitation channels and the viewer's perception will be the convolution of several transformations, including the color mapping of the software, the representation of colors by the output device, and the processing of visual information by the eye and the brain. It would seem natural for software to represent the intensities of the two channels directly as the red and green components of the redgreenblue (RGB) triple used to encode colors on most video displays, but this turns out not to be visually satisfying and is limited by the fact that pixel intensities are restricted to 8-bit (0255) while the uorescence intensities are encoded as 16-bit values (065,535). Consequently, most commercial software applications use a technique known as color mapping in which combinations of the two dye intensities dene a particular RGB triple of pixel intensities. Although this makes the representation of the image more subjective, since it relies on the design of the mapping, it also allows for the use of more subtle hues and shades that can be more appealing to the viewer. In addition to color mapping, commercial software also tends to apply transformations to the measured intensities to make the images more visually informative, if less quantitative. The wide range of intensities on a microarray often results in an image that is dominated by a few spots of high magnitude, while the remaining spots are too faint to be seen. Not only is this relatively uninformative, but it also leads to difculties when trying to grid the spots (see below) since the spot is indistinguishable from the background. A common solution to this problem is to apply a square-root transform to the data to be displayed, which has the result of suppressing large signals and amplifying small ones, as well as reducing the range from 16-bits to 8bits. This gives a more complete picture, although it may present a distorted view of relative intensities. Because of the combination of data transformation and color mapping, care should be taken not over-interpret visual images provided by microarrays. It should be noted that at least one instrument has been produced that allows scanning of uorescence spectra at each pixel (hyperspectral imaging) [73], but such an instrument is currently impractical for routine use, so we must continue to rely on the quantitative information available from two channels. In order to quantify the results in the microarray image, specic pixel intensities of the uorescing spot must be evaluated to provide a measure of the relative concentration of dye-labeled targets that have hybridized. To achieve this, the Cartesian coordinates of the spot on the image must be identied and separated from spurious signals

T.K. Karakach et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 2852

39

Fig. 7. Image of a sub-array from a typical microarray. This image is part of the CAMDA [72] data set depicting red, green and yellow spots as well as black holes. The interpretation of the color codes is as described in the text.

outside of the probe site. This is typically referred to as gridding or addressing. This process could, in principle, be automated since the basic structure of the microarray is determined by the arrayer. The number and arrangement of pins on the arrayer print-head provides the fundamental structure of rows and columns of grids (also referred to as sub-arrays or sub-grids). In addition, the number of rows and columns of spots printed in each sub-grid is pre-set. As a result, various gridding software applications use the known xy displacements of spots per grid and the xy separation for the pins, together with the initial location of the rst spot to automatically determine the address of each spot on the microarray. Unfortunately, this simplied approach to automatic gridding usually requires manual intervention to optimize the separation between grid rows and columns, slight variations in individual spots resulting from shifts in print-tip positions and, sometimes, shifts in rows or columns of spots in a sub-array, including rotation of grid axes relative to the image. An example of a grid is shown in Fig. 8. One major disadvantage of manual gridding is the time required and the associated monotony, which has the potential for introducing user bias and inaccuracy. The importance of accurate gridding of spots stems from the reliance of most of the higher-level analysis methods for microarrays on reliable measurements of pixel intensities comprising the spot. In this regard, addressed spots are classied into foreground and background regions through a process referred to as segmentation. Pixels within the foreground region are believed to represent the true signal corresponding to uorescing dye-labeled target that hybridized to the spot. On the other hand, pixels in the background region correspond to spurious signals from the substrate that are unrelated to the hybridized targets. Thus, for each spot, segmentation will lead to identication of a region around the spot, referred to as a spot mask, which is comprised of pixels from either the foreground or the background. It is therefore argued that after addressing, segmentation is the most important step in microarray image processing [74].

The most common segmentation methods can be classied depending on whether they place spot-shape restrictions on the estimation of the spot masks. Fixed circle and adaptive circle segmentation methods assume that the spots are circular, while the histogram segmentation and adaptive shape segmentation methods [74,75] place no restrictions on the shapes of the spots. The central difference between xed circle and adaptive circle segmentation methods is that the former ts a circle with a xed radius for all spots in the image, while the latter allows for estimation of different radii for each spot. In principle, if all the spots are of similar size, then the xed circle segmentation method provides estimates of background and foreground regions similar to the adaptive circle approach. Unfortunately, spot sizes within a microarray vary due to unequal deposition of material on the spots by pins and thus the procedure is prone to inadequate segmentation of spots. On the other hand, it has been argued that the adaptive circle method can be overly timeconsuming for an array with thousands of spots since it requires the user to adjust spot sizes. Furthermore, when the signal strength is low, it is hard to distinguish a transition between the foreground and background. Several automated software applications have been developed to address the drawbacks of the adaptive circle segmentation approach. Chen et al. partition the pixels into background and foreground by setting up a nonparametric test statistic that enables one to distinguish foreground pixels from background in the proximity of the probe site [76]. In particular, the MannWhitney test statistic is used to test a hypothesis that the intensity of a set of pixels chosen from outside the probe site is equal to a similar set of pixels chosen from the probe site. When the null hypothesis is rejected, the set of pixels causing the hypothesis to be rejected are assumed to correspond to the signal from a hybridized target. Thus, all the pixels in the spot target mask that have intensities higher than the set that lead to the rejection of the null hypothesis are classied as

40

T.K. Karakach et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 2852

Fig. 8. Section of typical grids on a microarray image. The circles describe an area segmented to consist of the spot.

foreground. Unfortunately the computational demands of this approach may have limited its utility since microarray data contain several thousand probe sites. To alleviate the time constraints associated with user intervention in adaptive segmentation approaches, Buhler et al. developed Dapple [77], a spot nding approach that places candidate spots (identied using the provided x y displacements of spots per grid) into vignettes. Within the vignettes, spots are identied by looking for characteristic sharp edges (given that spot morphology proles exhibit rising intensities at the edges from where they meet the background) typied by high negative second derivatives of pixel intensities with respect to the x and y displacements. Thus, the brightest ring in the vignette is identied and used in segmenting the spot. Another approach, Matarray [78], amalgamates the signal intensity and spatial information to determine spot locations and appropriate segmentation. Similar to Dapple, spots are identied from the initial estimates of spot location provided by the user. For each spot, patches (similar to the vignettes in Dapple) are dened and a circle is subscribed around a tentative spot centre to provide foreground pixels, while pixel intensities outside the circle enclosed in the patch are segmented as background. Grid locations are adjusted after identifying pixels within the circle that have intensities greater than the sum of the mean background intensity and twice its standard deviation. The locations of such pixels are determined, their centre re-calculated, and new patches dened. The process is repeated until some convergence criterion is satised. Although in principle most spot shapes are expected to be circular, in practice spots printed in-house rarely exhibit the perfect shapes anticipated, and instead descriptors such as comet tails, craters and donuts have been associated with aberrant microarray spot morphologies. Fig. 9 shows a sampling of the variety of spot shapes that can be obtained from a typical microarray. Accordingly, restricting spots to particular shapes could provide poor estimates of uorescence intensities for hybridized targets when the spotted probes exhibit morphologies different from the prescribed ones. Advanced approaches, referred to as adaptive segmentation methods, such as watershed and seeded region growing, continue to be used for

microarrays, albeit with mixed success [79]. The most widely used method for segmenting spots, without restricting them to particular shapes, is the histogram method [75]. This method denes a target spot mask whose size is thought to be bigger than any spot and evaluates a histogram of the pixels within this mask. Subsequently, from the histogram, background intensities are calculated as the mean of the pixels between the 5th and 20th percentile while foreground is the mean intensity of pixels between the 80th and 95th percentile. The segmentation methods discussed in this section are implemented in most software applications to perform primary level processing of microarray images. Table 1 reports the methods employed by some of these applications. Several recent developments [8284] introduce more complex methods of spot segmentation in order to improve the determination of uorescence ratio through minimized misidentication of spot masks. It should be noted that even with automated methods for spot addressing and segmentation, the resultant grids are often checked manually, and if necessary adjusted to better segment the spots. 4.5.1. Ratio calculation Statistical analysis of microarrays is based on the evaluation of relative uorescence intensities of two differentially labeled targets that are hybridized to a probe. Ratiometric methods [76,85] of analysis are preferred because absolute uorescence intensities do not correspond directly to the absolute concentration of the mRNA obtained from each of the samples. Instead, the observed uorescence is a function of the efciency of dye incorporation and DNA hybridization, the length and amount of probe attached to the surface, the relative content of dye-modied bases in a transcript, and the scanning parameters (laser intensity, PMT gain, etc.). For singlechannel microarrays (GeneChip), the test and reference scans are made on separate arrays (the analysis of GeneChips will be discussed in a later section). For spotted microarrays, this is not possible because direct comparison of intensities from separate arrays would be greatly affected by variations in spot morphology. The use of two-channel microarrays overcomes this limitation since the spot morphology for

T.K. Karakach et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 2852

41

Fig. 9. A three-dimensional view of typical spot morphologies alongside the spot images drawn from the CAMDA data set [72]. Each spot is indexed with an identifying spot number, grid (block) number and column and row numbers as well as a corresponding gene ID. The morphologies of the spots are identied as follows: (a) comet tail, (b) normal high intensity, (c) crescent, (d) donut, (e) pointed and (f) normal low intensity.

the test and the reference channels will be the same for a given spot. For these arrays, the morphology will inuence the calculation of the ratio and several methods have been developed to estimate summary statistics for the pixels in the spot masks. These methods include ratio of medians, ratio of means, median of ratios, mean of ratios, and regression ratios which are discussed in more detail below. One of the most widely used methods for ratio calculation is the ratio of medians. This is a method whereby differential expression is measured as a ratio of the median of pixel intensities within a spot mask for both channels. The median is intended to represent the centre for the distribution of pixel intensities in the spot mask. Perhaps one of the major advantages of this approach is that the measured ratios are robust to inuence from a few pixels with extreme values at either end of the distribution. Unfortunately, when spots are characterized by substantial regions (N50%) of low-intensity

pixels, as in the case of donuts as shown in Fig. 9d, it is anticipated that the low-intensity pixels will dominate the spot mask and result in ratios with a high uncertainty. Another common measure of differential expression involves evaluating the ratio of the mean of pixel intensities within the spot mask. Calculation of mean values is straightforward and less affected by extended regions of low-intensity uorescence, but they are more susceptible to the inuence of extreme values, i.e. outliers in pixel population. For this reason, the ratio of means is generally less robust. A less frequently used approach to measuring the relative uorescence is to calculate pixel-by-pixel ratios of intensities across the spot and then report the differential expression as the arithmetic mean or median of the ratios. This is referred to as the mean of ratios or median of ratios, respectively. A major drawback of this approach, especially when using means, is the high sensitivity of the summary

42

T.K. Karakach et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 2852

Table 1 Image segmentation methods used in some commercially and publicly available software packages. Software QuantArray (GSI Luminomics) ScanAlyze [80] GenePix Pro (Axon) UCSF Spot [79] TIGR Spotnder [81] Dapple [77] Matarray [78] Segmentation method Histogram, xed circle Fixed circle Adaptive circle Histogram Histogram Adaptive circle Adaptive circle

statistic to pixels near the background level, since the ratio measurements can become very erratic in these cases. A potential advantage, however, is that the individual ratios across the spot provide a population from which dispersion measures can be used to estimate the uncertainty in the ratio, as has been described in the literature [86]. Such uncertainty estimates are often unreliable however, because of inhomogeneity in the variances. Another infrequently used approach to ratio measurements is the regression ratio method. This method determines the ratio directly from the slope of a plot of Cy5 vs. Cy3 (or vice-versa) pixel intensities across a spot [87]. Although the regression can be inuenced by outliers in a manner similar to the ratio of means, one of its potential advantages is that, for a range of pixel intensities across the spot, it should allow some compensation for the contribution of background uorescence through the use of an intercept. In this way, segmentation of the image into foreground and background pixels is not critical. For the regression ratio to be calculated properly, however, orthogonal regression, rather than conventional regression should be employed. 4.5.2. Background measurements Typical microarray expression ratios are adjusted to eliminate the inuence of the background signal, which can be the result of nonspecic hybridization and auto-uorescence from the glass slides. The ratio is adjusted by subtracting a measure of the estimated background signal from the foreground signal for the spot. Most microarray software applications estimate the background signal by measuring the intensity of pixels in the proximity of a spot mask, i.e. from the pixels segmented by the image analysis software as being background. However, there are some differences among the methods of background estimation used. GenePix Pro 3.0 (Axon) estimates the background from a circular region three times the diameter of the foreground spot away from the spot centre, excluding regions that dene spots and maintaining a two pixel buffer from the spot masks. The appropriate measure (median, mean) of the pixels drawn from this region is computed and subtracted from the corresponding measure of the pixel intensities computed from the spot mask. QuantArray (Packard BioScience) evaluates the background by calculating the mean or median of pixels (in a histogram of all pixels) that lie below a given percentile. Matarray [78] draws background pixels from patches that dene a neighborhood region for each spot, and evaluates their mean, which is reported as the background intensity for any chosen spot. The presence of background uorescence in the calculation of intensity ratios is undesirable because it introduces bias into the result, especially for low-intensity spots that are near the background level. However, errors in the estimation of background intensities can be just as damaging to the quality of the measurements. For example, one of the immediate potential effects of subtracting background intensity from the foreground is negative intensities, which are not meaningful from a physical perspective and indicate aberrant measurements. Spots exhibiting such characteristics are normally excluded from further analysis. Whether the common methods of

background calculation truly meet their objectives has been a subject of debate [88]. Arguments have emerged that nave background subtraction is sometimes more detrimental to the nal analysis of microarray data than not correcting for background [73,89]. In fact, fundamental questions have been raised in the literature recently regarding the legitimacy of background estimated this way. A general assumption in calculating background signals in microarrays, as outlined above, is that the background signal around a spot reects the background signal at the spot location. This is not necessarily the case, however, since the surface chemistry at the spot is fundamentally different from that away from the spot. Moreover, differences in the spot-localized background are suggested by the presence of black holes on microarrays, where the spot appears darker than the surrounding background. These differences were conrmed by Timlin et al. [73] who used hyperspectral imaging of spot uorescence combined with multivariate curve resolution to separate the uorescence spectra of the dyes from glass and contaminating uorescence. It was shown that background uorescence in microarrays is variable, spot-localized, and channel-dependent. The likely ramication of this is a further deterioration in the quality of the data when standard background correction methods are used, especially for the low-intensity signals. Unfortunately, true background at the spot cannot be calculated using current dual wavelength scanners, as they are incapable of distinguishing uorescence due to a contaminant from the true uorescence due to the dye-labeled target. There have been suggestions that spot-localized background uorescence could be estimated from negative controls or blank spots (in the absence of the former) [75,90], but this is also problematic, especially since spatial variations in the background are commonly observed across a microarray. It may also be possible to mitigate the effects of the spot-localized background by using the regression ratio method, which includes an intercept term, but this has not been demonstrated. In addition to leading to erroneous ratio estimates, background issues in microarrays can confound normalization methods that assume a linear relationship between the background-corrected intensities of the two channels, as will be discussed in Section 4.8. 4.6. Image processing: GeneChips In contrast to spotted microarrays, the extremely precise and repeatable fabrication process used to manufacture GeneChips (discussed in Section 3.1) in conjunction with the large number of probes used per gene substantially changes the methods employed to convert GeneChip images into intensities on a per gene basis. Representing the spot intensities is greatly simplied due to the use of only one uorophore on each array, requiring only a simple squareroot transform to allow the full range of intensities to be interpreted by the user when viewing the raw image. Gridding the image is also simplied through the use of regular patterns of control probes at the corners of the array for grid placement, as well as in particular sections of the array to help correct for grid misalignment. The extremely precise method of manufacture also means that deviations from regularity are very rare, thereby simplifying the process of addressing the spots over the image. In addition, following the gridding process, spot segmentation into foreground and background regions is not required, as each spot almost completely lls the location on the array. However, due to the possibility of overlapped pixel intensities and the tendency for decreased signal intensities at the edges of the spot, the outermost pixels are discarded, and the 75th percentile of the remaining pixel intensities is reported [91]. These probe level intensities and their associated standard deviations are stored in a le referred to as the CEL le. In contrast to the two-color microarrays, GeneChips use a set of multiple probes to interrogate the expression level of each gene. Transforming the intensities from all the probes in a particular set into a single value for use in downstream analysis is not a trivial process, and many different methods have been

T.K. Karakach et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 2852

43

developed. The details of many of the methods used are beyond the scope of this review; however a general description of four widely used methods is presented in the next section so that the reader can appreciate the nature of the procedure. 4.6.1. Probe summarization The goal of probe summarization is to transform the intensities from a set of perfect match (PM) and mis-match (MM) probes for a particular gene into a single value that can be used in subsequent analyses. As previously mentioned, each gene has a set of 11 to 20 probe pairs of PM and MM probes, each 25 bases long, and the 13th base of the MM differs from the PM sequence. These MM probes are used to provide an estimate of the signal of the PM resulting from nonspecic hybridization. The earliest methods to summarize the probe signals employed a simple average difference, whereby the average of the differences PMij MMij, (j = 1, J, for each array i) was calculated. However, these techniques suffered from a number of weaknesses, (1) unaccounted for variations in background signal, (2) MM intensities that are higher than the corresponding PM intensity, (3) multiplicative error in the measured signal intensities (see Section 4.7 for more details on errors), (4) differential responses of probes to the same gene. Various solutions over time have attempted to correct different sets of these problems. The most widely used methods and how they address these weaknesses are described below. Although the MM probes are designed to correct for background signal and signal in the PM due to non-specic hybridization, for many samples there is a large proportion of MM probes with higher signal than the corresponding PM probe. The MAS 5.0 algorithm developed by Affymetrix [92] calculates the signal as the anti-log of a robust average (Tukey biweight) of the values log(PMij IMij), j = 1, ..., J. To avoid taking the log of negative numbers, IM is dened as a quantity equal to MM when MM b PM, but adjusted to be less than PM when MM PM. For more specics on the adjustments used to dene IM, the reader is referred to the Affymetrix documentation [92]. A different approach was used by Li and Wong [93] in their dChip software package. They theorized that the different probes in a set may have different afnities for the same gene, however their behavior across arrays should remain constant (after normalization), and therefore the afnities can be modeled using a set of arrays. They employ the model PMij MMij = i j + ij 3

Finally, Affymetrix also developed a very similar method to dChip and RMA, probe logarithmic intensity error (PLIER) estimation [98]. Like the others it ts the probe intensities across a series of multiple chips; however it also applies a penalty function for those probes that are less informative. All of the model-based methods use probe intensities from multiple arrays, and only arrays that are expected to behave similarly (i.e. most of the genes are not undergoing differential changes) between samples should be processed in the same batch. In addition, the model-based methods all use probe intensities that have all been background-corrected (Section 4.6.2) and normalized (Section 4.8) prior to summarization. It should be noted that the above summarizations are only the ones that the authors have encountered most often in the literature, and that new methods are continuously being developed. Frequently, the raw data made available for analysis consists of the raw intensities for each probe, in addition to the results of the summarized values, although this is not required for submission to many of the microarray databases.

This assumes that the probe afnities (j) inuence the nal signal in a multiplicative manner, and are the same across all of the arrays. Therefore, tting this model using multiple arrays allows calculation of i for each array i, giving a summary statistic for the probe set. The authors claim that tting of this model also allows detection of defective probes by discovery of probes that do not have a good t to the model. The second probe afnity modeling approach is the robust multichip average (RMA) [9496]. This method has been implemented in the Bioconductor [97] package affy, and has over 1000 citations. In contrast to the methods discussed thus far, RMA only uses the PM probe intensities on each array, due to the high proportion of MM probes with higher intensities than the corresponding PM probe. In comparison to the dChip method, RMA uses the log2 transformed PM values, leading to Yij = i + j + ij 4

where Yij is the normalized, background-corrected PM values. Fitting the equation using the Yij values allows one to estimate the summarization value for the probe set on array i, i.

4.6.2. Background correction for GeneChips The MM probes on the arrays are designed to account for nonspecic hybridization, with the signal from the PM probe comprised of both specic and non-specic signal, and that from the MM probe of non-specic signal only [99]. This assumption that the MM probes would account for only non-specic signal led to the original practice of correcting the PM intensities by a subtraction of the corresponding MM intensities. This does not account for all of the background signals, however, nor does it account for the many instances of MM probes with higher intensities than the corresponding PM probe. Methods to account for the global background (common to the whole chip) when using PM/MM measures were introduced in the MAS 5.0 algorithm by Affymetrix. The array is divided up into zones, the probe values sorted within each zone, and the lowest 2% is chosen as the background for that zone. In order to avoid discontinuities between zones, the background value to subtract from each probe is calculated as a sum of the background values from each zone, with each background value weighted by the distance of the probe from the centre of each zone. To avoid creating negative probe intensities using this process, a lower threshold is also computed based on the noise in the lowest 2% of values in each zone and the same weighting scheme as used for the background value. Unfortunately, the MAS 5.0 method uses a fairly arbitrary method of substituting values used for MM when the MM intensity is higher than the PM intensity (see Section 4.6.1). There have been a range of proposed solutions to x this. One is to ignore the MM intensities completely, and use only the PM probes and associated intensities. RMA uses this method, assuming that the PM intensities are a mixture of background and true signal, and that the background intensity is normally distributed and the true signal follows an exponential distribution [100]. Ignoring the MM probes has become more common, especially following the release of the actual sequences of the PM and MM probes on the array for many GeneChip designs. This is due to the fact that often the probe annotations supplied by Affymetrix have changed due to new gene sequence information. With the probe sequence information, it then becomes possible to reassign the probes to new genes [101]; however, the pairing of PM and MM probes is often lost in this process, necessitating that only the PM probes are used to perform the analysis. An alternative to both approaches was put forward by Wu et al. in 2004 [102] that used the sequence information in the PM and MM pairs to allow the calculation of an adjusted value for each MM based on its actual sequence and calculated afnity for the sequence that is bound by the PM. The normalization and summarization procedures were based on the RMA method, and the method is known as GCRMA.

44

T.K. Karakach et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 2852

4.7. Measurement quality and transformations In principle, most ratio calculation methods will be expected to yield similar measures of the relative concentration of mRNA in the two samples if spots have comparable amounts of sample hybridized at every pixel [103]. Unfortunately, it is hard to nd such truly ideal spots in a population of up to several thousand spots and, therefore, the methods employed will only provide estimates of the true, albeit unknown, relative intensities of the pixels in the spot masks. Further, these methods are differentially affected by spot characteristics such as morphology and the shape of the pixel distribution. Thus, for a given microarray, some methods may work better than others. Depending on the symmetry and distribution of the pixels, it is quite possible to obtain ratios that are widely varied and, given that the true intensities are unknown, it is hard to assess which approach is the closest estimate. Consider, for example, the spots drawn from the image shown in Fig. 9, which is part of the challenge data set (P. falciparum) of the 2004 Critical Assessment of Microarray Data Analysis (CAMDA) conference available from reference [72]. Cy5/ Cy3 ratios calculated using various methods in MatLab (MathWorks) are presented in Table 2, which demonstrates the apparent differences among the methods. Even when working with the same ratio calculation method, uncertainties in the ratios determined can vary greatly from spot to spot. Unlike some analytical measurements where the uncertainty can be inferred from the magnitude of the signal, microarray expression ratios do not provide an implicit indication of their uncertainty, since the ratio gives no clue as to the magnitude of the signals dening it. In practice, observed ratio measurements are found to exhibit a high level of heteroscedasticity [104,105]. This can be interpreted as arising from a combination of two limiting error contributions, one which is related to spot morphology/intensity and the other which is characterized by a multiplicative component. For spots where one or both channels are close to the background level, or for spots that are substantially distorted, errors will be dominated by the uncertainties in the ratio calculation itself. In the former case, for example, even a small absolute uncertainty in pixel intensities near the background level can lead to large relative uncertainties in a ratio calculation, most likely tied to the estimation of the background level. These errors are typically quite large, with relative uncertainties often in excess of 100%, and the spots may be considered to be outliers. Unfortunately, there are no clearly dened quality measures to assess these spots so the most widely used practice is to rely on quality control employed during primary processing to remove spots that demonstrate a potential for introducing uctuations in the data, especially those exhibiting

low overall intensity. These spots are excluded from further analysis through a subjective process referred to as agging. This all too common solution of ignoring potentially aberrant spots assumes that in such a large population it is possible to assign the quality of the spots to a binary classication. Although statistical approaches of agging weak spots have been reported [106,107], they rely on setting thresholds for a cut-off. In reality, spots exhibit a continuum of quality and the imposition of a cut-off based on their aesthetic appeal (or some other parameter) must be carefully evaluated as this can be quite deceptive and potentially lead to a loss of valuable data. For spots that are well-dened and exhibit sufcient intensity, the uncertainty introduced by the ratio calculation is small and the error structure appears to be dominated by a multiplicative (proportional) component. At this limit, the relative standard deviation in the ratio measurement appears to be xed, typically in the range of 1040%, depending on the microarray under study [108]. Although several researchers have reported such behavior [76,105,109,110], the physical origin has not been clearly elucidated, but it does not seem to be limited by the optical measurement [104]. Such an error structure is also observed in the absolute uorescence intensities, and it is easily shown that this would propagate directly to the ratios. The presence of this multiplicative error is one of the principal reasons that log-ratios are used to represent expression data as opposed to ratios multiplicative errors revert to a uniform variance under such conditions. For example, if the ratio X has associated standard deviation X = X, where is the constant relative standard deviation (RSD), propagation of error shows that the uncertainty associated with Y = log2X is Y = / ln2 (note base two logarithms are typically used with expression data to measure fold-changes). Another benet of this transformation is that it will suppress the range of outliers somewhat for a more informative display. For these reasons, logtransformed data are widely used for presentation and normalization (see Section 4.8). An alternative to log-transformation is the so-called variance stabilizing transformation [111,112]. Variance stabilization has been argued to be important from the point of view that probabilistic processes sometimes lead to intensity-dependent variance [85]. Rocke et al. [109] and Durbin et al. [111] proposed a method for dealing with this problem by expressing the measured dye intensity as a function of two components such that: y = + ce +

Table 2 Spot ratios calculated using various methods: ratio of medians, median of ratios, ratio of means, mean of ratios and regression ratio. Raw ratio (non-log-transformed) values are shown. Spot no./location gene name 3857/8-7-22 oPFrRNA0001 3313/7-13-19 oPFH0009 7647/16-13-18 Prps11 2398/5-22-21 Empty 7609/16-19-16 N155_35 80/1-14-4 D16785_6 Ratio of medians 0.931 1.002 0.551 0.610 0.753 0.746 Ratio of means 0.991 1.080 0.590 0.680 0.908 0.851 Median of ratios 0.843 1.039 0.640 0.620 0.730 0.771 Mean of ratios 1.047 1.113 0.703 0.663 0.858 0.809 Regression ratios 1.010 1.108 0.573 0.712 0.957 1.065

where y is the measured dye intensity for each spot, is the background intensity, and is the true expression level of the gene, while and are normally distributed random variables centered at zero, with 2 and 2 as their respective variances. This model was rst developed for purposes of evaluating measurement errors in analytical chemistry instrument responses [113], and its application to microarrays has several implications. First, at very low expression levels i.e. if is approximately zero, a measured spot intensity is expected to be dominated by the rst term, meaning that y is normally distributed in the limit of many such measurements, i.e. y N(, 2). Second, when is very large, the measured intensity is dominated by the second term, with an approximate log-normal distribution for y. Under such circumstances the model given for the  2  2 variance is 2 S2 where S2 = e e 1 , which implies that the variance of y is linearly related to 2, i.e. multiplicative noise. Third, at moderate expression levels, measured spot intensities are expected to lie between the two extremes mentioned above and the distribution of y is anticipated to exhibit the characteristics of both normal and log-normal distributions. The approach provided for dealing with the implications mentioned above is to transform the data using a variant

T.K. Karakach et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 2852

45

of the logarithmic transformation that stabilizes the variance in all the intensity regions, referred to as generalized log-transformation [111] (glog) given as: v1 ! u 2 u A g y = ln @y + ty2 + S2 0

where is the background intensity measured from the intensity of unexpressed genes on a microarray. This transformation ensures that the variance is constant over the dynamic range of measured intensities. In the literature, although sparse, several quality measures that provide somewhat objective methods for associating quality to the measured ratios have been reported. The reader is directed to Karakach and Wentzell [114] for a more in-depth review of ratioquality measures. Regrettably, most downstream microarray data analysis methods do not use these quality measures in the analysis of the data. For instance, as early as 1997, Chen et al. [85] introduced an approach for associating condence levels to calculated ratios, yet, unfortunately, most microarray data analysis methods known to the authors do not employ this information. In recent years, there has been continued interest in the issues related to the quality of spot ratios with the goal of associating condence to the measured ratios. Brown et al. [104] proposed an approach based on the relative standard deviation (RSD) of ratios of spots to calculate variability due to spot morphology and used this measure, spot ratio variability (SRV), to assign a signicance value to the ratios. Similarly, Wang et al. [78], developed a quality score depicting the overall quality of all the spots on a microarray based on the spot size, signal-to-noise ratio, and excessive and variable local background as well as spot pixel intensity saturation. As noted by Newton et al. [105], expression measurement procedures that rely solely on the raw intensity ratios are unlikely to be efcient, as high errors accompany the reported ratios especially at low signal intensities. The output from most microarray image analysis software includes a measure of the spread for the pixels in the spot mask. Whether or not this provides a good estimate of the uncertainty associated with the ratios, especially for low-intensity spots, is not clear [106]. More recently, Karakach and coworkers developed techniques to estimate both the additive and multiplicative components of the ratio uncertainties [87], thereby providing a numerical value associated with the quality of the spot ratio. However, it is evident that more research on estimating and incorporating measurement uncertainty in microarray data analysis needs to be carried out. 4.8. Normalization The goals of higher-level analyses of microarray data include the identication of genes whose expression strongly depends on the biological state of the cell and partitioning such genes, as well as the corresponding samples, together based on similarity of their expression proles. Microarray data analysis entails identication of genes that exhibit uncharacteristic patterns of expression, i.e. genes that are up-regulated or down-regulated with respect to some reference state. Unfortunately, the setup for these experiments exacerbates the potential for wide-ranging variability; hence access to the biological information that may be mirrored in the expression proles is often impeded. This variability may be of a random or systematic nature. Some random variability, the sources of which have been discussed in preceding sections, is inevitable and can be addressed through proper statistical analysis. This section discusses the methods for removal of systematic biases in microarray data (normalization), and begins by providing a general summary of some of the sources of experimental uncertainty. This is particularly important since normalization is the nal pre-processing step for microarray data.

The origins of the need for normalization are relatively simple to understand from an analytical perspective. In an ideal situation, a ratio of the intensity measurements on the test and reference channels would give a direct indication of up-or down-regulation, i.e. a ratio of unity would indicate no change. In practice, of course, the absolute intensities are a function of many variables, obvious ones including the response of the PMT and optical system, the laser power, the laser wavelength, the absorption spectrum of the dyes and their quantum yields, and the efciency with which each dye is incorporated into its respective sample. While the use of two-color arrays can solve the problem of variable spot morphologies, it cannot compensate for these other effects. An analytical solution to these differences might be to use some sort of internal standard whose concentration was the same in both the test and reference, and in fact this is one approach. However, the efcacy of this strategy is limited by another source of variability, which is the amount of mRNA extracted from each sample. Normally, the amounts of total RNA in the test and reference sample are adjusted to be the same by a spectrophotometric measurement, but the mRNA is only a few percent of the total RNA and this can vary, leading to another source of variability. It is expected that the correction for these effects would be multiplicative in nature such that it would involve simply scaling either the test or reference intensities by an appropriate constant, but the possibility of spatially dependent scaling factors or nonlinear behavior cannot be excluded. There are different approaches to normalization of microarray data, most of which have been developed based on sound distributional assumptions about the data. The most common assumption made for comparator experiments is that the vast majority of genes in the test sample exhibit no change in expression from the reference. Although most of the normalization techniques have been developed for comparator experiments, their application to time-course experiments is not uncommon. Some of the most widely used normalization methods include total intensity normalization [115], local regression [116] or local scatter smoothing (such as locally weighted scatter plot smoothing (Lowess) normalization [117,118]) and normalization by housekeeping genes [119] or external control genes. Only a brief description of some of these methods is presented in this section, since Quackenbush [115] has reviewed the most common normalization strategies and Park et al. [120] have also compared various methods of normalization of microarray data. 4.8.1. Total intensity normalization Total intensity normalization is the simplest strategy for normalization of microarrays and is based on some of the most straightforward assumptions. In this approach, it is assumed that the average amount of transcripts representing each target is approximately constant for the two samples. In addition, it is assumed that the probes are randomly sampled from the population of genes in the genome (or constitute a complete genome). This implies that approximately equal amounts of target (from both samples) should hybridize to each spot, producing equal intensities in both channels when integrated. The rationale of these assumptions is that in a given living system, basic cellular maintenance must continue and perturbations to the current state are addressed by adjustments in the expression of only a few genes. Thus, normalization is performed by scaling the total intensity of one of the channels with a factor calculated as the ratio of the total uorescence intensity of channelone to channel-two such that: Gi = Gi

and

Ri = Ri

where Gi is the normalized intensity of the ith probe hybridized to the green labeled target, Gi is the respective raw intensity while Ri is the normalized intensity of the ith probe hybridized to the red labeled

46

T.K. Karakach et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 2852

target and Ri is the respective raw intensity for the same probe. is the normalization factor given as: Ri 8
g

i=1 g i=1

Gi

where the summation is over all the g probes on an array. This adjusts the mean of the relative expression level for all the spots to unity. Alternatively, to provide better stabilization of the error variance, the geometric mean is often used and calculated using the log-ratios. This leads to: log = logR logG g g 9

measurements generates signicantly fewer points at high intensities, which would not in itself be a problem except for the proportional error structure of the intensity measurements discussed earlier. This will tend to give excessive weight to the high-intensity points. In addition, there are often a considerable number of outliers in the data. A logarithmic transformation is therefore used to transform the multiplicative errors to uniform errors and reduce the range of outliers. It is expected that a plot of the log-transformed measurements (Fig. 10b), i.e. log R vs. log G, will yield a unity slope and an intercept equal to the logarithm of the normalization factor in the untransformed space: log2 R = log2 + log2 G 10

Note that this approach, as well as most other normalization strategies, can be applied to an entire array or to sub-grids of the array where the terms global normalization and sub-grid normalization refer to the two approaches, respectively [115]. However, caution must be exercised in the use of these terms since others [120,121] have used the term global normalization to refer to the total intensity normalization method while approaches such as Lowess are referred to as intensity-dependent normalization methods. 4.8.2. Lowess normalization In view of the assumptions made in the total intensity normalization method, it is anticipated that a plot of the red channel versus green channel intensities will yield a unity slope and a zero intercept when properly normalized, since, typically, only a small fraction of genes exhibit differential expression. Therefore, another approach to normalization might be to make such a plot, as shown in Fig. 10a, and use the slope as the normalization factor. However such an approach is problematic for a number of reasons. First, the typical distribution of

Alternatively, the normalization can be performed by adding the intercept to the log G values. It is now standard practice to visualize microarray intensity measurements on the log scale in this way. Dudoit et al. [118] introduced a variation in this approach by incorporating a 45 clockwise rotation of the (log R vs. log G) coordinate system for ease of visualization. Such a rotation involves plotting the log-ratio (log(R/G)) of the intensities, designated M, versus the mean of their logarithmic intensities (log(R G)), designated A. It is then anticipated that properly normalized plots of M vs. A will have zero slope centered on the zero horizontal. This is shown in Fig. 10c. These socalled MA plots are considered as useful by some since the horizontal axis can be viewed as being related to a kind of average intensity, allowing intensity-dependent patterns to be observed. Unfortunately, both log R vs. log G plots and MA plots commonly depict nonlinear characteristics that likely arise due to differential background on the two channels. Such a nonlinear structure is shown in Fig. 10b and c. The curvature depicted in these gures introduces a complication to a problem that could otherwise be solved by simple linear regression. Yang et al. [121] observed that this nonlinearity is an intensity-dependent systematic bias in the log-ratio values, and

Fig. 10. Red vs. green channel intensity plots from Atlantic salmon microarrays [7]: (a) raw intensities, (b) log2 R vs. log2 G plots showing banana shaped curvature, (c) M vs. A plots showing deviation from zero for low intensities, and (d) Lowess-corrected data.

T.K. Karakach et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 2852

47

manifests itself as deviations from zero by the low-intensity signals, as seen in the M vs. A plots. This intensity-dependent bias renders the log R vs. log G plots banana shaped. Thus, to correct this intensitydependence, they introduced Lowess [122] normalization and, since its inception, it has become the de facto standard normalization technique. This approach corrects nonlinearity in data via an nth order locally weighted regression of every response variable on local predictor variables (for DNA microarrays, a 1st order t is generally used). Accordingly, M is regressed on A, point-by-point, through a weighted scheme such that for every point Mi, in M, a local subset of points, Msub, which are closest to Mi are identied and regressed on Asub, after being weighted by their respective Euclidean distances from Mi. Thus, in typical microarray data, the data are normalized such that: M = k(A) and M = M k(A) where k(A) is the Lowess t to the M vs. A plot. The size of Msub is based on a span chosen on the basis of the range of points from which the smoothest t can be obtained. This is similar to the window in a moving average lter; a small window will compromise the smoothness of the lter while a large window might escalate computation time without improvements to the lter smoothness. The application of Lowess to data normalization can either be global or local, where the former implies that normalization is applied to the entire data set and the latter entails dividing the data into physical subsets such as sub-arrays, where the elements of the subarray consist of spots printed with one pin (these are often called print-tip groups). Local Lowess normalization is said to correct for systematic spatial biases in the array, possibly related to discrepancies in the print-tips used to make the microarray [121]. Although the application of the Lowess methodology is widespread, it is largely empirical and, to the authors' knowledge, no physical explanation of the observed curvature has been provided in the literature. There is some evidence to suggest that the characteristic arises from spot-localized background effects for low-intensity spots, however. 4.8.3. External controls Normalization in early microarray experiments was performed using external controls. For instance, in the pioneering work reported by Schena et al. [13], Arabidopsis thaliana mRNA was spiked with human acetylcholine receptor (AChR) mRNA controls, which were used for normalization. Current microarray experimental designs encourage inclusion of control probes, generally derived from a nonhomologous organism, deposited at every sub-array to control for the systematic variability within the print-tip group. These external controls will not be expected to exhibit any differential expression since they are not subjected to the same biological stimulus as the experimental mRNA. Naturally, these act like internal standards against which signal from experimental samples are calibrated. While such controls serve a useful purpose, there are drawbacks in their use for normalization. Although they can account for differences in instrumental response parameters, they are unable to account for differences in the amount of mRNA extracted for the test and the reference, as already noted. Moreover, reliance on a limited number of spots can be dangerous if the quality turns out to be poor or insufcient target is added. 4.8.4. Housekeeping genes This approach to normalization assumes that, in a large number of genes, the expression level of a relatively large subset will remain unaltered under most biological stimuliexcept death. Thus, if these genes, referred to as housekeeping genes, are identied on a microarray, they could be employed to nd the normalization factor to be applied to the entire array. The Harvard University HUGE 451 Index [123] is a list of housekeeping genes that are ubiquitously expressed in all cell types and conditions. In addition, DeRisi et al. [124] identied a set of 90 housekeeping genes whose intensities

were used to normalize over 1000 spots on the microarray. Nonetheless, this approach does not take into account effects of nonlinearity in the data, and is viewed rather skeptically for this reason and the fact that it is hard to establish whether the extracted amount of mRNA corresponding to these genes is constant. In addition, recent reports [125] have suggested that certain housekeeping genes may be affected by the treatment to which a test organism is subjected. Since this has been suspected in the past, robust methods for choosing a self-consistent set of genes whose expression levels remain unchanged have been introduced [116,126]. 4.8.5. Other approaches Perhaps due to the signicance of pre-processing of microarray data, several other normalization strategies have been developed. These include quantile normalization [121], ANOVA [43], and mixed model methods (MMM) [110], which are statistical approaches that adjust the means of the log-ratio of spot intensities to reect expected distributional similarities between multiple arrays or, sometimes, within a single array given a mock array. The abundance of normalization methods and literature in this area is a testament to its importance in the broader picture of microarray data processing. 4.8.6. Normalization of GeneChip data In the case of GeneChip data, many of the same considerations as were discussed for spotted two-color microarrays still hold, and the methods used for normalization are very similar. Commonly used methods include a linear scaling akin to total intensity normalization, which may be performed before or after probe summarization, and may or may not use an invariant set of probes across a set of arrays. Lowess is also used, however the implementation for GeneChip data differs from two-color arrays due to the hybridization of a single sample to each array. This necessitates that Lowess be carried out using two arrays, with one arbitrarily designated as R and the other as G (following the convention used for the two-color arrays). If more than two arrays have been used in the study, then cyclic Lowess may be used, whereby each array is normalized against all of the other arrays in turn. Instead of assuming that the distributions of probe intensities among the various arrays is the same, quantile normalization forces the distribution of intensities to be the same. For this method the probe intensities for each array are rst sorted, and the actual value for each array is replaced by the average sorted value across all the arrays. 4.9. Missing values In contrast to many other analytical measurements, microarray data are often characterized by a signicant proportion of missing values. These arise as a consequence of several factors, including (1) the large dynamic range of the measured uorescent signal and removal of spots based on signal measures, (2) spot artifacts such as non-uniform background, smudges and scratches that compromise the quality of a given spot, and (3) negative intensities or ratios arising from abnormally high background. (1) and (2) tend to be random for a particular array, whereas (3) can be more systematic, depending on the experimental design. In the authors' experience, these factors lead to approximately 5% of measurements to be considered missing on any given array. With few exceptions (see [127] and [128] where missing values are included by weighting them appropriately), downstream analysis methods require complete data sets with no missing values. To allow the use of these downstream methods without discarding potentially important genes, missing value imputation (MVI) methods have been regularly used with DNA microarray data. In 2001, Troyanskaya et al. introduced the now standard K-nearest neighbor (KNN) approach, evaluating it and two other imputation methods [129]. Different imputation methods have been developed

48

T.K. Karakach et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 2852

specically for DNA microarray experiments (see [130140] for examples). In recent work, there has been controversy over the impact of the imputation method on the outcome of subsequent data analyses. Brock and coworkers [141] investigated the accuracy of eight different MVI techniques as well as methods to evaluate and select the most appropriate MVI approach. In this work, they concluded that in many cases, there was very little difference in the performance of the best algorithms, as evaluated by their ability to reconstruct the original data. In contrast, Celton et al. also examined 12 different imputation methods and their effect on the results of exploratory data analysis via clustering [142], demonstrating that the choice of imputation method and the number of imputed values can cause instabilities in the clustering results. What is of particular concern for the chemometrician performing downstream analysis of the data, is whether or not MVI is warranted, and which method to employ. Although KNN is the most commonly used MVI method, this appears to be primarily by virtue of having been introduced rst, as both studies above demonstrate that for many datasets it is not the best method. However, the best method is often dataset- or experimental design-specic, requiring specialized knowledge of the various methods available. 4.10. Higher-level processing At the primary level of data analysis, which might be considered as data pre-processing from a chemometrics perspective, the steps are largely the same from one application to another: gridding and segmentation, agging, image processing, background subtraction, ratio calculation, transformation and normalization. Although the details of these steps may differ, in the end the usual result is a vector of ratios or, more typically, log-ratios and their associated gene identiers for a series of samples, forming a two-way data matrix for further analysis. At this stage, a variety of methods can be used to coax the desired information from the data, depending on the nature of the experiment. Typical goals include: (1) the identication of genes exhibiting differential expression (up-or down-regulation) relative to some reference state, (2) the clustering or classication of samples based on their gene expression proles, (3) the clustering or classication of genes based on their expression across multiple samples, (4) the identication of genes that may be used as biological markers (e.g. for a mutation, a disease, or resistance to some medication), and (5) elucidation of gene function and mechanisms of interaction, i.e. gene networks. In these studies, the term expression prole is generally used to describe the normalized ratio (test/reference) or log-ratio of signals across all genes for a sample represented on a particular microarray. From a chemometrics point of view, it could be considered a kind of genetic spectrum except that there is no naturally contiguous ordering of channels. In other contexts, expression prole may also refer to changes in the expression of a particular gene across multiple samples, especially in a serial experiment. The application of higher level data analysis methods to microarray measurements generally assumes that the data have been adequately pre-processed such that poor quality spots have been eliminated or agged, background signals subtracted, and systematic variability has been accounted for through proper experimental design and normalization. Often these assumptions were not valid in the early days of microarrays, but the situation has improved somewhat in recent years. The earliest data analysis was performed by assigning differential expression cut-off values to genes based on fold-changes in their expression levels between two samples. For instance, Schena et al. [13] declared a gene to be differentially expressed if its expression level in the two samples differed by a factor of 5, while DeRisi et al. [124] chose a cut-off value of 3 fold up-or down-regulation. The standard approach, at this stage was to compute log-ratios of measured expression levels of genes in the

test and reference samples, and to assign an ad hoc threshold for differential expression. A convention emerged, soon after the technology was developed, that a two-fold relative induction or repression of the measured dye intensities indicated a signicant change in gene expression. It is not, however, clear how this convention was conceived and over time it has received a lot of criticism, mainly because such fold-changes did not take into account the reliability of the measurements. Moreover, most of the published data were quite elusive about the measurement reproducibility, and hence it was difcult to assess the condence levels of the reported fold-changes. In addition, some genes, such as those encoding transcription factors, may exhibit relatively small changes in expression yet have dramatic impacts on the cellular machinery. As time evolved, more replication was performed in microarray experiments and formal statistical testing was employed. Initially, simple t-tests were used, where a t-statistic could be calculated and evaluated for each gene, accounting for differences in variability among the genes. One method of displaying these results is in the form of a volcano plot, as shown in Fig. 11, where the log10 of the p-value calculated for each gene is plotted against the log2 of its fold change. The plot clearly shows that, while there is a correlation between signicance and fold change, it is not very strong. One of the problems with carrying out a t-test in this way is dening a p-value for signicance. If a typical cut-off of p = 0.05 were used (log10(0.05) = 1.3 in the gure), clearly a very large number of genes would be discovered. However, part of the difculty here is the problem of multiple testing with ca. 4000 genes, one would expect the cut-off to be exceeded (4000 0.05) = 200 times just by chance. Typically, in these cases, a Bonferroni correction would be applied, which would adjust the p-value to 0.05/4000 = 0.0000125 (or log10p = 4.9). While this reduces the number of genes discovered (often to zero), it has been argued that this correction is too conservative because it is based on an assumption of independence among the genes, which is not likely to be the case. The issue of signicance levels of differential expression has been widely addressed in the literature, culminating in the development of methods designed specically to address some of the shortfalls of techniques based on p-values, such as signicance analysis of microarrays (SAM) [143]. SAM, which has now become a widely used tool for microarray data analysis, is essentially based on a

Fig. 11. Volcano plot depicting statistical signicance against fold change in Atlantic salmon microarray data [7]. The solid horizontal line depicts the nominal p-value of 0.05 while the solid vertical lines depict a 1.4 fold change. The region labeled X corresponds to the region of differential expression.

T.K. Karakach et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 2852

49

standard t-statistic for replicate experiments, although a small additive term is used in the denominator to correct for the anomalous behavior of low-level signals. An important difference, however, is that SAM evaluates the false discovery rate (FDR), which is the estimate of the fraction of false positives, through a bootstrap method that uses random permutations of the samples. By adjusting the critical values accordingly, a biologist is able to control the FDR to an acceptable level and identify an appropriate number of genes for further investigation. More algorithmic details on this approach can be found in the original reference [143], as well as a recent review of the validity of the SAM methodology [144]. For routine data analysis the software is freely available on the internet for academic users [145]. Beyond basic hypothesis testing, a wide range of other multivariate statistical analysis tools have been applied to microarray data sets. These consist of both conventional methods and more novel techniques specically designed for microarrays. Although a comprehensive review of all of these methods is beyond the scope of this article, a number of analysis strategies are becoming somewhat standard in the eld and deserve mention. These fall into a variety of categories that include analysis of variance (ANOVA), cluster analysis, exploratory data analysis, classication and time series analysis. Where appropriately designed experiments with proper blocking and randomization have been performed, ANOVA presents a more powerful alternative to simple t-tests, but suffers from some of the same drawbacks associated with multiple testing. Where sufcient samples are available, clustering is a popular approach, with the advantage that it can be applied to observational studies as well as designed experiments. Methods such as hierarchical clustering and kmeans clustering are commonly used. Dendrograms are typically presented in a manner peculiar to this area, where the clustering results for both the samples and the genes are presented on the same gure. The sample dendrogram is normally shown at the top of the

page, with the gene dendrogram rotated by 90 and shown along the side of the page. In the rectangular region dened by the base of the two dendrograms, color-coded expression proles are presented for each gene, with green indicating up-regulation and red indicating down-regulation (see Fig. 12 for an example). Alternatively, the ordering provided by the sample dendrogram on the horizontal axis can be replaced by some other grouping, such as time or sample class [146]. This permits a rapid visual assessment of the expression patterns characteristic for each gene and each sample that is more intuitive to biologists. For exploratory data analysis, principal components analysis and nonlinear mapping methods (a.k.a. multidimensional scaling) are also widely used so that the data can be visualized in a lower dimensional space. A variety of methods have also been used for classication purposes, including discriminant analysis, support vector machines, and articial neural networks, although in early studies of microarrays some traditional classiers were re-invented by those unfamiliar with existing techniques. Techniques for the analysis of serial microarray data are not yet as well-established as those for comparator experiments, but some rudimentary time series approaches have been used, as well as other techniques, such as independent component analysis and hidden Markov models. Recently, an application of multivariate curve resolution has also been reported for spotted microarrays [128]. 4.11. Caveats for chemometrics As previously noted, the range of techniques that have been applied to microarray data is extensive and beyond the scope of this review. It has been the objective of this work to rather provide insight into the nature of the microarray measurements themselves, as such a description of the various measurement aspects has been lacking, especially in the chemometrics literature. From a chemometrics

Fig. 12. Heat map resulting from a two-way hierarchical cluster analysis. Red indicates genes that are underexpressed, and green indicates the genes are overexpressed in the test sample relative to the reference sample. The clustering according to genes is shown on the left, and the clustering according to sample is shown on the top. Figure modied from [147].

50

T.K. Karakach et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 2852 [11] K. Maurer, J. Cooper, M. Caraballo, J. Crye, D. Suciu, A. Ghindilis, J.A. Leonetti, W. Wang, F.M. Rossi, A.G. Stver, C. Larson, H. Gao, K. Dill, A. McShea, Electrochemically generated acid and its containment to 100 micron reaction areas for the production of DNA microarrays, PLoS ONE 1 (2006) e34. [12] A.P. Blanchard, R.J. Kaiser, L.E. Hood, High-density oligonucleotide arrays, Biosens. Bioelectron. 11 (1996) 687690. [13] M. Schena, D. Shalon, R.W. Davis, P.O. Brown, Quantitative monitoring of gene expression patterns with a complementary DNA microarray, Science 270 (1995) 467470. [14] D.J. Hall, Inkjet technology for precise, high throughput manufacture of protein microarrays Advances in Microarray Technology, [http://mms.technologynetworks.net/hall/player.html] (2005). [15] R.G. Sosnowski, E. Tu, W.F. Butler, J.P. O'Connell, M.J. Heller, Rapid determination of single base mismatch mutations in DNA hybrids by direct eld control, Proc. Natl. Acad. Sci. U. S. A. 94 (1997) 11191123. [16] J.R. Pollack, C.M. Perou, A.A. Alizadeh, M.B. Eisen, A. Pergamenschikov, C.F. Williams, S.S. Jeffrey, D. Botstein, P.O. Brown, Genome-wide analysis of DNA copy-number changes using cDNA microarrays, Nat. Genet. 23 (1999) 4146. [17] N. Patil, N. Nouri, L. McAllister, H. Matsukaki, T. Ryder, Single-nucleotide polymorphism genotyping using microarrays, Curr. Protoc. Hum. Genet. (2001) Chapter 2, Unit 2.9. [18] G.K. Hu, S.J. Madore, B. Moldover, T. Jatkoe, D. Balaban, J. Thomas, Y. Wang, Predicting splice variant from DNA chip expression data, Genome Res. 11 (2001) 12371245. [19] J.M. Thomson, J. Parker, C.M. Perou, S.M. Hammond, A custom microarray platform for analysis of microRNA gene expression, Nat. Meth. 1 (2004) 4753. [20] G.A. Churchill, Fundamentals of experimental design for cDNA microarrays, Nat. Genet. 32 (2002) 490495 Suppl. [21] R.M. Simon, K. Dobbin, Experimental design of DNA microarray experiments, BioTechniques Suppl (2003) 1621. [22] A.A. Alizadeh, M.B. Eisen, R.E. Davis, C. Ma, I.S. Lossos, A. Rosenwald, J.C. Boldrick, H. Sabet, T. Tran, X. Yu, J.I. Powell, L. Yang, G.E. Marti, T. Moore, J. Hudson, L. Lu, D. B. Lewis, R. Tibshirani, G. Sherlock, W.C. Chan, T.C. Greiner, D.D. Weisenburger, J. O. Armitage, R. Warnke, R. Levy, W. Wilson, M.R. Grever, J.C. Byrd, D. Botstein, P. O. Brown, L.M. Staudt, Distinct types of diffuse large B-cell lymphoma identied by gene expression proling, Nature 403 (2000) 503511. [23] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomeld, E.S. Lander, Molecular classication of cancer: class discovery and class prediction by gene expression monitoring, Science 286 (1999) 531537. [24] Z. Bar-Joseph, Analyzing time series gene expression data, Bioinformatics 20 (2004) 24932503. [25] P.T. Spellman, G. Sherlock, M.Q. Zhang, V.R. Iyer, K. Anders, M.B. Eisen, P.O. Brown, D. Botstein, B. Futcher, Comprehensive identication of cell cycleregulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization, Mol. Biol. Cell 9 (1998) 32733297. [26] M. Werner-Washburne, B. Wylie, K. Boyack, E. Fuge, J. Galbraith, J. Weber, G. Davidson, Comparative analysis of multiple genome-scale data sets, Genome Res. 12 (2002) 15641573. [27] E.F. Nuwaysir, M. Bittner, J. Trent, J.C. Barrett, C.A. Afshari, Microarrays and toxicology: the advent of toxicogenomics, Mol. Carcinog. 24 (1999) 153159. [28] M.J. Marton, J.L. DeRisi, H.A. Bennett, V.R. Iyer, M.R. Meyer, C.J. Roberts, R. Stoughton, J. Burchard, D. Slade, H. Dai, D.E. Bassett, L.H. Hartwell, P.O. Brown, S. H. Friend, Drug target validation and identication of secondary drug target effects using DNA microarrays, Nat. Med. 4 (1998) 12931301. [29] M. Shapira, E. Segal, D. Botstein, Disruption of yeast forkhead-associated cell cycle transcription by oxidative stress, Mol. Biol. Cell 15 (2004) 56595669. [30] K. Dobbin, R. Simon, Comparison of microarray designs for class comparison and class discovery, Bioinformatics 18 (2002) 14381445. [31] Z. Bozdech, M. Llinas, B.L. Pulliam, E.D. Wong, J.C. Zhu, J.L. Derisi, The transcriptome of the intraerythrocytic developmental cycle of Plasmodium falciparum, PLoS Biol. 1 (2003) 85100. [32] M. Llins, Z. Bozdech, E.D. Wong, A.T. Adai, J.L. DeRisi, Comparative whole genome transcriptome analysis of three Plasmodium falciparum strains, Nucleic Acids Res. 34 (2006) 11661173. [33] M.N. Arbeitman, E.E.M. Furlong, F. Imam, E. Johnson, B.H. Null, B.S. Baker, M.A. Krasnow, M.P. Scott, R.W. Davis, K.P. White, Gene expression during the life cycle of Drosophila melanogaster, Science 297 (2002) 22702275. [34] A.P. Gasch, P.T. Spellman, C.M. Kao, O. Carmel-Harel, M.B. Eisen, G. Storz, D. Botstein, P.O. Brown, Genomic expression programs in the response of yeast cells to environmental changes, Mol. Biol. Cell 11 (2000) 42414257. [35] M.J. Martinez, S. Roy, A.B. Archuletta, P.D. Wentzell, S.S. Anna-Arriola, A.L. Rodriguez, A.D. Aragon, G. Quinones, C. Allen, M. Werner-Washburne, Genomic analysis of stationary phase and exit in Saccharomyces cerevisiae: gene expression and identication of novel essential genes, Mol. Biol. Cell 15 (2004) 52955305. [36] Y.H. Yang, T. Speed, Design issues for cDNA microarray experiments, Nat. Rev. Genet. 3 (2002) 579588. [37] G.F.V. Glonek, P.J. Solomon, Factorial and time course designs for cDNA microarray experiments, Biostatistics 5 (2004) 89111. [38] M.K. Kerr, G.A. Churchill, Experimental design for gene expression microarrays, Biostatistics 2 (2001) 183201. [39] D.M. Rocke, Design and analysis of experiments with high throughput biological assay data, Semin. Cell Dev. Biol. 15 (2004) 703713. [40] M.K. Kerr, Design considerations for efcient and effective microarray studies, Biometrics 59 (2003) 822828.

perspective, a number of aspects should be emphasized for anyone undertaking an analysis of this type of data. First, experimental designs, especially for early work, are replete with examples of confounded variables. Therefore, caution should be used in over-interpreting the results of any multivariate analysis performed on these data. Second, most microarray data analyzed in the literature uses log-transformed ratios rather than ratios. This is done as a variance stabilization technique and is so commonplace that the transformation is often not even mentioned. Depending on the model being applied, this may have implications with respect to areas such as linearity, noise distribution, and scaling. Third, missing data and heteroscedasticity are common issues that need to be addressed in any multivariate analysis. Unlike conventional analytical instruments, microarray measurements can exhibit extreme non-uniformity in measurement variances with no particular structure. Although some may regard these aspects as insurmountable, the authors believe that these are in fact areas where chemometrics can make the most contribution, and represent great opportunities for those interested in this area of research. 5. Summary Transcriptomics in general, and DNA microarrays in particular provide a window into the inner workings of the cell. However, from an analytical measurement perspective, DNA microarray measurements present numerous challenges to researchers. Although the past two decades have seen an incredible amount of work performed to reduce the many sources of variance on the resultant measurements, the technology platform itself constrains any proposed solution to mitigate variances of the nal measurement, whether they are intensities or ratios. Due to the conceptual simplicity of microarrays, their acceptance as a standard molecular biology technique, the rise of biological studies performed from a system-wide perspective, and the need for further multivariate analyses, we foresee chemometrics playing a larger role in the analysis of DNA microarray experiments. Therefore, we believe it is vitally important that researchers are aware of the nature and the inherent limitations of the measurements with which they are working. It is hoped that this article has contributed to achieving that end. Declaration This is NRCC publication number 51755. References
[1] R.B. Stoughton, Applications of DNA microarrays in biology, Annu. Rev. Biochem. 74 (2005) 5382. [2] J. Quackenbush, Computational analysis of microarray data, Nat. Rev. Genet. 2 (2001) 418427. [3] F. Katagiri, J. Glazebrook, Overview of mRNA expression proling using DNA microarrays, Curr. Protoc. Mol. Biol. 85 (2009) 22.4.122.4.13. [4] D.V. Nguyen, A.B. Arpat, N. Wang, R.J. Carroll, DNA microarray experiments: biological and technological aspects, Biometrics 58 (2002) 701717. [5] J.S. Verducci, V.F. Mel, S. Lin, Z. Wang, S. Roy, C.K. Sen, Microarray analysis of gene expression: considerations in data mining and statistical treatment, Physiol. Genomics 25 (2006) 355363. [6] Microarray data analysis, [http://www.nslij-genetics.org/microarray/], Date last accessed: March 29, 2010. [7] K.V. Ewart, J.C. Belanger, J. Williams, T. Karakach, S. Penny, S.C.M. Tsoi, R.C. Richards, S.E. Douglas, Identication of genes differentially expressed in Atlantic salmon (Salmo salar) in response to infection by Aeromonas salmonicida using cDNA microarray technology, Dev. Comp. Immunol. 29 (2005) 333347. [8] M.C. Pirrung, How to make a DNA chip, Angew. Chem. Int. Ed. Engl. 41 (2002) 12761289. [9] S.P. Fodor, J.L. Read, M.C. Pirrung, L. Stryer, A.T. Lu, D. Solas, Light-directed, spatially addressable parallel chemical synthesis, Science 251 (1991) 767773. [10] S. Singh-Gasson, R.D. Green, Y. Yue, C. Nelson, F. Blattner, M.R. Sussman, F. Cerrina, Maskless fabrication of light-directed oligonucleotide microarrays using a digital micromirror array, Nat. Biotechnol. 17 (1999) 974978.

T.K. Karakach et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 2852 [41] W. Jin, R.M. Riley, R.D. Wolnger, K.P. White, G. Passador-Gurgel, G. Gibson, The contributions of sex, genotype and age to transcriptional variance in Drosophila melanogaster, Nat. Genet. 29 (2001) 389395. [42] X. Peng, C. Wood, E. Blalock, K. Chen, P. Landeld, A. Stromberg, Statistical implications of pooling RNA samples for microarray experiments, BMC Bioinform. 4 (2003) 26. [43] M.K. Kerr, M. Martin, G.A. Churchill, Analysis of variance for gene expression microarray data, J. Comput. Biol. 7 (2000) 819837. [44] M.K. Kerr, G.A. Churchill, Statistical design and the analysis of gene expression microarray data, Genet. Res. 77 (2001) 123128. [45] E.V. Thomas, K.H. Phillippy, B. Brahamsha, D.M. Haaland, J.A. Timlin, L.D.H. Elbourne, B. Palenik, I.T. Paulsen, Statistical analysis of microarray data with replicated spots: a case study with Synechococcus WH8102, Comp. Funct. Genomics 2009 (2009) 950171. [46] M. Liang, A.G. Briggs, E. Rute, A.S. Greene, J. Cowley, Quantitative assessment of the importance of dye switching and biological replication in cDNA microarray studies, Physiol. Genomics 14 (2003) 199207. [47] K. Dobbin, J.H. Shih, R. Simon, Statistical design of reverse dye microarrays, Bioinformatics 19 (2003) 803810. [48] V. Folsom, M.J. Hunkeler, A. Haces, J.D. Harding, Detection of DNA targets with biotinylated and uoresceinated RNA probes. Effects of the extent of derivitization on detection sensitivity, Anal. Biochem. 182 (1989) 309314. [49] J.B. Randolph, A.S. Waggoner, Stability, specicity and uorescence brightness of multiply-labeled uorescent DNA probes, Nucleic Acids Res. 25 (1997) 29232929. [50] A. Badiee, H.G. Eiken, V.M. Steen, R. Lvlie, Evaluation of ve different cDNA labeling methods for microarrays using spike controls, BMC Biotechnol. 3 (2003) 23. [51] A. Richter, C. Schwager, S. Hentze, W. Ansorge, M.W. Hentze, M. Muckenthaler, Comparison of uorescent tag DNA labeling methods used for expression analysis by DNA microarrays, Biotechniques 33 (2002) 620628 630. [52] J. Haralambidis, M. Chai, G.W. Tregear, Preparation of base-modied nucleosides suitable for non-radioactive label attachment and their incorporation into synthetic oligodeoxyribonucleotides, Nucleic Acids Res. 15 (1987) 48574876. [53] G. Wallner, Rudolf Amann, Wolfgang Beisker, Optimizing uorescent in situ hybridization with rRNA-targeted oligonucleotide probes for ow cytometric identication of microorganisms, Cytometry 14 (1993) 136143. [54] E. Manduchi, L.M. Scearce, J.E. Brestelli, G.R. Grant, K.H. Kaestner, C.J. Stoeckert, Comparison of different labeling methods for two-channel high-density microarray experiments, Physiol. Genomics 10 (2002) 169179. [55] J. Yu, M.I. Othman, R. Farjo, S. Zareparsi, S.P. MacNee, S. Yoshida, A. Swaroop, Evaluation and optimization of procedures for target labeling and hybridization of cDNA microarrays, Mol. Vis. 8 (2002) 130137. [56] J. Wang, M. Jiang, T.W. Nilsen, R.C. Getts, Dendritic nucleic acid probes for DNA biosensors, J. Am. Chem. Soc. 120 (1998) 82818282. [57] R.L. Stears, R.C. Getts, S.R. Gullans, A novel, sensitive detection system for highdensity microarrays using dendrimer technology, Physiol. Genomics 3 (2000) 9399. [58] M. Schena, Microarray Analysis, Wiley, Hoboken, NJ, 2003. [59] S. Capaldi, R.C. Getts, S.D. Jayasena, Signal amplication through nucleotide extension and excision on a dendritic DNA platform, Nucleic Acids Res. 28 (2000) e21. [60] T.L. Fare, E.M. Coffey, H. Dai, Y.D. He, D.A. Kessler, K.A. Kilian, J.E. Koch, E. LeProust, M.J. Marton, M.R. Meyer, R.B. Stoughton, G.Y. Tokiwa, Y. Wang, Effects of atmospheric ozone on microarray data quality, Anal. Chem. 75 (2003) 46724675. [61] Ozone Prevention, [http://cmgm.stanford.edu/pbrown/protocols/Ozone_Prevention.pdf], Date last accessed: April 1, 2010. [62] Genisphere3DNA Array Detection DyeSaver2, [http://www.genisphere.com/ array_detection_dyesaver.html] , Date last accessed: April 1, 2010. [63] M. Dar, T. Giesler, R. Richardson, C. Cai, M. Cooper, S. Lavasani, P. Kille, T. Voet, J. Vermeesch, Development of a novel ozone- and photo-stable HyPer5 red uorescent dye for array CGH and microarray gene expression analysis with consistent performance irrespective of environmental conditions, BMC Biotechnol. 8 (2008) 86. [64] U. Maskos, E.M. Southern, Oligonucleotide hybridizations on glass supports: a novel linker for oligonucleotide synthesis and hybridization properties of oligonucleotides synthesised in situ, Nucleic Acids Res. 20 (1992) 16791684. [65] K.R. Khrapko, A.A. Lysov YuP, V.V. Khorlyn, V.L. Shick, Florentiev, A.D. Mirzabekov, An oligonucleotide hybridization approach to DNA sequencing, FEBS Lett. 256 (1989) 118122. [66] V.G. Cheung, M. Morley, F. Aguilar, A. Massimi, R. Kucherlapati, G. Childs, Making and reading microarrays, Nat. Genet. 21 (1999) 1519. [67] T.D. Shalon, DNA micro arrays: a new tool for genetic analysis, Ph.D. dissertation, Stanford University, 1996. [68] P. Hegde, R. Qi, K. Abernathy, C. Gay, S. Dharap, R. Gaspard, J.E. Hughes, E. Snesrud, N. Lee, J. Quackenbush, A concise guide to cDNA microarray analysis, BioTechniques 29 (2000) 548550 552554, 556 passim. [69] C. Romualdi, S. Trevisan, B. Celegato, G. Costa, G. Lanfranchi, Improved detection of differentially expressed genes in microarray experiments through multiple scanning and image integration, Nucleic Acids Res. 31 (2003) e149. [70] G. Kamberova, S. Shah, DNA Array Image Analysis: Nuts and Bolts, DNA Press, New York, 2002. [71] Y.H. Yang, M.J. Buckley, S. Dudoit, T.P. Speed, Comparison of methods for image analysis on cDNA microarray data, J. Comput. Graph. Statist. 11 (2002) 108136.

51

[72] CAMDA 2004 Conference Contest Datasets, [http://www.camda.duke.edu/ camda04/datasets/index.html] , Date last accessed: April 1, 2010. [73] J.A. Timlin, D.M. Haaland, M.B. Sinclair, A.D. Aragon, M.J. Martinez, M. WernerWashburne, Hyperspectral microarray scanning: impact on the accuracy and reliability of gene expression data, BMC Genomics 6 (2005) 72. [74] R.S.H. Istepanian, Microarray image processing: current status and future directions, IEEE Trans. Nanobioscience 2 (2003) 173175. [75] Y.H. Yang, M.J. Buckley, T.P. Speed, Analysis of cDNA microarray images, Brief. Bioinform. 2 (2001) 341349. [76] Y. Chen, V. Kamat, E.R. Dougherty, M.L. Bittner, P.S. Meltzer, J.M. Trent, Ratio statistics of gene expression levels and applications to microarray data analysis, Bioinformatics 18 (2002) 12071215. [77] J. Buhler, T. Ideker, and D. Haynor, Dapple: Improved techniques for nding spots on DNA microarrays, University of Washington [http://www.cs.wustl.edu/jbuhler/dapple/dapple-tr.pdf] (2000) , Date last accessed: April 1, 2010. [78] X. Wang, S. Ghosh, S.W. Guo, Quantitative quality control in microarray image processing and data acquisition, Nucleic Acids Res. 29 (2001) E75-5. [79] A.N. Jain, T.A. Tokuyasu, A.M. Snijders, R. Segraves, D.G. Albertson, D. Pinkel, Fully automatic quantication of microarray image data, Genome Res. 12 (2002) 325332. [80] M.B. Eisen, ScanAlyze user manual, [http://rana.lbl.gov/manuals/ScanAlyzeDoc. pdf] , Date last accessed: April 1, 2010. [81] TM4: Spotnder, [http://www.tm4.org/spotnder.html], Date last accessed: April 1, 2010. [82] C.A. Glasbey, P. Ghazal, Combinatorial image analysis of DNA microarray features, Bioinformatics 19 (2003) 194203. [83] E.E. Schadt, C. Li, B. Ellis, W.H. Wong, Feature extraction and normalization algorithms for high-density oligonucleotide gene expression array data, J. Cell. Biochem. Suppl Suppl 37 (2001) 120125. [84] . Gjerstad, . Aakra, L. Snipen, U. Indahl, Probabilistically assisted spot segmentation with application to DNA microarray images, Chemometr. Intell. Lab. Syst. 98 (2009) 19. [85] Y. Chen, E.R. Dougherty, M.L. Bittner, Ratio-based decisions on the quantitative analysis of cDNA microarray images, J. Biomed. Opt. 2 (1997) 364374. [86] J.P. Brody, B.A. Williams, B.J. Wold, S.R. Quake, Signicance and statistical errors in the analysis of DNA microarray data, Proc. Natl. Acad. Sci. U. S. A. 99 (2002) 1297512978. [87] T.K. Karakach, R.M. Flight, P.D. Wentzell, Bootstrap method for the estimation of measurement uncertainty in spotted dual-color DNA microarrays, Anal. Bioanal. Chem. 389 (2007) 21252141. [88] L. Qin, K.F. Kerr, Empirical evaluation of data transformations and ranking statistics for microarray analysis, Nucleic Acids Res. 32 (2004) 54715479. [89] Y. Fang, A. Brass, D.C. Hoyle, A. Hayes, A. Bashein, S.G. Oliver, D. Waddington, M. Rattray, A model-based analysis of microarray experimental error and normalisation, Nucleic Acids Res. 31 (2003) e96. [90] P. Brzoska, Backround analysis and cross hybridization, Agilent Technologies [http://www.chem.agilent.com/Library/technicaloverviews/Public/5988-2363% 20Bknd%20Analys.pdf] (2001) Publication 5988-236EN, Date last accessed: May 11, 2010. [91] J.M. Arteaga-Salas, H. Zuzan, W.B. Langdon, G.J.G. Upton, A.P. Harrison, An overview of image-processing methods for Affymetrix GeneChips, Brief. Bioinform. 9 (2008) 2533. [92] Statistical algorithms description document, Affymetrix [http://www.affymetrix. com/support/technical/whitepapers/sadd_whitepaper.pdf] (2002), Date last accessed: April 1, 2010. [93] C. Li, W.H. Wong, Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection, Proc. Natl. Acad. Sci. U. S. A. 98 (2001) 3136. [94] R.A. Irizarry, B.M. Bolstad, F. Collin, L.M. Cope, B. Hobbs, T.P. Speed, Summaries of Affymetrix GeneChip probe level data, Nucleic Acids Res. 31 (2003) e15. [95] R.A. Irizarry, B. Hobbs, F. Collin, Y.D. Beazer-Barclay, K.J. Antonellis, U. Scherf, T.P. Speed, Exploration, normalization, and summaries of high density oligonucleotide array probe level data, Biostatistics 4 (2003) 249264. [96] B. Bolstad, R. Irizarry, M. Astrand, T. Speed, A comparison of normalization methods for high density oligonucleotide array data based on variance and bias, Bioinformatics 19 (2003) 185193. [97] R. Gentleman, V. Carey, D. Bates, B. Bolstad, M. Dettling, S. Dudoit, B. Ellis, L. Gautier, Y. Ge, J. Gentry, K. Hornik, T. Hothorn, W. Huber, S. Iacus, R. Irizarry, F. Leisch, C. Li, M. Maechler, A. Rossini, G. Sawitzki, C. Smith, G. Smyth, L. Tierney, J. Yang, J. Zhang, Bioconductor: open software development for computational biology and bioinformatics, Genome Biol. 5 (2004) R80. [98] Guide to probe logarithmic intensity error (PLIER) estimation, Affymetrix [http: //www.affymetrix.com/support/technical/technotes/plier_technote.pdf] (2005), Date last accessed: April 1, 2010 [99] D.J. Lockhart, H. Dong, M.C. Byrne, M.T. Follettie, M.V. Gallo, M.S. Chee, M. Mittmann, C. Wang, M. Kobayashi, H. Norton, E.L. Brown, Expression monitoring by hybridization to high-density oligonucleotide arrays, Nat. Biotechnol. 14 (1996) 16751680. [100] B. Bolstad, Low level analysis of high-density oligonucleotide array data: background, normalization and summarization, Ph.D. Dissertation, University of California, Berkeley, 2004. [101] L. Gautier, M. Moller, L. Friis-Hansen, S. Knudsen, Alternative mapping of probes to genes for Affymetrix chips, BMC Bioinform. 5 (2004) 111. [102] Z. Wu, R. Irizarry, R. Gentleman, F.M. Murillo, and F. Spencer, A model based background adjustment for oligonucleotide expression arrays, Johns Hopkins University [http://www.bepress.com/jhubiostat/paper1/] (2004), Date last accessed: April 1, 2010.

52

T.K. Karakach et al. / Chemometrics and Intelligent Laboratory Systems 104 (2010) 2852 [127] G. Wang, A.V. Kossenkov, M.F. Ochs, LS-NMF: a modied non-negative matrix factorization algorithm utilizing uncertainty estimates, BMC Bioinform. 7 (2006) 175. [128] P.D. Wentzell, T.K. Karakach, S. Roy, M.J. Martinez, C.P. Allen, M. WernerWashburne, Multivariate curve resolution of time course microarray data, BMC Bioinform. 7 (2006) 343. [129] O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, R.B. Altman, Missing value estimation methods for DNA microarrays, Bioinformatics 17 (2001) 520525. [130] S. Oba, M. Sato, I. Takemasa, M. Monden, K. Matsubara, S. Ishii, A Bayesian missing value estimation method for gene expression prole data, Bioinformatics 19 (2003) 20882096. [131] Z. Bar-Joseph, G.K. Gerber, D.K. Gifford, T.S. Jaakkola, I. Simon, Continuous representations of time-series gene expression data, J. Comput. Biol. 10 (2003) 341356. [132] X. Zhou, X. Wang, E.R. Dougherty, Missing-value estimation using linear and non-linear regression with Bayesian gene selection, Bioinformatics 19 (2003) 23022307. [133] M.S.B. Sehgal, I. Gondal, L.S. Dooley, Collateral missing value imputation: a new robust missing value estimation algorithm for microarray data, Bioinformatics 21 (2005) 24172423. [134] H. Kim, G.H. Golub, H. Park, Missing value estimation for DNA microarray gene expression data: local least squares imputation, Bioinformatics 21 (2005) 187198. [135] R. Jrnsten, H. Wang, W.J. Welsh, M. Ouyang, DNA microarray data imputation and signicance analysis of differential expression, Bioinformatics 21 (2005) 41554161. [136] G. Feten, T. Almy, A.H. Aastveit, Prediction of missing values in microarray and use of mixed models to evaluate the predictors, Stat. Appl. Genet. Mol. Biol. 4 (2005) Article10. [137] X. Gan, A.W. Liew, H. Yan, Microarray missing data imputation based on a set theoretic framework and biological knowledge, Nucleic Acids Res. 34 (2006) 16081619. [138] J. Tuikkala, L. Elo, O.S. Nevalainen, T. Aittokallio, Improving missing value estimation in microarray data with gene ontology, Bioinformatics 22 (2006) 566572. [139] X. Wang, A. Li, Z. Jiang, H. Feng, Missing value estimation for DNA microarray gene expression data by Support Vector Regression imputation and orthogonal coding scheme, BMC Bioinform. 7 (2006) 32. [140] P. Johansson, J. Hkkinen, Improving missing value imputation of microarray data by using spot quality weights, BMC Bioinform. 7 (2006) 306. [141] G.N. Brock, J.R. Shaffer, R.E. Blakesley, M.J. Lotz, and G.C. Tseng, Which missing value imputation method to use in expression proles: a comparative study and two selection schemes, BMC Bioinformatics 9 12. [142] M. Celton, A. Malpertuy, G. Lelandais, A. de Brevern, Comparative analysis of missing value imputation methods to improve clustering and interpretation of microarray experiments, BMC Genomics 11 (2010) 15. [143] V.G. Tusher, R. Tibshirani, G. Chu, Signicance analysis of microarrays applied to the ionizing radiation response, Proc. Natl. Acad. Sci. U. S. A. 98 (2001) 51165121. [144] S. Zhang, A comprehensive evaluation of SAM, the SAM R-package and a simple modication to improve its performance, BMC Bioinform. 8 (2007) 230. [145] Signicance Analysis of Microarrays, [http://www-stat.stanford.edu/tibs/ SAM/]. [146] M.B. Eisen, P.T. Spellman, P.O. Brown, D. Botstein, Cluster analysis and display of genome-wide expression patterns, Proc. Natl. Acad. Sci. U. S. A. 95 (1998) 1486314868. [147] C. Desert, M. Duclos, P. Blavy, F. Lecerf, F. Moreews, C. Klopp, M. Aubry, F. Herault, P. Le Roy, C. Berri, M. Douaire, C. Diot, S. Lagarrigue, Transcriptome proling of the feeding-to-fasting transition in chicken liver, BMC Genomics 9 (2008) 611.

[103] J. Nuez-Garcia, V. Mersinias, K. Cho, C.P. Smith, O. Wolkenhauer, The statistical distribution of the intensity of pixels within spots of DNA microarrays: what is the appropriate single-value representative? Appl. Bioinform. 2 (2003) 229239. [104] C.S. Brown, P.C. Goodwin, P.K. Sorger, Image metrics in the statistical analysis of DNA microarray data, Proc. Natl. Acad. Sci. U. S. A. 98 (2001) 89448949. [105] M.A. Newton, C.M. Kendziorski, C.S. Richmond, F.R. Blattner, K.W. Tsui, On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data, J. Comput. Biol. 8 (2001) 3752. [106] P.H. Tran, D.A. Peiffer, Y. Shin, L.M. Meek, J.P. Brody, K.W.Y. Cho, Microarray optimizations: increasing spot accuracy and automated identication of true microarray signals, Nucleic Acids Res. 30 (2002) e54. [107] M.C.K. Yang, Q.G. Ruan, J.J. Yang, S. Eckenrode, S. Wu, R.A. McIndoe, J.X. She, A statistical method for agging weak spots improves normalization and ratio estimates in microarrays, Physiol. Genomics 7 (2001) 4553. [108] R.L. Stears, T. Martinsky, M. Schena, Trends in microarray analysis, Nat. Med. 9 (2003) 140145. [109] D.M. Rocke, B. Durbin, A model for measurement error for gene expression arrays, J. Comput. Biol. 8 (2001) 557569. [110] R.D. Wolnger, G. Gibson, E.D. Wolnger, L. Bennett, H. Hamadeh, P. Bushel, C. Afshari, R.S. Paules, Assessing gene signicance from cDNA microarray expression data via mixed models, J. Comput. Biol. 8 (2001) 625637. [111] B.P. Durbin, J.S. Hardin, D.M. Hawkins, D.M. Rocke, A variance-stabilizing transformation for gene-expression microarray data, Bioinformatics 18 (Suppl 1) (2002) S105S110. [112] W. Huber, A.V. Heydebreck, H. Sltmann, A. Poustka, M. Vingron, Variance stabilization applied to microarray data calibration and to quantication of differential expression, Bioinformatics 18 (2002) s96s104. [113] D.M. Rocke, S. Lorenzato, A two component model for measurement error in analytical chemistry, Technometrics 37 (1995) 176184. [114] T.K. Karakach, P.D. Wentzell, Methods for estimating and mitigating errors in spotted, dual-color DNA microarrays, OMICS 11 (2007) 186199. [115] J. Quackenbush, Microarray data normalization and transformation, Nat. Genet. 32 (2002) 496501. [116] T. Kepler, L. Crosby, K. Morgan, Normalization and analysis of DNA microarray data by self-consistency and local regression, Genome Biol. 3 (2002) research0037.1-research0037.12. [117] G.K. Smyth, T. Speed, Normalization of cDNA microarray data, Methods 31 (2003) 265273. [118] S. Dudoit, Y.H. Yang, M.J. Callow, T.P. Speed, Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments, Stat. Sinica 12 (2002) 111139. [119] T. Suzuki, P.J. Higgins, D.R. Crawford, Control selection for RNA quantitation, Biotechniques 29 (2000) 332337. [120] T. Park, S. Yi, S. Kang, S. Lee, Y. Lee, R. Simon, Evaluation of normalization methods for microarray data, BMC Bioinform. 4 (2003) 33. [121] Y.H. Yang, S. Dudoit, P. Luu, D.M. Lin, V. Peng, J. Ngai, T.P. Speed, Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation, Nucleic Acids Res. 30 (2002) e15. [122] W.S. Cleveland, Robust locally weighted regression and smoothing scatterplots, J. Am. Stat. Assoc. 74 (1979) 829836. [123] Human Genome Expression Index, [http://www.biotechnologycenter.org/hio/ databases/index.html], Date last accessed: April 1, 2010. [124] J. DeRisi, L. Penland, P.O. Brown, M.L. Bittner, P.S. Meltzer, M. Ray, Y. Chen, Y.A. Su, J.M. Trent, Use of a cDNA microarray to analyse gene expression patterns in human cancer, Nat. Genet. 14 (1996) 457460. [125] A.H. Khimani, A.M. Mhashilkar, A. Mikulskis, M. O'Malley, J. Liao, E.E. Golenko, P. Mayer, S. Chada, J.B. Killian, S.T. Lott, Housekeeping genes in cancer: normalization of array data, BioTechniques 38 (2005) 739745. [126] T.T. Ni, W.J. Lemon, Y. Shyr, T.P. Zhong, Use of normalization methods for analysis of microarrays containing a high degree of gene effects, BMC Bioinform. 9 (2008) 505.

Vous aimerez peut-être aussi