Vous êtes sur la page 1sur 19

protocol

Web-based inference of biological patterns, functions and pathways from metabolomic data using MetaboAnalyst
Jianguo Xia1 & David S Wishart1 3
1 3

Department of Computing Science, University of Alberta, Edmonton, Alberta, Canada. 2Department of Biological Sciences, University of Alberta, Edmonton, Alberta, Canada. National Research Council, National Institute for Nanotechnology, Edmonton, Alberta, Canada. Correspondence should be addressed to D.S.W. (david.wishart@ualberta.ca).

Published online 5 May 2011; doi:10.1038/nprot.2011.319

2011 Nature America, Inc. All rights reserved.

Metaboanalyst is an integrated web-based platform for comprehensive analysis of quantitative metabolomic data. It is designed to be used by biologists (with little or no background in statistics) to perform a variety of complex metabolomic data analysis tasks. these include data processing, data normalization, statistical analysis and high-level functional interpretation. this protocol provides a step-wise description on how to format and upload data to Metaboanalyst, how to process and normalize data, how to identify significant features and patterns through univariate and multivariate statistical methods and, finally, how to use metabolite set enrichment analysis and metabolic pathway analysis to help elucidate possible biological mechanisms. the complete protocol can be executed in ~45 min.

IntroDuctIon
Metabolomics is primarily concerned with comprehensive analysis of all small-molecule compounds that can be found in biological samples, such as cells, tissues or biofluids1. Because of its utility in identifying biomarkers of disease and in measuring biochemical phenotypes, the field of metabolomics has grown rapidly in recent years. This growth has also been aided by advances in analytical technologies, such as high-resolution nuclear magnetic resonance (NMR) spectroscopy, mass spectrometry (MS) and various compound separation techniques2,3. As with other omics technologies, bioinformatics has a key role in facilitating the storage, dissemination and interpretation of metabolomic data. In particular, bioinformaticians have developed a number of comprehensive spectral, compound and biofluid databases46, as well as a variety of software tools for data processing, compound identification and compound quantification712. With these basic bioinformatics tools now in place, the focus in metabolomic software development has gradually shifted away from basic compound identification and more toward functional interpretation and pathway analysis (i.e., systems biology). There are two general approaches to performing a metabolomics study: chemometric approaches and quantitative approaches13. Chemometric approaches (also known as nontargeted or untargeted methods) use raw, unannotated peak lists, binned spectral data or aligned spectral profiles in combination with multivariate statistics to identify spectral features that are statistically different between two (or more) different sample populations. Those features (peaks, retention times, masses, chemical shifts) identified as being significant may or may not be identified in subsequent analysis steps. Because chemometric methods do not make compound identification a priority, a major challenge with this approach is the subsequent identification step and the handling and elimination of false positives or spectral noise. In contrast, quantitative metabolomics (also known as targeted profiling) requires compound identification and quantification before any further analysis. Multivariate statistical methods are then applied to the resulting concentration data to identify metabolites that are statistically different between two (or more) different sample populations. In quantitative metabolomics, compound identification and quantification are usually achieved by comparing the MS or NMR spectra of the biological samples of interest with a set of chemical standards or a reference spectral library. Obviously, a key limitation to quantitative metabolomics is the accurate identification and quantification of compounds, especially in complex mixtures. Although still in use today, chemometric approaches were more widespread when compound identification was hampered by the lack of comprehensive spectral databases and appropriate compound identification/quantification software. However, as many metabolomics researchers learned, without a list of named compounds, it is extremely difficult to identify the affected pathways, to infer a mechanism of action or to develop any kind of biological understanding. It is also very difficult to patent an unknown peak or an unnamed spectral feature. With the availability of several comprehensive metabolomic databases and improved spectral analysis tools46,14,15, compound identification has become much easier, and now quantitative metabolomics is becoming much more widely used in the metabolomics community1619. In response to this trend toward quantitative metabolomics, as well as the growing community shift toward using open-access, web-based tools in many omics applications, we have developed a web-based software tool called MetaboAnalyst20. MetaboAnalyst was specifically designed to address a wide variety of common metabolomic research and educational needs, including conventional biomarker identification, the extraction of diagnostic or prognostic metabolite patterns, general metabolite annotation, putative pathway identification, functional or biological interpretation of metabolomic data, general data exploration, online class instruction for multivariate statistics, general data visualization, the creation of plots/figures for publications and presentations, MS and/or NMR data normalization and large-scale error-checking of MS and NMR metabolomic data. Although MetaboAnalyst is certainly capable of being used for standard chemometric applications, it is mainly designed to support quantitative metabolomics. MetaboAnalyst is particularly unique among metabolomic analysis tools, in that it provides comprehensive support for multiple data
nature protocols | VOL.6 NO.6 | 2011 | 743

protocol Box 1 | GLoSSARY


ANOVAanalysis of variance CSFcerebrospinal fluid CSVcomma separated values EBAMempirical Bayesian analysis of microarrays38 FAQfrequently asked questions FDRfalse discovery rate GSEAgene set enrichment analysis HSDTukeys honestly significant difference LOOCVleave-one-out cross-validation LSDFishers least significant difference GC/LC-MSgas chromatography/liquid chromatography-mass spectrometry MSEAmetabolite set enrichment analysis21 NMRnuclear magnetic resonance ORAover-representation analysis PCAprincipal component analysis PLS-DApartial least squaresdiscriminant analysis OPLSorthogonal projections to latent structures39 QEAquantitative enrichment analysis21 SAMsignificance analysis of microarray30 SNPsingle nucleotide polymorphism SSPsingle sample profiling21 VIPvariable importance in projection

2011 Nature America, Inc. All rights reserved.

types (NMR, gas chromatography-MS (GC-MS) and liquid chromatography-MS (LC-MS) data), multiple data processing procedures, a wide range of statistical and machine learning methods, and tools for high-level functional interpretation. MetaboAnalyst also provides a user-friendly interface that guides non-experts through the data analysis process. In addition, it offers intuitive visualization tools and generates a detailed analysis report at the end of each session. Since its release in 2009, MetaboAnalyst has been heavily used by researchers in the metabolomics community. Currently, the server is being accessed by an average of ~50 unique users per day. This has necessitated multiple server upgrades and the development of a very extensive set of frequently asked questions and tutorials. On the basis of user feedback, MetaboAnalyst has also undergone several updates to improve its support for binary (two-group) analysis and to extend its support for multiple-group analysis. One of the most recent enhancements has been the incorporation of metabolite set enrichment analysis (MSEA)21 and metabolic pathway analysis22 into MetaboAnalyst to assist in the high-level functional interpretation of quantitative metabolomic data. These additions should make MetaboAnalyst a true one-stop shop for metabolomic data analysis. Comparison with other available tools for metabolomic data analysis Perhaps the most widely used tool in metabolomics data analysis today is SIMCA-P + (Umetrics). SIMCA-P + is a commercial desktop application with a nicely designed graphical user interface that supports a wide variety of data transformations and multivariate statistical analyses, including principal component analysis (PCA), partial least squaresdiscriminant analysis (PLS-DA) and orthogonal projection into latent structure (see Box 1 for glossary). SAS (Statistical Analysis System from the SAS Institute) is another
744 | VOL.6 NO.6 | 2011 | nature protocols

stand-alone commercial software package that is also commonly used in many metabolomics studies. Similar to SIMCA-P + , SAS supports a wide range of data transformations as well as sophisticated univariate and multivariate analyses. Unlike SIMCA-P + , SAS lacks a graphical interface and is generally accessed through application programming interfaces. Generally speaking, the normalization, clustering, multivariate statistics and many of the graphs generated by means of MetaboAnalyst (and the accompanying protocols) could be generated using SIMCA-P + and/or SAS. However, neither SIMCA-P + nor SAS support metabolomic-specific data processing (for NMR and/or MS data), nor do they offer highlevel functional interpretation through automated metabolite annotation, MSEA or metabolic pathway analysis. Furthermore, MetaboAnalyst is a freely available, web-based application with extensive graphical output and an easy-to-use graphical user interface. This makes it somewhat more accessible, easier to learn and far easier to use than either SIMCA-P + or SAS. To the best of our best knowledge, there are only two other freely available web-based metabolomic data processing toolsMeltDB23 and the metaP-Server24. However, neither would be able to perform most of the data processing or interpretive steps described in this protocol. MeltDB was primarily built for MS-based metabolomics data storage, administration, analysis and annotation, whereas metaP-Server was designed to support exploratory metabolomic data analysis using mainly univariate summary statistics. A detailed feature comparison for these five tools is given in Table 1. Limitations of the protocol and software Because of space restrictions, the protocols/procedures outlined in this paper will not be able to illustrate all of the functional capabilities that can be found in MetaboAnalyst. In particular, the clustering, classification and machine learning tools for data processing will not be discussed here. Similarly, some of the metabolite annotation

protocol
taBle 1 | Comparison of different metabolomics data analysis/interpretation programs. tool Software type License Data input Graphical interface Normalization Univariate analysis Multivariate analysis Clustering 2011 Nature America, Inc. All rights reserved. Classification Enrichment analysis Pathway analysis Pathway visualization Integration with other omics data Peak annotation ++ Metaboanalyst Web-based Free Data table, NMR, MS, GC-MS data, compound/peak lists +++ +++ +++ +++ +++ ++ ++ +++ ++ + +++ + MeltDB Web-based Free (registry required) Raw mass spectral files ++ + ++ ++ ++ metap-server Web-based Free Data table ++ + +++ + +++ sIMca-p Stand-alone Commercial Data table +++ ++ sas Stand-alone Commercial Data table +/ ++ +++ +++ ++ ++

The level of support for a particular feature is rated by the number of + , with + + + as the highest.

functions will not be presented either. Although it is important to software packages rather than through a web-based application. note some of the limitations of this particular protocol, it is also Indeed, many freely available tools have been developed for MS important to note that the software itself also has some shortcom- spectral processing, including MetAlign11, MZmine25, Met-IDEA26, ings. In particular, MetaboAnalyst has relatively limited metabolite MSFACTS27, Tagfinder28 or XCMS8, to name just a few. By avoiding annotation capabilities, it does not support or integrate other kinds this data transfer bottleneck and by limiting its preferred input forof omics data and it has limited capabilities for processing and visu- mat to partially processed data, such as peak lists or concentrations, alizing raw MS spectral files. This limitation with MS spectral files MetaboAnalyst is able to offer much more efficient data analysis and is primarily the result of hardware restrictions, both with respect to visualization services to a much wider user base. the MetaboAnalyst server and with respect to the speed of Internet data transfers (bandwidth). Raw MS spectral files are often too large Analysis overview (greater than hundreds of Mb) to be routinely or rapidly uploaded The procedure described here provides a step-by-step protocol for to a remote server. Furthermore, spectral processing (including peak using MetaboAnalyst to fully analyze quantitative metabolomic picking, alignment and annotation) is a computationally intensive exercise that usuConcentration tables Statistical analysis ally requires multiple iterations and careUnivariate analysis Data processing and normalization ful manual inspection to achieve optimal Fold changes t-Tests Data pre-processing Other inputs: results. Consequently, we believe that these Data integrity check (Step 3) Volcano plots Peak lists Peak detection/alignment Missing value imputation tasks are better handled by locally installed Spectral bins ANOVA (Step 8A) Retention time correction
MS spectra Noise filtering Compound name Standardization (Step 7) Outlier removal

Figure 1 | Flowchart for MetaboAnalyst. MetaboAnalyst is composed of three main functional modules responsible for data processing, statistical analysis and high-level functional interpretation. Different data inputs are first processed to produce appropriate data matrices. A wide array of univariate and multivariate statistical analyses can then be performed on these data matrices. If compound identities are known, users can perform enrichment analysis or pathway analysis after compound name standardization. Corresponding PROCEDURE step numbers are indicated in the figure.

Compound name lists

Data normalization (Steps 46) Row-wise procedures Column-wise procedures

Correlations (Step 9) SAM (Step 8B) EBAM

Multivariate analysis PCA (Steps 1114) PLS-DA (Steps 1519) Clustering Hierarchical cluster SOM K-means Classification Random forests SVM

High-level functional interpretation Pathway analysis (Steps 2131) 15 organisms, 1,173 pathways Pathway enrichment analysis Pathway topology analysis Interactive visualization Enrichment analysis (Steps 3235) 6,295 metabolite sets in 7 categories Over-representation analysis Single sample profiling Quantitaive enrichment analysis

Result download: analysis report, images, processed data

nature protocols | VOL.6 NO.6 | 2011 | 745

protocol
Figure 2 | Data upload view. This screenshot shows MetaboAnalysts available data analysis modules, with the Statistical Analysis module being selected for data upload. Clicking the tab labeled Enrichment Analysis or Pathway Analysis will allow users to upload data for the corresponding data analysis. The navigation tree is located on the left with the current step (Upload) highlighted.

data. It begins with a general overview of the program, followed by a detailed description on how to format and upload data, how to cleanse the data, how to normalize it and how to identify significant features or generate lists of important metabolites. It concludes with a description on how to perform MSEA and how to perform metabolic pathway analysis. Although the protocol is specific to MetaboAnalyst, many of the early stage statistical steps can be readily adapted to other statistical analysis packages (such as SIMCA-P + and SAS). As noted earlier, not all of MetaboAnalysts options or data analysis paths can be discussed in detail. However, the protocol described here should be applicable to many common data analysis scenarios in metabolomics. MetaboAnalyst consists of three main modules: (i) a data processing module; (ii) a statistics module; and (iii) a high-level functional interpretation module. The data processing module is responsible for data input, data processing and data normalization. The statistics module supports a number of statistical (univariate, multivariate) and machine learning methods for feature selection, clustering and classification. The high-level functional interpretation module includes enrichment analysis and pathway analysis. The enrichment analysis offers MSEA using several comprehensive metaboliteset libraries. The pathway analysis offers pathway enrichment analysis and pathway topology analysis through a Google Maps style interactive pathway visualization system. As illustrated in Figure 1, the data processing module is the entryway to access the other two modules. The statistics module, which is perhaps the

2011 Nature America, Inc. All rights reserved.

most important module in MetaboAnalyst, is designed for generalpurpose metabolomic data analysis and can be used to analyze a number of different data types, including compound concentration data, peak lists or binned spectral data (i.e., both targeted and non-targeted data). For high-level functional interpretation, only quantitative metabolomic data (i.e., compound concentration data or a list of metabolite names) can be accepted. It is important to note that MetaboAnalysts high-level functional analysis is organism specific as dictated by MetaboAnalysts underlying knowledgebase. For enrichment analysis, the collection of ~6,300 metabolite sets was compiled primarily from human studies. Therefore, users need to provide their own custom metabolite sets if they wish to perform enrichment analysis for other organisms. MetaboAnalysts pathway analysis currently supports 15 model organisms with ~1,200 precompiled Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. Before using this option, users need to decide whether these predefined libraries are applicable to their organism(s) under study. To perform high-level functional analysis, one critical step is to match compound names between users data and MetaboAnalysts

Box 2 | DATA FoRMATTING


comma separated values (.csv) Input data values must be numeric, such as concentrations, peak intensities or areas of spectral bins. Missing values should be left blank or marked as NA. Samples can be in rows or columns, with group labels immediately following the sample names. The group label can be binary or multigroup. For enrichment analysis or pathway analysis, users can also use continuous labels for regression analysis. peak list files Each peak list file should be saved in CSV format, with the first row reserved for column labels. An NMR peak list file must contain two columns, with the first column for peak positions (p.p.m. or parts per million) and the second column for peak intensities; mass spectrometry (MS) peak lists can be saved in either two-column (mass and intensities) or three-column format (mass, retention time and intensities) but not as a mixture of both. These files should be organized into separate folders named by their group labels and then uploaded as a single ZIP file. Ms spectra MS spectra must be saved in one of two open exchange formats (netCDF or mzXML), put into different folders named by their group labels and uploaded as a single ZIP file. Because of Internet bandwidth constraints, the maximum size (after compression) allowed for an upload is 50 Mb. Larger spectral data should be first processed into other formats such as spectral bins or peak lists. Many spectral processing tools are freely available. For MS spectra processing, users can use MetAlign11, MZmine25 or XCMS8; for NMR spectra processing, HiRes9 or Automics40 can be used.

746 | VOL.6 NO.6 | 2011 | nature protocols

protocol Box 3 | DATA PRoCESSING


Handling missing values Depending on the type of metabolomics experiment being analyzed, there may be a substantial number of missing values present in the data set. A variety of methods have been implemented to deal with missing values. By default, MetaboAnalyst treats missing values as being present but with low signal intensity (below the detection limit). Consequently, they are replaced by half of the minimum positive value detected in the user data. Users are also allowed to manually or automatically exclude samples with too many missing values. Alternately, a user can choose several computational methods, including replace by mean/median, probabilistic principal component analysis (PCA), Bayesian PCA or singular value decomposition, to impute the missing values41. outlier identification and removal Outliers are defined as those few (usually one or two) data points that stand out from the majority of the other data points. They can be caused by sample degradation, instrumental errors, changes in measurement conditions or faulty measurements due to human error. An outlier can be either a sample outlier or a feature outlier. Outliers can usually be visually identified based on some graphical summaries of the data. For instance, the PCA score plot is often used to identify sample outliers. With this kind of plot, the outlier(s) should be located far away from main cluster. Alternately, hierarchical clustering can also be used to identify outliers, as they usually form a distant branch that joins the main cluster at a very high level. Box plots or box-and-whisker plots are commonly used to help detect feature outliers. After being identified, outliers can be removed using the DataEditor interface provided by MetaboAnalyst. Please note, after missing value replacement or outlier removal, users should redo their normalization before moving to any further downstream analysis. Many data analysis methods are quite sensitive to outliers; therefore, the results may be quite different after these procedures.

2011 Nature America, Inc. All rights reserved.

knowledgebase. As there are currently no universally accepted set of metabolite names or IDs, we have implemented an automated compound disambiguator to convert various compound IDs and synonyms to Human Metabolome Database (HMDB) compound names for MSEA and to KEGG compound names for pathway analysis. In some cases, there will be redundancies and conflicts due to different naming schema adopted by different databases. Those compounds with name conflicts will be highlighted for subsequent manual inspection. We recommend that users try the recently released Chemical Translation Service29 (http://cts. fiehnlab.ucdavis.edu) to clarify these ambiguities before performing any kind of high-level analysis. MetaboAnalyst uses a navigation tree to guide users through its different analysis procedures (Fig. 2). All the available functions

are represented as tree nodes and these nodes are organized into different branches or functional categories. Users may click the corresponding nodes to navigate among different MetaboAnalyst functions. Depending on the context, some tree nodes may be disabled when the required preliminary steps have not been performed by the user. The current node is always highlighted during the analysis, as shown in Figure 2. This protocol is organized into five sections: (i) data formatting, uploading and processing; (ii) identifying important features using univariate analysis; (iii) multivariate statistical analysis; (iv) MSEA; and (v) metabolic pathway analysis. Two compound concentration data sets are provided to demonstrate these procedures. The first data set contains metabolite concentrations of 39 bovine rumen samples measured by proton NMR. The rumen

Box 4 | DATA NoRMALIZATIoN


Eight commonly used procedures have been implemented for data normalization in MetaboAnalyst. Depending on whether they are to be performed on samples (rows) or features (columns), these methods are organized into two categories as described below. row-wise normalization This procedure aims to reduce systematic bias during sample collection. The normalization by sum method is often used for binned spectra data in which the total spectral area is assumed to be constant; the normalization by a feature can be used to adjust the feature values (i.e., concentrations) of each sample against a spike-in, an internal standard or a physiological constant (i.e., urinary creatinine); the normalization by a sample, also known as probabilistic quotient normalization42, is a robust method to account for different dilution effects during sample preparation. This method rescales each sample by the most probable dilution factor, which is calculated as the median of the quotients between all corresponding features of the sample and the reference. It is very useful as an alternative procedure for urinary dilution adjustment when creatinine is unsuitable (i.e., kidney disease). The sample-specific normalization allows users to manually specify a normalization value for each sample (i.e., on the basis of tissue volume, dry weight, etc.). column-wise normalization These procedures aim to reduce the impact of very large feature values and to make all features more comparable or normally distributed by using different centering, scaling and log transformations. One disadvantage associated with this approach is the inflation of measurement errors or noise (usually of small values) after this procedure. A detailed discussion of these different normalization procedures is available in the paper by van den Berg et al.43. Please note that row-wise and column-wise normalizations are usually performed sequentially. For data already normalized before upload, this step can be skipped.

nature protocols | VOL.6 NO.6 | 2011 | 747

protocol
Figure 3 | Data normalization view. The graph summarizes the distribution of input data values before and after normalization. The box plots on the top show the concentration distributions of individual compounds, whereas the bottom plots show the overall concentration distribution based on kernel density estimation.
Before normalization Fumarate Glucose Endotoxin Xanthine Valine Valerate Uracil Tyrosine Succinate Ribose Propionate Proline PAG Phenylacetate Nicotinate NDMA Methylamine Mathanol Maltose Lysine Leucine Lactate Isovalerate Isoleucine Isobutyrate Hypoxanthine Histidine Glycine Glycerol Glutamate Formate Ferulate Ethanol Dimethylamine Choline Caffeine Cadaverine Butyrate Benzoate Aspartate Alanine Acetoacetate Acetate 3-PP 3-HP 3-HB 1,3-D Fumarate Glucose Endotoxin Xanthine Valine Valerate Uracil Tyrosine Succinate Ribose Propionate Proline PAG Phenylacetate Nicotinate NDMA Methylamine Methanol Maltose Lysine Leucine Lactate Isovalerate Isoleucine Isobutyrate Hypoxanthine Histidine Glycine Glycerol Glutamate Formate Ferulate Ethanol Dimethylamine Choline Caffeine Cadaverine Butyrate Benzoate Aspartate Alanine Acetoacetate Acetate 3-PP 3-HP 3-HB 1,3-D After normalization

samples were collected from dairy cows fed with different proportions of barley grain. The samples are labeled in four groups0, 15, 30 and 45indicating different percentages of barley in the diet. The second data set contains metabolite concentrations of 77 urine samples from cancer patients, also measured by proton NMR. The samples are divided into two groupscontrol or cachexic (significant muscle loss).
2011 Nature America, Inc. All rights reserved.

Data formatting, processing and normalization This section describes how to upload various data types into MetaboAnalyst, followed by explanations on how to perform data cleansing and normalization. The basic 8e+05 0.0020 idea is to transform any uploaded data into 6e+05 0.0015 a matrix, with samples in rows and features 4e+05 0.0010 in columns. Three basic data formats are 2e+05 0.0005 supported by MetaboAnalyst (Box 2). The 0 0e+00 most common type is a data table contain2e06 0 10,000 20,000 30,000 40,000 1e06 0e+00 1e06 ing compound concentrations, peak intenNormalized concentration Concentration sities or spectral bins. These kinds of data can be easily viewed and edited using any spreadsheet program. type corresponds to raw MS spectra saved in open exchange formats, The second data type consists of multiple peak lists, as picked from such as netCDF or mzXML. More detailed information regarding multiple spectra (NMR, MS or GC-MS). These kinds of data can be data input formats, including example data sets, are available on the obtained from most spectral processing programs. The third data MetaboAnalyst website under Data Formats and FAQs links.
Density

Box 5 | SIGNIFICANT FEATURE IDENTIFICATIoN


Identification of features similar to a known biomarker In this case, researchers are looking for features (metabolites or peaks) showing similarities in their intensity or concentration changes to a feature of interest (co-expression). Users can directly perform correlation analysis against the target feature to identify those peaks or metabolites that are either positively or negatively correlated. Users can also use hierarchical clustering. Features located in the same cluster as the target feature are most similar in terms of intensity or concentration changes. MetaboAnalyst supports many similarity measures, including Euclidean distance, Pearsons correlation, Spearmans rank correlation and Kendalls -test. Identification of features following a particular pattern In this case, researchers are looking for features that have shown particular patterns of changes under multiple (>2) conditions or through a range (>2) of time points. MetaboAnalyst uses a template matching approach to address this situation, as described by Pavlidis44. Template matching is part of the correlation analysis suite of MetaboAnalyst. Users can either use a predefined pattern or specify a new pattern to perform template matching. The template patterns must be specified as a series of numbers corresponding to concentration levels expected in different groups or at different time points. Identification of features significantly different between case and control MetaboAnalyst offers a number of approaches, including classical univariate methods such as the t-test and ANOVA, which are commonly used to compare means or medians of one variable across two or more groups. Because of the multiple testing issue, false discovery rate (FDR) or Bonferroni-corrected P values are provided. Significance analysis of microarrays (SAM)30 and empirical Bayesian analysis of microarrays (EBAM)38 are also available for high-dimensional data based on moderated t-statistics. In addition, machine learning approaches such as Random forests45 also provide feature importance measures based on their contribution to classification performance. Feature selection and assessment using multivariate approaches are discussed in Box 6.

748 | VOL.6 NO.6 | 2011 | nature protocols

Compounds

protocol Box 6 | MULTIVARIATE STATISTICAL ANALYSIS


principal component analysis PCA is an unsupervised clustering or classification method. It projects complex, high-dimensional data to a new coordinate system with fewer dimensions. The projection direction is calculated to maximize the data variance in just the first few dimensions (called principal components). The values in the remaining dimensions may be ignored with minimal loss of information. PCA is very good at revealing the internal structure of a data set with respect to variance. The results of a PCA are usually discussed in terms of scores and loadings. The scores represent the original data in the new coordinate system and the loadings are the weights applied to the original data during the projection process. Note that in PCA there is no guarantee that the directions of maximum variance will contain the best features for discrimination. It is also important to remember that PCA is very sensitive to outliers. Therefore, data normalization and outlier removal are usually needed in order to obtain good PCA results. partial least squares-discriminant analysis PLS-DA is a supervised clustering or classification method. This means that previous knowledge about the class labels (Y) is used during the classification process. PLS-DA projects the data (X) into a low-dimensional space that maximizes the separation between different groups of data in the first few dimensions (also called latent variables). These latent variables are ranked by how well they explain the Y-variance. An important issue with PLS-DA is deciding on the number of latent variables to be used to build the model. MetaboAnalyst supports two common approaches: (i) assessing the sum of squares captured by the model (R2) or the cross-validated R2 (also known as Q2), and (ii) assessing the prediction accuracies based on cross-validation, with different numbers of components. By default, the optimal number determined by Q2 is used (Fig. 5b). A common problem with PLS-DA is its propensity to data overfitting. This occurs when the algorithm appears to achieve good separation but has done so by picking up random noise rather than real signals. It has been shown that this problem cannot always be detected through cross-validation, but it can be detected using permutation tests46. A permutation test involves randomly reassigning the class labels and performing PLS-DA on the newly relabeled data set. The process is repeated hundreds or thousands of times, and the performance measures are plotted on a histogram for visual assessment. From the resulting histogram, it is possible to determine whether the original class assignment is significantly different from, or a part of, the distribution based on the permuted class assignments (Fig. 5c). An empirical P value is often calculated by determining the number of times the permutated data yielded a better result than the one using the original labels. For example, if none of the permuted classes is better than the observed one in 2,000 permutations, the P value is reported as P < 0.0005 (less than 1/2,000). We have implemented both cross-validation and permutation tests, as suggested by Bijlsma et al36. PLS-DA also produces variable importance measures. Two variable importance measures are available in MetaboAnalyst. The first, variable importance in projection (VIP), is a weighted sum of squares of the PLS loadings that takes into account the amount of explained Y-variance of each component (Fig. 5d). The other importance measure is based on a weighted sum of the PLS-regression coefficients. The weights are a function of the reduction of the sums of squares across the number of PLS components.

2011 Nature America, Inc. All rights reserved.

Depending on the type of uploaded data, different preprocessing procedures may be used to convert the raw data into an appropriate data matrix. Compound concentration data (measured by NMR, GC-MS or LC-MS) are usually of high quality and do not normally need a preprocessing step. Binned spectral data usually contain a great deal of baseline noise and often require baseline filtering. For NMR and MS peak lists, MetaboAnalyst will first align the peaks across all samples. For GC/LC-MS spectra, MetaboAnalyst performs peak detection, peak alignment and retention time correction sequentially using the XCMS package8. MetaboAnalyst also supports some limited peak annotation/identification from raw peak lists. This annotation function can be accessed by going to Other Utilities tab and clicking on the NMR/MS peak search tool bar. This particular function uses the HMDB peak search tools to score and identify peaks. MetaboAnlalysts MS peak search also identifies common adducts. After the data has been converted into a data matrix, a data integrity check is performed to ensure that the data are valid and suitable for subsequent analysis. This data integrity check includes checking for data values, sample size, group labels and other data features. Box 3 describes some of the approaches available in MetaboAnalyst for dealing with missing values and outliers.

It is often necessary to normalize metabolomic data before starting any kind of statistical analysis for several reasons. First, normalization can reduce systematic bias or technical variation. Second, metabolite concentrations or peak intensities usually span several orders of magnitude (sub-micromolar to millimolar). Consequently, the variance from the more abundant metabolites will tend to dominate the variance-covariance matrix and obscure small but potentially significant signals. This can lead to misidentification of significant changes or a failure to identify significant changes, particularly with conventional multivariate statistical approaches. In addition, many statistical methods assume that data values follow a Gaussian distribution. Therefore, it is important to perform appropriate data transformations to make the data look like a bell curve. MetaboAnalyst provides many useful methods for data normalization (Box 4). The effect of these normalization procedures on users data can be visualized with a diagnostic plot (Fig. 3). Significant feature identification using univariate methods This section provides the detailed steps on how to identify features of interest using classical univariate statistical methods, such as the Students t-test, analysis of variance (ANOVA) or correlation analysis. It also describes how to use a method developed for
nature protocols | VOL.6 NO.6 | 2011 | 749

protocol
a
0 15 30 45

1.0

Accuracy R2 Q2

0.8 6 Performance Component 3 (12.1%) 4 2 0 2 4 6 6 4 2 0 2 4 6 3 2 1 0 1 2 3 4 0.6 Component 2 (6.6%)

0.4

Figure 4 | Multivariate analysis using PLS-DA. (a) PLS-DA 3D score plot. (b) Bar plots showing the three performance measures (prediction accuracy, R2 and Q2) using different number of components. The red * indicates the best values of the currently selected measures (Q2). (c) The result of permutation tests summarized by a histogram. (d) The top 15 compounds ranked by VIP scores.

0.2

under the Peak search node located near the bottom of the navigation tree.
1 2 3 4 Number of components 5

Component 1 (15.9%)

c
150

Endotoxin 3-PP Alanine Methylamine Glucose Uracil

2011 Nature America, Inc. All rights reserved.

100 Frequency

Aspartate Isobutyrate Acetate Isovalerate Succinate Histidine 3-HP Caffeine NDMA

50

Observed statistic P < 5e04

0 0 0.5 1.0 1.5 2.0 2.5 Permutation test statistics 3.0

1.0

1.2

analyzing high-dimensional data, namely, significance analysis of microarrays (SAM)30. Metabolomic data sets are intrinsically high dimensional, with the number of features (peaks, metabolites) ranging from a few dozen to hundreds or even thousands. They represent snapa shots of global biochemical profiles of individual organisms. Most of these features are expected to be within normal physiological variations, and only a few may be significantly associated with the conditions or phenotypes of interest. The identification of those key features is the first step toward finding useful biomarkers or explaining the underlying biological process. Depending on the specific questions being asked or the information already known, MetaboAnalyst offers a number of different strategies to perform feature identification and assessment (Box 5). MetaboAnalyst also supports feature (or peak) annotation after significant features (peaks or bins) have been identified. This utility can be accessed
Figure 5 | Results from metabolite set enrichment analysis. (a) The result table summarizing the matched metabolite sets ranked by their P values. (b) The detailed view of a matched metabolite set (accessed by clicking the corresponding bar icon on the last table column). 750 | VOL.6 NO.6 | 2011 | nature protocols

Multivariate data analysis Multivariate statistics involves the simultaneous observation and analysis of more than two statistical variables. Because metabolomic data usually consist of dozens of features (compounds, peaks), many of which change as a function of time, phenotype or experimental conditions, multivariate data analysis is ideal for analyzing metabolomic data. Multivariate analysis includes a number of techniques, such as multivariate ANOVA, multivariate regression analysis, PCA, factor analysis and discriminant analysis. MetaboAnalyst supports two widely used multivariate methodsPCA and PLS1.4 1.6 1.8 2.0 2.2 DA. These two methods are very useful for VIP scores exploratory data analysis through dimensional reduction and data visualization (Box 6). MetaboAnalyst is also able to generate a variety of colorful, two- or three-dimensional graphs, such as score plots, loading plots and other kinds of

Compounds

protocol Box 7 | METABoLITE SET ENRICHMENT ANALYSIS


types of enrichment analysis Overrepresentation analysis (ORA). This algorithm requires a list of compound names as input. Such a list can be obtained by various feature selection or clustering analysis techniques, such as ANOVA, PCA or PLS-DA. A hypergeometric test is used to evaluate whether a particular metabolite set is represented more than expected by chance within the given compound list. The P-value indicates the probability of observing at least a particular number of metabolites from a certain metabolite set in a given compound list. Single-sample profiling (SSP). For this approach, the required input is a list of compound concentrations measured from common human biofluids, such as cerebral spinal fluid (CSF), blood and urine. The concentrations must be provided using standard concentration units (mol for blood and CSF, and mol per mmol creatinine for urine). The method first identifies those metabolites with concentrations deviating significantly from the reported normal reference ranges. These metabolites are then subject to overrepresentation analysis. Quantitative enrichment analysis (QEA). For this algorithm, the required input is a compound concentration table. QEA is based on the globaltest47 algorithm, which uses a generalized linear model to estimate the association between concentration profiles of a matched metabolite set and the class label. The P-value indicates the probability that none of the matched compounds in the metabolite set is associated with the class label. Metabolite set libraries We consider a group of metabolites as a metabolite set if there are established, empirically observed or theoretically predicted functional associations among them. On the basis of these criteria, we have collected a total of 6,292 metabolite sets organized into seven categories83 pathway-associated metabolite sets, 742 disease-associated metabolite sets, which were further divided into three groups on the basis of the type of biofluid (CSF, blood or urine) from which they were reported, 4,501 single-nucleotide polymorphism (SNP)-associated metabolite sets, 921 model-predicted metabolite sets and 57 metabolite sets on the basis of tissue or cellular co-localization. Please note: Users should always be aware of the technological limitations of their metabolomics platform(s) when interpreting the results from overrepresentation analysis. Current analytical technologies only offer partial metabolome coverage, which makes metabolomic studies intrinsically biased toward metabolite sets containing compounds that are more abundant or more easily detected by a given technology platform. MetaboAnalyst supports the application of a platform-specific reference metabolome to correct for this potential bias.

2011 Nature America, Inc. All rights reserved.

diagnostic plots (Fig. 4). This section describes the detailed steps needed to perform PCA and PLS-DA using the example data sets and how to interpret the results. Metabolite set enrichment analysis This section describes the detailed steps to perform MSEA. MSEA is the metabolomic counterpart of the gene set enrichment analysis (GSEA)31, which has been widely used in gene expression data analysis. The key idea behind GSEA is to investigate the enrichment of predefined groups of functionally related genes (or gene sets) instead of individual genes. This approach has been shown to be good at identifying significant as well as subtle but coordinated expression changes among a group of related genes. As groups of genes are usually associated with biological functions or biological pathways, GSEA also greatly facilitates higher-level functional interpretation. MSEA has been implemented in MetaboAnalyst, using the same concepts underlying GSEA (Fig. 5). Similar to GSEA, there are

two essential components for MSEA(i) the algorithms for enrichment analysis and (ii) the comprehensive libraries of functionally related metabolite sets. Box 7 provides more details about these two components. Metabolic pathway analysis This section describes the basic steps to perform metabolic pathway analysis and visualization of the results. Pathway analysis has proven to be an invaluable tool in understanding complex relationships among genes and proteins in genomics and proteomics studies3235. Most pathway analysis tools focus on visually displaying and highlighting matched genes, proteins or metabolites and do not support more quantitative or statistical analysis. To address this issue, we have integrated two pathway analysis approachespathway enrichment analysis and pathway topology analysis. The results can be visualized intuitively using a Google Mapsstyle visualization system (Fig. 6). Box 8 provides additional details on the main features offered by MetaboAnalysts pathway analysis utilities.

MaterIals
EQUIPMENT SETUP A PC with an Internet connection Browser requirements: MetaboAnalyst has been tested on all modern web browsers that are JavaScript enabled, including Mozilla Firefox 3.0 + , Safari 4.0 + , Chrome 5.0 + (Google), Opera 10.0 + and Internet Explorer 8.0 (Microsoft). Data files: MetaboAnalyst has a number of example data sets for format illustration purposes as well as for testing purposes. Users can directly select a testing data set in MetaboAnalysts data upload page without

actually downloading it. For this protocol, we will download a concentration data set and then re-upload it to better illustrate how local or user-generated data files may be handled. First, go to the MetaboAnalyst home page and then click the Data Formats link on the left menu bar. In the Data Formats page, under the Comma Separated Value (CSV) format, click and download the first concentration fileCompound concentration data setcow, four groups and save it as cow_diet.csv. The second concentration file to be retrieved is Compound concentration data sethuman, two groups. Save this file as human_cachexia.csv.

nature protocols | VOL.6 NO.6 | 2011 | 751

protocol

2011 Nature America, Inc. All rights reserved.

Figure 6 | Metabolic pathway analysis and visualization. (a) The metabolome view showing all metabolic pathways arranged according to the scores from enrichment analysis (y axis) and from topology analysis (x axis). (b) The pathway view showing the corresponding metabolic pathway after clicking any node in the metabolome view. The matched metabolites are highlighted according to their P values. Users can zoom or drag the pathway map to view a subset of the compounds. (c) The compound view showing the concentration distribution of the corresponding metabolite after clicking any matched compound node. The P value and the node importance are indicated below.

proceDure Data upload, processing and normalization tIMInG 510 min 1| Starting up: Go to the MetaboAnalyst home page and click the click here to start link to enter the data upload page. crItIcal step As most browsers support multiple tabs, do not access MetaboAnalyst from more than one tab during an analysis. Opening up multiple connections to MetaboAnalyst within the same browser will cause problems as a result of having the session data overwritten. ? trouBlesHootInG 2| Data upload: Depending on the type of analysis that a user wishes to perform, they can upload their data using any of the three available tab optionsStatistical Analysis, Enrichment Analysis or Pathway Analysis (Fig. 2). Here we show how to upload data from the Statistical Analysis tab, which is selected by default (data upload instructions for Enrichment Analysis are provided in Steps 2124, and data upload directions for Pathway Analysis are given in Step 32). In the Upload your data section, users can upload either a comma-separated value (CSV) file or a compressed (ZIP) file (see Box 2 for more details). For the example we use here, choose Concentrations as the data type and Samples in rows (unpaired) as the data format. Click the Browse button to locate the cow_diet.csv file and click the Submit button. crItIcal step Users must specify the correct data type and data format that match their data. Failure to do so will result in MetaboAnalyst launching the wrong data processing procedure. crItIcal step Users can also easily perform paired analysis in MetaboAnalyst. For any kind of paired data comparison, there must be an even (2n) number of samples. For CSV formatted data, the pairwise information must be given by the class labels as integer values between 1 and n/2 and between 1 and n/2. Samples with class labels having the same absolute integer values are considered to be pairs (i.e., 18 is paired with + 18). For ZIP formatted data, users need to upload a separate text file (.txt) to give the pair information. Each pair is specified as two sample names (without a suffix) separated by a colon with one pair per row. ? trouBlesHootInG
752 | VOL.6 NO.6 | 2011 | nature protocols

protocol Box 8 | METABoLIC PATHWAY ANALYSIS


pathway enrichment analysis Pathway enrichment analysis offers both over-representation analysis (Fishers exact tests or hypergeometric tests) as well as quantitative enrichment analysis (globaltest47 and GlobalAncova48). The main characteristics of the two types of enrichment analysis are given in Box 7. pathway topology analysis The importance of a compound within a given metabolic network can be estimated by its centrality measure. There are two commonly used centrality measuresdegree centrality and betweenness centrality. The former measures the number of connections the node of interest has to other nodes and the latter measures the number of shortest paths going through the node of interest. As metabolic pathways are directed graphs, the relative betweenness centrality and out-degree centrality measures are used for calculating compound importance. For more information on graph-based methods, please refer to the paper by Aittokallio et al.49. pathway visualization The metabolic pathways used by MetaboAnalyst are obtained from the KEGG database and presented as networks of chemical compounds, with metabolites as nodes and reactions as edges. MetaboAnalysts pathway visualization system supports lossless zooming, dragging, and linking operations. Relevant information can be obtained by clicking on the appropriate graphical elements. For instance, clicking each compound node on a metabolic pathway will display a more detailed view of the concentration distributions of the metabolite together with the node importance score and P value calculated by t-test, ANOVA or linear regression, as determined by the analysis type (Fig. 6).

2011 Nature America, Inc. All rights reserved.

3| Data integrity checking: If the data has been uploaded successfully, a data integrity check is performed. After this check is completed, MetaboAnalyst will provide a summary of the data characteristics. Two common issues that often arise with metabolomic data are missing values and outliers (see Box 3 for more details). To handle missing values, users can click the Missing value imputation button to use a variety of options to either exclude or replace these values. Outlier identification and removal is an iterative process and is usually performed in combination with preliminary data exploratory analysis. See Step 28 for an example. For this particular data set, we accept the data as is and so we will click the Skip button to go to the normalization step. 4| Data normalization: There are two normalization proceduresrow-wise normalization and column-wise normalization. The characteristics of the different normalization procedures are discussed in Box 4. In the data normalization page, choose normalization by a reference sample and then select the first sample name 0-1-1 for row-wise normalization. crItIcal step The choice for a reference sample is generally the sample in the control group with the fewest missing values. Alternatively, users can choose to use a pseudo-reference sample created by averaging all samples in the control group. For high-quality data in which samples in the same groups are very homogenous, the effects of either procedure should be very similar. 5| Select auto-scaling for column-wise normalization. 6| After the normalization steps have been completed, click next to view a graphic summary of the normalization effects on the data (Fig. 3). 7| Compound name standardization (optional): This step is only applicable for compound concentration data. Click the Name check node under the Processing branch. The results of the name conversion process will be shown as a table. Compounds without an exact match in MetaboAnalysts name library will be highlighted in either yellow (approximate match found) or red (no match found). Users should manually examine the compounds with approximate matches and choose the correct one. Otherwise, the first match in the candidate name list will be used. Click the Submit button to finish the name checking. Note that after this step, all three major nodes on the navigation treeStatistics, Enrichment and Pathway should be enabled. Note that if the data are uploaded under the Enrichment Analysis or Pathway Analysis tab, the compound name mapping will be performed by default. The data are now processed, normalized and ready for a variety of downstream analysis procedures. Identification of significant features with univariate methods tIMInG ~10 min 8| Identification of significantly different features: MetaboAnalyst directly supports significant feature (metabolite) identification using several methods including t-tests, ANOVA, volcano plots, SAM and others. Use option A for ANOVA-based feature selection or option B for SAM-based selection.
nature protocols | VOL.6 NO.6 | 2011 | 753

protocol
(a) anoVa-based feature selection (i) As the data in cow_diet.csv contains four groups, one can use ANOVA methods to select important features. Click the ANOVA node on the navigation tree to enter the One-way ANOVA and post hoc analysis page. (ii) Significant features are identified with the default P value threshold of 0.05. As the ANOVA F-test only indicates that more than two groups differ, the post hoc analysis further tests the ones that differ from each other. MetaboAnalyst offers two commonly used methodsFishers least significant difference (LSD) and Tukeys honestly significant difference (HSD). Tukeys HSD is generally more conservative than Fishers LSD. (iii) Click the view details link to see a data table from the ANOVA and post hoc tests using Fishers LSD (the default). Users can click any compound name to view a box plots summary of its concentrations in different groups. (B) saM-based feature selection (i) SAM is designed to control the false positives when running multiple tests on high-dimensional data. To use the SAM method, click the SAM node on the MetaboAnalyst navigation tree. (ii) The default view is the Step 1 tab, which contains two plots to help users select a suitable delta value. The left plot shows the false discovery rate (FDR) change with different delta values and the right plot shows the number of significant compounds identified given different delta values. For example, using the default delta value 0.6 will identify ~25 compounds with an FDR ~0.3; using a delta value of 1.0 will identify ~20 significant compounds with the FDR less than 0.1. Enter 1.0 as the new delta value and click Submit. (iii) The Step 2 tab shows a typical SAM plot with the delta value equaling 1.0. Click the View details link to see the SAM results table. A total of 21 compounds were identified above the chosen threshold. Note that the top ten compounds are almost exactly the same as those identified using ANOVA. 9| Identification of other features with patterns of interest: This step allows users to investigate trends or patterns in metabolite concentration changes. Click the Correlations node on the navigation tree to enter the Correlation Analysis page. There are two types of correlation analysis that can be performed in MetaboAnalystcorrelation with a defined pattern (option A) or correlation with a specific feature (option B). (a) correlation with a defined pattern (i) Here we will attempt to identify those metabolites that increase concentrations with the percentage of grain in the diet. Choose a predefined pattern 1234 from the select a predefined pattern drop-down list, which corresponds to a linear concentration increase in groups 0, 15, 30 and 45. Alternatively, users can specify their own patterns in the define your own pattern text field. (ii) Click the Submit button beside the drop-down list used in the previous step. The result is shown in Top 25 compounds correlated with the pattern 1234 a Figure 7a. The light blue bars show those metabolites Endotoxin showing a negative correlation and the light pink bars Alanine show those with a positive correlation with the given Methylamine pattern of change. Glucose (iii) Click the View details link to see a table of all the Uracil NDMA compounds listed as well as their correlation coefValine ficients. Clicking any compound name will generate Dimethylamine a graphic summary of its concentration distribution Glycerol within each group (Fig. 7b). Xanthine (B) correlation with a specific feature Ethanol Isoleucine (i) On the basis of the above analysis and a review 1,3-D of the literature, we know that elevated levels of b Benzoate endotoxin are important for initiating certain inflam2 Ribose matory responses. We are interested in identifying Histidine 1 other metabolites with patterns of change similar to Formate
Succinate Acetoacetate Isovalerate 3-HP Acetate Isobutyrate Aspartate 3-PP 1.0 0.5 0 Correlation coefficients 0.5 1.0 2 0 15 30 45 1 0

2011 Nature America, Inc. All rights reserved.

Figure 7 | Correlation analysis to identify compounds with a specific pattern. (a) Correlation plot showing the compounds that are significantly associated with a given pattern 1234 (a linear concentration increase under different conditions). The compounds are represented as horizontal bars, with colors in light pink indicating positive correlations and that in light blue indicating negative correlations. Users can click the view details link to see a detailed table. (b) Box plots summarizing the concentration distributions of a selected compound. 754 | VOL.6 NO.6 | 2011 | nature protocols

protocol
endotoxin. We will use the default Pearson r as the distance measure and then select Endotoxin from the Select a feature drop-down list. (ii) Click the Submit button. The resulting image shows a number of other features that are either positively or negatively correlated to endotoxin levels. The details can be obtained by following the view details link. 10| Report generation and result download: Click the Download node on the navigation tree. MetaboAnalyst will generate a detailed analysis report based on the steps that the user has previously executed. The report contains a brief description of each method used, followed by the graphical and textual results based on the last parameter set. The normalized data, as well as any graphs generated during the analysis, are also available for download. ? trouBlesHootInG Multivariate data analysis tIMInG ~10 min 11| Data exploration and visualization with PCA: PCA summarizes data into a few components that explains most of the data variance. The main characteristics of PCA are discussed in Box 6. Click the PCA node on the navigation tree to enter the PCA page. This page shows six main output panels from MetaboAnalysts PCA analysis. The default view is a pair-wise score plot from the top five PCs, with the diagonal panels showing the explained variance.
2011 Nature America, Inc. All rights reserved.

12| Click the 2D score plot tab to see a detailed scores plot using PC1 and PC2. The samples are labeled and colored according to their group memberships. In this view, users should look first for outliers; if there are obvious outliers, use the DataEditor under the Processing navigation tree to exclude outliers. Outlier removal should be carried out with considerable care and outliers should be removed only if there is some clear justification (sample stability problems, sample collection issues, instrument problems, typographical errors and so on) Next, users should investigate sample dispersion; if the data points in the score plot are not well dispersed or show a high degree of skewing, this may be due to insufficient normalization. Click the Normalization node under the Processing branch to choose a different normalization procedure. In particular, autoscaling or range scaling can be very effective for correcting severely skewed data. 13| In our case, no obvious outliers or skewed distribution can be detected. Furthermore, some modest separation or clustering is noticed among different groups. There are also some clusters that appear to overlap with each other. Users can click the 3D score plot to see whether a better separation can be identified with an extra dimension or an extra principal component. 14| Identification of influential or important features: If good separation patterns are seen in a scores plot, users should go to the Loading plot as well as the Biplot views to identify those features that are most responsible for the separation. The loading plot can be viewed either as a scatter plot or a bar plot, as specified by the user. In this particular case, as there are no clear separations, it is very difficult to identify the features that are important. We will use a supervised methodPLS-DAfor this purpose. 15| Data exploration and visualization with PLS-DA: PLS-DA can perform both classification and feature selection. The main characteristics of PLS-DA are discussed in Box 6. Click the PLS-DA node on the navigation tree to start this analysis. The default view is a pairwise summary of the score plots of the top five components. 16| Click the 2D Score plot for a detailed view of the separation patterns. A much better separation is obtained with PLS-DA compared with the PCA result obtained in Step 10. The 3D Score plot shows an almost perfect separation with the first three components (Fig. 4a). 17| Choosing the optimal number of components: MetaboAnalyst calculates R2 and Q2, which are two common performance measures in assessing PLS-DA models. R2 corresponds to the sum of squares captured by the model, whereas Q2 is the crossvalidated R2. MetaboAnalyst also calculates prediction accuracies through cross-validation. Click the Cross Validation tab to start the process. Users can choose 10-fold cross validation or Leave-one-out cross validation (LOOCV). In this case, we will choose LOOCV and click the Submit button. The result indicates that using the top two components gives the best performance based on Q2 measures (Fig. 4b). Click the view details link to get a detailed table of the calculated values. ? trouBlesHootInG 18| Result validation: As noted earlier, PLS-DA tends to overfit the data and this can often lead to false separations or incorrect classification. As a result, PLS-DA models need to be validated to see whether the separation is statistically significant or is due to random noise. This can be carried out using permutation tests. In each permutation, a PLS-DA model is built between the data (X) and the permuted class labels (Y) using the optimal number of components determined in the previous step. MetaboAnalyst provides two kinds of performance measures. The first is the separation distance, which is defined as
nature protocols | VOL.6 NO.6 | 2011 | 755

protocol
the ratio of the between-group sum of squares and the within-group sum of squares (B/W ratio), as suggested by Bijlsma et al.36. The second is the prediction accuracy. This is the default approach used by MetaboAnalyst. Click the Permutation button to view the results. The resulting histogram summarizes the distribution of the permutation test scores, with the red arrow indicating the performance based on the original labels. The further the arrow is to the right of the distribution, the more significant the separation between the two groups. Figure 4c shows a typical permutation result based on separation distance. As seen in this figure, the original class assignment is very significant and not part of the distribution that we obtained using the permuted data. A P value < 0.0005 is reported on the basis of 2,000 permutations. ? trouBlesHootInG 19| Identification of important features: Click the Var. Importance tab to see a list of important features identified based on the variable importance in projection (VIP) score (Fig. 4d). For multiple group analysis, the VIP score is calculated for each component. The overall VIP score shown in the figure is the average across all the selected components. Users can also use the coefficient-based importance measure by clicking on the corresponding radio button and then pressing the Submit button. For multiple-group discriminant analysis, the same number of predictors will be built with one for each group. The overall coefficient-based importance is the average of feature coefficients in all predictors. Click the View details link to see the individual VIP scores in each selected component or the coefficients in each group predictor if the coefficient-based importance is used. ? trouBlesHootInG
2011 Nature America, Inc. All rights reserved.

20| Report generation and result download: Click the Download node to download all the data, tables and figures produced from this particular analysis. ? trouBlesHootInG Metabolite set enrichment analysis tIMInG 510 min 21| In the Upload page, click the Enrichment Analysis tab. 22| There are three drop-down panels for three different types of enrichment analysis (see Box 7 for more details). Each method accepts a different data type: a list of compound names entered in a single-column format for over-representation analysis; a list of compound concentrations entered as two-column table for single-sample profiling (SSP); and a concentration table (CSV) with samples in rows and metabolites in columns for quantitative enrichment analysis (QEA). The phenotype information must be placed in the second column and can be binary, multiclass or continuous. Click the third drop-down pane A concentration table (quantitative enrichment analysis). 23| In the open page, click Browse to locate the human_cachexia.csv data file. 24| Ensure that the selected compound label type is compound names and the phenotype label is Discrete (Classification), and then click Submit. ? trouBlesHootInG 25| Compound name conversion: The purpose of this step is to compare and convert the compound names to common compound names used in the HMDB. The compound identities can be specified by common names or major database IDs (i.e., KEGG, PubChem, HMDB, MetLin, BiGG and so on). MetaboAnalysts compound name/ID conversion is based on a name-mapping table from the HMDB. Each HMDB compound ID is associated with a common name, a set of synonyms and compound IDs used in other major metabolomic databases. Any naming inconsistency is flagged and displayed to users for manual inspection and correction (see Step 7 for more details). crItIcal step Users must label compounds with either common compound names or common database IDs. Abbreviated names usually cannot be recognized. Unmatched or unidentified compounds will be excluded from downstream analyses. 26| Concentration comparison (optional): This step is only applicable when the uploaded data is a list of compound concentrations used for SSP. The basic idea behind SSP is to compare the measured concentration values of each compound with its normal reference ranges in the corresponding biofluid. For common human biofluids, such as blood, urine or cerebrospinal fluid, normal concentration ranges are known for many metabolites. In clinical metabolomic studies, it is often desirable to know whether certain metabolite concentrations in a given sample are higher or lower than their normal ranges. This procedure is designed to provide this kind of analysis. Click Conc. check to start concentration comparison. By default, only compounds with concentrations above or below all the known or reported normal ranges will be selected for further investigation. Users should manually select or deselect compounds to over-ride this default selection by inspecting the concentration comparison plots, as well as the original reports, by clicking the image icon in the Details column.
756 | VOL.6 NO.6 | 2011 | nature protocols

protocol
27| Data normalization (optional): This step is only applicable when the uploaded data is a concentration table. In this case, we select Normalization by a reference sample, and then choose create a pooled average sample from the control group. Choose Autoscaling for column-wise normalization. See Box 4 for more details. 28| Data visualization and outlier detection (optional): The purpose of this step is to check whether the data values are relatively homogenous and for outlier detection. Click the PCA node to open the PCA page. On the 2D score plot, a clear outlier PIF_115 is noticeable as it is far away from all other data points. This particular outlier is due to sample deterioration/ contamination. Follow the route Processing DataEditor and select PIF_115 under the Sample Editor tab, click Remove and then click Finish to go back to the normalization page. Perform the data normalization as done in Step 27. Recheck the PCA score plot. This time, no obvious outlier should be detected. Follow Enrichment Set param. to specify the parameters for enrichment analysis. 29| Set parameters for enrichment analysis: In this step, users must specify a metabolite set library (or upload a custom metabolite set library) to start the analysis (see Box 7 for details). Users can also indicate whether a filter should be applied to exclude metabolite sets containing very few compounds. In this case, we use the default Pathway-associated metabolite sets and click the Next button to view the result.
2011 Nature America, Inc. All rights reserved.

30| View the MSEA results: The MSEA result is presented, both graphically and in a detailed table (Fig. 5a). The horizontal bar graph summarizes the most significant metabolite sets identified during the analysis. The bars are colored on the basis of their P values and the bar length is based on the fold enrichment calculated as the actual matched number / expected number of matches (for over-representation analysis) and calculated statistic / expected statistic (for QEA). The Bonferroni corrected P value and FDR are also provided. Users can click the image icon in the Details column of each matched metabolite set to view all its constituent metabolites with matched ones highlighted in red (Fig. 5b), as well as SMPDB pathway images37 (when available). 31| Report generation and result download: Click the Download node to download the analysis report, images and the processed data. ? trouBlesHootInG Metabolic pathway analysis tIMInG ~10 min 32| Data upload and processing: In the Upload page, click the Pathway Analysis tab to get started with the human_ cachexia.csv data. Users can either enter a list of compound names or a concentration table. The data upload and processing steps are similar to those involved in the enrichment analysis. Please see Steps 2125 for more details. ? trouBlesHootInG 33| Set parameters for pathway analysis: Three parameters must be specified for pathway analysis. These include the pathway library, the algorithm for pathway enrichment analysis and the algorithm for topology analysis (see Box 8 for more details). Users can also supply a reference metabolome to correct for any potential bias in the enrichment analysis. The reference metabolome is specified as a list of KEGG compound IDs. In this case, we select the Homo sapiens library and use the default Global Test and Relative Betweenness Centrality for pathway enrichment analysis and pathway topology analysis, respectively. 34| Result visualization: The results from the pathway analysis are presented in two partsa graphical output in the top section and a table containing all the numerical results at the bottom. Users can intuitively explore the results by pointing and clicking on various graphic elements. There are three types of views (Fig. 6). The left panel is the metabolome view, which displays all the matched pathways as circles (Fig. 6a). The color and size of each circle is based on P values and pathway impact values, respectively. Pointing the mouse over different nodes will show the corresponding pathway names. Clicking the nodes of interest will launch the corresponding pathway view on the right panel (Fig. 6b). Users can zoom or drag to focus on a particular section of the pathway. Clicking on any matched compound node (with highlighted background) will show the corresponding compound view, which contains a detailed summary of the compound concentrations, importance measure, as well as the P value (Fig. 6c). 35| Report generation and result download: Click the Download node to get the complete analysis report as well as the processed data and images produced during the analysis. ? trouBlesHootInG

nature protocols | VOL.6 NO.6 | 2011 | 757

protocol
? trouBlesHootInG Troubleshooting advice can be found in table 2.
taBle 2 | Troubleshooting table. steps 1 problem The content of the home page does not show up possible reason JavaScript is disabled in your browser possible solution For Mozilla Firefox 3.0 + , go to Tools Options Content, then select the checkbox beside Enable JavaScript. For Internet Explorer 8.0, go to Tools Internet options Security, then select Internet from the Zone icons. Click the Custom level button. From the list of available options, make sure the Disable radio button is not selected under Active scripting item. For Safari 4.0 + , go to Edit Preferences Security, then select the checkbox beside Enable JavaScript. Please check the documentation for other browsers on how to enable JavaScript Make sure sample or feature (peak/compound) names are unique and consist of a combination of English letters, underscores or numbers for naming purposes; the names should contain no space or other special characters; make sure there are at least three samples per group; make sure the selected data format matches your data; for Microsoft Excel users, choose CSV (Macintosh) to generate a .csv file; for WinZip (v12.0) users, choose the Legacy compression (Zip 2.0 compatible) for compression These procedures require a minimum of five samples per group Set appropriate parameter values to make sure the resulting images are generated; make sure there are a minimum of five samples per group for PLS-DA analysis

2011 Nature America, Inc. All rights reserved.

2, 24 and 32

Fail to upload data

Non-unique or unusual names; small sample size; wrong data formats; unrecognized zip format

1719 10, 20, 31, 35

No image is generated

The sample size is too small

No PDF report is generated Some of the expected data were not generated

tIMInG The duration required to perform the steps described in the protocol depend on the data set size as well as the number of active users connected to the web server. For the test data sets used for these protocols, most results should be returned in a few seconds after a user has selected the appropriate parameters. The most time-consuming computational step is probably the permutation test used by PLS-DA (1520 s for 1,000 permutations). The most time-consuming non-computational test is typically the data visualization or data inspection step. Data upload, processing and normalization (Steps 27) should take about 510 min; feature selection using univariate analysis (Steps 810) usually takes around 35 min; and multivariate analysis (Steps 1120) takes ~10 min. For high-level functional analysis, MSEA (Steps 2131) should take 510 min, whereas metabolic pathway analysis (Steps 3235) should take ~10 min. Once the data has been uploaded, a modestly experienced user should be able to execute the complete protocol in 3040 min. antIcIpateD results
Graphical output

The graphical outputs produced during the analysis procedures are given in Figures 17. Some of the algorithms of the MetaboAnalyst use time-dependent random number generators to calculate certain statistical values and the results may vary slightly among runs.
Data processing results

The data integrity check for the data in cow_diet.csv will detect four groups with a total of 51 zero values and no missing values. The data integrity check for human_cachexia will yield two groups with no zero or missing values.
Feature selection using univariate methods

In MetaboAnalysts ANOVA analysis of the cow_diet.csv data, the top five compounds identified with the default threshold should be endotoxin, 3-PP, glucose, isobutyrate and methylamine. The top five compounds identified using the SAM method
758 | VOL.6 NO.6 | 2011 | nature protocols

protocol
will be the same. In correlation analysis using the predefined 1234 pattern, endotoxin and alanine are the top two compounds that will be positively correlated with this pattern, whereas 3-PP and aspartate are the top two compounds that will be negatively correlated with this pattern. The same compounds should be identified as being correlated/anticorrelated with endotoxin, using Pearson r. The top five compounds identified in SAM will be the same as those identified using the ANOVA test.
Multivariate data analysis

The score plot from the PCA analysis of the cow_diet.csv data should not show a clear separation, with groups 1 and 2 overlapping substantially and group 3 slightly overlapping with groups 2 and 4. A much better group separation will be achieved through PLS-DA. Using PLS-DA, the five most important compounds identified by VIP will be endotoxin, 3-PP, alanine, methylamine and glucose. The best PLS-DA model will use just top two components based on the Q2 score estimated from LOOCV (0.814). The P value based on 2,000 permutations should yield a value of P < 5e 04, which is very significant.
Metabolite set enrichment analysis

2011 Nature America, Inc. All rights reserved.

All compound names from the human_cachexia.csv data set should be found to have an exact match during the name conversion step. The PCA score plot should not show a clear separation, although it should show PIF_115 as being a clear outlier. In the enrichment analysis using the pathway-based metabolite sets, the top five metabolic pathways that appear to be associated with cachexia will be pyrimidine metabolism, beta-alanine metabolism, ketone body metabolism, purine metabolism and glutamate metabolism.
Metabolic pathway analysis

The top five pathways from the human_cachexia.csv data set that should be identified by pathway enrichment analysis alone are pyrimidine metabolism, pantothenate and CoA biosynthesis, beta-alanine metabolism, synthesis and degradation of ketone bodies and propanoate metabolism. Note that three of these pathways are similar to those previously identified by MSEA. The top three pathways identified by topology analysis alone should be glycine, serine and threonine metabolism; pyruvate metabolism; and taurine and hypotaurine metabolism. Overall, three pathwayspantothenate and CoA biosynthesis; citrate cycle (TCA cycle); and alanine, aspartate and glutamate metabolismappear to be perturbed as a consequence of cachexia, as these will be located in the diagonal area of the plot with relatively good scores from both analyses.

acknowleDGMents We thank the Canadian Institutes for Health Research (CIHR) and the Alberta Ingenuity Fund (AIF; now part of Alberta Innovates Technology Futures) for financial support. autHor contrIButIons J.X. and D.S.W. prepared and tested the protocol and wrote the article. coMpetInG FInancIal Interests The authors declare no competing financial interests. Published online at http://www.natureprotocols.com/. Reprints and permissions information is available online at http://npg.nature. com/reprintsandpermissions/. 1. 2. 3. 4. 5. 6. 7. 8. Fiehn, O. Metabolomicsthe link between genotypes and phenotypes. Plant. Mol. Biol. 48, 155171 (2002). Wishart, D.S. Quantitative metabolomics using NMR. Trends Analyt. Chem. 27, 228237 (2008). Dunn, W.B. & Ellis, D.I. Metabolomics: current analytical platforms and methodologies. Trends Analyt. Chem. 24, 285294 (2005). Wishart, D.S. et al. HMDB: the human metabolome database. Nucleic Acids Res. 35, D521D526 (2007). Lundberg, P. et al. MDLThe Magnetic Resonance Metabolomics Database http://mdl.imv.liu.se (European Society for Magnetic Resonance in Medicine and Biology, ESMRMB, 2005). Smith, C.A. et al. METLINa metabolite mass spectral database. Ther. Drug Monit. 27, 747751 (2005). Weljie, A.M., Newton, J., Mercier, P., Carlson, E. & Slupsky, C.M. Targeted profiling: quantitative analysis of 1H NMR metabolomics data. Anal. Chem. 78, 44304442 (2006). Smith, C.A., Want, E.J., OMaille, G., Abagyan, R. & Siuzdak, G. XCMS: processing mass spectrometry data for metabolite profiling using nonlinear

9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19.

peak alignment, matching, and identification. Anal. Chem. 78, 779787 (2006). Zhao, Q., Stoyanova, R., Du, S., Sajda, P. & Brown, T.R. HiResa tool for comprehensive assessment and interpretation of metabolomic data. Bioinformatics 22, 25622564 (2006). Xia, J., Bjorndahl, T.C., Tang, P. & Wishart, D.S. MetaboMinersemiautomated identification of metabolites from 2D NMR spectra of complex biofluids. BMC Bioinformatics 9, 507 (2008). Lommen, A. MetAlign: interface-driven, versatile metabolomics tool for hyphenated full-scan mass spectrometry data preprocessing. Anal. Chem. 81, 30793086 (2009). Katajamaa, M., Miettinen, J. & Oresic, M. MZmine: toolbox for processing and visualization of mass spectrometry based molecular profile data. Bioinformatics 22, 634636 (2006). Wishart, D.S. Current Progress in computational metabolomics. Brief. Bioinform. 8, 279293 (2007). Cui, Q. et al. Metabolite identification via the Madison Metabolomics Consortium Database. Nat. Biotechnol. 26, 162164 (2008). Wishart, D.S. et al. HMDB: a knowledgebase for the human metabolome. Nucleic Acids Res. 37, D603D610 (2009). Henderson, J.P. et al. Quantitative metabolomics reveals an epigenetic blueprint for iron acquisition in uropathogenic Escherichia coli. PLoS Pathog. 5, e1000305 (2009). Altmaier, E. et al. Variation in the human lipidome associated with coffee consumption as revealed by quantitative targeted metabolomics. Mol. Nutr. Food Res. 53, 13571365 (2009). Ewald, J.C., Heux, S. & Zamboni, N. High-throughput quantitative metabolomics: workflow for cultivation, quenching, and analysis of yeast in a multiwell format. Anal. Chem. 81, 36233629 (2009). Zulak, K.G., Weljie, A.M., Vogel, H.J. & Facchini, P.J. Quantitative 1H NMR metabolomics reveals extensive metabolic reprogramming of primary and secondary metabolism in elicitor-treated opium poppy cell cultures. BMC Plant Biol. 8, 5 (2008).

nature protocols | VOL.6 NO.6 | 2011 | 759

protocol
20. Xia, J., Psychogios, N., Young, N. & Wishart, D.S. MetaboAnalyst: a web server for metabolomic data analysis and interpretation. Nucleic Acids Res. 37, W652W660 (2009). 21. Xia, J. & Wishart, D.S. MSEA: A web-based tool to identify biologically meaningful patterns in quantitative metabolomics data. Nucleic Acids Res. 38, W71W77 (2010). 22. Xia, J. & Wishart, D.S. MetPA: a web-based metabolomics tool for pathway analysis and visualization. Bioinformatics 26, 23422344 (2010). 23. Neuweger, H. et al. MeltDB: a software platform for the analysis and integration of metabolomics experiment data. Bioinformatics 24, 27262732 (2008). 24. Kastenmuller, G., Romisch-Margl, W., Wagele, B., Altmaier, E. & Suhre, K. metaP-server: a web-based metabolomics data analysis tool. J. Biomed. Biotechnol. 2011, (2010). 25. Pluskal, T., Castillo, S., Villar-Briones, A. & Oresic, M. MZmine 2: modular framework for processing, visualizing, and analyzing mass spectrometrybased molecular profile data. BMC Bioinformatics 11, 395 (2010). 26. Broeckling, C.D., Reddy, I.R., Duran, A.L., Zhao, X. & Sumner, L.W. METIDEA: data extraction tool for mass spectrometry-based metabolomics. Anal. Chem. 78, 43344341 (2006). 27. Duran, A.L., Yang, J., Wang, L.J. & Sumner, L.W. Metabolomics spectral formatting, alignment and conversion tools (MSFACTs). Bioinformatics 19, 22832293 (2003). 28. Luedemann, A., Strassburg, K., Erban, A. & Kopka, J. TagFinder for the quantitative analysis of gas chromatographymass spectrometry (GC-MS)-based metabolite profiling experiments. Bioinformatics 24, 732737 (2008). 29. Wohlgemuth, G., Haldiya, P.K., Willighagen, E., Kind, T. & Fiehn, O. The Chemical Translation Servicea web-based tool to improve standardization of metabolomic reports. Bioinformatics 26, 26472648 (2010). 30. Tusher, V.G., Tibshirani, R. & Chu, G. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA 98, 511621 (2001). 31. Subramanian, A. et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci. USA 102, 1554515550 (2005). 32. Salomonis, N. et al. GenMAPP 2: new features and resources for pathway analysis. BMC Bioinformatics 8, 217 (2007). 33. Goffard, N., Frickey, T. & Weiller, G. PathExpress update: the enzyme neighbourhood method of associating gene-expression data with metabolic pathways. Nucleic Acids Res. 37, W335W339 (2009). 34. Hu, Z. et al. VisANT 3.5: multi-scale network visualization, analysis and inference based on the gene ontology. Nucleic Acids Res. 37, W115W121 (2009). 35. Goffard, N. & Weiller, G. PathExpress: a web-based tool to identify relevant pathways in gene expression data. Nucleic Acids Res. 35, W176 W181 (2007). 36. Bijlsma, S. et al. Large-scale human metabolomics studies: a strategy for data (pre-) processing and validation. Anal. Chem. 78, 567574 (2006). 37. Frolkis, A. et al. SMPDB: the small molecule pathway database. Nucleic Acids Res. 38, D480D487 (2010). 38. Efron, B., Tibshirani, R., Storey, J.D. & Tusher, V. Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc. 96, 11511160 (2001). 39. Trygg, J. & Wold, S. Orthogonal projections to latent structures (O-PLS). J. Chemom. 16, 119128 (2002). 40. Wang, T. et al. Automics: an integrated platform for NMR-based metabonomics spectral processing and data analysis. BMC Bioinformatics 10, 83 (2009). 41. Stacklies, W., Redestig, H., Scholz, M., Walther, D. & Selbig, J. pcaMethodsa bioconductor package providing PCA methods for incomplete data. Bioinformatics 23, 11641167 (2007). 42. Dieterle, F., Ross, A., Schlotterbeck, G. & Senn, H. Probabilistic quotient normalization as robust method to account for dilution of complex biological mixtures. Application in 1H NMR metabonomics. Anal. Chem. 78, 42814290 (2006). 43. van den Berg, R.A., Hoefsloot, H.C., Westerhuis, J.A., Smilde, A.K. & van der Werf, M.J. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics 7, 142 (2006). 44. Pavlidis, P. Using ANOVA for gene selection from microarray studies of the nervous system. Methods 31, 282289 (2003). 45. Breiman, L. Random forests. Mach. Learn. 45, 532 (2001). 46. Westerhuis, C.A. et al. Assessment of PLSDA cross validation. Metabolomics 4, 8189 (2007). 47. Goeman, J.J., van de Geer, S.A., de Kort, F. & van Houwelingen, H.C. A global test for groups of genes: testing association with a clinical outcome. Bioinformatics 20, 9399 (2004). 48. Hummel, M., Meister, R. & Mansmann, U. GlobalANCOVA: exploration and assessment of gene group effects. Bioinformatics 24, 7885 (2008). 49. Aittokallio, T. & Schwikowski, B. Graph-based methods for analysing networks in cell biology. Brief Bioinform. 7, 243255 (2006).

2011 Nature America, Inc. All rights reserved.

760 | VOL.6 NO.6 | 2011 | nature protocols

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.