Vous êtes sur la page 1sur 6

technology feature

Visualizing the right path 903


Cloud genomics 906

Scientific software: seeing the SNPs between us


2008 Nature Publishing Group http://www.nature.com/naturemethods

Steven David Buckingham

The results of large genome-wide association studies (GWASs) are being deposited in public databases
with increasing frequency. But the software to analyze and interpret GWAS datasets can be difficult to use.
Could a new generation of user-friendly programs fill the gap?

Genome-wide association studies (GWASs) ware and computational power on all fronts of doing two things at once: provide the
are an exciting new approach in the hunt to gain more speed when exploring these user a broad overview of the data in a way
for genes affecting our health and wellbe- larger datasets. Some of these analyses can that reveals trends as well as have the abil-
ing. GWASs look for associations between take days to run, he notes. ity to focus on areas of interest once identi-
underlying genetic featuressuch as single- fied. Stephen Miller, European Director of
nucleotide polymorphisms (SNPs) and copy- Visualization tools worth a Business Development for Progeny Genetics
number variations (CNVs)and phenotypes thousand words based in South Bend, Indiana, USA, says,
such as health, illness and even behavior. The There is little doubt that the currently avail- Visualization of targeted genes is really
last two years have witnessed an explosion able GWAS software tools present a barrier important, especially in family-based stud-
in the number of published GWAS studies, to the novice user. The associations that ies. For example, once you get to the gene
revealing new genes involved in diseases such GWASs look for can be quite subtle, with level you might want to see patterns of hap-
as late-onset Alzheimers1, type-2 diabetes2,3, several genes potentially acting together to lotypes and how that reflects the disease.
schizophrenia4 and cancer5,6. exert an effect. This requires advanced sta- Certainly with gene chips capable of scan-
But if these datasets are to be of use to tistical methods to tease associations out of ning a million SNPs and a typical GWAS
all biologists, not just GWAS experts, then datasets, without misleading researchers study incorporating thousands to tens of
new, user-friendly software is urgently need- with false positives. thousands of individuals, standard tables
ed. Happily, software designers are on the GWAS experts have traditionally used their and plots are of no use with datasets of
case, coming up with new ways of making own in-house scripts written in command- this size. Goldsurfer2 overcomes this prob-
GWAS data easy to explore and share among line environments, such as the R statisti- lem by arranging the data in a hierarchical
researchers, and designing analysis packages cal package, a computational and graphics
that deal with the increasing computational environment developed at Bell Laboratories
demands posed by these datasets. in Murray Hill, New Jersey, USA. Although
Julie Williams of the Cardiff University statisticians and computer programmers
School of Medicine in Cardiff, UK sees a clear are familiar with these powerful toolkits, to
need for new software. Last year, Williams most biologists they are like a foreign lan-
was awarded 1.3 million (about $2.6 mil- guage. But researchers like Fredrik Pettersson
lion) from the Wellcome Trust to lead one of from the Wellcome Trust in Cambridge, UK,
the biggest GWAS studies of Alzheimers dis- who developed a GWAS program called
ease to date. One of the major challenges for Goldsurfer2, are creating a new generation of
GWAS software designers is keeping up with visually based, interactive GWAS programs.
the changing demands of the research field, With complex datasets, it is useful to use
notes Williams. These demands can be sta- our superior cognitive abilities by interpret-
tistical, such as incorporating SNP imputa- ing pictures instead of text, says Pettersson.
tion, or bioinformatic, as with the integration He notes that these genomic datasets, with
of large-scale gene-expression data. their references to physical positions and the
Williams view is shared among GWAS ability to link to annotations, are well suited
researchers. Hakon Hakonarson of the to interactive environments. All sorts of fil-
Childrens Hospital of Philadelphia in ters, colors, shapes and sizes can be used to
Pennsylvania, USA, who recently headed up represent slices of the data.
a GWAS that identified a new type-1 diabetes The size of GWAS datasets means the Trey Ideker is working to apply his Cytoscape
gene, thinks there is a need to improve soft- visualization software must also be capable pathway analysis program to GWAS data.

nature methods | VOL.5 NO.10 | OCTOBER 2008 | 903


Technology feature
structure and linking it to plots and tables
Progeny Genetics
that update themselves as users focus in on Stephen Miller says
nodes of interest. The raw data are presented visualization of target
in a tree structure, making it easy to create genes is important
subsets based, for instance, on population when analyzing GWAS
stratification. The user can import several datasets.
sets of samples, merge them into a common

Progeny Genetics
file and use principal component analysis (a
statistical method for reducing multidimen-
sional datasets with minimal information
2008 Nature Publishing Group http://www.nature.com/naturemethods

loss) to see whether stratification is skew-


ing the results. Presenting data as tables side
by side with interactive graphical plots is Still, designing visualization approaches
an approach also taken by Partek Inc. of St. for GWAS data has lagged far behind other
Louis in their Genomics Suite, which offers areas of GWAS software development.
the possibility of exploring data with bi- There are two reasons for this reluctance
plots, heat maps and frequency plots. to tackle the visualization challenge, says
For biologists, such interactive visual envi- Kristin Tolle, Senior Research Program
ronments can make standard GWAS analy- Manager for Biomedical Computing for
sis techniques quite intuitive. For example, External Research in Microsoft Research
Petterssons Goldsurfer2 takes the tradi- in Redmond, Washington, USA. First, it
tional two-dimensional linkage disequilib- is not clear exactly how to represent mul-
rium plot into three dimensions, where the tidimensional data in a way that will lead
contours and coloring of the plot can be to discoveries. And in the second place
used to represent different measurements there are immense computational barriers
of linkage disequilibrium, and zooming in in visualizing such large datasets. In April
on the plot shows the calculated values for 2008, Microsoft awarded over $850,000 to
cases and controls. A similar approach was six teams of researchers to jumpstart the
adopted for the Genomics Module of the development of visualization tools.
JMP7 software package from SAS Institute
Inc. of Cary, North Carolina, USA, which Visualizing the right path
plots principal component analyses in three The complexities of the visualization chal-
dimensions, thereby allowing the user to lenge are not lost on Trey Ideker, associate
reduce a multidimensional dataset to three professor of Bioeng ineer ing at the
dimensions to see how the data group. University of California, San Diegos Jacobs

Fredrik Pettersson

The Goldsurfer2 software package allows three-dimensional visualization of linkage disequilibrium.

904 | VOL.5 NO.10 | OCTOBER 2008 | nature methods


technology feature
School of Engineering. But Ideker, who Some existing commercial packages spe- analysis is in reducing the massive datasets to
developed a software program for the visu- cializing in pathway analysis can already a small set of hypotheses that the investiga-
alization of protein interaction networks be used for the analysis of GWAS data- tor can interpret, says Hakonarson. While
called Cytoscape, believes that in the future sets. Programs such as Metacore from a user-friendly interface is appealing to most
GWAS datasets will be visualized in terms Genego of St. Joseph, Michigan, USA and users, sometimes command linedriven pro-
of pathways. We think that GWAS needs to Ingenuity Pathway analysis from Ingenuity grams are better, such as for rapid and paral-
transcend SNP-based thinking and move on of Redwood City, California, USA allow any lel computation of large data sets.
to pathways, he argues. His team was one of large-scale dataset, including GWAS datasets, Lambert also notes that although the
the six groups funded by Microsoft, and he to be imported and analyzed with regard to human mind is hard-wired to see pat-
envisions a day when a user will take a result protein interactions and other pathways, terns, you still need good statistics to sort
2008 Nature Publishing Group http://www.nature.com/naturemethods

from a GWAS study and use a program such such as metabolic and signaling networks. out whether those patterns arose by chance
as Cytoscape to query protein-interaction Following a slightly different approach, the alone. Visualization is an important trend,
databases and identify pathways and sub- JMP Genomics Module uses clustering tools but wont get you anywhere if the under-
networks involved with the phenotype. to identify pathways highly represented in lying analysis isnt right, says Lambert.
The gold nuggets of genome-wide sig- datasets. Imaginative and computer-aware Indeed, with datasets comprising over a mil-
nificant SNPs only explain a small fraction users could even use the current version of lion SNPs, a standardP-value cut-off of0.05
of heritability, says Christophe Lambert, Idekers Cytoscape to superimpose GWAS that does not account for multiple testing
chief executive officer of Golden Helix in findings onto protein-protein interaction would show ~50,000 SNPs as significant just
Bozeman, Montana, USA, a data-analysis networks, although Ideker says future devel- by sheer chance. But programs like Golden
software company he founded 10 years ago opments will incorporate this functionality Helixs SNP and Variation Suite include a
in anticipation of the present need for user- into the main Cytoscape program. genome-wide permutation test that effi-
friendly software. We need to look at the But not every GWAS software developer ciently deals with the false discovery prob-
less significant SNPsthe gold dust mixed thinks visualization is the answer to all our lem by examining the proportion of the time
in the dirt. needs. At the end of the day, the real value of that the genome-wide minimum P-values

nature methods | VOL.5 NO.10 | OCTOBER 2008 | 905


Technology feature
2008 Nature Publishing Group http://www.nature.com/naturemethods

Golden Helix
Helix Tree, from Golden Helix, uses color to visualize linkage disequilibrium.

generated from tests on randomly shuffled ble, and advanced statistical procedures that
phenotypes are at least as significant as the involve pair-wise comparisons can become
P-values generated from the original phe- prohibitively slow. However, by applying a
notype. This provides a model-free estimate battery of the latest computing techniques,
of the P-values, more closely approximating some software packages can cope with these
the true significance of potential findings. problems surprisingly well.
The top hits in the first round of analysis Golden Helix has successfully per-
from any genome screen are mostly artifacts formed both whole-genome SNP and CNV
that result from genotyping problems or association studies on over 11,000 samples
very low minor allele frequencies, points with over 500,000 markers apiece, claims
out Bill Cookson of the National Heart and Lambert. Starting from terabytes of raw
Lung Institute in London, who used a GWAS data, final processed matrices of 40 GB or
to discover genes involved in asthma. The more can be efficiently analyzed with desk-
key tools at this stage are a strong sense of top computers. This includes principal com-
cynicism and the desire to root out false pos- ponent analysis for the correction of batch
itives by examination of genotyping traces effects and of population stratification.
and raw allele counts. And he adds that only But the challenges are only going to get
after this process does modern software help tougher. Up to now, most GWASs have
navigate through the biological complexities explored simple yes or no phenotypes:
that underlie association signals. does the patient have diabetes or not? Is the
patient overweight? But there is increasing
Cloud genomics interest in applying GWAS to study more
Visualizing increasingly large datasets is graded phenotypes, such as blood pressure
not the only problem being addressed with or insulin levels. These expression quantita-
the new GWAS software. Programmers are tive trait loci will inevitably increase the size
also facing up to the intense computational of datasets by an order of magnitude.
demands placed by the sheer size of these Another reason GWAS datasets are
datasets. When interrogating a million SNPs going to get bigger is the increasing interest
per sample, a 10,000-sample study will be in CNVs. [CNVs] may account for more
simply too big to load into the memory of variation than SNPs, argues Lambert.
a normal desktop computer. Goldsurfer2 CNVs are regions of repeated sequence in
solves this problem by using a swap file, and the genome that differ from one individual
can comfortably load datasets containing to the next. The latest SNP chips, including
5,000 markers and 5,000 samples. But things the SNP 6.0 chip from Affymetrix in Santa
can get even trickier when it comes to the Clara, California, USA and the Infinium
analysis. Access times can become intolera- HD Bead Chips from Illumina of San

906 | VOL.5 NO.10 | OCTOBER 2008 | nature methods


technology feature
Diego, have already begun to incorporate
CNVswith more on the way. Whole-
genome sequencing will increase data sizes
by several orders of magnitude, according
to Lambert, who talks of a forthcoming
fire hose of information.
Software companies are not that far
behind in rising to this CNV challenge.
Affymetrix genotyping console software
includes algorithms and visualization tools
2008 Nature Publishing Group http://www.nature.com/naturemethods

for analyzing CNVs, and Golden Helix also


includes a CNV module. The Affymetrix
Genotyping Console implements Canary,
a new CNV-calling algorithm that was

Fredrik Pettersson
developed in collaboration with the Broad
Institute in Cambridge, Massachusetts,
USA. The Partek Genomics Suite is also
suitable for exploring CNV data, allowing
copy-number and loss-of-heterozygosity
data to be explored in the same region and Goldsurfer2 presents users with linked tables and plots to facilitate interaction.
visualized with the original mapping data.
In the end, the increasing dataset size from
GWAS is likely to outstrip the capacity of the Server 2007 platform, which is accessible datasets will ultimately drive analysis off the
standard laboratory computer, forcing some from a web browser, and allows research- desktop and onto the web.
developers to look to the internet for solu- ers to collaborate and share data all the way
1. Grupe, A. et al. Hum. Mol. Genet. 16, 865873
tions. Microsoft is planning a web-based, from the beginning of a GWAS study to the (2008).
end-to-end virtual GWAS environment for final publication. Users will be able to select 2. Sladek, R. et al. Nature 445, 881885 (2007).
researchers. Offered in conjunction with the and run workflows online, which will run 3. Hakonarson, H. et al. Nature 448, 591594
(2007).
British Library, the Research Information Microsofts own in-house analysis routines 4. Stefansson, H. et al. Nature advance online
Centrewhich Tolle likens to a Facebook behind the scenes. This service is currently publication 30 July 2008 (doi:10.1038/
for geneticistscan simplify the process of in beta testing. nature07229).
5. Thorgeirsson, T.E. et al. Nature 452, 638642
information search, facilitate discovery, effec- User-friendly, graphical programs for (2008).
tively manage research objects, and enable analyzing GWASs are becoming a reality, 6. Hung, R.J. et al. Nature 452, 633637 (2008).
versioning and archiving. The time is right and imaginative visualization approaches
for this, says Tolle, all the pieces are there; it are making it ever easier for users to explore Steven David Buckingham is a researcher
is just a matter of putting them all together. GWAS datasets. But it remains to be seen in the Functional Genetics Unit at the
The collaboration environment resides whether the exponential increase in compu- University of Oxford, UK
within a hosted Microsoft Office SharePoint tational power demanded by ever-growing (steven.buckingham@dpag.ox.ac.uk).

nature methods | VOL.5 NO.10 | OCTOBER 2008 | 907


technology feature

SUPPLIERS GUIDE: COMPANIES OFFERING CHEMICAL GENOMICS REAGENTS AND INSTRUMENTATION


Company Product
Affymetrix http://www.affymetrix.com/
Agilent Technologies http://www.chem.agilent.com/
Ariadne Genomics http://www.ariadnegenomics.com/
Array Genetics http://www.arraygenetics.com/
Attagene http://www.attagene.com/
Bio-Rad http://www.bio-rad.com/
deCODE genetics http://www.decode.com/
2008 Nature Publishing Group http://www.nature.com/naturemethods

Enzo Life Sciences http://www.enzo.com/


Expression Analysis http://www.expressionanalysis.com/
Genego http://www.genego.com/
Geneservice http://www.geneservice.co.uk/
Genizon Biosciences http://www.genizon.com/
GenoLogics http://www.genologics.com/
Genolyze http://www.genolyze.com/
Geopspiza http://www.geospiza.com/
Golden Helix http://www.goldenhelix.com/
Illumina http://www.illumina.com/
Infoquant http://www.infoquant.com/
Ingenuity http://www.ingenuity.com/
Jivan Biologics http://www.jivanbio.com/
JMP software http://www.jmp.com/
Marligen Biosciences http://www.marligen.com/
Miltenyi Biotec http://www.miltenyibiotec.com/
MiraiBio http://www.miraibio.com/
Molecular Devices http://www.moleculardevices.com/
NimbleGen http://www.nimblegen.com/
Ocimum Biosolutions http://www.ocimumbio.com/
Oxford Gene Technology http://www.ogt.co.uk/
Oxford University http://www.stats.ox.ac.uk/
Partek http://www.partek.com/
PerkinElmer http://www.perkinelmer.com/
Perlegen Sciences http://www.perlegen.com/
Phalanx Biotech Group http://www.phalanxbiotech.com/
Premier Biosoft http://www.premierbiosoft.com/
Progeny http://www.progenygenetics.com/
Quantiom http://www.quantiom.de/
Rosetta Biosoftware http://www.rosettabio.com/
Sequenom http://www.sequenom.com/
Signature Genomic Laboratories http://www.signaturegenomics.com/
Softgenetics http://www.softgenetics.com/

908 | VOL.5 NO.10 | OCTOBER 2008 | nature methods

Vous aimerez peut-être aussi