Académique Documents
Professionnel Documents
Culture Documents
The results of large genome-wide association studies (GWASs) are being deposited in public databases
with increasing frequency. But the software to analyze and interpret GWAS datasets can be difficult to use.
Could a new generation of user-friendly programs fill the gap?
Genome-wide association studies (GWASs) ware and computational power on all fronts of doing two things at once: provide the
are an exciting new approach in the hunt to gain more speed when exploring these user a broad overview of the data in a way
for genes affecting our health and wellbe- larger datasets. Some of these analyses can that reveals trends as well as have the abil-
ing. GWASs look for associations between take days to run, he notes. ity to focus on areas of interest once identi-
underlying genetic featuressuch as single- fied. Stephen Miller, European Director of
nucleotide polymorphisms (SNPs) and copy- Visualization tools worth a Business Development for Progeny Genetics
number variations (CNVs)and phenotypes thousand words based in South Bend, Indiana, USA, says,
such as health, illness and even behavior. The There is little doubt that the currently avail- Visualization of targeted genes is really
last two years have witnessed an explosion able GWAS software tools present a barrier important, especially in family-based stud-
in the number of published GWAS studies, to the novice user. The associations that ies. For example, once you get to the gene
revealing new genes involved in diseases such GWASs look for can be quite subtle, with level you might want to see patterns of hap-
as late-onset Alzheimers1, type-2 diabetes2,3, several genes potentially acting together to lotypes and how that reflects the disease.
schizophrenia4 and cancer5,6. exert an effect. This requires advanced sta- Certainly with gene chips capable of scan-
But if these datasets are to be of use to tistical methods to tease associations out of ning a million SNPs and a typical GWAS
all biologists, not just GWAS experts, then datasets, without misleading researchers study incorporating thousands to tens of
new, user-friendly software is urgently need- with false positives. thousands of individuals, standard tables
ed. Happily, software designers are on the GWAS experts have traditionally used their and plots are of no use with datasets of
case, coming up with new ways of making own in-house scripts written in command- this size. Goldsurfer2 overcomes this prob-
GWAS data easy to explore and share among line environments, such as the R statisti- lem by arranging the data in a hierarchical
researchers, and designing analysis packages cal package, a computational and graphics
that deal with the increasing computational environment developed at Bell Laboratories
demands posed by these datasets. in Murray Hill, New Jersey, USA. Although
Julie Williams of the Cardiff University statisticians and computer programmers
School of Medicine in Cardiff, UK sees a clear are familiar with these powerful toolkits, to
need for new software. Last year, Williams most biologists they are like a foreign lan-
was awarded 1.3 million (about $2.6 mil- guage. But researchers like Fredrik Pettersson
lion) from the Wellcome Trust to lead one of from the Wellcome Trust in Cambridge, UK,
the biggest GWAS studies of Alzheimers dis- who developed a GWAS program called
ease to date. One of the major challenges for Goldsurfer2, are creating a new generation of
GWAS software designers is keeping up with visually based, interactive GWAS programs.
the changing demands of the research field, With complex datasets, it is useful to use
notes Williams. These demands can be sta- our superior cognitive abilities by interpret-
tistical, such as incorporating SNP imputa- ing pictures instead of text, says Pettersson.
tion, or bioinformatic, as with the integration He notes that these genomic datasets, with
of large-scale gene-expression data. their references to physical positions and the
Williams view is shared among GWAS ability to link to annotations, are well suited
researchers. Hakon Hakonarson of the to interactive environments. All sorts of fil-
Childrens Hospital of Philadelphia in ters, colors, shapes and sizes can be used to
Pennsylvania, USA, who recently headed up represent slices of the data.
a GWAS that identified a new type-1 diabetes The size of GWAS datasets means the Trey Ideker is working to apply his Cytoscape
gene, thinks there is a need to improve soft- visualization software must also be capable pathway analysis program to GWAS data.
Progeny Genetics
file and use principal component analysis (a
statistical method for reducing multidimen-
sional datasets with minimal information
2008 Nature Publishing Group http://www.nature.com/naturemethods
Fredrik Pettersson
from a GWAS study and use a program such such as metabolic and signaling networks. out whether those patterns arose by chance
as Cytoscape to query protein-interaction Following a slightly different approach, the alone. Visualization is an important trend,
databases and identify pathways and sub- JMP Genomics Module uses clustering tools but wont get you anywhere if the under-
networks involved with the phenotype. to identify pathways highly represented in lying analysis isnt right, says Lambert.
The gold nuggets of genome-wide sig- datasets. Imaginative and computer-aware Indeed, with datasets comprising over a mil-
nificant SNPs only explain a small fraction users could even use the current version of lion SNPs, a standardP-value cut-off of0.05
of heritability, says Christophe Lambert, Idekers Cytoscape to superimpose GWAS that does not account for multiple testing
chief executive officer of Golden Helix in findings onto protein-protein interaction would show ~50,000 SNPs as significant just
Bozeman, Montana, USA, a data-analysis networks, although Ideker says future devel- by sheer chance. But programs like Golden
software company he founded 10 years ago opments will incorporate this functionality Helixs SNP and Variation Suite include a
in anticipation of the present need for user- into the main Cytoscape program. genome-wide permutation test that effi-
friendly software. We need to look at the But not every GWAS software developer ciently deals with the false discovery prob-
less significant SNPsthe gold dust mixed thinks visualization is the answer to all our lem by examining the proportion of the time
in the dirt. needs. At the end of the day, the real value of that the genome-wide minimum P-values
Golden Helix
Helix Tree, from Golden Helix, uses color to visualize linkage disequilibrium.
generated from tests on randomly shuffled ble, and advanced statistical procedures that
phenotypes are at least as significant as the involve pair-wise comparisons can become
P-values generated from the original phe- prohibitively slow. However, by applying a
notype. This provides a model-free estimate battery of the latest computing techniques,
of the P-values, more closely approximating some software packages can cope with these
the true significance of potential findings. problems surprisingly well.
The top hits in the first round of analysis Golden Helix has successfully per-
from any genome screen are mostly artifacts formed both whole-genome SNP and CNV
that result from genotyping problems or association studies on over 11,000 samples
very low minor allele frequencies, points with over 500,000 markers apiece, claims
out Bill Cookson of the National Heart and Lambert. Starting from terabytes of raw
Lung Institute in London, who used a GWAS data, final processed matrices of 40 GB or
to discover genes involved in asthma. The more can be efficiently analyzed with desk-
key tools at this stage are a strong sense of top computers. This includes principal com-
cynicism and the desire to root out false pos- ponent analysis for the correction of batch
itives by examination of genotyping traces effects and of population stratification.
and raw allele counts. And he adds that only But the challenges are only going to get
after this process does modern software help tougher. Up to now, most GWASs have
navigate through the biological complexities explored simple yes or no phenotypes:
that underlie association signals. does the patient have diabetes or not? Is the
patient overweight? But there is increasing
Cloud genomics interest in applying GWAS to study more
Visualizing increasingly large datasets is graded phenotypes, such as blood pressure
not the only problem being addressed with or insulin levels. These expression quantita-
the new GWAS software. Programmers are tive trait loci will inevitably increase the size
also facing up to the intense computational of datasets by an order of magnitude.
demands placed by the sheer size of these Another reason GWAS datasets are
datasets. When interrogating a million SNPs going to get bigger is the increasing interest
per sample, a 10,000-sample study will be in CNVs. [CNVs] may account for more
simply too big to load into the memory of variation than SNPs, argues Lambert.
a normal desktop computer. Goldsurfer2 CNVs are regions of repeated sequence in
solves this problem by using a swap file, and the genome that differ from one individual
can comfortably load datasets containing to the next. The latest SNP chips, including
5,000 markers and 5,000 samples. But things the SNP 6.0 chip from Affymetrix in Santa
can get even trickier when it comes to the Clara, California, USA and the Infinium
analysis. Access times can become intolera- HD Bead Chips from Illumina of San
Fredrik Pettersson
developed in collaboration with the Broad
Institute in Cambridge, Massachusetts,
USA. The Partek Genomics Suite is also
suitable for exploring CNV data, allowing
copy-number and loss-of-heterozygosity
data to be explored in the same region and Goldsurfer2 presents users with linked tables and plots to facilitate interaction.
visualized with the original mapping data.
In the end, the increasing dataset size from
GWAS is likely to outstrip the capacity of the Server 2007 platform, which is accessible datasets will ultimately drive analysis off the
standard laboratory computer, forcing some from a web browser, and allows research- desktop and onto the web.
developers to look to the internet for solu- ers to collaborate and share data all the way
1. Grupe, A. et al. Hum. Mol. Genet. 16, 865873
tions. Microsoft is planning a web-based, from the beginning of a GWAS study to the (2008).
end-to-end virtual GWAS environment for final publication. Users will be able to select 2. Sladek, R. et al. Nature 445, 881885 (2007).
researchers. Offered in conjunction with the and run workflows online, which will run 3. Hakonarson, H. et al. Nature 448, 591594
(2007).
British Library, the Research Information Microsofts own in-house analysis routines 4. Stefansson, H. et al. Nature advance online
Centrewhich Tolle likens to a Facebook behind the scenes. This service is currently publication 30 July 2008 (doi:10.1038/
for geneticistscan simplify the process of in beta testing. nature07229).
5. Thorgeirsson, T.E. et al. Nature 452, 638642
information search, facilitate discovery, effec- User-friendly, graphical programs for (2008).
tively manage research objects, and enable analyzing GWASs are becoming a reality, 6. Hung, R.J. et al. Nature 452, 633637 (2008).
versioning and archiving. The time is right and imaginative visualization approaches
for this, says Tolle, all the pieces are there; it are making it ever easier for users to explore Steven David Buckingham is a researcher
is just a matter of putting them all together. GWAS datasets. But it remains to be seen in the Functional Genetics Unit at the
The collaboration environment resides whether the exponential increase in compu- University of Oxford, UK
within a hosted Microsoft Office SharePoint tational power demanded by ever-growing (steven.buckingham@dpag.ox.ac.uk).