Académique Documents
Professionnel Documents
Culture Documents
IEWSIS
Haplotype
The goal of population association studies is to identify will obtain a clearer picture of the statistical issues and
A combination of alleles at patterns of polymorphisms that vary systematically gain some ideas for new or modified approaches.
different loci on the same between individuals with different disease states and In this review I cover only population association
chromosome. could therefore represent the effects of risk-enhancing or studies in which unrelated individuals of different dis-
protective alleles (BOXES 1,2). That sounds easy enough: ease states are typed at a number of SNP markers. I do
Population stratification
Refers to a situation in which what could be difficult about spotting allele patterns that not address family-based association studies, admixture
the population of interest are overrepresented in cases relative to controls? mapping or linkage studies (BOX 3), which also have an
includes subgroups of One fundamental problem is that the genome is so important role in efforts to understand the effects of
individuals that are on average large that patterns that are suggestive of a causal poly- genes on disease2.
more related to each other
than to other members of the
morphism could well arise by chance. To help distinguish
wider population. causal from spurious signals, tight standards for statisti- Preliminary analyses
cal significance need to be established; another tactic is Data quality is of paramount importance, and data
Multiple-testing problem to consider only patterns of polymorphisms that could should be checked thoroughly, for example, for batch or
Refers to the problem that
plausibly have been generated by causal genetic variants, study-centre effects, or for unusual patterns of missing
arises when many null
hypotheses are tested; some given our current understanding of human genetic his- data. Testing for Hardy–Weinberg equilibrium (HWE) can
significant results are likely tory1 and evolutionary processes such as mutation and also be helpful, as can analyses to select a good subset
even if all the hypotheses recombination. Checking for systematic errors and of the available SNPs (‘tag’ SNPs) or to infer haplotypes
are false. dealing with missing values present further challenges. from genotypes.
Hardy–Weinberg
Upstream of the study itself, at the study design phase,
equilibrium several questions need to be considered, such as: How Hardy–Weinberg equilibrium. Deviations from HWE can
Holds at a locus in a population many individuals should be genotyped? At how many be due to inbreeding, population stratification or selec-
when the two alleles within an markers? And how should markers and individuals tion. They can also be a symptom of disease association3,
individual are not statistically
be chosen? the implications of which are often under-exploited4.
associated.
In this article I survey current approaches to such Apparent deviations from HWE can arise in the pres-
challenges. My goal is to give a broad-brush view of dif- ence of a common deletion polymorphism, because of
ferent statistical problems and how they relate to each a mutant PCR-primer site or because of a tendency to
other, and to suggest some solutions and sources of fur- miscall heterozygotes as homozygotes. So far, researchers
Department of Epidemiology ther information. I look first at statistical analyses that have tested for HWE primarily as a data quality check
and Public Health, Imperial precede association testing and then move on to the tests and have discarded loci that, for example, deviate from
College, St Mary’s Campus, of association, based on single SNPs, multiple SNPs and HWE among controls at significance level α = 10−3 or 10−4.
Norfolk Place, London haplotypes. I also briefly introduce adjustments to allow However, the possibility that a deviation from HWE is
W2 1PG, UK.
e-mail:
for possible population stratification (or population struc- due to a deletion polymorphism5 or a segmental duplica-
d.balding@imperial.ac.uk ture) and approaches to the problem of multiple testing. tion6 that could be important in disease causation should
doi:10.1038/nrg1916 My hope is that those handling genetic-association data now be considered before discarding loci.
Testing for deviations from HWE can be carried the χ2 approximation7–9. The open-source data-analysis
out using a Pearson goodness-of-fit test, often known software R (see online links box) has an R genetics
simply as ‘the χ 2 test’ because the test statistic has package that implements both Pearson and Fisher tests
approximately a χ2 null distribution. Be aware, how- of HWE, and PEDSTATS also implements exact tests9.
ever, that there are many different χ2 tests. The Pearson (All statistical genetics software cited in the article can
test is easy to compute, but the χ2 approximation can be found at the Genetic Analysis Software website,
be poor when there are low genotype counts, and it is which can be found in the online links box).
better to use a Fisher exact test, which does not rely on A useful tool for interpreting the results of HWE and
other tests on many SNPs is the log quantile–quantile
(QQ) P-value plot (FIG. 1): the negative logarithm of the
Box 1 | Rationale for association studies ith smallest P value is plotted against −log (i / (L + 1)),
where L is the number of SNPs. Deviations from the
Most recent common ancestor
y = x line correspond to loci that deviate from the null
Time Case hypothesis10.
Control
Missing genotype data. For single-SNP analyses, if a
few genotypes are missing there is not much problem.
For multipoint SNP analyses, missing data can be more
Ancestral
mutation problematic because many individuals might have one
or more missing genotypes. One convenient solution is
data imputation: replacing missing genotypes with pre-
dicted values that are based on the observed genotypes
at neighbouring SNPs. This sounds like cheating, but
for tightly linked markers data imputation can be reli-
able, can simplify analyses and allows better use of the
observed data. Imputation methods either seek a ‘best’
prediction of a missing genotype, such as a maximum-
likelihood estimate (single imputation), or randomly select
it from a probability distribution (multiple imputations).
Population association studies compare unrelated individuals, but ‘unrelated’ actually
The advantage of the latter approach is that repetitions
means that relationships are unknown and presumed to be distant. Therefore, we
cannot trace transmissions of phenotype over generations and must rely on
of the random selection can allow averaging of results or
correlations of current phenotype with current marker alleles. Such a correlation might investigation of the effects of the imputation on resulting
be generated by one or more groups of cases that share a relatively recent common analyses11.
ancestor at a causal locus. Recombinations that have occurred since the most recent Most software for phase assignment (see below) also
common ancestor of the group at the locus can break down associations of phenotype imputes missing alleles. There are also more general impu-
with all but the most tightly linked marker alleles, permitting fine mapping if marker tation methods: for example, ‘hot-deck’ approaches11,
density is sufficiently high (say, ≥1 marker per 10 kb, but this depends on local levels of in which the missing genotype is copied from another
linkage disequilibrium). individual whose genotype matches at neighbouring loci,
This principle is illustrated in the figure, in which for simplicity I assume haploidy, and regression models that are based on the genotypes
such as for X-linked loci in males. The coloured circles indicate observed alleles (or
of all individuals at several neighbouring loci12.
haplotypes), and the colours denote case or control status; marker information is not
shown. The alleles within the shaded oval all descend from a risk-enhancing mutant
These analyses typically rely on missingness being
allele that perhaps arose some hundreds of generations in the past (red star), and so independent of both the true genotype and the pheno-
there is an excess of cases within this group. Consequently, there is an excess of the type. This assumption is widely made, even though its
mutant allele among cases relative to controls, as well as of alleles that are tightly linked validity is often doubtful. For example, as noted above,
with it. The figure also shows a second, minor mutant allele at the same locus that might heterozygotes might be missing more often than homo-
not be detectable because it contributes to few cases. zygotes. What is worse, case samples are often collected
Although the SNP markers that are used in association studies can have up to four differently from controls, which can lead to differential
nucleotide alleles, because of their low mutation rate most are diallelic, and many rates of missingness even if genotyping is carried out
studies only include diallelic SNPs. With increasing interest in deletion polymorphisms5, blind to case–control status. The combination of these
triallelic analyses of SNP genotypes might become more common (treating deletion as a
two effects can lead to serious biases13. One simple way
third allele), but in this article I assume all SNPs to be diallelic.
Broadly speaking, association studies are sufficiently powerful only for common causal
to investigate differential missingness between cases
variants. The threshold for ‘common’ depends on sample and effect sizes as well as and controls is to code all observed genotypes as 1 and
marker frequencies90, but as a rough guide the minor-allele frequency might need to be missing genotypes as 0, and test for association of this
above 5%. Arguments for the common-disease common-variant (CDCV) hypothesis variable with case–control status.
essentially rest on the fact that human effective population sizes are small1. A related
argument is that many alleles that are now disease-predisposing might have been Haplotype and genotype data. Underlying an individ-
advantageous in the past (for example, those that favour fat storage). In addition, ual’s genotypes at multiple tightly linked SNPs are the
selection pressure is expected to be weak on late-onset diseases and on variants that two haplotypes, each containing alleles from one parent.
contribute only a small risk. Although some common variants that underlie complex I discuss below the merits of analyses that are based on
diseases have been identified91, we still do not have a clear idea of the extent to which
phased haplotype data rather than unphased genotypes,
the CDCV hypothesis holds.
and consider here only ways to obtain haplotype data.
Box 3 | Linkage and other approaches pairwise methods discard one (preferably that with most
missing values) of every pair of SNPs with, say, r2 > 0.9.
In all approaches to gene mapping, the key idea is that a disease-predisposing allele will More sophisticated methods can be more efficient32, but
pass from generation to generation together with variants at tightly linked loci. Linkage
the most efficient tagging strategy will depend on the
studies directly examine the transmission across generations of both disease phenotype
statistical analysis to be used. In practice, tagging is only
and marker alleles within a known pedigree, seeking correlations that suggest that the
marker is linked with a causal locus. In parametric linkage analysis62,95, disease and effective in capturing common variants.
marker transmission are evaluated under a specified disease model using likelihood- There are two principal uses for tagging. The first
based statistical analyses of extended pedigrees. In nonparametric linkage analysis96, is to select a ‘good’ subset of SNPs to be typed in all
excess allele sharing is sought in affected relatives, which avoids the need to posit a the study individuals from an extensive SNP set that
disease model. has been typed in just a few individuals. Until recently,
An important advantage of linkage methods is that information is combined across this was frequently a laborious step in study design,
families such that evidence for a causal role of a locus can accumulate even if different but the International HapMap Project33 and related
variants segregate at that locus in different families. Therefore, linkage analysis is projects now allow selection of tag SNPs on the basis of
appropriate when many rare variants at a locus each contribute to disease risk. However,
publicly available data. The population that underlies a
linkage approaches can require many and/or large families to achieve satisfactory power
particular study will typically differ from the popula-
and resolution.
There are various strategies for combining linkage with association analyses for family- tions for which public data are available, and a set of
based data sets. The best-known of the family-based association methods is the tag SNPs that have been selected in one population
transmission disequilibrium test (TDT) 97, which implements a matched-pair study design might perform poorly in another. However, recent
by comparing alleles that are transmitted to an affected child with the untransmitted studies indicate that tag SNPs often transfer well across
parental alleles. More general and more powerful family-based association tests are populations34,35.
available98,99. A secondary use for tagging is to select for analysis
Admixture mapping100,101 has some similarities with nonparametric linkage. It can use a subset of SNPs that have already been typed in all the
case-only samples from a population formed by recent admixture of two or more study individuals. Although it is undesirable to discard
populations with very different disease prevalences. An excess sharing among cases of
available information, the amount of information lost
an allele that is more common in the high-risk ancestral population could be a signal that
might be small, and reducing the SNP set in this way
the allele contributes to disease risk.
can simplify analyses and lead to more statistical power
by reducing the degrees of freedom (df) of a test29.
across alleles, and HWE holds, r2 between a marker and Tests of association: single SNP
a causal SNP gives the sample size that would have been I now come to testing for association, first dealing with
required to detect the disease association by directly typ- single-SNP analyses. I will discuss case–control, quan-
ing the causal SNP, relative to the sample size required titative (continuous) and categorical disease outcomes,
to achieve the same power when typing the marker. starting with the simplest tests and moving on to more
Both D′ and r2 are two-locus measures; however, with advanced regression-based tests36, and also the score
Regression models dense markers it is of interest to summarize LD over a procedure.
A class of statistical models
region. One approach is to compute local averages of
that relate an outcome variable
to one or more explanatory pairwise values of D′ and r2. Alternatively, values over a Case–control phenotype. Perhaps the most natural
variables. The goal might be region can be illustrated diagrammatically with colours analysis of SNP genotypes and case–control status at
to predict further values of encoding different values 18,19. LD maps 20,21 provide a single SNP is to test the null hypothesis of no asso-
the outcome variable given the another solution: these fit an exponential decay func- ciation between rows and columns of the 2 × 3 matrix
explanatory variables, or to
identify a minimal set of
tion to D′ values, and the decay parameter provides a that contains the counts of the three genotypes (the two
explanatory variables with measure of local LD. The resulting LD unit is usually homozygotes and the heterozygote) among cases and
good predictive power. strongly correlated with underlying recombination controls. Users have a choice between, among others, a
rate, but also reflects the history of the mutations that Pearson test (2 df) or a Fisher exact test. Again, the latter
Prospective study design
generated the SNPs. is preferred: it is computationally more demanding but
Studies in which individuals
are followed forward in time Fine-scale estimates of recombination rate might is implemented in R and other software.
and disease events are provide the most satisfactory solution to the problem of For complex traits, it is widely thought that contribu-
recorded as they arise. DNA summarizing LD in a region because recombination is tions to disease risk from individual SNPs will often be
and biomarker samples, and the most important biological phenomenon underlying roughly additive — that is, the heterozygote risk will be
data on environmental
exposures and lifestyle factors,
LD. PHASE provides estimates22, and other available soft- intermediate between the two homozygote risks. The
are usually obtained at the ware includes LDHAT23 and HOTSPOTTER24. Analyses general tests (Pearson 2 df and Fisher) have reason-
start of the study. that are based on such software, and empirical studies25,26, able power regardless of the underlying risks, but if the
have shown that recombination rates are highly variable genotype risks are additive they will not be as powerful
Retrospective study design
on fine scales. This is consistent with the observation that as tests that are tailored to this scenario. One way to
Studies in which individuals are
identified for inclusion in the much of the human genome is ‘block like’27,28, with little improve power to detect additive risks is to count alleles
study on the basis of their or no recombination within blocks but block boundaries rather than genotypes so that each individual contrib-
disease state. Data on previous that are often hotspots of intense recombination. utes twice to a 2 × 2 table and a Pearson 1-df test can be
environmental exposures and applied. However, this procedure is not recommended37
lifestyle factors are then
recorded, and samples for
SNP tagging. ‘Tagging’ refers to methods to select a because it requires an assumption of HWE in cases and
DNA and biomarker studies minimal number of SNPs that retain as much as possible controls combined and does not lead to interpretable
might be obtained. of the genetic variation of the full SNP set29–31. Simple risk estimates.
Phenotypic score
relatedness of individuals in a study so that a complete
3.5
solution to the problem of population stratification —
which has in the past been the cause of much concern
3.0
— is probably not far away.
2.5
Tests of association: multiple SNPs
Given L SNPs genotyped in cases and controls at a
candidate gene that is subject to little recombination, or 2.0
perhaps an LD block within a gene, we might want to 0 1 2
decide whether or not the gene is associated with the SNP genotype
disease and/or, given that there is association, find the Figure 3 | Linear regression test of single-SNP
SNP(s) that are closest to the causal polymorphism(s). associations with continuous outcomes. Values of a
Analysing SNPs one at a time can neglect information quantitative phenotype for three SNP genotypes, together
in their joint distribution. This is of little consequence in with least-squares regression line. Note that here the line
the two extreme cases: when SNPs are widely spaced so gives a predicted trait value for the rare homozygote (2)
as to have little or no LD between them or when almost that exceeds the observed values, suggesting some
deviation away from the assumption of linearity. Analysis of
all SNPs are typed so that any causal variant is likely
variance (ANOVA) does not require linearity of the trait
to be typed in the study. In practice, most studies have means, at the cost of one more degree of freedom. Both
SNP densities between these two extremes, in which case tests also require the trait variance to be the same for each
multipoint association analyses have substantial advan- genotype: the graph is suggestive of decreasing variance
tages over single-SNP analyses57. I first outline regression with increasing genotype score, but there is not enough
analyses of unphased SNP genotypes and then move on data to confirm this, and a mild deviation from this
to haplotype-based analyses. assumption is unlikely to have an important adverse effect
on the validity of the test.
SNP-based logistic regression. Logistic regression analyses
for L SNPs are a natural extension of the single-SNP anal-
yses that are discussed above: there is now a coefficient correlated. A popular strategy, suggested by the block-
(β0, β1 or β2) for each SNP, leading to a general test with like structure of the human genome, is to use haplotypes
2L df. By constraining the coefficients, tests with L df can to try to capture the correlation structure of SNPs in
be obtained. For example, a test for additive effects at each regions of little recombination. This approach can lead to
SNP is obtained by requiring that each β1 = (β0 + β2) / 2. analyses with fewer degrees of freedom, but this benefit
The corresponding score test, also with L df, is a generali- is minimized when SNPs are ascertained through a tag-
zation of the Armitage test, and is related to the Hotelling ging strategy. Perhaps more importantly, haplotypes can
T2 statistic56. Another test, with L+1 df, uses only 1 df to capture the combined effects of tightly linked cis-acting
capture gene-wide dominance effects29. causal variants61.
Covariates such as sex, age or environmental expo- An immediate problem is that haplotypes are not
Stepwise selection
procedure
sures are readily included. Similarly, interactions between observed; instead, they must be inferred and it can be
Describes a class of statistical SNPs can be included. This conveys little benefit, and can hard to account for the uncertainty that arises in phase
procedures that identify from reduce power to detect an association, if there is a single inference when assessing the overall significance of any
a large set of variables (such underlying causal variant and little or no recombination finding. However, when LD between markers is high, the
as SNPs) a subset that
between SNPs58, but it is potentially useful for investigating level of uncertainty is usually low.
provides a good fit to a chosen
statistical model (for example, epistatic effects. Given haplotype assignments, the simplest analysis
a regression model that If the number of SNPs is large, tagging to eliminate involves testing for independence of rows and columns
predicts case–control status) near-redundant SNPs often increases power despite in a 2 × k contingency table, where k denotes the number
by successively including or some loss of information. Alternatively, the problem of distinct haplotypes62. Alternative approaches can be
discarding terms from the
model.
of too many highly correlated SNPs in the model can based on the estimated haplotype proportions among
be addressed using a stepwise selection procedure59 or cases and controls, without an explicit haplotype assign-
Shrinkage methods Bayesian shrinkage methods60. However, problems can ment for individuals63: the test compares the product of
In this approach a prior arise in assessing the significance of any chosen model. separate multinomial likelihoods for cases and controls
distribution for regression
Essentially the same issues arise for a continuous with that obtained by combining cases and controls.
coefficients is concentrated at
zero, so that in the absence phenotype; the same sets of coefficients are appropriate One problem with both these approaches is reliance on
of a strong signal of but they are equated to the expected phenotype value assumptions of HWE and of near-additive disease risk.
association, the corresponding rather than the logit of disease risk. A different approach, which leads to a test with fewer
regression coefficient is degrees of freedom, is to look for an excess sharing of
‘shrunk’ to zero. This mitigates
the effects of too many
Haplotype-based methods. The multi-SNP analyses haplotypes among cases relative to controls64. More
variables (degrees of freedom) discussed above can suffer from problems that are sophisticated haplotype-based analyses treat haplotypes
in the statistical model. associated with many predictors, some of which are highly as categorical variables in regression analyses65 or
Multiple testing gives a practical prescription: if n SNPs are tested and the
Multiple testing is a thorny issue, the bane of statistical tests are approximately independent, the appropriate per-
genetics. The problem is not really the number of tests SNP significance level α′ should satisfy α = 1 − (1 − α′)n,
that are carried out: even if a researcher only tests one which leads to the Bonferroni correction α′ ≈ α / n. For
SNP for one phenotype, if many other researchers do example, to achieve α = 5% over 1 million independent
the same and the nominally significant associations are tests means that we must set α′ = 5 × 10–8. However, the
reported, there will be a problem of false positives. The effective number of independent tests in a genome-wide
genome is large and includes many polymorphic variants analysis depends on many factors, including sample size
and many possible disease models. Therefore, any given and the test that is carried out.
variant (or set of variants) is highly unlikely, a priori, to For tightly linked SNPs, the Bonferroni correction is
be causally associated with any given phenotype under conservative. A practical alternative is to approximate
the assumed model, so strong evidence is required to the type-1 error rate using a permutation procedure.
overcome the appropriate scepticism about an asso- Here, the genotype data are retained but the phenotype
ciation. Although this Bayesian language provides a labels are randomized over individuals to generate a data
convenient description of the problem, the Bayesian set that has the observed LD structure but that satisfies
remedy is complex because every possible disease model the null hypothesis of no association with phenotype. By
must be assigned a prior probability. The approach is analysing many such data sets, the false-positive rate can
appealing from the perspective of researchers who are be approximated. The method is conceptually simple but
engaged in the disinterested pursuit of knowledge, can be computationally demanding, particularly as it is
but less satisfactory in prescribing exacting standards specific to a particular data set and the whole procedure
for researchers who might be tempted to cut corners has to be repeated if the data set is altered.
or exaggerate the prior plausibility of a model that is Although the 5% global error rate is widely used
supported a posteriori. in science, it is inappropriately conservative for large-
The frequentist paradigm of controlling the overall scale SNP-association studies: most researchers would
type-1 error rate sets a significance level α (often 5%), accept a higher risk of a false positive in return for
Frequentist
and all the tests that the investigator plans to conduct greater power. The 5% value can of course be relaxed,
A name for the school of
statistical thought in which should together generate no more than probability α of but another approach is instead to monitor the false-
support for a hypothesis or a false positive. In complex study designs, which involve, discovery rate (FDR)86,87, which is the proportion of false
parameter value is assessed for example, multiple stages and interim analyses, this positive test results among all positives. Under the null
using the probability of the can be difficult to implement, in part because it was the hypothesis, P values should be uniformly distributed
observed data (or more
‘extreme’ datasets) given the
analysis that was planned by the investigator that mat- between 0 and 1; FDR methods typically consider the
hypothesis or value. Usually ters, not only the analyses that were actually conducted. actual distribution as a mixture of outcomes under the
contrasted with Bayesian. However, in simple settings the frequentist approach null (uniform distribution of P values) and alternative
(P-value distribution skewed towards zero) hypotheses. a European Bioinformatics Institute but there is no
Assumptions about the alternative hypothesis might be equivalent institute for statistical genetics, the practitio-
required for the most powerful methods, but the simplest ners of which tend to work in relatively small groups that
procedures avoid explicit assumptions. See BOX 5 for a are scattered across institutions. Nevertheless, organized
discussion of issues that are relevant to genome-wide collaborations across institutions can achieve much, as
study designs. shown by the achievements of the HapMap Analysis
The usual frequentist approach to multiple test- Group33.
ing has a serious drawback in that researchers might Progress is being made, but there is still much to be
be discouraged from carrying out additional analyses done. Ultimately, complex diseases will require complex
beyond single-SNP tests, even though these might reveal analyses in which many variants are assessed simulta-
interesting associations, because all their analyses would neously for their individual or joint contributions to
then suffer a multiple-testing penalty. It is a matter of disease risk. Fear of multiple-testing penalties should not
common sense that expensive and hard-won data should deter researchers from making thorough analyses, but
be investigated exhaustively for possible patterns of asso- they need to deal honestly with the problem of chance
ciation. Although the frequentist paradigm is convenient associations. Bayesian regression and variable selection
in simple settings, strict adherence to it can be detri- procedures are beginning to be developed for genome-
mental to science. Under the Bayesian approach, there wide microarray and genetic analyses88,89, and they hold
is no penalty for analysing data exhaustively because the out promise for large-scale, simultaneous analyses of
prior probability of an association should not be affected many SNPs in association studies.
by what tests the investigator chooses to carry out. To finish on an optimistic note, fear of the effects
of population stratification should soon be banished:
Conclusion with genome-wide data a near-complete solution to the
Of the vast public investment in genetic association problem should be achievable, focusing directly on relat-
studies in recent years, relatively little has been focused edness and not unreliable proxies such as geographical
on efficient analyses of the data. There is, for example, location or ethnic affiliation.
1. Jobling, M. A., Hurles, M. E. & Tyler-Smith, C. 15. Stephens, M., Smith, N. J. & Donnelly, P. A new 30. Stram, D. O. Tag SNP selection for association studies.
Human Evolutionary Genetics: Origins Peoples & statistical method for haplotype reconstruction from Genet. Epidem. 27, 365–374 (2004).
Disease (Garland Science, New York, 2004). population data. Am. J. Hum. Genet. 68, 978–989 31. Carlson, C. S. et al. Selecting a maximally informative
2. Thomas, D. C. Statistical Methods in Genetic (2001). set of single-nucleotide polymorphisms for association
Epidemiology (Oxford Univ. Press, 2004). 16. Scheet, P. & Stephens, M. A fast and flexible statistical analyses using linkage disequilibrium. Am. J. Hum.
The best general reference for statistical methods model for large-scale population genotype data: Genet. 74, 106–120 (2004).
in genetic epidemiology; for population applications to inferring missing genotypes and 32. Zeggini, E. et al. An evaluation of HapMap sample
association studies it discusses important general haplotypic phase. Am. J. Hum. Genet. 78, 629–644 size and tagging SNP performance in large-scale
issues without specific details on tests and other (2006). empirical and simulated data sets. Nature Genet. 37,
analyses. 17. Devlin, B. & Risch, N. A comparison of linkage 1320–1322 (2005).
3. Nielsen, D. M., Ehm, M. G. & Weir, B. S. Detecting disequilibrium measures for fine-scale mapping. 33. The International HapMap Consortium. A haplotype
marker–disease association by testing for Genomics 29, 311–322 (1995). map of the human genome. Nature 437, 1299–1320
Hardy–Weinberg disequilibrium at a marker locus. 18. Abecasis, G. R. & Cookson, W. O. C. GOLD — (2005).
Am. J. Hum. Genet. 63, 1531–1540 (1998). graphical overview of linkage disequilibrium. 34. Huang, W. et al. Linkage disequilibrium sharing
4. Wittke-Thompson, J. K., Pluzhnikov, A. & Cox, N. J. Bioinformatics 16, 182–183 (2000). and haplotype-tagged SNP portability between
Rational inferences about departures from 19. Barrett, J. C., Fry, B., Maller, J. & Daly, M. J. populations. Proc. Natl Acad. Sci. USA 103,
Hardy–Weinberg equilibrium. Am. J. Hum. Genet. Haploview: analysis and visualization of LD and 1418–1421 (2006).
76, 967–986 (2005). haplotype maps. Bioinformatics 21, 263–265 (2005). 35. Gonzalez-Neira, A. et al. The portability of tagSNPs
5. Conrad, D. F., Andrews, T. D., Carter, N. P., Hurles, M. E. 20. Maniatis, N. et al. The first linkage disequilibrium (LD) across populations: a worldwide survey. Genome Res.
& Pritchard, J. K. A high-resolution survey of deletion maps: delineation of hot and cold blocks by diplotype 16, 323–330 (2006).
polymorphism in the human genome. Nature Genet. analysis. Proc. Natl Acad. Sci. USA 99, 2228–2233 36. McCullagh, P. & Nelder, J. A. Generalized Linear
38, 75–81 (2006). (2002). Models 2nd edn (Chapman and Hall, London,
6. Bailey, J. A. & Eichler, E. E. Primate segmental 21. Tapper, W. et al. A map of the human genome in 1989).
duplications: crucibles of evolution, diversity linkage disequilibrium units. Proc. Natl Acad. Sci. USA Still the best general reference on generalized
and disease. Nature Rev. Genet. 7, 552–564 102, 11835–11839 (2005). linear models (includes linear, multinomial and
(2006). 22. Crawford, D. C. et al. Evidence for substantial fine- logistic regression as special cases); it is relatively
7. Guo, S. W. & Thompson, E. A. Performing the exact scale variation in recombination rates across the advanced and more gentle introductions are
test of Hardy–Weinberg proportion for multiple human genome. Nature Genet. 36, 700–706 (2004). available.
alleles. Biometrics 48, 361–372 (1992). 23. McVean, G. A. et al. The fine-scale structure of 37. Sasieni, P. D. From genotypes to genes: doubling
8. Maiste, P. J. & Weir, B. S. A comparison of tests for recombination rate variation in the human genome. the sample size. Biometrics 53, 1253–1261
independence in the FBI RFLP databases. Genetica Science 23, 581–584 (2004). (1997).
96, 125–138 (1995). 24. Li, N. & Stephens, M. Modelling LD and identifying A useful reference for comparison of different
9. Wigginton, J. E., Cutler, D. J. & Abecasis, G. R. recombination hotspots from SNP data. Genetics single-SNP tests of association.
A note on exact tests of Hardy–Weinberg equilibrium. 165, 2213–2233 (2003). 38. Armitage, P. Tests for linear trends in proportions and
Am. J. Hum. Genet. 76, 887–893 (2005). 25. Jeffreys. A. J., Kauppi, L. & Neumann, R. Intensely frequencies. Biometrics 11, 375–386 (1955).
10. Weir, B. S., Hill, W. G. & Cardon, L. R. Allelic punctate meiotic recombination in the class II region 39. Freidlin, B., Zheng, G., Li, Z. H. & Gastwirth, J. L.
association patterns for a dense SNP map. of the major histocompatability complex. Nature Trend tests for case–control studies of genetic
Genet. Epidemiol. 27, 442–450 (2004). Genet. 29, 217–222 (2001). markers: power, sample size and robustness.
11. Little, R. J. A. & Rubin, D. B. Statistical Analysis with 26. Jeffreys, A. J. & May, C. A. Intense and highly localized Hum. Hered. 53, 146–152 (2002).
Missing Data (Wiley, New York, 2002). gene conversion activity in human meiotic crossover 40. Lunn, D. J., Whittaker, J. C. & Best, N. A Bayesian
12. Souverein, O. W., Zwinderman, A. H. & Tanck, M. W. T. hot spots. Nature Genet. 36, 151–156 (2004). toolkit for genetic association studies. Genet.
Multiple imputation of missing genotype data for 27. Ardlie, K. G., Krugylak, L. & Sielstad, M. Patterns of Epidemiol. 30, 231–247 (2006).
unrelated individuals. Ann. Hum. Genet. 70, 372–381 linkage disequilibrium in the human genome. Nature 41. Prentice, R. L. & Pyke, R. Logistic disease incidence
(2006). Rev. Genet. 3, 299–309 (2002). models and case–control studies. Biometrika 66,
13. Clayton, D. G. et al. Population structure differential 28. Gabriel, S. B. et al. The structure of haplotype blocks in 403–411 (1979).
bias and genomic control in a large-scale case–control the human genome. Science 296, 2225–2229 (2002). 42. Seaman, S. R. & Richardson, S. Equivalence of
association study. Nature Genet. 37, 1243–1246 29. Chapman, J. M., Cooper, J. D., Todd, J. A. & prospective and retrospective models in the Bayesian
(2005). Clayton, D. G. Detecting disease associations due analysis of case–control studies. Biometrika 91,
14 Marchini, J. et al. A comparison of phasing algorithms to linkage disequilibrium using haplotype tags: a class 15–25 (2004).
for trios and unrelated individuals. Am. J. Hum. Genet. of tests and the determinants of statistical power. 43. Cox, D. R. & Hinkley, D. V. Theoretical statistics
78, 437–450 (2006). Hum. Hered. 56, 18–31 (2003). (Chapman and Hall, London, 1974).
44. Wallace, C., Chapman J. M. & Clayton, D. G. 69. Molitor, J., Marjoram, P. & Thomas, D. C. Fine-scale 94. Leschzinger, G. et al. Clinical factors and ABCB1
Improved power offered by a score test for linkage mapping of disease genes with multiple mutations via polymorphisms in prediction of antiepileptic drug
disequilibrium mapping of quantitative-trait loci by spatial clustering techniques. Am. J. Hum. Genet. 73, response: a prospective cohort study. Lancet Neurol.
selective genotyping. Am. J. Hum. Genet. 78, 1368–1384 (2003). 5, 668–676 (2006).
498–504 (2006). 70. Seltman, H., Roeder, K. & Devlin, B. Evolutionary- 95. Thompson, E. in Handbook of Statistical Genetics 2nd
45. Agresti, A. Categorical Data Analysis 2nd edn based association analysis using haplotype data. edn (eds Balding D. J., Bishop, M. & Cannings, C.)
(Wiley, New York, 2002). Genet. Epidemiol. 25, 48–58 (2003). 893–918 (Wiley, New York, 2003).
46. Devlin, B. & Roeder, K. Genomic control for association 71. Durrant, C. et al. Linkage disequilibrium mapping 96. Holmans, P. in Handbook of Statistical Genetics 2nd
studies. Biometrics 55, 997–1004 (1999). via cladistic analysis of single-nucleotide edn (eds Balding D. J., Bishop, M. & Cannings, C.)
47. Devlin, B. & Roeder, K. Genomic control a new polymorphism haplotypes. Am. J. Hum. Genet. 919–938 (Wiley, New York, 2003).
approach to genetic-based association studies. 75, 35–43 (2004). 97. Ewens, W. J. & Spielman, R. S. in Handbook of
Theor. Pop. Biol. 60, 155–166 (2001). 72. Morris, A. P. Direct analysis of unphased SNP Statistical Genetics 2nd edn (eds Balding D. J.,
48. Zheng, G., Freidlin, B. & Gastwirth. J. L. Robust genomic genotype data in population-based association studies Bishop, M. & Cannings, C.) 961–972 (Wiley, New
control. Am. J. Hum. Genet. 78, 350–356 (2006). via Bayesian partition modelling of haplotypes. Genet. York, 2003).
49. Marchini, J., Cardon, L. R., Phillips, M. S. & Epidemiol. 29, 91–107 (2005). 98. Abecasis, G. R., Cardon, L. R. & Cookson, W. O. C.
Donnelly, P. The effects of human population structure 73. Beckmann, L., Thomas, D. C., Fischer, C. & A general test of association for quantitative traits in
on large genetic association studies. Nature Genet. Chang-Claude J. Haplotype sharing analysis nuclear families. Am. J. Hum. Genet. 66, 279–292
36, 512–517 (2004). using Mantel statistics. Hum. Hered. 59, 67–78 (2000).
50. Setakis, E., Stirnadel, H. & Balding D. J. Logistic (2005). 99. Van Steen, K. et al. Genomic screening and replication
regression protects against population structure in 74. Templeton, A. R. et al. Tree scanning: a method for using the same data set in family-based association
genetic association studies. Genome Res. 16, using haplotype trees in phenotype/genotype testing. Nature Genet. 37, 683–691 (2005).
290–296 (2006). association studies. Genetics 169, 441–453 100. Smith, M. W. & O’Brien, S. J. Mapping by
51. Pritchard, J. K., Stephens, M., Rosenberg, N. A. & (2005). admixture linkage disequilibrium: advances,
Donnelly, P. Association mapping in structured 75. Tzeng, J. Y., Wang, C. H., Kao, J. T. & Hsiao, C. K. limitations and guidelines. Nature Rev. Genet. 6,
populations. Am. J. Hum. Genet. 67, 170–181 (2000). Regression-based association analysis with clustered 623–266 (2005).
52. Satten, G., Flanders, W. D. & Yang, Q. Accounting for haplotypes through use of genotypes. Am. J. Hum. 101. Reich, D. et al. A whole-genome admixture scan finds
unmeasured population structure in case–control Genet. 78, 231–242 (2006). a candidate locus for multiple sclerosis susceptibility.
studies of genetic association using a novel latent-class 76. Zollner, S. & Pritchard, J. K. Coalescent-based Nature Genet. 37, 1113–1118 (2005).
model. Am. J. Hum. Genet. 68, 466–477 (2001). association mapping and fine mapping of complex 102. Clayton, D. in Handbook of Statistical Genetics 2nd
53. Hoggart, C. J. et al. Control of confounding of genetic trait loci. Genetics 169, 1071–1092 (2005). edn (eds Balding D. J., Bishop, M. & Cannings, C.)
associations in stratified populations. Am. J. Hum. 77. Browning, S. R. Multilocus association mapping using 939–960 (Wiley, New York, 2003).
Genet. 72, 1492–1504 (2003). variable-length Markov chains. Am. J. Hum. Genet. 103. Cardon, L. R. & Palmer, L. J. Population stratification
54. Delrieu, O. & Bowman, C. Visualizing gene 78, 903–913 (2006). and spurious allelic association. Lancet 361,
determinants of disease in drug discovery. 78. Moore, J. H. The ubiquitous nature of epistasis in 598–604 (2003).
Pharmacogenomics 7, 311–329 (2006). determining susceptibility to common human 104. Berger, M. et al. Hidden population substructures in an
55. Price, A. L. et al. Principal components analysis diseases. Hum. Hered. 56, 73–82 (2003). apparently homogeneous population bias association
corrects for stratification in genome-wide association 79. Carlborg, O. & Haley, C. S. Epistasis: too often studies. Eur. J. Hum. Genet. 14, 236–244 (2006).
studies. Nature Genet. 38, 904–909 (2006). neglected in complex trait studies? Nature Rev. Genet. 105. Wang, H. S., Thomas, D. C., Pe’er I. & Stram, D. O.
56. Yu, J. M. et al. A unified mixed-model method for 5, 618–625 (2004). Optimal two-stage genotyping designs for genome-
association mapping that accounts for multiple levels 80. Todd, J. A. Statistical false positive or true disease wide association scans. Genet. Epidemiol. 30,
of relatedness. Nature Genet. 38, 203–208 (2006). pathway? Nature Genet. 38, 731–733 (2006). 356–368 (2006).
57. Waldron, E. R. B., Whittaker J. C. & Balding D. J. 81. Lake, S. L. et al. Estimation and tests of haplotype– 106. Skol, A. D., Scott, L. J., Abecasis, G. R. & Boehnke, M.
Fine mapping of disease genes via haplotype environment interaction when linkage phase is Joint analysis is more efficient than replication-based
clustering. Genet. Epidemiol. 30, 170–179 (2006). ambiguous. Hum. Hered. 55, 56–65 (2003). analysis for two-stage genome-wide association
58. Clayton, D., Chapman, J. & Cooper, J. The use of 82. Millstein, J., Conti, D. V., Gilliland, F. D. & studies. Nature Genet. 38, 209–213 (2006).
unphased multilocus genotype data in indirect Gauderman, W. J. A testing framework for identifying 107. Verzilli, C. J., Stallard, N. & Whittaker, J. C. Bayesian
association studies. Genet. Epidemiol. 27, 415–428 susceptibility genes in the presence of epistasis. graphical models for genomewide association studies.
(2004). Am. J. Hum. Genet. 78, 15–27 (2006). Am. J. Hum. Genet. 79, 100–112 (2006).
59. Cordell, H. J. & Clayton, D. G. A unified stepwise 83. Piegorsch, W. W., Weinberg, C. R. & Taylor, J. A. 108. Dudbridge, F. & Koeleman, P. C. Efficient computation
regression approach for evaluating the relative effects Non-hierarchical logistic models and case-only of significance levels for multiple associations in large
of polymorphisms within a gene using case/control or designs for assessing susceptibility in population- studies of correlated data, including genomewide
family data: application to HLA in type 1 diabetes. based case–control studies. Stat. Med. 13, 153–162 association studies. Am. J. Hum. Genet. 75, 424–435
Am. J. Hum. Genet. 70, 124–141 (2002). (1994). (2004).
60. Wang, H. et al. Bayesian shrinkage estimation of 84. Cordell, H. J. Epistasis: what it means what it doesn’t 109. Hoh, J. & Ott, J. Mathematical multi-locus approaches
quantitative trait loci parameters. Genetics 170, mean and statistical methods to detect it in humans. to localizing complex human trait genes. Nature Rev.
465–480 (2005). Hum. Mol. Genet. 11, 2463–2468 (2002). Genet. 4, 701–709 (2003).
61. Clark, A. G. The role of haplotypes in candidate-gene 85. Marchini, J., Donnelly, P. & Cardon, L. R. Genome-
studies. Genet. Epidemiol. 27, 321–333 (2004). wide strategies for detecting multiple loci that Acknowledgements
62. Sham, P. Statistics in Human Genetics (Arnold, influence complex diseases. Nature Genet. 37, I thank W. Astle and E. Waldron for help with drawing figures,
London, 1998). 413–417 (2005). and W. Astle, L. Cardon, A. Lewin, D. Lunn, A. Morris, D.
Still a useful reference for basic linkage and 86. Storey, J. D. & Tibshirani, R. Statistical significance Schaid, J. Whittaker and D. Zabaneh for discussions and com-
association analyses, but now a little out of date. for genome-wide studies. Proc. Natl Acad. Sci. USA ments on drafts of the manuscript. The author is supported
63. Schaid, D. J. Evaluating associations of haplotypes 100, 9440–9445 (2003). in part by the UK Medical Research Council.
with traits. Genet. Epidemiol. 27, 348–364 (2004). 87. Dudbridge, F., Gusnanto, A. & Koeleman, P. C.
64. Tzeng, J. Y., Devlin, B., Wasserman, L. & Roeder, K. Detecting multiple associations in genome-wide Competing interests statement
On the identification of disease mutations by the studies. Hum. Genomics 2, 310–317 (2006). The author declares no competing financial interests.
analysis of haplotype similarity and goodness of fit. 88. Ishwaran, H. & Rao, J. S. Detecting differentially
Am. J. Hum. Genet. 72, 891–902 (2003). expressed genes in microarrays using Bayesian
65. Lin, D. Y. & Zeng, D. Likelihood-based inference on model selection. J. Am. Stat. Assoc. 98, 438–455
FURTHER INFORMATION
European Bioinformatics Institute: http://www.ebi.ac.uk
haplotype effects in genetic association studies. (2003).
Genetic Analysis Software (includes almost all freely
J. Am. Stat. Assoc. 101, 89–104 (2006). 89. Yi, N. J. et al. Bayesian model selection for genome-
available software for statistical genetic analyses, regularly
66. Schaid, D. J., Rowland, C. M., Tines, D. E., wide epistatic quantitative trait loci analysis. Genetics
Jacobson, R. M. & Poland, G. A. Score tests for 170, 1333–1344 (2005). updated): http://linkage.rockefeller.edu/soft
association between traits and haplotypes when 90. Zondervan, K. T. & Cardon, L. R. The complex Genetic Power Calculator (a useful tool that calculates the
linkage phase is ambiguous. Am. J. Hum. Genet. 70, interplay among factors that influence allelic power of many simple study designs):
425–434 (2002). association. Nature Rev. Genet. 5, 238–238 http://pngu.mgh.harvard.edu/~purcell/gpc
67. Ke, X. Y. et al. The impact of SNP density on fine-scale (2004). International HapMap Project: http://www.hapmap.org
patterns of linkage disequilibrium. Hum. Mol. Genet. 91. Hirschhorn, J. N. & Daly, M. J. Genome-wide Nature Reviews Genetics audio supplement:
13, 577–588 (2004). association studies for common diseases and complex http://www.nature.com/nrg/focus/stats/audio
68. Templeton, A. R., Boerwinkle, E. & Sing C. F. A cladistic traits. Nature Rev. Genet. 6, 95–108 (2005). R genetics package: http://rgenetics.org
analysis of phenotypic associations with haplotypes 92. Bingham, S. & Riboli, E. Diet and cancer — the Wellcome Trust Case Control Consortium (a large,
inferred from restriction endonuclease mapping. I. European prospective investigation into cancer genome-wide association study for eight distinct diseases
Basic theory and an analysis of alcohol dehydrogenase and nutrition. Nature Rev. Cancer 4, 206–215 with common set of controls): http://www.wtccc.org.uk
activity in Drosophila. Genetics 117, 343–351 (1987). (2004). SUPPLEMENTARY INFORMATION
The first in a series of papers that initiated cladistic 93. Ollier, W., Sprosen, T. & Peakman, T. UK Biobank: See online article: S1 (box)
and more general clustering approaches to from concept to reality. Pharmacogenomics 6, Access to this links box is available online.
haplotype-based tests of association. 639–646 (2005).