Vous êtes sur la page 1sur 11

POLICY FORUM

GENETICS

mation. No genetic data will be provided


unless a user can demonstrate that he or she
is associated with a bona fide academic, industrial, or governmental research unit and
agrees to our usage policies (including audit
of data access) (10). Although this does not
prevent data abuse, it provides a way to
Zhen Lin, 1 Art B. Owen, 2 Russ B. Altman 1 *
monitor usage.
nterest in understanding how
Social concerns about privacy
Insufficient for future genomic research
genetic variations influence
are intricately connected to beliefs
heritable diseases and the reabout benefits of research and
High
sponse to medical treatments is
trustworthiness of researchers and
intense. The academic communigovernmental agencies. In the
ty relies on the availability of
United States, the Health Insurance
public databases for the distribuPortability and Accountability Act
Medium
Needed to find genetic relationships
tion of the DNA sequences and
of 1996 (HIPAA) and the associattheir variations. However, like
ed Privacy Rules of 2003 (11) genother types of medical informaerally forbid sharing identifiable
Low
tion, human genomic data are pridata without patient consent.
Insufficient for privacy protection
vate, intimate, and sensitive.
However, they do not specifically
Genomic data have raised special
address use or disclosure policies
5
75
100 125 1000 2000 3000 4000
concerns about discrimination,
for human genetic data. Recent deIndependent SNPs
stigmatization, or loss of insurbates in Iceland, Estonia, Britain,
Trade-offs
between
SNPs
and
privacy.
ance or employment for individuand elsewhere (1215), reveal a
als and their relatives (1, 2).
range of views on the threats posed
Public dissemination of these data poses entific data has led to a search for new tech- by genetic information. The United States
nologies. However, the hurdles may be may be at one end of this spectrum, as its citnonintuitive privacy challenges.
Unrelated persons differ in about 0.1% greater than had been suspected. For exam- izens seem to strongly desire health privacy.
of the 3.2 billion bases in their genomes ple, one approach to protecting privacy is to Whatever the setting, we recommend explic(3). Now, the most widely used forms of limit the amount of high-quality data re- it clarifications to rules and legislation (such
forensic identification rely on only 13 to leased and randomly to change a small per- as HIPAA), so that they explicitly protect ge15 locations on the genome with variable centage of SNPs for each subject in the netic privacy and set strong penalties for viorepeats (4, 5). Single nucleotide polymor- database (8). Suppose that 10% of SNPs are lations. These clarifications should define
phisms (SNPs) contain information that randomly changed in a sequence of DNA, a entities authorized to use and exchange hucan be used to identify individuals (5, 6). If fairly major obfuscation that would not man genetic data and for what purposes.
someone has access to individual genetic please many genetics researchers. Our estidata and performs matches to public SNP mates (7) show that measuring as few as 75
References and Notes
1. M. R. Anderlik, M. A. Rothstein, Annu. Rev. Genomics
data, a small set of SNPs could lead to suc- statistically independent SNPs would deHum. Genet. 2, 401 (2001).
cessful matching and identification of the fine a small group that contained the real
2. P. Sankar, Annu. Rev. Med. 54, 393 (2003).
individual. In such a case, the rest of the owner of the DNA. Disclosure control
3. W. H. Li, L. A. Sadler, Genetics 129 ,513 (1991).
4. L. Carey, L. Mitnik, Electrophoresis 23, 1386 (2002).
genotypic, phenotypic, and other informa- methods such as data suppression, data
5. H. D. Cash et al., Pac. Symp. Biocomput. 2003 , 638
tion linked to that individual in public swapping, and adding noise would be unac(2003).
ceptable by similar arguments.
records would also become available.
6. National Commission on the Future of DNA Evidence,
10
The Future of Forensic DNA Testing: Predictions of
A second approach is to group SNPs
The world population is roughly 10 .
the Research and Development Working Group
Specifying DNA sequence at only 30 to 80 into bins. Disregarding exact genomic lo(National Institute of Justice, U.S. Department of
statistically independent SNP positions will cations of SNPs increases the number of
Justice, Washington, DC, 2000).
uniquely define a single person (7). Further- records that share the same values, thus in7. See supporting online material for further discussion.
8. L. C. R. J. Willenborg, T. D. Waal Elements of Statistical
more, if some of those positions have SNPs creasing confidentiality. Our calculations
Disclosure Control (Springer, New York, 2001).
that are relatively rare, the number that need (7) show that such strategies do not protect
9. T. E. Klein et al., Pharmacogenomics J.1 , 167 (2001).
to be tested is much smaller. If information privacy, because the pattern of binned val- 10. www.pharmgkb.org/home/policies/index.jsp
about kinship exists, a few positions will con- ues is unlikely to match anyone other than 11. Fed. Regist. 67, 53181 (2002).
firm it. Thus, the transition from private to the owner of the DNA. Data analysis would 12. R. Chadwick, BMJ 319 , 441 (1999).
13. L. Frank, Science 290 , 31 (2000).
identifiable is very rapid (see the figure).
be greatly complicated by binning, and the 14. M. A. Austin et al., Genet. Med. 5, 451 (2003).
Tension between the desire to protect information content would be severely re- 15. V. Barbour, Lancet 361 , 1734 (2003).
16. Supported in part by NIH/NLM Biomedical Inforprivacy and the need to ensure access to sci- duced or even eliminated.
matics Training Grant LM007033 (Z.L.), NSF Grant
Until technological innovations appear,
DMS-0306612 (A.B.O.), and the NIH/NIGMS Pharma1Department of Genetics, Stanford University School
solutions in policy and regulations must be
cogenetics Research Network and Database U01GM61374 (R.B.A). We thank J. T. Chang, B. T.
of Medicine, CA 943055120, USA. 2Department of
found. We are building the PharmacoNaughton, T. E. Klein, and reviewers.
Statistics, Stanford University, CA 940354065, USA.
genetics and Pharmacogenomics Knowledge
*To whom correspondence should be addressed. E- Base (8, 9), which contains individual genoSupporting Online Material
mail: russ.altman@stanford.edu
type data and associated phenotype infor- www.sciencemag.org/cgi/content/full/305/5681/183/DC1

Genomic Research and


Human Subject Privacy

Privacy

www.sciencemag.org

SCIENCE

VOL 305

9 JULY 2004

183

Genomic research and human subject privacy (Supplement)

Zhen Lin,1 Art B. Owen2, Russ B. Altman1*

Department of Genetics, Stanford University School of Medicine, CA 94305-5120.

Department of Statistics, Stanford University, CA 94035-4065.

*To whom correspondence should be addressed. E-mail: russ.altman@stanford.edu.

06/28/04 6:41 AM

I. On the identifiability of individuals based purely on DNA sequence information.

One might argue that the publication of genotypic and phenotypic information without a
key that establishes the connection between an individual and that information is not
fundamentally dangerous.

By analogy, for example, one could publish a set of

fingerprints along with clinical phenotypes on the web, but these would be of limited use
without knowledge of whose fingerprints they are. This argument is unacceptable for
genomic data in a rapidly changing environment where 1) genotyping is becoming very
accessible and inexpensive, 2) the ability to collect and share personal information is
increasing, and 3) the ability to use genomic data to infer individual characteristics is
largely unexplored and potentially powerful. Thus, we have focused on our analysis on
the problem of disclosing genomic information that is sufficient to be uniquely associated
with an individual, even without a key.

II. Evaluating the chance of a random match between two individuals

To evaluate the chance of a match using SNPs, we can construct a model such that the
diploid SNP data from autosomes are arranged into a matrix X where row i = 1,..., n is the
index of the subject, and column j = 1,..., M is the index of the SNP. A SNP j has kj
possible

values

for

Xij.

These

values

have

probabilities

(genotype

frequencies) j1 ,..., jk j . Most SNPs have three dominant values of Xij corresponding to

genotypes

aa,

Aa,

and

AA.

These

06/28/04 6:41 AM

have

Hardy-Weinberg

probabilitiesa j1 = (1 p j ) 2 , j 2 = 2 p j (1 p j ) , and

j 3 = p 2j

respectively, where

p j (0,1) is the probability (allele frequency) of allele A. Then kj = 3, or 4 if we add an

other category for rare exceptions.

The SNP data come from a population, such as a county, state, or country, of N
people. Suppose that a small set of SNPs for a person from this population become
available, and are found to match X i1 ,..., X iM for some subject i between 1 and n. Such a
coincidence provides overwhelming evidence, as described below, that the person is in
fact the individual i from the database.

As a consequence, the rest of genotypic,

phenotypic, and other data linked to the subject i become known to anybody with access
to both the database and the SNPs for the person in question.

Let the number of SNPs in the human genome with minor-allele frequencies
greater than 10% be M, which is thought to be close to 5,000,000 (1). For two random
people to match all SNPs is quite unlikely, because M is so large. The chance of two
unrelated people matching at a single SNP j is
kj

j = 2jl
l =1

In the Hardy-Weinberg setup, if 0.1 p j 0.9 , where the SNP j occurs in at least 10%
of the studied population, then 0.375 j 0.689 . The lower bound for j corresponds

Hardy-Weinberg equilibrium is a statistical assumption for population genetics that gene frequencies and
genotype ratios in a randomly-breeding population remain constant from generation to generation.

06/28/04 6:41 AM

to p j = 0.5 . Similarly, if 0.01 p j 0.99 , where the SNP j occurs in at least 1% of the
studied population, then 0.375 j 0.961 .

The chance that two random unrelated people match on a group of SNPs is harder
to assess because the SNPs are not all independent. For a list of M' < M independent
SNPs with j 0.689 the probability of a match is Mj=1' j 0.689 M ' . It does not take a
large value of M' to make this probability very small.

We can subsequently evaluate the posterior probability of a match and calculate


the M' via the Bayes Theorem.

Suppose that the nefarious person assumes a

conservative prior model that research subjects are uniformly sampled from a population
of N people. The probability that a person is subject i, given that they share a set of M'
SNPs, is
P( same | match)
P(match | same) P ( same)
P(match | same) P( same) + P(match |! same) P(! same)
1
1
N
=
1
' (1 1 )
1 + M
1 j
j
=
N
N
=

III. The number of independent SNPs contained in the genome.

We have mentioned the need for our SNP positions to be independent, since correlated
positions will contain less independent ability to distinguish people. It is fair, therefore,
to consider whether the human genome contains large numbers of independent SNPs.
Current projects will answer this question definitively. On Chromosome 21, one of the

06/28/04 6:41 AM

shortest chromosomes, a recent analysis shows that 24,047 SNPs can be summarized
effectively using 4,563 SNPs (2), giving us a sense that the number of independent SNPs
on each of the 23 chromosomes will likely number in the low thousands. Thus the
genome is likely to have far more than 80 or even 8,000 independent SNPs plenty to
allow identification of individuals not only from their entire DNA sequence but from
relatively short fragments. As various genetic databases are beginning to accumulate
very large stretches of DNA (3, 4), they are likely to have DNA sequences that are
essentially genetic fingerprints. Matching these sequences to other data resources, using
published methods, could destroy any guarantees of confidentiality (5).

IV. A model to understand why small numbers of accurate SNPs will spoil most data
obfuscation strategies.

A simplified coin-tossing model illustrates the problem.

Suppose that only

M ' << M independent SNPs exist among all of the M SNPs, where M is close to

5,000,000. We retain the essential feature that the SNPs act as a large number of
independent random variables, even though that large number may be a small fraction of
5,000,000. To further simplify we suppose that the probability of matching at any single
SNP is all 0.5, and that each allele has only two equally probable versions, that we label 0
and 1. Even if M' is only as small as 100, a match provides strong evidence. The
probability of two unrelated individuals matching is 2-100 = 0.79x10-30. Had we used a
probability of matching at a single SNP j as high as j = 0.9 , a comparably strong result
still requires only about M' = 658 independent SNPs.

V. Evaluating the chances of a match given a randomization of 10% of SNP data

06/28/04 6:41 AM

We first consider an obfuscation strategy in which each SNP is either left unchanged, or
changed to the opposite value with probability of 0.1.

Each SNP is treated

independently. Then, the actual individual matches his or her data record in a random
number of SNPs having a Binomial (M', 0.9) distribution, where the probability of a
match at a single SNP is j = 0.9 , while a stranger matches in a number of SNPs with a
Binomial (M', 0.5) distribution, where the probability of a match at a single SNP is

j = 0 .5 .

These distributions are very well separated. The chance that a stranger matches
74 or more of a subject's SNPs is about 0.3x10-6 (with the binomial probability
100 100
0.5100 , where the number of independent SNPs M' = 100). The chance
1 j =74
j

that the actual individual matches fewer than 73 SNPs is about 0.1x10-5 (with the
binomial probability

100 j 100 j

0.9 0.1
, where M' = 100).
j =0
j

73

In this example, a

threshold of 74 virtually guarantees that the identity of the individual will be


compromised but also bring 0.00003% false positives. Higher thresholds reduce the false
positive rate while still leaving the individual very identifiable (Figure 1; Appendix 1-2).

An alternative obfuscation strategy that selects and changes exactly 10% of the
full set of M SNPs at random requires a more complicated analysis. But the result will be

06/28/04 6:41 AM

essentially the same because the changes to M ' << M independent SNPs will be very
nearly independent.

VI. Evaluating the utility of SNP binning as a strategy for protecting subject privacy.

Suppose that there are 5,000,000 SNPs. We can lump the SNPs into 50,000 bins of 100
SNPs each, so one bin contains at most 100 independent SNPs. Then the probability of a
match is the product of 50,000 conditional probabilities.

While those conditional

probabilities are hard to assess, we can expect their product to follow an exponential
decay and become very small.

Suppose that we only release a set of 100 independent SNPs, group the data into
10 bins of 10 SNPs, and report the number of 1's in each bin. That number has the
Binomial (10, 0.5) distribution. A stranger will match such a Binomial count with the
2

probability =

10
j =0

10 20
2 = 0.176 . The chance that a stranger matches all 10 bins is
j

10 =0.3x10-7. Hence, this approach is also insufficient to maintain confidentiality.

Alternatively, we can lump SNPs into a smaller number of large bins, for
example, 100 bins of 50,000 SNPs, a strategy producing data useless for research. Then
the probability of a match becomes a product of 100 probabilities. Because of the large
number of SNPs in each bin, we can anticipate that these probabilities are also small.

06/28/04 6:41 AM

References:

1.

L. Kruglyak, D. A. Nickerson, Nat Genet 27, 234-6 (Mar, 2001).

2.

N. Patil et al., Science 294, 1719-23 (Nov 23, 2001).

3.

J. C. Murray et al., Science 265, 2049-54 (Sep 30, 1994).

4.

Nature 426, 789-96 (Dec 18, 2003).

5.

B. Malin, L. Sweeney, Proc AMIA Symp, 537-41 (2000).

Figures:

Figure 1: Plot of false positive identification vs. false negative identification.

Appendixes:

1.

Numerical

examples

of

thresholds

that

minimize

the

larger

of

false

positive identification and false negative identification probabilities, e.g. FalseIdent and
FalseMiss. For example, an individual is identified from >= 74 of 100 matches on SNPs,
but not from <= 73 matches.
M'

p-stranger

p-individual

Threshold

06/28/04 6:41 AM

FalseIdent

FalseMiss

50
100
150
200
250
300

For

the

0.5
0.5
0.5
0.5
0.5
0.5

world

0.9
0.9
0.9
0.9
0.9
0.9

4.6810-4
8.3410-7
4.7810-9
1.0310-11
2.2810-14
1.1110-16

36.5
73.5
109.5
146.5
183.5
219.5

case

we

could

minimize

2.8510-4
1.2210-6
1.7010-9
8.5610-12
4.4010-14
6.8710-17
the

larger

of

1010 x FalseIdent and FalseMiss.

For

M'

p-stranger

p-individual

Threshold

FalseIdent

FalseMiss

50
100
150
200
250
300

0.5
0.5
0.5
0.5
0.5
0.5

0.9
0.9
0.9
0.9
0.9
0.9

46.5
83.5
120.5
156.5
189.5
220.5

1.810-1
1.3010-2
7.3310-5
1.1110-6
0.00
0.00

7.5010-01
2.0610-2
1.7910-4
4.6510-7
4.2310-11
2.2910-16

the

US

case

we

could

minimize

the

larger

of

3x108 x FalseIdent and FalseMiss.


M'
50
100
150
200
250
300

p-stranger
0.5
0.5
0.5
0.5
0.5
0.5

p-individual
0.9
0.9
0.9
0.9
0.9
0.9

Threshold

FalseIdent

FalseMiss

45.5
81.5
118.5
155.5
189.5
220.5

6.6910
9.2210-3
3.5210-5
9.9910-8
0.00
0.00

5.6910-1
4.5810-3
3.0210-5
1.7610-7
4.2310-11
2.2910-16

-2

2. R functions for numerical examples.


cutpt = function(M=100, p0=.5, p1=.9, ratio=1 ){
#
# Rule: group is 0 if X <= m; group is 1 if X > m
#
m = 1:(M-1)
falsepos = 1-pbinom(m, M, p0)
falseneg = pbinom(m, M, p1)
best=which.min(pmax(ratio*falsepos, falseneg))

06/28/04 6:41 AM

c(M,p0,p1,best+.5,ratio*falsepos[best],falseneg[best])
}
cuttable = function(ratio=1){
# generate the result table
ans = rbind(
cutpt(50,ratio=ratio),
cutpt(100,ratio=ratio),
cutpt(150,ratio=ratio),
cutpt(200,ratio=ratio),
cutpt(250,ratio=ratio),
cutpt(300,ratio=ratio))
colnames(ans) =
c("M","p-stranger","p-individual","Threshold","FalseIdent","FalseMiss")
ans
}
cuttable(1)
cuttable(10^10)
cuttable(3*10^8)

06/28/04 6:41 AM

10

Vous aimerez peut-être aussi