Académique Documents
Professionnel Documents
Culture Documents
GENETICS
Privacy
www.sciencemag.org
SCIENCE
VOL 305
9 JULY 2004
183
06/28/04 6:41 AM
One might argue that the publication of genotypic and phenotypic information without a
key that establishes the connection between an individual and that information is not
fundamentally dangerous.
fingerprints along with clinical phenotypes on the web, but these would be of limited use
without knowledge of whose fingerprints they are. This argument is unacceptable for
genomic data in a rapidly changing environment where 1) genotyping is becoming very
accessible and inexpensive, 2) the ability to collect and share personal information is
increasing, and 3) the ability to use genomic data to infer individual characteristics is
largely unexplored and potentially powerful. Thus, we have focused on our analysis on
the problem of disclosing genomic information that is sufficient to be uniquely associated
with an individual, even without a key.
To evaluate the chance of a match using SNPs, we can construct a model such that the
diploid SNP data from autosomes are arranged into a matrix X where row i = 1,..., n is the
index of the subject, and column j = 1,..., M is the index of the SNP. A SNP j has kj
possible
values
for
Xij.
These
values
have
probabilities
(genotype
frequencies) j1 ,..., jk j . Most SNPs have three dominant values of Xij corresponding to
genotypes
aa,
Aa,
and
AA.
These
06/28/04 6:41 AM
have
Hardy-Weinberg
probabilitiesa j1 = (1 p j ) 2 , j 2 = 2 p j (1 p j ) , and
j 3 = p 2j
respectively, where
The SNP data come from a population, such as a county, state, or country, of N
people. Suppose that a small set of SNPs for a person from this population become
available, and are found to match X i1 ,..., X iM for some subject i between 1 and n. Such a
coincidence provides overwhelming evidence, as described below, that the person is in
fact the individual i from the database.
phenotypic, and other data linked to the subject i become known to anybody with access
to both the database and the SNPs for the person in question.
Let the number of SNPs in the human genome with minor-allele frequencies
greater than 10% be M, which is thought to be close to 5,000,000 (1). For two random
people to match all SNPs is quite unlikely, because M is so large. The chance of two
unrelated people matching at a single SNP j is
kj
j = 2jl
l =1
In the Hardy-Weinberg setup, if 0.1 p j 0.9 , where the SNP j occurs in at least 10%
of the studied population, then 0.375 j 0.689 . The lower bound for j corresponds
Hardy-Weinberg equilibrium is a statistical assumption for population genetics that gene frequencies and
genotype ratios in a randomly-breeding population remain constant from generation to generation.
06/28/04 6:41 AM
to p j = 0.5 . Similarly, if 0.01 p j 0.99 , where the SNP j occurs in at least 1% of the
studied population, then 0.375 j 0.961 .
The chance that two random unrelated people match on a group of SNPs is harder
to assess because the SNPs are not all independent. For a list of M' < M independent
SNPs with j 0.689 the probability of a match is Mj=1' j 0.689 M ' . It does not take a
large value of M' to make this probability very small.
conservative prior model that research subjects are uniformly sampled from a population
of N people. The probability that a person is subject i, given that they share a set of M'
SNPs, is
P( same | match)
P(match | same) P ( same)
P(match | same) P( same) + P(match |! same) P(! same)
1
1
N
=
1
' (1 1 )
1 + M
1 j
j
=
N
N
=
We have mentioned the need for our SNP positions to be independent, since correlated
positions will contain less independent ability to distinguish people. It is fair, therefore,
to consider whether the human genome contains large numbers of independent SNPs.
Current projects will answer this question definitively. On Chromosome 21, one of the
06/28/04 6:41 AM
shortest chromosomes, a recent analysis shows that 24,047 SNPs can be summarized
effectively using 4,563 SNPs (2), giving us a sense that the number of independent SNPs
on each of the 23 chromosomes will likely number in the low thousands. Thus the
genome is likely to have far more than 80 or even 8,000 independent SNPs plenty to
allow identification of individuals not only from their entire DNA sequence but from
relatively short fragments. As various genetic databases are beginning to accumulate
very large stretches of DNA (3, 4), they are likely to have DNA sequences that are
essentially genetic fingerprints. Matching these sequences to other data resources, using
published methods, could destroy any guarantees of confidentiality (5).
IV. A model to understand why small numbers of accurate SNPs will spoil most data
obfuscation strategies.
M ' << M independent SNPs exist among all of the M SNPs, where M is close to
5,000,000. We retain the essential feature that the SNPs act as a large number of
independent random variables, even though that large number may be a small fraction of
5,000,000. To further simplify we suppose that the probability of matching at any single
SNP is all 0.5, and that each allele has only two equally probable versions, that we label 0
and 1. Even if M' is only as small as 100, a match provides strong evidence. The
probability of two unrelated individuals matching is 2-100 = 0.79x10-30. Had we used a
probability of matching at a single SNP j as high as j = 0.9 , a comparably strong result
still requires only about M' = 658 independent SNPs.
06/28/04 6:41 AM
We first consider an obfuscation strategy in which each SNP is either left unchanged, or
changed to the opposite value with probability of 0.1.
independently. Then, the actual individual matches his or her data record in a random
number of SNPs having a Binomial (M', 0.9) distribution, where the probability of a
match at a single SNP is j = 0.9 , while a stranger matches in a number of SNPs with a
Binomial (M', 0.5) distribution, where the probability of a match at a single SNP is
j = 0 .5 .
These distributions are very well separated. The chance that a stranger matches
74 or more of a subject's SNPs is about 0.3x10-6 (with the binomial probability
100 100
0.5100 , where the number of independent SNPs M' = 100). The chance
1 j =74
j
that the actual individual matches fewer than 73 SNPs is about 0.1x10-5 (with the
binomial probability
100 j 100 j
0.9 0.1
, where M' = 100).
j =0
j
73
In this example, a
An alternative obfuscation strategy that selects and changes exactly 10% of the
full set of M SNPs at random requires a more complicated analysis. But the result will be
06/28/04 6:41 AM
essentially the same because the changes to M ' << M independent SNPs will be very
nearly independent.
VI. Evaluating the utility of SNP binning as a strategy for protecting subject privacy.
Suppose that there are 5,000,000 SNPs. We can lump the SNPs into 50,000 bins of 100
SNPs each, so one bin contains at most 100 independent SNPs. Then the probability of a
match is the product of 50,000 conditional probabilities.
probabilities are hard to assess, we can expect their product to follow an exponential
decay and become very small.
Suppose that we only release a set of 100 independent SNPs, group the data into
10 bins of 10 SNPs, and report the number of 1's in each bin. That number has the
Binomial (10, 0.5) distribution. A stranger will match such a Binomial count with the
2
probability =
10
j =0
10 20
2 = 0.176 . The chance that a stranger matches all 10 bins is
j
Alternatively, we can lump SNPs into a smaller number of large bins, for
example, 100 bins of 50,000 SNPs, a strategy producing data useless for research. Then
the probability of a match becomes a product of 100 probabilities. Because of the large
number of SNPs in each bin, we can anticipate that these probabilities are also small.
06/28/04 6:41 AM
References:
1.
2.
3.
4.
5.
Figures:
Appendixes:
1.
Numerical
examples
of
thresholds
that
minimize
the
larger
of
false
positive identification and false negative identification probabilities, e.g. FalseIdent and
FalseMiss. For example, an individual is identified from >= 74 of 100 matches on SNPs,
but not from <= 73 matches.
M'
p-stranger
p-individual
Threshold
06/28/04 6:41 AM
FalseIdent
FalseMiss
50
100
150
200
250
300
For
the
0.5
0.5
0.5
0.5
0.5
0.5
world
0.9
0.9
0.9
0.9
0.9
0.9
4.6810-4
8.3410-7
4.7810-9
1.0310-11
2.2810-14
1.1110-16
36.5
73.5
109.5
146.5
183.5
219.5
case
we
could
minimize
2.8510-4
1.2210-6
1.7010-9
8.5610-12
4.4010-14
6.8710-17
the
larger
of
For
M'
p-stranger
p-individual
Threshold
FalseIdent
FalseMiss
50
100
150
200
250
300
0.5
0.5
0.5
0.5
0.5
0.5
0.9
0.9
0.9
0.9
0.9
0.9
46.5
83.5
120.5
156.5
189.5
220.5
1.810-1
1.3010-2
7.3310-5
1.1110-6
0.00
0.00
7.5010-01
2.0610-2
1.7910-4
4.6510-7
4.2310-11
2.2910-16
the
US
case
we
could
minimize
the
larger
of
p-stranger
0.5
0.5
0.5
0.5
0.5
0.5
p-individual
0.9
0.9
0.9
0.9
0.9
0.9
Threshold
FalseIdent
FalseMiss
45.5
81.5
118.5
155.5
189.5
220.5
6.6910
9.2210-3
3.5210-5
9.9910-8
0.00
0.00
5.6910-1
4.5810-3
3.0210-5
1.7610-7
4.2310-11
2.2910-16
-2
06/28/04 6:41 AM
c(M,p0,p1,best+.5,ratio*falsepos[best],falseneg[best])
}
cuttable = function(ratio=1){
# generate the result table
ans = rbind(
cutpt(50,ratio=ratio),
cutpt(100,ratio=ratio),
cutpt(150,ratio=ratio),
cutpt(200,ratio=ratio),
cutpt(250,ratio=ratio),
cutpt(300,ratio=ratio))
colnames(ans) =
c("M","p-stranger","p-individual","Threshold","FalseIdent","FalseMiss")
ans
}
cuttable(1)
cuttable(10^10)
cuttable(3*10^8)
06/28/04 6:41 AM
10