Académique Documents
Professionnel Documents
Culture Documents
ARI
ANNUAL
REVIEWS
12 December 2011
Further
11:46
Human Genome Sequencing Center, Baylor College of Medicine, Houston, Texas 77030
Keywords
whole-genome sequencing (WGS), exome sequencing, simple
nucleotide variation (SNV), structural variation, personal genomics
Abstract
Following the nished, euchromatic, haploid human reference
genome sequence, the rapid development of novel, faster, and cheaper
sequencing technologies is making possible the era of personalized human genomics. Personal diploid human genome sequences have been
generated, and each has contributed to our better understanding of variation in the human genome. We have consequently begun to appreciate the vastness of individual genetic variation from single nucleotide
to structural variants. Translation of genome-scale variation into medically useful information is, however, in its infancy. This review summarizes the initial steps undertaken in clinical implementation of personal
genome information, and describes the application of whole-genome
and exome sequencing to identify the cause of genetic diseases and to
suggest adjuvant therapies. Better analysis tools and a deeper understanding of the biology of our genome are necessary in order to decipher, interpret, and optimize clinical utility of what the variation in the
human genome can teach us. Personal genome sequencing may eventually become an instrument of common medical practice, providing
information that assists in the formulation of a differential diagnosis.
We outline herein some of the remaining challenges.
35
ME63CH03-Lupski
ARI
12 December 2011
11:46
Inversion: a genomic
segment that differs in
orientation compared
with the human
reference genome
Copy-number
variant (CNV): a
genomic segment that
deviates from the
expected locus-specic
diploid state, either
through deletions or
amplications
NGS: next-generation
DNA sequencing
36
Gonzaga-Jauregui
Lupski
Gibbs
2n
2n
2n
2n
2n
2n
African
(NA18507)
Korean
(SJK)
Korean
(AK1)
Khoisan
(KB1)
D. Tutu
(ABT)
Lupski
AB SOLiD
AB SOLiD
Roche 454
Illumina
Illumina
AB SOLiD
Illumina
Illumina
Roche 454
Sanger
29.6
30.0
10.2
27.8
28.9
17.9
40.6
36.0
7.4
7.5
3.42
3.62
4.05
3.45
3.43
3.86
3.61
3.07
3.32
3.21
2.85
(83.58%)
3.21
(88.61%)
3.31
(81.65%)
2.86
(83.30%)
3.01
(87.79%)
3.13
(81.00%)
2.72
(75.50%)
2.65
(87.13%)
2.71
(81.73%)
2.80
(87.22%)
[M] (%)
[M]
Depth
0.56
(16.42%)
0.41
(11.39%)
0.74
(18.35%)
0.59
(16.70%)
0.42
(12.21%)
0.73
(19.00%)
0.88
(24.50%)
0.41
(12.87%)
0.61
(18.27%)
0.41
(12.77%)
[M] (%)
2.00 (58.72%)
2.17 (60.00%)
2.39 (59.00%)
2.11 (61.11%)
2.00 (58.21%)
2.33 (60.30%)
2.28 (63.21%)
1.72 (56.03%)
1.67 (50.53%)
1.76 (54.85%)
[M] (%)
Heterozygous
SNPs
Same HapMap sample was independently sequenced and reported using two different technologies.
Abbreviations: cSNPs, coding SNPs; nsSNPs, nonsynonymous SNPs; CNVs, copy-number variants; na, data not available.
2n
African
(NA18507)
2n
Watson
2n
2n
Venter
Technology
Novel
SNPs
1.41
(41.28%)
1.44
(40.00%)
1.66
(41.00%)
1.34
(38.89%)
1.43
(41.79%)
1.53
(39.70%)
1.32
(36.79%)
1.35
(43.97%)
1.64
(49.47%)
1.45
(45.15%)
[M] (%)
Homozygous
SNPs
18,406
17,342
22,119
21,606
27,118
68,624
26,140
15,759
22,041
21,152
cSNPs
9,069
na
na
10,162
9,334
9,902
5,361
7,062
10,659
6,114
na
3,395
463,788
170,202
342,965
226,529
404,416
135,262
222,718
214,691
nsSNPs InDels
530
na
na
414
3,303
6,714
8,470
2,682
1,674
6,485
(100 bp)
CNVs
12 December 2011
Chinese
(YH)
Ploidy
Individual
Known
SNPs
Total
SNPs
Av
ARI
ME63CH03-Lupski
11:46
37
ARI
12 December 2011
11:46
several variants in genes associated with coronary artery disease, hypertension, and myocardial infarction. There is a family history of cardiovascular disease; nevertheless, JCV is also
heterozygous for some cardiac-protective variants. Thus, it remains unknown whether or how
the different variants might account for cardiovascular disease risk in this particular individual, so clinical utility could not be documented.
He is also heterozygous for a null allele in the
GSTM1 gene, which is important for detoxication and the metabolism of xenobiotics. Carriers of null alleles for this gene have increased
susceptibility to environmental toxins and possibly increased risk of developing a variety of
cancers (11). This variant might be relevant to
this individuals history of skin cancer.
ME63CH03-Lupski
Gonzaga-Jauregui
Lupski
Gibbs
African Genomes
Later in 2008, the genomic sequence of
a Yoruban individual, HapMap sample
NA18507, was determined (13). This same
genome was sequenced again in 2009 using a
different NGS technology (14); a comparison
demonstrates the importance of the NGS
method used and the annotation approach
applied for analysis. An enrichment of heterozygous (versus homozygous) SNPs was
observed in this genome. Some homozygous
SNPs were found to be associated with pharmacogenetic traits or susceptibility to cancer
or some other complex disease. Indels affected
exons of 2,241 genes, with an enrichment of
indels in the rst and last exons. Some of the
deletions found were also observed in other
published personal genomes, suggesting that
they are actually insertions in the reference
human genome assembly.
More recently, the complete genomes of
two Khoisan and Bantu individuals from southern Africa were sequenced in addition to the
exomes of three other individuals originating
from different indigenous groups across the
Kalahari Desert (KB1, ABT, NB1, TK1, and
ME63CH03-Lupski
ARI
12 December 2011
11:46
Asian Genomes
The rst Asian genome sequence to be published was the YH genome, derived from a
Han Chinese individual (18). Complete or partial deletions of 33 genes were detected in
this genome. A heterozygous mutation in the
GJB2 gene responsible for autosomal recessive
deafness was identied among the nonsynonymous SNPs. In addition, several alleles associated with tobacco addiction and increased risk
for Alzheimer disease were found in this selfreported heavy smoker. A familial history of
Alzheimer disease could not be assessed.
Following this rst Asian genome, the
genomes of two Korean individuals were published, SJK (19) and AK1 (20). Nonsynonymous
SNPs in the SJK genome were distributed
throughout 5,365 genes, and 1,348 of these
nonsynonymous SNPs were novel. The majority of indels detected were in introns, while
only 49 were found in coding regions; however,
these alter the reading frames of 27 different
genes, probably affecting their function. Some
mitochondrial genome variants were found,
39
ME63CH03-Lupski
ARI
12 December 2011
11:46
including 44 novel SNPs, of which 6 were nonsynonymous, plus 3 insertions and 1 deletion.
In the AK1 genome, 70 of the indels detected were homozygous and 26 were in OMIM
genes, of which 13 have been associated with
some disease. Seven hundred seventy-three
SNPs were predicted to be potentially associated with a clinical phenotype, including 269
known SNPs having associations with variable risk of developing certain types of cancer, diabetes, or Alzheimer disease. There were
504 nonsynonymous SNPs identied in genes
associated with Mendelian diseases or traits,
of which 22 were nonsense mutations and
5 were homozygous in the AK1 individual.
Among these traits were dry earwax, apparently
common in the Korean population, and drug
metabolism variants that are of pharmacogenetic relevance. The CNVs detected might affect the function or expression of the 106 genes
they encompass in this individual.
Insights from these personal genomes led
to recognition of the tremendous variation that
the human genome harbors and the importance
of sequencing more genomes in order to get
a more comprehensive catalogue of that variation, especially low-frequency and rare variants
(Figure 1). Collaborative projects that aimed
to catalogue human variation in different populations and include variants with minor allele
frequency (MAF) 1% in addition to common
SNP variants (MAF 5%) were therefore initiated, including The 1000 Genomes Project
(TGP).
Gonzaga-Jauregui
Lupski
Gibbs
SNPs and 1.33 million indels were identied, of which 54% and 57% were novel,
respectively.
2. Deep sequencing of a Caucasian and a
Yoruban trio at 40 coverage.
3. Deep resequencing of 8,140 exons in 697
samples in order to capture most of the
normal coding variation.
This third part of the pilot identied 12,758
SNPs but only 96 indels; interestingly 70% of
the SNPs identied were novel. Included in
these were 68,300 nonsynonymous SNPs, of
which 50% were novel; some of these were
validated to be polymorphic across several samples. Interestingly, nonsense mutations, splicesite variants, and frameshifting variants were
biased toward lower allele frequencies and even
private to some populations or individuals. Six
hundred seventy-one apparent disease-causing
mutations included in the Human Gene Mutation Database were identied in this resequencing project.
De novo assembly was performed using the
data for the low-coverage pilot, and pooling all
samples together, this process identied 3.7 Mb
of sequence not present in the reference assembly. Of this sequence, 87% matched other
known human or primate sequences, while
79% matched sequence present in the Venter
genome. It was observed that variation across
the genome was not evenly distributed; some
highly polymorphic regions such as the HLA
locus showed more variation, whereas other
gene-rich, highly conserved regions showed
less variation.
Overall, the initial pilot phase of TGP identied 16.78 million variants in 742 samples
from different populations. The nal goal of the
project is to sequence 2,500 additional individuals of diverse populations from ve geographical areas at 4 average depth of coverage
in an attempt to identify most of the variation
that occurs at 0.1% frequency in the population. The most recent data report that TGP
has identied 38.9 million SNP sites (G. Marth,
personal communication). Interestingly, many
of these variants are private, i.e., present in very
few individuals or just one individual.
ME63CH03-Lupski
ARI
12 December 2011
11:46
AF1/AF2
AF1
AF2
KB1
ABT
YH
SJK
AK1
14,000,000
12,000,000
JCV
JDW
JRL
African
Asian
Caucasian
All SNPs
(nonredundant)
10,000,000
8,000,000
6,000,000
4,000,000
2,000,000
0
All SNPs
Unique SNPs by
ethnic group
Figure 1
Comparison of single nucleotide polymorphisms (SNPs) in 10 personal genomes. All SNPs in any of 10
sequenced personal genomes were compared with the other 9 genomes. Altogether, the 10 genomes
contribute 14,608,404 nonredundant SNPs (rst bar). The second bar pictures all SNPs that are unique to
each of the personal genomes; the third bar represents all the SNPs that are unique in a given personal
genome but also novel; the fourth bar shows the SNPs shared by individuals of the same ethnic group.
Abbreviations: AF1, NA18507(1) Illumina; AF2, NA18507(2) SOLiD; KB1, Khoisan genome; ABT,
Archbishop Desmond Tutu; YH, Chinese genome; SJK, Korean genome 1; AK1, Korean genome 2; JCV,
J. Craig Venter; JDW, James D. Watson; JRL, James R. Lupski.
STRUCTURAL VARIATION
IN HUMAN GENOMES
Resequencing of personal human genomes has
provided further insight into the extent of an
important, but until recently unappreciated,
form of variation; i.e., structural variation (SV).
In 2004, two articles (21, 22) marked a
turning point in our appreciation and understanding of human genetic variation. Until
then it was well recognized that rearrangements in the human genome could occur,
but these were regarded as rare events that
Segmental
duplication: a region
of the human genome
>1 kb in size that is
duplicated, sharing at
least 90% identity with
its other copy located
elsewhere in the
genome
41
ME63CH03-Lupski
ARI
12 December 2011
WGS: whole-genome
sequencing
42
11:46
Lupski
Gibbs
ME63CH03-Lupski
ARI
12 December 2011
11:46
1 kb
3,500
2,500
Indels
CNVs
2,000
1,500
Venter
Watson
Lupski
YH
SJK
AK1
NA18507(1)
NA18507(2)
1,000
500
<
10 10
0 0 b
20 20 p
0 0 b
30 30 p
0 0 b
40 40 p
0 0 b
50 50 p
0 0 b
60 60 p
0 0 b
70 70 p
0 0 b
80 80 p
0 0 b
90 90 p
0 0
bp bp
1 1
kb kb
2 2
kb kb
3 3
kb kb
4 4
kb kb
5 5
kb kb
6 6
kb kb
7 7
kb kb
8 8
k k
9 b9 b
k
10 b1 kb
k 0
20 b2 kb
k 0
30 b3 kb
k 0
40 b4 kb
k 0
50 b5 kb
k 0
60 b6 kb
k 0
70 b7 kb
k 0
80 b8 kb
0
90 kb kb
9
10 kb 0 k
0 10 b
20 kb 0 k
0 20 b
30 kb 0 k
0 30 b
40 kb 0 k
0 40 b
50 kb 0 k
0 50 b
60 kb 0 k
0 60 b
70 kb 0 k
0 70 b
80 kb 0 k
0 80 b
kb 0
90 9 kb
0 00
kb k
1 b
M
>1 b
M
b
3,000
Size distribution
Figure 2
Size distribution of large indels (100 bp1 kb) and copy-number variants (CNVs) (>1 kb) in sequenced personal human genomes.
Distribution of large indels and CNVs in 8 personal genomes is shown by size. We can observe peaks between 300 and 400 bp,
consistent with Alu indel polymorphisms, and at 12 kb. Few polymorphic CNVs are larger than 200 kb. Abbreviations: AF1,
NA18507(1) Illumina; AF2, NA18507(2) SOLiD; KB1, Khoisan genome; ABT, Archbishop Desmond Tutu; YH, Chinese genome;
SJK, Korean genome 1; AK1, Korean genome 2; JCV, J. Craig Venter; JDW, James D. Watson; JRL, James R. Lupski.
43
ARI
12 December 2011
11:46
dimorphisms in diploid genomes. In the Venter genome, a total of 1,316 Alu indels were
identied, 53% of which were inserted in this
genome as compared to the reference genome
and were not present in any of the databases for
SINE or retrotransposon insertion polymorphisms. Likewise, 900 Alu indel differences
were identied in the Watson genome when
compared to the reference assembly.
More recently, several studies have interrogated retrotransposed insertional polymorphisms in the human genome. Surprisingly,
retrotransposed elements in the human genome
are far from dormant. These elements show
considerable unanticipated activity; thus, each
personal human genome is far more diverse
structurally than we had appreciated initially.
Alu and L1 elements are the most common retrotransposed sequences in the human
genome, believed to have been the most recent to be active. One study shows that L1 elements can transpose at frequencies higher than
expected both somatically and in the germline
(34). Another study (35) surveyed 68 L1s that
were not present in the reference assembly
in different individuals and found that 54%
of them are actually complete and active L1
elements that can transpose, as documented by
a functional assay in vitro. This is consistent
with observations made to determine the insertion sites of most of the human-specic L1
family of retrotransposons in several individuals
(36). It was observed that L1 elements are dimorphic in the human genome and that any two
individuals differ on average at 285 insertion
sites. It is estimated that the total number of dimorphic L1 elements in the population ranges
between 3,000 and 10,000.
Although no current sequencing technology can accurately detect and specically assay
all the structural variants in a given genome,
increase in the length of reads produced by
NGS technologies and improvement in the algorithms for CNV calling and de novo assembly
of personal genomes may eventually allow the
unbiased detection of structural variants of all
sizes and types and with sequence-level resolution of breakpoints.
ME63CH03-Lupski
44
Gonzaga-Jauregui
Lupski
Gibbs
EXOME SEQUENCING IN
MEDICAL GENETICS
The exome comprises the coding sequences of
all annotated protein-coding genes (23,000)
and is equivalent to 1% of the total haploid
genomic sequence (30 Mb). To sequence this
subset of the genome through massively parallel sequencing, targeted-capture methodologies were developed using arrays (37), and later
beads in solution (38), that hybridize to the exonic sequences to be captured.
The rst targeted attempt to sequence only
this whole-exome fraction of the genome, as
validation for a disease-gene identication approach, was reported in 2009 (39). The exomes
of eight HapMap individuals were examined
and the accuracy of the approach validated
using HapMap data from these individuals. In
addition, the exomes of four other individuals
affected by Freeman-Sheldon syndrome (MIM
#193700), which is known to be caused by
mutations in the MYH3 gene, were also
sequenced. A total of 56,240 nonredundant
coding SNPs were identied across the 12
exomes, most of which were already present in
dbSNP; 23% were novel variants. Combining
the variants in the Freeman-Sheldon syndrome
patients to search for mutations in a common
gene among them, and subsequent ltering
of these variants for those known in dbSNP
or found in the HapMap samples, narrowed
the list of candidate genes to just one: MYH3.
This experiment demonstrated that it was
possible to capture most of the variation
contained in the exome and that by applying
bioinformatic ltering steps it was possible to
identify the pathogenic variants for a genetic
disease.
Choi et al. showed that it was possible to
reach a more accurate diagnosis of a patient after exome sequencing (40). They applied this
approach to diagnose a patient referred for possible Bartter syndrome, a disease of impairment
in salt reabsorption. The patient was born from
healthy parents who were rst cousins. Thus,
absence of heterozygosity was an additional lter to narrow the list of candidate genes in which
ME63CH03-Lupski
ARI
12 December 2011
11:46
45
ARI
12 December 2011
11:46
ME63CH03-Lupski
WHOLE-GENOME SEQUENCING
FOR GENETIC DIAGNOSIS AND
PATIENT MANAGEMENT
Initial personal genome projects delivered a
number of individual diploid human genomes,
but all of them were from individuals with no
explicit clinical phenotype.
Although exome sequencing has successfully identied the causative mutations of
selected highly penetrant Mendelian diseases,
it interrogates SNVs for only the coding fraction of the genome that we have annotated as
46
Gonzaga-Jauregui
Lupski
Gibbs
ME63CH03-Lupski
ARI
12 December 2011
11:46
47
48
Gonzaga-Jauregui
Lupski
Gibbs
#600176
#613610
#239300
Severe brain
malformations
Sensenbrenner syndrome/
cranioectodermal
dysplasia (CED)
Mabry syndrome/
hyperphosphatasia with
mental retardation
AR
AR
AR
AD
AD
#147920
Kabuki syndrome
AR
Mental retardation
#233400
Perrault syndrome
AR
AD
#613557
Nonsyndromic hearing
loss DFNB82
AD
AR
AR
Inheritance
Autosomal-dominant
spinocerebellar ataxia
#269150
Schinzel-Giedion
syndrome
#263750
#214700
MIM #
Agilent
SureSelect
Roche
NimbleGen
Agilent
SureSelect
Agilent
SureSelect
Roche
NimbleGen
Agilent
SureSelect
Agilent
SureSelect
Agilent
SureSelect
Agilent
SureSelect
Agilent
SureSelect
Roche
NimbleGen
Capture
platform
SOLiD
Illumina
SOLiD
SOLiD
Illumina
Illumina
Illumina
Illumina
SOLiD
Illumina
Illumina
Sequencing
Technology
Samples
10 parent-case
trios
4 related affecteds
3 affected siblings
2 unrelated
affecteds
1 affected in family
10 unrelated
affecteds
1 affected in family
1 affected in family
4 unrelated
affecteds
4 affecteds
(1 sib-pair)
1 affected
Identified gene
Reference
47
43
48
49
50
51
HSD17B4
MLL2
WDR62
WDR35
PIGV
TGM6
52
46
GPSM2
DYNC1H1, ZNF599 ,
RAB39B, YY1, BPIL3 ,
PGA5 , DEAF1, CIC,
SYNGAP1, JARID1C
42
41
40
SETBP1
DHODH
SLC26A3
12 December 2011
Miller syndrome
Congenital chloride
diarrhea
Disease
ARI
ME63CH03-Lupski
11:46
#302800
CMT1X
AD
Hereditary hypotrichosis
simplex (HHS)
AD
Hajdu-Cheney syndrome
(HCS)
AR
AR
Osteogenesis imperfecta
(OI)
XL
XL
Inammatory bowel
disease/ X-linked
inhibitor of apoptosis
deciency
#102500
#210600
Seckel syndrome
Roche
NimbleGen
Agilent
SureSelect
Agilent
SureSelect
Roche
NimbleGen
Roche
NimbleGen
Agilent
SureSelect
Agilent
SureSelect
Agilent
SureSelect
Agilent
SureSelect
Agilent
SureSelect
Agilent
SureSelect
Illumina
SOLiD
Illumina
Illumina
Roche 454
Illumina
Illumina
Illumina
Illumina
Illumina
SOLiD
1 affected in family
1 affected in family
3 unrelated
affecteds
2 affecteds and
parents
1 affected
2 related affecteds
1 affected
1 affected
2 related affecteds
2 related affecteds
1 affected
55
56
57
VCP
FADD
CEP152
59
60
61
62
POP1
NOTCH2
SERPINF1
RPL21
(Continued )
44
XIAP
58
54
ANGPTL3
GJB1
53
ACAD9
12 December 2011
AD
AR
#601859
Autoimmune
lymphoproliferative
syndrome (ALPS)
AR
AD
#605019
Familial combined
hypolipidemia
AR
ARI
Amyotrophic lateral
sclerosis
#611126
Mitochondrial complex I
deciency
ME63CH03-Lupski
11:46
49
50
Gonzaga-Jauregui
Lupski
Gibbs
AR
AR
Hypertrophic
mitochondrial
cardiomyopathy
Mosaic variegated
aneuploidy syndrome
(MVA)
AD
AR
Amelogenesis imperfecta
and gingival hyperplasia
syndrome
AR
Chondrodysplasia and
abnormal joint
development
#257300
AR
Hereditary progeroid
syndrome
AR
AR
#162400
AD
AD
Inheritance
Hereditary spastic
paraparesis (HSP)
Hereditary sensory
neuropathy with
dementia and hearing loss
(HSAN1)
#142690
MIM #
Roche
NimbleGen
Agilent
SureSelect
Agilent
SureSelect
Agilent
SureSelect
Agilent
SureSelect
Agilent
SureSelect
Agilent
SureSelect
Agilent
SureSelect/
Nimblegen
Agilent
SureSelect
Agilent
SureSelect
Capture
platform
Illumina
Illumina
Illumina
SOLiD
SOLiD
Illumina
Illumina
Illumina/Roche
454
Illumina
Illumina
Sequencing
Technology
Samples
20 parent-case
trios
2 affected siblings
1 affected
1 affected
3 affecteds
2 affecteds
1 parent-case trio
4 kindreds
1 affected in family
2 affecteds in
family
Identified gene
Reference
67
68
69
70
71
BANF1
IMPAD1
FAM20A
AARS2
CEP57
72
66
KIF1A
65
64
63
DNMT1
GJC2
NCSTN
12 December 2011
Primary lymphoedema
Acne inversa/hidradenitis
suppurativa
Disease
ARI
Table 2 (Continued )
ME63CH03-Lupski
11:46
#148050
#171300
#605039
#231050
#102500
KBG syndrome
Hereditary
pheochromocytoma
(PCC)
Bohring-Opitz syndrome
Acromicric and
geleophysic dysplasias
Hajdu-Cheney syndrome
(HCS)
51
Proteus syndrome
#176920
#139090
Mitochondrial
cardiomyopathy
#204000
somatic
AR
AD
AR
AD
AD
AD
AR
AR
Agilent
SureSelect
Agilent
SureSelect
Agilent
SureSelect
Agilent
SureSelect
Agilent
SureSelect
Agilent
SureSelect
Roche
NimbleGen
Agilent
SureSelect
Agilent
SureSelect
Agilent
SureSelect &
Roche
Nimblegen
Agilent
SureSelect
Roche
NimbleGen
Roche
NimbleGen
Illumina
Illumina
Illumina
SOLiD
SOLiD
Illumina
Illumina
Illumina
Illumina
Illumina
SOLiD
Illumina
Illumina
17 samples from
12 affecteds
1 affected
6 affecteds
2 affecteds
3 affecteds
3 affecteds
3 affecteds
4 affecteds
1 affected
2 affecteds & 2
affecteds
3 affecteds
2 affecteds
1 affected
45
85
MRPL3
AKT1
84
82
ASXL1
NOTCH2
81
MAX
83
80
ANKRD11
FBN1
79
76, 77
VPS35
NBEAL2
75
CCDC8
78
74
ZNF644
KCNJ13
73
ZBTB24
12 December 2011
Leber congenital
amaurosis (LCA)
AD
Late-onset Parkinson
disease
#168600
AR
3-M syndrome
AR
AD
#614069
ARI
High myopia
Immunodeciency
centromeric
instabilityfacial
anomalies syndrome type
2 (ICF2)
ME63CH03-Lupski
11:46
ME63CH03-Lupski
ARI
12 December 2011
11:46
Whole-genome sequencing
(WGS)
Exome sequencing
~20,000 variants
Cost
Technical
Variation
Disease
Figure 3
A comparison of the weaknesses and strengths of whole-genome sequencing (WGS) and exome sequencing
approaches for disease-gene identication. Abbreviations: CNVs, copy-number variants; SNVs, simple
nucleotide variants.
52
Gonzaga-Jauregui
Lupski
Gibbs
unbalanced because of insufcient BH4. Supplemental therapy with adjuvant 5-hydroxytryptophan (5HTP) was shown to compensate
the serotonin-production pathway, resulting in
documented clinical improvement in these patients. Adjuvant therapy for this form of DRD
can also include selective serotonin reuptake
inhibitors (SSRIs). Interestingly, the heterozygous nonsense variant was, as expected,
identied in the obligate carrier mother, but
furthermore found in the maternal grandmother, both of whom had been diagnosed previously with bromyalgia, a condition that can
respond to SSRIs. WGS led to unanticipated
insights that provided new therapeutic avenues
based on the medically actionable variants
identied, and the applied medical treatment
resulted in amelioration of symptoms, marking
a true landmark in personal medical genomics.
Probably the most immediate applicability
of genomic sequencing in clinical practice, in
addition to reaching an accurate genetic diagnosis of a given disease, is in the eld of pharmacogenomics (89). It is now possible to identify
the genome-wide totality of potentially clinically relevant pharmacogenomic variants and
ME63CH03-Lupski
ARI
12 December 2011
11:46
Learned from Personal Genomes). The foundation provided by the HGP with the human
genome reference sequence enabled these gains
in knowledge.
CHALLENGES
53
ME63CH03-Lupski
ARI
12 December 2011
11:46
Gonzaga-Jauregui
Lupski
Gibbs
ME63CH03-Lupski
ARI
12 December 2011
11:46
A substantial and not yet entirely appreciated problem for personal genome sequence
analysis, especially for medical diagnosis and
applications, is the veracity of the current mutation databases. If we are to use WGS in clinical practice, it is of the utmost importance
that mutation databases recognize potentially
pathogenic variants of clinical signicance, i.e.,
distinguish medically actionable variants from
benign variation.
FUTURE ISSUES
Large-scale human genome sequencing
projects and other disease-focused sequencing projects will add more variants to the
databases. The challenge that remains is the
analysis of this information and the knowledge to be gained concerning the biology
of our own genome. Structural variation is
still challenging to assess using only NGS
platforms. Comparison and standardization
of sequencing technologies and improvement
in mapping and de novo assembly algorithms
will eventually allow the accurate prediction of
indels, CNVs, and inversions at the nucleotide
level of resolution.
Other technical challenges are storing and
accessing the vast amounts of data that genomic
projects produce. Should we store the data les
produced by the sequencing machines or just
the processed and analyzed data? Will access
be public or restricted for research purposes
only? In addition, the bioinformatic analysis
still remains a bottleneck; even with automated
pipelines, processing these vast amounts of
information still requires extensive computational power and time. Furthermore, current
algorithms are suboptimal for some purposes;
improvement of current and development of
novel algorithms are necessary. A goal for
WGS should be the ability to provide all of the
variants in an individuals genome in a highly
reliable manner. Nevertheless, clinical utility
can be achieved for many patients long before
that laudable goal.
Next-generation resequencing of personal genomes will in the very near future
55
ME63CH03-Lupski
ARI
12 December 2011
11:46
Whole-genome
shotgun sequencing
De novo assembly
Discard all
highly-repetitive and
unmapping reads
Remapping
Variant calling
Quality filtering
Database annotation
(dbSNP, 1,000 genomes)
Known variants
Functional annotation
Novel variants
Coding variants
Noncoding variants
Nonsense, missense,
splicing sites, frameshifting indels
Dominant model
Recessive model
De novo variants
Homozygous,
compound heterozygous,
hemizygous
Computational prediction
of functional impact
Other evidence:
conservation, literature,
expression, databases
Variant(s) confirmation
and validation
Figure 4
Schematic workow of whole-genome/exome sequencing data analysis. After sequencing, the sequence reads
are mapped and aligned against the human reference genome assembly in order to obtain a list of variants at
every position that does not match the reference. Quality lters are applied to obtain high-quality variant
calls. Various ltering criteria are applied to prioritize the candidate variants. Most variants will be excluded
because they are known, meaning that they are already in variation databases, such as the database of single
nucleotide polymorphisms (dbSNP), The 1000 Genomes Project database, etc. The focus is mainly on novel
variants, which can be tiered in functional classes according to their annotation. For coding variants, priority
is given to nonsense, frameshifting, splice-site, and then missense mutations. Computational prediction of
the functional impact of these variants can also help prioritize candidate mutations. Based on the
characteristics of the trait or disease of interest, variants can be examined under a dominant or recessive
model. Additional conrmation through other resources can strengthen the hypotheses of the functional
signicance of identied variants. Genetic and functional conrmation of the candidate disease-causing
variants is the nal, most important step.
56
Gonzaga-Jauregui
Lupski
Gibbs
ME63CH03-Lupski
ARI
12 December 2011
11:46
become common practice. An initial clinical application may be to assay for genetic
susceptibilities and factors that may contribute to a disease state for a genetically
heterogeneous condition if current panel
testing is prohibitively expensive. Complete
genome sequencing is leading the way toward making personalized genomic medicine
possible in the near future. Legal and ethical
DISCLOSURE STATEMENT
J.R.L. is a consultant for Athena Diagnostics and Ion Torrent Systems Inc., and holds multiple U.S. and European patents for DNA diagnostics. R.A.G. has co-investments with Life
Technologies. The Department of Molecular and Human Genetics at Baylor College of
Medicine derives revenue from molecular diagnostic testing (Medical Genetics Laboratories:
http://www.bcm.edu/geneticlabs/).
ACKNOWLEDGMENTS
J.R.L. receives support from the National Institute of Neurological Disorders and Stroke (R01
NS058529) and the Baylor College of Medicine Intellectual and Developmental Disabilities Research Center (HD024064). R.A.G. receives support from the National Human Genome Research
Institute (5 U54 HG003273).
LITERATURE CITED
1. International Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the
human genome. Nature 409:860921
2. International Human Genome Sequencing Consortium. 2004. Finishing the euchromatic sequence of
the human genome. Nature 431:93145
3. Bailey JA, Yavor AM, Massa HF, et al. 2001. Segmental duplications: organization and impact within
the current Human Genome Project assembly. Genome Res. 11:100517
4. Lupski JR. 1998. Genomic disorders: structural features of the genome can lead to DNA rearrangements
and human disease traits. Trends Genet. 14:41722
5. The International HapMap Consortium. 2003. The International HapMap Project. Nature 426:78996
6. The International HapMap Consortium. 2005. A haplotype map of the human genome. Nature 437:1299
1320
7. The International HapMap Consortium. 2007. A second generation human haplotype map of over
3.1 million SNPs. Nature 449:85161
8. The International HapMap Consortium. 2010. Integrating common and rare genetic variation in diverse
human populations. Nature 467:5258
9. Lupski J. 2009. Genomic disorders ten years on. Genome Med. 1:42.242.11
10. Levy S, Sutton G, Ng PC, et al. 2007. The diploid genome sequence of an individual human. PLoS Biol.
5:e254
11. Ginsberg G, Smolenski S, Neafsey P, et al. 2009. The inuence of genetic polymorphisms on population
variability in six xenobiotic-metabolizing enzymes. J. Toxicol. Environ. Health Part B: Crit. Rev. 12:30733
12. Wheeler DA, Srinivasan M, Egholm M, et al. 2008. The complete genome of an individual by massively
parallel DNA sequencing. Nature 452:87276
www.annualreviews.org Human Genome Sequencing in Health and Disease
57
ARI
12 December 2011
11:46
13. Bentley DR, Balasubramanian S, Swerdlow HP, et al. 2008. Accurate whole human genome sequencing
using reversible terminator chemistry. Nature 456:5359
14. McKernan KJ, Peckham HE, Costa GL, et al. 2009. Sequence and structural variation in a human genome
uncovered by short-read, massively parallel ligation sequencing using two-base encoding. Genome Res.
19:152741
15. Schuster SC, Miller W, Ratan A, et al. 2010. Complete Khoisan and Bantu genomes from southern
Africa. Nature 463:94347
16. Perry GH, Dominy NJ, Claw KG, et al. 2007. Diet and the evolution of human amylase gene copy
number variation. Nat. Genet. 39:125660
17. Aldred PMR, Hollox EJ, Armour JAL. 2005. Copy number polymorphism and expression level variation
of the human alpha-defensin genes DEFA1 and DEFA3. Hum. Mol. Genet. 14:204552
18. Wang J, Wang W, Li R, et al. 2008. The diploid genome sequence of an Asian individual. Nature
456:6065
19. Ahn S-M, Kim T-H, Lee S, et al. 2009. The rst Korean genome sequence and analysis: full genome
sequencing for a socio-ethnic group. Genome Res. 19:162229
20. Kim J-I, Ju YS, Park H, et al. 2009. A highly annotated whole-genome sequence of a Korean individual.
Nature 460:101115
21. Sebat J, Lakshmi B, Troge J, et al. 2004. Large-scale copy number polymorphism in the human genome.
Science 305:52528
22. Iafrate AJ, Feuk L, Rivera MN, et al. 2004. Detection of large-scale variation in the human genome.
Nat. Genet. 36:94951
23. Stankiewicz P, Lupski JR. 2002. Genome architecture, rearrangements and genomic disorders. Trends
Genet. 18:7482
24. Redon R, Ishikawa S, Fitch KR, et al. 2006. Global variation in copy number in the human genome.
Nature 444:44454
25. Conrad DF, Pinto D, Redon R, et al. 2009. Origins and functional impact of copy number variation in
the human genome. Nature 464:70412
26. Flores M, Morales L, Gonzaga-Jauregui C, et al. 2007. Recurrent DNA inversion rearrangements in the
human genome. Proc. Natl. Acad. Sci. 104:6099106
27. Lakich D, Kazazian HH, Antonarakis SE, et al. 1993. Inversions disrupting the factor VIII gene are a
common cause of severe haemophilia A. Nat. Genet. 5:23641
28. Stefansson H, Helgason A, Thorleifsson G, et al. 2005. A common inversion under selection in Europeans. Nat. Genet. 37:12937
29. Koolen DA, Sharp AJ, Hurst JA, et al. 2008. Clinical and molecular delineation of the 17q21.31 microdeletion syndrome. J. Med. Genet. 45:71020
30. Zhang F, Gu W, Hurles ME, Lupski JR. 2009. Copy number variation in human health, disease, and
evolution. Annu. Rev. Genomics Hum. Genet. 10:45181
31. Stankiewicz P, Lupski JR. 2010. Structural variation in the human genome and its role in disease. Annu.
Rev. Med. 61:43755
32. Conrad DF, Bird C, Blackburne B, et al. 2010. Mutation spectrum revealed by breakpoint sequencing
of human germline CNVs. Nat. Genet. 42:38591
33. Lupski JR. 2010. Retrotransposition and structural variation in the human genome. Cell 141:111012
34. Iskow RC, McCabe MT, Mills RE, et al. 2010. Natural mutagenesis of human genomes by endogenous
retrotransposons. Cell 141:125361
35. Beck CR, Collier P, Macfarlane C, et al. 2010. LINE-1 retrotransposition activity in human genomes.
Cell 141:115970
36. Ewing AD, Kazazian HH Jr. 2011. Whole-genome resequencing allows detection of many rare LINE-1
insertion alleles in humans. Genome Res. 21(6):98590
37. Albert TJ, Molla MN, Muzny DM, et al. 2007. Direct selection of human genomic loci by microarray
hybridization. Nat. Methods 4:9035
38. Bainbridge M, Wang M, Burgess D, et al. 2010. Whole exome capture in solution with 3 Gbp of data.
Genome Biol. 11:R62
ME63CH03-Lupski
58
Gonzaga-Jauregui
Lupski
Gibbs
ME63CH03-Lupski
ARI
12 December 2011
11:46
39. Ng SB, Turner EH, Robertson PD, et al. 2009. Targeted capture and massively parallel sequencing of
12 human exomes. Nature 461:27276
40. Choi M, Scholl UI, Ji W, et al. 2009. Genetic diagnosis by whole exome capture and massively parallel
DNA sequencing. Proc. Natl. Acad. Sci. 106:19096101
41. Ng SB, Buckingham KJ, Lee C, et al. 2010. Exome sequencing identies the cause of a Mendelian
disorder. Nat. Genet. 42:3035
42. Hoischen A, van Bon BWM, Gilissen C, et al. 2010. De novo mutations of SETBP1 cause SchinzelGiedion syndrome. Nat. Genet. 42:48385
43. Ng SB, Bigham AW, Buckingham KJ, et al. 2010. Exome sequencing identies MLL2 mutations as a
cause of Kabuki syndrome. Nat. Genet. 42:79093
44. Worthey EA, Mayer AN, Syverson GD, et al. 2011. Making a denitive diagnosis: successful clinical
application of whole exome sequencing in a child with intractable inammatory bowel disease. Genet.
Med. 13:25562
45. Lindhurst MJ, Sapp JC, Teer JK, et al. 2011. A mosaic activating mutation in AKT1 associated with the
Proteus syndrome. N. Engl. J. Med. 365:61119
46. Walsh T, Shahin H, Elkan-Miller T, et al. 2010. Whole exome sequencing and homozygosity mapping
identify mutation in the cell polarity protein GPSM2 as the cause of nonsyndromic hearing loss DFNB82.
Am. J. Hum. Genet. 87:9094
47. Pierce SB, Walsh T, Chisholm KM, et al. 2010. Mutations in the DBP-deciency protein HSD17B4
cause ovarian dysgenesis, hearing loss, and ataxia of Perrault syndrome. Am. J. Hum. Genet. 87:28288
urk
48. Bilguvar
K, Ozt
data identies PIGV mutations in hyperphosphatasia mental retardation syndrome. Nat. Genet. 42:827
29
51. Wang JL, Yang X, Xia K, et al. 2010. TGM6 identied as a novel causative gene of spinocerebellar
ataxias using exome sequencing. Brain 133:351018
52. Vissers LELM, de Ligt J, Gilissen C, et al. 2010. A de novo paradigm for mental retardation. Nat. Genet.
42:110912
53. Haack TB, Danhauser K, Haberberger B, et al. 2010. Exome sequencing identies ACAD9 mutations
as a cause of complex I deciency. Nat. Genet. 42:113134
54. Musunuru K, Pirruccello JP, Do R, et al. 2010. Exome sequencing, ANGPTL3 mutations, and familial
combined hypolipidemia. N. Engl. J. Med. 363:222027
55. Johnson JO, Mandrioli J, Benatar M, et al. 2010. Exome sequencing reveals VCP mutations as a cause
of familial ALS. Neuron 68:85764
56. Bolze A, Byun M, McDonald D, et al. 2010. Whole-exome-sequencing-based discovery of human FADD
deciency. Am. J. Hum. Genet. 87:87381
57. Kalay E, Yigit G, Aslan Y, et al. 2010. CEP152 is a genome maintenance protein disrupted in Seckel
syndrome. Nat. Genet. 43:2326
58. Montenegro G, Powell E, Huang J, et al. 2011. Exome sequencing allows for rapid gene identication
in a Charcot-Marie-Tooth family. Ann. Neurol. 69:46470
59. Glazov EA, Zankl A, Donskoi M, et al. 2011. Whole-exome re-sequencing in a family quartet identies
POP1 mutations as the cause of a novel skeletal dysplasia. PLoS Genet. 7:e1002027
60. Simpson MA, Irving MD, Asilmaz E, et al. 2011. Mutations in NOTCH2 cause Hajdu-Cheney syndrome, a disorder of severe and progressive bone loss. Nat. Genet. 43:3035
61. Becker J, Semler O, Gilissen C, et al. 2011. Exome sequencing identies truncating mutations in human
SERPINF1 in autosomal-recessive osteogenesis imperfecta. Am. J. Hum. Genet. 88:36271
62. Zhou C, Zang D, Jin Y, et al. 2011. Mutation in ribosomal protein L21 underlies hereditary hypotrichosis
simplex. Hum. Mutat. 32:71014
63. Liu Y, Gao M, Lv Y-m, et al. 2011. Conrmation by exome sequencing of the pathogenic role of NCSTN
mutations in acne inversa (hidradenitis suppurativa). J. Invest. Dermatol. 131:157072
www.annualreviews.org Human Genome Sequencing in Health and Disease
59
ARI
12 December 2011
11:46
64. Ostergaard P, Simpson MA, Brice G, et al. 2010. Rapid identication of mutations in GJC2 in primary
lymphoedema using whole exome sequencing combined with linkage analysis with delineation of the
phenotype. J. Med. Genet. 48:25155
65. Klein CJ, Botuyan M-V, Wu Y, et al. 2011. Mutations in DNMT1 cause hereditary sensory neuropathy
with dementia and hearing loss. Nat. Genet. 43:595600
66. Erlich Y, Edvardson S, Hodges E, et al. 2011. Exome sequencing and disease-network analysis of a single
family implicate a mutation in KIF1A in hereditary spastic paraparesis. Genome Res. 21:65864
67. Puente Xose S, Quesada V, Osorio Fernando G, et al. 2011. Exome sequencing and functional analysis
identies BANF1 mutation as the cause of a hereditary progeroid syndrome. Am. J. Hum. Genet. 88:650
56
68. Vissers Lisenka ELM, Lausch E, Unger S, et al. 2011. Chondrodysplasia and abnormal joint development
associated with mutations in IMPAD1, encoding the Golgi-resident nucleotide phosphatase, gPAPP.
Am. J. Hum. Genet. 88:60815
69. OSullivan J, Bitu CC, Daly SB, et al. 2011. Whole-exome sequencing identies FAM20A mutations as
a cause of amelogenesis imperfecta and gingival hyperplasia syndrome. Am. J. Hum. Genet. 88:61620
70. Gotz
A, Tyynismaa H, Euro L, et al. 2011. Exome sequencing identies mitochondrial alanyl-tRNA
synthetase mutations in infantile mitochondrial cardiomyopathy. Am. J. Hum. Genet. 88:63542
71. Snape K, Hanks S, Ruark E, et al. 2011. Mutations in CEP57 cause mosaic variegated aneuploidy
syndrome. Nat. Genet. 43:52729
72. ORoak BJ, Deriziotis P, Lee C, et al. 2011. Exome sequencing in sporadic autism spectrum disorders
identies severe de novo mutations. Nat. Genet 43:58589
73. de Greef JC, Wang J, Balog J, et al. 2011. Mutations in ZBTB24 are associated with immunodeciency,
centromeric instability, and facial anomalies syndrome type 2. Am. J. Hum. Genet. 88:796804
74. Shi Y, Li Y, Zhang D, Zhang H, et al. 2011. Exome sequencing identies ZNF644 mutations in high
myopia. PLoS Genet. 7:e1002084
75. Hanson D, Murray PG, OSullivan J, et al. 2011. Exome sequencing identies CCDC8 mutations in
3-M syndrome, suggesting that CCDC8 contributes in a pathway with CUL7 and OBSL1 to control
human growth. Am. J. Hum. Genet. 89:14853
76. Zimprich A, Benet-Pag`es A, Struhal W, et al. 2011. A mutation in VPS35, encoding a subunit of the
retromer complex, causes late-onset Parkinson disease. Am. J. Hum. Genet. 89:16875
77. Vilarino-G
uell
C, Wider C, Ross OA, et al. 2011. VPS35 mutations in Parkinson disease. Am. J. Hum.
Genet. 89:16267
78. Sergouniotis PI, Davidson AE, Mackay DS, et al. 2011. Recessive mutations in KCNJ13, encoding an
inwardly rectifying potassium channel subunit, cause Leber congenital amaurosis. Am. J. Hum. Genet.
89:18390
79. Albers CA, Cvejic A, Favier R, et al. 2011. Exome sequencing identies NBEAL2 as the causative gene
for gray platelet syndrome. Nat. Genet. 43:73537
80. Sirmaci A, Spiliopoulos M, Brancati F, et al. 2011. Mutations in ANKRD11 cause KBG syndrome,
characterized by intellectual disability, skeletal malformations, and macrodontia. Am. J. Hum. Genet.
89:28994
81. Comino-Mendez I, Gracia-Aznarez FJ, Schiavi F, et al. 2011. Exome sequencing identies MAX mutations as a cause of hereditary pheochromocytoma. Nat. Genet. 43:66367
82. Hoischen A, van Bon BW, Rodrguez-Santiago B, et al. 2011. De novo nonsense mutations in ASXL1
cause Bohring-Opitz syndrome. Nat Genet. 43:72931
83. Le Goff C, Mahaut C, Wang LW, et al. 2011. Mutations in the TGF binding-protein-like domain 5
of FBN1 are responsible for acromicric and geleophysic dysplasias. Am. J. Hum. Genet. 89:714
84. Majewski J, Schwartzentruber JA, Caqueret A, et al. 2011. Mutations in NOTCH2 in families with
Hajdu-Cheney syndrome. Hum. Mutat. 32:111417
85. Galmiche L, Serre V, Beinat M, et al. 2011. Exome sequencing identies MRPL3 mutation in mitochondrial cardiomyopathy. Hum. Mutat. 32:122531
86. Lupski JR, Reid JG, Gonzaga-Jauregui C, et al. 2010. Whole-genome sequencing in a patient with
Charcot-Marie-Tooth neuropathy. N. Engl. J. Med. 362:118191
ME63CH03-Lupski
60
Gonzaga-Jauregui
Lupski
Gibbs
ME63CH03-Lupski
ARI
12 December 2011
11:46
87. Rios J, Stein E, Shendure J, et al. 2010. Identication by whole-genome resequencing of gene defect
responsible for severe hypercholesterolemia. Hum. Mol. Genet. 19:431318
88. Bainbridge MN, Wiszniewski W, Murdock DR, et al. 2011. Whole genome sequencing enables optimized patient management. Sci. Translational Med. 87re3
89. Wang L, McLeod HL, Weinshilboum RM. 2011. Genomics and drug response. N. Engl. J. Med.
364:114453
90. The 1000 Genomes Project Consortium. 2010. A map of human genome variation from population-scale
sequencing. Nature 467:106173
91. Hasin-Brumshtein Y, Lancet D, Olender T. 2009. Human olfaction: from genomic variation to phenotypic diversity. Trends Genet. 25:17884
92. Hasin Y, Olender T, Khen M, et al. 2008. High-resolution copy-number variation map reects human
olfactory receptor diversity and evolution. PLoS Genet. 4:e1000249
93. Hicks S, Wheeler DA, Plon S, et al. 2011. Prediction of missense mutation functionality depends on
both the algorithm and sequence alignment employed. Hum Mut. 32:66168
94. Yan H, Yuan W, Velculescu VE, et al. 2002. Allelic variation in human gene expression. Science 297:1143
95. Cheung VG, Spielman RS. 2009. Genetics of human gene expression: mapping DNA variants that
inuence gene expression. Nat. Rev. Genet. 10:595604
96. Montgomery SB, Sammeth M, Gutierrez-Arcelus M, et al. 2010. Transcriptome genetics using second
generation sequencing in a Caucasian population. Nature 464:77377
97. Emilsson V, Thorleifsson G, Zhang B, et al. 2008. Genetics of gene expression and its effect on disease.
Nature 452:42328
98. Berg JS, Khoury MJ, Evans JP. 2011. Deploying whole genome sequencing in clinical practice and public
health: meeting the challenge one bin at a time. Genet Med. 13:499504
99. McGuire AL, Lupski JR. 2010. Personal genome research: what should the participant be told? Trends
Genet. 26:199201
100. Mardis E. 2010. The $1,000 genome, the $100,000 analysis? Genome Med. 2:84
61
ME63-frontmatter
ARI
20 December 2011
12:3
Contents
Annual Review of
Medicine
Volume 63, 2012
ME63-frontmatter
ARI
20 December 2011
12:3
vi
Contents
ME63-frontmatter
ARI
20 December 2011
12:3
Contents
vii