Académique Documents
Professionnel Documents
Culture Documents
Workshop:
Developing the Genome Profile of the U.S.
Population and Assessing the Role of Genetic
Variation in Health and Disease
Beyond Gene
Discovery March 3, 2008
8:30 a.m. to 5:00 p.m.
March 3, 2008
The workshop will convene CDC programs, federal partners, academia, and the private
sector to review the Beyond Gene Discovery (BGD) plans, discuss analytic issues and
develop solutions regarding models of access to research datasets that maintain
human subjects protections.
1. Produce the first comprehensive report of the Genome Profile of the United States
population, using data from the National Health and Nutrition Examination Survey
(NHANES).
2. Support the development of a CDC searchable online information system of
human genome variation: allele, genotype and haplotype frequencies at individual
and multiple genetic loci readily accessible to researchers, healthcare providers and
policy makers.
3. Develop and disseminate a comprehensive agenda for population research that
will help fill the gaps between gene discoveries and health benefits of genomic
information.
4. Enhance informatics and analytic capacity to analyze complex data as well as
develop datasets for access by researchers that link relevant genetic test results and
NHANES interview, examination and laboratory measurements.
Session 2: NHANES in the genomics era--current practices and future options for Beyond
Gene Discovery
Moderator, Kathleen Toomey
11:00-12:00 Discussion of the analytic challenges of NHANES genomic data, in the context of
the proposed access options, and development of a plan to address these
challenges
Issues to include: genotyping quality control; detection of and adjustment for population
stratification; assessment of structural variants; statistical methods for evaluating the relationship
between the variations in genomic structure and function; statistical analysis of genetic
associations, gene-gene and gene-environment interactions; statistical methods appropriate for
analysis of weighted sample survey data.
Session 4: Policy options to promote access to NHANES genomic data while protecting
privacy and confidentiality - Co-moderators: Ellen Clayton & William Lowrance
Issues to include: identifying and meeting analytic needs while meeting privacy and
confidentiality requirements; the advantages and disadvantages of different access mechanisms;
alternative data access models and options to consider.
CDC Foundation:
Charles Stokes
Julie Rodgers
Chloe Tonney
Prepared by:
Beyond Gene The Data Access Subgroup of
Discovery CDC’s Beyond Gene Discovery Working
Group
Appendices................................................................................................................................. 13
Appendix A. The Beyond Gene Discovery Initiative................................................... 14
Appendix B. The National Health and Nutrition Examination Survey.................. 19
Appendix C. Statutory and Policy Considerations Related to the Release of and
Access to NHANES Data Including Genetic Information................................... 24
Appendix D. Considerations Related to Re-consent and Changes to Future
Informed Consent in Order to Achieve Broader Access to NHANES Genetic
Data...................................................................................................................................... 33
Appendix E. NIH Database of Genotypes and Phenotypes (dbGaP).................... 35
Appendix F. The Data Access Subgroup of CDC’s Beyond Gene Discovery
Working Group—Membership List.......................................................................... 38
With the completion of the Human Genome Project and the availability of technologies
to measure human genetic variation on a genome-wide scale, the Centers for Disease
Control and Prevention (CDC) and the CDC Foundation are launching the Beyond
Gene Discovery (BGD) initiative, in collaboration with public, private, and academic
partners. BGD will assess population genetic variation in the United States in relation to
health and disease and develop strategies for using genetic information to impact health
and eliminate health disparities among population groups.
The National Health and Nutrition Examination Survey (NHANES), a major program of
the National Center for Health Statistics (NCHS), provides a unique national resource
for investigating the effects of genetic variation on health and will serve as the initial
focus of BGD. Nationally representative probability samples from two NHANES data
collections include approximately 15,000 persons (about 7,000 participants from
NHANES III and 8,000 participants from NHANES 1999-2002), with oversampling of the
two largest race/ethnic minority groups, non-Hispanic blacks and Mexican Americans,
along with other subgroups of the population.
The success of the BGD initiative relies on the development of new and enhanced data
access procedures that will facilitate state-of-the-art analytic methods, including those
for genome-wide association studies (GWAS), while protecting the confidentiality of
NHANES participants’ identifiable private information. New and enhanced methods are
under consideration to facilitate these types of studies; however, any mechanisms for
access to NHANES data must be consistent with applicable Federal confidentiality and
statistical statutes and guidance.
NCHS collected the data and samples for the NHANES under the authority of section
306 of the Public Health Service Act (PHSA) (42 U.S.C. 242k). They are therefore
subject to protection by the confidentiality provisions of section 308(d) of the PHSA (42
U.S.C. 242m(d)). Under this provision, NCHS may not release identifiable information
unless the participant has consented to the release. Genomic data, alone or in
combination with other NHANES data, are considered to be potentially identifiable, and
the NHANES III and NHANES 1999-2002 consents do not allow for release of
identifiable data. Appendix C provides a more detailed review of the statutory authority,
regulations and policies that have governed the collection of and access to NHANES
data.
It is currently unclear whether NCHS may use the designated agent authority for data
collected prior to the passage of CIPSEA or without specific reference to such agents in
the informed consent, as is the case for NHANES III and NHANES 1999-2002. NCHS
has requested an Office of Management and Budget (OMB) determination regarding
applying the CIPSEA designated agent provision to these data.
The analytic and statistical requirements for NHANES genomic data will inform the type
of data access that is needed to maximize the scientific value of these data. Analytic
methods for genomic data are evolving rapidly, and would need to be custom-tailored to
accommodate the complex survey design of NHANES. Current NHANES data release
policies could present significant challenges for conducting state-of-the-art genomic
analyses, including those for GWAS. While identifiable data can be analyzed using a
remote access system, data that could identify the individual can only be viewed inside
a Research Data Center (RDC) and cannot be removed from the RDC in an identifiable
form.
The input of experts in both NHANES and genomic data analysis is needed to formally
assess the analytic and statistical requirements for conducting genomic/GWAS
2
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
• What are the unique opportunities, from an analytic perspective, of the proposed
NHANES genomic dataset? (e.g., What are some of the standard and higher-
level analyses that end users might want to perform using these data?)
• What are the unique challenges, from an analytic perspective, of the proposed
NHANES genomic dataset for standard and higher-level genomic analyses?
(e.g., challenges associated with the complex survey design, data accessibility,
etc.)
• What existing analytic and data access models for conducting standard and
higher-level genomic analyses might be applied to or modified for NHANES
data?
4.0 Current Access Options and Future Considerations for NHANES Genomic
Data
The genetic data collected for NHANES III and NHANES 1999-2002 must be kept
strictly confidential consistent with the informed consent provided by the participant
under the NCHS confidentiality statute. The existing data (about 200 genetic variants)
are currently made available to researchers using the same system as other sensitive
datasets housed at NCHS through two methods of access. The first is the Research
Data Center (RDC) at NCHS where individual-level data can be viewed and analyzed in
a secure environment and results vetted for confidentiality risks before they are
removed from the facility. The second method of access is the NCHS Analytical Data
Research by Email (ANDRE) remote access system, which allows broader access to
3
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
the data. The remote system is email-based and allows researchers to submit statistical
analysis code and receive results from any location.
The increase in the availability of NHANES genomic data by four orders of magnitude
through the genome-wide studies planned as part of BGD will change the demands on
the RDC and the ANDRE system. Similarly, the future of genomic data analyses is
clearly moving in the direction of more complex analyses as millions of genetic
variations are analyzed in an attempt to untangle biological pathways, as in GWAS.
These analyses pose real challenges in terms of the analytic methods needed to
perform these statistical tests and the computing resources needed to process them.
While modifying some current practices is likely feasible, other requirements of genomic
analyses may require new solutions.
To address both the increasing demands for analytic complexity, as well as the
anticipated high volume of users for analysis of NHANES genomic data, the Data
Access Subgroup of CDC’s BGD Working Group has identified four possible
approaches for consideration. These options could be employed, independently or in a
combined strategy, to maximize access to NHANES genomic data while protecting the
privacy of the participants and the confidentiality of their information, and working within
the statutory framework for statistical agencies.
1. Remote access to the full NHANES genotypic and phenotypic database that is
centrally-located in the RDC, via electronic submission of code, automated
disclosure review of program code and output, and return of output that has passed
a confidentiality review. Researchers can perform analyses on individual-level data
but cannot see identifiable data.
3. Establish and operate additional RDCs outside Hyattsville, with the first in Atlanta.
Remote access to the full NHANES genotypic and phenotypic database that is centrally-
located in the RDC, via email submission of code, automated disclosure review of code
and output, and return of output that has passed automated confidentiality review
conducted by the ANDRE system. Researchers can perform analyses on individual-
level data but cannot see identifiable data.
NCHS has provided researchers remote access to sensitive data through the ANDRE
remote system since April 1998. In the past nine years, ANDRE has served hundreds
of data analysts and executed tens of thousands of SAS programs. ANDRE has
multiple levels of disclosure risk prevention strategies built into the system.
While the remote access system provides a convenient solution for researchers who do
not wish to travel to NCHS, it does have some limitations: primarily that statistical
5
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
software is limited to SAS and SAS-callable SUDAAN due to technical and resource
constraints and individual-level data are accessible but not viewable. Yet, even with
these restrictions in place, the remote system has successfully been used to process
thousands of data analyses for hundreds of users on various sensitive data sets housed
at NCHS.
Beta testing of the ANDRE system for use with current NHANES genetic data (currently
including approximately 200 genetic variants) is underway. The results to date show
that the system can be successfully used to conduct standard genotype-phenotype
association analyses; however, it is not designed to conduct GWAS analysis. ANDRE
will be available for all users with NCHS Ethics Review Board (ERB)-approved
proposals in the first quarter of 2008. Future plans for development of the remote
system include the creation of a library of macros for more complex genetic analyses.
Clarifying the analytic and statistical needs of the end user will help to determine
whether mechanisms to provide the level of information needed for GWAS analyses can
be developed in an ANDRE-like system. Early attempts to produce regression results
for one million genetic variations have shown that p-values can be generated in under
24 hours using a system similar to ANDRE. However, further investigations are
underway to determine if additional modifications to these p-values will need to be done
in the remote access environment to complete the analyses, or if these p-values can be
further processed by the end user outside of the remote system. Further simulation
studies are underway to examine the capacity of the remote access system for GWAS
data analysis and the impact on the analyses of the requirement to suppress any
analytic products that would jeopardize confidentiality (e.g., individual-level data,
outliers, logs).
6
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
Until the passage of CIPSEA, NCHS did not have the authority to designate agents that
could access data not released to the public due to confidentiality requirements.
Designated agents are subject to penalties, including fines and imprisonment, if data
use agreements are violated and must be under the strict and substantial supervision of
NCHS. With the passage of CIPSEA, NCHS began developing policies for
implementing the designated agent authority in conjunction with the implementation
guidelines being developed by OMB and an interagency committee. NCHS decided to
adopt a stepwise process whereby designated agent status would be considered first
for users and data that presented the least risk (e.g., federal employees operating under
approved IT security plans accessing the least sensitive data). Proposals requirements
have included a clear and detailed description of the purpose of the access, a clear
justification of the need for the confidential data requested, a description of how the data
will be used, a description of how NCHS will benefit from granting the requested access,
7
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
At this time, NCHS does not have off-site NHANES data use agreements with non-
governmental entities; however, mechanisms to expand use of the designated agent
authority are under consideration, including the ability to access genetic data by
researchers that are not federal employees (further work is subject to guidance from
OMB).
Given the anticipated costs associated with the oversight of these proposed
agreements, the number of designated agent agreements that could be supported
would likely be limited, and criteria for their selection would need to be developed. The
designated agent option would be intended only for those instances when the RDC and
remote access system are insufficient to meet the needs of the proposed research.
Establish and operate additional RDCs outside Hyattsville, with the first in Atlanta.
8
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
NCHS provides qualified researchers on-site access to confidential data collections for
statistical purposes, under strict supervision, through the Hyattsville Research Data
Center (RDC). Data from virtually all of the NCHS data collection systems may be
made available through the RDC; also available are data from other data collection
systems, and RDC users may supply their own data to be merged with NCHS datasets.
In 2007, certain NCHS confidential data collections, including NHANES I, II and III, were
made available through the U.S. Census Bureau RDC network which includes several
locations around the country. The RDC provides a mechanism whereby researchers
can access detailed data files in a secure environment, without jeopardizing the
confidentiality of respondents.
To apply for RDC access to NCHS confidential data collections, the researcher submits
a proposal which includes key study questions or hypotheses, the analytic strategy and
statistical methods to be used, software requirements, curriculum vitae for each person
participating in the research activity, and a summary of the data requirements for the
proposed research, which is used by RDC staff to construct the necessary data files.
Additionally, the proposals must include the "Agreement Regarding Conditions of
Access to Confidential Data in the Research Data Center for the National Center for
Health Statistics'' and "Affidavit of Confidentiality" signed by all participating
researchers. Research proposals are reviewed by a Proposal Review Committee which
consists of (at a minimum) the director of the NCHS RDC, the RDC staff liaison, the
NCHS Confidentiality Officer, and the director (or designee) of the NCHS data division
whose data are requested in the proposal. Approval of research proposals does not
constitute endorsement by NCHS of the substantive, methodological, theoretical, or
policy relevance or merit of the proposed research, but rather constitutes a judgment
that the research, as described in the application, is not an illegal use of the requested
data file and that there is high probability that the project can be successfully done in
the RDC.
The RDC computers have no electronic link to the NCHS network, the CDC-NCHS
mainframe, or the internet. The computers are configured such that researchers are
given read-only access to requested data files and can write only onto the local
workstation's hard disk—removable media such as floppy disks are inaccessible. All
printed output is routed to a central printer which is monitored by RDC staff.
Researchers may take the results of their analyses off-site only after disclosure review
by NCHS RDC staff. Disclosure review consists of looking for tabular cells less than
five, tables with geographic variables in any dimension, models with geographic
variables (or variables tantamount to geographic variables) as outcome variables, or
case listings.
9
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
Researchers using the NCHS RDC are charged for space and equipment rental and
staff time necessary for supervision, disclosure limitation review, maintenance of
computer facilities (including both hardware and software), and the creation and
maintenance of data files required by the researcher. The cost per project includes a
daily rate of $200 plus a $500 charge for new file creation.
Informed consent changes to allow data to be shared more broadly, potentially through
dbGaP:
4a. Consider a new model for informed consent for future NHANES, and/or
4b. Consider re-consenting NHANES III and NHANES 1999-2002 participants.
10
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
11
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
In 1998, the National Human Genome Research Institute (NHGRI) requested that
NHANES III cell lines be added to the NIH-CDC DNA Polymorphism Discovery
Resource for the purpose of discovering polymorphisms in human DNA.
The NHGRI proposal involved a request to contact NHANES III participants for the
purposes of explaining the Polymorphism Discovery Resource and obtaining
informed consent for this research purpose. NHANES III was divided into two
phases: phase 1 (conducted from 1988-1991) and phase 2 (conducted from 1991-
1994). Each phase was a representative sample of the U.S. population; however,
since cell lines were not available for all phase 1 participants, and the Resource did
not require a representative sample, only phase 1 participants were re-contacted to
preserve the phase 2 samples for future studies. Participants were re-contacted and
re-consent was obtained by the NHANES III data collection contractor. Cell line
samples from participants who agreed to the research were sent to the Coriell
Institute. A randomly assigned ID was attached to the samples and no information
other than race/ethnicity was released.
12
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
Appendices
13
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
The NHANES III genotyping for this project was performed largely by medium-
throughput TaqMan and MGB Eclipse assays by the Core Genotyping Facility at
the National Cancer Institute (NCI) and the Division of Laboratory Services of the
National Center for Environmental Health at CDC. Quality assurance and quality
control criteria were established and implemented by NCHS, and a total of 90
polymorphisms in 50 genes were available for analyses. A manuscript detailing
the population-based allele frequencies and genotype prevalence of
polymorphisms in the U.S. population has been submitted for publication.
To fulfill the second goal of the project, investigators from the working group
developed and submitted research proposals to examine associations between
genetic variants of interest and numerous phenotypic data available in the
NHANES III public-use datasets. Statistical analyses for over 35 research
proposals are underway, and several investigators have presented their work at
scientific conferences and are preparing manuscripts and reports.
Throughout this process, the role of NOPHG has been to provide overall
leadership for and coordination of the project, develop collaborative research
proposals for the prevalence analysis and for genotype-phenotype analyses, and
conduct the bulk of statistical analysis for the proposed studies. Statistical
analyses of the genomic data have taken place within the Research Data Center
(RDC) of NCHS in Hyattsville, Maryland.
14
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
15
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
Figure A.1.
Genes Disease NHANES data
provide a unique
resource for
dissecting gene-
disease
associations by
Endo- facilitating
analyses of the
phenotypes associations
between genetic
Genes Disease variants,
environmental
Intermediate factors, and
Outcomes endophenotypes/
intermediate
outcomes, such
as known
Environmental markers or risk
Variables factors for
common
diseases.
An initiative of the scope of BGD requires engaging internal and external partners
in a collaborative effort. The necessary scientific, technical, strategic, and
financial resources will be brought together through a public-private partnership
established by the CDC Foundation (CDCF). The CDCF will work to forge this
partnership from interested government, academia, industry, and non-profit
sector organizations and parties.
The BGD Working Group, under the leadership of NOPHG and with membership
from across CDC’s National Centers and the Office of the Director, has been
established to develop CDC activities and to ensure that all activities performed
16
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
by CDC under this initiative are in the best interest of the public’s health and
consistent with CDC’s authorities. To this end, four subgroups have been
established to address various issues of project implementation (Figure A.2):
2) Analysis and Statistics: The charge of the group is to evaluate the analytic
challenges of NHANES genomic data and to develop an analytic plan for
BGD, in the context of the proposed access options, and development of a
plan to address these challenges. Issues under consideration include:
evaluation of genotyping quality control; detection of and adjustment for
population stratification; assessment of structural variants; statistical
methods for evaluating the relationship between the variations in genomic
structure and function; statistical analysis of genetic associations, and
gene-gene and gene-environment interactions. Appropriate analysis of
genome-wide data from NHANES will require additional statistical
methods development given that all data analyses must account for the
clustered complex survey design.
3) Data Access: The group is actively exploring and developing options for
access to data collected by statistical agencies while protecting privacy
and confidentiality. Issues for discussion and resolution include:
identifying and meeting analytic needs while meeting privacy and
confidentiality requirements; the advantages and disadvantages of the
different access mechanisms; and alternative data access models and
options to consider.
17
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
18
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
Introduction
The origin of the NHANES survey was the National Health Survey Act of 1956.
This act formulated the need for national population surveys to assess the extent
of illness and disability in the U. S. population. The National Health Interview
Survey (NHIS) was created to respond to this public health need, and the first
NHIS was fielded in July 1957. The need for additional data on health status that
could best, or only, be assessed using direct physical measures was recognized,
and in 1958 planning for the first National Health Examination Survey (NHES)
was begun. An early decision was to collect these measures in a standardized
environment, and this led to the construction of mobile examination centers to
collect this information. These mobile examination centers could be moved from
one location to another so that all data collected in the survey would utilize
standard procedures and equipment. A limited set of biological specimens were
included in the first NHES survey and in the subsequent surveys on children and
adolescents (NHES II and NHES III). In 1971, a nutrition component was added
to the NHES and the survey became known as the National Health and Nutrition
Examination Survey (NHANES). The number of biomarkers collected in the
NHANES I survey (1971-1975) was much greater than the number collected in
any of the three NHES surveys in the 1960’s. In addition, the survey covered a
much wider age range, 1-74 years, than any of the previously conducted NHES
surveys. Tests were completed on whole blood, serum and urine samples.
Some of the new biomarkers were added to address specific nutrition issues, but
others were added to provide national reference data for selected immunization
and infectious disease assessments. There was a further expansion of the
number of biomarkers collected in NHANES II (1976-1980). NHANES II included
the first environmental assessments (blood lead and selected pesticides). During
NHANES III (1988-1994) the number of blood and urine biomarkers increased
significantly; and for the first time, blood lymphocytes were collected in
anticipation of advances in genetic research.
19
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
Survey Content
NHANES collects data on the prevalence of conditions in the population.
Estimates for previously undiagnosed conditions, as well as those known to and
reported by survey respondents, are produced. Risk factors, lifestyle, heredity, or
environmental factors are examined. Smoking, alcohol consumption, sexual
practices, drug use, physical fitness and activity, weight, and dietary intake are
also included in the survey content. Data on certain aspects of reproductive
health, such as use of oral contraceptives and breastfeeding practices, are also
collected. The diseases, medical conditions, and health indicators studied in the
current NHANES include:
• Anemia • Osteoporosis
• Cardiovascular disease • Physical fitness and physical
• Diabetes functioning
• Environmental exposures • Reproductive history and sexual
• Hearing loss behavior
• Infectious diseases • Respiratory disease (asthma, chronic
• Kidney disease bronchitis, emphysema)
• Nutrition • Sexually transmitted diseases
• Obesity • Vision and eye diseases
• Oral health
The sample for the survey is selected to represent the U.S. population of all ages. To
produce reliable statistics, NHANES currently over-samples persons 60 and older,
African Americans, and Mexican Americans.
During the examination all participants have their pulse and/or blood pressure
measured. Dietary interviews and body measurements are included for everyone.
Participants age one and older have a blood sample collected. DNA samples are
20
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
collected on consenting adults age 20 or more years. Depending upon the age of the
participant, the rest of the examination includes tests and procedures to assess the
various aspects of health listed above. In general, the examinations become more
extensive with participant age.
Survey Operations
In each location, local health and government officials are notified of the upcoming
survey. Households in the survey receive a letter from the NCHS Director to introduce
the survey.
21
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
program activities. The U.S. Department of Agriculture and NCHS cooperate in planning
and reporting dietary and nutrition information from the survey. NHANES’ partnership
with the U.S. Environmental Protection Agency allows continued study of the many
important environmental influences on our health.
• Past surveys have provided data to create the growth charts used nationally by
pediatricians to evaluate children’s growth. The charts have been adapted and adopted
worldwide as a reference standard and have recently been updated using the latest
NHANES figures.
• Blood lead data were instrumental in developing policy to eliminate lead from gasoline
and in food and soft drink cans. Recent survey data indicate the policy has been even
more effective than originally envisioned, with a decline in elevated blood lead levels of
more than 70% since the 1970’s.
• Information collected in the survey assists the Food and Drug Administration in
deciding if there is a need to change vitamin and mineral fortification regulations for the
nation’s food supply.
• New measures of lung function assist in the understanding of respiratory disease and
better describe the burden of asthma in the United States.
22
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
to information collected from previous surveys to assess trends. This allows health
planners to detect the extent various health problems and risk factors have changed in
the U.S. population over time. By identifying the health care needs of the population,
government agencies and private sector organizations can establish policies and plan
research, education, and health promotion programs that help improve present health
status and prevent future health problems.
23
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
For a statistical data collection agency such as NCHS, data release and access policies
are developed at the organization level and for specific surveys or data collections.
Human subjects protection requires that participants sign informed consent statements,
and good research practice dictates that data collection systems include a written data
release policy as part of the data collection protocol. Data release and access policies
must reflect the informed consent protocols and any legislation that governs the data
collection. Practices must then be developed to implement the policies which will be
judged on how well they meet the requirements of the legislation and the informed
consent. They must be consistent with current best practices including data
stewardship and adoption of mechanisms that minimize risk such as limiting the number
of people with access to information that increases risk and limiting the amount of risky
information released to only that which is needed for the task. Policies and practices
should be publicly available.
The policies and practices that relate to NHANES are closely tied to the informed
consent statements used to obtain participation from sample persons, NCHS’
authorizing legislation, and the Principles and Practices of a Federal Statistical Agency.
NCHS, as a federal statistical agency designated by OMB and governed by the Federal
Statistics Confidentiality Order issued by OMB in 1997 (62 FR 35044), must abide by
the stated requirements for data stewardships and confidentiality protection
(http://www.whitehouse.gov/omb/inforeg/conf-order.pdf). Changes to the informed
consent document to request permission from the participant for broader data sharing
needs to abide by the authorizing legislation and requirements of data stewardship
mentioned above.
24
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
Additional NCHS policies and practices that support stewardship and confidentiality
protection are outlined in “How NCHS Protects Your Privacy”
(http://www.cdc.gov/nchs/about/policy/confiden.htm) and the NCHS staff manual
(http://www.cdc.gov/nchs/data/misc/staffmanual2004.pdf). All information on NHANES
operations, including the informed consent protocol and data release policies, is
available to the public, with the most recent version available on the NCHS web site.
Plans for data release and the protection of confidentiality undergo IRB and OMB
review.
The section of NCHS’s authorizing legislation that deals with confidentiality and data
release is 308(d) of the Public Health Service Act (42 U.S.C. 242m)
According to Section 308(d), consent to release identifiable data must be obtained from
the survey participant. This would be true even if Section 308(d) were not mentioned in
the consent, since all NCHS data collections are covered by 308(d). Confidentiality must
be protected even if no promise to do so is included in the consent. In order to release
identifiable data, consent must be obtained from the participant.
NCHS data collections are also covered by the Privacy Act, which requires that
identifiable data be stored securely and restricts access to personal information. In
2002, the Confidential Information Protection and Statistical Efficiency Act (CIPSEA)
was passed that deals directly with the protection of statistical data collected under a
pledge of confidentiality. In 2007, OMB released guidance for the adoption of CIPSEA
provisions. Of note, the penalties for willful disclosure of confidential statistical
25
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
information is a “class E” felony under this Act subject to imprisonment for up to 5 years,
a fine of $250,000, or both.
Similarly, NCHS and about a dozen other federal statistical agencies are governed by
the statistical confidentiality order issued by OMB. This order establishes a floor of
confidentiality protections similar to those established by law in CIPSEA. In effect, this
OMB order directs statistical agencies to provide survey participants with an informed
consent statement and to then limit disclosure of identifiable information to the
conditions and uses specified in the consent statement.
DNA was collected for the second phase of NHANES III (1991-1994), during NHANES
1999-2002 and is currently being collected during NHANES 2007-2008. Different
informed consent statements have been used in each survey.
Informed Consent for NHANES 1991-1994 (NHANES III Phase 2): There was no
explicit mention of DNA testing, but the consent does state that blood would be stored
for future laboratory tests. There is a general consent for participation in the survey.
Information about the confidentiality of the data is found is several places in the consent
materials:
“We respect your privacy. The confidentiality of all the information you
give us is protected by public law.”
From the informed consent signature page and all data collection documents:
“Information contained on this form which would permit identification of
any individual or establishment has been collected with a guarantee that it
will be held in strict confidence, will be used only for purposes stated for
this study and will not be disclosed or released to others without the
26
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
From the NHANES 2000 consent for specimen storage and further studies:
We will keep strictly private all health data and samples that we collect in
NHANES. Our staff is not allowed to discuss that any person is part of
this survey under penalty of Federal laws: Section 308(d) of the Public
Health Service Act (42 USC 242m) and the Privacy Act of 1974 (5 USC
552A).
“Q - What genetic studies will be done and what part will my DNA
sample play? (DNA samples will be collected only on those ages 20
or over.)
A - Genetic studies look at the DNA found in cells. We will store part of the
blood and saliva sample that we collect in the exam center for future
genetic studies. We will keep this material for an unlimited time. Studies of
human genes are helping us learn about many diseases and health
conditions. The information from people who are part of NHANES could
help that effort.
If you wish to have your samples used for future genetic studies, you will
have a chance to say so when you sign this consent form.”
“Genetic testing studies may be done with DNA samples collected only on
those ages 20 or over. If you wish to have your samples used for future
genetic studies, check the box below:
27
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
[Note: the NHANES 1999 consent used similar, but not the exact
language as NHANES 2000 for future genetic studies.]
“We will hold all data we collect in the strictest confidence. We gather and
protect all data in keeping with the requirements of Federal Laws: the
Public Health Service Act (42 USC 242k) authorizes collection and
Section 308(d) of that law (42 USC 242m) and the Privacy Act of 1974 (5
USC 552A) prohibit us from giving out information that identifies you or
your family without your consent. This means that we cannot give out any
fact about you, even if a court of law asks for it. However, if we find signs
of child abuse during an exam, we will report it to the local department of
social services or appropriate law enforcement agency. We will keep all
survey data safe and secure. When we allow researchers to use survey
data, we protect your privacy. We assign code numbers in place of
names or other facts that could identify you.”
28
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
In summary, the consent materials for NHANES 1999-2002 promise that no information
that could identify a participant will be released.
Informed Consent for NHANES 2007-2008: There was a separate consent for
collection and storage of DNA from those age 20 years or more, and general
information about the confidentiality of the data is found is several places in the consent
materials:
From the consent for collection and storage of DNA from those age 20 years or
more:
Q - Why will a sample of my DNA be kept for future health studies?
A - We will store part of the blood sample that we collect in the exam
center for future genetic studies. These samples will be frozen and kept in
a specimen bank for as long as they last. Your participation is voluntary
and no loss of benefits will result if you refuse.
A - Genes are the “instruction book” for people. Genes are made out of
DNA. The DNA of a person is about 99.9% the same as the DNA of
another person, but no two people have the same DNA except identical
twins. Differences in DNA are called genetic variations and explain
differences such as eye color and partly explain why some people get
certain diseases. To look at these variations many genetic tests may be
done on your blood sample. We will keep the DNA for an unlimited time.
Studies of human genes are helping us learn about many diseases and
health conditions. The information from people who are part of NHANES
could help that effort.
People conducting these studies will not contact NHANES participants for
any additional information.
We will keep strictly confidential all health data and samples that we
collect in NHANES, as required by Federal law. By confidential we mean
that the information that we release to the public can not be used to
identify you. Our staff is not allowed to discuss that any person is part of
this survey under penalty of Federal laws: Section 308(d) of the Public
29
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
Health Service Act (42 USC 242m) and the Privacy Act of 1974 (5 USC
552A).
Q Who can use the stored DNA samples for further study?
A Most studies using DNA samples will simply add to our knowledge
of health and disease. Therefore, we do not plan to contact you with
individual results from these studies. Periodically we will announce on our
web site general results from the studies being conducted,
(http://www.cdc.gov/nchs/nhanes.htm). To get more general information
about a particular study, you can call our toll-free number, 1-800 452-
6115.
Q What are the benefits and risks for giving a blood sample for
future genetic studies?
A You will not directly benefit but these studies may eventually
help the health of people in the future. The risk of giving a sample
includes the minor risk associated with taking the blood sample. There
may also be a risk that some people may use the information from the
genetic studies to exaggerate or downplay differences among people. The
ethics board that will review all studies using these samples will attempt to
prevent any misuse of the information gained from the NHANES DNA
samples.
30
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
information that identifies you or your family without your consent. Any
NHANES employee who violates the law may be convicted of a class E
felony and imprisoned for up to 5 years, or fined as much as $250,000.”
We respect your privacy. Public laws keep all information you give private.
These laws do not allow us to give out data that identifies you or your
family without your permission. This means that we cannot give out
any facts about you, even if a court of law asks for it. However, if we
find signs of child abuse during an exam, we will report it to the local
department of social services or the police.
We will keep all survey data safe and secure. When we share data
with our partners, we do so in a way that protects your privacy as
required and guaranteed by law. Our interviewer can provide you a list
of our partners if you wish to learn more.”
Public-use files: NCHS releases volumes of NHANES data within the confines of
these confidentiality authorities by taking steps to de-identify information collected from
individual participants. Since these public use files do not contain information
considered to directly or indirectly identify individuals, they are available freely to
researchers and do not need to be accessed in controlled situations.
Potentially identifiable data files: Beyond these public use files, data files that do
include variables that would make the data identifiable are made available to
researchers, but under more carefully controlled circumstances. NCHS has developed
a Research Data Center and remote access system which provide data access while
protecting confidentiality.
The first critical issue that needs to be addressed in order to determine the most
appropriate release strategies for databases that include genetic variation data is
whether the addition of the genetic testing results to existing public-use files produces a
combined file that could identify individuals. The existence of a growing number of DNA
collections through which a match could be made is discussed in the recent editorial in
31
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
If the release of genetic data makes the file identifiable, then existing confidentiality
requirements would require that a number of special steps be taken when releasing the
NHANES genetic data to a wider audience. Specifically, these steps could include any
or all of the four options for controlled data access outlined in the main document: a
reengineered remote access system; additional Research Data Centers; designated
agent agreements; or informed consent changes.
The primary difference between the current NHANES model for data accessibility and
those used by other organizations, such as the dbGaP model used by NIH (see
Appendix E), is that in other models, the responsibility for protecting confidentiality is
transferred to the research analyst, with limited responsibility for oversight by the data
steward or the data collectors, once a researcher goes through the ‘front end’ process.
The NHANES model is to have the data steward (NCHS) retain this responsibility
through a variety of mechanisms, including maintaining a separation between the
researcher and the information that could be used to identify a participant. The
researcher rarely needs to see the ‘risky’ information but the mechanisms needed to
maintain the separation can limit the researcher’s analytic flexibility.
1
Lowrance WW and FS Collins. (2007) Ethics. Identifiability in genomic research. Science. Aug 3;317(5838):600-
2.
32
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
NCHS data release policies are determined by both participant consent and federal law.
Past and current NHANES data collections promised that no personally identifiable data
would be released. In order to release identifiable data for GWAS studies, new
approaches to the consent process for future NHANES data collections and re-consent
of past participants would be required. It will be necessary to carefully evaluate the
impact of any such changes to the consent on survey operations, response rates and
data quality before moving in that direction.
Re-consent of the 7,157 NHANES III participants who were age 12 or more years in
1991-1994 and the 7,962 NHANES 1999-2002 participants age 20 or more years would
require tracking and re-contact by NHANES interviewers. New consent documents that
address the sharing of genetic data with the research community through unsupervised
data release mechanisms (such as dbGaP) will need to be developed, cognitively
tested, and pilot tested to assure that participants understand the benefits and
consequences of release of their genetic data. Different consent documents for
NHANES III participants, who never consented to genetic research and NHANES 1999-
2002, who consented but are now asked to allow for the release of potentially
identifiable data, may be needed. Cost to re-consent would be substantial. In 1998 it
cost $988,500 to re-consent 545 phase 1 NHANES III participants for NIH’s
Polymorphism Discovery Resource project. These participants were not a random
subsample but were selected because of their race-ethnicity and location that limited
travel for the interviewers. Only their DNA results were placed in the database; no other
NHANES data were included. It is reasonable to estimate that the process to contact
and re-consent all NHANES III and NHANES 1999-2002 would cost millions of dollars.
Currently, NCHS is in the process of obtaining cost estimates for potential re-consent
activities.
The success of a re-consent initiative will depend on the number of participants that are
successfully contacted and who consent to the new language. There are obvious
challenges in re-consenting participants with samples in the NHANES DNA bank. In
some cases, it has been up to sixteen years since the participant’s last contact with the
survey staff. During this time many participants many have changed addresses several
times and some may be deceased. For those who are contacted, response rates cannot
be predicted, but there will certainly be some that do not wish to have their genetic data,
which could be used to identify them, shared publicly. Therefore, it is entirely possible
that the re-consent process will yield a selection of participants who are no longer
33
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
representative of the U.S. population, in which case the major benefit of the NHANES
DNA samples may be lost.
There are other potential effects of re-consent and/or changes to future consent that
should be considered due to broader implications to the NHANES data collections as
well as to NCHS. NCHS data released to researchers using the dbGaP model would
mean that NCHS could no longer monitor, for appropriate use, sensitive and potentially
identifiable data. Participants could potentially be identified through the matching of
NHANES genetic data to future databases that contained personally identifiable
information, which puts all NHANES data collected on that participant at risk of
disclosure, which could harm the participant. Public breach of confidentiality could have
a negative impact on current and future NHANES data collections and on all data
collection activities at NCHS and other parts of CDC.
34
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
In 2006, the National Institutes of Health (NIH), launched the database of Genotypes
and Phenotypes (dbGaP), which was designed to archive and distribute data from
genome-wide association studies (GWAS). dbGaP provides controlled access to
individual-level data, and open access to summary data and study documentation,
including summaries of the measured variables in an organized and searchable web
format.
The NIH described the reasoning behind dbGaP as follows: “The NIH has concluded
that the full value of GWAS can be realized only if the genotype and phenotype datasets
derived from GWAS are made available as rapidly as possible to a wide range of
scientific investigators. The NIH recognizes that GWAS data release practices must be
consistent with the informed consent provided by individual participants. The NIH
considers broad access to data to be particularly important to GWAS because of the
significant resources involved, the serious analytical challenges involved in such large
datasets, and the powerful opportunities that will be provided by the ability to make
comparisons across multiple studies.”2
In 2007, the NIH issued its “Policy for Sharing of Data Obtained in NIH Supported or
Conducted Genome-wide Association Studies (GWAS)”3 which includes the following
data submission and data access guidance intended to promote access while protecting
privacy and confidentiality:
2
NIH, Genome-wide Association Studies (GWAS) Policy Background.
http://grants.nih.gov/grants/gwas/background.htm. Accessed 10/03/07.
3
NIH Policy for Sharing of Data Obtained in NIH Supported or Conducted Genome-wide Association Studies
(GWAS). http://edocket.access.gpo.gov/2007/pdf/E7-17030.pdf. Accessed 10/03/07.
35
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
o The data submission is consistent with all applicable laws, regulations and
institutional policies;
o The appropriate research uses and exclusions are delineated;
o The identities of the research participants will not be disclosed to the NIH
data repository [Note: A significant difference in comparison to NHANES,
where a federal agency has access to participant identities]; and
o An IRB and/or Privacy Board has reviewed and verified that: the
submission and sharing of data is consistent with the informed consent;
the investigator’s plan for deidentifying the data is consistent with NIH
policy; risks to individuals, families, and groups associated with the
submitted data have been considered; and that the submitted data were
collected in a manner consistent with human subjects regulations.
• Submitting investigators may request removal of data on individual participants
upon withdrawal of consent.
The consequences of failure of compliance with the NIH guidance for data access are
unclear. Once released to individual researchers, the dataset can no longer be
protected. For the dbGaP data, the responsibility for protecting the research subject’s
36
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
confidentiality seems to fall on the individual researcher; however, for NHANES data,
that responsibility falls on the federal government. The implications of data misuse for
undermining the confidence in federal data collections are potentially much greater than
those of an individual researcher.
The three-generation Framingham Heart Study serves as a case study for local access
to individually identifiable GWAS data with parallels to NHANES, including similarity in
size with more than 15,000 participants and 13,000 variables4. Requests for access to
Framingham data are submitted through the standard dbGaP request system6, and
include a Research Use Statement specifying the hypotheses or questions to be
addressed in the proposed data analysis, the phenotypes and covariates on which the
analysis will focus, the clinical events that may be needed, any exclusions that are
expected to part of the analytic approach, a descriptions of the adequacy of the
computing facilities to complete the proposed analyses, and detailed description of the
proposed analytic methods so that reviewers can determine whether the proposed key
personnel have the qualifications to complete the proposed research. The request also
includes a list of all collaborating investigators in the organization; collaborators at
different organizations must complete their own request for data use because
organizations are accountable for the actions of individuals. Since Framingham
research is always considered to be human subjects research due to the small, defined
population5, the request for data access must also include supplemental information
including documentation of IRB approval and human subjects training, a data security
plan, completed confidentiality awareness forms for all staff with access to the data, and
key personnel biosketches for determining qualifications to complete the proposed
research.6
4
NCBI Database of Genotypes and Phenotypes (dbGaP). http://www.ncbi.nlm.nih.gov/sites/entrez?db=gap.
Accessed 10/10/07.
5
Participant Protection Policy FAQ: The Framingham Heart Study. http://0-
www.ncbi.nlm.nih.gov.catalog.llu.edu/projects/gap/cgi-bin/GetPdf.cgi?id=phd000317. Accessed 10/03/07.
6
Instructions to apply for NHLBI authorized access datasets in dbGaP.
http://dbgap.ncbi.nlm.nih.gov/aa/wga.cgi?view_pdf&stacc=phs000007.v1.p1. Accessed 10/03/07.
37
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
38
Please do not distribute or cite this briefing document.
Prepared for the March 3, 2008 Beyond Gene Discovery Workshop.
39