HUMAN GENOME PROJECT
Human Genome Project Overview
The Human Genome Project (HGP) was the international, collaborative research program whose goal was the complete mapping and understanding of all the genes of human beings. The Human Genome Project (HGP) refers to the international 13-year effort, formally begun in October 1990 and completed in 2003, to discover all the estimated 20,000-25,000 human genes and make them accessible for further biological study. Another project goal was to determine the complete sequence of the 3 billion DNA subunits (bases in the human genome). As part of the HGP, parallel studies were carried out on selected model organisms such as the bacterium E. coli and the mouse to help develop the technology and interpret human gene function. The DOE Human Genome Program and the NIH National Human Genome Research Institute (NHGRI) together sponsored the U.S. Human Genome Project. The Human Genome Project (HGP) was one of the great feats of exploration in history - an inward voyage of discovery rather than an outward exploration of the planet or the cosmos; an international research effort to sequence and map all of the genes - together known as the genome of members of our species, Homo sapiens. Completed in April 2003, the HGP gave us the ability to, for the first time, to read nature's complete genetic blueprint for building a human being. The Human Genome Project international consortium announced (April 2000) that 2 billion of the 3 billion “letters” that constitute the genetic instruction book of humans have been deciphered and deposited into GenBank. GenBank, the public database of DNA sequence operated by the National Institutes of Health, is accessible freely and without restrictions to all scientists in industry and academia. Human Genome Project assembles 12,000 bases every minute. 15 billion raw base pairs were sequenced to reach the two billion. Each area of a chromosome at least four to five times to insure that the data deposited is accurate. The “depth of coverage,” Scientists have been quick to mine this new trove of genomic data, as well as to utilize the genomic tools and technologies developed by the Human Genome Project. For example, when the Human Genome Project began in 1990, scientists had discovered fewer than 100 human disease genes. Today, more than 1,400 disease genes have been identified.
In 1988 DOE and NIH signed a Memorandum of Understanding in which the agencies agreed to work together, coordinate technical research and activities, and share results. The two agencies assumed a joint systematic approach toward establishing goals to satisfy both short- and long-term project needs.
Early guidelines projected three 5-year phases, for which the first plan was presented to Congress in 1990. The 1990 plan emphasized the creation of chromosome maps, software, and automated technologies to enable sequencing. By 1993, unexpectedly rapid progress in chromosome mapping required updating the goals which now project through 1998. This plan is being revised again in anticipation of the approaching high-throughput sequencing phase of the project. Last year marked an early transition to this phase as many more genome sequencing projects were funded. The second and third phases of the project will optimize resources, refine sequencing strategies, and, finally, completely determine the sequence of all base pairs in the genome. The International Human Genome Sequencing Consortium included hundreds of scientists at 20 sequencing centers in China, France, Germany, Great Britain, Japan and the United States. The five institutions that generated the most sequence were: Baylor College of Medicine, Houston; Washington University School of Medicine, St. Louis; Whitehead Institute/MIT Center for Genome Research, Cambridge, Mass.; DOE’s Joint Genome Institute, Walnut Creek, Calif.; and The Wellcome Trust Sanger Institute near Cambridge, England. Another area of DOE and NIH cooperation is in exploring the ethical, legal, and social issues (ELSI) arising from increased availability of genetic data and growing genetic-testing capabilities. The two agencies established a joint working group to confront these ELSI challenges and have cosponsored joint projects and workshops.
Since October 1990, the project has been supported jointly by DOE and the National Institutes of Health (NIH) National Human Genome Research Institute (formerly National Center for Human Genome Research). Together, the DOE and NIH components make up the world's largest centrally coordinated biology research project ever undertaken. By 1985, progress in genetic and DNA technologies led to serious discussions in the scientific community about initiating a major project to analyze the structure of the human genome. After concluding that a DNA sequence would offer the most useful approach for detecting inherited mutations, DOE in 1986 announced its Human Genome Initiative. The initiative emphasized development of resources and technologies for genome mapping, sequencing, computation, and infrastructure support that would culminate in a complete sequence of the human genome. The flagship effort of the Human Genome Project has been producing the reference sequence of the human genome. The international consortium announced the first draft of the human sequence in June 2000. Since then, researchers have worked tirelessly to convert the “draft” sequence into a “finished” sequence. Finished sequence is a technical term meaning that the sequence is highly accurate (with fewer than one error per 10,000 letters) and highly contiguous (with the only remaining gaps corresponding to regions whose sequence cannot be reliably resolved with current technology). That standard was first achieved for a human chromosome when a team of British, Japanese and U.S. researchers produced a finished sequence for human chromosome 22
in 1999. The finished sequence produced by the Human Genome Project covers about 99 percent of the human genome's gene-containing regions, and it has been sequenced to an accuracy of 99.99 percent. In addition, to help researchers better understand the meaning of the human genetic instruction book, the project took on a wide range of other goals, from sequencing the genomes of model organisms to developing new technologies to study whole genomes. As of April 14, 2003, all of the Human Genome Project’s ambitious goals have been met or surpassed.
Human Genome Project Goals
Begun formally in 1990, the U.S. Human Genome Project was a 13-year effort coordinated by the U.S. Department of Energy and the National Institutes of Health. The project originally was planned to last 15 years, but rapid technological advances accelerated the completion date to 2003. Project goals were to
1. 2. 3. 4. 5. 6.
identify all the approximately 20,000-25,000 genes in human DNA, determine the sequences of the 3 billion chemical base pairs that make up human DNA, store this information in databases, improve tools for data analysis, transfer related technologies to the private sector, and address the ethical, legal, and social issues (ELSI) that may arise from the project.
To help achieve these goals, researchers also studied the genetic makeup of several nonhuman organisms. These include the common human gut bacterium Escherichia coli, the fruit fly, and the laboratory mouse. Besides delivering on the stated goals, the international network of researchers has produced an amazing array of advances that most scientists had not expected until much later. These "bonus" accomplishments include: an advanced draft of the mouse genome sequence, published in December 2002; an initial draft of the rat genome sequence, produced in November 2002; the identification of more than 3 million human genetic variations, called single nucleotide polymorphisms (SNPs); and the generation of full-length complementary DNAs (cDNAs) for more than 70 percent of known human and mouse genes.
Timeline & Cost
When the Human Genome Project was launched in 1990, many in the scientific community were deeply skeptical about whether the project’s audacious goals could be achieved, particularly given its hard-charging timeline and relatively tight spending levels. At the outset, the U.S. Congress was told the project would cost about $3 billion in FY 1991 dollars and would be completed by the end of 2005. In actuality, the Human Genome Project was finished two and a half
years ahead of time and, at $2.7 billion in FY 1991 dollars, significantly under original spending projections. The completion of the human DNA sequence in the spring of 2003 coincided with the 50th anniversary of Watson and Crick's description of the fundamental structure of DNA. The analytical power arising from the reference DNA sequences of entire genomes and other genomics resources has jump-started what some call the "biology century."
Human Genome Project Completion Dates
2- to 5-cM resolution map 1-cM resolution map (3,000 markers) September 1994 (600 – 1,500 markers) 30,000 STSs 52,000 STSs October 1998
Physical Map DNA Sequence
95% of gene-containing part 99% of gene-containing part of April 2003 of human sequence finished human sequence finished to 99.99% to 99.99% accuracy accuracy
Capacity and Cost of Sequence 500 Mb/year at < Sequence >1,400 November 2002 Finished Sequence $0.25 per finished base Mb/year at <$0.09 per finished base
Human Sequence 100,000 Variation SNPs Gene Identification Model Organisms
human 3.7 million mapped human SNPs
Full-length human cDNAs Complete genome sequences of E. coli, S. cerevisiae, C. elegans, D. melanogaster
15,000 full-length human cDNAs
Finished genome sequences of April 2003 E. coli, S. cerevisiae, C. elegans, D. melanogaster, plus whole-genome drafts of several others, including C. briggsae, D. pseudoobscura, mouse and rat oligonucleotide 1994
Functional Analysis Develop genomic-scale High-throughput technologies synthesis DNA microarrays Eukaryotic, knockouts (yeast)
1996 whole-genome 1999
Scale-up of two-hybrid system for 2002 protein-protein interaction
In the IHGSC international public-sector Human Genome Project (HGP), researchers collected blood (female) or sperm (male) samples from a large number of donors. Only a few of many collected samples were processed as DNA resources. Thus the donor identities were protected so neither donors nor scientists could know whose DNA was sequenced. DNA clones from many different libraries were used in the overall project, with most of those libraries being created by Dr. Pieter J. de Jong. It has been informally reported, and is well known in the genomics community, that much of the DNA for the public HGP came from a single anonymous male donor from Buffalo, New York (code name RP11). HGP scientists used white blood cells from the blood of two male and two female donors (randomly selected from 20 of each) -- each donor yielding a separate DNA library. One of these libraries (RP11) was used considerably more than others, due to quality considerations. One minor technical issue is that male samples contain only half as much DNA from the X and Y chromosomes as from the other 22 chromosomes (the autosomes); this happens because each male cell contains only one X and one Y chromosome, not two like other chromosomes (autosomes) Although the main sequencing phase of the HGP has been completed, studies of DNA variation continue in the International HapMap Project, whose goal is to identify patterns of single nucleotide polymorphism (SNP) groups (called haplotypes, or “haps”). The DNA samples for the HapMap came from a total of 270 individuals: Yoruba people in Ibadan, Nigeria; Japanese people in Tokyo; Han Chinese in Beijing; and the French Centre d’Etude du Polymorphisms Humain (CEPH) resource, which consisted of residents of the United States having ancestry from Western and Northern Europe. In the Celera Genomics private-sector project, DNA from five different individuals were used for sequencing. The lead scientist of Celera Genomics at that time, Craig Venter, later acknowledged (in a public letter to the journal Science) that his DNA was one of 21 samples in the pool, five of which were selected for use. On September 4th, 2007, a team led by Craig Venter published his complete DNA sequence, unveiling the six-billion-nucleotide genome of a single individual for the first time.
What We've Learned So Far From Human Genome Project
What Does the Draft Human Genome Sequence Tell Us?
By the Numbers
• • •
The human genome contains 3164.7 million chemical nucleotide bases (A, C, T, and G). The average gene consists of 3000 bases, but sizes vary greatly, with the largest known human gene being dystrophin at 2.4 million bases. The total number of genes is estimated at 30,000 —much lower than previous estimates of 80,000 to 140,000 that had been based on extrapolations from gene-rich areas as opposed to a composite of gene-rich and gene-poor areas.
Almost all (99.9%) nucleotide bases are exactly the same in all people. The functions are unknown for over 50% of discovered genes. Less than 2% of the genome codes for proteins. Repeated sequences that do not code for proteins ("junk DNA") make up at least 50% of the human genome. Repetitive sequences are thought to have no direct functions, but they shed light on chromosome structure and dynamics. Over time, these repeats reshape the genome by rearranging it, creating entirely new genes, and modifying and reshuffling existing genes. During the past 50 million years, a dramatic decrease seems to have occurred in the rate of accumulation of repeats in the human genome. The human genome's gene-dense "urban centers" are predominantly composed of the DNA building blocks G and C. In contrast, the gene-poor "deserts" are rich in the DNA building blocks A and T. GC- and AT-rich regions usually can be seen through a microscope as light and dark bands on chromosomes. Genes appear to be concentrated in random areas along the genome, with vast expanses of noncoding DNA between. Stretches of up to 30,000 C and G bases repeating over and over often occur adjacent to gene-rich areas, forming a barrier between the genes and the "junk DNA." These CpG islands are believed to help regulate gene activity. Chromosome 1 has the most genes (2968), and the Y chromosome has the fewest (231). Unlike the human's seemingly random distribution of gene-rich areas, many other organisms' genomes are more uniform, with genes evenly spaced throughout. Humans have on average three times as many kinds of proteins as the fly or worm because of mRNA transcript "alternative splicing" and chemical modifications to the proteins. This process can yield different protein products from the same gene. Humans share most of the same protein families with worms, flies, and plants, but the number of gene family members has expanded in humans, especially in proteins involved in development and immunity. The human genome has a much greater portion (50%) of repeat sequences than the mustard weed (11%), the worm (7%), and the fly (3%). Although humans appear to have stopped accumulating repeated DNA over 50 million years ago, there seems to be no such decline in rodents. This may account for some of the fundamental differences between hominids and rodents, although gene estimates are similar in these species. Scientists have proposed many theories to explain evolutionary contrasts
The Wheat from the Chaff
• • •
How It's Arranged
How the Human Compares with Other Organisms
between humans and other organisms, including those of life span, litter sizes, inbreeding, and genetic drift. Variations and Mutations
Scientists have identified about 1.4 million locations where single-base DNA differences (SNPs) occur in humans. This information promises to revolutionize the processes of finding chromosomal locations for disease-associated sequences and tracing human history. The ratio of germline (sperm or egg cell) mutations is 2:1 in males vs females. Researchers point to several reasons for the higher mutation rate in the male germline, including the greater number of cell divisions required for sperm formation than for eggs.
Benefits of Human Genome Project Research
Rapid progress in genome science and a glimpse into its potential applications have spurred observers to predict that biology will be the foremost science of the 21st century. Technology and resources generated by the Human Genome Project and other genomics research are already having a major impact on research across the life sciences. The potential for commercial development of genomics research presents U.S. industry with a wealth of opportunities, and sales of DNA-based products and technologies in the biotechnology industry are projected to exceed $45 billion by 2009.
Improved diagnosis of disease Earlier detection of genetic predispositions to disease Rational drug design Gene therapy and control systems for drugs Pharmacogenomics "custom drugs" Technology and resources promoted by the Human Genome Project are starting to have profound impacts on biomedical research and promise to revolutionize the wider spectrum of biological research and clinical medicine. Increasingly detailed genome maps have aided researchers seeking genes associated with dozens of genetic conditions, including myotonic dystrophy, fragile X syndrome, neurofibromatosis types 1 and 2, inherited colon cancer, Alzheimer's disease, and familial breast cancer.
• • • • •
Energy and Environmental Applications
• • • • • • • •
Use microbial genomics research to create new energy sources (biofuels) Use microbial genomics research to develop environmental monitoring techniques to detect pollutants Use microbial genomics research for safe, efficient environmental remediation Use microbial genomics research for carbon sequestration Assess health damage and risks caused by radiation exposure, including low-dose exposures Assess health damage and risks caused by exposure to mutagenic chemicals and cancer-causing toxins Reduce the likelihood of heritable mutations Study evolution through germline mutations in lineages
Bioarchaeology, Anthropology, Evolution, and Human Migration
• • • • • • • • • • • • • • • • • •
Study migration of different population groups based on female genetic inheritance Study mutations on the Y chromosome to trace lineage and migration of males Compare breakpoints in the evolution of mutations with ages of populations and historical events Identify potential suspects whose DNA may match evidence left at crime scenes Exonerate persons wrongly accused of crimes Identify crime and catastrophe victims Establish paternity and other family relationships Identify endangered and protected species as an aid to wildlife officials Detect bacteria and other organisms that may pollute air, water, soil, and food Match organ donors with recipients in transplant programs Determine pedigree for seed or livestock breeds Authenticate consumables such as caviar and wine Disease-, insect-, and drought-resistant crops Healthier, more productive, disease-resistant farm animals More nutritious produce Biopesticides Edible vaccines incorporated into food products New environmental cleanup uses for plants like tobacco
DNA Forensics (Identification)
Agriculture, Livestock Breeding, and Bioprocessing
Implications of the Human Genome Project
The effects of the Human Genome Project will be far-reaching. The best minds in commerce and industry will undertake issues related to patents and licenses. The insurance industry will be revolutionized by the effect of genetic information on future actuarial tables. Ultimately, our predisposition to health and disease will be known. Genetic mutations will no longer be regarded simply as defects but will be used to understand the etiology of disease at the most basic level. We may incorporate the new genetics into our lifestyle choices. Cloning, a current controversy may solve the shortage of organs for transplantation. Finally, health professionals need to become more comfortable and conversant with the concepts of the new genetics, especially when these concepts relate to how genetic predisposition affects the risk for developing disease. The National Coalition for Health Professionals Education in Genetics was formed to address these issues and to assist the education of health professionals in this area. The medical industry is building upon the knowledge, resources, and technologies emanating from the HGP to further understanding of genetic contributions to human health. As a result of this expansion of genomics into human health applications, the field of genomic medicine was born. Genetics is playing an increasingly important role in the diagnosis, monitoring, and treatment of diseases.
The Human Genome Project and the Future of Drug Development
The pharmaceutical industry is anticipating how information from the Human Genome Project will affect drug development. The potential benefits the new genetics will have on drug therapy. For example, in the future it may be possible to readily identify patients who rapidly metabolize a drug so that a higher dose of the drug can be used. On the other hand, a person who metabolizes a drug slowly or not at all will not be given the drug. At present, pharmacologic approaches block tissue receptors or inhibit specific enzymes; in the future, specific genes will be either turned on or off. Consider the example of hemochromatosis. Today, hemochromatosis is detected when complications such as diabetes, heart failure, or liver damage occur. Ninety percent of those affected have 1 or 2 mutations that can be detected by genetic screening. Early detection and therapy can prevent the complications associated with hemochromatosis. In the future, pharmacologic agents might be developed to prevent the accumulation of iron that causes tissue damage and eliminate the need for the cumbersome phlebotomies that are the mainstay of current therapy. Gene-based therapies may be directed either at correcting gene mutations caused by exposure to injurious substances or regulating the expression cancer causing genes that are not responsive to lifestyle modifications.