What is the Human Genome Project?

Begun formally in 1990, the U.S. Human Genome Project was a 13-year effort coordinated by the U.S. Department of Energy and the National Institutes of Health. The project originally was planned to last 15 years, but rapid technological advances accelerated the completion date to 2003. Project goals were to
y y y y y y

identify all the approximately 20,000-25,000 genes in human DNA, determine the sequences of the 3 billion chemical base pairs that make up human DNA, store this information in databases, improve tools for data analysis, transfer related technologies to the private sector, and address the ethical, legal, and social issues (ELSI) that may arise from the project.

To help achieve these goals, researchers also studied the genetic makeup of several nonhuman organisms. These include the common human gut bacterium Escherichia coli, the fruit fly, and the laboratory mouse. A unique aspect of the U.S. Human Genome Project is that it was the first large scientific undertaking to address potential ELSI implications arising from project data. Another important feature of the project was the federal government's long-standing dedication to the transfer of technology to the private sector. By licensing technologies to private companies and awarding grants for innovative research, the project catalyzed the multibillion-dollar U.S. biotechnology industry and fostered the development of new medical applications. Landmark papers detailing sequence and analysis of the human genome were published in February 2001 and April 2003 issues of Nature and Science. See an index of these papers and learn more about the insights gained from them. For more background information on the U.S. Human Genome Project, see the following
y y y y

HGP Goals HGP History HGP Timeline Human Genome News

What's a genome? And why is it important?


A genome is all the DNA in an organism, including its genes. Genes carry information for making all the proteins required by all organisms. These proteins determine, among other things, how the organism looks, how well its body metabolizes food or fights infection, and sometimes even how it behaves. DNA is made up of four similar chemicals (called bases and abbreviated A, T, C, and G) that are repeated millions or billions of times throughout a genome. The human genome, for example, has 3 billion pairs of bases.


The particular order of As, Ts, Cs, and Gs is extremely important. The order underlies all of life's diversity, even dictating whether an organism is human or another species such as yeast, rice, or fruit fly, all of which have their own genomes and are themselves the focus of genome projects. Because all organisms are related through similarities in DNA sequences, insights gained from nonhuman genomes often lead to new knowledge about human biology.

From the Genome to the Proteome
Cells are the fundamental working units of every living system. All the instructions needed to direct their activities are contained within the chemical DNA (deoxyribonucleic acid). DNA from all organisms is made up of the same chemical and physical components. The DNA sequence is the particular side-by-side arrangement of bases along the DNA strand (e.g., ATTCCGGA). This order spells out the exact instructions required to create a particular organism with its own unique traits. The genome is an organism¶s complete set of DNA. Genomes vary widely in size: the smallest known genome for a free-living organism (a bacterium) contains about 600,000 DNA base pairs, while human and mouse genomes have some 3 billion. Except for mature red blood cells, all human cells contain a complete genome. DNA in the human genome is arranged into 24 distinct chromosomes--physically separate molecules that range in length from about 50 million to 250 million base pairs. A few types of major chromosomal abnormalities, including missing or extra copies or gross breaks and rejoinings (translocations), can be detected by microscopic examination. Most changes in DNA, however, are more subtle and require a closer analysis of the DNA molecule to find perhaps single-base differences. Each chromosome contains many genes, the basic physical and functional units of heredity. Genes are specific sequences of bases that encode instructions on how to make proteins. Genes comprise only about 2% of the human genome; the remainder consists of noncoding regions, whose functions may include providing chromosomal structural integrity and regulating where, when, and in

what quantity proteins are made. The human genome is estimated to contain 20,000-25,000 genes. Although genes get a lot of attention, it¶s the proteins that perform most life functions and even make up the majority of cellular structures. Proteins are large, complex molecules made up of smaller subunits called amino acids. Chemical properties that distinguish the 20 different amino acids cause the protein chains to fold up into specific three-dimensional structures that define their particular functions in the cell. The constellation of all proteins in a cell is called its proteome. Unlike the relatively unchanging genome, the dynamic proteome changes from minute to minute in response to tens of thousands of intra- and extracellular environmental signals. A protein¶s chemistry and behavior are specified by the gene sequence and by the number and identities of other proteins made in the same cell at the same time and with which it associates and reacts. Studies to explore protein structure and activities, known as proteomics, will be the focus of much research for decades to come and will help elucidate the molecular basis of health and disease.

How is genome sequencing done?
Download a PDF illustration courtesy of the Department of Energy's Joint Genome Institute. See also their step-by-step illustrated guide to how sequencing is done.
y y

Chromosomes, which range in size from 50 million to 250 million bases, must first be broken into much shorter pieces (subcloning step). Each short piece is used as a template to generate a set of fragments that differ in length from each other by a single base that will be identified in a later step (template preparation and sequencing reaction steps). See a figure depicting the sequencing reaction.


The fragments in a set are separated by gel electrophoresis (separation step). New fluorescent dyes allow separation of all four fragments in a single lane on the gel. See an example of an electropherogram using fluorescent dyes. Click on the image for a caption.


The final base at the end of each fragment is identified (base-calling step). This process recreates the original sequence of As, Ts, Cs, and Gs for each short piece generated in the first step. Current electrophoresis limits are about 500 to 700 bases sequenced per read. Automated sequencers analyze the resulting electropherograms and the output is a four-color chromatogram showing peaks that represent each of the four DNA bases.

After the bases are "read," computers are used to assemble the short sequences (in blocks of about 500 bases each, called the read length) into long continuous stretches that are analyzed for errors, gene-coding regions, and other characteristics. To read about all the trouble researchers go through to "finish" this raw sequence from automated sequencers Click here (and scroll to bottom that begins "Here are our definitions of..."). Finished sequence is submitted to major public sequence databases, such as GenBank. Human Genome Project sequence data are thus made freely available to anyone around the world. For more on genome sequencing, see the Sequencing Fact Sheet.

What We've Learned So Far
What Does the Draft Human Genome Sequence Tell Us?
By the Numbers
y y y

y y

The human genome contains 3164.7 million chemical nucleotide bases (A, C, T, and G). The average gene consists of 3000 bases, but sizes vary greatly, with the largest known human gene being dystrophin at 2.4 million bases. The total number of genes is estimated at 30,000 ²much lower than previous estimates of 80,000 to 140,000 that had been based on extrapolations from gene-rich areas as opposed to a composite of gene-rich and gene-poor areas. Almost all (99.9%) nucleotide bases are exactly the same in all people. The functions are unknown for over 50% of discovered genes.

The Wheat from the Chaff
y y y


Less than 2% of the genome codes for proteins. Repeated sequences that do not code for proteins ("junk DNA") make up at least 50% of the human genome. Repetitive sequences are thought to have no direct functions, but they shed light on chromosome structure and dynamics. Over time, these repeats reshape the genome by rearranging it, creating entirely new genes, and modifying and reshuffling existing genes. During the past 50 million years, a dramatic decrease seems to have occurred in the rate of accumulation of repeats in the human genome.

How It's Arranged

The human genome's gene-dense "urban centers" are predominantly composed of the DNA building blocks G and C.


y y


In contrast, the gene-poor "deserts" are rich in the DNA building blocks A and T. GCand AT-rich regions usually can be seen through a microscope as light and dark bands on chromosomes. Genes appear to be concentrated in random areas along the genome, with vast expanses of noncoding DNA between. Stretches of up to 30,000 C and G bases repeating over and over often occur adjacent to gene-rich areas, forming a barrier between the genes and the "junk DNA." These CpG islands are believed to help regulate gene activity. Chromosome 1 has the most genes (2968), and the Y chromosome has the fewest (231).

How the Human Compares with Other Organisms
y y


y y

Unlike the human's seemingly random distribution of gene-rich areas, many other organisms' genomes are more uniform, with genes evenly spaced throughout. Humans have on average three times as many kinds of proteins as the fly or worm because of mRNA transcript "alternative splicing" and chemical modifications to the proteins. This process can yield different protein products from the same gene. Humans share most of the same protein families with worms, flies, and plants, but the number of gene family members has expanded in humans, especially in proteins involved in development and immunity. The human genome has a much greater portion (50%) of repeat sequences than the mustard weed (11%), the worm (7%), and the fly (3%). Although humans appear to have stopped accumulating repeated DNA over 50 million years ago, there seems to be no such decline in rodents. This may account for some of the fundamental differences between hominids and rodents, although gene estimates are similar in these species. Scientists have proposed many theories to explain evolutionary contrasts between humans and other organisms, including those of life span, litter sizes, inbreeding, and genetic drift.

Variations and Mutations


Scientists have identified about 1.4 million locations where single-base DNA differences (SNPs) occur in humans. This information promises to revolutionize the processes of finding chromosomal locations for disease-associated sequences and tracing human history. The ratio of germline (sperm or egg cell) mutations is 2:1 in males vs females. Researchers point to several reasons for the higher mutation rate in the male germline, including the greater number of cell divisions required for sperm formation than for eggs.

Applications, Future Challenges Deriving meaningful knowledge from the DNA sequence will define research through the coming decades to inform our understanding of biological systems. This enormous task will require the expertise and creativity of tens of thousands of scientists from varied disciplines in both the public and private sectors worldwide.

The draft sequence already is having an impact on finding genes associated with disease. A number of genes have been pinpointed and associated with breast cancer, muscle disease, deafness, and blindness. Additionally, finding the DNA sequences underlying such common diseases as cardiovascular disease, diabetes, arthritis, and cancers is being aided by the human variation maps (SNPs) generated in the HGP in cooperation with the private sector. These genes and SNPs provide focused targets for the development of effective new therapies. One of the greatest impacts of having the sequence may well be in enabling an entirely new approach to biological research. In the past, researchers studied one or a few genes at a time. With whole-genome sequences and new high-throughput technologies, they can approach questions systematically and on a grand scale. They can study all the genes in a genome, for example, or all the transcripts in a particular tissue or organ or tumor, or how tens of thousands of genes and proteins work together in interconnected networks to orchestrate the chemistry of life.

The Next Step: Functional Genomics
The words of Winston Churchill, spoken in 1942 after 3 years of war, capture well the HGP era: "Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the end of the beginning." The avalanche of genome data grows daily. The new challenge will be to use this vast reservoir of data to explore how DNA and proteins work with each other and the environment to create complex, dynamic living systems. Systematic studies of function on a grand scale-functional genomics-will be the focus of biological explorations in this century and beyond. These explorations will encompass studies in transcriptomics, proteomics, structural genomics, new experimental methodologies, and comparative genomics.
y y




Transcriptomics involves large-scale analysis of messenger RNAs transcribed from active genes to follow when, where, and under what conditions genes are expressed. Studying protein expression and function--or proteomics--can bring researchers closer to what's actually happening in the cell than gene-expression studies. This capability has applications to drug design. Structural genomics initiatives are being launched worldwide to generate the 3-D structures of one or more proteins from each protein family, thus offering clues to function and biological targets for drug design. Experimental methods for understanding the function of DNA sequences and the proteins they encode include knockout studies to inactivate genes in living organisms and monitor any changes that could reveal their functions. Comparative genomics²analyzing DNA sequence patterns of humans and well-studied model organisms side-by-side²has become one of the most powerful strategies for identifying human genes and interpreting their function.

What are some practical benefits to learning about DNA?

Knowledge about the effects of DNA variations among individuals can lead to revolutionary new ways to diagnose, treat, and someday prevent the thousands of disorders that affect us. Besides providing clues to understanding human biology, learning about nonhuman organisms' DNA sequences can lead to an understanding of their natural capabilities that can be applied toward solving challenges in health care, agriculture, energy production, environmental remediation, and carbon sequestration.

Rapid progress in genome science and a glimpse into its potential applications have spurred observers to predict that biology will be the foremost science of the 21st century. Technology and resources generated by the Human Genome Project and other genomics research are already having a major impact on research across the life sciences. The potential for commercial development of genomics research presents U.S. industry with a wealth of opportunities, and sales of DNA-based products and technologies in the biotechnology industry are projected to exceed $45 billion by 2009 (Consulting Resources Corporation Newsletter, Spring 1999). Some current and potential applications of genome research include
y y y y y y

Molecular medicine Energy sources and environmental applications Risk assessment Bioarchaeology, anthropology, evolution, and human migration DNA forensics (identification) Agriculture, livestock breeding, and bioprocessing


For more details about these applications, see below.

Molecular Medicine
y y y y y

Improved diagnosis of disease Earlier detection of genetic predispositions to disease Rational drug design Gene therapy and control systems for drugs Pharmacogenomics "custom drugs"

Technology and resources promoted by the Human Genome Project are starting to have profound impacts on biomedical research and promise to revolutionize the wider spectrum of biological research and clinical medicine. Increasingly detailed genome maps have aided researchers seeking genes associated with dozens of genetic conditions, including myotonic

dystrophy, fragile X syndrome, neurofibromatosis types 1 and 2, inherited colon cancer, Alzheimer's disease, and familial breast cancer. On the horizon is a new era of molecular medicine characterized less by treating symptoms and more by looking to the most fundamental causes of disease. Rapid and more specific diagnostic tests will make possible earlier treatment of countless maladies. Medical researchers also will be able to devise novel therapeutic regimens based on new classes of drugs, immunotherapy techniques, avoidance of environmental conditions that may trigger disease, and possible augmentation or even replacement of defective genes through gene therapy. For more information, see Medicine and the New Genetics and Fast Forward to 2020: What to Expect in Molecular Medicine --an article for online magazine TNTY Futures.

Energy and Environmental Applications
y y y y

Use microbial genomics research to create new energy sources (biofuels) Use microbial genomics research to develop environmental monitoring techniques to detect pollutants Use microbial genomics research for safe, efficient environmental remediation Use microbial genomics research for carbon sequestration

In 1994, taking advantage of new capabilities developed by the genome project, DOE initiated the Microbial Genome Program to sequence the genomes of bacteria useful in energy production, environmental remediation, toxic waste reduction, and industrial processing. A follow-on program, Genomic Science Program (GSP) builds on data and resources from the Human Genome Project, the Microbial Genome Program, and systems biology. GSP will accelerate understanding of dynamic living systems for solutions to DOE mission challenges in energy and the environment. Despite our reliance on the inhabitants of the microbial world, we know little of their number or their nature: estimates are that less than 0.01% of all microbes have been cultivated and characterized. Microbial genome sequencing will help lay a foundation for knowledge that will ultimately benefit human health and the environment. The economy will benefit from further industrial applications of microbial capabilities. Information gleaned from the characterization of complete microbial genomes will lead to insights into the development of such new energy-related biotechnologies as photosynthetic systems, microbial systems that function in extreme environments, and organisms that can metabolize readily available renewable resources and waste material with equal facility. Expected benefits also include development of diverse new products, processes, and test methods that will open the door to a cleaner environment. Biomanufacturing will use nontoxic chemicals and enzymes to reduce the cost and improve the efficiency of industrial processes. Microbial enzymes have been used to bleach paper pulp, stone wash denim, remove lipstick from glassware, break down starch in brewing, and coagulate milk protein for cheese production. In

the health arena, microbial sequences may help researchers find new human genes and shed light on the disease-producing properties of pathogens. Microbial genomics will also help pharmaceutical researchers gain a better understanding of how pathogenic microbes cause disease. Sequencing these microbes will help reveal vulnerabilities and identify new drug targets. Gaining a deeper understanding of the microbial world also will provide insights into the strategies and limits of life on this planet. Data generated in this young program have helped scientists identify the minimum number of genes necessary for life and confirm the existence of a third major kingdom of life. Additionally, the new genetic techniques now allow us to establish more precisely the diversity of microorganisms and identify those critical to maintaining or restoring the function and integrity of large and small ecosystems; this knowledge also can be useful in monitoring and predicting environmental change. Finally, studies on microbial communities provide models for understanding biological interactions and evolutionary history. For more information, see:
y y y

DOE Genomic Science Program DOE Microbial Genome Program MicrobeWorld

Risk Assessment
y y y

Assess health damage and risks caused by radiation exposure, including low-dose exposures Assess health damage and risks caused by exposure to mutagenic chemicals and cancercausing toxins Reduce the likelihood of heritable mutations

Understanding the human genome will have an enormous impact on the ability to assess risks posed to individuals by exposure to toxic agents. Scientists know that genetic differences make some people more susceptible and others more resistant to such agents. Far more work must be done to determine the genetic basis of such variability. This knowledge will directly address DOE's long-term mission to understand the effects of low-level exposures to radiation and other energy-related agents, especially in terms of cancer risk.

Bioarchaeology, Anthropology, Evolution, and Human Migration
y y y

Study evolution through germline mutations in lineages Study migration of different population groups based on female genetic inheritance Study mutations on the Y chromosome to trace lineage and migration of males


Compare breakpoints in the evolution of mutations with ages of populations and historical events

Understanding genomics will help us understand human evolution and the common biology we share with all of life. Comparative genomics between humans and other organisms such as mice already has led to similar genes associated with diseases and traits. Further comparative studies will help determine the yet-unknown function of thousands of other genes. Comparing the DNA sequences of entire genomes of differerent microbes will provide new insights about relationships among the three kingdoms of life: archaebacteria, eukaryotes, and prokaryotes.

DNA Forensics (Identification)
y y y y y y y y y

Identify potential suspects whose DNA may match evidence left at crime scenes Exonerate persons wrongly accused of crimes Identify crime and catastrophe victims Establish paternity and other family relationships Identify endangered and protected species as an aid to wildlife officials (could be used for prosecuting poachers) Detect bacteria and other organisms that may pollute air, water, soil, and food Match organ donors with recipients in transplant programs Determine pedigree for seed or livestock breeds Authenticate consumables such as caviar and wine

Any type of organism can be identified by examination of DNA sequences unique to that species. Identifying individuals is less precise, although when DNA sequencing technologies progress further, direct characterization of very large DNA segments, and possibly even whole genomes, will become feasible and practical and will allow precise individual identification. To identify individuals, forensic scientists scan about 10 DNA regions that vary from person to person and use the data to create a DNA profile of that individual (sometimes called a DNA fingerprint). There is an extremely small chance that another person has the same DNA profile for a particular set of regions. For more information, see the DNA Forensics site.

Agriculture, Livestock Breeding, and Bioprocessing
y y y

Disease-, insect-, and drought-resistant crops Healthier, more productive, disease-resistant farm animals More nutritious produce

y y y

Biopesticides Edible vaccines incorporated into food products New environmental cleanup uses for plants like tobacco

Understanding plant and animal genomes will allow us to create stronger, more disease-resistant plants and animals --reducing the costs of agriculture and providing consumers with more nutritious, pesticide-free foods. Already growers are using bioengineered seeds to grow insectand drought-resistant crops that require little or no pesticide. Farmers have been able to increase outputs and reduce waste because their crops and herds are healthier. Alternate uses for crops such as tobacco have been found. One researcher has genetically engineered tobacco plants in his laboratory to produce a bacterial enzyme that breaks down explosives such as TNT and dinitroglycerin. Waste that would take centuries to break down in the soil can be cleaned up by simply growing these special plants in the polluted area. For more information, see the Access Excellence Website's Biotech Applied page.

Chromosome Mapping
Basic Information
FAQs Glossary Acronyms Links Genetics 101 Publications Meetings Calendar Media Guide

The Human Genome Project was completed in 2003. One of the key research areas was Chromosome Mapping. This page details that research. Mapping is the construction of a series of chromosome descriptions that depict the position and spacing of unique, identifiable biochemical landmarks, including some genes, that occur on the DNA of chromosomes. In 1990, DOE initiated projects to enrich the developing chromosome maps with markers for genes. In 1993 this effort led to the establishment of the Integrated Molecular Analysis of Gene Expression (I.M.A.G.E.) Consortium. I.M.A.G.E. members develop and carray DNA clones (representing the gene coding regions of the genome) and make them available worldwide. Area Genetic Map HGP Goal 2- to 5-cM resolution map (600 ± 1,500 markers) 30,000 STSs Standard Achieved Date Achieved

About the Project
What is it? Goals Landmark Papers Sequence Databases Timeline History Ethical Issues Benefits Genetics 101 FAQs

Medicine & the New Genetics
Home Gene Testing Gene Therapy Pharmacogenomics Disease Information Genetic Counseling

1-cM September resolution map 1994 (3,000 markers) 52,000 STSs October 1998 March 2003

Physical Map Gene Identification

Full-length 15,000 fullhuman cDNAs length human cDNAs

Ethical, Legal,

For a more detailed explanation of mapping, see the U.S. DOE Primer on

Social Issues
Home Privacy Legislation Gene Testing Gene Therapy Patenting Forensics Genetically Modified Food Behavioral Genetics Minorities, Race, Genetics Human Migration

Molecular Genetics.

Teachers Students Careers Webcasts Images Videos Chromosome Poster Presentations Genetics 101 Genética Websites en Español

Home Sequence Databases Landmark Papers Insights

Chromosome Poster Primer Molecular Genetics List of All Publications

Search This Site
alls erv + url: w //w w w .ornl.gov /s +


Cont t U Priv y St t m nt Sit St t nd Cr dit Sit M p

¢ ¡   ¡ ¢ ¢ ¡ ¡ ¡¢ £¢   £¢

¤ £ ¢¡