Académique Documents
Professionnel Documents
Culture Documents
January 6, 2004
• Read a complete genome such as the human DNA. (DNA Sequencing & Assembly)
• Identify parts such as the genes encoded by the DNA sequence. (Gene Finding)
• Figure out the connections between parts such as how genes interact with each other.
• Gene Expression: The process by which genetic code is translated into structures present and
functioning in the cell. Expressed genes are transcribed into different types of RNA, of which
mRNA is the only type that is translated into proteins. Gene expression provides information
about how a gene functions and how it is different from other genes. DNA microarrays can be
used to compare gene expression in different populations of cells. Cells have different gene
expression patterns and levels. (Microarrays & Regulation)
Computer science plays an essential role in biology. With biology becoming an information science, new
high-throughput technology is needed. The shift to high throughput technologies in biology has led to an
explosion of genomic data.
• DNA Sequencing: The process of determining the exact order of a long string of bases (A, T, C,
G) that makes up the DNA of an organism. The genomes of several organisms, including human,
have been completely sequenced.
Paradigms in Biology
DNA is transcribed into RNA (rRNA, rNA, snRNA, mRNA) through a process known as RNA
transcription. mRNA is translated into polypeptides which then fold into 3-D protein structures
through a mechanism called protein translation. An organism consists of different types of
proteins.
Structures of Biomolecules
• The cell is composed of DNA in the nucleus and proteins in the cytoplasm, all of which is
encapsulated in a lipid membrane.
• The nucleic acids (DNA and RNA) form the genetic material of all living organisms. They are
found mainly in the nucleus of the cell.
Two nucleotides are linked together by attaching the phosphate group of one nucleotide to the 5’
carbon atom of the sugar of the other nucleotide.
DNA RNA
A T A
A=T
G
G=C G
C
C G C
G C G
A T A
C G C
T→U
T A U
G C G
• Three nucleotides of an mRNA strand form a codon that specifies one amino acid. This makes
sense because a codon made from only one or two nucleotides would not produce enough
combinations (codons) to code for all 20 of the known amino acids.
Since a three-nucleotide codon produces 64 possible combinations and there are only 20 known
amino acids, this implies redundancy or degeneracy in the genetic code where several different
codons specify the same amino acid. The parsimony principle – that the simplest solution is often
right – rules out a four-nucleotide codon.
• In the cell, DNA provides all the information needed to function. There are questions about DNA
as the carrier of genetic information.
• Ribosomes are the sites of protein synthesis. Since DNA is mainly found in the nucleus and
ribosomes are found in the cytoplasm, how does information flow from DNA to protein? There
is a need for an intermediary -- ribonucleic acid (RNA). RNA has three functions (mRNA,
tRNA, rRNA).
• In 1968, Nirenberg and Khorana received a Nobel Prize in medicine for cracking the universal
genetic code, which mapped each triplet (codon) to an amino acid. It shows how the nucleotide
language of mRNA is translated into the amino acid language of proteins.
• In 1962, Robert Holly solved the structure of tRNA. Although tRNA is single-stranded molecule,
stretches of complementary nucleotides hydrogen bond to form short double-stranded regions,
which bend the tRNA into a cloverleaf shape. All tRNAs have a similar cloverleaf structure. At
a position on one of the leaves, a sequence of three nucleotides form an anti-codon, which base
pairs with a specific mRNA codon. This anti-codon/codon binding is crucial. There is a different
tRNA molecule corresponding to each mRNA codon.
• rRNA serves as part of the structure of the ribosome, the protein/RNA complex that synthesizes
proteins according to the information carried by the mRNA
• So, to put this all together: The DNA code is transcribed into a complementary mRNA molecule
within the nucleus. The mRNA enters the cytoplasm, where it associates with a ribosome. The
mRNA code is then translated into a polypeptide chain. The codon AUG signals the start of
translation. An activated tRNA ferries the first amino acid, methionine, to the ribosome. The
tRNA anti-codon binds to the AUG codon on the mRNA. The whole complex shifts and the next
codon is read by another tRNA. As the two amino acids are held in position, a peptide bond is
formed between them. The second tRNA accepts the growing protein chain and the methionine
tRNA is released. The process continues until a stop codon is encountered. When the stop codon
is reached, translation is finished. The ribosome disassembles to be reused for translating another
mRNA and one complete peptide chain is released.
What is a gene?
• A genome is a set of all genes in the organism + junk stuff (the entire DNA content).
ZOOM
IN
tRNA
transcription
DNA
rRNA
snRNA
translation
POLYPEPTIDE
mRNA
DNA is transcribed into different types of RNA (tRNA, rRNA, snRNA, mRNA). Transcription
consists of three key steps: initiation, elongation, and termination. The transcripts (mRNA
molecules) contain the information to be translated into polypeptides that form proteins.
• Each gene has its own promoter(s). Promoters are sequences in the DNA just upstream of the
mRNA transcripts that define the sites of initiation. The role of the promoter is to attract RNA
polymerase to the correct start site so transcription can be initiated.
• The mRNA transcripts are sometimes edited before they serve as a blueprint for a protein. The
processing involves the removal of intervening, gibberish sequences (introns) in the gene. Exons
are spliced together to form mRNA. Exons are nucleotide segments whose codons will be
expressed.
• In an adult multi-cellular organism, there are a wide variety of cell types seen in the adult, such as
muscle, nerve, and blood cells. The different cell types contain the same DNA though. This
differentiation arises because different cell types express different genes. Hence, genes can be
switched on and off.
There has been an explosion of genomic data. Complete genomes of some organisms have been
sequenced (human, pig, dog, rat, mouse, etc.). DNA in these different organisms has been compared to
study evolution occurring at the DNA level, resulting from sequence edits (insertion, deletion, mutation)
and rearrangements (inversion, translocation, duplication). Similarity between DNA sequences has
suggested that all organisms come from a common ancestor, connected by an evolutionary tree (evolution
paradigm).
The evolutionary process occurs at different rates. If DNA mutations occur in non-critical regions, they
are incorporated into the next generation. If the mutations occur in critical regions, they are unlikely to be
propagated onward. However, some mutations have positive effects, and thus are conserved in
subsequent generations, such as in the case of the highly conserved Interleukin regions found in human
and mouse. Sequence conservation implies functionality. The fact that evolution did not modify a region
of the sequence suggests that it is functionally important to the organism.
Pairwise sequence alignment can be used to find sequences conserved between organisms. It can reveal if
sequences are related or not. This information can help to determine their functional and structural roles
and provide clues to the common ancestor.
Sequence Alignment
Given two strings, x = x1x2…xM and y = y1y2…yN, and a scoring function for calculating matched letters
and gap penalty, an alignment is an assignment of gaps to positions 0,…,M in x and 0,…,N in y, so as to
line up each letter in one sequence with either a letter, or a gap in the other sequence.
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Optimal Alignment
What is a good alignment? It is the “best” way to match the letters of one sequence with those of the
other. The problem is: how do we define “best”? If an alignment is a hypothesis that two sequences
come from a common ancestor through sequence edits, then optimal alignment is finding the least cost
transformation of one sequence into another using new operations (sequence edits, inversions,
translocations, duplications). The least cost transformation is measured as the edit distance between two
sequences, which is defined as the minimum number of edit operations needed to transform the first string
into the other. Since most of DNA changes during evolution are due to insertion, deletion, and
substitution, the edit distance can be used as a way to roughly measure the number of DNA replications
that occurred between two sequences. Although the edit distance is not an accurate metric system for
depicting the underlying evolutionary process, it serves as an approximation that is easy to optimize
algorithmically.
Likewise, optimal alignment is the pairing of sequences that retains the order of letters in each sequence,
introducing gaps if necessary, such that the scoring function returns an optimal score.
Scoring Function
Match: +m
Mismatch: –s
Gap: –d
The optimality of an alignment is measured by the calculated result of the scoring function. The total
score of an alignment is the sum of terms for each pair of aligned letters and terms for each gap. A match
receives a positive score of m, a mismatch receives a penalty of –s, and a gap receives a penalty of -d.