Académique Documents
Professionnel Documents
Culture Documents
Wenxiu Ma
CS374 Presentation
11/03/2005
Outline
Regulation of genes
Regulatory Motifs
Motif Representation
Current Motif Discovery Methods
2
Regulation of Genes
What turns genes on (producing a pr
otein) and off?
When is a gene turned on or off?
Where (in which cells) is a gene turn
ed on?
How many copies of the gene produc
t are produced?
3
Overview of Gene Control
The mechanisms that control the expressio
n of genes operate at many levels.
5
Regulation of Genes
Transcription Factor
(Protein)
RNA polymerase
(Protein)
DNA
Transcription Factor
(Protein)
RNA polymerase
DNA
DNA
9
What is a motif?
A subsequence (substring) that occurs in m
ultiple sequences with a biological importan
ce.
Motifs can be totally constant or have varia
ble elements.
Protein Motifs often result from structural fe
atures.
DNA Motifs (regulatory elements)
Binding sites for proteins
Short sequences (5-25)
Up to 1000 bp (or farther) from gene
Inexactly repeating patterns
10
daf-19 Binding Sites in C. elegans
GTTGTCATGGTGAC
GTTTCCATGGAAAC
GCTACCATGGCAAC
GTTACCATAGTAAC
GTTTCCATGGTAAC
-150 -1
che-2
daf-19
osm-1
osm-6
F02D8.3
source: Peter Swoboda 11
Motif Representing
Consensus sequence: a single string with
the most likely sequence(+/- wildcards)
Regular expression: a string with wildcards,
constrained selection
Profile: a list of the letter frequencies at
each position
Sequence Logo:
graphical depiction of a profile
conservation of elements in a motif.
12
Motif Logos: an Example
(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html) 13
Measure of Conservation
Relative heights of letters reflect their abundance in t
he alignment.
Total height = entropy-based measurement of conser
vation.
Entropy(i) =
-SUM { f(base, i)* ln[f(base, i)] }
over all bases
Conservation(i) = 2- Entropy(i)
Units of conservation = bits of information
Entropy measures variability/disorder.
High conserved = low entropy = tall stack
Very variable = high entropy = low stack
14
Outline
Regulation of genes
Regulatory Motifs
Motif Representation
Current Motif Discovery Methods
15
Finding Regulatory Motifs
.
.
.
17
Current Motif Discovery Methods
GOAL: comprehensive identification of all th
e regulatory motifs in genomes.
by overrepresentation
MEME, Gibbs sampling
by phylogenetic footprinting
Footprinter
19
Motif Discovery Procedure
Alignment of promoters & 3’ UTRs
Motif conservation score (MCS)
Measure the extent of excess conservatio
n
“Highly conserved motifs”
MCS>6
Clustering
20
Alignment of promoters & 3’ UTRs
construct a whole-genome alignment for the four m
ammalian genomes
Blastz1 and Multiz2
Extract the aligned promoter and 3’ UTRs portions r
espectively.
Coordinates: the annotation of NCBI reference sequ
ences (RefSeq)
21
Motif Conservation Score (MCS)
Consensus sequence representation
Alphabet size: 11
(A,C,G,T,[AC], [AG], [AT], [CG], [CT], [GT], [ACGT])
conserved occurrence of a motif m is an
instance in which an exact match to this motif
is found in all four species.
conservation rate p = ratio of conserved
occurrences to total occurrences in human
Expected conservation rate p0 =
avg. conservation rate of 100 random motifs,
given same length and redundancy.
22
MCS
MCS = # of s.d. by which the observed con
servation rate of a motif p exceeds the exp
ected conservation rate p0.
p = k/n
Binomial probability of observing k out of n
n!
P(k out of n) p0k (1 p0 ) n k
k !(n k )!
Estimated by way of Normal approximation
to the binomial Dist.
(k )
z , where np0 , np0 (1 p0 )
23
Conservation Properties of
Regulatory Motifs
Known 8-mer TGACCTTG
Conservation rate 37% (162 out of 434)
random rate 6.8%
MCS = 25.2 s.d.
Promoter Region
TRANSFAC: 446 motifs
MCS>3: 63%
MCS>5: ~50%
3’ UTR
no database analogous to TRANSFAC
some known motifs
24
Motif Discovery Procedure
Alignment of promoters & 3’ UTRs
Motif conservation score (MCS)
“Highly conserved motifs”
MCS>6
Clustering
25
Results: motifs in promoters
174 highly conserved motifs
59 strong match to known motifs, 10
weaker match.
105 potential new regulatory motifs
26
27
Property1: strand specificity
28
Xie, X. et al., Nature, 2005
Property2
30
G(5’)ppp(5’)G
7m
pri-miRNA
The microRNA 3’-nA…AAA
Drosha
pathway
Pasha
pre-miRNA
Dicer
miR/miR* dupl
ex
mature miRNA
miRNP
31
Adapted from Tomari & Zamore Curr Biol 2004
Relationship with miRNA
72 highly conserved 8-mer motifs
Contiguous, non-degenerate
~46% of all 3’-UTR motifs
207 distinct human miRNAs
From current registry
Complementary matches
Exactly match: ~43.5%
One mismatch: ~50%
95% of matches begin at NT 1 or 2 of the miRNA
gene
8-mer motifs represent target sites for miR
NA
32
8-mer motifs ->new miRNA genes
RNAfold program
242 conserved and stable stem-loop seq
uences
113 known, 129 potential new miRNAs
Biological validation
12 selected new miRNA genes
6 (50%) have clearly expression activity
in tissues.
33
Prevalence of miRNA regulation
20% of 3’ UTRs may be targets for co
nserved miRNA-based regulation at th
e 8-mer motifs.
Unbiased assessment of the relative i
mportance of miRNA-based regulation
in the human genome
34
Summary:
comparative genome analysis
4 mammalian species
an initial systematic catalogue
Promoters
3’ UTRs
Importance of the new miRNA regulat
ory mechanism
Future directions:
genome-wide discovery
more genomes alignments: the primate
35
Now…
Motif Finding Methods
Cross species comparative analysis
Combine structure information
36
Motif Finding: Structural Knowledge
Ab initio prediction of transcription factor ta
rgets using structural knowledge,
Kaplan T, et al., PLoS Comput Biol (2005)
Propose a general framework for predicting
DNA BS sequences of novel TFs from know
n family
Structure-based approach
No prior TF binding data and target gene
Family-wise probabilistic model
Context-specific amino acid-nucleotide
recognition preferences
37
Structure-based approach
Family-wise probabilistic model
Input:
pairs of TFs and their target DNA sequences
structural information
Output: Context-specific amino acid-nucleo
tide recognition preferences
Position specificity
Then, discover TFBSs of other TFs from the
same family
38
Cys2His2 Zinc Finger protein family
largest known DNA-binding family in multicellular organisms
common, strict binding models
39
42
Profile HMM
“Silent” deletion states
Insertion states
Match states
43
Example: full profile HMM
44
Structure-based approach
Input: set of pairs of TFs and their tar
get DNA sequences
Output: Context-specific amino acid-n
ucleotide recognition preferences
Iterative Expectation Maximization
(EM) algorithm
45
Cys2His2 Zinc Finger:
Probabilistic Model
The set of interacting residues in 4 different
positions of the k fingers
47
Estimate DNA-recognition
preferences
50
Compare with other preferences
52
Discussion
Tradeoff between complexity and accuracy
Canonical model
Extension to other DNA-Binding domain
Restrictions: enough binding-data, common and
strict binding model…
Provide a promising way to predict target
genes of novel proteins and to understand
their function and activity
53
Thank you!
Any question?
54