Regulatory Motif Finding: Wenxiu Ma CS374 Presentation 11/03/2005

Regulatory Motif Finding
Wenxiu Ma
CS374 Presentation
11/03/2005
Outline
 Regulation of genes
 Regulatory Motifs
 Motif Representation
 Current Motif Discovery Methods
2
Regulation of Genes
 What turns genes on (producing a pr
otein) and off?
 When is a gene turned on or off?
 Where (in which cells) is a gene turn
ed on?
 How many copies of the gene produc
t are produced?
3
Overview of Gene Control
 The mechanisms that control the expressio
n of genes operate at many levels.
source: Molecular Biology of the Cell (4th ed.), A. Johnson, et al. 4

Transcriptional Regulation
 The transcription of each gene is
controlled by a regulatory region of
DNA relatively near the transcription
start site (TSS).
 two types of fundamental
components
 short DNA regulatory elements
 gene regulatory proteins that recognize
and bind to them.
5
Regulation of Genes
Transcription Factor
(Protein)
RNA polymerase
(Protein)
DNA
Regulatory Element Gene
source: M. Tompa, U. of Washington 6

Regulation of Genes
(Protein)
RNA polymerase
DNA

Regulation of Genes
New protein
RNA
polymerase
DNA

Outline
9
What is a motif?
 A subsequence (substring) that occurs in m
ultiple sequences with a biological importan
ce.
 Motifs can be totally constant or have varia
ble elements.
 Protein Motifs often result from structural fe
atures.
 DNA Motifs (regulatory elements)
 Binding sites for proteins
 Short sequences (5-25)
 Up to 1000 bp (or farther) from gene
 Inexactly repeating patterns
10
daf-19 Binding Sites in C. elegans
GTTGTCATGGTGAC
GTTTCCATGGAAAC
GCTACCATGGCAAC
GTTACCATAGTAAC
GTTTCCATGGTAAC
-150 -1
che-2
daf-19
osm-1
osm-6
F02D8.3
source: Peter Swoboda 11
Motif Representing
 Consensus sequence: a single string with
the most likely sequence(+/- wildcards)
 Regular expression: a string with wildcards,
constrained selection
 Profile: a list of the letter frequencies at
each position
 Sequence Logo:
 graphical depiction of a profile
 conservation of elements in a motif.
12
Motif Logos: an Example
(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html) 13
Measure of Conservation
 Relative heights of letters reflect their abundance in t
he alignment.
 Total height = entropy-based measurement of conser
vation.
 Entropy(i) =
-SUM { f(base, i)* ln[f(base, i)] }
over all bases
 Conservation(i) = 2- Entropy(i)
 Units of conservation = bits of information
 Entropy measures variability/disorder.
 High conserved = low entropy = tall stack
 Very variable = high entropy = low stack
14
Outline
15
Finding Regulatory Motifs
.
.
.
Given a collection of genes with common

expression,
Find the (TF-binding) motif in common
16
Identifying Motifs: Complications
 We do not know the motif sequence
 We do not know where it is located
relative to the genes start
 Motifs can differ slightly from one
gene to another
 How to discern it from “random”
motifs?
17
Current Motif Discovery Methods
 GOAL: comprehensive identification of all th
e regulatory motifs in genomes.
 by overrepresentation
 MEME, Gibbs sampling
 by phylogenetic footprinting
 Footprinter
 Cross species comparative analysis

 Combine structure information
18
Motif Finding: Comparative Analysis
 Systematic discovery of regulatory motifs in huma
n promoters and 3' UTRs by comparison of several
mammals.
 Xie, X. et al., Nature (2005).
 Identify motifs based on comparative analysis of human,
mouse, rat and dog genomes
 A systematic catalogue of human gene regulatory motifs
 Short, functional sequences (6-10bp) used many times i
n a genome
 Focus regions
 Promoters
 3’ untranslated regions (3’ UTRs)
 microRNAs (miRNAs)
 post-transcriptional regulation
19
Motif Discovery Procedure
 Alignment of promoters & 3’ UTRs
 Motif conservation score (MCS)
 Measure the extent of excess conservatio
n
 “Highly conserved motifs”
 MCS>6
 Clustering
20
Alignment of promoters & 3’ UTRs
 construct a whole-genome alignment for the four m
ammalian genomes
 Blastz1 and Multiz2
 Extract the aligned promoter and 3’ UTRs portions r
espectively.
 Coordinates: the annotation of NCBI reference sequ
ences (RefSeq)
21
Motif Conservation Score (MCS)
 Consensus sequence representation
 Alphabet size: 11
(A,C,G,T,[AC], [AG], [AT], [CG], [CT], [GT], [ACGT])
 conserved occurrence of a motif m is an
instance in which an exact match to this motif
is found in all four species.
 conservation rate p = ratio of conserved
occurrences to total occurrences in human
 Expected conservation rate p0 =
avg. conservation rate of 100 random motifs,
given same length and redundancy.
22
MCS
 MCS = # of s.d. by which the observed con
servation rate of a motif p exceeds the exp
ected conservation rate p0.
 p = k/n
 Binomial probability of observing k out of n
n!
P(k out of n)  p0k (1  p0 ) n  k
k !(n  k )!
 Estimated by way of Normal approximation
to the binomial Dist.
(k   )
z , where   np0 ,   np0 (1  p0 )

23
Conservation Properties of
Regulatory Motifs
 Known 8-mer TGACCTTG
 Conservation rate 37% (162 out of 434)
 random rate 6.8%
 MCS = 25.2 s.d.
 Promoter Region
 TRANSFAC: 446 motifs
 MCS>3: 63%
 MCS>5: ~50%
 3’ UTR
 no database analogous to TRANSFAC
 some known motifs
24
Motif Discovery Procedure
 Alignment of promoters & 3’ UTRs
 Motif conservation score (MCS)
 “Highly conserved motifs”
 MCS>6
 Clustering
25
Results: motifs in promoters
 174 highly conserved motifs
 59 strong match to known motifs, 10
weaker match.
 105 potential new regulatory motifs
26
Xie, X. et al., Nature, 2005

Results: motifs in 3’ UTRs
 106 highly conserved motifs
 Two unusual properties
 Strand specificity
 Unusual length distribution
27
Property1: strand specificity
28
Xie, X. et al., Nature, 2005
Property2
Xie, X. et al., Nature, 2005 29

Properties => miRNA
 Strand specificity
 3’-UTR motifs acting at the level of RNA rather t
han DNA
 have a role in post-transcriptional regulation
 Length distribution
 Many mature miRNA start with U followed by a
7-base “seed” complementary to a site in the 3’
UTR of target mRNAs.
 Hypothesis: many of the highly conserved
8-mer motifs might be binding sites for con
served miRNAs.
30
G(5’)ppp(5’)G
7m
pri-miRNA
The microRNA 3’-nA…AAA
Drosha
pathway
Pasha
pre-miRNA
Dicer
miR/miR* dupl
ex
mature miRNA
miRNP
31
Adapted from Tomari & Zamore Curr Biol 2004
Relationship with miRNA
 72 highly conserved 8-mer motifs
 Contiguous, non-degenerate
 ~46% of all 3’-UTR motifs
 207 distinct human miRNAs
 From current registry
 Complementary matches
 Exactly match: ~43.5%
 One mismatch: ~50%
 95% of matches begin at NT 1 or 2 of the miRNA
gene
 8-mer motifs represent target sites for miR
NA
32
8-mer motifs ->new miRNA genes
 RNAfold program
 242 conserved and stable stem-loop seq
uences
 113 known, 129 potential new miRNAs
 Biological validation
 12 selected new miRNA genes
 6 (50%) have clearly expression activity
in tissues.
33
Prevalence of miRNA regulation
 20% of 3’ UTRs may be targets for co
nserved miRNA-based regulation at th
e 8-mer motifs.
 Unbiased assessment of the relative i
mportance of miRNA-based regulation
in the human genome
34
Summary:
comparative genome analysis
 4 mammalian species
 an initial systematic catalogue
 Promoters
 3’ UTRs
 Importance of the new miRNA regulat
ory mechanism
 Future directions:
 genome-wide discovery
 more genomes alignments: the primate
35
Now…
 Motif Finding Methods
 Cross species comparative analysis
 Combine structure information
36
Motif Finding: Structural Knowledge
 Ab initio prediction of transcription factor ta
rgets using structural knowledge,
 Kaplan T, et al., PLoS Comput Biol (2005)
 Propose a general framework for predicting
DNA BS sequences of novel TFs from know
n family
 Structure-based approach
 No prior TF binding data and target gene
 Family-wise probabilistic model
 Context-specific amino acid-nucleotide
recognition preferences
37
Structure-based approach
 Family-wise probabilistic model
 Input:
 pairs of TFs and their target DNA sequences
 structural information
 Output: Context-specific amino acid-nucleo
tide recognition preferences
 Position specificity
 Then, discover TFBSs of other TFs from the
same family
38
Cys2His2 Zinc Finger protein family
 largest known DNA-binding family in multicellular organisms
 common, strict binding models
39
source: Molecular Biology of the Cell (4th ed.), A. Johnson, et al.

Cys2His2 Zinc Finger:
Canonical DNA binding model
Residues at positions 6, 3, 2, and -1 (relative to the

beginning of the a-helix) at each finger interact with
adjacent nucleotides in the DNA molecule
(interactions shown with arrows).
Kaplan. et al., PLoS Comput Biol, 2005 40
DNA Binding Model
source: Molecular Biology of the Cell (4th ed.), A. Johnson, et al. 41

Compiling dataset
 Goal: DNA-recognition preferences for each
of the four key positions
 every AA v.s. every NT
 insufficient solved protein-DNA complex
 Known protein sequence data and their DN
A targets
 TRANSFAC: 455 protein-DNA Pairs
 Non-canonical model
 Profile HMM
 No exact binding locations
 CX(2-4)CX(11-13)HX(3-5)H
42
Profile HMM
“Silent” deletion states
Insertion states
Match states
 build a model representing the consensus sequence for a

family, rather than the sequence of any particular
member
 Find potential alignment for new sequences
43
Example: full profile HMM
44
Structure-based approach
 Input: set of pairs of TFs and their tar
get DNA sequences
 Output: Context-specific amino acid-n
ucleotide recognition preferences
 Iterative Expectation Maximization
(EM) algorithm
45
Probabilistic Model
 The set of interacting residues in 4 different
positions of the k fingers
 N1,… NL be a target DNA sequence

 The probability that an interaction starting f
rom jth position in the DNA
 where PP(N|A) is the conditional probability of nu

cleotide N given amino acide A at position p.

EM algorithm
 Iterative EM algorithm
 Exact binding locations for all protein-DNA pairs
 recognition preferences: Pp(N|A)
 E-step
 Compute expected posterior probability of bindin
g locations, based on current preferences
 M-step
 Update DNA-recognition preferences to maximiz
e the likelihood of current binding locations base
d on the distribution of possible binding locations
in previous E-step
 Local optima
47
Estimate DNA-recognition
preferences

Apply on TFs from the same family

Evaluation
 compatible with experimental results
 10-fold cross validation
 genome-wide scan of Drosophia melanogaster
 29 canonical Cys2His2 TFs
 GO Enrichment of predicted target genes
 21 enriched with at least one GO term.
 mRNA expression profile of target genes
 21 showed significant associations in at least one e
mbryogenesis experiment.
50
Compare with other preferences

Summary
 Family-wise approach
 Combine structure information with sequ
ence data
 Learn context-specific AA-NT recognition
preferences
 Predict binding preferences of new protei
n
 Identify TFBSs and target genes
52
Discussion
 Tradeoff between complexity and accuracy
 Canonical model
 Extension to other DNA-Binding domain
 Restrictions: enough binding-data, common and
strict binding model…
 Provide a promising way to predict target
genes of novel proteins and to understand
their function and activity
53
Thank you!
 Any question?
54

Regulatory Motif Finding: Wenxiu Ma CS374 Presentation 11/03/2005

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Regulatory Motif Finding: Wenxiu Ma CS374 Presentation 11/03/2005

Transféré par

Droits d'auteur :

Formats disponibles

Regulatory Motif Finding

source: Molecular Biology of the Cell (4th ed.), A. Johnson, et al. 4

Regulatory Element Gene

source: M. Tompa, U. of Washington 6

Regulatory Element Gene

source: M. Tompa, U. of Washington 7

Regulatory Element Gene

source: M. Tompa, U. of Washington 8

Given a collection of genes with common

 Cross species comparative analysis

Xie, X. et al., Nature, 2005

Xie, X. et al., Nature, 2005 29

source: Molecular Biology of the Cell (4th ed.), A. Johnson, et al.

Residues at positions 6, 3, 2, and -1 (relative to the

DNA Binding Model

source: Molecular Biology of the Cell (4th ed.), A. Johnson, et al. 41

 build a model representing the consensus sequence for a

 N1,… NL be a target DNA sequence

 where PP(N|A) is the conditional probability of nu

Kaplan. et al., PLoS Comput Biol, 2005 46

Kaplan. et al., PLoS Comput Biol, 2005 48

Kaplan. et al., PLoS Comput Biol, 2005 49

Kaplan. et al., PLoS Comput Biol, 2005 51

Vous aimerez peut-être aussi