Vous êtes sur la page 1sur 54

Regulatory Motif Finding

Wenxiu Ma
CS374 Presentation
11/03/2005
Outline
 Regulation of genes
 Regulatory Motifs
 Motif Representation
 Current Motif Discovery Methods

2
Regulation of Genes
 What turns genes on (producing a pr
otein) and off?
 When is a gene turned on or off?
 Where (in which cells) is a gene turn
ed on?
 How many copies of the gene produc
t are produced?

3
Overview of Gene Control
 The mechanisms that control the expressio
n of genes operate at many levels.

source: Molecular Biology of the Cell (4th ed.), A. Johnson, et al. 4


Transcriptional Regulation
 The transcription of each gene is
controlled by a regulatory region of
DNA relatively near the transcription
start site (TSS).
 two types of fundamental
components
 short DNA regulatory elements
 gene regulatory proteins that recognize
and bind to them.

5
Regulation of Genes
Transcription Factor
(Protein)

RNA polymerase
(Protein)

DNA

Regulatory Element Gene

source: M. Tompa, U. of Washington 6


Regulation of Genes

Transcription Factor
(Protein)

RNA polymerase

DNA

Regulatory Element Gene

source: M. Tompa, U. of Washington 7


Regulation of Genes
New protein
RNA
Transcription Factor
polymerase

DNA

Regulatory Element Gene

source: M. Tompa, U. of Washington 8


Outline
 Regulation of genes
 Regulatory Motifs
 Motif Representation
 Current Motif Discovery Methods

9
What is a motif?
 A subsequence (substring) that occurs in m
ultiple sequences with a biological importan
ce.
 Motifs can be totally constant or have varia
ble elements.
 Protein Motifs often result from structural fe
atures.
 DNA Motifs (regulatory elements)
 Binding sites for proteins
 Short sequences (5-25)
 Up to 1000 bp (or farther) from gene
 Inexactly repeating patterns

10
daf-19 Binding Sites in C. elegans
GTTGTCATGGTGAC
GTTTCCATGGAAAC
GCTACCATGGCAAC
GTTACCATAGTAAC
GTTTCCATGGTAAC
-150 -1
che-2
daf-19
osm-1
osm-6
F02D8.3
source: Peter Swoboda 11
Motif Representing
 Consensus sequence: a single string with
the most likely sequence(+/- wildcards)
 Regular expression: a string with wildcards,
constrained selection
 Profile: a list of the letter frequencies at
each position
 Sequence Logo:
 graphical depiction of a profile
 conservation of elements in a motif.

12
Motif Logos: an Example

(http://www-lmmb.ncifcrf.gov/~toms/sequencelogo.html) 13
Measure of Conservation
 Relative heights of letters reflect their abundance in t
he alignment.
 Total height = entropy-based measurement of conser
vation.
 Entropy(i) =
-SUM { f(base, i)* ln[f(base, i)] }
over all bases

 Conservation(i) = 2- Entropy(i)
 Units of conservation = bits of information
 Entropy measures variability/disorder.
 High conserved = low entropy = tall stack
 Very variable = high entropy = low stack

14
Outline
 Regulation of genes
 Regulatory Motifs
 Motif Representation
 Current Motif Discovery Methods

15
Finding Regulatory Motifs

.
.
.

Given a collection of genes with common


expression,
Find the (TF-binding) motif in common
16
Identifying Motifs: Complications
 We do not know the motif sequence
 We do not know where it is located
relative to the genes start
 Motifs can differ slightly from one
gene to another
 How to discern it from “random”
motifs?

17
Current Motif Discovery Methods
 GOAL: comprehensive identification of all th
e regulatory motifs in genomes.

 by overrepresentation
 MEME, Gibbs sampling
 by phylogenetic footprinting
 Footprinter

 Cross species comparative analysis


 Combine structure information
18
Motif Finding: Comparative Analysis
 Systematic discovery of regulatory motifs in huma
n promoters and 3' UTRs by comparison of several
mammals.
 Xie, X. et al., Nature (2005).
 Identify motifs based on comparative analysis of human,
mouse, rat and dog genomes
 A systematic catalogue of human gene regulatory motifs
 Short, functional sequences (6-10bp) used many times i
n a genome
 Focus regions
 Promoters
 3’ untranslated regions (3’ UTRs)
 microRNAs (miRNAs)
 post-transcriptional regulation

19
Motif Discovery Procedure
 Alignment of promoters & 3’ UTRs
 Motif conservation score (MCS)
 Measure the extent of excess conservatio
n
 “Highly conserved motifs”
 MCS>6
 Clustering

20
Alignment of promoters & 3’ UTRs
 construct a whole-genome alignment for the four m
ammalian genomes
 Blastz1 and Multiz2
 Extract the aligned promoter and 3’ UTRs portions r
espectively.
 Coordinates: the annotation of NCBI reference sequ
ences (RefSeq)

21
Motif Conservation Score (MCS)
 Consensus sequence representation
 Alphabet size: 11
(A,C,G,T,[AC], [AG], [AT], [CG], [CT], [GT], [ACGT])
 conserved occurrence of a motif m is an
instance in which an exact match to this motif
is found in all four species.
 conservation rate p = ratio of conserved
occurrences to total occurrences in human
 Expected conservation rate p0 =
avg. conservation rate of 100 random motifs,
given same length and redundancy.
22
MCS
 MCS = # of s.d. by which the observed con
servation rate of a motif p exceeds the exp
ected conservation rate p0.
 p = k/n
 Binomial probability of observing k out of n
n!
P(k out of n)  p0k (1  p0 ) n  k
k !(n  k )!
 Estimated by way of Normal approximation
to the binomial Dist.
(k   )
z , where   np0 ,   np0 (1  p0 )

23
Conservation Properties of
Regulatory Motifs
 Known 8-mer TGACCTTG
 Conservation rate 37% (162 out of 434)
 random rate 6.8%
 MCS = 25.2 s.d.
 Promoter Region
 TRANSFAC: 446 motifs
 MCS>3: 63%
 MCS>5: ~50%
 3’ UTR
 no database analogous to TRANSFAC
 some known motifs

24
Motif Discovery Procedure
 Alignment of promoters & 3’ UTRs
 Motif conservation score (MCS)
 “Highly conserved motifs”
 MCS>6
 Clustering

25
Results: motifs in promoters
 174 highly conserved motifs
 59 strong match to known motifs, 10
weaker match.
 105 potential new regulatory motifs

26

Xie, X. et al., Nature, 2005


Results: motifs in 3’ UTRs
 106 highly conserved motifs
 Two unusual properties
 Strand specificity
 Unusual length distribution

27
Property1: strand specificity

28
Xie, X. et al., Nature, 2005
Property2

Xie, X. et al., Nature, 2005 29


Properties => miRNA
 Strand specificity
 3’-UTR motifs acting at the level of RNA rather t
han DNA
 have a role in post-transcriptional regulation
 Length distribution
 Many mature miRNA start with U followed by a
7-base “seed” complementary to a site in the 3’
UTR of target mRNAs.
 Hypothesis: many of the highly conserved
8-mer motifs might be binding sites for con
served miRNAs.

30
G(5’)ppp(5’)G
7m

pri-miRNA
The microRNA 3’-nA…AAA
Drosha

pathway
Pasha
pre-miRNA

Dicer
miR/miR* dupl
ex

mature miRNA

miRNP

31
Adapted from Tomari & Zamore Curr Biol 2004
Relationship with miRNA
 72 highly conserved 8-mer motifs
 Contiguous, non-degenerate
 ~46% of all 3’-UTR motifs
 207 distinct human miRNAs
 From current registry
 Complementary matches
 Exactly match: ~43.5%
 One mismatch: ~50%
 95% of matches begin at NT 1 or 2 of the miRNA
gene
 8-mer motifs represent target sites for miR
NA
32
8-mer motifs ->new miRNA genes
 RNAfold program
 242 conserved and stable stem-loop seq
uences
 113 known, 129 potential new miRNAs
 Biological validation
 12 selected new miRNA genes
 6 (50%) have clearly expression activity
in tissues.

33
Prevalence of miRNA regulation
 20% of 3’ UTRs may be targets for co
nserved miRNA-based regulation at th
e 8-mer motifs.
 Unbiased assessment of the relative i
mportance of miRNA-based regulation
in the human genome

34
Summary:
comparative genome analysis
 4 mammalian species
 an initial systematic catalogue
 Promoters
 3’ UTRs
 Importance of the new miRNA regulat
ory mechanism
 Future directions:
 genome-wide discovery
 more genomes alignments: the primate
35
Now…
 Motif Finding Methods
 Cross species comparative analysis
 Combine structure information

36
Motif Finding: Structural Knowledge
 Ab initio prediction of transcription factor ta
rgets using structural knowledge,
 Kaplan T, et al., PLoS Comput Biol (2005)
 Propose a general framework for predicting
DNA BS sequences of novel TFs from know
n family
 Structure-based approach
 No prior TF binding data and target gene
 Family-wise probabilistic model
 Context-specific amino acid-nucleotide
recognition preferences

37
Structure-based approach
 Family-wise probabilistic model
 Input:
 pairs of TFs and their target DNA sequences
 structural information
 Output: Context-specific amino acid-nucleo
tide recognition preferences
 Position specificity
 Then, discover TFBSs of other TFs from the
same family

38
Cys2His2 Zinc Finger protein family
 largest known DNA-binding family in multicellular organisms
 common, strict binding models

39

source: Molecular Biology of the Cell (4th ed.), A. Johnson, et al.


Cys2His2 Zinc Finger:
Canonical DNA binding model

Residues at positions 6, 3, 2, and -1 (relative to the


beginning of the a-helix) at each finger interact with
adjacent nucleotides in the DNA molecule
(interactions shown with arrows).
Kaplan. et al., PLoS Comput Biol, 2005 40
Cys2His2 Zinc Finger:

DNA Binding Model

source: Molecular Biology of the Cell (4th ed.), A. Johnson, et al. 41


Cys2His2 Zinc Finger:
Compiling dataset
 Goal: DNA-recognition preferences for each
of the four key positions
 every AA v.s. every NT
 insufficient solved protein-DNA complex
 Known protein sequence data and their DN
A targets
 TRANSFAC: 455 protein-DNA Pairs
 Non-canonical model
 Profile HMM
 No exact binding locations
 CX(2-4)CX(11-13)HX(3-5)H

42
Profile HMM
“Silent” deletion states

Insertion states

Match states

 build a model representing the consensus sequence for a


family, rather than the sequence of any particular
member
 Find potential alignment for new sequences

43
Example: full profile HMM

44
Structure-based approach
 Input: set of pairs of TFs and their tar
get DNA sequences
 Output: Context-specific amino acid-n
ucleotide recognition preferences
 Iterative Expectation Maximization
(EM) algorithm

45
Cys2His2 Zinc Finger:
Probabilistic Model
 The set of interacting residues in 4 different
positions of the k fingers

 N1,… NL be a target DNA sequence


 The probability that an interaction starting f
rom jth position in the DNA

 where PP(N|A) is the conditional probability of nu


cleotide N given amino acide A at position p.

Kaplan. et al., PLoS Comput Biol, 2005 46


EM algorithm
 Iterative EM algorithm
 Exact binding locations for all protein-DNA pairs
 recognition preferences: Pp(N|A)
 E-step
 Compute expected posterior probability of bindin
g locations, based on current preferences
 M-step
 Update DNA-recognition preferences to maximiz
e the likelihood of current binding locations base
d on the distribution of possible binding locations
in previous E-step
 Local optima

47
Estimate DNA-recognition
preferences

Kaplan. et al., PLoS Comput Biol, 2005 48


Apply on TFs from the same family

Kaplan. et al., PLoS Comput Biol, 2005 49


Evaluation
 compatible with experimental results
 10-fold cross validation
 genome-wide scan of Drosophia melanogaster
 29 canonical Cys2His2 TFs
 GO Enrichment of predicted target genes
 21 enriched with at least one GO term.
 mRNA expression profile of target genes
 21 showed significant associations in at least one e
mbryogenesis experiment.

50
Compare with other preferences

Kaplan. et al., PLoS Comput Biol, 2005 51


Summary
 Family-wise approach
 Combine structure information with sequ
ence data
 Learn context-specific AA-NT recognition
preferences
 Predict binding preferences of new protei
n
 Identify TFBSs and target genes

52
Discussion
 Tradeoff between complexity and accuracy
 Canonical model
 Extension to other DNA-Binding domain
 Restrictions: enough binding-data, common and
strict binding model…
 Provide a promising way to predict target
genes of novel proteins and to understand
their function and activity

53
Thank you!
 Any question?

54

Vous aimerez peut-être aussi