Audrey WK 5 Lecture B

When comparing two different RNA samples, the signal from the
two samples needs to be normalized
On spotted arrays: On Affymetrics-type arrays:
The red and green channels are Same principle: one sample-one array
scanned and detected separately,
with independent scan parameters. Need to adjust for overall array intensity
Example: Imagine the red signal is detected

with much higher laser intensity &
PMT settings …
Array would look artifactually red.
1
‘center the distribution’
Yang et al. 2002 NAR 2

3
Now each array = list of bg-corrected, normalized relative transcript values
Array 1 Array 2
ID Log ratio ID Log Ratio (635/532)
YPL187W 6.36 YPL187W -0.072
YGR043C 1.82 YGR043C -0.228
YGL089C 6.439 YGL089C
YCR040W 1.012 YCR040W 0.694
YCR039C 1.147 YCR039C -0.487
YCL001W 1.934 YCL001W -0.536
YJR004C 2.76 YJR004C 0.026
YLL005C 2.395 YLL005C -0.008
YGL101W 2.22 YGL101W 0
YLR040C 2.073 YLR040C -0.659
upgrade plate 1.863 upgrade plate -0.408
EMPTY 1.755 EMPTY -0.008
upgrade plate 1.573 upgrade plate 0.109
YBL051C 1.419 YBL051C -0.054
YLR349W 1.382 YLR349W -0.457
YCL066W 1.338 YCL066W
YLR227W-A 1.335 YLR227W-A -0.419
upgrade plate 1.314 upgrade plate -0.401
YDL186W 1.246 YDL186W 0.959
YDR536W 1.183 YDR536W -0.58
upgrade plate 1.165 upgrade plate 0.543
YHR124W 1.163 YHR124W -0.465
YAL065C 1.091 YAL065C -1.133
YBR012W-A 1.078 YBR012W-A 0.676
YCL026C-A 1.046 YCL026C-A -0.468
YJL078C 1.045 YJL078C -0.889
YHR161C 1.033 YHR161C -0.033
YBR244W 1.028 YBR244W
YGR237C 1 YGR237C -0.754
YGL189C 0.997 YGL189C -0.11
YCL009C 0.989 YCL009C 0.014
YKL185W 0.968 YKL185W
YDR285W 0.95 YDR285W -0.435
YMR057C 0.949 YMR057C 0.672
Q0250 0.942 Q0250 -0.219
YOR235W 0.924 YOR235W 1.166
YDR415C 0.922 YDR415C -0.334
YER072W 0.906 YER072W -0.509
YDL013W 0.877 YDL013W
YLR206W 0.874 YLR206W
YML047C 0.874 YML047C -0.819
YDR306C 0.858 YDR306C
YDR528W 0.823 YDR528W 0.276
YGL088W 0.8 YGL088W
YBL097W 0.787 YBL097W
YBR013C 0.782 YBR013C -0.896
YIR019C 0.779 YIR019C
YDR361C
YLR267W
0.772
0.769
YDR361C
YLR267W
-1.017
-0.457
4
YAL008W 0.746 YAL008W 1.465
YGL128C 0.741 YGL128C 0.027
YDR530C 0.739 YDR530C 2.083
Assessing replicates: how well do the data agree overall?
linear regression
Example of good replicates

y = 0.978x + 0.0095
2
R = 0.8332
5
2
Array 2 values
DES460 + 0.2% MMS - 45

1 min
0 Linear (DES460 + 0.2%
MMS - 45 min)
-4 -2 0 2 4
-1
-2
-3
-4
Array 1 values
Example of bad replicates y = 0.1104x - 0.0358

2
R = 0.0205
2.5
2
1.5
Where does the noise come from? 1
Array 2 values
-- can be biological variation 0.5

0
-- can be array artifacts -6 -5 -4 -3 -2 -1 -0.5 0 1 2 3 4
… should define both types of variation … -1

-1.5
-2
-2.5 5
Array 1 values
Now you have your data, in the form of
background-subtracted expression ratios,
extracted from the arrays.
Now what?
6
Select differentially expressed genes to focus on
Methods of gene selection:
-- arbitrary fold-expression-change cutoff

example: genes that change >3X in expression between samples
-- statistically significant change in expression

requires replicates
Gene X expression under condition 1

Expression difference
7


requires replicates

8


requires replicates
Use statistics to compare the

mean & variation of 2 (or more)
populations
9
Test if the means of 2 (or more) groups are the same or statistically different
The ‘null hypothesis’ H0 says that the two groups are statistically the same
-- you will either accept or reject the null hypothesis
Choosing the right test:
parametric test if your data are normally distributed with equal variance
nonparametric test if neither of the above are true
Normal data Not normal data

10
Test if the means of 2 groups are the same or statistically different
If your two samples are normally distributed with equal variance, use the t-test
T = X1 – X2 difference in the means

SED standard error of the difference in the means
If T > Tc where Tc is the critical value for the degrees of freedom & confidence level,
then reject H0
Notice that if the data aren’t normally distributed, mean and standard deviation are not meaningful.
11
If your two samples are normally distributed with equal variance, use the t-test
T = X1 – X2 difference in the means

SED standard error of the difference in the means
one-tailed t-test two-tailed t-test

12
If your two samples are NOT normally distributed with equal variance, use Mann-Whitney test
(Wilcoxon Rank Sum test)
1. Combine data from sample 1 and sample 2

2. Rank each data point in the pooled dataset
3. Compare the average rank for sample 1 and sample 2 values
4. Calculate U:
U = n1*n2 + n1 (n1+1) - R1
2
Where n1 and n2 are the 2 sample sizes and R1 is the sum of the rank scores for sample 1
If U > Uc where Uc is the critical value from U table

13
The paired t-test for gene expression ratios
If your two samples are normally distributed with equal variance AND
your data were paired before collection, use the paired t-test
Example: Tumor sample before and after treatment

Gene expression differences expressed as ratios
eg) mutant vs. wt log2 [ratio]: 5.0 4.3 6.7
T = D Average difference in expression

SEM Standard error of the mean difference
If T > Tc where Tc is the critical value for the degrees of freedom (n-1) & confidence level,
then reject H0
14
Test if the means of 2 (or more) groups are the same or statistically different
ANOVA (ANalysis Of Variance): for comparing 2 or more means
variation between samples

F= variation within samples
If F > Fc where Fc is the critical value for the degrees of freedom (n-1) & confidence level,
then reject H0
ANOVA only tells you that at least one of your samples is different … may need
to identify which is different for >2 sample comparisons 15
Example uses:
You have 6 patients and 5 replicate liver biopsies from each patient.
The F-statistic (and corresponding p-value) will tell you
which genes are differentially expressed in any of the 6 patients
(but won’t tell you which patient)
There is also a two-way ANOVA for multiple variables:
You have 6 patients, half of whom smoke, and

5 replicate liver biopsies from each patient.
16
Assessing & minimizing error in calls
Type I error = false positives
Type II error = false negatives
Balance between minimizing false positives vs. false negatives
Assessing false positives vs. false negatives: sensitivity vs. specificity
Sensitivity (how well did you find what you want):

# of true positives
# of total positives ( = #true positives + # false negatives)
Specificity (how well did you discriminate):

# of true negatives
# of total negatives (= #true negatives + #false positives)
17
When working with many genes must correct for multiple testing …
p < 0.01 means that there is a 1 in 100 chance that the observation is H0
But if you have 30,000 genes, with 0.01 change that each conclusion is wrong
then you will get 300 false positives!
The Bonferroni correction is a simple way to deal with this.
Adjust the p-value cutoff such that there is a 1 in 100 chance of false
identification for each gene:
p = 0.01 / 30,000 ‘trials’ p < 3 x 10-7 is significant
18
When working with many genes must correct for multiple testing …
p < 0.01 means that there is a 1 in 100 chance that the observation is H0
But if you have 30,000 genes, with 0.01 change that each conclusion is wrong
then you will get 300 false positives!
The Bonferroni correction is a simple way to deal with this.
Adjust the p-value cutoff such that there is a 1 in 100 chance of false
identification for each gene:
p = 0.01 / 30,000 ‘trials’ p < 3 x 10-7 is significant
But, this turns out to be too stringent …

why divide by ALL trials, when only some are significant anyway?
19
Newer, better way of dealing with this is FDR correction
FDR: false discovery rate

How many of the called positives are false?
5% FDR means 5% of calls are false positive
This is different from the false positive rate:

The rate at which true negatives are called significant
5% false positives means 5% of true negatives are incorrectly called significant
“The p-value cutoff [and false positive rate] says little about the content of the
features actually called significant” (Storey and Tibshirani 2003)
Storey and Tibshirani 2003: q-value to represent FDR
20
FDR = expected ratio of false positives vs all positives (Expected [F/S])
q value: for a given region of data space, what fraction of genes in that region are false?
eg) Gene X has a q = 0.04 … this means that for all genes that are in that region
of data space, 4% are falsely called positive.
“The q-value for a particular feature is the expected proportion of false positives incurred
when calling that feature significant.”
21
FDR = expected ratio of false positives vs all positives:
Expected [F/S] ~ Expected[F] / Expected [S]
-- can initially estimate S based on a simple p-value cutoff
We need to estimate π0 = m0 / m = fraction of all features that are truly negative
Genes with p > 0.5 show a relatively

flat density … because we expect
that p-values of null genes are randomly
distributed, we assume that most of these
genes are true nulls …
(The tuning parameter λ is the p cutoff
above which nulls are assumed)
The density for genes with p>0.5

allows us to estimate the # of
true negatives and thus π0
22

Audrey WK 5 Lecture B

Transféré par

Informations du document

Description originale:

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Audrey WK 5 Lecture B

Transféré par

Droits d'auteur :

Formats disponibles

When comparing two different RNA samples, the signal from the

two samples needs to be normalized

On spotted arrays: On Affymetrics-type arrays:

Example: Imagine the red signal is detected

Array would look artifactually red.

Yang et al. 2002 NAR 2

Example of good replicates

DES460 + 0.2% MMS - 45

Example of bad replicates y = 0.1104x - 0.0358

-- can be biological variation 0.5

… should define both types of variation … -1

Methods of gene selection:

-- arbitrary fold-expression-change cutoff

-- statistically significant change in expression

Gene X expression under condition 1

Methods of gene selection:

-- arbitrary fold-expression-change cutoff

-- statistically significant change in expression

Gene X expression under condition 1

Methods of gene selection:

-- arbitrary fold-expression-change cutoff

-- statistically significant change in expression

Use statistics to compare the

Choosing the right test:

nonparametric test if neither of the above are true

Normal data Not normal data

T = X1 – X2 difference in the means

T = X1 – X2 difference in the means

one-tailed t-test two-tailed t-test

1. Combine data from sample 1 and sample 2

If U > Uc where Uc is the critical value from U table

Example: Tumor sample before and after treatment

T = D Average difference in expression

ANOVA (ANalysis Of Variance): for comparing 2 or more means

variation between samples

There is also a two-way ANOVA for multiple variables:

You have 6 patients, half of whom smoke, and

Type I error = false positives

Type II error = false negatives

Balance between minimizing false positives vs. false negatives

Assessing false positives vs. false negatives: sensitivity vs. specificity

Sensitivity (how well did you find what you want):

Specificity (how well did you discriminate):

The Bonferroni correction is a simple way to deal with this.

p = 0.01 / 30,000 ‘trials’ p < 3 x 10-7 is significant

The Bonferroni correction is a simple way to deal with this.

p = 0.01 / 30,000 ‘trials’ p < 3 x 10-7 is significant

But, this turns out to be too stringent …

FDR: false discovery rate

This is different from the false positive rate:

Storey and Tibshirani 2003: q-value to represent FDR

-- can initially estimate S based on a simple p-value cutoff

We need to estimate π0 = m0 / m = fraction of all features that are truly negative

Genes with p > 0.5 show a relatively

The density for genes with p>0.5

Vous aimerez peut-être aussi