Vous êtes sur la page 1sur 22

When comparing two different RNA samples, the signal from the

two samples needs to be normalized

On spotted arrays: On Affymetrics-type arrays:

The red and green channels are Same principle: one sample-one array
scanned and detected separately,
with independent scan parameters. Need to adjust for overall array intensity

Example: Imagine the red signal is detected


with much higher laser intensity &
PMT settings …

Array would look artifactually red.

1
‘center the distribution’

Yang et al. 2002 NAR 2


3
Now each array = list of bg-corrected, normalized relative transcript values
Array 1 Array 2
ID Log ratio ID Log Ratio (635/532)
YPL187W 6.36 YPL187W -0.072
YGR043C 1.82 YGR043C -0.228
YGL089C 6.439 YGL089C
YCR040W 1.012 YCR040W 0.694
YCR039C 1.147 YCR039C -0.487
YCL001W 1.934 YCL001W -0.536
YJR004C 2.76 YJR004C 0.026
YLL005C 2.395 YLL005C -0.008
YGL101W 2.22 YGL101W 0
YLR040C 2.073 YLR040C -0.659
upgrade plate 1.863 upgrade plate -0.408
EMPTY 1.755 EMPTY -0.008
upgrade plate 1.573 upgrade plate 0.109
EMPTY 1.529 EMPTY -0.866
YBL051C 1.419 YBL051C -0.054
YLR349W 1.382 YLR349W -0.457
YCL066W 1.338 YCL066W
YLR227W-A 1.335 YLR227W-A -0.419
upgrade plate 1.314 upgrade plate -0.401
YDL186W 1.246 YDL186W 0.959
YDR536W 1.183 YDR536W -0.58
upgrade plate 1.165 upgrade plate 0.543
YHR124W 1.163 YHR124W -0.465
EMPTY 1.127 EMPTY -0.715
YAL065C 1.091 YAL065C -1.133
YBR012W-A 1.078 YBR012W-A 0.676
YCL026C-A 1.046 YCL026C-A -0.468
YJL078C 1.045 YJL078C -0.889
YHR161C 1.033 YHR161C -0.033
YBR244W 1.028 YBR244W
YGR237C 1 YGR237C -0.754
YGL189C 0.997 YGL189C -0.11
YCL009C 0.989 YCL009C 0.014
YKL185W 0.968 YKL185W
YDR285W 0.95 YDR285W -0.435
YMR057C 0.949 YMR057C 0.672
Q0250 0.942 Q0250 -0.219
YOR235W 0.924 YOR235W 1.166
YDR415C 0.922 YDR415C -0.334
YER072W 0.906 YER072W -0.509
EMPTY 0.892 EMPTY -1.174
EMPTY 0.89 EMPTY -0.818
YDL013W 0.877 YDL013W
YLR206W 0.874 YLR206W
YML047C 0.874 YML047C -0.819
YDR306C 0.858 YDR306C
YDR528W 0.823 YDR528W 0.276
YGL088W 0.8 YGL088W
YBL097W 0.787 YBL097W
YBR013C 0.782 YBR013C -0.896
YIR019C 0.779 YIR019C
YDR361C
YLR267W
0.772
0.769
YDR361C
YLR267W
-1.017
-0.457
4
YAL008W 0.746 YAL008W 1.465
YGL128C 0.741 YGL128C 0.027
YDR530C 0.739 YDR530C 2.083
Assessing replicates: how well do the data agree overall?
linear regression

Example of good replicates


y = 0.978x + 0.0095
2
R = 0.8332
5

2
Array 2 values

DES460 + 0.2% MMS - 45


1 min
0 Linear (DES460 + 0.2%
MMS - 45 min)
-4 -2 0 2 4
-1

-2

-3

-4
Array 1 values

Example of bad replicates y = 0.1104x - 0.0358


2
R = 0.0205
2.5
2
1.5
Where does the noise come from? 1
Array 2 values

-- can be biological variation 0.5


0
-- can be array artifacts -6 -5 -4 -3 -2 -1 -0.5 0 1 2 3 4

… should define both types of variation … -1


-1.5
-2
-2.5 5
Array 1 values
Now you have your data, in the form of
background-subtracted expression ratios,
extracted from the arrays.

Now what?

6
Select differentially expressed genes to focus on

Methods of gene selection:

-- arbitrary fold-expression-change cutoff


example: genes that change >3X in expression between samples

-- statistically significant change in expression


requires replicates

Gene X expression under condition 1


Gene X expression under condition 2

Expression difference

7
Select differentially expressed genes to focus on

Methods of gene selection:

-- arbitrary fold-expression-change cutoff


example: genes that change >3X in expression between samples

-- statistically significant change in expression


requires replicates

Gene X expression under condition 1


Gene X expression under condition 2

Expression difference

8
Select differentially expressed genes to focus on

Methods of gene selection:

-- arbitrary fold-expression-change cutoff


example: genes that change >3X in expression between samples

-- statistically significant change in expression


requires replicates

Use statistics to compare the


mean & variation of 2 (or more)
populations

Expression difference

9
Test if the means of 2 (or more) groups are the same or statistically different
The ‘null hypothesis’ H0 says that the two groups are statistically the same
-- you will either accept or reject the null hypothesis

Choosing the right test:

parametric test if your data are normally distributed with equal variance

nonparametric test if neither of the above are true

Normal data Not normal data


10
Test if the means of 2 groups are the same or statistically different
The ‘null hypothesis’ H0 says that the two groups are statistically the same
-- you will either accept or reject the null hypothesis

If your two samples are normally distributed with equal variance, use the t-test

T = X1 – X2 difference in the means


SED standard error of the difference in the means

If T > Tc where Tc is the critical value for the degrees of freedom & confidence level,
then reject H0

Notice that if the data aren’t normally distributed, mean and standard deviation are not meaningful.
11
Test if the means of 2 groups are the same or statistically different
The ‘null hypothesis’ H0 says that the two groups are statistically the same
-- you will either accept or reject the null hypothesis

If your two samples are normally distributed with equal variance, use the t-test

T = X1 – X2 difference in the means


SED standard error of the difference in the means

one-tailed t-test two-tailed t-test


12
Test if the means of 2 groups are the same or statistically different
The ‘null hypothesis’ H0 says that the two groups are statistically the same
-- you will either accept or reject the null hypothesis

If your two samples are NOT normally distributed with equal variance, use Mann-Whitney test
(Wilcoxon Rank Sum test)

1. Combine data from sample 1 and sample 2


2. Rank each data point in the pooled dataset
3. Compare the average rank for sample 1 and sample 2 values
4. Calculate U:
U = n1*n2 + n1 (n1+1) - R1
2
Where n1 and n2 are the 2 sample sizes and R1 is the sum of the rank scores for sample 1

If U > Uc where Uc is the critical value from U table


13
The paired t-test for gene expression ratios

If your two samples are normally distributed with equal variance AND
your data were paired before collection, use the paired t-test

Example: Tumor sample before and after treatment


Gene expression differences expressed as ratios
eg) mutant vs. wt log2 [ratio]: 5.0 4.3 6.7

T = D Average difference in expression


SEM Standard error of the mean difference

If T > Tc where Tc is the critical value for the degrees of freedom (n-1) & confidence level,
then reject H0

14
Test if the means of 2 (or more) groups are the same or statistically different
The ‘null hypothesis’ H0 says that the two groups are statistically the same
-- you will either accept or reject the null hypothesis

ANOVA (ANalysis Of Variance): for comparing 2 or more means

variation between samples


F= variation within samples

If F > Fc where Fc is the critical value for the degrees of freedom (n-1) & confidence level,
then reject H0

ANOVA only tells you that at least one of your samples is different … may need
to identify which is different for >2 sample comparisons 15
Example uses:

You have 6 patients and 5 replicate liver biopsies from each patient.
The F-statistic (and corresponding p-value) will tell you
which genes are differentially expressed in any of the 6 patients
(but won’t tell you which patient)

There is also a two-way ANOVA for multiple variables:

You have 6 patients, half of whom smoke, and


5 replicate liver biopsies from each patient.

16
Assessing & minimizing error in calls

Type I error = false positives

Type II error = false negatives

Balance between minimizing false positives vs. false negatives

Assessing false positives vs. false negatives: sensitivity vs. specificity

Sensitivity (how well did you find what you want):


# of true positives
# of total positives ( = #true positives + # false negatives)

Specificity (how well did you discriminate):


# of true negatives
# of total negatives (= #true negatives + #false positives)

17
When working with many genes must correct for multiple testing …

p < 0.01 means that there is a 1 in 100 chance that the observation is H0

But if you have 30,000 genes, with 0.01 change that each conclusion is wrong
then you will get 300 false positives!

The Bonferroni correction is a simple way to deal with this.

Adjust the p-value cutoff such that there is a 1 in 100 chance of false
identification for each gene:

p = 0.01 / 30,000 ‘trials’ p < 3 x 10-7 is significant

18
When working with many genes must correct for multiple testing …

p < 0.01 means that there is a 1 in 100 chance that the observation is H0

But if you have 30,000 genes, with 0.01 change that each conclusion is wrong
then you will get 300 false positives!

The Bonferroni correction is a simple way to deal with this.

Adjust the p-value cutoff such that there is a 1 in 100 chance of false
identification for each gene:

p = 0.01 / 30,000 ‘trials’ p < 3 x 10-7 is significant

But, this turns out to be too stringent …


why divide by ALL trials, when only some are significant anyway?
19
Newer, better way of dealing with this is FDR correction

FDR: false discovery rate


How many of the called positives are false?
5% FDR means 5% of calls are false positive

This is different from the false positive rate:


The rate at which true negatives are called significant
5% false positives means 5% of true negatives are incorrectly called significant

“The p-value cutoff [and false positive rate] says little about the content of the
features actually called significant” (Storey and Tibshirani 2003)

Storey and Tibshirani 2003: q-value to represent FDR

20
FDR = expected ratio of false positives vs all positives (Expected [F/S])

q value: for a given region of data space, what fraction of genes in that region are false?
eg) Gene X has a q = 0.04 … this means that for all genes that are in that region
of data space, 4% are falsely called positive.

“The q-value for a particular feature is the expected proportion of false positives incurred
when calling that feature significant.”

21
FDR = expected ratio of false positives vs all positives:
Expected [F/S] ~ Expected[F] / Expected [S]

-- can initially estimate S based on a simple p-value cutoff

We need to estimate π0 = m0 / m = fraction of all features that are truly negative

Genes with p > 0.5 show a relatively


flat density … because we expect
that p-values of null genes are randomly
distributed, we assume that most of these
genes are true nulls …
(The tuning parameter λ is the p cutoff
above which nulls are assumed)

The density for genes with p>0.5


allows us to estimate the # of
true negatives and thus π0

22

Vous aimerez peut-être aussi