Académique Documents
Professionnel Documents
Culture Documents
The red and green channels are Same principle: one sample-one array
scanned and detected separately,
with independent scan parameters. Need to adjust for overall array intensity
1
‘center the distribution’
2
Array 2 values
-2
-3
-4
Array 1 values
Now what?
6
Select differentially expressed genes to focus on
Expression difference
7
Select differentially expressed genes to focus on
Expression difference
8
Select differentially expressed genes to focus on
Expression difference
9
Test if the means of 2 (or more) groups are the same or statistically different
The ‘null hypothesis’ H0 says that the two groups are statistically the same
-- you will either accept or reject the null hypothesis
parametric test if your data are normally distributed with equal variance
If your two samples are normally distributed with equal variance, use the t-test
If T > Tc where Tc is the critical value for the degrees of freedom & confidence level,
then reject H0
Notice that if the data aren’t normally distributed, mean and standard deviation are not meaningful.
11
Test if the means of 2 groups are the same or statistically different
The ‘null hypothesis’ H0 says that the two groups are statistically the same
-- you will either accept or reject the null hypothesis
If your two samples are normally distributed with equal variance, use the t-test
If your two samples are NOT normally distributed with equal variance, use Mann-Whitney test
(Wilcoxon Rank Sum test)
If your two samples are normally distributed with equal variance AND
your data were paired before collection, use the paired t-test
If T > Tc where Tc is the critical value for the degrees of freedom (n-1) & confidence level,
then reject H0
14
Test if the means of 2 (or more) groups are the same or statistically different
The ‘null hypothesis’ H0 says that the two groups are statistically the same
-- you will either accept or reject the null hypothesis
If F > Fc where Fc is the critical value for the degrees of freedom (n-1) & confidence level,
then reject H0
ANOVA only tells you that at least one of your samples is different … may need
to identify which is different for >2 sample comparisons 15
Example uses:
You have 6 patients and 5 replicate liver biopsies from each patient.
The F-statistic (and corresponding p-value) will tell you
which genes are differentially expressed in any of the 6 patients
(but won’t tell you which patient)
16
Assessing & minimizing error in calls
17
When working with many genes must correct for multiple testing …
p < 0.01 means that there is a 1 in 100 chance that the observation is H0
But if you have 30,000 genes, with 0.01 change that each conclusion is wrong
then you will get 300 false positives!
Adjust the p-value cutoff such that there is a 1 in 100 chance of false
identification for each gene:
18
When working with many genes must correct for multiple testing …
p < 0.01 means that there is a 1 in 100 chance that the observation is H0
But if you have 30,000 genes, with 0.01 change that each conclusion is wrong
then you will get 300 false positives!
Adjust the p-value cutoff such that there is a 1 in 100 chance of false
identification for each gene:
“The p-value cutoff [and false positive rate] says little about the content of the
features actually called significant” (Storey and Tibshirani 2003)
20
FDR = expected ratio of false positives vs all positives (Expected [F/S])
q value: for a given region of data space, what fraction of genes in that region are false?
eg) Gene X has a q = 0.04 … this means that for all genes that are in that region
of data space, 4% are falsely called positive.
“The q-value for a particular feature is the expected proportion of false positives incurred
when calling that feature significant.”
21
FDR = expected ratio of false positives vs all positives:
Expected [F/S] ~ Expected[F] / Expected [S]
22