Vous êtes sur la page 1sur 34

LEARNING MODULE 2

KEY STATISTICAL CONCEPTS

• Probability distributions
• Discrete probability distributions
• Continuous probability distributions
• Binomial distribution
• Normal distribution

This work is licensed under a Creative Commons Attribution 4.0 International License.
Random Variables
Definition:
• A random variable, usually written X, is a variable whose
possible values are numerical outcomes of a random
phenomenon. These values can be associated with
probabilities. There are two types of random variables,
discrete and continuous.

• Discrete random variables have a countable number of


outcomes, e.g., dice
• Continuous random variables have an infinite continuum
of possible values, e.g., blood pressure

2
Probability Functions/Distribution

• A probability distribution or function is a function that


describes the probability of a random variable taking
certain values
• A probability function maps the possible values of x to
their respective probabilities of occurrence, p(x)

Note:
• p(x) is a number between 0 and 1
• Area under a probability function is always 1

3
Distributions

4
Mean and Variance
• If we understand the underlying probability distribution
of a certain phenomenon, we know how x is expected to
behave on average
• The expected value E[X] is the weighted average or
mean (µ) of random variable X
• A random variable X takes values x1 with a probability p1,
x2 with p2,… and xn with pn, the expected value or mean
is then given by

5
Mean and Variance
• The variance describes how far the values of a random
variable deviate from the mean
• Variance Var[X] of a random variable X with expected
value µ=E[X] is given by

• Variance is often also denoted as σ2, where σ the


standard deviation of the random variable X

Questions:
• How can you relate the concepts of accuracy and precision to
the measure of variance
• What about reproducibility?
6
Discrete Example: Roll of a Die
• There are six possible outcomes for a die roll: numbers from one
through six
• Assume the die is fair, i.e. all numbers have the same probability
of showing
• If all outcomes are equally likely, then the probabilities are equal
as well – and since the sum over all probabilities has to be one,
they are all 1/6
• The histogram below shows the probabilities for each number
showing for every single roll of the die
p(x)

1/6

x
1 2 3 4 5 6
http://s522.photobucket.com/user/poka-dot-pocky/media/Gaia/Decorated%20images/dice_zps0d0b23cc.png.html 7
Discrete Example: Roll of a Die

• Probabilities are equal x P


(uniformly distributed)
1 P(x = 1) = 1/6
• Each roll of the die has
to come up with some 2 P(x = 2) = 1/6
result
3 P(x = 3) = 1/6
• Probability of having
any result is thus one 4 P(x = 4) = 1/6

• Summing up all 5 P(x = 5) = 1/6


probabilities of a
6 P(x = 6) = 1/6
discrete random
variable will thus give
one
8
Cumulative Distribution Function

Definition:
The cumulative distribution function (CDF) or just
distribution function, describes the probability that random
variable X with a given probability distribution will be found
to have a value less than or equal to x.

For a discrete random variable X the CDF is computed by


summing up the probabilities of all possible values xi up to
x:

Note: For continuous random variables the summation is


replaced by an integral. 9
Cumulative Distribution Function

Example: six-sided die x P(X ≤ x)


• Outcomes are mutually
exclusive (only one side can 1 1/6
be up)
• Probabilities can just be 2 2/6
summed up to obtain the
CDF 3 3/6
FX(x)
1 4 4/6
5/6
2/3 5 5/6
1/2
1/3
6 6/6
1/6
1 2 3 4 5 6 x
10
Examples
1. What is the probability of rolling a 3 or less?

1. What is the probability of rolling a 5 or higher?

2. Is the probability of rolling 10 or higher on a 20-sided


die higher than the probability of rolling a 6 or higher
on a 12-sided die?

11
Important Discrete Distributions

Binomial Distribution Poisson Distribution


0.25

p=0.5 and n=20


p=0.7 and n=20
0.20

p=0.5 and n=40


0.15
0.10
0.05
0.00

0 10 20 30 40

Example application: Example application:


isotope distributions peptide identification
(Learning Unit 1C) (Learning Unit 7C/D)
http://upload.wikimedia.org/wikipedia/commons/7/75/Binomial_distribution_pmf.svg
http://upload.wikimedia.org/wikipedia/commons/1/16/Poisson_pmf.svg 12
Bernoulli Experiment
• Bernoulli experiment is a
random experiment where
the random variable can
take only two values
• Success (1)
• Failure (0)
• Given a probability of
success p, the probability
of failure is q = 1 - p
Example: Jakob Bernoulli
(1655-1705, Swiss mathematician)
Roll a (six-sided) die for a six
• 6: success p = 1/6
• 1,2,3,4,5: failure, q = 1 – p = 5/6
http://en.wikipedia.org/wiki/Jacob_Bernoulli#mediaviewer/File:Jakob_Bernoulli.jpg 13
Binomial Distribution
• Independent Bernoulli experiments build the basis for binomial
distributions
• Experiments with two possible outcomes (e.g., flipping a coin)
• n independent (repeated) experiments are performed
• Probability of success p is the same in every experiment

• N marbles in a jar
• r black and N-r white
• What is the probability
to have k black marbles,
if n are drawn with
replacement ?

14
Binomial Distribution
• The binomial distribution B(x;n,p) describes the probability for an
n-trial binomial experiment to result in exactly k successes:

where
• k: the number of successes that result from the binomial experiment
• n: the number of trials in the binomial experiment
• p: the probability of success in an individual trial
• q: the probability of failure (q = 1 - p)
• Binomial coefficient (read: “n choose k”) the number of different ways to
choose k things out of n things

15
Example: Drawing Marbles
Experiment: Draw two marbles from a jar containing 10
white and 10 black marbles (with replacement)
• Probability of having drawing k black marbles is:

# of black marbles probability


0 0.25
1 0.5
2 0.25

• Mean and variance of the probability distribution are


given by:

16
Example: Throwing a Coin
Experiment: Throw a fair coin ten times
Probability

Number of heads
17
Binomial Approximation of the
Poisson Distribution
• Let P(x=k) denote the binomial distribution
and let p = λ/n

• We then obtain for the limit for very large n:

18
Binomial Approximation of the
Poisson Distribution
We thus obtain

the well-known Poisson distribution

• The Poisson distribution describes a Bernoulli experiment


with a high number of repeats and low success
probability (i.e., if p small and n large)
• Therefore it is also called Poisson law of small numbers
• The mean as well as the variance of the Poisson
distribution is λ
19
Binomial Distributions
n=100 n=100
p=0.5 p=0.1

n=1,000 n=5,000
p=0.009 p=0.0009

20
Poisson Distribution

n = 1000 n = 1000
λ = 4.5 λ = 2.5

21
Continuous Random Variables
• The probability function fX for a continuous random
variable X is a non-negative, continuous function that
integrates to 1

• Cumulative distribution functions for continuous


random variables are computed equivalently to those of
discrete random variables
• Rather than summing over all suitable outcomes, we
need to integrate, though:

22
Gaussian Distribution
• The probability function is given by

• By definition

• The probability function results in the well-known bell-


shape Gaussian

23
Gaussian Mean and Variance
• The expectation value is calculated as follows,

• Furthermore

• Resulting in the general variance of Gaussian


distributions

24
Standard Normal Distribution
• The standard normal distribution corresponds to the
general form of the Gaussian distribution with µ = 0 and
σ2 = 1 (centered, unit variance)
• An arbitrary normal distribution can be converted to a
standard normal distribution via Z-transformation:

resulting in

25
Gaussian Distribution

26
Error Function

• Computing the CDF of a normal


distribution is related to the
error function (or Gauss error
function) erf

• Note that this integral cannot


be evaluated in closed form in
terms of elementary functions Note:
The error function is an odd function, i.e.
• It can be approximated with
elementary functions though
or evaluated numerically
http://en.wikipedia.org/wiki/Error_function#mediaviewer/File:Error_Function.svg 27
Error Function
• We can compute the CDF of a normal distribution and
thus obtain

• With the Gaussian error function, the CDF P{x ≤ r}


simplifies to

• This allows the evaluation of


the probability that a
Gaussian random variable Y
lies in an interval of size r
around the mean value µ
r 28
Error Function
• What is the probability that Gaussian random variable
lies within twice the standard deviation from the mean?

• Probability that a Gaussian random variable lies within


the interval [µ - 2 σ, µ + 2 σ] is thus 95.5%

29
p-Value
• Widely used as a measure of statistical significance
• p-values in the measurement context:=
The probability of observing an incorrect event with a
given score or better
• Hence, a low p-value implies a low probability that the
observed measurement is incorrect
• The p-value can be derived from the false positive rate
(FPR), the fraction of incorrect measurements among a
set of measurements (i.e., all measurements above a
given threshold)
• Problems associated with p-value calculations
• The FPR is usually unknown
• p-values should be corrected for multiple hypothesis testing
30
p-Values in Statistical Testing
• p-values are used to judge the significance of a test for the
null-hypothesis
• Null-hypothesis:= corresponds to the default position, e.g.,
random chance peptide identification or mean values of two
independent measurements are not different
• Alternative-hypothesis:=the opposite positions, e.g., non
random peptide identification
• Usually, the null hypothesis cannot be formally proven, but
statistical testing can accept or reject the null-hypothesis
• The null-hypothesis is rejected if the p-value is less then a
significance level α (e.g., 0.05 or 0.01)

31
What is False?

• A general problem for any statistical assessment


is usually we do not know what is really true or
false (ground truth)
• All applied methods need to make assumptions
on false positive assignments

32
A note of Weibull

3-parameter Weibull model (shown above), the scale parameter, η, defines


where the bulk of the distribution lies. The shape parameter, β, defines the
shape of the distribution and the location parameter, γ, defines the location of
the distribution in time

33
Understanding Poisson

Example. If the average number of baby born in a hospital every hour is 3


(λ=3)
the probability that 4 new borns within 2 hour could be calculated as showed
above:

34

Vous aimerez peut-être aussi