Maximum Likelihood Estimation Explained with Coin Toss and Normal Distribution Examples

2/27/2018 Maximum Likelihood Estimation – Stokastik
MACHINE LEARNING AI
PROGRAMMING RANDOM
Maximum Likelihood
Estimation
April 4, 2016
Observations from a probability distribution,

depends on the parameters of that model. For
example, given an unbiased coin with
equal probability of landing heads as well as tails,
what is probability of observing the sequence
"HHTH". Our knowledge from probability theory
says that since the toss of a coin follows the
binomial distribution, the probability of the
observation should be 0.54 = 0.0625, but what if
the coin was biased and the probability of landing
http://www.stokastik.in/mle/ 1/8
heads was 0.7 (and 0.3 for tails) ? The answer

would be 0.103.
Thus the probability of the observation is a

function of the probability of heads (or tails). Lets
denote that probability as P(O; p) where 'p' is the
probability of heads,
P(O=HHTH; p=0.5) = 0.0625 and P(O=HHTH;

p=0.7) = 0.103.
But in more common scenarios, given the

observation, we need to estimate the
probabilities. The aim of maximum likelihood
estimation is to find the parameter value(s) that
makes the observed data most likely, or in other
words given the observation 'O', what should be
the value of 'p' so that P(O; p) is maximum.
Continuing with our coin tossing example, let's

suppose we do not know the value of 'p' and we
have to estimate it given the observation 'HHTH'.
Since the probability of heads is 'p', hence the
probability of tails is (1-p). From binomial
distribution, the likelihood of the observed
sequence as a function of 'p' is :
3
L(p; O = HHT H) = p ∗ (1 − p)
Our goal is to maximize this quantity.
Probability of the observation as a function of the

unknown parameter 'p'
There are many possible ways the above function

can be maximized. If the function is too complex
then we normally use numerical algorithms to
estimate the values of the parameters. But for
simple functions like the one above, we can find a
solution by setting the first derivative of the
likelihood w.r.t the unknown parameters.
Sometimes the function may be having local

minima and maxima and hence the solution
obtained may not be the globally correct answer.
Complex curve demonstrating local maxima

and minima.
Since we are dealing with probabilities (real

numbers between 0 and 1), raising them to
exponents and taking products will result in
integer underflow. We can overcome that by
taking Log Likelihood of the function, i.e.
LogL(p; O = HHT H) = 3 ∗ log(p) + log(1 −

Taking the derivative of the above quantity w.r.t. 'p'

and setting it to zero, gives us : 3(1 − p) = p,
solving for p gives us, p=3/4=0.75. The answer
seems pretty intuitive and one might wonder why
did we go through the pain of computing
derivatives and roots. True. But the point was to
illustrate that the approach is generic.
Lets look at a more complex example with Normal

distribution.
Given a sequence of identical and independent

observations X1, X2, ..., Xn from a normal
distribution with unknown parameters μ and σ,
estimate the parameters so as to fit the
distribution to the observation.
The Gaussian distribution is defined as

2
(x−μ)
−
1 2
N (μ, σ) = e 2σ
√2πσ
Hence the probability of the sequence of

observations is the product of the probabilities of
each observation (since the observations are
I.I.D.), thus :
2
(X −μ)
i
n −
1 2
P (O|μ, σ) = ∏ e 2σ
i=1
√2πσ
Taking the log likelihood of the above, we get :
n (X
1
LogL(μ, σ|O) = ∑ log( ) − log(σ) −
i=1
√2π
To maximize the above function, we will use the

same technique as earlier i.e. take the derivatives
w.r.t. the unknown parameters and find their roots.
On taking the derivative w.r.t. μ we get :
n
∑
i=1
(Xi − μ) = 0 ,
solving for μ we get,

n
∑ Xi
μ =
i=1
n
,
i.e. the mean of the observations. Similarly on

taking derivative w.r.t. σ, we get the following :
2
n (Xi −μ)
∑
i=1
−
1
σ
+
3
= 0 ,
σ
solving for σ, we get

−−−
n −−−−−−
2
∑ (Xi −μ)
σ = √
i=1
n
,
which is the standard deviation of the

observations.
Share this:
   
Like this:
 Like
Be the first to like this.
Related
Expectation Sampling from Understanding

Maximization with Probability Conditional
an Example Distributions Random Fields
April 5, 2016 December 26, August 9, 2017
In "MACHINE 2016 In "MACHINE
LEARNING" In "MACHINE LEARNING"
LEARNING"
EXPECTATION
MAXIMIZATION WITH
AN EXAMPLE →
Categories: MACHINE LEARNING
Tags: M A X I M U M L I K E L I H O O D E S T I M AT I O N , PROBABILITY ,
S TAT I S T I C S
PROUDLY POWERED BY WORDPRESS • THEME: ISCA BY PRO THEME DESIGN.

Maximum Likelihood Estimation Explained with Coin Toss and Normal Distribution Examples

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Maximum Likelihood Estimation Explained with Coin Toss and Normal Distribution Examples

Transféré par

Droits d'auteur :

Formats disponibles

2/27/2018 Maximum Likelihood Estimation – Stokastik

Observations from a probability distribution,

heads was 0.7 (and 0.3 for tails) ? The answer

Thus the probability of the observation is a

P(O=HHTH; p=0.5) = 0.0625 and P(O=HHTH;

But in more common scenarios, given the

Continuing with our coin tossing example, let's

Our goal is to maximize this quantity.

Probability of the observation as a function of the

There are many possible ways the above function

Sometimes the function may be having local

Complex curve demonstrating local maxima

Since we are dealing with probabilities (real

LogL(p; O = HHT H) = 3 ∗ log(p) + log(1 −

Taking the derivative of the above quantity w.r.t. 'p'

Lets look at a more complex example with Normal

Given a sequence of identical and independent

The Gaussian distribution is defined as

Hence the probability of the sequence of

Taking the log likelihood of the above, we get :

To maximize the above function, we will use the

solving for μ we get,

i.e. the mean of the observations. Similarly on

solving for σ, we get

which is the standard deviation of the

Expectation Sampling from Understanding

Categories: MACHINE LEARNING

PROUDLY POWERED BY WORDPRESS • THEME: ISCA BY PRO THEME DESIGN.

Vous aimerez peut-être aussi