Lecture Python Data Analysis ProfClark

Python Analysis
PHYS 224
October 1/2, 2015
Goals
Two things to teach in this lecture
1. How to use python to fit data
2. How to interpret what python gives you
Some references:
http://nbviewer.ipython.org/url/media.usm.maine.edu/~pauln/
ScipyScriptRepo/CurveFitting.ipynb
http://www.physics.utoronto.ca/~phy326/python/curve_fit_to_data.py
2
Fitting Experimental Data
The goal of the lab experiments is to determine a
physical quantity y (dependent variable) as a
function of x (independent variable)
How?
Measure the pair (xi,yi) a number (N) times
Find a fit function y=y(x) that describes the
relationship between these two quantities
3
The Linear Case
The simplest function relating the two
variables is the linear function
f(x) = y = ax +b
This is valid for any yi,xi combination
If a and b are known, the true value of yi
can be calculated for any xi
yi,true = axi + b
4
Linear Regression
Linear regression calculates the most

probable values of a and b such that the
linear equation is valid
yi,true = axi + b
When taking measurements of yi, these
usually obey Gauss distribution
5
An Example
Ideal Gas Law: P*V = n*R*T
Pressure * Volume = n * R * Temperature
P = [(n*R)/V]*T
6
Fitting in Python
Were going to use the curve_fit function, which is part
of the scipy.optimize package
The usage is as follows:
fit_parameters,fit_covariance = scipy.optimize.curve_fit(fit_function,x_data,y_data,sigma,guess)
#fit_parameters - an array of the output fit parameters

#fit_covariance - an array of the covariance of the output fit parameters
#fit_function - the function used to do the fit
#sigma - the uncertainty associated with the data
#guess - the initial guess input to the fit
7
Fitting with curve_fit
import numpy
import scipy.optimize
from matplotlib import peplos
#define the function to be used in the fitting

def linearFit(x,*p):
return p[0]+p[1]*x
#read in the data (currently only located on my hard drive...)

temp_data, vol_data = numpy.loadtxt('ideal_gas_law.txt',unpack=True)
#add an uncertainty to each measurement point

uncertainty = numpy.empty(len(vol_data))
uncertainty.fill(20.)
#do the fit

fit_parameters,fit_covariance = scipy.optimize.curve_fit(linearFit, temp_data, vol_data,
p0=(1.0,8.0),sigma=uncertainty)
8
Fitting with curve_fit
import numpy
from matplotlib import peplos
#define the function to be used in the fitting

return p[0]+p[1]*x

temp_data, vol_data = numpy.loadtxt('ideal_gas_law.txt',unpack=True)

Function Y data
}
#do the fit
fit_parameters,fit_covariance = scipy.optimize.curve_fit(linearFit, temp_data, vol_data,
}
p0=(1.0,8.0),sigma=uncertainty)
X data
}
}
Initial guess Uncertainty

for parameters on data
9
Results
f it parameters = [0.21617647 8.33058824]
f it covariance = [[2.16490542e + 04 6.89053501e + 01]
[ 6.89053501e + 01 2.20507375e 01]]
So what does this mean?
We set up the function for the

fit to be:
y = p[0] + p[1]*x
So with the fit parameters, the
function is:
y = 0.216 + 8.33*x
10
How did it do this?
The function curve_fit uses a minimizer
This varies the fit parameters ( p[0] and
p[1] ) to see what value of these are most
likely to fit the data properly
This also depends on residuals, which
are the difference between the result of
the fit and the data at each point
Well discuss the quantity which is
minimized in a bit
11
Probability
The probability of any one point being
from the fit is
(y a bx)2
1 2
Pa,b (y) = p e 2 y
2 y
Where y is the measured data, a and b

are from the fit
12
Full Probability
For a set of N measurements of the
dependent variable y
y1, y2, y3, yN
The probability of obtaining these values is
the product of the individual probabilities
Pa,b (y1 , y2 , y3 ...yN ) = Pa,b (y1 )Pa,b (y2 )Pa,b (y3 )...Pa,b (yN )
PN (yi a bxi )2
i=1 2
1 y
= N
e 2
13
Full Probability
For a set of N measurements of the
dependent variable y
y1, y2, y3, yN
The probability of obtaining these values is
the product of the individual probabilities
Pa,b (y1 , y2 , y3 ...yN ) = Pa,b (y1 )Pa,b (y2 )Pa,b (y3 )...Pa,b (yN )
PN (yi a bxi )2 Called the
chi-squared
i=1 2
1 y
= Ne 2
y ( 2)
14
Chi-Squared
N
X
2 (yi a bxi )2
= 2
i=1 y
The circled part is the definition of the residuals,

ie the true data (yi) minus the fit data (a + b*xi)
Dividing this by the standard deviation () tells
us how many standard deviations the test data
is away from the fit at that x
The square ensures this is always positive
15
Plotting the Residuals
temp_data,vol_data = numpy.loadtxt('/Users/kclark/Desktop/Teaching/phys224/weather_data/
ideal_gas_law.txt',unpack=True)

#do the fit

fit_parameters,fit_covariance =
scipy.optimize.curve_fit(linearFit,temp_data,vol_data,p0=(1.0,8.0),sigma=uncertainty)
#now generate the line of the best fit

#set up the temperature points for the full array
fit_temp = numpy.arange(270,355,5)
#make the data for the best fit values
fit_answer = linearFit(fit_temp,*fit_parameters)
#calculate the residuals
fit_resid = vol_data-linearFit(temp_data,*fit_parameters)
#make a line at zero
zero_line = numpy.zeros(len(vol_data))
16
How do the Residuals Look?
The residuals are
obviously a large
component of the
2 value used by
the minimizer
They can be
plotted to look for
trends and see if
the fit function is
appropriate
17
Other Results from curve_fit
curve_fit returns not only the best values

for the parameters p[0] and p[1]
The fit covariance matrix is also returned
One strength of curve_fit is the ease of
use of the fit covariance matrix
18
Interpreting the Covariance Matrix
f it parameters = [0.21617647 8.33058824]
[ 6.89053501e + 01 2.20507375e 01]]
Diagonal elements are the square of the

standard deviation for that parameter
The non-diagonal elements show the
relationship between the parameters
XN
1
cov(x, y) = (xi x
)(yi y)
N i=1
19
Fit Results
import numpy
from matplotlib import pyplot
#define the function to be used in the fitting, which is linear in this case
return p[0]+p[1]*x


#do the fit

#determine the standard deviations for each parameter

sigma0 = numpy.sqrt(fit_covariance[0,0])
20
#do the fit
Fit Results
#determine the standard deviations for each parameter

#calculate the mean fit result to plot the line

fit_line = linearFit(temp_data,*fit_parameters)
#calculate the residuals

fit_residuals = vol_data - fit_line
#calculate the data for the best fit minus one sigma in parameter #1
params_minus1sigma = numpy.array([fit_parameters[0],fit_parameters[1]-sigma1])
data_minus1sigma = linearFit(temp_data,*params_minus1sigma)
#do some plotting of the results

pyplot.errorbar(temp_data,vol_data,yerr=uncertainty,marker='o',ls='none')
pyplot.plot(fit_temp,fit_answer,'b--')
pyplot.plot(temp_data,data_plus1sigma,'g--',temp_data,data_minus1sigma,'g--')
pyplot.title("Ideal Gas Law Example")
pyplot.xlabel("Temperature (K)")
pyplot.ylabel("Pressure (Pa)")
21
Fit Results
f it parameters = [0.21617647 8.33058824]
[ 6.89053501e + 01 2.20507375e 01]]
Calculate the standard

deviation on the slope
(p[1])
This is the square root of
the [1,1] entry of the
covariance matrix
22
Fit Results
f it parameters = [0.21617647 8.33058824]
[ 6.89053501e + 01 2.20507375e 01]]
Show the p[1] parameter

with the standard
deviation:
p1 = 8.33 0.47
23
Comparison to Accepted Values
We obtained the result p[1] = 8.330.47
3
We assume that there is 1 mole in a 1m
volume so that n=V=1
The accepted value (currently) is
8.31446210.0000075
The accepted value IS contained within our
uncertainty (our one sigma range is from 7.86 to
8.80)
These values agree within their error
24
Application to Non-linear Examples
This method can also be applied to other examples
Powers: y = b x
2 2
can be linearized as y = b *x
2 3
Polynomials: y = a + b*x + c*x + d*x
This is just a case of using multiple regression
since the equation is linear in the coefficients
bx
Exponentials: y= a*e
Can be linearized as ln(y) = ln(a) + b*x
There are many other examples
25
Return to Chi-Squared
N
X 2
2 (yi y(xi ))
= 2
i=1 y
Here the definition of the residual has changed

Instead of yi - a - b*xi a more general term has
been used
yi is still the data
y(xi) is the fit function evaluated at xi
26
Gauss Distribution
The probability is described by

1 )2
(x x
P (x) = p e 2 2
2
where the average (mean) value is x and
the spread in values is
27
Gauss Distribution
We use the probabilities shown above to determine how

probable a value is in this distribution
When we take a measurement, we expect that 68.2% of
the time it will be within 1 from the mean value
Another way of phrasing this is that we expect a value to be
more than 3 above the mean value only 0.1% of the time
28
Another example
29
Fitting the Gaussian
import numpy
import matplotlib.pyplot as pyplot
import pylab as py
def gaussFit(x,*p):
return p[0]+p[1]*numpy.exp(-1*(x-p[2])**2/(2*p[3]**2))

day_num,rain_data = numpy.loadtxt('/Users/kclark/Desktop/Teaching/phys224/weather_data/
precip_2013.txt', unpack=True)
#get some (pretty good) guesses for the fitting parameters

data_mean = rain_data.mean()
data_std = rain_data.std()
#set up the histogram so that it can be fit

data_plot = py.hist(rain_data,range=(0.1,90),bins=100)
histx = [0.5 * (data_plot[1][i] + data_plot[1][i + 1]) for i in xrange(100)]
histy = data_plot[0]
#actually do the fitting

scipy.optimize.curve_fit(gaussFit,histx,histy,p0=(5.0,10.0,data_mean,data_std))
30
Another example
Mean
Fit mean: 7.06mm

Fit standard deviation:
10.13mm
}
Standard Deviation
31
Another example
Mean
Fit mean: 7.06mm

10.13mm
}
Standard Deviation
32
Another example
Mean
Fit mean: 7.06mm

10.13mm
}
Standard Deviation
33
Another example
Mean
Fit mean: 7.06mm

10.13mm
Rainfall of 85.5mm is
7.74 standard
deviations above the
mean (from this data)
}
Standard Deviation
which is extremely
unlikely
34
Chi-Squared and Goodness of Fit
N
X
2 (yi y(xi ))2
= 2
i=1 y
This can then be used as a goodness of

fit test
If the function is a good approximation,
then the residual will be within one
standard deviation, so this will sum to
approximately N
35
Chi-Squared
N
X
2 (yi y(xi ))2
= 2
i=1 y
We normally use the number of degrees of freedom

of the experiment to determine the fit quality
The number of DOF is the number of data points in
the sample minus the number of parameters in the fit
For a sample with 20 data points and a linear fit
(2 parameters), DOF = 18
2
This is used as the goodness of fit since /DOF1
for a good fit
36
Revisit the First Example
import numpy
from matplotlib import pyplot
return p[0]+p[1]*x


#do the fit

#calculate the chi-squared value

chisq = sum(((vol_data-linearFit(temp_data,*fit_parameters))/uncertainty)**2)
print chisq
dof = len(temp_data)-len(fit_parameters)
print dof
37
Revisit the First Example
Is this a good fit?
16
X 2
2 presDatai f iti
= = 65.6
i=1
uncertainty
Divide this by the DOF

We have 16 data points,
2 parameters
2
65.6
= = 4.68
DOF 16 2
This may not be a great fit...

38
Goodness of Fit
Previous statements only mostly true
More accurately:
2
>> 1 is a very poor fit, maybe even a fit
model which doesnt match
2
> 1 is not a good fit, or the uncertainty is
underestimated
2
<< 1 means the uncertainty could be
overestimated
39
Summary
You should now be well prepared to use

python to fit the data
Your practice with this starts with the
next pendulum exercise, which you can
begin now!
40

Lecture Python Data Analysis ProfClark

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Lecture Python Data Analysis ProfClark

Transféré par

Droits d'auteur :

Formats disponibles

Python Analysis

Linear regression calculates the most

#fit_parameters - an array of the output fit parameters

#define the function to be used in the fitting

#read in the data (currently only located on my hard drive...)

#add an uncertainty to each measurement point

#do the fit

#define the function to be used in the fitting

#read in the data (currently only located on my hard drive...)

#add an uncertainty to each measurement point

Initial guess Uncertainty

So what does this mean?

We set up the function for the

Where y is the measured data, a and b

The circled part is the definition of the residuals,

#add an uncertainty to each measurement point

#do the fit

#now generate the line of the best fit

curve_fit returns not only the best values

Diagonal elements are the square of the

#read in the data (currently only located on my hard drive...)

#add an uncertainty to each measurement point

#do the fit

#determine the standard deviations for each parameter

#determine the standard deviations for each parameter

#calculate the mean fit result to plot the line

#calculate the residuals

#do some plotting of the results

Calculate the standard

Show the p[1] parameter

Here the definition of the residual has changed

The probability is described by

We use the probabilities shown above to determine how

#read in the data (currently only located on my hard drive...)

#get some (pretty good) guesses for the fitting parameters

#set up the histogram so that it can be fit

#actually do the fitting

Fit mean: 7.06mm

Fit mean: 7.06mm

Fit mean: 7.06mm

Fit mean: 7.06mm

This can then be used as a goodness of

We normally use the number of degrees of freedom

#read in the data (currently only located on my hard drive...)

#add an uncertainty to each measurement point

#do the fit

#calculate the chi-squared value

Divide this by the DOF

This may not be a great fit...

You should now be well prepared to use

Vous aimerez peut-être aussi