Vous êtes sur la page 1sur 40

Python Analysis

PHYS 224
October 1/2, 2015
Goals
Two things to teach in this lecture
1. How to use python to fit data
2. How to interpret what python gives you
Some references:
http://nbviewer.ipython.org/url/media.usm.maine.edu/~pauln/
ScipyScriptRepo/CurveFitting.ipynb

http://www.physics.utoronto.ca/~phy326/python/curve_fit_to_data.py

2
Fitting Experimental Data
The goal of the lab experiments is to determine a
physical quantity y (dependent variable) as a
function of x (independent variable)
How?
Measure the pair (xi,yi) a number (N) times
Find a fit function y=y(x) that describes the
relationship between these two quantities

3
The Linear Case
The simplest function relating the two
variables is the linear function
f(x) = y = ax +b
This is valid for any yi,xi combination
If a and b are known, the true value of yi
can be calculated for any xi
yi,true = axi + b

4
Linear Regression

Linear regression calculates the most


probable values of a and b such that the
linear equation is valid
yi,true = axi + b
When taking measurements of yi, these
usually obey Gauss distribution

5
An Example
Ideal Gas Law: P*V = n*R*T
Pressure * Volume = n * R * Temperature
P = [(n*R)/V]*T

6
Fitting in Python
Were going to use the curve_fit function, which is part
of the scipy.optimize package
The usage is as follows:
fit_parameters,fit_covariance = scipy.optimize.curve_fit(fit_function,x_data,y_data,sigma,guess)

#fit_parameters - an array of the output fit parameters


#fit_covariance - an array of the covariance of the output fit parameters
#fit_function - the function used to do the fit
#sigma - the uncertainty associated with the data
#guess - the initial guess input to the fit

7
Fitting with curve_fit
import numpy
import scipy.optimize
from matplotlib import peplos

#define the function to be used in the fitting


def linearFit(x,*p):
return p[0]+p[1]*x

#read in the data (currently only located on my hard drive...)


temp_data, vol_data = numpy.loadtxt('ideal_gas_law.txt',unpack=True)

#add an uncertainty to each measurement point


uncertainty = numpy.empty(len(vol_data))
uncertainty.fill(20.)

#do the fit


fit_parameters,fit_covariance = scipy.optimize.curve_fit(linearFit, temp_data, vol_data,
p0=(1.0,8.0),sigma=uncertainty)

8
Fitting with curve_fit
import numpy
import scipy.optimize
from matplotlib import peplos

#define the function to be used in the fitting


def linearFit(x,*p):
return p[0]+p[1]*x

#read in the data (currently only located on my hard drive...)


temp_data, vol_data = numpy.loadtxt('ideal_gas_law.txt',unpack=True)

#add an uncertainty to each measurement point


uncertainty = numpy.empty(len(vol_data))
uncertainty.fill(20.)
Function Y data

}
#do the fit
fit_parameters,fit_covariance = scipy.optimize.curve_fit(linearFit, temp_data, vol_data,

}
p0=(1.0,8.0),sigma=uncertainty)
X data
}
}

Initial guess Uncertainty


for parameters on data

9
Results
f it parameters = [0.21617647 8.33058824]
f it covariance = [[2.16490542e + 04 6.89053501e + 01]
[ 6.89053501e + 01 2.20507375e 01]]

So what does this mean?

We set up the function for the


fit to be:
y = p[0] + p[1]*x
So with the fit parameters, the
function is:
y = 0.216 + 8.33*x

10
How did it do this?
The function curve_fit uses a minimizer
This varies the fit parameters ( p[0] and
p[1] ) to see what value of these are most
likely to fit the data properly
This also depends on residuals, which
are the difference between the result of
the fit and the data at each point
Well discuss the quantity which is
minimized in a bit

11
Probability
The probability of any one point being
from the fit is
(y a bx)2
1 2
Pa,b (y) = p e 2 y

2 y

Where y is the measured data, a and b


are from the fit

12
Full Probability
For a set of N measurements of the
dependent variable y
y1, y2, y3, yN
The probability of obtaining these values is
the product of the individual probabilities

Pa,b (y1 , y2 , y3 ...yN ) = Pa,b (y1 )Pa,b (y2 )Pa,b (y3 )...Pa,b (yN )
PN (yi a bxi )2
i=1 2
1 y
= N
e 2

13
Full Probability
For a set of N measurements of the
dependent variable y
y1, y2, y3, yN
The probability of obtaining these values is
the product of the individual probabilities

Pa,b (y1 , y2 , y3 ...yN ) = Pa,b (y1 )Pa,b (y2 )Pa,b (y3 )...Pa,b (yN )
PN (yi a bxi )2 Called the
chi-squared
i=1 2
1 y
= Ne 2

y ( 2)

14
Chi-Squared
N
X
2 (yi a bxi )2
= 2
i=1 y

The circled part is the definition of the residuals,


ie the true data (yi) minus the fit data (a + b*xi)
Dividing this by the standard deviation () tells
us how many standard deviations the test data
is away from the fit at that x
The square ensures this is always positive

15
Plotting the Residuals
#read in the data (currently only located on my hard drive...)
temp_data,vol_data = numpy.loadtxt('/Users/kclark/Desktop/Teaching/phys224/weather_data/
ideal_gas_law.txt',unpack=True)

#add an uncertainty to each measurement point


uncertainty = numpy.empty(len(vol_data))
uncertainty.fill(20.)

#do the fit


fit_parameters,fit_covariance =
scipy.optimize.curve_fit(linearFit,temp_data,vol_data,p0=(1.0,8.0),sigma=uncertainty)

#now generate the line of the best fit


#set up the temperature points for the full array
fit_temp = numpy.arange(270,355,5)
#make the data for the best fit values
fit_answer = linearFit(fit_temp,*fit_parameters)
#calculate the residuals
fit_resid = vol_data-linearFit(temp_data,*fit_parameters)
#make a line at zero
zero_line = numpy.zeros(len(vol_data))

16
How do the Residuals Look?
The residuals are
obviously a large
component of the
2 value used by
the minimizer
They can be
plotted to look for
trends and see if
the fit function is
appropriate
17
Other Results from curve_fit

curve_fit returns not only the best values


for the parameters p[0] and p[1]
The fit covariance matrix is also returned
One strength of curve_fit is the ease of
use of the fit covariance matrix

18
Interpreting the Covariance Matrix
f it parameters = [0.21617647 8.33058824]
f it covariance = [[2.16490542e + 04 6.89053501e + 01]
[ 6.89053501e + 01 2.20507375e 01]]

Diagonal elements are the square of the


standard deviation for that parameter
The non-diagonal elements show the
relationship between the parameters
XN
1
cov(x, y) = (xi x
)(yi y)
N i=1

19
Fit Results
import numpy
import scipy.optimize
from matplotlib import pyplot

#define the function to be used in the fitting, which is linear in this case
def linearFit(x,*p):
return p[0]+p[1]*x

#read in the data (currently only located on my hard drive...)


temp_data,vol_data = numpy.loadtxt('/Users/kclark/Desktop/Teaching/phys224/weather_data/
ideal_gas_law.txt',unpack=True)

#add an uncertainty to each measurement point


uncertainty = numpy.empty(len(vol_data))
uncertainty.fill(20.)

#do the fit


fit_parameters,fit_covariance =
scipy.optimize.curve_fit(linearFit,temp_data,vol_data,p0=(1.0,8.0),sigma=uncertainty)

#determine the standard deviations for each parameter


sigma0 = numpy.sqrt(fit_covariance[0,0])
sigma1 = numpy.sqrt(fit_covariance[1,1])

20
#do the fit
Fit Results
fit_parameters,fit_covariance =
scipy.optimize.curve_fit(linearFit,temp_data,vol_data,p0=(1.0,8.0),sigma=uncertainty)

#determine the standard deviations for each parameter


sigma0 = numpy.sqrt(fit_covariance[0,0])
sigma1 = numpy.sqrt(fit_covariance[1,1])

#calculate the mean fit result to plot the line


fit_line = linearFit(temp_data,*fit_parameters)

#calculate the residuals


fit_residuals = vol_data - fit_line

#calculate the data for the best fit minus one sigma in parameter #1
params_minus1sigma = numpy.array([fit_parameters[0],fit_parameters[1]-sigma1])
data_minus1sigma = linearFit(temp_data,*params_minus1sigma)

#do some plotting of the results


pyplot.errorbar(temp_data,vol_data,yerr=uncertainty,marker='o',ls='none')
pyplot.plot(fit_temp,fit_answer,'b--')
pyplot.plot(temp_data,data_plus1sigma,'g--',temp_data,data_minus1sigma,'g--')
pyplot.title("Ideal Gas Law Example")
pyplot.xlabel("Temperature (K)")
pyplot.ylabel("Pressure (Pa)")

21
Fit Results
f it parameters = [0.21617647 8.33058824]
f it covariance = [[2.16490542e + 04 6.89053501e + 01]
[ 6.89053501e + 01 2.20507375e 01]]

Calculate the standard


deviation on the slope
(p[1])
This is the square root of
the [1,1] entry of the
covariance matrix

22
Fit Results
f it parameters = [0.21617647 8.33058824]
f it covariance = [[2.16490542e + 04 6.89053501e + 01]
[ 6.89053501e + 01 2.20507375e 01]]

Show the p[1] parameter


with the standard
deviation:
p1 = 8.33 0.47

23
Comparison to Accepted Values
We obtained the result p[1] = 8.330.47
3
We assume that there is 1 mole in a 1m
volume so that n=V=1
The accepted value (currently) is
8.31446210.0000075
The accepted value IS contained within our
uncertainty (our one sigma range is from 7.86 to
8.80)
These values agree within their error

24
Application to Non-linear Examples
This method can also be applied to other examples
Powers: y = b x
2 2
can be linearized as y = b *x
2 3
Polynomials: y = a + b*x + c*x + d*x
This is just a case of using multiple regression
since the equation is linear in the coefficients
bx
Exponentials: y= a*e
Can be linearized as ln(y) = ln(a) + b*x
There are many other examples

25
Return to Chi-Squared
N
X 2
2 (yi y(xi ))
= 2
i=1 y

Here the definition of the residual has changed


Instead of yi - a - b*xi a more general term has
been used
yi is still the data
y(xi) is the fit function evaluated at xi

26
Gauss Distribution

The probability is described by


1 )2
(x x
P (x) = p e 2 2
2
where the average (mean) value is x and
the spread in values is
27
Gauss Distribution

We use the probabilities shown above to determine how


probable a value is in this distribution
When we take a measurement, we expect that 68.2% of
the time it will be within 1 from the mean value
Another way of phrasing this is that we expect a value to be
more than 3 above the mean value only 0.1% of the time

28
Another example

29
Fitting the Gaussian
import numpy
import scipy.optimize
import matplotlib.pyplot as pyplot
import pylab as py

#define the function to be used in the fitting, which is linear in this case
def gaussFit(x,*p):
return p[0]+p[1]*numpy.exp(-1*(x-p[2])**2/(2*p[3]**2))

#read in the data (currently only located on my hard drive...)


day_num,rain_data = numpy.loadtxt('/Users/kclark/Desktop/Teaching/phys224/weather_data/
precip_2013.txt', unpack=True)

#get some (pretty good) guesses for the fitting parameters


data_mean = rain_data.mean()
data_std = rain_data.std()

#set up the histogram so that it can be fit


data_plot = py.hist(rain_data,range=(0.1,90),bins=100)
histx = [0.5 * (data_plot[1][i] + data_plot[1][i + 1]) for i in xrange(100)]
histy = data_plot[0]

#actually do the fitting


fit_parameters,fit_covariance =
scipy.optimize.curve_fit(gaussFit,histx,histy,p0=(5.0,10.0,data_mean,data_std))

30
Another example
Mean

Fit mean: 7.06mm


Fit standard deviation:
10.13mm

}
Standard Deviation

31
Another example
Mean

Fit mean: 7.06mm


Fit standard deviation:
10.13mm

}
Standard Deviation

32
Another example
Mean

Fit mean: 7.06mm


Fit standard deviation:
10.13mm

}
Standard Deviation

33
Another example
Mean

Fit mean: 7.06mm


Fit standard deviation:
10.13mm

Rainfall of 85.5mm is
7.74 standard
deviations above the
mean (from this data)
}
Standard Deviation
which is extremely
unlikely
34
Chi-Squared and Goodness of Fit
N
X
2 (yi y(xi ))2
= 2
i=1 y

This can then be used as a goodness of


fit test
If the function is a good approximation,
then the residual will be within one
standard deviation, so this will sum to
approximately N

35
Chi-Squared
N
X
2 (yi y(xi ))2
= 2
i=1 y

We normally use the number of degrees of freedom


of the experiment to determine the fit quality
The number of DOF is the number of data points in
the sample minus the number of parameters in the fit
For a sample with 20 data points and a linear fit
(2 parameters), DOF = 18
2
This is used as the goodness of fit since /DOF1
for a good fit
36
Revisit the First Example
import numpy
import scipy.optimize
from matplotlib import pyplot

#define the function to be used in the fitting, which is linear in this case
def linearFit(x,*p):
return p[0]+p[1]*x

#read in the data (currently only located on my hard drive...)


temp_data,vol_data = numpy.loadtxt('/Users/kclark/Desktop/Teaching/phys224/weather_data/
ideal_gas_law.txt',unpack=True)

#add an uncertainty to each measurement point


uncertainty = numpy.empty(len(vol_data))
uncertainty.fill(20.)

#do the fit


fit_parameters,fit_covariance =
scipy.optimize.curve_fit(linearFit,temp_data,vol_data,p0=(1.0,8.0),sigma=uncertainty)

#calculate the chi-squared value


chisq = sum(((vol_data-linearFit(temp_data,*fit_parameters))/uncertainty)**2)
print chisq

dof = len(temp_data)-len(fit_parameters)
print dof
37
Revisit the First Example
Is this a good fit?
16
X 2
2 presDatai f iti
= = 65.6
i=1
uncertainty

Divide this by the DOF


We have 16 data points,
2 parameters
2
65.6
= = 4.68
DOF 16 2

This may not be a great fit...


38
Goodness of Fit
Previous statements only mostly true
More accurately:
2
>> 1 is a very poor fit, maybe even a fit
model which doesnt match
2
> 1 is not a good fit, or the uncertainty is
underestimated
2
<< 1 means the uncertainty could be
overestimated

39
Summary

You should now be well prepared to use


python to fit the data
Your practice with this starts with the
next pendulum exercise, which you can
begin now!

40

Vous aimerez peut-être aussi