Vous êtes sur la page 1sur 100

STAT231Final

Outline
Chapter1
Datatypes(discrete,continuous,categorical) Problem(3differentaspects) Populations(target,study,sample) Representationsofdata
Graphical:histograms,CDFs,boxplots Numerical:mean,standarddeviation,IQR

Bivariate Data
Relativerisk Correlationcoefficient

Outline
Chapter2
Reviewofprobabilitydistributions RandomPPDACexamples

Outline
Chapter3
BinomialModel ResponseModel RegressionModel MaximumLikelihoodEstimation

Outline
Chapter4
Samplingdistributionsforestimators Introductiontonewdistributions Gaussian Chisquared t ConfidenceInterval HypothesisTesting ConfidenceIntervalsandHypothesisTestingwiththelikelihood function

Outline
Chapter5
Testingforindependencewithcategoricalvariates Modelcheckingandassessmentforassumptions

Outline
Chapter 6
Comparison 2 sample t-tests Paired t-test Causality Testing for association Blocking Randomization and repetition Matching Prediction Prediction intervals for response Prediction intervals for regression

ConfidenceIntervalsusingthe RelativeLikelihoodFunction
Definethelikelihoodfunction

L( ) = f ( xi )
i =1

Definetherelativelikelihoodfunctionas:

L( ) ) L( )

ConfidenceIntervalsusingthe RelativeLikelihoodFunction
Graphtherelativelikelihoodfunction:

Drawahorizontallineat0.1,theintersectionofthetwo xcoordinatesformsanapproximate95%confidenceinterval

HypothesisTestingusingthe LikelihoodFunction
1)Definethenullhypothesis,definethealternate hypothesis 2)Definetheteststatistic,identifythedistribution, calculatetheobservedvalue 3)Calculatethepvalue Theteststatistic: DistributionofD:

D = 2[l ( ) l ( 0 )]

HypothesisTestingusingthe LikelihoodFunction

ObservedvalueofD: Pvalue: P ( D d )

d = 2[l ( ) l ( 0 )]
D ~ 2 n p

Example

Example
Theobservedvalueoftheteststatistic

d = 2[l ( ) l ( 0 )]

Example
l ( ) = n ln( + 1) + ln xi
i =1 n

Example

Example
d = 2[l ( ) l ( 0 )]
)
l ( ) = n ln( + 1) + ln xi
i =1 n

ModelAssessment
Wevebeenassumingourdatacollectedfits toaspecificmodel(Binomial,Response,etc.) Withthesemodelscomemanyassumptions, includingindependence Inthischapter,weanalyzeourdatato actuallyseeifwereabletousethesemodels tofitourdata

Independencewith BinaryVariates
Wewanttoseeifwecanassumetwobinary variates (representedby2randomvariablesX andY)areindependent Thisisessentiallyanothertypeofhypothesis testing Sinceabinaryvariateisjustacategorical variatewith2categories,thistestcanbe extendedtotwocategoricalvariates

Independencewith BinaryVariates
Define: LetXrepresentthebinaryvariategender(Male=0,Female=1) LetYrepresentthebinaryvariatesmoker(NonSmoker=0, Smoker=1) Letnbethesamplesize Letuscollectourobserveddataandpresentinthefollowing frequencytable:
Male (X=0) Non-Smoker (Y=0) Smoker (Y=1) Total a c a+c Female (X=1) b d b+d Total a+b c+d n=a+b+c+d

Independencewith BinaryVariates
IfXandYareindependentthen: Expectedfrequencyofmalesmokersis
n P ( X = 0) P (Y = 1)

Expectedfrequencyofmalenonsmokersis
n P ( X = 0) P (Y = 0)

Expectedfrequencyoffemalesmokersis
n P ( X = 1) P (Y = 1)

Expectedfrequencyoffemalenonsmokersis
n P ( X = 1) P (Y = 0)

Independencewith BinaryVariates
Usingtheobservedfrequencytable
Non-Smoker (Y=0) Smoker (Y=1) Total Male (X=0) a c a+c Female (X=1) b d b+d Total a+b c+d n=a+b+c+d

P ( X = 0)

P(Y = 0)

P( X = 1)

P (Y = 1)

Independencewith BinaryVariates
Creatingourexpectedfrequencytable
Male (X=0) Non-Smoker (Y=0) Female (X=1) Total a+b

n P( X = 0) P(Y = 0) n P( X = 1) P(Y = 0)

= e1
Smoker (Y=1)

= e2
n P( X = 1) P(Y = 1)
c+d

n P ( X = 0) P (Y = 1)

= e3
Total a+c

= e4
b+d n=a+b+c+d

Independencewith BinaryVariates
Aswithanyotherhypothesistestingquestion, weneedtodefinetheteststatistic. TestStatistic:
(oi ei ) 2 S = ei i =1
n

Distributionoftheteststatistic: S ~ 2 ( r 1)( c 1) Observedvalue:


(oi ei ) 2 s= ei i =1
n

Independencewith BinaryVariates
pvalue

= P( S s)
Makeyourconclusion: Reject: XandYarenotindependent Accept: XandYareindependent

Example

Example

Example

Observedvalue:

(oi ei ) 2 s= ei i =1
n

Example
Pvalue:

ModelAssessment
Fortheregressionmodel,wehavethefollowing assumptionswhenfittingourdata
1)TheexpectationofYisalinearfunctionoftheexplanatory variate 2)ThemodelusedisGaussian 3)Yisareindependent 4)Themodelhasaconstantvariance

ModelAssessment
TheexpectationofYisalinearfunctionofthe explanatoryvariate
ThemodelassumesthatE[Yi]isalinearcombinationofxi IfweplotYi vs.xi weshouldseealinearrelationship

ModelAssessment
ThemodelusedisGaussian
Inthemodel,weassume R ~ G (0, ) andthus Y ~ G ( + x, ) Howdowecheckifthisassumptionisreasonable? Residuals Rearrangingthemodel, R = Y ( + x) ArealizationofRbecomes ri = yi ( + xi ) ) ) ) ) Anestimatedresidualis,ri = yi ( + xi ) = y yi ) ri Graphically,isthedistancefromthelineofbestfittoour observedresponsevariate

ModelAssessment
WecancheckfortheGaussianassumptionsbyplottingaQQ plot Plotthesamplequantiles againstthetheoreticalquantiles of theestimatedresiduals,ifthelineisrelativelystraight,then theGaussianassumptionholds

ModelAssessment
Yisareindependent
Wewillchecktheseassumptionsbyplottingthefitted ) ) ) ) response,againsttheestimatedresiduals, ri yi = + xi Ifourassumptionsaretrue,weshouldseearandompattern centeredaround0

ModelAssessment

ModelAssessment
YishaveConstantVariance
IfYishaveconstantvariance,weshouldseeresidualsevenly distributedaroundzero

Nonconstantvariance:funnelshaped

Comparison
RecallinChapter1welearnedtherewerethree differentaspects(typeofproblem) Descriptive Causative Predictive Chapter6looksattechniquesforsolvingeachof the3problems

Comparison
Thedescriptiveaspectoftheproblemcouldinvolvelooking andcomparingbetweentwodifferentpopulations Inthissection,wewilllearnhowtoconducthypothesistests thatwillallowustomaketheconclusionwhethertheresa differencebetween2populations Thequestionaskedisisthereadifferencebetweenthe meanvaluesofthe2populations? Essentially,thehypothesistestediswhethertheparameter foreachpopulationisequal H 0 : 1 = 2

Comparison
2samplettests(ResponseModel)
Twopopulations

Y1 j = 1 + R1 j

Y2 j = 2 + R2 j

Theestimatorforeachpopulationis
~ 1 =

Y
j =1

n1

1j

n1

~ 2 =

Y
j =1

n2

2j

n2

Thesamplingdistributionforeachestimatoris ~ ~ G( , ) 1 1 n1 ~ ~ G( , ) 2 2 n2

Comparison
Inthehypothesistests,wewanttoseeifthetwoparameters ~ ~ 1 andareequal,soletslookatther.v. 1 2 2 ~ ~ Whatisthesamplingdistributionofunderthe 1 2 assumption1 = 2 ~ ~ G( , ) 1 1 n1 ~ ~ G( , ) 2 2 n2

Comparison
~ ~ 1 2 ~ G (0,
Standardize

1 1 + ) n1 n2

~ ~ 1 2 1 1 + n1 n2

~ G (0,1)

Replace with estimate

~ ~ 1 2 ~ 1+ 1 n1 n2

~ t n1 + n2 2

Comparison
(n1 1) 1 + (n2 1) 2 = (n1 + n2 2)
)
2

T=

~ ~ 1 2 ~ 1 + 1 n1 n2

~ t n1 + n2 2

Example

Example
1 = 71.3 2 = 68.7
) )
1 = 10.2 2 = 11.3
) )
n1 = 47

n2 = 36

Example
(n1 1) 1 + (n2 1) 2 (47 1)10.2 2 + (36 1)11.32 = = = 10.6892 (n1 + n2 2) (47 + 36 2) )
2 2

71.3 68.7 = 1.097 t= 1 1 10.6892 + 47 36

PairedTTests
Inthepriorpages,welookedattwosamplettests Astrongertestiscalledthepairedttest Thistestonlyworksifthetwosampleswecollectareactually dataforthesamegroupofnunits,butatdifferenttimes Thepairedttestinvolvessimplifyingthetwodatasetsinto onebyfindingthedifferenceofeachpairofdata,and workingwiththissingledataset Thenweconductausualttest/hypothesistestonthissingle datasetofdifferences

Causation
Thecausativeaspectofaproblemlooksatthe relationshipbetweentheexplanatoryandresponse variates Recallinchapter1welookedat2typesofconceptsthat looksattherelationshipbetweenXandY
RelativeRisk Association

Associationinvolvescalculatingthecorrelation coefficient n
r== S XY S XX SYY =

(x
i =1

x ) ( yi y )
n 2

( xi x )
i =1

( yi y ) 2
i =1

Causation
Inthiscourse,weonlyhavetheskillstotestfor association H0 : = 0 Thisinvolvestestingthehypothesis intheregressionmodel H0 : = 0 If,thenwecansaythereisno associationbetweenXandY

Example

Example

0 t= ~ SE ( )

Causation
AssociationdoesNOTimplycausation Thecoursenotestalksaboutwhythisisthe caseandhowwecanavoidmakingthewrong assumptionusingthreetechniques
Blocking RepetitionandRandomization Matching

Causation
Confounding Associationdoesnotimplycausation Therecouldbeathirdhiddenvariatethatisrelatedtoboth theexplanatoryandresponseandcausesthiscausal relationship:thisiscalledconfounding Thedifficultywithconfoundingvariates isidentifyingthemin thefirstplace,orelsewewillmakeawrongconclusionabout therelationshipbetweentheexplanatoryandresponse variates Ifwecanidentifytheconfoundingvariates,thenthereare toolswecanusewhendesigningexperimentalplansto accountforthesevariates

Causation
Blocking Ifweveidentifiedtheconfoundingvariate,weneutralizeits effectbycollectingsampleswheretheunitshavethesame valuefortheconfoundingvariate TheChickenExample:
Responsevariate:growthrateofchickens Explanatoryvariate:proteinindiet Confoundingvariate:genderofthechickens Blocking:lookatsamplesofonlymalechickensandsamplesofonly femaleschickens Thiseliminatesthegendereffectandtheexperimenterisabletolook attheeffectsofproteinindietonthegrowthrateofchickens

Causation
ReplicationandRandomization Ifwecannotidentifyorcontroltheconfoundingvariate,wecan alsotrytoneutralizeitseffectsbyrandomlyallocatingour controlledvariateintheexperimentalplan TheMedicineExample:
Responsevariate:survivalrate Explanatoryvariate:typeoftreatment Confoundingvariates:medicalhistory/healthofthepatient Usingrandomizationandreplicationtoassignthetreatmenttype toeach unitwillresultintwoverybalancedgroupsintermsoftheir health/medicalhistory Thiswilleliminatetheconfoundingvariates asmuchaspossible

Causation
MatchingandObservationalPlans Inobservationalplans,theexperimentercannot controlthevariates Themethodofmatchingisusedwheretheunitsthat arebeingobservedarecomparedwithacontrolunit thathasverysimilarcharacteristicstotheunitinthe plan,(thisissimilartoblocking) Thusifthereisadifferenceinthevalueobserved betweenthesampledunitandthecontrolunit,the differencemustbelegitimate

Prediction
Thepredictiveaspectofaprobleminvolves usingourcollecteddatatoestimateavalue foraunittoberandomlyselectedfromthe population Wewilllookatpredictionintervalsfor
Response Regression

Prediction
TheModel

Y =+R

Y ~ G( , )
Thepredictedunit:Y0 Sincefollowstheresponsemodelthen Y0

Y 0~ G ( , )

Prediction
Whatwouldbealogicalchoicetouseasourpredicted value? Theaverage

~ Weneedtheestimatorforthemeanparameter:

~ =

Y
i =1

~ ~ G( , ) n
Sampling Distribution

From MLE

Prediction
Ifwelookatthedifferencebetweenourpredictedvalueandthe populationaverage,thenwehavetherandomvariable

~ Y0
Y 0~ G ( , )
~ ~ G( , ) n

Prediction
~ ~ G (0, 1 + 1 ) Y0 n
Standardizinggives

~ Y0 1 1+ n ~ Y0 ~ 1+ 1 n

~ G (0,1)

Replacewithanestimatorgives

~ t n1

Prediction
Constructinga95%PredictionIntervalforY0 ( Ourultimategoal:a Y0 b
~ t n1 Sincewecanmaketheprobabilitystatement: 1 ~ ~ Y0

unknown)

1+

P(

~ Y0 ~ 1+ 1 n

c) = 0.95

Prediction
~ Y0 P ( c c) = 0.95 ~ 1+ 1 n

Example
LetYbetheresponsevariaterepresentingbodyweight(kg).The followingsampleiscollected: 60 54 72 65 64
Constructa95%predictionintervalforthebodyweightofsomeonewe randomlyselectfromthepopulation.

c 1+

1 n

Example
c 1+
) ) 1 n

Prediction
TheModel

Y = + xi + R
Y = + ( xi x ) + R

Butforourpurposes,wewilluseashiftedversionofthemodel

Prediction
TheModel
Y = + ( xi x ) + R

Thepredictedunit:Y0 Wewanttopredictgiventhesubgroup xi = x0 Y0

Y0 Sincefollowstheregressionmodelthen

Y0 ~ G ( + ( x0 x ), )

Prediction
Whatwouldbealogicalchoicetouseasourpredicted value? xi = x 0 Theaveragegiventhesubgroupwhichwe ~ willdenote ( x0 )
Y = + ( xi x ) + R
Regression Model

~ ~ ~( x ) = E[Y | x ] = + ( x x ) 0 0 0
Average of the subgroup

xi = x 0

Prediction
UsingMaximumLikelihoodEstimationweobtaintheestimators

~ =

Yi
i =1

(Y Y )( x
i =1 i n i =1

x)

( xi x ) 2

S XY = S XX

Thesamplingdistributionsofthesetwoestimatorsare

~ ~ G ( , ) n

~ G( ,

S XX

Prediction
~ ~ ~ Whatisthesamplingdistributionof ( x0 ) = + ( x0 x )
~ ~ G ( ,

~ G( ,

S XX

1 ( x0 x ) 2 ~ ( x0 ) ~ G ( + ( x0 x ), ( + )) n S xx

Prediction
Ifwelookatthedifferencebetweenourpredictedvalueandthe populationaverage,thenwehavetherandomvariable

~ Y0 ( x0 )
Y0 ~ G ( + ( x0 x ), )
1 ( x0 x ) 2 ~ ( x0 ) ~ G ( + ( x0 x ), ( + )) n S xx

Theobviousnextstepwouldbetodeterminethesampling ~ distributionof Y0 ( x0 )

Prediction
Y0 ~ G ( + ( x0 x ), )
1 ( x0 x ) 2 ~ ( x0 ) ~ G ( + ( x0 x ), ( + )) n S xx

Prediction
~( x ) ~ G (0, 1 + 1 + ( x0 x ) ) Y0 0 n S xx
2

Standardizinggives
~ Y0 ( x0 ) 1 ( x0 x ) 1+ + n S xx
2

~ G (0,1)

Estimatingsigmagives
~ Y0 ( x0 ) ~ 1 + 1 + ( x0 x ) n S xx
2

~ t n2

Prediction
Constructinga95%PredictionIntervalforY0 ( Ourultimategoal:a Y0 b
~ Y0 ( x0 )
2

unknown)

1 (x x) ~ Sincewecanmaketheprobability 1+ + 0 n statement: S xx

~ tn2

P(

~ Y0 ( x0 ) ~ 1 + 1 + ( x0 x ) n S xx
2

c) = 0.95

Prediction
P( ~ Y0 ( x0 ) ~ 1+ + 1 n ( x0 x ) S xx
2

c) = 0.95

P ( c

~ Y0 ( x0 ) ~ 1 + 1 + ( x0 x ) n S xx
2

c) = 0.95

1 ( x0 x ) 2 1 ( x0 x ) 2 ~ ~ ~ Y0 ( x0 ) c 1 + + ) = 0.95 P ( c 1 + + n S xx n S xx 1 ( x0 x ) 2 1 ( x0 x ) 2 ~ ~ ~ ~ Y0 ( x0 ) + c 1 + + ) = 0.95 P ( ( x0 ) c 1 + + n S xx n S xx

Prediction

1 ( x0 x ) + ( x0 x ) c 1 + + n S xx ) ) )
Upper and Lower bounds of a regression prediction interval

Example
LetYbetheresponsevariaterepresentingbodyweight(kg)and Xbetheexplanatoryvariaterepresentingbodyheight(cm). Thefollowingsampleiscollected:
i xi yi 1 172 60 2 162 54 3 180 72 4 170 65 5 174 64

Constructa95%predictionintervalforthebodyweightof someonewerandomlyselectfromthepopulationwhose ) = 2.97 heightis175cm.Use

Example
i xi yi 1 172 60 2 162 54 3 180 72 4 170 65 5 174 64

1 ( x0 x ) 2 + ( x0 x ) c 1 + + n S xx ) ) )

Example
1 ( x0 x ) 2 + ( x0 x ) c 1 + + n S xx ) ) )

Outline
Chapter1
Datatypes(discrete,continuous,categorical) Problem(3differentaspects) Populations(target,study,sample) Representationsofdata
Graphical:histograms,CDFs,boxplots Numerical:mean,standarddeviation,IQR

Bivariate Data
Relativerisk Correlationcoefficient

Chapter2
Reviewofprobabilitydistributions RandomPPDACexamples

PPDAC

PPDAC

DrawafrequencyhistogramoftheFlashdata,withbinsgivenby theintervals(45 49.9),(50 54.9),etc. Firstmakeafrequencytablewiththebinwidths


Interval (45 49.9) (50 54.9) (55 59.9) (60 64.9) (65 69.9) (70 74.9) (75 79.9) (80 84.9) (85 89.9) (90 94.9) Frequency 1 1 2 5 5 1 1 1 2 1

PPDAC

ConceptReview
Fromthepreviousexample:
Targetpopulation,studypopulation,sample,unit Responsevs.explanatoryvariates Aspects
Descriptive Causative Predictive

Histograms
BinWidth Frequencyhistogram

Outline
Chapter3
BinomialModel ResponseModel RegressionModel MaximumLikelihoodEstimation

MLE

L( ) = f ( xi ; )
i =1

MLE
l ( ) = n ln ( + 1) ln( xi )
i 1 n

ConceptReview
Fromthepreviousexample:
MaximumLikelihoodEstimationMethod
Definelikelihoodfunction Defineloglikelihoodfunction Differentiatewithrespecttotheparameter Settozero Solvefortheparameter

Outline
Chapter4
Samplingdistributionsforestimators Introductiontonewdistributions Gaussian Chisquared t ConfidenceInterval HypothesisTesting ConfidenceIntervalsandHypothesisTestingwiththelikelihood function

ConfidenceIntervals

ConfidenceInterval

ConceptsReview
Fromthepreviousexample:
ConfidenceIntervalsfortheresponsemodel,sigma unknown Structureofasymmetricconfidenceinterval

HypothesisTesting

HypothesisTesting
Forapairedttest,wecreateanewsetofdata
Diff 1 0.48 9 0.46 2 0.53 10 0.76 3 0.52 11 3.09 4 0.21 12 0.26 5 -0.05 13 0.34 6 0.44 14 0.32 7 0.41 15 -0.07 8 0.68 16 0.33

Diff

HypothesisTesting
Teststatistic:
~ D 0 T= ~ ~ t n1 D n

HypothesisTesting
Pvalue

HypothesisTesting

HypothesisTesting
Fora2samplettest,wehavetwopopulations,with2setsofdata

HypothesisTesting
Teststatistic: T =
~ ~ 1 2 ~ 1 + 1 n1 n2 ~ t n1 + n2 2

HypothesisTesting
)2 ) 2 (n1 1) 1 + (n2 1) 2 (16 1)2.48 2 + (16 1)2.912 ) = = 2.704 = (n1 + n2 2) (16 + 16 2)

Observedvalueoftheteststatistic: ) )
t=

1 2

) 1 1 + n1 n2

HypothesisTesting
Pvalue

ConceptsReview
Fromthepreviousexample:
HypothesisTesting
Definethenullhypothesis Definetheteststatistic,identifythedistribution,calculate theobservedvalueoftheteststatistic Calculatethepvalue

2samplettest Pairedttest

Vous aimerez peut-être aussi