Vous êtes sur la page 1sur 7

19/03/2015

LinearRegression
FatVersusProtein:AnExample
Thefollowingisascatterplotoftotalfat versusprotein for30
itemsontheBurgerKingmenu:

SIT191(week3)
LinearRegression

TheLinearModel
Examiningthescatterplotshowsthatthereseemstobea
linearrelationshipbetweenthesetwovariables
Wecansaymoreaboutthelinearrelationshipbetweentwo
quantitativevariableswithamodel/equation.
Amodelsimplifiesrealdatatohelpusunderstandunderlying
patternsandrelationships.
Theequationcanbeusedtopredictvaluesoftheresponse(y)
variableforgivenvaluesoftheexplanatory(x)variable

Residuals
Themodelwontbeperfect,regardlessofthelinewedraw
somepointswillbeabovethelineandsomewillbebelow.
Theestimatemadefromamodelisthepredictedvalue
(denotedy).
Thedifferencebetweentheobservedvalueanditsassociated
predictedvalueiscalledtheresidual.
Tofindtheresiduals,wealwayssubtractthepredictedvalue
fromtheobservedone:

TheLinearModel(cont.)
Thelinearmodel isjustanequationofthestraightlinethat
bestfitsthedata.
Thepointsinthescatterplotdontalllineupperfectly,buta
straightlinecansummarisethegeneralpattern.
Thelinearmodelcanhelpusbetterunderstandhowthevalues
areassociatednumerically.

Residuals(cont.)
Anegativeresidualmeans
thepredictedvalue(the
line)isabovethe
observation
Apositiveresidualmeans
thepredictedvalue(line)
liesbelowtheobservation

residual observed predicted y y

19/03/2015

BestFitMeansLeastSquares
Someresidualsarepositive,othersarenegative,and,on
average,theycanceleachotherout.
So,wecantassesshowwellthelinefitsbyaddingupallthe
residuals.
Similartowhatwedidwithstandarddeviations,wesquare
theresidualsandaddthesquares.
Thesmallerthesum,thebetterthefit.
Thelineofbestfitisthelineforwhichthesumofthesquared
residualsissmallest.

TheLeastSquaresLine
Wewritethelinearmodelas y b0 b1 x
Themodel/equationhasaslopeb1
Theslopeisbuiltfromthecorrelationandthestandard
deviations:
sy

b1 r

sx

Theslopeisalwaysinunitsofy perunitofx(thechangeiny for


aunitchangeinx).
Theequationalsohasaninterceptb0
Theinterceptisbuiltfromthemeansandtheslope:

b0 y b1 x
Theinterceptisalwaysinunitsofy.

FatVersusProteinExample
Theregressionlineforthe
BurgerKingdatafitsthedata
well:

TheLeastSquaresLine(cont.)
Sinceregressionandcorrelationarecloselyrelated,weneed
tocheckthesameconditionsforregressionsaswedidfor
correlations:
QuantitativeVariablesCondition
StraightEnoughCondition
OutlierCondition

Theequationis

Thepredictedfat contentfor
aBKBroilerchickensandwich
is
6.8+0.97(30)=35.9grams
offat.

ResidualsRevisited

ResidualsRevisited(cont.)

Thelinearmodelassumesthattherelationshipbetweenthe
twovariablesisaperfectstraightline.Theresidualsarethe
partofthedatathathasnt beenmodeled.
Data=Model+Residual
or
Residual=Data Model

Residualshelpustoseewhetherthemodelmakessense.
Whenaregressionmodelisappropriate,nothinginteresting
shouldbeleftbehind.
Afterwefitaregressionmodel,weusuallyplottheresiduals
inthehopeoffindingnothing.
Theresidualplotshouldnotshowanypatternsortrends
Theplotshouldshowarandomcloudofpoints

Or,insymbols,

e y y

19/03/2015

ResidualsRevisited(cont.)
TheresidualsfortheBKmenuregressionlookappropriately
boring:

R2TheVariationaccountedfor
Thevariationintheresidualsisthekeytoassessinghowwell
themodelfits.
IntheBKmenuitems
totalfat hasastandarddeviation
of16.4grams.The standarddeviation
oftheresidualsis9.2grams.

R2TheVariationaccountedfor(cont.)

R2TheVariationaccountedfor(cont.)

Ifthecorrelationwere1.0andthemodelpredictedthefat
valuesperfectly,theresidualswouldallbezeroandhaveno
variation.
Asitis,thecorrelationis0.83notperfect
Howeverwedidseethatthemodelresidualshadless
variationthantotalfatalone.
Wecandeterminehowmuchofthevariationisaccounted
forbythemodelandhowmuchisleftintheresiduals.

Thesquaredcorrelation,r2,givesthefractionofthedatas
varianceaccountedforbythemodel.
Thus,1 r2 isthefractionoftheoriginalvarianceleftinthe
residuals.
FortheBKmodel,r2=0.832 =0.69,so31%ofthevariabilityin
totalfat hasbeenleftintheresiduals.

R2TheVariationaccountedfor(cont.)

HowBigShouldR2 Be?

Allregressionanalysesincludethisstatistic,althoughby
tradition,itiswrittenR2 (pronouncedRsquared).AnR2 of
0meansthatnoneofthevarianceinthedataisinthemodel;
allofitisstillintheresiduals.
Wheninterpretingaregressionmodelyouneedtointerpret
whatR2 means.

R2 isalwaysbetween0%and100%.WhatmakesagoodR2
valuedependsonthekindofdatayouareanalysingandon
whatyouwanttodowithit.
Thestandarddeviationoftheresidualscangiveusmore
informationabouttheusefulnessoftheregressionbytelling
ushowmuchscatterthereisaroundtheline.

IntheBKexample,69%ofthevariationintotalfat isaccounted
forbythemodel.

19/03/2015

HowBigShouldR2 Be?(cont)
Alongwiththeslopeandinterceptforaregression,you
shouldalwaysreportR2 sothatreaderscanjudgefor
themselveshowsuccessfultheregressionisatfittingthe
data.
Statisticsisaboutvariation,andR2 measuresthesuccessof
theregressionmodelintermsofthefractionofthevariation
ofy accountedforbytheregression.

RegressionsAssumptionsandConditions(cont.)
OutlierCondition:
Watchoutforoutliers.
Outlyingpointscandramaticallychangearegressionmodel.
Outlierscanevenchangethesignoftheslope,misleadingus
abouttheunderlyingrelationshipbetweenthevariables.

RegressionAssumptionsandConditions
QuantitativeVariablesCondition:
Regressioncanonlybedoneontwoquantitativevariables,so
makesuretocheckthiscondition.

StraightEnoughCondition:
Thelinearmodelassumesthattherelationshipbetweenthe
variablesislinear.
Ascatterplotwillletyoucheckthattheassumptionis
reasonable.

Cautions
Dontfitastraightlinetoanonlinearrelationship.
Bewareofextraordinarypoints(yvaluesthatstandofffrom
thelinearpatternorextremexvalues).
Dontextrapolatebeyondthedatathelinearmodelmayno
longerholdoutsideoftherangeofthedata.
Dontinferthatx causesy justbecausethereisagoodlinear
modelfortheirrelationshipassociationisnot causation.
DontchooseamodelbasedonR2 alone.

Summary

Extrapolation:ReachingBeyondtheData

Whentherelationshipbetweentwoquantitativevariablesis
fairlystraight,alinearmodelcanhelpsummarisethat
relationship.
Regressionmodelsarenearlyalwayscalculatedusinga
calculatororsoftwaresuchasSPSS
Thecorrelationtellsushowstrongtherelationshipis.
R2 givesusthefractionoftheresponseaccountedforbythe
regressionmodel.

Linearmodelsgiveapredictedvalueforeachcaseinthedata.
Wecannotassumethatalinearrelationshipinthedataexists
beyondtherangeofthedata.
Onceweventureintonewx territory,suchapredictionis
calledanextrapolation.

19/03/2015

Extrapolation(cont.)
Extrapolationsaredubiousbecausetheyrequirethe
additionalandveryquestionableassumptionthatnothing
abouttherelationshipbetweenx andy changesevenat
extremevaluesofx.
Extrapolationscangetyouintodeeptrouble.Yourebetteroff
notmakingextrapolations.

Extrapolation(cont.)
Aregressionofmeanageatfirstmarriageformenvs.yearfittothe
yearsfrom1890 1998doesnotholdforlateryears:

Outliers,Leverage,andInfluence
Outlyingpointscanstronglyinfluencearegression.Evena
singlepointfarfromthebodyofthedatacandominatethe
analysis.

After1950,linearitydidnothold.

Outliers,Leverage,andInfluence(cont.)
ThefollowingscatterplotshowsthatsomethingwasawryinPalm
BeachCounty,Florida,duringthe2000presidentialelection

Anypointthatstandsawayfromtheotherscanbecalledan
outlier anddeservesspecialattention.

Outliers,Leverage,andInfluence(cont.)
Theredlineshowstheeffectsthatoneunusualpointcan
haveonaregression:

Outliers,Leverage,andInfluence(cont.)
Adatapointcanalsobeunusualifitsxvalueisfarfromthe
meanofthexvalues.Suchpointsaresaidtohavehigh
leverage.
Apointwithhighleveragehasthepotentialtochangethe
regressionline.
Wesaythatapointisinfluential ifomittingitfromthe
analysisgivesaverydifferentmodel.

19/03/2015

Outliers,Leverage,andInfluence(cont.)

Outliers,Leverage,andInfluence(cont.)
Warning:
Influentialpointscanhideinplotsofresiduals.
Pointswithhighleveragepullthelineclosetothem,so
theyoftenhavesmallresiduals.
Youllseeinfluentialpointsmoreeasilyinscatterplotsof
theoriginaldataorbyfindingaregressionmodelwithand
withoutthepoints.

LurkingVariablesandCausation
Nomatterhowstrongtheassociation,nomatterhowlarge
theR2 value,nomatterhowstraighttheline,thereisnoway
toconcludefromaregressionalonethatonevariablecauses
theother.
Theresalwaysthepossibilitythatsomethirdvariableis
drivingbothofthevariablesyouhaveobserved.
Withobservationaldata,asopposedtodatafromadesigned
experiment,thereisnowaytobesurethatalurkingvariable
isnotthecauseofanyapparentassociation.

LurkingVariablesandCausation(cont.)
Thisnewscatterplotshowsthattheaveragelifeexpectancy
foracountryisrelatedtothenumberoftelevisions per
personinthatcountry:

LurkingVariablesandCausation(cont.)
Thefollowingscatterplotshowsthattheaveragelife
expectancy foracountryisrelatedtothenumberofdoctors
perpersoninthatcountry:

LurkingVariablesandCausation(cont.)
Sincetelevisionsarecheaperthandoctors,sendTVsto
countrieswithlowlifeexpectanciesinordertoextend
lifetimes.Right?No!
Howaboutconsideringalurkingvariable?Thatmakesmore
sense
Countrieswithhigherstandardsoflivinghavebothlonger
lifeexpectanciesand moredoctors(andTVs!).
Ifhigherlivingstandardscause changesintheseother
variables,improvinglivingstandardsmightbeexpectedto
prolonglivesandincreasethenumbersofdoctors,and
TVs.

19/03/2015

WorkingWithSummaryValues

WorkingWithSummaryValues(cont.)

Scatterplotsofstatisticssummarisedovergroupstendtoshow
lessvariabilitythanwewouldseeifwemeasuredthesame
variableonindividuals.
Thisisbecausethesummarystatisticsthemselvesvaryless
thanthedataontheindividualsdo.

Thereisastrong,positive,linearassociationbetweenweight
(inpounds)andheight (ininches)formen:

WorkingWithSummaryValues(cont.)
Ifinsteadofdataonindividualsweonlyhadthemean weight
foreachheightvalue,wewouldseeanevenstronger
association:

WorkingWithSummaryValues(cont.)
Meansvarylessthanindividualvalues.
Scatterplotsofsummarystatisticsshowlessscatterthan
thebaselinedataonindividuals.
Thiscangiveafalseimpressionofhowwellaline
summarisesthedata.

Thereisnosimplecorrectionforthisphenomenon.
Oncewehavesummarydata,theresnosimplewaytoget
theoriginalvaluesback.

Cautions
Makesuretherelationshipisstraight.
ChecktheStraightEnoughCondition.

Bewareofextrapolating.
Lookforunusualpoints.
Bewareoflurkingvariablesanddontassumethat
associationiscausation.
Watchoutwhendealingwithdatathataresummaries.

Vous aimerez peut-être aussi