Académique Documents
Professionnel Documents
Culture Documents
TheInsandOutsofHistograms
Inthisissue,wewilltackletheprobabilitydistributioninferenceforarandomvariable.
Whydowecare?Asastart,nomatterhowgoodastochasticmodelyouhave,youwillalwaysendup
withanerrorterm(akashockorinnovation)andtheuncertainty(e.g.risk,forecasterror)ofthemodel
issolelydeterminedbythisrandomvariable.Second,uncertaintyiscommonlyexpressedasa
probabilitydistribution,sothereisnoescape!
Oneofthemainproblemsinpracticalapplicationsisthattheneededprobabilitydistributionisusually
notreadilyavailable.Thisdistributionmustbederivedfromotherexistinginformation(e.g.sample
data).
Whatwemeanbyprobabilitydistributionanalysisisessentiallytheselectionprocessofadistribution
function(parametricornonparametric).
Inthispaper,wellstartwiththenonparametricdistributionsfunctions:(1)empirical(cumulative)
densityfunctionand(2)thehistogram.Inlaterissue,wellalsogooverthekerneldensityfunction
(KDE).
Background
1. EmpiricalDistributionFunction(EDF)
Theempiricaldistributionfunction(EDF),orempiricalcdfisastepfunctionthatjumpsby1/Natthe
occurrenceofeachobservation.
1
1
EDF( ) ( ) { }
N
N i
i
x F x x x
N
=
= = I s
Where:
{ } A I istheindicatorofaneventfunction.
1 A=True
{ }
0 A=False
A
I =
TheEDFestimatesthetrueunderlyingcumulativedensityfunctionofthepointsinthesample;itis
virtuallyguaranteedtoconvergetothetruedistributionasthesamplesizegetssufficientlylarge.
Toobtaintheprobabilitydensityfunction(PDF),oneneedstotakethederivativeoftheCDF,butthe
EDFisastepfunctionanddifferentiationisanoiseamplifyingoperation.Asaresult,theconsequent
PDFisveryjaggedandneedsconsiderablesmoothingformanyareasofapplication.
NumXLTipsandHintsHistogram 2 SpiderFinancialCorp,2012
2. Histogram
The(frequency)histogramisprobablythemostfamiliarandintuitivedistributionfunctionwhichfairly
approximatesthePDF.
Instatistics,ahistogramisagraphicalrepresentationshowingavisualimpressionofthedistributionof
data.Histogramsareusedtoplotdensityofdata,andoftenfordensityestimation,orestimatingthe
probabilitydensityfunctionoftheunderlyingvariable.
Inmathematicalterms,ahistogramisafunction
i
m thatcountsthenumberofobservationswhose
valuesfallintooneofthedisjointintervals(akabins).
1
k
i
i
N m
=
=
Where:
N isthetotalnumberofobservationsinthesampledata
k isthenumberofbins
i
m isthehistogramvaluefortheithbin
Andacumulativehistogramisdefinedasfollows:
1
1
j
j k i
i
M m
s s
=
=
Thefrequencyfunction(
i
f )(akarelativehistogram)iscomputedsimplybydividingthehistogramvalue
bythetotalnumberofobservations;
i
i
m
f
N
=
Oneofthemajordrawbacksofthehistogramisthatitsconstructionrequiresanarbitraryassignmentof
barwidth(orbinsnumber)andbarpositions,whichmeansthatunlessonehasaccesstoaverylarge
amountofdata,theshapeofthedistributionfunctionvariesheavilyasthebarwidth(orbinnumber)
andpositionsarealtered.
Furthermore,foralargesamplesize,theoutliersaredifficultorperhapsimpossibletoseeinthe
histogram,exceptwhentheycausethexaxistoexpand.
Havingsaidthat,thereareafewmethodsforinferringthenumberofhistogrambins,butcaremustbe
takentounderstandtheassumptionsmadebehindtheirformulation.
NumXLTipsandHintsHistogram 3 SpiderFinancialCorp,2012
Sturges'formula
TheSturgesmethodassumesthesampledatafollowanapproximatenormaldistribution(i.e.bell
shape).
2
log 1 k N = + (
(
Where:
(
(
istheceilingoperator
Squarerootformula
ThismethodisusedbyExcelandotherstatisticalpackages.Itdoesnotassumeanyshapeofthe
distribution:
k N =
Scotts(normalreference)choice
Scottschoiceisoptimalforrandomsampleofnormaldistribution:
3
3.5
k
N
o
=
Where:
o istheestimatedsamplestandarddeviation
FreedmanDiaconisschoice
3
IQR
2 h
N
=
Where:
h isthebinsize
IQR istheinterquartilerange
And
max min
x x
k
h
(
=
(
(
Decisionbasedonminimizationofriskfunction(
2
L )
2
2
2
min min
h
m v
L
h
| |
=
|
\ .
NumXLTipsandHintsHistogram 4 SpiderFinancialCorp,2012
Where:
1
k
i
i
m
N
m
k k
=
= =
2
2
2 2 2 1
2
1 1
( )
1 1
k
i k k
i
i i
i i
m m
N
v m m m
k k k k
=
= =
= = =
3. KernelDensityEstimate(KDE)
Analternativetothehistogramisakerneldensityestimation(KDE),whichusesakerneltosmooth
samples.Thiswillconstructasmoothprobabilitydensityfunction,whichwillingeneralmoreaccurately
reflecttheunderlyingvariable.WementionedtheKDEforsakeofcompletion,butwewillpostponeits
discussiontoalaterissue.
EUR/USDApplication:
LetsconsiderthedailylogreturnsoftheEUR/USDexchangeratesampledata.Inourearlieranalysis
(ref:NumXLTipsandHintsPricethis),thedatawasshowntobeaGaussianwhitenoisedistribution.
TheEDFfunctionforthosereturns(n=498)isshownbelow:
NumXLTipsandHintsHistogram 5 SpiderFinancialCorp,2012
Forahistogram,wecalculatedthenumberofbinsusingthe4methods:
Next,weplottherelativehistogramusingthosedifferentbinsnumbers.Weoverlaythenormal
probabilitydensityfunction(redcurve)forcomparison.
Althoughwehavearelativelylargedataset(n=498)andtheEDFandstatisticaltestexhibitGaussian
distributeddata,theselectionofdifferentbinsizecandistortthedensityfunction.
TheScottschoice(n=15)describesthedensityfunctionbest,andnextwouldbeSturges.
Conclusion
Inthisissue,weattemptedtoderiveanapproximateoftheunderlyingdensityprobabilityusinga
sampledatahistogramandthe(cumulative)empiricaldensityfunction.
NumXLTipsandHintsHistogram 6 SpiderFinancialCorp,2012
Althoughthedatasampleisrelativelylarge(n=498),thehistogramisstillafairlycrudeapproximation
andverysensitivetothenumberofbinsused.
Usingtherulesofthump(e.g.Sturgesrule,Scottschoice,etc.)canimprovetheprocessoffinding
betterbinsnumber,buttheymaketheirownassumptionsabouttheshapeofthedistributionandan
experienced(manual)examination(oreyeballing)isneededtoensureproperhistogramgeneration.