Vous êtes sur la page 1sur 6

NumXLTipsandHintsHistogram 1 SpiderFinancialCorp,2012

TheInsandOutsofHistograms
Inthisissue,wewilltackletheprobabilitydistributioninferenceforarandomvariable.
Whydowecare?Asastart,nomatterhowgoodastochasticmodelyouhave,youwillalwaysendup
withanerrorterm(akashockorinnovation)andtheuncertainty(e.g.risk,forecasterror)ofthemodel
issolelydeterminedbythisrandomvariable.Second,uncertaintyiscommonlyexpressedasa
probabilitydistribution,sothereisnoescape!
Oneofthemainproblemsinpracticalapplicationsisthattheneededprobabilitydistributionisusually
notreadilyavailable.Thisdistributionmustbederivedfromotherexistinginformation(e.g.sample
data).
Whatwemeanbyprobabilitydistributionanalysisisessentiallytheselectionprocessofadistribution
function(parametricornonparametric).
Inthispaper,wellstartwiththenonparametricdistributionsfunctions:(1)empirical(cumulative)
densityfunctionand(2)thehistogram.Inlaterissue,wellalsogooverthekerneldensityfunction
(KDE).
Background
1. EmpiricalDistributionFunction(EDF)
Theempiricaldistributionfunction(EDF),orempiricalcdfisastepfunctionthatjumpsby1/Natthe
occurrenceofeachobservation.

1
1
EDF( ) ( ) { }
N
N i
i
x F x x x
N
=
= = I s


Where:
{ } A I istheindicatorofaneventfunction.

1 A=True
{ }
0 A=False
A

I =


TheEDFestimatesthetrueunderlyingcumulativedensityfunctionofthepointsinthesample;itis
virtuallyguaranteedtoconvergetothetruedistributionasthesamplesizegetssufficientlylarge.
Toobtaintheprobabilitydensityfunction(PDF),oneneedstotakethederivativeoftheCDF,butthe
EDFisastepfunctionanddifferentiationisanoiseamplifyingoperation.Asaresult,theconsequent
PDFisveryjaggedandneedsconsiderablesmoothingformanyareasofapplication.

NumXLTipsandHintsHistogram 2 SpiderFinancialCorp,2012

2. Histogram
The(frequency)histogramisprobablythemostfamiliarandintuitivedistributionfunctionwhichfairly
approximatesthePDF.
Instatistics,ahistogramisagraphicalrepresentationshowingavisualimpressionofthedistributionof
data.Histogramsareusedtoplotdensityofdata,andoftenfordensityestimation,orestimatingthe
probabilitydensityfunctionoftheunderlyingvariable.
Inmathematicalterms,ahistogramisafunction
i
m thatcountsthenumberofobservationswhose
valuesfallintooneofthedisjointintervals(akabins).

1
k
i
i
N m
=
=


Where:
N isthetotalnumberofobservationsinthesampledata
k isthenumberofbins

i
m isthehistogramvaluefortheithbin
Andacumulativehistogramisdefinedasfollows:

1
1
j
j k i
i
M m
s s
=
=


Thefrequencyfunction(
i
f )(akarelativehistogram)iscomputedsimplybydividingthehistogramvalue
bythetotalnumberofobservations;

i
i
m
f
N
=
Oneofthemajordrawbacksofthehistogramisthatitsconstructionrequiresanarbitraryassignmentof
barwidth(orbinsnumber)andbarpositions,whichmeansthatunlessonehasaccesstoaverylarge
amountofdata,theshapeofthedistributionfunctionvariesheavilyasthebarwidth(orbinnumber)
andpositionsarealtered.
Furthermore,foralargesamplesize,theoutliersaredifficultorperhapsimpossibletoseeinthe
histogram,exceptwhentheycausethexaxistoexpand.
Havingsaidthat,thereareafewmethodsforinferringthenumberofhistogrambins,butcaremustbe
takentounderstandtheassumptionsmadebehindtheirformulation.

NumXLTipsandHintsHistogram 3 SpiderFinancialCorp,2012

Sturges'formula
TheSturgesmethodassumesthesampledatafollowanapproximatenormaldistribution(i.e.bell
shape).

2
log 1 k N = + (
(

Where:
(
(
istheceilingoperator
Squarerootformula
ThismethodisusedbyExcelandotherstatisticalpackages.Itdoesnotassumeanyshapeofthe
distribution:
k N =
Scotts(normalreference)choice
Scottschoiceisoptimalforrandomsampleofnormaldistribution:

3
3.5
k
N
o
=
Where:
o istheestimatedsamplestandarddeviation
FreedmanDiaconisschoice

3
IQR
2 h
N
=
Where:
h isthebinsize
IQR istheinterquartilerange
And

max min
x x
k
h
(
=
(
(

Decisionbasedonminimizationofriskfunction(
2
L )

2
2
2
min min
h
m v
L
h
| |
=
|
\ .

NumXLTipsandHintsHistogram 4 SpiderFinancialCorp,2012

Where:

1
k
i
i
m
N
m
k k
=
= =


2
2
2 2 2 1
2
1 1
( )
1 1
k
i k k
i
i i
i i
m m
N
v m m m
k k k k
=
= =

= = =



3. KernelDensityEstimate(KDE)
Analternativetothehistogramisakerneldensityestimation(KDE),whichusesakerneltosmooth
samples.Thiswillconstructasmoothprobabilitydensityfunction,whichwillingeneralmoreaccurately
reflecttheunderlyingvariable.WementionedtheKDEforsakeofcompletion,butwewillpostponeits
discussiontoalaterissue.

EUR/USDApplication:
LetsconsiderthedailylogreturnsoftheEUR/USDexchangeratesampledata.Inourearlieranalysis
(ref:NumXLTipsandHintsPricethis),thedatawasshowntobeaGaussianwhitenoisedistribution.
TheEDFfunctionforthosereturns(n=498)isshownbelow:

NumXLTipsandHintsHistogram 5 SpiderFinancialCorp,2012

Forahistogram,wecalculatedthenumberofbinsusingthe4methods:

Next,weplottherelativehistogramusingthosedifferentbinsnumbers.Weoverlaythenormal
probabilitydensityfunction(redcurve)forcomparison.

Althoughwehavearelativelylargedataset(n=498)andtheEDFandstatisticaltestexhibitGaussian
distributeddata,theselectionofdifferentbinsizecandistortthedensityfunction.
TheScottschoice(n=15)describesthedensityfunctionbest,andnextwouldbeSturges.
Conclusion
Inthisissue,weattemptedtoderiveanapproximateoftheunderlyingdensityprobabilityusinga
sampledatahistogramandthe(cumulative)empiricaldensityfunction.

NumXLTipsandHintsHistogram 6 SpiderFinancialCorp,2012

Althoughthedatasampleisrelativelylarge(n=498),thehistogramisstillafairlycrudeapproximation
andverysensitivetothenumberofbinsused.
Usingtherulesofthump(e.g.Sturgesrule,Scottschoice,etc.)canimprovetheprocessoffinding
betterbinsnumber,buttheymaketheirownassumptionsabouttheshapeofthedistributionandan
experienced(manual)examination(oreyeballing)isneededtoensureproperhistogramgeneration.

Vous aimerez peut-être aussi