Vous êtes sur la page 1sur 10

PROBLEM STATEMENT-

In given data set get a value of threshold for different statistical metrics which can tag outliers in the
dataset.

Statistical metric

Median
Mean
Standard Deviation
Skewness
Kurtosis

MEDIAN:

First calculate the Median


Then calculate Q1(first Quartile) which is median of all those point which lies below median
Then calculate Q3 (third Quartile) which is median of all those point which lies above median
K=Q3-Q1, within this range middle 50% of data is covered.
Above Q3 there is 25% of data and same for below Q1.
By taking the range 2K below Q1 and above Q3 will cover 25% of the data.

So any point above Q3+2K and below Q1-2K will be tagged as outliers.

Similar method can be done for mean .But there is a problem since the data itself contain outliers which
can affect the statistical metrics and result from above method might be misleading.

If we can get the metric which is more close to clean data metric rather than using overall data metric
then we can use the above method.

METHOD TO GET MEAN WHICH IS CLOSER TO CLEAN DATA MEAN-

Calculate the mean of given data set. Then calculating deviation of each point from the mean of
data
Sorting the data in the increasing order of their deviation from the mean of the data
Take a set which contain only one element which is closest to mean. Add data point one by one
in the increasing order of their deviation from mean and calculating mean of every set at every
addition of the new point to the set.
Then plotting mean of all those set which is created by adding point one by one.
67 61

60.5
66
60
65 59.5

59
64
58.5
Series1 Series1
63
58

62 57.5

57
61
56.5

60 56
0 100 200 300 0 100 200 300

At end the graph will show sudden rise or fall because some last set contains the data point of high
Deviation from the mean
By taking average of all the means will give the value more closely to clean data mean.

S1 S2 S3 S4 Si .. Sc Sn

Si is the set created by adding ith element from sorted data (based on deviation from mean) to set
Si-1
(assume)Sc is clean data and Sn is the whole data set given.
If we take average of all the batches (S1 to Sn) the effect of outlier on the obtained mean will get
reduced as there are many batches between S1 to Sc which doesnt contain any outlier.

| Mean(Sn)-Mean(Sc)| > | Mean(S1,S2Sn) - Mean(Sc) |


GENERAL METHOD
M-any statistical summary
Calculate M for every set Si to Sn
Now calculating change diff=|M(Si)-M(Si-1)|. Higher the value higher the probability
of ith point to be outlier

1.2

0.8

0.6
Diff

Series1

0.4

0.2

0
0 50 100 150 200 250

For the clean data the value diff will show high change when there are are less no. of point
in the set .
So to take account for that ,defining p=>fraction of data points present in any set Si
Multiplying p with diff, assuming linear relation between batch size and diff values.

Diff_p=Diff*p

If the Diff_p (i)>max(diff(starting stages) )then the point is outlier.


M(Si)=f(i) i- data points present in set Si.
Diff_p(Si)=g(i) , g(i)=|p*f(i)-p*f(i-1)|
Threshold ,th=max{g(i)) } where i (0,0.5d) (assuming at least 50% of data is good)
Where d is total no. of data points

METHOD 1:

|M(Si)-M(Si-1)|> max{g(i)} , outliers


|M(Si)-Mean(Si-1)|< max{g(i)} , good point
METHOD 2:

We will search for minimum value of i for which this condition is satisfied

|M(Si)-M(Si-1)|> max{g(i)}

For points i to d (no. of points) - outliers


For points 0 to i - good points
Since the data is arranged in the increasing order of absolute deviation from mean .

Demo using Mean


Mean(Si)

No of points
0.6

0.5

0.4

0.3
Series1
diff

0.2

0.1

0
0 2000 4000 6000 8000 10000

SKEWNESS-

Getting all the batches S1 to Sn same as the previous one. Calculating skewness of each
batch
12
10
8
6
Skewness

4 Series1
2
0
0 50 100 150 200 250
data point added to set S
7

Series1
Diff

0
0 50 100 150 200 250

4
Diff_p

Series1
3

0
0 50 100 150 200 250

KURTOSIS-

70
60
50
40
30 Series1
Kurtosis

20
10
0
0 50 100 150 200 250
70

60

50

40

30 Series1
Diff

20

10

0
0 50 100 150 200 250

70

60

50

40
Diff_p

30 Series1

20

10

0
0 50 100 150 200 250

Demo of validation for statistical summaries with some clean groups and some suspicious
groups

Clean Outliers Outlier Outliers


data 1 2 3
63.00339 96.0112
5 -5 195.2617
62.9021 101.155
2 -9 198.2156
63.01951 105.154
2 -10 200.1544
63.28277 104.632 -15 201.7464
63.32419 92.846 203.154
63.33604 110.236
9 206.226
63.43936 107.952
62.47158
62.42757
62.27351
63.65194
.
.
.
.
.
.

65
64.5
64
63.5
63
Mean

62.5
Series1
62
61.5
61
60.5
60
0 50 100 150 200 250
0.7
Outlier 3
0.6

0.5

0.4 Outlier 2
diff

Series1
0.3

0.2 Outlier 1
Clean points
0.1

0
0 50 100 150 200 250

Here we are able to identify the clean points and some other suspicious group
This method using mean is able to identify all outliers

Now using Skewness

3.5 Outliers

2.5

2
Series1
1.5
skew

0.5

0
0 50 100 150 200 250
3.5

2.5

2
diff

Series1
1.5

0.5

0
0 50 100 150 200 250
Diff_p

Here we are able to identify the clean points but not the different groups of outliers.