Académique Documents
Professionnel Documents
Culture Documents
In given data set get a value of threshold for different statistical metrics which can tag outliers in the
dataset.
Statistical metric
Median
Mean
Standard Deviation
Skewness
Kurtosis
MEDIAN:
So any point above Q3+2K and below Q1-2K will be tagged as outliers.
Similar method can be done for mean .But there is a problem since the data itself contain outliers which
can affect the statistical metrics and result from above method might be misleading.
If we can get the metric which is more close to clean data metric rather than using overall data metric
then we can use the above method.
Calculate the mean of given data set. Then calculating deviation of each point from the mean of
data
Sorting the data in the increasing order of their deviation from the mean of the data
Take a set which contain only one element which is closest to mean. Add data point one by one
in the increasing order of their deviation from mean and calculating mean of every set at every
addition of the new point to the set.
Then plotting mean of all those set which is created by adding point one by one.
67 61
60.5
66
60
65 59.5
59
64
58.5
Series1 Series1
63
58
62 57.5
57
61
56.5
60 56
0 100 200 300 0 100 200 300
At end the graph will show sudden rise or fall because some last set contains the data point of high
Deviation from the mean
By taking average of all the means will give the value more closely to clean data mean.
S1 S2 S3 S4 Si .. Sc Sn
Si is the set created by adding ith element from sorted data (based on deviation from mean) to set
Si-1
(assume)Sc is clean data and Sn is the whole data set given.
If we take average of all the batches (S1 to Sn) the effect of outlier on the obtained mean will get
reduced as there are many batches between S1 to Sc which doesnt contain any outlier.
1.2
0.8
0.6
Diff
Series1
0.4
0.2
0
0 50 100 150 200 250
For the clean data the value diff will show high change when there are are less no. of point
in the set .
So to take account for that ,defining p=>fraction of data points present in any set Si
Multiplying p with diff, assuming linear relation between batch size and diff values.
Diff_p=Diff*p
METHOD 1:
We will search for minimum value of i for which this condition is satisfied
|M(Si)-M(Si-1)|> max{g(i)}
No of points
0.6
0.5
0.4
0.3
Series1
diff
0.2
0.1
0
0 2000 4000 6000 8000 10000
SKEWNESS-
Getting all the batches S1 to Sn same as the previous one. Calculating skewness of each
batch
12
10
8
6
Skewness
4 Series1
2
0
0 50 100 150 200 250
data point added to set S
7
Series1
Diff
0
0 50 100 150 200 250
4
Diff_p
Series1
3
0
0 50 100 150 200 250
KURTOSIS-
70
60
50
40
30 Series1
Kurtosis
20
10
0
0 50 100 150 200 250
70
60
50
40
30 Series1
Diff
20
10
0
0 50 100 150 200 250
70
60
50
40
Diff_p
30 Series1
20
10
0
0 50 100 150 200 250
Demo of validation for statistical summaries with some clean groups and some suspicious
groups
65
64.5
64
63.5
63
Mean
62.5
Series1
62
61.5
61
60.5
60
0 50 100 150 200 250
0.7
Outlier 3
0.6
0.5
0.4 Outlier 2
diff
Series1
0.3
0.2 Outlier 1
Clean points
0.1
0
0 50 100 150 200 250
Here we are able to identify the clean points and some other suspicious group
This method using mean is able to identify all outliers
3.5 Outliers
2.5
2
Series1
1.5
skew
0.5
0
0 50 100 150 200 250
3.5
2.5
2
diff
Series1
1.5
0.5
0
0 50 100 150 200 250
Diff_p
Here we are able to identify the clean points but not the different groups of outliers.