Vous êtes sur la page 1sur 18

Exploratory Data Analysis

Dr. Rohit Vishal Kumar


Reader
Department of Marketing
Xavier Institute of Social Service
P.O. Box No: 7, Purulia Road
Ranchi – 834 001 (Jharkhand)
Phone: (91-651) 2200-873 Ext. 308
Email: rohitvishalkumar@yahoo.com
Exploratory Data Analysis
• EDA is an approach or a philosophy which
uses a variety of techniques to:
– Maximise insight into a data set
– Uncover underlying structure of the data set
– Extract important variables
– Detect outliers and anomalies
– Test underlying assumptions
– Develop models
– Determine optimal factor settings
• It should NOT be confused with statistical
graphics
Graphical Examination
• Examining single variable
– Histogram
– Stem and Leaf Diagram
• Examining two variables at a time
– Scatter Plot Matrix
– Box Plot
• Examining multiple variables at a time
– Chernoff Faces
– Box Plots
The Dataset to be used
• We shall use the following dataset to illustrate
the “exploratory data analysis”
• International Marketing (IM) Marks
35 35 33 32 33 22 28 32 30 30 33 31 34 29 31 22 30
30 29 37 22 31 25 27 32 36 35 31 31 27 27 30 34 32 32
Mean: 30.51 Standard Deviation: 3.776
• Marketing Research (MR) Marks
19 34 27 18 28 20 31 36 32 33 23 34 34 24 38 26 22
31 29 36 30 28 25 29 44 38 39 22 27 17 24 32 40 36 27
Mean: 29.51 Standard Deviation: 6.714
• Sales Management (SM) Marks
31 36 33 16 27 22 40 31 27 23 11 39 44 25 48 12 29
25 22 46 39 39 10 46 32 12 36 26 42 27 18 26 14 13 13
Mean: 28.00 Standard Deviation: 11.293
The Histogram
Histogram (EDA.STA 4v*35c)

16

14

12

10
No of obs

0
<= 10 (10,20] (20,30] (30,40] > 40
MR

Histogram of MR Variable

ADVANTAGES
1. Provides powerful insight as to how the the data is distributed
2. A normal curve can be superimposed to see how normally the data are
distributed.
The Histogram
Histogram (EDA.STA 5v*35c) Histogram (EDA.STA 5v*35c)
22
16
20

18 14

16
12
14
10
12
No of obs

No of obs
10 8

8 6
6
4
4
2
2

0 0
<= 10 (10,20] (20,30] (30,40] > 40 <= 10 (10,20] (20,30] (30,40] > 40
IM MR

Histogram of IM Variable
Histogram of MR Variable

Histogram (EDA.STA 5v*35c)


12
11
10
9
8
7
No of obs

6
5
4
3
2
1
0
<= 10 (10,20] (20,30] (30,40] > 40
SM

Histogram of SM Variable
Stem & Leaf Plot
• Breaks the data into “STEM” and “LEAVES”
– STEM: is the root value of a data point
– LEAF: is the residual value of a data point
– Consider the following: 37, 32 and 39
– Stem is “3” and leaves are 7, 2 and 9
• It gives the histogram (or a close
approximation) of the variable
• Each value in the dataset is represented and
is visible
• Allows visual identification of Mode
• However The choice of the stem may
produce different histograms.
Stem & Leaf Plot
• The plot for MR Data looks as follows

Frequency Stem Leaf


3.00 1 * 789
15.00 2 * 022344567778899
15.00 3 * 011223444666889
2.00 4 * 04

Stem Width : 1.0


Each Leaf : 1 Case(s)
Valid Cases: 35.00
Missing Cases: 0.00
Scatter Plot
• Used to examine the relationship of two
variables (bivariate) at a time
• Allows visual inspection of the correlation's
between the variables
• Helps identify the presence of non-linear
relationships which may need to be controlled
for proper application of MVA
• Correlation Matrix for the data:
IM MR SM
IM 1.00 - -
MR 0.41 1.00 -
SM 0.15 0.26 -
Scatter Plot
Scatterplot (EDA.STA 5v*35c)
46

42

38

34

30
MR

26

22

18

14
20 24 28 32 36 40
IM
Scatter Plot
Matrix Plot (EDA.STA 4v*35c)

IM

MR

SM
Box & Whisker Plot
• Used to examine group differences between
two or more variables. Can also be used on a
single variable
• Provides insight into
– Spread & Skewness of the data
– Extreme values and outliers
• Construction:
– Calculate Q1, Median and Q3 and Interquartile
Range (IQ)
– L1 = Q1 - 1.5 IQ U1 = Q3 + 1.5 IQ
– L2 = Q1 - 3.0 IQ U2 = Q3 + 3.0 IQ
– Data beyond L2 and U2 are “Extreme Values” and
data L1 - L2 and U1 - U2 are “Outliers”
Box & Whisker Plot
Box Whisker Plot from Selected Block
Cases 1 through 35
55

45

35

25

Max
15
Min
75th %
25th %
5
IM MR SM Median

Note: The above Box Plot is drawn on Median/Range Basis


Multivariate Examination
• Used when we want to examine more than
two variables together
• Three distinct styles of MV data presentation
– Metroglyphs : Where a picture is drawn to show
the various variables E.G. Chernoff’s Faces
– MV Profile: Using a “Histogram” like display E.G.
Bar Plots
– MV Display: using a plotted mathematical
transformation E.G. Fourier Transformation
• Not very robust specially if a large number of
variables need to be studied simultaneously
• Suffers from “perception” error
• Commonly referred to as “Icon Plots”
Chernoff Face
Data: EDA.STA 5v * 35c

Case 1 Case 2 Case 3 Case 4 Case 5 Case 6 Case 7

Case 8 Case 9 Case 10 Case 11 Case 12 Case 13 Case 14

Case 15 Case 16 Case 17 Case 18 Case 19 Case 20 Case 21

Case 22 Case 23 Case 24 Case 25 Case 26 Case 27 Case 28

Case 29 Case 30 Case 31 Case 32 Case 33 Case 34 Case 35

LEGEND: face/w = IM , ear/lev = MR , half ace/h = SM ,

1. Chernoff Faces are always complete. As such some components will always remain
constant

2. It can handle a maximum of 20 variables at a time


Bar Plot
Data: EDA.STA 5v * 35c

Case 1 Case 2 Case 3 Case 4 Case 5 Case 6 Case 7

Case 8 Case 9 Case 10 Case 11 Case 12 Case 13 Case 14

Case 15 Case 16 Case 17 Case 18 Case 19 Case 20 Case 21

Case 22 Case 23 Case 24 Case 25 Case 26 Case 27 Case 28

Case 29 Case 30 Case 31 Case 32 Case 33 Case 34 Case 35

LEGEND (left to right): IM , MR , SM ,


In Sum
• EDA uses the data to peer into the process
by which the data has been generated
• It provides powerful insight into the data
structure
• Allows us to validate the assumptions of MVA
graphically:
– Assumption of Normality
– Assumption of Homoscedasticity (Equal Variance)
– Assumption of Linearity
– Assumption of presence or absence of correlation
• It should be remembered that EDA is a tool
and - not a confirmatory outcome
References
• Multivariate Data Analysis
J.F. Hair, R.E. Anderson, R.L. Tatham & W.C. Black,
Chapter 2, pp. 35- 50, 5/e, Pearson Education Asia
• Applied Multivariate Statistical Analysis
R.A. Johnson & D.W. Wichern, Chapter 1, pp. 11-
38, 5/e, Pearson Education Asia
• Exploratory Data Analysis
J. Tukey, Addison Wesley, 1977 (Book out of Print)
• http://www.itl.nist.gov/div898/handbook/eda/
eda.htm

Vous aimerez peut-être aussi