Vous êtes sur la page 1sur 48

Chocolate Cake Seminar

Series on Statistical Applications


Todays Talk:

Be an Explorer with Exploratory


Data Analysis!
By David Ramirez

Outline of Presentation
Exploratory v. Confirmatory Data Analyses
Exploratory Data Analysis Techniques

Examples of Graphical Techniques


Examples of Non-graphical Techniques

What is Exploratory Data Analysis (EDA)?


John Tukey (1915-2000), American statistician
It is important to understand what
you CAN DO before you learn to
measure how WELL you seem to
have DONE it.

Definition
EDA consists of methods of discovering unanticipated
patterns and relationships in a data set, by summarizing
data quantitatively or presenting them visually.
3

Exploratory v. Confirmatory
Exploratory Data Analysis
Descriptive Statistics - Inductive Approach
Look for flexible ways to examine data without preconceptions
Heavy reliance on graphical displays
Let data suggest questions

Advantages
Flexible ways to generate hypotheses
Does not require more than data can support
Promotes deeper understanding of processes

Disadvantages
Usually does not provide definitive answers
Requires judgment - cannot be cookbooked

Exploratory v. Confirmatory

Confirmatory Data Analysis


Inferential Statistics - Deductive Approach

Hypothesis tests and formal confidence interval estimation


Hypotheses determined at outset
Heavy reliance on probability models
Look for definite answers to specific questions
Emphasis on numerical calculations

Advantages
Provide precise information in the right circumstances
Well-established theory and methods

Disadvantages
Misleading impression of precision in less than ideal circumstances
Analysis driven by preconceived ideas
Difficult to notice unexpected results

EDA Techniques
Graphical presentation of distribution

- Continuous variables (stem-and-leaf plot, box plot,


histogram, bivariate scatterplot)
- Categorical variables (bar graph, pie chart)

Non-graphical summary of distribution


- Continuous variables (mean, median, mode, variance,
standard deviation, range, correlation coefficient, linear
regression)
- Categorical variables (frequency table, cross-tabulation)

Stem-and-Leaf Plot
What is it?
A plot where each data value is split into a "leaf"
(usually the last digit) and a "stem" (the other digits).

Useful for describing distributions in terms of


-- Symmetry or skewness (right-skewed=long right tail or
left-skewed=long left tail)
-- Unimodality, bimodality or multimodality (one, two,
or more peaks)
-- Presence of outliers (a few very large or very small
observations)
7

How To Create Stem-and-Leaf Plot


Syntax
EXAMINE VARIABLES=Rain
/PLOT BOXPLOT STEMLEAF

By Mouse
Descriptive Statistics-> Explore -> Plot Stem and
Leaf Plot

Example: Stem-and-leaf Plot


We use SPSS to construct a stem-and-leaf plot for
rainfall in the US in metropolitan areas.
Frequency Stem & Leaf
4.00 Extremes (=<15)
1.00
1. 8
.00
2.
2.00
2 . 58
10.00
3 . 0001111234
15.00
3 . 555556666677889
16.00
4 . 0011222223333344
7.00
4 . 5555566
4.00
5 . 0234
1.00 Extremes (>=60)
9

Box Plot
What is it?
A way of graphically depicting groups of numerical data
through their five-number summaries: the smallest
observation (sample minimum), lower quartile (Q1),
median (Q2), upper quartile (Q3), and largest observation
(sample maximum). A box plot may also indicate which
observations, if any, might be considered outliers.

Useful in visualizing the following:

Location
Spread
Skewness
Outliers
10

How To Create Box Plot


Syntax
EXAMINE VARIABLES=Rain
/PLOT=BOXPLOT.

By mouse
Graphs> legacy plots-> Box Plots->Click summaries of
separate variables-> Scaled Variable-> Optional:
Label Case-> Okay

11

Example: Box Plot


Using the previous data on precipitation, we
would like to understand the distribution of
the rain and check for any outliers.

12

Example: Multiple Box Plots


Side-by-side box plots below display the
population distribution of large cities in 1960.

13

How To Create Box Plots


Syntax
EXAMINE VARIABLES=Population BY Country
/PLOT=BOXPLOT
/ID=City.

By mouse
Graph> legacy plots-> Box Plots> click summaries
of groups of cases> define> Variable (scalar) >
categories (how are we organize them)> label (IDs
or name (optional))
14

Histogram
What is it?
A diagram consisting of rectangles which area is
proportional to the frequency of a continuous variable
and which width is equal to the class interval (bin).

Useful for describing distributions in terms of


-- Symmetry or skewness

-- Unimodality, bimodality or multimodality


-- Presence of outliers
15

How To Create Histogram


Automatically chosen Bins
Syntax
GRAPH
/HISTOGRAM(NORMAL)=Population.

By Mouse
Graphs-> histogram-> Variable (scalar)-> okay

16

How To Create Histogram


User-selected number of bins
Syntax
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=Population MISSING=LISTWISE
REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: Population=col(source(s), name("Population"))
GUIDE: axis(dim(1), label("Population"))
GUIDE: axis(dim(2), label("Frequency"))
ELEMENT: interval(position(summary.count(bin.rect(Population, binCount(5)))),
shape.interior(shape.square))
END GPL.

By Mouse
Graphs-> Chartbuilder > Histogram-> Drag Variable (scalar) (x-axis)>set parameters-> custom -> number of intervals -> continue-> okay
17

How To Create Histogram


User-selected bin width
Syntax
* Chart Builder.
GGRAPH
/GRAPHDATASET NAME="graphdataset" VARIABLES=Population MISSING=LISTWISE
REPORTMISSING=NO
/GRAPHSPEC SOURCE=INLINE.
BEGIN GPL
SOURCE: s=userSource(id("graphdataset"))
DATA: Population=col(source(s), name("Population"))
GUIDE: axis(dim(1), label("Population"))
GUIDE: axis(dim(2), label("Frequency"))
ELEMENT: interval(position(summary.count(bin.rect(Population, binWidth(1)))),
shape.interior(shape.square))
END GPL.

By Mouse
Graphs-> Chartbuilder > Histogram-> Drag Variable (scalar) (x-axis)>set parameters-> custom -> number of intervals -> continue-> okay

18

Example: Histogram
A researcher might need to select bins to have
a better understanding of the distribution and
check what type of distribution we have.

19

Scatterplot
What is it?
A scatterplot is a plot of data points in xy-plane
that displays the strength, direction and shape of
the relationship between the two variables.

Used for
Analyzing relationships between two variables
Looking to see if there are any outliers in the data

20

How To Create Scatterplot


Syntax
GRAPH
/SCATTERPLOT(BIVAR)=Height WITH Wieght
/MISSING=LISTWISE.

By Mouse
> graph-> legacy dialogs-> scatter/dot-> Simple
Scatter-> Y axis (outcome) -> X axis (predictor)->
okay

21

Example: Scatterplot
Researchers wanted to see if there is a link
between Height and Weight.

22

Bar Graph
What is it?
-- A diagram consisting of rectangles which area is
proportional to the frequency of each level of
categorical variable.
-- Bar graph is similar to histogram but for
categorical variables.
Used for
-- comparison of frequencies for different levels
23

How To Create Bar Graph


Syntax
GRAPH
/BAR(SIMPLE)=COUNT BY Gender.
By Mouse
Graph-> legacy dialogues-> bar-> Categorical
Variable->Categorical Axis-> okay

24

Example: Bar Graph


Experimenters wanted to make sure they had
an close equal number of males and females
in a study.

25

Pie chart
What is it?
A type of graph in which a circle is divided into
sectors corresponding to each level of categorical
variable and illustrating numerical proportion for
that level.

Used for
-- comparison of proportions for different levels

26

How To Create Pie Chart


Syntax
GRAPH
/PIE=COUNT BY Bindedage.

By Mouse
Graph-> Legacy Dialogs-> Pie Chart->
Summaries for group of cases-> define->
categorical variable-> categorical axis-> okay

27

Example: Pie Chart


A researcher wants to partition the age
variable into a categorical variable in terms of
mental development (College Age, Older
Young Adult, Young Middle age, Middle
Middle Age and up).

28

Non-Graphical Techniques
Measures of Central Tendency
Central Tendency is the location of the middle
value
Mean=sum of all data values divided by the
number of values (arithmetic average).

29

Measures of Central Tendency


Median=the middle value after all the values are
put in an ordered list (50% observations lie below
and 50% above the median).
If there is a two middle observations, median is the average of
the two.

Mode=most likely or frequently occurring value.

30

Measures of Spread
Spread is how far observations lie from each
other.
-- Variance=average of the squared distances from
the mean.

-- Standard deviation=square root of the variance.


-- Range=maximum-minimum.
31

How to Compute Measures of Central


Tendency and Spread
Syntax
FREQUENCIES VARIABLES=MORT
/STATISTICS=STDDEV VARIANCE RANGE MEAN MEDIAN MODE
/ORDER=ANALYSIS.

By Mouse
Analyze-> Frequency -> Select a Scaled data->
click Statistics-> select Mean, Median, Mode,
Range, Maximum and Minimum.
32

Example: Central Tendency and Spread


We use SPSS to figure out the Central
Tendency and Spread of the Mortality rates in
the 1960s.
Statistics
MORT
N

Valid
Missing

60
0

Mean

940.3650

Median

943.7000

Mode

790.70 a

Std. Deviation

62.20482

Variance

3869.439

Range

322.30

33

Correlation Coefficient
What is it?
-- A numeric measure of linear relationship between two continuous
variables.

Properties of correlation coefficient:


-- Ranges between -1 and 1
-- The closer it is to -1 or 1, the stronger the linear relationship is
-- If r=0, the two variables are not correlated
-- If r is positive, relationship is described as positive (larger values of one
variable tend to accompany larger values of the other variable)
-- If r is negative, relationship is described as negative (larger value of one
variable tend to accompany smaller values of the other variable)

34

Correlation

Slight warning:
Correlation tend to measure linear relationship;
however there are events that a curves might exist

35

Linear Regression
What is it?
-- Statistical technique of fitting a linear function to
data points in attempt to describe a relationship
between two variables.

Used for
-- prediction
-- interpretation of coefficients (change in y for a
unit increase in x)
36

How To Find Correlation and


Fitted Regression Line
By Syntax
REGRESSION
/DESCRIPTIVES MEAN STDDEV CORR SIG N
/MISSING LISTWISE
/STATISTICS COEFF OUTS R ANOVA
/CRITERIA=PIN(.05) POUT(.10)
/NOORIGIN
/DEPENDENT Wieght
/METHOD=ENTER Height.

By mouse
Analyze->Regression-> Y (Variable we want to
predict) to Dependent -> X (variable we are using to
predict Y) with Independent->
37

Example: Correlation
Referring to our weight and height scatterplot,
the researchers want to check how related
these two variable are.
Correlations
Wieght
Pearson
Correlation

Wieght

Hieght

1.000

.717

.717

1.000

Hieght

Sig. (1tailed)

Wieght
Hieght

.000

Wieght

507

507

Hieght

507

507

.000

38

Example: Regression
Researchers want to create a linear model
using the height as an independent variable
(predictor) and weight as a dependent variable
(outcome or response).
The fitted line can be written as
Weight= -105.011+1.018 (Height)
Coefficientsa

Unstandardized
Coefficients
Model
1

B
(Constant)
Hieght

Std. Error

-105.011

7.539

1.018

.044

Standardiz
ed
Coefficient
s
Beta

.717

Sig.

-13.928

.000

23.135

.000

39

Frequency Table
What is it?
-- A table that shows frequency (count) for each
level of a categorical variable.

Used for
-- comparison of frequencies for different levels

40

How To Find Frequency Table


Syntax
FREQUENCIES VARIABLES=EDUbinned
/ORDER=ANALYSIS.

By mouse
Analyze-> Descriptives-> frequency->Variable
-> display Frequency-> okay

41

Example: Frequency Table


We want to know what was the frequencies of different
educational levels in the US metropolitan area in 1960s. We have to
use visual binning first and identify bins. Using the range, we create
bins from 9th, 10th, 11th, 12th grade and up.
Syntax
* Visual Binning.
*EDU.
RECODE EDU (MISSING=COPY) (12 THRU HI=4) (11 THRU HI=3) (10 THRU HI=2) (LO THRU
HI=1) (ELSE=SYSMIS) INTO EDUbins.
VARIABLE LABELS EDUbins 'EDU (Binned)'.
FORMATS EDUbins (F5.0).
VALUE LABELS EDUbins 1 '9th Grade' 2 '10th Grade' 3 '11th Grade' 4 '12th grade and up'.
VARIABLE LEVEL EDUbins (ORDINAL).

By Mouse
Transform-> Visual Binning-> variable we want to create into an ordinal value->
okay-> Make cut point-> enter number of cutpoints, and width-> apply-> okay

42

Example: Frequency Table


EDU (Binned)

Valid

Valid
Cumulative
Percent
Percent
15.0
15.0

Frequency
9

Percent
15.0

19

31.7

31.7

46.7

20

33.3

33.3

80.0

12th grade
and up

12

20.0

20.0

100.0

Total

60

100.0

100.0

9th Grade
10th Grade
11th Grade

Cross-tabulation
What it is?
a two-way table containing frequencies (counts)
for different levels of the column and row
variables.

Used for
Comparison of frequencies for different levels of
the variables (chi-squared test)

44

How To Find Cross-tabulation


Syntax:
CROSSTABS
/TABLES=EDUbins BY US
/FORMAT=AVALUE TABLES
/STATISTICS=CHISQ
/CELLS=COUNT
/COUNT ROUND CELL.

By Mouse
Analyze-> Descriptive Statistics-> Crosstabs-> select
variable for row-> select variable for column->
statistic-> Chi-Square-> continue-> Okay
45

Example: Cross-tabulation
Researchers wish to understand if the
educational levels from the SMSA data were
equally distributed among the US.
Looking at the p-value, we can see that the
educational levels are different among the
regions of the US.
Chi-Square Tests

EDU (Binned) * US Crosstabulation

Asymp.
Sig. (2sided)

Count
US
1.00
EDU
(Binned)

Total

9th Grade
10th
Grade
11th
Grade
12th grade
and up

2.00

3.00

4.00

Value

Total

Pearson ChiSquare

19

20

12

Likelihood
Ratio
Linear-byLinear
Association

21

16

14

60

N of Valid
Cases

df

26.078a

.002

25.377

.003

9.893

.002

60

46

47

Recommended Readings/Citations
Hartwig, F., & Dearing, B. E. (1979). Exploratory Data
Analysis. Beverly Hills : Sage Publications.
Hoaglin, D. C., Mostellar, F., & Tukey, J. W. (1983).
Understanding Robust and Exploratory Data Analysis. New
York: John Wile & Sons Inc.
Pampel, F. C. (2004). Exploratory Data Analysis . In M. S.
Lewis-Beck, A. Bryman, & L. t. Futing, The SAGE
Encyclopedia of Social Science Research Methods (pp. 359360). Thousand Oak, California : Sage Publications.
Vogt, W. P. (1999). Exploratory Data Analysis. In W. P. Vogt,
Dictionary of Statistics & Methodology: A Nontechnical
Guide for the Social Science (pp. 104-105). Thousand Oaks,
California: SAGE Publications. Inc.
48

Vous aimerez peut-être aussi