Vous êtes sur la page 1sur 31

Lecture 2

Introduction to Data Analysis

Professor David Richardson


IIT Stuart School of Business
Agenda
• Decision Pro Software and Cases

• R

• Intro to Data Analysis


– Variable Types
– Descriptive Statistics
– Hypothesis Testing
Case Assignments
• Northern Aero Nabanita
• Connector PDA Vidya
• ABB Sonal
• G20 Adrian
• Bookbinder Rutika
• Forte Karan
• Xuan, Shaoshan, Divya – will assign later
Random Variables
• An Outcome is the result of an
experiment (natural or designed)
• A set of outcomes is an Event
• The set of all possible outcomes is a
Sample Space
• A Random Variable is a function that
assigns values to the outcomes of an
experiment
Variable Types
• Nominal/categorical
– e.g. gender

• Ordinal
– e.g. Likert scale rating, ranking, education

• Interval & ratio


– e.g. age, income, purchase frequency
Variable Types
• Nominal / Categorical
– Numbers are assigned to names or
categories for the purpose of classification.
Numbers are not comparable in any way;
just for labeling.
– Example: 1 = Chicago; 2 = New York; 3 =
Los Angeles; 4 = Dallas
Variable Types (cont’)
• Ordinal / Rank Order
– Numbers are assigned on the basis of relative
liking or standing. Numbers can be compared to
see which is higher on the list, but #4 isn’t “twice
as much” anything as #2, nor is the distance
between #2 and #3 the same as between #3 and
#4 (i.e. neither additive nor multiplicative
comparisons are possible)
– Example: please rank the following from most
pleasurable to least: homework, plague, root
canal, bluegrass, in-laws
Variable Types (cont’)
• Interval
– Scale has the following property: distance
between pairs of adjacent points is equal, but
there are no absolute comparisons; distance
between #2 and #3 the same as between #3 and
#4 (i.e. additive comparisons OK; multiplicative
comparisons not OK)
– Example: temperature scales (10° to 20° is the
same as 70° to 80°, but 10° isn’t half as warm as
20°.
Variable Types (cont’)
• Ratio
– Numbers are comparable both additively and
multiplicatively; there is an objective zero point
– Examples: age, income, distance to work, unit
sales, number of missing teeth

• Desirable property of interval and ratio data:


all statistical measures (means, variances)
and tests are appropriate
Coin Flipping
• Binary Random variable – what are
outcomes? scale? sample space?
• If we flip 3 times and count heads?
• Binomial Distribution
• Normal Distribution
Statistics
• A Sample is a subset of a population or
a set of outcomes to a repeated
experiment.
• A statistic is a function of a (real
valued) random variable over a sample
Describing a Single Variable
• Descriptive Statistics
– Statistics for Central Tendency
• Mean
• Median
• Mode
– Statistics for Dispersion
• Variance or standard deviation
• Range: min, max, frequency

• Pictorial representation
– Histogram (or bar chart, pie chart)
– Scatter plot
Histogram
• Histogram shows the frequency
distribution of the data
• In Excel, only quantitative numeric data
can be displayed with histograms
Histogram

Frequency

30

25

20

15

10

0
15000 24900 34900 44900 54900 64900 74900 84900 94900 104900 More
Statistics for Central Tendency
• Central tendency: where are the bulk of the
data?
– Mean: average value, used for interval & ratio data

– Median: middle value of the variable (i.e. 50% data


bigger than median, 50% data smaller than median),
used for ordinal, interval & ratio data
• More robust than mean, i.e., resistant to large changes from
a few outliers

– Mode: the category of a variable that occurs most


often; used for nominal & ordinal data; multiple
modes
Statistics for Dispersion
• Dispersion: how spread out is the data?
– Variance & standard deviation: used for
interval & ratio data
– Minimum & maximum: used for interval &
ratio data
– Absolute frequencies: number of observations
in each category of the variable; used for
ordinal & nominal data; one-way tabulation
– Relative frequencies: proportion of
observations in each category of the variable,
used for ordinal & nominal data
Example: a ratio variable
Observation # Age •Variable X refers to Age
1 21 •# of observation n is 10
n
2 22
X i
3 22 Mean : X  i 1

4 22 n
5 21 21  22    22  21
  21.6
6 22 10
2

 X 
7 22 n
X
8 21 i
Variance:  2  i 1
 0.26667
9 22 n 1
10 21
Standard Deviation : S   2  0.516398
Measures of Dispersion
Obs Age -21 Deviation Abs. Dev Square Deviation
1 21 0 -0.6 0.6 0.36
2 22 1 0.4 0.4 0.16
3 22 1 0.4 0.4 0.16
4 22 1 0.4 0.4 0.16
5 21 0 -0.6 0.6 0.36
6 22 1 0.4 0.4 0.16
7 22 1 0.4 0.4 0.16
8 21 0 -0.6 0.6 0.36
9 22 1 0.4 0.4 0.16
10 21 0 -0.6 0.6 0.36
sum 216 6 0.00 4.80 2.40
avg 21.6 0.6 0.00 0.48 0.24
Normal Distribution

• PDF

• Cumulative Distribution has no closed


form solution – we can’t calculate
Normal Table
Normal Distribution
When data is
Normally distributed
2/3 of observations
will fall within 1 standard
deviation of the mean,
95% within 2 and nearly all
within 3 standard deviations
Hypothesis Testing
• A hypothesis test is a statistical procedure
used to “accept” or “reject” the hypothesis
based on sample information.
– Based on the hypothesis, decide on the appropriate
statistical test, i.e. decide on the sampling distribution
of the test statistic
– Null hypothesis vs. alternative hypothesis
– Perform the test using a statistical software
– Check p-value. If p-value is smaller than 0.05
(significance level), the null hypothesis is rejected;
the alternative hypothesis is supported
Confidence Interval
• Confidence intervals are the degree of
accuracy desired by the researcher.
They’re stipulated as a range with a
lower boundary and an upper
boundary.
– Confidence intervals are usually set at 95%
level
Hypothesis Testing (cont’d)
• Test a hypothesis for a mean
– If variance known, one sample z-test
– If variance unknown, one sample t-test,
used most often
– Two-sided test: is average customer
satisfaction different from 3 on a scale
from 1 to 5?
– One-sided test: is average customer
satisfaction greater than 3 on a scale from
1 to 5?
Hypothesis Testing (cont’d)
• Test differences between groups
– Widely used in marketing
– E.g. are the males consumers equally
satisfied with us as the female consumers?
– If variance known, two sample z-test
– If variance unknown, two sample t-test,
used more often
Scatter Plot
Relationship between Two
Variables
• correlation coefficient (rXY)
– measure of the linear relationship between
X and Y:  ( X  X )(Y  Y )
n

i i
Cov( X , Y )
rXY   i 1

Cov( X , X )Cov(Y , Y )  n 2


n
2
 i ( X  X )   i (Y  Y ) 
 i 1   i 1 
n

(X i  X )(Yi  Y )
1 n
 ( X i  X )   (Yi  Y ) 
 i 1

(n  1) sx s y
 
(n  1) i 1  sx
 
  s y 

– If r = 0, there is no linear relationship


between the variables
What is Regression?
• A method that determines whether there are
any associations between a dependent variable
and a set of independent variables
– E.g. the relationship between customer
satisfaction and marketing efforts

• A method that predicts/forecasts the dependent


variable with independent variables
– E.g. forecasting sales given advertising and
sales promotion efforts
Regression Model
Theoretical Model:
Y = β0 + β1 X1 + β2 X2 + ... + βk Xk + ε
Follows
Estimated Model: normal
distribution
Y = b0 + b1 X1 + b2 X2 + ... + bk Xk + e

= “Fit” + “Error”
•Dependent variable: Y, must be interval/ratio variable
•Independent variables: X1 , X2 , … , Xk
•Coefficients: b0 , b1 , b2 , … , bk
Regression – Log 12 Pack Volume with Carry Over
Coefficientsa

Uns tandardized Standardized


Coefficients Coefficients Collinearity Statis tics
Model B Std. Error Beta t Sig. Tolerance VIF
1 (Cons tant) 11.787 .654 18.030 .000
Pepsi 12 pack promotion .283 .104 .255 2.708 .008 .129 7.756
Memorial Day .339 .133 .087 2.541 .013 .976 1.024
Ln Peps i 12 pack price -2.363 .285 -.656 -8.299 .000 .184 5.447
Ln Peps i 2 liter price .189 .091 .077 2.076 .041 .836 1.196
Ln Coke 12 price .313 .165 .103 1.897 .061 .391 2.557
Fall -.076 .044 -.061 -1.728 .087 .933 1.072
Ln Peps i 12 pack vol t-1 .138 .039 .137 3.514 .001 .752 1.330
a. Dependent Variable: Ln Peps i 12 pack volume

Model Summaryb

Adjus ted Std. Error of Durbi n-


Model R R Square R Square the Es tim ate Watson
1 .944 a .891 .883 .18455 2.164
a. Predictors : (Cons tant), Ln Pepsi 12 pack vol t-1, Mem ori al Day, Fall , Ln
Pepsi 2 liter price, Ln Pepsi 12 pack price, Ln Coke 12 price, Pepsi 12
pack prom otion
b. Dependent Variabl e: Ln Peps i 12 pack volum e
30
Next Class
• Review: Notes on Data, Correlation and
Regression
• Homework 1 due before class
• Get Cases and software from Decision
Pro.

Vous aimerez peut-être aussi