Vous êtes sur la page 1sur 50

Statistical Methods of Analysis

1 Monday, May 13, 2019


Agenda

 Descriptive Statistics
– Mean
– Range
– Standard Deviation
– Normal Curve
 Inferential Statistics
– T-test
– Chi-square Test
– Correlation
– ANOVA (One Way)

2 Monday, May 13, 2019


Descriptive Statistics

 Refers to statistics used to describe population under study.

 When using descriptive statistics, every member of a group or


population is measured.

 Example: India’s Census - All members of population are


counted.

 Purpose:
– Summarizing collection of data in a clear and understandable way
– Constructing appropriate graphs to visualize the patterns in data

3 Monday, May 13, 2019


Types of Descriptive Statistics

 Measures of Central Tendency


– Used to determine average score of a group

 Measures of Dispersion or Variability


– How spread out a group of scores are

 Measures of Relative Position


– Describe a subject’s performance compared to performance of all
other subjects

 Measures of Relationship
– Indicates to what degree two sets of scores are related

4 Monday, May 13, 2019


Measures of Central Tendency

 A convenient way of describing a set of data


with a single number.

 Three most frequently encountered indices of


central tendency are:
– Mode
– Median
– Mean
Mode
– Most frequently occurring value
– Example:
(45, 49, 50, 53, 60, 62, 63, 65, 66, 67, 69, 71, 73, 74, 74,78,
81, 85, 87, 100)
Mode: 74

– Problems associated with the mode:


 A set of scores may have two (or more) modes, in which
case it is referred to as bimodal.
 It is unstable as equal-sized samples randomly selected
from the accessible population are likely to have different
modes.

– Appropriate for nominal data.


Median

 The middle value in a sorted data set. Half the


values are greater and half are less than the
median.

 For even no of scores


(45, 49, 50, 53, 60, 62, 63, 65, 66, 67, 69, 71, 73,
74, 74, 78, 81, 85, 87, 100) Median=68
 For odd no of scores
(50, 52, 55, 57, 59) Median: 55

 Appropriate when data is ordinal.


Mean
– Arithmetic average of the scores.
– Sum of all values divided by the number of values in the data
set.

– Example
(73+66+69+67+49+60+81+71+78+62+53+87+74+65+74+50+
85+45+63+100)/20 = 68.6

– Precise and stable index than both the median and the mode
(Mean of the randomly selected samples will be more similar
to each other than either the medians and the modes).
Measures of Variability
 Measures of central tendency are very useful statistics for
describing a set of data, but not sufficient.

 Two sets of data that are very different can have identical means
or medians, Example:
set A: 79 79 79 80 81 81 81
set B: 50 60 70 80 90 100 110

 The Means and Medians of both sets of scores=80 but Set A is


very different from set B.
 In set A the scores are all very close together and clustered
around the mean whereas in set B the scores are much more
spread out.
 Measures of variability are used to measure this variability in
scores
Measures of Variability

 Three most frequently encountered


measures of variability are:
– Range
– Quartile Deviation
– Standard Deviation
Range

– Difference between minimum and maximum


values in a data set.

– Larger range usually (but not always) indicates a


large spread or deviation in the values of data set.
(73, 66, 69, 67, 49, 60, 81, 71, 78, 62, 53, 87, 74,
65, 74, 50, 85, 45, 63, 100) Range=55

– Not a very Stable measure of variability, but gives


a quick, rough estimate of variability.
Quartile Deviation

 Quartile Deviation or interquartile range is


the difference between the 75th and 25th
percentiles.
 75th percentile is that point below which are
75% of the scores and 25th percentile is that
point below which are 25% of the scores.
 More stable measure of variability than
Range.
 Appropriate whenever the median is
appropriate.
Standard Deviation
 Most stable measure of variability.
 The variance is the average of the squared deviations from
the mean. The standard deviation is the square root of the
variance.

1 N
Variance = 
N i 1
(mi  m ) 2

m  Average value of the data set


Example

(45, 49, 50, 53, 60, 62, 63, 65, 66, 67, 69, 71, 73,
74, 74, 78, 81, 85, 87, 100) Mean=68.6

Variance = [(45 – 68.6)2 + (49 – 68.6)2 + (50 –


68.6)2 + (53 – 68.6)2 + …]/20 = 181
Standard Deviation = (181)1/2= 13.5

– The larger the variance, the greater is the


average deviation of each datum from the
average value means the data is more spread out
Tabular and Graphical Procedures
Data

Qualitative Data Quantitative Data

Tabular Graphical Tabular Graphical


Methods Methods Methods Methods

•Dot Plot
•Frequency •Bar Graph
•Frequency •Histogram
Distribution •Pie Chart
Distribution •Ogive
•Rel. Freq. Dist.
•Rel. Freq. Dist. •Frequency
•% Freq. Dist.
•Cum. Freq. Dist. Polygon
•Cum. Rel. Freq.
Distribution
Tabular and Graphical Methods for Qualitative Data

 Frequency Distribution
 Relative Frequency
 Percent Frequency Distribution
 Bar Graph
 Pie Chart
Frequency Distribution

 A frequency distribution is a tabular summary


of data showing the frequency (or number) of
items in each of several non-overlapping
classes.

 The objective is to provide insights about the


data that cannot be quickly obtained by
looking only at the original data.
Example: Guest House

Guests staying at PU Guest House were asked to rate


quality of their accommodations as being excellent,
above average, average, below average, or poor.
Ratings provided by sample of 20 guests are:

Below Average Average Above Average


Above Average Above Average Above Average
Above Average Below Average Below Average
Average Poor Poor
Above Average Excellent Above Average
Average Above Average Average
Above Average Average
Example

 Frequency Distribution

Rating Frequency
Poor 2
Below Average 3
Average 5
Above Average 9
Excellent 1
Total 20
Relative Frequency Distribution

 The relative frequency of a class is the


fraction or proportion of the total number of
data items belonging to the class.
 A relative frequency distribution is a tabular
summary of a set of data showing the relative
frequency for each class.
Percent Frequency Distribution

 The percent frequency of a class is the


relative frequency multiplied by 100.
 A percent frequency distribution is a tabular
summary of a set of data showing the
percent frequency for each class.
Example

 Relative and Percent Frequency Distributions


Relative Percent
Rating Frequency Frequency
Poor .10 10
Below Average .15 15
Average .25 25
Above Average .45 45
Excellent .05 5
Total 1.00 100
Bar Graph

 A bar graph is a graphical device for depicting


qualitative data.
 On the horizontal axis we specify the labels
that are used for each of the classes.
 A frequency, relative frequency, or percent
frequency scale can be used for vertical axis.
 Using a bar of fixed width drawn above each
class label, we extend height appropriately.
 Bars are separated to emphasize the fact that
each class is a separate category.
Example

9
8
7
Frequency

6
5
4
3
2
1
Rating
Poor Below Average Above Excellent
Average Average
Pie Chart

 Pie chart is a commonly used graphical


device for presenting relative frequency
distributions for qualitative data.
 Draw a circle and use relative frequencies to
subdivide circle into sectors that correspond
to relative frequency for each class.
 Since there are 360 degrees in a circle, a
class with a relative frequency of .25 would
consume .25*(360) = 90 degrees of circle.
Example

Exc.
Poor
5%
10%
Below
Average
Above
15%
Average
45%
Average
25%

Quality Ratings
Tabular & Graphical Methods for Quantitative Data

 Frequency Distribution
 Relative Frequency and Percent Frequency
Distributions
 Dot Plot
 Histogram
 Cumulative Distributions
 Ogive
 Frequency Polygon
Example: Auto Repair

An Auto Repair Manager would like to get a


better picture of distribution of costs for engine
tune-up parts. Sample of 50 customer
invoices has been taken and costs of parts,
rounded to nearest rupee, are listed below.
91 78 93 57 75 52 99 80 97 62
71 69 72 89 66 75 79 75 72 76
104 74 62 68 97 105 77 65 80 109
85 97 88 68 83 68 71 69 67 74
62 82 98 101 79 105 79 69 62 73
Frequency Distribution

 Guidelines for Selecting Number of Classes


– Use between 5 and 20 classes.
– Data sets with a larger number of elements
usually require a larger number of classes.
– Smaller data sets usually require fewer classes.
– Use classes of equal width.
– Approximate Class Width =
Largest Data Value  Smallest Data Value
Number of Classes
Example: Frequency Distribution
If we choose six classes, approximate Class
Width = (109 - 52)/6 = 9.5 10
Cost Frequency
50-59 2
60-69 13
70-79 16
80-89 7
90-99 7
100-109 5
Total 50
Example: Relative & % Frequency Distributions

Relative Percent
Cost Frequency Frequency
50-59 .04 4
60-69 .26 26
70-79 .32 32
80-89 .14 14
90-99 .14 14
100-109 .10 10
Total 1.00 100
Example:

 Insights Gained from the Percent Frequency


Distribution
– Only 4% of the parts costs are in the 50-59 class.
– 30% of the parts costs are under 70.
– The greatest percentage (32% or almost one-
third) of the parts costs are in the 70-79 class.
– 10% of the parts costs are 100 or more.
Dot Plot

 One of the simplest graphical summaries of


data is a dot plot.
 A horizontal axis shows the range of data
values.
 Then each data value is represented by a dot
placed above the axis.
Example

.. .. . . .
. . .. .....
.. ..........
.. .. .. .. . .. . . ...
. . ... .

50 60 70 80 90 100 110

Cost
Histogram

 Another common graphical presentation


 Variable of interest on horizontal axis.
 A rectangle is drawn above each class
interval with its height corresponding to the
interval’s frequency, relative frequency, or
percent frequency.
 Unlike a bar graph, a histogram has no
natural separation between rectangles of
adjacent classes.
Example:

18
16
14
Frequency

12
10
8
6
4
2 Parts
Cost
50 60 70 80 90 100 110
Cumulative Distributions

 Cumulative frequency distribution -- number


of items with values less than or equal to the
upper limit of each class.
 Cumulative relative frequency distribution --
proportion of items with values less than or
equal to the upper limit of each class.
 Cumulative percent frequency distribution --
percentage of items with values less than or
equal to the upper limit of each class.
Example

Cumulative Cumulative
Cumulative Relative Percent
Cost Frequency Frequency Frequency
< 59 2 .04 4
< 69 15 .30 30
< 79 31 .62 62
< 89 38 .76 76
< 99 45 .90 90
< 109 50 1.00 100
Ogive

 Graph of a cumulative distribution.


 Data values are shown on horizontal axis.
 Shown on the vertical axis are the:
– cumulative frequencies, or
– cumulative relative frequencies, or
– cumulative percent frequencies
 Frequency of each class is plotted as a point.
 Plotted points are connected by straight
lines.
Example:

– Because the class limits for the parts-cost data


are 50-59, 60-69, and so on, there appear to be
one-unit gaps from 59 to 60, 69 to 70, and so on.
– These gaps are eliminated by plotting points
halfway between the class limits.
– Thus, 59.5 is used for the 50-59 class, 69.5 is
used for the 60-69 class, and so on.
Example: Ogive with Cumulative %
Frequencies
Cumulative Percent Frequency

100

80

60

40

20
Parts
50 60 70 80 90 100 110 Cost
Frequency Polygon
Frequency Polygon
Normal Curve

 Normal Curve is a type of frequency curve that


occurs when scores are randomly distributed
around mean (normal distribution).
 It is a bell shaped curve. Mean, Median, Mode

0.12

0.1

0.08

0.06

0.04

0.02

0
25 45 65 85 105 125 145 165
Normal Curve

 Characteristics
– Fifty-percent of the scores fall above the mean and fifty-
percent fall below the mean
– The mean, median, and mode are the same values
– Most participants score near the mean; the further a score is
from the mean the fewer the number of participants who
attained that score
– Specific numbers or percentages of scores fall between ±1
SD, ±2 SD, etc.
The Normal Curve

 Properties : Proportions under the curve


 Mean ±1 SD = 68%
 Mean ±1.96 SD = 95%
 Mean ±2.58 SD = 99%
 Mean ±3 SD = 99.7%
 Significance of Normal Curve: Irrespective of
actual value of mean and standard deviation,
we can pinpoint percentage of values that will
fall into specific ranges
68-95-99.7 Rule

68% of
the data

95% of the data

99.7% of the data


More words about normal curve

34% 34%

47% 47%
49% 49%
Skewed Distributions

 When a distribution is not normal, it is said to be skewed.


 It is not symmetrical and the values of mean, median and
mode are different.
 There are more extreme scores at one end than the other.
 Positively Skewed – many low scores and few high scores
 Negatively Skewed – few low scores and many high
scores
 Relationships between the mean, median, and mode
– Positively skewed – mode is lowest, median is in the
middle, and mean is highest
– Negatively skewed – mean is lowest, median is in the
middle, and mode is highest
Skewed Distributions

 Skewness 0.14

0.12

sk= 3(mean-median)/SD 0.1

If sk>|1| then distribution is 0.08

non-symmetrical 0.06

 Negatively skewed 0.04

– Mean<Median<Mode
0.02

– Sk is negative
0 20 40 60 80 100 120 140 160

 Positively Skewed 0.12

– Mean>Median>Mode 0.1

– Sk is positive 0.08

0.06

0.04

0.02

0
25 45 65 85 105 125 145 165 185 205 225