Vous êtes sur la page 1sur 69

Soc 2003

Statistical Methods and Computer


Applications in Social Sciences
18/19 (1)
Lecture 3-4
Click
Dr. Cemre
to addErciyes
Text
Lecture 1 – Topics covered
 Key concepts: What is statistics? Data?
How is data collected?
 What is population and sample?
Parameter and statistic?
 What methods does statistic provide for?
Design, Description, Inference
 Sampling and Measurement: Variable,
Types of Variables- Categorical (Nominal,
Ordinal); Quantitative-Interval Scale
(Continuous, Discrete)
Lecture 1-2- Topics covered
 Frequency distribution
 Frequency, Proportion, Percentage
 Cumulative Frequency, Cumulative
Percentage
 Histogram, Bar chart, Pie chart
 Stem and Leaf Graph
Lecture 2 – Topics Covered
 Measures of Central Tendency
 Mean
 Median (50th percentile, 2nd quartile or Q2)
 Percentile
 Lower and upper quartile (Q1 and Q3 or 25th
and 75th percentile)
 Mode
Assignment 1 and 2
• Collect data from 10 of your class mates
for these variables:
 Sex, Year of Birth, Their CGPA
 Prepare a Frequency Table, Histogram or
Bar Graph where appropriate
 Prepare a stem and leaf graph for each
 Calculate the mean where appropriate
 Find the median, lower and upper
quartiles, 30th percentile, 60th percentile
Your data

No Sex Year of Birth CGPA

1
2
3
4
5
6
7
8
9
10
Frequency table of sex

Frequency Relative Percentage


(F) Frequency (%)

Female

Male

Total 10 1 10
Frequency table of Year of birth
Year of birth Frequency Relative Cumulative Percentage
Frequency Frequency
Frequency table of Grouped
CGPA
Grouped Frequenc Relative Cumulativ Percentag Cumulativ
CGPA y Frequenc e e e
y Frequency Percentag
e
2.50<
2.01-2.50
1.51-2.00
1.01-1.50
<1.01
Total
Histogram or Bar chart of sex?
Histogram or Bar chart of Year of
Birth
Histogram or bar chart of
CGPA?
Stem and Leaf Graph of Year of
Birth
Stem and leaf Graph of CGPA
Descriptive statistics for sex
Mean
Median
Mode
Lower quartile
Upper quartile
30th percentile
60th percentile
The range
Descriptive statistics for year of
birth
Mean
Median
Mode
Lower quartile
Upper quartile
30th percentile
60th percentile
The range
Descriptive statistics for CGPA
Mean
Median
Mode
Lower quartile
Upper quartile
30th percentile
60th percentile
The range
Today
 When to use which measure of central
tendency?
 Normal distribution (introduction)
 Measures of varience (Range, Variance,
Std. Dev., Emprical Rule)
 Properties of distributions
 Weighted mean
Measures of Central Tendency
 The best way to reduce a set of data and
still retain its information is to summarize it
with a single value.
 Measures of central tendency—mean,
median, and mode—can help you capture,
with a single number, what is typical of a
data set.
Properties of the mean
 The formula for the mean uses numerical values for the observations.
 Mean is appropriate only for quantitative variables.
 It is not sensible to compute the mean for observations on a nominal scale.
 For instance, for religion measured with categories such as (Protestant, Catholic,
Jewish, Other), the mean religion does not make sense, even though these levels
may sometimes be coded by numbers for convenience.
 Similarly, we cannot find the mean of observations on an ordinal rating such as
excellent, good, fair, and poor, unless we assign numbers such as 4, 3, 2,1 to the
ordered levels, treating it as quantitative.
 The mean can be highly influenced by an observation that falls well above or well
below the bulk of the data, called an outlier.
 The mean is pulled in the direction of the longer tail of a skewed distribution,
relative to most of the data.
 The mean is the point of balance on the number line when an equal weight is at
each observation point.
Properties of the median
 If there are even number of scores you can either find the arithmetic average of the
two middle scores or calculate the 50th percentile.
 The median, like the mean, is appropriate for quantitative variables. Since it requires
only ordered observations to compute it, it is also valid for ordinal-scale data, as the
previous example showed. It is not appropriate for nominal-scale data, since the
observations cannot be ordered.
 For symmetric distributions, the median and the mean are identical.
 For skewed distributions, the mean lies toward the direction of skew (the longer
tail) relative to the median.
 The median is insensitive to the distances of the observations from the middle, since
it uses only the ordinal characteristics of the data. For example, the following four
sets of observations all have medians of 10:
Set 1: 8, 9, 10, 11, 12
Set 2: 8, 9, 10, 11, 100
Set 3: o, 9, 10, 10, 10
Set 4: 8, 9, 10, 100, 100
 The median is not affected by outliers. For instance, the incomes of the seven
employees in Example 3.5 have a median of $12,200 whether the largest observation
is $20,000, $215,000, or $2,000,000.
Normal distribution

In a normal
Click to add distribution,
Text mean,
median and mode are
identical in value.
Normal distribution
 Often just called the bell-curve or bell-shaped curve.
Most of the scores in this graph accumulate around the
middle.
 The mean, median and mode are all equal, and the
scores at either end of the distribution occur less often.
 For example, a curve representing the results of an
intelligence test would have the most number of people
in the middle or around the 'average' intelligence range.
Whereas the number of people decreases as the scores
get farther away on either side of the average, giving the
curve its shape and name.
Measures of
Varience
Measures
Click to addofText
variability
describe the spread of the
data
Measures of Varience
 The range of a set of data is the difference between the highest and
lowest values in the set. To find the range, first order the data from
least to greatest. Then subtract the smallest value from the largest
value in the set.
 The variance averages the squared deviations about the mean.
 Its square root, the standard deviation, is easier to interpret,
describing a typical distance from the mean.
 The Empirical Rule states that for a bell-shaped distribution, about
68% of the observations fall within one standard deviation of the
mean, about 95% fall within two standard deviations, and nearly all,
if not all, fall within three standard deviations.
Range
 Range = difference between highest and lowest observed values
 The range value of a data set is greatly influenced by the presence of just
one unusually large or small value (outlier).
 The range can be expressed as an interval such as 4–10, where 4 is the
lowest value and 10 is highest. Often, it is expressed as interval width. For
example, the range of 4–10 can also be expressed as a range of 6.
 The disadvantage of using range is that it does not measure the spread of
the majority of values in a data set—it only measures the spread between
highest and lowest values.
 Other measures are required in order to give a better picture of the data
spread.
 The range is an informative tool used as a supplement to other measures
such as the standard deviation or semi-interquartile range, but it should
rarely be used as the only measure of spread.
Variance

Click to add Text


Variance
 The variance averages the squared
deviations about the mean.
 The variance of a data set tells you how
spread out the data points are. The closer
the variance is to zero, the more closely
the data points are clustered together.
The expression ∑ (xi - x̄ )2
in these formulas
is called a sum of squares.
Why do we subtract one when
calculating sample varience?
 Remember that a sample is only part of
the population and isn't actually the whole
picture.
 Because of that, statisticians found a way
to compensate, by subtracting one from
the total number of numbers in the data
set.
Varience – lets calculate
 You want to analyze the number of muffins
sold each day at a cafeteria, you sample
six days at random and get these
results: 7,15, 23, 7, 9, 13.
 What can you say about this data
immediately?
Example
 7,15, 23, 7, 9, 13
 Median: 9
 Mode: 7
 Range: 16
Which one to use?
Sample varience 1
 xi => x̄=
 x1=
 x2=
 x3=
 x4=
 x5=
 x6=

 n= n-1=
Sample varience 1- solution:
 xi => x̄= (7+7+9+13+15+23)/6=
 x1= 7 14
 x2= 7
 x3= 9
 x4= 13
 x5= 15
 x6= 23

 n= 6 n-1= 5
Sample varience 2:
 (xi - x̄ )=> (xi - x̄ )2=>
 x1 - x̄ = (x1 - x̄ )2 =
 x2 - x̄ = (x2 - x̄ )2 =
 x3 - x̄ = (x3 - x̄ )2 =
 x4 - x̄ = (x4 - x̄ )2 =
 x5 - x̄ = (x5 - x̄ )2 =
 x6 - x̄ = (x6 - x̄ )2 =

∑ (xi - x̄ )2 =
Sample varience 2 – solution:
 (xi - x̄ )=> (xi - x̄ )2=>
 x1 - x̄ = 7-14 = -7 (x1 - x̄ )2 = 49
 x2 - x̄ = 7-14= -7 (x2 - x̄ )2 = 49
 x3 - x̄ = 9 -14= -5 (x3 - x̄ )2 = 25
 x4 - x̄ = 13-14=-1 (x4 - x̄ )2 = 1
 x5 - x̄ = 15-14=1 (x5 - x̄ )2 = 1
 x6 - x̄ = 23-14=9 (x6 - x̄ )2 = 81

 Sum of squares represents ∑ (xi - x̄ )2 = 166


squaring each deviation and then
adding those squares. It is
incorrect to first add the deviations
and then square that sum; this
gives a value of 0.
 The larger the deviations, the
larger the sum of squares and the
larger s tends to be.
Sample varience 3:
 s2= ∑ (xi - x̄ )2 /n-1
 s2=
Sample varience 3 – solution:
 s2= ∑ (xi - x̄ )2 /n-1
 s2= 166/5= 33.2
Solve yourself:
 A random sample of 10 American college
students reported sleeping 7, 6, 8, 4, 2, 7,
6, 7, 6, 5 hours, respectively. What is the
sample varience?
Answer:
 s2= 3.067
Spread of data: see for different s2
Standard deviation

Click to add Text


Standard deviation
 Square root of varience, the standard
deviation, is easier to interpret, describing
typical distance of an observation from the
mean.
 The larger the standard deviation s, the
greater the spread of the data
Quiz
 Quiz scores for two small groups of
students are as such:
Group 1: 0, 4, 4, 5, 7,10
Group 2: 0,0,1,9,10,10

 Compare the distribution of quiz scores of


each group.
What will you calculate and
compare? – std. dev.
 The variance equals 11.2 and standard
deviation equals s = 3.3 for sample 1.
 For sample 2, variance is 26.4 and
standard deviation is 5.1.
 Since 3.3 < 5.1 (s for sample 1 is smaller
than s for sample 2) the standard
deviations tell us that sample 1 is less
variable than sample 2.
Properties of the Standard Deviation

 s>0.
 s = 0 only when all observations have the same value.
 For instance, if the ages for a sample of five students are
19,19,19,19,19, then the sample mean equals 19, each of the
five deviations equals 0, and s = 0. This is the minimum possible
variability.
 The greater the variability about the mean, the larger is
the value of s.
 If the data are rescaled, the standard deviation is also
rescaled.
 For instance, if we change annual incomes from dollars (such as
34,000) to thousands of dollars (such as 34.0), the standard
deviation also changes by a factor of 1000 (such as from 11,800
to 11.8)
Lecture 4
Same mean different variability
Same variability different mean
 Variabilityprovides a quantitative
measure of the degree to which
scores in a distribution are spread out
or clustered together.
In other words variabilility refers to the
degree of “differentness” of the scores in
the distribution.
Emprical rule

Click to add Text


Emprical rule
 A distribution with s = 5.1 has greater variability than one with s =
3.3, but how do we interpret how large s = 5.1 is?
 We’ve seen that a rough answer is that s is a typical distance of an
observation from the mean.
 To illustrate, suppose the first exam in your course, graded on a
scale of 0 to 100, has a sample mean of 77. A value of s = 0 in
unlikely, since every student must then score 77. A value such as s =
50 seems implausibly large for a typical distance from the mean.
Values of s such as 8 or 12
seem much more realistic.
 More precise ways to interpret 5 require further knowledge of the
shape of the frequency distribution.
 The following rule provides an interpretation for many data sets.
Emprical rule
 If the histogram of the data is approximately
bell shaped, then
 1. About 68% of the observations fall between y - s a
n d y + s.
 2. About 95% of the observations fall between y - 2s
and y + 2s.
 3. All or nearly all observations fall between y - 3s and
y + 3s.
 The rule is called the Empirical Rule because
many distributions seen in practice (that is,
empirically) are approximately bell shaped.
Emprical rule visual
 The Empirical Rule applies only to distributions that are
approximately bellshaped.
 For other shapes, the percentage falling within two standard
deviations of the mean need not be near 95%.
 It could be as low as 75% or as high as 100%.
 Empirical Rule may not work well if the distribution is highly skewed
or if it is highly discrete, with the variable taking few values.
 The exact percentages depend on the form of the distribution.
Quiz - 2

Click to add Text


Properties of
distributions
Click to add Text
Properties of distributions
 Distributions are typically described with
three properties:
 Center: mean, median, mode
 Spread (variability): standard deviation,
variance
 Shape: unimodal, symmetric, skewed, etc.
Shapes of Frequency Distributions

 Unimodal, bimodal, and


rectangular
Shapes of Frequency Distributions
 Symmetrical and skewed distributions

 Normal and kurtotic distributions


What can you say about these?
Variability of a distribution

• High variability means that


the scores differ by a lot

• Low variability means that the scores


are all similar
Which center when?
 Depends on a number of factors, like scale of
measurement and shape.
 The mean is the most preferred measure and it is
closely related to measures of variability
 However, there are times when the mean isn’t the
appropriate measure.
 Use the median if:
 The distribution is skewed
 The distribution is ‘open-ended’
 (e.g. your top answer on your questionnaire is ‘5 or more’)
 Data are on an ordinal scale (rankings)
 Use the mode if the data are on a nominal scale
Weighted mean

Denote the sample means for two sets of data with sample sizes n1 and
Click to add Text
n2 by x̄ 1 and x̄ 2. The overall sample mean for the combined set of (n1 +
n2) observations is the weighted average
x̄ = n1 x̄ 1+n2 x̄ 2 /n1+n2
The numerator n1 x̄ 1+n2 x̄ 2 is the sum of all the observations, since n x̄=
∑x
for each set of observations.
The denominator is the total sample size.
Example of two groups of students
asked to state their number of shoes.
 Number of shoes:
 30, 11, 12, 20, 14, 12, 15, 8, 6, 8, 10, 15, 25, 6, 35, 20, 20, 20, 5,
7, 5, 5, 5, 25, 15
X 57555
X   5.4
n 5
X 327
X   16.35
n 20

Suppose we want the mean of the entire group? Can


we simply add the two means together and divide by
2?
 NO. Why not?
The Weighted Mean
 Number of shoes:
 30, 11, 12, 20, 14, 12, 15, 8, 6, 8, 10, 15, 25, 6, 35, 20, 20, 20, 5,
7, 5, 5, 5, 25, 15
X 5.4 X  16.35

X1n1  X2 n2

5.4 * 5   16.35 * 20 
 14.16
XN 
n1  n2 5  20

 Suppose we want the mean of the entire group? Can


we simply add the two means together and divide by
2?

 NO. Why not? Need to take into account the number


of scores in each mean
The Weighted Mean
 Number of shoes:
 30, 11, 12, 20, 14, 12, 15, 8, 6, 8, 10, 15, 25, 6, 35, 20, 20, 20, 5,
7, 5, 5, 5, 25, 15

XN 
X1n1  X2 n2

5.4 * 5   16.35 * 20 
 14.16
n1  n2 5  20

Let’s check: X   X  354


  14.16
n 25

 Both ways give the same answer



Assignment 3
 With your own data
 Calculate the measures of variance where
appropriate
 Tell the properties of the distribution
 Divide your data into groups of 4 and 6
respectively, and compare means of each group
 Calculate the weighted mean for the two groups

Vous aimerez peut-être aussi