Vous êtes sur la page 1sur 8

Statistics Week 1

Basic Definitions
Descriptive vs. Inferential Statistics
Descriptive Statistics describe, display, and Inferential Statistics is when you use data you
summarize data. You describe data you already do have to make educated guesses about data
have. you don’t have. You are making inferences
about that missing data.
• The goal is to turn a bunch of confusing
data into useful information. • Using past data to predict future data is
common, as is using one group to make
• It involves summary statistics like
predictions about another. You can use
means and median, graphical displays
present data to “predict” the past, too.
like charts and graphs, and good
organization.

Populations vs. Samples


The population is the entire group you wish to A sample is a subset of the population that you
study. Note that you get to decide what the are collecting data from. You can then use
population is – you can study small populations inferential statistics to talk about the
or large ones. population.

• E.g. All women in Canada, all Ryerson • E.g. You interview 200 randomly
Students, all NHL teams. selected Ryerson students.

A parameter is a measure describing the A statistic is a measure describing the sample.


population.
• E.g. The average height of students in
• E.g. The average goals scored per game your sample was 5’9”.
for all NHL teams was 2.73.

Quantitative vs. Qualitative Data


Quantitative data uses numbers. Qualitative data does not.

• E.g. I have 3 cats. • E.g. My cats’ names are Mats, Bindi,


and Twiggy.

Continuous vs. Discrete Data


A continuous variable can take on any value in A discrete variable can only take on certain
its range. values (usually whole numbers).

• E.g. Height: a person can be 180cm, or • E.g. Shoe Size: You can buy a size 10
180.5cm, 180.2867cm, or etc. shoe, or a size 10 ½, but you cannot buy
a size 10.35.
An Example of Statistics
Source: Coleman, Michael. “Journal Poll: Clinton Still Ahead in NM.” Albuquerque Journal 5 Nov. 2016

https://www.abqjournal.com/883092/clinton-still-ahead-in-new-mexico.html

The Journal Poll of likely New Mexico voters, conducted Nov. 1-3, showed
Clinton leading Trump 45 percent to 40 percent.

Johnson, the Libertarian Party candidate and former two-term New Mexico
governor, pulled 11 percent support in the new Journal Poll, compared with 24
percent of New Mexicans who supported him in the newspaper’s late September
poll.

Green Party candidate Jill Stein


polled at 3 percent among likely
New Mexico voters in the new
poll. Only 2 percent of New
Mexico voters were undecided or
unsure for whom they would
vote.

[…]

The Journal Poll is based on a


scientific, statewide sample of
504 voters who said they had
already voted this year or
planned to vote. Most voters
surveyed cast ballots in either the
2012 or 2014 general elections; a
small portion of newly registered
voters were also included in the
sample.

The poll was conducted Nov. 1


through Nov. 3. The full voter
sample has a margin of error of
plus or minus 4.4 percentage
points. The margin of error grows
for subsamples.
Graphical Presentation of Data

Describing Qualitative Data Describing Quantitative Data


To describe qualitative data, you can use tables There are many ways of describing quantitative
or charts. data. We will cover a few.

If you can collect the frequency of particular • To describe time series data, line charts
responses, you can illustrate those frequencies work well.
using pie charts and bar charts.
• E.g. Viewing a stock price over
• Bar charts make it easier to compare time
the size of one group to the size of
• To look for correlations and see how
another.
one variable influences another, use
• Pie charts make it easier to compare the scatter charts.
size of one group to the size of the
• Display comparisons between groups
whole.
using pie charts and bar charts.
To compare two variables, you can use
• To illustrate the distribution of a set of
contingency tables or you can use bar charts
data, you can group the data using
with multiple bars per group.
frequency tables and use those to make
histograms, a special kind of bar chart.
Frequency Tables

The Procedure An Example


1. First, decide how many groups to collect Start with the following height data (in cm):
your data into. Usually 5-10. If you have a
182, 187, 194, 168, 181, 174, 168, 174, 178,
lot of data, you can use more.
179, 190, 165.
• These groups will be called classes.
1. Since we don’t have a lot of data, lets aim
• Depending on the rounding later on, for 5 classes.
this number might change.
2. The max is 194 and the min is 168, so the
2. Next, find the maximum and minimum range is 26.
values and subtract to get the range.
3. Dividing 26 by 5 gets us 5.2, so we could
3. Divide the range by the number of classes round down and use a class width of 5 (we
and round to a “nice number” to get the may end up with more than 5 classes) or
class width. “Nice numbers” are 1, 2, 2.5, round up to 10 (and may end up with less
and 5, multiplied or divided by 10 as many than 5 classes). Let’s use 5.
times as you need.
4. So, starting at 165, we repeatedly add 10 to
• E.g. 10, 20, 25, 50, 100, 200, 250, get class boundaries of 165, 170, 175, 180,
500, 0.1, 0.2, 0.25, 0.5, 0.0001, 185, 190, 195.
0.0002, 0.00025, 0.0005.
5. This means we end up with the following
4. Finally, define your class boundaries. Start classes:
at a multiple of the class width immediately
165 to under 170, 170 to under 175,
below the minimum value, then repeatedly
175 to under 180, 180 to under 185,
add the class width until you get above the
185 to under 190, 190 to under 195.
maximum value.
6. Let’s do some counting and make our chart!
5. Use those class boundaries to define the
classes. Height in cm. Frequency
165 to under 170 3
• E.g. 20 to 30, 30 to 40, 40 to 50, etc.
170 to under 175 2
• When working in Excel, classes 175 to under 180 2
always include their upper 180 to under 185 2
boundary and not their lower. 185 to under 190 1
190 to under 195 2
6. Count the number of data points in each
class. These are the frequencies. Put these
into a table.
Expanding the Frequency Table
We can expand frequency tables to include relative frequencies (the frequency as a % of the whole),
cumulative frequencies (the running total of the frequencies), or even relative cumulative frequencies
(the running total of the relative frequencies).

Height in cm. Frequency Relative Cumulative Relative Cumulative


Frequency Frequency Frequency
165 to under 170 3 25% (3/12) 3 25% (3/12)
170 to under 175 2 16.67% (2/12) 5 41.67% (5/12)
175 to under 180 2 16.67% (2/12) 7 58.33% (7/12)
180 to under 185 2 16.67% (2/12) 9 75% (9/12)
185 to under 190 1 8.33% (1/12) 10 83.33% (10/12)
190 to under 195 2 16.67% (2/12) 12 100% (12/12)

In-Class Exercise
We’ll perform a quick survey in class to gather data on how many hours of sleep Ryerson students get
per night. This is not a randomized sample, so it won’t be representative and we couldn’t use it for
inferential statistics. Luckily for us, frequency tables are descriptive statistics so we’re in the clear. Follow
the six steps to making a frequency table, then fill in the other columns.

Sleep Last Night Frequency Relative Cumulative Relative Cumulative


(in hrs) Frequency Frequency Frequency
Making Charts from Frequency Tables
A Histogram
A histogram is what you get when you convert a frequency table to a bar chart.

An Ogive
An ogive is what you get when you convert a cumulative frequency table or a relative cumulative
frequency table (or both) to a line chart.

Ogive of Height
12 100%

90%
10
80%
Cumulative Relative Frequency
Cumulative Frequency

70%
8
60%

6 50%

40%
4
30%

20%
2
10%

0 0%
165 170 175 180 185 190 195
Height in cm.
Summary Statistics
A summary statistic is a single number that summarizes something about the distribution of your data.
The two major sorts of summary statistics that we need to look at are measures of central tendency and
measures of variability (aka measures of dispersion).

Measures of Central Tendency Examples


A measure of central tendency answers the Again, start with the following height data (in
question “What does the typical example of this cm):
look like?”
182, 187, 194, 168, 181, 174, 168, 174, 178,
179, 190, 165.

The mean: add up all the data and divide by To get the mean, we calculate:
however many numbers there are.
(182 + 187 + 194 + 168 + 181 + 174 + 168 + 174
+ 178 + 179 + 190 + 165)/12= 178.33

The median: find the middle value. To get the median, we put the numbers in
order:
• If there are an even number of data
points, take the mean of the middle 165, 168, 168, 174, 174, 178, 179, 181, 182,
two. 187, 190, 194

Then take the mean of the middle two:


(178+179)/2 = 178.5

The mode: the most common value.


To get the mode(s), we simply find the most
• Sometimes there is more than one
common value(s): 168 and 174
mode if there is a “tie,” or even no
mode if there are no repeated values.

The mid-range: the value halfway between the To get the mid-range, we take the mean of the
largest and smallest values. This is used nearly largest and smallest values: (165+194)/2 = 179.5
never.

In-Class Exercise
Given the five ages of my family, find the mean, median, mode, and mid-range: 31, 29, 7, 6, and 2.

Mean: Median: Mode: Mid-Range:


Measures of Variability
A measure of variability answers the question “How spread out is the data?”

• The range is the difference between the largest value in a set of data and the smallest.

• The variance and standard deviation provide more robust measures of variability.

• We can compute a sample standard deviation (𝑠) or a population standard deviation (𝜎).
Similarly, we can find a sample variance (𝑠 2 ) or a population variance (𝜎 2 ). The process is the
same except for one step.

The Procedure An Example


The process to calculate the (population or Starting with a set of data: 16, 19, 20, 22,
sample) variance is five steps. To calculate the 23. We will compute a sample standard
(population or sample) standard deviation, we deviation.
add a sixth step.
Step 1: We find the mean 𝑋̅ = 20.
1. Find the mean, 𝑋̅.
Steps 2-4: Fill in the following table:
2. Find the deviations by subtracting the
𝑿 𝑿−𝑿 ̅ (𝑿 − 𝑿 ̅ )𝟐
mean from each value. (𝑋 − 𝑋̅)
16 -4 16
3. Square the deviations. (𝑋 − 𝑋̅)2 19 -1 1
20 0 0
4. Sum the squared deviations. ∑(𝑋 − 𝑋̅)2 22 2 4
5. Divide by the size of the population, 𝑁, 23 3 9
to get a population variance (𝜎 2 ) or by Total 0 30
Step 5: Take that total, 30, and divide by
the sample size minus one, 𝑛 − 1, to
𝑛 − 1. Here we had 5 data points, so 𝑛 −
get a sample variance (𝑠 2 ).
1 = 4, and 30/4 = 7.5. This is the sample
6. Take the square root of the variance to variance 𝑠 2 .
get the standard deviation. (𝜎 or 𝑠).
Step 6: Take the square root of 𝑠 2 to get the
sample standard deviation 𝑠.

𝑠 = √7.5 = 2.74

One Final In-Class Exercise


Given the five ages of my family, find the population variance and population standard deviation:

𝑿 𝑿−𝑿̅ ̅ )𝟐
(𝑿 − 𝑿
31
29
7
6
2
Total
Population Variance (𝛔𝟐 ): Population Standard Deviation (𝛔):

Vous aimerez peut-être aussi