Vous êtes sur la page 1sur 79

MSc in Computing (Data Analytics)

Probability & Statistical Inference


Lecture 1
Lecture Outline
Introduction
General Info
Questionnaire

Introduction to Statistics
Statistics at work
The Analytics Process
Descriptive Statistics & Distributions
Graphs and Visualisation



Introduction
Name : Aoife DArcy
Email: aoife@theanalyticsstore.com
Bio: Managing Director and Chief Consultant at the Analytics
Store, has degrees in statistics, computer science, and financial
& industrial mathematics. With over 12 years of experience in
analytics consultancy with major national and international
companies in banking, finance, insurance, manufacturing and
gaming; I have developed particular expertise in risk
analytics, fraud analytics, and customer insight analytics.

Lecture Notes: Will be available online on
www.comp.dit.ie/bmacnamee and later on webcourses
Course Outline
Week Topic
1 Introduction to Statistics
2 & 3 Probability Theory
4 Introduction to SAS Enterprise Guide
5 Probability Distributions
6 Confidence Intervals
7 & 8 Hypothesis testing
9 Assignment
10 - 12 Regression Analysis
13 Revision
Exam & Assignment
Exam
The end of term exam accounts for 60% of the
overall mark

Assignment
The assignment is worth 40% of the overall mark.
The assignment will be handed out in week 5
Week 9s class will be dedicated to working on the
assignment.

Software
SAS Enterprise Guide will be the software that will
be used during the course.
Applied Statistics and
Probability for Engineers
John Wiley & Sons
Douglas C. Montgomery
Probability and Statistics
for Engineers and
Scientists
Pearson Education
R.E. Walpole, R.H. Myers, S.L.
Myers, K. Ye
Modelling Binary Data
Chapman & Hall
David Collett
Probability and Random
Processes
Oxford University Press
G. Grimmett & D. Stirzaker
Statistical Inference
Brooks/Cole
George Casella
Recommended Reading
Questionnaire

Section 1: Statistics are everywhere
We are bombarded with Statistics
http://www.irishtimes.com/newspaper/frontpage/2012/0918/1224324122326.html
http://www.irishtimes.com/newspaper/world/2012/0914/1224324008884.html
http://www.independent.ie/business/world/survey-names-oslo-the-worlds-priciest-city-ireland-ranks-27th-3229426.html
The internet is full of interesting statistics
http://www.usatoday.com/news/politics/twitter-election-meter
Statistics can be misleading
An ad claimed:
9 Out of 10 Dentists prefer Colgate
What is wrong with this statement?

Consider these complaints about airlines published in US
News and World Report on February 5, 2001





Can we conclude the
United airlines has the
worst customer service?
Statistics in Everyday Life
With the increase in the amount of data
available and advancement`s in the
power of computers, statistics are being
used more and more frequently

Question: Is it good that statistics are
used so much and what happens when
statistics are misused?
Statistics can be misleading
Misinterpreted Statistics can be
Devastating
In 1999 Sally Clarke was wrongly convicted of the
murder of two of her sons. The case was widely
criticised because of the way statistical evidence
was misrepresented in the original trial, particularly
by paediatrician Professor Sir Roy Meadow.
He claimed that, for an affluent non-smoking family
like the Clarks, the probability of a single cot death
was 1 in 8,543, so the probability of two cot deaths
in the same family was around "1 in 73 million"
(8543 8543).
What is wrong with this assumption?
Video
http://www.youtube.com/watch?v=4TKbIidbyhk&fe
ature=fvwrel
Challenges
As an Analytics practitioner you will face a
number of challenges:

Create insight from all available data (and
there is lots of it)
Interpret statistic correctly
Communicate statistically driven insight in a
way that is clearly understood

Objective of this course
Give you a set of statistical skills to allow you,
as an analytics practitioner, turn data into
insight!!
The Analytics Process & Statistics
Section Overview
Statistics and Analytics
Introduction to CRISP

Data Analytics Is Multidisciplinary
Databases
Statistics
Pattern
Recognition
KDD
Machine
Learning
AI
Neurocomputing
Predictive
Analytics
Data
Warehousing
Analytics Process
Data
Insight
Business
Decision
Analytics Is A Lot Of Things
Whats the best that can happen?
What will happen next?
What if these trends continue?
Why is this happening?
What actions are needed?
Where exactly is the problem?
How many, how often, where?
What happened?
Optimization
Predictive modelling
Forecasting/extrapolation
Statistical analysis
Alerts
Query/drill down
Ad hoc reports
Standard reports
C
o
m
p
e
t
i
t
i
v
e

a
d
v
a
n
t
a
g
e

Degree of intelligence
P
r
e
d
i
c
t
i
v
e

A
n
a
l
y
t
i
c
s

A
c
c
e
s
s

&

r
e
p
o
r
t
i
n
g

For this course we will concentrate on
Statistical Analysis
Whats the best that can happen?
What will happen next?
What if these trends continue?
Why is this happening?
What actions are needed?
Where exactly is the problem?
How many, how often, where?
What happened?
Optimization
Predictive modelling
Forecasting/extrapolation
Statistical analysis
Alerts
Query/drill down
Ad hoc reports
Standard reports
C
o
m
p
e
t
i
t
i
v
e

a
d
v
a
n
t
a
g
e

Degree of intelligence
P
r
e
d
i
c
t
i
v
e

A
n
a
l
y
t
i
c
s

A
c
c
e
s
s

&

r
e
p
o
r
t
i
n
g

CRISP-DM Evolution
Over 200 members of the CRISP-DM SIG
worldwide
DM Vendors: SPSS, NCR, IBM, SAS, SGI, Data
Distilleries, Syllogic, etc
System Suppliers/Consultants: Cap Gemini, ICL Retail,
Deloitte & Touche, etc
End Users: BT, ABB, Lloyds Bank, AirTouch, Experian, etc
Crisp-DM 2.0 is due
Complete information on CRISP-DM is available at:
http://www.crisp-dm.org/
CRISP-DM
Features of CRISP-DM:
Non-proprietary
Application/Industry neutral
Tool neutral
Focus on business issues
As well as technical analysis
Framework for guidance
Experience base
Templates for Analysis
Data
Business
Understanding
Data
Understanding
Data
Preparation
Modelling
Evaluation
Deployment
Phases & Generic Tasks
Business
Understanding
Data
Understanding
Data
Preparation
Modeling Deployment Evaluation
Determine
Business
Objectives
Assess
Situation
Determine
Data Mining
Goals
Produce
Project Plan
Business Understanding

This initial phase focuses on understanding the project
objectives and requirements from a business perspective,
then converting this knowledge into a data mining
problem definition and a preliminary plan designed to
achieve the objectives
Phases & Generic Tasks
Business
Understanding
Data
Understanding
Data
Preparation
Modeling Deployment Evaluation
Collect
Initial
Data
Describe
Data
Explore
Data
Verify
Data
Quality
Data Understanding

The data understanding phase starts with an initial data
collection and proceeds with activities in order to get
familiar with the data, to identify data quality problems,
to discover first insights into the data or to detect
interesting subsets to form hypotheses for hidden
information.
Phases & Generic Tasks
Business
Understanding
Data
Understanding
Data
Preparation
Modeling Deployment Evaluation
Select
Data
Clean
Data
Construct
Data
Integrate
Data
Format
Data
Data Preparation

The data preparation phase covers all activities to
construct the data that will be fed into the modelling tools
from the initial raw data. Data preparation tasks are
likely to be performed multiple times and not in any
prescribed order. Tasks include table, record and
attribute selection as well as transformation and cleaning
of data for modelling tools.
Phases & Generic Tasks
Business
Understanding
Data
Understanding
Data
Preparation
Modeling Deployment Evaluation
Select
Modeling
Technique
Generate
Test Design
Build
Model
Assess
Model
Modelling

In this phase, various modelling techniques are selected
and applied and their parameters are calibrated to
optimal values. Typically, there are several techniques for
the same data mining problem type. Some techniques
have specific requirements on the form of data.
Therefore, stepping back to the data preparation phase
is often necessary.
Phases & Generic Tasks
Business
Understanding
Data
Understanding
Data
Preparation
Modeling Deployment Evaluation
Evaluate
Results
Review
Process
Determine
Next Steps
Evaluation

Before proceeding to final deployment of a model, it is
important to thoroughly evaluate it and review the steps
executed to construct it to be certain it properly achieves
the business objectives. A key objective is to determine if
there is some important business issue that has not been
sufficiently considered. At the end of this phase, a
decision on the use of the data mining results should be
reached.
Phases & Generic Tasks
Business
Understanding
Data
Understanding
Data
Preparation
Modeling Deployment Evaluation
Plan
Deployment
Plan Monitoring
&
Maintenance
Produce
Final
Report
Review
Project
Deployment

Creation of a model is generally not the end of the
project. Even if the purpose of the model is to increase
knowledge of the data, the knowledge gained will need
to be organized and presented in a way that the
customer can use it. Depending on the requirements, the
deployment phase can be as simple as generating a
report or as complex as implementing a repeatable data
mining process across the enterprise.
Crisp - DM
Business Understanding

Data Understanding

Data Preparation

Modelling

Evaluation

Deployment.

Crisp DM Areas covered in this
course
Business Understanding

Data Understanding

Data Preparation

Modelling

Evaluation

Deployment










Section 2: Descriptive Statistics & Distributions

Topics
1. Introduction to Statistics
2. The Basics
3. Measures of location: Mean, Median & Mode.
4. Measures of location & Skew.
5. Measures of dispersion: range, standard deviation
(variance) & interquartile range.


Introduction to Statistics
According to The Random House College Dictionary,
statistics is the science that deals with the collection,
classification, analysis and interpretation of numerical
facts or data. In short, statistics is the science of data.
There are two main branches of Statistics:
The branch of statistics devoted to the organisation,
summarization and the description of data sets is called
Descriptive Statistics.
The branch of statistics concerned with using sample data to
make an inference about a large set of data is called
Inferential Statistics.

Process of Data Analysis
Population
Representative
Sample
Sample
Statistic
A Statistical population
is a data set that is our
target of interest.

A sample is a subset of
data selected from the
target population.

If your sample is not
representative then it is
referred to as being
bias
Describe
Make
Inference
Types of Data: Numeric Data
Numeric data can be of two types:
Continuous Data: Data is continuous if it has an
interval of real numbers for its range
The number of centimetres of rain that fell in March
Discrete Data: Data is defined as discrete if it has a
finite range
The number of correct answers in a 10 question quiz
Types of Data: Categorical Data
Data that is broken into discrete categories is referred
to as categorical data
Categorical data has two main types:
Nominal: A nominal variable has a discrete number of
categories or levels with no logical order
Gender: Male, Female
Working Status: Employed, Unemployed, Home-maker, Student,
Retired
Ordinal: An ordinal variable has a discrete number of
categories or levels with a logical order
Income Level: Low, Medium, High
Places in a race: 1st, 2nd, 3rd, 4th, 5th, 6th
Class Task
Task: Classify the type of each of the data the
following examples:
The profit margin made from customers of an online
clothing company
The type of interest rate you can be charged on a
mortgage i.e. Fixed rate, Adjustable rate
Number of dependents a associated with a loan
applicant

Lets Start at the Very Beginning
When learning to read and write we start with A-B-
C, when starting to count we start with 1-2-3 and of
course The Von Trappe family singers started with
Do-Re-Me!
When learning statistics
you start with the
arithmetic mean or a
simple average
The Arithmetic Mean
Year Canada China Germany* Russia**
United
Kingdom
United
States
Total Gold Total Gold Total Gold Total Gold Total Gold Total Gold
1992 18 7 54 16 82 33 112 45 20 5 108 37
1996 22 3 50 16 65 20 63 26 15 1 101 44
2000 14 3 59 28 56 13 88 32 28 11 92 37
2004 12 3 63 32 49 13 92 27 30 9 103 36
2008 18 3 100 51 41 16 72 23 47 19 110 36
Mean 17 4 65 28 58 19 85 30 28 9 102 38
The table below shows the total medals won and gold
medals won by each country in the last 5 Olympic
games
* Germany combines East and West Germany prior to reunification ** Russia or The Soviet Union
Data source http://www.databaseolympics.com/index.htm
Arithmetic Mean The Formula
The formula for calculating the sample arithmetic
mean of n data points x
1,
x
2
..... x
n
:



x
x
i
1
n

n
: x is referred to as x-bar
Attributes of the Arithmetic Mean
It is straight-forward to calculate
It is easy to interpret the mean
It gives us a good estimate of where a set of
numbers is centred
This is referred to as the central tendency of
a sample
It is sensitive to outliers
Other Measures of Central Tendency

Median: The middle value of an ordered set of
values, i.e. 50% higher and 50% lower

Mode: The most commonly occurring value in a
distribution

Calculating the Median
Year Medals
1964 90
1968 107
1972 94
1976 94
1980 0
1984 174
1988 94
1992 108
1996 101
2000 92
2004 103
2008 110
Medals (Sorted)
174
110
108
107
103
101
94
94
94
92
90
0
Sort the
data
Median = 97.5
Calculating the Mode
Medals Count
174 1
110 1
108 1
107 1
103 1
101 1
94 3
92 1
0 1
Mode = 94
Year Medals
1964 90
1968 107
1972 94
1976 94
1980 0
1984 174
1988 94
1992 108
1996 101
2000 92
2004 103
2008 110
Count
frequencies
When to Use Each Central Tendency
Value?

Question: When and why would you use the median
over the mean?
Lets Look at the Variation in our Data
0
2
4
6
8
10
12
14
16
18
20
C
o
u
n
t

Distribution of the Total Olympic Medals won by any Country from 1964 -
2008
0
2
4
6
8
10
12
14
16
18
20
C
o
u
n
t

Distribution of the Total Olympic Medals won by any Country from 1964 -
2008
Lets Look at the Variation in our Data
Central Tendency / Location
Spread/Variation
Measures of Spread or Variation
Range
Variance
Standard Deviation
Inter-quartile Range
Calculating the Range
The Range in calculated by subtracting the
minimum value in a data set from the maximum
value
The main advantage to using the range is the
ease with which it is calculated
The major disadvantage of the range is that it
is highly sensitive to outliers
Calculating the Variance
As an example of Variance consider the following
data:


OBS Data
1 3
2 4
3 8
Sum 15
Mean 5
Calculating the Variance
As an example of Variance consider the following
data:


OBS Data Mean Deviation
1 3
5
-2
2 4
5
-1
3 8
5
3
Sum 15 15 0
Mean 5 15 0
Calculating the Variance
As an example of Variance consider the following
data:


OBS Data Mean Deviation (Deviation)
2

1 3
5
-2 4
2 4
5
-1 1
3 8
5
3 9
Sum 15 15 0 14
Mean 5 15 0 4.67
Variance The Formula
Square the deviations around the mean before
summing. For n data points x
1,
x
2
..... x
n
:



Divide by n-1 (?) to get the average of squared
deviations:

x
i
x
n

2
i1
n


s
2

x
i
x
n

2
i1
n

n 1
Standard Deviation The Formula
Take the square root of the variance. The value
is in the original unit


s
x
i
x
n

2
i1
n

n 1
Standard Deviation

Question: Why might it be useful to have the
value is in the original unit?
Percentiles
The nth percentile is a value that has a proportion
of the sample taking values at or lower than it,
and taking values larger than it

Example: if your grade in an industrial engineering
class was located at the 84th percentile, then 84%
of the grades were equal to or lower than your
grade and 16% were higher


n
100

100 n
100
Inter-quartile Range
The median is the 50th percentile
The 25th percentile and the 75th percentile are
called the lower quartile and upper quartile
respectively (or 1
st
and 3
rd
)
The difference between the lower and upper
quartile is called the inter-quartile range
Quartiles Example
Medals (Sorted)
174
110
108
107
103
101
94
94
94
92
90
0
Sort the
data
25
th
Percentile =
1
st
Quartile = 93
50
th
Percentile =
Median = 97.5
75
th
Percentile =
3
rd
Quartile = 107.5
Inter-quartile Range

107.5 93 = 14.5
Year
Medals
1964 90
1968 107
1972 94
1976 94
1980 0
1984 174
1988 94
1992 108
1996 101
2000 92
2004 103
2008 110
Proportions
The proportion, p, of items in a population that belong
to a certain class, for example:
The proportion of your customers that are female
The proportion of voters that will vote for Labour in the
next election
A proportion is calculated as:


where C is the number of items in a population of size N
that belong to the class of interest

p
C
N
Skew The Shape of a Distribution
There are a number of ways of describing the
shape of a distribution.

We will consider only one skew.

Skew is a measure of how asymmetric a
distribution is.



Symmetric Distributions = skew is zero

There are few very large data points which create a
'tail' going to the right (i.e. up the number line)

Note: No axis of symmetry here - skew > 0 (i.e. it is positive)

Example: Lifetime of people, house prices

Positive Skew
There are few very small data points which create
a 'tail' going to the left (i.e. down the number line)

Note: No axis of symmetry here - skew < 0 (i.e. it is negative)

Examples: Examination Scores, reaction times for drivers
Negative Skew
Mean, Median & Mode are
the same and are found in the
middle
6
6
5 6 7
4 5 6 7 8
3 4 5 6 7 8 9

Mean = 102/17 = 6
Median = 6
Mode = 6

Skew & Measures of Location - Symmetry
Mode
Median
Mean
6
6
5 6 7
5 6 7 8 9
5 6 7 8 9 10 11

Mean = 121/17 = 7.12
Median = 7
Mode = 6

In general: Mode < Median < Mean
Positive Skew
Mode
Median
Mean

Mean = 83/17 = 4.89
Median = 5
Mode = 6

In general: Mode > Median > Mean
6
6
5 6 7
3 4 5 6 7
1 2 3 4 5 6 7
Negative Skew
Section 3: Graphs and Visualisation


Graphical Displays
A way of letting people get a 'picture' of
relationships in the data set.

The simpler the better should be a rule in graphical
display.

People can remember pictures better.

A good graph should show something that is not
easy to see using tables.

Bar Charts
Used to display categorical data or discrete
data with a modest number of values.
A Bar is drawn to represent each category.
The Bar height represents the frequency or % in
each category.
Allows for visual comparison of relative
frequencies.
Need to draw up a frequency distribution table
first.

Core Statistical Plots
0
5
10
15
20
25
Points Scored by any Team in Six
Nations Championship 2000 -
2011
Core Statistical Plots
Comparisons Column Charts
Box Plots
Core Statistical Plots
Correlations
Scatter Plots
Trends
(time)
Line Charts
Core Statistical Plots
Proportions Pie Chart
Column Chart
Some Hans Inspiration to Finish UP
http://www.youtube.com/watch?v=fTznEIZRkLg

Vous aimerez peut-être aussi