Académique Documents
Professionnel Documents
Culture Documents
No part of this book should be referenced or copied without the prior permission of the
company.
A FEW WORDS TO THE STUDENTS
Analytics is becoming a popular tool for managerial decision making. Its still not so
widespread in countries like India, but in the west it has become a standard practice.
Previously studying analytics involved an in depth knowledge of statistics and pro-
gramming languages. But widespread availability of statistical package software has
changed the reality to some extent. Now more emphasis is given on the application of
the techniques to solve the business problems. So there is a need to understand the
meaning of the statistical procedures. This book has been written to cater that need.
In this book, all the necessary concepts have been explained keeping the business
problem in mind. Also, to remove the apathy for statistics, use of mathematical expres-
sions have been limited. That doesnt imply that we dont have to study the mathe-
matics part. The intention is to put the substance over matter. As the students get ac-
customed to these statistical concepts, they can go for further investigations using vari-
ous mathematical and statistical techniques. A list of suggested books and links have
been given in the appendix.
In this book, the statistical procedures have been implemented on SAS. The expla-
nations of the codes have been from the perspective of a data modeler. For the per-
spective of a programmer, the students are advised to go through the documentation
of the procedures in the SAS website.
In fine, statistical concepts are a way of thinking. The more you recognize the think-
ing pattern, the quicker you will learn.
Best of Luck!
Team OTG
CONTENTS
PAGE
8. Cluster Analysis 62
9. Linear Regression 69
B
usiness analytics (BA) refers to the skills, technologies, applications and practic-
es for continuous iterative exploration and investigation of past business perfor-
mance to gain insight and drive business planning.
Exploratory Data Analysis (EDA) makes few assumptions, and its purpose is to suggest
hypotheses and assumptions. An OEM manufacturer was experiencing customer
complaints. A team wanted to identify and re-
move causes of these complaints. They asked
customers for usage data so the team could cal-
culate defect rates. This started an Exploratory
Data Analysis. The investigation established that
a supplier used the wrong raw material. Discus-
sions with the supplier and team members moti-
vated further analysis of raw material, and its
composition. This decision to analyze raw mate-
rial completed the Exploratory Data Analysis.
Properties of Measurement
Scales of Measurement
Nominal Scale: The nominal scale of measurement only satisfies the identity property
of measurement. Values assigned to variables represent a descriptive category, but
have no inherent numerical value with respect to magnitude. Gender is an example
of a variable that is measured on a nominal scale. Individuals may be classified as
"male" or "female", but neither value represents more or less "gender" than the other.
Religion and political affiliation are other examples of variables that are normally
measured on a nominal scale.
Interval Scale: The interval scale of measurement has the properties of identity, magni-
tude, and equal intervals. A perfect example of an interval scale is the Fahrenheit
scale to measure temperature. The scale is made up of equal temperature units, so
that the difference between 40 and 50 degrees Fahrenheit is equal to the difference
between 50 and 60 degrees Fahrenheit. With an interval scale, you know not only
whether different values are bigger or smaller, you also know how much bigger or
smaller they are. For example, suppose it is 60 degrees Fahrenheit on Monday and 70
degrees on Tuesday. You know not only that it was hotter on Tuesday; you also know
that it was 10 degrees hotter.
Ratio Scale: The ratio scale of measurement satisfies all four of the properties of meas-
urement: identity, magnitude, equal intervals, and an absolute zero. The weight of an
object would be an example of a ratio scale. Each value on the weight scale has a
unique meaning, weights can be rank ordered, units along the weight scale are equal
to one another, and there is an absolute zero. Absolute zero is a property of the weight
scale because objects at rest can be weightless, but they cannot have negative
weight.
Tabular Presentation
Cumulative Cumulative
Subcategory Frequency Percent
Frequency Percent
Graphical Presentation
Pie Chart
Tabular Presentation
Graphical Presentation
The vice president of marketing of a fast food chain is studying the sales perfor-
mance of the 100 stores in the eastern part of the country. He has constructed the fol-
lowing frequency distribution of annual sales:
Sales (000s) Frequency Sales (000s) Frequency
700 - 799 4 1300 - 1399 13
800 - 899 7 1400 - 1499 10
900 - 999 8 1500 - 1599 9
1000 - 1099 10 1600 - 1699 7
1100 - 1199 12 1700 - 1799 2
1200 - 1299 17 1800 - 1899 1
He would be looking at the distribution with an eye toward getting information about
the central tendency to compare the eastern part with other parts of country. Central
tendency is basically the central most value of a distribution. Now how do we know
which one is the central most value?
There are precisely three ways to find the
central value: Arithmetic mean, Median and
Mode.
Arithmetic mean is the simple average of the
data. The problem with arithmetic mean is
that it is influenced by the extreme values.
Suppose, you take a sample of 10 persons
whose monthly incomes are 10k, 12k, 14k,
12.5k, 14.2k, 11k, 12.3k, 13k, 11k, 10k. So the
average income turns out to be 12k. So thats
a good representation of the data. Now if
you replace the last data with 100k, then the
average turns out to be 21k which is very absurd as 9 out of 10 people earns way be-
Measures of Dispersion
As the name says, here we are trying to access how disperse the data is. A measure of
central tendency without any idea about the measures of dispersion dont make any
sense. Why it is so? Look at the following charts.
The horizontal data is the central value in both the cases. But for the first case where
the data is less dispersed, the data is really clustered around the central line. Whereas
in the second case, data is so dispersed that central value is not that meaningful, as
you cannot say that the horizontal line is a true representative of the data. So there is a
need to measure the dispersion in the data.
Broadly there are two measures of data, one is absolute measures like Range or Vari-
ance and the other is relative measure like Coefficient of Variation.
Range is the simplest measure. It is basically the difference between the maximum
and the minimum value in a data. The other absolute measure Variance is a bit com-
plicated to express in plain words. It basically comes from the sum of squared differ-
We will discuss about the population and sample in the coming chapters.
Apart from understanding the dispersion in the data, standard deviation can be used
for transforming the data. Suppose, if we want to com-
pare two variables like the amount of money persons
earn and the number of pair of shoes their wives have,
then it is better to express those data in terms of stand-
ard deviations. That is, we simply divide the data by
their respective standard deviations. So here the stand-
ard deviation acts as a unit or we make the data unit
free.
Now if you want to understand which data is more vol-
atile, personal income or pair of shoes, you better use
Coefficient of Variation. As mentioned earlier, it is a rel-
ative measure of dispersion and is expressed by stand-
ard deviation per unit of central value, i.e. mean. If you
have income in dollar terms and income in rupee terms, and if the first data has less
coefficient of variation than the second one, use the first data for analysis. You will find
more meaningful information.
Measures of Location
Using Measures of Location, we can get a birds eye view of the data. Measures of
Central Tendency also comes under the Measures of Location. Minimum and maxi-
mum are also measures of
location. Other measures
are Percentiles, Deciles,
and Quartiles. For example,
if 90 percentile denotes the
number 86, the it is implied
that 90% of the students
have got marks which are
less than 86. Now the 90
percentile is the 9th Deciles.
As we look at the shape of the histogram of a numeric data, we have various under-
standing about the distribution of the data. We have two statistics that are related to
the shape of
the distribu-
tion: Skewness
and Kurtosis. If
the distribution
has a longer
left tail, the
data is nega-
tively skewed.
The opposite is
for the posi-
tively skewed. So we are basically detecting whether the data is symmetric about the
central value of the distribution. In options markets, the difference in implied volatili-
ty at different strike prices represents the market's view of skew, and is called volatility
skew. (In pure BlackScholes, implied volatility is constant with respect to strike and
time to maturity.) Skewness causes the Skewness risk in the statistical models, that are
built out of variables which are assumed to be symmetrically distributed.
Kurtosis, on the other hand, measures the peakedness of the distribution as well as the
heaviness of the tail. Generally heavy tailed distributions dont have a finite variance.
In other words, we cannot calculate the variance for these distributions. Now if we
consider that the distribution is not heavy tailed and build the model on this assump-
tion, it can lead to Kurtosis risk of the model. For instance, Long-Term Capital Manage-
ment, a hedge fund cofounded by Myron Scholes,
ignored kurtosis risk to its detriment. After four suc-
cessful years, this hedge fund had to be bailed out
by major investment banks in the late 90s because
it understated the kurtosis of many financial securi-
ties underlying the fund's own trading positions.
There can be several situations as shown in the
chart. The value of kurtosis for a Mesokurtic Distribu-
tion is zero. For Platykurtic its negative and for Lep-
tokurtic its positive. Kurtosis is sometimes referred as volatility of volatility or the risk with-
in risk.
An outlier is a score very different from the rest of the data. When we analyze data we
have to be aware of such values because they bias the model we fit to the data. A
good example of this bias can be seen by looking at a simple statistical model such as
mean. Suppose a film gets a rating from 1 to 5. Seven people saw the film and rated
the movie with ratings of 2,
5, 4, 5, 5, 5, and 5. All but
one of these ratings is fairly
similar (mainly 5 and 4) but
the first rating was quite dif-
ferent from the rest. It was a
rating of 2. This is an exam-
ple of an outlier. The box-
plots tell us something
about the distributions of
scores. The boxplots show us
the lowest (the bottom horizontal line) and the highest (the top horizontal line). The dis-
tance between the lowest horizontal line and the lowest edge of the tinted box is the
range between which the lowest 25% of scores fall (called the bottom quartile). The
box (the tinted area) shows the middle
50% of scores (known as interquartile
range); i.e. 50% of the scores are bigger
than the lowest part of the tinted area
but smaller than the top part of the tint-
ed area. The distance between the top
edge of the tinted box and the top hori-
zontal line shows the range between
which top 25% of scores fall (the top
quartile). In the middle of the tinted box
is a slightly thicker horizontal line. This
represents the value of the median. Like
histograms they also tell us whether the
distribution is symmetrical or skewed. For
a symmetrical distribution, the whiskers
on either side of the box are of equal length. Finally you will notice small some circles
above each boxplot. These are the cases that are deemed to be outliers. Each circle
has a number next to it that tells us in which row of the data editor to find the case.
Generally we find problems related to the distribution or outliers while exploring the da-
ta. Suppose you detect outliers in the data. There are several options for reducing the
impact of these values. However, before you do any of these things, its worth check-
ing whether the data you have entered is correct or not. If the data are correct then
the three main options you have are:
Remove the Case: It entails deleting the data from the person who contributed the
gchart is the procedure to generate bar-chart. The data set we use here is the can-
dy_sales_summary. The bar chart is generated using the keyword vbar. This presenta-
tion is used to represent the qualitative variable subcategory. This code generates a
bar graph showing the frequency of occurrence of the different subcategory.
This code generates a 3d bar graph for subcategory. This is a better form of repre-
senting a qualitative data. vbar3d is the keyword for generating a three dimensional
bar graph.
This code generates a horizontal 3d bar graph. The bar graph is generated for the vari-
able subcategory. hbar3d is the keyword for generating the horizontal 3d bar
graph. This form of representing the data is useful when we are representing a spatial
data.
This code generates a 3d vertical bar graph for the variable subcategory. But, corre-
sponding to each vertical bar graph for the subcategory it gives the total sale amount
on top of each of the vertical bar.
This code results in the same output as the code above but does not display the sum
corresponding to each bar at the top. The sum keyword is responsible for the display.
This code generates a sub-divided multiple bar diagram. The group generates the bar
diagram corresponding to the fiscal years and show the sales corresponding to each
subcategory for a given fiscal year.
This code is run according to the margins specified by the options specified using the
goptions keyword. This is a global statement which holds throughout the rest of the
session. Every graph constructed by the software thereon would have these dimen-
sions.
The multiple bar diagram which is generated as a result of the above code, appears
very shabby on screen. To make them look better, we need to space them out and
this is done through the above code. We specify the margins for the vertical and hori-
zontal axis. This is a global statement, in the sense that, that any graphical representa-
tions, here onwards, would take these dimensions as given.
PIE-CHART
This code generates a 3 dimensional pie-chart using the keyword pie-3d. gchart is
the keyword to generate the chart. The pie-chart represents each of the
subcategory on a pie, i.e. as a percentage of 360 degrees.
This is a variation of the previous pie chart representation. This would generate a pie-
chart where the discrete value of the respective subcategory would be placed in
the different slices. value=inside keeps the frequency values in the slices along with
the names of the subcategory. Each of the subcategory is shown in slices of different
colors.
This code generates the pie chart such that the name of the frequency value and the
percentage frequency value of the subcategory inside the slice and the name of the
subcategory outside the slice.
This code for pie-chart puts out the frequency of sale corresponding to the sale sub-
category. The percentage frequency of the sale and the discrete value of the sale of
the subcategory are shown outside and the name of the variable is shown outside the
slice.
HISTOGRAM
This is the representation of quantitative data. The univariate keyword is used to gen-
erate all the key descriptive statistics related to a particular variable. Here, the variable
under consideration is sale_amount. The code to generate histogram is histogram. If
no dimension is mentioned then, it is by default, a two dimensional diagram.
The univariate key-word in the code generates all the descriptive statistics associated
with the variable sale_amount in the data set candy_sales_summary. Another objec-
tive of the code is to construct a histogram for the same variable using the keyword
histogram. The total amount of sales is generated for each of the sub- categories,
SCATTER PLOT
gplot is the procedure to generate a plot of two quantitative variables. The scatter
plot for two variables sale_amount and units is generated using the keyword plot. The
variable on the left-hand-side of the * represents the variable on the y-axis and the
variable on the right-hand-side is the variable on the x-axis.
NORMALITY CHECK
The univariate keyword generates all the descriptive statistics associated with the var-
iable height in the data set Class. The descriptive statistics associated with a distri-
bution helps in the identification of normality of a distribution. Normality of a distribution
implies an element of symmetry associated with the distribution. In this data set the
mean, median and the mode are approximately 62. The standard deviation is pretty
low (5) compared to the existent mean. The Skewness and Kurtosis of the data set lies
in the neighborhood of zero. A basic analysis yields the result that the variable height
is normally distributed in the data set Class.
This is the same code which has been executed for a different data set: can-
dy_sales_summary. The mean of the variable, sale_amount (4951.97) is significantly dif-
ferent from its median (4040.525) and mode (0.00). Also the average fluctuation in the
data set represented by the standard deviation is very high (3986). This means that the
mean is not a good representative value for the data set as there is a very high fluc-
tuation in the data set. It is easy to conclude that the variable sale_amount is not nor-
mally distributed.
The quality of the measures of Central tendency and dispersion are affected adversely
in the presence of outliers. Box-plot is widely used to examine the existence of outliers
in the data set. Our reference data set is a hypothetical data set consisting of the
marks and the name of the subject. Two important facts that must be kept in mind for
box plot are:
The number of observations in the data set must be at least as large as five.
If there are more than one category in the data set must be sorted according to
the category.
A data set containing the marks of 5 students in the subjects English and Maths exist in
a csv format. The file is imported into the SAS library by using proc import code. The
logic of this code is to import a given file in its existent format, convert it to SAS format
and replace the freshly imported file with any file that would have the same name.
boxplot is the key word for generating a boxplot. The plot is done between the marks
obtained by the students and the subject. The existence of the outliers in the data set
is observed as points outside the box. The boxstyle is a keyword to generate a partic-
ular format of the boxplot.
F
uture events are far from certain in the business world. Most managers who use
probabilities are concerned with two conditions:
The case when one event or another will occur
The situation where two or more events will both occur
We are interested in the first case when we ask, What is the probability that todays
demand will exceed our inventory? To illustrate the second situation, we could ask,
What is the probability that todays demand will exceed our inventory and that more
than 10% of our sales force will not report for work? Probability is used throughout busi-
ness to evaluate financial and decision-making risks. Every decision made by manage-
ment carries some chance for failure, so probability analysis is conducted formally
("math") and informally (i.e. "I hope").
Consider, for example, a company considering entering a new business line. If the
company needs to generate $500,000 in revenue in order to break even and their
probability distribution tells them that there is a 10 percent chance that revenues will
be less than $500,000, the company knows roughly what level of risk it is facing if it de-
cides to pursue that new business line.
Classical Approach:
"Probability of an event"
=(Number of outcomes where the event occurs)/(Total number of possible outcomes"
")
Relative Frequency Approach:
Suppose, we are tossing a coin. Initially the ratio of number of heads to number of trials
will remain volatile. As the number of trials increases, the ratio converges to a fixed
number (say 0.5). So probability of getting a head is 0.5; this concept has been shown
in the following chart.
1.0
Ratio
0.5
Number of Trials
Apart from all these, there is a concept of subjective probability. Its basically based
on individuals past experience and intuition. Most higher level social and managerial
decisions are concerned with specific, unique situations. Decision makers at this level
make considerable use of subjective probability.
Probability mass function (pmf) is a function that gives the probability that a discrete
random variable is exactly equal to some value. The probability mass function is often
the primary means of defining a discrete probability distribution.
Suppose that S is the sample space of all outcomes of a single toss of a fair coin, and X
is the random variable defined on S assigning 0 to "tails" and 1 to "heads". Since the
coin is fair, the probability mass function is given by:
The probability mass function of a fair die has been show in the
chart. All the numbers on the die have an equal chance of ap-
pearing on top when the die is rolled.
If you put this concept into a chart, then it will represent the area under the probability
density function curve between a and b.
f(x)
a b x
Suppose you toss a coin 10 times and get 7 heads. Hmm, strange, you say. You then
ask a friend to try tossing the coin 20 times; she gets 15 heads and 5 tails. So you have,
in all, 22 heads and 8 tails out of
Number of Patients (1) Probability (2) 1X2
30 tosses. What did you expect?
100 0.01 1.00
Was it something close to 15
101 0.02 2.02
heads and 15 tails (half and half)? 102 0.03 3.06
Now suppose you turn the tossing 103 0.05 5.15
over to a machine and get 792 104 0.06 6.24
heads and 208 tails out of 1000 105 0.07 7.35
tosses of the same coin. Then you 106 0.09 9.54
might be suspicious of the coin 107 0.1 10.70
because it didnt live up to what 108 0.12 12.96
you expected. To obtain the ex- 109 0.11 11.99
pected value of a discrete ran- 110 0.09 9.90
dom variable, we multiply each 111 0.08 8.88
value of that the random variable 112 0.06 6.72
113 0.05 5.65
can assume by the probability of
114 0.04 4.56
occurrence of that value and
115 0.02 2.30
sum these products. Again, re-
Expected Number of Patients 108.02
member that an expected value
of 108.02 doesnt imply that tomorrow exactly 108.2 patients will visit the clinic.
As the Random Variable is of two types, the Probability Distributions, hence, are of two
types, namely, discrete and continuous. The Probability Distribution for the sum of point
on two dice rolled is as follows:
Binomial Distribution
The binomial distribution describes discrete data resulting from an experiment known
as Bernoulli process. The tossing of a fair coin a fixed number of times is a Bernoulli pro-
cess and the outcomes of such tosses can be represented by the binomial probability
distribution. The success or failure of interviewees on an aptitude test may also be de-
scribed by a Bernoulli process. On the other hand, the frequency distribution of the
lives of fluorescent lights in a factory would be measured on a continuous scale of
hours and would not qualify as a binomial distribution. The probability mass function,
the mean and the variance are as follows:
There can be only two possible outcomes: heads or tails, yes or no, success or fail-
ure
Each Bernoulli process has its
own characteristic probabil-
ity. Take the situation in which
historically seven tenths of
all people who applied for a
certain type of job passed
the job test. We would say
that the characteristic proba-
bility here is 0.7, but we could
describe our testing results as
Bernoulli only if we felt certain
that the proportion of those
passing the test (0.07) re-
mained constant over time.
At the same time, outcome of one test must not affect the outcome of the other
tests.
Poisson Distribution
The Poisson distribution is used to describe a number of processes, including the distri-
bution of telephone calls going through a switchboard system, the demand of pa-
tients for service at a health institution, the arrivals of trucks and cars at a tollbooth,
and the number of accidents at an intersection. These examples all have a common
element: They can be described by a discrete random variable that takes on integer
values (0, 1, 2, 3, 4, and so on). The number of patients who arrive at a physicians of-
fice in an given interval of time will be 0, 1, 2, 3, 4, 5, or some other whole number. Simi-
larly, if you count the number of cars arriving at a tollbooth on an highway during
some 10 minutes period, the number will be 0, 1, 2, 3, 4, 5, and so on. The probability
mass function, the mean and the variance are as follows:
If we consider the example of number of cars, then the average number of vehi-
cles that arrive per rush hour can be estimated from the past traffic data.
If we divide the rush hour into intervals of one second each, we will find the follow-
ing statements to be true :
The probability that exactly one vehicle will arrive at the single booth per second is
a very small number and is constant for every one second interval.
The probability that two or more vehicles will arrive within one second interval is so
Normal Distribution
The normal distribution has applications in many areas of business administration. For
example:
Modern portfolio theory commonly assumes that the returns of a diversified asset
portfolio follow a normal distribution.
In operations management, process variations often are normally distributed.
In human resource management, employee performance sometimes is consid-
ered to be normally distributed.
data binom;
binom_prob=pdf ('binomial', 50, 0.6, 100);
run;
This code defines the probability of getting fifty successes in 100 trials of a binomial ex-
periment where the probability of getting success in a single trial is 0.6. This code cre-
ates a new data set by the name binom in the work library. The data set can also be
opened in the permanent library as well by assigning a library name. binom_prob is the
variable that stores the probability associated with the above-mentioned outcome.
Pdf stands for probability density function. It generates the probability associated
with the given outcome, given the parameters of the distribution. Pdf is the general
command for calculating the probabilities associated with various points of a distribu-
tion (be it discrete or continuous) since SAS does not identify pmf (Probability Mass
Function).
data binom_plot;
do x=0 to 20;
binom_prob=pdf ('binomial', x, 0.5, 20);
output; end;
run;
This code generates a schedule of probabilities associated with the various outcomes
or success in the binomial distribution. A loop is created for generating the schedule.
The number of success in the experiment is kept variable and within the loop. The pa-
rameters to the distribution, namely the number of trials (20) and the probabilities of
success (0.5) are specified. The loop is terminated using the end keyword. The key
word output is used to print the output at each iteration.
This command directly plots the binomial probability distribution with probabilities on
the vertical axis and the number of successes on the horizontal axis.
data binom_plot;
do x=0 to 20;
binom_prob=pdf ('binomial', x, 0.3, 20);
output; end;
run;
This is the command to generate the binomial probability distribution for 0 to 20 trials
with a much lower probability of success. Examining the nature of the distribution over
changing values of the probabilities of success gives us a fair idea of the Skewness of
the distribution. If the probability of obtaining a success in a particular trial is low then
the chances of getting very high successes is low and that of getting low successes
is very high.The distribution, given the specification of the parameters is a negatively
skewed distribution.
The command plots the binomial probability distribution for the newly specified param-
eters with the values of the probability on the vertical axis and the number of success
on the horizontal axis. The graphical representation displays the varying nature of
Skewness in the distribution very distinctly.
POISSON DISTRIBUTION
data day1.poisson;
pois_prob=pdf ('Poisson', 12, 10);
run;
This is a data step where a data set by the name Poisson is created in the permanent
library day1. The syntax defined by the function pdf is as follows:
New variable = pdf (name of the distribution, value of x, value of n). This code calcu-
lates the probability of obtaining a particular number of successes in the Poisson ex-
periment where the parameter to the experiment is 10 and the number of trials is 12.
data day1.pois_plot;
do x=0to 25;
pois_prob=pdf ('Poisson', x, 10);
output;
end;
run;
This command plots the poisson probability distribution with probabilities on the vertical
axis and number of successes on the horizontal axis. The output keyword is to print the
output of each iteration.
This command directly plots the poisson probability distribution with probabilities on the
vertical axis and the number of successes on the horizontal axis. Following set of codes
are used for analyzing the Skewness associated with the poisson distribution:
data day1.pois_plot;
do x= 0 to 25;
pois_prob=pdf ('poisson', x, 10.5);
output;
end;
run;
The last couple of codes can be used to analyse the nature of the Skewness of the
poisson distribution. The Skewness can be analyzed by changing the parameters to
the distribution.
NORMAL DISTRIBUTION
data day1.normal;
do x=-12 to 18 by 0.05;
normal_prob=pdf ('normal', x, 3, 8);
output; end;
run;
This is a data step which creates a new data set normal in the user-defined library
day1. The command generates the normal probability distribution. The values of the
respective probability densities are stored in the variable normal_prob. The syntax of
this function is: Name of the variable = pdf (distribution name, number of trials, mean,
variance). The mean and variance must be specified for a proper characterization of
the normal distribution. The schedule of probabilities corresponding to the different val-
ues of x is generated using the do loop. Since, normal distribution is a continuous dis-
tribution it assumes continuous values. By default, at every successive step in the loop
function the value is increased at a step of 1, which makes it a discrete loop. To make
it continuous we increase the trials at a step of 0.05. The result of each iteration is dis-
played using the output keyword.
This command directly plots the normal probability distribution with probabilities on the
vertical axis and the number of trials on the horizontal axis. The graph obtained from
the data normal is symmetric in nature.
The proc univariate is the procedure for listing out all the descriptive statistics associ-
ated with the variable height which is our analysis variable. The keyword histogram
is used for generating a histogram over which a normal curve is super-imposed. The
normal curve here is a green coloured curve, specified by the estimated mean and
the estimated standard deviation. Super imposition of the normal curve over the histo-
gram gives us an idea whether the variable is normally distributed. If the normal
curve fits on nicely to the histogram then we say that the variable is normally distribut-
ed. The variable height in the data set class has a normal plot. The normality of the
variable can be clearly observed in the diagram below:
This is the same code as above which has been used on a separate variable on a dif-
ferent data set. The variable sale_amount is not normally distributed and the normal
curve does not fit symmetrically on the histogram.
S
ampling is concerned with the selection of a subset of individuals from within a
population to estimate characteristics of the whole population. Researchers
rarely survey the entire population because the cost of a census is too high. The
three main advantages of sampling are that the cost is lower, data collection is
faster, and since the data set is smaller it is possible to ensure homogeneity and to im-
prove the accuracy and quality of the data.
Concept of Population
Sampling is concerned with the selection of a subset of individuals from within a popu-
lation to estimate characteristics of the whole popu-
lation. Researchers rarely survey the entire popula-
tion because the cost of a census is too high. The
three main advantages of sampling are that the
cost is lower, data collection is faster, and since the
data set is smaller it is possible to ensure homogene-
ity and to improve the accuracy and quality of the
data.
Techniques of Sampling
Non-Probability Sampling
In non probability sampling, we cannot assign any probability to the selected sam-
ple. Nonprobability sampling techniques cannot be used to infer from the sample to
the general population.
Sampling Bias
Sampling Distribution
Types of Estimator
Properties of Estimator
Statistical hypotheses are statements about real relationships; and like all hypotheses,
statistical hypotheses may match the reality, or they may fail to do so. Statistical hy-
potheses have the special characteristic that one ordinarily attempts to test them (i.e.,
to reach a decision about whether or not one believes the statement is correct, in the
sense of corresponding to the reality) by observing facts relevant to the hypothesis in a
sample. This procedure, of course, introduces the difficulty that the sample may or
may not represent well the population from which it was drawn.
Types of Hypotheses
Null Hypothesis (H0): Hypothesis testing works by collecting data and measuring how
likely the particular set of data is, assuming the null hypothesis is true. If the data-set is
very unlikely, defined as being part of a class of sets of data that only rarely will be ob-
served, the experimenter rejects the null hypothesis concluding it (probably) is false.
The null hypothesis can never be proven, only thing we can do is to reject it or not re-
ject it.
Alternative Hypothesis (H1 or HA): The alternative hypothesis (or maintained hypothesis
Both types of errors are problems for individuals, corporations, and data analysis.
Based on the real-life consequences of an er-
ror, one type may be more serious than the
other. For example, NASA engineers would
prefer to throw out an electronic circuit that is
really fine (null hypothesis H0: not broken; reali-
ty: not broken; action: thrown out; error: type I,
false positive) than to use one on a space-
craft that is actually broken (null hypothesis H0:
not broken; reality: broken; action: use it; error: type II, false negative). In that situation
a type I error raises the budget, but a type II error would risk the entire mission.
Confidence Interval
Surveyselect is the procedure for executing a sampling procedure. The data set that
we consider here is employee_satisfaction. The method of sampling specified here is
simple random sampling without replacement (SRS). We have pre-specified the sam-
ple size to be 50. This is a proc step which generates a report. Some important con-
cepts generated in the report are:
Random Number Seed: A integer used to set the starting point for generating a se-
ries of random numbers. The seed sets the generator to a random starting point. A
unique seed returns a unique random number sequence. Given the seed a series
of random numbers is generated. If no random number seed is specified, then the
numerical value of the system time is used for generating the subsequent random
numbers.
Selection Probability: This shows the probability of selecting a sample of n obser-
vations from a total of N observations (N > n). Each of the observations are equal-
ly likely of being drawn from the population and a sample observation once drawn
from a population is not returned back.
Sampling Weight: A sampling weight is a statistical correction factor that compen-
sates for a sample design that tends to over- or under-represent various segments
within a population. In some samples, small subsets of the population, such as reli-
gious, ethnic, or racial minorities, may be oversampled in order to have enough
cases to analyze. When these subsamples are combined with the larger sample,
their disproportionately large numbers must be diluted by a sampling weight. This is
just the reciprocal of the selection probability of a sample.
This code describes an alternate technique of sampling. The method urs or unrestrict-
ed sampling refers to the type of random sampling where the sample points are re-
turned to the population once the observations are recorded. This process of sampling
is also called the simple random sampling with replacement. In the final data set that
we get, there might not be 50 unique observations, since repetition may occur in the
selection of the sample observations. In this form of sampling, the report generated
contains, in addition to the concepts introduced in srs, another concept called the ex-
pected number of hits.
The concept of the expected number of hits is synonymous to the concept of selec-
tion probability in the simple random sampling with replacement. This measure repre-
sents the average number of times a particular observation is selected in the process
of random sampling without replacement. The sampling weight, in this context, is the
reciprocal of the expected number of hits made in the procedure.
The sample, in this technique, is drawn from the population, based on a particular or-
der. For example: If a departmental store wants to know about the level of customer
satisfaction then he needs to survey the customers. If in a day the mall expects a foot
fall of 1000 customers and the number of sample size he requires is 100, then the mall
can question every 10th person walking in through the door.
This command method=sys is used to execute the systematic sampling process. The
systematic number of observations that is to be sampled is calculated using: K = N/n,
where n = size of the sample, N = size of the population. So, for getting a sample size of
30, every 50th observation should be surveyed.
A
parametric test is one that requires data from one of the large catalogue of
distributions that statisticians have described and for data to be parametric
certain assumptions must be true. If you use a parametric test when your da-
ta is not parametric then the results are likely to be inaccurate. Therefore, it is
very important that we check the assumptions before deciding which statistical test is
appropriate.
Normally Distributed Data: It is assumed that the data are from one or more nor-
mally distributed populations. The rationale behind hypothesis testing relies on nor-
mally distributed populations and so if this assumption is not met then the logic be-
hind hypothesis testing is flawed. Most researchers eyeball their sample data using
a histogram and if the sample data look roughly normal, then the researchers as-
sume that the populations are also.
Homogeneity of Variance: The assumption means that the variance should be the
same throughout the data. In designs in which you test several groups of partici-
pants, this assumption means that each of these samples comes from populations
with the same variance.
Interval Data: Data should be measured at least at the interval level. This means
that the distance between points of your scale should be equal at all parts along
the scale. For example, if you had a 10 point anxiety scale, then the difference in
anxiety represented by a change in score from 2 to 3 should be the same as that
represented by a change in score from 9 to 10.
Independence: This assumption is that data from different participants are inde-
pendent, which means that the behavior of one participant does not influence the
behavior of another.
The assumptions of interval data and independent measurement are tested only by
common sense. The assumption of homogeneity of variance is tested in different ways
for different procedures .
Z Test
A Z-test is any statistical test for which the distribution of the test statistic under the null
hypothesis can be approximated by a normal distribution.
Assumptions
The parent population from which the sample is drawn should be normal
The sample observations are independent, i.e., the given sample is random
The population standard deviation is known
T Test
A t-test is any statistical hypothesis test in which the test statistic follows a Student's t dis-
tribution if the null hypothesis is supported. Among the most frequently used t-tests are:
A one-sample location test of whether the mean of a normally distributed popula-
tion has a value specified in a null hypothesis.
A two sample location test of the null hypothesis that the means of two normally
distributed populations are equal.
A test of the null hypothesis that the difference between two responses measured
on the same statistical unit has a mean value of zero.
A test of whether the slope of a regression line differs significantly from zero.
Assumptions
Consider you have Conducted a survey that studied the Commitment to Change in
your Organization. Now you require to find out if there are any differences in the Com-
mitment to Change between Male and Female Staff Members or for instance a re-
searcher wants to find out if Middle level employees are more satisfied than top level
employees. In this case the researchers needs the Satisfaction Scores for Middle and
Top Management. Here again we can see that One Variable (Satisfaction) is divided
into two groups (Middle and Top Level). So in summary when we need to compare
A company markets an eight week long weight loss program and claims that at the
end of the program on average a participant will have lost
5 pounds. On the other hand, you have studied the pro-
gram and you believe that their program is scientifically un-
sound and shouldn't work at all. You want to test the hy-
pothesis that the weight loss program does not help people
lose weight. Your plan is to get a random sample of people
and put them on the program. You will measure their
weight at the beginning of the program and then measure
their weight again at the end of the program. Based on
some previous research, you believe that the standard de-
viation of the weight difference over eight weeks will be 5
pounds.
Assumptions
The assumptions underlying the paired samples t-test are similar to the one-sample t-
test but refer to the set of difference scores.
The observations are independent of each other
The dependent variable is measured on an interval scale
The differences are normally distributed in the population
Hypotheses of Paired Sample t Test:
H0: The two population means are equal
H1: The two population means are not equal
In summary, a paired sample t test tries to assesses whether an action is effective or
not.
The case study on a single variable t-test pertains to a leading hospital in the city. The
baseline blood pressures for 60 patients belonging to different age groups were rec-
orded. The data set contains three variables namely: the subject (id variable), Age
(numeric variable) and Baseline bp (numeric variable).
The objective of the case study is to check whether there has been a statistically signif-
icant change in the average blood pressure over a span of 45 days. We use the t-test
in this case. However, before using the test we need to test for the assumption of nor-
mality.
The two-independent sample t-test is useful for examining significant differences in the
mean of two data sets. The present case study considers two renowned pizza compa-
nies: ABC and XYZ. The manager of the XYZ company is apprehensive of the falling
sales compared to its competitor ABC. The absolute delivery time for Pizza company
ABC is less than XYZ, but this would be considered a crucial factor in explaining the de-
clining sales of XYZ if the differences in the mean delivery time of company ABC are
significantly less than the mean delivery time of XYZ.
The t-test is executed using the procedure ttest. Since this is a t-test to check the dif-
ference of mean between two groups, we introduce the class keyword to identify
the two pizza companies. The variable in terms of which the t-test is to be carried out is
the variable waiting_time__in_minutes_. Three important tables are generated once
the code is executed:
STATISTICS: This table describes the vital statistics associated with the two pizza com-
panies. This table gives us a clear idea that the delivery time of the pizza company
ABC is distinctly less than the delivery time of the company XYZ. How can we say
so? This can be said so from the confidence intervals within which the sample
means of the two companies lie.
EQUALITY OF VARIANCES: To compare the means of two different sets it is neces-
sary to check that the variances of the two set. The population variances of the
two data sets must be identical in nature. This implies that the mean-difference test
is executed under the assumption that the variance remains constant across the
two data sets. The equality of variances is tested using the Folded F-test. This is de-
fined as: F = max (s12,s22)/min(s12,s22) where s12 and s22 are variances of category 1
and category 2.
The decision rule used is the p-value rule whereby the null hypothesis is accepted if
the exact probability of committing the type I error exceeds the benchmark proba-
bility as prescribed by the level of significance. Here, the p-value associated with
the folded F-statistic is 0.38. This is much greater than the level of significance.
Hence, the chance of committing a type I error is much higher in this model and
we do not take the risk of committing the error and accept the null hypothesis.
Therefore, it is safe to conclude that the population variances of the two pizza
companies are not identically different.
T-TESTS: This table displays the results of the t-test corresponding to the difference in
the mean delivery time of pizzas. The results are displayed under two sub-headings:
Pooled Variance and Unequal variance. We consider the results corresponding to
the Pooled variance for the t-test analysis. The p-value corresponding to the t-
statistic is 0.0003 which is less than the prescribed level of significance. Therefore, it
is easy to conclude that the difference in the mean delivery time of the pizza com-
panies ABC and XYZ are significantly different from one another.
To analyze the impact of e-learning on the students, the Ministry of the Human Re-
source Development of the Government of India performed an exploratory study on
the a sample of 50 students. The students were first taught in the traditional method of
teaching and then through the method of e-learning without the presence of any
teachers. The marks were recorded for the students before the e-learning and after
the e-learning. The marks were then compared to analyze the impact of the e-
learning on the performance of the students.
run;
The keyword paired is used to execute the paired t-test between the marks before
and marks after. The hypothesis set up is:
H0: The ex-post and ex-ante means are not significantly different
v/s
HA: The ex-post and ex-ante means are significantly different
The results for this t-test are displayed through the following tables:
STATISTICS: The statistic table shows that the mean marks of students have in-
creased after incorporating the e-learning process. The question that arises from
the table above is: Is the rise in the mean marks post the e-learning a significant
rise? To test the significance of the change we use the t-test table.
T-TEST: The t-test table details the significance of the difference of the paired
means. The p-value rule is used for deciding whether the null hypothesis is should
be accepted or not. The p-value generated (0.4539) within the model is greater
than the level of significance. This means that the differences in the means are not
statistically significant.
Therefore, the analysis shows that the mean of the performance of the students post
the e-learning process did not change significantly. Hence, e-learning employed by
the ministry of education did not prove to be effective as a strategy.
C
onsider the following questions:
Is their any association between income level and brand preference?
Is their any association between family size and size of washing machine
bought?
Are the attributes educational background and type of job chosen independent?
The solutions to the above questions need the help of Chi-Square test of independ-
ence in a contingency table. Please note that the variables involved in Chi-Square
analysis are nominally scaled. Nominal data are also known by two names - categori-
cal data and attribute data.
Contingency Table: Is
there any relation be- Investment
tween age and invest-
Stock Bond Cash Total
ment?
25 - 34 30 10 1 41
Assumptions 35 - 44 35 25 2 62
Remember, Chi square test of independence only checks whether there is any associ-
ation between the attributes, but it does not tell what is the nature of the association.
Correlation Analysis
The simplest way to look at whether two variables are associated is to look at whether
they covary. To understand what covariance is, we first need to think back to the con-
cept of variance.
Variance = (xi - mx)2 / (N 1) = (xi - mx) (xi - mx)/ (N 1)
The mean of the sample is represented by mx, xi is the data point in question and N is
the number of observations. If we are interested in whether two variables are related,
then we are interested in whether
changes in one variable are met with
similar changes in the other variable.
When there are two variables, rather
than squaring each difference, we
can multiply the difference for one var-
iable by the corresponding difference
for the second variable. As with the
variance, if we want an average value
of the combined differences for the
two variables, we must divide by the
number of observations (we actually
divide by N 1). This averaged sum of
combined differences is known as the
covariance: Cov(x,y) = (xi - mx) (yi - my)/ (N 1)
There is, however, one problem with covariance as a measure of the relationship be-
tween variables and that is that it depends upon the scales of measurement used. So,
covariance is not a standardized measure. To overcome the problem of dependence
on the measurement scale, we need to convert the covariance into a standard set of
units. This process is known as standardization.
Therefore, we need a unit of measurement into
which any scale of measurement can be con-
verted. The unit of measurement we use is the
standard deviation.
The standardized covariance is known as a cor-
relation coefficient.
r = covxy / sx sy = (xi - mx) (yi - my)/ [(N 1) sx sy]
which always lies in between 1 and 1.
For pairs from an uncorrelated bivariate normal distribution, the sampling distribution of
Pearson's correlation coefficient follows Student's t-distribution with degrees of freedom
n 2. Specifically, if the underlying variables have a bivariate normal distribution, the
variable
Partial Correlation
A correlation between two variables in which the effects of other variables are held
constant is known as partial correlation. The partial correlation for 1 and 2 with control-
ling variable 3 is given by:
r12.3 = (r12 r13 r23) / [ (1 r132) (1 r232)]
For example, we might find the ordinary correlation between blood pressure and
blood cholesterol might be a high, strong positive correlation. We could potentially
find a very small partial correlation between these two variables, after we have taken
into account the age of the subject. If this were the case, this might suggest that both
variables are related to age, and the observed correlation is only due to their com-
mon relationship to age.
proc corr is used to calculate correlation between two or more quantitative variables.
The var option identifies the variables whose correlation coefficients are to be quanti-
fied. The output to this code generates a 4x4 correlation matrix. Each element in this
matrix shows the correlation coefficient between two variables. Associated with each
correlation coefficient is a p-value which shows the statistical significance of the corre-
lation coefficient.
PARTIAL CORRELATION
This code produces the correlation between the two variables Education and Experi-
ence. The option partial is used to adjust the correlation coefficient value between Ed-
ucation and Experience for the impact of the variable Age. This adjustment is im-
portant to find out the extent exactly to which Education and Experience are correlat-
ed.
MATRIX PLOT
ods html;
ods graphics on;
proc corr data=day1.correlation noprint plots=matrix;
var Education Experience Wage_dollars_per_hour_ Age;
run;
ods graphics off;
ods html close;
For a matrix view of the correlations we first set the ods (Output Delivery System) to
html. Then we turn on the graphics mode. In the proc corr we use the options noprint
to suppress the output in the output window. At the same time, we set the type of the
plot to matrix. After running the code, we turn off the graphics mode and reset the
output delivery system.
Here we are trying to find out whether there is any association between the Frequen-
cy_of_Readership and Level_of_Educational_Achievement. This test is done under the
procedure freq and we request a chi square test in the table statement.
A
manager wants to raise the productivity at his company by increasing the
speed at which his employees can use a particular spreadsheet program. As
he does not have the skills in-house, he employs an external agency which
provides training in this spreadsheet program. They offer 3 packages - a be-
ginner, intermediate and advanced course. He is unsure which course is needed for
the type of work they do at his company so he sends 10 employees on the beginner
course, 10 on the intermediate and 10 on the advanced course. When they all return
from the training he gives them a problem to solve using the spreadsheet program
and times how long it takes them to complete the problem. He wishes to then com-
pare the three courses (beginner, intermediate, advanced) to see if there are any dif-
ferences in the average time it took to complete the problem.
Time
The two-way analysis of variance (ANOVA) test is an extension of the one-way ANOVA
test that examines the influence of different categorical independent variables on one
dependent variable. While the one-way ANOVA measures the significant effect of one
independent variable (IV), the two-way ANOVA is used when there are more than one
IV and multiple observations for each IV. The two-way ANOVA can not only determine
the main effect of contributions of each IV but also identifies if there is a significant in-
teraction effect between the IVs.
Example
What is Interaction?
When gender and level of education interact, we find 6 different groups, namely,
Male School, Female School, Male College, Female College, Male University
and Female University. Using two way ANOVA, we are trying to understand whether
any of the group is significantly different from the rest. If the interaction levels dont
show any significant differences, nor will the main factors for their levels.
Assumptions
As with other parametric tests, we make the following assumptions when using two-
way ANOVA:
The populations from which the samples are obtained must be normally distributed
Sampling is done correctly. Observations for within and between groups must be
independent
The variances among populations must be equal (homogeneity)
Data are interval or nominal
The Hypotheses for the test are:
For each factor and interaction,
H0: Means of all groups are equal
H1: There is one significant difference
We demonstrate one way anova through a case study. The case that we consider is
that of three production plants: Maruti, Hyundai and Tata. The associated processing
time of cars in each of these plants is mentioned along with them. The objective of the
analyst is to find out whether there exists a significant difference between the mean
processing time of the plant.
anova is the procedure used in analysis of variance when the data is balanced.
Class is the keyword for specifying the different groups in the problem. In this case,
the class variables are the respective production plants of the companies. The model
keyword is used for executing functions which involve an independent and a depend-
ent variable. The left-hand side of the equality is the dependent variable and the right-
hand side represents the independent variable. The code generates the following ta-
bles:
First table shows the statistics associated with the overall goodness of the model.
This table displays the variations across the groups (Mean Model Sum of Squares)
and within the groups (Mean Squares of Errors). The F-statistic is calculated as a ra-
tio of the Explained variation in the model to the unexplained variation. The p-
value rule is employed to check the significance of the F-value. The p-value for the
F-statistic in this study is 0.1447, which is significantly greater than the level of signifi-
cance. Thus it can be concluded that there is no significant difference in the pro-
cessing time of cars in the three plants.
The second table generates all the descriptive statistics corresponding to the varia-
ble mean_processing_time_of_plant.
The mean processing times of plants of the three companies are not significantly differ-
ent from each other. One problem with the one-way anova is that it does not include
any interaction effect between the independent variables. This problem is addressed
by two-way anova.
TWO-WAY ANOVA
A survey referred to weight gained by men because of different factors, viz, the
amount of food consumed by the men and the type or nature of diet. By Ten repre-
sentative men were randomly selected and each of them were fed with each type of
diet in the two specified diet amounts (i.e. High and low respectively). The weight
gained by the men was measured in grams. There are three variables with a total of 60
observations.
The numeric variable Weight Gain denotes the weight gained by the men. The two
separate samples of pre and post treatment weight is not taken; rather; a single sam-
ple of actual weight gain is considered. The variable Diet Amount denotes the amount
of diet. It is a categorical variable recording two responses; 1 for High and 2 for Low
amounts of diet. Also, the variable Diet Type denotes the type of diet consumed which
is also a categorical variable. It records three responses: 1 for Vegetarian diet, 2 for
non-vegetarian diet and 3 for a mixed diet.
The objective of the study is to locate the factors which most significantly affect the
weight gain in individuals. The code for two-way anova is:
This can also be done using proc anova. But anova works well when the data is bal-
anced, i.e. the interaction groups are equal in size. Also we are more interested about
the type III sum of squares. So we prefer proc glm over proc anova.
S
uppose, we are interested in consumers evaluation of a brand of coffee. We
take a random sample of consumers whom were given a cup of coffee. They
were not told which brand of coffee they were given. After they had drunk the
coffee, they were asked to rate it on 14 semantic differential scales. The 14 at-
tributes which were investigated are shown below:
1. Pleasant Flavor Unpleasant Flavor
2. Stagnant, muggy taste Sparkling, Refreshing Taste
3. Mellow taste Bitter taste
4. Cheap taste Expensive taste
5. Comforting, harmonious Irritating, discordant
6. Smooth, friendly taste
Rough, hostile taste Factor Attributes
7. Dead, lifeless, dull taste A. Comforting Quality 1. Pleasant flavor
alive, lively, peppy taste
3. Mellow taste
8. Tastes artificial Tastes like
real coffee 5. Comforting taste
9. Deep distinct flavor Shal-
low indistinct flavor 12. Pure, clear taste
10. Tastes warmed over B. Heartiness 9. Deep distinct flavor
Tastes just brewed 11. Hearty, full - bodied, full fla-
11. Hearty, full bodies, full fla- vor
vor Warm, thin empty flavor
C. Genuineness 2. Sparkling taste
12. Pure, clear taste Muddy,
swampy taste 4. Expensive taste
13. Raw taste Stale taste
14. Overall preference: Excel- 6. Smooth, friendly taste
lent quality Very poor quality 7. Alive, lively, peppy taste
A factor analysis of the ratings
given by consumers indicated 8. Tastes like real coffee
that four factors could summa-
14. Overall preference
rize the 14 attributes. These
factors were: comforting quali- D. Freshness 10. Tastes just brewed
ty, heartiness, genuineness and 13. Raw taste
freshness.
Here we are only exploring the factors, but we cannot confirm whether these are the
only factors, hence the name Exploratory Factor Analysis.
Principal component analysis was developed by Pearson and adapted for factor
analysis by Hotelling. A goal for the user of PCA is to summarize the interrelationships
among a set of original variables in terms of a smaller set of uncorrelated principal
components that are linear combinations of the original variables.
PCA assumes that there is as much variance to be analyzed as the number of ob-
served variables and that all of the variance in an item can be explained by the ex-
tracted factors. Communality means the variance that the items and factors share in
common.
PCA has been described as Eigen analysis or seeking of the solution to the characteris-
tic equation of the correlation matrix. An Eigen value represents the amount of vari-
ance in all of the items that can explained by a given principal component or factor.
An Eigen vector of a correlation matrix is a column of weights.
Factor Loadings
So, PCA explains the entire variance and EFA explains a part of it. In EFA we are basi-
Scree Plot
Eigen Value
1 2 3 4 5 6 7 8
Number of Factors
Factor Analysis is an Interdependence technique. In interdependence techniques the
variables are not classified as dependent or independent; rather, the whole set of in-
terdependence relationships is examined.
Initially, the weights are distributed across all the variables. So it is not possible to under-
stand the underlying factor of one or more variables. To remove this problem , we ap-
ply rotation to the axes.
We mainly deal with two types of rotation:
Orthogonal Rotation: Varimax
Oblique Rotation: Promax
The problem with oblique rotation is that it makes the factors correlated. Varimax rota-
tion is used in principal component analysis so that the axes are rotated to a position in
which the sum of the variances of the loadings is the maximum possible.
Here we are concerned about the underlying factors of the employee satisfaction.
Here the name of the data set is employee_satisfaction. Lets first look at the variables
in the data set.
So apart from the variable employee which is basically the identification of the em-
ployee, all the variables contribute to the satisfaction of the employee. Using factor
analysis we are going to find out the underlying factors of the employee satisfaction
and see which variable belongs to which factor.
The corr option in the data statement of procedure factor produces the correlation
matrix mentioned in the var statement. If the correlation between the variables are
very near to zero (say within +/- 0.2 ), then the variables are independent. So they
themselves are the factors. The other option msa produces a KMO MSA Check. The
scree option produces a scree plot.
The rotate option specifies the type of rotation that we give. Here we have assigned
Varimax rotation.
If we want to calculate all the scoring coefficients, then we mention the option score.
The mineigen = 0.5 option implies we want to retain those factors only that have eigen
values greater than 0.5.
For individual factor scores, we write specify the option out = day1.factor_scores.
CLUSTER ANALYSIS
C
luster analysis groups individuals or objects into clusters objects in the same
cluster are more similar to one another than they are to objects in other
clusters. The attempt is to maximize the homogeneity of objects within the
clusters while also maximizing the heterogeneity between the clusters. Like
factor analysis, cluster analysis is also a inter dependence technique.
A Simple Example
Suppose, you have done a pilot marketing of a candy on a randomly selected sample
of consumers. Each of the consumers was given a candy and was asked whether they
liked it and whether they will buy it. Now
the respondents were grouped into fol-
lowing four clusters:
Now the group NOT LIKED, WILL BUY
group is a bit unusual. But people can
buy for others. From a strategy point of
view, the group LIKED, WILL NOT BUY is
important, because they are potential
customers. A possible change in the pric-
ing policy may change the purchasing
decision.
From the example, it is very clear that we must have some objective on the basis of
which we want to create clusters. The
following questions need to be an-
swered:
What kind of similarity are we look-
ing for? Is it pattern or proximity?
How do we form the groups?
How many groups should we
form?
Whats the interpretation of each
cluster?
Whats the strategy related to
each of these clusters?
Customer Profiling
Cluster analysis does not have a theoretical statistical basis. So no inference can be
made from the sample to the population. Its only an exploratory technique. Noth-
ing guarantees unique solutions.
Cluster analysis will always create clusters, regardless of actual existence of any
structure in the data. Just because clusters can be found doesnt validate their ex-
istence.
The Cluster solution cannot be generalized because it is totally dependent upon
the variables used as the basis for similarity measure. This criticism can be made
against any statistical technique. With cluster variate completely specified by the
researcher, the addition of spurious variables or the deletion of relevant variables
can have substantial impact on the resulting solution. As a result, the researcher
must be especially cognizant of the variables used in the analysis, ensuring that
they have a strong conceptual support.
Hierarchical
Agglomerative Divisive
Related Statistics
Semi Partial R Squared: The semi-partial R-squared (SPR) measures the loss of homoge-
neity due to merging two clusters to form a new cluster at a given step. If the value is
small, then it suggests that the cluster solution obtained at a given step is formed by
merging two very homogeneous clusters.
R Square: R-Square (RS) measures the heterogeneity of the cluster solution formed at a
given step. A large value represents that the clusters obtained at a given step are
quite different (i.e. heterogeneous) from each other, whereas a small value would sig-
nify that the clusters formed at a given step are not very different from each other.
Related Charts
Dendrogram: Its a chart showing which two clusters are merging at which distance.
Icicle: Its a chart showing which case is being merged into the a cluster at which lev-
el.
The original data set IPL_cluster is in the csv format and therefore we need to import it.
We use the proc import code to import this file in the sas library.
This data set contains a variety of variables which needs to be used repeatedly for our
analysis. The proc content code helps us to get the list of all the variables in their cre-
ation order.
In cluster analysis, the idea is to club together the related observations. In order to club
together the related homogeneous observations, we need some sort of a composite
weight. In this dataset, it does not make any sense to add up the runs scored with the
number of not outs or the number of sixes hit. So the first step towards segmenting
the ipl data set is to standardize the entire data set, so that all the variables become
free of units. This code creates a standardized data set free of units.
In this code we are applying the cluster procedure on the dataset iplstandard. We
use an outtree command to generate a dataset by the name cluster_tree which
can be used to generate the dendrogram. The method=ward stands for Wards
minimum variance method. It clubs down those observations which would induce the
minimum increase in the error sum of squares or within the group variation. The com-
mand id player retains the player variable in the data set without performing any
mathematical function on the variable.
This code generates a cluster history. The cluster history contains the following im-
portant components which can be explained as follows:
Semi Partial R Squared: The semi-partial R-squared (SPR) measures the loss of homo-
geneity due to merging two clusters to form a new cluster at a given step. If the
value is small, then it suggests that the cluster solution obtained at a given step is
formed by merging two very homogeneous clusters.
R Square: R-Square (RS) measures the heterogeneity of the cluster solution formed
at a given step. A large value represents that the clusters obtained at a given step
are quite different (i.e. heterogeneous) from each other, whereas a small value
would signify that the clusters formed at a given step are not very different from
each other.
TIED: It implies that the performance of the two observations clustered together is
not unique. There are also other pairs who have performed in a similar manner. It is
to be observed that the pair we choose is going to affect the cluster formation.
Therefore, it is our discretion to choose the pair, which we want. SAS behaves like a
default user where it uses a tiebreaker rule to get the pairs of the cluster.
This code is used to generate the dendrogram. Proc tree is the procedure to generate
the dendrogram. Now, suppose we want to retain 4 clusters for our analysis. The follow-
ing code is used to do so:
The keyword nclusters is used for specifying the number of clusters that is to be re-
tained. The data set cluster_result has two new variables by the name cluster and
clus_name. The column cluster has observations like 1,2,3,4 representing that which
player belongs to which particular cluster. The variable clus_name has the observa-
tions CL4, CL5, CL7 and CL8.
data day1.cluster1;
set day1.cluster_result (keep=player cluster);
run;
We use the keep command to retain only the player and cluster in the data set.
The newly created data set is cluster1. The next following three steps are for arranging
the data to make meaningful decisions:
The first step is to sort the data sets cluster 1 by player and put out the results on the
cluster 2. We sort the data by the variable players. Also, we sort the iplcluster data
set by player and display the results in cluster3. We are sorting the dataset as we want
to merge dataset cluster2 and cluster3 together. We are doing this in order to add a
new column to the original data set iplcluster.
This code is to show the properties of the cluster. By examining the descriptive statistics
associated with each of the clusters, we can explain which the best cluster is.
The objective of this code is to print the four major clusters from which we can form our
decision of the best cluster.
LINEAR REGRESSION
I
n regression analysis we fit a predictive model to our data and use that model to
predict values of the dependent variable from one or more independent variables.
Simple regression seeks to predict an outcome variable from a single predictor vari-
able whereas multiple regression seeks to predict an outcome from several predic-
tors. We can predict any data using the following general equation:
Outcomei = (Model)i + Errori
The model that we fit here is a linear model.
Linear model just means a model based on
a straight line. One can imagine it as trying
to summarize a data set with a straight line.
Same intercept, but different slope Same slope, but different intercept
Like other statistical methods, using regression we are trying to discover a relationship
between the dependent and independent variable(s) from a sample and try to draw
inference on the population. So there comes the tests of significance in linear regres-
sion.
The Equation of the estimated line is:
Here alpha and beta are the estimated value of the intercept and the slope respec-
tively. The tests of significance are related to these two estimates.
Global Test
H0: All the parameters are equal to zero simultaneously
H1:At least one is non zero
This test is conducted by using a F statistic similar to that we saw in ANOVA.
Local Test
For each individual parameters,
H0: The parameter value is zero
H1: The value is non zero
This test is conducted by using a t statistic similar to a one sample t test.
For Simple Linear Regression there is no difference between the Global and Local tests
as there is only one independent variable.
The Multiple Linear Regression (MLR) is basically an extension of Simple Linear Regres-
sion. Single linear regression has single explanatory variable whereas multiple regres-
sion considers more than one independent variables to explain the dependent varia-
ble. So from a realistic point of view, MLR is more attractive than the simple linear re-
gression. For example,
(Salary)i = a + b1(Education)i + b2(Experience)i + b3(Productivity)i + b4 (Work Experi-
ence)i + ei
Assumptions
The relationship between the dependent and the independent variables is linear.
Scatter plots should be checked as an exploratory step in regression to identify pos-
sible departures from linearity.
The errors are uncorrelated with the independent variables. This assumption is
checked in residuals analysis with scatter plots of the residuals against individual
predictors.
The expected value of residuals is zero. This is not a problem because the least
squares method of estimating regression equations guarantees that the mean is
zero.
The variance of residual is constant. An example of violation is a pattern of residuals
whose scatter (variance) increases over time. Another aspect of this assumption is
that the error variance should not change systematically with the size of the pre-
dicted values. For example, the variance of errors should not be greater when the
predicted value is large than when the predicted value is small.
The residuals are random or uncorrelated in time.
The error term is normally distributed. This assumption must be satisfied for conven-
tional tests of significance of coefficients and other statistics of the regression equa-
tion to be valid.
Concept of Multicollinearity
Signs of Multicollinearity
What is VIF?
The Variance Inflation Factor (VIF) is a statistic that can be used to identify multicolin-
earity in a matrix of predictor variables. Variance Inflation refers here to the men-
tioned effect of multicolinearity on the variance of estimated regression coefficients.
Multicolinearity depends not just on the bivariate correlations between pairs of predic-
tors, but on the multivariate predictability of any one predictor from the other predic-
tors. Accordingly, the VIF is based on the multiple coefficient of determination in re-
gression of each predictor in multivariate linear regression on all the other predictors:
VIFi = 1/(1 Ri2)
where Ri2 is the multiple coefficient of determination in a regression of the i-th predictor
on all other predictors, and VIFi is the variance inflation factor associated with the i-th
predictor. Note that if the i-th predictor is independent of the other predictors, the vari-
ance inflation factor is one, while if the i-th predictor can be almost perfectly predict-
ed from the other predictors, the variance inflation factor approaches infinity. In that
case the variance of the estimated regression coefficients is unbounded. Multicolin-
earity is said to be a problem when the variance inflation factors of one or more pre-
dictors becomes large. How large it appears to be is a subjective judgment. Some re-
searchers use a VIF of 5 and others use a VIF of 10 as a critical threshold. The VIF is
closely related to a statistic call the tolerance, which is 1/VIF.
Analysis of residuals consists of examining graphs and statistics of the regression residu-
als to check that model assumptions are satisfied. Some frequently used residuals tests
are listed below. All these are to check whether the error terms are identically inde-
pendently distributed.
Time series plot of residuals: The time series plot of residuals can indicate such prob-
lems as non-constant variance of residuals, and trend or autocorrelation in residu-
The Durbin-Watson (D-W) statistic tests for autocorrelation of residuals, specifically lag-1
autocorrelation. The D-W statistic tests the null hypothesis of no first-order autocorrela-
tion against the alternative hypothesis of positive first-order autocorrelation. The alter-
native hypothesis might also be negative first-order autocorrelation. Assume the residu-
als follow a first-order autoregressive process
et = pet-1+ nt
where nt is random and p is the first-order autocorrelation coefficient of the residuals. If
the test is for positive autocorrelation of residuals, the hypotheses for the D-W test can
be written as H0: p = 0 against H1: p > 0
The D-W statistic is given by d = (ei - ei-1)2 / ei2
It can be shown that if the residuals follow a first-order autoregressive process, d is re-
lated to the first-order autocorrelation coefficient, p, as d = 2 (1 p).
The above equation implies that
d = 2 if no autocorrelation (p = 0)
d = 0 if 1st order autocorrelation is 1
d = 4 if 1st order autocorrelation is -1
The simple linear regression model explains the causality relationship between the de-
pendent variable and a single independent variable. We use the Walmart case study
to explain the different important characteristics of the model. The case study analyses
various factors which explain customer satisfaction for the retail giant, Walmart. This is
basically a rating data set where the customers have rated various departments of
Walmart. Based on this data Walmart will try to understand how it can improve its cus-
tomer satisfaction.
This code is meant for importing the file Walmart from the folder Analytics data sets
and case studies. This file is initially in the csv format and is being imported in order to
convert it to the SAS format. Post importing, we need to make a list of the variables so
that we can use them as and when necessary in our analysis. We make a list of the
variables in the data set using position short keyword.
reg is the procedure used to execute regression analysis. The key word model is
used for creating a model with customer_satisfaction as the dependent variable and
product_quality as the independent variable.
This is a standard model we follow when we would like to apply the regression tech-
nique and there are certain assumptions, which need to be satisfied if the Classical Lin-
ear Regression Model (CLRM) is to be valid. However, before going into the assump-
tions we would split the data set into two parts: Training and Validation data sets. The
objective of splitting the data set into two parts is to check for the robustness of the re-
sult obtained. The selection of the observations in the data set must be random and
therefore we use the ranuni function to break the data set into the two parts. The
code below shows how to break the data set into two parts:
data day1.walmart1;
set day1.walmart;
rannum=ranuni (0);
run;
quit;
This is a data step whereby we create a new data set walmart1 from the set
Walmart. The newly created dataset Walmart1 has a set of random numbers at-
tached to every observation in the data set. These random numbers are generated
using the ranuni command. This function generates random numbers from a uniform
distribution. Now, given these random numbers we would break the data set into two
parts using:
Instead of conducting the entire regression analysis on the original Walmart data set
into: Training data set which would contain 70% of the observations and Validation da-
ta set which would contain 30% of the total observations. This entire technique of
breaking the data set into two parts is called the Robust regression technique, since
the regression is initially conducted on the training data set and then the result ob-
tained in the training data set is validated in the Validation data set to check for the
robustness of the results. The entire purpose of this check is to ensure the reliability of
this model. In this method of creating the data set each and every observation in the
original data set has an equal probability of being selected in the training or in the val-
idation data set. The observations with corresponding random numbers greater than
0.7 are assigned to the validation data set and those with less than or equal to 0.7 are
assigned to the training data set.
Before proceeding with any predictions using the CLRM we must examine whether the
assumptions to the model are satisfied. The three most important assumptions are:
There should not be any multicollinearity among the explanatory variables
There should not be autocorrelation among the error terms
The variance of the error terms should be constant
The first check that we perform is for checking multicollinearity:
We use the Variance Inflation Factor for checking the existence of multicollinearity.
The variance inflation factor is executed through the vif keyword. The VIF is measured
using the auxiliary regression, i.e. the regression of one independent variable on the
other independent variables. If VIF is greater than 10, then there is severe multicolline-
arity. The variable with the highest value above 10 is dropped. In this case, it is the vari-
able Delivery_Speed which had a vif of 65 approx. Then all other variables except De-
livery_Speed are included.
The next objective is to choose the best model. This is done by the choosing the model
with the highest Adjusted R-square. The adjusted R2 takes into account the number of
independent variables in the model and adjusts for the loss of degrees of freedom. So,
we need to select the model with the highest adjusted R2. The code for it is as follows:
The keyword selection=adjrsq is used for selecting the model with the highest adjust-
ed R-square. The model with the highest adjusted R-square is considered to be the
model with the greatest explanatory capacity. The models are listed in the ascending
order of the goodness of fit of the model, i.e. the model with the highest goodness of
fit is listed first and the one with the lowest goodness of fit is the last model in line. Mod-
el here comprises of a combination of variables such that it yields an adjusted R-
square measure. The table adjrsq table displays the number of variables in the model
corresponding to the readings of adjusted R2 model. From now on we use the varia-
bles that are prescribed by the model with the highest adjrsq.
The original model contained 13 variables. Among them the variable Delivery_Speed
was dropped to solve the problem of multicollinearity. The model with the highest
adjrsq contains nine variables. This means three more variables have been dropped
to reach the model with the best fit. So, a part of the explanatory capacity of the
model is foregone which might enter the error or the unexplained part. This might cre-
ate a systematic behavior among the error terms. So, we need to check whether the
error terms are identically independently distributed or not. This check requires us to
check the existence of: (a) Autocorrelation (b) Heteroscedasticity. To check for auto-
correlation we use the Durbin-Watson test statistic. The code for checking the autocor-
relation is as follows:
proc reg data=day1.waltraining;
model customer_satisfaction=Competitive_Pricing Complaint_Resolution E_Commerce
Packaging Price_Flexibility Product_Line Product_Quality Salesforce_Image Warran-
ty_Claims/dw;
run;
quit;
dw is the key word for generating the Durbin Watson test statistics associated with this
model. The dw test measures the extent of correlation between the error terms.
However, the statistic reveals the value of the autocorrelation between the error terms.
It does not talk about the significance of the autocorrelation. For doing so, we need to
use the DW test tables. Since, SAS cannot compute the tables therefore no concrete
conclusions can be formed about the nature of autocorrelation between the error
terms. An alternative method is to use a technique which can test simultaneously for
the existence of autocorrelation and heteroscedasticity. This test is called the Spec test
or the Specification test.
The Specification test is executed using the keyword spec. This test aims to check the
following hypothesis:
H0: The error terms are identically and independently distributed
V/s
HA: The error terms are not identically and independently distributed
The null hypothesis is accepted if the p-value associated with the test is greater than
the level of significance. Since, SAS by default, considers the 5% level of significance,
therefore, if the p-value associated with this test is greater than 0.05, then we accept
the null hypothesis that the error term is random. Once, we are confirmed that the as-
sumptions to the classical linear regression model are satisfied we next obtain the pre-
dicted value of the customer satisfaction. The following code is used for this purpose:
The command output out is used for creating a new data set where the outputs on
predicted satisfaction and residual are added in addition to the information on the
existing variables. The command predicted is used to generate the predicted cus-
tomer_satisfaction values and these are saved in pred_sat. Similarly, the command
residual is used to calculate the residual values. These values are stored in the varia-
ble name error.
In this step, we check the correlation between the predicted and the actual value of
customer satisfaction. The higher the correlation between the two values, the better is
the prediction of customer satisfaction.
Now, our next task is to create the validation data set using the estimates of the pa-
rameters from the training data set. Using these estimates we estimate the correlation
between them and compare this with the results obtained in the training data set. If
the difference in the correlation coefficient of the training and the validation data set
is somewhere between 5%-6% then we know that the results we have obtained is ro-
bust.
data day1.wal_valid;
set day1.walvalidation;
pred_sat=-3.16541- 0.30252*E_Commerce + 0.34468*Price_Flexibility +
0.45464*Product_Quality + 0.47807*Product_Line + 0.63257*Salesforce_Image;
run;
The estimates in green have been obtained from the parameter estimates table in
the training data set.
LOGISTIC REGRESSION
I
n a nutshell, logistic regression is multiple regression but with an outcome variable
that is a categorical dichotomy and predictor variables that continuous or cate-
gorical. In pain English, this simply means that we can predict which of two catego-
ries a person is likely to belong to given certain other information.
This example is related to the Telecom Industry. The market is saturated. So acquiring
new customers is a tough job. A study for the European market shows that acquiring a
new customer is five time costlier than retaining an existing customer. In such a situa-
tion, companies need to take proactive measures to maintain the existing customer
base. Using logistic regression, we can predict which customer is going to leave the
network. Based on the findings, company can give some lucrative offers to the cus-
tomer. All these are a part of Churn Analysis.
Non - Performing Assets are big problems for the banks. So the banks as lenders try to
assess the capacity of the borrowers to honor their commitments of interest payments
and principal repayments. Using a Logistic Regression model, the managers can get
an idea of a prospective customer defaulting on payment. All these are a part of
Credit Scoring.
This is a key question in Sales Practices. Conventional salesman runs after, literally, eve-
rybody everywhere. This leads to a wastage of precious resources, like time and mon-
ey. Using logistic regression, we can narrow down our search by finding those leads
who have a higher probability of becoming a customer.
Employee retention is a key strategy for HR managers. This is important for the sustaina-
ble growth of the company. But in some industries, like Information Technology, em-
ployee attrition rate is very high. Using Logistic regression we can build some models
which willl predict the probability of an employee leaving the organization within a
given span of time, say one year. This technique can be applied on the existing em-
ployees. Also, it can be applied in the recruitment process.
In simple linear regression, we saw that the outcome variable Y is predicted from the
equation of a straight line: Yi = b0 + b1 X1 + i in which b0 is the intercept and b1 is the
slope of the straight line, X1 is the value of the predictor variable and i is the residual
term. In multiple regression, in which there are several predictors, a similar equation is
derived in which each predictor has its own coefficient.
In logistic regression, instead of predicting the value of a variable Y from predictor vari-
ables, we calculate the probability of Y = Yes given known values of the predictors.
The logistic regression equation bears many similarities to the linear regression equa-
tion. In its simplest form, when there is only one predictor variable, the logistic regres-
sion equation from which the probability of Y is predicted is given by:
P(Y = Yes) = 1/ [1+ exp{ - (b0 + b1 X1 + i )}]
One of the assumptions of linear regression is that the relationship between variables is
linear. When the outcome variable is dichotomous, this assumption is usually violated.
The logistic regression equation described above expresses the multiple linear regres-
sion equation in logarithmic terms and thus overcomes the problem of violating the
assumption of linearity. On the hand, the resulting value from the equation is a proba-
bility value that varies between 0 and 1. A value close to 0 means that Y is very unlikely
to have occurred, and a value close to 1 means that Y is very likely to have occurred.
One of the assumptions of linear regression is that the relationship between variables is
linear. When the outcome variable is dichotomous, this assumption is usually violated.
The logistic regression equation described above expresses the multiple linear regres-
sion equation in logarithmic terms and thus overcomes the problem of violating the
assumption of linearity. On the hand, the resulting value from the equation is a proba-
bility value that varies between 0 and 1. A value close to 0 means that Y is very unlikely
to have occurred, and a value close to 1 means that Y is very likely to have occurred.
Look at the data points in the following charts. The first one is for Linear Regression and
the second one for Logistic Regression.
In case of linear regression, we used ordinary least square method to generate the
model. In Logistic Regression, we use a technique called Maximum Likelihood Estima-
tion to estimates the parameters. This method estimates coefficients in such a way that
makes the observed values highly probable, i.e. the probability of getting the ob-
served values becomes very high.
The simplest binary choice model is the linear probability model, where as the name
implies, the probability of the event occurring is assumed to be a linear function of a
set of explanatory variables as follows:
P(Y = Yes) = b0 + b1 X1 + i
whereas the equation of logistic regression is as follows:
P(Y = Yes) = 1/ [1+ exp{ - (b0 + b1 X1 + i )}]
You may find a great resemblance of Linear regression with the Linear Probability Mod-
el. From expectation theory, it can be shown that, if you have two outcomes like yes or
no, and we regress those values on an independent variable X, we get a LPM. In this
case, we code yes and no as 1 and 0 respectively.
The reason is the as why we cannot use the linear regression for a dichotomous out-
come variable discussed in the last slide. Moreover, you may find some negative prob-
abilities and some probabilities greater than 1! And the error term will make you crazy.
So we have to again study Logistic Regression. NO CHOICE!
If we code yes as 1 and no as 0, the logistic regression equation can be written as fol-
lows:
Now if we divide probability of yes by the probability of no, then we get a new meas-
ure called ODDS. Odds shouldnt be confused with probability. Odds is simply the ratio
of probability of success to probability of failure. Like we may say, whats the odds of
India winning against Pakistan. Then we are basically comparing the probability of In-
dia winning to probability of Pakistan winning.
Change In Odds
Now if we divide the 2nd relation by the 1st one, we get e. So if we change X by 1
unit, then odds changes by a multiple of e. So the expression (e- 1)* 100% gives the
percentage change. Remember this kind of understanding is valid only when X is con-
tinuous. When X is categorical, we refer to Odds Ratio.
Odds Ratio
Suppose we are comparing the odds for a Poor Vision Person getting hit by a car to
the odds for a Good Vision Person getting hit by a car.
In order to estimate the logistic regression model, the likelihood maximization algorithm
must converge. The term infinite parameters refers to the situation when the
likelihood equation does not have a finite solution (or in other words, the maximum
likelihood estimate does not exist). The existence of maximum likelihood estimates for
the logistic model depends on the configurations of the sample points in the observa-
tion space. There are three mutually exclusive and exhaustive categories: complete
separation, quasi-complete separation, and overlap.
We saw in multiple regression that if we want to assess whether a model fits the data,
we can compare the observed and the predicted values of the outcome by using R2 .
Likewise, in logistic regression, we can use the observed and predicted values to assess
the fit of the model. The measure we use is the log likelihood.
The log-likelihood is therefore based on summing the probabilities associated with the
predicted and actual outcomes. The log likelihood statistic is analogous to the resid-
ual sum of squares in multiple regression in the sense that it is an indicator of how much
unexplained information is there after the model has been fitted. Its possible to calcu-
late a log-likelihood for different models and to compare these models by looking at
Now, what we should do with the rest of the variables which are not in the equation.
For that we have a statistic called Residual Chi Square Statistic. This statistic tells us
that the coefficients for the variables not in the model are significantly different from
zero, in other words, that the addition of one or more of these variables to the model
will significantly affect its predictive power.
The Wald statistic is chi-square distributed with 1 degrees of freedom if the variable is
metric and the number of categories minus 1 if the variable is non-metric.
The Hosmer and Lemeshow goodness of fit (GOF) test is a way to assess whether there
is evidence for lack of fit in a logistic regression model. Simply put, the test compares
the expected and observed number of events in bins defined by the predicted proba-
bility of the outcome. The null hypothesis is that the data are generated by the model
developed by the researcher.
Hosmer Lemeshow test statistic:
where Oi is the observed frequency of the i-th bin, Ni is the total frequency of the i-th
bin. i is the average estimated probability of the i-th bin.
AIC (Akaike Information Criterion) = -2log L + 2(k + s), k is the total number of response
level minus 1 and s is the number of explanatory variables.
Cox and Snell (1989, pp. 208209) propose the following generalization of the coeffi-
where n is the sample size and L(0) is the intercept only model,
So,
Nagelkerke (1991) proposes the following adjusted coefficient, which can achieve a
maximum value of one:
All these statistics have similar interpretation as the R2 in Linear Regression. So, in this
part we are trying to assess how much information is reflected through the model.
It is very important to understand the relation between the observed and predicted
outcome. The performance of the model can be benchmarked against this relation.
Simple Concepts
In this table, we are working with unique observations. The model was developed for Y
= Yes. So it should show high probability for the observation where the real outcome
has been Yes and a low probability for the observation where the real outcome has
been No.
Consider the observations 1 and 2. Here the real outcomes are Yes and No respective-
ly, and the probability of the Yes event is greater than the probability of the No event.
Such pairs of observations are called Concordant Pairs. This is in contrast to the obser-
vations 1 and 4. Here we get the probability of the No is greater than the probability of
Yes. But the data was modeled for P(Y = Yes). Such a pair is called a Discordant Pair.
Now consider the pair 1 and 3. The probability values are equal here, although we
have opposite outcomes. This type of pair is called a Tied Pair. For a good model, we
would expect the number of concordant pairs to be fairly high.
Related Measures
Let nc , nd and t be the number of concordant pairs, discordant pairs and unique ob-
servations in the dataset of N observations. Then (t - nc - nd ) is the number of tied pairs.
It is a very subjective issue to decide on the cut-point probability level, i.e. the proba-
bility level above which the predicted outcome is an Event i.e., Yes. A Classification
Table can help the researcher in deciding the cutoff level.
A Classification table has several key concepts.
Event It is our targeted outcome, e.g. will the customer churn? Yes
is the event.
Non Event The opposite of Event. In previous example, No is the Non-
Event
Correct Event For a probability level, prediction is an event and observed
outcome is also an event.
Correct Non Event For a probability level, prediction is a non-event and ob-
served outcome is also a non-event.
Incorrect Event For a probability level, prediction is an event but observed
outcome is a non-event.
Incorrect Non Event For a probability level, prediction is a non-event but ob-
served outcome is an event.
Correct Percentage of correct predictions out of total predictions
This case study is aims to study the credit risk faced by a German bank while extending
loans to borrowers. The credibility of the borrower is entirely private information of the
borrower himself. So, the bank needs to design measures to screen the good credible
borrowers from the bad defaulters. The objective of the analyst is to design a model so
that given any customer who comes up to the bank, the bank is able to predict
whether the customer is good or bad. Since, the type of the customers is dichoto-
mous in nature, the best technique of predictive tools that we can use is: Logistic Re-
gression. The dichotomous situation can be represented as:
Y = 1 if the customer is a good customer
= 0 if the customer is a bad customer
Since, the data set is in a csv format therefore we import the data set using the code
below:
proc import datafile="C:\Documents and Settings\OrangeTree\Desktop\Analytics data
sets and case studies\German Bank data.csv"
out=day1.german_bank
dbms=csv replace;
run;
The new data set in the SAS format is stored in the library day1 by the name
german_bank. The data set is a combination of categorical, character and numeri-
cal variables which describe the status of the customers in some way or the other. Let
us explain some of the variables in this data set:
CHK_ACCNT: This is a categorical variable which shows the amount of money that a
customer has with the bank in the check account. This indicates the credibility of the
customer. If a customer belongs to the category 0, then it speaks adversely about the
credibility of the customer as a customer who belongs to this category has negative
CHK_ACCNT balance. A customer who belongs to category 1 is relatively more relia-
ble compared to the later. Therefore, this variable is categorized in an ascending or-
der. The highest category (3) implies that the customer does not have any checking
account with the bank.
HISTORY: This is a categorical variable classified into four categories. The category 0
represents those customers who have a very clean credit history in that they have nev-
er taken any loans. Category 4 represents the worst possible case where the customer
having the category implies that he has a critical account.
DURATION: This is a numerical variable which describes the time period for which a
loan has been taken.
Each of these variables explains one or the other side of a borrowers credibility. We
again use the Robust regression technique for executing the logistic regression. The
german_bank data set is decomposed into training and validation data sets using the
data day1.german_bank1;
set day1.german_bank;
rannum= ranuni (0);
run;
This data step is used to create a new data set german_bank1 containing an addi-
tional column on random numbers generated by the ranuni code. The random num-
bers have been stored in the variable rannum. These random numbers would now be
used for splitting the data set into the training and the validation data set using the fol-
lowing code:
The two data sets formed by splitting the original data set are: gertraining, the training
data sets where all statistical and empirical techniques are applied to derive the pri-
mary results and gervalidation where the results obtained in the training data sets are
validated.
The position short statement allows us to make a list of the variables that we need to
use repeatedly for our analysis.
The main objective of this step is to check for the existence of multicollinearity among
the independent variables. The explanation and operation of this step is similar to the
The proc logistic is used to generate the logistic regression. The model keyword is
used to generate the logistic regression model where response is treated to be the
dependent variable. The desc keyword is used to model the probability of Y=1 as by
default SAS models for the lowest value (here zero). The selection= stepwise is the tech-
nique of selection used for entering the variables. At the first step, the intercept is en-
tered. This acts as the baseline or the reference model. Then at each step an addition-
al variable is added. This allows us to capture the impact of each variable inde-
pendently. This code also generates the Maximum likelihood Estimation table. This
gives us the log of the odds ratio. This table does not provide us with useful insights to
the model but it is used to form an idea about the parameter estimates when the re-
sults are validated using the validation data sets. The more important table from the
interpretation point of view is the Odds ratio.
Estimate table: This table gives us the value e. This gives the percentage of variation in
the dependent variable due to one unit change in an independent variable. This ta-
ble clearly explains the responsiveness of the dependent variable on the independent
variables. For eg: if the duration of credit increases by 1 unit; then the log of odds of
getting a good credit rating changes by -0.026 times. The odds ratio table for duration
is 0.974. Then if the duration increases by 1 unit then probability of getting a good
credit rating falls by 2.6%.
The option cable generates the classification table. The keyword pprob generates
the cut off probability levels of a customer being a good customer. For every cut-off
probability level, we have a certain number of correctly and incorrectly classified
event and non-event. For e.g. if the cut-off probability level is 0.1, then any customer
having a probability of being a good customer will be identified as a good custom-
er. So, as the cut-off probability level changes, the number of correctly and incorrect-
ly classified event and non-event will change accordingly.
The keyword lackfit is the keyword to generate the goodness of fit measures. The
goodness-of-fit statistic that we use here is the Hosmer-Lemeshow test statistics.
Outroc is the keyword to generate the ROC measures. The ROC measures are also
known as the Receiver Operating Characteristics. These measures are used to meas-
ure the accuracy of our predictions. The ROC measures are: Sensitivity, Specificity,
False Positive and False Negative etc. The two measures that we use extensively uses
are: Sensitivity and 1-Specificity. The Sensitivity measures the goodness or the accuracy
of the model while 1-Specificity reflects the weakness of the model. In the later meas-
ure, we are trying to find out that of the total number of non-events, how many non-
events were not identified by the model from before.
In this code, we are trying to plot the ROC measures to understand the goodness of fit
of the model. The plot of these two measures gives us a concave plot. This means that
as Sensitivity is increasing, 1-Specificity is increasing but at a diminishing rate. Given the
concave curve, the area under the curve will be greater than 0.5. The c-value or the
value of the concordance index gives the measure of the area under the ROC curve.
If c=0.5 then it would have meant that the model cannot perfectly discriminate be-
tween 0 and 1 responses. Then it implies that the initial model cannot perfectly say that
who is a good customer and who is a bad customer. If c> 0.5, then the model can
perfectly discriminate between 0 and 1. The c-value for our model is 0.81 which is
far greater than the cut-off value of 0.5. So, we can safely regard our model as a
good model.
The next step in our model construction is to find out the predicted values of the prob-
ability associated with a customer for being a good customer. We use the following
code:
Output out is the keyword that helps to generate a dataset from the proc steps. p will
give the predicted probability of a borrower being a good borrower. In the data set
the last column is the predicted probability values. However, we need a cut-off value
of probability to decide on the really good customers. The next code is used to set a
limit for the borrowers to be a good borrower.
data day1.logistic1;
set day1.logistic;
status= (predicted>0.5);
run;
In essence, this code states that the predicted value of the probability greater than 0.5
then the value of the status is 1 else it is 0. Now we want to understand the percentage
of predictions which match with the initial belief obtained from the data set. We use a
contingency frequency table for this analysis.
The proc freq keyword is used to generate the contingency table. Here we compare
the (1-1) and (0-0) pairs. The total percentage of frequency in these two boxes repre-
sents the percentage of correct predictions made by the model. This model shows
that approximately 77.1% of the predictions are correct. Now, we need to validate the
results obtained in the training data set. The codes in (a) and (b) are the set of proce-
dures that we follow for validating the data sets:
The parameter estimates in the green color represent the parameter estimates in the
training data set. A regression equation is constructed so that given a value of the vari-
able in the data set, we can obtain a predicted value of the probability Y=1.
(a)
data day1.gervalidation1;
set day1.gervalidation;
z=0.3626+0.5486*CHK_ACCT
-0.0260*DURATION+0.4938*HISTORY
-0.9548*NEW_CAR-1.1585*EDUCATION
-0.00012*AMOUNT+0.3377*SAV_ACCT
-0.3124*INSTALL_RATE+0.4367*MALE_SINGLE
+1.2503*GUARANTOR+0.0269*AGE
-0.6323*OTHER_INSTALL-0.4558*NUM_CREDITS;
predicted=exp (z)/ (1+exp (z));
status= (predicted>0.5);
run;
The predicted value of the probability of Y=1 is obtained using the standard formula
from the logistic regression literature, which has already been discussed.
(b)
proc freq data=day1.gervalidation1;
table Response*Status/norow nocol;
run;
This step is used for creating a contingency table for checking the correctness of pre-
diction. The (1-1) and (0-0) pairs are checked to see the extent of correctness in the
model in accordance with the observations in the data set. This percentage correct-
ness is tallied with those in the training data set. A percentage difference between the
two sets around (5%-6%) is considered to be a good estimate. This code is the final
check of the goodness of fit of the obtained model.
L
ets consider the following table. The table shows the quarterly sales of a com-
pany. Our purpose is to predict the sales figure of the next quarter.
So, in a time series we are trying express the dependent variable (Sales) as a
function of time period.
Year Quarter Sales
Chart of the Data: Look at the X and Y axis 2008 I 10.2
II 12.4
III 14.8
IV 15.0
2009 I 11.2
II 14.3
III 18.4
IV 18.0
Formal Definition
The trend is the long term pattern of a time series. A trend can be positive or negative
depending on whether the time series exhibits an increasing long term pattern or a de-
creasing long term pattern. If a time series does not show an increasing or decreasing
pattern then the series is stationary in the mean. For example, population increases
over a period of time, price increases over a period of years, production of goods of
the country increases over a period of years. These are the examples of upward trend.
The sales of a commodity may decrease over a period of time because of better
products coming to the market. This is an example of declining trend or downward
trend.
Any pattern showing an up and down movement around a given trend is identified as
a cyclical pattern. The duration of a cycle depends on the type of business or industry
being analyzed. A business cycle showing these oscillatory movements has to pass
through four phases-prosperity, recession, depression and recovery.
Seasonality occurs when the time series exhibits regular fluctuations during the same
month (or months) every year, or during the same quarter every year. This continues to
repeat year after year. The major factors that are responsible for the repetitive pattern
of seasonal variations are weather conditions and customs of people. More woolen
clothes are sold in winter than in the season of summer .Regardless of the trend, we
can observe that in each year more ice creams are sold in summer and very little in
Winter season. The sales in the departmental stores are more during festive seasons
that in the normal days.
This component is unpredictable. Every time series has some unpredictable compo-
nent that makes it a random variable. These variations are fluctuations in time series
that are short in duration, erratic in nature and follow no regularity in the occurrence
pattern. These variations are also referred to as residual variations since by definition
they represent what is left out in a time series after trend ,cyclical and seasonal varia-
tions. Irregular fluctuations results due to the occurrence of unforeseen events like
floods, earthquakes, wars, famines, etc.
Formula
For Exponential Moving Average, a small indicates, we are giving less emphasis on
recent periods and more on the previous periods, as a result, we get a slower moving
average.
The overall idea is that we extract a trend part, adjust the trend for seasonal compo-
nent, and make forecast. Now there can be two variations:
Y = T + C + S + e, Or Y = T * C * S * e
Where T = Trend Component
C = Cyclical Component
S = Seasonal Component
and e is the random part
These two variations are respectively known as Additive and Multiplicative Models.
Various Trends
Different Approaches
EXPONENTIAL
Exponential smoothing forecasts are forecasts for an integrated moving-average pro-
cess; however, the weighting parameter is speci-
fied by the user rather than estimated from the
data. As a general rule, smaller smoothing weights
are appropriate for series with a slowly changing
trend, while larger weights are appropriate for vol-
atile series with a rapidly changing trend.
WINTERS METHOD
The WINTERS method uses updating equations
similar to exponential smoothing to fit parameters
for the model.
xt=(a+bt)s(t)+ t
where a and b are trend parameters
and the function s(t) selecting the seasonal pa-
rameter.
Consider for instance the GDP of $2872.8 billion for 1970I. In theory, the GDP figure for
the first quarter of 1970 could have been any number, depending on the economic
and political climate then prevailing. The figure of 2872.8 is a particular realization of all
such possibilities. You can think of the value of $2872.8 billion as the mean value of all
possible values of GDP for the first quarter of 1970. Just as we use sample data to draw
inferences about a population, in time series we use the realization to draw inferences
about the underlying stochastic process.
A stochastic process is said to be stationary if its mean and variance are constant over
time and the value of the covariance be-
tween the two time periods depends only on
the distance or gap or lag between the two
time periods and not the actual time at
which the covariance is computed. Such a
time series will tend to return to its mean
(called mean reversion) and fluctuations
around this mean (measured by its variance)
will have a broadly constant amplitude. If a
time series is not stationary in the sense just defined, it is called a non-stationary time
series.
If the time series is non-stationary, then each set of time set data will have its own char-
acteristics. So we cannot generalize the behavior of one set to other sets.
This is a type of Non-Stationary Process. The term random walk is often compared with
a drunkards walk. Leaving a bar, the drunkard moves a random distance ut at time t,
and, continuing to walk indefinitely, will eventually drift farther and farther away from
the bar. The same is said about stock prices. Todays stock price is equal to yesterdays
stock price plus a random shock. There are two types of Random Walk:
A stochastic process expressed by Yt= Y(t-1)+ut, where 01, is called a unit root sto-
chastic process. If is in fact 1, we face what is known as the unit root problem, that is,
the process is non-stationary; we already know that in this case the variance of Yt is not
stationary. The name unit root is due to the fact that =1. Thus the terms non-
stationarity, random walk, and unit root can be treated as synonymous. If, however, if
||1, that is if the absolute value of is less than one, then it can be shown that the
time series Yt is stationary. In practice, then, it is important to find out if a time series
possesses a unit root.
The distinction between stationary and non-stationary stochastic processes (or time
series) has a crucial bearing on whether the trend (the slow long-run evolution of the
time series under consideration) observed in the constructed time series is deterministic
or stochastic. Broadly speaking, if the trend in a time series is completely predictable
and not variable, we call it a deterministic trend, whereas if it is not predictable, we
call it a stochastic trend. To make the definition more formal, consider the following
If 1=0, 2=0 and 3=1, we get, Yt= Y(t-1)+ ut which is nothing but a RWM without drift and
is therefore non-stationary. Again, Yt= (Yt - Y(t-1) )= ut which is stationary. Hence, a
RWM without drift is a difference stationary process.
If 10, 2=0 and 3=1, we get, Yt= 1+Y(t-1)+ ut which is a RWM with drift and is therefore
non-stationary. Again, Yt= (Yt - Y(t-1) )= 1+ ut which means that Yt will exhibit a positive
(1>0)or negative (1<0) trend. Such a trend is called a Stochastic Trend. Again this is a
difference stationary process as Yt is stationary.
If 10, 20 and 3=0, we get, Yt=1+ 2 t+ ut which is called a Trend Stationary Process.
Although the mean of the process 1+ 2 t is not constant, its variance is. Once the val-
ues of 1 and 2 are known, the mean can be forecasted perfectly. Therefore, if we
subtract the mean from Yt, the resulting series will be stationary, hence the name trend
stationary. This procedure of removing the (deterministic) trend is called detrending.
If 10, 20 and 3=1, we get, Yt=1+ 2 t+Yt-1+ ut we have a random walk with drift and
a deterministic trend. Yt= (Yt - Y(t-1) )=1+ 2 t+ ut, which implies that Yt is non station-
ary.
If 10, 20 and 3<1, we get, Yt=1+ 2 t+3 Yt-1+ ut which is stationary around a deter-
ministic trend.
By now, all of us probably have a good idea about the nature of stationary stochastic
processes and their importance. There are broadly three ways to find out whether the
Time Series under consideration is stationary or not. The first way is to plot the time series
and look into the chart. In the last few topics, we have seen how a non- stationary se-
ries looks like. For example, if the chart is showing an upward trend, its suggesting that
the mean of the data is changing. This may suggest that the data is non stationary.
Such an intuitive feel is the starting point of more formal tests of stationarity. The other
methods of checking stationarity is Autocorrelation Function or Correlogram and Unit
Root Test.
Autocorrelation refers to the correlation of a time series with its own past and future
values. The first-order autocorrela-
tion coefficient is the simple corre-
lation coefficient of the first N 1
observations x1, x2, x3, , x(N-1)and
the next N 1 observations, x2, x3,
x4, , xN. Similarly, we can define
higher order autocorrelation coef-
ficients. So for different order or
lag, we will get different autocorre-
lation coefficients. As a result, we
can define the autocorrelation co-
efficients as a function of lag. This
function is known as Autocorrela-
tion Function and its graphical
presentation is known as Correlogram. A rule of thumb is to compute ACF up to one-
third to one-quarter the length of the time series. The statistical significance of any au-
tocorrelation coefficient can be judged by its standard error. Bartlett has shown that if
a time series is purely random, that is, it exhibits white noise, the sample autocorrelation
coefficients follows a normal distribution with
Correlogram of A Stationary Process
mean = 0 and variance = 1/ Sample Size.
Apart from ADF, there are other unit root tests. There are limitations to these tests in-
cluding ADF. Most tests of the Dickey Fuller type tend to accept the null of unit root
more frequently than is warranted. That is, these tests may find a unit root even when
none exists.
Difference-Stationary Process (DSP): If a time series has a unit root, the first differ-
ences of such time series are stationary.
Trend-Stationary Process (TSP): A Trend Stationary Process is stationary around the
trend. Hence, the simplest way to make such a time series stationary is to regress it
on time and the residuals from this regression will then be stationary.
It should be pointed out that, if a time series is DSP but we treat it as TSP, this is called
Under-Differencing. On the other hand, if a time series is TSP but we treat it as DSP, this
is called Over-Differencing. The consequences of these types of specification errors
can be serious.
If a time series has to be differenced once to make it stationary, we such time series
Integrated of Order 1. Similarly, if a time series has to be differenced twice (i.e., take
the first difference of the first differences) to make it stationary, we call such a time se-
ries integrated of order 2. For example, pure random walk is non-stationary, but its first
difference is stationary. So we call random walk without drift integrated of order 1.
In general, if a (non-stationary) time series has to be differenced d times to make it sta-
Before forecasting we need to model the time series. If a time series is stationary, we
can model it in a variety of ways.
where is a constant and u, as before, is the white noise stochastic error term. Here Y
at time t is equal to a constant plus a moving average of the current and past error
terms. Thus, in the present case, we say that Y follows a first-order moving average, or
an MA(1) process. More generally, a MA(q) process is expressed as:
In short, a moving average process is simply a linear combination of white noise error
terms.
Of course, it is quite likely that Y has characteristics of both AR and MA and is therefore
ARMA. Thus, Yt follows an ARMA(1, 1)process if it can be written as:
where represents a constant term. Again this expression can be generalized for an
ARMA(p, q) process.
Now in the last segment we learnt about the Integrated Stochastic Process of order d,
which implies we have to difference a time series d times to make it stationary. So, giv-
en a time series, we first have to difference it d and then apply an ARMA(p, q) to mod-
el it. Then we say the original time series is ARIMA(p, d, q). Thus, an ARIMA(2, 1, 2) time
series has to be differenced once(d=1)before it becomes stationary and the (first-
differenced) stationary time series can be modeled as an ARMA(2, 2) process, that is, it
has two AR and two MA terms. Of course, if d=0 (i.e., a series is stationary to begin
with), ARIMA(p, d=0,q)= ARMA(p, q). Note that an ARIMA(p, 0, 0) process means a
purely AR(p)stationary process; an ARIMA(0, 0,q) means a purely MA(q) stationary pro-
cess. Given the values of p, d, and q, one can tell what process is being modeled.
The most important question while modeling a time series is how does one know
whether it follows a purely AR process (and if so, what is the value of p) or a purely MA
process (and if so, what is the value of q) or an ARMA process (and if so, what are the
values of p and q) or an ARIMA process, in which case we must know the values of p,
d, and q. The BJ methodology comes in handy in answering the preceding question.
The method consists of four steps:
Identification: Find out the appropriate values of p, d, and q
Estimation: Having identified the appropriate p and q values, the next stage is to
estimate the parameters of the autoregressive and moving average terms includ-
ed in the model
Diagnostic Checking: Having chosen a particular ARIMA model, and having esti-
mated its parameters, we next see whether the chosen model fits the data reason-
ably well
Forecasting
The chief tools in identification are the autocorrelation function (ACF),the partial auto-
correlation function (PACF),and the resulting correlograms, which are simply the plots
of ACFs and PACFs against the lag length.
Identifying d
If the series has positive autocorrelations out to a high number of lags, then it prob-
Identifying AR(p)
ACF decays exponentially or with dampened sine wave pattern or both and PACF has
significant spikes through lags p.
Identifying MA(q)
The PACF decays exponentially and the ACF has significant spikes through lags q.
Estimating the parameters for the BoxJenkins models is a quite complicated non-
linear estimation problem. For this reason, the parameter estimation should be left to a
high quality software program that fits BoxJenkins models. The main approaches to
fitting BoxJenkins models are non-linear least squares and maximum likelihood estima-
tion. Maximum likelihood estimation is generally the preferred technique.
One simple diagnostic is to obtain residuals from the model developed and obtain
the ACF and PACF of these residuals, say, up to lag 25. The correlograms of both auto-
correlation and partial autocorrelation give the impression that the residuals estimated
from model are purely random.
This code converts the variable date from a character variable to a numeric vari-
able. The input function is used to make this conversion.
The procedure Forecast is used to generate the forecasted output for the following
ten periods. So, date1 would have the upcoming 10 time points. _TYPE_ in Sales11 indi-
cates forecasted value because here we have generated the forecasted values of
Sales. _LEAD_ is talking about the 10 lead points and sales has corresponding predic-
tions on Sales where the point estimation on Sales are being given. However, we want
an interval but not a point estimate. To obtain the interval estimation we use the fol-
lowing code:
In this code the tests of confidence intervals are conducted at 99%, i.e. the level of sig-
nificance is 0.01 which has been specified using the command alpha. The outlimit
option restricts the sales11 data set to show only the forecasted values along with their
limits. But if we use outfull instead of outlimit, then the this output will be accompanied
with the past actual value and their forecasted value from the model.
We use the outfull keyword to generate the forecasted values for the given time peri-
od. This would help us to compare the actual sales and forecasted sales figures for the
given time periods, i.e. from July 89 to July 91.
Here we split the data set into the actual and forecasted data set. Next we create a
data set merge_time.
data day1.merge_time;
merge day1.actual (rename= (sales=actual_sales))
day1.forecast (rename= (sales=forecasted_sales));
by date1;run;
This code creates a list of the variables in the data set as in the list below:
From this data set we want to drop the _TYPE_ and since we want to get the lead, we
drop lead=0. Now to plot the actual and forecasted sales using the Big Legend Code:
This is a global statement, meaning that for the remaining SAS session the measures will
be applied for the other graphs as well. Now, legend 1 is the name of the legend that
we construct here and label refers to the label name that will be applied on the leg-
end. The value= (original forecast) refers to the labels assigned to the legend.
Now, symbol 1 refers to the symbol assigned to the first graph; i.e. graph for actual
sales. Here, the color given is blue, and the graph is shown in terms of dots or bubbles.
The size of the bubbles is 0.7 given by H. i=join means that the bubbles are joined by
straight lines l=1 refers to the width of the line. Same goes for the symbol 12 and the
colour of the line is green. axis 1 refers to the vertical axis where label given to it is
actual versus forecasted and the angle is 90 degree. axis 2 refers to the horizontal
axis and the label given to it is date.
The given here is merge_time and we are applying the gplot option since we are
plotting numeric variables. Here date 1 is plotted on the horizontal axis and the actual
and forecasted sale values are plotted on the vertical axis. The over-lay keyword has
been used to impose one graph over the other. Axis1 refers to the vertical axis and Ax-
is2 refers to the horizontal axis. Now; legend = legend1 means that we are applying the
pre-specified legend option and the quit statement is given to stop the global state-
ment execution.
At time point of June 90, we have the original sales exceeding the forecasted sales. At
the time point of April 91 we have forecasted sales greater than the actual sales and a
time point of August 1990; we have both the actual and forecasted sales almost equal
to each other. So, whenever there is any difference between the actual and forecast-
ed sales we call it a prediction error, which can be either positive or negative.
The case study involves the study of a data set timeseriesairline where there are 144
observations on the number of passengers for different dates. Now, we would like to
see if the time series data on passengers is stationary.
The position short keyword is used to list the variables in their creation order.
The gplot function is used to plot the variables passengers and date. gplot is used
since both the variables plotted are numeric variables. The plot shows that the mean
and variance exhibits an upward rising trend or, in other words it exhibits clear non-
stationarity. So now, lets see what adjustment can be made to make the data non-
stationary. So, our main objective is to make the data set stationary in terms of mean
and variance. We apply the differencing technique to solve this problem. The tech-
niques of simple differencing and log differencing are used to restore stationarity.
data day1.timeseriesairline1;
set day1.timeseriesairline;
passlog=log (Passengers);
pass1=dif1 (Passengers);
pass2=dif1 (passlog);
run;
In this data step creates a new data set with the variables passlog, pass1, pass2 along
with the other prevalent variables. Passlog is the log of passenger value. Pass1 is the
first difference applied on the number of passengers. That is Pass1 is equal to
(passengers in February 1949- passengers in January 1949). Pass2 is the first order differ-
encing applied on passlog. Therefore, Pass2 equals to (passlog in February 1949-
passlog in January 1948). Now, we would plot passlog, pass1, pass2 against the varia-
ble time to see which of them is stationary.
In this graph where we are plotting (passlog) with respect to the date we are getting a
non-stationary graph, i.e. where the mean and variance are fluctuating over time and
are not constant.
In this graph where we plot Pass1 with respect to date we get a mean stationary pro-
cess, where the mean is constant and the variance is not. Though the mean is con-
stant, the variance is increasing over time. Thus, the simple first order differencing is not
sufficient to restore stationarity of the time series. So, we try to apply the log differenc-
ing.
In the graph where we plot Pass2 with respect to date, we get a variance stationary
process, where both variance and mean are constant, i.e. they do not change over
time. After checking for stationarity, we come to the time series modeling part:
To solve the problem associated with the stationarity, the technique of simple first or-
der differencing is used on the variable passengers. The (1,1) shows the number of
differencing done in the two models to make them stationary. When the first model,
i.e. where the passengers at time t are assumed to be dependent on the number of
passengers at time t-1 is differenced once, then it becomes stationary, whereas in
the second model, where there are a total of 13 lags, one period differencing does
not remove the seasonal component from the data. Therefore, this model remains non
-stationary.
The question here is: Can we estimate the time series model in the presence of non-
stationarity? To answer this, our first job is to obtain the optimum number of lags in the
model using the Bayesian Minimum Information Criteria (BIC). This gives us the Mini-
mum possible number of lags, which are significantly required to forecast the depend-
ent time-series variable. The keyword minic is used to obtain the optimum number of
lags from all the existent lags. The BIC calculates that the optimum lag for the AR mod-
el should be 5 and that of the MA model is 3. The main idea here is that we want to
check whether the existence of optimum lags is sufficient for a proper estimation of a
time series model.
This code estimates the time series model using lags p=5 and q=3. The forecast key-
word is used to forecast the value of the time series variable passengers. lead is the
keyword for assigning the number of periods in advance for which forecast is made
and interval is the keyword which specifies the time interval at which forecasting is to
be made . This code tries to forecast the number of passengers that would board the
plane over the next twelve months. The estimation results generate a report whereby
we can see that the parameter estimates calculated by the model are unstable esti-
mates. Therefore, it can be inferred that the model estimation is not possible in the ab-
sence of Stationarity. Therefore, the foremost important job is to restore stationarity into
the model.
Stationarity in this model can be restored by differencing the model with 13 lags
twelve times, so that it becomes a single lag AR process. The optimum number of lags
is calculated using the minic keyword.
6. Rud, O. P. , Data Mining Cookbook: Modeling Data for Marketing, Risk, and
Customer Relationship Management, John Wiley & Sons, 2000
Call Us:
09051563222
Mail Us:
info@orangetreeglobal.com
Website:
www.orangetreeglobal.com