Analisis Regresi Linier Dan Logistik

26/05/12
Analisis Regresi Linier dan

Logistik
Oleh :
Nurita Andayani
Introduction
Difference between chi-square and regression : chisquare test of independence to determine whether a
statistical relationship existed between two variables. The
chi-square test tell if there is such a relationship, but it does
not tell about what that relationship. But regression and
correlation analyses will show how to determine both the
nature and the strength of a relationship between two
variables
Regression analysis is a body of statistical methods
dealing with the formulation of mathematical models that
depict relationships among variables, and the use of these
modeled relationships for the purpose of prediction and other
statistical inferences.
The word regression was first in its present technical
context by Sir Francis Galton, who analyzed the heights of
sons and the average heights of their parents.
26/05/12
Models
The independent or controlled variable is also called the predictor variable

and is denoted by x. The effect or response variable is denoted by y.
If the relation between y and x is exactly a straight line, then the variables
are connected by the formula :
y = + x
where indicates the intercept of the line with the y axis and represents
the slope of the line, or the change in y per unit change in x.
y
yi
+ xi
xi
Statistical Model
Yi = + xi + ei, i = 1, , n
Where :
a) x1, x2, ,xn are the set values of the controlled variable x
that the experimenter has selected for the study.
b) e1, e2, ,en are the unknown error components that are
superimposed on the true linear relation. These are
unobservable random variables, which we assume are
independently and normally distributed with a mean of
zero and unknown variance of 2.
c) The parameters and , which together locate the
straight line, are unknown.
26/05/12
Basic Notations
( x x ) x nx
( y y ) y ny
( x x )( y y ) x y nx y
x
S x2
1
n
S y2
S xy
xi ,
1
n
2
i
2
i
i i
Example
Zippy Cola is studying the effect of its
latest advertising campaign. People
chosen at random were called and asked
how many cans of Zippy Cola they had
bought in the past week and how many
Zippy Cola advertisements they had either
read or seen in the past week.
X (number of ads) 3 7 4
Y( cans purchased) 11 18 9
2
4
0
7
4
6
1
3
2
8
26/05/12
Least Squares Regression Line

y x
Least squares regression line :

Least square estimate of :
a y x
Least square estimate of :
S xy
b 2
Sx
The residual sum of squares or the sum of squares due to

n
error is :
SSE S y2 2 S x2
( y x )
i
i 1
Properties of the Least Squares

Estimators
a)
b)
c)
d)
The least squares estimators are unbiased; that is

E ( )
and E ( )
E (s 2 ) 2
and
E (s)
The distribution of and are normal with means of and

, respectively; the standard deviations are the square roots
of the variances given in b).
s2=SSE/(n-2) is an unbiased estimator of 2. Also, (n-1)s2/2
is distributed as 2 with d,f,=n-2, and it is independent of
and
26/05/12
e) Replacing 2 in b) with its sample estimate s2 and

considering the square roots of the variances, we
obtain the estimated standard errors of and ;
1 x2
estimated standard error of s

n S x2
s
estimated standard error of
Sx
f) S x ( ) has at t distribution with d.f.=n-2

s
( )
1 x2
s
n S x2
has at t distribution with d.f.=n-2
Inference Concerning the Slope

H 0 : 0 vs H1 : 0 is based on
S x ( 0 )
t
, d.f. n 2
s
p% confidence interval for :
t(1CI ) / 2
s
Sx
26/05/12
Inference about
H 0 : 0 vs H1 : 0
( 0 )
t
, d.f. n 2
2
1 x
s
2
n Sx
is based on
p% confidence interval for :

1 x2
t(1CI ) / 2 .s
2
n Sx
Checks on The Straight Line Model

yi
( xi ) ( yi xi )
observed Explainedby
residualor
y value
linear relation deviationfrom linear
relation
S y2
Total
SS of y
2 S x2
SSE
SS explained
residualSS
by linear relation (unexplained)
26/05/12
Anova for checking regression model

Source
Sum of Squares
d.f.
Mean Squares
Regression
SSR
MSR=SSR/1
MSR/MSE
Error
SSE
n2
MSE=SSE/(n-2)
Total
SST
n1
Inference for regression model

H 0 : 0 H1 : 0
Rejection region: (withsignificant level )
R : F F (1,( n 2))
26/05/12
The coefficient of determination

The sample coefficient of determination is
developed from relationship between two kinds of
variation: variation of Y values in a data set around :
The fitted regression line
Their own mean
R2
SSR
SSE
1
SST
SST
0 R 2 1 or 0 R 2 100%
Perfect fitted
regression line
unfitted
regression
model
The coefficient correlation

Coefficient correlation ( r ) indicates the direction of
the relationship between the two variables X and Y
If an inverse relationship exist-that is, if Y decreases
as X increases-then r will fall between 0 and -1
If there is a direct relationship (if Y increases as X
increases), then r will be a value within the range 0
and 1
S xy
S x2 .S y2
26/05/12
Exercise
PUSKESMAS PANCORAN MAS ingin mengetahui
hubungan antara usia dengan besarnya tekanan
darah dari pasien. Diambil 10 pasien dan didapatkan
hasilnya sebagai berikut
Usia
38
36
72
42
68
63
Tekanan darah
115
118
160
140
152 149
49
56
60
55
145
147
155
150
a) Buat model regresinya !

b) Jika usia pasien adalah 40 pediksikan besar tekanan
darahnya !
c) Ujilah model regresi yang telah anda buat !
d) Ujilah apakah parameter =0 dan =0 ?
e) Buat selang kepercayaan 90% untuk dan !
f) Hitung koefisien determinasi dan korelasinya, jelaskan artinya
!
What is Logistic Regression?

Form of regression that allows the prediction
of discrete variables by a mix of continuous
and discrete predictors.
Addresses the same questions that
discriminant function analysis and multiple
regression do but with no distributional
assumptions on the predictors (the
predictors do not have to be normally
distributed, linearly related or have equal
variance in each group)
26/05/12
What is Logistic Regression?
Logistic regression is often used because

the relationship between the a discrete
variable and a predictor is non-linear
Example from the text: the probability of heart disease

changes very little with a ten-point difference among
people with low-blood pressure, but a ten point change
can mean a drastic change in the probability of heart
disease in people with high blood-pressure.
Assumptions
Absence of multicollinearity
No outliers
Independence of errors assumes a
between subjects design. There are
other forms if the design is within
subjects.
10
26/05/12
Background
Odds like probability. Odds are usually

written as 5 to 1 odds which is equivalent to
1 out of five or .20 probability or 20% chance,
etc.
The problem with probabilities is that they are

non-linear
Going from .10 to .20 doubles the probability, but
going from .80 to .90 barely increases the
probability.
Background
Odds ratio the ratio of the odds over 1

the odds. The probability of winning
over the probability of losing. 5 to 1 odds
equates to an odds ratio of .20/.80 = .25.
11
26/05/12
Background
Logit this is the natural log of an odds

ratio; often called a log odds even though
it really is a log odds ratio. The logit
scale is linear and functions much like a
z-score scale.
Background
LOGITS ARE CONTINOUS, LIKE Z
SCORES
p = 0.50, then logit = 0
p = 0.70, then logit = 0.84
p = 0.30, then logit = -0.84
12
26/05/12
Plain old regression
Y = A BINARY RESPONSE (DV)
1 POSITIVE RESPONSE (Success) P

0 NEGATIVE RESPONSE (failure) Q = (1-P)
MEAN(Y) = P, observed proportion of

successes
VAR(Y) = PQ, maximized when P = .50,
variance depends on mean (P)
XJ = ANY TYPE OF PREDICTOR
Continuous, Dichotomous, Polytomous
Y | X B0 B1 X1
and it is assumed that errors are

normally distributed, with mean=0 and
constant variance (i.e., homogeneity of
variance)
13
26/05/12
E(Y | X ) B0 B1 X 1
an expected value is a mean, so
(Y ) PY 1 | X
The predicted value equals the proportion of

observations for which Y|X = 1; P is the
probability of Y = 1(A SUCCESS) given X, and
Q = 1- P (A FAILURE) given X.
An alternative the ogive function

An ogive function is a curved s-shaped
function and the most common is the
logistic function which looks like:
14
26/05/12
The logistic function
Yi
eu
1 eu
Where Y-hat is the estimated probability

that the ith case is in a category and u is
the regular linear regression equation:
u A B1 X1 B2 X 2 BK X K
15
26/05/12
b0 b1 X1
e
i b0 b1X1
1 e
Change in probability is not constant

(linear) with constant changes in X
This means that the probability of a
success (Y = 1) given the predictor
variable (X) is a non-linear function,
specifically a logistic function
16
26/05/12
It is not obvious how the regression

coefficients for X are related to changes
in the dependent variable (Y) when the
model is written this way
Change in Y(in probability units)|X
depends on value of X. Look at Sshaped function
The values in the regression equation b0

and b1 take on slightly different
meanings.
b0 The regression constant (moves curve

left and right)
b1 <- The regression slope (steepness of
curve)
b The threshold, where probability of
b
success
= .50
0
17
26/05/12
Logistic Function
Constant regression
constant different
slopes
v2: b0 = -4.00
b1 = 0.05 (middle)
v3: b0 = -4.00
b1 = 0.15 (top)
v4: b0 = -4.00
b1 = 0.025 (bottom)
1.0
.8
.6
.4
V4
V1
V3
.2
V1
V2
V1
0.0
30
40
50
60
70
80
90
100
Logistic Function
Constant slopes
with different
regression
constants
v2: b0 = -3.00
b1 = 0.05 (top)
v3: b0 = -4.00
b1 = 0.05 (middle)
v4: b0 = -5.00
b1 = 0.05 (bottom)
1.0
.9
.8
.7
.6
.5
.4
V4
.3
V1
.2
V3
V1
.1
V2
V1
0.0
30
40
50
60
70
80
90
100
18
26/05/12
The Logit
By algebraic manipulation, the logistic

regression equation can be written in
terms of an odds ratio for success:
P(Y 1| X i )
exp(b0 b1 X1i )
(1 P(Y 1| X i )) (1 )
The Logit
Odds ratios range from 0 to positive

infinity
Odds ratio: P/Q is an odds ratio; less
than 1 = less than .50 probability, greater
than 1 means greater than .50 probability
19
26/05/12
The Logit
Finally, taking the natural log of both

sides, we can write the equation in
terms of logits (log-odds):
P(Y 1| X )
ln
ln
b0 b1 X1
(1 P(Y 1| X )) (1 )
For a single predictor
The Logit

ln
b0 b1 X1 b2 X 2 bk X k
(1 )
For multiple predictors
20
26/05/12
The Logit
Log-odds are a linear function of the

predictors
The regression coefficients go back to
their old interpretation (kind of)
The expected value of the logit (logodds) when X = 0

Called a logit difference; The amount
the logit (log-odds) changes, with a one
unit change in X; the amount the logit
changes in going from X to X + 1
Conversion
EXP(logit) or = odds ratio

Probability = odd ratio / (1 + odd ratio)
21
26/05/12
THANK YOU
GOOD LUCK
22

Analisis Regresi Linier Dan Logistik

Transféré par

Informations du document

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Analisis Regresi Linier Dan Logistik

Transféré par

Droits d'auteur :

Formats disponibles

26/05/12

Analisis Regresi Linier dan

The independent or controlled variable is also called the predictor variable

Least Squares Regression Line

Least squares regression line :

Least square estimate of :

The residual sum of squares or the sum of squares due to

Properties of the Least Squares

The least squares estimators are unbiased; that is

The distribution of and are normal with means of and

e) Replacing 2 in b) with its sample estimate s2 and

estimated standard error of s

f) S x ( ) has at t distribution with d.f.=n-2

has at t distribution with d.f.=n-2

Inference Concerning the Slope

p% confidence interval for :

Checks on The Straight Line Model

Anova for checking regression model

Inference for regression model

The coefficient of determination

The coefficient correlation

a) Buat model regresinya !

What is Logistic Regression?

What is Logistic Regression?

Logistic regression is often used because

Example from the text: the probability of heart disease

Odds like probability. Odds are usually

The problem with probabilities is that they are

Odds ratio the ratio of the odds over 1

Logit this is the natural log of an odds

Plain old regression

Y = A BINARY RESPONSE (DV)

1 POSITIVE RESPONSE (Success) P

MEAN(Y) = P, observed proportion of

Plain old regression

and it is assumed that errors are

Plain old regression

The predicted value equals the proportion of

An alternative the ogive function

The logistic function

The logistic function

Where Y-hat is the estimated probability

The logistic function

Change in probability is not constant

The logistic function

It is not obvious how the regression

The logistic function

The values in the regression equation b0

b0 The regression constant (moves curve

By algebraic manipulation, the logistic

Odds ratios range from 0 to positive

Finally, taking the natural log of both

Log-odds are a linear function of the

The expected value of the logit (logodds) when X = 0

EXP(logit) or = odds ratio

Vous aimerez peut-être aussi