Vous êtes sur la page 1sur 6

LINEAR AND LOGISITC REGRESSION PROOFS

HAROLD VALDIVIA GARCIA

Contents 1. Linear Regression 2. Logistic Regression 2.1. The Cost Function J() 1 1 2

1. Linear Regression 2. Logistic Regression For logistic Regression, we use h(x(i) ) as the estimated probability that the training example x(i) is in class y = 1 ( or is labeled as y = 1). Here, we assume that the response variables y (1) , y (2) , ...y (m) Bern(p = i ) The hypothesis h (x) is the logistic function:

h (x) = g(T x) =

1 1 + eT x

The derivative of the sigmoid function has the following nice property (the proof is very easy, so we will not prove it): g (z) = g(z) (1 g(z))

Lets consider X Rmn+1 and Y Rm1 as our dataset. Now, for the parameters Rn+1 , we have that X is a linear combination of the features of X:
1

HAROLD VALDIVIA GARCIA

x(1) T T x(1) T x(2) x(2) T T x(3) x(3) T X = = . . . . . . x(m) T m1 T x(m) m1 Lets dene the vector h Rm1 such that: [h]i = g(T x(i) ) = g(x(i) T ) h = g(X) (g is the sigmoid function) ( g is the matrix version of the sigmoid function)

g(x(1) T ) . . . h = g(x(i) T ) . . . g(x(m) T ) m1

2.1. The Cost Function J(). We can get the cost function by using maximum likelihood L() over the joint distributions of the dataset. Or by constructing a cost function that penalizes the missclasication. We present the following similar cost functions : J1 () = J2 () = J3 () = 1 m y (i) log(h (x(i) )) + (1 y (i) )log(1 h (x(i) )) y (i) log(h (x(i) )) + (1 y (i) )log(1 h (x(i) )) y (i) log(h (x(i) )) + (1 y (i) )log(1 h (x(i) ))

1 m

LINEAR AND LOGISITC REGRESSION PROOFS

The last cost function J3 () is the log-likehood () = log(L()) of the parameters s. It is easy to demonstrate that minimizing J1 () is the same as maximizing J2 () and J3 (). min J1 () = max J2 () = max J3 ()

2.1.1. Matrix notation for J(). Lets consider the matrix notation for J1 ():

J1 () =

1 Y T log(h) + (1 + Y )T log(1 h) m

2.1.2. Gradient Descent for minimizing J1 ().

= J1 ()

J1 () = ?

1 J1 () = j m j 1 J1 () = j m 1 J1 () = j m

y (i) log(h (x(i) )) + (1 y (i) )log(1 h (x(i) ))

y (i)

1 1 h (x(i) ) + (1 y (i) ) h (x(i) ) (i) ) (i) ) h (x 1 h (x j j 1 1 (1 y (i) ) h (x(i) ) (i) ) (i) ) h (x 1 h (x j

y (i)

The partial derivative for the hypothesis h (x) is :

HAROLD VALDIVIA GARCIA

h (x(i) ) = g(T x(i) ) j j T (i) h (x(i) ) = g(T x(i) ) 1 g(T x(i) ) x j j (i) h (x(i) ) = g(T x(i) ) 1 g(T x(i) ) xj j (i) h (x(i) ) = h (x(i) ) 1 h (x(i) ) xj j

then, the partial derivative for J1 () can be written as:

1 J1 () = j m 1 J1 () = j m 1 J1 () = j m 1 J1 () = j m 1 J1 () = j m

y (i)

1 1 (1 y (i) ) h (x(i) ) (i) ) (i) ) h (x 1 h (x j

y (i) 1 h (x(i) ) (1 y (i) )h (x(i) ) xj

(i)

y (i) y (i) h (x(i) ) h (x(i) ) + y (i) h (x(i) ) xj

(i)

y (i) h (x(i) ) xj

(i)

h (x(i) ) y (i) xj

(i)

LINEAR AND LOGISITC REGRESSION PROOFS

The expression above in vector notation is: h(x(1) ) y (1) h(x(2) ) y (2) . . . (i) (i) 1m h(x ) y . . . h(x(m) ) y (m) m1

1 (1) (2) (i) (m) J1 () = xj xj . . . xj . . . xj j m

1 J1 () = xj T (h y) j m 1 X T (h y) m

J1 () =

The vector notation for the Gradient descent rule is as follow:

1 X T (h y) m

2.1.3. Gradient Ascent for maximizing J2 (). = + J2 ()

J2 () = ? 1 J2 () = j m h (x(i) ) y (i) xj
(i)

1 J2 () = xj T (h y) j m 1 X T (h y) m

J2 () =

HAROLD VALDIVIA GARCIA

The vector notation for the Gradient Ascent rule is as follow:

1 X T (h y) m

Vous aimerez peut-être aussi