Vous êtes sur la page 1sur 75

Optimization for Machine Learning

Lecture 1: Introduction to Convexity

S.V. N. (vishy) Vishwanathan


Purdue University
vishy@purdue.edu

July 12, 2012

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

1 / 43

Regularized Risk Minimization


Machine Learning
We want to build a model which predicts well on data
A models performance is quantified by a loss function
a sophisticated discrepancy score

Our model must generalize to unseen data


Avoid over-fitting by penalizing complex models (Regularization)
More Formally
Training data: {x1 , . . . , xm }
Labels: {y1 , . . . , ym }
Learn a vector: w
m

1 X
minimize J(w ) := (w ) +
l(xi , yi , w )
w
| {z } m
i=1
Regularizer |
{z
}
Risk Remp

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

2 / 43

Regularized Risk Minimization


Machine Learning
We want to build a model which predicts well on data
A models performance is quantified by a loss function
a sophisticated discrepancy score

Our model must generalize to unseen data


Avoid over-fitting by penalizing complex models (Regularization)
More Formally
Training data: {x1 , . . . , xm }
Labels: {y1 , . . . , ym }
Learn a vector: w
m

1 X
minimize J(w ) := (w ) +
l(xi , yi , w )
w
| {z } m
i=1
Regularizer |
{z
}
Risk Remp

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

2 / 43

Regularized Risk Minimization


Machine Learning
We want to build a model which predicts well on data
A models performance is quantified by a loss function
a sophisticated discrepancy score

Our model must generalize to unseen data


Avoid over-fitting by penalizing complex models (Regularization)
More Formally
Training data: {x1 , . . . , xm }
Labels: {y1 , . . . , ym }
Learn a vector: w
m

1 X
minimize J(w ) := (w ) +
l(xi , yi , w )
w
| {z } m
i=1
Regularizer |
{z
}
Risk Remp

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

2 / 43

Regularized Risk Minimization


Machine Learning
We want to build a model which predicts well on data
A models performance is quantified by a loss function
a sophisticated discrepancy score

Our model must generalize to unseen data


Avoid over-fitting by penalizing complex models (Regularization)
More Formally
Training data: {x1 , . . . , xm }
Labels: {y1 , . . . , ym }
Learn a vector: w
m

1 X
l(xi , yi , w )
minimize J(w ) := (w ) +
w
| {z } m
i=1
Regularizer |
{z
}
Risk Remp

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

2 / 43

Regularized Risk Minimization


Machine Learning
We want to build a model which predicts well on data
A models performance is quantified by a loss function
a sophisticated discrepancy score

Our model must generalize to unseen data


Avoid over-fitting by penalizing complex models (Regularization)
More Formally
Training data: {x1 , . . . , xm }
Labels: {y1 , . . . , ym }
Learn a vector: w
m

1 X
l(xi , yi , w )
minimize J(w ) := (w ) +
w
| {z } m
i=1
Regularizer |
{z
}
Risk Remp

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

2 / 43

Convex Functions and Sets

Outline

Convex Functions and Sets

Operations Which Preserve Convexity

First Order Properties

Subgradients

Constraints

Warmup: Minimizing a 1-d Convex Function

Warmup: Coordinate Descent

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

3 / 43

Convex Functions and Sets

Focus of my Lectures

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

4 / 43

Convex Functions and Sets

Focus of my Lectures

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

4 / 43

Convex Functions and Sets

Focus of my Lectures

10

0
2

S.V. N. Vishwanathan (Purdue University)

0
0

Optimization for Machine Learning

4 / 43

Convex Functions and Sets

Disclaimer

My focus is on showing connections between various methods


I will sacrifice mathematical rigor and focus on intuition

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

5 / 43

Convex Functions and Sets

Convex Function

f (x 0 )
f (x)

A function f is convex if, and only if, for all x, x 0 and (0, 1)
f (x + (1 )x 0 ) f (x) + (1 )f (x 0 )

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

6 / 43

Convex Functions and Sets

Convex Function

f (x 0 )
f (x)

A function f is strictly convex if, and only if, for all x, x 0 and (0, 1)
f (x + (1 )x 0 )<f (x) + (1 )f (x 0 )

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

6 / 43

Convex Functions and Sets

Convex Function

f (x 0 )
f (x)

A function f is -strongly convex if, and only if, f ()


That is, for all x, x 0 and (0, 1)
f (x + (1 )x 0 ) f (x) + (1 )f (x 0 )
S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

kk2 is convex.

(1 ) x x 0
2
6 / 43

Convex Functions and Sets

Exercise: Jensens Inequality

Extend the definitionP


of convexity to show that if f is convex, then for
all i 0 such that i i = 1 we have
!
X
X
i xi
i f (xi )
f
i

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

7 / 43

Convex Functions and Sets

Some Familiar Examples


12
10
8
6
4
2
4

f (x) = 12 x 2 (Square norm)


S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

8 / 43

Convex Functions and Sets

Some Familiar Examples

60

40

20

0
2
0
2
3


1
x, y
f (x, y ) =
2
S.V. N. Vishwanathan (Purdue University)

10, 1
2, 1

Optimization for Machine Learning



x
y

8 / 43

Convex Functions and Sets

Some Familiar Examples


0

0.2

0.4

0.6
0

0.2

0.4

0.6

0.8

f (x) = x log x + (1 x) log(1 x) (Negative entropy)


S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

8 / 43

Convex Functions and Sets

Some Familiar Examples

0
0.5
1
1.5

2
1.5

2
0

0.2 0.4
0.6 0.8

1
1

1.2 1.4
1.6 1.8

0.5
2 0

f (x, y ) = x log x + y log y x y (Un-normalized negative entropy)

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

8 / 43

Convex Functions and Sets

Some Familiar Examples


4
3
2
1
0
3

f (x) = max(0, 1 x) (Hinge Loss)


S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

8 / 43

Convex Functions and Sets

Some Other Important Examples

Linear functions: f (x) = ax + b


P
Softmax: f (x) = log i exp(xi )
Norms: For example the 2-norm f (x) =

S.V. N. Vishwanathan (Purdue University)

qP

Optimization for Machine Learning

xi2

9 / 43

Convex Functions and Sets

Convex Sets

A set C is convex if, and only if, for all x, x 0 C and (0, 1) we have
x + (1 )x 0 C

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

10 / 43

Convex Functions and Sets

Convex Sets and Convex Functions

A function f is convex if, and only if, its epigraph is a convex set
S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

11 / 43

Convex Functions and Sets

Convex Sets and Convex Functions

Indicator functions of convex sets are convex


(
0
if x C
IC (x) =
otherwise.

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

12 / 43

Convex Functions and Sets

Below sets of Convex Functions

10

0
2

0
0

f (x, y ) = x 2 + y 2
S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

13 / 43

Convex Functions and Sets

Below sets of Convex Functions

1
2
2
0

0.5

1
1

1.5

2 0

f (x, y ) = x log x + y log y x y


S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

13 / 43

Convex Functions and Sets

Below sets of Convex Functions

If f is convex, then all its level sets are convex


Is the converse true? (Exercise: construct a counter-example)

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

14 / 43

Convex Functions and Sets

Minima on Convex Sets

Set of minima of a convex function is a convex set


Proof: Consider the set {x : f (x) f }
S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

15 / 43

Convex Functions and Sets

Minima on Convex Sets

Set of minima of a strictly convex function is a singleton


Proof: try this at home!
S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

15 / 43

Operations Which Preserve Convexity

Outline

Convex Functions and Sets

Operations Which Preserve Convexity

First Order Properties

Subgradients

Constraints

Warmup: Minimizing a 1-d Convex Function

Warmup: Coordinate Descent

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

16 / 43

Operations Which Preserve Convexity

Set Operations

Intersection of convex sets is convex


Image of a convex set under a linear transformation is convex
Inverse image of a convex set under a linear transformation is convex

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

17 / 43

Operations Which Preserve Convexity

Function Operations

Linear Combination with non-negative weights: f (x) =


s.t. wi 0

wi fi (x)

Pointwise maximum: f (x) = maxi fi (x)


Composition with affine function: f (x) = g (Ax + b)
Projection along a direction: f () = g (x0 + d)
Restricting the domain on a convex set: f (x)s.t. x C

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

18 / 43

Operations Which Preserve Convexity

One Quick Example

The piecewise linear function f (x) := maxi hui , xi is convex

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

19 / 43

First Order Properties

Outline

Convex Functions and Sets

Operations Which Preserve Convexity

First Order Properties

Subgradients

Constraints

Warmup: Minimizing a 1-d Convex Function

Warmup: Coordinate Descent

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

20 / 43

First Order Properties

First Order Taylor Expansion


The First Order Taylor approximation globally lower bounds the function

For any x and x 0 we have


f (x) f (x 0 ) + x x 0 , f (x 0 )
S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

21 / 43

First Order Properties

Bregman Divergence

f (x)
f (x 0 ) + hx x 0 , f (x 0 )i
f (x 0 )

For any x and x 0 the Bregman divergence defined by f is given by


f (x, x 0 ) = f (x) f (x 0 ) x x 0 , f (x 0 ) .
S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

22 / 43

First Order Properties

Euclidean Distance Squared

Bregman Divergence
For any x and x 0 the Bregman divergence defined by f is given by


f (x, x 0 ) = f (x) f (x 0 ) x x 0 , f (x 0 ) .
Use f (x) =

1
2

kxk2 and verify that


f (x, x 0 ) =

S.V. N. Vishwanathan (Purdue University)


1
x x 0 2
2

Optimization for Machine Learning

23 / 43

First Order Properties

Unnormalized Relative Entropy

Bregman Divergence
For any x and x 0 the Bregman divergence defined by f is given by


f (x, x 0 ) = f (x) f (x 0 ) x x 0 , f (x 0 ) .
Use f (x) =

xi log xi xi and verify that


X
f (x, x 0 ) =
xi log xi xi xi log xi0 + xi0
i

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

24 / 43

First Order Properties

Identifying the Minimum

Let f : X R be a differentiable convex function. Then x is a


minimizer of f , if, and only if,

0

x x, f (x) 0 for all x 0 .
One way to ensure this is to set f (x) = 0
Minimizing a smooth convex function is the same as finding an x
such that f (x) = 0

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

25 / 43

Subgradients

Outline

Convex Functions and Sets

Operations Which Preserve Convexity

First Order Properties

Subgradients

Constraints

Warmup: Minimizing a 1-d Convex Function

Warmup: Coordinate Descent

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

26 / 43

Subgradients

What if the Function is NonSmooth?

The piecewise linear function


f (x) := max hui , xi
i

is convex but not differentiable at the kinks!

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

27 / 43

Subgradients

Subgradients to the Rescue

A subgradient at x 0 is any vector s which satisfies


f (x) f (x 0 ) + x x 0 , s for all x
Set of all subgradients is denoted as f (w )
S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

28 / 43

Subgradients

Subgradients to the Rescue

A subgradient at x 0 is any vector s which satisfies


f (x) f (x 0 ) + x x 0 , s for all x
Set of all subgradients is denoted as f (w )
S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

28 / 43

Subgradients

Subgradients to the Rescue

A subgradient at x 0 is any vector s which satisfies


f (x) f (x 0 ) + x x 0 , s for all x
Set of all subgradients is denoted as f (w )
S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

28 / 43

Subgradients

Example

0
3

f (x) = |x| and f (0) = [1, 1]


S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

29 / 43

Subgradients

Identifying the Minimum

Let f : X R be a convex function. Then x is a minimizer of f , if,


and only if, there exists a f (x) such that

0

x x, 0 for all x 0 .
One way to ensure this is to ensure that 0 f (x)

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

30 / 43

Constraints

Outline

Convex Functions and Sets

Operations Which Preserve Convexity

First Order Properties

Subgradients

Constraints

Warmup: Minimizing a 1-d Convex Function

Warmup: Coordinate Descent

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

31 / 43

Constraints

A Simple Example
12
10
8
6
4
2
4

Minimize 12 x 2 s.t. 1 w 2
S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

32 / 43

Constraints

Projection
x0
x


2
PC (x 0 ) := min x x 0
xC

Assignment: Compute PC (x 0 ) when C = {x s.t. l xi u}


S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

33 / 43

Constraints

First Order Conditions For Constrained Problems

x = PC (x f (x))
If x f (x) C then PC (x f (x)) = x implies that f (x) = 0
Otherwise, it shows that the constraints are preventing further
progress in the direction of descent

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

34 / 43

Warmup: Minimizing a 1-d Convex Function

Outline

Convex Functions and Sets

Operations Which Preserve Convexity

First Order Properties

Subgradients

Constraints

Warmup: Minimizing a 1-d Convex Function

Warmup: Coordinate Descent

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

35 / 43

Warmup: Minimizing a 1-d Convex Function

Problem Statement

Given a black-box which can compute J : R R and J 0 : R R find


the minimum value of J
S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

36 / 43

Warmup: Minimizing a 1-d Convex Function

Increasing Gradients
From the first order conditions
J(w ) J(w 0 ) + (w w 0 ) J 0 (w 0 )
and
J(w 0 ) J(w ) + (w 0 w ) J 0 (w )
Add the two
(w w 0 ) (J 0 (w ) J 0 (w 0 )) 0
w w 0 implies that J 0 (w ) J 0 (w 0 )

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

37 / 43

Warmup: Minimizing a 1-d Convex Function

Increasing Gradients
J(w )

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

37 / 43

Warmup: Minimizing a 1-d Convex Function

Increasing Gradients
J 0 (w )

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

37 / 43

Warmup: Minimizing a 1-d Convex Function

Increasing Gradients
J(w )

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

37 / 43

Warmup: Minimizing a 1-d Convex Function

Increasing Gradients
J 0 (w )

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

37 / 43

Warmup: Minimizing a 1-d Convex Function

Problem Restatement

J 0 (w )

Identify the point where the increasing function J 0 crosses zero


S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

38 / 43

Warmup: Minimizing a 1-d Convex Function

Bisection Algorithm

J 0 (w )

w
U

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

39 / 43

Warmup: Minimizing a 1-d Convex Function

Bisection Algorithm

J 0 (w )

w
M

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

39 / 43

Warmup: Minimizing a 1-d Convex Function

Bisection Algorithm

J 0 (w )

w
U

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

39 / 43

Warmup: Minimizing a 1-d Convex Function

Bisection Algorithm

J 0 (w )

w
U

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

39 / 43

Warmup: Minimizing a 1-d Convex Function

Bisection Algorithm

J 0 (w )

w
U

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

39 / 43

Warmup: Minimizing a 1-d Convex Function

Interval Bisection

Require: L, U, 
1: maxgrad J 0 (U)
2: while (U L) maxgrad >  do
3:
M U+L
2
4:
if J 0 (M) > 0 then
5:
UM
6:
else
7:
LM
8:
end if
9: end while
10: return U+L
2

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

40 / 43

Warmup: Coordinate Descent

Outline

Convex Functions and Sets

Operations Which Preserve Convexity

First Order Properties

Subgradients

Constraints

Warmup: Minimizing a 1-d Convex Function

Warmup: Coordinate Descent

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

41 / 43

Warmup: Coordinate Descent

Problem Statement

60

40

20

0
2
0
2
3

Given a black-box which can compute J : Rn R and J 0 : Rn Rn


find the minimum value of J
S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

42 / 43

Warmup: Coordinate Descent

Concrete Example

60

40

20

0
2
0
2
3


1
x, y
f (x, y ) =
2
S.V. N. Vishwanathan (Purdue University)

10, 1
2, 1

Optimization for Machine Learning



x
y

43 / 43

Warmup: Coordinate Descent

Concrete Example

60

40

20

0
2
0
x

2
3


1
x, 3
f (x, 3) =
2
S.V. N. Vishwanathan (Purdue University)

10, 1
2, 1

Optimization for Machine Learning



x
3

43 / 43

Warmup: Coordinate Descent

Concrete Example
60

40

20

0
3

0
x

9
9
f (x, 3) = 5x 2 + x +
2
2
S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

43 / 43

Warmup: Coordinate Descent

Concrete Example
60

40

20

0
3

0
x

9
9
f (x, 3) = 5x 2 + x +
2
2
S.V. N. Vishwanathan (Purdue University)

Minima: x =

Optimization for Machine Learning

9
20
43 / 43

Warmup: Coordinate Descent

Concrete Example

60

40

20

0
2
0
x

2
3

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

43 / 43

Warmup: Coordinate Descent

Concrete Example
8

f (
S.V. N. Vishwanathan (Purdue University)

0
y

9
1
27
81
, y) = y2 y +
20
2
40
80
Optimization for Machine Learning

43 / 43

Warmup: Coordinate Descent

Concrete Example
8

f (

0
y

9
1
27
81
, y) = y2 y +
20
2
40
80

S.V. N. Vishwanathan (Purdue University)

Minima: y =

Optimization for Machine Learning

27
40
43 / 43

Warmup: Coordinate Descent

Concrete Example
50

f (x, 27
40 )

40
30
20
10
0
3
Are we done?

S.V. N. Vishwanathan (Purdue University)

0
x

Optimization for Machine Learning

43 / 43

Warmup: Coordinate Descent

Concrete Example
50

f (x, 27
40 )

40
30
20
10

9
x = 20

0
3

0
x

Are we done?

S.V. N. Vishwanathan (Purdue University)

Optimization for Machine Learning

43 / 43