0 vues

- Barfield
- Optimal network reconfiguration of electrical distribution systems
- Linear Programming Module
- The Adjustment of Some Geodetic Networks Using Microsoft EXCEL
- Lpp Problem
- Linear Programming Notes (Graphical Mtd)
- 10. Linear Programming
- The OR Times Vol 4 Issue 2 Fall 2009
- OptiStruct_OutFiles_Jun14
- 5IJMPERDAPR20195
- Fleet Assignment
- Case Study on Profit Planning of Textile Industry Using Linear Programming Approach
- 2nd may 1
- artigo visão computacional
- Liner optimization
- Linear Programming Problems
- Presentation Newest 2
- PHD Thesis on Physical Design Good
- Comparative Study of Methods for Optimal Reactive Power Dispatch
- CHE657 1st Tutorial Solution

Vous êtes sur la page 1sur 7

Recap

So far we have studied several stuffs: (1) the concept of convex optimization problems; (2) a cat-

egorization of such problems as summarized in Fig. 1; (3) why the convex optimization problems

are tractable; (4) a bunch of important examples which can be translated into LPs and which

can be solved via LP relaxation; (5) the simplex algorithm for LPs; (6) CVX implementation

for LPs.

Convex Optimization

Embark on:

Least-squares

Least-squares Linear Program

1800s (LP) 1939

Quadratic Program (QP) 1956

(SOCP) 1994

Semi-Definite Program

(SDP) 1994

……

Today’s lecture

Today we are going to move onto the second instance of convex optimization problems: the Least-

squares problem. Specifically in this lecture, we are going to cover three stuffs: (1) Will review

what the least-squares problem is; (2) Will provide a geometric insight which can help us to

understand what the least-squares solution means and therefore why the problem is important;

(3) Will study one very important application in machine learning: the classification problem.

Let us start by reviewing the least-squares problem that we studied in Lecture 1. The problem

is formulated as:

min kAx − bk2 . (1)

As mentioned earlier, one of the most important things that we can benefit from this problem

is that it has the closed-form solution:

x∗ = (AT A)−1 AT b. (2)

Actually in Lecture 1, we could obtain this solution by simply looking for a solution that makes

its gradient w.r.t. x as 0:

∇kAx∗ − bk2 = 0. (3)

1

But I never explained why (3) enables us to obtain the solution though. Now remember what

we learned about unconstrained convex minimization in Lecture 3: If f (x) is convex and differ-

entiable at every point in domf , then ∇f (x∗ ) = 0 is the optimality condition (i.e., the sufficient

and necessary condition for x∗ to be optimal). Applying this to the least-squares problem, we

then see that (3) is indeed the optimality condition. This is because the objective function

kAx − bk2 is convex in x (why? check this in PS3).

Dimensions of (x, A, b)

What about dimensions of (x, A, b)? Let d be the dimension of the optimization variable x.

Then, x ∈ Rd . Let A := [a1 · · · ad ] ∈ Rm×d . Then, b ∈ Rm . We now have two cases depending

on the values of m and d.

One is: m < d (A is a wide matrix 1 ). Suppose all the row vectors in A are linearly independent,

i.e., rank(A) = m, which is a typical case in practice. In this case, we have larger number d of

unknowns than the number m of linear equations in Ax − b = 0. This implies that there are

infinitely many solutions that satisfy Ax − b = 0. So in this case, the optimal value p∗ = 0,

which is definitely not an interesting scenario.

The second case is: m ≥ d (A is a tall matrix ). Suppose that b does not lie in the range space

of A, range(A) (the space spanned by all the column vectors of A), which is a typical case in

practice. In this case, obviously there is no solution that satisfies Ax − b = 0. But what we

can say is that it has a solution that minimizes kAx − bk2 though, and this forms the basic

idea behind the least-squares problem (that Gauss brought up in the 1800s). So what we are

interested in is the second case: m ≥ d.

Geometric insight

Now let me give you a geometric insight behind the least-squares problem. From this, you will

see what the least-squares solution means, as well as why the problem is important accordingly.

distance

distance

Let us first consider the simplest setting in which d = 1. In this case, A is simply a column

vector and x is a scalar. Suppose we have two vectors a1 and b, as illustrated in Fig. 2(a), in

which b is not aligned with a1 due to our assumption made earlier: b ∈ / range(A). Notice that

2

ka1 x − bk is minimized when the vector a1 x − b (marked in the blue thick line) is perpendicular

to the direction of a1 . So from this, one can interpret the least-squares solution as the distance-

minimizing solution. The distance-minimizing solution is obviously what we want. So it is sort

1

In fact, a majority of people use the terminology like a short matrix instead. But Prof. Stephen Boyd at

Stanford strongly recommended me to use a different terminology: a wide matrix. His rationale was that the wide

matrix has sort of positive nuance, while the short matrix looks negative. I respect his attitude for advocating

positive aspects, so I use the “wide matrix” terminology.

2

of a good solution which well matches with our natural demand.

We can have the same interpretation for a slightly more general case, say d = 2. In this case,

A = [a1 a2 ] ∈ Rm×2 . The vector Ax now lies in the plane, which is the range space of A:

Ax ∈ range(A); see Fig. 2(b). Similarly kAx − bk2 is minimized when the vector Ax − b (marked

in the blue thick line) is perpendicular to the plane, as illustrated in Fig. 2(b). So again one can

interpret Ax∗ as the distance-minimizing solution.

As mentioned earlier, the least-squares problem is a very popular and powerful problem which

has played a significant role in the optimization field since the birth of the problem in the 1800s.

It has been employed for addressing many important problems that arise in a wide variety of

applications.

In this lecture, I would like to put a particular emphasis on one important application that

arises in machine learning as well as that we have already investigated earlier: the classification

problem.

Remember the classification example that we studied in Lecture 5: legitimate-vs-spam emails

classification, in which we are given m data points {(xi , yi )}m

i=1 . Here xi indicates a feature vector.

In the example, we considered a two-dimensional case where xi := (xi1 , xi2 ) and (xi1 , xi2 ) denote

the frequencies of keywords 1 and 2 that appear in the ith email, respectively. Here yi is a label,

indicating an identity of the ith email: yi = +1 (legitimate email), yi = −1 (spam email).

For the above setting, we considered linear classifiers. Specifically, for the separable case, we

formulated an LP which intends to find a line that separate two datasets (legitimate vs. spam).

For the non-separable case (which is a typical case in practice), we formulated a slightly different

LP which finds a line that minimizes the aggregated margin.

In this lecture, we will consider a different classifier which is based on the least-squares problem

and therefore called: the least-squares classifier. The idea of the least-squares classifier is to find

a linear projection that minimizes the aggregated squared error.

Least-squares classifier

Least-squares label

classifier

linear projection

To see what the idea means, let us consider a block diagram for the classifier, illustrated in

Fig. 3. The least-squares classifier is parameterized by a weight vector, say w ∈ Rd . Given

input xi , it computes a linear projection onto the space spanned by the weight vector w; hence,

it outputs xTi w. The way to design w is as follows. Using the corresponding label yi , we first

compute its squared error: kxTi w − yi k2 . Next compute the aggregated squared error with all

of the m data points given. Finally we formulate an optimization problem which minimizes the

3

aggregation:

m

X

min kxTi w − yi k2 . (4)

w∈Rd

i=1

Notice that the objective function is very similar to the one that we saw in Lecture 1. Yes, that

is the objective function that Gauss came up with while addressing the astronomy problem. So

we can use the same simplification trick that Gauss did, thus obtaining:

x w − y1
2

T

m 1

..

X

kxTi w − yi k2 =

.

i=1
xTm w − ym

T
2 (5)

x y1

1

=
... w − ...
.

xTm ym

Letting A := [xT1 ; · · · ; xTm ] and b := [y1 ; · · · ; yn ], we can then re-wrote the optimization problem

as:

w

Here one natural question that arises is: Can the least-squares classifier (7) perform better than

the linear classifier that we developed earlier using LP? To answer this question, first of all, we

need to know what is a proper performance measure that one can use. One very popular and

conventional approach in machine learning is to employ so called the test error. The test error

is computed by testing a classifier on new unseen data called the test data.

To see what it means, consider the following setup. Given the classifier w∗ designed as per (7),

we input unseen test data, say xtest . The output is then xTtest w∗ . Next we declare a legitimate

email if xTtest w∗ ≥ 0; otherwise, declare a spam email, meaning that we take the sign of the

output to obtain: ŷtest = sign(xTtest w). Comparing this to the ground-truth test data label ytest ,

we check if the classifier gives a correct answer (0) or not (1). Considering many such test data

points, say mtest test data points, we compute the test error as the average of such measures:

mtest

1 X

TestError = 1{ŷtest,i 6= ytest,i } (8)

mtest

i=1

where 1{·} denotes an indicator function which outputs 1 when the event {·} is true; 0 otherwise.

On a side note, we say that the seen data employed for training a classifier is called the training

data. In this case, the training data are:

T

x1 y1

A := ... , b := ... .

xTm ym

4

Example: Evaluation on test dataset (mtest = 1139 + 127)

Here is an example that demonstrates the test error performance of a trained classifier tested on

a test dataset. Suppose that the test dataset has two datasets: (1) legitimate dataset containing

1139 test data points; (2) spam dataset containing 127 test data points. Applying the trained

classifier, we have 1120 correct and 19 wrong answers for 1139 legitimate emails. For 127 spam

emails, we have 95 correct and 32 wrong answers. See Fig. 4.

1120 19

(legitimate) true positive misdetection

32 95

false positive

(spam)

(false alarm)

19 + 32

TestError = ≈ 4%. (9)

1139 + 127

Actually this error can be categorized into two types depending on what the ground truth is,

and a different emphasis should be put on the two types of errors depending on applications. To

explain what it means, let us first introduce some terminologies required to understand the two

types of errors. The first is the true positive case which indicates the event {ŷ = +1|y = +1}.

The second is the misdetection case indicating {ŷ = −1|y = +1}. The third is the false positive

(or false alarm) case, which refers to {ŷ = +1|y = −1}.

Now the first type of error is concerned about the misdetection case and therefore called the

misdetection error : Pr{ŷ = −1|y = +1}. The second type of error then refers to the false

positive case, so it is called the false positive error : Pr{ŷ = +1|y = −1}. In the above example

in Fig. 4, the two types of error can be computed as:

19

MisdetectionError = ≈ 1.7%;

19 + 1120

32

FalsePositiveError = ≈ 25%.

32 + 95

Notice that the two errors are highly imbalanced; one is much smaller than the other. Actually

if you think about it, it is sort of a desired situation. In reality, it is crucial to protect missing

legitimate emails. In other words, we should be able to well declare legitimate emails if they are

the cases, meaning that we should reduce the misdetection error as much as possible. In this

case, it is around 1.7%, which is more or less okay.

On the other hand, we are sort of okay with declaring spam emails when they are actually

legitimate emails, meaning that a moderate value of the false positive error is acceptable in

reality. In this case, it is around 25%, which is more or less okay.

5

Since we know about the test error which is a proper performance measure, we are now ready to

compare performances of the two classifiers: the linear classifier and the least-squares classifier.

To this end, we first gather training dataset: {(xi , yi )}m i=1 . We then use this to design the

margin-based linear classifier (using LP) as well as the least-squares classifier (using (7)).

Next we test the classifiers on test dataset {(xtest,i , ytest,i )}m

i=1 to compute TestError linear and

test

TestError LeastSquares . This is how we compare performances. Now you may wonder which is

better in terms of the test error measure. Don’t worry. You will have a chance to check this in

PS.

Regularization technique

Actually there is an issue in applying the least-squares classifier without any modification.

In reality, a data point, say x, contains some noise. Data points are usually obtained from

measurements made by humans or sensors. But humans and sensors are not perfect in reality,

so x definitely contains some error. This error incurs an issue: Large values of w∗ (designed as

per (7)) can boost up such noise.

To avoid this, we somehow want to make those values small. One way to implement this is to

minimize kw∗ k2 . But obviously at the same time, we want to make kAw − yk2 small; otherwise,

w∗ would be always 0 - this is obviously what we do not want to get. This motivates people to

come up with a natural idea, which is to regulate the two objective at the same time:

w∈Rd

where λ ≥ 0. Notice that for one extreme case of λ = 0, we obtain the conventional least-squared

solution while for the other extreme case of λ = ∞, we get w∗ = 0 in which we declare spam

emails randomly with probability 0.5. This technique is called the regularization and λ is called

the regularization factor.

error

training error

test error

Now you may wonder how performances vary in terms of the regularization factor λ. First

consider the training error which is defined as:

m

1 X

TrainError = 1 {ŷi =

6 yi } (11)

m

i=1

where ŷi = sign(xTi w∗ ). Obviously the training error is minimized at λ = 0 because in the case

we focus only on the error factor induced by the training dataset. And it monotonically increases

6

with an increase in λ, since the higher λ, the more we regulate, penalizing more on the training

error. So we will get something like a blue curve plotted in Fig. 5.

On the other hand, the situation is different about the test error, defined as:

mtest

1 X

TestError = 1 {ŷtest,i 6= ytest,i } . (12)

mtest

i=1

When λ = 0, the test error would be small but larger than the training error because the λ = 0

case focuses only on the training error. Increasing λ, we would have a regularization effect, so

we expect the smaller test error. But if λ is too big, then the classifier would be close to w∗ = 0

in which the test error would be obviously very large. So one can expect there is a sweet spot

on λ that minimizes the test error. Hence, we may obtain something like a red curve plotted in

Fig. 5. Actually this is indeed the case. Again you will have a chance to check this in PS.

Going back to the regularized least-squares problem (10), now how can we solve the problem?

If you think about it, this is nothing but another least-squares problem. Why? Again applying

the same Gauss’s simplification trick, we obtain:

2

A b

2 2√ = kA′ w − b′ k2

kAw − bk + λkwk =

w−

λ 0

w∈Rd

Look ahead

Next time, we will study another application in which the least-squares problem has played a

crucial role in a different field, the medical field: Computed Tomography (CT).

- BarfieldTransféré parFritzie Mae Inoc Faller
- Optimal network reconfiguration of electrical distribution systemsTransféré parapi-3697505
- Linear Programming ModuleTransféré parSTPMmaths
- The Adjustment of Some Geodetic Networks Using Microsoft EXCELTransféré parDontu Alexandru
- Lpp ProblemTransféré parRajendra Pansare
- Linear Programming Notes (Graphical Mtd)Transféré parDavid Gan
- 10. Linear ProgrammingTransféré parNorhidayah N Elyas
- The OR Times Vol 4 Issue 2 Fall 2009Transféré parinformsusf
- OptiStruct_OutFiles_Jun14Transféré parPaulo Lixandrao
- 5IJMPERDAPR20195Transféré parTJPRC Publications
- Fleet AssignmentTransféré parSaygın Gönç
- Case Study on Profit Planning of Textile Industry Using Linear Programming ApproachTransféré parVamshee Krishna
- 2nd may 1Transféré parRahul Patel
- artigo visão computacionalTransféré parRobson Melchiades
- Liner optimizationTransféré parYash Gurjar
- Linear Programming ProblemsTransféré parreogomez
- Presentation Newest 2Transféré parAddison Juttie
- PHD Thesis on Physical Design GoodTransféré parVenkat Gadireddy
- Comparative Study of Methods for Optimal Reactive Power DispatchTransféré parAnonymous LSebPS
- CHE657 1st Tutorial SolutionTransféré parDaniel David Aranguren
- Plastic Product Design Using CAO TechniqueTransféré parLuis Gustavo Rodrigues Pereira
- OR ASSIGN-1Transféré parmilanmotta
- Examples of Applications of Linear ProgrammingTransféré parSiti Nur
- Report Linear ProgrammingTransféré parJustz Lim
- 10-1.1.1.24Transféré parMiloš Nikolić
- Mat2003 Operations-research Th 1.0 0 Mat2003Transféré parpradyumn beriwal
- Manajemen Kuantitatif - Program LinierTransféré parAlbertus Aditya
- AggPlanTransféré pardicto
- AARC 0Transféré parcompadre97
- Web.ist.Utl.pt Ist11038 Acad or LP GassVinjaTransféré parcivaas

- IELTS ChecklistTransféré parIren Kaufman
- CN10Transféré parBer Love Charmed
- ch11_newTransféré parshankar
- Homework4.pdfTransféré parBer Love Charmed
- IMO2010SLTransféré parHimansu Mookherjee
- ResolutionTransféré parBer Love Charmed
- Logiclec1 2014 HandoutTransféré parBer Love Charmed
- LN_1_noteTransféré parBer Love Charmed
- EE405 2018 Lecture3 SupplementTransféré parBer Love Charmed
- HW3solTransféré parBer Love Charmed
- hW2solTransféré parBer Love Charmed
- HW1solTransféré parBer Love Charmed
- BizWorld BrainstormingTransféré parBer Love Charmed
- Data ScienceTransféré parBer Love Charmed
- Tuan Manh Sangt7cnTransféré parBer Love Charmed
- Duo TwistTransféré parBer Love Charmed
- DET Sample PaperTransféré parsengta
- NUS Math SyllabusTransféré parAgus Leonardi
- 1.docTransféré parBer Love Charmed
- Keyword Table LynaTransféré parNguyen Le
- Vocabulary for IELTS (Edited by Ber)Transféré parBer Love Charmed
- WritingTransféré parBer Love Charmed
- 99 Top IELTS CollocationsTransféré paramirlove206
- IELTS Simon - TemplateTransféré parLê Mai Huyền Linh
- 15 Days Practice for IELTS WritingTransféré parnguyenbatiep
- Exploration and Creativity (Edited)Transféré parBer Love Charmed

- 00287357Transféré parPushpendra Dwivedi
- POLYNOMIAL AND MATRIX FRACTION DESCRIPTIONTransféré parHOOMANcephalex
- 3.CrankNicolsonTransféré parElias Gonzalez
- Otb Software GuideTransféré parDiego Zago
- COMP1511 - Programming Fundamentals - 18s1Transféré parJackyWong
- Assignment 1Transféré parsrikanth.mujjiga
- Image Compression Using DCT Implementing MatlabTransféré parramit77
- PDEs - Problems (3)Transféré parscribd6289
- Sorting Handout 2x2Transféré parNobuyuki Tateoka
- ME_SPTransféré parshashankif
- 4 6 NeuralNetwork BackpropagationTransféré parManish Kumar
- CBNSTTransféré parfreedomtok
- Arsdigita Lecture NotesTransféré parmritj4u
- Artificial IntelligenceTransféré parsudheer1044
- Spectral MethodTransféré parĐặng Sơn Tùng
- Ch5 Dictionary CodingTransféré parRakesh Inani
- Section3_9Transféré parsonti11
- EC6303 Signals and Systems Two Marks for Ece 2013_2013_regulationTransféré parSankar Kumar
- 3G4G Voice Quality Testing POLQATransféré parlouie mabini
- 7.1 Finite Difference-Taylor FormulaeTransféré parChun Tat Teo
- DIgitalni filtri projektiranje i primjeriTransféré parStriko Strikan
- DP_Sound_ADI_15025_Drivers.txtTransféré parJuan Carlos Gonzalez L
- Real Time FIR filter design using DSK TMS3206713Transféré parAssini Hussain
- Fem finite element analysisTransféré parmitesh
- ProjectTransféré parOwais Shaikh
- DSP-Lab-1Transféré parmsbhima
- KltTransféré parArunShan
- Amr & Gsm CodecTransféré parFarukh Munir
- Comp6721 Ai SearchTransféré parerjaimin89
- Speech Audio Radar DspTransféré parPinky Bhagwat