Vous êtes sur la page 1sur 139

MA208 Optimization Theory

Lecture Notes

Bernhard von Stengel

Department of Mathematics, London School of Economics,


Houghton St, London WC2A 2AE, United Kingdom
email: b.von-stengel@lse.ac.uk

January 5, 2018
Contents

1 Basic definitions 4
1.1 The real numbers and their order . . . . . . . . . . . . . . . . . . . . 4
1.2 Infimum and supremum . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Constructing the real numbers . . . . . . . . . . . . . . . . . . . . . 8
1.4 Maximization and minimization . . . . . . . . . . . . . . . . . . . . 11
1.5 Sequences, convergence and limits . . . . . . . . . . . . . . . . . . . 13

2 Graph algorithms 17
2.1 Graphs, digraphs, networks . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Walks, paths, tours, cycles, strong connectivity . . . . . . . . . . . . 19
2.3 Shortest walks in networks . . . . . . . . . . . . . . . . . . . . . . . 21
2.4 Introduction to algorithms . . . . . . . . . . . . . . . . . . . . . . . . 23
2.5 Single-source shortest paths: Bellman–Ford . . . . . . . . . . . . . . 27
2.6 O-notation and running time analysis . . . . . . . . . . . . . . . . . 34
2.7 Single-source shortest paths: Dijkstra’s algorithm . . . . . . . . . . 37

3 Continuous optimization 43
3.1 Euclidean norm and maximum norm . . . . . . . . . . . . . . . . . 43
3.2 Sequences and convergence in Rn . . . . . . . . . . . . . . . . . . . 46
3.3 Open and closed sets . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.4 Bounded and compact sets . . . . . . . . . . . . . . . . . . . . . . . . 50
3.5 Continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.6 Proving continuity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.7 The Theorem of Weierstrass . . . . . . . . . . . . . . . . . . . . . . . 60
3.8 Using the Theorem of Weierstrass . . . . . . . . . . . . . . . . . . . . 61

4 First-order conditions 65
4.1 Introductory example . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.2 Differentiability in Rn . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3 Partial derivatives and C1 functions . . . . . . . . . . . . . . . . . . 72
4.4 Taylor’s theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.5 Unconstrained optimization . . . . . . . . . . . . . . . . . . . . . . . 76
4.6 Equality constraints and the Theorem of Lagrange . . . . . . . . . . 80
4.7 Inequality constraints and the KKT conditions . . . . . . . . . . . . 86

2
CONTENTS 3

5 Linear optimization 98
5.1 Linear functions, hyperplanes, and halfspaces . . . . . . . . . . . . 98
5.2 Linear programming: introduction . . . . . . . . . . . . . . . . . . . 100
5.3 Linear programs and duality . . . . . . . . . . . . . . . . . . . . . . 104
5.4 Lemma of Farkas and proof of strong LP duality . . . . . . . . . . . 107
5.5 Boundedness and dual feasibility . . . . . . . . . . . . . . . . . . . . 113
5.6 General LP duality . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.7 Complementary slackness . . . . . . . . . . . . . . . . . . . . . . . . 122
5.8 Convex sets and functions . . . . . . . . . . . . . . . . . . . . . . . . 125
5.9 LP duality and the KKT theorem . . . . . . . . . . . . . . . . . . . . 129
5.10 The simplex algorithm: example . . . . . . . . . . . . . . . . . . . . 130
5.11 The simplex algorithm: general description . . . . . . . . . . . . . . 133
Chapter 1

Basic definitions

1.1 The real numbers and their order


Optimization is concerned with finding a “best” solution to a mathematically
expressed problem. This may be a shortest path between two nodes in a network,
or a cylindrical beer can with minimal surface for a given volume. “Better” is here
measured by a real number, like the length of a path, or the required surface area.
This real number is minimized (or, in other contexts, maximized).
This minimization uses the basic ordering of real numbers so that one number
x can be declared as greater than another number y, written as x > y or, equiva-
lently y < x, that is, y is smaller than x. This order “greater than” is denoted by
>. It is used directly together with ≥ where x ≥ y is to be read as “x is greater
than or equal to y”, or equivalently y ≤ x, to be read as “y is less than or equal
to x”.
An inequality such as x > 0 is called a strict inequality, whereas x ≥ 0 is called
a weak inequality. It is important not to confuse the two, because one can find a
solution to the problem “minimize x subject to x ≥ 0” (which has the obvious
solution x = 0) but not to the problem “minimize x subject to x > 0” because
this problem has no solution, which is easily seen by contradiction: If there was a
smallest number x such that x > 0, then x/2 would also fulfill x/2 > 0, but x/2
is smaller than x. Hence, when you write x ≥ 0, be careful to observe that this
means “x is nonnegative” and not “x is positive”, because x could also be zero
(which is not a positive number). Every real number is either positive, zero, or
negative.
Comparing real numbers by size is due to their ordering, which we imagine
to be along the real line. We consider a straight line on which we mark one point
as 0, and a second point as 1 (normally to the right of 0 if the line drawn horizon-
tally, or above 0 if the line is drawn vertically). Then the distance between 0 and 1
is the “unit length”. Any other point x is then a point on this line where x is to
the right of 0 if x > 0, to the left of 0 if x < 0, with a distance x from 0 assuming

4
1.1. THE REAL NUMBERS AND THEIR ORDER 5

that the distance between 0 and 1 is 1. The set of reals R is thought to be exactly
the set of these points on the line.
The fact that real numbers are ordered makes them one of the most useful
mathematical objects in practical applications. For example, the complex num-
bers cannot be ordered in such a useful way. The complex numbers allow us to
solve arbitrary polynomial equations such as x2 = −1, for which no real solution
x exists, because x2 ≥ 0 holds for every real number x. This, as we show shortly,
is a property of the order relation ≥, and so we cannot have a system of numbers
that we can order and thus minimize and maximize, and that at the same time
allows us to solve arbitrary polynomial equations.
We state a few properties of the order relation ≥ that imply the inequality
x2 ≥ 0 for all x. Most importantly, the order relation x ≥ y should be compatible
with addition in the sense that we can add any number z to both sides and pre-
serve the property (which is obvious from our picture of the real line). That is, for
any reals x, y, z
x ≤ y ⇒ x+z ≤ y+z (1.1)
which for z = − x − y implies

x ≤ y ⇒ − y ≤ −x . (1.2)

Because − x = (−1) · x, the implication (1.2) states the well-known rule that mul-
tiplication with −1 reverses an inequality. If you forget why this is the case, simply
subtract the terms on both sides from the inequality to get this result, as we have
done. Another condition concerning multiplication of real numbers is

x, y ≥ 0 ⇒ x · y ≥ 0 . (1.3)

In terms of the real number line, this means that y is “stretched” (if x > 1) or
“shrunk” (if 0 ≤ x < 1) or stays the same (if x = 1) when multiplied with the
nonnegative number x, but stays on the same side as 0 (this holds for any real
number y; here y is also assumed to be nonnegative). Condition (1.3) holds for
a positive integer x as a consequence of (1.1), because y ≥ 0 implies y + y ≥ y
and hence 2y = y + y ≥ y ≥ 0, and similarly for any repeated addition of y.
Extending this from positive integers x to real numbers x gives (1.3).
We now show that x · x ≥ 0 for any real number x. This holds if x ≥ 0 by (1.3).
If x ≤ 0 then − x ≥ 0 by (1.2), and so x · x = (−1) · (−1) · x · x = (− x )(− x ) ≥ 0
again by (1.2), where we have used that (−1) · (−1) = 1. This, in turn, follows
from something we have already used, namely that (−1) · y = −y for any y,
because −y is the unique negative of y so that y + (−y) = 0: Namely, we also
have y + (−1) · y = 1 · y + (−1) · y = (1 − 1) · y = 0 · y = 0, so (−1) · y is indeed
−y. Similarly, −(−y) = y (because (−y) + y = 0) and in particular −(−1) = 1,
which means (−1) · (−1) = 1 as claimed.
A systematic derivation of all properties of the order ≤ of the reals in com-
bination with the arithmetic operations addition and multiplication is laborious,
6 CHAPTER 1. BASIC DEFINITIONS

and so we appeal to the intuition of the real number line. Here we note the follow-
ing “axioms” for ≤ which are useful to understand in their separate importance.
These are transitivity, which says that for all x, y, z ∈ R

x ≤ y, y ≤ z ⇒ x ≤ z. (1.4)

The next condition is antisymmetry: for all x, y ∈ R

x ≤ y, y ≤ x ⇒ x = y. (1.5)

Another condition is reflexivity: for all x ∈ R

x ≤ x. (1.6)

The corresponding strict order < is then defined by

x<y :⇔ x ≤ y and x 6= y (1.7)

(and correspondingly the relations ≥ and >). Transitivity (1.4), antisymmetry


(1.5) and reflexivity (1.6) define what is called a partial order. The term “partial”
means that there can be elements x and y that are incomparable in the sense that
neither x ≤ y not y ≤ x holds. One of the most important partial orders is the
inclusion relation ⊆ between sets, where A ⊆ B means that A is a subset of B.
For the order ≤ on the set R of reals, incomparability does not occur. This
order is therefore called total in the sense that for all x, y ∈ R

x ≤ y or y ≤ x . (1.8)

It is useful to study properties of an order relation ≤ based on these abstract


properties. That, we consider a set, say S, with a binary relation ≤ where x ≤ y
says that this relation holds for two particular elements x and y of S. The symbol
< is then defined according to (1.7), and the symbols ≥ understood > as the
usual shorthands, such as x ≥ y for y ≤ x. Then S together with ≥ is called an
“ordered” set.

Definition 1.1 An ordered set is a set S together with a binary relation ≤ that is
transitive, antisymmetric, and reflexive, that is, (1.4) (1.5), (1.6) hold for all x, y, z
in S. The order is called total and the set totally ordered if (1.8) holds for all x, y
in S.

1.2 Infimum and supremum


The real numbers R allow addition and multiplication, and comparison with the
order ≤. The same applies to the rational numbers Q, but the real numbers have
an additional property of completeness, which can be expressed in terms of the
1.2. INFIMUM AND SUPREMUM 7

order ≤, that the rational numbers


√ lack. Because there is no rational number x so
that x = 2 (in other words, 2 is irrational), the set { x ∈ Q | x2 < 2} has no least
2

upper bound in Q (defined shortly), but it does in R. Intuitively, the parabola that
consists of all pairs ( x, y) so that y = x2 − 2 is a “continuous curve” in R2 that

should intersect the “x-axis” where y = 0 at two points ( x, y) where x = ± 2
and y = 0, in agreement with the intuition that x can take all values on the real
number line.
The following definition of an upper or lower bound of a set applies to any
ordered set; the order need not be total. The applications we have in mind occur
when the ordered set S is R or Q, but there are other interesting cases as well that
will be considered in the exercises.

Definition 1.2 Let S be an ordered set and let A ⊆ S. An upper bound of A is an


element z of S so that x ≤ z for all x ∈ A. A lower bound of A is an element z of S
so that x ≥ z for all x ∈ A. A greatest element or maximum of A is an upper bound
of A that belongs to A. A smallest element or minimum of A is a lower bound of A
that belongs to A.

Note that a maximum of a set A, if it exists, is unique, so that we can call it


the maximum or say that the maximum is attained: Namely, if x and y are two
maxima of A, then x ≤ y because x ∈ A and y is an upper bound of A, and y ≤ x
because y ∈ A and x is an upper bound of A, so that x = y by antisymmetry (1.5).
Similarly, a minimum of a set, if it exists, is unique.

Definition 1.3 Let S be an ordered set and let A ⊆ S. We say A is bounded from
above if A has an upper bound, and bounded from below if A has a lower bound,
and just bounded if A is bounded from above and below. The least upper bound
or supremum of a set A, denoted sup A, is the least element of the set of upper
bounds of A (if it exists). The greatest lower bound or infimum of a set A, denoted
inf A, is the greatest element of the set of lower bounds of A (if it exists).

Clearly, for a set to have a least or greatest element, it needs to be nonempty,


so A can only have a least upper bound if A is bounded from above, and A can
only have a greatest lower bound if A is bounded from below. What if A itself is
empty? Then, by definition, every real number z is trivially an upper (or lower)
bound for A, and the set of upper (or lower) bounds of A itself is not bounded
from below or above, and so cannot have a least or greatest element. On the other
hand, if A is nonempty, then any element x of A is is a lower bound for the set of
upper bounds z of A (because x ≤ z). Similarly, x is an upper bound for the set
of lower bounds of A. In summary, if a set A is nonempty, then A can possibly
have a least upper bound or supremum if A is bounded from above, and A can
possibly have a greatest lower bound or infimum if A is bounded from below.
The supremum of A is the same as a maximum of A if (and only if) it belongs
to A. This is stated in the following proposition, analogously also for infimum
and minimum.
8 CHAPTER 1. BASIC DEFINITIONS

Proposition 1.4 Let S be an ordered set and let A ⊆ S. Then A has a maximum if and
only if sup A exists and belongs to A. Similarly, A has a minimum if and only if inf A
exists and belongs to A.

Proof. Suppose A has a maximum a. Then by definition a is an upper bound of


A, and for any upper bound z of A we have a ≤ z because a ∈ A. So a is the
least upper bound sup A. Conversely, if sup A exists and belongs to A then it is
an upper bound of A and hence a maximum. The argument for minimum and
infimum is similar.

The mentioned order completeness of R asserts that any nonempty set of real
numbers with an upper bound has a least upper bound (supremum), and any
nonempty set of real numbers with a lower bound has a greatest lower bound
(infimum). It is a basic property of the real numbers.

Axiom 1.5 Let A be a nonempty set of real numbers. Then if A is bounded from above
then A has a supremum sup A, and if A is bounded from below then A has an infimum
inf A.

We have stated condition 1.5 as an axiom about the real numbers rather than
a theorem. That is, we assume this condition for all sets of real numbers, according
to our intuition about the real number line.
In an exercise, you will be asked to prove that one of the conditions in Ax-
iom 1.5 (existence of a supremum for any nonempty set that is bounded from
above) implies the other (existence of an infimum for any nonempty set that is
bounded from below). Interestingly, this holds already for any ordered set and
does not require to consider − x for x ∈ A as in (1.2).

1.3 Constructing the real numbers


It is also possible to prove Axiom 1.5 as a theorem when one has “constructed” the
real numbers. There are several ways to do this, which we outline in this section,
as an excursion and general background.
The standard approach is to consider the set of R as a “system” of numbers
in a sequence of increasingly powerful systems N, Z, Q, R. We first consider the
set N of natural numbers (positive integers) as used for counting, then introduce
zero, and negative integers, which gives us the set of all integers Z. We also
have a way of writing down these integers, namely as finite sequences of decimal
digits (elements of {0, 1, . . . , 9}), preceded by a minus sign for negative integers.
This representation of an integer is unique if the first digit is not 0; all integers can
be written in this way except for the integer zero itself which is written as 0.
Next, the set of integers Z is extended to fractions p/q of integers p and q
where q is positive. This is not a unique representation because ap/aq (for any
1.3. CONSTRUCTING THE REAL NUMBERS 9

positive integer a) represents the same fraction as p/q. The set of all fractions
defines the set Q of rational numbers.
The most familiar way to define the real number is as infinite decimal frac-
tions. A decimal fraction starts with the representation of an integer in decimal
notation, followed by a decimal point, followed by an infinite sequence of deci-
mal digits. The decimal fraction that represents a real number is unique except
when the fraction is finite, that is, after some time all digits are 0, for example 1.25
(which represents the fraction 5/4 in decimal). As an infinite sequence of decimal
digits, this can be written as either 1.25000 . . . or 1.24999 . . .. Here one typically
chooses the finite sequence that ends in all 0’s rather than the sequence that ends
in all 9’s. For example, 1/3 is represented as 0.333 . . .. Multiplied by 3, this gives
1 or the equivalent representation 0.999 . . .. It can be shown that any rational
number is represented by a decimal fraction that is eventually periodic, that is, it
becomes an infinitely repeated finite sequence of digits. For example, 1/7 has the
decimal fraction 0.142857142857142857 . . . which is written as 0.142857. Another
example is 1/12 = 0.08333 . . . = 0.083. That is, any rational number, either as
p/q with a pair of integers p and q, or as an eventually periodic decimal fraction,
has a finite description.
In contrast, an arbitrary real number (which can be irrational, that is, it is not
an element of Q) is a general decimal fraction. It requires an infinite description
with an infinite sequence of digits after the decimal point which in general has no
predictable pattern. For example, the ratio π of the circumference of a circle to its
diameter starts with 3.1415926 . . . but has no discernible pattern in its digits (of
which billions have been computed). A single real number is therefore already
“described” by an infinite object. In practice, we are typically content to assume
that a finite prefix of the sequence of digits suffices to describe the real num-
ber “sufficiently accurately”, where we can extend this accuracy as much as we
like. The intuition is that finite (truncated) decimal fractions (which are rational
numbers) approximate the represented real number more and more accurately,
depending on the considered length of the truncated sequence.
One complication of infinite decimal fractions is that the arithmetic opera-
tions, such as addition and multiplication, are hard to describe using these in-
finite descriptions. Essentially, they are performed on the finite approximations
themselves. A way to do this more generally is to define real numbers as “Cauchy
sequences” of rational numbers.
A sequence of numbers (which themselves can be rationals or reals) is written
as x1 , x2 , . . . with elements xk of the sequence for each k ∈ N (where xk ∈ Q for a
sequence of rational numbers, and xk ∈ R for a sequence of real numbers). The
entire sequence is denoted by { xk }k∈N or just { xk } with the understanding that k
goes through all natural numbers.
A Cauchy sequence { xk } has the property that eventually any two of its ele-
ments are eventually arbitrarily close together, that is,
10 CHAPTER 1. BASIC DEFINITIONS

∀ε > 0 ∃K ∈ N ∀i, j ∈ N : i, j ≥ K ⇒ | xi − x j | < ε . (1.9)

That is, for any positive ε, which can be as small as one likes, there is a subscript
K so that for all i and j that are at least K, the sequence elements xi and x j differ
by less than ε. Note that, in particular, we could choose i = K and j arbitrarily
larger than i, and yet xi and x j would differ by less than ε.
In the Cauchy condition (1.9), all elements x1 , x2 , . . . and ε can be rational
numbers. An example of such a Cauchy sequence is the sequence of finite decimal
fractions xk obtained from an infinite decimal fraction up to the kth place past the
decimal point. For example, if the infinite decimal fraction is 3.1415926 . . ., then
this sequence of rational numbers is given by x1 = 3.1, x2 = 3.14, x3 = 3.141,
x4 = 3.1415, x5 = 3.14159, x6 = 3.141592, x7 = 3.1415926, and so on, which is
easily seen to be a Cauchy sequence.
So, more generally, we can define a real number to be a Cauchy sequence
of rational numbers. Two sequences { xk } and {yk } are equivalent if | xk − yk | is
arbitrarily small for sufficiently large k, and if one of these two sequences is a
Cauchy sequence then so is the other (which is easily seen). Any two equivalent
Cauchy sequences define the same real number. Note that a real number is an
infinite object (in fact an entire equivalence class of Cauchy sequences of rational
numbers), similar to an infinite decimal fraction.
With real numbers defined as Cauchy sequences of rational numbers, it is
possible to prove Axiom 1.5 as a theorem. This requires to show the existence
of limits of sequences of real numbers, and the construction of a supremum as a
suitable limit; see R. K. Sundaram (1996), A First Course in Optimization Theory
(Cambridge University Press), Appendix B and Section 1.2.4.
We mention a second possible construction of the real numbers where Axiom
1.5 is much easier to prove. A Dedekind cut is a partition of Q into two nonempty
sets L and U so that a < b for every a ∈ L and b ∈ U, and so that L has no
maximal element. The idea is that each real number x defines uniquely such a cut
of the rational numbers into L and U given by

L = { a ∈ Q | a < x }, U = {b ∈ Q | b ≥ x } . (1.10)

If x is itself a rational number, then x belongs to the upper set U for the Dedekind
cut L, U for x (which is why we require that L has no maximal element, to make
this a unique choice). If x is irrational, then x belongs to neither L nor U and is
“between” L and U. Hence, we can see this cut as a definition of x. The Dedekind
cut L, U in (1.10) that represents x is unique. This holds because any two different
real numbers x and y define different cuts (because a suitable rational number c
with x < c < y will belong to the “upper” set of the cut for x but to the “lower”
set of the cut for y).
In constructing R as Dedekind cuts, the described partitions L, U of Q, each
real number has a unique description as such a cut. Each such cut is uniquely
1.4. MAXIMIZATION AND MINIMIZATION 11

determined by its “lower” set L, by taking the “upper” set U as the set of upper
bounds of L (if we start just with the set L, then we have to require that L is
nonempty, is bounded from above, contains with each a any rational number
smaller than a, and has no maximal element). So, similar to the representation as
a Cauchy sequence, a real number x has an infinite description as a set of rational
numbers. If x and x 0 have in this description lower cut sets L and L0 , respectively,
then we can define x ≤ x 0 by the inclusion relation L ⊆ L0 (as seen from (1.10)).
Now Axiom 1.5 is very easy to prove: Given a nonempty set A, bounded
above, of real numbers x represented by their lower cut sets L in (1.10), the supre-
mum of A is represented by the union of these sets L. This union is a set of rational
numbers, which can be easily shown to fulfill the properties of a lower cut set of
a Dedekind cut, and thus defines a real number, which can be shown to be the
supremum sup A.
Dedekind cuts are an elegant construction of the real numbers from the ratio-
nal numbers. It is slightly more complicated to define arithmetic operations of ad-
dition and, in particular, multiplication of real numbers via the rational numbers
in the respective cut sets than using Cauchy sequences, but the order property is
very accessible.
However, Dedekind cuts are an abstraction that “defines” a point x on the
real line via all the rational numbers a to the left of x, which defines the lower cut
set L in (1.10). This infinite set L is mathematically “simpler” than x because it
contains only rational numbers a. We “know” these rational numbers via their fi-
nite descriptions as fractions, but as points on the line they do not provide a good
intuition about the reals. In our reasoning about the real numbers, we therefore
refer usually to our intuition of the real number line.
Useful literature on the material of this section:
https://en.wikipedia.org/wiki/Construction_of_the_real_numbers
https://en.wikipedia.org/wiki/Dedekind_cut

1.4 Maximization and minimization


Reals numbers are the values that a function takes which we want to optimize
(maximize or minimize). The domain of the considered function can be rather
general, and will often by denoted by X, which is always a nonempty set.
Consider a function f : X → R, where X is a nonempty set. The function f is
called the objective function. The domain X of f is sometimes called the constraint
set (typically when X is described by certain “constraints”, for example x ≥ 0 and
x ≤ 1 if X is the interval [0, 1]).
Our basic optimization problems are:
(a) maximize f ( x ) subject to x ∈ X,
12 CHAPTER 1. BASIC DEFINITIONS

(b) minimize f ( x ) subject to x ∈ X.


A solution to the maximization problem (a) is an element x ∗ ∈ X such that

f (x∗ ) ≥ f (x) for all x ∈ X.

If such an element x ∗ exists we say that f attains the (global) maximum on X at


x ∗ . We refer to x ∗ as a (global) maximizer of f on X, and to f ( x ∗ ) as the (global)
maximum of f on X. (The adjective “global” is used in distinction to a “local”
maximum, defined later.)
The set of all solutions to (a) is denoted by

arg max{ f ( x ) | x ∈ X } := { x ∗ ∈ X | f ( x ∗ ) ≥ f ( x ) for all x ∈ X }.

Analogously, a solution to the minimization problem (b) is an element x ∗∗


of X such that
f ( x ∗∗ ) ≤ f ( x ) for all x ∈ X.
If such an element x ∗∗ exists we say that f attains the (global) minimum on X at
x ∗∗ , with x ∗∗ as a (global) minimizer of f on X, and f ( x ∗∗ ) as the (global) minimum
of f on X. The set of all solutions to (b) is denoted by

arg min{ f ( x ) | x ∈ X } := { x ∗∗ ∈ X | f ( x ∗∗ ) ≤ f ( x ) for all x ∈ X }.

Note the difference between “a solution” to one of the optimization problems


above and “the solutions” (or “the set of solutions”), which are all solutions. Fur-
thermore, if a global maximum exists, then it is unique, but the global maximizer
is not necessarily unique. (The same is true for the global minimum.) Quite often
there will be no solution at all; in that case, the set of solutions is the empty set.
Here are some examples that illustrate, in particular, the importance of the
domain X.

Example 1.6 X = [0, ∞), f : X → R, f ( x ) = x. Then the maximization problem


has no solution, that is, arg max{ f ( x ) | x ∈ X } = ∅, whereas the minimization
problem has a unique solution, arg min{ f ( x ) | x ∈ X } = {0}, and the global
minimum is f (0) = 0.

Example 1.7 X = [−1, 1], f : X → R, f ( x ) = x2 . Then arg max{ f ( x ) | x ∈


X } = {−1, 1} and the global maximum is f (−1) = f (1) = 1. The minimization
problem has a unique solution, arg min{ f ( x ) | x ∈ X } = {0}, and the global
minimum is f (0) = 0.

Example 1.8 X = R, f : X → R, f ( x ) = 5. This is a constant function, where


the set of solutions of both optimization problems is the entire domain, that is,
arg max{ f ( x ) | x ∈ X } = arg min{ f ( x ) | x ∈ X } = X, and the global maximum
and minimum is 5. Only for constant functions (that is, f ( x ) = f (y) for all x, y ∈
X) maximum and minimum coincide.
1.5. SEQUENCES, CONVERGENCE AND LIMITS 13

The following is an easy but useful observation, proved with the help of (1.2).
We use it to consider only maximization problems, rather than repeating very
similar considerations for minimization problems.

Proposition 1.9 Consider a function f : X → R and let x ∈ X. Then x is a maximizer


of f on X if and only if x is a minimizer of − f on X.

The following theorem is useful in applications.

Theorem 1.10 Suppose X = X1 ∪ X2 (the two sets X1 and X2 need not be disjoint), so
that there exists an element y in X1 so that f (y) ≥ f ( x ) for all x ∈ X2 , and f attains a
maximum on X1 . Then f attains a maximum on X.

Proof. Any element x of X belongs to X1 or X2 . Consider the maximum x ∗ of f


on X1 . If x ∈ X1 then f ( x ∗ ) ≥ f ( x ). If x ∈ X2 then we have f (y) ≥ f ( x ), and
f ( x ∗ ) ≥ f (y), and hence also f ( x ∗ ) ≥ f ( x ), so x ∗ is indeed a maximizer of f
on X.

1.5 Sequences, convergence and limits

Analysis and the study of continuity require the use of sequences and limits. For
the moment we limit ourselves to sequences of real numbers. Recall that a se-
quence x1 , x2 , . . . is denoted by { xk }k∈N or just { xk }. The limit of such a sequence,
if it exists, is a real number L so that the elements xk of the sequence are eventually
arbitrary close to L. This closeness is described by a maximum distance from L,
often called ε, that can be an arbitrarily small positive real number.

Definition 1.11 A sequence { xk }k∈N of real numbers converges to L, or has limit


L, if
∀ε > 0 ∃K ∈ N ∀k ∈ N : k ≥ K ⇒ | xk − L| < ε . (1.11)
Then we also write limn→∞ xk = L, or xk → L as k → ∞ (read as “xk tends to L as
k tends to infinity”).

In words, (1.11) says that for every (arbitrarily small) positive ε there is some
index K so that from K onwards (k ≥ K) all sequence elements xk differ in abso-
lutely value by less than ε from L.
The next proposition asserts that if a sequence has a limit, that limit is unique
(something you should remember from Real Analysis – try proving it yourself
before you read the proof).

Proposition 1.12 A sequence { xk }k∈N can have at most one limit.


14 CHAPTER 1. BASIC DEFINITIONS

Proof. Suppose there are two limits L and L0 of the sequence { xk }k∈N with L 6= L0 .
We arrive at a contradiction as follows. Let ε = | L − L0 |/2 and consider K and K 0
so that k ≥ K implies | xk − L| = | L − xk | < ε and k ≥ K 0 implies | xk − L0 | < ε.
Consider some k so that k ≥ K and k ≥ K 0 . Then

2ε = | L − L0 | = | L − xk + xk − L0 | ≤ | L − xk | + | xk − L0 | < ε + ε = 2ε, (1.12)

which is a contradiction.

The symbol ∞ for (positive) infinity can be thought of as an extra element


that is larger than any real number. Similarly, −∞ is an additional element that
is smaller than any real number. In terms of the order ≤, it is unproblematic
to extend the set R with the elements ∞ and −∞. However, when used with
arithmetic operations these infinite elements are in general not useful and should
not be treated as “numbers”; for example, ∞ − ∞ cannot be meaningfully defined.
We say that a sequence is bounded (from above or below, or just bounded) if
this holds for the set of its elements. If a sequence converges, it is necessarily
bounded: Fix some ε (for example ε = 1), so that with K in (1.11) we have L − ε <
xk < L + ε for all k ≥ K. If K = 1 then the sequence is bounded by L − ε from
below and L + ε from above. If K > 1, then the set { xi | 1 ≤ i < K } is nonempty
and finite, and thus has a maximum a and minimum b, so that the larger number
of L + ε and a is an upper bound and the smaller number of L − ε and b is a lower
bound for the sequence.
An unbounded sequence may nevertheless show the “limiting behaviour”
that eventually its elements become arbitrarily large. We then say that the se-
quence tends to infinity.

Definition 1.13 A sequence { xk }k∈N of real numbers tends to infinity, written


xk → ∞ as k → ∞, or limk→∞ xk = ∞, if

∀ M ∈ R ∃K ∈ N ∀k ∈ N : k ≥ K ⇒ xk > M . (1.13)

Instead of “tends to infinity” the sequence is also said to “diverge” to infinity.


However, in general a sequence diverges if it does not converge. Even a bounded
sequence can diverge, such as (exercise) the sequence {yk } defined by yk = (−1)k .
To conclude this section, we will show a connection between boundedness
and convergence, namely that every bounded sequence has a convergent subse-
quence. This makes crucial use of the order completeness of the reals (see Propo-
sition 1.14 below).
We need a few more definitions. A subsequence of a sequence { xk }k∈N is a
sequence { xkn }n∈N where k1 , k2 , . . . is an increasing sequence of natural numbers.
For example, these may be the even numbers given by k n = 2n, but they do
not need to be defined explicitly. The subsequence just considers some (infinite)
1.5. SEQUENCES, CONVERGENCE AND LIMITS 15

subset of the elements of the original sequence in increasing order. Exercise: If a


sequence converges, then every subsequence converges to the same limit.
A sequence { xk }k∈N is called increasing if (always for all k ∈ N) xk < xk+1 ,
nondecreasing if xk ≤ xk+1 , decreasing if xk > xk+1 , nonincreasing if xk ≤ xk+1 , and
monotonic if it is nondecreasing or nonincreasing.

Proposition 1.14 A nondecreasing sequence that is bounded from above converges to


the supremum of the set of its elements. A nonincreasing sequence that is bounded from
below converges to the infimum of the set of its elements.

Proof. Let the sequence be { xk } and let A = { xk | k ∈ N} be the set of elements


of the sequence. Assume A is bounded from above, so A has a supremum L =
sup A by Axiom 1.5. Let ε > 0. We want to show that for some K ∈ N we have
| xk − L| < ε for all k ≥ K. Because L is an upper bound of A, we have L ≥ xk and
thus | xk − L| = L − xk for all k, so we want to show L − xk < ε or equivalently
L − ε < xk . It suffices to find K with L − ε < xK because xK ≤ xK +1 ≤ xK +2 ≤
· · · ≤ xk for all k ≥ K because the sequence is nondecreasing. Now, if for all
K ∈ N we had L − ε ≥ xK then L − ε would be an upper bound of A which is
less than L, but L is the least upper bound of A. So the desired K with L − ε < xK
exists. The claim for the infimum is proved similarly, or obtained by considering
the sequence {− xk } and its supremum instead of the original sequence.

Proposition 1.15 Every sequence has monotonic subsequence.

Proof. The following argument has a nice visualization in terms of “hotels that
have a view of the sea”. Suppose the real numbers x1 , x2 , . . . are the heights of
hotels. From the top of each hotel with height xk you can look beyond the subse-
quent hotels with heights xk+1 , xk+2 , . . . if they have lower height, and see the sea
at infinity if these are all lower. In other words, a hotel has “seaview” if it belongs
to the set S given by

S = {k ∈ N | xk > x j for all j > k}

(presumably, these are very expensive hotels). If S is infinite, then we take the
elements of S, in ascending order as the subscripts k1 , k2 , k3 , . . . that give our sub-
sequence xk1 , xk2 , xk3 , . . ., which is clearly decreasing. If, however, S is finite with
maximal element K (take K = 0 if S is empty), then for each k > K we have
k 6∈ S and hence for xk there exists some j > k with xk ≤ x j . Starting with xk1
for k1 = K + 1 we let k2 = j > k1 with xk1 ≤ xk2 . Then find another k3 > k2
so that xk2 ≤ xk3 , and so on, which gives a nondecreasing subsequence { xkn }n∈N
with xk1 ≤ xk2 ≤ xk3 ≤ · · · . In either case, the original sequence has a monotonic
subsequence.

The last two propositions together give a famous theorem known as the “Bol-
zano–Weierstrass” theorem.
16 CHAPTER 1. BASIC DEFINITIONS

Theorem 1.16 (Bolzano–Weierstrass) Every bounded sequence has a convergent sub-


sequence.

Proof. The sequence has a monotonic subsequence by Proposition 1.15 Because


the sequence is bounded, so is the subsequence, whose set of elements therefore
has supremum and infinimum. If the subsequence is nondecreasing, it converges
to the supremum of its elements by Proposition 1.14; if the subsequence is nonin-
creasing, it converges to the infimum.
Chapter 2

Graph algorithms

2.1 Graphs, digraphs, networks


This chapter is about combinatorial optimization. That is, the set of possibilities
that can be optimized over is usually finite, and these possibilities consist of com-
binations of choices. These combinations will be explored by algorithms that are
executed by a computer. Computers are very fast, but they will not work for large
problems with a “brute force” approach that tries out all combinations.
The following example, called the marriage problem, illustrates why a “brute
force” approach is not feasible. Suppose there are n women and n men, and for
every pair ( a, b) of a woman a and a man b it is known if they would both con-
sider marrying each other, called a “possible couple”. An example is n = 3 with
women a1 , a2 , a3 and men b1 , b2 , b3 , and possible couples ( a1 , b1 ), ( a2 , b1 ), ( a3 , b2 ),
and ( a3 , b3 ). The problem is to find n possible couples so that every woman and
every man has a partner. In this example, this is not possible, even though there
is a possible partner for every woman and man, because the sole possible partner
for a1 and a2 is b1 , but b1 can marry only one of them (similarly, a3 can marry only
one of the men b2 and b3 who have only a3 as a possible partner).
For the marriage problem with n women and n men, a “brute force” approach
is to try out all possible ways to order the n men so as to marry them to the n
women a1 , a2 , . . . , a3 and to check if the resulting couples are all possible. There
are n! = n · (n − 1) · · · 2 · 1 such orderings (namely n choices for the partner of
a1 , then n − 1 for the partner of a2 , and so on). Already for n = 32, say, this is
about 3 × 1029 . Even if a billion (109 ) possibilities could be checked per second,
this would take 1013 years, about 1,000 times the age of the universe. So such
an approach is clearly impractical, even taking future improvements in comput-
ing technology into account. Although we will not present it here, there is an
algorithm that can solve the marriage problem in proportional to n3 many steps,
which for n = 32 a modern computer (such as your smartphone) can perform in
a fraction of a second, and even for n = 1, 000 in less than a minute. So it is clearly
useful to look for clever algorithms.

17
18 CHAPTER 2. GRAPH ALGORITHMS

In an abstract setting, the women and men in this example define the nodes
of a graph, with the possible couples called edges. The graph in the marriage
problem has the special property of being bipartite (meaning that edges always
connect two nodes a and b that come from two disjoint sets). We first define the
concept of a general graph (or undirected graph), and for reasons of space will in
fact not consider further the marriage problem or bipartite graphs.

Definition 2.1 A graph is given by (V, E) with a finite set V of vertices or nodes
and a set E of edges which are unordered pairs {u, v} of nodes u, v.

The terminology of “vertex” and “edge” is derived from a geometric view,


for example of a three-dimensional cube, whose vertices (corners) are indeed con-
nected by edges of the cube. The following picture on the left shows a distorted,
“flattened” view of such a cube. According to Definition 2.1, the graph is a combi-
natorial structure, that is, it only matters which nodes are connected by edges, not
how these edges are drawn. To emphasize the nodes (which may ambiguously
appear when edges cross in the drawing), they are typically drawn as small disks,
as in the middle picture. It may also be necessary to draw crossing edges, as in
the right picture which has an additional edge from u to v.
u

The following is an example with V = { a, b, c, d}, and edges in E given by the


unordered pairs { a, b}, {b, c}, {b, d}, and {c, d}:

a b

(2.1)

c d

We will normally not assume that connections between two nodes u and v are
symmetric (even though this may apply in many cases). The concept of a directed
graph allows us to distinguish between getting from u to v and getting from v to u.

Definition 2.2 A digraph (directed graph) is given by (V, A) with a finite set V of
vertices or nodes and a set A of arcs which are ordered pairs (u, v) of nodes u, v.

The following is an example with V = { a, b, c, d}, and arcs in A given by the


ordered pairs ( a, b), (b, c), (c, d), (b, d), and (d, b), where an arc (u, v) is drawn as
2.2. WALKS, PATHS, TOURS, CYCLES, STRONG CONNECTIVITY 19

an arrow that points from u to v :

a b

(2.2)

c d

An arc (u, v) is also called an arc from u to v. In a digraph, we do not allow arcs
(u, u) from a node u to itself (such arcs are called “loops”). We do allow arcs in
reverse directions such as (u, v) and (v, u), as in the example (2.2) for u, v = b, d.
Because A is a set, it cannot record multiple or “parallel” arcs between the same
nodes, that is, two or more arcs of the form (u, v), so these are automatically
excluded.

Definition 2.3 Let D = (V, A) be a digraph. A network is given by ( D, w) with a


weight function w : A → R that assigns a weight w(u, v) to every arc (u, v) in A.

The following is an example of a network with the digraph from (2.2). Next
to each arc is written its weight (in many examples we use integers as weights,
but this need not be the case, like here where w( a, b) = 1.2).

1.2
a b

−9 3 −7

c d
2

2.2 Walks, paths, tours, cycles, strong connectivity

The underlying structure in our study will always be a digraph, typically with a
weight function so that this becomes a network. The arcs in the digraph represent
connections between nodes that can be followed. A sequence of such connections
defines a walk in the network, as well as special cases of walks according to the
following terminology.

Definition 2.4 Let D = (V, A) be a digraph. A walk in D is a sequence of nodes


u0 , u1 , . . . , uk for some k ≥ 0 so that (ui , ui+1 ) ∈ A for 0 ≤ i < k, which are the k
arcs of the walk, and k is called the length of the walk. This is also called a u0 , uk -
walk, or a walk from u0 to uk , or a walk with startpoint u0 and endpoint uk . The
walk is called a path if the nodes u0 , u1 , . . . , uk are all distinct, a tour if u0 = uk ,
and a cycle if it is a tour and the nodes u1 , . . . , uk are all distinct.
20 CHAPTER 2. GRAPH ALGORITHMS

The visited nodes on a walk are just the nodes of the walk. In a walk, but not
in a path, a node may be revisited. A tour starts and ends at the same node. A
cycle has also the same startpoint and endpoint but otherwise does not allow to
revisit a node.
In (2.2), the sequence a, b, c, d, b is a walk but not a path, a, b, c, d is a path,
d, b, d is a tour and a cycle, and c, d, b, d, b, c is a tour but not a cycle.
Sometimes a walk, path, tour, or cycle is sometimes called a directed walk, di-
rected path, directed tour, or directed cycle to emphasize that arcs cannot be tra-
versed backwards. Because we only consider a digraph (and not, say, the graph
that results when ignoring the direction of each arc), we do not need to add the
adjective “directed”.
One calls a graph “connected” if any two nodes are connected by a path; for
example, the graph (2.1) is connected. For a digraph, strong connectivity requires
the existence of a walk between any two nodes u, v, where “strong” emphasizes
that the digraph has a walk from u to v and a walk from v to u (which is implied
by the following definition where v, u is just another pair of nodes).

Definition 2.5 A digraph D = (V, A) is called strongly connected if for every two
nodes u, v there is a walk from u to v.

The digraph in (2.2) is not strongly connected because there is no walk from
b to a.
Paths are of particular interest in digraphs because there are only finitely
many of them, because each visited node on a path must be new. (As discussed
at the beginning of this chapter, the number of possible paths may still be huge.)
The following is a simple but crucial observation.

Proposition 2.6 Consider a digraph and two nodes u, v. If there is a walk from u to v,
then there is a path from u to v.

Proof. Consider a walk u0 , u1 , . . . , uk with u0 = u and uk = v. If this is not a


path, then ui = u j for some i, j with 0 ≤ i < j ≤ k, so the sequence of nodes
ui , ui+1 , . . . , u j is a tour T with j − i arcs. By removing the nodes ui+1 , . . . , u j from
the walk (effectively, removing the “detour” T) we obtain a walk with fewer arcs.
By continuing in this manner, we eventually obtain a walk with no repetition of
nodes, which is a path. (This may result in a path of length zero when u = v.)

In a similar way, one can show the following (the qualifier “positive length”
is added to exclude the trivial cycle of length zero).

Proposition 2.7 Consider a digraph and a node u. If there is a tour of positive length
that starts and ends at u, then there is a cycle of positive length that starts and ends at u.
2.3. SHORTEST WALKS IN NETWORKS 21

2.3 Shortest walks in networks

In a network, the weights typically represent costs of some sort associated with
the respective arcs. Weights for walks (and similarly of paths, tours, cycles) are
defined by summing the weights of their arcs.

Definition 2.8 Let ( D, w) be a network and let W = u0 , u1 , . . . , uk be a walk in


D. The weight w(W ) of the walk W is the sum of the weights of the k arcs of that
walk:
k −1
w (W ) = ∑ w ( u i , u i +1 ).
i =0

Note the difference between length and weight of a walk: length counts the
number of arcs in the walk, whereas weight is the sum of the weights of these
arcs (only if every arc has weight one, then length is the same as weight).
Given a network and two nodes u and v, we are interested in the u, v-walk
of minimum weight, often called shortest walk from u to v (remember throughout
that “shortest” means “least weighty”). Because there may be infinitely many
walks between two nodes if there is a possibility to revisit some nodes on the
way, “minimum weight” may not be a real number, but be equal to plus or minus
infinity.

Definition 2.9 Let u, v be two nodes in a network ( D, w). Let

Y (u, v) = {W | W is a walk from u to v}.

and y(u, v) = {w(W ) | W ∈ Y (u, v)}. The distance from u to v, denoted by


dist(u, v), is defined as follows:

 +∞ if Y (u, v) = ∅,

dist(u, v) = −∞ if y(u, v) has no lower bound,


inf y(u, v) otherwise.

So dist(u, v) is +∞ (also written ∞) if there is no walk from u to v, and −∞


if there are walks of arbitrarily negative weight from u to v. We will be able to
identify the latter with a finite condition, namely the existence of a cycle of negative
weight that can be inserted into a walk from u to v.

Theorem 2.10 Let u, v be two nodes in a network ( D, w). Then dist(u, v) = −∞ if and
only if there is a cycle C with w(C ) < 0 that starts and ends at some node on a walk from
u to v. If dist(u, v) 6= ±∞, then

dist(u, v) = min{w( P) | P is a path from u to v }. (2.3)


22 CHAPTER 2. GRAPH ALGORITHMS

Proof. We make repeated use of the proof of Proposition 2.6. Suppose there is a
walk P = u0 , u1 , . . . , uk with u0 = u and uk = v and a cycle C = ui , v1 , . . . , v`−1 , ui
that starts and ends at some node ui on that walk, 0 ≤ i ≤ k, with w(C ) < 0. Let
n ∈ N. We insert n repetitions of C into P to obtain a walk W that we write (in an
obvious notation) as

W = u0 , u1 , . . . , ui , [v1 , . . . , v`−1 , ui , ]n ui+1 , . . . , uk .

The first i arcs together with the last k − i arcs of W are those of P, with n copies of
C in the middle, so W has weight w( P) + n · w(C ). For larger n this is arbitrarily
negative because w(C ) < 0, and W belongs to the set Y (u, v). Hence dist(u, v) =
−∞.
Conversely, let dist(u, v) = −∞. Consider a path P from u to v of minimum
weight w( P) as given by the minimum in (2.3). Suppose there is a u, v-walk W
with w(W ) < w( P), which exists because dist(u, v) = −∞ (otherwise w( P) would
be a lower bound of y(u, v)). Because W is clearly not a path, it contains a tour T
as in the proof of Proposition 2.6. If w( T ) ≥ 0 then we could remove T from W
and obtain a path W 0 with weight w(W 0 ) = w(W ) − w( T ) ≤ w(W ) and thus
eventually a path of weight less than w( P), in contrast to the definition of P. So
W contains a tour T with w( T ) < 0, which starts and ends at some node x, say.
We now claim that T contains a cycle C with w(C ) < 0. If T is itself a cycle, that
is clearly the case. Otherwise, T either contains a “subtour” T 0 with w( T 0 ) < 0
(and in general some other startpoint y) which we can consider instead of T, or
else every subtour T 0 of T fulfills w( T 0 ) ≥ 0 in which case we can remove T 0 from
T without increasing w( T ); an example of these two possibilities is shown in the
following picture with T = x, y, z, y, x and T 0 = y, z, y.

1 1 1 1
u x v u x v

1 1 −1 −2 (2.4)
−1 1

z y z y
−2 1

In either case, T is eventually reduced to a cycle C with w(C ) < 0 which is part of
W (where W is modified alongside T when removing subtours T 0 of nonnegative
weight). This shows the first claim of the theorem.
This implies that if dist(u, v) 6= ±∞, then there is a walk and hence a path
from u, v, and no u, v-walk contains a cycle or tour of negative weight, and hence
(2.3) holds according to the preceding reasoning.

Note that the left picture in (2.4) shows that we can have dist(u, v) = −∞
even though no cycle of negative weight can be inserted into a path from u to v.
In this example it is only possible to insert a tour of negative weight into a path
from u to v, or to insert a cycle of negative weight into a walk from u to v.
2.4. INTRODUCTION TO ALGORITHMS 23

The Theorem 2.10 would be simpler to prove if it just stated the existence of a
negative-weight tour that can be inserted into a walk from u to v as an equivalent
statement to dist(u, v) = −∞. However, it has been common to call this condition
the existence of negative-weight (or just “negative”) cycles. For that reason, we
proved the stronger statement.
Consider again the network in the left picture in (2.4) but with the arc (y, x )
removed:
1 1
u x v

1
−1

z y
−2
In that case dist(u, v) = 2 because the only walk from u to v is the path u, x, v, and
the negative (-weight) cycle y, z, y can be reached from u but cannot be extended
to a walk to v. Nevertheless, we will in the following consider all negative cycles
that can be reached from a given node u as “bad” for the computation of distances
d(u, v) for nodes v.

2.4 Introduction to algorithms

Before we describe two algorithms for computing shortest paths in networks,


we explain some notations for algorithms. An algorithm is a sequence of pre-
cise instructions to solve a computational problem in finitely many steps. These
instructions are normally executed by a computer, but can also be performed
“manually”.
As a first example, we consider an algorithm that finds the minimum m of a
finite nonempty set S of real numbers.

Algorithm 2.11 (Finding the minimum of a nonempty finite set)

Input : finite nonempty set S ⊂ R.


Output : minimum m of S, that is, m ∈ S and m ≤ x for all x ∈ S.

1. m ← some element of S
2. remove m from S
3. while S 6= ∅ :
4. x ← some element of S
5. m ← min{m, x }
6. remove x from S
24 CHAPTER 2. GRAPH ALGORITHMS

In this algorithm, we first specify its behaviour in terms of its input and out-
put. Here the input is a nonempty finite set S of real numbers, and the output is
the minimum of that set, denoted by m. The algorithm is described by a sequence
of instructions. These instructions are numbered, but solely for reference; the line
numbers are irrelevant for the operation of the algorithm (some descriptions of
algorithms make use of them, such as “go to step 3”, but we do not).
Line 1 says that the variable m (which here stores a real number) will be set
to the result of the right-hand side of the arrow “ ← ”. Such a statement is also
called an assignment. Here, this right-hand side is a function that produces some
element of S (which for the moment we assume is implemented somehow). Line 2
states to remove m from S. Line 3 is the beginning of a repeated sequence of in-
structions, and says that as long as the condition S 6= ∅ is true, the instructions
in the subsequent lines 4–6 will be executed. The fact that the repeated set of
instructions are these three lines (and not just line 4, say) is indicated in this no-
tation by the indentation of lines 4–6, that is, they all start further to the right with
a fixed blank space inserted on the left (see also the discussion that follows Al-
gorithm 2.12 below). This convention makes such a “computer program” very
readable, and is in fact adopted by the programming language Python. Other
programming languages, such as C, C++, or Java, require that several instruc-
tions that are to be executed together are put between delimiters { and } (so the
opening brace { would appear between lines 3 and 4 and the corresponding clos-
ing brace } after line 6); we use indentation instead.
Consider now what happens in lines 3–6 which are executed repeatedly. First,
if S = ∅, then the computation in these lines finishes, and because there are no
further instructions, the algorithm terminates altogether. This happens immedi-
ately if the original set S contains only a single element. Otherwise, S is not empty
and in line 4 another element x is found in S. In line 5, the right-hand side is the
minimum of two numbers, here the current values of m and x, and the result will
be assigned again to m. The effect is that if x < m, then m will assume the new
smaller value x, otherwise m is unchanged. In line 6, the element x is removed
from S. Because S loses one element in each iteration and is originally finite, S
will eventually become empty, and the algorithm terminates. It can be seen that
m will then be the smallest of all the elements in S, as required.
Several observations are in order. First, S will in practice not contain a set of
arbitrary real numbers but only of rational numbers, which have a finite represen-
tation; even that representation is typically limited to a certain number of digits.
Nevertheless, the algorithm works also in an idealized setting where S contains
arbitrary reals. Second, the instruction in line 5 seems circuitous because it asks
to compute the minimum of two numbers m and x, but we are meant to define
an algorithm that computes a minimum of a finite set. However, computing the
minimum of two numbers is a simpler task, and in fact one of the numbers, m,
will become the result. That is, line 5 can be replaced by the more basic conditional
instruction that uses an if statement:
2.4. INTRODUCTION TO ALGORITHMS 25

5. if x < m : m ← x
where the assignment m ← x will not happen if x < m is false, that is, if x ≥ m,
in which case m is unchanged. We have chosen the description in Algorithm 2.11
because it is more readable.
A further observation is that the algorithm can be made more “elegant” by
avoiding the repetition of the similar instructions in lines 2 and 6. Namely, we
omit line 2 altogether and replace line 1 with the assignment
1. m ← ∞
under the assumption that an element ∞ that is larger than all real numbers exists
and can be stored in the computer. In that case, the first element that is found in
the set S is x in line 4, which when compared in line 5 with m (which currently
has value ∞) will certainly fulfill x < m and thus m takes in the first iteration
the value of x, which is then removed from S in line 6. So then the first iteration
of the “loop” in lines 3–6 performs what happened in lines 1–2 in the original
Algorithm 2.11. This variant of the algorithm is not only shorter but also more
general because it can also be applied to an empty set S. It is reasonable to define
min ∅ = ∞ because ∞ is the neutral element of min in the sense that min{ x, ∞} =
x for all reals x, just as an empty sum is 0 (the neutral element of addition) or an
empty product is 1 (the neutral element of multiplication). For example, this
would apply to the case that dist(u, v) = ∞ in (2.3) when there is no path from u
to v.
When Algorithm 2.11 terminates, the set S will be empty and therefore no
longer be the original set. If this is undesired, one may instead create a copy of
S on which the algorithm operates that can be “destroyed” in this way while the
original S is preserved.
This raises the question of how a set S is represented in a computer. The best
way to think of this is as a table of a fixed length n, say, that stores the elements of S
which will be denoted as S[1], S[2], . . . , S[n]. Each table element S[i ] for 1 ≤ i ≤ n
is a “real” number in a given limited precision just as the variables m and x. In
programming terminology, S is then also called an array of numbers, with a given
array index i in a specified range (here 1 ≤ i ≤ n) to access the array element S[i ].
In the computer, the array corresponds to a consecutive sequence of memory
cells, each of which stores an array element. The only difference to a set S is that
in that way, repetitions of elements may occur if the numbers S[1], S[2], . . . , S[n]
are not all distinct. Computing the minimum of these n (not necessarily distinct)
numbers is possible just as before. Algorithm 2.12, shown below, is close to an
actual implementation in a programming language such as Python. We just say
“numbers” which are real (in fact, rational) numbers as they can be represented
in a computer.
In Algorithm 2.12, the indentation (white space at the left) in lines 6–7 means
that these two statements are executed if the condition S[k] < m of the if state-
ment in line 5 is true. Line 8 has the same indentation as line 5 so the statement
26 CHAPTER 2. GRAPH ALGORITHMS

Algorithm 2.12 (Finding the minimum in an array of numbers)

Input : n numbers S[1], S[2], . . . , S[n], n ≥ 1.


Output : their minimum m and its index i, that is, m = S[i ] and m ≤ S[k ] for all k.

1. m ← S [1]
2. i ← 1
3. k ← 2
4. while k ≤ n :
5. if S[k ] < m :
6. m ← S[k]
7. i ← k
8. k ← k+1

k ← k + 1 is always executed inside the “loop” in lines 4–8. This is important


because if line 8 was indented like lines 6 and 7, then if S[k ] ≥ m the value of k
would stay the same instead of being incremented by 1, so that the loop in lines
4–8 would from then on be executed forever. The result would be a faulty al-
gorithm that normally never terminates (unless the elements in the array are in
strictly decreasing order).
Line 2 and 7 together with their respective preceding line make sure that the
index i will be such that always m = S[i ] holds. If one is not interested in this
index i of the minimum in the array, then lines 2 and 7 can be omitted.
Algorithm 2.12 is very detailed and shows how to iterate with an index vari-
able k through array elements S[k ] that represent the elements of a set. Moreover,
the array itself is not modified by this operation, unlike the description in Algo-
rithm 2.11. We normally aim for the most concise description. The following is a
short version that uses the loop description for all x ∈ S which means a suitable
iteration through the elements x of S, where S has some representation such as
an array.

Algorithm 2.13 (Finding the minimum in a set of numbers using “for all”)

Input : finite set of numbers S


Output : minimum m of S, where m = ∞ if S is empty.

1. m ← ∞
2. for all x ∈ S :
3. m ← min{m, x }

To represent a digraph D = (V, A) we need representations of V and A. For


the example (2.2), the following table lists in the top row all vertices of D, and
2.5. SINGLE-SOURCE SHORTEST PATHS: BELLMAN–FORD 27

each column u contains all vertices v so that (u, v) ∈ A.


a b c d
b c d b (2.5)
d
The columns in this table are also called adjacency lists (which in general have dif-
ferent lengths). Such a table is easily stored in a computer. Often the vertices are
represented by the integers 1, . . . , |V | (which makes it easy to find the adjacency
list for each vertex).

2.5 Single-source shortest paths: Bellman–Ford


We resume the discussion from Section 2.3. That is, we will in the following
study algorithms for single-source shortest paths, where some node s (for “source”,
or “start node”) is specified and the task is to compute dist(s, v) for all nodes
v, or to find out that some node v can be reached from s so that d(s, v) = −∞,
in which case the algorithm will stop. In short, meaningful distances will only
be computed under the assumption that there is no negative cycle that can be
reached from s. The reasoning behind computing dist(s, v) for all nodes v is that
even if one is only interested in computing dist(s, t) for a specific pair s, t, there is
essentially no other way than to compute dist(s, v) for all other nodes v because v
could be the last node before t on a shortest path from s to t.
Algorithm 2.14 below, generally known as the Bellman–Ford Algorithm, finds
shortest paths from a single source node s to all other nodes v, or terminates with
a warning message that a negative (-weight) cycle can be reached from s so that
all nodes v in that cycle (and possibly others) fulfill dist(s, v) = −∞.
The algorithm uses an internal table d[v, i ] where v is a vertex and i takes
values between 0 and |V | − 1. Possible values for d[v, i ] are real numbers as
well as ∞, where it is assumed that ∞ + x = ∞ and min{∞, x } = x for any
real number x or if x = ∞. The algorithm is presented in a first version that
is easier to understand than a second version (Algorithm 2.19 below) where the
two-dimensional table with entries d[v, i ] will be replaced by a one-dimensional
array with entries d[v].
We explain Algorithm 2.14 alongside the following example of a network
with four nodes s, x, y, z.

2 v s x y z
s
1
x d[v, 0] 0 ∞ ∞ ∞

−1 d[v, 1] 0 1 ∞ ∞ (2.6)
2
d[v, 2] 0 1 3 0
y z d[v, 3] 0 1 1 0
1
28 CHAPTER 2. GRAPH ALGORITHMS

Algorithm 2.14 (Bellman–Ford, first version)

Input : network (V, A, w) and source (start node) s.


Output : dist(s, v) for all nodes v if no such distance is −∞.

1. d[s, 0] ← 0
2. for all v ∈ V − {s} : d[v, 0] ← ∞
3. i ← 0
4. while i < |V | − 1 :
5. for all v ∈ V : d[v, i + 1] ← d[v, i ]
6. for all (u, v) ∈ A :
7. d[v, i + 1] ← min{ d[v, i + 1], d[u, i ] + w(u, v) }
8. i ← i+1
9. for all (u, v) ∈ A :
10. if d[u, |V | − 1] + w(u, v) < d[v, |V | − 1] :
11. print “Negative cycle!” and stop immediately
12. for all v ∈ V : dist(s, v) ← d[v, |V | − 1]

The right side of (2.6) shows d[v, i ] as rows of a table for i = 0, 1, 2, 3, with the
vertices v as columns. In lines 1–2 of Algorithm 2.14, these values are initialized
(initially set) to d[s, 0] = 0 and d[s, v] = ∞ for v 6= S. Lines 4–8 represent the
main loop of the algorithm, where i takes successively the values 0, 1, . . . |V | − 2,
and the entries in row d[v, i + 1] are computed from those in row d[v, i ]. The
important property of these numbers, which we prove shortly, is the following.

Theorem 2.15 In Algorithm 2.14, at the beginning of each iteration of the main loop
(lines 4–8), d[v, i ] is the smallest weight of any walk from s to v that has at most i arcs.

The main loop begins with i = 0 after line 3. In line 5, the entries d[v, i + 1]
are copied from d[v, i ], and will subsequently be updated. In the example (2.6),
d[v, 1] first contains 0, ∞, ∞, ∞. Lines 6–7 describe a second “inner” loop that
considers all arcs (u, v). Whenever d[u, i ] + w(u, v) is smaller than d[v, i + 1],
the assignment d[v, i + 1] ← d[u, i ] + w(u, v) takes place. This will not happen
if d[u, i ] = ∞ because then also d[u, i ] + w(u, v) = ∞. For i = 0, the only arc
(u, v) where this is not the case is (u, v) = (s, x ), in which case d[u, i ] + w(u, v) =
d[s, 0] + 1 = 1, which is less than ∞, resulting in the assignment d[ x, 1] ← 1. In
(2.6), this assignment is shown by the new entry 1 for d[ x, 1] surrounded by a box.
This is the only assignment of this sort. After all arcs have been considered, it can
be verified that the entries 0, 1, ∞, ∞ in row d[v, 1] represent indeed the shortest
weights of walks from s to v that use at most one arc, as asserted by Theorem 2.15.
After i is increased from 0 to 1 in line 8, the second iteration of the main loop
starts with i = 1. Then arcs (u, v) where d[u, i ] < ∞ are those where u = s or
2.5. SINGLE-SOURCE SHORTEST PATHS: BELLMAN–FORD 29

u = x, which are the arcs (s, x ), ( x, s), ( x, y), and ( x, z). The last two produce the
updates d[y, 2] ← d[ x, 1] + w( x, y) = 1 + 2 = 3 and d[z, 2] ← d[ x, 1] + w( x, z) =
1 − 1 = 0, shown by the boxed entries in row d[v, 2] of the table. Again, it can be
verified that these are the weights of shortest walks from s to v with at most two
arcs.
The last iteration of the main loop is for i = 2, which produces only a single
update, namely when the arc (z, y) is considered in line 7, where d[y, 3] is updated
from its current value 3 to d[y, 3] ← d[z, 2] + w(z, y) = 0 + 1 = 1. Row d[v, 3]
then has the weights of shortest walks from s to v that use at most three arcs. The
main loop terminates when i = 3 (in general, when i = |V | − 1).
Because the network in (2.6) has only four nodes, any walk with more than
three nodes (in general, more than |V | − 1 nodes) cannot be a path. In fact, if there
is a walk with |V | arcs that is shorter than found so far, it must contain a tour
of negative weight, as will be proved in Theorem 2.17 below. In Algorithm 2.14,
lines 9–11 test for a possible improvement of the current values in d[v, |V | − 1] (the
last row in the table), much in the same way as in the previous updates in lines 6–
7, by considering all arcs (u, v). However, unlike the assignment in line 7, such a
possible improvement is now taken to terminate the algorithm immediately with
the notification that there must be a negative cycle that can be reached from s.
The normal case is that no such improvement is possible. In that case, line 12
produces the desired output of the distances d(s, v). In the example (2.6), these
are the entries 0, 1, 1, 0 in the last row d[v, 3].
Before we prove Theorem 2.15, we note that any prefix of a shortest walk is a
shortest walk.

Lemma 2.16 Consider two nodes s and v in a network, and suppose W = u0 , u1 , . . . , uk


is a shortest walk from s to v, with u0 , uk = s, v. Then for 0 ≤ i < k, the walk
u0 , u1 , . . . , ui (called a prefix of W) is a shortest walk from u to ui .

Proof. Suppose there was shorter walk W 0 from u to ui than the prefix u0 , u1 , . . . , ui
of W. Then W 0 followed by ui+1 , . . . , uk is a shorter walk from s to u than W, which
contradicts the definition of W.

Proof of Theorem 2.15. This is proved by induction. It clearly holds for i = 0


because the only walk from s with zero arcs is from s to s, so that d[s, 0] = 0 and
d[v, 0] = ∞ for v 6= s.
Suppose that the assertion holds for some i ≥ 0. We show that it holds for
i + 1 when line 8 is reached, and hence at the beginning of the next iteration of
the loop in line 4. Consider a shortest walk W from s to v that has most i + 1
arcs (if such a walk exists; if not, then clearly d[v, i + 1] = d[v, i ] = ∞ as set
in line 5, and d[u, i ] = ∞ for all arcs (u, v) that end in v, so d[v, i + 1] remains
equal to ∞ in line 7). Either W has most i arcs or exactly i + 1 arcs. In the former
case, w(W ) = d[v, i ] by induction hypothesis. In the latter case, W is given by
30 CHAPTER 2. GRAPH ALGORITHMS

some walk of i arcs from s to some node u, followed by an arc (u, v), where the
walk from s to u has minimum weight d[u, i ] by Lemma 2.16. In either case,
w(W ) = d[v, i + 1] = min{d[v, i ], d[u, i ] + w(u, v)} as computed (via line 5) in
line 7, which was to be shown.

Theorem 2.15 presents one part of the correctness of Algorithm 2.14. A second
part is the correct detection of negative (-weight) cycles. We first consider an
example, which is the same network as in (2.6) with an additional arc (y, s) of
weight −5, which creates two negative cycles, namely s, x, y, s and s, x, y, z, s.

v s x y z
2 d[v, 0] 0 ∞ ∞ ∞
s x
1
d[v, 1] 0 1 ∞ ∞
−5 −1 (2.7)
2 d[v, 2] 0 1 3 0

y z
d[v, 3] −2 1 1 0
1
neg. cycle? −4 −1

In this network, any walk from s to y has two or more arcs, so by Theorem 2.15 the
rows d[v, 0], d[v, 1], d[v, 2] are the same as in (2.6) and only d[v, 3] has the different
entries −2, 1, 1, 0, when the main loop terminates. In the additional row in the
table in (2.7), there are two possible improvements of d[v, 3], namely of d[s, 3],
indicated by −4 , when the arc (y, s) is considered as (u, v) in line 10, or of d[ x, 3],
indicated by −1 , when the arc (s, x ) is considered. Whichever improvement
is discovered first (depending on the order of arcs (u, v) in line 9), it leads to
the immediate stop of the algorithm in line 11. both improvements reveal the
existence of a walk with four arcs that is shorter than the current shortest walk
with at most three arcs. For the first improvement, this four-arc walk is s, x, z, y, s,
for the second it is s, x, y, s, x.

Theorem 2.17 Consider a network (V, A, w) with a source node s. Then there is a
negative cycle that starts at some node which can be reached from s if and only if Algo-
rithm 2.14 stops in line 11.

Proof. The existence of such a cycle corresponds to the condition dist(s, v) = −∞


by Theorem 2.10, for any node v on the cycle. Suppose there is no such cycle.
Then the weight of a shortest walk from s to v for any node v is that of a short-
est path. Because a path has at most |V | − 1 arcs, this weight is computed as
d[v, |V | − 1] according to Theorem 2.15. Furthermore, by the same reasoning,
no improvement is possible in line 10 (that is, the condition of the if statement is
never true), and the algorithm terminates with the output of all distances dist(s, v)
in line 12, none of which is −∞.
2.5. SINGLE-SOURCE SHORTEST PATHS: BELLMAN–FORD 31

Conversely, consider any cycle C = v0 , v1 , . . . , vk−1 , v0 that starts and ends


at some node v0 which is reachable from s, which implies d[vi ] < ∞ for 0 ≤
i < k, where we let d[v] := d[v, |V | − 1] for brevity. Suppose that there is no
improvement in line 10, that is, for all arcs (u, v) we have
d[u] + w(u, v) ≥ d[v] . (2.8)
Applied to the arcs in the cycle, this gives
d [ v0 ] + w ( v0 , v1 ) ≥ d [ v1 ]
d [ v1 ] + w ( v1 , v2 ) ≥ d [ v2 ]
.. .. .. (2.9)
. . .
d [ v k −2 ] + w ( v k −2 , v k −1 ) ≥ d [ v k −1 ]
d [ v k −1 ] + w ( v k −1 , v 0 ) ≥ d [ v0 ] .
Summation of these k inequalities and subtracting the sum ∑ik=−01 d[vi ] on both
sides gives w(C ) ≥ 0. That is, if the algorithm does not terminate in line 11,
then all cycles C that can be reached from s have nonnegative weight, as claimed.
Note that the condition only applies to reachable cycles C, because otherwise the
inequalities in (2.9) only imply ∞ + w(C ) ≥ ∞ which also holds if w(C ) < 0.
Theorems 2.15 and 2.17 demonstrate the correctness of Algorithm 2.14, that is,
it fulfills its stated input-output behaviour.
A small extension of the algorithm not only computes the distances from s
to v, but also a shortest path itself. Here we make use of Lemma 2.16, where it
suffices to know for each node v only the last arc (u, v) on such a shortest path.
That is, we only need to store the predecessor u of the shortest path from s to v,
until have arrived back at s. That is, the network formed by the shortest paths is
a tree with root s. A tree is a digraph with a distinguished node, the root, from
which is there is a unique path to every other node. It can be stored with a single
“predecessor” array which contains the predecessor pred[v] on the path from the
root to v (unless v = s, in which case pred[s] is not given, often written NIL). The
following is an example of such a shortest path tree in a network with six nodes,
with the corresponding pred array and the distances from s. The arcs of the tree
(which are also part of the original network) are indicated as dashed arrows.

2 2
s x a v s x y z a b
1
−3
−1 2 dist(s, v) 0 1 1 0 3 1
2
pred[v] NIL s z x x z
y z b
1 1
(2.10)
The following algorithm is an extension of Algorithm 2.14 that also computes
the shortest-path predecessors.
32 CHAPTER 2. GRAPH ALGORITHMS

Algorithm 2.18 (Bellman–Ford, first version, with shortest-path tree)

Input : network (V, A, w) and source (start node) s.


Output : dist(s, v) for all nodes v if no such distance is −∞, and predecessor
pred[v] of v on shortest path from s.

1. for all v ∈ V : d[v, 0] ← +∞ ; pred[v] ← NIL


2. d[s, 0] ← 0
3. i ← 0
4. while i < |V | − 1 :
5. for all v ∈ V : d[v, i + 1] ← d[v, i ]
6. for all (u, v) ∈ A :
7. if d[u, i ] + w(u, v) < d[v, i + 1] :
7a. d[v, i + 1] ← d[u, i ] + w(u, v)
7b. pred[v] ← u
8. i ← i+1
9. for all (u, v) ∈ A :
10. if d[u, |V | − 1] + w(u, v) < d[v, |V | − 1] :
11. print “Negative cycle!” and stop immediately
12. for all v ∈ V : dist(s, v) ← d[v, |V | − 1]

Line 1 of this algorithm not only initializes d[v, 0] to ∞ but also pred[v] to
NIL, for all nodes v. Line 2 then sets d[s, 0] to 0. In fact, without the assign-
ment pred[v] ← NIL, lines 1–2 would be faster than the initialization lines 1–2
of Algorithm 2.14, because the latter has a loop in line 2 where every node v has
to be checked if it is not equal to s, assuming V is represented in an array and
s is not known to be the first element of that array (which in general it is not).
This (not very important) speedup compensates for the double assignment of
first d[s, 0] ← ∞ and then d[s, 0] ← 0 in Algorithm 2.18.
Lines 7 and 7a represent the previous update of d[v, i + 1] in line 7 of Algo-
rithm 2.14, and line 7b the new assignment of the predecessor pred[v] on the new
shortest walk to v from s. This the predecessor as computed by the algorithm. In
general, a shortest path is not necessarily unique. For example, in (2.11) a short-
est path from s to z could also go via node a. Algorithm 2.18 will only compute
pred[z] as x, as shown in (2.11), irrespective of the order in which the arcs in A are
traversed in line 6 (why? – hint: Theorem 2.15).
The following demonstrates the updating of the pred array in Algorithm 2.18
for our familiar example (2.6). Its main purpose is a notation to record the progress
of the algorithm with the additional information of the assignment pred[v] ← u
in line 7b, by simply writing u as a subscript next to the box which indicates the
update of d[v, i + 1] in line 7a. There are four such updates, and the most recent
2.5. SINGLE-SOURCE SHORTEST PATHS: BELLMAN–FORD 33

one gives the final value of pred[v] as shown in the last row of the table.

v s x y z
2 d[v, 0] 0 ∞ ∞ ∞
s x
1
d[v, 1] 0 1 s ∞ ∞
−1 (2.11)
2 d[v, 2] 0 1 3 x 0 x

d[v, 3] 0 1 1 z 0
y z
1
pred[v] NIL s z x
Note that the notation with updated values of d[v, i + 1] shown by boxes with
a subscript u for the corresponding arc (u, v)s is completely ad-hoc, and chosen
to document the progress of the algorithm in a compact and unambiguous way.
Furthermore, a case that has not yet occurred is that d[v, i + 1] is updated more
than once for the same value of i, in case there are several arcs (u, v) where this
occurs, depending on the order in which these arcs are traversed in line 6. An
example is (2.10) above for i = 2 and v = 2, where the update of d[b, 3] occurs
from ∞ to 5 via the arc ( a, b), and then to 1 via the arc (z, b). One may record
these updates of d[b, 3] in the table as 5 a 1 z , or by only listing the last update
1 z (which is the only update in case arc (z, b) is considered before ( a, b)).
Algorithm 2.19 (Bellman–Ford, second version)

Input : network (V, A, w) and source (start node) s.


Output : dist(s, v) for all nodes v if no such distance is −∞, and predecessor
pred[v] of v on shortest path from s.

1. d[s] ← 0
2. for all v ∈ V − {s} : d[v] ← ∞
4. repeat |V | − 1 times :
6. for all (u, v) ∈ A :
7. d[v] ← min{ d[v], d[u] + w(u, v) }
9. for all (u, v) ∈ A :
10. if d[u] + w(u, v) < d[v] :
11. print “Negative cycle!” and stop immediately
12. for all v ∈ V : dist(s, v) ← d[v]

Algorithm 2.14 is the first version of the Bellman–Ford algorithm. The prog-
ress of the algorithm is nicely described by Theorem 2.15. Algorithm 2.19 is a
second, simpler version of the algorithm. Instead of storing the current distances
from s for walks that use at most i arcs in a separate table row d[v, i ], the second
version of the algorithm uses just a single array with entries d[v]. The new algo-
rithm has fewer instructions, which we have numbered with some line numbers
omitted for easier comparison with the first version in Algorithm 2.14.
34 CHAPTER 2. GRAPH ALGORITHMS

The main difference between this algorithm and the first version is the update
rule in line 7. The first version compared d[v, i + 1] with d[u, i ] + w(u, v) where
d[u, i ] was always the value of the previous iteration, whereas the second version
compares d[v] with d[u] + w(u, v) where d[u] may already have improved in the
current iteration. The following simple example illustrates the difference.

v s x y z
d[v, 0] 0 ∞ ∞ ∞

s x y z d[v, 1] 0 1 ∞ ∞ (2.12)
1 1 1
d[v, 2] 0 1 2 ∞
d[v, 3] 0 1 2 3

The table on the right in (2.12) shows the progress of the first version of the algo-
rithm. Suppose that in the second version in line 6, the arcs are considered in the
order (s, x ), ( x, y), and (y, z). Then the assignments in the inner loop in line 7 are
d[ x ] ← 1, d[y] ← 2, d[z] ← 3, so the complete array is already found in the first
iteration of the main loop in lines 4–7, without any further improvements in the
second and third iteration of the main loop. However, if the order of arcs in line 6
is (y, z), ( x, y), and (s, x ), then the only update in the main loop in line 7 in the
first iteration is d[ x ] ← 1, with d[y] ← 2 in the second iteration, and d[z] ← 3
in the last iteration. In general, the main loop does need |V | − 1 iterations for the
algorithm to work correctly, as asserted by the following theorem.

Theorem 2.20 In Algorithm 2.19, at the beginning of the ith iteration of the main loop
(lines 4–7), 1 ≤ i ≤ |V | − 1, we have d[v] ≤ w(W ) for any node v and any s, v-walk W
that has at most i − 1 arcs. Moreover, if d[v] < ∞, then there is some s, v-walk of weight
d[v]. If the algorithm terminates without stopping in line 11, then d[v] = dist(s, v) as
claimed.

Proof. The algorithm performs at least the updates of Algorithm 2.14 (possibly
more quickly), which shows that d[v] ≤ w(W ) for any s, v-walk W with at most
i − 1 arcs, as in Theorem 2.15. If d[v] < ∞, then d[v] = w(W 0 ) for some s, v-
walk W 0 because of the way d[v] is computed in line 7. Furthermore, dist(s, v) =
d[v] 6= −∞ if and only if d[u] + w(u, v) ≥ d[v] for all arcs (u, v), as proved for
Theorem 2.17.

2.6 O-notation and running time analysis

In this section, we describe the O-notation used to describe the running time of
algorithms, and apply it to the analysis of the Bellman–Ford algorithm.
2.6. O-NOTATION AND RUNNING TIME ANALYSIS 35

The time needed to execute an algorithm depends on the size of its input,
and on the machine that performs the instructions of the algorithm. The size of
the input can be very accurately measured in terms of the number of bits (binary
digits) to represent the input. If the input is a network, then the input size is
normally measured more coarsely by the number of nodes and arcs, assuming
that each piece of associated information (such as the endpoints of an arc, and
the weight of an arc) can be stored in some fixed number of bits (which is realistic
in practice).
The execution time of an instruction depends on the computer, and on the
way that the instruction is represented in terms of more primitive instructions,
for example how an assigment translates to the evaluation of the right-hand side
and to the storage of the computed value in memory (and in which location) for
the assigned variable. Because computing technology is constantly improving, it
is normally assumed that a basic instruction, such as an assignment or a test of a
condition like x < m, takes a certain constant amount of time, without specifying
what that constant is.
The functions to measure running times take nonnegative values. Let

R≥ = { x ∈ R | x ≥ 0 } . (2.13)

Suppose f : N → R≥ is a function where f (n) measures, say, the number of mi-


croseconds needed to run the Bellman–Ford Algorithm 2.14 for a network with
n nodes on a specific computer. Changing the computer, or changing microsec-
onds to nanoseconds, would result in changing f (n) by a constant factor. It makes
sense to specify running times “up to a constant factor” as a function of the input
size. The O-notation, or “order of” notation, is designed to capture this, as well
as the asymptotic behaviour of a function (that is, of f (n) for sufficiently large n).

Definition 2.21 Consider two functions f , g : N → R≥ . Then we say f (n) ∈


O( g(n)), or f (n) is of order g(n), if there is some C > 0 and K ∈ N so that
f (n) ≤ C · g(n) for all n ≥ K.

As an example, 1000 + 10n + 3n2 ∈ O(n2 ), for the following reason: choose
K = 10. Then n ≥ K implies 1000 ≤ 10n2 , 10n ≤ n2 , and thus 1000 + 10n + 3n2 ≤
(10 + 1 + 3)n2 so that the claim is true for C = 14. If we are interested in a
smaller constant C when n is large, we can choose K = 100 and C = 3.2. It is
clear that asymptotically (for large n) the quadratic term dominates the growth
of the function, which is captured by the notation O(n2 ). As a running time,
this particular function may, for example, represent 1000 time units to load the
program, 10n time units for initialization, and 3n2 time units to run the main
algorithm.
The notation f (n) ∈ O( g(n)) is very commonly written as “ f (n) = O( g(n))”.
However, this is imprecise, because f (n) represents a function of n, whereas
O( g(n)) is a set of functions, as stated in Definition 2.21.
36 CHAPTER 2. GRAPH ALGORITHMS

In many ways, O-notation is a shorthand, for example by allowing to omit


constants: For a positive constant c, it is immediate from Definition 2.21 that

O(c · g(n)) = O( g(n)) . (2.14)

In addition, O-notation allows to describe upper bounds of asymptotic growth in a


convenient way. We have for functions f , g, h : N → R≥

f (n) ∈ O( g(n)), g(n) ∈ O(h(n)) ⇒ f (n) ∈ O(h(n)) (2.15)

because if there are C, D > 0 with f (n) ≤ C · g(n) for n ≥ K and g(n) ≤ D ·
h(n) for n ≥ L, then f (n) ≤ C · D · h(n) for n ≥ max{K, L}. Note that (2.15) is
equivalent to the statement

g(n) ∈ O(h(n)) ⇔ O( g(n)) ⊆ O(h(n)) (2.16)

for the following reason: Suppose g(n) ∈ O(h(n)). Then (2.15) says that any
function f (n) in O( g(n)) is also in O(h(n)), which shows O( g(n)) ⊆ O(h(n))
and thus “⇒” in (2.16). Conversely, if O( g(n)) ⊆ O(h(n)), then we have clearly
g(n) ∈ O( g(n)) and thus g(n) ∈ O(h(n)), which shows “⇐” in (2.16).
What is O(1)? This is the set of functions f (n) that fulfill f (n) ≤ C for all
n ≥ K, for some constants C and K. Because the finitely many numbers n with
n < K are bounded, we can if necessary increase C to obtain that f (n) ≤ C for all
n ∈ N. In other words, O(1) is the set of functions that are bounded by a constant.
In addition to (2.15), the following rules are useful and easy to prove:

f (n) ∈ O( g(n)) ⇒ f (n) + g(n) ∈ O( g(n)) (2.17)

which shows that a sum of two functions can be “absorbed” into the function
with higher growth rate. With the definition

O( g(n)) + O(h(n)) = { f (n) + f 0 (n) | f (n) ∈ O( g(n)), f 0 (n) ∈ O(h(n)) } (2.18)

one can similarly see

O( g(n)) + O(h(n)) = O( g(n) + h(n)). (2.19)

In addition,
f (n) ∈ O( g(n)) ⇒ n · f (n) ∈ O(n · g(n)) . (2.20)

We now apply this notation to analyze the running time of the Bellman–Ford
algorithm, where we consider Algorithm 2.18 because it is slightly more detailed
than Algorithm 2.14. Suppose the input to the algorithm is a network (V, A, w)
with n = |V | and m = | A|. Line 1 takes running time O(n) (we say “time O(n)”
rather than “time in O(n)”) because in that line all nodes are considered, each
with two assignments that take constant time. Lines 2 and 3 take constant time
O(1). The main loop in lines 4–8 is executed n − 1 times. Testing the condition
2.7. SINGLE-SOURCE SHORTEST PATHS: DIJKSTRA’S ALGORITHM 37

i < |V | − 1 in line 4 takes time O(1). Line 5 takes time O(n). The “inner loop”
in lines 6–7b takes time O(m) because the evaluation of the if condition in line 7
and the assigments in lines 7a–7b take constant time (which is shorter when they
are not executed because the if condition is false, but bounded in either case).
Line 8 takes time O(1). So the time to perform one iteration of the main loop
in lines 4–8 is O(1) + O(n) + O(m) + O(1), which by (2.17) we can shorten to
O(n + m) because we can assume n > 0. The main loop is performed n − 1 times,
where in view of the constants this can be simplified to multiplication with n,
that is, it takes together time n · O(n + m) = O(n2 + nm). The test for negative
cycles in lines 9–11 takes time O(m), and the final assigment of distance in line 12
time O(n). So the overall running time from lines 1–3, 4–8, 9–11, 12 is O(n) +
O(n2 + nm) + O(m) + O(n) where the second term absorbs the others according
to (2.17). So the overall running time is O(n2 + nm).
The number m of arcs of a digraph with n nodes is at most n · (n − 1), that is,
m ∈ O(n2 ), so that O(n2 + nm) ⊆ O(n2 + n3 ) = O(n3 ). That is, for a network with
n nodes, the running time of the Bellman–Ford algorithm is O(n3 ). (It is therefore also
called a cubic algorithm.)
The above analysis shows a more accurate running time of O(n2 + nm) that
depends on the number m of arcs in the network. The algorithm works for any
number of arcs (even if m = 0). Normally the number of arcs is at least n − 1
because otherwise some nodes cannot be reached from the source node s (this
can be seen by induction on n by adding nodes one at a time to the network,
starting with s: every new node v requires at least one new arc (u, v) in order to
be reachable from the nodes u that are currently reachable from s). In that case
n ∈ O(m) and thus O(n2 + nm) = O(nm), so that the running time is O(nm).
(We have to be careful here: when we say the digraph has at least n − 1 arcs, we
cannot write this as m ∈ O(n), because this would mean an upper bound on m;
the correct way to say this is n ∈ O(m), which translates to n ≤ Cm and thus to
m ≥ n/C, meaning that the number of arcs is at least proportional to the number
of nodes. An upper bound for m is given by m ∈ O(n2 ).) In short, for a network
with n nodes and m arcs, where n ∈ O(m), the running time of the Bellman–Ford
algorithm is O(nm).
The second version of the Bellman–Ford algorithm has the same running time
O(n3 ). Algorithm 2.19 is faster than the first version but, in the worst case, only
by a constant factor, because the main loop in lines 4–7 is still performed n − 1
times, and the algorithm would in general be incorrect with fewer iterations, as
the example in (2.12) shows (which can be generalized to an arbitrary path).

2.7 Single-source shortest paths: Dijkstra’s algorithm


Dijkstra’s Algorithm 2.22, shown below, is a single-source shortest path algorithm
that works in networks where all weights are nonnegative. We use the example in
38 CHAPTER 2. GRAPH ALGORITHMS

Figure 2.1 to explain the algorithm, with a suitable table that demonstrates its
progress.

Algorithm 2.22 (Dijkstra’s algorithm for single-source shortest paths)

Input : network (V, A, w), w(u, v) ≥ 0 for all (u, v) ∈ A, source s.


Output : dist(s, v) for all nodes v

1. for all v ∈ V : d[v] ← ∞ ; colour[v] ← black


2. d[s] ← 0 ; colour[s] ← grey
3. while there are grey nodes :
4. u ← grey node with smallest d[u]
5. colour[u] ← white
6. for all v ∈ V so that colour[v] 6= white and (u, v) ∈ A :
7. colour[v] ← grey
8. d[v] ← min{ d[v], d[u] + w(u, v) }
9. for all v ∈ V : dist(s, v) ← d[v]

u s a b c x y
0G ∞B ∞B ∞B ∞B ∞B
y
3 1 s 0W 1 G 4 G ∞ ∞ ∞
0 s s

c x a 0 1W 3 G
a 6 G
a ∞ ∞
2
5 2 1 b 0 1 3W 5 G
b 4 G
b ∞
a 2 G
b x 0 1 3 4 x 4W 5 G
x
3
1 4 c 0 1 3 4W 4 5G
s
y 0 1 3 4 4 5W
dist(s, u) 0 1 3 4 4 5
pred[v] NIL s a x b x

Figure 2.1 Example for Dijkstra’s Algorithm 2.22.

The algorithm uses an array that defines for each node v an array entry
colour[v] which is either black, grey, or white (so this “colour” may internally be
stored by one of three distinct integers such as 0, 1, 2). In the course of the com-
putation, each node will change its colour from black to grey to white, unless there
is no path from the source s to the node, in which case the node will stay black. In
addition, the array entries d[v] represent preliminary distances from s, according
2.7. SINGLE-SOURCE SHORTEST PATHS: DIJKSTRA’S ALGORITHM 39

to the following theorem that we prove later and which will be used to show that
the algorithm is correct.

Theorem 2.23 In Algorithm 2.22, at the beginning of each iteration of the main loop
(lines 3–8), d[v] is the smallest weight of a path u0 , ui , . . . , uk from s to v (so u0 , uk = s, v)
that with the exception of v consists exclusively of white nodes, that is, colour[ui ] = white
for 0 ≤ i < k. When u is made white in line 5, we have d[u] = dist(s, u).

In line 1, all nodes v are initially black with d[v] ← ∞. In line 2, the source
s becomes grey and d[s] ← 0. This is also shown as the first row in the table in
Figure 2.1, where we use superscripts B, G, W for a newly assigned colour black,
grey, or white. Because grey nodes are of special interest, we will indicate their
colour all the time, even if it has not been updated in that particular iteration.
The main loop in lines 3–8 operates as long as the set of grey nodes is not
empty, in which case it selects in line 4 a particular grey node u with smallest
value d[u]. Because of the initialization in line 2, the only grey node is s, which is
therefore chosen in the first iteration. Each row in the table in Figure 2.1 repre-
sents one iteration of the main loop, where the node u that is chosen in line 4 is
displayed on the left of that row. The row entries are the values d[v] for all nodes
v, as they become updated or stay unchanged in that iteration. The chosen node
u changes its colour to white in line 5, indicated by the superscript W in the table,
where that node is also underlined.
Lines 6–8 are a second inner loop of the algorithm. It traverses all non-white
neighbours v of u, that is, all nodes v so that colour[v] is not white and so that
(u, v) is an arc. For all these non-white neighbours v of u are set to grey in line 7
(indicated by a superscript G), and their distance d[v] is updated to d[u] + w(u, v)
in case this is smaller than the previous value of d[v] (which happens always if
v is black and therefore d[v] = ∞). If such an update happens, this means that
there is an all-white path from s to u followed by an arc (u, v) that connects u
to the grey node v, and in that case we can also set pred[v] ← u to indicate that
u is the predecessor of v on the current path from s to v (we have omitted that
update of pred[v] to keep Algorithm 2.22 short; it is the same as in lines 7–7b of
Algorithm 2.18). As in example (2.11) for the Bellman–Ford algorithm, the update
of pred[v] with u is shown with the subscript u, and the update of d[v] is shown
by surrounding the new value with a box.
In the first iteration in Figure 2.1 where u = s, the updated nodes neighbours
v of u are a and b. These are also the grey nodes in the next iteration of the main
loop, where node a is selected because d[ a] < d[b], and a is made white. The two
neighbours of a are b and c. Both are non-white and become grey (b is already
grey). The value of d[b] is updated from 4 to d[ a] + w( a, b) = 3, with pred[b] ← a.
The value of d[c] is updated from ∞ to d[ a] + w( a, c) = 6, with pred[c] ← a. The
current row shows that the grey nodes are b and c, where d[b] < d[c].
40 CHAPTER 2. GRAPH ALGORITHMS

In the next iteration therefore u is chosen to be b, which gives the next row of
the table. The neighbors of b are a, c, x. Here a is white and is ignored, c is non-
white and gets the update d[c] ← d[b] + w(b, c) = 5 because this is smaller than
the current value 6, and pred[c] ← b. Node x changes colour from black to grey
and d[ x ] ← d[b] + w(b, x ) = 4. In the next iteration, x is the grey node u with
smallest d[u], creating updates for c and y. The next and penultimate iteration
chooses c among two remaining grey nodes, where the neighbour y of c creates
no update (other than being set to grey, which is already its colour). The final iter-
ation chooses y. Because all nodes are now white, the algorithm terminates with
the output of distances in line 9, as shown in the table in Figure 2.1 in addition to
the predecessors in the shortest-path tree with root s.
Proof of Theorem 2.23. First, we note that because all weights are nonnegative,
the shortest walk from s to any node v can always be chosen as a path by Propo-
sition 2.6 because any tour that the walk contains can be removed and is of non-
negative weight, which will not increase the weight of the walk.
We prove the theorem by induction. Before the main loop in lines 3–8 is
executed for the first time, there are no white nodes. Hence, the only path from
s where all but the last node are white is a path with no arcs that consists of the
single node s, and its weight is zero, where d[s] = 0 as claimed. Furthermore, this
is the only (and shortest) path from s to s, so dist(s, s) = d[s] = 0.
Suppose now that at the beginning of the main loop the condition is true for
any set of white nodes. If there are no grey nodes, then the main loop will no
longer performed and the algorithm proceeds to line 9. If there are grey nodes,
then the main loop will be executed, and we will show that the condition holds
again afterwards. Let u be node that is chosen in line 4, which is made white
in line 5. We prove, as claimed in the theorem, that just before this assignment
we have d[u] = dist(s, u). This has already been shown when u = s. There is
a path from s to u with weight d[u], namely the assumed path (by the induction
hypothesis) where all nodes except u are white. Consider any shortest path P from
s to u; we will show d[u] ≤ w( P) which implies d[u] = w( P) = dist(s, u). Let y
be the first node on the path P which is is not white. Let P0 be the prefix of P
given by the path from s to y, which is a shortest path from s to y by Lemma 2.16.
Moreover, y is grey because there are no arcs ( x, y) where x is white (such as the
previous node x on P before y) and y is black because after x has been made white
in line 5, all its black neighbours y are made grey in line 7. So is P0 is a shortest
path from s to y and certainly a shortest path among those where all but the last
node are white, so by the induction hypothesis, d[y] = w( P0 ). By the choice of u in
line 4 we have d[u] ≤ d[y] = w( P0 ) ≤ w( P), where the latter inequality because
all weights are nonnegative. That is, d[u] ≤ w( P) as claimed.
We now show that updating the non-white neighbours v of u in lines 7–8
will complete the induction step, that is, any shortest path P from s to v where
all nodes but v are white has weight d[v]. If the last arc of such a shortest path
is not (u, v), then this is true by the induction hypothesis. If the last arc of P is
2.7. SINGLE-SOURCE SHORTEST PATHS: DIJKSTRA’S ALGORITHM 41

(u, v), then P without its last node v defines a shortest path from s to u (where
all nodes are white), were we just proved d[u] = dist(s, u), and hence w( P) =
d[u] + w(u, v) = d[v] because that is how d[v] has been updated in line 8. This
completes the induction.

Corollary 2.24 Dijkstra’s Algorithm 2.22 works correctly.

Proof. When the algorithm terminates, every node is either white or black. As
shown in the preceding proof, at the end of each iteration of the main loop there
are no arcs ( x, y) where x is white and y is black. Hence, the white nodes are exactly
the nodes u that can be reached from s by a path, with dist(s, u) = d[u] < ∞ by
Theorem 2.23. The black nodes v are those that cannot be reached from s, where
dist(s, v) = ∞ = d[v] as set at initialization in line 1.

In Dijkstra’s algorithm, a grey node u with minimal d[u] has already its final
distance d(s, v) given by d[u], so that u can be made white. There can be no shorter
“detour” to reach u via nodes that at time are grey or black, because the first grey
node y on such a path from s to u would fulfill d[u] ≤ d[y] (see the proof of
Theorem 2.23), and the remaining part of that path from y to u has nonnegative
weight by assumption. This argument fails for negative weights. In the following
network,
y u
−5
4 1
s

the next node made white after s is u and is recorded with distance 1, and after
that node y with distance 4. However, the path s, y, u has weight −1 which is less
than the computed weight 1 of the path s, u. So the output of Dijkstra’s algorithm
is incorrect, here because of the negative weight of the arc (y, u). It may happen
that the output of Dijkstra’s algorithm is correct (as in the preceding example if
w(s, u) = 5), but in general this is not guaranteed.
We now analyse the running time of Dijkstra’s algorithm. Let n = |V | and
m = | A|. The initialization in lines 1–2 takes time O(n), and so does the final
output (if d[v] is not taken directly as the output) in line 9. In each iteration of the
main loop in lines 3–8, exactly one node becomes (and stays) white. Hence, the
loop is performed n times, assuming (which in general is the case) that all nodes
are eventually white, that is, are reachable by a path from the source node s. We
assume that the colour of a node v is represented by the array entry colour[v],
and that nodes themselves are just represented the numbers 1, . . . , n. By iterating
through the colour array, identifying the grey nodes in line 3 and finding the node
u with minimal d[u] in line 4 takes time O(n). (Even if the number of grey nodes
were somehow represented in an array of shrinking size, it is possible that they
are at least a constant fraction, if not all, of the nodes that are not white, and their
number is initially n, then n − 1, and so on, so that the number of nodes checked in
42 CHAPTER 2. GRAPH ALGORITHMS

line 4 counted over the n iterations of the main loop is n + (n − 1) + · · · + 2 + 1 =


O(n2 ), which is the same as in our current analysis.)
The inner loop in lines 6–8 of Algorithm 2.22 iterates through the nodes v,
and checks if they are not white and if they are neighbours of u, that is, (u, v) ∈ A.
The time this takes depends on the representation of the digraph (V, A). If the
neighbours of every node u are stored in an adjacency list as in (2.5), then this is
as efficient as possible, that is, over the entire iterations of the main loop each arc
(u, v) is considered exactly once, namely after u has just become white. So over
all iterations of the main loop the steps in lines 6–8 take time O(m). Alternatively,
the digraph may be represented by an adjacency table which has Boolean entries
a[u, v] which have value true if and only if (u, v) ∈ A, otherwise false. In that case,
all nodes v have to be checked in line 6, which takes time O(n) for each iteration
of the main loop.
Taken together, lines 4, 5, and 6–8 take time O(n) + O(1) + O(n), and because
they are performed up to n times in total time O(n2 ), which dominates the overall
running time compared to lines 1–2 and 9. If the digraph is represented by adja-
cency lists, the running time is O(n2 ) for lines 3–4 plus O(m) for lines 6–8 over all
iterations of the main loop, which is O(n2 ) + O(m) = O(n2 ) because m ∈ O(n2 ),
by (2.17). In summary, for a network with n nodes, the running time of Dijkstra’s
algorithm is O(n2 ). This is better by a factor of n than the Bellman–Ford algorithm,
but requires the assumption of nonnegative arc weights.
Chapter 3

Continuous optimization

This chapter studies optimization problems for continuous functions. It stud-


ies suitable assumptions about the domain of a continuous function so that it is
guaranteed to have a maximum and minimum. The correct condition is that of
compactness, which leads to Theorem 3.21, due to Weierstrass, which says: A con-
tinuous real-valued function on a compact domain assumes its maximum and
minimum.

3.1 Euclidean norm and maximum norm


Recall that if X and Y are sets and f : X → Y is a function, then X is called the
domain and Y the range of f . The range should be distinguished from the image of
f , denoted by
f (X ) = { f (x) | x ∈ X } , (3.1)
which is the set of possible values of f . The image f ( X ) is always a subset of the
range Y. When f ( X ) = Y then f is called a surjective function. Because we want
to maximize or minimize the function, the range will always be R, that is, f is
real-valued.
The domain X of f will in the following be assumed to be a subset of Rn ,
which is the set of n-tuples of real numbers,

Rn = { ( x1 , . . . , xn ) | xi ∈ R for 1 ≤ i ≤ n } (3.2)

(n is some positive integer). The components xi of an n-tuple ( x1 , . . . , xn ) are also


called its coordinates. Often Rn is called the n-dimensional Euclidean space and its
elements are called points, or sometimes vectors in order to emphasize that they
have typical several components. This generalizes the familiar geometric cases
for n = 1, 2, 3, where R1 is the real line, R2 the plane, and R3 three-dimensional
space. In these cases, we will often use coordinates x, y, and z, writing ( x, y) for
the elements of R2 and ( x, y, z) for the elements of R3 , but this notation may vary.

43
44 CHAPTER 3. CONTINUOUS OPTIMIZATION

It is always useful to consider simple cases as examples, where often R1 or R2


suffices.
We will soon define formally what it means for a function to be continu-
ous. Intuitively, a continuous function maps nearby points to nearby points. Here,
“nearby” means “of arbitrarily small distance”. The distance between two real
numbers x and y (that is, points on the real line) is simply | x − y|. There are
several ways to generalize this when x and y are points in Rn . The standard “Eu-
clidean” distance is well known from measuring geometric distances in R2 , for
example. In order to deal with continuity, a distance function defined in terms of
the “maximum norm” will often be simpler to use.

Definition 3.1 Let x = ( x1 , . . . , xn ) ∈ Rn . Then the (Euclidean) norm k x k of x is


defined by s
n
kxk = ∑ xi2 . (3.3)
i =1
The maximum norm k x kmax of x is defined by

k x kmax = max{ | x1 |, . . . , | xn | } . (3.4)

For x, y ∈ Rn , the (Euclidean) distance between x and y is defined by

d( x, y) = k x − yk , (3.5)

and their maximum-norm distance by

dmax ( x, y) = k x − ykmax . (3.6)

In Definition 3.1, the distance of two elements x and y of Rn is defined in


terms of a norm as k x − yk. For any arbitray set instead of Rn , a distance can
be considered as any function that assigns a real number d( x, y) to any two ele-
ments x, y of the set, provided it fulfills the following axioms. These axioms state:
nonnegativity of distance, and positive distance of distinct elements, according
to
d( x, x ) = 0, and x 6= y ⇒ d( x, y) > 0 , (3.7)
symmetry
d( x, y) = d(y, x ) , (3.8)
and the triangle inequality

d( x, z) ≤ d( x, y) + d(y, z) . (3.9)

It can be shown that the maximum-norm and the Euclidean distance fulfill these
axioms (see Exercises 5.1 and 5.2).
The triangle inequality is then often stated as

k x + yk ≤ k x k + kyk (3.10)
3.1. EUCLIDEAN NORM AND MAXIMUM NORM 45

which implies (3.9) using x − y and y − z instead of x and y. For an arbitrary set,
a trivially defined distance function that also fulfills axioms (3.7)–(3.9) is given by
d( x, x ) = 0 and d( x, y) = 1 for x 6= y.

Let ε > 0 and x ∈ Rn . The set of all points y that have distance less than ε is
called the ε-ball around x, defined as

B( x, ε) = {y ∈ Rn | ky − x k < ε } . (3.11)

It is also called the open ball because the inequality in (3.11) is strict. That is, B( x, ε)
does not include its “boundary”, called a sphere, which consists of all points y
whose distance to x is equal to ε.

Similary, the maximum-norm ε-ball around a point x in Rn is defined as the set

Bmax ( x, ε) = {y ∈ Rn | ky − x kmax < ε } . (3.12)

The following is elementary but extremely useful:

∀y ∈ Rn : ( y ∈ Bmax ( x, ε) ⇔ ∀i ∈ {1, . . . , n} : |yi − xi | < ε ) . (3.13)

In other words, y is in the maximum-norm ε-ball around x if and only if y differs


from x in each component by less than ε. This follows immediately from (3.4).

The following picture shows the ε-ball and the maximum-norm ε-ball for
ε = 1 around the origin 0 in R2 . The latter, Bmax (0, 1), is the set of all points
( x1 , x2 ) so that −1 < x1 < 1 and −1 < x2 < 1, which is the open square shown
on the right.

x2 x2

(0,1) (0,1) (1,1)

x1 x1
0 0
(1,0) (1,0)

The following picture on the left


46 CHAPTER 3. CONTINUOUS OPTIMIZATION

x2 x2

(0,1) (1,1) (0,1) (1,1)

x1 x1
0 0
(1,0) (1,0)

illustrates for x = 0 and ε = 1 that


B( x, ε) ⊆ Bmax ( x, ε) . (3.14)
For general x and ε this can be seen as follows. Assume y ∈ B( x, ε), that is,
ky − x k < ε. We have to show that then y ∈ Bmax ( x, ε), that is, ky − x kmax < ε,
p is, |yk − xkp
that | < ε for all k = 1, . . . , n. But this holds because |yk − xk | =
(yk − xk )2 ≤ ∑in=1 (yi − xi )2 = ky − x k < ε, which shows (3.14).
The above picture on the right shows for n = 2 that

Bmax ( x, ε) ⊆ B( x, ε n) . (3.15)
The reason is that the corner point (1, 1) of the square has farthest Euclidean √
distance from (0, 0). So we can put the square into a disk of radius 2. In
general, (3.15) is seen as follows: Let y ∈ Bmax ( x, ε), that is,
p kny − x kmax < ε
and
p n thus | y k − x |
k p< ε for 1 ≤ √k ≤ n. Then k y − x k = ∑ i =1 ( y i − x i )2 =
n
∑i=1 |yi − xi | < ∑i=1 ε = ε n as claimed.
2 2

3.2 Sequences and convergence in Rn


In Section 1.5, we considered sequences of real numbers and their limits, if they
exist. In this section we give analogous definitions and results for sequences of
points in Rn . Because xi denotes a component of a point x = ( x1 , . . . , xn ), we
will write sequence elements, which are now elements of Rn , in the form x (k) for
k ∈ N.
Analogous to (1.11), the sequence { x (k) }k∈N has limit x ∈ Rn , or x (k) → x as
k → ∞, if
∀ ε > 0 ∃ K ∈ N ∀ k ∈ N : k ≥ K ⇒ k x (k) − x k < ε . (3.16)
The sequence is bounded if there is some M ∈ R so that

∀ k ∈ N : k x (k) k ≤ M . (3.17)
3.2. SEQUENCES AND CONVERGENCE IN R N 47

Analogously to Proposition 1.12, a sequence can have at most one limit. This is
proved in the same way, where the contradiction (1.12) is proved with the help of
the triangle inequality (3.10), using the norm instead of the absolute value.

In the definitions (3.16) and (3.17), we have used the Euclidean norm, but we
could have used in the same way the maximum norm as defined in (3.4) instead,
as asserted by the following lemma.

Lemma 3.2 The sequence { x (k) }k∈N in Rn has limit x ∈ Rn if and only if

∀ε > 0 ∃K ∈ N ∀k ∈ N : k ≥ K ⇒ k x (k) − x kmax < ε , (3.18)

and it is bounded if and only if for some M ∈ R

∀k ∈ N : k x (k) kmax ≤ M . (3.19)

Proof. Suppose { x (k) } converges to x in the Euclidean norm. Let ε > 0, and
choose K in (3.16) so that k ≥ K implies k x (k) − x k < ε, that is, x (k) ∈ B( x, ε).
Because B( x, ε) ⊆ Bmax ( x, ε) by (3.14), this also means x (k) ∈ Bmax ( x, ε), which
shows (3.18).

Conversely, assume (3.18) holds and let ε > 0. Choose K so that k ≥ K implies
√ √
x (k) ∈ Bmax ( x, ε/ n). Then Bmax ( x, ε/ n) ⊆ Bmax ( x, ε) by (3.15), which shows
(3.16).

The equivalence of (3.17) and (3.19) is shown similarly.

According to (3.16), a sequence { x (k) } converges to x when for every ε > 0


eventually (that is, for sufficiently large k) all elements of the sequence are in
the open ball B( x, ε) of radius ε around x. The same applies to (3.18), using the
(square- or cubical-looking) ball Bmax ( x, ε). Another useful view of (3.18) is that
the sequence { x (k) } converges to x if for each component i = 1, . . . , n of these
(k)
n-tuples, we have xi → xi as k → ∞, because the condition

(k)
| xi − xi | < ε (3.20)

for all i = 1, . . . , n, if it holds for all k ≥ K, is equivalent to (3.18), as shown in


(3.13). Conversely, if we have convergence in each component, that is, (3.20) holds
for k ≥ Ki , for all i, then we can simply take K = max{K1 , . . . , Kn } to obtain (3.18).
To repeat, a sequence in Rn converges if and only if it convergences it each of its n
components.

Similarly, (prove this!) a sequence is bounded if and only if it is bounded in


each component, according to (3.19).
48 CHAPTER 3. CONTINUOUS OPTIMIZATION

3.3 Open and closed sets

We are concerned with the behaviour of a function f “near a point a”, that is,
how the function value f ( x ) behaves when x is near a, where x and a are points
in some subset S of Rn . For that purpose, it is of interest if a can approached
with x from “all sides”, which is the case if there is an ε-ball around a that is fully
contained in S. If that is the case, then the set S will be called open according to
the following definition.

Definition 3.3 Let S ⊆ Rn . Then S is called open if

∀ a ∈ S ∃ε > 0 : B( a, ε) ⊆ S . (3.21)

By (3.14) and (3.15), we could use the maximum-norm ball instead of the
Euclidean-norm ball in (3.21), that is, S is open if and only if

∀ a ∈ S ∃ε > 0 : Bmax ( a, ε) ⊆ S . (3.22)

It is a useful exercise to prove that the open balls B( a, ε) and Bmax ( a, ε) are them-
selves open subsets of Rn .

Definition 3.4 An interior point of a subset S of Rn is a point a so that B( a, ε) ⊆ S


for some ε > 0.

Hence, a set S is open if all its elements are interior points of S.


Related to the concept of an open set is the concept of a closed set.

Definition 3.5 Let S ⊆ Rn . Then S is called closed if for all a ∈ Rn and all se-
quences { x (k) } in S (that is, x (k) ∈ S for all k ∈ N) with limit a we have a ∈ S.

Definition 3.6 A limit point of a subset S of Rn is a point a ∈ Rn so that there is a


sequence { x (k) } in S with limit a. Note that a need not be an element of S.

Another common term for limit point is accumulation point. Clearly, a set is
closed if and only if it contains all its limit points. Trivially, every element a
of S is a limit point of S, by taking the constant sequence given by x (k) = a in
Definition 3.6.
The next lemma is important to show the connection between open and closed
sets.

Lemma 3.7 Let S ⊆ Rn and a ∈ Rn . Then a is a limit point of S if and only if

∀ε > 0 : B( a, ε) ∩ S 6= ∅ . (3.23)
3.3. OPEN AND CLOSED SETS 49

Proof. Suppose a is a limit point of S according to Definition 3.6, so that there is


a sequence { x (k) } in S with limit a. Let ε > 0. By (3.16), there exists K ∈ N so
that k x (k) − ak < ε for all k ≥ K, that is, x (k) ∈ B( a, ε), in particular for k = K, so
B( a, ε) and S contain the common element x (K ) , which shows (3.23).
Conversely, suppose (3.23) holds. We want to find a sequence { x (k) } in S with
limit a. For that purpose, we choose a sequence of smaller and smaller distances
ε, such as 1/k for k ∈ N. By assumption, B( a, 1/k ) ∩ S 6= ∅, so let x (k) be such
an element of S that also belongs to B( a, 1/k ), that is, k x (k) − ak < 1/k. This
defines our sequence { x (k) } in S. This sequence has limit a. Namely, in order to
prove (3.16) for x = a, let ε > 0. Let K be an integer so that K ≥ 1/ε. Then k ≥ K
implies k ≥ 1/ε and hence k x (k) − ak < 1/k ≤ ε as required.

The next theorem states the connection between open and closed sets: A set
is closed if and only if its set-theoretic complement is open.

Theorem 3.8 Let S ⊆ Rn and let T = Rn \ S = { x ∈ Rn | x 6∈ S}. Then S is closed


if and only if T is open.

Proof. Suppose S is closed, so it contains all its limit points. We want to show that
T is open, so let a ∈ T. We want to show that B( a, ε) ⊆ T for some ε > 0. If that
was not the case, then for all ε > 0 there would be some element a in B( a, ε) that
does not belong to T and hence to S, so that B( a, ε) ∩ S 6= ∅. But then a is a limit
point of S according to Lemma 3.7, hence a ∈ S because S is closed, contrary to
the assumption that a ∈ T.
Conversely, assume T is open, so for all a ∈ T we have B( a, ε) ⊆ T for some
ε > 0. But then B( a, ε) ∩ S = ∅, and thus a is not a limit point of S. Hence
S contains all its limits points (if not, such a point would belong to T), so S is
closed.

It is possible that a set is both open and closed, which applies to the full set
Rn and to the empty set ∅. (For any “connected space” such as Rn , these are
the only possibilities.) A set may also be neither open nor closed, such as the
half-open interval [0, 1) as a subset of R1 . This set does not contain its limit point
1 and is therefore not closed. It is also not open, because its element 0 does not
have a ball B(0, ε) = (−ε, ε) around it that is fully contained in [0, 1). Another
example of a set which is neither open nor closed is the set {1/n | n ∈ N} which
is missing its limit point 0.
The following theorem states that the intersection of any two open sets S and
S0
S
is open, and the arbitrary union i∈ I Si of any open sets Si is open. Similarly,
the intersection of any two closed sets S and S0 is closed, and the arbitrary in-
S
tersection i∈ I Si of any closed sets Si is closed. Here I is any (possibly infinite)
nonempty set of subscripts i for the sets Si , and

Si = { x | ∃ i ∈ I : x ∈ Si } Si = { x | ∀i ∈ I : x ∈ Si } . (3.24)
S T
i∈ I and i∈ I
50 CHAPTER 3. CONTINUOUS OPTIMIZATION

Theorem 3.9 Let S, S0 ⊆ Rn , and let Si ⊆ Rn for i ∈ I for some arbitrary nonempty
set I. Then
(a) If S and S0 are both open, then S ∩ S0 is open.
(b) If S and S0 are both closed, then S ∪ S0 is closed.
(c) If Si is open for i ∈ I, then
S
i∈ I Si is open.
(d) If Si is closed for i ∈ I, then
T
i∈ I Si is closed.

Proof. Assume both S and S0 are open, and let a ∈ S ∩ S0 . Then B( a, ε) ⊆ S and
B( a, ε0 ) ⊆ S0 for suitable positive ε and ε0 . The smaller of the two balls B( a, ε) and
B( a, ε0 ) is therefore a subset of both sets S and S0 and therefore of their intersec-
tion. So S ∩ S0 is open, which shows (a).
Condition (b) holds because if S and S0 are closed, then T = Rn \ S and
T0= Rn \ S0 are open, and so is T ∩ T 0 by (a), and hence S ∪ S0 = Rn \ ( T ∩ T 0 ) is
open by Theorem 3.8.
To see (c), let Si be open for all i ∈ I, and let a ∈ ∪i∈ I Si , that is, a ∈ S j for some
j ∈ I. Then there is some ε > 0 so that B( a, ε) is a subset of S j , which is a subset
of the set ∪i∈ I Si which is therefore open.
We obtain from (d) from (c) because the intersection of complements of sets
is the complement of their union, that is, i∈ I (Rn \ Si ) = Rn \ i∈ I Si , which we
T S

consider here for closed sets Si and hence open sets Rn \ Si .


Note that by induction, Theorem 3.9(a) extends to the statement that the in-
tersection of any finite number of open sets is open. However, this is no longer
true for arbitrary intersections. For example, each of the intervals Sn = (− n1 , n1 )
for n ∈ N is open, but their intersection n∈N Sn is the singleton {0} which is not
T

an open set. Similarly, arbitrary unions of closed sets are not necessarily closed,
for example the closed intervals [ n1 , 1] for n ∈ N, whose union is the half-open
interval (0, 1] which is not closed. However, (c) and (d) do allow arbitrary unions
of open sets and arbitrary intersections of closed sets.

3.4 Bounded and compact sets


Condition states (3.17) what it means for a sequence { xk } to be bounded. The
same definition applies to a set.

Definition 3.10 Let S ⊆ Rn . Then S is called bounded if there is some M ∈ R so


that k x k < M for all x ∈ S.

In other words, S is bounded if it is contained in some M-ball around the ori-


gin 0 in Rn . Because of (3.14) and (3.15), S is bounded if and only if S is contained
in some maximum-norm M-ball around the origin 0, that is, S ⊆ Bmax (0, M) or
∀( x1 , . . . , xn ) ∈ S ∀ i = 1, . . . , n : | xi | < M . (3.25)
3.4. BOUNDED AND COMPACT SETS 51

That is, S is bounded if and only if the components xi of the points x in S are
bounded.
Theorem 1.16 states that a bounded sequence in R has convergent subse-
quence. The same holds for Rn instead of R.

Theorem 3.11 (Bolzano–Weierstrass in Rn ) Every bounded sequence in Rn has a con-


vergent subsequence.
(k) (k)
Proof. Consider a bounded sequence { x (k) } in Rn , where x (k) = ( x1 , . . . , xn ) for
(k)
each k. Because the sequence is bounded, the sequence of ith components { xi }
(k)
is bounded in R, for each i = 1, . . . , n. In particular, the sequence { x1 } given
by the first component is bounded, and because it is a bounded sequence of real
(k j )
numbers, it has a convergent subsequence by Theorem 1.16, call it { x1 } where
k j for j = 1, 2, . . . indicates the subsequence. That is, the sequence { x (k j ) } j∈N of
points in Rn converges in its first component. We now consider the sequence of
(k j )
real numbers { x2 } j∈N given the second components of the elements of that sub-
sequence. Again, by Theorem 1.16, this sequence has a convergent subsequence
for suitable values j` for ` = 1, 2, . . ., so that the resulting sequence of points
{ x (k j` ) }`∈N of points in Rn convergences in its second component. Because the
subscripts k j` for ` ∈ N define a subsequence of k j for j ∈ N, the first components
(k j ) (k j )
x1 ` of these vectors are a subsequence of the convergent sequence x1 which is
therefore also convergent with the same limit.
So the sequence of vectors { x (k j` ) }`∈N convergences in their first and second
component. We now proceed in the same manner by considering the sequence
(k j )
of third components { x3 ` }`∈N of these vectors, which again has a convergent
subsequence since these are bounded real numbers, and that subsequence now
defines a sequence of vectors in Rn that convergence in their first, second, and
third components. By continuing in this manner, we obtain eventually, after n
applications of Theorem 1.16, a subsequence of the original sequence { x (k) }k∈N
that converges in each component. As mentioned at the end of Section 3.2, con-
vergence in each component means overall convergence. So we have found the
required subsequence.

By the previous theorem, a sequence of points in a bounded subset S of Rn


has a convergent subsequence. If that sequence has also its limit in S, then S is
called compact, which is a very important concept.

Definition 3.12 Let S ⊆ Rn . Then S is called compact if and only if every sequence
of points in S has a convergent subsequence whose limit belongs to S.

The following characterization of compact sets in Rn is very useful.

Theorem 3.13 Let S ⊆ Rn . Then S is compact if and only if S is closed and bounded.
52 CHAPTER 3. CONTINUOUS OPTIMIZATION

Proof. Assume first that S is closed and bounded, and consider an arbitrary se-
quence of points in S. Then by Theorem 3.11, this sequence has a convergent
subsequence with limit x, say, which belongs to S because S is closed. So S is
compact according to Definition 3.12.
Conversely, assume S is compact. Consider any convergent sequence of points
in S with limit x. Because S is compact, that sequence has a convergent subse-
quence, whose limit is also x, and which belongs to S. So every limit point of S
belongs to S, which means that S is closed.
In order to show that S is bounded, assume this is not the case, so that for ev-
ery k ∈ N there is a point x (k) in S with k x (k) k ≥ k. This defines an unbounded se-
quence { x (k) }k∈N in S where clearly every subsequence is also unbounded which
therefore cannot converge (every convergent sequence is bounded), in contra-
diction to the assumption that S is compact. This proves that S is bounded, as
claimed.

Because a subset of a bounded set is clearly bounded, we immediately obtain


the following consequence of Theorem 3.13.

Corollary 3.14 A closed subset of a compact set is compact.

3.5 Continuity
We now consider functions that are defined on a subset S of Rn . The concepts
of being open, closed, or bounded apply to such sets S. These are “topologi-
cal” properties of S, which means and they refer to way points in S can be “ap-
proached”, for example by sequences. If S is open, then any point in S has an
ε-ball around it that belongs to S, for sufficiently small ε. If S is closed, then S
contains all its limit points. If S is bounded, then every sequence in S is bounded,
and moreover boundedness is necessary for compactness.
The central notion of “topology” is continuity, which refers to a function f
and means that f preserves “nearness”. That is, a function f is continuous if
it maps nearby points to nearby points. Here, “nearby” means “arbitrarily close”.
“Closeness” is defined in terms of the distance between two points, according to
the Euclidean norm or the maximum norm, as discussed in the first two sections
of this chapter. Basically, we can say that a function is continuous if it preserves
limits in the sense that
lim f ( xk ) = f ( x ) (3.26)
xk → x

where { xk } is an arbitrary sequence that converges to x, or

lim f ( xk ) = f ( lim xk )
k→∞ k→∞

assuming that lim xk exists, which is called x in (3.26).


k→∞
3.5. CONTINUITY 53

We now give a formal definition of continuity. For now, no particular topo-


logical property is assumed about the domain S of the function. Moreover, we
define this for functions that may take values in some Rm , because it can be stated
in the same manner for m = 1 or any other positive integer m.

Definition 3.15 Let S ⊆ Rn , let f : S → Rm be a function defined on S, and let


x ∈ S. Then f is called continuous at x if
∀ε > 0 ∃δ > 0 ∀ x ∈ S : k x − x k < δ ⇒ k f ( x ) − f ( x )k < ε . (3.27)
The function f is called continuous on S if it is continuous at all points of S.

The condition k f ( x ) − f ( x )k < ε in (3.27) states that f ( x ) belongs to the ε-ball


around f ( x ) (in Rm ), which says that the function values f ( x ) are “close” to f ( x ).
This is required to hold for all points x provided these belong to S (so that f ( x )
is defined) and are within an δ-ball around x. Here δ can be chosen as small as
required but must be positive. This captures the intuition that f maps points near
x to points near f ( x ).
A simple example of a function that is not continuous is the function f : R →
R defined by (
0 if x 6= 0,
f (x) = (3.28)
1 if x = 0,
which is not continuous at x = 0. Namely, if we choose ε = 12 , for example, then
for any δ > 0 we will always a point x near x so that we have k x − x k < δ but
k f ( x ) − f ( x )k ≥ ε, in this case for example x = δ/2 where k f ( x ) − f ( x )k = 1,
which contradicts the requirement (3.27).
The following function g : R2 → R is a much more sophisticated example (a
plot of this function is shown in Figure 3.1):

 xy
if ( x, y) 6= (0, 0),
g( x, y) = x2 +y2 (3.29)
0 if ( x, y) = (0, 0).

What is interesting about the function g is that it is separately continuous in each


variable, which means the following: Consider g( x, y) as a function of x for fixed y,
that is, consider the function gy : R → R given by gy ( x ) = g( x, y). If y = 0, then
we clearly have gy ( x ) = 0 for all x, which is certainly a continuous function. If
xy
y 6= 0, then y2 > 0, and gy given by gy ( x ) = x 2 + y2
is also continuous. Because
g( x, y) is symmetric in x and y, the function g is also separately continuous in y.
However, g is not a continuous function when its arguments are allowed to vary
2
jointly. Namely, for x = y 6= 0 we have g( x, y) = g( x, x ) = x2x+ x2 = 12 , so this
1
function is constant but with a different constant value 2 than g(0, 0), which is
zero. That is, g is not continuous at (0, 0).
A useful criterion to prove that a function is continuous relates to our initial
consideration of this section in terms of sequences.
54 CHAPTER 3. CONTINUOUS OPTIMIZATION

g( x, y)
6

y



H
Hj
x

Figure 3.1 Plot of the function g( x, y) in (3.29).

Proposition 3.16 Let S ⊆ Rn , let f : S → Rm be a function defined on S, and let


x ∈ S. Then f is continuous at x if for all sequences { x (k) } in S that converge to x, we
have
lim f ( x (k) ) = f ( x ) . (3.30)
k→∞

Proof. Suppose f is continuous according to Definition 3.15, and let { x (k) } be a


sequence in S that converges to x. We want to show that f ( x (k) ) → f ( x ) as k → ∞.
For given ε > 0, we choose δ > 0 so that k x − x k < δ ⇒ k f ( x ) − f ( x )k < ε for
all x ∈ S according to (3.27). Because x (k) → x, there is some natural number K
so that k x (k) − x k < δ whenever k ≥ K according to (3.16) (with δ instead of ε).
Then k ≥ K implies k f ( x (k) ) − f ( x )k < ε, as required.
Conversely, assume that (3.30) holds for all sequences { x (k) } in S that con-
verge to x. Suppose f is not continuous according to Definition 3.15. That is, there
is some ε > 0 so that for all δ > 0 there is some point x in S with k x − x k < δ but
k f ( x ) − f ( x )k ≥ ε. Similary to the proof of Lemma 3.7, we let x (k) be such a point
for δ = 1/k, that is, k x (k) − x k < 1/k but k f ( x (k) ) − f ( x )k ≥ ε, for all k ∈ N.
This clearly defines a sequence { x (k) } in S that converges to x but where the cor-
responding images f ( x (k) ) under f do not converge to f ( x )k, a contradiction.

With the help of this proposition, we see that g as defined in (3.29) is not con-
tinuous at (0, 0) by considering the sequence ( x, y)(k) = ( 1k , 1k ) which converges
to (0, 0), but where g(( x, y)(k) ) = g( 1k , 1k ) = 21 , which are function values of g that
do not converge to g(0, 0) = 0.
3.6. PROVING CONTINUITY 55

3.6 Proving continuity


Intuitively, a function f from Rn to R is continuous if its graph has no “disrup-
tions”. This graph is a subset of Rn+1 given by the points ( x1 , . . . , xn , f ( x1 , . . . , xn )).
The function f in (3.28) is not continuous at 0, and the function ( x, y) 7→ g( x, y) in
(3.29) is not continuous because of a similar disruption at x = 0 when considered
for its arguments ( x, x ), for example.
The graph of a function f from R1 to R has no disruptions if it be drawn as a
continuous line. In order to prove this formally, note that condition (3.27) states
that if the argument x of f ( x ) is at most distance δ away from x, then f ( x ) is less
than ε from f ( x ). So the challenge is to identify how small δ needs to be for a
given ε. We demonstrate this with some familiar but nontrivial examples.
First, let S = R \ {0} and consider the function f : S → R given by f ( x ) =
1/x. The domain S of f is not all of R, but it is an open set, which means that for
each x ∈ S there is some δ-ball around x that is fully contained in S. In R1 , such a
δ-ball is the open interval ( x − δ, x + δ). In the present case, x ∈ ( x − δ, x + δ) ⊆ S
implies that x and x have the same sign, which will be useful to prove (3.27).
We now work backwards from the condition k f ( x ) − f ( x )k < ε in order to
obtain a suitable constraint on k x − x k, which is here equal to | x − x |:

k f ( x ) − f ( x )k < ε
⇔ |1/x − 1/x | < ε
|x − x| (3.31)
⇔ < ε
| xx |
⇔ | x − x | < εxx
where the last equivalence holds because x and x have the same sign which we
assure by choosing δ small enough so that ( x − δ, x + δ) ⊆ S as described (any δ so
that δ ≤ | x | will do) , and thus xx > 0. The last inequality in (3.31) is a condition
on | x − x |, but we cannot choose δ = εxx because this expression does not depend
solely on ε and x but also on x. However, all we want is that | x − x | < δ implies
| x − x | < εxx, so it suffices that δ is anything smaller than εxx (but still δ > 0).
Also, recall that we can force x to be close to x with a sufficiently small δ. So if δ
is | x |/2 or less, then | x − x | < δ or equivalently x ∈ ( x − δ, x + δ) clearly implies
| x | ∈ ( 12 | x |, 32 | x |) and thus in particular | x | > 12 | x | (as well as x ∈ S). With that
consideration, we let
δ = min{ 21 | x |, 12 ε| x |2 }. (3.32)
Then | x − x | < δ implies 21 | x | < | x | and thus | x − x | < δ ≤ 21 ε| x |2 < ε| x || x | = εxx
and therefore k f ( x ) − f ( x )k < ε according to (3.31), as intended. (These consider-
ations had the additional challenge of writing them neatly so as to simultaneously
cover the cases x > 0 and x < 0.)
In the preceding proof, the function f : x 7→ 1/x was shown to be continuous
at x by choosing δ = 21 εx2 (which for small ε is also implies δ ≤ 12 x as required in
56 CHAPTER 3. CONTINUOUS OPTIMIZATION

(3.32)). We see here that we have to choose δ as a function not only of ε but also
of the point x at which we want to prove continuity.
As an aside, the concept of uniform continuity means that δ can be chosen as a
function of ε only. That is, a function f : S → R is called uniformly continuous if

∀ε > 0 ∃δ > 0 ∀ x ∈ S ∀ x ∈ S : k x − x k < δ ⇒ k f ( x ) − f ( x )k < ε . (3.33)

In contrast, the function f is just continuous if (3.27) holds prefixed with the quan-
tification ∀ x ∈ S, so that δ can be chosen depending on ε and x, as in (3.32). It can
be shown that a continuous function on a compact domain is uniformly continu-
ous. This is not the case for the function R \ {0} → R, x 7→ 1/x whose domain is
not compact.
We consider as a second example for which we prove continuity the function

f : [0, ∞) → R, x 7→ x. For x > 0, the function f has the derivative f 0 ( x ) =
1 −1/2
2x . At x = 0, the function f has no derivative because that derivative would
have to be arbitrarily steep. The graph of f is a flipped parabola arc and f is
clearly continuous. We prove this using the definition of continuity (3.27), similar
to the equivalences in (3.31):

k f ( x ) − f ( x )k < ε
√ √
⇔ | x − x| < ε
√ √ (3.34)
⇔ ( x − x )2 < ε2

⇔ x + x < ε2 + 2 xx.

We now consider (3.34) separately


2
√ for the two cases x ≥ x and x < x. If x ≥ x,
then we rewrite x + x < ε + 2 xx equivalently as

| x − x | = x − x < ε2 + 2( xx − x ), (3.35)
√ √
where this inequality is implied by | x − x | < ε2 because
√ xx − x ≥ x x − x = 0.
2
Similarly, if x < x, then we rewrite x + x < ε + 2 xx equivalently as

| x − x | = x − x < ε2 + 2( xx − x ), (3.36)
√ √
where this inequality is again implied by | x − x | < ε2 because xx − x ≥ x x −
x = 0. In both cases, if we choose δ = ε2 , then (3.27) holds, which proves that f
is continuous, and in fact uniformly continuous on S = [0, ∞) according to (3.33).
(This is an example of a function that is uniformly continuous even though its
domain S is not compact.) Note that we do not need to worry that | x − x | < δ
implies x ∈ S because that condition is also imposed in (3.27) and (3.33). For the
function x 7→ 1/x, we also had to make sure that x and x have the same sign
because otherwise 1/x and 1/x would be very far apart.
We now prove the continuity of some functions on Rn . The following lemma
states that we can replace the Euclidean norm in (3.27) with the maximum norm,
which in many situations is more convenient to use.
3.6. PROVING CONTINUITY 57

Lemma 3.17 Let S ⊆ Rn , let f : S → Rm be a function defined on S, and let x ∈ S.


Then f is continuous at x if and only if

∀ε > 0 ∃δ > 0 ∀ x ∈ S : k x − x kmax < δ ⇒ k f ( x ) − f ( x )kmax < ε . (3.37)

Proof. Suppose first f is continuous at x. Let ε > 0 and choose δ > 0 so that (3.27)

holds, and let δ0 = δ/ n. Then x ∈ Bmax ( x, δ0 ) ⊆ B( x, δ) by (3.15), which implies
f ( x ) ∈ B( f ( x ), ε) by choice of δ and thus f ( x ) ∈ Bmax ( f ( x ), ε) by (3.14), which
implies (3.37) (with δ0 instead of δ) as claimed.
Conversely, given (3.37) and ε > 0, we choose δ > 0 so that x ∈ Bmax ( x, δ)

implies f ( x ) ∈ Bmax ( f ( x ), ε/ m). Then x ∈ B( x, δ) implies x ∈ Bmax ( x, δ) by

(3.14) and thus f ( x ) ∈ Bmax ( f ( x ), ε/ m) ⊆ B( f ( x ), ε) by (3.15), which proves
(3.27).

We now prove that the multiplication of real numbers is a continuous opera-


tion, that is, that the function f : R2 → R, f ( x, y) = x · y, is continuous. This is
intuitive from the graph of f which has no disruptions, but we prove it formally.
As before, let ε > 0, let ( x, y) ∈ R2 , and we want to find how close ( x, y) needs to
be to ( x, y) to prove that f ( x, y) is close to f ( x, y), that is,

| xy − x y| < ε . (3.38)

Using the triangle inequality, we have

| xy − x y| = | xy − x y + x y − x y| ≤ | xy − x y| + | x y − x y| = | x − x | |y| + | x | |y − y|
(3.39)
so that we have proved (3.38) if we can prove

| x − x | |y| < ε/2, | x | |y − y| < ε/2 . (3.40)

The second inequality in (3.40) produces an easy constraint on |y − y|: Let


ε
δy = (3.41)
2(| x | + 1)

(the denominator is chosen to avoid complications if x = 0), so that |y − y| < δy


implies | x | |y − y| < ε/2. The first inequality in (3.40) is (if y 6= 0) equivalent to
| x − x | < ε/2y but y is not fixed, so we use that y is close to y. Assume that δy ≤ 1
(if, as defined in (3.41), δy > 1, then we just set δy = 1). Then

|y − y| < δy ⇒ |y| = |y − y + y| ≤ |y − y| + |y| < 1 + |y| . (3.42)

Define
ε
δx = . (3.43)
2(|y| + 1)
Then | x − x | < δx and |y − y| < δy imply | x − x | |y| < ε/2, that is, the first
inequality in (3.40). Now let δ = min{δx , δy }. Then |( x, y) − ( x, y)|max < δ implies
58 CHAPTER 3. CONTINUOUS OPTIMIZATION

| x − x | < δx and |y − y| < δy , which in turn imply (3.40) and therefore (3.38). With
Lemma 3.17, this shows the continuity of the function ( x, y) 7→ xy.
This is an important observation: the arithmetic operation of multiplication is
continuous, and it is also easy to prove that addition, that is, the function ( x, y) 7→
x + y, is continuous. Similarly, the function x 7→ − x is continuous, which is
nearly trivial compared to proving that x 7→ 1/x (for x 6= 0) is continuous.
The following lemma exploits that we have defined continuity for functions
that take values in Rm and not just in R1 . It states that the composition of con-
tinuous functions is continuous. Recall that f (S) is the image of f as defined
in (3.1).

Lemma 3.18 Let S ⊆ Rn , T ⊆ Rm , and f : S → Rm and g : T → R` be functions so


that f (S) ⊆ T. Then if f and g are continuous, their composition g f : S → R` , given
by ( g f )( x ) = g( f ( x )) for x ∈ S, is also continuous.

Proof. Assume that f and g are continuous. Let x ∈ S and ε > 0. We want to
show that there is some δ > 0 so that k x − x k < δ and x ∈ S imply k g( f ( x )) −
g( f ( x ))k < ε. Because g is continuous at f ( x ), there is some γ > 0 so that for any
y ∈ T with ky − f ( x )k < γ we have k g(y) − g( f ( x ))k < ε. Now choose δ > 0
so that, by continuity of f at x, we have for any x ∈ S that k x − x k < δ implies
k f ( x ) − f ( x )k < γ. Then (for y = f ( x )) this implies k g( f ( x )) − g( f ( x ))k < ε as
required.

In principle, this lemma should allow us to prove that the function ( x, y) 7→


x/y is continuous on R × (R \ {0}) (that is, for y 6= 0), by considering it as the
composition of the functions ( x, y) 7→ ( x, 1/y) and ( x, z) 7→ xz . We have just
proved that ( x, z) 7→ xz is continuous, and earlier that y 7→ 1/y is continuous,
but what about ( x, y) 7→ ( x, 1/y) where the function y 7→ 1/y affects only the
second component of its input? Clearly, the function ( x, y) 7→ ( x, 1/y) should
also be continuous, but we need one further simple observation to prove this.
Whereas Lemma 3.18 considers the sequential composition of two functions,
the following lemma refers to a “parallel” composition of functions.

Lemma 3.19 Let S ⊆ Rn , and f : S → Rm and g : S → R` be two functions defined


on S, and consider the function h : S → Rm+` defined by h( x ) = ( f ( x ), g( x )) for
x ∈ S. Then h is continuous if and only if f and g are continuous.

Proof. Let x ∈ S and ε > 0. Suppose f and g are continuous at x. Then ac-
cording to Lemma 3.17 there is some δ > 0 so that k x − x kmax < δ and x ∈ S
imply k f ( x ) − f ( x )kmax < ε and k g( x ) − g( x )kmax < ε. But then also kh( x ) −
h( x )kmax < ε because each of the m + ` components of h( x ) − h( x ) is either the
corresponding component of f ( x ) − f ( x ) or of g( x ) − g( x ).
Conversely, if h is continuous at x, there is some δ > 0 so that k x − x kmax < δ
and x ∈ S imply kh( x ) − h( x )kmax < ε. Because k f ( x ) − f ( x )kmax ≤ kh( x ) −
3.6. PROVING CONTINUITY 59

h( x )kmax and k g( x ) − g( x )kmax ≤ k h( x ) − h( x )kmax by the definition of h and


of the maximum norm (3.4), this implies k f ( x ) − f ( x )kmax < ε and k g( x ) −
g( x )kmax < ε. This proves the claim.

Corollary 3.20 Let S ⊆ Rn and f : S → Rm , where f ( x ) = ( f 1 ( x ), . . . , f m ( x )) for


x ∈ S, that is, f i : S → R is the ith component function of f , for 1 ≤ i ≤ m. Then f is
continuous if and only if f i is continuous for 1 ≤ i ≤ m.

Proof. This follows by induction on m from Lemma 3.19.

Corollary 3.20 states that a function f that takes values in Rm is continuous if


and only if all its component functions f 1 , . . . , f m are continuous. These functions
take values in R. Note that in comparison, separate continuity in each variable is
not sufficient for continuity of a function defined on Rn , as example (3.29) shows.
So for functions Rn → Rm , where both n and m are positive integers, it is the
higher dimensionality of the domain Rn that requires additional considerations
for continuity, not of the range Rm . In order to check continuity, we can focus on
the case m = 1.
We now use the results of this section to show that the following function
h : R2 → R, not unlike g in (3.29), is continuous:

 √ xy if ( x, y) 6= (0, 0),
h( x, y) = x 2 + y2 (3.44)
0 if ( x, y) = (0, 0) .

We want to show that h is continuous at ( x, y). If ( x, y) 6= (0, 0), then h is a com-


position of continuous functions as follows. The numerator xy is a continuous
function of ( x, y) as we proved earlier. The function x 7→ x2 is the composition
of the functions x 7→ ( x, x ) (continuous by Lemma 3.19 because x 7→ x is contin-
uous) and ( x, y) → xy. Therefore also ( x, y) 7→ ( x2 , y2 ) is continuous (again
by Lemma 3.19)pand thus (because addition is continuous) ( x, y) 7→ x2 + y2 ,
then ( x, y) 7→p x2 + y2 because the square root function is continuous, then
( x, y) 7→ 1/ x2 + y2 because z 7→ 1/z is continuous, and here z 6= 0 because
( x, y) 6= (0, 0), and finally
pthe function ( x, y) 7→ h( x, y) itself is continuous as
the product of xy and 1/ x2 + y2 . In short, for ( x, y) 6= (0, 0) we have h( x, y)
defined as a composition of continuous functions, and so h is continuous. (Note
that all considerations of this section were developed carefully just to prove this
paragraph.)
The question is if h( x, y) is continuous at (0, 0), so for given ε > 0 we have to
find δ > 0 so that

k( x, y) − (0, 0)k < δ ⇒ |h( x, y) − h(0, 0)| < ε , (3.45)


60 CHAPTER 3. CONTINUOUS OPTIMIZATION

which√is (trivially) the case if ( x, y) = (0, 0), so assume ( x, y) 6= (0, 0). Then
p p p
| x | = x2 ≤ x2 + y2 and |y| = y2 ≤ x2 + y2 , and therefore
| x | |y|
q
|h( x, y) − h(0, 0)| = |h( x, y)| = p ≤ x2 + y2 = k( x, y)k . (3.46)
x 2 + y2

So if we choose δ = ε then k( x, y)k < δ implies |h( x, y)| ≤ k( x, y)k < ε and thus
(3.45) as required.
In this section, we have seen how continuity can be proved for functions that
are defined on Rn . The maximum norm is particularly useful for these proofs.

3.7 The Theorem of Weierstrass


The following theorem of Weierstrass is the central theorem of this chapter.

Theorem 3.21 (Weierstrass) A continuous real-valued function on a nonempty com-


pact domain assumes its maximum and minimum.

Recall the notions used in Theorem 3.21, which is about a function f : X → R.


The function f is assumed to be continuous and the domain X to be compact (and
nonempty). The theorem says that under these conditions, there are x ∗ and x ∗∗
in X so that f ( x ∗ ) is the maximum and f ( x ∗∗ ) the minimum of f ( X ) (see Sec-
tion 1.4).
The proof of the Theorem of Weierstrass is based on the following two lem-
mas. The first lemma refers to a subset of R.

Lemma 3.22 Let A be a nonempty compact subset of R. Then A has a maximum and a
minimum.

Proof. We only show that A has a maximum. By Theorem 3.13, A is closed and
bounded, sup A exists. We show that sup A is a limit point of A. Otherwise,
B(sup A, ε) ∩ A = ∅ for some ε > 0 by Lemma 3.7. But then there is no t ∈ A
with t > sup A − ε, so sup A − ε is an upper bound of A, but sup A is the least
upper bound, a contradiction. So sup A is a limit point of A and therefore belongs
to A because A is closed, and hence sup A is also the maximum of A.

The second lemma says that the image of a compact set is compact.

Lemma 3.23 Let ∅ 6= X ⊆ Rn and f : X → R. Then if f is continuous and X is


compact, then f ( X ) is compact.

Proof. Let {yk }k∈N be any sequence in f ( X ). We show that there exists a subse-
quence {ykn }n∈N and a y ∈ f ( X ) such that limn→∞ ykn = y, which will show that
f ( X ) is compact. For that purpose, for each k choose x (k) ∈ X with f ( x (k) ) = yk ,
3.8. USING THE THEOREM OF WEIERSTRASS 61

which exists by the definition of f ( X ). Then { x (k) } is a sequence in X, which


has a convergent subsequence { x (kn ) }n∈N with limit x in X because X is compact.
Then, because f is continuous,

f ( x ) = f ( lim x (kn ) ) = lim f ( x (kn ) ) = lim ykn


n→∞ n→∞ n→∞

where f ( x ) ∈ f ( X ) because x ∈ X. This proves the claim.

The Theorem of Weierstrass is a simple corollary to these two lemmas.


Proof of Theorem 3.21. Consider a function f : X → R on a compact domain X.
By Lemma 3.23, f ( X ) is a compact subset of R, which by Lemma 3.22 has a max-
imum, which is the maximum value f ( x ∗ ) of f on X (for some x ∗ in X), and a
minimum, which is the minimum value f ( x ∗∗ ) of f on X (for some x ∗∗ in X).

3.8 Using the Theorem of Weierstrass


The Theorem 3.21 of Weierstrass is about a continuous function, say f : X → R,
on a compact domain X, where typically X ⊆ Rn . In Section 3.6 we have given
a number of examples that show how to prove that a function is continuous. In
order to prove that a subset X of Rn is compact, we normally use the characteri-
zation in Theorem 3.13 that compact means “closed and bounded”. Boundedness
is typically most easily proved using the maximum norm, that is, boundedness
in each component, as in (3.25).
For closedness, the following observation is most helpful: sets that are pre-
images of closed sets under continuous functions are closed. We state and prove
this via the equivalent statement that pre-images of open sets under continuous
functions are open.

Lemma 3.24 Let f : Rn → Rk be a continuous function, let T ⊆ Rk , and let S be the


pre-image of T under f , that is,

S = f −1 ( T ) = { x ∈ Rn | f ( x ) ∈ T } . (3.47)

Then S is open if T is open, and S is closed if T is closed.

Proof. We first prove that if T is open, then S is also open. Let x ∈ S, so that
f ( x ) ∈ T by (3.47). Because T is open, there is some ε > 0 so that B( f ( x ), ε) ⊆ T.
Because f is continuous, there is some δ > 0 so that for all y ∈ B( x, δ) we have
f (y) ∈ B( f ( x ), ε) and therefore f (y) ∈ T, that is, y ∈ S. This shows B( x, δ) ⊆ S,
which proves that S is open, as required.
In order to observe the same property for closed sets note that a set is closed
if and only if its set-theoretic complement is open, and that the pre-image of the
62 CHAPTER 3. CONTINUOUS OPTIMIZATION

complement is the complement of the pre-image. Namely, let T 0 ⊆ Rk and sup-


pose that T 0 is closed, that is, T given by T = Rk \ T 0 is open. Let S = f −1 ( T ),
where S is open as just shown, and let S0 = Rn \ S, so that S0 is closed. But then
S0 = { x ∈ Rn | f ( x ) 6∈ T } = f −1 ( T 0 ) , and S0 is closed, as claimed.

Note that Lemma 3.24 concerns the pre-image of a continuous function. In


contrast, Lemma 3.23 concerns the image of a continuous function f . The state-
ment in Lemma 3.24 is not true for images, that is, if S is closed, then f (S) is
not necessarily closed; for a counterexample, S has to be unbounded since oth-
erwise S would be compact. An example is the function f : R → R given by
f ( x ) = 1/(1 + x2 ) and S = R, where f (S) = (0, 1]. That is, f (S) is neither closed
nor open even though S is both closed and open. A simpler example of a contin-
uous function f so that f (S) is not open for open sets S is a constant function f ,
where f (S) is singleton if S is nonempty.
In a more abstract setting, Lemma 3.24 can also be used to define that a func-
tion f : X → Y is continuous. In that case, X and Y are so-called “topological
spaces”. A topological space is a set X together with a set (called a “topology”) of
subsets of X which are called open sets, which have to fulfill the following condi-
tions: The empty set ∅ and the entire set X are open; the intersection of any two
open sets is open; and arbitrary unions of open sets are open. These conditions
hold for the open sets as defined in Definition 3.3 (which define the standard
topology on Rn ) according to Theorem 3.9. Given that X and Y are topological
spaces, a function f : X → Y is called continuous if and only if the pre-image of
any open set (in Y) is open (in X). This characterisation of continuous functions
is important enough to state it as a theorem.

Theorem 3.25 Let f be a function Rn → Rk . Then f is continuous if and only if the


pre-image f −1 ( T ) as defined in (3.47) of any open subset T of Rk is an open subset of Rn .

Proof. If f is continuous according to Definition 3.15, then any pre-image of an


open set under f is open by Lemma 3.24.
Conversely, suppose that for any open set T its pre-image f −1 ( T ) is open. We
want to show that f is continuous at x for any x ∈ Rn . Let ε > 0 and consider
the ε-ball B( f ( x ), ε) around f ( x ). The key observation is that this ε-ball is itself
an open set. Namely, let y ∈ B( f ( x ), ε) and let ε0 = ε − ky − f ( x )k > 0. Consider
any y0 ∈ B(y, ε0 ), so that ky0 − yk < ε0 = ε − ky − f ( x )k. By the triangle inequality
(3.10), ky0 − f ( x )k ≤ ky0 − yk + ky − f ( x )k < ε, that is, y0 ∈ B( f ( x ), ε), which
shows B(y, ε0 ) ⊆ B( f ( x ), ε) . That is, B( f ( x ), ε) is open as claimed.
Let S = f −1 ( B( f ( x ), ε)) = { x ∈ Rn | f ( x ) ∈ B( f ( x ), ε)} = { x ∈ Rn |
k f ( x ) − f ( x )k < ε}. By assumption, S is open, and clearly x ∈ S. So for some
δ > 0 there is some δ-ball around x that is contained in S, that is, B( x, δ) ⊆ S.
Then f ( B( x, δ)) ⊆ f (S) = B( f ( x ), ε) (understand this carefully!). But this means
that k x − x k < δ implies k f ( x ) − f ( x )k < ε as in (3.27), so f is continuous.
3.8. USING THE THEOREM OF WEIERSTRASS 63

For our purposes, we only need the part of Theorem 3.25 that is stated in
Lemma 3.24. It implies the following observation, which is most useful to identify
certain subsets of Rn as closed or open.

Lemma 3.26 Let f : Rn → R be continuous and let a ∈ R. Then the sets { x ∈ Rn |


f ( x ) ≤ a} and { x ∈ Rn | f ( x ) ≥ a} are closed, the sets { x ∈ Rn | f ( x ) > a} and
{ x ∈ Rn | f ( x ) < a} are open, and the set { x ∈ Rn | f ( x ) = a} is also closed.

Proof. We have { x ∈ Rn | f ( x ) ≤ a} = f −1 ((−∞, a]) and { x ∈ Rn | f ( x ) ≥ a} =


f −1 ([ a, ∞)) so these are closed sets by Lemma 3.24 because the intervals (−∞, a]
and [ a, ∞) are closed. The sets { x ∈ Rn | f ( x ) > a} and { x ∈ Rn | f ( x ) < a}
are the complements of these closed sets and therefore open. The set { x ∈ Rn |
f ( x ) = a} is the intersection of these closed sets and therefore closed, and also
equal to the pre-image f −1 ({ a}) of the closed set { a}.

We now consider, as an example, the following set X,

X = {( x, y) ∈ R2 | x ≥ 0, y ≥ 0, xy ≥ 1} . (3.48)

Here X is the intersection of the closed sets {( x, y) ∈ R2 | x ≥ 0}, {( x, y) ∈


R2 | y ≥ 0}, and {( x, y) ∈ R2 | xy ≥ 0} given by the three conditions x ≥ 0,
y ≥ 0, and xy ≥ 0. These are all closed sets by Lemma 3.26 because the functions
( x, y) 7→ x, ( x, y) 7→ y, and ( x, y) 7→ xy are all continuous (see Section 3.6), and
the intersection of any closed sets is closed. In words, X is the set of pairs ( x, y)
in the positive quadrant of R2 (defined by x ≥ 0 and y ≥ 0) bounded by the
hyperbole given by xy = 1. While closed, X is clearly not bounded because, for
example, k( x, x )k can become arbitrarily large.
We now consider the problem of maximizing or minimizing the function f :
X → R given by
1
f ( x, y) = . (3.49)
x+y
The function f is well-defined because x + y cannot be zero on X. Clearly, mini-
mizing f ( x, y) is equivalent to maximizing x + y, and that has no solution because
x + y can become arbitrarily large on X.
However, the problem of maximizing f ( x, y) (which is equivalent to mini-
mizing x + y) has a solution on X. With the aim to use the Theorem of Weierstrass,
we let X = X1 ∪ X2 with

X1 = {( x, y) ∈ X | x + y ≤ 3}, X2 = {( x, y) ∈ X | x + y ≥ 3}, (3.50)

as shown in Figure 3.2. By definition, x + y is bounded from above by 3 on the


set X1 , and the set X1 is closed because it is the intersection of X with the set
{( x, y) ∈ R2 | x + y ≤ 3} which is closed because the function ( x, y) 7→ x + y is
continuous. Moreover, X1 is bounded, because | x | ≤ 3 and |y| ≤ 3 for ( x, y) ∈ X1 .
So X1 is compact and therefore f ( x, y) has a maximum on X1 , by the Theorem of
64 CHAPTER 3. CONTINUOUS OPTIMIZATION

2 X2

X1
1

0 x
0 1 2 3

Figure 3.2 Decomposition of the set X in (3.48) into X = X1 ∪ X2 as in (3.50).

Weierstrass. We also need that X1 is not empty: it contains for example the point
(2, 1). Now, the maximum of f on X1 is also the maximum of f on X. Namely, for
( x, y) ∈ X2 we have x + y ≥ 3 and therefore f ( x, y) = x+1 y ≤ 31 = f (2, 1), where
(2, 1) ∈ X1 , so Theorem 1.10 applies.
In this example of maximizing the function f in (3.49) on the domain X in
(3.48), X is closed but not compact. However, we have applied Theorem 1.10
with X = X1 ∪ X2 as in (3.50) in order to obtain a compact domain X1 where we
know the maximization of f has a solution, which then applies to all of X. This is
an important example which we will consider further in some exercises.
The Theorem of Weierstrass only gives us the existence of a a maximum of
f on X1 (and thereby on X), but it does not show how to find it. It seems rather
clear the maximum of f ( x, y) on X is obtained for ( x, y) = (1, 1), but proving
(and finding) this maximum is the topic of the next chapter.
Chapter 4

First-order conditions

This chapter studies optimization problems for differentiable functions. It dis-


cusses how the concept of differentiability can be used to find necessary condi-
tions for a (local) maximum of a function. If the function is subject to equality
constraints, this leads to the Theorem of Lagrange that asserts that at a local max-
imum, the derivative of the function is a linear combination of the derivatives
of the constraint functions. If the function is subject to inequality constraints,
the corresponding KKT Theorem (due to Karush, Kuhn, and Tucker) states, simi-
larly, that the derivative of the function is a nonnegative linear combination of the
derivatives of those constraint functions where the inequalities hold tight, that is,
hold as equalities. Both the Lagrange and the KKT theorem require an additional
constraint qualification that the respective derivatives of the constraints functions
have to be linearly independent.
First-order conditions refer to derivatives. Second-order conditions refer to
second derivatives. They are much less important and are omitted from our in-
troduction to these methods.

4.1 Introductory example

We consider an introductory example similar to the example considered at the


end of the last chapter. We want to

minimize x + 4y subject to x ≥ 0, y ≥ 0, xy = 1. (4.1)

The objective function to be minimized is here the linear function f defined by

f ( x, y) = x + 4y (4.2)

on the domain X = {( x, y) ∈ R2 | x ≥ 0, y ≥ 0, xy = 1} which is a closed


set by Lemma 3.26. Note that the domain X is here just the hyperbola (in the

65
66 CHAPTER 4. FIRST-ORDER CONDITIONS

positive quadrant of R2 ) defined by the equality xy = 1. An analogue of Theo-


rem 1.10 shows that this minimum exists, by restricting f to a suitable compact
subset X1 of X where f has a minimum by the theorem of Weierstrass. For that
purpose, consider any point in X, for example (1, 1), and its function value of f ,
here f (1, 1) = 5. We then define X1 to be the subset of X where f ( x, y) is at
most 5. That is, similar to (3.50), we write X = X1 ∪ X2 with

X1 = {( x, y) ∈ X | x + 4y ≤ 5}, X2 = {( x, y) ∈ X | x + 4y ≥ 5}, (4.3)

and because X1 is nonempty we can restrict the minimization of f to the compact


set X1 which then gives the minimum of f on X.

18
16
14
12
10
8
6
4
2
0
3.0 2.5 5
2.0 1.5 4
3
y 1.0
0.5 1
2 x
0.0 0
Figure 4.1 Plot of the function f ( x, y) = x + 4y for 0 ≤ x ≤ 5 and 0 ≤ y ≤ 3,
with the blue curve showing the restriction xy = 1.

Figure 4.1 shows a perspective drawing of a three-dimensional plot of f ( x, y)


which shows the linearity this function. The function value (drawn vertically) in-
creases by 1/4 when x is increased by 1/4, and by 1 when y is increased by 1/4;
each small rectangle represents such a step size of 1/4 for both x and y. In addi-
tion, the blue curve in the picture shows the restriction to those pairs ( x, y) where
xy = 1.
While a plot as in Figure 4.1 is useful to get an understanding of the behaviour
of f , it does not show the minimum of f exactly, and it is also hard draw without
computer tools.
4.1. INTRODUCTORY EXAMPLE 67

(1, 4)
y

0 x
0 1 2 3 4 5

Figure 4.2 Contour lines and gradient (1, 4) of the function f ( x, y) = x + 4y for
x ≥ 0, y ≥ 0.

A more instructive picture that can be drawn in two dimensions uses contour
lines of the function f in (4.2), shown as the dashed lines in Figure 4.2. Such a
contour line for f ( x, y) is the set of points ( x, y) where f ( x, y) = c for some con-
stant c, that is, where f ( x, y) takes a fixed value. One could also say that a contour
line is the pre-image f −1 ({c}) under f of one of its possible values c. Clearly, for
different values of c any two such contour lines are disjoint. Here, because f
is linear, these contour lines are parallel lines. For ( x, y) ∈ R2 , such a contour
line corresponds to the equation x + 4y = c or equivalently y = 4c − 4x (we also
only consider nonnegative values for x and y). Contour lines are known from
topographical maps of, say, mountain regions, where each line corresponds to a
particular height above sea level; the two-dimensional picture of these lines con-
veys information about the three-dimensional terrain. Here, they indicate how
the function should be minimized, by choosing the smallest function value c that
is possible.
Figure 4.2 also shows the gradient of the function f . We will define this gradi-
ent, called D f , later formally. It is given by the derivatives of f with respect to x
d d
and to y, that is, the pair ( dx f ( x, y), dy f ( x, y)), which is here (1, 4) for every ( x, y)
because f is the linear function (4.2). This vector (1, 4) is drawn in Figure 4.2.
The gradient (1, 4) shows in which direction the function increases (which is dis-
cussed in further detail in the introductory Section 5.1 of the next chapter), and
can be interpreted as the direction of “steepest ascent”. Correspondingly, the op-
posite direction of the gradient is the direction of “steepest descent”. In addition,
the gradient is orthogonal to the contour line, because along the contour line the
function neither increases nor decreases.
68 CHAPTER 4. FIRST-ORDER CONDITIONS

3
(12 , 2)

0 x
0 1 2 3 4 5

Figure 4.3 Minimum of the function f ( x, y) = x + 4y subject to the constraints


x ≥ 0, y ≥ 0, and g( x, y) = xy − 1 = 0, at the point ( x, y) = (2, 21 )
where the gradient Dg( x, y) = (y, x ) = ( 12 , 2) of the constraint func-
tion g is co-linear with the gradient D f ( x, y) = (1, 4) of f .

Moving along any direction which is not orthogonal to the gradient means ei-
ther moving partly in the same direction as the gradient (increasing the function
value), or away from it (decreasing the function value). Consider now Figure 4.3
where we have drawn the hyperbola which represents the constraint xy = 1
in (4.1). A point on this hyperbola is, for example, (1, 1). At that point, the con-
tour lines show that the function value can still be lowered by moving towards
1 1
(1 + ε, 1+ ε ). But at the point ( x, y ) = (2, 2 ) the contour line just touches the hyper-
bola and so the function value of f ( x, y) cannot be reduced further.

The way to compute this point is the method of so-called Lagrange multipli-
ers, here a single multiplier that corresponds to the single constraint xy = 1. We
write this constraint in the form g( x, y) = 0 where g( x, y) = xy − 1. This con-
straint function g( x, y) has itself a gradient, which depends on ( x, y) and is given
d d
by Dg( x, y) = ( dx g( x, y), dy g( x, y)) = (y, x ). The Lagrange multiplier method
says that only when there is a scalar λ so that D f ( x, y) = λDg( x, y), that is, when
the gradients of the objective function f and of the constraint function g are co-
linear, then no further improvement of f ( x, y) is possible. The reason is that only
in this case the contour lines of f and of g, which are orthogonal to the gradients
D f and Dg, touch as required, so moving along the contour line of g (that is,
maintaining the constraint) also neither increases nor decreases the value of f .

Here D f ( x, y) = λDg( x, y) is the equation (1, 4) = λ · (y, x ) = (λy, λx ) and


thus λ = 1/y = 4/x and thus y = x/4, which together with the constraint xy = 1
4.2. DIFFERENTIABILITY IN R N 69

means x2 /4 = 1 or x = 2 (because x = −2 violates x ≥ 0) and y = 1/2, which is


indeed the optimum ( x, y) = (2, 21 ).
Of course, this simple example (4.1) has a solution that can be found directly
using one-variable calculus. Namely, the constraint xy = 1 translates to y = 1/x
so that we can consider the problem of minimizing x + 4/x (for x ≥ 0, in fact
x > 0 because x = 0 is excluded by the condition xy = 1). We differentiate and
d
set the derivative to zero. That is, dx ( x + 4/x ) = 1 − 4/x2 = 0 gives the same
solution x = 2, y = 1/2 .
The point of this introductory section was to illustrate the geometric under-
standing of contour lines and co-linear gradients of objective and constraint func-
tions in an optimum.

4.2 Differentiability in Rn
The idea of differentiability is to approximate a function locally with a linear,
more precisely affine, function. If f : Rn → R, then f is called affine if there are
reals c0 , c1 , . . . , cn so that for all x = ( x1 , . . . , xn ) ∈ Rn we have

f ( x1 , . . . , x n ) = c0 + c1 x1 + · · · + c n x n . (4.4)

If thereby c0 = 0, then f is called linear (which is equivalent to f (0, . . . , 0) = 0).

f (x)

f (x) + G · (x − x)

f (x)

x x

Figure 4.4 Approximating a function f ( x ) for x near x with an affine function


with slope or “gradient” G that represents the tangent line at f ( x ).

In order to see what it means that a function f is differentiable, we consider


first the familiar case that f ( x ) depends on a single variable x in R1 . Figure 4.4
shows the graph of a function f which is “smooth” (differentiable), which we
70 CHAPTER 4. FIRST-ORDER CONDITIONS

understand to mean that near a point x, the function f ( x ) (by “zooming in” to
the function graph near x) becomes less and less distinguishable from an affine
function which has a “gradient” G that defines the slope of the “tangent” at f ( x ),
as in
f (x) ≈ f (x) + G · (x − x) . (4.5)
But this is an imprecise statement, which should clearly mean more than the left
and right side in (4.5) approach the same value as x tends to x, because this would
be true for any G as long as f is continuous. What we mean is that G should
represent the “rate of change” of f ( x ) near f ( x ). (Then G will be the gradient or
derivative of f at x.)

f ( x ) = c0 + c1 x

f ( x ) = c0 + c1 x

x x

Figure 4.5 Two points ( x, f ( x )) and ( x, f ( x )) on the graph of f define a “secant”


line, given by an affine function t 7→ c0 + c1 t. If c1 has a limit as
x → x, then this limit defines G in Figure 4.4.

In order to define G as a suitable limit, we consider, as in Figure 4.5, a point


x near (but distinct from) x and the unique affine function t 7→ c0 + c1 t that coin-
cides with f for t = x and t = x, that is,
f ( x ) = c0 + c1 x , f ( x ) = c0 + c1 x . (4.6)
Then c1 is defined by f ( x ) − f ( x ) = c1 ( x − x ), that is,
f (x) − f (x)
c1 = (4.7)
x−x
so that (by definition, with c1 defined as a function of x)
f ( x ) = f ( x ) + c1 ( x − x ) . (4.8)
In the representation (4.8), we are interested in c1 as the “rate of change” of f ;
we do not care about the constant c0 in (4.6), which would be given by c0 =
f ( x ) − c1 x.
4.2. DIFFERENTIABILITY IN R N 71

If c1 as defined in (4.7) has a limit as x → x, then this limit will take the role
of G in (4.5) and will be called the gradient or derivative of f at x and denoted
by D f ( x ). Geometrically, the “secant” line defined by two points ( x, f ( x )) and
( x, f ( x )) on the curve in Figure 4.5 becomes the tangent line when x → x.
For f : Rn → R, the aim will be to replicate (4.8) where the single coefficient
c1 will be replaced by an n-vector of coefficients (c1 , . . . , cn ). One problem is that
this vector can no longer be represented as a quotient as in (4.7). The following
is the central definition of this section, which we explain in detail afterwards.
Interior points are defined in Definition 3.4.

Definition 4.1 Let X ⊆ Rn , f : X → R, and let x be an interior point of X. Then


f is called differentiable at x if for some G ∈ R1×n and all ways to approach x with x
we have
f (x) − f (x) − G · (x − x)
lim = 0. (4.9)
x→x kx − xk
Any such G is called a gradient or derivative of f at x and is denoted by D f ( x ).
The function f is called differentiable on X (or just differentiable if X = Rn ) if f
is differentiable at all interior points x of X.

In order to understand (4.9) better, we rewrite it as

f (x) − f (x) 1
≈ G · (x − x) · (4.10)
kx − xk kx − xk

where “≈” is here meant to say “holds as equality in the limit as x → x”. The
right-hand side in (4.10) is a product of three terms: a row vector G in R1×n , a
column vector x − x in Rn = Rn×1 , and a scalar 1/k x − x k in R. Written in
this way, each such product is the familiar matrix product. Recall that the matrix
product of two matrices A and B is defined if and only if A is of dimension m × k
and B is of dimension k × n; the result is an m × n matrix that in row i and column
j has entry ∑ks=1 ais bsj where ais and bsj are the respective entries of A and B. We
normally consider elements of Rn as column vectors, that is, as n × 1 matrices.
A scalar is a 1 × 1 matrix, and so we multiply a vector z in Rn with a scalar α
as the matrix product zα. In contrast, a row vector such as G in R1×n has to be
multiplied with a scalar α from the left as in αG. Otherwise these products would
not be defined as matrix products. It is very useful to keep all such products
as matrix products, not least in order to check that the product is written down
correctly. If a and b are two (column) vectors in Rn , then the matrix product a>b
denotes their scalar product ∑in=1 ai bi in R, where a> is the row vector in R1×n that
is obtained from a by matrix transposition. In (4.9) and (4.10), the vector G is
directly given as a row vector so such a transposition is not needed.
Both sides of (4.10) are real numbers, because f is real-valued and the term
f ( x ) − f ( x ) is divided by the Euclidean norm k x − x k of the vector x − x. Because
we have kzαk = kzk |α| for any z ∈ Rn and α ∈ R, the vector y = z · 1/kzk (for
72 CHAPTER 4. FIRST-ORDER CONDITIONS

z 6= 0) has unit length, that is, kyk = 1. Therefore, on the right-hand side of (4.10)
the vector ( x − x ) · 1/k x − x k has unit length. This vector is a scalar multiple of
x − x and can thus be interpreted as the direction of x − x (a vector normalized
to have length one). If (4.10) holds, then the scalar product of G with this vector
given by G · ( x − x ) · 1/k x − x k shows the “growth rate” of f ( x ) − f ( x ) in the
direction x − x.
Consider now the case n = 1, where kzk = |z| for any z ∈ R1 , so that z · 1/kzk
is either 1 (if z > 0) or −1 (if z < 0). Then the right-hand side in (4.10) is G if x > x
f ( x )− f ( x )
and − G if x < x. Similarly, the left-hand side of (4.10 is x−x if x > x and
− f (xx)−
−x
f (x)
if x < x. Hence for x 6= x these two conditions state
f (x) − f (x)
lim = G, (4.11)
x→x x−x
which is exactly the familiar notion of differentiability of a function defined on
R1 . The case distinction x > x and x < x that we just made emphasizes that the
limit of the quotient in (4.11) has to exist for any possible approach of x to x, which
is also stated in Definition 4.1. For example, consider the function f ( x ) = | x |
which is well known not to be differentiable at 0. Namely, if we restrict x to be
f ( x )− f ( x )
positive (that is, x > x = 0), then x− x = |xx| = 1, whereas for x < x = 0 we
f ( x )− f ( x ) |x|
have x− x = x = −1. Therefore, there is no common limit to these fractions
as x → x, for example if we approach x with the sequence { xk } defined by xk =
(−1/2)k which converges to 0 but with alternating signs of xk . In Definition 4.1,
the limit has to exist for any possible approach of x to x (for example, by letting x
be the points of a sequence that converges to x).
Next, we show that the gradient G of a differentiable function is unique and
nicely described by the row vector of “partial derivatives” of the function, and
then describe again how the derivative represents a local linear approximation of
the function in “Taylor’s theorem”, in (4.15) below.

4.3 Partial derivatives and C1 functions


Next we show that if a function is a differentiable at x then its gradient D f ( x ) is
unique. We will show that the components of D f ( x ) are the n partial derivatives
of f , defined as follows.

Definition 4.2 Let X ⊆ Rn , f : X → R, and let x be an interior point of X. Let


e j ∈ Rn for 1 ≤ j ≤ n be the j-th unit vector, i.e., e j is 1 in the j-th coordinate and
∂ f (x)
0 everywhere else. Then the j-th partial derivative of f at x is the real number ∂x j

(or ∂x j f ( x )) such that

f ( x + e j t) − f ( x ) ∂ f (x)
lim = . (4.12)
t →0 t ∂x j
4.3. PARTIAL DERIVATIVES AND C1 FUNCTIONS 73

d
We have earlier (in our introductory Section 4.1) used the notation dx j f (x)
rather than ∂x∂ f ( x ), which means the same, namely differentiating f ( x1 , . . . , xn )
j
as a function of x j only, while keeping the values of all other variables x1 , . . . , x j−1 ,
x j+1 , . . . xn fixed. For example, if f ( x1 , x2 ) = x1 x2 + x1 , then ∂x∂ f ( x1 , x2 ) = x2 + 1
1

and ∂x2 f ( x1 , x2 ) = x1 .
Next we show that the gradient of a differentiable function is the vector of
partial derivatives.

Proposition 4.3 Let X ⊆ Rn , f : X → R, and let x be an interior point of X. If f is


differentiable at x, then
 ∂ f (x) ∂ f (x) ∂ f (x) 
D f (x) = , , ..., . (4.13)
∂x1 ∂x2 ∂xn

Proof. Consider some j for 1 ≤ j ≤ n and consider the points x = x + e j t for t ∈ R


where x → x if t → 0. Then x − x = e j t and k x − x k = ke j tk = |t|. Hence, by
(4.9),
f ( x + e j t) − f ( x ) G · ej · t
 
lim − = 0
t →0 |t| |t|
or equivalently, as in the consideration for (4.11) for the two cases t > 0 and t < 0,

∂ f ( x + e j t) − f ( x )
f ( x ) = lim = G · ej
∂x j t →0 t

so G · e j , which is the jth component of G, is ∂


∂x j f ( x ) as claimed.

Hence, if f has a gradient D f , it is uniquely given by the vector of partial


derivatives. However, the existence of these partial derivatives is not enough to
guarantee that the function is differentiable. In fact, the function may not even
be continuous. As an example, consider the function g : R2 → R defined in

(3.29). Since g( x, 0) = g(0, y) = 0 for all x, y, we clearly have ∂x g( x, 0) = 0 and
∂y g (0, y ) = 0, and for y 6 = 0 we have

∂ y( x2 + y2 ) − 2x ( xy) y3 − yx2
g( x, y) = = (4.14)
∂x ( x 2 + y2 )2 ( x 2 + y2 )2
x3 − xy2
and for x 6= 0 we have ∂y ∂
g( x, y) = ( x2 +y2 )2 because g( x, y) is symmetric in x
and y. So the partial derivatives of g exist everywhere. However, g( x, y) is not
even continuous, let alone differentiable.
It can be shown that the continuous function h( x, y) defined (3.44) is not dif-
ferentiable at (0, 0).
Nevertheless, the partial derivatives of a function are very useful if they are
continuous, which is often the case.
74 CHAPTER 4. FIRST-ORDER CONDITIONS

Definition 4.4 Let X be on open subset of Rn and f : X → R. Then f is called


continuously differentiable on X, and we say f is C1 ( X ), if f is differentiable on X
and its gradient D f ( x ) is a continuous function of x ∈ X.

Theorem 4.5 Let X be on open subset of Rn and f : X → R. Then f is C1 ( X ) if and


only if all partial derivatives ∂x∂ f ( x ) exist and are continuous functions of x on X, for
j
1 ≤ j ≤ n.

Theorem 4.5 is not a trivial observation, because the existence of partial deriva-
tives does not imply differentiability. However, if the partial derivatives exist
and are continuous, then the function is differentiable and continuously differen-
tiable. As with many further results in this chapter, we will not prove Theorem 4.5
for reasons of space. A proof is given in W. Rudin (1976), Principles of Mathematical
Analysis (3rd edition, McGraw-Hill, New York), Theorem 9.21, page 219. What is
important is that the partial derivatives have to be jointly continuous in all vari-
ables, not just separately continuous. For example, the function g( x, y) defined
in (3.29) is separately continuous in x and y, respectively (when the other vari-
∂ y3 −yx2
able is fixed), and so is in fact each partial derivative, such as ∂x g( x, y) = ( x2 +y2 )2
in (4.14). However, this function is not jointly continuous at (0, 0), because for
3 −2x3
y = 2x, say, we have (for x 6= 0) ∂x ∂
g( x, y) = (8x
x +4x2 )2
2
6
= 5x which does not tend
to ∂
∂x g (0, 0) = 0 when x → 0.
By definition, a C1 function is differentiable. Many functions (in particular
if they are not defined by case distinctions) are easily seen to have continuous
partial derivatives, so this is often assumed for simplicity rather than the more
general differentiability.

4.4 Taylor’s theorem

The following theorem expresses, once more, that differentiability means local
approximation by a linear function. It will also be used to prove (sometimes only
with a heuristic argument) first-order conditions for optimality that we consider
later.

Theorem 4.6 (Taylor) Let X ⊆ Rn , f : X → R, and let x be an interior point of X.


Then f is differentiable at x with gradient G in R1×n if and only if there exists a function
R : X → R (called the “remainder term”) so that

f ( x ) = f ( x ) + G · ( x − x ) + R( x ) · k x − x k (4.15)

where limx→ x R( x ) = R( x ) = 0.
4.4. TAYLOR’S THEOREM 75

Proof. By (4.9), the differentiability of f at x is equivalent to the condition that the


function R : X → R defined by R( x ) = 0, and for x 6= x by

f (x) − f (x) − G · (x − x)
R( x ) = (4.16)
kx − xk

fulfills limx→ x R( x ) = 0. Multiplication of both sides in (4.16) with k x − x k gives


(4.15), and division of both sides of (4.15) by k x − x k gives (4.16), so the two
equations are equivalent.

The important part in (4.15) is that the remainder term R( x ) tends to zero as
x → x. The norm k x − x k also tends to zero as x → x, so the product R( x ) ·
k x − x k becomes negligible in comparison to G · ( x − x ) which is therefore the
dominant linear term. Condition (4.15) is perhaps the best way to understand
differentiability as “local approximation by a linear function”.

Taylor’s theorem is familiar from the case n = 1, where D f is typically written


as f 0 , and the second derivative (the derivative of
f 0 ) as f 00 , if it exists. In that case
the theorem can be applied again to R( x ) which itself has a Taylor approximation,
which gives rise to an expression like

f 00 ( x )
f ( x ) = f ( x ) + f 0 ( x )( x − x ) + ( x − x )2 + R̂( x ) · | x − x |2 (4.17)
2

where limx→ x R̂( x ) = 0. By iterating this process for functions that are differ-
entiable sufficiently many times, one obtains a “Taylor expansion” that approxi-
mates the function not just linearly but by a higher-degree polynomial. Secondly,
the expression (4.17) for a function that is twice differentiable is more informa-
tive than the expression (4.15), with the following additional observation: one
can show that it allows to represent the original remainder term R( x ) to be of the
form f 00 (z)/2 · ( x − x ) for some “intermediate value” z that is between x and x;
hence bounds on f 00 (z) translate to bounds on R( x ). These variations of Taylor’s
theorem are often stated in the literature. We do not consider them here, only the
simple version of Theorem 4.6.

We illustrate (4.15) with a specific remainder term for some differentiable


function f : R2 → R with gradient D f ( x, y). Fix ( x, y) and let ( x, y) = ( x, y) +
(∆ x , ∆y ), so that (4.15) becomes

∆ 
x
f ( x + ∆ x , y + ∆y ) = f ( x, y) + D f ( x, y) · + R(∆ x , ∆y ) · k(∆ x , ∆y )k. (4.18)
∆y

Consider now the function f ( x, y) = x · y which has gradient D f ( x, y) = (y, x ),


which is a continuous function of ( x, y), so that f is continuously differentiable
76 CHAPTER 4. FIRST-ORDER CONDITIONS

by Theorem 4.5. Then

f ( x, y) = f ( x + ∆ x , y + ∆y ) = ( x + ∆ x ) · (y + ∆y )
= x y + y∆ x + x∆y + ∆ x ∆y
(4.19)
∆ 
x
= f ( x, y) + D f ( x, y) · + ∆ x ∆y
∆y

which is of the form (4.18) if we can find a remainder term R(∆ x , ∆y ) so that
R(∆ x , ∆y ) · k(∆ x , ∆y )k = ∆ x ∆y . This holds if R(∆ x , ∆y ) = ∆ x ∆y /k(∆ x , ∆y )k, and
then s
|∆ x ∆y | ∆2x ∆2y 1
| R(∆ x , ∆y )| = q = =q (4.20)
∆2x + ∆2y ∆ x + ∆y
2 2 1
+ 1
∆2 y ∆2x

which indeed goes to zero as (∆ x , ∆y ) → (0, 0).

4.5 Unconstrained optimization


We give some definitions of global and local maximum and minimum.

Definition 4.7 Let ∅ 6= X ⊆ Rn , f : X → R, and x ∈ X. Then


• f attains a global maximum on X at x if f ( x ) ≤ f ( x ) for all x ∈ X. Then x is
called a global maximizer of f on X.
• f attains a local maximum on X at x if there exists ε > 0 so that f ( x ) ≤ f ( x ) for
all x ∈ X ∩ B( x, ε). Then x is called a local maximizer of f on X.
• f attains an unconstrained local maximum on X at x if there exists ε > 0 so that
B( x, ε) ⊆ X and f ( x ) ≤ f ( x ) for all B( x, ε). Then x is called an unconstrained
local maximizer of f on X.
The analogous definitions hold for “minimum” instead of “maximum” by replac-
ing “ f ( x ) ≤ f ( x )” with “ f ( x ) ≥ f ( x )”.

In Figure 4.7, a, c, and e are local maximizers and b and d are local minimizers
of the function shown, where b, c, d are unconstrained. The function attains its
global minimum at b and its global maximum at e.

Lemma 4.8 Let X ⊆ Rn , f : X → R, and x ∈ X. Then x is an unconstrained local


maximizer of f on X ⇔ x is a local maximizer of f on X and an interior point of X.

Proof. The direction “⇒” is immediate from Definition 4.7. To see the converse
direction “⇐”, if x is a local maximizer of f on X then there is some ε 1 > 0 so that
f ( x ) ≤ f ( x ) for all x ∈ X ∩ B( x, ε 1 ), and x is an interior point of X if B( x, ε 2 ) ⊆ X
for some ε 2 > 0. With ε = min{ε 1 , ε 2 } we obtain B( x, ε) ⊆ X and f ( x ) ≤ f ( x ) for
all x ∈ B( x, ε), that is, x is an unconstrained local maximizer of f on X.
4.5. UNCONSTRAINED OPTIMIZATION 77

a b
c d e

Figure 4.6 Illustration of Definition 4.7 for a function defined on the interval
[ a, e].

For a differentiable (for example, C1 ) function, the following lemma shows


that at an unconstrained maximizer or minimizer the function has a zero gradient.
The lemma is a corollary to Taylor’s Theorem 4.6.

Lemma 4.9 Let X ⊆ Rn and let f : X → R be differentiable at x ∈ X. If x is an


unconstrained maximizer or minimizer of f , then D f ( x ) = 0.

Proof. Suppose x is an unconstrained maximizer of f . Then B( x, ε) ⊆ X for some


ε > 0, and f ( x ) ≤ f ( x ) for all x ∈ B( x, ε).
We apply Taylor’s Theorem 4.6, so that by (4.15), with ∆ x = x − x, we have
for all ∆ x so that k∆ x k < ε

f ( x + ∆ x ) = f ( x ) + D f ( x ) · ∆ x + R(∆ x ) · k∆ x k (4.21)

where | R(∆ x )| → 0 as ∆ x → 0. Because the local maximization is unconstrained,


we can choose any ∆ x with k∆ x k < ε in (4.21). Suppose that G = D f ( x ) 6= 0.
We will show that the value of f increases in the direction of the gradient G, a
contradiction. That is, let ∆ x = G >t for any t > 0 so that k∆ x k = k G kt < ε (we
transpose G to obtain a column vector G >, so that G · G > = k G k2 ). Then by (4.21),

f ( x + ∆ x ) = f ( x ) + D f ( x ) · ∆ x + R(∆ x ) · k∆ x k
= f ( x ) + G · G >t + R( G >t) · k G kt
= f ( x ) + k G kt · (k G k + R( G >t))

Because k G k > 0, t > 0, and R( G >t) → 0 as t → 0, the term k G k + R( G >t) is


positive for sufficiently small positive t, and therefore f ( x + ∆ x ) > f ( x ), which
contradicts the local maximality of f ( x ). So D f ( x ) = 0 as claimed.
78 CHAPTER 4. FIRST-ORDER CONDITIONS

If x is an unconstrained local minimizer of f , then x is an unconstrained local


maximizer of − f , so that − D f ( x ) = 0, which is equivalent to D f ( x ) = 0.

In Lemma 4.9, it is important that the local maximum is unconstrained. For


example, points a and e in Figure 4.6 are a local maximizers where the deriva-
tive of the function is not zero. Moreover, a zero gradient is only a necessary
condition. It may indicate a stationary point that is neither a local maximizer nor
minimizer, such as x = 0 of the function f : R → R defined by f ( x ) = x3 .
Another example is the function f : R2 → R defined by f ( x, y) = x · y where
(0, 0) has gradient zero, but is neither a local maximizer nor minimizer because
f (0, 0) = 0, and f ( x, y) takes positive and negative values nearby.

For a differentiable function f : Rn → R, its gradient D f ( x ) for x ∈ Rn


is a row vector with n components, so the condition D f ( x ) = 0 amounts to n
equations for n unknowns (the n components of x). Often these n equations have
only a finite number of solutions, which can then be checked as to whether they
represent local maxima or minima of f .

We show how to use Lemma 4.9 with two examples. First, consider the func-
tion f : R2 → R,
x−y
f ( x, y) = , (4.22)
2 + x 2 + y2

where D f ( x, y) = (0, 0) means

2 + x2 + y2 − ( x − y)2x −2 − x2 − y2 − ( x − y)2y
= 0, )=0
(2 + x 2 + y2 )2 (2 + x 2 + y2 )2

or equivalently

2 − x2 + y2 + 2xy = 0 ,
(4.23)
−2 − x2 + y2 − 2xy = 0 .

The simultaneous solution of two nonlinear equations as in (4.23) is in general not


easy. Here, these equations are very similar, so that sometimes a simpler equation
results by, for example, adding them (in other cases another manipulation may
be useful, such as taking their difference). Adding the two equations in (4.23)
gives −2x2 + 2y2 = 0 or equivalently x2 = y2 which means that either x = y or
x = −y. Furthermore, x2 = y2 in either equation (4.23) gives 2 + 2xy = 0 and
thus xy = −1, which has no solution if x = y but the solutions ( x, y) = (1, −1) or
( x, y) = (−1, 1) if x = −y. It seems that ( x, y) = (1, −1) is a local or even global
maximizer and ( x, y) = (−1, 1) a local or global minimizer of f ( x, y). It can be
verified directly that f (1, −1) is the global maximum of f , because the following
are equivalent for all ( x, y):
4.5. UNCONSTRAINED OPTIMIZATION 79

2
f ( x, y) ≤ f (1, −1) =
2+1+1
x−y 1

2 + x 2 + y2 2
(4.24)
2x − 2y ≤ 2 + x2 + y2
0 ≤ 1 − 2x + x2 + 1 + 2y + y2
0 ≤ (1 − x )2 + (1 + y )2

which is true (with equality for ( x, y) = (1, −1), a useful check). The inequality
f ( x, y) ≥ f (−1, 1) = − 12 is shown very similarly, which shows that f (−1, 1) is
the global minimum of f .
In the following example, the first-order condition of a zero derivative gives
also useful information, although of a different kind. Consider the function g :
R2 → R,
xy
g( x, y) = , (4.25)
1 + x 2 + y2
where Dg( x, y) = (0, 0) means

y(1 + x2 + y2 ) − ( xy)2x x (1 + x2 + y2 − ( xy)2y


= 0, )=0
(1 + x 2 + y2 )2 (1 + x 2 + y2 )2

or equivalently
y − yx2 + y3 = 0 ,
(4.26)
x + x3 − xy2 = 0 .
An obvious solution to (4.26) is ( x, y) = (0, 0), but this is only a stationary point
of g and neither maximum nor minimum (not even locally), because g(0, 0) = 0
but g( x, y) takes positive as well as negative values (also near (0, 0)). Similarly,
when x = 0 or y = 0, then g( x, y) = 0 but this is not a maximum or minimum, so
that we can assume x 6= 0 and y 6= 0. Then the equations (4.26) are equivalent to

1 − x 2 + y2 = 0
(4.27)
1 + x 2 − y2 = 0

which when added give 2 = 0 which is a contradiction. This shows that there
is no solution to (4.26) where x 6= 0 and y 6= 0 and thus g( x, y) has no local
and therefore also no global maximum or minimum. This is possible because the
domain R2 of g is not compact. For x = y and large x, for example, we have

x2 1
g( x, x ) = =
1 + 2x2 1/x2 + 2

which tends to 21 as x → ∞ but this seems to be an upper bound for g( x, y). We


can prove this for all ( x, y) via the following equivalences:
80 CHAPTER 4. FIRST-ORDER CONDITIONS

xy 1
g( x, y) = <
1 + x 2 + y2 2
2xy < 1 + x2 + y2
0 < 1 + x2 − 2xy + y2
0 < 1 + ( x − y )2

which is true. We can show similarly that g( x, y) > − 21 and that g( x, − x ) gets
arbitrarily close to − 12 . This shows that the image of g is the interval (− 21 , 12 ).

4.6 Equality constraints and the Theorem of Lagrange


The following central Theorem of Lagrange gives conditions for a constrained lo-
cal maximum or minimum of a continuously differentiable function f . The con-
straints are given as k equations g1 ( x ) = 0, . . . , gk ( x ) = 0 with continuously
differentiable functions g1 , . . . , gk . The theorem states that at a local maximum or
minimum, the optimized function f ( x ) has a gradient that is a linear combina-
tion of the gradients of these constraint functions, provided these gradients are
linearly independent. The latter condition is called the constraint qualification. The
corresponding coefficients λ1 , . . . , λk are known as Lagrange multipliers.

Theorem 4.10 (Lagrange) Let f : U → R be a C1 (U ) function on an open subset


U of Rn , and let g1 , . . . , gk : Rn → R be C1 (Rn ) functions. Let x be a local max-
imizer or minimizer of f on X = U ∩ { x ∈ Rn | gi ( x ) = 0, 1 ≤ i ≤ k }. Let
Dg1 ( x ), . . . , Dgk ( x ) be linearly independent (“constraint qualification”). Then there
exist λ1 , . . . , λk ∈ R (“Lagrange multipliers”) so that
k
D f (x) = ∑ λi Dgi (x) . (4.28)
i =1

To understand this theorem, consider first the case k = 1, that is, a single
constraint g( x ) = 0. Then (4.28) states D f ( x ) = λ Dg( x ), which means that the
gradient of f (a row vector) is a scalar multiple of the gradient of g. The two gra-
dients have the n partial derivatives of f and g as components, and each partial
derivative of g is multiplied with the same λ to equal the respective partial deriva-
tive of f . These are n equations for the n components of x and λ as unknowns,
and an additional equation is g( x ) = 0, so these are n + 1 equations for n + 1 un-
knowns in total. If there are k constraints gi ( x ) = 0 for 1 ≤ i ≤ n, then (4.28) and
these constraints are n + k equations for n + k unknowns x and λ1 , . . . , λk . Often
these equations have only finitely many solutions that can then be investigated
further.
As an example with a single constraint, consider the functions f , g : R2 → R,

f ( x, y) = x · y, g( x, y) = x2 + y2 − 2, (4.29)
4.6. EQUALITY CONSTRAINTS AND THE THEOREM OF LAGRANGE 81

where f ( x, y) is to be maximized or minimized subject to g( x, y) = 0,√that is, on


the set X = {( x, y) ∈ R2 | x2 + y2 − 2 = 0}, which is a circle of radius 2 around
the origin (0, 0). The contour lines of f and g are shown in Figure 4.7. Because X
is compact and f is continuous, f assumes its maximum and minimum on X by
the theorem of Weierstrass.

Figure 4.7 Illustration of the Theorem of Lagrange for f and g in (4.29). The
arrows indicate the gradients of f and g, which are orthogonal to
the contour lines. These gradients have to be co-linear in order to
find a local maximum or minimum of f ( x, y) subject to the constraint
g( x, y) = 0.

For (4.29), D f ( x, y) = (y, x ) and Dg( x, y) = (2x, 2y). Here Dg( x, y) is linearly
dependent only if ( x, y) = (0, 0), which however does not fulfill g( x, y) = 0, so
the constraint qualification holds always. The Lagrange multiplier λ has to fulfill
D f ( x, y) = λ Dg( x, y), that is, (y, x ) = λ(2x, 2y). Here x = 0 would imply y = 0
and vice versa, so we have x 6= 0 and y 6= 0, and the first equation y = λ2x
implies λ = y/2x, which when substituted into the second equation gives x =
λ2y = 2y2 /2x, and thus x2 = y2 or | x | = |y|. The constraint g( x, y) = 0 then
implies x2 + y2 − 2 = 2x2 − 2 = 0 and therefore | x | = 1, which gives the four
solutions (1, 1), (−1, −1), (−1, 1), and (1, −1). For the first two solutions, f takes
the value 1, and for the last two the value −1, so these are the local and in fact
global maxima and minima of f on the circle X.
The following functions illustrate why the constraint qualification is needed
in Theorem 4.10. Let

f ( x, y) = −y, g( x, y) = x2 − y3 , (4.30)

where f ( x, y) is maximized subject to g( x, y) = 0, that is, on X = {( x, y) ∈


R2 | y3 = x2 }. The set X is shown in Figure 4.8 as the two mirrored arcs of
the function y = | x |2/3 , which end in a cusp (pointed end) at ( x, y) = (0, 0).
82 CHAPTER 4. FIRST-ORDER CONDITIONS

Figure 4.8 Horizontal contour lines of f in (4.30), and of g for g( x, y) = 0 given


by y = | x |2/3 . The arrows, orthogonal to the contour lines, are the
gradients of f and g, which are nowhere co-linear. The maximum of
f ( x, y) is at ( x, y) = (0, 0) where the constraint qualification fails.

Here D f ( x, y) = (0, −1) and Dg( x, y) = (2x, −3y2 ). However, the equation
D f ( x, y) = λ Dg( x, y), that is, (0, −1) = λ(2x, −3y2 ) has no solution at all, since
0 = λ2x implies λ = 0 or x = 0 and in either case y = 0. However, the unique
maximizer of f ( x, y) on X is clearly (0, 0). The equation D f ( x, y) = λ Dg( x, y)
fails to hold because the constraint qualification is not fulfilled, because then the
gradient Dg( x, y) equals (0, 0), which is not a linearly independent vector.
An example that gives a geometric justification for Theorem 4.10 was given
in Figure 4.3. We do not prove Theorem 4.10, but give a more general plausibility
argument for the case k = 1, with the help of Taylor’s Theorem 4.6. Consider x so
that f ( x ) is a local maximum of f on X = { x ∈ U | g( x ) = 0}, and thus g( x ) = 0.
Any variation ∆ x around x so that x + ∆ x ∈ X requires

0 = g( x ) = g( x + ∆ x ) ≈ g( x ) + Dg( x ) · ∆ x (4.31)

where “≈” means that we neglegt the remainder term because we assume ∆ x to
be sufficiently small. By (4.31), Dg( x ) · ∆ x = 0, and the set of these ∆ x ’s is a sub-
space of Rn of dimension n − 1 provided Dg( x ) 6= 0, which holds by the constraint
qualification (this just says that the gradient of g at the point x is orthogonal to
the “contour set” { x ∈ Rn | g( x ) = 0}). Similarly, a local maximum f ( x ) requires

f (x + ∆x ) ≈ f (x) + D f (x) · ∆x ≤ f (x) (4.32)


and therefore
Dg( x ) · ∆ x = 0 ⇒ D f (x) · ∆x = 0 (4.33)
for the following reason: Condition Dg( x ) · ∆ x = 0 states that x + ∆ x ∈ X by
(4.31). If for such a ∆ x we had D f ( x ) · ∆ x 6= 0, then either D f ( x ) · ∆ x > 0 or
D f ( x ) · (−∆ x ) > 0 where also Dg( x ) · (−∆ x ) = 0 and then x − ∆ x ∈ X because
X = U ∩ { x ∈ Rn | g( x ) = 0} and U is an open set. But this contradicts (4.32).
4.6. EQUALITY CONSTRAINTS AND THE THEOREM OF LAGRANGE 83

In turn, because Dg( x ) 6= 0, (4.33) holds only if D f ( x ) = λ Dg( x ) as claimed.


This is a heuristic argument where we have assumed that the functions f and g
behave like affine functions near x, which is approximately true because they are
differentiable.
Consider the maximization problem of f ( x ) subject to gi ( x ) = 0 for 1 ≤
i ≤ k, for x ∈ Rn as in Theorem 4.10 with U = Rn ; all functions are C1 . Then
the Lagrangian for this problem is the function F : Rn × Rk → R where λ =
( λ 1 , . . . , λ k ) ∈ Rk ,
k
F ( x, λ) = f ( x ) − ∑ λi gi ( x ). (4.34)
i =1

Then the stationary points of F are by definition the points ( x, λ) ∈ Rn × Rk with


zero derivative, that is,
DF ( x, λ) = 0 . (4.35)
These are n + k equations for the partial derivatives of F with n + k unknowns,
the components of ( x, λ). These equations define exactly the problem of finding
the Lagrangian multipliers in (4.28) and of solving the given equality constraints.
Namely, the first n equations in (4.35) are for the n partial derivatives with respect
to x j of F, that is, by (4.34),

k
∂ ∂ ∂
F ( x, λ) = f ( x ) − ∑ λi g (x) = 0 (1 ≤ j ≤ n ). (4.36)
∂x j ∂x j i =1
∂x j i

These n equations can be written as

k
D f ( x ) − ∑ λi Dgi ( x ) = 0 (4.37)
i =1

which is equivalent to (4.28). The last k equations in (4.35) are for the k partial
derivatives with respect to λi of F, that is,


F ( x, λ) = − gi ( x ) = 0 (1 ≤ i ≤ k ) (4.38)
∂λi

which is equivalent to gi ( x ) = 0 for 1 ≤ i ≤ k, which are the given equality


constraints. Note that it does not make sense to maximize the Lagrangian F ( x, λ)
without these constraints, because for any x so that gi ( x ) 6= 0 and λ → ∞ (if
gi ( x ) > 0, or λ → −∞ if gi ( x ) < 0) it is unbounded.
The Lagrangian is often defined as DF ( x, λ) = f ( x ) + ∑ik=1 λi gi ( x ) which is
(4.34) but with a plus sign instead of a minus sign, which accordingly gives (4.37)
with a plus sign instead of a minus sign. This is also equivalent to (4.28) except
for the sign change of each λi . We prefer (4.28) which states directly that D f ( x )
is a linear combination of the gradients Dgi ( x ), and which avoids using the zero
vector.
84 CHAPTER 4. FIRST-ORDER CONDITIONS

Lagrangian multipliers can be interpreted as shadow prices in certain economic


settings. In such a setting, x may represent an allocation of the variables x1 , . . . , xn
according to some production schedule which results in profit f ( x ) for the firm,
subject to the constraints gi ( x ) = 0 for 1 ≤ i ≤ k. The profit f ( x ) is maximized
at x, with Lagrange multipliers λ1 , . . . , λk as in (4.28). Suppose that ĝ j ( x ) is the
amount of some resource j needed for production schedule x, for example man-
power, of which amount a j is available, so that g j ( x ) = ĝ j ( x ) − a j = 0 (assuming
all manpower is used; we could more generally assume g j ( x ) ≤ 0, but here just
assume that for x = x this inequality is tight, g j ( x ) = 0).
Now suppose the amount of manpower can be increased from a j to a j + ε for
some small amount ε > 0, which results in the new constraint g j ( x ) = ε, where
all other constraints are kept fixed. Assume that the new constraint results in a
new optimal solution x (ε), that is, g j ( x (ε)) = ε and gi ( x (ε)) = 0 for i 6= j. We
claim that then
f ( x (ε)) ≈ f ( x ) + λ j ε . (4.39)
Namely, with x (ε) = x + ∆ x we have Dgi ( x ) · ∆ x = 0 for i 6= j in order to keep
the condition gi ( x + ∆ x ) = 0 (see (4.31) above), but

ε = g j ( x + ∆ x ) ≈ g j ( x ) + Dg j ( x ) · ∆ x = Dg j ( x ) · ∆ x , (4.40)

and thus
f ( x (ε)) = f ( x + ∆ x ) ≈ f ( x ) + D f ( x ) · ∆ x
= f ( x ) + ∑ik=1 λi Dgi ( x ) · ∆ x
(4.41)
= f ( x ) + λ j Dg j ( x ) · ∆ x
= f (x) + λ j ε
which shows (4.39). The interpretation of (4.39) is that adding ε more manpower
(amount of resource j ) so that the constraint g j ( x ) = 0 is changed to g j ( x ) = ε
increases the firm’s profit by λ j ε. Hence, λ j is the price per extra unit of man-
power that the firm should be willing to pay, given the current maximizer x and
associated Lagrangian multipliers λ1 , . . . , λk in (4.28).
The following is a typical problem that can be solved with the help of La-
grange’s Theorem 4.10. A manufacturer of rectangular milkcartons wants to min-
imize the material used to obtain a carton of a given volume. A carton is x cm
high, y cm wide and z cm deep, and is folded according to the layout shown on
the right in Figure 4.9 (which is used twice, for front and back). Each of the four
squares in a corner of the layout with length z/2 is (together with its counterpart
on the back) folded into a triangle as shown on the left (the triangles at the bot-
tom are folded underneath the carton). We ignore any overlapping material used
for gluing. What are the optimal dimensions x, y, z for a carton with volume 500
cm3 ?
The layout on the right shows that the area f ( x, y, z) of the material used is
( x + z)(y + z) times two (for front and back, but the factor 2 can be ignored in the
4.6. EQUALITY CONSTRAINTS AND THE THEOREM OF LAGRANGE 85

z /2

z /2 z /2

x
x

z y
z /2
y

Figure 4.9 Optimization of a milk carton using the Theorem of Lagrange.

minimization), subject to g( x, y, z) = xyz − 500 = 0. We have

D f ( x, y, z) = (y + z, x + z, x + y + 2z) , Dg( x, y, z) = (yz, xz, xy) .

Because clearly x, y, z > 0, the derivative Dg( x, y, z) is never zero and therefore
linearly independent. By Lagrange’s theorem, there is some λ so that

y + z = λyz ,
x + z = λxz , (4.42)
x + y + 2z = λxy .
These equations are nonlinear, but simpler equations can be found by exploiting
their symmetry. Multiplying the first, second, and third equation in (4.42) by
x, y, z, respectively (all of which are nonzero), these equations are equivalent to

x (y + z) = λxyz ,
y( x + z) = λxyz , (4.43)
z( x + y + 2z) = λxyz ,
that is, they all have the same right-hand side. The first two equations in (4.43)
imply xz = yz and thus x = y. With x = y, the second and third equation give

x ( x + z) = z(2x + 2z) = 2z( x + z)

and thus x = 2z. That is, the only optimal solution is of the form (2z, 2z, z).
Applied to the volume equation this gives 4z3 = 500 or z3 = 125, that is, x = y =
10 cm and z = 5 cm.
86 CHAPTER 4. FIRST-ORDER CONDITIONS

The area of material used is 2( x + z)(y + z) = 2 × 152 = 450 cm2 . In com-


parison, the surface area of the carton without the extra folded triangles is 2( xy +
xz + yz) = 2(100 + 50 + 50) = 400 cm2 . The extra material is from the eight
squares of size 2.5 × 2.5 for the folded triangles which do not contribute to the
surface of the carton, which have area 8 × 2.52 = 2 × 52 = 50 cm2 .

4.7 Inequality constraints and the KKT conditions


In this section, we discuss conditions for local optimality of a function subject to
inequalities, as in the problem
maximize f ( x ) subject to hi ( x ) ≤ 0 (1 ≤ i ≤ ` ), (4.44)
where f and hi are continuously differentiable functions on Rn . The inequalities
(4.44) are always weak (allowing for equality) so that f is maximized on a closed
set, which if bounded ensures existence of a maximum by the Theorem of Weier-
strass.
The inequalities are written as in (4.44), that is, they require hi ( x ) to be non-
positive, because for a maximization problem it is natural to think of the con-
straints as upper bounds on limited resources. For example, if f ( x ) is the profit
from a production schedule x, then hi ( x ) ≤ 0 means that, in order to imple-
ment x, of resource i the use hi ( x ) cannot exceed a certain bound (which is set
to 0 by a suitable choice of the function hi ). In contrast, in a minimization prob-
lem it is often more natural to write lower bounds as in hi ( x ) ≥ 0, which express
that certain minimum conditions that have to be met, such as producing at least
a certain quantity of each good i, where the goal is to so with minimum overall
cost f ( x ). The direction of the inequality is in principle arbitrary because hi ( x )
can be replaced by −hi ( x ). We consider maximization problems with the conven-
tion in (4.44).
Suppose ` = 1 in (4.44), that is, we have a single inequality constraint that we
write as h( x ) ≤ 0. At a local maximizer x of f , the inequality is either not tight,
h( x ) < 0, or tight, h( x ) = 0. The distinction between these two cases is central to
the first-order conditions for optimization under inequality constraints, and illus-
trated in Figures 4.10 and 4.11. These can be summarized as follows: A constraint
that is not tight can be treated as if it is absent, and a tight constraint can be treated
like an equality constraint where the corresponding Lagrange multiplier has to
be nonnegative.
Consider the case that h( x ) < 0, that is, the constraints is not tight, shown
in Figure 4.10. This is means that there is an ε-ball B( x, ε) around x where (by
continuity of h) we also have h( x ) < 0, and so f ( x ) is an unconstrained local
maximum by Lemma 4.8, where by Lemma 4.9 we have D f ( x ) = 0. Hence, as
long as the constraint h( x ) ≤ 0 holds as a strict inequality, it has no effect on the
gradient D f ( x ), just as if the constraint was absent.
4.7. INEQUALITY CONSTRAINTS AND THE KKT CONDITIONS 87

Df
h( x ) < 0

Dh
x

f (x) = c

h( x ) = 0

Figure 4.10 Maximization of f ( x ) subject to h( x ) ≤ 0 (grey region). At the


maximizer x, the inequality is not tight (h( x ) < 0) which requires
D f ( x ) = 0. The ellipses are contour lines of f .

h( x ) < 0

Df Dh
x
f (x) = c

h( x ) = 0

Figure 4.11 Maximization of f ( x ) subject to h( x ) ≤ 0. At the maximizer x, the


inequality is tight (h( x ) = 0) which requires D f ( x ) = µDh( x ) for
some µ ≥ 0.

The case h( x ) = 0 is illustrated in Figure 4.11. The grey region { x | h( x ) ≤ 0}


shown there has the contour line { x | h( x ) = 0} as a boundary. If h( x ) denotes
the “height of a terrain” at location x, then this grey region can be seen as a “lake”
with the surface of the water at height 0. The gradient Dh( x ) points outwards,
orthogonally to the boundary, for getting out of the lake. The function f ( x ) may
denote the height of a different terrain, and maximizing f ( x ) is achieved at f ( x )
88 CHAPTER 4. FIRST-ORDER CONDITIONS

where the contour line of f touches the contour line of h. This is exactly the same
situation as in the Lagrange multiplier problem, meaning D f ( x ) = µDh( x ) for
some Lagrange multiplier µ, with the additional constraint that the gradients of f
and h have to be not only co-linear, but point in the same direction, that is, µ ≥ 0.
(We use a different Greek letter µ instead of the usual λ to emphasize this.) The
reason is that at the point x both h( x ) and f ( x ) are maximized, by “getting out of
the lake”, and by maximizing f , in the direction of the gradients.
The following theorem is also known as the Kuhn–Tucker Theorem, pub-
lished in 1951. Later Kuhn found out that it had already be shown in 1939 in
an unpublished Master’s thesis by Karush, and it is now also called the KKT or
Karush–Kuhn–Tucker Theorem, or simply the “first-order conditions” for maxi-
mization under inequality constraints.

Theorem 4.11 (KKT, Karush–Kuhn–Tucker) Let f : U → R be a C1 (U ) function


on an open subset U of Rn , and let h1 , . . . , h` : Rn → R be C1 (Rn ) functions. Let x be
a local maximizer of f on

X = U ∩ { x ∈ Rn | hi ( x ) ≤ 0 , 1 ≤ i ≤ `} . (4.45)

Let the set of vectors { Dhi ( x ) | hi ( x ) = 0 } be linearly independent (“constraint qualifi-


cation” for the tight constraints). Then there exist µ1 , . . . , µ` ∈ R so that for 1 ≤ i ≤ `
`
µi ≥ 0 , µi hi ( x ) = 0 , D f (x) = ∑ µi Dhi (x) . (4.46)
i =1

In (4.46), it is important to understand the middle condition µi hi ( x ) = 0,


which is equivalent to
hi ( x ) < 0 ⇒ µi = 0 . (4.47)
The purpose of this condition is to disregard all non-tight constraints hi ( x ) < 0
where the multiplier µi must be zero, so that the corresponding gradients Dhi ( x )
do not even appear in the linear combination that represents D f ( x ) in (4.46).
That is, D f ( x ) is a nonnegative linear combination of the gradients Dhi ( x ) for
the tight constraints only (assuming they are linearly independent as stated in the
constraint qualification). That is, (4.46) can also be written as follows: Let E be
the set of tight (or “effective”) constraints,

E = {i | 1 ≤ i ≤ ` , hi ( x ) = 0 } . (4.48)

Then there are reals µi ≥ 0 for i ∈ E so that

D f (x) = ∑ µi Dhi (x) . (4.49)


i∈E

Proof of Theorem 4.11. We prove the KKT theorem with the help of the Theorem
4.10 of Lagrange. Let x be a local maximizer of f on X, and let E be the set of
4.7. INEQUALITY CONSTRAINTS AND THE KKT CONDITIONS 89

effective constraints as in (4.48). Because the functions hi for i 6∈ E are continuous,


the set V defined by

V = U ∩ { x ∈ Rn | hi ( x ) < 0, 1 ≤ i ≤ `, i 6∈ E } (4.50)

is open, and we can consider f as a function V → R. We apply Theorem 4.10 to


this function subject to k = | E| equations on the set

V ∩ { x ∈ Rn | hi ( x ) = 0, i ∈ E } (4.51)

where it has the local maximizer x. Because the constraint qualification holds for
the gradients Dhi ( x ) for i ∈ E, there are Lagrange multipliers µi for i ∈ E so that
(4.49) holds. It remains to show that they are nonnegative.
Suppose µ j < 0 for some j ∈ E. Because x is in the interior of V, for suffi-
ciently small ε > 0 we can find ∆ x ∈ Rn so that x + ∆ x ∈ V and, as in (4.40),

h j ( x + ∆ x ) = −ε (4.52)
and hi ( x + ∆ x ) = 0 for i ∈ E − { j}, so that x + ∆ x ∈ X in (4.45), that is, all
inequality constraints are fulfilled. Then, as in (4.41),

f (x + ∆x ) ≈ f (x) − µ j ε > f (x) (4.53)

because µ j < 0. This is a contradiction because f ( x ) is a local maximum on the


set X (see also our explanation with Figure 4.11 above). This proves that µi ≥ 0
for all i ∈ E as claimed.

f (x)

h( x )
x
a b c d

Figure 4.12 Illustration of the KKT theorem for n = 1, ` = 1. The condition


h( x ) ≤ 0 holds for x ∈ [ a, b] ∪ [c, d] (where both functions are shown
as bold curves), and is tight for x ∈ { a, b, c, d}. For x = d the con-
straint qualification fails because Dh(d) = 0.

The sign conditions in the KKT theorem are most easily remembered (or re-
constructed) for a single constraint in dimension n = 1, as shown in Figure 4.12.
There the condition h( x ) ≤ 0 holds on the two intervals [ a, b] and [c, d] and is tight
90 CHAPTER 4. FIRST-ORDER CONDITIONS

at any end of either interval. For x = a both f and h have a negative derivative,
and hence D f ( x ) = µDh( x ) for some µ ≥ 0, and indeed x is a local maximizer
of f . For x ∈ {b, c} the derivatives of f and h have opposite sign, and in each
case D f ( x ) = λDh( x ) for some λ but λ < 0, so these are not maximizers of f .
However, in that case − D f ( x ) = −λDh( x ) and hence both b and c are local max-
imizers of − f and hence local minimizers of f , in agreement with the picture. For
x = d we have a local maximum f ( x ) with D f ( x ) > 0 but Dh( x ) = 0 and hence
no µ with D f ( x ) = µDh( x ) because the constraint qualification fails. Moreover,
there are two points x in the interior of [c, d] where f has zero derivative, which
is a necessary condition for a local maximum of f because h( x ) ≤ 0 is not tight.
One of these points is indeed a local maximum.

Method 4.12 The following is a “cookbook procedure” to use the KKT Theorem
4.11 in order to find the optimum of a function f : Rn → R.
1. Write all inequality constraints in the form hi ( x ) ≤ 0 for 1 ≤ i ≤ `. In
particular, write a constraint such as g( x ) ≥ 0 in the form − g( x ) ≤ 0 .
2. Assert that the functions f , h1 , . . . , h` are C1 functions on Rn . If the function f
is to be minimized, replace it by − f to obtain a maximization problem.
3. Check if the set

S = { x ∈ Rn | hi ( x ) ≤ 0, 1 ≤ i ≤ `} (4.54)

is bounded and hence compact, which ensures the existence of a (global) max-
imum of f on S by the Theorem of Weierstrass.
3a. If not, check if the set

T = S ∩ { x ∈ Rn | f ( x ) ≥ c } (4.55)

is non-empty and bounded for some c ∈ R, so that f ( x ) has a maximum on


T, also by Weierstrass, which is also a maximum of f on S because any other
point x in S − T fulfills f ( x ) < c . If you cannot easily assert these conditions,
f may be unbounded on S, which you should find out. (In fact, if the set T in
(4.55) is non-empty for all c ∈ R, then f is indeed unbounded on S, but you
are not meant to perform a search for the largest such c in order to solve the
maximization problem.)
4. For all 2` subsets E of {1, . . . , `} as possible “effective” constraints, consider
the set of solutions x so that hi ( x ) = 0 for i ∈ E, and hi ( x ) < 0 for i 6∈ S. For
every such E, do the following.
4a. Determine the gradients Dhi ( x ) for i ∈ E and if they are linearly independent.
For any critical point x where they are not linearly independent the constraint
qualification fails and we have to evaluate f ( x ) as a possible maximum.
4b. Find solutions x and µi for i ∈ E to (4.49) and to the equations hi ( x ) = 0 for
i ∈ E. If µi ≥ 0 for all i ∈ E, then this is a local maximum of f , otherwise not.
4.7. INEQUALITY CONSTRAINTS AND THE KKT CONDITIONS 91

5. Compare the function values of f ( x ) found in 4b, and of f ( x ) for the critical
points x in 4a, to determine the global maximum (which may occur for more
than one maximizer).

The main step in this method is Step 4. As an example, we apply Method 4.12
to the problem
1
maximize x + y subject to 2y ≤ 21 x, y≤ 5
4 − 14 x2 , y ≥ 0 . (4.56)

In the standard form required by Step 1, this is a problem with n = 2, ` = 3,


where f , h1 , h2 , h3 : R2 → R are defined by

f ( x, y) = x + y
h1 ( x, y) = − 21 x + 12 y
(4.57)
1 2
h2 ( x, y) = 4x + y − 54
h3 ( x, y) = − y

and f ( x, y) is to be maximized subject to hi ( x, y) ≤ 0 for i = 1, 2, 3. All functions


are in C1 (R2 ) as required by Step 2.
This defines the set S in (4.54) shown in Figure 4.13. The conditions (4.56) are
in a more familar format for drawing in the ( x, y)-plane, because they show the
constraints on ( x, y) with y as constrained by a function of x (the first inequality
is clearly equivalent to y ≤ x). As the picture shows, the set S is bounded and
hence compact (Step 3; we do not need Step 3a).
From (4.57) we obtain the following gradients which are required in Step 4:

D f ( x, y) = (1, 1)
Dh1 ( x, y) = (− 12 , 12 )
(4.58)
Dh2 ( x, y) = ( 12 x, 1)
Dh3 ( x, y) = (0, −1) .

There are eight possible subsets E of {1, 2, 3}. If E = ∅, then (4.49) holds if
D f ( x, y) = (0, 0), which is never the case. Next, consider the three “corners”
of the set S which are defined when two inequalities are tight, where E has two
elements.
If E = {1, 2} then h1 ( x, y) = 0 and h2 ( x, y) = 0 hold if x = y and 14 x2 + x −
5 2
4 = 0 or x + 4x − 5 = 0, that is, ( x − 1)( x + 5) = 0 or x ∈ {1, −5} where only
x = y = 1 fulfills y ≥ 0, so this is the point a = (1, 1) shown in Figure 4.13. In
this case h3 (1, 1) = −1 < 0, so the third inequality is indeed not tight (if it was
then this would correspond to the case E = {1, 2, 3}). Then the two gradients
are Dh1 (1, 1) = (− 21 , 12 ) and Dh2 (1, 1) = ( 21 , 1) which are not scalar multiples
of each other and therefore linearly independent, so the constraint qualification
92 CHAPTER 4. FIRST-ORDER CONDITIONS

y Dh2
C

Dh1

1
a
Df
S
h2 ≤ 0
h1 ≤ 0

h3 ≤ 0 b
0 x
0 1 2

Figure 4.13 The set S defined by the constraints in (4.57). The triple short lines
next to each line defined by hi ( x, y) = 0 (for i = 1, 2, 3) show the side
where hi ( x, y) ≤ 0 holds, abbreviated as hi ≤ 0. The (infinite) set
C is the cone of all nonnegative linear combinations of the gradients
Dh1 ( x, y) and Dh2 ( x, y) for the point a = ( x, y) = (1, 1), which does
not contain D f ( x, y), so f ( a) is not a local maximum. At the point
b = (2, 14 ) we have D f (b) = µ2 Dh2 (b) for the (only) tight constraint
h2 (b) = 0, and µ2 ≥ 0, so f (b) is a local maximum.

holds. Because these are two linearly independent vectors in R2 , any vector, in
particular D f (1, 1), can be represented as a linear combination of them. That is,
there are µ1 and µ2 with
D f (1, 1) = (1, 1) = µ1 Dh1 (1, 1) + µ2 Dh2 (1, 1) = µ1 (− 21 , 12 ) + µ2 ( 12 , 1), (4.59)
which are uniquely given by µ1 = − 32 , µ2 = 43 . Because µ1 < 0, we do not have
a local maximum of f . We can also this in the picture: By allowing the constraint
h1 ( x, y) ≤ 0 to be non-tight and keeping h2 ( x, y) ≤ 0 tight, ( x, y) can move along
the line h2 ( x, y) = 0 and increase f ( x, y) in the direction D f ( x, y) (exactly as in
(4.53) in the proof of Theorem 4.11 for a negative µ j ). Figure 4.13 shows the cone C
spanned by Dh1 ( a) and Dh2 ( a), that is, C = {µ1 Dh1 ( a) + µ2 Dh2 ( a) | µ1 , µ2 ≥ 0 }.
Only gradients D f ( a) in that cone “push” the function values of f in such a way
that f is maximized at a. This is not the case here.
If E = {1, 3} then h1 ( x, y) = 0 and h3 ( x, y) = 0 require x = y and y = 0,
which is the point (0, 0). The two gradients Dh1 (0, 0) and Dh2 (0, 0) in (4.58 are
linearly independent. The unique solution to D f (0, 0) = (1, 1) = µ1 Dh1 (0, 0) +
µ3 Dh3 (0, 0) = µ1 (− 21 , 21 ) + µ3 (0, −1) is µ1 = −2, µ3 = −2. Here both multi-
pliers are negative, so then − D f (0, 0) is a positive combination of Dh1 (0, 0) and
4.7. INEQUALITY CONSTRAINTS AND THE KKT CONDITIONS 93

Dh2 (0, 0), which shows that f (0, 0) is a local minimum of f (it easily seen to be the
global minimum).
If E = {2, 3} then h2 ( x, y) = 0 and h3 ( x, y) = 0 require y = 0 and 14 x2 − 54 = 0
√ √ √
or√x = 5 (for x =√ − 5 we would have h1 ( x, 0) > 0). Then Dh2 ( 5, 0) =
( 5/2, 1) and Dh3 ( 5,√0) = (0, −1) which √ are also linearly√ independent. The
unique
√ solution to D f ( 5, 0 ) = µ 2 Dh√2 ( 5, 0 ) + µ Dh
3√ 3 ( 5, 0 ) , that is, to (1, 1) =
µ2 ( 5/2, 1) + µ3 (0, −1), is µ2 = 2/ 5, µ3 = 2/ 5 − 1 < 0, √ so this point is
also not a local maximum, even though it has the highest value 5 of the three
“corners” of the set S considered so far.
If E = {1, 2, 3} then there is no common solution to the three equations
hi ( x, y) = 0 for i ∈ E because any two of them already have different unique
solutions.
When E is a singleton {i }, the gradient Dhi ( x, y) is always a nonzero vec-
tor and so the constraint qualification holds. If E = {1} then (4.49) requires
D f ( x, y) = (1, 1) = µ1 Dh1 ( x, y) = µ1 (− 21 , 12 ) which has no solution µ1 because
the two gradients are not scalar multiples of each other. The same applies when
E = {3} where D f ( x, y) = (1, 1) = µ3 Dh3 ( x, y) = µ3 (0, −1) has no solution µ3 .
However, for E = {2} we do have a solution to Step 4b in Method 4.12. The
equation D f ( x, y) = (1, 1) = µ2 Dh2 ( x, y) = µ2 ( 12 x, 1) has the unique solution
µ2 = 1 ≥ 0 and x = 2, and h2 ( x, y) = 0 is solved for 44 + y − 54 = 0 or y = 14 . This
is the point b = (2, 14 ) shown in Figure 4.13. Hence, f (b) is a local maximum of f .
Because it is the only local maximum found, it is also the global maximum of f ,
which exists as confirmed in Step 3.
A rather laborious part of Method 4.12 in this example has been to compute
the multipliers µi for i ∈ E in Step 4b and to check their signs. We have done
this in detail to explain the interpretation of these signs, with the cone C shown
in Figure 4.13 at the candidate point a which fulfills all conditions except for the
sign of µ1 . However, there is a shortcut which avoids computing the multipliers,
by going directly to Step 5, and simply comparing the function values √ at these
points. In our example, the candidate points where √ ( 1, 1 ) , ( 0, 0 ) , ( 5, 0), and
1 9
(2, 4 ), with corresponding√function values 2, 0, 5, and 4 , of which the latter is
9 9 2
the largest (we have 4 > 5 because ( 4 ) = 81 16 > 5).

We consider a second example that has two local maxima, where one local
maximum has a tight constraint that has a zero multiplier. The problem is again
about a function f : R2 → R and says

maximize f ( x, y) = x2 − y subject to x ≥ 0, x 2 + y2 ≤ 1 . (4.60)

We write the constraints with h1 , h2 : R2 → R as

h1 ( x, y) = − x ≤ 0 , h2 ( x, y) = x2 + y2 − 1 ≤ 0 . (4.61)
94 CHAPTER 4. FIRST-ORDER CONDITIONS

f ( x, y) = c
h2 ( x, y) = 0
S

x
h1 ( x, y) = 0

b Df
Dh2
Dh1
a
Df

Dh2

Figure 4.14 The maximization problem (4.60). The parabolas


√ are contour lines
1
of f . The points a = (0, −1) and b = ( 3/2, − 2 ) are local max-
imizers of f , and f (b) is the global maximum. The gradients are
displayed shorter to save space.

The corresponding set S is shown in Figure 4.14 and compact, so f has a maxi-
mum. We have

D f ( x, y) = (2x, −1), Dh1 ( x, y) = (−1, 0), Dh2 ( x, y) = (2x, 2y). (4.62)

There are four possible subsets E of {1, 2} of tight constraints in (4.48). If


E = ∅ then we need D f ( x, y) = (0, 0) which is never the case. If E = {1, 2} then
h1 ( x, y) = h2 ( x, y) = 0 in (4.61) has two solutions ( x, y) = (0, 1) and ( x, y) =
(0, −1). Then the two gradients Dh1 ( x, y) and Dh2 ( x, y) are linearly independent
and the constraint qualification holds. For ( x, y) = (0, 1) we have D f (0, 1) =
(0, −1) = µ1 Dh1 (0, 1) + µ2 Dh2 (0, 1) = µ1 (−1, 0) + µ2 (0, 2) so µ1 = 0 and µ2 =
− 12 , so this is a local (and in fact global) minimum of f . For ( x, y) = a = (0, −1)
we have D f (0, −1) = (0, −1) = µ1 Dh1 (0, −1) + µ2 Dh2 (0, −1) = µ1 (−1, 0) +
µ2 (0, −2) so µ1 = 0 and µ2 = 21 , so f ( a) is a local maximum of f . Note that
h1 ( a) = 0 but the corresponding multiplier µ1 for this tight inequality is zero,
which is possible and allowed.
If E = {1} then h1 ( x, y) = 0 and D f ( x, y) = (2x, −1) = µ1 Dh1 ( x, y) =
µ1 (−1, 0) has no solution µ1 . If E = {2} then h2 ( x, y) = 0 and D f ( x, y) =
4.7. INEQUALITY CONSTRAINTS AND THE KKT CONDITIONS 95

(2x, −1) = µ2 Dh2 ( x, y) = µ2 (2x, 2y) has the following solution: We have x > 0
because the first constraint h1 ( x, y) ≤ 0 is not tight, which requires µ2 = 1 and
hence 2y = −1, that is, y = − 21 . Then h2 ( x, y) = 0 means x2 + 14 − 1 = 0 or
√ √
x = 3/2. This is the point b = ( 3/2, − 12 ) with f (b) = 43 + 12 = 54 , which is
larger than f ( a) = f (0, −1) = 1, so f (b) is the global maximum of f .
Finally, we consider an example of the KKT theorem where the constraint
qualification fails, where unusually the set over which f is optimized does not
have a cusp. This example provides also a motivation for the next chapter on
linear optimization. The problem says:

maximize 2x + y subject to x ≥ 0, y ≥ 0, y · ( x + y − 1) ≤ 0. (4.63)

It is of the form considered in Theorem 4.11 for n = 2, ` = 3, where the objective


function f and constraint functions h1 , h2 , h3 , and their gradients, are given as
follows:

f ( x, y) = 2x + y, D f ( x, y) = (2, 1)
h1 ( x, y) = − x ≤ 0, Dh1 ( x, y) = (−1, 0)
(4.64)
h2 ( x, y) = −y ≤ 0, Dh2 ( x, y) = (0, −1)
h3 ( x, y) = y · ( x + y − 1) ≤ 0, Dh3 ( x, y) = (y, x + 2y − 1).

The constraint h3 ( x, y) ≤ 0 can be understood as follows: It holds as a tight


constraint if h3 ( x, y) = 0, that is, y = 0 or x + y − 1 = 0, the latter equation
given by the line y = 1 − x. If y > 0 then h3 ( x, y) ≤ 0 only if x + y − 1 ≤ 0,
that is, y ≤ 1 − x, and if y < 0 then h3 ( x, y) ≤ 0 only if x + y − 1 ≥ 0, that is,
y ≥ 1 − x, as shown in Figure 4.15. However, the case y < 0 is excluded by the
second constraint y ≥ 0 in (4.63). This means that the set S in (4.54) is given by
the triangle with corners (0, 0), (1, 0), and (0, 1), shown in Figure 4.16.

(0,1)

(1,0)
(0,0)
x

Figure 4.15 The set of points ( x, y) so that h3 ( x, y) = y · ( x + y − 1) ≤ 0, shown


as the dark area. It stretches infinitely in both directions and is there-
fore shown as “torn off” on the left and right.

We first consider the corners of the triangle as possible solutions to the KKT
conditions. (It is obviously much easier to just evaluate the function on those
96 CHAPTER 4. FIRST-ORDER CONDITIONS

(2,1)
(0,1)

(1,0)
(0,0)
x

Figure 4.16 The feasible set and the objective function for the example (4.63),
and the optimal point (1, 0).

corners as in Step 5 of Method 4.12, but we want to check if the KKT theorem
can be applied.) Let ( x, y) = (0, 0). Then all three constraints are tight, with E =
{1, 2, 3} in (4.48). The three vectors of derivatives Dh1 (0, 0), Dh2 (0, 0), Dh3 (0, 0)
in R2 are necessarily linearly independent, so the constraint qualification fails
and we have to investigate this critical point as a possible maximum.
For ( x, y) = (1, 0), we have h2 (1, 0) = 0 and h3 (1, 0) = 0 but h1 (1, 0) < 0, so
E = {2, 3} in (4.48). By (4.64), Dh2 (1, 0) = (0, −1) and Dh3 (1, 0) = (0, 0), which
are linearly dependent vectors, so this is another critical point where the constraint
qualification fails.
For ( x, y) = (0, 1), the tight constraints are given by h1 (0, 1) = h3 (0, 1) = 0
whereas h2 (0, 1) < 0, so E = {1, 3} in (4.48). By (4.64), Dh1 (0, 1) = (−1, 0) and
Dh3 (0, 1) = (1, 1), which are linearly independent vectors. We want to find µ1
and µ3 that are nonnegative so that

D f (0, 1) = µ1 Dh1 (0, 1) + µ3 Dh3 (0, 1),

that is,
(2, 1) = µ1 (−1, 0) + µ3 (1, 1)
which has the unique solution µ3 = 1 and µ1 = −1. Because µ1 < −1, the KKT
conditions (4.46) fail and ( x, y) = (0, 1) cannot be a local maximum of f (nor, for
that matter, a minimum because µ3 > 0, because for a minimum of f , that is, a
maximum of − f , a we would need µ1 ≤ 0 and µ3 ≤ 0).
For completeness, we consider the cases of fewer tight constraints. E = ∅
would require D f ( x, y) = (0, 0), which is never the case. If E = {1} then D f ( x, y)
would have to be scalar multiple of Dh1 ( x, y) but it is not, and neither is it a scalar
multiple of Dh2 ( x, y) when E = {2}. Consider E = {3}, so h3 ( x, y) = 0 is the
only tight constraint, that is, x > 0 and y > 0. Then h3 ( x, y) = 0 is equivalent to
x + y − 1 = 0. Then we need µ3 ≥ 0 so that D f ( x, y) = µ3 Dh3 ( x, y), that is,

(2, 1) = µ3 (y, x − 2y − 1)
4.7. INEQUALITY CONSTRAINTS AND THE KKT CONDITIONS 97

or 2 = µ3 y and 1 = µ3 ( x + 2y − 1). The first of these equations means µ3 = 2/y


(note that y > 0) and the second 1 = 2/y · ( x + 2y − 1), that is, y = 2x + 4y − 2
or −y = 2( x + y − 1). Because x + y − 1 = 0 due to the constraint h3 ( x, y) = 0,
this implies y = 0, a contradiction. Hence, the KKT conditions cannot be fulfilled
when only a single constraint is tight.
It thus remains to investigate the critical points (0, 0) and (1, 0). Clearly, (1, 0)
produces the maximum of f (and (0, 0) the minimum).
Chapter 5

Linear optimization

We consider now certain optimization problems on Rn where the objective func-


tion and all constraints are linear.
So far we have used the letters x, y, z to name variables in small-dimensional
examples, but will from now on use in such examples x1 , x2 , x3 instead. The rea-
son is that in our theory x and y will be certain vectors of variables, with y called
“dual variables”, so that a different use of x and y in examples would be too con-
fusing. The examples are therefore slightly harder to read and to write down, but
their correspondence with the general notation will be clearer.

5.1 Linear functions, hyperplanes, and halfspaces


In the example (4.63), we have made our life unnecessarily hard because we could
also have written the third constraint simply as x1 + x2 ≤ 1, so that the entire
problem could be written as

maximize 2x1 + x2 subject to x1 ≥ 0, x2 ≥ 0, x1 + x2 ≤ 1. (5.1)

In this form, the problem is in the standard inequality form of a linear optimiza-
tion problem, also called linear programming problem, or just linear program. (The
term “programming” was popular in the middle of the 20th century when opti-
mization problems started to be solved with computer programs, with electronic
computers also being developed around the same time.)
A linear function f : Rn → R is of the form

f ( x1 , . . . , x n ) = c1 x1 + · · · + c n x n (5.2)

for suitable real coefficients c1 , . . . , cn . These coefficients define an n-tuple c =


(c1 , . . . , cn ) in Rn . We then write (5.2) as f ( x ) = c>x which means that both c
and x are considered as column vectors in Rn , which we consider as n × 1 matri-
ces. Then c> is the corresponding row vector (a 1 × n matrix) and c>x is just a

98
5.1. LINEAR FUNCTIONS, HYPERPLANES, AND HALFSPACES 99

matrix product, which in this case produces a 1 × 1 matrix which is a real num-
ber that represents the normal scalar product of the vectors c and x in (5.2). For
that reason, we write the multiplication of a vector x with a scalar λ in the form
xλ (rather than as λx) because it is the product of a n × 1 with a 1 × 1 matrix.
This consistency is very helpful in re-grouping products of several matrices and
vectors.
Recall that we write the derivative Dh( x ) of a function h as a row vector, so
that we multiply it with a scalar like λ from the left as in λDh( x ). Also, when we
write x = ( x1 , . . . , xn ), say, then this is just meant to define x as an n-tuple of real
numbers and not as a row vector, because otherwise we would always have to
introduce x tediously as x = ( x1 , . . . , xn )>. The thing to remember is that when
we use matrix multiplication, then a vector like x is always a column vector and
x > is a row vector.
Let c 6= 0 (where 0 is the vector with all components zero, in any dimension),
and let f be the linear function defined by f ( x ) = c>x as in (5.2). The set { x |
f ( x ) = 0} where f takes value 0 is a linear subspace of Rn . By definition, it
consists of all vectors x that are orthogonal to c, that is, have scalar product 0
with c. If n = 2, then this “nullspace” of f is a line, but in general it will be a
“hyperplane” in Rn , a space of dimension n − 1.
More generally, let u ∈ R and consider the set

H = { x ∈ Rn | f ( x ) = u} = { x ∈ Rn | c>x = u} (5.3)

where f takes value u, which we have earlier called a contour set or level set for f .
Then for any two x and x̂ on this level set H, that is, so that f ( x ) = f ( x̂ ) = u,
we have c>( x − x̂ ) = 0, so that the vector x − x̂ is orthogonal to c. Then H is also
called a hyperplane through the point x (which does not contain the origin 0 unless
u = 0) with normal vector c. To repeat, such a hyperplane H is of the form (5.3)
for some c ∈ Rn , c 6= 0, and u ∈ R. The different contour sets for f are therefore
parallel hyperplanes, all with the same normal vector c.
Figure 5.1 shows an example of such level sets, where these “hyperplanes”
are contour lines because n = 2. The vector c, here c = (2, −1), is orthogonal to
any such level set. Moreover, c points in the direction in which the function value
of f ( x ) increases, because if we replace x by x + c then f ( x ) changes from c>x to
f ( x + c) = c>x + c>c which is larger than c>x because c>c = ∑in=1 c2i > 0 because
c 6= 0. Note that c may have negative components (as in the figure). Only the
direction of c matters to find out where f ( x ) gets larger.
Similar to a hyperplane H defined by c and u in (5.3), a halfspace S is defined
by an inequality according to

S = { x ∈ Rn | c>x ≤ u} (5.4)

which consists of all points x that are on the hyperplane H or “below” it, that
is, with smaller values of c>x than the points on H. Figure 5.1 shows such a
100 CHAPTER 5. LINEAR OPTIMIZATION

0 0 (1,0) (2,0)

c S c

Figure 5.1 Left: Contour lines (level sets) of the function f : R2 → R defined
by f ( x ) = c>x for c = (2, −1). Right: Halfspace S in (5.4) given by
c>x ≤ 5.

halfspace S for c = (2, −1) and u = 5, which contains, for example, the point
x = (2.5, 0). It is customary to “shade” the side of the hyperplane H that defines
S with a few small parallel strokes as shown in the picture, and then it is not
needed to indicate c which is the orthogonal vector to H that points away from S.

x2

c
(0,1)

(1,0)
(0,0)
x1

Figure 5.2 The feasible set and the objective function for the example (5.1), and
the optimal point (1, 0).

With these conventions, Figure 5.2 gives a graphical description of the prob-
lem in (5.1) where the feasible set where all inequalities hold is the intersection of
the three halfspaces defined by x1 ≥ 0 x2 ≥ 0, and x1 + x2 ≤ 0. This is the shaded
triangle. In this graphical way, the optimal solution is nearly obvious.

5.2 Linear programming: introduction

The example in this section is taken from J. Matoušek and B. Gärtner (2007), Un-
derstanding and Using Linear Programming (Springer Verlag, Berlin).
5.2. LINEAR PROGRAMMING: INTRODUCTION 101

Consider the following linear program:

maximize x1 + x2
subject to x1 ≥0
x2 ≥ 0
(5.5)
− x1 + x2 ≤ 1
x1 + 6x2 ≤ 15
4x1 − x2 ≤ 10 .

The set of points ( x1 , x2 ) in R2 that fulfill these inequalities is called the feasible set
and shown in Figure 5.3.
x2

D f = (1, 1)
− x1 + x2 ≤ 1

x1 + 6x2 ≤ 15
(3, 2)
x1 + x2 = 5

x2 ≥ 0
x1
x1 ≥ 0

4x1 − x2 ≤ 10

Figure 5.3 Feasible set and objective function vector (1, 1) for the LP (5.5), with
optimum at (3, 2) and objective function value 5.

The contour lines of the objective function f ( x1 , x2 ) = x1 + x2 are parallel lines


where the maximum is clearly at the top-right corner (3, 2) of the feasible set,
where the constraints h4 ( x1 , x2 ) = x1 + 6x2 ≤ 15 and h5 ( x1 , x2 ) = 4x1 − x2 ≤
10 are tight. The fact that this is a local maximum can be seen with the KKT
Theorem 4.11 because there are nonnegative µ4 and µ5 so that D f = (1, 1) =
µ4 Dh4 + µ5 Dh5 = µ4 (1, 6) + µ5 (4, −1), namely µ4 = 51 , µ5 = 15 . We can write D f
instead of D f ( x1 , x2 ) because the gradient of a linear function is constant. The
picture shows that (3, 2) is in fact also the global maximum of f . We will see that
the KKT theorem has a simpler version for linear programming, which is called
the duality theorem, where the multipliers µi are called dual variables. Moreover,
there will be better ways of finding a maximum than testing all combinations of
possible tight constraints.
102 CHAPTER 5. LINEAR OPTIMIZATION

x2
D f = ( 16 , 1)

− x1 + x2 ≤ 1

x1 + 6x2 ≤ 15 1 5
6 x1 + x2 = 2

x2 ≥ 0
x1
x1 ≥ 0

4x1 − x2 ≤ 10

Figure 5.4 Feasible set and objective function vector ( 61 , 1) with non-unique
maximum along the side where the constraint x1 + 6x2 ≤ 15 is tight.

It can be shown that if the feasible set is bounded, then an optimum of a


linear program can always be found at a corner of the feasible set. However, there
may more than one corner where the optimum is obtained, and then any point on
the line segment that connects these optimal corners is also optimal. Figure 5.4
shows this with the same constraints as in (5.5), but a different objective function
to be maximized, namely f ( x1 , x2 ) = 16 x1 + x2 . The corner (3, 2) is also optimal
here, but so is the entire line where f ( x1 , x2 ) = 16 x1 + x2 = 52 (intersected with the
feasible set), which coincides with the tight constraint x1 + 6x2 = 15.
Figure 5.5 shows an example where the feasible set is empty, which is called
an infeasible linear program. This occurs, for example, by reversing two of the
inequalities in (5.5) to obtain the following constraints:

x1 ≥0
x2 ≥ 0
− x1 + x2 ≥ 1 (5.6)
x1 + 6x2 ≤ 15
4x1 − x2 ≥ 10 .

Finally, an optimal solution need not exist even when there are feasible solu-
tions. This happens when the objective function can attain arbitrarily large val-
ues; such a linear program is called unbounded. This is the case when we remove
the constraints 4x1 x2 ≤ 10 and x1 + 6x2 ≤ 15 from the initial example (5.5), as
shown in Figure 5.6.
5.2. LINEAR PROGRAMMING: INTRODUCTION 103
x2
− x1 + x2 ≥ 1

x1 + 6x2 ≤ 15

x2 ≥ 0
x1
x1 ≥ 0

4x1 − x2 ≥ 10

Figure 5.5 Example of an infeasible set, for the constraints (5.6). Recall that the
little strokes indicate the side where the inequality is valid, and here
there is no point ( x1 , x2 ) where all inequalities are valid. This would
be the case even without the constraints x1 ≥ 0 and x2 ≥ 0.

x2

− x1 + x2 ≤ 1

D f = (1, 1)

x2 ≥ 0
x1
x1 ≥ 0

Figure 5.6 Example of a feasible set with an unbounded objective function.

The pictures shown in this section provide a good intuition of how linear pro-
grams look in principle. However, this graphical method hardly extends beyond
R2 or R3 . Our development of the theory of linear programming will proceed
largely algebraically, with some geometric intuition for the important Lemma 5.5
of Farkas.
104 CHAPTER 5. LINEAR OPTIMIZATION

5.3 Linear programs and duality

We use the following notation. For positive integers m, n, the set of m × n matrices
is denoted by Rm×n . An n-vector is an element of Rn . Unless stated otherwise, all
vectors are column vectors, so a vector x in Rn is considered as an n × 1 matrix.
Its transpose x > is the corresponding row vector in R1×n . The components of an
n-vector x are x1 , . . . , xn . The vectors 0 and 1 have all components equal to zero
and one, respectively, and have suitable dimension, which may vary with each
use of 0 or 1. An inequality between vectors like x ≥ 0 holds for all components.
The identity matrix, of any dimension, is denoted by I.
A linear optimization problem or linear program (LP) says: optimize (max-
imize or minimize) a linear objective function subject to linear constraints (in-
equalities or equalities).
The standard inequality form of an LP is given by an m × n matrix A, an m-
vector b and an n-vector c and says:

maximize c>x
subject to Ax ≤ b , (5.7)
x ≥ 0.

Here x is the n-vector ( x1 , . . . , xn )> of variables. The vector c of coefficients deter-


mines the objective function. The matrix A and the vector b determine the m linear
constraints, which are here only inequalities. Furthermore, all variables x1 , . . . , xn
are constrained to be nonnegative; this is stated as x ≥ 0 separately from Ax ≤ b
because nonnegative variables are a standard case.
The LP (5.7) states maximization subject to “upper bounds” (inequalities “≤”).
A way to remember this is to assume that the components of A, b and c are pos-
itive (they do not have to be), so that maximizing c>x is restricted by the “upper
bounds” imposed by Ax ≤ b. If A has only positive entries, then Ax ≤ b can only
be fulfilled for nonnegative x if b ≥ 0.
In general, the LP (5.7) is called feasible if Ax ≤ b and x ≥ 0 hold for some x,
otherwise infeasible. If c>x can be arbitrarily large for suitable x subject to these
constraints, then the LP is called unbounded, otherwise bounded. The LP can have
an optimal solution only if it is feasible and bounded. We will see that in that case
it has an optimal solution even though the feasible set { x ∈ Rn | Ax ≤ b, x ≥ 0}
is in general not compact.

Example 5.1 Consider (5.7) with


" # " #
3 4 2 7 h i
A= , b= , c> = 8 10 5 ,
1 1 1 2
5.3. LINEAR PROGRAMS AND DUALITY 105

which can be stated explicitly as: for x1 , x2 , x3 ≥ 0 subject to

3x1 + 4x2 + 2x3 ≤ 7


x1 + x2 + x3 ≤ 2 (5.8)
maximize 8x1 + 10x2 + 5x3 .

The horizontal line is often written to separate the objective function from the
constraints.

One feasible solution to (5.8) is x1 = 0, x2 = 1, x3 = 1 with objective function


value 15. Another is x1 = 1, x2 = 1, x3 = 0 with objective function value 18,
which is better. (We often choose integers in our examples, but coefficients and
variables are allowed to assume any real values.) How do we know when we
have an optimal value?
The dual of an LP can be motivated by finding an upper bound to the objective
function of the given LP (which is called the primal LP). The dual LP results by
reading the constraint matrix vertically rather than horizontally, exchanging the
roles of objective function and right-hand side, as follows.
In the example (5.8), we multiply each of the two inequalities by some non-
negative number, for example the first inequality by y1 = 1 and the second by
y2 = 6, and add the inequalities up, which yields

(3 + 6) x1 + (4 + 6) x2 + (2 + 6) x3 ≤ 7 + 6 · 2

or
9x1 + 10x2 + 8x3 ≤ 19.
In this inequality, which holds for any feasible solution, all coefficients of the non-
negative variables x j are at least as large as in the primal objective function, so the
right-hand side 19 is certainly an upper bound for this objective function. In fact,
we can obtain an even better bound by multiplying the two primal inequalities
by y1 = 2 and y2 = 2, getting

(3 · 2 + 2) x1 + (4 · 2 + 2) x2 + (2 · 2 + 2) x3 ≤ 2 · 7 + 2 · 2

or
8x1 + 10x2 + 6x3 ≤ 18.
Again, all coefficients are at least as large as in the primal objective function. Thus,
it cannot be larger than 18, which was achieved by the above solution x1 = 1,
x2 = 1, x3 = 0, which is therefore optimal.
In general, the dual LP for the primal LP (5.7) is obtained as follows:

• Multiply each primal inequality by some nonnegative number yi (so as not


to reverse the inequality).
106 CHAPTER 5. LINEAR OPTIMIZATION

• Sum the resulting entries of each of the n columns and require that the re-
sulting coefficient of x j for j = 1, . . . , n is at least as large as the coefficient
c j of the objective function. (Because x j ≥ 0, this will at most increase the
objective function.)
• Minimize the resulting right-hand side y1 b1 + · · · + ym bm (because it is an
upper bound for the primal objective function).

So the dual of (5.7) says:


minimize y>b
(5.9)
subject to y>A ≥ c>, y≥0.
Clearly, (5.9) is also an LP in standard inequality form because it can be written
as: maximize −b>y subject to − A>y ≤ −c, y ≥ 0 . In that way, it is easy to see
that the dual LP of the dual LP (5.9) is again the primal LP (5.7).
A good way to simultaneously picture the primal and dual LP (which are
defined by the same data A, b, c) is the following “Tucker diagram”:
x≥0

y≥0 A ≤ b (5.10)

∨ ,→ min
c> → max
The diagram (5.10) shows the m × n matrix A with the m-vector b on the right and
the row vector c> at the bottom. The top shows the primal variables x with their
constraints x ≥ 0. The left-hand side shows the dual variables y with their con-
straints y ≥ 0. The primal LP is to be read horizontally, with constraints Ax ≤ b,
and the objective function c>x that is to be maximized. The dual LP is to be read
vertically, with constraints y>A ≥ c> (where in the diagram (5.10) ≥ is written
vertically as ∨ ), and the objective function y>b that is to be minimized. A way
to remember the direction of the inequalities is to see that one inequality Ax ≤ b
points “towards” A and the other, y>A ≥ c>, “away from” A, where maximiza-
tion is subject to upper bounds and minimization subject to lower bounds, apart
from the nonnegativity constraints for x and y.
The fact that the primal and dual objective functions are mutual bounds is
known as the “weak duality” theorem, which is very easy to prove – essentially
in the way we have motivated the dual LP above.

Theorem 5.2 (Weak LP duality) For a pair x, y of feasible solutions of the primal LP
(5.7) and its dual LP (5.9), the objective functions are mutual bounds:
c>x ≤ y>b .
If thereby c>x = y>b (equality holds), then these two solutions are optimal for both LPs.
5.4. LEMMA OF FARKAS AND PROOF OF STRONG LP DUALITY 107

Proof. In general, if u, v, w are vectors of the same dimension, then


u ≥ 0, v ≤ w ⇒ u>v ≤ u>w , (5.11)
because v ≤ w is equivalent to (w − v) ≥ 0 which with u ≥ 0 implies u>(w − v) ≥
0 and hence u>v ≤ u>w; note that this is an inequality between scalars which can
also be written as v>u ≤ w>u.
Feasibility of x for (5.7) and of y for (5.9) means Ax ≤ b, x ≥ 0, y>A ≥ c>,
y ≥ 0. Using (5.11), this implies
c>x ≤ (y>A) x = y>( Ax ) ≤ y>b
as claimed.
If c>x ∗ = (y∗ )>b for some primal feasible x ∗ and dual feasible y∗ , then c>x ≤
(y∗ )>b = c>x ∗ for any primal feasible x, and y>b ≥ c>x ∗ = (y∗ )>b for any dual
feasible y, so equality of the objective functions implies optimality.

The following “strong duality” theorem is the central theorem of linear pro-
gramming.

Theorem 5.3 (Strong LP duality) Whenever both the primal LP (5.7) and its dual LP
(5.9) are feasible, they have optimal solutions with equal value of their objective functions.

We will prove this theorem in Section 5.4. Its proof is not trivial. In fact, many
theorems in economics have a hidden LP duality so that they can be proved by
writing down a suitable LP and interpreting its dual LP. For that reason, Theo-
rem 5.3 is extremely useful.

5.4 Lemma of Farkas and proof of strong LP duality


In this section, we state and prove Farkas’s Lemma, also know as the theorem of
the separating hyperplane, and use it to prove Theorem 5.3.
The Lemma of Farkas is concerned with the question of finding a nonnegative
solution x to a system Ax = b of linear equations. If A = [ A1 · · · An ], this means
that b is a nonnegative linear combination A1 x1 + · · · + An xn of the columns of A.
We first observe that such linear combinations can always be obtained by taking
suitable linearly independent columns of A.

Lemma 5.4 Let A = [ A1 · · · An ] ∈ Rm×n , let b ∈ Rm , and let


C = { Ax | x ∈ Rn , x ≥ 0}. (5.12)
Then if b ∈ C, there is a set J ⊆ {1, . . . , n} so that the vectors A j for j ∈ J are linearly
independent, and there are unique positive reals x j for j ∈ J so that

∑ Aj xj = b . (5.13)
j∈ J
108 CHAPTER 5. LINEAR OPTIMIZATION

Proof. Because b ∈ C, we have Ax = b for some x ≥ 0, so that (5.13) holds


with J = { j | x j > 0}. If the vectors A j for j ∈ J are linearly independent (in
which case we simply call J independent), we are done. Otherwise, we change
the coefficients x j by keeping them nonnegative but so that at least one of them
becomes zero, which gives a smaller set J.
Suppose J is not independent, that is, there are scalars z j for j ∈ J, not all zero,
so that
∑ Aj zj = 0 ,
j∈ J

where we can assume that the set S = { j ∈ J | z j > 0} is not empty (otherwise
replace z by −z). Then
∑ A j ( x j − z j α) = b
j∈ J

for any α. We choose the largest α so that x j − z j α ≥ 0 for all j ∈ J. If z j ≤ 0 this


imposes no constraint on α, but for z j > 0 (that is, j ∈ S) this means x j /z j ≥ α, so
the largest α fulfilling all these constraints is given by
xj x
α = min{ | j ∈ S} =: i (5.14)
zj zi
which implies xi − zi α = 0 so that we can remove any i that achieves the mini-
mum in (5.14) from J. By replacing x j with x j − αz j , we thus obtain a smaller set
J to represent b as in (5.13). By continuing in this manner, we eventually obtain
an independent set J as claimed (if b = 0, then J is the empty set). Because the
vectors A j for j ∈ J are linearly independent, the scalars x j for j ∈ J are unique.
The set C in (5.12) of nonnegative linear combinations of the column vectors
of A is also called the cone generated by these vectors. Figure 5.7 gives an exam-
ple with A ∈ R2×4 and a vector b ∈ R2 that is not in the cone C generated by
A1 , A2 , A3 , A4 .

C A4 C A4
A1 A1
A3 A3
A2 A2
v
H y = v−b
0 0
b b
Figure 5.7 Left: Vectors A1 , A2 , A3 , A4 , the cone C generated by them (which
extends to infinity between the two “rays” that extend A3 and A2 ),
and a vector b not in C. Right: A separating hyperplane H for b with
normal vector y = v − b.

The right diagram in Figure 5.7 shows a vector y so that y>A j ≥ 0 for all j, for
1 ≤ j ≤ n, and y>b < 0. The set H = {z ∈ Rm | y>z = 0} is called a separating
5.4. LEMMA OF FARKAS AND PROOF OF STRONG LP DUALITY 109

hyperplane with normal vector y because all vectors A j are on one side of H (they
fulfill y>A j ≥ 0, which includes the case y>A j = 0 where A j belongs to H, like
A2 in Figure 5.7), whereas b is strictly on the other side of H because y>b < 0.
Farkas’s Lemma asserts that such a separating hyperplane exists for any b that
does not belong to C.

Lemma 5.5 (Farkas) Let A ∈ Rm×n and b ∈ Rm . Then exactly one of the following
statements holds:
(a) ∃ x ∈ Rn : x ≥ 0, Ax = b, or
(b) ∃y ∈ Rm : y>A ≥ 0>, y>b < 0.

In Lemma 5.5, it is clear that (a) and (b) cannot both hold because if (a) holds,
then y>A ≥ 0> implies y>b = y>( Ax ) = (y>A) x ≥ 0.
If (a) is false, that is, b does not belong to the cone C in (5.12), then y can be
constructed by the following intuitive geometric argument: Take a vector v in C
that is closest to b (see Figure 5.7), and let y = v − b. We will show that y fulfills
the conditions in (b).
Apart from this geometric argument, an important part of the proof is to show
that the cone C is closed, that is, it contains any point nearby. Otherwise, b could be
a point near C but not in C which would mean that the distance kv − bk for√ any v
in C can be arbitrarily small, where kzk denotes the Euclidean norm, kzk = z>z.
In that case, one could not define y as described. We first show, as a separate
property, that C is closed.

C A4
A1
A3
v (1)
A2
v (2)

0 b ← v(k)

Figure 5.8 Illustration of the proof of Lemma 5.6 where J = {2} since v(k) for
large k is a positive linear combination of A2 only.

Lemma 5.6 For an m × n matrix A = [ A1 · · · An ], the cone C in (5.12) is a closed set.

Proof. Let b be a point in Rm near C, that is, for all ε > 0 there is a v in C so
that kv − bk < ε. Consider a sequence v(k) (for k = 1, 2, . . .) of elements of C that
converges to b. By Lemma 5.4, there exists for each k a subset J (k) of {1, . . . , n} and
110 CHAPTER 5. LINEAR OPTIMIZATION

(k)
unique positive real numbers x j for j ∈ J (k) so that the columns A j for j ∈ J (k)
are linearly independent and


(k)
v(k) = Aj xj .
j ∈ J (k)

There are only finitely many different sets J (k) , so there is a set J that appears
infinitely often among them (see Figure 5.8 for an example). We consider the
subsequence of the vectors v(k) that use this set, that is,

∑ Aj xj
(k) (k)
v(k) = = AJ xJ (5.15)
j∈ J

(k)
where A J is the matrix with columns A j for j ∈ J and x J is the vector with
(k) (k)
components xj for j ∈ J. Now, xJ in (5.15) is a continuous function of v(k) : In
order to see this, consider a set I of | J | linearly independent rows of A J , let A I J be
(k)
the square submatrix of A J with these rows and let v I be the subvector of v(k)
(k) (k)
with these rows, so that x J = A− 1
I J vI in (5.15). Hence, as v(k) converges to b,
(k) (k)
the | J |-vector x J converges to some x ∗J with b = A J x ∗J , where x J > 0 implies
x ∗J ≥ 0, which shows that b ∈ C. So C is closed.

Remark 5.7 In Lemma 5.6, it is important that C is the cone generated by a finite set
A1 , . . . , An of vectors. The cone generated from an infinite set may not be closed. For
example, let C be the set of nonnegative linear combinations of the vectors (n, 1) in R2 ,
for n = 0, 1, 2, . . .. Then (1, 0) is a vector near C that does not belong to C.

⇒ It is an exercise to prove Remark 5.7, by giving an exact description of C.

Proof of Lemma 5.5. Let A = [ A1 · · · An ] and C as in (5.12). Assume that (a) is


false, that is, b 6∈ C; we have to prove that (b) holds. Choose some v ∈ C so that
kv − bk is minimal (v happens to be unique but this is irrelevant). This minimum
exists for the following reason: Let X = {v ∈ C | kv − bk ≤ kbk}. The set X
(shown in Figure 5.9) is nonempty (it contains 0) and compact because it is the
intersection of C with the compact ball with radius kbk around b, where C is a
closed set by Lemma 5.6. The function v 7→ kv − bk on X is continuous and
assumes its minimum by the Weierstrass Theorem, which is also the minimum of
kv − bk when taken over all v in C.
Then kv − bk > 0 because b 6∈ C and thus v 6= b, and thus kv − bk2 =
(v − b)>(v − b) > 0, that is,
(v − b)>v > (v − b)>b . (5.16)
Let y = v − b. We show that for all c in C we have
y>c ≥ y>v , that is, (v − b)>c ≥ (v − b)>v . (5.17)
5.4. LEMMA OF FARKAS AND PROOF OF STRONG LP DUALITY 111

X
v
y cε c
0
b

Figure 5.9 Proof why a point c with y>(c − v) < 0, that is, (5.18), creates a point
cε closer to b than v, where cε is a convex combination of v and c. This
implies c 6∈ C.

If (5.17) was false, then


(v − b)>(c − v) < 0 (5.18)
for some c in C. Consider the convex combination of c and v given by cε =
v + (c − v)ε = v(1 − ε) + cε which belongs to C, because C is clearly a convex set
(see Section 5.8 below for more details on convexity). Then
 
2 2 > 2
k c ε − b k = k v − b k + ε 2( v − b ) ( c − v ) + ε k c − v k

which by (5.18) is less than kv − bk2 for sufficiently small positive ε, which con-
tradicts the minimality of kv − bk2 for v ∈ C (see also Figure 5.9). So (5.17) holds.
In particular, for c = A j + v we have y>( A j + v) ≥ y>v and thus y>A j ≥ 0,
for 1 ≤ j ≤ n, that is, y>A ≥ 0.
Equations (5.17) and (5.16) imply (v − b)>c > (v − b)>b, that is, y>c > y>b,
for all c ∈ C. For c = 0 this shows 0 > y>b.

Lemma 5.5 is concerned with finding nonnegative solutions x to a system of


equations Ax = b. We will use a closely related “inequality form” of this lemma
that concerns nonnegative solutions x to a system of inequalities Ax ≤ b.

Lemma 5.8 (Farkas with inequalities) Let A ∈ Rm×n and b ∈ Rm . Then exactly
one of the following statements holds:
(a) ∃ x ∈ Rn : x ≥ 0, Ax ≤ b, or
(b) ∃y ∈ Rm : y ≥ 0, y>A ≥ 0>, y>b < 0 .

Proof. Clearly, there is a vector x so that Ax ≤ b and x ≥ 0 if and only if there are
x ∈ Rn and s ∈ Rm with

Ax + s = b, x ≥ 0, s ≥ 0. (5.19)
112 CHAPTER 5. LINEAR OPTIMIZATION

The system (5.19) is a system of equations as in Lemma 5.5 with the matrix [ A I ]
instead of A, where I is the m × m identity matrix, and vector ( x, s) instead of x.
The condition y>[ A I ] ≥ 0> in Lemma 5.5(b) is then simply y>A ≥ 0>, y ≥ 0 as
stated here in (b).

We can now prove the strong duality theorem.


Proof of Theorem 5.3. We assume that (5.7) and (5.9) are feasible, and want to show
that there are feasible x and y so that c>x ≥ y>b, which by Theorem 5.2 implies
c>x = y>b. Suppose, to the contrary, that there are x, y so that

x ≥ 0, Ax ≤ b, y ≥ 0, y>A ≥ c> (5.20)

but no solution x, y to the system of n + m + 1 inequalities

− A>y ≤ − c
Ax ≤ b (5.21)
−c>x + b>y ≤ 0

and x, y ≥ 0. By Lemma 5.8, this means there are u ∈ Rm , v ∈ Rn , and t ∈ R so


that

u ≥ 0, v ≥ 0, t ≥ 0, v>A − tc> ≥ 0>, −u>A> + tb> ≥ 0>, (5.22)

but
−u>c + v>b < 0 . (5.23)

We derive a contradiction as follows: If t = 0, this means that already the first


n + m inequalities in (5.22) have no nonnegative solution x, y, contrary to our
assumption. To show this directly, t = 0 implies v>A ≥ 0, u>A> ≤ 0, which with
(5.20) implies
v>b ≥ v>Ax ≥ 0 ≥ u>A>y ≥ u>c

in contradiction to (5.23).
If t > 0, then u and v are essentially primal and dual feasible solutions that
violate weak LP duality, because then by (5.22) b t ≥ Au and v>A ≥ tc> and
therefore
v>b t ≥ v>Au ≥ tc>u

which after division by t gives v>b ≥ c>u, again contradicting (5.23).


So if the first n + m inequalities in (5.21) have a solution x, y ≥ 0, then there
is also a solution that fulfills the last inequality, as claimed by strong LP duality.
5.5. BOUNDEDNESS AND DUAL FEASIBILITY 113

5.5 Boundedness and dual feasibility

So far, the strong duality Theorem 5.3 makes only a statement when both primal
and dual LP are feasible. In principle, it could be the case that the primal LP has
an optimal solution while its dual is not feasible. The following theorem excludes
this possibility. Its proof is a typical application of Theorem 5.3 itself.

Theorem 5.9 (Boundedness implies dual feasibility) Suppose the primal LP (5.7)
is feasible. Then its objective function is bounded if and only if the dual LP (5.9) is
feasible.

Proof. By weak duality (Theorem 5.2), if the dual LP has a feasible solution y, then
its objective function y>b provides an upper bound for the primal objective func-
tion c>x. Conversely, suppose that the dual LP (5.9) is infeasible, and consider
the following LP which uses an additional real variable t and the vector 1 which
has all components equal to 1:

minimize t
(5.24)
subject to y>A + t1> ≥ c>, y ≥ 0, t ≥ 0.

This LP is clearly feasible by setting y = 0 and t = max{0, c1 , . . . , cn }. Also, the


constraints y>A ≥ c>, y ≥ 0 of (5.9) have no solution if and only if the optimum
value of (5.24) has t > 0, which we assume to be the case. The LP (5.24) is the
dual LP to the following LP, which we write with variables z ∈ Rn :

maximize c>z
subject to Az ≤ 0 ,
(5.25)
1>z ≤ 1 ,
z≥0.

This LP is also feasible with z = 0. By strong duality, it has the same value as its
dual LP (5.24), which is positive, given by c>z = t > 0 for some z that fulfills the
constraints in (5.25). Consider now a feasible solution x to the original primal LP,
that is, Ax ≤ b, x ≥ 0, and let α ∈ R, α ≥ 0. Then A( x + zα) = Ax + Azα ≤
b + 0α = b and x + zα ≥ 0, so x + zα is also a feasible solution to (5.7) with
objective function value c>( x + zα) = c>x + (c>z)α which gets arbitrarily large
with growing α. So the original LP is unbounded. This proves the theorem.

An alternative way of stating the preceding theorem, for the dual LP, is as
follows.

Corollary 5.10 Suppose the dual LP (5.9) is feasible. Then the primal LP (5.7) is infea-
sible if and only if the objective function of the dual LP (5.9) is unbounded.
114 CHAPTER 5. LINEAR OPTIMIZATION

Proof. This is just an application of Theorem 5.9 with dual and primal exchanged:
Rewrite (5.9) as a primal LP in the form: maximize −b>y subject to − A>y ≤ −c,
y ≥ 0, so that its dual is: minimize − x >c subject to − x >A> ≥ −b>, x ≥ 0, which
is the same as (5.7), and apply Theorem 5.9.

On the other hand, the fact that one LP is infeasible does not imply that its
dual LP is unbounded, because both could be infeasible.

Remark 5.11 It is possible that both the primal LP (5.7) and its dual LP (5.9) are infea-
sible.

Proof. Consider the primal LP

maximize x2
subject to x1 ≤ −1
x1 − x2 ≤ 1
x1 , x2 ≥ 0,

which is clearly infeasible, as is its dual LP

minimize − y1 + y2
subject to y1 + y2 ≥ 0
− y2 ≥ 1
y1 , y2 ≥ 0 .

PP primal
PP
optimal unbounded infeasible
dual PPP
P

optimal yes no no

unbounded no no yes

infeasible no yes yes

Table 5.1 The possibilities for primal and dual LP, where “optimal” means the
LP is feasible and bounded and then has an optimal solution, and
“unbounded” means the LP is feasible but its objective function is
unbounded.

Table 5.1 shows the four possibilities that can occur for the primal LP and its
dual: both have optimal solutions, one is infeasible and the other unbounded, or
both are infeasible. If one LP is feasible, its dual cannot be unbounded by weak
duality (Theorem 5.2), and if it has an optimal solution then its dual cannot be
infeasible by Theorem 5.9.
5.6. GENERAL LP DUALITY 115

Table 5.1 does not state the equality of primal and dual objective functions
when both have optimal solutions, but it does state Corollary 5.10. We show that
this implies Farkas’s Lemma 5.8 for inequalities that we have used to prove the
strong duality Theorem 5.3. Consider the LP

maximize 0
subject to Ax ≤ b , (5.26)
x ≥ 0.
with its Tucker diagram
x≥0

y≥0 A ≤ b (5.27)

∨ ,→ min
0> → max

Its dual LP: minimize y>b subject to y>A ≥ 0>, y ≥ 0, is feasible with y = 0.
The LP (5.26) is feasible if and only if there is a solution x to the inequalities
Ax ≤ b, x ≥ 0. By Corollary 5.10, there is no such solution if and only if the
dual is unbounded, that is, assumes an arbitrarily negative value of its objective
function y>b. This is equivalent to the existence of some y ≥ 0 with y>A ≥ 0>
and y>b < 0 which can then be made arbitrarily negative by replacing y with
yα for any α > 0. This proves Lemma 5.8, which can therefore be remembered
with the Tucker diagram (5.27) and Corollary 5.10. So the possibilities described
in Table 5.1 capture the important theorems of LP duality.

5.6 General LP duality


We have stated the strong duality Theorem 5.3 for the standard inequality form
of an LP where both primal and dual LP have inequalities with separately stated
nonnegativity constraints for the primal and dual variables. In this section, we
consider the general form of an LP, which offers greater flexibility in applying the
duality theorem in various contexts. A general LP allows both for linear inequali-
ties and linear equations as constraints, and for variables that can be nonnegative
or take arbitrary real values. These cases are closely related with respect to the du-
ality property. As we will see, a primal equality constraint corresponds to a dual
variable that is unrestricted in sign, and a primal variable that is unrestricted in
sign gives rise to a dual constraint that is an equality. The other case, which we
have already seen, is a primal inequality that corresponds to a dual variable that
is nonnegative, or a primal nonnegative variable where the corresponding dual
constraint is an inequality.
116 CHAPTER 5. LINEAR OPTIMIZATION

In the following, the matrix A and vectors b and c will always have dimen-
sions A ∈ Rm×n , b ∈ Rm , and c ∈ Rn . These data A, b, c will simultaneously
define a primal LP with variables x in Rn , and a dual LP with variables y in
Rm . In the primal LP, we compare Ax with the right-hand side b, and maximize
the objective function c>x. In the dual LP, we compare y>A with the right-hand
side c>, and minimize the objective function y>b. In a general LP, “compare” can
mean inequalities or equations (or a mixture of the two). The corresponding dual
or primal variable will be nonnegative or unconstrained.
We first consider a primal LP with nonnegative variables and equality con-
straints, which is often called an LP in equality form:

maximize c>x
subject to Ax = b , (5.28)
x ≥ 0.

The corresponding dual LP can be motivated, as in Section 5.3, by trying to find


an upper bound for the primal objective function. That is, each constraint in
Ax = b is multiplied with a dual variable yi . However, because the constraint is
an equality, the variable yi can be unrestricted in sign. The dual constraints are
inequalities because the primal objective function has nonnegative variables x j .
That is, the dual LP to (5.28) is

minimize y>b
(5.29)
subject to y>A ≥ c>.

This motivation is just the weak duality theorem, which is immediate: it says that
for feasible solutions x and y to (5.28) and (5.29) we have x ≥ 0 and y>A − c> ≥ 0
and thus
c>x ≤ y>A x = y>b. (5.30)

The dual LP (5.29) has unconstrained variables y subject to inequality con-


straints. As a primal LP, this is of the form

maximize c>x
(5.31)
subject to Ax ≤ b.

To find the dual LP to the primal LP (5.31), we can again multiply any of the in-
equalities in Ax ≤ b with a variable yi , with the aim of finding an upper bound
to the primal objective function c>x. The inequality is preserved when yi is non-
negative, but in order to obtain an upper bound on the primal objective function
c>x we have to require that y>A = c> because the sign of any variable x j is not
known. That, is the dual to (5.31) is

minimize y>b
(5.32)
subject to y>A = c>, y ≥ 0.
5.6. GENERAL LP DUALITY 117

Observe that compared to (5.7) the LP (5.31) is missing the nonnegativity con-
straints x ≥ 0, and that compared to (5.9) the dual LP (5.32) states n equations
y>A = c> rather than inequalities.
Again, the choice of primal and dual LP is motivated by weak duality, which
states that for feasible solutions x to (5.31) and y to (5.32) the corresponding ob-
jective functions are mutual bounds. Including proof, it says

c>x = y>A x ≤ y>b. (5.33)

Hence, we have the following types of pairs of a primal LP and its dual LP,
including the original more symmetric situation of LPs in inequality form:

• a primal LP (5.28) with nonnegative variables and equality constraints, and


its dual LP (5.29) with unrestricted variables and inequality constraints;

• a primal LP (5.31) with unrestricted variables and inequality constraints,


and its dual LP (5.32) with nonnegative variables and equality constraints;

• a primal LP (5.7) and its dual LP (5.9), both with nonnegative variables sub-
ject to inequality constraints.

In all cases, by changing signs and transposing the matrix, we see that the dual
of the dual is again the primal.

Remark 5.12 The seemingly missing case of a primal and dual problem with unre-
stricted variables subject to equality constraints, which has no inequality constraints of
any kind, would have as the primal problem: maximize c>x subject to Ax = b, and as its
dual: minimize y>b subject to y>A = c>. Then if and only if both problems are feasible,
any x with Ax = b is primal optimal and any y with y>A = c> is dual optimal. Finding
such solutions is an easy problem of linear algebra and therefore not considered as a linear
optimization problem.

Proof. If Ax = b and y>A = c>, then c>x = y>Ax = y>b, which proves primal
optimality of x (because, given y, it holds for any x) and similarly dual optimality
of y. Such x and y exist if and only if b is in the “column space” of A and c>
is in the “row space” of A, both of which can be determined using Gaussian
elimination. This fails, for example, if Az = 0 but c>z 6= 0 for some z ∈ Rn ;
then the primal problem is unbounded (similar to (5.25) above) and the dual is
infeasible.

Next, we consider how to convert an LP (5.7) in inequality form into equal-


ity form by means of nonnegative “slack variables” as we have seen it in (5.19),
which is a standard trick in linear programming. The inequality constraints Ax ≤
118 CHAPTER 5. LINEAR OPTIMIZATION

b can be converted to equality form by introducing a slack variable si for each in-
equality. These slack variables define a nonnegative vector s in Rm . Then (5.7) is
equivalent to:
maximize c>x
subject to Ax + s = b , (5.34)
x, s ≥ 0 .
This amounts to extending the original constraint matrix A to the right by an
m × m identity matrix and adding coefficients 0 in the objective function for the
slack variables, as shown in the following Tucker diagram:
x≥0 s≥0
1
1
y ∈ Rm A .. = b (5.35)
.
1
∨ ∨ ,→ min
c> 0 0 · · · 0 → max

Note that converting the inequality form (5.7) to an LP in equality form (5.34)
defines a new dual LP with unrestricted variables y1 , . . . , ym , but the former in-
equalities yi ≥ 0 reappear now explicitly via the identity matrix and objective
function zeros introduced with the slack variables, as shown in (5.35). So the
resulting dual LP is exactly the same as in (5.9).
Even simpler, an LP in inequality form (5.7) can also be seen as the special
case of an LP with unrestricted variables x j as in (5.31) since the condition x ≥ 0
can be written in the form Ax ≤ b by explicitly listing the n inequalities − x j ≤
" #x ≥ 0 become with unrestricted x ∈ R the m + n
n
" ≤
0. That is, Ax # b and
A b
inequalities x ≤ with an n × n identity matrix I. The corresponding
−I 0
dual LP according to (5.32) (easily seen with a suitable Tucker diagram) has an
additional n-vector of slack variables r, say, with the dual constraints y>A − r > =
c>, y ≥ 0, r ≥ 0, which are equivalent to the inequalities y>A ≥ c>, y ≥ 0, again
exactly as in (5.9).
It is useful to consider all these forms of linear programs as a special case of
an LP in general form. Such an LP has inequalities and equalities as constraints, as
well as nonnegative and unrestricted variables. Let J be the subset of the column
set {1, . . . , n} where j ∈ J means that x j ≥ 0, and let K be the subset of the row set
{1, . . . , m} where i ∈ K means that row i is an inequality in the primal constraint
with corresponding dual variable yi ≥ 0 (the letter I denotes an identity matrix so
we use K instead). Let J and K be sets of unconstrained primal and dual variables,
with corresponding dual and primal equality constraints,

J = {1, . . . , n} − J , K = {1, . . . , m} − K . (5.36)


5.6. GENERAL LP DUALITY 119

To define the LP in general form, we first draw the Tucker diagram, shown in
Figure 5.10. The diagram assumes that columns and rows are arranged so that
those in J and K come first. The big boxes contain the respective parts of the
constraint matrix A, the vertical boxes on the right the parts of the right-hand
side b, and the horizontal box at the bottom the parts of the primal objective
function c>.

xj ≥ 0 (j ∈ J) xj ∈ R (j ∈ J)

yi ≥ 0

(i ∈ K )
A b
yi ∈ R
=
(i ∈ K )

c>
Figure 5.10 Tucker diagram for an LP in general form.

In order to state the duality theorem concisely, we define the feasible sets X
and Y for the primal and dual LP. The entries of the m × n matrix A are aij in
row i and column j. Let
n
X = { x ∈ Rn | ∑ aij x j ≤ bi , i ∈ K,
j =1
n
(5.37)
∑ aij x j = bi , i ∈ K,
j =1

xj ≥ 0 , j ∈ J }.

Any x belonging to X is called primal feasible, and the primal LP is called feasible
if X is not the empty set ∅. The primal LP is the problem

maximize c>x subject to x ∈ X . (5.38)

(This results when reading the Tucker diagram in Figure 5.10 horizontally.) The
corresponding dual LP has the feasible set
m
Y = { y ∈ Rm | ∑ yi aij ≥ c j , j ∈ J,
i =1
m
(5.39)
∑ yi aij = c j , j ∈ J,
i =1
yi ≥ 0 , i ∈ K}
120 CHAPTER 5. LINEAR OPTIMIZATION

primal LP dual LP
constraint variable
n
row i ∈ K inequality ∑ aij x j ≤ bi nonnegative yi ≥ 0
j =1
n
row i ∈ K equation ∑ aij x j = bi unconstrained yi ∈ R
j =1
objective function
m
minimize ∑ y i bi
i =1
variable constraint
m
column j ∈ J nonnegative xj ≥ 0 inequality ∑ yi aij ≥ c j
i =1
m
column j ∈ J unconstrained xj ∈ R equation ∑ yi aij = c j
i =1
objective function
n
maximize ∑ cj xj
j =1

Table 5.2 Relationship between inequalities and nonnegative variables, and


equations and unconstrained variables, for a primal and dual LP in
general form.

and is the problem

minimize y>b subject to y ∈ Y . (5.40)

(This results when reading the Tucker diagram in Figure 5.10 vertically.) By re-
versing signs, one can verify that the dual of the dual LP is again the primal.
Table 5.2 shows the roles of the sets K, K, J, J.
For an LP in general form, the strong duality theorem states that (a) for any
primal and dual feasible solutions, the corresponding objective functions are mu-
tual bounds, (b) if the primal and the dual LP both have feasible solutions, then
they have optimal solutions with the same value of their objective functions, (c)
if the primal or dual LP is bounded, the other LP is feasible. This implies the
possibilities shown in Table 5.1.

Theorem 5.13 (General LP duality) For the primal of LP (5.38) and its dual LP (5.40),
(a) (Weak duality) c>x ≤ y>b for all x ∈ X and y ∈ Y.
(b) (Strong duality) If X 6= ∅ and Y 6= ∅ then c>x = y>b for some x ∈ X and y ∈ Y,
so that both x and y are optimal.
5.6. GENERAL LP DUALITY 121

(c) (Boundedness implies dual feasibility) If X 6= ∅ and c>x for x ∈ X is bounded


above, then Y 6= ∅. If Y 6= ∅ and y>b for y ∈ Y is bounded below, then X 6= ∅.

Proof. We show that an LP in general form can be represented as an LP in inequal-


ity form with nonnegative variables, so that the claims (a), (b), and (c) follow from
Theorems 5.2, 5.3, and 5.9. In order to keep the notation simple, we demonstrate
this first for the special case of an LP (5.28) in equality form and its dual (5.29),
that is, with J = {1, . . . , n} and K = {1, . . . , m}. This LP with constraints Ax = b,
x ≥ 0 is equivalent to
maximize c>x
subject to Ax ≤ b,
(5.41)
− Ax ≤ − b ,
x≥ 0.
The corresponding dual LP uses two m-vectors ŷ and y and says

minimize ŷ>b − y>b


subject to ŷ>A − y>A ≥ c>, (5.42)
ŷ, y ≥ 0,

or equivalently
minimize (ŷ − y)>b
subject to (ŷ − y)>A ≥ c>, (5.43)
ŷ, y ≥ 0.
Any solution y to the dual LP (5.29) with unconstrained dual variables y can be
written in the form (5.43) where ŷ represents the “positive” part of y and y the
negated “negative” part of y according to

ŷi = max{yi , 0}, yi = max{−yi , 0}, i ∈ K, (5.44)

so that evidently ŷ ≥ 0 and y ≥ 0 and y = ŷ − y. That is, any vector y without


sign restrictions can be written as the difference of two nonnegative vectors ŷ and
y. These nonnegative vectors are not unique, because their components ŷi and yi
when defined by (5.44) (where at least one of them is zero) can be replaced by
ŷi + zi and yi + zi for any zi ≥ 0, which leaves ŷi − yi unchanged.
That is, any solution to (5.43) and thus (5.42) is a solution to (5.29) and vice
versa. This proves the three claims (a), (b), (c) because they are known for the
primal LP (5.41) and its dual (5.42) which are in inequality form.
The same argument applies in the general case. First, if the set K of equality
constraints is only a subset of {1, . . . , m}, then we write each such row as a pair
of two inequalities as in (5.41), which gives rise to a pair of nonnegative dual
variables ŷi and yi for i ∈ K. Their difference ŷi − yi defines a variable yi without
sign restrictions, and can be obtained from yi according to (5.44).
122 CHAPTER 5. LINEAR OPTIMIZATION

Suppose this is done so that all rows are inequalities. In a similar way, we
then write an unrestricted primal variable x j for j ∈ J as the difference x̂ j − x j
of two new primal variables that are nonnegative. The jth column A j of A and
the jth component c j of c> are then replaced by a pair A j , − A j and c j , −c j with
coefficients x̂ j , x j , so that aij x j is written as aij x̂ j − aij x j and c j x j as c j x̂ j − c j x j .
For these columns, the resulting pair of inequalities in the dual LP
m m
∑ yi aij ≥ c j , ∑ −yi aij ≥ −c j
i =1 i =1

is then equivalent to a dual equation for j ∈ J, as stated in (5.39) and Table 5.2.
The claim then follows as before for the known statements for an LP in inequality
form.

5.7 Complementary slackness

The optimality condition c>x = y>b, already stated in the weak duality Theo-
rem 5.2, is equivalent to a combinatorial condition known as “complementary
slackness”. It states that in each column j and row i at least one of the associated
inequalities in the dual or primal LP has to be tight, that is, hold as an equality. In
a general LP, this is only relevant for the inequality constraints, that is, for j ∈ J
and i ∈ K (see the Tucker diagram in Figure 5.10).

Theorem 5.14 (Complementary slackness) A pair x, y of feasible solutions to the


primal LP (5.7) and its dual LP (5.9) is optimal if and only if

(y>A − c>) x = 0, y>(b − Ax ) = 0, (5.45)

or equivalently for all j = 1, . . . , n and i = 1, . . . , m


! !
m n
∑ yi aij − c j xj = 0 , yi bi − ∑ aij x j = 0, (5.46)
i =1 j =1

that is,
m n
xj > 0 ⇒ ∑ yi aij = c j , yi > 0 ⇒ ∑ aij x j = bi . (5.47)
i =1 j =1

For an LP in general form (5.38) and its dual (5.40), a feasible pair x ∈ X, y ∈ Y is also
optimal if and only if (5.45) holds, or equivalently (5.46) or (5.47).

Proof. Suppose x and y are feasible for (5.7) and (5.9), so Ax ≤ b, x ≥ 0, y>A ≥ c>,
y ≥ 0. They are both optimal if and only if their objective functions are equal,
c>x = y>b. This means that the two inequalities c>x ≤ y>A x ≤ y>b used to
5.7. COMPLEMENTARY SLACKNESS 123

prove weak duality hold as equalities c>x = y>A x and y>A x = y>b, which are
equivalent to (5.45).
The left equation in (5.45) says
!
n m
0 = (y>A − c>) x = ∑ ∑ yi aij − c j xj . (5.48)
j =1 i =1

Then y>A ≥ c> and x ≥ 0 imply that the sum over j on the right-hand side of
(5.48) is a sum of nonnegative terms, which is zero only if each of them is zero,
as stated on the left in (5.46). Similarly, the second equation y>(b − Ax ) = 0 in
(5.45) holds only if the equations on the right of (5.46) hold for all j. Clearly, (5.47)
is equivalent to (5.46).
For an LP in general form, the feasibility conditions y ∈ Y and x ∈ X with
(5.39) and (5.37) imply
m n
∑ yi aij = c j for j ∈ J, ∑ aij x j = bi for i ∈ K, (5.49)
i =1 j =1

so that (5.46) holds for j ∈ J and i ∈ K. Hence, the respective terms in (5.46) are
zero in the scalar products (y>A − c>) x and y>(b − Ax ). These scalar products
are nonnegative because ∑im=1 yi aij ≥ c j and x j ≥ 0 for j ∈ J, and yi ≥ 0 and
bi ≥ ∑nj=1 aij x j for i ∈ K. So the weak duality proof c>x ≤ y>A x ≤ y>b applies
as well. As before, optimality c>x = y>b is equivalent to (5.45) and thus to (5.46)
and (5.47), which for j ∈ J and i ∈ K hold trivially by (5.49) irrespective of the
sign of x j or yi .

Consider the standard LP in inequality form (5.7) and its dual LP (5.9). The
dual feasibility constraints imply nonnegativity of y>A − c>, which is the n-vector
of “slacks”, that is, of differences in the inequalities y>A ≥ c>; such a slack is
zero in some column if the inequality is tight. The condition (y>A − c>) x = 0
in (5.45) says that this nonnegative slack vector is orthogonal to the nonnegative
vector x, because the scalar product of these two vectors is zero. The conditions
(5.46) and (5.47) state that this orthogonality can hold only if the two vectors are
complementary in the sense that in each component at least one of them is zero.
Similarly, the nonnegative m-vector y and the m-vector of primal slacks b − Ax
are orthogonal in the second equation y>(b − Ax ) = 0 in (5.45). In a compact
way, we can write
y>A ≥ c> ⊥ x≥0
(5.50)
y≥0 ⊥ Ax ≤ b
to state the following:

• all the inequalities in (5.50) have to hold, where those on the left state dual
feasibility and those on the right state primal feasibility, and
124 CHAPTER 5. LINEAR OPTIMIZATION

• the orthogonality signs ⊥ in the two rows in (5.50) say that the n- and m-
vectors of slacks (differences in these inequalities) have to be orthogonal as
in (5.45) for x and y to be optimal.

In the Tucker diagram (5.10), the first orthogonality in (5.50) refers to the n columns
and the second orthogonality to the m rows. By (5.46) or (5.47), orthogonality
means complementarity in the sense that for each column j or row i at least one
inequality is tight.
We demonstrate with Example 5.1 how complementary slackness is useful in
the search for optimal solutions. The dual to (5.8) (which we normally see directly
from the Tucker diagram) says explicitly: for y1 , y2 ≥ 0 subject to

3y1 + y2 ≥ 8
4y1 + y2 ≥ 10
(5.51)
2y1 + y2 ≥ 5

minimize 7y1 + 2y2 .

One feasible primal solution is x = ( x1 , x2 , x3 ) = (0, 1, 1). Then the first inequality
in (5.8) is not tight so by (5.47) we need y1 = 0 in an optimal solution. Because
x2 > 0 and x3 > 0 the second and third inequality in (5.51) have to be tight, which
implies y2 = 10 and y2 = 5 which is impossible. So this x is not optimal.
Another feasible primal solution is ( x1 , x2 , x3 ) = (0, 1.75, 0), where y2 = 0
because the second primal inequality is not tight. Only the second inequality in
(5.51) has to be tight, that is, 4y1 = 10 or y1 = 2.5. However, this violates the first
dual inequality in (5.51).
Finally, for the primal solution x = ( x1 , x2 , x3 ) = (1, 1, 0) both primal in-
equalities are tight, which allows for y1 > 0 and y2 > 0. Then the first two dual
inequalities in (5.51) have to be tight, which determines y as (y1 , y2 ) = (2, 2),
which also fulfills the third dual inequality (which is allowed to have positive
slack because x3 = 0). So here x and y are optimal.
The complementary slackness condition is a good way to verify that a con-
jectured primal solution is optimal, because the resulting equations for the dual
variables typically determine the values of the dual variables which can then be
checked for dual feasibility (or for equality of primal and dual objective function).
As stated in Theorem 5.14, the complementary slackness conditions charac-
terize optimality of a primal-dual pair x, y also for an LP in general form. How-
ever, in such an LP they only impose constraints for the primal or dual inequali-
ties, that is, for the columns j ∈ J and i ∈ K in the Tucker diagram in Figure 5.10.
The other columns and rows already define dual or primal equations which by
definition have zero slack. This is also the case if such an equality is converted to
a pair of inequalities. For example, for i ∈ K, the primal equation ∑nj=1 aij x j = bi
with unrestricted dual variable yi can be rewritten as a pair of two inequalities
5.8. CONVEX SETS AND FUNCTIONS 125

∑nj=1 aij x j ≤ bi and − ∑nj=1 aij x j ≤ −bi with associated nonnegative dual vari-
ables ŷi and yi so that yi = ŷi − yi . As noted in the proof of Theorem 5.13, we
can add a constant zi to the two variables ŷi and yi in any dual feasible solution,
so that they are both positive when zi > 0. By complementary slackness, the two
primal inequalities then have to be tight, but they can anyhow only be fulfilled if
they both hold as an equation. This confirms that for a general LP complementary
slackness is informative only for the inequality constraints.

5.8 Convex sets and functions

The purpose of this and the next section is to connect linear programming with
what we know about constrained optimization (which because of its greater gen-
erality is sometimes called “nonlinear programming”) and the Karush-Kuhn-
Tucker (KKT) conditions. Because the KKT conditions concern local optima, we
show in this section that for so-called convex optimization problems, which in-
clude linear programs, a local minimum is a global minimum.

x
c a y
b

{ x + ( y − x ) p | p ∈ R}
0

Figure 5.11 The line through the points x and y consists of points written as
x + (y − x ) p where p ∈ R. Examples are point a for p = 0.6, point
b for p = 1.5, and point c when p = −0.4. The line segment that
connects x and y (drawn as a solid line) results when p is restricted
to 0 ≤ p ≤ 1.

Let x and y be two vectors in Rm . Figure 5.11 shows two points x and y in
the plane, but the picture may also be regarded as a suitable view of the situation
in a higher-dimensional space. The line that goes through the points x and y is
obtained by adding to the point x, regarded as a vector, any scalar multiple of the
difference y − x. The resulting vector x + (y − x ) p, for p ∈ R, gives x when p = 0,
and y when p = 1. Figure 5.11 gives some examples a, b, c of other points. When
0 ≤ p ≤ 1, as for point a, the resulting points define the line segment that joins x
and y. If p > 1, then one obtains points on the line through x and y on the other
side of y relative to x, like the point b in Figure 5.11. For p < 0, the corresponding
point, like c in Figure 5.11, is on that line but on the other side of x relative to y.
126 CHAPTER 5. LINEAR OPTIMIZATION

The expression x + (y − x ) p can be re-written as x (1 − p) + yp, where the


given points x and y appear only once. This special linear combination of x and
y with nonnegative coefficients that sum to one is called a convex combination of
x and y. It is useful to remember the expression x (1 − p) + yp in this order with
1 − p as the coefficient of the first vector and p of the second vector, because then
the line segment that joins x to y corresponds to the real interval [0, 1] for the
possible values of p, with the endpoints 0 and 1 of the interval corresponding to
the respective endpoints x and y of the line segment.
In general, a convex combination of points z1 , z2 , . . . , zk in some space is given
as any linear combination z1 p1 + z2 p2 + · · · + zk pk where the linear coefficients
p1 , . . . , pk are non-negative and sum to one. The previously discussed case corre-
sponds to z1 = x, z2 = y, p1 = 1 − p, and p2 = p ∈ [0, 1].

x
x

y y

Figure 5.12 Examples of sets that are convex (left) and not convex (right).

A set of points is called convex if it contains with any points z1 , z2 , . . . , zk also


every convex combination of these points. Equivalently, one can show that a set
is convex if it contains with any two points also the line segment joining these
two points (see Figure 5.12). One can then obtain combinations of k points for
k > 2 by iterating convex combinations of only two points.

Figure 5.13 Illustration of Theorem 5.15 for m = 2. Any point in the pentagon
belongs to one of the three shown triangles (which are not unique,
there are others ways to “triangulate” the pentagon). A triangle is
the set of convex combinations of its corners.

If a point is the convex combination of some points in the plane, then it is


easy to see that it is already the convex combination of suitably chosen three of
5.8. CONVEX SETS AND FUNCTIONS 127

those points (see Figure 5.13). This is the case m = 2 of the following theorem,
which we mention as an easy consequence of Lemma 5.4.

Theorem 5.15 (Carathéodory) If a point b is the convex combination of some vectors


in Rm , then it is the convex combination of suitable m + 1 of these vectors.

Proof. If b is the convex combination of n vectors A1 , . . . , An in Rm , then b =


A1 x1 + · · · + An xn with x1 , . . . , xn ≥ 0, x1 + · · · + xn = 1. This is equivalent to
saying that there are x1 , . . . , xn ≥ 0 so that
" # " # " #
b A1 An
= x1 + · · · + x n ∈ Rm +1 . (5.52)
1 1 1
" #
Aj
By Lemma 5.4, a linearly independent subset of the vectors suffices to rep-
1
" #
b
resent , which has at most size m + 1. This proves the claim.
1

A function f can also be called convex, according to the following definition.

Definition 5.16 Let X ⊆ Rn and f : X → R. The function f is called convex if X


is convex and the set

{ ( x, u) ∈ Rn × R | x ∈ X, f ( x ) ≤ u } (5.53)

is convex. The set in (5.53) is called the epigraph of f . Equivalently, f is called


convex if X is convex and for all x, y ∈ X and p ∈ [0, 1] we have

f ( x (1 − p) + yp) ≤ f ( x )(1 − p) + f (y) p . (5.54)

Figure 5.14 illustrates condition (5.54). This condition says that if one con-
nects x and y in the domain X of f with a line segment, the function value should
be on or below the line segment that connects the endpoints f ( x ) and f (y). That
is, let p ∈ [0, 1] and z = x (1 − p) + yp. The white dot in Figure 5.14 shows the
convex combination in Rn+1 of ( x, f ( x )) and (y, f (y)) with 1 − p and p given by
( x, f ( x ))(1 − p) + (y, f (y)) p, which is at least as high as the point (z, f (z)), ac-
cording to the condition f (z) ≤ f ( x )(1 − p) + f (y) p in (5.54). Note that z ∈ X
requires that X itself is convex. It is easy to see that then (5.54) is equivalent to
the convexity of the epigraph of f .
The important property of convex functions is the following theorem. Its
proof (by way of contradiction) is illustrated in Figure 5.15.

Theorem 5.17 Let X ⊆ Rn and let f : X → R be a convex function. Then any local
minimum of f is a global minimum of f .
128 CHAPTER 5. LINEAR OPTIMIZATION

f (x)

f (y)

f (z)

X
x z y

Figure 5.14 Illustration of Definition 5.16. The shaded set is the epigraph of f
in (5.53), that is, the set of points ( x, u) above the graph of f , with u
(unbounded above) drawn in vertical direction.
f (x)
f (z)
f (y)

X
x z y

Figure 5.15 Proof of Theorem 5.17 with z as in (5.55).

Proof. Let x ∈ X and let x be a local minimum of f , so there is some ε > 0 so that
f ( x ) ≤ f (z) for all z in X with kz − x k < ε. Suppose x is not a global minimum
of f , that is, f ( x ) > f (y) for some y ∈ X. Let p > 0 so that ky − x k p < ε, and

z = x (1 − p) + yp = x + (y − x ) p ∈ X , hence kz − x k < ε . (5.55)

Then by (5.54), f (z) ≤ f ( x )(1 − p) + f (y) p = f ( x ) − ( f ( x ) − f (y)) p < f ( x ) (see


Figure 5.15), which contradicts the assumption that x is a local minimum of f .

Corollary 5.18 Any local optimum (maximum or minimum) of an LP (in general form)
is a global optimum.

Proof. The inequality and equality constraints of an LP in general form are pre-
served under taking convex combinations, so the primal feasible set X in (5.37) is
convex. Any linear function f , such as x 7→ −c>x or x 7→ c>x, fulfills (5.54) (even
with “=” instead of “≤”) and is therefore convex, so that Theorem 5.17 applies to
5.9. LP DUALITY AND THE KKT THEOREM 129

minimizing −c>x (that is, maximizing c>x) for x ∈ X. (It also applies to the dual
LP.)

5.9 LP duality and the KKT theorem

In this section we show that the KKT Theorem 4.11 applied to a linear program
is essentially the strong LP duality theorem, applied to an LP (5.31) in inequality
form where any inequalities such as x ≥ 0 would have to be written as part of
Ax ≤ b, so the variables x ∈ Rn are unrestricted. In order to match the notation
in Theorem 4.11, let the number of rows of A be `. That is, (5.31) states: maximize
f ( x ) = c>x subject to hi ( x ) = ∑nj=1 aij x j − bi ≤ 0 for 1 ≤ i ≤ `. The functions f
and hi are affine functions that have constant derivatives, with D f ( x ) = c> and
Dhi ( x ) = ( ai1 , . . . , ain ) for 1 ≤ i ≤ `. The open set U in Theorem 4.11 is Rn .

Suppose that this LP is feasible and that c>x has a local maximum at x = x,
which by Corollary 5.18 is also a global maximum. By the duality Theorem 5.13
for an LP in general form, there exists an optimal dual vector y ∈ R` with y ≥ 0
and y>A = c> (see also (5.32)), which is equivalent to D f ( x ) = c> = y>A =
∑i`=1 yi Dhi ( x ) which is the last equation in (4.46) with µi = yi for 1 ≤ i ≤ `.
Moreover, the optimality condition c>x = y>b is equivalent to the complemen-
tary slackness conditions (5.46). In (5.46), the first set of equations hold automat-
ically because y>A = c>, and the second equations yi (bi − ∑`j=1 aij x j ) = 0 are
equivalent to yi (−hi ( x )) = 0 and therefore to µi hi ( x ) = 0 as stated in (4.46). So
Theorem 4.11 is a consequence of the strong duality theorem, in fact in a stronger
form because it does not require the constraint qualification that the gradients in
(4.46) for the tight constraints are linearly independent.

Conversely, the strong duality theorem for an LP with unrestricted variables


(5.31) can also be seen as a special case of the KKT Theorem 4.11, where one can
argue separately that the constraint qualification is not needed.

However, although we do not go into details here, mathematically it is not the


right approach to prove LP duality as a special case of the KKT theorem. Rather,
LP duality is a more basic observation (which we have proved here in full). The
KKT theorem applies to the derivatives of differentiable functions. The funda-
mental purpose of derivatives is to describe local approximations of the given
functions by linear functions. These linear approximations can be studied with
the help of linear programming, which can be used to prove the KKT theorem.
However, this has to be done carefully, and requires separate treatment. For ex-
ample, the constraint qualifications are needed to make sure that this approach
works properly.
130 CHAPTER 5. LINEAR OPTIMIZATION

5.10 The simplex algorithm: example

The simplex algorithm is a method to find a solution to a linear program. It


has been successfully applied to very large LPs with hundreds of thousands of
variables. The algorithm applies to LPs in equality form, to which a standard LP
in inequality form is converted with the help of slack variables as in (5.34).
In this section, we describe the simplex algorithm for Example 5.1, written
with slack variables s1 and s2 as the problem: for x1 , x2 , x3 , s1 , s2 ≥ 0 subject to

3x1 + 4x2 + 2x3 + s1 =7


x1 + x2 + x3 + s2 = 2 (5.56)
maximize 8x1 + 10x2 + 5x3 .

Because b ≥ 0, this system Ax + s = b has an easy first solution where x = 0


and s = b. These equations are now rewritten with s1 and s2 as functions of the
remaining variables x1 , x2 , x3 . In addition, an extra variable z denotes the current
value of the objective function. Then (5.56) is equivalent to: maximize z subject
to x1 , x2 , x3 , s1 , s2 ≥ 0 and

s1 = 7 − 3x1 − 4x2 − 2x3


s2 = 2 − x1 − x2 − x3 (5.57)
z = 0 + 8x1 + 10x2 + 5x3

A system such as (5.57) is called a dictionary and is defined as follows. Assume


the original system has m equality constraints. Then m of the variables (here s1
and s2 ), called basic variables, are expressed in terms of the remaining nonbasic
variables (here x1 , x2 , x3 ). This system of equations is always equivalent to the original
equality constraints. The objective function z is also expressed as a function of the
nonbasic variables, and written below the horizontal line.
The basic solution that corresponds to a dictionary is obtained by setting all
nonbasic variables to zero. It is called a basic feasible solution when the resulting
basic variables are nonnegative. In the dictionary, the values for the basic vari-
ables in this basic feasible solution are just the constants that follow the equality
signs, with the corresponding value for z beneath the horizontal line. In (5.57)
these values are s1 = 7, s2 = 2, z = 0.
Starting with an initial basic feasible solutions such as (5.57), the simplex al-
gorithm proceeds in steps that rewrite the dictionary. In the following, we only
record the changes of the dictionary, and keep in mind that z should be maxi-
mized and all variables should stay nonnegative. In one step, one nonbasic vari-
able becomes basic (it is said to enter the basis) and a basic variable becomes non-
basic (this variable is said to leave the basis).
5.10. THE SIMPLEX ALGORITHM: EXAMPLE 131

For a given dictionary, the entering variable is chosen so as to improve the value
of the objective function when that variable is increased from zero in the current ba-
sic feasible solution. In (5.57), this will happen by increasing any of x1 , x2 , x3
because they all have a positive coefficient in the linear equation for z. Suppose
x2 is chosen as the entering variable (for example, because it has the largest co-
efficient). Suppose the other nonbasic variables x1 and x3 stay at zero and x2
increases. Then z = 0 + 10x2 (the desired increase), s1 = 7 − 4x2 , and s2 = 2 − x2 .
In order to maintain feasibility, we need s1 = 7 − 4x2 ≥ 0 and s2 = 2 − x2 ≥ 0,
where these two constraints are equivalent to 47 = 1.75 ≥ x2 and 2 ≥ x2 . The first
of these is the stronger constraint: when x2 is increased from 0 to 1.75, then s1 = 0
and s2 = 0.25 > 0. For that reason, s1 is chosen as the leaving variable, and we
rewrite the first equation in (5.57) so that x2 is on the left and s1 is on the right,
giving
4x2 = 7 − 3x1 − s1 − 2x3
s2 = 2 − x1 − x2 − x3 (5.58)
z = 0 + 8x1 + 10x2 + 5x3

However, this is not a dictionary because x2 is still on the right-hand side of the
second and third equation, but should appear only on the left. To remedy this,
we first rewrite the first equation so that x2 has coefficient 1, and then substitute
this equation into the other two equations:

x2 = 1.75 − 0.75x1 − 0.25s1 − 0.5x3


s2 = 2− x1 − x3
− (1.75 − 0.75x1 − 0.25s1 − 0.5x3 ) (5.59)
z= 0+ 8x1 + 5x3
+ 10(1.75 − 0.75x1 − 0.25s1 − 0.5x3 )

which gives the new dictionary with basic variables x2 and s2 and nonbasic vari-
ables x1 , s1 , x3 :
x2 = 1.75 − 0.75x1 − 0.25s1 − 0.5x3
s2 = 0.25 − 0.25x1 + 0.25s1 − 0.5x3 (5.60)
z = 17.5 + 0.5x1 − 2.5s1 + 0x3

The basic feasible solution corresponding to (5.60) is x2 = 1.75, s2 = 0.25 and has
objective function value z = 17.5. The latter can still be improved by increasing
x1 , which is now the unique choice for entering variable because neither s1 nor
x3 have a positive coefficient in this representation of z. Increasing (only) x1 from
zero imposes the constraints x2 = 1.75 − 0.75x1 ≥ 0 and s2 = 0.25 − 0.25x1 ≥ 0,
where the second is stronger, since s2 becomes zero when x1 = 1 while x2 is still
positive. So x1 enters and s2 leaves the basis. Similar to the step from (5.57) to
132 CHAPTER 5. LINEAR OPTIMIZATION

(5.58), we bring x1 to the left and s2 to the right side of the equation,

x2 = 1.75 − 0.75x1 − 0.25s1 − 0.5x3


0.25x1 = 0.25 − s2 + 0.25s1 − 0.5x3 (5.61)
z = 17.5 + 0.5x1 − 2.5s1 + 0x3

and substitute the resulting equation x1 = 1 − 4s2 + s1 − 2x3 for x1 into the other
two equations:
x2 = 1.75 − 0.25s1 − 0.5x3
− 0.75(1 − 4s2 + s1 − 2x3 )
x1 = 1 − 4s2 + s1 − 2x3 (5.62)
z= 17.5 − 2.5s1 + 0x3
+ 0.5(1 − 4s2 + s1 − 2x3 )
which gives the next dictionary with x2 and x1 as basic variables and s2 , s1 , x3 as
nonbasic variables:
x2 = 1 + 3s2 − s1 + x3
x1 = 1 − 4s2 + s1 − 2x3 (5.63)
z = 18 − 2s2 − 2s1 − x3
As always, this dictionary is equivalent to the original system of equations (5.56),
with basic feasible solution x1 = 1, x2 = 1, and corresponding objective func-
tion value z = 18. In the last line in (5.63), no nonbasic variable has a positive
coefficient. This means that no increase from zero of a nonbasic variable can im-
prove the objective function. Hence this basic feasible solution is optimal, and the
algorithm terminates.
Converting one dictionary to another by exchanging a nonbasic (entering)
variable with a basic (leaving) variable is commonly referred to as pivoting. The
column of the entering variable and the row of the leaving variable define a
nonzero coefficient of the entering variable known as a pivot element. Pivoting
amounts to a manipulation of the matrix of coefficients of all variables and of the
right-hand side (often called a tableau). This manipulation involves row operations
that represent the substitutions as performed in (5.59) and (5.62). In such a row
operation, the pivot row is divided by the pivot element, and suitable multiples
of the resulting new row are subtracted from the other rows.
That is, we can express the variable substitution that leads to the new dictio-
nary in terms of suitable row operations of the system of equations. This is easiest
seen by keeping all variables on one side similar to (5.56). We rewrite (5.57) as

3x1 + 4x2 + 2x3 + s1 =7


x1 + x2 + x3 + s2 = 2 (5.64)
z − 8x1 − 10x2 − 5x3 =0
5.11. THE SIMPLEX ALGORITHM: GENERAL DESCRIPTION 133

where we now have to remember that in the expression for z a potential entering
variable is identified by a negative coefficient. In (5.64) the basic variables are s1
and s2 which have a unit vector as a column of coefficients, which has entry 1 in
the row of the basic variable and entry 0 elsewhere.
With the x2 as the entering and s1 as the leaving variable in (5.64), pivoting
amounts to creating a unit vector in the column for x2 . This means to divide the
first (pivot) row by 4 so that x2 has coefficient 1 in that row. The new first row is
then subtracted from the second row, and 10 times the new first row is added to
the third row, so that the coefficient of x2 in those rows becomes zero:

0.75x1 + x2 + 0.5x3 + 0.25s1 = 1.75


0.25x1 + 0.5x3 − 0.25s1 + s2 = 0.25 (5.65)
z − 0.5x1 + 0x3 + 2.5s1 = 17.5

These row operations have the same effect as the substitutions in (5.59). The
system (5.65) is equivalent to the second dictionary (5.60). The basic variables x2
and s2 are identified by their unit-vector columns. Note that z is expressed only
in terms of the nonbasic variables.
The entering variable in (5.65) is x1 and the leaving variable is s2 , so that the
unit vector for that second row should now appear in the column for x1 rather
than s2 . The second row is divided by the pivot element 0.25 (i.e., multiplied
by 4) to give x1 coefficient 1, and the coefficients of x1 in the other rows give the
suitable multiplier to subtract the new second row from the other rows, namely
0.75 for the first and −0.5 for the third row. This gives

x2 − x3 + s1 − 3s2 = 1
x1 + 2x3 − s1 + 4s2 = 1 (5.66)
z + x3 + 2s1 + 2s2 = 18

which is equivalent to the final dictionary (5.63).


The only difference between the “dictionaries” (5.57), (5.60), (5.63) and the
“tableaus” (5.64), (5.65), (5.66) is the sign of the nonbasic variables, and that the
columns of all variables stay in place in a tableau, which requires to identify the
basic variables; they have unit vectors as columns.

5.11 The simplex algorithm: general description

This section describes the simplex algorithm in general. We also define in gen-
erality the relevant terms, many of which have already been introduced in the
previous section.
134 CHAPTER 5. LINEAR OPTIMIZATION

The simplex algorithm applies to an LP (5.28) in equality form: maximize c>x


subject to Ax = b, x ≥ 0, for given A ∈ Rm×n , b ∈ Rm , c ∈ Rn . We assume that
the m rows of the matrix A are linearly independent. This is automatically the
case if A has been obtained from an LP in inequality form by adding an identity
matrix for the slack variables as in (5.35). In general, if the row vectors of A are
linearly dependent, then some row of A is a linear combination of the other rows.
The respective equation in Ax = b is then either also the linear combination of the
other equations, that is, it can be omitted, or it contradicts the linear combination
and Ax = b has no solution. Therefore, it is no restriction to assume that A has
full row rank m.
Let A1 , . . . , An be the n columns of A. A basis (of A) is an m-element subset
B of the column indices {1, . . . , n} so that the vectors A j for j ∈ B are linearly
independent (sometimes “basis” also refers to such a set of vectors, that is, to a
basis of the column space of A). For a basis B of A, a feasible solution x to (5.28)
(that is, Ax = b and x ≥ 0) where x j > 0 implies j ∈ B is called a basic feasible
solution. The components x j of x for j ∈ B are called basic variables.
For a basis B, let N = {1, . . . , n} − B, where j ∈ N means x j is a nonbasic vari-
able. Let A B denote the m × m submatrix of A that consists of the basic columns
A j for j ∈ B, and let A N be the submatrix that consists of the nonbasic columns.
Similarly, let x B and x N be the subvectors of x with components x j for j ∈ B and
j ∈ N, respectively. Because the column vectors of A B are linearly independent,
this matrix is invertible, and the basic solution x to Ax = A B x B + A N x N = b
that corresponds to the basis B is uniquely given by x N = 0 and x B = A− 1
B b. By
definition, this is a basic feasible solution if it is nonnegative, that is, if x B ≥ 0.
A basic feasible solution is uniquely specified by a basis B. The converse does
not hold because x B may have zero components xi for some i ∈ B. Such a basis
B and its corresponding basic feasible solution is called degenerate. In that case,
B can equally well be replaced by a basis B − {i } ∪ { j} (for suitable j ∈ N) that
has the same basic feasible solution x. This lack of uniqueness requires certain
precautions when defining the simplex algorithm; for the moment, we assume
that no feasible basis is degenerate.
By Lemma 5.4, if the LP (5.28) has a feasible solution, it also has a basic feasi-
ble solution, because any solution to Ax = b can be iteratively modified until b is
only a positive linear combination of linearly independent columns of A. If these
are fewer than m columns, they can be extended with suitable further columns A j
to form a basis, with corresponding coefficients x j = 0; the corresponding basic
feasible solution is then degenerate.
The simplex algorithm works exclusively with basic feasible solutions, which
are iteratively changed to improve the objective function. Thereby, it suffices to
change only one basic variable at a time, which is called pivoting.
Assume that a basic feasible solution for the LP (5.28) has been found. In
general, this requires an initialization phase of the simplex algorithm, which fails
5.11. THE SIMPLEX ALGORITHM: GENERAL DESCRIPTION 135

if the LP is infeasible, a case that is discovered at that point. We will describe this
initializing “first phase” later.
Consider a basic feasible solution with basis B, and let N denote the index
set of the nonbasic columns as above. The following equations are equivalent for
any x ∈ Rn :

Ax = b
AB xB + A N x N = b
A− 1 −1 −1
B AB xB + AB A N x N = AB b
x B = A− 1 −1
B b − AB A N x N (5.67)
xB = A− 1
B b− ∑ A− 1
B Aj xj
j∈ N

xB = b − ∑ Aj xj
j∈ N

with the notation b = A− 1 −1


B b and A j = A B A j for j ∈ N. The last three equations
in (5.67) (which are the same equation in different notation) represent a dictionary
as written above the horizontal line in the examples (5.57), (5.60), (5.63). A dictio-
nary expresses the basic variables x B in terms of the nonbasic variables x N , and
defines the basic solution for the basis B by setting x N = 0, which is feasible when
b ≥ 0.
The value c>x of the objective function is represented as follows. Let c B and
c N denote the subvectors of c with the components c j for j ∈ B and j ∈ N, respec-
tively. Then
c>x = c> >
B xB + c N x N
= c> −1 −1 >
B ( AB b − AB A N x N ) + c N x N
= c> −1 > > −1 (5.68)
B A B b + (c N − c B A B A N ) x N
= c> −1
B AB b + ∑ (c j − c>B A−B 1 A j ) x j
j∈ N

which expresses the objective function c>x in terms of the nonbasic variables, as
in the equation below the horizontal line in the examples (5.57), (5.60), (5.63). In
−1
(5.68), c>
B A B b is the value of the objective function for the basic feasible solution
where x N = 0. This is an optimal solution if

c j − c> −1
B AB A j ≤ 0 for all j ∈ N , (5.69)

because x j ≥ 0 in any feasible solution, so that by (5.68) c>x is maximal if (5.69)


holds. Condition (5.69) is the criterion for optimality used by the simplex algo-
rithm. If this condition is fulfilled, we have also obtained an optimal solution to
the dual LP (5.29) of (5.28), namely

y> = c> −1
B AB , (5.70)
136 CHAPTER 5. LINEAR OPTIMIZATION

which is feasible for the dual LP (5.29) because y> A B = c> >
B by (5.70) and y A j ≥
c j for j ∈ N by (5.69), that is, y>A N ≥ c> > >
N, so altogether y A ≥ c . It is optimal
−1
because y>b = c> > >
B A B b = c B x B = c x when x N = 0, that is, dual and primal
objective function have the same value.
The optimality criterion (5.69) fails if

c j − c> −1
B AB A j > 0 for some j ∈ N. (5.71)

In that case, the value of the objective function will be increased if x j can assume
a positive value. The simplex algorithm therefore looks for such a j in (5.71) and
makes x j a new basic variable, called the entering variable. The index j is said to
enter the basis. This has to be done while preserving feasibility, and so that there
are again m basic variables. Thereby, some element i of B leaves the basis, where
xi is called the leaving variable.
To demonstrate this change of basis, consider the last equation in (5.67) that
expresses the variables x B in terms of the nonbasic variables x N . Assume that all
components of x N are kept zero except x j . Then (5.67) has the form

xB = b − A j x j , (5.72)

where by (5.68) and (5.71)

c>x = c> >


B b + (c j − c B A j ) x j with c j − c>
B Aj > 0 . (5.73)

For x j = 0, (5.72) represents the current basic feasible solution x B = b. How


long does x B stay nonnegative if x j is gradually increased? If A j has no positive
components, then x j can be made arbitrarily large, where because of (5.73) c>x
increases arbitrarily and the LP is unbounded.
Hence, suppose that some components of A j are positive. It is useful to con-
sider the m rows of the equation (5.72)) as numbered with the elements of B be-
cause the left-hand side of that equation is x B (in practice, one would record for
each row 1, . . . , m of the dictionary represented by the last equation in (5.67) the
respective element of B, as for example in (5.63)). So let the components of A j be
aij for i ∈ B. At least one of them is positive, and any of these positive elements
imposes an upper bound on the choice of x j in (5.72) so that x B stays nonnegative,
by the condition bi − aij x j ≥ 0 or equivalently bi /aij ≥ x j (because aij > 0). This
defines the maximum choice of x j by the following so-called minimum ratio test
(which we have encountered in similar form before in (5.14)):

x j = min { b` /a` j | a` j > 0, ` ∈ B } = bi /aij for some i ∈ B, aij > 0 . (5.74)

For at least one i ∈ B, the minimum ratio is achieved as stated in (5.74). The
corresponding variable xi is made the leaving variable and becomes nonbasic.
This defines the pivoting step: The entering variable x j is made basic and the
leaving variable xi is made nonbasic, and the basis B is replaced by B − {i } ∪ { j}.
5.11. THE SIMPLEX ALGORITHM: GENERAL DESCRIPTION 137

We show that the column vectors Ak for k ∈ B − {i } ∪ { j} are linearly inde-


pendent. Consider a linear combination of these vectors that represents the zero
vector, ∑k Ak tk = 0, which implies ∑k A− 1 −1
B Ak tk = 0. The vectors A B Ak for
k ∈ B − {i } are unit vectors with zeros in all rows except row k, so their linear
combination has a zero in row i. On the other hand, A− 1
B Ak for k = j is the vector
A j which in row i has entry aij > 0. This implies t j = 0. For all k ∈ B − {i }, the
vectors Ak are linearly independent, so that tk = 0 for all k. Thus, B − {i } ∪ { j} is
indeed a new basis.
We have described an iteration of the simplex algorithm. In summary, it con-
sists of the following steps.
1. Given a basic feasible solution as in (5.67) with basis B, choose some entering
variable x j according to (5.71). If no such variable exists, stop: the current
solution is optimal.
2. With b = A− 1 −1
B b and A j = A B A j , determine the maximum value of x j so that
b − A j x j ≥ 0. If there is no such maximum because A j ≤ 0, then stop: the LP
is unbounded. Otherwise, set x j to the minimum ratio in (5.74).
3. Replace the current basic feasible solution x B = b by x B = b − A j x j . At least
one component xi of this vector is zero, which is made the leaving variable
and is replaced by the entering variable x j . Replace the basis B by B − {i } ∪
{ j}. Go back to Step 1.
A given basis B determines a unique basic feasible solution x B . By increas-
ing the value of the entering variable x j , the feasible solution changes according
to (5.72). During this continuous change, this feasible solution is not a basic so-
lution: it has m + 1 positive variables, namely x` for ` ∈ B and x j . Unless the
LP is unbounded, A j has positive components a` j , so the respective variables x`
decrease while x j increases. The smallest value of x j where at least one compo-
nent xi of x B becomes zero is given by the minimum ratio in (5.74). At this point,
again only m (or fewer) variables are nonzero, and the leaving variable xi is re-
placed by x j so that indeed a new basic feasible solution and corresponding basis
B − {i } ∪ { j} is obtained.
However, the simplex algorithm does not require a continuous change of the
values of m + 1 variables. Instead, the value of the entering value x j can directly
“jump” to the minimum ratio in (5.74). What is important is the next basis. The
change of the basis requires an update of the inverse A− 1
B of the basis matrix
in order to obtain the new dictionary in (5.67). (As demonstrated in the previ-
ous section, this update can be implemented by suitable row operations on the
“tableau” representation of dictionary, which also determines the new basic fea-
sible solution.) In this view, the simplex algorithm is a combinatorial method that
computes a sequence of bases, which are certain finite subsets of the set {1, . . . , n}
that represents the columns of the original system Ax = b.
We have made an important assumption, namely that no feasible basis is de-
generate, that is, all basic variables in a basic feasible solution have positive val-
138 CHAPTER 5. LINEAR OPTIMIZATION

ues. This implies bi > 0 in (5.74), so that the entering variable x j takes on a posi-
tive value and the objective function for the basic feasible solution increases with
each iteration by (5.73). Hence, no basis is revisited, and the simplex algorithm
terminates because there are only finitely many bases. Furthermore, Step 3 of the
above summary shows that in the absence of degeneracy the leaving variable xi ,
and thus the minimum in the minimum-ratio test (5.74), is unique, because if two
variables could leave the basis because they become zero at the same time, then
only one of them leaves and the other remains basic but has value zero in the new
basic feasible solution.
If there are degenerate basic feasible solutions, then the minimum in (5.74)
may be zero because bi = 0 for some i where aij > 0. Then the entering variable
x j , which was zero as a nonbasic variables, enters the basis but stays at zero in
the new basic feasible solution. In that case, only the basis has changed but not
the feasible solution and also not the value of the objective function. In fact, it is
possible that this results in a cycle of the simplex algorithm (when the same basis
is revisted) and thus a failure to terminate. This behaviour is rare, and degeneracy
itself an “accident” that only occurs when there a special relationships between
the entries of the payoff matrix. Nevertheless, degeneracy can be dealt with in
a systematic manner, which we do not treat in these notes that are already long
enough.
We also need to find an initial feasible solution to start the simplex algorithm.
For that purpose, we use a “first phase” with a different objective function that
establishes whether the LP (5.28) is feasible, similar to the approach in (5.24).
First, choose an arbitrary basis B and let b = A− 1
B b. If b ≥ 0, then x B = b is
already a basic feasible solution and nothing needs to be done. Otherwise, b has
at least one negative component. Define the m-vector h = A B 1 where 1 is the all-
one vector. That is, h is just the sum of the columns of A B . We add −h as an extra
column to the system Ax = b with a new variable t and consider the following
LP:
maximize −t
subject to Ax − ht = b (5.75)
x, t≥0
We find a basic feasible solution to this LP with a single pivoting step from the
(infeasible) basis B. Namely, the following are equivalent, similar to (5.67):

Ax − ht = b
A B x B + A N x N − ht = b
(5.76)
x B = A− 1 −1 −1
B b − AB A N x N + AB h t
xB = b − A− 1
B AN xN + 1 t

where we now let t enter the basis and increase t such that b + 1 t ≥ 0. For the
smallest such value of t, at least one component xi of x B is zero and becomes
5.11. THE SIMPLEX ALGORITHM: GENERAL DESCRIPTION 139

the leaving variable. After the pivot with xi leaving and t entering the basis one
obtains a basic feasible solution to (5.75).
The LP (5.75) is therefore feasible, and its objective function bounded from
above by zero. The original system Ax = b, x ≥ 0 is feasible if and only if the
optimum in (5.75) is zero. Suppose this is the case, which will be found out by
solving the LP (5.75) with the simplex algorithm. Then this “first phase” termi-
nates with a basic feasible solution to (5.75) where t = 0 which is then also a
feasible solution to Ax = b, x ≥ 0. The simplex algorithm can then proceed with
maximizing the original objective function c>x as described earlier.

Vous aimerez peut-être aussi