Sparse Optimization Lecture: Basic Sparse Optimization Models

Sparse Optimization
Lecture: Basic Sparse Optimization Models
Instructor: Wotao Yin

July 2013
online discussions on piazza.com
Those who complete this lecture will know
basic `1 , `2,1 , and nuclear-norm models

some applications of these models
how to reformulate them into standard conic programs
which conic programming solvers to use
1 / 33
Examples of Sparse Optimization Applications
See online seminar at piazza.com
2 / 33
Basis pursuit
min{kxk1 : Ax = b}
find least `1 -norm point on the affine plane {x : Ax = b}

tends to return a sparse point (sometimes, the sparsest)
`1 ball
3 / 33
Basis pursuit
min{kxk1 : Ax = b}
find least `1 -norm point on the affine plane {x : Ax = b}
tends to return a sparse point (sometimes, the sparsest)
`1 ball touches the affine plane

4 / 33
Basis pursuit denoising, LASSO

min{kAx bk2 : kxk1 },
(1a)
min kxk1 + kAx bk22 ,

x
2
(1b)
min{kxk1 : kAx bk2 }.
(1c)
all models allow Ax 6= b

5 / 33
Basis pursuit denoising, LASSO
min{kAx bk2 : kxk1 },

x
min kxk1 +
x
kAx bk22 ,
2
min{kxk1 : kAx bk2 }.

x
(2a)
(2b)
(2c)
k k2 is most common for error but can be generalized to loss function L

(2a) seeks for a least-squares solution with bounded sparsity
(2b) is known as LASSO (least absolute shrinkage and selection operator).
it seeks for a balance between sparsity and fitting

(2c) is referred to as BPDN (basis pursuit denoising), seeking for a sparse
solution from tube-like set {x : kAx bk2 }

they are equivalent (see later slides)
in terms of regression, they select a (sparse) set of features (i.e., columns
of A) to linearly express the observation b

6 / 33
Sparse under basis / `1 -synthesis model

min{ksk1 : As = b}
s
(3)
signal x is sparsely synthesized by atoms from , so vector s is sparse

is referred to as the dictionary
commonly used dictionaries include both analytic and trained ones
analytic examples: Id, DCT, wavelets, curvelets, gabor, etc., also their
combinations; they have analytic properties, often easy to compute (for

example, multiplying a vector takes O(n log n) instead of O(n2 ))
can also be numerically learned from training data or partial signal
they can be orthogonal, frame, or general
7 / 33
Sparse under basis / `1 -synthesis model
If is orthogonal, problem (3) is equivalent to

min{k xk1 : Ax = b}
x
(4)
by change of variable x = s, equivalently s = x.

Related models for noise and approximate sparsity:
min{kAx bk2 : k xk1 },
x
min k xk1 +
x
kAx bk22 ,
2
min{k xk1 : kAx bk2 }.

x
8 / 33
Sparse after transform / `1 -analysis model
min{k xk1 : Ax = b}
x
(5)
Signal x becomes sparse under the transform (may not be orthogonal)

Examples of :
DCT, wavelets, curvelets, ridgelets, ....
tight frames, Gabor, ...
(weighted) total variation
When is not orthogonal, the analysis is more difficult
9 / 33
Example: sparsify an image
10 / 33
(a) DCT coefficients

5
10
(b) Haar wavelets

1000
300
800
250
10
(c) Local variation
200
600
150
400
10
100
200
10
10
0.5
1
1.5
2
2.5
sorted coefficient magnitudesx 105
(d) DCT coeffs decay
0
0
50
0.5
1
1.5
2
2.5
sorted coefficient magnitudesx 105
(e) Haar wavelets
0
0
0.5
1
1.5
2
2.5
sorted coefficient magnitudes x 105
(f) Local variation
Figure: the DCT and wavelet coefficients are scaled for better visibility.
11 / 33
Questions
1. Can we trust these models to return intended sparse solutions?

2. When will the solution be unique?
3. Will the solution be robust to noise in b?
4. Are constrained and unconstrained models equivalent? in what sense?
Questions 14 will be addressed in next lecture.
5. How to choose parameters?
(sparsity), (weight), and (noise level) have different meanings
applications determine which one is easier to set
generality: use a test data set, then scale parameters for real data
cross validation: reserve a subset of data to test the solution
12 / 33
Joint/group sparsity
Joint sparse recovery model:

min{kXk2,1 : A(X) = b}
X
where
kXk2,1 :=
m
X
(6)
k[xi1 xi,2 xin ]k2 .
i=1
`2 -norm is applied to each row of X

`2,1 -norm ball has sharp boundaries across different rows, which tend to
be touched by {X : A(X) = b}, so the solution tends to be row-sparse

also kXkp,q for 1 < p , affects magnitudes of entries on the same row
complex-valued signals are a special case
13 / 33
Joint/group sparsity
Decompose {1, . . . , n} = G1 G2 GS .
non-overlapping groups: Gi Gj = , i 6= j.
otherwise, groups may overlap (modeling many interesting structures).
Group-sparse recovery model:

min{kxkG,2,1 : Ax = b}
x
(7)
where
kxkG,2,1 =
S
X
ws kxGs k2 .
s=1
14 / 33
Auxiliary constraints
Auxiliary constraints introduce additional structures of the underlying signal

into its recovery, which sometimes significantly improve recovery quality
nonnegativity: x 0
bound (box) constraints: l x u
general inequalities: Qx q
They can be very effective in practice. They also generate corners.
15 / 33
Reduce to conic programs

Sparse optimization often has nonsmooth objectives.
Classic conic programming solvers do not handle nonsmooth functions.
Basic idea: model nonsmoothness by inequality constraints.
Example: for given x, we have
kxk1 = min {1T (x1 + x2 ) : x1 x2 = x, x1 0, x2 0}.
x1 ,x2
(8)
Therefore,
min{kxk1 : Ax = b} reduces to a linear program (LP)
minx kxk1 +
kAx
2
bk22 reduces to a bound constrained quadratic
program (QP)
minx {kAx bk2 : kxk1 } reduces to a bound constrained QP
minx {kxk1 : kAx bk2 } reduces to a second-order cone program
(SOCP)
16 / 33
Conic programming
Basic form:
min{cT x : Fx + g K 0, Ax = b.}
x
a K b stands for a b K, which is a convex, closed, pointed cone.

Examples:
n
first orthant (cone): Rn
+ = {x R : x 0}.
norm cone (2nd order cone): Q = {(x, t) : kxk t}

polyhedral cone: P = {x : Ax 0}
positive semidefinite cone: S+ = {X : X 0, XT = X}
Example:
1
"
(x, y, z) :
x
y
y
S+
z
0.5
0
1
1
0
0.5
1 0
x
17 / 33
Linear program
Model
min{cT x : Ax = b, x K 0}
where K is the nonnegative cone (first orthant).
x K 0 x 0.
Algorithms
the Simplex method (move between vertices)
interior-point methods (IPMs) (move inside the polyhedron)
decomposition approaches (divide and conquer)
In primal IPM, x 0 is replaced by its logarithmic barrier:

X
(y) =
log(yi )
i
log-barrier formulation:
min{cT x (1/t)
log(xi ) : Ax = b}
i
18 / 33
Second-order cone program
Model
min{cT x : Ax = b, x K 0}
where K = K1 KK ; each Kk is the second-order cone
q
o
n
Kk = y Rnk : ynk y12 + + yn2 k 1 .
IPM is the standard solver (though other options also exist)
Log-barrier of Kk :
(y) = log yn2 k (y12 + + ynk 1 )
19 / 33
Semi-definite program
Model
min{C X : A(X) = b, X K 0}
n
where K = K1 KK ; each Kk = S+k .

IPM is the standard solver (though other options also exist)
n
Log-barrier of S+k (still a concave function):

(Y) = log det(Y).
20 / 33
(from Boyd & Vandenberghe, Convex Optimization)

properties (without proof): for y K 0,
y T (y) =
(y) K 0,
nonnegative orthant Rn+: (y) =
Pn
i=1 log yi
(y) = (1/y1, . . . , 1/yn),
y T (y) = n
positive semidefinite cone Sn+: (Y ) = log det Y

(Y ) = Y 1,
tr(Y (Y )) = n
second-order cone K = {y Rn+1 | (y12 + + yn2 )1/2 yn+1}:
y1
..
2
,
y T (y) = 2
(y) = 2
yn+1 y12 yn2 yn
yn+1
21 / 33
(from Boyd & Vandenberghe, Convex Optimization)
Central path
for t > 0, define x?(t) as the solution of
minimize tf0(x) + (x)
subject to Ax = b
(for now, assume x?(t) exists and is unique for each t > 0)
central path is {x?(t) | t > 0}
example: central path for an LP
minimize cT x
subject to aTi x bi,
i = 1, . . . , 6
hyperplane cT x = cT x?(t) is tangent to

level curve of through x?(t)
x?
x?(10)
22 / 33
Log-barrier formulation:
min{tf0 (x) + (x) : Ax = b}
Complexity of log-barrier interior-point method:
'
&
P
log(( i i )/(t(0) ))
k
log
23 / 33
Primal-dual interior-point methods

more efficient than barrier method when high accuracy is needed
update primal and dual variables at each iteration; no distinction
between inner and outer iterations
often exhibit superlinear asymptotic convergence
search directions can be interpreted as Newton directions for modified
KKT conditions
can start at infeasible points
cost per iteration same as barrier method
24 / 33
`1 minimization by interior-point method

Model
min{kxk1 : kAx bk2 }
x
min min {1T (x1 + x2 ) : x1 0, x2 0, x1 x2 = x, kAx bk2 }

x
x1 ,x2
min {1T (x1 + x2 ) : x1 0, x2 0, kA(x1 x2 ) bk2 }
x1 ,x2
min {1T x1 + 1T x2 : Ax1 Ax2 + y = b, z = , (x1 , x2 , z, y) K 0}
x1 ,x2
where (x1 , x2 , z, y) K 0 means

x1 , x2 Rn
+,
(t, y) Qm+1 .
Solver: Mosek, SDPT3, Gurobi.

Also, modeling language CVX and YALMIP.
25 / 33
Nuclear-norm minimization by interior-point method
If we can model
min{kXk : A(X) = b}
X
(9)
as an SDP ... (how? see next slide) ...

then, we can also model
minX {kXk : kA(X) bkF }
minX {kA(X) bkF : kXk }
minX kXk +
1
kA(X)
2
bk2F
as well as problems involving tr(X) and spectral norm kXk.

kXk I X 0.
26 / 33
Sparse calculus for `1

inspect |x| to get some ideas:
y, z 0 and yz |x| = 12 (y + z) yz |x|.
moreover, 12 (y + z) = yz = |x| if y = z = |x|.
observe
y, z 0 and
"
y
yz |x|
x
#
x
0.
z
So,
"
we attain
1
(y
2
y
x
#
x
1
0 = (y + z) |x|.
2
z
+ z) = |x| if y = z = |x|.
Therefore, given x, we have

"
(
0
1
|x| = min
tr(M) :
M
2
0
#
)
1
T
M = x, M = M , M 0 .
0
27 / 33
Generalization to nuclear norm
Consider X Rmn (w.o.l.g., assume m n) and lets try imposing
"
Y
XT
#
X
0
Z
Diagonalize X = UVT , = diag(1 , . . . , m ), kXk =
"
[UT , VT ]
Y
XT
i .
#"
#
X
U
= UT YU + VT ZV UT XV VT XT U
Z V
= UT YU + VT ZV 2 0.
So, tr(UYUT + VZVT 2) = tr(Y) + tr(Z) 2kXk 0.

To attain =, we can let Y = UUT and Z = Vnn VT .
28 / 33
Therefore,
"
#
)
Y
X
1
(tr(Y) + tr(Z)) :

0
Y,Z
2
XT Z
(
"
#
)
0
I
1
T
= min
tr(M) :
M = X, M = M , M 0 .
M
2
0 0
(
kXk = min
(10)
(11)
Exercise: express the following problems as SDPs

minX {kXk : A(X) = b}
minX kXk +
1
kA(X)
2
bkF
minL,S {kLk + kSk1 : A(L + S) = b}
29 / 33
Practice of interior-point methods (IPMs)

LP, SOCP, SDP are well known and have reliable (commercial,
off-the-shelf) solvers
Yet, the most reliable solvers cannot handle large-scale problems (e.g.,
images, video, manifold learning, distributed stuff, ...)

Example: to recover a still image, there can be 10M variables and 1M
constraints. Even worse, the constraint coefficients are dense. Result: Out
of memory.
Simplex and active-set methods: matrix containing A must be inverted or
factorized to compute the next point (unless A is very sparse).

IPMs approximately solve a Newton system and thus also factorize a
matrix involving A.
Even large and dense matrices can be handled, for sparse optimization,
one should take advantages of the solution sparsity.

Some compressive sensing problems have A with structures friendly for
operations like Ax and AT y.
30 / 33
Practice of interior-point methods (IPMs)
The Simplex, active-set, and IPMs have reliable solvers; good to be the
benchmark.
They have nice interfaces (including CVX and YALMIP, which save you
time.)
CVX and YALMIP are not solvers; they translate problems and then call
solvers; see http://goo.gl/zUlMK and http://goo.gl/1u0xP.
They can return highly accurate solutions; some first-order algorithms
(coming later in this course) do not always.

There are other remedies; see next slide.
31 / 33
Papers of large-scale SDPs
Low-rank factorizations:
S. Burer and R. D. C. Monteiro, A nonlinear programming algorithm for solving
semidefinite programs via low-rank factorization, Math. Program., 95:329357, 2003.
LMaFit, http://lmafit.blogs.rice.edu/
First-order methods for conic programming:

Z. Wen, D. Goldfarb, and W. Yin. Alternating direction augmented Lagrangian methods
for semidefinite programming. Math. Program. Comput., 2(3-4):203230, 2010.
Matrix-free IPMs:
K. Fountoulakis, J. Gondzio, P. Zhlobich. Matrix-free interior point method for compressed
sensing problems, 2012. http://www.maths.ed.ac.uk/~gondzio/reports/mfCS.html
32 / 33
Subgradient methods
Sparse optimization is typically nonsmooth, so it is natural to consider

subgradient methods.
apply subgradient descent to, say, minx kxk1 +
kAx
2
bk22 .
apply projected subgradient descent to, say, minx {kxk1 : Ax = b}.
Good: subgradients are easy to derive, methods are simple to implement.

Bad: convergence requires carefully chosen step sizes (classical ones require
diminishing step sizes). Convergence rate is weak on paper (and in practice,
too?)
Further readings: http://arxiv.org/pdf/1001.4387.pdf,
http://goo.gl/qFVA6, http://goo.gl/vC21a.
33 / 33

Sparse Optimization Lecture: Basic Sparse Optimization Models

Transféré par

Informations du document

Titre original

Copyright

Formats disponibles

Partager ce document

Partager ou intégrer le document

Options de partage

Avez-vous trouvé ce document utile ?

Ce contenu est-il inapproprié ?

Droits d'auteur :

Formats disponibles

Sparse Optimization Lecture: Basic Sparse Optimization Models

Transféré par

Droits d'auteur :

Formats disponibles

Sparse Optimization

Lecture: Basic Sparse Optimization Models

Instructor: Wotao Yin

online discussions on piazza.com

Those who complete this lecture will know

basic `1 , `2,1 , and nuclear-norm models

Examples of Sparse Optimization Applications

See online seminar at piazza.com

find least `1 -norm point on the affine plane {x : Ax = b}

`1 ball touches the affine plane

Basis pursuit denoising, LASSO

min kxk1 + kAx bk22 ,

min{kxk1 : kAx bk2 }.

all models allow Ax 6= b

Basis pursuit denoising, LASSO

min{kAx bk2 : kxk1 },

min{kxk1 : kAx bk2 }.

k k2 is most common for error but can be generalized to loss function L

it seeks for a balance between sparsity and fitting

solution from tube-like set {x : kAx bk2 }

of A) to linearly express the observation b

Sparse under basis / `1 -synthesis model

signal x is sparsely synthesized by atoms from , so vector s is sparse

combinations; they have analytic properties, often easy to compute (for

Sparse under basis / `1 -synthesis model

If is orthogonal, problem (3) is equivalent to

by change of variable x = s, equivalently s = x.

min{k xk1 : kAx bk2 }.

Sparse after transform / `1 -analysis model

Signal x becomes sparse under the transform (may not be orthogonal)

When is not orthogonal, the analysis is more difficult

Example: sparsify an image

(a) DCT coefficients

(b) Haar wavelets

(c) Local variation

(d) DCT coeffs decay

(e) Haar wavelets

(f) Local variation

1. Can we trust these models to return intended sparse solutions?

Joint sparse recovery model:

k[xi1 xi,2 xin ]k2 .

`2 -norm is applied to each row of X

be touched by {X : A(X) = b}, so the solution tends to be row-sparse

Group-sparse recovery model:

Auxiliary constraints introduce additional structures of the underlying signal

They can be very effective in practice. They also generate corners.

Reduce to conic programs

bk22 reduces to a bound constrained quadratic

a K b stands for a b K, which is a convex, closed, pointed cone.

norm cone (2nd order cone): Q = {(x, t) : kxk t}

In primal IPM, x 0 is replaced by its logarithmic barrier:

Second-order cone program

where K = K1 KK ; each Kk = S+k .

Log-barrier of S+k (still a concave function):

(from Boyd & Vandenberghe, Convex Optimization)

(y) = (1/y1, . . . , 1/yn),

positive semidefinite cone Sn+: (Y ) = log det Y

second-order cone K = {y Rn+1 | (y12 + + yn2 )1/2 yn+1}:

(from Boyd & Vandenberghe, Convex Optimization)

hyperplane cT x = cT x?(t) is tangent to

Primal-dual interior-point methods

`1 minimization by interior-point method

min min {1T (x1 + x2 ) : x1 0, x2 0, x1 x2 = x, kAx bk2 }

a K b stands for a b K, which is a convex, closed, pointed cone.

min {1T x1 + 1T x2 : Ax1 Ax2 + y = b, z = , (x1 , x2 , z, y) K 0}

where (x1 , x2 , z, y) K 0 means