0% found this document useful (0 votes)
29 views44 pages

Lasso Slides Tibsharani

The document discusses novel algorithms and applications of the lasso. It describes a new fast algorithm for lasso called pathwise coordinate descent. It also provides examples of using lasso for tasks like linear regression, classification, and matrix completion.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views44 pages

Lasso Slides Tibsharani

The document discusses novel algorithms and applications of the lasso. It describes a new fast algorithm for lasso called pathwise coordinate descent. It also provides examples of using lasso for tasks like linear regression, classification, and matrix completion.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

1

The lasso:
some novel algorithms and
applications
Robert Tibshirani
Stanford University

Dept. of Statistics, Purdue, February 2011

Collaborations with Trevor Hastie, Jerome Friedman, Holger


Hoefling, Rahul Mazumber, Ryan Tibshirani
Email:[email protected]
http://www-stat.stanford.edu/~tibs
2
3

Jerome Friedman Trevor Hastie


4

From MyHeritage.Com
5

Jerome Friedman Trevor Hastie


6

Linear regression via the Lasso (Tibshirani, 1995)

• Outcome variable yi , for cases i = 1, 2, . . . n, features xij ,


j = 1, 2, . . . p
• Minimize
n
X X p
X
(yi − xij βj )2 + λ |βj |
i=1 j j=1

• Equivalent to minimizing sum of squares with constraint


P
|βj | ≤ s.
P 2
• Similar to ridge regression, which has constraint j βj ≤ t
• Lasso does variable selection and shrinkage; ridge only shrinks.
• See also “Basis Pursuit” (Chen, Donoho and Saunders, 1998).
7

Picture of Lasso and Ridge regression

β2 ^
β
. β2 ^
β
.

β1 β1
8

Example: Prostate Cancer Data

yi = log (PSA), xij measurements on a man and his prostate

lcavol

0.6
0.4

svi
lweight
pgg45
Coefficients

0.2

lbph
0.0

gleason

age
−0.2

lcp

0.0 0.2 0.4 0.6 0.8 1.0

Shrinkage Factor s
9

Emerging themes

• Lasso (ℓ1 ) penalties have powerful statistical and


computational advantages
• ℓ1 penalties provide a natural to encourage/enforce sparsity
and simplicity in the solution.
• “Bet on sparsity principle” (In the Elements of Statistical
learning). Assume that the underlying truth is sparse and use
an ℓ1 penalty to try to recover it. If you’re right, you will do
well. If you’re wrong— the underlying truth is not sparse—,
then no method can do well. [Bickel, Buhlmann, Candes,
Donoho, Johnstone,Yu ...]
• ℓ1 penalties are convex and the assumed sparsity can lead to
significant computational advantages
10

Outline

• New fast algorithm for lasso- Pathwise coordinate descent


• Three examples of applications/generalizations of the lasso:
• Logistic/multinomial for classification. Example later of
classification from microarray data
• Near-isotonic regression - a modern take on an old idea
• The matrix completion problem
• Not covering: sparse multivariate methods- Principal
components, canonical correlation, clustering (Daniela
Witten’s thesis). Google ’Daniela Witten’ − > “Penalized
matrix decomposition”
11

Algorithms for the lasso

• Standard convex optimizer


• Least angle regression (LAR) - Efron et al 2004- computes
entire path of solutions. State-of-the-Art until 2008
• Pathwise coordinate descent- new
12

Pathwise coordinate descent for the lasso

• Coordinate descent: optimize one parameter (coordinate) at a


time.
• How? suppose we had only one predictor. Problem is to
minimize X
(yi − xi β)2 + λ|β|
i

• Solution is the soft-thresholded estimate


sign(β̂)(|β̂| − λ)+
where β̂ is usual least squares estimate.
• Idea: with multiple predictors, cycle through each predictor in
P
turn. We compute residuals ri = yi − j6=k xij β̂k and applying
univariate soft-thresholding, pretending that our data is
(xij , ri ).
13

Soft-thresholding

β1

λ β2

β4

β3
14

• Turns out that this is coordinate descent for the lasso criterion
X X X
2
(yi − xij βj ) + λ |βj |
i j

• like skiing to the bottom of a hill, going north-south, east-west,


north-south, etc. [Show movie]
• Too simple?!
15

A brief history of coordinate descent for the lasso

• 1997: Tibshirani’s student Wenjiang Fu at University of


Toronto develops the “shooting algorithm” for the lasso.
Tibshirani doesn’t fully appreciate it
• 2002 Ingrid Daubechies gives a talk at Stanford, describes a
one-at-a-time algorithm for the lasso. Hastie implements it,
makes an error, and Hastie +Tibshirani conclude that the
method doesn’t work
• 2006: Friedman is the external examiner at the PhD oral of
Anita van der Kooij (Leiden) who uses the coordinate descent
idea for the Elastic net. Friedman wonders whether it works for
the lasso. Friedman, Hastie + Tibshirani start working on this
problem. See also Wu and Lange (2008)!
16

Pathwise coordinate descent for the lasso

• Start with large value for λ (very sparse model) and slowly
decrease it
• most coordinates that are zero never become non-zero
• coordinate descent code for Lasso is just 73 lines of
Fortran!
17

Extensions

• Pathwise coordinate descent can be generalized to many other


models: logistic/multinomial for classification, graphical lasso
for undirected graphs, fused lasso for signals.
• Its speed and simplicity are quite remarkable.
• glmnet R package now available on CRAN
18

Logistic regression

• Outcome Y = 0 or 1; Logistic regression model


P r(Y = 1)
log( ) = β0 + β1 X1 + β2 X2 . . .
1 − P r(Y = 1)

• Criterion is binomial log-likelihood +absolute value penalty


• Example: sparse data. N = 50, 000, p = 700, 000.
• State-of-the-art interior point algorithm (Stephen Boyd,
Stanford), exploiting sparsity of features : 3.5 hours for 100
values along path
19

Logistic regression

• Outcome Y = 0 or 1; Logistic regression model


P r(Y = 1)
log( ) = β0 + β1 X1 + β2 X2 . . .
1 − P r(Y = 1)

• Criterion is binomial log-likelihood +absolute value penalty


• Example: sparse data. N = 50, 000, p = 700, 000.
• State-of-the-art interior point algorithm (Stephen Boyd,
Stanford), exploiting sparsity of features : 3.5 hours for 100
values along path
• Pathwise coordinate descent: 1 minute
20

Multiclass classification

Microarray classification: 16,000 genes, 144 training samples 54 test


samples, 14 cancer classes. Multinomial regression model.
Methods CV errors Test errors # of
out of 144 out of 54 genes used

1. Nearest shrunken centroids 35 (5) 17 6520


2. L2 -penalized discriminant analysis 25 (4.1) 12 16063
3. Support vector classifier 26 (4.2) 14 16063
4. Lasso regression (one vs all) 30.7 (1.8) 12.5 1429
5. K-nearest neighbors 41 (4.6) 26 16063
6. L2 -penalized multinomial 26 (4.2) 15 16063
7. Lasso-penalized multinomial 17 (2.8) 13 269
8. Elastic-net penalized multinomial 22 (3.7) 11.8 384
21

Near Isotonic regression

Ryan Tibshirani, Holger Hoefling, Rob Tibshirani (2010)


• generalization of isotonic regression: data sequence
y1 , y2 , . . . yn .

X
minimize (yi − ŷi )2 subject to ŷ1 ≤ ŷ2 . . .
Solved by Pool Adjacent Violators algorithm.
• Near-isotonic regression:
n n−1
1X 2
X
βλ = argmin β∈Rn (yi − βi ) + λ (βi − βi+1 )+ ,
2 i=1 i=1

with x+ indicating the positive part, x+ = x · 1(x > 0).


22

Near-isotonic regression- continued

• Convex problem. Solution path β̂i = yi at λ = 0 and


culminates in usual isotonic regression as λ → ∞. Along the
way gives near monotone approximations.
23

Numerical approach

How about using coordinate descent?


• Surprise! Although criterion is convex, it is not differentiable,
and coordinate descent can get stuck in the “cusps”
24

No improvement No improvement

Improvement
25
26

When does coordinate descent work?

Paul Tseng (1988), (2001)


If
X
f (β1 . . . βp ) = g(β1 . . . βp ) + hj (βj )
where g(·) is convex and differentiable, and hj (·) is convex, then
coordinate descent converges to a minimizer of f .

Non-differential part of loss function must be separable


27

Solution: devise a path algorithm

• Simple algorithm that computes the entire path of solutions, a


modified version of the well-known pool adjacent violators
• Analogous to LARS algorithm for lasso in regression
• Bonus: we show that the degrees of freedom is the number of
“plateaus” in the solution. Using results from Ryan
Tibshirani’s PhD work with Jonathan Taylor
28

Toy example

λ=0 λ = 0.25

λ = 0.7 λ = 0.77
29

Global warming data


30

lam= 0 , ss= 0 , viol= 5.6 lam= 0.3 , ss= 0.68 , viol= 0.5

0.4

0.4
0.2

0.2
Temperature anomalies

Temperature anomalies
0.0

0.0
−0.2

−0.2
−0.4

−0.4
−0.6

−0.6
1850 1900 1950 2000 1850 1900 1950 2000

Year Year

lam= 0.6 , ss= 0.88 , viol= 0.3 lam= 1.8 , ss= 1.39 , viol= 0
0.4

0.4
0.2

0.2
Temperature anomalies

Temperature anomalies
0.0

0.0
−0.2

−0.2
−0.4

−0.4
−0.6

−0.6

1850 1900 1950 2000 1850 1900 1950 2000

Year Year
31

The matrix completion problem

• Data Xm×n , for which only a relatively small number of entries


are observed. The problem is to “complete” or impute the
matrix based on the observed entries. Eg the Netflix database
(see next slide).
• For a matrix Xm×n let Ω ⊂ {1, . . . , m} × {1, . . . , n} denote the
indices of observed entries. Consider the following optimization
problem:

minimize rank(Z)
subject to Zij = Xij , ∀(i, j) ∈ Ω (1)

Not convex!
gs
in an r
er om tte n et
th Po io
of W ct ll lv
rd ty ar
ry Fi Bi ve
et lp ill ue
Lo Pr H
Pu K Bl
Daniela 5 5 4 1 1 1
4 5 4 2 1
Genevera ?
Larry 1 5
? 2 5 4
Jim ? ? 2 4 3 5
1 1 3 ? ? 5
Andy

32
33

• The following seemingly small modification to (1)

minimize kZk∗
subject to Zij = Xij , ∀(i, j) ∈ Ω (2)

makes the problem convex [Faz02]. Here kZk∗ is the nuclear


norm, or the sum of the singular values of Z.
• This criterion is used by [CT09, CCS08, CR08]. Fascinating
work! See figure.
• But this criterion requires the training error to be zero. This is
too harsh and can overfit!
• Instead we use the criterion:

minimize kZk∗
X
subject to (Zij − Xij )2 ≤ δ (3)
(i,j)∈Ω
34

Nuclear norm is like L1 norm for matrices


35

Idea of Algorithm

1. impute the missing data with some initial values


2. compute the SVD of the current matrix, and soft-threshold the
singular values
3. reconstruct the SVD and hence obtain new imputations for
missing values
4. repeat steps 2,3 until convergence
36

Notation

• Define a matrix PΩ (X) (with dimension n × m)



 X if (i, j) ∈ Ω
ij
PΩ (X) (i, j) = (4)
 0 if (i, j) ∈
/ Ω,

which is a projection of the matrix X onto the observed entries.


• Let

Sλ (W ) ≡ U Dλ V ′ with Dλ = diag [(d1 − λ)+ , . . . , (dr − λ)+ ] , (5)

where U DV ′ is the singular value decomposition of W ,


37

Algorithm

1. Initialize Z old = 0 and create a decreasing grid Λ of values


λ1 > . . . > λK .
2. For every fixed λ = λ1 , λ2 , . . . ∈ Λ iterate till convergence:
Compute Z new ← Sλ (PΩ (X) + PΩ⊥ (Z old ))
3. Output the sequence of solutions Ẑλ1 , . . . , ẐλK .

It X is sparse, then at each step the non-sparse matrix has the


structure:

X = XSP (Sparse) + XLR (Low Rank) (6)

Can apply Lanczos methods to compute the SVD efficiently.


38

Properties of Algorithm

We show this iterative algorithm converges to the solution to

1
minimize kPΩ (X) − PΩ (Z)k2F + λkZk∗ . (7)
Z 2
which is equivalent to the bound version (3),
39

Timings

(m, n) |Ω| true rank SNR effective rank time(s)

(3 × 104 , 104 ) 104 15 1 (13, 47, 80) (41.9, 124.7, 305.8)

(105 , 105 ) 104 15 10 (5, 14, 32, 62) (37, 74.5, 199.8, 653)

(105 , 105 ) 105 15 10 (18, 80) (202, 1840)

(5 × 105 , 5 × 105 ) 104 15 10 11 628.14

(5 × 105 , 5 × 105 ) 105 15 1 (3, 11, 52) (341.9, 823.4, 4810.75)

(106 , 106 ) 105 15 1 80 8906


40

Accuracy

50% missing entries with SNR=1, true rank =10


Test error Training error
1 1
L1
0.95 0.9 L1−U
L1−L0
0.9 0.8 C

0.85 0.7

0.8 0.6

0.75 0.5

0.7 0.4

0.65 0.3

0.6 0.2

0.55 0.1

0.5 0
0 1000 2000 0 1000 2000
Nuclear Norm Nuclear Norm
41

Discussion

• lasso penalties are useful for fitting a wide variety of models to


large datasets; pathwise coordinate descent enables to fit these
models to large datasets for the first time
• In CRAN: coordinate descent in R: glmnet- linear regression,
logistic, multinomial, Cox model, Poisson
• Also: LARS, nearIso, cghFLasso, glasso
• Matlab software for glm.net and matrix completion
http://www-stat.stanford.edu/∼ tibs/glmnet-matlab/
http://www-stat.stanford.edu/∼rahulm/SoftShrink
42

Ongoing work in lasso/sparsity

• grouped lasso (Yuan and Lin) and many variations (Peng,


Zhu...Wang “RemMap”)
• multivariate- principal components, canonical correlation,
clustering (Witten and others)
• matrix-variate normal (Genevera Allen)
• graphical models, graphical lasso (Yuan+Lin, Friedman,
Hastie+Tibs, Peng, Wang et al- “SPACE”)
• Compressed sensing (Candes and co-authors)
• “Strong rules” (Tibs et al 2010) provide a 5-80 fold speedup in
computation, with no loss in accuracy
43

Some challenges

• develop tools and theory that allow these methods to be used


in statistical practice: standard errors, p-values and confidence
intervals that account for the adaptive nature of the estimation.
• while it’s fun to develop these methods, as statisticians, our
ultimate goal is to provide better answers to scientific questions
43-1

References
[CCS08] Jian-Feng Cai, Emmanuel J. Candes, and Zuowei Shen. A sin-
gular value thresholding algorithm for matrix completion, 2008.
[CR08] Emmanuel Candès and Benjamin Recht. Exact matrix com-
pletion via convex optimization. Foundations of Computa-
tional Mathematics, 2008.
[CT09] Emmanuel J. Candès and Terence Tao. The power of convex
relaxation: Near-optimal matrix completion, 2009.
[Faz02] M. Fazel. Matrix Rank Minimization with Applications.
PhD thesis, Stanford University, 2002.

You might also like