程序代写代做代考 algorithm Predictive Analytics – Week 7: Linear Methods for Regression I

Predictive Analytics – Week 7: Linear Methods for Regression I

Predictive Analytics
Week 7: Linear Methods for Regression I

Semester 2, 2018

QBUS2820 content structure

1. Statistical and Machine Learning foundations and applications.

3. Classification methods.

4. Time series forecasting.

2/52

Week 7: Linear Methods for Regression I

1. Introduction

2. Variable selection

3. Regularisation methods

4. Discussion

Reading: Chapters 6.1 and 6.2 of ISL.

Exercise questions: Chapter 6.8 of ISL, Q1, Q2, Q3, and Q4.

3/52

Introduction

Linear Methods for Regression

In this lecture we focus again on the linear regression model for
prediction. We move beyond OLS to consider other estimation
methods.

The motivation for studying these methods is that using many
predictors in a linear regression model typically leads to overfitting.
We will therefore accept some bias in order reduce variance.

4/52

Linear regression (review)

Y = f(x) + ε.

The linear regression model is a special case based on a regression
function of the form

f(x) = β0 + β1×1 + β2×2 + . . .+ βpxp

5/52

OLS (review)

In the OLS method, we select the coefficient values that minimise
the residual sum of squares

β̂ols = argmin
β

N∑
i=1

yi − β0 − p∑

j=1
βjxij

2

We obtain the formula

β̂ols = (XTX)−1XTy.

6/52

MLR model (review)

1. Linearity: if X = x, then

Y = β0 + β1×1 + . . .+ βpxp + ε

for some population parameters β0, β1, . . . , βp and a random
error ε.

2. The conditional mean of ε given X is zero, E(ε|X) = 0.

3. Constant error variance: Var(ε|X) = σ2.

4. Independence: the observations are independent.

5. The distribution of X1, . . . , Xp is arbitrary.

6. There is no perfect multicollinearity (no column of X is a
linear combination of other columns).

7/52

OLS properties (review)

Under Assumptions 1 (the regression function is correctly
specified) and 2 (there are no omitted variables that are correlated
with the predictors), the OLS estimator is unbiased

E(β̂ols) = β.

8/52

Why we are not satisfied with OLS?

Prediction accuracy. Low bias (if the linearity assumption is
approximately correct), but potentially high variance. We can
improve performance by setting some coefficients to zero or
shrinking them.

Interpretability. A regression estimated with too many predictors
and high variance is hard or impossible to interpret. In order to
understand the big picture, we are willing to sacrifice some of the
small details.

9/52

Linear model selection and regularisation

——————————————————————————
Intercept 26.1369 14.287 1.829 0.069 -2.064 54.337
dp 0.3280 8.247 0.040 0.968 -15.951 16.607
dy 3.3442 7.941 0.421 0.674 -12.330 19.019
ep 0.3133 2.345 0.134 0.894 -4.315 4.942
bm -3.2443 6.719 -0.483 0.630 -16.507 10.018
ntis -46.9566 38.911 -1.207 0.229 -123.762 29.848
tbl -2.8651 20.922 -0.137 0.891 -44.162 38.432
ltr 10.2432 14.468 0.708 0.480 -18.314 38.800
tms 13.1083 11.129 1.178 0.240 -8.859 35.076
dfy -156.8202 213.943 -0.733 0.465 -579.111 265.471
dfr 71.0710 29.099 2.442 0.016 13.634 128.508
infl -36.9489 82.870 -0.446 0.656 -200.521 126.623
ik -208.4868 242.844 -0.859 0.392 -687.824 270.851
==============================================================================

21/52

We select the following models in the equity premium dataset
based on the AIC:

Best subset selection: (dy, bm, tms, dfr)

Forward selection: (ik, tms, dfr)

Backward selection: (dy, tms, dfr)

22/52

Table 1: Equity Premium Prediction Results

Train R2 Test R2

OLS 0.108 0.014
Best Subset 0.095 0.038
Forward 0.083 0.042
Backward 0.084 0.060

23/52

24/52

Wrong ways to do variable selection

selection criterion. It does not sufficiently penalise additional
predictors.

Removing statistically insignificant predictors. A statistically
significant coefficient means we can reliably say that it is not
exactly zero. This has almost nothing to do with prediction (see
the regression output slide). Furthermore, there are multiple
testing issues.

25/52

Regularisation methods

Regularisation methods (key concept)

Regularisation or shrinkage methods for linear regression follow
the general framework of empirical risk minimisation:

θ̂ = argmin
θ

[
N∑

i=1
L(yi, f(xi;θ))

]
+ λC(θ),

Here, the loss function is the squared loss and the complexity
function will be the norm of the vector of regression coefficients β.
The choice of norm leads to different regularisation properties.

26/52

Ridge regression (key concept)

The ridge regression method solves the penalised estimation
problem

β̂ridge = argmin
β




N∑
i=1

yi − β0 − p∑

j=1
βixij

2 + λ p∑

j=1
β2j


 ,

for a tuning parameter λ.

The penalty term λ||β||22 has the effect of shrinking the coefficients
relative to OLS. We refer to this procedure as `2 regularisation.

27/52

Ridge regression

The ridge estimator has an equivalent formulation as a constrained
minimisation problem

β̂ridge = argmin
β

N∑
i=1

yi − β0 − p∑

j=1
βixij

2

subject to
p∑

j=1
β2j < t. for some t > 0.

28/52

Practical details

1. The hyperparameters λ or t control the amount of shrinkage.
There is an one-to-one connection between them.

2. We do not penalise the intercept. In practice, we center the
response and the predictors before computing the solution and
estimate the intercept as β̂0 = y.

3. The method is not invariant on the scale of the inputs. We
standardise the predictors before solving the minimisation
problem.

29/52

Ridge regression

We can write the minimisation problem in matrix form as

min
β

(y −Xβ)T (y −Xβ) + λβTβ.

Relying on the same techniques that we used to derive the OLS
estimator, we can show the ridge estimator has the formula

β̂ridge = (XTX + λ I)−1XTy

30/52

Orthonormal vectors

We say that two vectors u and v are orthonormal when

||u|| =

uTu = 1, ||v|| =

vTv = 1, and uTv = 0.

We say that the design matrix X is orthonormal when all its
columns are orthonormal.

31/52

Ridge regression: shrinkage (key concept)

If the design matrix X was orthonormal, the ridge estimate would
just a scaled version of the OLS estimate

β̂ridge = (I + λ I)−1XTy =
1

1 + λ
β̂OLS

In a more general situation, we can say that the ridge regression
method will shrink together the coefficients of correlated predictors.

32/52

Ridge regression

We define the ridge shrinkage factor as

s(λ) =
||β̂ridge||2
||β̂ols||2

,

for a given λ or t.

The next slide illustrates the effect of varying the shrinkage factor
on the estimated parameters.

33/52

Ridge coefficient profiles (equity premium data)

34/52

Selecting λ

The ridge regression method leads to a range of models for
different values of λ. We select λ by cross validation or generalised
cross validation.

GCV is computationally convenient for this model.

35/52

36/52

The Lasso

The Lasso (least absolute shrinkage and selection operator)
method solves the penalised estimation problem

β̂lasso = argmin
β




N∑
i=1

yi − β0 − p∑

j=1
βixij

2 + λ p∑

j=1
|βj |


 ,

for a tuning parameter λ.

The Lasso therefore performs `1 regularisation.

37/52

The Lasso

The equivalent formulation of the lasso as a constrained
minimisation problem is

β̂lasso =argmin
β

N∑
i=1

yi − β0 − p∑

j=1
βixij

2

subject to
p∑

j=1
|βj | < t. for some t > 0.

38/52

The Lasso: shrinkage and variable selection (key concept)

Shrinkage. As with ridge regression, the lasso shrinks the
coefficients towards zero. However, the nature of this shrinkage is
different, as we will see below.

Variable selection. In addition to shrinkage, the lasso also
performs variable selection. With λ sufficiently large, some
estimated coefficients will be exactly zero, leading to sparse
models. This is a key difference from ridge.

39/52

The Lasso: variable selection property

Estimation picture for the lasso (left) and ridge regression (right):

40/52

Practical details

1. We select the tuning parameter λ by cross validation.

2. As with ridge, we center and standardise the predictors before
computing the solution.

3. There is no closed form solution for the lasso coefficients.
Computing the lasso solution is a quadratic programming
problem.

4. There are efficient algorithms for computing an entire path of
solutions for a range of λ values.

41/52

The Lasso

We define the shrinkage factor for a given value of λ (or t) as

s(λ) =
∑p

j=1

∣∣∣β̂lassoj ∣∣∣∑p
j=1

∣∣∣β̂olsj ∣∣∣ .

The next slide illustrates the effect of varying the shrinkage factor
on the estimated parameters.

42/52

Lasso coefficient profiles (equity premium data)

43/52

Model selection for the equity premium data

44/52

Discussion

Subset selection, ridge, and lasso: comparison in the orthonor-
mal case (optional)

Estimator Formula

Best subset (size k) β̂j · I(|β̂j | > |β̂(k)|)
Ridge β̂j/(1 + λ)
Lasso sign(β̂j)(|β̂j | − λ)+

Estimators of βj in the case of orthonormal columns of X.

45/52

Ridge and Lasso: comparison in the orthonormal case (op-
tional)

−1.5 −0.5 0.0 0.5 1.0 1.5

1
.5

0

.5
0
.5

1
.5

C
o
e
ff
ic

ie
n

t
E

s
ti
m

a
te

Ridge

Least Squares

−1.5 −0.5 0.0 0.5 1.0 1.5

1
.5

0

.5
0
.5

1
.5

C
o
e
ff
ic

ie
n

t
E

s
ti
m

a
te

Lasso

Least Squares

yjyj

46/52

Which method to use?

• Recall the no free lunch theorem: neither ridge regression or
the lasso universally outperform the other. The choice of
method should be data driven.

• In general terms, we can expect the lasso to perform better
when a small subset of predictors have important coefficients,
while the remaining predictors having small or zero
coefficients (sparse problems).

• Ridge regression will tend to perform better when the
predictors all have similar importance.

• The lasso may have better interpretability since it can lead to
a sparse solution.

47/52

Elastic Net

The elastic net is a compromise between ridge regression and the
lasso:

β̂EN = argmin
β

N∑
i=1

yi − β0 − p∑

j=1
βixij

2 + λ p∑

j=1

(
αβ2j + (1− α)|βj |

)
,

for λ ≥ 0 and 0 < α < 1. The elastic net performs variable selection like the lasso, and shrinks together the coefficients of correlated predictors like ridge regression. 48/52 Illustration: equity premium data Estimated coefficients (tuning parameters selected by leave-one-out CV) OLS Ridge Lasso EN dp 0.566 0.159 0.000 0.111 dy 0.602 0.197 0.000 0.153 ep 0.942 0.116 0.000 0.048 bm -1.055 0.033 0.000 0.000 ntis -0.276 -0.067 -0.000 -0.000 tbl -0.489 -0.248 -0.000 -0.178 ltr 0.597 0.186 0.000 0.124 tms 0.762 0.286 0.161 0.239 dfy 0.145 0.031 0.000 0.000 dfr 1.570 0.377 0.131 0.294 infl -0.202 -0.214 -0.000 -0.150 ik -0.408 -0.318 -0.422 -0.282 49/52 Illustration: equity premium data Prediction results Train R2 Test R2 OLS 0.108 0.014 Ridge 0.054 0.033 Lasso 0.033 0.011 Elastic Net 0.050 0.029 50/52 Comparison with variable selection Regularisation methods have two important advantages over variable selection. 1. They are continuous procedures, generally leading to lower variance. 2. The computational cost is not much larger than OLS. 51/52 Review questions • What is best subset selection? • What are stepwise methods? • What are the advantages and disadvantages of variable selection? • What are the penalty terms in the ridge and Lasso methods? • What are the key differences in type of shrinkage between the ridge and Lasso methods? • In what situations would we expect the ridge or lasso methods to perform better? 52/52 Introduction Variable selection Regularisation methods Discussion

Posted in Uncategorized