# 程序代写代做代考 algorithm Predictive Analytics – Week 7: Linear Methods for Regression I

Predictive Analytics – Week 7: Linear Methods for Regression I

Predictive Analytics

Week 7: Linear Methods for Regression I

Semester 2, 2018

Discipline of Business Analytics, The University of Sydney Business School

QBUS2820 content structure

1. Statistical and Machine Learning foundations and applications.

2. Advanced regression methods.

3. Classification methods.

4. Time series forecasting.

2/52

Week 7: Linear Methods for Regression I

1. Introduction

2. Variable selection

3. Regularisation methods

4. Discussion

Reading: Chapters 6.1 and 6.2 of ISL.

Exercise questions: Chapter 6.8 of ISL, Q1, Q2, Q3, and Q4.

3/52

Introduction

Linear Methods for Regression

In this lecture we focus again on the linear regression model for

prediction. We move beyond OLS to consider other estimation

methods.

The motivation for studying these methods is that using many

predictors in a linear regression model typically leads to overfitting.

We will therefore accept some bias in order reduce variance.

4/52

Linear regression (review)

Consider the additive error model

Y = f(x) + ε.

The linear regression model is a special case based on a regression

function of the form

f(x) = β0 + β1×1 + β2×2 + . . .+ βpxp

5/52

OLS (review)

In the OLS method, we select the coefficient values that minimise

the residual sum of squares

β̂ols = argmin

β

N∑

i=1

yi − β0 − p∑

j=1

βjxij

2

We obtain the formula

β̂ols = (XTX)−1XTy.

6/52

MLR model (review)

1. Linearity: if X = x, then

Y = β0 + β1×1 + . . .+ βpxp + ε

for some population parameters β0, β1, . . . , βp and a random

error ε.

2. The conditional mean of ε given X is zero, E(ε|X) = 0.

3. Constant error variance: Var(ε|X) = σ2.

4. Independence: the observations are independent.

5. The distribution of X1, . . . , Xp is arbitrary.

6. There is no perfect multicollinearity (no column of X is a

linear combination of other columns).

7/52

OLS properties (review)

Under Assumptions 1 (the regression function is correctly

specified) and 2 (there are no omitted variables that are correlated

with the predictors), the OLS estimator is unbiased

E(β̂ols) = β.

8/52

Why we are not satisfied with OLS?

Prediction accuracy. Low bias (if the linearity assumption is

approximately correct), but potentially high variance. We can

improve performance by setting some coefficients to zero or

shrinking them.

Interpretability. A regression estimated with too many predictors

and high variance is hard or impossible to interpret. In order to

understand the big picture, we are willing to sacrifice some of the

small details.

9/52

Linear model selection and regularisation

Variable selection. Identify a subset of k < p predictors to use.
Estimate the model by using OLS on the reduced set of variables.
Regularisation (shrinkage). Fit a model involving all the p
predictors, but shrink the coefficients towards zero relative to OLS.
Depending on the type of shrinkage, some estimated coefficients
may be zero, in which case the method also performs variable
selection.
Dimension reduction. Construct a set of m < p predictors which
are are linear combinations of the original predictors. Fit the model
by OLS on these new predictors.
10/52
Variable selection
Best subset selection (key concept)
The best subset selection method estimates all possible models
and selects the best one according to a model selection criterion
(AIC, BIC, or cross validation).
Given p predictors, there are 2p possible models to choose from.
11/52
Best subset selection
For example, if p = 3 we would estimate 23 = 8 models:
k = 0 : Y = β0 + ε
k = 1 : Y = β0 + β1x1 + ε
Y = β0 + β2x2 + ε
Y = β0 + β3x3 + ε
k = 2 : Y = β0 + β1x1 + β2x2 + ε
Y = β0 + β1x1 + β3x3 + ε
Y = β0 + β2x2 + β3x3 + ε
k = 3 : Y = β0 + β1x1 + β2x2 + β3x3 + ε
12/52
Best subset selection
Algorithm Best subset selection
1: Estimate the null modelM0, which contains only the constant.
2: for k = 1, 2, . . . , p do
3: Fit all
(p
k
)
possible models with exactly k predictors.
4: Pick the model with the lowest RSS and call it Mk.
5: end for
6: Select the best model among M0,M1, . . . ,Mp according to
cross validation, AIC, or BIC.
13/52
Computational considerations
The best subset method suffers from a problem of combinatorial
explosion, since it requires the estimation of 2p different models.
The computational requirement is therefore very high, except in
low dimensions.
For example, for p = 30 we would need to fit a little over 1 billion
models! Best subset selection has a very high computational cost
and is infeasible in practice for p larger than around 40.
14/52
Stepwise selection
Stepwise selection methods are a family of search algorithms that
find promising subsets by sequentially adding or removing
regressors, dramatically reducing the computational cost compared
to estimating all possible specifications.
Conceptually, they are an approximation to best subset selection,
not different methods.
15/52
Forward selection
Algorithm Forward selection
1: Estimate the null modelM0, which contains only the constant.
2: for k = 1, 2, . . . , p do
3: Fit all the p−k+ 1 models that add one predictor toMk−1.
4: Choose the best of p−k+ 1 models in terms of RSS and call
it Mk.
5: end for
6: Select the best model among M0,M1, . . . ,Mp according to
cross validation, AIC, or BIC.
16/52
Backward selection
Algorithm Backward selection
1: Estimate the full model Mp by OLS.
2: for k = p− 1, . . . , 1, 0 do
3: Fit all the k+1 models that delete one predictor fromMk+1.
4: Choose the best of the k+ 1 models in terms of RSS and call
it Mk.
5: end for
6: Select the best model among M0,M1, . . . ,Mp according to
cross-validation, AIC, or BIC.
17/52
Stepwise selection
• Compared to best subset selection, the forward and backward
stepwise algorithms reduce the number of estimations from 2p
to 1 + p(p+ 1)/2. For example, for p = 30 the number of
fitted models is 466.
• The disadvantage is that the final model selected by stepwise
selection is not guaranteed to optimise any selection criterion
among the 2p possible models.
18/52
Variable selection
Advantages
• Accuracy relative to OLS. It tends to lead to better predictions
compared to estimating a model with all predictors.
• Interpretability. The final model is a linear regression model
based a reduced set of predictors.
Disadvantages
• Computational cost.
• By making binary decisions include or exclude particular
variables, variable selection may exhibit higher variance than
regularisation and dimension reduction approaches.
19/52
Illustration: Equity Premium Prediction (OLS)
Quarterly data from Goyal and Welch (2008).
Response: quarterly S&P 500 returns minus treasury bill rate
Predictors (lagged by one quarter):
1. dp Dividend to price ratio
2. dy Dividend yield
3. ep Earnings per share
4. bm Book-to-market ratio
5. ntis Net equity expansion
6. tbl Treasury bill rate
7. ltr Long term rate of return on US bods
8. tms Term spread
9. dfy Default yield spread
10.dfr Default return spread
11.infl Inflation
12.ik Investment to capital ratio
20/52
Illustration: Equity Premium Prediction
OLS Regression Results
==============================================================================
Dep. Variable: ret R-squared: 0.108
Model: OLS Adj. R-squared: 0.051
Method: Least Squares F-statistic: 1.901
Date: Prob (F-statistic): 0.0421
Time: Log-Likelihood: -629.21
No. Observations: 184 AIC: 1282.
Df Residuals: 172 BIC: 1321.
Df Model: 11
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]

——————————————————————————

Intercept 26.1369 14.287 1.829 0.069 -2.064 54.337

dp 0.3280 8.247 0.040 0.968 -15.951 16.607

dy 3.3442 7.941 0.421 0.674 -12.330 19.019

ep 0.3133 2.345 0.134 0.894 -4.315 4.942

bm -3.2443 6.719 -0.483 0.630 -16.507 10.018

ntis -46.9566 38.911 -1.207 0.229 -123.762 29.848

tbl -2.8651 20.922 -0.137 0.891 -44.162 38.432

ltr 10.2432 14.468 0.708 0.480 -18.314 38.800

tms 13.1083 11.129 1.178 0.240 -8.859 35.076

dfy -156.8202 213.943 -0.733 0.465 -579.111 265.471

dfr 71.0710 29.099 2.442 0.016 13.634 128.508

infl -36.9489 82.870 -0.446 0.656 -200.521 126.623

ik -208.4868 242.844 -0.859 0.392 -687.824 270.851

==============================================================================

21/52

Illustration: Equity Premium Prediction

We select the following models in the equity premium dataset

based on the AIC:

Best subset selection: (dy, bm, tms, dfr)

Forward selection: (ik, tms, dfr)

Backward selection: (dy, tms, dfr)

22/52

Illustration: Equity Premium Prediction

Table 1: Equity Premium Prediction Results

Train R2 Test R2

OLS 0.108 0.014

Best Subset 0.095 0.038

Forward 0.083 0.042

Backward 0.084 0.060

23/52

Illustration: Equity Premium Prediction (OLS)

24/52

Wrong ways to do variable selection

Adjusted R2. The adjusted R2 has no justification as a model

selection criterion. It does not sufficiently penalise additional

predictors.

Removing statistically insignificant predictors. A statistically

significant coefficient means we can reliably say that it is not

exactly zero. This has almost nothing to do with prediction (see

the regression output slide). Furthermore, there are multiple

testing issues.

25/52

Regularisation methods

Regularisation methods (key concept)

Regularisation or shrinkage methods for linear regression follow

the general framework of empirical risk minimisation:

θ̂ = argmin

θ

[

N∑

i=1

L(yi, f(xi;θ))

]

+ λC(θ),

Here, the loss function is the squared loss and the complexity

function will be the norm of the vector of regression coefficients β.

The choice of norm leads to different regularisation properties.

26/52

Ridge regression (key concept)

The ridge regression method solves the penalised estimation

problem

β̂ridge = argmin

β

N∑

i=1

yi − β0 − p∑

j=1

βixij

2 + λ p∑

j=1

β2j

,

for a tuning parameter λ.

The penalty term λ||β||22 has the effect of shrinking the coefficients

relative to OLS. We refer to this procedure as `2 regularisation.

27/52

Ridge regression

The ridge estimator has an equivalent formulation as a constrained

minimisation problem

β̂ridge = argmin

β

N∑

i=1

yi − β0 − p∑

j=1

βixij

2

subject to

p∑

j=1

β2j < t.
for some t > 0.

28/52

Practical details

1. The hyperparameters λ or t control the amount of shrinkage.

There is an one-to-one connection between them.

2. We do not penalise the intercept. In practice, we center the

response and the predictors before computing the solution and

estimate the intercept as β̂0 = y.

3. The method is not invariant on the scale of the inputs. We

standardise the predictors before solving the minimisation

problem.

29/52

Ridge regression

We can write the minimisation problem in matrix form as

min

β

(y −Xβ)T (y −Xβ) + λβTβ.

Relying on the same techniques that we used to derive the OLS

estimator, we can show the ridge estimator has the formula

β̂ridge = (XTX + λ I)−1XTy

30/52

Orthonormal vectors

We say that two vectors u and v are orthonormal when

||u|| =

√

uTu = 1, ||v|| =

√

vTv = 1, and uTv = 0.

We say that the design matrix X is orthonormal when all its

columns are orthonormal.

31/52

Ridge regression: shrinkage (key concept)

If the design matrix X was orthonormal, the ridge estimate would

just a scaled version of the OLS estimate

β̂ridge = (I + λ I)−1XTy =

1

1 + λ

β̂OLS

In a more general situation, we can say that the ridge regression

method will shrink together the coefficients of correlated predictors.

32/52

Ridge regression

We define the ridge shrinkage factor as

s(λ) =

||β̂ridge||2

||β̂ols||2

,

for a given λ or t.

The next slide illustrates the effect of varying the shrinkage factor

on the estimated parameters.

33/52

Ridge coefficient profiles (equity premium data)

34/52

Selecting λ

The ridge regression method leads to a range of models for

different values of λ. We select λ by cross validation or generalised

cross validation.

GCV is computationally convenient for this model.

35/52

Selecting λ (equity premium data)

36/52

The Lasso

The Lasso (least absolute shrinkage and selection operator)

method solves the penalised estimation problem

β̂lasso = argmin

β

N∑

i=1

yi − β0 − p∑

j=1

βixij

2 + λ p∑

j=1

|βj |

,

for a tuning parameter λ.

The Lasso therefore performs `1 regularisation.

37/52

The Lasso

The equivalent formulation of the lasso as a constrained

minimisation problem is

β̂lasso =argmin

β

N∑

i=1

yi − β0 − p∑

j=1

βixij

2

subject to

p∑

j=1

|βj | < t.
for some t > 0.

38/52

The Lasso: shrinkage and variable selection (key concept)

Shrinkage. As with ridge regression, the lasso shrinks the

coefficients towards zero. However, the nature of this shrinkage is

different, as we will see below.

Variable selection. In addition to shrinkage, the lasso also

performs variable selection. With λ sufficiently large, some

estimated coefficients will be exactly zero, leading to sparse

models. This is a key difference from ridge.

39/52

The Lasso: variable selection property

Estimation picture for the lasso (left) and ridge regression (right):

40/52

Practical details

1. We select the tuning parameter λ by cross validation.

2. As with ridge, we center and standardise the predictors before

computing the solution.

3. There is no closed form solution for the lasso coefficients.

Computing the lasso solution is a quadratic programming

problem.

4. There are efficient algorithms for computing an entire path of

solutions for a range of λ values.

41/52

The Lasso

We define the shrinkage factor for a given value of λ (or t) as

s(λ) =

∑p

j=1

∣∣∣β̂lassoj ∣∣∣∑p

j=1

∣∣∣β̂olsj ∣∣∣ .

The next slide illustrates the effect of varying the shrinkage factor

on the estimated parameters.

42/52

Lasso coefficient profiles (equity premium data)

43/52

Model selection for the equity premium data

44/52

Discussion

Subset selection, ridge, and lasso: comparison in the orthonor-

mal case (optional)

Estimator Formula

Best subset (size k) β̂j · I(|β̂j | > |β̂(k)|)

Ridge β̂j/(1 + λ)

Lasso sign(β̂j)(|β̂j | − λ)+

Estimators of βj in the case of orthonormal columns of X.

45/52

Ridge and Lasso: comparison in the orthonormal case (op-

tional)

−1.5 −0.5 0.0 0.5 1.0 1.5

−

1

.5

−

0

.5

0

.5

1

.5

C

o

e

ff

ic

ie

n

t

E

s

ti

m

a

te

Ridge

Least Squares

−1.5 −0.5 0.0 0.5 1.0 1.5

−

1

.5

−

0

.5

0

.5

1

.5

C

o

e

ff

ic

ie

n

t

E

s

ti

m

a

te

Lasso

Least Squares

yjyj

46/52

Which method to use?

• Recall the no free lunch theorem: neither ridge regression or

the lasso universally outperform the other. The choice of

method should be data driven.

• In general terms, we can expect the lasso to perform better

when a small subset of predictors have important coefficients,

while the remaining predictors having small or zero

coefficients (sparse problems).

• Ridge regression will tend to perform better when the

predictors all have similar importance.

• The lasso may have better interpretability since it can lead to

a sparse solution.

47/52

Elastic Net

The elastic net is a compromise between ridge regression and the

lasso:

β̂EN = argmin

β

N∑

i=1

yi − β0 − p∑

j=1

βixij

2 + λ p∑

j=1

(

αβ2j + (1− α)|βj |

)

,

for λ ≥ 0 and 0 < α < 1. The elastic net performs variable selection like the lasso, and shrinks together the coefficients of correlated predictors like ridge regression. 48/52 Illustration: equity premium data Estimated coefficients (tuning parameters selected by leave-one-out CV) OLS Ridge Lasso EN dp 0.566 0.159 0.000 0.111 dy 0.602 0.197 0.000 0.153 ep 0.942 0.116 0.000 0.048 bm -1.055 0.033 0.000 0.000 ntis -0.276 -0.067 -0.000 -0.000 tbl -0.489 -0.248 -0.000 -0.178 ltr 0.597 0.186 0.000 0.124 tms 0.762 0.286 0.161 0.239 dfy 0.145 0.031 0.000 0.000 dfr 1.570 0.377 0.131 0.294 infl -0.202 -0.214 -0.000 -0.150 ik -0.408 -0.318 -0.422 -0.282 49/52 Illustration: equity premium data Prediction results Train R2 Test R2 OLS 0.108 0.014 Ridge 0.054 0.033 Lasso 0.033 0.011 Elastic Net 0.050 0.029 50/52 Comparison with variable selection Regularisation methods have two important advantages over variable selection. 1. They are continuous procedures, generally leading to lower variance. 2. The computational cost is not much larger than OLS. 51/52 Review questions • What is best subset selection? • What are stepwise methods? • What are the advantages and disadvantages of variable selection? • What are the penalty terms in the ridge and Lasso methods? • What are the key differences in type of shrinkage between the ridge and Lasso methods? • In what situations would we expect the ridge or lasso methods to perform better? 52/52 Introduction Variable selection Regularisation methods Discussion