程序代写代做代考 algorithm Predictive Analytics – Week 7: Linear Methods for Regression I
Predictive Analytics – Week 7: Linear Methods for Regression I
Predictive Analytics
Week 7: Linear Methods for Regression I
Semester 2, 2018
Discipline of Business Analytics, The University of Sydney Business School
QBUS2820 content structure
1. Statistical and Machine Learning foundations and applications.
2. Advanced regression methods.
3. Classification methods.
4. Time series forecasting.
2/52
Week 7: Linear Methods for Regression I
1. Introduction
2. Variable selection
3. Regularisation methods
4. Discussion
Reading: Chapters 6.1 and 6.2 of ISL.
Exercise questions: Chapter 6.8 of ISL, Q1, Q2, Q3, and Q4.
3/52
Introduction
Linear Methods for Regression
In this lecture we focus again on the linear regression model for
prediction. We move beyond OLS to consider other estimation
methods.
The motivation for studying these methods is that using many
predictors in a linear regression model typically leads to overfitting.
We will therefore accept some bias in order reduce variance.
4/52
Linear regression (review)
Consider the additive error model
Y = f(x) + ε.
The linear regression model is a special case based on a regression
function of the form
f(x) = β0 + β1×1 + β2×2 + . . .+ βpxp
5/52
OLS (review)
In the OLS method, we select the coefficient values that minimise
the residual sum of squares
β̂ols = argmin
β
N∑
i=1
yi − β0 − p∑
j=1
βjxij
2
We obtain the formula
β̂ols = (XTX)−1XTy.
6/52
MLR model (review)
1. Linearity: if X = x, then
Y = β0 + β1×1 + . . .+ βpxp + ε
for some population parameters β0, β1, . . . , βp and a random
error ε.
2. The conditional mean of ε given X is zero, E(ε|X) = 0.
3. Constant error variance: Var(ε|X) = σ2.
4. Independence: the observations are independent.
5. The distribution of X1, . . . , Xp is arbitrary.
6. There is no perfect multicollinearity (no column of X is a
linear combination of other columns).
7/52
OLS properties (review)
Under Assumptions 1 (the regression function is correctly
specified) and 2 (there are no omitted variables that are correlated
with the predictors), the OLS estimator is unbiased
E(β̂ols) = β.
8/52
Why we are not satisfied with OLS?
Prediction accuracy. Low bias (if the linearity assumption is
approximately correct), but potentially high variance. We can
improve performance by setting some coefficients to zero or
shrinking them.
Interpretability. A regression estimated with too many predictors
and high variance is hard or impossible to interpret. In order to
understand the big picture, we are willing to sacrifice some of the
small details.
9/52
Linear model selection and regularisation
Variable selection. Identify a subset of k < p predictors to use.
Estimate the model by using OLS on the reduced set of variables.
Regularisation (shrinkage). Fit a model involving all the p
predictors, but shrink the coefficients towards zero relative to OLS.
Depending on the type of shrinkage, some estimated coefficients
may be zero, in which case the method also performs variable
selection.
Dimension reduction. Construct a set of m < p predictors which
are are linear combinations of the original predictors. Fit the model
by OLS on these new predictors.
10/52
Variable selection
Best subset selection (key concept)
The best subset selection method estimates all possible models
and selects the best one according to a model selection criterion
(AIC, BIC, or cross validation).
Given p predictors, there are 2p possible models to choose from.
11/52
Best subset selection
For example, if p = 3 we would estimate 23 = 8 models:
k = 0 : Y = β0 + ε
k = 1 : Y = β0 + β1x1 + ε
Y = β0 + β2x2 + ε
Y = β0 + β3x3 + ε
k = 2 : Y = β0 + β1x1 + β2x2 + ε
Y = β0 + β1x1 + β3x3 + ε
Y = β0 + β2x2 + β3x3 + ε
k = 3 : Y = β0 + β1x1 + β2x2 + β3x3 + ε
12/52
Best subset selection
Algorithm Best subset selection
1: Estimate the null modelM0, which contains only the constant.
2: for k = 1, 2, . . . , p do
3: Fit all
(p
k
)
possible models with exactly k predictors.
4: Pick the model with the lowest RSS and call it Mk.
5: end for
6: Select the best model among M0,M1, . . . ,Mp according to
cross validation, AIC, or BIC.
13/52
Computational considerations
The best subset method suffers from a problem of combinatorial
explosion, since it requires the estimation of 2p different models.
The computational requirement is therefore very high, except in
low dimensions.
For example, for p = 30 we would need to fit a little over 1 billion
models! Best subset selection has a very high computational cost
and is infeasible in practice for p larger than around 40.
14/52
Stepwise selection
Stepwise selection methods are a family of search algorithms that
find promising subsets by sequentially adding or removing
regressors, dramatically reducing the computational cost compared
to estimating all possible specifications.
Conceptually, they are an approximation to best subset selection,
not different methods.
15/52
Forward selection
Algorithm Forward selection
1: Estimate the null modelM0, which contains only the constant.
2: for k = 1, 2, . . . , p do
3: Fit all the p−k+ 1 models that add one predictor toMk−1.
4: Choose the best of p−k+ 1 models in terms of RSS and call
it Mk.
5: end for
6: Select the best model among M0,M1, . . . ,Mp according to
cross validation, AIC, or BIC.
16/52
Backward selection
Algorithm Backward selection
1: Estimate the full model Mp by OLS.
2: for k = p− 1, . . . , 1, 0 do
3: Fit all the k+1 models that delete one predictor fromMk+1.
4: Choose the best of the k+ 1 models in terms of RSS and call
it Mk.
5: end for
6: Select the best model among M0,M1, . . . ,Mp according to
cross-validation, AIC, or BIC.
17/52
Stepwise selection
• Compared to best subset selection, the forward and backward
stepwise algorithms reduce the number of estimations from 2p
to 1 + p(p+ 1)/2. For example, for p = 30 the number of
fitted models is 466.
• The disadvantage is that the final model selected by stepwise
selection is not guaranteed to optimise any selection criterion
among the 2p possible models.
18/52
Variable selection
Advantages
• Accuracy relative to OLS. It tends to lead to better predictions
compared to estimating a model with all predictors.
• Interpretability. The final model is a linear regression model
based a reduced set of predictors.
Disadvantages
• Computational cost.
• By making binary decisions include or exclude particular
variables, variable selection may exhibit higher variance than
regularisation and dimension reduction approaches.
19/52
Illustration: Equity Premium Prediction (OLS)
Quarterly data from Goyal and Welch (2008).
Response: quarterly S&P 500 returns minus treasury bill rate
Predictors (lagged by one quarter):
1. dp Dividend to price ratio
2. dy Dividend yield
3. ep Earnings per share
4. bm Book-to-market ratio
5. ntis Net equity expansion
6. tbl Treasury bill rate
7. ltr Long term rate of return on US bods
8. tms Term spread
9. dfy Default yield spread
10.dfr Default return spread
11.infl Inflation
12.ik Investment to capital ratio
20/52
Illustration: Equity Premium Prediction
OLS Regression Results
==============================================================================
Dep. Variable: ret R-squared: 0.108
Model: OLS Adj. R-squared: 0.051
Method: Least Squares F-statistic: 1.901
Date: Prob (F-statistic): 0.0421
Time: Log-Likelihood: -629.21
No. Observations: 184 AIC: 1282.
Df Residuals: 172 BIC: 1321.
Df Model: 11
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
——————————————————————————
Intercept 26.1369 14.287 1.829 0.069 -2.064 54.337
dp 0.3280 8.247 0.040 0.968 -15.951 16.607
dy 3.3442 7.941 0.421 0.674 -12.330 19.019
ep 0.3133 2.345 0.134 0.894 -4.315 4.942
bm -3.2443 6.719 -0.483 0.630 -16.507 10.018
ntis -46.9566 38.911 -1.207 0.229 -123.762 29.848
tbl -2.8651 20.922 -0.137 0.891 -44.162 38.432
ltr 10.2432 14.468 0.708 0.480 -18.314 38.800
tms 13.1083 11.129 1.178 0.240 -8.859 35.076
dfy -156.8202 213.943 -0.733 0.465 -579.111 265.471
dfr 71.0710 29.099 2.442 0.016 13.634 128.508
infl -36.9489 82.870 -0.446 0.656 -200.521 126.623
ik -208.4868 242.844 -0.859 0.392 -687.824 270.851
==============================================================================
21/52
Illustration: Equity Premium Prediction
We select the following models in the equity premium dataset
based on the AIC:
Best subset selection: (dy, bm, tms, dfr)
Forward selection: (ik, tms, dfr)
Backward selection: (dy, tms, dfr)
22/52
Illustration: Equity Premium Prediction
Table 1: Equity Premium Prediction Results
Train R2 Test R2
OLS 0.108 0.014
Best Subset 0.095 0.038
Forward 0.083 0.042
Backward 0.084 0.060
23/52
Illustration: Equity Premium Prediction (OLS)
24/52
Wrong ways to do variable selection
Adjusted R2. The adjusted R2 has no justification as a model
selection criterion. It does not sufficiently penalise additional
predictors.
Removing statistically insignificant predictors. A statistically
significant coefficient means we can reliably say that it is not
exactly zero. This has almost nothing to do with prediction (see
the regression output slide). Furthermore, there are multiple
testing issues.
25/52
Regularisation methods
Regularisation methods (key concept)
Regularisation or shrinkage methods for linear regression follow
the general framework of empirical risk minimisation:
θ̂ = argmin
θ
[
N∑
i=1
L(yi, f(xi;θ))
]
+ λC(θ),
Here, the loss function is the squared loss and the complexity
function will be the norm of the vector of regression coefficients β.
The choice of norm leads to different regularisation properties.
26/52
Ridge regression (key concept)
The ridge regression method solves the penalised estimation
problem
β̂ridge = argmin
β
N∑
i=1
yi − β0 − p∑
j=1
βixij
2 + λ p∑
j=1
β2j
,
for a tuning parameter λ.
The penalty term λ||β||22 has the effect of shrinking the coefficients
relative to OLS. We refer to this procedure as `2 regularisation.
27/52
Ridge regression
The ridge estimator has an equivalent formulation as a constrained
minimisation problem
β̂ridge = argmin
β
N∑
i=1
yi − β0 − p∑
j=1
βixij
2
subject to
p∑
j=1
β2j < t.
for some t > 0.
28/52
Practical details
1. The hyperparameters λ or t control the amount of shrinkage.
There is an one-to-one connection between them.
2. We do not penalise the intercept. In practice, we center the
response and the predictors before computing the solution and
estimate the intercept as β̂0 = y.
3. The method is not invariant on the scale of the inputs. We
standardise the predictors before solving the minimisation
problem.
29/52
Ridge regression
We can write the minimisation problem in matrix form as
min
β
(y −Xβ)T (y −Xβ) + λβTβ.
Relying on the same techniques that we used to derive the OLS
estimator, we can show the ridge estimator has the formula
β̂ridge = (XTX + λ I)−1XTy
30/52
Orthonormal vectors
We say that two vectors u and v are orthonormal when
||u|| =
√
uTu = 1, ||v|| =
√
vTv = 1, and uTv = 0.
We say that the design matrix X is orthonormal when all its
columns are orthonormal.
31/52
Ridge regression: shrinkage (key concept)
If the design matrix X was orthonormal, the ridge estimate would
just a scaled version of the OLS estimate
β̂ridge = (I + λ I)−1XTy =
1
1 + λ
β̂OLS
In a more general situation, we can say that the ridge regression
method will shrink together the coefficients of correlated predictors.
32/52
Ridge regression
We define the ridge shrinkage factor as
s(λ) =
||β̂ridge||2
||β̂ols||2
,
for a given λ or t.
The next slide illustrates the effect of varying the shrinkage factor
on the estimated parameters.
33/52
Ridge coefficient profiles (equity premium data)
34/52
Selecting λ
The ridge regression method leads to a range of models for
different values of λ. We select λ by cross validation or generalised
cross validation.
GCV is computationally convenient for this model.
35/52
Selecting λ (equity premium data)
36/52
The Lasso
The Lasso (least absolute shrinkage and selection operator)
method solves the penalised estimation problem
β̂lasso = argmin
β
N∑
i=1
yi − β0 − p∑
j=1
βixij
2 + λ p∑
j=1
|βj |
,
for a tuning parameter λ.
The Lasso therefore performs `1 regularisation.
37/52
The Lasso
The equivalent formulation of the lasso as a constrained
minimisation problem is
β̂lasso =argmin
β
N∑
i=1
yi − β0 − p∑
j=1
βixij
2
subject to
p∑
j=1
|βj | < t.
for some t > 0.
38/52
The Lasso: shrinkage and variable selection (key concept)
Shrinkage. As with ridge regression, the lasso shrinks the
coefficients towards zero. However, the nature of this shrinkage is
different, as we will see below.
Variable selection. In addition to shrinkage, the lasso also
performs variable selection. With λ sufficiently large, some
estimated coefficients will be exactly zero, leading to sparse
models. This is a key difference from ridge.
39/52
The Lasso: variable selection property
Estimation picture for the lasso (left) and ridge regression (right):
40/52
Practical details
1. We select the tuning parameter λ by cross validation.
2. As with ridge, we center and standardise the predictors before
computing the solution.
3. There is no closed form solution for the lasso coefficients.
Computing the lasso solution is a quadratic programming
problem.
4. There are efficient algorithms for computing an entire path of
solutions for a range of λ values.
41/52
The Lasso
We define the shrinkage factor for a given value of λ (or t) as
s(λ) =
∑p
j=1
∣∣∣β̂lassoj ∣∣∣∑p
j=1
∣∣∣β̂olsj ∣∣∣ .
The next slide illustrates the effect of varying the shrinkage factor
on the estimated parameters.
42/52
Lasso coefficient profiles (equity premium data)
43/52
Model selection for the equity premium data
44/52
Discussion
Subset selection, ridge, and lasso: comparison in the orthonor-
mal case (optional)
Estimator Formula
Best subset (size k) β̂j · I(|β̂j | > |β̂(k)|)
Ridge β̂j/(1 + λ)
Lasso sign(β̂j)(|β̂j | − λ)+
Estimators of βj in the case of orthonormal columns of X.
45/52
Ridge and Lasso: comparison in the orthonormal case (op-
tional)
−1.5 −0.5 0.0 0.5 1.0 1.5
−
1
.5
−
0
.5
0
.5
1
.5
C
o
e
ff
ic
ie
n
t
E
s
ti
m
a
te
Ridge
Least Squares
−1.5 −0.5 0.0 0.5 1.0 1.5
−
1
.5
−
0
.5
0
.5
1
.5
C
o
e
ff
ic
ie
n
t
E
s
ti
m
a
te
Lasso
Least Squares
yjyj
46/52
Which method to use?
• Recall the no free lunch theorem: neither ridge regression or
the lasso universally outperform the other. The choice of
method should be data driven.
• In general terms, we can expect the lasso to perform better
when a small subset of predictors have important coefficients,
while the remaining predictors having small or zero
coefficients (sparse problems).
• Ridge regression will tend to perform better when the
predictors all have similar importance.
• The lasso may have better interpretability since it can lead to
a sparse solution.
47/52
Elastic Net
The elastic net is a compromise between ridge regression and the
lasso:
β̂EN = argmin
β
N∑
i=1
yi − β0 − p∑
j=1
βixij
2 + λ p∑
j=1
(
αβ2j + (1− α)|βj |
)
,
for λ ≥ 0 and 0 < α < 1. The elastic net performs variable selection like the lasso, and shrinks together the coefficients of correlated predictors like ridge regression. 48/52 Illustration: equity premium data Estimated coefficients (tuning parameters selected by leave-one-out CV) OLS Ridge Lasso EN dp 0.566 0.159 0.000 0.111 dy 0.602 0.197 0.000 0.153 ep 0.942 0.116 0.000 0.048 bm -1.055 0.033 0.000 0.000 ntis -0.276 -0.067 -0.000 -0.000 tbl -0.489 -0.248 -0.000 -0.178 ltr 0.597 0.186 0.000 0.124 tms 0.762 0.286 0.161 0.239 dfy 0.145 0.031 0.000 0.000 dfr 1.570 0.377 0.131 0.294 infl -0.202 -0.214 -0.000 -0.150 ik -0.408 -0.318 -0.422 -0.282 49/52 Illustration: equity premium data Prediction results Train R2 Test R2 OLS 0.108 0.014 Ridge 0.054 0.033 Lasso 0.033 0.011 Elastic Net 0.050 0.029 50/52 Comparison with variable selection Regularisation methods have two important advantages over variable selection. 1. They are continuous procedures, generally leading to lower variance. 2. The computational cost is not much larger than OLS. 51/52 Review questions • What is best subset selection? • What are stepwise methods? • What are the advantages and disadvantages of variable selection? • What are the penalty terms in the ridge and Lasso methods? • What are the key differences in type of shrinkage between the ridge and Lasso methods? • In what situations would we expect the ridge or lasso methods to perform better? 52/52 Introduction Variable selection Regularisation methods Discussion