# CS计算机代考程序代写 algorithm finance Prediction and Regularization

Prediction and Regularization

Chris Hansman

Empirical Finance: Methods and Applications Imperial College Business School

February 1st and 2nd

1/59

Overview

1. The prediction problem and an example of overfitting 2. The Bias-Variance Tradeoff

3. LASSO and RIDGE

4. Implementing LASSO and RIDGE via glmnet()

2/59

A Basic Prediction Model

Suppose y is given by:

X is a vector of attributes

ε has mean 0, variance σ2, ε ⊥X

Our goal is to find a model f (X ) that approximates f (X )

y = f (X ) + ε

3/59

Suppose We Are Given 100 Observations of y

20

10

0

−10

−20

0 25 50 75 100

Observation

●● ●●

● ●●●●

●

●

●● ●● ●

● ●●

●●●● ●●●●●

● ●●● ●●●●●

●● ●●● ●●●●

●●●

●

●● ●●

● ●

● ●●●● ●

● ●●

●● ●

● ●●

● ● ●● ●●

●● ●●

●●●●●●● ●

●●●● ●●

● ●●

●

●

4/59

Outcome

How Well Can We Predict Out-of-Sample Outcomes (yoos)

20

10

0

−10

−20

0 25 50 75 100

Observation

●

● ●●

● ●

●

●

●

●●

●● ● ●●●

● ●●●●●

●

●●● ●●●●●●

●●●●● ●

●●●● ●●●●

● ●

●

●● ●

●

●● ●●

●●

● ●●●

● ● ●●● ●●●●●●

● ●●●● ●●●

●●●●●

●●

● ●●●

●

●

●

●

●

●

5/59

Outcome and Prediction

Predicting Out of Sample Outcomes (fˆ(Xoos))

20

10

0

−10

−20

0 25 50 75 100

Observation

●● ●●●●

●●● ● ● ● ● ●

●●●●●● ●●● ●●●●● ● ●●●●

●●●●●

●●● ● ●● ●●●●

●●●● ●●● ●●●●●●●●

●

●

●

●●●●

● ●●

●●●●● ●●● ●●●

● ●● ●● ●●●●●● ●●● ● ●

●●● ●●●● ●

● ●●●●● ●●●●●● ●●●●●●

●● ● ●●● ●●● ●●

●●●●●●● ●●● ●●●

●●●●●● ● ●●●

●●● ●● ●●●●

●● ●● ●●●

● ●●

● ●●●

●●● ●

●

●

●

●

●

●●

5/59

Outcome and Prediction

A Good Model Has Small Distance (yoos −fˆ(Xoos))2

20

10

0

−10

−20

0 25 50 75 100

Observation

●● ●●● ●●

● ●●

●● ●● ●●● ●

●●●●●● ●●● ●●●● ● ●●●

●●●● ●●●●●●●

●●●●●● ●●●●●

●●●● ●● ● ● ● ●

● ●● ●● ●●●●●●

●●● ● ●

●●●● ●●●

●●●●●●●●●●●

●●●● ● ● ●●● ●●●●●

●●●●●● ●●●● ●●●●●

● ●● ● ● ● ●● ● ●●●

●●●●●● ● ●●●

● ●

● ● ● ●●

●●● ●● ●●●●

●● ●●

● ●●●

●

●●● ●

●

●

●

●

6/59

Outcome and Prediction

A Simple Metric of Fit is the MSE

A standard measure of fit is the mean squared error between y and fˆ: MSE =E[(y−fˆ(X))2]

With N observations we can estimate this as: 1Nˆ2

MSE = N ∑[(yi −f(Xi)) ] i=1

7/59

Overfitting

Key tradeoff in building a model:

A model that fits the data you have

A model that will perform well on new data

Easy to build a model that fits the data you have well

With enough parameters

fˆ fit too closely to one dataset may perform poorly out-of-sample This is called overfitting

8/59

An Example of Overfitting

On the hub you will find a dataset called polynomial.csv Variables y and x

Split into test and training data

y is generated as a polynomial in x plus random noise:

P

yi = ∑θpxip+εi

p=0 Don’t know the order of P…

9/59

Exercise:

Fit a regression with a 2nd order polynomial in x

2

yi = ∑θpxip+εi p=0

What is the In-Sample MSE?

What is the Out-of-Sample MSE?

10/59

Exercise:

Fit a regression with a 25th order polynomial in x

25

yi = ∑θpxip+εi p=0

What is the In-Sample MSE?

What is the Out-of-Sample MSE?

Is the in-sample fit better or worse than the quadratic model?

Is the out-of-sample fit better or worse than the quadratic model?

If you finish early:

What order polynomial gives the best out-of-sample fit?

11/59

Formalizing Overfitting: Bias-Variance Tradeoff

Consider an algorithm to build model f (X ) given training data D Could write f(X)

Consider the MSE at some particular out-of-sample point X0: MSE(X0) = E[(y0 −fˆ(X0))2]

Here the expectation is taken over y and all D We may show that:

MSE(X0) = (E[fˆ(X0)]−f (X0))2 +E[(fˆ(X0)−E[fˆ(X0)])2]+σ2

BIAS2 Variance

D

12/59

Formalizing Overfitting: Bias-Variance Tradeoff

MSE(X0) = E[(y0 −fˆ(X0))2]

= E[(f (Xo)−fˆ(X0))2]+E[ε2]+2E[f (X0)−fˆ(X0)]E[ε] = E[(f (Xo)−fˆ(X0))2]+σε2

13/59

Formalizing Overfitting: Bias-Variance Tradeoff

MSE(X0) = E = E = E

(f (Xo)−fˆ(X0))2 +σε2

(f (Xo)−E[fˆ(X0)]−fˆ(X0)+E[fˆ(X0)])2 +σε2 (f (Xo)−E[fˆ(X0)])2]+E[(fˆ(X0)−E[fˆ(X0)])2

−2E (f (Xo)−E[fˆ(X0)])(fˆ(X0)−E[fˆ(X0)]) +σε2

= (f (Xo)−E[fˆ(X0)])2 +E (fˆ(X0)−E[fˆ(X0)])2

−2(f(Xo)−E[fˆ(X0)])E (fˆ(X0)−E[fˆ(X0)]) +σε2

=0

= (f (Xo)−E[fˆ(X0)])2 +E (fˆ(X0)−E[fˆ(X0)])2 +σε2

14/59

The Bias-Variance Tradeoff

MSE(X0) = (E[fˆ(X0)]−f (X0))2 +E[(fˆ(X0)−E[fˆ(X0)])2]+σ2

BIAS2 Variance

This is known as the Bias-Variance Tradeoff

More complex models can pick up subtle elements of true f (X )

Less bias

More complex models vary more across different training datasets

More variance

Introducing a little bias may allow substantial decrease in variance And hence reduce MSE (or Prediction Error)

15/59

Depicting the Bias Variance Tradeoff

16/59

Depicting the Bias Variance Tradeoff

17/59

Depicting the Bias Variance Tradeoff

18/59

Regularization and OLS

y =X′β+ε iii

Today: two tweaks on linear regression RIDGE and LASSO

Both operate by regularizing or shrinking components of βˆ toward 0 This introduces bias, but may reduce variance

Simple intuition:

Force βˆ to be small: won’t vary so much across training data sets

Take the extreme case: βk = 0 for all k… No variance

k

19/59

The RIDGE Objective

OLS Objective:

Ridge objective:

y =X′β+ε iii

N

βˆOLS =argmin∑(yi −Xi′β)2.

β i=1

NK

βˆRIDGE = argmin ∑(yi −Xi′β)2 subject to ∑ βk2 ≤ c.

β i=1 k=1

20/59

The RIDGE Objective

Equivalent to minimizing the penalized residual sum of squares: NK

PRSS(β) = (y −X′β)2+λ β2 l2∑ii ∑k

i=1 k=1 PRSS(β)l2 is convex ⇒ unique solution

λ is called the penalty Penalizes large βk

Different values of λ provide different βˆRIDGE λ

As λ →0 we have βˆRIDGE →βˆOLS λ

As λ → ∞ we have βˆRIDGE → 0 λ

21/59

Aside: Standardization

By convention yi and Xi are assumed to be mean 0

Xi should also be standardized (unit variance)

All βk treated the same by penalty, don’t want different scaling

22/59

Closed Form Solution to Ridge Objective

The ridge solution is given by (you will prove this): βˆRIDGE =(X′X+λIK)−1X′y

λ

Here X is the Matrix with Xi as rows

IK is the Identity Matrix

Note that λIK makes X′X +λIK invertible even if X′X isn’t

For example if K >N

This was actually the original motivation for the problem

23/59

RIDGE is Biased

DefineA=X′X

βˆRIDGE =(X′X+λIK)−1X′y

Therefore, if λ ̸= 0

λ

=(A+λIK)−1A(A−1X′y)

= (A[Ik +λA−1])−1A(A−1X′y)

= (Ik +λA−1)−1A−1A((X′X)−1X′y) =(Ik +λA−1)−1βˆols

E[βˆRIDGE]=E[(Ik +λA−1])−1βˆols]̸=β λ

24/59

Pros and cons of RIDGE

Pros:

Simple, closed form solution

Can deal with K >> N and multicollinearity

Introduces bias but can improve out of sample fit

Cons:

Shrinks coefficients but will not simplify model by eliminating variables

25/59

LASSO Objective

RIDGE will include all K predictors in the final model No simplification

LASSO is a relatively recent alternative that overcomes this: NK

βˆLASSO =argmin∑(yi −Xi′β)2 subject to β i=1

Can also write this as minimizing: NK

PRSS(β) = (y −X′β)2 +λ l1∑ii ∑k

i=1 k=1

∑|βk|≤c. k=1

|β |

26/59

LASSO and Sparse Models

Like RIDGE, LASSO will shrink βk s toward 0

However, the l1 penalty will force some coefficient estimates to be exactly 0 if λ is large enough

Sparse models: lets us ignore some features

Again different values of λ provide different βˆLASSO λ

Need to find a good choice of λ

27/59

Why Does LASSO Set some βk to 0?

28/59

LASSO Details

Unlike RIDGE, LASSO has no closed form solution Requires numerical methods

Neither LASSO nor RIDGE universally dominates

29/59

Elastic Net: Combining LASSO and RIDGE Penalties

Simplest version of elastic net (nests LASSO and RIDGE): ˆelastic 1N ′2 K K2

β =argmin ∑(yi−Xiβ) +λ α∑|βk|+(1−α)∑βk β Ni=1 k=1 k=1

α ∈ [0, 1] weights LASSO vs. RIDGE style Penalties α=1isLASSO

α=0isRIDGE

30/59

Implementing LASSO

1. An example of a prediction problem 2. Elastic Net and LASSO in R

3. How to choose hyperparameter λ 4. Cross-validation in R

31/59

An Example of a Prediction Problem

Suppose we see 100 observations of some outcome yi

Example: residential real estate prices in London (i.e. home prices)

We have 50 characteristics x1i,x2i,···,x50i that might predict yi E.g. number of rooms, size, neighborhood dummy, etc.

Want to build a model that helps us predict yi out of sample I.e. the price of some other house in London

32/59

We Are Given 100 Observations of yi

20

10

0

−10

−20

0 25 50 75 100

Observation

●● ●●

● ●●●●

●

●

●● ●● ●

● ●●

●●●● ●●●●●

● ●●● ●●●●●

●● ●●● ●●●●

●●●

●

●● ●●

● ●

● ●●●● ●

● ●●

●● ●

● ●●

● ● ●● ●●

●● ●●

●●●●●●● ●

●●●● ●●

● ●●

●

●

33/59

Outcome

How Well Can We Predict Out-of-Sample Outcomes (yoos) i

20

10

0

−10

−20

0 25 50 75 100

Observation

●

● ●●

● ●

●

●

●

●●

●● ● ●●●

● ●●●●●

●

●●● ●●●●●●

●●●●● ●

●●●● ●●●●

● ●

●

●● ●

●

●● ●●

●●

● ●●●

● ● ●●● ●●●●●●

● ●●●● ●●●

●●●●●

●●

● ●●●

●

●

●

●

●

●

34/59

Outcome and Prediction

Using x1i,x2i,··· ,x50i to Predict yi

The goal is to use x1i,x2i,···,x50i to predict any yoos i

If you give us number of rooms, size, etc., we will tell you home price Need to build a model fˆ(·):

yˆi =fˆ(x1i,x2i,···,x50i)

A good model will give us predictions close to yoos

We can accurately predict prices for other homes

i

35/59

A Good Model Has Small Distance (yoos −yˆoos)2 i

20

10

0

−10

−20

0 25 50 75 100

Observation

●● ●●● ●●

● ●●

●● ●● ●●● ●

●●●●●● ●●● ●●●● ● ●●●

●●●● ●●●●●●●

●●●●●● ●●●●●

●●●● ●● ● ● ● ●

● ●● ●● ●●●●●●

●●● ● ●

●●●● ●●●

●●●●●●●●●●●

●●●● ● ● ●●● ●●●●●

●●●●●● ●●●● ●●●●●

● ●● ● ● ● ●● ● ●●●

●●●●●● ● ●●●

● ●

● ● ● ●●

●●● ●● ●●●●

●● ●●

● ●●●

●

●●● ●

●

●

●

●

36/59

Outcome and Prediction

Our Working Example: Suppose Only a Few xki Matter

We see yi and x1i,x2i,··· ,x50i

The true model (which we will pretend we don’t know) is:

yi = 5·x1i +4·x2i +3·x3i +2·x4i +1·x5i +εi Only the first 5 attributes matter!

In other words β1 =5,β2 =4,β3 =3,β4 =2,β5 =1 βk=0fork=6,7,···,50

37/59

Prediction Using OLS

A first approach would be to try to predict using OLS:

yi = β0 +β1x1i +β2x2i +β3x3i +β4x4i +β5x5i +β6x6i +β7x7i +···+β50x50i +vi

Have to estimate 51 different parameters with only 100 data points ⇒ not going to get precise estimates

⇒ out of sample predictions will be bad

38/59

OLS Coefficients With 100 Observations Aren’t Great

5

4

3

2

1

0

−1

5 10 15 20 25 30 35 40 45 50

X Variable

39/59

Coefficient Estimate

OLS Doesn’t Give Close Predictions for New yoos : i

20

10

0

−10

−20

0 25 50 75 100

Observation

●● ●●●●

●

● ●●

●● ●● ●●● ●

●●●●●● ●●● ●●●● ●● ●●●

●●●●●●●●

●●●●● ●●● ● ● ●●●

●●●●●● ● ●●●●●

●●●●●●●● ●● ●●● ●●● ● ● ●●●

●●●●●●●●●● ●●●● ●

● ●●● ●●●●●● ●●●●●

● ●● ● ●

●● ● ● ●●

● ●● ● ● ● ● ● ●● ●●●●●● ● ●●●

●●● ● ●●●●●

●● ●● ●●●

● ●●

●●

●

● ●●●

●

●

●

●● ●

●

1100oos oos2 Mean Squared Error = 100 ∑ (yi − yˆ )

= 39.33

i=1

40/59

Outcome and Prediction

Aside: OLS Does Much Better With 10,000 Observations

5

4

3

2

1

0

−1

5 10 15 20 25 30 35 40 45 50

X Variable

41/59

Coefficient Estimate

OLS Out-of-Sample Predictions: 100 Training Obs.

20

10

0

−10

−20

0 25 50 75 100

Observation

●● ●●●●

●

● ●●

●● ●● ●●● ●

●●●●●● ●●● ●●●● ●● ●●●

●●●●●●●●

●●●●● ●●● ● ● ●●●

●●●●●● ● ●●●●●

●●●●●●●● ●● ●●● ●●● ● ● ●●●

●●●●●●●●●● ●●●● ●

● ●●● ●●●●●● ●●●●●

● ●● ● ●

●● ● ● ●●

● ●● ● ● ● ● ● ●● ●●●●●● ● ●●●

●●● ● ●●●●●

●● ●● ●●●

● ●●

●●

●

● ●●●

●

●

●

●● ●

●

1100oos oos2 Mean Squared Error = 100 ∑ (yi − yˆ )

= 39.33

i=1

42/59

Outcome and Prediction

OLS Out-of-Sample Predictions: 10,000 Training Obs.

20

10

0

−10

−20

0 25 50 75 100

Observation

● ●●●

●● ●

●

●

●●● ●●●

●

●●● ●

●●●

●●●●

● ● ●●

● ●●●●● ●●●● ●

●●●● ●●●●

● ● ● ●● ●● ●● ● ● ●● ● ● ● ● ●

●

●● ● ●●●

●● ●● ●●●● ●● ●●

●●●● ● ●●●●●●● ●● ● ● ● ● ●● ●

●●● ●●●● ●●● ●● ●●●●●●●●● ●

●●● ● ● ● ● ●●

● ● ●●●● ●●●● ●●●●●

●●● ● ●

●

●● ●

●

●

●

●

●

●

1100oos oos2 Mean Squared Error = 100 ∑ (yi − yˆ )

= 18.92

i=1

43/59

Outcome and Prediction

Solution to The OLS Problem: Regularization

With 100 Observations OLS Didn’t do Very Well Solution: regularization

LASSO/RIDGE/Elastic Net

Simplest version of elastic net (nests LASSO and RIDGE):

β Ni=1 k=1 k=1 For λ = 0 this is just OLS

Forλ>0,α=1thisisLASSO:

ˆLASSO

β =argmin

β

ˆelastic 1N 2 K K2 β =argmin ∑(yi−β0−β1x1i···−βKxKi) +λ α ∑|βk|+(1−α)∑βk

1N 2 K ∑(yi−β0−β1x1i···−βKxKi) +λ ∑|βk|

N i=1 k=1

44/59

Implementing Elastic Net in R

The big question when running LASSO is the choice of λ By default, glmnet(·) tries 100 different choices for λ

Starts with λ just large enough that all βk = 0 Proceeds with steps towards λ = 0

For each λ, we estimate corresponding coefficients βlasso(λ) How do we decide which one is best?

45/59

LASSO Coefficients With 100 Observations (λ=0.2)

5

4

3

2

1

0

−1

5 10 15 20 25 30 35 40 45 50

X Variable

46/59

Coefficient Estimate

LASSO Coefficients With 100 Observations (λ=1)

5

4

3

2

1

0

−1

5 10 15 20 25 30 35 40 45 50

X Variable

47/59

Coefficient Estimate

LASSO Coefficients With 100 Observations (λ=3)

5

4

3

2

1

0

−1

5 10 15 20 25 30 35 40 45 50

X Variable

48/59

Coefficient Estimate

LASSO Coefficients For All λ

49 49 46 35 21 10 3

−5 −4 −3 −2 −1 0 1

Log Lambda

49/59

Coefficients

−1 0 1 2 3 4 5

How to Choose λ (tuning)

One option would be to look at performance out-of-sample

Compare out-of-sample mean squared error for different values of λ

For example:

MSEoos(0.2)=26.18

MSEoos(1)=22.59 MSEoos(3)=56.31

Would like a way of choosing that does not require going out of sample…

MSEoos(λ)

50/59

How to Choose λ?

Need a disciplined way of choosing λ without going out of sample Could split training data into a training and validation sample

Estimate model on training data for many λ Compute MSE on validation sample

Choose λ that gives smallest MSE

What if there is something weird about your particular validation sample?

51/59

K-fold Cross Validation

Most common approach is K − fold cross validation

Partition training data into K separate subsets of equal size

Usually K is either 5 or 10

For any k = 1,2,··· ,K exclude the kth fold and estimate the model

for many λ on the remaining data

For each λ, compute the MSEcv on the excluded fold

k,λ

Do this for all K folds:

Now you have K estimates of MSEcv for each λ k,λ

52/59

K-fold Cross Validation

K estimates of MSEcv for each λ k,λ

Compute mean of the MSEs for each λ:

̄cv 1K

MSEλ =K∑MSEk,λ

cv k=1

Can also compute standard deviations

̄ cv Choose λ that gives small MSEλ

53/59

How to Choose λ : k-fold Cross Validation

54/59

How to Choose λ : k-fold Cross Validation Partition the sample into k equal folds

The default for R is k=10

For our sample, this means 10 folds with 10 observations each

Cross-validation proceeds in several steps:

1. Choose k-1 folds (9 folds in our example, with 10 observations each)

2. Run LASSO on these 90 observations

3. find βlasso(λ) for all 100 λ

4. Compute MSE(λ) for all λ using remaining fold (10 observations) 5. Repeat for all 10 possible combinations of k-1 folds

This provides 10 estimates of MSE(λ) for each λ

Can construct means and standard deviations of MSE(λ) for each λ

Choose λ that gives small mean MSE(λ)

55/59

Cross Validated Mean Squared Errors for all λ 49 48 48 49 47 46 45 34 31 21 15 11 5 4 3 1

●●● ●

●●●

● ●

● ●

●●

●●

●●

●●

●● ●

● ●

● ●●

● ●

● ●

● ●

●

● ●●

● ●

●● ●

● ●

●

● ●●

● ●●

●● ●● ●●

● ●●

●● ●● ●●●

●● ●● ●

●

●

−5 −4 −3 −2 −1 0 1

log(Lambda)

56/59

Mean−Squared Error

30 40 50 60 70 80 90

λ = 0.50 Minimizes Cross-Validation MSE

5

4

3

2

1

0

−1

5 10 15 20 25 30 35 40 45 50

X Variable

1100oos oos2 OOS Mean Squared Error = 100 ∑ (yi − yˆ )

= 22.39

i=1

57/59

Coefficient Estimate

λ = 0.80 Is Most Regularized Within 1 S.E.

5

4

3

2

1

0

−1

5 10 15 20 25 30 35 40 45 50

X Variable

1100oos oos2 OOS Mean Squared Error = 100 ∑ (yi − yˆ )

= 21.79

i=1

58/59

Coefficient Estimate

Now Do LASSO Yourself

On the Hub you will find two files: Training data: menti 200.csv

Testing data (out of sample): menti 200 test.csv

There are 200 observations and 200 predictors

Three questions:

1. Run LASSO on the training data: what is the out-of-sample MSE for

the λ that gives minimum mean cross-validated error?

2. How many coefficients (excluding intercept) are included in the most

regularized model with error within 1 s.e. of the minimum?

3. Run a RIDGE regression. Is the out-of-sample MSE higher or lower than for LASSO?

Extra time: estimate an elastic net regression with α = 0.5 How would you tune α? Google caret or train…

59/59