# CS计算机代考程序代写 AI finance Review

Review

Chris Hansman

Empirical Finance: Methods and Applications Imperial College Business School

March 8-9, 2021

1/102

Topic 1: OLS and the Conditional Expectation Function

Consider random variable yi and (variables) Xi Which of the following is false,

(a) Xi′βOLS provides the best predictor of yi out of any function of Xi (b) Xi′βOLS is the best linear approximation of E[yi|Xi]

(c) yi −E[yi|Xi] is uncorrelated with Xi

2/102

Topic 1: OLS and the Conditional Expectation Function

1. A review of the conditional expectation function and its properties 2. The relationship between OLS and the CEF

3/102

Topic 1 – Part 1: The Conditional Expectation Function (CEF)

We are often interested in the relationship between some outcome yi and a variable (or set of variables) Xi

A useful summary is the conditional expectation function: E[yi|Xi] Gives the mean of yi when Xi takes any particular value

Formally, if fy (·|Xi ) is the conditional p.d.f. of yi |Xi :

E[yi|Xi]= zfy(z|Xi)dz

E[yi|Xi] is a random variable itself: a function of the random Xi

Can think of it as E[yi|Xi] = h(Xi)

Alternatively, evaluate it at particular values: for example Xi = 0.5

E [yi |Xi = 0.5] is just a number!

4/102

Topic 1 – Part 1: The Conditional Expectation Function: E[Y|X]

E[H|Age=5]

E[H|Age=10]

E[H|Age=15]

E[H|Age=20] E[H|Age=25] E[H|Age=30] E[H|Age=35]

E[H|Age=40]

Height (Inches)

30 40 50 60 70 80

0 5 10 15 20 25 30 35 40 Age

5/102

Topic 1 – Part 1: Three Useful Properties of E[X|Y]

(i) The law of iterated expectations (LIE): E[E[yi|Xi]] = E[yi]

(ii) The CEF Decomposition Property:

Any random variable yi can be broken down into two pieces

yi = E[yi|Xi]+εi

Where the residual εi has the following properties:

(a) E[εi|Xi]=0(“meanindependence”) (b) εi uncorrelated with any function of Xi

(iii) Out of any function of Xi, E[yi|Xi] is the best predictor of yi:

E[yi|Xi] = arg min E[(yi −m(Xi))2] m(Xi )

6/102

Topic 1 – Part 1 summary: Why We Care About Conditional Expectation Functions

Useful tool for describing relationship between yi and Xi

Several nice properties

Most statistical tests come down to comparing E[yi|X] at certain Xi Classic example: experiments

7/102

Topic 1 – Part 2: Ordinary Least Squares

Linear regression is arguably the most popular modeling approach across every field in the social sciences

Transparent, robust, relatively easy to understand

Provides a basis for more advanced empirical methods Extremely useful when summarizing data

Plenty of focus on the technical aspects of OLS last term Focus on an applied perspective

8/102

Topic 1 – Part 2: OLS Estimator Fits a Line Through the Data

10

5

0

−5

●

●

●● ●

●

●

●● ●

●

●

●●

● ●

●

●

● ●

● ● ●

●●

●●

●●● ●

●

●● ● ●●

●●

●●● ●●●

●● ● ● ●●

●●●●●

●

●●●●● ● ●●

● ● ●● ● ●● ●●●●

●●●● ●● ● ● ●●

●●● ●●●

● ● ●●● ●● ●●●● ●●

● ●●

● ● ●

●● ●● ●●●●● ●●

●

●● ●

●●

●

●

●

● ●●●

● ●● ●● ●● ● ●

●● ●● ●

● ● ●● ●●●●

●● ● ●

●

● ● ●

●●● ●●●● ● ●● ● ●● ● ● ● ●●●●●●●● ●●●

● ●●●●●●● ● ● ●● ●●●●●●●●●●●●●●

● ● ●

●● ●●●●●●●●●●● ●

●●●● ● ●●

●●● ● ● ● ●●● ●●

● ●●●● ● ● ● ●● ● ● ● ●●● ●●●

● ●●●●●● ●●

●●● ●● ●● ●

● ●● ●

●●● ●● ● ●● ● ●

●●●●

● ● ● ●●

●

●● ●

● ● ●●

●●●

●●●

● ●

●● ● ●●●●● ●● ● ● ●●

● ●●

● ●

●●

●

●● ● ●● ●

●● ●●●●● ●●●●● ● ● ●● ● ● ● ●●

● ●● ●●● ● ●● ● ●●●● ● ●●●●●●●● ●● ●●●●● ●●● ●●●● ●

●●●● ● ●● ● ●●

●●● ● ●

●● ●●●●

●● ● ●● ●●●●●●●●●

● ● ●●● ● ● ● ● ● ●● ●● ● ●● ● ●● ● ●● ● ● ●● ●●●●●●●●●●● ●●●● ●●●

●● ●●●●●● ● ● ●●● ●●●●●●●●●●●●● ●

● ● ●●●● ●●● ● ●●●●●●●●● ● ● ● ● ●●●● ●● ● ● ●● ● ●● ●●● ● ●● ● ● ●●● ● ●●●● ●

●● ●● ●● ●● ●●● ●●● ● ● ●●●●●●●●●●●●●●●●● ●● ●●●●●● ●●● ●● ●● ●● ● ●●

●

●●●●●●●● ●

●

● ●●● ●

●● ● ●●

●●●

● ● ● ● ● ● ● ●● ● ● ● ●● ●● ● ● ● ● ● ●

● ●

●●●● ●●●● ●●

●●● ●

●●

● ●● ● ●●●

●

●●● ●●

● ●● ●●●

●

●

●

● ● ●

● ●

●

●

● ●

−2 0 2

X

●● ●

● ●● ●

●

● ●●●●

●

●●● ●●

●●● ●●

●●● ●●●

●●●

●

●

●

●

●

●●●●● ● ●●

9/102

Y

Topic 1 – Part 2: Choosing the (Population) Regression Line

yi =β0+β1xi+vi

An OLS regression is simply choosing the βOLS,βOLS that make vi

as “small” as possible on average How do we define “small”?

Want to treat positive/negative the same: consider vi2

Choose βOLS,βOLS to minimize: 01

E[vi2] = E[(yi −β0 −β1xi)2]

01

10/102

Topic 1 – Part 2: Regression and The CEF

Given yi and Xi The population regression coefficient is: βOLS = E[X X′]−1E[X y ]

A useful time to note: you should remember the OLS estimator: βˆOLS = (X′X)−1X′Y

With just one xi:

ii ii

ˆ

βˆOLS = Cov(xi,yi)

ˆ Var(xi)

11/102

Topic 1 – Part 2: Regression and the Conditional Expectation Function

Why is linear regression so popular?

Simplest way to estimate (or approximate) conditional expectations!

Three simple results

OLS perfectly captures CEF if CEF is Linear

OLS generates best linear approximation to the CEF if not OLS perfectly captures CEF with binary (dummy) regressors

12/102

Topic 1 – Part 2: Regression captures CEF if CEF is Linear

Take the special case of a linear conditional expectation function:

E[y |X ] = X′β iii

Then OLS captures E[yi|Xi]

10

5

0

−5

●

●

● ●●

● ●

●● ●●

● ●●

●

●● ●

●

●

● ●● ● ●

● ●●

●●

●●● ●

●

●● ● ●● ●●

●●●● ●● ●

●● ●● ●●●●● ●● ●

● ●●● ● ●● ● ● ●●●●

●●●● ●● ● ● ● ●

● ● ●●● ● ● ● ●● ● ●●●● ●●

● ●● ●● ● ●●● ●●●●● ●●●

●

●● ●

●●

●

● ●●● ● ●● ●● ●●

● ●

● ● ●●●●●●●●●● ●●

●●●●● ● ●●●● ● ●●●●●●●● ● ● ●●● ●●● ● ● ●●●● ● ● ● ●

●●●●●●● ● ●●

● ●● ●●●●●●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ●

● ●●●●●●●●●● ●●

● ●●●● ●●●●● ●

● ●●●● ●●●●● ●● ●●●●● ● ●●● ●●●● ●●●●

● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ●● ●●

● ● ● ●

● ●● ●●● ●

● ●●

●● ● ●● ● ● ●● ● ●

●●●●

●● ● ● ●●

● ●●●

●●●

●

● ●●●

●● ● ●●

●● ●● ●● ● ●● ● ● ●●● ● ● ● ● ● ● ●●

● ●● ●●● ●●●● ● ●●●●

● ●●●● ●

●

●

● ●●●●●●●●● ●● ●● ●●●

● ●●●●● ●●● ● ●●● ●

●● ●● ●●●

● ● ●●●●●●●●

●●● ● ●●●●●●● ●

●● ●

● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●●●● ●●● ● ●● ●

● ●●●●● ●● ●● ●● ● ● ● ● ● ●●●● ●●●● ● ●● ● ●●● ●●●●●● ●●●●●●● ● ●● ●●●

●●●● ● ●●●●● ● ●●● ●● ●● ●

●● ●●●

● ●

●●

●

●●

●

●● ● ●●●●

●●●

●●●● ●

● ●●● ● ●●●● ● ●

●●● ● ● ●●

●

● ●●●●

●● ● ●

●

●

●

●●● ●●

●●● ●●●● ●●

●●●● ●●●● ● ●

●●●●● ●●● ●

●●

●● ●●●

● ●● ●●

●

●● ●

● ●

●

●

● ●

●

●

● ●

−2 0 2

X

● ●

●● ● ●●●●● ●● ● ● ●●

● ●●●

●● ●●

● ●● ●

●

●

●

●

●

●

●

13/102

Y

Topic 1 – Part 2: OLS Provides Best Linear Approximation to CEF

10

−10

0

●

●● ●

●

●

●

●

●

●

● ●

● ●

●

●

● ●

● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ●● ●● ● ●● ● ● ●● ●●● ●● ● ●●● ●● ● ●● ●●●●

●●

● ● ●

●●●●●● ●● ●●●● ● ●●●● ●

● ● ● ●● ● ● ● ●● ● ● ● ●●● ●● ●●● ●

●● ● ●● ● ●

● ●

● ●

●●● ●

●

●

●● ● ●

● ●●●●

● ● ●● ● ●● ●●

● ●● ●

●● ●●●●●●●

● ● ●●●●● ●● ●●

●● ● ● ●● ●● ●●●●●●● ●●●●●●

● ● ●●● ●●●

● ●●

●●●●●

●●● ●●● ●

● ●●

●●●●●● ●● ●

● ●●●●●●

● ●● ●●● ● ● ●● ●● ● ● ●● ●●●● ● ● ●● ●

●●● ●● ●

● ●●●●● ● ●●

● ●

●

●

●● ● ● ● ● ●●●

● ● ● ● ● ● ●

●●●

●● ●●●● ●● ●●●

● ● ● ● ● ● ●

●● ●

● ● ● ●

● ●

● ●●

●●●●●

● ● ●●●●● ●●

● ●● ● ●

● ●

● ●

●

● ● ● ● ● ● ● ● ● ●●● ●● ●●●●● ●●●● ●●●●●●●● ●

●●● ● ●●● ● ●● ●

● ● ●● ● ● ●● ●●●● ●●● ●●●●● ● ●

● ● ● ●● ●

●●●●●● ● ● ● ●● ● ● ● ● ● ●

●●●●●● ● ●● ●●● ●

● ● ●● ● ● ● ● ●● ●●

● ●●●●● ●

●

● ●● ● ● ● ● ●● ●● ●●●●

●●●● ●●●●● ● ● ●●●● ●

●

● ●● ●

●●●● ●

● ●●●● ● ●●●

● ●●●● ●● ● ● ● ● ● ● ● ● ●● ●● ●

●

● ●●

●●● ●●●●●● ● ●●●●● ●●● ●● ●● ●●● ●●●

●●

● ●●● ● ●●●●● ●●●●

●● ● ●● ●● ●

●●

●●●● ●●● ● ● ● ●● ● ● ●● ●●●●● ●● ●●●●●●● ●●●●●●●

● ● ● ●● ● ●

●●● ●●●

●● ●●● ●●●●

● ● ●●● ● ●● ●●●●

●

● ● ●

●●

●●

●●

● ●

●

●

●

●

●

●

−2 0 2

X

●

●

●

●

● ●

●

●

14/102

Y_nl

Topic 1 – Part 2: Implementing Regressions with Categorical Variables What if we are interested in comparing all 11 GISCS sectors?

Create dummy variables for each sector omitting 1 Lets call them D1i,···,D10i

pricei = β0 +δ1D1i +···+δ10D10i +vi In other words Xi = [1 D1i ···D10]′ or

1 0 ··· 1 0

1 1 ··· 0 0

1 0 ··· 0 0

1 0 ··· 1 0 X=

Regress pricei on a constant and those 10 dummy variables

1 0 ··· 0 1 . . . . .

. . .. . . 1 1 ··· 0 0

15/102

Topic 1 – Part 2: Average Share Price by Sector for Some S&P Stocks

16/102

Topic 1 – Part 2: Implementing Regressions with Dummy Variables

βˆOLS (coef. on the constant) is the mean for the omitted category: 0

In this case “Consumer Discretionary”

The coefficient on each dummy variable (e.g. δˆOLS) is the difference

k

between βˆOLS and the conditional mean for that category

0

Key point: If you are only interested in categorical variables… You can perfectly capture the full CEF in a single regression

For example:

E[price|sector =consumerstaples]=βOLS+δOLS

ii 01 E[price|sector =energy]=βOLS+δOLS

ii 02 .

17/102

Topic 1 – Part 2 Summary: Regression and the Conditional Expectation Function

Why is linear regression so popular?

Simplest way to estimate (or approximate) conditional expectations!

Three simple results

OLS perfectly captures CEF if CEF is Linear

OLS generates best linear approximation to the CEF if not OLS perfectly captures CEF with binary (dummy) regressors

18/102

Topic 1: OLS and the Conditional Expectation Function

1. A review of the conditional expectation function and its properties 2. The relationship between OLS and the CEF

19/102

Topic 2: Causality and Regression

1. The potential outcomes framework 2. Causal effects with OLS

20/102

Topic 2: Causality and Regression

Suppose wages (yi ) are determined by:

yi =β0+β1xi+γai+ei

and we see years of schooling (xi ) but not ability (ai ) Corr(xi,ai) > 0 and Corr(yi,ai) > 0

We estimate: And recover

yi =β0+β1xi+vi

βOLS =β +γδOLS 111

Bias

Is our estimated βOLS larger or smaller than β1? 1

21/102

Topic 2 – Part 1: The Potential Outcomes Framework

Ideally, how would we find the impact of candy on evaluations (yi )? Imagine we had access to two parallel universes and could observe

The exact same student (i)

At the exact same time

In one universe they receive candy—in the other they do not

And suppose we could see the student’s evaluations in both worlds Define the variables we would like to see: for each individual i:

yi1 = evaluation with candy yi0 = evaluation without candy

22/102

Topic 2 – Part 1: The Potential Outcomes Framework

If we could see both yi1 and yi0 impact would be easy to find: The causal effect or treatment effect for individual i defined as

yi1 −yi0

Would answer our question—but we never see both yi1 and yi0!

Some people call this the “fundamental problem of causal inference” Intuition: there are two “potential” worlds out there

The treatment variable Di decides which one we see:

yi1 if Di = 1 yi= yi0ifDi=0

23/102

Topic 2 – Part 1: So What Do Differences in Conditional Means Tell You?

E[yi1|Di = 1]−E[yi0|Di = 0] = E[yi1|Di = 1]−E[yi0|Di = 1]

Average Treatment Effect for the Treated Group +E[yi0|Di = 1]−E[yi0|Di = 0]

Selection Effect ̸= E[yi1]−E[yi0]

Average Treatment Effect

So our estimate could be different from the average effect of treatment E[yi1]−E[yi0] for two reasons:

(1) The morning section might have given better reviews anyway: E[yi0|Di = 1]−E[yi0|Di = 0] > 0

Selection Effect

(2) Candy matters more in the morning:

E[yi1|Di = 1]−E[yi0|Di = 1] ̸= E[yi1]−E[yi0]

Average Treatment Effect for the Treated Group Average Treatment Effect

24/102

Topic 2 – Part 2: Causality and Regression

yi =β0+β1xi+vi

Regression coefficient captures causal effect (β OLS = β ) if:

E[vi|xi] = E[vi] Failsanytimecorr(xi,vi)̸=0

An aside: we have used similar notation for 3 different things: 1. β1: the causal effect on yi of a 1 unit change in xi.

2. β OLS = Cov (xi ,yi ) : the population regression coefficient 1 Var(xi)

1

3. βˆOLS = Cov (xi ,yi ) : the sample regression coefficient

1 Var (xi )

25/102

Topic 2 – Part 2: Omitted Variables Bias

So if we have:

What will the regression of yi on xi give us?

Recall that the regression coefficient is βOLS = Cov(yi,xi) : 1 Var(xi)

βOLS = Cov(yi,xi) 1 Var(xi)

= β1 + Cov (vi , xi ) Var(xi)

yi =β0+β1xi+vi

26/102

Topic 2: Causality and Regression

1. The potential outcomes framework 2. Causal effects with OLS

27/102

Menti Break

The average coursework grade in the Morning class is 68 The average coursework grade in the Afternoon class is 75 Suppose we run the following regression:

Coursework = β0 + β1Afternooni + vi What is the value of β0?

(a) 68 (b) 75 (c) 7

28/102

Topic 3: Panel Data and Diff-in-Diff

1. Panel Data, First Difference Regression and Fixed Effects 2. Difference-in-Difference

29/102

Topic 3: Panel Data and Diff-in-Diff

Panel data consists of observations of the same n units in T different periods

If the data contains variables x and y, we write them (xit,yit)

fori=1,···,N

i denotes the unit, e.g. Microsoft or Apple

andt=1,···,T

t denotes the time period, e.g. September or October

30/102

Topic 3 – Part 1: Panel Data and Omitted Variables

Lets reconsider our omitted variables problem: yit =β0+β1xit+γai+eit

Suppose we see xit and yit but not ai

Suppose Corr(xit,eit) = 0 but Corr(ai,xi) ̸= 0

Note that we are assuming ai doesn’t depend on t

31/102

Topic 3 – Part 1: First Difference Regression

yit=β0+β1xit+ vit

γ ai +eit

Suppose we see two time periods t = {1, 2} for each i We can write our two time periods as:

yi,1 = β0 +β1xi,1 +γai +ei,1

yi,2 = β0 +β1xi,2 +γai +ei,2

Taking changes (differences) gets rid of fixed omitted variables

∆yi,2−1 = β1∆xi,2−1 +∆ei,2−1

32/102

Topic 3 – Part 1: Fixed Effects Regression

yit =β0+β1xit+γai+eit

An alternative approach:

Lets define δi = γai and rewrite:

yit =β0+β1xit+δi+eit So yi is determined by

(i) The baseline intercept β0 (ii) The effect of xi

(iii) An individual specific change in the intercept: δi Intuition behind fixed effects: Lets just estimate δi

33/102

Topic 3 – Part 1: Fixed Effects Regression – Implementation

N−1

yit = β0 +β1xit + ∑ δiDi +eit

i=1

Note that we’ve left out DN

βOLS is interpreted as the intercept for individual N:

βOLS=E[y|x =0,i=N] 0 itit

0

and for all other i (e.g. i=1)

δ2 = E[yit|xit = 0,i = 1]−β0

This should look familiar

34/102

Topic 3 – Part 2: Difference-in-Difference

We are interested in the impact of some treatment on outcome Yi

Suppose we have a treated group and a control group

Let Di =1 be a dummy equal to 1 if i belongs to the treatment

group

And suppose we see both groups before and after the treatment occurs

Let Aftert = 1 be equal to 1 if time t is after the treatment date Yit =β0+β1Di×Aftert+β2Di+β3Aftert+vit

35/102

Topic 3 – Part 2: Diff-in-Diff Graphically

36/102

Topic 3 – Part 2: When Does Diff-in-Diff Identify A Causal Effect

As usual, we need

E[vit|Di,Aftert] = E[vit]

What does this mean intuitively?

Parallel trends assumption: In the absence of any reform the

average change in leverage would have been the same in the treatment and control groups

In other words: trends in both groups are similar

37/102

Topic 3 – Part 2: Parallel Trends

Parallel trends does not require that there is no trend in leverage Just that it is the same between groups

Does not require that the levels be the same in the two groups What does it look like when the parallel trends assumption fails?

38/102

Topic 3 – Part 2: When Parallel Trends Fails

Leverage

Treatment (Delaware)

Control (Non−Delaware)

Before After

Month (t)

39/102

Topic 3: Panel Data and Diff-in-Diff

1. Panel Data, First Difference Regression and Fixed Effects 2. Difference-in-Difference

40/102

Topic 4: Regularization

1. Basics of Ridge, LASSO and Elastic Net

2. How to choose hyperparameter λ: cross-validation

41/102

Topic 4 – Part 1: The Basics of Elastic Net

20

10

0

−10

−20

0 25 50 75 100

Observation

●● ●●

● ●●●●

●

●

●● ●● ●

● ●●

●●●● ●●●●●

● ●●● ●●●●●

●● ●●● ●●●●

●●●

●

●● ●●

● ●

● ●●●● ●

● ●●

●● ●

● ●●

● ● ●● ●●

●● ●●

●●●●●●● ●

●●●● ●●

● ●●

●

●

42/102

Outcome

Topic 4 – Part 1: How Well Can We Predict Out-of-Sample Outcomes (yoos) i

20

10

0

−10

−20

0 25 50 75 100

Observation

●

● ●●

● ●

●

●

●

●●

●● ● ●●●

● ●●●●●

●

●●● ●●●●●●

●●●●● ●

●●●● ●●●●

● ●

●

●● ●

●

●● ●●

●●

● ●●●

● ● ●●● ●●●●●●

● ●●●● ●●●

●●●●●

●●

● ●●●

●

●

●

●

●

●

43/102

Outcome and Prediction

Topic 4 – Part 1: A Good Model Has Small Distance (yoos −yˆoos)2 i

20

10

0

−10

−20

0 25 50 75 100

Observation

●● ●●● ●●

● ●●

●● ●● ●●● ●

●●●●●● ●●● ●●●● ● ●●●

●●●● ●●●●●●●

●●●●●● ●●●●●

●●●● ●● ● ● ● ●

● ●● ●● ●●●●●●

●●● ● ●

●●●● ●●●

●●●●●●●●●●●

●●●● ● ● ●●● ●●●●●

●●●●●● ●●●● ●●●●●

● ●● ● ● ● ●● ● ●●●

●●●●●● ● ●●●

● ●

● ● ● ●●

●●● ●● ●●●●

●● ●●

● ●●●

●

●●● ●

●

●

●

●

44/102

Outcome and Prediction

Topic 4 – Part 1: Solution to OLS drawbacks – Regularization

With 100 Observations OLS Doesn’t do Very Well Solution: regularization

LASSO/RIDGE/Elastic Net

Simplest version of elastic net (nests LASSO and RIDGE):

β Ni=1 k=1 k=1 For α =1 is this LASSO or RIDGE?

ˆelastic 1N 2 K K2 β =argmin ∑(yi−β0−β1x1i···−βKxKi) +λ α ∑|βk|+(1−α)∑βk

45/102

Topic 4 – Part 1: LASSO Coefficients With 100 Observations (λ=0.2)

5

4

3

2

1

0

−1

5 10 15 20 25 30 35 40 45 50

X Variable

46/102

Coefficient Estimate

Topic 4 – Part 1: LASSO Coefficients With 100 Observations (λ=1)

5

4

3

2

1

0

−1

5 10 15 20 25 30 35 40 45 50

X Variable

47/102

Coefficient Estimate

Topic 4 – Part 1: LASSO Coefficients With 100 Observations (λ=3)

5

4

3

2

1

0

−1

5 10 15 20 25 30 35 40 45 50

X Variable

48/102

Coefficient Estimate

Topic 4 – Part 1: LASSO Coefficients For All λ

49 49 46 35 21 10 3

−5 −4 −3 −2 −1 0 1

Log Lambda

49/102

Coefficients

−1 0 1 2 3 4 5

Topic 4 – Part 2: How to Choose λ – k-fold Cross Validation

Partition the sample into k equal folds The default for R is k=10

For our sample, this means 10 folds with 10 observations each Cross-validation proceeds in several steps:

1. Choose k-1 folds (9 folds in our example, with 10 observations each) 2. Run LASSO on these 90 observations

3. find βlasso(λ) for all 100 λ

4. Compute MSE(λ) for all λ using remaining fold (10 observations) 5. Repeat for all 10 possible combinations of k-1 folds

This provides 10 estimates of MSE(λ) for each λ

Can construct means and standard deviations of MSE(λ) for each λ

Choose λ that gives small mean MSE(λ)

50/102

Topic 4 – Part 2: Cross Validated Mean Squared Errors for all λ 49 48 48 49 47 46 45 34 31 21 15 11 5 4 3 1

●●● ●

●●●

● ●

● ●

●●

●●

●●

●●

●● ●

● ●

● ●●

● ●

● ●

● ●

●

● ●●

● ●

●● ●

● ●

●

● ●●

● ●●

●● ●● ●●

● ●●

●● ●● ●●●

●● ●● ●

●

●

−5 −4 −3 −2 −1 0 1

log(Lambda)

51/102

Mean−Squared Error

30 40 50 60 70 80 90

Topic 4: Regularization

1. Basics of Ridge, LASSO and Elastic Net

2. How to choose hyperparameter λ: cross-validation

52/102

Topic 5: Observed Factor Models

Suppose xt is a vector of asset returns, and B is a matrix of factor loadings

Which has higher dimension: (a) B

(b) Σx =Cov(xt)

53/102

Topic 5: Observed Factor Models

1. General Framing of Linear Factor Models

2. Single Index Model and the CAPM

3. Multi-Factor Models Fama-French

Macroeconomic Factors 4. Barra approach

54/102

Topic 5 – Part 1: Linear Factor Models

Assume that returns xit are driven by K common factors: xi,t = αi +β1,if1,t +β2,if2,t +···+βK,ifK,t +εit

ft = (f1,t,f2,t,··· ,fK,t)′ is the set of common factors These are the same for all assets (constant over i)

But change over time (different for t, t+1) Each ft has dimension (K × 1)

βi = (β1,i,β2,i,··· ,βK,i)′ is the set of factor loadings

K different parameters for each asset

But constant over time (same for all t)

Fixed, specific relationship between asset i and factor k

55/102

Topic 5 – Part 1: Linear Factor Model

xt =α+Bft+εt

Summary of Parameters

α: (m×1) intercepts for m assets

B:(m×K)loadings(βik)onK factorsformassets μf : (K × 1) vector of means for K factors

Ωf : (K × K ) variance covariance matrix of factors

Ψ: (m×m) diagonal matrix of asset specific variances

Given our assumptions xt is m-variate covariance stationary with: E[xt|ft] = α +Bft

Cov[xt|ft] = Ψ E[xt]=μx =α+Bμf Cov[xt]=Σx =BΩfB′+Ψ

56/102

Topic 5 – Part 2: The Index Model: First Pass

xi =αi1T +Rmβi +εi

Estimate OLS regression on time-series version of our factor specification

One regression for each asset i

Recover two parameters αˆi and βˆi for each asset i Ωˆf is just the sample variance of observed factor

Estimate Residuals

Use these to estimate asset specific variances (for each i):

With these, can compute:

(T −2) Σˆ x = Bˆ Ωˆ f Bˆ ′ + Ψˆ

εˆ = x − αˆ 1 − R βˆ iiiTmi

εˆ ′ εˆ σˆi2= i i

57/102

Topic 5 – Part 2: The Index Model/Testing CAPM: Second Pass

x ̄ i = γ 0 + γ 1 βˆ i + γ 2 σˆ i + η i

CAPM tests: expected excess return should be determined only by systemic risk (β)

1. γ0=0oraverageαis0

2. γ2 = 0 (idiosyncratic risk shouldn’t be priced) 3. γ1=R ̄m

58/102

Topic 5 – Part 3: Fama-French Three Factor Model

Recall our general linear factor model:

xi,t = αi +β1,if1,t +β2,if2,t +···+βK,ifK,t +εit

Fama-French is just another version of this with three particular factors:

xi,t = αi +β1,if1,t +β2,if2,t +β3,if3,t +εit

The factors are:

1. f1,i = Rmt : proxy for excess market return—same as before 2. f2,i = SMBt : size factor

3. f2,i = HMLt : value factor

Can use same two-pass methodology to recover parameters

Should understand what the covariance matrix Σx looks like

59/102

Topic 5 – Part 3: Macroeconomic Factors

An alternative approach uses key macro variables as factors

For example, Chen, Roll, and Ross use: IP: Growth rate in industrial production

EI: Changes in expected inflation

UI: Unexpected inflation

CG: Unexpected changes in risk premiums GB: Unexpected changes in term premia

In this case, our general model becomes:

xi,t =αi +βR,iRm,t +βIp,iIPt +βEI,iEIt +βUI,iUIt +βCG,iCGt +βGB,iGCt +εit

Can use two-pass procedure to estimate βˆs, evaluate the model

Like before, can use estimated βˆs, asset specific variances, and factor covariances to estimate asset covariance matrix

60/102

Topic 5 – Part 4: BARRA approach

̃x t = B f t + ε t

Assume that B is known

This looks just like the standard OLS matrix notation And we can estimate our ft like always:

ˆfOLS = (B′B)−1B′ ̃x tt

A bit weird conceptually because the role of the βs flips But no technical difference…

Except heteroskedasticity Estimate with GLS

61/102

Topic 5: Observed Factor Models

1. General Framing of Linear Factor Models

2. Single Index Model and the CAPM

3. Multi-Factor Models Fama-French

Macroeconomic Factors 4. Barra approach

62/102

Topic 6: Principal Components Analysis

Suppose the covariance of two asset returns is given by: 1 0

03

What fraction of the total variance in asset returns will be explained by the first principle component?

(a) 1 3

(b) 3 4

(c) 1 4

63/102

Topic 6: Principal Components Analysis

1. Basics of Eigenvectors and Eigenvalues 2. PCA

64/102

Topic 6 – Part 1: Basics of Eigenvalues and Eigenvectors

Consider the square n×n matrix A.

An eigenvalue λi of A is a (1×1) scalar:

The corresponding eigenvector of a ⃗vi is an (n × 1) vector Whereλi,⃗vi satisfy:

A⃗vi =λi⃗vi

⃗vi are the special vectors that A only stretches

λi is the stretching factor

Won’t ask you to compute these without giving you the formulas Except maybe in the diagonal case…

65/102

Topic 6 – Part 1: For Some Vectors ⃗v , Matrix A Only Stretches Lets say

5 0 1 5

A= 2 3 andv⃗1= 1 ⇒Av⃗1= 5 =1v⃗1

Av1=(5 5)’

v1=(1 1)’

66/102

Topic 6 – Part 1: Eigenvalues of Σx with Uncorrelated Data 3 0

Σx= 0 1

What are the eigenvalues and eigenvectors of Σx ?

With uncorrelated assets eigenvalues are just the variances of each asset return!

Eigenvectors:

1 0 v1= 0 , v2= 1

Note that the first eigenvalue points in the direction of the largest variance

We sometimes write the eigenvectors together as a matrix:

1 0 Γ=(v1 v2)= 0 1

67/102

Topic 6 – Part 1: Eigenvectors are Γ = (v1 v2) = 2 2 22

V2=(−1 1)’ V1=(1 1)’

−10 −5 0 5 10 xa

2 1 Cov(x)=Σx= 1 2

1 −1 1 1

xb

−10 −5 0 5 10

67/102

Topic 6 – Part 1: Eigenvectors of Σx with Correlated Data 2 1

Σx= 1 2

1 −1

1 1 Γ=(v1 v2)= 2 2

22

Just as with uncorrelated data, first eigenvector finds the direction with the most variability

Second eigenvector points in the direction that explains the maximum amount of the remaining variance

Note that the two are perpendicular (because Σx is symmetric)

This is the geometric implication of the fact that they are orthogonal:

vi′vj =0

The fact that they are orthogonal also implies:

Γ′ = Γ−1

68/102

Topic 6 – Part 2: Principal Components Analysis

xt

m×1 E[xt] = α Cov(xt) = Σx

Define the principal components variables as: p = Γ′(xt −α)

m×1 Γ is the ordered matrix of eigenvectors

The proportion of the total variance of xi that is explained by the largest eigenvalue λi is simply:

λi ∑mi=1 λi

69/102

Topic 6 – Part 2: Principal Components Analysis

Our principal components variables provide a transformation of the data into variables that are:

Uncorrelated (orthogonal)

Order by how much of the total variance they explain (size of

eigenvalue)

What if we have many m, but the first few (2, 5, 20) principle components explain most of the variation:

Idea: Use these as “factors” Dimension reduction!

70/102

Topic 6 – Part 2: Principal Components Analysis

Note that because Γ′ = Γ−1

xt =α+Γp

We can also partition Γ into the first K < m eigenvectors and the
remaining m−K
Partition p into its first K elements and the remaining m − K
p1 p= p2
We can then write
This looks just like a factor model:
But
xt = α + Bft + εt Cov(εt) = Ψ = Γ2Λ2Γ′2
Γ = [Γ1 Γ2]
xt =α+Γ1p1+Γ2p2
71/102
Topic 6 - Part 2: Implementing Principal Components Analysis
xt =α+Γ1p1+Γ2p2
Recall the sample covariance matrix:
Σˆ x = 1 X ̃ X ̃ ′ T
Calculate this, and perform the eigendecomposition (using a computer):
Σˆx =ΓΛΓ′
We now have everything need to compute the sample principle
components at each t:
P=[p1 p2···pT]= ΓX ̃
m×T
72/102
Topic 6 - Part 2: xˆt : Predicted Yields from First Two Components
7
6
5
4
3
2
1
0
CMT3Month
CMT6Month
CMT1Year
CMT2Year
CMT3Year
CMT5Year
CMT7Year
CMT10Year
CMT20Year
1200 1400
0 200 400 600
800 1000
73/102
Topic 6: Principal Components Analysis
1. Basics of Eigenvectors and Eigenvalues 2. PCA
74/102
Topic 7: Limited Dependent Variables
True or false: It is never ok to use an OLS regression when the outcome variable is binary
(a) True (b) False
75/102
Topic 7: Limited Dependent Variables
1. Binary Dependent variables 2. Censoring and truncation
76/102
Topic 7 - Part 1: Linear Probability Models
P(Yi = 1|Xi) = β0 +β1xi
The linear probability model has a bunch of advantages
1. Just OLS with a binary Yi—estimation of βOLS is the same 2. Simple interpretation of βOLS
3. Can use all the techniques we’ve seen: IV/difference-in-difference, etc Because of this simplicity, lots of applied research just uses linear
probability models
But a few downsides...
Predicted probabilities above one Constant effects of xi
1
77/102
Topic 7 - Part 1: Two Common Alternatives to Linear Probability Models
P(yi = 1|xi) = G(β0 +β1xi) In practice, mostly use two choices of G(·):
1. Probit: the standard normal CDF
z
G(z) = Φ(z) =
φ(v) = (2π)−1/2 exp−v2/2
2. Logit: the logistic function
G(z) = Λ(z) = exp(z) 1+exp(z)
−∞
φ(v)dv
78/102
Topic 7 - Part 1: Probit
Probability of Passing
0 .5 1 1.5
0 20 40 60 80 100 Assignment 1 Score
79/102
Topic 7 - Part 1: Why Does the Probit Approach Make Sense?
You should be able to derive the probit from a latent variable model:
y i∗ = β 0 + β 1 x i + v i 1 if yi∗ ≥ 0
yi= 0ifyi∗<0 P(y =1)=P(y∗ >0)

ii

Probit approach: assume vi is distributed standard normal

vi|xi ∼N(0,1)

P(yi∗ > 0|xi) = P(β0 +β1xi +vi > 0|xi) =P(vi >−(β0+β1xi)|xi)

= 1−Φ(−(β0 +β1xi)) = Φ(β0 + β1xi )

80/102

Topic 7 – Part 1: The Effect of a Change in xi for the Probit

P(yi = 1|xi) = Φ(β0 +β1xi) Taking derivatives:

∂P(yi = 1|xi) = β1Φ′(β0 +β1xi) ∂xi

The derivative of the standard normal CDF is just the PDF: Φ′(z) = φ(z)

so

∂P(yi = 1|xi) = β1φ(β0 +β1xi) ∂xi

Obviously should be able to do this for more complicated functions

81/102

Topic 7 – Part 1: Deriving the Log-likelihood

Given data Y , define the likelihood function:

L(β0,β1) = f (Y |X;β0,β1) n

= ∏[Φ(β0 + β1xi )]yi [1 − Φ(β0 + β1xi )](1−yi ) i=1

Take the log-likelihood: l(β0,β1) = log(L(β0,β1))

= log(f (Y |X;β0,β1)) n

= ∑yilog(Φ(β0 +β1xi))+(1−yi)log(1−Φ(β0 +β1xi)) i=1

82/102

Topic 7 – Part 2: Censoring and Truncation

An extremely common data issue is censoring:

We only observe Yi if it is below (or above) some threshold

We see Xi either way

Example: Income is often top-coded

That is, we might only see whether income is > £100,000

Formally, we might be interested in Yi , but see:

Wi =min(Yi,ci) where ci is a censoring value

Similar to censoring is truncation

We don’t observe anything if Yi is above some threshold

e.g.: we only have data for those with incomes below £100,000

83/102

Topic 7 – Part 2: Censored Regression

So in general:

f(yi =y|xi,ci)=1{y≥ci} 1−Φ

y i∗ = β 0 + β 1 x i + v i y =min(y∗,c)

iii vi|xi,ci ∼N(0,σ2)

ci −β0 −β1xi σ

1 y−β0−β1xi +1{y