# CS计算机代考程序代写 finance Limited Dependent Variables

Limited Dependent Variables

Chris Hansman

Empirical Finance: Methods and Applications Imperial College Business School

February 22-23, 2021

1/76

Today: Four Parts

1. Writing and minimizing functions in R

2. Binary dependent variables

3. Implementing a probit in R via maximum likelihood 4. Censoring and truncation

2/76

Part 1: Simple Functions in R

Often valuable to create our own functions in R May want to simplify code

Automate a common task/prevent mistakes Plot or optimize a function

Simple syntax in R for user written functions Two key components

Arguments Body

3/76

Creating functions in R

function_name <- function(arguments}{
body
}
Write a simple function that adds two inputs x and y
Write the function f (x) = (x −1)2
What x minimizes this function? How do we find it in R?
4/76
Rosenbrock’s Banana Function
The Rosenbrock Banana Function is given by:
f (x1,x2) = (1−x1)2 +100(x1 −x2)2
What values of x1 and x2 minimize this function?
Please find this using optim in R with starting values (−1.2,1)
5/76
Part 2: Binary Dependent Variables
1. Review: Bernoulli distribution
2. Linear probability model and limitations 3. Introducing the probit and logit
4. Deriving the probit from a latent variable 5. Partial effects
6/76
Bernoulli Distribution
We are interested in an event that has two possible outcomes Call them success and failure, but could describe:
Heads vs. tails in a coin flip Chelsea wins next match
Pound rises against dollar
Define:
1 if Success 0 if Failure
Y=
Y is often called a Bernoulli trial
Say the probability of success is p, probability of failure is (1−p) So the PMF of Y can be written as:
P(Y =y)=py(1−p)(1−y)
7/76
Bernoulli Distribution
Say p = 0.2:
So then we can write the probabilities of both values of y as:
P(Y = y) = (0.2)y (0.8)(1−y) P(Y = 1) = (0.2)1(0.8)(1−1) = 0.2
P(Y = 0) = (0.2)0(0.8)(1−0) = 0.8
8/76
Binary Dependent Variables
yi =β0+β1xi+vi
So far, focused on cases in which yi is continuous
What about when yi is binary? Thatis,yi iseither1or0
For example: yi represents employment, passing vs. failing this course, etc...
Put any concerns about causality aside for a moment: Assume E[vi|Xi] = 0
How do we interpret β1?
9/76
A Look at a Continuous Outcome
10/76
X
Y
A Look at a Continuous Outcome
Y
X
βOLS +βOLSX 01
10/76
A Look at a Binary Outcome
Probability of Passing
0 .5 1 1.5
0 20 40 60 80 100 Assignment 1 Score
11/76
Binary Dependent Variables
yi =β0+β1xi+vi
With a continuous yi , we interpreted β1 as a slope:
Change in yi for a one unit change in xi
This doesn’t make much sense when yi is binary
Say yi is employment, xi is years of schooling, β = 0.1
What does it mean for a year of schooling to increase your
employment by 0.1?
Solution: think in probabilities
12/76
Linear Probability Models
yi =β0+β1xi+vi When E[vi|xi] = E[vi] = 0 we have:
But if yi is binary:
E[yi|xi] = β0 +β1xi E[yi|xi] = P(yi = 1|xi)
So we can think of our regression as:
P(yi = 1|xi) = β0 +β1xi
β1 is the change in probability of “success” (yi = 1) for a one unit change in xi
13/76
Linear Probability Models
P(yi = 1|xi) = β0 +β1xi
Basic idea: probability of success is a linear function of xi
Examples:
1. Bankruptcy:
P(Bankruptcyi =1|Leveragei)=β0+β1Leveragei Probability of bankruptcy increases linearly with leverage
2. Mortgage Denial:
P(MortgageDeniali =1|Incomei)=β0+β1Incomei
Probability of denial decreases linearly in income
14/76
Linear Probability Models
Probability of Passing
0 .5 1 1.5
0 20 40 60 80 100 Assignment 1 Score
15/76
Linear Probability Models
Probability of Passing
0 .5 1 1.5
0 20 40 60 80 100 Assignment 1 Score
15/76
Linear Probability Models
P(yi = 1|xi) = β0 +β1xi
The linear probability model has a bunch of advantages
1. Just OLS with a binary Yi—estimation of βOLS is the same 2. Simple interpretation of βOLS
3. Can use all the techniques we’ve seen: IV/difference-in-difference, etc 4. Can include many Xi :
P(yi =1|Xi)=β0+Xi′β
Because of this simplicity, lots of applied research just uses linear
probability models
But a few downsides...
1
16/76
Linear Probability Models: Downsides
Downside 1: Nonsense predictions
P (MortgageDeniali |Incomei ) = β0 + β1 Incomei
Suppose we estimate this and recover β OLS = 1, β OLS = −0.1 01
Income is measured in 10k
What is the predicted probability of denial for an individual with an
income of 50k?
What is the predicted probability of denial for an individual with an
income of 110k?
What about 1,000,000?
17/76
Linear Probability Models
Probability of Passing
0 .5 1 1.5
0 20 40 60 80 100 Assignment 1 Score
18/76
Linear Probability Models: Downsides
Downside 2: Constant Effects
MortgageDeniali = β0 + β1 Incomei + vi
β OLS = 1, β OLS = −0.1 01
Income is measured in 10k
Probability of denial declines by 0.1 when income increases from 50,000 to 60,000
Seems reasonable
Probability of denial declines by 0.1 when income increases from 1,050,000 to 1,060,000
Probably less realistic
19/76
Alternatives to Linear Probability Models
Simplest problem with P(yi = 1|xi ) = β0 + β1xi : Predicts P(yi |xi ) > 1 for high values of β0 + β1xi

Predicts P(yi |xi ) < 0 for low values of β0 + β1xi Solution:
P(yi = 1|xi) = G(β0 +β1xi)
Where 0

31/76

Normal Density

P(X

32/76

Why Does the Probit Approach Make Sense?

1 if yi∗ ≥ 0 yi= 0ifyi∗<0
⇒P(y =1)=P(y∗ ≥0) ii
So lets plug in for yi∗ and figure out the probabilities:
P(yi∗ > 0|xi) = P(β0 +β1xi +vi > 0) =P(vi >−(β0+β1xi))

= 1−Φ(−(β0 +β1xi)) = Φ(β0 + β1xi )

Which is exactly the probit:

P(yi = 1|xi) = Φ(β0 +β1xi)

33/76

What About the Logit?

y i∗ = β 0 + β 1 x i + v i

The logit can actually be derived the same way

Assuming vi follows a standard logistic distribution Instead of a standard normal distribution

More awkward/uncommon distribution but still symmetric around 0 All the math/interpretation is the same, just using Λ(z) instead of

Φ(z )

Primary benefit is computational/analytic convenience

34/76

The Effect of a Change in Xi

In OLS/Linear probability model interpreting coefficients was easy:

β1 is the impact of a one unit change in xi This interpretation checks out formally:

Taking derivatives:

P(yi = 1|xi) = β0 +β1xi ∂P(yi =1|xi) =β1

∂xi

Things are a little less clean with probit/logit

Can’t interpret β1 as the impact of a one unit change in xi anymore!

35/76

The Effect of a Change in xi

P(yi = 1|xi) = G(β0 +β1xi)

Taking derivatives:

∂P(yi = 1|xi) = β1G′(β0 +β1xi)

∂xi

The impact of xi is now non-linear

Downside: harder to interpret

Upside: no longer have the same effect when, e.g. income goes from

50,000 to 60,000 as when income goes from 1,050,000 to 1,060,000

For any set of values xi, β1G′(β0 +β1xi) is pretty easy to compute

36/76

The Effect of a Change in xi for the Probit P(yi = 1|xi) = Φ(β0 +β1xi)

Taking derivatives:

∂P(yi = 1|xi) = β1Φ′(β0 +β1xi)

∂xi

The derivative of the standard normal CDF is just the PDF:

so

Φ′(z) = φ(z)

∂P(yi = 1|xi) = β1φ(β0 +β1xi) ∂xi

37/76

Practical Points: The Effect of a Change in xi

∂P(yi = 1|xi) = β1φ(β0 +β1xi) ∂xi

Because the impact of xi is non linear, it can be tough to answer “what is the impact of xi on yi ?” with a single number

A few approaches:

1. Chose an important value of xi : e.g. the mean x ̄

∂P(yi = 1|x ̄) = β1φ(β0 +β1x ̄) ∂xi

This is called the partial effect at the average

38/76

Practical Points: The Effect of a Change in xi

∂P(yi = 1|xi) = β1φ(β0 +β1xi) ∂xi

Because the impact of xi is non linear, it can be tough to answer “what is the impact of xi on yi ?” with a single number

A few approaches:

2. Take the average for all observed values of xi

n ∂P(yi=1|xi) n φ(β +β x) ∑∂xi =∑β101i

i=1 n i=1 n

This is called (confusingly) the average partial effect

39/76

Practical Points: The Effect of a Change in xi

P(yi = 1|xi) = Φ(β0 +β1xi)

If xi is a dummy variable, makes sense to avoid all the calculus and

compute:

P(yi = 1|xi = 1)−P(yi = 1|xi = 0) = Φ(β0 +β1)−Φ(β0)

40/76

Practical Points: No Problem with Many Xi

So far we have only seen one xi

P(yi = 1|xi) = Φ(β0 +β1xi)

This can easily be extended to many Xi :

P(yi = 1|Xi) = Φ(β0 +β1x1i +β2x2i +···+βkxki)

Intuition behind the latent variable approach remains the same: yi∗ = β0 +β1x1i +β2x2i +···+βkxki +vi

1 if yi∗ ≥ 0 yi= 0ifyi∗<0
41/76
Practical Points: No Problem with Many Xi
P(yi = 1|Xi) = Φ(β0 +β1x1i +β2x2i +···+βkxki) However, does make partial effects a bit more complicated:
∂P(yi = 1|xi) = β2φ(β0 +β1x1i +β2x2i +···+βkxki) ∂ x2i
What about partial effects with a more complicated function of Xi ?
P(y =1|x)=Φ(β +β x +β x2+···+β ln(x)) ii01i2iki
42/76
Implementation: Flashback
43/76
Implementation: Flashback
44/76
Implementation of Probit by MLE
Suppose we have n independent observation of (yi,xi) Where yi is binary
And suppose we have a probit specification: P(yi = 1|xi) = Φ(β0 +β1xi)
This means that for each i:
P(yi = 0|xi) = 1−Φ(β0 +β1xi)
In other words, P (yi = y |xi ) is Bernoulli!
P(yi =y|xi)=[Φ(β0+β1xi)]y[1−Φ(β0+β1xi)](1−y)
45/76
Implementation of Probit by MLE
Often we write this pdf as a function of the unknown parameters: P(yi|xi;β0,β1)=[Φ(β0+β1xi)]yi [1−Φ(β0+β1xi)](1−yi)
What is the joint density of two independent observations i and j? f(yi,yj|xi,xj;β0,β1)=P(yi|xi;β0,β1)×P(yj|xj;β0,β1)
= [Φ(β0 + β1 xi )]yi [1 − Φ(β0 + β1 xi )](1−yi ) ×[Φ(β0 +β1xj)]yj [1−Φ(β0 +β1xj)](1−yj)
46/76
Implementation of Probit by MLE
Often we write this pdf as a function of the unknown parameters: P(yi|xi;β0,β1)=[Φ(β0+β1xi)]yi [1−Φ(β0+β1xi)](1−yi)
And what is the joint density of all n independent observations?
n
f (Y |X; β0 , β1 ) = ∏ P (yi |xi ; β0 , β1 )
i=1 n
= ∏[Φ(β0 + β1xi )]yi [1 − Φ(β0 + β1xi )](1−yi ) i=1
47/76
Implementation of Probit by MLE
n
f(Y =y|X;β0,β1)=∏[Φ(β0+β1xi)]yi[1−Φ(β0+β1xi)](1−yi)
i=1
48/76
Implementation of Probit by MLE
Given data Y , define the likelihood function:
L(β0,β1) = f (Y |X;β0,β1) n
= ∏[Φ(β0 + β1xi )]yi [1 − Φ(β0 + β1xi )](1−yi ) i=1
Take the log-likelihood: l(β0,β1) = log(L(β0,β1))
= log(f (Y |X;β0,β1)) n
= ∑yilog(Φ(β0 +β1xi))+(1−yi)log(1−Φ(β0 +β1xi)) i=1
49/76
Implementation of Probit by MLE
We then have
(βˆMLE ,βˆMLE ) = arg max l(β ,β ) 0101
Intuition: values of β0,β1 that make the observed data most likely
(β0 ,β1 )
50/76
Implementation of Probit by MLE
We then have
(βˆMLE ,βˆMLE ) = arg max l(β ,β ) 0101
(β0 ,β1 )
51/76
Log-likelihood is a Nice Concave Function
−6200
−6250
−6300
−6350
0.3 0.4
0.5 0.6 0.7
β1
52/76
Log−Likelihood
Part 3: Implementation of Probit by MLE in R
It turns out this log-likelihood is globally concave in (β0,β1) Pretty easy problem for a computer
Standard optimization packages will typically coverge relatively easily Lets try in R
53/76
Implementation of Probit in R
Lets start by simulating some example data We will use the latent variable approach
y i∗ = β 0 + β 1 x i + v i 1 if yi∗ ≥ 0
yi= 0ifyi∗<0 To start, lets define some parameters
beta 0 <- 0.2 beta 1 <- 0.5
And choose n=10000
54/76
Simulating Data in R
y i∗ = β 0 + β 1 x i + v i
To generate yi∗ we need to simulate xi and vi
We will use the function rnorm(n)
Simulates n draws from a normal random variable
x i=rnorm(n)
v i=rnorm(n)
Aside: We’ve simulated both xi and vi as normal Probit only assumes vi normal
Could have chosen xi to be uniform or some other distribution
55/76
Simulating Data in R
With xi and vi in hand, we can generate yi∗ and yi y i∗ = β 0 + β 1 X i + v i
TodothisinR:
1 if yi∗ ≥ 0 yi= 0ifyi∗<0
y i star = beta 0+beta 1*x i+v i y i=y i star>0

56/76

Writing the Likelihood Function in R

Now, recall the log-likelihood

n

l(β0,β1) = ∑yilog(Φ(β0 +β1xi))+(1−yi)log(1−Φ(β0 +β1xi))

i=1

In R, the function Φ(·) is pnorm(·)

We will define beta= [β0,β1]

beta[1] refers to β0

beta[2] refers to β1

57/76

Writing the Likelihood Function in R

n

l(β0,β1) = ∑yilog(Φ(β0 +β1xi))+(1−yi)log(1−Φ(β0 +β1xi))

i=1

Easy to break into a few steps

Tocapturelog(Φ(β0+β1xi))

l 1 <- log(pnorm(beta[1]+beta[2]*x i))
To capture log(1−Φ(β0 +β1xi))
l 2 <- log(1-pnorm(beta[1]+beta[2]*x i))
To capture the whole function l(β0,β1)
sum(y i*l 1+(1-y i)*l 2)
58/76
sum(y i*l 1+(1-y i)*l 2) is just a function of β0,β1
−6200
−6250
−6300
−6350
0.3 0.4
0.5 0.6 0.7
β1
Here I’ve just plotted it holding β0 = 0.2
59/76
Log−Likelihood
Minimizing the Negative Likelihood
So we have our R function, which we will call probit loglik We just need to find the β0,β1 that maximize this:
Unfortunately, most software is set up to minimize Easy solution:
(βˆMLE,βˆMLE)=arg max l(β ,β )=arg min −l(β ,β ) 010101
(β0 ,β1 ) (β0 ,β1 ) So we just define probit loglik to be -l(β0,β1)
60/76
Maximum Likelihood Estimation
To find the β0,β1 that maximize the likelihood use the function: optim(par=·,fn=·)
This finds the parameters (par) that minimize a function (fn) Takes two arguments:
par: starting guesses for the parameters to estimate fn: what function to minimize
61/76
Menti: Estimate a Logit via Maximum Likelihood
On the hub, you’ll find data: logit data.csv
Data is identical to yi , xi before, except vi simulated from standard
logistic distribution
Everything identical in likelihood, except instead of Φ(z), we have:
Λ(z) = exp(z) 1+exp(z)
62/76
Part 4: Censoring and Truncation
So far we have focused on binary dependent variables
Two other common ways in which yi may be limited are
Censoring Truncation
The censored regression model and likelihood
63/76
Censoring
An extremely common data issue is censoring:
We only observe yi∗ if it is below (or above) some threshold
We see xi either way
Example: Income is often top-coded
That is, we might only see whether income is > £100,000 Formally, we might be interested in yi∗, but see:

y =min(y∗,c) iii

where ci is a censoring value

64/76

An Example of Censored Data

65/76

Truncation

Similar to censoring is truncation

We don’t observe anything if yi∗ is above some threshold

e.g.: we only have data for those with incomes below £100,000

66/76

An Example of Truncated Data: No one over £100,000

67/76

Terminology: Left vs. Right

If we only see yi∗ when it is above some threshold it is left censored We still see other variables xi regardless

If we only see yi∗ when it is below some threshold it is right censored We still see other variables xi regardless

If we only see the observation when yi∗ is above some threshold ⇒ left truncated

Not able to see other variables xi in this case

If we only see the observation when yi∗ is below some threshold ⇒ right truncated

Not able to see other variables xi in this case

68/76

Censored Regression

Suppose there is some underlying outcome y i∗ = β 0 + β 1 x i + v i

Again depends on observable xi Unobservable vi

We only see continuous yi∗ if it is above/below some threshold: Hours worked: yi =max(0,yi∗)

Income: yi = min(£100,000,yi∗)

And assume: vi ∼ N(0,σ2), with vi ⊥ xi

69/76

Censored Regression

y i∗ = β 0 + β 1 x i + v i y =min(y∗,c)

iii vi|xi,ci ∼N(0,σ2)

What is P(yi =ci|xi)?

P(y =c|x)=P(y∗≥c|x)

ci −β0 −β1xi σ

iiiiii

=1−Φ

70/76

Censored Regression

y i∗ = β 0 + β 1 x i + v i y =min(y∗,c)

iii vi|xi,ci ∼N(0,σ2)

For yi < ci what is f (yi |xi )?
1 yi−β0−β1xi f(yi|xi)= φ
σσ
71/76
Censored Regression
So in general:
f (yi|xi,ci) = 1{yi ≥ ci} 1−Φ
y i∗ = β 0 + β 1 x i + v i y =min(y∗,c)
iii vi|xi,ci ∼N(0,σ2)
ci −β0 −β1xi σ
1 yi−β0−β1xi +1{yi