# CS计算机代考程序代写 finance Limited Dependent Variables

Limited Dependent Variables
Chris Hansman
Empirical Finance: Methods and Applications Imperial College Business School
February 22-23, 2021
1/76

Today: Four Parts
1. Writing and minimizing functions in R
2. Binary dependent variables
3. Implementing a probit in R via maximum likelihood 4. Censoring and truncation
2/76

Part 1: Simple Functions in R
􏰒 Often valuable to create our own functions in R 􏰒 May want to simplify code
􏰒 Automate a common task/prevent mistakes 􏰒 Plot or optimize a function
􏰒 Simple syntax in R for user written functions 􏰒 Two key components
􏰒 Arguments 􏰒 Body
3/76

Creating functions in R
function_name <- function(arguments}{ body } 􏰒 Write a simple function that adds two inputs x and y 􏰒 Write the function f (x) = (x −1)2 􏰒 What x minimizes this function? How do we find it in R? 4/76 Rosenbrock’s Banana Function 􏰒 The Rosenbrock Banana Function is given by: f (x1,x2) = (1−x1)2 +100(x1 −x2)2 􏰒 What values of x1 and x2 minimize this function? 􏰒 Please find this using optim in R with starting values (−1.2,1) 5/76 Part 2: Binary Dependent Variables 1. Review: Bernoulli distribution 2. Linear probability model and limitations 3. Introducing the probit and logit 4. Deriving the probit from a latent variable 5. Partial effects 6/76 Bernoulli Distribution 􏰒 We are interested in an event that has two possible outcomes 􏰒 Call them success and failure, but could describe: 􏰒 Heads vs. tails in a coin flip 􏰒 Chelsea wins next match 􏰒 Pound rises against dollar 􏰒 Define: 􏰍1 if Success 0 if Failure Y= 􏰒 Y is often called a Bernoulli trial 􏰒 Say the probability of success is p, probability of failure is (1−p) 􏰒 So the PMF of Y can be written as: P(Y =y)=py(1−p)(1−y) 7/76 Bernoulli Distribution 􏰒 Say p = 0.2: 􏰒 So then we can write the probabilities of both values of y as: P(Y = y) = (0.2)y (0.8)(1−y) P(Y = 1) = (0.2)1(0.8)(1−1) = 0.2 P(Y = 0) = (0.2)0(0.8)(1−0) = 0.8 8/76 Binary Dependent Variables yi =β0+β1xi+vi 􏰒 So far, focused on cases in which yi is continuous 􏰒 What about when yi is binary? 􏰒 Thatis,yi iseither1or0 􏰒 For example: yi represents employment, passing vs. failing this course, etc... 􏰒 Put any concerns about causality aside for a moment: 􏰒 Assume E[vi|Xi] = 0 􏰒 How do we interpret β1? 9/76 A Look at a Continuous Outcome 10/76 X Y A Look at a Continuous Outcome Y X βOLS +βOLSX 01 10/76 A Look at a Binary Outcome Probability of Passing 0 .5 1 1.5 0 20 40 60 80 100 Assignment 1 Score 11/76 Binary Dependent Variables yi =β0+β1xi+vi 􏰒 With a continuous yi , we interpreted β1 as a slope: 􏰒 Change in yi for a one unit change in xi 􏰒 This doesn’t make much sense when yi is binary 􏰒 Say yi is employment, xi is years of schooling, β = 0.1 􏰒 What does it mean for a year of schooling to increase your employment by 0.1? 􏰒 Solution: think in probabilities 12/76 Linear Probability Models yi =β0+β1xi+vi 􏰒 When E[vi|xi] = E[vi] = 0 we have: 􏰒 But if yi is binary: E[yi|xi] = β0 +β1xi E[yi|xi] = P(yi = 1|xi) 􏰒 So we can think of our regression as: P(yi = 1|xi) = β0 +β1xi 􏰒 β1 is the change in probability of “success” (yi = 1) for a one unit change in xi 13/76 Linear Probability Models P(yi = 1|xi) = β0 +β1xi 􏰒 Basic idea: probability of success is a linear function of xi 􏰒 Examples: 1. Bankruptcy: P(Bankruptcyi =1|Leveragei)=β0+β1Leveragei 􏰒 Probability of bankruptcy increases linearly with leverage 2. Mortgage Denial: P(MortgageDeniali =1|Incomei)=β0+β1Incomei 􏰒 Probability of denial decreases linearly in income 14/76 Linear Probability Models Probability of Passing 0 .5 1 1.5 0 20 40 60 80 100 Assignment 1 Score 15/76 Linear Probability Models Probability of Passing 0 .5 1 1.5 0 20 40 60 80 100 Assignment 1 Score 15/76 Linear Probability Models P(yi = 1|xi) = β0 +β1xi 􏰒 The linear probability model has a bunch of advantages 1. Just OLS with a binary Yi—estimation of βOLS is the same 2. Simple interpretation of βOLS 3. Can use all the techniques we’ve seen: IV/difference-in-difference, etc 4. Can include many Xi : P(yi =1|Xi)=β0+Xi′β 􏰒 Because of this simplicity, lots of applied research just uses linear probability models 􏰒 But a few downsides... 1 16/76 Linear Probability Models: Downsides 􏰒 Downside 1: Nonsense predictions P (MortgageDeniali |Incomei ) = β0 + β1 Incomei 􏰒 Suppose we estimate this and recover β OLS = 1, β OLS = −0.1 01 􏰒 Income is measured in 10k 􏰒 What is the predicted probability of denial for an individual with an income of 50k? 􏰒 What is the predicted probability of denial for an individual with an income of 110k? 􏰒 What about 1,000,000? 17/76 Linear Probability Models Probability of Passing 0 .5 1 1.5 0 20 40 60 80 100 Assignment 1 Score 18/76 Linear Probability Models: Downsides 􏰒 Downside 2: Constant Effects MortgageDeniali = β0 + β1 Incomei + vi 􏰒 β OLS = 1, β OLS = −0.1 01 􏰒 Income is measured in 10k 􏰒 Probability of denial declines by 0.1 when income increases from 50,000 to 60,000 􏰒 Seems reasonable 􏰒 Probability of denial declines by 0.1 when income increases from 1,050,000 to 1,060,000 􏰒 Probably less realistic 19/76 Alternatives to Linear Probability Models 􏰒 Simplest problem with P(yi = 1|xi ) = β0 + β1xi : 􏰒 Predicts P(yi |xi ) > 1 for high values of β0 + β1xi
􏰒 Predicts P(yi |xi ) < 0 for low values of β0 + β1xi 􏰒 Solution: P(yi = 1|xi) = G(β0 +β1xi) 􏰒 Where 0 −z) = 1−Φ(−z) = Φ(z)
31/76

Normal Density
P(X−z)
32/76

Why Does the Probit Approach Make Sense?
􏰍1 if yi∗ ≥ 0 yi= 0ifyi∗<0 ⇒P(y =1)=P(y∗ ≥0) ii 􏰒 So lets plug in for yi∗ and figure out the probabilities: P(yi∗ > 0|xi) = P(β0 +β1xi +vi > 0) =P(vi >−(β0+β1xi))
= 1−Φ(−(β0 +β1xi)) = Φ(β0 + β1xi )
􏰒 Which is exactly the probit:
P(yi = 1|xi) = Φ(β0 +β1xi)
33/76

y i∗ = β 0 + β 1 x i + v i
􏰒 The logit can actually be derived the same way
􏰒 Assuming vi follows a standard logistic distribution 􏰒 Instead of a standard normal distribution
􏰒 More awkward/uncommon distribution but still symmetric around 0 􏰒 All the math/interpretation is the same, just using Λ(z) instead of
Φ(z )
􏰒 Primary benefit is computational/analytic convenience
34/76

The Effect of a Change in Xi
􏰒 In OLS/Linear probability model interpreting coefficients was easy:
􏰒 β1 is the impact of a one unit change in xi 􏰒 This interpretation checks out formally:
􏰒 Taking derivatives:
P(yi = 1|xi) = β0 +β1xi ∂P(yi =1|xi) =β1
∂xi
􏰒 Things are a little less clean with probit/logit
􏰒 Can’t interpret β1 as the impact of a one unit change in xi anymore!
35/76

The Effect of a Change in xi
P(yi = 1|xi) = G(β0 +β1xi)
􏰒 Taking derivatives:
∂P(yi = 1|xi) = β1G′(β0 +β1xi)
∂xi
􏰒 The impact of xi is now non-linear
􏰒 Downside: harder to interpret
􏰒 Upside: no longer have the same effect when, e.g. income goes from
50,000 to 60,000 as when income goes from 1,050,000 to 1,060,000
􏰒 For any set of values xi, β1G′(β0 +β1xi) is pretty easy to compute
36/76

The Effect of a Change in xi for the Probit P(yi = 1|xi) = Φ(β0 +β1xi)
􏰒 Taking derivatives:
∂P(yi = 1|xi) = β1Φ′(β0 +β1xi)
∂xi
􏰒 The derivative of the standard normal CDF is just the PDF:
􏰒 so
Φ′(z) = φ(z)
∂P(yi = 1|xi) = β1φ(β0 +β1xi) ∂xi
37/76

Practical Points: The Effect of a Change in xi
∂P(yi = 1|xi) = β1φ(β0 +β1xi) ∂xi
􏰒 Because the impact of xi is non linear, it can be tough to answer “what is the impact of xi on yi ?” with a single number
􏰒 A few approaches:
1. Chose an important value of xi : e.g. the mean x ̄
∂P(yi = 1|x ̄) = β1φ(β0 +β1x ̄) ∂xi
􏰒 This is called the partial effect at the average
38/76

Practical Points: The Effect of a Change in xi
∂P(yi = 1|xi) = β1φ(β0 +β1xi) ∂xi
􏰒 Because the impact of xi is non linear, it can be tough to answer “what is the impact of xi on yi ?” with a single number
􏰒 A few approaches:
2. Take the average for all observed values of xi
n ∂P(yi=1|xi) n φ(β +β x) ∑∂xi =∑β101i
i=1 n i=1 n
􏰒 This is called (confusingly) the average partial effect
39/76

Practical Points: The Effect of a Change in xi
P(yi = 1|xi) = Φ(β0 +β1xi)
􏰒 If xi is a dummy variable, makes sense to avoid all the calculus and
compute:
P(yi = 1|xi = 1)−P(yi = 1|xi = 0) = Φ(β0 +β1)−Φ(β0)
40/76

Practical Points: No Problem with Many Xi
􏰒 So far we have only seen one xi
P(yi = 1|xi) = Φ(β0 +β1xi)
􏰒 This can easily be extended to many Xi :
P(yi = 1|Xi) = Φ(β0 +β1x1i +β2x2i +···+βkxki)
􏰒 Intuition behind the latent variable approach remains the same: yi∗ = β0 +β1x1i +β2x2i +···+βkxki +vi
􏰍1 if yi∗ ≥ 0 yi= 0ifyi∗<0 41/76 Practical Points: No Problem with Many Xi P(yi = 1|Xi) = Φ(β0 +β1x1i +β2x2i +···+βkxki) 􏰒 However, does make partial effects a bit more complicated: ∂P(yi = 1|xi) = β2φ(β0 +β1x1i +β2x2i +···+βkxki) ∂ x2i 􏰒 What about partial effects with a more complicated function of Xi ? P(y =1|x)=Φ(β +β x +β x2+···+β ln(x)) ii01i2iki 42/76 Implementation: Flashback 43/76 Implementation: Flashback 44/76 Implementation of Probit by MLE 􏰒 Suppose we have n independent observation of (yi,xi) 􏰒 Where yi is binary 􏰒 And suppose we have a probit specification: P(yi = 1|xi) = Φ(β0 +β1xi) 􏰒 This means that for each i: P(yi = 0|xi) = 1−Φ(β0 +β1xi) 􏰒 In other words, P (yi = y |xi ) is Bernoulli! P(yi =y|xi)=[Φ(β0+β1xi)]y[1−Φ(β0+β1xi)](1−y) 45/76 Implementation of Probit by MLE 􏰒 Often we write this pdf as a function of the unknown parameters: P(yi|xi;β0,β1)=[Φ(β0+β1xi)]yi [1−Φ(β0+β1xi)](1−yi) 􏰒 What is the joint density of two independent observations i and j? f(yi,yj|xi,xj;β0,β1)=P(yi|xi;β0,β1)×P(yj|xj;β0,β1) = [Φ(β0 + β1 xi )]yi [1 − Φ(β0 + β1 xi )](1−yi ) ×[Φ(β0 +β1xj)]yj [1−Φ(β0 +β1xj)](1−yj) 46/76 Implementation of Probit by MLE 􏰒 Often we write this pdf as a function of the unknown parameters: P(yi|xi;β0,β1)=[Φ(β0+β1xi)]yi [1−Φ(β0+β1xi)](1−yi) 􏰒 And what is the joint density of all n independent observations? n f (Y |X; β0 , β1 ) = ∏ P (yi |xi ; β0 , β1 ) i=1 n = ∏[Φ(β0 + β1xi )]yi [1 − Φ(β0 + β1xi )](1−yi ) i=1 47/76 Implementation of Probit by MLE n f(Y =y|X;β0,β1)=∏[Φ(β0+β1xi)]yi[1−Φ(β0+β1xi)](1−yi) i=1 48/76 Implementation of Probit by MLE 􏰒 Given data Y , define the likelihood function: L(β0,β1) = f (Y |X;β0,β1) n = ∏[Φ(β0 + β1xi )]yi [1 − Φ(β0 + β1xi )](1−yi ) i=1 􏰒 Take the log-likelihood: l(β0,β1) = log(L(β0,β1)) = log(f (Y |X;β0,β1)) n = ∑yilog(Φ(β0 +β1xi))+(1−yi)log(1−Φ(β0 +β1xi)) i=1 49/76 Implementation of Probit by MLE 􏰒 We then have (βˆMLE ,βˆMLE ) = arg max l(β ,β ) 0101 􏰒 Intuition: values of β0,β1 that make the observed data most likely (β0 ,β1 ) 50/76 Implementation of Probit by MLE 􏰒 We then have (βˆMLE ,βˆMLE ) = arg max l(β ,β ) 0101 (β0 ,β1 ) 51/76 Log-likelihood is a Nice Concave Function −6200 −6250 −6300 −6350 0.3 0.4 0.5 0.6 0.7 β1 52/76 Log−Likelihood Part 3: Implementation of Probit by MLE in R 􏰒 It turns out this log-likelihood is globally concave in (β0,β1) 􏰒 Pretty easy problem for a computer 􏰒 Standard optimization packages will typically coverge relatively easily 􏰒 Lets try in R 53/76 Implementation of Probit in R 􏰒 Lets start by simulating some example data 􏰒 We will use the latent variable approach y i∗ = β 0 + β 1 x i + v i 􏰍1 if yi∗ ≥ 0 yi= 0ifyi∗<0 􏰒 To start, lets define some parameters beta 0 <- 0.2 beta 1 <- 0.5 􏰒 And choose n=10000 54/76 Simulating Data in R y i∗ = β 0 + β 1 x i + v i 􏰒 To generate yi∗ we need to simulate xi and vi 􏰒 We will use the function rnorm(n) 􏰒 Simulates n draws from a normal random variable x i=rnorm(n) v i=rnorm(n) 􏰒 Aside: We’ve simulated both xi and vi as normal 􏰒 Probit only assumes vi normal 􏰒 Could have chosen xi to be uniform or some other distribution 55/76 Simulating Data in R 􏰒 With xi and vi in hand, we can generate yi∗ and yi y i∗ = β 0 + β 1 X i + v i 􏰒 TodothisinR: 􏰍1 if yi∗ ≥ 0 yi= 0ifyi∗<0 y i star = beta 0+beta 1*x i+v i y i=y i star>0
56/76

Writing the Likelihood Function in R
􏰒 Now, recall the log-likelihood
n
l(β0,β1) = ∑yilog(Φ(β0 +β1xi))+(1−yi)log(1−Φ(β0 +β1xi))
i=1
􏰒 In R, the function Φ(·) is pnorm(·)
􏰒 We will define beta= [β0,β1]
beta[1] refers to β0
beta[2] refers to β1
57/76

Writing the Likelihood Function in R
n
l(β0,β1) = ∑yilog(Φ(β0 +β1xi))+(1−yi)log(1−Φ(β0 +β1xi))
i=1
􏰒 Easy to break into a few steps
􏰒 Tocapturelog(Φ(β0+β1xi))
l 1 <- log(pnorm(beta[1]+beta[2]*x i)) 􏰒 To capture log(1−Φ(β0 +β1xi)) l 2 <- log(1-pnorm(beta[1]+beta[2]*x i)) 􏰒 To capture the whole function l(β0,β1) sum(y i*l 1+(1-y i)*l 2) 58/76 sum(y i*l 1+(1-y i)*l 2) is just a function of β0,β1 −6200 −6250 −6300 −6350 0.3 0.4 0.5 0.6 0.7 β1 􏰒 Here I’ve just plotted it holding β0 = 0.2 59/76 Log−Likelihood Minimizing the Negative Likelihood 􏰒 So we have our R function, which we will call probit loglik 􏰒 We just need to find the β0,β1 that maximize this: 􏰒 Unfortunately, most software is set up to minimize 􏰒 Easy solution: (βˆMLE,βˆMLE)=arg max l(β ,β )=arg min −l(β ,β ) 010101 (β0 ,β1 ) (β0 ,β1 ) 􏰒 So we just define probit loglik to be -l(β0,β1) 60/76 Maximum Likelihood Estimation 􏰒 To find the β0,β1 that maximize the likelihood use the function: optim(par=·,fn=·) 􏰒 This finds the parameters (par) that minimize a function (fn) 􏰒 Takes two arguments: 􏰒 par: starting guesses for the parameters to estimate 􏰒 fn: what function to minimize 61/76 Menti: Estimate a Logit via Maximum Likelihood 􏰒 On the hub, you’ll find data: logit data.csv 􏰒 Data is identical to yi , xi before, except vi simulated from standard logistic distribution 􏰒 Everything identical in likelihood, except instead of Φ(z), we have: Λ(z) = exp(z) 1+exp(z) 62/76 Part 4: Censoring and Truncation 􏰒 So far we have focused on binary dependent variables 􏰒 Two other common ways in which yi may be limited are 􏰒 Censoring 􏰒 Truncation 􏰒 The censored regression model and likelihood 63/76 Censoring 􏰒 An extremely common data issue is censoring: 􏰒 We only observe yi∗ if it is below (or above) some threshold 􏰒 We see xi either way 􏰒 Example: Income is often top-coded 􏰒 That is, we might only see whether income is > £100,000 􏰒 Formally, we might be interested in yi∗, but see:
y =min(y∗,c) iii
where ci is a censoring value
64/76

An Example of Censored Data
65/76

Truncation
􏰒 Similar to censoring is truncation
􏰒 We don’t observe anything if yi∗ is above some threshold
􏰒 e.g.: we only have data for those with incomes below £100,000
66/76

An Example of Truncated Data: No one over £100,000
67/76

Terminology: Left vs. Right
􏰒 If we only see yi∗ when it is above some threshold it is left censored 􏰒 We still see other variables xi regardless
􏰒 If we only see yi∗ when it is below some threshold it is right censored 􏰒 We still see other variables xi regardless
􏰒 If we only see the observation when yi∗ is above some threshold ⇒ left truncated
􏰒 Not able to see other variables xi in this case
􏰒 If we only see the observation when yi∗ is below some threshold ⇒ right truncated
􏰒 Not able to see other variables xi in this case
68/76

Censored Regression
􏰒 Suppose there is some underlying outcome y i∗ = β 0 + β 1 x i + v i
􏰒 Again depends on observable xi 􏰒 Unobservable vi
􏰒 We only see continuous yi∗ if it is above/below some threshold: 􏰒 Hours worked: yi =max(0,yi∗)
􏰒 Income: yi = min(£100,000,yi∗)
􏰒 And assume: vi ∼ N(0,σ2), with vi ⊥ xi
69/76

Censored Regression
y i∗ = β 0 + β 1 x i + v i y =min(y∗,c)
iii vi|xi,ci ∼N(0,σ2)
􏰒 What is P(yi =ci|xi)?
P(y =c|x)=P(y∗≥c|x)
􏰉ci −β0 −β1xi 􏰊 σ
iiiiii
=1−Φ
70/76

Censored Regression
y i∗ = β 0 + β 1 x i + v i y =min(y∗,c)
iii vi|xi,ci ∼N(0,σ2)
􏰒 For yi < ci what is f (yi |xi )? 1 􏰉yi−β0−β1xi􏰊 f(yi|xi)= φ σσ 71/76 Censored Regression 􏰒 So in general: f (yi|xi,ci) = 1{yi ≥ ci} 1−Φ y i∗ = β 0 + β 1 x i + v i y =min(y∗,c) iii vi|xi,ci ∼N(0,σ2) 􏰋 􏰉ci −β0 −β1xi 􏰊􏰌 σ 1 􏰉yi−β0−β1xi􏰊 +1{yi