Bias Variance Trade-O↵

Bias vs Variance Tradeo↵

I In previous lecture we saw what happens to test MSE as model complexity increases. The U-shape that emerged is actually something that can be theoretically derived.

Copyright By cscodehelp代写 加微信 cscodehelp

I The expected test MSE at any point X can be decomposed as: E [ y fˆ ( X | D ) ] 2 = V a r [ fˆ ( X | D ) ] + B i a s [ fˆ ( X | D ) ] 2 + V a r ( ✏ )

where Bias[fˆ(X|D)] = E[f (X) fˆ(X|D)]

I where D indicates the training dataset

D = {(X1,y1),…,(Xn,yn)} on which basis fˆ(X|D) has been learnt.

Bias vs Variance Tradeo↵

I Remember that we assume the true model can be written as y = f (X ) + ✏

I with E (✏) = 0 and f (.) being the true population model that is independent of the training data D and non-stochastic, hence

I E(f)=f andE(y)=E(f +✏)=E(f)+E(✏)=f

I Similarly, remember for the definition of variance of a random

I This implies

Var(Z) = E(Z2) E(Z)2 Var(y)=E[(yE(y))2]=E[(yf)2]=E(f +✏f)2 =Var(✏)

I The trained fˆ and the ✏ in the validation sample are independent. Independence means, zero covariance, and hence

Cov (fˆ, ✏) = E (fˆ ⇥ ✏) E (fˆ) ⇥ E (✏) = E (fˆ ⇥ ✏) = 0

I Using this, we can formally show that the expected MSE on a validation sample can be decomposed

Bias vs Variance Tradeo↵ Proof [advanced]

Show that:

E[y fˆ(X|D)]2 = Var[fˆ(X|D)] + Bias[fˆ(X|D)]2 + Var(✏) To get rid of the indices, let f = f (X), and fˆ = fˆ(X|D)

E ⇥ ( y fˆ ) 2 ⇤ = = = = = = = = =

E [ y 2 + fˆ 2 2 y fˆ ] (1) E[y2] + E[fˆ2] E[2yfˆ] (2)

Var[y] + E[y]2 + Var[fˆ] + E[fˆ]2E[2yfˆ] (3)

Var[y] + E[y]2 + Var[fˆ] + E[fˆ]2 2E[(f + ✏)fˆ] (4)

Var[y] + E[y]2 + Var[fˆ] + E[fˆ]2 2E[f fˆ] + E[✏fˆ] (5)

Var[y] + Var[fˆ] + E[y]2 + E[fˆ]2 2fE[fˆ] (6)

Var[y] + Var[fˆ] + [f 2 2fE[fˆ] + E[fˆ]2] (7)

Var[✏] + Var[fˆ] + (f E[fˆ])2 (8)

(f E[fˆ])2+Var[fˆ]+ Var[✏] (9) |{z}|{z} |{z}

Bias Variance Irreducible error

Bias vs Variance Intuition

E⇥(y fˆ)2⇤ = (f E[fˆ])2 + Var[fˆ] + Var[✏] |{z}|{z} |{z}

Bias Variance Irreducible error

Look at the individual elements here: Var (✏)

… is a constant.

it remains unchanged for di↵erent fˆ’s.

it represents the lowest bound for a test error that is attainable, since both the other terms are positive.

Minimizing test error requires finding an fˆ that minimizes the sum between squared bias and variance.

Bias vs Variance Intuition

E⇥(y fˆ)2⇤ = (f E[fˆ])2 + Var[fˆ] + Var[✏] |{z}|{z} |{z}

Bias Variance Irreducible error

Look at the individual elements here: V a r [ fˆ ]

… refers to the amount by which fˆ would change if we estimated it using a di↵erent training data set.

Since the training data are used to fit the statistical learning method, di↵erent training data sets produce di↵erent fˆ. Ideally the estimate for f should not vary too much between training sets.

Di↵erent methods have di↵erent variances: more flexible methods have larger variances, while less flexible ones (e.g. linear regression) have lower variance.

This is pushing up our test MSE for highly flexible specifications.

Bias vs Variance Intuition

E⇥(y fˆ)2⇤ = (f E[fˆ])2 + Var[fˆ] + Var[✏]

Looking at the Bias:

|{z}|{z} |{z}

Bias Variance Irreducible error

… an approximate model, that leaves our relevant factors systematically introduces errors by not allowing e.g. for more complex interactions between variables Xi .

e.g. a linear model may be inadequate in case the true relationship is non-linear, introducing significant bias.

This is akin to the idea of ommitted variable bias in regression, which causes the true e↵ect of some variable to be under or over-stated, thus, distorting the predictive power of that variable.

Bias vs Variance Intuition

Figure: Bias-Variance tradeo↵ illustrated: U-shape due to increasing variance at high level of model flexibility. Taken from Hastie et al., 2013.

Bias vs Variance Intuition

I Its important to recognize and realize that test MSE and training MSE are themselves random variables coming both with a mean and variance

I Training- and validation set approach constructs a single realization of both test MSE and training MSE

I This means that on specific training and validation sets the underlying shapes may not follow the theoretically expected U-shape by chance

I Cross validation or resampling methods allow you to construct multiple training and test error curves.

I We will see this in the interactive visualization

程序代写 CS代考 加微信: cscodehelp QQ: 2235208643 Email: kyit630461@163.com