# CS代写 CS 189 (CDSS offering) – cscodehelp代写

Lecture 8: Linear regression (2) CS 189 (CDSS offering)

2022/02/04

Today’s lecture

Copyright By cscodehelp代写 加微信 cscodehelp

• Back to linear regression!

• Why? We have some new tools to play around with: MVGs and MAP estimation

• Remember: we already motivated several linear regression formulations, such as least squares, from the perspective of MLE

• Today, we will introduce two more formulations from the perspective of MAP estimation: ridge regression and LASSO regression

• Time permitting, we will introduce the concept of (polynomial) featurization 2

Recap: MAP estimation

what is the MAP estimate if p(!) = (!; 0, “2I)? arggin logp yi lxi 0 logp o

argoninItlo lyi lxi ol E ÉII

g p set it directly

this is an example of a regularizer — typically, something we add to the loss function that doesn’t depend on the data

regularizers are crucial for combating overfitting

Ridge regression

• We can apply this idea ( p(!) = (!; 0, “2I), or equivalently, #2-regularization) to least squares linear regression

• What will be the resulting learning problem?

• Remember the least squares linear regression problem: arg min “Xw # y”2

• Here, we have ! = w

• So adding #2-regularization results in: arg min “Xw # y”2 + $”w”2

• This is the ridge regression problem setup

Solving ridge regression

again I Xw yl tall will

argmain wTXTXw 2 yTX w t XwTw argumin WT XX XI w

now let’s take the gradient and set equal to zero:

Dw 2 ATX XI Wmap 2XTy O

or whip XX XI XTy 5

objective:

let’s rewrite this slightly:

The ridge solution vs the least squares solution

• The ridge solution differs in adding a $I term inside the inverted expression

• What does this do?

• Intuitively, it makes the resulting solution smaller in magnitude, which makes sense

• Numerically, it can fix underdetermined (or ill conditioned) problems!

• Recall that the least squares solution we found needs X$X to be invertible, and this is not the case when the columns of X are not linearly independent

• Adding $I for any $ > 0 guarantees invertibility, and even relatively small $ can generally make the problem better conditioned (easier for a computer to solve)

Selecting $

• Think for a minute: can we choose $ the same way we learn w? That is, can we

• We can’t, because this would always set $ to 0!

• $ is an example of a hyperparameter — a parameter that is not learned, but

instead we have to set it ourselves

• Learning hyperparameters with the same objective often leads to overfitting

• We will talk more about how to select hyperparameters in a few weeks

do something like arg min “Xw # y”2 + $”w”2? w, $%0

LASSO regression

what if we instead choose p(!i) = Laplace(!i; 0, b) for all i?

poi Eexp 191 eogplo arggax É logpl.yj1xj 07tlogp

argminE.llxw yllit’sÉl

wfilogp10il

w Frise pop

we can replace with X

how did I getthis 8

The LASSO regression solution

• LASSO corresponds to #1-regularization, which tends to induce sparse solutions

• LASSO does not have an analytical (“set the gradient to zero”) solution

• Most commonly, LASSO is solved with proximal gradient methods, which are covered in optimization courses

How powerful are linear models anyway? Or, the importance of featurization

• Linear models, by themselves, might seem not that useful

• However, they can be, if we use the right set of features

• That is, instead of using x directly, we work with a featurization %(x) that may be better suited to the data

• Everything else stays the same — just replace x with %(x) everywhere

• We will talk extensively in this class about different featurizations, both hand

designed (when we talk about kernels) and learned (neural networks)

A classic example of featurization

boundary d

(this example is for linear classification, not regression)

Advanced Machine Learning Specialization, Coursera

程序代写 CS代考 加微信: cscodehelp QQ: 2235208643 Email: kyit630461@163.com