# CS代写 CS 189 (CDSS offering) – cscodehelp代写

Lecture 8: Linear regression (2) CS 189 (CDSS offering)
2022/02/04

Today’s lecture

• Back to linear regression!
• Why? We have some new tools to play around with: MVGs and MAP estimation
• Remember: we already motivated several linear regression formulations, such as least squares, from the perspective of MLE
• Today, we will introduce two more formulations from the perspective of MAP estimation: ridge regression and LASSO regression
• Time permitting, we will introduce the concept of (polynomial) featurization 2

Recap: MAP estimation
what is the MAP estimate if p(!) = (!; 0, “2I)? arggin logp yi lxi 0 logp o
argoninItlo lyi lxi ol E ÉII
g p set it directly
this is an example of a regularizer — typically, something we add to the loss function that doesn’t depend on the data
regularizers are crucial for combating overfitting

Ridge regression
• We can apply this idea ( p(!) = (!; 0, “2I), or equivalently, #2-regularization) to least squares linear regression
• What will be the resulting learning problem?
• Remember the least squares linear regression problem: arg min “Xw # y”2
• Here, we have ! = w
• So adding #2-regularization results in: arg min “Xw # y”2 + \$”w”2
• This is the ridge regression problem setup

Solving ridge regression
again I Xw yl tall will
argmain wTXTXw 2 yTX w t XwTw argumin WT XX XI w
now let’s take the gradient and set equal to zero:
Dw 2 ATX XI Wmap 2XTy O
or whip XX XI XTy 5
objective:
let’s rewrite this slightly:

The ridge solution vs the least squares solution
• The ridge solution differs in adding a \$I term inside the inverted expression
• What does this do?
• Intuitively, it makes the resulting solution smaller in magnitude, which makes sense
• Numerically, it can fix underdetermined (or ill conditioned) problems!
• Recall that the least squares solution we found needs X\$X to be invertible, and this is not the case when the columns of X are not linearly independent
• Adding \$I for any \$ > 0 guarantees invertibility, and even relatively small \$ can generally make the problem better conditioned (easier for a computer to solve)

Selecting \$
• Think for a minute: can we choose \$ the same way we learn w? That is, can we
• We can’t, because this would always set \$ to 0!
• \$ is an example of a hyperparameter — a parameter that is not learned, but
instead we have to set it ourselves
• Learning hyperparameters with the same objective often leads to overfitting
• We will talk more about how to select hyperparameters in a few weeks
do something like arg min “Xw # y”2 + \$”w”2? w, \$%0

LASSO regression
what if we instead choose p(!i) = Laplace(!i; 0, b) for all i?
poi Eexp 191 eogplo arggax É logpl.yj1xj 07tlogp
argminE.llxw yllit’sÉl
wfilogp10il
w Frise pop
we can replace with X
how did I getthis 8

The LASSO regression solution
• LASSO corresponds to #1-regularization, which tends to induce sparse solutions
• LASSO does not have an analytical (“set the gradient to zero”) solution
• Most commonly, LASSO is solved with proximal gradient methods, which are covered in optimization courses

How powerful are linear models anyway? Or, the importance of featurization
• Linear models, by themselves, might seem not that useful
• However, they can be, if we use the right set of features
• That is, instead of using x directly, we work with a featurization %(x) that may be better suited to the data
• Everything else stays the same — just replace x with %(x) everywhere
• We will talk extensively in this class about different featurizations, both hand
designed (when we talk about kernels) and learned (neural networks)

A classic example of featurization
boundary d
(this example is for linear classification, not regression)