# 程序代写 CS 189 (CDSS offering) – cscodehelp代写

Lecture 16: Kernels (2) CS 189 (CDSS offering)

2022/02/25

Today’s lecture

Copyright By cscodehelp代写 加微信 cscodehelp

• Last time, we saw how to kernelize the soft margin SVM, such that we can use featurizations !(x) that are very high dimensional (or even infinite dimensional)

• Today, we will kernelize a couple more of the models we have already studied — the perceptron, and ridge regression

• Keep in mind what this allows us to do — we can “extend” these linear models into problems requiring nonlinear solutions!

• Also worth noting: a number of models that we will study later in the course can also be kernelized, so it really is quite a useful technique

Recall: kernelization

• Kernelizing a model, at a high level, involves replacing all x!x”with !(x)!!(x”)

• All machine learning models must specify two things: how they are learned at

training time, and how they are used to predict at test time

• If both of these can be specified using only inner products, we are in business

• The kernel function k(x, x”) = !(x)!!(x”) specifies how to compute this inner product without computing the featurizations

The kernel matrix Kij = k(xi, xj) stores the inner products between all the pairs of training points

Recall: the perceptron algorithm The simplest version

• The perceptron is best understood as repeating the following update to ”

• Pick a (xi, yi) — if sign(“!xi) = yi, move on; otherwise, ” # ” + yixi

• We repeat this until sign(“!xi) = yi for all i

• For linearly separable datasets, this is guaranteed to eventually happen!

• If the data are not linearly separable, this algorithm will never terminate

• Turns out, kernelizing this algorithm and the linear model is pretty straightforward!

Characterizing ” for the kernel perceptron

• If we initialize ” = 0, then, since the update is ” # ” + yixi, at any iteration

during learning, I can write ” = !Ni=1 #iyixi for the appropriately chosen #i • Specifically, #i will be the number of times the i-th data point has been

misclassified up to the current training iteration

And, if the algorithm terminates, it will return some “$ = !N #$yixi

• An aside: this is an instance of the more general representer theorem

Learning in the kernel perceptron

• Before: initialize ” = 0

• Pick a (xi,yi)

• If sign(“!xi) % yi, then • “#”+yixi

• Once all points correctly classified: return “$

• Now: initialize ! = 0

• Pick a (xi,yi)

• If sign (!Nj=1 #jyjx!j xi) % yi, then • #i##i+1

• Once all points correctly classified: return #$

Predicting in the kernel perceptron And, actually kernelizing the kernel perceptron

• Prediction on a new x was previously just sign(“$!x)

Now, it is sign (!Ni=1 #$yix!i x) i

• We are definitely in business, since we only have inner products everywhere!

To switch from x to !(x), replace all x!i xj for pairs of training points with Kij, and replace all other x!i x with k(xi, x), for the appropriate K and k

Kernelizing linear classifiers in general

sign ( !Ni=1 #i$yix!i x) looks very similar to how the kernel SVM turned out • And, in fact, it is also the same for kernel logistic regression

This is no accident — it is a direct consequence of the representer theorem applied From this perspective, the big difference between these three methods for learning

to linear models

linear classifiers is that they will produce different #i$

• SVM is quite nice, from this perspective, because most #i$ will be zero! 8

Recall: ridge regression

• Ridge regression adds $2-regularization to least squares linear regression: arg min &Xw ‘ y&2 + %&w&2

• The solution we derived was given by w$ = (X!X + %I)’1X!y

• If this doesn’t come back to you immediately, it would be good to review this

• At first glance, it may not be obvious how to kernelize this solution, but we will see how it becomes obvious using a cool math trick

A cool math trick

xTX XId X XX XIN XXT XIN y

xTX X Id EXTXX TXTIN XX’t XIN y xTX XId EXTXXTt XI XI XXTt XIN y XTX XId 1 XX IIdXTXXT XIN y

Kernel ridge regression

• So, we can rewrite w$ = (X!X + %I)’1X!y = X!(XX! + %I)’1y for a N ( N I instead of a d ( d I

• What is XX!? By inspection, we have (XX!)ij = x!i xj

• XX! is called a Gram matrix and can be directly replaced with K!

• For predicting on a new x, we compute w$!x = y!(XX! + %I)’1Xx

• What is Xx? By inspection, we have (Xx)i = x!i x — replace with k(xi, x) 11

If time: the RBF kernel

• We gave the example last time of the radial basis function (RBF) kernel 2

k(x, x”) = exp ‘ &x ‘ x”&2 , also called the Gaussian kernel 2

• It can be understood as an infinite dimensional !(x), but it is perhaps better understood as a “distance” between x and x”

• For linear regression, the prediction function becomes a linear combination of Gaussians!

from Prof.

程序代写 CS代考 加微信: cscodehelp QQ: 2235208643 Email: kyit630461@163.com