# 代写代考 EECS 445 — Introduction to Machine Learning Winter 2022 – cscodehelp代写

UNIVERSITY OF MICHIGAN

Department of Electrical Engineering and Computer Science EECS 445 — Introduction to Machine Learning Winter 2022

Homework 2 (50 points)

Due: Wednesday, February 16th at 10 pm

Copyright By cscodehelp代写 加微信 cscodehelp

1 Soft-Margin SVM Dual Derivation [14 pts]

Consider the primal formulation of a soft-margin SVM with regularization and offset, where C ≥ 0 is a regularization hyperparameter.

Submission: Please upload your completed assignment to Gradescope. Your submission can be either handwritten or typed. Assign all pages (including pages containing code) to their respective questions. Incorrectly assigned pages will not be graded.

θ ̄,b,ξ ̄ subject to

ξi ≥0,∀i=1,…,n. From this, we can obtain the dual formulation.

subject to

αi − αiαjy(i)y(j)x ̄(i) ·x ̄(j)

i=1 2 i=1 j=1 n

αiy(i) = 0, i=1

∥θ∥ +C ξi i=1

y(i)(θ ̄ · x ̄(i) + b) ≥ 1 − ξi,

0≤αi ≤C,∀i=1,…,n.

(a) (8 pt) Start with the primal form of a soft-margin SVM shown above and derive the dual formulation

shown above.

(i) (4 pt) To start, state the Lagrangian of the primal.

(ii) (4 pt) Next, swap the order of the optimization and find the optimal value of θ ̄ as well as the ̄

̄ ̄ ∥θ ̄∥2 n n (i) ̄ (i) n L(θ,b,ξ,α,β)= 2 +C i=1ξi+ i=1αi(1−ξi−y (θ·x ̄ +b))+ i=1βi(−ξi)

constraints imposed by the optimal values of b and ξ.

̄ ̄ ̄∗ ̄∗n (i)(i) ∇θ ̄L(θ,b,ξ,α,β) = 0 and solving for θ gives θ = i=1 αiy x ̄

̄ ̄ ̄ n (i)(i) ∇θ ̄L(θ,b,ξ,α,β)=θ+0+ i=1−αiy x ̄ +0

Set equal to 0,

n (i) (i) ∗ n (i) (i) θ ̄−i=1αiy x ̄ =0=⇒θ ̄=i=1αiy x ̄

∇bL(θ, b, ξ, α, β) = 0 gives rise to the constraint

̄ ̄ n(i) ∇bL(θ,b,ξ,α,β)=0+0+ i=1αiy +0

Set equal to 0,

ni=1 αiy(i) = 0, which is the constraint

n (i) i=1 αiy

∇ξ ̄L(θ,b,ξ,α,β) = 0 gives rise to the constraint 0 ≤ αi ≤ C

We can take the gradient w.r.t. ξi to see what each dimension of ξ ̄ should be.

̄ ̄ ∇ξiL(θ,b,ξ,α,β)=0+C−αi −βi

Set equal to 0,

0+C−αi −βi =0 =⇒ βi =C−αi foralli=1,2,…

Since we know βi ≥ 0 for all i = 1, 2, … by construction of the dual form, then we can say αi ≤ C. Since αi is also ≥ 0 by construction of the dual form, then we have 0 ≤ αi ≤ C for all i = 1,2,…

Using the above information, we can substitute the optimal value of θ ̄ into the Lagrangian with the swapped order of optimization and derive the final dual formulation.

subject to

αi − αiαjy(i)y(j)x ̄(i) ·x ̄(j)

i=1 2 i=1 j=1 n

αiy(i) = 0, i=1

0≤αi ≤C,∀i=1,…,n.

For the remaining problems, assume we trained an SVM classifier with C = 1 on a dataset consisting

of n = 6 points and learned the parameters listed in the table below.

1 2 3 4 5 6

[ -1.5 , -2.5 ]T

[ 1.6 , -2.5 ]T [ -1.9 , 0 ]T

[ -3.3 , -2.8 ]T [ 0.5 , 0.75 ]T [ 0.7 , 1.3 ]T

y(i) αi 1 0

1 0.175 1 0.322 1 0

-1 0.497 -1 0

(b) (2 pt) Based on the above dataset, state which points are support vectors.

(c) (4pt)Usethesamedatasetandstatetheoptimalvalueofbandθ ̄andusethemtoclassifythefollowing points: [-3,4] and [1,2].

Solution: Data points 2,3, and 5 are support vectors because they have an alpha value greater than 0.

Solution: From the derivation above, we know the optimal value of θ ̄ is θ ̄ = 6i=1 αiy(i)x ̄(i) = 0.175 ∗ 1 ∗ [1.6, −2.5] + 0.322 ∗ 1 ∗ [−1.9, 0] + 0.497 ∗ −1 ∗ [0.5, 0.75] = [-0.5803,-0.8103] Since all of the support vectors have an αi value less than C=1, we can conclude that they are all on the margin such that y(i)(θ ̄ · x ̄(i) + b) = 1. So, we can plug in any support vector for y(i) and x ̄(i) to solve for b.

b = y ( i ) − ( θ ̄ · x ̄ ( i ) )

Possible solutions:

b = 1 − ([−0.5803, −0.8103] · [1.6, −2.5]) = −0.09727

b = 1 − ([−0.5803, −0.8103] · [−1.9, 0]) = −0.10257

b = −1 − ([−0.5803, −0.8103] · [0.5, 0.75]) = −0.102125

Use the optimal value of θ ̄ and b to classify (any value of b from above gives same classification result):

sign([−0.5803, −0.8103] · [−3, 4] − 0.10257) = sign( − 1.60287), -1

sign([−0.5803, −0.8103] · [1, 2] − 0.10257) = sign(-2.30347), -1

**Solutions if alpha values of 0.05 used for i=2,3,5 used before new version was uploaded**

θ ̄=6i=1αiy(i)x ̄(i) =0.05∗1∗[1.6,−2.5]+0.05∗1∗[−1.9,0]+0.05∗−1∗[0.5,0.75]= [-0.04,-0.1625]

b = y ( i ) − ( θ ̄ · x ̄ ( i ) )

Possible solutions:

b = 1 − ([−0.04, −0.1625] · [1.6, −2.5]) = 0.65775

b = 1 − ([−0.04, −0.1625] · [−1.9, 0]) = 0.924

b = −1 − ([−0.04, −0.1625] · [0.5, 0.75]) = −0.858125

Use the optimal value of θ ̄ and b to classify (different values of b from above give different clas- sification results in this case):

sign([−0.04, −0.1625] · [−3, 4] − 0.924) = sign( − 1.454), -1

sign([−0.04, −0.1625] · [1, 2] − 0.924) = sign(-1.289), -1

b = 0.65775

sign([−0.04, −0.1625] · [−3, 4] − 0.65775) = sign( − 1.18775), -1 sign([−0.04, −0.1625] · [1, 2] − 0.65775) = sign(-1.02275), -1

b = −0.858125

sign([−0.04, −0.1625] · [−3, 4] + 0.858125) = sign(0.328125), 1 sign([−0.04, −0.1625] · [1, 2] + 0.858125) = sign(0.493125), 1

Kernels [10 pts]

(a) (3pt)LetK : Rd ×Rd → Rbeavalidkernelcorrespondingtothefeaturemappingφ : Rd → Rp andlettherebeascalarc≥0.DefineakernelK1 :Rd×Rd →Rasfollows:

K1(x ̄, z ̄) = cK(x ̄, z ̄)

What is the associated feature mapping φ′ for K1(·, ·) in terms of φ, for x ̄ ∈ Rd?

Therefore,

K1(x ̄, z ̄) = cK(x ̄, z ̄)

= c(φ(x ̄) · φ(z ̄)) √√

=( c∗φ(x ̄))·( c∗φ(z ̄))

φ′(x ̄) = √c ∗ φ(x ̄)

(b) (7 pt) Let x ̄, z ̄ ∈ R2 and K : R2 × R2 → R. Derive the underlying feature mapping φ for the function:

K(x ̄, z ̄) = 9(x ̄ · z ̄) + 4(x ̄ · z ̄ + r)2

Solution: We begin by expanding:

K(x ̄, z ̄) = 9(x ̄ · z ̄) + 4(x ̄ · z ̄ + r)2

= 9(x1z1 + x2z2) + 4(x1z1 + x2z2 + r)2

= 9x1z1 + 9x2z2 + 4(x21z12 + x2z2 + r2 + 2rx1z1 + 2rx2z2 + 2x1z1x2z2) = 9x1z1 + 9x2z2 + 4x21z12 + 4x2z2 + 4r2 + 8rx1z1 + 8rx2z2 + 8x1z1x2z2)

We can see that we can separate the terms into a dot product of the following vectors

3x ,3x ,2×2,2×2,2r,2√2rx ,2√2rx ,2√2x x T ·3z ,3z ,2z2,2z2,2r,2√2rz ,2√2rz ,2√2z z T 1212 12121212 1212

Therefore, our final feature mapping φ(x ̄) is:

3x ,3x ,2×2,2×2,2r,2√2rx ,2√2rx ,2√2x x T 12121212

3 Ensemble Methods [26 pts]

We will study the performance of two ensemble methods on the Optdigits dataset consisting of handwritten digits. This dataset contains 5620 samples (each an 8 × 8 grayscale image having 64 features) of digits 0 through 9, classified into 10 classes. In the following question we will use a subset of these classes as our dataset.

Within HW2 ensemble.py, the following functions have been implemented for you:

• load optdigits(classes)

• get median performance(X, y, m vals, nsplits=50)

• plot data(bagging scores, random forest scores, m range)

It also contains the following function declarations that you will implement:

• bagging ensemble(X train, y train, X test, y test, n clf=10)

• random forest(X train, y train, X test, y test, m, n clf=10)

a) (12pt) Implement random forest(X train, y train, X test, y test, m, n clf=10) based on the specification provided in the skeleton code. Random forest consists of n clf-many de- cision trees where each decision tree is trained independently on a bootstrap sample of the training data (for this problem, n clf=10). For each node, we randomly select m features as candidates for splitting on.

Here, the final prediction of the bagging classifier is determined by a majority vote of these n clf decision trees. In the case of ties, randomly sample among the plurality classes (i.e. the classes that are tied) to choose a label.

You should use the sklearn.tree.DecisionTreeClassifier class. Set criterion=‘entropy’ to avoid the default setting. Also, see the max features parameter within this class.

Note: Do not set the max depth parameter. Remember that you are free to use helper functions to keep your code organized.

Include a screenshot (or equivalent) of your implemented function as your solution to this ques- tion.

b) (6pt) Implement bagging ensemble(X train, y train, X test, y test, n clf=10) based on the specification provided in the skeleton code. Like random forest, a bagging ensemble clas- sifier consists of n clf-many decision trees where each decision tree is trained independently on a bootstrap sample of the training data. However, all features are considered at every split.

Again, the final prediction is determined by a majority vote of these n clf decision trees.

Include a screenshot (or equivalent) of your implemented function as your solution to this ques- tion.

Solution: Implementations will vary

Solution: Implementations will vary

c) (4pts) Now, we will compare the performance of these ensemble classifiers using get median performance(). Measure the median performance across 50 random splits of the digits dataset into training (80%) and

test (20%) sets, and plot the returned performance for each algorithm over the range of m values

specified in the skeleton code (use the plot data() function). See Figure 1 for an example of

the plot (yours may differ due to randomness, but the general trend should be the same). Include your generated plot in you solution. How does the average test performance of the two methods compare as we vary m (the size of the randomly selected subset of features)?

Figure 1: Plot of accuracy of random forest, bagging ensemble methods for m = 1, …, 64

Solution: See the plot in Figure 1. Random forest initially does better than bagging: the best performance for random forest is achieved around m = 15, and steadily declines as m increases. This makes intuitive sense: as m increases, it lets each tree choose more of the same features, because the feature choice is less random. Thus, each tree becomes less independent, and the performance decreases. When, in the limit, random forest uses all available features, it is no different from bagging, which is why random forest tends towards bagging in this plot.

d) (4pt) Suppose we change the number of classifiers used for the Random forest method to n = 1. We get the following plots for training and testing accuracy. Explain the differences between the train and test performances for n = 10 and n = 1.

Figure 2: Plot of random forest train and test accuracy for m = 1, 2, …, 30

The n = 1 case over fits to the training data which causes it to perform poorly on the test data. The n = 10 case on the other hand doesn’t so it has relatively the same performance between both the training and test data.

REMEMBER to submit your completed assignment to Gradescope by 10:00pm ET on Wednes- day, Feb. 16. Assign all pages (including pages containing code) to their respective questions. Incorrectly assigned pages will not be graded.

程序代写 CS代考 加微信: cscodehelp QQ: 2235208643 Email: kyit630461@163.com