# CS代考程序代写 data science Data 100 & 200A Principles and Techniques of Data Science

Data 100 & 200A Principles and Techniques of Data Science

Spring 2019

INSTRUCTIONS

• You have 70 minutes to complete the exam.

Midterm 2 Solutions

• The exam is closed book, closed notes, closed computer, closed calculator, except for two 8.5″ × 11″ crib sheets of your own creation.

• Mark your answers on the exam itself. We will not grade answers written on scratch paper.

Last name

First name

Student ID number

CalCentral email (_@berkeley.edu)

Exam room

Name of the person to your left

Name of the person to your right

All the work on this exam is my own.

(please sign)

Terminology and Notation Reference:

exp(x)

ex

log(x)

loge x

Linear regression model

E[Y |X] = XT β

Logistic (or sigmoid) function

σ(t) = 1 1+exp(−t)

Logistic regression model

P(Y =1|X)=σ(XTβ)

Squared error loss

L(y, θ) = (y − θ)2

Absolute error loss

L(y, θ) = |y − θ|

Cross-entropy loss

L(y, θ) = −y log θ − (1 − y) log(1 − θ)

Bias

Bias[θˆ, θ] = E[θˆ] − θ

Variance

Var[θˆ] = E[(θˆ− E[θˆ])2]

Mean squared error

MSE[θˆ, θ] = E[(θˆ − θ)2]

2

1. (8 points) Feature Engineering

For each dataset depicted below in a scatterplot, fill in the squares next to all of the letters for the vector-valued functions f that would make it possible to choose a column vector β such that yi = f(xi)Tβ for all (xi,yi) pairs in the dataset. The input to each f is a scalar x shown on the horizontal axis, and the corresponding y value is shown on the vertical axis.

(A) f(x) = [1 x]T (B) f(x) = [x 2x]T

Notes: In (D), x=-1 and x=1 must have the same y value, so the V shape cannot be moved horizontally with a linear combination of those features. Dataset (ii) was intended to be parabolic, but the original printed version of the exam had an error in the parabola shape; sorry! Credit was given for C or E.

(i) (2 pt) * A B * C D E (ii) (2 pt) A B * C D E

62

(C) f(x) = [1

(D) f(x) = [1 |x|]T (E) None of the above

x x2]T

4 2 0

0 −2

−2 −4

−4 −6

−1 0 1 2 3 4 5 (iii) (2 pt) A B C D * E (iv) (2 pt) A B C D * E

66

−6

5

4

3

2

1

4 2 0

−2

−1 0 1 2 3 4 5 xx

0 −4

−1 0 1 2 3 4 5 xx

−1 0 1 2 3 4 5

yy

yy

Name: 3

2. (6 points) Estimation

A learning set (x1, y1), . . . , (x10, y10) is sampled from a population where X and Y are both binary. The learning set data are summarized by the following table of row counts:

(a) (4 pt) You decide to fit a constant model P(Y = 1|X = 0) = P(Y = 1|X = 1) = α using the cross-entropy loss function and no regularization. What is the formula for the empirical risk on this learning set for this model and loss function? What estimate of the model parameter α minimizes empirical risk? You must show your work for finding the estimate αˆ to receive full credit.

Recall: Since Y is binary, P(Y = 0|X) + P(Y = 1|X) = 1 for any X.

x

y

Count

0 0 1 1

0 1 0 1

2 3 1 4

Empirical Risk: − 7 log α − 3 log(1 − α) 10 10

Estimate αˆ (show your work):

0=7−3 10α 10(1 − α)

0 = 7(1−α)−3α 10α = 7

α=7 10

(b) (2 pt) The true population probability P(Y = 0|X = 0) is 1. Provide an expression in terms of αˆ for 3

the bias of the estimator of P(Y = 0|X = 0) described in part (a) for the constant model. You may use E[…] in your answer to denote an expectation under the data generating distribution of the learning set, but do not write P(…) in your answer.

Bias[Pˆ(Y =0|X=0),P(Y =0|X=0)]=E[1−αˆ]−1 orequivalently 2 −E[αˆ] 33

Note: the value for αˆ computed in part (a) is just for this particular learning set, which is just one sample among many possible samples. We don’t know from this one dataset that E[αˆ] = 7 . Bias does not

10

describe a particular estimate from a particular dataset, but instead refers to the average of estimates obtained from repeated random sampling from the population, i.e., the average of αˆ from multiple learning sets.

4

3. (6 points) Linear Regression

A learning set of size four is sampled from a population where X and Y are both quantitative:

(x1, y1) = (2.5, 3) (x2, y2) = (2, 5) (x3, y3) = (1, 3) (x4, y4) = (3, 5).

You fit a linear regression model E[Y |X] = β0+Xβ1, where β0 and β1 are scalar parameters, by ridge regression, minimizing the following objective function:

1 4 β 2 + β 2 (yi −(β0 +xiβ1))2 + 0 1 .

43

i=1

(a) (4 pt) Fill in all blanks below to compute the parameter estimates that minimize this regularized empirical

risk. (You do not need to compute their values; just fill in the matrices appropriately.)

1 1 1 1

XT = n

2.5 2 1 3

Y nT = 3 5 3 5

βˆ 4 0 0 3

=(XTX + )−1XTY . n n n n

βˆ 0 4 13

Note: the common answer of 1 on the diagonal of the regularization term was given full credit. 3

(b) (2 pt) Without computing values for βˆ0 and βˆ1, write an expression for the squared error loss of the learning set observation (x4, y4) in terms of βˆ0 and βˆ1 and any relevant numbers. Your solution should not contain any of yˆ4, x4, or y4, but instead just numbers and βˆ0 and βˆ1.

L(y4, yˆ4) = (5 − (βˆ0 + 3βˆ1))2

Name: 5 4. (8 points) Model Selection

(a) (2 pt) You have a quantitative outcome Y and two quantitative covariates (X1,X2). You want to fit a linear regression model for the conditional expected value E[Y |X] of the outcome given the covariates, including an intercept. Bubble in the minimum dimension of the parameter vector β needed to express this linear regression model?

⃝ 1 ⃝ 2 ⃝ * 3 ⃝ 4 ⃝ 5 ⃝ 6 ⃝ 7 ⃝ None of these Both quantitative features and the intercept are needed.

(b) (2 pt) You have a quantitative outcome Y and two qualitative covariates (X1 , X2 ). X1 ∈ {a, b, c, d}, X2 ∈ {e, f, g}, and there is no ordering to the values for either variable. You want to fit a linear regression model for the conditional expected value E[Y |X] of the outcome given the covariates, including an intercept. Bubble in the minimum dimension of the parameter vector β needed to express this linear regression model?

⃝2 ⃝3 ⃝4 ⃝5 ⃝*6 ⃝7 ⃝8 ⃝9 ⃝10 ⃝11 ⃝12 ⃝13 Each categorical variable with k outcomes requires k − 1 features to encode, because an additional feature would be a linear combination of the others and the intercept feature. (4 − 1) + (3 − 1) + 1 = 6.

(c) (2 pt) Bubble all true statements: In ridge regression, when the assumptions of the linear model are satisfied, the larger the shrinkage/penalty parameter,

the larger the magnitude of the bias of the estimator of the regression coefficients β. the smaller the magnitude of the bias of the estimator of the regression coefficients β. the larger the variance of the estimator of the regression coefficients β.

the smaller variance of the estimator of the regression coefficients β.

the smaller the true mean squared error of the estimator of the regression coefficients β.

(d) (2 pt) Bubble all true statements: A good approach for selecting the shrinkage/penalty parameter in

LASSO is to:

minimize the learning set risk for the squared error (L2) loss function.

minimize the learning set risk for the absolute error (L1) loss function.

minimize the cross-validated regularized risk for the squared error (L2) loss function.

* minimize the cross-validated risk for the squared error (L2) loss function.

minimize the variance of the estimator of the regression coefficients.

The cross-validated L2 risk is a good unbiased estimator for the L2 risk (average L2 loss) on unseen data, which is the quantity we care to minimize in the end. The L1 norm of the regression coefficients in LASSO is a regularization/penalty term that appears only for the purpose of estimating β on the training set.

6

5. (12 points) Logistic Regression

(a) (2 pt) Bubble the expression that describes the odds ratio P (Y =1|X) of a logistic regression model.

P(Y =0|X)

Recall: P(Y =0|X)+P(Y =1|X)=1foranyX.

⃝ XT β ⃝ −XT β ⃝ * exp(XT β) ⃝ σ(XT β) ⃝ None of these

(b) (2 pt) Bubble the expression that describes P(Y = 0|X) for a logistic regression model.

⃝ * σ(−XT β) ⃝ 1−log(1+exp(XT β)) ⃝ 1+log(1+exp(−XT β)) ⃝ None of these

(c) (2 pt) Bubble all of the following that are typical effects of adding an L1 regularization penalty to the

loss function when fitting a logistic regression model with parameter vector β. The magnitude of the elements of the estimator of β are increased.

The magnitude of the elements of the estimator of β are decreased.

All elements of the estimator of β are non-negative.

Some elements of the estimator of β are zero.

None of the above.

Note: The first two options were not specific enough, and so the credit for that part of the question was

given to all answers. The total magnitude of the estimated β will decrease with an L1 penalty, but some individual elements of β may stay constant or increase.

(d) (3 pt) What would be the primary disadvantage of a regularization term of the form Jj=1 βj3 rather than the more typical ridge penalty Jj=1 βj2 for logistic regression? Answer in one sentence.

The minimum of β3 is attained at β = −∞, so minimizing empirical risk would always result in a degenerate solution.

(e) (3 pt) For a logistic regression model P(Y = 1|X) = σ(−2 − 3X), where X is a scalar random variable, what values of x would give P (Y = 0|X = x) ≥ 3 ? You must show your work for full credit.

4

P (Y = 0|X = x) ≥ 3 4

1 − P (Y = 1|X = x) ≥ 3 4

P (Y = 1|X = x) ≤ 1 4

1≤1 1+exp(2+3x) 4

1 + exp(2 + 3x) ≥ 4

2 + 3x ≥ log 3

x ≥ log 3 − 2 3