# CS代考程序代写 data science Data 100 & 200A Principles and Techniques of Data Science

Data 100 & 200A Principles and Techniques of Data Science

Spring 2019

INSTRUCTIONS

• You have 70 minutes to complete the exam.

Midterm 2

• The exam is closed book, closed notes, closed computer, closed calculator, except for two 8.5″ × 11″ crib sheets of your own creation.

• Mark your answers on the exam itself. We will not grade answers written on scratch paper.

Last name

First name

Student ID number

CalCentral email (_@berkeley.edu)

Exam room

Name of the person to your left

Name of the person to your right

All the work on this exam is my own.

(please sign)

Terminology and Notation Reference:

exp(x)

ex

log(x)

loge x

Linear regression model

E[Y |X] = XT β

Logistic (or sigmoid) function

σ(t) = 1 1+exp(−t)

Logistic regression model

P(Y =1|X)=σ(XTβ)

Squared error loss

L(y, θ) = (y − θ)2

Absolute error loss

L(y, θ) = |y − θ|

Cross-entropy loss

L(y, θ) = −y log θ − (1 − y) log(1 − θ)

Bias

Bias[θˆ, θ] = E[θˆ] − θ

Variance

Var[θˆ] = E[(θˆ− E[θˆ])2]

Mean squared error

MSE[θˆ, θ] = E[(θˆ − θ)2]

2

1. (8 points) Feature Engineering

For each dataset depicted below in a scatterplot, fill in the squares next to all of the letters for the vector-valued functions f that would make it possible to choose a column vector β such that yi = f(xi)Tβ for all (xi,yi) pairs in the dataset. The input to each f is a scalar x shown on the horizontal axis, and the corresponding y value is shown on the vertical axis.

(A) f(x) = [1 x]T (B) f(x) = [x 2x]T

(i) (2 pt) A B C D E (ii) (2 pt) A B C D E

62

(C) f(x) = [1

(D) f(x) = [1 |x|]T (E) None of the above

x x2]T

4 2 0

0 −2

−2 −4

−4 −6

−1 0 1 2 3 4 5

(iii) (2 pt) A B C D E (iv) (2 pt) A B C D E

66

−6

5

4

3

2

1

4 2 0

−2

−1 0 1 2 3 4 5 xx

0 −4

−1 0 1 2 3 4 5 xx

−1 0 1 2 3 4 5

yy

yy

Name: 3

2. (6 points) Estimation

A learning set (x1, y1), . . . , (x10, y10) is sampled from a population where X and Y are both binary. The learning set data are summarized by the following table of row counts:

(a) (4 pt) You decide to fit a constant model P(Y = 1|X = 0) = P(Y = 1|X = 1) = α using the cross-entropy loss function and no regularization. What is the formula for the empirical risk on this learning set for this model and loss function? What estimate of the model parameter α minimizes empirical risk? You must show your work for finding the estimate αˆ to receive full credit.

Recall: Since Y is binary, P(Y = 0|X) + P(Y = 1|X) = 1 for any X.

Empirical Risk:

Estimate αˆ (show your work):

x

y

Count

0 0 1 1

0 1 0 1

2 3 1 4

(b) (2 pt) The true population probability P(Y = 0|X = 0) is 1. Provide an expression in terms of αˆ for 3

the bias of the estimator of P(Y = 0|X = 0) described in part (a) for the constant model. You may use E[…] in your answer to denote an expectation under the data generating distribution of the learning set, but do not write P(…) in your answer.

Bias[Pˆ(Y = 0|X = 0),P(Y = 0|X = 0)] =

4

3. (6 points) Linear Regression

A learning set of size four is sampled from a population where X and Y are both quantitative:

(x1, y1) = (2.5, 3) (x2, y2) = (2, 5) (x3, y3) = (1, 3) (x4, y4) = (3, 5).

You fit a linear regression model E[Y |X] = β0+Xβ1, where β0 and β1 are scalar parameters, by ridge regression, minimizing the following objective function:

1 4 β 2 + β 2 (yi −(β0 +xiβ1))2 + 0 1 .

43

i=1

(a) (4 pt) Fill in all blanks below to compute the parameter estimates that minimize this regularized empirical

risk. (You do not need to compute their values; just fill in the matrices appropriately.)

_____ _____ _____ _____

XT = n

2.5 2 1 3

Y nT = _____ _____ _____ _____

βˆ _____ _____ 0

=(XTX + )−1XTY . n n n n

βˆ _____ _____ 1

(b) (2 pt) Without computing values for βˆ0 and βˆ1, write an expression for the squared error loss of the learning set observation (x4, y4) in terms of βˆ0 and βˆ1 and any relevant numbers. Your solution should not contain any of yˆ4, x4, or y4, but instead just numbers and βˆ0 and βˆ1.

L ( y 4 , yˆ 4 ) =

Name: 5 4. (8 points) Model Selection

(a) (2 pt) You have a quantitative outcome Y and two quantitative covariates (X1,X2). You want to fit a linear regression model for the conditional expected value E[Y |X] of the outcome given the covariates, including an intercept. Bubble in the minimum dimension of the parameter vector β needed to express this linear regression model?

⃝ 1 ⃝ 2 ⃝ 3 ⃝ 4 ⃝ 5 ⃝ 6 ⃝ 7 ⃝ None of these

(b) (2 pt) You have a quantitative outcome Y and two qualitative covariates (X1 , X2 ). X1 ∈ {a, b, c, d}, X2 ∈ {e, f, g}, and there is no ordering to the values for either variable. You want to fit a linear regression model for the conditional expected value E[Y |X] of the outcome given the covariates, including an intercept. Bubble in the minimum dimension of the parameter vector β needed to express this linear regression model?

⃝2 ⃝3 ⃝4 ⃝5 ⃝6 ⃝7 ⃝8 ⃝9 ⃝10 ⃝11 ⃝12 ⃝13

(c) (2 pt) Bubble all true statements: In ridge regression, when the assumptions of the linear model are satisfied, the larger the shrinkage/penalty parameter,

the larger the magnitude of the bias of the estimator of the regression coefficients β. the smaller the magnitude of the bias of the estimator of the regression coefficients β. the larger the variance of the estimator of the regression coefficients β.

the smaller variance of the estimator of the regression coefficients β.

the smaller the true mean squared error of the estimator of the regression coefficients β.

(d) (2 pt) Bubble all true statements: A good approach for selecting the shrinkage/penalty parameter in

LASSO is to: minimize

minimize minimize minimize minimize

the learning set risk for the squared error (L2) loss function.

the learning set risk for the absolute error (L1) loss function.

the cross-validated regularized risk for the squared error (L2) loss function. the cross-validated risk for the squared error (L2) loss function.

the variance of the estimator of the regression coefficients.

6

5. (12 points) Logistic Regression

(a) (2 pt) Bubble the expression that describes the odds ratio P (Y =1|X) of a logistic regression model.

P(Y =0|X)

Recall: P(Y =0|X)+P(Y =1|X)=1foranyX.

⃝ XT β ⃝ −XT β ⃝ exp(XT β) ⃝ σ(XT β) ⃝ None of these

(b) (2 pt) Bubble the expression that describes P(Y = 0|X) for a logistic regression model.

⃝ σ(−XT β) ⃝ 1−log(1+exp(XT β)) ⃝ 1+log(1+exp(−XT β)) ⃝ None of these

(c) (2 pt) Bubble all of the following that are typical effects of adding an L1 regularization penalty to the

loss function when fitting a logistic regression model with parameter vector β. The magnitude of the elements of the estimator of β are increased.

The magnitude of the elements of the estimator of β are decreased.

All elements of the estimator of β are non-negative.

Some elements of the estimator of β are zero. None of the above.

(d) (3 pt) What would be the primary disadvantage of a regularization term of the form Jj=1 βj3 rather than the more typical ridge penalty Jj=1 βj2 for logistic regression? Answer in one sentence.

(e) (3 pt) For a logistic regression model P(Y = 1|X) = σ(−2 − 3X), where X is a scalar random variable,

what values of x would give P (Y = 0|X = x) ≥ 3 ? You must show your work for full credit. 4