# CS代考程序代写 data science decision tree python SQL Data 100 Principles and Techniques of Data Science

Data 100 Principles and Techniques of Data Science

Summer 2019

INSTRUCTIONS

Final

• You have 2 hours and 50 minutes to complete the exam.

• The exam is closed book, closed notes, closed computer, closed calculator, except for the provided midterm

reference sheet and up to two 8.5″ × 11″ sheets of notes of your own creation.

• There are 12 pages on this exam and a total of 110 points possible.

• Write your name at the top of each sheet of the exam.

• Mark your answers on the exam itself. We will not grade answers written on scratch paper.

Last name

First name

Student ID number

CalCentral email (_@berkeley.edu)

Name of the person to your left

Name of the person to your right

All the work on this exam is my own.

(please sign)

Terminology and Notation Reference:

exp(x)

ex

log(x)

loge x

Linear regression model

fθ(x) = x · θ = θ · x

Logistic (or sigmoid) function

σ(t) = 1 1+exp(−t)

Logistic regression model

fθ(x)=P(Y =1|X)=σ(θ·x)

Squared error loss function

l(x, y, θ) = (y − fθ (x))2

Absolute error loss function

l(x, y, θ) = |y − fθ (x)|

Cross-entropy loss function

l(x, y, θ) = −y log fθ (x) − (1 − y) log(1 − fθ (x))

Bias

Bias(θˆ, θ∗) = E[θˆ] − θ∗

Variance

Var(θˆ) = E[(θˆ− E[θˆ])2]

2

This page is intentionally left blank, but feel free to use it as scratch paper.

Name: 3 1. (16 points) Flights

Fill in both the Python code and the SQL query to produce each result below, assuming that the following tables are stored both as Pandas DataFrames and SQLite tables. Only the first few rows are shown for each table. The acc table contains one row per user account registered at an airline company’s website. The fly table contains one row per flight that the airline offers. The fact table is a fact table containing one row per flight booked by a user—its uid and fid columns are foreign keys to the corresponding columns in acc and fly. Some accounts have never booked a flight, and some flights have never been booked by a user.

Note: On this problem, some blanks should be filled in using more than one keyword or expression. acc fact fly

(a) (4 pt) Find the names of all users that did not book any flights.

Python: acc.loc[~acc[______________].isin(____________________________________________), __________] SQL: SELECT __________ FROM __________ WHERE __________ NOT IN (SELECT __________ FROM ____________);

(b) (6 pt) Find the average flight price for each unique pair of origin and destination cities.

Python: (fly[[‘orig’, ‘dest’, ‘price’]]._________________(_________________________________________)

.______________().reset_index())

SQL: SELECT orig, dest, __________________________________ AS price FROM fly

____________________________________________________________________________________________;

(c) (6 pt) Find all unique origin and destination city pairs that can be reached in a sequence of two consecutive flights. For example, LA → MN can be reached using fid 1 and 3. The result should have two columns: the first column holds origin cities (e.g. LA) and the second holds destination cities (e.g. MN).

Hint: The suffixes=(1, 2) argument to pd.merge appends a 1 to column labels in the first table and a 2 to column labels in the second table if the merged tables share column names.

Hint: In SQL, AS t1 and AS t2 enable retrieving a table’s columns using the aliases t1 and t2. For example, t1.fid gets the fid column of the table referenced by t1.

Python: m = pd.merge(__________, __________, left_on=__________, right_on=__________, suffixes=(1, 2)) m.loc[:, ________________________________________________________________].drop_duplicates()

SQL: SELECT DISTINCT _____________________________________________________________________________ FROM __________ AS t1 JOIN __________ AS t2 ON _____________________________________________;

uid

name

1 4 3 6

Sam Leo Steph Manana

uid

fid

4 4 3 2

1 2 4 1

fid

orig

dest

price

1 2 3 4

LA LA SF SD

SF SF MN NY

110 90 240 370

4

2. (15 points)

Lost Labels

During a data analysis on car attributes, Sam created several plots. However, he has lost the axis labels for all of his plots except for the scatter plot shown on the left. Determine whether the plots below were generated from the same data. If so, mark the axis label that makes each plot consistent with the data in the scatter plot. Assume that: The KDE plots use the same bandwidth, the his- tograms use the same number of bins, and point plots show the means of two columns and 95% confidence intervals. The axis limits for each plot were automatically chosen to display all plotted marks.

(a) (7 pt) Fill the missing axis labels of the 8 plots above using either age or mpg to make the plots consistent with the labeled scatter plot. For example, the first plot shows the distribution of age, so (a) should be filled in with age. If the plot cannot be generated from the data in either age or mpg, select Neither.

(a) (b) (c) (d) (e.i) (e.ii) (f.i) (f.ii) (g.i) (g.ii) (h.i) (h.ii)

age

mpg

Neither

(b) (8 pt) After conducting PCA, Sam projected each point onto the two principal component axes. He stored the projections onto the first and second principal components in the columns pc1 and pc2, respectively. As in the previous part, fill in each of the missing axis labels using either pc1 or pc2 if the plots were generated using the points projected onto the first or second principal component, or select Neither.

(a) (b) (c) (d) (e.i) (e.ii) (f.i) (f.ii) (g.i) (g.ii) (h.i) (h.ii)

pc1

pc2

Neither

⃝

⃝

⃝

⃝

⃝⃝

⃝⃝

⃝⃝

⃝⃝

⃝

⃝

⃝

⃝⃝

⃝⃝

⃝⃝

⃝⃝

⃝

⃝

⃝

⃝

⃝

⃝

⃝

⃝

⃝

⃝

⃝

⃝

⃝⃝

⃝⃝

⃝⃝

⃝⃝

⃝

⃝

⃝

⃝

⃝⃝

⃝⃝

⃝⃝

⃝⃝

⃝

⃝

⃝

⃝

⃝

⃝

⃝

⃝

Name: 5 3. (5 points) Parking Problems

A parking lot on campus has a $10 parking fee per day. You find out that every morning, a police officer flips three fair coins. If all three coins land heads, the officer will go to the parking lot and give a $64 parking ticket to all cars that did not pay the parking fee.

Let X be a random variable for the dollar amount you will pay on a particular day if you decide to never pay the parking fee. Note that all fractions shown in this problem are fully simplified.

(a) (2 pt) What is E(X)?

0 1 5

5

(c) (1 pt) Based on the calculations in (a) and (b), which parking strategy will save you the most money in

(b) (2 pt) What is V ar(X)? 0 1 7

84

8

64

24

448

32 512

Other Other

7

Flip a fair coin and pay the parking fee only if the coin lands heads.

1 n

L(θ) = − n

Find the estimator θˆ that minimizes the empirical risk. Show all your work within the space provided below and draw a box around your final answer.

56

To estimate a population parameter from a sample (x1, . . . , xn), we select the following empirical risk:

8 64

the long run?

Never pay the parking fee.

Always pay the parking fee. 4. (5 points) Derive It

log(θe−θxi )

i=1

6

5. (2 points) Modeling

Shade in the box for all the models that are appropriate for the modeling problems described below.

Predict day of the week from the number of shoppers at a store.

Predict total revenue today for a store from weather forecast of either sunny or rainy.

Predict number of apples sold from number of chickens sold.

Predict fastest checkout line (1, 2, . . . , 8) from number of people in each line.

6. (23 points) Grocery Associations

Linear Logistic Random Regression Regression Forest

Every week, Manana goes to her local grocery store and buys a varying amount of vegetables but always buys exactly one pound of meat. We use a linear regression model to predict her total grocery bill. We’ve collected a dataset containing the pounds of vegetables bought, the type of meat bought (either beef, chicken, or fish), and the total bill. Below we display the first few rows of the dataset and two plots generated using the entire dataset.

(a) (8 pt) Suppose we fit the following linear regression models to predict total. Based on the data and visualizations shown above, determine whether the fitted model weights are positive (+), negative (-), or exactly 0. The notation meat=beef refers to the one-hot encoded meat column with value 1 if the original value in the meat column was beef and 0 otherwise.

veg

meat

total

1 3 2 0

beef fish beef chicken

13 19 16 9

Model

fθ(x) = θ0 fθ(x)=θ0 +θ1 ·veg2

fθ (x) = θ0 + θ1 · (meat=beef) + θ2 · (meat=chicken)

fθ (x) = θ0 + θ1 · (meat=beef)

+θ2 · (meat=chicken) + θ3 · (meat=fish)

Weight + – 0 Not enough info

θ0 ⃝⃝⃝ θ0 ⃝⃝⃝ θ1 ⃝⃝⃝ θ0 ⃝⃝⃝ θ1 ⃝⃝⃝ θ2 ⃝⃝⃝ θ0 ⃝⃝⃝ θ1 ⃝⃝⃝ θ2 ⃝⃝⃝ θ3 ⃝⃝⃝

⃝ ⃝ ⃝ ⃝ ⃝ ⃝ ⃝ ⃝ ⃝ ⃝

Name:

7

The data and plots from the previous page are reproduced here for convenience:

Suppose we fit the model: fθ(x) = θ0 + θ1 · veg + θ2 · (meat=beef) + θ3 · (meat=fish) . After fitting, we find that θˆ = [−3, 5, 8, 12]. Calculate:

(b) (1 pt) The prediction of this model on the first point in our dataset.

-3 2 5 10 13 22 25

(c) (2 pt) The loss of this model on the second point in our dataset using squared error loss.

0 1 5 6 8 24 25 169

(d) (2 pt) The loss on the third point in our dataset using absolute loss with L1 regularization and λ = 1. 0 1 15 24 26 27 28 29

(e) (4 pt) Determine how each change below affects model bias and variance compared to the model described at the top of this page. Shade in all the boxes that apply.

veg

meat

total

1 3 2 0

beef fish beef chicken

13 19 16 9

Add degree 3 polynomial features Add a feature of random numbers between 0 and 1 Collect 100 more sample points

Increase Decrease Increase Decrease bias bias variance variance

Remove the veg column

(f) (4 pt) Suppose we predict total from veg using 8 models with different degree polynomial features (degrees 0 through 7). Which of the following plots display the training and validation errors of these models? Assume that we plot the degree of polynomial features on the x-axis, mean squared error loss on the y-axis, and the plots share y-axis limits.

Plot A Plot B Plot C Plot D

Training error: A B C D Validation error: A B C D

(g) (2 pt) Suppose that we fit 8 degree-4 polynomial models using ridge regression and the x-axis for the plots in the previous part show λ instead of polynomial degree. Which plots show the training and validation errors of the models?

Training error: A B C D Validation error: A B C D

8

7. (20 points) Logistic Regression

(a) (8 pt) Suppose we use the following regression model with a single model weight θ and loss function:

fθ(x) = σ(θ − 2)

l(θ, x, y) = −y log fθ (x) − (1 − y) log(1 − fθ (x)) + 1 θ2

2

Derive the stochastic gradient descent update rule for this model and loss function, assuming that the learning rate α = 1. Your answer may only use the following variables: θ(t+1), θ(t), y, and the sigmoid function σ. Show all your work within the space provided and draw a box around your final answer.

Name:

9

(b)

(c)

(4 pt) What are the dimensions of the expressions below?

Scalar Length-n vector Length-p vector (n × p) matrix

Recall that in lecture we derived the following batch gradient descent (BGD) and stochastic gradient descent (SGD) update rules for logistic regression with L2 regularization. This expression uses the same notation used in class where X is the (n × p) design matrix, Xi is a vector containing the values in the i’th row of X, y is the length-n vector of outcomes, yi is a single outcome (either 0 or 1), and θ is a vector containing the model weights.

1 n

α1 n

n i=1

yi −σ(Xi ·θ(t)) ⃝ ⃝ (yi −σ(Xi ·θ(t)))Xi ⃝ ⃝ (yi −σ(Xi ·θ(t)))Xi ⃝ ⃝

⃝ ⃝ ⃝ ⃝ ⃝ ⃝

(yi − σ(Xi · θ(t)))Xi Stochastic Gradient Descent: θ(t+1) = (1 − 2λ)θ(t) + α (y − σ(X · θ(t)))X

Batch Gradient Descent: θ(t+1) = (1 − 2λ)θ(t) + α n

i=1 iii

Xi·θ(t) ⃝ ⃝ ⃝ ⃝

(8 pt) Suppose we use SGD to fit a logistic regression model with L2 regularization with one model weight and no intercept term. Combinations of θ(t), Xi, yi, λ and α are listed below. Complete the combination such that θ(t+1) will be the most positive after one iteration of SGD among the choices provided. The notation λ ∈ R means that λ is a fixed, unknown real number (which can be either positive, zero, or negative). If more than one choice produces the most positive value, select Tie. If you need more information on unknown variables to solve the problem, select Need Info.

(c.i) θ(t) =1, Xi =5, yi =0, λ=1, α = -1 0

(c.ii) θ(t) =−1, Xi =5, λ∈R, α=1 yi = 0 1

(c.iii) θ(t) ∈R, Xi =0, λ∈R, α=1 yi = 0 1

(c.iv) θ(t) =0, Xi ∈R, λ∈R, α=1 yi = 0 1

1

Tie Tie

Tie Tie

Need Info Need Info Need Info Need Info

10

8. (5 points) Classifiers

(a) (2 pt) Suppose we fit three classifiers which produce the following confusion matrices:

Model A

Which model has the highest precision? Which model has the highest recall?

Model B

Model A

⃝ ⃝

Model B

⃝ ⃝

Model C

Model C

⃝ ⃝

(b) (3 pt) Suppose we fit three more classifiers and plot the ROC curves for each classifier on the test set. The test set contains 100 points: the first 50 points are labeled 0 and the second 50 points are labeled 1. Determine which models produce each ROC curve.

ROC A ROC B

Predicts P(Y = 1|X) using a random number between 0 and 1 Assigns P(Y = 1|X) = 0.3 to the first 50 points and

P(Y = 1|X) = 0.4 to the second 50 points. Assigns P(Y = 1|X) = 0.8 to the first 50 points and

P(Y = 1|X) = 0.6 to the second 50 points. 9. (12 points) If a Forest Falls…

ROC C

ROC A ROC B ROC C

⃝⃝⃝ ⃝⃝⃝

⃝⃝⃝

(a) (8 pt) Suppose we fit decision trees of varying depths to predict y using x1 and x2. For this question, a decision tree with depth 0 is a tree with no splitting (all points in a single node). What is the:

0 0.5 1 2 Lowest possible entropy of a node in a fitted tree with depth 0? ⃝ ⃝ ⃝ ⃝ Lowest possible entropy of a node in a fitted tree with depth 1? ⃝ ⃝ ⃝ ⃝ Lowest possible entropy of a node in a fitted tree with depth 2? ⃝ ⃝ ⃝ ⃝ Depth of a fitted decision tree with no depth limit? ⃝ ⃝ ⃝ ⃝

x1

x2

y

S S M M S S M M

1 2 3 4 1 2 3 4

0 1 0 1 0 1 0 1

Name: 11

(b) (4 pt) Select true or false for each statement about the bootstrap. TF

⃝ ⃝

⃝ ⃝

⃝ ⃝

⃝ ⃝

Increasing the number of bootstrap resamples increases the model bias of a random forest.

Increasing the number of bootstrap resamples causes a confidence interval for the mean of a population to decrease in width.

Increasing the number of bootstrap resamples does not change the center of a sampling distribution.

After fitting any regression model, we can bootstrap the test set to create a confidence interval for the population error of a model.

10. (7 points) Thinking in Parallel

For this question, assume that the following functions are written in Python to compute the average of a column in the DataFrame df and fit a decision tree to a design matrix. Assume that all other code outside these functions runs instantly, and that there is no limit on the number of functions that can be run in parallel unless otherwise stated.

@ray.remote def avg(col):

… # Takes 1 second to run

(a) (1 pt) If df contains 10 columns, how many seconds will it take to compute the average for all columns

in df if we call avg serially?

1 2 10 20

(b) (2 pt) If df contains 10 columns, how many seconds will it take to run the following code?

vals = []

for col in df.columns:

vals.append(avg.remote(col))

ray.get(vals)

1 2 10 20

(c) (2 pt) How many seconds will it take to run the following code?

frst = []

for _ in range(10):

frst.append(fit_tree.remote(df))

ray.get(frst)

1 2 10 20

@ray.remote

def fit_tree(df):

… # Takes 2 seconds to run

(d) (2 pt) How many seconds will it take to run the code in part (c) if we can only run a maximum of four functions in parallel at a time?

123456

12

11. (0 points) Optional: Draw a Picture About Berkeley Data Science (or use this page for scratch work)