# CS代考程序代写 case study database file system SQL python Name: Email: Student ID:

Name: Email: Student ID:

@berkeley.edu

DS-100 Final Exam Spring 2018

Instructions:

• This final exam must be completed in the 3 hour time period ending at 11:00AM,

unless you have accommodations supported by a DSP letter.

• Note all questions on this exam are single choice only.

• Please put your student id at the top of each page to ensure that pages are not lost during scanning.

• When selecting your choices, you must fully shade in the circle. Check marks will likely be mis-graded.

• You may use a two-sheet (two-sided) study guide.

• Work quickly through each question. There are a total of 199 points on this exam.

Honor Code:

As a member of the UC Berkeley community, I act with honesty, integrity, and respect for others. I am the person whose name is on the exam and I completed this exam in accordance with the honor code.

Signature:

1

DS100 Final, Page 2 of 29, SID:

May 10th, 2018

Syntax Reference Regular Expressions

“ˆ” matches the position at the beginning of string (unless used for negation “[ˆ]”)

“$” matches the position at the end of string character.

“?” match preceding literal or sub-expression 0 or 1 times. When following “+” or “*” results in non-greedy matching.

“+” match preceding literal or sub-expression one or more times.

“*” match preceding literal or sub-expression zero or more times

“.” match any character except new line. Some useful Python functions and syntax

re.findall(pattern, st) return the list of all sub-strings in st that match pattern.

Useful Pandas Syntax

“[

]” match any one of the characters inside, accepts a range, e.g., “[a-c]”.

df.loc[row_selection, col_list] # row selection can be boolean df.iloc[row_selection, col_list] # row selection can be boolean df.groupby(group_columns)[[’colA’, ’colB’]].sum()

pd.merge(df1, df2, on=’hi’) # Merge df1 and df2 on the ’hi’ column

Variance and Expected Value

The expected value of X is E [X ] = mj =1 xj pj . The variance of X is Var [X ] = E [(X − E [X ])2 ] = E [X2] − E [X]2. The standard deviation of X is SD [X] = Var [X].

“( )” used to create a sub-expression

“{n}” precedingexpressionrepeatedntimes.

“d” match any digit character. “D” is the complement.

“w” match any word character (letters, dig- its, underscore). “W” is the comple- ment.

“s” match any whitespace character includ- ing tabs and newlines. S is the comple- ment.

“” match boundary between words

np.random.choice(n, replace, size)

sample size numbers 0 to n with replacement.

DS100 Final, Page 3 of 29, SID: May 10th, 2018

Problem Formulation

1. [3 Pts] In 1936, the Literary Digest ran a poll to predict the outcome of the Presidential elec- tion. They constructed a sample of over 10 million individuals by aggregating lists of

– magazine subscribers

– registered automobile owners

– telephone records

and received responses from about 2.4 million individuals from this 10 million sample.

(a) What kind of sample is this?

⃝ Convenience Sample ⃝ SRS ⃝ Stratified Sample ⃝ Census

(b) Which is likely a more serious concern for the Literary Digest estimate of the proportion

of voters who support FDR? ⃝ Bias ⃝ Variance

(c) Including more registered magazine subscribers would more likely have helped reduce ⃝ Bias ⃝ Variance

2. [5 Pts] Which kind of statistical problem is associated with each of the following tasks?

(a) Filtering emails according to whether they are spam.

⃝ Estimation ⃝ Prediction ⃝ Causal Inference

(b) Determining whether a new feature will improve a website’s revenue from an A/B test.

⃝ Estimation ⃝ Prediction ⃝ Causal Inference

(c) Investigating whether perceived gender has any effect on student teaching evaluations.

⃝ Estimation ⃝ Prediction ⃝ Causal Inference

(d) Building a recommendation system from historical ratings to serve personalized content.

⃝ Estimation ⃝ Prediction ⃝ Causal Inference

(e) Determining the growth rate of yeast cells in a petri dish.

⃝ Estimation ⃝ Prediction ⃝ Causal Inference

DS100 Final, Page 4 of 29, SID: May 10th, 2018

3. Suppose we observe a sample of n runners from a larger population, and we record their race times X1, . . . , Xn. We want to estimate the maximum race time θ∗ in the population. When comparing estimates, we prefer whichever is closer to θ∗ without going over. We consider the following three estimators based on our sample:

12

θ1 =maxXi i

θ2 = 1 Xi

i

θ3 = max Xi + 1 i

(a) [2 Pts] θ is never an over estimate but could be an underestimate of θ∗. 1

⃝ True ⃝ False

(b) [2 Pts] θ is never a worse estimate of θ∗ than θ .

n

⃝ True ⃝ False

(c) [2 Pts] θ is never a worse estimate of θ∗ than θ .

⃝ True ⃝ False

(d) [3 Pts] Which loss l(θ, θ ) best reflects our goal of “closest without going over”? (where

31

∗

θ represents any θi, i = 1, 2, 3)

∗∗2

⃝ l(θ,θ )=(θ−θ )

∗

⃝ l(θ,θ )=

∗∗ θ−θ, ifθ≤θ

∞, otherwise ∗∗

∗∗ θ−θ, ifθ≤θ

0, otherwise

⃝ l(θ,θ )=|θ−θ |

∗

⃝ l(θ,θ )=

DS100 Final, Page 5 of 29, SID: May 10th, 2018

Data Collection and Cleaning

4. For each scenario below, mark the sampling technique used.

(a) [1 Pt] A researcher wants to study the diet of California Residents. The researcher collects a dataset by asking her family members.

⃝ SRS ⃝ Stratified Sample ⃝ Cluster Sample ⃝ Convenience Sample

(b) [1 Pt] Bay Area Rapid Transit (BART) wants to survey its customers one day, so they

randomly select 5 trains and survey all of the customers on these trains

⃝ SRS ⃝ Stratified Sample ⃝ Cluster Sample ⃝ Convenience Sample

(c) [1 Pt] In order to survey drivers in a certain city, the police set up checkpoints at ran- domly selected road locations, then inspected every driver at those locations.

⃝ SRS ⃝ Stratified Sample ⃝ Cluster Sample ⃝ Convenience Sample

(d) [1 Pt] To study how different student organizations perceive campus issues, a professor

surveyed 3 students at random from each student organization.

⃝ SRS ⃝ Stratified Sample ⃝ Cluster Sample ⃝ Convenience Sample

5. [1 Pt] The date 01/01/1970 is typically associated with which data anomaly: ⃝ Outliers ⃝ Missing Values ⃝ Leap Years ⃝ Roundoff Error

6. [1 Pt] When would it be safe to drop records with missing values?

⃝ When less than ten percent of the records have missing values.

⃝ When the missing value occurs in a field that is not being studied. ⃝ When the missing value implies that the entire record is corrupted. ⃝ When the missing values are encoded using 999.

7. [1 Pt] When loading a comma delimited file which of the following is a parsing concern? ⃝ Unquoted tab characters in strings

⃝ Unquoted newline characters in strings ⃝ Dates with negative values

⃝ Capitalization

DS100 Final, Page 6 of 29, SID: May 10th, 2018

SQL

Consider the following database schema used to track doctor visits to animals in a zoo (note: all the questions in this SQL section are based on this schema):

CREATE TABLE animals ( aid INT PRIMARY KEY, animal_type TEXT, name TEXT,

age INTEGER, color TEXT);

CREATE TABLE visits (

vid INT PRIMARY KEY,

aid INT REFERENCES animals(aid), did INT REFERENCES doctors(did));

CREATE TABLE doctors ( did INT,

name TEXT,

PRIMARY KEY (did));

Each row of the animals table describes a distinct animal. Each row of the doctors table describes a distinct doctor at the zoo. Each row of the visits table describes a distinct visit of an animal to a doctor. The entire dataset is contained in the following tables:

aid animal type

0 rabbit

1 bear

2 bear

3 cat

vid aid did

132 1 0 145 2 1 167 0 3 168 2 3 169 2 2

(c) visits

name age

Bugs 2 Air 5 Care 1 Grumpy 75

(a) animals

color did name

white 0 Turk golden 1 House golden 2 Dre gray 3 Bailey

(b) doctors

8. Mark each statement below as True

(a) [1 Pt] An animal who visits the same doctor multiple times is recorded under the same

doctor id (did) in the visits table.

⃝ True ⃝ False

(b) [2 Pts] More than one animal could have the same animal type, age and color

⃝ True ⃝ False

(c) [2 Pts] Each aid in the visits table must be present in the animals table.

⃝ True ⃝ False

or False.

DS100 Final, Page 7 of 29, SID: May 10th, 2018 9. [4 Pts] What does the following query compute?

SELECT animal_type, color, AVG(age) AS avg_age FROM animals

GROUP BY animal_type, color;

⃝ The average age of each type of animal.

⃝ The average age of the animals for each type of color.

⃝ The average age for each combination of animal type and color. ⃝ This query throws an error.

10. [4 Pts] Which of the following SQL queries computes the names of the animals in our zoo who visited the doctor more than 2 times?

⃝ SELECT name FROM animals, visits WHERE COUNT(*) > 2

GROUP BY animals.aid, animals.name;

⃝ SELECT name FROM animals, visits GROUP BY animals.aid, animals.name HAVING COUNT(*) > 2;

⃝ SELECT name FROM animals, visits

WHERE visits.aid = animals.aid AND COUNT(*) > 2 GROUP BY animals.aid, animals.name

⃝ SELECT name FROM animals, visits WHERE visits.aid = animals.aid GROUP BY animals.aid, animals.name HAVING COUNT(*) > 2;

11. [3 Pts] When run on above data what does the following compute:

SELECT animal_type

FROM animals JOIN visits ON animals.aid = visits.aid GROUP BY animal_type

ORDER BY COUNT(*) DESC

LIMIT 1;

⃝ rabbit ⃝ bear ⃝ 3 ⃝ The above query is invalid.

DS100 Final, Page 8 of 29, SID: May 10th, 2018

Pandas

12. Using each of the zoo tables (from the SQL questions) as Pandas dataframes:

aid animal type

0 rabbit

1 bear

2 bear

3 cat

vid aid did

132 1 0 145 2 1 167 0 3 168 2 3 169 2 2

(c) visits

name age

Bugs 2 Air 5 Care 1 Grumpy 75

(a) animals

color did name

white 0 Turk golden 1 House golden 2 Dre gray 3 Bailey

(b) doctors

Evaluate each of the following Python expressions:

(a) [2 Pts] (doctors[doctors[’name’] == ’Dre’][[’did’]] .merge(visits)

.merge(animals)[’name’][0])

⃝ ’Air’ ⃝ ’Care’ ⃝ ’Grumpy’ ⃝ None

(b) [3 Pts] len(animals.merge(visits, on=’aid’) .groupby(’color’)[[’age’]].mean())

⃝0⃝1⃝2⃝3

(c) [3 Pts] list(visits.groupby(’did’) .filter(lambda g: len(g) > 1)

.merge(animals, on=’aid’)[’name’])

⃝ [’Bugs’] ⃝ [’Care’] ⃝ [’Air’] ⃝ [’Bugs’, ’Care’]

(d) [3 Pts] list(animals.merge(visits, how=’outer’) .groupby([’aid’])

.filter(lambda g: len(g[’vid’].dropna()) == 0) [’name’])

⃝ [’Bugs’] ⃝ [’Care’] ⃝ [’Grumpy’] ⃝ [’Air’]

DS100 Final, Page 9 of 29, SID: May 10th, 2018

13. The following is the output of the command taxi df.head() run on the taxi df dataframe containing taxi trips in NYC. You may assume that there are no missing values in the dataframe and the duration is measured in seconds.

vendor id

start timestamp

2016-06-08 07:36:19 2016-04-03 12:58:11 2016-06-05 02:49:13 2016-05-05 17:18:27 2016-05-12 17:43:38

passenger count duration

1 1040 1 827 5 614 2 867 4 4967

id

0 2

1 2

2 2

3 2

4 1

(a) [1 Pt] Which of the following lines returns a dataframe containing only the rides with a duration less than 3 hours.

⃝ taxi_df[taxi_df[’duration’] < 3]
⃝ taxi_df.set_index(’duration’) < 3
⃝ taxi_df[’duration’] < 3 * 60 * 60
⃝ taxi_df[taxi_df[’duration’] < 3 * 60 * 60]
(b) [4 Pts] We would like to know the average and duration and for each passenger count for each vendor id, excluding (vendor id, passenger count) pairs for which
we have less than 10 records. Which of the following returns this dataframe.
⃝ (taxi_df.groupby([’vendor_id’, ’passenger_count’]) .filter(lambda x: x.shape[0] >= 10) .groupby([’vendor_id’, ’passenger_count’]) .agg({’duration’: ’mean’}))

⃝ (taxi_df.groupby([’passenger_count’]) .filter(lambda x: x.shape[0] >= 10) .groupby([’vendor_id’, ’passenger_count’]) .agg({’duration’: ’mean’}))

⃝ (taxi_df.groupby([’vendor_id’, ’passenger_count’]) .filter(lambda x: x.shape[0] < 10) .groupby([’vendor_id’, ’passenger_count’]) .agg({’duration’: ’mean’}))
⃝ (taxi_df.groupby([’vendor_id’, ’passenger_count’]) .filter(lambda x: x.shape[1] >= 10) .groupby([’vendor_id’, ’passenger_count’]) .mean())

DS100 Final, Page 10 of 29, SID: May 10th, 2018

Big Data

14. We want to store a big file on a distributed file system by splitting it into smaller fragments. Assuming the file divides evenly into 800 fragments and we use 4-way replication answer the following questions.

(a) [2 Pts] If the distributed file system contains 8 separate nodes, how many fragments of the file will be stored on each node?

⃝100 ⃝400 ⃝800 ⃝3200

(b) [2 Pts] What is the maximum number of machines that can fail and still guarantee that

we can read the entire file. ⃝1⃝2⃝3⃝4⃝5

15. Which of the following statements are correct?

(a) [1Pt] Inthestarschema,thedimensiontablecontainstherelationshipsbetweendifferent

facts in the separate fact tables. ⃝ True ⃝ False

(b) [1 Pt] Star schemas can help eliminate update errors by reducing duplication of data. ⃝ True ⃝ False

(c) [1 Pt] During the reduce phase of MapReduce all the records associated with a given key are sent to the same machine.

⃝ True ⃝ False

(d) [1 Pt] Because files are spread across multiple machines, reading a large file from a dis-

tributed file system is usually slower than reading the same large file from a single drive. ⃝ True ⃝ False

(e) [1 Pt] When using MapReduce, we need to have a memory buffer that is big enough to load all the data from disk to memory.

⃝ True ⃝ False

DS100 Final, Page 11 of 29, SID: May 10th, 2018

Regular Expressions

16. [3 Pts] Given that we are using the regular expression r”ta.*c”, which option specifies the starting and ending position of the first match in the string: “tacocat”

⃝ 0-2 ⃝ 0-3 ⃝ 0-5 ⃝ 0-6 ⃝ The string contains no matches.

17. EvaluatethefollowingPythonexpressionsthatusetheremodule.Notes:(1)assumeimport

re was already run; (2) The character “ ” represents a single space: (a) [2 Pts] re.findall(r”d{3}.d{3}.d{4}”,

“123.456.7890 and Fax 800999000″) ⃝ None

⃝ [’123.456.7890’]

⃝ [’800999000’]

⃝ [’123.456.7890’, ’800999000’]

(b) [2 Pts] re.findall(r”{[ˆ}]*}”, “{begin} {and {end}}”) ⃝ None

⃝ [’{begin}’]

⃝ [’{begin}’, ’{end}’]

⃝ [’{begin}’, ’{and {end}’]

(c) [2 Pts] len(re.findall(r”[cat]+|dog”, “cat catch dog attack”))

⃝1⃝2⃝3⃝4⃝5

(d) [2 Pts] re.findall(r”

.*?

“,

“

stuff?

more stuff

“) ⃝ [’

stuff?

’]

⃝ [’

stuff?

more stuff

’]

⃝ [’

stuff?

’, ’

more stuff

’]

⃝ [’

more stuff

’]

DS100 Final, Page 12 of 29, SID: May 10th, 2018

Visualization & EDA

18. Using the following box plots from lecture answer the following questions:

50 40 30 20 10

10 8 6 4 2

6 5 4 3 2 1

Boxplot

total_bill

tip

size

(a) [1 Pt] Which of piece of information is not communicated by these box plots. ⃝ The number of observations in each box.

⃝ Quartiles (lower and upper) and median.

⃝ A simple comparison between distributions across different groups. ⃝ Outliers values for each category.

(b) [1 Pt] Using only the box plot we can tell that the tip distribution appears to be: ⃝ Bimodal ⃝ Unimodal ⃝ Skewed left ⃝ Skewed right

19. [2 Pts] Which of the following changes will most effectively improve the following plot to communicate the relationship between the two variables x and y?

⃝ Change the dot size and/or trans- parency.

⃝ Remove the outliers.

⃝ Display a 1D histogram.

⃝ Change the scale of x and y.

DS100 Final, Page 13 of 29, SID: May 10th, 2018 20. For this question consider the following dataframe (compass df) and data visualization.

born order

0 2nd

1 2nd

2 2nd

3 1st

4 1st

5 1st

delivery perc

Cesarean 0.08 Overall 0.13 Vaginal 0.15 Cesarean 0.16 Overall 0.26 Vaginal 0.35

(a) [1 Pt] Which plotting mistake best characterizes a problem with this plot. ⃝ Use of stacking

⃝ Use of angles to compare magnitudes ⃝ Chart junk

⃝ Overplotting

(b) [3 Pts] The original intent of this plot was to demonstrate that the 2nd born child has a lower risk of HIV-1 infection. Which of the following snippets of code would generate a plot that best illustrates this trend?

⃝ sns.barplot(y=’perc’, x=’delivery’, hue=’born_order’, data=compass_df);

⃝ sns.barplot(y=’perc’, x=’born_order’, hue=’delivery’, data=compass_df);

⃝ sns.boxplot(y=’perc’, x=’delivery’, hue=’born_order’, data=compass_df);

⃝ compass_df.plot(y=’perc’, kind=’pie’);

DS100 Final, Page 14 of 29, SID: May 10th, 2018

21. Suppose we have constructed a dataset about meals at restaurants around Berkeley. The fol- lowing is just a sample of the dataset.

total bill

0 16.99 1 10.34 3 23.68 4 24.59

tip day

1.01 Sun 1.66 Mon 3.31 Wed 3.61 Thu

party size

2 3 2 4

date

2017-01-01 2017-01-02 2017-01-04 2017-01-05

place

Taqueria El Buen Sabor Burger King

Ichiraku Ramen Akatsuki Hideout

For each of the following scenarios, determine which plot type is most appropriate to reveal the distribution of and/or the relationships between the following variable(s).

(a) [1 Pt] The spread of the total bill for each day of the week:

⃝ Bar plot ⃝ Side-by-side boxplots ⃝ Scatter plot ⃝ Contour Plot

(b) [1 Pt] The distribution of the tip field for meals at Taqueria El Buen Sabor: ⃝ Histogram ⃝ Bar plot ⃝ Line plot ⃝ Contour plots

(c) [1 Pt] Average tip for meals on each day from January 2017 to January 2018: ⃝ Histogram ⃝ Line plot ⃝ Side-by-side boxplots ⃝ Contour plots

(d) [1 Pt] Number of meals for each place in 2017:

⃝ Histogram ⃝ Bar plot ⃝ Side-by-side boxplots ⃝ Scatter plot

22. [2Pts] Whatadditionalinformationdoestheviolinplotontherightprovidethatisnotpresent in the boxplot on the left.

Boxplot Violin plot

ABCD ABCD

group group

⃝ The violin plot displays the number of observations, which is hidden in the boxplot.

⃝ The violin plot shows the underlying distribution into each group, e.g. we observe

a bimodal distribution in group B.

⃝ The violin plot shows the number of missing values, which is hidden in the boxplot.

⃝ The violin plot does not provide any additional information.

Value

Value

DS100 Final, Page 15 of 29, SID: May 10th, 2018

Modeling and Estimation

23. [6 Pts] What parameter estimate would minimize the following regularized loss function:

1 n

l(θ)=λ(θ−4)2 +n ⃝θˆ=4+1n xi

(xi −θ)2 (1)

⃝θˆ=1n xi λn i=1

i=1

λn i=1 ⃝θˆ= 1 n xi

n(λ+1) i=1 ⃝θˆ=λ+1 n(xi−4)

λ+1 n(λ+1) i=1 ⃝θˆ=4λ+1 nxi

λ+1 n(λ+1) i=1

You may use the space below for scratch work (not graded, no partial credit).

DS100 Final, Page 16 of 29, SID: May 10th, 2018

24. [8 Pts] Suppose X1, . . . , Xn are random variables with E[Xi] = μ∗ and Var[Xi] = θ∗. Con- sider the following loss function

1 n

l(θ) = log(θ) + nθ

i=1 Let θ denote the minimizer for l(θ). What is E[ θ]?

⃝ θ∗ ⃝ θ∗+μ∗ ⃝ θ∗+μ∗/2 ⃝ E[θ∗+μ∗]

⃝ θ∗+(μ∗)2

Xi2.

⃝ (θ∗+μ∗)2 You may use the space below for scratch work (not graded, no partial credit).

DS100 Final, Page 17 of 29, SID: May 10th, 2018

Regression

25. [2 Pts] Which model would be the most appropriate linear model for the following dataset?

30 20 10

0 10 20

20 15 10 5 0 5 10 15 20

X

⃝ y=θ1x+θ2 ⃝ y=dk=1θkxk ⃝ y=θ1x+θ2sin(x) ⃝ y=θ1x+θ2sin(θ3x) ⃝ Since y is a non-linear function of x, the relationship can’t be expressed by a linear model.

26. [2 Pts] Which of the following depicts models with the largest bias?

12.5 10.0 7.5 5.0 2.5 0.0

12.5 10.0 7.5 5.0 2.5 0.0

12.5 10.0 7.5 5.0 2.5 0.0

2.5 5.0

7.5 10.0

x

⃝⃝⃝

27. [4 Pts] Given a full rank feature matrix Φ ∈ Rnd and response values Y ∈ Rn the following

equation computes what quantity:

ΦT Y −ΦΦTΦ−1 ΦTY (12)

⃝ 0 ⃝ residuals ⃝ squared residuals ⃝ Yˆ ⃝ θˆ ⃝ squared error 28. [2 Pts] Which of the following loss functions is most sensitive to extreme outliers.

⃝ L1-Loss Function ⃝ Squared Loss ⃝ Absolute Loss Function ⃝ Huber Loss

2.5 5.0

7.5 10.0

x

2.5 5.0

7.5 10.0

x

y

y

y

Y

DS100 Final, Page 18 of 29, SID: May 10th, 2018

29. Given a dataset of 100 tweets where each tweet is no longer than 10 words. We apply a bag-of- words featurization to the tweets with a vocabulary of 10, 000 unique words plus an addition bias term. Our goal is to predict the number of retweets from the text of the tweet.

(a) [1 Pt] Because words are nominal this is a classification task. ⃝ True ⃝ False

(b) [2 Pts] The feature (covariate) matrix including bias term has 100 rows and how many columns (including zero valued columns)?

⃝1 ⃝10 ⃝11 ⃝10,000 ⃝10,001 ⃝10,010 ⃝10,011

(c) [2 Pts] Because n > d the solution to the normal equations is not well defined.

⃝ True ⃝ False

(d) [2 Pts] By applying L1 regularization we may be able to identify informative words.

⃝ True ⃝ False

30. In the process of training linear models with different numbers of features you created the

following plot but forgot to include the Y -axis label.

0 2 4 6 8 10 12 Number of Features

(a) [1 Pt] The Y-axis might represent the training error: ⃝ True ⃝ False (b) [1 Pt] The Y-axis might represent the bias: ⃝ True ⃝ False

(c) [1 Pt] The Y-axis might represent the test error: ⃝ True ⃝ False

(d) [1 Pt] The Y-axis might represent the variance. ⃝ True ⃝ False

DS100 Final, Page 19 of 29, SID: May 10th, 2018 31. Consider the following model training script to estimate the training error:

1 X_train, X_test, y_train, y_test =

2 train_test_split(X, y, test_size=0.1)

3

4 model = lm.LinearRegression(fit_intercept=True)

5 model.fit(X_test, y_test)

6

7 y_fitted = model.predict(X_train)

8 y_predicted = model.predict(X_test)

9

10 training_error = rmse(y_fitted, y_predicted)

(a) [3 Pts] Line 5 contains a serious mistake. Assuming our eventual goal is to compute the

training error, which of the following corrects that mistake. ⃝ model.fit(X_train, y_test)

⃝ model.fit(X_train, y_train) ⃝ model.fit(X, y)

(b) [3 Pts] Line 10 contains a serious mistake. Assuming we already have corrected the mistake in Line 5 which of the following corrects the mistake on Line 10.

⃝ training_error = rmse(y_train, y_predicted) ⃝ training_error = rmse(y_train, y_test)

⃝ training_error = rmse(y_fitted, y_test)

⃝ training_error = rmse(y_fitted, y_train)

32. [2 Pts] Which of the following techniques could be used to reduce over-fitting? ⃝ Adding noise to the training data

⃝ Cross-validation to remove features ⃝ Fitting the model on the test split ⃝ Adding features to the training data

DS100 Final, Page 20 of 29, SID: May 10th, 2018

33. Suppose you are given a dataset {(xi, yi)}ni=1 where xi ∈ R is a one dimensional feature and yi ∈ R is a real-valued response. To model this data, you choose a model characterized by the following loss function:

n

L(θ)= 1yi −θ0 −x3iθ12 +λ|θ1| (13)

n i=1

For the following statements, indicate whether it is True or False.

(a) [1 Pt] This model includes a bias/intercept term. ⃝ True ⃝ False

(b) [1 Pt] As λ decreases to smaller values, the model will reduce to a constant θ0 ⃝ True ⃝ False

(c) [1 Pt] Larger λ values help reduce the chances of overfitting. ⃝ True ⃝ False

(d) [1 Pt] Increasing λ decreases model variance. ⃝ True ⃝ False

(e) [1 Pt] The training error should be used to determine the best value for λ. ⃝ True ⃝ False

DS100 Final, Page 21 of 29, SID: May 10th, 2018

Stochastic Gradient Descent (Going Downhill Quickly!)

34. Consider the following broken Python implementation of stochastic gradient descent. 1 def stochastic_grad_descent(

2 3

4 “””

5 X:A2D

6 Y:A1D

7 theta0:

8 grad_function: Maps a parameter vector, a feature matrix,

9 and a response vector to the gradient of some loss

10 function at the given parameter value.

11 batch_size: the number of data points to use in each

12 gradient estimate

13 returns the optimal theta

14 “””

15 theta = theta0

16 ind 17

18 for 19

= np.random.choice(len(Y), replace=False, size=batch_size)

t in range(1, max_iter+1):

(xbatch, ybatch) = (X[ind, :], Y[ind])

grad = grad_function(theta, xbatch, ybatch) theta=theta+t/grad

20

21

22 return theta

X, Y, theta0, grad_function,

max_iter = 10000, batch_size=2):

array, the feature matrix.

array, the response vector.

A 1D array, the initial parameter vector.

(a) [3 Pts] Which of the following best describes the bug in how data are sampled?

⃝ The call to sample on Line 16 should have been with replacement.

⃝ A new random sample of indices should be constructed on each loop iteration.

⃝ Like the bootstrap, each random sample should be the size of the original data set(i.e.,size = len(Y)inLine17)

⃝ The len(Y) on Line 16 should be len(X).

(b) [3 Pts] Assuming that the stochastic gradient grad is computed correctly, what is the

correct implementation of the gradient update on Line 21: ⃝theta = theta – t / grad

⃝theta = theta – t * grad

⃝ theta = theta + 1/t * grad

⃝ theta = theta – 1/t * grad

DS100 (c)

Final, Page 22 of 29, SID: May 10th, 2018

[5 Pts] Suppose we wanted to add L2 regularization with each dimension having a dif- ferent regularization parameter:

d

Rλ(θ) = λkθk2 (14)

k=1

where λ is now a vector of regularization parameters. Which of the following rewrites of

Line 20 would achieve this goal (assuming λ = lam):

⃝ grad = (grad_function(theta, xbatch, ybatch) +

2*theta*lam)

⃝ grad = (grad_function(theta, xbatch, ybatch) + theta.dot(lam))

⃝ grad = (grad_function(theta, xbatch, ybatch) – theta.dot(lam))

⃝ grad = (grad_function(theta, xbatch, ybatch) – 2*theta*lam)

35. Use the following plot to answer each of the following questions about convexity:

f1 f2 f3

x

(a) [1 Pt] f1(x) = max(0.01x, −x) is convex. (b) [1 Pt] f2(x) = −2x is convex. ⃝ True (c) [1 Pt] f3(x) = −x2 is convex. ⃝ True (d) [1 Pt] f4(x) = f1(x) + f2(x) is convex.

⃝ True ⃝ False ⃝ False

⃝ False

⃝ True ⃝ False

y

DS100 Final, Page 23 of 29, SID: May 10th, 2018

Classification

36. [4 Pts] True or False.

(a) A binary (0/1) classifier that always predicts 1 can get 100% precision, and its recall will be the fraction of ones in the training set.

⃝ True ⃝ False

(b) If the training data is linearly separable we expect a logistic regression model to obtain

100% training accuracy. ⃝ True ⃝ False

(c) We should use classification if the response variable is categorical. ⃝ True ⃝ False

(d) A binary classifier that only predicts class 1 may still achieve 99% accuracy on some prediction tasks.

⃝ True ⃝ False

37. [2 Pts] The plot below is a scatter plot of a dataset with two dimensional features and binary labels (e.g., Class 0 and Class 1). Without additional feature transformations, is the this dataset linearly separable?

⃝ Yes. ⃝ No. ⃝ We cannot tell that from this plot.

Class 1 Class 0

X1

X2

DS100 Final, Page 24 of 29, SID: May 10th, 2018 38. [4 Pts] We perform a 4-fold cross validation on 4 different hyper-parameters, the mean square

error are shown in the table below. Which λ should we select?

Fold Num

λ=0.1 λ=0.2 λ=0.3 λ=0.4

Row Max

Row Min

Row Avg

1 2 3 4

80.2 84.1 76.8 77.3 81.5 74.5 79.4 75.2

70.1 91.2 83.3 88.8 81.6 86.5 79.2 85.4

91.2 88.8 86.5 85.4

70.1 76.8 74.5 75.2

83.36 83 82.12 80.92

Col Avg

79.475 77.775 78.55 87.975

⃝ λ=0.1 ⃝ λ=0.2 ⃝ λ=0.3 ⃝ λ=0.4

39. [4 Pts] Answer true or false for each of the following statements about logistic regression:

(a) If no regularization is used and the training data is linearly separable, the optimal model parameters will tend towards positive or negative infinity.

⃝ True ⃝ False

(b) After using L2 regularization, the optimal model parameter will be the mean of the data,

since L2 regularization is similar to the square loss.

⃝ True ⃝ False

(c) L1 regularization can help us select a subset of the features that are important.

⃝ True ⃝ False

(d) After using the regularization, we expect the training accuracy to increase and the test

accuracy to decrease. ⃝ True ⃝ False

40. [2 Pts] Suppose you are given the following dataset {(xi, yi)}ni=1 consisting of x and y pairs where the covariate xi ∈ R and the response yi ∈ {0, 1}.

1.5 1.0 0.5 0.0 0.5

3210123

X

Given this data, the value P (Y = 1 | x = −1) is likely closest to: ⃝ 0.95 ⃝ 0.50 ⃝ 0.05 ⃝ -0.95

Y

DS100 Final, Page 25 of 29, SID: May 10th, 2018

Statistical Inference

41. [3 Pts] True or False.

(a) A 95% confidence interval is wider than a 90% confidence interval.

⃝ True ⃝ False

(b) A p-value of 0.97 says that under the null model, there is a 97% chance of observing a

test statistic at least as extreme as the one calculated from the data. ⃝ True ⃝ False

(c) Suppose we have 100 samples drawn independently from a population. If we construct a separate 95% confidence interval for each sample, 95 of them will include the population mean.

⃝ True ⃝ False

42. A roulette wheel has 2 green slots, 18 red slots, and 18 black slots. Suppose you observe 760

games on a particular wheel and see that the red slot is chosen 380 times.

(a) [2 Pts] What is the expected number of times that the red slot is chosen?

⃝ 760 ⃝ 18 ∗760 ⃝ 1−18∗760 ⃝ 760 2 38 38 18

(b) [2 Pts] You hypothesize that the wheel has been altered so that the frequency of landing on red is higher than by chance. Given the data generation model that landing on red is the outcome of a Bernoulli trail with probability p, which of the following would be the most appropriate null hypothesis?

⃝ p = 0 ⃝ p = 18 ⃝ p = 1 ⃝ p = 1 38 2

(c) [2 Pts] You decide to study this problem by running a simulation of N = 10, 000 repli- cations of a fair roulette wheel constructed as described above. The percentiles for the proportion of red is shown below:

Percentile 2.5% 5% 10% 50% 90% 95% 97.5%

Proportion 0.438 0.445 0.450 0.474 0.497 0.504 0.509

Using the 5% convention for statistical significance, is the null model consistent with your observations? ⃝ Yes ⃝ No ⃝ Not enough information given to answer

DS100 Final, Page 26 of 29, SID: May 10th, 2018

43. Two methods of memorizing words are to be compared. You deterministically pair 1,000 people in the study together, manually matching so that the two people in each pair have very similar education levels and ages. For each of the 500 pairs of people, you randomly assign one person to memorization method 1 and the other to method 2. After a week of training, the number of words recalled in a memory test is recorded. A portion of the data is shown below:

Participant ID Pair ID Memorization Method Age Group

Education High School High School

Words Recalled 25

21

1

2

3 22 26-35 Undergraduate 20 4 21 26-35 Undergraduate 30

11 12

18-25 18-25

… 999 1000

… 500 500

… 1 2

… 55-65 55-65

… Masters Masters

… 29 17

(a) [2Pts] Thenullhypothesisforthisexperimentisthatthereisnodifferenceintheaverage number of words recalled

⃝ across memorization methods. ⃝ across Pair IDs. ⃝ across age groups. ⃝ across education levels.

(b) [2 Pts] Which of the following describes a reasonable test statistic for this experiment? By “ungroup”, we mean remove any lingering effect of the group by operation.

⃝ 1. Group by pair ID.

2. Subtract words recalled for memorization method 2 from method 1. 3. Ungroup.

4. Take the average of the differences from step 2.

⃝ 1. Group by memorization method, age group, education.

2. Take the average of words recalled. Group by age group and education. 3. Subtract words recalled for memorization method 2 from method 1.

4. Ungroup.

5. Take the average of the differences from step 3.

(c) Instead you decide to use permutation test to analyze the data. What permutation is justified by the design of the experiment?

i. [1 Pt] Group by

⃝ Pair ID ⃝ Age Group ⃝ Education

⃝ Words Recalled ⃝ None of the above ii. [1 Pt] and permute the values of

⃝ Pair ID ⃝ Age Group ⃝ Education

⃝ Memorization Method ⃝ None of the above

DS100 Final, Page 27 of 29, SID: May 10th, 2018

Probability

44. There are 32 participants in a randomized clinical trial: 8 are male and 24 are female. 16 are assigned to treatment and the others are put into the control group. What is the probability that none of the men are in the treatment group if:

(a) [3Pts] thetreatmentwasassignedusingstratifiedrandomsampling,groupingbygender? ⃝ 32/32 ⃝ 1 − 8 16 ⃝ 15 24−i ⃝ 0

8 16 32 i=0 32−i

(b) [4 Pts] the treatment was assigned using simple random sampling?

⃝ 1− 8 16 ⃝ 24!16! ⃝ 16 16−i ⃝ 1/32 32 8!32! i=1 i 16

(c) [4 Pts] the treatment was assigned using cluster random sampling of 2 groups of 8 using clusters as described below?

⃝0⃝1⃝1⃝1 268

Cluster

Male

Female

A B C D

0 3 5 0

8 5 3 8

DS100 Final, Page 28 of 29, SID: May 10th, 2018

Ethics

45. [2 Pts] During the guest lecture on ethics, Josh Kroll presented a case study on facial recog- nition. In it, Joy Buolamwini, an MIT PhD student, argues that current facial recognition technology:

⃝ Could lead to issues concerning the right to privacy

⃝ Straddles a gray zone on data usage agreements

⃝ Does not perform well on certain subpopulations

⃝ Creates an asymmetry of power between those who have data and those who do not

46. [2 Pts] PredPol is a model that helps police determine where they should patrol to maximize their likelihood of spotting crimes. As presented in lecture, which of the following best ex- plains why PredPol was consistently suggesting low-income minority neighborhoods as areas that needed more policing?

⃝ Past arrest rates in those neighborhoods are higher than in their higher-income, non- minority counterparts.

⃝ Thecontractedcompanydevelopingthemodelintroducedsystematicbiasesintothe model to advance a political agenda.

⃝ Location data was missing for most of the police reports. Cases that had location data were often in lower-income minority neighborhoods.

⃝ There was a flaw in the data cleaning that led to an aggregation of crime cases into particular low-income minority communities.

DS100 Final, Page 29 of 29, SID: May 10th, 2018

Last Thoughts

This page was intentionally left blank for you! Feel free to use it as scrap paper, draw a picture, write a song, or just tell us how you feel now that the semester is over.