CS代考程序代写 data science database python SQL Name: Email: Student ID:
Name: Email: Student ID:
@berkeley.edu
DS100 Final Exam, Version A Fall 2018
Instructions:
• This final exam must be completed in the 170 minute time period ending at 2:30
PM, unless you have accommodations supported by a DSP letter.
• Note that some questions have bubbles to select a choice. This means that you should only select one choice. Other questions have boxes. This means you should select all that apply.
• When selecting your choices, you must fully shade in the box/circle. Check marks will likely be misgraded. We reserve the right to deny regrade requests if an answer choice is not completely filled in.
• Write clearly and legibly when filling in free response questions.
• You may use a twosheet (each twosided) study guide.
Honor Code:
As a member of the UC Berkeley community, I act with honesty, integrity, and respect for others. I am the person whose name is on the exam and I completed this exam in accordance with the honor code.
Signature:
1
DS100 Final, Page 2 of 31 December 13th, 2018
Syntax Reference Regular Expressions
“” matches expression on either side of symbol. Has lowest priority.
“” match the following character literally.
“?” match preceding literal or subexpression 0
or 1 times.
“+” matchprecedingliteralorsubexpressionone or more times.
Some useful Python functions and syntax re.findall(pattern, st) returns the list of all non
overlapping substrings in st that match pattern.
Useful Pandas Syntax
“*” match preceding literal or subexpression zero or more times.
“.” match any character except new line.
“[ ]” matchanyoneofthecharactersinside,ac cepts a range, e.g., “[ac]”. All characters inside treated literally.
“( )” used to create a subexpression. “{n}” preceding expression repeated n times.
np.random.choice(a, replace, size)
Generates a random sample from a consisting of size values (with replacement if replace=True). a can be 1D arraylike or int.
df.loc[row_selection, col_list] # row selection can be boolean df.iloc[row_selection, col_list] # row selection can be boolean pd.get_dummies(data) # Convert categorical variable into indicator values df.groupby(group_columns)[[’colA’, ’colB’]].agg(agg_func) df.groupby(group_columns)[[’colA’, ’colB’]].filter(filter_func)
Variance and Expected Value
The expected value of X is E[X] = mj=1 xjpj. The variance of X is V ar[X] = E[(X − E[X])2] = E[X2] − E[X]2. The standard deviation of X is SD[X] = V ar[X].
Misc
For calculations involving percentiles of collections of numbers, we will use the following convention from Data 8: Let p be a number between 0 and 100. The pth percentile is the smallest number in the collection that is at least as large as p% of all the values.
The logistic equation is σ(x) = 1 and the KL divergence for two distributions is D(PQ) = 1+exp(−x)
Kk=1 P (k) log(P (k)/Q(k))
DS100 Final, Page 3 of 31 December 13th, 2018
Score Breakdown
Page
Points
4
8
5
3
6
6
7
12
8
8
9
12
11
12
12
7
13
9
14
4
15
4
16
9
17
3.5
18
4
19
11.5
20
5
21
11
22
7
23
2
24
8
25
6
26
6
27
11
28
15
30
7
31
9
Total:
200
DS100 Final, Page 4 of 31 December 13th, 2018
Tabular Data
1. For this section, we will be working with the UC Berkeley Undergraduate Career Survey dataset. Each year, the UC Berkeley career center surveys graduating seniors for their plans after graduating. Below is a sample of the full dataset. The full dataset contains many thousands of rows.
j name
Llama Technician Software Engineer Open Source Maintainer Big Data Engineer
Data Analyst
Analyst Intern
c name
Google Salesforce Github Microsoft Startup Google
c location
MOUNTAIN VIEW SF
SF
REDMOND BERKELEY
m name
EECS
EECS
Computer Science Data Science Data Science Philosophy
SF
Table 1: survey Table
Each record of the survey table is an entry corresponding to a student. We have the student’s major information(mname),companyinformation(cname, clocation),andthejobtitle(jname).
(a) [3 Pts] Write a SQL query that selects all data science major graduates that got jobs in Berkeley. The result generated by your query should include all 4 columns.
______________ FROM survey ____________________________________________ ____________________________________________
(b) [5 Pts] Write a SQL query to find the top 5 popular companies that data science graduates will work at, from most popular to 5th most popular.
SELECT c_name, __________________ as count FROM survey
WHERE _______________ = ’Data Science’ GROUP BY _______________
ORDER BY _______________ LIMIT 5
Solution:
SELECT * FROM survey
WHERE m_name = ’Data Science’
AND c_location = ’Berkeley’
Solution:
SELECT c_name, COUNT(*) AS count
FROM survey
WHERE m_name = ’Data Science’
DS100
Final, Page 5 of 31 December 13th, 2018
GROUP BY c_name
ORDER BY count DESC
LIMIT 5;
(c)
[3 Pts] Suppose our table has 9,000 rows, with 3,000 unique job names, 1,700 unique company names, 817 unique locations, and 105 unique major names. The table above has many redundan cies. Suppose we wanted to instead use the star schema idea from lecture, where we have one fact table and many dimension tables. How many dimension tables would we end up with? How many rows would there be in our fact table? How many columns would there be in our fact table? There may be more than one correct answer.
i. Number of dimension tables: ii. Number of rows in fact table:
iii. Number of columns in fact table:
4 9000
4
DS100 (d)
Final, Page 6 of 31 December 13th, 2018
[3 Pts] Consider the pandas expression below, where nunique returns the number of unique elements in a group.
survey.groupby(’c name’)[’m name’].nunique().max(). What does it return?
(e)
[3 Pts] Which of the SQL expressions below is equivalent to the pandas code from above?
⃝ A. ⃝ B.
⃝ C. ⃝ D.
One value: The number of unique majors for the company with the most unique majors.
One value: The number of unique companies for the major with the most hires.
Many values: For each company, the count of the number of hires for the most popular major.
Many values: For each major, the count of the number of hires by the most popular company.
⃝ A.
⃝ B.
⃝ C.
⃝ D.
SELECT MAX(count)
FROM (
SELECT c_name, COUNT(DISTINCT m_name) AS count
FROM survey
GROUP BY c_name
);
SELECT c_name, MAX(COUNT(DISTINCT m_name)) AS count FROM survey
GROUP BY c_name;
SELECT c_name, COUNT(DISTINCT m_name) AS count FROM survey
GROUP BY c_name
HAVING MAX(count);
SELECT MAX(count) FROM (
SELECT c_name, COUNT(DISTINCT m_name) AS count FROM survey
GROUP BY c_name
)
WHERE count >= MAX(count);
DS100 Final, Page 7 of 31 December 13th, 2018
Big Data
2. [6 Pts] The figure above from class shows four distinct file blocks labeled A, B, C, and D spread across four machines, where each machine holds exactly 3 blocks.
(a) Forthefigureabove,atmost,howmanyofourmachinescanfailwithoutanydataloss? 2
(b) Suppose that instead of 4 machines, we have only 3 machines that can store 3 blocks each. Suppose we want to be able to recover our data even if two machines fail. What is the maximum total number of distinct blocks we can store? 3
(c) Same as part b, but now suppose we only need to be able to recover our data if one machine fails. What is the maximum total number of distinct blocks we can store? 4
3. [3 Pts] Suppose we use the map reduce paradigm to compute the total lab scores for each student in DS100. Suppose there are 800 students and 12 labs, and exactly 1 submission per student. Each execution of the map operation is an execution of the autograder, i.e. it will compute the score for a single lab for a single student. The reduce operation computes the total score for each student by adding up all of the lab scores.
(a) How many key value pairs will be generated in total after all map operations have completed execution? 9600
(b) How many distinct keys will there be? 800
(c) Howmanyfinalkeyvaluepairswillremainafterallreduceoperationshavecompleted? 800
4. [3 Pts] As described in class, the traditional data warehouse is a large tabular database that is period ically updated through the ETL process, which combines data from several smaller data sources into a common tabular format. The alternative is a data lake, where data is stored in its original natural form. Which of the following are good reasons to use a data lake approach?
A. The data is sensitive, e.g. medical data or government secrets.
B. To maximize compatibility with commercial data analysis and visualization tools. C. When there is no natural way to store the data in tabular format.
D. To ensure that the data is clean.
DS100 Final, Page 8 of 31 December 13th, 2018
Bootstrap and the Sampling Distribution
5. Note: The following problem is stylized to fit on paper, meaning that the sample size and number of replicates are much smaller than they should be in practice.
In order to infer the population mean, variance, and median of a discrete random variable X, a single simple random sample of size 5 is drawn from the population.
x 11 11 92 34 53
The mean, variance, and median of the values above are 40.2, 918.16, and 34, respectively. You decide to perform a bootstrap analysis of your statistics. The replicates and (rounded) summary statistics are given to you below. Bootstrap replicates are displayed either as rows or columns–you must infer which from the structure of this problem. rowmeans, rowvars, and rowmedians are the means, variances, and medians of the rows respectively.
replicates
01234
rowmeans rowvars rowmedians
0 1 2 3
11 11
53 92
11 11
53 92
34 11 92
92 92 92
11 53 34
11 34 11
32 985 11 84 243 92 24 290 11 40 918 34
colmeans colvars colmedians
32 52 37 48 57
441 1640 1096 881 1274
11 11 11 34 34
The summary statistics are loaded into a Python session as numpy arrays.
(a) [1 Pt] What is the sample mean? 40.2
(b) [3 Pts] Which of the following is the bootstrap estimate for the variance of the sample mean?
⃝ A. np.var(rowmeans) ⃝ B. np.mean(rowvars) ⃝ C. np.var(colmeans) ⃝ D. np.mean(colvars) ⃝ E. None of the above
(c) [4 Pts] Which of the following changes would decrease the variance of the sample mean? Select all that apply.
A. Increasing the sample size
B. Increasing the size of each bootstrap replicate
C. Increasing the number of bootstrap replicates
D. Combining all the bootstrap replicates into one array before estimating the variance
DS100 Final, Page 9 of 31 December 13th, 2018
(d) [3 Pts] Which of the following plots displays the approximate sampling distribution of the sam ple median according to the results of the bootstrap?
⃝ A. ⃝ B.
⃝ C. ⃝ D.
(e) [3 Pts] Using the results of the bootstrap, construct a 50% confidence interval for the population
variance.
243 , 918
(f) [2 Pts] Above, we drew R bootstrap replicates. If we draw many more sets of bootstrap repli cates, say M = 10000 sets of R replicates each, and calculate 10000 50% confidence intervals for the population variance the same way we did in the part above, then roughly 50% of those intervals will contain the true population variance.
⃝ True √ False
6. [4 Pts] This problem is unrelated to the previous problem. Suppose you have a box containing prize tickets. Half of the tickets are worth $0, and other half are worth $3. Suppose you draw a simple random sample of size 3 with replacement from the box. Plot the sampling distribution of the sample mean and provide titles for the axes.
DS100 Final, Page 10 of 31 December 13th, 2018
Solution:
DS100 Final, Page 11 of 31 December 13th, 2018
Hypothesis Testing
7. A mysterious stranger on Sproul Plaza stops you on your way to class and claims that she has learned to flip any coin such that it lands on heads more often than the 50% you’d expect from random chance. To demonstrate, she takes a penny from her wallet, flips it 10 times, and gets heads nine times and only gets tails once.
(a) [4 Pts] The null hypothesis is that this was pure random chance, and that the probability of getting heads was 50% for each flip. What is the pvalue under the null hypothesis of getting 1 or fewer tails out of 10 flips? You may leave your answer as a fraction if necessary.
Solution: 11 210
There is 1 way to get all heads, and n ways to get one tails. This is out of a total of 2n
permutations possible. Therefore the chances of getting 1 or fewer heads is 11 . 210
(b) [4Pts] Supposethestrangerflipsthecoin28moretimes,andtheyallendupheads.Theresulting p value including all 38 flips under the null hypothesis is approximately pb = 10−10. Which of the following are true? Select all that apply.
A. B.
C. D. E. F.
It is extremely unlikely that the stranger just happened to get 37 heads by ran domly getting heads on 50/50 coin flips.
pb is the probability that the null hypothesis is true.
1 − pb is the probability that the stranger has the skill to flip any arbitrary coin and get
heads.
If you flipped a fair coin 38 times, pb is the chance that you’d get at least 37 heads by random chance.
The stranger has proven beyond any reasonable doubt that she has the skill to flip any coin to land on heads with high probability.
None of the above.
8. The DS100 staff is trying to test whether going to discussion improves a students grade in the class. In order to test this, they consider an observational study to measure possible effects of going to discussion on a students grade in the class.
(a) [4 Pts] For Fall 2019, suppose A is a Series containing final exam grades for students who at tended section 7 or more times, and B is a Series containing final exam grades for students who attended section 3 or fewer times. Assume that there are no students who attended section 4, 5, or 6 times. The staff wishes to evaluate the null hypothesis that attending discussion has no relationship with a student’s score on the final exam. Which of the following are reasonable test statistics to evaluate the null hypothesis? Select all that apply.
DS100 Final, Page 12 of 31 December 13th, 2018
A. A.mean() – B.mean()
B. A.sum() – B.sum()
C. A.median() – B.median() D. A.max() – B.max()
(b) [2 Pts] Suppose the Fall 2019 staff selects the first statistic above: A.mean() – B.mean(). Suppose this difference is 8 points out of 100. From this information alone, what can the staff conclude? Select all that apply.
A. It is very unlikely that this large difference in performance was merely due to chance. B. Attending section helps improve a student’s performance.
C. Neither of these.
(c) [2 Pts] A staff member suggests using the bootstrap to create a confidence interval for the test statistic from part b. Another staff member disagrees and says that the bootstrap would be useless for this purpose because the data is already a census, not a sample. Who is right?
⃝ A. ⃝ B. ⃝ C.
The bootstrap would yield a useful confidence interval.
The bootstrap confidence interval would be useless since the data is already a census.
Neither is correct. The bootstrap confidence interval would be useless, but for a differ ent reason.
Solution: A was awarded 2 points, B was awarded 1 point.
(d) [3 Pts] Another staff member suggests using a permutation test. Which of the following could a permutation test help with? Select all that apply.
A. B.
C.
Can be used to provide a p value for the null hypothesis.
Can provide strong evidence that the difference in performance was not merely due to chance.
Can establish a causal relationship that attending section helps improve a student’s performance.
DS100 Final, Page 13 of 31 December 13th, 2018
Classification
9. Suppose we train a binary classifier on some dataset. Suppose y is the set of true labels, and yˆ is the set of predicted labels.
Determine each of the following quantities. (a) [1 Pt] The number of true positives
(b) [1 Pt] The number of false negatives
(c) [1 Pt] The precision of our classifier. Write your answer as a simplified fraction.
10. You have a classification data set, where x is some value and y is the label for that value:
Suppose that we’re using a logistic regression model to predict the probability that Y = 1 given x: P(Y = 1x) = σ(φT (x)θ)
(a) [6 Pts] Suppose that φ(x) = [φ1 φ2 φ3]T = 1 x x2T and our model parameters are θ∗ =
1 0 −2T . For the following parts, leave your answer as an expression (do not numerically
evaluate log, e, π, etc).
i. Compute Pˆ(y = 1x = 0).
ii. What is the loss for this single prediction Pˆ(y = 1x = 0), assuming we are using KL divergence as our loss function (or equivalently that we are using the cross entropy as our loss function)?
y
0
0
0
0
0
1
1
1
1
1
yˆ
0
1
1
1
1
1
1
0
0
0
Solution: 2
Solution: 3
Solution: 2 = 1 2+4 3
x
y
2 3 0 1
1 0 1 0
Solution: 1 1+exp(−1)
DS100 Final, Page 14 of 31 December 13th, 2018
Solution: log(1+exp(−1))
(b) [4 Pts] Suppose φ(x) = 1 x x%2T , where % is the modulus operator. Are the data from part a linearly separable with these features? If so, give the equation for a separating plane, e.g. φ2 = 3φ3 + 1. Use 1indexing, e.g. we have φ1, φ2, and φ3. If not, just write ”no”.
Solution: Yes, they can be separated by the hyperplane φ3 = 0.5.
DS100 Final, Page 15 of 31 December 13th, 2018 11. [4 Pts] Suppose we have the dataset below.
Suppose we have the feature set φ(x) = [φ1 φ2]T = [1 x]T . Suppose we use gradient descent to compute the θ which minimizes the KL divergence under a logistic model without regularization, i.e.
x
y
1 1
1 0
1 n
argmin− (yiφ(xi)T +log(σ(−φ(xi)Tθ)))
θ n i=1
Select all that are true regarding the data points and the optimal theta value θ.
A. The data is linearly separable.
B. The optimal θ yields an average cross entropy loss of zero.
C. The optimal θ diverges to −∞
D. The optimal θ diverges to +∞
E. The equation of the line that separates the 2 classes is φ2 = 0.
F. None of the above.
Solution:
A. B.
C. D. E.
F.
True. When drawn in the 2D feature space, the points are linearly separable. True. If the data is linearly separable, we can achieve an average cross entropy
loss of zero and our parameter value θ will diverge.
False. The optimal theta value θ diverges to +∞
True. The optimal theta value θ diverges to +∞
True. If we draw the line φ2 = 0 in the 2D feature space, this separates the points.
False. 4 choices were true above.
12. Suppose we have the dataset below.
Suppose we have the feature set φ(x) = 1 x2T . Suppose we use gradient descent to compute the θ which minimizes the KL divergence under a logistic model without regularization, i.e.
x
y
3 1 1 3
1 0 0 1
DS100
Final, Page 16 of 31 December 13th, 2018
1 n
(yiφ(xi)T +log(σ(−φ(xi)Tθ)))
(a) [3 Pts] Explain in 10 words or fewer why the magnitudes of θ1 and θ2 will be very large.
(b) [3 Pts] Will the sign of θ2 be negative or positive?
⃝ A. Could be either, it depends on where our gradient descent starts ⃝ B. Positive
⃝ C. Negative
⃝ D. Neither, θ2 will be zero
(c) [3 Pts] If we use L1 regularization, which of our θ values would you expect to be zero? ⃝ A. Neither of them
⃝ B. θ1
⃝ C. θ2
⃝ D. Both θ1 and θ2
argmin− θ
n i=1
Solution: Because the data is linearly separable.
DS100 Final, Page 17 of 31 December 13th, 2018
Bias Variance Tradeoff
13. In class, we showed that the expected squared error can be decomposed into several important terms:
E[(Y − fθˆ(x))2] = σ2 + (h(x) − E[fθˆ(x)])2 + E[(E[fθˆ(x)] − fθˆ(x))2].
(a) [1 Pt] For which of the following reasons are we taking an expectation? In other words, what are the sources of randomness that we are considering in the derivation of the biasvariance tradeoff?
A. B.
C.
D.
E.
We chose arbitrary features when doing feature engineering.
We drew random samples from some larger population when we built our train ing set.
There is some noise in the underlying process that generates our observations Y from our features.
Our x values could have had missing or erroneous data, e.g. participants misreading a question on a survey.
None of the Above.
(b) [1.5 Pts] Which of the following do we treat as fixed? Select all that apply. A. θˆ
B. σ2 C. h(x)
(c) [1 Pt] By decreasing model complexity, we are able to decrease σ2. ⃝ A. True
⃝ B. False
DS100 Final, Page 18 of 31 December 13th, 2018
14. Your team would like to train a machine learning model in order to predict the next YouTube video that a user will click on based on m features for each of the previous d videos watched by that user. In other words, the total number of features is m × d. You’re not sure how many videos to consider.
(a) [2 Pts] Your colleague generates the following plot, where the value d is on the x axis. However, they forgot to label the yaxis.
Which of the following could the y axis represent? Select all that apply.
A. Training Error B. Validation Error C. Bias
D. Variance
(b) [2 Pts] Your colleague generates the following plot, where the value d is on the x axis. However, they forgot to label the yaxis.
Which of the following could the y axis represent? Select all that apply. A. Training Error
B. Validation Error
C. Bias
D. Variance
DS100 Final, Page 19 of 31 December 13th, 2018
Cross Validation
15. [2.5 Pts] Aman and Ed built a model on their data with two regularization hyperparameters λ and γ. They have 4 good candidate values for λ and 3 possible values for γ, and they are wondering which λ, γ pair will be the best choice. If they were to perform fivefold crossvalidation, how many validation errors would they need to calculate?
16. [2 Pts] In the typical setup of kfold cross validation, we use a different parameter value on each fold, compute the mean squared error of each fold and choose the parameter whose fold has the lowest loss. ⃝ True √ False
17. [2 Pts] Suppose we have m data points in our training set and n data points in our test set. In leaveoneout cross validation, we only use one data point for validation while the rest are used for training. Which of the following is leaveoneout cross validation equivalent to?
⃝ A. mfold cross validation
⃝ B. nfold cross validation
⃝ C. (m + n)fold cross validation ⃝ D. 1fold cross validation
18. [5 Pts] Suppose we have a linear regression model with L2 regularization that we’d like to train. Recall that ridge regression has a single hyperparameter λ. Suppose we are trying to pick a λ value from [0, 0.1, 0.2, 0.3, 0.4]. In class, we discussed cross validation, but there are other ways we could attempttocomputeλ.LetλCV betheoptimalλthatwouldbecalculatedusing5foldcrossvalidation. Let λX be the optimal λ that would be computed using procedure X below.
Procedure X: Don’t create any sort of validation set. Instead, for every candidate λ value, compute the theta that minimizes the average loss over the entire training set including the regularization term λ di=1 θi2. Return the λ that yields the lowest training loss.
Which of the following are true? Select all that apply.
A. λX will require vastly more computation time to compute than λCV .
B. λX ≤λCV.
C. Procedure X reduces the risk of overfitting even more than cross validation.
D. If we computed both, we should use the smaller of λX and λCV to train our final model.
E. None of the above are true.
Solution: 60
DS100 Final, Page 20 of 31 December 13th, 2018
19. Suppose you are working with a partner to train a model with one hyperparameter λ. Together, you and your partner run 5fold cross validation and compute mean squared errors for each fold and value of λ from a set of 4 candidate values for λ. However, your partner forgets to send you the results for the last two folds! The table below contains the mean squared errors for the first three of five total folds.
(a) [3 Pts] Your partner uses the full table containing data for all five folds to create a final model to use on test data. Given the information above, what can you conclude about the final model? Select all that apply.
A. Our final model should use λ = 0.4.
B. Our final model should be trained on fold 1, since it achieves the lowest row average. C. Our final model should be trained on fold 2, since it achieves the highest row average. D. None of the above.
(b) [2 Pts] Let’s say we know the row averages for all 5 folds. Which of the following are valid conclusions we can draw? Select all that apply.
A. We can determine which fold number to use for our model. B. We can determine which λ value to use in our model.
C. None of the above.
Fold Num
λ = 0.1 λ = 0.2 λ = 0.3 λ = 0.4
Row Avg
1 2 3
64.2 60.1 76.8 66.8 81.5 71.5
77.7 79.2 88.8 98.8 86.5 88.5
70.3 82.8 82.0
DS100 Final, Page 21 of 31 December 13th, 2018
Regularization
20. [3.5 Pts] Which of the following are indications that you should regularize? Select all that apply.
A. Our training loss is 0.
B. Our model bias is too high.
C. Our model variance is too high.
D. Our weights are too large.
E. Our model does better on unseen data than training data.
F. We have linearly dependent features.
G. We are training a classification model and the data is linearly separable.
21. [7.5 Pts] Suppose we have a data set which we divide into 3 equally sized parts, A, B, and C. We fit 3 linear regression models with L2 regularization (i.e. ridge regression), X, Y , and Z, all on A. Each model uses the same features and training set, the only difference is the λ used by each model. Select all below that are always true.
A. B.
C. D. E. F.
Suppose Z has the lowest average loss on B. Model Z will have the lowest average loss when evaluated on C.
If A and B have the same exact mean and variance, the average loss of model Y on B will be exactly equal to the average loss of Y on A.
If λ = 0 for model X, Loss(X, A) ≤ Loss(Y, A) and Loss(X, A) ≤ Loss(Z, A). If λY < λZ, then Loss(Y,A) ≤ Loss(Z,A).
If λY > λZ, then Loss(Y,B) ≥ Loss(Z,B).
None of the above.
Solution:
A: Not guaranteed since we don’t know the distributions of B, C.
B: Having the same mean and variance does not imply that the data are the same.
C: Since increasing λ increases bias, the loss of X must be less than or equal to the loss of Y, Z on A.
D: Since Y and Z were trained on A, and Y is less restricted than Z, the loss of Y on A must belessthanthelossofZonA. E:EventhoughZisamorerestricted(i.e.simpler)model,itis possible that the dataset B is slightly better for Z. In other words, minimizing training error with a regularized model does not guarantee minimized error on unseen datasets.
DS100 Final, Page 22 of 31 December 13th, 2018
Probability and Potpourri
Some of the problems in this section were explicitly (or near explicitly) covered in a lecture or dis cussion section. As a result, they are worth fewer points than you might expect given the amount of work needed to derive them from scratch. If you find yourself spending a ton of time on one of these, come back to it later.
22. Recall from lecture 21 that Jensen’s Inequality states that for a random variable X and a convex function f, f(E[X]) ≤ E[f(X)].
(a) [2 Pts] In class, we showed that the KullbackLeibler divergence DKL(PQ) ≥ 0. To prove this, we applied Jensen’s inequality with which of the following four functions?
⃝A. f(X)=−logX ⃝B. f(X)=logX ⃝C. f(X)=−X2 ⃝D. f(X)=X2
(b) [3 Pts] As we know, the variance of a random variable is always greater than or equal to 0, i.e. Var(X) ≥ 0. Give a function f that lets us trivially prove this statement using Jensen’s inequality. For example, if we can prove that the variance is always nonnegative by plugging in f(x) = ex, write ex.
(c) [2 Pts] For which of the following functions f will equality (instead of inequality) hold for Jensen’s inequality regardless of the random variable X? You may assume a,b, and c are con stants. Select all that apply.
A. f(X)=a
B. f(X)=aX
C. f(X)=aX+b
D. f(X) = aX2 + bX + c E. None of the Above
Solution: X2. Since X2 is convex, Jensen’s inequality tells us that E[X]2 ≤ E[X2]. Since Var(X) = E[X2] − E[X]2, we know that Var(X) ≥ 0.
Solution: This follows directly from the linearity of expectation. Note that in general, E[X2] ̸= E[X]2.
23. A/B Testing
Solution: − log X . We did this proof in class, and − log X is the appropriate convex func tion.
DS100 Final, Page 23 of 31 December 13th, 2018
(a) [2Pts] Youhaveacoinwhichlandsheadsonaverage50%ofthetime,andtails50%ofthetime. You know that the coin is fair, having flipped it millions of times in the past. Suppose your most recent four flips have all been heads. How does the probability that a fifth toss will again be heads compared to the probability that a fifth toss will be tails?
⃝ A. The odds of heads are greater than 50% for the fifth flip. ⃝ B. The odds of heads are 50% for the fifth flip.
⃝ C. The odds of heads are less than 50% for the fifth flip.
DS100 (b)
Final, Page 24 of 31 December 13th, 2018
[4 Pts] You want to know whether or not a coin is fair. As you flip the coin, you keep track of your test statistic D, which is the total number of heads observed minus the total number of tails observed. After each flip, you compute the p value under the null hypothesis (that the coin is fair) for your observed value of D. If the p value ever falls below 0.05, you stop and announce the coin is unfair. If you reach T flips, you announce the coin is fair. Suppose you have a fair coin, what is the probability pF D of a false discovery with this procedure using that fair coin?
⃝ A. At most 5%, independent of the value of T.
⃝ B. As T grows, pF D asymptotically decreases to 0%.
⃝ C. As T grows, pF D asymptotically increases to 100%.
24. [2 Pts] Which of the following statements are true? Select all that apply.
A. B. C. D.
A matrix with a condition number of 1020 will magnify numerical issues more than a matrix with a condition number of 1010.
Suppose a 3 dimensional sphere is inscribed in a 3 dimensional cube. The volume inside the sphere is greater than the volume outside the sphere but inside the cube.
Suppose a 20 dimensional sphere is inscribed in a 20 dimensional hypercube. The volume inside the sphere is greater than the volume outside the sphere but inside the hypercube.
Supposef1(x)=rx(1−x),andf2(x)=rx−rx2.Forr=1.9andx=0.8,f1 andf2 will return two different numbers in Python.
Solution: These were all discussed in the numerical issues lecture.
25. [2 Pts] For this problem, recall that stochastic gradient descent is very similar to normal gradient descent, except that the gradient of the loss function is computed on a random sample of the data instead of the entire dataset. Which of the following are true? Select all that apply.
A. B.
C. D. E.
At a particular iteration, stochastic gradient descent will often update θ more accurately compared to an update with regular gradient descent.
For a convex loss function, a single step of gradient descent always decreases the loss.
For a convex loss function, a single step of stochastic gradient descent always decreases the loss.
Suppose it takes t seconds for one update of regular gradient descent, Stochastic gra dient descent can usually perform more than one update in t seconds.
None of the Above
DS100 Final, Page 25 of 31 December 13th, 2018
Linear Regression (Hard Problem)
Throughout this section we refer to ”least squares regression”, which is the process of minimizing the average L2 loss using a linear regression model. Ordinary least squares is the version of least squares regression where we do not use regularization. Assume throughout that our model includes a bias term. Warning: Parts of this problem are pretty hard!
26. [3 Pts] What is always true about the residuals in least squares regression? Select all that apply.
A. They are orthogonal to the column space of the features. B. They represent the errors of the predictions.
C. Their sum is equal to the mean squared error.
D. Their sum is equal to zero.
E. None of the above.
Solution: (a), (b)
(c) is supposed to be a trick since the mean squared error is the mean of the sum of the squares of the residuals. So I guess this tests whether they understand what the acronym MSE represents or if they just regurgitate it mindlessly.
(e) is wrong since (c) is wrong obviously
27. [3Pts] Whatarepossibledisadvantagesofordinaryleastsquares(OLS)regressioncomparedtoridge or LASSO? Select all that apply.
A. The OLS estimate selects too few features as being important.
B. OLS has unrealistically small variance compared to LASSO or ridge. C. OLS is computationally much more expensive.
D. OLS is more prone to overfitting.
E. None of the above.
Solution: (e)
(a) LASSO tends to identify fewer features as important
(b) LASSO and ridge are biased, but often have smaller MSE than OLS (that’s why people use them after all), which means that OLS (often) has larger variance compared to LASSO and ridge,
DS100 Final, Page 26 of 31 December 13th, 2018
not smaller
(c) OLS has the simplest loss function and the simplest closed form solution
28. [3 Pts] What differentiates LASSO compared to OLS? Select all that apply.
A.
B. C. D. E.
LASSO uses the mean absolute error (MAE) loss, while OLS uses mean squared error (MSE).
LASSO tends to identify more features as relevant.
LASSO typically has lower average error on the training data. All weights in a LASSO model must be less than 1.
None of the above.
Solution: (a) is false, since it’s the penalization of the coefficients, not the residuals, which uses L1 in LASSO.
(b) LASSO tends to identify fewer features as relevant (the second s in LASSO is selection after all).
(c) The error for a LASSO model is always greater than or equal to an OLS model. (d) the LASSO estimate is biased whereas OLS is unbiased.
(e) is wrong since (a), (b), (c), and (d) are.
29. [3 Pts] Which are true about the predictions made by OLS? Select all that apply.
A. They are projections of the observations onto the column space of the features. B. They are linear in the chosen features.
C. They are orthogonal to the residuals.
D. They are orthogonal to the column space of the features.
E. None of the above.
Solution: (a), (b), (c)
(a) is correct because they are linear projections onto the column space. This fact also makes (c)
correct and (e) incorrect and is what makes (d) incorrect.
(b) is also correct because even in e.g. polynomial regression the resulting predictions are linear in the new/transformed features. But admittedly this is somewhat awkwardly worded.
DS100 Final, Page 27 of 31 December 13th, 2018
30. [3 Pts] Which of the following would be true if you chose mean absolute error (L1) instead of mean squared error (L2) as your loss function? Select all that apply.
A. The results of the regression would be more sensitive to outliers.
B. You would not be able to use gradient descent to find the regression line.
C. You would not be able to use the normal equation to calculate your parameters. D. The sum of the residuals would now be zero.
E. None of the above.
Solution: (e)
(a) is false because using L1 loss increases robustness to outliers.
(b) is false because you can still use (sub)gradient descent given the convexity of L1 loss. (c) is true, the normal equation only works if we’re minimizing the L2 loss.
(d) is false because the sum of the residuals was zero in OLS, so if this happened it wouldn’t be a change from OLS.
31. Let yˆ ∈ Rn be the vector of fitted values in the ordinary least squares regression of y ∈ Rn on the full columnrank feature matrix Φ ∈ Rn×d with n much larger than d. Denote the fitted coefficients as βˆ ∈ Rd and the vector of residuals as e ∈ Rn.
(a) [4 Pts] What is Φ(ΦT Φ)−1ΦT y?
⃝A. 0 ⃝B. yˆ ⃝C. e ⃝D. βˆ ⃝E. 1 ⃝F. Noneoftheabove
(b) [4 Pts] What is Φ(ΦT Φ)−1ΦT yˆ? Notice: This problem has a hat in yˆ.
⃝A. 0 ⃝B. yˆ ⃝C. e ⃝D. βˆ ⃝E. 1 ⃝F. Noneoftheabove
Suppose e ̸= 0. Define a new feature matrix Ψ by appending the residual vector e to the feature matrix Φ. In other words,
.
Ψ= Φ:,1 Φ:,2 ··· Φ:,d e .
Solution: We discussed this in discussion 6, where we called Φ(ΦT Φ)−1ΦT the hat matrix. It projects y into the feature space.
Solution: Since yˆ is already in the feature space, projecting it into the feature space has no effect.
DS100 Final, Page 28 of 31 December 13th, 2018 (c) [4 Pts] We now want to fit the model y = Ψγ = γ1Φ:,1 + γ2Φ:,2 + · · · + γdΦ:,d + γd+1e by
choosing γˆ = [γˆ1 . . . γˆd+1]T to minimize the L2 loss. What is γˆd+1? ⃝A. 0 ⃝B. 1 ⃝C. eTy ⃝D. 1−βˆTβˆ
⃝ E. (ΦT Φ)−1ΦT ⃝ F. None of the above
32. We collect some data D = {(x1, y1), …, (xn, yn)} and decide to model the relationship between X and y as
y = β1Φ:,1 + β2Φ:,2
where Φi,: = 1 xi We found the estimates βˆ1 = 2 and βˆ2 = 5 for the coefficients by minimizing
T 42
the L2 loss. Given that Φ Φ = 2 5 , answer the following problems. If not enough information is
given, write “Cannot be determined.”
(a) [4 Pts] What was the sample size n? Hint: Consider the form of the feature matrix.
(b) [7 Pts] What must ΦT y be for this data set?
Solution: We’re effectively memorizing all of our regression values here. This is the equiva lent (in a roundabout way) of using someone’s weight, age, and height to predict their height. It’ll work perfectly, but the model is useless.
Solution:
n
[ΦTΦ]1,1 =1×1=n=4
i=1
Solution: βˆ comes from the normal equations
Φ T Φ βˆ = Φ T y
Therefore, we have
T 4 22 18 Φy=25 5=29
DS100 Final, Page 29 of 31 December 13th, 2018
Cleaning, EDA, Visualization
Let’s take a look at the California Air Quality Index (AQI) for 2017. The following cells and outputs are for your reference.
aq = pd.read_csv(“./air_quality_final.csv”, index_col=0)
aq.head()
Date AQI
0 01/01/2017 24.0
1 01/02/2017 19.0
2 01/03/2017 NaN
3 01/04/2017 15.0
4 01/05/2017 20.0
aq.iloc[49437:49442]
COUNTY CODE
COUNTY
Alameda Alameda Alameda Alameda
NaN
LAT LON
37.687526 121.784217 37.687526 121.784217 37.687526 121.784217 0.000000 0.000000 37.687526 121.784217
1 1 1 1 1
Date AQI
01/01/2017 NaN 01/02/2017 15.0 01/03/2017 36.0 01/04/2017 18.0 01/05/2017 16.0
COUNTY CODE
COUNTY LAT LON
49437
49438
49439
49440
49441
113 Yolo 113 Yolo 113 Yolo 113 Yolo 113 NaN
LAT
49812.000000 36.294056 2.235560 0.000000 34.144350 36.487823 37.960400 41.756130
38.534450 121.773400 38.534450 121.773400 38.534450 121.773400 37.995239 121.756812 38.534450 121.773400
LON
49812.000000 119.859815 2.099002 124.203470 121.618549 119.828400 118.147294 0.000000
aq.describe()
AQI
count 49810.000000 mean 38.270167 std 24.750558 min 0.000000 25% 21.000000 50% 35.000000 75% 52.000000 max 537.000000
COUNTY CODE
49812.000000 56.169678 30.486150 1.000000 29.000000 65.000000 77.000000 113.000000
print(aq[’COUNTY’].nunique())
Output: 51
DS100 Final, Page 30 of 31 December 13th, 2018 33. [3 Pts] Select all that apply.
A.
B. C. D.
Supposing that there is a one to one mapping from COUNTY CODE to COUNTY, we can extrapolate the value of COUNTY for index 4.
Grouping by COUNTY is equivalent to grouping by LAT, LON. The primary key in this dataset is the DATE.
None of the above
Solution:
A: True
B: No, there are different latitude and longitudes for a county C: Dates are not unique.
For all following questions, assume we have finished cleaning the dataset (filled in or removed miss ing, NaN, etc.).
34. [2 Pts] Which of the following correctly verifies that the mapping from COUNTY CODE to COUNTY is 1 to 1? Select only one.
⃝ A. ⃝ B. ⃝ C. ⃝ D.
⃝ E.
len(aq[’COUNTY’].value_counts()) ==
len(aq[’COUNTY_CODE’].value_counts())
len(set(aq[’COUNTY’])) ==
len(set(aq[’COUNTY_CODE’]))
len(aq[’COUNTY’].unique()) ==
len(aq[’COUNTY_CODE’].unique())
len(aq.groupby([’COUNTY’, ’COUNTY_CODE’])) ==
len(set(aq[’COUNTY’])) and
len(set(aq[’COUNTY’])) == len(set(aq[’COUNTY_CODE’]))
None of the above
Solution:
AC: Having 51 unique COUNTY values and 51 COUNTY CODE values does not imply a 1 to 1 mapping
D: Correct
35. [2 Pts] In the questions below, select the best plot to visualize a certain aspect of our data.
(a) visualize the AQI for Los Angeles, San Diego, San Francisco, Humboldt, and Inyo counties over the first 7 days of January 2017.
⃝ A. Stacked bar plot
⃝ B. Side by side line plot ⃝ C. KDE plot
DS100
Final, Page 31 of 31
December 13th, 2018
⃝ D. Side by side violin plot
(b) visualize the distribution of site locations by latitude and longitude. ⃝ A. Histogram
⃝ B. Scatter plot ⃝ C. Bar plot
⃝ D. 1D KDE plot
(c) visualize the average AQI over all counties for each day of January. ⃝ A. Overlaid line plot
⃝ B. Line plot
⃝ C. Side by side histogram ⃝ D. Side by side box plot
36. [9 Pts] We wish to visualize the mean AQI measurements taken for Alameda, San Francisco and Yolo county over the entire period. Fill in the code below to accomplish this. Use choices from the following table.
reg = r’__________ __________ __________ __________ __________’
temp = (__________
.loc[
__________[__________].str.__________(__________),
__________
]
.__________(__________)
.__________()
.reset_index())
sns.barplot(x = __________, y=__________, data=data);
an().res
aq
’Date’
:
’AQI’
’COUNTY CODE’
’COUNTY’
’LAT’
’LON’
Alameda
San Francisco
Yolo
str
apply
match
groupby
agg
findall
count
sum
mean

==
or
and
filter
Solution:
reg = r’AlamedaSan FranciscoYolo’
data = aq.loc[aq[’COUNTY’].str.match(reg), :].groupby(’COUNTY’).me
sns.barplot(x =’COUNTY’, y=’AQI’, data=temp);