# CS代考程序代写 AI algorithm decision tree SQL Data 100, Final Fall 2019

Data 100, Final Fall 2019
Name:
Email:
Student ID:
Exam Room:
First and last name of student to your left: First and last name of student to your right: All work on this exam is my own (please sign):
@berkeley.edu
Instructions:
• This final exam consists of 117 points and must be completed in the 170 minute time period ending at 6:00PM, unless you have accommodations supported by a DSP letter.
• Note that some questions have circular bubbles to select a choice. This means that you should only select one choice. Other questions have boxes. This means you should select all that apply. When selecting your choices, you must fully shade in the box/circle. Check marks will likely be mis-graded.
• You may use three cheat sheets each with two sides.
• Please show your work for computation questions as we may award partial credit.
1

Data 100 Page 2 of 22 Initials:
1 An Instructor Thinks This Is A Good Question [9 Pts.]
The average response time for a question on Piazza this semester was 11 minutes. As always, the number of questions answered by each TA is highly variable, with a few TAs going above and beyond the call of duty. Below are the number of contributions for the top four TAs (out of 20, 000 total Piazza contributions):
Suppose we take an SRS (simple random sample) of size n = 500 contributions from the original 20, 000 contributions. We will also define some random variables:
• Di = 1 when the ith contribution in our sample is made by Daniel; else Di = 0.
• Si = 1 when the ith contribution in our sample is made by Suraj; else Si = 0.
• Mi = 1 when the ith contribution in our sample is made by Mansi; else Mi = 0.
• Ai = 1 when the ith contribution in our sample is made by Allen; else Ai = 0.
• Oi = 1 when the ith contribution is made by anyone other than Daniel, Suraj, Mansi, or Allen; else, Oi = 0
Throughout this problem, you may leave your answer as an unsimplified fraction. If your answer is much more complicated than necessary, we may deduct points. Some of these problems are simple, and some are quite tricky. If you’re stuck, move on and come back later.
TA
# contributions
Daniel Suraj Mansi Allen
2000
1800
700
500
(a)
i. [1 Pt] What is P(A1 = 1)? P(A1 =1)=
ii. [1 Pt] What is E[S1]? E[S1] =
iii. [1 Pt] What is E[M100]? E[M100] =

Data 100
Page 3 of 22 Initials:
iv.
v.
[1 Pt] What is Var[D50]? Var[D50] =
[1Pt] WhatisD400 +S400 +A400 +M400 +O400? D400 +S400 +A400 +M400 +O400 =
(b)
For parts b.i and b.ii, let:
• ND =􏰂500 Di i=1
• NS =􏰂500 Si i=1
• NM = 􏰂500 Mi i=1
• NA =􏰂500 Ai i=1
• NO =􏰂500 Oi i=1
i. [1 Pt] What is E[NA]? E[NA] =
ii. [1Pt] WhatisVar(ND +NS +NA +NM +NO)? Var(ND +NS +NA +NM +NO)=
[2 Pts] Let’s consider the situation where we sample with replacement instead of taking a SRS. If we take a sample with replacement of 10 contributions, what is the probability that 3 were by Daniel, 3 were by Suraj, and 4 were by Mansi?
Probability =
(c)

Data 100 Page 4 of 22 Initials:
2 Relative Mean Squared Error [6 Pts.]
Consider a set of points {x1, x2, …, xn}, where each xi ∈ R, and further suppose we want to determine a summary statistic c for this data. Naturally, our choice of loss function determines the optimal c.
In this problem, let’s consider a new loss function l(c) = (x−c)2/x. We call this loss function the relative squared error loss. If we compute the average over an entire dataset, we get the empirical risk function below:
L(c)=1􏰃n (xi−c)2 n i=1 xi
Forexample,supposeourdatais[0.1, 0.2, 0.5, 0.5, 1],andweconsiderthesum- mary statistic c = 1. The empirical risk would be:
1 􏰀(0.1 − 1)2 (0.2 − 1)2 (0.5 − 1)2 (0.5 − 1)2 (1 − 1)2 􏰁 5 0.1 + 0.2 + 0.5 + 0.5 + 1
= (8.1+3.2+0.5+0.5) = 2.46 5
[6 pts] Give the summary statistic that minimizes the relative mean squared error for the data above, i.e. [0.1, 0.2, 0.5, 0.5, 1]. Make sure to show your work in the space below, correct answers will not be accepted without shown work.
cˆ =

Data 100 Page 5 of 22 Initials:
3 Election (Pandas) [12 Pts.]
You are given an DataFrame elections with the results of each U.S. presidential election. The first 8 rows of elections is shown on the left. The max votes Series on the right is described later on this page.
(a) [3 Pts] Suppose we want to add a new column called Popular Result that is equal to ’win’ if the candidate won the popular vote and ’loss’ if the candidate lost the popu- lar vote. Note, this is not the same thing as the Result column, e.g. Donald Trump won the 2016 election but lost the popular vote, i.e. did not have the largest value for Popular Vote in 2016.
To do this, we’ll start by using a new pandas function we have not learned in class called transform. For example, the code below creates a Series called max votes shown at the top right of this page.
Using the max votes Series, create the new Popular Result column in elections. Your code may not use any loops. We have done the first line for you. If you’re not quite sure what your goal is, we provide a picture of the result on the next page. You may not need all lines. Hint: The .loc feature in pandas accepts boolean arrays for either of its arguments.
elections[“Popular_Result”] = “loss”
__________________________________________________________
__________________________________________________________
___________________________________________________ = “win”

Data 100 (b)
Page 6 of 22 Initials: [2 Pts] Below is the correct result for part a of this problem.
elections
Fill in the code below so that df is a dataframe with only candidates whose ultimate
result was not the same as the popular vote, i.e.
df
You may not need all lines. Make sure to assign df somewhere.
_______________________________________________________
_______________________________________________________
_______________________________________________________

Data 100 (c)
Page 7 of 22 Initials:
[4 Pts] Create a series win fraction giving the fraction each candidate won out of all elections participated in by that candidate. For example, Andrew Jackson participated in 3 presidential elections (1824, 1828, and 1832) and won 2 of these (1828 and 1832), so his fraction is 2/3. You should use the Result column, not the Popular Result column. For example, win fraction.to frame().head(9) would give us:
(d)
win fraction
You may not use loops of any kind. You do not need to worry about the order of the candidates. You may assume that no two candidates share the same name.
def f(s):
________________________________________
________________________________________
________________________________________
win_fraction = ________________________________________
[3 Pts] Create a series s that gives the name of the last candidate who successfully won
office for each party. That is, s.to frame() would give us:
s
elections_sorted = elections.sort_values(_____________)
winners_only = ____________________________________
s = winners_only.____________(___________)[____________].__________

Data 100 Page 8 of 22 Initials:
4 Regression [13 Pts.]
Recall from lab 9 the tips dataset from the seaborn library, which contains records about tips, total bills, and information about the person who paid the tip. Throughout this entire problem, assume there are a total of 20 records, though we only ever show 5. The first 5 rows of the resulting dataframe are shown below. The integer on the far left is the index, not a column of the DataFrame.
Suppose we want to predict the tip from the other available data. Four possible design matrices XMFB, XMF , XFB, and XF are given below.

Data 100 (a) i.
Page 9 of 22 Initials: [2 Pts] What is the rank of each of our four design matrices?
rank(XMFB)= ⃝1 ⃝2 ⃝3 ⃝4 ⃝5 ⃝19 rank(XMF)= ⃝1 ⃝2 ⃝3 ⃝4 ⃝5 ⃝19 rank(XFB)= ⃝1 ⃝2 ⃝3 ⃝4 ⃝5 ⃝19 rank(XF)= ⃝1 ⃝2 ⃝3 ⃝4 ⃝5 ⃝19
ii. [2 Pts] Recall that an Ordinary Least Squares (OLS) model is an
⃝20 ⃝20 ⃝20 ⃝ 20
unregularized lin-
ear model that minimizes the MSE for a given design matrix. Suppose we train three
different unregularized OLS models on XM F , XF B and XF , respectively. The result-
⃗⃗⃗
ing predictions given by each model are yˆM F , yˆF B , and yˆF . Which of the following
statements are true?
⃗⃗
􏰄 yˆ M F = yˆ F B
⃗⃗
􏰄 yˆ M F = yˆ F
⃗⃗
􏰄 yˆ F B = yˆ F
􏰄 None of These
iii. In lecture, we said that the residuals sum to zero for an OLS model trained on a
feature matrix that includes a bias term. For example, if SFB is the sum of the ⃗
residualsforyˆFB,thenSFB =0becauseXFB includesabiasterm.
i. [2Pts] LetSMF,SFB,andSF bethesumsoftheresidualsforourthreemodels. Which of the following are true? We have omitted SFB from the list below
because we already gave away the answer above. 􏰄 SMF =0 􏰄 SF =0 􏰄 Neitherofthese
ii. [2 Pts] Let SF , SF , and SF be the sums of the residuals for only female MFFB F
customers. For example, SF is the sum of the residuals for the 0th, 4th, etc. MF
rowsofXMF,SF isthesumoftheresidualsforthe0th,4th,etc.rowsofXFB, FB
and similarly for SF . Which of the following are true?
􏰄SF =0 􏰄SF =0 􏰄SF =0 􏰄Noneofthese MF FB F

Data 100 (b)
Page 10 of 22 Initials:
Suppose we create a new design matrix XB that contains only the total bill, size, and
a bias term. Suppose we then fit an OLS model on XB, which generates predictions ⃗
yˆ = [yˆ0, yˆ1, …, yˆ19] = [2.631665, 2.0483329, . . . ] with residuals ⃗r = [r0, r1, …, r19] = [−1.621665, −0.388329, . . . ].
Suppose we then do a very strange thing: We create a new design matrix W that has ⃗
the columns from XB, as well as two new columns corresponding to yˆ and ⃗r from our model on XB. Note: You’d never ever do this, but we’re asking as a way to probe your knowledge of regression. The first 5 rows of W are given below.
i. [2 Pts] What is the rank of W?
⃝0 ⃝1 ⃝2 ⃝3 ⃝4 ⃝5 ⃝10 ⃝20 ⃝40
ii. [3 Pts] Let βˆ1, βˆ2, βˆ3, βˆ4, βˆ5 be optimal parameters of a linear regression model on W, e.g. βˆ4 is the weight of the yhat column of our data frame. Give a set of parameters that minimizes the MSE.
βˆ1 = βˆ2 = βˆ3 = βˆ4 = βˆ5 =

Data 100 Page 11 of 22 Initials:
5 Alternate Classification Techniques [14 Pts.]
The primary technique for binary classification in our course was logistic regression, where we first calculated P(Y = 1|⃗x) = σ(⃗xTβ⃗), then applied a threshold T to compute a label (either 0 or 1). In other words, we predict yˆ = f(⃗x) = I(σ(⃗xT β⃗) > T), where I is an indicator function (i.e. returns 1 if the argument is true, 0 otherwise).
We trained such a model by finding the β⃗ that minimizes the cross entropy loss between our predicted probabilities and the true labels.
In this problem we’ll explore some variants on this idea.
(a) In this part, we’ll consider various loss functions.
i. [2 Pts] Suppose our true labels are ⃗y = [0, 0, 1], our predicted probabilities of being
in class 1 are [0.1, 0.6, 0.9], and our threshold is T = 0.5. Give the total (not average) cross-entropy loss. Do not simplify your answer.
Total CE Loss =
ii. [2 Pts] For the same values as above, give the total squared loss. Do not simplify your answer.
Squared Loss =
iii. [2 Pts] Which of the following are valid reasons why we minimized the average cross-entropy loss rather than the average squared loss?
􏰄 To prevent our parameters from going to infinity for linearly separable data.
􏰄 There is no closed form solution for the average squared loss.
􏰄 To improve the chance that gradient descent converges to a good set of
parameters.
􏰄 The cross entropy loss gives a higher penalty to very wrong probabilities.
􏰄 None of the above
iv. [1Pt] Athirdlossfunctionwemightconsideristhezero-oneloss,givenbyLZO(y,yˆ)= I(y ̸= yˆ). In other words, the loss is 1 if the label is incorrect, and 0 if it is correct. For the same values above, what is the total zero-one loss?
⃝0⃝1⃝2⃝3
v. [2 Pts] The zero-one loss is a function of both β⃗ and T . This is in contrast to the
⃗ ⃗ˆ ˆ
cross-entropy loss, which is only a function of β. Let βZO and TZO be parameters
⃗ˆ ˆ that minimize the zero-one loss. Which of the following are true about βZO and TZO?
􏰄 They maximize accuracy. 􏰄 They maximize precision. 􏰄 They maximize recall. 􏰄 None of these

Data 100
Page 12 of 22 Initials:
(b)
vi. [3 Pts] A DS100 student wants to run gradient descent on the total zero-one loss to
⃗ˆ ˆ
find optimal βZO and TZO. Give the very specific reason that this will always fail.
Answer in 10 words or less. Vague answers will be given no credit.
[2 Pts] In this part, we’ll consider an alternative to the logistic function.
Instead of using the logistic function as our choice of f , let’s say we instead use a scaled
inverse tangent function, f(x) = 1 tan−1(x) + 1. This choice of f has the exact same π2
tail-end behavior as σ(x). In other words, it is always between 0 and 1. A plot of f is below:
Which of the following are true?
􏰄 Thecrossentropylossisstillwelldefinedforallpossibleoutputsofourmodel.
􏰄 We are still able to construct an ROC curve and use the AUC as a metric for our classifier.
􏰄 We can still compute a confusion matrix from our classifier.
􏰄 We can still assume that log P (Y =1|x) is linear.
􏰄 None of the above
Linear Separability [4 Pts.]
6
P (Y =0|x)
Suppose we fit a logistic regression model with two features x1,x2, and find that with clas- ⃗ˆ ˆ ˆ
sification threshold T = 0.75 and β = [β1, β2] = [2, 3], we achieve 100% training accuracy. Let x2 = mx1 + b be the equation for the line that separates the two classes. Give m and b (you may leave your answers in terms of ln). Hint: You might find the following fact useful: σ(ln(3)) = 0.75.
m= b=

Data 100 Page 13 of 22 Initials:
7 ROC Curves [5 Pts.]
Here, we present a ROC curve, with unlabelled axes.
(a) [4 Pts] Fill in the pseudocode below to generate a ROC Curve. (Ignore the ”X” above.)
Hint: You can convert a boolean array to an array of 1’s and 0’s by multiplying the array by 1:
>>> y
array([False, False, True, True], dtype=bool)
>>> 1 * y
array([0, 0, 1, 1])
predicted_probs = np.array([0.37, 0.1, …])
y_actual = np.array([1, 0, …])
thresholds = np.linspace(______, _______, 1000)
tprs, fprs = [], []
for t in ________________________________:
y_pred = ________________________________
a = np.sum((y_pred == y_actual) & (y_pred == 1))
b = np.sum((y_pred == y_actual) & (y_pred == 0))
c = np.sum((y_pred != y_actual) & (y_pred == 1))
d = np.sum((y_pred != y_actual) & (y_pred == 0))
tprs.append(________________________________)
fprs.append(________________________________)
plt.plot(fprs, tprs)
(b) [1 Pt] Which of the following classification thresholds most likely corresponds to the point marked with an ”X” above?
⃝0.1 ⃝0.65 ⃝0.9 ⃝1.0

Data 100
Page 14 of 22
Initials:
8
(a)
PCA [7 Pts.]
Consider the matrix X below.
(b)
2 0 −5
Suppose we decompose X using PCA into X = UΣV T . Let r x c be the dimensions of
VT.
i. [1 Pt] What is r?
⃝0 ⃝1 ⃝2 ⃝3 ⃝4 ⃝5 ⃝Noneofthese
ii. [1 Pt] What is c?
⃝0 ⃝1 ⃝2 ⃝3 ⃝4 ⃝5 ⃝Noneofthese
[3 Pts] Let P be the principal component matrix of X. That is, P = UΣ.
Suppose we now decompose the principal component matrix P into its principal compo- nents,givingusP =UPΣPVPT.WhatisVPT?
VPT =
Consider the statement: ”When we created 2D PCA scatter plots in this class, we were usually plotting the first 2 _________ of __________”?
i. [1 Pt] For the first blank, what is the appropriate word? ⃝ rows ⃝ columns
ii. [1 Pt] For the second blank, what is the appropriate object?
⃝ X ⃝ U ⃝ Σ ⃝ VT ⃝ UΣ ⃝ ΣVT ⃝ UΣVT
0 2 −1
0 2 −2 X=1 1 −3 1 1 −4
(c)

Data 100
Page 15 of 22 Initials:
9
(a)
SQL [10 Pts.]
[5 Pts] In this problem, we have the two tables below, named brackets and names respectively. The left table is a list of U.S. tax brackets, e.g. a person’s first \$9700 of income is taxed at 10%, income between \$9701 and \$39475 is taxed at 12%, etc. The right table is a list of people, their ages, and incomes.
brackets names
Give a SQL query that results in the table below, except that the order of your rows may be different. Here, the rate column represents the highest tax bracket at which their income is taxed. For example, Lorenza earns \$165,743, so her highest income is taxed at the 32% rate. The how much column says how much of the person’s income is taxed at this rate, e.g. \$5,017 of Lorenza’s income is taxed at 32% since her income exceeds the low of the 32% bracket of \$160,726 by \$5,017. Your output should have the same column names as the example below.
SELECT ___________________________________________________
FROM _____________________________________________________
WHERE ________________________ AND _______________________;

Data 100 (b)
Page 16 of 22 Initials:
[5 Pts] For this problem, we have the ds100 grades table below.
Suppose we want to generate the table below.
The table above provides the average grade on HW5 for students with ”ee” in their name, separated into two groups: those who have taken Data 8 and those who have not. For example, Akeem and Desiree both have ”ee” in their names, and have taken Data 8. The average of their scores is 91. Kathleen has an ”ee” in her name, but has not taken Data 8. Since she is the only person in the table who has not taken Data 8, the average is just her score of 95. Penelope and Ashoka do not have ”ee” in their names, so their data will not get included in the table. Each table includes the count, the HW5 average, and whether the row corresponds to students who took Data 8 or not. Give a query below that generates this table. The order of your rows does not matter. Your output should have the same column names as the example below.
SELECT ___________________________________________________
FROM _____________________________________________________
WHERE ____________________________________________________
__________________________________________________________;

Data 100 Page 17 of 22 Initials:
10 Decision Trees [8 Pts.]
Suppose we are trying to train a decision tree model for a binary classification task. We denote the two classes as 0 (the negative class) and 1 (the positive class) respectively. Our input data consists of 6 sample points and 2 features x1 and x2.
The data is given in the table below, and is also plotted for your convenience on the right.
(a) [2 Pts] What is the entropy at the root of the tree? Do not simplify your answer. entropy =
(b) [3 Pts] Suppose we split the root note with a rule of the form xi ≥ β, where i could be either 1 or 2. Which of the following rules minimizes the weighted entropy of the two resulting child nodes?
⃝x1≥3 ⃝x1≥4.5 ⃝x1≥8.5 ⃝x2≥3.5 ⃝x2≥4.5
(c) [3 Pts] Now, suppose we split the root note with a different rule of the form below:
x1 ≥β1 andx2 ≤β2,
where β1,β2 are the thresholds we choose for splitting. Give a β1 and β2 value that
minimizes the entropy of the two resulting child nodes of the root. β1 =
β2 =

Data 100 Page 18 of 22 Initials:
11 Clustering [7 Pts.]
(a) The two figures below show two datasets clustered into three clusters each. For each dataset, state whether the given clustering could have been generated by the K-means and Max- agglomerative clustering algorithms. By max-agglomerative we mean the exact algorithm discussed in class, where the distance between two clusters is given by the maximum distance between any two points in those clusters.
Note: There are no hidden overplotted cluster markers. For example, there’s no need to look closely at all the triangles to see if there is a square or circle hidden somewhere.
i. [2 Pts] Dataset 1:
􏰄 K-means 􏰄 Max-agglomerative 􏰄 None of these
ii. [2 Pts] Dataset 2:
􏰄 K-means 􏰄 Max-agglomerative 􏰄 None of these
(b) For each of the following statements, say whether the statement is true or false.
i. [1 Pt] If we run K-Means clustering three times, and the generated labels are exactly equal all three times, then the locations of the generated cluster centers are also exactly equal all three times.
⃝ True ⃝ False
ii. [1 Pt] Assuming no two points have the same distance, the cluster labels computed by
K-means are always the same for a given dataset. ⃝ True ⃝ False
iii. [1 Pt] Assuming no two points have the same distance, the cluster labels computed by Max-agglomerative clustering are always the same for a given dataset.
⃝ True ⃝ False

Data 100 Page 19 of 22 Initials:
12 Potpourri [16 Pts.]
(a) [1 Pt] Suppose we train an OLS model to predict a person’s salary from their age and get β1 as the coefficient. Suppose we then train another OLS model to a predict a person’s salary from both their age and number of years of education and get parameters γ1 and γ2, respectively. For these two models β1 = γ1.
⃝ Always True ⃝ Sometimes True ⃝ Never True
(b) [1 Pt] Suppose we train a ridge regression model with non-zero hyperparameter λ to predict a person’s salary from their age and get β1 as the coefficient. Suppose we then train another ridge regression model using the same non-zero hyperparameter λ to predict a person’s salary from both their age and number of years of education and get parameters γ1 and γ2, respec- tively. For these two models β1 = γ1.
⃝ Always True ⃝ Sometimes True ⃝ Never True
(c) [1 Pt] If we get 100% training accuracy with a logistic regression model, then the data is
linearly separable.
⃝ Always True ⃝ Sometimes True ⃝ Never True
(d) [1 Pt] If we get 100% training accuracy with a decision tree model, then the data is linearly separable.
⃝ Always True ⃝ Sometimes True ⃝ Never True
(e) [1Pt] Increasingthehyperparameterλinaridgeregressionmodeldecreasestheaverageloss.
⃝ Always True ⃝ Sometimes True ⃝ Never True
(f) [1 Pt] Let MSE1 be the training MSE for an unregularized OLS model trained on X1. Let MSE2 be the training MSE for an unregularized OLS model trained on X2, where X2 is just X1 with one new linearly independent column. If MSE1 > 0, then MSE2 < MSE1. ⃝ Always True ⃝ Sometimes True ⃝ Never True (g) [1 Pt] When using regularization on a linear regression model, you should center and scale the quantitative non-bias columns of your design matrix. ⃝ Always True ⃝ Sometimes True ⃝ Never True Data 100 Page 20 of 22 Initials: (h) [3 Pts] Suppose you have the following .xml file:

DS 100
Fall 2019 Josh Hug Deb Nolan

CS 61B
Spring 2019 Josh Hug
Fernando Perez

Which of the following XPath queries will return only the strings ”Josh Hug” and ”Deb Nolan” (can have multiple of each)? There is at least one correct answer.
􏰄 //professor/text()
􏰄 //professor/../class/professor/text()
􏰄 //class/professor/../class/professor/text()
􏰄 //semester/../professor/text()
􏰄 /catalog/class[name/text()=”DS 100″]/professor/text() 􏰄 /catalog/class/name[text()=”DS 100″]/professor/text()
(i) [3 Pts] Consider the regular expression dw{2,5}d+[hug+]\$
Which of the following strings match this regular expression? At least one of these is correct.
􏰄 123445dg
􏰄 1234dddhug 􏰄 61bdug
􏰄 61bdg
􏰄 61bdugggg
􏰄 1hello234gg
(j) [3 Pts] Consider the string 61bdugggg
Which of these regular expressions match the entire string? At least one of these is correct.
􏰄 dbug*|w*
􏰄 [61b]+d{1,3}[a-z]* 􏰄 d{2}b+[ds100][hug]* 􏰄 .*g\$
􏰄 61bdugggg
􏰄 61[b|d]{1}ug+

Data 100 Page 21 of 22 Initials:
13 HCE [6 Pts.]
In this problem, we will ask a somewhat sensitive and complex real world problem. We will be lenient in grading this problem, but we want you to provide an opinion and try to defend it. You will not be penalized for unpopular or ”politically incorrect” opinions. Joke answers will receive no credit. If there is something unclear in the problem description, write your assumption.
In a hypothetical course, all submitted work is automatically reviewed for cheating by plagiarism detection software. However, some students also have the entirety of their subjected to an intensive manual review at the end of the semester. It is not possible to manually review all student’s work due to the large number of students in the course.
One approach is to randomly select students for manual review. An alternate approach is to use a model to try to target students who are more likely to plagiarize. For example, a student who has all perfect scores on assignments but very poor midterm grades might warrant manual review.
Suppose you build a logistic regression model to classify students with one of two labels: ”investi- gate” or ”do not investigate”. Students who are given the ”investigate” label have all of their work carefully reviewed by a teaching assistant (TA) for evidence of cheating. Students who are given the ”do not investigate” label are not manually reviewed at all.
The model uses as features the full text of all of the student’s electronically submitted work, grades on each problem for all assignments and exams, submission times for electronically submitted work, and the full text of all the student’s Piazza posts. The model works by generating a plagiarism probability for each student. Students with a plagiarism probability above a certain threshold will be assigned the ”investigate” label. The model is trained on a dataset collected during previous semesters of the course, where each student has a true label corresponding to whether or not the student was caught plagiarizing.
(a) [3 Pts] Below, describe at least one benefit and at least one downside of using such a logistic regression model compared to the randomized approach.
(b) [3 Pts] Suppose we add a demographic factor to our design matrix, specifically whether the student is international or not. Suppose that after training, the coefficient related to the inter- national feature is non-zero. Is it ethical to include this feature in your model? Why or why not?

Data 100 Page 22 of 22 Initials:
14
(a)
(b)
1729 [0 Pts.]
[0 Pts] What is the height difference between Josh Hug and Suraj Rampure? (Make sure to specify units.)
Height difference =
[0 Pts] What should Josh name his new kid (assume female if you want a gender specific name)?
Name =