# CS代考程序代写 algorithm Agda python Name: Email: Student ID:

Name: Email: Student ID:

@berkeley.edu

DS-100 Midterm Exam Fall 2018

Instructions:

• This midterm exam must be completed in the 110 minute time period ending at

10:00, unless you have accommodations supported by a DSP letter.

• Note that some questions have bubbles to select a choice. This means that you should only select one choice. Other questions have boxes. This means you should select all that apply.

• When selecting your choices, you must fully shade in the box/circle. Check marks will likely be mis-graded.

• You may use a one-sheet (two-sided) study guide.

Honor Code:

As a member of the UC Berkeley community, I act with honesty, integrity, and respect for others. I am the person whose name is on the exam and I completed this exam in accordance with the honor code.

Signature:

1

DS100 Midterm,

Page 2 of 18 October 17, 2018

Syntax Reference Regular Expressions

“|” matches expression on either side of sym- bol. Has lowest priority.

“” match the following character literally. “?” match preceding literal or sub-expression

0 or 1 times.

“+” match preceding literal or sub-expression

one or more times.

Some useful Python functions and syntax

re.findall(pattern, st) returns the list of all sub-strings in st that match pattern.

Useful Pandas Syntax

“*” match preceding literal or sub-expression zero or more times.

“.” match any character except new line.

“[ ]” matchanyoneofthecharactersinside, accepts a range, e.g., “[a-c]”. All characters inside treated literally.

“( )” used to create a sub-expression. “{n}” preceding expression repeated n times.

np.random.choice(a, replace, size)

Generates a random sample from a con- sisting of size values (with replacement if replace=True). a can be 1-D array-like or int.

df.loc[row_selection, col_list] # row selection can be boolean df.iloc[row_selection, col_list] # row selection can be boolean pd.get_dummies(data) # Convert categorical variable into indicator values df.groupby(group_columns)[[’colA’, ’colB’]].agg(agg_func) df.groupby(group_columns)[[’colA’, ’colB’]].filter(filter_func)

Variance and Expected Value

The expected value of X is E[X] = mj=1 xjpj. The variance of X is V ar[X] = E[(X−E[X])2] = E[X2] − E[X]2. The standard deviation of X is SD[X] = V ar[X].

DS100 Midterm, Page 3 of 18

October 17, 2018

Sampling

1. Below is shown the first 5 rows of the table restaurants:

name

Marufuku Japanese 2241 Jack in the Box Fast Food 1592

cuisine size

Thai Basil Tako Sushi McDonald’s

Thai 820 Japanese 1739 Fast Food 1039

The table restaurants contains information regarding different restaurants. The name and cuisine columns contain strings, and the size column contains integers. The name column is the primary key of the table and therefore contains unique values. This is a preview of the first 5 rows of the table. You may assume it has many more rows than what is shown, with the same structure and no missing data.

Throughout problem 1 and 2, use the keywords and numbers from the answer bank below to fill in the blanks. Note that the same keyword/number can be used multiple times; some keywords may not be used at all. The documentation for some terms appears on the first page of this exam. This answer bank also appears on the back of your answer sheet.

1

2

3

4

5

6

7

8

9

10

11

12

<
>

<=
>=

==

restaurants

min rating df

pivot

agg

loc

index

iloc

sort values

size

filter

groupby

barplot

lineplot

jointplot

boxplot

’name’

’cuisine’

’size’

’rating’

max

min

median

mean

first

last

np.array

sample

pd.Series

list

values

True

False

n

x

(a) Complete the following function sample, which takes in a series and a sample size n and returns a simple random sample of n values in that series. Recall that a SRS is drawn without replacement. The result should be a list of the n values that are in the sample. For example, sample(restaurants[’name’], 10) should return a simple random sample of 10 restaurant names with no duplicates. The documentation for np.random.choice can be found on the first page of this exam.

def sample(series, n):

return _*_______(np.random.choice(_ ______._*

size=_

DS100 Midterm, Page 4 of 18 October 17, 2018

(b) Suppose that the probability that Jack in the Box appears in the simple random sample is 1 . What is the probability that McDonald’s appears in the sample? There is only one

10

correct answer.

⃝ A. < 1 10
⃝B. 1 10
⃝ C. > 1 10

⃝ D. Not enough information

(c) Whattypeofsampledoesrestaurants.groupby(’cuisine’)[’name’].first()

collect? Select all that apply.

A. Simple Random Sample B. Stratified Random Sample C. Cluster Sample

D. Probability Sample

E. None of the above

(d) Josh wants to collect a stratified random sample of restaurant names where the strata are the cuisine type, and he wants to collect 2 restaurant names per strata. Complete the following line of code to collect Josh’s desired stratified random sample.

restaurants._*_____(_ ______)[_*

lambda x: _

(e) Suppose there are 10 unique cuisine types with 15 fast food restaurants, 20 Japanese Restaurants, and 5 Thai Restaurants. There are 100 restaurants overall in the table with at least 2 restaurants for each cuisine type. What is the probability that McDonald’s is picked in Josh’s stratified sample from the previous problem?

DS100 Midterm, Page 5 of 18 October 17, 2018

(f) Fernando wants to collect a cluster sample, where each cluster is a cuisine type. Suppose Fernando wants to have 2 clusters in his cluster sample. Which of the following lines of code would create Fernando’s desired cluster sample? Select all that apply. At least one answer is correct. All of the code in all three possible answers is syntactically correct.

A.

B.

C.

restaurants[restaurants[’cuisine’].isin(

np.random.choice(restaurants[’cuisine’].unique(),

size=2, replace=True))][’name’]

restaurants[restaurants[’cuisine’].isin(

np.random.choice(restaurants[’cuisine’].unique(),

size=2, replace=False))][’name’]

restaurants[restaurants[’cuisine’].isin(

np.random.choice(restaurants[’cuisine’].values,

size=2, replace=False))][’name’]

(g) Withthesameassumptionsasinpart(e),whatistheprobabilitythatMcDonald’sappears in Fernando’s cluster sample?

(h) Manana goes for a third sampling strategy. She decides to take a cluster sample (just like Fernando) of 2 cuisines, but rather than collecting information about every restaurant in those two clusters, she then collects a SRS of one restaurant from each of her two ran- domly selected clusters, for a total of two restaurants. What type of sample has Manana collected? Select all that apply.

A. Simple Random Sample B. Stratified Random Sample C. Cluster Sample

D. Probability sample

E. None of the above

(i) Let H be the probability that McDonald’s appears in Manana’s sample from part (h). Suppose Manana repeats this the process in part (h) 5 times with replacement between samples so that each (two restaurant) sample is independent from the other 4 samples. Let X be a random variable that represents how many total times McDonald’s appears in these 5 samples. What is E[X]? Your answer should be in terms of H.

(j) What is V ar(X)? Again, your answer should be in terms of H.

DS100 Midterm, Page 6 of 18 October 17, 2018

Pandas

2. For the following problems, suppose we add a new column to our restaurants table which con- tains the average rating of each restaurant by users of a restaurant review service. Remember to only use terms from the table in the previous section (or on the back of your answer sheet)!

name cuisine size rating Marufuku Japanese 2241 4.5 Jack in the box Fast Food 1592 3.4

Thai Basil Tako Sushi McDonald’s

(a) Leta”lowestrankrestaurant”bearestaurantthathasthelowestratingforagivencuisine. Create a table of all lowest rank restaurants (i.e. one row for each cuisine). Include the restaurant’s name, size, and average rating.

min_rating_df = restaurants._*____________(_ ___________)*

._

._

(b) Create a chart that is useful for visualizing the ratings for the lowest rank restaurants.

sns._

*_______(min_rating_df.index, ________[_*_______])

(c) Change the ratings of all Japanese restaurants to be 5.

restaurants.loc[_

(c) Change the ratings of all Japanese restaurants to be 5.

restaurants.loc[_

*_______[________]_*_______ ’Japanese’,

____________] = ___________

(d) Return a list of all cuisines whose restaurants have an average size greater than or equal to 1000.

list(restaurants._

_

(d) Return a list of all cuisines whose restaurants have an average size greater than or equal to 1000.

list(restaurants._

*__________(___________)*

.______(

lambda x:x[________].________() >= 1000)

[___________].unique())

Thai

Japanese

Fast Food 1039 3.5

820 4.7 1739 2.3

._

lambda x:x[_

[_

Thai

Japanese

Fast Food 1039 3.5

820 4.7 1739 2.3

DS100 Midterm, Page 7 of 18 October 17, 2018

Visualization

3. Suppose we have the following histograms.

(a) Which of the histograms above is right skewed? There is only one correct answer. ⃝ A. World Bank Female Adult Literacy Rate

⃝ B. World Bank Gross National Income Per Capita

(b) Which histogram would look more symmetric with a log transformation applied? There is only one correct answer.

⃝ A. World Bank Female Adult Literacy Rate

⃝ B. World Bank Gross National Income Per Capita

DS100 Midterm, Page 8 of 18 October 17, 2018 4. Consider the scatter plot shown below.

Which of the following transformations would make the relationship between x and y more linear, i.e. if we plotted fy(y) vs. fx(x), which would look most linear? There is only one correct answer.

⃝A. fx(x)=x2

⃝ B. fx(x) = log(x) ⃝ C. fx(x) = x

⃝ D. fx(x) = log(x)

fy(y)=y2 fy(y) = y fy(y) = log(y)

fy(y) = y2

5. For each of the following relationships between x and y, select the appropriate transformation so that the transformed values are linearly related. In other words, select the transformations such that if we plotted fy(y) vs. fx(x), we’d expect to get a straight line. There is only one correct answer in each part.

(a) y = abx ⃝ A. ⃝ B. ⃝ C. ⃝ D. ⃝ E.

(b)y= x a+bx

⃝A. ⃝ B.

⃝ C. ⃝ D. ⃝ E.

fy(y) = log(y)

fy(y) = y

fy(y) = log(y)

fx(x) = log(x) fx(x) = log(x) fx(x) = x fx(x) = 1

fy(y) = 1 yx

The relationship is already linear.

fy(y)= 1 y

fy(y) = y fy(y) = 1

fx(x)=x fx(x) = 1

x

fx(x) = 1 yx

None of the transformations above create a linear relationship. The relationship is already linear.

DS100 Midterm, Page 9 of 18 October 17, 2018

Regular Expressions

6. Recall that the re.findall(regex, string) method returns a list of all matching strings,e.g.re.findall(’cow’, ’A cow = cow.’)wouldreturn[’cow’, ’cow’].

For the following regular expression, which of the following strings would result in exactly one match? By one match, we mean the list returned by findall is of length 1.

’[a-z][aueoi][a-z]+’

A. ’ba’

B. ’bat’

C. ’batch’

D. ’batches’

E. ’batches of bees’

F. None of the above

7. For the following regular expression, which of the following strings would result in exactly

one match?

’(burrito|dog){2}’

A. ’dogdog’

B. ’burritodog’ C. ’dogburrito’ D. ’burrito dog’ E. None of the above

8. How many matches would be returned by the code below (note the square brackets!): re.findall(“[Cow|Man]{2}”, “CowManCowCowManMan999”)?

⃝ A. 0 ⃝ B. 1 ⃝ C. 2 ⃝ D. 3 ⃝ E. 4 ⃝ F. 6 ⃝ G. 8 ⃝ H. 9 ⃝ I.12 ⃝ J.17 ⃝ K.18

9. What are the start and end positions for the match returned by re.search below? Use Python’s 0-indexing and semi-open intervals [a, b) notation. If you believe that no match is returned, answer with the empty interval [0, 0).

re.search(r’..*’, ’pic.jpg.*bak’)

(a) Theinclusivestartpositionis: ⃝ A.0 ⃝ B.3 ⃝ C.4 ⃝ D.7 ⃝ E.8 (b) Theexclusiveendpositionis: ⃝ A.0 ⃝ B.3 ⃝ C.7 ⃝ D.8 ⃝ E.11

DS100 Midterm, Page 10 of 18 October 17, 2018

Linear Models

10. Recall from lecture that a linear model is defined as a model where our prediction yˆ is given by the equation below, where d is the number of parameters in our model:

d

yˆ = f θ ( x ) = θ j φ j ( x )

j=1

Which of the following models are linear? Select all that apply. A. fθ(x)=θ1x+θ2sin(x)

B. fθ(x) = θ1x + θ2 sin(x2)

C. fθ(x)=θ1

D. fθ(x)=(θ1x+θ2)x

E. fθ(x)=ln(θ1x+θ2)+θ3

11. Suppose we have data about 5 people shown below.

name level trials phase Magda 1 10 1 Valerie 5 20 -1 Kumar 2 15 1 Octavia 6 30 1 Dorete 6 5 -1

(a) Supposewewanttomodelthelevelofeachperson,andusethefollowingconstantmodel: f (x) = θ . What is θˆ , the value that minimizes the average L2 loss?

θ11

(b) Wecanalsocomputeθˆfromthepreviouspartbyusingthenormalequationθˆ=(ΦTΦ)−1ΦTY. ˆ

If we use the normal equation to compute θ, how many rows and columns are in the feature matrix Φ? Write your answer in the form # rows × # columns, e.g. 1 × 1.

(c) What is (ΦT Φ)−1ΦT ? Write your answer in the form of a Python list, e.g. [1, 2, 3].

DS100 Midterm, Page 11 of 18 October 17, 2018

Gradient Descent

12.

Momentum is a common variation of gradient descent in which we include the gradient at a previous step of the iteration in our current update equation. More formally it is defined as follows, where γ is the weight of momentum.

θt+1=θt−α∂L −γ∂L

∂θ θt ∂θ θt−1

Fill in the code with the following keywords and numbers to implement gradient descent with momentum. Assume when t = 0 and t = −1, θt = t0.

Note that the same keyword/number can be used multiple times; some keywords may not be used at all. Only use one keyword per blank. You may not need all blanks.

def grad(phi, y, theta):

“””Returns dL/dtheta. Assume correct implementation.”””

def grad_desc_momentum(phi, y, num_iter, alpha, gamma, t0): “”” Returns theta computed after num_iter iterations. phi: matrix, design matrix

y: vector, response vector

num_iter: scalar, number of iterations to run

alpha: scalar, learning rate

gamma: scalar, weight of momentum

t0: theta for t=0

“””

theta, theta_prev = ___________, _**_____________
theta
phi
y
theta prev
num iter
t0
temp
alpha
gamma
range
len
t
1 2 3 4 5 6 7 8 9
10
11
12
13
14
15
16
17
18
19
for
_**

*____ – _*___ * g – _____ * m,
______________ return theta
Recall that python allows multiple assignment, e.g. ”one, two = two, one” would swap the values of one and two.

DS100 Midterm, Page 12 of 18 October 17, 2018 13. Consider the following function of f(θ), which alternates between completely flat regions and

regions of absolute slope equal to 1. There is only one correct answer for each part.

(a) Assuming that θ starts in a flat region that is not a minimum and α > 0, will the basic gradient descent algorithm terminate at a minimum? Note that the basic gradient descent algorithm is just the same as version with momentum on the previous page, but where γ = 0.

⃝ A. Never ⃝ B. Maybe ⃝ C. Yes with enough iterations

(b) Assuming that θ starts in a sloped region and α > 0, will the basic gradient descent

algorithm find the minimum?

⃝ A. Never ⃝ B. Maybe ⃝ C. Yes with enough iterations

(c) Assuming that θ starts in a flat region that is not a minimum and α > 0 and γ > 0, will the momentum gradient descent algorithm find the minimum?

⃝ A. Never ⃝ B. Maybe ⃝ C. Yes with enough iterations

(d) Assuming that θ starts in a sloped region and α > 0 and γ > 0, will the momentum

gradient descent algorithm find the minimum?

⃝ A. Never ⃝ B. Maybe ⃝ C. Yes with enough iterations

(e) Is f(θ) convex? ⃝ A. Yes

⃝ B. No

⃝ C. No, but −f(θ) is convex ⃝ D. No, but f(−θ) is convex

DS100 Midterm, Page 13 of 18 October 17, 2018

Feature Engineering

You Can’t Forget About Those Tips

14. Recallfromlabs5and6thetipsdatasetfromtheseabornlibrary,whichcontainsrecordsabout tips, total bills, and information about the person who paid the tip. There are a total of 244 records in tips. In addition, you can assume that there are no missing or NaN values in the dataset. The first 5 rows of the tips DataFrame are shown below, where sex takes on values ∈ {”M ale”, ”F emale”}, smoker takes on values ∈ {”Y es”, ”N o”}, day takes on values from Monday to Sunday as strings, and time takes on values ∈ {”Breakf ast”, ”Lunch”, ”Dinner”}.

(a) Suppose we use pd.get dummies to create a one-hot encoding of only our sex col- umn. This yields a feature matrix Φq1 with exactly 2 columns sex Male, sex Female, where values can be either 0 or 1 in each column.

Which of the following are true? Select all that apply. A. Φq1 has 244 rows.

B. Φq1 has full column rank.

C. (ΦTq1Φq1) is invertible.

D. None of the above

(b) Suppose we use pd.get dummies to create a one-hot encoding of only our sex and

smoker columns. This yields a feature matrix Φq2 with 4 columns. Which of the following are true? Select all that apply.

A. Φq2 has 244 rows.

B. Φq2 has full column rank. C. (ΦTq2Φq2) is invertible.

D. None of the above

DS100 Midterm, Page 14 of 18 October 17, 2018

(c) Suppose we use pd.get dummies to create a one-hot encoding of only our sex and smoker columns, and also include a bias column. This yields a feature matrix Φq3 with 5 columns.

Which of the following are true? Select all that apply. A. Φq3 has 244 rows.

B. Φq3 has full column rank.

C. (ΦTq3Φq3) is invertible.

D. None of the above

(d) For the day column, we can either use a one-hot encoding or an integer encoding. By integer encoding, we mean mapping Monday to 1, Tuesday to 2, and so on. Which of the following statements are true? Select all that apply.

A. One-hot encoding creates fewer columns than integer encoding.

B. One-hot encoding gives all days of the week the same weight, while integer

encoding gives certain days of the week higher weight than others.

C. The columns generated by the one-hot encoding of the days of the week are linearly independent of each other.

D. None of the above

More Feature Engineering

15. Which of the following are reasons to use N -Gram Encoding (N > 1) instead of Bag-of-words (1-Gram) encoding? For simplicity, you may assume that N = 2 for N-Gram Encoding. Select all that apply.

A. B. C.

Vectors for N -Gram encoding are less sparse than vectors from Bag-of-words en- coding.

Vectors for N -Gram encoding have lower dimension than vectors from Bag-of- words encoding.

It is easier for N -Gram encoding to deal with unseen combinations at prediction time.

D. N -Gram helps preserve word order while Bag-of-words does not. E. None of the above

DS100 Midterm, Page 15 of 18 October 17, 2018

Regularization

Elastic Net Regularization

16. Elastic Net is a regression technique that combines L1 and L2 regularization. It is preferred in many situations as it possesses the benefits of both LASSO and Ridge Regression. Minimizing theL2lossusingElasticNetisasfollows,whereλ1,λ2 >=0,λ1 +λ2 =λ,λ>0.

1pp θˆ=argmin (y −θx)2 +λ |θ |+λ θ2

θni1j2j i j=1 j=1

Suppose our goal was to get sparse parameters, i.e. we want as many parameters as possible to be zero. Which of the following choices for λ1, λ2 are most consistent with this goal, assuming λ = 1? There is only one correct answer.

⃝A. λ1 =0,λ2 =1 ⃝B. λ1 =0.5,λ2 =0.5 ⃝C. λ1 =1,λ2 =0

17. What happens to bias and variance as we increase the value of λ? Assume λ2 = λ1. There is only one correct answer in each part. You will be asked to justify why in the next question.

(a) Bias:

⃝ A.

⃝ B. ⃝ C.

(b) Variance: ⃝ A. ⃝ B. ⃝ C.

Bias goes up

Bias stays the same Bias goes down

Variance goes up Variance stays the same Variance goes down

18. Justify why by marking the true statements. Select all that apply for each part.

(a) Bias:

A.

B. C.

D. E.

Bias goes down because increasing λ reduces over fitting. Bias goes down because bias is minimized when λ2 = λ1.

Bias goes up because increasing λ penalizes complex models, limiting the set of possible solutions.

Bias goes up because the loss function becomes non-convex for sufficiently large λ.

None of the above

DS100

(b) Variance:

A.

B. C. D.

E. 19. What happens

apply.

Midterm, Page 16 of 18 October 17, 2018

Variance goes down because increasing λ encourages the value of the loss to decrease.

Variance goes down because increasing λ penalizes large model weights. Variance goes up because because increasing λ increases bias.

Variance goes up because increasing λ increases the magnitude of terms in the loss function.

None of the above

to the model parameters θ as λ → ∞, i.e. what is limλ→∞ θ? Select all that

ˆˆ

A. Converge to 0.

B. Diverge to infinity.

C. Converge to values that minimize the L2 loss.

D. Converge to equal but non-zero values.

E. Converge to a sparse vector.

20. Of the choices below, why do we prefer to use ridge regression over linear regression (i.e. the normal equation) in certain cases? Select all that apply.

A. B. C. D. E.

Ridge regression always guarantees an analytic solution, but the normal equation does not.

Ridge regression encourages sparsity in our model parameters, which is helpful for inferring useful features.

Ridge regression isn’t sensitive to outliers, which makes it preferable over linear regression.

Ridge regression always performs just as well as linear regression, with the added benefit of reduced variance.

None of the above

DS100 Midterm, Page 17 of 18 October 17, 2018

More Linear Models

21. Suppose in some universe, the true relationship between the measured luminosity of a single star Y can be written in terms of a single feature φ of that same star as

Y =θ∗φ+ε

where φ ∈ R is some non-random scalar feature, θ∗ ∈ R is a non-random scalar parameter,

and ε is a random variable with E[ε] = 0 and Var(ε) = σ2. For each star, you have a set

of features Φ = φ1 φ2 … φnT and luminosity measurements y = y1 y2 … ynT generated by this relationship. Your Φ may or may not include the feature φ described above. The εi for the various yi have the same probability distribution and are independent of each other.

(a) What is E[Y ]?

⃝A. 0 ⃝B. θ∗φ ⃝C. φ(ΦTΦ)−1ΦTy

⃝D. θ∗ ⃝E. Noneoftheabove

(b) What is Var(Y )?

σ2 σ2

⃝A.n ⃝B.n2 ⃝C.0

1 n 1 n 2

⃝D. n−1 yi−n yi ⃝E. Noneoftheabove

i=1 i=1

(c) Suppose you have information about the exact φ value for each star, but try to fit a linear

model for Y that includes an intercept term θ0.

Y = θ0 + θ1φ

Note the true relationship has no intercept term, so our model is not quite correct. Let θˆ 0

and θˆ be the values that minimize the average L loss. Let y be the actual observed data 12

and yˆ = θˆ + θˆ φ be the fitted values. 01

i. Which of the following could possibly be the value of θˆ after fitting our model? 0

Select all that apply; at least one is correct.

A.-1 B.0 C.1 D.10

ii. Which of the following could possibly be the residual vector for our model? Select

all that apply; at least one is correct.

A. −2 −4 6T B. 0.0001 0.0003 −0.0005T C. 3 12 −9T D. 1 1 1T

DS100 Midterm, Page 18 of 18 October 17, 2018

22. Suppose we create a new loss function called the OINK loss, defined as follows for a single observation:

a(fθ(x) − y) fθ(x) ≥ y LOINK(θ,x,y)= b(y−fθ(x)) fθ(x)