# CS代考程序代写 file system data science database scheme SQL python Name:

Name:
DS-100 Practice Final Questions Fall 2017
Instructions:
• These are a random selection of previous final exam questions.
• This is not representative of the length of the final (its too long!). • You may use a single page (two-sided) cheat sheet.
1

DS100 Final, Page 2 of 42
1 Loss Minimization
1. In a petri dish, yeast populations grow exponentially over time. In order to estimate the growth rate of a certain yeast, you place yeast cells in each of n petri dishes and observe the population yi at time xi and collect a dataset {(x1, y1), . . . , (xn, yn)}. Because yeast populations are known to grow exponentially, you propose the following model:
log(yi) = γxi (1) where γ is the growth rate parameter (which you are trying to estimate). We would like to
derive the L2 regularized estimator least squares estimator.
(1) [4 Pts.] Write the regularized least squares loss function for γ under this model. Use λ as
the regularization parameter.
(2) [8 Pts.] Solve for the optimal γ as a function of the data and λ
Solution:
1 􏰃n
L(γ) = n
i=1
(log(yi) − γxi)2 + λγ2 (2)
Solution: Taking the derivative of the regularized loss function function:
∂ L(γ) = 1 􏰃n ∂ (log(yi)−γxi)2 + ∂ λγ2 (3)
∂γ
Setting the derivative equal to zero and solving for γ:
2n 2γ􏰙 n 􏰚
n i=1 ∂γ ∂γ 2 􏰃n
= −n (log(yi) − γxi)xi + 2λγ (4) i=1
2n 2γ􏰙 n 􏰚
=− 􏰃log(yi)xi+ λ+􏰃x2i (5)
n i=1 n i=1
0=− 􏰃log(yi)xi+ λ+􏰃x2i (6)
γ
λ + 􏰃 x2i i=1
= 􏰃 log(yi)xi i=1
(7)
(8) (9)
n i=1 n i=1 􏰙n􏰚n
􏰙 n 􏰚−1n
γ= λ+􏰃x2 􏰃log(y)x
iii i=1 i=1

DS100 Final, Page 3 of 42
2. Suppose we observe a dataset {x1, . . . , xn} of independent and identically distributed samples from the exponential distribution. Suppose we give you a ”probability model” parameterized by λ:
fλ(x) = λe−λx
that estimates the probability of a particular data point. In addition we give you the “log-
likelihood” loss function as the following:
n
L(λ) = −n log (λ) + λ 􏰃 xi (10)
i=1
Derive the parameter value λ that minimizes this loss function. Circle your answer.
Solution: Taking the derivative of the loss function with respect to the parameter λ we get:
∂ ∂ ∂􏰃n
∂λL(λ) = −n∂λ log (λ) + ∂λλ
1 􏰃n
xi (11) (12)
(13)
x i (14) (15)
= −nλ +
To minimize the loss we set the above derivative equal to zero and solve:
i=1
1 􏰃n
xi
i=1
0 = − n λˆ + 1 1􏰃n
λˆ=n xi i=1
i=1
i=1 i
ˆn λ=􏰂n x
(16)
Thus loss minimizing estimate is:
ˆ n 􏰙1􏰃n 􏰚−1 1
λ=􏰂n x= n xi =Mean(x) (17) i=1 i i=1

DS100 Final, Page 4 of 42
3. Suppose we collect a dataset of n observations {x1, . . . , xn} which we believe are drawn from a distribution with the following PDF:
􏰀 (x−μ)6􏰁
fμ(x)=Cexp − 6 (18)
where C is a constant that does not depend on μ. As before we are given the loss function: 1 􏰃n
L(μ)=−nlogC+6
(1) [4 Pts.] Compute the derivative of the derivative of the loss with respect to μ.
(xi −μ)6 (19)
(2) [3 Pts.] Because there is no closed form solution for μ in ∂ L(μ) = 0, we would likely ∂μ
g(μ)= ∂ logL(μ), (23)
∂μ
and a step size ρ(t), what is the gradient descent update rule to go from μ(t) to μ(t+1)? (Hint: your answer should contain only the variables g(μ(t)), μ(t), μ(t+1), and ρ(t).)
i=1
Solution: Taking the derivative:
∂ L(μ) = − ∂ n log C + 1 􏰃n ∂ (xi − μ)6 (20)
i=1 n
= − 􏰃(xi − μ)5 (22) i=1
∂μ ∂μ
=0−􏰃(xi −μ)5 (21)
n
6 i=1 ∂μ
Solution: Recall that the gradient points in the “uphill” direction. When we minimize we want to go the opposite direction (negative gradient. So the update rule would look like:
μ(t+1) ← μ(t) − ρ(t)g(μ(t)) (24)

DS100 Final, Page 5 of 42
2
Wrangling and Querying Data 2.1 SQL
For the questions in this subsection, assume we have a massive database in the cloud with the following schema:
— A simple digital media store database
CREATE TABLE media
(mid integer PRIMARY KEY,
name text, type char, year_released integer, length integer, buy_cost float, rent_cost float, avg_rating float);
CREATE TABLE customers
(cid integer PRIMARY KEY,
name text, joined date, nation_id integer, activity_level integer);
CREATE TABLE transactions (tid integer PRIMARY KEY,
tdate date, item integer, customer integer,
rent_or_buy integer, price_paid float, percent_viewed float, FOREIGN KEY (item) REFERENCES media,
FOREIGN KEY (customer) REFERENCES customers);
CREATE VIEW stats AS
SELECT min(length) AS len_min, max(length) AS len_max,
avg(length) AS len_mu, stddev(length) AS len_sigma, min(avg_rating) AS ar_min, max(avg_rating) AS ar_max, avg(avg_rating) AS ar_mu, stddev(avg_rating) AS ar_sigma
FROM media;

DS100 Final, Page 6 of 42
4. [4 Pts.] In the media table above, the type column encodes the type of media as a unique character code (e.g., ’S’ for song, ’M’ for movie, ’E’ for episode, etc.). Suppose we wanted to modify the stats view to display the stats for each type of media. Which of the following are true? (Select all that apply.)
A. We need to change the granularity of the view to be finer than it is above.
B. We need to add a GROUP BY type clause to the view.
C. It would be helpful to add media.type to the list of columns in the SELECT clause of the view.
D. The modified view should have more rows than the original view above.
E. None of the above.
5. [3 Pts.] Which of the following queries finds the ids of media that are 2 standard deviations
longer than the mean length? (Select only one.) A.
B.
C.
SELECT media.mid FROM media, stats
WHERE media.mid = stats.mid
AND media.length >= stats.len_mu
+ 2*(stats.len_sigma);
SELECT media.mid
FROM media, stats
WHERE media.length >= stats.len_mu
+ 2*(stats.len_sigma);
SELECT media.mid FROM media
WHERE media.length >= avg(media.length)
+ 2*stddev(media.length);
D. None of the above.

DS100 Final, Page 7 of 42
2.2 SQL Sampling
The transactions table has 30 million (30 × 106) rows. It is too large to load into the memory of our laptop. We will extract a sample from the database server to process on our laptop in Python.
SELECT *
FROM transactions TABLESAMPLE Bernoulli(.0001);
6. [2 Pts.] In expectation, how many rows will there be in the answer to this query?
SELECT *
FROM transactions
LIMIT XX;
(where XX is replaced by the correct answer to the previous question). Select all the true
statements:
A. Emily’s LIMIT query will probably run faster than the TABLESAMPLE query. For Emily’s query, the database engine can simply access the first XX rows it finds in the table, and skip the rest.
B. Emily’s query result may be biased to favor certain rows.
C. The output of the TABLESAMPLE query provides a hint about how many rows there are in the transactions table while Emily’s LIMIT query does not.
D. Emily’s LIMIT query may run fast, but it will swamp the memory on your laptop, since it doesn’t sample the database.
Solution: 3000 = 30 × 102
Solution: True. For reasoning above.
Solution: True. The database will optimize for speed, which will likely fa- vor clusters of records stored near each other.
Solution: True. You can extrapolate from the sample size and the sample probability to predict the table size.

DS100
Final, Page 8 of 42
Solution: False. Emily’s query will only return XX rows to the laptop.
E. None of the above.
8. [2 Pts.] You will recall from Homework 5 that it is possible to do bootstrap sampling in SQL by constructing a design table with two columns. Each of the columns used in that scheme is described by a single choice below. Identify the two correct choices:
A. A foreign key to the table being sampled.
B. A count column to capture the number of tuples in each bootstrap sample. C. An identifier to group rows together into bootstrap samples.
D. A regularization column to prevent overfitting.
2.3 Pandas
For the questions in this subsection, assume that we have pandas dataframes with the same schemas as described in the previous section on SQL. That is, we have a media dataframe with columns mid, name, type, year, et cetera. Assume that the index column of each dataframe is meaningless—the primary key is represented as a regular column.
9. [3 Pts.] Consider the following code snippet:
def get_average_price_paid(join_method): return (customers
.merge(transactions, how=join_method, left_on=’cid’, right_on=’customer’)
.loc[:,’price_paid’]
.fillna(0) # <- Important .mean() ) inner = get_average_price_paid(’inner’) outer = get_average_price_paid(’outer’) left = get_average_price_paid(’left’) right = get_average_price_paid(’right’) Assume that all item prices are positive, all transactions refer to valid customers in the customers table, but some customers may have no transactions. (1) How are inner and outer related? Pick one best answer. A. inner < outer B. inner ≤ outer C. inner = outer DS100 Final, Page 9 of 42 D. inner ≥ outer E. inner > outer
(2) How are left and right related? Pick one best answer. A. left < right B. left ≤ right C. left = right D. left ≥ right E. left > right
(3) How are left and outer related? Pick one best answer. A. left < outer B. left ≤ outer C. left = outer D. left ≥ outer 10. [3 Pts.] We wish to write a python expression to find the largest amount of money spent by one person on any single date. We will use the following code: biggie = transactions.groupby(_____)[’price_paid’].sum().max() What should we be pass in as our groupby predicate? Select only one answer. A. ’tdate’ B. ’customer’ C. [’item’, ’tdate’] D. [’customer’, ’tdate’] E. [’customer’, ’item’] 11. [6 Pts.] Fill in the following python code that finds the names of every customer who has spent over \$100. merged = customers.merge(__A__, left_on=__B__, right_on=__C__) grouped = merged.groupby(__D__).__E_() names = grouped[__F__].index Solution: merged = customers.merge(transactions, left_on=cid, right_on=customer) grouped = merged.groupby(cid).sum() names = grouped[grouped.price_paid > 100].index

DS100 Final, Page 10 of 42
12. [4 Pts.] We wish to find years where the average price paid (over all time) for products released in that year is greater than the average price paid across all transactions; from those years we want to return the earliest (smallest). We have the following code:
merged = transactions.merge(media, left_on=”item”, right_on=”mid”)
mean_price = merged.groupby(“year_released”) .mean().price_paid.mean() # Line A
by_year = merged.groupby(“year_released”).count() # Line B is_greater = by_year[by_year.price_paid > mean_price] # Line C result = is_greater.sort_index(ascending=False).index[0] # Line D
Some of these lines need to be modified in order for the code to work properly. We have suggested replacements for each line below. Which lines need to be replaced? Select all that apply.
A. mean_price = merged.price_paid.mean()
B. by_year = merged.groupby(“year_released”).mean()
C. is_greater = by_year.where(by_year.price_paid > mean_price)
D. result = is_greater.sort_index(ascending=True).index[0] E. All the lines are correct.

DS100 Final, Page 11 of 42
3 Feature Engineering
For this question you were given the following sales data and asked to build a model to predict units sold based on the the product attributes to guide the design of future products.

13. Write down a reasonable schema for this data.
ProdID
Name
Desc
Price
Category
Units Sold
13
Errorplane
“A truly uncaught exception . . . ”
404.00
Toy
9
42
Rock Kit
“Launch into minerology with . . . ”
123.45
Toys
1
54
Punative Jokes
“Jokes that will get you fined . . . ”
1.00
Books
30
Solution:
Products(prodId INTEGER, name VARCHAR, desc VARCHAR,
price REAL,
cat CHARACTER(20), sold INTEGER);
14. Suppose we are interested in building a linear predictive model. For each of the columns indicate which (one or more) of the feature transformations could be appropriate.
(1) The ProdID column:
A. Drop the column
B. One-Hot Encoding C. Leave as is
(2) The Name column:
A. The length of the text in characters
B. One-Hot Encoding
C. Bag-of-words Encoding
D. Leave as is
Solution: Because we are trying to predict sales of future products and each product is likely to have a unique product id, this feature is not likely to be helpful and may harm predictions.

DS100 Final, Page 12 of 42
Solution: The length of the name could be helpful in predicting how well product sell. Perhaps long names are hard to remember? A one-hot encoding would not likely be helpful unless we are going to make new products with the same name as old products. A bag-of-word encoding could be helpful if for example certain words like kit implied a decrease or increase in sales.
(3) The Desc column:
A. The length of the text in characters
B. One-Hot Encoding
C. Bag-of-words Encoding D. bi-gram Encoding
E. Leave as is
(4) The Price column:
A. The length of the text in characters
B. One-Hot Encoding
C. Bag-of-words Encoding
D. Convert the price to an indicator indicating if it is less than 19.99. E. Leave as is
(5) The Category column:
A. The length of the text in characters.
B. One-Hot Encoding
C. Bag-of-words Encoding D. N-Gram Encoding
E. Leave as is
15. It might be reasonable to assume that the relationship between units sold and price differs for each category (e.g., an expensive toy might be less likely to sell than expensive jewelry). Which of the following feature functions might capture this intuition?
Solution: This is similar to the Name but given longer post a bi-gram featurization may also be helpful.
Solution: Asafloatingpointnumberwecouldleavethepriceasis.Wemighthowever also compare the price to a few thresholds (e.g., 19.99 or 199.99) to test if the price is in some critical sales ranges.
Solution: Categorical data is typically best represented in a one-hot encoding.

DS100
Final, Page 13 of 42
A. φ(category, price) = category + price
B. φ(category, price) = price × category
C. φ(category, price) = OneHot (category) + price D. φ(category, price) = price × OneHot (category)
Solution: This will result in a vector of length k, where k is the number of distinct categories. Each element will be equal to the price for examples that match the category, and 0 otherwise. So the learned coefficient for each element is the slope for prices for that category. Most of the other choices would result in type errors, since it doesn’t make sense to add or multiply a category (a string) by any number. Choice F would run, but it would only allow predictions for each category to vary by additive factors; each category would get a different intercept rather than a different slope on price.
E. φ(category, price) = category × OneHot (price)
F. φ(category, price) = Concatenate(OneHot(category), price)

DS100 Final, Page 14 of 42
4 Feature Engineering 2
For this problem we collected the following data on the new social networking app UFace.

16. Suppose we are interested in predicting the number of responses for future posts. For each of the columns, indicate which (one or more) of the given feature transformations could be informative. Select all that apply.
(1) [2 Pts.] The PostID column: A. Drop the column
B. One-Hot encoding C. Leave as is
(2) [2 Pts.] The Time column:
A. Take the hour as a float
B. One-Hot encoding
C. Bag-of-words encoding
D. Time since midnight in seconds
(3) [2 Pts.] The Text column:
A. The length of the text
B. One-Hot encoding
C. Bag-of-words encoding
D. Leave as is
(4) [2 Pts.] The State column: A. The length of the text
B. One-Hot encoding
C. Bag-of-words encoding D. Leave as is
PostID
UTC Time
Text
Num. Responses
State
3
08:10 PM
“Checkout my breakfast . . . ”
2
VA
13
11:00 AM
“Studied all night for . . . ”
5
CA
14
12:04 PM
“Hello world!”
0
NY
17
11:35 PM
“That exam was lit …”
42
CA

DS100 Final, Page 15 of 42
17. [4 Pts.] Suppose we believe that people are more likely to respond to tweets in the afternoon (roughly from hours 13 to 17). Which of the following feature functions would help capture this intuition? Assume that the function localHour takes a time and a state as its arguments and returns the hour of the day (in 24-hour time) in the state’s time zone. Also assume that any boolean-valued feature is encoded as 0 (false) or 1 (true). Select all that apply.
A. φ(time, state) = localHour(time, state)
A. The tumor is benign
B. The tumor is more likely benign
C. The tumor is more likely to be malignant D. The tumor is malignant
31. [4Pts.] Whichofthefollowingexplanationsthatapplyingregularizationtoalogisticregression model? Select all that apply.
A. The training error is too high.
B. The test error is too low.
C. The data are high-dimensional.
D. There is a large class imbalance.
E. None of the above justify regularization for logistic regression.

DS100 Final, Page 27 of 42
32. For each of the following circle T for true or F for false.
(1) Increasing the regularization penalty decreases bias.
(2) Without taking precautions, reducing bias often leads to increased variance.
(3) In the bias variance trade-off the variance refers to the variability in predictions across different training datasets.
(4) As we improve the model to reduce bias we often run the risk of under-fitting.
33. In this question we complete the following figure describing the train-test split and k-fold cross validation. Note the data table with many records and a few columns is depicted on the left as a tall rectangle.
Solution: False. Increasing the regularization penalty reduces variance but increases bias.
Solution: True. This is the fundamental bias variance tradeoff.
Solution: True. In the bias variance trade-off the variance refers to variability in predictions due to variability in training data.
Solution: False. Improving the model by reducing bias often leads to over-fitting
(A) (D)
(E)
K-Fold Cross Validation
(1) This part of the figure refers to the validation data. ⃝(A) ⃝(B) ⃝(C) ⃝(D) √(E)
(2) This part of the figure refers to the testing data ⃝(A) √(B) ⃝(C) ⃝(D) ⃝(E)
Data
(B) (C)

DS100 (3)
(4)
Final, Page 28 of 42
This part of the figure refers to the process of constructing the train-test split.
√(A) ⃝(B) ⃝(C) ⃝(D) ⃝(E)
Select all the following statements that apply to the above figure.
A. This figure illustrates 5-fold cross validation.
B. This figure illustrates 6-fold cross-validation.
C. Assuming all the data points are distinct each of the validation data sets are also distinct.
D. The test data should be used during cross-validation to fully evaluate the model.
34. For each of the following select T for true or F for false on the answer sheet.
(1) [1 Pt.] Regularization can be used to manage the bias-variance trade-off.
(2) [1 Pt.] When conducting linear regression, adding polynomial features to your data often decreases the variance of your fitted model.
(3) [1 Pt.] When conducting linear regression, adding polynomial features to your data often decreases the bias of your fitted model.
(4) [1 Pt.] Suppose your data are an i.i.d. sample from a population. Then collecting a larger sample for use as a training set can help reduce bias.
(5) [1 Pt.] Suppose your data are an i.i.d. sample from a population. Then collecting a larger sample for use as a training set can help reduce variance.
(6) [1 Pt.] Training error is typically larger than test error.
Solution: True. Regularization encourages simpler models which can help to reduce variance but increase bias.
Solution: False. Adding more features tends to increase your model’s variance since there are more parameters to fit.
Solution: True. Adding more features tends to decrease your model’s bias since your model can fit more complicated patterns in the data.
Solution: False. Increasing the dataset size without changing the modeling procedure can often reduce variance but is unlikely to address bias.
Solution: True. More data often helps to reduce variance in the model fitting process.
Solution: False. Training error often under-estimates the test error.

DS100 Final, Page 29 of 42
(7) [1 Pt.] If you include the test set in your training data, your accuracy as measured on the test set will probably increase.
(8) [1 Pt.] It is important to frequently evaluate models on the test data throughout the process of model development.
35. [2 Pts.] A colleague has been developing models all quarter and noticed recently that her test error has started to gradually increase while her training error has been decreasing. Which of the following is the most likely explanation for what is happening? Select only one.
A. She is starting to over-fit to her training data.
B. She is starting to under-fit to her training data. C. The model is overly biased.
D. None of the above.
Solution: True. Training on the test data improves test accuracy but this improvement can be misleading due to over-fitting.
Solution: False. Noooooooooooo. Once test data is used it is no longer test data. You should create validation datasets or use cross-validation procedures to evaluate models.

DS100 Final, Page 30 of 42 36. [5 Pts.] Given the following general loss formulation:
􏰊nd􏰋
argmin 􏰃􏰎yi −xTi θ􏰏2 +λ􏰃θp2 (37)
θ
i=1 p=1
Which of the following statements are true? Select all that apply. A. There are d data points.
B. There are n data points.
C. The data is d dimensional.
D. This is a classification problem.
E. This is a linear model.
F. This problem has LASSO regularization.
G. Larger values of λ imply increased regularization. H. Larger values of λ will increase variance.
I. Larger values of λ will likely increase bias. J. None of the above are true.
37. [3 Pts.] In class we broke the least-squares error into three separate terms:
E􏰈(y−fθ(x))2􏰉=E􏰈(y−h(x))2􏰉+E􏰈(h(x)−fθ(x))2􏰉+E􏰈(fθ(x)−E[fθ(x)])2􏰉 (38)
where y = h(x) + ε, h(x) is the true model and ε is zero-mean noise. For each of the following terms, indicate its usual interpretation in the bias variance trade-off:
1. E [(y − h(x))2]: A. Bias B. Variance C. Noise
2. E􏰈(h(x)−fθ(x))2􏰉: A. Bias B. Variance C. Noise
3. E􏰈(fθ(x)−E[fθ(x)])2􏰉: A. Bias B. Variance C. Noise

DS100 Final, Page 31 of 42
9 Big Data
38. Which of the following are true:
A. Star schemas are designed to decrease redundancy.
B. A Data Warehouse is typically updated every time a change occurs in a related Operational Data Store.
C. A typical Data Warehouse favors cleanliness over completeness: it rejects data that does not conform to the warehouse schema.
D. A typical Data Lake favors completeness over cleanliness: it allows you to store any data you like, without even requiring a schema.
E. The “T” in ETL involves many of the same tasks as Data Wrangling.
A. Because a traditional ETL process only loads data into the warehouse periodically, it will lose sensor information recorded in the operational data store.
B. Each sensor reading should be timestamped in the data warehouse.
C. There is no reason for the data warehouse to record timestamps for information on the vehicles.
40. Which of the following features are typical of a distributed file system:
A. It can store large volumes of data.
B. It is optimized to store data as compactly as possible.
Solution: False. That would be too resource-intensive. Instead, we run batch ETL periodically.
Solution: False. For this application, sensor readings would not be UPDATEs to existing sensor readings; instead it makes more sense for each sensor reading to be timestamped and stored in the operational data store separately. So the ETL process would pick up the full history of sensor readings in this case.
Solution: False. It would be useful for the warehouse to record the history of changes to information about vehicles.

DS100 Final, Page 32 of 42
Solution: False. It in fact wastes storage by replicating data, to provide fault tolerance among other features.
C. It can keep serving files even after a certain number of machine failures.
D. After a crash, if any data can be recovered at all, then all the data can be recovered.
41. In class, we asserted that MapReduce is being used less and less in practice. It is being replaced by what other programming interfaces? Why?
42. Which of the following are true?
A. Because people can store any file in a Data Lake, it is harder to assess data quality in a Lake than in a Warehouse.
B. The raw data in a Data Lake will likely require more wrangling than the data in a well-governed Data Warehouse.
C. The lack of a unifying schema in a Data Lake makes it difficult to get a global view of information being captured.
D. Relative to traditional Data Warehouses, Data Lakes make it easier to secure data in a well-governed way.
43. Consider the following simple Data Warehouse schema from a Cellular Service Provider, which records activity on a cell phone network:
CREATE TABLE devices (
did integer, customer_id integer, phone_number varchar(13), firstname text, lastname text,
Solution: In general this is false. It is possible for some shards to be recoverable and others not so.
Solution: SQL and Dataframe APIs. Both of those interfaces can achieve anything that MapReduce can do, and offer higher-level constructs that are convenient for tabular data: e.g. joins, built-in aggregates (reduce functions), etc.
Solution: The first 3 parts are true and self-explanatory. The last one is false … data lakes encourage users to “land” data files that are unstructured and experimental. It is hard to know what confidential information might be in such files, and who should get access to them as a result.

DS100 Final, Page 33 of 42
zip varchar(12), registered_on varchar(2), PRIMARY KEY(did),
UNIQUE (customer_id) — a ‘‘candidate’’ key );
CREATE TABLE billing (
rate_code char PRIMARY KEY,
description text, base_fee float, per_minute float, max_minutes integer, overage_fee float,
PRIMARY KEY (rate_code));
CREATE TABLE calls (
caller_handset_id integer, callee_handset_id integer, cell_tower_id integer, call_start datetime, call_end datetime, billing_code char,
PRIMARY KEY (caller_handset_id, call_start),
FOREIGN KEY (caller_handset_id) REFERENCES devices,
FOREIGN KEY (billing_code) REFERENCES billing;
(1) [3 Pts.] Which of these tables is a dimension table? Select all that apply. A. devices
B. calls
C. billing
D. None of the above.
(2) [3 Pts.] Which of the following statements are true? Select all that apply.
A. The calls.billing code column violates star schema design because any
update to a single billing fee requires updates to many call records.
B. If we want to look for correlations between a device’s average call length and the time since it was registered, we have to perform a join.
C. If the cell service provider implemented a Data Lake, it would make it easier for them to load audio recordings of calls for subsequent analysis.
D. None of the above statements are true.
44. [3 Pts.] The figure below depicts a distributed file system with one logical “big file” partitioned into 4 “shards” (A, B, C, D) and replicated across multiple worker machines (1, 2, 3, 4).
Solution: devices and billing
Big File
D
Worker 1
B
D
Worker 2
A
Worker 3
AC D
Worker 4
A
D
A
C
B
C
C
B
B

DS100 Final, Page 34 of 42
Suppose workers 1 AND 2 both fail. Which of the following statements are true? Select all
that apply.
A. The full file will remain available since worker 3 and worker 4 are both still running.
B. The system can tolerate one more worker failure without losing data.
C. If every request requires all 4 shards of the file, then worker 3 and worker 4 can share the work evenly.
D. None of the above statements are true.
45. Consider only the mechanism of partitioning files into shards, and storing different shards on
different machines. Which of the following statements are true? Select all that apply. A. Partitioning enhances the ability of the system to store large files.
B. Partitioning allows the system to tolerate machine failures without losing data. C. Partitioning allows the system to read files in parallel.
D. None of the above statements are true.
46. [2 Pts.] Recall the statistical query pattern discussed in class for computing on very large data
sets. Which of the following statements are true? Select all that apply.
A. It eliminates the need for the end-user device (e.g. a laptop) to acquire all the
data.
B. It pushes the computational task closer to the large-scale data storage.
C. It is well suited to both MapReduce and SQL interfaces.
D. Analternativetothestatisticalquerypatternforbigdataistoacquireasample of the full dataset on the end-user device.
E. None of the above statements are true.

DS100 Final, Page 35 of 42
10 XPath
47. [3 Pts.] Below is a tree representation of an XML document. The tree has 10 nodes, which we have been numbered 1 through 10. Two of these are text nodes: one containing “Some Text” (labeled #5), and the other “More Text” (labeled #10). In addition, some of the nodes have attributes, e.g. #7 is the tag .
For each of the following XPath expressions, provide the numbers for the nodes which are located by the expression. If no nodes match, say NULL.
1. //f
2. //b/..
3. //c//f
4. //b[@id]
Solution: 4 and 9
Solution: 1 and 3
Solution: Any “f” that is a child of any “c”: 9
Solution: Any “b” that has the ATtribute id: 8

DS100 Final, Page 36 of 42
11 EDA and Visualization
48. [2 Pts.] Consider the following statistics for infant mortality rate. According to these statistics, which transformation would best symmetrize the distribution? (Select only one.)
Transformation lower quartile median upper quartile x 13 30 68
√x 3.5 5 8 log(x) 1.15 1.5 1.8
A. no transformation
B. square root
C. log
D. not possible to tell with this information
49. [5 Pts.] For each of the following scenarios, determine which plot type is most appropriate to reveal the distribution of and/or the relationships between the following variable(s). For each scenario, select only one plot type. Some plot types may be used multiple times.
A. histogram
B. pie chart
C. bar plot
D. line plot
E. side-by-side boxplots
F. scatter plot
G. stacked bar plot H. overlaid line plots
I. mosaic plot
(1) [1 Pt.] sale price and number of bedrooms (assume integer) for houses sold in Berkeley in 2010.
(2) [1 Pt.] sale price and date of sale for houses sold in Berkeley between 1995 and 2015.
(3) [1 Pt.] infant birth weight (grams) for babies born at Alta Bates hospital in 2016.
Solution: E. Side-by-side Boxplots. We might imagine using a scatter plot since we are plotting the relationship between two numeric quantities. However because the number of bedrooms is an integer and most houses will only have a small number, we are likely to encounter over-plotting in the scatter plot. Therefore side-by-side boxplots are likely to be most informative.
Solution: F. Scatter Plot. Here we are plotting two numeric quantities with sufficient spread on each axis.

DS100 Final, Page 37 of 42
Solution: A. Histogram. Here we are plotting the distribution of a likely large number of observations and therefore a histogram would be most appropriate.
(4) [1 Pt.] mother’s education-level (highest degree held) for students admitted to UC Berke- ley in 2016
(5) [1 Pt.] SAT score and HS GPA of students admitted to UC Berkeley in 2016
(6) [1 Pt.] race and gender of students admitted to UC Berkeley in 2016
(7) [1 Pt.] The percentage of female student admitted to UC Berkeley each year from 1950 to 2000.
(8) [1 Pt.] SAT score for males and females of students admitted to UCB from 1950 to 2000
Solution: C. Bar Plot. Here we want to visualize counts of a categorical variable.
Solution: F. Scatter Plot. Here we are visualizing the relationship between two continuous quantities.
Solution: I. mosaic plot Here we are visualizing the relationship between two cate- gorical variables.
Solution: D. Line plot. This allows us to see the trends over time.
Solution: E. side-by-side boxplots. This allows us to see the distributions of SAT scores per gender and year.

DS100 Final, Page 38 of 42 50. [4 Pts.] Consider the following empirical distribution:
(1) [1 Pt.] The distribution has mode(s). A.1 B.2 C.3 D.4
(2) [1 Pt.] The distribution is: A. Skewed left
B. Symmetric
C. Skewed right
(3) [2 Pts.] Select all of the following properties displayed by the distribution: A. gaps
B. outliers
C. normal left tail
D. None of the above

DS100 Final, Page 39 of 42
51. [4 Pts.] Select all of the problems associated with the following plot (there may be more than one problem):
A. Over-plotting
B. Use of chart junk
C. Vertical axis should be in log scale
D. Missing vertical axis label
E. Poor use of the horizontal dimension
F. Graph elements interfere with data G. Stacking
H. Use of angles to convey information
I. None of the above are problems with this awesome plot.

DS100 Final, Page 40 of 42
52. In the odd questions, name the plot’s type (for example, “scatter” or “box”). In the even questions, answer whether the plot is useful for answering the given query.
(1) [1 Pt.]
(2) [1 Pt.] True or false: This plot is useful for answering the question: “Did underemploy- ment generally increase when unemployment increased?”
Solution: This is a stacked line plot.
Solution: False. Because of the stacking, it’s hard even to tell how much underem- ployment there was in each year, much less compare changes in underemployment with changes in unemployment. To answer this question, a good visualization would be a scatter plot of the change in unemployment and underemployment for each year.
(3) [1 Pt.]

DS100 Final, Page 41 of 42
(The plot is from http://www.stubbornmule.net/2009/07/love-old-fashioned/ and displays topics of selected popular music over time. Not all pop songs are represented
in the dataset.)
(4) [1 Pt.] True or false: This plot is useful for answering the question: “Among the songs in this dataset, how many were released in each of the five decades from 1960 to 2010?”
(5) [1 Pt.]
(6) [1Pt.] Trueorfalse:Thisplotisusefulforansweringthequestion:“Assumingthebacteria population grew linearly over time, what was the rate of increase?”
Solution: This is a mosaic plot.
Solution: True. The widths of the bars show us how many songs were released in each category, and the categories include the 1960s through 2000s.
Solution: This is a line plot on a logarithmic scale.
Solution: False. Since the plot is on a log scale, the slope of the line is the exponential rate of growth, not the linear rate. (The question would have been clearer if it had read, “If we model the population growth as a linear function of time, what would be a good
estimate of the linear growth rate?”)

DS100 Final, Page 42 of 42
End of Exam