# CS代考程序代写 data science python SQL Data 100 Principles and Techniques of Data Science

Data 100 Principles and Techniques of Data Science

Summer 2019

INSTRUCTIONS

• You have 80 minutes to complete the exam.

• This exam has 6 pages and a total of 40 points.

Midterm Solutions

• The exam is closed book, closed notes, closed computer, closed calculator, except one hand-written 8.5″ × 11″ crib sheet of your own creation and the official Data 100 reference sheet.

• Mark your answers on the exam itself. We will not grade answers written on scratch paper.

• Please put your name at the top of every page of the exam.

Last name

First name

Student ID number

CalCentral email (_@berkeley.edu)

Name of the person to your left

Name of the person to your right

All the work on this exam is my own.

(please sign)

2

This page is intentionally left blank, but feel free to use it as scratch paper.

Name: 3 1. (12 points) Rush Hour

Fill in both the Python code and the SQL query to produce each result below, assuming that the following tables are stored both as Pandas DataFrames and SQLite tables. Only the first few rows are shown for each table. The trfc table contains one row per sensor recording of hourly average car speed in mph. The time column contains strings that encode the hour of day and whether the time occurred during rush hour. The dates table contains one row for all dates in 2019 with their days of the week.

trfc dates

(a) (4 pt) Calculate the average speed during rush hour.

Python: trfc.loc[trfc[‘time’].str.contains(‘yes’), ‘spd’].mean()

SQL: SELECT AVG(spd) FROM trfc WHERE time LIKE ‘%yes’;

(b) (4 pt) Create a table t with one row per recording in trfc. Each row should contain the day of week, speed, hour of day as a two-character string, and whether the recording occurred during rush hour (either “yes” or “no”). Hint: The correct call to extract() takes in a single regex with two captured groups: one for hour of day and one for rush hour. Also, the provided SQL already computes the hr column.

Python: m = trfc.merge(dates, left_on=’dt’, right_on=’date’)

m[[‘hr’, ‘rush’]] = m[‘time’].str.extract(r’hr=(d+),rush=(w+)’)

time

dt

spd

hr=01,rush=no hr=13,rush=no hr=08,rush=yes hr=18,rush=yes

May 1 May 3 May 29 May 3

70 59 37 30

date

day

Jan 1 May 2 Jun 13 May 4

Mon Wed Sat Thu

SQL:

t = m[[‘day’, ‘spd’, ‘hr’, ‘rush’]]

CREATE TABLE t AS SELECT day, spd, SUBSTR(time, 4, 2) AS hr,

CASE WHEN time LIKE ‘%no’ THEN “no” ELSE “yes” END AS rush

FROM trfc JOIN dates ON dt = date;

(c) (4 pt) Find the minimum speed in a cluster sample with two clusters: take a SRS of two unique values in day, then find the minimum speed across all recordings on those days of the week. Note that there are many speed recordings for every date in May. You may assume that the t table is correctly created.

Python: days = np.random.choice(t[‘day’].unique(), 2, False) t.loc[t[‘day’].isin(days), ‘spd’].min()

SQL: SELECT MIN(spd) FROM t WHERE day IN ( SELECT day FROM t GROUP BY day

ORDER BY RANDOM() LIMIT 2

);

4

2. (5 points) SAMpling

Suppose that there are ten people in a room and one of these people is named Sam. We will take random samples of these people and compute probabilities associated with these samples. Bubble in the circles corresponding to your answers.

(a) (1 pt) What is the probability that Sam is not in a simple random sample of 1 individual? 1 1 2 1 4 9 Noneofthese

(b) (2 pt) What is the probability that Sam is not in a simple random sample of 3 individuals?

1 3 7 9 3 3 9 3 None of these 10 10 10 10 10 10

(c) (2 pt) Suppose we take a sample of 2 individuals by first drawing a simple random sample of size 5, then taking a simple random sample of size 2 from that sample. What is the probability that Sam is not in this sample?

1 1 3 1 4 1 · 9 None of these 10 5 10 2 5 210

3. (5 points) Go Bears?

([go]|[bear])+s?!

Shade in the box for all of the strings below that match the regular expression above. Only shade a box if the whole string matches the expression, not just a substring. Do not put a checkmark in the box; shade in the entire box.

go! gear? garbs! bearsbears! gobears?!

10 5 5 2 5 10

Name: 5 4. (10 points) Ballers

Suppose you have a dataset of 30,000 basketball games played in the NBA over the last 20 years. After conducting EDA, you find that each row in this dataset corresponds to a single game with the six columns described below. You also find that all columns contain duplicate values.

Column Description

date Date of game

team Home team’s name

opp Visiting team’s name

win 1 if the home team won, else 0 Ordinal

Data Type Numeric Nominal Nominal

scr Score of home team opp_scr Score of visiting team

Numeric, between 80 and 170 Numeric, between 80 and 170

For each of the following questions, shade in one or more boxes corresponding to your answer.

(a) (2 pt) Given the data types of this dataset, which of the following visualizations are not appropriate?

A histogram of win.

A histogram of scr.

A box plot with team on the x-axis and opp_scr on the y-axis.

A 2D KDE plot with scr on the x-axis and opp_scr on the y-axis.

(b) (2 pt) Which of the following plots will likely suffer from overplotting?

A scatter plot with dates on the x-axis and number of games played on that date on the y-axis. A scatter plot with scr on the x-axis and opp_scr on the y-axis.

A scatter plot with scr on the x-axis and win on the y-axis.

A dot plot with team on the x-axis and the average of scr for each team on the y-axis.

(c) (2 pt) Which of the following plots show all teams that improved in scoring? (These are the teams with higher scores at later dates.)

A line plot with one line for every team with date on the x-axis and scr on the y-axis.

A line plot with one line for every team with date on the x-axis and opp_scr – scr on the y-axis. One separate line plot for each team with date on the x-axis and scr on the y-axis.

A bar plot with one bar per team and average of the latest five games as bar lengths.

(d) (2 pt) Which of the following plots show that home teams scored more on average than visiting teams? Correct answer: all four choices

6

(e) (2 pt) Suppose you find a linear relationship when you make a scatter plot with np.log(scr) on the x-axis and np.log(opp_scr) on the y-axis. When you fit a least-squares line on this plot, you find the slope of the line is 2 and the intercept is 5. Which of the following relationships hold?

opp_scr=2·scr+5

log(opp_scr) = log(2 · scr + 5) opp_scr = e5 · e2·scr

opp_scr = e5 · scr2

5. (8 points) Dim Matrices

You perform principal component analysis on a data matrix D using the following Python code from lecture.

m = D.shape[0]

X = (D – np.mean(D, axis=0)) / np.sqrt(m)

u, s, vt = np.linalg.svd(X, full_matrices=False)

Here are the values of a few expressions executed after running the code above:

(a) (2 pt) What is the shape of D? Recall that a matrix with 10 rows and 3 columns has shape (10 × 3).

(5×5) (5×40) (40×5) (40×40)

(b) (2 pt) What is the rank of D? 0123456

(c) (2 pt) What percentage of D’s total variance is kept if PCA is used to reduce the number of dimensions to 3?

12% 22% 60% 92% 98% Not enough information

(d) (2 pt) Suppose the last row in X is: array([10, 4, -5, 2, 1]). After projecting this point onto the

first principal component, what is the location of this point on the principal component axis?

10 11 24 100 128 Notenoughinformation

Python

Result

s

u.shape

vt[0]

array([12, 6, 4, 2, 0])

(40, 5)

array([0.8, 0, -0.6, 0, 0])