CS代考程序代写 data science python SQL Data 100 & 200A Principles and Techniques of Data Science

Data 100 & 200A Principles and Techniques of Data Science
Spring 2019
• You have 70 minutes to complete the exam.
Midterm 1 Solutions
• The exam is closed book, closed notes, closed computer, closed calculator, except one hand-written 8.5″ × 11″ crib sheet of your own creation and the official Data 100 study guide.
• Mark your answers on the exam itself. We will not grade answers written on scratch paper.
Last name
First name
Student ID number
CalCentral email (_@berkeley.edu)
Exam room
Name of the person to your left
Name of the person to your right
All the work on this exam is my own.
(please sign)

1. (12 points) Python vs. SQL
Fill in both the Python code and the SQL query to produce each result below, assuming that the following three tables are stored both as Pandas DataFrames and Sqlite tables. Only the first few rows are shown for each table. The cities table contains one row per city and its population in thousands. The names table contains one row per state. The states table contains one row per state with its population in millions. Assume that cities contains only a small subset of US cities. There may be multiple cities in each state, but every city is in a state that appears in both names and states, and every state contains at least one city.
cities names states
(a) (4 pt) Create a table t that is the same as cities but with an additional column ab containing the abbreviation (e.g. CA) of the state in which each city is located. Hint: The str.extract method of a Series called on a regular expression with one group returns a DataFrame with one column labeled 0 containing the first substring matching the group. Assume there is exactly one comma in each city value.
Python: t = cities.copy()
Nevada City, California Carson City, Nevada Newark, New Jersey
3 55 285
California Nevada Washington
39.5 3.0 7.4
x = cities[‘city’].str.extract(r’, (w+)’)[0]
t[‘ab’] = list(names.set_index(‘full’).loc[x, ‘abbrev’])
CREATE TABLE t AS SELECT city, pop, abbrev AS ab FROM cities JOIN names ON
city LIKE ‘%, ‘ || full;
(b) (4 pt) Create a two-column table u of the cities and their populations (labeled city and pop) that are in
states with a population above 5 million. Assume that t from part (a) was constructed correctly. Python: m = t.merge(states, left_on=’ab’, right_on=’state’)
m[m[‘people’] > 5][[‘city’, ‘pop’]]
ab IN (SELECT state FROM states WHERE people > 5)
(c) (4 pt) Create a table with one row per state that contains the state’s abbreviation and the fraction of cities (from the cities table) in that state that have a population above 50,000.
Python: t[‘pop’].groupby(t[‘ab’]).agg( lambda s: sum(s>50)/len(s))
SQL: SELECT ab, SUM(CASE WHEN pop < 50 THEN 0 ELSE 1 END)/COUNT(*) FROM t GROUP BY ab; Name: 3 2. (6 points) Sampling Circle the correct response to each question about this population of six individuals. (a) (2 pt) From the population above, you draw a simple random sample A of 2 individuals. What’s the probability that Ace and Adele both appear in sample A? 1 ·1 1 ·1 1 ·1 1 +1 2·1 ·1 2·1 ·1 2·(1 ·1) 2·(1 +1) 66653366 66 65 33 66 (b) (2 pt) You separately draw a cluster sample B from the same population based on the cluster column. (Sample A is replaced in the population before drawing sample B, so the two are independent.) What’s the probability that Ace and Adele both appear in sample B? 0 1 1 1 1 2 3 Noneofthese 643234 (c) (2 pt) You then combine all individuals from sample A and sample B into sample C. Thus, there may be repeated individuals in sample C. What’s the probability that Bella appears exactly once in sample C? 0 1 1 1 1 2 5 Noneofthese 643236 3. (4 points) Regular Expressions [a-z]+_[a-z]{2}[_r]_?[a-z]+ Circle all of the strings below that match the regular expression above. Only circle a string below if the whole string matches the expression, not just a substring. bar_chart group_by_x a_box_plot visualize_first name cluster Abdul Ace Adele Aerie Bella Buzz Blue Blue Blue Blue Gold Gold 4 4. (10 points) Data Visualization (a) (2 pt) Are the two histograms below displaying exactly the same data? Circle only one answer. (a) Yes (b) No (c) Impossible to tell 0 5 10 15 20 5 10 15 20 25 xx There is a location shift between the two distributions. (b) (2 pt) Are the two Gaussian kernel density plots below displaying exactly the same data? Circle only one answer. (a) Yes (b) No (c) Impossible to tell −4 −2 0 2 4 −4 −2 0 2 4 . (c) (2 pt) Which of the following can be determined from looking at this pie chart? Circle only one answer. B The density plots could be displaying the same data and look very different because of the bandwith, however, one cannot tell for sure. (a) Category B is twice as frequent as category A (b) Category A is half as frequent as category B (c) Category B is more frequent than category A (d) All of the above (e) None of the above A With angles/areas, it is very hard to precisely compare frequencies. C Density Frequency 0.0 0.1 0.2 0.3 0.4 0 10 20 30 Density Frequency 0.0 0.2 0.4 0.6 0.8 0 10 20 30 40 Name: 5 (d) (2 pt) You have 30 unique observations of a real-valued quantitative variable. Which of the following visualizations effectively depicts the distribution of these values while retaining as much information as possible about the original data? Circle only one answer. (a) A pie chart (b) A strip chart (c) A box plot A strip chart displays all of the data. A boxplot would summarize the data and potentially miss issues such as bimodality. A pie chart is not applicable. (2 pt) You are given six lists, each of a few thousand numbers taking on values in the real line. Which of the following is the most effective way to visually compare the center and spread of the corresponding six distributions? Choose only one answer. (a) Side-by-side box plots (b) Side-by-side histograms (c) Side-by-side strip charts With a few thousand values, strip charts would be unreadable. Visually comparing several histograms is difficult. (e) 5. (8 points) Dimensionality Reduction You perform principal component analysis on a data matrix D using the following Python code from lecture: n = D.shape[0] X = (D - np.mean(D, axis=0)) / np.sqrt(n) u, s, vt = np.linalg.svd(X, full_matrices=False) The resulting value of s is np.array([3, 1, 0, 0, 0]). (a) (4 pt) To draw a histogram of the data’s distribution along the first principal component of X, which of the following arrays would you visualize? Circle all correct expressions. X @ u.T[:,0] (u * s)[:,0] X @ vt[0,:] (X @ vt.T)[:,0] (b) (2 pt) What proportion of the total variance in D is accounted for by the first principal component? 9 10 (c) (2 pt) What is the rank of X? 2

Leave a Reply

Your email address will not be published. Required fields are marked *