DS/CMPSC 410 Programming Models for Big Data Fall 2021
Final Exam Study Guide
December 6, 2021
The weight of each topic is an estimate. The actual weight of exam questions can vary slightly. A question in the exam can also related to more than one topic areas.
1. Big Data Opportunities, MapReduce, Hadoop, Spark (5-10%)
• What is the difference between the requirement of a Big Data Streaming Analytics vs Big Data Discovery?
• How to decide the real-time requirement of deploying a machine learning model for a streaming data analytics?
o DataProcessingTime+MLmodelTime+DecisionTime<=ResponseTime • What big data analytic problems motivated Google’s innovation of MapReduce? • How is MapReduce related to Google Search Engine? • What are the main innovative ideas in MapReduce that enable scalable big data analytics? • What type of problems are suitable for MapReduce? • What is the challenge of using MapReduce for iterative Big Data analytics? • What enables Spark to be adopted by industry so quickly? • What advantages Spark provide over MapReduce? 2. RDD and Lazy Evaluation (7-10%) • What is the main cost reduction of MapReduce’s plan ahead (mapper or reducer, input or output)? • What information is used in MapReduce’s plan ahead? • What is the difference between a variable in Spark whose value is an RDD and a variable in other programming language? • Is RDD mutable or immutable? What does this mean? • How does lazy evaluation helps Spark to more effectively distribute data and computations on a • Is RDD closely related to lazy evaluation of Spark or it is an independent feature of Spark? • What is the difference between the local mode and the cluster mode of running spark? • What are implications of lazy evaluation for debugging Spark code? • What is the difference between map and flatMap in Spark? What task is suitable for map, what is suitable for flatMap? • What is filter in Spark? How is it different from map? What task is suitable to use filter? • What are the problems of the following code? Correct all of the incorrect code. text_RDD=sc.textFile(“TweetCovid19Vaccine.csv”) tokens_RDD.flatMap(lambda x: x.split(“ “) ) wc_RDD= tokens_RDD.map(lambda x: (x, 1)).reduceByKey(lambda a, b: a+b, 1) wc_RDD.saveAsTextFile(“TweetWordCounts.txt”) • Suppose the filename for the input tweets above is “TweetCovid19_Vaccine.csv”, and each line is in a code cell of Jupyter Notebook. Which line will trigger the error due to the incorrect filename? • Suppose we add tokens_RDD.take(3) to the second cell, which cell will trigger the error due to the incorrect filename? • Where do we specify the local mode (vs cluster mode) of running Spark? • What is the relationship between spark context and reading an input file to an RDD? 3 Filter, Transformation, and Ordering (7-10%) • You have been given two designs for processing a streaming tweet input. You want to count the frequency of all words, hashtags, and twitters mentioned every minute. Should you filter for hashtags and twitters before or after reduceByKey? Explain the rationale of your decision. • After we test spark code in a local mode, should we comment out actions for debugging purpose (e.g., take(..)) • If we generate an RDD using map, but need to use it on two separate filters (to create two separate RDDs), what can we do to enhance the performance of this code? • Which of the following Spark features are enabled by RDD? A: Recompute RDD if needed (e.g., re-using RDD, worker failure) B: Can be used to store key-value pairs even if key is a list of multiple items. C: Persist computed RDD (in memory or disk) D: Lazy evaluation E: Join on a pair of RDD, regardless whether the structure of their key matches. 4 Data Frame Transformation and SQL Functions (7-10%) • What situations of Big Data Processing require the use of Frame? • What is the relationship between a SparkSession and DataFrame? • What is the requirement of an input CSV file for Spark to be able to infer schema? • Which of the following information are specified in a schema for a DataFrame? A. Name of each column B. Type of each column C. Whether the column can have null value D. Whether the column is an array E. Whether the column is a key of the DataFrame • Be able to use join operation to join two DataFrames on a given column name • How to generate the titles of all movies that have at least 100 reviews? • How to generate the titles of all movies whose average reviews are above 3.5 ? • How to use withColumn to create a new column that is an Array of Genres from the column “Genres” whose value is a string that connects genres of a movie using “-”? • How to generate the titles of all Drama movies who have at least 100 reviews and whose average reviews are above 3.5? • How to improve the efficiency of a Big data analytic pipeline involving DF join and filtering? • How to extract elements of an RDD that were created from a DataFrame? • How to use array_contain to filter movies of a specific genre (e.g., drama) from a DataFrame that stores genres of a movie in an array? • How to join two DataFrames that have a common column name? 5 Spark-submit in a cluster mode, persist, and GroupBy (5-10%) • During the process of developing a Spark application, what is the role of running Spark in a local mode (e.g., with Jupyter Notebook)? • What are the key benefits of running Spark on a cluster versus running Spark in a local mode? • What are the parameters of spark-submit in a cluster mode? • What are the steps involved to convert a PySpark in local mode using Jupyter Notebook to running PySpark in cluster mode? • Which of the following is correct about persist? 1. Interpreting a persistent PySpark statement on an RDD will immediately cause the content of the RDD to be saved (in memory or disk). 2. Adding persist to RDD or DataFrame can cause no harm. 3. If an RDD or DataFrame is used in an iteration, it is beneficial to persist it. 4. Persist of RDD in Spark makes a “best effort” to save the RDD in memory and/or disk, but does not guarantee it. 6 Recommendation Systems, Hyperparameter Tuning (7-10 %) • What is the principle of Alternating Least Squares (ALS) for recommendation systems? • What are the hyperparameters of training a recommendation system using ALS? • What are different approaches for reducing the risk of overfitting for developing a recommendation system? • Is the choice of hyperparameter values important for developing a machine learning application? If so, why? • What is the different role of validation data and testing data for hyperparameter tuning? • Pandas DataFrame and QL DataFrame are different; therefore, a Pandas DataFrame can not be converted to QL DataFrame. • How do we combine the RDD containing (
with the RDD of validation data
to calculate error for each combination of
• How do we use groupBy on a DataFrame to calculate the total number of movies in a particular Genres?
• What is the effect of SetCheckpointDir on a the sparkContext of a SparkSession? 7 From Small to Big, Sampling, and Persist (5-10%)
• What hyperparameters for ALS models need to persist?
• How do you sample a big rating dataset for testing it in a local mode?
• What command can show you status of jobs?
• What is the problem of random sampling a small dataset from a Big movie review
• Which of the following sampling approach for constructing a smaller movie review dataset from a Big movie review dataset achieves two goals in the same time: (1) obtains reviews data suitable for constructing and evaluating a recommendation system, and (2) reduce potential biases introduced due to the sampling approach.
A. Random sampling from the entire review dataset.
B. Identify and sample top k1 users (based on number of reviews provided), use all reviews of these users.
C. Randomly sample k2 users, use all reviews of these users.
D. Identify k1 users that are most influential (based on their post in a social media
platform), use all reviews of these users.
8. Decision Trees Learning Using ML and Visualization (5-10 %)
• What is the principle for Decision Tree learning?
• What hyperparameters need to be tuned for Decision Tree learning?
• How the depth of a decision tree relate to its overfitting risk?
• Can the Decision Tree learned by Spark be used to generate a visualization?
• How to interpret a learned Decision Tree (i.e., knowledge/rules in the tree) using its
• What information is captured in the leaf of a decision tree?
• How to evaluate and select the best Decision Tree model among a set of models (e.g.,
created using different hyperparameters)?
• Can Decision Tree learning be carried out if the data is distributed and can not be (e.g.,
due to privacy protection reason) copied into a central location? What is the principle of adapting DT learning into such situation?
9. Frequent Patterns Mining (10-15% %)
• What are the potential benefits of identifying frequent patterns in a dataset?
• You are hired by a College to apply Data Science to improve its course scheduling
system. What frequent patterns can be potentially useful to discover?
• Suppose you are in a DS team asked by a client to uncover frequent patterns regarding
webpage visits of their customers on the client’s website. What are important
information you want to gather to assess the feasibility of the project?
• What are principles for reducing the cost of finding frequent patterns (frequent item
• How to use an accumulator? Can we use map to apply a function that updates an
accumulator? Why not?
• What is the difference between map and foreach?
• How to convert a column (in a Py Frame) that is a string representation of a set of column values (connected by a character such as -, _) into an array representation of the column values?
• What is the impact of choosing top k items (e.g., top k ports in Darknet’s scanner data) for finding frequent patterns?
10. One Hot Encoding and kMeans Clustering (10-15%)
• Does Euclidean Distance measure always provide an adequate measure for distance/similarity between different data points? If not, what are the limitations.
• You are given a Darknet dataset that includes a variable (port-scanned) that has 1000 possible values (whose similarity are important for clustering the data) and a numerical variable that ranges 0 to 1000. One member of your team suggests that you do One Hot Encoding on the port-scanned feature and use a distance measure based on Cosign similarity measure for k-means clustering. What is your response?
• Under what situations, should OneHot Encoding be used before sending it a clustering?
• How to evaluate clustering results using external labels (e.g., Mirai signature)?
• What factors affect the outcomes of kMeans clustering?
o Number of clusters (k)
o Whether and how categorical variables are encoded using One Hot Encoding o Choice of distance measure
• How to use pyspark DataFrame SQL function withColumn to create OneHotEncoded features?
11. Bucketing for Encoding Numerical Variables (10-15%)
• What can be an issue for clustering a dataset involving both categorical features (that need to use One Hot Encoding) and numerical features together?
• How should one design the “buckets” for bucketing-based encoding?
• How does bucketing-based encoding affect Euclidean Distance measure of the feature
• How to rename the column of bucketing-based feature in DataFrame into one that is
• What do we learn from miniProject deliverable #3?
12. Deep Learning (5-10%)
o What is the principle of Deep NN learning?
o Can a Deep NN be used for multi-class classification problems?
o Can a Deep NN be used for generating a mapping from input features to more than one
numerical output variables (e.g., speed and change of speed)?
o What is the impact of choosing linear/nonlinear activation functions in constructing a Deep NN?
o What is a typical activating function for the output layer of a hand written digit recognition DNN?
o What are the best practices for deciding the size of batch for Deep NN learning? o How to identify and reduce the risk of overfitting in training a Deep NN?