# CS代考 COMP9313 2021T3 Final Exam – cscodehelp代写

COMP9313 2021T3 Final Exam
The deadline for the final exam is:
Monday 29th , November 6:00 pm (AEDT)
Please submit your answers through Moodle. Do not wait till the last minute to submit and double check if it is successful.

Question 1. Concepts (3 marks)
(a) (2 marks) Show four differences between MapReduce and Spark.
(b) (1 mark) NoSQL Concept: For the CAP theorem, briefly explain why if we want to guarantee “P”, then “A” and “C” cannot be both guaranteed.
Question 2. MapReduce Programming (14 marks)
You should explain how the input is mapped into (key, value) pairs by the map stage, i.e., specify what is the key and what is the associated value in each pair, and how the key(s) and value(s) are computed. Then you should explain how the (key, value) pairs produced by the map stage are processed by the reduce stage to get the final answer(s). You only need to provide the pseudo code for the classes including Mapper and Reducer (optionally Combiner etc., and self-defined classes if necessary, and the efficiency of your method will be considered).
(a) (6 marks) Problem: Given a large text dataset, find the top-k frequent terms. For terms having the same frequency, sort them in alphabetical order (considering that you can utilize multiple reducers).

(b) (8 marks) Problem Background: Given an undirected graph G, its “line graph” is another graph L(G) that represents the adjacencies between edges of G, such that:
• each vertex of L(G) represents an edge of G; and
• two vertices of L(G) are adjacent if and only if their corresponding edges share a common
endpoint (“are incident”) in G.
The following figures show a graph (left) and its line graph (right). Each vertex of the line graph is shown labelled with the pair of endpoints of the corresponding edge in the original graph. For instance, the vertex on the right labelled (1,3) corresponds to the edge on the left between the vertices 1 and 3. Vertex (1,3) is adjacent to three other vertices: (1,2) and (1,4) (corresponding to edges sharing the endpoint 1 in G) and (3,4) (corresponding to an edge sharing the endpoint 3 in G).
Problem: Given you the adjacency list of an undirected graph G, use MapReduce to generate the adjacency list of its line graph L(G). Note that each edge connecting two nodes i and j is represented by (i, j) in L(G) (if i, where docID is the ID of a document and Text is the content of a document, the following code will construct the Boolean inverted index for this document set. That is, the output is in the format of , if there are n documents containing this term.
In the pairRDD InvList, the keys (terms) are sorted in alphabetical order, and the docIDs in each list are sorted in ascending order according to their IDs (numerical values).
val docs = sc. parallelize(List((1, “hello world”), (2, “this scala program”), (3, “creates a pair RDD”), (4, “in spark”) … …))
// fill your code here, and store the result in a pair RDD InvList InvList.saveAsTextFile(“hdfs://…”)
(b) (8 marks) Given a directed graph in format of “sourceNodeID, targetNodeID, Distance” and a node “s”, compute the number (a single integer) of nodes that are reachable from s in the graph (i.e., there exists a path from s to the node). Use the GraphX pregel operator for this task.
val pairs = sc. parallelize(List((1, 2, 5), (1, 4, 6), (3, 4, 2), (3, 6, 8) … …))
val sourceNode = s.toInt
val edgeList = pairs.map(x => x.split(“,”)).map( x => Edge(x(1).toLong, x(2).toLong, x(3).toDouble))
val graph = Graph.fromEdges[Int, Double](edgelist, 0)
// fill your code here, and store the result in a variable numRNodes println(numRNodes)

Question 4. Finding Similar Items (7 marks)
(i) (2 marks) Given two documents A (“the sky is blue the sun is bright”) and B (“the sun in the sky is bright”), using the words as tokens, compute the 2-shingles for A and B, and then compute their Jaccard similarity based on their 2-shingles.
(ii) (4 marks) We want to compute min-hash signature for the two documents A and B given in Question (i), using two pseudo-random permutations of columns using the following function:
h1(n) = 5n – 1 mod M
h2(n) = 2n + 1 mod M
Here, n is the row number in the original ordering of the 2-shingles (according to their occurrence in A then B), and M is the number of all 2-shingles you have computed from A and B. Instead of explicitly reordering the columns for each hash function, we use the implementation discussed in class, in which we read each data in a column once in a sequential order and update the min hash signatures as we pass through them.
Complete the steps of the algorithm and give the resulting signatures for A and B.
(iii) (1 marks) Suppose we wish to find similar sets, and we do so by MinHashing the sets 10 times and then applying locality-sensitive hashing using 5 bands of 2 rows (MinHash values) each. If two sets had a Jaccard similarity of 0.6, what is the probability that they will be identified in the locality- sensitive hashing as candidates (i.e. they hash at least once to the same bucket)? You may assume that there are no coincidences, where two unequal values hash to the same bucket. A correct expression is sufficient: you need not give the actual number.

Question 5. Mining Data Streams (6 marks)
(a) (3 marks) Bloom Filter
Consider a Bloom filter of size m = 7 (i.e., 7 bits) and 2 hash functions that both take a string (lowercase) as input:
h0(str) = ∑𝑐 𝑖𝑛 𝑠𝑡𝑟(𝑐 − ′𝑎′) mod 7
h1(str) = str.length mod 7
Here, c – ‘a’ is used to compute the position of the letter c in the 26 alphabetical letters, e.g., h0(“bd”) = (1 + 3) mod 7 = 4.
(i) (1 mark) Given a set of strings S = {“hi”, “big”, “data”}, show the update of the Bloom filter
(ii) (1 mark) Given a string “comp”, use the Bloom filter to check whether it is contained in S.
(iii) (1 mark) Give a false positive example, that is, the string is not in S but the Bloom filter declares it is in S.
(b) (3 marks) CM-Sketch
Assume that we have 5 buckets and three hash functions:
h0(str) = str.length * 2 mod 5 h1(str) = str.length mod 5 h2(str) = (str[0]-‘a’) mod 5
Given you a stream of terms: “big”, “data”, “data”, “set”, “data”, “analytics”, show the steps of building the CM-Sketch. Then, use the built CM-sketch to get the count for word “data”.
Initially, the CM-Sketch is like:
b0 b1 b2 b3 b4 h0 0 0 0 0 0 h1 0 0 0 0 0 h2 0 0 0 0 0

Question 6. Recommender Systems (6 marks)
(a) (2 marks) Consider a database containing information about movies: genre, director, and decade of release. We also have information about which users have seen each movie. The rating for a user on a movie is either 0 or 1. Here is a summary of the database:
Consider user U1 is interested in the period 2000s, the director D2 and the genre Humor. We have some existing recommender system R that recommended the movie B to user U1. Is the recommender system R content-based or collaborative filtering based? Explain your answer.
(b) (4 marks) Given the above five movies, ratings from three users are as below. Estimate the rating of U1 to B using the user-user collaborative filtering method.
(a) Do not include the ratings for B in the user vectors (thus the dimension of user vectors is 4)
(b) Adopt the cosine similarity measure to compute the similarities
(c) Use the weighted average to perform the rating prediction
user item rating U1 A 2 U1 D 4 U2 A 1 U2 B 3 U2 D 4 U3 B 5 U3 D 3 U3 E 2
End Of Paper