# CS代考程序代写 python algorithm CS5344 Lab 2

CS5344 Lab 2
AY2020/2021 Semester 2
This lab has two tasks and requires you to work with a large set of documents and search for relevant documents. You will need to understand the concepts of RDD, transformation and action in Apache Spark, and implement an algorithm at the RDD level. This is an individual lab assignment.
Task A: Write a Spark program to find the top-k frequently occurring word pairs in a set of documents.
Step 1. Preprocess the documents. (Refer to Data Pre-processing section)
Step 2. Compute the count of every word pair in the resulting documents. Note that and are considered as the same word pair.
Step 3. Sort the list of word pairs in descending order and obtain the top-k frequently occurring word pairs.
Input: (a) Set of documents (in “datafiles” folder), (b) stopwords to remove (in stopwords.txt).
Output: One line per word pair in the following format: You should sort the word pairs in descending order of frequency of occurrence. Use k = 5.
Task B: Write a Spark program that finds that top-k relevant documents given a query comprising of set of keywords.
A document can be modelled as a vector of words (or terms). Each entry in the vector is a TF-IDF value that reflects how important a word is to a document in a collection, computed as TF-IDF = (1 + log (TF)) * log (N/DF) where N is total number of documents, TF is the count of the word in a document, and DF is the count of documents having the word. Figure 1 shows a simple example.

Figure 1. Example of representing documents as vectors.
A query can also be represented as a vector where each entry represents a word with a value 1 if the word is in the query, and 0 otherwise. We can compute a relevance score for each document d to a query q based on the based on the cosine similarity of their corresponding vectors V1 and V2 and rank the documents with respect to a query:
⃗⃗⃗⃗ ⃗⃗⃗⃗ relevance (q, d) = 𝒄𝒐𝒔𝒊𝒏𝒆(𝑽𝟏, 𝑽𝟐) = 𝟏 𝟐
Step 1. Preprocess the documents. (Refer to Data Pre-processing section)
Step 2. Compute term frequency (TF) of every word in a document.
Step 3. Compute TF-IDF of every word w.r.t a document.
You can use key-value pair RDD and the groupByKey() API. Use log base 10 for TF-IDF calculation.
Step 4. Compute normalized TF-IDF of every word w.r.t. a document.
If the TF-IDF value of word1 in doc1 is t1 and the sum of squares of the TF-IDF
of all the words in doc1 is S, then the normalized TF-IDF value of word1 is 𝑡1 . √𝑆
Step 5. Compute the relevance of each document w.r.t a query.
Step 6. Sort and get top-k documents.
Input: (a) set of documents (in “datafiles” folder), (b) set of keywords for a query (in query.txt), (c) stopwords to remove (in stopwords.txt).
⃗⃗⃗⃗⃗ ⃗⃗⃗⃗⃗⃗ 𝑽∙𝑽
⃗⃗⃗⃗⃗ ||𝑽𝟏||×||𝑽𝟐||
⃗⃗⃗⃗⃗

Output: One line per review in the following format:
The output should be sorted in descending order of the relevance of the documents to the query. Use k = 5.
Deliverables: Zip your executable Spark programs for Task A and Task B with documentation in the code, the output files and upload it to the Lab2 folder. The zipped folder should be named as Student ID_Lab2.
Data Pre-processing:
 Convert all the words to lower case.
 Remove the stop words (provided in stopwords.txt) and then combine the words
on special symbols like -, ‘, etc. Ex. “tea-table” becomes “teatable”
 Drop symbols such as “?”, “!”, “.”, etc. occurring at the end of words. Ex. “home?”
or “home ?” becomes “home”.
 Drop independent numbers (not alphanumeric) in sentences. Ex. “flying 2 birds”
becomes “flying birds”.

Important Notes:
 Specify the python version along with the supporting packages used and the command to run your scripts.
 Your code should be executable either on the virtual machine configuration given in Lab 1 or on stand-alone Spark configuration.
You can consider non-english, alphanumeric words as new words. Ex. apple1
can be considered as a new word.