CS代考 Here is an example colab starter for Spark: – cscodehelp代写

Here is an example colab starter for Spark:
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!pip install -q pyspark

Copyright By cscodehelp代写 加微信 cscodehelp


import pyspark
from pyspark import SparkConf, SparkContext

conf = SparkConf().setMaster(“local[*]”).set(“spark.executor.memory”, “1g”)
sc = SparkContext(conf = conf)

Create a python Spark program that does the following:
1. ( ) Loads each line of the text file 1661-0.txt as an entry in an RDD: https://www.gutenberg.org/files/1661/1661-0.txt

hint: you can use this in your colab notebook to download the file automatically:

!wget “https://www.gutenberg.org/files/1661/1661-0.txt”

2. (Birds in Town and Village) Loads each line of the text file 7353.txt.utf-8 as an entry in a different RDD: https://www.gutenberg.org/ebooks/7353.txt.utf-8

!wget “https://www.gutenberg.org/ebooks/7353.txt.utf-8”

3. For both RDDs, maps each entry to tuples of lowercase-character-only sequential ordered word bi-grams (word, next word). For example the following line:

“{From the elms hard by comes a subdued, airy prattle of a few sparrows.}”
would be represented as the tuples

[(‘from’, ‘the’), (‘the’, ‘elms’), (‘elms’, ‘hard’), (‘hard’, ‘by’), (‘by’, ‘comes’), (‘comes’, ‘a’), (‘a’, ‘subdued’), (‘subdued’, ‘airy’), (‘airy’, ‘prattle’), (‘prattle’, ‘of’), (‘of’, ‘a’), (‘a’, ‘few’), (‘few’, ‘sparrows’)]
hint: you can generate non-letter lowercase characters in a line of text with a function like:
def only_letters(line):
return ”.join([c.lower() for c in line if c.isalpha() or c == ‘ ‘])
4. Filters out any pair that contains stopwords. Use the nltk library stopwords. Here’s an example import of a STOPWORDS set that contains english stop words:

!pip install nltk
import nltk
STOPWORDS = nltk.corpus.stopwords.words(‘english’)
print(‘the’ in STOPWORDS)
print(‘computer’ in STOPWORDS)
With the example in step 3 the tuples would be filtered to:
[(‘elms’, ‘hard’), (‘subdued’, ‘airy’), (‘airy’, ‘prattle’)]
5. Computes the most frequent bigrams that are in the Birds in Town and Village RDD and NOT IN the RDD. There are several ways to do this with more or less efficiency. Printing the top 10 bigrams by count should give the output:

[((‘bird’, ‘life’), 18), ((‘young’, ‘birds’), 9), ((‘small’, ‘birds’), 7), ((‘day’, ‘long’), 7), ((‘wild’, ‘birds’), 6), ((‘one’, ‘another’), 6), ((‘birds’, ‘would’), 5), ((‘would’, ‘probably’), 5), ((‘w’, ‘h’), 5), ((‘exotic’, ‘birds’), 5)]

Full credit for efficient solutions that use Spark functions

Please download your .ipynb file from Google Colab (see File->Download .ipynb) and upload it to this question. DO NOT SUBMIT A WEB LINK

程序代写 CS代考 加微信: cscodehelp QQ: 2235208643 Email: kyit630461@163.com

Leave a Reply

Your email address will not be published. Required fields are marked *