CS计算机代考程序代写 python 题目和题目解释

题目和题目解释
The Tasks (1 mark each)
1. Find the top stems
Implement a function get_top_stems that returns a list with the n most frequent stems which is not in the list of NLTK stopwords. To determine whether a word is a stop word, remember to lowercase the word. The list must be sorted by frequency in descending order and the words must preserve the original casing. The input arguments of the function are:
• document: The name of the Gutenberg document, e.g. “austen-emma.txt”.
• n: The number of stems to return.
To produce the correct results, the function must do this:
• Use the NLTK libraries to find the tokens and the stems.
• Use NLTK’s sentence tokeniser before NLTK’s word tokeniser.
• Use NLTK’s list of stop words, and compare your words with those of the list after lowercasing.
2. Find the top PoS bigrams
Implement a function get_top_pos_bigrams that returns a list with the n most frequent bigrams of parts of speech. Do not remove stop words. The list of bigrams must be sorted by frequency in descending order. The input arguments are:
• document: The name of the Gutenberg document, e.g. “austen-emma.txt”.
• n: The number of bigrams to return.
To produce the correct results, the function must do this:
• Use NLTK’s pos_tag_sents instead of pos_tag.
• Use NLTK’s “universal” PoS tagset.
• When computing bigrams, do not consider parts of speech of words that are in different sentences. For example, if we have this text: “Sentence 1. And sentence 2” the bigrams are: (‘NOUN’,’NUM’), (‘NUM’,’.’), (‘CONJ’,’NOUN’), (‘NOUN’,’NUM’). Note that this would not be a valid bigram, since the punctuation mark and the word “And” are in different sentences: (‘.’,’CONJ’).
3. Find the distribution of frequencies of parts of speech after a given word
Implement a function get_pos_after that returns the distribution of the parts of speech of the words that follow a word given as an input to the function. The result must be returned in descending order of frequency. The input arguments of the function are:
• document: The name of the Gutenberg document, e.g. “austen-emma.txt”.
• word: The word.
To produce the correct results, the function must do this:
• First do sentence tokenisation, then word tokenisation.
• Do not consider words that occur in different sentences. Thus, if a word ends a sentence, there are no words following it.
4. Get the words with highest tf.idf
In this exercise you will implement a simple approach to find keywords in a document.
Implement a function get_top_word_tfidf that returns the list of n words with highest tf.idf. The result must be returned in descending order of tf.idf. The input arguments are:
• document: The name of the Gutenberg document, e.g. “austen-emma.txt”.
• n: The number of words to return.
To produce the correct results, the function must do this:
• Use Scikit-learn’s TfidfVectorizer.
• Fit the tf.idf vectorizer using the documents of the NLTK Gutenberg corpus.
5. Get the sentences with highest average of tf.idf
In this exercise you will implement a simple document summariser that returns the most important sentences based on the average tf.idf.
Implement a function get_top_sentence_tfidf that returns the positions of the sentences which have the largest average tf.idf. The list of sentence positions must be returned in the order of occurrence in the document. The input arguments are:
• document: The name of the Gutenberg document, e.g. “austen-emma.txt”.
• n: The number of sentence positions to return.
The reason for returning the sentence positions in the order of occurrence, and not in order of average tf.idf, is that this is what document summarisers normally do.
To produce the correct results, the function must do this:
• Use Scikit-learn’s TfidfVectorizer.
• Fit the tfidf vectorizer using the sentences of the documents of the NLTK Gutenberg corpus. This is different from task 4. Now you want to compute the tf.idf of sentences, not of documents.
• Use NLTK’s sentence tokeniser to find the sentences.

复制下方代码到Anaconda Navigtor 里的jupyter.lab 使用python 3 打开Note Book
Import nltk

from nltk.tokenize import sent_tokenize, word_tokenize

import numpy as np

nltk.download(‘punkt’)

nltk.download(‘gutenberg’)

from sklearn.feature_extraction.text import TfidfVectorizer

import collections

# Task 1 (1 mark)

import collections

def get_top_stems(document, n):

“””Return a list of the n most frequent stems of a Gutenberg document, sorted by

frequency in descending order. Don’t forget to remove stop words before counting

the stems.

>>> get_top_stems(‘austen-emma.txt’, 10)

[‘,’, ‘.’, ‘–‘, “””, ‘;’, ‘“’, ‘mr.’, ‘!’, “‘s”, ’emma’]

>>> get_top_stems(‘austen-sense.txt’, 7)

[‘,’, ‘.’, “””, ‘;’, ‘“’, ‘–‘, ‘elinor’]

“””

return []

# Task 2 (1 mark)

def get_top_pos_bigrams(document, n):

“””Return the n most frequent bigrams of parts of speech. Return the list sorted in descending order of frequency.

The parts of speech of words in different sentences cannot form a bigram. Use the universal pos tag set.

>>> get_top_pos_bigrams(‘austen-emma.txt’, 3)

[(‘NOUN’, ‘.’), (‘PRON’, ‘VERB’), (‘DET’, ‘NOUN’)]

“””

return []

# Task 3 (1 mark)

def get_pos_after(document, word):

“””Return the distribution of frequencies of the parts of speech occurring after a word. Return the result sorted by

frequency in descending order. Do not consider words that occur in different sentences. Use the

universal pos tag set.

>>> get_pos_after(‘austen-emma.txt’,’the’)

[(‘NOUN’, 3434), (‘ADJ’, 1148), (‘ADV’, 170), (‘NUM’, 61), (‘VERB’, 24), (‘.’, 7)]

“””

return []

# Task 4 (1 mark)

def get_top_word_tfidf(document, n):

“””Return the list of n words with highest tf.idf. The reference for computing

tf.idf is the NLTK Gutenberg corpus. The list of words must be sorted by frequency

in descending order.

>>> get_top_word_tfidf(‘austen-emma.txt’, 3)

[’emma’, ‘mr’, ‘harriet’]

“””

return []

# Task 5 (1 mark)

def get_top_sentence_tfidf(document, n):

“””Return the positions of the n sentences which have the largest average tf.idf. The list of sentences

must be returned in the order of occurrence in the document. The reference for computing

tf.idf is the list of sentences from the NLTK Gutenberg corpus.

>>> get_top_sentence_tfidf(‘austen-emma.txt’, 3)

[5668, 5670, 6819]

“””

return []

# DO NOT MODIFY THE CODE BELOW

if __name__ == “__main__”:

import doctest

doctest.testmod(optionflags=doctest.ELLIPSIS)

Published by admin

Leave a Reply Cancel reply