# 程序代写代做代考 python data science algorithm MIE 1624 Introduction to Data Science and Analytics – Winter 2017

MIE 1624 Introduction to Data Science and Analytics – Winter 2017

Assignment 2

Due date: Thursday, March 23 at 11:59pm

Background

There has been growing interest, in recent years, in mining online political sentiment in order to predict

the outcome of elections. Continuing on from Assignment 1, you are asked to implement more

sophisticated machine learning approaches and apply them to calculating the sentiment of tweets, in

particular the sentiment of tweets related to the 2015 Canadian federal election.

As in Assignment 1, you have access to a classified data set, classified_tweets.txt, and an unclassified

data set, unclassified_tweets.txt. The classified data set contains tweets that have had their sentiments

already analyzed and recorded as values 0 and 4. A value of 0 is a negative tweet and a value of 4 is a

positive tweet.

You are asked to implement two of the following well established approaches: logistic regression,

linear regression, SVM, and Naive Bayes, select one of those two based on analyzing their performance

on the classified data set and use it to categorize the tweets in the unclassified data set. You are also

asked to implement an algorithm to categorize the tweets in the unclassified data set by political party.

Produce a report detailing the analysis you performed in order to choose the most suitable

classifier between the two initial selections, the results of applying the chosen algorithm to the

unclassified tweets, as well as any potential insights into the political sentiment of the Canadian

electorate with respect to the major political parties participating in the 2015 federal election.

Finding the most suitable algorithm for a given application task (comparing multiple classifiers on a

specific domain) requires selecting performance measures, such as accuracy, true positive rate (TPR),

false positive rate (FPR), etc., according to which to judge the algorithms being considered. As part of

your evaluation procedure, select an appropriate number of measures, review the results that your

chosen methods obtain on the selected measures and try to explain these observations. Among the

questions to ask here is: “Can the observed results be attributed to the characteristics of the

implemented classifiers or are they observed by chance?” Hypothesis testing is one way to gather

additional evidence of the extent to which the results of the evaluation metrics are representative of the

general behaviour of the classifiers under consideration. As the data we have access to is limited (and it

is not the whole population of tweets), statistical re-sampling techniques (simple or multiple

resampling, etc.) can be applied on the data available in order to improve the estimation of the

classification error. (Keep in mind that, given sufficient data, it is always possible to show that a

difference between two alternatives, no matter how small, is statistically significant.) When evaluating

the performance of two algorithms, also keep in mind that there can be an inherent tradeoff between the

results on various performance measures. For example, the TPR and the FPR are quite different and

often an algorithm with good results on one yields bad results on the other. Classifier evaluation can

also be viewed as a problem of analyzing high-dimensional data and various methods can be employed

for an effective visualization. In your report, present at least one graphical comparison of the

1/3 MIE 1624 Introduction to Data Science and Analytics

performance of the classification algorithms you have selected.

Learning Objectives

(1) Understand how to apply machine learning algorithms (e.g., Logistic regression, Linear

regression, SVM, and Naive Bayes) to the task of text classification.

(2) Improve on skills and competencies required to compare and contrast the performance of

classification algorithms on one domain (text classification), including application of

performance measurements, statistical hypothesis testing and visualization of comparisons.

(3) Improve on skills and competencies required to collate and present domain specific, evidence-

based insights.

To Do:

● Apply two of the following algorithms: Logistic regression, Linear regression, SVM, Naive

Bayes, to the task of classifying tweets into positive and negative tweets

● Prepare a 3 to 4-page report describing:

➢ (1) the classification algorithms you have implemented (at most 1.5 pages);

➢ (2) the results of applying the two selected algorithms on the classified dataset, according to

the selected performance measures, as well as your interpretation of the results, including

graphical representations of the comparisons between the chosen classifiers, and your

choice of method to apply to the unclassified data set;

➢ (3) the results of applying the chosen algorithm on the unclassified dataset, as well as any

potential insights into the political sentiment of the Canadian electorate with respect to the

major political parties participating in the 2015 federal election.

All graphs should be readable and have all axes appropriately labelled. All visual materials should be

understandable and all graphs should be appropriately labelled and easy to read.

Tools Required

● Software

○ Python Version 3.X Only NO 2.7 is allowed for this assignment. Your code should run

on the Data Scientist Workbench (Kernel 3). All libraries and builtins are allowed but

here is a list of the major libraries you might consider: Numpy, Scipy, Scikit,

Matplotlib, Pandas.

● Data files

corpus.txt: file containing a set of words and associated sentiment value

stop_words.txt: file containing an extensive list of stop words

classified_tweets.txt: file containing a set of tweets which have already been

classified as negative (have a sentiment score of 0) or positive tweets (have a sentiment

score of 4)

What to submit via BlackBoard:

(1) IPython notebook and an equivalent Python .py file containing your implementation of the

classifiers and the various evaluation methods.

(a) lastname_studentnumber_assignment2.ipynb

(b) lastname_studentnumber_assignment2.py

(2) Your four-page report named lastname_studentnumber_assignment2.pdf

2/3 MIE 1624 Introduction to Data Science and Analytics

Respect the above convention when naming your files, making sure that all letters are lowercase

and hyphens are used as shown. Only submissions via Blackboard will be accepted.

BEFORE uploading your files, make sure that:

(1) your file name does not contain any extras, such as version information, e.g.,

lastname_firstname_studentnumber_assignment2.py.

(2) you comment your code appropriately and describe your algorithms in sufficient detail.

3/3 MIE 1624 Introduction to Data Science and Analytics