机器学习|自然语言处理|数据挖掘|Kaggle竞赛|python

Goal

In this assignment, the task is predict the sentiment of Tweets about four technology companies, Apple, Microsoft, Google, and Twitter. Here are some examples of tweets with different sentiments:

  • positivehttp://t.co/QV4m1Un9 Forget the phone.. Nice UI. Liking the Scroll Feature #android #google #nexus”
  • negative“Have never had such poor customer service at @Apple before! What happened? (@ Apple Store w/ 2 others) http://t.co/GKlXMUi6
  • neutral“The lock screen now has facial recognition capability! #Google #Android #ICS”.

Your goal is to train a classifier to predict whether a tweet is positive, neutral, or negative sentiment.

Methodology

You need to train classifiers using the training data, and then predict on the test data. You are free to choose the feature extraction method and classifier algorithm. You are free to use methods that were not introduced in class. You should probably do cross-validation to select a good parameters.

Evaluation on Kaggle

You need to submit your test predictions to Kaggle for evaluation. 50% of the test data will be used to show your ranking on the live leaderboard. After the assignment deadline, the remaining 50% will be used to calculate your final ranking. The entry with the highest final ranking will win a prize! Also the top-ranked entries will be asked to give a short 5 minute presentation on what they did.

To submit to Kaggle you need to create an account, and use the competition invitation that will be posted on Canvas.

Note: You can only submit 2 times per day to Kaggle!

What to hand in

You need to turn in the following things:

  1. This ipynb file with your source code and documentation. You should write about all the various attempts that you make to find a good solution.
  2. Your final submission file to Kaggle.
  3. The ipynb file Assignment3-Final.ipynb, which contains the code that generates the final submission file that you submit to Kaggle. This code will be used to verify that your Kaggle submission is reproducible.

Grading

The marks of the assignment are distributed as follows:

  • 45% – Results using various classifiers and feature representations.
  • 30% – Trying out feature representations (e.g. adding additional features) or classifiers not used in the tutorials.
  • 20% – Quality of the written report. More points for insightful observations and analysis.
  • 5% – Final ranking on the Kaggle test data (private leaderboard). If a submission cannot be reproduced by the submitted code, it will not receive marks for ranking.
  • Late Penalty: 25 marks will be subtracted for each day late.

Note: you should start early! Some classifiers may take a while to train.

Kaggle Notebooks

You can use Kaggle notebooks to run your code. This ipynb has also been uploaded to the Kaggle competition site. Note that you still need to submit your code to Canvas for grading.

Leave a Reply

Your email address will not be published. Required fields are marked *