Monthly Archives: April 2020

CSE115 Project 1 | Python | WEB | DATA Visualization

CSE115 Project 1

Part 4 (front-end and web server) requirements

Submit a ZIP file of your code (like with lab activities and lab quizzes) by Friday May 1st at 6:00 PM, worth 48 out of the project’s 60 total points.

NOTE: You have a limited number (5) of AutoLab submissions.  Use them wisely.  Your LAST submission counts.


Languages: JavaScript and Python

Topics: lists, dictionaries, loops, conditionals, reading CSV files, writing CSV files, HTML, JavaScript front end, Bottle web server, JSON, AJAX (lectures up to and including Monday, April 6)


Overview

In this last part of the project you will put the pieces from earlier parts of the project to create a small web application. There are several pieces you need to complete; each is relatively small, but you need to work incrementally to ensure that all the pieces work together in the finished product.

If you did not quite complete the functions for parts 1, 2, and/or 3 keep working on them. Remember that the bulk of the points come from your final project submission.

Your finished web application must create three graphs. Because the OpenBuffalo towing dataset is updated daily, the data your program downloads (and is then shown in your graph) can change. So while your graphs should look similar to those below, they will not be identical. You can use AutoLab to check your code for the first 3 parts and then use those results to help verify your graphs are showing the proper data. The first graph is a scatter plot showing the number of tows on each day of the month:

CSE115 Project 1 | Python | WEB | DATA Visualization插图
CSE115 Project 1 | Python | WEB | DATA Visualization插图
CSE115 Project 1 | Python | WEB | DATA Visualization插图

The second graph is a pie chart showing how the police districts were responsible for tows: The third graph is a line graph showing why cars were towed and how those reasons change from month to month:


Web server (Python)

Define your Bottle web server so it has five routes:

  • The “/” must serve up the HTML file as a static file.

You may use any name you want for the remaining routes. Best practices encourage selecting  route names that are meaningful and descriptive within the context of the application.

  • A route to serve up the JavaScript code file as a static file
  • A route to serve up the scatter plot data as a JSON blob
  • A route to serve up the pie chart data as a JSON blob
  • A route to serve up the line graph data a JSON blob

HTML file

You will need to write an HTML file for your application. Inside your HTML file’s <head> element, it will need 2 <script> elements to load in the needed JavaScript files. The first <script> element should cause the Plot.ly library to be loaded. The URL for this is:

https://cdn.plot.ly/plotly-latest.min.js

The second <script> element will be the JavaScript code you wrote for this application. Since this file is on the same web server as your HTML file, the second <script> element’s src property should use the route specified in your web server (do not include a protocol or server).

Your HTML file also needs a <body> element. The <body> element will need to include an onload property. This should call a function in your JavaScript code which sends 3 data requests to the web server: one for each of your graphs. When the server responds to each request, you will need code that creates the graph using that data. As an example, the <body> element in my code is:

<body onload=”getData();”>

Inside the <body> element, you will need to include 3 <div> elements. Each <div> element will need to include an id property and be assigned a unique value. You will need to use these ids to display your graphs.


Data processing (Python)

You will also need additional code to process the data.The functions required for parts 1, 2, and 3 of the project will make this process much simpler. You should make certain they are available for this part of the project (copying them in as needed).

Graph #1 – the scatter plot

This chart must show the number of tows that occurred on each day each month. After reading the dataset from your CSV file, count the number of dictionaries recorded for each of the 31 possible days.

Use a layout to give the chart a title and labels for the x- and y-axes. For more information, use this link to Plot.ly’s scatter plot documentation.

Graph #2 – the pie chart

This chart breaks out the tows by the police district in which it occurred. Even though the graph shows percentages, it uses the number of tows in each district. For a pie charts Plot.ly needs two lists: one with names (“labels”) and one with the data (“values”). The labels will be the names of the police districts: District A, District B, District C, District D, and District E. To calculate the values, you will need to read in the complete data set from your CSV file. You will then want to work with each district in the order they appear in your labels list. For each district, get the list of matches using ‘police_district’ as the key and the district’s name as the value. Add the length of that list to your list of values.

Use a layout to give the chart a title. For more information about pie charts, use this link to the Plot.ly documentation.

Graph #3 – the line graph

This must show the number of tows broken down by the description of why the tow occurred. Within each tow description, you will need to plot how often that was the reason for a tow in each month. The way to do this is to read the complete data set from the CSV file and get the list of all descriptions. You can then get a list of dictionaries which matches that description (use get_matches) and use that to get a month-by-month count (the value returned by your count_by_month function). Remember to add a name to each set of data displayed in the graph.

Use a layout to give the chart a title and labels for the x- and y-axes. For more information, use this link to Plot.ly’s line graph documentation.


Fetching and caching OpenBuffalo data

The following function uses your get_data, minimize_dictionary, and write_cache functions to get data from the OpenBuffalo site. In doing this, it includes two additional optimizations. The function begins by checking if the cached_data.csv file exists. When the file exists, it will not request data from OpenBuffalo or update the cache. This minimizes the likelihood of your application going over any resource limits. Second, it uses a query string that the OpenBuffalo API provides to limit the request to 10000 records. Note that this query string is specific to this API and may not be usable everywhere.

Please see https://piazza.com/class/k5e51319xsy76d?cid=879 for a change this will require you make to your  minimize_dictionary function.

You should include this in your code and call this function just before starting the webserver.

import os.path
import cache # This assumes that your functions from parts 2 & 3 are in a file named cache.py

def load_data( ):

   csv_file = ‘cached_data.csv’

   if not os.path.isfile(csv_file):

       query_str = “?$limit=10000”

       url = “https://data.buffalony.gov/resource/5c88-nfii.json” + query_str

       data = cache.get_data(url)

       data = cache.minimize_dictionaries(data, [‘tow_date’, ‘tow_description’, ‘police_district’])

       cache.write_cache(data, csv_file)


JSON and AJAX

JavaScript front-end and Python back-end run on different machines. (JavaScript is executed by the machine of the person looking at the page; the Python code is executed by repl.it’s servers.) This means that we cannot communicate directly from JavaScript to Python (or vice versa). We instead rely on an approach called AJAX to allow the two sides to coordinate their work.

This process begins when the JavaScript front-end needs data from the Python back-end. To make this request, it must do two things:

  1. Send a request to the repl.it server requesting the data;
  2. Tell the JavaScript system what function should be called if/when it receives this response.

You will need the ajaxGetRequest function to accomplish both of these tasks. We will discuss how to use this function during lectures, you should copy-and-paste this function into your JavaScript code:

// path — string specifying URL to which data request is sent

// callback — function called by JavaScript when response is received                                                             

function ajaxGetRequest(path, callback) {

    let request = new XMLHttpRequest();

    request.onreadystatechange = function() {

          if (this.readyState===4 && this.status ===200) {

              callback(this.response);

            }

    }

    request.open(“GET”, path);

    request.send();

}

You will need to provide two arguments for each of your 3 (or more) calls to ajaxGetRequest. An example of this call is:

ajaxGetRequest(“/towsByDay”, showScatter);

The first argument (“/towsByDay”) is the route on the repl.it server where the request is sent. For this to work, the Python back-end will need a function with a matching annotation

(e.g., @bottle.route(“/towsByDay”)). The second argument (showScatter) must be the name of the JavaScript function you wrote. JavaScript automatically executes the second argument (showScatter), but only when it receives a response from repl.it. That function should include one parameter. This parameter will be the data the back-end sent over the Internet to respond to our request. As an example, the first bit of my function looks like this:

function showScatter(response) {

    // Data is always sent in JSON, so we must first convert it                                                                                                                

    let data = JSON.parse(response);

    // Now write the code that uses data to create the scatter plot                                                                                                             

Finally, where do we make the calls to the ajaxGetRequest function?  For this project, these should be in the JavaScript function called in the <body> tag’s onload property. In the example above the function getData would have all of the calls to ajaxGetRequest.


GRADING

For your final project submission, autogrades in AutoLab will score the functions specified in parts 1, 2, and 3. The functionality for part 4 of the final project will be manual graded by the TAs in the weeks after the project deadline. As before, you are limited to 5 submissions on AutoLab and it is your last submission that counts: that is the autograder score we will use and the submission that will be manually graded by the TAs.


SUBMISSION INSTRUCTIONS

In order for us to grade your project, you must submit all of your .html, .js, and .py files. We will download and execute your project for the manual grading, so make certain everything runs before you submit.

Code organization

It is important you follow these instructions so that the autograders can run and you earn those points.
Your Python code MUST be separated into 3 files and the files MUST have the following names:

cache.pythis file must contain ALL OF THE FUNCTIONS FROM PARTS 2 and 3. Do not include any web server code and the file cannot import or use the bottle library.

backend.pythis file must contain ALL OF THE FUNCTIONS FROM PART 1. You can include the functions from part 4 that process the data, but this is optional. Do not include any web server code and the file cannot import or use the bottle library.

main.py – this file should include all of the Python code that defines the web server’s behavior.  This file should have the following basic structure:

import bottle

import json

import cache
import backend

@bottle.route(…)

  # …all the routes and handlers defined here…


import os.path
def load_data( ):
  # You should use my code for loadData()

load_data()

bottle.run(…)

Note the import cache and import backend lines. You will need to include the name of the module when you call functions that are in those files. For example, when loadData wants to call the get_data function, it write the call as cache.get_data(url).

Submission

To submit, export the project files from repl.it as a zip file. If you can, please remove the following files from the ZIP file you submit:

cached_data.csv
poetry.lock

pyproject.toml

requirements.txt

but only do this if you are confident you know how. The most important thing is that we can download and run your submission.

Feedback

Autolab will give two kinds of feedback:

  1. An ungraded checklist showing how well your code meets (some of) the explicitly stated expectations spelled out above in this document.  The first bit of the feedback will look something like this:

******************************************************************************

*** File names and quantities                                              ***

******************************************************************************

    PASS: I expected to find 1 HTML file.  I found 1:

          index.html

    PASS: I expected to find 1 .js file.  I found 1:

          permits.js

    PASS: I expected to find 1 file named ‘main.py’.  I found 1:

          main.py

    PASS: I expected to find 1 file named ‘cache.py’.  I found 1:

          cache.py


    PASS: I expected to find 1 file named ‘backend.py’.  I found 1:

          backend.py

Please verify that all of the checks are labeled “PASS”. This is needed for the autograders and makes the manual grading go much faster. We gave you lots of leeway for how to structure your code for part 4, but may not be able to find and grade code if you move things around.

  • A report on the automated grading of the functionality of parts 1, 2, and 3.  Each part is graded out of 8 points.

Overall Project Grading

Project Part Part 1 submit Part 2 submit Part 3 submit Final submission Overall points
Part 1 4     10 14
Part 2   4   10 14
Part 3     4 10 14
Part 4       18 (manually graded) 18

R代写:STAT318 Statistical Analysis

发表于 2020-04-26   |   分类于R

使用R代码,进行统计分析。

R

Description

The purpose of this project is to allow you to carry out a statistical analysis from start to finish. This project provides the opportunity to apply the principles you learn to an actual problem. Since the focus of STAT 318 is application of statistical methods, this project is the most important part of the class. You will be working independently on this project. This project must be related to Chapter 6 (multiple linear regression (with at least 4 predictors) with normal response.)

As such, this project paper will serve as a substitute for the final exam and account for 25% of your final grade. The following is a basic outline of what will be expected from this project.

Rubric for paper:

  • (1) Research question and data set approval: 5 points
  • (2) Project report and analysis: 85 points
  • (3) Formatting of paper (correct grammar, clear and concise writing, etc. All math equations should be written using the equation editor in Word or in math type in TeX such as RMarkdown): 10 points

Total: 100 points

Detailed outline

  1. Decide on a research question
    • As we learn different methods, you will hopefully have ideas about how to apply them to problems you may be interested in. While this is one way to come up with a good research question, you may find the project more enjoyable if you first think of a topic you are interested in and then a question that can be answered using multiple regression (with *at least* 4 predictors).
    • I need to approve your research question to ensure that you will be able to carry out the project.
    • You may submit these early. I will give feedback as they come in.
  2. Collect data
    • There are plenty of data sets available online. You can use the links below to find data sets. However, I highly recommend that you spend time finding a data set that you are interested in analyzing and not limiting yourself to the data sets from the links below. I also need the source for your data.
  3. Research paper
    • (1) Introduction (1-2 paragraphs): State the research objectives. State any background information of the topic that is important to know.
    • (2) Methodology (1-2 paragraphs): Describe in detail what the method is and why it is appropriate to use.
    • (3) Results (multiple paragraphs): Answer your original research question by conducting a full analysis and discussing the results. Reports should address model assumptions, model selection (any additional forms or interactions?), model validation, final regression model, interpreting coefficients, providing confidence and prediction intervals. Provide any appropriate plots and tables to help summarize the results. Do NOT copy output from R as any tables you make should be formatted cleanly.
    • (4) Discussion (~ 2 paragraphs): Summarize the study and the results. Were your results expected or surprising? Were there any problems with your data? Would a different method work better? Does your analysis raise additional questions that could be investigated? What are the strengths and weaknesses of your model?
    • (5) References: Your entire analysis needs to be written in a report. This will need to be written like an essay that you would write for any other class. While I will be mostly grading your report on correct application of statistical principles, I will take points off for sloppy writing. You will need to make sure your equations are written in appropriate equation form (such as using the math editor in Word and math type in TeX). An important part of any statistical analysis is clearly communicating your results. I want to see you put in some effort to that end. There is no page length required because the page length may vary depending on plots and tables you may need to include, but the paper you submit should clearly answer and summarize the research question.
  4. R Code
    • You will need to upload your R code with your final report. This code should be easy to follow, commented, and include all components of your analysis.

MATH11154 | Stochastic Analysis in Finance

Stochastic Analysis in Finance

MATH11154
1. All random variables in the following questions are defined on a probability space (⌦,F,P),

and G denotes a -algebra contained in F.

  1. (a)  Let X and Y be random variables such that E|X| < 1. Define precisely the conditional expectations E(X|G) and E(X|Y ). [7 marks]
  2. (b)  Let X be a random variable with finite second moment, and consider R = X E(X|G), the di↵erence between the “true value” of X and the “predicted value” of X based on the “information” G. Compute ER and E(R|G). Show that

where Z = E(X|G).

[7 marks]

E|X2| < 1. Set X = (X1 + X2)/2, and determine E(↵X1 + (1 ↵)X2|X)

ER2 = EX2 EZ2,

(c) Let X be a random variable with finite second moment, and let L2(G) denote the space of G- measurable random variables with finite second moment. Show that Z = E(X|G) minimises the mean square distance of X from L2(G), i.e.,

E(XZ)2 =min{E(XY)2 :Y 2L2(G)}.
(d) Let X1 and X2 be independent identically distributed random variables such that E|X1| =

̄

as a function of X ̄ for any constant ↵ 2 R. [Hint: You may use without proof that due to ̄ ̄

symmetry, E(X1|X) = E(X2|X).] [5 marks] Solution:

G 2 G. (ii) E(X|Y ) = E(X|(Y )) where (Y ) is the -algebra generated by Y . (i) ER=E(XE(X|G))=EXEE(X|G)=EXEX=0,

[7 marks]

[7 marks]

(ii) E(R|G) = E(X E(X|G|G) = E(X|G) E(X|G) = 0 2

(iii) Notice that E(XZ) = EE(XZ|G) = EZ . Hence 222222

ER =E(XZ) =EX 2E(XZ)+EZ =EX EZ .

2222

̄

May 2019

[6 marks]

(i) Z := E(X|G) is a G-measurable random variable such that E(Z1G) = E(X1G) for every

(a) (b)

(c)
E(XY) =E(XZ+ZY) =E(XZ) +2E{(XZ)(ZY)}+E(ZY) .

Notice that E{(X Z)(Z Y )} = EE ((X Z)(Z Y )|G) = E ((Z Y )E (X Z|G)) = 0. Consequently,

E(XY)2 =E(XZ+ZY)2 =E(XZ)2 +E(ZY)2 E(XZ)2,
which proves the statement since Z 2 L2(G). [6 marks]

1

MATH11154

May 2019

Stochastic Analysis in Finance

̄ ̄

E(Xi|X) =

Consequently,
E(↵X1 + (1 ↵)X2|X) = ↵E(X1|X) + (1 ↵)E(X2|X) = ↵X + (1 ↵)X = X.

[5 marks] 2. We want to model the evolution of the instantaneous interest rate by an Itˆo process r = (rt)t0

satisfying following conditions:

(i)r0 =4and3rt 5almostsurelyforallt0; (ii)Ert =4forallt0;
(iii) E(|rt 4|2)  1/200 for all t 0.

(d) By symmetry E(X1|X) = E(X2|X). Hence ̄1 ̄ ̄ ̄ ̄ ̄

2

{E(X1|X) + (X2|X)} = E(X|X) = X ̄ ̄ ̄ ̄ ̄ ̄

for i = 1, 2.

The solution r of the stochastic di↵erential equation
drt =100(4rt)dt+(|rt 5||rt 3|)1/2dWt,

r0 =4 is suggested as a suitable model, where W = (Wt)t0 is a Wiener process.

  1. (a)  State precisely a comparison theorem for SDEs, and applying it to suitable SDEs, show that property (i) holds for the solution r. [12 marks]
  2. (b)  Prove that r satisfies property (ii). [6 marks]
  3. (c)  Setting Yt := rt 4 and using Itoˆ’s formula, write an expression for the stochastic di↵erential of e100tYt. Hence, estimating E(|e100tYt|2), or otherwise, deduce that r satisfies property (iii). [7 marks]

Solution:

HADOOP|Distributed System|Spark|Python|分布式系统代写

DS/CMPSC 410 MiniProject #4

Spring 2020

Instructor: John Yen

TA: Rupesh Prajapati

Learning Objectives

  • Be able to apply k-means clustering to the Darknet dataset
  • Be able to intepret the features of the cluster centers generated
  • Be able to compare the result of k-means clustering with different value of k

Total points: 35

  • Exercise 1: 10 points
  • Exercise 2: 10 points
  • Exercise 3: 15 points

Due: 5 pm, April 24, 2020.

[-]import pysparkimport csv[-]from pyspark import SparkContextfrom pyspark.sql import SparkSessionfrom pyspark.sql.types import StructField, StructType, StringType, LongTypefrom pyspark.sql.functions import col, columnfrom pyspark.sql.functions import exprfrom pyspark.sql.functions import splitfrom pyspark.sql import Rowfrom pyspark.ml import Pipelinefrom pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler, IndexToStringfrom pyspark.ml.clustering import KMeans[-]ss = SparkSession.builder.master(“local”).appName(“FrequentPortSet”).getOrCreate()[-]Scanners_df = ss.read.csv(“/storage/home/juy1/work/Darknet/scanners-dataset1-anon.csv”, header= True, inferSchema=True )

We can use printSchema() to display the schema of the DataFrame Scanners_df to see whether it was inferred correctly.

[-]Scanners_df.printSchema()

Exercise 1 (10 points)

Use k-means to cluster the scanners who scan the top 3-port-sets. Complete the code below for k-means clustering (k=10) on the following input features:

  • num_ports_scanned
  • avg_lifetime
  • total_lifetime
  • avg_pkt_size

Specify Parameters for k Means Clustering

[-]km = KMeans(featuresCol=”features”, predictionCol=”prediction”).setK(???).setSeed(???)km.explainParams()[-]va = VectorAssembler().setInputCols([???]).setOutputCol(“features”)[-]Scanners_df.printSchema()[-]data= va.transform(???)[-]data.persist()[-]kmModel=km.fit(???)[-]kmModel[-]predictions = ???.transform(???)[-]predictions.persist().show(3)[-]Cluster1_df=predictions.where(col(“prediction”)==0)[-]Cluster1_df.persist().count()[-]summary = kmModel.summary[-]summary.clusterSizes[-]kmModel.computeCost(data)[-]centers = kmModel.clusterCenters()[-]print(“Cluster Centers:”)i=0for center in centers:    print(“Cluster “, str(i+1), center)    i = i+1

Exercise 2 Analyze the result of k-means clustering (k =10) (10 points)

  • (a) Describe what the cluster center for the largest cluster indicate about scanners in this group. (5 point)
  • (b) Describe what the cluster center for the second largest cluster indicate about scanners in this group. (5 point)

Exercise 3 Perform k-means clustering for a different choice of the value of k (15 points)

  • a) Change the value of k to 30. (5 points)
  • b) Compare the “cost” of the result of kmeans with that of k=10. (5 ponts)
  • c) Compare the top two clusters generated with k=30 with the top two clusters generated with k=10. (5 points)

[-]# Code for Exercise 3 (a)

Answer for Exercise 3 (b):

Answer for Exercise 3 (c):

机器学习|AI|Stylish|图像处理|Python|平时作业

Following up on our check-in, I have collected some materials to get you started.

  1. Style transfer Keras guide (link): This page has complete code to run style transfer. Unfortunately, the coding concepts it uses are quite a bit different than what we have learned about so far. 
  2. Attached .py file: I have simplified the code from #1 a bit in the attached .py file. This should be more manageable and a good place to start. The parts you should focus on understanding are the various inputs, hyperparameters, and losses. The details of the implementation are important but secondary.
  3. Papers: Have a look at the these two papers to get an understanding for the model being implemented in the code. Focus on understanding the intuition and the role of the various losses.
    1. “A Neural Algorithm of Artistic Style” (Gatys et al): The earlier of the two papers (only uses style and content losses)
    2. “Perceptual Losses for Real-Time Style Transfer
      and Super-Resolution” (Johnson et al): The second paper, also discusses the total variation loss
  4. Tutorial (link): A tutorial video on style transfer. The code presented will not work on current tensorflow versions but the presenter gives a good overview of how this style of code works and which parts are most important.

强化学习|Python|Hex|人工智能|AlphaGO|Reinforcement Learning 2020: Assignment 4 Self-Play

Reinforcement Learning 2020: Assignment 4 Self-Play

1 Introduction

Aske Plaat rl@liacs.leidenuniv.nl

April 15, 2020

The objective of this assignment is to build intuition on deep reinforcement learning in games, where self-play is used to create an environment for generating the rewards for training. You will implement a deep neural network player for the game of Hex and experiment with various self-play algorithms.

Self-play is old. Samuel’s Checkers player in the 1950s used a rudimentary form of self-play to learn a better evaluation function. In the early 1990s, Tesauro’s TD-Gammon used self-play to create a Backgammon player strong enough to beat the human champion. However, without a doubt, the most famous self-play success is AlphaGo Zero, the Go playing program that taught itself to play Go entirely by self-play, without any heuristic domain knowledge, tabula rasa.

Deep reinforcement learning is computationally intensive, and self-play is even more so. Therefore we will choose Hex as our game, a simple, fast game, and we will choose a small board size, 7×7, to allow for plenty of training to occur.

DeepMind has not published the source code of AlphaGo. However, the scientific publications contain enough detail for many re-implementations to have been created, most of which are public.

For this assignment, you may find the following resources useful:

  • Chapter 7 of Learning to Play.
  • There is a variety of existing AlphaZero implementations from which you may pick one
  • Shantanu Thakoor, Surag Nair, Megha Jhunjhunwala and others have written AlphaZero General (A0G), a re-implementation fully in Python. A0G runs on PyTorch, TensorFlow, or Keras. Implementations of 6×6 Othello, Gomoku, Tic Tac Toe, and Connect4 are available. The Python code is extensible for your own game implementation. A writeup and documentation are available on the github site. 1

• PolyGames by Facebook

When you get stuck, your first resource should be the book for this course. If you find that confusing or it does not answer your question, please write an email to rl@liacs.leidenuniv.nl. Consult the resources of the assignment, ask questions at the classes, or the question hour.

2 Hex – 3 points

Install an AlphaZero implementation and familiarize yourself with it. Run the 6×6 Othello player, play against a trained player. Look at Coach.py for the main structure of the self-play, and train a player, see if you can follow the training steps.

Implement a 7×7 Hex player via AlphaZero. It helps self-play training to use a GPU. See the previous assignment for tips on using a GPU.

Learn the Hex player by self-play. Write scripts to automate playing against each other.

3 Tournament – 3 points

Perform a thorough analysis of the performance of the learned player. Run a tournament to determine the Elo rating of the following players:

• ID-TT alpha-beta Hex player
• MCTS Hex player
• AlphaZero self-play Hex player 1 • AlphaZero self-play Hex player 2

Use the 7×7 board size. Plot Elo graphs. Perform a computation how many games should be played.

Learn TWO AlphaZero players independently but equally long. See if they are equal in strength or whether they differ significantly. If so, what does this tell you about the learning process?

If the performance of the self-play player disappoints, consider a third player, that has learned longer.

4 Hyperparameters – 3 points

See if you can improve the performance of the Hex player. You can find default values to most of these hyperparameters in main.py. However, hyperparameters aren’t the only thing that influence the training process. Report in detail on everything you tried and its effect.

2

Again, make a plan first, making a rough estimate of the training time, and evaluation time this will take. It is better to do a small experiment well than to be too ambitious and running out of time.

Report on your results, using the scripts of the previous section.

5 Report

Your submission should be a self-contained report that should be at least 6 to 8 pages if not even more with figures etc. The page amount we expect of you might vary depending on your layout. Your code should be uploaded to blackboard and ready to run and free of syntax errors. Your report should contain:

  1. Make sure that your report is a self-contained document with an introduc- tion, methods and conclusion, so even someone not following the course would understand what you did and why
  2. Also report on your general approach and issues you encountered
  3. Be sure to properly reference all information that wasn’t gained as insight through your data or is describing your experiments. Especially things you believe to be common knowledge, e.g. infos about AlphaZero or MCTS
  4. For AlphaZero Hex implementation: relevant source code files, ready to interpret and run, on blackboard. Short user documentation in report
  5. For all sections: Supporting scripts for running, testing, and performance evaluation on blackboard. Short user documentation in report
  6. Tournament results (approach, positions used, outcome)
  7. For Hyperparameter tuning: Report on experiment design and results
  8. One overall conclusion, lessons learned, observations

Self-play is computationally demanding. Your compute resources are lim- ited. Choose your experiments carefully and realistically. The grading for this assignment will take the computational resource in account (do not try to do the impossible but make the most of the resources you have). Describe carefully what you do, and what the outcome was, and what you think might be going on will get you points.

The report must be handed in on 15 May 2020 before 23:59. For each full 24 hours late, a full point will be deducted.

3

自然语言处理|单文本分类|NLP|UniMelburne|COMP90042

COMP90042 Project 2020:
Climate Change Misinformation Detection

Copyright the University of Melbourne, 2020

Project type: Individual (non-group)
Due date: 9pm Wed, 13th May 2020
Codalab due date: 1pm Wed, 13th May 2020 (no extensions possible for competition component)

Although there is ample scientific evidence on climate change, there is still a lingering debate over fundamental questions whether global warming is human-induced. One factor that contributes to this continuing debate is that ideologically-driven misinformation is regularly being distributed on mainstream and social media.

The challenge of the project is to build a system that detects whether a document contains climate change mis- information. The task is framed as a binary classification problem: a document either contains climate change misinformation or it doesn’t.

The caveat of the task is that we only have training documents with positive labels. That is, the training docu- ments provided are articles with climate change misinformation. An example can be seen below:

New Zealand schools to terrify children about the “climate crisis” Who cares about education if you believe the world is ending?What will it take for sanity to return? Global cooling? Another Ice Age even? The climate lunatics would probably still blame human CO2 no matter what hap- pens.But before that, New Zealand seems happy to indoctrinate an entire generation with alarmist nonsense, and encourage them to wag school to protest for more “action” (reported gleefully by the Guardian, natch):Totally disgraceful, and the fact that the Guardian pushes this nonsensical propa- ganda as “news” is appalling.You should read the whole thing just to understand how bonkers this is.

We encourage you to have a skim of the documents to get a sense how they read; there are some common patterns that these documents share in challenging climate change, e.g. by fear mongering (like the example article), or by discrediting the science behind climate change.

You’re free to explore any learning methods for the task, e.g. by building a supervised binary classifier or de- veloping methods that leverage only the positive label data (one-class classification). You should not, however, attempt to reverse engineer the sources of the documents, and classify using the document source as a feature. Any sign that you have done so will result in significant penalty, as it trivialises the problem. If you want to expand your training data (e.g. to get negative label data), you can crawl the web for additional resources. Al- ternatively, you can also use publicly released NLP datasets. A list of popular NLP datasets can be seen here. Pre-trained models and word embeddings are also allowed.

You will need to write a report outlining your system, the reason behind the choices you have made, and the performance results for your techniques.

We hope that you will enjoy the project. To make it more engaging we will run this task as a codalab competition. You will be competing with other students in the class. The following sections give more details on data format, the use of codalab, and the marking scheme. Your assessment will be based on your report, your performance in the competition, and your code.

Submission materials: Please submit the following:

• Report (.pdf)
• Python code (.py or .ipynb)
• Scripting code (.sh or similar), if using Unix command line tools for preprocessing • External training data(.json, <10MB), if used

1

You should submit these files via Canvas, as a zip archive. Submissions using other formats to those listed, e.g., docx, 7z, rar, etc will not be marked, and given a score of 0. Your external training data json file (if you have one) should not exceed 10MB.

If multiple code files are included, please make clear in the header of each file what it does. If pre-trained models are used, you do not need to include them as part of the submission, but make sure your code or script downloads them if necessary. We should be able to run your code, if needed, however note that code is secondary – the primary focus of marking will be your report, and your system performance on codalab.

You must submit at least one entry to the codalab competition. Late submissions: -20% per day
Marks: 40% of mark for class

Materials: See Using Jupyter Notebook and Python page on Canvas (under Modules>Resources) for informa- tion on the basic setup required for COMP90042, including an iPython notebook viewer and the Python packages NLTK, Numpy, Scipy, Matplotlib, Scikit-Learn, and Gensim. For this project, you are encouraged to use the NLP tools accessible from NLTK, such as the Stanford parser, NER tagger etc, or you may elect to use the Spacy or AllenNLP toolkit, which bundle a lot of excellent NLP tools. You are free to use public NLP datasets or crawl additional resources from the web. You may also use Python based deep learning libraries: TensorFlow/Keras or PyTorch. You should use Python 3.

You are being provided with various files including a training, a development and a test set. See the instructions below (section “Datasets”) for information on their format and usage. If there’s something you want to use and you are not sure if it’s allowed, please ask in the discussion forum (without giving away too much about your ideas to the class).

Evaluation: You will be evaluated based on several criteria: the correctness of your approach, the originality and appropriateness of your method, the performance of your system, and the clarity and comprehensiveness of your final report.

Updates: Any major changes to the project will be announced via Canvas. Minor changes and clarifications will be announced in the discussion forum on Canvas; we recommend you check it regularly.

Academic Misconduct: This is an individual project, and while you’re free to discuss the project with other students, this is ultimately an individual task, and so reuse of code between students, copying large chunks of code from online sources, or other instances of clear influence will be considered cheating. Do remember to cite your sources properly, both for research ideas and algorithmic solutions and code snippets. We will be checking submissions for originality and will invoke the University’s Academic Misconduct policy where inappropriate levels of collusion or plagiarism are deemed to have taken place.

Datasets

You are provided with several data files for use in the project:

train.json a set of training documents (all with positive labels)
dev.json a set of development documents (mixture of positive and negative labels) test-unlabelled.json a set of test documents (without labels)

All data files are json files, formulated as a dictionary of key-value pairs. E.g.,

“train-0”: {
“text”: “why houston flooding isn’t a sign of climate change…”, “label”: 1

}
is the first entry in train.json. The dev and test files follow the same format, with the exception that test

excludes the label fields. You should use the Python json library to load these files. 2

For the test data, the label fields are missing, but the “id” number (e.g. test-0) is included which should be used when creating your codalab submissions.

Each of this datasets has a different purpose. The training data should be used for building your models, e.g., for use in development of features, rules and heuristics, and for supervised/unsupervised learning. You are encour- aged to inspect this data closely to fully understand the task and some common patterns shared by documents with climate change misinformation. As mentioned earlier, you are free to expand the training data by crawling the web or using public NLP datasets.

The development set is formatted like the training set, but this partition contains documents with negative and positive labels. This will help you make major implementation decisions (e.g. choosing optimal hyper-parameter configurations), and should also be used for detailed analysis of your system – both for measuring performance, and for error analysis – in the report.

You will use the test set, as discussed above, to participate in the codalab competition. You should not at any time manually inspect the test dataset; any sign that you have done so will result in loss of marks.

Scoring Predictions

We provide a scoring script scoring.py for evaluation of your outputs. This takes as input two files: the ground truth, and your predictions, and outputs precision, recall and F1 score on the positive class. Shown below is the output from running against random predictions on the development set:

$ python3 scoring.py –groundtruth dev.json –predictions dev-baseline-r.json Performance on the positive class (documents with misinformation):
Precision = 0.5185185185185185
Recall = 0.56

F1 = 0.5384615384615384

Your scores will hopefully be a good deal higher! We will be focussing on F1 scores on the positive class, and the precision and recall performance are there for your information, and may prove useful in developing and evaluating your system.

Also provided is an example prediction file, dev-baseline-r.json, to help you understand the required file format for your outputs. The label fields are randomly populated for this baseline system.

Misinformation Detection System

You are asked to develop a climate change misinformation detection method, or several such methods. How you do this is up to you. The following are some possible approaches:

  • Expand the training data to include documents with negative labels and build a binary supervised classi- fier;
  • Formulate the problem as a one-class classification or outlier detection task, and use the provided positive label data as supervision;
  • Build a language/topic model using the positive label data, and classify a document based on how well its language/topic conforms to patterns predicted by the language/topic model.
  • A combination of the approaches above. Note that these are only suggestions to help you start the project. You are free to use your own ideas for solving the problem. Regardless of your approach, the extent of success of your detection system will likely depend on informativeness of the extracted features and novelty of the model design.

3

If you are at all uncertain about what design choices to make, you should evaluate your methods using the development data, and use the results to justify your choice in your report. You will need to perform error analysis on the development data, where you attempt to understand where your approach(es) work well, and where they fail.

Your approaches should run in a modest time frame, with the end to end process of training and evaluation not taking more than 24 hours of wall clock time on a commodity desktop machine (which may have a single GPU card). You are welcome to use cloud computing for running your code, however techniques with excessive com- putational demands will be penalised. This time limit includes all stages of processing: preprocessing, training and prediction.

Evaluation

Your submissions will be evaluated on the following grounds:

Component Marks Report writing 10

Report content 20

Performance 10

Criteria

clarity of writing; coherence of document structure; use of illustrations; use of tables & figures for experimental results exposition of technique; motivation for method(s) and jus- tification of design decisions; correctness of technique; am- bition of technique; quality of error analysis; interpretation of results and experimental conclusions

positive class F1 score of system

The performance levels will be set such that 0=random; 1-3=simple baseline performance; 4-7=small improve- ments beyond baseline; 8-10=substantial improvements, where the extent of improvement is based on your relative ranking on the leaderboard. The top end scores will be reserved for top systems on the leaderboard. You must submit at least one competition entry to codalab, and your best result in the leaderboard will be used in marking.

Once you are satisfied that your system is working as intended, you should use the training and development data to do a thorough error analysis, looking for patterns in the kinds of errors your basic system is making. You should consider the various steps in processing, and identify where the most serious problems are occurring. If there are any relatively simple fixes that might have a sizeable impact on performance, you should feel free to note and apply them, but your main goal is to identify opportunities for major enhancements. You should include a summary of this error analysis in your report.

A report should be submitted with the description, analysis, and comparative assessment (where applicable) of methods used. There is no fixed template for the report, but we would recommend following the structure of a short system description paper, e.g., https://www.aclweb.org/anthology/W18-5516. You should mention any choices you made in implementing your system along with empirical justification for those choices. Use your error analysis of the basic system to motivate your enhancements, and describe them in enough detail that we could replicate them without looking at your code. Using the development dataset, you should evaluate whether your enhancements increased performance as compared to the basic system, and also report your relative per- formance on the codalab leaderboard. Finally, discuss what steps you might take next if you were to continue development of your system (since you don’t actually have to do it, feel free to be ambitious!).

For the evaluation, you should generally avoid reporting numbers in the text: include at least one table, and at least one chart. Using the development set, you should report your results, showing precision, recall and F1- score, as appropriate. In addition, you are encouraged to report results with other metrics, where needed to best support your error analysis.

Your description of your method should be clear and concise. You should write it at a level that a masters student could read and understand without difficulty. If you use any existing algorithms, you do not have to rewrite the complete description, but must provide a summary that shows your understanding and you should provide a citation to reference(s) in the relevant literature. In the report, we will be very interested in seeing evidence of your thought processes and reasoning for choosing one approach over another.

4

The report should be submitted as a PDF, and be no more than four A4 pages of content (excluding references, for which you can use the 5th page). You should follow the ACL style files based on a “short paper” which are available from https://acl2020.org/calls/papers/. We prefer you to use LATEX, however you also permitted to use Word. You should include your full name and student id under the title (using the \author field in LATEX, and the \aclfinalcopy option). We will not accept reports that are longer than the stated limits above, or otherwise violate the style requirements.

Codalab

You will need to join the competition on codalab. To do so, visit

https://competitions.codalab.org/competitions/24205?secret_key= 37c3468d-1b4f-4a10-ac60-38bba553d8ee

and sign up for a codalab account using your @student.unimelb.edu.au email address, and request join the competition. You can do this using the “Participate” tab. Only students enrolled in the subject will be permitted to join, and this may take a few hours for use to process so please do not leave this to the last minute.

Please edit your account details by clicking on your login in the top right corner and selecting “Settings”. Please provide your student number as your Team name. Submissions which have no team name will not be marked.

You can use this system to submit your outputs, by selecting the “Participate” tab then clicking the “Ongoing evaluation” button, and then “Submit”. This will allow you to select a file, which is uploaded to the codalab server, which will evaluate your results and add an entry to the leaderboard. Your file should be a zip archive containing a single file named test-output.json, which has a label prediction for all the keys in test-unlabelled.json. The system will provide an error message if the filename is different, as it cannot process your file.

The results are shown on the leaderboard under the “Results” tab, under “Ongoing Evaluation”. To start you off, I have submitted a system that gives random predictions, to give you a baseline. The competition ends at 1pm on 13th May, after which submissions will no longer be accepted (individual extensions can not be granted to this deadline). At this point the “Final Evaluation” values will be available. These two sets of results reflect evaluation on different subsets of the test data. The best score on the ongoing evaluation may not be the best on the final evaluation, and we will be using the final evaluation scores in assessment. The final result for your best submission(s) can now be used in the report and discussed in your analysis.

Note that codalab allows only 3 submissions per user per day, so please only upload your results when you have made a meaningful change to your system. Note that the system is a little slow to respond at times, so you will need to give it a minute or so to process your file and update the result table.

5

Texas Tech University|CS 3365| Software Engineering II GOAL: THREAT MODELING USING STRIDE

Homework#2: CS3365‐Fall2017You are required to implement the Make Order Request use case using an object‐orient langauge (C++,C#, or Java). The following depicts the communication diagrams for the Make Order Request use case,followed by the classes for the use case. Implement the Make Order Request use case as specified usingthe objects instantiated from the classes. Your implementation of Make Order Request should followthe sequence of communication diagram using the methods defined in classes. State why you need toadd new methods or change the methods specified in classes if you made changed (or added) to themethods in classes.Main Sequence for Make Order Request use caseM1: Order RequestM2: Order RequestM11: Order ConfirmationaCustomerM12: Customer Output«interface» :CreditCardM5: Authorize Credit Card RequestM8: Credit Card Approved«entity» :DeliveryOrderM9: Create OrderM10: Order Confirmation«business logic» :PurchaseOrderManager«user interaction» :CustomerInterface

Image of page 1

Python代写|Spark代写|平时作业代写|分布式作业代写|机器学习

Description

INF553 Foundations and Applications of Data Mining Spring 2020

Assignment 3

Deadline: Mar. 23th 11:59 PM PST 1. Overview of the Assignment

In Assignment 3, you will complete three tasks. The goal is to let you be familiar with Min-Hash, Locality Sensitive Hashing (LSH), and various types of recommendation systems.

2. Requirements

2.1 Programming Requirements

  1. You must use Python to implement all the tasks. You can only use standard Python libraries (i.e., external libraries like numpy or pandas are not allowed).
  2. You are required to only use Spark RDD, i.e. no point if using Spark DataFrame or DataSet.
  3. There will be 10% bonus for Scala implementation in each task. You can get the bonus only when both Python and Scala implementations are correct.

2.2 Programming Environment

Python 3.6, Scala 2.11, and Spark 2.3.0
We will use Vocareum to automatically run and grade your submission. You must test your scripts on the

local machine and the Vocareum terminal before submission.

2.3 Write your own code

Do not share code with other students!!

For this assignment to be an effective learning experience, you must write your own code! We emphasize this point because you will be able to find Python implementations of some of the required functions on the web. Please do not look for or at any such code!

TAs will combine all the code we can find from the web (e.g., Github) as well as other students’ code from this and other (previous) sections for plagiarism detection. We will report all detected plagiarism to the university.

3. Yelp Data

In this assignment, we generated the review data from the original Yelp review dataset with some filters, such as the condition: “state” == “CA”. We randomly took 80% of the data for training, 10% of the data for testing, and 10% of the data as the blind dataset.

You can access and download the following JSON files either under the directory on the Vocareum: resource/asnlib/publicdata/ or in the Google Drive (USC email only): https://drive.google.com/open?id=146Re0IDgtHB2OImmKOpzU43pGl12ZLqF

  1. train_review.json
  2. test_review.json – containing only the target user and business pairs for prediction tasks
  3. test_review_ratings.json – containing the ground truth rating for the testing pairs
  4. user_avg.json – containing the average stars for the users in the train dataset
  5. business_avg.json – containing the average stars for the businesses in the train dataset
  6. We do not share the blind dataset.

4. Tasks

You need to submit the following files on Vocareum: (all lowercase)

  1. [REQUIRED] Python scripts: task1.py, task2train.py, task2predict.py, task3train.py, task3predict.py
  2. [REQUIRED] Model files: task2.model, task3item.model, task3user.model
  3. [REQUIRED] Result files: task1.res, task2.predict
  4. [REQUIRED FOR SCALA] Scala scripts: task1.scala, task2train.scala, task2predict.scala, task3train.scala, task3predict.scala; one Jar package: hw3.jar
  5. [REQUIRED FOR SCALA] Model files: task2.scala.model, task3item.scala.model, task3user.scala.model
  6. [REQUIRED FOR SCALA] Result files: task1.scala.res, task2.scala.predict
  7. [OPTIONAL] You can include other scripts to support your programs (e.g., callable functions).

4.1 Task1: Min-Hash + LSH (2pts)

4.1.1 Task description

In this task, you will implement the Min-Hash and Locality Sensitive Hashing algorithms with Jaccard similarity to find similar business pairs in the train_review.json file. We focus on the 0 or 1 ratings rather than the actual ratings/stars in the reviews. Specifically, if a user has rated a business, the user’s contribution in the characteristic matrix is 1. If the user hasn’t rated the business, the contribution is 0. Table 1 shows an example. Your task is to identify business pairs whose Jaccard similarity is >= 0.05.

Table 1: The left table shows the original ratings; the right table shows the 0 or 1 ratings.

You can define any collection of hash functions that you think would result in a consistent permutation of the row entries of the characteristic matrix. Some potential hash functions are:

𝑓(𝑥) = (𝑎𝑥 + 𝑏) % 𝑚

𝑓(𝑥) = ((𝑎𝑥 + 𝑏) % 𝑝) % 𝑚

where 𝑝 is any prime number; 𝑚 is the number of bins. You can define any combination for the parameters (𝑎, 𝑏, 𝑝, or 𝑚) in your implementation.

After you have defined all the hash functions, you will build the signature matrix using Min-Hash. Then you will divide the matrix into 𝒃 bands with 𝒓 rows each, where 𝒃 × 𝒓 = 𝒏 (𝒏 is the number of hash functions). You need to set 𝒃 and 𝒓 properly to balance the number of candidates and the computational cost. Two businesses become a candidate pair if their signatures are identical in at least one band.

Lastly, you need to verify the candidate pairs using their original Jaccard similarity. Table 1 shows an example of calculating the Jaccard similarity between two businesses. Your final outputs will be the business pairs whose Jaccard similarity is >= 0.05.

business2 0 1 0 0
Table 2: Jaccard similarity (business1, business2) = #intersection / #union = 1/3

4.1.2 Execution commands

Python $ spark-submit task1.py <input_file> <output_file>
Scala $ spark-submit –class task1 hw3.jar <input_file> <output_file>

<input_file>: the train review set
<output_file>: the similar business pairs and their similarities

4.1.3 Output format

You must write a business pair and its similarity in the JSON format using exactly the same tags as the example in Figure 1. Each line represents for a business pair (“b1”, “b2”). There is no need to have an output for (“b2”, “b1”).

Figure 1: An example output for Task 1 in the JSON format

user1 user2 user3 user4
business1 0 1 1 1

4.1.4 Grading

You should generate the ground truth that contains all the business pairs (from the train review set) whose Jaccard similarity is >=0.05. You need to compare Task 1 outputs (1pt) against the ground truth using the following metrics. Your accuracy should be >= 0.8 (1pt). The execution time on Vocareum should be less than 200 seconds.

Accuracy = number of true positives / number of ground truth pairs

4.2 Task2: Content-based Recommendation System (2pts)

4.2.1 Task description

In this task, you will build a content-based recommendation system by generating profiles from review texts for users and businesses in the train review set. Then you will use the system/model to predict if a user prefers to review a given business, i.e., computing the cosine similarity between the user and item profile vectors.

During the training process, you will construct business and user profiles as the model:

  1. Concatenating all the review texts for the business as the document and parsing the document, such as removing the punctuations, numbers, and stopwords. Also, you can remove extremely rare words to reduce the vocabulary size, i.e., the count is less than 0.0001% of the total words.
  2. Measuring word importance using TF-IDF, i.e., term frequency * inverse doc frequency
  3. Using top 200 words with highest TF-IDF scores to describe the document
  4. Creating a Boolean vector with these significant words as the business profile
  5. Creating a Boolean vector for representing the user profile by aggregating the profiles of the items that the user has reviewed

During the predicting process, you will estimate if a user would prefer to review a business by computing the cosine distance between the profile vectors. The (user, business) pair will be considered as a valid pair if their cosine similarity is >= 0.01. You should only output these valid pairs.

4.2.2 Execution commands Training commands:

Python $ spark-submit task2train.py <train_file> <model_file> <stopwords>
Scala $ spark-submit –class task2train hw3.jar < train_file> <model_file> <stopwords>

Predicting commands:
Python $ spark-submit task2predict.py <test_file> <model_file> <output_file>

Scala $ spark-submit –class task2predict hw3.jar <test_file> <model_file> <output_file>

<train_file>: the train review set
<model_file>: the output model
<stopwords>: containing the stopwords that can be removed
<test_file>: the test review set (only target pairs) <model_file>: the model generated during the training process <output_file>: the output results

4.2.3 Output format:

Model format: There is no strict format for the content-based model. Prediction format:

You must write the results in the JSON format using exactly the same tags as the example in Figure 2. Each line represents for a predicted pair of (“user_id”, “business_id”).

Figure 2: An example prediction output for Task 2 in JSON format

4.2.4 Grading

You should be able to generate the content-based model as well as the prediction results (1pt). We will compare your prediction results against the ground truth (i.e., the test reviews). Your accuracy should be >= 0.7 for the test datasets (1pt), i.e., the number of identified pairs should be >= 70% of the total number of given user and business pairs. The execution time of the training process on Vocareum should be less than 600 seconds. The execution time of the predicting process on Vocareum should be less than 300 seconds.

4.3 Task3: Collaborative Filtering Recommendation System (4pts)

4.3.1 Task description

In this task, you will build collaborative filtering recommendation systems with train reviews and use the models to predict the ratings for a pair of user and business. You are required to implement 2 cases:

• Case 1: Item-based CF recommendation system (2pts)

In Case 1, during the training process, you will build a model by computing the Pearson correlation for the business pairs that have at least three co-rated users. During the predicting process, you will use the model to predict the rating for a given pair of user and business. You must use at most N business neighbors that are most similar to the target business for prediction (you can try various N, e.g., 3 or 5).

• Case 2: User-based CF recommendation system with Min-Hash LSH (2pts)

In Case 2, during the training process, since the number of potential user pairs might be too large to compute, you should combine the Min-Hash and LSH algorithms in your user-based CF recommendation system. You need to (1) identify user pairs who are similar using their co-rated businesses without considering their rating scores (similar to Task 1). This process reduces the number of user pairs you need to compare for the final Pearson correlation score. (2) compute the Pearson correlation for the user pair candidates that have Jaccard similarity >= 0.01 and at least three co-rated businesses. The predicting process is similar to Case 1.

4.3.2 Execution commands Training commands:

Python $ spark-submit task3train.py <train_file> <model _file> <cf_type>
Scala $ spark-submit –class task3train hw3.jar < train_file> <model _file> <cf_type>

Predicting commands:
Python $ spark-submit task3predict.py <train_file> <test_file> <model_file> <output_file> <cf_type>

$ spark-submit –class task3predict hw3.jar <train_file> <test_file> <model_file> <output_file> <cf_type>

4.3.3 Output format: Model format:

You must write the model in the JSON format using exactly the same tags as the example in Figure 3. Each line represents for a business pair (“b1”, “b2”) for item-based model (Figure 3a) or a user pair (“u1”, “u2”) for user-based model (Figure 3b). There is no need to have (“b2”, “b1”) or (“u2”, “u1”).

(a)

(b)
Figure 3: (a) is an example item-based model and (b) is an example user-based model

Prediction format:
You must write a business pair and its similarity in the JSON format using exactly the same tags as the

example in Figure 4. Each line represents for a predicted pair of (“user_id”, “business_id”).

Figure 4: An example output for task3 in JSON format

<train_file>: the train review set
<model_file>: the output model
<cf_type>: either “item_based” or “user_based”

Scala

<train_file>: the train review set
<test_file>: the test review set (only target pairs) <model_file>: the model generated during the training process <output_file>: the output results
<cf_type>: either “item_based” or “user_based”

4.1.4 Grading

You should be able to generate the item-based and user-based CF models. We will compare your model to our ground truth. The number of similar pairs in your item-based model should match at least 90% of the pairs in the ground truth (0.5pt). The number of similar pairs in your user-based model should match at least 50% of the pairs in the ground truth (0.5pt).

Besides, we will compare your prediction results against the ground truth. You should ONLY output the predictions that can be generated from the model. For those pairs that your model cannot predict (e.g., due to cold start problem or too few co-rated users), we will first predict them with the business average stars for the item-based model and with user average stars for the user-based model. We provide two files contain the average stars for users and businesses in the training dataset respectively. There is a tag “UNK” that is the overall average stars of the whole review, which can be used for predicting those new businesses and users. Then we use RMSE (Root Mean Squared Error) to evaluate the performance as the following formula (1pt/prediction):

𝑅𝑀𝑆𝐸 = √1 ∑(𝑃𝑟𝑒𝑑𝑖 − 𝑅𝑎𝑡𝑒𝑖)2

Where 𝑃𝑟𝑒𝑑𝑖 is the prediction for business 𝑖 and 𝑅𝑎𝑡𝑒𝑖 is the true rating for business 𝑖. 𝑛 is the total number of the user and business.

The execution time of the training process on Vocareum should be less than 600 seconds. The execution time of the predicting process on Vocareum should be less than 100 seconds. For your reference, the table below shows the RMSE requirements for all the prediction tasks (1pt).

5. About Vocareum

  1. You can use the provided datasets under the directory resource: /asnlib/publicdata/
  2. You should upload the required files under your workspace: work/
  3. You must test your scripts on both the local machine and the Vocareum terminal before submission.
  4. During submission period, the Vocareum will directly evaluate the following result files: task1.res, task2.predict, task3item.model, and task3user.model. The Vocareum will also run task3predict scripts and evaluate the prediction results for both test and blind sets.
  5. During grading period, the Vocareum will run both train and predict scripts. If the training or predicting process fail to run, you can get 50% of the score only if the submission report shows that your submitted models or results are correct (regrading).
  6. Here are the commands (for Python scripts):

𝑛
𝑖

Case 1 (Test) Case 2 (Test) Case 1 (Blind) Case 2 (Blind)
RMSE 0.9 1.0 0.9 1.0
  1. You will receive a submission report after Vocareum finishes executing your scripts. The submission report should show the accuracy for each task. We do not test the Scala implementation during the submission period.
  2. Vocareum will automatically run both Python and Scala implementations during the grading period.
  3. The total execution time of submission period should be less than 600 seconds. The execution time of grading period need to be less than 3000 seconds.
  4. Please start your assignment early! You can resubmit any script on Vocareum. We will only grade on your last submission.

6. Grading Criteria

(% penalty = % penalty of possible points you get)

  1. You can use your free 5-day extension separately or together. You must submit a late-day request via https://forms.gle/worKTbCRBWKQ6jqu6. This form is recording the number of late days you use for each assignment. By default, we will not count the late days if no request submitted.
  2. There will be 10% bonus for each task if your Scala implementations are correct. Only when your Python results are correct, the bonus of Scala will be calculated. There is no partial point for Scala.
  3. There will be no point if your submission cannot be executed on Vocareum.
  4. There is no regrading. Once the grade is posted on the Blackboard, we will only regrade your assignments if there is a grading error. No exceptions.
  5. There will be 20% penalty for the late submission within one week and no point after that.