# 程序代写代做代考 python matlab University of Waterloo

University of Waterloo

ECE 657A: Data and Knowledge Modeling and Analysis

Winter 2017

Assignment 1: Data Cleaning and Dimensionality Reduction

Due: February 10th, 2016 11:59pm

Overview

Assignment Type: done in groups of up to three students.

Notification of Group: by Wednesday Jan 25, one member of the group should email mcrow-

ley@uwaterloo.ca and the TA Rasoul (r26mohammadinasiri@uwaterloo.ca) with the names of

everyone in the group with [first name, last name, student number, email] for everyone in the

group.

Hand in: One report (PDF) per group, via the LEARN dropbox. Also submit the code / scripts

needed to reproduce your work. (If you don’t know LATEX you should try to use it, it’s good

practice and it will make the project report easier)

Objective: To study how to apply some of the methods discussed in class on three datasets. The

emphasis is on analysis and presentation of results not on code implemented or used. You

can use libraries available in MATLAB, python, R or any other programs available to you.

You need to mention explicitly the source with proper references.

Data sets

Available on LEARN, if you aren’t registered yet they should be emailed to you.

Dataset A : This is a time-series dataset which is collected from a set of motion sensors for

wearable activity recognition. The data is given in time order, with 19,000 samples and 81

features. Some missing values (’NaN’) and outliers are present. (note: The negative values

are not outliers) This data is used to illustrate the data cleaning and preprocessing techniques.

(File: DataA.mat)

Dataset B : Handwritten digits of 0, 1, 2, 3, and 4 (5 classes). This dataset contains 2066 samples

with 784 features corresponding to a 28 x 28 gray-scale (0-255) image of the digit, arranged in

column-wise. This data is used to illustrate the difference between feature extraction methods.

(File: DataB.mat)

Dataset C: This data contains measurements of heart cardiotocograms. The goal is to classify an

observation to be one of the three categories: normal(1) / suspect(2) / pathologic(3) given as

the ground truth level gnd. It includes a sample-feature matrix fea with 2100 samples with 21

features and 3 classes. Features represent measurements of heart rate and uterine contraction

features. Each sample is a separate row. This data is used to illustrate the difference between

feature selection methods. (File: DataC.mat)

1

Questions

I. Data Cleaning and Preprocessing (for dataset A)

1. Detect any problems that need to be fixed in dataset A. Report such problems.

2. Fix the detected problems using some of the methods discussed in class.

3. Normalize the data using min-max and z-score normalization. Plot histograms of feature

9 and 24; compare and comment on the differences before and after normalization. For

both features, plot auto-correlation before and after normalizations and report and discuss

observations.

II. Feature Extraction (for dataset B)

1. Use PCA as a dimensionality reduction technique to the data, compute the eigenvectors and

eigenvalues.

2. Plot a 2 dimensional representation of the data points based on the first and second principal

components. Explain the results versus the known classes (display data points of each class

with a different color).

3. Repeat step 2 for the 5th and 6st components. Comment on the result.

4. Use the Naive Bayes classifier to classify 8 sets of dimensionality reduced data (using the first

2, 4, 10, 30, 60, 200, 500, and all 784 PCA components). Plot the classification error for the

8 sets against the retained variance (rm from lect3:slide22) of each case.

5. As the class labels are already known, you can use the Linear Discriminant Analysis (LDA)

to reduce the dimensionality, plot the data points using the first 2 LDA components (display

data points of each class with a different color). Explain the results obtained in terms of the

known classes. Compare with the results obtained by using PCA.

III. Nonlinear Dimensionality Reduction (for dataset B)

Apply the nonlinear dimensionality reduction methods Locally Linear Embedding (LLE) and ISOMAP

to the dataset B, set the number of nearest neighbours to be 5, the projected low dimension to be

4.

1. Apply LLE to the images of digit ’3’ only. Visualize the original images by plotting the

images corresponding to those instances on 2-D representations of the data based on the first

and second components of LLE. Use the given Matlab function plotImages.m to do this, see

Figure for an example of what this looks like for random location of images on of the number

1-3. Describe qualitatively what kind of variations is captured.

2. Repeat step 1 using the ISOMAP method. Comment on the result. Does ISOMAP do better

in some way? Are the patterns being found globally based or locally based?

2

3. Use the Naive Bayes classifier to classify the dataset based on the projected 4-dimension

representations of the LLE and ISOMAP. Train your classifier by randomly selected 70% of

data, and test with remained 30%. Retrain for multiple iterations (using different random

partitions of the data) and use the average accuracy of multiple runs for your analysis. Justify

why your number of iterations was sufficient. Based on the average accuracies compare their

performance with PCA and LDA. Discuss the result.

IV. Feature Selection (for dataset C)

Reduce the number of features by filter-based and wrapper-based feature selection methods. Data

need to be normalized (min-max) before further processing. Experiment and report on the following

tasks:

1. Using Sequential Forward Selection (SFS) strategy and the sum of squared Euclidean distances

as an objective function, implement a filter feature selection to select 8 features. Report the

selected feature subset.

2. Using the Naive Bayes classifier as the objective function, realize a wrapper based feature

selection with SFS search strategy (to select 8 features). Report the selected feature subset.

3. Use the same objective function as (2) and implement Sequential Backward Selection (SBS)

strategy to select 8 features. Report the selected feature subset.

4. Using the Naive Bayes classifier to classify the data set using the selected feature subsets

obtained above in (1), (2), (3), and the case that uses all 21 features. Follow the same policy

as indicated in section III.3 to do your experiment and divide data in train and test. Report

the average accuracy and run time in each case and discuss the results.

Deliverables

For submitting your assignment please consider the following notes:

• Submit all of your work as one compressed file (.zip, .rar) named as Gx.zip or Gx.rar where “x”

indicate your group number. (You will be able to see your group number on LEARN afte ryou

submit your group members to mcrowley@uwaterloo.ca and r26mohammadinasiri@uwaterloo.ca

by Jan 25).

• Your compressed file should have all code, images, etc in addition to your report’s document.

• Write a technical document as your report and submit its PDF format included it in your

compressed file.

• Yor report (.pdf file) should have the name and student number of all members of your group

at the beginning and separated sections for the answer of each part of each question .

• Late submissions (up to 3 days) are accepted with penalty of 10% per day.

• All code should be clearly written and commented and be runnable on another system with

just the data set files beside the code in the same folder.

3

• Do not upload the data set files.

• One member of each group should upload the report to your group’s dropbox on Learn. Each

member does not need to submit same version. The last version submitted will be the one

which is graded.

Some Helpful Info:

1. Naive Bayes classifier: PredictClass = classify(Xtest,Xtrain,Ytrain,’diaglinear’);

2. Random data split: p = randperm(n,k)

3. Plot images on defined coordinates: Use plotImages.m

Example of use:

% X: n x d, n number of samples

% xy_coord: n x 2

digitsImages = reshape(X’, height, width, size(X,1));

scale = 0.02;

skip = 1;

plotImages(digitsImages, xy_coord(:,1:2), scale, skip);

Sample code and result plot is shown in Fig. 1.

4. Libraries to load and use:

LLE: http://www.cs.nyu.edu/~roweis/lle/code.html

ISOMAP: http://web.mit.edu/cocosci/isomap/isomap.html

LDA-dimensionality reduction: http://lvdmaaten.github.io/drtoolbox/

4

http://www.cs.nyu.edu/~roweis/lle/code.html

http://web.mit.edu/cocosci/isomap/isomap.html

http://lvdmaaten.github.io/drtoolbox/

(a)

0 2 4 6 8 10 12

0

2

4

6

8

10

12

x

y

(b)

Figure 1: Using plotImages to plot pictures in a sample 2D space; a) A sample code, b) Output of

the sample code.

5