# 程序代写代做代考 python deep learning database scheme algorithm CSE474/574 Introduction to Machine Learning

CSE474/574 Introduction to Machine Learning

Programming Assignment 1

Handwritten Digits Classification

Due Date: March 8th 2017

1 Introduction

In this assignment, your task is to implement a Multilayer Perceptron Neural Network and evaluate its

performance in classifying handwritten digits. For CSE574 students only: You will use the same network to

analyze a more challenging face dataset and compare the performance of the neural network against a deep

neural network using the TensorFlow library.

After completing this assignment, you are able to understand:

• How Neural Network works and use Feed Forward, Back Propagation to implement Neural Network?

• How to setup a Machine Learning experiment on real data?

• How regularization plays a role in the bias-variance tradeoff?

• For CSE574 students only: How to use TensorFlow library to deploy deep neural networks and under-

stand how having multiple hidden layers can improve the performance of the neural network?

To get started with the exercise, you will need to download the supporting files and unzip its contents to

the directory you want to complete this assignment.

Warning: In this project, you will have to handle many computing intensive tasks such as training

a neural network. Our suggestion is to use the CSE server Metallica (this server is dedicated to intensive

computing tasks) to run your computation. YOU MUST USE PYTHON 3 FOR IMPLEMENTATION.

In addition, training such a big dataset will take a very long time, maybe many hours or even days to

complete. Therefore, we suggest that you should start doing this project as soon as possible so that the

computer will have time to do heavy computational jobs.

1.1 File included in this exercise

• mnist all.mat : original dataset from MNIST. In this file, there are 10 matrices for testing set and 10

matrices for training set, which corresponding to 10 digits. You will have to split the training data

into training and validation data.

• face all.mat : sample of face images from the CelebA data set. In this file there is one data matrix and

one corresponding labels vector. You will have to split the data into training and testing data.

• nnScript.py : Python script for this programming project. Contains function definitions –

– preprocess(): performs some preprocess tasks, and output the preprocessed train, validation and

test data with their corresponding labels. You need to make changes to this function.

– sigmoid(): compute sigmoid function. The input can be a scalar value, a vector or a matrix. You

need to make changes to this function.

1

– nnObjFunction(): compute the error function of Neural Network. You need to make changes to

this function.

– nnPredict(): predicts the label of data given the parameters of Neural Network. You need to make

changes to this function.

– initializeWeights(): return the random weights for Neural Network given the number of unit in

the input layer and output layer.

• deepnnScript.py : Python script for calling the TensorFlow library for running the deep neural network

(For CSE574 students only). This script will be released by Feb 20th 2017.

1.2 Datasets

Two data sets will be provided. Both consist of images. See the notebook available here – http://nbviewer.

jupyter.org/github/ubdsgroup/ubmlcourse/blob/master/notebooks/ProgrammingAssignment1.ipynb,

for pointers about how to handle the data.

1.2.1 MNIST Dataset

The MNIST dataset [1] consists of a training set of 60000 examples and test set of 10000 examples. All

digits have been size-normalized and centered in a fixed image of 28× 28 size. In original dataset, each pixel

in the image is represented by an integer between 0 and 255, where 0 is black, 255 is white and anything

between represents different shade of gray.

YOu will need to split the training set of 60000 examples into two sets. First set of 50000 randomly sampled

examples will be used for training the neural network. The remainder 10000 examples will be used as a

validation set to estimate the hyper-parameters of the network (regularization constant λ, number of hidden

units).

1.2.2 CelebFaces Attributes Dataset (CelebA)

CelebFaces Attributes Dataset (CelebA) [3] is a large-scale face attributes dataset with more than 200K

celebrity images. CelebA has large diversities, large quantities, and rich annotations, including:

• 10,177 number of identities,

• 202,599 number of face images, and

• 5 landmark locations, 40 binary attributes annotations per image.

For this programming assignment, we will have provided a subset of the images. The subset will consist of

data for 26407 face images, split into two classes. One class will be images in which the individual is wearing

glasses and the other class will be images in which the individual is not wearing glasses. Each image is a

54× 44 matrix, flattened into a vector of length 2376.

2 Your tasks

• Implement Neural Network (forward pass and back propagation)

• Incorporate regularization on the weights (λ)

• Use validation set to tune hyper-parameters for Neural Network (number of units in the hidden layer

and λ).

• For CSE574 students only: Run the deep neural network code we provided and compare the results

with normal neural network. The code will be released by Feb 20th 2017.

• Write a report to explain the experimental results.

2

Figure 1: Neural network

3 Some practical tips in implementation

3.1 Feature selection

In the dataset, one can observe that there are many features which values are exactly the same for all data

points in the training set. With those features, the classification models cannot gain any more information

about the difference (or variation) between data points. Therefore, we can ignore those features in the

pre-processing step.

Later on in this course, you will learn more sophisticated models to reduce the dimension of dataset (but

not for this assignment).

3.2 Neural Network

3.2.1 Neural Network Representation

Neural network can be graphically represented as in Figure 1.

As observed in the Figure 1, there are totally 3 layers in the neural network:

• The first layer comprises of (d + 1) units, each represents a feature of image (there is one extra unit

representing the bias).

• The second layer in neural network is called the hidden units. In this document, we denote m + 1

as the number of hidden units in hidden layer. There is an additional bias node at the hidden layer

as well. Hidden units can be considered as the learned features extracted from the original data set.

Since number of hidden units will represent the dimension of learned features in neural network, it’s

our choice to choose an appropriate number of hidden units. Too many hidden units may lead to the

slow training phase while too few hidden units may cause the the under-fitting problem.

• The third layer is also called the output layer. The value of lth unit in the output layer represents

the probability of a certain hand-written image belongs to digit l. Since we have 10 possible digits,

there are 10 units in the output layer. In this document, we denote k as the number of output units

in output layer.

The parameters in Neural Network model are the weights associated with the hidden layer units and the

output layers units. In our standard Neural Network with 3 layers (input, hidden, output), in order to

represent the model parameters, we use 2 matrices:

3

• W (1) ∈ Rm×(d+1) is the weight matrix of connections from input layer to hidden layer. Each row in

this matrix corresponds to the weight vector at each hidden layer unit.

• W (2) ∈ Rk×(m+1) is the weight matrix of connections from hidden layer to output layer. Each row in

this matrix corresponds to the weight vector at each output layer unit.

We also further assume that there are n training samples when performing learning task of Neural Network.

In the next section, we will explain how to perform learning in Neural Network.

3.2.2 Feedforward Propagation

In Feedforward Propagation, given parameters of Neural Network and a feature vector x, we want to compute

the probability that this feature vector belongs to a particular digit.

Suppose that we have totally m hidden units. Let aj for 1 ≤ j ≤ m be the linear combination of input

data and let zj be the output from the hidden unit j after applying an activation function (in this exercise,

we use sigmoid as an activation function). For each hidden unit j (j = 1, 2, · · · ,m), we can compute its

value as follow:

aj =

D+1∑

i=1

w

(1)

ji xi (1)

zj = σ(aj) =

1

1 + exp(−aj)

(2)

where w

(1)

ji = W

(1)[j][i] is the weight of connection from unit i in input layer to unit j in hidden layer. Note

that we do not compute the output for the bias hidden node (m+ 1); zm+1 is directly set to 1.

The third layer in neural network is called the output layer where the learned features in hidden units

are linearly combined and a sigmoid function is applied to produce the output. Since in this assignment,

we want to classify a hand-written digit image to its corresponding class, we can use the one-vs-all binary

classification in which each output unit l (l = 1, 2, · · · , 10) in neural network represents the probability of an

image belongs to a particular digit. For this reason, the total number of output unit is k = 10. Concretely,

for each output unit l (l = 1, 2, · · · , 10), we can compute its value as follow:

bl =

m+1∑

j=1

w

(2)

lj zj (3)

ol = σ(bl) =

1

1 + exp(−bl)

(4)

Now we have finished the Feedforward pass.

3.2.3 Error function and Backpropagation

The error function in this case is the negative log-likelihood error function which can be written as follow:

J(W (1),W (2)) = −

1

n

n∑

i=1

k∑

l=1

(yil ln oil + (1− yil) ln(1− oil)) (5)

where yil indicates the l

th target value in 1-of-K coding scheme of input data i and oil is the output at l

th

output node for the ith data example (See (4)).

Because of the form of error function in equation (5), we can separate its error function in terms of error

for each input data xi:

J(W (1),W (2)) =

1

n

n∑

i=1

Ji(W

(1),W (2)) (6)

where

Ji(W

(1),W (2)) = −

k∑

l=1

(yil ln oil + (1− yil) ln(1− oil)) (7)

4

One way to learn the model parameters in neural networks is to initialize the weights to some random

numbers and compute the output value (feed-forward), then compute the error in prediction, transmits this

error backward and update the weights accordingly (error backpropagation).

The feed-forward step can be computed directly using formula (1), (2), (3) and (4).

On the other hand, the error backpropagation step requires computing the derivative of error function

with respect to the weight.

Consider the derivative of error function with respect to the weight from the hidden unit j to output

unit l where j = 1, 2, · · · ,m+ 1 and l = 1, · · · , 10:

∂Ji

∂w

(2)

lj

=

∂Ji

∂ol

∂ol

∂bl

∂bl

∂w

(2)

lj

(8)

= δlzj (9)

where

δl =

∂Ji

∂ol

∂ol

∂bl

= −(

yl

ol

−

1− yl

1− ol

)(1− ol)ol = ol − yl

Note that we are dropping the subscript i for simplicity. The error function (log loss) that we are using

in (5) is different from the the squared loss error function that we have discussed in class. Note that the

choice of the error function has “simplified” the expressions for the error!

On the other hand, the derivative of error function with respect to the weight from the input unit l to

output unit j where p = 1, 2, · · · , d+ 1 and j = 1, · · · ,m can be computed as follow:

∂Ji

∂w

(1)

jp

=

k∑

l=1

∂Ji

∂ol

∂ol

∂bl

∂bl

∂zj

∂zj

∂aj

∂aj

∂w

(1)

jp

(10)

=

k∑

l=1

δlw

(2)

lj (1− zj)zjxp (11)

= (1− zj)zj(

k∑

l=1

δlw

(2)

lj )xp (12)

Note that we do not compute the gradient for the weights at the bias hidden node.

After finish computing the derivative of error function with respect to weight of each connection in neural

network, we now can write the formula for the gradient of error function:

∇J(W (1),W (2)) =

1

n

n∑

i=1

∇Ji(W (1),W (2)) (13)

We again can use the gradient descent to update each weight (denoted in general as w) with the following

rule:

wnew = wold − γ∇J(wold) (14)

3.2.4 Regularization in Neural Network

In order to avoid overfitting problem (the learning model is best fit with the training data but give poor

generalization when test with validation data), we can add a regularization term into our error function to

control the magnitude of parameters in Neural Network. Therefore, our objective function can be rewritten

as follow:

J̃(W (1),W (2)) = J(W (1),W (2)) +

λ

2n

m∑

j=1

d+1∑

p=1

(w

(1)

jp )

2 +

k∑

l=1

m+1∑

j=1

(w

(2)

lj )

2

(15)

where λ is the regularization coefficient.

5

With this new objective function, the partial derivative of new objective function with respect to weight

from hidden layer to output layer can be calculated as follow:

∂J̃

∂w

(2)

lj

=

1

n

(

n∑

i=1

∂Ji

∂w

(2)

lj

+ λw

(2)

lj

)

(16)

Similarly, the partial derivative of new objective function with respect to weight from input layer to hidden

layer can be calculated as follow:

∂J̃

∂w

(1)

jp

=

1

n

(

n∑

i=1

∂Ji

∂w

(1)

jp

+ λw

(1)

jp

)

(17)

With this new formulas for computing objective function (15) and its partial derivative with respect to

weights (16) (17) , we can again use gradient descent to find the minimum of objective function.

3.2.5 Python implementation of Neural Network

In the supporting files, we have provided the base code for you to complete. In particular, you have to

complete the following functions in Python:

• sigmoid : compute sigmoid function. The input can be a scalar value, a vector or a matrix.

• nnObjFunction: compute the objective function of Neural Network with regularization and the gradient

of objective function.

• nnPredict : predicts the label of data given the parameters of Neural Network.

Details of how to implement the required functions is explained in Python code.

Optimization: In general, the learning phase of Neural Network consists of 2 tasks. First task is

to compute the value and gradient of error function given Neural Network parameters. Second task

is to optimize the error function given the value and gradient of that error function. As explained

earlier, we can use gradient descent to perform the optimization problem. In this assignment, you

have to use the Python scipy function: scipy.optimize.minimize (using the option method=’CG’

for conjugate gradient descent), which performs the conjugate gradient descent algorithm to perform

optimization task. In principle, conjugate gradient descent is similar to gradient descent but it chooses

a more sophisticated learning rate γ in each iteration so that it will converge faster than gradient

descent. Details of how to use minimize are provided here: http://docs.scipy.org/doc/scipy-0.

14.0/reference/generated/scipy.optimize.minimize.html.

We use regularization in Neural Network to avoid overfitting problem (more about this will be discussed

in class). You are expected to change different value of λ to see its effect in prediction accuracy in validation

set. Your report should include diagrams to explain the relation between λ and performance of Neural

Network. Moreover, by plotting the value of λ with respect to the accuracy of Neural Network, you should

explain in your report how to choose an appropriate hyper-parameter λ to avoid both underfitting and

overfitting problem. You can vary λ from 0 (no regularization) to 1 in increments of 0.1 or 0.2.

You are also expected to try different number hidden units to see its effect to the performance of Neural

Network. Since training Neural Network is very slow, especially when the number hidden units in Neural

Network is large. You should try with small hidden units and gradually increase the size and see how it

effects the training time. Your report should include some diagrams to explain relation between number of

hidden units and training time. Recommended values: 4, 8, 12, 16, 20.

6

4 TensorFlow Library

In this assignment you will only implement a single layer Neural Network. You will realize that implementing

multiple layers can be a very cumbersome coding task. However, additional layers can provide a better

modeling of the data set. The analysis of the challenging CelebA data set will show how adding more layers

can improve the performance of the Neural Network. To experiment with Neural Networks with multiple

layers, we will use Google’s TensorFlow library (https://www.tensorflow.org/).

TensorFlow is currently installed as a Python module in the metallica server. Students planning to

use their personal laptop or desktop machines can install the TensorFlow library (tensorflow-0.12.1) using

instructions from the main site.

Your experiments should include the following:

• Evaluate the accuracy of single hidden layer Neural Network on CelebA data set, to distinguish between

two classes – wearing glasses and not wearing glasses.

• Evaluate the accuracy of deep Neural Network (try 3, 5, and 7 hidden layers) on CelebA data set.

• Compare the performance of single vs. deep Neural Networks in terms of accuracy on test data and

learning time.

5 Submission

You are required to submit a single file called proj1.zip using UBLearns.

File proj1.zip must contain 2 folders: report and code.

• Folder report contains your report file (in pdf format). Please indicate the team members and your

course number on the top of the report.

• Folder code must contains the following updated files: nnScript.py and params.pickle1. File params.pickle

contains the learned parameters of Neural Network. Concretely, file params.pickle must contain the

following variables: optimal n hidden (number of units in hidden layer), w1 (matrix of weight W (1)

as mentioned in section 3.2.1), w2 (matrix of weight W (2) as mentioned in section 3.2.1), optimal λ

(regularization coeffient λ as mentioned in section 3.2.4).2

Using UBLearns Submission: In the groups page of the UBLearns website you will see groups

called “Programming Assignment 1 Group x”. Please choose any available group number for your group

and join the group. All project group members must join the same group. Please do not join any other

group on UBLearns that you are not part of. You should submit one solution per group through the

groups page.

Project report: The hard-copy of report will be collected in class at due date. Your report should include

the following:

• Explanation of how to choose the hyper-parameters for Neural Network (number of hidden units,

regularization term λ).

• For CSE574 students only: Compare the results of deep neural network and neural network with one

hidden layer on the CelebA data set.

1Check this to learn how to pickle objects in Python: https://wiki.python.org/moin/UsingPickle

2If you want to write more supporting functions to complete the required functions, you should include these supporting

functions and a README file which explains your supporting functions.

7

6 Grading scheme

• For CSE474 students [Total 100 points]:

– Successfully implement Neural Network: 60 points (preprocess() [10 points], sigmoid() [10 points],

nnObjFunction() [30 points], nnPredict() [10 points]).

– Project report: 40 points

∗ Explanation with supporting figures of how to choose the hyper-parameter for Neural Net-

work: 30 points

∗ Accuracy of classification method on the handwritten digits test data: 10 points

• For CSE574 students [Total 120 points]:

– Successfully implement Neural Network: 60 points (preprocess() [10 points], sigmoid() [10 points],

nnObjFunction() [30 points], nnPredict() [10 points]).

– Project report: 60 points

∗ Explanation with supporting figures of how to choose the hyper-parameter for Neural Net-

work: 30 points

∗ Accuracy of classification method on the handwritten digits test data: 10 points

∗ Accuracy of classification method on the CelebA data set: 10 points

∗ Comparison of your neural network with a deep neural network (using TensorFlow) in terms

of accuracy and training time: 10 points

• Students in CSE474 section may attempt the CSE574 requirements (CelebA data analysis, comparison

with deep neural networks) for extra credit

References

[1] LeCun, Yann; Corinna Cortes, Christopher J.C. Burges. “MNIST handwritten digit database”.

[2] Bishop, Christopher M. “Pattern recognition and machine learning (information science and statistics).”

(2007).

[3] Liu, Ziwei; Luo, Ping; Wang, Xiaogang; Tang, Xiaoou. “Deep Learning Face Attributes in the Wild”,

Proceedings of International Conference on Computer Vision (ICCV) (2015).

8