# CS代考 CSC 311: Introduction to Machine Learning – cscodehelp代写

CSC 311: Introduction to Machine Learning

Lecture 4 – Multiclass Classification & Neural Networks

Based on slides by Amir-massoud Farahmand & Emad A.M. Andrews

Intro ML (UofT) CSC311-Lec4 1 / 48

Classification: predicting a discrete-valued target

Binary classification: predicting a binary-valued target

Multiclass classification: predicting a discrete(> 2)-valued target

Examples of multi-class classification

predict the value of a handwritten digit

classify e-mails as spam, travel, work, personal

Intro ML (UofT) CSC311-Lec4

Multiclass Classification

Classification tasks with more than two categories:

It is very hard to say what makes a 2 Some examples from an earlier version of the net

Intro ML (UofT) CSC311-Lec4 3 / 48

Multiclass Classification

Targets form a discrete set {1,…,K}.

It’s often more convenient to represent them as one-hot vectors, or

a one-of-K encoding:

t = (0,…,0,1,0,…,0) ∈ RK

entry k is 1

Intro ML (UofT) CSC311-Lec4 4 / 48

Multiclass Classification

Now there are D input dimensions and K output dimensions, so we need K × D weights, which we arrange as a weight matrix W.

Also, we have a K-dimensional vector b of biases. Linear predictions:

Vectorized:

wkjxj+bk for k=1,2,…,K

z = Wx + b

Intro ML (UofT)

CSC311-Lec4

Multiclass Classification

Predictions are like probabilities: want 1 ≥ yk ≥ 0 and k yk = 1 A natural activation function to use is the softmax function, a

multivariable generalization of the logistic function: yk = softmax(z1,…,zK)k = ezk

The inputs zk are called the logits. Properties:

Outputs are positive and sum to 1 (so they can be interpreted as probabilities)

If one of the zk is much larger than the others, softmax(z)k ≈ 1 (behaves like argmax).

Exercise: how does the case of K = 2 relate to the logistic function?

Note: sometimes σ(z) is used to denote the softmax function; in this class, it will denote the logistic function applied elementwise. Intro ML (UofT) CSC311-Lec4 6 / 48

Multiclass Classification

If a model outputs a vector of class probabilities, we can use cross-entropy as the loss function:

LCE(y, t) = −

= −t⊤(log y),

where the log is applied elementwise.

Just like with logistic regression, we typically combine the softmax

and cross-entropy into a softmax-cross-entropy function.

Intro ML (UofT) CSC311-Lec4 7 / 48

Multiclass Classification

Softmax regression:

z = Wx + b

y = softmax(z) LCE = −t⊤(log y)

Gradient descent updates can be derived for each row of W: ∂LCE = ∂LCE · ∂zk =(yk −tk)·x

∂wk ∂zk ∂wk

wk ← wk − α (y(i) − t(i))x(i) Nkk

Similar to linear/logistic reg (no coincidence) (verify the update)

Intro ML (UofT) CSC311-Lec4 8 / 48

Limits of Linear Classification

Visually, it’s obvious that XOR is not linearly separable. But how to show this?

Intro ML (UofT) CSC311-Lec4 9 / 48

Limits of Linear Classification

Showing that XOR is not linearly separable (proof by contradiction)

If two points lie in a half-space, line segment connecting them also lie in the same halfspace.

Suppose there were some feasible weights (hypothesis). If the positive examples are in the positive half-space, then the green line segment must be as well.

Similarly, the red line segment must line within the negative half-space.

But the intersection can’t lie in both half-spaces. Contradiction!

Intro ML (UofT) CSC311-Lec4 10 / 48

Limits of Linear Classification

Sometimes we can overcome this limitation using feature maps, just like for linear regression. E.g., for XOR:

x1 ψ(x)=x2

x1 x2 ψ1(x) ψ2(x) ψ3(x) t 000000 010101 101001 111110

This is linearly separable. (Try it!)

Not a general solution: it can be hard to pick good basis functions. Instead, we’ll use neural nets to learn nonlinear hypotheses

Intro ML (UofT) CSC311-Lec4 11 / 48

Neural Networks

Intro ML (UofT) CSC311-Lec4 12 / 48

Inspiration: The Brain

Our brain has ∼ 1011 neurons, each of which communicates (is connected) to ∼ 104 other neurons

Figure: The basic computational unit of the brain: Neuron [Pic credit: http://cs231n.github.io/neural-networks-1/]

Intro ML (UofT) CSC311-Lec4

Inspiration: The Brain

Neurons receive input signals and accumulate voltage. After some threshold they will fire spiking responses.

[Pic credit: www.moleculardevices.com]

Intro ML (UofT) CSC311-Lec4 14 / 48

Inspiration: The Brain

Neurons in form billions of connections in the brain.

Intro ML (UofT) CSC311-Lec4 15 / 48

Inspiration: The Brain

For neural nets, we use a much simpler model neuron, or unit:

output bias i’th weight

y=g b+ xiwi i

Compare with logistic regression: y = σ(w⊤x + b)

By throwing together lots of these incredibly simplistic neuron-like processing units, we can do some powerful computations!

Intro ML (UofT) CSC311-Lec4 16 / 48

w1 w2 w3 weights inputs

nonlinearity i’th input

We can connect lots of units together into a directed acyclic graph.

Typically, units are grouped together into layers.

This gives a

feed-forward neural network. That’s in contrast to recurrent neural networks, which can have cycles.

Intro ML (UofT) CSC311-Lec4 17 / 48

Each hidden layer i connects Ni−1 input units to Ni output units.

In the simplest case, all input units are connected to all output units. We call

this a fully connected layer. We’ll consider other layer types later. Note: the inputs and outputs for a layer are distinct from the inputs and

outputs to the network.

Intro ML (UofT) CSC311-Lec4 18 / 48

Each hidden layer i connects Ni−1 input units to Ni output units.

In the simplest case, all input units are connected to all output units. We call

this a fully connected layer. We’ll consider other layer types later. Note: the inputs and outputs for a layer are distinct from the inputs and

outputs to the network.

If we need to compute M outputs from N

inputs, we can do so in parallel using matrix multiplication. This means we’ll be using a M × N matrix

The output units are a function of the input units:

y = f (x) = φ (Wx + b)

Intro ML (UofT) CSC311-Lec4

Each hidden layer i connects Ni−1 input units to Ni output units.

In the simplest case, all input units are connected to all output units. We call

this a fully connected layer. We’ll consider other layer types later. Note: the inputs and outputs for a layer are distinct from the inputs and

outputs to the network.

If we need to compute M outputs from N

inputs, we can do so in parallel using matrix multiplication. This means we’ll be using a M × N matrix

The output units are a function of the input units:

y = f (x) = φ (Wx + b)

A multilayer network consisting of fully connected layers is called a multilayer perceptron. Despite the name, it has nothing to do with perceptrons!

Intro ML (UofT) CSC311-Lec4

Some activation functions:

Rectified Linear Unit (ReLU)

y = max(0, z)

y = log 1 + ez

Intro ML (UofT)

CSC311-Lec4

Some activation functions:

Hard Threshold

y = 1 if z > 0 0 ifz≤0

Hyperbolic Tangent (tanh)

ez − e−z y=ez+e−z

y = 1 1+e−z

Intro ML (UofT)

CSC311-Lec4

Each layer computes a function, so the network computes a composition of functions:

h(1) = f(1)(x) = φ(W(1)x + b(1))

h(2) = f(2)(h(1)) = φ(W(2)h(1) + b(2))

y = f(L)(h(L−1))

Or more compactly:

y=f(L) ◦···◦f(1)(x).

Neural nets provide modularity: we can implement each layer’s computations as a black box.

Intro ML (UofT) CSC311-Lec4 21 / 48

Feature Learning

Last layer:

If task is regression: choose

y = f(L)(h(L−1)) = (w(L))T h(L−1) + b(L)

Intro ML (UofT) CSC311-Lec4 22 / 48

Feature Learning

Last layer:

If task is regression: choose

y = f(L)(h(L−1)) = (w(L))T h(L−1) + b(L) If task is binary classification: choose

y = f(L)(h(L−1)) = σ((w(L))T h(L−1) + b(L))

Intro ML (UofT) CSC311-Lec4 22 / 48

Feature Learning

Last layer:

If task is regression: choose

y = f(L)(h(L−1)) = (w(L))T h(L−1) + b(L) If task is binary classification: choose

y = f(L)(h(L−1)) = σ((w(L))T h(L−1) + b(L))

Neural nets can be viewed as a way of learning features:

Intro ML (UofT) CSC311-Lec4

Feature Learning

Last layer:

If task is regression: choose

y = f(L)(h(L−1)) = (w(L))T h(L−1) + b(L) If task is binary classification: choose

y = f(L)(h(L−1)) = σ((w(L))T h(L−1) + b(L))

Neural nets can be viewed as a way of learning features:

Intro ML (UofT) CSC311-Lec4 22 / 48

Feature Learning

Suppose we’re trying to classify images of handwritten digits. Each image is represented as a vector of 28 × 28 = 784 pixel values.

Each first-layer hidden unit computes φ(wiT x). It acts as a feature detector.

We can visualize w by reshaping it into an image. Here’s an example that responds to a diagonal stroke.

Intro ML (UofT) CSC311-Lec4 23 / 48

Feature Learning

Here are some of the features learned by the first hidden layer of a handwritten digit classifier:

Intro ML (UofT) CSC311-Lec4 24 / 48

Expressive Power

We’ve seen that there are some functions that linear classifiers can’t represent. Are deep networks any better?

Intro ML (UofT) CSC311-Lec4 25 / 48

Expressive Power

We’ve seen that there are some functions that linear classifiers can’t represent. Are deep networks any better?

Suppose a layer’s activation function was the identity, so the layer just computes a affine transformation of the input

We call this a linear layer

Intro ML (UofT) CSC311-Lec4 25 / 48

Expressive Power

We’ve seen that there are some functions that linear classifiers can’t represent. Are deep networks any better?

Suppose a layer’s activation function was the identity, so the layer just computes a affine transformation of the input

We call this a linear layer

Any sequence of linear layers can be equivalently represented with

a single linear layer.

y = W(3)W(2)W(1) x ′

Deep linear networks are no more expressive than linear regression.

Intro ML (UofT) CSC311-Lec4 25 / 48

Expressive Power

Multilayer feed-forward neural nets with nonlinear activation functions are universal function approximators: they can approximate any function arbitrarily well.

This has been shown for various activation functions (thresholds, logistic, ReLU, etc.)

Even though ReLU is “almost” linear, it’s nonlinear enough.

Intro ML (UofT) CSC311-Lec4 26 / 48

Designing a network to classify XOR:

Assume hard threshold activation function

Intro ML (UofT) CSC311-Lec4 27 / 48

h1 computes I[x1 + x2 − 0.5 > 0] i.e.x1ORx2

Intro ML (UofT) CSC311-Lec4 28 / 48

h1 computes I[x1 + x2 − 0.5 > 0] i.e.x1ORx2

h2 computes I[x1 + x2 − 1.5 > 0] i.e.x1ANDx2

Intro ML (UofT) CSC311-Lec4 28 / 48

h1 computes I[x1 + x2 − 0.5 > 0] i.e.x1ORx2

h2 computes I[x1 + x2 − 1.5 > 0] i.e.x1ANDx2

y computes I[h1 − h2 − 0.5 > 0] ≡ I[h1 + (1 − h2) − 1.5 > 0] i.e. h1 AND(NOTh2)=x1 XORx2

Intro ML (UofT) CSC311-Lec4

Expressive Power

Universality for binary inputs and targets:

Hard threshold hidden units, linear output

Strategy: 2D hidden units, each of which responds to one particular input configuration

Intro ML (UofT) CSC311-Lec4

Expressive Power

Universality for binary inputs and targets:

Hard threshold hidden units, linear output

Strategy: 2D hidden units, each of which responds to one particular input configuration

Only requires one hidden layer, though it needs to be extremely wide.

Intro ML (UofT) CSC311-Lec4 29 / 48

Expressive Power

What about the logistic activation function?

You can approximate a hard threshold by scaling up the weights and biases:

y = σ(x) y = σ(5x)

Intro ML (UofT) CSC311-Lec4

Expressive Power

What about the logistic activation function?

You can approximate a hard threshold by scaling up the weights and biases:

y = σ(x) y = σ(5x)

This is good: logistic units are differentiable, so we can train them

with gradient descent.

Intro ML (UofT) CSC311-Lec4 30 / 48

Expressive Power

Limits of universality

You may need to represent an exponentially large network.

How can you find the appropriate weights to represent a given function?

If you can learn any function, you’ll just overfit.

We desire a compact representation.

Intro ML (UofT) CSC311-Lec4

Training Neural Networks with Backpropagation

Intro ML (UofT) CSC311-Lec4 32 / 48

Recap: Gradient Descent

Recall: gradient descent moves opposite the gradient (the direction of steepest descent)

Weight space for a multilayer neural net: one coordinate for each weight or bias of the network, in all the layers

Conceptually, not any different from what we’ve seen so far — just higher dimensional and harder to visualize!

We want to define a loss L and compute the gradient of the cost dJ /dw, which is the vector of partial derivatives.

This is the average of dL/dw over all the training examples, so in this lecture we focus on computing dL/dw.

Intro ML (UofT) CSC311-Lec4 33 / 48

Univariate Chain Rule

We’ve already been using the univariate Chain Rule. Recall: if f(x) and x(t) are univariate functions, then

d f(x(t)) = df dx.

Intro ML (UofT)

CSC311-Lec4

Univariate Chain Rule

Recall: Univariate logistic least squares model

z = wx + b y = σ(z)

L = 12 ( y − t ) 2 Let’s compute the loss derivatives ∂L , ∂L

Intro ML (UofT) CSC311-Lec4 35 / 48

Univariate Chain Rule

How you would have done it in calculus class

= ∂ 1(σ(wx+b)−t)2 ∂w 2

= 1 ∂ (σ(wx+b)−t)2 2 ∂w

= (σ(wx+b)−t) ∂ (σ(wx+b)−t) ∂w

∂ 1 2 = ∂b 2(σ(wx+b)−t)

= 1 ∂ (σ(wx+b)−t)2 2∂b

= (σ(wx+b)−t) ∂ (σ(wx+b)−t) ∂b

= (σ(wx+b)−t)σ′(wx+b) ∂ (wx+b) ∂b

1 L=2(σ(wx+b)−t)

= (σ(wx+b)−t)σ′(wx+b) ∂ (wx+b) ∂w

= (σ(wx+b)−t)σ′(wx+b) What are the disadvantages of this approach?

= (σ(wx + b) − t)σ′(wx + b)x

Intro ML (UofT) CSC311-Lec4

Univariate Chain Rule

A more structured way to do it

Computing the derivatives:

dL = y − t dy

dL = dLdy = dLσ′(z) dz dy dz dy

L = 12 ( y − t ) 2

Remember, the goal isn’t to obtain closed-form solutions, but to be

able to write a program that efficiently computes the derivatives.

Computing the loss:

z = wx + b y=σ(z)

∂ L = d L d z = d L x ∂w dz dw dz ∂L = dL dz = dL

∂b dz db dz

Intro ML (UofT) CSC311-Lec4

Univariate Chain Rule

We can diagram out the computations using a computation graph.

The nodes represent all the inputs and computed quantities, and the edges represent which nodes are computed directly as a function of which other nodes.

Computing the loss:

z = wx + b y = σ(z)

L = 12 ( y − t ) 2

Intro ML (UofT) CSC311-Lec4 38 / 48

Univariate Chain Rule

A slightly more convenient notation:

Use y to denote the derivative dL/dy, sometimes called the error signal.

This emphasizes that the error signals are just values our program is computing (rather than a mathematical operation).

Computing the loss:

z = wx + b y = σ(z)

L = 21 ( y − t ) 2 Intro ML (UofT)

Computing the derivatives:

z = y σ′(z) w=zx

CSC311-Lec4

Multivariate Chain Rule

Problem: what if the computation graph has fan-out > 1? This requires the multivariate Chain Rule!

L2-Regularized regression

z = wx + b y = σ(z)

L = 1 (y − t)2 2

Softmax regression

zl =wljxj +bl j

12 ezk R = 2w yk = l ezl

Lreg =L+λR

L = − tk log yk k

Intro ML (UofT) CSC311-Lec4

Multivariate Chain Rule

Suppose we have a function f(x,y) and functions x(t) and y(t). (All the variables here are scalar-valued.) Then

d f(x(t),y(t)) = ∂f dx + ∂f dy dt ∂x dt ∂y dt

Plug in to Chain Rule:

f (x, y) = y + exy x(t) = cos t y(t) = t2

Intro ML (UofT)

df = ∂f dx + ∂f dy dt ∂x dt ∂y dt

= (yexy) · (− sin t) + (1 + xexy) · 2t CSC311-Lec4

Multivariable Chain Rule

In the context of backpropagation:

In our notation:

t = x dx + y dy dt dt

Intro ML (UofT) CSC311-Lec4 42 / 48

Backpropagation

Full backpropagation algorithm:

Letv1,…,vN beatopologicalorderingofthecomputationgraph (i.e. parents come before children.)

vN denotes the variable we’re trying to compute derivatives of (e.g. loss).

Intro ML (UofT) CSC311-Lec4 43 / 48

Backpropagation

Example: univariate logistic least squares regression

Forward pass:

z = wx + b y = σ(z)

L= 1(y−t)2 2

Lreg =L+λR

Intro ML (UofT) CSC311-Lec4 44 / 48

Backpropagation

Example: univariate logistic least squares regression Backward pass:

Forward pass:

z = wx + b y = σ(z)

L= 1(y−t)2 2

Lreg =L+λR

dLreg = Lreg λ

dLreg = Lreg dL

= Lreg = L dL

= L (y − t)

Intro ML (UofT)

CSC311-Lec4 44 / 48

Backpropagation

Example: univariate logistic least squares regression Backward pass:

Forward pass:

z = wx + b y = σ(z)

L= 1(y−t)2 2

Lreg =L+λR

dLreg = Lreg λ

z = y dy dz

b = z ∂z ∂b

dLreg = Lreg dL

w= z ∂w + R dw = z x + R w

= Lreg = L dL

= L (y − t)

Intro ML (UofT)

CSC311-Lec4

Perceptron (multiple outputs):

Forward pass:

zi =w(1)xj +b(1) ij i

hi = σ(zi)

yk =w(2)hi +b(2) ki k

L = 1 (yk − tk)2

Intro ML (UofT) CSC311-Lec4 45 / 48

Perceptron (multiple outputs): Backward pass:

Forward pass:

zi =w(1)xj +b(1) ij i

hi = σ(zi)

yk =w(2)hi +b(2) ki k

L = 1 (yk − tk)2

yk =L(yk −tk)

w(2) = yk hi ki

b(2) = yk k

hi =ykw(2) ki

zi = hi σ′(zi)

w(1) = zi xj ij

Intro ML (UofT)

CSC311-Lec4

Backpropagation

In vectorized form:

Forward pass:

z = W(1)x + b(1) h = σ(z)

y = W(2)h + b(2)

L = 21 ∥ y − t ∥ 2

Intro ML (UofT) CSC311-Lec4 46 / 48

Backpropagation

In vectorized form:

Backward pass:

Forward pass:

z = W(1)x + b(1) h = σ(z)

y = W(2)h + b(2)

L = 21∥y − t∥2 Intro ML (UofT)

y = L (y − t) W(2) = yh⊤

h = W(2)⊤y

z = h ◦ σ′(z) W(1) = zx⊤

CSC311-Lec4

Computational Cost

Computational cost of forward pass: one add-multiply operation per weight

zi =w(1)xj +b(1) ij i

Intro ML (UofT) CSC311-Lec4 47 / 48

Computational Cost

Computational cost of forward pass: one add-multiply operation per weight

zi =w(1)xj +b(1) ij i

Computational cost of backward pass: two add-multiply operations per weight

w(2) = yh

hi = ykw(2) ki

CSC311-Lec4

Computational Cost

Computational cost of forward pass: one add-multiply operation per weight

zi =w(1)xj +b(1) ij i

Computational cost of backward pass: two add-multiply operations per weight

w(2) = yh

hi = ykw(2) ki

Rule of thumb: the backward pass is about as expensive as two forward passes.

For a multilayer perceptron, this means the cost is linear in the number of layers, quadratic in the number of units per layer. Intro ML (UofT) CSC311-Lec4

is used to train the overwhelming majority of neural nets today.

Even optimization algorithms much fancier than gradient descent (e.g. second-order methods) use backprop to compute the gradients.

Despite its practical success, backprop is believed to be neurally implausible.

Intro ML (UofT) CSC311-Lec4 48 / 48