# CS代考程序代写 ÿþ Where are we?

ÿþ Where are we?

Least Squares Method for regression Examples

The LMS objective

Gradient descent

Incremental/stochastic gradient descent Exercises

What s the mileage?

Suppose we want to predict the mileage of a car from its weight and age

(x 100 lb)

x1 x2

31.5 6 21 36.2 2 25 43.1 0 18 27.6 2 30

What we want: A function that can predict mileage using x1 and x2

(years)

Linear regression: The strategy Predicting continuous values using a linear model

Assumption: The output is a linear function of the inputs Mileage=w0 +w1 x1 +w2 x2

Learning: Using the training data to find the best possible value ofw

Prediction: Given the values for x1, x2 for a new car, use the learned w to predict the for the new car

Linear regression: The strategy Predicting continuous values using a linear model

Assumption: The output is a linear function of the inputs Mileage=w0 +w1 x1 +w2 x2

Parameters of the model Also called weights Collectively, a vector

Learning: Using the training data to find the best possible value ofw

Prediction: Given the values for x1, x2 for a new car, use the learned w to predict the for the new car

Linear regression: The strategy

” Inputs are vectors: 5Ø ” R

” Outputs are real numbers: ” R

” We have a training set

D={ 5Ø, , 5Ø, ,ï”}

” We want to approximate as

= + + ï” +

For simplicity, we will assume that the first feature is always 1.

1

= 5Ø

is the learned weight vector in R

=

î”

This lets makes notation easier

Examples y

One dimensional input

x1

Examples y

One dimensional input

Predict using y = w1 + w2 x2

x1

Examples y

One dimensional input

Predict using y = w1 + w2 x2

The linear function is not our only choice. We could have tried to fit the data as another polynomial

x1

Examples y

One dimensional input

Predict using y = w1 + w2 x2

The linear function is not our only choice. We could have tried to fit the data as another polynomial

x1

Two dimensional input Predictusingy=w1 +w2 x2+w3 x3

What is the best weight vector?

Question: How do we know which weight vector is the best one for a training set?

For an input ( , ) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be

Sum of squared costs over the training set

One strategy for learning: Find the w with least cost on this data

What is the best weight vector?

Question: How do we know which weight vector is the best one for a training set?

For an input ( , ) in the training set, the cost of a mistake is

Define the cost (or loss) for a particular weight vector w to be

Sum of squared costs over the training set

One strategy for learning: Find the w with least cost on this data

What is the best weight vector?

Question: How do we know which weight vector is the best one for a training set?

For an input ( , ) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be

One strategy for learning: Find the w with least cost on this data

What is the best weight vector?

Question: How do we know which weight vector is the best one for a training set?

For an input ( , ) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be

Sum of squared costs over the training set

One strategy for learning: Find the w with least cost on this data

What is the best weight vector?

Question: How do we know which weight vector is the best one for a training set?

For an input ( , ) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be

Sum of squared costs over the training set

One strategy for learning: Find the w with least cost on this data

Least Mean Squares (LMS) Regression

Learning: minimizing mean squared error

Least Mean Squares (LMS) Regression

Learning: minimizing mean squared error

Different strategies exist for learning by optimization

” Gradient descent is a popular algorithm

(For this particular minimization objective, there is also an analytical solution. No need for gradient descent)

We are trying to minimize

Gradient descent General strategy for minimizing

a function J(w)

” Start with an initial guess for

w, say w0

” Iterate till convergence:

Computethegradientofthe gradient of J at wt

Update wt to get wt+1 by taking a step in the opposite direction of the gradient

Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction

We are trying to minimize

Gradient descent General strategy for minimizing

a function J(w)

” Start with an initial guess for

w, say w0

” Iterate till convergence:

Computethegradientofthe gradient of J at wt

Update wt to get wt+1 by taking a step in the opposite direction of the gradient

Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction

Gradient descent General strategy for minimizing

a function J(w)

” Start with an initial guess for

w, say w0

” Iterate till convergence:

Computethegradientofthe gradient of J at wt

Update wt to get wt+1 by taking a step in the opposite direction of the gradient

We are trying to minimize

Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction

Gradient descent General strategy for minimizing

a function J(w)

” Start with an initial guess for

w, say w0

” Iterate till convergence:

Computethegradientofthe gradient of J at wt

Update wt to get wt+1 by taking a step in the opposite direction of the gradient

We are trying to minimize

Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction

We are trying to minimize

Gradient descent General strategy for minimizing

a function J(w)

” Start with an initial guess for

w, say w0

” Iterate till convergence:

Computethegradientofthe gradient of J at wt

Update wt to get wt+1 by taking a step in the opposite direction of the gradient

Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction

We are trying to minimize

Gradient descent General strategy for minimizing

a function J(w)

” Start with an initial guess for

w, say w0

” Iterate till convergence:

Computethegradientofthe gradient of J at wt

Update wt to get wt+1 by taking a step in the opposite direction of the gradient

Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction

Gradient descent for LMS

1. Initializew0

2. Fort=0,1,2,....

1. Compute gradient of J(w) at wt. Call it r J(wt)

2. Update w as follows:

r: Called the learning rate

(For now, a small constant. We will get to this later)

We are trying to minimize

We are trying to minimize

Gradient descent for LMS 1. Initializew0

2. Fort=0,1,2,.... 1.

2. Update w as follows:

What is the gradient of J?

Compute gradient of J(w) at wt. Call it r J(wt)

r: Called the learning rate

(For now, a small constant. We will get to this later)

Gradient of the cost ” Thegradientisoftheform

” Rememberthatwisavectorwithdelements w=[w1,w2,w3,wj,,wd]

We are trying to minimize

We are trying to minimize

Gradient of the cost ” Thegradientisoftheform

We are trying to minimize

Gradient of the cost ” Thegradientisoftheform

We are trying to minimize

Gradient of the cost ” Thegradientisoftheform

We are trying to minimize

Gradient of the cost ” Thegradientisoftheform

We are trying to minimize

Gradient of the cost ” Thegradientisoftheform

We are trying to minimize

Gradient of the cost ” Thegradientisoftheform

One element of the gradient vector

Gradient of the cost ” Thegradientisoftheform

We are trying to minimize

One element of the gradient vector

Sum of Error £ Input

We are trying to minimize

Gradient descent for LMS

1. Initializew0

2. Fort=0,1,2,....

1. Compute gradient of J(w) at wt. Call it r J(wt)

Evaluate the function for each training example to compute the error and construct the gradient vector

2. Update w as follows:

r: Called the learning rate

(For now, a small constant. We will get to this later)

Gradient descent for LMS

1. Initializew0

2. Fort=0,1,2,....

1. Compute gradient of J(w) at wt. Call it r J(wt)

Evaluate the function for each training example to compute the error and construct the gradient vector

We are trying to minimize

One element ofrJ

2. Update w as follows:

r: Called the learning rate

(For now, a small constant. We will get to this later)

Gradient descent for LMS

1. Initializew0

2. Fort=0,1,2,....

1. Compute gradient of J(w) at wt. Call it r J(wt)

Evaluate the function for each training example to compute the error

and construct the gradient vector

2. Update w as follows:

We are trying to minimize

One element ofrJ

r: Called the learning rate

(For now, a small constant. We will get to this later)

Gradient descent for LMS 1. Initializew0

2. Fort=0,1,2,....(untiltotalerrorisbelowathreshold) 1. Compute gradient of J(w) at wt. Call it r J(wt)

Evaluate the function for each training example to compute the error and construct the gradient vector

2. Update w as follows:

We are trying to minimize

One element ofrJ

r: Called the learning rate

(For now, a small constant. We will get to this later)

Gradient descent for LMS

1. Initializew0

2. Fort=0,1,2,....(untiltotalerrorisbelowathreshold)

1.

Compute gradient of J(w) at wt. Call it r J(wt)

Evaluate the function for each training example to compute the error

and construct the gradient vector

Update w as follows:

r: Called the learning rate

(For now, a small constant. We will get to this later)

We are trying to minimize

One element ofrJ

2.

1.

Compute gradient of J(w) at wt. Call it r J(wt)

Evaluate the function for each training example to compute the error

and construct the gradient vector

Update w as follows: r: Called the learning rate

(For now, a small constant. We will get to this later)

We are trying to minimize

Gradient descent for LMS

1. Initializew0

2. Fort=0,1,2,....(untiltotalerrorisbelowathreshold)

2.

One element ofrJ

This algorithm is guaranteed to converge to the minimum of J if r is small enough. Why? The objective J is a convex function

We are trying to minimize

Gradient descent for LMS 1. Initializew0

2. Fort=0,1,2,....(untiltotalerrorisbelowathreshold) 1. Compute gradient of J(w) at wt. Call it r J(wt)

Evaluate the function for each training example to compute the error and construct the gradient vector

2. Update w as follows:

Gradient descent for LMS 1. Initializew0

2. Fort=0,1,2,....(untiltotalerrorisbelowathreshold) 1. Compute gradient of J(w) at wt. Call it r J(wt)

Evaluate the function for each training example to compute the error and construct the gradient vector

2. Update w as follows:

The weight vector is not updated until all errors are calculated

Why not make early updates to the weight vector as soon as we encounter errors instead of waiting for a full pass over the data?

We are trying to minimize

Incremental/Stochastic gradient descent ” Repeatforeachexample(xi,yi)

Pretend that the entire training set is represented by this single example

Use this example to calculate the gradient and update the model

” Contrastwithbatchgradientdescentwhichmakes one update to the weight vector for every pass over the data

Incremental/Stochastic gradient descent

1. Initialize

2. Fort=0,1,2,....(untilerrorbelowsomethreshold)

For each training example ( , ):

” Update w. For each element of the weight vector (wj):

Incremental/Stochastic gradient descent

1. Initialize

2. Fort=0,1,2,....(untilerrorbelowsomethreshold)

For each training example ( , ):

” Update w. For each element of the weight vector (wj):

Contrast with the previous method, where the weights are updated only after all examples are processed once

Incremental/Stochastic gradient descent

1. Initialize

2. Fort=0,1,2,....(untilerrorbelowsomethreshold)

For each training example ( , ):

” Update w. For each element of the weight vector (wj):

This update rule is also called the Widrow-Hoff rule in the neural networks literature

Incremental/Stochastic gradient descent

1. Initialize

2. Fort=0,1,2,....(untilerrorbelowsomethreshold)

For each training example ( , ):

” Update w. For each element of the weight vector (wj):

Online/Incremental algorithms are often preferred when the training set is very large

May get close to optimum much faster than the batch version

This update rule is also called the Widrow-Hoff rule in the neural networks literature

Learning Rates and Convergence

” In the general (non-separable) case the learning rate r must decrease to zero to guarantee convergence

” The learning rate is called the step size.

Moresophisticatedalgorithmschoosethestepsizeautomaticallyand

converge faster

” Choosing a better starting point can also have impact

” Gradient descent and its stochastic version are very simple algorithms

Yet,almostallthealgorithmswewilllearnintheclasscanbetracedback to gradient decent algorithms for different loss functions and different hypotheses spaces

Linear regression: Summary

” Whatwewant:Predictarealvaluedoutputusinga feature representation of the input

” Assumption:Outputisalinearfunctionoftheinputs ” Learningbyminimizingtotalcost

Gradient descent and stochastic gradient descent to find the best weight vector

This particular optimization can be computed directly by framing the problem as a matrix problem