# CS代考程序代写 ÿþ Where are we?

ÿþ Where are we?
Least Squares Method for regression  Examples
 The LMS objective
 Incremental/stochastic gradient descent  Exercises

What s the mileage?
Suppose we want to predict the mileage of a car from its weight and age
(x 100 lb)
x1 x2
31.5 6 21 36.2 2 25 43.1 0 18 27.6 2 30
What we want: A function that can predict mileage using x1 and x2
(years)

Linear regression: The strategy Predicting continuous values using a linear model
Assumption: The output is a linear function of the inputs Mileage=w0 +w1 x1 +w2 x2
Learning: Using the training data to find the best possible value ofw
Prediction: Given the values for x1, x2 for a new car, use the learned w to predict the for the new car

Linear regression: The strategy Predicting continuous values using a linear model
Assumption: The output is a linear function of the inputs Mileage=w0 +w1 x1 +w2 x2
Parameters of the model Also called weights Collectively, a vector
Learning: Using the training data to find the best possible value ofw
Prediction: Given the values for x1, x2 for a new car, use the learned w to predict the for the new car

Linear regression: The strategy
” Inputs are vectors: 5Ø ” R
” Outputs are real numbers: ” R
” We have a training set
D={ 5Ø, , 5Ø, ,ï”}
” We want to approximate as
= + + ï” +
For simplicity, we will assume that the first feature is always 1.
1
= 5Ø
is the learned weight vector in R
=

î”
This lets makes notation easier

Examples y
One dimensional input
x1

Examples y
One dimensional input
Predict using y = w1 + w2 x2
x1

Examples y
One dimensional input
Predict using y = w1 + w2 x2
The linear function is not our only choice. We could have tried to fit the data as another polynomial
x1

Examples y
One dimensional input
Predict using y = w1 + w2 x2
The linear function is not our only choice. We could have tried to fit the data as another polynomial
x1
Two dimensional input Predictusingy=w1 +w2 x2+w3 x3

What is the best weight vector?
Question: How do we know which weight vector is the best one for a training set?
For an input ( , ) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be
Sum of squared costs over the training set
One strategy for learning: Find the w with least cost on this data

What is the best weight vector?
Question: How do we know which weight vector is the best one for a training set?
For an input ( , ) in the training set, the cost of a mistake is
Define the cost (or loss) for a particular weight vector w to be
Sum of squared costs over the training set
One strategy for learning: Find the w with least cost on this data

What is the best weight vector?
Question: How do we know which weight vector is the best one for a training set?
For an input ( , ) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be
One strategy for learning: Find the w with least cost on this data

What is the best weight vector?
Question: How do we know which weight vector is the best one for a training set?
For an input ( , ) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be
Sum of squared costs over the training set
One strategy for learning: Find the w with least cost on this data

What is the best weight vector?
Question: How do we know which weight vector is the best one for a training set?
For an input ( , ) in the training set, the cost of a mistake is Define the cost (or loss) for a particular weight vector w to be
Sum of squared costs over the training set
One strategy for learning: Find the w with least cost on this data

Least Mean Squares (LMS) Regression
Learning: minimizing mean squared error

Least Mean Squares (LMS) Regression
Learning: minimizing mean squared error
Different strategies exist for learning by optimization
” Gradient descent is a popular algorithm
(For this particular minimization objective, there is also an analytical solution. No need for gradient descent)

We are trying to minimize
Gradient descent General strategy for minimizing
a function J(w)
w, say w0
” Iterate till convergence:
 Update wt to get wt+1 by taking a step in the opposite direction of the gradient
Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction

We are trying to minimize
Gradient descent General strategy for minimizing
a function J(w)
w, say w0
” Iterate till convergence:
 Update wt to get wt+1 by taking a step in the opposite direction of the gradient
Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction

Gradient descent General strategy for minimizing
a function J(w)
w, say w0
” Iterate till convergence:
 Update wt to get wt+1 by taking a step in the opposite direction of the gradient
We are trying to minimize
Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction

Gradient descent General strategy for minimizing
a function J(w)
w, say w0
” Iterate till convergence:
 Update wt to get wt+1 by taking a step in the opposite direction of the gradient
We are trying to minimize
Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction

We are trying to minimize
Gradient descent General strategy for minimizing
a function J(w)
w, say w0
” Iterate till convergence:
 Update wt to get wt+1 by taking a step in the opposite direction of the gradient
Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction

We are trying to minimize
Gradient descent General strategy for minimizing
a function J(w)
w, say w0
” Iterate till convergence:
 Update wt to get wt+1 by taking a step in the opposite direction of the gradient
Intuition: The gradient is the direction of steepest increase in the function. To get to the minimum, go in the opposite direction

1. Initializew0
2. Fort=0,1,2,....
1. Compute gradient of J(w) at wt. Call it r J(wt)
2. Update w as follows:
r: Called the learning rate
(For now, a small constant. We will get to this later)
We are trying to minimize

We are trying to minimize
Gradient descent for LMS 1. Initializew0
2. Fort=0,1,2,.... 1.
2. Update w as follows:
What is the gradient of J?
Compute gradient of J(w) at wt. Call it r J(wt)
r: Called the learning rate
(For now, a small constant. We will get to this later)

” Rememberthatwisavectorwithdelements  w=[w1,w2,w3,wj,,wd]
We are trying to minimize

We are trying to minimize

We are trying to minimize

We are trying to minimize

We are trying to minimize

We are trying to minimize

We are trying to minimize
One element of the gradient vector

We are trying to minimize
One element of the gradient vector
Sum of Error £ Input

We are trying to minimize
1. Initializew0
2. Fort=0,1,2,....
1. Compute gradient of J(w) at wt. Call it r J(wt)
Evaluate the function for each training example to compute the error and construct the gradient vector
2. Update w as follows:
r: Called the learning rate
(For now, a small constant. We will get to this later)

1. Initializew0
2. Fort=0,1,2,....
1. Compute gradient of J(w) at wt. Call it r J(wt)
Evaluate the function for each training example to compute the error and construct the gradient vector
We are trying to minimize
One element ofrJ
2. Update w as follows:
r: Called the learning rate
(For now, a small constant. We will get to this later)

1. Initializew0
2. Fort=0,1,2,....
1. Compute gradient of J(w) at wt. Call it r J(wt)
Evaluate the function for each training example to compute the error
2. Update w as follows:
We are trying to minimize
One element ofrJ
r: Called the learning rate
(For now, a small constant. We will get to this later)

Gradient descent for LMS 1. Initializew0
2. Fort=0,1,2,....(untiltotalerrorisbelowathreshold) 1. Compute gradient of J(w) at wt. Call it r J(wt)
Evaluate the function for each training example to compute the error and construct the gradient vector
2. Update w as follows:
We are trying to minimize
One element ofrJ
r: Called the learning rate
(For now, a small constant. We will get to this later)

1. Initializew0
2. Fort=0,1,2,....(untiltotalerrorisbelowathreshold)
1.
Compute gradient of J(w) at wt. Call it r J(wt)
Evaluate the function for each training example to compute the error
Update w as follows:
r: Called the learning rate
(For now, a small constant. We will get to this later)
We are trying to minimize
One element ofrJ
2.

1.
Compute gradient of J(w) at wt. Call it r J(wt)
Evaluate the function for each training example to compute the error
Update w as follows: r: Called the learning rate
(For now, a small constant. We will get to this later)
We are trying to minimize
1. Initializew0
2. Fort=0,1,2,....(untiltotalerrorisbelowathreshold)
2.
One element ofrJ
This algorithm is guaranteed to converge to the minimum of J if r is small enough. Why? The objective J is a convex function

We are trying to minimize
Gradient descent for LMS 1. Initializew0
2. Fort=0,1,2,....(untiltotalerrorisbelowathreshold) 1. Compute gradient of J(w) at wt. Call it r J(wt)
Evaluate the function for each training example to compute the error and construct the gradient vector
2. Update w as follows:

Gradient descent for LMS 1. Initializew0
2. Fort=0,1,2,....(untiltotalerrorisbelowathreshold) 1. Compute gradient of J(w) at wt. Call it r J(wt)
Evaluate the function for each training example to compute the error and construct the gradient vector
2. Update w as follows:
The weight vector is not updated until all errors are calculated
Why not make early updates to the weight vector as soon as we encounter errors instead of waiting for a full pass over the data?
We are trying to minimize

 Pretend that the entire training set is represented by this single example
 Use this example to calculate the gradient and update the model
” Contrastwithbatchgradientdescentwhichmakes one update to the weight vector for every pass over the data

1. Initialize
2. Fort=0,1,2,....(untilerrorbelowsomethreshold)
 For each training example ( , ):
” Update w. For each element of the weight vector (wj):

1. Initialize
2. Fort=0,1,2,....(untilerrorbelowsomethreshold)
 For each training example ( , ):
” Update w. For each element of the weight vector (wj):
Contrast with the previous method, where the weights are updated only after all examples are processed once

1. Initialize
2. Fort=0,1,2,....(untilerrorbelowsomethreshold)
 For each training example ( , ):
” Update w. For each element of the weight vector (wj):
This update rule is also called the Widrow-Hoff rule in the neural networks literature

1. Initialize
2. Fort=0,1,2,....(untilerrorbelowsomethreshold)
 For each training example ( , ):
” Update w. For each element of the weight vector (wj):
Online/Incremental algorithms are often preferred when the training set is very large
May get close to optimum much faster than the batch version
This update rule is also called the Widrow-Hoff rule in the neural networks literature

Learning Rates and Convergence
” In the general (non-separable) case the learning rate r must decrease to zero to guarantee convergence
” The learning rate is called the step size.
 Moresophisticatedalgorithmschoosethestepsizeautomaticallyand
converge faster
” Choosing a better starting point can also have impact
” Gradient descent and its stochastic version are very simple algorithms
 Yet,almostallthealgorithmswewilllearnintheclasscanbetracedback to gradient decent algorithms for different loss functions and different hypotheses spaces

Linear regression: Summary
” Whatwewant:Predictarealvaluedoutputusinga feature representation of the input
” Assumption:Outputisalinearfunctionoftheinputs ” Learningbyminimizingtotalcost
 Gradient descent and stochastic gradient descent to find the best weight vector
 This particular optimization can be computed directly by framing the problem as a matrix problem