February 2, 2022

Recitation 2 February 2, 2022 1 / 27

Recitation 2

Gradient Descent and Stochastic Gradient Descent

Copyright By cscodehelp代写 加微信 cscodehelp

Gradient Descent

Adaptive Learning Rates

Stochastic Gradient Descent Applications

Linear Regression Logistic Regression

Recitation 2

February 2, 2022

Recitation 2 February 2, 2022 4 / 27

Statistical Learning Theory to Gradient Descent

We are given the data set (x1,y1),…,(xn,yn) where xi ∈ Rd and yi ∈ R. We want to fit a linear function to this data by performing empirical risk minimization.

What is the hypothesis space? And what loss function can we use?

Recitation 2 February 2, 2022 5 / 27

Statistical Learning Theory to Gradient Descent

We are given the data set (x1,y1),…,(xn,yn) where xi ∈ Rd and yi ∈ R. We want to fit a linear function to this data by performing empirical risk minimization.

We search the hypothesis space F = {hθ(x) = θTx | θ ∈ Rd} to find the function that minimizes the loss as calculated by, let’s say,

l(a, y) = (a − y)2.

Statistical Learning Theory to Gradient Descent

The empirical risk is given by

1 n l(hθ(xi),yi)= n (θTxi −yi)2

decision function means finding a good θ.

To find this function we will minimize the empirical loss on the training data. How?

J(θ):=Rn(hθ)= n

The hypothesis space is parameterized by θ, so finding a good

Recitation 2 February 2, 2022 6 / 27

Given a current guess for θ, we will use the gradient of the empirical loss (w.r.t. θ) to get a local linear approximation.

If the gradient is non-zero, taking a small step in the direction of the negative gradient is guaranteed to decrease the empirical loss.

This motivates the minimization algorithm called gradient descent.

Recitation 2 February 2, 2022 7 / 27

Gradient Descent

Gradient Descent

Gradient Descent Algorithm

Goal: find θ∗ = arg minθ J(θ) θ0 :=[initial condition]

while not [termination condition]:

compute ∇J(θi )

εi :=[choose learning rate at iteration i] θi+1 :=θi −εi∇J(θi)

i := i + 1

Recitation 2

February 2, 2022

Gradient Descent

Gradient Checker

Recall the mathematical definition of the derivative as: ∂ f(θ)=lim f(θ+ε)−f(θ−ε)

Approximate the gradient by setting ε to a small constant, say

ε = 10−4. Compare that this is close to your computed value within some tolerance level.

Now let’s expand this method to deal with vector input θ ∈ Rd . Let’s say we want to verify out gradient at dimension i (∇f (θ))i . We can make use of one-hot vector ei in which all dimension except the ith are 0 and the ith dimension has a value of 1: ei = [0, 0, …, 1, …, 0]T

The gradient at ith dimension can be then approximated as

[∇f(θ)](i) ≈ f(θ+εei)−f(θ−εei) 2ε

Recitation 2

February 2, 2022

Gradient Descent

Gradient Descent

How to initialize θ0?

sample from some distribution

compose θ0 using some heuristics

Glorot et. al, He at. al, PyTorch initializations

How to choose termination conditions? run for a fixed number of iteration

the value of J(θ) stabilizes θi converges

What is a good learning rate?

Recitation 2

February 2, 2022

Gradient Descent

Learning Rate

Recitation 2 February 2, 2022 11 / 27

Application

Supposewewouldliketofindθ∗ ∈Rthatminimizesf(θ)=θ2−2θ+1. The gradient (in this case, the derivative) ∇f (θ) = 2θ − 2. We can easily see that θ∗ = argminθf (θ) = 1.

Gradient Descent

Learning Rate

We applied gradient descent for 1000 iterations on f (θ) = θ2 − 2θ + 1 with varying learning rate ε ∈ {1, 0.1, 0.01, 0.001, 0.0001}.

When the learning rate is too large (ε = 1), f (θ) does not decrease through iterations. The value of θi at each iteration significantly fluctuates.

When the learning rate is too small (ε = 0.0001), f (θ) decreases very slowly.

Recitation 2 February 2, 2022 12 / 27

Gradient Descent

Adaptive Learning Rate

Instead of using a fixed learning rate through all iterations, we can adjust our learning rate in each iteration using a simple algorithm. At each iteration i:

θ ̃ := θi−1 − εi−1∇f (θi−1) δ := f (θi−1) − f (θ ̃)

if δ ≥ threshold:

we achieve a satisfactory reduction on f (θ)

maybe we can consider increasing the learning rate for next iteration εi := 2εi−1

the reduction is unsatisfactory θi =θi−1

the learning rate is too large, so we reduce the learning rate εi := 21εi−1

Recitation 2 February 2, 2022 13 / 27

Gradient Descent

Adaptive Learning Rate

How to decide a proper threshold for f (θi−1) − f (θ ̃)?

You can find more details at this link.

Recitation 2 February 2, 2022 14 / 27

Armijo rule

If learning rate ε satisfies

f (θi−1) − f (θ ̃) ≥ 12ε∥∇f (θi−1)∥2 (1)

then f (θ) is guaranteed to converge to a (local) maximum under certain technical assumptions.

Stochastic Gradient Descent

Gradient Descent

Running gradient descent on the entire training dataset can be expensive. Motivates SGD.

SGD Algorithm

Initialize weights θ0 :=[initial condition] Repeat:

Randomly select a data point (xi , yi ) θi+1 =θi −ε∇θl(fθ(xi),yi)

Check for stopping critria

Recitation 2

February 2, 2022

Gradient Descent

Minibatch Gradient Descent

SGD can be noisy. Can we get a better estimate of the gradient?

Minibatch Gradient Descent Algorithm (batch size = n)

Initialize weights θ0 :=[initial condition] Repeat:

Randomly select n data point (xi , yi )ni =1 n

θi+1 =θi −ε[n1 ∇θl(fθ(xi),yi)] i=1

Check for stopping critria

Larger n ⇒ Better estimate of gradient but slower

Smaller n ⇒ Worse estimate of gradient but faster

Recitation 2

February 2, 2022

Gradient Descent for Linear Regression

Applications

Problem Setup

Data: D = {xi,yi}ni=1 where xi ∈ Rd and yi ∈ R HypothesisSpace: F={f :Rd →R|f(x)=θTx, θ∈Rd} Action: Prediction on an input x, yˆ = θT x

Loss: l(y,yˆ)=(yˆ−y)2

n Hence,costfunctionJ(θ)=n1 (θTxi −yi)2

Goal: Find the optimum theta, θ∗ = arg minθ J(θ)

Recitation 2

February 2, 2022

Finding θ∗

Applications

Approach 1: Closed-Form Solution (HW1)

J(θ)= n1 (θTxi −yi)2 = n1||Xθ−b||2

To minimize J(θ), we take the derivative and set it to zero

∇J(θ)=XTXθ−XTy =0 θˆ = ( X T X ) − 1 X T y

Pros: One shot algorithm for a direct solution Cons: Doesn’t scale, invertibility constraints

Recitation 2 February 2, 2022 18 / 27

Finding θ∗

Approach 2: Iterative Gradient Descent Initialize θ0

While not [terminating condition]

Compute the gradient

∇θJ(θi−1) = d J(θi−1), d J(θi−1) … dθ1 dθ2

θi =θi−1−εi∇θJ(θi−1) Return θi

Pros: Conceptually simple, good chance of convergence Cons: Slow

Applications

d J(θi−1)T dθd

Recitation 2

February 2, 2022

Finding θ∗

Applications

Approach 3: Stochastic Gradient Descent Initialize θ0

While not [terminating condition]

Select a random training point {xj , yj }

θi =θi−1−εi∇θJ(j)(θi−1)whereJ(j)(θ)=(θTxj −yj)2

Pros: Fast and memory efficient

Cons: Practical challenges, noisy gradient steps

Recitation 2 February 2, 2022 20 / 27

Gradient Descent for Logistic Regression

Applications

Problem Setup

Data: D = {xi,yi}ni=1 where xi ∈ Rd and yi ∈ {0,1} Hypothesis Space: (Slightly convoluted)

H={h:Rd →(0,1)|h (x)=g(θTx), θ∈Rd,g(z)= θ

Action: Prediction on an input x: yˆ = h ( x ) = g ( θ T x ) = 1

θ 1+e−θT x

Interpret this as a probability: p(y = 1|x;θ) = hθ(x) and p(y =0|x;θ)=1−hθ(x)

More succinctly, p(y|x, θ) = (hθ(x))y (1 − hθ(x))1−y

Recitation 2 February 2, 2022

Applications

Gradient Descent for Logistic Regression

Goal: Minimize logistic loss over all examples:

l(hθ (x ), y ) = −y log hθ (x ) − (1 − y )(1 − log hθ (x ))

Assuming your data is generated iid.

Then likelihood, L(θ) = P(y1 … yn|x1 … xn; θ) = p(yi |xi ; θ)

Maximizing log likelihood,

l(θ)=logL(θ)= yi loghθ(xi)+(1−yi)(1−loghθ(xi))

Equivalent to our goal of loss minimization

Recitation 2 February 2, 2022 22 / 27

Applications

Gradient Descent for Logistic Regression

No analytical one-shot solution

Find optimum using gradient based approaches

The gradient works out to be quite tractable

Recitation 2

February 2, 2022

Gradient Descent for Logistic Regression

Applications

For a single example (x,y):

l(θ) = y log hθ(x) + (1 − y)(1 − log hθ(x))

∂ l(θ)= ∂l(θ) ∂g(θTx) ∂θj ∂g(θTx) ∂θj

Recitation 2 February 2, 2022 24 / 27

Applications

Gradient Descent for Logistic Regression

For a single example (x,y):

l(θ) = y log hθ(x) + (1 − y)(1 − log hθ(x))

∂ l(θ)= ∂l(θ) ∂g(θTx) ∂θj ∂g(θTx) ∂θj

= y 1 − (1 − y) 1 ∂g(θT x) g(θTx) 1−g(θTx) ∂θj

Recitation 2 February 2, 2022 25 / 27

Gradient Descent for Logistic Regression

Applications

For a single example (x,y):

l(θ) = y log hθ(x) + (1 − y)(1 − log hθ(x))

∂ l(θ)= ∂l(θ) ∂g(θTx)

∂g(θTx) ∂θj

− (1 − y) 1 ∂g(θT x)

1−g(θTx) ∂θj

g(θT x)(1 − g(θT x))∂θT x ∂θj

∂θj = y 1

1 g(θTx) 1−g(θTx)

g(θTx) 1 − (1 − y)

Recitation 2

February 2, 2022

Applications

Gradient Descent for Logistic Regression

For a single example (x,y):

l(θ) = y log hθ(x) + (1 − y)(1 − log hθ(x))

∂ l(θ)= ∂l(θ) ∂g(θTx)

∂g(θTx) ∂θj

− (1 − y) 1 ∂g(θT x)

1−g(θTx) ∂θj

g(θT x)(1 − g(θT x))∂θT x ∂θj

= y(1 − g(θT x)) − (1 − y)(g(θT x)xj = (y − hθ(x))xj

∂θj = y 1

1 g(θTx) 1−g(θTx)

g(θTx) 1 − (1 − y)

Recitation 2 February 2, 2022

程序代写 CS代考 加微信: cscodehelp QQ: 2235208643 Email: kyit630461@163.com