# 程序代写 February 2, 2022 – cscodehelp代写

February 2, 2022
Recitation 2 February 2, 2022 1 / 27
Recitation 2

Linear Regression Logistic Regression
Recitation 2
February 2, 2022

Recitation 2 February 2, 2022 4 / 27
Statistical Learning Theory to Gradient Descent
We are given the data set (x1,y1),…,(xn,yn) where xi ∈ Rd and yi ∈ R. We want to fit a linear function to this data by performing empirical risk minimization.
What is the hypothesis space? And what loss function can we use?

Recitation 2 February 2, 2022 5 / 27
Statistical Learning Theory to Gradient Descent
We are given the data set (x1,y1),…,(xn,yn) where xi ∈ Rd and yi ∈ R. We want to fit a linear function to this data by performing empirical risk minimization.
We search the hypothesis space F = {hθ(x) = θTx | θ ∈ Rd} to find the function that minimizes the loss as calculated by, let’s say,
l(a, y) = (a − y)2.

Statistical Learning Theory to Gradient Descent
The empirical risk is given by
1 􏳇n l(hθ(xi),yi)= n (θTxi −yi)2
decision function means finding a good θ.
To find this function we will minimize the empirical loss on the training data. How?
J(θ):=Rn(hθ)= n
The hypothesis space is parameterized by θ, so finding a good
Recitation 2 February 2, 2022 6 / 27

Given a current guess for θ, we will use the gradient of the empirical loss (w.r.t. θ) to get a local linear approximation.
If the gradient is non-zero, taking a small step in the direction of the negative gradient is guaranteed to decrease the empirical loss.
This motivates the minimization algorithm called gradient descent.
Recitation 2 February 2, 2022 7 / 27

Goal: find θ∗ = arg minθ J(θ) θ0 :=[initial condition]
while not [termination condition]:
compute ∇J(θi )
εi :=[choose learning rate at iteration i] θi+1 :=θi −εi∇J(θi)
i := i + 1
Recitation 2
February 2, 2022

Recall the mathematical definition of the derivative as: ∂ f(θ)=lim f(θ+ε)−f(θ−ε)
Approximate the gradient by setting ε to a small constant, say
ε = 10−4. Compare that this is close to your computed value within some tolerance level.
Now let’s expand this method to deal with vector input θ ∈ Rd . Let’s say we want to verify out gradient at dimension i (∇f (θ))i . We can make use of one-hot vector ei in which all dimension except the ith are 0 and the ith dimension has a value of 1: ei = [0, 0, …, 1, …, 0]T
The gradient at ith dimension can be then approximated as
[∇f(θ)](i) ≈ f(θ+εei)−f(θ−εei) 2ε
Recitation 2
February 2, 2022

How to initialize θ0?
sample from some distribution
compose θ0 using some heuristics
Glorot et. al, He at. al, PyTorch initializations
How to choose termination conditions? run for a fixed number of iteration
the value of J(θ) stabilizes θi converges
What is a good learning rate?
Recitation 2
February 2, 2022

Learning Rate
Recitation 2 February 2, 2022 11 / 27
Application
Supposewewouldliketofindθ∗ ∈Rthatminimizesf(θ)=θ2−2θ+1. The gradient (in this case, the derivative) ∇f (θ) = 2θ − 2. We can easily see that θ∗ = argminθf (θ) = 1.

Learning Rate
We applied gradient descent for 1000 iterations on f (θ) = θ2 − 2θ + 1 with varying learning rate ε ∈ {1, 0.1, 0.01, 0.001, 0.0001}.
When the learning rate is too large (ε = 1), f (θ) does not decrease through iterations. The value of θi at each iteration significantly fluctuates.
When the learning rate is too small (ε = 0.0001), f (θ) decreases very slowly.
Recitation 2 February 2, 2022 12 / 27

Instead of using a fixed learning rate through all iterations, we can adjust our learning rate in each iteration using a simple algorithm. At each iteration i:
θ ̃ := θi−1 − εi−1∇f (θi−1) δ := f (θi−1) − f (θ ̃)
if δ ≥ threshold:
we achieve a satisfactory reduction on f (θ)
maybe we can consider increasing the learning rate for next iteration εi := 2εi−1
the reduction is unsatisfactory θi =θi−1
the learning rate is too large, so we reduce the learning rate εi := 21εi−1
Recitation 2 February 2, 2022 13 / 27

How to decide a proper threshold for f (θi−1) − f (θ ̃)?
You can find more details at this link.
Recitation 2 February 2, 2022 14 / 27
Armijo rule
If learning rate ε satisfies
f (θi−1) − f (θ ̃) ≥ 12ε∥∇f (θi−1)∥2 (1)
then f (θ) is guaranteed to converge to a (local) maximum under certain technical assumptions.

Running gradient descent on the entire training dataset can be expensive. Motivates SGD.
SGD Algorithm
Initialize weights θ0 :=[initial condition] Repeat:
Randomly select a data point (xi , yi ) θi+1 =θi −ε∇θl(fθ(xi),yi)
Check for stopping critria
Recitation 2
February 2, 2022

SGD can be noisy. Can we get a better estimate of the gradient?
Minibatch Gradient Descent Algorithm (batch size = n)
Initialize weights θ0 :=[initial condition] Repeat:
Randomly select n data point (xi , yi )ni =1 n
θi+1 =θi −ε[n1 􏲿∇θl(fθ(xi),yi)] i=1
Check for stopping critria
Larger n ⇒ Better estimate of gradient but slower
Smaller n ⇒ Worse estimate of gradient but faster
Recitation 2
February 2, 2022

Applications
Problem Setup
Data: D = {xi,yi}ni=1 where xi ∈ Rd and yi ∈ R HypothesisSpace: F={f :Rd →R|f(x)=θTx, θ∈Rd} Action: Prediction on an input x, yˆ = θT x
Loss: l(y,yˆ)=(yˆ−y)2
n Hence,costfunctionJ(θ)=n1 􏲿(θTxi −yi)2
Goal: Find the optimum theta, θ∗ = arg minθ J(θ)
Recitation 2
February 2, 2022

Finding θ∗
Applications
Approach 1: Closed-Form Solution (HW1)
J(θ)= n1 􏲿(θTxi −yi)2 = n1||Xθ−b||2
To minimize J(θ), we take the derivative and set it to zero
∇J(θ)=XTXθ−XTy =0 θˆ = ( X T X ) − 1 X T y
Pros: One shot algorithm for a direct solution Cons: Doesn’t scale, invertibility constraints
Recitation 2 February 2, 2022 18 / 27

Finding θ∗
Approach 2: Iterative Gradient Descent Initialize θ0
While not [terminating condition]
∇θJ(θi−1) = 􏳂 d J(θi−1), d J(θi−1) … dθ1 dθ2
θi =θi−1−εi∇θJ(θi−1) Return θi
Pros: Conceptually simple, good chance of convergence Cons: Slow
Applications
d J(θi−1)􏳃T dθd
Recitation 2
February 2, 2022

Finding θ∗
Applications
Approach 3: Stochastic Gradient Descent Initialize θ0
While not [terminating condition]
Select a random training point {xj , yj }
θi =θi−1−εi∇θJ(j)(θi−1)whereJ(j)(θ)=(θTxj −yj)2
Pros: Fast and memory efficient
Cons: Practical challenges, noisy gradient steps
Recitation 2 February 2, 2022 20 / 27

Applications
Problem Setup
Data: D = {xi,yi}ni=1 where xi ∈ Rd and yi ∈ {0,1} Hypothesis Space: (Slightly convoluted)
H={h:Rd →(0,1)|h (x)=g(θTx), θ∈Rd,g(z)= θ
Action: Prediction on an input x: yˆ = h ( x ) = g ( θ T x ) = 1
θ 1+e−θT x
Interpret this as a probability: p(y = 1|x;θ) = hθ(x) and p(y =0|x;θ)=1−hθ(x)
More succinctly, p(y|x, θ) = (hθ(x))y (1 − hθ(x))1−y
Recitation 2 February 2, 2022

Applications
Goal: Minimize logistic loss over all examples:
l(hθ (x ), y ) = −y log hθ (x ) − (1 − y )(1 − log hθ (x ))
Assuming your data is generated iid.
Then likelihood, L(θ) = P(y1 … yn|x1 … xn; θ) = 􏳝 p(yi |xi ; θ)
Maximizing log likelihood,
l(θ)=logL(θ)= 􏲿yi loghθ(xi)+(1−yi)(1−loghθ(xi))
Equivalent to our goal of loss minimization
Recitation 2 February 2, 2022 22 / 27

Applications
No analytical one-shot solution
Find optimum using gradient based approaches
The gradient works out to be quite tractable
Recitation 2
February 2, 2022

Applications
For a single example (x,y):
l(θ) = y log hθ(x) + (1 − y)(1 − log hθ(x))
∂ l(θ)= ∂l(θ) ∂g(θTx) ∂θj ∂g(θTx) ∂θj
Recitation 2 February 2, 2022 24 / 27

Applications
For a single example (x,y):
l(θ) = y log hθ(x) + (1 − y)(1 − log hθ(x))
∂ l(θ)= ∂l(θ) ∂g(θTx) ∂θj ∂g(θTx) ∂θj
= 􏳖y 1 − (1 − y) 1 􏳗∂g(θT x) g(θTx) 1−g(θTx) ∂θj
Recitation 2 February 2, 2022 25 / 27

Applications
For a single example (x,y):
l(θ) = y log hθ(x) + (1 − y)(1 − log hθ(x))
∂ l(θ)= ∂l(θ) ∂g(θTx)
∂g(θTx) ∂θj
− (1 − y) 1 􏳗∂g(θT x)
1−g(θTx) ∂θj
􏳗􏳖g(θT x)(1 − g(θT x))􏳗∂θT x ∂θj
∂θj = 􏳖y 1
1 g(θTx) 1−g(θTx)
g(θTx) 1 − (1 − y)
Recitation 2
February 2, 2022

Applications
For a single example (x,y):
l(θ) = y log hθ(x) + (1 − y)(1 − log hθ(x))
∂ l(θ)= ∂l(θ) ∂g(θTx)
∂g(θTx) ∂θj
− (1 − y) 1 􏳗∂g(θT x)
1−g(θTx) ∂θj
􏳗􏳖g(θT x)(1 − g(θT x))􏳗∂θT x ∂θj
= 􏳖y(1 − g(θT x)) − (1 − y)(g(θT x)􏳗xj = (y − hθ(x))xj
∂θj = 􏳖y 1
1 g(θTx) 1−g(θTx)
g(θTx) 1 − (1 − y)
Recitation 2 February 2, 2022