# CS代考 COMP90049. – cscodehelp代写

Lecture 5: Introduction to Optimization
Introduction to Machine Learning Semester 1, 2022
Copyright @ University of Melbourne 2022. All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm or any other means without written permission from the author.

Last time… Probability
• estimate the (conditional, joint) probability of observations • Bayes rule
• Marginalization
• Probabilistic models
• Maximum likelihood estimation (taster) • Maximum aposteriori estimation (taster)

Last time… Probability
• estimate the (conditional, joint) probability of observations • Bayes rule
• Marginalization
• Probabilistic models
• Maximum likelihood estimation (taster) • Maximum aposteriori estimation (taster)
Today… Optimization
• Curves, minima
• Recipe for numerical optimization
• Maximum likelihood of the Binomial (from scratch!)

Optimization

Introduction I
We are all here to learn about Machine Learning. • What is learning?
• It probably has something to do with change or mastering or optimizing performance on a specific task
• Machine learning typically involves to build models (like seen last time), and learning boils down to finding model parameters that optimize some measure of performance
But, how do we know what is optimal?

Finding Optimal Points I
Finding the parameters that optimize a target
Ex1: Estimate the study time which leads to the best grade in COMP90049.
Ex2: Find the shoe price which leads to maximum profit of our shoe shop.

Finding Optimal Points I
Finding the parameters that optimize a target
Ex1: Estimate the study time which leads to the best grade in COMP90049.
Ex2: Find the shoe price which leads to maximum profit of our shoe shop.

Finding Optimal Points I
Finding the parameters that optimize a target
Ex1: Estimate the study time which leads to the best grade in COMP90049.
Ex2: Find the shoe price which leads to maximum profit of our shoe shop.
Ex3: Predicting housing prices from a weighted combination of house age and house location
Ex4: Find the parameters θ of a spam classifier which lead to the lowest error
Ex5: Find the parameters θ of a spam classifier which lead to the highest data log likelihood

Finding Optimal Points I
Finding the parameters that optimize a target
Ex1: Estimate the study time which leads to the best grade in COMP90049.
Ex2: Find the shoe price which leads to maximum profit of our shoe shop.
Ex3: Predicting housing prices from a weighted combination of house age and house location
Ex4: Find the parameters θ of a spam classifier which lead to the lowest error
Ex5: Find the parameters θ of a spam classifier which lead to the highest data log likelihood

Objective functions
Find parameter values θ that maximize (or minimize) the value of a function f (θ)
• we want to find the extreme points of the objective function. Depending on our target, this could be
• …the maximum
E.g., the maximum profit of our shoe shop
E.g., the largest possible (log) likelihood of the data
θˆ = argmax f (θ) θ
• …or the minimum (in which case we often call f a loss function) E.g., the smallest possible classification error
θˆ = argmin f (θ) θ

Finding extreme points of a function
• At its extreme point, f (θ) is ‘flat’: its slope is equal to zero.
• We can measure the slope of a function at any point through its first
derivative at that point
• The derivative measures the change of the output f (θ) given a change in
the input θ
• We write the derivative of f with respect to θ as ∂f ∂θ

Finding extreme points of a function
• At its extreme point, f (θ) is ‘flat’: its slope is equal to zero.
• We can measure the slope of a function at any point through its first
derivative at that point
• The derivative measures the change of the output f (θ) given a change in
the input θ
• We write the derivative of f with respect to θ as ∂f
In order to find the parameters that maximize / minimize an objective function, we find those inputs at which the derivative of the function evaluates to zero.
That’s it!

Finding a Minimum / Maximum
• For our function, with a single 1-dimensional parameter θ f(θ) = θ2
Take the derivative
We want to find the point where this derivative is zero, so
and solve for θ
2θ = 0 θ=0

Finding a Minimum / Maximum
• For our function, with a single 1-dimensional parameter θ f(θ) = θ2
Take the derivative
We want to find the point where this derivative is zero, so
and solve for θ
2θ = 0 θ=0
The global minimum of f (θ) = θ2 occurs at the point where θ=0.

Recipe for finding Minima / Maxima
1. Define your function of interest f (θ) (e.g., data log likelihood) 2. Compute its first derivative with respect to its input θ
3. Set the derivative equal to zero
4. Solve for θ

Recipe for finding Minima / Maxima
1. Define your function of interest f (θ) (e.g., data log likelihood) 2. Compute its first derivative with respect to its input θ
3. Set the derivative equal to zero
4. Solve for θ
Let’s do this for a more interesting problem. Recall our binomial spam model from the last lecture?

Maximum Likelihood Optimization of the Binomial Spam Model
1. Problem setup / identifying the function of interest
• Consider a data set of emails, where each email is an observation x which is labeled either as spam or not spam
• We have N observations, each with 2 possible outcomes. The data consequently follows a binomial distribution and the data likelihood is
L(θ) = p(X;N,θ) = N! θx(1−θ)N−x x!(N − x)!
• So the parameter θ = P(spam)

Maximum Likelihood Optimization of the Binomial Spam Model
1. Problem setup / identifying the function of interest
• Consider a data set of emails, where each email is an observation x which is labeled either as spam or not spam
• We have N observations, each with 2 possible outcomes. The data consequently follows a binomial distribution and the data likelihood is
L(θ) = p(X;N,θ) = N! θx(1−θ)N−x x!(N − x)!
• So the parameter θ = P(spam)
• Imagine we have a data set of 100 emails: 20 are spam (and
consequently 80 emails are not spam).
• In the last lecture, we agreed intuitively that
P(spam) = θ = 20/100 = Nx .
• We will now derive the same result mathematically, and show that θ = Nx is the θˆ that maximizes the likelihood of the observed data

Maximum Likelihood Optimization of the Binomial Spam Model
(Log transformation aside)
• Log is a monotonic transformation: The same θ will maximize both p(x, y) and log p(x, y)
• Log values are less extreme (cf. x scale vs y scale) • Products become sums (avoid under/overflow)

Maximum Likelihood Optimization of the Binomial Spam Model
L(θ)=p(X;N,θ)= N! θx(1−θ)N−x ≈θx(1−θ)N−x x!(N − x)!

Maximum Likelihood Optimization of the Binomial Spam Model
L(θ)=p(X;N,θ)= N! θx(1−θ)N−x ≈θx(1−θ)N−x x!(N − x)!

Maximum Likelihood Optimization of the Binomial Spam Model
L(θ)=p(X;N,θ)= N! θx(1−θ)N−x ≈θx(1−θ)N−x x!(N − x)!

Maximum Likelihood Optimization of the Binomial Spam Model
2. Computing its first derivative
L(θ) = p(X;N,θ) = N! θx(1−θ)N−x x!(N − x)!
≈ θx (1 − θ)N−x Move to log space (makes our life easier)
logL(θ) = xlogθ + (N − x)log(1 − θ)

Maximum Likelihood Optimization of the Binomial Spam Model
2. Computing its first derivative
L(θ)=p(X;N,θ)= N! θx(1−θ)N−x ≈θx(1−θ)N−x x!(N − x)!
Move to log space (makes or life easier)
logL(θ) = xlogθ + (N − x)log(1 − θ)

Maximum Likelihood Optimization of the Binomial Spam Model
2. Computing its first derivative
L(θ)=p(X;N,θ)= N! θx(1−θ)N−x ≈θx(1−θ)N−x x!(N − x)!
Move to log space (makes or life easier)
logL(θ) = xlogθ + (N − x)log(1 − θ)
Take the derivative of L wrt the parameters θ
∂L = x − N − x ∂θ θ 1−θ

Maximum Likelihood Optimization of the Binomial Spam Model
3. Set the derivative to zero
4. Solve for θ
0=x−N−x θ 1−θ
θ 1−θ x × (1 − θ) = N − x
1−θ = N −x θx
1θ − 1 = Nx − 1 1θ = Nx
[rearrange] [+1]
[×(1 − θ)] [× 1 ]
Which corresponds to our estimate of x = 20 = 0.2 for our spam N 100
classification problem.

https://pollev.com/iml2022
for a quick quiz on optimization / MLE!

Possible Complications
Can you think of scenarios where this approach breaks down?

Possible Complications
Can you think of scenarios where this approach breaks down?
• Our loss function is not differentiable
• It is mathematically impossible to set the derivative to 0 and solve for the
parameters θ. “No closed-form solution”.
• Our function has multiple ‘extreme points’ where the slope equals zero.
Which one is the correct one?
to be continued…

• What is optimization?
• Objective function / loss function
Next: Naive Bayes

Solution subject to Constraints

Constrained Optimization
Finding the parameters that optimize a target subject to one or more constraints.
• Buy 3 pieces of fruit which lead to the best nutritional value. But we only have a budget of 3\$.
• I want to estimate the parameters of a Categorical distribution to maximize the data log likelihood and I know that the parameters must sum to 1.

Constrained Optimization
It often happens that the parameters we want to learn have to obey constraints
argmin f (θ) θ
subject to g(θ) = 0,
• ideally, we would like to incorporate such constraints and still be able to
follow the general recipe for optimization discussed before
• Lagrangians allow us to do exactly that in the case of equality
constraints (there are also boundary constraints, which we won’t cover)
• we combine our target functions with (sets of) constraints multiplied
through Lagrange multipliers λ
L(θ, λ) = f (θ) − λg(θ)
• proceed as before: derivative, set to zero, solve for θ

Constrained Optimization
• Find an optimal parameter vector θ such that each all θi sum up to a certain constant b.
• Formalize the constraint:
• Set the constraint to zero
0=􏰂θi −b=−b+􏰂θi ii
• set the constraint and write the Lagrangian
gc (θ) = −b + 􏰂 θi i
L(θ,λ) = f(θ) − λgc(θ)
= f (θ) − λ(−b + 􏰂 θi ) i
• proceed as before: derivative, set to zero, solve for θ

. Introduction to Natural Language Processing, Appendix B (up to B.1)
. without Permanent Scarring. https:// people.eecs.berkeley.edu/~klein/papers/lagrange-multipliers.pdf . Sections 1, 2 (up to 2.4), 3.1, 3.5