Lecture 5: Introduction to Optimization

Introduction to Machine Learning Semester 1, 2022

Copyright @ University of Melbourne 2022. All rights reserved. No part of the publication may be reproduced in any form by print, photoprint, microfilm or any other means without written permission from the author.

Copyright By cscodehelp代写 加微信 cscodehelp

Last time… Probability

• estimate the (conditional, joint) probability of observations • Bayes rule

• Marginalization

• Probabilistic models

• Maximum likelihood estimation (taster) • Maximum aposteriori estimation (taster)

Last time… Probability

• estimate the (conditional, joint) probability of observations • Bayes rule

• Marginalization

• Probabilistic models

• Maximum likelihood estimation (taster) • Maximum aposteriori estimation (taster)

Today… Optimization

• Curves, minima

• Gradients, derivatives

• Recipe for numerical optimization

• Maximum likelihood of the Binomial (from scratch!)

Optimization

Introduction I

We are all here to learn about Machine Learning. • What is learning?

• It probably has something to do with change or mastering or optimizing performance on a specific task

• Machine learning typically involves to build models (like seen last time), and learning boils down to finding model parameters that optimize some measure of performance

But, how do we know what is optimal?

Finding Optimal Points I

Finding the parameters that optimize a target

Ex1: Estimate the study time which leads to the best grade in COMP90049.

Ex2: Find the shoe price which leads to maximum profit of our shoe shop.

Finding Optimal Points I

Finding the parameters that optimize a target

Ex1: Estimate the study time which leads to the best grade in COMP90049.

Ex2: Find the shoe price which leads to maximum profit of our shoe shop.

Finding Optimal Points I

Finding the parameters that optimize a target

Ex1: Estimate the study time which leads to the best grade in COMP90049.

Ex2: Find the shoe price which leads to maximum profit of our shoe shop.

Ex3: Predicting housing prices from a weighted combination of house age and house location

Ex4: Find the parameters θ of a spam classifier which lead to the lowest error

Ex5: Find the parameters θ of a spam classifier which lead to the highest data log likelihood

Finding Optimal Points I

Finding the parameters that optimize a target

Ex1: Estimate the study time which leads to the best grade in COMP90049.

Ex2: Find the shoe price which leads to maximum profit of our shoe shop.

Ex3: Predicting housing prices from a weighted combination of house age and house location

Ex4: Find the parameters θ of a spam classifier which lead to the lowest error

Ex5: Find the parameters θ of a spam classifier which lead to the highest data log likelihood

Objective functions

Find parameter values θ that maximize (or minimize) the value of a function f (θ)

• we want to find the extreme points of the objective function. Depending on our target, this could be

• …the maximum

E.g., the maximum profit of our shoe shop

E.g., the largest possible (log) likelihood of the data

θˆ = argmax f (θ) θ

• …or the minimum (in which case we often call f a loss function) E.g., the smallest possible classification error

θˆ = argmin f (θ) θ

Finding extreme points of a function

• At its extreme point, f (θ) is ‘flat’: its slope is equal to zero.

• We can measure the slope of a function at any point through its first

derivative at that point

• The derivative measures the change of the output f (θ) given a change in

the input θ

• We write the derivative of f with respect to θ as ∂f ∂θ

Finding extreme points of a function

• At its extreme point, f (θ) is ‘flat’: its slope is equal to zero.

• We can measure the slope of a function at any point through its first

derivative at that point

• The derivative measures the change of the output f (θ) given a change in

the input θ

• We write the derivative of f with respect to θ as ∂f

In order to find the parameters that maximize / minimize an objective function, we find those inputs at which the derivative of the function evaluates to zero.

That’s it!

Finding a Minimum / Maximum

• For our function, with a single 1-dimensional parameter θ f(θ) = θ2

Take the derivative

We want to find the point where this derivative is zero, so

and solve for θ

2θ = 0 θ=0

Finding a Minimum / Maximum

• For our function, with a single 1-dimensional parameter θ f(θ) = θ2

Take the derivative

We want to find the point where this derivative is zero, so

and solve for θ

2θ = 0 θ=0

The global minimum of f (θ) = θ2 occurs at the point where θ=0.

Recipe for finding Minima / Maxima

1. Define your function of interest f (θ) (e.g., data log likelihood) 2. Compute its first derivative with respect to its input θ

3. Set the derivative equal to zero

4. Solve for θ

Recipe for finding Minima / Maxima

1. Define your function of interest f (θ) (e.g., data log likelihood) 2. Compute its first derivative with respect to its input θ

3. Set the derivative equal to zero

4. Solve for θ

Let’s do this for a more interesting problem. Recall our binomial spam model from the last lecture?

Maximum Likelihood Optimization of the Binomial Spam Model

1. Problem setup / identifying the function of interest

• Consider a data set of emails, where each email is an observation x which is labeled either as spam or not spam

• We have N observations, each with 2 possible outcomes. The data consequently follows a binomial distribution and the data likelihood is

L(θ) = p(X;N,θ) = N! θx(1−θ)N−x x!(N − x)!

• So the parameter θ = P(spam)

Maximum Likelihood Optimization of the Binomial Spam Model

1. Problem setup / identifying the function of interest

• Consider a data set of emails, where each email is an observation x which is labeled either as spam or not spam

• We have N observations, each with 2 possible outcomes. The data consequently follows a binomial distribution and the data likelihood is

L(θ) = p(X;N,θ) = N! θx(1−θ)N−x x!(N − x)!

• So the parameter θ = P(spam)

• Imagine we have a data set of 100 emails: 20 are spam (and

consequently 80 emails are not spam).

• In the last lecture, we agreed intuitively that

P(spam) = θ = 20/100 = Nx .

• We will now derive the same result mathematically, and show that θ = Nx is the θˆ that maximizes the likelihood of the observed data

Maximum Likelihood Optimization of the Binomial Spam Model

(Log transformation aside)

• Log is a monotonic transformation: The same θ will maximize both p(x, y) and log p(x, y)

• Log values are less extreme (cf. x scale vs y scale) • Products become sums (avoid under/overflow)

Maximum Likelihood Optimization of the Binomial Spam Model

L(θ)=p(X;N,θ)= N! θx(1−θ)N−x ≈θx(1−θ)N−x x!(N − x)!

Maximum Likelihood Optimization of the Binomial Spam Model

L(θ)=p(X;N,θ)= N! θx(1−θ)N−x ≈θx(1−θ)N−x x!(N − x)!

Maximum Likelihood Optimization of the Binomial Spam Model

L(θ)=p(X;N,θ)= N! θx(1−θ)N−x ≈θx(1−θ)N−x x!(N − x)!

Maximum Likelihood Optimization of the Binomial Spam Model

2. Computing its first derivative

L(θ) = p(X;N,θ) = N! θx(1−θ)N−x x!(N − x)!

≈ θx (1 − θ)N−x Move to log space (makes our life easier)

logL(θ) = xlogθ + (N − x)log(1 − θ)

Maximum Likelihood Optimization of the Binomial Spam Model

2. Computing its first derivative

L(θ)=p(X;N,θ)= N! θx(1−θ)N−x ≈θx(1−θ)N−x x!(N − x)!

Move to log space (makes or life easier)

logL(θ) = xlogθ + (N − x)log(1 − θ)

Maximum Likelihood Optimization of the Binomial Spam Model

2. Computing its first derivative

L(θ)=p(X;N,θ)= N! θx(1−θ)N−x ≈θx(1−θ)N−x x!(N − x)!

Move to log space (makes or life easier)

logL(θ) = xlogθ + (N − x)log(1 − θ)

Take the derivative of L wrt the parameters θ

∂L = x − N − x ∂θ θ 1−θ

Maximum Likelihood Optimization of the Binomial Spam Model

3. Set the derivative to zero

4. Solve for θ

0=x−N−x θ 1−θ

θ 1−θ x × (1 − θ) = N − x

1−θ = N −x θx

1θ − 1 = Nx − 1 1θ = Nx

[rearrange] [+1]

[×(1 − θ)] [× 1 ]

Which corresponds to our estimate of x = 20 = 0.2 for our spam N 100

classification problem.

Please go to

https://pollev.com/iml2022

for a quick quiz on optimization / MLE!

Possible Complications

Can you think of scenarios where this approach breaks down?

Possible Complications

Can you think of scenarios where this approach breaks down?

• Our loss function is not differentiable

• It is mathematically impossible to set the derivative to 0 and solve for the

parameters θ. “No closed-form solution”.

• Our function has multiple ‘extreme points’ where the slope equals zero.

Which one is the correct one?

to be continued…

• What is optimization?

• Objective function / loss function

• Gradients, derivatives, and slopes

Next: Naive Bayes

Solution subject to Constraints

Constrained Optimization

Finding the parameters that optimize a target subject to one or more constraints.

• Buy 3 pieces of fruit which lead to the best nutritional value. But we only have a budget of 3$.

• I want to estimate the parameters of a Categorical distribution to maximize the data log likelihood and I know that the parameters must sum to 1.

Constrained Optimization

It often happens that the parameters we want to learn have to obey constraints

argmin f (θ) θ

subject to g(θ) = 0,

• ideally, we would like to incorporate such constraints and still be able to

follow the general recipe for optimization discussed before

• Lagrangians allow us to do exactly that in the case of equality

constraints (there are also boundary constraints, which we won’t cover)

• we combine our target functions with (sets of) constraints multiplied

through Lagrange multipliers λ

L(θ, λ) = f (θ) − λg(θ)

• proceed as before: derivative, set to zero, solve for θ

Constrained Optimization

• Find an optimal parameter vector θ such that each all θi sum up to a certain constant b.

• Formalize the constraint:

• Set the constraint to zero

0=θi −b=−b+θi ii

• set the constraint and write the Lagrangian

gc (θ) = −b + θi i

L(θ,λ) = f(θ) − λgc(θ)

= f (θ) − λ(−b + θi ) i

• proceed as before: derivative, set to zero, solve for θ

. Introduction to Natural Language Processing, Appendix B (up to B.1)

. without Permanent Scarring. https:// people.eecs.berkeley.edu/~klein/papers/lagrange-multipliers.pdf . Sections 1, 2 (up to 2.4), 3.1, 3.5

程序代写 CS代考 加微信: cscodehelp QQ: 2235208643 Email: kyit630461@163.com