# CS代考计算机代写 algorithm Machine Learning 10-601

Machine Learning 10-601

Tom M. Mitchell

Machine Learning Department Carnegie Mellon University

February 2, 2015

Today:

• Logistic regression

• Generative/Discriminative classifiers

Readings: (see class website) Required:

• Mitchell: “Naïve Bayes and Logistic Regression”

Optional

• Ng & Jordan

Announcements

• HW3 due Wednesday Feb 4

• HW4 will be handed out next Monday Feb 9

• new reading available:

– Estimating Probabilities: MLE and MAP (Mitchell)

– see Lecture tab of class website

• required reading for today:

– Naïve Bayes and Logistic Regression (Mitchell)

Gaussian Naïve Bayes – Big Picture

Example: Y= PlayBasketball (boolean), X1=Height, X2=MLgrade assume P(Y=1) = 0.5

Logistic Regression

Idea:

• Naïve Bayes allows computing P(Y|X) by learning P(Y) and P(X|Y)

• Why not learn P(Y|X) directly?

• Consider learning f: XàY, where

• X is a vector of real-valued features, < X1 ... Xn > • Y is boolean

• assume all Xi are conditionally independent given Y • model P(Xi | Y = yk) as Gaussian N(μik,σi)

• model P(Y) as Bernoulli (π)

• What does that imply about the form of P(Y|X)?

Derive form for P(Y|X) for Gaussian P(Xi|Y=yk) assuming σik = σi

Very convenient!

implies implies

implies

Very convenient!

implies implies

implies

linear classification rule!

Logistic function

Logistic regression more generally

• Logistic regression when Y not boolean (but still discrete-valued).

• Now y ∈ {y1 … yR} : learn R-1 sets of weights for k

where

• Training data D =

• Data likelihood =

• Data conditional likelihood =

Expressing Conditional Log Likelihood

Maximizing Conditional Log Likelihood

Good news: l(W) is concave function of W

Bad news: no closed-form solution to maximize l(W)

Gradient Descent:

Batch gradient: use error over entire training set D Do until satisfied:

1. Compute the gradient

2. Update the vector of parameters:

Stochastic approximates Batch arbitrarily closely as Stochastic can be much faster when D is very large Intermediate approach: use error over subsets of D

Stochastic gradient: use error over single examples Do until satisfied:

1. Choose (with replacement) a random training example 2. Compute the gradient just for :

3. Update the vector of parameters:

Maximize Conditional Log Likelihood: Gradient Ascent

Maximize Conditional Log Likelihood: Gradient Ascent

Gradient ascent algorithm: iterate until change < ε For all i, repeat
That’s all for M(C)LE. How about MAP?
• One common approach is to define priors on W – Normal distribution, zero mean, identity covariance
• Helps avoid very large weights and overfitting • MAP estimate
• let’s assume Gaussian prior: W ~ N(0, σ)
MLE vs MAP
• Maximum conditional likelihood estimate
• Maximum a posteriori estimate with prior W~N(0,σI)
MAP estimates and Regularization
• Maximum a posteriori estimate with prior W~N(0,σI)
called a “regularization” term
• helps reduce overfitting
• keep weights nearer to zero (if P(W) is zero mean
Gaussian prior), or whatever the prior suggests • used very frequently in Logistic Regression
The Bottom Line
• Consider learning f: XàY, where
• X is a vector of real-valued features, < X1 ... Xn > • Y is boolean

• assume all Xi are conditionally independent given Y • model P(Xi | Y = yk) as Gaussian N(μik,σi)

• model P(Y) as Bernoulli (π)

• Then P(Y|X) is of this form, and we can directly estimate W

• Furthermore, same holds if the Xi are boolean • trying proving that to yourself

Generative vs. Discriminative Classifiers

Training classifiers involves estimating f: XàY, or P(Y|X)

Generative classifiers (e.g., Naïve Bayes)

• Assume some functional form for P(X|Y), P(X)

• Estimate parameters of P(X|Y), P(X) directly from training data

• Use Bayes rule to calculate P(Y|X= xi)

Discriminative classifiers (e.g., Logistic regression)

• Assume some functional form for P(Y|X)

• Estimate parameters of P(Y|X) directly from training data

Use Naïve Bayes or Logisitic Regression?

Consider

• Restrictiveness of modeling assumptions

• Rate of convergence (in amount of training data) toward asymptotic hypothesis

Naïve Bayes vs Logistic Regression

Consider Y boolean, Xi continuous, X=

Number of parameters: • NB: 4n +1

• LR: n+1

Estimation method:

• NB parameter estimates are uncoupled • LR parameter estimates are coupled

G.Naïve Bayes vs. Logistic Regression

Recall two assumptions deriving form of LR from GNBayes: 1. Xi conditionally independent of Xk given Y

2. P(Xi | Y = yk) = N(μik,σi), ß not N(μik,σik)

Consider three learning methods: • GNB (assumption 1 only)

• GNB2 (assumption 1 and 2)

• LR

Which method works better if we have infinite training data, and… • Both (1) and (2) are satisfied

• Neither (1) nor (2) is satisfied

• (1) is satisfied, but not (2)

G.Naïve Bayes vs. Logistic Regression

[Ng & Jordan, 2002]

Recall two assumptions deriving form of LR from GNBayes: 1. Xi conditionally independent of Xk given Y

2. P(Xi | Y = yk) = N(μik,σi), ß not N(μik,σik)

Consider three learning methods: • GNB (assumption 1 only) • GNB2 (assumption 1 and 2)

• LR

Which method works better if we have infinite training data, and… • Both (1) and (2) are satisfied

• Neither (1) nor (2) is satisfied

• (1) is satisfied, but not (2)

G.Naïve Bayes vs. Logistic Regression

[Ng & Jordan, 2002]

Recall two assumptions deriving form of LR from GNBayes: 1. Xi conditionally independent of Xk given Y

2. P(Xi | Y = yk) = N(μik,σi), ß not N(μik,σik)

Consider three learning methods:

• GNB (assumption 1 only) — decision surface can be non-linear • GNB2 (assumption 1 and 2) – decision surface linear

• LR — decision surface linear, trained without

assumption 1.

Which method works better if we have infinite training data, and…

• Both (1) and (2) are satisfied: LR = GNB2 = GNB

• (1) is satisfied, but not (2) : GNB > GNB2, GNB > LR, LR > GNB2 • Neither (1) nor (2) is satisfied: GNB>GNB2, LR > GNB2, LR>

So, GNB requires n = O(log d) to converge, but LR requires n = O(d)

Some experiments from UCI data sets

[Ng & Jordan, 2002]

Naïve Bayes vs. Logistic Regression

The bottom line:

GNB2 and LR both use linear decision surfaces, GNB need not

Given infinite data, LR is better or equal to GNB2 because training procedure does not make assumptions 1 or 2 (though our derivation of the form of P(Y|X) did).

But GNB2 converges more quickly to its perhaps-less-accurate asymptotic error

And GNB is both more biased (assumption1) and less (no assumption 2) than LR, so either might outperform the other

What you should know:

• Logistic regression

– Functional form follows from Naïve Bayes assumptions • For Gaussian Naïve Bayes assuming variance σi,k = σi

• For discrete-valued Naïve Bayes too

– But training procedure picks parameters without making conditional independence assumption

– MLE training: pick W to maximize P(Y | X, W)

– MAP training: pick W to maximize P(W | X,Y) • ‘regularization’

• helps reduce overfitting

• Gradient ascent/descent

– General approach when closed-form solutions unavailable

• Generative vs. Discriminative classifiers – Bias vs. variance tradeoff

extra slides

What is the minimum possible error?

Best case:

• conditionalindependenceassumptionissatistied

• weknowP(Y),P(X|Y)perfectly(e.g.,infinitetrainingdata)

Questions to think about:

• Can you use Naïve Bayes for a combination of discrete and real-valued Xi?

• How can we easily model the assumption that just 2 of the n attributes as dependent?

• What does the decision surface of a Naïve Bayes classifier look like?

• How would you select a subset of Xi’s?