# CS代考计算机代写 algorithm Machine Learning 10-601

Machine Learning 10-601
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
February 2, 2015
Today:
• Logistic regression
• Generative/Discriminative classifiers
Readings: (see class website) Required:
• Mitchell: “Naïve Bayes and Logistic Regression”
Optional
• Ng & Jordan

Announcements
• HW3 due Wednesday Feb 4
• HW4 will be handed out next Monday Feb 9
• new reading available:
– Estimating Probabilities: MLE and MAP (Mitchell)
– see Lecture tab of class website
• required reading for today:
– Naïve Bayes and Logistic Regression (Mitchell)

Gaussian Naïve Bayes – Big Picture
Example: Y= PlayBasketball (boolean), X1=Height, X2=MLgrade assume P(Y=1) = 0.5

Logistic Regression
Idea:
• Naïve Bayes allows computing P(Y|X) by learning P(Y) and P(X|Y)
• Why not learn P(Y|X) directly?

• Consider learning f: XàY, where
• X is a vector of real-valued features, < X1 ... Xn > • Y is boolean
• assume all Xi are conditionally independent given Y • model P(Xi | Y = yk) as Gaussian N(μik,σi)
• model P(Y) as Bernoulli (π)
• What does that imply about the form of P(Y|X)?

Derive form for P(Y|X) for Gaussian P(Xi|Y=yk) assuming σik = σi

Very convenient!
implies implies
implies

Very convenient!
implies implies
implies
linear classification rule!

Logistic function

Logistic regression more generally
• Logistic regression when Y not boolean (but still discrete-valued).
• Now y ∈ {y1 … yR} : learn R-1 sets of weights for k to maximize conditional likelihood of training data
where
• Training data D =
• Data likelihood =
• Data conditional likelihood =

Expressing Conditional Log Likelihood

Maximizing Conditional Log Likelihood
Good news: l(W) is concave function of W
Bad news: no closed-form solution to maximize l(W)

Gradient Descent:
Batch gradient: use error over entire training set D Do until satisfied:
1. Compute the gradient
2. Update the vector of parameters:
Stochastic approximates Batch arbitrarily closely as Stochastic can be much faster when D is very large Intermediate approach: use error over subsets of D
Stochastic gradient: use error over single examples Do until satisfied:
1. Choose (with replacement) a random training example 2. Compute the gradient just for :
3. Update the vector of parameters:

Maximize Conditional Log Likelihood: Gradient Ascent

Maximize Conditional Log Likelihood: Gradient Ascent
Gradient ascent algorithm: iterate until change < ε For all i, repeat That’s all for M(C)LE. How about MAP? • One common approach is to define priors on W – Normal distribution, zero mean, identity covariance • Helps avoid very large weights and overfitting • MAP estimate • let’s assume Gaussian prior: W ~ N(0, σ) MLE vs MAP • Maximum conditional likelihood estimate • Maximum a posteriori estimate with prior W~N(0,σI) MAP estimates and Regularization • Maximum a posteriori estimate with prior W~N(0,σI) called a “regularization” term • helps reduce overfitting • keep weights nearer to zero (if P(W) is zero mean Gaussian prior), or whatever the prior suggests • used very frequently in Logistic Regression The Bottom Line • Consider learning f: XàY, where • X is a vector of real-valued features, < X1 ... Xn > • Y is boolean
• assume all Xi are conditionally independent given Y • model P(Xi | Y = yk) as Gaussian N(μik,σi)
• model P(Y) as Bernoulli (π)
• Then P(Y|X) is of this form, and we can directly estimate W
• Furthermore, same holds if the Xi are boolean • trying proving that to yourself

Generative vs. Discriminative Classifiers
Training classifiers involves estimating f: XàY, or P(Y|X)
Generative classifiers (e.g., Naïve Bayes)
• Assume some functional form for P(X|Y), P(X)
• Estimate parameters of P(X|Y), P(X) directly from training data
• Use Bayes rule to calculate P(Y|X= xi)
Discriminative classifiers (e.g., Logistic regression)
• Assume some functional form for P(Y|X)
• Estimate parameters of P(Y|X) directly from training data

Use Naïve Bayes or Logisitic Regression?
Consider
• Restrictiveness of modeling assumptions
• Rate of convergence (in amount of training data) toward asymptotic hypothesis

Naïve Bayes vs Logistic Regression
Consider Y boolean, Xi continuous, X=
Number of parameters: • NB: 4n +1
• LR: n+1
Estimation method:
• NB parameter estimates are uncoupled • LR parameter estimates are coupled

G.Naïve Bayes vs. Logistic Regression
Recall two assumptions deriving form of LR from GNBayes: 1. Xi conditionally independent of Xk given Y
2. P(Xi | Y = yk) = N(μik,σi), ß not N(μik,σik)
Consider three learning methods: • GNB (assumption 1 only)
• GNB2 (assumption 1 and 2)
• LR
Which method works better if we have infinite training data, and… • Both (1) and (2) are satisfied
• Neither (1) nor (2) is satisfied
• (1) is satisfied, but not (2)

G.Naïve Bayes vs. Logistic Regression
[Ng & Jordan, 2002]
Recall two assumptions deriving form of LR from GNBayes: 1. Xi conditionally independent of Xk given Y
2. P(Xi | Y = yk) = N(μik,σi), ß not N(μik,σik)
Consider three learning methods: • GNB (assumption 1 only) • GNB2 (assumption 1 and 2)
• LR
Which method works better if we have infinite training data, and… • Both (1) and (2) are satisfied
• Neither (1) nor (2) is satisfied
• (1) is satisfied, but not (2)

G.Naïve Bayes vs. Logistic Regression
[Ng & Jordan, 2002]
Recall two assumptions deriving form of LR from GNBayes: 1. Xi conditionally independent of Xk given Y
2. P(Xi | Y = yk) = N(μik,σi), ß not N(μik,σik)
Consider three learning methods:
• GNB (assumption 1 only) — decision surface can be non-linear • GNB2 (assumption 1 and 2) – decision surface linear
• LR — decision surface linear, trained without
assumption 1.
Which method works better if we have infinite training data, and…
• Both (1) and (2) are satisfied: LR = GNB2 = GNB
• (1) is satisfied, but not (2) : GNB > GNB2, GNB > LR, LR > GNB2 • Neither (1) nor (2) is satisfied: GNB>GNB2, LR > GNB2, LR>
So, GNB requires n = O(log d) to converge, but LR requires n = O(d)

Some experiments from UCI data sets
[Ng & Jordan, 2002]

Naïve Bayes vs. Logistic Regression
The bottom line:
GNB2 and LR both use linear decision surfaces, GNB need not
Given infinite data, LR is better or equal to GNB2 because training procedure does not make assumptions 1 or 2 (though our derivation of the form of P(Y|X) did).
But GNB2 converges more quickly to its perhaps-less-accurate asymptotic error
And GNB is both more biased (assumption1) and less (no assumption 2) than LR, so either might outperform the other

What you should know:
• Logistic regression
– Functional form follows from Naïve Bayes assumptions • For Gaussian Naïve Bayes assuming variance σi,k = σi
• For discrete-valued Naïve Bayes too
– But training procedure picks parameters without making conditional independence assumption
– MLE training: pick W to maximize P(Y | X, W)
– MAP training: pick W to maximize P(W | X,Y) • ‘regularization’
• helps reduce overfitting
• Gradient ascent/descent
– General approach when closed-form solutions unavailable
• Generative vs. Discriminative classifiers – Bias vs. variance tradeoff

extra slides

What is the minimum possible error?
Best case:
• conditionalindependenceassumptionissatistied
• weknowP(Y),P(X|Y)perfectly(e.g.,infinitetrainingdata)

Questions to think about:
• Can you use Naïve Bayes for a combination of discrete and real-valued Xi?
• How can we easily model the assumption that just 2 of the n attributes as dependent?
• What does the decision surface of a Naïve Bayes classifier look like?
• How would you select a subset of Xi’s?

Posted in Uncategorized