# CS代考计算机代写 Bayesian algorithm Machine Learning 10-601

Machine Learning 10-601

Tom M. Mitchell

Machine Learning Department Carnegie Mellon University

January 28, 2015

Today:

• Naïve Bayes

• discrete-valued Xi’s

• Document classification

• Gaussian Naïve Bayes

• real-valued Xi’s

• Brain image classification

Readings: Required:

• Mitchell: “Naïve Bayes and Logistic Regression”

(available on class website)

Optional

• Bishop 1.2.4

• Bishop 4.2

Recently:

• BayesclassifierstolearnP(Y|X)

• MLEandMAPestimatesforparametersofP

• Conditionalindependence

• Naïve Bayes à make Bayesian learning practical

Next:

• Textclassification

• NaïveBayesandcontinuousvariablesXi:

• GaussianNaïveBayesclassifier • LearnP(Y|X)directly

• Logisticregression,Regularization,Gradientascent • NaïveBayesorLogisticRegression?

• Generativevs.Discriminativeclassifiers

Naïve Bayes in a Nutshell

Bayes rule:

Assuming conditional independence among Xi’s: So, classification rule for Xnew = < X1, ..., Xn > is:

Example: Live in Sq Hill? P(S|G,D,B)

• S=1 iff live in Squirrel Hill

• G=1 iff shop at SH Giant Eagle

P(S=1) : P(D=1 | S=1) : P(D=1 | S=0) : P(G=1 | S=1) : P(G=1 | S=0) : P(B=1 | S=1) : P(B=1 | S=0) :

Tom: D=1, G=0, B=0 P(S=1|D=1,G=0,B=0) =

• D=1 iff Drive or Carpool to CMU • B=1 iff Birthday is before July 1

P(S=0) : P(D=0 | S=1) : P(D=0 | S=0) : P(G=0 | S=1) : P(G=0 | S=0) : P(B=0 | S=1) : P(B=0 | S=0) :

P(S=1) P(D=1|S=1) P(G=0|S=1) P(B=0|S=1) _____________________________________________________________________________

[P(S=1) P(D=1|S=1) P(G=0|S=1) P(B=0|S=1) + P(S=0) P(D=1|S=0) P(G=0|S=0) P(B=0|S=0)]

Another way to view Naïve Bayes (Boolean Y): Decision rule: is this quantity greater or less than 1?

Another way to view Naïve Bayes (Boolean Y): Decision rule: is this quantity greater or less than 1?

Naïve Bayes: classifying text documents

• Classify which emails are spam?

• Classify which emails promise an attachment?

How shall we represent text documents for Naïve Bayes?

Learning to classify documents: P(Y|X)

• Ydiscretevalued. – e.g., Spam or not

• X=

• Xi is a random variable describing…

Learning to classify documents: P(Y|X)

• Ydiscretevalued. – e.g., Spam or not

• X=

• Xi is a random variable describing…

Answer 1: Xi is boolean, 1 if word i is in document, else 0

e.g., Xpleased = 1

Issues?

Learning to classify documents: P(Y|X)

• Ydiscretevalued. – e.g., Spam or not

• X=

• Xi is a random variable describing…

Answer 2:

• Xi represents the ith word position in document

• X1 = “I”, X2 = “am”, X3 = “pleased”

• and,let’sassumetheXiareiid(indep,identicallydistributed)

Learning to classify document: P(Y|X) the “Bag of Words” model

• Y discrete valued. e.g., Spam or not

• X=

• Xi are iid random variables. Each represents the word at its position i in the document

• Generatingadocumentaccordingtothisdistribution= rolling a 50,000 sided die, once for each word position in the document

• Theobservedcountsforeachwordfollowa???distribution

Multinomial Distribution

Multinomial Bag of Words

aardvark 0 about 2 all 2 Africa 1 apple 0 anxious 0 …

gas 1 …

oil 1 …

Zaire 0

MAP estimates for bag of words

Map estimate for multinomial

What β’s should we choose?

Naïve Bayes Algorithm – discrete Xi

• TrainNaïveBayes(examples) for each value yk

estimate

for each value xij of each attribute Xi

estimate • Classify(Xnew)

prob that word xij appears in position i, given Y=yk

* Additional assumption: word probabilities are position independent

For code and data, see

www.cs.cmu.edu/~tom/mlbook.html

click on “Software and Data”

What if we have continuous Xi ? Eg., image classification: Xi is real-valued ith pixel

What if we have continuous Xi ? Eg., image classification: Xi is real-valued ith pixel

Naïve Bayes requires P(Xi | Y=yk), but Xi is real (continuous)

Common approach: assume P(Xi | Y=yk) follows a Normal (Gaussian) distribution

What if we have continuous Xi ? Eg., image classification: Xi is real-valued ith pixel

Naïve Bayes requires P(Xi | Y=yk), but Xi is real (continuous)

Common approach: assume P(Xi | Y=yk) follows a Normal (Gaussian) distribution

Gaussian

Distribution

(also called “Normal”)

p(x) is a probability density function, whose integral (not sum) is 1

What if we have continuous Xi ? Gaussian Naïve Bayes (GNB): assume

Sometimes assume variance • is independent of Y (i.e., σi), • or independent of Xi (i.e., σk) • or both (i.e., σ)

Gaussian Naïve Bayes Algorithm – continuous Xi (but still discrete Y)

• Train Naïve Bayes (examples) for each value yk

estimate*

for each attribute Xi estimate

• class conditional mean , variance • Classify (Xnew)

* probabilities must sum to 1, so need estimate only n-1 parameters…

jth training example

ith feature

kth class

δ()=1 if (Yj=yk) else 0

Estimating Parameters: Y discrete, Xi continuous Maximum likelihood estimates:

How many parameters must we estimate for Gaussian Naïve Bayes if Y has k possible values, X=

GNB Example: Classify a person’s cognitive state, based on brain image

• reading a sentence or viewing a picture?

• reading the word describing a “Tool” or “Building”? • answering the question, or getting confused?

Mean activations over all training examples for Y=“bottle”

Y is the mental state (reading “house” or “bottle”) Xi are the voxel activities,

this is a plot of the μ’s defining P(Xi | Y=“bottle”)

average

fMRI activation

high

below average

Classification task: is person viewing a “tool” or “building”?

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

statistically significant p<0.05
p4 p8 p6p11p5 p7p10p9 p2p12p3 p1
Participants
Classification accuracy
Classification accuracy
Where is information encoded in the brain?
Accuracies of cubical 27-voxel classifiers centered at each significant voxel [0.7-0.8]
Naïve Bayes: What you should know
• DesigningclassifiersbasedonBayesrule
• Conditionalindependence – What it is
– Why it’s important
• NaïveBayesassumptionanditsconsequences
– Which (and how many) parameters must be estimated under different generative models (different forms for P(X|Y) )
• and why this matters
• HowtotrainNaïveBayesclassifiers – MLE and MAP estimates
– with discrete and/or continuous inputs Xi
Questions to think about:
• Can you use Naïve Bayes for a combination of discrete and real-valued Xi?
• How can we easily model just 2 of n attributes as dependent?
• What does the decision surface of a Naïve Bayes classifier look like?
• How would you select a subset of Xi’s?