CS代考计算机代写 Bayesian network Bayesian algorithm Machine Learning 10-601

Machine Learning 10-601
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
February 25, 2015
Today:
• Graphical models
• Bayes Nets:
• Inference • Learning • EM
Readings:
• Bishop chapter 8
• Mitchell chapter 6

Midterm
• In class on Monday, March 2
• Closed book
• You may bring a 8.5×11 “cheat sheet” of notes
• Covers all material through today
• Be sure to come on time. We’ll start at 11am sharp

Bayesian Networks Definition
A Bayes network represents the joint probability distribution over a collection of random variables
A Bayes network is a directed acyclic graph and a set of conditional probability distributions (CPD’s)
• Eachnodedenotesarandomvariable
• Edgesdenotedependencies
• For each node Xi its CPD defines P(Xi | Pa(Xi))
• Thejointdistributionoverallvariablesisdefinedtobe
Pa(X) = immediate parents of X in the graph

What You Should Know
• Bayesnetsareconvenientrepresentationforencoding dependencies / conditional independence
• BN=GraphplusparametersofCPD’s
– Defines joint distribution over variables – Can calculate everything else from that – Though inference may be intractable
• Readingconditionalindependencerelationsfromthe graph
– Each node is cond indep of non-descendents, given only its parents
– X and Y are conditionally independent given Z if Z D-separates every path connecting X to Y
– Marginal independence : special case where Z={}

Inference in Bayes Nets
• Ingeneral,intractable(NP-complete)
• Forcertaincases,tractable
– Assigning probability to fully observed set of variables
– Or if just one variable unobserved
– Or for singly connected graphs (ie., no undirected loops)
• Belief propagation
• SometimesuseMonteCarlomethods
– Generate many samples according to the Bayes Net distribution, then count up the results
• Variationalmethodsfortractableapproximate solutions

Example
• BirdfluandAllegiesbothcauseSinusproblems
• SinusproblemscauseHeadachesandrunnyNose

Prob. of joint assignment: easy
• Supposeweareinterestedinjoint assignment
What is P(f,a,s,h,n)?
let’s use p(a,b) as shorthand for p(A=a, B=b)

Prob. of marginals: not so easy
• HowdowecalculateP(N=n)?
let’s use p(a,b) as shorthand for p(A=a, B=b)

Generating a sample from joint distribution: easy
How can we generate random samples drawn according to P(F,A,S,H,N)?
Hint: random sample of F according to P(F=1) = θF=1 : • drawavalueofruniformlyfrom[0,1]
• if r<θ then output F=1, else F=0 let’s use p(a,b) as shorthand for p(A=a, B=b) Generating a sample from joint distribution: easy How can we generate random samples drawn according to P(F,A,S,H,N)? Hint: random sample of F according to P(F=1) = θF=1 : • drawavalueofruniformlyfrom[0,1] • if r<θ then output F=1, else F=0 Solution: • drawarandomvaluefforF,usingitsCPD • thendrawvaluesforA,forS|A,F,forH|S,forN|S Generating a sample from joint distribution: easy Note we can estimate marginals like P(N=n) by generating many samples from joint distribution, then count the fraction of samples for which N=n Similarly, for anything else we care about P(F=1|H=1, N=0) à weak but general method for estimating any probability term... Inference in Bayes Nets • Ingeneral,intractable(NP-complete) • Forcertaincases,tractable – Assigning probability to fully observed set of variables – Or if just one variable unobserved – Or for singly connected graphs (ie., no undirected loops) • Variable elimination • Belief propagation • OftenuseMonteCarlomethods – e.g., Generate many samples according to the Bayes Net distribution, then count up the results – Gibbs sampling • Variationalmethodsfortractableapproximatesolutions see Graphical Models course 10-708 Learning of Bayes Nets • Fourcategoriesoflearningproblems – Graph structure may be known/unknown – Variable values may be fully observed / partly unobserved • Easycase:learnparametersforgraphstructureis known, and data is fully observed • Interestingcase:graphknown,datapartlyknown • Gruesomecase:graphstructureunknown,datapartly unobserved Learning CPTs from Fully Observed Data • Example:Considerlearning the parameter • MaxLikelihoodEstimateis Flu Allergy Sinus Headache Nose kth training example • Rememberwhy? δ(x) = 1 if x=true, = 0 if x=false let’s use p(a,b) as shorthand for p(A=a, B=b) MLE estimate of from fully observed data • Maximumlikelihoodestimate • Ourcase: Flu Sinus Allergy Nose Headache Estimate from partly observed data • WhatifFAHNobserved,butnotS? • Can’tcalculateMLE Flu Sinus Allergy Nose Headache • LetXbeallobservedvariablevalues(overallexamples) • LetZbeallunobservedvariablevalues • Can’tcalculateMLE: • WHAT TO DO? Estimate from partly observed data • WhatifFAHNobserved,butnotS? • Can’tcalculateMLE Flu Sinus Allergy Nose Headache • LetXbeallobservedvariablevalues(overallexamples) • LetZbeallunobservedvariablevalues • Can’tcalculateMLE: • EM seeks* to estimate: * EM guaranteed to find local maximum • EM seeks estimate: • here, observed X={F,A,H,N}, unobserved Z={S} Flu Sinus Allergy Nose Headache EM Algorithm - Informally EM is a general procedure for learning from partly observed data Given observed variables X, unobserved Z (X={F,A,H,N}, Z={S}) Begin with arbitrary choice for parameters θ Iterate until convergence: • E Step: estimate the values of unobserved Z, using θ • M Step: use observed values plus E-step estimates to derive a better θ Guaranteed to find local maximum. Each iteration increases EM Algorithm - Precisely EM is a general procedure for learning from partly observed data Given observed variables X, unobserved Z (X={F,A,H,N}, Z={S}) Define Iterate until convergence: • E Step: Use X and current θ to calculate P(Z|X,θ) • M Step: Replace current θ by Guaranteed to find local maximum. Each iteration increases E Step: Use X, θ, to Calculate P(Z|X,θ) observed X={F,A,H,N}, Flu Allergy Sinus Nose unobserved Z={S} • How? Bayes net inference problem. Headache let’s use p(a,b) as shorthand for p(A=a, B=b) E Step: Use X, θ, to Calculate P(Z|X,θ) observed X={F,A,H,N}, Flu Allergy Sinus Nose unobserved Z={S} • How? Bayes net inference problem. Headache let’s use p(a,b) as shorthand for p(A=a, B=b) EM and estimating observed X = {F,A,H,N}, unobserved Z={S} Flu Sinus Allergy Nose Headache E step: Calculate P(Zk|Xk; θ) for each training example, k M step: update all relevant parameters. For example: Recall MLE was: EM and estimating Flu Sinus Allergy More generally, Headache Nose Given observed set X, unobserved set Z of boolean values E step: Calculate for each training example, k the expected value of each unobserved variable M step: Calculate estimates similar to MLE, but replacing each count by its expected count Using Unlabeled Data to Help Train Naïve Bayes Classifier Learn P(Y|X) Y Y X1 X2 X3 X4 1 0 0 1 1 0 0 1 0 0 0 0 0 1 0 ? 0 1 1 0 ? 0 1 0 1 X1 X2 X3 X4 E step: Calculate for each training example, k the expected value of each unobserved variable EM and estimating Given observed set X, unobserved set Y of boolean values E step: Calculate for each training example, k the expected value of each unobserved variable Y M step: Calculate estimates similar to MLE, but replacing each count by its expected count let’s use y(k) to indicate value of Y on kth example EM and estimating Given observed set X, unobserved set Y of boolean values E step: Calculate for each training example, k the expected value of each unobserved variable Y M step: Calculate estimates similar to MLE, but replacing each count by its expected count MLE would be: From [Nigam et al., 2000] Experimental Evaluation • Newsgrouppostings – 20 newsgroups, 1000/group • Webpageclassification – student, faculty, course, project – 4199 web pages • Reutersnewswirearticles – 12,902 articles – 90 topics categories 20 Newsgroups word w ranked by P(w|Y=course) / P(w|Y ≠ course) Using one labeled example per class 20 Newsgroups Bayes Nets – What You Should Know • Representation – Bayes nets represent joint distribution as a DAG + Conditional Distributions – D-separation lets us decode conditional independence assumptions • Inference – NP-hard in general – For some graphs, some queries, exact inference is tractable – Approximate methods too, e.g., Monte Carlo methods, ... • Learning – Easy for known graph, fully observed data (MLE’s, MAP est.) – EM for partly observed data, known graph

Posted in Uncategorized

Leave a Reply

Your email address will not be published. Required fields are marked *