# CS代考计算机代写 Bayesian network Hidden Markov Mode chain algorithm Bayesian Java Machine Learning 10-601

Machine Learning 10-601

Tom M. Mitchell

Machine Learning Department Carnegie Mellon University

February 18, 2015

Today:

• Graphical models

• Bayes Nets:

• Representing distributions

• Conditional independencies

• Simple inference

• Simple learning

Readings:

• Bishop chapter 8, through 8.2

Graphical Models

• Key Idea:

– Conditional independence assumptions useful

– but Naïve Bayes is extreme!

– Graphical models express sets of conditional independence assumptions via graph structure

– Graph structure plus associated parameters define joint probability distribution over set of variables

10-601

• Two types of graphical models:

– Directed graphs (aka Bayesian Networks)

– Undirected graphs (aka Markov Random Fields)

Graphical Models – Why Care?

• AmongmostimportantMLdevelopmentsofthedecade

• Graphicalmodelsallowcombining:

– Prior knowledge in form of dependencies/independencies – Prior knowledge in form of priors over parameters

– Observed training data

• Principledand~generalmethodsfor – Probabilistic inference

– Learning

• Usefulinpractice

– Diagnosis, help systems, text analysis, time series models, …

Conditional Independence

Definition: X is conditionally independent of Y given Z, if the probability distribution governing X is independent of the value of Y, given the value of Z

Which we often write

E.g.,

Marginal Independence

Definition: X is marginally independent of Y if

Equivalently, if

Equivalently, if

Represent Joint Probability Distribution over Variables

Describe network of dependencies

Bayes Nets define Joint Probability Distribution in terms of this graph, plus parameters

Benefits of Bayes Nets:

• Representthefulljointdistributioninfewer parameters, using prior knowledge about dependencies

• Algorithmsforinferenceandlearning

Bayesian Networks Definition

A Bayes network represents the joint probability distribution over a collection of random variables

A Bayes network is a directed acyclic graph and a set of conditional probability distributions (CPD’s)

• Eachnodedenotesarandomvariable

• Edgesdenotedependencies

• For each node Xi its CPD defines P(Xi | Pa(Xi))

• Thejointdistributionoverallvariablesisdefinedtobe

Pa(X) = immediate parents of X in the graph

Bayesian Network

StormClouds

Nodes = random variables

A conditional probability distribution (CPD) is associated with each node N, defining P(N | Parents(N))

Parents

P(W|Pa)

P(¬W|Pa)

L, R

0

1.0

L, ¬R

0

1.0

¬L, R

0.2

0.8

¬L, ¬R

0.9

0.1

Lightning

Rain

Thunder

WindSurf

WindSurf

The joint distribution over all variables:

Bayesian Network

StormClouds

What can we say about conditional independencies in a Bayes Net?

One thing is this:

Each node is conditionally independent of its non-descendents, given only its immediate parents.

Parents

P(W|Pa)

P(¬W|Pa)

L, R

0

1.0

L, ¬R

0

1.0

¬L, R

0.2

0.8

¬L, ¬R

0.9

0.1

Lightning

Rain

WindSurf

Thunder

WindSurf

Some helpful terminology

Parents = Pa(X) = immediate parents Antecedents = parents, parents of parents, … Children = immediate children

Descendents = children, children of children, …

Bayesian Networks

• CPDforeachnodeXi describes P(Xi | Pa(Xi))

Chain rule of probability says that in general: But in a Bayes net:

StormClouds

How Many Parameters?

Parents

P(W|Pa)

P(¬W|Pa)

L, R

0

1.0

L, ¬R

0

1.0

¬L, R

0.2

0.8

¬L, ¬R

0.9

0.1

Lightning

Rain

WindSurf

Thunder

WindSurf

To define joint distribution in general?

To define joint distribution for this Bayes Net?

StormClouds

Inference in Bayes Nets

Parents

P(W|Pa)

P(¬W|Pa)

L, R

0

1.0

L, ¬R

0

1.0

¬L, R

0.2

0.8

¬L, ¬R

0.9

0.1

Lightning

Rain

WindSurf

Thunder

WindSurf

P(S=1, L=0, R=1, T=0, W=1) =

StormClouds

Learning a Bayes Net

Parents

P(W|Pa)

P(¬W|Pa)

L, R

0

1.0

L, ¬R

0

1.0

¬L, R

0.2

0.8

¬L, ¬R

0.9

0.1

Lightning

Rain

WindSurf

Thunder

WindSurf

Consider learning when graph structure is given, and data = { ~~ } What is the MLE solution? MAP?~~

Algorithm for Constructing Bayes Network

• Choose an ordering over variables, e.g., X1, X2, … Xn

• Fori=1ton

– Add Xi to the network

– Select parents Pa(Xi) as minimal subset of X1 … Xi-1 such that

Notice this choice of parents assures

(by chain rule)

(by construction)

Example

• BirdfluandAllegiesbothcauseNasalproblems • NasalproblemscauseSneezesandHeadaches

What is the Bayes Network for X1,…X4 with NO assumed conditional independencies?

What is the Bayes Network for Naïve Bayes?

What do we do if variables are mix of discrete and real valued?

Bayes Network for a Hidden Markov Model

Implies the future is conditionally independent of the past, given the present

Unobserved state:

Observed output:

St-2 St-1

Ot-2 Ot-1

St St+1 St+2

Ot Ot+1 Ot+2

What You Should Know

• Bayesnetsareconvenientrepresentationforencoding dependencies / conditional independence

• BN=GraphplusparametersofCPD’s

– Defines joint distribution over variables – Can calculate everything else from that – Though inference may be intractable

• Readingconditionalindependencerelationsfromthe graph

– Each node is cond indep of non-descendents, given only its parents

– ‘Explaining away’

See Bayes Net applet: http://www.cs.cmu.edu/~javabayes/Home/applet.html

Inference in Bayes Nets

• Ingeneral,intractable(NP-complete)

• Forcertaincases,tractable

– Assigning probability to fully observed set of variables

– Or if just one variable unobserved

– Or for singly connected graphs (ie., no undirected loops)

• Belief propagation

• Formultiplyconnectedgraphs

• Junction tree

• SometimesuseMonteCarlomethods

– Generate many samples according to the Bayes Net distribution, then count up the results

• Variationalmethodsfortractableapproximate solutions

Example

• BirdfluandAllegiesbothcauseSinusproblems

• SinusproblemscauseHeadachesandrunnyNose

Prob. of joint assignment: easy

• Supposeweareinterestedinjoint assignment

What is P(f,a,s,h,n)?

let’s use p(a,b) as shorthand for p(A=a, B=b)

Prob. of marginals: not so easy

• HowdowecalculateP(N=n)?

let’s use p(a,b) as shorthand for p(A=a, B=b)

Generating a sample from joint distribution: easy

How can we generate random samples drawn according to P(F,A,S,H,N)?

let’s use p(a,b) as shorthand for p(A=a, B=b)

Generating a sample from joint distribution: easy

Note we can estimate marginals

like P(N=n) by generating many samples

from joint distribution, then count the fraction of samples for which N=n

Similarly, for anything else we care about P(F=1|H=1, N=0)

à weak but general method for estimating any probability term…

let’s use p(a,b) as shorthand for p(A=a, B=b)

Prob. of marginals: not so easy

But sometimes the structure of the network allows us to be cleveràavoid exponential work

eg., chain

ABCDE

Inference in Bayes Nets

• Ingeneral,intractable(NP-complete)

• Forcertaincases,tractable

– Assigning probability to fully observed set of variables

– Or if just one variable unobserved

– Or for singly connected graphs (ie., no undirected loops)

• Variable elimination • Belief propagation

• Formultiplyconnectedgraphs • Junction tree

• SometimesuseMonteCarlomethods

– Generate many samples according to the Bayes Net

distribution, then count up the results

• Variationalmethodsfortractableapproximate solutions