CS代考计算机代写 Bayesian network Bayesian chain algorithm Bias, Variance and Error
Bias, Variance and Error
Bias and Variance
given algorithm that outputs estimate the bias of the estimator:
the variance of estimator:
e.g., estimator for probability n independent coin flips
what is its bias? variance?
for , we define:
of heads, based on
Bias and Variance
given algorithm that outputs estimate for , we define: the bias of the estimator:
the variance of estimator:
which estimator has higher bias? higher variance?
Bias – Variance decomposition of error
Reading: Bishop chapter 9.1, 9.2
• Consider simple regression problem f:XàY y = f(x) + ε
noise N(0,σ) deterministic
Define the expected prediction error:
expectation over training D
learned estimate of f(x)
Sources of error
What if we have perfect learner, infinite data? – Our learned h(x) satisfies h(x)=f(x)
– Still have remaining, unavoidable error
σ2
Sources of error
• What if we have only n training examples? • What is our expected error
– Taken over random training sets of size n, drawn from distribution D=p(x,y)
Sources of error
L2 vs. L1 Regularization
Gaussian P(W)
àL2 regularization
Laplace P(W)
àL1 regularization
w2
constant P(Data|W)
w2
w1
w1
constant P(W)
Summary
• Biasofparameterestimators
• Varianceofparameterestimators
• Wecandefineanalogousnotionsforestimators (learners) of functions
• Expectederrorinlearnedfunctionscomesfrom
– unavoidable error (invariant of training set size, due to noise)
– bias (can be caused by incorrect modeling assumptions) – variance (decreases with training set size)
• MAPestimatesgenerallymorebiasedthanMLE – but bias vanishes as training set size à
• RegularizationcorrespondstoproducingMAPestimates – L2 / Gaussian prior / leads to smaller weights
– L1 / Laplace prior / leads to fewer non-zero weights
Machine Learning 10-601
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
February 18, 2015
Today:
• Graphical models
• Bayes Nets:
• Representing distributions
• Conditional independencies
• Simple inference
• Simple learning
Readings:
• Bishop chapter 8, through 8.2
Graphical Models
• Key Idea:
– Conditional independence assumptions useful
– but Naïve Bayes is extreme!
– Graphical models express sets of conditional independence assumptions via graph structure
– Graph structure plus associated parameters define joint probability distribution over set of variables
10-601
• Two types of graphical models:
– Directed graphs (aka Bayesian Networks)
– Undirected graphs (aka Markov Random Fields)
Graphical Models – Why Care?
• AmongmostimportantMLdevelopmentsofthedecade
• Graphicalmodelsallowcombining:
– Prior knowledge in form of dependencies/independencies – Prior knowledge in form of priors over parameters
– Observed training data
• Principledand~generalmethodsfor – Probabilistic inference
– Learning
• Usefulinpractice
– Diagnosis, help systems, text analysis, time series models, …
Conditional Independence
Definition: X is conditionally independent of Y given Z, if the probability distribution governing X is independent of the value of Y, given the value of Z
Which we often write
E.g.,
Marginal Independence
Definition: X is marginally independent of Y if
Equivalently, if
Equivalently, if
Represent Joint Probability Distribution over Variables
Describe network of dependencies
Bayes Nets define Joint Probability Distribution in terms of this graph, plus parameters
Benefits of Bayes Nets:
• Representthefulljointdistributioninfewer parameters, using prior knowledge about dependencies
• Algorithmsforinferenceandlearning
Bayesian Networks Definition
A Bayes network represents the joint probability distribution over a collection of random variables
A Bayes network is a directed acyclic graph and a set of conditional probability distributions (CPD’s)
• Eachnodedenotesarandomvariable
• Edgesdenotedependencies
• For each node Xi its CPD defines P(Xi | Pa(Xi))
• Thejointdistributionoverallvariablesisdefinedtobe
Pa(X) = immediate parents of X in the graph
Bayesian Network
StormClouds
Nodes = random variables
A conditional probability distribution (CPD) is associated with each node N, defining P(N | Parents(N))
Parents
P(W|Pa)
P(¬W|Pa)
L, R
0
1.0
L, ¬R
0
1.0
¬L, R
0.2
0.8
¬L, ¬R
0.9
0.1
Lightning
Rain
Thunder
WindSurf
WindSurf
The joint distribution over all variables:
Bayesian Network
StormClouds
What can we say about conditional independencies in a Bayes Net?
One thing is this:
Each node is conditionally independent of its non-descendents, given only its immediate parents.
Parents
P(W|Pa)
P(¬W|Pa)
L, R
0
1.0
L, ¬R
0
1.0
¬L, R
0.2
0.8
¬L, ¬R
0.9
0.1
Lightning
Rain
WindSurf
Thunder
WindSurf
Some helpful terminology
Parents = Pa(X) = immediate parents Antecedents = parents, parents of parents, … Children = immediate children
Descendents = children, children of children, …
Bayesian Networks
• CPDforeachnodeXi describes P(Xi | Pa(Xi))
Chain rule of probability says that in general:
But in a Bayes net:
StormClouds
How Many Parameters?
Parents
P(W|Pa)
P(¬W|Pa)
L, R
0
1.0
L, ¬R
0
1.0
¬L, R
0.2
0.8
¬L, ¬R
0.9
0.1
Lightning
Rain
WindSurf
Thunder
WindSurf
To define joint distribution in general?
To define joint distribution for this Bayes Net?
StormClouds
Inference in Bayes Nets
Parents
P(W|Pa)
P(¬W|Pa)
L, R
0
1.0
L, ¬R
0
1.0
¬L, R
0.2
0.8
¬L, ¬R
0.9
0.1
Lightning
Rain
WindSurf
Thunder
WindSurf
P(S=1, L=0, R=1, T=0, W=1) =
StormClouds
Learning a Bayes Net
Parents
P(W|Pa)
P(¬W|Pa)
L, R
0
1.0
L, ¬R
0
1.0
¬L, R
0.2
0.8
¬L, ¬R
0.9
0.1
Lightning
Rain
WindSurf
Thunder
WindSurf
Consider learning when graph structure is given, and data = { } What is the MLE solution? MAP?