# CS计算机代考程序代写 Bayesian Textbook Reference

Textbook Reference

Duda et al. Pattern Classification, 2nd Edition (2000)

� This week: Sections 3.1–3.5

1/25

Recap: Bayes Decision Theory

Recap:

� Knowing class priors and class-conditioned data densities, we can infer posterior probabilities using the Bayes theorem

P(ωj |x)= p(x|ωj)·P(ωj) p(x)

� From there, optimal classifiers can be built, e.g. the maximum accuracy

classifier:

Question:

� Can we assume that we know in advance the data densities p(x | ωj ), or should they be learned from the data?

argmax �P(ωj |x)� j

=argmax �logp(x|ωj)+logP(ωj)� j

2/25

Learning p(x | ωj ): Histogram Approach

Idea:

� Build a grid in input space, and for each class build a histogram that counts the number of observations that belong to each bin.

Problem:

� The number of bins is sd where s is the number of steps along each dimension and d is the number of dimensions.

� Many bins will have zero observations, not because the data probability is truly zero, but because finitely many examples have been observed.

� � � � �

� � � �

x1 x2

3/25

Learning p(x | ωj ): Model Approach

Idea:

� Assume that p(x|ωj) is a known parametric form, e.g. N (μj , Σj ) where μj,Σj aretheparametersofthe distribution.

� Estimate the parameters of the distribution that best fit the few observations we have.

Advantage:

� With the model-based approach, we need to estimate a finite and typically small number of model parameters and not the whole data distribution.

x1 x2

4/25

Maximum Likelihood Estimation

Goal: Let {p(x | θ), θ ∈ Θ} be a set of density functions (e.g. Gaussian), where θ denotes a parameter vector (e.g. mean / covariance). We would like to find the parameter θ that is the most likely with respect to the data.

Maximum Likelihood (ML) Approach:

� Assume that we have a dataset D = (x1,…,xN).

� Assume that each example xk ∈ Rd in the dataset has been is generated independently and from the same density function p(x | θ).

� In that case, the joint density function can be written as:

p(D|θ)=

�N k=1

p(xk |θ)

We also call this quantity the likelihood of θ w.r.t. the dataset D.

� The best parameter is then given by θˆ = arg max p(D | θ). θ

5/25

Maximum Likelihood, Simple Gaussian Case

Assume the data density is modeled as a univariate Gaussian with unit variance and unknown mean θ. For a given data point xk , the density function can be written as:

1 �1 2� p(xk|θ)=√2πexp −2(xk−θ)

Using the independence assumption, the joint density becomes:

�N 1 � 1 2� p(D|θ)= √2πexp −2(xk −θ)

k=1

Question: How to find θ that maximizes the function p(D | θ).

��� ��� ��� ��� ���

����� �

���

�����

�����

�����

�����

���

6/25

�� � �

����� �

Finding the Maximum of a Function

For some function f of interest (e.g. the data likelihood) we would like to compute:

θˆ=argmax f(θ) θ

When the function to optimize is continuously differentiable and concave, the maximum is found at the point where the gradient is zero.

∂f/∂θ1

∂f/∂θ2

∇ θ f ( θ ) = . . . = 0

∂f /∂θh

7/25

Maximum Likelihood, Simple Gaussian Case

�����

�����

�����

�����

���

��

��

��

���

Observation: The function p(D | θ) is not concave with θ.

Idea: Transform the function in a way that (i) doesn’t change its maximum and

(ii) makes it concave.

Applying the logarithm ensures the two properties above:

�N 1 � 1 2� logp(D|θ)=log √2πexp −2(xk −θ)

k=1

=�N �−1log(2π)−1(xk−θ)2� k=1 2 2

8/25

����� � �

�� � �

Maximum Likelihood, Simple Gaussian Case

Having found the log-likelihood w.r.t. D to be logp(D|θ)=�N �−1log(2π)−1(xk −θ)2�,

k=1 2 2

the best parameter θˆ can then be found by solving ∇θ log P(D | θ) = 0.

9/25

Maximum Likelihood, Multivariate Case

The log-likelihood of a multivariate Gaussian distribution w.r.t. D is given by �N1�d �1 �−1

logp(D|θ)= −2log (2π) det(Σ) −2(xk −μ) Σ (xk −μ) k=1

Question: Assuming Σ is fixed, what is the optimal parameter vector μ?

10/25

Building a Classifier End-to-End with ML

Consider a labeled dataset containing two examples for the first class, and four examples for the second class. Points are in N0 and we assume they are

generated from classes ω1 , ω2 following geometric unknown parameters θ1 and θ2 respectively.

probability distributions of

P(ω2) = 0.5

Known

Unknown

θ1,θ2

Known

P(ω1) = 0.5

P(x =k|θ1) = (1 − θ1)k θ1

D1 = {0,2}

P(x =k|θ2) = (1 − θ2)k θ2

D2 = {0,2,0,1}

11/25

Building a Classifier End-to-End with ML

Recall:

P(ω1) = 0.5

P(ω2) = 0.5

P(x =k|θ1)=(1−θ1)kθ1

P(x =k|θ2)=(1−θ2)kθ2

D1 ={0,2}

D2 = {0,2,0,1}

12/25

Building a Classifier End-to-End with ML

Recall:

P(ω1) = 0.5

P(ω2) = 0.5

P(x =k|θ1)=(1−θ1)kθ1

P(x =k|θ2)=(1−θ2)kθ2

D1 ={0,2}

D2 = {0,2,0,1}

13/25

Building a Classifier End-to-End with ML

Recall:

P(ω1) = 0.5

P(ω2) = 0.5

P(x =k|θ1)=(1−θ1)kθ1

P(x =k|θ2)=(1−θ2)kθ2

D1 ={0,2}

D2 = {0,2,0,1}

14/25

From ML to Bayes Classifiers

ML classifier: Class posteriors are given by:

ˆ p(x|ωi,θi)P(ωi)

p ( x , θˆ )

where θi is the maximum likelihood parameter for class ωi .

Bayes classifier: Class posteriors are computed by bypassing the intermediate computation of the parameter θˆ, and instead conditioning directly on the data:

P(ωi |x,D) = p(x|D,ωi)p(D|ωi)P(ωi) p(x | D) p(D)

p(x|D,ωi) = � p(x|θi,D,ωi)p(θi |D,ωi)dθi

p(θi |Di) = � p(Di |θi)p(θi) . p(Di |θi)p(θi)dθi

ˆ

ˆ P(ωi |x,θi)=

where and

15/25

From ML to Bayes Classifiers

Simplified Bayes classifier: Applying the Bayes rule internally, assuming that each class generates its dataset independently, assuming that priors are known, leveraging the fact that new data points are also generated i.i.d. and making thedependenceonωi implicitthroughthedatasetDi orparameterθi,wearrive after some steps at the simplified form:

where and

P(ωi |x,D) = p(x|Di)P(ωi) (1) p(x|D)

p(x|Di) = � p(x|θi)p(θi |Di)dθi (2) p(θi |Di) = � p(Di |θi)p(θi) . (3)

p(Di |θi)p(θi)dθi

We will use this simplified form in the following slides and also in the homeworks.

16/25

Building a Classifier End-to-End with Bayes

Recall: we consider the following data generating process

Known

Unknown

θ1,θ2

Known

P(ω1) = 0.5

P(ω2) = 0.5

P(x =k|θ1) = (1 − θ1)k θ1

D1 = {0,2}

P(x =k|θ2) = (1 − θ2)k θ2

D2 = {0,2,0,1}

17/25

Building a Classifier End-to-End Bayes

Recall:

P(ω1) = 0.5

P(ω2) = 0.5

P(x =k|θ1)=(1−θ1)kθ1

P(x =k|θ2)=(1−θ2)kθ2 D1 ={0,2}

D2 = {0,2,0,1} … and we further set:

p(θ1) ∼ U(0,1) p(θ2) ∼ U(0,1)

18/25

Building a Classifier End-to-End Bayes

Recall:

P(ω1) = 0.5

P(ω2) = 0.5

P(x =k|θ1)=(1−θ1)kθ1

P(x =k|θ2)=(1−θ2)kθ2 D1 ={0,2}

D2 = {0,2,0,1} … and we further set:

p(θ1) ∼ U(0,1) p(θ2) ∼ U(0,1)

19/25

Building a Classifier End-to-End Bayes

Recall:

P(ω1) = 0.5

P(ω2) = 0.5

P(x =k|θ1)=(1−θ1)kθ1

P(x =k|θ2)=(1−θ2)kθ2 D1 ={0,2}

D2 = {0,2,0,1} … and we further set:

p(θ1) ∼ U(0,1) p(θ2) ∼ U(0,1)

20/25

Building a Classifier End-to-End Bayes

Recall:

P(ω1) = 0.5

P(ω2) = 0.5

P(x =k|θ1)=(1−θ1)kθ1

P(x =k|θ2)=(1−θ2)kθ2 D1 ={0,2}

D2 = {0,2,0,1} … and we further set:

p(θ1) ∼ U(0,1) p(θ2) ∼ U(0,1)

21/25

ML vs. Bayes Classifiers

Observations:

� ML and Bayes classifiers do not always produce the same decisions

� Bayes classifiers are influenced by the prior distribution and are

consequently less sensitive to the data.

� Bayes classifiers will tend to favor the outcome that is supported by a larger amount of data.

22/25

ML vs. Bayes: Gaussian Case

Consider the simple data density p(x | μ) = N (μ, σ2) with unknown parameter μ. We would like to compare the ML and Bayes approaches to estimate the parameter μ from some dataset D = {x1, . . . , xn}.

ML approach:

� The maximum likelihood estimate is given by μˆ = 1 �n

n k=1

empirical mean (cf. previous slides).

Bayes approach:

� Assuming some prior distribution p(μ) = N(μ0,σ02), the posterior distribution can be computed as:

p(D | μ)p(μ) p(μ | D) = p(D)

where α is a normalizing factor.

= α

�n k=1

p(xk |μ)p(μ)

xk , i.e. the

23/25

ML vs. Bayes: Gaussian Case

Bayes approach (continued):

� The posterior distribution can be expanded as:

which corresponds to a new Gaussian distribution N(μn,σn2) with parameters σn2 σn2 2 �1 1�−1

μn=σ2/nμˆ+σ02μ0 and σn= σ2/n+σ02

� We observe that (1) the mean estimate μn is now pulled towards the prior if too little data is available, and (2) the mean estimate comes with a variance term σn2 that can be useful for confidence estimation.

24/25

Summary

� In practice, parameters of the class-conditioned distributions are not known and they must be inferred from some finite dataset.

� Two main approaches for learning these parameters: (1) maximum likelihood estimation and (2) Bayesian inference.

� Bayesian inference is often more difficult than maximum likelihood estimation (both analytically and computationally), because it requires integration.

� Bayesian inference readily incorporates interesting functionalities (inclusion of priors, construction of confidence estimates).

25/25