# CS代写 WiSe 2021/22 – cscodehelp代写

WiSe 2021/22
Machine Learning 1/1-X
Lecture 2 Parameter Estimation

Textbook Reference
Duda et al. Pattern Classification, 2nd Edition (2000)
� This week: Sections 3.1–3.5

Recap: Bayes Decision Theory
� Knowing class priors and class-conditioned data densities, we can infer posterior probabilities using the Bayes theorem
P(ωj |x)= p(x|ωj)·P(ωj) p(x)
� From there, optimal classifiers can be built, e.g. the maximum
accuracy classifier: �
argmax P(ωj |x)
=argmax �logp(x|ωj)+logP(ωj)�
� Can we assume that we know in advance the data densities p(x | ωj ), or should they be learned from the data?

Learning p(x | ωj ): Histogram
� Build a grid in input space, and for each class build a histogram that counts the number of observations that belong to each bin.
� The number of bins is sd where s is the number of steps along each dimension and d is the number of dimensions.
� Many bins will have zero observations, not because the data probability is truly zero, but because finitely many examples have been observed.

Learning p(x | ωj ): Model-Based
� Assume that p(x|ωj) is a known parametric form, e.g. N (μj , Σj ) where μj,Σj aretheparametersofthe distribution.
� Estimate the parameters of the distribution that best fit the few observations we have.
� With the model-based approach, we need to estimate a finite and typically small number of model parameters and not the whole data distribution.

Maximum Likelihood Estimation
Goal: Let {p(x|θ),θ ∈ Θ} be a set of density functions (e.g. Gaussian), where θ denotes a parameter vector (e.g. mean / covariance). We would like to find the parameter θ that is the most likely with respect to the data.
Maximum Likelihood (ML) Approach:
� Assume that we have a dataset D = (x1,…,xN).
� Assume that each example xk ∈ Rd in the dataset has been is generated independently and from the same density function p(x | θ).
� In that case, the joint density function can be written as:
We also call this quantity the likelihood of θ w.r.t. the dataset D.
� The best parameter is then given by θˆ = arg max p(D | θ). θ

, Simple Gaussian Case
Assume the data density is modeled as a univariate Gaussian with unit variance and unknown mean θ. For a given data point xk , the density function can be written as:
1 �1 2� p(xk|θ)=√2πexp −2(xk−θ)
Using the independence assumption, the joint density becomes:
�N 1 � 1 2� p(D|θ)= √2πexp −2(xk −θ)
Question: How to find θ that maximizes the function p(D | θ).
��� ��� ��� ��� ���

Finding the Maximum of a Function
For some function f of interest (e.g. the data likelihood) we would like to compute:
θˆ=argmax f(θ) θ
When the function to optimize is continuously differentiable and concave, the maximum is found at the point where the gradient is zero.
 ∂f/∂θ2 
∇ θ f ( θ ) =  . . .  = 0

, Simple Gaussian Case
Observation: The function p(D | θ) is not concave with θ.
Idea: Transform the function in a way that
(i) doesn’t change its argmax and (ii) makes it concave.
Applying the logarithm ensures the two properties above:
�N 1 � 1 2� logp(D|θ)=log √2πexp −2(xk −θ)
=�N �−21log(2π)−12(xk−θ)2� k=1

, Simple Gaussian Case
Having found the log-likelihood w.r.t. D to be logp(D|θ)=�N �−21log(2π)−12(xk −θ)2�,
the best parameter θˆ can then be found by solving ∇θ log P(D | θ) = 0.

, Multivariate Case
The log-likelihood of a multivariate Gaussian distribution w.r.t. D is given by �N1�d �1 �−1
logp(D|θ)= −2log (2π) det(Σ) −2(xk −μ) Σ (xk −μ) k=1
Question: Assuming Σ is fixed, what is the optimal parameter vector μ?

Building a Classifier with ML
Consider a labeled dataset containing two examples for the first class, and four examples for the second class. Points are in N0 and we assume they are generated from classes ω1,ω2 following geometric probability distributions of unknown parameters θ1 and θ2 respectively.
P(ω2) = 0.5
P(ω1) = 0.5
P(x =k|θ1) = (1 − θ1)k θ1
D1 = {0,2}
P(x =k|θ2) = (1 − θ2)k θ2
D2 = {0,2,0,1}

Building a Classifier with ML
P(x =k|θ1)=(1−θ1)kθ1 P(x =k|θ2)=(1−θ2)kθ2
D2 = {0,2,0,1}
P(ω1) = 0.5 P(ω2) = 0.5

Building a Classifier with ML
P(x =k|θ1)=(1−θ1)kθ1 P(x =k|θ2)=(1−θ2)kθ2
D2 = {0,2,0,1}
P(ω1) = 0.5 P(ω2) = 0.5

Building a Classifier with ML
P(x =k|θ1)=(1−θ1)kθ1 P(x =k|θ2)=(1−θ2)kθ2
D2 = {0,2,0,1}
P(ω1) = 0.5 P(ω2) = 0.5

Bayes Parameter Estimation
Assuming some dataset D = (x1, . . . , xN ). We would like to model this dataset using some probability func- tion p(D|θ) with θ some unknown parameter that needs to be learned.
ML parameter estimation: Chose the parameter θ that maximises the data likelihood:
θˆ = arg max p(D|θ) θ
Bayes parameter estimation: Instead of learning a fixed estimate, build a posterior distribution over this parameter using the Bayes theorem:
p(θ|D) = p(D|θ)p(θ) p(D)
= � p(D|θ)p(θ)
θ p(D|θ)p(θ)dθ
This approach requires to define a prior distribution p(θ) which models our initial belief about the parameter θ before observing the data.

From ML to Bayes Classifiers
ML classifier: Class posteriors are given by:
computation of the parameters θi , and instead conditioning directly on the data:
p(x|ωi,D)P(ωi |D) P(ωi |x,D) = �i p(x|ωi,D)P(ωi |D)
p(x|ωi,Di)P(ωi) = �i p(x|ωi,Di)P(ωi)
ˆ p(x|ωi;θi)P(ωi)
P(ωi |x)= �
i p(x|ωi;θi)P(ωi)
where θi is the maximum likelihood parameter for the distribution of class ωi . Bayes classifier: Class posteriors are computed by bypassing the intermediate

Bayes Classifiers (cont.)
The terms of the class posterior:
p(x|ωi,Di)P(ωi) P(ωi |x,D) = �i p(x|ωi,Di)P(ωi)
can be expressed with model parameters as:
p(x|ωi,Di) = � p(x|θi,ωi,Di)p(θi |ωi,Di)dθi = p(x|θi)p(θi |Di)dθi
p(θi |Di) = � p(Di |θi)p(θi) . p(Di |θi)p(θi)dθi
are our Bayes parameter estimates.

Building a Classifier with Bayes
Recall: we consider the following data generating process
P(ω1) = 0.5
P(ω2) = 0.5
P(x =k|θ1) = (1 − θ1)k θ1
D1 = {0,2}
P(x =k|θ2) = (1 − θ2)k θ2
D2 = {0,2,0,1}

Building a Classifier with Bayes
P(x =k|θ1)=(1−θ1)kθ1 P(x =k|θ2)=(1−θ2)kθ2
D2 = {0,2,0,1}
… and we further set:
p(θ1) ∼ U(0,1) p(θ2) ∼ U(0,1)
P(ω1) = 0.5 P(ω2) = 0.5

Building a Classifier with Bayes
P(x =k|θ1)=(1−θ1)kθ1 P(x =k|θ2)=(1−θ2)kθ2
D2 = {0,2,0,1}
… and we further set:
p(θ1) ∼ U(0,1) p(θ2) ∼ U(0,1)
P(ω1) = 0.5 P(ω2) = 0.5

Building a Classifier with Bayes
P(x =k|θ1)=(1−θ1)kθ1 P(x =k|θ2)=(1−θ2)kθ2
D2 = {0,2,0,1}
… and we further set:
p(θ1) ∼ U(0,1) p(θ2) ∼ U(0,1)
P(ω1) = 0.5 P(ω2) = 0.5

Building a Classifier with Bayes
P(x =k|θ1)=(1−θ1)kθ1 P(x =k|θ2)=(1−θ2)kθ2
D2 = {0,2,0,1}
… and we further set:
p(θ1) ∼ U(0,1) p(θ2) ∼ U(0,1)
P(ω1) = 0.5 P(ω2) = 0.5

ML vs. Bayes Classifiers
Observations:
� ML and Bayes classifiers do not always produce the same decisions
� Bayes classifiers are influenced by the prior distribution and are
consequently less sensitive to the data.
� Bayes classifiers will tend to favor the outcome that is supported by a larger amount of data.

ML vs. Bayes: Gaussian Case
Consider the simple data density p(x | μ) = N (μ, σ2) with unknown parame- ter μ. We would like to compare the ML and Bayes approaches to estimate the parameter μ from some dataset D = {x1, . . . , xn}.
ML approach:
� The maximum likelihood estimate is given by μˆ = n1 �nk=1 xk, i.e. the empirical mean (cf. previous slides).
Bayes approach:
� Assuming some prior distribution p(μ) = N(μ0,σ02), the posterior distribution can be computed as:
p(D | μ)p(μ) p(μ | D) = p(D)
where α is a normalizing factor.
p(xk |μ)p(μ)

ML vs. Bayes: Gaussian Case
Bayes approach (continued):
� The posterior distribution can be expanded as:
which corresponds to a new Gaussian distribution N(μn,σn2) with parameters σn2 σn2 2 �1 1�−1
μn=σ2/nμˆ+σ02μ0 and σn= σ2/n+σ02
� We observe that (1) the mean estimate μn is now pulled towards the prior if too little data is available, and (2) the mean estimate comes with a variance term σn2 that can be useful for confidence estimation.

� In practice, parameters of the class-conditioned distributions are not known and they must be inferred from some finite dataset.
� Two main approaches for learning these parameters: (1) maximum likelihood estimation and (2) Bayesian inference.
� Bayesian inference is often more difficult than maximum likelihood estimation (both analytically and computationally), because it requires integration.
� Bayesian inference readily incorporates interesting functionalities (inclusion of priors, construction of confidence estimates).