CS计算机代考程序代写 decision tree algorithm Bayesian AI GMM deep learning lecture/12-em-annotated.pdf

lecture/12-em-annotated.pdf

lecture/13-poe-annotated.pdf

lecture/14-xai-annotated.pdf

lecture/lecture1-annotated.pdf

lecture/lecture10.pdf

lecture/lecture11.pdf

1/24

Outline

� Latent Variable Models
� The Expectation Maximization Procedure
� Gaussian Mixture Models
� K-Means Clustering
� Kernel K-Means

2/24

Motivation

PCA of Iris dataset PCA of Boston dataset

PCA of Diabetes dataset PCA of Digits dataset

Complex data cannot be modeled accurately by standard probability distributions
(e.g. Gaussian) and require a more complex description, e.g. with latent variables.

3/24

Latent Variable Models

A latent variable model (LVM) is a description of the data distribution that makes
use of a layer of abstraction (latent variables) to improve modeling capability.

p(x|z,θ)

p(z|θ)

dataxdata

“Classical”
descrip�on

Latent variable
descrip�on

p(x|θ)

x

z

Relation between the ‘classsical’ and latent variable description:

p(� |θ) =

z∈Z
p(z |θ) · p(� |z , θ).

4/24

Learning a Latent Variable Model

Assuming a dataset D where we assume that the samples have been drawn i.i.d..
In that case, the log-likelihood of the data according to the model can be written
as:

logP(D|θ) = log

�∈D
p(� |θ)

=

�∈D
log p(� |θ)

and we wish to maximize that quantity. Injecting the latent variable model, the
objective to optimize becomes

=

�∈D
log

z∈Z
p(z |θ) · p(� |z , θ)

Problem: The maximum of log p(D|θ) cannot be found analytically.

5/24

Learning a Latent Variable Model

Strategy: If the function p(D|θ) cannot be easily optimized, find a lower-bound
of that function that is easier to optimize.

6/24

Building a Lower-Bound

Jensen’s inequality

For any element λ of the d-dimensional simplex (i.e. λ � 0 and λ�1 = 1, and
any concave function f : Rd → R, we have

f (
d�

i=1

λiai ) ≥
d�

i=1

λi f (ai )

Application to the latent variable model:

� Let λ be some probability distribution q(z |�) independent from θ.
� Let �i containing the remaining terms.
� Because the latent variable model

�∈D log

z∈Z p(z |θ) · p(� |z , θ) does

not contain such distribution q(z |�) independent from θ, create it!

7/24

Building a Lower-Bound

Step 1: Add q(z |�) both on the numerator and denominator:

log p(D|θ) =

�∈D
log

z

p(z |θ) · p(� |z , θ)
q(z |�) · q(z |�)

Step 2: Applying the Jensen inequality

�∈D

z

log

p(z |θ) · p(� |z , θ)

q(z |�)

· q(z |�)

… and verify the bound:

=

�∈D

z

log

p(� |θ) · p(z |� , θ)

q(z |�)

· q(z |�)

=

�∈D

z

log

p(� |θ)

· q(z |�)

� �� �
log P(D|θ)

�∈D

z

log

q(z |�)
p(z |� , θ)

· q(z |�)

� �� �
KL(q(z|�) � p(z|�,θ))≥ 0

8/24

Building a Lower-Bound

Question: How to ensure that

KL(q(z |�) � p(z |� , θ))

is small, so that maximizing the lower-bound with θ gives a good approximation
of the true log-likelihood?

Chicken & egg problem:

� If we use a simple q, e.g. uniformly distributed, we get a loose
lower-bound from which we get a bad θ.

� To get a tight lower-bound, we need to choose q(z |�) ≈ p(z |� , θ) which
requires that we know θ.

9/24

The Expectation-Maximization (EM) Algorithm

Strategy: Start with some random solution θ and alternately estimate q and θ,
until we converge.

10/24

The Expectation-Maximization (EM) Algorithm

θ0 ← random()

q1(z |�) ← p(z |� , θ0) (E-step)

θ1 ← argmax
θ

�∈D

z

log

p(� , z |θ)
q1(z |�)

· q1(z |�) (M-step)

q2(z |�) ← p(z |� , θ1) (E-step)

θ2 ← argmax
θ

�∈D

z

log

p(� , z |θ)
q2(z |�)

· q2(z |�) (M-step)

11/24

The Expectation-Maximization (EM) Algorithm

θ

q

(θ0,q1)

(θn,qn)

Properties of the algorithm:

� Block coordinate descent
� Locally optimal step size
� The algorithm lands in a local

minimum of the function p(D|θ).

descent:

� no learning rate
� no need to compute the gradients.

12/24

Application: Cluster Analysis

PCA of Iris dataset PCA of Boston dataset

PCA of Diabetes dataset PCA of Digits dataset

Question: Can we build a model of the cluster structure of the data?

13/24

The Gaussian Mixture Model (GMM)

� The Gaussian Mixture Model is a latent variable model specialized for
clustering.

p(x|z,θ)

p(z|θ)

data

Latent variable
descrip�on
(general
formula�on)

x

z

p(x|z=i,θ)

p(z=i|θ)

data

Gaussian
mixture
model

x

i=1 i=2 i=3 i=4

� The latent variable z of a GMM is an integer between 1 and C . Each
value of z indicates a cluster of the data.

� For each cluster, we associate a different data-generating distribution
(e.g. Gaussian with different means and covariances).

14/24

The Gaussian Mixture Model (GMM)

The GMM is formally defined by the two equations:

p(z = i |θ) = αi

p(� |z = i , θ) = (2π)−d/2 · |Σi |−1/2 · exp

− 1

2
(� − µi )

Σ

−1
i (x − µi )

The parameters θ of the model to be learned are:

� The mixing coefficients (αi )Ci=1 subject to αi ≥ 0 and
�C

i=1
αi = 1

� The mean vectors (µi )
C
i=1.

� The covariances (Σi )Ci=1 subject to positive semi-definiteness.

Various simplifications of the GMM model can be used in practice, e.g. for
speed, statistical robustness, or simplicity of implementation.

� diagonal/isotropic/tied/fixed covariance matrices,
� fixed mixing coefficients, etc.

15/24

EM for the GMM (simplified)

Consider the simplified GMM:

p(z = i |θ) = 1
C

p(� |z = i , θ) = 1
(π/γ)d/2

exp(−γ�� − µi�
2
)

E-step: (Apply Bayes rule)

q(z = i |�) = p(z = i |θ) · p(� |z = i , θ)
p(� |θ) =

exp(−γ�� − µi�2)�C
i=1

exp(−γ�� − µi�2)

M-step: (Set lower-bound gradient to zero)

∂θ

�∈D

C�

i=1

log(exp(−γ�� − µi�
2
)) · q(z = i |�) = 0

⇒ µi =

�∈D � · q(z = i |�)�
�∈D q(z = i |�)

16/24

Implementing GMMs

The (simplified) GMM model can be implemented easily in few lines of code:

17/24

GMM Demo (1st try)

�� � � � ��

��

��

�������������

�� � � � ��

��

��

�������������

�� � � � ��

��

��

�������������

�� � � � ��

��

��

�������������

�� � � � ��

��

��

�������������

�� � � � ��

��

��

�������������

18/24

GMM Demo (2nd try)

�� � � � ��

��

��

�������������

�� � � � ��

��

��

�������������

�� � � � ��

��

��

�������������

�� � � � ��

��

��

�������������

�� � � � ��

��

��

�������������

�� � � � ��

��

��

�������������

19/24

Gaussian Mixture Model (Summary)

� Number of clusters need to be determined in advance.
� EM converge to some local minima after a few iterations
� EM is non-convex, initialization matters.
� For a better result, run the algorithm multiple times and retain the best

solution.

20/24

The K-Means Algorithm

Cluster assignments step

c(�) = argmin
k

�� − µk�
2

Centroid update step

µk =

� :c(�)=k

� :c(�)=k

1

� K-means can be interpreted as the limit case of EM for Gaussian
distributions with γ → ∞.

� K-means produces hard-assignments (i.e. data points belong to a single
cluster).

� K-means has the same limitations as GMM (non-convex, predefined
cluster shape).

� K-means has only one parameter: the number of clusters.

21/24

Kernelizing K-Means

Problem: Standard k-means requires clusters to have round
shape. It fails on ‘banana’ datasets.

Idea: Replace � by the nonlinear feature map φ(�) and de-
fine µi =

j
βijφ(� j), i.e. centroids become weighted com-

binations of data points.

1. Cluster assignments step:

c(�) = argmin
i

�φ(�)− µi�
2

= argmax
i

2

j

βijk(� , � j)− β�i Kβi

2. Centroid update step:

βi = argmin
βi

� :c(�)=i

�φ(�)− µi�
2

= K
−1 ·

��
� :c(�)=i

k(X , �)

� :c(�)=i
1

Standard k-means

Kernel k-means

22/24

Other Cluster Algorithms

� Deep k-means

� Hierarchical clustering
� Divisive clustering
� Agglomerative clustering

� DBScan
� Affinity Propagation

23/24

Beyond Clustering

Mixture model
(today’s lecture)

Product of experts
(next week’s lecture)

24/24

Summary

� Latent variable models is a powerful approach to model complex data
distribution.

� Expectation-maximization is a framework to efficiently optimize these
models.

� Latent variable models can be used for clustering, e.g. the Gaussian
mixture model.

� The well-known k-means algorithm can be seen as a limit case of the
Gaussian mixture model.

� K-means can be kernelized to produce nonlinear boundaries between
clusters.

1/24

Outline

Covered content:

� Products of Experts (PoE)
� Restricted Boltzmann Machine (RBM)
� Structure of an RBM
� RBM learning algorithms
� Application of RBMs

Reference:

� Hinton, Geoffrey E. (2002). ”Training Products of Experts by Minimizing Contrastive
Divergence” (PDF). Neural Computation. 14 (8): 1771–1800

2/24

Beyond Clustering

Mixture model
(last week’s lecture)

Product of experts
(today’s lecture)

3/24

Mixture model vs. Product of experts

Mixture Model: Mixture models are commonly used for clustering problems and the
probability is defined as a weighted sum of probability distributions:

p(x|α, θ) =
C�

k=1

αk · p(x|θk)

with αk denoting the mixing coefficients subject to the constraints
�C

k=1 αk = 1 and αk ≥ 0.

Product of Experts (PoE): Products of experts are commonly used for learning dis-
tributed representations (e.g. unsupervised feature extraction) and are defined as a product
of functions.

p(x|θ) = 1Z
H�

j=1

g(x|θj)� �� �
jth

expert

where Z is a normalization term set to ensure that p(x|θ) integrates to 1.

4/24

Example 1: Product of Gaussians

Consider the experts to be univariate Gaussians spe-
cialized on a particular input dimension

g(x|θi) =
1

(2πσ2i )
1/2

exp

− (xi − µi)

2

2σ2i

The product of experts (PoE) can be developed as:

p(x|θ) =
d�

i=1

g(x|θi)

=
1

(2π)d/2|Σ|1/2
exp

− 1

2
(x− µ)�Σ−1(x− µ)

with Σ = diag(σ21 , . . . ,σ
2
d), i.e. a multivariate Gaussian

distribution without correlation between features.

5/24

Example 2: Product of Gaussian Mixtures

Let the experts be the univariate mixtures:

g(x|θi) =
C�

k=1

π−1/2 exp(−(xi − µik)2)

The product of experts (PoE) can be developed as:

p(x|θ) = 1Z
d�

i=1

C�

k=1

π−1/2 exp(−(xi − µik)2)

=
1

Z
C�

k1=1

· · ·
C�

kd=1

d�

i=1

π−1/2 exp(−(xi − µiki)2)
� �� �

multivariate Gaussian

i.e. a mixture of exponentially many (Cd) multivariate
Gaussians. Therefore, a PoE can encode many varia-
tions with few parameters.

6/24

Example 3: Product of t-Student Distributions

Define experts to be t-Student distributions in some
projected space (z = w�j x):

g(x|θj) =
1

αj + (w

j x)

2
�wj� = 1

The resulting product of experts

p(x|θ) = 1Z
H�

j=1

1

αj + (w

j x)

2

withH the number of experts, produces a non-Gaussian
multivariate distribution, which can be useful to model
e.g. image or speech data. This PoE has connections
to other analyses, e.g. independent component analysis
(ICA).

7/24

The Restricted Boltzmann Machine

input

latent

0 1 0 0 1

0 1 0 0 1 1

The restricted Boltzmann machine (RBM) is a joint probability model defined over input
features x ∈ {0, 1}d and latent variables h ∈ {0, 1}H .

p(x,h|θ) = 1Z exp
� d�

i=1

H�

j=1

xiwijhj +

H�

j=1

bjhj

The parameter wij can be interpreted as the connection strength between input feature xi
and latent variable hj . The larger wij the stronger xi and hj co-activate.

8/24

The Restricted Boltzmann Machine (PoE View)

Connection between RBM and PoE

The RBM, when marginalized over its hidden units, has the structure of a Product of
Experts with g(x, θj) = (1 + exp(w

j x+ bj)).

Proof:
p(x|θ) =

h∈{0,1}d
p(x,h|θ)

=

h∈{0,1}d

1

Z
exp

� d�

i=1

H�

j=1

xiwijhj +

H�

j=1

bjhj

=
1

Z

h∈{0,1}d

H�

j=1

exp((w�j x+ bj) · hj)

=
1

Z

H�

j=1

hj∈{0,1}
exp((w�j x+ bj) · hj)

=
1

Z

H�

j=1

(1 + exp(w�j x+ bj))

9/24

Interpreting the RBM Experts

The experts

g(x, θj) = 1 + exp(w

j x+ bj)

forming the RBM implement two behaviors:

� w�j x+ bj > 0 : g(x; θj) � 1: the example x is in
the area of competence of the expert and the
latter speaks in favor of x.

ω2 else.

Alternate formulations:

11/29

Multivariate Normal Distributions

P(x|ωj ) =
1�

(2π)ddet(Σj )
exp

− 1
2
(x− µj )�Σ−1j (x− µj )

� µj is the mean (center of the
data distribution)

� Σj is the covariance
(elongation of the data

distribution).

Image source: Duda et al. (2000)

12/29

Optimal Classifier for Gaussians (Σ1 = Σ2)

Note: logP(x|ωj ) = − d2 log 2π −
1
2
log det(Σ)− 1

2
(x− µj )�Σ−1(x− µj )

13/29

Optimal Classifier for Gaussians (Σ1 = Σ2)

Image source:

Duda et al. (2000)

� Decision boundary is linear and oriented by mean and covariance.

� Offset is controlled by class prior probabilities.

14/29

Optimal Classifier for Gaussians (Σ1 �= Σ2)

� When covariances Σ1 and Σ2
are not the same, the decision

ellipse, parabola, hyperbola,

and degenerate forms.

Image source: Duda et al. (2000)

15/29

Binary Data Distributions

Assume that our data is now binary: x ∈ {0, 1}d . With each dimension
generated independently according to some parameters p and q.

P(xi = 0 |ω1) = 1− pi P(xi = 0 |ω2) = 1− qi
P(xi = 1 |ω1) = pi P(xi = 1 |ω2) = qi

The probability of the whole multivariate observation can be written as:

P(x|ω1) =
d�

i=1

pixi + (1− pi ) · (1− xi )

Question: How to express the decision boundary

P(ω1|x) > P(ω2|x)

16/29

Optimal Classifier for Binary Data (1st try)

Recall: P(x|ω1) =
�d
i=1

pixi + (1− pi ) · (1− xi )

17/29

Optimal Classifier for Binary Data

Observation:

For binary data, the data likelihood

P(x|ω1) =
d�

i=1

pixi + (1− pi ) · (1− xi )

can be equivalently expressed as

P(x|ω1) =
d�

i=1

p
xi
i · (1− pi )

(1−xi )

18/29

Optimal Classifier for Binary Data (2nd try)

Recall: P(x|ω1) =
�d
i=1

p
xi
i · (1− pi )(1−xi )

19/29

From Binary to Multi-Class

The two-class problem: ω = (ω1,ω2)

decide

ω1 if P(ω1 | x) > P(ω2 | x)
ω2 else,

can be generalized to multiple classes.

Let ω = (ω1,ω2, . . . ,ωC ) be our C possible classes. The optimal decision
now takes the form:

decide ωi if ∀j �=i : P(ωi |x) > P(ωj |x)

… or equivalently …

decide ωi where i = argmax
j
P(ωj |x)

20/29

Example: Multi-Class Gaussian

Decision function:

“decide ωi”

i = argmax
j
P(ωj |x)

Observation:

The decision surface of a multi-class

classifier can be quite complex and

nonlinear.

Image source: Duda et al. (2000)

21/29

From Classification to Cost Minimization

Example: Suppose you would like to purchase a second-hand car. After

observing the car, you assess that it has a defect with probability

P(defect | x) = 0.1 P(no defect | x) = 0.9

Concretely, the decision you need to take is not to classify the whether

the car has a defect or not, but whether to buy the car or not.

For this, we need to evaluate the cost of each scenario, e.g.

cost ( buy | defect ) = 100.0 cost ( buy | no defect ) = −20.0
cost ( not buy | defect ) = 0.0 cost ( not buy | no defect ) = 0.0

and take the action with the lowest expected cost.

22/29

From Classification to Cost Minimization

Classification Accuracy Maximization:

“classify as ωi” where i = argmax
j
P(ωj |x)

Expected Cost Minimization: Let (αi )i be the set of actions. The
expected cost of taking a certain action is given by

R(αj |x) =
C�

k=1

λ(αj |ωk)P(ωk |x)

where λ(αj ,ωk) is the cost of action αj given the class ωk . We then
decide to

“take action αi” where i = argmin
j
R(αj |x)

23/29

Classification Accuracy as a Special Case

Recall: R(αj |x) =
�C
k=1 λ(αj |ωk)P(ωk |x)

Show that the problem of maximum accuracy classification is a special

instance of expected cost minimization with a particular set of actions

(αj )j and a particular cost function λ(αj |ωk).

24/29

Analyzing the Two-Class Case

Recall: R(αj |x) =
�C
k=1 λ(αj |ωk)P(ωk |x)

Show that in the two-action two-class case, changing the cost function

means shifting the decision boundary.

(Note: when differences of lambdas are negative,
we need to stop before applying the log)

25/29

Measuring Classification Error

So far, we have studied what the decision boundary should be in order to

predict optimally.

However, in certain cases, it is also important to determine what is the

expected error of the classifier (e.g. to determine whether the classifier

is good enough for practical use).

The expected error is the probability that the data is of a different class

than the one predicted:

P(Err | x) =

P(ω1 | x) if “decide ω2”
P(ω2 | x) if “decide ω1”

For a maximally accurate classifier, this reduces to

P(Err | x) = min{P(ω1 | x),P(ω2 | x)}

26/29

Measuring Classification Error

The expected error of this maximally accurate classifier is computed as

the integral of its error probability over the distribution p(x).

P(Err) =

x

P(Err | x)p(x)dx

=

x

min{P(ω1 | x),P(ω2 | x)}p(x)dx

This is also known as the Bayes error rate.

Generally, this integral cannot be solved analytically, because of the min

function. Error must instead be bounded or evaluated empirically.

27/29

Bounding the Error of a Classifier

Very basic bound

Observe for binary classification that P(ω1 | x) = 1− P(ω2 | x).
The error can be bounded as:

P(Err) =

x

min{P(ω1 | x),P(ω2 | x)}p(x)dx

x

0.5 p(x)dx = 0.5

The optimal classifier predicts the correct class at least 50% of the time.

28/29

Slightly Better Bound

Recall: P(ωj | x) =
p(x |ωj ) · P(ωj )

p(x)

The error can be bounded as:

P(Err) =

x

min{P(ω1 | x),P(ω2 | x)}p(x)dx

=

x

min{p(x |ω1)P(ω1), p(x |ω2)P(ω2)}dx

x

max{p(x |ω1), p(x |ω2)}min{P(ω1),P(ω2)}dx

≤ 2min{P(ω1),P(ω2)}

Additional insight: The optimal classifier improves its accuracy when one

class prior probability is strongly dominant over to the other class.

29/29

Summary

� Bayesian decision theory allows to build optimal classifiers based on
prior knowledge about the classes and the data distributions.

� For certain simple data distributions, optimal classifiers are linear.

� The framework is general enough to extend to multi-class
scenarios and user-defined cost functions.

� Error of the optimal classifier is typically harder to compute than
the actual decision function, but it is still possible to compute

bounds on the error.

2020_TUB10

Wojciech Samek & Grégoire Montavon

ML1 Lecture 10: Kernel Ridge Regression 2

ML1 Lecture 10: Kernel Ridge Regression 3

ML1 Lecture 10: Kernel Ridge Regression 4

ML1 Lecture 10: Kernel Ridge Regression 5

ML1 Lecture 10: Kernel Ridge Regression 6

ML1 Lecture 10: Kernel Ridge Regression 7

ML1 Lecture 10: Kernel Ridge Regression 8

ML1 Lecture 10: Kernel Ridge Regression 9

ML1 Lecture 10: Kernel Ridge Regression 10

ML1 Lecture 10: Kernel Ridge Regression 11

ML1 Lecture 10: Kernel Ridge Regression 12

ML1 Lecture 10: Kernel Ridge Regression 13

ML1 Lecture 10: Kernel Ridge Regression 14

ML1 Lecture 10: Kernel Ridge Regression 15

ML1 Lecture 10: Kernel Ridge Regression 16

ML1 Lecture 10: Kernel Ridge Regression 17

ML1 Lecture 10: Kernel Ridge Regression 18

ML1 Lecture 10: Kernel Ridge Regression 19

ML1 Lecture 10: Kernel Ridge Regression 20

ML1 Lecture 10: Kernel Ridge Regression 21

ML1 Lecture 10: Kernel Ridge Regression 22

ML1 Lecture 10: Kernel Ridge Regression 23

ML1 Lecture 10: Kernel Ridge Regression 24

ML1 Lecture 10: Kernel Ridge Regression 25

ML1 Lecture 10: Kernel Ridge Regression 26

ML1 Lecture 10: Kernel Ridge Regression 27

ML1 Lecture 10: Kernel Ridge Regression 28

ML1 Lecture 10: Kernel Ridge Regression 29

ML1 Lecture 10: Kernel Ridge Regression 30

ML1 Lecture 10: Kernel Ridge Regression 31

ML1 Lecture 10: Kernel Ridge Regression 32

ML1 Lecture 10: Kernel Ridge Regression 33

ML1 Lecture 10: Kernel Ridge Regression 34

ML1 Lecture 10: Kernel Ridge Regression 35

ML1 Lecture 10: Kernel Ridge Regression 36

ML1 Lecture 10: Kernel Ridge Regression 37

ML1 Lecture 10: Kernel Ridge Regression 38

ML1 Lecture 10: Kernel Ridge Regression 39

ML1 Lecture 10: Kernel Ridge Regression 40

ML1 Lecture 10: Kernel Ridge Regression 41

ML1 Lecture 10: Kernel Ridge Regression 42

ML1 Lecture 10: Kernel Ridge Regression 43

ML1 Lecture 10: Kernel Ridge Regression 44

ML1 Lecture 10: Kernel Ridge Regression 45