# CS计算机代考程序代写 decision tree algorithm Bayesian AI GMM deep learning lecture/12-em-annotated.pdf

lecture/12-em-annotated.pdf

lecture/13-poe-annotated.pdf

lecture/14-xai-annotated.pdf

lecture/lecture1-annotated.pdf

lecture/lecture10.pdf

lecture/lecture11.pdf

1/24

Outline

� Latent Variable Models

� The Expectation Maximization Procedure

� Gaussian Mixture Models

� K-Means Clustering

� Kernel K-Means

2/24

Motivation

PCA of Iris dataset PCA of Boston dataset

PCA of Diabetes dataset PCA of Digits dataset

Complex data cannot be modeled accurately by standard probability distributions

(e.g. Gaussian) and require a more complex description, e.g. with latent variables.

3/24

Latent Variable Models

A latent variable model (LVM) is a description of the data distribution that makes

use of a layer of abstraction (latent variables) to improve modeling capability.

p(x|z,θ)

p(z|θ)

dataxdata

“Classical”

descrip�on

Latent variable

descrip�on

p(x|θ)

x

z

Relation between the ‘classsical’ and latent variable description:

p(� |θ) =

�

z∈Z

p(z |θ) · p(� |z , θ).

4/24

Learning a Latent Variable Model

Assuming a dataset D where we assume that the samples have been drawn i.i.d..

In that case, the log-likelihood of the data according to the model can be written

as:

logP(D|θ) = log

�

�∈D

p(� |θ)

=

�

�∈D

log p(� |θ)

and we wish to maximize that quantity. Injecting the latent variable model, the

objective to optimize becomes

=

�

�∈D

log

�

z∈Z

p(z |θ) · p(� |z , θ)

Problem: The maximum of log p(D|θ) cannot be found analytically.

5/24

Learning a Latent Variable Model

Strategy: If the function p(D|θ) cannot be easily optimized, find a lower-bound

of that function that is easier to optimize.

6/24

Building a Lower-Bound

Jensen’s inequality

For any element λ of the d-dimensional simplex (i.e. λ � 0 and λ�1 = 1, and

any concave function f : Rd → R, we have

f (

d�

i=1

λiai ) ≥

d�

i=1

λi f (ai )

Application to the latent variable model:

� Let λ be some probability distribution q(z |�) independent from θ.

� Let �i containing the remaining terms.

� Because the latent variable model

�

�∈D log

�

z∈Z p(z |θ) · p(� |z , θ) does

not contain such distribution q(z |�) independent from θ, create it!

7/24

Building a Lower-Bound

Step 1: Add q(z |�) both on the numerator and denominator:

log p(D|θ) =

�

�∈D

log

�

z

p(z |θ) · p(� |z , θ)

q(z |�) · q(z |�)

Step 2: Applying the Jensen inequality

≥

�

�∈D

�

z

log

�

p(z |θ) · p(� |z , θ)

q(z |�)

�

· q(z |�)

… and verify the bound:

=

�

�∈D

�

z

log

�

p(� |θ) · p(z |� , θ)

q(z |�)

�

· q(z |�)

=

�

�∈D

�

z

log

�

p(� |θ)

�

· q(z |�)

� �� �

log P(D|θ)

−

�

�∈D

�

z

log

�

q(z |�)

p(z |� , θ)

�

· q(z |�)

� �� �

KL(q(z|�) � p(z|�,θ))≥ 0

8/24

Building a Lower-Bound

Question: How to ensure that

KL(q(z |�) � p(z |� , θ))

is small, so that maximizing the lower-bound with θ gives a good approximation

of the true log-likelihood?

Chicken & egg problem:

� If we use a simple q, e.g. uniformly distributed, we get a loose

lower-bound from which we get a bad θ.

� To get a tight lower-bound, we need to choose q(z |�) ≈ p(z |� , θ) which

requires that we know θ.

9/24

The Expectation-Maximization (EM) Algorithm

Strategy: Start with some random solution θ and alternately estimate q and θ,

until we converge.

10/24

The Expectation-Maximization (EM) Algorithm

θ0 ← random()

q1(z |�) ← p(z |� , θ0) (E-step)

θ1 ← argmax

θ

�

�∈D

�

z

log

�

p(� , z |θ)

q1(z |�)

�

· q1(z |�) (M-step)

q2(z |�) ← p(z |� , θ1) (E-step)

θ2 ← argmax

θ

�

�∈D

�

z

log

�

p(� , z |θ)

q2(z |�)

�

· q2(z |�) (M-step)

…

11/24

The Expectation-Maximization (EM) Algorithm

θ

q

(θ0,q1)

(θn,qn)

Properties of the algorithm:

� Block coordinate descent

� Locally optimal step size

� The algorithm lands in a local

minimum of the function p(D|θ).

Advantages of EM compared to gradient

descent:

� no learning rate

� no need to compute the gradients.

12/24

Application: Cluster Analysis

PCA of Iris dataset PCA of Boston dataset

PCA of Diabetes dataset PCA of Digits dataset

Question: Can we build a model of the cluster structure of the data?

13/24

The Gaussian Mixture Model (GMM)

� The Gaussian Mixture Model is a latent variable model specialized for

clustering.

p(x|z,θ)

p(z|θ)

data

Latent variable

descrip�on

(general

formula�on)

x

z

p(x|z=i,θ)

p(z=i|θ)

data

Gaussian

mixture

model

x

i=1 i=2 i=3 i=4

� The latent variable z of a GMM is an integer between 1 and C . Each

value of z indicates a cluster of the data.

� For each cluster, we associate a different data-generating distribution

(e.g. Gaussian with different means and covariances).

14/24

The Gaussian Mixture Model (GMM)

The GMM is formally defined by the two equations:

p(z = i |θ) = αi

p(� |z = i , θ) = (2π)−d/2 · |Σi |−1/2 · exp

�

− 1

2

(� − µi )

�

Σ

−1

i (x − µi )

�

The parameters θ of the model to be learned are:

� The mixing coefficients (αi )Ci=1 subject to αi ≥ 0 and

�C

i=1

αi = 1

� The mean vectors (µi )

C

i=1.

� The covariances (Σi )Ci=1 subject to positive semi-definiteness.

Various simplifications of the GMM model can be used in practice, e.g. for

speed, statistical robustness, or simplicity of implementation.

� diagonal/isotropic/tied/fixed covariance matrices,

� fixed mixing coefficients, etc.

15/24

EM for the GMM (simplified)

Consider the simplified GMM:

p(z = i |θ) = 1

C

p(� |z = i , θ) = 1

(π/γ)d/2

exp(−γ�� − µi�

2

)

E-step: (Apply Bayes rule)

q(z = i |�) = p(z = i |θ) · p(� |z = i , θ)

p(� |θ) =

exp(−γ�� − µi�2)�C

i=1

exp(−γ�� − µi�2)

M-step: (Set lower-bound gradient to zero)

∂

∂θ

�

�∈D

C�

i=1

log(exp(−γ�� − µi�

2

)) · q(z = i |�) = 0

⇒ µi =

�

�∈D � · q(z = i |�)�

�∈D q(z = i |�)

16/24

Implementing GMMs

The (simplified) GMM model can be implemented easily in few lines of code:

17/24

GMM Demo (1st try)

�� � � � ��

��

�

�

�

��

�������������

�� � � � ��

��

�

�

�

��

�������������

�� � � � ��

��

�

�

�

��

�������������

�� � � � ��

��

�

�

�

��

�������������

�� � � � ��

��

�

�

�

��

�������������

�� � � � ��

��

�

�

�

��

�������������

18/24

GMM Demo (2nd try)

�� � � � ��

��

�

�

�

��

�������������

�� � � � ��

��

�

�

�

��

�������������

�� � � � ��

��

�

�

�

��

�������������

�� � � � ��

��

�

�

�

��

�������������

�� � � � ��

��

�

�

�

��

�������������

�� � � � ��

��

�

�

�

��

�������������

19/24

Gaussian Mixture Model (Summary)

� Number of clusters need to be determined in advance.

� EM converge to some local minima after a few iterations

� EM is non-convex, initialization matters.

� For a better result, run the algorithm multiple times and retain the best

solution.

20/24

The K-Means Algorithm

Cluster assignments step

c(�) = argmin

k

�� − µk�

2

Centroid update step

µk =

�

� :c(�)=k

�

�

� :c(�)=k

1

� K-means can be interpreted as the limit case of EM for Gaussian

distributions with γ → ∞.

� K-means produces hard-assignments (i.e. data points belong to a single

cluster).

� K-means has the same limitations as GMM (non-convex, predefined

cluster shape).

� K-means has only one parameter: the number of clusters.

21/24

Kernelizing K-Means

Problem: Standard k-means requires clusters to have round

shape. It fails on ‘banana’ datasets.

Idea: Replace � by the nonlinear feature map φ(�) and de-

fine µi =

�

j

βijφ(� j), i.e. centroids become weighted com-

binations of data points.

1. Cluster assignments step:

c(�) = argmin

i

�φ(�)− µi�

2

= argmax

i

2

�

j

βijk(� , � j)− β�i Kβi

2. Centroid update step:

βi = argmin

βi

�

� :c(�)=i

�φ(�)− µi�

2

= K

−1 ·

��

� :c(�)=i

k(X , �)

�

� :c(�)=i

1

�

Standard k-means

Kernel k-means

22/24

Other Cluster Algorithms

� Deep k-means

� Hierarchical clustering

� Divisive clustering

� Agglomerative clustering

� DBScan

� Affinity Propagation

23/24

Beyond Clustering

Mixture model

(today’s lecture)

Product of experts

(next week’s lecture)

24/24

Summary

� Latent variable models is a powerful approach to model complex data

distribution.

� Expectation-maximization is a framework to efficiently optimize these

models.

� Latent variable models can be used for clustering, e.g. the Gaussian

mixture model.

� The well-known k-means algorithm can be seen as a limit case of the

Gaussian mixture model.

� K-means can be kernelized to produce nonlinear boundaries between

clusters.

1/24

Outline

Covered content:

� Products of Experts (PoE)

� Restricted Boltzmann Machine (RBM)

� Structure of an RBM

� RBM learning algorithms

� Application of RBMs

Reference:

� Hinton, Geoffrey E. (2002). ”Training Products of Experts by Minimizing Contrastive

Divergence” (PDF). Neural Computation. 14 (8): 1771–1800

2/24

Beyond Clustering

Mixture model

(last week’s lecture)

Product of experts

(today’s lecture)

3/24

Mixture model vs. Product of experts

Mixture Model: Mixture models are commonly used for clustering problems and the

probability is defined as a weighted sum of probability distributions:

p(x|α, θ) =

C�

k=1

αk · p(x|θk)

with αk denoting the mixing coefficients subject to the constraints

�C

k=1 αk = 1 and αk ≥ 0.

Product of Experts (PoE): Products of experts are commonly used for learning dis-

tributed representations (e.g. unsupervised feature extraction) and are defined as a product

of functions.

p(x|θ) = 1Z

H�

j=1

g(x|θj)� �� �

jth

expert

where Z is a normalization term set to ensure that p(x|θ) integrates to 1.

4/24

Example 1: Product of Gaussians

Consider the experts to be univariate Gaussians spe-

cialized on a particular input dimension

g(x|θi) =

1

(2πσ2i )

1/2

exp

�

− (xi − µi)

2

2σ2i

�

The product of experts (PoE) can be developed as:

p(x|θ) =

d�

i=1

g(x|θi)

=

1

(2π)d/2|Σ|1/2

exp

�

− 1

2

(x− µ)�Σ−1(x− µ)

�

with Σ = diag(σ21 , . . . ,σ

2

d), i.e. a multivariate Gaussian

distribution without correlation between features.

5/24

Example 2: Product of Gaussian Mixtures

Let the experts be the univariate mixtures:

g(x|θi) =

C�

k=1

π−1/2 exp(−(xi − µik)2)

The product of experts (PoE) can be developed as:

p(x|θ) = 1Z

d�

i=1

C�

k=1

π−1/2 exp(−(xi − µik)2)

=

1

Z

C�

k1=1

· · ·

C�

kd=1

d�

i=1

π−1/2 exp(−(xi − µiki)2)

� �� �

multivariate Gaussian

i.e. a mixture of exponentially many (Cd) multivariate

Gaussians. Therefore, a PoE can encode many varia-

tions with few parameters.

6/24

Example 3: Product of t-Student Distributions

Define experts to be t-Student distributions in some

projected space (z = w�j x):

g(x|θj) =

1

αj + (w

�

j x)

2

�wj� = 1

The resulting product of experts

p(x|θ) = 1Z

H�

j=1

1

αj + (w

�

j x)

2

withH the number of experts, produces a non-Gaussian

multivariate distribution, which can be useful to model

e.g. image or speech data. This PoE has connections

to other analyses, e.g. independent component analysis

(ICA).

7/24

The Restricted Boltzmann Machine

input

latent

0 1 0 0 1

0 1 0 0 1 1

The restricted Boltzmann machine (RBM) is a joint probability model defined over input

features x ∈ {0, 1}d and latent variables h ∈ {0, 1}H .

p(x,h|θ) = 1Z exp

� d�

i=1

H�

j=1

xiwijhj +

H�

j=1

bjhj

�

The parameter wij can be interpreted as the connection strength between input feature xi

and latent variable hj . The larger wij the stronger xi and hj co-activate.

8/24

The Restricted Boltzmann Machine (PoE View)

Connection between RBM and PoE

The RBM, when marginalized over its hidden units, has the structure of a Product of

Experts with g(x, θj) = (1 + exp(w

�

j x+ bj)).

Proof:

p(x|θ) =

�

h∈{0,1}d

p(x,h|θ)

=

�

h∈{0,1}d

1

Z

exp

� d�

i=1

H�

j=1

xiwijhj +

H�

j=1

bjhj

�

=

1

Z

�

h∈{0,1}d

H�

j=1

exp((w�j x+ bj) · hj)

=

1

Z

H�

j=1

�

hj∈{0,1}

exp((w�j x+ bj) · hj)

=

1

Z

H�

j=1

(1 + exp(w�j x+ bj))

9/24

Interpreting the RBM Experts

The experts

g(x, θj) = 1 + exp(w

�

j x+ bj)

forming the RBM implement two behaviors:

� w�j x+ bj > 0 : g(x; θj) � 1: the example x is in

the area of competence of the expert and the

latter speaks in favor of x.

� w�j x+ bj < 0 : g(x; θj) ≈ 1: the example x is
outside the expert’s area of competence, and the
expert ‘withdraws’, i.e. multiplies by 1.
This double behavior is important to implement the
nonlinearity.
10/24
RBM as a Neural Network
The product of expert (PoE) model
p(x|θ) = 1Z
H�
j=1
(1 + exp(w�j x+ bj))
can be rewritten as:
log p(x|θ) =
H�
j=1
log(1 + exp(� �� �
softplus
w�j x+ bj))
� �� �
f(x,θ)
− logZ
where f(x, θ) can be interpreted as a neural network.
∑
x
f
softplus
w
(x)
Note: The neural network can predict which examples x are more probable relative to
other examples, but not the actual probability score, because logZ is difficult to compute.
11/24
RBM as a Neural Network
RBM trained on MNIST digits (only digits “3” with 100-125 foreground pixels).
2% least likely 2% most likely
� ��� ���� ���� ����
�������������������������������������
���
���
���
���
���
��
�
��
��
�
��
�
�
�
�
��
�
�
�
�
��
��
Observations:
� Digits with anomalies are predicted the RBM to be less likely than clean digits.
12/24
RBM as a Neural Network
Universal approximation
The RBM is an universal approximator of data distributions in the binary input space
{0, 1}d.
“Proof” by construction: Add a softplus neuron pointing to each corner of the hypercube.
Scale the weight wj according to the probability of each corner, and choose the bias accord-
ingly.
Note: This construction requires exponentially many neurons. In practice, each expert θj
captures more than a single corner of the hypercube (i.e. learn general features).
13/24
RBM as a Neural Network
RBM weights ((wj)
H
j=1) K-Means centroids
Observation
� In the RBM, experts (wj)j encode only part of the digit (e.g. a stroke). The expert
therefore contributes to make all data containing that particular stroke more likely.
� The experts that the RBM extracts can be used for transfer learning, e.g. reused to
predict different digits/characters.
� K-means centroids are much more localized, and consequently less transferable.
14/24
Learning an RBM
Recap: mixture model (EM bound)
The mixture model can be optimized by finding the optimum of a lower-bound at each
iteration
log p(D|θ) =
N�
n=1
log
C�
k=1
αkp(xn|θk)
βk
βk ≥
N�
n=1
C�
k=1
log
�αkp(xn|θk)
βk
�
βk
where N is the number of data points, and (αk)k and (βk)k are distributions over the C
mixture elements.
Question: Can the same EM approach be used for Product of Experts?
log p(D|θ) =
N�
n=1
�
f(xn, θ)− logZ
�
Answer: No, the expression cannot be bounded in the same way as for the mixture model
(it has a different structure).
15/24
Gradient of the RBM
Idea: Although EM is not applicable, we can still compute the gradient of the log-likelihood
and perform gradient descent.
∂
∂θ
log p(D|θ) = ∂
∂θ
N�
n=1
�
f(xn, θ)− logZ
�
=
∂
∂θ
N�
n=1
�
f(xn, θ)− log
�
x∈{0,1}d
exp(f(x, θ))
�
=
N�
n=1
� ∂
∂θ
f(xn, θ)−
�
x∈{0,1}d exp(f(x, θ))
∂
∂θ
f(x, θ)
�
x∈{0,1}d exp(f(x, θ))
�
=
N�
n=1
∂
∂θ
f(xn, θ)−NEx∼p(x|θ)
� ∂
∂θ
f(x, θ)
�
The gradient is a difference between a data-dependent and a model-dependent term.
16/24
RBM Update Rule
Based on the gradient calculation above, we can build the update rule:
θ ← θ + γ ·
� 1
N
N�
n=1
∂
∂θ
f(xn, θ)− Ex∼p(x|θ)
� ∂
∂θ
f(x, θ)
��
Interpretation:
� The first term makes the data more likely according to the model.
� The second term makes everything that the model considers likely, less likely.
� The training procedure terminates when we reach some equilibrium.
data
p(x)
data
17/24
Computation of the RBM update rule
θ ← θ + γ ·
� 1
N
N�
n=1
∂
∂θ
f(xn, θ)− Ex∼p(x|θ)
� ∂
∂θ
f(x, θ)
��
Observations
� The left hand side is easy to compute (backprop in f(x, θ), averaged over all data
points).
� The right hand side is more tricky, because p(x|θ) is defined over exponentially many
states. An unbiased approximation of the expected gradient can be obtained by
generating a sample {x1, . . . ,xm} from the model distribution p(x|θ).
Question: How do we sample from p(x|θ)?
� Idea: Switch back to the ‘classical’ (i.e. non-PoE) view of the RBM, where we can use
latent variables to ease sampling.
18/24
Recap: ‘Classical’ View of the RBM
input
latent
0 1 0 0 1
0 1 0 0 1 1
The RBM is a probability model defined over input features x ∈ {0, 1}d and
and latent variables h ∈ {0, 1}H .
p(x,h|θ) = 1Z exp
� d�
i=1
H�
j=1
xiwijhj +
H�
j=1
bjhj
�
The parameter wij can be interpreted as the connection strength between
input feature xi and latent variable hj . The larger wij the stronger xi and hj
co-activate.
19/24
Conditional Distributions in an RBM
Sampling p(x) and p(h) is hard, however, hidden variables h can be sampled easily and
independently when we condition on the state x, and similarly for x conditioned on h. Let
zj =
�
i xiwij + bj . We proceed as:
p(h|x, θ) =
1
Z exp
��
j
zjhj
�
�
h
1
Z exp
��
j
zjhj
� =
�
j
exp(zjhj)
�
j
�
hj
exp(zjhj)
=
�
j
exp(zjhj)
1 + exp(zj)� �� �
p(hj |x,θ)
and we observe that, conditioned on x, the latent variables (hj)j can be sampled easily
(Bernoulli distributions) and independently.
By symmetry of the RBM model, we get a similar result for the visible units conditioned on
the latent variables, i.e. p(xi|h, θ) = exp(zihi)/(1 + exp(zi)) with zi =
�
j wijhj .
20/24
Block Gibbs Sampling
Observation: We know p(hj |x, θ) and p(xi|h, θ).
But how do we sample p(x|θ), which we need to
compute the RBM gradient?
Block Gibbs Sampling: Start with some ran-
dom input x(0), then sample alternately:
h(0) ∼
�
p(hj |x(0), θ)
�H
j=1
x(1) ∼
�
p(xi |h(0), θ)
�d
i=1
h(1) ∼
�
p(hj |x(1), θ)
�H
j=1
x(2) ∼
�
p(xi |h(1), θ)
�d
i=1
...
until x(·) converges in distribution to p(x|θ).
21/24
Fast Approximations
In practice, Gibbs sampling can take a long time to
converge. Instead, we can use fast approximations
of converged Gibbs sampling.
Common approximation strategies:
� Contrastive divergence (CD-k): Start from
an existing data point, and perform k steps
of alternate Gibbs sampling.
� Persistent contrastive divergence (PCD):
Start from the Gibbs sample x in the
previous iteration of gradient descent, and
perform one step of Gibbs sampling.
22/24
Application: RBM for Data Denoising
Suppose you receive an example xnoisy, and would
like to denoise it. A reconstruction of that digit can
be obtained by mapping to the latent variables and
projecting back:
1. Projection on latent variables
h = sigm(Wxnoisy + b)
2. Backprojection on the input domain
xrec = sigm(W
�h)
Before denoising
After denoising
23/24
Application: RBM for Representation Learning
The RBM model can be used as a neural net-
work initialization to achieve lower generaliza-
tion error on some classification task.
� Step 1: Create a layer of features based
on the RBM learned parameters:
φ(x) =
sigm(w�1 x)
...
sigm(w�Hx)
� Step 2: Add a top layer that maps
φ(x) to the class probabilities, and train
the resulting neural network end-to-end
using gradient descent.
MNIST example:
Source: Erhan et al. (2010) Why Does Unsuper-
vised Pre-training Help Deep Learning?
24/24
Summary
� The Product of Experts is an unsupervised learning approach that is substantially
different from Mixture Models.
� Product of experts such as the RBM are optimized by gradient descent (instead of
EM).
� The RBM has several desirable properties: Simple gradient, block Gibbs sampler
(quite fast).
� The RBM can be used for various tasks, e.g. probability modeling, data denoising,
representation learning.
1/24
Outline
� What is Explainable AI?
� Desiderata of an Explainable AI technique
� Uses of Explainable AI
� Methods for Explainable AI
� Activation Maximization
� Shapley Values
� Taylor Expansions
� Layer-wise Relevance Propagation
2/24
What is Explainable AI?
Standard machine learning:
� The function f is typically considered to be a “black-box” whose parameters are learned
from the data using e.g. gradient descent. The objective to minimize encourages the
predictions f (x) to coincide with the ground truth on the training and test data.
x1
f(x)x2
xd
x
...
Machine learning + Explainable AI:
� We do not only look at the outcome f (x) of the prediction but also at the way the
prediction is produced by the ML model, e.g. which features are used, how these features
are combined, or to what input pattern the model responds the most.
3/24
What is Explainable AI?
Example 1: Synthesize an input pattern that most strongly activates the output of the ML
model associated to a particular class.
Image source: Nguyen et al. (2016) Multifaceted Feature Visualization: Uncovering the Different Types of
Features Learned By Each Neuron in Deep Neural Networks
4/24
What is Explainable AI?
Example 2: Highlight features that have contributed for a given data point to the ML prediction.
image image image image
heatmap
(bike)
heatmap
(person)
heatmap
(cat)
heatmap
(person)
image image image
heatmap
(train)
heatmap
(train)
heatmap
(dining table)
Image source: Lapuschkin et al. (2016) Analyzing Classifiers: Fisher Vectors and Deep Neural Networks
5/24
What is Explainable AI?
Example 3: Concept activation vectors (TCAV). Highlight the mid-level concepts that explain,
for a given data point, the ML prediction.
Source: Google Keynote’19 (URL: https://www.youtube.com/watch?v=lyRPyRKHO8M&t=2279s)
6/24
Desiderata of an Explanation
In practice, we would like the explanation technique to satisfy a number
of properties:
1. Fidelity: The explanation should reflect the quantity being
explained and not something else.
2. Understandability: The explanation must be easily
understandable by its receiver.
3. Sufficiency: The explanation should provide sufficient
information on how the model came up with its prediction.
4. Low Overhead: The explanation should not cause the prediction
model to become less accurate or less efficient.
5. Runtime Efficiency: Explanations should be computable in
reasonable time.
see also Swartout & Moore (1993), Explanation in Second Generation Expert Systems.
image
heatmap (train)
7/24
Uses of an Explanation
Verify (and improve?) a ML model
� Verify that the model is based on features
which generalize well to examples outside
the current data distribution (this cannot
be done with standard validation
techniques!).
� Reliance of the ML models on wrong
features is often encountered when there
are spurious correlation in the data.
� From the explanation, the model’s
trustworthiness can be reevaluated, and
the flawed ML model can be potentially
retrained based on the user feedback.
8/24
Uses of an Explanation
Example: The classifier is right for the wrong reasons
image heatmap (horse)
aer bic bir boa bot
79.08 66.44 45.90 70.88 27.64
bus car cat cha cow
69.67 80.96 59.92 51.92 47.60
din dog hor mot per
58.06 42.28 80.45 69.34 85.10
pot she sof tra tvm
28.62 49.58 49.31 82.71 54.33
average precision of the Fisher Vector
model on the PascalVOC dataset
� In this example, the classifier accurately predicts the horse class, but based on the wrong
features (some copyright tag in the corner).
� This incorrect decision strategy cannot be detected by just looking at the test error.
cf. Lapuschkin et al. (2019) Unmasking Clever Hans Predictors and Assessing What Machines Really Learn.
Nature Communications
9/24
Uses of an Explanation
Learn something about the data
(or about the system that produced the data)
� Step 1: Train a ML model that predicts
well the data.
� Step 2: Apply XAI to the trained ML
model to produce explanations of the ML
decision strategy.
� Step 3: Based on the XAI explanations,
the user can compare his reasoning with
that of the ML model, and can potentially
refine his own domain knowledge.
Image source: Thomas et al. (2019) Analyzing
Neuroimaging Data Through Recurrent Deep
Learning Models
10/24
Part II: Methods of XAI
Presented methods
� Activation maximization
� Shapley values
� Taylor expansion
� Layer-wise relevance propagation
Other methods
� Surrogate models (LIME)
� Integrated gradients / expected gradients / SmoothGrad
� Influence functions
� . . .
11/24
Activation Maximization
Assume a trained a ML model (e.g. a neural network), and we would like to understand what
concept is associated to some particular output neuron of the ML model, e.g. the output neuron
that codes for the class ‘cat’. Activation maximization proceeds in two steps:
� Step 1: Think of the ML model as a function of the input
� Step 2: Explain the function f by generating a maximally activating input pattern:
x� = arg max
x
f (x, θ)
12/24
Activation Maximization
Problem: In most cases f (x) does not have single point
corresponding to the maximum.
E.g. in linear models, f (x) = w�x + b, we
can keep moving the point x further along the
direction w , and the output continues to grow).
Therefore, we would like to apply a preference for ‘regular’
regions of the input domain, i.e.
x� = arg max
x
f (x) + Ω(x)
In practice, the preference can be for data points with
small norm (i.e. we set Ω(x) = −λ�x�2 so that points
with large norm are penalized.)
13/24
Activation Maximization: Examples
f (x) = w�x + b and Ω(x) = −λ�x�2 f (x) = max(x1, x2) and Ω(x) = −λ�x�2
14/24
Activation Maximization: Probability View
Assume the model produces a log-probability for class ωc :
f (x) = log p(ωc |x)
The input x� that maximizes this function can be inter-
preted as the point where the classifier is the most sure
about class ωc .
Choose the regularizer Ω(x) = log p(x), i.e. favor points
that are likely.
The optimization problem becomes:
x� = arg max
x
log p(ωc |x) + log p(x)
= arg max
x
log p(x|ωc)
where x� can now be interpreted as the most typical input
for class ωc .
15/24
Attribution of a Prediction to Input Features
ML
blackbox
predic�on
input
a�ribu�on
1. The data x ∈ Rd is fed to the ML black-box and we get a prediction f (x) ∈ R.
2. We explain the prediction by determining the contribution of each input feature.
Key property of an explanation: conservation (
�d
i=1 φi = f (x)).
16/24
Attribution: Shapley Values
� Framework originally proposed in the context
of game theory (Shapley 1951) for assigning
payoffs in a cooperative game, and recently
applied to ML models.
� Each input variable is viewed as a player, and
the function output as the profit realized by
the cooperating players.
The Shapley values φ1, . . . , φd measuring the contribution of each feature are:
φi =
�
S : i /∈S
|S|!(d−|S|−1)!
d!
�
f (xS∪{i}) − f (xS)
�
where (xS)S are all possible subsets of features contained in the input x.
17/24
Attribution: Shapley Values
Recall: φi =
�
S : i /∈S
|S|!(d−|S|−1)!
d!� �� �
αS
�
f (xS∪{i}) − f (xS)
�
� �� �
ΔS
Worked-through example: Consider the function f (x) = x1 · (x2 + x3). Calculate the contri-
bution of each feature to the prediction f (1) = 1 · (1 + 1) = 2.
18/24
Attribution: Taylor Expansions
� Many ML models f (x) are complex and
nonlinear when taken globally but are simple
and linear when taken locally.
� The function can be approximated locally by some Taylor expansion:
f (x) = f (�x) +
d�
i=1
[∇f (�x)]i · (xi − �xi)� �� �
φi
+ . . .
� First-order terms φi of the expansion can serve as an explanation.
� The explanation (φi)i depends on the choice of root point �x.
19/24
Attribution: Taylor Expansions
Example: Attribute the prediction f (x) = w�x with x ∈ Rd on the d input features.
20/24
Attribution: Taylor Expansions
Limitations: Gradient information is too local-
ized.
� Cannot handle saturation effects and
discontinuities e.g. cannot explain the
function
f (x) =
d�
i=1
xi − max(0, xi − θ)
at the point x = (2, 2).
This limitation can be overcome by looking at
the structure of the model and decompose the
problem of explanation in multiple parts (→ next
slide).
21/24
Attribution: Look at the Structure of the Model
∑a4
a3
a5
a6
x1
x2
yout
-1
1
1
1
-1
-1
1
-1
1
1
wij vj
Observation:
� The function implemented by a ML model
is typically a composition of simple
elementary functions.
� These functions are simpler to analyze
than the whole input-output function.
Idea:
� Treat the problem of explanation as
propagating the prediction backward in
the input-output graph.
� The layer-wise relevance propagation
(LRP) method implements this approach
and can be used to explain ML models
(→ next slide).
22/24
Attribution: The LRP Method
∑a4
a3
a5
a6
x1
x2
yout
-1
1
1
1
-1
-1
1
-1
1
1
wij vj
Example: Consider yout to be the quantity to
explain:
Rout ← yout
� Step 1: Propagate on the hidden layer
∀6j=3 : Rj ←
ajvj�6
j=3 ajvj
Rout
� Step 2: Propagate on the first layer
∀2i=1 : Ri ←
6�
j=3
xiwij�2
i=1 xiwij
Rj
Note: Other propagation rules can be engineered, and choosing appropriate propagation rules
is important to ensure LRP works well for practical neural network architectures.
23/24
Attribution: The LRP Method
Effect of LRP rules on the explanation (e.g. class ‘castle’ predicted by a VGG-16 neural network.)
VGG-16 Network
3x
3
@
6
4
3x
3
@
6
4
3x
3
@
1
28
3x
3
@
1
28
3x
3
@
2
56
3x
3
@
2
56
3x
3
@
2
56
3x
3
@
5
12
3x
3
@
5
12
3x
3
@
5
12
7x
1
@
4
09
6
1x
1
@
4
09
6
1x
1
@
1
00
0
3x
3
@
5
12
3x
3
@
5
12
3x
3
@
5
12
LRP-0
LRP-ϵ
LRP-°
LRP-0LRP-ϵLRP-°
24/24
Summary
� Explainable AI is an important addition to classical ML models (e.g. for validating a ML
model or extracting knowledge from it).
� Many XAI methods have been developed, each of them, with their strengths and
limitations:
� Activation maximization can be used to understand what a ML model has learned,
but is unsuitable for explaining an individual prediction f (x).
� Shapley value has strong theoretical foundations, but is computationally unfeasible
for high-dimensional input data.
� Taylor expansions are simple and theoretically founded for simple models, but the
expansion does not extrapolate well in complex nonlinear models.
� LRP leverages the structure of the ML model to handle nonlinear decision functions,
but requires to carefully choose the propagation rules.
1/29
Machine Learning 1, Course Outline
� Covered topics
� Bayes Decision Theory, Parameter Estimation
� Component/Discriminant Analyses (PCA, Fisher)
� Model Selection, Learning Theory
� SVMs, Regression, Neural Nets, Boosting, Decision Trees
� Clustering, Latent Variable Models
� Weekly homeworks
� Elective course (only for Machine Learning 1-X)
2/29
Machine Learning 1, Important Dates
Registering homework groups
� Request need to be sent before Friday 6 November 2020 at 14:00.
First homework submission deadline
� Monday 9 November 2020 at 14:00.
Exam registration
� Opens in January 2021. To be able to register, you need to first
pass the prerequisites (homeworks / elective) and then request an
Übungsschein/Genehmigung.
Exams
� March 2021, two sessions (you can choose the session you prefer),
exact exam dates still to be determined.
3/29
Textbook Reference
Duda et al. Pattern Classification, 2nd Edi-
tion (2000)
� The first four lectures will be based
on this textbook.
� This week: Sections 2.1-2.6;2.9
4/29
Example of a Machine Learning Classifier
Information about the input (fish
on the conveyor belt) is first
acquired through sensors (e.g.
wage, camera, etc.)
The data is preprocessed (e.g.
cropping, rescaling, normaliza-
tion).
Then a finite number of features
is extracted.
Finally, a classifier can be learned
on these features.
Image source: Duda et al. (2000)
5/29
Single-Feature Example (Length)
6/29
Two-Feature Example (Length + Lightness)
7/29
Bayesian Decision Theory
Goals:
� Merging empirical observations with prior knowledge about the
task (e.g. prior class probabilities).
� Incorporating distributional assumptions (e.g. Gaussian-generated
data, independence of features).
� Build optimal classifiers (in terms of classification accuracy or in
terms of costs).
8/29
Bayesian Decision Theory
Notation:
� ω1,ω2, . . . Different classes (e.g. salmon, sea bass, ...)
� x ∈ Rd vector of observations
(e.g. x1 is the length and x2 is the lightness).
we now add probabilities ...
� P(ωj ): Probability of being of class j .
� p(x): Data density function (e.g. histogram)
� p(x |ωj ): Class-conditioned data density function (e.g. histogram).
� P(ωj | x): Probability of being of a certain class after observing x.
9/29
Bayes Theorem
P(ωj | x) =
p(x |ωj ) · P(ωj )
p(x)
Image source: Duda et al. (2000)
10/29
Optimally Accurate Classifier
Decide
�
ω1 if P(ω1 | x) > P(ω2 | x)

ω2 else.

Alternate formulations:

11/29

Multivariate Normal Distributions

P(x|ωj ) =

1�

(2π)ddet(Σj )

exp

�

− 1

2

(x− µj )�Σ−1j (x− µj )

�

� µj is the mean (center of the

data distribution)

� Σj is the covariance

(elongation of the data

distribution).

Image source: Duda et al. (2000)

12/29

Optimal Classifier for Gaussians (Σ1 = Σ2)

Note: logP(x|ωj ) = − d2 log 2π −

1

2

log det(Σ)− 1

2

(x− µj )�Σ−1(x− µj )

13/29

Optimal Classifier for Gaussians (Σ1 = Σ2)

Image source:

Duda et al. (2000)

� Decision boundary is linear and oriented by mean and covariance.

� Offset is controlled by class prior probabilities.

14/29

Optimal Classifier for Gaussians (Σ1 �= Σ2)

� When covariances Σ1 and Σ2

are not the same, the decision

boundary is quadric instead of

linear. Quadrics include circle,

ellipse, parabola, hyperbola,

and degenerate forms.

Image source: Duda et al. (2000)

15/29

Binary Data Distributions

Assume that our data is now binary: x ∈ {0, 1}d . With each dimension

generated independently according to some parameters p and q.

P(xi = 0 |ω1) = 1− pi P(xi = 0 |ω2) = 1− qi

P(xi = 1 |ω1) = pi P(xi = 1 |ω2) = qi

The probability of the whole multivariate observation can be written as:

P(x|ω1) =

d�

i=1

�

pixi + (1− pi ) · (1− xi )

�

Question: How to express the decision boundary

P(ω1|x) > P(ω2|x)

16/29

Optimal Classifier for Binary Data (1st try)

Recall: P(x|ω1) =

�d

i=1

�

pixi + (1− pi ) · (1− xi )

�

17/29

Optimal Classifier for Binary Data

Observation:

For binary data, the data likelihood

P(x|ω1) =

d�

i=1

�

pixi + (1− pi ) · (1− xi )

�

can be equivalently expressed as

P(x|ω1) =

d�

i=1

�

p

xi

i · (1− pi )

(1−xi )

�

18/29

Optimal Classifier for Binary Data (2nd try)

Recall: P(x|ω1) =

�d

i=1

�

p

xi

i · (1− pi )(1−xi )

�

19/29

From Binary to Multi-Class

The two-class problem: ω = (ω1,ω2)

decide

�

ω1 if P(ω1 | x) > P(ω2 | x)

ω2 else,

can be generalized to multiple classes.

Let ω = (ω1,ω2, . . . ,ωC ) be our C possible classes. The optimal decision

now takes the form:

decide ωi if ∀j �=i : P(ωi |x) > P(ωj |x)

… or equivalently …

decide ωi where i = argmax

j

P(ωj |x)

20/29

Example: Multi-Class Gaussian

Decision function:

“decide ωi”

i = argmax

j

P(ωj |x)

Observation:

The decision surface of a multi-class

classifier can be quite complex and

nonlinear.

Image source: Duda et al. (2000)

21/29

From Classification to Cost Minimization

Example: Suppose you would like to purchase a second-hand car. After

observing the car, you assess that it has a defect with probability

P(defect | x) = 0.1 P(no defect | x) = 0.9

Concretely, the decision you need to take is not to classify the whether

the car has a defect or not, but whether to buy the car or not.

For this, we need to evaluate the cost of each scenario, e.g.

cost ( buy | defect ) = 100.0 cost ( buy | no defect ) = −20.0

cost ( not buy | defect ) = 0.0 cost ( not buy | no defect ) = 0.0

and take the action with the lowest expected cost.

22/29

From Classification to Cost Minimization

Classification Accuracy Maximization:

“classify as ωi” where i = argmax

j

P(ωj |x)

Expected Cost Minimization: Let (αi )i be the set of actions. The

expected cost of taking a certain action is given by

R(αj |x) =

C�

k=1

λ(αj |ωk)P(ωk |x)

where λ(αj ,ωk) is the cost of action αj given the class ωk . We then

decide to

“take action αi” where i = argmin

j

R(αj |x)

23/29

Classification Accuracy as a Special Case

Recall: R(αj |x) =

�C

k=1 λ(αj |ωk)P(ωk |x)

Show that the problem of maximum accuracy classification is a special

instance of expected cost minimization with a particular set of actions

(αj )j and a particular cost function λ(αj |ωk).

24/29

Analyzing the Two-Class Case

Recall: R(αj |x) =

�C

k=1 λ(αj |ωk)P(ωk |x)

Show that in the two-action two-class case, changing the cost function

means shifting the decision boundary.

(Note: when differences of lambdas are negative,

we need to stop before applying the log)

25/29

Measuring Classification Error

So far, we have studied what the decision boundary should be in order to

predict optimally.

However, in certain cases, it is also important to determine what is the

expected error of the classifier (e.g. to determine whether the classifier

is good enough for practical use).

The expected error is the probability that the data is of a different class

than the one predicted:

P(Err | x) =

�

P(ω1 | x) if “decide ω2”

P(ω2 | x) if “decide ω1”

For a maximally accurate classifier, this reduces to

P(Err | x) = min{P(ω1 | x),P(ω2 | x)}

26/29

Measuring Classification Error

The expected error of this maximally accurate classifier is computed as

the integral of its error probability over the distribution p(x).

P(Err) =

�

x

P(Err | x)p(x)dx

=

�

x

min{P(ω1 | x),P(ω2 | x)}p(x)dx

This is also known as the Bayes error rate.

Generally, this integral cannot be solved analytically, because of the min

function. Error must instead be bounded or evaluated empirically.

27/29

Bounding the Error of a Classifier

Very basic bound

Observe for binary classification that P(ω1 | x) = 1− P(ω2 | x).

The error can be bounded as:

P(Err) =

�

x

min{P(ω1 | x),P(ω2 | x)}p(x)dx

≤

�

x

0.5 p(x)dx = 0.5

The optimal classifier predicts the correct class at least 50% of the time.

28/29

Slightly Better Bound

Recall: P(ωj | x) =

p(x |ωj ) · P(ωj )

p(x)

The error can be bounded as:

P(Err) =

�

x

min{P(ω1 | x),P(ω2 | x)}p(x)dx

=

�

x

min{p(x |ω1)P(ω1), p(x |ω2)P(ω2)}dx

≤

�

x

max{p(x |ω1), p(x |ω2)}min{P(ω1),P(ω2)}dx

≤ 2min{P(ω1),P(ω2)}

Additional insight: The optimal classifier improves its accuracy when one

class prior probability is strongly dominant over to the other class.

29/29

Summary

� Bayesian decision theory allows to build optimal classifiers based on

prior knowledge about the classes and the data distributions.

� For certain simple data distributions, optimal classifiers are linear.

� The framework is general enough to extend to multi-class

scenarios and user-defined cost functions.

� Error of the optimal classifier is typically harder to compute than

the actual decision function, but it is still possible to compute

bounds on the error.

2020_TUB10

Wojciech Samek & Grégoire Montavon

ML1 Lecture 10: Kernel Ridge Regression 2

ML1 Lecture 10: Kernel Ridge Regression 3

ML1 Lecture 10: Kernel Ridge Regression 4

ML1 Lecture 10: Kernel Ridge Regression 5

ML1 Lecture 10: Kernel Ridge Regression 6

ML1 Lecture 10: Kernel Ridge Regression 7

ML1 Lecture 10: Kernel Ridge Regression 8

ML1 Lecture 10: Kernel Ridge Regression 9

ML1 Lecture 10: Kernel Ridge Regression 10

ML1 Lecture 10: Kernel Ridge Regression 11

ML1 Lecture 10: Kernel Ridge Regression 12

ML1 Lecture 10: Kernel Ridge Regression 13

ML1 Lecture 10: Kernel Ridge Regression 14

ML1 Lecture 10: Kernel Ridge Regression 15

ML1 Lecture 10: Kernel Ridge Regression 16

ML1 Lecture 10: Kernel Ridge Regression 17

ML1 Lecture 10: Kernel Ridge Regression 18

ML1 Lecture 10: Kernel Ridge Regression 19

ML1 Lecture 10: Kernel Ridge Regression 20

ML1 Lecture 10: Kernel Ridge Regression 21

ML1 Lecture 10: Kernel Ridge Regression 22

ML1 Lecture 10: Kernel Ridge Regression 23

ML1 Lecture 10: Kernel Ridge Regression 24

ML1 Lecture 10: Kernel Ridge Regression 25

ML1 Lecture 10: Kernel Ridge Regression 26

ML1 Lecture 10: Kernel Ridge Regression 27

ML1 Lecture 10: Kernel Ridge Regression 28

ML1 Lecture 10: Kernel Ridge Regression 29

ML1 Lecture 10: Kernel Ridge Regression 30

ML1 Lecture 10: Kernel Ridge Regression 31

ML1 Lecture 10: Kernel Ridge Regression 32

ML1 Lecture 10: Kernel Ridge Regression 33

ML1 Lecture 10: Kernel Ridge Regression 34

ML1 Lecture 10: Kernel Ridge Regression 35

ML1 Lecture 10: Kernel Ridge Regression 36

ML1 Lecture 10: Kernel Ridge Regression 37

ML1 Lecture 10: Kernel Ridge Regression 38

ML1 Lecture 10: Kernel Ridge Regression 39

ML1 Lecture 10: Kernel Ridge Regression 40

ML1 Lecture 10: Kernel Ridge Regression 41

ML1 Lecture 10: Kernel Ridge Regression 42

ML1 Lecture 10: Kernel Ridge Regression 43

ML1 Lecture 10: Kernel Ridge Regression 44

ML1 Lecture 10: Kernel Ridge Regression 45