代写代考 Machine Learning – cscodehelp代写 – INTERVIEW&CODEHELP

Machine Learning

p(θ | D) = p(D | θ)p(θ)
􏰋 p(D | θ)p(θ)dθ

Likelihood How much evidence is there in the data for a specific hypothesis

p(θ | D) = p(D | θ)p(θ)
􏰋 p(D | θ)p(θ)dθ
Likelihood How much evidence is there in the data for a specific hypothesis Prior What are my beliefs about different hypothesis

p(θ | D) = p(D | θ)p(θ)
􏰋 p(D | θ)p(θ)dθ
Likelihood How much evidence is there in the data for a specific hypothesis Prior What are my beliefs about different hypothesis
Posterior What is my updated belief after having seen data

Likelihood
p(y | f, x)
How strongly do I believe that the data y that I see comes from the function f at input x

Before I have seen the data how much do I believe in each possible function?

Given that I have seen observations y what is my updated belief about the function f

Bayes Rule
Likelihood Prior
􏰏 􏰒􏰑 􏰐 􏰏􏰒􏰑􏰐
p(f | y) = p(y | f,x)p(f) 􏰑 􏰐􏰏 􏰒 p(y)
Marginal Likelihood
Model we define Likelihood and Prior
Inference we compute Posterior through marginalisation of our beliefs

Marginal Likelihood/Evidence
p(y | f)p(f)df
Given different data sets yi how much evidence do they provide of the model?

Regression Model
y = x · w ± 15

Linear Regression

Likelihood
y = wTφ(x) + ε

Likelihood
y = wTφ(x) + ε y − wTφ(x) = ε

Likelihood
y = wTφ(x) + ε y − wTφ(x) = ε
T −1 y−w φ(x)∼N(ε|0,β I)=
β 2 −1(ε−0)β(ε−0)

Likelihood
−1 ⇒N(y−w φ(x)|0,β I)=
β 2 −1(y−wTφ(x))β(y−wTφ(x))
y = wTφ(x) + ε y − wTφ(x) = ε
T −1 y−w φ(x)∼N(ε|0,β I)=
β 2 −1(ε−0)β(ε−0)

Likelihood
y = wTφ(x) + ε y − wTφ(x) = ε
T −1 y−w φ(x)∼N(ε|0,β I)=
β 2 −1(ε−0)β(ε−0)
−1 β 2 −1(y−wTφ(x))β(y−wTφ(x))
⇒ N (y − wTφ(x)|0, β−1I) = N (y|wTφ(x), β−1I)
⇒N(y−w φ(x)|0,β I)=

Linear Regression
• Likelihood is Gaussian in w
p(y|w, x) = N (y|wTφ(x), β−1I)

Linear Regression
• Likelihood is Gaussian in w
p(y|w, x) = N (y|wTφ(x), β−1I)
• Conjugate Prior
p(w) = N (w|m0, S0)

Linear Regression
• Likelihood is Gaussian in w
p(y|w, x) = N (y|wTφ(x), β−1I)
• Conjugate Prior • Posterior
p(w) = N (w|m0, S0) p(w|y) =?

Definition (Conjugate Prior)
In Bayesian probability theory, if the posterior distribution p(θ | D) is in the same probability distribution family as the prior probability distribution p(θ), the prior and posterior are then called conjugate distributions, and the prior is called a conjugate prior for the likelihood function p(D | θ).
A conjugate prior is an algebraic convenience, giving a closed-form expression for the posterior; otherwise numerical integration may be necessary. Further, conjugate priors may give intuition, by more transparently showing how a likelihood function updates a prior distribution.

Conjugacy Example I
Beta(μ|a, b) = Γ(a + b) μa−1(1 − μ)b−1, Γ(a)Γ(b)
• The Beta distribution is the conjugate prior of the Bernoulli likelihood parameter μ

Conjugacy Example I
􏰑 􏰐􏰏 􏰒 p(D)
p(D|μ) p(μ) 􏰑􏰐􏰏􏰒
Beta =∝ p(D|μ)p(μ)

Conjugacy Example I
p(μ|D) = 􏰑 􏰐􏰏 􏰒
=∝ p(D|μ)p(μ) = 􏰍 Bern(xi|μ)Beta(μ|a, b)
μa−1(1 − μ)b−1 μ􏰊i xi (1 − μ)􏰊i(1−xi) Γ(a + b) μa−1(1 − μ)b−1
p(D|μ) p(μ) 􏰑􏰐􏰏􏰒
μxi (1 − μ)1−xi
Γ(a + b) Γ(a)Γ(b)
Γ(a + b) μ􏰊i xi (1 − μ)􏰊i(1−xi)μa−1(1 − μ)b−1
Γ(a + b) μ􏰊i xi+a−1(1 − μ)􏰊i(1−xi)+b−1 Γ(a)Γ(b)

Conjugacy Example I
p(D|μ)p(μ) p(D)
Γ(a+b) μ􏰊i xi+a−1(1 − μ)􏰊i(1−xi)+b−1
p(μ|D) = p(D) =
= Γ(a)Γ(b) p(D | μ)p(μ)dμ
• we still do not know the normaliser/evidence?

Conjugacy Example I
p(D|μ)p(μ) p(D)
Γ(a+b) μ􏰊i xi+a−1(1 − μ)􏰊i(1−xi)+b−1
p(μ|D) = p(D) =
= Γ(a)Γ(b) p(D | μ)p(μ)dμ
• we still do not know the normaliser/evidence? • conjugacy means that we know its form

Conjugacy Example I
􏰌xi +a−1 􏰌(1−xi)+b−1 ii
p(μ|D) ∝ μ􏰑 􏰐􏰏 􏰒(1 − μ)􏰑 􏰐􏰏 􏰒
• we know the normaliser to a Beta distribution • now we can avoid computing the integral

Conjugacy Example I
􏰌xi +a−1 􏰌(1−xi)+b−1 ii
p(μ|D) ∝ μ􏰑 􏰐􏰏 􏰒(1 − μ)􏰑 􏰐􏰏 􏰒
= Γ(an +bn)μan(1−μ)bn−1 Γ(an )Γ(bn )
• we know the normaliser to a Beta distribution • now we can avoid computing the integral

Conjugacy Example I
􏰌xi +a−1 􏰌(1−xi)+b−1 ii
p(μ|D) ∝ μ􏰑 􏰐􏰏 􏰒(1 − μ)􏰑 􏰐􏰏 􏰒
= Γ(an +bn)μan(1−μ)bn−1 Γ(an )Γ(bn )
= Beta(μ|an, bn)
• we know the normaliser to a Beta distribution • now we can avoid computing the integral

Conjugacy Example II
p(x1|x2) = p(x1, x2) p(x2 )
= p(x1, x2)
􏰋 p(x1, x2)dx1
p(x1, x2) = p(x1|x2)p(x2).
• The conjugate prior to the mean of the Gaussian distribution is a Gaussian

Conjugacy Example II
• Marginal
x−μTΣ Σ −1x−μ −1 1 1 11 12 1 1
2x−μΣ Σ  x−μ ∝e 2 2 21 22 2 2
p(x ) = 1 e−1(x2−μ2)TΣ−1(x2−μ2) 2 D 2 22
(2π) 2 |Σ |1 2 222
∝ e−1(x2−μ2)TΣ−1(x2−μ2). 2 22
􏰆􏰈x−μ􏰉􏰈Σ Σ􏰉􏰇
p(x1,x2) = N 1 1 , 11 12 x2−μ2 Σ21 Σ22

Exponential
• Identifying the exponent leads to
−1(x−(μ +Σ Σ−1(x −μ )))T(Σ/Σ )−1(x−(μ1+Σ21Σ−1(x2−μ2))) 2121222222 22
p(x|x)∝e 􏰑 􏰐􏰏 􏰒 􏰑􏰐􏰏􏰒 , 1 2
• Through conjugacy we know the functional form of the conditional distribution
p(x1|x2) = N (x1|μ1 + Σ21Σ−1(x2 − μ2), Σ11 − Σ12Σ−1Σ21), 22 22
mean covariance

• We know that the prior and the posterior is in the same functional family p(θ) p(θ | D)

• We know that the prior and the posterior is in the same functional family p(θ) p(θ | D)
• We know that the posterior is proportional to the prior times the likelihood p(θ | D) ∝ p(D | θ)p(θ)
• This means we can just identify the parameters and get the posterior without performing integration

Reflections
1. what happens if you choose a prior that is very confident at a place far from the true value?

Reflections
1. what happens if you choose a prior that is very confident at a place far from the true value?
2. what happens if you choose a prior that is very confident at the true value?

Reflections
1. what happens if you choose a prior that is very confident at a place far from the true value?
2. what happens if you choose a prior that is very confident at the true value?
3. in the plot above we cannot see the order of the red lines, to do so create a plot where the x-axis is the number of data-points that you have used to compute the posterior and the y-axis is the distance of the posterior mean to the prior mean.
4. How much does the order of the data-points matter? Redo the plot above but with many different random permutations of the data. Now plot the mean and the variance of each iteration. What can you see?

Reflections
1. what is the most likely line according to your prior belief?

Reflections
1. what is the most likely line according to your prior belief? 2. what is the least likely line according to your prior belief?

Reflections
1. what is the most likely line according to your prior belief? 2. what is the least likely line according to your prior belief? 3. is there any lines that have zero probability in this belief?

p(w|y, X) = N 􏰂w| 􏰀S−1 + βXTX􏰁−1 􏰀S−1w0 + βXTy􏰁 , 􏰀S−1 + βXTX􏰁−1􏰃 . 000
Reflections
1. what would happen if you assume a noise-free situation i.e. β → ∞

p(w|y, X) = N 􏰂w| 􏰀S−1 + βXTX􏰁−1 􏰀S−1w0 + βXTy􏰁 , 􏰀S−1 + βXTX􏰁−1􏰃 . 000
Reflections
1. what would happen if you assume a noise-free situation i.e. β → ∞ 2. what would happen if we assume a zero mean prior?
3. what happens if we do not observe any data?

p(w|y, X) = N 􏰂w| 􏰀S−1 + βXTX􏰁−1 􏰀S−1w0 + βXTy􏰁 , 􏰀S−1 + βXTX􏰁−1􏰃 . 000
Reflections
1. what would happen if you assume a noise-free situation i.e. β → ∞
2. what would happen if we assume a zero mean prior?
3. what happens if we do not observe any data?
4. when you observe more and more data which terms are going to dominate posterior?

Reflections
1. our prior is spherical, what assumption does this encode? Does this make sense for a line?

Reflections
1. our prior is spherical, what assumption does this encode? Does this make sense for a line?
2. with a few data-points the posterior starts quickly to look non-spherical, what does this mean? Does this make sense?
3. with many data-points the posterior becomes spherical again? Why is this? Look at Eq.24 can you see why this is the case for this data?

Does this make sense?
Posterior Variance Posterior Mean
SN =􏰀Iα+βXTX􏰁−1
mN = αI+βXTX βXTy

Posterior Variance
SN =􏰀Iα+βXTX􏰁−1
􏰆 􏰈􏰊N1 􏰊x 􏰉􏰇−1 􏰈βN+α β􏰊x 􏰉−1
= Iα+β 􏰊xi 􏰊x2 = β􏰊xi α+β􏰊x2 iiiiii
1 􏰈α+β􏰊x2 −β􏰊x􏰉 ii ii
= (βN +α)(α+β􏰊i x2i)−(β􏰊i xi)2 −β􏰊i xi βN +α

Posterior Variance
1 􏰈α+β􏰊x2 −β􏰊x􏰉 ii ii
SN = (βN +α)(α+β􏰊i x2i)−(β􏰊i xi)2 −β􏰊i xi βN +α

Posterior Variance
1 􏰈α+β􏰊x2 −β􏰊x􏰉 ii ii
SN = (βN +α)(α+β􏰊i x2i)−(β􏰊i xi)2 −β􏰊i xi βN +α • Lets assume input is centered ⇒ 􏰊i xi = 0
SN =(βN+α)(α+β􏰊ix2i)
􏰈 α + β 􏰊 i x 2i
0 􏰉 0 βN+α
1 α+β 􏰊i x2i

Posterior Mean
mN =􏰀αI+βXTX􏰁−1βXTy
􏰈 1 … 1 􏰉 y1 
=βSN x1…xN . yN
= βSN 􏰊i yixi

Posterior Mean
mN = βSN 􏰊i yixi • Lets assume input is centered ⇒ 􏰊i xi = 0
βN+α ii mN=β01 􏰊yx
α + β 􏰊 i x 2i 􏰈β􏰊iyi 􏰉
βN+α β 􏰊i yixi
α+β 􏰊i x2i

Posterior Mean Slope
w ̃ 0 = β 􏰊 i y i βN + α
p(w0) = N (w0|0, α1 ) p ( ε ) = N ( ε | 0 , β1 )

p(θ | y) = p(y | θ)p(θ)
􏰋 p(y | θ)p(θ)dθ
Likelihood How much evidence is there in the data for a specific hypothesis Prior What are my beliefs about different hypothesis
Posterior What is my updated belief after having seen data Evidence What is my belief about the data

The Compute: Evidence
p(y | θ)p(θ)dθ

Regression Model

Which Parametrisation
• Should I use a line, polynomial, quadratic basis function? • How many basis functions should I use?
• Likelihood won’t help me
• How do we proceed?

Regression Models
Linear Linear Model
Basis function
p(yi|xi,w)=N(w0 +w1 ·xi,β−1)
p(yi|xi, w) = N (􏰌 wiφ(xi), β−1)

p(Y|W)p(W)dW

Probabilities are a zero-sum game

Model Selection1
1 D Thesis

Occams Razor

Occams Razor
Definition (Occams Razor)
“All things being equal, the simplest solution tends to be the best one” – William of Ockham

The Mac Mackay, 1991

Image Segmentation

Markov Random Field
x4 x3 x5 x1 x0 x2
y4 y3 y5 x7 x6 x8
y1 y0 y2 y7 y6 y8

Markov Random Field
• xi is a specific binary images
􏰎 􏰌N p(y) = p(y|x)p(x) =
p(y|xi)p(xi)

Markov Random Field
􏰎 􏰌N p(y) = p(y|x)p(x) =
• xi is a specific binary images
• Lets take an 2009 iPhone with a 3 Megapixel camera
p(y|xi)p(xi)

Number of terms i
2290593203500326442498254071102 8779924646158308390547680551234 5054431338510774037915738775865 8057318635099533562444284837656 6408900340661545734126916095393 4651531316272895970961099648619 5486636741656944283948869330648 4701733713508133208092688099524 0707971539803921050200955733579 4366205566676730638553849508752 9677470990968153918788613785751 3890052212385415364000233552517 9230941551480812783648467474496 1578781252261713953420063416790 7552057630497077601674681891226 1453204962575441115371836944715 6895505073882545721273943517481 6507334054019330445298798029650 8746618030728963410359112463410 9184832439049686890853942279882

Number of terms ii
9655406361370980789697504759416 7461331023628146001054998291892 8850448033966038407878196527044 7157474368533868315778800203562 1474121034155871572968019805251 8982409725023084881200238736500 2027283572275248844963488736471 3943526031912848227248826190464 8476965948928382396693052519124 1687725175533908692952453783598 2837023543516588536916371046489 4220310701508827933380526429979 2599815801920922903898158871712 8926097153382729134531621865313 9786085815417055159827515344471 3326325034781836776513703100360 9793889758575377908303501066776 6548311999605347475370343426743 8253400053810997864187276609708 2093090380663944422789696913654 8900202322285082544979530967870 6304437009833849217731493021674

Number of terms iii
2550624871750833859476679189509 5680602732346712939153259990811 4893913032842065037601973054196 1524092173016464047938013691439 6671843203605981118777513627755 7250792266837423597968228683403 4089138475154767372727122932222 8878852083218796660305975797728 8778298768646815994259957325408 8749600987758158350339985951647 5121708697580746029473842801833 8592485796034133919973077413533 6869491956368516611377674237208 1780419191068702807890339161440 9912666138730775266005780452422 5302437317858452782485229505751 3761093944464722805553911771716 4315059230286413698788578331540 1782239495790781650110059887274 5959467831004471989549305375741 9073809906471822251882514747849 0657161167548497523333968812279

Number of terms iv

Number of terms v

Number of terms vi

• Possible black and white 3 Megapixel images

• Possible black and white 3 Megapixel images
23145728 • Number of atoms in the universe
1080 ≈ (210 )80 ≈ 2267 3

• Possible black and white 3 Megapixel images
23145728 • Number of atoms in the universe
1080 ≈ (210 )80 ≈ 2267 3
• Age of the universe in seconds
4.35 · 1017 ≈ 259

Approximate Inference
• When we have non-conjugate models we have to compute the marginal likelihood

Approximate Inference
• When we have non-conjugate models we have to compute the marginal likelihood
• We have to approximate the computation

Approximate Inference
• When we have non-conjugate models we have to compute the marginal likelihood
• We have to approximate the computation
• Deterministic approximations

Approximate Inference
• When we have non-conjugate models we have to compute the marginal likelihood
• We have to approximate the computation
• Deterministic approximations • Stochastic approximations

• How to quantify beliefs using distributions

• How to quantify beliefs using distributions • How to update beliefs using data

• How to quantify beliefs using distributions • How to update beliefs using data
• Conjugacy for tractable posteriors

Dual Linear Regression
− 1 (wTxn−tn)T(wTxn−yn) e 2σ2
− 1 (wTw) e 2τ2
tr((Xw−y)T (Xw−y))
• Lets maximise the above to find a point estimate (not a distribution) of w

Dual Linear Regression
−log p(w|y,x) = J(w) = 12(Xw − y)T(Xw − y) + λ2wTw • Find a stationary point in w

Dual Linear Regression
−log p(w|y,x) = J(w) = 12(Xw − y)T(Xw − y) + λ2wTw ∂ J(w)=12XT(Xw−y)+λ2w
• Find a stationary point in w

Dual Linear Regression
−log p(w|y,x) = J(w) = 12(Xw − y)T(Xw − y) + λ2wTw ∂ J(w)=12XT(Xw−y)+λ2w
• Find a stationary point in w
∂w2 2 w = −λ1XT(Xw − y)

Dual Linear Regression
−log p(w|y,x) = J(w) = 12(Xw − y)T(Xw − y) + λ2wTw ∂ J(w)=12XT(Xw−y)+λ2w
• Find a stationary point in w
∂w2 2 w = −λ1XT(Xw − y)
= XTa = 􏰌αnxn

Dual Linear Regression
J(w) = 21(Xw − y)T(Xw − y) + λ2wTw w = XTa
• Rewrite objective in terms of a

Dual Linear Regression
J(w) = 12(Xw − y)T(Xw − y) + λ2wTw w = XTa
J(a) = 12aTXXTXXTa − aTXXTy + 12yTy + λ2aTXXTa • Rewrite objective in terms of a

Dual Linear Regression
[K]ij = xTi xj
J(a) = 12aTKKa − aKy + 12yTy + λ2aTKa
• K is a matrix with all inner-products between the data points

Dual Linear Regression
α n = − λ1 ( w T x n − y n )
w=􏰌αnxn =XTa
• Eliminate w and rewrite in terms of a

Dual Linear Regression
α n = − λ1 ( w T x n − y n )
w=􏰌αnxn =XTa
⇒ a = (K + λI)−1y • Eliminate w and rewrite in terms of a

Dual Linear Regression
[K]ij = xTi xj
J(a) = 12aTKKa − aKy + 12yTy + λ2aTKa
a = (K + λI)−1y

Dual Linear Regression
[K]ij = xTi xj
J(a) = 12aTKKa − aKy + 12yTy + λ2aTKa
a = (K + λI)−1y
y(x∗) = wTx∗ = aTXTx∗ = aTk(x, x∗) =
= ((K + λI)−1y)Tk(x, x∗) = k(x∗, x)(K + λI)−1y

What have we actually done
• Linear Regression
• See data
• Encode relationship between variates using parameters w • Make predictions using w

What have we actually done
• Linear Regression
• See data
• Encode relationship between variates using parameters w • Make predictions using w
• See Data
• Encode relationship between variates

程序代写 CS代考加微信: cscodehelp QQ: 2235208643 Email: kyit630461@163.com

代写代考 Machine Learning – cscodehelp代写

Published by admin

Leave a Reply Cancel reply