# 程序代写代做代考 algorithm MLE

MLE

The usual representation we come across is a

probability density function:

But what if we know that ,

but we don’t know ?

We can set up a likelihood equation: , and

find the value of that maximizes it.

P (X = x|!)

x1, x2, …, xn ! N(µ, !

2)

µ

µ

P (x|µ,!)

These slides derived from course notes on http://www.cs.cmu.edu/~tom/10601_sp08/slides/recitation-mle-nb.pdf

MLE OF MU

Since x’s are independent and from the same

distribution,

Taking the log likelihood (we get to do this since log is

monotonic) and removing some constants:

p(x|µ,!2) =

n!

i=1

p(xi|µ,!

2)

L(x) =

n!

i=1

p(xi|µ,!2) =

1

!

2″!2

n!

i=1

exp(

(xi ” µ)2

2!2

)

log(L(x)) = l(x) !

n!

i=1

“(xi ” µ)

–

2

CALCULUS

We can take the derivative of this value and set it equal

to zero, to maximize.

dl(x)

dx

=

d

dx

(!

n!

i=1

x2

i

! xiµ + µ

2) = !

n!

i=1

xi ! µ

!

n!

i=1

xi ! µ = 0 ” nµ =

n!

i=1

xi ” µ =

”

n

i=1

xi

n

Maximum a posteriori (MAP)

What if you have some ideas about your parameter?

We can use Bayes’ Rule:

P (!|x) =

P (x|!)P (!)

P (x)

=

P (x|!)P (!)

!

!

P (!,x)

=

P (x|!)P (!)

!

!

P (x|!)P (!)

Maximum a posteriori (MAP)

This is just maximizing the numerator, since the

denominator is a normalizing constant.

This assumes we know the prior distribution.

argmax!P (!|x) = argmax!

P (x|!)P (!)

!

!

P (x|!)P (!)

Computing MAP

Analytical Solution

Numerical Optimization: gradient descent

EM Algorithm

Optimization Monte Carlo Simulation such as Simulated Annelaing

WHAT WE CAN DO NOW

MAP is the foundation for Naive Bayes classifiers.

Here, we’re assuming our data are drawn from two

“classes”. We have a bunch of data where we know the

class, and want to be able to predict P(class|data-point).

So, we use empirical probabilities

In NB, we also make the assumption that the features

are conditionally independent.

prediction = argmaxCP (C = c|X = x) ! argmaxC P̂ (X = x|C = c)P̂ (C = c)

SPAM FILTERING

Suppose we wanted to build a spam filter. To use the

“bag of words” approach, assuming that n words in an

email are conditionally independent, we’d get:

Whichever one’s bigger wins!

P (spam|w) !

n!

i=1

P (wi|spam)P (spam)

P (¬spam|w) ∝

n!

i=1

P (wi|¬spam)P (¬spam)

^ ^

^ ^

ECE 750 T17

Bayes’ Classifier

37

)(

)()(

)(

xp

wpwxp

xwp jjj

Likelihood or

conditional prior

evidence posterior

)()()( 2

2

1

j

j

j wpwxpxpclassesfor ¦

Normalization

Term

ECE 750 T17

38

Max a Posterior Classifier

known are )(&)(

)()( if

2

211

jj wpwxpAssumes

wotherwise

xwpxwpw !

1)()p(w w t..r.function w mass posterior

1)curveunder (area x t.r. density w.likelihood

21 �{

{

xwpx

Priors come from background knowledge

Likelihood comes from observation

ECE 750 T17

39

Max a Posterior Classifier (MAP)

¯

®

21

12

if )(

if )(

)(

wxwp

wxwp

xerrorp

)( minimizes errorpMAP

Generalization

1. More than one feature

2. More than 2 classes

3. Allowing actions rather than just deciding on class.

4. How to deal with situations where misclassifications for

some classes are more costly than for others.

> @dxxxx �21,

^ `cwww �21,

ECE 750 T17

40

Generalization

oUse of more than one feature (use feature

vector x)

oUse of more than 2 classes

o Allowing actions rather than just decide on

classes (possible rejection)

oDealing with different misclassification costs

o Introducing a loss function which is more

general than the probability of error

),1, ( cjwuse j �

ECE 750 T17

41

Loss Function

o Allowing actions: accepting the action or

rejecting

o The loss function states how costly each

action taken is

^ `

^ `

)(

,,

,,

21

21

ji

ji

a

c

tate is wwhen the saction

g for takins incurredbe the loswLet

actionspossibleofsetbeLet

classescofsetthebewwwLet

D

DO

DDD �

�

ECE 750 T17

42

Loss Function

aixallofsumRriskoverall �,1),R( i D

aixRR i �,1)( minimizing minimizing { D

isactionwithlossExpected i

.min )( isxRwhichforactionSelect ii DD

Bayes Risk = Best Performance that can be achieved

Conditional

Risk

ܴ ݔ|ߙ = σ ୀଵݓ|ߙ)ߣ (ݔ|ݓ)(

ECE 750 T17

43

2-Class Case

)()()(

2121111 xwpxwpxR

RisklConditiona

OOD �

)()()( 2221212 xwpxwpxR OOD �

ଵݓ ݃݊݅݀݅ܿ݁݀:ଵߙ

ଶݓ ݃݊݅݀݅ܿ݁݀:ଶߙ

ߣ = (ݓ|ߙ)ߣ

ECE 750 T17

44

2-Class Case

{

11 decide:action wD

� � � � � �

� � � � � �

1

21 11 11 1

12 22 2 2

2

decide

p x p

otherwise decide

w if

p x w p w

w w

w

O O

O O

� !

�

݈݁ݑܴ ݊݅ݏ݅ܿ݁ܦ ݇ݏܴ݅ ݊݅ܯ

If ܴ(ߙଵ ݔ < (ݔ|ଶߙ)ܴ
ECE 750 T17
45
2-Class Case
� �
� �
� �
� �
x
wp
wp
wxp
wxp
if oft independen or
1
2
1121
2212
2
1
OO
OO
�
�
!
� �
� �22
11
decideaction else
decideaction then take
w
w
D
D
Likelihood
Ratio
ECE 750 T17
Bayes’ Classifier
Learning priors and conditionals
� If have training data, we can
learn the prior and conditional from the
training data (freq. prob.)
� we may assume distributions and
learn parameters using MLE or other
methods
46
{
ECE 750 T17
Naive Bayes
F
Learning = Estimating
Classification
47
x
w
� � � � � �
� �xP
wPwxP
xwP
� � � �wPwxP ,
� �newxwP
ECE 750 T17
Naive Bayes
Assumes
48
� �
� � � �
jiallforwgiventindependen
lconditionaarefandf
wfPwffP
wffx
ji
i
i
d
d
z
�
,
valued-discrete ,,
1
1
�
�
ECE 750 T17
49
¾Bayes Classifier - optimum in the sense
of probability of error that for given prior
probabilities, loss function and class-
conditional densities, no other decision
rule will have a lower risk (expected value
of the loss function, for example,
probability of error)
¾In practice the class-conditional densities
are estimated using parametric and non-
parametric methods
Blank Page