CS代考计算机代写 algorithm Machine Learning 10-601

Machine Learning 10-601
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
January 21, 2015
Today:
• BayesRule
• Estimatingparameters
• MLE • MAP
some of these slides are derived from William Cohen, Andrew Moore, Aarti Singh, Eric Xing, Carlos Guestrin. – Thanks!
Readings:
Probabilityreview
• BishopCh.1thru1.2.3 • Bishop,Ch.2thru2.2
• AndrewMoore’sonline tutorial

Announcements
• Class is using Piazza for questions/discussions about homeworks, etc.
– see class website for Piazza address
– http://www.cs.cmu.edu/~ninamf/courses/601sp15/
• Recitations thursdays 7-8pm, Wean 5409 – videos for future recitations (class website)
• HW1 was accepted to Sunday 5pm for full credit
• HW2 out today on class website, due in 1 week
• HW3 will involve programming (in Octave )

P(A|B) = P(B|A) * P(A) P(B)
we call P(A) the “prior” and P(A|B) the “posterior”
Bayes’ rule
…by no means merely a curious speculation in the doctrine of chances, but necessary to be solved in order to a sure foundation for all our reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any that would give a clear account of the strength of analogical or inductive reasoning…
Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418

Other Forms of Bayes Rule
P(A|B) =
P(B|A) * P(A) P(B)
P(A|B) = P(B| A)P(A)
P(B| A)P(A)+P(B|~ A)P(~ A)
P(A|B∧ X) = P(B| A∧ X)P(A∧ X) P(B∧X)

Applying Bayes Rule
P(A|B)= P(B|A)P(A)
P(B | A)P(A) + P(B |~ A)P(~ A)
A = you have the flu, B = you just coughed
Assume:
P(A) = 0.05 P(B|A) = 0.80 P(B| ~A) = 0.20
what is P(flu | cough) = P(A|B)?

what does all this have to do with function approximation?
instead of F: XàY, learn P(Y | X)

The Joint Distribution
Recipe for making a joint distribution of M variables:
Example: Boolean variables A, B, C
A
0
0
0
0
1
1
1
1
B
0
0
1
1
0
0
1
1
C
0
1
0
1
0
1
0
1
Prob
0.30
0.05
0.10
0.05
0.05
0.10
0.25
0.10
0.05 0.10
0.10
B 0.10 0.30
0.05
C
A
0.25
0.05
[A. Moore]

The Joint Distribution
Recipe for making a joint distribution of M variables:
1. Make a truth table listing all combinations of values (M Boolean variablesà2M rows).
Example: Boolean variables A, B, C
A
0
0
0
0
1
1
1
1
B
0
0
1
1
0
0
1
1
C
0
1
0
1
0
1
0
1
Prob
0.30
0.05
0.10
0.05
0.05
0.10
0.25
0.10
0.05 0.10
0.10
B 0.10 0.30
0.05
C
A
0.25
0.05
[A. Moore]

The Joint Distribution
Recipe for making a joint distribution of M variables:
1. Make a truth table listing all combinations of values (M Boolean variablesà2M rows).
2. For each combination of values, say how probable it is.
Example: Boolean variables A, B, C
A
0
0
0
0
1
1
1
1
B
0
0
1
1
0
0
1
1
C
0
1
0
1
0
1
0
1
Prob
0.30
0.05
0.10
0.05
0.05
0.10
0.25
0.10
0.05 0.10
0.10
B 0.10 0.30
0.05
C
A
0.25
0.05
[A. Moore]

The Joint Distribution
Recipe for making a joint distribution of M variables:
1. Make a truth table listing all combinations of values (M Boolean variablesà2M rows).
2. For each combination of values, say how probable it is.
3. If you subscribe to the axioms of probability, those probabilities must sum to 1.
Example: Boolean variables A, B, C
A
0
0
0
0
1
1
1
1
B
0
0
1
1
0
0
1
1
C
0
1
0
1
0
1
0
1
Prob
0.30
0.05
0.10
0.05
0.05
0.10
0.25
0.10
0.05 0.10
0.10
B 0.10 0.30
0.05
C
A
0.25
0.05
[A. Moore]

Using the Joint Distribution
One you have the JD you can ask for the probability of any logical expression involving these variables
P(E) =
∑P(row) rows matching E
[A. Moore]

Using the Joint
P(Poor Male) = 0.4654
P(E) =
∑P(row) rows matching E
[A. Moore]

Using the Joint
P(Poor) = 0.7604
P(E) =
∑P(row) rows matching E
[A. Moore]

Inference with the Joint
∑P(row) = rows matching E1 and E2
P(E2 )
P(Male | Poor) = 0.4654 / 0.7604 = 0.612
P(E | E ) = 12
P(E ∧E ) 1 2
P(row) rows matching E2

[A. Moore]

Learning and the Joint Distribution
Suppose we want to learn the function f: àW Equivalently, P(W | G, H)
Solution: learn joint distribution from data, calculate P(W | G, H) e.g., P(W=rich | G = female, H = 40.5- ) =
[A. Moore]

sounds like the solution to learning F: XàY,
or P(Y | X). Are we done?

sounds like the solution to learning F: XàY,
or P(Y | X).
Main problem: learning P(Y|X) can require more data than we have
consider learning Joint Dist. with 100 attributes
# of rows in this table?
# of people on earth?
fraction of rows with 0 training examples?

What to do?
1. Be smart about how we estimate probabilities from sparse data
– maximum likelihood estimates
– maximum a posteriori estimates
2. Be smart about how to represent joint distributions
– Bayes networks, graphical models

1. Be smart about how we estimate probabilities

Estimating Probability of Heads
X=1 X=0

Estimating θ = P(X=1)
X=1 X=0 100 flips: 51 Heads (X=1), 49 Tails (X=0)
Test B:
3 flips: 2 Heads (X=1), 1 Tails (X=0)
Test A:

Estimating θ = P(X=1)
Case C: (online learning)
X=1 X=0
• keep flipping, want single learning algorithm that gives reasonable estimate after each flip

Principles for Estimating Probabilities
Principle 1 (maximum likelihood):
• choose parameters θ that maximize P(data | θ) • e.g.,
Principle 2 (maximum a posteriori prob.):
• choose parameters θ that maximize P(θ | data) • e.g.

Maximum Likelihood Estimation
P(X=1) = θ P(X=0) = (1-θ) Data D:
X=1 X=0
Flips produce data D with heads, tails
• flipsareindependent,identicallydistributed1’sand0’s (Bernoulli)
• and are counts that sum these outcomes (Binomial)

Maximum Likelihood Estimate for Θ
[C. Guestrin]

hint:

Summary:
Maximum Likelihood Estimate
X=1 X=0
P(X=1) = θ P(X=0) = 1-θ (Bernoulli)

Principles for Estimating Probabilities
Principle 1 (maximum likelihood):
• choose parameters θ that maximize P(data | θ)
Principle 2 (maximum a posteriori prob.): • choose parameters θ that maximize
P(θ | data) = P(data | θ) P(θ) P(data)

Beta prior distribution – P(θ)

Beta prior distribution – P(θ)
[C. Guestrin]

and MAP estimate is therefore

and MAP estimate is therefore

Some terminology
• Likelihood function: P(data | θ)
• Prior: P(θ)
• Posterior: P(θ | data)
• Conjugate prior: P(θ) is the conjugate prior for likelihood function P(data | θ) if the forms of P(θ) and P(θ | data) are the same.

You should know
• Probabilitybasics
– random variables, conditional probs, …
– Bayes rule
– Joint probability distributions
– calculating probabilities from the joint distribution
• Estimatingparametersfromdata
– maximum likelihood estimates
– maximum a posteriori estimates
– distributions – binomial, Beta, Dirichlet, … – conjugate priors

Extra slides

Independent Events
• Definition: two events A and B are independent if P(A ^ B)=P(A)*P(B)
• Intuition: knowing A tells us nothing about the value of B (and vice versa)

Picture “A independent of B”

Expected values
Given a discrete random variable X, the expected value of X, written E[X] is
Example:
X P(X)
0 0.3
1 0.2
2 0.5

Expected values
Given discrete random variable X, the expected value of X, written E[X] is
We also can talk about the expected value of functions of X

Covariance
Given two discrete r.v.’s X and Y, we define the covariance of X and Y as
e.g., X=gender, Y=playsFootball or X=gender, Y=leftHanded
Remember:

Posted in Uncategorized

Leave a Reply

Your email address will not be published. Required fields are marked *