CS代考计算机代写 algorithm Machine Learning 10-601
Machine Learning 10-601
Tom M. Mitchell
Machine Learning Department Carnegie Mellon University
January 21, 2015
Today:
• BayesRule
• Estimatingparameters
• MLE • MAP
some of these slides are derived from William Cohen, Andrew Moore, Aarti Singh, Eric Xing, Carlos Guestrin. – Thanks!
Readings:
Probabilityreview
• BishopCh.1thru1.2.3 • Bishop,Ch.2thru2.2
• AndrewMoore’sonline tutorial
Announcements
• Class is using Piazza for questions/discussions about homeworks, etc.
– see class website for Piazza address
– http://www.cs.cmu.edu/~ninamf/courses/601sp15/
• Recitations thursdays 7-8pm, Wean 5409 – videos for future recitations (class website)
• HW1 was accepted to Sunday 5pm for full credit
• HW2 out today on class website, due in 1 week
• HW3 will involve programming (in Octave )
P(A|B) = P(B|A) * P(A) P(B)
we call P(A) the “prior” and P(A|B) the “posterior”
Bayes’ rule
…by no means merely a curious speculation in the doctrine of chances, but necessary to be solved in order to a sure foundation for all our reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any that would give a clear account of the strength of analogical or inductive reasoning…
Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418
Other Forms of Bayes Rule
P(A|B) =
P(B|A) * P(A) P(B)
P(A|B) = P(B| A)P(A)
P(B| A)P(A)+P(B|~ A)P(~ A)
P(A|B∧ X) = P(B| A∧ X)P(A∧ X) P(B∧X)
Applying Bayes Rule
P(A|B)= P(B|A)P(A)
P(B | A)P(A) + P(B |~ A)P(~ A)
A = you have the flu, B = you just coughed
Assume:
P(A) = 0.05 P(B|A) = 0.80 P(B| ~A) = 0.20
what is P(flu | cough) = P(A|B)?
what does all this have to do with function approximation?
instead of F: XàY, learn P(Y | X)
The Joint Distribution
Recipe for making a joint distribution of M variables:
Example: Boolean variables A, B, C
A
0
0
0
0
1
1
1
1
B
0
0
1
1
0
0
1
1
C
0
1
0
1
0
1
0
1
Prob
0.30
0.05
0.10
0.05
0.05
0.10
0.25
0.10
0.05 0.10
0.10
B 0.10 0.30
0.05
C
A
0.25
0.05
[A. Moore]
The Joint Distribution
Recipe for making a joint distribution of M variables:
1. Make a truth table listing all combinations of values (M Boolean variablesà2M rows).
Example: Boolean variables A, B, C
A
0
0
0
0
1
1
1
1
B
0
0
1
1
0
0
1
1
C
0
1
0
1
0
1
0
1
Prob
0.30
0.05
0.10
0.05
0.05
0.10
0.25
0.10
0.05 0.10
0.10
B 0.10 0.30
0.05
C
A
0.25
0.05
[A. Moore]
The Joint Distribution
Recipe for making a joint distribution of M variables:
1. Make a truth table listing all combinations of values (M Boolean variablesà2M rows).
2. For each combination of values, say how probable it is.
Example: Boolean variables A, B, C
A
0
0
0
0
1
1
1
1
B
0
0
1
1
0
0
1
1
C
0
1
0
1
0
1
0
1
Prob
0.30
0.05
0.10
0.05
0.05
0.10
0.25
0.10
0.05 0.10
0.10
B 0.10 0.30
0.05
C
A
0.25
0.05
[A. Moore]
The Joint Distribution
Recipe for making a joint distribution of M variables:
1. Make a truth table listing all combinations of values (M Boolean variablesà2M rows).
2. For each combination of values, say how probable it is.
3. If you subscribe to the axioms of probability, those probabilities must sum to 1.
Example: Boolean variables A, B, C
A
0
0
0
0
1
1
1
1
B
0
0
1
1
0
0
1
1
C
0
1
0
1
0
1
0
1
Prob
0.30
0.05
0.10
0.05
0.05
0.10
0.25
0.10
0.05 0.10
0.10
B 0.10 0.30
0.05
C
A
0.25
0.05
[A. Moore]
Using the Joint Distribution
One you have the JD you can ask for the probability of any logical expression involving these variables
P(E) =
∑P(row) rows matching E
[A. Moore]
Using the Joint
P(Poor Male) = 0.4654
P(E) =
∑P(row) rows matching E
[A. Moore]
Using the Joint
P(Poor) = 0.7604
P(E) =
∑P(row) rows matching E
[A. Moore]
Inference with the Joint
∑P(row) = rows matching E1 and E2
P(E2 )
P(Male | Poor) = 0.4654 / 0.7604 = 0.612
P(E | E ) = 12
P(E ∧E ) 1 2
P(row) rows matching E2
∑
[A. Moore]
Learning and the Joint Distribution
Suppose we want to learn the function f:
Solution: learn joint distribution from data, calculate P(W | G, H) e.g., P(W=rich | G = female, H = 40.5- ) =
[A. Moore]
sounds like the solution to learning F: XàY,
or P(Y | X). Are we done?
sounds like the solution to learning F: XàY,
or P(Y | X).
Main problem: learning P(Y|X) can require more data than we have
consider learning Joint Dist. with 100 attributes
# of rows in this table?
# of people on earth?
fraction of rows with 0 training examples?
What to do?
1. Be smart about how we estimate probabilities from sparse data
– maximum likelihood estimates
– maximum a posteriori estimates
2. Be smart about how to represent joint distributions
– Bayes networks, graphical models
1. Be smart about how we estimate probabilities
Estimating Probability of Heads
X=1 X=0
Estimating θ = P(X=1)
X=1 X=0 100 flips: 51 Heads (X=1), 49 Tails (X=0)
Test B:
3 flips: 2 Heads (X=1), 1 Tails (X=0)
Test A:
Estimating θ = P(X=1)
Case C: (online learning)
X=1 X=0
• keep flipping, want single learning algorithm that gives reasonable estimate after each flip
Principles for Estimating Probabilities
Principle 1 (maximum likelihood):
• choose parameters θ that maximize P(data | θ) • e.g.,
Principle 2 (maximum a posteriori prob.):
• choose parameters θ that maximize P(θ | data) • e.g.
Maximum Likelihood Estimation
P(X=1) = θ P(X=0) = (1-θ) Data D:
X=1 X=0
Flips produce data D with heads, tails
• flipsareindependent,identicallydistributed1’sand0’s (Bernoulli)
• and are counts that sum these outcomes (Binomial)
Maximum Likelihood Estimate for Θ
[C. Guestrin]
hint:
Summary:
Maximum Likelihood Estimate
X=1 X=0
P(X=1) = θ P(X=0) = 1-θ (Bernoulli)
Principles for Estimating Probabilities
Principle 1 (maximum likelihood):
• choose parameters θ that maximize P(data | θ)
Principle 2 (maximum a posteriori prob.): • choose parameters θ that maximize
P(θ | data) = P(data | θ) P(θ) P(data)
Beta prior distribution – P(θ)
Beta prior distribution – P(θ)
[C. Guestrin]
and MAP estimate is therefore
and MAP estimate is therefore
Some terminology
• Likelihood function: P(data | θ)
• Prior: P(θ)
• Posterior: P(θ | data)
• Conjugate prior: P(θ) is the conjugate prior for likelihood function P(data | θ) if the forms of P(θ) and P(θ | data) are the same.
You should know
• Probabilitybasics
– random variables, conditional probs, …
– Bayes rule
– Joint probability distributions
– calculating probabilities from the joint distribution
• Estimatingparametersfromdata
– maximum likelihood estimates
– maximum a posteriori estimates
– distributions – binomial, Beta, Dirichlet, … – conjugate priors
Extra slides
Independent Events
• Definition: two events A and B are independent if P(A ^ B)=P(A)*P(B)
• Intuition: knowing A tells us nothing about the value of B (and vice versa)
Picture “A independent of B”
Expected values
Given a discrete random variable X, the expected value of X, written E[X] is
Example:
X P(X)
0 0.3
1 0.2
2 0.5
Expected values
Given discrete random variable X, the expected value of X, written E[X] is
We also can talk about the expected value of functions of X
Covariance
Given two discrete r.v.’s X and Y, we define the covariance of X and Y as
e.g., X=gender, Y=playsFootball or X=gender, Y=leftHanded
Remember: