CS代考计算机代写 decision tree algorithm information theory Machine Learning Theory
Machine Learning Theory
Maria-Florina (Nina) Balcan
February 9th, 2015
A2
Â
Goals of Machine Learning Theory
Develop & analyze models to understand:
• what kinds of tasks we can hope to learn, and from what kind of data; what are key resources involved (e.g., data, running time)
• prove guarantees for practically successful algs (when will they succeed, how long will they take?)
• develop new algs that provably meet desired criteria (within new learning paradigms)
Interesting tools & connections to other areas:
• Algorithms, Probability & Statistics, Optimization, Complexity Theory, Information Theory, Game Theory.
Very vibrant field:
• Conference on Learning Theory
• NIPS, ICML
A2
Â
Today’s focus: Sample Complexity for Supervised Classification (Function Approximation)
• Statistical Learning Theory (Vapnik) • PAC (Valiant)
• Recommended reading: Mitchell: Ch. 7 • Suggested exercises: 7.1, 7.2, 7.7
• Additional resources: my learning theory course!
Supervised Classification
Decide which emails are spam and which are important.
Not spam Supervised classification spam
Goal: use emails seen so far to produce good prediction rule for future data.
4
Example: Supervised Classification
Represent each message by features. (e.g., keywords, spelling, etc.)
example
label
Reasonable RULES:
Predict SPAM if unknown AND (money OR pills)
++- +-
Predict SPAM if 2money + 3pills –5 known > 0
+— –
Linearly separable
5
•
•
Very well understood: Occam’s bound, VC theory, etc. Note: to talk about these we need a precise model.
Two Core Aspects of Machine Learning
Algorithm Design. How to optimize?
Computation
Automatically generate rules that do well on observed data.
• E.g.: logistic regression, SVM, Adaboost, etc.
Confidence Bounds, Generalization
(Labeled) Data
Confidence for rule effectiveness on future data.
PAC/SLT models for Supervised Learning
Data Source
Distribution D on X
Expert / Oracle
Learning Algorithm
Labeled Examples
(x1,c*(x1)),…, (xm,c*(xm))
Alg.outputs
h : X! Y
x1 > 5
+1
c* : X ! Y
++-
x6 > 2
-1
+ +
– –
– –
+1
–
Learning Algorithm
Expert/Oracle
PAC/SLT models for Supervised Learning
Data Source
Labeled Examples
(x1,c*(x1)),…, (xm,c*(xm)) h : X! Y
Distribution D on X
c* : X ! Y Today: Y={-1,1}
Alg.outputs
• Algo sees training sample S: (x1,c*(x1)),…, (xm,c*(xm)), xi independently
and identically distributed (i.i.d.) from D; labeled by 𝑐∗
• Does optimization over S, finds hypothesis h (e.g., a decision tree).
• Goal: h has small error over D.
•
PAC/SLT models for Supervised Learning
X – feature or instance space; distribution D over X e.g.,X=Rd orX={0,1}d
Algo sees training sample S: (x1,c*(x1)),…, (xm,c*(xm)), xi i.i.d. from D – labeled examples – assumed to be drawn i.i.d. from some distr.
D over X and labeled by some target concept c* – labels 2 {-1,1} – binary classification
• Algo does optimization over S, find hypothesis h. • Goal: h has small error over D.
𝑒𝑟𝑟 h = Pr (h 𝑥 ≠ 𝑐∗(𝑥))
𝑥~𝐷
Need a bias: no free lunch.
•
h ++ c* -++-
— Instance space X
𝐷
Function Approximation: The Big Picture
•
PAC/SLT models for Supervised Learning
X – feature or instance space; distribution D over X e.g.,X=Rd orX={0,1}d
Algo sees training sample S: (x1,c*(x1)),…, (xm,c*(xm)), xi i.i.d. from D – labeled examples – assumed to be drawn i.i.d. from some distr.
D over X and labeled by some target concept c* – labels 2 {-1,1} – binary classification
• Algo does optimization over S, find hypothesis h. • Goal: h has small error over D.
𝑒𝑟𝑟 h = Pr (h 𝑥 ≠ 𝑐∗(𝑥))
𝑥~𝐷
Bias: Fix hypotheses space H . (whose complexity is not too large).
Realizable: 𝑐∗ ∈ 𝐻. Agnostic: 𝑐∗ “close to” H.
•
h ++ c* -++-
— Instance space X
𝐷
PAC/SLT models for Supervised Learning
• Algo sees training sample S: (x1,c*(x1)),…, (xm,c*(xm)), xi i.i.d. from D
• Does optimization over S, find hypothesis h ∈ 𝐻.
• Goal: h has small error over D.
True error: 𝑒𝑟𝑟 h = Pr (h 𝑥 ≠ 𝑐∗(𝑥))
𝐷
How often h 𝑥 ≠ 𝑐∗(𝑥) over future instances drawn at random from D
But, can only measure:
Trainingerror:𝑒𝑟𝑟 h = 1 𝐼 h 𝑥 ≠𝑐∗ 𝑥 𝑆𝑚𝑖𝑖𝑖
How often h 𝑥 ≠ 𝑐∗(𝑥) over training instances
Sample complexity: bound 𝑒𝑟𝑟 h in terms of 𝑒𝑟𝑟 h 𝐷𝑆
𝑥~𝐷
•
Sample Complexity for Supervised Learning
Consistent Learner
• Input: S: (x1,c*(x1)),…, (xm,c*(xm))
• Output: Find h in H consistent with the sample (if one exits).
Contrapositive: if the target is in H, and we have an algo that can find consistent fns, then we only need this many examples to get generalization error ≤ 𝜖 with prob. ≥ 1 − 𝛿
Sample Complexity for Supervised Learning
Consistent Learner
• •
• Input: S: (x1,c*(x1)),…, (xm,c*(xm))
• Output: Find h in H consistent with the sample (if one exits).
Bound inversely linear in 𝜖 Bound only logarithmic in |H|
𝜖 is called error parameter
• D might place low weight on certain parts of the space
𝛿 is called confidence parameter
• there is a small chance the examples we get are not representative of the distribution
Sample Complexity for Supervised Learning
Consistent Learner
• Input: S: (x1,c*(x1)),…, (xm,c*(xm))
• Output: Find h in H consistent with the sample (if one exits).
Example: H is the class of conjunctions over X = 0,1 n. |H| = 3n E.g.,h=x1x3x5 orh=x1x2x4𝑥9
Then𝑚≥1 𝑛ln3+ln 1 suffice 𝜖𝛿
𝑛 = 10,𝜖 = 0.1,𝛿 = 0.01 then 𝑚 ≥ 156 suffice
Sample Complexity for Supervised Learning
Consistent Learner
• Input: S: (x1,c*(x1)),…, (xm,c*(xm))
• Output: Find h in H consistent with the sample (if one exits).
Example: H is the class of conjunctions over X = 0,1 n.
Side HWK question: show that any conjunction can be represented by a small decision tree; also by a linear separator.
Sample Complexity for Supervised Learning
Proof
Assume k bad hypotheses h1, h2, … , hk with errD hi ≥ ε
1) Fix hi. Prob. hi consistent with first training example is
Prob. hi consistent with first m training examples is ≤ 1 − ε m.
≤ 1 − ε. 2) Prob. that at least one h𝑖 consistent with first m training
examples is ≤ k 1 − ε m ≤ H 1 − ε m.
3) Calculate value of m so that H 1 − ε m ≤ δ
3) Use the fact that 1 − x ≤ e−x, sufficient to set H e−εm ≤ δ
Sample Complexity: Finite Hypothesis Spaces Realizable Case
Probability over different samples of m training examples
Sample Complexity: Finite Hypothesis Spaces Realizable Case
1) PAC: How many examples suffice to guarantee small error whp.
2) Statistical Learning Way: Withprobabilityatleast1−𝛿,forallh∈Hs.t.errS h =0wehave
errD(h)≤1 lnH+ln1 . m𝛿
Supervised Learning: PAC model (Valiant)
• X – instance space, e.g., X = 0,1 n or X = Rn
• Sl={(xi, yi)} – labeled examples drawn i.i.d. from some
distr. D over X and labeled by some target concept c* – labels 2 {-1,1} – binary classification
• Algorithm A PAC-learns concept class H if for any
target c* in H, any distrib. D over X, any , > 0:
– A uses at most poly(n,1/,1/,size(c*)) examples and running
time.
– With probab. 1-, A produces h in H of error at · .
Uniform Convergence
•
• •
This basic result only bounds the chance that a bad hypothesis looks perfect on the data. What if there is no perfect h∈H (agnostic case)?
What can we say if c∗ ∉ H?
Can we say that whp all h∈H satisfy |errD(h) – errS(h)| ≤ ?
– Called “uniform convergence”.
– Motivates optimizing over S, even if we can’t find a perfect function.
Sample Complexity: Finite Hypothesis Spaces Realizable Case
Agnostic Case
What if there is no perfect h?
To prove bounds like this, need some good tail inequalities.
Hoeffding bounds
Consider coin of bias p flipped m times.
Let N be the observed # heads. Let ∈ [0,1].
Hoeffding bounds: 2
• Pr[N/m > p + ] ≤ e-2m , and • Pr[N/m < p - ] ≤ e-2m2.
Exponentially decreasing tails
• Tail inequality: bound probability mass in tail of distribution (how concentrated is a random variable around its expectation).
Sample Complexity: Finite Hypothesis Spaces Agnostic Case
• Proof: Just apply Hoeffding. 2
– Chance of failure at most 2|H|e-2|S| .
– Set to . Solve.
• So, whp, best on sample is -best over D.
– Note: this is worse than previous bound (1/ has become 1/2), because we are asking for something stronger.
– Can also get bounds “between” these two.