CS代考计算机代写 decision tree algorithm Sample Complexity for Function Approximation. Model Selection.
Sample Complexity for Function Approximation. Model Selection.
Maria-Florina (Nina) Balcan
February 16th, 2015
Structural risk minimization
Sample complex.
Â
Two Core Aspects of Machine Learning
Algorithm Design. How to optimize?
Computation
Automatically generate rules that do well on observed data.
• E.g.: logistic regression, SVM, Adaboost, etc.
Confidence Bounds, Generalization
(Labeled) Data
Confidence for rule effectiveness on future data.
PAC/SLT models for Supervised Classification
Learning Algorithm
Data Source
Distribution D on X
Expert / Oracle
Labeled Examples
(x1,c*(x1)),…, (xm,c*(xm))
Alg.outputs
h:X!Y
x1 > 5
c* : X ! Y + + –
x6 > 2 +1
+1 -1
+ —
•
•
PAC/SLT models for Supervised Learning
X – feature/instance space; distribution D over X e.g.,X=Rd orX={0,1}d
Algo sees training sample S: (x1,c*(x1)),…, (xm,c*(xm)), xi i.i.d. from D – labeledexamples-drawni.i.d.fromDandlabeledbytargetc* – labels 2 {-1,1} – binary classification
• Algo does optimization over S, find hypothesis h. • Goal: h has small error over D.
𝑒𝑟𝑟𝐷 h = Pr (h 𝑥 ≠ 𝑐∗(𝑥)) 𝑥~𝐷
• Fix hypothesis space H • Realizable: 𝑐∗ ∈ 𝐻.
[whose complexity is not too large]
• Agnostic: 𝑐∗ “close to” H.
h++ c* – ++ —
–
Instance space X
Sample Complexity for Supervised Learning
Realizable Case
Consistent Learner
• Input: S: (x1,c*(x1)),…, (xm,c*(xm))
• Output: Find h in H consistent with S (if one exits).
Prob. over different samples of m training examples
Linear in 1/𝜖
Sample Complexity: Infinite Hypothesis Spaces Realizable Case
E.g., H= linear separators in Rd
VCdim(H)= d+1
+ + – Sample complexity linear in d + —
So, if double the number of features, then I only need roughly twice the number of samples to do well.
What if c∗ ∉ H?
Sample Complexity: Uniform Convergence Agnostic Case
Empirical Risk Minimization (ERM)
• Input: S: (x1,c*(x1)),…, (xm,c*(xm))
• Output: Find h in H with smallest errS(h)
1/𝜖2 dependence [as opposed to1/𝜖 for realizable]
Hoeffding bounds
Consider coin of bias p flipped m times.
Let N be the observed # heads. Clearly E mN = p.
[N = X1 + X2 + … + Xm, Xi = 1 with prob. p, 0 with prob 1-p.]
Hoeffding Inequality
Let 𝛾 ∈ [0,1].
𝑃 𝑚𝑁 − 𝑝 ≥ 𝛾 ≤ 𝑒 − 2 𝑚 𝛾 2
𝑝−𝛾 𝑝 𝑝+𝛾
Exponentially decreasing tails
Tail inequality: bound probability mass in tail of distribution (how concentrated is a random variable around its expectation).
Sample Complexity: Finite Hypothesis Spaces Agnostic Case
Proof:
• By union bound over all h ∈ 𝐻, the prob. that ∃h s.t. errS h − errD h ≥ ε is
errD h∗
≤𝝐
errS h
Hoeffding & union bound.
• Fix h; by Hoeffding, prob. that errS h − errD h ≥ ε is at most 2e−2mε2
at most 2|H|e−2mε2. Set to . Solve. Fact:
errS h∗
≤𝝐
W.h.p.≥1−𝛿,errD h ≤errD h∗ +2ε,
h is ERM output, h∗ is hyp. of smallest true error rate.
errD h
Sample Complexity: Finite Hypothesis Spaces Agnostic Case
1) How many examples suffice to get UC whp (so success for ERM).
1/𝜖2 dependence [as opposed to 1/𝜖 for realizable], but get for something stronger.
2) Statistical Learning Theory style: With prob. at least 1 − 𝛿, for all h ∈ H:
𝑚1 as opposed to 𝑚1 for realizable
errDh≤errSh+ 1 ln(2H)+ln1 . 2m 𝛿
Sample Complexity: Infinite Hypothesis Spaces Agnostic Case
1) How many examples suffice to get UC whp (so success for ERM).
2) Statistical Learning Theory style: With prob. at least 1 − 𝛿, for all h ∈ H:
errD h ≤ errS h + O 1 VCdim H ln em + ln 1 . 2m VCdim H δ
VCdimension Generalization Bounds
E.g., errD h ≤ errS h + O 1 VCdim H ln em + ln 1 . 2m VCdim H δ
VC bounds: distribution independent bounds
•
Generic: hold for any concept class and any distribution. [nearly tight in the WC over choice of D]
Might be very loose specific distr. that are more benign than the worst case….
Hold only for binary classification; we want bounds for fns approximation in general (e.g., multiclass classification and regression).
• •
Rademacher Complexity Bounds
[Koltchinskii&Panchenko 2002]
• Distribution/data dependent. Tighter for nice distributions.
• Apply to general classes of real valued functions & can be used to recover the VCbounds for supervised classification.
• Prominent technique for generalization bounds in last decade.
See “Introduction to Statistical Learning Theory” O. Bousquet, S. Boucheron, and G. Lugosi.
Rademacher Complexity
Problem Setup
• •
A space Z and a distr. D|Z
F be a class of functions from Z to [0,1]
S = {z1,…,zm} be i.i.d. from D|Z
Want a high prob. uniform convergence bound, all f ∈ F satisfy:
ED f z ≤ ES f z + term(complexity of F, niceness of D/S)
What measure of complexity?
•
General discrete Y
E.g., Z=X×Y,Y={−1,1}, H={h: X→Y}hyp.space(e.g.,lin.sep)
F=L(H)={lh: X→Y},wherelh 𝑧= x,y =1h x ≠𝑦
[Loss fnc induced by h and 0/1 loss]
Then E𝑧~𝐷 lh z = errD(h) and ES lh z = errS(h).
errD h ≤errS h +term(complexityofH,nicenessofD/S)
Rademacher Complexity
Space Z and a distr. D|Z; F be a class of functions from Z to [0,1] Let S = {z1,…,zm} be i.i.d from D|Z.
The empirical Rademacher complexity of F is: Rm(F) = Eσ1,…,σm sup m1 σif zi
f∈F i
where 𝜎𝑖 are i.i.d. Rademacher variables chosen uniformly from {−1,1}.
The Rademacher complexity of F is: Rm F = ES[Rm(F)]
sup measures for any given set S and Rademacher vector 𝜎, the max correlation between f zi and 𝜎𝑖 for all f ∈ F
So, taking the expectation over 𝜎 this measures the ability of class F to fit random noise.
Rademacher Complexity
Space Z and a distr. D|Z; F be a class of functions from Z to [0,1] Let S = {z1,…,zm} be i.i.d from D|Z.
The empirical Rademacher complexity of F is: Rm(F) = Eσ1,…,σm sup m1 σif zi
f∈F i
where 𝜎𝑖 are i.i.d. Rademacher variables chosen uniformly from {−1,1}. The Rademacher complexity of F is: Rm F = ES[Rm(F)]
Theorem: Whp all f ∈ F satisfy:
EDfz ≤ESfz +2RmF+
Useful if it decays with m.
ln 2/δ 2m
ln 1/δ m
EDfz ≤ESfz +2RmF+3
Rademacher Complexity
Space Z and a distr. D|Z; F be a class of functions from Z to [0,1] Let S = {z1,…,zm} be i.i.d from D|Z.
The empirical Rademacher complexity of F is: Rm(F) = Eσ1,…,σm sup m1 σif zi
f∈F i
where 𝜎𝑖 are i.i.d. Rademacher variables chosen uniformly from {−1,1}.
The Rademacher complexity of F is: Rm F = ES[Rm(F)]
E.g.,:
1) F={f}, then Rm(F) = 0
[Linearity of expectation: each σif(zi) individually has expectation 0.] 2) F={all 0/1 fnc}, then Rm(F) = 1/2
[To maximize set f(zi) = 1 when σi = 1 and f(zi) = 0 when σi = −1. Then quantity inside expectation is #1′𝑠 ∈ 𝜎, which is m/2 by linearity of expectation.]
Rademacher Complexity
Space Z and a distr. D|Z; F be a class of functions from Z to [0,1] Let S = {z1,…,zm} be i.i.d from D|Z.
The empirical Rademacher complexity of F is: Rm(F) = Eσ1,…,σm sup m1 σif zi
f∈F o
where 𝜎𝑖 are i.i.d. Rademacher variables chosen uniformly from {−1,1}.
The Rademacher complexity of F is: Rm F = ES[Rm(F)]
E.g.,:
1) F={f}, then Rm(F) = 0
2) F={all 0/1 fnc}, then Rm(F) = 1/2
3) F=L(H), H=binary classifiers then: RS F ≤
ln 2|H| m
ln 2|H[S]| m
H finite: RS F ≤
Rademacher Complexity Bounds
Space Z and a distr. D|Z; F be a class of functions from Z to [0,1] Let S = {z1,…,zm} be i.i.d from D|Z.
The empirical Rademacher complexity of F is: Rm(F) = Eσ1,…,σm sup m1 σif zi
f∈F o
where 𝜎𝑖 are i.i.d. Rademacher variables chosen uniformly from {−1,1}.
The Rademacher complexity of F is: Rm F = ES[Rm(F)]
Theorem: Whp all f ∈ F satisfy:
ln 2/δ 2m
ln 1/δ m
Data dependent bound!
EDfz ≤ESfz +2RmF+
Bound expectation of each f in terms of its empirical average & the RC of F
EDfz ≤ESfz +2RmF+3
Proof uses Symmetrization and Ghost Sample Tricks! (same as for VC bound)
Rademacher Complex: Binary classification
Fact: H = {h: X → Y} hyp. space (e.g., lin. sep) F= L(H), d=VCdim(H): Sd
RS F ≤ ln 2|H[S]| So, by Sauer’s lemma, R F ≤ 2dln em mm
Theorem: For any H, any distr. D, w.h.p. ≥ 1 − 𝛿 all h ∈ H satisfy: errD h ≤ errS h + Rm H + 3 ln 2/δ .
2dln em ln 2/δ errD h ≤ errS h + m d + 3 2m
generalization bound
Many more uses!!! Margin bounds for SVM, boosting, regression bounds, etc.
2m
Can we use our bounds for model selection?
True Error, Training Error, Overfitting
Model selection: trade-off between decreasing training error and
keeping H simple.
errD h ≤errS h +Rm H+… generalization
error
train error
error
complexity
Structural Risk Minimization (SRM)
𝐻1 ⊆𝐻2 ⊆𝐻3 ⊆⋯⊆𝐻𝑖 ⊆…
error rate
Hypothesis complexity
overfitting
empirical error
What happens if we increase m?
Black curve will stay close to the red curve for longer, everything shifts to the right…
Structural Risk Minimization (SRM)
𝐻1 ⊆𝐻2 ⊆𝐻3 ⊆⋯⊆𝐻𝑖 ⊆…
error rate
Hypothesis complexity
overfitting
empirical error
Structural Risk Minimization (SRM)
• •
•
𝐻1 ⊆𝐻2 ⊆𝐻3 ⊆⋯⊆𝐻𝑖 ⊆… hk = argminh∈Hk {errS h }
As k increases, errS hk goes down but complex. term goes up.
𝑘 = argmink≥1{errS hk + complexity(Hk)}
Output h = h𝑘
Claim: W.h.p., errD h ≤ mink∗ minh∗∈Hk∗ errD h∗ + 2complexity Hk∗ Proof:
• We chose h s.t. errs h + complexity Hk ≤ errS h∗ + complexity(Hk∗ ). • Whp,errDh≤errsh+complexityHk.
• Whp, errS h∗ ≤ errD h∗ + complexity Hk∗ .
Techniques to Handle Overfitting
• Structural Risk Minimization (SRM). 𝐻1 ⊆ 𝐻2 ⊆ ⋯ ⊆ 𝐻𝑖 ⊆… Minimize gener. bound: h = argmink≥1{errS hk + complexity(Hk)}
•
•
• Often computationally hard….
• Nice case where it is possible: M. Kearns, Y. Mansour, ICML’98, “A Fast, Bottom-Up
Decision Tree Pruning Algorithm with Near-Optimal Generalization”
Regularization: general family closely related to SRM
• E.g., SVM, regularized logistic regression, etc. 2 Some norm when H
• minimizes expressions of the form: errS h + λ h is a vector space; e.g., L2 norm
Cross Validation:
Picked through cross validation
• Hold out part of the training data and use it as a proxy for the generalization error
What you should know
• Notion of sample complexity.
• Understand reasoning behind the simple sample
complexity bound for finite H [exam question!].
• Shattering, VC dimension as measure of complexity, Sauer’s lemma, form of the VC bounds (upper and lower bounds).
• Rademacher Complexity.
• Model Selection, Structural Risk Minimization.
L2 vs. L1 Regularization
Gaussian P(W) L2 regularization
Laplace P(W) L1 regularization
w2
w1
constant
P(Data|W)
w2
constant P(W)
w1