# CS代考计算机代写 decision tree algorithm Sample Complexity for Function Approximation. Model Selection.

Sample Complexity for Function Approximation. Model Selection.

Maria-Florina (Nina) Balcan

February 16th, 2015

Structural risk minimization

Sample complex.

Â

Two Core Aspects of Machine Learning

Algorithm Design. How to optimize?

Computation

Automatically generate rules that do well on observed data.

• E.g.: logistic regression, SVM, Adaboost, etc.

Confidence Bounds, Generalization

(Labeled) Data

Confidence for rule effectiveness on future data.

PAC/SLT models for Supervised Classification

Learning Algorithm

Data Source

Distribution D on X

Expert / Oracle

Labeled Examples

(x1,c*(x1)),…, (xm,c*(xm))

Alg.outputs

h:X!Y

x1 > 5

c* : X ! Y + + –

x6 > 2 +1

+1 -1

+ —

•

•

PAC/SLT models for Supervised Learning

X – feature/instance space; distribution D over X e.g.,X=Rd orX={0,1}d

Algo sees training sample S: (x1,c*(x1)),…, (xm,c*(xm)), xi i.i.d. from D – labeledexamples-drawni.i.d.fromDandlabeledbytargetc* – labels 2 {-1,1} – binary classification

• Algo does optimization over S, find hypothesis h. • Goal: h has small error over D.

𝑒𝑟𝑟𝐷 h = Pr (h 𝑥 ≠ 𝑐∗(𝑥)) 𝑥~𝐷

• Fix hypothesis space H • Realizable: 𝑐∗ ∈ 𝐻.

[whose complexity is not too large]

• Agnostic: 𝑐∗ “close to” H.

h++ c* – ++ —

–

Instance space X

Sample Complexity for Supervised Learning

Realizable Case

Consistent Learner

• Input: S: (x1,c*(x1)),…, (xm,c*(xm))

• Output: Find h in H consistent with S (if one exits).

Prob. over different samples of m training examples

Linear in 1/𝜖

Sample Complexity: Infinite Hypothesis Spaces Realizable Case

E.g., H= linear separators in Rd

VCdim(H)= d+1

+ + – Sample complexity linear in d + —

So, if double the number of features, then I only need roughly twice the number of samples to do well.

What if c∗ ∉ H?

Sample Complexity: Uniform Convergence Agnostic Case

Empirical Risk Minimization (ERM)

• Input: S: (x1,c*(x1)),…, (xm,c*(xm))

• Output: Find h in H with smallest errS(h)

1/𝜖2 dependence [as opposed to1/𝜖 for realizable]

Hoeffding bounds

Consider coin of bias p flipped m times.

Let N be the observed # heads. Clearly E mN = p.

[N = X1 + X2 + … + Xm, Xi = 1 with prob. p, 0 with prob 1-p.]

Hoeffding Inequality

Let 𝛾 ∈ [0,1].

𝑃 𝑚𝑁 − 𝑝 ≥ 𝛾 ≤ 𝑒 − 2 𝑚 𝛾 2

𝑝−𝛾 𝑝 𝑝+𝛾

Exponentially decreasing tails

Tail inequality: bound probability mass in tail of distribution (how concentrated is a random variable around its expectation).

Sample Complexity: Finite Hypothesis Spaces Agnostic Case

Proof:

• By union bound over all h ∈ 𝐻, the prob. that ∃h s.t. errS h − errD h ≥ ε is

errD h∗

≤𝝐

errS h

Hoeffding & union bound.

• Fix h; by Hoeffding, prob. that errS h − errD h ≥ ε is at most 2e−2mε2

at most 2|H|e−2mε2. Set to . Solve. Fact:

errS h∗

≤𝝐

W.h.p.≥1−𝛿,errD h ≤errD h∗ +2ε,

h is ERM output, h∗ is hyp. of smallest true error rate.

errD h

Sample Complexity: Finite Hypothesis Spaces Agnostic Case

1) How many examples suffice to get UC whp (so success for ERM).

1/𝜖2 dependence [as opposed to 1/𝜖 for realizable], but get for something stronger.

2) Statistical Learning Theory style: With prob. at least 1 − 𝛿, for all h ∈ H:

𝑚1 as opposed to 𝑚1 for realizable

errDh≤errSh+ 1 ln(2H)+ln1 . 2m 𝛿

Sample Complexity: Infinite Hypothesis Spaces Agnostic Case

1) How many examples suffice to get UC whp (so success for ERM).

2) Statistical Learning Theory style: With prob. at least 1 − 𝛿, for all h ∈ H:

errD h ≤ errS h + O 1 VCdim H ln em + ln 1 . 2m VCdim H δ

VCdimension Generalization Bounds

E.g., errD h ≤ errS h + O 1 VCdim H ln em + ln 1 . 2m VCdim H δ

VC bounds: distribution independent bounds

•

Generic: hold for any concept class and any distribution. [nearly tight in the WC over choice of D]

Might be very loose specific distr. that are more benign than the worst case….

Hold only for binary classification; we want bounds for fns approximation in general (e.g., multiclass classification and regression).

• •

Rademacher Complexity Bounds

[Koltchinskii&Panchenko 2002]

• Distribution/data dependent. Tighter for nice distributions.

• Apply to general classes of real valued functions & can be used to recover the VCbounds for supervised classification.

• Prominent technique for generalization bounds in last decade.

See “Introduction to Statistical Learning Theory” O. Bousquet, S. Boucheron, and G. Lugosi.

Rademacher Complexity

Problem Setup

• •

A space Z and a distr. D|Z

F be a class of functions from Z to [0,1]

S = {z1,…,zm} be i.i.d. from D|Z

Want a high prob. uniform convergence bound, all f ∈ F satisfy:

ED f z ≤ ES f z + term(complexity of F, niceness of D/S)

What measure of complexity?

•

General discrete Y

E.g., Z=X×Y,Y={−1,1}, H={h: X→Y}hyp.space(e.g.,lin.sep)

F=L(H)={lh: X→Y},wherelh 𝑧= x,y =1h x ≠𝑦

[Loss fnc induced by h and 0/1 loss]

Then E𝑧~𝐷 lh z = errD(h) and ES lh z = errS(h).

errD h ≤errS h +term(complexityofH,nicenessofD/S)

Rademacher Complexity

Space Z and a distr. D|Z; F be a class of functions from Z to [0,1] Let S = {z1,…,zm} be i.i.d from D|Z.

The empirical Rademacher complexity of F is: Rm(F) = Eσ1,…,σm sup m1 σif zi

f∈F i

where 𝜎𝑖 are i.i.d. Rademacher variables chosen uniformly from {−1,1}.

The Rademacher complexity of F is: Rm F = ES[Rm(F)]

sup measures for any given set S and Rademacher vector 𝜎, the max correlation between f zi and 𝜎𝑖 for all f ∈ F

So, taking the expectation over 𝜎 this measures the ability of class F to fit random noise.

Rademacher Complexity

Space Z and a distr. D|Z; F be a class of functions from Z to [0,1] Let S = {z1,…,zm} be i.i.d from D|Z.

The empirical Rademacher complexity of F is: Rm(F) = Eσ1,…,σm sup m1 σif zi

f∈F i

where 𝜎𝑖 are i.i.d. Rademacher variables chosen uniformly from {−1,1}. The Rademacher complexity of F is: Rm F = ES[Rm(F)]

Theorem: Whp all f ∈ F satisfy:

EDfz ≤ESfz +2RmF+

Useful if it decays with m.

ln 2/δ 2m

ln 1/δ m

EDfz ≤ESfz +2RmF+3

Rademacher Complexity

Space Z and a distr. D|Z; F be a class of functions from Z to [0,1] Let S = {z1,…,zm} be i.i.d from D|Z.

The empirical Rademacher complexity of F is: Rm(F) = Eσ1,…,σm sup m1 σif zi

f∈F i

where 𝜎𝑖 are i.i.d. Rademacher variables chosen uniformly from {−1,1}.

The Rademacher complexity of F is: Rm F = ES[Rm(F)]

E.g.,:

1) F={f}, then Rm(F) = 0

[Linearity of expectation: each σif(zi) individually has expectation 0.] 2) F={all 0/1 fnc}, then Rm(F) = 1/2

[To maximize set f(zi) = 1 when σi = 1 and f(zi) = 0 when σi = −1. Then quantity inside expectation is #1′𝑠 ∈ 𝜎, which is m/2 by linearity of expectation.]

Rademacher Complexity

Space Z and a distr. D|Z; F be a class of functions from Z to [0,1] Let S = {z1,…,zm} be i.i.d from D|Z.

The empirical Rademacher complexity of F is: Rm(F) = Eσ1,…,σm sup m1 σif zi

f∈F o

where 𝜎𝑖 are i.i.d. Rademacher variables chosen uniformly from {−1,1}.

The Rademacher complexity of F is: Rm F = ES[Rm(F)]

E.g.,:

1) F={f}, then Rm(F) = 0

2) F={all 0/1 fnc}, then Rm(F) = 1/2

3) F=L(H), H=binary classifiers then: RS F ≤

ln 2|H| m

ln 2|H[S]| m

H finite: RS F ≤

Rademacher Complexity Bounds

Space Z and a distr. D|Z; F be a class of functions from Z to [0,1] Let S = {z1,…,zm} be i.i.d from D|Z.

The empirical Rademacher complexity of F is: Rm(F) = Eσ1,…,σm sup m1 σif zi

f∈F o

where 𝜎𝑖 are i.i.d. Rademacher variables chosen uniformly from {−1,1}.

The Rademacher complexity of F is: Rm F = ES[Rm(F)]

Theorem: Whp all f ∈ F satisfy:

ln 2/δ 2m

ln 1/δ m

Data dependent bound!

EDfz ≤ESfz +2RmF+

Bound expectation of each f in terms of its empirical average & the RC of F

EDfz ≤ESfz +2RmF+3

Proof uses Symmetrization and Ghost Sample Tricks! (same as for VC bound)

Rademacher Complex: Binary classification

Fact: H = {h: X → Y} hyp. space (e.g., lin. sep) F= L(H), d=VCdim(H): Sd

RS F ≤ ln 2|H[S]| So, by Sauer’s lemma, R F ≤ 2dln em mm

Theorem: For any H, any distr. D, w.h.p. ≥ 1 − 𝛿 all h ∈ H satisfy: errD h ≤ errS h + Rm H + 3 ln 2/δ .

2dln em ln 2/δ errD h ≤ errS h + m d + 3 2m

generalization bound

Many more uses!!! Margin bounds for SVM, boosting, regression bounds, etc.

2m

Can we use our bounds for model selection?

True Error, Training Error, Overfitting

Model selection: trade-off between decreasing training error and

keeping H simple.

errD h ≤errS h +Rm H+… generalization

error

train error

error

complexity

Structural Risk Minimization (SRM)

𝐻1 ⊆𝐻2 ⊆𝐻3 ⊆⋯⊆𝐻𝑖 ⊆…

error rate

Hypothesis complexity

overfitting

empirical error

What happens if we increase m?

Black curve will stay close to the red curve for longer, everything shifts to the right…

Structural Risk Minimization (SRM)

𝐻1 ⊆𝐻2 ⊆𝐻3 ⊆⋯⊆𝐻𝑖 ⊆…

error rate

Hypothesis complexity

overfitting

empirical error

Structural Risk Minimization (SRM)

• •

•

𝐻1 ⊆𝐻2 ⊆𝐻3 ⊆⋯⊆𝐻𝑖 ⊆… hk = argminh∈Hk {errS h }

As k increases, errS hk goes down but complex. term goes up.

𝑘 = argmink≥1{errS hk + complexity(Hk)}

Output h = h𝑘

Claim: W.h.p., errD h ≤ mink∗ minh∗∈Hk∗ errD h∗ + 2complexity Hk∗ Proof:

• We chose h s.t. errs h + complexity Hk ≤ errS h∗ + complexity(Hk∗ ). • Whp,errDh≤errsh+complexityHk.

• Whp, errS h∗ ≤ errD h∗ + complexity Hk∗ .

Techniques to Handle Overfitting

• Structural Risk Minimization (SRM). 𝐻1 ⊆ 𝐻2 ⊆ ⋯ ⊆ 𝐻𝑖 ⊆… Minimize gener. bound: h = argmink≥1{errS hk + complexity(Hk)}

•

•

• Often computationally hard….

• Nice case where it is possible: M. Kearns, Y. Mansour, ICML’98, “A Fast, Bottom-Up

Decision Tree Pruning Algorithm with Near-Optimal Generalization”

Regularization: general family closely related to SRM

• E.g., SVM, regularized logistic regression, etc. 2 Some norm when H

• minimizes expressions of the form: errS h + λ h is a vector space; e.g., L2 norm

Cross Validation:

Picked through cross validation

• Hold out part of the training data and use it as a proxy for the generalization error

What you should know

• Notion of sample complexity.

• Understand reasoning behind the simple sample

complexity bound for finite H [exam question!].

• Shattering, VC dimension as measure of complexity, Sauer’s lemma, form of the VC bounds (upper and lower bounds).

• Rademacher Complexity.

• Model Selection, Structural Risk Minimization.

L2 vs. L1 Regularization

Gaussian P(W) L2 regularization

Laplace P(W) L1 regularization

w2

w1

constant

P(Data|W)

w2

constant P(W)

w1