# CS代考计算机代写 algorithm Machine Learning Theory II

Machine Learning Theory II

Maria-Florina (Nina) Balcan

February 11th, 2015

Sample complex.

Â

Two Core Aspects of Machine Learning

Algorithm Design. How to optimize?

Computation

Automatically generate rules that do well on observed data.

• E.g.: logistic regression, SVM, Adaboost, etc.

Confidence Bounds, Generalization

(Labeled) Data

Confidence for rule effectiveness on future data.

Today’s focus: Sample Complexity for Supervised Classification (Function Approximation)

• Statistical Learning Theory (Vapnik) • PAC (Valiant)

• Recommended reading: Mitchell: Ch. 7 • Suggested exercises: 7.1, 7.2, 7.7

• Additional resources: my learning theory course!

•

Supervised Learning

E.g., which emails are spam and which are important.

Not spam

spam

•

E.g., classify images as man versus women.

Man Women

Learning Algorithm

PAC/SLT models for Supervised Learning

Data Source

Distribution D on X

Expert / Oracle

Labeled Examples

(x1,c*(x1)),…, (xm,c*(xm))

Alg.outputs

h:X!Y

x1 > 5

c* : X ! Y + + –

x6 > 2 +1

+1 -1

+ —

•

•

PAC/SLT models for Supervised Learning

X – feature/instance space; distribution D over X e.g.,X=Rd orX={0,1}d

Algo sees training sample S: (x1,c*(x1)),…, (xm,c*(xm)), xi i.i.d. from D – labeledexamples-drawni.i.d.fromDandlabeledbytargetc* – labels 2 {-1,1} – binary classification

• Algo does optimization over S, find hypothesis h. • Goal: h has small error over D.

𝑒𝑟𝑟𝐷 h = Pr (h 𝑥 ≠ 𝑐∗(𝑥)) 𝑥~𝐷

Bias: fix hypothesis space H

• Realizable: 𝑐∗ ∈ 𝐻.

• Agnostic: 𝑐∗ “close to” H.

h++ c* – ++ —

–

Instance space X

[whose complexity is not too large]

PAC/SLT models for Supervised Learning

• Algo sees training sample S: (x1,c*(x1)),…, (xm,c*(xm)), xi i.i.d. from D

• Does optimization over S, find hypothesis h ∈ 𝐻.

• Goal: h has small error over D.

True error: errD h = Pr (h x ≠ c∗(x))

x~D

How often h 𝑥 ≠ 𝑐∗(𝑥) over future instances drawn at random from D

• But, can only measure:

Trainingerror:errS h =m1 iI h xi ≠c∗ xi

How often h 𝑥 ≠ 𝑐∗(𝑥) over training instances

Sample complexity: bound 𝑒𝑟𝑟𝐷 h in terms of 𝑒𝑟𝑟𝑆 h

Sample Complexity for Supervised Learning

Consistent Learner

• Input: S: (x1,c*(x1)),…, (xm,c*(xm))

• Output: Find h in H consistent with the sample (if one exits).

Bound only logarithmic in |H|, linear in 1/𝜖 Probability over different samples of m

training examples

So, if c∗ ∈ H and can find consistent fns, then only need this many examples to get generalization error ≤ 𝜖 with prob. ≥ 1 − 𝛿

Sample Complexity for Supervised Learning

Consistent Learner

• Input: S: (x1,c*(x1)),…, (xm,c*(xm))

• Output: Find h in H consistent with the sample (if one exits).

Example: H is the class of conjunctions over X = 0,1 n. |H| = 3n E.g.,h=x1x3x5 orh=x1x2x4𝑥9

Then𝑚≥1𝜖 𝑛ln3+ln 𝛿1 suffice

Sample Complexity: Finite Hypothesis Spaces Realizable Case

1) PAC: How many examples suffice to guarantee small error whp.

2) Statistical Learning Way: Withprobabilityatleast1−𝛿,forallh∈Hs.t.errS h =0wehave

errD(h)≤m1 lnH+ln𝛿1 .

Supervised Learning: PAC model (Valiant)

• X – instance space, e.g., X = 0,1 n or X = Rn

• Sl={(xi, yi)} – labeled examples drawn i.i.d. from some

distr. D over X and labeled by some target concept c* – labels 2 {-1,1} – binary classification

• Algorithm A PAC-learns concept class H if for any

target c* in H, any distrib. D over X, any , > 0:

– A uses at most poly(n,1/,1/,size(c*)) examples and running

time.

– With prob. ≥ 1 − 𝛿, A produces h in H of error at · .

What if H is infinite?

E.g., linear separators in Rd E.g., thresholds on the real line

+ + – + —

-+

w

-+-

a

E.g., intervals on the real line

b

Sample Complexity: Infinite Hypothesis Spaces

• H[m] – maximum number of ways to split m points using concepts

in H; i.e.

Sauer’s Lemma: H m = O mVCdim H

Effective number of hypotheses

• H[S] – the set of splittings of dataset S using concepts from H.

• H[m] – max number of ways to split m points using concepts in H

H m = max |H[S]| S =m

• •

Effective number of hypotheses

H[S] – the set of splittings of dataset S using concepts from H. H[m] – max number of ways to split m points using concepts in H

H m = max |H[S]| S =m

H[m] ≤ 2m

E.g., H= Thresholds on the real line —+

—-

–++

-+++ ++++

-+

w

|H S | = 5

In general, if |S|=m (all distinct), |H S | = m + 1 ≪ 2m

• •

Effective number of hypotheses

H[S] – the set of splittings of dataset S using concepts from H. H[m] – max number of ways to split m points using concepts in H

H m = max |H[S]| S =m

E.g., H= Intervals on the real line

–+-

—-

In general, |S|=m (all distinct), H m = m m+1 + 1 = O(m2) ≪ 2m 2

There are m+1 possible options for the first part, m left for the second part, the order does not matter, so (m choose 2) + 1 (for empty interval).

H[m] ≤ 2m

-+-

• •

Effective number of hypotheses H[S] – the set of splittings of dataset S using concepts from H.

H[m] – max number of ways to split m points using concepts in H

H m = max |H[S]| S =m

H[m] ≤ 2m

Definition: H shatters S if |H S | = 2|𝑆|.

Sample Complexity: Infinite Hypothesis Spaces

• H[m] – max number of ways to split m points using concepts in H

S= {x1, x2, … , xm} i.i.d. from D B:∃h∈HwitherrS h =0buterrD h ≥ε.

S’ ={x1′ , … , x𝑚′ } another i.i.d. “ghost sample” from D

B’:∃h∈HwitherrS h =0buterrS′ h ≥ε. Claim: To bound P(B), sufficient to bound P(B’)

.Over S ∪ S′ only H[2m] effective hypotheses left… but, no randomness left.

Need randomness to bound the probability of a bad event, another symmetrization trick….

Very Very Rough Idea:

Sample Complexity: Infinite Hypothesis Spaces Realizable Case

H[m] – max number of ways to split m points using concepts in H

• Nottooeasytointerpretsometimeshardtocalculate exactly, but can get a good bound using “VC-dimension

IfHm =2m,thenm≥m(….) ε

• VC-dimension is roughly the point at which H stops looking like it contains all functions, so hope for solving for m.

Sample Complexity: Infinite Hypothesis Spaces H[m] – max number of ways to split m points using concepts in H

Sauer’s Lemma: H m = O mVCdim H

Shattering, VC-dimension Definition: H shatters S if |H S | = 2|𝑆|.

A set of points S is shattered by H is there are hypotheses in H that split S in all of the 2|𝑆| possible ways, all possible ways of classifying points in S are achievable using concepts in H.

Definition: VC-dimension (Vapnik-Chervonenkis dimension)

The VC-dimension of a hypothesis space H is the cardinality of the largest set S that can be shattered by H.

If arbitrarily large finite sets can be shattered by H, then

VCdim(H) = ∞

Shattering, VC-dimension

Definition: VC-dimension (Vapnik-Chervonenkis dimension)

The VC-dimension of a hypothesis space H is the cardinality of the largest set S that can be shattered by H.

If arbitrarily large finite sets can be shattered by H, then VCdim(H) = ∞

To show that VC-dimension is d:

– there exists a set of d points that can be shattered

– there is no set of d+1 points that can be shattered.

Fact: If H is finite, then VCdim(H) ≤ log(|H|).

Shattering, VC-dimension

If the VC-dimension is d, that means there exists a set of d points that can be shattered, but there is no set of d+1 points that can be shattered.

E.g., H= Thresholds on the real line

-+

w

-+-

VCdimH =1

+-

E.g., H= Intervals on the real line

VCdimH =2

+-+

Shattering, VC-dimension

If the VC-dimension is d, that means there exists a set of d points that can be shattered, but there is no set of d+1 points that can be shattered.

E.g., H= Union of k intervals on the real line

VCdimH =2k

+

-+-+-

VCdim H ≥ 2k VCdimH <2k+1
+
A sample of size 2k shatters (treat each pair of points as a separate case of intervals)
-+-+ ...
Shattering, VC-dimension
E.g., H= linear separators in R2 VCdim H ≥ 3
Shattering, VC-dimension
E.g., H= linear separators in R2 VCdim H < 4
Case 1: one point inside the triangle formed by the others. Cannot label inside point as positive and outside points as negative.
Case 2: all points on the boundary (convex hull). Cannot label two diagonally as positive and other two as negative.
Fact: VCdim of linear separators in Rd is d+1
Sauer’s Lemma Sauer’s Lemma:
Let d = VCdim(H)
• m≤d,thenHm =2m • m>d,thenHm =Om𝑑

Proof: induction on m and d. Cool combinatorial argument! Hint: try proving it for intervals…

Sample Complexity: Infinite Hypothesis Spaces Realizable Case

Sauer’s Lemma: H m = O mVCdim H

Sample Complexity: Infinite Hypothesis Spaces Realizable Case

E.g., H= linear separators in Rd

Sample complexity linear in d

So, if double the number of features, then I only need roughly twice the number of samples to do well.

Nearly Matching Bounds

Theorem (lower bound):

Forany𝐻,anyalgo𝐴,any0<𝜖<1/8,any𝛿<.01,∃ distr. 𝐷and
target 𝑐∗ ∈ 𝐻 s.t. if 𝐴 sees fewer than max 1 log 1 , 𝑉𝐶𝑑𝑖𝑚 𝐻 −1 𝜖 𝛿 32𝜖
examples, then with prob. ≥ 𝛿, 𝐴 produces h with 𝑒𝑟𝑟𝐷 h > 𝜖.

Lower Bound (simpler form)

• Theorem: For any 𝐻 there exists 𝐷 such that any algorithm needs Ω 𝑉𝐶𝑑𝑖𝑚 𝐻 examples to reach error 𝜖 with prob ≥ 3.

𝜖4

• Proof: consider 𝑑 = 𝑉𝐶𝑑𝑖𝑚(𝐻) shattered points:

Prob 1 − 4𝜖

Prob 4𝜖 each 𝑑−1

• Consider target 𝑐∗ that labels these points randomly.

• Unless I see roughly 1⁄2 of the rare points, have error ≥ 𝜖

• Each example has only prob 4𝜖 of being one of the rare points,

and need to see 𝑑−1 rare points, so need to see Ω 𝑑 total. 2𝜖

What if c∗ ∉ H?

Uniform Convergence

•

• •

This basic result only bounds the chance that a bad hypothesis looks perfect on the data. What if there is no perfect h∈H (agnostic case)?

What can we say if c∗ ∉ H?

Can we say that whp all h∈H satisfy |errD(h) – errS(h)| ≤ ?

– Called “uniform convergence”.

– Motivates optimizing over S, even if we can’t find a perfect function.

Sample Complexity: Finite Hypothesis Spaces Realizable Case

Agnostic Case

What if there is no perfect h?

To prove bounds like this, need some good tail inequalities.

Hoeffding bounds

Consider coin of bias p flipped m times.

Let N be the observed # heads. Let ∈ [0,1].

Hoeffding bounds: 2

• Pr[N/m > p + ] ≤ e-2m , and • Pr[N/m < p - ] ≤ e-2m2.
Exponentially decreasing tails
• Tail inequality: bound probability mass in tail of distribution (how concentrated is a random variable around its expectation).
Sample Complexity: Finite Hypothesis Spaces Agnostic Case
• Proof: Just apply Hoeffding. 2
– Chance of failure at most 2|H|e-2|S| .
– Set to . Solve.
• So, whp, best on sample is -best over D.
– Note: this is worse than previous bound (1/ has become 1/2), because we are asking for something stronger.
– Can also get bounds “between” these two.
• •
•
What you should know
Notion of sample complexity.
Understand reasoning behind the simple sample complexity bound for finite H.
Shattering, VC dimension as measure of complexity, Sauer’s lemma, form of the VC bounds (upper and lower bounds).