# CS代考计算机代写 decision tree algorithm information theory Machine Learning Theory

Machine Learning Theory

Maria-Florina (Nina) Balcan

February 9th, 2015

A2

Â

Goals of Machine Learning Theory

Develop & analyze models to understand:

• what kinds of tasks we can hope to learn, and from what kind of data; what are key resources involved (e.g., data, running time)

• prove guarantees for practically successful algs (when will they succeed, how long will they take?)

• develop new algs that provably meet desired criteria (within new learning paradigms)

Interesting tools & connections to other areas:

• Algorithms, Probability & Statistics, Optimization, Complexity Theory, Information Theory, Game Theory.

Very vibrant field:

• Conference on Learning Theory

• NIPS, ICML

A2

Â

Today’s focus: Sample Complexity for Supervised Classification (Function Approximation)

• Statistical Learning Theory (Vapnik) • PAC (Valiant)

• Recommended reading: Mitchell: Ch. 7 • Suggested exercises: 7.1, 7.2, 7.7

• Additional resources: my learning theory course!

Supervised Classification

Decide which emails are spam and which are important.

Not spam Supervised classification spam

Goal: use emails seen so far to produce good prediction rule for future data.

4

Example: Supervised Classification

Represent each message by features. (e.g., keywords, spelling, etc.)

example

label

Reasonable RULES:

Predict SPAM if unknown AND (money OR pills)

++- +-

Predict SPAM if 2money + 3pills –5 known > 0

+— –

Linearly separable

5

•

•

Very well understood: Occam’s bound, VC theory, etc. Note: to talk about these we need a precise model.

Two Core Aspects of Machine Learning

Algorithm Design. How to optimize?

Computation

Automatically generate rules that do well on observed data.

• E.g.: logistic regression, SVM, Adaboost, etc.

Confidence Bounds, Generalization

(Labeled) Data

Confidence for rule effectiveness on future data.

PAC/SLT models for Supervised Learning

Data Source

Distribution D on X

Expert / Oracle

Learning Algorithm

Labeled Examples

(x1,c*(x1)),…, (xm,c*(xm))

Alg.outputs

h : X! Y

x1 > 5

+1

c* : X ! Y

++-

x6 > 2

-1

+ +

– –

– –

+1

–

Learning Algorithm

Expert/Oracle

PAC/SLT models for Supervised Learning

Data Source

Labeled Examples

(x1,c*(x1)),…, (xm,c*(xm)) h : X! Y

Distribution D on X

c* : X ! Y Today: Y={-1,1}

Alg.outputs

• Algo sees training sample S: (x1,c*(x1)),…, (xm,c*(xm)), xi independently

and identically distributed (i.i.d.) from D; labeled by 𝑐∗

• Does optimization over S, finds hypothesis h (e.g., a decision tree).

• Goal: h has small error over D.

•

PAC/SLT models for Supervised Learning

X – feature or instance space; distribution D over X e.g.,X=Rd orX={0,1}d

Algo sees training sample S: (x1,c*(x1)),…, (xm,c*(xm)), xi i.i.d. from D – labeled examples – assumed to be drawn i.i.d. from some distr.

D over X and labeled by some target concept c* – labels 2 {-1,1} – binary classification

• Algo does optimization over S, find hypothesis h. • Goal: h has small error over D.

𝑒𝑟𝑟 h = Pr (h 𝑥 ≠ 𝑐∗(𝑥))

𝑥~𝐷

Need a bias: no free lunch.

•

h ++ c* -++-

— Instance space X

𝐷

Function Approximation: The Big Picture

•

PAC/SLT models for Supervised Learning

X – feature or instance space; distribution D over X e.g.,X=Rd orX={0,1}d

Algo sees training sample S: (x1,c*(x1)),…, (xm,c*(xm)), xi i.i.d. from D – labeled examples – assumed to be drawn i.i.d. from some distr.

D over X and labeled by some target concept c* – labels 2 {-1,1} – binary classification

• Algo does optimization over S, find hypothesis h. • Goal: h has small error over D.

𝑒𝑟𝑟 h = Pr (h 𝑥 ≠ 𝑐∗(𝑥))

𝑥~𝐷

Bias: Fix hypotheses space H . (whose complexity is not too large).

Realizable: 𝑐∗ ∈ 𝐻. Agnostic: 𝑐∗ “close to” H.

•

h ++ c* -++-

— Instance space X

𝐷

PAC/SLT models for Supervised Learning

• Algo sees training sample S: (x1,c*(x1)),…, (xm,c*(xm)), xi i.i.d. from D

• Does optimization over S, find hypothesis h ∈ 𝐻.

• Goal: h has small error over D.

True error: 𝑒𝑟𝑟 h = Pr (h 𝑥 ≠ 𝑐∗(𝑥))

𝐷

How often h 𝑥 ≠ 𝑐∗(𝑥) over future instances drawn at random from D

But, can only measure:

Trainingerror:𝑒𝑟𝑟 h = 1 𝐼 h 𝑥 ≠𝑐∗ 𝑥 𝑆𝑚𝑖𝑖𝑖

How often h 𝑥 ≠ 𝑐∗(𝑥) over training instances

Sample complexity: bound 𝑒𝑟𝑟 h in terms of 𝑒𝑟𝑟 h 𝐷𝑆

𝑥~𝐷

•

Sample Complexity for Supervised Learning

Consistent Learner

• Input: S: (x1,c*(x1)),…, (xm,c*(xm))

• Output: Find h in H consistent with the sample (if one exits).

Contrapositive: if the target is in H, and we have an algo that can find consistent fns, then we only need this many examples to get generalization error ≤ 𝜖 with prob. ≥ 1 − 𝛿

Sample Complexity for Supervised Learning

Consistent Learner

• •

• Input: S: (x1,c*(x1)),…, (xm,c*(xm))

• Output: Find h in H consistent with the sample (if one exits).

Bound inversely linear in 𝜖 Bound only logarithmic in |H|

𝜖 is called error parameter

• D might place low weight on certain parts of the space

𝛿 is called confidence parameter

• there is a small chance the examples we get are not representative of the distribution

Sample Complexity for Supervised Learning

Consistent Learner

• Input: S: (x1,c*(x1)),…, (xm,c*(xm))

• Output: Find h in H consistent with the sample (if one exits).

Example: H is the class of conjunctions over X = 0,1 n. |H| = 3n E.g.,h=x1x3x5 orh=x1x2x4𝑥9

Then𝑚≥1 𝑛ln3+ln 1 suffice 𝜖𝛿

𝑛 = 10,𝜖 = 0.1,𝛿 = 0.01 then 𝑚 ≥ 156 suffice

Sample Complexity for Supervised Learning

Consistent Learner

• Input: S: (x1,c*(x1)),…, (xm,c*(xm))

• Output: Find h in H consistent with the sample (if one exits).

Example: H is the class of conjunctions over X = 0,1 n.

Side HWK question: show that any conjunction can be represented by a small decision tree; also by a linear separator.

Sample Complexity for Supervised Learning

Proof

Assume k bad hypotheses h1, h2, … , hk with errD hi ≥ ε

1) Fix hi. Prob. hi consistent with first training example is

Prob. hi consistent with first m training examples is ≤ 1 − ε m.

≤ 1 − ε. 2) Prob. that at least one h𝑖 consistent with first m training

examples is ≤ k 1 − ε m ≤ H 1 − ε m.

3) Calculate value of m so that H 1 − ε m ≤ δ

3) Use the fact that 1 − x ≤ e−x, sufficient to set H e−εm ≤ δ

Sample Complexity: Finite Hypothesis Spaces Realizable Case

Probability over different samples of m training examples

Sample Complexity: Finite Hypothesis Spaces Realizable Case

1) PAC: How many examples suffice to guarantee small error whp.

2) Statistical Learning Way: Withprobabilityatleast1−𝛿,forallh∈Hs.t.errS h =0wehave

errD(h)≤1 lnH+ln1 . m𝛿

Supervised Learning: PAC model (Valiant)

• X – instance space, e.g., X = 0,1 n or X = Rn

• Sl={(xi, yi)} – labeled examples drawn i.i.d. from some

distr. D over X and labeled by some target concept c* – labels 2 {-1,1} – binary classification

• Algorithm A PAC-learns concept class H if for any

target c* in H, any distrib. D over X, any , > 0:

– A uses at most poly(n,1/,1/,size(c*)) examples and running

time.

– With probab. 1-, A produces h in H of error at · .

Uniform Convergence

•

• •

This basic result only bounds the chance that a bad hypothesis looks perfect on the data. What if there is no perfect h∈H (agnostic case)?

What can we say if c∗ ∉ H?

Can we say that whp all h∈H satisfy |errD(h) – errS(h)| ≤ ?

– Called “uniform convergence”.

– Motivates optimizing over S, even if we can’t find a perfect function.

Sample Complexity: Finite Hypothesis Spaces Realizable Case

Agnostic Case

What if there is no perfect h?

To prove bounds like this, need some good tail inequalities.

Hoeffding bounds

Consider coin of bias p flipped m times.

Let N be the observed # heads. Let ∈ [0,1].

Hoeffding bounds: 2

• Pr[N/m > p + ] ≤ e-2m , and • Pr[N/m < p - ] ≤ e-2m2.
Exponentially decreasing tails
• Tail inequality: bound probability mass in tail of distribution (how concentrated is a random variable around its expectation).
Sample Complexity: Finite Hypothesis Spaces Agnostic Case
• Proof: Just apply Hoeffding. 2
– Chance of failure at most 2|H|e-2|S| .
– Set to . Solve.
• So, whp, best on sample is -best over D.
– Note: this is worse than previous bound (1/ has become 1/2), because we are asking for something stronger.
– Can also get bounds “between” these two.