# CS代考计算机代写 algorithm The No-Free-Lunch Theorem

The No-Free-Lunch Theorem

Prof. Dan A. Simovici

UMB

1/23

Outline

1 Preliminaries

2 The No-Free-Lunch Theorem

2/23

Preliminaries

Reminder

If K is event such that P(K) = p, 1K is a random variable 1 if K takes place

If P(K) = p, then

1K = 0 otherwise. 0 1

1K: 1−p p If X is a random variable

and E(1K) = p.

x1 ··· xn X:p···p,

then X = ni=1 xi1X=xi , where

1n

01

1X=xi : 1−p p . ii

3/23

Preliminaries

First Lemma

Lemma

Let Z be a random variable that takes values in [0, 1] such that E [Z ] = μ. Then, for every a ∈ (0, 1) we have

P(Z > 1 − a) μ − (1 − a) and P(Z > a) μ − a μ − a. a 1−a

Proof: The random variable Y = 1 − Z is non-negative with E(Y)=1−E(Z)=1−μ. ByMarkov’sinequality:

P(Z 1−a)=P(1−Z a)=P(Y a) E(Y) = 1−μ. aa

Therefore,

P(Z > 1 − a) 1 − 1 − μ = a + μ − 1 = μ − (1 − a). aaa

4/23

Preliminaries

Proof (cont’d)

By replacing a by 1 − a we have:

P(Z >a) μ−a μ−a. 1−a

5/23

Preliminaries

Second Lemma

Lemma

Let θ be a random variable that ranges in the interval [0, 1] such that E(θ) 1. We have

4

1 1 P θ>8 7.

Proof: From the second inequality of the previous lemma it follows that P(θ > a) E(θ) − a.

1−a

By substituting a = 1 we obtain: 8

11−11 P(θ>)4 8=.

81−17 8

6/23

The No-Free-Lunch Theorem

A learning task is defined by an unknown probability distribution D over X × Y.

The goal of the learner is to find (to learn) a hypothesis h : X −→ Y such that its risk LD(h) is sufficiently small.

The choice of a hypothesis class H reflects some prior knowlege that the learner has about the task: a belief that a member of H is a low-error model for the task.

Fundamental Question: There exist a universal learner A and a training set size m such that for every distribution D, if A receives m iid examples from D, there is a high probability that A will produce h with a low risk?

7/23

The No-Free-Lunch Theorem

The No-Free-Lunch (NFL) Theorem stipulates that a universal learner (for every distribution) does not exist!

A learner fails if, upon receiving a sequence of iid examples from a distribution D, its output hypothesis is likely to have a large loss (say, larger than 0.3), whereas for the same distribution there exists another learner that will output a hypothesis with a small loss.

More precise statment: for every binary prediction task and learner, there exists a distribution D for which the learning task fails.

No learner can succeed on all learning tasks: every learner has tasks on which it fails whereas other learners succeed.

8/23

The No-Free-Lunch Theorem

Recall 0/1 Loss Function

The loss function is the 0/1-loss function l0−1:

0 ifh(x)=y, l0−1(h,(x,y))= 1 ifh(x)̸=y.

9/23

The No-Free-Lunch Theorem

The NFL Theorem

For a learning algorithm A denote by A(S) the hypothesis returned by the algorithm A upon receiving the training sequence S.

Theorem

Let A be any learning algorithm for the task of binary classification with respect to the 0/1-loss function over a domain X . Let m < |X | be a
2
number representing a training set size.
There exists a distribution D over X × {0, 1} such that:
thereexistsafunctionf :X −→{0,1}withLD(f)=0;
with probability at least 1/7 over the choice of a sample S of size m
there exists a hypothesis h = A(S) such that we have LD(h) 1/8.
10/23
The No-Free-Lunch Theorem
Interpretation of the NFL Theorem
For every learner, there us a task for which it fails, even though the task can be successfully learned by another learner.
11/23
The No-Free-Lunch Theorem
Proof
Let C be a subset of X of size 2m; this set exist because we assume that |calx| > 2m.

Intuition of the proof: any algorithm that observes only half of the instances of C has no information of what should be the labels of the other half. Therefore, there exists a target function f which would contradict the labels that A(S) predicts on the unobserved instances of C.

12/23

The No-Free-Lunch Theorem

Note that:

There are T = 22m possible functions from C to {0,1}: f1,…,fT. The set C × {0, 1} consists of the pairs

C × {0, 1} = {(x1, 0), (x1, 1), . . . , (x2m, 0), (x2m, 1)} For each fi let Di be the distribution over C × {0, 1} given by

1 Di({(x,y)}} = |C|

0 The probability to choose a pair (x,y) is 1

if y = fi(x) otherwise.

if y is the true label according to fi and 0, otherwise (if y ̸= fi (x )). Clearly LDi (fi ) = 0.

|C|

13/23

The No-Free-Lunch Theorem

Intuition

Let m = 3, C = {x1,x2,x3,x4,x5,x6}. Suppose that

fi(x1) = 1 fi(x4) = 1

The distribution Di is:

(x1, 0) (x2, 0)

fi(x2) = 0 fi(x3) = 1 fi(x5) = 1 fi(x6) = 0,

(x3, 0) (x4, 0) (x5, 0) (x6, 0) 010001

66 (x1, 1) (x2, 1) (x3, 1) (x4, 1) (x5, 1) (x6, 1)

1 0 1 1 1 0. 6 666

Clearly, we have:

LDi (fi) = P({(x,y) | fi(x) ̸= y}) = 0.

14/23

The No-Free-Lunch Theorem

Claim (*):

For every algorithm A that receives a training set of m examples from C × {0, 1} and returns a function A(S) : C −→ {0, 1} we have:

max ES∼Dm (LDi (A(S))) 1. 1i |T | 4

This means that for every A′ that receives a training set of m examples

from X × {0, 1} there exists f : X −→ {0, 1} and a distribution D over

X × {0, 1} such that LD(f ) = 0 and ES∼Dm (LD(A′(S))) 1 . 4

15/23

The No-Free-Lunch Theorem

Note that the index j refers to samples while i refers to hypotheses. There are k = (2m)m possible sequences (samples)

S1,…,Sk

If Sj = (x1, . . . , xm), the sequence labeled by a function fi is denoted

of m examples from C. by

Sji =((x1,fi(x1)),…,(xm,fi(xm))).

If the distribution is Di , then the possible training sets that A can receive are S1i , . . . , Ski and all these training sets have the same probability of being sampled. Therefore, the expected error of the sample S is:

1 k

ES ∼Dm (LDi (A(S )) = k

LDi (A(Sji )).

j=1

16/23

The No-Free-Lunch Theorem

Recall that there are T = 22m possible functions from C to {0, 1}: f1,…,fT, so 1 i T, where i the superscript of Sji reflecting the labeling function fi .

We have:

1 k

max

LDi (A(Sji ))

1iT k

j=1

1 T 1 k

T k i=1

LDi (A(Sji )) j=1

1 k 1 T = k T

j=1 i=1

1 T

min 1jk T

i=1

LDi (A(Sji )) LDi(A(Sji)).

17/23

The No-Free-Lunch Theorem

Fix some j and let Sj = {x1,…,xm}. Let v1,…,vp be the examples in C that do not appear in Sj . Clearly p m. Therefore, for each

h : C −→ {0, 1} and every i we have:

Hence,

r=1

LDi (h)

= 2m

1 p

1h(x)̸=fi (x) 2m 1h(vr )̸=fi (vr ).

1h(vr )̸=fi (vr )

1 T

T

i=1

1 T 1 p

1A(S i )(vr )̸=fi (vr ) j

2p LDi (A(Sji ))

1 1p

x∈C

r=1

T 2p i=1

r=1 1 p 1 T

= 2p T 1A(Sji)(vr)̸=fi(vr) r=1 i=1

1 1T min

1A(Si )(vr )̸=f (vr ). j i

2 1tp T

i=1

18/23

The No-Free-Lunch Theorem

Let vr be an example in C that does not appear in a sample Sj . We can partition all functions in {f1,…,fT} into T/2 disjoint sets {fi,fi′} such that we have

fi(c) ̸= fi′(c) if and only if c = vr.

Since for a set {fi,fi′} we must have Si = Si′, it follows that

which implies

jj

1A(Si )(vr )̸=fi (vr ) + 1A(Si′ )(vr )̸=f ′ (vr ) = 1, jji

1T 1

T

i=1

1A(Si )(vr )̸=fi (vr ) = 2. j

19/23

The No-Free-Lunch Theorem

Since

1T 11T

LDi (A(Sji ))

1T 1

T

and

T

i=1

min

2 1tp T

1A(S i )(vr )̸=f (vr ) j i

i=1

T

1A(Si (vr )̸=fi (vr ) = 2, j

i=1

we have

1T 1

i=1

L D i ( A ( S ji ) 4 .

20/23

The No-Free-Lunch Theorem

Thus,

implies

max

1 k

1 T

LDi(A(Sji)) min 1jk T

LDi(A(Sji))

1iT k

j=1

i=1

max

LDi(A(Sji)) . 4

1k 1

1iT k

j=1

21/23

The No-Free-Lunch Theorem

We combined

1T

T

i=1 1 k

1 1T min

LDi (A(Sji ))

LDi(A(Sji)) min LDi(A(Sji))

max

1iT k

ES∼Dm(LDi(A(S))) =

1jk T 1 k

2 1tp T

1 T

1A(S i )(vr )̸=f (vr ) j i

i=1

j=1

i=1

k LDi(A(Sji))

1 T

T

i=1 to obtain:

j=1

1 2

max ES∼Dm (LDi (A(S)) 1. 1iT i 4

1A(S i )(vr )̸=fi (vr ) = j

22/23

The No-Free-Lunch Theorem

Thus, the Claim (*) is justified.

This means that for every algorithm A′ that receives a training set of m examples from X × {0, 1} there exists a function f : X −→ {0, 1} and a distribution D over X × {0, 1} such that LD(f ) = 0 and

ES∼Dm (LD(A′(S))) 1. 4

By the second Lemma this implies:

′ 11 P LD(A(S))8 7.

23/23