# CS计算机代考程序代写 scheme Outline

Outline

From structured risk minimization to prediction error

Error bounds: Hoeffding, Union, Vapnik-Chervonenkis

VC dimension of typical models

Classification in (high-dimensional) feature space, and kernels

1/25

Recap: Structured Risk Minimization

Structured risk minimization is a method for model selection that struc- tures the space of solutions into a sequence of increasingly large sets

and chooses among competing solutions the one that is included in the smallest set.

Example: Concentric spheres Sn = {θ ∈ Rd : ∥θ∥2 < Cn} with Cn < Cn+1. We recover common regularization schemes for linear models, e.g. ridge regression, large-margin, etc.
Question: Can we link model complexity and prediction strength di- rectly?
2/25
Formalizing the Learning Problem
Given some ground-truth joint distribution P (x, y) over the input x ∈ Rd and the output y ∈ {−1, 1} (or R for regression), we measure the error of some classifier f : Rd → R using the risk function:
R[f ] =
where l(f(x),y) is some loss function.
Examples:
0/1loss(classification) MSE (regression)
0 if f(x)·y>0 l(f(x),y)= 1 else

l(f (x), y) = 1 (f (x) − y)2 2

l(f (x), y)dP (x, y)

3/25

Formalizing the Learning Problem

We would like to minimize the risk function

R[f ] =

Problem: P is unknown −→ we need an inductive principle.

A related quantity that can be computed from some training sample D = (x1,y1),…,(xN,yN)

is the empirical risk, which replaces P(x,y) by an average over the training sample:

1 N

Remp[f] = N

l(f (x), y)dP (x, y)

l(f(xi),yi)

i=1

4/25

Predicting R[f]

Estimating a parameter Learning a ML model

?

Questions:

Given a dataset of N points, how well will my model generalize? How quickly R[f] and Remp[f] converge with N?

Does it actually converge?

5/25

l

a

w

o

f

l

a

r

g

e

n

u

m

b

e

r

s

Predicting R[f]

Cross-validation (last week)

Simulate the true distribution by randomly splitting the available data between training and test. This approach provides an estimator of the risk R[f].

Error bounds (this week)

Bound R[f] based on model complexity. This can be achieved for

specific measures of complexity, e.g. the VC dimension.

In the next few slides, we give a brief overview of error bounds, based on Bousquet et al. (2003) Introduction to Statistical Learning Theory.

6/25

Error Bounds: Hoeffding’s Inequality

Hoeffding’s inequality provides an upper bound on the probability that the sum of bounded independent random variables deviates from its expected value by more than a certain amount. (Wikipedia)

Theorem (Hoeffding). Let Z1, . . . , ZN be N i.i.d. random variables with g(Z) ∈ [a,b], then, for all ε > 0, we have:

1N 2Nε2

Pr N

i=1

g(Zi)−E[g(Z)]≥ε ≤2exp −(b−a)2

When applied to variables composing the risk, i.e. Zi = (f(xi),ti) are the prediction/label pairs, g = l, and also observing that l(·) ∈ [0, 1], we get:

P(|Remp[f] − R[f]| > ε) ≤ 2 exp(−2Nε2) which further reduces to:

log(2/δ)

P R[f]≤Remp[f]+ 2N ≥1−δ

7/25

Error Bounds: Hoeffding’s Inequality

Problem: When training a machine learning model, we make the function f depend on the data. This causes the random variables Z1, . . . , ZN represent- ing the prediction and the labels to no longer be i.i.d. Therefore, Hoeffding’s inequality not applicable in that case.

Idea: Avoid the dependency on a particular function f, by considering in- stead a supremum over all functions in F:

R[f] − Remp[f] ≤ max R[g] − Remp[g] g∈F

i.e. the worst-case gap between true and empirical risk.

8/25

Error Bounds: Union Bound

We build an union-bound (of the type P(A ∪ B) ≤ P(A) + P(B)) for the expression above, and then apply the Hoeffding inequality to each individual functions:

P max R[g] − Remp[g] > ε g∈F

g∈F

≤ |F|exp(−2Nε2).

After some manipulations, we get:

R[f] ≤ Remp[f] + 2N

with probability at least 1 − δ.

R[g]−Remp[g] >ε

=P

≤ P(R[g] − Remp[g] > ε)

g∈F

log|F|+log(1/δ)

9/25

Error Bounds: Vapnik-Chervonenkis

Refinement of the union bound in the previous slide.

Spread the δs unevenly across point to only consider those that are likely to be misclassified.

Make the approach applicable to large F by measuring the size of F ‘projected on the data’:

Fx1,…,xN = {(f(x1),…,f(xN)) : f ∈ F}

,…,x (aka. growth function) N

Fx

Applying these two steps, we can arrive at the bound

SF (N ) = max x1,…,x

1

N

logSF(2N)+log(2/δ) R[f]≤Remp[f]+2 2 N

with probability at least 1 − δ (cf. Bousquet 2003).

10/25

Error bounds (Summary)

Hoeffding

Union Bound

Vapnik-Chervonenkis

Notes:

log(2/δ) R[f] ≤ Remp[f] + 2N

log|F|+log(1/δ) R[f] ≤ Remp[f] + 2N

logSF(2N)+log(2/δ) R[f] ≤ Remp[f] + 2 2 N

All three bounds hold with probability at least 1 − δ.

Hoeffding’s bound only holds for fixed functions (i.e. not applicable to a trained classifier).

VC bounds also applicable to large (infinite) function spaces.

11/25

Error Bounds: Vapnik-Chervonenkis

The VC-dimension is the maximum number of data points that the function class can always classify in any possible way.

hF =max{N:SF(N)=2N}

Another VC bound, where the VC dimension is stated explicitly:

hF(log(2N/hF)+1)+log(2/δ) R[f]≤Remp[f]+2 2 N

12/25

Model Complexity: Sinusoid Function

fθ : R → {−1, 1}, fθ (x) = sign(sin(θx))

F={fθ:θ∈R}

log |F | = ∞ −→ Union bound doens’t provide guarantee hF = ∞ −→ VC bound doesn’t provide guarantee either

13/25

Linear Classifier

fw,b :Rd →R, fw,b(x)=sign(w⊤x+b) F={fw,b:w∈Rd,b∈R}

log|F|=∞

Union bound doesn’t provide guarantee hF =d+1

VC bound guarantees convergence

Convergence especially fast when d low.

14/25

Linear Classifier with Margin

fw,b :Rd →R, fw,b(x)=sign(w⊤x+b) F = fw,b : w ∈ Rd, b ∈ R ,

∀Ni=1 : |w⊤xi + b| ≥ 1, 1=M

∥w∥ 2

h ≤mind+1,4R2+1 F M2

(Burges (1998) A Tutorial on Support Vector

Machines for Pattern Recognition) M

VC bound guarantees convergence

Convergence especially fast when d low

or when the margin is large.

R

15/25

From Linear to Nonlinear Models

Linear models of type f(x) = w⊤x + b are easy to regularize (e.g. dimensionality reduction, large-margin). However the class of linear models is too restrictive (cannot solve nonlinear problems).

Approach 1: Consider an arbitrary nonlinear function f(x), (e.g. a quadratic model f(x) = x⊤Ax + b⊤x + c), and learn the parameters of this model from the data.

Approach 2: Apply a fixed nonlinear feature map φ : Rd → RD to the data, and learn a linear model in the image domain:

f(x) = β⊤φ(x) + b

16/25

Linear Model in Feature Space

Procedure:

1. Consider a fixed feature map:

φ:Rd →RD , x→φ(x)

where d ≪ D (typically)

2. Learn β ∈ RD and b ∈ R such that the function f(x) = β⊤φ(x) + b

correctly predicts the data.

17/25

Linear Model in Feature Space

Example: All second-order monomials

φ:R2 →R3

(x1, x2) → (z1, z2, z3) := (x21, √2x1x2, x2)

18/25

Linear Model in Feature Space

Remarks:

Classicalstatistics:harderasthedatabecomehigh-dimensional.

VC-dimension viewpoint: not harder as we can control complexity via other means (e.g. by enforcing a large margin in the feature space).

⇒ complexity matters, not dimensionality.

19/25

From Feature Spaces to Kernels

Example:

x = (x1,x2)

φ(x) = (x21, √2x1x2, x2)

Dot product in (high-dimensional) feature space can be computed in R2: ⟨φ(x),φ(x′)⟩ = ⟨x,x′⟩2

′ k(x, x )

This is known as the kernel trick (cf. Boser, Guyon & Vapnik 1992).

20/25

The Kernel Trick in Practice

Example:

We have a dataset D = {x1, . . . , xN } and would like to compute the (squared) distance to the mean in feature space, i.e.

1N 2

f(x) = φ(x) − N

Applying the identity

∥a − b∥2 = ⟨a,a⟩ − 2⟨a,b⟩ + ⟨b,b⟩

we arrive at a formulation

f(x) = ⟨φ(x),φ(x)⟩ − 2 ⟨φ(x),φ(xi)⟩ + 1 ⟨φ(xi),φ(xj)⟩ 2

i=1

NNN

φ(xi)

Ni=1 Ni=1j=1 k(x,x) k(x,xi) k(xi,xj)

where the function (initially defined in feature space) can be expressed solely in terms of the kernel function.

21/25

From Kernels to Feature Spaces

[Mercer] If k is continuous, symmetric, and for all square-integrable functions g(x) satisfies the condition:

then the kernel can be expanded as:

NF

k(x, x′) = λiψi(x)ψi(x′)

i=1

withλi >0,andNF ∈NorNF =∞andwecanconstructthefeaturemap

√λ1 ψ1 (x)

√ φ(x) := λ2ψ2(x)

g(x)k(x,x′)g(x′)dxdx′ ≥0

satisfying ⟨φ(x), φ(x′)⟩ = k(x, x′)

. .

22/25

Example

Showthatk:Rd ×Rd →Rwithk(x,x′)=c+x1x′1 andc>0,isaMercer kernel.

Find a feature map φ(x) associated to this kernel

23/25

Example of Kernels

Examples of commonly used kernels satisfying the Mercer property

Linear Polynomial Gaussian Student

k(x, x′) = ⟨x, x′⟩

k(x, x′) = (⟨x, x′⟩ + β)γ

k(x, x′) = exp(−γ∥x − x′∥2)

k(x, x′) = 1 α+∥x−x′∥2

β ∈ R≥0, γ ∈ N γ ∈ R>0 α ∈ R>0

24/25

Final remarks

Cross-validation vs. VC dimension

Last week: Since it was unclear how to measure complexity, we tried to directly predict the generalization error via cross-validation, and choose parameters to minimize this error.

This week: Model complexity can be reliably measured (VC-dimension). It allows to do model selection (e.g. prioritize large-margin models) without having to consume data.

Nonlinear representation and kernels

Linearly unsolvable problems can become solvable in high-dimensional feature spaces.

High dimensionality is not a big problem, because other techniques (e.g. large margin) can be used to control capacity.

25/25