# CS计算机代考程序代写 algorithm Bioinformatics Outline

Outline

Large margin classifiers

Primal/Dual formulation of the SVM Kernel SVMs

Hard-margin / soft-margin

SVM and Hinge Loss

SVMs beyond classification

Applications

1/25

Linear Classifiers with Margin

Let φ(x) be some feature map mapping the data in Rd to some feature space RD.

We consider functions of the type fθ : Rd → R with fθ(x) = sign(w⊤φ(x) + b), where θ = (w,b) denotes the parameters.

We consider functions that classify with some margin: F=fθ:θ∈RD+1 ,

R

∀Ni=1:|w⊤φ(xi)+b|≥1, 2 =M ∥w∥

The VC-dimension can be bounded as

hF ≤minD+1,4R2+1 M2

(cf. Burges 1998).

M

2/25

Support Vector Machine (SVM)

reduce complexity

support vectors

increase margin

3/25

SVM Primal

The classifier with largest margin between the positive and negative data points can be obtained by solving the minimization problem:

min 1∥w∥2 (minimizing ∥w∥2/2 is equivalent to max- w,b 2 imizing the margin 2/∥w∥)

subject to the constraints

∀Ni=1 : yi ·(w⊤φ(xi)+b)≥1.

Once the parameters have been optimized, the decision function is given by f(x) = sign(w⊤φ(x) + b)

and it can be applied to classify new data.

4/25

SVM Primal

input view

constraints

SRM view

5/25

margin

constraints

constraints

Slater’s Condition

Consider the optimization problem minf(θ)

θ

with f convex, subject to the inequality constraints ∀mi=1 : gi(θ) ≤ 0

with gi also convex. If there exists a point θ in the input domain that satisfies all constraints with strict inequalities, then the solution of the optimization problem is also given by its Lagrange dual formulation, the constrained opti- mization problem:

m

maxminf(θ)+αigi(θ) ∀mi=1 αi ≥0, (1)

αθ

where the inner minimization problem is typically solved analytically.

i=1

6/25

SVM: from the Primal to the Dual

Slater’s condition for SVM:

∃ w,b s.t. ∀Ni=1 : yi ·(w⊤φ(xi)+b) > 1

In other words, the two classes must be linearly separable. When φ can be computed explicitly, this can be tested by running e.g. the perceptron algorithm.

7/25

SVM: from the Primal to the Dual

If the Slater’s conditions are verified, we inject the SVM objective in Eq. (1):

N

maxmin1∥w∥2 +αi ·(1−yi ·(w⊤φ(xi)+b)) s.t. a≽0,

α w,b 2

and after solving min{·} analytically by setting the gradient to zero, we get:

i=1

w,b NN

i=1 ij i=1 k(xi,xj)

where we have also found in the process w = i αiyiφ(xi).

max αi−1 αiαjyiyj φ(xi),φ(xj) s.t. α ≽ 0 and αiyi = 0 α2

8/25

Dual Optimization of the SVM (derivation)

N

maxmin1∥w∥2 +αi ·(1−yi ·(w⊤φ(xi)+b)) s.t. a≽0,

i=1

α w,b 2

NN

max αi−1 αiαjyiyj φ(xi),φ(xj) s.t. α ≽ 0 and αiyi = 0 α2

i=1 ij i=1 k(xi,xj)

9/25

Predicting with the Dual Solution

The prediction function can be rewritten as: f(x) = sign(w⊤φ(x) + b)

= sign(i αiyi φ(xi)⊤φ(x) + b)

k(xi,x) Question: how do we find the bias b ?

Answer: We recall that because the model has maximized the margin be- tween the two classes, the nearest point to the decision boundary for each class is exactly on the margin, i.e.

minyi ·(w⊤φ(xi)+b)=1 i

After some manipulations, we get: b = 1 − min j αj yj φ(xj )⊤ φ(xi ) . i|yi=1

k(xj,xi)

10/25

Examples of Kernels

Observation: In the SVM dual form, we never need to access the feature map φ(·) explicitly. Instead, we can always express computations in terms of the kernel function.

Examples of commonly used kernels satisfying the Mercer property are:

Linear Polynomial Gaussian

k(x, x′) = ⟨x, x′⟩

k(x, x′) = (⟨x, x′⟩ + β)γ β ∈ R≥0, γ ∈ N k(x, x′) = exp(−γ∥x − x′∥2) γ ∈ R>0

Note: The feature map associated to the Gaussian kernel is infinite-dimensional. However, in the dual form, we never need to access it for training and pre- diction, and we only need the kernel function.

11/25

SVM with Polynomial Kernel

The SVM decision function

N

g(x) = αiyi (⟨xi, x⟩ + a)γ + b

is polynomial of degree γ.

i=1

k(xi,x)

degree=1 degree=3 degree=12

Observation: The polynomial decision function has very high value in the extrapolation regime.

12/25

SVM with Gaussian Kernel

The SVM decision function

N

g(x) = αiyi exp(−γ∥xi − x∥2) +b

i=1

k(xi,x)

is a superposition of ‘bump’ functions. Intermediate values of γ usually give the best results.

gamma=0.01 gamma=0.30 gamma=10.00

13/25

More Advanced Kernels

More advanced kernels can be used to incorporate prior knowledge into the classifier, and to let the algorithm operate on non-vector data.

Examples:

Bag-of-words kernels (text classification, image recognition) Tree kernels (text classification)

Weighted degree kernel (bioinformatics)

Graph kernels

Some of these kernels will be studied in Machine Learning 2.

14/25

Computational Requirements of SVMs

Computation required by the primal

Optimization of a vector (w,b) in RD+1 subject to N constraints. Primal is infeasible when D is very large (e.g. infinite-dimensional).

Computation required by the dual

Computation of a kernel matrix K of size N ×N with Kij = k(xi,xj) Dual is infeasible when N is large (e.g. 1 million data points).

What if both N and D are large?

Random features (approximates infinite-dimensional feature maps

through randomization).

Neural networks (learns an efficient feature map of size ≪ D).

15/25

Soft-Margin SVM

To account for cases where the data is not separable (e.g. due to noise), we in- troduce slack variables (ξi)i that let points not satisfy the margin constraints, at the cost of some penalty.

N min 1∥w∥2+Cξi

w,b,ξ 2 subject to the constraints

i=1

Note:

∀Ni=1 : yi ·(w⊤φ(xi)+b)≥1−ξi

and ξi ≥0.

Slater’s conditions are now always satisfied (by choosing slack variables large enough for difficult examples), and the Soft-Margin SVM also has a dual formulation.

Bias can be recovered from the dual using another set of conditions, the KKT conditions.

16/25

KKT Conditions (simplified)

Consider the optimization problem minf(θ). Subject to the inequality con- straints ∀mi=1 : gi(θ) ≤ 0. For this problem, the Karush-Kuhn-Tucker (KKT) conditions are:

m

∇f(θ) + λi∇gi(θ) = 0

i=1

∀mi=1 : gi(θ) ≤ 0

∀mi=1 : λi ≥ 0 λigi(θ) = 0

If Slater’s conditions are satisfied, then the KKT conditions are satisfied at the optimum.

The bias of the primal can be recovered from the ‘complementary slackness’ equation.

(stationarity) (primal feasibility) (dual feasibility) (complementary slackness)

17/25

Effect of the parameter C

The larger the parameter C the more the decision boundary is forced to correctly classify every data point. Best results are usually obtained with intermediate values of the parameter C.

C=0.1 C=10.0 C=10000.0

18/25

SVM and Hinge Loss

Recall that we have defined g(x) = w⊤φ(xi)+b, and therefore, we can write the constraint as:

ξi ≥ 1 − yi · g(x) We can consider instead a positive slack function

ξi′ =max(0,1−yi ·g(x)) and inject it directly in the objective

min

w,b,ξ

lhinge(y, g(x)) N

1∥w∥2 +C max0,1 − yi · g(x)

2

i=1

Ereg

Eemp

i.e. we optimize the sum of a fitting term and a regularization term (as we do e.g. for logistic regression but with a different loss function).

19/25

SVM Beyond Classification

Support vector data description (SVDD):

Consider an unlabeled dataset D = {x1, . . . , xN }. We would like to find the smallest enclosing sphere of the data, i.e.

min R2 R,c

subject to the constraints

∀Ni=1 : ∥φ(xi) − c∥2 ≤ R2

and we would like to detect a point to be anomalous if the point exceeds a certain distance from the center c, i.e.

f(x) = sign(∥φ(x) − c∥ − τ)

Note: Like for the SVM, the SVDD algorithm also has a dual form, it can expressed in terms of kernels, and we can add slack variables to it.

20/25

SVM Beyond Classification

Support vector regression (SVR):

Consider a labeled regression dataset D = {(x1,y1),··· ,(xN,yN)} with yi ∈ R. We would like to build the regression model

min ∥w∥2 w,b

subject to the constraints

∀Ni=1 : |w⊤φ(xi)+b−yi|≤ε

Note: Like for the SVM, support vector regres- sion has a dual form, it can be kernelized, and we can add slack variables to it. With the slack vari- ables, the constraints above can also be seen as applying an ε-sensitive loss.

21/25

Visual Summary

large margin/ VC dimension

quadratic SVMs programming

kernel choice/ prior knowledge

kernel trick

classification/ regression/…

slack variables/ noise modeling

22/25

SVM for ECG-based arrhythmia Detection

Wang et al. (2014), Hardware Specializa- tion in Low-power Sensing Applications to Address Energy and Resilience

23/25

SVM for Text Categorization

Precision/recall-breakeven point on the ten most frequent Reuters categories and microaveraged performance over all Reuters categories.

Source: Joachims et al. 1998, Text categorization with Support Vector Machines: Learn- ing with many relevant features

24/25

SVM/HoG for Pedestrian Detection

Source: Dutta 2015, Pedestrian Detection using HOG and SVM in Automotives

25/25