CS代考计算机代写 computational biology algorithm • Support Vector Machines (SVMs).
• Support Vector Machines (SVMs).
• Semi-Supervised Learning.
• Semi-Supervised SVMs.
Maria-Florina Balcan
03/25/2015
Support Vector Machines (SVMs).
One of the most theoretically well motivated and practically most effective classification algorithms in machine learning.
Directly motivated by Margins and Kernels!
Geometric Margin
WLOG homogeneous linear separators [w0 = 0].
Definition: The margin of example 𝑥 w.r.t. a linear sep. 𝑤 is the
distance from 𝑥 to the plane 𝑤 ⋅ 𝑥 = 0. Margin of example 𝑥1
𝑥1
w
If 𝑤 = 1, margin of x w.r.t. w is |𝑥 ⋅ 𝑤|.
Margin of example 𝑥2
𝑥2
Geometric Margin
Definition: The margin of example 𝑥 w.r.t. a linear sep. 𝑤 is the distance from 𝑥 to the plane 𝑤 ⋅ 𝑥 = 0.
Definition: The margin 𝛾𝑤 of a set of examples 𝑆 wrt a linear separator 𝑤 is the smallest margin over points 𝑥 ∈ 𝑆.
Definition: The margin 𝛾 of a set of examples 𝑆 is the maximum 𝛾𝑤 over all linear separators 𝑤.
w
+
–
-𝛾𝛾+ —
++
+
–
— –
Margin Important Theme in ML
Both sample complexity and algorithmic implications. Sample/Mistake Bound complexity:
• If large margin, # mistakes Peceptron makes is small (independent on the dim of the space)!
• If large margin 𝛾 and if alg. produces a large margin classifier, then amount of data needed depends only on R/𝛾 [Bartlett & Shawe-Taylor ’99].
Algorithmic Implications
– 𝛾
–
𝛾 ++w
Suggests searching for a large margin classifier… SVMs
– – —
++
–
+
– –
Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs
First, assume we know a lower bound on the margin 𝛾 -𝛾𝛾++w
+
Input: 𝛾, S={(x1, 𝑦1), …,(xm, 𝑦m)}; Find: some w where:
•w2=1
• For all i, 𝑦𝑖𝑤 ⋅ 𝑥𝑖 ≥ 𝛾
Output: w, a separator of margin 𝛾 over S
Realizable case, where the data is linearly separable by margin 𝛾
—- ++
– –
– –
Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs
E.g., search for the best possible 𝛾
Input:S={(x1,𝑦1),…,(xm,𝑦m)}; -𝛾𝛾 ++w Find: some w and maximum 𝛾 where: – – — ++
•w2=1 — • For all i, 𝑦𝑖𝑤 ⋅ 𝑥𝑖 ≥ 𝛾 – –
Output: maximum margin separator over S
+
Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs
-𝛾𝛾++w —- ++
– –
+
Input: S={(x1, 𝑦1), …,(xm, 𝑦m)}; Maximize 𝛾 under the constraint:
•w2=1
• Foralli,𝑦𝑖𝑤⋅𝑥𝑖 ≥𝛾
– –
Support Vector Machines (SVMs)
Directly optimize for the maximum margin separator: SVMs
This is a
constrained optimization problem.
Input: S={(x1, 𝑦1), …,(xm, 𝑦m)}; Maximize 𝛾 under the constraint:
•w2=1
• Foralli,𝑦𝑖𝑤⋅𝑥𝑖 ≥𝛾
objective function
constraints
• Famous example of constrained optimization: linear programming, where objective fn is linear, constraints are linear (in)equalities
Support Vector Machines (SVMs)
Directly optimize for the maximum margin separator: SVMs -𝛾𝛾++w
Input: S={(x1, 𝑦1), …,(xm, 𝑦m)}; Maximize 𝛾 under the constraint:
•w2=1
• Foralli,𝑦𝑖𝑤⋅𝑥𝑖 ≥𝛾
—- ++
+
– –
– –
This constraint is non-linear.
In fact, it’s even non-convex
𝑤1
𝑤1 + 𝑤2 2
𝑤2
Support Vector Machines (SVMs)
Directly optimize for the maximum margin separator: SVMs -𝛾𝛾++w
+
Input: S={(x1, 𝑦1), …,(xm, 𝑦m)}; Maximize 𝛾 under the constraint:
•w2=1
• Foralli,𝑦𝑖𝑤⋅𝑥𝑖 ≥𝛾
– –
𝑤’ = 𝑤/𝛾, then max 𝛾 is equiv. to minimizing ||𝑤’||2 (since ||𝑤’||2 = 1/𝛾2). So, dividing both sides by 𝛾 and writing in terms of w’ we get:
w’
+
– –
—- ++
𝑤’ ⋅ 𝑥 = −1 + + —- ++
–
–
Input: S={(x1, 𝑦1), …,(xm, 𝑦m)}; Minimize 𝑤′ 2 under the constraint:
• For all i, 𝑦𝑖𝑤′ ⋅ 𝑥𝑖 ≥ 1
–
–
𝑤’ ⋅ 𝑥 = 1
Support Vector Machines (SVMs) Directly optimize for the maximum margin separator: SVMs
Input: S={(x1, 𝑦1), …,(xm, 𝑦m)}; argminw 𝑤 2 s.t.:
• For all i, 𝑦𝑖𝑤 ⋅ 𝑥𝑖 ≥ 1
• The objective is convex (quadratic)
• All constraints are linear
• Can solve efficiently (in poly time) using standard quadratic programing (QP) software
This is a
constrained optimization problem.
Support Vector Machines (SVMs) Question: what if data isn’t perfectly linearly separable?
Issue 1: now have two objectives
• maximize margin
• minimize # of misclassifications.
𝑤 ⋅ 𝑥 = −1
–
w
+
𝑤 ⋅ 𝑥 = 1
Ans 1: Let’s optimize their sum: minimize
𝑤 2 + 𝐶(# misclassifications) where 𝐶 is some tradeoff constant.
– –
– +-
–
– – –
++ +
Issue 2: This is computationally hard (NP-hard).
[even if didn’t care about margin and minimized # mistakes]
NP-hard [Guruswami-Raghavendra’06]
Support Vector Machines (SVMs) Question: what if data isn’t perfectly linearly separable?
Issue 1: now have two objectives
• maximize margin
• minimize # of misclassifications.
𝑤 ⋅ 𝑥 = −1
–
w
+
𝑤 ⋅ 𝑥 = 1
Ans 1: Let’s optimize their sum: minimize
𝑤 2 + 𝐶(# misclassifications) where 𝐶 is some tradeoff constant.
– –
– +-
–
– – –
++ +
Issue 2: This is computationally hard (NP-hard).
[even if didn’t care about margin and minimized # mistakes]
NP-hard [Guruswami-Raghavendra’06]
Support Vector Machines (SVMs)
Question: what if data isn’t perfectly linearly separable? Replace “# mistakes” with upper bound called “hinge loss”
w’
+
Input: S={(x1, 𝑦1), …,(xm, 𝑦m)};
Find argminw,𝜉1,…,𝜉𝑚 𝑤 2 + 𝐶 𝑖 𝜉𝑖 s.t.:
• For all i, 𝑦𝑖𝑤 ⋅ 𝑥𝑖 ≥ 1 − 𝜉𝑖 𝜉𝑖 ≥ 0
𝜉𝑖 are “slack variables”
–
𝑤 ⋅ 𝑥 = −1 –
w
+
𝑤⋅𝑥=1
𝑤’ ⋅ 𝑥 = −1 + + —- ++
Input: S={(x1, 𝑦1), …,(xm, 𝑦m)}; Minimize 𝑤′ 2 under the constraint:
• For all i, 𝑦𝑖𝑤′ ⋅ 𝑥𝑖 ≥ 1
–
– –
𝑤’ ⋅ 𝑥 = 1
– ++ + – – –
– –
– +-
Support Vector Machines (SVMs)
Question: what if data isn’t perfectly linearly separable? Replace “# mistakes” with upper bound called “hinge loss”
–
– +-
𝜉𝑖 are “slack variables”
C controls the relative weighting between the
twin goals of making the 𝑤 2 small (margin is large) and ensuring that most examples have functional margin ≥ 1.
𝑙 𝑤,𝑥,𝑦 =max(0,1−𝑦𝑤⋅𝑥)
𝑤 ⋅ 𝑥 = −1
–
w
Input: S={(x1, 𝑦1), …,(xm, 𝑦m)};
Find argminw,𝜉1,…,𝜉𝑚 𝑤 2 + 𝐶 𝑖 𝜉𝑖 s.t.: –
• Foralli,𝑦𝑖𝑤⋅𝑥𝑖 ≥1−𝜉𝑖 𝜉𝑖 ≥ 0
– –
–
++
+
+
𝑤⋅𝑥=1
Support Vector Machines (SVMs)
Question: what if data isn’t perfectly linearly separable? Replace “# mistakes” with upper bound called “hinge loss”
–
– +-
Replace the number of mistakes with the hinge loss
𝑤 2 + 𝐶(# misclassifications)
𝑤 ⋅ 𝑥 = −1
–
w
Input: S={(x1, 𝑦1), …,(xm, 𝑦m)};
Find argminw,𝜉1,…,𝜉𝑚 𝑤 2 + 𝐶 𝑖 𝜉𝑖 s.t.: –
• Foralli,𝑦𝑖𝑤⋅𝑥𝑖 ≥1−𝜉𝑖 𝜉𝑖 ≥ 0
– –
–
++
+
+
𝑤⋅𝑥=1
𝑙 𝑤,𝑥,𝑦 =max(0,1−𝑦𝑤⋅𝑥)
Support Vector Machines (SVMs)
Question: what if data isn’t perfectly linearly separable? Replace “# mistakes” with upper bound called “hinge loss”
–
– +-
Total amount have to move the points to get them on the correct side of the lines 𝑤 ⋅ 𝑥 = +1/−1, where the distance between the lines 𝑤 ⋅ 𝑥 = 0 and 𝑤 ⋅ 𝑥 = 1 counts as “1 unit”.
𝑤 ⋅ 𝑥 = −1
–
w
Input: S={(x1, 𝑦1), …,(xm, 𝑦m)};
Find argminw,𝜉1,…,𝜉𝑚 𝑤 2 + 𝐶 𝑖 𝜉𝑖 s.t.: –
• Foralli,𝑦𝑖𝑤⋅𝑥𝑖 ≥1−𝜉𝑖 𝜉𝑖 ≥ 0
– –
–
++
+
+
𝑤⋅𝑥=1
𝑙 𝑤,𝑥,𝑦 =max(0,1−𝑦𝑤⋅𝑥)
What if the data is far from being linearly separable?
Example:
vs
No good linear separator in pixel representation.
SVM philosophy: “Use a Kernel”
Support Vector Machines (SVMs)
Input: S={(x1, 𝑦1), …,(xm, 𝑦m)};
Find argminw,𝜉1,…,𝜉𝑚 𝑤 2 + 𝐶 𝑖 𝜉𝑖 s.t.:
• For all i, 𝑦𝑖𝑤 ⋅ 𝑥𝑖 ≥ 1 − 𝜉𝑖 𝜉𝑖 ≥ 0
Which is equivalent to:
Primal form
Input: S={(x1, y1), …,(xm, ym)};
Find argminα 12 i j yiyj αiαjxi ⋅ xj − i αi s.t.:
• For all i, 0 ≤ αi ≤ Ci yiαi = 0
i
Lagrangian Dual
SVMs (Lagrangian Dual)
Input: S={(x1, y1), …,(xm, ym)};
Find argminα 12 i j yiyj αiαjxi ⋅ xj − i αi s.t.:
• For all i, 0 ≤ αi ≤ Ci
yiαi = 0 𝑤 ⋅ 𝑥 = −1
i
+-
• Final classifier is: w = i αiyixi – – – + +
w
• Thepointsx forwhichα ≠0 – –
i i -+𝑤⋅𝑥=1
are called the “support vectors” –
+
Kernelizing the Dual SVMs Replace xi ⋅ xj
with K xi,xj .
Input: S={(x1, y1), …,(xm, ym)};
Find argminα 12 i j yiyj αiαjxi ⋅ xj − i αi s.t.:
• For all i, 0 ≤ αi ≤ Ci yiαi = 0
i
• •
•
Final classifier is: w = i αiyixi
The points xi for which αi ≠ 0 are called the “support vectors”
With a kernel, classify x using i αiyiK(x, xi)
Support Vector Machines (SVMs).
One of the most theoretically well motivated and practically most effective classification algorithms in machine learning.
Directly motivated by Margins and Kernels!
What you should know
• The importance of margins in machine learning.
• The primal form of the SVM optimization problem
• The dual form of the SVM optimization problem.
• Kernelizing SVM.
• Think about how it’s related to Regularized Logistic Regression.
Modern (Partially) Supervised Machine Learning
• Using Unlabeled Data and Interaction for Learning
Classic Paradigm Insufficient Nowadays
Modern applications: massive amounts of raw data. Only a tiny fraction can be annotated by human experts.
Protein sequences Billions of webpages
Images
Semi-Supervised Learning
raw data
face
Labeled data
not face
Expert Labeler
Classifier
Active Learning
raw data face
not face
O O O
Expert Labeler
Classifier
Semi-Supervised Learning
Prominent paradigm in past 15 years in Machine Learning. • Most applications have lots of unlabeled data, but
labeled data is rare or expensive:
• Web page, document classification • Computer Vision
• Computational Biology,
• ….
Learning Algorithm
Unlabeled examples
Expert / Oracle
Semi-Supervised Learning Data Source
Unlabeled examples
Labeled Examples
Sl={(x1,y1), …,(xml,yml)}
xi drawn i.i.d from D, yi = c∗(xi)
errD h = Pr(hx ≠c∗(x)) x~D
Algorithm outputs a classifier
Goal: h has small error over D.
Su={x1, …,xmu} drawn i.i.d from D
Semi-supervised learning: no querying. Just have lots of additional unlabeled data.
A bit puzzling since unclear what unlabeled data can do for you….
Key Insight
Unlabeled data useful if we have beliefs not only about the form of the target, but also about its relationship with the underlying distribution.
Combining Labeled and Unlabeled Data
• Several methods have been developed to try to use unlabeled data to improve performance, e.g.:
– Transductive SVM [Joachims ’99] – Co-training [Blum & Mitchell ’98]
– Graph-based methods [B&C01], [ZGL03] Workshops [ICML ’03, ICML’ 05, …]
Books:
Test of time awards at ICML!
• •
Semi-Supervised Learning, MIT 2006 O. Chapelle, B. Scholkopf and A. Zien (eds)
Introduction to Semi-Supervised Learning,
Morgan & Claypool, 2009
Zhu & Goldberg
Example of “typical” assumption: Margins
• The separator goes through low density regions of the space/large margin.
– assume we are looking for linear separator
– belief: should exist one with large separation
+ +
_ _
+ +
_
_
+ +
_
_
SVM
Labeled data only
Transductive SVM
Transductive Support Vector Machines Optimize for the separator with large margin wrt labeled and
0
0
unlabeled data. [Joachims ’99]
Input:S={(x ,y ),…,(x ,y )} — +++
0
0
𝑤’ ⋅ 𝑥 = 1
l11mlml 000 000
+0
0 00 𝑤’⋅𝑥=−1 000000
0
00 00 0
00
0 +w’
0 0
0 00
00000 0
S={x,…,x } —
0
0
0
u1 mu argminw w 2 s.t.:
• yiw⋅xi≥1,foralli∈{1,…,ml}
• yuw⋅xu ≥1,forallu∈{1,…,mu}
• yu ∈ {−1,1} for all u ∈ {1,…,mu}
Find a labeling of the unlabeled sample and 𝑤 s.t. 𝑤 separates both labeled and unlabeled data with maximum margin.
0
000000 0
00000 000
00 0
0 00
0 0
0
0
0
00 0
Transductive Support Vector Machines
Optimize for the separator with large margin wrt labeled and
0
0
unlabeled data. [Joachims ’99]
Input:S={(x ,y ),…,(x ,y )} — +++
0
0
𝑤’ ⋅ 𝑥 = 1
0 00 𝑤’⋅𝑥=−1 000000
0
00 00 0
00
0 +w’
0 0
0 00
l11mlml 000 000
+0
S={x,…,x } —
0
0
0
u1 mu
argminw w2+𝐶 𝑖𝜉𝑖+𝐶 𝑢𝜉𝑢
00000 000
• yiw⋅xi≥1-𝜉𝑖,foralli∈{1,…,ml}0 0
• yiw⋅xu ≥1− 𝜉𝑢,forallu∈{1,…,mu}
• yu ∈ {−1,1} for all u ∈ {1,…,mu}
0
Find a labeling of the unlabeled sample and 𝑤 s.t. 𝑤 separates both labeled and unlabeled data with maximum margin.
0
000000 0
00 0
0 00
0 0
00000 0
0
0
0
Transductive Support Vector Machines Optimize for the separator with large margin wrt labeled and
unlabeled data.
Input: Sl={(x1,y1), …,(xml,yml)} Su={x1, …,xmu}
argminw w2+𝐶 𝑖𝜉𝑖+𝐶 𝑢𝜉𝑢
• yiw⋅xi≥1-𝜉𝑖,foralli∈{1,…,ml}
• yiw⋅xu ≥1− 𝜉𝑢,forallu∈{1,…,mu}
• yu ∈ {−1,1} for all u ∈ {1,…,mu}
NP-hard….. Convex only after you guessed the labels… too many possible guesses…
Transductive Support Vector Machines Optimize for the separator with large margin wrt labeled and
unlabeled data.
Heuristic (Joachims) high level idea:
• First maximize margin over the labeled points
• Use this to give initial labels to unlabeled points based on this separator.
• Try flipping labels of unlabeled points to see if doing so can increase margin
Keep going until no more improvements. Finds a locally-optimal solution.
Experiments [Joachims99]
Transductive Support Vector Machines
Helpful distribution
+
Highly compatible +
_
_
Non-helpful distributions
1/°2 clusters, all partitions separable by large margin
Semi-Supervised Learning
Prominent paradigm in past 15 years in Machine Learning.
Key Insight
Unlabeled data useful if we have beliefs not only about the form of the target, but also about its relationship with the underlying distribution.
Prominent techniques
– Transductive SVM [Joachims ’99]
– Co-training [Blum & Mitchell ’98]
– Graph-based methods [B&C01], [ZGL03]