CS代考计算机代写 algorithm 10-601 Machine Learning

10-601 Machine Learning
Maria-Florina Balcan Spring 2015
Plan: Perceptron algorithm for learning linear separators.
1 Learning Linear Separators
Here we can think of examples as being from {0,1}n or from Rn. Given a training set of labeled examples (that is consistent with a linear separator),we can find a hyperplane w · x = w0 such that all positive examples are on one side and all negative examples are on other. I.e., w·x > w0 for positive x’s and w·x < w0 for negative x’s. We can solve this using linear programming. The sample complexity results for classes of finite VC- dimension together with known results about linear programming imply that the class of linear separators is efficiently learnable in the PAC (distributional) model. Today we will talk about the Perceptron algorithm. 1.1 The Perceptron Algorithm One of the oldest algorithms used in machine learning (from early 60s) is an online algorithm for learning a linear threshold function called the Perceptron Algorithm. We present the Perceptron algorithm in the online learning model. In this model, the following scenario is repeats: 1. The algorithm receives an unlabeled example. 2. The algorithm predicts a classification of this example. 3. The algorithm is then told the correct answer. We will call whatever is used to perform step (2), the algorithm’s “current hypothesis.” As mentioned, the Perceptron algorithm is an online algorithm for learning linear separators. For simplicity, we’ll use a threshold of 0, so we’re looking at learning functions like: w1x1 +w2x2 +...+wnxn >0.
We can simulate a nonzero threshold with a “dummy” input x0 that is always 1, so this can be done without loss of generality.
1

The Perceptron Algorithm:
1. Start with the all-zeroes weight vector w1 = 0, and initialize t to 1.
2. Given example x, predict positive iff wt · x > 0.
3. On a mistake, update as follows:
• Mistake on positive: wt+1 ← wt + x. • Mistake on negative: wt+1 ← wt − x.
t ← t + 1.
So, this seems reasonable. If we make a mistake on a positive x we get wt+1 ·x = (wt +x)·x = wt · x + ||x||2, and similarly if we make a mistake on a negative x we have wt+1 · x = (wt −x)·x = wt ·x−||x||2. So, in both cases we move closer (by ||x||2) to the value we wanted.
We will show the following guarantee for the Perceptron Algorithm :
Theorem 1 Let S be a sequence of labeled examples consistent with a linear threshold func- tion w∗ · x > 0, where w∗ is a unit-length vector. Then the number of mistakes M on S made by the online Perceptron algorithm is at most (R/γ)2, where
R = max||x||, and γ = min|w∗ · x|. x∈S x∈S
Note that since w∗ is a unit-length vector, the quantity |w∗ · x| is equal to the distance of x to the separating hyperplane w∗ · x = 0. The parameter “γ” is often called the “margin” of w∗, or more formally, the L2 margin because we are measuring Euclidean distance.
Proof of Theorem 1. We are going to look at the following two quantities wt · w∗ and ||wt||. Claim 1: wt+1 · w∗ ≥ wt · w∗ + γ. That is, every time we make a mistake, the dot-product
of our weight vector with the target increases by at least γ.
Proof: if x was a positive example, then we get wt+1 ·w∗ = (wt +x)·w∗ = wt ·w∗ +x·w∗ ≥ wt ·w∗ +γ (by definition of γ). Similarly, if x was a negative example,weget(wt −x)·w∗ =wt ·w∗ −x·w∗ ≥wt ·w∗ +γ.
Claim 2: ||wt+1||2 ≤ ||wt||2 +R2. That is, every time we make a mistake, the length squared of our weight vector increases by at most R2.
Proof: if x was a positive example, we get ||wt + x||2 = ||wt||2 + 2wt · x + ||x||2. This is less than ||wt||2 + ||x||2 because wt · x is negative (remember, we made a mistake on x), and this in turn is at most ||wt||2 + R2. Same thing (flipping signs) if x was negative but we predicted positive.
2

Claim 1 implies that after M mistakes, wM+1 · w∗ ≥ γM. On the other hand, Claim 2 implies that after M mistakes, ||wM+1||2 ≤ R2M. Now, all we need to do is use the fact
Discussion: In order to use the Perceptron algorithm to find a consistent linear separator given a set S of labeled examples that is linearly separable by margin γ, we do the following. We repeatedly feed the whole set S of labeled examples into the Perceptron algorithm up to (R/γ)2 + 1 rounds, until we get to a point where the current hypothesis is consistent with the whole set S. Note that by theorem 1, we are guaranteed to reach such a point. The runnning time is then polynomial in |S| and (R/γ)2.
In the worst case, γ can be exponentially small in n. On the other hand, if we’re lucky and the data is well-separated, γ might even be large compared to 1/n. This is called the “large margin” case. (In fact, the latter is the more modern spin on things: namely, that in many natural cases, we would hope that there exists a large-margin separator.) In fact, one nice thing here is that the mistake-bound depends on just a purely geometric quantity: the amount of “wiggle-room” available for a solution and doesn’t depend in any direct way on the number of features in the space.
So, if data is separable by a large margin, then the Perceptron is a good algorithm to use.
1.2 Additional More Advanced Notes
Guarantee in a distributional setting: In order to get a distributional guarantee we can dothefollowing.1 LetM=(R/γ)2.Foranyε,δ,wedrawasampleofsize(M/ε)·log(M/δ). We then run Perceptron on the data set and look at the sequence of hypotheses produced: h1, h2, … . For each i, if hi is consistent with following 1/ε · log(M/δ) examples, then we stop and output hi. We can argue that with probability at least 1 − δ, the hypothesis we output has error at most ε. This can be shown as follows. If hi was a bad hypothsis with true error greater than ε, then the chance we stopped and output hi was at most δ/M. So, by union bound, there’s at most a δ chance we are fooled by any of the hypotheses.
Note that this implies that if the margin over the whole distribution is 1/poly(n), the Per- ceptron algorithm can be used to PAC learn the class of linear separators.
What if there is no perfect separator? What if only most of the data is separable by a large margin, or what if w∗ is not perfect? We can see that the thing we need to look at is Claim 1. Claim 1 said that we make “γ amount of progress” on every mistake. Now it’s possible there will be mistakes where we make very little progress, or even negative progress. One thing we can do is bound the total number of mistakes we make in terms of the total distance we would have to move the points to make them actually separable by margin γ. Let’s call that TDγ. Then, we get that after M mistakes, wM+1 · w∗ ≥ γM − TDγ. So,
1This is not the most sample efficient online to PAC reduction, but it is the simplest to think about. 3
||, since w∗ is a unit-length vector. So, this means we must have γM≤R M,andthusM≤(R/γ)2.
that w √ · w∗ ≤ ||w M+1 M+1


combining with Claim 2, we get that R M ≥ γM −TDγ. We could solve the quadratic,
but this implies, for instance, that M ≤ (R/γ)2 +(2/γ)TDγ. The quantity 1TDγ is called
the total hinge-loss of w∗.
γ
So, this is not too bad: we can’t necessarily say that we’re making only a small multiple of the number of mistakes that w∗ is (in fact, the problem of finding an approximately-optimal separator is NP-hard), but we can say we’re doing well in terms of the “total distance” parameter.
Perceptron for approximately maximizing margins. We saw that the perceptron algorithm makes at most (R/γ)2 mistakes on any sequence of examples that is linearly- separable by margin γ, i.e., any sequence for which there exists a unit-length vector w∗ such that all examples x satisfy l(x)(w∗ · x) ≥ γ, where l(x) ∈ {−1, 1} is the label of x.
Suppose we are handed a set of examples S and we want to actually find a large-margin separator for them. One approach is to directly solve for the maximum-margin separator using convex programming (which is what is done in the SVM algorithm). However, if we only need to approximately maximize the margin, then another approach is to use Perceptron. In particular, suppose we cycle through the data using the Perceptron algorithm, updating not only on mistakes, but also on examples x that our current hypothesis gets correct by margin less than γ/2. Assuming our data is separable by margin γ, then we can show that this is guaranteed to halt in a number of rounds that is polynomial in R/γ. (In fact, we can replace γ/2 with (1 − ε)γ and have bounds that are polynomial in R/(εγ).)
The Margin Perceptron Algorithm(γ):
1. Initialize w1 = l(x)x, where x is the first example seen and initialize t to 1.
2. Predict positive if wt·x ≥ γ/2, predict negative if wt·x ≤ −γ/2, and consider an ||wt || ||wt ||
example to be a margin mistake when wt·x ∈ (−γ/2, γ/2). ||wt ||
3. On a mistake (incorrect prediction or margin mistake), update as in the standard Perceptron algorithm: wt+1 ← wt + l(x)x; t ← t + 1.
Theorem 2 Let S be a sequence of labeled examples consistent with a linear threshold func- tion w∗ · x > 0, where w∗ is a unit-length vector, and let
R = max||x||, and γ = min|w∗ · x|. x∈S x∈S
Then the number of mistakes (including margin mistakes) made by Margin Perceptron(γ) on S is at most 8(R/γ)2 + 4(R/γ).
Proof: The argument for this new algorithm follows the same lines as the argument for the original Perceptron algorithm.
4

As before, each update increases wt ·w∗ by at least γ. What is now a little more complicated
is to bound the increase in ||wt||. For the original algorithm, we had: ||wt+1||2 ≤ ||wt||2 +R2,
which implies ||wt+1|| ≤ ||wt|| + R2 . 2||wt ||
For the new algorithm, we can show instead:
||wt+1|| ≤ ||wt|| + 2||wt|| + 2.
2l(x) wt · x
Taking the square-root of both sides and using the inequality √1 + α ≤ 1+ α and ||x||2 ≤ R2
R2 γ
(1)
To see this note that:
||wt+1||2 = ||wt||2 + 2l(x)wt · x + ||x||2 = ||wt||2 1 + ||wt|| ||wt|| + ||wt||2
􏰓
||x||2 􏰔
we get:
􏰓 l(x) wt · x R2 􏰔 ||wt+1|| ≤ ||wt|| 1 + ||wt|| ||wt|| + 2||wt||2
.
Combining this with the fact l(x)wt·x ≤ γ (since wt made a mistake or margin mistake on x)
2
||wt || 2
we get the desired upper bound on ||wt+1||, namely: ||wt+1|| ≤ ||wt|| + γ + R2 .
Note that (1) implies that if ||wt|| ≥ 2R2/γ then ||wt+1|| ≤ ||wt|| + 3γ/4. Note also that ||wt+1|| ≤ ||wt|| + ||x|| (by triangle inequality), so ||wt+1|| ≤ ||wt|| + R. These two facts imply that after M updates we have:
||wM+1|| ≤ (2R2/γ + R) + 3Mγ/4.
Solving Mγ ≤ (2R2/γ + R) + 3Mγ/4 we get M ≤ 8R2/γ2 + 4R/γ, as desired.
2 2||wt ||
5

Posted in Uncategorized

Leave a Reply

Your email address will not be published. Required fields are marked *