# 程序代写代做代考 algorithm database book

book

2007/2/23

page 113

Chapter 10

Classification of

Handwritten Digits

Classification by computer of handwritten digits is a standard problem in pattern

recognition. The typical application is automatic reading of zip codes on envelopes.

A comprehensive review of different algorithms is given in [62].

10.1 Handwritten Digits and a Simple Algorithm

In Figure 10.1 we illustrate handwritten digits that we will use in the examples in

this chapter.

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

Figure 10.1. Handwritten digits from the U.S. Postal Service database;

see, e.g., [47].

We will treat the digits in three different but equivalent formats:

1. As 16 × 16 gray scale images, as in Figure 10.1;

2. As functions of two variables, s = s(x, y), as in Figure 10.2; and

3. As vectors in R256.

In the classification of an unknown digit we need to compute the distance

to known digits. Different distance measures can be used, and perhaps the most

natural one to use is the Euclidean distance: stack the columns of the image in a

113

张嘉平

张嘉平

张嘉平

book

2007/2/23

page 114

114 Chapter 10. Classification of Handwritten Digits

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

0 2 4

6 8 10

12 14 160

5

10

15

20

−1

−0.5

0

0.5

1

Figure 10.2. A digit considered as a function of two variables.

vector and identify each digit as a vector in R256. Then define the distance function

(x, y) = ∥x− y ∥2.

An alternative distance function is the cosine between two vectors.

In a real application of handwritten digit classification, e.g., zip code reading,

there are hardware and real-time factors that must be taken into account. In this

chapter we describe an idealized setting. The problem is as follows:

Given a set of manually classified digits (the training set), classify a set

of unknown digits (the test set).

In the U.S. Postal Service database, the training set contains 7291 handwritten

digits. Here we will use a subset of 1707 digits, relatively equally distributed between

0 and 9. The test set has 2007 digits.

If we consider the training set digits as vectors or points, then it is reasonable

to assume that all digits of one kind form a cluster of points in a Euclidean 256-

dimensional vector space. Ideally the clusters are well separated (otherwise the task

of classifying unknown digits will be very difficult), and the separation between the

clusters depends on how well written the training digits are.

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

Figure 10.3. The means (centroids) of all digits in the training set.

In Figure 10.3 we illustrate the means (centroids) of the digits in the training

set. From the figure we get the impression that a majority of the digits are well

written. (If there were many badly written digits, the means would be very diffuse.)

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

book

2007/2/23

page 115

10.2. Classification Using SVD Bases 115

This indicates that the clusters are rather well separated. Therefore, it seems likely

that a simple classification algorithm that computes the distance from each unknown

digit to the means may be reasonably accurate.

A simple classification algorithm

Training: Given the manually classified training set, compute the means (cen-

troids) mi, i = 0, . . . , 9, of all the 10 classes.

Classification: For each digit in the test set, classify it as k if mk is the closest

mean.

It turns out that for our test set, the success rate of this algorithm is around

75%, which is not good enough. The reason for this relatively bad performance is

that the algorithm does not use any information about the variation within each

class of digits.

10.2 Classification Using SVD Bases

We will now describe a classification algorithm that is based on the modeling of the

variation within each digit class using orthogonal basis vectors computed using the

SVD. This can be seen as a least squares algorithm based on a reduced rank model ;

cf. Chapter 7.

If we consider the images as 16 × 16 matrices, then the data are multidimen-

sional; see Figure 10.4. Stacking all the columns of each image above each other

gives a matrix. Let A ∈ Rm×n, with m = 256, be the matrix consisting of all the

training digits of one kind, the 3’s, say. The columns of A span a linear subspace

of Rm. However, this subspace cannot be expected to have a large dimension, be-

cause if it did, then the subspaces of the different kinds of digits would intersect

(remember that we are considering subspaces of R256).

Now the idea is to “model” the variation within the set of training (and test)

digits of one kind using an orthogonal basis of the subspace. An orthogonal basis

can be computed using the SVD, and any matrix A is a sum of rank 1 matrices:

A =

m!

i=1

σiuiv

T

i = + + · · · . (10.1)

Each column in A represents an image of a digit 3, and therefore the left singular

vectors ui are an orthogonal basis in the “image space of 3’s.” We will refer to the

left singular vectors as “singular images.” From (10.1) the jth column of A is equal

to

aj =

m!

i=1

(σivij)ui,

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

book

2007/2/23

page 116

116 Chapter 10. Classification of Handwritten Digits

16

16

digits

3

A256

digits

Figure 10.4. The image of one digit is a matrix, and the set of images of

one kind form a tensor. In the lower part of the figure, each digit (of one kind) is

represented by a column in the matrix.

and we see that the coordinates of image j in A in terms of this basis are σivij .

From the matrix approximation properties of the SVD (Theorems 6.6 and 6.7), we

know that the first singular vector represents the “dominating” direction of the data

matrix. Therefore, if we fold the vectors ui back into images, we expect the first

singular vector to look like a 3, and the following singular images should represent

the dominating variations of the training set around the first singular image. In

Figure 10.5 we illustrate the singular values and the first three singular images for

the training set 3’s. In the middle graph we plot the coordinates of each of the

131 digits in terms of the first three singular vectors. We see that all the digits have

a large portion (between 0.05 and 0.1) of the first singular image, which, in fact,

looks very much like the mean of 3’s in Figure 10.3. We then see that there is a

rather large variation in the coordinates in terms of the second and third singular

images.

The SVD basis classification algorithm will be based on the following assump-

tions:

1. Each digit (in the training set and the test set) is well characterized by a

few of the first singular images of its own kind. The more precise meaning of

“few” should be investigated in experiments.

2. An expansion in terms of the first few singular images discriminates well

between the different classes of digits.

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

book

2007/2/23

page 117

10.2. Classification Using SVD Bases 117

0 20 40 60 80 100 120 140

10

−1

10

0

10

1

10

2

10

3

0 20 40 60 80 100 120 140

0

0.1

0.2

0 20 40 60 80 100 120 140

−0.5

0

0.5

0 20 40 60 80 100 120 140

−0.5

0

0.5

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

Figure 10.5. Singular values (top), coordinates of the 131 test digits in

terms of the first three right singular vectors vi (middle), and the first three singular

images (bottom).

3. If an unknown digit can be better approximated in one particular basis of

singular images, the basis of 3’s say, than in the bases of the other classes,

then it is likely that the unknown digit is a 3.

Thus we should compute how well an unknown digit can be represented in

the 10 different bases. This can be done by computing the residual vector in least

squares problems of the type

min

αi

!!!!!

z −

k”

i=1

αiui

!!!!!

,

where z represents an unknown digit and ui represents the singular images. We can

张嘉平

张嘉平

张嘉平

张嘉平

book

2007/2/23

page 118

118 Chapter 10. Classification of Handwritten Digits

0 1 2 3 4 5 6 7 8 9

0.4

0.5

0.6

0.7

0.8

0.9

1

Basis

R

e

si

d

u

a

l

0 1 2 3 4 5 6 7 8 9

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Basis

R

e

si

d

u

a

l

Figure 10.6. Relative residuals of all test 3’s (top) and 7’s (bottom) in

terms of all bases. Ten basis vectors were used for each class.

write this problem in the form

min

α

∥ z − Ukα ∥2,

where Uk =

!

u1 u2 · · · uk

”

. Since the columns of Uk are orthogonal, the solu-

tion of this problem is given by α = UTk z, and the norm of the residual vector of

the least squares problems is

∥ (I − UkUTk )z ∥2, (10.2)

i.e., the norm of the projection of the unknown digit onto the subspace orthogonal

to span(Uk).

To demonstrate that the assumptions above are reasonable, we illustrate in

Figure 10.6 the relative residual norm for all test 3’s and 7’s in terms of all 10 bases.

张嘉平

张嘉平

张嘉平

book

2007/2/23

page 119

10.2. Classification Using SVD Bases 119

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

0 1 2 3 4 5 6 7 8 9 10

0.4

0.5

0.6

0.7

0.8

0.9

1

# basis vectors

R

e

si

d

u

a

l

Figure 10.7. Unknown digit (nice 3) and approximations using 1, 3, 5, 7,

and 9 terms in the 3-basis (top). Relative residual ∥ (I −UkUTk )z ∥2/∥ z ∥2 in least

squares problem (bottom).

In the two figures, there is one curve for each unknown digit, and naturally it is

not possible to see the individual curves. However, one can see that most of the

test 3’s and 7’s are best approximated in terms of their own basis. The graphs also

give information about which classification errors are more likely than others. (For

example, 3’s and 5’s are similar, whereas 3’s and 4’s are quite different; of course

this only confirms what we already know.)

It is also interesting to see how the residual depends on the number of terms

in the basis. In Figure 10.7 we illustrate the approximation of a nicely written 3 in

terms of the 3-basis with different numbers of basis images. In Figures 10.8 and 10.9

we show the approximation of an ugly 3 in the 3-basis and a nice 3 in the 5-basis.

From Figures 10.7 and 10.9 we see that the relative residual is considerably

smaller for the nice 3 in the 3-basis than in the 5-basis. We also see from Figure 10.8

that the ugly 3 is not well represented in terms of the 3-basis. Therefore, naturally,

if the digits are very badly drawn, then we cannot expect to get a clear classification

based on the SVD bases.

It is possible to devise several classification algorithms based on the model of

expanding in terms of SVD bases. Below we give a simple variant.

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

book

2007/2/23

page 120

120 Chapter 10. Classification of Handwritten Digits

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

0 1 2 3 4 5 6 7 8 9 10

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

# basis vectors

R

e

si

d

u

a

l

Figure 10.8. Unknown digit (ugly 3) and approximations using 1, 3, 5,

7, and 9 terms in the 3-basis (top). Relative residual in least squares problem

(bottom).

An SVD basis classification algorithm

Training: For the training set of known digits, compute the SVD of each set of

digits of one kind.

Classification: For a given test digit, compute its relative residual in all 10 bases.

If one residual is significantly smaller than all the others, classify as that.

Otherwise give up.

The work in this algorithm can be summarized as follows:

Training: Compute SVDs of 10 matrices of dimension m2 × ni.

Each digit is an m×m digitized image.

ni: the number of training digits i.

Test: Compute 10 least squares residuals (10.2).

张嘉平

张嘉平

张嘉平

张嘉平

book

2007/2/23

page 121

10.2. Classification Using SVD Bases 121

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

0 1 2 3 4 5 6 7 8 9 10

0.7

0.75

0.8

0.85

0.9

0.95

1

# basis vectors

R

e

si

d

u

a

l

Figure 10.9. Unknown digit (nice 3) and approximations using 1, 3, 5,

7, and 9 terms in the 5-basis (top). Relative residual in least squares problem

(bottom).

Table 10.1. Correct classifications as a function of the number of basis

images (for each class).

# basis images 1 2 4 6 8 10

correct (%) 80 86 90 90.5 92 93

Thus the test phase is quite fast, and this algorithm should be suitable for

real-time computations. The algorithm is related to the SIMCA method [89].

We next give some test results (from [82]) for the U.S. Postal Service database,

here with 7291 training digits and 2007 test digits [47]. In Table 10.1 we give

classification results as a function of the number of basis images for each class.

Even if there is a very significant improvement in performance compared to

the method in which one used only the centroid images, the results are not good

enough, as the best algorithms reach about 97% correct classifications. The training

and test contain some digits that are very difficult to classify; we give a few examples

in Figure 10.10. Such badly written digits are very difficult to handle automatically.

张嘉平

张嘉平

book

2007/2/23

page 122

122 Chapter 10. Classification of Handwritten Digits

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

2 4 6 8 10 12 14 16

2

4

6

8

10

12

14

16

Figure 10.10. Ugly digits in the U.S. Postal Service database.

Figure 10.11. A digit (left) and acceptable transformations (right).

Columnwise from left to right the digit has been (1) written with a thinner and

a thicker pen, (2) stretched diagonally, (3) compressed and elongated vertically, and

(4) rotated.

10.3 Tangent Distance

A good classification algorithm should be able to classify unknown digits that are

rather well written but still deviate considerably in Euclidean distance from the

ideal digit. There are some deviations that humans can easily handle and which are

quite common and acceptable. We illustrate a few such variations20 in Figure 10.11.

Such transformations constitute no difficulties for a human reader, and ideally they

should be very easy to deal with in automatic digit recognition. A distance measure,

tangent distance, that is invariant under small such transformations is described in

[86, 87].

16 × 16 images can be interpreted as points in R256. Let p be a fixed pattern

in an image. We shall first consider the case of only one allowed transformation,

translation of the pattern (digit) in the x-direction, say. This translation can be

thought of as moving the pattern along a curve in R256. Let the curve be param-

eterized by a real parameter α so that the curve is given by s(p,α) and in such a

way that s(p, 0) = p. In general, the curve is nonlinear and can be approximated

by the first two terms in the Taylor expansion,

s(p,α) = s(p, 0) +

ds

dα

(p, 0)α + O(α2) ≈ p + tpα,

where tp =

ds

dα

(p, 0) is a vector in R256. By varying α slightly around 0, we make

a small movement of the pattern along the tangent at the point p on the curve.

Assume that we have another pattern e that is approximated similarly:

s(e,α) ≈ e + teα.

20Note that the transformed digits have been written not manually but by using the techniques

described later in this section. The presentation in this section is based on the papers [86, 87, 82].

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平

张嘉平