# 程序代写代做代考 Hidden Markov Mode algorithm chain Tagging with Hidden Markov Models

Tagging with Hidden Markov Models

Michael Collins

1 Tagging Problems

In many NLP problems, we would like to model pairs of sequences. Part-of-speech

(POS) tagging is perhaps the earliest, and most famous, example of this type of

problem. In POS tagging our goal is to build a model whose input is a sentence,

for example

the dog saw a cat

and whose output is a tag sequence, for example

D N V D N (1)

(here we use D for a determiner, N for noun, and V for verb). The tag sequence is

the same length as the input sentence, and therefore specifies a single tag for each

word in the sentence (in this example D for the, N for dog, V for saw, and so on).

We will use x1 . . . xn to denote the input to the tagging model: we will often

refer to this as a sentence. In the above example we have the length n = 5, and

x1 = the, x2 = dog, x3 = saw, x4 = the, x5 = cat. We will use y1 . . . yn to denote

the output of the tagging model: we will often refer to this as the state sequence or

tag sequence. In the above example we have y1 = D, y2 = N, y3 = V, and so on.

This type of problem, where the task is to map a sentence x1 . . . xn to a tag se-

quence y1 . . . yn, is often referred to as a sequence labeling problem, or a tagging

problem.

We will assume that we have a set of training examples, (x(i), y(i)) for i =

1 . . .m, where each x(i) is a sentence x(i)1 . . . x

(i)

ni , and each y

(i) is a tag sequence

y

(i)

1 . . . y

(i)

ni (we assume that the i’th example is of length ni). Hence x

(i)

j is the j’th

word in the i’th training example, and y(i)j is the tag for that word. Our task is to

learn a function that maps sentences to tag sequences from these training examples.

1

2 Generative Models, and The Noisy Channel Model

Supervised problems in machine learning are defined as follows. We assume train-

ing examples (x(1), y(1)) . . . (x(m), y(m)), where each example consists of an input

x(i) paired with a label y(i). We use X to refer to the set of possible inputs, and Y

to refer to the set of possible labels. Our task is to learn a function f : X → Y that

maps any input x to a label f(x).

Many problems in natural language processing are supervised learning prob-

lems. For example, in tagging problems each x(i) would be a sequence of words

x

(i)

1 . . . x

(i)

ni , and each y

(i) would be a sequence of tags y(i)1 . . . y

(i)

ni (we use ni to

refer to the length of the i’th training example). X would refer to the set of all

sequences x1 . . . xn, and Y would be the set of all tag sequences y1 . . . yn. Our

task would be to learn a function f : X → Y that maps sentences to tag sequences.

In machine translation, each input x would be a sentence in the source language

(e.g., Chinese), and each “label” would be a sentence in the target language (e.g.,

English). In speech recognition each input would be the recording of some ut-

terance (perhaps pre-processed using a Fourier transform, for example), and each

label is an entire sentence. Our task in all of these examples is to learn a function

from inputs x to labels y, using our training examples (x(i), y(i)) for i = 1 . . . n as

evidence.

One way to define the function f(x) is through a conditional model. In this

approach we define a model that defines the conditional probability

p(y|x)

for any x, y pair. The parameters of the model are estimated from the training

examples. Given a new test example x, the output from the model is

f(x) = arg max

y∈Y

p(y|x)

Thus we simply take the most likely label y as the output from the model. If our

model p(y|x) is close to the true conditional distribution of labels given inputs, the

function f(x) will be close to optimal.

An alternative approach, which is often used in machine learning and natural

language processing, is to define a generative model. Rather than directly estimat-

ing the conditional distribution p(y|x), in generative models we instead model the

joint probability

p(x, y)

over (x, y) pairs. The parameters of the model p(x, y) are again estimated from the

training examples (x(i), y(i)) for i = 1 . . . n. In many cases we further decompose

2

the probability p(x, y) as follows:

p(x, y) = p(y)p(x|y) (2)

and then estimate the models for p(y) and p(x|y) separately. These two model

components have the following interpretations:

• p(y) is a prior probability distribution over labels y.

• p(x|y) is the probability of generating the input x, given that the underlying

label is y.

We will see that in many cases it is very convenient to decompose models in this

way; for example, the classical approach to speech recognition is based on this type

of decomposition.

Given a generative model, we can use Bayes rule to derive the conditional

probability p(y|x) for any (x, y) pair:

p(y|x) =

p(y)p(x|y)

p(x)

where

p(x) =

∑

y∈Y

p(x, y) =

∑

y∈Y

p(y)p(x|y)

Thus the joint model is quite versatile, in that we can also derive the probabilities

p(x) and p(y|x).

We use Bayes rule directly in applying the joint model to a new test example.

Given an input x, the output of our model, f(x), can be derived as follows:

f(x) = arg max

y

p(y|x)

= arg max

y

p(y)p(x|y)

p(x)

(3)

= arg max

y

p(y)p(x|y) (4)

Eq. 3 follows by Bayes rule. Eq. 4 follows because the denominator, p(x), does not

depend on y, and hence does not affect the arg max. This is convenient, because it

means that we do not need to calculate p(x), which can be an expensive operation.

Models that decompose a joint probability into into terms p(y) and p(x|y) are

often called noisy-channel models. Intuitively, when we see a test example x, we

assume that has been generated in two steps: first, a label y has been chosen with

probability p(y); second, the example x has been generated from the distribution

3

p(x|y). The model p(x|y) can be interpreted as a “channel” which takes a label y

as its input, and corrupts it to produce x as its output. Our task is to find the most

likely label y, given that we observe x.

In summary:

• Our task is to learn a function from inputs x to labels y = f(x). We assume

training examples (x(i), y(i)) for i = 1 . . . n.

• In the noisy channel approach, we use the training examples to estimate

models p(y) and p(x|y). These models define a joint (generative) model

p(x, y) = p(y)p(x|y)

• Given a new test example x, we predict the label

f(x) = arg max

y∈Y

p(y)p(x|y)

Finding the output f(x) for an input x is often referred to as the decoding

problem.

3 Generative Tagging Models

We now see how generative models can be applied to the tagging problem. We

assume that we have a finite vocabulary V , for example V might be the set of

words seen in English, e.g., V = {the, dog, saw, cat, laughs, . . .}. We use K to

denote the set of possible tags; again, we assume that this set is finite. We then give

the following definition:

Definition 1 (Generative Tagging Models) Assume a finite set of words V , and

a finite set of tags K. Define S to be the set of all sequence/tag-sequence pairs

〈x1 . . . xn, y1 . . . yn〉 such that n ≥ 0, xi ∈ V for i = 1 . . . n and yi ∈ K for

i = 1 . . . n. A generative tagging model is then a function p such that:

1. For any 〈x1 . . . xn, y1 . . . yn〉 ∈ S,

p(x1 . . . xn, y1 . . . yn) ≥ 0

2. In addition, ∑

〈x1…xn,y1…yn〉∈S

p(x1 . . . xn, y1 . . . yn) = 1

4

Hence p(x1 . . . xn, y1 . . . yn) is a probability distribution over pairs of sequences

(i.e., a probability distribution over the set S).

Given a generative tagging model, the function from sentences x1 . . . xn to tag

sequences y1 . . . yn is defined as

f(x1 . . . xn) = arg max

y1…yn

p(x1 . . . xn, y1 . . . yn)

Thus for any input x1 . . . xn, we take the highest probability tag sequence as the

output from the model.

Having introduced generative tagging models, there are three critical questions:

• How we define a generative tagging model p(x1 . . . xn, y1 . . . yn)?

• How do we estimate the parameters of the model from training examples?

• How do we efficiently find

arg max

y1…yn

p(x1 . . . xn, y1 . . . yn)

for any input x1 . . . xn?

The next section describes how trigram hidden Markov models can be used to

answer these three questions.

4 Trigram Hidden Markov Models (Trigram HMMs)

In this section we describe an important type of generative tagging model, a trigram

hidden Markov model, describe how the parameters of the model can be estimated

from training examples, and describe how the most likely sequence of tags can be

found for any sentence.

4.1 Definition of Trigram HMMs

We now give a formal definition of trigram hidden Markov models (trigram HMMs).

The next section shows how this model form is derived, and gives some intuition

behind the model.

Definition 2 (Trigram Hidden Markov Model (Trigram HMM)) A trigram HMM

consists of a finite set V of possible words, and a finite set K of possible tags, to-

gether with the following parameters:

5

• A parameter

q(s|u, v)

for any trigram (u, v, s) such that s ∈ K ∪ {STOP}, and u, v ∈ V ∪ {*}.

The value for q(s|u, v) can be interpreted as the probability of seeing the tag

s immediately after the bigram of tags (u, v).

• A parameter

e(x|s)

for any x ∈ V , s ∈ K. The value for e(x|s) can be interpreted as the

probability of seeing observation x paired with state s.

Define S to be the set of all sequence/tag-sequence pairs 〈x1 . . . xn, y1 . . . yn+1〉

such that n ≥ 0, xi ∈ V for i = 1 . . . n, yi ∈ K for i = 1 . . . n, and yn+1 = STOP.

We then define the probability for any 〈x1 . . . xn, y1 . . . yn+1〉 ∈ S as

p(x1 . . . xn, y1 . . . yn+1) =

n+1∏

i=1

q(yi|yi−2, yi−1)

n∏

i=1

e(xi|yi)

where we have assumed that y0 = y−1 = *.

As one example, if we have n = 3, x1 . . . x3 equal to the sentence the dog

laughs, and y1 . . . y4 equal to the tag sequence D N V STOP, then

p(x1 . . . xn, y1 . . . yn+1) = q(D|∗, ∗)× q(N|∗, D)× q(V|D, N)× q(STOP|N, V)

×e(the|D)× e(dog|N)× e(laughs|V)

Note that this model form is a noisy-channel model. The quantity

q(D|∗, ∗)× q(N|∗, D)× q(V|D, N)× q(STOP|N, V)

is the prior probability of seeing the tag sequence D N V STOP, where we have

used a second-order Markov model (a trigram model), very similar to the language

models we derived in the previous lecture. The quantity

e(the|D)× e(dog|N)× e(laughs|V)

can be interpreted as the conditional probability p(the dog laughs|D N V STOP):

that is, the conditional probability p(x|y) where x is the sentence the dog laughs,

and y is the tag sequence D N V STOP.

6

4.2 Independence Assumptions in Trigram HMMs

We now describe how the form for trigram HMMs can be derived: in particular, we

describe the independence assumptions that are made in the model. Consider a pair

of sequences of random variables X1 . . . Xn, and Y1 . . . Yn, where n is the length

of the sequences. We assume that each Xi can take any value in a finite set V of

words. For example, V might be a set of possible words in English, for example

V = {the, dog, saw, cat, laughs, . . .}. Each Yi can take any value in a finite set K

of possible tags. For example, K might be the set of possible part-of-speech tags

for English, e.g. K = {D, N, V, . . .}.

The length n is itself a random variable—it can vary across different sentences—

but we will use a similar technique to the method used for modeling variable-length

Markov processes (see the previous lecture notes).

Our task will be to model the joint probability

P (X1 = x1 . . . Xn = xn, Y1 = y1 . . . Yn = yn)

for any observation sequence x1 . . . xn paired with a state sequence y1 . . . yn, where

each xi is a member of V , and each yi is a member of K.

We will find it convenient to define one additional random variable Yn+1, which

always takes the value STOP. This will play a similar role to the STOP symbol seen

for variable-length Markov sequences, as described in the previous lecture notes.

The key idea in hidden Markov models is the following definition:

P (X1 = x1 . . . Xn = xn, Y1 = y1 . . . Yn+1 = yn+1)

=

n+1∏

i=1

P (Yi = yi|Yi−2 = yi−2, Yi−1 = yi−1)

n∏

i=1

P (Xi = xi|Yi = yi) (5)

where we have assumed that y0 = y−1 = *, where * is a special start symbol.

Note the similarity to our definition of trigram HMMs. In trigram HMMs we

have made the assumption that the joint probability factorizes as in Eq. 5, and in

addition we have assumed that for any i, for any values of yi−2, yi−1, yi,

P (Yi = yi|Yi−2 = yi−2, Yi−1 = yi−1) = q(yi|yi−2, yi−1)

and that for any value of i, for any values of xi and yi,

P (Xi = xi|Yi = yi) = e(xi|yi)

Eq. 5 can be derived as follows. First, we can write

P (X1 = x1 . . . Xn = xn, Y1 = y1 . . . Yn+1 = yn+1)

= P (Y1 = y1 . . . Yn+1 = yn+1)P (X1 = x1 . . . Xn = xn|Y1 = y1 . . . Yn+1 = yn+1)

(6)

7

This step is exact, by the chain rule of probabilities. Thus we have decomposed

the joint probability into two terms: first, the probability of choosing tag sequence

y1 . . . yn+1; second, the probability of choosing the word sequence x1 . . . xn, con-

ditioned on the choice of tag sequence. Note that this is exactly the same type of

decomposition as seen in noisy channel models.

Now consider the probability of seeing the tag sequence y1 . . . yn+1. We make

independence assumptions as follows: we assume that for any sequence y1 . . . yn+1,

P (Y1 = y1 . . . Yn+1 = yn+1) =

n+1∏

i=1

P (Yi = yi|Yi−2 = yi−2, Yi−1 = yi−1)

That is, we have assumed that the sequence Y1 . . . Yn+1 is a second-order Markov

sequence, where each state depends only on the previous two states in the sequence.

Next, consider the probability of the word sequence x1 . . . xn, conditioned on

the choice of tag sequence, y1 . . . yn+1. We make the following assumption:

P (X1 = x1 . . . Xn = xn|Y1 = y1 . . . Yn+1 = yn+1)

=

n∏

i=1

P (Xi = xi|X1 = x1 . . . Xi−1 = xi−1, Y1 = y1 . . . Yn+1 = yn+1)

=

n∏

i=1

P (Xi = xi|Yi = yi) (7)

The first step of this derivation is exact, by the chain rule. The second step involves

an independence assumption, namely that for i = 1 . . . n,

P (Xi = xi|X1 = x1 . . . Xi−1 = xi−1, Y1 = y1 . . . Yn+1 = yn+1) = P (Xi = xi|Yi = yi)

Hence we have assumed that the value for the random variable Xi depends only on

the value of Yi. More formally, the value forXi is conditionally independent of the

previous observationsX1 . . . Xi−1, and the other state values Y1 . . . Yi−1, Yi+1 . . . Yn+1,

given the value of Yi.

One useful way of thinking of this model is to consider the following stochastic

process, which generates sequence pairs y1 . . . yn+1, x1 . . . xn:

1. Initialize i = 1 and y0 = y−1 = *.

2. Generate yi from the distribution

q(yi|yi−2, yi−1)

3. If yi = STOP then return y1 . . . yi, x1 . . . xi−1. Otherwise, generate xi from

the distribution

e(xi|yi),

set i = i+ 1, and return to step 2.

8

4.3 Estimating the Parameters of a Trigram HMM

We will assume that we have access to some training data. The training data con-

sists of a set of examples where each example is a sentence x1 . . . xn paired with a

tag sequence y1 . . . yn. Given this data, how do we estimate the parameters of the

model? We will see that there is a simple and very intuitive answer to this question.

Define c(u, v, s) to be the number of times the sequence of three states (u, v, s)

is seen in training data: for example, c(V, D, N) would be the number of times the

sequence of three tags V, D, N is seen in the training corpus. Similarly, define

c(u, v) to be the number of times the tag bigram (u, v) is seen. Define c(s) to be

the number of times that the state s is seen in the corpus. Finally, define c(s ; x)

to be the number of times state s is seen paired sith observation x in the corpus: for

example, c(N ; dog) would be the number of times the word dog is seen paired

with the tag N.

Given these definitions, the maximum-likelihood estimates are

q(s|u, v) =

c(u, v, s)

c(u, v)

and

e(x|s) =

c(s ; x)

c(s)

For example, we would have the estimates

q(N|V, D) =

c(V, D, N)

c(V, D)

and

e(dog|N) =

c(N ; dog)

c(N)

Thus estimating the parameters of the model is simple: we just read off counts

from the training corpus, and then compute the maximum-likelihood estimates as

described above.

4.4 Decoding with HMMs: the Viterbi Algorithm

We now turn to the problem of finding the most likely tag sequence for an input

sentence x1 . . . xn. This is the problem of finding

arg max

y1…yn+1

p(x1 . . . xn, y1 . . . yn+1)

9

where the arg max is taken over all sequences y1 . . . yn+1 such that yi ∈ K for

i = 1 . . . n, and yn+1 = STOP. We assume that p again takes the form

p(x1 . . . xn, y1 . . . yn+1) =

n+1∏

i=1

q(yi|yi−2, yi−1)

n∏

i=1

e(xi|yi) (8)

Recall that we have assumed in this definition that y0 = y−1 = *, and yn+1 =

STOP.

The naive, brute force method would be to simply enumerate all possible tag

sequences y1 . . . yn+1, score them under the function p, and take the highest scor-

ing sequence. For example, given the input sentence

the dog barks

and assuming that the set of possible tags is K = {D, N, V}, we would consider all

possible tag sequences:

D D D STOP

D D N STOP

D D V STOP

D N D STOP

D N N STOP

D N V STOP

. . .

and so on. There are 33 = 27 possible sequences in this case.

For longer sentences, however, this method will be hopelessly inefficient. For

an input sentence of length n, there are |K|n possible tag sequences. The expo-

nential growth with respect to the length n means that for any reasonable length

sentence, brute-force search will not be tractable.

4.4.1 The Basic Algorithm

Instead, we will see that we can efficiently find the highest probability tag se-

quence, using a dynamic programming algorithm that is often called the Viterbi

algorithm. The input to the algorithm is a sentence x1 . . . xn. Given this sentence,

for any k ∈ {1 . . . n}, for any sequence y1 . . . yk such that yi ∈ K for i = 1 . . . k

we define the function

r(y1 . . . yk) =

k∏

i=1

q(yi|yi−2, yi−1)

k∏

i=1

e(xi|yi) (9)

10

This is simply a truncated version of the definition of p in Eq. 8, where we just

consider the first k terms. In particular, note that

p(x1 . . . xn, y1 . . . yn+1) = r(y1 . . . yn)× q(yn+1|yn−1, yn)

= r(y1 . . . yn)× q(STOP|yn−1, yn) (10)

Next, for any any k ∈ {1 . . . n}, for any u ∈ K, v ∈ K, define S(k, u, v) to

be the set of sequences y1 . . . yk such that yk−1 = u, yk = v, and yi ∈ K for

i = 1 . . . k. Thus S(k, u, v) is the set of all tag sequences of length k, which end

in the tag bigram (u, v). Define

π(k, u, v) = max

〈y1…yk〉∈S(k,u,v)

r(y1 . . . yk) (11)

We now observe that we can calculate the π(k, u, v) values for all (k, u, v)

efficiently, as follows. First, as a base case define

π(0, *, *) = 1

and

π(0, u, v) = 0

if u 6= * or v 6= *. These definitions just reflect the fact that we always assume that

y0 = y−1 = *.

Next, we give the recursive definition.

Proposition 1 For any k ∈ {1 . . . n}, for any u ∈ K and v ∈ K, we can use the

following recursive definition:

π(k, u, v) = max

w∈K

(π(k − 1, w, u)× q(v|w, u)× e(xk|v)) (12)

This definition is recursive because the definition makes use of the π(k − 1, w, v)

values computed for shorter sequences. This definition will be key to our dynamic

programming algorithm.

How can we justify this recurrence? Recall that π(k, u, v) is the highest prob-

ability for any sequence y1 . . . yk ending in the bigram (u, v). Any such sequence

must have yk−2 = w for some state w. The highest probability for any sequence

of length k − 1 ending in the bigram (w, u) is π(k − 1, w, u), hence the highest

probability for any sequence of length k ending in the trigram (w, u, v) must be

π(k − 1, w, u)× q(v|w, u)× e(xi|v)

In Eq. 12 we simply search over all possible values forw, and return the maximum.

Our second claim is the following:

11

Input: a sentence x1 . . . xn, parameters q(s|u, v) and e(x|s).

Initialization: Set π(0, *, *) = 1, and π(0, u, v) = 0 for all (u, v) such that u 6= *

or v 6= *.

Algorithm:

• For k = 1 . . . n,

– For u ∈ K, v ∈ K,

π(k, u, v) = max

w∈K

(π(k − 1, w, u)× q(v|w, u)× e(xk|v))

• Return maxu∈K,v∈K (π(n, u, v)× q(STOP|u, v))

Figure 1: The basic Viterbi Algorithm.

Proposition 2

max

y1…yn+1

p(x1 . . . xn, y1 . . . yn+1) = max

u∈K,v∈K

(π(n, u, v)× q(STOP|u, v)) (13)

This follows directly from the identity in Eq. 10.

Figure 1 shows an algorithm that puts these ideas together. The algorithm takes

a sentence x1 . . . xn as input, and returns

max

y1…yn+1

p(x1 . . . xn, y1 . . . yn+1)

as its output. The algorithm first fills in the π(k, u, v) values in using the recursive

definition. It then uses the identity in Eq. 13 to calculate the highest probability for

any sequence.

The running time for the algorithm is O(n|K|3), hence it is linear in the length

of the sequence, and cubic in the number of tags.

4.4.2 The Viterbi Algorithm with Backpointers

The algorithm we have just described takes a sentence x1 . . . xn as input, and re-

turns

max

y1…yn+1

p(x1 . . . xn, y1 . . . yn+1)

as its output. However we’d really like an algorithm that returned the highest prob-

ability sequence, that is, an algorithm that returns

arg max

y1…yn+1

p(x1 . . . xn, y1 . . . yn+1)

12

Input: a sentence x1 . . . xn, parameters q(s|u, v) and e(x|s).

Initialization: Set π(0, *, *) = 1, and π(0, u, v) = 0 for all (u, v) such that u 6= *

or v 6= *.

Algorithm:

• For k = 1 . . . n,

– For u ∈ K, v ∈ K,

π(k, u, v) = max

w∈K

(π(k − 1, w, u)× q(v|w, u)× e(xk|v))

bp(k, u, v) = arg max

w∈K

(π(k − 1, w, u)× q(v|w, u)× e(xk|v))

• Set (yn−1, yn) = arg max(u,v) (π(n, u, v)× q(STOP|u, v))

• For k = (n− 2) . . . 1,

yk = bp(k + 2, yk+1, yk+2)

• Return the tag sequence y1 . . . yn

Figure 2: The Viterbi Algorithm with backpointers.

for any input sentence x1 . . . xn.

Figure 2 shows a modified algorithm that achieves this goal. The key step

is to store backpointer values bp(k, u, v) at each step, which record the previous

state w which leads to the highest scoring sequence ending in (u, v) at position k

(the use of backpointers such as these is very common in dynamic programming

methods). At the end of the algorithm, we unravel the backpointers to find the

highest probability sequence, and then return this sequence. The algorithm again

runs in O(n|K|3) time.

5 Summary

We’ve covered a fair amount of material in this note, but the end result is fairly

straightforward: we have derived a complete method for learning a tagger from

a training corpus, and for applying it to new sentences. The main points were as

follows:

• A trigram HMM has parameters q(s|u, v) and e(x|s), and defines the joint

13

probability of any sentence x1 . . . xn paired with a tag sequence y1 . . . yn+1

(where yn+1 = STOP) as

p(x1 . . . xn, y1 . . . yn+1) =

n+1∏

i=1

q(yi|yi−2, yi−1)

n∏

i=1

e(xi|yi)

• Given a training corpus from which we can derive counts, the maximum-

likelihood estimates for the parameters are

q(s|u, v) =

c(u, v, s)

c(u, v)

and

e(x|s) =

c(s ; x)

c(s)

• Given a new sentence x1 . . . xn, and parameters q and e that we have es-

timated from a training corpus, we can find the highest probability tag se-

quence for x1 . . . xn using the algorithm in figure 2 (the Viterbi algorithm).

14