# 程序代写代做代考 Hidden Markov Mode data structure algorithm chain CS447: Natural Language Processing

CS447: Natural Language Processing

http://courses.engr.illinois.edu/cs447

Julia Hockenmaier

juliahmr@illinois.edu

3324 Siebel Center

Lecture 6:

HMM algorithms

CS447: Natural Language Processing (J. Hockenmaier)

Recap: Statistical POS tagging

she1 promised2 to3 back4 the5 bill6

w = w1 w2 w3 w4 w5 w6

t = t1 t2 t3 t4 t5 t6

PRP1 VBD2 TO3 VB4 DT5 NN6

What is the most likely sequence of tags t

for the given sequence of words w ?

�2

CS447: Natural Language Processing (J. Hockenmaier)

Statistical POS tagging with HMMs

What is the most likely sequence of tags t

for the given sequence of words w ?

Hidden Markov Models define P(t) and P(w|t) as:

Transition probabilities:

P(t) = ∏i P(ti | ti−1) [bigram HMM]

or P(t) = ∏i P(ti | ti−1, ti−2) [trigram HMM]

Emission probabilities:

P(w | t) = ∏i P(wi | ti)

�3

Estimate argmaxt P(t|w) directly (in a conditional model)

or use Bayes’ Rule (and a generative model):

argmax

t

P(t|w) = argmax

t

P(t,w)

P(w)

= argmax

t

P(t,w)

= argmax

t

P(t)P(w|t)

Estimate argmaxt P(t|w) directly (in a conditional model)

or use Bayes’ Rule (and a generative model):

argmax

t

P(t|w) = argmax

t

P(t,w)

P(w)

= argmax

t

P(t,w)

= argmax

t

P(t)P(w|t)

CS447: Natural Language Processing (J. Hockenmaier)

HMMs as probabilistic automata

DT

JJ

NN

0.7

0.3

0.4

0.6

0.55

VBZ

0.45

0.5

the

0.2

a

0.1every

0.1

some 0.1

no

0.01

able

…

…

0.003

zealous

…

…

0.002

zone

0.00024

abandonment

0.001

yields

…

…

0.02

acts

A (bigram) HMM defines

Transition probabilities:

P( ti | ti-1)

Emission probabilities:

P( wi | ti )

�4

CS447: Natural Language Processing (J. Hockenmaier)

HMMs as probabilistic automata

Transition probabilities P(ti | ti−1):

Probability of going from one state (ti−1) of the

automaton to the next (ti)

“Markov model”: We’re making a Markov [independence]

assumption for how to move between states of the automaton

Emission probabilities P(wi | ti):

Probability of emitting a symbol (wi) in a given state of

the automaton (ti)

“Hidden Markov model”: The data that we see (at test time)

consists only of the words w, and we find tags for w by

searching for the most likely sequence of (hidden) states of the

automaton (the tags t) that generated the data w

�5 CS498JH: Introduction to NLP

An example HMM

�6

D N V A .

D 0.8 0.2

N 0.7 0.3

V 0.6 0.4

A 0.8 0.2

.

Transition Matrix A

the man ball throws sees red blue .

D 1

N 0.7 0.3

V 0.6 0.4

A 0.8 0.2

. 1

Emission Matrix B

D N V A .

π 1

Initial state vector π

D N

V

A

.

CS447: Natural Language Processing (J. Hockenmaier)

!

Using HMMs for tagging

-The input to an HMM tagger is a sequence of words, w.

The output is the most likely sequence of tags, t, for w.

-For the underlying HMM model, w is a sequence of output

symbols, and t is the most likely sequence of states (in the

Markov chain) that generated w.

�7

argmax

t

P( t⇤⇥�⌅

Outputtagger

| w⇤⇥�⌅

Inputtagger

) = argmax

t

P(w, t)

P(w)

= argmax

t

P(w, t)

= argmax

t

P( w⇤⇥�⌅

OutputHMM

| t⇤⇥�⌅

StatesHMM

)P( t⇤⇥�⌅

StatesHMM

)

CS447: Natural Language Processing (J. Hockenmaier)

How would the

automaton for a trigram

HMM with transition probabilities

P(ti | ti-2ti-1) look like?

What about unigrams

or n-grams?

???

???

�8

CS447: Natural Language Processing (J. Hockenmaier)

DT

JJ

NN VBZq0

Encoding a trigram model as FSA

JJ_DT

NN_DT

JJ

NN VBZDT

DT_

JJ_JJ

NN_JJ

VBZ_NN

NN_NN

Bigram model:

States = Tag Unigrams

Trigram model:

States = Tag Bigrams

�9 CS447: Natural Language Processing (J. Hockenmaier)

Trigram HMMs

In a trigram HMM tagger, each state qi corresponds

to a POS tag bigram (the tags of the current and

preceding word): qi=tjtk

Emission probabilities depend only on the current

POS tag: States tjtk and titk use the same emission

probabilities P(wi | tk)

�10

CS447: Natural Language Processing (J. Hockenmaier)

Building an HMM tagger

To build an HMM tagger, we have to:

-Train the model, i.e. estimate its parameters

(the transition and emission probabilities)

Easy case: we have a corpus labeled with POS tags

(supervised learning)

-Define and implement a tagging algorithm that

finds the best tag sequence t* for each input

sentence w:

t* = argmaxt P(t)P(w | t)

�11 CS498JH: Introduction to NLP

Learning an HMM

Where do we get the transition probabilities P(tj | ti)

(matrix A) and the emission probabilities P(wj | ti)

(matrix B) from?

Case 1: We have a POS-tagged corpus.

– This is learning from labeled data, aka “supervised learning”

Case 2: We have a raw (untagged) corpus and a tagset.

– This is learning from unlabeled data, aka “unsupervised learning”

�12

Pierre_NNP Vinken_NNP ,_, 61_CD years_NNS

old_JJ ,_, will_MD join_VB the_DT board_NN

as_IN a_DT nonexecutive_JJ director_NN Nov._NNP

29_CD ._.

Pierre Vinken , 61 years old , will

join the board as a nonexecutive

director Nov. 29 .

Tagset:

NNP: proper noun

CD: numeral,

JJ: adjective,…

CS498JH: Introduction to NLP

We count how often we see titj and wj_ti etc. in the data

(use relative frequency estimates):

Learning the transition probabilities:

Learning the emission probabilities:

We might use some smoothing, but this is pretty trivial…

Learning an HMM from labeled data

�13

P (tj |ti) =

C(titj)

C(ti)

Pierre_NNP Vinken_NNP ,_, 61_CD years_NNS

old_JJ ,_, will_MD join_VB the_DT board_NN

as_IN a_DT nonexecutive_JJ director_NN Nov._NNP

29_CD ._.

P (wj |ti) =

C(wj ti)

C(ti)

CS447: Natural Language Processing (J. Hockenmaier)

Learning an HMM from unlabeled data

We can’t count anymore.

We have to guess how often we’d expect to see titj

etc. in our data set. Call this expected count C(…)

-Our estimate for the transition probabilities:

-Our estimate for the emission probabilities:

-We will talk about how to obtain these counts on Friday

�14

Pierre Vinken , 61 years old , will

join the board as a nonexecutive

director Nov. 29 .

Tagset:

NNP: proper noun

CD: numeral,

JJ: adjective,…

P̂ (tj |ti) =

�C(titj)⇥

�C(ti)⇥

P̂ (wj |ti) =

�C(wj ti)⇥

�C(ti)⇥

CS447: Natural Language Processing (J. Hockenmaier)

Finding the best tag sequence

The number of possible tag sequences is

exponential in the length of the input sentence:

Each word can have up to T tags.

There are N words.

There are up to TN possible tag sequences.

We cannot enumerate all TN possible tag sequences.

But we can exploit the independence assumptions

in the HMM to define an efficient algorithm that

returns the tag sequence with the highest probability

�15 CS447: Natural Language Processing (J. Hockenmaier)

Dynamic

Programming for

HMMs

�16

CS498JH: Introduction to NLP

The three basic problems for HMMs

We observe an output sequence w=w1…wN:

w=“she promised to back the bill”

Problem I (Likelihood): find P(w | λ )

Given an HMM λ = (A, B, π), compute the likelihood

of the observed output, P(w | λ )

Problem II (Decoding): find Q=q1..qT

Given an HMM λ = (A, B, π), what is the most likely sequence of

states Q=q1..qN ≈ t1…tN to generate w?

Problem III (Estimation): find argmax λ P(w | λ )

Find the parameters A, B, π which maximize P(w | λ)

�17 CS498JH: Introduction to NLP

How can we solve these problems?

I. Likelihood of the input w:

Compute P(w | λ ) for the input w and HMM λ

II. Decoding (= tagging) the input w:

Find the best tags t*=argmaxt P(t | w,λ) for the input w and HMM λ

III. Estimation (= learning the model):

Find the best model parameters λ*=argmax λ P(t, w | λ)

for the training data w

These look like hard problems: With T tags, every input string

w1…n has Tn possible tag sequences

Can we find efficient (polynomial-time) algorithms?

�18

CS447: Natural Language Processing (J. Hockenmaier)

Dynamic programming

Dynamic programming is a general technique to solve

certain complex search problems by memoization

1.) Recursively decompose the large search problem

into smaller subproblems that can be solved efficiently

–There is only a polynomial number of subproblems.

2.) Store (memoize) the solution of each subproblem

in a common data structure

–Processing this data structure takes polynomial time

�19 CS498JH: Introduction to NLP

Dynamic programming algorithms for HMMs

I. Likelihood of the input:

Compute P(w| λ ) for an input sentence w and HMM λ

⇒ Forward algorithm

II. Decoding (=tagging) the input:

Find best tags t*=argmaxt P(t | w,λ) for an input sentence w and HMM λ

⇒ Viterbi algorithm

III. Estimation (=learning the model):

Find best model parameters λ*=argmax λ P(t, w | λ) for training data w

⇒ Forward-Backward algorithm

�20

CS447: Natural Language Processing (J. Hockenmaier)

Tags

Bookkeeping: the trellis

We use a N×T table (“trellis”) to keep track of the HMM.

The HMM can assign one of the T tags to each of the N words.

w(1) w(2) … w(i-1) w(i) w(i+1) … w(N-1) w(N)

t1

…

tj

…

tT

Words (“time steps”)

�21

word w(i) has tag tj

CS447: Natural Language Processing (J. Hockenmaier)

One tag sequence = one path through trellis

w(1) w(2) … w(i-1) w(i) w(i+1) … w(N-1) w(N)

q1

…

qj

…

qT

�22

One path through the trellis = one tag sequence

CS447: Natural Language Processing (J. Hockenmaier)

Computing P(t,w) for one tag sequence

w(1) w(2) … w(i-1) w(i) w(i+1) … w(N-1) w(N)

q1

…

qj

…

qT

P(w(1)|q1)

P(w(2) | qj)

P(w(i) | qi)

P(t(1)=q1)

P(qj | q1)

P(qi | q…)

P(q..| qi)

P(w(i+1) | qi+1)

P(w(N) | qj)

P(qj | q..)

�23

One path through the trellis = one tag sequence

To get its probability, we just multiply the initial state

and all emission and transition probabilities

CS447: Natural Language Processing (J. Hockenmaier)

The Viterbi algorithm

�24

CS447: Natural Language Processing (J. Hockenmaier)

Finding the best tag sequence

The number of possible tag sequences is exponential

in the length of the input sentence:

Each word can have up to T tags.

There are N words.

There are up to TN possible tag sequences.

We cannot enumerate all TN possible tag sequences.

But we can exploit the independence assumptions

in the HMM to define an efficient algorithm that

returns the tag sequence with the highest probability

in linear (O(N)) time.

�25 CS447: Natural Language Processing (J. Hockenmaier)

Notation: ti/wi vs t(i)/w(i)

To make the distinction between the i-th word/tag in

the vocabulary/tag set and the i-th word/tag in the

sentence clear:

use superscript notation w(i) for the i-th token

in the sequence

and subscript notation wi for the i-th type

in the inventory (tagset/vocabulary):

�26

CS447: Natural Language Processing (J. Hockenmaier)

HMM decoding

We observe a sentence w = w(1)…w(N)

w= “she promised to back the bill”

We want to use an HMM tagger to find its POS tags t

t* = argmaxt P(w, t)

= argmaxt P(t(1))·P(w(1)| t(1))·P(t(2)| t(1))·…·P(w(N)| t(N))

To do this efficiently, we will use

dynamic programming to exploit

the independence assumptions

in the HMM.

�27 CS447: Natural Language Processing (J. Hockenmaier)

The Viterbi algorithm

A dynamic programming algorithm which finds the

best (=most probable) tag sequence t* for an input

sentence w: t* = argmaxt P(w | t)P(t)

Complexity: linear in the sentence length.

With a bigram HMM, Viterbi runs in O(T2N) steps

for an input sentence with N words and a tag set of T tags.

The independence assumptions of the HMM tell us

how to break up the big search problem

(find t* = argmaxt P(w | t)P(t)) into smaller subproblems.

The data structure used to store the solution of these

subproblems is the trellis.

�28

CS447: Natural Language Processing (J. Hockenmaier)

HMM independences

1. Emissions depend only on the current tag:

… P(w(i) = man | t(i) = NN )…

We only have to multiply the emission probability

P(w(i) | tj ) with the probability of the best tag

sequence that gets us to t(i) = tj

�29 CS447: Natural Language Processing (J. Hockenmaier)

HMM independences

2. Transition probabilities to the current tag t(i)

depend only on the previous tag t(i−1):

… P( t(i) = NN | t(i−1) = DT )

-Assume the probability of the best tag sequence

for the prefix w(1)…w(i−1) that ends in the tag t(i−1) = tj

is known, and stored in a variable max[i−1][j].

-To compute the probability of the best tag sequence

for w(1)…w(i-1)w(i) that ends in the tags t(i-1)t(i) = tjtk,

multiply max[i−1][j] with P(tk | tj) and P(w(i) | tk)

-To compute the probability of the best tag sequence

for w(1)…w(i-1)w(i) that ends in t(i) = tk ,

consider all possible tags t(i-1) = tj for the preceding word:

max[i][k] = maxj ( max[i−1][j] P(tk | tj) )P(w(i) | tk)

�30

CS447: Natural Language Processing (J. Hockenmaier)

HMM independences

3. The current tag also determines the transition

probability of the next tag:

… P( t(i+1) = VBZ | t(i) = NN )…

We cannot fix the current tag t(i) based on the

probability of getting to t(i) (and producing w(i))

We have to wait until we have reached the last word

in the sequence.

Then, we can trace back to get the best tag sequence

for the entire sentence.

�31 CS447: Natural Language Processing (J. Hockenmaier)

Using the trellis to find t*

Let trellis[i][j] (word w(j) and tag tj) store the probability

of the best tag sequence for w(1)…w(i) that ends in tj

trellis[i][j] = max P(w(1)…w(i), t(1)…, t(i) = tj )

We can recursively compute trellis[i][j]

from the entries in the previous column trellis[i-1][j]

trellis[i][j] = P(w(i)| tj) ⋅Maxk( trellis[i-1][k]P(tj | tk) )

At the end of the sentence, we pick the highest

scoring entry in the last column of the trellis

�32

CS447: Natural Language Processing (J. Hockenmaier)

At any given cell

-For each cell in the preceding column: multiply its entry with

the transition probability to the current cell.

-Keep a single backpointer to the best (highest scoring) cell in

the preceding column

-Multiply this score with the emission probability of the current

word

�33

w(n-1) w(n)

t1 P(w(1..n-1), t(n-1)=t1)

… …

ti P(w(1..n-1), t(n-1)=ti)

… …

tN P(w(1..n-1), tn-1=ti)

P(ti |t1)

P(ti |ti)

P(ti

|tN

)

trellis[n][i] =

P(w(n)|ti)

⋅Max(trellis[n-1][j]P(ti |ti))

CS447: Natural Language Processing (J. Hockenmaier)

At the end of the sentence

In the last column (i.e. at the end of the sentence)

pick the cell with the highest entry, and trace back the

backpointers to the first word in the sentence.

�34

CS498JH: Introduction to NLP �35

w(1) w(2) … w(i-1) w(i) w(i+1) … w(N-1) w(N)

q1

…

qj

…

qT

Retrieving t* = argmaxt P(t,w)

By keeping one backpointer from each cell to the cell

in the previous column that yields the highest probability,

we can retrieve the most likely tag sequence when we’re done.

CS447: Natural Language Processing (J. Hockenmaier)

The Viterbi algorithm

Viterbi( w1…n){

for t (1…T) // INITIALIZATION: first column

trellis[1][t].viterbi = p_init[t] × p_emit[t][w1]

for i (2…n){ // RECURSION: every other column

for t (1….T){

trellis[i][t] = 0

for t’ (1…T){

tmp = trellis[i-1][t’].viterbi × p_trans[t’][t]

if (tmp > trellis[i][t].viterbi){

trellis[i][t].viterbi = tmp

trellis[i][t].backpointer = t’}}

trellis[i][t].viterbi ×= p_emit[t][wi]}}

t_max = NULL, vit_max = 0; // FINISH: find the best cell in the last column

for t (1…T)

if (trellis[n][t].vit > vit_max){t_max = t; vit_max = trellis[n][t].value }

return unpack(n, t_max);

}

�36

CS447: Natural Language Processing (J. Hockenmaier)

Unpacking the trellis

unpack(n, t){

i = n;

tags = new array[n+1];

while (i > 0){

tags[i] = t;

t = trellis[i][t].backpointer;

i–;

}

return tags;

}

�37 CS447: Natural Language Processing (J. Hockenmaier)

Today’s key concepts

HMM taggers

Learning HMMs from labeled text

Viterbi for HMMs

Dynamic programming

Independence assumptions in HMMs

The trellis

�38

CS447: Natural Language Processing (J. Hockenmaier)

Supplementary material:

Viterbi for Trigram

HMMs

�39 CS447: Natural Language Processing (J. Hockenmaier)

Trigram HMMs

In a Trigram HMM, transition probabilities are of the form:

P(t(i) = ti | t(i−1) = tj, t(i−2) = tk )

The i-th tag in the sequence influences the probabilities

of the (i+1)-th tag and the (i+2)-th tag:

… P(t(i+1) | t(i), t(i−1)) … P(t(i+2) | t(i+1), t(i))

Hence, each row in the trellis for a trigram HMM has to

correspond to a pair of tags — the current and the preceding tag:

(abusing notation)

trellis[i]⟨j,k⟩: word w(i) has tag tj, word w(i−1) has tag tk

The trellis now has T2 rows.

But we still need to consider only T transitions into each cell,

since the current word’s tag is the next word’s preceding tag:

Transitions are only possible from trellis[i]⟨j,k⟩ to trellis[i+1]⟨l,j⟩

�40