COMP4121 Lecture Notes

The Hidden Markov Models and the Viterbi Algorithm

THE UNIVERSITY OF NEW SOUTH WALES

School of Computer Science and Engineering The University of Wales Sydney 2052, Australia

Hidden Markov Models and the Viterbi Algorithm

We start with an example, the problem of speech recognition.

A phoneme is an “elementary particle” of human speech, so to speak. Spoken words of a particular language are understood by a listener by parsing them into a sequence of phonemes. There is no general consensus what exactly the phonemes are, even for a specific language such as English. For all practical purposes we can think of them as the basic sounds of a language, which have the property that replacing one phoneme in a word with a different phoneme would change the meaning of the word. For example, if you replace the last phoneme /s/ in ‘kiss” with phoneme /l/ you obtain a different word, “kill”. However, we identify variants of the same sound as a single phoneme, for example the “k” sound in words “cat”, “kit” and “skill” are considered all to be slight variants of a single basic phoneme /k/, and replacing one such variant with another would not cause the listener to mistakenly understand the word being pronounced.

So let us now consider a sentence being spoken, for example “I love algorithms”. We understand this sentence by breaking the speech into constituent phonemes. I tried to reproduce the International Phonetic Alphabet (IPA) transcription in LaTeX, but could only get an approximation; so the utterance of “I love algorithms” would look approximately like this: aI lav algorI∂mz .

Humans are very good at understanding speech, but machines got good at that only quite recently. Of course, machines do not have a direct access to the sequence of the phonemes spoken; they only have access to the resulting (sampled) sound waveforms, and from such waveforms they have to “decide” (if this is the right word for what speech recognition machines do) what phonemes were spoken. In order to do that a machine has to:

1. partition the waveform into disjoint pieces, so that each piece should correspond to a single phoneme (this is tricky in itself, figuring out where the boundaries between the waveforms of adjacent phonemes are);

2. then it has to do feature extraction from each piece, thus producing features vectors f⃗[1], f⃗[2], . . . (this is usually done by taking the DFT of each piece and then looking for features in the frequency domain);

3. finally, it has to decide for each feature vector f⃗[i] to which phoneme it corresponds, i.e., it has to decide what phonemes would produce as their signatures each of the feature vectors f⃗[i].

Note that some phonemes are only slightly different and they also vary a lot from speaker to speaker, so it is hard to determine with absolute certainty from a feature vector f⃗[i] what phoneme “caused” a waveform that resulted in that particular feature vector.

!”!#!$%&’!($)%*+,-./0!

! !!!!!!!!!!!!!!!!!!!1(#1!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!1!$1!(1!&1!!!!!!!!!!!!!!!!!!!!!!!1!(1!$1!)1!%1!*1!+1!21!.1!31!! !

One can model human speech as a Markov chain, whose states are the phonemes, and an utterance is an evolution of a Markov chain through such states, i.e., phonemes. This Markov chain is hidden from the listener

!!!!!!!!!!!!!!!!!!!!

! !!!!!!!!!!!!!!!!!!!!!4567!!!!!!!!!!!!!!!!!!!!!!!!!!!4587!4597!45:7!!!!!!!!!!!!!!!!!!45;74

(in our case a machine) which instead has access only to a corresponding sequence of manifestations caused by uttering these phonemes, namely the corresponding waveforms, processed into feature vectors:

“%!”! “#!”!

#$”! #$”!

#$”! #$”!

Figure 1: Hidden Markov Model (HMM)

So given a sequence of such manifestations f⃗[1], f⃗[2], . . . , f⃗[10] how can we decide what sequence of phonemes caused it, in a way that maximizes the likelihood that we are correct? This is what the Viterbi algorithm does. It is a dynamic programming algorithm which produces a sequence of states that are most likely to have as a consequence the sequence of “observations” f⃗[1], f⃗[2], . . . , f⃗[10]. The Viterbi algorithm has a huge range of ap- plications, including in speech recognition, also for decoding the transmitted convolutional codes in presence of noise in communication channels, which is used in digital telecommunications such as the CDMA mobile phone network and 802.11 wireless LANs; it is also used in many other fields such as bioinformatics and computational linguistics.

General setup. We are given a finite Markov chain, consisting of a set of its states S = {s1 , s2 , . . . , sK } with initial probabilities π1,π2,…,πK, where πi is the probability that the chain starts in state si, and its matrix M = (p(i,j) : 1 ≤ i,j ≤ K) of size K × K of transition probabilities p(i,j) = p(si → sj) (so if at an instant t the chain is in state si, then p(i, j) is the probability that at the next instant t + 1 the chain will be in state sj). We have no direct access to the states of such a Markov chain, but we do have a set of observables O = {o1,o2,…,oN} which are the manifestations of the states S = {s1,s2,…,sK} and the emission matrix E of size K × N where entry e(i, k) represents the probability that when the chain is in state si the observable outcome will be ok.1

Assume now that we are given a sequence of observable outcomes (y1, y2, . . . , yT ); the goal is to find the sequence (x1, x2, . . . , xT ) of states of the Markov chain for which the probability of such a sequence to be the cause of observing the sequence of outcomes (y1, y2, . . . , yT ) is the largest. In other words, we are looking for the most likely cause of the sequence of observations (y1, y2, . . . , yT ). In our speech recognition example we are given a sequence of feature vectors (f⃗[1],f⃗[2],…,f⃗[10]) extracted from the sound waveform and we have to find what sequence of phonemes (φ1 , φ2 , . . . , φ10 ) is the most likely cause of the sequence of the feature vectors (f⃗[1], f⃗[2], . . . , f⃗[10]).

1We will abuse the notation for technical convenience and sometimes write p(si,sj) and sometimes just p(i,j) and again sometimes p(si → sj) for the probability of the Markov chain transitioning from state pi into state pj. Similarly, sometimes we write e(i,j) and sometimes e(si,oj) for the probability that state si causes the observable outcome oj.

Theoretically, we could solve this problem by brute force: we could go through all possible sequences of states ⃗x = (x1,x2,…,xT) of the same length as the sequence of observables ⃗y = (y1,y2,…,yT) and compute the corresponding probabilities that each of such sequences might be the cause for the given sequence of observables ⃗y = (y1, y2, . . . , yT ) and then select the sequence ⃗x for which such probability is the largest. This could, theoretically, be done as follows. We first find the probability P(⃗x) that a sequence ⃗x occurred at all, as the following product:

P(⃗x)=π(x1)·p(x1 →x2)·p(x2 →x3)…p(xT−1 →xT)

because π(x1) is the probability that the starting state (phoneme in our example) is x1, times the probability p(x1 → x2) that system transitions from state x1 into state x2 and so fourth.

Next, we compute the probability P ∗(⃗x, ⃗y) that this particular sequence of states ⃗x = (x1, . . . , xT ) will cause the sequence of observable outcomes ⃗y = (y1,…,yT ), as the product of the probabilities e(x1,y1)·e(x2,y2)·…· e(xT , yT ) where each e(xi, yi) is the “emission” probability that state xi produces the observable manifestation yi.

Finally, the total probability P(⃗x,⃗y) that sequence ⃗x is the cause of the sequence of observable outcomes ⃗y is the product of the probabilities P(⃗x) that sequence ⃗x occurred at all and the probability P∗(⃗x,⃗y) that, if it did occur, it also caused the right observable outcome ⃗y, so in total P(⃗x,⃗y) = P(⃗x) · P∗(⃗x,⃗y).

So given the sequence of outcomes ⃗y = (y1, y2, . . . , yT ) observed, we could now simply pick the sequence of states ⃗x for which the value of P(⃗x,⃗y) is the largest. However, this is not a feasible algorithm: if the total number of states is K and the observed sequence is of length T then there would be in total KT many sequences to try, i.e., exponentially many which is no good for real time (or almost real time) algorithms such as speech recognition or decoding the convolutional codes used in mobile telephony. Instead, the Viterbi algorithm solves this problem in time O(T × K2) using dynamic programming.

The Viterbi Algorithm We solve all subproblems S(i, k) for every 1 ≤ i ≤ T and every 1 ≤ k ≤ K: Subproblem S(i, k): Find the sequence of states (x1, . . . xi) such that xi = sk and such that the probability

of observing the outcome (y1 , . . . , yi ) is maximal.

Thus, we solve subproblems for all truncations (x1 , . . . , xi ), 1 ≤ i ≤ T , but also insisting that the last state

xi of such a truncated subsequence be state sk.

Let us denote by L(i,k) the largest possible probability for which there exists a sequence (x1,x2,…,xi) ending with xi = sk causing the subsequence of observable outcomes (y1, y2, . . . , yi); such probabilities must satisfy the following recursion:

L(1,k)=πk ·e(sk,y1) L(i,k)=max(L(i−1,m)p(sm →sk)e(sk,yi));

σ(i,k)=argmax(L(i−1,m)p(sm →sk)e(sk,yi)).

Here σ(i, k) stores the value of m for which L(i − 1, m) p(sm → sk) e(sk, yi) is the largest which allows us to reconstruct the sequence of states which maximize the probabilities we are tracking. We now obtain the solution for our original problem by backtracking:

xT =argmax(L(T,m)) m∈S

xi−1 =σ(i,xi), i=T,T −1,…,2.

A Python code for this algorithm can be found at, https://en.wikipedia.org/wiki/Viterbi_algorithm, together with some examples and even animations. Please take a look to make sure you learn this most im- portant algorithm with so many diverse applications. Here we present a simple example which should help you understand the Viterbi Algorithm.

Exercise: In California there are twice as many raccoons as possums. Having gotten a job with Google, you are in California observing a back yard. It is dusk, so the probability that you think you saw a raccoon when you are actually looking at a possum at a distance is 1/3; the probability that you think you saw a possum while you are actually looking at a raccoon at a distance is 1/4. Raccoons move in packs; so if a raccoon comes to your back yard the probability that the next animal to follow will also be a raccoon is 4/5. Possums are solitary animals, so if a possum comes to your back yard, this does not impact the probabilities what the next animal to come will be (a possum or a raccoon, but recall in California there are twice as many raccoons as possums!) You believe that you saw four animals coming in the following order: a raccoon, a possum, a possum, a raccoon

(rppr). Given such a sequence of observations, what actual sequence of animals is most likely to cause such a sequence of your observations?

• Solution:

• probabilities of initial states: π(R) = 2/3; π(P ) = 1/3.

• transition probabilities:

P(R→R)=4/5;P(R→P)=1/5;P(P →P)=1/3;P(P →R)=2/3.

• emission probabilities:

O(r|P ) = 1/3; O(p|P ) = 2/3; O(p|R) = 1/4; O(r|R) = 3/4.

3/4 R 4/5 R 4/5 R

3/4 1/4 3/4 1/4

Observed emissions: r p p r

P 1/3 P 1/3 P 1/3 P

2/3 1/3 2/3 1/3 2/3 1/3 2/3

1/3 rprprprp

3/4 R 4/5 R 4/5 R

3/4 1/4 3/4 1/4

Observed emissions: r p p r

P 1/3 P 1/3 P 1/3 P

2/3 1/3 2/3 1/3 2/3 1/3 2/3

1/3 rprprprp

• The sequence of observations is rppr.

• probabilities of initial states: π(R) = 2/3; π(P ) = 1/3.

• transition probabilities:

P(R→R)=4/5;P(R→P)=1/5;P(P →P)=1/3;P(P →R)=2/3.

• emission probabilities:

E(r|P) = 1/3; E(p|P) = 2/3; E(p|R) = 1/4; E(r|R) = 3/4.

• The states of the Markov Chain states: s1 = R; s2 = P . (actual animals)

• The observations: o1 = r; o2 = p. (observations you make, i.e., what you believe that you have seen)

• The initialisation and the recursion:

L(1, k) = π(sk) · E(o1|sk) (1) L(i,k)=max(L(i−1,m)P(sm →sk)E(oi|sk)); (2)

σ(i, k) = arg max (L(i − 1, m) P(sm → sk) E(oi|sk)) . (3)

Here σ(i, k) stores the value of m for which L(i − 1, m) P(sm → sk) E(oi|sk) is the largest which allows us to reconstruct the sequence of states which maximise the probabilities we are tracking. We now obtain the solution for our original problem by backtracking:

xT =argmax(L(T,m)) m∈S

xi−1 =σ(i,xi), i=T,T −1,…,2.

Initialisation t = 1 with our first observation o1 = r, using 1 applied to state s1 = R and then state

L(1, R) = π(s1)E(o1|s1) = π(R)E(r|R) = 2/3 × 3/4 = 1/2; L(1, P ) = π(s2)E(o1|s2) = π(P )E(r|P ) = 1/3 × 1/3 = 1/9.

Note that, generally, likelihoods do not define probability spaces, because they need not sum up to 1! 5

3/4 R 4/5 R 4/5 R

3/4 1/4 3/4 1/4

Observed emissions: r p p r

P 1/3 P 1/3 P 1/3 P

2/3 1/3 2/3 1/3 2/3 1/3 2/3

1/3 rprprprp

L(2,R) = max{L(1,R)P(R → R)E(p|R), L(1,P)P(P → R)E(p|R)}

= max{1/2 × 4/5 × 1/4, 1/9 × 2/3 × 1/4} = max{1/10, 1/54} = 1/10;

σ(2, R) = R;

L(2,P) = max{L(1,R)P(R → P)E(p|P), L(1,P)P(P → P)E(p|P)}

= max{1/2 × 1/5 × 2/3, 1/9 × 1/3 × 2/3} = max{1/15, 2/81} = 1/15; σ(2,P) = R;

L(3,R) = max{L(2,R)P(R → R)E(p|R), L(2,P)P(P → R)E(p|R)}

= max{1/10 × 4/5 × 1/4, 1/15 × 2/3 × 1/4} = max{1/50, 1/90} = 1/50;

σ(3, R) = R;

L(3,P) = max{L(2,R)P(R → P)E(p|P), L(2,P)P(P → P)E(p|P)}

= max{1/10 × 1/5 × 2/3, 1/15 × 1/3 × 2/3} = max{1/75, 2/135} = 2/135; σ(3,P) = P;

L(4, R) = max{L(3, R) P(R → R) E(r|R), L(3, P ) P(P → R) E(r|R)}

= max{1/50 × 4/5 × 3/4, 2/135 × 2/3 × 3/4} = max{3/250, 1/135} = 3/250;

σ(4, R) = R;

L(4,P) = max{L(3,R)P(R → P)E(r|P), L(3,P)P(P → P)E(r|P)}

= max{1/50 × 1/5 × 1/3, 2/135 × 1/3 × 1/3} = max{1/750, 2/1215} = 2/1215; σ(4,P) = P.

Since L(4, R) > L(4, P ) we conclude that the event ends with R,. and this events’ likelihood is 3/250 ≈ 0.012. To retrieve the whole most likely sequence we now use σ function to backtrack. Since σ(4, R) = R;

σ(3, R) = R, σ(2, R) = R we conclude that the most likely sequence is actually RRRR, despite the fact that our observations were rppr!

Let us now calculate the likelihood that it was sequence RPPR which caused our observations rppr: π(R)P(R → P )P(P → P )P(P → R) × E(r|R)E(p|P )E(p|P )E(r|R) =

likelihood of run RPPR likelihood of emissions rppr for run RPPR = (2/3×1/5×1/3×2/3)×(3/4×2/3×2/3×3/4)

= 4/135 × 1/4

= 1/135 ≈ 0.0074

Thus, it is more likely that the sequence RRRR caused the sequence of observations rppr (likelihood ≈ 0.012) then that the sequence RP P R caused the same observations (likelihood ≈ 0.0074). Without the Viterbi algorithm we would conclude that the sequence is RPPR which is likely incorrect. So you can appreciate why this algorithm is a powerful tool for solving many problems related to noisy signals and thus noisy observations, such as for speech recognition, decoding convolutional codes in telecommunications and in many other applications in diverse fields such as bioinformatics and many others.