# CS代考计算机代写 algorithm 5. Weighted FSAs

5. Weighted FSAs

1 Reminder/summary: (Boolean) FSAs

Generation of strings is fundamentally defined like this:

(1) x1x2…xn ∈L(M) ⇐⇒ ··· I(q0)∧∆(q0,x1,q1)∧···∧∆(qn−1,xn,qn)∧F(qn)

q0∈Q q1∈Q qn∈Q

Forward and backward are useful “helper functions” whose values can be computed recursively:

(2)

(3)

We (4)

(5)

(6)

fwdM (ε)(q) = I(q)

fwdM(x1…xn)(q)= fwdM(x1…xn−1)(qn−1)∧∆(qn−1,xn,q)

qn−1 ∈Q

bwdM (ε)(q) = F (q)

bwdM(x1…xn)(q)= ∆(q,x1,q1)∧bwdM(x2…xn)(q1)

q1 ∈Q

can use forward and/or backward values to check whether a string is generated:

uv∈L(M) ⇐⇒ fwdM(u)(q)∧bwdM(v)(q) q∈Q

w∈L(M) ⇐⇒ fwdM(w)(q)∧bwdM(ε)(q) ⇐⇒ fwdM(w)(q)∧F(q) q∈Q q∈Q

w∈L(M) ⇐⇒ fwdM(ε)(q)∧bwdM(w)(q) ⇐⇒ I(q)∧bwdM(w)(q) q∈Q q∈Q

Notice that everything here is about computing with booleans, using conjunction and disjunction.

(7)

C 12 V

What does this FSA say about the string ‘CVC’? What does this FSA say about the string ‘VCV’?

In both cases what it says is just “yes”/“true” — even though they are generated in different ways, ‘VCV’ is generated two ways and ‘CVC’ only one way, etc.

As soon as one choice of q0, q1, q2, q3 gives us a ‘true’, all the other choices of states make no difference.

V

C

V

V

3

Ling185A, Winter 2021 — Tim Hunter, UCLA

2 Introducing weights

Here’s a picture of a weighted FSA. Each possible “step”, including starting and ending, has a non-negative number we call its weight.

1

C/1 112

(8)

V/1

V/0.9

C/0.8 V/0.9

3

V/1

Here’s the linguistic idea being implemented:

• State 1 represents a syllable boundary.

• The 0.8 transition penalizes syllables that have a coda.

• The 0.9 transitions penalize syllables that don’t have an onset.

We multiply the weights along a path. For example:

(9) fM (CV) = 1 × 1 × 1 × 1

fM (VC) = 1 × 0.9 × 0.8 × 1

fM (CVC) = 1 × 1 × 1 × 0.8 × 1

fM (CVV) = 1 × 1 × 1 × 0.9 × 1

fM (VVC) = 1 × 0.9 × 0.8 × 0.9 × 1

For “ambiguous” strings, we multiply along each path and take the maximum. The idea is that a string is as good as its best analysis.

(10)

(11)

for‘VCV’viapath[1,1,2,1]: 1×0.9×1×1×1=0.9

for ‘VCV’ via path [1,3,1,1]: 1 × 0.9 × 0.8 × 0.9 × 1 = 0.648

fM (VCV) = max(0.9, 0.648) = 0.9 for‘CVCV’viapath[1,2,1,2,1]: 1×1×1×1×1×1=1

for‘CVCV’viapath[1,2,3,1,1]: 1×1×1×0.8×0.9×1=0.72 fM (CVCV) = max(1, 0.72) = 1

Any “impossible” transitions have a weight of zero. This makes things behave nicely:

• Any impossible paths get a weight of zero, because 0 × x = 0.

• Impossible paths “do nothing” to the overall value assigned to a string, because max(x, 0) = x.

(12) for‘VC’viapath[1,3,1]: 1×0.9×0.8×1=0.72 for‘VC’viapath[1,1,1]: 1×1×0×1=0

fM (VC) = max(0.72, 0) = 0.72

Question: What if every weight in a particular weighted FSA was either zero or one?

Ling185A, Winter 2021 — Tim Hunter, UCLA

3 Weighted FSAs, formally

Now the careful definitions.

(13) A weighted finite-state automaton (WFSA) is a five-tuple (Q, Σ, I, F, ∆) where:

• Q is a finite set of states;

• Σ, the alphabet, is a finite set of symbols;

• I : Q → R≥0 is a function assigning starting weights to states;

• F : Q → R≥0 is a function assigning ending weights to states;

• ∆ : (Q × Σ × Q) → R≥0 is a function assigning weights to transitions.

Then for any WFSA M = (Q,Σ,I,F,∆), the function fM : Σ∗ → R≥0 is defined as:

(14) fM(x1 …xn) = max value for x1 …xn via path p all possible paths p

=max max… max maxI(q0)×∆(q0,x1,q1)×···×∆(qn−1,xn,qn)×F(qn) q0∈Q q1∈Q qn−1∈Q qn∈Q

For the empty string, the length n is 0, so this just reduces to (15) fM (ε) = max[I (q0 ) × F (q0 )]

q0 ∈Q

The definition in (14) makes the task of calculating fM (w) for some long string w seem rather intimidating. It looks like you have to consider a very large number of distinct paths, and for each new path you consider you have to “start afresh”, with no help from the values of other paths. Similarly, knowing the value of fM (u) and/or fM (v) doesn’t seem to give you any head start in calculating the value of fM (uv).

4 Forward values

It turns out that the helpful idea of forward values carries over to weighted FSAs — although it is a bit less intuitive here.

Let’s define fwdM (w)(q) as the best product-of-weights we can get by choosing some state to start at and then some transitions to take, which produce the string w and land us in the state q.

(16) fwdM(x1…xn)(q)=max max… max I(q0)×∆(q0,x1,q1)×···×∆(qn−1,xn,q) q0 ∈Q q1 ∈Q qn−1 ∈Q

We can use forward values to calculate the overall weight fM (w) assigned to a string w: (17) fM (w) = max fwdM (w)(qn) × F (qn)

qn ∈Q

This equation is perhaps slightly less obvious than it looks. It relies on two important points:

• The calculation of fwdM (w)(qn) already took into account all the relevant ways of reaching state qn, and chose the very best of those.

• A non-optimal way of getting-to-qn can’t be part of the optimal way of getting-to-qn-and-stopping: if a > b, then a × F(qn) > b × F(qn).

We can construct a table of forward values, just like we did for booleans.

(18) Using the WFSA in (8):

State VCVVC

1 1.0

2 0.0

3 0.0

0.9 0.72 0.0 0.9 0.9 0.0

0.9 0.81 0.648 0.0 0.0 0.81 0.9 0.81 0.0

Ling185A, Winter 2021 — Tim Hunter, UCLA

As with booleans, the values in a column only depend on (i) the values in the column immediately to its left, and (ii) the last symbol being “added”. So forward values can be computed recursively:

(19) fwdM (ε)(q) = I(q)

fwdM(x1…xn)(q)= max fwdM(x1…xn−1)(qn−1)×∆(qn−1,xn,q)

qn−1 ∈Q

It may take a bit of thought to convince yourself that this works, but the important logic is the same as

for (17) above:

• The calculation of fwdM (x1 . . . xn−1)(qn−1) already took into account all the relevant ways of reaching

state qn−1, and chose the very best of those.

• A non-optimal way of getting-to-qn−1 can’t be part of the optimal way of getting-to-qn−1-and-then-q:

if a > b, then a × ∆(qn−1, xn, qn) > b × ∆(qn−1, xn, qn).

More abstractly, the crucial underlying point here is that multiplication distributes over max, which says

that we can “pull out” k in calculations like the following:

(20) max a × k, b × k = max(a, b) × k max g(x) × k = max g(x) × k

x∈S x∈S

This is what justifies the following algebraic reshuffles, which get us from the definition in (14) to the

equation in (17) (by “pulling out” F(qn)):

(21) fM(x1…xn)=max… max maxI(q0)×∆(q0,x1,q1)×···×∆(qn−1,xn,qn)×F(qn)

q0∈Q qn−1∈Q qn∈Q

=maxmax… max I(q0)×∆(q0,x1,q1)×···×∆(qn−1,xn,qn)×F(qn)

qn ∈Q q0 ∈Q qn−1 ∈Q

=maxmax… max I(q0)×∆(q0,x1,q1)×···×∆(qn−1,xn,qn)×F(qn) qn ∈Q q0 ∈Q qn−1 ∈Q

fwdM (x1…xn)(qn)

and from the definition in (16) to the equation in (19) (by “pulling out” ∆(qn−1,xn,q)):

(22)

fwdM (x1 . . . xn)(q)

=max… max I(q0)×∆(q0,x1,q1)×···×∆(qn−1,xn,q)

q0 ∈Q qn−1 ∈Q

= max max… max I(q0)×∆(q0,x1,q1)×···×∆(qn−1,xn,q)

qn−1 ∈Q q0 ∈Q qn−2 ∈Q

= max max… max I(q0)×∆(q0,x1,q1)×···×∆(qn−2,xn−1,qn−1)×∆(qn−1,xn,q)

qn−1 ∈Q q0 ∈Q qn−2 ∈Q

= max fwdM(x1…xn−1)(qn−1)×∆(qn−1,xn,q)

qn−1 ∈Q Probabilistic FSAs

5

If we are interested in defining a probability distribution over strings, then one simple idea is to take some WFSA M (say the one in (8) above), add up the weights of all generated strings and divide:

(23) Z(M) = fM(w) Pr(w) = fM(w) w∈Σ∗ Z(M)

But unfortunately this doesn’t always work: sometimes the infinite sum Z(M) does not converge.

A more common approach is to set up a WFSA so that it obeys certain conditions which guarantee that the sum Z(M) turns out to be 1, so that fM(w) just is the probability of w; a WFSA that satisfies these conditions is called a probabilistic finite-state automaton.

Ling185A, Winter 2021 — Tim Hunter, UCLA

(24) A probabilistic finite-state automaton (PFSA) is a WFSA where:

• all starting weights sum to one;

• each state’s outgoing transition weights and ending weight sum to one; and

• there are no “dead ends”, i.e. no states from which it is impossible to end.

Here’s an example of a PFSA, based on (8) from above.

0.1

(25)

C/0.5 112

V/0.2

V/0.5 C/1 V/0.2

3

V/0.5

With a PFSA, we can take the maximum across paths, as before, to find the highest probability associated with any analysis of the given string.1 Or, we can take the sum across paths to find the total probability of the string being generated (by any analysis)!

(26)

(27)

for ‘VCV’ via path [1,1,2,1]: 1 × 0.2 × 0.5 × 0.5 × 0.1 = 0.005 for ‘VCV’ via path [1,3,1,1]: 1 × 0.2 × 1 × 0.2 × 0.1 = 0.004

max(0.005, 0.004) = 0.005 0.005 + 0.004 = 0.009

for ‘CVCV’ via path [1,2,1,2,1]: 1 × 0.5 × 0.5 × 0.5 × 0.5 × 0.1 = 0.00625 for ‘CVCV’ via path [1,2,3,1,1]: 1 × 0.5 × 0.5 × 1 × 0.2 × 0.1 = 0.005

max(0.00625, 0.005) = 0.00625 0.00625 + 0.005 = 0.01125

And the same tricks for using forward (or backward) values also work for calculating these total probabil- ities, because just like (20) above, multiplication distributes over addition:

(28) (a × k) + (b × k) = (a + b) × k g(x) × k = g(x) × k x∈S x∈S

1In the context of a probabilistic grammar, the recursive calculation of these probabilities in a table like (18) above is known as the “Viterbi algorithm” — invented by Andrew Viterbi, right here at UCLA in 1967! See https://en.wikipedia. org/wiki/Andrew_Viterbi. This was one of many independently-discovered algorithms that were eventually unified under the general perspective presented here.

Ling185A, Winter 2021 — Tim Hunter, UCLA

6 The general pattern: Semirings

Is it generated? Highest weight? Total weight?

Possible values

{True, False} R≥0

R≥0

Generalized “and” () and neutral element

∧ True × 1 × 1

Generalized “or” () and neutral element

∨ False max 0 + 0

These are all instances of a single algebraic concept.

(29) A set R, along with an associated “generalized and” operation and an associated “generalized or” operation , counts as a semiring iff the following conditions are satisfied for all x, y, z ∈ R:

a. x(yz)=(xy)z

b. Thereisaparticularelement⊤∈Rsuchthat⊤x=x⊤=x. c. x(yz)=(xy)z

d. Thereisaparticularelement⊥∈Rsuchthat⊥x=x⊥=x. e. xy=yx

f. x(yz)=(xy)(xz)and(yz)x=(yx)(zx) g. x⊥=⊥x=⊥

Officially, the semiring is the five-tuple R = (R, , ⊥, , ⊤).

Side note: In the literature, you’ll often see the symbol ⊗ used where I’m using , and the symbol ⊕ used where I’m using . (And in such cases you’ll typically see something like 0 or 0 used as the neutral element for ⊕, and something like 1 or 1 as the neutral element for ⊗.) The choice of ⊗ and ⊕ is natural if you think of numbers as the “canonical” case of a semiring, and think of others as generalizations of those. I’ve chosen and because this seemed to make more sense given that we, in effect, saw the boolean semiring first and then generalized to others from there. But the important thing is that what makes a system a semiring is the fact that its pieces interact in the ways stated in (29), not the names or symbols we use for the pieces of the system.

So here’s the general notion of a semiring-weighted FSA.

(30) For any semiring R = (R, , ⊥, , ⊤), an R-weighted finite-state automaton is a five-tuple (Q, Σ, I, F, ∆) where:

• Q is a finite set of states;

• Σ, the alphabet, is a finite set of symbols;

• I : Q → R is a function assigning starting values to states;

• F : Q → R is a function assigning ending values to states;

• ∆ : (Q × Σ × Q) → R is a function assigning values to transitions.

Then for any R-weighted FSA M = (Q, Σ, I, F, ∆), the function fM : Σ∗ → R is defined as:

(31) fM(x1…xn)=… I(q0)∆(q0,x1,q1)…∆(qn−1,xn,qn)F(qn) q0∈Q q1∈Q qn−1∈Q qn∈Q

And the algebraic properties of the semiring operations (especially (29f); recall (21) and (22)) ensure that the now-familiar tricks for calculating these values efficiently will work:

(32) fwdM (ε)(q) = I(q)

fwdM(x1…xn)(q)= fwdM(x1…xn−1)(qn−1)∆(qn−1,xn,q) qn−1 ∈Q

Ling185A, Winter 2021 — Tim Hunter, UCLA

(33)

(34)

(35)

(36)

7

bwdM (ε)(q) = F (q)

bwdM (x1 . . . xn)(q) = ∆(q, x1, q1) bwdM (x2 . . . xn)(q1)

q1 ∈Q

fM (uv) = fwdM (u)(q) bwdM (v)(q) q∈Q

fM (w) = fwdM (w)(q) bwdM (ε)(q) = fwdM (w)(q) F (q) q∈Q q∈Q

fM (w) = fwdM (ε)(q) bwdM (w)(q) = I(q) bwdM (w)(q) q∈Q q∈Q

Two more useful semirings

7.1 Costs

We can associate a natural number “cost” with each transition.

The numbers in (37) assign a cost of 1 for having an onset, and assign a cost of 2 for not having a coda.

0

(37)

C/0 012

V/0

V/1

C/2 V/1

3

V/0

Here we take the sum of the costs along a single path, and take the minimum across the various possible paths.

(38) for‘VCV’viapath[1,1,2,1]: 0+1+0+0+0=1 for‘VCV’viapath[1,3,1,1]: 0+1+2+1+0=4

fM (VCV) = min(1, 4) = 1

If we also take all the transitions not shown in the diagram to have the special value ∞, where • x+∞=∞+x=∞,and

• min(x,∞)=min(∞,x)=x,

then:

• Each of the “bad paths” that we did not consider in (38) has the value ∞ (because x + ∞ = ∞).

• So we can consider fM (VCV) to actually be the minimum over all paths; including all the ∞ paths has no effect (because min(x, ∞) = x).

• And therefore if, for some string w, all paths are “bad paths”, then fM (w) = ∞.

Ling185A, Winter 2021 — Tim Hunter, UCLA

7.2 Output strings

We can associate a set of output strings, over some alphabet Γ, with each transition. (This set of output

strings might often be a singleton set.)

The output strings in (39) have the effect of optionally lengthening vowels that follow an onset, and deleting all codas.

{ε}

(39)

C/{C} {ε}1 2

V/{V}

V/{V, VV} C/{ε} V/{V}

3

V/{V, VV}

Here we use (lifted) concatenation to combine weights along a single path, and take the union across the various possible paths.

(40) (41)

Plus,

• •

•

X•Y ={u+v|u∈X,v∈Y}

for ‘VCV’ via path [1,1,2,1]: {ε} • {V} • {C} • {V, VV} • {ε} = {VCV, VCVV}

for ‘VCV’ via path [1,3,1,1]: {ε} • {V} • {ε} • {V} • {ε} = {VV}

fM (VCV) = {VCV, VCVV} ∪ {VV} = {VCV, VCVV, VV}

if we take all the transitions not shown in the diagram to have the value {}, then:

Each of the “bad paths” that we did not consider in (41) has the value {} (because X • {} = {}).

So we can consider fM (VCV) to actually be the union over all paths; including all the {} paths has no effect (because X ∪ {} = X).

And therefore if, for some string w, all paths are “bad paths”, then fM (w) = {}.

Possible values

{True, False} R≥0

R≥0

N ∪ {∞} P (Γ∗ )

Generalized “or” () and neutral element

Generalized “and” () and neutral element

∧ True

× 1 max0 × 1+0 + 0 min ∞

• {ε} ∪ {}

Is it generated? Highest weight? Total weight? Lowest cost? Output strings?

∨ False

Ling185A, Winter 2021 — Tim Hunter, UCLA