# 程序代写代做代考 data structure algorithm Computational

Computational

Linguistics

Copyright © 2017

Gerald Penn. All

rights reserved.

10A

10A. Log-Likelihood

Dependency Parsing

Gerald Penn

Department of Computer Science, University of Toronto

CSC 2501 / 485

Fall 2018

Based on slides by Yuji Matsumoto, Dragomir Radev,

David Smith and Jason Eisner

2

MOD

Word Dependency Parsing

He reckons the current account deficit will narrow to only 1.8 billion in September.

Raw sentence

Part-of-speech tagging

He reckons the current account deficit will narrow to only 1.8 billion in September.

PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP .

POS-tagged sentence

Word dependency parsing

slide adapted from Yuji Matsumoto

Word dependency parsed sentence

He reckons the current account deficit will narrow to only 1.8 billion in September .

SUBJ

ROOT

S-COMP

SUBJ

SPEC

MOD

MOD

COMP

COMP

Parsing Methods

Shift-Reduce Type Algorithms

◮ Data structures:

◮ Stack [. . . , wi ]S of partially processed tokens

◮ Queue [wj , . . .]Q of remaining input tokens

◮ Parsing actions built from atomic actions:

◮ Adding arcs (wi → wj , wi ← wj)

◮ Stack and queue operations

◮ Left-to-right parsing in O(n) time

◮ Restricted to projective dependency graphs

Dependency Parsing 54(103)

Parsing Methods

Yamada’s Algorithm

◮ Three parsing actions:

Shift

[. . .]S [wi , . . .]Q

[. . . , wi ]S [. . .]Q

Left

[. . . , wi , wj ]S [. . .]Q

[. . . , wi ]S [. . .]Q wi → wj

Right

[. . . , wi , wj ]S [. . .]Q

[. . . , wj ]S [. . .]Q wi ← wj

◮ Algorithm variants:

◮ Originally developed for Japanese (strictly head-final) with only

the Shift and Right actions [Kudo and Matsumoto 2002]

◮ Adapted for English (with mixed headedness) by adding the

Left action [Yamada and Matsumoto 2003]

◮ Multiple passes over the input give time complexity O(n2)

Dependency Parsing 55(103)

Parsing Methods

Nivre’s Algorithm

◮ Four parsing actions:

Shift

[. . .]S [wi , . . .]Q

[. . . , wi ]S [. . .]Q

Reduce

[. . . , wi ]S [. . .]Q ∃wk : wk → wi

[. . .]S [. . .]Q

Left-Arcr

[. . . , wi ]S [wj , . . .]Q ¬∃wk : wk → wi

[. . .]S [wj , . . .]Q wi

r

← wj

Right-Arcr

[. . . , wi ]S [wj , . . .]Q ¬∃wk : wk → wj

[. . . , wi , wj ]S [. . .]Q wi

r

→ wj

◮ Characteristics:

◮ Integrated labeled dependency parsing

◮ Arc-eager processing of right-dependents

◮ Single pass over the input gives time complexity O(n)

Dependency Parsing 56(103)

Parsing Methods

Example

[root]S [Economic news had little effect on financial markets .]Q

Dependency Parsing 57(103)

Parsing Methods

Example

[root Economic]S [news had little effect on financial markets .]Q

Shift

Dependency Parsing 57(103)

Parsing Methods

Example

[root]S Economic [news had little effect on financial markets .]Q

nmod

Left-Arcnmod

Dependency Parsing 57(103)

Parsing Methods

Example

[root Economic news]S [had little effect on financial markets .]Q

nmod

Shift

Dependency Parsing 57(103)

Parsing Methods

Example

[root]S Economic news [had little effect on financial markets .]Q

sbjnmod

Left-Arcsbj

Dependency Parsing 57(103)

Parsing Methods

Example

[root Economic news had]S [little effect on financial markets .]Q

pred

sbjnmod

Right-Arcpred

Dependency Parsing 57(103)

Parsing Methods

Example

[root Economic news had little]S [effect on financial markets .]Q

pred

sbjnmod

Shift

Dependency Parsing 57(103)

Parsing Methods

Example

[root Economic news had]S little [effect on financial markets .]Q

pred

sbjnmod nmod

Left-Arcnmod

Dependency Parsing 57(103)

Parsing Methods

Example

[root Economic news had little effect]S [on financial markets .]Q

objpred

sbjnmod nmod

Right-Arcobj

Dependency Parsing 57(103)

Parsing Methods

Example

[root Economic news had little effect on]S [financial markets .]Q

objpred

sbjnmod nmod nmod

Right-Arcnmod

Dependency Parsing 57(103)

Parsing Methods

Example

[root Economic news had little effect on financial]S [markets .]Q

objpred

sbjnmod nmod nmod

Shift

Dependency Parsing 57(103)

Parsing Methods

Example

[root Economic news had little effect on]S financial [markets .]Q

objpred

sbjnmod nmod nmod nmod

Left-Arcnmod

Dependency Parsing 57(103)

Parsing Methods

Example

[root Economic news had little effect on financial markets]S [.]Q

objpred

sbjnmod nmod nmod

pc

nmod

Right-Arcpc

Dependency Parsing 57(103)

Parsing Methods

Example

[root Economic news had little effect on]S financial markets [.]Q

objpred

sbjnmod nmod nmod

pc

nmod

Reduce

Dependency Parsing 57(103)

Parsing Methods

Example

[root Economic news had little effect]S on financial markets [.]Q

objpred

sbjnmod nmod nmod

pc

nmod

Reduce

Dependency Parsing 57(103)

Parsing Methods

Example

[root Economic news had]S little effect on financial markets [.]Q

objpred

sbjnmod nmod nmod

pc

nmod

Reduce

Dependency Parsing 57(103)

Parsing Methods

Example

[root]S Economic news had little effect on financial markets [.]Q

objpred

sbjnmod nmod nmod

pc

nmod

Reduce

Dependency Parsing 57(103)

Parsing Methods

Example

[root Economic news had little effect on financial markets .]S []Q

obj

p

pred

sbjnmod nmod nmod

pc

nmod

Right-Arcp

Dependency Parsing 57(103)

Parsing Methods

Classifier-Based Parsing

◮ Data-driven deterministic parsing:

◮ Deterministic parsing requires an oracle.

◮ An oracle can be approximated by a classifier.

◮ A classifier can be trained using treebank data.

◮ Learning methods:

◮ Support vector machines (SVM)

[Kudo and Matsumoto 2002, Yamada and Matsumoto 2003,

Isozaki et al. 2004, Cheng et al. 2004, Nivre et al. 2006]

◮ Memory-based learning (MBL)

[Nivre et al. 2004, Nivre and Scholz 2004]

◮ Maximum entropy modeling (MaxEnt)

[Cheng et al. 2005]

Dependency Parsing 58(103)

Parsing Methods

Feature Models

◮ Learning problem:

◮ Approximate a function from parser states, represented by

feature vectors to parser actions, given a training set of gold

standard derivations.

◮ Typical features:

◮ Tokens:

◮ Target tokens

◮ Linear context (neighbors in S and Q)

◮ Structural context (parents, children, siblings in G)

◮ Attributes:

◮ Word form (and lemma)

◮ Part-of-speech (and morpho-syntactic features)

◮ Dependency type (if labeled)

◮ Distance (between target tokens)

Dependency Parsing 59(103)

3 3

Great ideas in NLP: Log-linear models

(Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972)

In the beginning, we used generative models.

p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * …

each choice depends on a limited part of the history

but which dependencies to allow?

what if they’re all worthwhile?

p(D | A,B,C)?

p(D | A,B,C)?

… p(D | A,B) * p(C | A,B,D)?

4 4

Great ideas in NLP: Log-linear models

(Berger, della Pietra, della Pietra 1996; Darroch & Ratcliff 1972)

In the beginning, we used generative models.

Solution: Log-linear (max-entropy) modeling

Features may interact in arbitrary ways

Iterative scaling keeps adjusting the feature weights

until the model agrees with the training data.

p(A) * p(B | A) * p(C | A,B) * p(D | A,B,C) * …

which dependencies to allow? (given limited training data)

(1/Z) * Φ(A) * Φ(B,A) * Φ(C,A) * Φ(C,B)

* Φ(D,A,B) * Φ(D,B,C) * Φ(D,A,C) *

… throw them all in!

5

Log-linear models great for n-way classification

Also good for predicting sequences

Also good for dependency parsing

5

How about structured outputs?

but to allow fast dynamic

programming,

only use n-gram features

but to allow fast dynamic

programming or MST parsing,

only use single-edge features …find preferred links…

find preferred tags

v a n

6

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen” was a in the striking

Is this a good edge?

yes, lots of green …

7

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen” was a in the striking

Is this a good edge?

jasný den

(“bright day”)

8

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen” was a in the striking

Is this a good edge?

jasný den

(“bright day”)

jasný N

(“bright NOUN”)

V A A A N J N V C

9

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen” was a in the striking

Is this a good edge?

jasný den

(“bright day”)

jasný N

(“bright NOUN”)

V A A A N J N V C

A N

10

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen” was a in the striking

Is this a good edge?

jasný den

(“bright day”)

jasný N

(“bright NOUN”)

V A A A N J N V C

A N

preceding

conjunction A N

11

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen” was a in the striking

How about this competing edge?

V A A A N J N V C

not as good, lots of red …

12

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen” was a in the striking

How about this competing edge?

V A A A N J N V C

jasný hodiny

(“bright clocks”)

… undertrained …

13

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen” was a in the striking

How about this competing edge?

V A A A N J N V C

byl jasn stud dubn den a hodi odbí třin

jasný hodiny

(“bright clocks”)

… undertrained …

jasn hodi

(“bright clock,”

stems only)

14

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen” was a in the striking

How about this competing edge?

V A A A N J N V C

jasn hodi

(“bright clock,”

stems only)

byl jasn stud dubn den a hodi odbí třin

Aplural Nsingular

jasný hodiny

(“bright clocks”)

… undertrained …

15

jasný hodiny

(“bright clocks”)

… undertrained …

Edge-Factored Parsers (McDonald et al. 2005)

Byl jasný studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen” was a in the striking

How about this competing edge?

V A A A N J N V C

jasn hodi

(“bright clock,”

stems only)

byl jasn stud dubn den a hodi odbí třin

Aplural Nsingular

A N

where N follows

a conjunction

16

jasný

Edge-Factored Parsers (McDonald et al. 2005)

Byl studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen” was a in the striking

V A A A N J N V C

byl jasn stud dubn den a hodi odbí třin

Which edge is better?

“bright day” or “bright clocks”?

17

jasný

Edge-Factored Parsers (McDonald et al. 2005)

Byl studený dubnový den a hodiny odbíjely třináctou

“It bright cold day April and clocks were thirteen” was a in the striking

V A A A N J N V C

byl jasn stud dubn den a hodi odbí třin

Which edge is better?

Score of an edge e = features(e)

Standard algos valid parse with max total score

our current weight vector

18

Edge-Factored Parsers (McDonald et al. 2005)

Which edge is better?

Score of an edge e = features(e)

Standard algos valid parse with max total score

our current weight vector

can’t have both

(one parent per word)

can‘t have both

(no crossing links)

Can’t have all three

(no cycles)

Thus, an edge may lose (or win)

because of a consensus of other

edges.

19

Finding Highest-Scoring Parse

The cat in the hat wore a stovepipe. ROOT

Convert to context-free grammar (CFG)

Then use dynamic programming

each subtree is a linguistic constituent

(here a noun phrase)

The

cat

in

the

hat

wore

a

stovepipe

ROOT

let’s vertically stretch

this graph drawing

20

Finding Highest-Scoring Parse

each subtree is a linguistic constituent

(here a noun phrase)

The

cat

in

the

hat

wore

a

stovepipe

ROOT

so CKY’s “grammar constant” is no longer constant

Convert to context-free grammar (CFG)

Then use dynamic programming

CKY algorithm for CFG parsing is O(n3)

Unfortunately, O(n5) in this case

to score “cat wore” link, not enough to know this is NP

must know it’s rooted at “cat”

so expand nonterminal set by O(n): {NPthe, NPcat, NPhat, …}

21

Finding Highest-Scoring Parse

each subtree is a linguistic constituent

(here a noun phrase)

The

cat

in

the

hat

wore

a

stovepipe

ROOT

Convert to context-free grammar (CFG)

Then use dynamic programming

CKY algorithm for CFG parsing is O(n3)

Unfortunately, O(n5) in this case

Solution: Use a different decomposition (Eisner 1996)

Back to O(n3)

22

Spans vs. constituents

Two kinds of substring.

» Constituent of the tree: links to the rest

only through its headword (root).

» Span of the tree: links to the rest

only through its endwords.

The cat in the hat wore a stovepipe. ROOT

The cat in the hat wore a stovepipe. ROOT

Decomposing a tree into spans

The cat in the hat wore a stovepipe. ROOT

The cat

wore a stovepipe. ROOT cat in the hat wore

+

+

in the hat wore cat in +

hat wore in the hat +

cat in the hat wore a stovepipe. ROOT

24

Hard Constraints on Valid Trees

Score of an edge e = features(e)

Standard algos valid parse with max total score

our current weight vector

can’t have both

(one parent per word)

can‘t have both

(no crossing links)

Can’t have all three

(no cycles)

Thus, an edge may lose (or win)

because of a consensus of other

edges.

25

talk

Non-Projective Parses

can‘t have both

(no crossing links)

The “projectivity” restriction.

Do we really want it?

I give a on bootstrapping tomorrow ROOT ‘ll

subtree rooted at “talk”

is a discontiguous noun phrase

26

Non-Projective Parses

ista meam norit gloria canitiem ROOT

I give a on bootstrapping talk tomorrow ROOT ‘ll

thatNOM myACC may-know gloryNOM going-grayACC

That glory may-know my going-gray

(i.e., it shall last till I go gray)

occasional non-projectivity in English

frequent non-projectivity in Latin, etc.

Parsing Methods

Non-Projective Parsing Algorithms

◮ Complexity considerations:

◮ Projective (Proj)

◮ Non-projective (NonP)

Problem/Algorithm Proj NonP

Complete grammar parsing P NP hard

[Gaifman 1965, Neuhaus and Bröker 1997]

Deterministic parsing O(n) O(n2)

[Nivre 2003, Covington 2001]

First order spanning tree O(n3) O(n2)

[McDonald et al. 2005b]

Nth order spanning tree (N > 1) P NP hard

[McDonald and Pereira 2006]

Dependency Parsing 65(103)

27

McDonald’s Approach (non-projective)

Consider the sentence “John saw Mary” (left).

The Chu-Liu-Edmonds algorithm finds the maximum-

weight spanning tree (right) – may be non-projective.

Can be found in time O(n2).

9

10

30

20

30

0

11

3

9

root

John

saw

Mary

10

30

30

root

John

saw

Mary

slide thanks to Dragomir Radev

Every node selects best parent

If cycles, contract them and repeat

28

Summing over all non-projective trees

Finding highest-scoring non-projective tree

Consider the sentence “John saw Mary” (left).

The Chu-Liu-Edmonds algorithm finds the maximum-

weight spanning tree (right) – may be non-projective.

Can be found in time O(n2).

How about total weight Z of all trees?

Can be found in time O(n3) by matrix determinants

and inverses (Smith & Smith, 2007).

slide thanks to Dragomir Radev

29

Graph Theory to the Rescue!

Tutte’s Matrix-Tree Theorem (1948)

The determinant of the Kirchoff (aka Laplacian)

adjacency matrix of directed graph G without row and

column r is equal to the sum of scores of all directed

spanning trees of G rooted at node r.

Exactly the Z we need!

O(n3) time!

30

Building the Kirchoff

(Laplacian) Matrix

0 s(1,0) s(2,0) s(n,0)

0 0 s(2,1) s(n,1)

0 s(1,2) 0 s(n,2)

0 s(1,n) s(2,n) 0

0 s(1,0) s(2,0) s(n,0)

0 s(1, j)

j1

s(2,1) s(n,1)

0 s(1,2) s(2, j)

j2

s(n,2)

0 s(1,n) s(2,n) s(n, j)

jn

nj

j

j

jnsnsns

nsjss

nssjs

),(),2(),1(

)2,(),2()2,1(

)1,()1,2(),1(

2

1

• Negate edge scores

• Sum columns

(children)

• Strike root row/col.

• Take determinant

N.B.: This allows multiple children of root, but see Koo et al. 2007.

31

Why Should This Work?

Chu-Liu-Edmonds analogy:

Every node selects best parent

If cycles, contract and recur

K K with contracted edge 1,2

K K({1,2} |{1,2})

K s(1,2) K K

s(1, j)

j1

s(2,1) s(n,1)

s(1,2) s(2, j)

j2

s(n,2)

s(1,n) s(2,n) s(n, j)

jn

Clear for 1×1 matrix; use induction

Undirected case; special root cases for directed

1

Graph Deletion & Contraction

Important fact: κ(G) = κ(G-{e}) + κ(G{e})