# CS代考程序代写 computational biology algorithm Lecture 7:

Lecture 7:

Dynamic Programming II (Adv)

The University of Sydney

Page 1

Fast Fourier Transform

How does Google know what I want?

The University of Sydney Page 4

Edit distance

– How similar are two strings? – S = “ocurrance”

– T = “occurrence”

o

c

–

u

r

r

a

n

c

e

o

c

c

u

r

r

e

n

c

e

1 mismatch, 1 gap

Modify S to get T by adding c, swapping a-e The University of Sydney

Page 5

Which edit is better?

o

c

–

u

r

r

a

n

c

e

o

c

c

u

r

r

e

n

c

e

1 mismatch, 1 gap

o

c

–

u

r

r

–

a

n

c

e

o

c

c

u

r

r

e

–

n

c

e

0 mismatches, 3 gaps

o

c

u

r

r

a

n

c

e

–

o

c

c

u

r

r

e

n

c

e

The University of Sydney

5 mismatches, 1 gap

Page 8

Edit Distance

– Applications.

– Basis for Unix diff.

– Speech recognition.

– Computational biology.

– Edit distance. [Levenshtein 1966, Needleman-Wunsch 1970] – Gap penalty d and mismatch penalty apq.

– Cost = sum of gap and mismatch penalties.

C

T

G

A

C

C

T

A

C

C

T

–

C

T

G

A

C

C

T

A

C

C

T

C

C

T

G

A

C

T

A

C

A

T

C

C

T

G

A

C

–

T

A

C

A

T

aTC + aGT + aAG+ 2aCA

The University of Sydney Page 9

2d + aCA

Sequence Alignment

– Goal: Given two strings X = x1 x2 . . . xm and Y = y1 y2 . . . yn find alignment of minimum cost.

– Definition: An alignment M is a set of ordered pairs xi-yj such that each item occurs in at most one pair and no crossings.

– Definition: The pair xi-yj and xa-yb cross if i < a, but j > b.

Example: CTACCG vs. TACATG.

x1 x2 x3 x4 x5 x6

C

T

A

C

C

–

G

–

T

A

C

A

T

G

The University of Sydney M = x2-y1, x3-y2, x4-y3, x5-y4, x6-y6. Page 10

y1 y2 y3 y4 y5 y6

Sequence Alignment

cost(M) = Saxi,yj + Sd + Sd (xi,yj)ÎM i: xi unmatched j: yj unmatched

mismatch Example: CTACCG vs. TACATG.

x1 x2 x3 x4 x5 x6

C

T

A

C

C

–

G

–

T

A

C

A

T

G

The University of Sydney

Page 11

y1 y2 y3 y4 y5 y6

M = x2-y1, x3-y2, x4-y3, x5-y4, x6-y6.

Key steps: Dynamic programming

1. Define subproblems 2. Find recurrences

3. Solve the base cases

4. Transform recurrence into an efficient algorithm

The University of Sydney Page 12

Sequence Alignment: Problem Structure Step 1: Define subproblems

OPT(i, j) =

min cost of aligning strings x1 x2 . . . xi and y1 y2 …yj.

x1 x2 x3 x4 x5 xi

··· ···

C

T

A

C

C

–

G

–

T

A

C

A

T

G

The University of Sydney

Page 14

y1 y2 y3 y4 y5 yj

Sequence Alignment: Problem Structure

– Definition: OPT(i, j) = min cost of aligning strings x1 x2 . . . xi and y1 y2 …yj.

– Case 1: OPT matches xi-yj.

• pay mismatch for xi-yj + min cost of aligning

twostringsx1 x2 …xi-1 andy1 y2 …yj-1 – Case 2a: OPT leaves xi unmatched.

• paygapforxi andmincostofaligningx1 x2 …xi-1 andy1 y2 …yj – Case 2b: OPT leaves yj unmatched.

• paygapforyj andmincostofaligningx1 x2 …xi andy1 y2 …yj-1

OPT(i, j) = min

axiyj + OPT(i-1,j-1) d + OPT(i-1, j)

d + OPT(i, j-1)

The University of Sydney Page 19

Sequence Alignment: Problem Structure Step 3: Solve the base cases

OPT(0,j) = j·d and OPT(i,0) = i·d

OPT(i,j) = j·d if i=0 OPI(i, j) = OPT(i,j) = i·d if j=0

min{axiyj + OPT(i-1, j-1),d + OPT(i-1, j),d + OPT(i, j-1)} otherwise

The University of Sydney

Page 20

Sequence Alignment: Algorithm

Sequence-Alignment(m, n, x1x2…xm, y1y2…yn, d, a) { for i = 0 to m

M[0, i] = id for j = 0 to n

M[j, 0] = jd

for i = 1 to m

for j = 1 to n

M[i, j] = min{a[xi, yj] + M[i-1, j-1], d + M[i-1, j],

return M[m, n]

}

d + M[i, j-1]}

– Analysis. Q(mn) time and space.

– English words or sentences: m, n £ 100.

– Computational biology: m = n = 100,000. 10 billions ops OK, but 10GB array?

The University of Sydney

Page 21

Sequence Alignment: Algorithm

Sequence-Alignment(m, n, x1x2…xm, y1y2…yn, d, a) { for i = 0 to m

M[0, i] = id for j = 0 to n

M[j, 0] = jd

for i = 1 to m

for j = 1 to n

M[i, j] = min{a[xi, yj] + M[i-1, j-1], d + M[i-1, j],

return M[m, n]

}

d + M[i, j-1]}

– To get the alignment itself we can trace back through the array M.

The University of Sydney Page 22

Sequence Alignment: Linear Space

Question: Can we avoid using quadratic space?

Easy. Optimal value in O(m + n) space and O(mn) time.

– Compute OPT(i, j) from OPT(i-1, j), OPT(i-1, j-1) and OPT(i, j-1). – BUT! No longer a simple way to recover alignment itself.

Theorem: [Hirschberg 1975] Optimal alignment in O(m + n) space and O(mn) time.

– Clever combination of divide-and-conquer and dynamic programming. – Inspired by idea of Savitch from complexity theory.

The University of Sydney Page 25

Sequence Alignment: Linear Space

– Edit distance graph.

– m ́n grid graph GXY (as shown in the figure)

– Horizontal/vertical edges have cost d (gap)

– Diagonal edges from (i-1, j-1) to (i,j) have cost axiyj. (match)

e y1 y2 y3 y4 y5 y6 0-0

axiyj d

d i-j

e

x1 x2

x3

m-n

The University of Sydney

Page 26

Sequence Alignment: Linear Space

– Edit distance graph.

– Let f(i, j) be cheapest path from (0,0) to (i, j). – Observation: f(i, j) = OPT(i, j).

e y1 y2 y3 y4 y5 y6 0-0

axiyj d

d i-j

e

x1 x2

x3

m-n

The University of Sydney

Page 27

Sequence Alignment: Linear Space

min{axiyj + OPT(i-1, j-1),d + OPT(i-1, j),d + OPT(i, j-1)}

e

x1 x2

x3

e y1 y2 y3 y4 y5 y6 0-0

axiyj d

d i-j

m-n

The University of Sydney

Page 28

Sequence Alignment: Linear Space

– Edit distance graph.

– Let f(i, j) be cheapest path from (0,0) to (i, j).

– Can compute f(m,n) in O(mn) time and O(mn) space.

j

y4

i-j

e y1 y2 y3 0-0

x1 x2

y5 y6

e

x3

m-n

The University of Sydney

Page 29

Sequence Alignment: Linear Space

– Edit distance graph.

– Let f(i, j) be cheapest path from (0,0) to (i, j).

– If only interested in the value of the optimal alignment we do it in O(mn)

time and O(m + n) space.

e y1 y2 y3 0-0

x1 x2

j

y4

i-j

y5 y6

e

x3

m-n

The University of Sydney

Page 30

Sequence Alignment: Linear Space

– Edit distance graph.

– Let f(i, j) be cheapest path from (0,0) to (i, j).

– This corresponds to optimal alignment of X[1..i] and Y[1..j]

– For any column j, can compute f(•, j) in O(mn)j time and O(m + n) space

y4

i-j

e y1 y2 y3 0-0

x1 x2

y5 y6

e

x3

m-n

The University of Sydney

Page 31

Sequence Alignment: Linear Space

– Edit distance graph.

– Let f(i, j) be cheapest path from (0,0) to (i, j).

– If only interested in the value of the optimal alignment we do it in O(mn) time and O(m + n) space.

Space-Efficient-Alignment(X,Y) {

array B[0..m,0..1]

for i = 0 to m

B[i,0] = id for j = 1 to n B[0,1] = jd

#Collapse A into an m ́2 array] # B[i,0] = A[i,j-1]

# B[i,1] = A[i,j]

#corresponds to A[0,j]

for i = 1 to m

B[i,1] = min(a[xi, yj] + B[i-1,0],

d + B[i-1,1],d + B[i,0])

endFor

Move column i of B to column 0 #(B[i,0]=B[i,1]) endFor

}

The University of Sydney Page 32

Sequence Alignment: Linear Space

– Edit distance graph.

– Let f(i, j) be length of cheapest path from (0,0) to (i, j).

– This corresponds to cost of optimal alignment of X[1..i] and Y[1..j] – Observation: f(i, j) = OPT(i, j).

e y1 y2 y3 y4 y5 y6 0-0

e

x1 x2

d d

i-j

x3

m-n

The University of Sydney

Page 33

Sequence Alignment: Linear Space

– Edit distance graph.

– Let g(i, j) be cheapest path from (i, j) to (m, n).

– This corresponds to optimal alignment of X[m..i+1] and Y[m..j+1]

– Can compute by reversing the edge orientations and inverting the roles of (0, 0) and (m, n)

0-0

x1

x2 x3

ePage 34

i-j d

d

The University of Sydney m-n y1 y2 y3 y4 y5 y6 e

Sequence Alignment: Linear Space

– Edit distance graph.

– Let g(i, j) be cheapest path from (i, j) to (m, n).

– For any column j, can compute g(•, j) for all j in O(mn) time and O(m +

n) space.

e y1 y2 0-0

x1 x2

j

y3

i-j

y4 y5 y6

e

x3

m-n

The University of Sydney

Page 35

Sequence Alignment: Linear Space

Observation 1: The cost of the cheapest path that uses (i, j) is f(i, j) + g(i, j).

e y1 y2 y3 y4 y5 y6 0-0

x1 i-j x2

e

x3

m-n

The University of Sydney

Page 36

Sequence Alignment: Linear Space

Observation 2: Let q be an index that minimizes f(q, n/2) + g(q, n/2). Then, the cheapest path from (0, 0)

e y1 y2 0-0

x1 x2

y4 y5 y6

to (m, n) uses (q, n/2).

n/2

y3

i-j

e

q

x3

m-n

The University of Sydney

Page 37

Sequence Alignment: Linear Space

Divide: Find index q that minimizes f(q, n/2) + g(q, n/2) using DP. Can be done in O(nm) time and O(n + m) space

– Align xq and yn/2.

Conquer: recursively compute optimal alignment in each piece.

e y1 y2 0-0

x1

e

y3

q,n/2

n/2

m-n

y4 y5 y6

q

x2

x3

The University of Sydney

Page 38

Pseudocode

Divide-and-Conquer alignment(X,Y) { If |X|≤2 or |Y|≤2 then

OptimalAlignment(X,Y) #Alg using quadratic space f(·,n/2)=Space-Efficient-Alignment(X,Y[1..n/2]) g(·,n/2)=Backward-S-E-Alignment(X,Y[n/2..n])

Let q be the index minimizing f(q,n/2)+g(q,n/2)

A = Divide-and-Conquer alignment(X[1..q],Y[1..n/2])

B = Divide-and-Conquer alignment(X[m..q+1],Y[n..n/2+1]) return concatenation of A and B

}

n/2

q

The University of Sydney

Page 39

Sequence Alignment: Running Time Analysis Warmup

Theorem: Let T(m, n) = max running time of algorithm on strings of length at most m and n. T(m, n) = O(mn log n).

T(m, n) £ 2T(m, n/2) + O(mn) Þ T(m, n) = O(mn logn)

Remark: Analysis is not tight because two subproblems are of size (q, n/2) and (m – q, n/2).

The University of Sydney

Page 40

Sequence Alignment: Running Time Analysis

Theorem: Let T(m, n) = max running time of algorithm on strings of length m and n. T(m, n) = kmn for some constant k > 0.

Proof: (by induction on n)

– O(mn)timetocomputef(•,n/2)andg(•,n/2)andfindindexq. – T(q, n/2) + T(m – q, n/2) time for two recursive calls.

The University of Sydney

Page 41

Sequence Alignment: Running Time Analysis

Theorem: Let T(m, n) = max running time of algorithm on strings of length m and n. T(m, n) = kmn for some constant k > 0.

Proof: (by induction on n)

– O(mn)timetocomputef(•,n/2)andg(•,n/2)andfindindexq. – T(q, n/2) + T(m – q, n/2) time for two recursive calls.

– For some constant c we have:

T(m,2) £ cm

T(2, n) £ cn

T(m, n) £ cmn+T(q, n/2)+T(m-q, n/2)

The University of Sydney

Page 42

Sequence Alignment: Running Time Analysis

Theorem: Let T(m, n) = max running time of algorithm on strings of length m and n. T(m, n) = kmn for some constant k > 0.

Proof: (by induction on n)

– O(mn)timetocomputef(•,n/2)andg(•,n/2)andfindindexq. – T(q, n/2) + T(m – q, n/2) time for two recursive calls.

– For some constant c we have:

– Basecases:m≤2orn≤2.

– Inductive hypothesis: T(m’, n’) £ 2cm’n’ for m’ and n’ with m’n’ < mn.
T(m,2) £ cm
T(2, n) £ cn
T(m, n) £ cmn+T(q, n/2)+T(m-q, n/2)
T(m,n) £ T(q,n/2)+T(m-q,n/2)+cmn £ 2cqn/2+2c(m-q)n/2+cmn
= cqn+cmn-cqn+cmn = 2cmn
Page 43
The University of Sydney
Sequence Alignment: Running Time Analysis
Theorem: An optimal alignment can be computed in O(mn) time using O(m+n) space.
The University of Sydney
Page 44
Sequence Alignment: History [m=n]
– Needleman and Wunsch 1970 O(n3)
– Sankoff 1972 O(n2)
[see also Vintsyuk’68 for speech processing
Wagner and Fisher’74 for string matching]
– Still an active research area (experimental research) Chakraborty and Angana’13 (claimed 54-90% speedup)
The University of Sydney
Page 45
Generalising the algorithm
Problem:
Nature often inserts or removes entire substrings of nucleotides (creating long gaps), rather than editing just one position at a time.
The penalty for a gap of length 10 should not be 10 times the penalty for a gap of length 1, but something significantly smaller. Can we modify the scoring function in which the penalty for a gap of length k is:
d0+d1·k ?
The University of Sydney
Page 46
Dynamic Programming Summary
– 1Ddynamicprogramming
– Weightedintervalscheduling
– SegmentedLeastSquares
– Maximum-sumcontiguoussubarray – Longestincreasingsubsequence
– 2Ddynamicprogramming – Knapsack
– Sequencealignment
– Dynamicprogrammingoverintervals
– RNASecondaryStructure
– Dynamicprogrammingoversubsets – TSP
– k-path – Playlist
The University of Sydney
Page 47