THE UNIVERSITY OF NEW SOUTH WALES
4. INTEGER MULTIPLICATION II
Raveen de Silva,
office: K17 202
Course Admin: ,
School of Computer Science and Engineering UNSW Sydney
Term 3, 2021
Table of Contents
2. ’s algorithm
Basics revisited: how do we multiply two numbers?
The primary school algorithm:
X X X X <- first input integer * X X X X <- second input integer ------- XXXX X XXX XX XX XXX X --------------- XXXXXXXX O(n^2) intermediate operations: / O(n^2) elementary multiplications / + O(n^2) elementary additions <- resultoflength2n Can we do it faster than in n2 many steps?? The Karatsuba trick Take the two input numbers A and B, and split them into two halves: n A=A122 +A0 XX...XXX...X B = B 2 n2 + B n n . 1022 AB can now be calculated as follows: AB = A1B12n + (A1B0 + A0B1)2n2 + A0B0 = A1B12n + ((A1 + A0)(B1 + B0) − A1B1 − A0B0)2n2 + A0B0. The Karatsuba trick We have saved one multiplication, now we have only three: A0B0, A1B1 and (A1 + A0)(B1 + B0). The runtime satisfies T (n) = 3T n + c n, 2 and the Master Theorem gives T(n) = Θnlog2 3 = O n1.585. Table of Contents 1. Recap 2. ’s algorithm 3. Puzzle ’s algorithm Can we do better if we break the numbers in more than two pieces? Let’s try breaking the numbers A, B into 3 pieces; then with k = n/3 we obtain i.e., A = XXX ...XX XXX ...XX XXX ...XX k bitsofA2 k bitsofA1 k bitsofA0 A=A222k +A12k +A0 B =B222k +B12k +B0 ’s algorithm Multiplying out, we get AB = A2B2 24k C4 + (A2B1 + A1B2) 23k C3 + (A2B0 + A1B1 + A0B2) 22k C2 + (A1B0 + A0B1) 2k C1 +A0B0, C0 ’s algorithm Question Can we get these five coefficients with less than 9 multiplications? Best case: 5 multiplications? Should we perhaps look at (A2 +A1 +A0)(B2 +B1 +B0)=A0B0 +A1B0 +A2B0 + A0B1 + A1B1 + A2B1 + A0B2 + A1B2 + A2B2? Not clear at all how to get all five coefficients with 5 multiplications only ... The Karatsuba trick: slicing into 3 pieces We now look for a method for getting these coefficients without any guesswork! C4 = A2B2 C3 = A2B1 + A1B2 C2 = A2B0 + A1B1 + A0B2 C1 = A1B0 + A0B1 C0 = A0B0 This should remind us of multiplying polynomials. The Karatsuba trick: slicing into 3 pieces Let A=A222k +A12k +A0 B = B2 22k + B1 2k + B0. We form the naturally corresponding polynomials: PA(x) = A2 x2 + A1 x + A0 PB(x) = B2 x2 + B1 x + B0, where A = PA(2k) and B = PB(2k). Then the coefficients of PC(x) = PA(x)PB(x) are exactly the coefficients we want! The Karatsuba trick: slicing into 3 pieces Let so that PC(x) = PA(x)PB(x) = C4 x4 + C3 x3 + C2 x2 + C1 x + C0, A·B =PA(2k)PB(2k) = PC(2k) = C4 24k + C3 23k + C2 22k + C1 2k + C0. Note that the right hand side involves only shifts and additions. The Karatsuba trick: slicing into 3 pieces Question How do we find PC(x) without multiplying out in full? Answer Find the value of PC(x) at several points, using PC (x0) = PA(x0)PB (x0). Since the product polynomial PC(x) = PA(x)PB(x) is of degree 4 we need 5 values to uniquely determine PC(x). For simplicity, we choose −2, −1, 0, 1 and 2. The Karatsuba trick: slicing into 3 pieces PA (x ) evaluate PA (−2) PA (−1) PA (0) PA (1) PA (2) ××××× PB (−2) PB (−1) PB (0) PB (1) PB (2) PC (−2) PC (−1) PC (0) PC (1) PC (2) PB (x ) evaluate PC (x ) reconstruct The Karatsuba trick: slicing into 3 pieces First, let’s examine the “evaluate” step. For PA(x)=A2x2+A1x+A0 wehave PA(−2)=A2(−2)2 +A1(−2)+A0 PA(−1)=A2(−1)2 +A1(−1)+A0 PA(0)=A202 +A10+A0 PA(1)=A212 +A11+A0 PA(2)=A222 +A12+A0 =4A2 −2A1 +A0 =A2 −A1 +A0 =A0 =A2 +A1 +A0 =4A2 +2A1 +A0, and we have similar expressions from PB(x). Note that these evaluations involve only a constant number of additions. The Karatsuba trick: slicing into 3 pieces The next step is to multiply values of PA and PB to calculate values of PC . PC (−2) = PA(−2)PB (−2) PC (−1) = PA(−1)PB (−1) PC (0) = PA(0)PB (0) PC (1) = PA(1)PB (1) PC (2) = PA(2)PB (2) At this stage we require only five multiplications of large (i.e. n/3-bit) numbers. The Karatsuba trick: slicing into 3 pieces Perhaps the least straightforward step is “reconstruct”. How do we go from values of PC to its coefficients? Expanding out, we have PC (−2) = C4(−2)4 + C3(−2)3 + C2(−2)2 + C1(−2) + C0 PC (−1) = C4(−1)4 + C3(−1)3 + C2(−1)2 + C1(−1) + C0 PC(0)=C404 +C303 +C202 +C1 ·0+C0 PC(1)=C414 +C313 +C212 +C1 ·1+C0) PC(2)=C424 +C323 +C222 +C1 ·2+C0. This is a system of five linear equations in five variables. We know how to solve these! The Karatsuba trick: slicing into 3 pieces Using Gaussian elimination, we obtain C0 = PC (0) C1 = 1 12 C2 = 1 24 C3 = 1 12 C4 = 1 24 [PC(−2)−8PC(−1)+8PC(1)−PC(2)] [−PC(−2)+16PC(−1)−30PC(0)+16PC(1)−PC(2)] [−PC(−2)+2PC(−1)−2PC(1)+PC(2)] [PC(−2)−4PC(−1)+6PC(0)−4PC(1)+PC(2)]. Note that these expressions do not involve any multiplications of TWO large numbers and thus can be done in linear time (i.e. O(n), where n is the number of bits). The Karatsuba trick: slicing into 3 pieces With the coefficients C4, C3, C2, C1, C0 obtained, we can now form the polynomial PC (x) = C4x4 + C3x3 + C2x2 + C1x + C0. We can then compute PC(2k)=C424k +C323k +C222k +C12k +C0 in linear time, using only bitwise shifts of the coefficients and a constant number of additions. Thus we have obtained A · B = PA(2k )PB (2k ) = PC (2k ) with only 5 multiplications! The Karatsuba trick: slicing into 3 pieces 1: function Mult(A,B) 2: obtain A2, A1, A0 and B2, B1, B0 such that A=A2 22k +A1 2k +A0; B =B2 22k +B1 2k +B0 3: form polynomials PA(x)=A2x2+A1x+A0; PB(x)=B2x2+B1x+B0 The Karatsuba trick: slicing into 3 pieces 4: evaluate PA(−2)←4A2 −2A1 +A0 PA(−1)←A2 −A1 +A0 PA(0) ← A0 PA(1)←A2 +A1 +A0 PA(2)←4A2 +2A1 +A0 PB(−2)←4B2 −2B1 +B0; PB(−1)←B2 −B1 +B0; PB (0) ← B0; PB(1)←B2 +B1 +B0; PB(2)←4B2 +2B1 +B0. The Karatsuba trick: slicing into 3 pieces 5: multiply PC (−2) ← Mult(PA(−2), PB (−2)); PC (−1) ← Mult(PA(−1), PB (−1)); PC (0) ← Mult(PA(0), PB (0)); PC (1) ← Mult(PA(1), PB (1)); PC (2) ← Mult(PA(2), PB (2)). The Karatsuba trick: slicing into 3 pieces 6: reconstruct C0 ← PC (0); C1 ← 1 [PC(−2)−8PC(−1)+8PC(1)−PC(2)]; 12 C2 ← 1 [−PC(−2)+16PC(−1)−30PC(0)+16PC(1)−PC(2)]; 24 C3 ← 1 [−PC(−2)+2PC(−1)−2PC(1)+PC(2)]; 12 C4 ← 1 [PC(−2)−4PC(−1)+6PC(0)−4PC(1)+PC(2)]. 24 The Karatsuba trick: slicing into 3 pieces 7: form PC (x ) = C4 x 4 + C3 x 3 + C2 x 2 + C1 x + C0 8: compute PC(2k)=C424k +C323k +C222k +C12k +C0 9: return PC (2k ) = A · B . 10: end function The Karatsuba trick: slicing into 3 pieces How fast is this algorithm? We have replaced a multiplication of two n bit numbers with 5 multiplications of n/3 bit numbers with an overhead of additions, shifts and so on, all doable in linear time c n, and thus We now apply the Master Theorem. The critical exponent is c∗ = log3 5 ≈ 1.465, so the critical polynomial is nlog3 5. Clearly f (n) = O nlog3 5−0.1, so the first case of the Master Theorem applies and we get T (n) = Θ nlog3 5 = O(n1.47). T (n) = 5T n + c n. 3 The Karatsuba trick: slicing into 3 pieces Recall that the original Karatsuba algorithm runs in time nlog2 3 ≈ n1.58 > n1.47. Thus, we got a significantly faster algorithm.
Why not slice numbers A and B into even larger number of slices? Maybe we can get an even faster algorithm?
In a sense, BOTH yes and no. Let’s see what happens if we slice numbers into p + 1 many (approximately) equal slices, where
p = 1,2,3,……
The general case – slicing the input numbers A, B into p + 1 many slices
For simplicity, let us assume A and B have exactly
n = (p + 1)k bits (otherwise pad out the most significant slice with zeros).
Note: p is a fixed parameter of our design – p+1 is the number of slices we are going to make, but k depends on the input values A and B and can be arbitrarily large!
The runtime of our algorithm will be a function of both p and k, but the asymptotics will be entirely in terms of k, i.e. in terms of n.
We begin by slicing A and B into p + 1 pieces each: A=XX…X XX…X …XX…XXX…X
k bits of Ap k bits of Ap−1 k bits of A1 k bits of A0
A=Ap2pk +Ap−12(p−1)k +…+A12k +A0
B=Bp2pk +Bp−12(p−1)k +…+B12k +B0.
AB = 2tk = Ct 2tk . t=0 i+j=t t=0
We form the naturally corresponding polynomials: PA(x) = Ap xp + Ap−1 xp−1 + . . . + A1 x + A0
PB(x) = Bp xp + Bp−1 xp−1 + … + B1 x + B0, and let
As before, we have
A = PA(2k) and B = PB(2k),
xt =Ctxt. t=0
We will again adopt the following strategy: first multiply PA(x) and PB(x) to obtain PC(x), then evaluate PC(2k).
We saw that the coefficients of PC(x) are given by Ct = AiBj,
i+j=t requiring (p + 1)2 multiplications in total.
However, PC(x) is of degree 2p, so we can instead evaluate PC (x0) = PA(x0) · PB (x0)
at 2p+1 points (say −p,…,p) and reconstruct the polynomial from these values.
An important digression: convolution
This is in fact a very common problem, fundamental to many fields of engineering!
Define vectors ⃗v = (v0,v1,…,vr) and w⃗ = (w0,w1,…,ws). Then let u⃗ = (u0,u1,…,ur+s) such that
ut = viwj. i+j=t
u⃗ is said to be the linear convolution of ⃗v and w⃗ , denoted u⃗ = ⃗v ⋆ w⃗ .
If we form polynomials with coefficients from ⃗v and w⃗ , their product has coefficients given by u⃗.
An important digression: convolution
For example, if you have an audio signal and you want to emphasise the bass sounds, you would pass the sequence of discrete samples of the signal through a digital filter which amplifies the low frequencies more than the medium and the high audio frequencies.
This is accomplished by computing the linear convolution of the sequence of discrete samples of the signal with a sequence of values which correspond to that filter, called the impulse response of the filter.
This means that the samples of the output sound are simply the coefficients of the product of two polynomials:
1. polynomial PA(x) whose coefficients Ai are the samples of the input signal;
2. polynomial PB(x) whose coefficients Bk are the samples of the so called impulse response of the filter (they depend on what kind of filtering you want to do).
An important digression: convolution
Convolutions are the bread-and-butter of signal processing, and for that reason it is extremely important to find fast ways of multiplying two polynomials of possibly very large degrees.
In signal processing these degrees can be greater than 1000.
This is the main reason for us to study methods of fast computation of convolutions (aside of finding products of large integers, which is what we are doing at the moment).
PA(x) evaluate PA(−p) · · · PA(p) ××
PB(x) evaluate PB(−p) ··· PB(p)
PC (x) reconstruct PC (−p) · · · PC (p)
To evaluate PA(x0), we compute
PA(x0)=A0 +A1x0 +A2x02 +…+Apx0p.
This can be rewritten as a matrix-vector product as follows:
2 pA PA(x0)= 1 x0 x0 ··· x0 2.
Then we can let
1 −p (−p)2 ··· (−p)p
1 −(p−1) (−(p−1))2 ··· (−(p−1))p
1 −(p−2) (−(p−2))2 ··· (−(p−2))p Mp = ,
. . . .. . . . . . .
1 p p2 ··· pp which is a Vandermonde matrix.
Since p is a constant, we will assume that this matrix is precomputed, so it will contribute only a constant to the runtime of our algorithm.
Then we have
PA(−p) A0 PA(−(p − 1)) A1
PA(−(p−2))=M A2. p
This matrix-vector product requires O(p2) multiplications of a k-bit number Ai by an entry of Mp. However, Mp is a constant matrix of integers, so each of these multiplications takes Θ(k) time. The same is true for PB(x).
Therefore, the ‘evaluate’ step takes linear time.
The next step is to multiply. We compute
PC (−p) = PA(−p) PB (−p)
PC (−(p − 1)) = PA(−(p − 1)) PB (−(p − 1)) PC (−(p − 2)) = PA(−(p − 2)) PB (−(p − 2))
PC (p) = PA(p) PB (p).
This requires 2p + 1 multiplications of large numbers.
The ‘reconstruct’ step requires us to go from 2p + 1 values of PC(x) to the coefficients of the polynomial.
Again we recall that
PC(x0)=C0 +C1x0 +C2x02 +…+C2px2p
C1 2 2pC
= 1 x0 x0 ··· x0 2. .
1 −p (−p)2 ··· (−p)2p
1 −(p−1) (−(p−1))2 ··· (−(p−1))2p
1 −(p−2) (−(p−2))2 ··· (−(p−2))2p M2p = .
. . . .. . . . . . .
1 p p2 ··· p2p
A square Vandermonde matrix with distinct rows is invertible.
We will assume that M2p and its inverse are precomputed.
PC (p) C2p
C2 . (1) 2p
Then, to find the coefficients Ct, we left multiply both sides of (1) by M−1 to obtain
C0 PC(−p) C1 PC(−(p−1))
C2 =M−1PC(−(p−2)). 2p
C2p PC (p)
Much like the ‘evaluate’ step, this requires O(p2) multiplications of a constant by a large number, which is linear time.
1: 2: 3: 4:
Then there exists a small ε > 0 such that f (n) = O nc∗−ε, so Case 1 of the Master Theorem applies.
We conclude that
T(n) = Θnc∗ = Θnlogp+1(2p+1).
How fast is our algorithm?
How small can logp+1(2p + 1) get? Note that
logp+1(2p + 1) < logp+1 2(p + 1) = logp+1 2 + logp+1(p + 1) = 1 + logp+1 2 =1+1, log2(p + 1) which can be made arbitrarily close to 1 by choosing a sufficiently large p. Therefore using a large enough number of slices allows us to get a run time arbitrarily close to linear time! How fast is our algorithm? Question How many slices are needed for an O(n1.1) time algorithm? Answer We want c∗ ≤ 1.1, so we can solve 1 + 1 = 1.1 p + 1 = 210 to see that 1024 slices should be used. log2(p + 1) log2(p + 1) = 10 How fast is our algorithm, really? Unfortunately, there are some problems when we use large p. In our algorithm, we evaluate polynomials PA(x) and PB(x) both of degree p at values x = −p,...,p. Forp=1023,evaluatingPA(p)=Appp+...+A0 involves multiplication of Ap with pp = 10231023 > 103000.
Thus, while evaluations of PA(x) and PB(x) for
x = −p, . . . , p can all be done in asymptotically linear time, the constant factor on this is absolutely humongous.
How fast is our algorithm, really?
Consequently, slicing the input numbers into more than just a few slices results in a hopelessly slow algorithm, despite the fact that the asymptotic bounds improve as we increase the number of slices!
It is for this reason that, for example, Python implements multiplication of large numbers using only two slices, i.e. the original Karatsuba algorithm.
The moral is: In practice, asymptotic estimates are useless unless the size of the constants hidden by the big O notation are estimated and found to be reasonably small!!!
Crucial question: Are there numbers x0, x1, . . . , xp such that the size of xip does not grow uncontrollably?
Answer: YES; they are the complex numbers zi lying on the unit circle, i.e., such that |zi | = 1!
This motivates us to consider values of polynomials at inputs which are equally spaced complex numbers all lying on the unit circle.
The sequence of such values is called the discrete Fourier transform (DFT) of the sequence of the coefficients of the polynomial being evaluated.
We will present a very fast algorithm for computing these values, called the Fast Fourier Transform, abbreviated as FFT.
The Fast Fourier Transform is the most executed algorithm today and is thus arguably the most important algorithm of all.
Every mobile phone performs thousands of FFT runs each second, for example to compress your speech signal or to compress images taken by your camera, to mention just a few uses of the FFT.
Table of Contents
2. ’s algorithm
The warden meets with 23 new prisoners when they arrive. He tells them, “You may meet today and plan a strategy. But after today, you will be in isolated cells and will have no communication with one another. In the prison there is a switch room, which contains two light switches labeled A and B, each of which can be in either the on or the off position. I am not telling you their present positions. The switches are not connected to anything.
“After today, from time to time whenever I feel so inclined, I will select one prisoner at random and escort him to the switch room. This prisoner will select one of the two switches and reverse its position. He must move one, but only one of the switches. He can’t move both but he can’t move none either. Then he will be led back to his cell.
“No one else will enter the switch room until I lead the next prisoner there, and he’ll be instructed to do the same thing. I’m going to choose prisoners at random. I may choose the same guy three times in a row, or I may jump around and come back. “But, given enough time, everyone would eventually visit the switch room many times. At any time anyone of you may declare to me:
We have all visited the switch room.
“If it is true, then you will all be set free. If it is false, and somebody has not yet visited the switch room, you will be fed to the alligators.”
What is the strategy the prisoners can devise to gain their freedom?
That’s All, Folks!!