# CS代考计算机代写 computer architecture compiler scheme mips RISC-V algorithm Lecture 6:

Lecture 6:
Arithmetic 2/3
John Owens
Introduction to Computer Architecture UC Davis EEC 170, Winter 2021

▪ A+B=S
A B
Cin
A B
Cin
A B
Cin
A B
Cin
Cout Sum
Cout Sum
Cout Sum
Cout Sum
A3
B3
A2
S3
B2
S2
A1
B1
S1
A0
B0
S0
40
UC Davis EEC 170, Winter 2021 / © John Owens

But What About Performance? ▪ Critical path of one bitslice is CP
▪ Critical path of n-bit rippled-carry adder is n*CP
– Thought experiment:
– @ 7 nm: FO4 is ~2.5 ps
– 2 gate delays * 64b = 320 ps
– 4 GHz clock period is 250 ps
▪ Design Trick:
– Throw hardware at it
A0 B0
A1 B1
A2 B2
A3 B3
CarryIn0
1-bit ALU
CarryOut0
Result0
Result1
Result2
Result3
CarryIn1
1-bit ALU
CarryOut1
CarryIn2
1-bit ALU
CarryOut2 CarryIn3
CarryOut3
41
UC Davis EEC 170, Winter 2021 / © John Owens
1-bit ALU

Truth Table for Adder Bit Slice ▪ 3 inputs (A, B, Cin); 2 outputs (Sum, Cout)
A
B
Cin
Sum
Cout
0
0
0
0
0
0
0
1
1
0
0
1
0
1
0=Cin
0
1
1
0
1=Cin
1
0
0
1
0=Cin
1
0
1
0
1=Cin
1
1
0
0
1
1
1
1
1
1
42 UC Davis EEC 170, Winter 2021 / © John Owens

A0 B0
A1 B1
A2 B2
A3 B3
Carry Look Ahead (Design trick: peek)
C0 = Cin
S
S
S
S
C4 = . . .
G
P
G = A and B P = A xor B WHY are these interesting?
43
UC Davis EEC 170, Winter 2021 / © John Owens
A B 0 0
0 1
1 0
1 1
Cout
0 “kill”
Cin “propagate” Cin “propagate” 1 “generate”
G P
G P
C2 = G1 + G0 · P1 + C0 · P0 · P1
C1 = G0 + C0 · P0
G P
C3 = G2 + G1 · P2 + G0 · P1 · P2 + C0 · P0 · P1 · P2
G P

CLA vs. Ripple
C0 = Cin
S
G = A and B P = A xor B
A0 B0
CarryIn0 A0
Result0
Result1
Result2
Result3
G P
B0 A1 A1
B1
B1
A2 B2
1-bit ALU
C1 = G0 + C0 · P0
C2 = G1 + G0 · P1 + C0 · P0 · P1
C3 = G2 + G1 · P2 + G0 · P1 · P2 + C0 · P0 · P1 · P2
G
P
CarryOut0 CarryIn1
S
1-bit ALU
G P
A2
A3 B3
CarryOut1 CarryIn2
S
1-bit ALU
G P
B2
A3 B3
CarryOut2 CarryIn3
1-bit ALU
CarryOut3
G P
S
C4 = . . .
44
UC Davis EEC 170, Winter 2021 / © John Owens

C
L C0
▪ Abstraction!
▪ Good exercise to work
A
G0 P0
C1 = G0 + C0 · P0
out what G and P are for a 4-bit adder
C2 = G1 + G0 · P1 + C0 · P0 · P1
C3 = G2 + G1 · P2 + G0 · P1 · P2 + C0 · P0 · P1 · P2
G P
C4 = . . .
45
UC Davis EEC 170, Winter 2021 / © John Owens

Design Trick: Guess (or “Precompute”)
CP(2n) = 2*CP(n)
CP(2n) = CP(n) + CP(mux)
1
0
Cout
46
UC Davis EEC 170, Winter 2021 / © John Owens

Carry Skip Adder: reduce worst case delay
B A4 B A0
P3 S P3 S
P2
P1
P0
P2
P1
P0
▪ Just speed up the slowest case for each block
▪ Exercise: optimal design uses variable block sizes (why?)
47 UC Davis EEC 170, Winter 2021 / © John Owens

▪ Reuse hardware if possible
– +/- reuse is compelling argument for 2’s complement
▪ For higher performance:
– Look for critical path, optimize for it Reorganize equations [propagate/generate / carry lookahead]
– Precompute [carry save]
– Reduce worst-case delay [carry skip]
48 UC Davis EEC 170, Winter 2021 / © John Owens

Multiply (unsigned)
▪ Paper and pencil example (unsigned):
Multiplicand 1000 Multiplier x1001 1000
0000 0000
1000 Product 01001000
▪ m bits x n bits = m+n bit product
▪ Binary makes it easy:
– 0 => place 0 ( 0 x multiplicand )
– 1 => place a copy ( 1 x multiplicand )
▪ 4 versions of multiply hardware & algorithm (see book):
– successive refinement
49 UC Davis EEC 170, Winter 2021 / © John Owens
§3.3 Multiplication

m bits x n bits = m+n bit product
How do we store a 128b result from a 64b x 64b multiply?
50 UC Davis EEC 170, Winter 2021 / © John Owens

Unsigned Combinational Multiplier
0000
A3 A2 A1 A0
A3 A2
A3 A2
A3 A2
P7 P6 P5 P4 P3 P2 P1 P0
B0 B1
B2 B3
A1
A0
A1
A0
A1
A0
51 UC Davis EEC 170, Winter 2021 / © John Owens

Unsigned Combinational Multiplier
▪ ▪

At each stage shift A left (multiply it by 2)
Use next bit of B to determine whether to add in shifted multiplicand (Stage i A3 accumulates A * 2i if Bi == 1)
A3 A2
0000
A3 A2 A1 A0 A2
B0 B1
B2 B3
A1
A0
A1
A0
Accumulate 2n bit partial product at each stage
A3 A2
A1
A0
▪ Q: How much
hardware for 32 bit multiplier? Critical path?
P7 P6 P5 P4 P3 P2 P1 P0
52 UC Davis EEC 170, Winter 2021 / © John Owens

▪ 64-bit Multiplicand reg, 64-bit ALU, 64-bit Product reg, 32-bit multiplier reg
Multiplicand
64 bits
Shift Left
Multiplier
32 bits
Control
Shift Right
64-bit ALU
Product
64 bits
Write
Multiplier = datapath + control
53
UC Davis EEC 170, Winter 2021 / © John Owens

Multiply Algorithm V1
Product
0000 0000 0011
1:0000 0010 0011 2:0000 0010 0011 3:0000 0010 0001 1:0000 0110 0001 2:0000 0110 0001 3:0000 0110 0000
0000 0110 0000
Start
1. Test Multiplier0
Multiplier Multiplicand 0000 0010
Multiplier0 = 1
Multiplier0 = 0
0000 0010
0000 0100
0000 0100 0000 0100 0000 1000 0000 1000
0000 1000
1a. Add multiplicand to product & place the result in Product register
54
UC Davis EEC 170, Winter 2021 / © John Owens
2. Shift the Multiplicand register left 1 bit.
3. Shift the Multiplier register right 1 bit.
32nd repetition?
shift P (sign ext) 11 -> nop, shift 11 -> nop, shift 01 -> add
shift done
Blue + red = multiplier
(red is 2 bits to compare); Green = arithmetic ops; Black = product
62 UC Davis EEC 170, Winter 2021 / © John Owens
0010
0010
0010
0010
0010
0010
+ 1110->
+ 0010->

Booth’s Example (2 x -3)
Operation
0. initial value
1a. P = P -m 1b.
2a.
2b.
3a.
3b. 4a.
4b.
Blue + red = multiplier (red is 2 bits to compare); Green = arithmetic ops; Black = product
Multiplicand 0010
1110 + 1110 0010
0010
+ 1110
0010 0010
Product
0000 1101 0 1110 1101 0
01
next?
10 -> sub
shift P (sign ext) 01 -> add + 0010
1111 011
0001 011 0000 1011 0
01 shift P
10 -> sub shift
1110 1011 0
1111 0101 1 11 -> nop
0010
1111 0101 1 1111 1010 1
shift done
63
UC Davis EEC 170, Winter 2021 / © John Owens

Current Bits
00 01 10 11
00 01 10 11
Bit to the Right
0 0 0 0
1 1 1 1
Explanation Example Recode
Middle of zeros Single one Begins run of 1s Begins run of 1s
Ends run of 1s Ends run of 1s Isolated 0 Middle of run
00 00 00 00 00 00 00 00 01 00 00 01 11 10 00 00 01 11 11 00
00 00 11 11 00 00 01 11 11 00 00 11 10 11 00 00 11 11 11 00
0
1 -2 -1
1
2 -1 0
Same insight as one-bit
Allows multiplication 2 bits at a time.
Booth’s, simply adjust for alignment of 2 bits.
64
UC Davis EEC 170, Winter 2021 / © John Owens

RISC-V Multiplication Support
▪ Four multiply instructions: – mul: multiply
– Gives the lower 64 bits of the product – mulh: multiply high
– Gives the upper 64 bits of the product, assuming the operands are signed
– mulhu: multiply high unsigned
– Gives the upper 64 bits of the product, assuming the operands are unsigned
– mulhsu: multiply high signed/unsigned
– Gives the upper 64 bits of the product, assuming one operand is signed and the other unsigned
– Use mulh result to check for 64-bit overflow
65 UC Davis EEC 170, Winter 2021 / © John Owens

RISC-V Support for multiply
▪ “If both the high and low bits of the same product are required, then the recommended code sequence is: MULH[[S]U] rdh, rs1, rs2; MUL rdl, rs1, rs2 (source register specifiers must be in same order andrdhcannot be the same asrs1or
rs2). Microarchitectures can then fuse these into a single multiply operation instead of performing two separate multiplies.”
66 UC Davis EEC 170, Winter 2021 / © John Owens

Multiplication Summary
▪ Iterative algorithm ▪ Design techniques:
– Analyze hardware—what’s not in use?
– Spend more hardware to get higher performance
– Booth’s Algorithm—more general (2’s complement)
– Booth’s Algorithm—recoding is powerful technique to think about problem in a different way
– Booth’s Algorithm—more bits at once gives higher performance
67 UC Davis EEC 170, Winter 2021 / © John Owens

RISC-V Support for divide
▪ 4 instructions:
-{div, divu, rem, remu} rd, rs1, rs2
– – – –
▪ “If both the quotient and remainder are required from the same division, the recommended code sequence is: DIV[U] rdq, rs1, rs2; REM[U] rdr,
rs1, rs2 (rdq cannot be the same as rs1 or rs2). Microarchitectures can then fuse these into a single divide operation instead of performing two separate divides.”
▪ Overflow and division-by-zero don’t produce errors
– Just return defined results
– Faster for the common case of no error
div: rs1 / rs2, treat as signed
divu: rs1 / rs2, treat as unsigned
rem: rs1 mod rs2, treat as signed
remu: rs1 mod rs2, treat as unsigned
72 UC Davis EEC 170, Winter 2021 / © John Owens

MIPS Support for multiply/divide
▪ Rather than target the general-purpose registers:
– –
hi
mul placed its output into two specialhiandloregisters div placed its divide output intoloand its rem output into

general-purpose register)
MIPS providedmfloandmfhiinstructions (destination:
73 UC Davis EEC 170, Winter 2021 / © John Owens

Divide: Paper & Pencil
1001 Quotient Divisor 1000 1001010 Dividend
–100100
101
1010
–1000
10 Remainder
(or Modulo result)

▪ ▪
See how big a number can be subtracted, creating quotient bit on each step
– Binary => 1 * divisor or 0 * divisor Dividend = Quotient x Divisor + Remainder 3 versions of divide, successive refinement
74 UC Davis EEC 170, Winter 2021 / © John Owens

Divide Hardware Version 1
▪ 64-bit Divisor reg, 64-bit ALU, 64-bit Remainder reg, 32-bit Quotient reg
Divisor
64 bits
Shift Right
64-bit ALU
Remainder
64 bits
Write
Quotient
32 bits
Control
Shift Left
75
UC Davis EEC 170, Winter 2021 / © John Owens

Divide Algorithm V1
▪ Takes n+1 steps for n-bit Quotient & Rem.
Start: Place Dividend in Remainder
▪ Remainder Quotient 0000 0111 0000
Divisor 0010 0000
Remainder ≥ 0
1. Subtract the Divisor register from the Remainder register, and place the result in the Remainder register.
Test Remainder
– 01010 >> 1 = 00101 = 5 = 10/2
– Each bit shifted right == divide by two
– Why?
– Compilers do this—“strength reduction”
86 UC Davis EEC 170, Winter 2021 / © John Owens

Shift Operations

With left shift, what do we shift in?
– 00011 << 2 = 01100 (arithmetic) - 0000XXXX << 4 = XXXX0000 (logical) - We shifted in zeroes How about right shift? - XXXX0000 >> 4 = 0000XXXX (logical) – Shifted in zero
– 00110 (= 6) >> 1 = 00011 (3) (arithmetic) – Shifted in zero
– 11110 (= -2) >> 1 = 11111 (-1) (arithmetic) – Shifted in one
87 UC Davis EEC 170, Winter 2021 / © John Owens

Shift Operations
– XXXX0000 >> 4 = 0000XXXX: Logical shift
– Shifted in zero
– 00110 (= 6) >> 1 = 00011 (3)
11110 (= -2) >> 1 = 11111 (-1): Arithmetic shift
– Shifted in sign bit
▪ RISC-V supports both logical and arithmetic:
– slli, srai, srli: Shift amount taken from within instruction (“imm”)
6 bits 6 bits 5 bits 3 bits 5 bits
7 bits 5 bits 5 bits 3 bits 5 bits
– sll, sra, srl: shift amount taken from register (“variable”)
– How far can we shift with slli/srai/slli? With sll/sra/srl?
7 bits
7 bits
funct6
shamt
rs1
funct3
rd
opcode
funct7
rs2
rs1
funct3
rd
opcode
88
UC Davis EEC 170, Winter 2021 / © John Owens

Combinational Shifter from MUXes
Basic Building Block
AB sel
D
8-bit right shifter
A7 A6 A5 A4 A3 A2 A1 A0
S2 S1 S0
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
10
▪ What comes in the MSBs?
▪ How many levels for 64-bit shifter? ▪ What if we use 4-1 Muxes ?
R7 R6 R5 R4 R3 R2 R1 R0
89 UC Davis EEC 170, Winter 2021 / © John Owens
10

General Shift Right Scheme using 16 bit example
S0 (0,1)
S1 (0,2)
S2 (0,4)
S3 (0,8)
▪ If we added right-to-left connections, we could supportROTATE (not in RISC-V but found in other ISAs)
90 UC Davis EEC 170, Winter 2021 / © John Owens

Funnel Shifter ▪ Shift A by i bits
▪ Problem: Set Y, X, sa ▪ Logical:
▪ Arithmetic: ▪ Rotate:
▪ Left shifts:
Extract 64 bits of 128 (sa selects which bits)
| sa |
Y
X
Y
64
X
64
R
Shift Right
91
UC Davis EEC 170, Winter 2021 / © John Owens
64 R

Barrel Shifter
▪ Technology-dependent solutions: transistor per switch
in6
in5
in4
out3
out2
out1
out0
SR3
SR2 SR1 SR0
in3
in2 in1 in0
92
UC Davis EEC 170, Winter 2021 / © John Owens

Shifter Summary
▪ Shifts common in logical ops, also in arithmetic ▪ RISC-V has:
– 2 flavors of shift: logical and arithmetic
– 2 directions of shift: right and left
– 2 sources for shift amount: immediate, variable
▪ Lots of cool shift algorithms, but …
– Barrel shifter prevalent in today’s hardware
93 UC Davis EEC 170, Winter 2021 / © John Owens

Floating Point
▪ Representation for non-integral numbers
– Including very small and very large numbers
▪ Like scientific notation
– –2.34 × 1056
– +0.002 × 10–4
– +987.02 × 109 ▪ In binary
– ±1.xxxxxxx2 × 2yyyy
▪ Typesfloatanddoublein C
normalized
not normalized
94
UC Davis EEC 170, Winter 2021 / © John Owens
§3.5 Floating Point

Floating Point Standard
▪ Defined by IEEE Std 754-1985
▪ Developed in response to divergence of representations
– Portability issues for scientific code ▪ Now almost universally adopted
▪ Two representations
– Single precision (32-bit)
– Double precision (64-bit)
95 UC Davis EEC 170, Winter 2021 / © John Owens

96 UC Davis EEC 170, Winter 2021 / © John Owens

Floating-point Formats
▪ Single-precision (32 bits)
▪ Double precision (64 bits)
97 UC Davis EEC 170, Winter 2021 / © John Owens

IEEE Floating-Point Format
x = (−1)S ×(1+Fraction)×2(Exponent−Bias)
▪ S: sign bit (0 ⇒ non-negative, 1 ⇒ negative)