# CS计算机代考程序代写 cache algorithm data structure assembly AI x86 concurrency compiler prolog Static Scheduling & VLIW 15-740

Static Scheduling & VLIW 15-740

Prof. Nathan Beckmann

(Original slides by Onur Mutlu, edited by Seth Goldstein)

Carnegie Mellon University

Reprise of dynamic scheduling

n

2

DO WE REALLY NEED ALL THIS COMPLEX HARDWARE?

HOW FAR CAN WE GET WITHOUT IT?

3

Key Questions

Q1. How do we find independent instructions to fetch/execute?

Q2. How do we enable more compiler optimizations?

e.g., common subexpression elimination, constant propagation, dead code elimination, redundancy elimination, …

Q3. How do we increase the instruction fetch rate?

i.e., have the ability to fetch more instructions per cycle

4

Key Questions

Q1. How do we find independent instructions to fetch/execute?

Q2. How do we enable more compiler optimizations?

e.g., common subexpression elimination, constant propagation, dead code elimination, redundancy elimination, …

Q3. How do we increase the instruction fetch rate?

i.e., have the ability to fetch more instructions per cycle

A: Enabling the compiler to optimize across a larger number of instructions that will be executed straight line (without branches getting in the way) eases all of the above

5

Very long instruction word – VLIW

n Compiler does the scheduling statically

n Simple hardware with multiple function units

q Reduced hardware complexity

q Little or no scheduling done in hardware, e.g., in-order q Hopefully, faster clock and less power

n Compiler required to group and schedule instructions (compare to OoO superscalar)

q Predicated instructions to help with scheduling (trace, etc.) q More registers (for software pipelining, etc.)

6

VLIW example – 2-cycle loads

n RISC code MUL R1, R3, 3

LD R4, 0(R1)

ADD R2, R2, R4

SUB R3, R3, 1

BNEZ R3, -4

n VLIW code MUL R1, R3, 3

LD R4, 0(R1)

NOP

ADD R2, R2, R4

SUB R3, R3, 1

NOP

NOP

BNEZ R3, -4

7

Very long instruction word – VLIW

n Compiler does the scheduling statically

n Simple hardware with multiple function units

n Compiler required to group and schedule instructions (compare to OoO superscalar)

n Example machines:

q Multiflow, Cydra 5 (8-16 ops per VLIW)

q IA-64 Itanium (3 ops per bundle, “EPIC” VLIW variant) q TMS32xxxx (5+ ops per VLIW, DSPs)

q Crusoe (4 ops per VLIW, JIT x86 è VLIW)

q Tilera TILEPro (3 ops per VLIW, 100+ cores)

8

Comparison between SS « VLIW

From Mark Smotherman, “Understanding EPIC Architectures and Implementations”

Comparison: CISC, RISC, VLIW

VLIW is a natural extension of RISC ideas to superscalar

VLIW relies on compiler for ILP

n

11

VLIW: Finding Independent Operations

n Within a basic block, there is limited instruction-level parallelism

n To find multiple instructions to be executed in parallel, the compiler needs to consider multiple basic blocks

n Problem: Moving an instruction above a branch is unsafe because instruction is not guaranteed to be executed

n Idea: Enlarge blocks at compile time by finding the frequently-executed paths

q Trace scheduling

q Superblock scheduling q Hyperblock scheduling q Software Pipelining

It’s all about the compiler and how to schedule the instructions to maximize parallelism

12

List Scheduling: For 1 basic block

n Idea:Assignprioritytoeachinstruction

n Initializereadylistthatholdsallreadyinstructions

n Choose one ready instruction I from ready list with the highest priority

n InsertIintoschedule

q Ensuring resources are available (structural hazards)

n Addthoseinstructionswhoseprecedenceconstraintsare now satisfied into the ready list

13

Data Precedence Graph

i1 i2 i5 i6 i7 i10 i11 i12

222222 i3 2 i8 2 i13

2

22

i4

i9

i14

4

4 i15

2 i16

14

Instruction Prioritization Heuristics

n Number of descendants in precedence graph

n Maximum latency from root node of precedence graph n Length of operation latency

n Ranking of paths based on importance

n Some combination of above

15

VLIW List Scheduling

n Assign priorities – critical path in this example

n Compute data ready list (DRL) – all operations whose predecessors

have been scheduled.

n Select from DRL in priority order while checking resource constraints n Add newly ready operations to DRL and repeat for next instruction

5

1

23334

23456

223

789

112

1011 12

1

4-wide VLIW

Data Ready List

1

{1}

6

3

4

5

{2,3,4,5,6}

9

2

7

8

{2,7,8,9}

12

10

11

{10,11,12}

13

{13}

13

16

VLIW List Scheduling

5

1 2233435364

4-wide VLIW

2

2

3

2

1

13

789

11

1011 12

17

VLIW List Scheduling

5

1 2233435364

4-wide VLIW

1

6

3

4

5

9

2

7

8

12

10

11

13

2

2

3

2

1

13

789

11

1011 12

18

VLIW List Scheduling+Structural Hazards

5

1 2233435364

4-wide VLIW

2

2

3

2

1011 12

1

13

789 11

19

VLIW List Scheduling+Structural Hazards

5

1 2233435364

4-wide VLIW

1

2

3

6

5

9

4

7

8

12

10

11

13

2

2

3

2

1011 12

1

13

789 11

20

Extending the scheduling domain

n Basic block is too small to get any real parallelism q Recall: 88% of OOO benefit from speculation è

larger scheduling window

n How to extend the basic block?

q Why do we have basic blocks in the first place?

q Loops

n Loop unrolling

n Software pipelining

q Non-loops

n Will almost always involve some speculation n Thus profiling may be very important

[MTZ’13]

21

Safety and Legality in Code Motion

n Two characteristics of speculative code motion:

q Safety: whether or not spurious exceptions may occur q Legality: whether or not result will be always correct

n Four possible types of code motion:

r4 = r1 …

r1 = …

r1 = r2 & r3

r1 = r2 & r3

(a) safe and legal

(b) illegal

r1 = …

r1 = load A

r4 = r1 …

r1 = load A

(c) unsafe

(d) unsafe and illegal

22

Code Movement Constraints

n Downward

q When moving an operation from a BB to one of its dest BB’s,

n all the other dest basic blocks should still be able to use the result of the operation

n the other source BB’s of the dest BB should not be disturbed n Upward

q When moving an operation from a BB to its source BB’s n register values required by the other dest BB’s must not be

destroyed

n the movement must not cause new exceptions

23

Trace Scheduling

n Trace: A frequently executed path in the control-flow graph (has multiple side entrances and multiple side exits)

n Idea: Find independent operations within a trace to pack into VLIW instructions.

q Traces determined via profiling

q Compiler adds fix-up code for correctness (if a side entrance or side exit of a trace is exercised at runtime, corresponding fix-up code is executed)

24

Trace Scheduling Idea

25

Trace Scheduling (II)

n There may be conditional branches from the middle of the trace (side exits) and transitions from other traces into the middle of the trace (side entrances).

n These control-flow transitions are ignored during trace scheduling.

n After scheduling, fix-up/bookkeeping code is inserted to ensure the correct execution of off-trace code.

n Fisher, “Trace scheduling: A technique for global microcode compaction,” IEEE TC 1981.

26

Trace Scheduling (III)

Instr 1

Instr 2

Instr 3

Instr 4

Instr 5

Instr 2

Instr 3

Instr 4

Instr 1

Instr 5

What bookkeeping is required when Instr 1

is moved below the side entrance in the trace?

27

Trace Scheduling (IV)

Instr 3

Instr 4

Instr 1

Instr 2

Instr 3

Instr 4

Instr 5

Instr 2

Instr 3

Instr 4

Instr 1

Instr 5

28

Trace Scheduling (V)

Instr 1

Instr 2

Instr 3

Instr 4

Instr 5

Instr 1

Instr 5

Instr 2

Instr 3

Instr 4

What bookkeeping is required when Instr 5 moves above the side entrance in the trace?

29

Trace Scheduling (VI)

Instr 5

Instr 1

Instr 2

Instr 3

Instr 4

Instr 5

Instr 1

Instr 5

Instr 2

Instr 3

Instr 4

30

Trace Scheduling Fixup Code Issues

n Sometimes need to copy instructions more than once to ensure correctness on all paths (see C below)

Original trace

A

B X C

D A’ B’ C’ Y Scheduled B

trace E DYA

C’’’ E’’ D’’

EC

BX Correctness C

B’’ X

DY

31

Trace Scheduling Overview

n Trace Selection

q select seed block (the highest frequency basic block) q extend trace (along the highest frequency edges)

forward (successor of the last block of the trace)

backward (predecessor of the first block of the trace) q don’t cross loop back edge

q bound max_trace_length heuristically

n Trace Scheduling

q build data precedence graph for a whole trace

q perform list scheduling and allocate registers

q add compensation code to maintain semantic correctness

n Speculative Code Motion (upward)

q move an instruction above a branch if safe

32

Trace Scheduling Example (I)

B1

9 stalls

B3

1 stall

fdiv f1, f2, f3 fadd f4, f1, f5 beq r1, $0

fdiv f1, f2, f3 fadd f4, f1, f5

beq r1, $0

990

ld r2, 0(r3)

990

800

800

10

r2 and f2 not live out

B3

f2 not live out

B6

B2

ld r2, 4(r3)

10

ld r2, 0(r3)

add r2, r2, 4 beq r2, $0

fsub f2, f2, f6 st.d f2, 0(r8)

add r3, r3, 4 add r8, r8, 4

add r2, r2, 4 beq r2, $0

B4

B5

200

fsub f2, f3, f7

200

B7

B6

1 stall

fsub f2, f2, f6 st.d f2, 0(r8)

add r3, r3, 4 add r8, r8, 4

33

Trace Scheduling Example (I)

9 stalls

fdiv f1, f2, f3 fadd f4, f1, f5 beq r1, $0

ld r2, 0(r3)

add r2, r2, 4 beq r2, $0

fsub f2, f2, f6 st.d f2, 0(r8)

B6

fdiv f1, f2, f3 beq r1, $0

ld r2, 0(r3) fsub f2, f2, f6

add r2, r2, 4 beq r2, $0

st.d f2, 0(r8)

add r3, r3, 4 add r8, r8, 4 fadd f4, f1, f5

1 stall

r2 and f2 not live out

B3 0 stall 0 stall

f2 not live out

1 stall

1 stall

add r3, r3, 4 add r8, r8, 4

B3 B6

34

Trace Scheduling Example (II)

fdiv f1, f2, f3 beq r1, $0

ld r2, 0(r3) fsub f2, f2, f6

add r2, r2, 4 beq r2, $0

st.d f2, 0(r8)

add r3, r3, 4

add r8,r8,4

fadd f4, f1, f5 fadd f4, f1, f5

B6

0 stall 0 stall

1 stall

fdiv f1, f2, f3 beq r1, $0

ld r2, 0(r3) fsub f2, f2, f6

add r2, r2, 4 beq r2, $0

st.d f2, 0(r8)

fadd f4, f1, f5

Split comp. code

fadd f4, f1, f5

add r3, r3, 4

B3 add r8,r8,4 B3

B6

35

Trace Scheduling Example (III)

fdiv f1, f2, f3 beq r1, $0

ld r2, 0(r3) fsub f2, f2, f6 add r2, r2, 4 beq r2, $0

st.d f2, 0(r8)

fadd f4, f1, f5

Split comp. code

fadd f4, f1, f5

add r3, r3, 4 add r8, r8, 4 fadd f4, f1, f5

B3

B6

add r3, r3, 4 add r8, r8, 4

Join comp. code

36

Trace Scheduling Example (IV)

fdiv f1, f2, f3 beq r1, $0

ld r2, 0(r3) fsub f2, f2, f6 add r2, r2, 4 beq r2, $0

st.d f2, 0(r8)

add r3, r3, 4 add r8, r8, 4 fadd f4, f1, f5

fadd f4, f1, f5

Split comp. code

fadd f4, f1, f5

B3

add r2, r2, 4 beq r2, $0

fsub f2, f2, f6 st.d f2, 0(r8)

add r3, r3, 4 add r8, r8, 4

B6

Copied split instructions

add r3, r3, 4 add r8, r8, 4

Join comp. code

37

Trace Scheduling Example (V)

ld r2, 0(r3) fsub f2, f2, f6 add r2, r2, 4

beq r2, $0

fdiv f1, f2, f3 beq r1, $0

fadd f4, f1, f5

ld r2, 4(r3)

add r2, r2, 4 beq r2, $0

B3

st.d f2, 0(r8) add r3, r3, 4 add r8, r8, 4 fadd f4, f1, f5

fadd f4, f1, f5

B6

fsub f2, f2, f6 st.d f2, 0(r8)

add r3, r3, 4 add r8, r8, 4

fsub f2, f3, f7

add r3, r3, 4 add r8, r8, 4

38

Trace Scheduling Tradeoffs

n Advantages

+ Enables the finding of more independent instructions à fewer

NOPs in a VLIW instruction

n Disadvantages

— Profile dependent

— What if dynamic path deviates from trace à lots of NOPs in the VLIW instructions

— Code bloat and additional fix-up code executed — Due to side entrances and side exits

— Infrequent paths interfere with the frequent path — Effectiveness depends on the bias of branches

— Unbiased branches à smaller traces à less opportunity for finding independent instructions

39

Superblock Scheduling

n Trace: multiple entry, multiple exit block

n Superblock: single-entry, multiple exit block

q A trace with side entrances are eliminated

q Infrequent paths do not interfere with the frequent path

+ More optimization/scheduling opportunity than traces

+ Eliminates “difficult” bookkeeping due to side entrances

Hwu+, “The Superblock: An Effective Technique for VLIW and superscalar compilation,” J of SC 1991.

40

Superblock example

opA: mul r1,r2,3

99

opC: mul r3,r2,3

opA: mul r1,r2,3

99 opB: add r2,r2,1

11

opB: add r2,r2,1

opC’: mul r3,r2,3

opC: mul r3,r2,3

1

Original Code

Code After Superblock Formation

opA: mul r1,r2,3

99

opC: mov r3,r1

1

opB: add r2,r2,1

opC’: mul r3,r2,3

Code After Common Subexpression Elimination

41

Superblock Scheduling Shortcomings

— Still profile-dependent

— No single frequently executed path if there is an unbiased

branch

— Reduces the size of superblocks

— Code bloat and additional fix-up code executed — Due to side exits

42

Hyperblock Scheduling

n Idea: Use predication support to eliminate unbiased branches and increase the size of superblocks

n Hyperblock: A single-entry, multiple-exit block with internal control flow eliminated using predication (if-conversion)

n Advantages

+ Reduces the effect of unbiased branches on scheduled block size

n Disadvantages

— Requires predicated execution support

— All disadvantages of predicated execution

43

Hyperblock Formation (I)

n Hyperblock formation 1. Block selection

2. Tail duplication

3. If-conversion

10

90 80

80 20

20

n Block selection

q Select subset of BBs for inclusion in HB

q Difficult problem 10

q Weighted cost/benefit function n Height overhead

n Resource overhead

n Dependency overhead

n Branch elimination benefit n Weighted by frequency

90 10

10

n Mahlke et al., “Effective Compiler Support for Predicated Execution Using the Hyperblock,” MICRO 1992.

BB1

BB2

BB3

BB4

BB5

BB6

44

Hyperblock Formation (II)

Tail duplication same as with Superblock formation 10

BB1

80

BB2

80

10

90 10

BB6

10

20

BB1

80

BB2

80

20

BB3

20

90

10

BB3

20

BB4

BB4

10

BB5

BB5

90

BB6

81 9

10 19

BB6’

45

Hyperblock Formation (III)

10

BB1

80

BB2

80

If-convert (predicate) intra-hyperblock branches 10

BB1 p1,p2 = CMPP

BB2 if p1

BB3 if p2

BB4

BB6

BB4

20

BB3

20

90

10

10

BB5

BB6

81 9

9

1

10

81

9

1

BB6’

BB5

BB6’

46

WHAT ABOUT MEMORY?

47

Non-Faulting Loads and Exception Propagation

ld.s r1=[a]

inst 1 inst 2 …. br

inst 1 inst 2 ….

br

unsafe code

motion

….

ld r1=[a]

use=r1

n ld.s fetches speculatively from memory

i.e. any exception due to ld.s is suppressed

n If ld.s r1 did not cause an exception then chk.s r1 is a NOP, else a branch is taken (to execute some compensation code)

….

chk.s r1

use=r1

ld r1=[a]

48

Non-Faulting Loads and Exception Propagation in IA-64

ld.s r1=[a]

inst 1 inst 2 use=r1 ….

br

inst 1 inst 2 ….

br br

unsafe code

motion

….

ld r1=[a] use=r1

n Load data can be speculatively consumed prior to check

n “speculation” status is propagated with speculated data

n Any instruction that uses a speculative result also becomes speculative itself (i.e. suppressed exceptions)

n chk.s checks the entire dataflow sequence for exceptions

….

chk.s use

ld r1=[a] use=r1

49

Aggressive ST-LD Reordering in IA-64

inst 1

inst 2

….

st [?]

st[?] ….

ld r1=[x]

use=r1

ld.a r1=[x]

inst 1

inst 2

….

st [?]

….

ld.c r1=[x] use=r1

potential aliasing

n ld.a starts the monitoring of any store to the same address as the advanced load

n If no aliasing has occurred since ld.a, ld.c is a NOP n If aliasing has occurred, ld.c re-loads from memory

50

Aggressive ST-LD Reordering in IA-64

inst 1

inst 2

….

st [?]

st[?] ….

ld r1=[x] use=r1

ld.a r1=[x]

inst 1 inst 2 use=r1 ….

st [?] …. chk.a X ….

potential aliasing

ld r1=[a] use=r1

51

Can We Do Better?

n Hyperblock still

q Profile dependent

q Requires fix-up code

q And, requires predication support

n Single-entry, single-exit enlarged blocks

q Block-structured ISA

n Optimizes multiple paths (can use predication to enlarge blocks) n No need for fix-up code (duplication instead of fixup)

52

Block Structured ISA

n Blocks (> instructions) are atomic (all-or-none) operations q Either all of the block is committed or none of it

n Compiler enlarges blocks by combining basic blocks with their control flow successors

q Branches within the enlarged block converted to “fault” operations à if the fault operation evaluates to true, the block is discarded and the target of fault is fetched

Melvin and Patt, “Enhancing Instruction Scheduling with a Block-Structured ISA,” IJPP 1995.

53

Block Structured ISA (II)

n Advantages:

+ Larger atomic blocks à larger units can be fetched from I-cache

+ Aggressive compiler optimizations (e.g. reordering) can be enabled within atomic blocks (no side entries or exits)

+ Can explicitly represent dependencies among operations within an enlarged block

n Disadvantages:

— “Fault operations” can lead to work to be wasted (atomicity)

— Code bloat (multiple copies of the same basic block exists in the binary and possibly in I-cache)

— Need to predict which enlarged block comes next n Optimizations

q Within an enlarged block, the compiler can perform optimizations that cannot normally/easily be performed across basic blocks

54

Block Structured ISA (III)

n Hao et al., “Increasing the instruction fetch rate via block- structured instruction set architectures,” MICRO 1996.

55

Superblock vs. BS-ISA

n Superblock

q Single-entry, multiple exit code block

q Not atomic

q Compiler inserts fix-up code on superblock side exit

n BS-ISA blocks

q Single-entry, single exit

q Atomic

q Need to roll back to the beginning of the block on fault

56

Superblock vs. BS-ISA

n Superblock

+ No ISA support needed

— Optimizes for only 1 frequently executed path

— Not good if dynamic path deviates from profiled path à missed opportunity to optimize another path

n Block Structured ISA

+ Enables optimization of multiple paths and their dynamic selection.

+ Dynamic prediction to choose the next enlarged block. Can dynamically adapt to changes in frequently executed paths at run- time

+ Atomicity can enable more aggressive code optimization — Code bloat becomes severe as more blocks are combined

— Requires “next enlarged block” prediction, ISA+HW support — More wasted work on “fault” due to atomicity requirement

57

Summary and Questions

n Trace, superblock, hyperblock, block-structured ISA

n How many entries, how many exits does each of them have?

q What are the corresponding benefits and downsides?

n What are the common benefits?

q Enable and enlarge the scope of code optimizations q Reduce fetch breaks; increase fetch rate

n What are the common downsides?

q Code bloat (code size increase)

q Wasted work if control flow deviates from enlarged block’s path

58

VLIW Summary

n Heavy reliance on compiler (push RISC to the extreme) q Compiler algorithms (e.g., software pipelining) have lasting

impact outside of VLIW

n Is there enough statically knowable parallelism?

q E.g., memory aliasing and branch bias n What about wasted FUs? Code bloat?

q Code size is already a big problem with x86 apps!

n Architecture joke: “VLIW is the architecture of the future,

and always will be.”

n Yet many DSPs are VLIW. Why?

59

WHAT ABOUT LOOPS?

60

The problem with loops

n Consider the following code:

for (int i = 0; i < N; i++) { b[i] = b[i] * b[i]; } 61 The problem with loops n Consider the following code: for (int i = 0; i < N; i++) { b[i] = b[i] * b[i]; } n RISC assembly (LD & MUL 2-cycles) LOOP: Useful work Loop overhead LD R1, 0(R3) MUL R2, R1, R1 ST R2, 0(R3) ADD R3, R3, 4 BLT R3, R4, LOOP è 7 cycles per iteration 62 The problem with loops n Consider the following code: for (int i = 0; i < N; i++) { b[i] = b[i] * b[i]; } n VLIW assembly (1 ALU, 1 LD/ST; 2-cycles) LOOP: è5 cycles per iteration NOP NOP LD R1, 0(R3) NOP NOP NOP Useful work MUL R2, R1, R1 ADD R,3 R,3 4 BLT R3, R4, LOOP ST R2, -4(R3) Loop overhead 63 Amortize overheads by unrolling loops n Key idea is to schedule the following code instead: for (int i = 0; i < N; i+=4) { b[i+0] = b[i+0] * b[i+0]; b[i+1] = b[i+1] * b[i+1]; b[i+2] = b[i+2] * b[i+2]; b[i+3] = b[i+3] * b[i+3]; } Loop unrolling è Larger scheduling block è Better schedule 64 Amortize overheads by unrolling loops n VLIW assembly LOOP: NOP NOP è 2 cycles per iteration LD R1, 0(R9) LD R3, 4(R9) LD R5, 8(R9) LD R7, 12(R9) ST R2, 0(R9) ST R4, 4(R9) ST R6, 8(R9) ST R8, -4(R9) MUL R2, R1, R1 MUL R4, R3, R3 MUL R6, R5, R5 MUL R8, R7, R7 ADD R9, R9, 16 BLT R9, R10 LOOP Loop overhead Useful work 65 Correctness of loop unrolling n Is this transformation legal? for (int i = 0; i < N; i++) { b[i] = b[i] * b[i]; } for (int i = 0; i < N; i+=4) { b[i+0] = b[i+0] * b[i+0]; b[i+1] = b[i+1] * b[i+1]; b[i+2] = b[i+2] * b[i+2]; b[i+3] = b[i+3] * b[i+3]; } 66 Correctness of loop unrolling n Instead, schedule the following code: int i; for (i = 0; i+3 < N; i+=4) { b[i+0] = b[i+0] * b[i+0]; b[i+1] = b[i+1] * b[i+1]; b[i+2] = b[i+2] * b[i+2]; b[i+3] = b[i+3] * b[i+3]; } for (; i < N; i++) { b[i] = b[i] * b[i]; } 67 Loop unrolling summary n Advantages q Reduces loop overhead q Improves code schedule within loop n (eg, hiding MUL latency in example) n Disadvantages q Increases code size q Less effective on loops with internal branches n Can use predication … more on this later 68 Loop Unrolling n Idea: Replicate loop body multiple times within an iteration + Reduces loop maintenance overhead q Induction variable increment or loop condition test + Enlarges basic block (and analysis scope) q Enables code optimization and scheduling opportunities — What if iteration count not a multiple of unroll factor? (need extra code to detect this) — Increases code size 69 Can we do better? n VLIW assembly (ALU + LD/ST) LOOP: NOP NOP LD R1, 0(R9) LD R3, 4(R9) LD R5, 8(R9) LD R7, 12(R9) ST R2, 0(R9) ST R4, 4(R9) ST R6, 8(R9) ST R8, -4(R9) MUL R2, R1, R1 MUL R4, R3, R3 MUL R6, R5, R5 MUL R8, R7, R7 ADD R9, R9, 16 BLT R9, R10 LOOP è 2 cycles per iteration Loop overhead Useful work Not in this case – LD/ST unit is at 100% utilization 70 Can we do better? n VLIW assembly (3-wide) LOOP: NOP NOP NOP NOP NOP LD R1, 0(R9) LD R3, 4(R9) LD R5, 8(R9) LD R7, 12(R9) MUL R2, R1, R1 MUL R4, R3, R3 MUL R6, R5, R5 MUL R8, R7, R7 ST R2, 0(R9) ST R4, 4(R9) ST R6, 8(R9) ST R8, 12(R9) NOP ADD R9, R9, 16 BLT R9, R10, LOOP NOP Loop overhead è 7/4 cycles per iteration Useful work 71 Can we do better? n VLIW assembly (3-wide) LOOP: NOP NOP NOP NOP NOP LD R1, 0(R31) LD R3, 4(R31) LD R5, 8(R31) LD R7, 12(R31) LD R9, 12(R31) LD R11, 12(R31) LD R13, 12(R31) LD R15, 12(R31) MUL R2, R1, R1 MUL R4, R3, R3 MUL R6, R3, R3 MUL R8, R3, R3 MUL R10, R3, R3 MUL R12, R3, R3 MUL R14, R5, R5 MUL R16, R7, R7 ST R2, 0(R31) ST R4, 4(R31) ST R6, 8(R31) ST R8, 12(R31) ST R10, 16(R31) ST R12, 20(R31) ST R14, 24(R31) ST R16, -4(R31) NOP ADD R31, R31, 32 BLT R31, R10, LOOP NOP è 11/8 cycles per iteration Loop overhead Useful work …but code & register bloat 72 Can we do better than unrolling? n VLIW assembly (3-wide) LOOP: NOP NOP Ramp-up & ramp-down overhead NOP Loop overhead Useful work NOP NOP NOP LD R1, 0(R31) LD R3, 4(R31) LD R5, 8(R31) LD R7, 12(R31) LD R9, 12(R31) LD R11, 12(R31) LD R13, 12(R31) LD R15, 12(R31) MUL R2, R1, R1 MUL R4, R3, R3 MUL R6, R3, R3 MUL R8, R3, R3 MUL R10, R3, R3 MUL R12, R3, R3 MUL R14, R5, R5 MUL R16, R7, R7 ST R2, 0(R31) ST R4, 4(R31) ST R6, 8(R31) ST R8, 12(R31) ST R10, 16(R31) ST R12, 20(R31) ST R14, 24(R31) ST R16, -4(R31) NOP ADD R31, R31, 32 BLT R31, R10, LOOP 73 Can we do better than unrolling? n VLIW assembly (3-wide) LOOP: NOP NOP Ramp-up & ramp-down overhead NOP NOP NOP LD R1, 0(R31) LD R3, 4(R31) LD R5, 8(R31) LD R7, 12(R31) LD R9, 12(R31) LD R11, 12(R31) LD R13, 12(R31) LD R15, 12(R31) MUL R2, R1, R1 MUL R4, R3, R3 MUL R6, R3, R3 MUL R8, R3, R3 MUL R10, R3, R3 MUL R12, R3, R3 MUL R14, R5, R5 MUL R16, R7, R7 ST R2, 0(R31) ST R4, 4(R31) ST R6, 8(R31) ST R8, 12(R31) ST R10, 16(R31) ST R12, 20(R31) ST R14, 24(R31) ST R16, -4(R31) NOP ADD R31, R31, 32 BLT R31, R10, LOOP Perfect efficiency NOP Loop overhead Useful work Goal: Maintain peak efficiency, w/out ramp-up/down 74 Software Pipelining n Idea: Move instructions across iterations of the loop q Very large improvements in running time are possible n E.g., 5-wide VLIW LOOP: { LD R1, 0(R9) NOP NOP NOP LD R1, 4(R9) MUL R2, R1, R1 ADD R9, R9, 4 NOP ADD R9, R9, 4 BGE R9, R10, END Current iteration One iteration ago Two iterations ago NOP SUB R10, R10, 4 LD R1, 0(R9) MUL R2, R1, R1 ST R2, -8(R9) ADD R9, R9, 4 BLT R9, R10, LOOP } NOP ST R2, -4(R9) NOP ST R2, 0(R9) END: NOP NOP NOP NOP NOP MUL R2, R1, R1 NOP NOP 15-745 © Seth Copen Goldstein 2000-5 75 Software Pipelining n Idea: Move instructions across iterations of the loop q Very large improvements in running time are possible n E.g., 5-wide VLIW LD R1, 0(R9) NOP NOP NOP LD R1, 4(R9) MUL R2, R1, R1 NOP ADD R9, R9, 4 NOP ADD R9, R9, 4 BGE R9, R10, END è 1 cycle per iteration } NOP ST R2, -4(R9) NOP ST R2, 0(R9) LOOP: Perfect efficiency END: { SUB R10, R10, 4 LD R1, 0(R9) NOP NOP NOP NOP NOP MUL R2, R1, R1 NOP NOP MUL R2, R1, R1 ST R2, -8(R9) ADD R9, R9, 4 BLT R9, R10, LOOP 15-745 © Seth Copen Goldstein 2000-5 76 Goal of SP n Increase distance between dependent operations by moving destination operation to a later iteration A:a¬ ld[d] B:b¬ a*a C: st [d], b D:d¬ d+4 Assume all have latency of 2 A B C D 15-745 © Seth Copen Goldstein 2000-5 77 Can we decrease the latency? n Lets unroll A: a¬ld[d] B: b¬a*a C: st [d], b D: d¬d+4 A1: a ¬ ld [d] B1: b ¬ a * a C1: st [d], b D1: d¬ d+4 A B C D A1 B1 C1 D1 15-745 © Seth Copen Goldstein 2000-5 78 Rename variables A: a¬ ld[d] B: b¬ a*a C: st [d], b D: d1¬d+4 A1: a1 ¬ ld [d1] B1: b1¬ a1*a1 C1: st [d1], b1 D1:d¬ d1+4 A B C D A1 B1 C1 D1 15-745 © Seth Copen Goldstein 2000-5 79 Schedule A: a¬ ld[d] B: b¬ a*a C: st [d], b D: d1¬d+4 A1: a1 ¬ ld [d1] B1: b1¬ a1*a1 C1: st [d1], b1 D1:d¬ d1+4 AD B A1 C B1 D1 C1 A B C D1 D A1 B1 C1 15-745 © Seth Copen Goldstein 2000-5 80 Unroll Some More A: a¬ B: b¬ C: D: d1¬ A1: a1 ¬ B1: b1¬ C1: D1: d2¬ A2: a2 ¬ B2: b2¬ C2: D2: d ¬ ld[d] a*a st [d], b d+4 ld [d1] a1*a1 st [d1], b1 d1+4 ld [d2] a2*a2 st [d2], b2 d2 + 4 AD B C A1 D1 B1 A2 C1 C2 D2 B2 A B C D2 D A1 B1 C1 D1 A2 B2 C2 15-745 © Seth Copen Goldstein 2000-5 81 Unroll Some More A: a ¬ B: b¬ C: D: d1¬ A1: a1 ¬ B1: b1¬ C1: D1: d2¬ A2: a2 ¬ B2: b2¬ C2: D2: d¬ ld [d] a*a st [d], b d+4 ld [d1] a1*a1 st [d1], b1 d1+4 ld [d2] a2*a2 st [d2], b2 d2+4 AD B C A1 D1 B1 A2 D2 C1 B2 A3 C2 B3 C3 A B C D3 D A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 15-745 © Seth Copen Goldstein 2000-5 82 A B C D4 D A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 A4 B4 C4 One More Time AD B C A1 D1 B1 A2 D2 C1 B2 A3 C2 B3 C3 D3 A4 B4 C4 15-745 © Seth Copen Goldstein 2000-5 83 A B C D4 D A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 A4 B4 C4 Can Rearrange AD B C A1 D1 B1 A2 D2 C1 B2 A3 C2 B3 C3 D3 A4 B4 C4 15-745 © Seth Copen Goldstein 2000-5 84 Rearrange AD B C A1 D1 B1 A2 D2 C1 B2 A3 C2 B3 C3 A B C D3 D A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 15-745 © Seth Copen Goldstein 2000-5 85 Rearrange AD B C A1 D1 B1 A2 D2 C1 B2 A3D3 C2 B3 C3 A B C D3 D A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 15-745 © Seth Copen Goldstein 2000-5 86 SP Loop A: a ¬ B: b¬ D: d1¬ A1: a1 ¬ D1: d2¬ C: B1: b1¬ A2: a2 ¬ D2: d¬ B2: b2¬ C1: D3: d2¬ C2: ld [d] a*a d+4 ld [d1] d1+4 st [d], b a1*a1 ld [d2] d2+4 a2*a2 st [d1], b1 d1+4 st [d2], b2 Prolog Body Epilog A B C C C D3 D A1 B1 B1 B1 C1 D1 A2 A2 A2 B2 C2 D2 D2 D2 15-745 © Seth Copen Goldstein 2000-5 87 Goal of Software Pipelining n Increase distance between dependent operations by moving destination operation to a later iteration A B C dependencies in initial loop A after SP B C i+1 i+2 iteration i 15-745 © Seth Copen Goldstein 2000-5 88 Goal of Software Pipelining n Discover ILP across iterations! A0 A0 A1 B0 A2 B1 C0 A3 B2 C1 B3 C2 C3 A1 B0 Ai Bi-1 Ci-2 Bi Ci-1 Ci 15-745 © Seth Copen Goldstein 2000-5 89 Example Assume operating on a infinite wide machine A0 A1 B0 Prolog loop body epilog Ai Bi-1 Ci-2 Bi Ci-1 Ci 15-745 © Seth Copen Goldstein 2000-5 90 Dealing with exit conditions for (i=0; i= N) goto done

A0

B0

if (i+1 == N) goto last i=1

A1

if (i+2 == N) goto epilog i=2

{

}

loop: Ai Ai

Bi Bi-1 Ci Ci-2

i++

if (i < N) goto loop epilog: Bi Ci-1 last: ci done: 15-745 © Seth Copen Goldstein 2000-5 91 Loop Unrolling vs. Software Pipelining For SuperScalar or VLIW n Loop Unrolling reduces loop overhead n Software Pipelining reduces fill/drain n Best is if you combine them Software Pipelining Loop Unrolling Time 15-745 © Seth Copen Goldstein 2000-5 92 TODO: More complicated SP example 93 Software Pipelining Approaches n The first serious approach to software pipelining was presented by Aiken & Nicolau. q Aiken’s 1988 Ph.D. thesis. q Impractical as it ignores resource hazards (focusing only on data-dependence constraints). n “Iterative Modulo Scheduling” Rau MICRO’94 q Addresses resource constraints q Iterate over different loop lengths until one schedules successfully q Compute loop lower/upper bounds from required & available resources 94 SYSTOLIC ARRAYS 95 Why Systolic Architectures? n Idea: Data flows from the computer memory in a rhythmic fashion, passing through many processing elements before it returns to memory n Similar to an assembly line q Different people work on the same car q Many cars are assembled simultaneously q Can be two-dimensional n Special purpose accelerators/architectures need q Simple, regular designs (keep # unique parts small and regular) q High concurrency à high performance q Balanced computation and I/O (memory access) 96 Systolic Architectures n H. T. Kung, “Why Systolic Architectures?,” IEEE Computer 1982. Memory: heart PEs: cells Memory pulses data through cells 97 Systolic Architectures n Basic principle: Replace a single PE with a regular array of PEs and carefully orchestrate flow of data between the PEs à achieve high throughput w/o increasing memory bandwidth requirements n Differences from pipelining: q Array structure can be non-linear and multi-dimensional q PE connections can be multidirectional (and different speed) q PEs can have local memory and execute kernels (rather than a piece of the instruction) 98 Systolic Computation Example n 99 Systolic Computation Example: Convolution 100 Systolic Computation: Convolution y1 w3 w2 x2 x1 y1 w1 w3 w2 w1 x3 x2 x1 x3 y1 w3 y1 w2 x2 y2 w1 y2 w3 w2 w1 x4 x3 x2 y2 y3 w3 w2 x4 x3 w1 101 Systolic Computation: Convolution y1 w3 w2 w1 x2 x1 102 Systolic Computation: Convolution y1 w3 w2 w1 x3 x2 x1 103 Systolic Computation: Convolution y1 y2 w3 w2 w1 x3 x2 104 Systolic Computation: Convolution y1 y2 w3 w2 w1 x4 x3 x2 105 Systolic Computation: Convolution y1 y2 y3 w3 w2 w1 x4 x3 106 Systolic Computation Example: Convolution n Worthwhile to implement adder and multiplier separately to allow overlapping of add/multiply executions 107 TODO: Example relating SP to systolic architecture for some computation (maybe the convolution) 108 More Programmability n Each PE in a systolic array q Can store multiple “weights” q Weights can be selected on the fly q Eases implementation of, e.g., adaptive filtering n Taken further q Each PE can have its own data and instruction memory q Data memory à to store partial/temporary results, constants q Leads to stream processing, pipeline parallelism n Moregenerally,stagedexecution 109 Pipeline Parallelism 110 File Compression Example 111 Why pipeline parallelism in software? n Pipeline parallelism vs data parallelism q Why split pipeline stages across PEs? q No cycle-time benefit like we got in hardware n Data movement patterns differ q Pipeline parallelism: move input data between PEs q Data parallelism: move task code/data between PEs n Tight feedback loops within single stage q E.g., compression or encryption n Appropriate design depends on application 112 Systolic Array Summary n Advantages q Makes multiple uses of each data item à reduce data fetches q High concurrency q Regular design (both data and control flow) n Disadvantages q Not good at exploiting irregular parallelism q Relatively special purpose à need software, programmer support to be a general purpose model 113 The WARP Computer n HT Kung, CMU, 1984-1988 n Linear array of 10 cells, each cell a 10 Mflop programmable processor n Attached to a general purpose host machine n High-level language and optimizing compiler to program the systolic array n Used extensively to accelerate vision and robotics tasks n Annaratone et al., “Warp Architecture and Implementation,” ISCA 1986. n Annaratone et al., “The Warp Computer: Architecture, Implementation, and Performance,” IEEE TC 1987. 114 The WARP Computer 115 The WARP Computer 116 Resource Constraints n Minimally indivisible sequences, i and j, can execute together if combined resources in a step do not exceed available resources. n R(i) is a resource configuration vector R(i) is the number of units of resource i n r(i) is a resource usage vector s.t. 0 £ r(i) £ R(i) n Each node in G has an associated r(i) 15-745 © Seth Copen Goldstein 2000-5 117 SCALAR REPLACEMENT 118 Scalar Replacement: Example DO I = 1, N DO J = 1, M A(I) = A(I) + B(J) ENDDO ENDDO DO I = 1, N T = A(I) DO J = 1, M T = T + B(J) ENDDO A(I) = T ENDDO n AllloadsandstorestoAintheinner loop have been saved n A(I)canbeleftinaregister throughout the inner loop n Superscalar+cachewillgetmostof n HighchanceofTbeingallocateda this, but not allocate A(I) to register register by compiler Scalar Replacement n Convert array reference to scalar reference n Approach is to use dependences to achieve these memory hierarchy transformations Dependence and Memory Hierarchy n True or Flow – save loads and cache miss n Anti – save cache miss? n Output – save stores n Input – save loads A(I) = … + B(I) … =A(I)+k A(I) = … … = B(I) Dependence and Memory Hierarchy n Loop Carried dependences – Consistent dependences most useful for memory management purposes n Consistent dependences – dependences with constant dependence distance Using Dependences n In the reduction example DO I = 1, N DO J = 1, M A(I) = A(I) + B(J) ENDDO ENDDO DO I = 1, N T = A(I) DO J = 1, M T = T + B(J) ENDDO A(I) = T ENDDO n True dependence – replace the references to A in the inner loop by scalar T n Output dependence – store can be moved outside the inner loop n Antidependence – load can be moved before the inner loop Scalar Replacement n Example: Scalar Replacement in case of loop independent dependence DO I = 1, N A(I) = B(I) + C X(I) = A(I)*Q ENDDO DO I = 1, N t = B(I) + C A(I) = t X(I) = t*Q ENDDO n One less load for each iteration for reference to A Scalar Replacement n Example: Scalar Replacement in case of loop carried dependence spanning single iteration DO I = 1, N A(I) = B(I-1) B(I) = A(I) + C(I) ENDDO tB = B(0) DO I = 1, N tA = tB A(I) = tA tB = tA + C(I) B(I) = tB ENDDO n One less load for each iter for ref to B which had a loop carried true dependence of 1 iter n Also one less load per iter for reference to A Scalar Replacement n Example: Scalar Replacement in case of loop carried dependence spanning multiple iterations DO I = 1, N A(I) = B(I-1) + B(I+1) ENDDO n t1 = B(0) t2 = B(1) DO I = 1, N t3 = B(I+1) A(I) = t1 + t3 t1 = t2 t2 = t3 ENDDO One less load for each iter for ref to B which had a loop carried input dependence of 2 iters Invariants maintained were t1=B(I-1); t2=B(I); t3=B(I+1) n Aiken/Nicolau Scheduling Step 1 Perform scalar replacement to eliminate memory references where possible. for i:=1 to N do a := j Å V[i-1] b := a Å f c := e Å j d := f Å c e := b Å d f := U[i] g: V[i] := b h: W[i] := d j := X[i] for i:=1 to N do a:=jÅb b:=aÅf c:=eÅj d:=fÅc e:=bÅd f := U[i] g: V[i] := b h: W[i] := d j := X[i] 15-745 © Seth Copen Goldstein 2000-5 127 Aiken/Nicolau Scheduling Step 2 Unroll the loop and compute the data-dependence graph (DDG). DDG for rolled loop: for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d f := U[i] g: V[i] := b h: W[i] := d j := X[i] ac j bd gf h e 15-745 © Seth Copen Goldstein 2000-5 128 Aiken/Nicolau Scheduling Step 2, cont’d DDG for unrolled loop: for i:=1 to N do a := j Å b b := a Å f c:=eÅj d:=fÅc e:=bÅd f:=U[i] g: V[i] := b h: W[i] := d j := X[i] a1 c1 b1 d1 g a j1 e h1 121 b f1 c 22 g2 a3 b 3 j2 d 2 f2 e2 c3 h1 15-745 © Seth Copen Goldstein 2000-5 129 Aiken/Nicolau Scheduling Step 3 Build a tableau of iteration number vs cycle time. ac 1 1 b d1 1j iteration 123456 1 acfjfjfjfjfjfj 2bd 3egha 1 h 4 cb g1a2 e115dga f 6ehb b1c 7 cga 2j28db 2d 9 ehga g2a3 2 10 cb fh11 dga 2e112 ehb b3 2 13 cg c 14 d 3 15 eh 15-745 © Seth Copen Goldstein 2000-5 130 cycle Aiken/Nicolau Scheduling Step 3 Build a tableau of iteration number vs cycle time. basically, you’re emulating iteration 1 1 123456 ac a superscalar with infinite 1 acfjfjfjfjfjfj 2 bd 3 egh a 4 cb 5 dga 6 eh b 7cga 8db 9 ehga resources, infinite register b d1 renaming, always predicting 1j the loop-back branch: h g a 1 e 1 121 thus, just pure f1 b2jc2 data dependency 2d g2a3 2 10 cb fh11 dga 2e112 ehb b3 2 13 cg c 14 d 3 15 eh 15-745 © Seth Copen Goldstein 2000-5 131 cycle Aiken/Nicolau Scheduling Step 4 Find repeating patterns of instructions. iteration 123456 1 acfjfjfjfjfjfj 2 bd 34 egh acb 5 dga 6 ehb 7 cg a 8db 9 10 11 12 13 14 15 eh g a cb dga ehb cg d eh 15-745 © Seth Copen Goldstein 2000-5 132 cycle Aiken/Nicolau Scheduling Step 4 Find repeating patterns of instructions. iteration 123456 1 acfjfjfjfjfjfj 2 bd 34 egh acb 5 dga 6 ehb 7 cg a 8db 9 10 11 12 13 14 15 eh g a cb dga ehb cg d eh 15-745 © Seth Copen Goldstein 2000-5 133 cycle Aiken/Nicolau Scheduling Step 4 Find repeating patterns of instructions. iteration 123456 1 acfjfjfjfjfjfj 2 bd 34 egh acb 5 dga 6 ehb 7 cg a 8db Go back and relate slopes to DDG 9 10 11 12 13 14 15 eh g a cb dga ehb cg d eh 15-745 © Seth Copen Goldstein 2000-5 134 cycle Aiken/Nicolau Scheduling Step 5 “Coalesce” the slopes. iteration iteration 123456 123456 1 acfjfjfjfjfjfj 2 bd 3 egh a 4 cb 1 acfj 2 bd 3 egh 4 fj a cb fj dg a eh b fj 5 dga 6 ehb 7 cg a 8db8db 9 10 11 12 13 14 15 eh g a cb dga ehb cg d eh cg a 9 eh g fj 5 6 7 10 ca 11 db 12 eh g 13 c 14 d 15 eh 15-745 © Seth Copen Goldstein 2000-5 135 cycle cycle Aiken/Nicolau Scheduling Step 6 Find the loop body and “reroll” the loop. iteration 123456 1 acfj 2bd 3 egh 4 fj a cb fj dga ehbfj 5 6 7 8db 9 10 11 12 13 14 15 ehg fj ca db eh g c d eh cg a 15-745 © Seth Copen Goldstein 2000-5 136 cycle Aiken/Nicolau Scheduling Step 6 Find the loop body and “reroll” the loop. Prologue/entry code Loop body Epilogue/exit code iteration 123456 1 acfj 2bd fj 3 egh a 4 cb fj 5 dga 6 ehbfj 78 cg a 910 11 db 12 eh g 13 c 14 d 15 eh db ehg fj ca 15-745 © Seth Copen Goldstein 2000-5 137 cycle Aiken/Nicolau Scheduling Step 7 Generate code. (Assume VLIW-like machine for this example. The instructions on each line should be issued in parallel.) L: bi+1 := ai Å fi W[i] := di ai+2 := ji+1 Å bi+1 i := i+1 bN :=aN Å fN-1 W[N-1] := dN-1 w[N]:=dN a1:=j0 Å b0 b1:=a1 Å f0 e1:=b1 Å d1 c2:=e1 Å j1 d2:=f1 Å c2 e2:=b2 Å d2 c3:=e2 Å j2 di:=fi-1Åci ei := bi Å di ci+1:=eiÅji c1:=e0 Å j0 d1:=f0 Å c1 V[1] := b1 b2:=a2 Å f1 V[2] := b2 W[2] := d2 V[3] := b3 f1:=U[1] f2:=U[2] W[1] := d1 f3 := U[3] a3:=j2 Å b3:=a3 Å a4:=j3 Å j1 := X[1] j2 := X[2] a2:=j1 Å b1 j3 := X[3] b2 f2 f4 := U[4] b3 i:=3 j4 := X[4] dN-1 :=fN-2 Å cN-1 eN-1 := bN-1 Å dN-1 cN := eN-1 Å jN-1 dN := fN-1 + cN eN :=bN Å dN V[i+1] := bi+1 v[N] := bN fi+2 := U[I+2] ji+2 := X[i+2] if i

n For an edge, u®v, we must have s(v)-s(u) 3 d(u,v)

15-745 © Seth Copen Goldstein 2000-5 149

Precedence Constraints

n Cyclic: constraint becomes a tuple: q p is the minimum iteration delay

(or the loop carried dependence distance) q d is the delay

n For an edge, u®v, we must have s(v)-s(u) 3 d(u,v)-s*p(u,v)

np30

n If data dependence is

q within an iteration, p=0

q loop-carried across p iter boundaries, p>0

15-745 © Seth Copen Goldstein 2000-5 150

Iterative Approach

n Finding minimum S that satisfies the constraints is NP- Complete.

n Heuristic:

q Find lower and upper bounds for S

q foreach s from lower to upper bound? n Schedule graph.

n If succeed, done

n Otherwise try again (with next higher s)

n Thus: “Iterative Modulo Scheduling” Rau MICRO’94

15-745 © Seth Copen Goldstein 2000-5 151

Iterative Approach

n Heuristic:

q Find lower and upper bounds for S

q foreach s from lower to upper bound

n Schedule graph.

n If succeed, done

n Otherwise try again (with next higher s)

n So the key difference:

q AN88 does not assume S when scheduling

q IMS must assume an S for each scheduling attempt to understand resource conflicts

15-745 © Seth Copen Goldstein 2000-5 152

Lower Bounds

n Resource Constraints: SR

maximum over all resources of # of uses divided by #

available…

n Precedence Constraints: SE

max delay over all cycles in dataflow graph

In practice, one is easy, other is hard.

Tim’s secret approach: just use SR as lower bound, then do binary search for best S

15-745 © Seth Copen Goldstein 2000-5 153

Acyclic Example

a <0,2>

<0,3> c

b <0,1>

Lower Bound: SR=2 Upper Bound: 5

15-745 © Seth Copen Goldstein 2000-5 154

Lower Bound on s

• Assume 1 ALU and 1 MU

• Assume latency Op or load is 1 cycle

a

c

for i:=1 to N do a := j Å b b := a Å f c := e Å j d:=fÅc e := b Å d

f := U[i]

g: V[i] := b

h: W[i] := d

j := X[i]

<1,1> <0,1>

<1,1> <0,1>

<1,1> <0,1>

j

b

h

f

d

g

<0,1>

<0,1>

<1,1>

Resources => 5 cycles Dependencies => 3 cycles

e

15-745 © Seth Copen Goldstein 2000-5 155

Scheduling data structures

To schedule for initiation interval s:

n Create a resource table with s rows and R columns

n Create a vector, s, of length N for n instructions in the loop

q s[n] = the time at which n is scheduled, or NONE

n Prioritize instructions by some heuristic q critical path (or cycle)

q resource critical

15-745 © Seth Copen Goldstein 2000-5 156

Scheduling algorithm

n Pick an instruction, n

n Calculate earliest time due to dependence constraints For all x=pred(n),

earliest = max(earliest, s(x)+d(x,n)-s.p(x,n)) n try and schedule n from earliest to (earliest+s-1)

s.t. resource constraints are obeyed.

q possible twist: deschedule a conflicting node to make way for n, maybe randomly, like sim anneal

n If we fail, then this schedule is faulty (i.e. give up on this s)

15-745 © Seth Copen Goldstein 2000-5 157

Scheduling algorithm – cont.

n We now schedule n at earliest, I.e., s(n) = earliest n Fix up schedule

q Successors, x, of n must be scheduled s.t.

s(x) >= s(n)+d(n,x)-s.p(n,x), otherwise they are removed

(descheduled) and put back on worklist.

n repeat this some number of times until either

q succeed, then register allocate q fail, then increase s

15-745 © Seth Copen Goldstein 2000-5 158

Simplest Example

for () {

a = b+c

b = a*a

c = a*194 }

<1,1>

<1,1> <0,1> <0,1>

b

Resources:

a

1

1

c

What is IIres? What is IIrec?

15-745 © Seth Copen Goldstein 2000-5 159

Simplest Example

for () { 0 a = b+c

b = a*a

c = a*194 1 }

a

Modulo Resource Table:

0 1

b

c

Try II = 2

1

15-745 © Seth Copen Goldstein 2000-5 160

Simplest Example

for () { 0 a = b+c

b = a*a

c = a*194 1 }

a

Try II = 2

Modulo Resource Table:

0 1

1

b

c

1

15-745 © Seth Copen Goldstein 2000-5 161

Simplest Example

for () { 0 a = b+c

b = a*a

c = a*194 1 }

Modulo Resource Table:

0 1

1

1

a

2

Try II = 2

b

c

1

15-745 © Seth Copen Goldstein 2000-5 162

Simplest Example

for () { 0 a = b+c

b = a*a

c = a*194 1 }

a

b

Try II = 2

Modulo Resource Table:

0 1

earliest a: sigma(c) + delay(c) – 2 = 2+1-2 = 1

1

c

2

1

15-745 © Seth Copen Goldstein 2000-5 163

Simplest Example

for () {

a = b+c

b = a*a

c = a*194

0 1 2

}

b

a

Try II = 2

Modulo Resource Table:

0 1

earliest b? scheduled b? what next?

1

c

1

15-745 © Seth Copen Goldstein 2000-5 164

Simplest Example

for () {

a = b+c

b = a*a

c = a*194

0 1 2 3

}

Try II = 2

Modulo Resource Table:

0 1

Lesson: lower bound may not be achievable

1

b

a

c

1

15-745 © Seth Copen Goldstein 2000-5 165

Example

for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d

f := U[i]

g: V[i] := b

h: W[i] := d

j := X[i]

Priorities: ?

<1,1> <0,1>

<1,1> <0,1>

<1,1> <0,1>

<0,1>

<0,1>

<1,1>

g

a

h

j

f

c

b

d

e

15-745 © Seth Copen Goldstein 2000-5 166

Example

for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d

f := U[i]

g: V[i] := b

h: W[i] := d

j := X[i]

Priorities: c,d,e,a,b,f,j,g,h

g

a

<1,1> <0,1>

<1,1> <0,1>

<1,1> <0,1>

<0,1>

<0,1>

<1,1>

h

f

c

j

b

d

e

15-745 © Seth Copen Goldstein 2000-5 167

for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d

f := U[i]

g: V[i] := b

h: W[i] := d

j := X[i]

s=5

instr

s

a

b

c

d

e

f

g

h

j

ALU

MU

Priorities: c,d,e,a,b,f,j,g,h

a

b

j

h

f

c

g

e

d

15-745 © Seth Copen Goldstein 2000-5 168

for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d

f := U[i]

g: V[i] := b

h: W[i] := d

j := X[i]

Priorities: a,b,f,j,g,h

s=5

instr

s

a

b

c

0

d

1

e

2

f

g

h

j

ALU

MU

c

d

e

j

a

c

f

b

d

g

h

e

15-745 © Seth Copen Goldstein 2000-5 169

for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d

f := U[i]

g: V[i] := b

h: W[i] := d

j := X[i]

Priorities: b,f,j,g,h

s=5

instr

s

a

3

b

c

0

d

1

e

2

f

g

h

j

ALU

MU

c

d

e

a

j

a

c

f

b

d

g

h

e

15-745 © Seth Copen Goldstein 2000-5 170

for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d

f := U[i]

g: V[i] := b

h: W[i] := d

j := X[i]

Priorities: b,f,j,g,h

s=5

instr

s

a

3

b

4

c

0

d

1

e

2

f

g

h

j

ALU

MU

c

d

e

a

b

j

a

c

f

b

d

g

h

e

15-745 © Seth Copen Goldstein 2000-5 171

for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d

f := U[i]

g: V[i] := b

h: W[i] := d

j := X[i]

Priorities: e,f,j,g,h

s=5

ALU

MU

c

d

a

b

j

a

c

f

b

d

h

g

instr

a

b

c

d

e

f

g

h

j

1

s

3

4

0

e

b causes b->e edge violation

15-745 © Seth Copen Goldstein 2000-5 172

for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d

f := U[i]

g: V[i] := b

h: W[i] := d

j := X[i]

Priorities: e,f,j,g,h

s=5

ALU

MU

c

d

e

a

b

j

a

c

f

b

d

h

g

instr

a

b

c

d

e

f

g

h

j

1

s

3

4

0

7

e

e causes e->c edge violation

15-745 © Seth Copen Goldstein 2000-5 173

for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d

f := U[i]

g: V[i] := b

h: W[i] := d

j := X[i]

Priorities: f,j,g,h

s=5

instr

s

a

3

b

4

c

5

d

6

e

7

f

0

g

h

j

ALU

MU

c

f

d

e

a

b

j

a

c

f

b

d

g

h

e

15-745 © Seth Copen Goldstein 2000-5 174

for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d

f := U[i]

g: V[i] := b

h: W[i] := d

j := X[i]

Priorities:j,g,h

s=5

instr

s

a

3

b

4

c

5

d

6

e

7

f

0

g

h

j

1

ALU

MU

c

f

d

j

e

a

b

j

a

c

f

b

d

g

h

e

15-745 © Seth Copen Goldstein 2000-5 175

for i:=1 to N do a := j Å b b := a Å f c := e Å j d := f Å c e := b Å d

f := U[i]

g: V[i] := b

h: W[i] := d

j := X[i]

Priorities:g,h

s=5

instr

s

a

3

b

4

c

5

d

6

e

7

f

0

g

7

h

8

j

1

ALU

MU

c

f

d

j

e

g

a

h

b

j

a

c

f

b

d

g

h

e

15-745 © Seth Copen Goldstein 2000-5 176

Creating the Loop

n Create the body from the schedule.

n Determine which iteration an instruction

falls into

q Mark its sources and dest as belonging to that iteration.

q Add Moves to update registers n Prolog fills in gaps at beginning

q For each move we will have an instruction in prolog, and we fill in dependent instructions

n Epilog fills in gaps at end

instr

s

a

3

b

4

c

5

d

6

e

7

f

0

g

7

h

8

j

1

15-745 © Seth Copen Goldstein 2000-5 177

f0 = U[0]; j0 = X[0];

FOR i = 0 to N f1 := U[i+1] j1 := X[i+1] nop

a := j0 ? b b := a ? f0 c := e ? j0 d := f0 ? c e := b ? d

h: W[i] := d f0 = f1

j0 = j1

g: V[i] := b

15-745 © Seth Copen Goldstein 2000-5 178

Conditionals

n What about internal control structure, I.e., conditionals n Three approaches

q Schedule both sides and use conditional moves

q Schedule each side, then make the body of the conditional a

macro op with appropriate resource vector q Trace schedule the loop

15-745 © Seth Copen Goldstein 2000-5 179

What to take away

n Architecture includes compiler!

n Dependence analysis is very important

(including alias analysis)

n Software pipelining crucial for statically scheduled, but also very useful for dynamically scheduled

15-745 © Seth Copen Goldstein 2000-5 180

Multiflow: An early VLIW architecture (1987)

EPIC – Intel IA-64 Architecture

n Gets rid of lock-step execution of instructions within a VLIW instruction

n Idea: More ISA support for static scheduling and parallelization q Specify dependencies within and between VLIW instructions

(explicitly parallel)

+ No lock-step execution

+ Static reordering of stores and loads + dynamic checking

— Hardware needs to perform dependency checking (albeit aided by software)

— Other disadvantages of VLIW still exist

n Huck et al., “Introducing the IA-64 Architecture,” IEEE Micro, Sep/Oct 2000.

183

IA-64 Instructions

n IA-64 “Bundle” (~EPIC Instruction)

q Total of 128 bits

q Contains three IA-64 instructions

q Template bits in each bundle specify dependencies within a bundle

\

n IA-64 Instruction

q Fixed-length 41 bits long

q Contains three 7-bit register specifiers

q Contains a 6-bit field for specifying one of the 64 one-bit predicate registers

184

IA-64 Instruction Bundles and Groups

n Groups of instructions can be executed safely in parallel

q Marked by “stop bits”

n Bundles are for packaging

q Groups can span multiple bundles

n Alleviates recompilation need somewhat

185

Template Bits

n Specify two things

q Stop information: Boundary of independent instructions

q Functional unit information: Where should each instruction be routed

186