Embed code:

Download Presentation - The PPT/PDF document "Advanced Compiler Techniques" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.

### Presentations text content in Advanced Compiler Techniques

Slide1

LIU XianhuaSchool of EECS, Peking University

Loops

Slide2

Content

Concepts:DominatorsDepth-First OrderingBack edgesGraph depthReducibilityNatural LoopsEfficiency of Iterative AlgorithmsDependences & Loop Transformation

2

Slide3

Loops are Important!

Loops dominate program execution timeNeeds special treatment during optimizationLoops also affect the running time of program analysese.g., A dataflow problem can be solved in just a single pass if a program has no loops

3

Slide4

Dominators

Node d dominates node n if every path from the entry to n goes through d.written as: d dom nQuick observations:Every node dominates itself.The entry dominates every node.Common Cases:The test of a while loop dominates all blocks in the loop body.The test of an if-then-else dominates all blocks in either branch.

4

Slide5

Dominator Tree

Immediate dominance: d idom n d dom n, d  n, no m s.t. d dom m and m dom nImmediate dominance relationships form a tree

1

3

5

2

4

1

3

5

2

4

5

Slide6

Finding Dominators

A dataflow analysis problem: For each node, find all of its dominators.Direction: forwardConfluence: set intersectionBoundary: OUT[Entry] = {Entry}Initialization: OUT[B] = All nodesEquations:OUT[B] = IN[B] U {B}IN[B] = p is a predecessor of B OUT[p]

6

Slide7

Example: Dominators

7

1

3

5

2

4

{1,5}

{1,4}

{1,2,3}

{1,2}

{1}

{1}

{1}

{1}

{1}

{1,2}

Slide8

Depth-First Search

Start at entry.If you can follow an edge to an unvisited node, do so.If not, backtrack to your parent (node from which you were visited).

8

Slide9

Depth-First Spanning Tree

Root = entry.Tree edges are the edges along which we first visit the node at the head.

1

5

3

4

2

9

Slide10

Depth-First Node Order

The reverse of the order in which a DFS retreats from the nodes. 1-4-5-2-3Alternatively, reverse of postorder traversal of the tree. 3-2-5-4-1

1

3

5

2

4

10

Slide11

Four Kinds of Edges

Tree edges.Advancing edges (node to proper descendant).Retreating edges (node to ancestor, including edges to self).Cross edges (between two nodes, neither of which is an ancestor of the other.

11

Slide12

A Little Magic

Of these edges, only retreating edges go from high to low in DF order.Example of proof: You must retreat from the head of a tree edge before you can retreat from its tail.Also surprising: all cross edges go right to left in the DFST.Assuming we add children of any node from the left.

12

Slide13

Example: Non-Tree Edges

13

1

3

5

2

4

Retreating

Forward

Cross

Slide14

14

Back Edges

An edge is a back edge if its head dominates its tail.Theorem: Every back edge is a retreating edge in every DFST of every flow graph.Converse almost always true, but not always.

Back

edge

tail in any DFST

Search must reach the

tail before retreating

from the head, so tail is

Slide15

Example: Back Edges

15

1

3

5

2

4

{1,5}

{1,4}

{1,2,3}

{1,2}

{1}

Slide16

16

Reducible Flow Graphs

A flow graph is reducible if every retreating edge in any DFST for that flow graph is a back edge.Testing reducibility: Remove all back edges from the flow graph and check that the result is acyclic.Hint why it works: All cycles must include some retreating edge in every DFST.In particular, the edge that enters the first node of the cycle that is visited.

Slide17

DFST on a Cycle

17

Depth-first search

reaches here first

Search must reach

these nodes before

l

eaving the cycle

So this is a

retreating edge

Slide18

Why Reducibility?

Folk theorem: All flow graphs in practice are reducible.Fact: If you use only while-loops, for-loops, repeat-loops, if-then(-else), break, and continue, then your flow graph is reducible.

18

Slide19

Example: Remove Back Edges

19

1

3

5

2

4

Remaining graph is acyclic.

Slide20

Example: Nonreducible Graph

20

A

C

B

In any DFST, one

of these edges will

be a retreating edge.

A

B

C

A

B

C

But no heads dominate their tails, so deleting

back edges leaves the cycle.

Slide21

21

Proper ordering of nodes during iterative algorithm assures number of passes limited by the number of “nested” back edges.Depth of nested loops upper-bounds the number of nested back edges.

Slide22

DF Order and Retreating Edges

Suppose that for a RD analysis, we visit nodes during each iteration in DF order.The fact that a definition d reaches a block will propagate in one pass along any increasing sequence of blocks.When d arrives at the tail of a retreating edge, it is too late to propagate d from OUT to IN.The IN at the head has already been computed for that round.

22

Slide23

Example: DF Order

23

1

3

5

2

4

d

d

d

d

d

d

d

d

d

d

Definition d is

Gen’d

by node 2.

The first pass

The second pass

Slide24

Depth of a Flow Graph

The depth of a flow graph with a given DFST and DF-order is the greatest number of retreating edges along any acyclic path.For RD, if we use DF order to visit nodes, we converge in depth+2 passes.Depth+1 passes to follow that number of increasing segments.1 more pass to realize we converged.

24

Slide25

Example: Depth = 2

25

1->4->7 ---> 3->10->17 ---> 6->18->20

increasing

retreating

increasing

increasing

retreating

Pass 1

Pass 2

Pass 3

Slide26

Similarly . . .

AE also works in depth+2 passes.Unavailability propagates along retreat-free node sequences in one pass.So does LV if we use reverse of DF order.A use propagates backward along paths that do not use a retreating edge in one pass.

26

Slide27

In General . . .

The depth+2 bound works for any monotone framework, as long as information only needs to propagate along acyclic paths.Example: if a definition reaches a point, it does so along an acyclic path.

27

Slide28

However . . .

Constant propagation does not have this property.

28

a = b

b = c

c = 1

L: a = b

b = c c = 1 goto L

Slide29

Why Depth+2 is Good

Normal control-flow constructs produce reducible flow graphs with the number of back edges at most the nesting depth of loops.Nesting depth tends to be small.A study by Knuth has shown that average depth of typical flow graphs =~2.75.

29

Slide30

Example: Nested Loops

30

3 nested while-

loops; depth =

3

3 nested repeat-

loops; depth = 1

Slide31

Natural Loops

A natural loop is defined by:A single entry-point called headera header dominates all nodes in the loopA back edge that enters the loop headerOtherwise, it is not possible for the flow of control to return to the header directly from the "loop" ; i.e., there really is no loop.

31

Slide32

Find Natural Loops

The natural loop of a back edge a->b is {b} plus the set of nodes that can reach a without going through bRemove b from the flow graph, find all predecessors of aTheorem: two natural loops are either disjoint, identical, or nested.

32

Slide33

Example: Natural Loops

33

1

3

5

2

4

Natural loop

of 3 -> 2

Natural loop

of 5 -> 1

Slide34

Relationship between Loops

If two loops do not have the same headerthey are either disjoint, orone is entirely contained (nested within) the otherinnermost loop: one that contains no other loop.If two loops share the same headerHard to tell which is the inner loopCombine as one

1

2

3

4

34

Slide35

Basic Parallelism

Examples:FOR i = 1 to 100 a[i] = b[i] + c[i]FOR i = 11 TO 20 a[i] = a[i-1] + 3FOR i = 11 TO 20 a[i] = a[i-10] + 3Does there exist a data dependence edge between two different iterations?A data dependence edge is loop-carried if it crosses iteration boundariesDoAll loops: loops without loop-carried dependences

35

Slide36

Data Dependence of Variables

True dependence

Anti-dependence

a = = a

= aa =

a = a =

= a = a

Output dependenceInput dependence

36

Slide37

Affine Array Accesses

Common patterns of data accesses: (i, j, k are loop indexes)A[i], A[j], A[i-1], A[0], A[i+j], A[2*i], A[2*i+1], A[i,j], A[i-1, j+1]Array indexes are affine expressions of surrounding loop indexesLoop indexes: in, in-1, ... , i1Integer constants: cn, cn-1, ... , c0Array index: cnin + cn-1in-1+ ... + c1i1+ c0Affine expression: linear expression + a constant term (c0)

37

Slide38

FOR i := 2 to 5 do A[i-2] = A[i]+1;

38

Between read access A[i] and write access A[i-2] there is a dependence if:there exist two iterations ir and iw within the loop bounds, s.t.iterations ir & iw read & write the same array element, respectively ∃integers iw, ir 2≤iw,ir≤5 ir=iw-2Between write access A[i-2] and write access A[i-2] there is a dependence if:∃integers iw, iv 2≤iw,iv≤5 iw–2=iv–2To rule out the case when the same instance depends on itself: add constraint iw ≠ iv

Slide39

Memory Disambiguation

Undecidable at Compile Time read(n) For i = … a[i] = a[n]

39

Slide40

Domain of Data Dependence Analysis

Only use loop bounds and array indexes that are affine functions of loop variablesfor i = 1 to nfor j = 2i to 100a[i + 2j + 3][4i + 2j][i * i] = … … = a[1][2i + 1][j]Assume a data dependence between the read & write operation if there exists:∃integers ir,jr,iw,jw 1 ≤ iw, ir ≤ n 2iw ≤ jw ≤ 100 2ir ≤ jr ≤ 10 iw + 2jw + 3 = 1 4iw + 2jw = 2ir + 1Equate each dimension of array access; ignore non-affine onesNo solution No data dependenceSolution  There may be a dependence

40

Slide41

Iteration Space

41

An abstraction for loops Iteration is represented as coordinates in iteration space.

for i= 0, 5 for j = 0, 3 a[i, j] = 3

i

j

Slide42

Iteration Space

42

An abstraction for loops

for i = 0, 5 for j = i, 3 a[i, j] = 0

i

j

Slide43

Iteration Space

43

An abstraction for loops

for

i = 0, 5 for j = i, 7 a[i, j] = 0

i

j

Slide44

Affine Access

44

Slide45

Affine Transform

45

i

j

u

v

Slide46

Loop Transformation

46

for i = 1, 100 for j = 1, 200 A[i, j] = A[i, j] + 3 end_forend_for

for u = 1, 200 for v = 1, 100 A[v,u] = A[v,u]+ 3 end_forend_for

Slide47

Old Iteration Space

47

for

i = 1, 100

for

j = 1, 200 A[i, j] = A[i, j] + 3 end_forend_for

Slide48

New Iteration Space

48

for

u = 1, 200

for

v = 1, 100 A[v,u] = A[v,u]+ 3 end_forend_for

Slide49

Old Array Accesses

49

for

i = 1, 100

for

j = 1, 200

A[i, j] = A[i, j] + 3

end_forend_for

Slide50

New Array Accesses

50

for

u = 1, 200

for v = 1, 100 A[v,u] = A[v,u]+ 3 end_forend_for

Slide51

Interchange Loops?

51

for i = 2, 1000 for j = 1, 1000 A[i, j] = A[i-1, j+1]+3 end_forend_for

e.g. dependence vector dold = (1,-1)

i

j

for

u

=

1, 1000 for v = 2, 1000 A[v, u] = A[v-1, u+1]+3 end_forend_for

Slide52

Interchange Loops?

A transformation is legal, if the new dependence is lexicographically positive, i.e. the leading non-zero in the dependence vector is positive.Distance vector (1,-1) = (4,2)-(3,3)Loop interchange is not legal if there exists dependence (+, -)

52

Slide53

GCD Test

53

Is there any dependence?Solve a linear Diophantine equation2*iw = 2*ir + 1

for i = 1, 100 a[2*i] = … … = a[2*i+1] + 3

Slide54

GCD

The greatest common divisor (GCD) of integers a1, a2, …, an, denoted gcd(a1, a2, …, an), is the largest integer that evenly divides all these integers. Theorem: The linear Diophantine equation has an integer solution x1, x2, …, xn iff gcd(a1, a2, …, an) divides c

54

Slide55

Examples

55

Example 1: gcd(2,-2) = 2. No solutionsExample 2: gcd(24,36,54) = 6. Many solutions

Slide56

Loop Fusion

56

for i = 1, 1000 A[i] = B[i] + 3end_forfor j = 1, 1000 C[j] = A[j] + 5end_for

for i = 1, 1000 A[i] = B[i] + 3 C[i] = A[i] + 5end_for

Better reuse between A[i] and A[i]

Slide57

Loop Distribution

57

for i = 1, 1000 A[i] = A[i-1] + 3end_forfor i = 1, 1000 C[i] = B[i] + 5end_for

for i = 1, 1000 A[i] = A[i-1] + 3 C[i] = B[i] + 5end_for

2nd loop is parallel

Slide58

Register Blocking

for j = 1, 2*m for i = 1, 2*n A[i, j] = A[i-1, j] + A[i-1, j-1] end_forend_for

for j = 1, 2*m, 2 for i = 1, 2*n, 2 A[i, j] = A[i-1,j] + A[i-1,j-1] A[i, j+1] = A[i-1,j+1] + A[i-1,j] A[i+1, j] = A[i, j] + A[i, j-1] A[i+1, j+1] = A[i, j+1] + A[i, j] end_forend_for

Better reuse between A[

i,j] and A[i,j]

58

Slide59

Virtual Register Allocation

for j = 1, 2*M, 2 for i = 1, 2*N, 2 r1 = A[i-1,j] r2 = r1 + A[i-1,j-1] A[i, j] = r2 r3 = A[i-1,j+1] + r1 A[i, j+1] = r3 A[i+1, j] = r2 + A[i, j-1] A[i+1, j+1] = r3 + r2 end_forend_for

Memory operations reduced to register load/store

59

Slide60

Scalar Replacement

for i = 2, N+1 = A[i-1]+1 A[i] =end_for

t1 = A[1]for i = 2, N+1 = t1 + 1 t1 = A[i] = t1end_for

Eliminate loads and stores for array references

60

Slide61

Unroll-and-Jam

for j = 1, 2*M for i = 1, N A[i, j] = A[i-1, j] + A[i-1, j-1] end_forend_for

for j = 1, 2*M, 2 for i = 1, N A[i, j]=A[i-1,j]+A[i-1,j-1] A[i, j+1]=A[i-1,j+1]+A[i-1,j] end_forend_for

Expose more opportunity for scalar replacement

61

Slide62

Large Arrays

for i = 1, 1000 for j = 1, 1000 A[i, j] = A[i, j] + B[j, i] end_forend_for

Suppose arrays A and B have row-major layout

B has poor cache locality.Loop interchange will not help.

62

Slide63

Loop Blocking

for v = 1, 1000, 20 for u = 1, 1000, 20 for j = v, v+19 for i = u, u+19 A[i, j] = A[i, j] + B[j, i] end_for end_for end_forend_for

63

Slide64

Loop Unrolling for ILP

for i = 1, 10 a[i] = b[i]; *p = ... end_for

for I = 1, 10, 2 a[i] = b[i]; *p = … a[i+1] = b[i+1]; *p = …end_for

Large scheduling regions. Fewer dynamic branchesIncreased code size

64

Slide65