/
ECE 454  Computer Systems Programming ECE 454  Computer Systems Programming

ECE 454 Computer Systems Programming - PowerPoint Presentation

luanne-stotts
luanne-stotts . @luanne-stotts
Follow
384 views
Uploaded On 2018-02-01

ECE 454 Computer Systems Programming - PPT Presentation

Memory performance Part II Optimizing for caches Ding Yuan ECE Dept University of Toronto httpwwweecgtorontoeduyuan Content Cache basics and organization last lec Optimizing for ID: 626903

fit cache time misses cache fit misses time accesses rate column arrays row array double stamp block assume matrix

Share:

Link:

Embed:

Download Presentation from below link

Download Presentation The PPT/PDF document "ECE 454 Computer Systems Programming" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.


Presentation Transcript

Slide1

ECE 454 Computer Systems ProgrammingMemory performance(Part II: Optimizing for caches)

Ding Yuan

ECE Dept., University of Toronto

http://www.eecg.toronto.edu/~yuanSlide2

ContentCache

basics and

organization (last

lec

.)

Optimizing for

Caches (this

lec

.)

Tiling/blocking

Loop reordering

Prefetching (next

lec

.)

Virtual

Memory (next

lec

.)Slide3

Optimizing for CachesSlide4

Memory Optimizations

Write code that has locality

Spatial: access data contiguously

Temporal: make sure access to the same data is not too far apart in time

How to achieve?

Proper choice of algorithm

Loop transformationsSlide5

Background: Array Allocation

Basic Principle

T

A[

L

];

Array of data type

T and length LContiguously allocated region of L * sizeof(T) bytes

char string[12];

x

x

+ 12

int val[5];

x

x

+ 4

x

+ 8

x

+ 12

x

+ 16

x

+ 20

double a[3];

x

+ 24

x

x

+ 8

x

+ 16

char *p[3]

;

(64 bit)

x

+ 24

x

x

+ 8

x

+ 16Slide6

Multidimensional (Nested) Arrays

Declaration

T

A[

R

][

C];2D array of data type TR rows, C columnsT element requires K bytesArray Size

R * C * K bytesArrangementRow-Major Ordering (C code)

A[0][0]

A[0][C-1]

A[R-1][0]

• • •

• • •

A[R-1][C-1]

••

int A[R][C];

• • •

A

[0]

[0]

A

[0]

[C-1]

• • •

A

[1]

[0]

A

[1]

[C-1]

• • •

A

[R-1]

[0]

A

[R-1][C-1]

•  •  •

4*R*C

BytesSlide7

Assumed Simple Cache2

ints

per block

2-way set associative

2 blocks, 1 set in total

i.e., same thing as fully associative

Replacement policy: Least Recently Used (LRU)

Cache

Block 0

Block 1Slide8

Some Key QuestionsHow many elements are there per block?Does the data structure fit in the cache?

Do I re-use blocks over time?

In what order am I accessing blocks?Slide9

Simple Array

1

2

3

4

A

Cache

for (

i

=0;i<

N;i

++){

… = A[

i

];

}

Miss rate = #misses / #accesses =

(N/2) / N = ½ = 50%Slide10

Simple Array w outer loop

1

2

3

4

A

Cache

for (k=0;k<

P;k

++){

for (

i

=0;i<

N;i

++){

… = A[

i

];

}

}

Assume A[] fits in the cache:

Miss rate = #misses / #accesses =

(N/2) / N*P = 1/

2P

Lesson: for sequential accesses with re-use,

If fits in the cache, first visit suffers all the missesSlide11

Simple Array

1

2

3

4

5

6

7

8

A

Cache

for (

i

=0;i<

N;i

++){

… = A[

i

];

}

Assume A[] does not fit in the cache:

Miss rate = #misses / #accessesSlide12

Simple Array

5

6

7

8

1

2

3

4

5

6

7

8

A

Cache

for (

i

=0;i<

N;i

++){

… = A[

i

];

}

Assume A[] does not fit in the cache:

Miss rate = #misses / #accesses

= (N/2) / N = ½ = 50%

Lesson: for sequential accesses, if no-reuse it doesn’t

matter whether data structure fitsSlide13

Simple Array with outer loop

1

2

3

4

5

6

7

8

A

Cache

Assume A[] does not fit in the cache:

Miss rate = #misses / #accesses =

for (k=0;k<

P;k

++){

for (

i

=0;i<

N;i

++){

… = A[

i

];

}

}

(N/2) / N = ½ = 50

%

Lesson: for sequential accesses with re-use,

If the data structure doesn’t fit,

same miss rate as no-reuseSlide14

Let’s warm-up our cacheProblem (and opportunity)L1 cache reference 0.5 ns* (L1 cache size:

32 KB)

Main memory reference 100 ns (

mem

. size: 4 GBs)

Locality

Temporal locality

Spatial locality

Target program: matrix multiplicationSlide15

2D array

A

Cache

Assume A[] fits in the cache:

Miss rate = #misses / #accesses

=

for (

i

=0;i<

N;i

++){

for (j=0;j<

N;j

++){

… = A[

i][j];

}}

1

2

3

4

(

N*N/

2) /

(N*N)

= ½ = 50

%Slide16

2D array

A

Cache

for (

i

=0;i<

N;i

++){

for (j=0;j<

N;j

++){

… = A[

i

][j]; }

}

Lesson: for 2D accesses, if row order and no-reuse,

same hit rate as sequential,

whether fits or not

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

Assume A[] does not fit in the cache:

Miss rate = #misses / #

accesses =

(

N*N/

2) /

(N*N)

= ½ = 50

%Slide17

2D array

A

Cache

for (j=0;j<

N;j

++){

for (

i

=0;i<

N;i

++){

… = A[

i

][j]; }

}

Lesson: for 2D accesses, if column order and no-reuse,

same hit rate as sequential if entire column fits in the cache

1

2

3

4

Assume A[] fits in the cache:

Miss rate = #misses / #accesses =

(N*N/2) / N*N = ½ = 50%Slide18

2D array

A

Cache

Assume A[] does not fit in the cache:

Miss rate = #misses / #accesses

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

for (j=0;j<

N;j

++){

for (

i

=0;i<

N;i

++){

… = A[

i

][j];

}

}

= N

*N / N*N = 100%

Lesson: for 2D accesses, if column order, if entire column

doesn’t fit, then

100% miss rate (block (1,2) is gone

after

access to element

9)

.Slide19

Matrix multiplication

A

for (

i

=0;i<

N;i

++){

for (j=0;j<

N;j

++){ for (k=0;k<

N;k++){

… =

A[i][k] *

B[k

][j]; }

}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

B

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32Slide20

2 2D Arrays

A

Cache

A[] does not fit, B[] does not fit,

column of B[] does not fit (at same time as row of A[])

Miss rate = #misses / #accesses =

for (

i

=0;i<

N;i

++){

for (j=0;j<

N;j

++){

for (k=0;k<

N;k++){ … =

A[i][k] *

B[k

][j]; }

}

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

B

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32Slide21

2 2D Arrays

A

Cache

A[] does not fit, B[] does not fit,

column of B[] does not fit (at same time as row of A[])

Miss rate = #misses / #accesses =

for (

i

=0;i<

N;i

++){

for (j=0;j<

N;j

++){

for (k=0;k<

N;k++){ … =

A[i][k] *

B[k

][j]; }

}

1

2

17

18

21

22

3

4

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

B

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

The most inner loop (

i

=j=0):

A[0][0] * B[0][0], A[0][1] * B[1][0],

A[0][2] * B[2][0], A[0][3] * B[3][0]

1

time

stamp

2

3

X

4

5Slide22

2 2D Arrays

A

Cache

A[] does not fit, B[] does not fit,

column of B[] does not fit (at same time as row of A[])

Miss rate = #misses / #accesses =

for (

i

=0;i<

N;i

++){

for (j=0;j<

N;j

++){

for (k=0;k<

N;k++){ … =

A[i][k] *

B[k

][j]; }

}

1

2

25

26

21

22

3

4

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

B

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

The most inner loop (

i

=j=0):

A[0][0] * B[0][0], A[0][1] * B[1][0],

A[0][2] * B[2][0], A[0][3] * B[3][0]

3

6

4

5

time

stamp

X

7Slide23

2 2D Arrays

A

Cache

A[] does not fit, B[] does not fit,

column of B[] does not fit (at same time as row of A[])

Miss rate = #misses / #accesses =

for (

i

=0;i<

N;i

++){

for (j=0;j<

N;j

++){

for (k=0;k<

N;k++){ … =

A[i][k] *

B[k

][j]; }

}

29

30

25

26

21

22

3

4

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

B

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

The most inner loop (

i

=j=0):

A[0][0] * B[0][0], A[0][1] * B[1][0],

A[0][2] * B[2][0], A[0][3] * B[3][0]

8

6

4

7

time

stampSlide24

2 2D Arrays

A

Cache

A[] does not fit, B[] does not fit,

column of B[] does not fit (at same time as row of A[])

Miss rate = #misses / #accesses =

for (

i

=0;i<

N;i

++){

for (j=0;j<

N;j

++){

for (k=0;k<

N;k++){ … =

A[i][k] *

B[k

][j]; }

}

29

30

25

26

21

22

3

4

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

B

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

Next time: (

i

=0, j=1):

A[0][0] * B[0][1], A[0][1] * B[1][1],

A[0][2] * B[2][1], A[0][3] * B[3][1]

8

6

4

7

time

stampSlide25

2 2D Arrays

A

Cache

A[] does not fit, B[] does not fit,

column of B[] does not fit (at same time as row of A[])

Miss rate = #misses / #accesses =

for (

i

=0;i<

N;i

++){

for (j=0;j<

N;j

++){

for (k=0;k<

N;k++){ … =

A[i][k] *

B[k

][j]; }

}

29

30

25

26

1

2

3

4

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

B

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

Next time: (

i

=0, j=1):

A[0][0] * B[0][1], A[0][1] * B[1][1],

A[0][2] * B[2][1], A[0][3] * B[3][1]

8

6

9

7

time

stampSlide26

2 2D Arrays

A

Cache

A[] does not fit, B[] does not fit,

column of B[] does not fit (at same time as row of A[])

Miss rate = #misses / #accesses =

for (

i

=0;i<

N;i

++){

for (j=0;j<

N;j

++){

for (k=0;k<

N;k++){ … =

A[i][k] *

B[k

][j]; }

}

29

30

17

18

1

2

3

4

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

B

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

Next time: (

i

=0, j=1):

A[0][0] * B[0][1], A[0][1] * B[1][1],

A[0][2] * B[2][1], A[0][3] * B[3][1]

8

10

9

7

time

stamp

X

11Slide27

2 2D Arrays

A

Cache

A[] does not fit, B[] does not fit,

column of B[] does not fit (at same time as row of A[])

Miss rate = #misses / #accesses =

for (

i

=0;i<

N;i

++){

for (j=0;j<

N;j

++){

for (k=0;k<

N;k++){ … =

A[i][k] *

B[k

][j]; }

}

29

30

17

18

1

2

21

22

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

B

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

Next time: (

i

=0, j=1):

A[0][0] * B[0][1], A[0][1] * B[1][1],

A[0][2] * B[2][1], A[0][3] * B[3][1]

8

10

11

12

time

stampSlide28

2 2D Arrays

A

Cache

A[] does not fit, B[] does not fit,

column of B[] does not fit (at same time as row of A[])

Miss rate = #misses / #accesses =

for (

i

=0;i<

N;i

++){

for (j=0;j<

N;j

++){

for (k=0;k<

N;k++){ … =

A[i][k] *

B[k

][j]; }

}

3

4

17

18

1

2

21

22

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

B

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

Next time: (

i

=0, j=1):

A[0][0] * B[0][1], A[0][1] * B[1][1],

A[0][2] * B[2][1], A[0][3] * B[3][1]

13

10

11

12

time

stampSlide29

2 2D Arrays

A

Cache

A[] does not fit, B[] does not fit,

column of B[] does not fit (at same time as row of A[])

Miss rate = #misses / #accesses =

for (

i

=0;i<

N;i

++){

for (j=0;j<

N;j

++){

for (k=0;k<

N;k++){ … =

A[i][k] *

B[k

][j]; }

}

3

4

25

26

1

2

21

22

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

B

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

Next time: (

i

=0, j=1):

A[0][0] * B[0][1], A[0][1] * B[1][1],

A[0][2] * B[2][1], A[0][3] * B[3][1]

13

14

11

12

time

stamp

X

15Slide30

2 2D Arrays

A

Cache

A[] does not fit, B[] does not fit,

column of B[] does not fit (at same time as row of A[])

Miss rate = #misses / #accesses =

for (

i

=0;i<

N;i

++){

for (j=0;j<

N;j

++){

for (k=0;k<

N;k++){ … =

A[i][k] *

B[k

][j]; }

}

3

4

25

26

29

30

21

22

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

B

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

Next time: (

i

=0, j=1):

A[0][0] * B[0][1], A[0][1] * B[1][1],

A[0][2] * B[2][1], A[0][3] * B[3][1]

15

14

16

12

time

stamp

75%Slide31

Example: Matrix Multiplication

a

b

i

j

*

c

+=

c = (double *)

calloc

(

sizeof

(double), n*n);

/* Multiply n x n matrices a and b */

void mmm(double *a, double *b, double *c,

int

n) {

int

i, j, k;

for (i = 0; i < n; i

++) for (j = 0; j < n; j++)

for (k = 0; k < n; k++) c[

i][j] += a[i

][k]*b[k][j];

}Slide32

Cache Miss AnalysisAssume:

Matrix elements are doubles

Cache block 64B = 8 doubles

Cache capacity << n (much smaller than n)

i.e., can’t even hold an entire row in the cache!

First iteration:

How many misses?

in cache

at end of first iteration:

*

+=

*

+=

n/8 misses

n misses

n/8 + n = 9n/8 misses

8 wide

8 wideSlide33

Cache Miss AnalysisAssume:

Matrix elements are doubles

Cache block = 8 doubles

Cache capacity << n (much smaller than n)

Second iteration:

Number of misses:

n/8 + n = 9n/8 misses

Total misses (entire

mmm):9n/8 * n2 = (9/8) * n3

*

+=

8 wide

8 wideSlide34

Doing BetterMMM has lots of re-use:

try to use all of a cache block once loaded

Challenge

we need both rows and columns to work with

Compromise:

operate in sub-squares of the matrices

One sub-square per matrix should fit in cache simultaneously

Heavily re-use the sub-squares before loading new ones

Called ‘Tiling’ or ‘Blocking’A sub-square is a ‘tile’Slide35

Tiled Matrix Multiplication

c = (double *)

calloc

(

sizeof

(double), n*n);

/* Multiply n x n matrices a and b */

void mmm(double *a, double *b, double *c,

int n) {

int i, j, k;

for (i

= 0; i

< n; i+=T)

for (j = 0; j < n; j+=T

) for (k = 0; k < n; k+=

T) /

* T x T mini matrix multiplications */ for (i1 =

i; i1 < i+T

; i1++) for (j1 = j; j1 <

j+T; j1++) for

(k1 = k; k1 < k+T; k1++)

c[i1][j1

] += a[i1][k1]*b[k1][j1

];}

a

b

i1

j1

*

c

+=

Tile size T x TSlide36

Big picture

*

+=

First calculate C[0][0] – C[T-1][T-1]Slide37

Big picture

*

+=

Next calculate C[0][T] – C[T-1][2T-1]Slide38

Detailed Visualization

a

*

+=

b

c

Still have to access b[] column-wise

But now

b’s

cache blocks don’t get replacedSlide39

Cache Miss AnalysisAssume:

Cache block = 8 doubles

Cache capacity << n (much smaller than n)

Need to fit 3 tiles in cache: hence ensure 3T

2

< capacity

(since 3 arrays

a,b,c

)Misses per tile-iteration:T2/8 misses for each tile2n/T * T2

/8 = nT/4Total misses:Tiled:

nT/4 * (n/T)

2 = n

3/(4T)Untiled: (9/8) * n3

*

+=

Tile size T x T

n/T tiles