Memory performance Part II Optimizing for caches Ding Yuan ECE Dept University of Toronto httpwwweecgtorontoeduyuan Content Cache basics and organization last lec Optimizing for ID: 626903
Download Presentation The PPT/PDF document "ECE 454 Computer Systems Programming" is the property of its rightful owner. Permission is granted to download and print the materials on this web site for personal, non-commercial use only, and to display it on your personal computer provided you do not modify the materials and that you retain all copyright notices contained in the materials. By downloading content from our website, you accept the terms of this agreement.
Slide1
ECE 454 Computer Systems ProgrammingMemory performance(Part II: Optimizing for caches)
Ding Yuan
ECE Dept., University of Toronto
http://www.eecg.toronto.edu/~yuanSlide2
ContentCache
basics and
organization (last
lec
.)
Optimizing for
Caches (this
lec
.)
Tiling/blocking
Loop reordering
Prefetching (next
lec
.)
Virtual
Memory (next
lec
.)Slide3
Optimizing for CachesSlide4
Memory Optimizations
Write code that has locality
Spatial: access data contiguously
Temporal: make sure access to the same data is not too far apart in time
How to achieve?
Proper choice of algorithm
Loop transformationsSlide5
Background: Array Allocation
Basic Principle
T
A[
L
];
Array of data type
T and length LContiguously allocated region of L * sizeof(T) bytes
char string[12];
x
x
+ 12
int val[5];
x
x
+ 4
x
+ 8
x
+ 12
x
+ 16
x
+ 20
double a[3];
x
+ 24
x
x
+ 8
x
+ 16
char *p[3]
;
(64 bit)
x
+ 24
x
x
+ 8
x
+ 16Slide6
Multidimensional (Nested) Arrays
Declaration
T
A[
R
][
C];2D array of data type TR rows, C columnsT element requires K bytesArray Size
R * C * K bytesArrangementRow-Major Ordering (C code)
A[0][0]
A[0][C-1]
A[R-1][0]
• • •
• • •
A[R-1][C-1]
•
•
•
••
•
int A[R][C];
• • •
A
[0]
[0]
A
[0]
[C-1]
• • •
A
[1]
[0]
A
[1]
[C-1]
• • •
A
[R-1]
[0]
A
[R-1][C-1]
• • •
4*R*C
BytesSlide7
Assumed Simple Cache2
ints
per block
2-way set associative
2 blocks, 1 set in total
i.e., same thing as fully associative
Replacement policy: Least Recently Used (LRU)
Cache
Block 0
Block 1Slide8
Some Key QuestionsHow many elements are there per block?Does the data structure fit in the cache?
Do I re-use blocks over time?
In what order am I accessing blocks?Slide9
Simple Array
1
2
3
4
A
Cache
for (
i
=0;i<
N;i
++){
… = A[
i
];
}
Miss rate = #misses / #accesses =
(N/2) / N = ½ = 50%Slide10
Simple Array w outer loop
1
2
3
4
A
Cache
for (k=0;k<
P;k
++){
for (
i
=0;i<
N;i
++){
… = A[
i
];
}
}
Assume A[] fits in the cache:
Miss rate = #misses / #accesses =
(N/2) / N*P = 1/
2P
Lesson: for sequential accesses with re-use,
If fits in the cache, first visit suffers all the missesSlide11
Simple Array
1
2
3
4
5
6
7
8
A
Cache
for (
i
=0;i<
N;i
++){
… = A[
i
];
}
Assume A[] does not fit in the cache:
Miss rate = #misses / #accessesSlide12
Simple Array
5
6
7
8
1
2
3
4
5
6
7
8
A
Cache
for (
i
=0;i<
N;i
++){
… = A[
i
];
}
Assume A[] does not fit in the cache:
Miss rate = #misses / #accesses
= (N/2) / N = ½ = 50%
Lesson: for sequential accesses, if no-reuse it doesn’t
matter whether data structure fitsSlide13
Simple Array with outer loop
1
2
3
4
5
6
7
8
A
Cache
Assume A[] does not fit in the cache:
Miss rate = #misses / #accesses =
for (k=0;k<
P;k
++){
for (
i
=0;i<
N;i
++){
… = A[
i
];
}
}
(N/2) / N = ½ = 50
%
Lesson: for sequential accesses with re-use,
If the data structure doesn’t fit,
same miss rate as no-reuseSlide14
Let’s warm-up our cacheProblem (and opportunity)L1 cache reference 0.5 ns* (L1 cache size:
32 KB)
Main memory reference 100 ns (
mem
. size: 4 GBs)
Locality
Temporal locality
Spatial locality
Target program: matrix multiplicationSlide15
2D array
A
Cache
Assume A[] fits in the cache:
Miss rate = #misses / #accesses
=
for (
i
=0;i<
N;i
++){
for (j=0;j<
N;j
++){
… = A[
i][j];
}}
1
2
3
4
(
N*N/
2) /
(N*N)
= ½ = 50
%Slide16
2D array
A
Cache
for (
i
=0;i<
N;i
++){
for (j=0;j<
N;j
++){
… = A[
i
][j]; }
}
Lesson: for 2D accesses, if row order and no-reuse,
same hit rate as sequential,
whether fits or not
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Assume A[] does not fit in the cache:
Miss rate = #misses / #
accesses =
(
N*N/
2) /
(N*N)
= ½ = 50
%Slide17
2D array
A
Cache
for (j=0;j<
N;j
++){
for (
i
=0;i<
N;i
++){
… = A[
i
][j]; }
}
Lesson: for 2D accesses, if column order and no-reuse,
same hit rate as sequential if entire column fits in the cache
1
2
3
4
Assume A[] fits in the cache:
Miss rate = #misses / #accesses =
(N*N/2) / N*N = ½ = 50%Slide18
2D array
A
Cache
Assume A[] does not fit in the cache:
Miss rate = #misses / #accesses
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
for (j=0;j<
N;j
++){
for (
i
=0;i<
N;i
++){
… = A[
i
][j];
}
}
= N
*N / N*N = 100%
Lesson: for 2D accesses, if column order, if entire column
doesn’t fit, then
100% miss rate (block (1,2) is gone
after
access to element
9)
.Slide19
Matrix multiplication
A
for (
i
=0;i<
N;i
++){
for (j=0;j<
N;j
++){ for (k=0;k<
N;k++){
… =
A[i][k] *
B[k
][j]; }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
B
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32Slide20
2 2D Arrays
A
Cache
A[] does not fit, B[] does not fit,
column of B[] does not fit (at same time as row of A[])
Miss rate = #misses / #accesses =
for (
i
=0;i<
N;i
++){
for (j=0;j<
N;j
++){
for (k=0;k<
N;k++){ … =
A[i][k] *
B[k
][j]; }
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
B
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32Slide21
2 2D Arrays
A
Cache
A[] does not fit, B[] does not fit,
column of B[] does not fit (at same time as row of A[])
Miss rate = #misses / #accesses =
for (
i
=0;i<
N;i
++){
for (j=0;j<
N;j
++){
for (k=0;k<
N;k++){ … =
A[i][k] *
B[k
][j]; }
}
1
2
17
18
21
22
3
4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
B
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
The most inner loop (
i
=j=0):
A[0][0] * B[0][0], A[0][1] * B[1][0],
A[0][2] * B[2][0], A[0][3] * B[3][0]
1
time
stamp
2
3
X
4
5Slide22
2 2D Arrays
A
Cache
A[] does not fit, B[] does not fit,
column of B[] does not fit (at same time as row of A[])
Miss rate = #misses / #accesses =
for (
i
=0;i<
N;i
++){
for (j=0;j<
N;j
++){
for (k=0;k<
N;k++){ … =
A[i][k] *
B[k
][j]; }
}
1
2
25
26
21
22
3
4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
B
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
The most inner loop (
i
=j=0):
A[0][0] * B[0][0], A[0][1] * B[1][0],
A[0][2] * B[2][0], A[0][3] * B[3][0]
3
6
4
5
time
stamp
X
7Slide23
2 2D Arrays
A
Cache
A[] does not fit, B[] does not fit,
column of B[] does not fit (at same time as row of A[])
Miss rate = #misses / #accesses =
for (
i
=0;i<
N;i
++){
for (j=0;j<
N;j
++){
for (k=0;k<
N;k++){ … =
A[i][k] *
B[k
][j]; }
}
29
30
25
26
21
22
3
4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
B
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
The most inner loop (
i
=j=0):
A[0][0] * B[0][0], A[0][1] * B[1][0],
A[0][2] * B[2][0], A[0][3] * B[3][0]
8
6
4
7
time
stampSlide24
2 2D Arrays
A
Cache
A[] does not fit, B[] does not fit,
column of B[] does not fit (at same time as row of A[])
Miss rate = #misses / #accesses =
for (
i
=0;i<
N;i
++){
for (j=0;j<
N;j
++){
for (k=0;k<
N;k++){ … =
A[i][k] *
B[k
][j]; }
}
29
30
25
26
21
22
3
4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
B
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Next time: (
i
=0, j=1):
A[0][0] * B[0][1], A[0][1] * B[1][1],
A[0][2] * B[2][1], A[0][3] * B[3][1]
8
6
4
7
time
stampSlide25
2 2D Arrays
A
Cache
A[] does not fit, B[] does not fit,
column of B[] does not fit (at same time as row of A[])
Miss rate = #misses / #accesses =
for (
i
=0;i<
N;i
++){
for (j=0;j<
N;j
++){
for (k=0;k<
N;k++){ … =
A[i][k] *
B[k
][j]; }
}
29
30
25
26
1
2
3
4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
B
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Next time: (
i
=0, j=1):
A[0][0] * B[0][1], A[0][1] * B[1][1],
A[0][2] * B[2][1], A[0][3] * B[3][1]
8
6
9
7
time
stampSlide26
2 2D Arrays
A
Cache
A[] does not fit, B[] does not fit,
column of B[] does not fit (at same time as row of A[])
Miss rate = #misses / #accesses =
for (
i
=0;i<
N;i
++){
for (j=0;j<
N;j
++){
for (k=0;k<
N;k++){ … =
A[i][k] *
B[k
][j]; }
}
29
30
17
18
1
2
3
4
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
B
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Next time: (
i
=0, j=1):
A[0][0] * B[0][1], A[0][1] * B[1][1],
A[0][2] * B[2][1], A[0][3] * B[3][1]
8
10
9
7
time
stamp
X
11Slide27
2 2D Arrays
A
Cache
A[] does not fit, B[] does not fit,
column of B[] does not fit (at same time as row of A[])
Miss rate = #misses / #accesses =
for (
i
=0;i<
N;i
++){
for (j=0;j<
N;j
++){
for (k=0;k<
N;k++){ … =
A[i][k] *
B[k
][j]; }
}
29
30
17
18
1
2
21
22
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
B
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Next time: (
i
=0, j=1):
A[0][0] * B[0][1], A[0][1] * B[1][1],
A[0][2] * B[2][1], A[0][3] * B[3][1]
8
10
11
12
time
stampSlide28
2 2D Arrays
A
Cache
A[] does not fit, B[] does not fit,
column of B[] does not fit (at same time as row of A[])
Miss rate = #misses / #accesses =
for (
i
=0;i<
N;i
++){
for (j=0;j<
N;j
++){
for (k=0;k<
N;k++){ … =
A[i][k] *
B[k
][j]; }
}
3
4
17
18
1
2
21
22
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
B
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Next time: (
i
=0, j=1):
A[0][0] * B[0][1], A[0][1] * B[1][1],
A[0][2] * B[2][1], A[0][3] * B[3][1]
13
10
11
12
time
stampSlide29
2 2D Arrays
A
Cache
A[] does not fit, B[] does not fit,
column of B[] does not fit (at same time as row of A[])
Miss rate = #misses / #accesses =
for (
i
=0;i<
N;i
++){
for (j=0;j<
N;j
++){
for (k=0;k<
N;k++){ … =
A[i][k] *
B[k
][j]; }
}
3
4
25
26
1
2
21
22
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
B
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Next time: (
i
=0, j=1):
A[0][0] * B[0][1], A[0][1] * B[1][1],
A[0][2] * B[2][1], A[0][3] * B[3][1]
13
14
11
12
time
stamp
X
15Slide30
2 2D Arrays
A
Cache
A[] does not fit, B[] does not fit,
column of B[] does not fit (at same time as row of A[])
Miss rate = #misses / #accesses =
for (
i
=0;i<
N;i
++){
for (j=0;j<
N;j
++){
for (k=0;k<
N;k++){ … =
A[i][k] *
B[k
][j]; }
}
3
4
25
26
29
30
21
22
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
B
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Next time: (
i
=0, j=1):
A[0][0] * B[0][1], A[0][1] * B[1][1],
A[0][2] * B[2][1], A[0][3] * B[3][1]
15
14
16
12
time
stamp
75%Slide31
Example: Matrix Multiplication
a
b
i
j
*
c
+=
c = (double *)
calloc
(
sizeof
(double), n*n);
/* Multiply n x n matrices a and b */
void mmm(double *a, double *b, double *c,
int
n) {
int
i, j, k;
for (i = 0; i < n; i
++) for (j = 0; j < n; j++)
for (k = 0; k < n; k++) c[
i][j] += a[i
][k]*b[k][j];
}Slide32
Cache Miss AnalysisAssume:
Matrix elements are doubles
Cache block 64B = 8 doubles
Cache capacity << n (much smaller than n)
i.e., can’t even hold an entire row in the cache!
First iteration:
How many misses?
in cache
at end of first iteration:
*
+=
*
+=
n/8 misses
n misses
n/8 + n = 9n/8 misses
8 wide
8 wideSlide33
Cache Miss AnalysisAssume:
Matrix elements are doubles
Cache block = 8 doubles
Cache capacity << n (much smaller than n)
Second iteration:
Number of misses:
n/8 + n = 9n/8 misses
Total misses (entire
mmm):9n/8 * n2 = (9/8) * n3
*
+=
8 wide
8 wideSlide34
Doing BetterMMM has lots of re-use:
try to use all of a cache block once loaded
Challenge
we need both rows and columns to work with
Compromise:
operate in sub-squares of the matrices
One sub-square per matrix should fit in cache simultaneously
Heavily re-use the sub-squares before loading new ones
Called ‘Tiling’ or ‘Blocking’A sub-square is a ‘tile’Slide35
Tiled Matrix Multiplication
c = (double *)
calloc
(
sizeof
(double), n*n);
/* Multiply n x n matrices a and b */
void mmm(double *a, double *b, double *c,
int n) {
int i, j, k;
for (i
= 0; i
< n; i+=T)
for (j = 0; j < n; j+=T
) for (k = 0; k < n; k+=
T) /
* T x T mini matrix multiplications */ for (i1 =
i; i1 < i+T
; i1++) for (j1 = j; j1 <
j+T; j1++) for
(k1 = k; k1 < k+T; k1++)
c[i1][j1
] += a[i1][k1]*b[k1][j1
];}
a
b
i1
j1
*
c
+=
Tile size T x TSlide36
Big picture
*
+=
First calculate C[0][0] – C[T-1][T-1]Slide37
Big picture
*
+=
Next calculate C[0][T] – C[T-1][2T-1]Slide38
Detailed Visualization
a
*
+=
b
c
Still have to access b[] column-wise
But now
b’s
cache blocks don’t get replacedSlide39
Cache Miss AnalysisAssume:
Cache block = 8 doubles
Cache capacity << n (much smaller than n)
Need to fit 3 tiles in cache: hence ensure 3T
2
< capacity
(since 3 arrays
a,b,c
)Misses per tile-iteration:T2/8 misses for each tile2n/T * T2
/8 = nT/4Total misses:Tiled:
nT/4 * (n/T)
2 = n
3/(4T)Untiled: (9/8) * n3
*
+=
Tile size T x T
n/T tiles