Académique Documents
Professionnel Documents
Culture Documents
Part 1
For dgemm0 the body of the third for loop consists of one floating point opera-
tion of multiplication and one of addition. Since the body is executed n3 times,
we get a total of 2n3 floating point operations. Also, because no register reuse is
employed, every value of a, b and c will be loaded every time, which results in a
total of 100+100+ 14 = 200.25 cycles for the multiplication, and 100+ 14 = 100.25
cycles for the addition. The addition does not need to load the second operand
since it is the result of the multiplication which is already stored in a register.
Hence, we get (200.25 + 100.25)n3 = 300.5n3 = 300.5 · 10003 = 300.5 · 106 cycles
in total, or
For dgemm1 the multiplication operands are loaded n3 times each in total,
whereas the first addition operand is loaded n2 times in total. This results in a
total of 100(2n3 + n2 ) cycles for loading operands into registers. Also, we again
have 2n3 floating point operations in total, which consumes 41 (2n3 ) = 0.5n3
cycles. Therefore, the total execution time will be
100(2n3 + n2 ) cycles
= 100.05 milliseconds
2 · 109 cycles
sec
Part 2
n = 64 dgemm0 0.000000 sec inf Gflops
dgemm1 0.000000 sec inf Gflops max—C0-C1—=0.0000000000000000
dgemm2 0.000000 sec inf Gflops max—C0-C1—=0.0000000000000000
n=128
dgemm0 0.030000 sec 0.139810 Gflops
dgemm1 0.020000 sec 0.209715 Gflops max—C0-C1—=0.0000000000000000
dgemm2 0.010000 sec 0.419430 Gflops max—C0-C1—=0.0000000000000000
n=256 dgemm0 0.260000 sec 0.129056 Gflops
dgemm1 0.130000 sec 0.258111 Gflops max—C0-C1—=0.0000000000000000
dgemm2 0.120000 sec 0.279620 Gflops max—C0-C1—=0.0000000000000000
1
Part 3
Cache Reuse
Part 1
• 10x10
1. ijk
First, for i=j=0 we will have a miss for a[0][0] and one miss for each
element in the first column of b. After that, the whole b will be in
cache along with the first row of a. Similarly, we will have a miss for
the rest of the elements in the first column of a. After the miss in
a[n-1][0], all rows of a will have been stored in the cache. Also, we
will not have any more misses for b since the cache is large enough
to hold all a and b (and c).
Hence, when the whole calculation finishes, a single miss for each
element in the first column of a and b will have occurred, and no miss
2·10
for the rest of their elements. Therefore, the miss rate is 2·103 = 1%.
2. jik
First, for i=j=0 we will have a miss for a[0][0] and one miss for each
element in the first column of b. After that, the whole b will be in
cache along with the first row of a. Similarly, we will have a miss for
each element of the first column of a just before j=0 becomes j=1.
At this point, both a and b are in cache in their entirety and no more
misses will occur.
Hence, when the whole calculation finishes, a single miss for each
element in the first column of a and b will have occurred, and no miss
2·10
for the rest of their elements. Therefore, the miss rate is 2·103 = 1%.
3. ikj
First, for i=k=0 we will have a miss for a[0][0] and b[0][0], and the
first rows of a and b are loaded in cache. For the rest values of k,
and while i=0, we will have a miss for the rest of the elements in the
first column of b, and the whole b is loaded in the cache. For the
rest values of i we will also have a miss for the rest of the elements
in the first column of a. Now both a and b are in cache.
Hence, when the whole calculation finishes, a single miss for each
element in the first column of a and b will have occurred, and no miss
2·10
for the rest of their elements. Therefore, the miss rate is 2·103 = 1%.
4. jki
First, for j=k=0 we will have a miss for all the elements in the first
2
column of a and a miss for b[0][0]. Now the first row of b and the
whole a are loaded in cache. For the rest values of k, and while j=0,
we will have a miss for the rest of the elements in the first column of
b, and the whole b is loaded in the cache. Now both a and b are in
cache.
Hence, when the whole calculation finishes, a single miss for each
element in the first column of a and b will have occurred, and no miss
2·10
for the rest of their elements. Therefore, the miss rate is 2·10 3 = 1%.
5. kij
First, for k=i=j=0 we will have a miss for a[0][0] and b[0][0], and the
first row of a and b are loaded in cache. Then, for the rest of the
values of i, and while k=0, we will have a miss for all the elements
in the first column of a. Now the first row of b and the whole a are
in cache. Now for the rest of the values of k, we will have a miss for
the rest of the elements in the first column of b, and the whole b is
loaded in the cache. Now both a and b are in the cache.
Hence, when the whole calculation finishes, a single miss for each
element in the first column of a and b will have occurred, and no miss
2·10
for the rest of their elements. Therefore, the miss rate is 2·103 = 1%.
6. kji
First, for k=j=0, we will have a miss for b[0][0] and all the elements
in the first column of a. Now the whole a and the first row of b are in
the cache. After that, for the rest of the values of k, we will also have
a miss for all the elements in the first column of b, and the whole b
is now loaded in the cache.
Hence, when the whole calculation finishes, a single miss for each
element in the first column of a and b will have occurred, and no miss
2·10
for the rest of their elements. Therefore, the miss rate is 2·10 3 = 1%.
• 10000X10000
1. ijk
3
2
Hence, in a we will have n10 elements where misses occur, which gives
2 3
n· n10 = n10 misses, and in b we will have n·n2 = n3 misses. Therefore,
we get a miss rate of
n3
10 + n3 1.1
= = 55%
2 · n3 2
.
2. jik
Same as in ijk.
3. ikj
In this case we will have misses every time we try to access any
element of a or b. This happens because we traverse both a and
b column-wise whereas only elements of the same row are stored in
cache.
Since each element of a and b is loaded n times, it will also cause n
misses.
Obviously, the miss rate is 100%.
4
5. kij
n · n2 + n2 /10 1 + 1/(10n)
= = 50.0005%
2 · n3 2
Part 2
We can calculate the misses for this part based on on Part 1. Specifically, first
consider the block multiplication part (outer 3 loops) as a multiplication of two
m × m matrices where m = n/10, using a cache of 60/10=6 lines, where each
line can hold one element. Since m is far greater than the 6 lines of cache,
the number of misses for each block will be equal to number of misses of each
element for the 10000 × 10000 matrix in Part 1, with the only difference that
we substitute the number 10 in every formula with the number 1, since we do
not have 10 elements per cache line any more, but 1.
Now that we have the number of misses for each block in both a and b, we can
calculate the number of misses for every element in each block by multiplying
the number of misses of this block by the number of misses of the 10 × 10 matrix
multiplication in Part 1 for the same access pattern.
Part 3
Part 4