MCS-284 Prelab Activity: Reuse Distances, Fall 2012

Memory layout and cache blocks

The lab uses so-called "double-precision" floating point numbers, which are declared using the C datatype double. Each of these is 64 bits (8 bytes) in size. The two-dimensional arrays used in the lab are laid out in "row-major order," which means that all of the numbers in the first row come first, then all of the numbers in the second row, etc. As an example, in a 200 × 200 array, the first 1600 bytes would contain the 200 8-byte numbers from row 0, the next 1600 bytes would contain the 200 8-byte numbers from row 1, etc.

The processors we will use in the lab bring data into their caches in 64-byte blocks. Given that our arrays are composed of 8-byte numbers, this means that each cache block holds 8 array elements. To simplify our prelab analysis, assume that each array starts at the beginning of a cache block and that the value of n is a multiple of 8. (You can conduct experiments with values of n that are not multiples of 8; this prelab analysis is only intended to give approximate answers anyway.) Given these simplifying assumptions, each row of an n × n array occupies n/8 cache blocks.

Algorithms and FLOPS/byte

In the lab, you will be experimenting with two different algorithms, naive and reordered. The naive algorithm follows:

   for i in the range 0 ≤ i < n:
      for j in the range 0 ≤ j < n:
         for k in the range 0 ≤ k < n:
            C_i, j ← C_i, j + A_i, kB_k, j

The reordered algorithm is the same except for the order of the loops:

   for i in the range 0 ≤ i < n:
      for k in the range 0 ≤ k < n:
         for j in the range 0 ≤ j < n:
            C_i, j ← C_i, j + A_i, kB_k, j

Each of these algorithms reads from the A array n³ times and the same for the B and C arrays. (It also writes into the C array; for simplicity, we'll ignore this.) Thus, given our use of 8-byte numbers, each of these algorithms does a total of 24n³ bytes of reading.

However, we would get a different answer if instead of looking at the amount of data read from the L1 cache by the processor, we looked at the amount of data read into the L1 cache from the L2 cache or into the L2 cache from the main memory. (I am assuming there is no L3 cache.) The reason why the answer would be different is because a value read into one of the caches might be used repeatedly by the processor before it is displaced from the cache. Conversely, a value might be read into the cache as part of a larger cache block without the particular value being used at all by the processor before the block is displaced from the cache.

Each of the n² numbers in each array gets read by the processor n times. Thus, the best a cache could do would be to reduce the memory traffic by a factor of n, turning the n reads of each value into just 1. The worst possibility, on the other hand, would be for each cache block (containing 8 different numbers) to be displaced from the cache after only one of the values is read a single time. In this case, the memory traffic would be inflated by a factor of 8. Thus, all depending on how much reuse the cache achieves, the total number of bytes fetched into the cache might be anywhere from 24n² to 192n³.

Each of the algorithms also performs n³ floating point multiplications and n³ floating point additions for a total of 2n³ floating point operations. Thus, depending on the cache reuse, the FLOPS/byte ratio might be anywhere from as low as 2n³/(192n³) = 1/96 to as high as 2n³/(24n²) = n/12. In this prelab activity, you'll pin down more precisely what FLOPS/byte ratio each algorithm might achieve for particular values of n and particular cache sizes.

Temporal reuse distance

The temporal reuse distance for reading an array element is the number of cache blocks used since the previous time the same array element was read, including the array element's own block. (Recall that for simplicity we are excluding writes.) For example, consider the read of C_i, j. We can focus in on one particular combination of i and j, since they are all treated the same way. What is the temporal reuse distance of C_0, 0?

In the naive algorithm, this one element of C keeps getting read over and over again in rapid succession as the innermost loop varies k. In the time from the first read of C_0, 0 to the second, only three cache blocks are accessed, one from each array. Because the program is very consistent in its behavior, this temporal reuse distance of 3 stays fixed for all the reads from C, not just the distance between the first two reads of C_0, 0.

In the reordered algorithm, the temporal reuse distance for C is greater. Between the first read of C_0, 0 and the second, there are also reads of C_0, 1, C_0, 2, etc., as the innermost loop varies j while i and k stay fixed. These n reads from C occupy n/8 cache blocks. Because i and k are staying fixed, only a single block is read over and over again from A. However, just as the innermost loop scans an entire row of the C array before returning to the initial element, an entire row of B is scanned: B_0, 0, B_0, 1, B_0, 2, etc. This contributes another n/8 cache blocks. Thus, the temporal reuse distance for C in the reordered algorithm is n/8 + 1 + n/8 = n/4 + 1. Because we are only doing an approximate analysis and are considering large values of n, we can simplify this formula to just its highest order term, n/4.

Because modern on-chip caches have reasonably high associativity, we won't go too far wrong if we make the simplifying assumption that blocks remain in the cache so long as the cache's total capacity isn't exceeded. You should keep in mind that this is a simplification; cache blocks may also be displaced due to conflicts. As a result, a matrix size that seems like it ought to fit in the cache may not; as you experiment with increasing matrix sizes, don't be surprised if you see performance drop off somewhat earlier than predicted.

We can put together our analysis of temporal reuse distance and this simplifying assumption about the cache. Array reads only contribute to the amount of data fetched into a cache if the temporal reuse distance is greater than the cache's capacity, measured in blocks. For example, suppose you are using a computer with a 32KB L1 cache and a 3MB L2 cache. With 64B blocks, this means the L1 cache has a capacity of 512 blocks and the L2 cache has a capacity of 49,152 blocks.

Because the temporal reuse distance for C is 3 in the naive algorithm, which is far less than even 512, we can see that the reads from C are not going to substantially contribute to the amount of data that needs to be fetched into either cache, no matter how large n is. Each time a block of C is loaded into the cache, it will be reused n times. Thus, the n³ reads of C contribute only 8n² bytes of fetching into either cache, rather than 8n³. Compared to the 2n³ floating point operations, this is a small amount of traffic so long as n is realistically large.

The story with the reordered algorithm is more interesting. Once n/4 becomes larger than the cache capacity (or even nears that capacity, given the reality of conflicts), each read from C_0, 0 (or any other element of C) will require those 8 bytes of data be fetched into the cache just for the single read. As such, the data traffic into a particular cache will go up when n/4 is approximately that cache's capacity. If the bandwidth available for fetching data into that particular cache is constraining the system's performance, then performance will drop. In our example with 512 blocks in the L1 cache, this particular performance drop-off might occur when n is in the neighborhood of 2048, which we find by solving n/4 = 512 for n. Similarly, there would be an increase in traffic being fetched into the L2 cache when n exceeds 196,608. That's such a large matrix size that you won't be able to experimentally observe whether any performance drop-off occurs.

The same analysis needs to be done for the reads from the A and B arrays in each algorithm. Before you do that, however, let's turn our attention to spatial reuse.

Spatial reuse distance

As we saw in the previous section, when n is larger than 2048, the reordered algorithm will no longer exhibit any temporal reuse of elements from C in the L1 cache. However, it may still exhibit spatial reuse of C's cache blocks. That is, when the block containing C_0, 0 is fetched into the L1 cache, the other values within the block (such as C_0, 1) may be used before the block is displaced, even though C_0, 0 itself won't be reused. If so, then each byte of C that is fetched into the L1 cache gets used; if not, only one-eighth of them do. In order to answer this question, we need to look at C's spatial reuse distance, which is the number of cache blocks accessed from the read of C_0, 0 to the read of C_0, 1. (The same distance will apply between any two consecutive elements of C.)

In the reordered algorithm, the j index varies the fastest because it is controlled by the innermost loop. As such, if we look at the reads of C_i, j, we see that they occur in close succession; between reading C_0, 0 and C_0, 1, there is just one read from A and one read from B. Thus, the spatial reuse distance for C is just three cache blocks.

Because this spatial reuse distance is so much less than the L1 cache capacity, we have good reason to expect each block of C to stay in the cache long enough for all 8 array elements in it to get used. Thus, the number of bytes fetched into the L1 cache for C in the reordered algorithm goes up from 8n² only to 8n³; it doesn't go up further to 64n³, no matter how large n gets.

The bottom line is that there are three possibilities. If the temporal reuse distance of a particular array in one of our algorithms is less than the cache capacity, then reading from that array contributes only 8n² bytes of traffic being fetched into that cache. If the temporal reuse distance exceeds the capacity, but the spatial reuse distance is less than the capacity, then the traffic goes up to 8n³. And if the spatial reuse distance is also larger than the capacity, then the traffic goes up even further to 64n³. In other algorithms, you might on rare occasion run into a case where the temporal reuse distance is small but the spatial reuse distance is large. If that happened in either of our algorithms, you'd see an array contributing 64n² bytes of traffic. However, this doesn't in fact happen with any of the arrays in either algorithm.

Now you do it

Naive algorithm
1. Reads from array C (this part filled in based on the earlier analysis)
  1. Temporal reuse distance
    1. How many elements of C are read before returning to the original one? Divide by 8 to find cache blocks, unless only one element of C is read per block. 1 element = 1 block
    2. How many elements of A are read before returning to the original element of C? Divide by 8 to find cache blocks, unless only one element of A is read per block.1 element = 1 block
    3. How many elements of B are read before returning to the original element of C? Divide by 8 to find cache blocks, unless only one element of B is read per block.1 element = 1 block
    4. Add these up to find the temporal reuse distance for C. Simplify to just the highest-order term. 3 blocks
    5. For each of the two cache capacities, figure out if there is a value of n where the temporal reuse distance of C exceeds the cache capacity. none
      1. If so: Up until that point, the reads from C contribute 8n² bytes of traffic into the cache. Continue on to the spatial reuse distance part to figure out what happens beyond that point.
      2. If not: The reads from C contribute 8n² bytes of traffic into the cache for all values of n. You are done analyzing reads from array C for this algorithm; skip the part about spatial reuse distance.
  2. Spatial reuse distance skipped
    1. From when C_0, 0 is read until C_0, 1 is read, how many elements of C are read? Divide by 8 to find the number of cache blocks, unless only one element of C is read per block.
    2. From when C_0, 0 is read until C_0, 1 is read, how many elements of A are read? Divide by 8 to find the number of cache blocks, unless only one element of A is read per block.
    3. From when C_0, 0 is read until C_0, 1 is read, how many elements of B are read? Divide by 8 to find the number of cache blocks, unless only one element of B is read per block.
    4. Add these up to find the spatial reuse distance for C. Simplify to just the highest-order term.
    5. For each of the two cache capacities, figure out if there is a value of n where the spatial reuse distance of C exceeds the cache capacity.
      1. If so: Up until that point (but after the point where temporal reuse stops), the reads from C contribute 8n³ bytes of traffic into the cache. After that point, they contribute 64n³ bytes.
      2. If not: The reads from C contribute 8n³ bytes of traffic into the cache for all values of n that exceed the limit calculated based on temporal reuse.
2. Reads from array A
  1. Temporal reuse distance
    1. How many elements of A are read before returning to the original one? Divide by 8 to find cache blocks, unless only one element of A is read per block.
    2. How many elements of B are read before returning to the original element of A? Divide by 8 to find cache blocks, unless only one element of B is read per block.
    3. How many elements of C are read before returning to the original element of A? Divide by 8 to find cache blocks, unless only one element of C is read per block.
    4. Add these up to find the temporal reuse distance for A. Simplify to just the highest-order term.
    5. For each of the two cache capacities, figure out if there is a value of n where the temporal reuse distance of A exceeds the cache capacity.
      1. If so: Up until that point, the reads from A contribute 8n² bytes of traffic into the cache. Continue on to the spatial reuse distance part to figure out what happens beyond that point.
      2. If not: The reads from A contribute 8n² bytes of traffic into the cache for all values of n. You are done analyzing reads from array A for this algorithm; skip the part about spatial reuse distance.
  2. Spatial reuse distance
    1. From when A_0, 0 is read until A_0, 1 is read, how many elements of A are read? Divide by 8 to find the number of cache blocks, unless only one element of A is read per block.
    2. From when A_0, 0 is read until A_0, 1 is read, how many elements of B are read? Divide by 8 to find the number of cache blocks, unless only one element of B is read per block.
    3. From when A_0, 0 is read until A_0, 1 is read, how many elements of C are read? Divide by 8 to find the number of cache blocks, unless only one element of C is read per block.
    4. Add these up to find the spatial reuse distance for A. Simplify to just the highest-order term.
    5. For each of the two cache capacities, figure out if there is a value of n where the spatial reuse distance of A exceeds the cache capacity.
      1. If so: Up until that point (but after the point where temporal reuse stops), the reads from A contribute 8n³ bytes of traffic into the cache. After that point, they contribute 64n³ bytes.
      2. If not: The reads from A contribute 8n³ bytes of traffic into the cache for all values of n that exceed the limit calculated based on temporal reuse.
3. Reads from array B
  1. Temporal reuse distance
    1. How many elements of B are read before returning to the original one? Divide by 8 to find cache blocks, unless only one element of B is read per block.
    2. How many elements of A are read before returning to the original element of B? Divide by 8 to find cache blocks, unless only one element of A is read per block.
    3. How many elements of C are read before returning to the original element of B? Divide by 8 to find cache blocks, unless only one element of C is read per block.
    4. Add these up to find the temporal reuse distance for B. Simplify to just the highest-order term.
    5. For each of the two cache capacities, figure out if there is a value of n where the temporal reuse distance of B exceeds the cache capacity.
      1. If so: Up until that point, the reads from B contribute 8n² bytes of traffic into the cache. Continue on to the spatial reuse distance part to figure out what happens beyond that point.
      2. If not: The reads from B contribute 8n² bytes of traffic into the cache for all values of n. You are done analyzing reads from array B for this algorithm; skip the part about spatial reuse distance.
  2. Spatial reuse distance
    1. From when B_0, 0 is read until B_0, 1 is read, how many elements of B are read? Divide by 8 to find the number of cache blocks, unless only one element of B is read per block.
    2. From when B_0, 0 is read until B_0, 1 is read, how many elements of A are read? Divide by 8 to find the number of cache blocks, unless only one element of A is read per block.
    3. From when B_0, 0 is read until B_0, 1 is read, how many elements of C are read? Divide by 8 to find the number of cache blocks, unless only one element of C is read per block.
    4. Add these up to find the spatial reuse distance for B. Simplify to just the highest-order term.
    5. For each of the two cache capacities, figure out if there is a value of n where the spatial reuse distance of B exceeds the cache capacity.
      1. If so: Up until that point (but after the point where temporal reuse stops), the reads from B contribute 8n³ bytes of traffic into the cache. After that point, they contribute 64n³ bytes.
      2. If not: The reads from B contribute 8n³ bytes of traffic into the cache for all values of n that exceed the limit calculated based on temporal reuse.
4. Overall assessment for this algorithm
  1. For each cache, put in order all the threshold values of n where the traffic for any of the arrays changes. (There may be 0, 1, or 2 of these thresholds for each of the three arrays. For array C, there are 0 thresholds for this algorithm.)
  2. Within each of the ranges of n defined by these thresholds, add up the three arrays' traffic contributions. Simplify to just the highest-order term. (You don't need to do this separately for each cache: the traffic contributions will be the same for corresponding ranges of n, with just the threshold values defining the ranges being different.)
  3. Divide the 2n³ floating point operations by each of these total traffics to find the FLOPS/byte for this algorithm within each range of n values. (Again, this needn't be done separately for each cache.)
Reordered algorithm
1. Reads from array C (this part filled in based on the earlier analysis)
  1. Temporal reuse distance
    1. How many elements of C are read before returning to the original one? Divide by 8 to find cache blocks, unless only one element of C is read per block. n elements = n/8 blocks
    2. How many elements of A are read before returning to the original element of C? Divide by 8 to find cache blocks, unless only one element of A is read per block. 1 element = 1 block
    3. How many elements of B are read before returning to the original element of C? Divide by 8 to find cache blocks, unless only one element of B is read per block. n elements = n/8 blocks
    4. Add these up to find the temporal reuse distance for C. Simplify to just the highest-order term. n/4+1 blocks, approximately n/4
    5. For each of the two cache capacities, figure out if there is a value of n where the temporal reuse distance of C exceeds the cache capacity. L1: 2048; L2: 196,608
      1. If so: Up until that point, the reads from C contribute 8n² bytes of traffic into the cache. Continue on to the spatial reuse distance part to figure out what happens beyond that point.
      2. If not: The reads from C contribute 8n² bytes of traffic into the cache for all values of n. You are done analyzing reads from array C for this algorithm; skip the part about spatial reuse distance.
  2. Spatial reuse distance
    1. From when C_0, 0 is read until C_0, 1 is read, how many elements of C are read? Divide by 8 to find the number of cache blocks, unless only one element of C is read per block. 1 element = 1 block
    2. From when C_0, 0 is read until C_0, 1 is read, how many elements of A are read? Divide by 8 to find the number of cache blocks, unless only one element of A is read per block. 1 element = 1 block
    3. From when C_0, 0 is read until C_0, 1 is read, how many elements of B are read? Divide by 8 to find the number of cache blocks, unless only one element of B is read per block. 1 element = 1 block
    4. Add these up to find the spatial reuse distance for C. Simplify to just the highest-order term. 3 blocks
    5. For each of the two cache capacities, figure out if there is a value of n where the spatial reuse distance of C exceeds the cache capacity. none
      1. If so: Up until that point (but after the point where temporal reuse stops), the reads from C contribute 8n³ bytes of traffic into the cache. After that point, they contribute 64n³ bytes.
      2. If not: The reads from C contribute 8n³ bytes of traffic into the cache for all values of n that exceed the limit calculated based on temporal reuse.
2. Reads from array A
  1. Temporal reuse distance
    1. How many elements of A are read before returning to the original one? Divide by 8 to find cache blocks, unless only one element of A is read per block.
    2. How many elements of B are read before returning to the original element of A? Divide by 8 to find cache blocks, unless only one element of B is read per block.
    3. How many elements of C are read before returning to the original element of A? Divide by 8 to find cache blocks, unless only one element of C is read per block.
    4. Addt these up to find the temporal reuse distance for A. Simplify to just the highest-order term.
    5. For each of the two cache capacities, figure out if there is a value of n where the temporal reuse distance of A exceeds the cache capacity.
      1. If so: Up until that point, the reads from A contribute 8n² bytes of traffic into the cache. Continue on to the spatial reuse distance part to figure out what happens beyond that point.
      2. If not: The reads from A contribute 8n² bytes of traffic into the cache for all values of n. You are done analyzing reads from array A for this algorithm; skip the part about spatial reuse distance.
  2. Spatial reuse distance
    1. From when A_0, 0 is read until A_0, 1 is read, how many elements of A are read? Divide by 8 to find the number of cache blocks, unless only one element of A is read per block.
    2. From when A_0, 0 is read until A_0, 1 is read, how many elements of B are read? Divide by 8 to find the number of cache blocks, unless only one element of B is read per block.
    3. From when A_0, 0 is read until A_0, 1 is read, how many elements of C are read? Divide by 8 to find the number of cache blocks, unless only one element of C is read per block.
    4. Add these up to find the spatial reuse distance for A. Simplify to just the highest-order term.
    5. For each of the two cache capacities, figure out if there is a value of n where the spatial reuse distance of A exceeds the cache capacity.
      1. If so: Up until that point (but after the point where temporal reuse stops), the reads from A contribute 8n³ bytes of traffic into the cache. After that point, they contribute 64n³ bytes.
      2. If not: The reads from A contribute 8n³ bytes of traffic into the cache for all values of n that exceed the limit calculated based on temporal reuse.
3. Reads from array B
  1. Temporal reuse distance
    1. How many elements of B are read before returning to the original one? Divide by 8 to find cache blocks, unless only one element of B is read per block.
    2. How many elements of A are read before returning to the original element of B? Divide by 8 to find cache blocks, unless only one element of A is read per block.
    3. How many elements of C are read before returning to the original element of B? Divide by 8 to find cache blocks, unless only one element of C is read per block.
    4. Add these up to find the temporal reuse distance for B. Simplify to just the highest-order term.
    5. For each of the two cache capacities, figure out if there is a value of n where the temporal reuse distance of B exceeds the cache capacity.
      1. If so: Up until that point, the reads from B contribute 8n² bytes of traffic into the cache. Continue on to the spatial reuse distance part to figure out what happens beyond that point.
      2. If not: The reads from B contribute 8n² bytes of traffic into the cache for all values of n. You are done analyzing reads from array B for this algorithm; skip the part about spatial reuse distance.
  2. Spatial reuse distance
    1. From when B_0, 0 is read until B_0, 1 is read, how many elements of B are read? Divide by 8 to find the number of cache blocks, unless only one element of B is read per block.
    2. From when B_0, 0 is read until B_0, 1 is read, how many elements of A are read? Divide by 8 to find the number of cache blocks, unless only one element of A is read per block.
    3. From when B_0, 0 is read until B_0, 1 is read, how many elements of C are read? Divide by 8 to find the number of cache blocks, unless only one element of C is read per block.
    4. Add these up to find the spatial reuse distance for B. Simplify to just the highest-order term.
    5. For each of the two cache capacities, figure out if there is a value of n where the spatial reuse distance of B exceeds the cache capacity.
      1. If so: Up until that point (but after the point where temporal reuse stops), the reads from B contribute 8n³ bytes of traffic into the cache. After that point, they contribute 64n³ bytes.
      2. If not: The reads from B contribute 8n³ bytes of traffic into the cache for all values of n that exceed the limit calculated based on temporal reuse.
4. Overall assessment for this algorithm
  1. For each cache, put in order all the threshold values of n where the traffic for any of the arrays changes. (There may be 0, 1, or 2 of these thresholds for each of the three arrays. For array C, there is 1 threshold for this algorithm.)
  2. Within each of the ranges of n defined by these thresholds, add up the three arrays' traffic contributions. Simplify to just the highest-order term. (You don't need to do this separately for each cache: the traffic contributions will be the same for corresponding ranges of n, with just the threshold values defining the ranges being different.)
  3. Divide the 2n³ floating point operations by each of these total traffics to find the FLOPS/byte for this algorithm within each range of n values. (Again, this needn't be done separately for each cache.)