The order in which a program accesses memory locations can have a substantial impact on the program's performance as a result of such memory-system hardware as caches. In this lab, you will measure the speed of at least two variants of matrix multiplication on matrices of varying size to get some appreciation for this phenomenon.

You can complete this project either individually or with a partner of your choice.

Matrix multiplication is a basic building block for many numerical computations in scientific and engineering applications as well as quantitative finance. For simplicity, in this lab we will only consider square matrices. If *A* and *B* are both matrices of size *n* × *n*, then their product, *C*, is another *n* × *n* matrix. Each element of *C* can be computed using elements from one row of *A* and one column of *B*. In particular, the element *C*_{i, j} is formed from row *i* of *A* and column *j* of *B*. Specifically, *C*_{i, j} is the summation of *A*_{i, k}*B*_{k, j} over all values of *k*.

Throughout the lab, we will be working with a slight generalization of matrix multiplication, which computes *C* as the sum of its original value and the matrix product *AB*. If you just want the matrix product, you can initialize *C* to a zero matrix. Because we are incorporating the initial contents of *C*, all *n* terms of the above summation (each of the form *A*_{i, k}*B*_{k, j}) can be added in turn to *C*_{i, j}.

In this lab, we will only consider algorithms for matrix multiplication that more or less directly correspond to the preceding definition. Each of the *n*^{2} elements of *C* is computed by adding *n* products. Assuming that the matrices are composed of floating point numbers, this means the matrix multiplication involves *n*^{3} floating point multiplications together with *n*^{3} floating point additions to do the summations. There are fancier algorithms that do fewer computations, but because floating point arithmetic does not obey algebraic laws such as associativity, these fancier algorithms will generally produce different answers.

The naive algorithm computes each element of *C* in turn, doing all the computation for the first element of *C* before moving on to the next. In pseudocode, it looks like this:

for *i* in the range 0 ≤ *i* < *n*:

for *j* in the range 0 ≤ *j* < *n*:

for *k* in the range 0 ≤ *k* < *n*:

*C*_{i, j} ← *C*_{i, j} + *A*_{i, k}*B*_{k, j}

In this pseudocode version, the outer two loops ensure that all the different elements of *C* are computed. For each particular element of *C*, the innermost loop adds up the terms of the summation.

If we translate this into the C programming language, it looks as follows:

for(i = 0; i < n; i++){ for(j = 0; j < n; j++){ for(k = 0; k < n; k++){ c[i*n + j] += a[i*n + k] * b[k*n + j]; // <- this line is executed n*n*n times } } }

To understand the C version, you need to know some of the notational tricks of the programming language, and you also need to understand how the three *n* × *n* matrices are being stored in memory.

Notations such as `i++`

are used to mean that the variable `i`

should be increased by 1. The `+=`

notation indicates that the value computed on its right side should be added to whatever is currently stored in the location specified on the left side. Both of these notations may be familiar to you from Java, Python, or C++. If you have any question about the programming details (or anything else), please do ask.

The more interesting part is how the matrices are stored. Each one is stored as a sequence of *n*^{2} values. The first *n* constitute the first row of the matrix, the next *n* the second row, etc. (This is called "row major order".) As such, given that we are numbering the rows and columns starting from 0, to reach a position that is in row number *i* requires skipping over *i* full rows of *n* elements. This explains why the element *C*_{i, j} is coded as `c[i*n + j]`

, and similarly for the others.

Rather than adding up all *n* terms for *C*_{0,0} before moving on to *C*_{0,1} and so forth, we could interleave work on the different elements of *C*. Consider, for example the following algorithm:

for *i* in the range 0 ≤ *i* < *n*:

for *k* in the range 0 ≤ *k* < *n*:

for *j* in the range 0 ≤ *j* < *n*:

*C*_{i, j} ← *C*_{i, j} + *A*_{i, k}*B*_{k, j}

This is nearly identical to the naive algorithm; only the order of the loops is altered. The exact same computations are done, just in a different order. The reordering can't have any impact on the final answer, even though floating point addition is not associative. This is because the terms contributing to any one element, *C*_{i,j}, are still added in the same order. The only difference is that they are not done immediately after one another; work on other elements of *C* is interleaved in between.

There are two reasons why this reordered algorithm might be faster than the naive algorithm. One concerns pipelined processor design. In the naive algorithm, each iteration of the innermost loop contains a floating point addition that is data dependent on the floating point addition in the prior iteration. If these floating point additions turn out to be the critical path that limits the speed of the computation and if they have a latency greater than one cycle, then moving the data-dependent instructions further apart would help. The reordered algorithm does this.

The other reason why the reordered algorithm might be faster is because it accesses memory locations in a different order, which might reduce the number of blocks that need to be fetched into the caches.

In this lab, your goal is to provide evidence that the reordered algorithm is faster than the naive one as well as evidence that this speed difference stems (at least in part) from the memory system rather than the processor pipeline.

Consider measuring the performance of one algorithm, such as the naive one. You will have to provide some specific value of *n* and see how long the computation takes. By doing the computation repeatedly with the same value of *n*, you can get an idea how consistent your measurements are.

Now suppose you change the value of *n*. We expect that the time will also change. For example, if you double *n*, then we know that the computation will now involve 8 times as many floating point multiplications and 8 times as many floating point additions because the innermost loop is executed *n*^{3} times. If the operations are done at the same rate as before, this means that the computation will take 8 times as long. Therefore, if your measured time is exactly 8 times as long, that would mean that the only effect we were observing was attributable to the algorithm, rather than anything peculiar about the computer hardware. But if the time change was consistently more than a factor of 8 or less than a factor of 8, then we might be learning something about the rate at which the hardware is able to execute instructions and how that depends on the size of the matrices.

To make it easier to separate the hardware-dependent performance effects from the scaling of the algorithm, we can look at the performance in units other than seconds. In particular, we will use the units of floating point operations per seconds (FLOPS). When you increase the value of *n*, the number of floating point operations goes up and so does the number of seconds. If the time is scaling up just as the algorithm would predict (like *n*^{3}), then the speed in FLOPS will remain constant. If the speed in FLOPS changes, that indicates a hardware effect.

You should try the naive algorithm and the reordered algorithm for values of *n* in the range from the 100 to 1500. Remember to try each value of *n* more than once, so that you know how precise the results are. (Ideally you should do your experimentation in a randomized order.) For each experimental condition, record the speed in FLOPS. You can use the program I provide in the next section.

Supposing that the reordered algorithm is faster (has a higher FLOPS value), you still need to determine whether the memory system seems to be involved in its speed difference. Changing the size of the matrices shouldn't affect the rate at which the processor pipeline can do additions but might affect the number of blocks that need to be fetched into the caches. Therefore, if you see any substantial change in the naive algorithm's speed deficit as you change the matrix size, you can reasonably conclude that at least part of the effect comes from the memory system. In the prelab analysis we'll do in class, we'll approximately determine some critical matrix sizes where you might expect to see each algorithm's performance change.

I've built the naive algorithm into a C program, naive.c, with the extra code to time how long it takes and then compute and print the number of FLOPS. You don't have to understand this extra code; you can just focus your attention on the three nested `for`

loops that perform the matrix multiplication.

Once you have downloaded the program, you can compile it in a terminal window using the following command:

cc -o naive -O3 naive.c

Note that the option "`-O3`

" uses the upper-case letter O (oh), not the digit 0 (zero). This command runs the C compiler, specifying that the output (the executable machine language) should go in the file `naive`

and that the compiler should take pains to make the result as fast as it can.

To run this program with a specific value of *n*, such as 200, you would use a command such as the following:

./naive 200

If you get output like `4.38E+08`

, that means that the program executed 4.38 × 10^{8} floating point operations per second.

To make a copy of the program with another name, you can use a command like this:

cp naive.c reordered.c

Then you can edit the copy to reflect the reordered algorithm and compile it and run it analogously.

You should explain that your goal was to test the hypothesis that reordering matrix multiplication improves its performance due to better use of the memory system.

You should explain your experiment. This includes presenting the C code for each of the two algorithms, sticking to just the matrix multiplication itself, not the timing code. It also includes indicating the values of *n* you used with each algorithm and the number of times you tried each. If you mixed up the order of the experiments to avoid systematic biases, you should specify that as well. Finally, you should specify the hardware and software you used. In the Apple menu, you will find "About This Mac," which you can use to find the processor, memory, and (with a little more work) the amount of L2 and L3 cache. In a terminal window, you can use the command

cc --version

to find what version of the C compiler you used. You should also mention having used the `-O3`

option.

You should give an explanation of what effects you would expect to find. Your explanation should be more detailed than anything in this assignment. In particular, you should describe the pattern of memory accesses for each algorithm. You should include our prelab analysis of the approximate matrix size at which it becomes impossible to realize full temporal reuse of all values from the L3 cache, so that the memory bandwidth is likely to become limiting. You should also include our prelab analysis of the approximate matrix size at which the naive algorithm would no longer achieve full spatial reuse of the blocks loaded into the L1 cache, which could cause bandwidth between the levels of cache to become limiting.

You should report your quantitative results in a form that allows your reader to see not only what the typical value for each experimental condition is but also how repeatable the values are. In particular, you should present the results in a table and one or more graphs. One form of graph you might consider is a box and whisker plot. However, other forms of graphs also have advantages, for example if you would prefer to focus on comparing the algorithms rather than on showing how repeatable the observations are.

You should provide an interpretation for your results. Do they support the hypothesis that the reordered algorithm is faster? (Can you say anything about how much faster?) Do they support the further hypothesis that the speed difference stems (at least in part) from an interaction between the algorithm and the memory system? These are not "yes" or "no" questions; explain what leads you to your conclusion.

Even the reordered algorithm still leaves a lot of room for improvement. A "blocked" version of matrix multiplication would add two extra loops to the nest, for a total of five. The outer two would iterate over big blocks of *j* and *k*, while the inner loops would be like the reordered algorithm, but limited to work within the current block. You can talk with me about this approach if it interests you.