MCS-284 Lab 3: Memory System Performance

Due November 24, 2009

Overview

The order in which a program accesses memory locations can have a substantial impact on the program's performance, as a result of such memory-system hardware as caches and prefetchers. In this lab, you will measure the speed of at least two variants of matrix multiplication, on matrices of varying size, to get some appreciation for this phenomenon.

Matrix multiplication

Matrix multiplication is a basic building block for many numerical computations in scientific and engineering applications. For simplicity, in this lab we will only consider square matrices. If A and B are both matrices of size n × n, then their product, C, is another n × n matrix. Each element of C can be computed using elements from one row of A and one column of B. In particular, the element C_i, j is formed from row i of A and column j of B. Specifically, C_i, j is the summation of A_i, kB_k, j over all values of k.

Throughout the lab, we will be working with a slight generalization of matrix multiplication, which computes C as the sum of its original value and the matrix product AB. If you just want the matrix product, you can initialize C to a zero matrix. Because we are incorporating the initial contents of C, all n terms of the above summation (each of the form A_i, kB_k, j) can be added in turn to C_i, j.

In this lab, we will only consider algorithms for matrix multiplication that more or less directly correspond to the preceding definition. Each of the n² elements of C is computed by adding n products. Assuming that the matrices are composed of floating point numbers, this means the matrix multiplication involves n³ floating point multiplications, together with n³ floating point additions to do the summations. There are fancier algorithms that do fewer computations, but because floating point arithmetic does not obey algebraic laws such as associativity, these fancier algorithms will generally produce different answers.

Naive algorithm

The naive algorithm computes each element of C in turn, doing all the computation for the first element of C before moving on to the next. In pseudocode, it might look like this:

   for i in the range 0 ≤ i < n:
      for j in the range 0 ≤ j < n:
         for k in the range 0 ≤ k < n:
            C_i, j ← C_i, j + A_i, kB_k, j

In this pseudocode version, the outer two loops ensure that all the different elements of C are computed. For each particular element of C, the innermost loop adds up the terms of the summation.

If we translate this into the C programming language, it looks as follows:

  for(i = 0; i < n; i++){
    for(j = 0; j < n; j++){
      for(k = 0; k < n; k++){
	c[i*n + j] += a[i*n + k] * b[k*n + j]; // <- this line is executed n*n*n times
      }
    }
  }

To understand the C version, you need to know some of the notational tricks of the programming language, but you also need to understand how the three n × n matrices are being stored in memory.

Notations such as i++ are used to mean that the variable i should be increased by 1. The += notation indicates that the value computed on its right side should be added to whatever is currently stored in the location specified on the left side. Both of these notations may be familiar to you from Java, Python, or C++. If you have any question about the programming details (or anything else), please do ask.

The more interesting part is how the matrices are stored. Each one is stored as a sequence of n² values. The first n constitute the first row of the matrix, the next n the second row, etc. (This is called "row major order".) As such, given that we are numbering the rows and columns starting from 0, to reach a position that is in row number i requires skipping over i full rows of n elements. This explains why the element C_i, j is coded as c[i*n + j], and similarly for the others.

Reordered algorithm

Rather than adding up all n terms for C_0,0 before moving on to C_0,1 and so forth, we could interleave work on the different elements of C. Consider, for example the following algorithm:

   for i in the range 0 ≤ i < n:
      for k in the range 0 ≤ k < n:
         for j in the range 0 ≤ j < n:
            C_i, j ← C_i, j + A_i, kB_k, j

This is nearly identical to the naive algorithm; only the order of the loops is altered. The exact same computations are done, just in a different order. And the reordering can't have any impact on the final answer, even though floating point addition is not associative. This is because the terms contributing to any one element, C_i,j, are still added in the same order. The only difference is that they are not done immediately after one another; work on other elements of C is interleaved in between.

There are two reasons why this reordered algorithm might be faster than the naive algorithm. One concerns pipelined processor design. In the naive algorithm, each iteration of the innermost loop contains a floating point addition that is data dependent on the floating point addition in the prior iteration. If these floating point additions turn out to be the critical path that limits the speed of the computation, and if they have a latency greater than one cycle, then moving the data-dependent instructions further apart would help. The reordered algorithm does this.

The other reason why the reordered algorithm might be faster is because it accesses memory locations in a different order, which might correspond better to the design of caches and prefetchers.

In this lab, your goal is to provide evidence that the reordered algorithm is faster than the naive one, as well as evidence that this speed difference stems (at least in part) from the memory system rather than the processor pipeline.

Experimental design

Consider measuring the performance of one algorithm, such as the naive one. You will have to provide some specific value of n and see how long the computation takes. By doing the computation repeatedly with the same value of n, you can get an idea how consistent your measurements are and hence how precisely to report them.

Now suppose you change the value of n. We expect that the time will also change. For example, if you double n, then we know that the computation will now involve 8 times as many floating point multiplications and 8 times as many floating point additions, because the innermost loop is executed n³ times. If the operations are done at the same rate as before, this means that the computation will take 8 times as long. Therefore, if your measured time is exactly 8 times as long, that would mean that the only effect we were observing was attributable to the algorithm, rather than anything peculiar about the computer hardware. But if the time change was consistently more than a factor of 8, or less than a factor of 8, then we might be learning something about the rate at which the hardware is able to execute instructions, and how that depends on the size of the matrices.

To make it easier to separate the hardware-dependent performance effects from the scaling of the algorithm, we can look at the performance in units other than seconds. In particular, we will use the units of floating point operations per seconds (FLOPS). When you increase the value of n, the number of floating point operations goes up, and so does the number of seconds. If the time is scaling up just as the algorithm would predict (like n³), then the speed in FLOPS will remain constant. If the speed in FLOPS changes, that indicates a hardware effect.

You should try the naive algorithm and the reordered algorithm for a variety of different values of n in the range from the low hundreds to the low thousands. Remember to try each value of n more than once, so that you know how precise the results are. (Ideally you should do your experimentation in a randomized order.) For each experimental condition, record the speed in FLOPS. You can use the program I provide in the next section.

Supposing that the reordered algorithm is faster (has a higher FLOPS value), you still need to determine whether the memory system seems to be involved in its speed difference. Would changing the size of the matrices affect the rate at which the processor pipeline can do additions? Would changing the size of the matrices affect the cache miss rate? Based on these considerations, what sort of a trend in FLOPS rates would you expect if the naive algorithm's performance deficit stems from the way it accesses memory?

A program and how to compile and run it

I've built the naive algorithm into a C program, naive.c, with the extra code to time how long it takes and then compute and print the number of FLOPS. You don't have to understand this extra code; you can just focus your attention on the three nested for loops that perform the matrix multiplication.

Once you have downloaded the program, you can compile it in a terminal window using the following command:

cc -o naive -fast naive.c

This runs the C compiler, specifying that the output (the executable machine language) should go in the file naive and that the compiler should take pains to make the result as fast as it can.

To run this program with a specific value of n, such as 200, you would use a command such as the following:

./naive 200

If you get output like 4.38E+08, that means that the program executed 4.38 × 10⁸ floating point operations per second.

To make a copy of the program with another name, you can use a command like this:

cp naive.c reordered.c

Then you can edit the copy to reflect the reordered algorithm and compile it and run it analogously.

Your lab report

You should explain that your goal was to test the hypothesis that reordering matrix multiplication improves its performance due to better use of the memory system. You should give an explanation of why you would expect this to be true. Your explanation should be more detailed than anything in this assignment. In particular, you should describe the pattern of memory accesses for each algorithm, focusing on what happens as the innermost loop iterates.

You should explain your experiment. This includes presenting the C code for each of the two algorithms, sticking to just the matrix multiplication itself, not the timing code. It also includes indicating the values of n you used with each algorithm, and the number of times you tried each. If you mixed up the order of the experiments to avoid systematic biases, you should specify that as well. Finally, you should specify the hardware and software you used. In the Apple menu, you will find "About This Mac," which you can use to find the processor, memory, and (with a little more work) the amount of L2 cache and the bus speed. In a terminal window, you can use the command

cc --version

to find what version of the C compiler you used. You should also mention having used the -fast option.

You should report your quantitative results. State your impression of how repeatable the results are and report your numbers with a corresponding degree of precision. Present the results in a table and a graph.

You should provide an interpretation for your results. Do they support the hypothesis that the reordered algorithm is faster? (Can you say anything about how much faster?) Do they support the further hypothesis that the speed difference stems (at least in part) from an interaction between the algorithm and the memory system? These are not "yes" or "no" questions; explain what leads you to your conclusion.

Extra credit opportunity

Even the reordered algorithm still leaves a lot of room for improvement. You can talk with me for suggestions of other improvements you can try. What follows is a brief mention of two.

If the same memory location is accessed each time around the innermost loop, you could move that access out of the loop.

A "blocked" version of matrix multiplication would add two extra loops to the nest, for a total of five. The outer two would iterate over big blocks of the matrices, while the inner loops would be limited to work within the current block.