MCS-284 Lab 4: Memory System Performance

Due December 11, 2007

In this lab, you will again be experimenting with the performance of a C++ program, measuring the number of clock cycles needed to execute the central portion of the program. Unlike in Lab 3, the program is for a real purpose (simulating Conway's Game of Life), rather than just cooked up to test computers' performance. Another major difference from Lab 3 is that the focus this time is on the memory system. This time let's try doing the lab with each student working independently.

You will need to write a clear report that explains what you did (so that someone else could replicate your experiment), provides the factual results of your experiment, and discusses what interpretation you place on those results. Your report should be aimed at an audience with background similar to our course, but without specific knowledge of this lab assignment.

For simplicity, you are welcome to stick with just a single computer rather than comparing the performance of two different kinds of computer. Be sure to specify which kind of computer you used; by looking at the file /proc/cpuinfo, you can find both the name of the processor and the amount of cache it has. The cache size listed here is for the L2 cache, which is likely to be the relevant one for the experiments you are conducting. In order to provide a really satisfactory interpretation of your results, you will also find it useful to know such attributes of the cache as its associativity and block size. As I demonstrated in class, this information is available within the /sys/devices/system/cpu/cpu0/ directory. The L1 cache information is also available there, if you want to go the extra mile and document that for the sake of completeness.

Compiling and running the program

You will need to download two files: life.cpp, which is the actual C++ source code of the program, and cyccounter.h, which is a helper file that contains the necessary definitions for using the processor's cycle counter to get exact timings in units of clock cycles. To compile the program, use the following command:

g++ -o life -O life.cpp

You can run the program using a command like

./life 200 100 20 10 0 1

The meaning of most of these numbers will become apparent from the program's output, but to find the full story, you can read a comment at the top of the source file.

Questions your experimentation should address

If you read the program, you'll see that the main portion, which is timed, loops through the generations one by one, and for each generation, loops over all the columns and rows of cells. For each cell in each generation, the work done is the same: adding up how many neighbors are live and then (in a second pass over the cells) setting the cell to be alive or dead based on the number of living neighbors and the cell's own prior status. Because this same work is done for each cell in each generation, one might think that the number of clock cycles would scale up approximately proportionately to the number of cells and generations. How well, or poorly, does your experimental evidence comport with this hypothesis of approximate proportionality? Be sure you examine cases where the number of cells is small relative to the L2 cache size and cases where the number of cells is large relative to the L2 cache size. You may not be able to explain all of the phenomena you observe, but you should at least give some idea what may be going on.

Some widths of cell grids may be particularly interesting, for reasons that have to do with their particular numerical values rather than their general size. For example, 4094 isn't much larger than 4000, but it has a special property: when you add in two border cells, each row winds up occupying 4096 bytes, which is exactly 2 to the 12th power. Does this make the program perform differently? Can you improve its performance in this case by adding a little bit of padding to each row? (The program has a built-in feature for adding a specifiable amount of padding to each row.) You ought to be able to connect your results to the structural organization of the L2 cache.

Each time the program loops over all the cells, it does it with two nested loops that start with these lines:

for(int col = 1; col <= width; col++){
  for(int row = 1; row <= height; row++){

If you interchange these two lines in the two places they appear within the timed portion of the program, the program will still produce the same results, but it will process the cells in a different order. How does this affect its performance? You should recompile the modified program the same way as shown earlier in this lab handout and should repeat the timing experiments you've previously done. In accounting for your results, you may want to know that both the Pentium 4 and the Core 2 can prefetch a cache block in advance of any access to that block, if the preceding block is being read sequentially.