MCS-284 Lab 4: Memory System Performance

Due December 5, 2008

In this lab, you will again be experimenting with the performance of a C++ program, measuring the number of clock cycles needed to execute the central portion of the program. Unlike in Lab 3, the program is for a real purpose (simulating Conway's Game of Life), rather than just cooked up to test computers' performance. Another major difference from Lab 3 is that the focus this time is on the memory system. You may do this lab with a partner so that you can talk about what you are doing and hopefully come do a clearer understanding.

You will need to write a clear report that explains what you did (so that someone else could replicate your experiment), provides the factual results of your experiment, and discusses what interpretation you place on those results. Your report should be aimed at an audience with background similar to our course, but without specific knowledge of this lab assignment.

You and your partner should do all experimental runs on a single machine, which should be one of the following: oh326ll03, oh326ll04, oh326ll07, oh326ll11, oh326ll12, oh326ll13, oh326ll15, oh326ll16, oh326ll17, or oh326ll18. Eight of these are the eight machines on the side of the room with the glass windows and the art installation. I can point out the other two listed machines if you want, and can explain why I don't recommend the six machines that aren't on this list.

Be sure to specify which kind of computer you used; by looking at the file /proc/cpuinfo, you can find both the name of the processor and the amount of cache it has. The cache size listed here is for the L2 cache, which is likely to be the relevant one for the experiments you are conducting. In order to provide a really satisfactory interpretation of your results, you will also find it useful to know such attributes of the cache as its associativity and block size. As I demonstrated in class, this information is available within the /sys/devices/system/cpu/cpu0/ directory. The L1 cache information is also available there, if you want to go the extra mile and document that for the sake of completeness.

Compiling and running the program

You will need to download two files: life.cpp, which is the actual C++ source code of the program, and cyccounter.h, which is a helper file that contains the necessary definitions for using the processor's cycle counter to get exact timings in units of clock cycles. To compile the program, use the following command:

g++ -o life -O life.cpp

You can run the program using a command like

./life 1000 4000 0 1

The meaning of most of these numbers will become apparent from the program's output, but to find the full story, you can read a comment at the top of the source file.

Questions your experimentation should address

If you read the program, you'll see that the main portion, which is timed, loops through the generations one by one, and for each generation, loops over all the columns and rows of cells. Each position in the grid is called a "cell", whether it is live or dead; so the number of cells remains fixed, even as the number of living cells changes.

For each cell in each generation, the work done is the same: adding up how many neighbors are live and then (in a second pass over the cells) setting the cell to be alive or dead based on the number of living neighbors and the cell's own prior status. Because this same work is done for each cell in each generation, one might think that the number of clock cycles would scale up approximately proportionately to the number of cells and generations. See if this seems to be the case when you double the number of cells, either by switching from 1000 rows to 2000 or from 4000 columns to 8000.

Some widths of cell grids may be particularly interesting, for reasons that have to do with their particular numerical values rather than their general size. For example, 4094 isn't much larger than 4000, but it has a special property: when you add in two border cells, each row winds up occupying 4096 bytes, which is exactly 2 to the 12th power. Does this make the program perform differently? Can you improve its performance in this case by adding a little bit of padding to each row? (The program has a built-in feature for adding a specifiable amount of padding to each row.) You ought to be able to connect your results to the structural organization of the L2 cache.

Each time the program loops over all the cells, it does it with two nested loops that start with these lines:

for(int col = 1; col <= width; col++){
  for(int row = 1; row <= height; row++){

In particular, these two lines appear in the two places within the timed portion of the program, which is to say, between the two calls to read_cyccounter(). (They also appear outside the the timed portion, but that is less relevant.) If you interchange the order of these two lines in the two places they appear within the timed portion, the program will still produce the same results, but it will process the cells in a different order. How does this affect its performance? You should recompile the modified program the same way as shown earlier in this lab handout and should repeat the timing experiments you've previously done. In accounting for your results, you may want to know that both the Pentium 4 and the Core 2 can prefetch a cache block in advance of any access to that block, if the preceding block is being read sequentially.