MCS-284 Lab 4: Multiprocessor Performance

Due December 11, 2009

Overview

In this lab, you will continue working with variations on the matrix multiplication program from the prior lab, but using two threads so as to be able to exploit two processors. You will be considering different ways to divide the n3 loop iterations between the two threads. Some ways may not even reliably produce correct results. But among those that produce correct results, there will also be performance differences. Recall that the Core 2 Duo has L1 caches for each processor core, while sharing a single L2 cache. Two threads can interact at either level of the cache hierarchy. At Level 2, there may be a beneficial interaction if both threads are reading the same data, so that a single L2 miss can bring data in for both threads. At Level 1, the cache coherence protocol can cause detrimental interactions.

The programs

The program ilh.c divides the iterations of the i loop into low-numbered and high-numbered halves. It also does the computation in a single-threaded fashion so as to be able to compute the speedup for the two-threaded version and check the correctness of the answers. Please ask me if you have any questions about this program.

You should create slightly modified copies of this program, jlh.c and klh.c, that move the low-vs.-high division to the j or k loop. The i loop should now cover all n possible values in both threads, and the order in which the three loops are nested should be unchanged. (The program name in each case indicates which loop variable has its range partitioned and that the partitioning is into low and high halves. In particular, the program jlh is not a tribute to our textbook's coauthor, John L. Hennessy.)

You should also make three other versions, ieo.c, jeo.c, and keo.c, each of which divides one of the loop's iterations so that one thread computes the even-numbered iterations and the other the odd-numbered ones. For example, in ieo.c, the i loop variable would take on the values 0, 2, 4, ... in one thread and 1, 3, 5, ... in the other thread. The nesting of the three loops should again be unchanged. The code for these versions should be simpler than for the lh versions; if your code is complicated, you aren't thinking about it right.

Forming hypotheses

Think about how each of the versions will be accessing memory. For each version, do you think the two-threaded execution will reliably produce the same results as the single-threaded execution? Why or why not? (If not, will the answers at least be close?) Also, for each version, make a qualitative prediction regarding the speedup, that is, the number of times faster the two-threaded computation is than the single-threaded computation. Some possible predictions follow:

If you predict that more than one version will have good speedup, you might want to also predict their relative ordering. Is there one you expect to be the very best? One that would still be good, but not so good?

Experimentation

You can compile and run the programs the same way as in the prior lab. You can do all your experiments with n=1000. Be sure to run each program a significant number of times, preferably in a randomized order. This is even more important in this lab than in the prior one. In order to show the speedup from using two cores, the timings need to be of total elapsed time rather than CPU time. This means that anything that competes with your programs for the computer's attention will cause anomalously long runtimes, which can cause the speedup to be anomalously high or low, depending on whether the two-threaded or single-threaded computation is impacted. If you do enough runs, you can set aside values that clearly are outliers compared to the others.

Your lab report

You should explain your hypotheses, together with an explanation of why you expected them to be true.

You should explain your experiment. This includes presenting the C code for at least two of the algorithms, one that divides a loop's iterations into even and odd, and one that divides low-numbered iterations from high-numbered ones. (In each case, just include the code for the computational loops, not the setup, timing, and comparison code.). Using this code for reference, you should describe what all six algorithms were. You should also provide general background information, such as the value of n you used with each algorithm, and the number of times you tried each. If you mixed up the order of the experiments to avoid systematic biases, you should specify that as well. Finally, you should specify the hardware and software you used. In the Apple menu, you will find "About This Mac," which you can use to find the processor, memory, and (with a little more work) the amount of L2 cache and the bus speed. In a terminal window, you can use the command

cc --version

to find what version of the C compiler you used. You should also mention having used the -fast option.

You should report your quantitative results. State your impression of how repeatable the results are, what you did in terms of discarding outliers, and how you summarized the remaining values (mean? median?). You should report your numbers with a degree of precision that corresponds to the observed repeatability. Present the results in a table.

You should provide an interpretation for your results. Do they support your hypotheses? This is not a "yes" or "no" question; explain what leads you to your conclusion.

Extra credit opportunity

You might find it interesting to run these two-threaded programs on a single processor core, leaving it to the operating system to interleave the execution of the two threads. Is there any way this could still provide a speedup? To try the experiment, use the following command before running the program:

hwprefs cpu_count=1

After you are done with your uniprocessor experiments, be sure to give the same command again but with 2 in place of 1.