MCS-284 Lab 4: Multiprocessor Performance

Due December 12, 2013

Overview

In this lab, you will continue working with variations on the matrix multiplication program from the prior lab, but using two threads so as to be able to exploit two processors. (Most of our computers now have four processor cores, but for simplicity we'll stick to two. You can earn extra credit by experimenting with four.)

You will be considering different ways to divide the n³ loop iterations between the two threads. Some ways may not even reliably produce correct results. But among those that produce correct results, there will also be performance differences. Recall that the Core i5 has private L1 and L2 caches for each processor core, while sharing a single L3 cache. (Likewise, the Core 2 Duo chip used in our older computers has private L1 caches, while sharing a single L2 cache.) Two threads can interact at either the private (core-specific) or the shared portion of the cache hierarchy. At the shared level, there may be a beneficial interaction if both threads are reading the same data, so that a single miss can bring data in for both threads. In the private portion of the hierarchy, the cache coherence protocol can cause detrimental interactions if both threads write into the same block.

Important note

The last section of this lab assignment is a checklist of my grading expectations. Please read this section in advance and plan your work so that you meet these expectations.

The programs

Each of the programs you will work with fills three n × n matrices with pseudorandom numbers in the range from 0.0 to 1.0 and then adds into the c matrix the product of the a and b matrices.

The program ilh.c divides the iterations of the i loop into low-numbered and high-numbered halves. It also does the computation in a single-threaded fashion so as to be able to compute the speedup for the two-threaded version and check the correctness of the answers. Please ask me if you have any questions about this program.

You should create slightly modified copies of this program, jlh.c and klh.c, that move the low-vs.-high division to the j or k loop. The i loop should now cover all n possible values in both threads, and the order in which the three loops are nested should be unchanged. (The program name in each case indicates which loop variable has its range partitioned and that the partitioning is into low and high halves. In particular, the program jlh is not a tribute to our textbook's coauthor, John L. Hennessy.)

You should also make three other versions, ieo.c, jeo.c, and keo.c, each of which divides one of the loop's iterations so that one thread computes the even-numbered iterations and the other the odd-numbered ones, when there are 2 threads. (As in the lh version, your doComputations procedure should work for any number of threads, not just 2.) For example, in ieo.c, the i loop variable would take on the values 0, 2, 4, ... in thread 0 of 2. That same variable would take on the values 1, 3, 5, ... in thread 1 of 2. The nesting of the three loops should again be unchanged. The code for these versions should be simpler than for the lh versions; if your code is complicated, you aren't thinking about it right.

Forming hypotheses

Think about how each of the versions will be accessing memory, both the portion that is read and the portion that is written. Based on the patterns of memory accesses, make these predictions:

For each version, do you think the two-threaded execution will reliably produce the same results as the single-threaded execution? Why or why not? If not, will the answers at least be close? Where the answers differ, is there any reason to expect downward versus upward deviations?
For each version, make a qualitative prediction regarding the speedup, that is, the number of times faster the two-threaded computation is than the single-threaded computation. Each of your predictions should come from the following list:
- Good speedup: near 2
- Mediocre speedup: significantly greater than 1 but well short of 2
- Essentially no speedup: perhaps a bit more than 1, perhaps a bit less than 1
- Slowdown: a speedup significantly less than 1

Experimentation

You can compile and run the programs the same way as in the prior lab. You can do all your experiments with n = 1500. Be sure to run each program a significant number of times, preferably in a randomized order. This is even more important in this lab than in the prior one. In order to show the speedup from using two cores, the timings need to be of total elapsed time rather than CPU time. This means that anything that competes with your programs for the computer's attention will cause anomalously long runtimes, which can cause the speedup to be anomalously high or low, depending on whether the two-threaded or single-threaded computation is impacted. If you do enough runs, you can set aside values that clearly are outliers compared to the others.

Your lab report

You should explain your hypotheses, together with an explanation of why you expected them to be true.

You should explain your experiment. This includes presenting the C code for at least two of the algorithms, one that divides a loop's iterations into even and odd, and one that divides low-numbered iterations from high-numbered ones. (In each case, just include the code for the computational loops, not the setup, timing, and comparison code.). Using this code for reference, you should describe what all six algorithms were. You should also provide general background information, such as the value of n you used with each algorithm and the number of times you tried each. If you mixed up the order of the experiments to avoid systematic biases, you should specify that as well. Finally, you should specify the hardware and software you used. In the Apple menu, you will find "About This Mac," which you can use to find the processor, memory, and (with a little more work) the cache sizes. In a terminal window, you can use the command

cc --version

to find what version of the C compiler you used. You should also mention having used the -O3 option.

You should report your quantitative performance results. State your impression of how repeatable the results are, what you did in terms of discarding outliers, and how you summarized the remaining values (mean? median?). Present the results in a table.

You should also report your results regarding correctness. For each program, did you observe any errors? If so, what can you say about the magnitude and sign of the errors?

You should provide an interpretation for your results both with regard to correctness and with regard to performance. Do they support your hypotheses? This is not a "yes" or "no" question; explain what leads you to your conclusion.

Checklist of expectations used for grading

There is a hypothesis stated regarding which two-threaded versions will reliably produce the same results as the single-threaded reference version.
There is a qualitative speedup/slowdown hypothesis for each two-threaded version.
There is an explanation for each hypothesis; this need not be correct, as long as a correct analysis shows up later in the interpretation of results.
The code for at least one low/high version and one even/odd version is shown.
The six versions are described.
The matrix size and number of runs per condition are stated.
The processor, memory, cache sizes, C compiler version, and optimization setting are reported.
The typical value of each version's speedup is indicated quantitatively in a table with some indication of how it was calculated (such as whether it is a mean or median).
The degree of run-to-run variability in each version's speedup is indicated at least qualitatively in some way other than by just listing the raw data.
The versions for which errors were observed are indicated .
For each version with observed errors, some indication is given (beyond a listing of the raw data) as to whether those errors included small positive errors, small negative errors, large positive errors, and large negative errors.
The hypothesis or interpretation of the results draws a connection between correctness and which threads are involved in computing each element of the result array.
The hypothesis or interpretation with regard to correctness explains how errors could arise in certain versions through the non-associative nature of floating-point addition.
The hypothesis or interpretation with regard to correctness explains another potential source of error in certain versions that has nothing to do with floating-point arithmetic.
The hypothesis or interpretation with regard to correctness comments on which of the two error mechanisms is expected to only produce small errors and which might produce large ones.
The hypothesis or interpretation with regard to correctness comments on which of the two error mechanisms can result in upward versus downward deviations from the single-threaded values.
The hypothesis or interpretation with regard to performance makes clear which versions will tend to pass ownership of cache blocks back and forth between the two processors' private caches using the cache coherence mechanism.
The hypothesis or interpretation with regard to performance makes clear, for each version of the program, what fraction of the data in a typical block of b brought into a private cache gets used.
The hypothesis or interpretation with regard to performance makes clear, for each version of the program, how much risk there is that half the data in many of the blocks of b brought into the shared cache will go unused.
The hypothesis or interpretation with regard to performance makes clear how much potential each version has for both processors to use all of each block of b that is brought into the shared cache.