In this lab, you will continue working with variations on the matrix multiplication program from the prior lab, but using two threads so as to be able to exploit two processors. You will be considering different ways to divide the n3 loop iterations between the two threads. Some ways may not even reliably produce correct results. But among those that produce correct results, there will also be performance differences. Recall that the Core 2 Duo has L1 caches for each processor core, while sharing a single L2 cache. Two threads can interact at either level of the cache hierarchy. At Level 2, there may be a beneficial interaction if both threads are reading the same data, so that a single L2 miss can bring data in for both threads. At Level 1, the cache coherence protocol can cause detrimental interactions.
The last section of this lab assignment is a checklist of my minimal grading expectations. Please read this section in advance and plan your work so that you meet these expectations.
The program ilh.c divides the iterations of the i loop into low-numbered and high-numbered halves. It also does the computation in a single-threaded fashion so as to be able to compute the speedup for the two-threaded version and check the correctness of the answers. Please ask me if you have any questions about this program.
You should create slightly modified copies of this program, jlh.c and klh.c, that move the low-vs.-high division to the j or k loop. The i loop should now cover all n possible values in both threads, and the order in which the three loops are nested should be unchanged. (The program name in each case indicates which loop variable has its range partitioned and that the partitioning is into low and high halves. In particular, the program jlh is not a tribute to our textbook's coauthor, John L. Hennessy.)
You should also make three other versions, ieo.c, jeo.c, and keo.c, each of which divides one of the loop's iterations so that one thread computes the even-numbered iterations and the other the odd-numbered ones. For example, in ieo.c, the i loop variable would take on the values 0, 2, 4, ... in one thread and 1, 3, 5, ... in the other thread. The nesting of the three loops should again be unchanged. The code for these versions should be simpler than for the lh versions; if your code is complicated, you aren't thinking about it right.
Think about how each of the versions will be accessing memory. Based on the pattern of memory accesses, make these predictions:
For each version, do you think the two-threaded execution will reliably produce the same results as the single-threaded execution? Why or why not? If not, will the answers at least be close? Where the answers differ, is there any reason to expect downward versus upward deviations?
For each version, make a qualitative prediction regarding the speedup, that is, the number of times faster the two-threaded computation is than the single-threaded computation. Some possible predictions follow:
If you predict that more than one version will have good speedup, you might want to also predict their relative ordering. Is there one you expect to be the very best? One that would still be good, but not so good?
You can compile and run the programs the same way as in the prior lab. You can do all your experiments with n=1000. Be sure to run each program a significant number of times, preferably in a randomized order. This is even more important in this lab than in the prior one. In order to show the speedup from using two cores, the timings need to be of total elapsed time rather than CPU time. This means that anything that competes with your programs for the computer's attention will cause anomalously long runtimes, which can cause the speedup to be anomalously high or low, depending on whether the two-threaded or single-threaded computation is impacted. If you do enough runs, you can set aside values that clearly are outliers compared to the others.
You should explain your hypotheses, together with an explanation of why you expected them to be true.
You should explain your experiment. This includes presenting the C code for at least two of the algorithms, one that divides a loop's iterations into even and odd, and one that divides low-numbered iterations from high-numbered ones. (In each case, just include the code for the computational loops, not the setup, timing, and comparison code.). Using this code for reference, you should describe what all six algorithms were. You should also provide general background information, such as the value of n you used with each algorithm and the number of times you tried each. If you mixed up the order of the experiments to avoid systematic biases, you should specify that as well. Finally, you should specify the hardware and software you used. In the Apple menu, you will find "About This Mac," which you can use to find the processor, memory, and (with a little more work) the amount of L2 cache and the bus speed. In a terminal window, you can use the command
cc --version
to find what version of the C compiler you used. You should also mention having used the -fast
option.
You should report your quantitative results. State your impression of how repeatable the results are, what you did in terms of discarding outliers, and how you summarized the remaining values (mean? median?). You should report your numbers with a degree of precision that corresponds to the observed repeatability. Present the results in a table.
You should provide an interpretation for your results. Do they support your hypotheses? This is not a "yes" or "no" question; explain what leads you to your conclusion.
You might find it interesting to run these two-threaded programs on a single processor core, leaving it to the operating system to interleave the execution of the two threads. Is there any way this could still provide a speedup? To try the experiment, use the following command before running the program:
hwprefs cpu_count=1
After you are done with your uniprocessor experiments, be sure to give the same command again but with 2 in place of 1.
There is a hypothesis stated regarding which two-threaded versions will reliably produce the same results as the single-threaded reference version.
There is a qualitative speedup/slowdown hypothesis for each two-threaded version.
There is an explanation for each hypothesis; this need not be correct, as long as a correct analysis shows up later in the interpretation of results.
The code for at least one low/high version and one even/odd version is shown.
The six versions are described.
The matrix size and number of runs per condition are stated.
The processor, memory, L2 cache size, bus speed, C compiler version, and optimization setting are reported.
The typical value of each version's speedup is indicated quantitatively in either a table or a graph, with some indication of how it was calculated (such as whether it is a mean or median). If the mean or median is only shown graphically, then the raw data is in a table.
The degree of run-to-run variability in each version's speedup is indicated at least qualitatively in some way other than by just listing the raw data.
The hypothesis or interpretation of the results draws a connection between correctness and which threads are involved in computing each element of the result array.
The hypothesis or interpretation with regard to correctness explains how errors could arise in certain versions through the non-associative nature of floating-point addition.
The hypothesis or interpretation with regard to correctness explains another potential source of error in certain versions that has nothing to do with floating-point arithmetic.
The hypothesis or interpretation with regard to correctness comments on which of the two error mechanisms is expected to only produce small errors and which might produce large ones.
The hypothesis or interpretation with regard to correctness comments on which of the two error mechanisms can result in upward versus downward deviations from the single-threaded values.
The hypothesis or interpretation with regard to performance makes clear which versions will pass ownership of cache blocks back and forth between the two processors' L1 caches using the cache coherence mechanism.
The hypothesis or interpretation with regard to performance clearly distinguishes "false sharing" from cases where the two processors are writing to the same memory location.
The hypothesis or interpretation with regard to performance makes clear, for each version of the program, what fraction of the data in each block brought into an L1 cache gets used.
The hypothesis or interpretation with regard to performance makes clear how much potential each version has for both processors to use each block of data that is brought into the L2 cache.