# MCS-284 Lab 3: Processor Performance

## Due November 21, 2008

This lab focuses on carrying out an experiment and analyzing the results, rather than on programming. You will need to write a clear report that explains what you did (so that someone else could replicate your experiment), provides the factual results of your experiment, and discusses what interpretation you place on those results. Your report should be aimed at an audience with background similar to our course, but without specific knowledge of this lab assignment.

Your will perform your experiments on two computers with different processors, in order to allow comparison. In the OHS 326 lab, there are two kinds of computers: those in beige cases and those in black. Use one of each. In order to find out what model of processor each has, as well as its clock rate, you should give the following command. Like all commands throughout this assignment, you would do this is a terminal (shell) window:

```fgrep name /proc/cpuinfo
```

## Acquainting yourself with a program

The first thing you will need to do is acquaint yourself with a program I am providing. This program, and the other program you will encounter in this lab, are useful programs, but not in the usual sense. They are not designed to perform important computations, but rather to experimentally probe aspects of processor architecture. The measure of their usefulness is how much they allow you to learn about the Pentium 4 and Core 2 architectures, not how interesting their computational results are.

You will need to download two files: countmod.cpp, which is the actual C++ source code of the program, and cyccounter.h, which is a helper file that contains the necessary definitions for using the processor's cycle counter to get exact timings in units of clock cycles. The main loop of the `countmod.cpp` program is as follows:

```  for(unsigned int i = 0; i < n; i++){
result = i % m;
}
```

This loops for `i` having the values from 0 up to `n` The percent sign is the remainder operation, so each time around the loop, the current value of `i` is divided by `m` and the remainder is stored as the `result`. On thing you might notice is that only the last iteration of the loop actually achieves anything. As such, a smart compiler would eliminate all the other iterations of the loop and produce code that runs just as quickly no matter what the value of `n`. Since our goal in this lab is to see how the repeated divisions perform, that would be a disaster. Luckily, as you'll see, our compiler isn't that smart.

To compile the program, use the following commands:

```g++ -O -S countmod.cpp
g++ -o countmod countmod.s
```

The first of these two commands uses the C++ compiler, telling it to stop after translating the program into assembly language (that is the `-S` option) and to take some care to generate efficient code (that is the `-O` option). The resulting assembly code is left in the file `countmod.s`. The second command assembles that program, producing an executable machine language version in the file `countmod`. Normally this two-step process wouldn't be necessary, but for the sake of your experimentation, it is valuable to be able to examine the assembly language version.

You can run the program using a command like

```./countmod 1000 17
```

In this command, the first number, 1000, provides the value for `n`, the number of iterations. The second number, 17, provides the value for `m`, the divisor. Running the command will produce an output line such as the following:

```./countmod 1000 17 13 0026286
```

The first three elements of the output line simply repeat the command. The virtue of including this in the output will become apparent when you later run a script that executes a large number of different commands. The output from that script will be self-describing: each piece of output will be tagged with the command that produced it. The fourth element of the output line, 13, is the result computed by the loop shown earlier, or more particularly, by its last iteration: 999 divided by 17 leaves a remainder of 13. The reason to include this is in the output is not only to allow us to check that the program is functioning correctly, but also to discourage the compiler from optimizing the computation entirely out of existence: we need to show we actually care about the result. Finally comes the number we are most interested in: this particular execution took 26286 clock cycles.

If you run the same command repeatedly, the number of clock cycles may vary somewhat for two reasons. First, there may be small variations due to memory system effects. Second, there may be rare but larger variations if an interrupt occurs while the loop is running. In the worst case, the operating system might respond to an interrupt by choosing to switch processes and run one or more other programs for a fraction of a second, which would greatly increase the cycle count. (You probably will never see that.) Finally, if the processor is running at a reduced clock rate to save power, the cycle counts will be artificially inflated; this will normally influence at most the first run, because running a CPU-intensive program causes the processor to leave its power-saving mode. To deal with these variations, when you get to the part of the lab where you are doing performance comparisons, you will run each timing command repeatedly and use the smallest cycle count reported by any of the runs. Ordinarily in experimental work it is conventional to use an average of multiple runs. However, if you have reason to believe there is some underlying true number of clock cycles the processor needs, and that all deviations are going to be in the upward direction, caused by adding in extraneous cycles doing other work, then it makes sense to use the minimum rather than the average.

You may notice that the cycle counts are shown with extra leading zeros so as to always be seven digits. The reason why I wrote the program that way is because it will make it easier for you to find the minimum number of clock cycles for each test condition when you run your experimental script. As I'll explain later, you'll just need to sort the output lines into lexicographic order. If the cycle counts were displayed with varying numbers of digits, lexicographic order wouldn't correspond to numerical order: 100000 comes before 99999 lexicographically because 1 precedes 9.

Before we can consider making any performance comparisons, however, you need to convince yourself that our compiler really is stupid enough to do all 1000 divisions, even if only the last turns out to be relevant. I would suggest you confirm this two separate ways: by looking at the assembly language code and by doing a little informal preliminary experimentation.

If you look in your `countmod.s` file, you'll see that there is a large amount of messy IA-32 assembly language code. However, you can focus your attention on the portion between the two `rdtsc` instructions, since these are the ones that capture the processor's cycle counter, once before the loop and once after. Within that narrower portion of the code, you are looking for the loop itself, which would start with a label and end with a jump to that label. In my copy of the code, it looks as follows:

```.L45:
movl	%ebx, %eax
movl	\$0, %edx
divl	%ecx
movl	%edx, %edi
cmpl	%esi, %ebx
jne	.L45
```

You don't need to be able to read IA-32 assembly language. Because there is some value in puzzling a bit of it out, I'll project this on the screen at the beginning of lab and we can collectively work through it. However, for your purposes, the key fact is that there really is a loop here; the compiler didn't get rid of looping. Also, this loop does indeed have a division instruction in it; that's what the `divl` is. (If our program had used a constant divisor, rather than reading in the divisor of 17 from the command line, the compiler might well have found a way to compute the remainder without division.)

To confirm that the program really is looping, see what happens when you increase the iteration count from 1000 to 2000 and to 3000. If our understanding of the assembly language is correct, and the program really is looping, you should see the cycle counts go up. While we are at it, I should point out that the cycle counts may not exactly double and triple, even if you get rid of run-to-run variation by taking the minimum cycle counts from repeated runs. The cycle counts you measure include not only the 1000, 2000, or 3000 trips through the loop, but also some extra cycles spent on such matters as retrieving the cycle counter at the start and end and initializing the loop counter to 0. Therefore, when you do the performance comparisons later in this lab, you will be focusing on how many additional cycles are spent for each 1000 additional trips through the loop. Since the fixed costs of starting and ending should be identical, this will give you a more pure measure of the cost per trip through the loop. Hopefully the increase in clock cycles from 2000 to 3000 will be about the same as from 1000 to 2000, confirming the linear relationship we expect.

## Making a variant of the program

The `countmod` program I provided you performs lots of divisions, but it is worth noticing that they are all independent of each other. Before the processor can start each of the divisions, it needs to complete the incrementing of the loop counter, `i`. However, it does not need to have completed the previous iteration's division. This may have important performance implications if divisions are very long-latency instructions (as we might expect) and the processor has enough hardware resources to be able to have more than one division underway at a time.

To probe the significance of this issue, it would be interesting to compare the performance of `countmod` with that of a similar program in which each division is dependent on the value produced by the preceding division. Make a copy of `countmod.cpp` called `itermod.cpp` and then modify the loop body to the following:

```    result = result % m;
```

Use `g++` to compile `itermod.cpp` to assembly language and then produce an executable version, analogously to what you did earlier. Take a look at the assembly language version, again focusing on the main loop. Hopefully you will find that it is nearly identical to the `countmod` version. In particular, the compiler should still be generating looping code with one division per iteration. (A little mathematical reasoning would show that the result is always 0.) You can again do an informal preliminary experiment to confirm that the cycle count scales up with the number of iterations.

## Experimental data collection

At this point, you have two different programs (`countmod` and `itermod`) and two different processors (Pentium 4 and Core 2). You also have three different iteration counts (1000, 2000, and 3000). As such, you are all set to collect 12 different cycle counts. However, each one of these 12 should be found as the minimum of a whole bunch of nominally identical runs, for the reason explained earlier. My experience is that 10 runs under each experimental condition is probably adequate, though once you have some data, you'll be able to see if I seem to be right. Under this assumption, you need to do 120 program executions, 60 on each of the two processors. You could do that by hand if you needed to, but some automation sure would be nice.

When doing the 60 runs on a particular processor, you should do the runs in a randomized order. That way, if there is some time-dependent confounding effect, it won't systematically effect one experimental condition more than another. To achieve both randomization and automation, you can use a program to generate a script of the 60 commands, shuffled into a random order. Then you will run that same randomized script on each of the two computers, producing two different files of 60 outputs for analysis.

I've written a small program for you, randomize.py, in the Python programming language; it prints out the 60 commands in a randomized order. You don't need to be able to read Python, though if you look at the program, I hope you'll find it plausible that it does what I say. To run it, with the output going into a file named `script`, you would give the following command:

```python randomize.py >script
```

Having prepared the script in this way, you could run the commands in the script, capturing the output into a file, using a command such as the following:

```bash script >output-on-core2
```

My suggestion would be that you execute this command on a computer with a Core 2 processor and execute a similar command, but with a different output filename, on a computer with a Pentium 4 processor.

## Data analysis

The next step is to undo the randomization of the experimental data and select out the minimum cycle count for each of the 12 experimental conditions. (An experimental condition is a combination of a program, iteration count, and processor.) To undo the randomization, you can sort the output into lexicographic order, using a command such as the following:

```sort output-on-core2
```

The output will appear in your terminal window; if you'd rather have it in a file, you can redirect it using the `>` operation as in the earlier examples. You should see data from the 6 experimental conditions that were run on the Core 2 processor; they should be listed in a systematic order, and for each of the six, there should be a group of 10 consecutive lines that reflect the 10 different outputs that were produced from different executions of the same command. The 10 lines in each group should differ only in the cycle counts, which should be in increasing order within each group, making it easy to pick out the minimum values by using lines 1, 11, 21, 31, 41, and 51.

While you are looking at the sorted data, you can check if I seem to be right that 10 repetitions of each condition was enough. There is no sure-fire test, but if the first few cycle counts in each group of 10 are nearly identical, that is a good sign. If in one of the groups of 10, the first line is much smaller than the second, I would have real doubts as to whether you had found the true minimum, and would suggest gathering more data.

Once you have extracted the minimum cycle count for each of the 12 experimental conditions, your next step would be to test that for each of the 4 combinations of a program and processor, the relationship between iteration count and cycle count is linear, or at least nearly so. That is, test that the increase in cycle count going from 1000 to 2000 iterations is essentially the same as the increase from 2000 to 3000. Assuming this test is passed, you can now summarize your data by giving the number of cycles per additional iteration, for each of the 4 program/processor combinations.

## Drawing conclusions from your data

Think about which of the two programs would be more likely to give you an accurate estimate of the latency (in clock cycles) of integer division. Based on the data from that program, how do the two processors seem to compare in this regard? In assessing how expected or unexpected this result is, you may want to consider what we've read about the two processor designs. Which one has the deeper, more fine-grained pipeline, so as to be able to support higher clock rates?

The processors may also differ in how many independent integer divisions can be concurrently in progress. How would you estimate this degree of concurrency from your data? What would your estimate be for each processor?

The preceding two questions point to some interesting architectural differences between the processors. Those architectural differences influence the performance of the two programs. However, the clock rates also influence the performance. To put your cycle counts into context, convert them into seconds using the two processors' cycle times. How many times faster is the Core 2 than the Pentium 4 for `itermod`? How many times faster for `countmod`?

## Writing a lab report

As stated at the outset, you will need to write a clear report that explains what you did (so that someone else could replicate your experiment), provides the factual results of your experiment, and discusses what interpretation you place on those results. Your report should be aimed at an audience with background similar to our course, but without specific knowledge of this lab assignment.

Although you are to provide enough information about your experiment to allow replication, that doesn't necessarily include all the details that I gave in this lab assignment. Nor, more generally, should you formulate your report as a response to the assignment. Phrases such as "we were asked to" have no place in a report of an experiment. Just come right out and tell your reader what you did, as though your goal was to convey newly obtained scientific knowledge, not to demonstrate completion of a course assignment.

Once you have analyzed your data and drawn your conclusions, you should be in a position to decide on such matters as what title would suit your report and what main point you should state in your introduction. I have intentionally given this lab assignment a rather vague title ("Processor Performance") and vague section names ("a program", "a variant of the program"), because I want you to figure out for yourself how to characterize more precisely what issue it is that you studied. You should not emulate my vagueness.

Feel free to ask for my feedback on a draft of your lab report, just as you can consult me at any earlier stage in your work.