MCS-284 Lab 3: Measuring Processor Architectures' Performance (Fall 2002)

Due: November 18, 2002

The goal of this lab is to see how the performance of an advanced pipelined architecture can be empirically measured, using the processor's cycle counter, and in particular to see the performance interaction between different hardware approaches to control hazards and different software approaches to jumping. As usual, I expect most of you to work in pairs.

You will measure the performance of two different programs each on two different computers, for a total of four combinations. The two programs will be nearly identical. Both read in a number, n, and then loop n times, doing nothing in the loop except incrementing the counter i until it reaches n. Each program then prints out the number of clock cycles that the loop took. (This is found by using a special cycle counter built into the computer.) The only difference between the two programs is in how they jump back from the bottom of the loop to the top. One jumps to a label, like the MIPS j instruction, while the other jumps to an address stored in a register, like the MIPS jr instruction. (The register is initialized before the loop with the address of the label at the top of the loop.) The two computers you will use are not variants of the MIPS architecture, but rather of the IA-32 architecture. One is the Intel Pentium II or III, which we have in most of our lab computers. Be sure to do all your experiments with a single lab computer (other than rt or lahi, which use the Pentium 4). You can find out what kind of processor a computer has by in a shell window giving the command

 cat /proc/cpuinfo

Your lab report should specify the particular processor model name. The other IA-32 implementation you will measure is the Pentium 4 processor, which lahi and rt have. Because those two computers will only be used for running the experiments, and not for any of the program development work or data analysis and writing, we should be able to share those two machines among all the lab groups.

It would be unreasonable for me to expect you to learn Intel's assembly language after having just struggled with MIPS's. So instead, you'll use a different approach: writing the programs in C and compiling them with the gcc compiler. I'll have you look at the assembly language output produced by the compiler so that you can see the difference between the two versions, but it is much easier to read an unfamiliar assembly language than to write it.

I have written the first version of the program, which jumps to a label. It is in the loop.c file linked from the web version of this lab handout. In order to use the cycle counter, it makes use of a second file, cyccounter.h, which is also on the web. You will need to download copies of these two files into your own directory. You are not expected to be able to read and understand cyccounter.h.

To compile this program using gcc, you can use the following two commands:

gcc -O -S loop.c
gcc -o loop loop.s

The first command compiles the C program loop.c and stops at assembly language (that is what the -S option does). This assembly language version is left in the file loop.s. The -O option says that the compiler should "optimize" the program, in other words, take the time to do a good job of compiling. The second command says to compile (actually assemble) the loop.s assembly language program the rest of the way to an executable machine language program, which it should put in the output file loop. (The -o option just says that what follows it is the output file.) Note that there are no zero digits in the above commands, just upper- and lower-case letter O's.

In order to run the loop program, you can just give the command

./loop

You will now need to type in a value for n (the number of iterations). The program will then print out the number of cycles the loop took. You can repeat this several times with the same value of n to get an idea how repeatable it is, and repeat it with several different values of n to see how the number of cycles grows with n, in particular, how many additional cycles are done for each additional trip through the loop. If there is some variation in the cycle counts for a particular n, you should use the smallest rather than taking an average. This is because we only want to pay attention to cycles actually spent in the processor, not the extra cycles that can sometimes be needed because of memory delays or other effects outside of chapter 6's scope.

The question of how many cycles each iteration of the loop takes can be answered in two ways: the average number of cycles per iteration or the marginal number of cycles per iteration. The average is found by taking the total number of cycles for n iterations and dividing it by n. The marginal value is found by seeing how many more cycles the loop takes to do n₁ iterations than n₀ (where n₁ > n₀) and dividing that difference by n₁ - n₀. The marginal value is generally the more interesting one, because it excludes any fixed costs. Some of those fixed costs are artifacts of the measurement study: starting and stopping the timer takes some time. Even fixed costs that aren't artificial are of diminishing relevance as the number of iterations grows large, and if the number of iterations is small, then performance is probably not crucial in the first place. Besides, we are particularly interested in the performance of the jumping to the top of the loop, and that is a per-iteration phenomenon, not a start-up cost like initializing i to 0. Therefore, you should calculate marginal, not average, values in this lab.

There are two important experimental design questions you will need to answer: what values of n should you test, and how many theoretically identical runs should you do with each value of n? To answer the second question first, you should do enough runs to be reasonably confident that your results are not thrown off by any extraneous effect (such as memory-system-induced delays) that adds extra cycles in to some runs. How many runs this is will depend on what sort of variability you see in your initial experiments.

Returning to the other experimental design question, of what values of n you should try, you should keep in mind that it is always best to design experiments so that the effect you want to measure is large relative to other effects you are not interested in. In this case, we are interested in how many extra clock cycles are added for each additional trip around the loop. If the loop is done very few times, most of the cycles measured will be for things as uninteresting as starting and stopping the timer. Furthermore, if you compare runs with very similar values of n, then even if those values of n are large, most of the cycles you measure will not be the incremental cycles caused by the change in n

The issues mentioned in the prior paragraph should give you some guidance on what size values of n to try, but you may still be wondering how many different values of n to try. The implicit assumption in the foregoing is that the graph of cycles versus n will be approximately a straight line. If your experimental plan is based on this assumption, you should gather enough data that if the assumption were wrong, you would have some chance of discovering that. This tells something about the absolute minimum number of different values of n you should use: enough that there is some possibility for the corresponding points to not fall along a line.

To make the alternative version of the program, where the loop is closed not with a jump to a label but rather a jump using a register, you will need to edit the C program (or rather, a copy of it with a different name) and recompile. The loop in the C program ends with the statement

 goto loop;

This is what gets translated by the C compiler into a jump to a label. We want to change this to the C equivalent of jumping to an address specified by a register, which is using goto with the target pointed to by a variable. This can't be done in standard C, but can using a non-standard extension to C that the gcc compiler provides. To learn about this non-standard extension, you'll need to look at the gcc documentation. You can do this in the KDE Help System. To use that system, click on the icon of a life ring, which is in the main bar at the bottom of the screen. Then select (from the contents bar on the right of the window) "Browse info pages", "Programming", "gcc", "Extensions to the C Language", and finally "Labels as Values." This should have the information you need. If you have trouble understanding it, be sure to ask. One likely trouble spot is that foo is used in the example without explanation. The idea is that this is a label appearing in part of the example program that isn't shown. That is, some statement would be labeled with foo: (note the colon here). Then after assigning &&foo to ptr, the statement goto *ptr; has the same effect as

goto
foo;

would. Be sure you initialize your pointer variable once before entering the loop, rather than setting it each time through the loop. Be sure to ask for help if you need it, either with operating the KDE Help program or with understanding how to modify the C program.

You can compile and run the new version of the program the same way as the original. You should also compare the two assembly language versions (the .s files) to make sure that the loop is really the same except for the jump instruction at the end, and possibly which specific register has been assigned to hold each variable. The jump instruction in the IA-32 assembly language is jmp. This is used for both jumping to a label (MIPS's j) and jumping via a register (MIPS's jr). The difference is that if you are jumping to a label, the label appears after the jmp, while if you are jumping to a location specified by a register, the jmp is followed by a * and then the register name. The labels generated by gcc start with .L, and the register names all start with % (which is like $ for MIPS). If you have trouble identifying what lines of assembly language are the loop, or understanding the difference, please ask for help.

While you are looking at the assembly language versions, you should also count how many instructions are executed each time through the loop. By putting this together with your cycle count data, you should be able to calculate the CPI for each version of the loop. By this I mean not the overall CPI of the full program (or even of that portion that is between the starting and stopping of the timer) but rather the CPI of the looping itself, based on just how many instructions are done per additional trip around the loop and how many extra cycles that takes.

Note, by the way, that neither version of the program is good C programming style, and there certainly is no reason why a programmer would use the second version, where a variable (or register) is used to jump to the fixed address at the top of the loop. However, these programs let us try out in a simple context the two forms of jump instructions that real programs do use. For example, jumping to an address contained in a register is an important part of how modern object-oriented languages are implemented.

Graphs can be useful for showing your data, but you need to be careful about some common pitfalls, particularly if you use a computer to generate the graphs rather than doing it by hand:

The actual data points should be marked (for example with dots or other little symbols), rather than just showing a line or curve that goes through the points.
The position of the points should correspond in some way to their numerical values: either proportionately if you use a linear scale or in proportion to their logarithms if you use a logarithmic scale. It is not acceptable to have 100, 200, 500, and 5000 be equally spaced from one another. This may seem obvious, but many software packages intended primarily for business graphics will do the wrong thing by default, and many students don't seem to pay any attention. The problem is not with the software, however ill designed it may be: the problem is with accepting its default behavior rather than changing its settings or even drawing the graph by hand.
If you want the reader to compare two things, they should be shown on the same graph, or at absolute worst drawn to the same scale on adjacent graphs. If each of them can be more clearly shown on its own scale, you can also do both: a common graph to show the comparison and also separate graphs to show the individual data sets most clearly. If the difficulty with a combined presentation is that the numerical values are vastly different, a logarithmic scale may help.

In your lab report, you should show what the difference in assembly language code is, explain how you determined the cycles-per-iteration and CPI of each version on each processor, and provide your results. You should also try to draw some bigger picture conclusions from these results. What can you say about the pipelines of the two processors? How coping with control hazards is influenced by the kind of jump?

Instructor: Max Hailperin