MCS-284 Lab 3: Measuring Processor Architectures' Performance (Fall 2000)

Due: November 17, 2000

The goal of this lab is to see how the performance of an advanced pipelined architecture can be empirically measured, using the processor's cycle counter, and in particular to see the performance interaction between different hardware approaches to control hazards and different software approaches to jumping. As usual, I expect most of you to work in pairs.

You will measure the performance of two different programs each on two different computers, for a total of four combinations. The two programs will be nearly identical. Both read in a number, n, and then loop n times, doing nothing in the loop except incrementing the counter i until it reaches n. Each program then prints out the number of clock cycles that the loop took. (This is found by using a special cycle counter built into the computer.) The only difference between the two programs is in how they jump back from the bottom of the loop to the top. One jumps to a label, like the MIPS j instruction, while the other jumps to an address stored in a register, like the MIPS jr instruction. (The register is initialized before the loop with the address of the label at the top of the loop.) The two computers you will use are not variants of the MIPS architecture, but rather of the IA-32 architecture. One is the Intel Pentium II or III processor, which we have in our lab computers. The other is the AMD K-6 processor, which I have in my computer at home. (I will run your programs for you on my home computer.)

It would be unreasonable for me to expect you to learn Intel's assembly language after having just struggled with MIPS's. So instead, you'll use a different approach: writing the programs in C and compiling them with the gcc compiler. I'll have you look at the assembly language output produced by the compiler so that you can see the difference between the two versions, but it is much easier to read an unfamiliar assembly language than to write it.

I have written the first version of the program, which jumps to a label. It is in the loop.c file linked from the web version of this lab handout. In order to use the cycle counter, it makes use of a second file, cyccounter.h, which is also on the web. You will need to download copies of these two files into your own directory. You are not expected to be able to read and understand cyccounter.h.

To compile this program using gcc, you can use the following two commands:

gcc -O -S loop.c
gcc -o loop loop.s

The first command compiles the C program loop.c and stops at assembly language (that is what the -S option does). This assembly language version is left in the file loop.s. The -O option says that the compiler should "optimize" the program, in other words, take the time to do a good job of compiling. The second command says to compile (actually assemble) the loop.s assembly language program the rest of the way to an executable machine language program, which it should put in the output file loop. (The -o option just says that what follows it is the output file.)

In order to run the loop program, you can just give the command

./loop

You will now need to type in a value for n (the number of iterations). The program will then print out the number of cycles the loop took. You can repeat this several times with the same value of n to get an idea how repeatable it is, and repeat it with several different values of n to see how the number of cycles grows with n, in particular, how many additional cycles are done for each additional trip through the loop. If there is some variation in the cycle counts for a particular n, you should use the smallest rather than taking an average. This is because we only want to pay attention to cycles actually spent in the processor, not the extra cycles that can sometimes be needed because of memory delays or other effects outside of chapter 6's scope.

The question of how many cycles each iteration of the loop takes can be answered in two ways: the average number of cycles per iteration or the marginal number of cycles per iteration. The average is found by taking the total number of cycles for n iterations and dividing it by n. The marginal value is found by seeing how many more cycles the loop takes to do n₁ iterations than n₀ (where n₁ > n₀) and dividing that difference by n₁ - n₀. The marginal value is generally the more interesting one, because it excludes any fixed costs. Some of those fixed costs are artifacts of the measurement study: starting and stopping the timer takes some time. Even fixed costs that aren't artificial are of diminishing relevance as the number of iterations grows large, and if the number of iterations is small, then performance is probably not crucial in the first place. Besides, we are particularly interested in the performance of the jumping to the top of the loop, and that is a per-iteration phenomenon, not a start-up cost like initializing i to 0. Therefore, you should calculate marginal, not average, values in this lab.

There are two important experimental design questions you will need to answer: what values of n should you test, and how many theoretically identical runs should you do with each value of n? To answer the second question first, you should do enough runs to be reasonably confident that your results are not thrown off by any extraneous effect (such as memory-system-induced delays) that adds extra cycles in to some runs. How many runs this is will depend on what sort of variability you see in your initial experiments.

Returning to the other experimental design question, of what values of n you should try, you should keep in mind that it is always best to design experiments so that the effect you want to measure is large relative to other effects you are not interested in. In this case, we are interested in how many extra clock cycles are added for each additional trip around the loop. If the loop is done very few times, most of the cycles measured will be for things as uninteresting as starting and stopping the timer. Furthermore, if you compare runs with very similar values of n, then even if those values of n are large, most of the cycles you measure will not be the incremental cycles caused by the change in n

To make the alternative version of the program, where the loop is closed not with a jump to a label but rather a jump using a register, you will need to edit the C program (or rather, a copy of it with a different name) and recompile. The loop in the C program ends with the statement

goto loop;

This is what gets translated by the C compiler into a jump to a label. We want to change this to the C equivalent of jumping to an address specified by a register, which is using goto with the target pointed to by a variable. This can't be done in standard C, but can using a non-standard extension to C that the gcc compiler provides. To learn about this non-standard extension, you'll need to look at the gcc documentation. You can do this by giving the command

info gcc

in a terminal window. Now go to the menu item "C Extensions" and from there to "Labels as Values." This should have the information you need. If you have trouble understanding it, be sure to ask. Be sure you initialize your pointer variable once before entering the loop, rather than setting it each time through the loop. Be sure to ask for help if you need it, either with operating the info program or with understanding how to modify the C program.

You can compile and run the new version of the program as above. You should also compare the two assembly language versions (the .s files) to make sure that the loop is really the same except for the jump instruction at the end, and possibly which specific register has been assigned to hold each variable. The jump instruction in the IA-32 assembly language is jmp. This is used for both jumping to a label (MIPS's j) and jumping via a register (MIPS's jr). The difference is that if you are jumping to a label, the label appears after the jmp, while if you are jumping to a location specified by a register, the jmp is followed by a * and then the register name. The labels generated by gcc start with .L, and the register names all start with % (which is like $ for MIPS). If you have trouble identifying what lines of assembly language are the loop, or understanding the difference, please ask for help.

While you are looking at the assembly language versions, you should also count how many instructions are executed each time through the loop. By putting this together with your cycle count data, you should be able to calculate the CPI for each version of the loop. By this I mean not the overall CPI of the full program (or even of that portion that is between the starting and stopping of the timer) but rather the CPI of the looping itself, based on just how many instructions are done per additional trip around the loop and how many extra cycles that takes.

Note, by the way, that neither version of the program is good C programming style, and there certainly is no reason why a programmer would use the second version, where a variable (or register) is used to jump to the fixed address at the top of the loop. However, these programs let us try out in a simple context the two forms of jump instructions that real programs do use. For example, jumping to an address contained in a register is an important part of how modern object-oriented languages are implemented.

Thus far, your cycle count data, and hence CPI, are for the Pentium II or III processor. In order to get data for the K-6, you will need to email me the assembly language version of your two programs and information about what values of n you want me to input and how many theoretically identical runs you want me to do with each value of n with each program. I will then email you back the results. Note that I'm not going to be able to do a whole class worth of runs the last night; if you want to be sure to get results, you will need to send me your request by November 13. Earlier would be better, so that you have more time to work on analysis and write-up.

It is important that you send me the assembly language programs, not the C versions, so that we can be confident I am executing the exact same instructions on the K-6 as you did on the Pentium II or III, even if I might have a different version of the C compiler. And, you must clue me in to your experimental design: not just the values of n, but also the number of repetitions with each value.

Graphs can be useful for showing your data, but you need to be careful about some common pitfalls, particularly if you use a computer to generate the graphs rather than doing it by hand:

The actual data points should be marked (for example with dots or other little symbols), rather than just showing a line or curve that goes through the points.
The position of the points should correspond in some way to their numerical values: either proportionately if you use a linear scale or in proportion to their logarithms if you use a logarithmic scale. It is not acceptable to have 100, 200, 500, and 5000 be equally spaced from one another. This may seem obvious, but many software packages intended primarily for business graphics will do the wrong thing by default, and many students don't seem to pay any attention. The problem is not with the software, however ill designed it may be: the problem is with accepting its default behavior rather than changing its settings or even drawing the graph by hand.
If you want the reader to compare two things, they should be shown on the same graph, or at absolute worst drawn to the same scale on adjacent graphs. If each of them can be more clearly shown on its own scale, you can also do both: a common graph to show the comparison and also separate graphs to show the individual data sets most clearly. If the difficulty with a combined presentation is that the numerical values are vastly different, a logarithmic scale may help.

In your lab report, you should show what the difference in assembly language code is, explain how you determined the cycles-per-iteration and CPI of each version on each processor, and provide your results. You should also try to draw some bigger picture conclusions from these results. What can you say about the handling of control hazards on the two processors? How is it influenced by the kind of jump? If you are feeling more ambitious, you can also find information on Intel's and AMD's web sites describing the specific mechanisms the two processors use for dealing with control hazards. I can help you with uncovering and understanding this information. You could then connect your experimental findings to this published information about the processors' designs.

Instructor: Max Hailperin