You will measure the performance of two different programs each on two
different computers, for a total of four combinations. The two
programs will be nearly identical. Both read in a number,
n, and then loop n times, doing nothing in the
loop except incrementing the counter i until it reaches
n. Each program then prints out the number of clock cycles
that the loop took. (This is found by using a special cycle counter
built into the computer.) The only difference between the two
programs is in how they jump back from the bottom of the loop to the
top. One jumps to a label, like the MIPS j
instruction,
while the other jumps to an address stored in a register, like the
MIPS jr
instruction. (The register is initialized before
the loop with the address of the label at the top of the loop.) The
two computers you will use are not variants of the MIPS architecture,
but rather of the IA-32 architecture. One is the Intel Pentium II or
III, which we have in most of our lab computers. Be sure to
do all your experiments with a single lab computer (other than rt or
lahi, which use the Pentium 4). You
can find out what kind of processor a computer has by in a shell window giving
the command
cat /proc/cpuinfoYour lab report should specify the particular processor model name. The other IA-32 implementation you will measure is the Pentium 4 processor, which lahi and rt have. Because those two computers will only be used for running the experiments, and not for any of the program development work or data analysis and writing, we should be able to share those two machines among all the lab groups.
It would be unreasonable for me to expect you to learn Intel's assembly language after having just struggled with MIPS's. So instead, you'll use a different approach: writing the programs in C and compiling them with the gcc compiler. I'll have you look at the assembly language output produced by the compiler so that you can see the difference between the two versions, but it is much easier to read an unfamiliar assembly language than to write it.
I have written the first version of the program, which jumps to a label. It is in the loop.c file linked from the web version of this lab handout. In order to use the cycle counter, it makes use of a second file, cyccounter.h, which is also on the web. You will need to download copies of these two files into your own directory. You are not expected to be able to read and understand cyccounter.h.
To compile this program using gcc, you can use the following two commands:
gcc -O -S loop.c gcc -o loop loop.sThe first command compiles the C program loop.c and stops at assembly language (that is what the -S option does). This assembly language version is left in the file loop.s. The -O option says that the compiler should "optimize" the program, in other words, take the time to do a good job of compiling. The second command says to compile (actually assemble) the loop.s assembly language program the rest of the way to an executable machine language program, which it should put in the output file loop. (The -o option just says that what follows it is the output file.) Note that there are no zero digits in the above commands, just upper- and lower-case letter O's.
In order to run the loop program, you can just give the command
./loopYou will now need to type in a value for n (the number of iterations). The program will then print out the number of cycles the loop took. You can repeat this several times with the same value of n to get an idea how repeatable it is, and repeat it with several different values of n to see how the number of cycles grows with n, in particular, how many additional cycles are done for each additional trip through the loop. If there is some variation in the cycle counts for a particular n, you should use the smallest rather than taking an average. This is because we only want to pay attention to cycles actually spent in the processor, not the extra cycles that can sometimes be needed because of memory delays or other effects outside of chapter 6's scope.
The question of how many cycles each iteration of the loop takes can
be answered in two ways: the average number of cycles per iteration or
the marginal number of cycles per iteration. The average is found by
taking the total number of cycles for n iterations and
dividing it by n. The marginal value is found by seeing
how many more cycles the loop takes to do n1
iterations than n0 (where n1 >
n0) and dividing that difference by n1
- n0. The marginal value is generally the more
interesting one, because it excludes any fixed costs. Some of those
fixed costs are artifacts of the measurement study: starting and
stopping the timer takes some time. Even fixed costs that aren't
artificial are of diminishing relevance as the number of iterations
grows large, and if the number of iterations is small, then
performance is probably not crucial in the first place. Besides, we
are particularly interested in the performance of the jumping to the
top of the loop, and that is a per-iteration phenomenon, not a
start-up cost like initializing i
to 0. Therefore, you
should calculate marginal, not average, values in this lab.
There are two important experimental design questions you will need to answer: what values of n should you test, and how many theoretically identical runs should you do with each value of n? To answer the second question first, you should do enough runs to be reasonably confident that your results are not thrown off by any extraneous effect (such as memory-system-induced delays) that adds extra cycles in to some runs. How many runs this is will depend on what sort of variability you see in your initial experiments.
Returning to the other experimental design question, of what values of n you should try, you should keep in mind that it is always best to design experiments so that the effect you want to measure is large relative to other effects you are not interested in. In this case, we are interested in how many extra clock cycles are added for each additional trip around the loop. If the loop is done very few times, most of the cycles measured will be for things as uninteresting as starting and stopping the timer. Furthermore, if you compare runs with very similar values of n, then even if those values of n are large, most of the cycles you measure will not be the incremental cycles caused by the change in n
The issues mentioned in the prior paragraph should give you some guidance on what size values of n to try, but you may still be wondering how many different values of n to try. The implicit assumption in the foregoing is that the graph of cycles versus n will be approximately a straight line. If your experimental plan is based on this assumption, you should gather enough data that if the assumption were wrong, you would have some chance of discovering that. This tells something about the absolute minimum number of different values of n you should use: enough that there is some possibility for the corresponding points to not fall along a line.
To make the alternative version of the program, where the loop is closed not with a jump to a label but rather a jump using a register, you will need to edit the C program (or rather, a copy of it with a different name) and recompile. The loop in the C program ends with the statement
goto loop;This is what gets translated by the C compiler into a jump to a label. We want to change this to the C equivalent of jumping to an address specified by a register, which is using goto with the target pointed to by a variable. This can't be done in standard C, but can using a non-standard extension to C that the gcc compiler provides. To learn about this non-standard extension, you'll need to look at the gcc documentation. You can do this in the KDE Help System. To use that system, click on the icon of a life ring, which is in the main bar at the bottom of the screen. Then select (from the contents bar on the right of the window) "Browse info pages", "Programming", "gcc", "Extensions to the C Language", and finally "Labels as Values." This should have the information you need. If you have trouble understanding it, be sure to ask. One likely trouble spot is that
foo
is used in the example
without explanation. The idea is that this is a label appearing in
part of the example program that isn't shown. That is, some statement
would be labeled with foo:
(note the colon here). Then
after assigning &&foo
to ptr
, the
statement goto *ptr;
has the same effect as goto
foo;
would. Be sure you initialize your pointer variable once
before entering the loop, rather than setting it each time through the
loop. Be sure to ask for help if you need it, either with operating
the KDE Help program or with understanding how to modify the C
program.
You can compile and run the new version of the program the same way as
the original. You should also compare the two assembly language
versions (the .s files) to make sure that the loop is really the same
except for the jump instruction at the end, and possibly which
specific register has been assigned to hold each variable. The jump
instruction in the IA-32 assembly language is jmp
. This
is used for both jumping to a label (MIPS's j
) and
jumping via a register (MIPS's jr
). The difference is
that if you are jumping to a label, the label appears after the
jmp
, while if you are jumping to a location specified by
a register, the jmp
is followed by a *
and
then the register name. The labels generated by gcc start with .L,
and the register names all start with % (which is like $ for MIPS).
If you have trouble identifying what lines of assembly language are
the loop, or understanding the difference, please ask for help.
While you are looking at the assembly language versions, you should also count how many instructions are executed each time through the loop. By putting this together with your cycle count data, you should be able to calculate the CPI for each version of the loop. By this I mean not the overall CPI of the full program (or even of that portion that is between the starting and stopping of the timer) but rather the CPI of the looping itself, based on just how many instructions are done per additional trip around the loop and how many extra cycles that takes.
Note, by the way, that neither version of the program is good C programming style, and there certainly is no reason why a programmer would use the second version, where a variable (or register) is used to jump to the fixed address at the top of the loop. However, these programs let us try out in a simple context the two forms of jump instructions that real programs do use. For example, jumping to an address contained in a register is an important part of how modern object-oriented languages are implemented.
Graphs can be useful for showing your data, but you need to be careful about some common pitfalls, particularly if you use a computer to generate the graphs rather than doing it by hand:
In your lab report, you should show what the difference in assembly language code is, explain how you determined the cycles-per-iteration and CPI of each version on each processor, and provide your results. You should also try to draw some bigger picture conclusions from these results. What can you say about the pipelines of the two processors? How coping with control hazards is influenced by the kind of jump?
Instructor: Max Hailperin