MCS-284 Lab 3: Processor Performance

Due November 29, 2007

This lab focuses mostly on designing an experiment and analyzing the results, rather than on programming or the actual data-gathering for the experiment. As such, I would like you to work in pairs, so as to have someone with whom to talk over the design and analysis. Please try to avoid using the same partner as in Lab 1; getting to know and work with new people is valuable. Given that we have an odd number of students, please let me know if you'd like to work alone or in a group of three. I'll intervene in the group formation process if necessary.

You will need to write a clear report that explains what you did (so that someeone else could replicate your experiment), provides the factual results of your experiment, and discusses what interpretation you place on those results. Your report should be aimed at an audience with background similar to our course, but without specific knowledge of this lab assignment.

All experimentation throughout the lab should be carried out on two computers with different processors, in order to allow comparison. I will not alsways explicitly mention this in the remainder of the lab assignment; you just need to remember to do everything on both computers. In the OHS 326 lab, there are two kinds of computers: those in beige cases and those in black. Use one of each. In order to find out what model of processor each has, as well as its clock rate, you should give the following command. Like all commands throughout this assignment, you would do this is a terminal (shell) window:

fgrep name /proc/cpuinfo

Measuring a program's performance

The first thing you will need to do is measure the performance of a program I am providing. You will need to download two files: counter.cpp, which is the actual C++ source code of the program, and cyccounter.h, which is a helper file that contains the necessary definitions for using the processor's cycle counter to get exact timings in units of clock cycles. To compile the program, use the following commands:

g++ -O -S counter.cpp
g++ -o counter counter.s

The first of these two commands uses the C++ compiler, telling it to stop after translating the program into assembly language (that is the -S option) and to take the trouble to generate efficient code (that is the -O option. The resulting assembly code is left in the file counter.s. The second command assembles that program, producing an executable machine language version in the file counter. Normally this two-step process wouldn't be necessary, but for the sake of your experimentation, it is valuable to be able to examine, and even possibly modify, the assembly language version.

You can run the program using a command like

./counter 1000 2

This will produce an output line containing two numbers. The first one should (in this case) be 1000, which is the result computed by the program. Frankly, it isn't very interesting, because the program is not carrying out a particularly useful computation. In this case, it found the result of 1000 by counting from 0 up to (but not including) 1000, and for each number that is not divisible by 2, adding 2 to a running total (which starts at 0). The numbers 1000 and 2 come from the command line and could be changed if you want to run a different number of iterations of the loop or use a different number as the divisor in the divisibility checks (and also as what to add into the running total). The more interesting result from the program is the second number it outputs; that is the number of clock cycles that the loop took to run.

If you run the same command repeatedly, the number of clock cycles may vary somewhat for two reasons. First, there may be small variations due to memory system effects. Second, there may be rare but larger variations if an interrupt occurs while the loop is running. In the worst case, the operating system might respond to an interrupt by choosing to switch processes and run one or more other programs for a fraction of a second, which would greatly increase the cycle count. (You probably will never see that.) To deal with this possibility, I recommend that your run each timing command at least half a dozen times and use the smallest cycle count reported by any of the runs. Ordinarily in experimental work it is conventional to use an average of multiple runs. However, if you have reason to believe there is some underlying true number of clock cycles the processor needs, and that all deviations are going to be in the upward direction, caused by adding in extraneous cycles doing other work, then it makes sense to use the minimum rather than the average.

The cycle counts you measure include the 1000 trips through the loop, but they also include some extra cycles spent on such matters as retrieving the cycle counter at the start and end, and initializing the loop counter to 0. Therefore, to get a more accurate handle on the number of cycles spent per trip through the loop, you should do a second measurement, replacing 1000 by 2000, and then take the difference of the two cycle counts. That is, you will be focusing on how many additional cycles are spent for 1000 additional trips through the loop. Since the fixed costs of starting and ending should be identical, this will give you a more pure measure of the cost per trip through the loop. To verify that you really do have a linear relationship between loop trips and clock cycles, you could do a third measurement with 3000 trips. Hopefully the increase in clock cycles from 2000 to 3000 will be about the same as from 1000 to 2000.

Having found how many clock cycles the average additional trip through the loop takes, it would be useful to translate this into seconds by taking into account the clock rate. This will allow a more full comparison of the two processors.

You should also provide additional context for the number of clock cycles by counting the number of assembly language instructions executed on the average trip through the loop. If a lot of cycles are spent per trip through the loop, counting the instructions will let you see whether the issue is a high instruction count or a high CPI. (You can calculate the CPI by dividing the cycles you measured by your count of instructions.) To count the instructions, you'll need to look in counter.s for the loop. In my copy of the file, the loop looks as follows; because yours might be a bit different for some reason, you should look for similar code in your own file:

.L45:
	movl	-452(%ebp), %edx
	movl	%edx, %eax
	sarl	$31, %edx
	idivl	%ebx
	testl	%edx, %edx
	je	.L149
	addl	%ebx, %esi
.L149:
	addl	$1, -452(%ebp)
	cmpl	%edi, -452(%ebp)
	jne	.L45

You don't need to really understand IA-32 assembly language for this lab, although being able to puzzle out what some of the instructions are doing may prove useful; I'd be happy to work with you on that. For the moment, all you need to do is count how many instructions there are in the loop. Be sure you don't count the two labels, which aren't instructions. Also, you need to take into account that half the trips through the loop (the even numbered ones) will be skipping the addl instruction that precedes label .L149. (The je instruction is the conditional branch.)

Probing what accounts for the performance

If your results from the first part of the lab are like mine, you probably found that each trip through the loop is taking a considerable number of clock cycles, even relative to the number of instructions. (Recall that modern processors ought to be able to execute multiple instructions per cycle.) Thus, some further experimentation seems to be called for to determine what accounts for the rather lackluster performance. This is where you'll get to do your own experimental design, rather than just repeating an experiment that I designed.

Looking at the code shown above for the loop, I spot two likely contenders for performance difficulties. One is the integer division instruction, idivl, which is used for the divisibility test. If the C++ program was specialized to always check for divisibility by 2, it could use a more efficient approach. But because it checks for divisibility by whatever number is given on the command line (the variable f in the C++ code), the division instruction is needed. Integer division is a difficult operation, and moreover one that occurs so seldom in real programs that processor designers have little incentive to spend resources making it run quickly. So this instruction could well be a major performance bottleneck.

The other potential performance issue I see is the conditional branch, which arranges that when the divison's remainder is equal to zero, the step of increasing the running total is skipped. As we've learned in our study of pipelining, conditional branches are difficult to execute efficiently.

Your goal is to design and carry out one or more additional experiments to illuminate what role each of these possibilities plays in the loop's performance on each of the two processors. How many cycles are attributable to the division? To the conditional branching? Is there any positive or negative synergy between the two, or do their incremental costs just add together?

In order to gain this insight, you will likely want to do timing runs with modified versions of the program. You can either modify at the C++ source level and recompile or modify the assembly version and just go through the second step of the compilation process (the one that does the assembling). Even if you work at the C++ source level, you should examine the looping portion of the assembly code to make sure you understand what the impact of your change is. I would be happy to consult with you as you plan what to do and as you work to understand the code and performance results.

Another kind of experiment you might find interesting, other than modifying the program, would be to use divisors other than 2. This will have two impacts. One is that it changes what fraction of the trips through the loop take the conditional branch around the addition versus what fraction go through the addition. The other less obious, but perhaps more interesting, effect is that the processor's attempts at branch prediction may work better for some divisors than for others. The increasing proportion of trips through the loop that do the addition should be a nice smooth change as you increase the divisor, whereas the branch prediction might suddenly break down at some point. Like other performance issues, this might also differ between the two processors.