This lab focuses mostly on designing an experiment and analyzing the results, rather than on programming or the actual data-gathering for the experiment. As such, I would like you to work in pairs, so as to have someone with whom to talk over the design and analysis. Please try to avoid using the same partner as in Lab 1; getting to know and work with new people is valuable. Given that we have an odd number of students, please let me know if you'd like to work alone or in a group of three. I'll intervene in the group formation process if necessary.
You will need to write a clear report that explains what you did (so that someeone else could replicate your experiment), provides the factual results of your experiment, and discusses what interpretation you place on those results. Your report should be aimed at an audience with background similar to our course, but without specific knowledge of this lab assignment.
All experimentation throughout the lab should be carried out on two computers with different processors, in order to allow comparison. I will not alsways explicitly mention this in the remainder of the lab assignment; you just need to remember to do everything on both computers. In the OHS 326 lab, there are two kinds of computers: those in beige cases and those in black. Use one of each. In order to find out what model of processor each has, as well as its clock rate, you should give the following command. Like all commands throughout this assignment, you would do this is a terminal (shell) window:
fgrep name /proc/cpuinfo
The first thing you will need to do is measure the performance of a program I am providing. You will need to download two files: counter.cpp, which is the actual C++ source code of the program, and cyccounter.h, which is a helper file that contains the necessary definitions for using the processor's cycle counter to get exact timings in units of clock cycles. To compile the program, use the following commands:
g++ -O -S counter.cpp g++ -o counter counter.s
The first of these two commands uses the C++ compiler, telling it
to stop after translating the program into assembly language (that is
the -S
option) and to take the trouble to generate efficient code
(that is the -O
option. The resulting assembly code is left in the
file counter.s
. The second command assembles that
program, producing an executable machine language version in the file
counter
. Normally this two-step process wouldn't be
necessary, but for the sake of your experimentation, it is valuable to
be able to examine, and even possibly modify, the assembly language
version.
You can run the program using a command like
./counter 1000 2
This will produce an output line containing two numbers. The first one should (in this case) be 1000, which is the result computed by the program. Frankly, it isn't very interesting, because the program is not carrying out a particularly useful computation. In this case, it found the result of 1000 by counting from 0 up to (but not including) 1000, and for each number that is not divisible by 2, adding 2 to a running total (which starts at 0). The numbers 1000 and 2 come from the command line and could be changed if you want to run a different number of iterations of the loop or use a different number as the divisor in the divisibility checks (and also as what to add into the running total). The more interesting result from the program is the second number it outputs; that is the number of clock cycles that the loop took to run.
If you run the same command repeatedly, the number of clock cycles may vary somewhat for two reasons. First, there may be small variations due to memory system effects. Second, there may be rare but larger variations if an interrupt occurs while the loop is running. In the worst case, the operating system might respond to an interrupt by choosing to switch processes and run one or more other programs for a fraction of a second, which would greatly increase the cycle count. (You probably will never see that.) To deal with this possibility, I recommend that your run each timing command at least half a dozen times and use the smallest cycle count reported by any of the runs. Ordinarily in experimental work it is conventional to use an average of multiple runs. However, if you have reason to believe there is some underlying true number of clock cycles the processor needs, and that all deviations are going to be in the upward direction, caused by adding in extraneous cycles doing other work, then it makes sense to use the minimum rather than the average.
The cycle counts you measure include the 1000 trips through the loop, but they also include some extra cycles spent on such matters as retrieving the cycle counter at the start and end, and initializing the loop counter to 0. Therefore, to get a more accurate handle on the number of cycles spent per trip through the loop, you should do a second measurement, replacing 1000 by 2000, and then take the difference of the two cycle counts. That is, you will be focusing on how many additional cycles are spent for 1000 additional trips through the loop. Since the fixed costs of starting and ending should be identical, this will give you a more pure measure of the cost per trip through the loop. To verify that you really do have a linear relationship between loop trips and clock cycles, you could do a third measurement with 3000 trips. Hopefully the increase in clock cycles from 2000 to 3000 will be about the same as from 1000 to 2000.
Having found how many clock cycles the average additional trip through the loop takes, it would be useful to translate this into seconds by taking into account the clock rate. This will allow a more full comparison of the two processors.
You should also provide
additional context for the number of clock cycles by counting the
number of assembly language instructions executed on the average trip
through the loop. If a lot of cycles are spent per trip through the
loop, counting the instructions will let you see whether the issue is
a high instruction count or a high CPI. (You can calculate the CPI by
dividing the cycles you measured by your count of instructions.) To
count the instructions, you'll need to look in counter.s
for the loop. In my copy of the file, the loop looks as follows;
because yours might be a bit different for some reason, you should
look for similar code in your own file:
.L45: movl -452(%ebp), %edx movl %edx, %eax sarl $31, %edx idivl %ebx testl %edx, %edx je .L149 addl %ebx, %esi .L149: addl $1, -452(%ebp) cmpl %edi, -452(%ebp) jne .L45
You don't need to really understand IA-32 assembly language for
this lab, although being able to puzzle out what some of the
instructions are doing may prove useful; I'd be happy to work with you
on that. For the moment, all you need to do is count how many
instructions there are in the loop. Be sure you don't count the two
labels, which aren't instructions. Also, you need to take into
account that half the trips through the loop (the even numbered ones)
will be skipping the addl
instruction that precedes label
.L149
. (The je
instruction is the
conditional branch.)
If your results from the first part of the lab are like mine, you probably found that each trip through the loop is taking a considerable number of clock cycles, even relative to the number of instructions. (Recall that modern processors ought to be able to execute multiple instructions per cycle.) Thus, some further experimentation seems to be called for to determine what accounts for the rather lackluster performance. This is where you'll get to do your own experimental design, rather than just repeating an experiment that I designed.
Looking at the code shown above for the loop, I spot two likely
contenders for performance difficulties. One is the integer division
instruction, idivl
, which is used for the divisibility
test. If the C++ program was specialized to always check for
divisibility by 2, it could use a more efficient approach. But
because it checks for divisibility by whatever number is given on the
command line (the variable f
in the C++ code), the
division instruction is needed. Integer division is a difficult
operation, and moreover one that occurs so seldom in real programs
that processor designers have little incentive to spend resources
making it run quickly. So this instruction could well be a major
performance bottleneck.
The other potential performance issue I see is the conditional branch, which arranges that when the divison's remainder is equal to zero, the step of increasing the running total is skipped. As we've learned in our study of pipelining, conditional branches are difficult to execute efficiently.
Your goal is to design and carry out one or more additional experiments to illuminate what role each of these possibilities plays in the loop's performance on each of the two processors. How many cycles are attributable to the division? To the conditional branching? Is there any positive or negative synergy between the two, or do their incremental costs just add together?
In order to gain this insight, you will likely want to do timing runs with modified versions of the program. You can either modify at the C++ source level and recompile or modify the assembly version and just go through the second step of the compilation process (the one that does the assembling). Even if you work at the C++ source level, you should examine the looping portion of the assembly code to make sure you understand what the impact of your change is. I would be happy to consult with you as you plan what to do and as you work to understand the code and performance results.
Another kind of experiment you might find interesting, other than modifying the program, would be to use divisors other than 2. This will have two impacts. One is that it changes what fraction of the trips through the loop take the conditional branch around the addition versus what fraction go through the addition. The other less obious, but perhaps more interesting, effect is that the processor's attempts at branch prediction may work better for some divisors than for others. The increasing proportion of trips through the loop that do the addition should be a nice smooth change as you increase the divisor, whereas the branch prediction might suddenly break down at some point. Like other performance issues, this might also differ between the two processors.