MC48 Lab 3: Control Hazards and Pipeline Performance (Fall 1998)

Due: November 19, 1998

In this lab, you and a partner of your choice will make a prediction about the impact of control hazards on the difference between two existing pipelined processors' performance, and will experimentally test that prediction. You will then estimate the impact of a hypothetical further change in pipeline structure.

Because there are an odd number of students in the class, one group will need to have three students in it. If you would like to nominate your group to be the group of three, please let me know. Also, if you have trouble finding a partner, let me know.

Background

The five-stage MIPS pipeline described in our textbook corresponds to the original MIPS processors, the R2000 and R3000. In those processors, the handling of branches is moved up to the ID stage (as described in section 6.6), so that only one instruction following the branch is in the pipeline when the decision whether to branch is made. (Unconditional control transfer instructions, such as jump, are handled the same way, since again one following instruction is already in the pipeline before the jump instruction is even recognized as such.) Rather than using hardware to flush the extra instruction from the pipeline, the MIPS processors take the "delayed branch" approach described in an elaboration on pages 502 and 503. That is, the instruction after the branch (or jump, jump and link, etc.) is executed first in any case, before control transfers to the instruction being branched to.

Because of this, the CPI of the R2000/R3000 is very nearly 1 even in programs that do a lot of branching, so long as the program doesn't use floating point instructions (which we haven't talked about) and so long as we ignore memory-system effects (which we will until the next chapter and the next lab). Of course, some of the instructions executed at that CPI of 1 may be nop instructions, if nothing better can be found to do in the delay slots after control transfers.

The newer R4000 version of the MIPS processor has a different pipeline structure. Rather than a five-stage pipeline, with branching handled in the second stage, the R4000 has an eight-stage pipeline, with branching handled in the fourth stage. (The reason for this is that each stage does less, so that the clock rate can be higher.)

In order to allow programs from the R2000/R3000 to run on the R4000 without change, a single instruction following each branch or jump is still executed in any case. However, this is no longer enough to take care of the full control hazard. Rather than conditionally flushing extra instructions out of the pipeline, the R4000 designers chose a simpler approach: the pipeline is simply stalled (after the one "delay slot" instruction) until it is known where instructions should be fetched from. Thus each branch (or jump) causes extra stall cycles, even if the branch winds up falling through rather than being taken. As a result, the CPI of the R4000 is increased. How much worse the CPI is depends on how much branching or jumping the program does.

Pre-lab

First, figure out how many stall cycles each branch or jump introduces. Second, make a prediction (as an explicit mathematical formula) for how the increase in CPI relates to the fraction of instructions executed that are branches or jumps. (For the increase in CPI, you can use the ratio of the R4000's CPI to the R3000's, or you can equivalently think of the ratio of the R4000's total number of cycles for a program to that of the R3000, since both are executing the same program.)

In lab

In order to test your prediction, we need data for a variety of different programs with different fractions of control transfer instructions. For each we will need to know that fraction, and we will need to know the number of cycles the program takes on an R3000 and on an R4000. In each case, we want to only count cycles consumed by the processor pipeline, excluding extra ones used by the memory system (since we don't understand that yet). Further, our programs should all use little or know floating point instructions, since we haven't talked about their performance impact.

Rather than test several separate programs, we can do a slight variation on this experiment more easily. Namely, we can run one largish program, composed out of multiple procedures, and compare data from each of several procedures. Since procedures differ from one another, we will still get a variety of densities of control transfer instructions, but with fewer logistical hassles, since only one program needs to be run. We'll take this approach, running the troff program (a text formatter) on a sample text input file, and gathering statistics from it that we can analyze separately for each procedure.

You will need to do the experimentation on one of the Silicon Graphics Indy machines, or at least remotely logged into one over the network. The first phase of the experiment consists of making an instrumented version of the troff program and running it on the sample input to produce statistical output about how often each portion of the program is executed. Because the instrumented version of the program is rather large, and you won't need it after the experiment is over, I suggest you do this work in a temporary directory (on the Indy's own disk drive) rather than in your home directory (on the network file server). You can do this as follows (in a terminal/shell window):

mkdir /tmp/troff$$
cd /tmp/troff$$
ssrun -ideal /usr/local/bin/troff ~max/MC48/sample.text >/dev/null
The first line makes the temporary directory and the second changes your current directory to it. The third line does the actual work of instrumenting the program and running it on the sample text.

At this point you have a file called something like troff.ideal.2324. The number on the end will likely be different. In order to analyze this data, you will use the prof program. To get an overall analysis of the whole program, with the R3000's pipeline assumed, you can do

prof -r3000 -op troff.ideal.* >troff-r3000
This puts the output into the file troff-r3000. You can do the same thing with 4000 substituted for 3000 (in both places) to get a file describing how the program would run on the R4000.

If you look in these files, you will see that one major feature is a list of the procedures, starting with the ones that contribute the most to the program's total number of cycles and working down to the ones that are very minor. Pick a handful of procedures to focus on, drawing from at or near the top of the list, since these are the ones that run for a substantial number of cycles. For each one, rerun prof but telling it to focus on just that one procedure. For example, if you pick token::next, you would use

prof -r3000 -O token::next -op troff.ideal.* >troff-r3000-token::next
in order to get a file, troff-r3000-token::next, analyzing the behavior of that one procedure on the R3000 pipeline. You should do this for each of the procedures you selected, on each of the two processors. At this point, you will have a bunch of files of analysis output from prof, each with a name starting with troff-r. These files would be worth hanging onto as the basis from which you are going to draw conclusions, including testing your pre-lab prediction. However, the other files in the temporary directory, in particular the instrumented version of the troff program and its libraries, are large and now useless. Therefore, you should copy the output files from prof into your own directory and delete the temporary directory with its contents. You can do this as follows:
cd
cp /tmp/troff$$/troff-r* .
rm -rf /tmp/troff$$
The first line changes directory back to your home directory. (You could instead change to some subdirectory of your home directory.) The second line copies the output files from the temporary directory to the current (home) directory (be sure to include the space and period at the end of this command), and the final line removes the temporary directory and its contents (including the instrumented program).

Post-lab

Now you are done doing the work that needs to be done on an Indy. Using the files of data you have produced, you should test your prediction about the relationship between the density of control transfer instructions and the increase in CPI. From the top of the files, you can get information about the number of cycles, number of instructions, and CPI. From later in the files, under the heading "dynamic instruction counts," you can find the number of instructions of each kind that are executed. (If you have any question which ones are branches or jumps of some sort, ask. Simple rule: if it starts with a b or a j, it is a branch or jump.) Please ignore the column that purports to be a percentage, and instead add up the absolute counts and divide the sum yourself by the total number of instructions from the top of the file. (The percentages are incorrect.) Test how well your data fits your prediction.

Finally, it is worth considering whether some of the R4000's penalty for its greater pipelining could be avoided. Rather than unconditionally stalling the pipeline, the instructions after a branch could be tentatively allowed to proceed, and then later flushed out (except the first one) if the branch turns out to be taken. (This approach is described in section 6.6.) Your data files indicate how many of the branches are taken and how many are untaken. (You'll need to combine the forward and backward branches.) Using this data, how much better might the overall troff program's CPI be on a modified R4000 that used the flush-if-taken approach?

Instructor: Max Hailperin