Because there is an odd number of students, exactly one group should be of one or three rather than of two. Please try to find your own partners; if you want to be the group of one or three (with two others), or if you can't find a partner, see me. Each group should submit a single jointly authored lab report.
It is important that you complete the Pre-Lab portion of this assignment before you start the In Lab portion.
Auto-prefetch can decrease the miss rate if the prefetched blocks are used, but may also increase the amount of traffic between the main memory and the cache if the blocks go unused. In fact, if the prefetched blocks go unused, and bump other blocks out of the cache that would have been used, the miss rate can actually go up rather than down. A prefetch strategy is highly effective if it causes a substantial decrease in the miss rate with only a minor increase in memory traffic. For example, the miss rate might go down by 50% while the memory traffic goes up by 1%. In the other scenarios, where the miss rate doesn't go down much and/or the traffic goes up a lot, we can term the prefetch strategy ineffective (if the miss rate doesn't go down by much), costly (if the miss rate is substantially reduced, but at a high price in memory traffic), or counterproductive (if miss rate and traffic both go up).
The question you are to answer is this: what is the relationship between cache block size and total cache size and the cost and effectiveness of auto-prefetch? That is, if you consider a two-dimensional space of possible caches, where one dimension is block size and the other is total cache size, and you were to map out for each point in that two-dimensional space the percent increase in memory traffic that auto-prefetching would cause, what would it look like? Now suppose you mapped out for each point in the two-dimensional space the percent reduction (or increase!) in misses that auto-prefetching would cause, what does that look like? Putting these together, in what regions of the space is auto-prefetching cheap and highly effective? In which regions is it ineffective to the extent of being pointless? In which is it out and out harmful?
Before starting the lab, you should decide and write down the following two things:
You will use the dineroIII trace-driven cache simulator. A trace-driven cache simulator reads in a trace, that is, a record of the memory references from a particular execution (with each address tagged with whether it was a read, a write, or an instruction fetch). The simulator is parameterizable so that the same trace can be tried with a variety of different parameters for block size, associativity, cache size, prefetch strategy, etc. At the end of the simulation, a variety of statistics are produced, including the miss rate for the cache and the amount of memory traffic. DineroIII is a good quality, flexible simulator often used in such work.
We have available on-line traces from three benchmarks (tex, gcc, and
spice). Each of the traces contains about a million memory
references. The traces reside in the directory ~max/MC48
, and
have filenames tex.din.gz
, cc1.din.gz
, and
spice.din.gz
respectively. The .gz
suffix indicates that they
are compressed using the gzip program (to save disk space); to
decompress them at the same time you run dineroIII, you can use the
command
gzcat ~max/MC48/name.din.gz | dineroIII optionswhere name is the name of the trace you want to use and options is a list of dineroIII parameters as specified in the accompanying manual page for dineroIII. Note that at a bare minimum you need to specify the
-b
option and either
the -u
option or the -i
and -d
options. The -f
option will also be particularly
relevant to your investigation. Remember that the block size for the
-b
option is measured in bytes, so needs to be a multiple
of four.
You should observe the improvement prefetching makes in the number of misses, specifically the so-called ``demand'' misses, i.e., the misses on those blocks actually requested by the CPU, rather than by the prefetching. You should also observe the increase in memory traffic caused by unnecessary prefetches. Further, it would be interesting to see whether instruction and data references are equally suited to auto-prefetching. (DineroIII prints out separate statistics in each category as well as totals.)
You may find it convenient to automate the data collection using a feature of the unix shell known as ``foreach.'' In particular, you can set up a shell script that uses nested foreach loops to run systematically through a collection of cache sizes, and for each of them through a collection of block sizes, and for each of them through a collection of traces, and for each of them through two prefetching strategies (don't prefetch, or some particular form of auto-prefetch). The exact nature of this will depend on what you are trying to do. The below is just an example, you should work from your own experimental design choices (from the pre-lab) regarding what ranges of values to iterate over.
foreach size (8k 16k 32k 64k 128k) foreach block (4 8 16 32) foreach bench (cc1 spice tex) foreach prefetch (d a) echo $bench $size $block $prefetch gzcat ~max/MC48/"$bench".din.gz | dineroIII -b$block -d$size -i$size \ -f$prefetch >"$bench"."$size"."$block"."$prefetch" echo $bench $size $block $prefetch done end end end end
Instructor: Max Hailperin