MCS-378 Lab 2: Fixing sched_yield, Fall 1999

Due: October 20, 1999

In this lab, we will investigate a bug that has been reported in linux's scheduler, and come up with a patch to the linux kernel that solves the problem. Of course, each group might come up with a different approach to solving the problem. We'll submit the best solution we come up with to the linux kernel mailing list for further testing by others and potential inclusion in the next release of linux.

This is a challenging assignment, and I fully expect that in addition to working together as a team, you will need to consult with me. Don't be shy.

Lab facilities

Each team will have an older PC in the OHS 329 lab which will be for the exclusive use of that team, and for which they will have the root password. This will allow you to experiment with kernel modifications and with use of the SCHED_FIFO scheduling policy. Note that we will be using these same machines in the next lab (with permuted group membership), so it would be desirable not to do anything irreversible to them. If you do manage to foul the machine up, be sure to 'fess up. We've got a special CD-ROM and boot disk which we can use to bring the machine back to a clean state; I'll show you this. So long as you don't make too much trouble for the rest of us, you can feel free to do whatever other experiments you want on these machines.

I will issue each team five diskettes. Two of these diskettes are write protected, have "Orig 2.2.12 boot" written on them, and contain identical copies of the Linux 2.2.12 kernel, suitable for booting any of our three PCs. (You put the floppy in the drive and turn the machine on.) You should leave at least one of these two diskettes write-protected. The other three diskettes are MS-DOS formatted, and contain some uninteresting Windows setup files (from OmniTech), which you can delete. You can put one of these disks in an already booted Linux machine, or equally well in one of the normal lab machines, and mount it using the command

mount /mnt/floppy
At this point, the diskette is available to you as the /mnt/floppy directory. When you are done reading or writing it, before you physically eject the diskette you should give the command
umount /mnt/floppy
You will need to mount MS-DOS floppies like this in order to transport files back and forth between our experimental machines and the normal lab machines. You will need to do this in order to print files, maintain safe backup copies, etc. (Don't treat the hard drive of your experimental machine as a safe place to leave your work.) The experimental machines are not networked.

Your job

The problem, in a nutshell, is that the sched_yield system call doesn't work. Particularly, it has been reported that with SCHED_FIFO, the sched_yield system call doesn't seem to do anything at all. The situation with the SCHED_RR policy is presumably the same, since these two policies are handled nearly identically. The situation with SCHED_OTHER may well be different (since it is handled quite differently) but that doesn't mean it works correctly either.
  1. Read the sched_yield man page so that you understand what this system call is supposed to do. The man page may not suffice, so ask me if you need me to refer you to other reference sources or do some explaining.
  2. Write a simple test program that uses SCHED_FIFO policy, forks a child process, and uses sched_yield to yield to that child. Have the parent and child both print some output out, so you can see which order they run in. You will need to run this test program as root. Hopefully you should be able to confirm the report that yielding isn't working.
  3. To see whether yielding is working in normal (SCHED_OTHER) processes, you will need a different testing strategy. Talk with me if you can't come up with one. (Don't get hung up on this; the main part of the lab is still to come.)
  4. Read the source code file /usr/src/linux/kernel/sched.c and understand how sched_yield is implemented. In particular, write an explanation of why it isn't working correctly. You can focus on the procedures sys_sched_yield (the system call itself), scheduler (the actual scheduler, which chooses the next process, hopefully respecting any yielding that is being done), and goodness and prev_goodness (helpers used by schedule).
  5. Design a revision of sched.c that should make sched_yield work. If you find this too challenging to do in general, consider just making it work for SCHED_FIFO and SCHED_RR, or alternatively just for SCHED_OTHER. Remember, this may not be a simple matter of fixing a localized typo kind of bug in one place. You may need to scrap (and replace) substantial parts of the design of how yielding is done. Discuss your design with me.
  6. Try modifying sched.c in accordance with your redesign, building a new kernel incorporating your change, booting your new kernel, and testing. You should test both that the system seems to still be more or less working and also see whether you have indeed fixed sched_yield. To rebuild the kernel, put a blank disk in the floppy drive and, with your current directory being /usr/src/linux, give the command
    make bzdisk
    
    To boot your new kernel, leave that disk in the drive and give the command
    reboot
    
    Remember that kernel hacking is difficult, and debugging especially so. So don't feel bad if your modifications don't work at first. Be sure to consult with me.


Instructor: Max Hailperin