MCS-388 Lab 2: Adding Variables (Spring 2005)

March 14, 2005

Objective

In this lab, you'll turn your c1 compiler into c2, extending the source language to include variables, which can be declared, assigned to, and used in expressions. This will involve introducing two new mechanisms (which are unfortunately commonly lumped under the name "symbol table"): Of course, you'll also need to enrich the parser and scanner to deal with the new language features. Furthermore, we'll have to start thinking a little bit about conventions for saving and restoring registers to the stack.

The language extensions

An identifier name can start with any letter, and can then have any number of additional characters that can be letters or digits. The case of letters is significant.

The program is still of the same overall form, with main() and then a brace-enclosed list of zero or more statements, but now there are three possibilities for what a statement can be, rather than just one. In addition to the old print_int statements, now there can be declaration statements and assignment statements.

A declaration statement starts with the keyword int, then has one or more variable declarations separated by commas, and then a semicolon at the end. Each variable declaration can be either just an identifier (declaring the identifier to be a local integer variable, but leaving it uninitialized), or an identifier followed by an equal sign and an expression, in which case the variable is initialized just as if by the corresponding assignment statement. It is illegal to redeclare an already declared identifier. The scope of a variable begins immediately after its declaration and extends to the right brace that closes the block of statements. Note the preceding specification needs to be read carefully: "immediately after its declaration" is different than "immediately after its declaration statement" and also different from "immediately after the identifier's appearance in its declaration."

An assignment statement is a variable followed by an equal sign, an expression, and then a semicolon.

There is one new form of expression, namely a variable.

In the two preceding paragraphs, reference is made to variables. A variable is simply an identifier, from a grammatical perspective. However, it must have been declared to be a variable (otherwise, the program is erroneous). Thus, it is an error if an undeclared identifier appears either in an assignment or an expression.

Target code issues

We will use the $s registers for variables; there are nine of these, running from $s0 to $s8. (Programs with more than nine variables will be outside the scope of our compilers, for now.)

To compile an assignment statement (or initialization), for the time being we will take a simple approach, analogous to the way we are currently compiling print_int statements. Namely, the expression should be evaluated into an arbitrary register, and then the value moved from that register into the variable's register. (Be sure to deallocate the original register if it is a temporary.)

Note that when a variable is used as an expression, there is no reason to move its value into a temporary register; the variable's register can be directly used.

By convention, the $s registers are supposed to be left unchanged by any procedure invocation. This works to our advantage when we call the print_int procedure; we can count on it not to clobber the registers we're using to store main's variables in. However, we should also ensure that main plays by the same rules. This means we should be sure that if main's caller had any values it cared about in the $s registers when it called main, they are still (or again) there when main returns. The simplest way to arrange for this (which is good enough for now) is to change the prolog of main to allocate a 40-byte stack frame instead of a 4-byte one, and store not only $ra but also all nine of the $s registers into that frame. Similarly, the epilogue should load all these registers back from the stack frame, and deallocate all 40 bytes of it.

Error handling

There are now two new kinds of errors the compiler can detect, beyond syntax errors. Namely, an identifier can be redeclared, or it can be used without having been declared. For each of these cases, you should print out a simple, informative message. It should state exactly what the problem is: what identifier is at issue (i.e., its name), where in the program it occurs (its character position), and what is wrong with it (that it is being redeclared or used without declaration). An error message in implementation terms (like a null pointer exception) is not acceptable. No recovery is necessary: it is acceptable for your compiler to simply exit after reporting the error. However, you may also choose to recover and continue. If you do so, make sure that the compiler doesn't make a nuisance of itself, for example by complaining repeatedly about the same undeclared variable.

For reporting the position of the error, Java_cup and JLex have built in support for character counting. If you have a production like

var ::= ID:id
then in the action {: ... :}, you can not only use id to refer to the attribute of the ID, but also idleft and idright to refer to the character positions over which it ranges. These are automatically synthesized over the whole parse tree so that if you had something like expr:e in a production, then eleft and eright would be the ends of the whole expr. (See what the java_cup manual says about positions for all the details.)

Suggestions for an Identifier class

There are some techniques that will make the "symbol table" portion of your compiler easier and will allow it to more gracefully adapt to future needs. This section suggests some of these techniques. The basic assumption is that there is an Identifier class; when the lexical analyzer sees an identifier, it passes back a token with an attribute that is the unique Identifier object corresponding to the name (lexeme) it saw. (It could also pass back the lexeme itself, as a String, and leave the conversion into an Identifier as a job for the parser. I prefer the former approach, but the tradeoffs are subtle enough that I will not explain them here. If you care, ask me.)

A good way to arrange for the uniqueness property (that there will never be two Identifier objects with the same name) is to make the Identifier class's constructor be private, so that it can not be invoked from outside the class. Instead, have a public static (i.e., class-wide) method, perhaps called get, that the outside world uses to get the appropriate Identifier object for a particular name. For example, rather than saying new Identifier(yytext()), the lexical analyzer would say Identifier.get(yytext()). The static get method would check a table to see if an Identifier with that name already existed, and if so, return it. Otherwise, it could invoke the private constructor, install the result into the table, and return it.

The next questions concern the nature of the table: where should it be stored, and how should it be implemented? The best approach to where the table should be stored is as a private static member of the Identifier class. That way the public static method, get, has access to it, but nothing external to the Identifier class does. Regarding how it should be implemented: the standard Java library contains a HashMap class that is perfect for the job. You should look it up in a text, or look at the official documentation for this class, which is available on the web. (For those reading this on paper: a link to it is embedded in the prior sentence.)

The above paragraphs address the question of how Strings that are identifier names (lexemes) get mapped into the corresponding Identifier object. Now, what operations should the Identifier class provide? You can give objects of the Identifier class the ability to remember and retrieve some arbitrary piece of information about the Identifier. At the moment, that might be the Register that has been allocated for the variable. For simplicity, you could stick with just that. Alternatively, to provide enough generality to accomodate future changes, you could consider making the Identifier remember an Object, rather than specifically a Register. In either case, when a variable is declared, you'd allocate a Register and ask the variable's Identifier object to remember it. Later when the variable is used, you'd retrieve from the Identifier the Register that it is remembering. If you take the approach of storing arbitrary Objects, you'd need to retrieve the Object the Identifier is remembering and cast it to Register.

Regarding error checking, the "remember this information" method on the Identifier class could throw an exception if the identifier is already declared, and the "retrieve the remembered information" method could return null or throw an exception if the identifier is undeclared (has had no information remembered). Whether to return null or throw an exception is a design decision you need to think about, but neither option is unreasonable. If you have the Identifiers remember generic Objects, the final possible error (which shouldn't arise yet!) is that when you try to cast the information Object to a Register, you might find that it is of the wrong type, i.e., that some other kind of information had been registered with this Identifier.

Remember that you should test your Identifier class out on its own, either by using a small test program or by interacting with it in BlueJ, rather than needing to test it in the context of your compiler.

Postlab

Write a lab report showing your c2 modules, highlighting the differences from the code for c1. You should also describe the tests you ran, and any problems uncovered.


Course web site: http://www.gac.edu/~max/courses/S2005/MCS-388/
Instructor: Max Hailperin <max@gustavus.edu>