MCS-388 Lab 2: Adding Variables (Spring 2002)

Due: March 18, 2002

Objective

In this lab, you'll turn your c1 compiler into c2, extending the source language to include variables, which can be declared, assigned to, and used in expressions. This will involve introducing two new mechanisms (which are unfortunately commonly lumped under the name "symbol table"):

You'll need a mechanism for mapping variable names (lexemes) into Identifier objects, such that repeated occurrences of the same name map into the same Identifier.
You'll need a mechanism for associating information with an Identifier when it is declared and retrieving that information when it is used in an assignment or expression.

Of course, you'll also need to enrich the parser and scanner to deal with the new language features. Furthermore, we'll have to start thinking a little bit about conventions for saving and restoring registers to the stack.

The language extensions

An identifier name can start with any letter, and can then have any number of additional characters that can be letters or digits. The case of letters is significant.

The program is still of the same overall form, with main() and then a brace-enclosed list of zero or more statements, but now there are three possibilities for what a statement can be, rather than just one. In addition to the old print_int statements, now there can be declaration statements and assignment statements.

A declaration statement starts with the keyword int, then has one or more variable declarations separated by commas, and then a semicolon at the end. Each variable declaration can be either just an identifier (declaring the identifier to be a local integer variable, but leaving it uninitialized), or an identifier followed by an equal sign and an expression, in which case the variable is initialized just as if by the corresponding assignment statement. It is illegal to redeclare an already declared identifier. The scope of a variable begins immediately after its declaration and extends to the right brace that closes the block of statements. Note the preceding specification needs to be read carefully: "immediately after its declaration" is different than "immediately after its declaration statement" and also different from "immediately after the identifier's appearance in its declaration."

An assignment statement is a variable followed by an equal sign, an expression, and then a semicolon.

There is one new form of expression, namely a variable.

In the two preceding paragraphs, reference is made to variables. A variable is simply an identifier, from a grammatical perspective. However, it must have been declared to be a variable (otherwise, the program is erroneous). Thus, it is an error if an undeclared identifier appears either in an assignment or an expression.

Target code issues

We will use the $s registers for variables; there are nine of these, running from $s0 to $s8. (Programs with more than nine variables will be outside the scope of our compilers, for now.)

To compile an assignment statement (or initialization), for the time being we will take a simple approach, analogous to the way we are currently compiling print_int statements. Namely, the expression should be evaluated into an arbitrary register, and then the value moved from that register into the variable's register. (Be sure to deallocate the original register if it is a temporary.)

Note that when a variable is used as an expression, there is no reason to move its value into a temporary register; the variable's register can be directly used.

By convention, the $s registers are supposed to be left unchanged by any procedure invocation. This works to our advantage when we call the print_int procedure; we can count on it not to clobber the registers we're using to store main's variables in. However, we should also ensure that main plays by the same rules. This means we should be sure that if main's caller had any values it cared about in the $s registers when it called main, they are still (or again) there when main returns. The simplest way to arrange for this (which is good enough for now) is to change the prolog of main to allocate a 40-byte stack frame instead of a 4-byte one, and store not only $ra but also all nine of the $s registers into that frame. Similarly, the epilogue should load all these registers back from the stack frame, and deallocate all 40 bytes of it.

Error handling

There are now two new kinds of errors the compiler can detect, beyond syntax errors. Namely, an identifier can be redeclared, or it can be used without having been declared. For each of these cases, you should print out a simple, informative message. It should state exactly what the problem is: what identifier is at issue (i.e., its name), where in the program it occurs (its character position), and what is wrong with it (that it is being redeclared or used without declaration). An error message in implementation terms (like a null pointer exception) is not acceptable. No recovery is necessary: it is acceptable for your compiler to simply exit after reporting the error. However, you may also choose to recover and continue. If you do so, make sure that the compiler doesn't make a nuisance of itself, for example by complaining repeatedly about the same undeclared variable.

For reporting the position of the error, Java_cup and JLex have built in support for character counting. If you have a production like

expr ::= ID:id

then in the action {: ... :}, you can not only use id to refer to the attribute of the ID, but also idleft and idright to refer to the character positions over which it ranges. These are automatically synthesized over the whole parse tree so that if you had something like expr:e in a production, then eleft and eright would be the ends of the whole expr. (See what the java_cup manual says about positions for all the details.)

Suggestions for an Identifier class

There are some techniques that will make the "symbol table" portion of your compiler easier and will allow it to more gracefully adapt to future needs. This section suggests some of these techniques. The basic assumption is that there is an Identifier class; when the lexical analyzer sees an identifier, it passes back a token with an attribute that is the unique Identifier object corresponding to the name (lexeme) it saw. (It could also pass back the lexeme itself, as a String, and leave the conversion into an Identifier as a job for the parser. I prefer the former approach, but I can't think of any good reason why.)

A good way to arrange for the uniqueness property (that there will never be two Identifier objects with the same name) is to make the Identifier class's constructor be private, so that it can not be invoked from outside the class. Instead, have a public static (i.e., class-wide) method, perhaps called get, that the outside world uses to get the appropriate Identifier object for a particular name. For example, rather than saying new Identifier(yytext()), the lexical analyzer would say Identifier.get(yytext()). The static get method would check a table to see if an Identifier with that name already existed, and if so, return it. Otherwise, it could invoke the private constructor, install the result into the table, and return it.

The next questions concern the nature of the table: where should it be stored, and how should it be implemented? The best approach to where the table should be stored is as a private static member of the Identifier class. That way the public static method, get, has access to it, but nothing external to the Identifier class does. Regarding how it should be implemented: the standard Java library contains a Hashtable class that is perfect for the job. You should look it up in a text, or look at the official documentation for this class, which is available on the web. (For those reading this on paper: a link to it is embedded in the prior sentence.)

The above paragraphs address the question of how Strings that are identifier names (lexemes) get mapped into the corresponding Identifier object. Now, what operations should the Identifier class provide? At the moment, we have only one kind of thing that identifiers can name (local variables, as opposed to globals or procedures, for example), we only have one type (int), and we don't have any nesting of scopes. However, all of these are subject to change. There are two tempting extreme positions:

One extreme position would be to design around the assumption that all there will ever be are local integer variables in a single scope. You could have methods like getRegister on the Identifier class, which makes sense at the moment, but will lead to problems when identifiers are used as names for other things (like procedures) that don't live in registers.
Another extreme position would be to attempt to foresee all the functionality that may ever be needed, and design it all in now.

I would like to point out that there is some middle ground between these extremes. You can give objects of the Identifier class the ability to register and retrieve some arbitrary, unspecified piece of information about the Identifier. For complete generality, it could be an Object. Then you could make a class, say LocalInt, containing the information that you want to store about local variables of type int. (That information would include the register.) When a variable is declared, you'd make a LocalInt object and register it with the variable's Identifier object. Later when the variable is used, you'd retrieve from the Identifier the Object that it has registered with it, and cast it to LocalInt, and then retrieve the register from that. That way other (future) kinds of Identifiers could have other kinds of information registered with them, without your having to yet decide what that information will be.

Of course, some of the decisions may turn out to be wrong. We may discover in later labs that there is some "call back" operation that the Identifier class needs to be able to perform on any piece of information that has been registered with an Identifier. In that case, it won't be possible to register any arbitrary Object. However, we'll be able to quite gracefully evolve from the registration method taking an Object to it taking an ObjectSuitableForRegisteringWithAnIdentifier (that name is not to be taken seriously), where that name identifies not a concrete class, but rather an abstract interface. We just need to make sure that LocalInt (and any similar classes) implement this interface, and we'll be set. This isn't an evolutionary path of 100% upward compatibility, but it is at least not particularly trouble prone. Similarly, having a specialized LocalInt class may prove to be a bad decision, but the impact of changing it will be relatively localized. For example, the Identifier class doesn't even know it exists.

Regarding error checking, the "register this information" method on the Identifier class could throw an exception if the identifier is already declared, and the "retrieve the registered information" method could return null or throw an exception if the identifier is undeclared (has had no information registered). Whether to return null or throw an exception is a design decision you need to think about, but neither option is unreasonable. The final possible error (which shouldn't arise yet!) is that when you try to cast the information object to a LocalInt, you might find that it is of the wrong type, i.e., that some other kind of information had been registered with this Identifier.

I haven't said anything yet about being prepared for scoping. As long as the Identifier class's methods for registering and retrieving information are an abstract enough interface, they will survive the introduction of scoping. Register will mean: register in the current scope, and complain only if it is already declared in this same scope, not some outer one. Retrieve will mean: retrieve from the closest surrounding scope in which it has been declared. How will this be done? We can deal with that later; for now, all we need to know is that we have an interface that will survive. The implementation can for now be a simple implementation that assumes there is only a single scope. Later we can replace it with a scope-cognizant implementation. This is another example of how one can "design to not suffer too badly as a result of being wrong" as opposed to "design to be right." Both are important techniques.

Postlab

Write a lab report showing your c2 modules, highlighting the differences from the code for c1. You should also describe the tests you ran, and any problems uncovered.

Course web site: http://www.gac.edu/~max/courses/S2002/MCS-388/
Instructor: Max Hailperin <max@gac.edu>