c1
compiler into
c2
, extending the source language to include variables,
which can be declared, assigned to, and used in expressions. This
will involve introducing two new mechanisms (which are unfortunately
commonly lumped under the name "symbol table"):
The program is still of the same overall form, with
main()
and then a brace-enclosed list of zero or more
statements, but now there are three possibilities for what a statement
can be, rather than just one. In addition to the old
print_int
statements, now there can be declaration
statements and assignment statements.
A declaration statement starts with the keyword int
, then
has one or more variable declarations separated by commas, and then a
semicolon at the end. Each variable declaration can be either just an
identifier (declaring the identifier to be a local integer variable,
but leaving it uninitialized), or an identifier followed by an equal
sign and an expression, in which case the variable is initialized just
as if by the corresponding assignment statement. It is illegal to
redeclare an already declared identifier. The scope of a variable
begins immediately after its declaration and extends to
the right brace that closes the block of statements. Note the
preceding specification needs to be read carefully: "immediately after
its declaration" is different than "immediately after its declaration
statement" and also different from "immediately after the identifier's
appearance in its declaration."
An assignment statement is a variable followed by an equal sign, an expression, and then a semicolon.
There is one new form of expression, namely a variable.
In the two preceding paragraphs, reference is made to variables. A variable is simply an identifier, from a grammatical perspective. However, it must have been declared to be a variable (otherwise, the program is erroneous). Thus, it is an error if an undeclared identifier appears either in an assignment or an expression.
$s
registers for variables; there are
nine of these, running from $s0
to $s8
.
(Programs with more than nine variables will be outside the scope of
our compilers, for now.)
To compile an assignment statement (or initialization), for the time
being we will take a simple approach, analogous to the way we are
currently compiling print_int
statements. Namely, the
expression should be evaluated into an arbitrary register, and then
the value moved from that register into the variable's register.
(Be sure to deallocate the original register if it is a temporary.)
Note that when a variable is used as an expression, there is no reason to move its value into a temporary register; the variable's register can be directly used.
By convention, the $s
registers are supposed to be left
unchanged by any procedure invocation. This works to our advantage
when we call the print_int
procedure; we can count on it
not to clobber the registers we're using to store
main
's variables in. However, we should also ensure that
main
plays by the same rules. This means we should be
sure that if main
's caller had any values it cared about
in the $s
registers when it called main
, they
are still (or again) there when main
returns. The
simplest way to arrange for this (which is good enough for now) is to
change the prolog of main
to allocate a 40-byte
stack frame instead of a 4-byte one, and store not only
$ra
but also all nine of the $s
registers
into that frame. Similarly, the epilogue should load all these
registers back from the stack frame, and deallocate all 40 bytes of
it.
For reporting the position of the error, Java_cup and JLex have built in support for character counting. If you have a production like
var ::= ID:idthen in the action
{: ... :}
, you can not only use id
to refer to the
attribute of the ID, but also idleft
and idright
to refer to the
character positions over which it ranges. These are automatically
synthesized over the whole parse tree so that if you had something
like expr:e
in a production, then eleft
and eright
would be the ends
of the whole expr. (See what the java_cup manual says about positions
for all the details.)
A good way to arrange for the uniqueness property (that there will
never be two Identifier objects with the same name) is to make the
Identifier class's constructor be private, so that it can not be
invoked from outside the class. Instead, have a public static (i.e.,
class-wide) method, perhaps called get
, that the outside
world uses to get the appropriate Identifier object for a particular
name. For example, rather than saying new
Identifier(yytext())
, the lexical analyzer would say
Identifier.get(yytext())
. The static get
method would check a table to see if an Identifier with that name
already existed, and if so, return it. Otherwise, it could invoke the
private constructor, install the result into the table, and return it.
The next questions concern the nature of the table: where should it
be stored, and how should it be implemented? The best approach to
where the table should be stored is as a private static member of the
Identifier class. That way the public static method,
get
, has access to it, but nothing external to the
Identifier class does. Regarding how it should be implemented: the
standard Java library contains a HashMap class that is perfect for
the job. You should look it up in a text, or look at the official
documentation for this class, which is available on the web.
(For those reading this on paper: a link to it is embedded in the
prior sentence.)
The above paragraphs address the question of how Strings that are identifier names (lexemes) get mapped into the corresponding Identifier object. Now, what operations should the Identifier class provide? You can give objects of the Identifier class the ability to remember and retrieve some arbitrary piece of information about the Identifier. At the moment, that might be the Register that has been allocated for the variable. For simplicity, you could stick with just that. Alternatively, to provide enough generality to accomodate future changes, you could consider making the Identifier remember an Object, rather than specifically a Register. In either case, when a variable is declared, you'd allocate a Register and ask the variable's Identifier object to remember it. Later when the variable is used, you'd retrieve from the Identifier the Register that it is remembering. If you take the approach of storing arbitrary Objects, you'd need to retrieve the Object the Identifier is remembering and cast it to Register.
Regarding error checking, the "remember this information" method on the Identifier class could throw an exception if the identifier is already declared, and the "retrieve the remembered information" method could return null or throw an exception if the identifier is undeclared (has had no information remembered). Whether to return null or throw an exception is a design decision you need to think about, but neither option is unreasonable. If you have the Identifiers remember generic Objects, the final possible error (which shouldn't arise yet!) is that when you try to cast the information Object to a Register, you might find that it is of the wrong type, i.e., that some other kind of information had been registered with this Identifier.
Remember that you should test your Identifier class out on its own, either by using a small test program or by interacting with it in BlueJ, rather than needing to test it in the context of your compiler.
c2
modules, highlighting
the differences from the code for c1
. You should also
describe the tests you ran, and any problems uncovered.