c1
compiler into
c2
, extending the source language to include variables,
which can be declared, assigned to, and used in expressions. This
will involve introducing two new mechanisms (which are unfortunately
commonly lumped under the name "symbol table"):
The program is still of the same overall form, with
main()
and then a brace-enclosed list of zero or more
statements, but now there are three possibilities for what a statement
can be, rather than just one. In addition to the old
print_int
statements, now there can be declaration
statements and assignment statements.
A declaration statement starts with the keyword int
, then
has one or more variable declarations separated by commas, and then a
semicolon at the end. Each variable declaration can be either just an
identifier (declaring the identifier to be a local integer variable,
but leaving it uninitialized), or an identifier followed by an equal
sign and an expression, in which case the variable is initialized just
as if by the corresponding assignment statement. It is illegal to
redeclare an already declared identifier. The scope of a variable
begins immediately after its declaration and extends to
the right brace that closes the block of statements. Note the
preceding specification needs to be read carefully: "immediately after
its declaration" is different than "immediately after its declaration
statement" and also different from "immediately after the identifier's
appearance in its declaration."
An assignment statement is a variable followed by an equal sign, an expression, and then a semicolon.
There is one new form of expression, namely a variable.
In the two preceding paragraphs, reference is made to variables. A variable is simply an identifier, from a grammatical perspective. However, it must have been declared to be a variable (otherwise, the program is erroneous). Thus, it is an error if an undeclared identifier appears either in an assignment or an expression.
$s
registers for variables; there are
nine of these, running from $s0
to $s8
.
(Programs with more than nine variables will be outside the scope of
our compilers, for now.)
To compile an assignment statement (or initialization), for the time
being we will take a simple approach, analogous to the way we are
currently compiling print_int
statements. Namely, the
expression should be evaluated into an arbitrary register, and then
the value moved from that register into the variable's register.
(Be sure to deallocate the original register if it is a temporary.)
Note that when a variable is used as an expression, there is no reason to move its value into a temporary register; the variable's register can be directly used.
By convention, the $s
registers are supposed to be left
unchanged by any procedure invocation. This works to our advantage
when we call the print_int
procedure; we can count on it
not to clobber the registers we're using to store
main
's variables in. However, we should also ensure that
main
plays by the same rules. This means we should be
sure that if main
's caller had any values it cared about
in the $s
registers when it called main
, they
are still (or again) there when main
returns. The
simplest way to arrange for this (which is good enough for now) is to
change the prolog of main
to allocate a 40-byte
stack frame instead of a 4-byte one, and store not only
$ra
but also all nine of the $s
registers
into that frame. Similarly, the epilogue should load all these
registers back from the stack frame, and deallocate all 40 bytes of
it.
For reporting the position of the error, Java_cup and JLex have built in support for character counting. If you have a production like
expr ::= ID:idthen in the action
{: ... :}
, you can not only use id
to refer to the
attribute of the ID, but also idleft
and idright
to refer to the
character positions over which it ranges. These are automatically
synthesized over the whole parse tree so that if you had something
like expr:e
in a production, then eleft
and eright
would be the ends
of the whole expr. (See what the java_cup manual says about positions
for all the details.)
A good way to arrange for the uniqueness property (that there will
never be two Identifier objects with the same name) is to make the
Identifier class's constructor be private, so that it can not be
invoked from outside the class. Instead, have a public static (i.e.,
class-wide) method, perhaps called get
, that the outside
world uses to get the appropriate Identifier object for a particular
name. For example, rather than saying new
Identifier(yytext())
, the lexical analyzer would say
Identifier.get(yytext())
. The static get
method would check a table to see if an Identifier with that name
already existed, and if so, return it. Otherwise, it could invoke the
private constructor, install the result into the table, and return it.
The next questions concern the nature of the table: where should it
be stored, and how should it be implemented? The best approach to
where the table should be stored is as a private static member of the
Identifier class. That way the public static method,
get
, has access to it, but nothing external to the
Identifier class does. Regarding how it should be implemented: the
standard Java library contains a Hashtable class that is perfect for
the job. You should look it up in a text, or look at the official
documentation for this class, which is available on the web.
(For those reading this on paper: a link to it is embedded in the
prior sentence.)
The above paragraphs address the question of how Strings that are identifier names (lexemes) get mapped into the corresponding Identifier object. Now, what operations should the Identifier class provide? At the moment, we have only one kind of thing that identifiers can name (local variables, as opposed to globals or procedures, for example), we only have one type (int), and we don't have any nesting of scopes. However, all of these are subject to change. There are two tempting extreme positions:
getRegister
on the Identifier
class, which makes sense at the moment, but will lead to problems when
identifiers are used as names for other things (like procedures) that
don't live in registers.
I would like to point out that there is some middle ground between these extremes. You can give objects of the Identifier class the ability to register and retrieve some arbitrary, unspecified piece of information about the Identifier. For complete generality, it could be an Object. Then you could make a class, say LocalInt, containing the information that you want to store about local variables of type int. (That information would include the register.) When a variable is declared, you'd make a LocalInt object and register it with the variable's Identifier object. Later when the variable is used, you'd retrieve from the Identifier the Object that it has registered with it, and cast it to LocalInt, and then retrieve the register from that. That way other (future) kinds of Identifiers could have other kinds of information registered with them, without your having to yet decide what that information will be.
Of course, some of the decisions may turn out to be wrong. We may
discover in later labs that there is some "call back" operation that
the Identifier class needs to be able to perform on any piece of
information that has been registered with an Identifier. In that case,
it won't be possible to register any arbitrary Object. However, we'll
be able to quite gracefully evolve from the registration method taking
an Object to it taking an ObjectSuitableForRegisteringWithAnIdentifier
(that name is not to be taken seriously), where that name identifies
not a concrete class, but rather an abstract interface. We just need
to make sure that LocalInt (and any similar classes)
implement
this interface, and we'll be set. This isn't
an evolutionary path of 100% upward compatibility, but it is at least
not particularly trouble prone. Similarly, having a specialized
LocalInt class may prove to be a bad decision, but the impact of
changing it will be relatively localized. For example, the Identifier
class doesn't even know it exists.
Regarding error checking, the "register this information" method on the Identifier class could throw an exception if the identifier is already declared, and the "retrieve the registered information" method could return null or throw an exception if the identifier is undeclared (has had no information registered). Whether to return null or throw an exception is a design decision you need to think about, but neither option is unreasonable. The final possible error (which shouldn't arise yet!) is that when you try to cast the information object to a LocalInt, you might find that it is of the wrong type, i.e., that some other kind of information had been registered with this Identifier.
I haven't said anything yet about being prepared for scoping. As long as the Identifier class's methods for registering and retrieving information are an abstract enough interface, they will survive the introduction of scoping. Register will mean: register in the current scope, and complain only if it is already declared in this same scope, not some outer one. Retrieve will mean: retrieve from the closest surrounding scope in which it has been declared. How will this be done? We can deal with that later; for now, all we need to know is that we have an interface that will survive. The implementation can for now be a simple implementation that assumes there is only a single scope. Later we can replace it with a scope-cognizant implementation. This is another example of how one can "design to not suffer too badly as a result of being wrong" as opposed to "design to be right." Both are important techniques.
c2
modules, highlighting
the differences from the code for c1
. You should also
describe the tests you ran, and any problems uncovered.