MCS-388 Lab 2: Scanning and Parsing (Spring 2008)

Due: March 20, 2008

Objective

In this lab, you will turn your AST classes from the previous lab into a true compiler by equipping them with a scanner and parser. That way, your compiler will be able to read in any program (within its limited language) and generate the translation for that program, rather than always generating the translation for one fixed program as in the prior lab.

Starting point

Download and unzip the file c2.zip. This will produce a directory c2 structured similarly to the previous lab, for use with ant.

Copy all the files from the previous lab's c1/src into c2/src, because your new compiler will build on your work from the prior lab. The scanner and parser are provided by two new source files, src/scanner.jflex and src/Parser.cup.

One you have copied the files, you can try building and testing the initial version of the compiler. In a shell with your current directory being c2, give the command ant. With any luck, the output you see should end with a successful execution of the LLVM code generated by running your compiler on the testProgram.c source file.

If you look at the testProgram.c used in the testing, you will see that despite the .c ending on the filename, the syntax isn't that of the C programming language. That's going to change in the course of this lab. For now, the test program consists of a single arithmetic expression. A look at the grammar in src/Parser.cup should confirm that as the compiler currently stands, the syntax of a program contains nothing but a single expression. Moreover, if you look at src/scanner.jflex, you'll see another limitation of the current compiler: although it skips whitespace, it has no provision for skipping comments. (You could try putting a comment into testProgram.c and see that it trips up the compiler.) These observations set the stage for the work you need to do in order to expand the compiler to a somewhat more realistic source language. In subsequent labs, you will further expand the language.

Before getting to work on your extensions to the compiler, it would be worth noting that the build.xml file has additional features beyond the ability to run the whole compiler-creation and test compilation process. For example, you can test just the scanner by running ant test_scanner to see what tokens it identifies. I would be happy to explain any of the other testing features that might interest you.

Changes you need to make

First, modify the scanner so that it will skip comments extending from // to the end of the line and test that this works.

Second, modify the syntax of expressions so as to allow procedure call expressions. A procedure call expression consists of a name followed by a parenthesized expression. This will require changes in both the scanner and the parser. Again, you should stop and test that your compiler works with examples of the modified language. Once this is working, you can remove the implicit call to println provided in the parser's action for the stmt nonterminal. Instead, the call to println can be made explicit in the source code file, testProgram.c.

Third, modify the syntax of statements so that a statement is no longer just an unadorned expression, but rather has a semicolon after the expression, as in the following example:

println(2+2);

This will again require modifying both the scanner and the parser. Even though this is a minor change, you should test your compiler. One of the secrets to building a large program successfully is to be meticulous about testing each individual change.

Fourth, modify the syntax of programs so that a program is no longer just an unadorned statement, but rather has some extra symbols at the beginning and end, as in the following example:

int main(){
  println(2+2);
}

Once more, this will require changes in both the scanner and the parser. The first two lexemes, int and main, play different roles. The first should be a special keyword token, which must be int, whereas the second should be an arbitrary name; whatever name is given should be used for the compiled procedure, whether or not it is main. Again, test your compiler.

Finally, modify the syntax of programs further so that between the curly braces the programmer can write any list of zero or more statements, rather than only one. At this point, the language is as complete as it will be for this lab, unless you chose to do one or more of the optional changes. You should give your compiler one last testing.

Optional additional changes

It would be nice if the scanner also skipped comments that start with /* and extend through the next instance of */.

If you look at the example main procedure above, you'll see that it is defined as returning an int but doesn't in fact contain any return statement; the code just falls off the end of the procedure body. This works out OK because the parser is automatically building an implicit return of 0 into the AST at the end of the procedure. However, it would be nice if you could include an explicit return statement, which could return the value of any expression, not necessarily 0.

Your test program can include calls to procedures other than just println; in particular, you can use putchar to output individual characters. However, when you do so, the argument needs to be the code number of the character being output, and it is a bit inelegant to have to use magic numbers such as 43 for a plus sign or 61 for an equal sign. It would be so much nicer if you could just write '+' or '='. By having the scanner produce the same tokens for the character constants as it would for the decimal numerals, you can leave the parser alone. If you want to get fancy, you can also handle backslashes specially.

What you should turn in

Please turn in a printout of the final version of any files you modified or added. You should also turn in evidence of your testing of the final version.


Course web site: http://gustavus.edu/+max/courses/S2008/MCS-388/
Instructor: Max Hailperin <max@gustavus.edu>