MCS-388 Lab 4: Control Flow and Scoping (Spring 2008)

Due: April 21, 2008

Objective

You will extend your compiler to provide some important additions to the source language:

Several new kinds of statements:
- while statements
- if statements, both with and without else parts
- brace-enclosed compound statements
Nested variable scopes (arising from declarations in compound statements)
The six arithmetic comparison operators (<, >, <=, >=, ==, !=)

Of these, while and if statements introduce interesting control flow, for the first time allowing programs that do something other than continue in a straight-line path. Compound statements are important in conjunction with while and if statements, in order to allow a loop body (for example) to perform more than one action. Moreover, introducing compound statements gives us a natural context for variable scoping. The comparison operators are the only addition of no real substance: they are fundamentally similar to the arithmetic operators, except for a minor quirk of LLVM that we will need to work around. The comparison operators could equally well have been in the compiler the whole time. The only reason to add them now is that they become particularly handy to have now that we have loops and conditional statements.

Some notes on the new language features

We will follow the lead of C, rather than Java, in using integers to represent truth values, rather than having a separate boolean type. If the condition of a while or if evaluates to a non-zero value it counts as true, while a zero value counts as false. The comparison operations should all produce 1 for true and 0 for false. The comparison operators should be usable anywhere any other operator (e.g., +) could be used. So, for example, there is nothing wrong with

x = y > z;

which assigns x the value 1 or 0, or

println((x > y) + (z > w));

which prints one of 0, 1, or 2.

The syntax of while and if statements should be as in C (and C++ and Java). For example, the controlling expression in each case must be parenthesized. Remember that any statement can be used as the body of a loop, or as an alternative in an if statement: there is no requirement of curly braces. They have nothing to do with the while or if statement and are only present if a compound statement is used; they form part of the compound statement. For example, the following three while loops are all legal:

while(x >= 10)
  x = x / 10;     // body is an assignment statement

while(x >= 10){   // body is a compound statement, which
  x = x / 10;     // happens to only have one statement in it
  }

while(x >= 10){   // body is a more general compound statement
  int d = x % 10;
  x = x / 10;
  println(d);
  }

{                 // here is is a compound statement
 int x = 3;
 {                // with another compound statement inside
  int y = 4;      // just to emphasize that they don't
  println(x+y);   // need to show up with while or if
 }
 println(x);
}

Each brace-enclosed compound statement constitutes a new scope, nested inside the surrounding scope. It is legal to redeclare a variable that was declared in some outer, surrounding scope, but not legal to redeclare a variable in the exact same scope.

Comparison operations in LLVM

Each comparison operation needs to translate into two LLVM instructions because LLVM is a strongly typed language, in which 32-bit values (such as we are using for integers) are distinct from 1-bit values (such as the comparison instruction produces). As such, we need to first use one instruction to do the comparison (producing a 1-bit result) and then a second instruction to widen it to 32 bits. This second instruction is known as "zero extension," because the 31 additional bits are always zero. Here is an example of testing whether the value named %t0 is less than the value named %t1:

    %t2 = icmp slt i32 %t0, %t1
    %t3 = zext i1 %t2 to i32

At this point, the value named %t3 is a normal 32-bit integer value, which will be either 0 or 1. The code slt indicates that a signed less-than comparison was desired. Other comparisons are sgt (signed greater than), sle (signed less than or equal to), sge (signed greater than or equal to), eq (equal to), and ne (not equal to).

Labels and branching in LLVM assembly language

In order to translate while and if statements, you'll need to know how to code branch instructions in LLVM assembly language. To mark a position within the assembly code that is a possible target for branching to, use a label followed by a colon, as in this example:

label_0:

In order to unconditionally branch to that label, you would use the following LLVM instruction:

    br label %label_0

In order to conditionally branch to label_1 if %t17 is nonzero and to label_2 if %t17 is zero, use this pair of instructions:

    %t18 = icmp ne i32 %t17, 0
    br i1 %t18, label %label_1, label %label_2

Note that the conditional branch instruction includes two labels. Unlike in most assembly languages, there is no circumstance under which a branch instruction falls through to the next instruction: it always branches somewhere. (The branch target label could happen to be immediately afterward.) Another point worth noting is that the conditional branch needs to be based on a single-bit value, that is, a value of type i1. For this reason, given that we are primarily using normal 32-bit values, the conditional branch needed to be prefaced by an instruction that tests whether %t17 is not equal to 0.

A related way in which LLVM differs from normal assembly languages is that a label cannot be immediately preceded by a normal instruction, such as a load, store, or arithmetic computation. Instead of following one of those instructions immediately by a label, you need to insert an unconditional branch instruction in between, branching to the label.

Symbol table

In the preceding lab you used a HashMap; now you'll want to use some class of your own creation that supports not only the put and get methods (or close analogs of them), but also methods for entering and exiting scopes. I expect that we'll spend some class time on how this can be represented, but you would also be well advised to look back over your solution to last year's MCS-287 Lab 3.

Shift/reduce conflicts

Our textbook mentions (on page 293) that YACC will resolve a shift/reduce conflict in favor of shifting and (on page 282) that this correctly resolves the dangling-else ambiguity that you are likely to encounter when you incorporate the two forms of if statements. The same is true for CUP, but only if you give an option telling it how many conflicts to expect. That is, in the build.xml file, you can change the action

    <cup srcfile="${src}/Parser.cup" 
         destdir="${java}" 
         parser="Parser"
    />

    <cup srcfile="${src}/Parser.cup" 
         destdir="${java}" 
         parser="Parser"
         expect="1"
    />

if your grammar has one shift/reduce conflict in it that can be correctly resolved in favor of shifting. You should make the corresponding change in the cup_dump target of the build.xml file as well if you decide to go this route.

There are at least two other ways you could deal with the dangling-else ambiguity. The easiest one is just to declare a precedence for the else terminal at the top of your Parser.cup file. Then you won't need to modify the build.xml and yet CUP will still resolve the shift/reduce conflict in favor of shifting, much like it would for an expression such as 3+4*5. The other approach, conceptually clean but messy in practice, would be to write an unambiguous grammar in the first place, as shown on page 212 of the textbook.

A variable allocation issue

From the previous lab, you are emitting an LLVM instruction such as the following each time a variable is declared:

%t0 = alloca i32

This instruction allocates stack space to hold the variable. That creates an issue now that programs can have loops in them. Consider a program such as the following:

int main(){
  int n = 100;
  while(n){
    int x = n;
    println(x);
    n = n - 1;
  }
}

This program will allocate stack space for the variable x every time through the loop, which is 100 times in this case. By replacing the 100 with a larger number, you could make the amount of allocated memory even larger. (None of the memory is deallocated until the procedure returns.) This isn't good. To improve your compiler's handling of such situations, you should move the alloca instruction to the beginning of the procedure, rather than positioning it where the declaration statement occurs. There are a couple different strategies for how you might accomplish that. I'd be happy to talk with you.

Postlab

Write a lab report highlighting the differences from the code for your previous compiler. You should also describe the tests you ran and any problems uncovered.

Course web site: http://gustavus.edu/+max/courses/S2008/MCS-388/
Instructor: Max Hailperin <max@gustavus.edu>