LET'S BUILD A COMPILER!

By

Jack W. Crenshaw, Ph.D.

16 April 1989

Part IX: A TOP VIEW

INTRODUCTION

In the previous installments, we have learned many of the techniques required to build a full-blown compiler. We've done both assignment statements (with Boolean and arithmetic expressions), relational operators, and control constructs. We still haven't addressed procedure or function calls, but even so we could conceivably construct a mini-language without them. I've always thought it would be fun to see just how small a language one could build that would still be useful. We're almost in a position to do that now. The problem is: though we know how to parse and translate the constructs, we still don't know quite how to put them all together into a language.

In those earlier installments, the development of our programs had a decidedly bottom-up flavor. In the case of expression parsing, for example, we began with the very lowest level constructs, the individual constants and variables, and worked our way up to more complex expressions.

Most people regard the top-down design approach as being better than the bottom-up one. I do too, but the way we did it certainly seemed natural enough for the kinds of things we were parsing.

You mustn't get the idea, though, that the incremental approach that we've been using in all these tutorials is inherently bottom-up. In this installment I'd like to show you that the approach can work just as well when applied from the top down ... maybe better. We'll consider languages such as C and Pascal, and see how complete compilers can be built starting from the top.

In the next installment, we'll apply the same technique to build a complete translator for a subset of the KISS language, which I'll be calling TINY. But one of my goals for this series is that you will not only be able to see how a compiler for TINY or KISS works, but that you will also be able to design and build compilers for your own languages. The C and Pascal examples will help. One thing I'd like you to see is that the natural structure of the compiler depends very much on the language being translated, so the simplicity and ease of construction of the compiler depends very much on letting the language set the program structure.

It's a bit much to produce a full C or Pascal compiler here, and we won't try. But we can flesh out the top levels far enough so that you can see how it goes.

Let's get started.

THE TOP LEVEL

One of the biggest mistakes people make in a top-down design is failing to start at the true top. They think they know what the overall structure of the design should be, so they go ahead and write it down.

Whenever I start a new design, I always like to do it at the absolute beginning. In program design language (PDL), this top level looks something like:

     begin
        solve the problem
     end

OK, I grant you that this doesn't give much of a hint as to what the next level is, but I like to write it down anyway, just to give me that warm feeling that I am indeed starting at the top.

For our problem, the overall function of a compiler is to compile a complete program. Any definition of the language, written in BNF, begins here. What does the top level BNF look like? Well, that depends quite a bit on the language to be translated. Let's take a look at Pascal.

THE STRUCTURE OF PASCAL

Most texts for Pascal include a BNF or "railroad-track" definition of the language. Here are the first few lines of one:
     <program> ::= <program-header> <block> '.'

     <program-header> ::= PROGRAM <ident>

     <block> ::= <declarations> <statements>

We can write recognizers to deal with each of these elements, just as we've done before. For each one, we'll use our familiar single-character tokens to represent the input, then flesh things out a little at a time. Let's begin with the first recognizer: the program itself.

To translate this, we'll start with a fresh copy of the Cradle. Since we're back to single-character names, we'll just use a 'p' to stand for 'PROGRAM.'

To a fresh copy of the cradle, add the following code, and insert a call to it from the main program:

-- ------------------------------------------------------------
-- Parse and translate a program
: prog ( -- )
        0 LOCAL name
        'p' match       \ Handles program header part 
        getname TO name
        name prolog
        '.' match
        name epilog ;
-- ------------------------------------------------------------

The words Prolog and Epilog perform whatever is required to let the program interface with the operating system, so that it can execute as a program. Needless to say, this part will be very OS-dependent. Remember, I've been emitting code for an x86 running under the OS I use, which is Windows XP.

Anyhow, Windows XP is a particularly easy OS to interface to, especially when we can use iForth as an intermediary. Here is the code for Prolog and Epilog:

-- ------------------------------------------------------------
-- Write the prolog
: prolog ( char -- )
        S" CODE " ROT CHAR-APPEND emitln 
        S" rpush," emitln ;
-- ------------------------------------------------------------
-- Write the epilog
: epilog ( char -- )
        S" rpop,  ebx jmp," emitln 
        DROP S" END-CODE"   emitln ;
-- ------------------------------------------------------------

As usual, add this code and try out the "compiler." (chap9a.frt) At this point, there is only one legal input:

     px.   (where x is any single letter, the program name)

Well, as usual our first effort is rather unimpressive, but by now I'm sure you know that things will get more interesting. There is one important thing to note: the output is a working, complete, and executable program (at least after it's assembled, or, in this case, when you've pasted it into the Forth commandline).

This is very important. The nice feature of the top-down approach is that at any stage you can compile a subset of the complete language and get a program that will run on the target machine. From here on, then, we need only add features by fleshing out the language constructs. It's all very similar to what we've been doing all along, except that we're approaching it from the other end.

FLESHING IT OUT

To flesh out the compiler, we only have to deal with language features one by one. I like to start with a stub procedure that does nothing, then add detail in incremental fashion. Let's begin by processing a block, in accordance with its PDL above. We can do this in two stages. First, add the null procedure:
-- ------------------------------------------------------------
-- Parse and translate a pascal block 
: doblock ( char -- ) prolog ;
-- ------------------------------------------------------------
and modify Prog to read:
-- ------------------------------------------------------------
-- Parse and translate a program 
: prog ( char -- )
        0 LOCAL name
        'p' match       \ Handles program header part 
        getname TO name
        name doblock
        '.' match
        name epilog ;
-- ------------------------------------------------------------

That certainly shouldn't change the behavior of the program, and it doesn't. But now the definition of Prog is complete, and we can proceed to flesh out DoBlock. That's done right from its BNF definition:

-- ------------------------------------------------------------
-- Parse and translate a pascal block 
: doblock ( char -- )
        LOCAL name
        declarations 
        name prolog
        'OF name 1 prolog
        name postlabel
        statements ;
-- ------------------------------------------------------------

The procedure PostLabel was defined in the installment on branches. Copy it into your cradle.

I probably need to explain the reason for inserting the label where I have. It has to do with the operation of XP and iForth. iForth allows the entry point to the main program to be anywhere in the program. All you have to do is to give that point a name. The call to PostLabel puts that name just before the first executable statement in the main program.

OK, now we need stubs for the words Declarations and Statements. Make them null procedures as we did before.

Does the program (chap9b.frt) still run the same? Then we can move on to the next stage.

DECLARATIONS

The BNF for Pascal declarations is:
     <declarations> ::= ( <label list>    |
                          <constant list> |
                          <type list>     |
                          <variable list> |
                          <procedure>     |
                          <function>         )*

(Note that I'm using the more liberal definition used by Turbo Pascal. In the standard Pascal definition, each of these parts must be in a specific order relative to the rest.)

As usual, let's let a single character represent each of these declaration types. The new form of Declarations is:

-- ------------------------------------------------------------
-- Parse and translate the declaration part 
: declarations ( -- )
        BEGIN 
          Look 'l' = 
          Look 'c' = OR 
          Look 't' = OR
          Look 'v' = OR  
          Look 'p' = OR
          Look 'f' = OR
        WHILE   CASE Look
                 'l' OF  labels      ENDOF
                 'c' OF  constants   ENDOF
                 't' OF  types       ENDOF
                 'v' OF  variables   ENDOF
                 'p' OF  doprocedure ENDOF
                 'f' OF  dofunction  ENDOF
                ENDCASE 
        REPEAT ;
-- ------------------------------------------------------------

Of course, we need stub procedures for each of these declaration types. This time, they can't quite be null procedures, since otherwise we'll end up with an infinite while loop. At the very least, each recognizer must eat the character that invokes it. Insert the following procedures:

-- ------------------------------------------------------------
-- Process label statement 
: labels ( -- ) 'l' match ;
-- ------------------------------------------------------------
-- Process const statement 
: constants ( -- ) 'c' match ;
-- ------------------------------------------------------------
-- Process type statement 
: types ( -- ) 't' match ;
-- ------------------------------------------------------------
-- Process var statement 
: variables ( -- ) 'v' match ;
-- ------------------------------------------------------------
-- Process procedure definition 
: doprocedure ( -- ) 'p' match ;
-- ------------------------------------------------------------
-- Process function definition 
: dofunction ( -- ) 'f' match ;
-- ------------------------------------------------------------

Now try out the compiler (chap9c.frt) with a few representative inputs. You can mix the declarations any way you like, as long as the last character in the program is '.' to indicate the end of the program. Of course, none of the declarations actually declare anything, so you don't need (and can't use) any characters other than those standing for the keywords.

We can flesh out the statement part in a similar way. The BNF for it is:

     <statements> ::= <compound statement>

     <compound statement> ::= BEGIN <statement> (';' <statement>) END

Note that statements can begin with any identifier except END. So the first stub form of procedure Statements is:

-- ------------------------------------------------------------
-- Parse and translate the statement part
: statements ( -- )
        'b' match
        BEGIN  Look 'e' <>
        WHILE  getchar 
        REPEAT 
        'e' match ;
-- ------------------------------------------------------------

At this point the compiler (chap9d.frt) will accept any number of declarations, followed by the BEGIN block of the main program. This block itself can contain any characters at all (except an END), but it must be present.

The simplest form of input is now

     'pxbe.'

Try it. Also try some combinations of this. Make some deliberate errors and see what happens.

At this point you should be beginning to see the drill. We begin with a stub translator to process a program, then we flesh out each procedure in turn, based upon its BNF definition. Just as the lower-level BNF definitions add detail and elaborate upon the higher-level ones, the lower-level recognizers will parse more detail of the input program. When the last stub has been expanded, the compiler will be complete. That's top-down design/implementation in its purest form.

You might note that even though we've been adding procedures, the output of the program hasn't changed. That's as it should be. At these top levels there is no emitted code required. The recognizers are functioning as just that: recognizers. They are accepting input sentences, catching bad ones, and channeling good input to the right places, so they are doing their job. If we were to pursue this a bit longer, code would start to appear.

The next step in our expansion should probably be procedure Statements. The Pascal definition is:

    <statement> ::= <simple statement> | <structured statement>

    <simple statement> ::= <assignment> | <procedure call> | null

    <structured statement> ::= <compound statement> |
                               <if statement>       |
                               <case statement>     |
                               <while statement>    |
                               <repeat statement>   |
                               <for statement>      |
                               <with statement>

These are starting to look familiar. As a matter of fact, you have already gone through the process of parsing and generating code for both assignment statements and control structures. This is where the top level meets our bottom-up approach of previous sessions. The constructs will be a little different from those we've been using for KISS, but the differences are nothing you can't handle.

I think you can get the picture now as to the procedure. We begin with a complete BNF description of the language. Starting at the top level, we code up the recognizer for that BNF statement, using stubs for the next-level recognizers. Then we flesh those lower-level statements out one by one.

As it happens, the definition of Pascal is very compatible with the use of BNF, and BNF descriptions of the language abound. Armed with such a description, you will find it fairly straightforward to continue the process we've begun.

You might have a go at fleshing a few of these constructs out, just to get a feel for it. I don't expect you to be able to complete a Pascal compiler here ... there are too many things such as procedures and types that we haven't addressed yet ... but it might be helpful to try some of the more familiar ones. It will do you good to see executable programs coming out the other end.

If I'm going to address those issues that we haven't covered yet, I'd rather do it in the context of KISS. We're not trying to build a complete Pascal compiler just yet, so I'm going to stop the expansion of Pascal here. Let's take a look at a very different language.

THE STRUCTURE OF C

The C language is quite another matter, as you'll see. Texts on C rarely include a BNF definition of the language. Probably that's because the language is quite hard to write BNF for.

One reason I'm showing you these structures now is so that I can impress upon you these two facts:

  1. The definition of the language drives the structure of the compiler. What works for one language may be a disaster for another. It's a very bad idea to try to force a given structure upon the compiler. Rather, you should let the BNF drive the structure, as we have done here.
  2. A language that is hard to write BNF for will probably be hard to write a compiler for, as well. C is a popular language, and it has a reputation for letting you do virtually anything that is possible to do. Despite the success of Small C, C is not an easy language to parse.

A C program has less structure than its Pascal counterpart. At the top level, everything in C is a static declaration, either of data or of a function. We can capture this thought like this:

     <program> ::= ( <global declaration> )*

     <global declaration> ::= <data declaration>  | <function>

In Small C, functions can only have the default type int, which is not declared. This makes the input easy to parse: the first token is either "int," "char," or the name of a function. In Small C, the preprocessor commands are also processed by the compiler proper, so the syntax becomes:

     <global declaration> ::= '#' <preprocessor command>  |
                              'int' <data list>           |
                              'char' <data list>          |
                              <ident> <function body>     |

Although we're really more interested in full C here, I'll show you the code corresponding to this top-level structure for Small C.

-- ------------------------------------------------------------
-- Parse and translate a program 
: prog ( -- )
        BEGIN   Look ^Z <>
        WHILE   CASE Look
                  '#' OF preproc  ENDOF
                  'i' OF intdecl  ENDOF
                  'c' OF chardecl ENDOF
                      int dofunction
                ENDCASE
        REPEAT ;
-- ------------------------------------------------------------

Note that I've had to use a ^Z to indicate the end of the source. C has no keyword such as END or the '.' to otherwise indicate the end.

With full C, things aren't even this easy. The problem comes about because in full C, functions can also have types. So when the compiler sees a keyword like "int," it still doesn't know whether to expect a data declaration or a function definition. Things get more complicated since the next token may not be a name ... it may start with an '*' or '(', or combinations of the two.

More specifically, the BNF for full C begins with:

     <program> ::= ( <top-level decl> )*

     <top-level decl> ::= <function def> | <data decl>

     <data decl> ::= [<class>] <type> <decl-list>

     <function def> ::= [<class>] [<type>] <function decl>

You can now see the problem: The first two parts of the declarations for data and functions can be the same. Because of the ambiguity in the grammar as written above, it's not a suitable grammar for a recursive-descent parser. Can we transform it into one that is suitable? Yes, with a little work. Suppose we write it this way:

     <top-level decl> ::= [<class>] <decl>

     <decl> ::= <type> <typed decl> | <function decl>

     <typed decl> ::= <data list> | <function decl>

We can build a parsing routine for the class and type definitions, and have them store away their findings and go on, without their ever having to "know" whether a function or a data declaration is being processed.

To begin, key in the following version of the main program:

-- ------------------------------------------------------------
-- Main Program
: main ( -- )
        init
        BEGIN   
          Look ^Z <> 
        WHILE
          getclass
          gettype
          topdecl
        REPEAT ;
-- ------------------------------------------------------------

For the first round, just make the three procedures stubs that do nothing but call GetChar.

Does this program work? Well, it would be hard put not to, since we're not really asking it to do anything. It's been said that a C compiler will accept virtually any input without choking. It's certainly true of this compiler, since in effect all it does is to eat input characters until it finds a ^Z.

Next, let's make GetClass do something worthwhile. Declare the global variable

     0 VALUE class

and change GetClass to do the following:

-- ------------------------------------------------------------
-- Get a storage class specifier 
: getclass ( -- )
        Look 'a' =  Look 'x' = OR  Look 's' = OR
           IF Look TO class
              getchar
         ELSE 'a' TO class      
        ENDIF ;
-- ------------------------------------------------------------

Here, I've used three single characters to represent the three storage classes "auto," "extern," and "static." These are not the only three possible classes ... there are also "register" and "typedef," but this should give you the picture. Note that the default class is "auto."

We can do a similar thing for types. Enter the following word next:

-- ------------------------------------------------------------
-- Get a type specifier 
: gettype ( -- )
        BL TO typ
        Look 'u' 
         = IF   'u' TO sign
                'i' TO typ
                getchar
         ELSE   's' TO sign
        ENDIF
        Look 'i' = Look 'l' = OR  Look 'c' = OR
           IF   Look TO typ
                getchar
        ENDIF ;
-- ------------------------------------------------------------

Note that you must add two more global variables, Sign and Typ.

With these two procedures in place, the compiler will process the class and type definitions and store away their findings. We can now process the rest of the declaration.

We are by no means out of the woods yet, because there are still many complexities just in the definition of the type, before we even get to the actual data or function names. Let's pretend for the moment that we have passed all those gates, and that the next thing in the input stream is a name. If the name is followed by a left paren, we have a function declaration. If not, we have at least one data item, and possibly a list, each element of which can have an initializer.

Insert the following version of TopDecl:

-- ------------------------------------------------------------
-- Process a top-level declaration 
: topdecl ( -- )
        getname LOCAL Name
        Look '(' = IF  name dofunc 
                 ELSE  name dodata
                ENDIF ;
-- ------------------------------------------------------------

(Note that, since we have already read the name, we must pass it along to the appropriate routine.)

Finally, add the two procedures DoFunc and DoData:

-- -------------------------------------------------------------
-- Process a function definition 
: dofunc ( char -- )
        LOCAL n
        '(' match
        ')' match
        '{' match
        '}' match
        typ BL = IF  'i' TO typ  ENDIF
        CR class EMIT SPACE sign EMIT SPACE typ EMIT 
           ."  function " n EMIT ;
-- -------------------------------------------------------------
-- Process a Data Declaration 
: dodata ( char -- )
        LOCAL n
        typ BL = IF  S" Type declaration" expected  ENDIF
        CR class EMIT SPACE sign EMIT SPACE typ EMIT 
           ."  data " n EMIT 
        BEGIN  Look ',' = 
        WHILE  ',' match
               getname TO n
               CR class EMIT SPACE sign EMIT SPACE typ EMIT 
                  ."  data " n EMIT 
        REPEAT ';' match ;
-- ------------------------------------------------------------

Since we're still a long way from producing executable code, I decided to just have these two routines tell us what they found.

OK, give this program (chap9e.frt) a try. For data declarations, it's OK to give a list separated by commas. We can't process initializers as yet. We also can't process argument lists for the functions, but the "(){}" characters should be there.

We're still a very long way from having a C compiler, but what we have is starting to process the right kinds of inputs, and is recognizing both good and bad inputs. In the process, the natural structure of the compiler is starting to take form.

Can we continue this until we have something that acts more like a compiler. Of course we can. Should we? That's another matter. I don't know about you, but I'm beginning to get dizzy, and we've still got a long way to go to even get past the data declarations.

At this point, I think you can see how the structure of the compiler evolves from the language definition. The structures we've seen for our two examples, Pascal and C, are as different as night and day. Pascal was designed at least partly to be easy to parse, and that's reflected in the compiler. In general, in Pascal there is more structure and we have a better idea of what kinds of constructs to expect at any point. In C, on the other hand, the program is essentially a list of declarations, terminated only by the end of file.

We could pursue both of these structures much farther, but remember that our purpose here is not to build a Pascal or a C compiler, but rather to study compilers in general. For those of you who do want to deal with Pascal or C, I hope I've given you enough of a start so that you can take it from here (although you'll soon need some of the stuff we still haven't covered yet, such as typing and procedure calls). For the rest of you, stay with me through the next installment. There, I'll be leading you through the development of a complete compiler for TINY, a subset of KISS.

See you then.

*****************************************************************
*                                                               *
*                        COPYRIGHT NOTICE                       *
*                                                               *
*   Copyright (C) 1989 Jack W. Crenshaw. All rights reserved.   *
*                                                               *
*****************************************************************