LET'S BUILD A COMPILER!

By

Jack W. Crenshaw, Ph.D.

24 July 1988

Part 2: EXPRESSION PARSING

GETTING STARTED

If you've read the introduction document to this series, you will already know what we're about. You will also have copied the cradle software into your Forth system, and have compiled it. So you should be ready to go.

The purpose of this article is for us to learn how to parse and translate mathematical expressions. What we would like to see as output is a series of assembly-language statements that perform the desired actions. For purposes of definition, an expression is the right-hand side of an equation, as in

               x = 2*y + 3/(4*z)

In the early going, I'll be taking things in very small steps. That's so that the beginners among you won't get totally lost. There are also some very good lessons to be learned early on, that will serve us well later. For the more experienced readers: bear with me. We'll get rolling soon enough.

SINGLE DIGITS

In keeping with the whole theme of this series (KISS, remember?), let's start with the absolutely most simple case we can think of. That, to me, is an expression consisting of a single digit.

Before starting to code, make sure you have loaded a baseline copy of the "cradle" that I gave last time. We'll be using it again for other experiments. Then enter this code on the command line:

-- -------------------------------------------------------------
-- Parse and translate a math expression
: expression ( -- ) getnum (.)  S"  d# -> eax mov," $+ emitln ;
-- -------------------------------------------------------------

Now enter "CR init expression" on the Forth command line. Try any single-digit number as input. You should get a single line of assembly-language output (please note that there is obviously something wrong with the number put into EAX). Now try any other character as input, and you'll see that the parser properly reports an error.

Congratulations! You have just written a working translator!

OK, I grant you that it's pretty limited. But don't brush it off too lightly. This little "compiler" does, on a very limited scale, exactly what any larger compiler does: it correctly recognizes legal statements in the input "language" that we have defined for it, and it produces correct, executable assembler code, suitable for assembling into object format. Just as importantly, it correctly recognizes statements that are not legal, and gives a meaningful error message. Who could ask for more? As we expand our parser, we'd better make sure those two characteristics always hold true.

There are some other features of this tiny program worth mentioning. First, you can see that we don't separate code generation from parsing ... as soon as the parser knows what we want done, it generates the object code directly. In a real compiler, of course, the reads in GetChar would be from a disk file, and the writes to another disk file, but this way is much easier to deal with while we're experimenting.

Also note that an expression must leave a result somewhere. I've chosen the x86 register EAX. I could have made some other choices, but this one makes sense.

BINARY EXPRESSIONS

Now that we have that under our belt, let's branch out a bit. Admittedly, an "expression" consisting of only one character is not going to meet our needs for long, so let's see what we can do to extend it. Suppose we want to handle expressions of the form:

                         1+2
     or
                         4-3
     or, in general,
                     <term> +/- <term>

(That's a bit of Backus-Naur Form, or BNF.)

To do this we need a word that recognizes a term and leaves its result somewhere, and another that recognizes and distinguishes between a '+' and a '-' and generates the appropriate code. But if Expression is going to leave its result in EAX, where should Term leave its result? Answer: the same place. We're going to have to save the first result of Term somewhere before we get the next one.

OK, basically what we want to do is have the word Term do what Expression was doing before. So just rename the word Expression as Term (: term expression ;), and enter the following new version of Expression:

-- ------------------------------------------------------------
-- Recognize and translate an add
: add ( -- )
   '+' match
   term
   S" ebx -> eax  add," emitln ;
-- ------------------------------------------------------------
-- Recognize and translate a subtract
: subtract ( -- )
   '-' match
   term
   S" ebx -> eax  sub," emitln ;
-- ------------------------------------------------------------
-- Parse and translate an expression
: expression ( -- )
   term
   S" eax -> ebx  mov," emitln
   CASE Look
    '+' OF add       ENDOF
    '-' OF subtract  ENDOF
        S" Addop" expected
   ENDCASE ;
-- ------------------------------------------------------------

When you're finished with that, the order of the routines should be:

Term (The old Expression)
Add
Subtract
Expression

(Alternatively, load chap2a.frt).

Now enter "init expression" on the command line. Try any combination you can think of of two single digits, separated by a '+' or a '-'. You should get a series of four assembly-language instructions out of each run. Now try some expressions with deliberate errors in them. Does the parser catch the errors?

Take a look at the object code generated. There are two observations we can make. First, the code generated is not what we would write ourselves. The sequence

        n d# -> eax  mov,
        eax  -> ebx  mov,

is inefficient. If we were writing this code by hand, we would probably just load the data directly to EBX.

There is a message here: code generated by our parser is less efficient than the code we would write by hand. Get used to it. That's going to be true throughout this series. It's true of all compilers to some extent. Computer scientists have devoted whole lifetimes to the issue of code optimization, and there are indeed things that can be done to improve the quality of code output. Some compilers do quite well, but there is a heavy price to pay in complexity, and it's a losing battle anyway ... there will probably never come a time when a good assembly-language programmer can't out-program a compiler. Before this session is over, I'll briefly mention some ways that we can do a little optimization, just to show you that we can indeed improve things without too much trouble. But remember, we're here to learn, not to see how tight we can make the object code. For now, and really throughout this series of articles, we'll studiously ignore optimization and concentrate on getting out code that works.

Speaking of which: ours doesn't! The code is wrong! As things are working now, the subtraction process subtracts EBX (which has the first argument in it) from EAX (which has the second). That's the wrong way, so we end up with the wrong sign for the result. So let's fix up the word Subtract with a sign-changer, so that it reads

-- -----------------------------------------------------------
-- Recognize and translate a subtract
: subtract ( -- )
   '-' match
   term
   S" ebx -> eax  sub," emitln ;
   S" eax  neg," emitln ;
-- -----------------------------------------------------------

Now our code (chap2b.frt) is even less efficient, but at least it gives the right answer! Unfortunately, the rules that give the meaning of math expressions require that the terms in an expression come out in an inconvenient order for us. Again, this is just one of those facts of life you learn to live with. This one will come back to haunt us when we get to division.

OK, at this point we have a parser that can recognize the sum or difference of two digits. Earlier, we could only recognize a single digit. But real expressions can have either form (or an infinity of others). For kicks, go back and run the program with the single input line '1'.

Didn't work, did it? And why should it? We just finished telling our parser that the only kinds of expressions that are legal are those with two terms. We must rewrite the word Expression to be a lot more broadminded, and this is where things start to take the shape of a real parser.

GENERAL EXPRESSIONS

In the real world, an expression can consist of one or more terms, separated by "addops" ('+' or '-'). In BNF, this is written

          <expression> ::= <term> [<addop> <term>]*

We can accomodate this definition of an expression with the addition of a simple loop to the word Expression:

-- -------------------------------------------------------------
-- Parse and translate an expression
: expression ( -- )
   term
   BEGIN   Look '+' =  Look '-' = OR
   WHILE   S" eax -> ebx  mov," emitln
           CASE Look
            '+' OF add      ENDOF
            '-' OF subtract ENDOF
                S" Addop" expected
           ENDCASE
   REPEAT ;
-- -------------------------------------------------------------

Now we're getting somewhere! This version (chap2c.frt) handles any number of terms, and it only cost us two extra lines of code. As we go on, you'll discover that this is characteristic of top-down parsers ... it only takes a few lines of code to accomodate extensions to the language. That's what makes our incremental approach possible. Notice, too, how well the code of the word Expression matches the BNF definition. That, too, is characteristic of the method. As you get proficient in the approach, you'll find that you can turn BNF into parser code just about as fast as you can type!

OK, compile the new version of our parser, and give it a try. As usual, verify that the "compiler" can handle any legal expression, and will give a meaningful error message for an illegal one. Neat, eh? You might note that in our test version, any error message comes out sort of buried in whatever code had already been generated. But remember, that's just because we are using the CRT as our "output file" for this series of experiments. In a production version, the two outputs would be separated ... one to the output file, and one to the screen.

USING THE STACK

At this point I'm going to violate my rule that we don't introduce any complexity until it's absolutely necessary, long enough to point out a problem with the code we're generating. As things stand now, the parser uses EAX for the "primary" register, and EBX as a place to store the partial sum. That works fine for now, because as long as we deal with only the "addops" '+' and '-', any new term can be added in as soon as it is found. But in general that isn't true. Consider, for example, the expression

               1+(2-(3+(4-5)))

If we put the '1' in EBX, where do we put the '2'? Since a general expression can have any degree of complexity, we're going to run out of registers fast!

Fortunately, there's a simple solution. Like every modern microprocessor, the x86 has a stack, which is the perfect place to save a variable number of items. So instead of moving the term in EAX to EBX, let's just push it onto the stack. For the benefit of those unfamiliar with x86 assembly language, a push is written

               push,

and a pop,

               pop, .

So let's change the EmitLn in Expression to read:

                S" eax push," emitln

and the two lines in Add and Subtract to

               S" [esp] -> eax  add,  [esp 4 +] -> esp  lea," emitln

and

               S" [esp] -> eax  sub,  [esp 4 +] -> esp  lea," emitln,

respectively. Now try the parser again (chap2d.frt) and make sure we haven't broken it.

Once again, the generated code is less efficient than before, but it's a necessary step, as you'll see.

MULTIPLICATION AND DIVISION

Now let's get down to some REALLY serious business. As you all know, there are other math operators than "addops" ... expressions can also have multiply and divide operations. You also know that there is an implied operator precedence, or hierarchy, associated with expressions, so that in an expression like

                    2 + 3 * 4,

we know that we're supposed to multiply first, then add. (See why we needed the stack?)

In the early days of compiler technology, people used some rather complex techniques to insure that the operator precedence rules were obeyed. It turns out, though, that none of this is necessary ... the rules can be accommodated quite nicely by our top-down parsing technique. Up till now, the only form that we've considered for a term is that of a single decimal digit.

More generally, we can define a term as a product of factors; i.e.,

          <term> ::= <factor>  [ <mulop> <factor> ]*

What is a factor? For now, it's what a term used to be ... a single digit.

Notice the symmetry: a term has the same form as an expression. As a matter of fact, we can add to our parser with a little judicious copying and renaming. But to avoid confusion, the listing below is the complete set of parsing routines. (Note the way we handle the reversal of operands in Divide.)

-- -------------------------------------------------------------
-- Parse and translate a math factor
: factor ( -- )
    getnum (.)  S"  d# -> eax mov," $+ emitln ;
-- -------------------------------------------------------------
-- Recognize and translate a multiply
: multiply  ( -- )
   '*' match
   factor
   S" [esp] dword mul, [esp 4 +] -> esp lea," emitln ;
-- -------------------------------------------------------------
-- Recognize and translate a divide
: divide ( -- )
   '/' match
   factor
   S" ebx pop, ebx -> eax xchg, eax -> edx mov, #31 b# -> edx sar, ebx idiv," emitln ;
-- -------------------------------------------------------------
-- Parse and translate a math term
: term ( -- )
   factor
   BEGIN   Look '*' =  Look '/' = OR
   WHILE   S" eax push," emitln
           CASE Look
             '*' OF multiply  ENDOF
             '/' OF divide    ENDOF
                    S" Mulop" expected
           ENDCASE
   REPEAT ;
-- -------------------------------------------------------------
-- Recognize and translate an add
: add ( -- )
   '+' match
   term
   S" [esp] -> eax add, [esp 4 +] -> esp lea," emitln ;
-- -------------------------------------------------------------
-- Recognize and translate a subtract
: subtract ( -- )
   '-' match
   term
   S" [esp] -> eax sub, [esp 4 +] -> esp lea, eax neg," emitln ;
-- -------------------------------------------------------------
-- Parse and translate an expression
: expression ( -- )
   term
   BEGIN   Look '+' =  Look '-' = OR
   WHILE   S" eax push," emitln
           CASE Look
            '+' OF add      ENDOF
            '-' OF subtract ENDOF
                   S" Addop" expected
           ENDCASE
   REPEAT ;
-- -------------------------------------------------------------

Hot dog! A nearly functional parser/translator, in only 52 lines of Forth (chap2e.frt)! The output is starting to look really useful, if you continue to overlook the inefficiency, which I hope you will. Remember, we're not trying to produce tight code here.

PARENTHESES

We can wrap up this part of the parser with the addition of parentheses with math expressions. As you know, parentheses are a mechanism to force a desired operator precedence. So, for example, in the expression

               2*(3+4) ,

the parentheses force the addition before the multiply. Much more importantly, though, parentheses give us a mechanism for defining expressions of any degree of complexity, as in

               (1+2)/((3+4)+(5-6))

The key to incorporating parentheses into our parser is to realize that no matter how complicated an expression enclosed by parentheses may be, to the rest of the world it looks like a simple factor. That is, one of the forms for a factor is:

          <factor> ::= (<expression>)

This is where the recursion comes in. An expression can contain a factor which contains another expression which contains a factor, etc., ad infinitum.

Complicated or not, we can take care of this by adding just a few lines of Forth to the word Factor:

-- -------------------------------------------------------------
-- Parse and translate a math factor

DEFER expression ( -- )

: factor ( -- )
   Look '(' = IF  '(' match  expression  ')' match  EXIT  ENDIF
   getnum (.)  S"  d# -> eax mov," $+ emitln ;
-- -------------------------------------------------------------

Note again how easily we can extend the parser, and how well the Forth code matches the BNF syntax.

As usual, compile the new version and make sure that it correctly parses legal sentences, and flags illegal ones with an error message (chap2f.frt).

UNARY MINUS

At this point, we have a parser that can handle just about any expression, right? OK, try this input sentence:

-1

Woops! It doesn't work, does it? The word Expression expects everything to start with an integer, so it coughs up the leading minus sign. You'll find that +3 won't work either, nor will something like

                    -(3-2) .

There are a couple of ways to fix the problem. The easiest (although not necessarily the best) way is to stick an imaginary leading zero in front of expressions of this type, so that -3 becomes 0-3. We can easily patch this into our existing version of Expression:

-- -------------------------------------------------------------
-- Parse and translate an expression
: expression ( -- )
   Look addop? IF  S" eax -> eax  xor," emitln
             ELSE  term
            ENDIF
   BEGIN Look addop?
   WHILE   S" eax  push," emitln
           CASE Look
            '+' OF add      ENDOF
            '-' OF subtract ENDOF
                   S" Addop" expected
           ENDCASE
   REPEAT ;
-- -------------------------------------------------------------

I told you that making changes was easy! This time it cost us only three new lines of Forth. Note the new reference to function Addop?. Since the test for an addop appeared twice, I chose to embed it in the new word. The form of Addop? should be apparent from that for Alpha?. Here it is:

-- ------------------------------------------------------------
-- Recognize an addop
: addop? ( char -- tf ) DUP '+' =  SWAP '-' =  OR ;
-- ------------------------------------------------------------

OK, make these changes to the program (chap2g.frt) and recompile. You should also include Addop? in your baseline copy of the cradle. We'll be needing it again later. Now try the input -1 again. Wow! The efficiency of the code is pretty poor ... six lines of code just for loading a simple constant ... but at least it's correct. Remember, we're not trying to replace iForth here.

At this point we're just about finished with the structure of our expression parser. This version of the program should correctly parse and compile just about any expression you care to throw at it. It's still limited in that we can only handle factors involving single decimal digits. But I hope that by now you're starting to get the message that we can accomodate further extensions with just some minor changes to the parser. You probably won't be surprised to hear that a variable or even a function call is just another kind of a factor.

In the next session, I'll show you just how easy it is to extend our parser to take care of these things too, and I'll also show you just how easily we can accomodate multicharacter numbers and variable names. So you see, we're not far at all from a truly useful parser.

A WORD ABOUT OPTIMIZATION

Earlier in this session, I promised to give you some hints as to how we can improve the quality of the generated code. As I said, the production of tight code is not the main purpose of this series of articles. But you need to at least know that we aren't just wasting our time here ... that we can indeed modify the parser further to make it produce better code, without throwing away everything we've done to date. As usual, it turns out that some optimization is not that difficult to do ... it simply takes some extra code in the parser.

There are two basic approaches we can take:

Try to fix up the code after it's generated This is the concept of "peephole" optimization. The general idea it that we know what combinations of instructions the compiler is going to generate, and we also know which ones are pretty bad (such as the code for -1, above). So all we do is to scan the produced code, looking for those combinations, and replacing them by better ones. It's sort of a macro expansion, in reverse, and a fairly straightforward exercise in pattern-matching. The only complication, really, is that there may be a lot of such combinations to look for. It's called peephole optimization simply because it only looks at a small group of instructions at a time. Peephole optimization can have a dramatic effect on the quality of the code, with little change to the structure of the compiler itself. There is a price to pay, though, in both the speed, size, and complexity of the compiler. Looking for all those combinations calls for a lot of IF tests, each one of which is a source of error. And, of course, it takes time. In the classical implementation of a peephole optimizer, it's done as a second pass to the compiler. The output code is written to disk, and then the optimizer reads and processes the disk file again. As a matter of fact, you can see that the optimizer could even be a separate program from the compiler proper. Since the optimizer only looks at the code through a small "window" of instructions (hence the name), a better implementation would be to simply buffer up a few lines of output, and scan the buffer after each EmitLn.
Try to generate better code in the first place This approach calls for us to look for special cases before we Emit them. As a trivial example, we should be able to identify a constant zero, and emit a XOR instead of a load, or even do nothing at all, as in an add of zero, for example. Closer to home, if we had chosen to recognize the unary minus in Factor instead of in Expression, we could treat constants like -1 as ordinary constants, rather then generating them from positive ones. None of these things are difficult to deal with ... they only add extra tests in the code, which is why I haven't included them in our program. The way I see it, once we get to the point that we have a working compiler, generating useful code that executes, we can always go back and tweak the thing to tighten up the code produced. That's why there are Release 2.0's in the world.

There is one more type of optimization worth mentioning, that seems to promise pretty tight code without too much hassle. It's my "invention" in the sense that I haven't seen it suggested in print anywhere, though I have no illusions that it's original with me.

This is to avoid such a heavy use of the stack, by making better use of the CPU registers. Remember back when we were doing only addition and subtraction, that we used registers EAX and EBX, rather than the stack? It worked, because with only those two operations, the "stack" never needs more than two entries.

Well, the x86 has eight data registers. Why not use them as a privately managed stack? The key is to recognize that, at any point in its processing, the parser knows how many items are on the stack, so it can indeed manage it properly. We can define a private "stack pointer" that keeps track of which stack level we're at, and addresses the corresponding register. Procedure Factor, for example, would not cause data to be loaded into register EAX, but into whatever the current "top-of-stack" register happened to be.

What we're doing in effect is to replace the CPU's RAM stack with a locally managed stack made up of registers. For most expressions, the stack level will never exceed eight, so we'll get pretty good code out. Of course, we also have to deal with those odd cases where the stack level does exceed eight, but that's no problem either. We simply let the stack spill over into the CPU stack. For levels beyond eight, the code is no worse than what we're generating now, and for levels less than eight, it's considerably better.

For the record, I have implemented this concept, just to make sure it works before I mentioned it to you. It does. In practice, it turns out that you can't really use all eight levels ... you need at least one register free to reverse the operand order for division (luckily the x86 has an XCHG, like XTHL of the 8080!). For expressions that include function calls, we would also need a register reserved for them. Still, there is a nice improvement in code size for most expressions.

So, you see, getting better code isn't that difficult, but it does add complexity to our translator ... complexity we can do without at this point. For that reason, I strongly suggest that we continue to ignore efficiency issues for the rest of this series, secure in the knowledge that we can indeed improve the code quality without throwing away what we've done.

Next lesson, I'll show you how to deal with variables, factors, and function calls. I'll also show you just how easy it is to handle multicharacter tokens and embedded white space.

*****************************************************************
*                                                               *
*                        COPYRIGHT NOTICE                       *
*                                                               *
*   Copyright (C) 1988 Jack W. Crenshaw. All rights reserved.   *
*                                                               *
*****************************************************************

LET'S BUILD A COMPILER! By Jack W. Crenshaw, Ph.D. 24 July 1988