LET'S BUILD A COMPILER!

By

Jack W. Crenshaw, Ph.D.

29 May, 1995

Part 16: UNIT CONSTRUCTION

INTRODUCTION

This series of tutorials promises to be perhaps one of the longest-running mini-series in history, rivalled only by the delay in Volume IV of Knuth. Begun in 1988, the series ran into a four-year hiatus in 1990 when the "cares of this world," changes in priorities and interests, and the need to make a living seemed to stall it out after Installment 14. Those of you with loads of patience were finally rewarded, in the spring of last year, with the long-awaited Installment 15. In it, I began to try to steer the series back on track, and in the process, to make it easier to continue on to the goal, which is to provide you with not only enough understanding of the difficult subject of compiler theory, but also enough tools, in the form of canned subroutines and concepts, so that you would be able to continue on your own and become proficient enough to build your own parsers and translators. Because of that long hiatus, I thought it appropriate to go back and review the concepts we have covered so far, and to redo some of the software, as well. In the past, we've never concerned ourselves much with the development of production-quality software tools ... after all, I was trying to teach (and learn) concepts, not production practice. To do that, I tended to give you, not complete compilers or parsers, but only those snippets of code that illustrated the particular point we were considering at the moment.

I still believe that's a good way to learn any subject; no one wants to have to make changes to 100,000 line programs just to try out a new idea. But the idea of just dealing with code snippets, rather than complete programs, also has its drawbacks in that we often seemed to be writing the same code fragments over and over. Although repetition has been thoroughly proven to be a good way to learn new ideas, it's also true that one can have too much of a good thing. By the time I had completed Installment 14 I seemed to have reached the limits of my abilities to juggle multiple files and multiple versions of the same software functions. Who knows, perhaps that's one reason I seemed to have run out of gas at that point.

Fortunately, the later versions of iForth allow us to have our cake and eat it too. By using the concept of interpreting include files, we can still write small subroutines and functions, and keep our main programs and test programs small and simple. But, once written, the code in the Forth libs will always be there for us to use, and "linking" them in is totally painless and transparent.

Since, by now, most of you are programming in either C or C++, I know what you're thinking: DFW, with their iForth, certainly didn't invent the concept of interpretable include files. And of course you're right. But if you've not used Forth lately, or ever, you may not realize just how painless the whole process is. Even in C or C++, you still have to build a make file, either manually or by telling the compiler how to do so. You must also list, using "extern" statements or header files, the functions you want to import. In Forth, you don't even have to do that. You need only name the files you wish to use, and all of their words automatically become available.

It's not my intention to get into a language-war debate here, so I won't pursue the subject any further. Even I no longer use Forth on my job ... I use C at work and C++ for my articles in Embedded Systems Programming and other magazines. Believe me, when I set out to resurrect this series, I thought long and hard about switching both languages and target systems to the ones that we're all using these days, C/C++ and PC architecture, and possibly object-oriented methods as well. In the end, I felt it would cause more confusion than the hiatus itself has. And after all, Forth still remains one of the best possible languages for teaching, not to mention production programming. Finally, iForth still compiles at the speed of light, much faster than competing C/C++ compilers. And linking is of course simply not necessary! For one of the few times in our lives, we don't have to compromise between completeness and efficiency. When we're writing a Forth lib, we can make it as complete as we like, including any member functions and data items we may think we will ever need, confident that doing so will not create unwanted bloat in the executable.

The point, really, is simply this: By using iForth's lib mechanism, we can have all the advantages and convenience of writing small, seemingly stand-alone test programs, without having to constantly rewrite the support functions that we need. Once written, the libs sit there, quietly waiting to do their duty and give us the support we need, when we need it.

Using this principle, in Installment 15 I set out to minimize our tendency to re-invent the wheel by organizing our code into separate Forth files, each containing different parts of the compiler. We ended up with the following units:

Input
Output
Errors
Scanner
Parser
CodeGen

Each of these files serves a different function, and encapsulates specific areas of functionality. The Input and Output units, as their name implies, provide character stream I/O and the all-important lookahead character upon which our predictive parser is based. The Errors unit, of course, provides standard error handling. The Scanner unit contains all of our boolean functions such as Alpha?, and the routines GetName and GetNumber, which process multi-character tokens.

The two units we'll be working with the most, and the ones that most represent the personality of our compiler, are Parser and CodeGen. Theoretically, the Parser unit should encapsulate all aspects of the compiler that depend on the syntax of the compiled language (though, as we saw last time, a small amount of this syntax spills over into Scanner). Similarly, the code generator, CodeGen, contains all of the code dependent upon the target machine. In this installment, we'll be continuing with the development of the functions in these two all-important units.

JUST LIKE CLASSICAL?

Before we proceed, however, I think I should clarify the relationship between, and the functionality of these units. Those of you who are familiar with compiler theory as taught in universities will, of course, recognize the names, Scanner, Parser, and CodeGen, all of which are components of a classical compiler implementation. You may be thinking that I've abandoned my commitment to the KISS philosophy, and drifted towards a more conventional architecture than we once had. A closer look, however, should convince you that, while the names are similar, the functionalities are quite different.

Together, the scanner and parser of a classical implementation comprise the so-called "front end," and the code generator, the back end. The front end routines process the language-dependent, syntax-related aspects of the source language, while the code generator, or back end, deals with the target machine-dependent parts of the problem. In classical compilers, the two ends communicate via a file of instructions written in an intermediate language (IL).

Typically, a classical scanner is a single procedure, operating as a co-procedure with the parser. It "tokenizes" the source file, reading it character by character, recognizing language elements, translating them into tokens, and passing them along to the parser. You can think of the parser as an abstract machine, executing "op codes," which are the tokens. Similarly, the parser generates op codes of a second abstract machine, which mechanizes the IL. Typically, the IL file is written to disk by the parser, and read back again by the code generator.

Our organization is quite different. We have no lexical scanner, in the classical sense; our Scanner, though it has a similar name, is not a single procedure or co-procedure, but merely a set of separate subroutines which are called by the parser as needed.

Similarly, the classical code generator, the back end, is a translator in its own right, reading an IL "source" file, and emitting an object file. Our code generator doesn't work that way. In our compiler, there IS no intermediate language; every construct in the source language syntax is converted into assembly language as it is recognized by the parser. Like Scanner, the unit CodeGen consists of individual procedures which are called by the parser as needed.

This "code 'em as you find 'em" philosophy may not produce the world's most efficient code -- for example, we haven't provided (yet!) a convenient place for an optimizer to work its magic -- but it sure does simplify the compiler, doesn't it?

And that observation prompts me to reflect, once again, on how we have managed to reduce a compiler's functions to such comparatively simple terms. I've waxed eloquent on this subject in past installments, so I won't belabor the point too much here. However, because of the time that's elapsed since those last soliloquies, I hope you'll grant me just a little time to remind myself, as well as you, how we got here. We got here by applying several principles that writers of commercial compilers seldom have the luxury of using. These are:

The KISS philosophy -- Never do things the hard way without a reason
Lazy coding -- Never put off until tomorrow what you can put of forever (with credits to P.J. Plauger)
Skepticism -- Stubborn refusal to do something just because that's the way it's always been done.
Acceptance of inefficient code
Rejection of arbitrary constraints

As I've reviewed the history of compiler construction, I've learned that virtually every production compiler in history has suffered from pre-imposed conditions that strongly influenced its design. The original FORTRAN compiler of John Backus, et al, had to compete with assembly language, and therefore was constrained to produce extremely efficient code. The IBM compilers for the minicomputers of the 70's had to run in the very small RAM memories then available -- as small as 4k. The early Ada compiler had to compile itself. Per Brinch Hansen decreed that his Pascal compiler developed for the IBM PC must execute in a 64k machine. Compilers developed in Computer Science courses had to compile the widest variety of languages, and therefore required LALR parsers.

In each of these cases, these preconceived constraints literally dominated the design of the compiler.

A good example is Brinch Hansen's compiler, described in his excellent book, "Brinch Hansen on Pascal Compilers" (highly recommended). Though his compiler is one of the most clear and un-obscure compiler implementations I've seen, that one decision, to compile large files in a small RAM, totally drives the design, and he ends up with not just one, but many intermediate files, together with the drivers to write and read them.

In time, the architectures resulting from such decisions have found their way into computer science lore as articles of faith. In this one man's opinion, it's time that they were re-examined critically. The conditions, environments, and requirements that led to classical architectures are not the same as the ones we have today. There's no reason to believe the solutions should be the same, either.

In this tutorial, we've followed the leads of such pioneers in the world of small compilers for Pcs as Leor Zolman, Ron Cain, and James Hendrix, who didn't know enough compiler theory to know that they "couldn't do it that way." We have resolutely refused to accept arbitrary constraints, but rather have done whatever was easy. As a result, we have evolved an architecture that, while quite different from the classical one, gets the job done in very simple and straightforward fashion.

I'll end this philosophizing with an observation re the notion of an intermediate language. While I've noted before that we don't have one in our compiler, that's not exactly true; we do have one, or at least are evolving one, in the sense that we are defining code generation functions for the parser to call. In essence, every call to a code generation procedure can be thought of as an instruction in an intermediate language. Should we ever find it necessary to formalize an intermediate language, this is the way we would do it: emit codes from the parser, each representing a call to one of the code generator procedures, and then process each code by calling those procedures in a separate pass, implemented in a back end. Frankly, I don't see that we'll ever find a need for this approach, but there is the connection, if you choose to follow it, between the classical and the current approaches.

FLESHING OUT THE PARSER

Though I promised you, somewhere along about Installment 14, that we'd never again write every single function from scratch, I ended up starting to do just that in Installment 15. One reason: that long hiatus between the two installments made a review seem eminently justified ... even imperative, both for you and for me. More importantly, the decision to collect the procedures into modules (units), forced us to look at each one yet again, whether we wanted to or not. And, finally and frankly, I've had some new ideas in the last four years that warranted a fresh look at some old friends. When I first began this series, I was frankly amazed, and pleased, to learn just how simple parsing routines can be made. But this last time around, I've surprised myself yet again, and been able to make them just that last little bit simpler, yet.

Still, because of this total rewrite of the parsing modules, I was only able to include so much in the last installment. Because of this, our hero, the parser, when last seen, was a shadow of his former self, consisting of only enough code to parse and process a factor consisting of either a variable or a constant. The main effort of this current installment will be to help flesh out the parser to its former glory. In the process, I hope you'll bear with me if we sometimes cover ground we've long since been over and dealt with.

First, let's take care of a problem that we've addressed before: Our current version of procedure Factor, as we left it in Installment 15, can't handle negative arguments. To fix that, we'll introduce the procedure SignedFactor:

 
-- ------------------------------------------------------------
-- Parse and translate a factor with optional sign
: signedfactor ( -- ) 
        Look >R 
        Look addop? IF  getchar  ENDIF
        factor  
        R> '-' = IF  _negate  ENDIF ; 
-- ------------------------------------------------------------

Note that this procedure calls a new code generation routine, _Negate:

{-------------------------------------------------------------- 
-- Negate primary 
: _negate ( -- ) S" eax neg," emitln ;
-- ------------------------------------------------------------

(Here, and elsewhere in this series, I'm only going to show you the new routines. I'm counting on you to put them into the proper include file, which you should normally have no trouble identifying.

In the main program, simply change the word called from Factor to SignedFactor (main1.frt), and give the code a test. Isn't it neat how Forth handles all the details?

Yes, I know, the code isn't very efficient. If we input a number, -3, the generated code is:

       3 d# -> eax  mov,
       eax neg,

which is really, really dumb. We can do better, of course, by simply pre-appending a minus sign to the string passed to LoadConstant, but it adds a few lines of code to SignedFactor, and I'm applying the KISS philosophy very aggressively here. What's more, to tell the truth, I think I'm subconsciously enjoying generating "really, really dumb" code, so I can have the pleasure of watching it get dramatically better when we get into optimization methods.

Most of you have never heard of John Spray, so allow me to introduce him to you here. John's from New Zealand, and used to teach computer science at one of its universities. John wrote a compiler for the Motorola 6809, based on a delightful, Pascal-like language of his own design called "Whimsical." He later ported the compiler to the 68000, and for awhile it was the only compiler I had for my homebrewed 68000 system.

For the record, one of my standard tests for any new compiler is to see how the compiler deals with a null program like:

 
        program main; 
        begin 
        end.

My test is to measure the time required to compile and link, and the size of the object file generated. The undisputed loser in the test is the DEC C compiler for the VAX, which took 60 seconds to compile, on a VAX 11/780, and generated a 50k object file. John's compiler is the undisputed, once, future, and forever king in the code size department. Given the null program, Whimsical generates precisely one byte of code, implementing the one instruction,

RET

By setting a compiler option to generate an include file rather than a standalone program, John can even cut this size, from one bytes to zero! Sort of hard to beat a null object file, wouldn't you say?

Needless to say, I consider John to be something of an expert on code optimization, and I like what he has to say: "The best way to optimize is not to have to optimize at all, but to produce good code in the first place." Words to live by. When we get started on optimization, we'll follow John's advice, and our first step will not be to add a peephole optimizer or other after-the-fact device, but to improve the quality of the code emitted before optimization. So make a note of SignedFactor as a good first candidate for attention, and for now we'll leave it be.

TERMS AND EXPRESSIONS

I'm sure you know what's coming next: We must, yet again, create the rest of the procedures that implement the recursive-descent parsing of an expression. We all know that the hierarchy of procedures for arithmetic expressions is:

 
expression 
        term 
                factor

However, for now let's continue to do things one step at a time, and consider only expressions with additive terms in them. The code to implement expressions, including a possibly signed first term, is shown next:

 
-- ------------------------------------------------------------
-- Parse and translate an expression
: expression ( -- ) 
        signedfactor 
        BEGIN  Look addop? 
        WHILE  CASE Look 
                 '+' OF  add      ENDOF 
                 '-' OF  subtract ENDOF 
               ENDCASE 
        REPEAT ; 
-- ------------------------------------------------------------

This procedure calls two other procedures to process the operations:

 
-- ------------------------------------------------------------ 
-- Parse and translate an addition operation  
: add ( -- ) '+' match _push factor _popadd ; 
-- ------------------------------------------------------------ 
-- Parse and translate a subtraction operation 
: subtract ( -- ) '-' match _push factor _popsub ; 
-- ------------------------------------------------------------

The three procedures _Push, _PopAdd, and _PopSub are new code generation routines. As the name implies, procedure Push generates code to push the primary register (EAX, in our 80x86 implementation) to the stack. _PopAdd and _PopSub pop the top of the stack again, and add it to, or subtract it from, the primary register. The code is shown next:

-- ---------------------------------------------------------------------------- 
-- Push primary to stack  
: _push   ( -- ) S" eax  push,"  emitln ;
-- ----------------------------------------------------------------------------
-- Add TOS to primary 
: _popadd ( -- ) S" [esp] -> eax  add,  [esp 4 +] -> esp lea," emitln ;
-- ----------------------------------------------------------------------------
-- Subtract TOS from primary
: _popsub ( -- ) S" [esp] -> eax  sub,  [esp 4 +] -> esp lea," emitln _negate ;
-- ----------------------------------------------------------------------------

Add these routines to Parser and CodeGen, and change the main program (main2.frt) to call Expression. Voila!

The next step, of course, is to add the capability for dealing with multiplicative terms. To that end, we'll add a word Term, and code generation procedures _PopMul and _PopDiv. These code generation procedures are shown next:

 
-- ---------------------------------------------------------------
-- Multiply TOS and primary 
: _popmul ( -- ) 
        S" [esp] dword mul, [esp 4 +] -> esp lea," emitln ;
-- --------------------------------------------------------------- 
-- Divide primary by TOS
: _popdiv ( -- ) 
        S" ecx pop, ecx -> eax xchg," emitln
        S" eax -> edx mov, #31 b# -> edx sar, ebx idiv," emitln ;
-- ---------------------------------------------------------------

I admit, the division routine is a little busy, but there's no help for it. Unfortunately, while the 80x86 CPU allows a division using the top of stack (TOS), it wants the arguments in the wrong order, just as it does for subtraction. So our only recourse is to pop the stack to a scratch register (ECX) and perform the division with that. Note the use of signed multiply and divide operations. This follows an implied, but unstated, assumption, that all our variables will be signed 32-bit integers. This decision will come back to haunt us later, when we start looking at multiple data types, type conversions, etc.

Our procedure Term is virtually a clone of Expression, and looks like this:

 
-- ------------------------------------------------------------ 
-- Parse and translate a term
: term ( -- ) 
        factor 
        BEGIN   Look mulop?
        WHILE   _push
                CASE Look
                  '*' OF multiply  ENDOF
                  '/' OF divide    ENDOF
                ENDCASE 
        REPEAT ;
-- ------------------------------------------------------------

Our next step is to change some names. SignedFactor now becomes SignedTerm, and the calls to Factor in Expression, Add, Subtract and SignedTerm get changed to call Term:

 
-- ------------------------------------------------------------ 
-- Parse and translate a term with optional leading sign  
: signedterm ( -- ) 
        Look >R 
        Look addop? IF  getchar  ENDIF
        term
        R> '-' = IF  _negate  ENDIF ; 
-- ------------------------------------------------------------ 
... 
-- ------------------------------------------------------------ 
-- Parse and translate an expression 
: expression ( -- ) 
        signedterm 
        BEGIN  Look addop? 
        WHILE  CASE Look 
                 '+' OF  add      ENDOF 
                 '-' OF  subtract ENDOF 
               ENDCASE 
        REPEAT ; 
-- ------------------------------------------------------------

If memory serves me correctly, we once had both a procedure SignedFactor and a procedure SignedTerm. I had reasons for doing that at the time ... they had to do with the handling of Boolean algebra and, in particular, the Boolean "not" function. But certainly, for arithmetic operations, that duplication isn't necessary. In an expression like:

 
        -x*y

it's very apparent that the sign goes with the whole term, x*y, and not just the factor x, and that's the way Expression is coded.

Test this new code by executing Main2.frt. It still calls Expression, so you should now be able to deal with expressions containing any of the four arithmetic operators.

Our last bit of business, as far as expressions goes, is to modify procedure Factor to allow for parenthetical expressions. By using a recursive call to Expression, we can reduce the needed code to virtually nothing. Five lines added to Factor do the job:

 
-- ------------------------------------------------------------ 
-- Parse and Translate a Factor
: factor ( -- ) 
        Look '(' = IF '(' match expression ')' match EXIT ENDIF
        Look digit? IF  getnumber loadconstant  EXIT  ENDIF
        Look alpha? IF  getname   loadvariable  EXIT  ENDIF
        S" Unrecognized character `" Look CHAR-APPEND 
           &' CHAR-APPEND errors ;
-- ------------------------------------------------------------

At this point, your "compiler" should be able to handle any legal expression you can throw at it. Better yet, it should reject all illegal ones!

ASSIGNMENTS

As long as we're this close, we might as well create the code to deal with an assignment statement. This code needs only to remember the name of the target variable where we are to store the result of an expression, call Expression, then store the number. The procedure is shown next:

 
-- ------------------------------------------------------------ 
-- Parse and translate an assignment statement
: assignment ( -- )
        getname DUP 1+ ALLOCATE ?ALLOCATE LOCAL name  
        name PACK DROP
        '=' match  expression  
        name COUNT storevariable
        name FREE ?ALLOCATE ;
-- ------------------------------------------------------------

The assignment calls for yet another code generation routine:

 
-- ------------------------------------------------------------ 
-- Store primary register in variable
: storevariable ( c-addr u -- ) 
        S" eax -> " 2SWAP $+ S"  dword-ptr  mov," $+ emitln ;
-- ------------------------------------------------------------

Now, change the call in Main to call Assignment (main3.frt), and you should see a full assignment statement being processed correctly. Pretty neat, eh? And painless, too.

In the past, we've always tried to show BNF relations to define the syntax we're developing. I haven't done that here, and it's high time I did. Here's the BNF:

 
<factor>      ::= <variable> | <constant> | '(' <expression> ')'
<signed_term> ::= [<addop>] <term>
<term>        ::= <factor> (<mulop> <factor>)*
<expression>  ::= <signed_term> (<addop> <term>)*
<assignment>  ::= <variable> '=' <expression>

BOOLEANS

The next step, as we've learned several times before, is to add Boolean algebra. In the past, this step has at least doubled the amount of code we've had to write. As I've gone over this step in my mind, I've found myself diverging more and more from what we did in previous installments. To refresh your memory, I noted that Pascal treats the Boolean operators pretty much identically to the way it treats arithmetic ones. A Boolean "and" has the same precedence level as multiplication, and the "or" as addition. C, on the other hand, sets them at different precedence levels, and all told has a whopping 17 levels. In our earlier work, I chose something in between, with seven levels. As a result, we ended up with things called Boolean expressions, paralleling in most details the arithmetic expressions, but at a different precedence level. All of this, as it turned out, came about because I didn't like having to put parentheses around the Boolean expressions in statements like:

 
             IF (c >= 'A') and (c <= 'Z') then ...

In retrospect, that seems a pretty petty reason to add many layers of complexity to the parser. Perhaps more to the point, I'm not sure I was even able to avoid the parens.

For kicks, let's start anew, taking a more Pascal-ish approach, and just treat the Boolean operators at the same precedence level as the arithmetic ones. We'll see where it leads us. If it seems to be down the garden path, we can always backtrack to the earlier approach.

For starters, we'll add the "addition-level" operators to Expression. That's easily done; first, modify the function Addop? in Scanner to include two extra operators: '|' for "or," and '~' for "exclusive or":

-- ------------------------------------------------------------ 
: addop? ( char -- tf )  
        DUP  '+' =  
        OVER '-' =  OR   
        OVER '|' =  OR 
        SWAP '~' =  OR ;
-- ------------------------------------------------------------

Next, we must include the parsing of the operators in procedure Expression:

 
-- ------------------------------------------------------------
:NONAME ( -- ) 
        signedterm 
        BEGIN  Look addop? 
        WHILE  CASE Look 
                 '+' OF  add      ENDOF 
                 '-' OF  subtract ENDOF 
                 '|' OF  _or      ENDOF 
                 '~' OF  _xor     ENDOF 
               ENDCASE 
        REPEAT ; IS expression
-- ------------------------------------------------------------
end;

(The underscores are needed, of course, because "or" and "xor" are reserved words in Forth.)

Next, the words _Or and _Xor:

  
-- ------------------------------------------------------------
-- Parse and translate OR
: _or ( -- ) '|' match _push term _popor ;
-- ------------------------------------------------------------
-- Parse and translate XOR
: _xor ( -- ) '~' match _push term _popxor ; 
-- ------------------------------------------------------------

And, finally, the new code generator procedures:

-- --------------------------------------------------------------------
-- or TOS to primary
: _popor  ( -- ) S" [esp] -> eax  or,   [esp 4 +] -> esp lea," emitln ; 
-- --------------------------------------------------------------------
-- xor TOS to primary
: _popxor ( -- ) S" [esp] -> eax  xor,  [esp 4 +] -> esp lea," emitln ; 
-- --------------------------------------------------------------------

Now, let's test the translator (you might want to change the call in Main back to a call to Expression, just to avoid having to type "x=" for an assignment every time).

So far, so good. The parser nicely handles expressions of the form:

 
        x|y~z

Unfortunately, it also does nothing to protect us from mixing Boolean and arithmetic algebra. It will merrily generate code for:

 
        (a+b)*(c~d)

We've talked about this a bit, in the past. In general the rules for what operations are legal or not cannot be enforced by the parser itself, because they are not part of the syntax of the language, but rather its semantics. A compiler that doesn't allow mixed-mode expressions of this sort must recognize that c and d are Boolean variables, rather than numeric ones, and balk at multiplying them in the next step. But this "policing" can't be done by the parser; it must be handled somewhere between the parser and the code generator. We aren't in a position to enforce such rules yet, because we haven't got either a way of declaring types, or a symbol table to store the types in. So, for what we've got to work with at the moment, the parser is doing precisely what it's supposed to do.

Anyway, are we sure that we don't want to allow mixed-type operations? We made the decision some time ago (or, at least, I did) to adopt the value 0 as a Boolean "false," and -1, or FFFFFFFFh, as a Boolean "true." The nice part about this choice is that bitwise operations work exactly the same way as logical ones. In other words, when we do an operation on one bit of a logical variable, we do it on all of them. This means that we don't need to distinguish between logical and bitwise operations, as is done in C with the operators & and &&, and | and ||. Reducing the number of operators by half certainly doesn't seem all bad.

From the point of view of the data in storage, of course, the computer and compiler couldn't care less whether the number FFFFFFFFh represents the logical TRUE, or the numeric -1. Should we? I sort of think not. I can think of many examples (though they might be frowned upon as "tricky" code) where the ability to mix the types might come in handy. Example, the Dirac delta function, which could be coded in one simple line:

 
        -(x=0)

or the absolute value function (definitely tricky code!):

 
        x*(1+2*(x<0))

Please note, I'm not advocating coding like this as a way of life. I'd almost certainly write these functions in more readable form, using IFs, just to keep from confusing later maintainers. Still, a moral question arises: Do we have the right to enforce our ideas of good coding practice on the programmer, but writing the language so he can't do anything else? That's what Nicklaus Wirth did, in many places in Pascal, and Pascal has been criticized for it -- for not being as "forgiving" as C.

An interesting parallel presents itself in the example of the Motorola 68000 design. Though Motorola brags loudly about the orthogonality of their instruction set, the fact is that it's far from orthogonal. For example, you can read a variable from its address:

 
        MOVE X,D0 (where X is the name of a variable)

but you can't write in the same way. To write, you must load an address register with the address of X. The same is true for PC-relative addressing:

 
        MOVE X(PC),DO   (legal) 
        MOVE D0,X(PC)   (illegal)

When you begin asking how such non-orthogonal behavior came about, you find that someone in Motorola had some theories about how software should be written. Specifically, in this case, they decided that self-modifying code, which you can implement using PC-relative writes, is a Bad Thing. Therefore, they designed the processor to prohibit it. Unfortunately, in the process they also prohibited all writes of the forms shown above, however benign. Note that this was not something done by default. Extra design work had to be done, and extra gates added, to destroy the natural orthogonality of the instruction set.

One of the lessons I've learned from life: If you have two choices, and can't decide which one to take, sometimes the best thing to do is nothing. Why add extra gates to a processor to enforce some stranger's idea of good programming practice? Leave the instructions in, and let the programmers debate what good programming practice is. Similarly, why should we add extra code to our parser, to test for and prevent conditions that the user might prefer to do, anyway? I'd rather leave the compiler simple, and let the software experts debate whether the practices should be used or not.

All of which serves as rationalization for my decision as to how to prevent mixed-type arithmetic: I won't. For a language intended for systems programming, the fewer rules, the better. If you don't agree, and want to test for such conditions, we can do it once we have a symbol table.

BOOLEAN "AND"

With that bit of philosophy out of the way, we can press on to the "and" operator, which goes into procedure Term. By now, you can probably do this without me, but here's the code, anyway:

In Scanner,

 
-- ------------------------------------------------------------ 
: mulop?   ( char -- tf )  
        DUP  '*' =  
        OVER '/' = OR
        SWAP '&' = OR ;
-- ------------------------------------------------------------

In Parser,

 
-- ------------------------------------------------------------ 
-- Parse and translate a term
: term ( -- ) 
        factor 
        BEGIN   Look mulop?
        WHILE   _push
                CASE Look
                  '*' OF multiply  ENDOF
                  '/' OF divide    ENDOF
                  '&' OF _and      ENDOF
                ENDCASE 
        REPEAT ;
-- ------------------------------------------------------------ 
-- Parse and translate AND
: _and ( -- ) '&' match _push factor _popand ;
-- ------------------------------------------------------------

and in CodeGen,

 
-- --------------------------------------------------------------------
-- And TOS to primary
: _popand ( -- ) S" [esp] -> eax  and,  [esp 4 +] -> esp lea," emitln ; 
-- --------------------------------------------------------------------

Your parser should now be able to process almost any sort of logical expression, and (should you be so inclined), mixed-mode expressions as well.

Why not "all sorts of logical expressions"? Because, so far, we haven't dealt with the logical "not" operator, and this is where it gets tricky. The logical "not" operator seems, at first glance, to be identical in its behavior to the unary minus, so my first thought was to let the exclusive or operator, '~', double as the unary "not." That didn't work. In my first attempt, procedure SignedTerm simply ate my '~', because the character passed the test for an addop, but SignedTerm ignores all addops except '-'. It would have been easy enough to add another line to SignedTerm, but that would still not solve the problem, because note that Expression only accepts a signed term for the first argument.

Mathematically, an expression like:

 
        -a * -b

makes little or no sense, and the parser should flag it as an error. But the same expression, using a logical "not," makes perfect sense:

 
        not a and not b

In the case of these unary operators, choosing to make them act the same way seems an artificial force fit, sacrificing reasonable behavior on the altar of implementational ease. While I'm all for keeping the implementation as simple as possible, I don't think we should do so at the expense of reasonableness. Patching like this would be missing the main point, which is that the logical "not" is simply not the same kind of animal as the unary minus. Consider the exclusive or, which is most naturally written as:

 
        a~b ::= (a and not b) or (not a and b)

If we allow the "not" to modify the whole term, the last term in parentheses would be interpreted as:

 
        not(a and b)

which is not the same thing at all. So it's clear that the logical "not" must be thought of as connected to the factor, not the term.

The idea of overloading the '~' operator also makes no sense from a mathematical point of view. The implication of the unary minus is that it's equivalent to a subtraction from zero:

 
        -x <=> 0-x

In fact, in one of my more simple-minded versions of Expression, I reacted to a leading addop by simply preloading a zero, then processing the operator as though it were a binary operator. But a "not" is not equivalent to an exclusive or with zero ... that would just give back the original number. Instead, it's an exclusive or with FFFFFFFFh, or -1.

In short, the seeming parallel between the unary "not" and the unary minus falls apart under closer scrutiny. "not" modifies the factor, not the term, and it is not related to either the unary minus nor the exclusive or. Therefore, it deserves a symbol to call its own. What better symbol than the obvious one, also used by C, the '!' character? Using the rules about the way we think the "not" should behave, we should be able to code the exclusive or (assuming we'd ever need to), in the very natural form:

 
        a & !b | !a & b

Note that no parentheses are required -- the precedence levels we've chosen automatically take care of things.

If you're keeping score on the precedence levels, this definition puts the '!' at the top of the heap. The levels become:

!
- (unary)
*, /, &
+, -, |, ~

Looking at this list, it's certainly not hard to see why we had trouble using '~' as the "not" symbol!

So how do we mechanize the rules? In the same way as we did with SignedTerm, but at the factor level. We'll define a procedure NotFactor:

 
-- ------------------------------------------------------------
-- Parse and translate a factor with optional "not"
: notfactor ( -- )
        Look '!' <> IF  factor EXIT  ENDIF
        '!' match  factor _not ;
-- ------------------------------------------------------------

and call it from all the places where we formerly called Factor, i.e., from Term, Multiply, Divide, and And. Note the new code generation procedure:

 
-- ------------------------------------------------------------
-- Bitwise not primary
: _not ( -- ) S" eax not," emitln ;
-- ------------------------------------------------------------

Try this now, with a few simple cases. In fact, try that exclusive or example,

 
        a&!b|!a&b

You should get the code (without the comments, of course):

 
        A dword-ptr -> eax  mov,                      ; load a 
        eax  push,                                    ; push it 
        B dword-ptr -> eax  mov,                      ; load b 
        eax not,                                      ; not it 
        [esp] -> eax  and,  [esp 4 +] -> esp lea,     ; and with a 
        eax  push,                                    ; push result 
        A dword-ptr -> eax  mov,                      ; load a 
        eax not,                                      ; not it 
        eax  push,                                    ; push it 
        B dword-ptr -> eax  mov,                      ; load b 
        [esp] -> eax  and,  [esp 4 +] -> esp lea,     ; and with !a 
        [esp] -> eax  or,   [esp 4 +] -> esp lea,     ; or with first term

That's precisely what we'd like to get. So, at least for both arithmetic and logical operators, our new precedence and new, slimmer syntax hang together. Even the peculiar, but legal, expression with leading addop:

~x

makes sense. SignedTerm ignores the leading '~', as it should, since the expression is equivalent to:

 
        0~x,

which is equal to x.

When we look at the BNF we've created, we find that our boolean algebra now adds only one extra line:

 
<not_factor>  ::= [!] <factor>
<factor>      ::= <variable> | <constant> | '(' <expression> ')'
<signed_term> ::= [<addop>] <term>
<term>        ::= <not_factor> (<mulop> <not_factor>)*
<expression>  ::= <signed_term> (<addop> <term>)*
<assignment>  ::= <variable> '=' <expression>

That's a big improvement over earlier efforts. Will our luck continue to hold when we get to relational operators? We'll find out soon, but it will have to wait for the next installment. We're at a good stopping place, and I'm anxious to get this installment into your hands. It's already been a year since the release of Installment 15. I blush to admit that all of this current installment has been ready for almost as long, with the exception of relational operators. But the information does you no good at all, sitting on my hard disk, and by holding it back until the relational operations were done, I've kept it out of your hands for that long. It's time for me to let go of it and get it out where you can get value from it. Besides, there are quite a number of serious philosophical questions associated with the relational operators, as well, and I'd rather save them for a separate installment where I can do them justice.

Have fun with the new, leaner arithmetic and logical parsing, and I'll see you soon with relationals.

*****************************************************************
*                                                               *
*                        COPYRIGHT NOTICE                       *
*                                                               *
*   Copyright (C) 1995 Jack W. Crenshaw. All rights reserved.   *
*                                                               *
*****************************************************************

LET'S BUILD A COMPILER! By Jack W. Crenshaw, Ph.D. 29 May, 1995