This series of tutorials promises to be perhaps one of the longest-running mini-series in history, rivalled only by the delay in Volume IV of Knuth. Begun in 1988, the series ran into a four-year hiatus in 1990 when the "cares of this world," changes in priorities and interests, and the need to make a living seemed to stall it out after Installment 14. Those of you with loads of patience were finally rewarded, in the spring of last year, with the long-awaited Installment 15. In it, I began to try to steer the series back on track, and in the process, to make it easier to continue on to the goal, which is to provide you with not only enough understanding of the difficult subject of compiler theory, but also enough tools, in the form of canned subroutines and concepts, so that you would be able to continue on your own and become proficient enough to build your own parsers and translators. Because of that long hiatus, I thought it appropriate to go back and review the concepts we have covered so far, and to redo some of the software, as well. In the past, we've never concerned ourselves much with the development of production-quality software tools ... after all, I was trying to teach (and learn) concepts, not production practice. To do that, I tended to give you, not complete compilers or parsers, but only those snippets of code that illustrated the particular point we were considering at the moment.
I still believe that's a good way to learn any subject; no one wants to have to make changes to 100,000 line programs just to try out a new idea. But the idea of just dealing with code snippets, rather than complete programs, also has its drawbacks in that we often seemed to be writing the same code fragments over and over. Although repetition has been thoroughly proven to be a good way to learn new ideas, it's also true that one can have too much of a good thing. By the time I had completed Installment 14 I seemed to have reached the limits of my abilities to juggle multiple files and multiple versions of the same software functions. Who knows, perhaps that's one reason I seemed to have run out of gas at that point.
Fortunately, the later versions of iForth
allow us to
have our cake and eat it too. By using the concept of interpreting
include files, we can still write small subroutines and functions,
and keep our main programs and test programs small and simple. But,
once written, the code in the Forth
libs will always be there for us
to use, and "linking" them in is totally painless and transparent.
Since, by now, most of you are programming in either C
or C++
, I know
what you're thinking: DFW, with their iForth
, certainly
didn't invent the concept of interpretable include files. And of
course you're right. But if you've not used Forth
lately, or ever, you may
not realize just how painless the whole process is. Even in C
or C++
,
you still have to build a make file, either manually or by telling the
compiler how to do so. You must also list, using "extern" statements or
header files, the functions you want to import. In Forth
, you don't even
have to do that. You need only name the files you wish to use, and all
of their words automatically become available.
It's not my intention to get into a language-war debate here, so I won't
pursue the subject any further. Even I no longer use Forth
on my job
... I use C
at work and C++
for my articles in Embedded Systems
Programming and other magazines. Believe me, when I set out to
resurrect this series, I thought long and hard about switching both
languages and target systems to the ones that we're all using these
days, C
/C++
and PC architecture, and possibly object-oriented methods as
well. In the end, I felt it would cause more confusion than the hiatus
itself has. And after all, Forth
still remains one of the best possible
languages for teaching, not to mention production programming. Finally,
iForth
still compiles at the speed of light, much faster than competing
C
/C++
compilers. And linking is of course simply not necessary!
For one of the few times in our lives, we don't have to compromise between completeness and efficiency.
When we're writing a Forth
lib, we can make it as complete as we like,
including any member functions and data items we may think we will ever
need, confident that doing so will not create unwanted bloat in the
executable.
The point, really, is simply this: By using iForth
's lib mechanism, we can
have all the advantages and convenience of writing small, seemingly
stand-alone test programs, without having to constantly rewrite the
support functions that we need. Once written, the libs sit there,
quietly waiting to do their duty and give us the support we need, when
we need it.
Using this principle, in Installment 15 I set out to minimize our
tendency to re-invent the wheel by organizing our code into separate
Forth
files, each containing different parts of the compiler. We
ended up with the following units:
Each of these files serves a different function, and encapsulates
specific areas of functionality. The Input and Output units, as their
name implies, provide character stream I/O and the all-important
lookahead character upon which our predictive parser is based. The
Errors unit, of course, provides standard error handling. The Scanner
unit contains all of our boolean functions such as Alpha?
, and the
routines GetName
and GetNumber
, which process multi-character tokens.
The two units we'll be working with the most, and the ones that most represent the personality of our compiler, are Parser and CodeGen. Theoretically, the Parser unit should encapsulate all aspects of the compiler that depend on the syntax of the compiled language (though, as we saw last time, a small amount of this syntax spills over into Scanner). Similarly, the code generator, CodeGen, contains all of the code dependent upon the target machine. In this installment, we'll be continuing with the development of the functions in these two all-important units.
Before we proceed, however, I think I should clarify the relationship between, and the functionality of these units. Those of you who are familiar with compiler theory as taught in universities will, of course, recognize the names, Scanner, Parser, and CodeGen, all of which are components of a classical compiler implementation. You may be thinking that I've abandoned my commitment to the KISS philosophy, and drifted towards a more conventional architecture than we once had. A closer look, however, should convince you that, while the names are similar, the functionalities are quite different.
Together, the scanner and parser of a classical implementation comprise the so-called "front end," and the code generator, the back end. The front end routines process the language-dependent, syntax-related aspects of the source language, while the code generator, or back end, deals with the target machine-dependent parts of the problem. In classical compilers, the two ends communicate via a file of instructions written in an intermediate language (IL).
Typically, a classical scanner is a single procedure, operating as a co-procedure with the parser. It "tokenizes" the source file, reading it character by character, recognizing language elements, translating them into tokens, and passing them along to the parser. You can think of the parser as an abstract machine, executing "op codes," which are the tokens. Similarly, the parser generates op codes of a second abstract machine, which mechanizes the IL. Typically, the IL file is written to disk by the parser, and read back again by the code generator.
Our organization is quite different. We have no lexical scanner, in the classical sense; our Scanner, though it has a similar name, is not a single procedure or co-procedure, but merely a set of separate subroutines which are called by the parser as needed.
Similarly, the classical code generator, the back end, is a translator in its own right, reading an IL "source" file, and emitting an object file. Our code generator doesn't work that way. In our compiler, there IS no intermediate language; every construct in the source language syntax is converted into assembly language as it is recognized by the parser. Like Scanner, the unit CodeGen consists of individual procedures which are called by the parser as needed.
This "code 'em as you find 'em" philosophy may not produce the world's most efficient code -- for example, we haven't provided (yet!) a convenient place for an optimizer to work its magic -- but it sure does simplify the compiler, doesn't it?
And that observation prompts me to reflect, once again, on how we have managed to reduce a compiler's functions to such comparatively simple terms. I've waxed eloquent on this subject in past installments, so I won't belabor the point too much here. However, because of the time that's elapsed since those last soliloquies, I hope you'll grant me just a little time to remind myself, as well as you, how we got here. We got here by applying several principles that writers of commercial compilers seldom have the luxury of using. These are:
As I've reviewed the history of compiler construction, I've learned that
virtually every production compiler in history has suffered from pre-imposed
conditions that strongly influenced its design. The original
FORTRAN
compiler of John Backus, et al, had to compete with assembly
language, and therefore was constrained to produce extremely efficient
code. The IBM compilers for the minicomputers of the 70's had to run in
the very small RAM memories then available -- as small as 4k. The early
Ada
compiler had to compile itself. Per Brinch Hansen decreed that his
Pascal
compiler developed for the IBM PC must execute in a 64k machine.
Compilers developed in Computer Science courses had to compile the
widest variety of languages, and therefore required LALR parsers.
In each of these cases, these preconceived constraints literally dominated the design of the compiler.
A good example is Brinch Hansen's compiler, described in his excellent book, "Brinch Hansen on Pascal Compilers" (highly recommended). Though his compiler is one of the most clear and un-obscure compiler implementations I've seen, that one decision, to compile large files in a small RAM, totally drives the design, and he ends up with not just one, but many intermediate files, together with the drivers to write and read them.
In time, the architectures resulting from such decisions have found their way into computer science lore as articles of faith. In this one man's opinion, it's time that they were re-examined critically. The conditions, environments, and requirements that led to classical architectures are not the same as the ones we have today. There's no reason to believe the solutions should be the same, either.
In this tutorial, we've followed the leads of such pioneers in the world of small compilers for Pcs as Leor Zolman, Ron Cain, and James Hendrix, who didn't know enough compiler theory to know that they "couldn't do it that way." We have resolutely refused to accept arbitrary constraints, but rather have done whatever was easy. As a result, we have evolved an architecture that, while quite different from the classical one, gets the job done in very simple and straightforward fashion.
I'll end this philosophizing with an observation re the notion of an intermediate language. While I've noted before that we don't have one in our compiler, that's not exactly true; we do have one, or at least are evolving one, in the sense that we are defining code generation functions for the parser to call. In essence, every call to a code generation procedure can be thought of as an instruction in an intermediate language. Should we ever find it necessary to formalize an intermediate language, this is the way we would do it: emit codes from the parser, each representing a call to one of the code generator procedures, and then process each code by calling those procedures in a separate pass, implemented in a back end. Frankly, I don't see that we'll ever find a need for this approach, but there is the connection, if you choose to follow it, between the classical and the current approaches.
Still, because of this total rewrite of the parsing modules, I was only able to include so much in the last installment. Because of this, our hero, the parser, when last seen, was a shadow of his former self, consisting of only enough code to parse and process a factor consisting of either a variable or a constant. The main effort of this current installment will be to help flesh out the parser to its former glory. In the process, I hope you'll bear with me if we sometimes cover ground we've long since been over and dealt with.
First, let's take care of a problem that we've addressed before: Our
current version of procedure Factor
, as we left it in Installment 15,
can't handle negative arguments. To fix that, we'll introduce the
procedure SignedFactor
:
-- ------------------------------------------------------------ -- Parse and translate a factor with optional sign : signedfactor ( -- ) Look >R Look addop? IF getchar ENDIF factor R> '-' = IF _negate ENDIF ; -- ------------------------------------------------------------
Note that this procedure calls a new code generation routine, _Negate
:
{-------------------------------------------------------------- -- Negate primary : _negate ( -- ) S" eax neg," emitln ; -- ------------------------------------------------------------
(Here, and elsewhere in this series, I'm only going to show you the new routines. I'm counting on you to put them into the proper include file, which you should normally have no trouble identifying.
In the main program, simply change the word called from Factor
to
SignedFactor
(main1.frt), and give the code a test.
Isn't it neat how Forth
handles all the details?
Yes, I know, the code isn't very efficient. If we input a number, -3, the generated code is:
3 d# -> eax mov, eax neg,
which is really, really dumb. We can do better, of course, by simply
pre-appending a minus sign to the string passed to LoadConstant
, but it
adds a few lines of code to SignedFactor
, and I'm applying the KISS
philosophy very aggressively here. What's more, to tell the truth, I
think I'm subconsciously enjoying generating "really, really dumb" code,
so I can have the pleasure of watching it get dramatically better when
we get into optimization methods.
Most of you have never heard of John Spray, so allow me to introduce him to you here. John's from New Zealand, and used to teach computer science at one of its universities. John wrote a compiler for the Motorola 6809, based on a delightful, Pascal-like language of his own design called "Whimsical." He later ported the compiler to the 68000, and for awhile it was the only compiler I had for my homebrewed 68000 system.
For the record, one of my standard tests for any new compiler is to see how the compiler deals with a null program like:
program main; begin end.
My test is to measure the time required to compile and link, and the
size of the object file generated. The undisputed loser in the test
is the DEC C
compiler for the VAX, which took 60 seconds to compile, on
a VAX 11/780, and generated a 50k object file. John's compiler is the
undisputed, once, future, and forever king in the code size department.
Given the null program, Whimsical
generates precisely one byte of code,
implementing the one instruction,
RET
By setting a compiler option to generate an include file rather than a standalone program, John can even cut this size, from one bytes to zero! Sort of hard to beat a null object file, wouldn't you say?
Needless to say, I consider John to be something of an expert on code
optimization, and I like what he has to say: "The best way to optimize
is not to have to optimize at all, but to produce good code in the first
place." Words to live by. When we get started on optimization, we'll
follow John's advice, and our first step will not be to add a peephole
optimizer or other after-the-fact device, but to improve the quality of
the code emitted before optimization. So make a note of SignedFactor
as
a good first candidate for attention, and for now we'll leave it be.
expression term factor
However, for now let's continue to do things one step at a time, and consider only expressions with additive terms in them. The code to implement expressions, including a possibly signed first term, is shown next:
-- ------------------------------------------------------------ -- Parse and translate an expression : expression ( -- ) signedfactor BEGIN Look addop? WHILE CASE Look '+' OF add ENDOF '-' OF subtract ENDOF ENDCASE REPEAT ; -- ------------------------------------------------------------This procedure calls two other procedures to process the operations:
-- ------------------------------------------------------------ -- Parse and translate an addition operation : add ( -- ) '+' match _push factor _popadd ; -- ------------------------------------------------------------ -- Parse and translate a subtraction operation : subtract ( -- ) '-' match _push factor _popsub ; -- ------------------------------------------------------------
The three procedures _Push
, _PopAdd
, and _PopSub
are new code generation
routines. As the name implies, procedure Push generates code to push
the primary register (EAX, in our 80x86 implementation) to the stack.
_PopAdd
and _PopSub
pop the top of the stack again, and add it to, or
subtract it from, the primary register. The code is shown next:
-- ---------------------------------------------------------------------------- -- Push primary to stack : _push ( -- ) S" eax push," emitln ; -- ---------------------------------------------------------------------------- -- Add TOS to primary : _popadd ( -- ) S" [esp] -> eax add, [esp 4 +] -> esp lea," emitln ; -- ---------------------------------------------------------------------------- -- Subtract TOS from primary : _popsub ( -- ) S" [esp] -> eax sub, [esp 4 +] -> esp lea," emitln _negate ; -- ----------------------------------------------------------------------------
Add these routines to Parser and CodeGen, and change the main program (main2.frt) to
call Expression
. Voila!
The next step, of course, is to add the capability for dealing with
multiplicative terms. To that end, we'll add a word Term
, and code
generation procedures _PopMul
and _PopDiv
. These code generation
procedures are shown next:
-- --------------------------------------------------------------- -- Multiply TOS and primary : _popmul ( -- ) S" [esp] dword mul, [esp 4 +] -> esp lea," emitln ; -- --------------------------------------------------------------- -- Divide primary by TOS : _popdiv ( -- ) S" ecx pop, ecx -> eax xchg," emitln S" eax -> edx mov, #31 b# -> edx sar, ebx idiv," emitln ; -- ---------------------------------------------------------------
I admit, the division routine is a little busy, but there's no help for it. Unfortunately, while the 80x86 CPU allows a division using the top of stack (TOS), it wants the arguments in the wrong order, just as it does for subtraction. So our only recourse is to pop the stack to a scratch register (ECX) and perform the division with that. Note the use of signed multiply and divide operations. This follows an implied, but unstated, assumption, that all our variables will be signed 32-bit integers. This decision will come back to haunt us later, when we start looking at multiple data types, type conversions, etc.
Our procedure Term
is virtually a clone of Expression
, and looks like
this:
-- ------------------------------------------------------------ -- Parse and translate a term : term ( -- ) factor BEGIN Look mulop? WHILE _push CASE Look '*' OF multiply ENDOF '/' OF divide ENDOF ENDCASE REPEAT ; -- ------------------------------------------------------------
Our next step is to change some names. SignedFactor
now becomes
SignedTerm
, and the calls to Factor
in Expression
, Add
, Subtract
and
SignedTerm
get changed to call Term
:
-- ------------------------------------------------------------ -- Parse and translate a term with optional leading sign : signedterm ( -- ) Look >R Look addop? IF getchar ENDIF term R> '-' = IF _negate ENDIF ; -- ------------------------------------------------------------ ... -- ------------------------------------------------------------ -- Parse and translate an expression : expression ( -- ) signedterm BEGIN Look addop? WHILE CASE Look '+' OF add ENDOF '-' OF subtract ENDOF ENDCASE REPEAT ; -- ------------------------------------------------------------
If memory serves me correctly, we once had both a procedure SignedFactor
and a procedure SignedTerm
. I had reasons for doing that at the time ...
they had to do with the handling of Boolean algebra and, in particular,
the Boolean "not" function. But certainly, for arithmetic operations,
that duplication isn't necessary. In an expression like:
-x*y
it's very apparent that the sign goes with the whole term, x*y, and not
just the factor x, and that's the way Expression
is coded.
Test this new code by executing Main2.frt. It still calls Expression
, so you
should now be able to deal with expressions containing any of the four
arithmetic operators.
Our last bit of business, as far as expressions goes, is to modify
procedure Factor
to allow for parenthetical expressions. By using a
recursive call to Expression
, we can reduce the needed code to virtually
nothing. Five lines added to Factor
do the job:
-- ------------------------------------------------------------ -- Parse and Translate a Factor : factor ( -- ) Look '(' = IF '(' match expression ')' match EXIT ENDIF Look digit? IF getnumber loadconstant EXIT ENDIF Look alpha? IF getname loadvariable EXIT ENDIF S" Unrecognized character `" Look CHAR-APPEND &' CHAR-APPEND errors ; -- ------------------------------------------------------------
At this point, your "compiler" should be able to handle any legal expression you can throw at it. Better yet, it should reject all illegal ones!
Expression
, then store the number. The procedure is
shown next:
-- ------------------------------------------------------------ -- Parse and translate an assignment statement : assignment ( -- ) getname DUP 1+ ALLOCATE ?ALLOCATE LOCAL name name PACK DROP '=' match expression name COUNT storevariable name FREE ?ALLOCATE ; -- ------------------------------------------------------------
The assignment calls for yet another code generation routine:
-- ------------------------------------------------------------ -- Store primary register in variable : storevariable ( c-addr u -- ) S" eax -> " 2SWAP $+ S" dword-ptr mov," $+ emitln ; -- ------------------------------------------------------------
Now, change the call in Main to call Assignment
(main3.frt), and you should see a
full assignment statement being processed correctly. Pretty neat, eh?
And painless, too.
In the past, we've always tried to show BNF relations to define the syntax we're developing. I haven't done that here, and it's high time I did. Here's the BNF:
<factor> ::= <variable> | <constant> | '(' <expression> ')' <signed_term> ::= [<addop>] <term> <term> ::= <factor> (<mulop> <factor>)* <expression> ::= <signed_term> (<addop> <term>)* <assignment> ::= <variable> '=' <expression>
Pascal
treats the
Boolean operators pretty much identically to the way it treats
arithmetic ones. A Boolean "and" has the same precedence level as
multiplication, and the "or" as addition. C
, on the other hand, sets
them at different precedence levels, and all told has a whopping 17
levels. In our earlier work, I chose something in between, with seven
levels. As a result, we ended up with things called Boolean
expressions, paralleling in most details the arithmetic expressions, but
at a different precedence level. All of this, as it turned out, came
about because I didn't like having to put parentheses around the Boolean
expressions in statements like:
IF (c >= 'A') and (c <= 'Z') then ...
In retrospect, that seems a pretty petty reason to add many layers of complexity to the parser. Perhaps more to the point, I'm not sure I was even able to avoid the parens.
For kicks, let's start anew, taking a more Pascal
-ish approach, and just
treat the Boolean operators at the same precedence level as the
arithmetic ones. We'll see where it leads us. If it seems to be down
the garden path, we can always backtrack to the earlier approach.
For starters, we'll add the "addition-level" operators to Expression
.
That's easily done; first, modify the function Addop?
in Scanner
to include two extra operators: '|' for "or," and '~' for "exclusive
or":
-- ------------------------------------------------------------ : addop? ( char -- tf ) DUP '+' = OVER '-' = OR OVER '|' = OR SWAP '~' = OR ; -- ------------------------------------------------------------Next, we must include the parsing of the operators in procedure
Expression
:
-- ------------------------------------------------------------ :NONAME ( -- ) signedterm BEGIN Look addop? WHILE CASE Look '+' OF add ENDOF '-' OF subtract ENDOF '|' OF _or ENDOF '~' OF _xor ENDOF ENDCASE REPEAT ; IS expression -- ------------------------------------------------------------ end;
(The underscores are needed, of course, because "or" and "xor" are
reserved words in Forth
.)
Next, the words _Or and _Xor:
-- ------------------------------------------------------------ -- Parse and translate OR : _or ( -- ) '|' match _push term _popor ; -- ------------------------------------------------------------ -- Parse and translate XOR : _xor ( -- ) '~' match _push term _popxor ; -- ------------------------------------------------------------
And, finally, the new code generator procedures:
-- -------------------------------------------------------------------- -- or TOS to primary : _popor ( -- ) S" [esp] -> eax or, [esp 4 +] -> esp lea," emitln ; -- -------------------------------------------------------------------- -- xor TOS to primary : _popxor ( -- ) S" [esp] -> eax xor, [esp 4 +] -> esp lea," emitln ; -- --------------------------------------------------------------------
Now, let's test the translator (you might want to change the call
in Main back to a call to Expression
, just to avoid having to type
"x=" for an assignment every time).
So far, so good. The parser nicely handles expressions of the form:
x|y~z
Unfortunately, it also does nothing to protect us from mixing Boolean and arithmetic algebra. It will merrily generate code for:
(a+b)*(c~d)
We've talked about this a bit, in the past. In general the rules for what operations are legal or not cannot be enforced by the parser itself, because they are not part of the syntax of the language, but rather its semantics. A compiler that doesn't allow mixed-mode expressions of this sort must recognize that c and d are Boolean variables, rather than numeric ones, and balk at multiplying them in the next step. But this "policing" can't be done by the parser; it must be handled somewhere between the parser and the code generator. We aren't in a position to enforce such rules yet, because we haven't got either a way of declaring types, or a symbol table to store the types in. So, for what we've got to work with at the moment, the parser is doing precisely what it's supposed to do.
Anyway, are we sure that we don't want to allow mixed-type
operations? We made the decision some time ago (or, at least, I
did) to adopt the value 0 as a Boolean "false," and -1, or
FFFFFFFFh, as a Boolean "true." The nice part about this choice is
that bitwise operations work exactly the same way as logical ones.
In other words, when we do an operation on one bit of a logical
variable, we do it on all of them. This means that we don't need
to distinguish between logical and bitwise operations, as is done
in C
with the operators & and &&, and | and ||. Reducing the
number of operators by half certainly doesn't seem all bad.
From the point of view of the data in storage, of course, the computer and compiler couldn't care less whether the number FFFFFFFFh represents the logical TRUE, or the numeric -1. Should we? I sort of think not. I can think of many examples (though they might be frowned upon as "tricky" code) where the ability to mix the types might come in handy. Example, the Dirac delta function, which could be coded in one simple line:
-(x=0)
or the absolute value function (definitely tricky code!):
x*(1+2*(x<0))
Please note, I'm not advocating coding like this as a way of life.
I'd almost certainly write these functions in more readable form,
using IFs, just to keep from confusing later maintainers. Still,
a moral question arises: Do we have the right to enforce our
ideas of good coding practice on the programmer, but writing the
language so he can't do anything else? That's what Nicklaus Wirth
did, in many places in Pascal
, and Pascal
has been criticized for
it -- for not being as "forgiving" as C
.
An interesting parallel presents itself in the example of the Motorola 68000 design. Though Motorola brags loudly about the orthogonality of their instruction set, the fact is that it's far from orthogonal. For example, you can read a variable from its address:
MOVE X,D0 (where X is the name of a variable)
but you can't write in the same way. To write, you must load an address register with the address of X. The same is true for PC-relative addressing:
MOVE X(PC),DO (legal) MOVE D0,X(PC) (illegal)
When you begin asking how such non-orthogonal behavior came about, you find that someone in Motorola had some theories about how software should be written. Specifically, in this case, they decided that self-modifying code, which you can implement using PC-relative writes, is a Bad Thing. Therefore, they designed the processor to prohibit it. Unfortunately, in the process they also prohibited all writes of the forms shown above, however benign. Note that this was not something done by default. Extra design work had to be done, and extra gates added, to destroy the natural orthogonality of the instruction set.
One of the lessons I've learned from life: If you have two choices, and can't decide which one to take, sometimes the best thing to do is nothing. Why add extra gates to a processor to enforce some stranger's idea of good programming practice? Leave the instructions in, and let the programmers debate what good programming practice is. Similarly, why should we add extra code to our parser, to test for and prevent conditions that the user might prefer to do, anyway? I'd rather leave the compiler simple, and let the software experts debate whether the practices should be used or not.
All of which serves as rationalization for my decision as to how to prevent mixed-type arithmetic: I won't. For a language intended for systems programming, the fewer rules, the better. If you don't agree, and want to test for such conditions, we can do it once we have a symbol table.
Term
. By now, you can
probably do this without me, but here's the code, anyway:
In Scanner,
-- ------------------------------------------------------------ : mulop? ( char -- tf ) DUP '*' = OVER '/' = OR SWAP '&' = OR ; -- ------------------------------------------------------------
In Parser,
-- ------------------------------------------------------------ -- Parse and translate a term : term ( -- ) factor BEGIN Look mulop? WHILE _push CASE Look '*' OF multiply ENDOF '/' OF divide ENDOF '&' OF _and ENDOF ENDCASE REPEAT ; -- ------------------------------------------------------------ -- Parse and translate AND : _and ( -- ) '&' match _push factor _popand ; -- ------------------------------------------------------------
and in CodeGen,
-- -------------------------------------------------------------------- -- And TOS to primary : _popand ( -- ) S" [esp] -> eax and, [esp 4 +] -> esp lea," emitln ; -- --------------------------------------------------------------------
Your parser should now be able to process almost any sort of logical expression, and (should you be so inclined), mixed-mode expressions as well.
Why not "all sorts of logical expressions"? Because, so far, we haven't
dealt with the logical "not" operator, and this is where it gets tricky.
The logical "not" operator seems, at first glance, to be identical in
its behavior to the unary minus, so my first thought was to let the
exclusive or operator, '~', double as the unary "not." That didn't
work. In my first attempt, procedure SignedTerm
simply ate my '~',
because the character passed the test for an addop, but SignedTerm
ignores all addops except '-'. It would have been easy enough to add
another line to SignedTerm
, but that would still not solve the problem,
because note that Expression
only accepts a signed term for the first
argument.
Mathematically, an expression like:
-a * -b
makes little or no sense, and the parser should flag it as an error. But the same expression, using a logical "not," makes perfect sense:
not a and not b
In the case of these unary operators, choosing to make them act the same way seems an artificial force fit, sacrificing reasonable behavior on the altar of implementational ease. While I'm all for keeping the implementation as simple as possible, I don't think we should do so at the expense of reasonableness. Patching like this would be missing the main point, which is that the logical "not" is simply not the same kind of animal as the unary minus. Consider the exclusive or, which is most naturally written as:
a~b ::= (a and not b) or (not a and b)
If we allow the "not" to modify the whole term, the last term in parentheses would be interpreted as:
not(a and b)
which is not the same thing at all. So it's clear that the logical "not" must be thought of as connected to the factor, not the term.
The idea of overloading the '~' operator also makes no sense from a mathematical point of view. The implication of the unary minus is that it's equivalent to a subtraction from zero:
-x <=> 0-x
In fact, in one of my more simple-minded versions of Expression
, I
reacted to a leading addop by simply preloading a zero, then processing
the operator as though it were a binary operator. But a "not" is not
equivalent to an exclusive or with zero ... that would just give back
the original number. Instead, it's an exclusive or with FFFFFFFFh, or -1.
In short, the seeming parallel between the unary "not" and the unary
minus falls apart under closer scrutiny. "not" modifies the factor, not
the term, and it is not related to either the unary minus nor the
exclusive or. Therefore, it deserves a symbol to call its own. What
better symbol than the obvious one, also used by C
, the '!' character?
Using the rules about the way we think the "not" should behave, we
should be able to code the exclusive or (assuming we'd ever need to), in
the very natural form:
a & !b | !a & b
Note that no parentheses are required -- the precedence levels we've chosen automatically take care of things.
If you're keeping score on the precedence levels, this definition puts the '!' at the top of the heap. The levels become:
Looking at this list, it's certainly not hard to see why we had trouble using '~' as the "not" symbol!
So how do we mechanize the rules? In the same way as we did with
SignedTerm
, but at the factor level. We'll define a procedure
NotFactor:
-- ------------------------------------------------------------ -- Parse and translate a factor with optional "not" : notfactor ( -- ) Look '!' <> IF factor EXIT ENDIF '!' match factor _not ; -- ------------------------------------------------------------
and call it from all the places where we formerly called Factor
, i.e.,
from Term
, Multiply
, Divide
, and And
. Note the new code generation
procedure:
-- ------------------------------------------------------------ -- Bitwise not primary : _not ( -- ) S" eax not," emitln ; -- ------------------------------------------------------------
Try this now, with a few simple cases. In fact, try that exclusive or example,
a&!b|!a&b
You should get the code (without the comments, of course):
A dword-ptr -> eax mov, ; load a eax push, ; push it B dword-ptr -> eax mov, ; load b eax not, ; not it [esp] -> eax and, [esp 4 +] -> esp lea, ; and with a eax push, ; push result A dword-ptr -> eax mov, ; load a eax not, ; not it eax push, ; push it B dword-ptr -> eax mov, ; load b [esp] -> eax and, [esp 4 +] -> esp lea, ; and with !a [esp] -> eax or, [esp 4 +] -> esp lea, ; or with first term
That's precisely what we'd like to get. So, at least for both arithmetic and logical operators, our new precedence and new, slimmer syntax hang together. Even the peculiar, but legal, expression with leading addop:
~x
makes sense. SignedTerm
ignores the leading '~', as it should, since
the expression is equivalent to:
0~x,
which is equal to x.
When we look at the BNF we've created, we find that our boolean algebra now adds only one extra line:
<not_factor> ::= [!] <factor> <factor> ::= <variable> | <constant> | '(' <expression> ')' <signed_term> ::= [<addop>] <term> <term> ::= <not_factor> (<mulop> <not_factor>)* <expression> ::= <signed_term> (<addop> <term>)* <assignment> ::= <variable> '=' <expression>
That's a big improvement over earlier efforts. Will our luck continue to hold when we get to relational operators? We'll find out soon, but it will have to wait for the next installment. We're at a good stopping place, and I'm anxious to get this installment into your hands. It's already been a year since the release of Installment 15. I blush to admit that all of this current installment has been ready for almost as long, with the exception of relational operators. But the information does you no good at all, sitting on my hard disk, and by holding it back until the relational operations were done, I've kept it out of your hands for that long. It's time for me to let go of it and get it out where you can get value from it. Besides, there are quite a number of serious philosophical questions associated with the relational operators, as well, and I'd rather save them for a separate installment where I can do them justice.
Have fun with the new, leaner arithmetic and logical parsing, and I'll see you soon with relationals.
***************************************************************** * * * COPYRIGHT NOTICE * * * * Copyright (C) 1995 Jack W. Crenshaw. All rights reserved. * * * *****************************************************************