I won't spend a lot of time making excuses; only point out that things happen, and priorities change. In the four years since installment fourteen, I've managed to get laid off, get divorced, have a nervous breakdown, begin a new career as a writer, begin another one as a consultant, move, work on two real-time systems, and raise fourteen baby birds, three pigeons, six possums, and a duck. For awhile there, the parsing of source code was not high on my list of priorities. Neither was writing stuff for free, instead of writing stuff for pay. But I do try to be faithful, and I do recognize and feel my responsibility to you, the reader, to finish what I've started. As the tortoise said in one of my son's old stories, I may be slow, but I'm sure. I'm sure that there are people out there anxious to see the last reel of this film, and I intend to give it to them. So, if you're one of those who's been waiting, more or less patiently, to see how this thing comes out, thanks for your patience. I apologize for the delay. Let's move on.
Forth
, when the rest of the world
seems to have gone bananas over C++
. It also seems a little
strange to be programming in a classical style when the rest of
the world has switched to object-oriented methods. Still, in
spite of the four-year hiatus, it would be entirely too wrenching
a change, at this point, to switch to, say, C++
with object-orientation.
Anyway, Forth
is still not only a powerful
programming language (more than ever, in fact), but it's a
wonderful medium for teaching. C
is a notoriously difficult
language to read ... it's often been accused, along with Forth
, of
being a "write-only language." When I program in C++
, I find
myself spending at least 50% of my time struggling with language
syntax rather than with concepts. A stray "&" or "*" can not only
change the functioning of the program, but its correctness as
well. By contrast, Forth
code is usually quite transparent and
easy to read, even if you don't know the language. What you see is
almost always what you get, and we can concentrate on concepts
rather than implementation details. I've said from the beginning
that the purpose of this tutorial series was not to generate the
world's fastest compiler, but to teach the fundamentals of
compiler technology, while spending the least amount of time
wrestling with language syntax or other aspects of software
implementation.
Finally, since a lot of what we do in this course
amounts to software experimentation, it's important to have a
compiler and associated environment that compiles quickly and with
no fuss. In my opinion, by far the most significant time measure
in software development is the speed of the edit/compile/test
cycle. In this department, iForth
is king. The compilation
speed is blazing fast, and continues to get faster in every
release (how do they keep doing that?). Despite vast improvements
in C
compilation speed over the years, even Intel's fastest
C/C++
compiler is still no match for Forth
. Further, the
editor built into the IDE, the unnecessary make facility, and the
absence of a linker, all complement each other to produce a
wonderful environment for quick turnaround. For all of these
reasons, I intend to stick with Forth
for the duration of this
series. We'll be using iForth
for Windows. If
you don't have this compiler, don't worry ... nothing we do here
is going to count on your having the latest version. Using the
Windows version helps me a lot, by allowing me to use the
Clipboard to copy code from the compiler's editor into these
documents. It should also help you at least as much, copying the
code in the other direction.
I've thought long and hard about whether or not to introduce objects to our discussion. I'm a big advocate of object-oriented methods for all uses, and such methods definitely have their place in compiler technology. In fact, I've written papers on just this subject (Refs. 1-3). But the architecture of a compiler which is based on object-oriented approaches is vastly different than that of the more classical compiler we've been building. Again, it would seem to be entirely too much to change these horses in mid-stream. As I said, programming styles change. Who knows, it may be another six years before we finish this thing, and if we keep changing the code every time programming style changes, we may never finish.
So for now, at least, I've determined to continue the classical
style in Forth
, though we might indeed discuss objects and object
orientation as we go. Likewise, the target machine will remain
the Intel 80x86 family. Of all the decisions to be made here,
this one has been the easiest. Though I know that many of you
would like to see code for the 68000, the 80x86 has become, if
anything, even more popular as a platform for embedded systems,
and it's to that application that this whole effort began in the
first place. Of course, compiling for the PC, MSDOS platform,
we have to deal with all the issues of DOS system calls, DOS linker formats,
the PC file system and hardware, and all those other complications
of a DOS environment. An embedded system, on the other hand, must
run standalone, and it's for this kind of application, as an
alternative to assembly language, that I've always imagined that a
language like KISS would thrive. Anyway, who wants to deal with
the 68000 architecture if they don't have to?
The one feature of Forth
that I'm going to be making heavy
use of INCLUDE
. In the past, we've had to make compromises
between code size and complexity, and program functionality. A
lot of our work has been in the nature of computer
experimentation, looking at only one aspect of compiler technology
at a time. We did this to avoid to avoid having to carry around
large programs, just to investigate simple concepts. In the
process, we've re-invented the wheel and re-programmed the same
functions more times than I'd like to count. Forth
include files provide
a wonderful way to get functionality and simplicity at the same
time: You write reusable code, and invoke it with a single line.
Your test program stays small, but it can do powerful things.
One feature of Forth
include files is the possibility of run-time
initialization.
As with an Ada
package, any interpreted code in the
include file gets executed as the program is initialized. As you'll see
later, this sometimes gives us neat simplifications in the code.
Our procedure Init, which has been with us since Installment 1,
goes away entirely when we use include files. The various routines in the
Cradle, another key features of our approach, will get distributed
among the includes.
The concept of include files, of course, is no different than that of C
modules. However, in C
(and C++
), the interface between modules
comes via preprocessor include statements and header files. As
someone who's had to read a lot of other people's C
programs, I've
always found this rather bewildering. It always seems that
whatever data structure you'd like to know about is in some other
file. Include files are simpler for the very reason that they're
criticized by some: The function interfaces and their
implementation are included in the same file. While this
organization may create problems with code security, it also
reduces the number of files by half, which isn't half bad.
Linking of the object files is also easy, because the Forth
compiler takes care of it without the need for make files or other
mechanisms.
Since we're going to be using multiple modules in our new approach, we have to address the issue of file management. If you've followed all the other sections of this tutorial, you know that, as our programs evolve, we're going to be replacing older, more simple-minded units with more capable ones. This brings us to an issue of version control. There will almost certainly be times when we will overlay a simple file (unit), but later wish we had the simple one again. A case in point is embodied in our predilection for using single-character variable names, keywords, etc., to test concepts without getting bogged down in the details of a lexical scanner. Thanks to the use of units, we will be doing much less of this in the future. Still, I not only suspect, but am certain that we will need to save some older versions of files, for special purposes, even though they've been replaced by newer, more capable ones.
To deal with this problem, I suggest that you create different directories, with different versions of the units as needed. If we do this properly, the code in each directory will remain self-consistent. I've tentatively created four directories: SINGLE (for single-character experimentation), MULTI (for, of course, multi-character versions), TINY, and KISS.
Enough said about philosophy and details. Let's get on with the resurrection of the software.
C/Unix
approach using getchar and unget, and I think you'll agree that
our approach is simpler). We'll begin our hike into the future by
translating this concept into our new, unit-based organization.
The first unit, appropriately called Input, is shown below:
-- ------------------------------------------------------------ -- input -- ------------------------------------------------------------ 0 VALUE Look -- Lookahead character -- ------------------------------------------------------------ -- Read new character from input stream : getchar ( -- ) EKEY TO Look ; -- ------------------------------------------------------------ -- Unit initialization getchar -- ------------------------------------------------------------
As you can see, there's nothing very profound, and certainly
nothing complicated, about this unit, since it consists of only a
single procedure. But already, we can see how the use of units
gives us advantages. Note the executable code in the
initialization block. This code "primes the pump" of the input
stream for us, something we've always had to do before, by
inserting the call to GetChar
in line, or in procedure Init
. This
time, the call happens without any special reference to it on our
part, except within the unit itself. As I predicted earlier, this
mechanism is going to make our lives much simpler as we proceed.
I consider it to be one of the most useful features of Forth
, and
I lean on it heavily.
Copy this unit into your compiler's IDE, and compile it. To test the software, of course, we always need a main program. I used the following, really complex test program, which we'll later evolve into the Main for our compiler:
-- ------------------------------------------------------------ NEEDS -input : main1 ( -- ) Look EMIT ; -- ------------------------------------------------------------
Note also that we can access the lookahead character, even though
it's not declared in the main source. All variables declared
within a file are global, but they can be hidden from prying eyes
by using PRIVATE
; to that extent, we get a modicum of
information hiding. Of course, if we were writing in an object-
oriented fashion, we should not allow outside modules to access
the units internal variables. But, although Forth
source has a
lot in common with objects, we're not doing object-oriented design
or code here, so our use of Look
is appropriate.
Go ahead and save the test program as main1.frt
(main1.frt).
I hasten to point out, as I've done before, that the function of
file input
is, and always has been, considered to be a dummy
version of the real thing. In a production version of a compiler,
the input stream will, of course, come from a file rather than
from the keyboard. And it will almost certainly include line
buffering, at the very least, and more likely, a rather large text
buffer to support efficient disk I/O. The nice part about the
include approach is that, as with objects, we can modify the code in
the file to be as simple or as sophisticated as we like. As long
as the interface, as embodied in the public procedures and the
lookahead character, don't change, the rest of the program is
totally unaffected. And since units are interpreted, ratehr than
compiled, the time required to link with them is virtually
nil. Again, the result is that we can get all the benefits of
sophisticated implementations, without having to carry the code
around as so much baggage.
In later installments, I intend to provide a full-blown IDE for the KISS compiler, using a true Windows application generated by MPE's applications framework. For now, though, we'll obey my #1 rule to live by: Keep It Simple.
-- ------------------------------------------------------------ -- output -- ------------------------------------------------------------ -- Emit an instruction : emits ( c-addr u -- ) Tab EMIT TYPE ; -- ------------------------------------------------------------ -- Emit an instruction, followed by a newline : emitln ( c-addr u -- ) CR emits ; -- ------------------------------------------------------------
(Notice that this file has no initialization clause.)
Test this unit with the following main program (test.frt):
-- ------------------------------------------------------------ NEEDS -input NEEDS -output : test ( -- ) CR ." MAIN:" S" Hello, world!" emitln ; -- ------------------------------------------------------------
Did you see anything that surprised you? You may have been
surprised to see that you needed to type something, even though
the main program requires no input. That's because of the
initialization in input.frt
, which still requires something to
put into the lookahead character. Sorry, there's no way out of
that box, or rather, we don't want to get out. Except for simple
test cases such as this, we will always want a valid lookahead
character, so the right thing to do about this "problem" is ...
nothing.
In our original Cradle, we had two error-handling procedures:
Error, which didn't halt, and Abort, which did. But I don't think
we ever found a use for the procedure that didn't halt, so in the
new, lean and mean file Errors, shown next, the word Errors
takes
the place of Aborts
.
-- ------------------------------------------------------------------- -- Errors -- ------------------------------------------------------------------- -- Write error message and halt : errors ( c-addr u -- ) CR ^G EMIT ." Error: " TYPE '.' EMIT ABORT ; -- ------------------------------------------------------------------- -- Write "" expected : expected ( c-addr u -- ) S" expected" $+ errors ; -- -------------------------------------------------------------------
As usual, here's a test program (test2.frt):
-- ------------------------------------------------------------------- NEEDS -input NEEDS -output NEEDS -errors : test2 ( -- ) S" Integer" expected ; -- ------------------------------------------------------------
Have you noticed that the "needs" line in our main program keeps getting longer? That's OK. In the final version, the main program will only call the words in our parser, so its use clause will only have a couple of entries. But for now, it's probably best to include all the units so we can test the words in them.
But, even though there is no functional procedure called "Scanner," it still makes sense to separate the scanning functions from the parsing functions. So I've created two more units called, amazingly enough, Scanner and Parser. The Scanner unit contains all of the routines known as recognizers. Some of these, such as IsAlpha, are pure boolean routines which operate on the lookahead character only. The other routines are those which collect tokens, such as identifiers and numeric constants. The Parser unit will contain all of the routines making up the recursive-descent parser. The general rule should be that unit Parser contains all of the information that is language-specific; in other words, the syntax of the language should be wholly contained in Parser. In an ideal world, this rule should be true to the extent that we can change the compiler to compile a different language, merely by replacing the single file, Parser.
In practice, things are almost never this pure. There's always a small amount of "leakage" of syntax rules into the scanner as well. For example, the rules concerning what makes up a legal identifier or constant may vary from language to language. In some languages, the rules concerning comments permit them to be filtered by the scanner, while in others they do not. So in practice, both units are likely to end up having language- dependent components, but the changes required to the scanner should be relatively trivial.
Now, recall that we've used two versions of the scanner routines: One that handled only single-character tokens, which we used for a number of our tests, and another that provided full support for multi-character tokens. Now that we have our software separated into units, I don't anticipate getting much use out of the single- character version, but it doesn't cost us much to provide for both. I've created two versions of the Scanner unit. The first one, called Scanner1, contains the single-digit version of the recognizers:
-- ------------------------------------------------------------ -- Scanner1 -- ------------------------------------------------------------ NEEDS -input NEEDS -errors -- ------------------------------------------------------------ : alpha? ( char -- tf ) >UPC 'A' 'Z' 1+ WITHIN ; -- recognize an alpha character : digit? ( char -- tf ) '0' '9' 1+ WITHIN ; -- recognize a decimal digit : alnum? ( char -- tf ) DUP alpha? SWAP digit? OR ; -- recognize alphanumeric : addop? ( char -- tf ) DUP '+' = SWAP '-' = OR ; -- test for AddOp : mulop? ( char -- tf ) DUP '*' = SWAP '/' = OR ; -- test for MulOp -- ------------------------------------------------------------ -- match a specific input character : match ( char -- ) DUP Look = IF DROP getchar ELSE S" `" ROT CHAR-APPEND &' CHAR-APPEND expected ENDIF ; -- ------------------------------------------------------------ -- get an identifier : getname ( -- char ) Look alpha? 0= IF S" Name" expected ENDIF Look >UPC getchar ; -- ------------------------------------------------------------ -- get a number : getnum ( -- char ) Look digit? 0= IF S" Integer" expected ENDIF Look getchar ; -- ------------------------------------------------------------
The following code fragment of the main program provides a good test of the scanner. For brevity, I'll only include the executable code here; the rest remains the same. Don't forget, though, to add the name Scanner1 to the "needs" clause (main2.frt).
-- ----------------------- NEEDS -scanner1 : main2 ( -- ) getname EMIT '=' match getnumber EMIT '+' match getname CR EMIT ; -- -----------------------
This code will recognize all sentences of the form:
x=0+y
where x and y can be any single-character variable names, and 0 any digit. The code should reject all other sentences, and give a meaningful error message. If it did, you're in good shape and we can proceed.
GetName
and
GetNumber
, change between the two units, but just to be sure there
are no mistakes, I've reproduced the entire unit here. This is
unit Scanner:
-- ------------------------------------------------------------ -- Scanner -- ------------------------------------------------------------ NEEDS -input NEEDS -errors -- ------------------------------------------------------------ : alpha? ( char -- tf ) >UPC 'A' 'Z' 1+ WITHIN ; -- recognize an alpha character : digit? ( char -- tf ) '0' '9' 1+ WITHIN ; -- recognize a decimal digit : alnum? ( char -- tf ) DUP alpha? SWAP digit? OR ; -- recognize alphanumeric : addop? ( char -- tf ) DUP '+' = SWAP '-' = OR ; -- test for AddOp : mulop? ( char -- tf ) DUP '*' = SWAP '/' = OR ; -- test for MulOp -- ------------------------------------------------------------ -- match a specific input character : match ( char -- ) DUP Look = IF DROP getchar ELSE S" `" ROT CHAR-APPEND &' CHAR-APPEND expected ENDIF ; -- ------------------------------------------------------------ CREATE token$ 0 C, #256 CHARS ALLOT -- unencoded token : char+! ( c addr -- ) DUP >S COUNT + C! 1 S> C+! ; -- ------------------------------------------------------------ -- get an identifier : getname ( -- c-addr u ) Look alpha? 0= IF S" Name" expected ENDIF token$ C0! BEGIN Look alnum? WHILE Look >UPC token$ char+! getchar REPEAT token$ COUNT ; -- ------------------------------------------------------------ -- get a number : getnum ( -- c-addr u ) Look digit? 0= IF S" Integer" expected ENDIF token$ C0! BEGIN Look digit? WHILE Look token$ char+! getchar REPEAT token$ COUNT ; -- ------------------------------------------------------------
The following test program will test this scanner.
-- ----------------------- NEEDS -scanner : main3 ( -- ) getname TYPE '=' match getnumber TYPE '+' match getname CR TYPE ; -- -----------------------Simply change the "needs" clause to use Scanner instead of Scanner1. Now you should be able to type multi-character names and numbers (main3.frt).
In spite of the relative simplicity of both scanners, a lot of
thought has gone into them, and a lot of decisions had to be made.
I'd like to share those thoughts with you now so you can make your
own educated decision, appropriate for your application. First,
note that both versions of GetName
translate the input characters
to upper case. Obviously, there was a design decision made here,
and this is one of those cases where the language syntax splatters
over into the scanner. In the C
language, the case of characters
in identifiers is significant. For such a language, we obviously
can't map the characters to upper case. The design I'm using
assumes a language like Pascal
, where the case of characters
doesn't matter. For such languages, it's easier to go ahead and
map all identifiers to upper case in the scanner, so we don't have
to worry later on when we're comparing strings for equality.
We could have even gone a step further, and map the characters to
upper case right as they come in, in GetChar
. This approach works
too, and I've used it in the past, but it's too confining.
Specifically, it will also map characters that may be part of
quoted strings, which is not a good idea. So if you're going to
map to upper case at all, GetName
is the proper place to do it.
Note that the function GetNumber
in this scanner returns a string,
just as GetName
does. This is another one of those things I've
oscillated about almost daily, and the last swing was all of ten
minutes ago. The alternative approach, and one I've used many
times in past installments, returns an integer result.
Both approaches have their good points. Since we're fetching a
number, the approach that immediately comes to mind is to return
it as an integer. But bear in mind that the eventual use of the
number will be in a write statement that goes back to the outside
world. Someone -- either us or the code hidden inside the write
statement -- is going to have to convert the number back to a
string again. Turbo Pascal
includes such string conversion
routines, but why use them if we don't have to? Why convert a
number from string to integer form, only to convert it right back
again in the code generator, only a few statements later?
Furthermore, as you'll soon see, we're going to need a temporary storage spot for the value of the token we've fetched. If we treat the number in its string form, we can store the value of either a variable or a number in the same string. Otherwise, we'll have to create a second, integer variable.
On the other hand, we'll find that carrying the number as a string virtually eliminates any chance of optimization later on. As we get to the point where we are beginning to concern ourselves with code generation, we'll encounter cases in which we're doing arithmetic on constants. For such cases, it's really foolish to generate code that performs the constant arithmetic at run time. Far better to let the parser do the arithmetic at compile time, and merely code the result. To do that, we'll wish we had the constants stored as integers rather than strings.
What finally swung me back over to the string approach was an aggressive application of the KISS test, plus reminding myself that we've studiously avoided issues of code efficiency. One of the things that makes our simple-minded parsing work, without the complexities of a "real" compiler, is that we've said up front that we aren't concerned about code efficiency. That gives us a lot of freedom to do things the easy way rather than the efficient one, and it's a freedom we must be careful not to abandon voluntarily, in spite of the urges for efficiency shouting in our ear. In addition to being a big believer in the KISS philosophy, I'm also an advocate of "lazy programming," which in this context means, don't program anything until you need it. As P.J. Plauger says, "Never put off until tomorrow what you can put off indefinitely." Over the years, much code has been written to provide for eventualities that never happened. I've learned that lesson myself, from bitter experience. So the bottom line is: We won't convert to an integer here because we don't need to. It's as simple as that.
For those of you who still think we may need the integer version (and indeed we may), here it is:
-- ------------------------------------------------------------ -- get a number (integer version) : getnum ( -- val ) newline Look digit? 0= IF S" Integer" expected ENDIF 0 >R BEGIN Look digit? WHILE R> #10 * Look '0' - + >R getchar REPEAT '#' TO token R> ; -- ------------------------------------------------------------
You might file this one away, as I intend to, for a rainy day.
Remember the very first installment of this series? We read an integer value, say n, and generated the code to load it into the EAX register via an immediate move:
n d# -> eax mov,
Shortly afterwards, we repeated the process for a variable,
X dword-ptr -> eax mov,
and then for a factor that could be either constant or variable. For old times sake, let's revisit that process. Define the following new unit:
-- ------------------------------------------------------------ -- Parser -- ------------------------------------------------------------ NEEDS -input NEEDS -scanner NEEDS -codegen -- ------------------------------------------------------------ -- Parse and translate a factor : factor ( -- ) getnumber loadconstant ; -- ------------------------------------------------------------
As you can see, this unit calls a word, LoadConstant
, which
actually effects the output of the assembly-language code. The
unit also uses a new unit, CodeGen. This step represents the last
major change in our architecture, from earlier installments: The
removal of the machine-dependent code to a separate unit. If I
have my way, there will not be a single line of code, outside of
CodeGen, that betrays the fact that we're targeting the 80x86 CPU.
And this is one place I think that having my way is quite
feasible.
For those of you who wish I were using the 68000 architecture (or any other one) instead of the 80x86, here's your answer: Merely replace CodeGen with one suitable for your CPU of choice.
So far, our code generator has only one procedure in it. Here's the unit:
-- ------------------------------------------------------------ -- CodeGen -- ------------------------------------------------------------ NEEDS -output -- ------------------------------------------------------------ -- Load the primary register with a constant : LoadConstant ( c-addr u -- ) S" d# -> eax mov," $+ emitln ; -- ------------------------------------------------------------
Copy and compile this file, and execute the following main program (main4.frt):
-- ------------------------------------------------------------ NEEDS -input NEEDS -output NEEDS -errors NEEDS -scanner NEEDS -parser : main4 ( -- ) factor ; -- ------------------------------------------------------------
There it is, the generated code, just as we hoped it would be.
Now, I hope you can begin to see the advantage of the unit-based
architecture of our new design. Here we have a main program
that's all of five lines long. That's all of the program we need
to see, unless we choose to see more. And yet, all those units
are sitting there, patiently waiting to serve us. We can have our
cake and eat it too, in that we have simple and short code, but
powerful allies. What remains to be done is to flesh out the
units to match the capabilities of earlier installments. We'll do
that in the next installment, but before I close, let's finish out
the parsing of a factor, just to satisfy ourselves that we still
know how. The final version of CodeGen includes the new
procedure, LoadVariable
:
-- ------------------------------------------------------------ -- CodeGen -- ------------------------------------------------------------ NEEDS -output -- ------------------------------------------------------------ -- Load the primary register with a constant : loadconstant ( c-addr u -- ) S" d# -> eax mov," $+ emitln ; -- ------------------------------------------------------------ -- Load a variable to the primary register : loadvariable ( c-addr u -- ) S" dword-ptr -> eax mov," $+ emitln ; -- ------------------------------------------------------------
The parser unit itself doesn't change, but we have a more complex
version of the word Factor
:
-- ------------------------------------------------------------ -- Parse and translate a factor : factor ( -- ) Look digit? IF getnumber loadconstant EXIT ENDIF Look alpha? IF getname loadvariable EXIT ENDIF S" Unrecognized character `" Look CHAR-APPEND &' CHAR-APPEND errors ; -- ------------------------------------------------------------
Now, without altering the main program, you should find that our program (main4.frt) will process either a variable or a constant factor. At this point, our architecture is almost complete; we have units to do all the dirty work, and enough code in the parser and code generator to demonstrate that everything works. What remains is to flesh out the units we've defined, particularly the parser and code generator, to support the more complex syntax elements that make up a real language. Since we've done this many times before in earlier installments, it shouldn't take long to get us back to where we were before the long hiatus. We'll continue this process in Installment 16, coming soon. See you then.
***************************************************************** * * * COPYRIGHT NOTICE * * * * Copyright (C) 1994 Jack W. Crenshaw. All rights reserved. * * * * * *****************************************************************