LET'S BUILD A COMPILER!

By

Jack W. Crenshaw, Ph.D.

5 March 1994

Part 15: BACK TO THE FUTURE

INTRODUCTION

Can it really have been four years since I wrote installment fourteen of this series? Is it really possible that six long years have passed since I began it? Funny how time flies when you're having fun, isn't it?

I won't spend a lot of time making excuses; only point out that things happen, and priorities change. In the four years since installment fourteen, I've managed to get laid off, get divorced, have a nervous breakdown, begin a new career as a writer, begin another one as a consultant, move, work on two real-time systems, and raise fourteen baby birds, three pigeons, six possums, and a duck. For awhile there, the parsing of source code was not high on my list of priorities. Neither was writing stuff for free, instead of writing stuff for pay. But I do try to be faithful, and I do recognize and feel my responsibility to you, the reader, to finish what I've started. As the tortoise said in one of my son's old stories, I may be slow, but I'm sure. I'm sure that there are people out there anxious to see the last reel of this film, and I intend to give it to them. So, if you're one of those who's been waiting, more or less patiently, to see how this thing comes out, thanks for your patience. I apologize for the delay. Let's move on.

NEW STARTS, OLD DIRECTIONS

Like many other things, programming languages and programming styles change with time. In 1994, it seems a little anachronistic to be programming in Forth, when the rest of the world seems to have gone bananas over C++. It also seems a little strange to be programming in a classical style when the rest of the world has switched to object-oriented methods. Still, in spite of the four-year hiatus, it would be entirely too wrenching a change, at this point, to switch to, say, C++ with object-orientation. Anyway, Forth is still not only a powerful programming language (more than ever, in fact), but it's a wonderful medium for teaching. C is a notoriously difficult language to read ... it's often been accused, along with Forth, of being a "write-only language." When I program in C++, I find myself spending at least 50% of my time struggling with language syntax rather than with concepts. A stray "&" or "*" can not only change the functioning of the program, but its correctness as well. By contrast, Forth code is usually quite transparent and easy to read, even if you don't know the language. What you see is almost always what you get, and we can concentrate on concepts rather than implementation details. I've said from the beginning that the purpose of this tutorial series was not to generate the world's fastest compiler, but to teach the fundamentals of compiler technology, while spending the least amount of time wrestling with language syntax or other aspects of software implementation.

Finally, since a lot of what we do in this course amounts to software experimentation, it's important to have a compiler and associated environment that compiles quickly and with no fuss. In my opinion, by far the most significant time measure in software development is the speed of the edit/compile/test cycle. In this department, iForth is king. The compilation speed is blazing fast, and continues to get faster in every release (how do they keep doing that?). Despite vast improvements in C compilation speed over the years, even Intel's fastest C/C++ compiler is still no match for Forth. Further, the editor built into the IDE, the unnecessary make facility, and the absence of a linker, all complement each other to produce a wonderful environment for quick turnaround. For all of these reasons, I intend to stick with Forth for the duration of this series. We'll be using iForth for Windows. If you don't have this compiler, don't worry ... nothing we do here is going to count on your having the latest version. Using the Windows version helps me a lot, by allowing me to use the Clipboard to copy code from the compiler's editor into these documents. It should also help you at least as much, copying the code in the other direction.

I've thought long and hard about whether or not to introduce objects to our discussion. I'm a big advocate of object-oriented methods for all uses, and such methods definitely have their place in compiler technology. In fact, I've written papers on just this subject (Refs. 1-3). But the architecture of a compiler which is based on object-oriented approaches is vastly different than that of the more classical compiler we've been building. Again, it would seem to be entirely too much to change these horses in mid-stream. As I said, programming styles change. Who knows, it may be another six years before we finish this thing, and if we keep changing the code every time programming style changes, we may never finish.

So for now, at least, I've determined to continue the classical style in Forth, though we might indeed discuss objects and object orientation as we go. Likewise, the target machine will remain the Intel 80x86 family. Of all the decisions to be made here, this one has been the easiest. Though I know that many of you would like to see code for the 68000, the 80x86 has become, if anything, even more popular as a platform for embedded systems, and it's to that application that this whole effort began in the first place. Of course, compiling for the PC, MSDOS platform, we have to deal with all the issues of DOS system calls, DOS linker formats, the PC file system and hardware, and all those other complications of a DOS environment. An embedded system, on the other hand, must run standalone, and it's for this kind of application, as an alternative to assembly language, that I've always imagined that a language like KISS would thrive. Anyway, who wants to deal with the 68000 architecture if they don't have to?

The one feature of Forth that I'm going to be making heavy use of INCLUDE. In the past, we've had to make compromises between code size and complexity, and program functionality. A lot of our work has been in the nature of computer experimentation, looking at only one aspect of compiler technology at a time. We did this to avoid to avoid having to carry around large programs, just to investigate simple concepts. In the process, we've re-invented the wheel and re-programmed the same functions more times than I'd like to count. Forth include files provide a wonderful way to get functionality and simplicity at the same time: You write reusable code, and invoke it with a single line. Your test program stays small, but it can do powerful things.

One feature of Forth include files is the possibility of run-time initialization. As with an Ada package, any interpreted code in the include file gets executed as the program is initialized. As you'll see later, this sometimes gives us neat simplifications in the code. Our procedure Init, which has been with us since Installment 1, goes away entirely when we use include files. The various routines in the Cradle, another key features of our approach, will get distributed among the includes.

The concept of include files, of course, is no different than that of C modules. However, in C (and C++), the interface between modules comes via preprocessor include statements and header files. As someone who's had to read a lot of other people's C programs, I've always found this rather bewildering. It always seems that whatever data structure you'd like to know about is in some other file. Include files are simpler for the very reason that they're criticized by some: The function interfaces and their implementation are included in the same file. While this organization may create problems with code security, it also reduces the number of files by half, which isn't half bad. Linking of the object files is also easy, because the Forth compiler takes care of it without the need for make files or other mechanisms.

STARTING OVER?

Four years ago, in Installment 14, I promised you that our days of re-inventing the wheel, and recoding the same software over and over for each lesson, were over, and that from now on we'd stick to more complete programs that we would simply add new features to. I still intend to keep that promise; that's one of the main purposes for using units. However, because of the long time since Installment 14, it's natural to want to at least do some review, and anyhow, we're going to have to make rather sweeping changes in the code to make the transition to units. Besides, frankly, after all this time I can't remember all the neat ideas I had in my head four years ago. The best way for me to recall them is to retrace some of the steps we took to arrive at Installment 14. So I hope you'll be understanding and bear with me as we go back to our roots, in a sense, and rebuild the core of the software, distributing the routines among the various units, and bootstrapping ourselves back up to the point we were at lo, those many moons ago. As has always been the case, you're going to get to see me make all the mistakes and execute changes of direction, in real time. Please bear with me ... we'll start getting to the new stuff before you know it.

Since we're going to be using multiple modules in our new approach, we have to address the issue of file management. If you've followed all the other sections of this tutorial, you know that, as our programs evolve, we're going to be replacing older, more simple-minded units with more capable ones. This brings us to an issue of version control. There will almost certainly be times when we will overlay a simple file (unit), but later wish we had the simple one again. A case in point is embodied in our predilection for using single-character variable names, keywords, etc., to test concepts without getting bogged down in the details of a lexical scanner. Thanks to the use of units, we will be doing much less of this in the future. Still, I not only suspect, but am certain that we will need to save some older versions of files, for special purposes, even though they've been replaced by newer, more capable ones.

To deal with this problem, I suggest that you create different directories, with different versions of the units as needed. If we do this properly, the code in each directory will remain self-consistent. I've tentatively created four directories: SINGLE (for single-character experimentation), MULTI (for, of course, multi-character versions), TINY, and KISS.

Enough said about philosophy and details. Let's get on with the resurrection of the software.

THE INPUT

A key concept that we've used since Day 1 has been the idea of an input stream with one lookahead character. All the parsing routines examine this character, without changing it, to decide what they should do next. (Compare this approach with the C/Unix approach using getchar and unget, and I think you'll agree that our approach is simpler). We'll begin our hike into the future by translating this concept into our new, unit-based organization. The first unit, appropriately called Input, is shown below:

-- ------------------------------------------------------------
-- input
-- ------------------------------------------------------------
0 VALUE Look -- Lookahead character 
-- ------------------------------------------------------------
-- Read new character from input stream 
: getchar ( -- ) EKEY TO Look ;
-- ------------------------------------------------------------
-- Unit initialization 
        getchar 
-- ------------------------------------------------------------

As you can see, there's nothing very profound, and certainly nothing complicated, about this unit, since it consists of only a single procedure. But already, we can see how the use of units gives us advantages. Note the executable code in the initialization block. This code "primes the pump" of the input stream for us, something we've always had to do before, by inserting the call to GetChar in line, or in procedure Init. This time, the call happens without any special reference to it on our part, except within the unit itself. As I predicted earlier, this mechanism is going to make our lives much simpler as we proceed. I consider it to be one of the most useful features of Forth, and I lean on it heavily.

Copy this unit into your compiler's IDE, and compile it. To test the software, of course, we always need a main program. I used the following, really complex test program, which we'll later evolve into the Main for our compiler:

-- ------------------------------------------------------------
NEEDS -input
: main1 ( -- ) Look EMIT ;
-- ------------------------------------------------------------

Note also that we can access the lookahead character, even though it's not declared in the main source. All variables declared within a file are global, but they can be hidden from prying eyes by using PRIVATE; to that extent, we get a modicum of information hiding. Of course, if we were writing in an object- oriented fashion, we should not allow outside modules to access the units internal variables. But, although Forth source has a lot in common with objects, we're not doing object-oriented design or code here, so our use of Look is appropriate.

Go ahead and save the test program as main1.frt (main1.frt).

I hasten to point out, as I've done before, that the function of file input is, and always has been, considered to be a dummy version of the real thing. In a production version of a compiler, the input stream will, of course, come from a file rather than from the keyboard. And it will almost certainly include line buffering, at the very least, and more likely, a rather large text buffer to support efficient disk I/O. The nice part about the include approach is that, as with objects, we can modify the code in the file to be as simple or as sophisticated as we like. As long as the interface, as embodied in the public procedures and the lookahead character, don't change, the rest of the program is totally unaffected. And since units are interpreted, ratehr than compiled, the time required to link with them is virtually nil. Again, the result is that we can get all the benefits of sophisticated implementations, without having to carry the code around as so much baggage.

In later installments, I intend to provide a full-blown IDE for the KISS compiler, using a true Windows application generated by MPE's applications framework. For now, though, we'll obey my #1 rule to live by: Keep It Simple.

THE OUTPUT

Of course, every decent program should have output, and ours is no exception. Our output routines included the emit functions. The code for the corresponding output unit is shown next:

-- ------------------------------------------------------------
-- output
-- ------------------------------------------------------------
-- Emit an instruction
: emits  ( c-addr u -- ) Tab EMIT TYPE ;
-- ------------------------------------------------------------
-- Emit an instruction, followed by a newline
: emitln ( c-addr u -- ) CR emits ;
-- ------------------------------------------------------------

(Notice that this file has no initialization clause.)

Test this unit with the following main program (test.frt):

-- ------------------------------------------------------------
NEEDS -input 
NEEDS -output
: test ( -- ) CR ." MAIN:" S" Hello, world!" emitln ;
-- ------------------------------------------------------------

Did you see anything that surprised you? You may have been surprised to see that you needed to type something, even though the main program requires no input. That's because of the initialization in input.frt, which still requires something to put into the lookahead character. Sorry, there's no way out of that box, or rather, we don't want to get out. Except for simple test cases such as this, we will always want a valid lookahead character, so the right thing to do about this "problem" is ... nothing.

THE ERRORS

Our next set of routines are those that handle errors. To refresh your memory, we take the approach of halting on the first error. Not only does this greatly simplify our code, by completely avoiding the sticky issue of error recovery, but it also makes much more sense, in my opinion, in an interactive environment. I know this may be an extreme position, but I consider the practice of reporting all errors in a program to be an anachronism, a holdover from the days of batch processing. It's time to scuttle the practice. So there.

In our original Cradle, we had two error-handling procedures: Error, which didn't halt, and Abort, which did. But I don't think we ever found a use for the procedure that didn't halt, so in the new, lean and mean file Errors, shown next, the word Errors takes the place of Aborts.

-- -------------------------------------------------------------------
-- Errors
-- -------------------------------------------------------------------
-- Write error message and halt
: errors ( c-addr u -- ) CR ^G EMIT ." Error: " TYPE '.' EMIT  ABORT ;
-- -------------------------------------------------------------------
-- Write "" expected
: expected ( c-addr u -- ) S"  expected" $+ errors ;
-- -------------------------------------------------------------------

As usual, here's a test program (test2.frt):

-- -------------------------------------------------------------------
NEEDS -input
NEEDS -output 
NEEDS -errors
: test2 ( -- ) S" Integer" expected ;
-- ------------------------------------------------------------

Have you noticed that the "needs" line in our main program keeps getting longer? That's OK. In the final version, the main program will only call the words in our parser, so its use clause will only have a couple of entries. But for now, it's probably best to include all the units so we can test the words in them.

SCANNING AND PARSING

The classical compiler architecture consists of separate modules for the lexical scanner, which supplies tokens in the language, and the parser, which tries to make sense of the tokens as syntax elements. If you can still remember what we did in earlier installments, you'll recall that we didn't do things that way. Because we're using a predictive parser, we can almost always tell what language element is coming next, just by examining the lookahead character. Therefore, we found no need to prefetch tokens, as a scanner would do.

But, even though there is no functional procedure called "Scanner," it still makes sense to separate the scanning functions from the parsing functions. So I've created two more units called, amazingly enough, Scanner and Parser. The Scanner unit contains all of the routines known as recognizers. Some of these, such as IsAlpha, are pure boolean routines which operate on the lookahead character only. The other routines are those which collect tokens, such as identifiers and numeric constants. The Parser unit will contain all of the routines making up the recursive-descent parser. The general rule should be that unit Parser contains all of the information that is language-specific; in other words, the syntax of the language should be wholly contained in Parser. In an ideal world, this rule should be true to the extent that we can change the compiler to compile a different language, merely by replacing the single file, Parser.

In practice, things are almost never this pure. There's always a small amount of "leakage" of syntax rules into the scanner as well. For example, the rules concerning what makes up a legal identifier or constant may vary from language to language. In some languages, the rules concerning comments permit them to be filtered by the scanner, while in others they do not. So in practice, both units are likely to end up having language- dependent components, but the changes required to the scanner should be relatively trivial.

Now, recall that we've used two versions of the scanner routines: One that handled only single-character tokens, which we used for a number of our tests, and another that provided full support for multi-character tokens. Now that we have our software separated into units, I don't anticipate getting much use out of the single- character version, but it doesn't cost us much to provide for both. I've created two versions of the Scanner unit. The first one, called Scanner1, contains the single-digit version of the recognizers:

-- ------------------------------------------------------------
-- Scanner1
-- ------------------------------------------------------------
NEEDS -input
NEEDS -errors
-- ------------------------------------------------------------
: alpha?   ( char -- tf )  >UPC 'A' 'Z' 1+ WITHIN ;             -- recognize an alpha character 
: digit?   ( char -- tf )  '0' '9' 1+ WITHIN ;                  -- recognize a decimal digit
: alnum?   ( char -- tf )  DUP alpha? SWAP digit? OR ;          -- recognize alphanumeric
: addop?   ( char -- tf )  DUP '+' =  SWAP '-' =  OR ;          -- test for AddOp
: mulop?   ( char -- tf )  DUP '*' =  SWAP '/' =  OR ;          -- test for MulOp
-- ------------------------------------------------------------
-- match a specific input character 
: match ( char -- )
        DUP Look = IF  DROP getchar  
                 ELSE  S" `" ROT CHAR-APPEND &' CHAR-APPEND expected 
                ENDIF ;
-- ------------------------------------------------------------
-- get an identifier
: getname ( -- char )
        Look alpha? 0= IF  S" Name" expected  ENDIF
        Look >UPC 
        getchar ;
-- ------------------------------------------------------------
-- get a number 
: getnum ( -- char )
        Look digit? 0= IF  S" Integer" expected  ENDIF
        Look
        getchar ;
-- ------------------------------------------------------------

The following code fragment of the main program provides a good test of the scanner. For brevity, I'll only include the executable code here; the rest remains the same. Don't forget, though, to add the name Scanner1 to the "needs" clause (main2.frt).

-- -----------------------
NEEDS -scanner1
: main2 ( -- )
        getname    EMIT
        '=' match
        getnumber  EMIT
        '+' match
        getname CR EMIT ;
-- -----------------------

This code will recognize all sentences of the form:

        x=0+y

where x and y can be any single-character variable names, and 0 any digit. The code should reject all other sentences, and give a meaningful error message. If it did, you're in good shape and we can proceed.

THE SCANNER UNIT

The next, and by far the most important, version of the scanner is the one that handles the multi-character tokens that all real languages must have. Only the two functions, GetName and GetNumber, change between the two units, but just to be sure there are no mistakes, I've reproduced the entire unit here. This is unit Scanner:

-- ------------------------------------------------------------
-- Scanner
-- ------------------------------------------------------------
NEEDS -input
NEEDS -errors
-- ------------------------------------------------------------
: alpha?   ( char -- tf )  >UPC 'A' 'Z' 1+ WITHIN ;             -- recognize an alpha character 
: digit?   ( char -- tf )  '0' '9' 1+ WITHIN ;                  -- recognize a decimal digit
: alnum?   ( char -- tf )  DUP alpha? SWAP digit? OR ;          -- recognize alphanumeric
: addop?   ( char -- tf )  DUP '+' =  SWAP '-' =  OR ;          -- test for AddOp
: mulop?   ( char -- tf )  DUP '*' =  SWAP '/' =  OR ;          -- test for MulOp
-- ------------------------------------------------------------
-- match a specific input character 
: match ( char -- )
        DUP Look = IF  DROP getchar  
                 ELSE  S" `" ROT CHAR-APPEND &' CHAR-APPEND expected 
                ENDIF ;
-- ------------------------------------------------------------
CREATE token$  0 C, #256 CHARS ALLOT    -- unencoded token

: char+!  ( c addr -- ) DUP >S COUNT + C!  1 S> C+! ;
-- ------------------------------------------------------------
-- get an identifier
: getname ( -- c-addr u )
        Look alpha? 0= IF  S" Name" expected  ENDIF
        token$ C0!
        BEGIN   Look alnum?
        WHILE   Look >UPC token$ char+!  getchar
        REPEAT  
        token$ COUNT ;
-- ------------------------------------------------------------
-- get a number 
: getnum ( -- c-addr u )
        Look digit? 0= IF  S" Integer" expected  ENDIF
        token$ C0!
        BEGIN   Look digit?
        WHILE   Look token$ char+!  getchar
        REPEAT  
        token$ COUNT ;
-- ------------------------------------------------------------

The following test program will test this scanner.

-- -----------------------
NEEDS -scanner
: main3 ( -- )
        getname    TYPE
        '=' match
        getnumber  TYPE
        '+' match
        getname CR TYPE ;
-- -----------------------

Simply change the "needs" clause to use Scanner instead of Scanner1. Now you should be able to type multi-character names and numbers (main3.frt).

DECISIONS, DECISIONS

In spite of the relative simplicity of both scanners, a lot of thought has gone into them, and a lot of decisions had to be made. I'd like to share those thoughts with you now so you can make your own educated decision, appropriate for your application. First, note that both versions of GetName translate the input characters to upper case. Obviously, there was a design decision made here, and this is one of those cases where the language syntax splatters over into the scanner. In the C language, the case of characters in identifiers is significant. For such a language, we obviously can't map the characters to upper case. The design I'm using assumes a language like Pascal, where the case of characters doesn't matter. For such languages, it's easier to go ahead and map all identifiers to upper case in the scanner, so we don't have to worry later on when we're comparing strings for equality.

We could have even gone a step further, and map the characters to upper case right as they come in, in GetChar. This approach works too, and I've used it in the past, but it's too confining. Specifically, it will also map characters that may be part of quoted strings, which is not a good idea. So if you're going to map to upper case at all, GetName is the proper place to do it.

Note that the function GetNumber in this scanner returns a string, just as GetName does. This is another one of those things I've oscillated about almost daily, and the last swing was all of ten minutes ago. The alternative approach, and one I've used many times in past installments, returns an integer result.

Both approaches have their good points. Since we're fetching a number, the approach that immediately comes to mind is to return it as an integer. But bear in mind that the eventual use of the number will be in a write statement that goes back to the outside world. Someone -- either us or the code hidden inside the write statement -- is going to have to convert the number back to a string again. Turbo Pascal includes such string conversion routines, but why use them if we don't have to? Why convert a number from string to integer form, only to convert it right back again in the code generator, only a few statements later?

Furthermore, as you'll soon see, we're going to need a temporary storage spot for the value of the token we've fetched. If we treat the number in its string form, we can store the value of either a variable or a number in the same string. Otherwise, we'll have to create a second, integer variable.

On the other hand, we'll find that carrying the number as a string virtually eliminates any chance of optimization later on. As we get to the point where we are beginning to concern ourselves with code generation, we'll encounter cases in which we're doing arithmetic on constants. For such cases, it's really foolish to generate code that performs the constant arithmetic at run time. Far better to let the parser do the arithmetic at compile time, and merely code the result. To do that, we'll wish we had the constants stored as integers rather than strings.

What finally swung me back over to the string approach was an aggressive application of the KISS test, plus reminding myself that we've studiously avoided issues of code efficiency. One of the things that makes our simple-minded parsing work, without the complexities of a "real" compiler, is that we've said up front that we aren't concerned about code efficiency. That gives us a lot of freedom to do things the easy way rather than the efficient one, and it's a freedom we must be careful not to abandon voluntarily, in spite of the urges for efficiency shouting in our ear. In addition to being a big believer in the KISS philosophy, I'm also an advocate of "lazy programming," which in this context means, don't program anything until you need it. As P.J. Plauger says, "Never put off until tomorrow what you can put off indefinitely." Over the years, much code has been written to provide for eventualities that never happened. I've learned that lesson myself, from bitter experience. So the bottom line is: We won't convert to an integer here because we don't need to. It's as simple as that.

For those of you who still think we may need the integer version (and indeed we may), here it is:

-- ------------------------------------------------------------
-- get a number (integer version) 
: getnum ( -- val )
        newline
        Look digit? 0= IF  S" Integer" expected  ENDIF
        0 >R
        BEGIN   Look digit? 
        WHILE   R> #10 *  Look '0' - +  >R  
                getchar
        REPEAT  '#' TO token
        R> ;
-- ------------------------------------------------------------

You might file this one away, as I intend to, for a rainy day.

PARSING

At this point, we have distributed all the routines that made up our Cradle into units that we can draw upon as we need them. Obviously, they will evolve further as we continue the process of bootstrapping ourselves up again, but for the most part their content, and certainly the architecture that they imply, is defined. What remains is to embody the language syntax into the parser unit. We won't do much of that in this installment, but I do want to do a little, just to leave us with the good feeling that we still know what we're doing. So before we go, let's generate just enough of a parser to process single factors in an expression. In the process, we'll also, by necessity, find we have created a code generator unit, as well.

Remember the very first installment of this series? We read an integer value, say n, and generated the code to load it into the EAX register via an immediate move:

        n d# -> eax  mov,

Shortly afterwards, we repeated the process for a variable,

        X dword-ptr -> eax  mov,

and then for a factor that could be either constant or variable. For old times sake, let's revisit that process. Define the following new unit:

-- ------------------------------------------------------------ 
-- Parser
-- ------------------------------------------------------------
NEEDS -input
NEEDS -scanner
NEEDS -codegen
-- ------------------------------------------------------------
-- Parse and translate a factor 
: factor ( -- ) getnumber loadconstant ;
-- ------------------------------------------------------------

As you can see, this unit calls a word, LoadConstant, which actually effects the output of the assembly-language code. The unit also uses a new unit, CodeGen. This step represents the last major change in our architecture, from earlier installments: The removal of the machine-dependent code to a separate unit. If I have my way, there will not be a single line of code, outside of CodeGen, that betrays the fact that we're targeting the 80x86 CPU. And this is one place I think that having my way is quite feasible.

For those of you who wish I were using the 68000 architecture (or any other one) instead of the 80x86, here's your answer: Merely replace CodeGen with one suitable for your CPU of choice.

So far, our code generator has only one procedure in it. Here's the unit:

-- ------------------------------------------------------------
-- CodeGen
-- ------------------------------------------------------------
NEEDS -output
-- ------------------------------------------------------------
-- Load the primary register with a constant
: LoadConstant ( c-addr u -- ) S"  d# -> eax  mov," $+ emitln ;
-- ------------------------------------------------------------

Copy and compile this file, and execute the following main program (main4.frt):

-- ------------------------------------------------------------
NEEDS -input
NEEDS -output
NEEDS -errors
NEEDS -scanner
NEEDS -parser
: main4 ( -- ) factor ;
-- ------------------------------------------------------------

There it is, the generated code, just as we hoped it would be.

Now, I hope you can begin to see the advantage of the unit-based architecture of our new design. Here we have a main program that's all of five lines long. That's all of the program we need to see, unless we choose to see more. And yet, all those units are sitting there, patiently waiting to serve us. We can have our cake and eat it too, in that we have simple and short code, but powerful allies. What remains to be done is to flesh out the units to match the capabilities of earlier installments. We'll do that in the next installment, but before I close, let's finish out the parsing of a factor, just to satisfy ourselves that we still know how. The final version of CodeGen includes the new procedure, LoadVariable:

-- ------------------------------------------------------------
-- CodeGen
-- ------------------------------------------------------------
NEEDS -output
-- ------------------------------------------------------------
-- Load the primary register with a constant
: loadconstant ( c-addr u -- ) 
        S"  d# -> eax  mov," $+ emitln ;
-- ------------------------------------------------------------
-- Load a variable to the primary register 
: loadvariable  ( c-addr u -- ) 
        S"  dword-ptr -> eax  mov," $+ emitln ;
-- ------------------------------------------------------------

The parser unit itself doesn't change, but we have a more complex version of the word Factor:

-- ------------------------------------------------------------
-- Parse and translate a factor
: factor ( -- ) 
        Look digit? IF  getnumber loadconstant  EXIT  ENDIF
        Look alpha? IF  getname   loadvariable  EXIT  ENDIF
        S" Unrecognized character `" Look CHAR-APPEND 
           &' CHAR-APPEND errors ;
-- ------------------------------------------------------------

Now, without altering the main program, you should find that our program (main4.frt) will process either a variable or a constant factor. At this point, our architecture is almost complete; we have units to do all the dirty work, and enough code in the parser and code generator to demonstrate that everything works. What remains is to flesh out the units we've defined, particularly the parser and code generator, to support the more complex syntax elements that make up a real language. Since we've done this many times before in earlier installments, it shouldn't take long to get us back to where we were before the long hiatus. We'll continue this process in Installment 16, coming soon. See you then.

REFERENCES

Crenshaw, J.W., "Object-Oriented Design of Assemblers and Compilers," Proc. Software Development '91 Conference, Miller Freeman, San Francisco, CA, February 1991, pp. 143-155.
Crenshaw, J.W., "A Perfect Marriage," Computer Language, Volume 8, #6, June 1991, pp. 44-55.
Crenshaw, J.W., "Syntax-Driven Object-Oriented Design," Proc. 1991 Embedded Systems Conference, Miller Freeman, San Francisco, CA, September 1991, pp. 45-60.

*****************************************************************
*                                                               *
*                        COPYRIGHT NOTICE                       *
*                                                               *
*   Copyright (C) 1994 Jack W. Crenshaw. All rights reserved.   *
*                                                               *
*                                                               *
*****************************************************************

LET'S BUILD A COMPILER! By Jack W. Crenshaw, Ph.D. 5 March 1994