Pascal
and C
, but I
stopped far short of pushing it through to completion. The
reason was simple: if we're going to produce a real, functional
compiler for any language, I'd rather do it for KISS
, the
language that I've been defining in this tutorial series.
In this installment, we're going to do just that, for a subset of
KISS
which I've chosen to call TINY
.
The process will be essentially that outlined in Installment IX,
except for one notable difference. In that installment, I
suggested that you begin with a full BNF description of the
language. That's fine for something like Pascal
or C
, for which
the language definition is firm. In the case of TINY
, however,
we don't yet have a full description ... we seem to be defining
the language as we go. That's OK. In fact, it's preferable,
since we can tailor the language slightly as we go, to keep the
parsing easy.
So in the development that follows, we'll actually be doing a top-down development of both the language and its compiler. The BNF description will grow along with the compiler.
In this process, there will be a number of decisions to be made, each of which will influence the BNF and therefore the nature of the language. At each decision point I'll try to remember to explain the decision and the rationale behind my choice. That way, if you happen to hold a different opinion and would prefer a different option, you can choose it instead. You now have the background to do that. I guess the important thing to note is that nothing we do here is cast in concrete. When you're designing your language, you should feel free to do it your way.
Many of you may be asking at this point: Why bother starting over
from scratch? We had a working subset of KISS
as the outcome of
Installment VII (lexical scanning). Why not just extend it as
needed? The answer is threefold. First of all, I have been
making a number of changes to further simplify the program ...
changes like encapsulating the code generation procedures, so
that we can convert to a different target machine more easily.
Second, I want you to see how the development can indeed be done
from the top down as outlined in the last installment. Finally,
we both need the practice. Each time I go through this exercise,
I get a little better at it, and you will, also.
Tiny BASIC
,
Tiny Pascal
, and Tiny C
, each of which was a subset of its parent full
language. Tiny BASIC
, for example, had only single-character
variable names and global variables. It supported only a single
data type. Sound familiar? At this point we have almost all the
tools we need to build a compiler like that.
Yet a language called Tiny-anything still carries some baggage
inherited from its parent language. I've often wondered if this
is a good idea. Granted, a language based upon some parent
language will have the advantage of familiarity, but there may
also be some peculiar syntax carried over from the parent that
may tend to add unnecessary complexity to the compiler. (Nowhere
is this more true than in Small C
.)
I've wondered just how small and simple a compiler could be made
and still be useful, if it were designed from the outset to be
both easy to use and to parse. Let's find out. This language
will just be called "TINY
," period. It's a subset of KISS
, which
I also haven't fully defined, so that at least makes us
consistent (!). I suppose you could call it TINY
KISS
. But that
opens up a whole can of worms involving cuter and cuter (and
perhaps more risque) names, so let's just stick with TINY
.
The main limitations of TINY
will be because of the things we
haven't yet covered, such as data types. Like its cousins Tiny C
and Tiny BASIC
, TINY
will have only one data type, the 32-bit
integer. The first version we develop will also have no
procedure calls and will use single-character variable names,
although as you will see we can remove these restrictions without
much effort.
The language I have in mind will share some of the good features
of Pascal
, C
, and Ada
. Taking a lesson from the comparison of
the Pascal
and C
compilers in the previous installment, though,
TINY
will have a decided Pascal
flavor. Wherever feasible, a
language structure will be bracketed by keywords or symbols, so
that the parser will know where it's going without having to
guess.
One other ground rule: As we go, I'd like to keep the compiler producing real, executable code. Even though it may not do much at the beginning, it will at least do it correctly.
Finally, I'll use a couple of Pascal
restrictions that make
sense: All data and procedures must be declared before they are
used. That makes good sense, even though for now the only data
type we'll use is a word. This rule in turn means that the only
reasonable place to put the executable code for the main program
is at the end of the listing.
The top-level definition will be similar to Pascal
:
<program> ::= PROGRAM <top-level decl> <main> '.'
Already, we've reached a decision point. My first thought was to
make the main block optional. It doesn't seem to make sense to
write a "program" with no main program, but it does make sense if
we're allowing for multiple modules, linked together. As a
matter of fact, I intend to allow for this in KISS
. But then we
begin to open up a can of worms that I'd rather leave closed for
now. For example, the term "PROGRAM" really becomes a misnomer.
The MODULE of Modula-2 or the Unit of Turbo Pascal
would be more
appropriate. Second, what about scope rules? We'd need a
convention for dealing with name visibility across modules.
Better for now to just keep it simple and ignore the idea
altogether.
There's also a decision in choosing to require the main program
to be last. I toyed with the idea of making its position
optional, as in C
. But this doesn't really make
much sense in view of the Pascal
-like requirement that all data
and procedures be declared before they're referenced. Since the
main program can only call procedures that have already been
declared, the only position that makes sense is at the end, a la
Pascal
.
Given the BNF above, let's write a parser that just recognizes the brackets:
-- ------------------------------------------------------------ -- Parse and translate a program : prog ( -- ) 'p' match header prolog '.' match epilog ; -- ------------------------------------------------------------
The procedure Header just emits the startup code required by the assembler:
-- ------------------------------------------------------------ -- Write header info : header ( -- ) S" -- DATA section --------" emitdln S" " emitdln ; -- ------------------------------------------------------------
The procedures Prolog and Epilog emit the code for identifying the main program, and for returning to the OS (i.e. Forth):
-- ------------------------------------------------------------ -- Write the prolog : prolog ( -- ) S" " emitln S" -- CODE section --------" emitdln S" " emitdln S" CODE main" emitdln S" rpush," emitln newlabel postlabel ; -- ------------------------------------------------------------ -- Write the epilog : epilog ( -- ) S" rpop, ebx jmp," emitln S" END-CODE" emitdln ; -- ------------------------------------------------------------
The main program just calls Prog, and then looks for a clean ending:
-- ------------------------------------------------------------------------- : tiny10 ( -- ) init prog Look ^M <> IF S" Unexpected data after " S" ." name.aborts ENDIF ; -- -------------------------------------------------------------------------
Where name.aborts
is a new error-reporting routine added to the cradle:
-- abort, reporting offending item : name.aborts ( c-addr1 u1 char ) >R S" `" $+ R> CHAR-APPEND &' CHAR-APPEND aborts ;
At this point, TINY
will accept only one input "program," the
null program (chap10a.frt):
PROGRAM . (or 'p.' in our shorthand.)
Note, though, that the compiler does generate correct code for this program. It will run, and do what you'd expect the null program to do, that is, nothing but return gracefully to the OS.
As a matter of interest, one of my favorite compiler benchmarks
is to compile, link, and execute the null program in whatever
language is involved. You can learn a lot about the
implementation by measuring the overhead in time required to
compile what should be a trivial case. It's also interesting to
measure the amount of code produced. In many compilers, the code
can be fairly large, because they always include the whole run-time
library whether they need it or not. Early versions of
Turbo Pascal
produced a 12K object file for this case. VAX C
generates 50K!
The smallest null programs I've seen are those produced by Modula-2 compilers, and they run about 200-800 bytes.
In the case of TINY
, we have no run-time library as yet, so the
object code is indeed tiny: two bytes. That's got to be a
record, and it's likely to remain one since it is the minimum
size required by the OS.
The next step is to process the code for the main program. I'll
use the Pascal
BEGIN
-block:
<main> ::= BEGIN <block> END
Here, again, we have made a decision. We could have chosen to
require a "PROCEDURE MAIN" sort of declaration, similar to C
. I
must admit that this is not a bad idea at all ... I don't
particularly like the Pascal
approach since I tend to have
trouble locating the main program in a Pascal
listing. But the
alternative is a little awkward, too, since you have to deal with
the error condition where the user omits the main program or
misspells its name. Here I'm taking the easy way out.
Another solution to the "where is the main program" problem might be to require a name for the program, and then bracket the main by
BEGIN <name> END <name>
similar to the convention of Modula 2
. This adds a bit of
"syntactic sugar" to the language. Things like this are easy to
add or change to your liking, if the language is your own design.
To parse this definition of a main block, change procedure Prog
to read:
-- ------------------------------------------------------------ -- Parse and translate a program : prog ( -- ) 'p' match header main '.' match ; -- ------------------------------------------------------------and add the new procedure:
-- ------------------------------------------------------------ -- Parse and translate a main program : main ( -- ) 'b' match prolog 'e' match epilog ; -- ------------------------------------------------------------
Now, the only legal program is (chap10b.frt):
PROGRAM BEGIN END . (or 'pbe.')
Aren't we making progress??? Well, as usual it gets better. You might try some deliberate errors here, like omitting the 'b' or the 'e', and see what happens. As always, the compiler should flag all illegal inputs.
The obvious next step is to decide what we mean by a declaration.
My intent here is to have two kinds of declarations: variables
and procedures/functions. At the top level, only global
declarations are allowed, just as in C
.
For now, there can only be variable declarations, identified by
the keyword VAR
(abbreviated 'v'):
<top-level decls> ::= ( <data declaration> )* <data declaration> ::= VAR <var-list>
Note that since there is only one variable type, there is no need
to declare the type. Later on, for full KISS
, we can easily add
a type description.
The procedure Prog
becomes:
-- ------------------------------------------------------------ -- Parse and translate a program : prog ( -- ) 'p' match header topdecls main '.' match ; -- ------------------------------------------------------------Now, add the two new procedures:
-- ------------------------------------------------------------ -- Process a data declaration : decl ( -- ) 'v' match getchar ; -- ------------------------------------------------------------ -- Parse and translate global declarations : topdecls ( -- ) BEGIN Look 'b' <> WHILE CASE Look 'v' OF decl ENDOF S" Unrecognized keyword " Look name.aborts ENDCASE REPEAT ; -- ------------------------------------------------------------
Note that at this point, Decl
is just a stub. It generates no
code, and it doesn't process a list ... every variable must occur
in a separate VAR
statement.
OK, now we can have any number of data declarations, each
starting with a 'v' for VAR
, before the BEGIN
-block. Try a few
cases and see what happens (chap10c.frt).
With a little extra code, that's an easy thing to do from
procedure Decl
. Modify it as follows:
-- ------------------------------------------------------------ -- Parse and translate a data declaration : decl ( -- ) 'v' match getname alloc ; -- ------------------------------------------------------------
The procedure Alloc
just issues a command to the assembler to
allocate storage:
-- ------------------------------------------------------------ -- Allocate storage for a variable : alloc ( char -- ) S" CREATE " ROT CHAR-APPEND S" 0 , " $+ emitdln ; -- ------------------------------------------------------------
Give this one a whirl (chap10d.frt). Try an input that declares some variables, such as:
pvxvyvzbe.
See how the storage is allocated? Simple, huh? Note also that the entry point, "MAIN," comes out in the right place.
For the record, a "real" compiler would also have a symbol table to record the variables being used. Normally, the symbol table is necessary to record the type of each variable. But since in this case all variables have the same type, we don't need a symbol table for that reason. As it turns out, we're going to find a symbol necessary even without different types, but let's postpone that need until it arises.
Of course, we haven't really parsed the correct syntax for a data declaration, since it involves a variable list. Our version only permits a single variable. That's easy to fix, too.
The BNF for <var-list> is
<var-list> ::= <ident> (, <ident>)*
Adding this syntax to Decl
gives this new version:
-- ------------------------------------------------------------ -- Parse and translate a data declaration : decl ( -- ) 'v' match getname alloc BEGIN Look ',' = WHILE getchar getname alloc REPEAT ; -- ------------------------------------------------------------
OK, now compile this code and give it a try (chap10e.frt).
Try a number of lines of VAR
declarations, try a list of several variables on one
line, and try combinations of the two. Does it work?
Pascal
is that it doesn't allow
initializing data items in the declaration. That feature is
admittedly sort of a frill, and it may be out of place in a
language that purports to be a minimal language. But it's also
so easy to add that it seems a shame not to do so. The BNF
becomes:
<var-list> ::= <var> ( <var> )* <var> ::= <ident> [ = <integer> ]Change
Alloc
as follows:
-- ------------------------------------------------------------ -- Allocate storage for a variable : alloc ( char -- ) S" CREATE " ROT CHAR-APPEND BL CHAR-APPEND Look '=' = IF '=' match getnum (.) ELSE S" 0" ENDIF $+ S" ," $+ emitln ; -- ------------------------------------------------------------
There you are: an initializer with five added lines of Forth
.
OK, try this version of TINY
and verify that you can, indeed,
give the variables initial values (chap10f.frt).
By golly, this thing is starting to look real! Of course, it still doesn't do anything, but it looks good, doesn't it?
Before leaving this section, I should point out that we've used
two versions of function GetNum
. One, the earlier one, returns a
character value, a single digit. The other accepts a multi-digit
integer and returns an integer value. There's no reason to
limit ourselves to single-digit values here, so the correct
version to use is the one that returns an integer. Here it is:
-- ------------------------------------------------------------ -- Get a Number : getnum ( -- val ) Look digit? 0= IF S" Integer" expected ENDIF 0 >R BEGIN Look digit? WHILE R> #10 * Look '0' - + >R getchar REPEAT R> ; -- ------------------------------------------------------------
As a matter of fact, strictly speaking we should allow for
expressions in the data field of the initializer, or at the very
least for negative values. For now, let's just allow for
negative values by changing the code for Alloc
as follows:
-- ------------------------------------------------------------ -- Allocate storage for a variable : alloc ( char -- ) S" CREATE " ROT CHAR-APPEND BL CHAR-APPEND Look '=' = IF '=' match Look '-' = IF '-' match S" -" $+ ENDIF getnum (.) ELSE S" 0" ENDIF $+ S" ," $+ emitln ; -- ------------------------------------------------------------
Now you should be able to initialize variables with negative and/or multi-digit values (chap10g.frt).
pvavavabe.
Here we've declared the variable A three times. As you can see, the compiler will cheerfully accept that, and generate three identical labels. Not good.
Later on, when we start referencing variables, the compiler will also let us reference variables that don't exist. The assembler will catch both of these error conditions, but it doesn't seem friendly at all to pass such errors along to the assembler. The compiler should catch such things at the source language level.
So even though we don't need a symbol table to record data types, we ought to install one just to check for these two conditions. Since at this point we are still restricted to single-character variable names, the symbol table can be trivial. To provide for it, first add the following declaration at the beginning of your program:
CREATE ST 26 CHARS ALLOT
and insert the following function:
-- ------------------------------------------------------------ -- Look for symbol in table : intable ( char -- tf ) 'A' - ST + C@ 0<> ; -- ------------------------------------------------------------
We also need to initialize the table to all blanks. The
following lines in Init
will do the job:
begin ST 26 ERASE ...
Finally, insert the following two lines at the beginning of
Alloc
:
( n) DUP intable IF S" Duplicate variable name" ROT name.aborts ENDIF 'v' OVER 'A' - ST + C!
That should do it. The compiler will now catch duplicate declarations. Later, we can also use InTable when generating references to the variables. (chap10h.frt)
Believe it or not, though, we almost have a usable language! What's missing is the executable code that must go into the main program. But that code is just assignment statements and control statements ... all stuff we have done before. So it shouldn't take us long to provide for them, as well.
The BNF definition given earlier for the main program included a statement block, which we have so far ignored:
<main> ::= BEGIN <block> END
For now, we can just consider a block to be a series of assignment statements:
<block> ::= (Assignment)*
Let's start things off by adding a parser for the block. We'll begin with a stub for the assignment statement:
-- ------------------------------------------------------------ -- Parse and translate an assignment statement : assignment ( -- ) getchar ; -- ------------------------------------------------------------ -- Parse and translate a block of statements : block ( -- ) BEGIN Look 'e' <> WHILE assignment REPEAT ; -- ------------------------------------------------------------
Modify procedure Main to call Block
as shown below:
-- ------------------------------------------------------------ -- Parse and translate a main program : main ( -- ) 'b' match prolog block 'e' match epilog ; -- ------------------------------------------------------------
This version still won't generate any code for the "assignment statements" ... all it does is to eat characters until it sees the 'e' for 'END.' But it sets the stage for what is to follow.
The next step, of course, is to flesh out the code for an assignment statement. This is something we've done many times before, so I won't belabor it. This time, though, I'd like to deal with the code generation a little differently. Up till now, we've always just inserted the Emits that generate output code in line with the parsing routines. A little unstructured, perhaps, but it seemed the most straightforward approach, and made it easy to see what kind of code would be emitted for each construct.
However, several of you have asked me if the CPU-dependent code couldn't be collected into one spot where it would be easier to retarget to another CPU. The answer, of course, is yes.
To accomplish this, insert the following "code generation" routines:
-- ----------------------------------------------------------------------------------------------------------------------- -- Load/store primary register with a constant/variable : loadconstant ( u -- ) (.) S" d# -> eax mov," $+ emitln ; : loadvariable ( char -- ) DUP intable 0= IF .undefined ENDIF S" dword-ptr -> eax mov," ROT CHAR-PREPEND emitln ; : storevariable ( char -- ) DUP intable 0= IF .undefined ENDIF S" eax -> " ROT CHAR-APPEND S" dword-ptr mov," $+ emitln ; -- Allocate storage for a static variable : allocatestorage ( char c-addr u -- ) ROT S" CREATE " ROT CHAR-APPEND BL CHAR-APPEND 2SWAP $+ S" , " $+ emitdln ; : callfunction ( char -- ) S" offset NEAR call," ROT CHAR-PREPEND emitln ; -- call a function : _negate ( -- ) S" eax neg," emitln ; -- negate primary : _not ( -- ) S" eax not," emitln ; -- not primary : _push ( -- ) S" eax push," emitln ; -- push primary to stack : _pop ( -- ) S" eax pop," emitln ; -- pop primary from stack : _decr ( -- ) S" 1 b# -> eax sub," emitln ; -- decrement primary : _incr ( -- ) S" 1 b# -> eax add," emitln ; -- increment primary : _decSP ( -- ) S" [esp -4 +] -> esp lea," emitln ; -- drop 1 stack item : _incSP ( -- ) S" [esp 4 +] -> esp lea," emitln ; -- restore stack item : _popadd ( -- ) S" [esp] -> eax add, [esp 4 +] -> esp lea," emitln ; -- add TOS to primary : _popsub ( -- ) S" [esp] -> eax sub, [esp 4 +] -> esp lea," emitln _negate ; -- subtract TOS from primary : _popor ( -- ) S" [esp] -> eax or, [esp 4 +] -> esp lea," emitln ; -- or TOS to primary : _popxor ( -- ) S" [esp] -> eax xor, [esp 4 +] -> esp lea," emitln ; -- xor TOS to primary : _popand ( -- ) S" [esp] -> eax and, [esp 4 +] -> esp lea," emitln ; -- xor TOS to primary : _popcmp ( -- ) S" [esp] -> eax cmp, [esp 4 +] -> esp lea," emitln ; -- compare TOS to primary : _sete ( -- ) S" al sete, al -> eax movsx," emitln ; -- set primary if equal : _setne ( -- ) S" al setne, al -> eax movsx," emitln ; -- set primary if NOT equal : _setg ( -- ) S" al setl, al -> eax movsx," emitln ; -- set primary if greater : _setl ( -- ) S" al setg, al -> eax movsx," emitln ; -- set primary if less : _setge ( -- ) S" al setle, al -> eax movsx," emitln ; -- set primary if greater or equal : _setle ( -- ) S" al setge, al -> eax movsx," emitln ; -- set primary if less or equal -- jump always; jump if eax 0 : _branch ( c-addr u -- ) S" offset NEAR jmp," $+ emitln ; : _pcmp+b0> ( c-addr u -- ) _popcmp S" offset NEAR jg," $+ emitln ; : _branch0 ( c-addr u -- ) S" eax -> eax or, " 2SWAP $+ S" offset NEAR je," $+ emitln ; : _branch<>0 ( c-addr u -- ) S" eax -> eax or, " 2SWAP $+ S" offset NEAR jnz," $+ emitln ; -- multiply TOS and primary; divide primary by TOS : _popmul ( -- ) S" [esp] dword mul, [esp 4 +] -> esp lea," emitln ; : _popdiv ( -- ) S" ecx pop, ecx -> eax xchg, eax -> edx mov, #31 b# -> edx sar, ebx idiv," emitln ; -- -----------------------------------------------------------------------------------------------------------------------
The nice part of this approach, of course, is that we can retarget the compiler to a new CPU simply by rewriting these "code generator" procedures. In addition, we will find later that we can improve the code quality by tweaking these routines a bit, without having to modify the compiler proper.
Note that both LoadVariable
and StoreVariable
check the symbol table to make
sure that the variable is defined. The error handler .Undefined
simply calls Abort
:
-- ------------------------------------------------------------------ -- Report an undefined identifier : .undefined ( char -- ) S" Undefined identifier `" ROT CHAR-APPEND &' CHAR-APPEND aborts ; -- ------------------------------------------------------------------
OK, we are now finally ready to begin processing executable code.
We'll do that by replacing the stub version of procedure
Assignment
.
We've been down this road many times before, so this should all be familiar to you. In fact, except for the changes associated with the code generation, we could just copy the procedures from Part VII. Since we are making some changes, I won't just copy them, but we will go a little faster than usual.
The BNF for the assignment statement is:
<assignment> ::= <ident> = <expression> <expression> ::= <first term> ( <addop> <term> )* <first term> ::= <first factor> <rest> <term> ::= <factor> <rest> <rest> ::= ( <mulop> <factor> )* <first factor> ::= [ <addop> ] <factor> <factor> ::= <var> | <number> | ( <expression> )
This version of the BNF is also a bit different than we've used before ... yet another "variation on the theme of an expression." This particular version has what I consider to be the best treatment of the unary minus. As you'll see later, it lets us handle negative constant values efficiently. It's worth mentioning here that we have often seen the advantages of "tweaking" the BNF as we go, to help make the language easy to parse. What you're looking at here is a bit different: we've tweaked the BNF to make the code generation more efficient! That's a first for this series.
Anyhow, the following code implements the BNF:
-- ------------------------------------------------------------- -- Parse and Translate a Math Factor DEFER expression : factor ( -- ) Look '(' = IF '(' match expression ')' match EXIT ENDIF Look alpha? IF getname loadvariable ELSE getnum loadconstant ENDIF ; -- ------------------------------------------------------------ -- Parse and translate a negative factor : negfactor ( -- ) '-' match Look digit? IF getnum NEGATE loadconstant EXIT ENDIF factor _negate ; -- ------------------------------------------------------------ -- Parse and translate a leading factor : firstfactor ( -- ) CASE Look '+' OF '+' match factor ENDOF '-' OF negfactor ENDOF factor ENDCASE ; -- ------------------------------------------------------------ -- Recognize and translate a multiply : multiply ( -- ) '*' match factor _popmul ; -- ------------------------------------------------------------ -- Recognize and translate a divide : divide ( -- ) '/' match factor _popdiv ; -- ------------------------------------------------------------ -- Common code used by term and firstterm : term1 ( -- ) BEGIN Look mulop? WHILE _push CASE Look '*' OF multiply ENDOF '/' OF divide ENDOF ENDCASE REPEAT ; -- ------------------------------------------------------------ -- Parse and translate a math term : term ( -- ) factor term1 ; -- ------------------------------------------------------------ -- Parse and translate a leading term : firstterm ( -- ) firstfactor term1 ; -- ------------------------------------------------------------ -- Recognize and translate an add : add ( -- ) '+' match term _popadd ; -- ------------------------------------------------------------ -- Recognize and translate a subtract : subtract ( -- ) '-' match term _popsub ; -- ------------------------------------------------------------ -- Parse and translate an expression :NONAME ( -- ) firstterm BEGIN Look addop? WHILE _push CASE Look '+' OF Add ENDOF '-' OF Subtract ENDOF ENDCASE REPEAT ; IS expression -- ------------------------------------------------------------ -- Parse and translate an assignment statement : assignment ( -- ) getname LOCAL name '=' match expression name storevariable ; -- ------------------------------------------------------------
OK, if you've got all this code inserted, then compile it and check it out. You should be seeing reasonable-looking code, representing a complete program that will assemble and execute. We have a compiler! (chap10i.frt)
NotFactor
somewhat, to parallel the structure of FirstFactor
.
To begin, we're going to need some more recognizers:
-- ------------------------------------------------------------ -- Recognize a boolean Orop : orop? ( char -- tf ) DUP '|' = SWAP '~' = OR ; -- ------------------------------------------------------------ -- Recognize a relop : relop? ( char -- tf ) DUP '=' = OVER '#' = OR OVER '<' = OR SWAP '>' = OR ; -- ------------------------------------------------------------
All of this gives us the tools we need. The BNF for the Boolean expressions is:
<bool-expr> ::= <bool-term> ( <orop> <bool-term> )* <bool-term> ::= <not-factor> ( <andop> <not-factor> )* <not-factor> ::= [ '!' ] <relation> <relation> ::= <expression> [ <relop> <expression> ]
Sharp-eyed readers might note that this syntax does not include
the non-terminal "bool-factor" used in earlier versions. It was
needed then because I also allowed for the Boolean constants TRUE
and FALSE. But remember that in TINY
there is no distinction
made between Boolean and arithmetic types ... they can be freely
intermixed. So there is really no need for these predefined
values ... we can just use -1 and 0, respectively.
In C
terminology, we could always use the defines:
#define TRUE -1 #define FALSE 0
(That is, if TINY
had a preprocessor.) Later on, when we allow
for declarations of constants, these two values will be
predefined by the language.
The reason that I'm harping on this is that I've already tried the alternative, which is to include TRUE and FALSE as keywords. The problem with that approach is that it then requires lexical scanning for every variable name in every expression. If you'll recall, I pointed out in Installment VII that this slows the compiler down considerably. As long as keywords can't be in expressions, we need to do the scanning only at the beginning of every new statement ... quite an improvement. So using the syntax above not only simplifies the parsing, but speeds up the scanning as well.
OK, given that we're all satisfied with the syntax above, the corresponding code is shown below:
-- ------------------------------------------------------------- -- Recognize and translate a relational "equals" : equals ( -- ) '=' match expression _popcmp _sete ; -- ------------------------------------------------------------- -- Recognize and translate a relational "not equals" : notequals ( -- ) '#' match expression _popcmp _setne ; -- ------------------------------------------------------------- -- Recognize and translate a relational "less than" : less ( -- ) '<' match expression _popcmp _setl ; -- ------------------------------------------------------------- -- Recognize and translate a relational "greater than" : greater ( -- ) '>' match expression _popcmp _setg ; -- ------------------------------------------------------------- -- Parse and translate a relation : relation ( -- ) expression Look relop? 0= ?EXIT _push CASE Look '=' OF equals ENDOF '#' OF notequals ENDOF '<' OF less ENDOF '>' OF greater ENDOF ENDCASE ; -- ------------------------------------------------------------- -- Parse and translate a boolean factor with leading NOT : notfactor ( -- ) Look '!' <> IF relation EXIT ENDIF '!' match relation _not ; -- ------------------------------------------------------------- -- Parse and translate a boolean term : boolterm ( -- ) notfactor BEGIN Look '&' = WHILE _push '&' match notfactor _popand REPEAT ; -- ------------------------------------------------------------ -- Recognize and translate a boolean OR : boolOR ( -- ) '|' match boolterm _popor ; -- ------------------------------------------------------------ -- Recognize and translate an EXCLUSIVE OR : boolXOR ( -- ) '~' match boolterm _popxor ; -- ------------------------------------------------------------ -- Parse and translate a boolean expression :NONAME ( -- ) boolterm BEGIN Look orop? WHILE _push CASE Look '|' OF boolor ENDOF '~' OF boolxor ENDOF ENDCASE REPEAT ; IS boolexpression -- ------------------------------------------------------------
To tie it all together, don't forget to change the references to
Expression
in procedures Factor
and Assignment
so that they call
BoolExpression
instead.
OK, if you've got all that typed in, compile it and give it a whirl (chap10j.frt). First, make sure you can still parse an ordinary arithmetic expression. Then, try a Boolean one. Finally, make sure that you can assign the results of relations. Try, for example:
pvx,y,zbx=z>ye.which stands for:
PROGRAM VAR X,Y,Z BEGIN X = Z > Y END.
See how this assigns a Boolean value to X?
TINY
, we'll only
allow two kinds of them, the IF
and the WHILE
:
<if> ::= IF <bool-expression> <block> [ ELSE <block>] ENDIF <while> ::= WHILE <bool-expression> <block> ENDWHILE
Once again, let me spell out the decisions implicit in this
syntax, which departs strongly from that of C
or Pascal
. In both
of those languages, the "body" of an IF
or WHILE
is regarded as a
single statement. If you intend to use a block of more than one
statement, you have to build a compound statement using BEGIN-END
(in Pascal
) or '{}' (in C
). In TINY
(and KISS
) there is no such
thing as a compound statement ... single or multiple they're all
just blocks to these languages.
In KISS
, all the control structures will have explicit and unique
keywords bracketing the statement block, so there can be no
confusion as to where things begin and end. This is the modern
approach, used in such respected languages as Ada
and Modula 2
,
and it completely eliminates the problem of the "dangling else."
Note that I could have chosen to use the same keyword END
to end
all the constructs, as is done in Pascal
. (The closing '}' in C
serves the same purpose.) But this has always led to confusion,
which is why Pascal
programmers tend to write things like
end { loop }
or
end { if }
As I explained in Part V, using unique terminal keywords does increase the size of the keyword list and therefore slows down the scanning, but in this case it seems a small price to pay for the added insurance. Better to find the errors at compile time rather than run time.
One last thought: The two constructs above each have the non-terminals
<bool-expression> and <block>
juxtaposed with no separating keyword. In Pascal
we would expect
the keywords THEN
and DO
in these locations.
I have no problem with leaving out these keywords, and the parser has no trouble either, on condition that we make no errors in the bool-expression part. On the other hand, if we were to include these extra keywords we would get yet one more level of insurance at very little cost, and I have no problem with that, either. Use your best judgment as to which way to go.
OK, with that bit of explanation let's proceed. As usual, we're going to need some new code generation routines. We already defined most of them above.
Except for the encapsulation of the code generation, the code to parse the control constructs is the same as you've seen before:
-- ------------------------------------------------------------- -- Recognize and translate an IF Construct DEFER block ( -- ) : doIF ( -- ) 0 0 LOCALS| L2 L1 | 'i' match boolexpression newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L1 PACK DROP L1 COUNT DUP 1+ ALLOCATE ?ALLOCATE DUP TO L2 PACK DROP L1 COUNT _branch0 block Look 'l' = IF L2 FREE ?ALLOCATE newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L2 PACK DROP L2 COUNT _branch L1 COUNT postlabel block ENDIF L2 COUNT postlabel 'e' match L1 FREE ?ALLOCATE L2 FREE ?ALLOCATE ; -- ------------------------------------------------------------ -- Parse and translate a WHILE Statement : doWHILE ( -- ) 0 0 LOCALS| L2 L1 | newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L1 PACK DROP newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L2 PACK DROP 'w' match L1 COUNT postlabel boolexpression L2 COUNT _branch0 block 'e' match L1 COUNT _branch L2 COUNT postlabel L1 FREE ?ALLOCATE L2 FREE ?ALLOCATE ; -- ------------------------------------------------------------
To tie everything together, we need only modify procedure Block
to recognize the "keywords" for the IF
and WHILE
. As usual, we
expand the definition of a block like so:
<block> ::= ( <statement> )*
where
<statement> ::= <if> | <while> | <assignment>
The corresponding code is:
-- ------------------------------------------------------------ -- Parse and translate a block of statements :NONAME BEGIN Look 'e' <> Look 'l' <> AND WHILE CASE Look 'i' OF doIF ENDOF 'w' OF doWHILE ENDOF assignment ENDCASE REPEAT ; IS block -- ------------------------------------------------------------
OK, add the routines I've given, compile and test them (chap10k.frt). You should be able to parse the single-character versions of any of the control constructs. It's looking pretty good!
As a matter of fact, except for the single-character limitation
we've got a virtually complete version of TINY
. I call it, with
tongue planted firmly in cheek, TINY Version 0.1
.
To begin with, let's simply allow for whitespace. This involves
only adding calls to SkipWhite
at the end of the three routines,
GetName
, GetNum
, and Match
. A call to SkipWhite
in Init
primes
the pump in case there are leading spaces.
Next, we need to deal with newlines. This is really a two-step process, since the treatment of the newlines with single-character tokens is different from that for multi-character ones. We can eliminate some work by doing both steps at once, but I feel safer taking things one step at a time.
Insert the new procedure:
-- ------------------------------------------------------------ -- Skip over an end-of-line : newline ( -- ) BEGIN Look ^M = WHILE getchar Look ^J = IF getchar ENDIF skipwhite REPEAT ; -- ------------------------------------------------------------
Note that we have seen this procedure before in the form of
word Fin
. I've changed the name since this new one seems
more descriptive of the actual function. I've also changed the
code to allow for multiple newlines and lines with nothing but
white space.
The next step is to insert calls to NewLine
wherever we decide a
newline is permissible. As I've pointed out before, this can be
very different in different languages. In TINY
, I've decided to
allow them virtually anywhere. This means that we need calls to
NewLine
at the beginning (not the end, as with SkipWhite
) of the
procedures GetName
, GetNum
, and Match
.
For procedures that have while loops, such as TopDecl
, we need a
call to NewLine
at the beginning of the procedure and at the
bottom of each loop. That way, we can be assured that NewLine
has just been called at the beginning of each pass through the
loop.
If you've got all this done, try the program out and verify that it will indeed handle white space and newlines.
If it does, then we're ready to deal with multi-character tokens and keywords. To begin, add the additional declarations (copied almost verbatim from Part VII):
-- ------------------------------------------------------------ -- Type declarations 8 =: /symbol 100 =: maxentry -- ------------------------------------------------------------ -- Variable declarations 0 VALUE token -- encoded token CREATE token$ 16 CHARS ALLOT -- unencoded token -- ------------------------------------------------------------ -- Definition of keywords and token types -- helper to build symbol tables : $, ( c-addr u size -- ) >S S MIN >R R@ C, R@ 0 ?DO C@+ C, LOOP DROP S> R> - CHARS ALLOT ; -- Symboltable builder itself : SYMTAB ( size -- ) CREATE DUP 0< IF 0 , ABS DUP , 1+ maxentry * ALLOT ELSE HERE >R 0 , ( #items ) DUP >S , ( itemsize ) BEGIN BL <WORD> DUP WHILE S $, 1 R@ +! REPEAT 2DROP -R -S ENDIF DOES> CELL+ @+ 1+ ROT * + ( ix -- addr ) ; /symbol SYMTAB KWlist IF ELSE ENDIF WHILE ENDWHILE VAR BEGIN END PROGRAM : KW->token ( kw_index -- ) 2+ C" xilewevbep" + C@ TO token ; -- ------------------------------------------------------------
Next, add the three procedures, also from Part VII:
-- ------------------------------------------------------------ -- Table Lookup : lookup ( c-addr u 'table -- n2 | -1 ) 0 0 LOCALS| /symbol n table sz addr | table 2 CELLS - @+ TO n @ TO /symbol n 0<= IF -1 EXIT ENDIF 0 n DO /symbol 1+ I * table + COUNT addr sz COMPARE 0= IF I UNLOOP EXIT ENDIF -1 +LOOP -1 ; -- ------------------------------------------------------------ . . -- ------------------------------------------------------------ -- Get an identifier and scan it for keywords : scan ( -- ) getname 0 KWlist lookup KW->token ; -- ------------------------------------------------------------ -- Match a specific input string : matchstring ( c-addr u -- ) token$ COUNT 2OVER COMPARE IF &` CHAR-PREPEND &' CHAR-APPEND expected ELSE 2DROP ENDIF ; -- ------------------------------------------------------------
Now, we have to make a fairly large number of subtle changes to
the remaining procedures. First, we must change the function
GetName
to a procedure, again as we did in Part VII:
-- ------------------------------------------------------------ -- Get an identifier : getname ( -- c-addr u ) newline Look alpha? 0= IF S" Name" expected ENDIF token$ C0! BEGIN Look alnum? WHILE Look >UPC token$ char+! getchar REPEAT token$ COUNT skipwhite ; -- ------------------------------------------------------------
Note that this word also leaves its result in the global string
token$
.
Next, we have to change every reference to GetName
to reflect its
new form. These occur in Factor
, Assignment
, and Decl
:
-- ------------------------------------------------------------- -- Parse and translate a math factor DEFER boolexpression ( -- ) : factor ( -- ) Look '(' = IF '(' match boolexpression ')' match EXIT ENDIF Look alpha? IF getname DROP C@ loadvariable EXIT ENDIF getnum loadconstant ; -- ------------------------------------------------------------ . . -- ------------------------------------------------------------ -- Parse and translate an assignment statement : assignment ( -- ) token$ COUNT DUP 1+ ALLOCATE ?ALLOCATE DUP LOCAL name PACK DROP '=' match boolexpression name COUNT DROP C@ storevariable name FREE ?ALLOCATE ; -- ------------------------------------------------------------ . . -- ------------------------------------------------------------ -- Parse and translate a data declaration : decl ( -- ) getname DROP C@ alloc BEGIN Look ',' = WHILE ',' match getname DROP C@ alloc REPEAT ; -- ------------------------------------------------------------
(Note that we're still only allowing single-character variable names, so we take the easy way out here and simply use the first character of the string.)
Finally, we must make the changes to use Token
instead of Look
as
the test character and to call Scan
at the appropriate places.
Mostly, this involves deleting calls to Match
, occasionally
replacing calls to Match
by calls to MatchString, and Replacing
calls to NewLine
by calls to Scan
. Here are the affected
routines:
-- ------------------------------------------------------------- -- Recognize and translate an IF construct DEFER block : doIF ( -- ) 0 0 LOCALS| L2 L1 | boolexpression newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L1 PACK DROP L1 COUNT DUP 1+ ALLOCATE ?ALLOCATE DUP TO L2 PACK DROP L1 COUNT _branch0 block token 'l' = IF L2 FREE ?ALLOCATE newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L2 PACK DROP L2 COUNT _branch L1 COUNT postlabel block ENDIF L2 COUNT postlabel S" ENDIF" matchstring L1 FREE ?ALLOCATE L2 FREE ?ALLOCATE ; -- ------------------------------------------------------------ -- Parse and translate a WHILE statement : doWHILE ( -- ) 0 0 LOCALS| L2 L1 | newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L1 PACK DROP newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L2 PACK DROP L1 COUNT postlabel boolexpression L2 COUNT _branch0 block L1 COUNT _branch L2 COUNT postlabel S" ENDWHILE" matchstring L1 FREE ?ALLOCATE L2 FREE ?ALLOCATE ; -- ------------------------------------------------------------ -- Parse and translate a block of statements :NONAME ( label -- ) LOCAL L scan BEGIN token 'e' <> token 'l' <> AND WHILE CASE token 'i' OF doIF ENDOF 'w' OF doWHILE ENDOF assignment ENDCASE scan REPEAT ; IS block -- ------------------------------------------------------------ -- Parse and Translate Global Declarations : topdecls ( -- ) scan BEGIN token 'b' <> WHILE CASE token 'v' OF decl ENDOF S" Unrecognized keyword " token$ COUNT name.aborts ENDCASE scan REPEAT ; -- ------------------------------------------------------------ -- Parse and translate a main program : doprogram ( -- ) S" BEGIN" matchstring prolog block S" END" matchstring epilog ; -- ------------------------------------------------------------ -- Parse and translate a program : prog ( -- ) S" PROGRAM" matchstring getname prog$ PACK DROP header topdecls doprogram '.' match ; -- ------------------------------------------------------------ -- Initialize : init ( -- ) CLEAR Lcount ST 26 CHARS ERASE CR getchar scan ; -- ------------------------------------------------------------
That should do it. If all the changes got in correctly, you should now be parsing programs that look like programs. (If you didn't make it through all the changes, don't despair. A complete listing of the final form is given later.)
Did it work? If so, then we're just about home. In fact, with a few minor exceptions we've already got a compiler that's usable. There are still a few areas that need improvement.
We've done this step before. This time, as usual, I'm doing it a little differently. I think the approach used here keeps things just about as simple as possible.
The natural way to implement a symbol table in Forth
is by
declaring an array of symbol strings. This has the advantage that we can use
the existing procedure Lookup
to search the symbol table as well
as the keyword list. As it turns out, even when we need more
fields we can still use the same approach, simply by storing the
other fields in separate arrays.
OK, here are the changes that need to be made. First, add the new constant:
0 ST 2 CELLS - =: [cnt]ST 0 SType 2 CELLS - =: [cnt]SType
Then change the definition of the symbol table as follows:
100 =: maxentry . . /symbol NEGATE SYMTAB ST =CELL 1- NEGATE SYMTAB SType
(The NEGATE prevents initializing the symbol table from the input stream.)
Next, we need to replace InTable
:
-- ------------------------------------------------------------ -- Look for symbol in table : intable ( c-addr u -- tf ) 0 ST lookup 0>= ; -- ------------------------------------------------------------
We also need a new procedure, AddEntry
, that adds a new entry to
the table:
-- ------------------------------------------------------------ -- Add a new entry to the symbol table : addentry ( c-addr u type -- ) [cnt]ST LOCALS| #entries T sz addr | addr sz intable IF S" Duplicate identifier " addr sz name.aborts ENDIF #entries @ maxentry = IF S" Symbol table full" aborts ENDIF 1 #entries +! addr sz #entries @ ST PACK DROP T #entries @ SType C! ; -- ------------------------------------------------------------
This procedure is called by Alloc
:
-- ------------------------------------------------------------ -- Allocate storage for a variable : alloc ( c-addr u -- ) LOCALS| sz addr | addr sz intable IF S" Duplicate variable name " addr sz name.aborts ENDIF addr sz 'v' addentry . . . -- ------------------------------------------------------------
Finally, we must change all the routines that currently treat the
variable name as a single character. These include LoadVariable
,
StoreVariable
, and Factor
, Assignment
, and Decl
(change token$ CHAR+ C@
to token$ COUNT
).
One last thing: change procedure Init
to clear the array as
shown:
-- ------------------------------------------------------------ -- Initialize : init ( -- ) CLEAR Lcount [cnt]ST OFF [cnt]SType OFF CR getchar scan ; -- ------------------------------------------------------------
That should do it. Try it out and verify that you can, indeed, use multi-character variable names.
Pascal
'<>' for "not equals," instead of '#'.
If you'll recall, in Part VII I pointed out that the conventional way to deal with relops is to include them in the list of keywords, and let the lexical scanner find them. But, again, this requires scanning throughout the expression parsing process, whereas so far we've been able to limit the use of the scanner to the beginning of a statement.
I mentioned then that we can still get away with this, since the multi-character relops are so few and so limited in their usage. It's easy to just treat them as special cases and handle them in an ad hoc manner.
The changes required affect only the code generation routines and
procedures Relation
and friends.
Modify the relation parsing routines as shown below:
-- ------------------------------------------------------------- -- Recognize and translate a relational "less than or equal" : lessorequal ( -- ) '=' match expression _popcmp _setle ; -- ------------------------------------------------------------- -- Recognize and Translate a Relational "Not Equals" : notequals ( -- ) '>' match expression _popcmp _setne ; -- ------------------------------------------------------------- -- recognize and translate a relational "less than" : less ( -- ) '<' match Look '=' = IF lessorequal EXIT ENDIF Look '>' = IF notequals EXIT ENDIF expression _popcmp _setl ; -- ------------------------------------------------------------- -- recognize and translate a relational "greater than" : greater ( -- ) '>' match Look '=' = IF '=' match expression _popcmp _setge EXIT ENDIF expression _popcmp _setg ; -- -------------------------------------------------------------
That's all it takes. Now you can process all the relops. Try it.
Now, the convention these days, established in C
and continued in
Ada
and Modula 2
, is to leave I/O statements out of the language
itself, and just include them in the subroutine library. That
would be fine, except that so far we have no provision for
subroutines. Anyhow, with this approach you run into the problem
of variable-length argument lists. In Pascal
, the I/O statements
are built into the language because they are the only ones for
which the argument list can have a variable number of entries.
In C
, we settle for kludges like scanf and printf, and must pass
the argument count to the called procedure. In Ada
and Modula 2
we must use the awkward (and slow!) approach of a separate call
for each argument.
So I think I prefer the Pascal
approach of building the I/O in,
even though we don't need to.
As usual, for this we need some more code generation routines. These turn out to be the easiest of all, because all we do is to call library procedures to do the work:
-- ------------------------------------------------------------- -- read to primary register and store in variable : readvar ( c-addr u -- ) S" READ" callfunction storevariable ; -- ------------------------------------------------------------- -- write from primary register : writevar ( -- ) S" WRITE" callfunction ;
The idea is that READ
loads the value from input to EAX, and
WRITE outputs it from there.
These two words represent our first encounter with a need
for library procedures ... the components of a Run Time Library
(RTL). Of course, someone (namely us) has to write these
routines, but they're not part of the compiler itself. I won't
even bother showing the routines here, since these are obviously
very much OS-dependent. I will simply say that for
Windows XP, they are not particularly simple.. One reason I
won't show them here is that you can add all kinds of fanciness
to the things, for example by prompting in READ
for the inputs,
and by giving the user a chance to reenter a bad input.
But that is really separate from compiler design, so for now I'll just assume that a library call TINYLIB.LIB exists. Since we now need it loaded, we need to add a statement to include it in procedure Header:
-- ------------------------------------------------------------ -- Write header info : header ( -- ) S" -- DATA section --------" emitdln S" NEEDS -tinylib" emitdln S" " emitdln ; -- ------------------------------------------------------------
That takes care of that part. Now, we also need to recognize the read and write commands. We can do this by adding two more keywords to our list:
-- ------------------------------------------------------------ -- Definition of keywords and token types /symbol SYMTAB KWlist IF ELSE ENDIF WHILE ENDWHILE READ WRITE VAR BEGIN END PROGRAM : KW->token ( kw_index -- ) 2+ C" xileweRWvbep" + C@ TO token ; -- ------------------------------------------------------------
(Note how I'm using upper case codes here to avoid conflict with
the 'w' of WHILE
.)
Next, we need procedures for processing the read/write statement and its argument list:
-- ------------------------------------------------------------ -- Process a read statement : doread ( -- ) '(' match getname readvar BEGIN Look ',' = WHILE ',' match getname readvar REPEAT ')' match ; -- ------------------------------------------------------------- -- Process a write statement : dowrite ( -- ) '(' match expression writevar BEGIN Look ',' = WHILE ',' match expression writevar REPEAT ')' match ; -- ------------------------------------------------------------
Finally, we must expand procedure Block
to handle the new
statement types:
-- ------------------------------------------------------------ -- Parse and translate a block of statements :NONAME ( -- ) scan BEGIN token 'e' <> token 'l' <> AND WHILE CASE token 'i' OF doIF ENDOF 'w' OF doWHILE ENDOF 'R' OF doread ENDOF 'W' OF dowrite ENDOF assignment ENDCASE scan REPEAT ; IS block -- ------------------------------------------------------------
That's all there is to it. Now we have a language!
TINY
completely defined. It's not much ...
actually a toy compiler. TINY
has only one data type and no
subroutines ... but it's a complete, usable language. While
you're not likely to be able to write another compiler in it, or
do anything else very seriously, you could write programs to read
some input, perform calculations, and output the results. Not
too bad for a toy.
Most importantly, we have a firm base upon which to build further
extensions. I know you'll be glad to hear this: this is the last
time I'll start over in building a parser ... from now on I
intend to just add features to TINY
until it becomes KISS
. Oh,
there'll be other times we will need to try things out with new
copies of the Cradle, but once we've found out how to do those
things they'll be incorporated into TINY
.
What will those features be? Well, for starters we need subroutines and functions. Then we need to be able to handle different types, including arrays, strings, and other structures. Then we need to deal with the idea of pointers. All this will be upcoming in future installments.
See you then.
For references purposes, the complete listing of TINY
Version 1.0
is shown below (tiny10.frt):
-- Variable declarations -------------------------------------------------------------------------------------------------- -1 =: unknown 0 VALUE Lcount -- label counter 0 VALUE Look -- lookahead character 0 VALUE token -- encoded token #100 =: maxentry CREATE token$ 0 C, #256 CHARS ALLOT -- unencoded token CREATE prog$ 0 C, #256 CHARS ALLOT -- program name -- Symbol tables and Labels ---------------------------------------------------------------------------------------------------------- -- helper to build symbol tables : $, ( c-addr u size -- ) >S S MIN >R R@ C, R@ 0 ?DO C@+ C, LOOP DROP S> R> - CHARS ALLOT ; -- Type declarations : SYMTAB ( size -- ) CREATE DUP 0< IF 0 , ABS DUP , 1+ maxentry * ALLOT ELSE HERE >R 0 , ( #items ) DUP >S , ( itemsize ) BEGIN BL <WORD> DUP WHILE 2DUP S" \" COMPARE 0= IF 2DROP REFILL DROP ELSE S $, 1 R@ +! ENDIF REPEAT 2DROP -R -S ENDIF DOES> CELL+ @+ 1+ ROT * + ( ix -- addr ) ; -- Definition of keywords and token types 8 CHARS =: /symbol /symbol SYMTAB KWlist IF ELSE ENDIF \ WHILE ENDWHILE \ DO ENDDO \ LOOP ENDLOOP \ REPEAT UNTIL \ FOR TO ENDFOR \ BREAK \ READ WRITE \ VAR BEGIN END PROGRAM : KW->token ( kw_index -- ) 2+ C" xilewedeLerufteBRWvbep" + C@ TO token ; /symbol NEGATE SYMTAB ST =CELL 1- NEGATE SYMTAB SType 0 ST 2 CELLS - =: [cnt]ST 0 SType 2 CELLS - =: [cnt]SType : lookup ( c-addr u 'table -- n2|unknown ) 0 0 LOCALS| /symbol n table sz addr | table 2 CELLS - @+ TO n @ TO /symbol n 0<= IF unknown EXIT ENDIF 0 n DO /symbol 1+ I * table + COUNT addr sz COMPARE 0= IF I UNLOOP EXIT ENDIF -1 +LOOP unknown ; -- Locate a symbol in table -- Returns the index of the entry. (unknown) -1 if not present. : locate ( c-addr u -- ix ) 0 ST lookup ; -- Look for symbol in table : intable ( c-addr u -- tf ) 0 ST lookup 0>= ; -- Dump the symbol tabel : .symbols ( -- ) [cnt]ST @ 0= IF CR ." No symbols defined." EXIT ENDIF CR ." -- type --.-------- name ---------" CR [cnt]ST @ 0 ?DO CR 2 HTAB I 1+ SType C@ EMIT #14 HTAB I 1+ ST .$ LOOP ; -- Tools ------------------------------------------------------------------------------------------------------------------ : getchar ( -- ) EKEY TO Look ; -- read new character from input stream : error ( c-addr u -- ) CR ^G EMIT ." Error: " TYPE ." ." ; -- report an error : aborts ( c-addr u -- ) error ABORT ; -- report error and halt : expected ( c-addr u -- ) S" Expected" $+ aborts ; -- report what was expected : alpha? ( char -- tf ) >UPC 'A' 'Z' 1+ WITHIN ; -- recognize an alpha character : digit? ( char -- tf ) '0' '9' 1+ WITHIN ; -- recognize a decimal digit : alnum? ( char -- tf ) DUP alpha? SWAP digit? OR ; -- recognize alphanumeric : white? ( char -- tf ) DUP Tab = SWAP BL = OR ; -- recognize white space : orop? ( char -- tf ) DUP '|' = SWAP '~' = OR ; -- recognize an OR operand : addop? ( char -- tf ) DUP '+' = SWAP '-' = OR ; -- test for AddOp : mulop? ( char -- tf ) DUP '*' = SWAP '/' = OR ; -- test for MulOp : emits ( c-addr u -- ) Tab EMIT TYPE ; -- output a string with tab : emitdln ( c-addr u -- ) CR TYPE ; -- output a string without tab, +crlf : emitln ( c-addr u -- ) CR emits ; -- output a string with tab and crlf -- abort, reporting offending item : name.aborts ( c-addr1 u1 c-addr2 u2 ) &` CHAR-PREPEND &' CHAR-APPEND $+ aborts ; : .undefined ( c-addr u -- ) S" Undefined identifier " 2SWAP name.aborts ; -- generate a unique label : newlabel ( -- c-addr u ) S" @" Lcount U>D <# #S #> $+ 1 +TO Lcount ; -- post a label to output : postlabel ( c-addr u -- ) CR TYPE ':' EMIT ; -- Recognize a relop : relop? ( char -- tf ) DUP '=' = OVER '#' = OR OVER '<' = OR SWAP '>' = OR ; : char+! ( c addr -- ) DUP >S COUNT + C! 1 S> C+! ; -- skip white space : skipwhite ( -- ) BEGIN Look white? WHILE getchar REPEAT ; -- skip cr+lf : fin ( -- ) Look ^M = IF getchar ENDIF Look ^J = IF getchar ENDIF skipwhite ; : newline ( -- ) BEGIN Look ^M = WHILE fin REPEAT ; -- match a specific input character : match ( char -- ) DUP Look = IF DROP getchar skipwhite ELSE S" `" ROT CHAR-APPEND &' CHAR-APPEND expected ENDIF ; -- get an identifier. : getname ( -- c-addr u ) newline Look alpha? 0= IF S" Name" expected ENDIF token$ C0! BEGIN Look alnum? WHILE Look >UPC token$ char+! getchar REPEAT token$ COUNT skipwhite ; -- get a number : getnum ( -- val ) newline Look digit? 0= IF S" Integer" expected ENDIF 0 >R BEGIN Look digit? WHILE R> #10 * Look '0' - + >R getchar REPEAT '#' TO token R> skipwhite ; -- Get an identifier and scan it for keywords : scan ( -- ) getname 0 KWlist lookup KW->token ; -- Add a new entry to symbol table : addentry ( c-addr u type -- ) [cnt]ST LOCALS| #entries T sz addr | addr sz intable IF S" Duplicate identifier " addr sz name.aborts ENDIF #entries @ maxentry = IF S" Symbol table full" aborts ENDIF 1 #entries +! addr sz #entries @ ST PACK DROP T #entries @ SType C! ; -- Match a specific input string : matchstring ( c-addr u -- ) token$ COUNT 2OVER COMPARE IF &` CHAR-PREPEND &' CHAR-APPEND expected ELSE 2DROP ENDIF ; -- Code generation ------------------------------------------------------------------------------------------------------- -- Load/store primary register with a constant/variable : loadconstant ( u -- ) (.) S" d# -> eax mov," $+ emitln ; : loadvariable ( c-addr u -- ) 2DUP intable 0= IF .undefined ENDIF S" dword-ptr -> eax mov," $+ emitln ; : storevariable ( c-addr u -- ) 2DUP intable 0= IF .undefined ENDIF S" eax -> " 2SWAP $+ S" dword-ptr mov," $+ emitln ; -- Allocate storage for a static variable : allocatestorage ( c-addr1 u1 c-addr2 u2 -- ) S" CREATE " 2ROT $+ BL CHAR-APPEND 2SWAP $+ S" , " $+ emitdln ; : callfunction ( c-addr u -- ) S" offset NEAR call," $+ emitln ; -- call a function : _negate ( -- ) S" eax neg," emitln ; -- negate primary : _not ( -- ) S" eax not," emitln ; -- not primary : _push ( -- ) S" eax push," emitln ; -- push primary to stack : _pop ( -- ) S" eax pop," emitln ; -- pop primary from stack : _decr ( -- ) S" 1 b# -> eax sub," emitln ; -- decrement primary : _incr ( -- ) S" 1 b# -> eax add," emitln ; -- increment primary : _decSP ( -- ) S" [esp -4 +] -> esp lea," emitln ; -- drop 1 stack item : _incSP ( -- ) S" [esp 4 +] -> esp lea," emitln ; -- restore stack item : _popadd ( -- ) S" [esp] -> eax add, [esp 4 +] -> esp lea," emitln ; -- add TOS to primary : _popsub ( -- ) S" [esp] -> eax sub, [esp 4 +] -> esp lea," emitln _negate ; -- subtract TOS from primary : _popor ( -- ) S" [esp] -> eax or, [esp 4 +] -> esp lea," emitln ; -- or TOS to primary : _popxor ( -- ) S" [esp] -> eax xor, [esp 4 +] -> esp lea," emitln ; -- xor TOS to primary : _popand ( -- ) S" [esp] -> eax and, [esp 4 +] -> esp lea," emitln ; -- xor TOS to primary : _popcmp ( -- ) S" [esp] -> eax cmp, [esp 4 +] -> esp lea," emitln ; -- compare TOS to primary : _sete ( -- ) S" al sete, al -> eax movsx," emitln ; -- set primary if equal : _setne ( -- ) S" al setne, al -> eax movsx," emitln ; -- set primary if NOT equal : _setg ( -- ) S" al setl, al -> eax movsx," emitln ; -- set primary if greater : _setl ( -- ) S" al setg, al -> eax movsx," emitln ; -- set primary if less : _setge ( -- ) S" al setle, al -> eax movsx," emitln ; -- set primary if greater or equal : _setle ( -- ) S" al setge, al -> eax movsx," emitln ; -- set primary if less or equal -- jump always; jump if eax 0 : _branch ( c-addr u -- ) S" offset NEAR jmp," $+ emitln ; : _pcmp+b0> ( c-addr u -- ) _popcmp S" offset NEAR jg," $+ emitln ; : _branch0 ( c-addr u -- ) S" eax -> eax or, " 2SWAP $+ S" offset NEAR je," $+ emitln ; : _branch<>0 ( c-addr u -- ) S" eax -> eax or, " 2SWAP $+ S" offset NEAR jnz," $+ emitln ; -- multiply TOS and primary; divide primary by TOS : _popmul ( -- ) S" [esp] dword mul, [esp 4 +] -> esp lea," emitln ; : _popdiv ( -- ) S" ecx pop, ecx -> eax xchg, eax -> edx mov, #31 b# -> edx sar, ebx idiv," emitln ; -- read to primary register and store in variable : readvar ( c-addr u -- ) S" READ" callfunction storevariable ; -- write from primary register : writevar ( -- ) S" WRITE" callfunction ; -- Write assembler header info; epilogue, and prologue : header ( -- ) S" -- DATA section --------" emitdln S" " emitdln ; : prolog ( -- ) S" " emitln S" -- CODE section --------" emitdln S" " emitdln S" CODE " prog$ COUNT $+ emitdln S" rpush," emitln newlabel postlabel ; : epilog ( -- ) S" rpop, ebx jmp," emitln S" END-CODE" emitdln ; -- Expressions ----------------------------------------------------------------------------------------------------------- -- parse and translate an identifier. -- Note that we need to store getname locally because parsing a call's -- parameterlist may clobber it. : ident ( -- ) getname DUP 1+ ALLOCATE ?ALLOCATE DUP LOCAL name PACK DROP Look '(' = IF '(' match ')' match name COUNT callfunction ELSE name COUNT loadvariable ENDIF name FREE ?ALLOCATE ; DEFER expression DEFER boolexpression -- parse and translate a math factor : factor ( -- ) Look '(' = IF '(' match boolexpression ')' match EXIT ENDIF Look alpha? IF getname loadvariable EXIT ENDIF getnum loadconstant ; -- parse and translate the first math factor : signedfactor ( -- ) Look '+' = IF getchar ENDIF Look '-' <> IF factor EXIT ENDIF getchar skipwhite Look digit? IF getnum NEGATE loadconstant ELSE factor _negate ENDIF ; -- recognize and translate multiply / divide : multiply ( -- ) '*' match factor _popmul ; : divide ( -- ) '/' match factor _popdiv ; -- parse and translate a math term : term ( -- ) signedfactor BEGIN Look mulop? WHILE _push CASE Look '*' OF multiply ENDOF '/' OF divide ENDOF S" Mulop" expected ENDCASE REPEAT ; -- recognize and translate add / subtract : add ( -- ) '+' match term _popadd ; : subtract ( -- ) '-' match term _popsub ; -- parse and translate an expression :NONAME ( -- ) term BEGIN Look addop? WHILE _push CASE Look '+' OF add ENDOF '-' OF subtract ENDOF S" Addop" expected ENDCASE REPEAT ; IS expression -- recognize and translate a relational "equals" : equals ( -- ) '=' match expression _popcmp _sete ; -- recognize and translate a relational "less or equal" : lessorequal ( -- ) '=' match expression _popcmp _setle ; -- recognize and translate a relational "not equals" : notequals ( -- ) '>' match expression _popcmp _setne ; -- recognize and translate a relational "less than" : less ( -- ) '<' match Look '=' = IF lessorequal EXIT ENDIF Look '>' = IF notequals EXIT ENDIF expression _popcmp _setl ; -- recognize and translate a relational "greater than" : greater ( -- ) '>' match Look '=' = IF '=' match expression _popcmp _setge EXIT ENDIF expression _popcmp _setg ; -- parse and translate a relation : relation ( -- ) expression Look relop? 0= ?EXIT _push CASE Look '=' OF equals ENDOF '<' OF less ENDOF '>' OF greater ENDOF ENDCASE ; -- parse and translate a boolean factor with NOT : notfactor ( -- ) Look '!' <> IF relation EXIT ENDIF '!' match relation _not ; -- parse and translate a boolean term : boolterm ( -- ) notfactor BEGIN Look '&' = WHILE _push '&' match notfactor _popand REPEAT ; -- Recognize and translate a boolean OR : boolOR ( -- ) '|' match boolterm _popor ; -- Recognize and translate an EXCLUSIVE OR : boolXOR ( -- ) '~' match boolterm _popxor ; -- Parse and translate a boolean expression :NONAME ( -- ) boolterm BEGIN Look orop? WHILE _push CASE Look '|' OF boolOR ENDOF '~' OF boolXOR ENDOF ENDCASE REPEAT ; IS boolexpression -- Block statements ------------------------------------------------------------------------------------------------------- DEFER block -- Parse and translate an assignment statement : assignment ( -- ) token$ COUNT DUP 1+ ALLOCATE ?ALLOCATE DUP LOCAL name PACK DROP '=' match boolexpression name COUNT storevariable name FREE ?ALLOCATE ; -- Recognize and translate an IF construct : doIF ( label -- ) 0 0 LOCALS| L2 L1 L | boolexpression newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L1 PACK DROP L1 COUNT DUP 1+ ALLOCATE ?ALLOCATE DUP TO L2 PACK DROP L1 COUNT _branch0 L block token 'l' = IF L2 FREE ?ALLOCATE newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L2 PACK DROP L2 COUNT _branch L1 COUNT postlabel L block ENDIF L2 COUNT postlabel S" ENDIF" matchstring L1 FREE ?ALLOCATE L2 FREE ?ALLOCATE ; : doWHILE ( -- ) 0 0 LOCALS| L2 L1 | newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L1 PACK DROP newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L2 PACK DROP L1 COUNT postlabel boolexpression L2 COUNT _branch0 L2 block L1 COUNT _branch L2 COUNT postlabel S" ENDWHILE" matchstring L1 FREE ?ALLOCATE L2 FREE ?ALLOCATE ; -- Parse and translate a loop statement : doLOOP ( -- ) 0 0 LOCALS| L2 L1 | newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L1 PACK DROP newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L2 PACK DROP L1 COUNT postlabel L2 block L1 COUNT _branch L2 COUNT postlabel S" ENDLOOP" matchstring L1 FREE ?ALLOCATE L2 FREE ?ALLOCATE ; -- Parse and translate a REPEAT statement : doREPEAT ( -- ) 0 0 LOCALS| L2 L1 | newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L1 PACK DROP newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L2 PACK DROP L1 COUNT postlabel L2 block S" UNTIL" matchstring boolexpression L1 COUNT _branch0 L2 COUNT postlabel L1 FREE ?ALLOCATE L2 FREE ?ALLOCATE ; -- Parse and translate a for statement : doFOR ( -- ) 0 0 0 LOCALS| name L2 L1 | getname DUP 1+ ALLOCATE ?ALLOCATE DUP TO name PACK DROP newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L1 PACK DROP newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L2 PACK DROP '=' match expression _decr name COUNT storevariable scan S" TO" matchstring expression _push L1 COUNT postlabel name COUNT loadvariable _incr name COUNT storevariable L2 COUNT _pcmp+b0> L2 block L1 COUNT _branch L2 COUNT postlabel S" ENDFOR" matchstring _incSP name FREE ?ALLOCATE L1 FREE ?ALLOCATE L2 FREE ?ALLOCATE ; -- Parse and Translate a DO Statement : doDO ( -- ) 0 0 LOCALS| L2 L1 | newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L1 PACK DROP newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L2 PACK DROP expression L1 COUNT postlabel _push L2 block _pop _decr L1 COUNT _branch<>0 _decSP L2 COUNT postlabel S" ENDDO" matchstring _incSP L1 FREE ?ALLOCATE L2 FREE ?ALLOCATE ; -- Recognize and translate a BREAK : doBREAK ( label -- ) DUP C@ 0= IF S" No loop to break from" aborts ENDIF COUNT _branch ; -- Process a read statement : doread ( -- ) '(' match getname readvar BEGIN Look ',' = WHILE ',' match getname readvar REPEAT ')' match ; -- Process a write statement : dowrite ( -- ) '(' match expression writevar BEGIN Look ',' = WHILE ',' match expression writevar REPEAT ')' match ; -- Recognize and translate a statement block :NONAME ( label -- ) LOCAL L scan BEGIN token 'e' <> token 'l' <> AND token 'u' <> AND WHILE CASE token 'i' OF L doIF ENDOF 'w' OF doWHILE ENDOF 'd' OF doDO ENDOF 'L' OF doLOOP ENDOF 'r' OF doREPEAT ENDOF 'f' OF doFOR ENDOF 'B' OF L doBREAK ENDOF 'R' OF doread ENDOF 'W' OF dowrite ENDOF assignment ENDCASE scan REPEAT ; IS block -- Declarations ----------------------------------------------------------------------------------------------------------- -- Allocate storage for a variable : alloc ( c-addr u -- ) LOCALS| sz addr | addr sz intable IF S" Duplicate variable name " addr sz name.aborts ENDIF addr sz 'v' addentry addr sz Look '=' = IF '=' match Look '-' = IF '-' match S" -" ELSE S" " ENDIF getnum (.) $+ ELSE S" 0" ENDIF allocatestorage ; -- Parse and translate a data declaration : decl ( -- ) getname alloc BEGIN Look ',' = WHILE ',' match getname alloc REPEAT ; -- Parse and translate the global declarations : topdecls ( -- ) scan BEGIN token 'b' <> WHILE CASE token 'v' OF decl ENDOF S" Unrecognized keyword " token$ COUNT name.aborts ENDCASE scan REPEAT ; -- Parse and translate MAIN code : doprogram ( -- ) S" BEGIN" matchstring prolog C" " block S" END" matchstring epilog ; -- Parse and translate a program ------------------------------------------------------------------------------------------ : prog ( -- ) S" PROGRAM" matchstring getname prog$ PACK DROP header topdecls doprogram '.' match ; -- initialize : init ( -- ) CLEAR Lcount [cnt]ST OFF [cnt]SType OFF CR getchar scan ; : TINY10 ( -- ) init prog Look ^M <> IF S" Unexpected data after " S" ." name.aborts ENDIF ;
***************************************************************** * * * COPYRIGHT NOTICE * * * * Copyright (C) 1989 Jack W. Crenshaw. All rights reserved. * * * *****************************************************************