LET'S BUILD A COMPILER!

By

Jack W. Crenshaw, Ph.D.

3 June 1989

Part XI: LEXICAL SCAN REVISITED

INTRODUCTION

I've got some good news and some bad news. The bad news is that this installment is not the one I promised last time. What's more, the one after this one won't be, either. The good news is the reason for this installment: I've found a way to simplify and improve the lexical scanning part of the compiler. Let me explain.

BACKGROUND

If you'll remember, we talked at length about the subject of lexical scanners in Part VII, and I left you with a design for a distributed scanner that I felt was about as simple as I could make it ... more than most that I've seen elsewhere. We used that idea in Part X. The compiler structure that resulted was simple, and it got the job done.

Recently, though, I've begun to have problems, and they're the kind that send a message that you might be doing something wrong.

The whole thing came to a head when I tried to address the issue of semicolons. Several people have asked me about them, and whether or not KISS will have them separating the statements. My intention has been not to use semicolons, simply because I don't like them and, as you can see, they have not proved necessary.

But I know that many of you, like me, have gotten used to them, and so I set out to write a short installment to show you how they could easily be added, if you were so inclined.

Well, it turned out that they weren't easy to add at all. In fact it was darned difficult.

I guess I should have realized that something was wrong, because of the issue of newlines. In the last couple of installments we've addressed that issue, and I've shown you how to deal with newlines with a procedure called, appropriately enough, NewLine. In TINY Version 1.0, I sprinkled calls to this procedure in strategic spots in the code.

It seems that every time I've addressed the issue of newlines, though, I've found it to be tricky, and the resulting parser turned out to be quite fragile ... one addition or deletion here or there and things tended to go to pot. Looking back on it, I realize that there was a message in this that I just wasn't paying attention to.

When I tried to add semicolons on top of the newlines, that was the last straw. I ended up with much too complex a solution. I began to realize that something fundamental had to change.

So, in a way this installment will cause us to backtrack a bit and revisit the issue of scanning all over again. Sorry about that. That's the price you pay for watching me do this in real time. But the new version is definitely an improvement, and will serve us well for what is to come.

As I said, the scanner we used in Part X was about as simple as one can get. But anything can be improved. The new scanner is more like the classical scanner, and not as simple as before. But the overall compiler structure is even simpler than before. It's also more robust, and easier to add to and/or modify. I think that's worth the time spent in this digression. So in this installment, I'll be showing you the new structure. No doubt you'll be happy to know that, while the changes affect many procedures, they aren't very profound and so we lose very little of what's been done so far.

Ironically, the new scanner is much more conventional than the old one, and is very much like the more generic scanner I showed you earlier in Part VII. Then I started trying to get clever, and I almost clevered myself clean out of business. You'd think one day I'd learn: K-I-S-S!

THE PROBLEM

The problem begins to show itself in procedure Block, which I've reproduced below:

-- ------------------------------------------------------------
-- Parse and translate a block of statements
:NONAME ( -- )
        scan
        BEGIN  token 'e' <> 
               token 'l' <> AND
        WHILE  CASE token
                'i' OF  doIF      ENDOF
                'w' OF  doWHILE   ENDOF
                'R' OF  doread    ENDOF
                'W' OF  dowrite   ENDOF
                        assignment
               ENDCASE
               scan
        REPEAT ; IS block
-- ------------------------------------------------------------

As you can see, Block is oriented to individual program statements. At each pass through the loop, we know that we are at the beginning of a statement. We exit the block when we have scanned an END or an ELSE.

But suppose that we see a semicolon instead. The procedure as it's shown above can't handle that, because the word Scan only expects and can only accept tokens that begin with a letter.

I tinkered around for quite awhile to come up with a fix. I found many possible approaches, but none were very satisfying. I finally figured out the reason.

Recall that when we started with our single-character parsers, we adopted a convention that the lookahead character would always be prefetched. That is, we would have the character that corresponds to our current position in the input stream fetched into the global character Look, so that we could examine it as many times as needed. The rule we adopted was that every recognizer, if it found its target token, would advance look to the next character in the input stream.

That simple and fixed convention served us very well when we had single-character tokens, and it still does. It would make a lot of sense to apply the same rule to multi-character tokens.

But when we got into lexical scanning, I began to violate that simple rule. The scanner of Part X did indeed advance to the next token if it found an identifier or keyword, but it didn't do that if it found a carriage return, a whitespace character, or an operator.

Now, that sort of mixed-mode operation gets us into deep trouble in the word Block, because whether or not the input stream has been advanced depends upon the kind of token we encounter. If it's a keyword or the target of an assignment statement, the "cursor," as defined by the contents of Look, has been advanced to the next token or to the beginning of whitespace. If, on the other hand, the token is a semicolon, or if we have hit a carriage return, the cursor has not advanced.

Needless to say, we can add enough logic to keep us on track. But it's tricky, and makes the whole parser very fragile.

There's a much better way, and that's just to adopt that same rule that's worked so well before, to apply to tokens as well as single characters. In other words, we'll prefetch tokens just as we've always done for characters. It seems so obvious once you think about it that way.

Interestingly enough, if we do things this way the problem that we've had with newline characters goes away. We can just lump them in as whitespace characters, which means that the handling of newlines becomes very trivial, and much less prone to error than we've had to deal with in the past.

THE SOLUTION

Let's begin to fix the problem by re-introducing the two words:

-- ------------------------------------------------------------
-- Get an identifier 
: getname ( -- )
        skipwhite
        Look alpha? 0= IF  S" Identifier" expected  ENDIF
        'x' TO token  
        token$ C0!  
        BEGIN   
          Look >UPC token$ char+!  
          getchar
          Look alnum? 0=
        UNTIL ;
-- ------------------------------------------------------------
-- Get a number
: getnum ( -- )
        skipwhite
        Look digit? 0= IF  S" Number" expected  ENDIF
        '#' TO token
        token$ C0!
        BEGIN   
          Look token$ char+!  
          getchar
          Look digit? 0=
        UNTIL ;
-- ------------------------------------------------------------

These two words are functionally almost identical to the ones I showed you in Part VII. They each fetch the current token, either an identifier or a number, into the global string token$. They also set the encoded version, Token, to the appropriate code. The input stream is left with Look containing the first character not part of the token.

We can do the same thing for operators, even multi-character operators, with a procedure such as:

-- ------------------------------------------------------------
-- Get an operator
: getop ( -- )
        Look TO token
        token$ C0!
        BEGIN   
          Look token$ char+!  
          getchar
          Look digit?  Look alpha? OR  Look white? OR
        UNTIL ;
-- ------------------------------------------------------------

Note that GetOp returns, as its encoded token, the first character of the operator. This is important, because it means that we can now use that single character to drive the parser, instead of the lookahead character.

We need to tie these procedures together into a single procedure that can handle all three cases. The following procedure will read any one of the token types and always leave the input stream advanced beyond it:

-- ------------------------------------------------------------
-- Get the next input token
: next ( -- )
        skipwhite
        Look alpha? IF  getname EXIT  ENDIF
        Look digit? IF  getnum  EXIT  ENDIF
        getop ;
-- ------------------------------------------------------------

Note that here I have put SkipWhite before the calls rather than after. This means that, in general, the variable Look will not have a meaningful value in it, and therefore we should not use it as a test value for parsing, as we have been doing so far. That's the big departure from our normal approach.

Now, remember that before I was careful not to treat the carriage return (CR) and line feed (LF) characters as white space. This was because, with SkipWhite called as the last thing in the scanner, the encounter with LF would trigger a read statement. If we were on the last line of the program, we couldn't get out until we input another line with a non-white character. That's why I needed the second procedure, NewLine, to handle the CRLF's.

But now, with the call to SkipWhite coming first, that's exactly the behavior we want. The compiler must know there's another token coming or it wouldn't be calling Next. In other words, it hasn't found the terminating END yet. So we're going to insist on more data until we find something.

All this means that we can greatly simplify both the program and the concepts, by treating CR and LF as whitespace characters, and eliminating NewLine. You can do that simply by modifying the function White?:

-- ------------------------------------------------------------
-- Recognize white space
: white? ( char -- tf )
        DUP  BL  = 
        OVER Tab = OR 
        OVER ^M  = OR
        SWAP ^J  = OR ;
-- ------------------------------------------------------------

We've already tried similar routines in Part VII, but you might as well try these new ones out. Add them to a copy of the Cradle and call Next with the following main program:

-- ------------------------------------------------------------
-- main program
        init
        BEGIN
          next
          CR Token EMIT SPACE token$ .$
          Token '.' = 
        UNTIL ;
-- ------------------------------------------------------------

Compile it and verify that you can separate a program into a series of tokens, and that you get the right encoding for each token.

This almost works, but not quite. There are two potential problems: First, in KISS/TINY almost all of our operators are single-character operators. The only exceptions are the relops >=, <=, and <>. It seems a shame to treat all operators as strings and do a string compare, when only a single character compare will almost always suffice. Second, and much more important, the thing doesn't work when two operators appear together, as in (a+b)*(c+d). Here the string following 'b' would be interpreted as a single operator ")*(."

It's possible to fix that problem. For example, we could just give GetOp a list of legal characters, and we could treat the parentheses as different operator types than the others. But this begins to get messy.

Fortunately, there's a better way that solves all the problems. Since almost all the operators are single characters, let's just treat them that way, and let GetOp get only one character at a time. This not only simplifies GetOp, but also speeds things up quite a bit. We still have the problem of the relops, but we were treating them as special cases anyway.

So here's the final version of GetOp:

-- ------------------------------------------------------------
-- Get an operator
: getop ( -- )
        skipwhite
        Look TO Token 
        Look token$ char-place
        getchar ;
-- ------------------------------------------------------------

Note that I still give the string token$ a value. If you're truly concerned about efficiency, you could leave this out. When we're expecting an operator, we will only be testing Token anyhow, so the value of the string won't matter. But to me it seems to be good practice to give the thing a value just in case.

Try this new version with some realistic-looking code. You should be able to separate any program into its individual tokens, with the caveat that the two-character relops will scan into two separate tokens. That's OK ... we'll parse them that way.

Now, in Part VII the function of Next was combined with procedure Scan, which also checked every identifier against a list of keywords and encoded each one that was found. As I mentioned at the time, the last thing we would want to do is to use such a procedure in places where keywords should not appear, such as in expressions. If we did that, the keyword list would be scanned for every identifier appearing in the code. Not good.

The right way to deal with that is to simply separate the functions of fetching tokens and looking for keywords. The version of Scan shown below does nothing but check for keywords. Notice that it operates on the current token and does not advance the input stream.

-- ------------------------------------------------------------
-- Scan the current identifier for keywords 
: scan ( -- ) 
        Token 'x' = IF  token$ COUNT  0 KWlist lookup 
                        KW->token  
                 ENDIF ;
-- ------------------------------------------------------------

There is one last detail. In the compiler there are a few places that we must actually check the string value of the token. Mainly, this is done to distinguish between the different END's, but there are a couple of other places. (I should note in passing that we could always eliminate the need for matching END characters by encoding each one to a different character. Right now we are definitely taking the lazy man's route.)

The following version of MatchString takes the place of the character-oriented Match. Note that, like Match, it does advance the input stream.

-- ------------------------------------------------------------
-- Match a specific input string 
: matchstring ( c-addr u -- )
        token$ COUNT 2OVER 
        COMPARE IF  &` CHAR-PREPEND &' CHAR-APPEND expected
              ELSE  2DROP
             ENDIF  
        next ;
-- ------------------------------------------------------------

FIXING UP THE COMPILER

Armed with these new scanner procedures, we can now begin to fix the compiler to use them properly. The changes are all quite minor, but there are quite a few places where changes are necessary. Rather than showing you each place, I will give you the general idea and then just give the finished product.

First of all, the code for procedure Block doesn't change, though its function does:

-- ------------------------------------------------------------
-- Parse and translate a block of statements 
:NONAME ( -- )
        scan
        BEGIN  token 'e' <> 
               token 'l' <> AND
        WHILE  CASE token
                'i' OF  doIF      ENDOF
                'w' OF  doWHILE   ENDOF
                'R' OF  doread    ENDOF
                'W' OF  dowrite   ENDOF
                        assignment
               ENDCASE
               scan
        REPEAT ; IS block
-- ------------------------------------------------------------

Remember that the new version of Scan doesn't advance the input stream, it only scans for keywords. The input stream must be advanced by each procedure that Block calls.

In general, we have to replace every test on Look with a similar test on Token. For example:

-- -------------------------------------------------------------
-- Parse and translate a boolean expression
:NONAME ( -- )
        boolterm
        BEGIN   Token orop?
        WHILE   _push
                CASE Token
                  '|' OF  boolOR  ENDOF
                  '~' OF  boolXOR ENDOF
                ENDCASE
        REPEAT ; IS boolexpression
-- ------------------------------------------------------------

In procedures like Add, we don't have to use Match anymore. We need only call Next to advance the input stream:

-- ------------------------------------------------------------
-- Recognize and translate an add
: add       ( -- ) next term   _popadd ; 
-- ------------------------------------------------------------

Control structures are actually simpler. We just call Next to advance over the control keywords:

-- ------------------------------------------------------------
-- Recognize and translate an if construct 

DEFER block

: doIF ( -- )
        0 0 LOCALS| L2 L1 |
        next
        boolexpression
        newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L1 PACK DROP
        L1 COUNT DUP 1+ ALLOCATE ?ALLOCATE DUP TO L2 PACK DROP
        L1 COUNT _branch0
        block
        token 'l' = IF  L2 FREE ?ALLOCATE
                        next
                        newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L2 PACK DROP
                        L2 COUNT _branch
                        L1 COUNT postlabel
                        block
                 ENDIF
        L2 COUNT postlabel
        S" ENDIF" matchstring

        L1 FREE ?ALLOCATE 
        L2 FREE ?ALLOCATE ;
-- ------------------------------------------------------------

That's about the extent of the required changes. In the listing of TINY Version 1.1 below, I've also made a number of other "improvements" that aren't really required. Let me explain them briefly:

I've deleted the two procedures Prog and Main, and combined their functions into the main program. They didn't seem to add to program clarity ... in fact they seemed to just muddy things up a little.
I've deleted the keywords PROGRAM and BEGIN from the keyword list. Each one only occurs in one place, so it's not necessary to search for it.
Having been bitten by an overdose of cleverness, I've reminded myself that TINY is supposed to be a minimalist program. Therefore I've replaced the fancy handling of unary minus with the dumbest one I could think of. A giant step backwards in code quality, but a great simplification of the compiler. KISS is the right place to use the other version.
I've added some error-checking routines such as CheckTable and CheckDup, and replaced in-line code by calls to them. This cleans up a number of routines.
I've taken the error checking out of code generation routines like StoreVariable, and put it in the parser where it belongs. See Assignment, for example.
I've cleaned up the code for the relational operators by the addition of the new procedures CompareExpression and NextExpression.

CONCLUSION

The resulting compiler for TINY is given below (tiny11.frt). Other than the removal of the keyword PROGRAM, it parses the same language as before. It's just a bit cleaner, and more importantly it's considerably more robust. I feel good about it.

The next installment will be another digression: the discussion of semicolons and such that got me into this mess in the first place. THEN we'll press on into procedures and types. Hang in there with me. The addition of those features will go a long way towards removing KISS from the "toy language" category. We're getting very close to being able to write a serious compiler.

TINY VERSION 1.1

-- ------------------------------------------------------------
-- Variable declarations

-1 =: unknown

0 VALUE Lcount                          -- label counter
0 VALUE Look                            -- lookahead character 
0 VALUE token                           -- encoded token

#100 =: maxentry

CREATE token$  0 C, #256 CHARS ALLOT    -- unencoded token
CREATE prog$   0 C, #256 CHARS ALLOT    -- program name

: char+!     ( c addr -- ) DUP >S COUNT + C!  1 S> C+! ;
: char-place ( c addr -- ) 1 OVER C!  CHAR+ C! ;

:   token> ( -- c-addr u ) token$ COUNT ;
:  =token  ( char -- tf )  token =  ;
: <>token  ( char -- tf )  token <> ;

-- Symbol tables and labels -----------------------------------

-- Helper to build symbol tables
: $, ( c-addr u size -- ) 
        >S S MIN >R  
        R@ C, R@ 0 ?DO  C@+ C,  LOOP DROP 
        S> R> - CHARS ALLOT ; 

-- Type declarations  
: SYMTAB ( size -- ) 
        CREATE  DUP 0< IF  0 , ABS DUP , 
                           1+ maxentry * ALLOT   
                     ELSE  HERE >R
                           0 ,  ( #items )
                           DUP >S ,  ( itemsize )
                           BEGIN  BL <WORD> DUP  
                           WHILE  2DUP S" \" COMPARE 
                                  0= IF  2DROP REFILL DROP
                                   ELSE  S $, 1 R@ +!  
                                  ENDIF
                           REPEAT 2DROP -R -S
                    ENDIF
        DOES>   CELL+ @+ 1+ ROT * + ( ix -- addr ) ; 

-- Definition of keywords and token types

8 CHARS =: /symbol

/symbol SYMTAB KWlist   IF ELSE ENDIF   \ 
                        WHILE ENDWHILE  \ 
                        DO ENDDO        \
                        LOOP ENDLOOP    \ 
                        REPEAT UNTIL    \
                        FOR TO ENDFOR   \
                        BREAK           \
                        READ WRITE      \ 
                        VAR END

: KW->token ( index -- ) 2+ C" xilewedeLerufteBRWve" + C@ TO token ;
 
        /symbol  NEGATE SYMTAB ST
        =CELL 1- NEGATE SYMTAB SType

0 ST    2 CELLS - =: [cnt]ST
0 SType 2 CELLS - =: [cnt]SType

: lookup ( c-addr u 'table -- n2|unknown )
        0 0 LOCALS| /symbol n table sz addr |
        table 2 CELLS - @+ TO n  @ TO /symbol
        n 0<= IF  unknown EXIT  ENDIF
           0 n DO 
                  /symbol 1+ I * table + COUNT  
                  addr sz COMPARE 0= IF  I UNLOOP EXIT  ENDIF
         -1 +LOOP 
        unknown ;

-- Locate symbol in table 
: locate  ( c-addr u -- ix ) 0 ST lookup ;      -- Returns -1 | the index of the entry.
: intable ( c-addr u -- tf ) 0 ST lookup 0>= ;  -- Look for symbol in table

-- Dump symbol table
: .symbols ( -- )
        [cnt]ST @ 0= IF  CR ." No symbols defined." EXIT  ENDIF
        CR ." -- type --.--- name ---" 
        [cnt]ST @ 0 ?DO  CR     5 HTAB I 1+ SType C@ EMIT 
                              #16 HTAB I 1+ ST    .$ 
                   LOOP ;

-- Tools ------------------------------------------------------

: getchar  ( -- )          EKEY TO Look ;                       -- read new character from input stream 
: error    ( c-addr u -- ) CR ^G EMIT ." Error: " TYPE ." ." ;  -- report an error 
: aborts   ( c-addr u -- ) error ABORT ;                        -- report error and halt 
: expected ( c-addr u -- ) S"  expected" $+ aborts ;            -- report what was expected 
: alpha?   ( char -- tf )  >UPC 'A' 'Z' 1+ WITHIN ;             -- recognize an alpha character 
: digit?   ( char -- tf )  '0' '9' 1+ WITHIN ;                  -- recognize a decimal digit
: alnum?   ( char -- tf )  DUP alpha? SWAP digit? OR ;          -- recognize alphanumeric
: orop?    ( char -- tf )  DUP '|' =  SWAP '~' =  OR ;          -- recognize an OR operand
: addop?   ( char -- tf )  DUP '+' =  SWAP '-' =  OR ;          -- test for AddOp
: mulop?   ( char -- tf )  DUP '*' =  SWAP '/' =  OR ;          -- test for MulOp
: emits    ( c-addr u -- ) Tab EMIT TYPE ;                      -- output a string with tab 
: emitdln  ( c-addr u -- ) CR TYPE ;                            -- output a string without tab, +crlf
: emitln   ( c-addr u -- ) CR emits ;                           -- output a string with tab and crlf 

-- Recognize white space
: white? ( char -- tf )  
        DUP  Tab =  
        OVER BL  = OR
        OVER ^M  = OR 
        SWAP ^J  = OR ; 

-- Skip white space
: skipwhite ( -- ) BEGIN  Look white?  WHILE  getchar  REPEAT ;

-- Abort, reporting an offending item
: name.aborts ( c-addr1 u1 c-addr2 u2 ) &' CHAR-APPEND S"  `" 2SWAP $+ $+ aborts ;

: .undefined ( c-addr u -- ) S" Undefined identifier" 2SWAP name.aborts ;
: .duplicate ( c-addr u -- ) S" Duplicate identifier" 2SWAP name.aborts ;
: checktable ( c-addr u -- ) 2DUP intable    IF 2DROP EXIT ENDIF .undefined ; -- Is identifier in the symbol table?
: checkdup   ( c-addr u -- ) 2DUP intable 0= IF 2DROP EXIT ENDIF .duplicate ; -- Is identifier already in symbol table?
: checkident ( -- )          'x' <>token IF  S" Identifier" expected  ENDIF ; -- Is current token an identifier?

-- Generate a unique label / post it to output
: newlabel  ( -- c-addr u ) S" @" Lcount U>D <# #S #> $+  1 +TO Lcount ;
: postlabel ( c-addr u -- ) CR TYPE ':' EMIT ;

-- Recognize a relop
: relop? ( char -- tf )  
        DUP  '=' = 
        OVER '#' = OR  
        OVER '<' = OR
        SWAP '>' = OR ;

-- Match a specific input character 
: match ( char -- )
        DUP Look = IF  DROP getchar skipwhite 
                 ELSE  S" `" ROT CHAR-APPEND &' CHAR-APPEND expected 
                ENDIF ;

-- Get an identifier 
: getname ( -- )
        skipwhite
        Look alpha? 0= IF  S" Identifier" expected  ENDIF
        'x' TO token
        token$ C0! 
        BEGIN
          Look >UPC token$ char+!  getchar
          Look alnum? 0=
        UNTIL ;

-- Get a number 
: getnum ( -- )
        skipwhite
        Look digit? 0= IF  S" Number" expected  ENDIF
        '#' TO token
        token$ C0!
        BEGIN   
          Look token$ char+!  
          getchar
          Look digit? 0=
        UNTIL ;

-- Get an operator
: getop ( -- )
        skipwhite
        Look TO token 
        Look token$ char-place
        getchar ;

-- Get next token of any type
: next ( -- )
        skipwhite
        Look alpha? IF  getname EXIT  ENDIF
        Look digit? IF  getnum  EXIT  ENDIF
        getop ;

-- Get an identifier and scan it for keywords
: scan ( -- ) 'x' =token IF  token>  0 KWlist lookup  KW->token  ENDIF ;

-- Match a specific input string 
: matchstring ( c-addr u -- )
        token> 2OVER 
        COMPARE IF  &` CHAR-PREPEND &' CHAR-APPEND expected
              ELSE  2DROP  next
             ENDIF ;

-- Add a new entry to symbol table 
: addentry ( c-addr u type -- )
        [cnt]ST LOCALS| #entries T sz addr |
        addr sz checkdup
        #entries @ maxentry = IF  S" Symbol table full" aborts  ENDIF
        1 #entries +!
        addr sz  #entries @ ST     PACK DROP 
        T        #entries @ SType  C! ;

-- Code generation --------------------------------------------

-- Load primary register with a constant
: loadconstant  ( c-addr u -- ) S"  d# -> eax  mov," $+ emitln ;

-- Load primary register from variable
: loadvariable  ( c-addr u -- ) 
        2DUP intable 0= IF  .undefined  ENDIF
        S"  dword-ptr -> eax  mov," $+ emitln ;

-- Store primary register in variable
: storevariable ( c-addr u -- ) 
        2DUP intable 0= IF  .undefined  ENDIF
        S" eax -> " 2SWAP $+ S"  dword-ptr  mov," $+ emitln ;

-- Allocate storage for a static variable
: allocatestorage ( $value $name -- )  
        S" CREATE " 2SWAP $+ BL CHAR-APPEND 2SWAP $+ S"  , " $+ emitdln ;

: _callw  ( c-addr u -- ) S"  offset NEAR call," $+ emitln ;                    -- call a word
: _clear  ( -- ) S" eax -> eax xor," emitln ;                                   -- clear primary
: _negate ( -- ) S" eax  neg,"   emitln ;                                       -- negate primary
: _not    ( -- ) S" eax  not,"   emitln ;                                       -- not primary
: _push   ( -- ) S" eax  push,"  emitln ;                                       -- push primary to stack
: _pop    ( -- ) S" eax  pop,"   emitln ;                                       -- pop primary from stack
: _decr   ( -- ) S" 1 b# -> eax sub," emitln ;                                  -- decrement primary
: _incr   ( -- ) S" 1 b# -> eax add," emitln ;                                  -- increment primary
: _decSP  ( -- ) S" [esp -4 +] -> esp lea," emitln ;                            -- drop 1 stack item
: _incSP  ( -- ) S" [esp  4 +] -> esp lea," emitln ;                            -- restore stack item
: _popadd ( -- ) S" [esp] -> eax  add,  [esp 4 +] -> esp lea," emitln ;         -- add TOS to primary
: _popsub ( -- ) S" [esp] -> eax  sub,  [esp 4 +] -> esp lea," emitln _negate ; -- subtract TOS from primary
: _popor  ( -- ) S" [esp] -> eax  or,   [esp 4 +] -> esp lea," emitln ;         -- or TOS to primary
: _popxor ( -- ) S" [esp] -> eax  xor,  [esp 4 +] -> esp lea," emitln ;         -- xor TOS to primary
: _popand ( -- ) S" [esp] -> eax  and,  [esp 4 +] -> esp lea," emitln ;         -- xor TOS to primary
: _popcmp ( -- ) S" [esp] -> eax  cmp,  [esp 4 +] -> esp lea," emitln ;         -- compare TOS to primary
: _sete   ( -- ) S" al  sete,   al -> eax  movsx," emitln ;                     -- set primary if equal
: _setne  ( -- ) S" al  setne,  al -> eax  movsx," emitln ;                     -- set primary if NOT equal
: _setg   ( -- ) S" al  setl,   al -> eax  movsx," emitln ;                     -- set primary if greater
: _setl   ( -- ) S" al  setg,   al -> eax  movsx," emitln ;                     -- set primary if less
: _setge  ( -- ) S" al  setle,  al -> eax  movsx," emitln ;                     -- set primary if greater or equal
: _setle  ( -- ) S" al  setge,  al -> eax  movsx," emitln ;                     -- set primary if less or equal

-- A collection of jumps
: _branch    ( c-addr u -- ) S"  offset NEAR  jmp," $+ emitln ;    
: _pcmp+b0>  ( c-addr u -- ) _popcmp S"  offset NEAR  jg,"  $+ emitln ;
: _branch0   ( c-addr u -- ) S" eax -> eax or,  " 2SWAP $+ S"  offset NEAR  je,"  $+ emitln ;
: _branch<>0 ( c-addr u -- ) S" eax -> eax or,  " 2SWAP $+ S"  offset NEAR  jnz," $+ emitln ;

-- Multiply TOS and primary; divide primary by TOS
: _popmul ( -- ) S" [esp] dword mul, [esp 4 +] -> esp lea," emitln ;
: _popdiv ( -- ) S" ecx pop, ecx -> eax xchg, eax -> edx mov, #31 b# -> edx sar, ebx idiv," emitln ;

-- Read to primary register and store in variable
: readit  ( c-addr u -- ) S" READ" _callw  storevariable ; 

-- Write from primary register
: writeit ( -- ) S" WRITE" _callw ; 

-- Write assembler header info
: header ( -- ) 
        S" -- DATA section --------" emitdln 
        S" " emitdln ;

-- Write assembler prologue
: prolog ( -- ) 
        S" " emitln 
        S" -- CODE section --------" emitdln 
        S" "          emitdln 
        S" CODE " prog$ COUNT $+ emitdln  
        S" rpush,"    emitln
        newlabel postlabel ;

-- Write assembler epilogue
: epilog ( -- ) 
        S" rpop,  ebx jmp," emitln
        S" END-CODE"        emitdln ;

-- Expressions ------------------------------------------------

DEFER boolexpression

-- Parse and translate a math factor
: factor ( -- ) 
        '(' =token IF  next boolexpression  S" )" matchstring  EXIT  ENDIF
        'x' =token IF  token> loadvariable  next  EXIT  ENDIF
        '#' =token IF  token> loadconstant  next  EXIT  ENDIF 
        S" Math factor" expected ;

-- Recognize and translate multiply / divide 
: multiply  ( -- ) next  factor _popmul ;  
: divide    ( -- ) next  factor _popdiv ;

-- Parse and translate a math term
: term ( -- ) 
        factor 
        BEGIN   token mulop?
        WHILE   _push
                CASE token
                  '*' OF multiply  ENDOF
                  '/' OF divide    ENDOF
                ENDCASE 
        REPEAT ;

-- Recognize and translate add / subtract
: add       ( -- ) next  term   _popadd ; 
: subtract  ( -- ) next  term   _popsub ;

-- Parse and translate an expression
: expression ( -- ) 
        token addop? IF  _clear 
                   ELSE  term
                  ENDIF
        BEGIN   token addop? 
        WHILE   _push
                CASE token
                  '+' OF add      ENDOF
                  '-' OF subtract ENDOF
                ENDCASE 
        REPEAT ;

-- Get another / next expression and compare 
: comparex ( -- ) expression _popcmp ;
: nextx    ( -- ) next comparex ;

-- Recognize and translate a relational "equals" / "less or equal" / "not equals"
: equals      ( -- )  nextx _sete  ;
: lessorequal ( -- )  nextx _setle ;
: notequals   ( -- )  nextx _setne ;

-- Recognize and translate a relational "less than"
: less ( -- ) 
        next
        CASE token
          '=' OF  lessorequal    ENDOF
          '>' OF  notequals      ENDOF
                  comparex _setl 
        ENDCASE ;

-- Recognize and translate a relational "greater than"
: greater ( -- ) 
        next
        '=' =token IF  nextx _setge EXIT  ENDIF
        comparex _setg ;

-- Parse and translate a relation
: relation ( -- )
        expression
        token relop? 0= ?EXIT
        _push
        CASE token
          '=' OF  equals     ENDOF
          '<' OF  less       ENDOF
          '>' OF  greater    ENDOF
        ENDCASE ;

-- Parse and translate a boolean factor with NOT
: notfactor ( -- )
        '!' <>token IF  relation EXIT  ENDIF
        next  relation _not ;

-- Parse and translate a boolean term 
: boolterm ( -- )
        notfactor
        BEGIN   '&' =token 
        WHILE   _push next  notfactor _popand
        REPEAT ;

-- Recognize and translate a boolean OR / XOR
: boolOR  ( -- ) next boolterm _popor ;
: boolXOR ( -- ) next boolterm _popxor ;

-- Parse and translate a boolean expression 
:NONAME ( -- )
        boolterm
        BEGIN   token orop?
        WHILE   _push
                CASE token
                  '|' OF  boolOR  ENDOF
                  '~' OF  boolXOR ENDOF
                ENDCASE
        REPEAT ; IS boolexpression

-- Parse and translate an assignment statement
: assignment ( -- )
        token> DUP 1+ ALLOCATE ?ALLOCATE DUP LOCAL name PACK DROP
        next  
        S" =" matchstring
        boolexpression
        name COUNT storevariable
        name FREE ?ALLOCATE ;

-- Block statements -------------------------------------------

DEFER block

-- Recognize and translate an IF construct
: doIF ( label -- )
        0 0 LOCALS| L2 L1 L |
        next
        boolexpression
        newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L1 PACK DROP
        L1 COUNT DUP 1+ ALLOCATE ?ALLOCATE DUP TO L2 PACK DROP
        L1 COUNT _branch0
        L block
        'l' =token IF   L2 FREE ?ALLOCATE
                        next
                        newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L2 PACK DROP
                        L2 COUNT _branch
                        L1 COUNT postlabel
                        L block
                ENDIF
        L2 COUNT postlabel
        S" ENDIF" matchstring

        L1 FREE ?ALLOCATE 
        L2 FREE ?ALLOCATE ;

-- Recognize and translate a WHILE construct
: doWHILE ( -- )
        0 0 LOCALS| L2 L1 |
        newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L1 PACK DROP
        newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L2 PACK DROP

        next
        L1 COUNT postlabel
        boolexpression
        L2 COUNT _branch0
        L2 block
        S" ENDWHILE" matchstring
        L1 COUNT _branch
        L2 COUNT postlabel

        L1 FREE ?ALLOCATE 
        L2 FREE ?ALLOCATE ;

-- Parse and translate a LOOP statement
: doLOOP ( -- )
        0 0 LOCALS| L2 L1 |
        newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L1 PACK DROP
        newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L2 PACK DROP

        next
        L1 COUNT postlabel
        L2 block
        L1 COUNT _branch
        L2 COUNT postlabel
        S" ENDLOOP" matchstring

        L1 FREE ?ALLOCATE 
        L2 FREE ?ALLOCATE ;

-- Parse and translate a REPEAT statement 
: doREPEAT ( -- )
        0 0 LOCALS| L2 L1 |
        newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L1 PACK DROP
        newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L2 PACK DROP

        next
        L1 COUNT postlabel
        L2 block
        S" UNTIL" matchstring
        boolexpression
        L1 COUNT _branch0
        L2 COUNT postlabel

        L1 FREE ?ALLOCATE 
        L2 FREE ?ALLOCATE ;

-- Parse and translate a FOR statement 
: doFOR ( -- )
        0 0 0 LOCALS| name L2 L1 |
        next
        checkident  token> checktable
        token>   DUP 1+ ALLOCATE ?ALLOCATE DUP TO name PACK DROP
        newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L1   PACK DROP
        newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L2   PACK DROP

        next S" =" matchstring  expression
        _decr
        name COUNT storevariable
        S" TO" matchstring  
        expression
        _push
        L1 COUNT postlabel
        name COUNT loadvariable
        _incr
        name COUNT storevariable
        L2 COUNT _pcmp+b0>
        L2 block
        L1 COUNT _branch
        L2 COUNT postlabel
        S" ENDFOR" matchstring
        _incSP

        name FREE ?ALLOCATE
        L1   FREE ?ALLOCATE 
        L2   FREE ?ALLOCATE ;

-- Parse and Translate a DO Statement
: doDO ( -- )
        0 0 LOCALS| L2 L1 |
        newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L1 PACK DROP
        newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L2 PACK DROP

        next
        expression
        L1 COUNT postlabel
        _push
        L2 block
        _pop
        _decr
        L1 COUNT _branch<>0
        _decSP
        L2 COUNT postlabel
        S" ENDDO" matchstring
        _incSP 

        L1 FREE ?ALLOCATE 
        L2 FREE ?ALLOCATE ;

-- Recognize and translate a BREAK
: doBREAK ( label -- )
        DUP C@ 0= IF  S" No loop to break from" aborts  ENDIF
        COUNT _branch 
        next ;

-- Read a single variable
: readvar ( -- ) checkident  token> checktable  token> readit  next ;

-- Process a read statement 
: doread ( -- )
        next 
        S" (" matchstring
        readvar
        BEGIN  ',' =token
        WHILE  next readvar
        REPEAT 
        S" )" matchstring ;

-- Process a write statement
: dowrite ( -- )
        next 
        S" (" matchstring
        expression writeit
        BEGIN  ',' =token
        WHILE  next expression writeit
        REPEAT 
        S" )" matchstring ;

-- Recognize and translate a statement block
:NONAME ( label -- )
        LOCAL L
        scan
        BEGIN  'e' <>token 
               'l' <>token AND
               'u' <>token AND
        WHILE  CASE token
                'i' OF  L doIF    ENDOF
                'w' OF  doWHILE   ENDOF
                'd' OF  doDO      ENDOF
                'L' OF  doLOOP    ENDOF
                'r' OF  doREPEAT  ENDOF
                'f' OF  doFOR     ENDOF
                'B' OF  L doBREAK ENDOF
                'R' OF  doread    ENDOF
                'W' OF  dowrite   ENDOF
                        assignment
               ENDCASE
               scan
        REPEAT ; IS block

-- Declarations -----------------------------------------------

-- Allocate storage for a variable
: alloc ( -- )
        0 LOCAL aname
        next 
        token> DUP 1+ ALLOCATE ?ALLOCATE DUP TO aname PACK DROP
        'x' <>token IF  S" Variable name" expected  ENDIF
        aname COUNT checkdup
        aname COUNT 'v' addentry
        next
        '=' =token IF   S" =" matchstring
                        '-' =token IF  S" -" matchstring  
                                       S" -"  
                                 ELSE  S" "  
                                ENDIF
                        token> $+  
                        next
                 ELSE   S" 0"
                ENDIF   
        aname COUNT allocatestorage 
        aname FREE ?ALLOCATE ;

-- Parse and translate the global declarations 
: topdecls ( -- )
        scan
        BEGIN   'v' =token
        WHILE   alloc
                BEGIN ',' =token WHILE alloc REPEAT
        REPEAT ;

-- Parse and translate a program ------------------------------

-- Initialize everything
: init ( -- ) 
        CLEAR Lcount
        [cnt]ST    0!
        [cnt]SType 0!
        CR getchar next ; 

: TINY11 ( -- )
        init
        S" PROGRAM" matchstring  token> prog$ PACK DROP next
        header
        topdecls
        S" BEGIN" matchstring
        prolog
        C" " block
        S" END" matchstring
        epilog ;
-- ------------------------------------------------------------

*****************************************************************
*                                                               *
*                        COPYRIGHT NOTICE                       *
*                                                               *
*   Copyright (C) 1989 Jack W. Crenshaw. All rights reserved.   *
*                                                               *
*****************************************************************

LET'S BUILD A COMPILER! By Jack W. Crenshaw, Ph.D. 3 June 1989