In this installment, we'll get rid of those restrictions. We'll also extend what we've done to include assignment statements function calls and. Remember, though, that the second restriction was mainly self-imposed ... a choice of convenience on our part, to make life easier and to let us concentrate on the fundamental concepts. As you'll see in a bit, it's an easy restriction to get rid of, so don't get too hung up about it. We'll use the trick when it serves us to do so, confident that we can discard it when we're ready to.
Most expressions that we see in practice involve variables, such as
b * b + 4 * a * c
No parser is much good without being able to deal with them. Fortunately, it's also quite easy to do.
Remember that in our parser as it currently stands, there are two kinds of factors allowed: integer constants and expressions within parentheses. In BNF notation,
<factor> ::= <number> | (<expression>)
The '|' stands for "or", meaning of course that either form is a legal form for a factor. Remember, too, that we had no trouble knowing which was which ... the lookahead character is a left paren '(' in one case, and a digit in the other.
It probably won't come as too much of a surprise that a variable is just another kind of factor. So we extend the BNF above to read:
<factor> ::= <number> | (<expression>) | <variable>
Again, there is no ambiguity: if the lookahead character is a
letter, we have a variable; if a digit, we have a number. Back
when we translated the number, we just issued code to load the
number, as immediate data, into EAX. Now we do the same, only we
load a variable. Let's modify the current version of Factor
to read:
-- ------------------------------------------------------------------------------------ -- Parse and translate a math factor DEFER expression ( -- ) : factor ( -- ) Look '(' = IF '(' match expression ')' match EXIT ENDIF Look alpha? IF S" dword-ptr -> eax mov," getname CHAR-PREPEND emitln EXIT ENDIF S" d# -> eax mov," getnum '0' + CHAR-PREPEND emitln ; -- ------------------------------------------------------------------------------------
I've remarked before how easy it is to add extensions to the parser, because of the way it's structured. You can see that this still holds true here. This time it cost us all of two extra lines of code. Notice, too, how the if-else-else structure exactly parallels the BNF syntax equation.
OK, compile and test this new version of the parser (chap3a.frt). That didn't hurt too badly, did it?
Up till now, we've been able to write what is called a
"predictive parser." That means that at any point, we can know
by looking at the current lookahead character exactly what to do
next. That isn't the case when we add functions. Every language
has some naming rules for what constitutes a legal identifier.
For the present, ours is simply that it is one of the letters
'a'..'z'. The problem is that a variable name and a function
name obey the same rules. So how can we tell which is which?
One way is to require that they each be declared before they are
used. Forth takes that approach. The other is that we might
require a function to be followed by a (possibly empty) parameter
list. That's the rule used in C
.
Since we don't yet have a mechanism for declaring types, let's
use the C
rule for now. Since we also don't have a mechanism to
deal with parameters, we can only handle empty lists, so our
function calls will have the form
x() .
Since we're not dealing with parameter lists yet, there is nothing to do but to call the function, so we need only to issue a CALL instead of a MOV.
Now that there are two possibilities for the "Alpha? IF" branch
of the test in Factor
, let's treat them in a separate word.
Modify Factor
to read:
-- ------------------------------------------------------------- -- Parse and translate a math factor DEFER expression ( -- ) : factor ( -- ) Look '(' = IF '(' match expression ')' match EXIT ENDIF Look alpha? IF ident EXIT ENDIF S" d# -> eax mov," getnum '0' + CHAR-PREPEND emitln ; -- -------------------------------------------------------------
and insert before it the new word
-- ------------------------------------------------------------- -- Parse and translate an identifier : ident ( -- ) getname LOCAL name Look '(' = IF '(' match ')' match S" offset NEAR call," ELSE S" dword-ptr -> eax mov," ENDIF name CHAR-PREPEND emitln ; -- -------------------------------------------------------------
OK, compile and test this version (chap3b.frt). Does it parse all legal expressions? Does it correctly flag badly formed ones?
The important thing to notice is that even though we no longer
have a predictive parser, there is little or no complication
added with the recursive descent approach that we're using. At
the point where Factor
finds an identifier (letter), it doesn't
know whether it's a variable name or a function name, nor does it
really care. It simply passes it on to Ident
and leaves it up to
that word to figure it out. Ident
, in turn, simply tucks
away the identifier and then reads one more character to decide
which kind of identifier it's dealing with.
Keep this approach in mind. It's a very powerful concept, and it should be used whenever you encounter an ambiguous situation requiring further lookahead. Even if you had to look several tokens ahead, the principle would still work.
Ident
through Expression
) there are
only two calls to the error routine, Expected
. Even those aren't
necessary ... if you'll look again in Term
and Expression
, you'll
see that those statements can't be reached. I put them in early
on as a bit of insurance, but they're no longer needed. Why
don't you delete them now?
So how did we get this nice error handling virtually for free?
It's simply that I've carefully avoided reading a character
directly using GetChar
. Instead, I've relied on the error
handling in GetName
, GetNum
, and Match
to do all the error
checking for me. Astute readers will notice that some of the
calls to Match
(for example, the ones in Add
and Subtract
) are
also unnecessary ... we already know what the character is by the
time we get there ... but it maintains a certain symmetry to
leave them in, and the general rule to always use Match
instead
of GetChar
is a good one.
I mentioned an "almost" above. There is a case where our error handling leaves a bit to be desired. So far we haven't told our parser what and end-of-line looks like, or what to do with embedded white space. So a space character (or any other character not part of the recognized character set) simply causes the parser to terminate, ignoring the unrecognized characters.
It could be argued that this is reasonable behavior at this point. In a "real" compiler, there is usually another statement following the one we're working on, so any characters not treated as part of our expression will either be used for or rejected as part of the next one.
But it's also a very easy thing to fix up, even if it's only temporary. All we have to do is assert that the expression should end with an end-of-line , i.e., a carriage return.
To see what I'm talking about, try the input line
1+2 <space> 3+4
See how the space was treated as a terminator? Now, to make the compiler properly flag this, add the line
Look ^M <> IF S" Newline" expected ENDIF
in the main program, just after the call to Expression
. That
catches anything left over in the input stream.
As usual, recompile the program (chap3c.frt) and verify that it does what it's supposed to.
KISS
principle.
Of course, parsing an expression is not much good without having something to do with it afterwards. Expressions usually (but not always) appear in assignment statements, in the form
<Ident> = <Expression>
We're only a breath away from being able to parse an assignment
statement, so let's take that last step. Just after the word
Expression
, add the following new word:
-- ------------------------------------------------------------------------------- -- Parse and translate an assignment statement : assignment ( -- ) getname LOCAL name '=' match expression S" eax -> " 'OF name 1 $+ S" dword-ptr mov," $+ emitln ; -- -------------------------------------------------------------------------------
Note again that the code exactly parallels the BNF. And notice
further that the error checking was painless, handled by GetName
and Match
.
Now change the call to Expression
, in the main program, to one to
Assignment
. That's all there is to it (chap3d.frt).
Son of a gun! We are actually compiling assignment statements. If those were the only kind of statements in a language, all we'd have to do is put this in a loop and we'd have a full-fledged compiler!
Well, of course they're not the only kind. There are also little
items like control statements (IF
s and loops
), procedures,
declarations, etc. But cheer up. The arithmetic expressions
that we've been dealing with are among the most challenging in a
language. Compared to what we've already done, control
statements will be easy. I'll be covering them in the fifth
installment. And the other statements will all fall in line, as
long as we remember to KISS
.
Most compilers separate out the handling of the input stream into
a separate module called the lexical scanner. The idea is that
the scanner deals with all the character-by-character input, and
returns the separate units (tokens) of the stream. There may
come a time when we'll want to do something like that, too, but
for now there is no need. We can handle the multi-character
tokens that we need by very slight and very local modifications
to GetName
and GetNum
.
The usual definition of an identifier is that the first character must be a letter, but the rest can be alphanumeric (letters or numbers). To deal with this, we need one other recognizer function
-- ------------------------------------------------------------ -- Recognize an alphanumeric : alnum? ( char -- tf ) DUP alpha? SWAP digit? OR ; -- ------------------------------------------------------------
Add this function to your parser. I put mine just after Digit?
.
While you're at it, might as well include it as a permanent
member of Cradle, too.
Now, we need to modify function GetName
to return a string
instead of a character:
-- ------------------------------------------------------------ -- Get an identifier CREATE token 0 C, 256 CHARS ALLOT : char+! ( c addr -- ) DUP >R COUNT + C! 1 R> C+! ; : getname ( -- c-addr u ) Look alpha? 0= IF S" Name" expected ENDIF token C0! BEGIN Look alnum? WHILE Look >UPC token char+! getchar REPEAT token COUNT ; -- ------------------------------------------------------------
Similarly, modify GetNum
to read:
-- ------------------------------------------------------------ -- Get a number CREATE #value 0 C, 256 CHARS ALLOT : getnum ( -- c-addr u ) Look digit? 0= IF S" Integer" expected ENDIF #value C0! BEGIN Look digit? WHILE Look #value char+! getchar REPEAT #value COUNT ; -- ------------------------------------------------------------
A slight change to Factor
is needed:
-- ------------------------------------------------------------ : factor ( -- ) Look '(' = IF '(' match expression ')' match EXIT ENDIF Look alpha? IF ident EXIT ENDIF getnum S" d# -> eax mov," $+ emitln ; -- ------------------------------------------------------------
Amazingly enough, that is virtually all the changes required to
the parser! Of course, as the local variable Name
in words
Ident
and Assignment
was originally declared as an integer, we must
now make sure it can hold a string. This is actually rather
messy, because iForth does not support LOCAL strings:
-- -------------------------------------------------------------------------------- : ident ( -- ) getname DUP 1+ ALLOCATE ?ALLOCATE LOCAL name name PACK DROP Look '(' = IF '(' match ')' match name COUNT S" offset NEAR call," ELSE name COUNT S" dword-ptr -> eax mov," ENDIF $+ emitln name FREE ?ALLOCATE ; -- -------------------------------------------------------------------------------- : assignment ( -- ) getname DUP 1+ ALLOCATE ?ALLOCATE LOCAL name name PACK DROP '=' match expression S" eax -> " name COUNT $+ S" dword-ptr mov," $+ emitln name FREE ?ALLOCATE ; -- --------------------------------------------------------------------------------Make this change, and then recompile and test (chap3e.frt). Now do you believe that it's a simple change?
The key to easy handling of white space is to come up with a
simple rule for how the parser should treat the input stream, and
to enforce that rule everywhere. Up till now, because white
space wasn't permitted, we've been able to assume that after each
parsing action, the lookahead character Look
contains the next
meaningful character, so we could test it immediately. Our
design was based upon this principle.
It still sounds like a good rule to me, so that's the one we'll
use. This means that every routine that advances the input
stream must skip over white space, and leave the next non-white
character in Look
. Fortunately, because we've been careful to
use GetName
, GetNum
, and Match
for most of our input processing,
it is only those three routines (plus Init
) that we need to
modify.
Not surprisingly, we start with yet another new recognizer routine:
-- ------------------------------------------------------------ -- Recognize white space : white? ( char -- tf ) DUP Tab = SWAP BL = OR ; -- ------------------------------------------------------------
We also need a routine that will eat white-space characters, until it finds a non-white one:
-- ------------------------------------------------------------ -- Skip over leading white space : skipwhite ( -- ) BEGIN Look white? WHILE getchar REPEAT ; -- ------------------------------------------------------------
Now, add calls to SkipWhite
to Match
, GetName
, and GetNum
as
shown below:
-- ------------------------------------------------------------- -- Match a specific input character : match ( char -- ) DUP Look = IF DROP getchar skipwhite ELSE S" `" ROT CHAR-APPEND &' CHAR-APPEND expected ENDIF ; -- ------------------------------------------------------------- -- get an identifier : getname ( -- c-addr u ) Look alpha? 0= IF S" Name" expected ENDIF token C0! BEGIN Look alnum? WHILE Look >UPC token char+! getchar REPEAT token COUNT skipwhite ; -- get a number : getnum ( -- c-addr u ) Look digit? 0= IF S" Integer" expected ENDIF #value C0! BEGIN Look digit? WHILE Look #value char+! getchar REPEAT #value COUNT skipwhite ; -- ------------------------------------------------------------
Finally, we need to skip over leading blanks where we "prime the
pump" in Init
:
-- ------------------------------------------------------------ -- Initialize : init ( -- ) getchar skipwhite ; -- ------------------------------------------------------------
Make these changes and recompile the program. You will find that
you will have to move Match
below SkipWhite
, to avoid an error
message from the Forth
compiler. Test the program as always to
make sure it works properly (chap3f.frt).
Since we've made quite a few changes during this session, I'm reproducing the entire parser below:
-- Variable declarations -------------------------------------------------------------------------------------------------- 0 VALUE Look -- lookahead character CREATE token PRIVATE 0 C, #256 CHARS ALLOT CREATE #value PRIVATE 0 C, #256 CHARS ALLOT -- Tools ------------------------------------------------------------------------------------------------------------------ : getchar ( -- ) EKEY TO Look ; -- read new character from input stream : error ( c-addr u -- ) CR ^G EMIT ." Error: " TYPE ." ." ; -- report an error : aborts ( c-addr u -- ) error ABORT ; -- report error and halt : expected ( c-addr u -- ) S" Expected" $+ aborts ; -- report what was expected : alpha? ( char -- tf ) >UPC 'A' 'Z' 1+ WITHIN ; -- recognize an alpha character : digit? ( char -- tf ) '0' '9' 1+ WITHIN ; -- recognize a decimal digit : alnum? ( char -- tf ) DUP alpha? SWAP digit? OR ; -- recognize alphanumeric : emits ( c-addr u -- ) Tab EMIT TYPE ; -- output a string with tab : emitln ( c-addr u -- ) CR emits ; -- output a string with tab and crlf : addop? ( char -- tf ) DUP '+' = SWAP '-' = OR ; -- test for AddOp : mulop? ( char -- tf ) DUP '*' = SWAP '/' = OR ; -- test for MulOp : white? ( char -- tf ) DUP Tab = SWAP BL = OR ; -- recognize white space : char+! ( c addr -- ) DUP >S COUNT + C! 1 S> C+! ; -- skip white space : skipwhite ( -- ) BEGIN Look white? WHILE getchar REPEAT ; -- Specifics -------------------------------------------------------------------------------------------------------------- -- initialize : init ( -- ) CR getchar skipwhite ; -- match a specific input character : match ( char -- ) DUP Look = IF DROP getchar skipwhite ELSE S" `" ROT CHAR-APPEND &' CHAR-APPEND expected ENDIF ; -- get an identifier : getname ( -- c-addr u ) Look alpha? 0= IF S" Name" expected ENDIF token C0! BEGIN Look alnum? WHILE Look >UPC token char+! getchar REPEAT token COUNT skipwhite ; -- get a number : getnum ( -- c-addr u ) Look digit? 0= IF S" Integer" expected ENDIF #value C0! BEGIN Look digit? WHILE Look #value char+! getchar REPEAT #value COUNT skipwhite ; -- parse and translate a math expression ---------------------------------------------------------------------------------- DEFER expression ( -- ) -- parse and translate an identifier. -- Note that we need to store getname locally because parsing a call's -- parameterlist may clobber it. : ident ( -- ) getname DUP 1+ ALLOCATE ?ALLOCATE LOCAL name name PACK DROP Look '(' = IF '(' match ')' match name COUNT S" offset NEAR call," ELSE name COUNT S" dword-ptr -> eax mov," ENDIF $+ emitln name FREE ?ALLOCATE ; : factor ( -- ) Look '(' = IF '(' match expression ')' match EXIT ENDIF Look alpha? IF ident EXIT ENDIF getnum S" d# -> eax mov," $+ emitln ; : multiply ( -- ) '*' match factor S" [esp] dword mul, [esp 4 +] -> esp lea," emitln ; : divide ( -- ) '/' match factor S" ebx pop, ebx -> eax xchg, eax -> edx mov, #31 b# -> edx sar, ebx idiv," emitln ; : term ( -- ) factor BEGIN Look mulop? WHILE S" eax push," emitln CASE Look '*' OF multiply ENDOF '/' OF divide ENDOF S" Mulop" expected ENDCASE REPEAT ; : add ( -- ) '+' match term S" [esp] -> eax add, [esp 4 +] -> esp lea," emitln ; : subtract ( -- ) '-' match term S" [esp] -> eax sub, [esp 4 +] -> esp lea, eax neg," emitln ; :NONAME ( -- ) Look addop? IF S" eax -> eax xor," emitln ELSE term ENDIF BEGIN Look addop? WHILE S" eax push," emitln CASE Look '+' OF add ENDOF '-' OF subtract ENDOF S" Addop" expected ENDCASE REPEAT ; IS expression -- Parse and translate an assignment statement. -- Note that we need to store getname locally because parsing a call's -- parameterlist may clobber it. : assignment ( -- ) getname DUP 1+ ALLOCATE ?ALLOCATE LOCAL name name PACK DROP '=' match expression S" eax -> " name COUNT $+ S" dword-ptr mov," $+ emitln name FREE ?ALLOCATE ; -- Main Program ----------------------------------------------------------------------------------------------------------- : main ( -- ) init assignment Look ^M <> IF S" Newline" expected ENDIF ;
Now the parser is complete. It's got every feature we can put in a one-line "compiler." Tuck it away in a safe place. Next time we'll move on to a new subject, but we'll still be talking about expressions for quite awhile. Next installment, I plan to talk a bit about interpreters as opposed to compilers, and show you how the structure of the parser changes a bit as we change what sort of action has to be taken. The information we pick up there will serve us in good stead later on, even if you have no interest in interpreters. See you next time.
***************************************************************** * * * COPYRIGHT NOTICE * * * * Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * * * *****************************************************************