Maybe I should mention why we need a lexical scanner at all ... after all, we've been able to manage all right without one, up till now, even when we provided for multi-character tokens.
The only reason, really, has to do with keywords. It's a fact of
computer life that the syntax for a keyword has the same form as
that for any other identifier. We can't tell until we get the
complete word whether or not it is a keyword. For example, the
variable IFILE and the keyword IF
look just alike, until you get
to the third character. In the examples to date, we were always
able to make a decision based upon the first character of the
token, but that's no longer possible when keywords are present.
We need to know that a given string is a keyword before we begin
to process it. And that's why we need a scanner.
In the last session, I also promised that we would be able to provide for normal tokens without making wholesale changes to what we have already done. I didn't lie ... we can, as you will see later. But every time I set out to install these elements of the software into the parser we have already built, I had bad feelings about it. The whole thing felt entirely too much like a band-aid. I finally figured out what was causing the problem: I was installing lexical scanning software without first explaining to you what scanning is all about, and what the alternatives are. Up till now, I have studiously avoided giving you a lot of theory, and certainly not alternatives. I generally don't respond well to the textbooks that give you twenty-five different ways to do something, but no clue as to which way best fits your needs. I've tried to avoid that pitfall by just showing you one method, that works.
But this is an important area. While the lexical scanner is
hardly the most exciting part of a compiler, it often has the
most profound effect on the general "look & feel" of the
language, since after all it's the part closest to the user. I
have a particular structure in mind for the scanner to be used
with KISS
. It fits the look & feel that I want for that
language. But it may not work at all for the language you're
cooking up, so in this one case I feel that it's important for
you to know your options.
So I'm going to depart, again, from my usual format. In this session we'll be getting much deeper than usual into the basic theory of languages and grammars. I'll also be talking about areas other than compilers in which lexical scanning plays an important role. Finally, I will show you some alternatives for the structure of the lexical scanner. Then, and only then, will we get back to our parser from the last installment. Bear with me ... I think you'll find it's worth the wait. In fact, since scanners have many applications outside of compilers, you may well find this to be the most useful session for you.
Typically, lexical scanning is done in a separate part of the compiler, so that the parser per se sees only a stream of input tokens. Now, theoretically it is not necessary to separate this function from the rest of the parser. There is only one set of syntax equations that define the whole language, so in theory we could write the whole parser in one module.
Why the separation? The answer has both practical and theoretical bases.
In 1956, Noam Chomsky defined the "Chomsky Hierarchy" of grammars. They are:
A few features of the typical programming language (particularly
the older ones, such as FORTRAN
) are Type 1, but for the most
part all modern languages can be described using only the last
two types, and those are all we'll be dealing with here.
The neat part about these two types is that there are very specific ways to parse them. It has been shown that any regular grammar can be parsed using a particular form of abstract machine called the state machine (finite automaton). We have already implemented state machines in some of our recognizers.
Similarly, Type 2 (context-free) grammars can always be parsed using a push-down automaton (a state machine augmented by a stack). We have also implemented these machines. Instead of implementing a literal stack, we have relied on the built-in stack associated with recursive coding to do the job, and that in fact is the preferred approach for top-down parsing.
Now, it happens that in real, practical grammars, the parts that qualify as regular expressions tend to be the lower-level parts, such as the definition of an identifier:
<ident> ::= <letter> [ <letter> | <digit> ]*
Since it takes a different kind of abstract machine to parse the two types of grammars, it makes sense to separate these lower-level functions into a separate module, the lexical scanner, which is built around the idea of a state machine. The idea is to use the simplest parsing technique needed for the job.
There is another, more practical reason for separating scanner
from parser. We like to think of the input source file as a
stream of characters, which we process right to left without
backtracking. In practice that isn't possible. Almost every
language has certain keywords such as IF
, WHILE
, and END
. As I
mentioned earlier, we can't really know whether a given
character string is a keyword, until we've reached the end of it,
as defined by a space or other delimiter. So in that sense, we
must save the string long enough to find out whether we have a
keyword or not. That's a limited form of backtracking.
So the structure of a conventional compiler involves splitting up the functions of the lower-level and higher-level parsing. The lexical scanner deals with things at the character level, collecting characters into strings, etc., and passing them along to the parser proper as indivisible tokens. It's also considered normal to let the scanner have the job of identifying keywords.
Unix
tools LEX
and YACC
, that's what you'll get. The output of
LEX
is a state machine implemented in C
, plus a table of actions
corresponding to the input grammar given to LEX
. The YACC
output
is similar ... a canned table-driven parser, plus the table
corresponding to the language syntax.
That is not the only choice, though. In our previous
installments, you have seen over and over that it is possible to
implement parsers without dealing specifically with tables,
stacks, or state variables. In fact, in Installment V I warned
you that if you find yourself needing these things you might be
doing something wrong, and not taking advantage of the power of
Forth
. There are basically two ways to define a state machine's
state: explicitly, with a state number or code, and implicitly,
simply by virtue of the fact that I'm at a certain place in the
code (if it's Tuesday, this must be Belgium). We've relied
heavily on the implicit approaches before, and I think you'll
find that they work well here, too.
In practice, it may not even be necessary to have a well-defined
lexical scanner. This isn't our first experience at dealing with
multi-character tokens. In Installment III, we extended our
parser to provide for them, and we didn't even need a lexical
scanner. That was because in that narrow context, we could
always tell, just by looking at the single lookahead character,
whether we were dealing with a number, a variable, or an
operator. In effect, we built a distributed lexical scanner,
using procedures GetName
and GetNum
.
With keywords present, we can't know anymore what we're dealing with, until the entire token is read. This leads us to a more localized scanner; although, as you will see, the idea of a distributed scanner still has its merits.
Let's begin with the two definitions most often seen in real programming languages:
<ident> ::= <letter> [ <letter> | <digit> ]* <number ::= [<digit>]+
(Remember, the '*' indicates zero or more occurences of the terms in brackets, and the '+', one or more.)
We have already dealt with similar items in Installment III. Let's begin (as usual) with a bare cradle. Not surprisingly, we are going to need a new recognizer:
-- ------------------------------------------------------------ -- Recognize an alphanumeric character : alnum? ( char -- tf ) DUP alpha? SWAP digit? OR ; -- ------------------------------------------------------------
Using this let's write the following two routines, which are very similar to those we've used before:
-- ------------------------------------------------------------ -- Get an identifier. : char+! ( c addr -- ) DUP >R COUNT + C! 1 R> C+! ; CREATE token 0 C, #256 CHARS ALLOT : getname ( -- c-addr u ) Look alpha? 0= IF S" Name" expected ENDIF token C0! BEGIN Look alnum? WHILE Look >UPC token char+! getchar REPEAT token COUNT ; -- ------------------------------------------------------------ -- Get a number CREATE #value 0 C, #256 CHARS ALLOT : getnum ( -- c-addr u ) Look digit? 0= IF S" Integer" expected ENDIF #value C0! BEGIN Look digit? WHILE Look #value char+! getchar REPEAT #value COUNT ; -- ------------------------------------------------------------
(Notice that this version of GetNum
returns a string, not an
integer as before.)
You can easily verify that these routines work by calling them from the main program, as in
GetName TYPE
This program will print any legal name typed in. It will reject anything else. (chap7a.frt)
Test the other routine similarly.
White?
and SkipWhite
. Make sure that these
routines are in your current version of the cradle, and add the line
SkipWhite
at the end of both GetName
and GetNum
.
Now, let's define the new procedure:
-- ------------------------------------------------------------ -- Lexical scanner : scan ( -- c-addr u ) Look alpha? IF getname EXIT ENDIF Look digit? IF getnum EXIT ENDIF Look token C! token 1 getchar skipwhite ; -- ------------------------------------------------------------
We can call this from the new main program:
-- ------------------------------------------------------------ -- Main Program init BEGIN scan token$ PACK COUNT CR TYPE token$ CHAR+ C@ ^M = UNTIL ; -- ------------------------------------------------------------
(You will have to add the declaration of the string token$
at the
beginning of the program. Make it any convenient length, say 16
characters.)
Now, run the program (chap7b.frt). Note how the input string is, indeed, separated into distinct tokens.
GetName
does indeed
implement a state machine. The state is implicit in the current
position in the code. A very useful trick for visualizing what's
going on is the syntax diagram, or "railroad-track" diagram.
It's a little difficult to draw one in this medium, so I'll use
them very sparingly, but the figure below should give you the
idea:
.-----> Other ----------------------------> Error | Start --+----> Letter -----------+---> Other -----> Finish ^ | | v |<----- Letter <---------| | | `<----- Digit <---------'
As you can see, this diagram shows how the logic flows as characters are read. Things begin, of course, in the start state, and end when a character other than an alphanumeric is found. If the first character is not alpha, an error occurs. Otherwise the machine will continue looping until the terminating delimiter is found.
Note that at any point in the flow, our position is entirely dependent on the past history of the input characters. At that point, the action to be taken depends only on the current state, plus the current input character. That's what make this a state machine.
Because of the difficulty of drawing railroad-track diagrams in
this medium, I'll continue to stick to syntax equations from now
on. But I highly recommend the diagrams to you for anything you
do that involves parsing. After a little practice you can begin
to see how to write a parser directly from the diagrams.
Parallel paths get coded into guarded actions (guarded by IF
's or
CASE
statements), serial paths into sequential calls. It's
almost like working from a schematic.
We didn't even discuss SkipWhite
, which was introduced earlier,
but it also is a simple state machine, as is GetNum
. So is their
parent procedure, Scan
. Little machines make big machines.
The neat thing that I'd like you to note is how painlessly this implicit approach creates these state machines. I personally prefer it a lot over the table-driven approach. It also results is a small, tight, and fast scanner.
C
standard library routine, iswhite
, works. We didn't
actually try this before. I'd like to do it now, so you can get
a feel for the results.
To do this, simply modify the single executable line of White?
to read:
: white? ( char -- tf ) DUP Tab = OVER BL = OR OVER ^M = OR SWAP ^J = OR ;
We need to give the main program a new stop condition, since it will never see a CR. Let's just use:
token$ CHAR+ C@ '.' = UNTIL ;
OK, compile this program (chap7c.frt) and run it. Try a couple of lines, terminated by the period. I used:
now is the time for all good men.
Hey, what happened? When I tried it, I didn't get the last token, the period. The program didn't halt. What's more, when I pressed the 'enter' key a few times, I still didn't get the period.
If you're still stuck in your program, you'll find that typing a period on a new line will terminate it.
What's going on here? The answer is that we're hanging up in
SkipWhite
. A quick look at that routine will show that as long
as we're typing null lines, we're going to just continue to loop.
After SkipWhite
encounters an LF, it tries to execute a GetChar
.
But since the input buffer is now empty, GetChar
's read statement
insists on having another line. The word Scan
gets the
terminating period, all right, but it calls SkipWhite
to clean
up, and SkipWhite
won't return until it gets a non-null line.
This kind of behavior is not quite as bad as it seems. In a real
compiler, we'd be reading from an input file instead of the
console, and as long as we have some procedure for dealing with
end-of-files, everything will come out OK. But for reading data
from the console, the behavior is just too bizarre. The fact of
the matter is that the C
/Unix
convention is just not compatible
with the structure of our parser, which calls for a lookahead
character. The code that the Bell wizards have implemented
doesn't use that convention, which is why they need ungetc
.
OK, let's fix the problem. To do that, we need to go back to the
old definition of White?
(delete the CR and LF characters) and
make use of the word Fin
that I introduced last time. If
it's not in your current version of the cradle, put it there now.
Also, modify the main program to read:
-- ------------------------------------------------------------ -- Main Program : main ( -- ) init BEGIN scan token$ PACK COUNT CR TYPE token$ CHAR+ C@ ^M = IF fin ENDIF token$ CHAR+ C@ '.' = UNTIL ; -- ------------------------------------------------------------
Note the "guard" test preceding the call to Fin
. That's what
makes the whole thing work, and ensures that we don't try to read
a line ahead.
Try the code now. I think you'll like it better.
If you refer to the code we did in the last installment, you'll
find that I quietly sprinkled calls to Fin
throughout the code,
wherever a line break was appropriate. This is one of those
areas that really affects the look & feel that I mentioned. At
this point I would urge you to experiment with different
arrangements and see how you like them. If you want your
language to be truly free-field, then newlines should be
transparent. In this case, the best approach is to put the
following lines at the beginning of Scan
:
BEGIN Look ^M = WHILE fin REPEAT
If, on the other hand, you want a line-oriented language like
Assembler, BASIC
, or FORTRAN
(or even Ada
... note that it has
comments terminated by newlines), then you'll need for Scan
to
return CR's as tokens. It must also eat the trailing LF. The
best way to do that is to use this line, again at the beginning
of Scan
:
Look ^J = IF fin ENDIF
For other conventions, you'll have to use other arrangements. In my example of the last session, I allowed newlines only at specific places, so I was somewhere in the middle ground. In the rest of these sessions, I'll be picking ways to handle newlines that I happen to like, but I want you to know how to choose other ways for yourselves.
KISS
that we've built so far, the
only tokens that have multiple characters are the identifiers and
numbers. All operators were single characters. The only
exception I can think of is the relops <=, >=
, and <>
, but they
could be dealt with as special cases.
Still, other languages have multi-character operators, such as
the :=
of Pascal
or the ++
and >>
of C
. So while we may
not need multi-character operators, it's nice to know how to get
them if necessary.
Needless to say, we can handle operators very much the same way as the other tokens. Let's start with a recognizer:
-- ------------------------------------------------------------ -- Recognize any operator : op? ( char -- tf ) DUP '+' = OVER '-' = OR OVER '*' = OR OVER '/' = OR OVER '<' = OR OVER '>' = OR OVER ':' = OR SWAP '=' = OR ; -- ------------------------------------------------------------
It's important to note that we don't have to include every
possible operator in this list. For example, the paretheses
aren't included, nor is the terminating period. The current
version of Scan
handles single-character operators just fine as
it is. The list above includes only those characters that can
appear in multi-character operators. (For specific languages, of
course, the list can always be edited.)
Now, let's modify Scan
to read:
-- ------------------------------------------------------------ -- Get an operator : getop ( -- ) token$ C0! Look op? 0= IF S" Operator" expected ENDIF BEGIN Look op? WHILE Look token$ char+! getchar REPEAT token$ COUNT skipwhite ; -- ------------------------------------------------------------ -- Lexical scanner : scan ( -- c-addr u ) BEGIN Look ^M = WHILE fin REPEAT Look alpha? IF getname EXIT ENDIF Look digit? IF getnum EXIT ENDIF Look op? IF getop EXIT ENDIF Look token$ char-place token$ COUNT getchar skipwhite ; -- ------------------------------------------------------------
Try the program (chap7e.frt) now. You will find that any code fragments you care to throw at it will be neatly broken up into individual tokens.
How many times have you worked with a program or operating system that had rigid rules about how you must separate items in a list? (Try, the last time you used MSDOS!) Some programs require spaces as delimiters, and some require commas. Worst of all, some require both, in different places. Most are pretty unforgiving about violations of their rules.
I think this is inexcusable. It's too easy to write a parser that will handle both spaces and commas in a flexible way. Consider the following procedure:
-- ------------------------------------------------------------ -- Skip over a comma : skipcomma ( -- ) skipwhite Look ',' = IF getchar skipwhite ENDIF ; -- ------------------------------------------------------------
This three-line procedure will skip over a delimiter consisting of any number (including zero) of spaces, with zero or one comma embedded in the string.
Temporarily, change the call to SkipWhite
in Scan
to a call to
SkipComma
, and try inputting some lists. Works nicely, eh?
Don't you wish more software authors knew about SkipComma
?
For the record, I found that adding the equivalent of SkipComma
to my Z80 assembler-language programs took all of 6 (six) extra
bytes of code. Even in a 64K machine, that's not a very high
price to pay for user-friendliness!
I think you can see where I'm going here. Even if you never write a line of a compiler code in your life, there are places in every program where you can use the concepts of parsing. Any program that processes a command line needs them. In fact, if you think about it for a bit, you'll have to conclude that any time you write a program that processes user inputs, you're defining a language. People communicate with languages, and the syntax implicit in your program defines that language. The real question is: are you going to define it deliberately and explicitly, or just let it turn out to be whatever the program ends up parsing?
I claim that you'll have a better, more user-friendly program if you'll take the time to define the syntax explicitly. Write down the syntax equations or draw the railroad-track diagrams, and code the parser using the techniques I've shown you here. You'll end up with a better program, and it will be easier to write, to boot.
The main consideration is <shudder> efficiency. Remember when we
were dealing with single-character tokens, every test was a
comparison of a single character, Look
, with a byte constant. We
also used the Case
statement heavily.
With the multi-character tokens being returned by Scan
, all those
tests now become string comparisons. Much slower. And not only
slower, but more awkward, since there is no string equivalent of
the Case
statement in Forth
. It seems especially wasteful to
test for what used to be single characters ... the =
, +
, and
other operators ... using string comparisons.
Using string comparison is not impossible ... Ron Cain used just
that approach in writing Small C
. Since we're sticking to the
KISS
principle here, we would be truly justified in settling for
this approach. But then I would have failed to tell you about
one of the key approaches used in "real" compilers.
You have to remember: the lexical scanner is going to be called a lot! Once for every token in the whole source program, in fact. Experiments have indicated that the average compiler spends anywhere from 20% to 40% of its time in the scanner routines. If there were ever a place where efficiency deserves real consideration, this is it.
For this reason, most compiler writers ask the lexical scanner to do a little more work, by "tokenizing" the input stream. The idea is to match every token against a list of acceptable keywords and operators, and return unique codes for each one recognized. In the case of ordinary variable names or numbers, we just return a code that says what kind of token they are, and save the actual string somewhere else.
One of the first things we're going to need is a way to identify
keywords. We can always do it with successive IF
tests, but it
surely would be nice if we had a general-purpose routine that
could compare a given string with a table of keywords. (By the
way, we're also going to need such a routine later, for dealing
with symbol tables.) Standard Forth
also doesn't allow for
initializing arrays, so you tend to see code like
S" IF" 1 table PACK DROP S" ELSE" 2 table PACK DROP . . S" END" n table PACK DROP
which can get pretty old if there are many keywords.
Fortunately, Forth
has facilities that eliminate this problem.
First, modify your declarations like this:
-- ------------------------------------------------------------ -- Type declarations : $, ( c-addr u size -- ) >S S MIN >R R@ C, R@ 0 ?DO C@+ C, LOOP DROP S> R> - CHARS ALLOT ; : SYMTAB ( size -- ) CREATE HERE >R 0 , ( #items ) DUP >S , ( itemsize ) BEGIN BL <WORD> DUP WHILE S $, 1 R@ +! REPEAT 2DROP -R -S DOES> CELL+ @+ 1+ ROT * + ( ix -- addr ) ; -- ------------------------------------------------------------
Now, just beneath those declarations, add the following:
-- ------------------------------------------------------------ -- Definition of keywords and token types 8 CHARS =: /symbol /symbol SYMTAB KWlist IF ELSE ENDIF END -- ------------------------------------------------------------
Next, insert the following new function:
-- ------------------------------------------------------------ -- Table lookup -- If the input string matches a table entry, return the entry -- index. If not, return a -1. : lookup ( c-addr u 'table -- n2 ) 0 0 LOCALS| /symbol n table sz addr | table 2 CELLS - @+ TO n @ TO /symbol 0 n 1- DO /symbol 1+ I * table + COUNT addr sz COMPARE 0= IF I UNLOOP EXIT ENDIF -1 +LOOP -1 ; -- ------------------------------------------------------------
To test it, load chap7f.frt and
enter: ( c-addr u -- ) 0 KWlist lookup
.
Example:
-- ------------------------------------------------------------ S" ENDIF" 0 KWlist lookup . ( 2 ) -- ------------------------------------------------------------
Now that we can recognize keywords, the next thing is to arrange to return codes for them.
So what kind of code should we return? A reasonable choice would be the address of the keyword string
: SymType ( u -- ) KWlist CONSTANT ; 0 SymType IfSym 1 SymType ElseSym 2 SymType EndifSym 3 SymType EndSym -1 =: unknown -2 =: Ident -3 =: Number -4 =: Operator
and arrange to return a variable of this type. Let's give it a try. Insert the lines above into your type definitions.
Now, add the two variable declarations to the cradle:
0 VALUE token -- current token CREATE $token 16 CHARS ALLOT -- string token of look
Modify the scanner to read:
-- ------------------------------------------------------------ -- Lexical scanner : scan ( -- c-addr u ) BEGIN Look ^M = WHILE fin REPEAT Look alpha? IF getname 2DUP 0 KWlist lookup DUP 0>= IF KWlist ELSE DROP Ident ENDIF TO token EXIT ENDIF Look digit? IF getnum Number TO token EXIT ENDIF Look op? IF getop Operator TO token EXIT ENDIF Look token$ char-place Operator TO token token$ COUNT getchar skipwhite ; -- ------------------------------------------------------------
Finally, modify the main program to read:
-- ------------------------------------------------------------ -- Main Program : main ( -- ) init BEGIN scan CASE token Ident OF ." Ident " ENDOF Number OF ." Number " ENDOF Operator OF ." Operator " ENDOF IfSym OF ." Keyword " ENDOF ElseSym OF ." Keyword " ENDOF EndifSym OF ." Keyword " ENDOF EndSym OF ." Keyword " ENDOF unknown OF ." unknown " ENDOF ENDCASE TYPE CR token EndSym = UNTIL ; -- ------------------------------------------------------------
What we've done here is to replace the string Token
used earlier
with an computed constant. Scan returns the constant in value Token
,
and returns the string itself in the variable token$
.
OK, compile this and give it a whirl (chap7g.frt). If everything goes right, you should see that we are now recognizing keywords.
What we have now is working right, and it was easy to generate
from what we had earlier. However, it still seems a little
"busy" to me. It also seems a little cleaner to move the table
lookup into GetName
. The new form for the four words is, then:
-- ------------------------------------------------------------ -- Get an Identifier : getname ( -- c-addr u ) Look alpha? 0= IF S" Name" expected ENDIF token$ C0! BEGIN Look alnum? WHILE Look >UPC token$ char+! getchar REPEAT token$ COUNT skipwhite 0 KWlist lookup DUP 0>= IF KWlist ELSE DROP Ident ENDIF TO token token$ COUNT ; -- ------------------------------------------------------------ -- Get a number : getnum ( -- c-addr u ) Look digit? 0= IF S" Integer" expected ENDIF token$ C0! BEGIN Look digit? WHILE Look token$ char+! getchar REPEAT Number TO token token$ COUNT skipwhite ; -- ------------------------------------------------------------ -- Get an operator : op? ( char -- tf ) DUP '+' = OVER '-' = OR OVER '*' = OR OVER '/' = OR OVER '<' = OR OVER '>' = OR OVER ':' = OR SWAP '=' = OR ; : getop ( -- c-addr u ) token$ C0! Look op? 0= IF S" Operator" expected ENDIF BEGIN Look op? WHILE Look token$ char+! getchar REPEAT Operator TO token token$ COUNT skipwhite ; -- ------------------------------------------------------------ -- Lexical scanner : scan ( -- c-addr u ) BEGIN Look ^M = WHILE fin REPEAT Look alpha? IF getname EXIT ENDIF Look digit? IF getnum EXIT ENDIF Look op? IF getop EXIT ENDIF Look token$ char-place Operator TO token token$ COUNT getchar skipwhite ; -- ------------------------------------------------------------
Pascal
used the mechanism of an enumerated type that I've just
described. It is certainly a workable mechanism, but it doesn't
seem the simplest approach to me.
For one thing, the list of possible symbol types can get pretty
long. Here, I've used just one symbol, "Operator
," to stand for
all of the operators, but I've seen other designs that actually
return different codes for each one.
There is, of course, another simple type that can be returned as
a code: the character. Instead of returning the enumeration
value Operator
for a +
sign, what's wrong with just returning
the character itself? A character is just as good a variable for
encoding the different token types, it can be used in case
statements easily, and it's sure a lot easier to type. What
could be simpler?
Besides, we've already had experience with the idea of encoding keywords as single characters. Our previous programs are already written that way, so using this approach will minimize the changes to what we've already done.
Some of you may feel that this idea of returning character codes
is too mickey-mouse. I must admit it gets a little awkward for
multi-character operators like <=
. If you choose to stay with
the enumerated type, fine. For the rest, I'd like to show you
how to change what we've done above to support that approach.
First, you can delete the SymType
declaration now ... we won't be
needing that.
Next, to replace SymType
, add the following constant string:
CREATE KWcode ," xilee"
(I'll be encoding all idents with the single character 'x'.)
Lastly, modify Scan
and its relatives as follows:
-- ------------------------------------------------------------ -- Get an identifier : getname ( -- c-addr u ) Look alpha? 0= IF S" Name" expected ENDIF token$ C0! BEGIN Look alnum? WHILE Look >UPC token$ char+! getchar REPEAT token$ COUNT skipwhite 0 KWlist lookup KWcode + C@ TO token token$ COUNT ; -- ------------------------------------------------------------ -- Get a number : getnum ( -- c-addr u ) Look digit? 0= IF S" Integer" expected ENDIF token$ C0! BEGIN Look digit? WHILE Look token$ char+! getchar REPEAT '#' TO token token$ COUNT skipwhite ; -- ------------------------------------------------------------ -- Get an operator : op? ( char -- tf ) DUP '+' = OVER '-' = OR OVER '*' = OR OVER '/' = OR OVER '<' = OR OVER '>' = OR OVER ':' = OR SWAP '=' = OR ; : getop ( -- c-addr u ) token$ C0! Look op? 0= IF S" Operator" expected ENDIF BEGIN Look op? WHILE Look token$ char+! getchar REPEAT '?' TO token token$ C@ 1 = IF token$ CHAR+ C@ ELSE '?' ENDIF TO token token$ COUNT skipwhite ; -- ------------------------------------------------------------ -- Lexical Scanner : scan ( -- c-addr u ) BEGIN Look ^M = WHILE fin REPEAT Look alpha? IF getname EXIT ENDIF Look digit? IF getnum EXIT ENDIF Look op? IF getop EXIT ENDIF Look char-place '?' TO token token$ COUNT getchar skipwhite ; -- ------------------------------------------------------------ -- Main Program : main ( -- ) init BEGIN scan CASE token 'x' OF ." Ident " ENDOF '#' OF ." Number " ENDOF 'i' OF ." Keyword " ENDOF 'l' OF ." Keyword " ENDOF 'e' OF ." Keyword " ENDOF ." Operator " ENDCASE token$ .$ token$ COUNT S" END" COMPARE 0= UNTIL ; -- ------------------------------------------------------------
This program should work the same as the previous version. A minor difference in structure, maybe, but it seems more straightforward to me.
The problem with the conventional approach is that the scanner
has no knowledge of context. For example, it can't distinguish
between the assignment operator =
and the relational operator
=
(perhaps that's why both C
and Pascal
use different strings
for the two). All the scanner can do is to pass the operator
along to the parser, which can hopefully tell from the context
which operator is meant. Similarly, a keyword like IF
has no
place in the middle of a math expression, but if one happens to
appear there, the scanner will see no problem with it, and will
return it to the parser, properly encoded as an IF
.
With this kind of approach, we are not really using all the information at our disposal. In the middle of an expression, for example, the parser "knows" that there is no need to look for keywords, but it has no way of telling the scanner that. So the scanner continues to do so. This, of course, slows down the compilation.
In real-world compilers, the designers often arrange for more information to be passed between parser and scanner, just to avoid this kind of problem. But that can get awkward, and certainly destroys a lot of the modularity of the structure.
The alternative is to seek some way to use the contextual information that comes from knowing where we are in the parser. This leads us back to the notion of a distributed scanner, in which various portions of the scanner are called depending upon the context.
In KISS
, as in most languages, keywords only appear at the
beginning of a statement. In places like expressions, they are
not allowed. Also, with one minor exception (the multi-character
relops) that is easily handled, all operators are single
characters, which means that we don't need GetOp
at all.
So it turns out that even with multi-character tokens, we can still always tell from the current lookahead character exactly what kind of token is coming, except at the very beginning of a statement.
Even at that point, the only kind of token we can accept is an identifier. We need only to determine if that identifier is a keyword or the target of an assignment statement.
We end up, then, still needing only GetName
and GetNum
, which are
used very much as we've used them in earlier installments.
It may seem at first to you that this is a step backwards, and a rather primitive approach. In fact, it is an improvement over the classical scanner, since we're using the scanning routines only where they're really needed. In places where keywords are not allowed, we don't slow things down by looking for them.
IF
) and no
Boolean expressions. That's enough to demonstrate the parsing of
both keywords and expressions. The extension to the full set of
constructs should be pretty apparent from what we've already
done.
All the elements of the program to parse this subset, using single-character tokens, exist already in our previous programs. I built it by judicious copying of these files, but I wouldn't dare try to lead you through that process. Instead, to avoid any confusion, the whole program is shown below (kiss00.frt):
-- ---------------------------------------------------------------------------- -- Variable declarations ------------------------------------------------------ 0 VALUE Look -- lookahead character 0 VALUE Lcount -- label counter 0 VALUE token -- encoded value of scanned token CREATE token$ 0 C, #256 CHARS ALLOT -- token in unencoded string form -- Tools ---------------------------------------------------------------------- : getchar ( -- ) EKEY TO Look ; -- read new character from input stream : error ( c-addr u -- ) CR ^G EMIT ." Error: " TYPE ." ." ; -- report an error : aborts ( c-addr u -- ) error ABORT ; -- report error and halt : expected ( c-addr u -- ) S" Expected" $+ aborts ; -- report what was expected : alpha? ( char -- tf ) >UPC 'A' 'Z' 1+ WITHIN ; -- recognize an alpha character : digit? ( char -- tf ) '0' '9' 1+ WITHIN ; -- recognize a decimal digit : alnum? ( char -- tf ) DUP alpha? SWAP digit? OR ; -- recognize alphanumeric : white? ( char -- tf ) DUP Tab = SWAP BL = OR ; -- recognize white space : addop? ( char -- tf ) DUP '+' = SWAP '-' = OR ; -- test for AddOp : mulop? ( char -- tf ) DUP '*' = SWAP '/' = OR ; -- test for MulOp : emits ( c-addr u -- ) Tab EMIT TYPE ; -- output a string with tab : emitln ( c-addr u -- ) CR emits ; -- output a string with tab and crlf -- skip cr+lf : fin ( -- ) Look ^M = IF getchar ENDIF Look ^J = IF getchar ENDIF ; -- skip white space : skipwhite ( -- ) BEGIN Look white? WHILE getchar REPEAT ; -- match a specific input character : match ( char -- ) DUP Look = IF DROP getchar ELSE S" `" ROT CHAR-APPEND &' CHAR-APPEND expected ENDIF ; -- get an identifier. : getname ( -- char ) BEGIN Look ^M = WHILE fin REPEAT Look alpha? 0= IF S" Name" expected ENDIF Look >UPC getchar skipwhite ; ( Returns a single character for now ) -- get a number : getnum ( -- char ) Look digit? 0= IF S" Integer" expected ENDIF Look getchar skipwhite ; ( Returns a single character for now ) -- generate a unique label : newlabel ( -- c-addr u ) S" @" Lcount U>D <# #S #> $+ 1 +TO Lcount ; -- post a label to output : postlabel ( c-addr u -- ) CR TYPE ':' EMIT ; -- parse and translate an identifier. -- Note that we need to store getname locally because parsing a call's -- parameterlist may clobber it. : ident ( -- ) getname LOCAL name Look '(' = IF '(' match ')' match S" offset NEAR call," ELSE S" dword-ptr -> eax mov," ENDIF name CHAR-PREPEND emitln ; DEFER expression -- parse and translate a math factor : factor ( -- ) Look '(' = IF '(' match expression ')' match EXIT ENDIF Look alpha? IF ident EXIT ENDIF S" d# -> eax mov," getnum CHAR-PREPEND emitln ; -- parse and translate the first math factor : signedfactor ( -- ) Look '+' = IF getchar ENDIF Look '-' <> IF factor EXIT ENDIF getchar Look digit? IF S" #-" getnum CHAR-APPEND S" d# -> eax mov," $+ ELSE factor S" eax neg," ENDIF emitln ; -- recognize and translate multiply : multiply ( -- ) '*' match factor S" [esp] dword mul, [esp 4 +] -> esp lea," emitln ; -- recognize and translate divide : divide ( -- ) '/' match factor S" ecx pop, ecx -> eax xchg, eax -> edx mov," emitln S" #31 b# -> edx sar, ebx idiv," emitln ; -- parse and translate a math term : term ( -- ) signedfactor BEGIN Look mulop? WHILE S" eax push," emitln CASE Look '*' OF multiply ENDOF '/' OF divide ENDOF S" Mulop" expected ENDCASE REPEAT ; -- recognize and translate add : add ( -- ) '+' match term S" [esp] -> eax add, [esp 4 +] -> esp lea," emitln ; -- recognize and translate subtract : subtract ( -- ) '-' match term S" [esp] -> eax sub, [esp 4 +] -> esp lea, eax neg," emitln ; -- parse and translate an expression :NONAME ( -- ) term BEGIN Look addop? WHILE S" eax push," emitln CASE Look '+' OF add ENDOF '-' OF subtract ENDOF S" Addop" expected ENDCASE REPEAT ; IS expression -- Parse and translate a boolean condition -- This is a dummy. : condition ( -- ) S" condition" emitln ; DEFER block -- Recognize and translate an IF construct : doIF ( -- ) 0 0 LOCALS| L2 L1 | 'i' match condition newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L1 PACK DROP L1 COUNT DUP 1+ ALLOCATE ?ALLOCATE DUP TO L2 PACK DROP L1 COUNT S" offset NEAR je," $+ emitln block Look 'l' = IF 'l' match L2 FREE ?ALLOCATE newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L2 PACK DROP L2 COUNT S" offset NEAR jmp," $+ emitln L1 COUNT postlabel block ENDIF 'e' match L2 COUNT postlabel L1 FREE ?ALLOCATE L2 FREE ?ALLOCATE ; -- Parse and translate an assignment statement : assignment ( -- ) getname LOCAL name '=' match expression S" eax -> " 'OF name 1 $+ S" dword-ptr mov," $+ emitln ; -- Recognize and translate a statement block :NONAME ( -- ) BEGIN Look 'e' <> Look 'l' <> AND WHILE CASE Look 'i' OF doIF ENDOF ^M OF BEGIN Look ^M = WHILE fin REPEAT ENDOF assignment ENDCASE REPEAT ; IS block -- Parse and translate a program : doprogram ( -- ) block Look 'e' <> IF S" End" expected ENDIF S" END" emitln ; -- initialize : init ( -- ) CLEAR Lcount CR getchar skipwhite ; : KISS00 ( -- ) init doprogram ; -- ----------------------------------------------------------------------------A couple of comments:
SignedFactor
, etc.,
is a little different from what you've seen before. It's
yet another variation on the same theme. Don't let it throw
you ... the change is not required for what follows.
Fin
at strategic
spots to allow for multiple lines.
Before we proceed to adding the scanner, first copy this file and
verify that it does indeed parse things correctly. Don't forget
the "codes": 'i' for IF
, 'l' for ELSE
, and 'e' for END
or ENDIF
.
If the program works, then let's press on. In adding the scanner modules to the program, it helps to have a systematic plan. In all the parsers we've written to date, we've stuck to a convention that the current lookahead character should always be a non-blank character. We preload the lookahead character in Init, and keep the "pump primed" after that. To keep the thing working right at newlines, we had to modify this a bit and treat the newline as a legal token.
In the multi-character version, the rule is similar: The current lookahead character should always be left at the beginning of the next token, or at a newline.
The multi-character version is shown next. To get it, I've made the following changes:
Token
and Token$
, and the type definitions
needed by Lookup
.
KWList
and KWcode
.
Lookup
.
GetName
and GetNum
by their multi-character versions.
(Note that the call to Lookup
has been moved out of GetName
,
so that it will not be executed for calls within an
expression.)
Scan
that calls GetName
, then scans
for keywords.
MatchString
, that looks for a
specific keyword. Note that, unlike Match
, MatchString
does
not read the next keyword.
Block
to call Scan
.
Fin
a bit. Fin
is now called within GetName
.
Here is the program in its entirety (kiss01.frt):
-- ----------------------------------------------------------------------------- -- Variable declarations ------------------------------------------------------- -1 =: unknown 0 VALUE Lcount -- label counter 0 VALUE Look -- lookahead character 0 VALUE token -- encoded token CREATE token$ 0 C, #16 CHARS ALLOT -- unencoded token -- helper to build symbol tables : $, ( c-addr u size -- ) >S S MIN >R R@ C, R@ 0 ?DO C@+ C, LOOP DROP S> R> - CHARS ALLOT ; -- Type declarations : SYMTAB ( size -- ) CREATE HERE >R 0 , ( #items ) DUP >S , ( itemsize ) BEGIN BL <WORD> DUP WHILE S $, 1 R@ +! REPEAT 2DROP -R -S DOES> CELL+ @+ 1+ ROT * + ( ix -- addr ) ; -- Definition of keywords and token types 8 CHARS =: /symbol /symbol SYMTAB KWlist IF ELSE ENDIF WHILE ENDWHILE READ WRITE VAR END : KW->token ( kw -- ) 2+ C" xilewhRWve" + C@ TO token ; : lookup ( c-addr u 'table -- n2|unknown ) 0 0 LOCALS| /symbol n table sz addr | table 2 CELLS - @+ TO n @ TO /symbol 0 n 1- DO /symbol 1+ I * table + COUNT addr sz COMPARE 0= IF I UNLOOP EXIT ENDIF -1 +LOOP unknown ; -- Tools ----------------------------------------------------------------------- : getchar ( -- ) EKEY TO Look ; -- read new char : error ( c-addr u -- ) CR ^G EMIT ." Error: " TYPE ." ." ; -- report an error : aborts ( c-addr u -- ) error ABORT ; -- report and halt : expected ( c-addr u -- ) S" Expected" $+ aborts ; -- report expected : alpha? ( char -- tf ) >UPC 'A' 'Z' 1+ WITHIN ; -- is alpha char? : digit? ( char -- tf ) '0' '9' 1+ WITHIN ; -- is digit? : alnum? ( char -- tf ) DUP alpha? SWAP digit? OR ; -- is alphanum? : white? ( char -- tf ) DUP Tab = SWAP BL = OR ; -- is white space? : orop? ( char -- tf ) DUP '|' = SWAP '~' = OR ; -- recognize ORs : addop? ( char -- tf ) DUP '+' = SWAP '-' = OR ; -- test for AddOps : mulop? ( char -- tf ) DUP '*' = SWAP '/' = OR ; -- test for MulOps : emits ( c-addr u -- ) Tab EMIT TYPE ; -- output with tab : emitln ( c-addr u -- ) CR emits ; -- output with crlf : char+! ( c addr -- ) DUP >S COUNT + C! 1 S> C+! ; -- skip white space : skipwhite ( -- ) BEGIN Look white? WHILE getchar REPEAT ; -- skip cr+lf : fin ( -- ) Look ^M = IF getchar ENDIF Look ^J = IF getchar ENDIF skipwhite ; -- match a specific input character : match ( char -- ) DUP Look = IF DROP getchar skipwhite ELSE S" `" ROT CHAR-APPEND &' CHAR-APPEND expected ENDIF ; -- get an identifier. : getname ( -- c-addr u ) BEGIN Look ^M = WHILE fin REPEAT Look alpha? 0= IF S" Name" expected ENDIF token$ C0! BEGIN Look alnum? WHILE Look >UPC token$ char+! getchar REPEAT token$ COUNT skipwhite ; -- get a number : getnum ( -- c-addr u ) Look digit? 0= IF S" Integer" expected ENDIF token$ C0! BEGIN Look digit? WHILE Look token$ char+! getchar REPEAT '#' TO token token$ COUNT skipwhite ; -- Get an identifier and scan it for keywords : scan ( -- ) getname 0 KWlist lookup KW->token ; -- Match a specific input string : matchstring ( c-addr u -- ) token$ COUNT 2OVER COMPARE IF &` CHAR-PREPEND &' CHAR-APPEND expected ELSE 2DROP ENDIF ; -- generate a unique label : newlabel ( -- c-addr u ) S" @" Lcount U>D <# #S #> $+ 1 +TO Lcount ; -- post a label to output : postlabel ( c-addr u -- ) CR TYPE ':' EMIT ; -- parse and translate an identifier. -- Note that we need to store getname locally because parsing a call's -- parameterlist may clobber it. : ident ( -- ) getname DUP 1+ ALLOCATE ?ALLOCATE DUP LOCAL name PACK DROP Look '(' = IF '(' match ')' match name COUNT S" offset NEAR call," ELSE name COUNT S" dword-ptr -> eax mov," ENDIF $+ emitln name FREE ?ALLOCATE ; DEFER expression -- parse and translate a math factor : factor ( -- ) Look '(' = IF '(' match expression ')' match EXIT ENDIF Look alpha? IF ident EXIT ENDIF getnum S" d# -> eax mov," $+ emitln ; -- parse and translate the first math factor : signedfactor ( -- ) Look '+' = IF getchar ENDIF Look '-' <> IF factor EXIT ENDIF getchar skipwhite Look digit? IF S" #-" getnum $+ S" d# -> eax mov," $+ ELSE factor S" eax neg," ENDIF emitln ; -- recognize and translate multiply : multiply ( -- ) '*' match factor S" [esp] dword mul, [esp 4 +] -> esp lea," emitln ; -- recognize and translate divide : divide ( -- ) '/' match factor S" ecx pop, ecx -> eax xchg, eax -> edx mov," emitln S" #31 b# -> edx sar, ebx idiv," emitln ; -- parse and translate a math term : term ( -- ) signedfactor BEGIN Look mulop? WHILE S" eax push," emitln CASE Look '*' OF multiply ENDOF '/' OF divide ENDOF S" Mulop" expected ENDCASE REPEAT ; -- recognize and translate add : add ( -- ) '+' match term S" [esp] -> eax add, [esp 4 +] -> esp lea," emitln ; -- recognize and translate subtract : subtract ( -- ) '-' match term S" [esp] -> eax sub, [esp 4 +] -> esp lea, eax neg," emitln ; -- parse and translate an expression :NONAME ( -- ) term BEGIN Look addop? WHILE S" eax push," emitln CASE Look '+' OF add ENDOF '-' OF subtract ENDOF S" Addop" expected ENDCASE REPEAT ; IS expression -- Parse and translate a boolean condition -- This is a dummy. : condition ( -- ) S" condition" emitln ; DEFER block -- Recognize and translate an IF construct : doIF ( -- ) 0 0 LOCALS| L2 L1 | condition newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L1 PACK DROP L1 COUNT DUP 1+ ALLOCATE ?ALLOCATE DUP TO L2 PACK DROP L1 COUNT S" offset NEAR je," $+ emitln block token 'l' = IF L2 FREE ?ALLOCATE newlabel DUP 1+ ALLOCATE ?ALLOCATE DUP TO L2 PACK DROP L2 COUNT S" offset NEAR jmp," $+ emitln L1 COUNT postlabel block ENDIF L2 COUNT postlabel S" ENDIF" matchstring L1 FREE ?ALLOCATE L2 FREE ?ALLOCATE ; -- Parse and translate an assignment statement : assignment ( -- ) token$ COUNT DUP 1+ ALLOCATE ?ALLOCATE DUP LOCAL name PACK DROP '=' match expression S" eax -> " name COUNT $+ S" dword-ptr mov," $+ emitln name FREE ?ALLOCATE ; -- Recognize and translate a statement block :NONAME ( -- ) scan BEGIN token 'e' <> token 'l' <> AND WHILE CASE token 'i' OF doIF ENDOF assignment ENDCASE scan REPEAT ; IS block -- Parse and translate a program : doprogram ( -- ) block token 'e' <> IF S" End" expected ENDIF S" END" emitln ; -- initialize : init ( -- ) CLEAR Lcount CR getchar skipwhite ; : KISS01 ( -- ) init doprogram ; -- -----------------------------------------------------------------------------
Compare this program with its single-character counterpart. I think you will agree that the differences are minor.
We are very close to having all the elements that we need to build a real, functional compiler. There are still a few things missing, notably procedure calls and type definitions. We will deal with those in the next few sessions. Before doing so, however, I thought it would be fun to turn the translator above into a true compiler. That's what we'll be doing in the next installment.
Up till now, we've taken a rather bottom-up approach to parsing, beginning with low-level constructs and working our way up. In the next installment, I'll also be taking a look from the top down, and we'll discuss how the structure of the translator is altered by changes in the language definition.
See you then.
***************************************************************** * * * COPYRIGHT NOTICE * * * * Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * * * *****************************************************************