This series of articles is a tutorial on the theory and practice of developing language parsers and compilers. Before we are finished, we will have covered every aspect of compiler construction, designed a new programming language, and built a working compiler.
Though I am not a computer scientist by education (my Ph.D. is in a different field, Physics), I have been interested in compilers for many years. I have bought and tried to digest the contents of virtually every book on the subject ever written. I don't mind telling you that it was slow going. Compiler texts are written for Computer Science majors, and are tough sledding for the rest of us. But over the years a bit of it began to seep in. What really caused it to jell was when I began to branch off on my own and begin to try things on my own computer. Now I plan to share with you what I have learned. At the end of this series you will by no means be a computer scientist, nor will you know all the esoterics of compiler theory. I intend to completely ignore the more theoretical aspects of the subject. What you will know is all the practical aspects that one needs to know to build a working system.
This is a "learn-by-doing" series. In the course of the series I
will be performing experiments on a computer. You will be
expected to follow along, repeating the experiments that I do,
and performing some on your own. I will be using iForth 2.0
on a PC clone.
I will periodically insert examples written in Forth
.
These will be executable code, which you will be expected
to copy into your own computer and run. If you don't have a copy
of iForth
, you will be severely limited in how well you will be
able to follow what's going on. If you don't have a copy, I urge
you to get one. After all, it's an excellent product, good for
many other uses!
Some articles on compilers show you examples, or show you (as in
the case of Small-C
) a finished product, which you can then copy
and use without a whole lot of understanding of how it works. I
hope to do much more than that. I hope to teach you how the
things get done, so that you can go off on your own and not only
reproduce what I have done, but improve on it.
This is admittedly an ambitious undertaking, and it won't be done in one page. I expect to do it in the course of a number of articles. Each article will cover a single aspect of compiler theory, and will pretty much stand alone. If all you're interested in at a given time is one aspect, then you need to look only at that one article. Each article will be uploaded as it is complete, so you will have to wait for the last one before you can consider yourself finished. Please be patient.
The average text on compiler theory covers a lot of ground that we won't be covering here. The typical sequence is:
I'll be taking a much different approach in this series. To begin with, I won't dwell long on options. I'll be giving you a way that works. If you want to explore options, well and good ... I encourage you to do so ... but I'll be sticking to what I know. I also will skip over most of the theory that puts people to sleep. Don't get me wrong: I don't belittle the theory, and it's vitally important when it comes to dealing with the more tricky parts of a given language. But I believe in putting first things first. Here we'll be dealing with the 95% of compiler techniques that don't need a lot of theory to handle.
I also will discuss only one approach to parsing: top-down,
recursive descent parsing, which is the only technique that's
at all amenable to hand-crafting a compiler. The other
approaches are only useful if you have a tool like YACC
, and also
don't care how much memory space the final product uses.
I also take a page from the work of Ron Cain, the author of the
original Small C
. Whereas almost all other compiler authors have
historically used an intermediate language like P-code and
divided the compiler into two parts (a front end that produces
P-code, and a back end that processes P-code to produce
executable object code), Ron showed us that it is a
straightforward matter to make a compiler directly produce
executable object code, in the form of assembly language
statements. The code will not be the world's tightest code ...
producing optimized code is a much more difficult job. But it
will work, and work reasonably well. Just so that I don't leave
you with the impression that our end product will be worthless, I
do intend to show you how to "soup up" the compiler with some
optimization.
Finally, I'll be using some tricks that I've found to be most
helpful in letting me understand what's going on without wading
through a lot of boiler plate. Chief among these is the use of
single-character tokens, with no embedded spaces, for the early
design work. I figure that if I can get a parser to recognize
and deal with I-T-L, I can get it to do the same with IF-THEN-ELSE
.
And I can. In the second "lesson," I'll show you just
how easy it is to extend a simple parser to handle tokens of
arbitrary length. As another trick, I completely ignore file
I/O, figuring that if I can read source from the keyboard and
output object to the screen, I can also do it from/to disk files.
Experience has proven that once a translator is working
correctly, it's a straightforward matter to redirect the I/O to
files. The last trick is that I make no attempt to do error
correction/recovery. The programs we'll be building will
recognize errors, and will not crash, but they will simply stop
on the first error ... just like good ol' iForth
does. There will
be other tricks that you'll see as you go. Most of them can't be
found in any compiler textbook, but they work.
A word about style and efficiency. As you will see, I tend to
write programs in very small, easily understood pieces. None
of the procedures we'll be working with will be more than about
15-20 lines long. I'm a fervent devotee of the KISS
(Keep It
Simple, Sidney) school of software development. I try to never
do something tricky or complex, when something simple will do.
Inefficient? Perhaps, but you'll like the results. As Brian
Kernighan has said, first make it run, then make it run fast.
If, later on, you want to go back and tighten up the code in one
of our products, you'll be able to do so, since the code will be
quite understandable. If you do so, however, I urge you to wait
until the program is doing everything you want it to.
I also have a tendency to delay building a module until I discover that I need it. Trying to anticipate every possible future contingency can drive you crazy, and you'll generally guess wrong anyway. In this modern day of screen editors and fast compilers, I don't hesitate to change a module when I feel I need a more powerful one. Until then, I'll write only what I need.
One final caveat: One of the principles we'll be sticking to here is that we don't fool around with P-code or imaginary CPUs, but that we will start out on day one producing working, executable object code, at least in the form of assembly language source. However, you may not like my choice of assembler language ... it's 80x86 code, which is what works on my system (under Windows XP). I think you'll find, though, that the translation to any other CPU such as the 68000 will be quite obvious, though, so I don't see a problem here. In fact, I hope someone out there who knows the 68K language better than I do will offer us the equivalent object code fragments as we need them.
There are many different ways to organize the scanning activities
of a parser. In Unix systems, authors tend to use getc
and
ungetc
. I've had very good luck with the approach shown here,
which is to use a single, global, lookahead character. Part of
the initialization procedure (the only part, so far!) serves to
"prime the pump" by reading the first character from the input
stream. No other special techniques are required with iForth 2.0
... each successive call to GetChar
will read the next character
in the stream.
-- Variable declarations ----------------------------------------- 0 VALUE Look -- lookahead character -- Tools --------------------------------------------------------- -- read new character from input stream : getchar ( -- ) EKEY TO Look ; -- report an error : error ( c-addr u -- ) CR ^G EMIT ." Error: " TYPE ." ." ; -- report an error and stop : aborts ( c-addr u -- ) error ABORT ; -- report what was expected : expected ( c-addr u -- ) S" Expected" $+ aborts ; -- recognize an alpha character : alpha? ( char -- tf ) >UPC 'A' 'Z' 1+ WITHIN ; -- recognize a decimal digit : digit? ( char -- tf ) '0' '9' 1+ WITHIN ; -- output a string with tab : emits ( c-addr u -- ) Tab EMIT TYPE ; -- output a string with tab and crlf : emitln ( c-addr u -- ) CR emits ; -- initialize : init ( -- ) getchar ; -- Specifics ----------------------------------------------------- -- match a specific input character ------------------------------ : match ( char -- ) DUP Look = IF DROP getchar ELSE S" `" ROT CHAR-APPEND &' CHAR-APPEND expected ENDIF ; -- get an identifier --------------------------------------------- : getname ( -- char ) Look alpha? 0= IF S" Name" expected ENDIF Look >UPC getchar ; -- get a number -------------------------------------------------- : getnum ( -- char ) Look digit? 0= IF S" Integer" expected ENDIF Look '0' - getchar ; -- Main Program -------------------------------------------------- : cradle ( -- ) init ;
That's it for this introduction. Copy the code above into iForth
and
compile it. Make sure that it compiles and runs correctly (type CRADLE
<cr>,
the computer waits for a single character, then exits). Then
proceed to the first lesson, which is on expression parsing.
***************************************************************** * * * COPYRIGHT NOTICE * * * * Copyright (C) 1988 Jack W. Crenshaw. All rights reserved. * * * * * *****************************************************************