Last update: Sat Aug 10, 1996; converted to HTML.
author: Marcel Hendrix, email: mhx@IAEhv.nl

Abstract

This paper discusses a parallel Forth implementation for the INMOS Transputer; tForth version 1.0 by the Dutch Forth Workshop.

Besides being a parallel Forth, another important feature of tForth is that it tracks the documents prepared by X3J14, an American National Standards Technical Committee (ANS TC) with the mission of standardizing the Forth language. ANS Forth materialized in 1994. Several important consequences for tForth's design are discussed.

The (occam-like) parallel constructs of tForth, its optimizing compiler and server/client concept are described in detail.

The tForth kernel, sans the parallel extensions, proves easy to port to other 32-bit CPU's. This process uses the same base set of source files plus a small hardware dependent kernel file. As a first experiment, a 32-bit protected mode Forth for the Intel 386+387 and 486 chips is produced: iForth. Helped by the ANS Forth word ENVIRONMENT?, both these Forths run our standard set of examples, benchmarks and utilities -- currently a collection of over two megabytes of text files. Some details of iForth's implementation are disclosed.

Introduction

tForth is developed by five people over a period of two years; originally we called ourselves the Dutch Transputer Forth Workshop. The project was started and partly sponsored by the Dutch FIG branch, at that time headed by Hans Nieuwenhuijsen. After two years of implementing, testing and writing documentation, the project is now finally nearing completion. The original DTFW crew will stay together to support the transputer product, but a subgroup is planning to extend the project to other very high performance hardware (32-bit RISC and DSP) under the banner of the The Dutch Forth Workshop.

The tForth system

The Transputer hardware architecture

The INMOS transputer was first introduced in 1985 [1,2]. The first processor in the series was the T414, a 32-bit 15 MIPS processor (at 30 MHz) with the ability to perform multitasking in hardware. New members have been added to the family regularly, but we'll limit ourselves to the T800, a T414 with an ANSI-IEEE 754-1985 standard 64-bit floating-point coprocessor integrated onto the chip (2.25 MFLOPs at 30 MHz), 4 Kbyte onboard RAM and 4 high-speed communication links. The chip has a smart onboard memory interface that minimizes external circuitry but still gives 40 Mbytes/s throughput with 100 ns DRAM.

The best-known feature of a transputer is probably its link interface. The links allow networking of transputers by direct point-to-point connections without any external logic. The four T800 links support bit rates of 5, 10 and 20 Mbit/s, bidirectionally, and with full automatic DMA control. For the programmer there is no handshaking or protocol to take care of. Sending messages over a link can be done with a single instruction (plus a few to set up buffer pointers and counts).

Peeking inside the chip reveals that the transputer is a stack machine. It has two three-level stacks for integer and floating-point expression evaluation. Apart from these stacks, the transputer has a workspace pointer. This is a register that points to the base of a workspace where local variables are kept. When calling a subroutine, room is allocated at the bottom (lowest address) of the workspace and the integer stack plus the old workspace pointer are copied there.

Transputer multitasking is implemented with on-chip timer and microcoded scheduling hardware. The basic item to schedule is called a process. A process needs a (potentially very small) workspace to save its state and queueing information, and for its local variables.

At any time, a process is active: being executed or on a list waiting to be executed, or it is inactive: ready to input, ready to output, or waiting until a specified time. Inactive processes do not consume any time. Input and output is not necessarily done on a hardware link, there is something called a memory channel that behaves in exactly the same way as the hardware and is used for on-chip interprocess communication.

Active processes are held in two linked lists of process workspaces, one of high priority processes and one of low priority processes. A high priority process runs until completion, or until it becomes inactive (although one may assume that to be a programming error!). A low priority process is only permitted to run for two timeslices (about 2 ms), after that it is forcibly descheduled at the next available descheduling point.

Mapping tForth on the hardware; tForth internals

We have decided to use subroutine threading with in-line code expansion for all those Forth primitives that run significantly slower if they have to execute an additional call/return pair (like 1+ or DUP) [11,12]. Because we also use the transputer on-chip stacks to hold the top three items of the corresponding Forth stacks (data and floating-point), a simple peephole optimizer is able to eliminate the need to push or pop the software stacks if any two primitives are compiled directly after each other (without an intervening high-level call) [13]. Combined with the fact that a transputer call instruction using a 32-bit offset is eight (8) bytes long, we find that macro expansion not only gives an enormous speedup but also leads to somewhat smaller code size.

Actually, our present peephole optimizer is not simple anymore. The first approach used a state-machine that combined the current stack depth with the stack requirements of the macro to be compiled, resulting in an optimized sequence of external d(f)push, d(f)pop instructions or equivalent internal stack shuffling. We found however that almost every primitive benefited from some kind of special optimization that was very hard to generalize (e.g addition is commutative, subtraction is not, ROT is faster using a local etc.). We ended up putting a full-scale stack optimizer in every primitive. The speedup is about 50% over the state machine approach. However, the transputer is extremely hostile to the stack concept (a dpush or dpop costs about 8 bytes!) and we suspect that for other CPU's these extreme measures will not be necessary. For iForth we do not keep the top of the stack in registers and (therefore) perform much less optimization, nevertheless iForth (on a 33 MHz '386) is about 3 times faster than tForth (on a 25 MHz T800).

A subroutine-threaded Forth without macro expansion is really not much faster than for instance a direct threaded system. But if words like DUP + 2+ are implemented as macro's, what do you do when the user accesses them in interpret mode: 13 DUP . .? And what happens when a macro is ticked: ' DUP routine ! routine @ EXECUTE? Although the first problem could be tackled by compiling input into a buffer and execute it there (watch out for nested interpreters!), the second problem cannot. Thus, we have decided to implement a shadow definition for every macro. The shadow is directly executable and can be ticked and thus indirectly compiled, although that will result in suboptimal code. Macro's are identified by a special bit in their header, which is used by ' and FIND to pick the right definition, depending on STATE. Words like COMPILE [COMPILE] POSTPONE and COMPILE, need to be really smart when a user specifies a macro word as their argument. The easy way out is to always compile the slow non-macro definition, but tForth does extra processing, again relying on STATE, to optimize these cases.

Considering the fact that a transputer context switch is in effect a change of the workspace, we decided to put all of the Forth administration in the workspace. For tForth this means all of its five stacks (data, return, system, floating-point and locals) plus the user table. The stack pointers are implemented as user variables and are thus also located in the workspace. This means that a switch between Forth processes has no additional overhead as it involves a simple swap of workspace pointers, which is accomplished by the scheduling hardware automatically.

A transputer call instruction changes the workspace pointer. When all Forth stacks and the user area are in the workspace, it is clearly unacceptable (and unnecessary) that the workspace moves around when Forth words call each other. We solved this by starting every single piece of code with an instruction that ``undoes'' the workspace adjustment caused by a call. This creates a small problem for the occasional Forth word that in a subroutine-threaded model effectively ``jumps'' into a piece of code (for instance EXECUTE). All affected words must simulate a workspace adjustment before they jump.

When discussing transputer hardware scheduling we mentioned that a low priority process is only permitted to run for two timeslices (about 2 ms) before it is forcibly descheduled at the next available descheduling point. This descheduling point is an almost Forth-like concept resembling PAUSE. A process can prevent descheduling if it does not use two special instructions, lend (loop end) and j (an unconditional branch). Of course it must also avoid becoming inactive. We kept this optionally non-cooperating feature in tForth by not using the special loop instruction (it is incompatible with +LOOP anyway) and by clearly documenting which tForth control flow words (AGAIN REPEAT, but not ELSE) use the special unconditional branch instruction. Also, we provide alternative words that perform the same action but cannot be descheduled. This gives the programmer full dynamic control over the priorities of Forth processes, using PAUSE-like words of his own design.

Hardware requirements

To be able to use tForth, the hardware must provide for a T4xx or T8xx processor with at least 256 Kbyte of RAM. Keyboard, screen and file I/O is handled by a PC-compatible machine running MS-DOS, MS-Windows, or some flavor of UNIX. This machine needs a small plug-in board with a standard INMOS C012 link adapter chip. Alternatively a UNIX workstation can be used, provided it is able to handle a C012 chip. The interface problem is trivial if the transputer itself is seated on a PC plug-in board, as these cards almost always provide a C012 compatible hardware interface [4].

The consequences of tracking ANS Forth

tForth conforms as closely to the future ANS Forth as it is possible to guarantee. All of the words in the ANS document are available [6,7,8].

A transputer Forth conforming to the Forth 83 standard is a bad idea, as that explicitly requires the use of a 16-bit model [9]. The T4 and T8-type transputers can not access this data type directly; they only handle bytes and words efficiently (it is no problem for the 16-bit T2). The ANS Forth does not specify a virtual machine, and only requires that the size of a cell is an integral multiple of the size of a character, which should have a minimum of 8 bits (data-stack elements, return-stack elements, addresses and single-precision numbers are all one cell wide). A natural choice for the transputer is thus a byte-sized character and a 32-bit cell.

Of course, there is no such thing as a free lunch. The extra flexibility and speed is paid for with alignment problems. These have now become the responsibility of the programmer (if he cares about transportability, that is). To a transputer a cell is not equivalent to four bytes in a row, as a cell address should be a multiple of four. ANS Forth specifies the words CELL+ CELLS CHAR+ CHARS ALIGN and ALIGNED for use with , C, ALLOT etcetera. Furthermore CREATE must guarantee that aligned addresses are passed into DOES> code. Stack comments specifying addresses now mention if these addresses should be aligned.

To a user of tForth alignment problems are hardly visible. The only problem we frequently encountered is when using CREATE to build records of mixed datatypes and forgetting to use strategically placed ALIGN words (This happened often when transporting quick hacks from iForth to tForth. iForth uses 32-bit cells too, but the hardware is able to fetch a word from an unaligned address so a programming error goes unnoticed). Therefore special attention must be paid to the mixed use of C, and , and to the compilation of strings. Floating-point numbers must also be aligned (single and double-precision IEEE numbers plus the processor-specific internal format must be supported).

The problems become really interesting when the hardware does not support bytes and thus CHAR+ is not equivalent to 1+ anymore. We expect to find a lot of invalid assumptions in our metacompiler code once we get to build a TMS320C30 target (This DSP chip treats everything as a 32-bit item and so CHAR+ and CELL+ must be equivalent).

LOOPing

ANS Forth still specifies the +LOOP algorithm introduced with Forth 83. This means the loop index must be incremented or decremented across the boundary between the loop limit minus one and the loop limit in order for looping to stop. This specification allows a loop index to address all of memory (64 Kbyte for a 16-bit Forth) but it rules out generating efficient machine code for a transputer. Essentially the Forth-83 LOOP wants to treat the loop index as an unsigned number. For RISC-like reasons, the transputer treats addresses as signed numbers. This saves a flag register that is difficult to handle on a stack computer and presents extra context to save when a task switch occurs. The transputer can test for overflow after an addition with the testerr instruction. The problem with (+)LOOP deliberately generating an overflow to signal loop end, is that operating system software may consider this to be a serious system error and is able to halt the complete transputer if it occurs. All it takes is to set the transputer's halt-on-error flag. tForth assumes this flag is never set, and our bootloader explicitly resets this flag at startup.

division

We like the way ANS Forth ``solves'' the question of floored versus symmetric division. The implementor is free to choose the method used in the kernel, as long as the documents clearly specify which method is used. Standard ANS Forth programs must include code to generate the desired form of division and may not rely on the implementation-defined result. tForth uses symmetric division because the hardware directly supports it. Besides, we have not seen much code, apart from contrived examples, that really needs the floored variant. ANS Forth requires that SM/REM and FM/MOD are available in CORE , so a simple redefinition at the top of a source file is all it takes to switch between the two variants:

 : /    S" FM/MOD NIP "  EVALUATE ; IMMEDIATE
 : MOD  S" FM/MOD DROP " EVALUATE ; IMMEDIATE

The code shown will not obstruct the workings of the optimizer.

Restricted access to code and headerfields

The ANS documents describe the Dictionary as

an extensible structure that contains definitions and associated data space. The form of this structure is not specified by the standard, but it may be described as consisting of three logical parts: ordered word list(s) that may be searched for word names, a code space where the actions of the definitions are stored, and a data space. Of these, only data space may be directly addressed by a standard program.

By this definition, words like PFA, NFA, CFA etcetera are no longer required, and an implementation can use any algorithm it sees fit to organize the word headers and the information stored therein. Furthermore, it means special threading techniques and in-line code generation are now legal. For a user, it means something like : FOO [ ' BAR , ] ; is no longer considered portable code (if it ever was). Likewise, 3 CONSTANT FUBAR 5 FUBAR 2+ ! must be frowned upon. Regrettably, forbidding access to word headers and vocabulary internals makes illegal some elegant OOP techniques [10].

tForth uses the relaxed rules of ANS Forth to implement a subroutine-threaded Forth where some primitives are expanded in-line and optimized at macro edges for optimal stack access. The headers are separated from the code (this makes implementing locals much easier) and use a binned hashing technique with 256 threads for every vocabulary (or wordlist). The header table contains extra information to aid the optimizer (more flags) and contains an additional pointer to optional forget code that is executed when FORGET walks the complete set of wordlists and removes user definitions one by one.

 Header: flags(word) forget(ptr) hash(ptr) code(ptr) link(ptr) name(token)

Flags holds the bits that signal IMMEDIATE, COMPILE-ONLY SMUDGED MACRO, (not in ANS Forth) and PRIVATE.
The hash field is used to look up a word in a given wordlist.
The code field holds a pointer to the executable code.

The link links all words in a wordlist. This is essential for a safe and bullet-proof FORGET. For instance, it allows the following:

 VOCABULARY one   ALSO one DEFINITIONS  : hello  ." hello!"  ; 1000 ALLOT
 PREVIOUS
 VOCABULARY two   ALSO two DEFINITIONS  : hello2 ." hello2!" ; 1000 ALLOT
 PREVIOUS  ALSO one DEFINITIONS  : hello3  ." hello3!"  ;

 FORGET hello  ( or even FORGET one . Try it on your system... )

Name is a standard Forth counted string.

POSTPONE COMPILE [COMPILE] COMPILE,

We already saw that access to code and headerfields has become restricted. This affects COMPILE and [COMPILE]. On systems that generate machine code, but do not have tForth's luxurious shadow words, things like COMPILE DUP do not work, or have to be written [COMPILE] DUP. POSTPONE has been invented to circumvent the IMMEDIATE problem, although there are still some pathological cases where it cannot do what you want. The famous example is COMPILE [COMPILE] IF.

COMPILE, provides a portable alternative for the common coding practice: [ ' DUP ] LITERAL ,. It should now be written ['] DUP COMPILE, but again, it will only work if the programmer knows the ticked word is not IMMEDIATE.

ENVIRONMENT?

This CORE word enables one to check for strings in the present environment. It takes a string argument from the stack and returns false or [value] true. We use this word a lot in our Forths, but almost always to look for strings we defined ourselves, never for the ones ANS Forth thinks are needed. We use it to check for server version number, floating-point size (we have 3 now, maybe 4 later), and if we are running tForth or iForth. Our systems implement it by using a special hidden wordlist, but ANS Forth allows this to be done by other means (presumably checking a disk-based database).

memory allocation

The tForth parallel extensions need dynamic memory allocation to implement the PAR structure. Luckily ANS Forth provides it in a usable form (direct addressing) although a handle-based scheme like that proposed in [10] is more like the state of the art, and could have been implemented painlessly on 8-bit and 16-bit systems.

Floating-point

Again, the presence of the Floating-point wordset in ANS Forth is exactly what we needed for both iForth and tForth. Both Forths use the hardware coprocessor and the hardware floating-point stacks. The standard sorely lacks flexible tools for floating-point number formatting, like <# does for integers (There are words to print in common formats, and lately REPRESENT was added). Also, there is no way to use floating-point locals, which is a serious omission as there are at least 3 different floating-point formats.

Multitasking and BLOCKs.

Multi-tasking is not part of ANS Forth. Nevertheless, in the Informative Annex generic multitasking is described using the PAUSE concept (implicit in a set of I/O words). Fortunately, we can easily implement a PAUSE-like concept on a micro-scale in tForth (that is within user-defined words), however once in the system an I/O word can be descheduled because its time slice runs out, not only because it has finished accessing the disk block or buffer. What we do is acquire a special semaphore in the block buffer words. This semaphore is released in every I/O word documented in the ANS Document (they give a list). When a process has the semaphore, it can freely access the buffer. This solution does not extend easily to more than one block buffer, so we had rather seen ANS Forth words like LOCK and UNLOCK .

Sequential files do not have this problem because most OS's give out unique fid's when a file is multiple accessed (it therefore is environmentally dependent what happens).

Parallel Forth

Transputer links and channels

The parallel processes enabled by the transputer hardware communicate by the use of channels [2,3,5]. The communication regulates data transfer as well as decides which process is to run. A communication proceeds along a channel that connects two processes, the output and the input process. The output process stops until it is able to hand its data to the input process. The input process stops until the process on the other side of the channel is ready to deliver its data. In principle, there is no limit to the number of channels connected to a process, but there is always just one input and one output process that relates to a channel.

A channel in above description can be a hardware bi-directional link, of which there are a limited number (between 2 and 4), or a memory channel, of which there can be an almost unlimited number (they only need one word of memory). Of course, processes running on different transputers can only communicate using hardware links. Therefore the ideal, writing code to run on n processes without needing knowledge of the physical location of these n processes, is unattainable in practice. The new generation of transputers features an unlimited number of virtual channels, that is, the link messages are multiplexed and routed automatically [14]. This eliminates the differences between a memory channel and a hardware link, and thus the difference between local and remote processes.

The CHANNEL and LINK definitions

A memory channel is created by the defining word CHANNEL, followed by a name. The channel created is initialized automatically, i.e. it is ready to be used. It is the user's responsibility to handle it properly, this means observing the rule of attaching one input and output process through this channel.

The channels LINK0 LINK1 LINK2 LINK3 and EVENT are pre-defined. The first four refer to the hardware channels that connect the transputers to each other, opposed to the memory channels defined using CHANNEL. These hardware channels behave identically to those defined using CHANNEL down to the machine code level, as far as a single input or output action is concerned. The EVENT channel is special in that it is read only. Nothing but external circuitry is able to restart the process thus descheduled. The data read is garbage and should be thrown away. Note that, although a link is bi-directional and thus is equivalent to two separate memory channels, it is possible to refer to both with the same name. tForth's high-level channel i/o words sort out the hardware details for themselves. As an example, the following syntax is correct:

 LINK1 CHANNEL-C@  3 +  LINK1 CHANNEL-C!

This reads a byte of link1, adds three to it and sends it back over the same link.

Channel Input and Output operations

A channel can be seen as a transputer address representing an hardware link or a memory channel.

The word CHANNEL-@ fetches a word from a channel, CHANNEL-! stores a word to a channel. They are similar to @ !. As already used in bove, CHANNEL-C@ CHANNEL-C! fetch and store bytes, similar to C@ C!. Finally we have CHANNEL-SEND CHANNEL-RECEIVE. Both accept a string address plus count and a channel. CHANNEL-SEND sends count bytes located at string address over the channel. CHANNEL-RECEIVE receives count bytes from the channel and stores them at string address. Note that the number of bytes have to agree. Normally they will be communicated by a separate CHANNEL-@ CHANNEL-! exchange.

In order to make an hardware link equivalent to a memory channel, it is sometimes useful to hide the fact that the name of a link corresponds to two memory addresses, one for input, the other for output. Note that use of the CHANNEL-xx words already guarantees this automatically, but in some situations it is better to have explicit conversion words. >INPUT-CHANNEL and >OUTPUT-CHANNEL convert a channel addres to the address needed for the transputer hardware in, and out, instructions.

Language constructs using channels

The SELECT statement in tForth allows a process to make a choice over its future behavior dependent on the readiness of other concurrent processes to communicate with it over a channel. Several input sources are possible for this construct. We will concentrate here on those components directly using channels. An informal description follows:

 SELECT
	[channel] GUARD [code] ENDGUARD
	...
	[channel] [boolean] ?GUARD [code] ENDGUARD
	...
	[end] [start] REPLICATE [channel] GUARD [code] ENDGUARD
	...
	[end] [start] REPLICATE [channel] [boolean] ?GUARD [code] ENDGUARD
	...
 ENDSELECT

Here ``[channel]'' is used to describe an arbitrary series of tForth words that should result in a channel address put on the data stack, for use by GUARD.

The word ``[code]'' describes any tForth code. This [code] is executed on the condition that something can be read from the channel GUARDed.

Again, the ``[boolean]'' can be generated by any sequence of tForth words. ?GUARD needs a boolean true and a ready channel to activate its code.

The REPLICATE statement should be seen as DO LOOP in disguise. Between REPLICATE and GUARD, the loop index I can be used to compute channel addresses for GUARD. If one of these computed channels becomes ready, the corresponding index is pushed on the data stack and the [code] after GUARDis executed.

Time

There are two clock registers, one for each priority level. The high priority clock increments every microsecond. The low priority clock increments every 64 microseconds.

Time is cyclic. Whenever a clock register reaches the maximum positive number fitting in a CELL , it ``increments'' to MININT. Time values must be considered unsigned values and manipulated with unsigned operators like U< and U>. As tForth pays no attention whatsoever to the error flag, it is perfectly okay to use + to add time values together.

A disadvantage of the internal clocks is that their intervals depend on the process priority. Therefore tForth provides the words >MS and >TICKS that work independent of this priority. The words accept or output data in millisecond format, like the ANSI standard word MS. With >MS a count in ``timer ticks'' is translated to real elapsed time in milliseconds. With >TICKS, a real time in milliseconds is translated to timer ticks. Both words determine the priority of their callers to decide on the conversion factor.

Reading the time

The present value of the relevant timer register can be read with ?MS. The time since switch-on is returned in milliseconds (tForth does not attempt to provide a real time-of-day clock. On a 32-bits transputer the low priority timer register will cycle every 76 hours, but on a 16-bit transputer this will only be 4.2 seconds).

Waiting

It is possible to halt a process for a specified number of milliseconds. The word to use is MS . Maybe we should stress that this word only delays the process calling it, not the transputer itself; other processes and link i/o will keep on running. For a 32-bits transputer the limits for high and low-priority processes are 1.2 and 76 hours respectively.

Language constructs using TIME

The SELECT statement in tForth allows a process to make a choice over its future behavior dependent on signals from the system timer. We will concentrate here on those components directly using the timer. An informal description follows:

 SELECT
	[timeout] TIMEOUT [code] ENDGUARD
	...
	[timeout] [boolean] ?TIMEOUT [code] ENDGUARD
	...
	[end] [start] REPLICATE [timeout] TIMEOUT [code] ENDGUARD
	...
	[end] [start] REPLICATE [timeout] [boolean] ?TIMEOUT [code] ENDGUARD
	...
 ENDSELECT

Here ``[timeout]'' is used to describe an arbitrary series of tForth words that should result in an unsigned number being put on the data stack, for use by TIMEOUT or ?TIMEOUT. This number describes a time interval in timer ticks.

The word ``[code]'' describes any tForth code. This [code] is executed on the condition that the specified time interval has elapsed.

Again, the ``[boolean]'' can be generated by any sequence of tForth words. ?TIMEOUT needs a boolean true and a counted down time interval to activate its code.

The REPLICATE statement was described with GUARD, above.

An example of the above follows

\ Needed now tForth REQUESTS for keys to be send.
: KEY-REQUEST 	{{ ITERM! 0 BOOTLINK @ CHANNEL-C! }} ;

: X6
	CR ." Type any key to continue (you have a few seconds...)"
	KEY-REQUEST
	SELECT
		BOOTLINK @
		GUARD
			BOOTLINK @ CHANNEL-C@
			CR ." You typed '" EMIT ." '"
		ENDGUARD

		100000
		TIMEOUT CR ." Time out, press a key "
		        BOOTLINK @ CHANNEL-C@ DROP
		ENDGUARD
	ENDSELECT ;

An example of REPLICATE

: X5-CALCULATE-TIMEOUT 7 = IF 10000 ELSE 100000 THEN ;

: X5
	SELECT
		10 0 REPLICATE
		I X5-CALCULATE-TIMEOUT TIMEOUT
			7 <> ABORT" Failure in X5"
		ENDGUARD
	ENDSELECT ;

Concurrent Language Features

tForth supports true and apparent concurrency. In true concurrent systems each parallel activity is executed by a different physical processor. If parallel execution is simulated by a single-processor system it is said to use apparent concurrency. In tForth the distinction between both forms of concurrency is explicitly programmed by the user. Both forms have advantages. The same language constructs can be used for both. However, as we have already seen with the transputer link hardware and channel concept, important differences exist that reflect the physical implementation. Here we will discuss the apparent concurrency features only, as the true concurrent features of tForth are planned for version 2.0 and cannot be discussed yet. This does not mean true concurrent programming is impossible, it only states that powerful high-level constructs like automatic placement of processes on a transputer network are not yet available. However, the basic tools to help the user build these utilities are already present in tForth version 1.0.

To simulate parallel tasks on a single-processor system, executing the instructions of one task is interleaved with running all other tasks. The transputer hardware handles this automatically. This ``slicing'' is completely transparent to the user: there are no special requirements for the use of data structures or program control statements; a task or process may be structured just like any other definition. However, special syntax exists to create, start and stop processes.

The transputer supports running a task at two levels of priority [3]. The hardware maintains four queues of active processes, two for high, and two for low priority processes. A process can be in one of four possible states: executing; waiting to execute, which implies that it is in one of the two active process queues; waiting for a timer event, which implies it is in one of the two timer queues, or waiting for a communication event, in which case it is in no queue. A high priority process will execute without interruption until it terminates, or waits for a timer or communication event to take place. In this case, if there are any further high priority processes waiting to proceed, then the process at the head of the high priority active queue will be scheduled. If there are no high priority processes waiting to execute, then the next waiting low priority process will be scheduled. Low priority processes may be pre-empted at any time by a high-priority process that becomes ready to run. Low priority processes are time-sliced; if a low priority process executes a j, or lend, instruction, and has been executing for more than its time-slice period, it is descheduled and placed at the back of the low priority active queue, with the process at the head of the queue commencing execution.

We already mentioned that tForth uses a combination of time-slicing and a method involving the word PAUSE. A process can immediately suspend its execution by calling PAUSE.

Asynchronous processes

A colon definition is a routine that will execute in the sequence of other colon definitions called by the main program. A task or process is a routine that will execute in parallel with the calling program, immediately after it is started. It has been a design decision of tForth not to have special compiler syntax for routines that are to be run as a task. The difference with normal colon definitions rests solely in how they are started, and what happens after they run to completion. Typing the name of a colon definition runs its code as a single task, just like it is for a normal Forth system. Actually, this means the colon definition runs instead of the default interpreter loop. If it finishes, the interpreter automatically restarts. The same colon definition can be run concurrent with all other active tasks if special syntax is used to submit it. Carefully note that if it runs to completion (by executing the final ;), there is no interpreter to regain control. This means a task must explicitly execute STOP or stopp, when it finishes. Having the word STOP is advantageous, because it compeletely saves the state of the task in its private workspace. Using RERUN, these tasks can be restarted at the exact point they left of.

The necessary steps to be able to run a task concurrently are (1) create a workspace and (2) pass the wanted priority plus the address of the newly allocated workspace plus the machine address of the code to be run to the transputer hardware. As already mentioned, the code must contain an infinite loop or execute the STOP statement when it finishes. Because the machine executable address of a colon definition is not equivalent to its Forth execution token, the word RUN is available as a conversion operator.

We'll describe one of the two kinds of asynchronous concurrent processes supported by tForth. Methods are discussed to create, destroy, run and stop processes. Some simple examples will be given.

Process priorities

In ``tForth and Concurrency'' we mentioned that the transputer supports two task priority levels; high priority and low priority. tForth allows Forth tasks to run at both these machine priority levels. Consequently, the words LO-PRIO and HI-PRIO are available to be used together with the RUN command (to be explained below).

The word GETPRIORITY assesses the priority level of the calling task. Its dual, SETPRIORITY , changes the priority level of the caller, at the same time leaving the current priority level on the data stack. A guaranteed feature of this word is that it will not cause descheduling.

We needed SETPRIORITY for the tForth SEMAPHORE concept. It was very difficult to find a way to do priority switching. The transputer instruction set does not support it directly, probably because there is no use for it in an occam context.

Concurrent Forth processes

The workspace of a Forth process is created using the defining word FORTH-PROCESS and contains all five Forth stacks and a partially initialized user area (that is: copied from the starting process). Code to be run in this workspace can be programmed using all features of the regular tForth compiler. There is one exception: compilation by sub-processes is not supported, because these processes may not modify the shared, but unprotected, header space. However, it is ok to compile executable code fit for other processes to use. (Note the subtle difference). In general, you should not try to abuse tForth as a multi-user system, it is not designed for this.

There is no compile-time parameter to FORTH-PROCESS as the size of stacks and user area is fixed. A Forth task needs about 8 KByte of memory to run.

A feature of the word created with FORTH-PROCESS is that it not merely returns the address of the allocated workspace, but that it will actually start a concurrent Forth process when passed a priority and a machine executable address. When the process executes STOP the transputer hardware automatically writes data at negative offsets of the workspace pointer in order to allow a restart of the process. This information is used by RERUN to let the process continue just after the point where it left off.

An example:

 FORTH-PROCESS process
 VARIABLE count

 : counter
	BEGIN		      \ begin ... again allows RERUN
	    1000 0 DO
		       I count +!
	         LOOP
	     STOP		\ do one iteration only
	AGAIN ;

	LO-PRIO  ' counter RUN process

	count ?		\ result of the first iteration

	' process RERUN

	count ?		\ result of the second iteration

The process identifier

We now have seen that each high-level process in tForth is created by the defining word FORTH-PROCESS (except for the interpreter process). One of the important parameters that tForth keeps in the parameter field of a process is the address of the workspace that is allocated for it. As this is an unique address, it is ideally suited as a Process IDentifier or PID. Various important parameters are stored at negative offsets from the PID. You can use 'PID to find the Process IDentifier of a named process. Example:

 FORTH-PROCESS process
 ' process  ( cfa) 'PID ( PID) H.  ( might print $80031B00)

Do not confuse 'PID and PID. The tForth word PID returns the process identifier of the interpreter or main process. The server will try to put the workspace of the main process in transputer on-chip RAM, so PID will in most cases be very nearly equal to MININT.

Running, stopping and killing processes

We now know how to run one of the two kinds of processes tForth supports. It comes down to executing the word defined by FORTH-PROCESS after having put the priority and the address of the executable code on the data stack. Remember that the word RUN is only a conversion operator that serves to convert execution tokens to machine addresses suitable for these defined words.

Stopping a Forth process is accomplished by having it execute STOP or stopp,. A machine process can only stop by executing stopp,.

Killing a process is the operation of forcing a process to execute STOP or stopp,. It probably should be avoided, but it has its uses, especially when FORGETting code containing process definitions (tForth already uses KILL for this purpose internally).

KILL needs an execution token on the stack. It will try to stop the process that is identified by this token, removing the process from all the process queues. It works by redirecting the stored instruction pointer of the process to code that executes stopp,, and making sure the process runs at least one more time to execute that instruction. There is a check if the word corresponding to the execution token is created by a FORTH-PROCESS, but it can not be 100% fool-proof. If it fails, a random word is stored at a random memory address.

There are difficulties with killing processes:

When a process is in a timer queue, you cannot KILL it.
If a process controls one or more semaphores, you'd better not KILL it, as that will lock the resources connected to the semaphores.
If a process is doing channel i/o, it is not in any queue and thus the workspace can be removed without doing anything special. However, i/o implies two processes are involved. If one of these processes is performing the KILL , there is still no problem. If an outsider process (most likely the root process) does the KILL , it is possible that the still living process tries to communicate with the process now KILLed, with potentially disastrous results.

These three conditions cannot be detected by tForth itself. The programmer is expected to override the FORGET fields of processes with such behavior. A crude solution would be to let the process poll a flag that is set by the forget part. The forget code PAUSE 's until the process has read the flag. A transputer-like solution for this is to use a CHANNEL.

Example usage: ' name KILL

Synchronized concurrent processing

We will now describe one of the two tForth constructs that allow synchronized concurrent processing. Occam would call this a ``PAR construct''. The syntax is discussed and methods are shown to pass parameters to the separate processes. Simple examples are given throughout.

In tForth's PAR construct all started concurrent processes are forced to wait until each and every process has finished and the main line of sequential processing can continue. The ``waiting to be finished'' bit, the synchronization, is the only feature discerning them from a set of asynchronous concurrent processes as described in the previous paragraphs. Of course there are superficial differences too, like that it is not necessary for the user to allocate named workspaces for each and every process needed in a PAR. Allocation and deallocation code is generated by the compiler automatically.

As a quick preview, inside a colon definition a standard PAR will look as follows:

 VARIABLE gorilla	  1 gorilla !
 VARIABLE bananas	111 bananas !

  : ZOO
  	PAR
  	   STARTP  -24 bananas +!  ENDP
	   STARTP    1 gorilla +!  ENDP
	ENDPAR
	CR bananas ?  gorilla ? ;

Between the PAR and ENDPAR two processes are started in parallel. Their routines are specified between STARTP and ENDP . This example carefully avoids the problems that arise when in the course of their execution the processes need to modify the same variable.

Preserving priority

It is not necessary for parallel processes to have the same priority. Thus one might suspect that the priority of the main sequential process depends on the priority of the parallel process finishing last. We can document that the priority of the main process is explicitly restored by all of tForth's PAR constructs.

How to define a high-level PAR

tForth's high-level PAR construct is designed to be used with dynamically allocated workspaces and routines written in high-level tForth. The workspace of a sub-process contains all 5 tForth stacks and a copy of the user area of the main sequential process.

Occam splits of n-1 tasks if asked to run n processes. The root process takes over one of the remaining parallel tasks and runs it in its own workspace. This would be wrong if one of the parallel tasks wants to start its own set of parallel jobs: then two or more processes (at different ``task depths'') will end up with the same workspace and overwrite each other's data!

tForth therefore starts up exactly n NEW processes for each n PAR . The root process executes endp, after it has allocated and started the n tasks.

To define a high-level parallel construct four words are available: PAR, STARTP, ENDP and ENDPAR.

PAR marks the start of a list of sub-processes. The routine definition for each sub-process is enclosed by STARTP and ENDP. These two words generate code that takes care of workspace allocation and deallocation. ENDPAR marks the end of a list of processes. It compiles code that waits until all of the processes that are in the list of processes have terminated and makes sure sequential processing proceeds with the same priority level in effect as before the PAR was started.

Example usage, a parallel multiplier:

\ First create a parameter area. Each parallel process can access this area.
\              item#: 0   1    2    3     4      5
	CREATE data   0 , 11 , 12 , 123 , 3311 , 0 ,

: TH	data []CELL ;	( index -- address )

: ZOTZ
	PAR
	   STARTP	 	\ multiply items 1 and 2 and store at 0
		1 TH @	\ get item 1
		2 TH @	\ get item 2
		  *		\ multiply
		0 TH !	\ store at item 0
	   ENDP
	   STARTP 		\ multiply items 3 and 4 and store at 5
	 	3 TH @	\ get item 3
		4 TH @	\ get item 4
		  * 		\ multiply
		5 TH !	\ store at item 5
	   ENDP
	ENDPAR ;

	PREVIOUS

	ZOTZ  CR 0 TH ?  5 TH ?

Parameter passing

Two words are available to pass parameters to the high-level processes inside a PAR ENDPAR construct. The word :I signals the following STARTP that it should generate code to transfer n words from the data stack of the main sequential process onto the data stack of this particular process. The word :F does the same thing, but now for the floating-point items. After the execution of the code generated by STARTP , the relevant stack of the root process has dropped the specified number of items.

Example:

 : WOOF ( -- )
 	PAR
  	   12 21  2 :I STARTP  + {{ . SPACE }} ENDP
	   3e PI  2 :F STARTP  F+ E.           ENDP
	ENDPAR
	.S ;

\ After WOOF , tForth prints:
\ FORTH> WOOF 33  6.141593E0
\   Data: ---
\ System: ---
\  Float: --- ok
\ FORTH>

Process replication

It is possible to start concurrent processes using a DO LOOP related construction. The word to be used is TIMES .

[limit] [start] TIME makes sure the next code fragment that is inside a PAR ... ENDPAR construct is executed [limit] - [start] +1 times. TIME has the same functionality as DO. The next STARTP command will act as the corresponding LOOP statement. The loop index is available using I as with any other DO ... LOOP command. You can pass items to the I-th sub-process using :I.

Example:

: FOO
	PAR  3 0 TIMES
		   I 2*  I 1+  I DUP *   3 :I
		STARTP
		    + +  .
		ENDP
	ENDPAR ;

\ Execution of FOO generates the following output:
\ FORTH> FOO 1  5  11   ok

The tForth server/client scheme

The key to tForth being platform independent lies in its server/client concept. All I/O and memory management is routed through stub words that connect, via one hardware link, to a server on the host computer. This server is written in C, using the host-supplied compiler and I/O libraries. If a user has access to the source code, he can link in any library available using the hooks in our protocol. In that way we have added graphics facilities to tForth. The server can be made to work in parallel with tForth, which makes a lot of a difference when there are only one or two transputers available.

Porting tForth

As tForth left the bootstrap stage, it became able to metacompile itself. The resulting source-code cleanup revealed that it is very easy to generate Forths for other CPU's, using the same base set of source files plus a small hardware dependent kernel. Right now, we have build a 32-bit protected mode Forth for the Intel 386+387 and 486 chips in this way: iForth. Both tForth and iForth can run the same metacompiler source code, enabling us to generate a new tForth with tForth, a tForth with iForth, an iForth with tForth and an iForth using iForth. Helped by the ANS Forth word ENVIRONMENT?, both Forths run our standard set of examples and benchmarks -- currently a collection of over two megabytes of text files. iForth is not a parallel Forth, so obviously some examples must be excluded.

Where do we go from here?

Version 1.0 of tForth does not run on a network of transputers, but the necessary hooks are there. At this moment we are experimenting with tForth on a 64-node transputer system, which we hope will result in an Helios-compliant tForth version 2.00. This version will provide the high-level tools to load and distribute tForth code over large networks. Any transputer on the net can be selected to have full and exclusive access to keyboard, screen and other host facilities.

In a parallel effort, we are working on kernel ports to the Intel '386+387 and '486 chips (Ed: now finished), the Motorola 68000 series and the Texas Instruments TMS320C30 (Ed: finished) and '40 chips. Porting problems with byte-order and character size will result in incremental touch-ups of our present metacompiler, trying to make it truly universal.

literature

 [1] IMS T800 Architecture; Technical Note 6. INMOS Ltd, Bristol, 1987.
 [2] Transputer Instruction Set: a compiler writer's guide.
     INMOS Limited, Prentice Hall International (UK) Ltd 1988.
 [3] Inside The Transputer.
     D.A.P. Mitchell, J.A. Thompson, G.A. Manson, G.R. Brooks,
     Blackwell Scientific Publications, 1990
 [4] User Guide Transputer Education Kit; Theory of Operation, Installation,
     Schematics. Computer System Architects.
 [5] A tutorial introduction to OCCAM programming. Dick Pountain, INMOS Ltd.
     1987
 [6] draft proposed American National Standard for Information Systems -
     Programming Languages - Forth  (X3J14 dpANS-2 - Aug 3, 1991)
     Secretariat Computer and Business Equipment Manufacturers Association,
     311 First Street, N.W., Suite 500, Washington, DC 20001-2178
 [7] Comments received during the first public review of BSR X3.215.199x,
     dpANS Forth. Secretariat CBEMA.
 [8] dpANS-3, Changed sections only, by John Rible. Secretariat CBEMA.
 [9] Forth-83 Standard. A publication of the Forth Standards Team,
     P.O. box 4545 Mountain View, CA 94040, USA, August 1983
[10] Object-Oriented Forth; Implementation of Data Structures.
     Dick Pountain, Academic Press.
[11] Faster Forth; Reducing overhead in threaded interpretive languages.
     Ronald L. Greene, BYTE, June 1984.
[12] A Fast Forth for the 68000, Lori Chavez. Dr. Dobb's Journal, October 1987.
[13] Implementing Forth on the 80386. John E. Lecky, JFAR Vol 5, No 1, 1987
[14] The T9000 Transputer; Product Overview, INMOS Ltd, Bristol, 1991

APPENDIX

List of tForth features

All of the transputer hardware is supported: hardware task-switching, hardware floating-point, the on-chip timer, the links. For the T4 a simple floating-point emulator is provided.
Conforming to ANS Forth. All ANS word sets and their extensions are supported: CORE BLOCK DOUBLE ERROR FACILITY FILE FLOAT LOCALS MEMORY TOOLKIT SEARCH and STRING.
Full 32-bit system, incrementally compiling to machine code. Speed of high-level words is about two times slower than hand-optimized assembly language. (The generated code beats most transputer ``C'' implementations with regards to speed and size for anything but trivial programs. It is faster than all of them by an order of magnitude with regards to compile and link speeds: ready to execute a 4700 line program consisting of 11 source files within 30 seconds, where the INMOS ICC will take 84 seconds to compile and link the Sieve of Eratosthanes, a single-file program of 39 lines). Highlevel words can be mixed with machine code primitives, in-line generation of machine code is possible.
tForth can be loaded at any available transputer address, but it will always claim the on-chip RAM for its workspace (this can be changed if unacceptable). tForth can use all the memory that is available (dynamically as well as statically), or it can be deliberately shrunk to a workable minimum of 192 KBytes (73 Kbytes free for user programs).
All OCCAM constructs like PAR, ALT and PRIPAR have tForth counter parts, but tForth also allows dynamic memory management and recursion, which is not available in OCCAM. The on-chip timer and the transputer links can be accessed with high-level Forth words. Semaphores and queues are implemented and are used in our I/O and memory management system to enable multi-process I/O.
tForth has an integrated optimizer and runs only about 2 times slower than hand-optimized assembly language. There is an ASSEMBLER vocabulary to get the extra speed if needed. The assembler supports all documented (and some undocumented) instructions for every transputer model known to us. It is possible to cross assemble for other models, the assembler emits warnings if the code is not executable on the host hardware.
Symbolic disassembler loads on demand.
Several flavors of debuggers and profilers.
String, Complex number, 2D-Matrix, Array and Graphics support (fully documented source code in library format).
Online help for every tForth kernel word.
A flexible command line editor is included. All of the user's own programming tools (for instance editor, grep, browser) can be seamlessly integrated into the tForth environment using the words SYSTEM and OS-IMPORT (not available under MS-Windows).
Function key support, command and file history.
Compile errors start a user-specified editor with the cursor pointing at the problem.
Two megabytes of examples are available for study or inclusion in your own programs. Examples encompass benchmarks, parallel demo's, graphics programming, big number arithmetic, text processing tools, neural networks, text compression and even a few games.
High level support for module management, localized definitions and use of forward references. Smart forgetting automatically handles process descheduling, memory deallocation and deinstallation of vectors.
All I/O is vectored. A line of code or a function key press direct all output to a disk file or a free transputer link.
Source code for the C server is available. User functions enable the calling of any C-library on the host, and thus the host platform may process in parallel with the transputer.

free counter