I have ported iForth from Linux to (a 32-bit console) Windows 95 application. Since Forth vendors have suggested that Windows 95 has a huge overhead, resulting in their Forths running much slower than usual, I thought I'd run some tests. I'm using the benchmark suite developed to find out i486 / Pentium differences.

The question: Is there a difference between

  1. iForth 1.06 on a Pentium 166 MHz under Windows 95 (native 32-bit console),
  2. iForth 1.06 on a Pentium 166 MHz under DOS 7.0 (using the GO32 DOS extender)
In both cases the same machine code is run. The Win95 version makes use of Borland's C-library. The following tests are a mix of integer and floating point code, I/O is eliminated as much as possible. The benchmark code is available from my homepage.

1. The Ertl/Maierhofer Benchmark Suite

For the code, see Anton Ertl's homepage, or here.

BENCHMARK			   Win95	DOS 7.0		rel.
--------------------------------------------------------------------
Fibonacci...9227465      	 2.040 sec	2.052 sec	0.99
Sieving...1899           	 2.690 sec	2.192 sec	1.23
Matrix multiply...		 2.530 sec	2.099 sec	1.21
bubbling...			 3.070 sec	2.778 sec	1.11
The average slowdown caused by Win95 is 13%.

2. Caches

Some people may remember my confusion over the fact that the timing of a push REG + pop REG instruction on my '486 didn't match the timings found in the Intel databook. This was thought to be caused by sub-optimal implementation of the caches. Here these figures are again, for an unrolled loop of 300,000 instructions (click here):

BENCHMARK			  Win95		DOS 7.0		rel.
--------------------------------------------------------------------
300,000-times push+pop  	 3.790 sec	3.646 sec	1.04
(push+pop expected		 3.600 sec	3.600 sec	1.00)
300,000-times nop+nop		 1.810 sec	1.837 sec	0.99
(nop+nop expected 		 3.600 sec	3.600 sec	1.00)
No dramatic slowdown is apparent.

3. Memory Moves

Another important performance criterium is the speed with which memory can be read and written. Here are the high-level iForth results, overlap and no-overlap indicate the relative positions of the copied memory regions. Due to unfortunate Forth legacy code ANS Forth forbids to use fast machine instructions to implement memory block moves and fills.

BENCHMARK	  Win95		DOS 7.0		rel.	1/rel.
--------------------------------------------------------------
no-overlap
CMOVE      : 	55.555 MB/s	48.859 MB/s	1.14	0.88
CMOVE>     : 	45.454 MB/s	42.253 MB/s	1.08	0.93
MOVE 1->2  : 	53.571 MB/s	50.335 MB/s	1.06	0.94
MOVE 2<-1  : 	45.454 MB/s	50.000 MB/s	0.91	1.10
overlap
CMOVE      : 	45.454 MB/s	44.776 MB/s	1.02	0.99
CMOVE>     : 	45.454 MB/s	38.363 MB/s	1.18	0.84
MOVE 1<-2  : 	68.181 MB/s	60.483 MB/s	1.13	0.89
MOVE 2->1  : 	68.181 MB/s	60.728 MB/s	1.13	0.89
Surprisingly, we find that Win95 is 7% faster dan DOS 7.0!

4. The FORTH Inc Benchmark Suite

Here is the test posted by Elizabeth Rather, as modified by Tom Zimmer.

Test			 	 Win95		DOS 7.0		rel.
--------------------------------------------------------------------
Testing DO LOOP 	=	0.0500 us	0.0300 us	1.67
Testing * 		= 	0.1100 us	0.1440 us	0.76
Testing / 		= 	0.3900 us	0.3160 us	1.23
Testing + 		= 	0.0550 us	0.0540 us	1.02
Testing M*	 	= 	0.1700 us	0.1730 us	0.98
Testing M/ 		= 	0.3300 us	0.3230 us	1.02
Testing M+ 		= 	0.1100 us	0.1200 us	0.92
Testing /MOD 		= 	0.3300 us	0.3830 us	0.86
Testing */ 		= 	0.4400 us	0.4300 us	1.02
Testing Eratosthenes 	= 	0.6105 us	0.2564 us	2.38
Testing Hoare's qsort	=      11.0000 us      11.1000 us	0.99
all tests 		=       2.1400 sec      2.1160 sec 	1.01
There are large differences between the two OS's, but I think they are caused by a lack of timer resolution under Win95. The "all tests" row is probably okay: no significant difference for this test.

5. Guy Kelly's Forth Dimensions Benchmark Suite

A little used test found in FD March/April 1992, by Guy Kelly. This code is very hard for modern 32-bit Forth implementations. Unfortunately I could not run it under Win95: it was too fast with respect to the timer resolution and reported zero elapsed time for almost all tests.

6. Mixed Bag

Follows a bunch of more or less well-known benchmarks. The 1MLOOP program is from Bernie Mentink.

The "unnest/nest pairs" benchmark tests how long it takes to unwind from 32 million nested colon definitions (no loops!).

The SAVAGE benchmark is well-known, Bill Savage wrote it.

FLOAT is a mix of fp operations.

FPMATH was translated by R. L. Smith from an article in Dr. Dobb's Journal, September 1988.

A long time ago I myself posted the Forth DHRYSTONE and WHETSTONE programs here.


BENCHMARK			  Win95	    DOS 7.0	rel.
--------------------------------------------------------------
1mloop				0.031 sec   0.035 sec	0.89
SuperSieve 1e9 -- 1.001e9	0.710 sec   0.693 sec	1.02
(50181 primes found)
32 million nest/unnest pairs 	2.960 sec   2.947 sec	1.00
Savage (floating point) 	0.018 sec   0.018 sec	1.00
Float (mix of operations)	1.210 sec   0.737 sec	1.64
Fpmath  			0.600 sec   0.579 sec	1.04
Forth Dhrystone 		101010 D/s 166389 D/s	0.61
Forth Whetstone 		53475 KW/s 57208 KW/s	0.93
There is definite slowdown of 40% by Win95 in two important benchmarks. I made sure these figures are correct by running the tests many times. The mean slowdown is 20%.

CONCLUSION

The strict average of all benchmark results says that the Win95 iForth is 9% slower than the DOS 7.0 iForth. When the memory copy test is removed, the result is even worse: 17% slower. Two important benchmarks (FLOAT and DHRYSTONE) are 40% below the DOS values. Quite a lot of cycles seem to be going down the GUI.

The iForth server is build with Borland's C++ 4.5. I do not know if the Borland libraries set up a "cycle stealing scheme" to make Windows programming easier. I certainly notice strange delays in the input and output routines.

Right now, iForth under Windows 95 is only a toy compared to the DOS and especially the Linux version. I intend to go straight to the Win95 DLL's with the next server version, to see if that improves matters. I have to do that anyway to add DDE and device-independent graphics support.

free counter Valid HTML 3.0