I tried to port iForth from Linux to Windows 95 but decided to go for Windows NT 4.0 instead when the high-level iForth tools and applications encountered a host of silly limitations. Since many people seem to think Windows has a huge overhead, I thought I'd run some tests. I'm using the benchmark suite developed to compare the iForths for Windows 95 and DOS. When you compare the results with the (preliminary) iForth for Windows 95 you will notice that the current iForth 1.07 is a lot faster than iForth 1.06.

The question: Is there a difference between

  1. iForth 1.07 on a Pentium 166 MHz under Windows NT (native 32-bit console),
  2. iForth 1.07 on a Pentium 166 MHz under Linux 2.0.0 (console).
In both cases the same machine code is run. The WinNT version makes use of Borland's C-library. The following tests are a mix of integer and floating point code, I/O is eliminated as much as possible. The benchmark code is available from my homepage.

1. The Ertl/Maierhofer Benchmark Suite

For the code, see Anton Ertl's homepage.

BENCHMARK			   WinNT	Linux   	rel.
--------------------------------------------------------------------
Fibonacci...9227465      	 2.086 sec	2.069 sec	1.01
Sieving...1899           	 1.936 sec	1.912 sec	1.01
Matrix multiply...		 1.498 sec	1.480 sec	1.01
bubbling...			 2.312 sec	2.256 sec	1.02
There is hardly a difference here.

2. Caches

Some people may remember my confusion over the fact that the timing of a push REG + pop REG instruction on my '486 didn't match the timings found in the Intel databook. This was thought to be caused by sub-optimal implementation of the caches. Here these figures are again, for an unrolled loop of 300,000 instructions (click here):

BENCHMARK			  WinNT		Linux  		rel.
--------------------------------------------------------------------
300,000-times push+pop  	 3.645 sec	3.640 sec	1.00
(push+pop expected		 3.600 sec	3.600 sec	1.00)
300,000-times nop+nop		 1.834 sec	1.829 sec	1.00
(nop+nop expected 		 3.600 sec	3.600 sec	1.00)
Almost too good to be true.

3. Memory Moves

Another important performance criterium is the speed with which memory can be read and written. Here are the high-level iForth results, overlap and no-overlap indicate the relative positions of the copied memory regions. Due to unfortunate Forth legacy code ANS Forth forbids to use fast machine instructions to implement memory block moves and fills.

BENCHMARK	  WinNT		Linux  		rel.	1/rel.
--------------------------------------------------------------
no-overlap
CMOVE      : 	49.019 MB/s	85.227 MB/s	0.58	1.74
CMOVE>     : 	53.003 MB/s	57.251 MB/s	0.93	1.08
MOVE 1->2  : 	51.194 MB/s	60.240 MB/s	0.85	1.18
MOVE 2<-1  : 	52.816 MB/s	58.139 MB/s	0.91	1.10
overlap
CMOVE      : 	45.454 MB/s	52.631 MB/s	0.86	1.16
CMOVE>     : 	45.454 MB/s	52.816 MB/s	0.86	1.16
MOVE 1<-2  : 	72.115 MB/s	90.361 MB/s	0.80	1.25
MOVE 2->1  : 	72.465 MB/s	89.820 MB/s	0.81	1.24
Amazingly, we see that WinNT is 17.5% slower than Linux 2.0.

4. The FORTH Inc Benchmark Suite

Here is the test posted by Elizabeth Rather, as modified by Tom Zimmer.

Test			 	 WinNT		Linux  		rel.
--------------------------------------------------------------------
Testing DO LOOP 	=	0.0430 us	0.0430 us	1.00
Testing * 		= 	0.1150 us	0.1150 us	1.00
Testing / 		= 	0.3530 us	0.3500 us	1.01
Testing + 		= 	0.0490 us	0.0480 us	1.02
Testing M*	 	= 	0.1220 us	0.1210 us	1.01
Testing M/ 		= 	0.3330 us	0.3320 us	1.00
Testing M+ 		= 	0.1150 us	0.1150 us	1.00
Testing /MOD 		= 	0.3570 us	0.3560 us	1.00
Testing */ 		= 	0.4110 us	0.4100 us	1.00
Testing Eratosthenes 	= 	0.4762 us	0.4518 us	1.05
Testing Hoare's qsort	=       7.9000 us       7.8000 us	1.01
all tests 		=       2.0300 sec      2.0100 sec 	1.01
There is no significant difference for this test.

5. Guy Kelly's Forth Dimensions Benchmark Suite

A little used test found in FD March/April 1992, by Guy Kelly. This code is very hard for modern 32-bit Forth implementations. Unfortunately I could not run it under Windows NT: it was too fast with respect to the timer resolution and reported zero elapsed time for almost all tests.

6. Mixed Bag

Follows a bunch of more or less well-known benchmarks. The 1MLOOP program is from Bernie Mentink.

The "unnest/nest pairs" benchmark tests how long it takes to unwind from 32 million nested colon definitions (no loops!).

The SAVAGE benchmark is well-known, Bill Savage wrote it.

FLOAT is a mix of fp operations.

FPMATH was translated by R. L. Smith from an article in Dr. Dobb's Journal, September 1988.

A long time ago I myself posted the Forth DHRYSTONE and WHETSTONE programs here.


BENCHMARK			  WinNT	    Linux  	rel.
--------------------------------------------------------------
1mloop				0.036 sec   0.036 sec	1.00
SuperSieve 1e9 -- 1.001e9	0.722 sec   0.709 sec	1.02
(50181 primes found)
32 million nest/unnest pairs 	3.037 sec   2.957 sec	1.03
Savage (floating point) 	0.013 sec   0.013 sec	1.00
Float (mix of operations)	0.541 sec   0.499 sec	1.08
Fpmath  			0.384 sec   0.325 sec	1.18
Forth Dhrystone 		160000 D/s 161550 D/s	0.99
Forth Whetstone 		67957 KW/s 64935 KW/s	1.05
There is definite slowdown of 18% by WinNT in one benchmark. I made sure these figures are correct by running the tests many times. The mean slowdown is 3%.

CONCLUSION

The strict average of all benchmark results says that the WinNT iForth is 10% slower than the Linux iForth. When the memory copy test is removed, the result is better: only 1.5% slower. Windows NT versus Linux does not show the appreciable overhead Windows 95 has with respect to DOS (40%).

The iForth server is build with Borland's C++ 4.5. I do not know if the Borland libraries set up a "cycle stealing scheme" to make Windows programming easier.

free counter Valid HTML 3.0