The big Pentium - i486 Debate

I now own a Pentium machine. Naturally, I wanted to find out how fast this thing is in combination with iForth (The code generator of iForth is optimized for register-stack and register-memory operations and does not follow the conventional register-register path).

The question I started out with: Is there a non-trivial difference between

iForth 1.06 on a Pentium 166 MHz (under Win95 / DOS 7.0),
iForth 1.06 on a '486/66 DX2 (under DOS 5.0),

where the trivial result would be a difference based on the higher clock frequency of the Pentium: 166/66 or 2.5.

The following tests are a mix of integer and floating point code, I/O is eliminated as much as possible. The benchmark code is available from my homepage.

1. The Ertl/Maierhofer Benchmark Suite

For the code, see Anton Ertl's homepage, or here.


BENCHMARK                          486           Pentium        rel.
--------------------------------------------------------------------
Fibonacci...9227465             10.347 sec      2.052 sec       5.04
Sieving...1899                  10.934 sec      2.192 sec       4.99
Matrix multiply...               8.537 sec      2.099 sec       4.07
bubbling...                     11.024 sec      2.778 sec       3.97

The average speedup of the Pentium over the i486 is a factor 4.51, 1.8 times better than expected on the basis of the clockspeed difference.

2. Caches

Some people may remember my confusion over the fact that the timing of a push REG + pop REG instruction on my '486 didn't match the timings found in the Intel databook. This was thought to be caused by sub-optimal implementation of the caches. Here these figures are again, for an unrolled loop of 300,000 instructions (click here):


BENCHMARK                          486          Pentium         rel.
--------------------------------------------------------------------
300,000-times push+pop          27.899 sec      3.646 sec       7.65
(push+pop expected               9.100 sec      3.600 sec       2.53)
300,000-times nop+nop            9.137 sec      1.837 sec       4.97
(nop+nop expected                9.100 sec      3.600 sec       2.53)

From the "expected" rows we see that the Pentium machine has a much better caching strategy / implementation. A mystery remains however, as the Pentium seems to use substantially less than 1 cycle to execute a NOP instruction! This is probably why everybody thinks that a Pentium is about twice as fast as a similarly clocked '486 (see also CONCLUSION). Overall, the mean speedup is a factor of 6.31, or 2.52 times better than expected.

3. Memory Moves

Another important performance criterium is the speed with which memory can be read and written. Here are the high-level iForth results, overlap and no-overlap indicate the relative positions of the copied memory regions. Due to unfortunate Forth legacy code ANS Forth forbids to use fast machine instructions to implement memory block moves and fills.


BENCHMARK              486            Pentium        rel.   1/rel.
---------------------------------------------------------------------
no-overlap
CMOVE		: 	8.291 MB/s	48.859 MB/s	0.17	5.88
CMOVE>		: 	9.310 MB/s	42.253 MB/s	0.22	4.55
MOVE 1->2	: 	9.068 MB/s	50.335 MB/s	0.18	5.56
MOVE 2<-1	: 	9.085 MB/s	50.000 MB/s	0.18	5.56
overlap
CMOVE		: 	5.787 MB/s	44.776 MB/s	0.13	7.69
CMOVE>		: 	5.784 MB/s	38.363 MB/s	0.15	6.67
MOVE 1<-2	: 	10.570 MB/s	60.483 MB/s	0.17	5.88
MOVE 2->1	: 	10.691 MB/s	60.728 MB/s	0.18	5.68

Speedup is real good: 5.93 times, or 2.37 times better than expected.

4. The FORTH Inc Benchmark Suite

Here is the test posted by Elizabeth Rather, as modified by Tom Zimmer. You may wonder why Eratosthenes is so much faster for the Pentium here. The reason is that FORTH Inc. has designed their benchmarks so that the time to print two strings is included. The Pentium machine has much faster video circuitry (PCI, not ISA). The Sieve happens to execute much faster than the rest of the tests, so this I/O time becomes dominant.


Test			 	 486		Pentium		rel.
---------------------------------------------------------------------------
Testing DO LOOP		=	0.1380 us	0.0300 us	4.60
Testing * 		= 	0.7290 us	0.1440 us	5.06
Testing / 		= 	0.9320 us	0.3160 us	2.95
Testing + 		= 	0.2050 us	0.0540 us	3.80
Testing M*	 	= 	0.7950 us	0.1730 us	4.60
Testing M/ 		= 	0.9020 us	0.3230 us	2.79
Testing M+ 		= 	0.4910 us	0.1200 us	4.09
Testing /MOD		= 	0.9370 us	0.3830 us	2.45
Testing */ 		= 	1.4290 us	0.4300 us	3.32
Testing Eratosthenes 	= 	2.7106 us	0.2564 us	10.57
Testing Hoare's qsort	=	39.8000 us	11.1000 us	3.59
all tests 		= 	7.2160 sec	2.1160 sec	3.41

The average speedup is 4.26, 1.7 times better than expected.

5. Guy Kelly's Forth Dimensions Benchmark Suite

A little used test found in FD March/April 1992, by Guy Kelly. This code is very hard for modern 32-bit Forth implementations.


    BENCH      Pentium (sec)	486 (sec)	rel.	1/rel
--------------------------------------------------------------------
    Empty     : 0.007     		0.038		0.18	5.56
    Thread    : 0.058     		0.122		0.48	2.08
    Nest1     : 0.044     		0.268		0.16	6.25
    Nest2     : 0.028   		0.071		0.39	2.56
    Prims     : 0.045   		0.136		0.33	3.00
    Sieve     : 0.050     		0.152		0.33	3.00
    Loads     : 0.017     		0.062		0.27	3.70
    Comp      : 0.018     		0.069		0.26	3.85
    C prim    : 0.161     		0.689		0.23	4.35
    C sec     : 0.170     		0.688		0.25	4.00
  rd+wrd+fnd  : 0.100     		0.469		0.21	4.76
read + <word> : 0.050     		0.147		0.34	2.94
    <word>    : 0.253     		0.817		0.31	3.22
     word     : 0.239    		0.982		0.24	4.17
    refill    : 0.139     		0.490		0.28	3.57

The average speedup is 3.80, or 1.52 times better than expected.

6. Mixed Bag

Follows a bunch of more or less well-known benchmarks. The 1MLOOP program is from Bernie Mentink.

The "unnest/nest pairs" benchmark tests how long it takes to unwind from 32 million nested colon definitions (no loops!).

The SAVAGE benchmark is well-known, Bill Savage wrote it.

FLOAT is a mix of fp operations.

FPMATH was translated by R. L. Smith from an article in Dr. Dobb's Journal, September 1988.

A long time ago I myself posted the Forth DHRYSTONE and WHETSTONE programs here.


BENCHMARK			  486	    Pentium	rel.
-------------------------------------------------------------------
1mloop				0.121 sec   0.035 sec	3.42
SuperSieve 1e9 -- 1.001e9	4.446 sec   0.693 sec	6.42
(50181 primes found)
32 million nest/unnest pairs	9.434 sec   2.947 sec	3.20
Savage (floating point) 	0.075 sec   0.018 sec	4.17
Float (mix of operations)	3.039 sec   0.737 sec	4.12
Fpmath  			2.074 sec   0.579 sec	3.58
Forth Dhrystone 		41631 D/s 166389 D/s	0.25
Forth Whetstone 		14495 KW/s 57208 KW/s	0.25

Average speedup (note the Dhrystone and Whetstone figures) is 4.11, or 1.65 times better than expected.

CONCLUSION

The average of the "unexpected speedups" is 1.93. This is however strongly influenced by the memory move operations and the push/pop loop experiment which are fairly a-typical for mainstream Forth. I seem to remember that Intel claims that the Pentium is about 70% faster than an i486 for a mix of operations.

I wonder if the future will bring us a Forth compiler that is truly optimized for the Pentium, and that will show higher speedups than say a factor of 2 over a similarly clocked '486. free counter