TIMEMOD benchmark codes and driver files

As announced on CLF.

iForth timemod.frt
VFX timemod.fth
(Linux) Gforth 0.5.0 timemod.fs and Gforth 0.5.0 fsl_util.fs
(Win32/DOS) Gforth 0.4.9 timemod.fs and Gforth 0.4.9 fsl_util.fs_4
(Win32/DOS) Gforth 0.5.0 fsl_util.fs_5 (use with the above timemod.fs)
SwiftForth timemod.f and SwiftForth fsl_util.f
Win32Forth timemod.f32 and Win32Forth fsl_util.f32

Win32Forth: rename timemod.f32 and fsl_util.f32 to timemod.f and fsl_util.f before use. Do not confuse with the two files for SwiftForth. Likewise: rename timemod.fs_4 to timemod.fs etc.

MONSTER benchmark codes

The Forth MONSTER benchmark

Benchmark results

This is a collection of 65 benchmarks I've been using since 1991 to develop the iForth compiler. Not all of the code is portable. In places the Forth assembler is used, in other places special tricks of the iForth compiler come into play. However, there's only one benchmark, GAUSSJ.frt, that I suspect no system can run. The type of benchmarks includes integer and floating-point algorithms, input/output (file I/O) performance and interpreter and compiler speed tests. There are even two programs that run the interpreter in a time-critical piece of compiled code.

Sometimes the code is written in a way to expose gotchas of the Intel x86 line of CPUs. Also, the code might favor the AMD Athlon in some places (i.e. run slower than necessary on lesser CPUs). However, I feel the benchmarks are still interesting enough across the board.

I do not excuse myself for not writing portable code -- I do what is needed to get the job done. Hopefully this vast amount of sometimes tricky Forth code is just what the advanced student needs.

Of course I didn't write all these benchmarks myself. Many programs are taken from comp.lang.forth or the Forth archives at taygeta. Many come from magazine articles (Like Forth Dimensions). Many are ported from well-known 'C' or FORTRAN benchmarks. Some of the benchmarks are interesting by themselves (e.g. there is a program to compute all 64-bit Persistent Numbers).

These Forth benchmarks shows the quality of your compiler's integer, floating-point and I/O code as compared to what iForth would give you. Results are available in the monster.tbl file.

Selected Benchmark Results

Run the benchmarks by including the master file BDRIVER.FRT and then typing ALL-TESTS .

AMD Athlon 900 MHz system, 128 MB RAM, Windows 2000

BENCHMARK t2 "tak.frt" The famous three-way recurse useless code.

********** BENCHMARK: "tak.frt" run at 00:14:41, April 16, 2001

1000 times   TAK, 2.471 seconds elapsed. Result = 7 (7)
1000 times  TAKK, 2.122 seconds elapsed. Result = 7 (7)
1000 times  TACK, 1.716 seconds elapsed. Result = 7 (7)
1000 times TAKKK, 0.000 seconds elapsed. Result = 7 (7)

BENCHMARK t38 "sievef.frt" You know the Sieve of Eratosthenes? But did you ever try to run it on the FPU?

********** BENCHMARK: "sievef.frt" run at 00:16:35, April 16, 2001

1000 iterations.
1899.000000 primes found.
2.158 seconds elapsed.

BENCHMARK t13 "dsieve.frt" This is a parallel implementation of the Sieve of Eratosthenes. It runs faster even on a single CPU (portable only to Forths that can access OS threads).

********** BENCHMARK: "dsieve.frt" run at 00:14:56, April 16, 2001

1 iteration, non-threaded.
539776 primes found.
2.423 seconds elapsed.

1 iteration, threaded.
539776 primes found.
1.765 seconds elapsed.

1 iteration, threaded.
539776 primes found.
1.768 seconds elapsed.

BENCHMARK t12 "dhryston.frt" The Dhrystone benchmark.

********** BENCHMARK: "dhryston.frt" run at 00:14:55, April 16, 2001

Starting Dhrystone, 100000 loops ... 0.083 seconds elapsed.
That is 1204819 Dhrystones/sec.

BENCHMARK t46 "whetvsn2.frt" The Whetstone (FP) benchmark.

********** BENCHMARK: "whetvsn2.frt" run at 00:16:47, April 16, 2001

Start .. 0.347 seconds elapsed... that makes 576,368 KWhet/sec.

BENCHMARK t18 "fastread.frt" Tests the speed of your Forth's file I/O.

********** BENCHMARK: "fastread.frt" run at 00:15:24, April 16, 2001


zygote
0.126 seconds elapsed. 128 (maxtest)
zygote
0.022 seconds elapsed. 128 (test-mapversion)
zygote
0.097 seconds elapsed. 128 (test-boyer-moore-version)

2.07 milliseconds, 7  (test-TAK)


260001 lines in 0.087 seconds elapsed. (lcount)

1.122 seconds elapsed. (test-invert)

0.090 seconds elapsed. (1024 read-files)
0.238 seconds elapsed. (  10 read-files)
0.374 seconds elapsed. (1024 read-files)
0.375 seconds elapsed. (  10 read-files)
0.479 seconds elapsed. (   9 read-files)
0.094 seconds elapsed. (1100 read-files)
0.231 seconds elapsed. (  11 read-files)
0.377 seconds elapsed. (1100 read-files)
0.375 seconds elapsed. (  11 read-files)
0.375 seconds elapsed. (  10 read-files)

BENCHMARK t19 "fd92.frt" This benchmark comes from Forth Dimensions 1992.

********** BENCHMARK: "fd92.frt" run at 00:15:29, April 16, 2001

*** Benchmark from FD March/April 1992, Guy Kelly ***
   Empty      : 0.001 seconds elapsed.
   Thread     : 0.025 seconds elapsed.
   Nest1      : 0.030 seconds elapsed.
   Nest2      : 0.004 seconds elapsed.
   Prims      : 0.002 seconds elapsed.
   Sieve      : 0.004 seconds elapsed.
   Loads      : 0.005 seconds elapsed.
   Comp       : 0.001 seconds elapsed.
   C prim     : 0.062 seconds elapsed.
   C sec      : 0.070 seconds elapsed.
  rd+wrd+fnd  : 0.064 seconds elapsed.
read + <word> : 0.052 seconds elapsed.
   <word>     : 0.048 seconds elapsed.
    word      : 0.030 seconds elapsed.
    refill    : 0.192 seconds elapsed.

BENCHMARK t30 "many_sorts.frt" Tests different sorting algorithms. 'New' Forth.

********** BENCHMARK: "many_sorts.frt" run at 00:16:20, April 16, 2001

        Sedge sort   - 100000 elements, 68 ms (needs 14874 passes)
        Quicker sort - 100000 elements, 70 ms (needs 60746 passes)
        Quick sort   - 100000 elements, 65 ms (needs 86326 passes)
        Comb sort    - 100000 elements, 219 ms (needs 43 passes)
        Heap sort    - 100000 elements, 191 ms (needs 1624893 passes)
        Shell sort   - 100000 elements, 299 ms (needs 2793058 passes)
        SIS sort     -  10000 elements, 787 ms (needs 25161320 passes)
        Bubble sort  -  10000 elements, 3063 ms (needs 9888 passes)

BENCHMARK t41 "sortsc.frt" Tests different sorting algorithms. 'Old' Forth.

********** BENCHMARK: "sortsc.frt" run at 00:16:38, April 16, 2001

Evaluating NR-QUICK
Test       Dict RAM   Fetches    Stores   Compares    Time   Score  Maximum Average
-----------------------------------------------------------------------------------
RAMP     498000   0    622602     40959     557070    0.05    3.66    3.66    3.66
SLOPE    498000   0    663579     81922     557083    0.05    3.89    3.89    3.77
WILD     498000   0   1047912    341600     678763    0.09    6.24    6.24    4.58
SHUFFLE  498000   0   1056320    343432     685038    0.09    6.28    6.28    5.01
BYTE     498000   0    994683    370544     603589    0.09    5.92    6.34    5.19
FLAT     498000   0   1138687    548866     565248    0.10    6.74    7.03    5.44
CHECKER  498000   0   1101497    518212     561480    0.09    6.52    7.03    5.59
HUMP     498000   0   1007353    391671     595322    0.09    5.99    7.03    5.64

Evaluating R-QUICK
Test       Dict RAM   Fetches    Stores   Compares    Time   Score  Maximum Average
-----------------------------------------------------------------------------------
RAMP     498000   0    671761     49152     598033    0.05    3.93    3.93    3.93
SLOPE    498000   0    712737     90112     598048    0.06    4.17    4.17    4.04
WILD     498000   0   1183112    317790     829149    0.10    6.98    6.98    5.02
SHUFFLE  498000   0   1187685    312742     838541    0.10    7.02    7.03    5.51
BYTE     498000   0   1170198    430714     709739    0.10    6.90    7.31    5.79
FLAT     498000   0   1261567    614400     614400    0.10    7.39    7.94    6.05
CHECKER  498000   0   1233009    581124     622058    0.10    7.23    7.94    6.21
HUMP     498000   0   1156347    455738     670708    0.10    6.80    7.94    6.29

Evaluating HIBBARD
Test       Dict RAM   Fetches    Stores   Compares    Time   Score  Maximum Average
-----------------------------------------------------------------------------------
RAMP     498000   0   1064996         0     532498    0.06    4.72    4.72    4.72
SLOPE    498000   0   2135532    576172     779680    0.15   10.34   10.34    7.53
WILD     498000   0   4394402   1688406    1352998    0.32   22.06   22.06   12.38
SHUFFLE  498000   0   4465416   1722632    1371392    0.32   22.41   22.41   14.88
BYTE     498000   0   2743320    862002     940659    0.19   13.49   22.41   14.60
FLAT     498000   0   1064996         0     532498    0.06    4.71   22.41   12.94
CHECKER  498000   0   1085478     20480     532499    0.06    4.83   22.41   11.78
HUMP     498000   0   2412754    696712     858021    0.17   11.77   22.41   11.78

BENCHMARK t37 "savage.frt" Bill Savage wrote this FPU test ages ago.

********** BENCHMARK: "savage.frt" run at 00:16:33, April 16, 2001

Test with MaxLoop = 2500 ... 0.002 seconds elapsed.
Error = -8.69931682245806E-010
Test with MaxLoop = 25000 ... 0.017 seconds elapsed.
Error = -1.04020728031173E-007
Test with MaxLoop = 250000 ... 0.162 seconds elapsed.
Error = -1.24108046293259E-005
Test with MaxLoop = 2500000 ... 1.624 seconds elapsed.
Error = -1.42431538552046E-003

BENCHMARK t47 "clinpack/4linpack.frt" The LINPACK suite in Forth.

********** BENCHMARK: "clinpack/4linpack.frt" run at 00:16:47, April 16, 2001

Unrolled DP precision LINPACK

  norm. resid        resid          machep          x[0]-1        x[n-1]-1
       1        8.21565038E-14  2.22044605E-16 -1.49880108E-14 -1.89848137E-14
times are reported for matrices of order 100
      dgefa      dgesl      total       kflops      unit      ratio
times for array with leading dimension of 201
      0.002      0.000      0.002      343333      0.006      0.036
      0.002      0.000      0.002      343333      0.006      0.036
      0.003      0.000      0.003      228888      0.009      0.054
      0.002   8.000E-5      0.002      305184      0.007      0.040
times for array with leading dimension of 200
      0.002      0.000      0.002      343333      0.006      0.036
      0.002      0.000      0.002      343333      0.006      0.036
      0.002      0.000      0.002      343333      0.006      0.036
      0.002   8.000E-5      0.002      305184      0.006      0.038
Unrolled, DP Precision, 305185 Kflops; 100 Reps

BENCHMARK t53 "tfftdp/tfftdp.frt" A special FFT implementation. Very ugly Forth.

********** BENCHMARK: "fft/fft_c.frt" run at 00:16:48, April 16, 2001

FFT-Mayer: 04 Oct 1994
FFT size: 131072, run-time =   411 ms [1.000], sum of squared errors = 1.219198E-9
FFT size:  65536, run-time =   182 ms [0.940], sum of squared errors = 1.554338E-10
FFT size:  32768, run-time =    47 ms [0.518], sum of squared errors = 6.390517E-13
FFT size:  16384, run-time =    13 ms [0.307], sum of squared errors = 1.955693E-14
FFT size:   8192, run-time =     5 ms [0.254], sum of squared errors = 7.584755E-16
FFT size:   4096, run-time =     2 ms [0.220], sum of squared errors = 3.386117E-17
FFT size:   2048, run-time =     1 ms [0.240], sum of squared errors = 1.507175E-17
FFT size:   1024, run-time =     0 ms [0.000], sum of squared errors = 1.101705E-19
FFT size:    512, run-time =     0 ms [0.000], sum of squared errors = 9.511414E-22
FFT size:    256, run-time =     0 ms [0.000], sum of squared errors = 1.933452E-22
FFT size:    128, run-time =     0 ms [0.000], sum of squared errors = 3.258967E-23
FFT size:     64, run-time =     0 ms [0.000], sum of squared errors = 3.255156E-26
FFT size:     32, run-time =     0 ms [0.000], sum of squared errors = 9.150787E-29
FFT size:     16, run-time =     0 ms [0.000], sum of squared errors = 1.459393E-28
FFT size:      8, run-time =     0 ms [0.000], sum of squared errors = 0.000000
FFT size:      4, run-time =     0 ms [0.000], sum of squared errors = 0.000000

BENCHMARK t55 "MPEBench/benchm.frt" Not Really The MPE Benchmarks. Slight modifications to frustrate compilers that inline everything in sight. (Not so successful, VFX is still some 20% faster than iForth sometimes).

********** BENCHMARK: "MPEBench/benchm.frt" run at 00:18:02, April 16, 2001

This system's primitives using extensions

Test time including overhead               ms     times     ns (each)
DO LOOP                                     7     1000000       7
+                                           8     1000000       8
M+                                         15     1000000      15
*                                           8     1000000       8
/                                          55     1000000      55
M*                                         10     1000000      10
M/                                         55     1000000      55
/MOD                                       57     1000000      57
*/                                         65     1000000      65
ARRAY fill                                 10     1000000      10
Total:                                    295     1

This system's application performance using extensions

Test time including overhead               ms     times     ns (each)
Eratosthenes sieve 1899 Primes            451     8190000      55
Fibonacci recursion ( 34 -> 5702887 )     471     5702853      82
Hoare's quick sort (reverse order)        267     1000000     267
Generate random numbers (1024 kb array)   274     262144     1045
LZ77 Comp. (100 kb Random Data Mem>Mem)    99     1
Dhrystone (integer)                       855     1000000     855     1169590 Dhrystones/sec
Total:                                   2422     1

BENCHMARK t8 "compile-rtime1.frt" Here the interpreter runs while a compiled definition is executing. This can be very efficient if the compiler is either very dumb or every smart.

********** BENCHMARK: "compile-rtime1.frt" run at 00:14:50, April 16, 2001

0.562 seconds elapsed.

BENCHMARK t56 "mmul/mm.frt" Matrix-multiplication benchmark. Your compiler will be 10 to 100 times slower. If not, we have to talk :-)

********** BENCHMARK: "mmul/mm.frt" run at 00:18:05, April 16, 2001

CLK 902 MHz
500x500 mm - normal algorithm                           31.67 MFlops,  28.48 ticks/flop,   7.893 s
500x500 mm - blocking, factor of 20                    191.43 MFlops,   4.71 ticks/flop,   1.305 s
500x500 mm - transposed B matrix                       121.36 MFlops,   7.43 ticks/flop,   2.059 s
500x500 mm - transposed B matrix #2                    123.49 MFlops,   7.30 ticks/flop,   2.024 s
500x500 mm - Robert's algorithm                        121.93 MFlops,   7.39 ticks/flop,   2.050 s
500x500 mm - T. Maeno's algorithm, subarray 20x20      189.16 MFlops,   4.76 ticks/flop,   1.321 s
500x500 mm - D. Warner's algorithm, subarray 20x20     134.11 MFlops,   6.72 ticks/flop,   1.864 s
500x500 mm - generic mat*                              456.26 MFlops,   1.97 ticks/flop,   0.547 s
500x500 mm - iForth DGEMM1                             456.50 MFlops,   1.97 ticks/flop,   0.547 s
CLK 897 MHz
120x120 mm - normal algorithm                          271.84 MFlops,   3.29 ticks/flop,  12.713 ms
120x120 mm - blocking, factor of 20                    215.91 MFlops,   4.15 ticks/flop,  16.006 ms
120x120 mm - transposed B matrix                       336.24 MFlops,   2.66 ticks/flop,  10.278 ms
120x120 mm - transposed B matrix #2                    388.64 MFlops,   2.30 ticks/flop,   8.892 ms
120x120 mm - Robert's algorithm                        366.07 MFlops,   2.45 ticks/flop,   9.440 ms
120x120 mm - T. Maeno's algorithm, subarray 20x20      226.45 MFlops,   3.96 ticks/flop,  15.261 ms
120x120 mm - D. Warner's algorithm, subarray 20x20     158.53 MFlops,   5.65 ticks/flop,  21.799 ms
120x120 mm - generic mat*                              685.61 MFlops,   1.30 ticks/flop,   5.040 ms
120x120 mm - iForth DGEMM1                             685.52 MFlops,   1.30 ticks/flop,   5.041 ms
CLK 895 MHz
60x60 mm - normal algorithm                            345.33 MFlops,   2.59 ticks/flop,   1.250 ms
60x60 mm - blocking, factor of 20                      220.67 MFlops,   4.05 ticks/flop,   1.957 ms
60x60 mm - transposed B matrix                         321.31 MFlops,   2.78 ticks/flop,   1.344 ms
60x60 mm - transposed B matrix #2                      385.03 MFlops,   2.32 ticks/flop,   1.121 ms
60x60 mm - Robert's algorithm                          373.78 MFlops,   2.39 ticks/flop,   1.155 ms
60x60 mm - T. Maeno's algorithm, subarray 20x20        229.38 MFlops,   3.90 ticks/flop,   1.883 ms
60x60 mm - D. Warner's algorithm, subarray 20x20       158.54 MFlops,   5.64 ticks/flop,   2.724 ms
60x60 mm - generic mat*                                691.80 MFlops,   1.29 ticks/flop,   0.624 ms
60x60 mm - iForth DGEMM1                               691.30 MFlops,   1.29 ticks/flop,   0.624 ms

4x4 mm - generic mat*                                  216.64 MFlops,   4.13 ticks/flop,   0.000 us
5x5 mm - generic mat*                                  346.84 MFlops,   2.58 ticks/flop,   0.000 us
6x6 mm - generic mat*                                  484.69 MFlops,   1.84 ticks/flop,   0.000 us
7x7 mm - generic mat*                                  521.59 MFlops,   1.71 ticks/flop,   1.000 us
8x8 mm - generic mat*                                  628.28 MFlops,   1.42 ticks/flop,   1.000 us
9x9 mm - generic mat*                                  625.26 MFlops,   1.43 ticks/flop,   2.000 us
10x10 mm - generic mat*                                694.48 MFlops,   1.28 ticks/flop,   2.000 us
11x11 mm - generic mat*                                748.74 MFlops,   1.19 ticks/flop,   3.000 us
12x12 mm - generic mat*                                782.20 MFlops,   1.14 ticks/flop,   4.000 us
13x13 mm - generic mat*                                790.87 MFlops,   1.13 ticks/flop,   5.000 us
14x14 mm - generic mat*                                831.73 MFlops,   1.07 ticks/flop,   6.000 us
15x15 mm - generic mat*                                836.65 MFlops,   1.06 ticks/flop,   8.000 us
16x16 mm - generic mat*                                867.92 MFlops,   1.03 ticks/flop,   9.000 us
17x17 mm - generic mat*                                886.63 MFlops,   1.00 ticks/flop,  11.000 us
18x18 mm - generic mat*                                886.42 MFlops,   1.00 ticks/flop,  13.000 us
19x19 mm - generic mat*                                912.80 MFlops,   0.98 ticks/flop,  15.000 us
20x20 mm - generic mat*                                924.73 MFlops,   0.96 ticks/flop,  17.000 us
21x21 mm - generic mat*                                915.25 MFlops,   0.97 ticks/flop,  20.000 us
22x22 mm - generic mat*                                914.60 MFlops,   0.97 ticks/flop,  23.000 us
23x23 mm - generic mat*                                910.64 MFlops,   0.98 ticks/flop,  26.000 us
24x24 mm - generic mat*                                919.00 MFlops,   0.97 ticks/flop,  30.000 us
25x25 mm - generic mat*                                921.05 MFlops,   0.97 ticks/flop,  33.000 us
26x26 mm - generic mat*                                866.73 MFlops,   1.03 ticks/flop,  40.000 us
27x27 mm - generic mat*                                922.13 MFlops,   0.97 ticks/flop,  42.000 us
28x28 mm - generic mat*                                925.65 MFlops,   0.96 ticks/flop,  47.000 us
29x29 mm - generic mat*                                920.12 MFlops,   0.97 ticks/flop,  53.000 us
30x30 mm - generic mat*                                898.29 MFlops,   0.99 ticks/flop,  60.000 us
31x31 mm - generic mat*                                888.08 MFlops,   1.00 ticks/flop,  67.000 us
32x32 mm - generic mat*                                915.42 MFlops,   0.97 ticks/flop,  71.000 us

BENCHMARK t57 "../misc/floyd.frt" This is Floyd's algorithm (combinatorial logic).

********** BENCHMARK: "../misc/floyd.frt" run at 00:18:51, April 16, 2001

20 microseconds per shuffle
23 microseconds per shuffle
4 microseconds per shuffle3
14 microseconds per shuffle4

BENCHMARK t60 "../misc/golygonc.frt" A geometric problem.

********** BENCHMARK: "../misc/golygonc.frt" run at 00:19:02, April 16, 2001

48-sided Golygon.
246803720 golygons 4.180 seconds.

BENCHMARK t62 "../misc/md5-nwr.frt" The MD5 protocol.

********** BENCHMARK: "../misc/md5-nwr.frt" run at 00:19:07, April 16, 2001

1000 times 10,000 spaces ... 77 us/iteration, result: f38898bb69bb02bccb9594dfe471c5c0

BENCHMARK t65 "../misc/my-sha.frt" The SHA protocol.

********** BENCHMARK: "../misc/my-sha.frt" run at 00:19:12, April 16, 2001

SHA-1 test for EX1 for 100000 loops, 1 microseconds/iteration
SHA-1 test for EX2 for 100000 loops, 2 microseconds/iteration
SHA-1 test for EX3 for 10 loops, 14100 microseconds/iteration, 70.921 MBytes/sec.

BENCHMARK t66 "../misc/pento2.frt" Another geometric problem, or better, puzzle.

********** BENCHMARK: "../misc/pento2.frt" run at 00:19:14, April 16, 2001
Solutions to the Pentomino Puzzle by Exhaustive Search

Total solutions = 2339 Total pieces tried = 2455939, 0.964 seconds elapsed.

BENCHMARK t67 "mullen3.frt" This searches for 64-bit Persistent Numbers.

********** BENCHMARK: "mullen3.frt" run at 00:19:15, April 16, 2001

N =  2, MAX_L =  4, DISTINCT = 55           (0.000 seconds elapsed)
N =  3, MAX_L =  5, DISTINCT = 220          (0.000 seconds elapsed)
N =  4, MAX_L =  6, DISTINCT = 715          (0.001 seconds elapsed)
N =  5, MAX_L =  7, DISTINCT = 2,002        (0.004 seconds elapsed)
N =  6, MAX_L =  7, DISTINCT = 5,005        (0.012 seconds elapsed)
N =  7, MAX_L =  8, DISTINCT = 11,440       (0.029 seconds elapsed)
N =  8, MAX_L =  9, DISTINCT = 24,310       (0.069 seconds elapsed)
N =  9, MAX_L =  9, DISTINCT = 48,620       (0.153 seconds elapsed)
N = 10, MAX_L = 10, DISTINCT = 92,378       (0.313 seconds elapsed)
N = 11, MAX_L = 10, DISTINCT = 167,960      (0.619 seconds elapsed)
N = 12, MAX_L = 10, DISTINCT = 293,930      (1.155 seconds elapsed)
N = 13, MAX_L = 10, DISTINCT = 497,420      (2.113 seconds elapsed)
N = 14, MAX_L = 10, DISTINCT = 817,190      (3.635 seconds elapsed)
N = 15, MAX_L = 11, DISTINCT = 1,307,504    (6.270 seconds elapsed)
N = 16, MAX_L = 11, DISTINCT = 2,042,975    (10.157 seconds elapsed)
MAX-LENGTH = 11, champion = 277777788888899

free counter