FFT benchmark code

The FFT floating-point benchmark in Forth

Benchmark results

The URL http://hpwww.epfl.ch/bench/bench.FAQ.html contains interesting benchmark source code pointers. This is a port of the "C" FFT mark by Ron Mayer to Forth. The Forth FFT benchmark shows the quality of your compiler's FPU code as compared to what an optimizing "C" compiler would give you. Above are links to the Forth source I used with the iForth compiler. Results for a large number of C compilers on a variety of systems are in the fft.tbl file.

Intel Pentium P54C 166 MHz, 48 MB, iForth 1.11, NT 4.0

FFT-Mayer: 04 Oct 1994
FFT size: 131072, run-time =  1743 ms, sum of squared errors = 1.215567E-9
FFT size:  65536, run-time =   825 ms, sum of squared errors = 1.538225E-10
FFT size:  32768, run-time =   382 ms, sum of squared errors = 6.745882E-13
FFT size:  16384, run-time =   135 ms, sum of squared errors = 1.892659E-14
FFT size:   8192, run-time =    67 ms, sum of squared errors = 7.951561E-16
FFT size:   4096, run-time =    31 ms, sum of squared errors = 3.672111E-17
FFT size:   2048, run-time =    15 ms, sum of squared errors = 1.503414E-17
FFT size:   1024, run-time =     6 ms, sum of squared errors = 1.129821E-19
FFT size:    512, run-time =     3 ms, sum of squared errors = 1.015006E-21
FFT size:    256, run-time =     1 ms, sum of squared errors = 2.096234E-22

AMD Athlon 900 MHz, 128 MB, iForth 1.11, Windows 2000

FFT-Mayer: 04 Oct 1994
FFT size: 131072, run-time =   420 ms, sum of squared errors = 1.215567E-9
FFT size:  65536, run-time =   192 ms, sum of squared errors = 1.538225E-10
FFT size:  32768, run-time =    79 ms, sum of squared errors = 6.745882E-13
FFT size:  16384, run-time =    14 ms, sum of squared errors = 1.892659E-14
FFT size:   8192, run-time =     4 ms, sum of squared errors = 7.951561E-16
FFT size:   4096, run-time =     2 ms, sum of squared errors = 3.672111E-17
FFT size:   2048, run-time =     1 ms, sum of squared errors = 1.503414E-17

The FFT result puts iForth in the following company (fft.tbl, latest result added in 1997):

====================== Time to do 131072 point FFT =======================
    System                  OS               CPU/FPU    CPU     Run    REF
						       (MHz) time(sec)
### ----------------------- -------------- ----------- ----- --------- ---
001 HP 9000/J210XC          HP-UX 10.20    PA7200_2CPU   120    0.18    30
002 SGI Onyx                Irix 6.2       MIPS R8000     75    0.2123  28
003 HP 9000/J280            HP-UX 10.20    ----------- -----    0.24    34
004 AlphaServer 2100 5/250  UNIX V3.2b     DEC 21064     250    0.2529   4
005 AlphaServer 2100 5/250  UNIX V3.2b     DEC 21064     250    0.2542   4
006 AlphaServer 2100 5/250  UNIX V3.2b     DEC 21064     250    0.2787   4
007 HP 9000/J280            HP-UX 10.20    ----------- -----    0.28    34
008 SGI Origin 200          Irix 6.4       MIPS R10000   180    0.2942  27
009 DEC 2100 4/275          OSF/1 V3.0b    DEC 21064     275    0.3079   2
010 SGI Indigo2             IRIX 6.2       MIPS R10000   195    0.3177  17
011 SGI Indigo2             IRIX 6.2       MIPS R10000   195    0.3437  17
--- ----------------------------------------------------------------------
    AMD Athlon, iForth 1.11 W2K            AMD Athlon    900    0.420
--- ----------------------------------------------------------------------
012 SGI O2                  IRIX 6.3       MIPS R10000   175    0.4432  24
013 SGI O2                  IRIX 6.3       MIPS R10000   175    0.5266  24
014 Sun Ultra 1             Solaris 2.5    UltraSPARC1   167    0.6600   6
015 HP 9000/J210            HP-UX 10.01    PA-RISC       120    0.665   23
016 Sun Ultra 1             Solaris 2.5    UltraSPARC1   143    0.8350   6
017 SGI Challenge S         Irix 6.2       MIPS R4400    200    0.866   25
018 Dell XPS Pro200n        NT 3.51        PentiumPro    200    0.920   20
019 Pentium Pro             Windows 95     PentiumPro    200    0.96    33
020 Brett Station ATX       Linux 2.0.0    PentiumPro    180    0.9850  29
021 Dell XPS Pro200n No opt NT 3.51        PentiumPro    200    0.990   20

Notes

There are several interesting aspects to this benchmark.

First, the FFT mark is based on the Fast Hartley Transform and uses conversion routines from (I)FHT to (I)FFT to get conventional results. The FSL also contains Fast Hartley routines. When I translated Skip's code in a format compatible with FFT_C.frt, I found that Ron Mayer's code is about two times faster than Skip Carter's implementation (see FFT_4.frt).

Second, by adding a few scaling statements (which were not needed for benchmarking), FFT.frt can now be used as a fast, general, FHT, IFHT, FFT and IFFT toolbox. I've added a few test words that instill confidence that the component transforms are working correctly.

Third, it puzzled me that the difference in runtime between my old P54C 166 MHz and new AMD Athlon 900 MHz was rather disappointing. Nothing I did to the code was able to improve it.

 vsize   P54C   Athlon
-------+------+-------
 131072  1743 	 420
  65536   825 	 192
  32768   382 	  79
  16384   135 	  14
   8192    67 	   4
   4096    31 	   2
   2048    15 	   1
   1024     6 	   0
    512     3 	   0
    256     1 	   0

It is obvious that for a vector size of 16K and up the Athlon slows down enormously. Less obvious, but significant, is that the P54C also slows down above 16K. I attribute this fact to the tiny 64K data cache of the Athlon. With single-precision floating-point I can keep the speed difference between the P54C and the Athlon to a factor above 10 even for 32K vectors. Extrapolation says that with 256K of cache, the Athlon might have outrun the current top-performer, a HP 9000/J210XC (dual-cpu machine).

I am interested in tests on Intel PIII for these large vectors. Does the PIII fade away more gracefully (like the P54C seems to do)? free counter