FFT-Mayer: 04 Oct 1994 FFT size: 131072, run-time = 1743 ms, sum of squared errors = 1.215567E-9 FFT size: 65536, run-time = 825 ms, sum of squared errors = 1.538225E-10 FFT size: 32768, run-time = 382 ms, sum of squared errors = 6.745882E-13 FFT size: 16384, run-time = 135 ms, sum of squared errors = 1.892659E-14 FFT size: 8192, run-time = 67 ms, sum of squared errors = 7.951561E-16 FFT size: 4096, run-time = 31 ms, sum of squared errors = 3.672111E-17 FFT size: 2048, run-time = 15 ms, sum of squared errors = 1.503414E-17 FFT size: 1024, run-time = 6 ms, sum of squared errors = 1.129821E-19 FFT size: 512, run-time = 3 ms, sum of squared errors = 1.015006E-21 FFT size: 256, run-time = 1 ms, sum of squared errors = 2.096234E-22
FFT-Mayer: 04 Oct 1994 FFT size: 131072, run-time = 420 ms, sum of squared errors = 1.215567E-9 FFT size: 65536, run-time = 192 ms, sum of squared errors = 1.538225E-10 FFT size: 32768, run-time = 79 ms, sum of squared errors = 6.745882E-13 FFT size: 16384, run-time = 14 ms, sum of squared errors = 1.892659E-14 FFT size: 8192, run-time = 4 ms, sum of squared errors = 7.951561E-16 FFT size: 4096, run-time = 2 ms, sum of squared errors = 3.672111E-17 FFT size: 2048, run-time = 1 ms, sum of squared errors = 1.503414E-17The FFT result puts iForth in the following company (fft.tbl, latest result added in 1997):
====================== Time to do 131072 point FFT ======================= System OS CPU/FPU CPU Run REF (MHz) time(sec) ### ----------------------- -------------- ----------- ----- --------- --- 001 HP 9000/J210XC HP-UX 10.20 PA7200_2CPU 120 0.18 30 002 SGI Onyx Irix 6.2 MIPS R8000 75 0.2123 28 003 HP 9000/J280 HP-UX 10.20 ----------- ----- 0.24 34 004 AlphaServer 2100 5/250 UNIX V3.2b DEC 21064 250 0.2529 4 005 AlphaServer 2100 5/250 UNIX V3.2b DEC 21064 250 0.2542 4 006 AlphaServer 2100 5/250 UNIX V3.2b DEC 21064 250 0.2787 4 007 HP 9000/J280 HP-UX 10.20 ----------- ----- 0.28 34 008 SGI Origin 200 Irix 6.4 MIPS R10000 180 0.2942 27 009 DEC 2100 4/275 OSF/1 V3.0b DEC 21064 275 0.3079 2 010 SGI Indigo2 IRIX 6.2 MIPS R10000 195 0.3177 17 011 SGI Indigo2 IRIX 6.2 MIPS R10000 195 0.3437 17 --- ---------------------------------------------------------------------- AMD Athlon, iForth 1.11 W2K AMD Athlon 900 0.420 --- ---------------------------------------------------------------------- 012 SGI O2 IRIX 6.3 MIPS R10000 175 0.4432 24 013 SGI O2 IRIX 6.3 MIPS R10000 175 0.5266 24 014 Sun Ultra 1 Solaris 2.5 UltraSPARC1 167 0.6600 6 015 HP 9000/J210 HP-UX 10.01 PA-RISC 120 0.665 23 016 Sun Ultra 1 Solaris 2.5 UltraSPARC1 143 0.8350 6 017 SGI Challenge S Irix 6.2 MIPS R4400 200 0.866 25 018 Dell XPS Pro200n NT 3.51 PentiumPro 200 0.920 20 019 Pentium Pro Windows 95 PentiumPro 200 0.96 33 020 Brett Station ATX Linux 2.0.0 PentiumPro 180 0.9850 29 021 Dell XPS Pro200n No opt NT 3.51 PentiumPro 200 0.990 20
First, the FFT mark is based on the Fast Hartley Transform and uses conversion routines from (I)FHT to (I)FFT to get conventional results. The FSL also contains Fast Hartley routines. When I translated Skip's code in a format compatible with FFT_C.frt, I found that Ron Mayer's code is about two times faster than Skip Carter's implementation (see FFT_4.frt).
Second, by adding a few scaling statements (which were not needed for benchmarking), FFT.frt can now be used as a fast, general, FHT, IFHT, FFT and IFFT toolbox. I've added a few test words that instill confidence that the component transforms are working correctly.
Third, it puzzled me that the difference in runtime between my old P54C 166 MHz and new AMD Athlon 900 MHz was rather disappointing. Nothing I did to the code was able to improve it.
vsize P54C Athlon -------+------+------- 131072 1743 420 65536 825 192 32768 382 79 16384 135 14 8192 67 4 4096 31 2 2048 15 1 1024 6 0 512 3 0 256 1 0
It is obvious that for a vector size of 16K and up the Athlon slows down enormously. Less obvious, but significant, is that the P54C also slows down above 16K. I attribute this fact to the tiny 64K data cache of the Athlon. With single-precision floating-point I can keep the speed difference between the P54C and the Athlon to a factor above 10 even for 32K vectors. Extrapolation says that with 256K of cache, the Athlon might have outrun the current top-performer, a HP 9000/J210XC (dual-cpu machine).
I am interested in tests on Intel PIII for these large vectors. Does the PIII fade away more gracefully (like the P54C seems to do)? free counter