The presented code is not a verbatim translation of Mark Smotherman's original "C" code. In the conversion process I have favored constructs that fit better with the Forth language. However, I did not try to change the core ideas of the original algorithms too much. It is stressed that I actually use the given algorithms for my numerical work. This is not an artificial benchmark.
Shown below is the result of the ALL-TESTS word. This word prints the number of floating-point operations executed per second. As this depends on the CPU speed I also added the average number of clock ticks needed per FP operation (F* and F+ in this case). On a current x86 CPU with an exceptionally good compiler the best result you may find is 0.5 tick/flop, and 2 - 6 ticks/flop are very good already. The tick count is a useful figure to compare (with some care) wild mixes of languages, compilers, CPUs and platforms.
The results of the several Forth compilers can be compared with the results for optimized "C" code generated with Microsoft's C++ 4.0. It was not trivial to generate correct code with all the various optimization switches set to the optimal position (mm.c was actual incorrect in a subtle way). You will note that the iForth code for MAT* easily beats C for most algorithms. Exceptions however, are the MAENO and WARNER algorithms. These are ideally suited to the register allocation strategies of a good C compiler, and it shows! Not all is very well though. In the mm.c file I've added two alternative algorithms that make 'minor' modifications to WARNER. These 'minor' changes (removal of local variables) make the code about twice slower. Also note that only the NORMAL and BLOCKING algorithms are hassle-free in use. Both WARNER and MAENO have restrictions on the array size. The iForth MAT* operator has no known restrictions and doesn't need a transposed copy of matrix B. One other remark: the original C code had all arrays allocated statically. This is, of course, unusable in practice, so I've modified the code to allocate dynamically. This increases run-time by about 10% (a previous version of these pages erroneously showed the results of the unrealistic "C" implementation).
With 'r' MEGAFLOPS the software runs the specified algorithm (here 'r') for a number of steadily increasing matrix sizes. On my hardware a very severe performance breakdown is noticeable above a certain size. With n TO N ALL-TESTS you run the tests with a specified size n. I found that very slight changes in n (e.g. from 74 to 75) can cause a speed difference of as large as a factor 2. This has to do with the discrete unroll factors in BLAS level I routines (no code shown).
My conclusions are that at the moment there is no good FP support in most of the Forth compilers tested. The generated code varies from abysmal to mediocre, compared to what a good "C" compiler offers, or what can be done by adding native BLAS support (i.e the iForth MAT* results). In all cases you should opt for the hardware FPU stack, it is between 2 and 3 times faster than software emulation. (IMHO the hardware stack option offered by SwiftForth is not useable because it is too easy to overflow the FPU stack in high-level Forth. E.g. there are FSL algorithms that use more than 8 FP stack positions.)
Unless you are using iForth it doesn't matter much which algorithm is used. In iForth the generic MAT* algorithm (which uses Pentium optimized BLAS) can not be beat. When commercial compilers get better the 'blocking' algorithm with size 20 will be the best choice where N goes above about 100. For smaller matrices the 'normal' algorithm can perform spectacularly better.
Please send (updated) results to mhx@iae.nl. Be sure to mention Forth version, CPU, clockspeed and OS.
500x500 mm - normal algorithm 5.00 MFlops, utime 49.952 secs 500x500 mm - blocking, factor of 20 14.00 MFlops, utime 17.856 secs 500x500 mm - transposed b matrix 20.26 MFlops, utime 12.338 secs 500x500 mm - Robert's algorithm 20.25 MFlops, utime 12.347 secs 500x500 mm - 20x 20 subarray (from T. Maeno) 30.19 MFlops, utime 8.282 secs 500x500 mm - 20x 20 subarray (from D. Warner) 39.00 MFlops, utime 6.410 secs 120x120 mm - normal algorithm 14.53 MFlops, utime 0.238 secs 120x120 mm - blocking, factor of 20 14.38 MFlops, utime 0.240 secs 120x120 mm - transposed b matrix 32.49 MFlops, utime 0.106 secs 120x120 mm - Robert's algorithm 32.11 MFlops, utime 0.108 secs 120x120 mm - 20x 20 subarray (from T. Maeno) 32.11 MFlops, utime 0.108 secs 120x120 mm - 20x 20 subarray (from D. Warner) 43.13 MFlops, utime 0.080 secs 60x 60 mm - normal algorithm 17.97 MFlops, utime 0.024 secs 60x 60 mm - blocking, factor of 20 14.48 MFlops, utime 0.030 secs 60x 60 mm - transposed b matrix 32.19 MFlops, utime 0.013 secs 60x 60 mm - Robert's algorithm 31.72 MFlops, utime 0.014 secs 60x 60 mm - 20x 20 subarray (from T. Maeno) 32.19 MFlops, utime 0.013 secs 60x 60 mm - 20x 20 subarray (from D. Warner) 43.20 MFlops, utime 0.010 secs
500x500 mm - normal algorithm 9.66 MFlops, 17.06 ticks/flop, 25.861 s 500x500 mm - blocking, factor of 20 24.21 MFlops, 6.81 ticks/flop, 10.323 s 500x500 mm - transposed B matrix 36.92 MFlops, 4.46 ticks/flop, 6.770 s 500x500 mm - Robert's algorithm 37.65 MFlops, 4.38 ticks/flop, 6.639 s 500x500 mm - T. Maeno's algorithm, subarray 20x20 10.13 MFlops, 16.27 ticks/flop, 24.655 s 500x500 mm - D. Warner's algorithm, subarray 20x20 10.44 MFlops, 15.79 ticks/flop, 23.929 s 500x500 mm - generic iForth MAT* 78.23 MFlops, 2.10 ticks/flop, 3.195 s 120x120 mm - normal algorithm 53.58 MFlops, 3.07 ticks/flop, 0.064 s 120x120 mm - blocking, factor of 20 24.13 MFlops, 6.83 ticks/flop, 0.143 s 120x120 mm - transposed B matrix 42.58 MFlops, 3.87 ticks/flop, 0.081 s 120x120 mm - Robert's algorithm 46.47 MFlops, 3.55 ticks/flop, 0.074 s 120x120 mm - T. Maeno's algorithm, subarray 20x20 10.11 MFlops, 16.31 ticks/flop, 0.341 s 120x120 mm - D. Warner's algorithm, subarray 20x20 10.40 MFlops, 15.86 ticks/flop, 0.332 s 120x120 mm - generic iForth MAT* 72.49 MFlops, 2.27 ticks/flop, 0.047 s 60x60 mm - normal algorithm 38.98 MFlops, 4.23 ticks/flop, 0.011 s 60x60 mm - blocking, factor of 20 23.69 MFlops, 6.96 ticks/flop, 0.018 s 60x60 mm - transposed B matrix 39.14 MFlops, 4.21 ticks/flop, 0.011 s 60x60 mm - Robert's algorithm 45.84 MFlops, 3.59 ticks/flop, 0.009 s 60x60 mm - T. Maeno's algorithm, subarray 20x20 9.94 MFlops, 16.59 ticks/flop, 0.043 s 60x60 mm - D. Warner's algorithm, subarray 20x20 10.32 MFlops, 15.97 ticks/flop, 0.041 s 60x60 mm - generic iForth MAT* 62.86 MFlops, 2.62 ticks/flop, 0.006 s
500x500 mm - normal algorithm 6.32 MFlops, 26.09 ticks/flop, 39.543 s 500x500 mm - blocking, factor of 20 20.52 MFlops, 8.03 ticks/flop, 12.178 s 500x500 mm - transposed B matrix 21.42 MFlops, 7.70 ticks/flop, 11.666 s 500x500 mm - Robert's algorithm 21.89 MFlops, 7.53 ticks/flop, 11.418 s 500x500 mm - T. Maeno's algorithm, subarray 20x20 13.61 MFlops, 12.12 ticks/flop, 18.365 s 500x500 mm - D. Warner's algorithm, subarray 20x20 11.87 MFlops, 13.89 ticks/flop, 21.054 s 500x500 mm - generic mat* 63.49 MFlops, 2.59 ticks/flop, 3.937 s 500x500 mm - iForth MAT* 63.50 MFlops, 2.59 ticks/flop, 3.936 s 120x120 mm - normal algorithm 27.77 MFlops, 5.94 ticks/flop, 0.124 s 120x120 mm - blocking, factor of 20 21.36 MFlops, 7.72 ticks/flop, 0.161 s 120x120 mm - transposed B matrix 27.35 MFlops, 6.03 ticks/flop, 0.126 s 120x120 mm - Robert's algorithm 29.32 MFlops, 5.62 ticks/flop, 0.117 s 120x120 mm - T. Maeno's algorithm, subarray 20x20 13.89 MFlops, 11.87 ticks/flop, 0.248 s 120x120 mm - D. Warner's algorithm, subarray 20x20 12.09 MFlops, 13.64 ticks/flop, 0.285 s 120x120 mm - generic mat* 63.67 MFlops, 2.59 ticks/flop, 0.054 s 120x120 mm - iForth MAT* 63.62 MFlops, 2.59 ticks/flop, 0.054 s 60x60 mm - normal algorithm 38.09 MFlops, 4.30 ticks/flop, 0.011 s 60x60 mm - blocking, factor of 20 20.84 MFlops, 7.86 ticks/flop, 0.020 s 60x60 mm - transposed B matrix 28.17 MFlops, 5.81 ticks/flop, 0.015 s 60x60 mm - Robert's algorithm 32.38 MFlops, 5.06 ticks/flop, 0.013 s 60x60 mm - T. Maeno's algorithm, subarray 20x20 13.71 MFlops, 11.95 ticks/flop, 0.031 s 60x60 mm - D. Warner's algorithm, subarray 20x20 11.91 MFlops, 13.76 ticks/flop, 0.036 s 60x60 mm - generic mat* 56.58 MFlops, 2.89 ticks/flop, 0.007 s 60x60 mm - iForth MAT* 56.74 MFlops, 2.88 ticks/flop, 0.007 s
500x500 mm - normal algorithm 4.37 MFlops, 37.49 ticks/flop, 57.157 s 500x500 mm - blocking, factor of 20 7.61 MFlops, 21.54 ticks/flop, 32.838 s 500x500 mm - transposed B matrix 9.22 MFlops, 17.77 ticks/flop, 27.098 s 500x500 mm - Robert's algorithm 9.32 MFlops, 17.58 ticks/flop, 26.802 s 500x500 mm - T. Maeno's algorithm, subarray 20x20 5.71 MFlops, 28.70 ticks/flop, 43.751 s 500x500 mm - D. Warner's algorithm, subarray 20x20 7.69 MFlops, 21.30 ticks/flop, 32.480 s 120x120 mm - normal algorithm 8.97 MFlops, 18.39 ticks/flop, 0.385 s 120x120 mm - blocking, factor of 20 8.10 MFlops, 20.35 ticks/flop, 0.426 s 120x120 mm - transposed B matrix 9.85 MFlops, 16.74 ticks/flop, 0.350 s 120x120 mm - Robert's algorithm 10.09 MFlops, 16.33 ticks/flop, 0.342 s 120x120 mm - T. Maeno's algorithm, subarray 20x20 6.32 MFlops, 26.08 ticks/flop, 0.546 s 120x120 mm - D. Warner's algorithm, subarray 20x20 8.06 MFlops, 20.45 ticks/flop, 0.428 s
120x120 mm - normal algorithm 1.75 MFlops, 94.07 ticks/flop, 1.970 s 120x120 mm - blocking, factor of 20 1.63 MFlops, 100.78 ticks/flop, 2.110 s 120x120 mm - transposed B matrix 1.74 MFlops, 94.40 ticks/flop, 1.977 s 120x120 mm - Robert's algorithm 1.80 MFlops, 91.56 ticks/flop, 1.917 s 120x120 mm - T. Maeno's algorithm, subarray 20x20 1.42 MFlops, 115.49 ticks/flop, 2.419 s 120x120 mm - D. Warner's algorithm, subarray 20x20 1.80 MFlops, 91.65 ticks/flop, 1.919 s
120x120 mm - normal algorithm 5.19 MFlops, 31.77 ticks/flop, 0.665 s 120x120 mm - blocking, factor of 20 4.67 MFlops, 35.32 ticks/flop, 0.739 s 120x120 mm - transposed B matrix 5.21 MFlops, 31.66 ticks/flop, 0.663 s 120x120 mm - Robert's algorithm 5.71 MFlops, 28.89 ticks/flop, 0.605 s 120x120 mm - T. Maeno's algorithm, subarray 20x20 2.96 MFlops, 55.70 ticks/flop, 1.166 s 120x120 mm - D. Warner's algorithm, subarray 20x20 3.50 MFlops, 47.02 ticks/flop, 0.985 s
120x120 mm - normal algorithm 2.20 MFlops, 77.23 ticks/flop, 1.570 s 120x120 mm - blocking, factor of 20 1.34 MFlops, 126.61 ticks/flop, 2.573 s 120x120 mm - transposed B matrix 2.20 MFlops, 77.04 ticks/flop, 1.566 s 120x120 mm - Robert's algorithm 2.24 MFlops, 75.61 ticks/flop, 1.537 s 120x120 mm - T. Maeno's algorithm, subarray 20x20 1.24 MFlops, 136.26 ticks/flop, 2.770 s 120x120 mm - D. Warner's algorithm, subarray 20x20 1.43 MFlops, 118.68 ticks/flop, 2.412 s 60x60 mm - normal algorithm 2.21 MFlops, 74.05 ticks/flop, 0.195 s 60x60 mm - blocking, factor of 20 1.28 MFlops, 127.77 ticks/flop, 0.336 s 60x60 mm - transposed B matrix 1.98 MFlops, 82.52 ticks/flop, 0.217 s 60x60 mm - Robert's algorithm 2.05 MFlops, 79.94 ticks/flop, 0.210 s 60x60 mm - T. Maeno's algorithm, subarray 20x20 1.18 MFlops, 138.16 ticks/flop, 0.363 s 60x60 mm - D. Warner's algorithm, subarray 20x20 1.36 MFlops, 120.44 ticks/flop, 0.317 s
500x500 mm (aborted - over 6 minutes per benchmark) 120x120 mm - normal algorithm 0.93 MFlops, 176.76 ticks/flop, 3.702 s 120x120 mm - blocking, factor of 20 0.55 MFlops, 295.18 ticks/flop, 6.182 s 120x120 mm - transposed B matrix 0.87 MFlops, 189.46 ticks/flop, 3.968 s 120x120 mm - Robert's algorithm 0.90 MFlops, 181.76 ticks/flop, 3.807 s 120x120 mm - T. Maeno's algorithm, subarray 20x20 0.45 MFlops, 366.00 ticks/flop, 7.666 s 120x120 mm - D. Warner's algorithm, subarray 20x20 0.57 MFlops, 288.33 ticks/flop, 6.039 s 60x60 mm - normal algorithm 0.92 MFlops, 178.71 ticks/flop, 0.467 s 60x60 mm - blocking, factor of 20 0.56 MFlops, 290.60 ticks/flop, 0.760 s 60x60 mm - transposed B matrix 0.80 MFlops, 203.84 ticks/flop, 0.533 s 60x60 mm - Robert's algorithm 0.86 MFlops, 191.84 ticks/flop, 0.502 s 60x60 mm - T. Maeno's algorithm, subarray 20x20 0.44 MFlops, 372.79 ticks/flop, 0.976 s 60x60 mm - D. Warner's algorithm, subarray 20x20 0.53 MFlops, 307.52 ticks/flop, 0.805 s
500x500 mm - normal algorithm 61.25 MFlops, 5.77 ticks/flop, 4.081 s 500x500 mm - blocking, factor of 20 74.84 MFlops, 4.72 ticks/flop, 3.340 s 500x500 mm - transposed B matrix 67.44 MFlops, 5.24 ticks/flop, 3.706 s 500x500 mm - Robert's algorithm 67.76 MFlops, 5.22 ticks/flop, 3.689 s 500x500 mm - T. Maeno's algorithm, subarray 20x20 16.35 MFlops, 21.64 ticks/flop, 15.285 s 500x500 mm - D. Warner's algorithm, subarray 20x20 31.50 MFlops, 11.23 ticks/flop, 7.934 s 500x500 mm - generic iForth MAT* 242.22 MFlops, 1.46 ticks/flop, 1.032 s 120x120 mm - normal algorithm 136.37 MFlops, 2.58 ticks/flop, 0.025 s 120x120 mm - blocking, factor of 20 76.26 MFlops, 4.62 ticks/flop, 0.045 s 120x120 mm - transposed B matrix 138.66 MFlops, 2.54 ticks/flop, 0.024 s 120x120 mm - Robert's algorithm 144.34 MFlops, 2.44 ticks/flop, 0.023 s 120x120 mm - T. Maeno's algorithm, subarray 20x20 16.35 MFlops, 21.58 ticks/flop, 0.211 s 120x120 mm - D. Warner's algorithm, subarray 20x20 31.64 MFlops, 11.15 ticks/flop, 0.109 s 120x120 mm - generic iForth MAT* 201.97 MFlops, 1.74 ticks/flop, 0.017 s 60x60 mm - normal algorithm 136.41 MFlops, 2.58 ticks/flop, 0.003 s 60x60 mm - blocking, factor of 20 75.14 MFlops, 4.69 ticks/flop, 0.005 s 60x60 mm - transposed B matrix 125.37 MFlops, 2.81 ticks/flop, 0.003 s 60x60 mm - Robert's algorithm 144.44 MFlops, 2.44 ticks/flop, 0.002 s 60x60 mm - T. Maeno's algorithm, subarray 20x20 16.29 MFlops, 21.66 ticks/flop, 0.026 s 60x60 mm - D. Warner's algorithm, subarray 20x20 31.45 MFlops, 11.22 ticks/flop, 0.013 s 60x60 mm - generic iForth MAT* 193.59 MFlops, 1.82 ticks/flop, 0.002 s
500x500 mm - normal algorithm 35.41 MFlops, 9.99 ticks/flop, 7.059 s 500x500 mm - blocking, factor of 20 65.90 MFlops, 5.37 ticks/flop, 3.793 s 500x500 mm - transposed B matrix 39.35 MFlops, 8.99 ticks/flop, 6.351 s 500x500 mm - Robert's algorithm 39.57 MFlops, 8.94 ticks/flop, 6.317 s 500x500 mm - T. Maeno's algorithm, subarray 20x20 15.51 MFlops, 22.81 ticks/flop, 16.115 s 500x500 mm - D. Warner's algorithm, subarray 20x20 27.38 MFlops, 12.92 ticks/flop, 9.127 s 500x500 mm - generic mat* 187.74 MFlops, 1.88 ticks/flop, 1.331 s 500x500 mm - iForth MAT* 187.77 MFlops, 1.88 ticks/flop, 1.331 s 120x120 mm - normal algorithm 86.43 MFlops, 4.08 ticks/flop, 0.039 s 120x120 mm - blocking, factor of 20 67.71 MFlops, 5.21 ticks/flop, 0.051 s 120x120 mm - transposed B matrix 61.86 MFlops, 5.70 ticks/flop, 0.055 s 120x120 mm - Robert's algorithm 63.82 MFlops, 5.53 ticks/flop, 0.054 s 120x120 mm - T. Maeno's algorithm, subarray 20x20 15.60 MFlops, 22.61 ticks/flop, 0.221 s 120x120 mm - D. Warner's algorithm, subarray 20x20 27.69 MFlops, 12.74 ticks/flop, 0.124 s 120x120 mm - generic mat* 191.96 MFlops, 1.83 ticks/flop, 0.018 s 120x120 mm - iForth MAT* 192.00 MFlops, 1.83 ticks/flop, 0.017 s 60x60 mm - normal algorithm 78.82 MFlops, 4.45 ticks/flop, 0.005 s 60x60 mm - blocking, factor of 20 66.81 MFlops, 5.25 ticks/flop, 0.006 s 60x60 mm - transposed B matrix 58.40 MFlops, 6.00 ticks/flop, 0.007 s 60x60 mm - Robert's algorithm 61.70 MFlops, 5.68 ticks/flop, 0.007 s 60x60 mm - T. Maeno's algorithm, subarray 20x20 15.49 MFlops, 22.65 ticks/flop, 0.027 s 60x60 mm - D. Warner's algorithm, subarray 20x20 27.47 MFlops, 12.77 ticks/flop, 0.015 s 60x60 mm - generic mat* 169.99 MFlops, 2.06 ticks/flop, 0.002 s 60x60 mm - iForth MAT* 170.02 MFlops, 2.06 ticks/flop, 0.002 s
ProForth VFX for i386+, Version: 3.10.0007, Build date: 23 June 2000 P11-350, 128Mb RAM, WinNT 4.0 80-bit extended floats for F@ and friends Absolutely no FP optimisation at all! Using NDP stack =============== 500x500 mm - normal algorithm 19.84 MFlops, 17.48 ticks/flop, 12.599 s 500x500 mm - blocking, factor of 20 28.61 MFlops, 12.12 ticks/flop, 8.736 s 500x500 mm - transposed B matrix 23.28 MFlops, 14.90 ticks/flop, 10.737 s 500x500 mm - Robert's algorithm 22.93 MFlops, 15.13 ticks/flop, 10.900 s 500x500 mm - T. Maeno's algorithm, subarray 20x20 29.13 MFlops, 11.90 ticks/flop, 8.579 s 500x500 mm - D. Warner's algorithm, subarray 20x20 31.93 MFlops, 10.86 ticks/flop, 7.828 s 120x120 mm - normal algorithm 38.85 MFlops, 8.93 ticks/flop, 0.088 s 120x120 mm - blocking, factor of 20 29.56 MFlops, 11.73 ticks/flop, 0.116 s 120x120 mm - transposed B matrix 34.13 MFlops, 10.16 ticks/flop, 0.101 s 120x120 mm - Robert's algorithm 34.90 MFlops, 9.94 ticks/flop, 0.099 s 120x120 mm - T. Maeno's algorithm, subarray 20x20 29.77 MFlops, 11.65 ticks/flop, 0.116 s 120x120 mm - D. Warner's algorithm, subarray 20x20 33.55 MFlops, 10.34 ticks/flop, 0.102 s 60x60 mm - normal algorithm 37.90 MFlops, 9.20 ticks/flop, 0.011 s 60x60 mm - blocking, factor of 20 29.36 MFlops, 11.88 ticks/flop, 0.014 s 60x60 mm - transposed B matrix 32.55 MFlops, 10.71 ticks/flop, 0.013 s 60x60 mm - Robert's algorithm 33.63 MFlops, 10.37 ticks/flop, 0.012 s 60x60 mm - T. Maeno's algorithm, subarray 20x20 29.54 MFlops, 11.81 ticks/flop, 0.014 s 60x60 mm - D. Warner's algorithm, subarray 20x20 33.36 MFlops, 10.46 ticks/flop, 0.012 s (submitted by) Stephen Pelc, sfp@mpeltd.demon.co.uk MicroProcessor Engineering Ltd - More Real, Less Time 133 Hill Lane, Southampton SO15 5AF, England tel: +44 23 80 631441, fax: +44 23 80 339691 web: http://www.mpeltd.demon.co.uk(original code, no MFlops, for ProForth VFX for Windows)
500x500 mm - normal algorithm 31.31 MFlops, 28.55 ticks/flop, 7.983 s 500x500 mm - blocking, factor of 20 168.74 MFlops, 5.29 ticks/flop, 1.481 s 500x500 mm - transposed B matrix 119.44 MFlops, 7.48 ticks/flop, 2.093 s 500x500 mm - transposed B matrix #2 121.34 MFlops, 7.36 ticks/flop, 2.060 s 500x500 mm - Robert's algorithm 119.88 MFlops, 7.45 ticks/flop, 2.085 s 500x500 mm - T. Maeno's algorithm, subarray 20x20 199.49 MFlops, 4.48 ticks/flop, 1.253 s 500x500 mm - D. Warner's algorithm, subarray 20x20 149.88 MFlops, 5.96 ticks/flop, 1.667 s 500x500 mm - generic mat* 453.47 MFlops, 1.97 ticks/flop, 0.551 s 500x500 mm - iForth DGEMM1 453.30 MFlops, 1.97 ticks/flop, 0.551 s 120x120 mm - normal algorithm 265.16 MFlops, 3.37 ticks/flop, 13.033 ms 120x120 mm - blocking, factor of 20 193.48 MFlops, 4.62 ticks/flop, 17.861 ms 120x120 mm - transposed B matrix 332.64 MFlops, 2.69 ticks/flop, 10.389 ms 120x120 mm - transposed B matrix #2 379.05 MFlops, 2.36 ticks/flop, 9.117 ms 120x120 mm - Robert's algorithm 363.68 MFlops, 2.46 ticks/flop, 9.502 ms 120x120 mm - T. Maeno's algorithm, subarray 20x20 243.99 MFlops, 3.66 ticks/flop, 14.164 ms 120x120 mm - D. Warner's algorithm, subarray 20x20 178.15 MFlops, 5.02 ticks/flop, 19.399 ms 120x120 mm - generic mat* 675.42 MFlops, 1.32 ticks/flop, 5.116 ms 120x120 mm - iForth DGEMM1 678.68 MFlops, 1.31 ticks/flop, 5.092 ms 60x60 mm - normal algorithm 344.42 MFlops, 2.60 ticks/flop, 1.254 ms 60x60 mm - blocking, factor of 20 198.41 MFlops, 4.52 ticks/flop, 2.177 ms 60x60 mm - transposed B matrix 320.04 MFlops, 2.80 ticks/flop, 1.349 ms 60x60 mm - transposed B matrix #2 387.19 MFlops, 2.31 ticks/flop, 1.115 ms 60x60 mm - Robert's algorithm 374.21 MFlops, 2.39 ticks/flop, 1.154 ms 60x60 mm - T. Maeno's algorithm, subarray 20x20 244.99 MFlops, 3.66 ticks/flop, 1.763 ms 60x60 mm - D. Warner's algorithm, subarray 20x20 180.36 MFlops, 4.97 ticks/flop, 2.395 ms 60x60 mm - generic mat* 693.99 MFlops, 1.29 ticks/flop, 0.622 ms 60x60 mm - iForth DGEMM1 693.25 MFlops, 1.29 ticks/flop, 0.623 ms Note: With a 26x26 matrix the Athlon reaches 1113.11 MFlops peak.
CLK 2998 MHz 500x500 mm - normal algorithm 221.68 MFlops, 13.52 ticks/flop, 1.127 s 500x500 mm - blocking, factor of 20 395.53 MFlops, 7.57 ticks/flop, 0.632 s 500x500 mm - transposed B matrix 643.28 MFlops, 4.66 ticks/flop, 0.388 s 500x500 mm - transposed B matrix #2 640.79 MFlops, 4.67 ticks/flop, 0.390 s 500x500 mm - Robert's algorithm 637.72 MFlops, 4.70 ticks/flop, 0.392 s 500x500 mm - T. Maeno's algorithm, subarray 20x20 800.03 MFlops, 3.74 ticks/flop, 0.312 s 500x500 mm - D. Warner's algorithm, subarray 20x20 527.72 MFlops, 5.68 ticks/flop, 0.473 s 500x500 mm - generic mat* 1492.59 MFlops, 2.00 ticks/flop, 0.167 s 500x500 mm - iForth DGEMM1 1470.64 MFlops, 2.03 ticks/flop, 0.169 s 120x120 mm - normal algorithm 1025.67 MFlops, 2.92 ticks/flop, 3.369 ms 120x120 mm - blocking, factor of 20 419.26 MFlops, 7.15 ticks/flop, 8.242 ms 120x120 mm - transposed B matrix 975.71 MFlops, 3.07 ticks/flop, 3.542 ms 120x120 mm - transposed B matrix #2 1044.70 MFlops, 2.86 ticks/flop, 3.308 ms 120x120 mm - Robert's algorithm 1016.85 MFlops, 2.94 ticks/flop, 3.398 ms 120x120 mm - T. Maeno's algorithm, subarray 20x20 916.87 MFlops, 3.26 ticks/flop, 3.769 ms 120x120 mm - D. Warner's algorithm, subarray 20x20 576.57 MFlops, 5.19 ticks/flop, 5.994 ms 120x120 mm - generic mat* 1940.21 MFlops, 1.54 ticks/flop, 1.781 ms 120x120 mm - iForth DGEMM1 1939.37 MFlops, 1.54 ticks/flop, 1.782 ms 60x60 mm - normal algorithm 956.97 MFlops, 3.13 ticks/flop, 0.451 ms 60x60 mm - blocking, factor of 20 414.05 MFlops, 7.24 ticks/flop, 1.043 ms 60x60 mm - transposed B matrix 849.18 MFlops, 3.53 ticks/flop, 0.508 ms 60x60 mm - transposed B matrix #2 965.94 MFlops, 3.10 ticks/flop, 0.447 ms 60x60 mm - Robert's algorithm 881.04 MFlops, 3.40 ticks/flop, 0.490 ms 60x60 mm - T. Maeno's algorithm, subarray 20x20 868.44 MFlops, 3.45 ticks/flop, 0.497 ms 60x60 mm - D. Warner's algorithm, subarray 20x20 554.86 MFlops, 5.40 ticks/flop, 0.778 ms 60x60 mm - generic mat* 1715.25 MFlops, 1.74 ticks/flop, 0.251 ms 60x60 mm - iForth DGEMM1 1699.92 MFlops, 1.76 ticks/flop, 0.254 ms
500x500 mm - normal algorithm 24.73 MFlops, utime 10.108 secs 500x500 mm - blocking, factor of 20 124.61 MFlops, utime 2.006 secs 500x500 mm - transposed b matrix 78.83 MFlops, utime 3.171 secs 500x500 mm - Robert's algorithm 130.25 MFlops, utime 1.919 secs 500x500 mm - 20x 20 subarray (from T. Maeno) 292.51 MFlops, utime 0.855 secs 500x500 mm - 20x 20 subarray (from D. Warner) 218.34 MFlops, utime 1.145 secs 120x120 mm - normal algorithm 130.85 MFlops, utime 0.026 secs 120x120 mm - blocking, factor of 20 145.29 MFlops, utime 0.024 secs 120x120 mm - transposed b matrix 354.01 MFlops, utime 0.010 secs 120x120 mm - Robert's algorithm 378.22 MFlops, utime 0.009 secs 120x120 mm - 20x 20 subarray (from T. Maeno) 418.28 MFlops, utime 0.008 secs 120x120 mm - 20x 20 subarray (from D. Warner) 300.20 MFlops, utime 0.012 secs 60x 60 mm - normal algorithm 141.92 MFlops, utime 0.003 secs 60x 60 mm - blocking, factor of 20 148.76 MFlops, utime 0.003 secs 60x 60 mm - transposed b matrix 392.73 MFlops, utime 0.001 secs 60x 60 mm - Robert's algorithm 406.78 MFlops, utime 0.001 secs 60x 60 mm - 20x 20 subarray (from T. Maeno) 431.14 MFlops, utime 0.001 secs 60x 60 mm - 20x 20 subarray (from D. Warner) 312.59 MFlops, utime 0.001 secs
500x500 mm - normal algorithm 1.121 secs. 500x500 mm - temporary variable in loop 1.631 secs. 500x500 mm - unrolled inner loop, factor of 4 1.260 secs. 500x500 mm - unrolled inner loop, factor of 8 1.147 secs. 500x500 mm - unrolled inner loop, factor of 16 1.144 secs. 500x500 mm - pointers used to access matrices 1.371 secs. 500x500 mm - pointers used, unrolled by 4 1.103 secs. 500x500 mm - transposed B matrix 0.694 secs. 500x500 mm - interchanged inner loops 1.073 secs. 500x500 mm - blocking, step size of 20 1.249 secs. 500x500 mm - Robert's algorithm 0.388 secs. 500x500 mm - T. Maeno's algorithm, subarray 20x20 0.397 secs. 500x500 mm - Generic Maeno, subarray 20x20 0.668 secs. 500x500 mm - D. Warner's algorithm, subarray 20x20 0.747 secs. ========================================================= ===== Total using no extensions and using no hackery 14.000 secs. 240x240 mm - normal algorithm 0.616 secs. 240x240 mm - temporary variable in loop 1.086 secs. 240x240 mm - unrolled inner loop, factor of 4 0.762 secs. 240x240 mm - unrolled inner loop, factor of 8 0.659 secs. 240x240 mm - unrolled inner loop, factor of 16 0.628 secs. 240x240 mm - pointers used to access matrices 0.810 secs. 240x240 mm - pointers used, unrolled by 4 0.636 secs. 240x240 mm - transposed B matrix 0.732 secs. 240x240 mm - interchanged inner loops 1.149 secs. 240x240 mm - blocking, step size of 20 1.343 secs. 240x240 mm - Robert's algorithm 0.277 secs. 240x240 mm - T. Maeno's algorithm, subarray 20x20 0.409 secs. 240x240 mm - Generic Maeno, subarray 20x20 0.711 secs. 240x240 mm - D. Warner's algorithm, subarray 20x20 0.799 secs. ========================================================= ===== Total using no extensions and using no hackery 10.624 secs. 60x60 mm - normal algorithm 0.490 secs. 60x60 mm - temporary variable in loop 1.104 secs. 60x60 mm - unrolled inner loop, factor of 4 0.698 secs. 60x60 mm - unrolled inner loop, factor of 8 0.656 secs. 60x60 mm - unrolled inner loop, factor of 16 0.738 secs. 60x60 mm - pointers used to access matrices 0.702 secs. 60x60 mm - pointers used, unrolled by 4 0.467 secs. 60x60 mm - transposed B matrix 1.143 secs. 60x60 mm - interchanged inner loops 1.842 secs. 60x60 mm - blocking, step size of 20 2.016 secs. 60x60 mm - Robert's algorithm 0.487 secs. 60x60 mm - T. Maeno's algorithm, subarray 20x20 0.616 secs. 60x60 mm - Generic Maeno, subarray 20x20 1.113 secs. 60x60 mm - D. Warner's algorithm, subarray 20x20 1.233 secs. ========================================================= ===== Total using no extensions and using no hackery 13.313 secs. This page last modified, on: Thursday, August 03, 2006, 23:47 PM free counter