------------------------------------------------------
IDCT COMPARISON : ACCURACY AND SPEED BENCHMARK RESULTS
------------------------------------------------------

All results reported only for
  IEEE test conditions: -L = -256, +H = 255, sign = 1, #iters = 10000

Peak                                              CPU        
abs     mean  square        mean              execution  
error       error           error               time          algorithm
worst   worst overall   worst   overall   AMDK62  CELERON2    name
------------------------------------------------------------------------------
0(64)* .00000 .0000000 .000000  .000000  2631.78  1268.671 IEEE-1180 float64
1(64)  .01610 .01274   .002400 -.000142    65.056   32.847 int32 AAN IEEE-1180
1(64)  .01610 .01391   .003600 -.000030                    int32 AAN JREVDCT

3(4)  2.1026  .41767  1.3161    .031114                    ap528mmx (fail)
1(64)  .05110 .040950  .036500  .012691                    idctchen_mmx
1(64)  .01620 .012925  .002600 -.000025                  **ap922mmx "hacked"
1(64)  .00200 .0010594 .000700 -.000080    18.295          ap922hybrid 3DNOW
1(64)  .00200 .0010437 .000700 -.000081   244.629          ap922hybrid X87
1(5)   .00010 .0000078 .000100  .000008    21.271          ap922float 3DNOW
1(5)   .00010 .0000078 .000100  .000008             12.528 ap922float ISSE
1(3)   .00010 .0000005 .000100  .000005   239.319  102.956 ap922float frnd X87
1(3)   .00010 .0000005 .000100  .000005   524.095  221.753 ap922float arnd X87
1(3)   .00010 .0000005 .000100  .000005   259.774  112.245 ap922doubl frnd X87
0(64)* .00000 .0000000 .000000  .000000   510.599  227.658 ap922doubl arnd X87

For all listed columns, *SMALLER* numbers represent better performance.


* reference precision level

** MPEG2AVI 0.16B34 "MMX32" iDCT (partial implementation of AP922)

   "arnd"   = accurate rounding (using standard C-library "floor()")
   "frnd"   = fast truncation

   X87      = standard C code, 80387 FPU
   3DNOW    = AMD 3D-Now acceleration (AMD K6-2/Athlon)
   ISSE     = Intel SSE acceleration (Pentium3/Celeron2)

%% "AMDK62" = AMD K6/2-500 @ 500MHz (100FSB), PC100SDRAM
   "CEL2  " = FC-PGA Celeron 566 @ 850MHz (100FSB), PC100SDRAM

   execution times should only be used to rank iDCT performance 
   within a CPU family!  Execution times of different CPUs cannot
   be compared!

AMD_3DNOW vs Intel-SSE
----------------------
WARNING : my *personal rant* follows...

     After hand-coding both the Intel-SSE and 3DNow versions of ap922float, 
   I feel like volunteering my opinion to the world.  For me, the 3D-Now 
   instruction set was undoubtedly easier to hand-code.  Even on AMD's aging
   K6-2, 3D-Now instructions have few pairing restrictions (other than 
   MMX-multiply and 3DNow-multiply sharing the same execution resource.)
   3DNow uses the familiar MMX registers, and the instruction latencies are
   extremely low (2 cycles.)  The AMD Athlon retains these low latencies, 
   despite having a higher clock speed.
     Intel SSE, on the other hand, has longer instruction latencies (4 for
   sse-add, 5 for sse-multiply), and some instruction pairing restrictions
   (MMX shifter, SSE-adder, and mmx pack/unpack all share the same decoder
   port.)  The longer latencies contribute to more wasted CPU cycles during
   the center of for_loops. Pairing restrictions reduce Intel SSE's 
   float<->int conversion speed, which introduce more wasted cycles during 
   AP922float's initial "input/unpack/int->float" and final 
   "output/float->int/repack" operations.  
     On the plus side, SSE introduces a whole new set of 128-bit registers
   (xmm0-xmm7.)  This allows programmers to work with vectors of 4-floats
   (versus vectors of 2-floats for AMD's 3DNow.)  Furthermore, SSE offers
   two instructions for float->conversion, letting the programmer select
   an accurate rounding mode (not available in 3DNow.)  
     Comparing execution throughput, The AMD Athlon and Pentium3/Celeron2 
   CPUs offer similar theoretical peak throughput, which is 2 adds + 2 mults
   per clock cycle (3DNow or Intel-SSE.) However, due to 3DNow's shorter
   instruction latencies, 3DNow implentations of an math algorithm would
   probably execute faster than the same algorithm implemented in IntelSSE.
   Furthermore, AMD's better MMX/3DNow execution units can pair
   MMX shifts, packs, and float<->int instructions with no restriction.
   So data conversion operations execute quicker.
   The Pentium4's SSE unit is expected to at least double the 
   Pentium3's fpu throughput (clock for clock.)

...

    In the benchmarks above, the AMD and Celeron850 scores cannot be directly 
  compared.  However, a comparison were to be made, first the Cel850 
  (ap922float_sse) score of 12.528 should be mathematically converted to a 
  Celeron566, by multiplying the execution time by 1.5 (the reciprocal of 
  the CPU MHz 850/566.)  This yields an estimated score of 18.8 for a 
  hypothetical Celeron566.
    Applying the same mathematical conversion in reverse, the AMD K6/2-500's
  score of 21.271 can be converted to a "K6/2-566", again by the CPU-MHz 
  multiplier ratio (500/566.)  This yields an estimated score of 18.6 for a
  hypothetical K6/2-566.
    To summarize :

    AP922float 3DNow           AP922float IntelSSE
    hypothetical K6/2-566     hypothetical Celeron-566
    ---------------------     ------------------------
          18.6                            18.8       (lower score is better)

    Here the two scores are nearly identical.  Based on this "thought 
	experiment", we would expect the real-world performance of AP922float
	to be same for either the 3DNow or IntelSSE implementations, when
	comparing a K6/2 or CeleronII CPU.  This is great for AMD considering 
	the age	of the K6/2.  The Athlon and PentiumIII CPUs execute the same 
	(respective) code faster.
