----------------------------------------------------------------------------
fdctpk10.zip v1.01 08/26/2000 
----------------------------------------------------------------------------

----------------------------------------------------------------------------
Summary
----------------------------------------------------------------------------
    This package includes several different implementations of the 
	forward discrete cosine transform (hereafter called fDCT.)  The
	DCT is used in many image compression standards, including the
	popular JPEG and MPEG-video standards.  Full source-code listings
	(C, assembly) are included for each fDCT.  

  (1) *reference* double-precision fDCT
      extremely slow, but most accurate ****** (by definition)

  (2) double-precision AAN fDCT (from Brent Beyeler's BBMPEG)
      very slow, very accurate *****

  (3) 3DNow single-precision AAN fDCT
      average, very accurate ****

  (4) MMX(3DNow enhanced) AP922 MMX fDCT
      fast, accurate ***

  (5) MMX AP922 MMX fDCT
      fast, accurate ***

  (6) MPEG software simulation integer-32 AAN fDCT
      slow, accurate      **

  These rankings are for accuracy, and are arbitrary.
  ALL of the included DCTs *should* meet or exceed IEEE-1180 guidelines.

  fDCT-accuracy tests are summarized in 'compare.txt'

----------------------------------------------------------------------------
Contents
----------------------------------------------------------------------------

  *reference* fDCT
  -----------------------
  This is the algorithm used to test the other fDCTs.
  The reference fDCT uses double-precision floating point math (actually
  extended double 80-bit on the x87 CPUs.)  This fDCT is used to
  establish a baseline for the error measurements.  It is probably too
  slow to use in practical applications.


  double-precision AAN fDCT (x87daan)
  --------------------------------
  Taken from BBMPEG, this fDCT uses the AAN algorithm with
  double-precision float arithmetic.  It is the 2nd most accurate fDCT
  in this package (after the reference fDCT.)  It is also the second slowest.


  3D-Now AAN fDCT (3dnowaan)
  ----------------------
  This package contains a code-listing for a 3D-Now floating-point
  implementation of the AAN algorithm.  The code was derived from
  BBMPEG's AAN double-precision fDCT.  It has been tested on an
  AMD K6/2-500 with Microsoft Visual C++ 6.0 Professional.  You will
  need VC++ 6 to compile the 3Dnow fdct code 'as is.'
  Speedwise, the 3DNow-AAN is between the MMX fDCTs and the i32aan.


  AP-922 MMX(3Dnow) fDCT (ap922mm3)
  ----------------------
  Complete forward_DCT, derived from Intel's application note AP-922.
  This fDCT uses 3D-Now MMX enhancements to improve upon the AP922's
  accuracy.  Speedwise, AP922mm3 is slightly faster than AP922mmx.
  The C-equivalent listing can be compiled/run on any x86 CPU.


  AP-922 MMX fDCT (ap922mmx)
  ----------------------
  Complete forward_DCT, derived from Intel's application note AP-922.
  This implementation uses standard MMX instructions, to achieve
  very accurate output at high speed.  The AP922 MMX fDCT is roughly as
  fast as other (less precise) MMX fDCT implementations.
  The C-equivalent listing can be compiled/run on any x86 CPU.


  AAN int32 fDCT (i32aan)
  ---------------------
  This dct is distributed with the JPEG and MPEG reference codecs.
  Although it uses 32-bit integer math, the i32aan is less accurate
  than the AP922MMX.  It is also slower.


----------------------------------------------------------------------------
File Descriptions
----------------------------------------------------------------------------

  This package contains the following files:

  fdctmm32.c- AP922 standard-MMX fDCT code-listing
              Two implementations are included : 
			    (1) MMX implementation (inline assembly, Visual C++ 6 Pro)
				(2) standard C-language (any ANSI C compiler)
				For a given input, both versions produce identical output.


  fdctmm32.doc - (MSword6 document)
              Mathematical derivation (ok, so it's not rigorous!) of
              the foward_dct_row() macro.

  fdctam32.c- AP922 MMX(3DNow) fDCT code-listing
              Two implementations are included : 
			    (1) MMX implementation (inline assembly, Visual C++ 6 Pro)
				(2) standard C-language (any ANSI C compiler)
				For a given input, both versions produce identical output.

              Due to the presence of 3D-Now instructions, this file 
			  requires a header file from the AMDSDK : 'amd3dx.h'.
              This file is included in this package, but the header file
              only supports several C++ compilers (Watcom, VC++.)
			  The standard-C listing does not require any special 
			  instruction support, and will compile with any C compiler.

  fdct3dn.c - 3DNOW fDCT code-listing, written in Visual C++ 6 (inline
              assembly)  

  fdct3dn_extracomments.c - code-listing with all 3D-Now instructions
              converted to vC++ "_emit" primitives (same as asm 'DB')
              For easier conversion to compilers without native 3D-Now
              support.  Listing is for fdct3dn.c v1.00 (not v1.01!)

  fdctref.c - x87 double-precision (64-bit) AAN fDCT (from BBMPEG)

  fdctint.c - integer 32-bit AAN fDCT (from MPEG Software Simulation Group)

  report_XXX.txt - accuracy reports for the various fDCT
              implementations.  You can compare the 3D-Now implementation
              against the integer AAN and x87d-AAN fDCT(s).
              ***NOTE*** During my profile runs, Visual C++'s profiler
              produced wide variation from test run to test run. Therefore
              the performance profiles should be interpreted as 'relative'
              rankings of performance (i.e. which function is fastest,
              which is slowest, etc.)  They should not be read as
              absolute measurements!

  fdct3dn.doc - (MS-Word6 document)
              Shows the order of operations within the 3D-Now AAN fDCT.
              The listing is in 'pseudo-code', for easier reading. 

  ieee1180.c - modified ieee1180 test package.  The original program 
              tested iDCT-functions for IEEE-1180/1990 compliance   The 
			  IEEE document defines maximum allowable error.  The program
			  did not contain the equivalent values for fDCT-functions.
			  So even with my modifications, this program will not give
			  indication for "pass/fail."  This program now only reports
			  the error measurements (MSE, mean error.)

  amd3dx.h -  macro definitions AMD 3D-Now asm opcodes (Visual C++, Watcom)
              Provided by the AMD-SDK, this header file enables 3D-Now
              code generation within Visual C++'s inline-assembler.
              (Get the AMDSDK from http://www.amd.com)

  Ignore any other files ... I re-used another project directory for this 
  project, so these other files most likely shouldn't be part of this 
  package.

----------------------------------------------------------------------------
Miscellaneous Notes
----------------------------------------------------------------------------

  --------------------
  fDCT-Accuracy report
  --------------------
  Please see "compare.txt"

  ----------------------------------------------
  About the IEEE-1180 test function (ieee1180.c)
  ----------------------------------------------
  In its original form, the Tom Lane's ieee1180 tested inverse-dct
  functions for conformance with ieee1180/1990 precision requirements.
  The code did not support the complementary test : forward-dct.
  I modified Tom Lane's code to compute the MSE/mean errors on
  the following path :

  coefficient | path 1 -> reference_fdct -> reference_idct -> ref results
  generator   | path 2 -> test_fdct      -> reference_idct -> test results

  ERROR( ref_results, test_results ) -> printf()

  In effect, the modified_code performs a 'complete loop' test.  Because
  the test_fdct's output is checked *after* it has been transformed by
  the reference_idct, all errors are amplified.

  It is important to realize that my modifications render the error-limit
  tests invalid.  These error-limits were defined for the idct operation.
  Since I do not have access to the IEEE1180/1990 document, this 'hack'
  is the best I can do for accuracy testing.  While the hacked test does
  produce numerical error values, the results should only be used to 
  develop a qualitative sense of accuracy for the tested fdct.

  -------
  x87dAAN
  -------
  This dct_function is from the BBMPEG package.  It was source for 
  the 3D-Now AAN implementation.


  --------
  3DNowAAN
  --------
  This 3D-Now AAN fDCT implementation is faster than the 32-bit integer
  implementation.  Originally, I believed the 3dnowaan was 2X faster
  than the i32aan, but retesting reveals wide variation from test run to
  test run.  Visual C++ 6's profiler is not accurate enough for measuring
  'fast' functions within an entire project.  (The aan dcts account for
  <1% of total execution time.)

  The 3D-Now AAN fDCT is more accurate, approaching the numerical accuracy
  of the BBMPEG's x87d-AAN fDCT.


     OPTIMIZATION

     This implementation is not fully optimized.  The fDCT function is divided
     into 3 major loops:

        (1) dct_rows_and_transpose();
        (2) dct_columns();
        (3) dct_postscale_and_transpose();

     The 3rd step (3) could be integrated into (1) and (2), but reduces the
     codes readability (making further hand-optimization more difficult.)

     As noted in the 3D-Now code-listing, the final postcaler/rounder
     introduces a small shift into the negative fDCT elements.  The shift's
     magntitude is less than 1/32768.  Additional code can remove the
     shift entirely, but at the cost of lower performance.  Tests shows no
     diffference in accuracy, between the current implementation and a
     more accurate-rounding algorithm (not implemented.)

     Finally, the MMX-instruction set requires QWORD aligned memory
     structures.  Non-QWORD aligned loads/stores stall the execution pipeline.

     In its present form, fdct_3dn.c declares most arrays 'static', as
     Visual C++ will QWORD align non-local vars.  If the code is to be used
     with other compilers, the programmer should verify proper QWORD
     alignment.


  ------------
  AP922
  ------------

  Intel's application-note AP-922 presents the dct in mathematical form.
  By itself, the supplied code-listing shows the 'first step' toward
  a complete dct/idct implementation.  For interested parties, I have
  written a document fdctmm32.doc, describing the discovery process I used
  to complete the AP922 fDCT.

  The QWORD alignment isssues mentioned earlier also apply to AP922.
  Ideally, the code should be converted to MASM, where user-specified 
  assembler-directives will explicitly align the data structures.  (As
  opposed to implicitly relying on VC++ 6's compiler-optimization behavior.)
  


----------------------------------------------------------------------------
Usage Disclaimer
----------------------------------------------------------------------------
  
     Since this package contains fDCT-code derived from the MPEG software
  simulation group, the distribution terms of the MSSG apply to this package
  as well.  In short, the code is freely usable in non-commercial
  applications.  Parties interested in commercial applications should check
  with the appropriate patent-holders (if applicable) for possible licensing
  requirements.

-------------
Helpful Links
-------------

  http://www.mpeg.org - mpeg2v12 codec

  http://developer.intel.com/vtune/cbts/appnotes.htm - AP-922

  http://members.home.net/beyeler - BBMPEG

  http://www.elecard.com/peter    - AP922 iDCT (the *correct* way!)
    Thanks to Peter Gubanov's AP-922 idct code-listing, I had the insight
    needed to complete the AP-922 fDCT.

  http://www.concentric.net/~psilon - AP922 iDCT (VirtualDub Source code)
    Another *correct* implementation of AP-922 MMX iDCT.


----------------------------------------------------------------------------
Revision History
----------------------------------------------------------------------------
  v1.01  08/26/2000 Revised fdctmm32.c, fdctam32.c to "clip" output values
                    to the numerical range {-2048,+2047}  Note, only these
					two fDCTs have been fixed.  (The others can be
					similarly modified easily fixed.)
					Completed the C-language listings for AP-922 fDCT
					(The listings are part of fdctmm32.c, fdctam32.c)

              
					Removed 'pass/fail' indication from the test-reporter.
					This was removed because the program does not
					implement nor attempt to implement any IEEE-1180
					precision test for fDCTs.  (The original code in
					the ieee1180.c tester-program worked with iDCTs only.)
					
         07/23/2000 Added 'compare.txt' to summarize the accuracy of 
		            the various fDCTs.

  v1.0   07/22/2000 (initial release)
                    collected two forward_dct's worthy of mention,
                    AP922 MMX fDCT,  3DNow AAN fDCT (both written by me)

----------------------------------------------------------------------------
liaor@iname.com http://members.tripod.com/~liaor
