idctpk12.zip 09/04/2000 v1.2 

 ************************************************************************
 *                                                                      *
 *      DISCLAIMER                                                      *
 *      ----------                                                      *
 *                                                                      *
 *      This software comes without warranty of any kind.  The          *
 *      author specifically disclaims all warranties, express or        *
 *      implied, and all liability, including consequential and         *
 *      other indirect damages, for the use of this code,               *
 *      including liability for infringement of any proprietary         *
 *      rights, and including the warranties of merchantability         *
 *      and fitness for a particular purpose.  The author does not      *
 *      assume any responsibility for any errors which may              *
 *      appear in this code nor any responsibility to update it.        *
 *                                                                      *
 ************************************************************************

--------------------------------------------------------------------------
Inverse Discrete Cosine Transform functions
--------------------------------------------------------------------------

     1. Contents of iDCT package
     2. iDCT overview
     3. Detailed Descriptions of each iDCT function
     4. About the test results

--------------------------------------------------------------------------
1) Contents
--------------------------------------------------------------------------

The earlier version of this package contained iDCT functions I
had collected and incorporated into MPEG2AVI.  Unfortunately, the
'mmx32idct' was numerically incorrect.  (This idct is not part
of this release.)

In the time since, I converted the MMX iDCT algorithm 
(from AP922) into several floating-point iDCT algorithms.
     
     8x8 IDCT functions
     ------------------
     1.  jrevdct.c - JPEG v4 reference iDCT (32-bit integer)
         came with the original ieee1180.tar.gz distribution
         http://www.mpeg.org
     
     
     2.  idct_int.c - MPEG-2 v1.2 reference Chen iDCT (32-bit integer)
         came with the MSSG MPEG2CODEC v1.2 (08/96)
         http://www.mpeg.org
     
     3.  idct_ap528.c - Intel AAN iDCT (16-bit MMX)
         http://developer.intel.com/drg/mmx/appnotes - AppNote 528
         *** HORRIBLY INACCURATE, DO NOT USE ***
     
     4.  Chen iDCT (16-bit MMX) - Khanh Nguyen-Phi's iDCT
         http://www.ece.ubc.ca/~knguyen/   ( k.p.nguyen@ieee.org )
         (Contact information and www-link no longer work.)
     
     5.  idct_ap922float.c - Floating-point iDCT (derived from AP922)

         http://developer.intel.com/vtunecbts/strmsimd - AppNote 922

             Numerical precision can be increased at expense of speed.
         In "best precision mode" (accurate rounding, double-precision),
         ap922float matches the IEEE-1180/1990 reference-IDCT.

         ap922float is available in both C-code and Intel-SSE versions
         To compile the Intel_SSE version "as is", you will need the 
         Visual C++ 6.0 Service Pack4 + Processor Pack Beta.  
         
         An AMD 3D-Now version is in the works.
     
     6.  idct_ap922hybr.c - Hybrid float/int32 iDCT (derived from AP922)
             Faster than ap922float, but less accurate.  This version
         exists only because I wasted too much time on ap922float_sse,
         but I didn't want to leave out AMD users in this release.
     
         ap922hybrid is available in both C-code and AMD 3D-Now versions

         To compile the 3D-Now version "as is", you will need Visual C++
         6.0 (5.0 should work, too.)

     A program-generated report for each iDCT is in \REPORTS
     
     The ZIP archive may contain additional (undocumented) files.
     These should be ignored.

     Finally, a MATLAB script file (ap922.m) with several AP922 matrices 
     (pertaining to the idct_column operators) already entered, is
     included for users with MATLAB.  AP922float's development was
     facilitated by this script.  The MATLAB file requires the symbolic 
     math toolkit for MATLAB.

--------------------------------------------------------------------------
2) iDCT Description
--------------------------------------------------------------------------

         This package assumes the user is familiar with the
     mathematics of the discrete cosine transform (DCT.)  The
     functions in this package are targeted for MPEG applications.
     These applications have additional input/output format
     requirements, which are listed as follows :

     iDCT Input : 8x8 array (64 16-bit integers)
                  input elements lie in the range {-2048,+2047}

     iDCT Output: 8x8 array (64 16-bit integers)
                  output elements lie in the range {-256,+255}
     
     All iDCTs in this release produce 'normal' (non-transposed)
     output.  Most iDCTs properly clip the output to {-256,+255}.
     (Forgot about this the first time around.)
     
--------------------------------------------------------------------------
3)  Descriptions of individual iDCT algorithms
--------------------------------------------------------------------------

  ----------------------------------------
  IDCT_AP528MMX.C - 16-bit MMX AAN iDCT (Intel AP-528)

     Not IEEE 1180-1990 compliant
     
          In its original form, Intel's AP-528 iDCT (AAN) is pretty
     awful.  The iDCT requires input data to be left-shifted by 4
     bits, and the iDCT outputs the matrix in transposed order.
     Finally, the output is mean-shifted by approximately -0.5.  The
     mean-shift is sufficient to cause VISIBLE artifacts in MPEG-
     decoding engines, where an iDCT coefficients are applied as delta-
     values to inter-compressed frames.  The error propogates within
     MPEG sequences, and has the visual appearance of
     increasing/decreasing color/brightness shifts that cycle up and
     down.
     
     The improved AP528_IDCT automatically contains 2 noteworthy
     modifications :
     
     1)    In-place left-shift of input data. The left-shift
     eliminates the need to pre-shift the iDCT input coefficients.
     And the "in-place" aspect imposes minimal performance overhead (
     <10 cycles on Pentium/MMX.)  This generally faster than pre-
     scaling the data outside the AANiDCT function.
     
     2)   "mean-shift compensation", the DC-coefficient [iDCT(0,0)] is
     padded upward prior to transformation.  This reduces the mean-
     error to ~ -0.03 or so, enough to eliminate the color/brightness
     cycling artifacts described earlier.
     
          I added an output-transposer to idct_ap528.c, to make the
     output compatible with the ieee1180 tester.  Since the transposer
     is completely separate from the core iDCT-operation, the code
     can be deleted to save computational time.  (Provided that the
     application accepts transposed iDCT output.)

  ----------------------------------------------------
  IDCTCHEN.C - 16-bit MMX Chen1 iDCT (Khanh Nguyen-Phi)

     Not IEEE 1180-1990 compliant
     
          Although not IEEE compliant, this iDCT is just as fast as
     Intel's AAN, but more accurate.  Khanh's original Chen1 function
     contains raw assembly code that is not optimally scheduled,
     so I hand-optimized it a little more.
          The modified Chen1_IDCT is (almost) optimally scheduled for
     Intel MMX CPUs.  K6-2/Athlon CPUs will run the code faster
     (thanks to AMD's better MMX unit.)  

  ---------------------------------------------------------
  IDCT_AP922FLOAT (derived from Intel AP-922)

     IEEE 1180-1990 compliant

     After releasing the marginally accurate AP528, Intel did
     further work on DCT implementations, and released AP922.
     AP922 is almost as fast as the earlier Intel MMX DCTs, but
     far more accurate, meeting IEEE-1180/1990 requirements.
     
     "AP922float" is the name I give to my derivative work.
     This iDCT uses the basic algorithm presented in AP-922, and
     substitutes floating-point math for scaled-int (MMX) math.

     The benefits of floating-point math are easily revealed in
     the ieee1180 test suite.  AP922float is extremely accurate.  
     When double-precision and accurate-rounding mode(s) are used,
     AP922float matches the IEEE-1180/1990 reference-IDCT, yet 
     runs approximately 5X faster.  (Test CPU is AMD K6/2.)

     Three versions of AP922float are included :
       1) Intel SSE - requires Pentium-III or Celeron2 (533A, 566+)
          single-precision (32-bit) floating point, accurate rounding
          idct_ap922sse_rawcode.c - "raw code" (unoptimized version)
          idct_ap922sse_opt.c     - "optimized code" (slightly faster)

       2) AMD 3DNow - requires AMD K6/2 or other 3DNow capable CPU
          single-precision (32-bit) floating point, truncation rounding
          idct_ap922tdn.c

       3) C-code - x87 (FPU) works with any CPU (i486DX or later)
          selectable precision & rounding-mode
          *when double-precision and accurate-rounding mode(s) are used,
           iDCT output matches the IEEE-1180/1990 reference-IDCT
          
          idct_ap922tdn.c - (natural coefficient table w[] order)
          idct_ap922x87.c - (Intel-SSE table w[] order)
          

  ---------------------------------------------------------
  IDCT_AP922HYBRID (derived from Intel AP-922)
  

     IEEE 1180-1990 compliant

     "AP922hybrid" is the name I give to my derivative work.
     This iDCT combines the original AP-922 MMX iDCT with the new
     AP922float iDCT.  The 3D-Now AP922hybrid is faster than
     AP922float (3DNow), but less accurate.

     Two versions of AP922hybrid are included :
       1) AMD 3DNow - requires CPU with 3D-Now support (AMD K6/2 or later)
          single-precision (32-bit) floating point, truncation rounding
          idct_ap922hybr.c

       2) C-code - x87 (FPU) works with any CPU (i486DX or later)
          single-precision (32-bit) floating point, truncation rounding
          This version offers no advantage over the AP922float,
          it is provided for educatonal purposes only.  
          idct_ap922hybr.c

   ----------------------------------------------
   Integer 32-bit iDCTs (JPEG and MPEG reference)

     IEEE 1180-1990 compliant
     
     These IDCTs are part of the MPEG and JPEG reference software.


--------------------------------------------------------------------------
4)  Test results
--------------------------------------------------------------------------
     This ZIP includes the test package from www.mpeg.org : ieee1180.tar.gz
     So far, this is the only publicly released test-package that I've 
     found.  I used it to generate the reports in this package.

     Full test reports for each individual iDCT are located in \REPORTS.
     idct_summary.txt compares all iDCTs using one of the tests out of
     the suite.

--------------------------------------------------------------------------
5)  Known problems
--------------------------------------------------------------------------

   Intel SSE-code and data alignment

     The included project file does not align data structures to
     16-byte memory offsets.  This causes problems for the Intel-SSE
     code, which *requires* 16-byte alignment.  The Intel-SSE code
     will generate a page fault if data is not aligned properly.

     For now, the "workaround" is to declare additional dummy variables
     at global scope, in the hopes that the added vars will coincidentally
     align the coefficient_tables for the Intel-SSE code.

   MMX-code and data alignment

     Visual C++ 6 does not align data structures to 8-byte memory
     offsets.  Although the MMX code will still run (without generating
     error), the unaligned memory accesses pose a severe performance
     penalty. 


--------------------------------------------------------------------------
Useful links
--------------------------------------------------------------------------
  http://developer.intel.com/vtune/cbts/strmsimd - AP922 document (PDF)
  http://www.elecard.com/peter      - AP922 mmx iDCT
  http://www.concentric.net/~psilon - VirtualDub (AP922 mmx iDCT)
--------------------------------------------------------------------------
Revision History
--------------------------------------------------------------------------

09/04/2000 v1.2
      removed idctmm32.c (broken!)
      added idct_ap922float   (Intel SSE, AMD 3DNow, and standard C-code)
      added idct_ap922hybrid  (AMD 3DNow and standard C-code)
      ieeetest.c now tests for proper output range {-256,+255}

11/04/99 v1.1
      optimized idctmm32.c, added idctmm32_transpose.c
      modified ieee1180.c to issue an "emms" instruction after each
       j_rev_dct() call
10/31/99 v1.0
      initial release, includes 3 MMX iDCT functions, and ieee1180 test-
       reports

liaor@iname.com      http://members.tripod.com/~liaor
