1. 11 5月, 2012 1 次提交
    • C
      rv40dsp x86: MMX/MMX2/3DNow/SSE2/SSSE3 implementations of MC · 110d0cdc
      Christophe Gisquet 提交于
      Code mostly inspired by vp8's MC, however:
      - its MMX2 horizontal filter is worse because it can't take advantage of
        the coefficient redundancy
      - that same coefficient redundancy allows better code for non-SSSE3 versions
      
      Benchmark (rounded to tens of unit):
              V8x8  H8x8  2D8x8  V16x16  H16x16  2D16x16
      C       445    358   985    1785    1559    3280
      MMX*    219    271   478     714     929    1443
      SSE2    131    158   294     425     515     892
      SSSE3   120    122   248     387     390     763
      
      End result is overall around a 15% speedup for SSSE3 version (on 6 sequences);
      all loop filter functions now take around 55% of decoding time, while luma MC
      dsp functions are around 6%, chroma ones are 1.3% and biweight around 2.3%.
      Signed-off-by: NDiego Biurrun <diego@biurrun.de>
      110d0cdc
  2. 10 5月, 2012 29 次提交
  3. 09 5月, 2012 10 次提交