1. 28 4月, 2020 2 次提交
    • R
      arm64/lib: improve CRC32 performance for deep pipelines · acc2d05a
      Rongwei Wang 提交于
      to #26730415
      
      commit efdb25efc7645b326cd5eb82be5feeabe167c24e upstream.
      
      Improve the performance of the crc32() asm routines by getting rid of
      most of the branches and small sized loads on the common path.
      
      Instead, use a branchless code path involving overlapping 16 byte
      loads to process the first (length % 32) bytes, and process the
      remainder using a loop that processes 32 bytes at a time.
      
      Tested using the following test program:
      
        #include <stdlib.h>
      
        extern void crc32_le(unsigned short, char const*, int);
      
        int main(void)
        {
          static const char buf[4096];
      
          srand(20181126);
      
          for (int i = 0; i < 100 * 1000 * 1000; i++)
            crc32_le(0, buf, rand() % 1024);
      
          return 0;
        }
      
      On Cortex-A53 and Cortex-A57, the performance regresses but only very
      slightly. On Cortex-A72 however, the performance improves from
      
        $ time ./crc32
      
        real  0m10.149s
        user  0m10.149s
        sys   0m0.000s
      
      to
      
        $ time ./crc32
      
        real  0m7.915s
        user  0m7.915s
        sys   0m0.000s
      
      Cc: Rui Sun <sunrui26@huawei.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NRongwei Wang <rongwei.wang@linux.alibaba.com>
      Acked-by: NZou Cao <zoucao@linux.alibaba.com>
      acc2d05a
    • R
      arm64/lib: add accelerated crc32 routines · 19473a51
      Rongwei Wang 提交于
      to #26730415
      
      commit 7481cddf29ede204b475facc40e6f65459939881 upstream.
      
      Unlike crc32c(), which is wired up to the crypto API internally so the
      optimal driver is selected based on the platform's capabilities,
      crc32_le() is implemented as a library function using a slice-by-8 table
      based C implementation. Even though few of the call sites may be
      bottlenecks, calling a time variant implementation with a non-negligible
      D-cache footprint is a bit of a waste, given that ARMv8.1 and up
      mandates
      support for the CRC32 instructions that were optional in ARMv8.0, but
      are
      already widely available, even on the Cortex-A53 based Raspberry Pi.
      
      So implement routines that use these instructions if available, and fall
      back to the existing generic routines otherwise. The selection is based
      on alternatives patching.
      
      Note that this unconditionally selects CONFIG_CRC32 as a builtin. Since
      CRC32 is relied upon by core functionality such as CONFIG_OF_FLATTREE,
      this just codifies the status quo.
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NRongwei Wang <rongwei.wang@linux.alibaba.com>
      Acked-by: NZou Cao <zoucao@linux.alibaba.com>
      19473a51