• Y
    Arm64 CRC32 parallel computation optimization for RocksDB (#5494) · a3c1832e
    Yuqi Gu 提交于
    Summary:
    Crc32c Parallel computation optimization:
    Algorithm comes from Intel whitepaper: [crc-iscsi-polynomial-crc32-instruction-paper](https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/crc-iscsi-polynomial-crc32-instruction-paper.pdf)
     Input data is divided into three equal-sized blocks
    Three parallel blocks (crc0, crc1, crc2) for 1024 Bytes
    One Block: 42(BLK_LENGTH) * 8(step length: crc32c_u64) bytes
    
    1. crc32c_test:
    ```
    [==========] Running 4 tests from 1 test case.
    [----------] Global test environment set-up.
    [----------] 4 tests from CRC
    [ RUN      ] CRC.StandardResults
    [       OK ] CRC.StandardResults (1 ms)
    [ RUN      ] CRC.Values
    [       OK ] CRC.Values (0 ms)
    [ RUN      ] CRC.Extend
    [       OK ] CRC.Extend (0 ms)
    [ RUN      ] CRC.Mask
    [       OK ] CRC.Mask (0 ms)
    [----------] 4 tests from CRC (1 ms total)
    
    [----------] Global test environment tear-down
    [==========] 4 tests from 1 test case ran. (1 ms total)
    [  PASSED  ] 4 tests.
    ```
    
    2. RocksDB benchmark: db_bench --benchmarks="crc32c"
    
    ```
    Linear Arm crc32c:
      crc32c: 1.005 micros/op 995133 ops/sec; 3887.2 MB/s (4096 per op)
    ```
    
    ```
    Parallel optimization with Armv8 crypto extension:
      crc32c: 0.419 micros/op 2385078 ops/sec; 9316.7 MB/s (4096 per op)
    ```
    
    It gets ~2.4x speedup compared to linear Arm crc32c instructions.
    Pull Request resolved: https://github.com/facebook/rocksdb/pull/5494
    
    Differential Revision: D16340806
    
    fbshipit-source-id: 95dae9a5b646fd20a8303671d82f17b2e162e945
    a3c1832e
Makefile 67.5 KB