• E
    crypto: arm/chacha20 - faster 8-bit rotations and other optimizations · a1b22a5f
    Eric Biggers 提交于
    Optimize ChaCha20 NEON performance by:
    
    - Implementing the 8-bit rotations using the 'vtbl.8' instruction.
    - Streamlining the part that adds the original state and XORs the data.
    - Making some other small tweaks.
    
    On ARM Cortex-A7, these optimizations improve ChaCha20 performance from
    about 12.08 cycles per byte to about 11.37 -- a 5.9% improvement.
    
    There is a tradeoff involved with the 'vtbl.8' rotation method since
    there is at least one CPU (Cortex-A53) where it's not fastest.  But it
    seems to be a better default; see the added comment.  Overall, this
    patch reduces Cortex-A53 performance by less than 0.5%.
    Signed-off-by: NEric Biggers <ebiggers@google.com>
    Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
    a1b22a5f
chacha20-neon-core.S 12.6 KB