1. 17 7月, 2015 3 次提交
    • M
      crypto: poly1305 - Add a SSE2 SIMD variant for x86_64 · c70f4abe
      Martin Willi 提交于
      Implements an x86_64 assembler driver for the Poly1305 authenticator. This
      single block variant holds the 130-bit integer in 5 32-bit words, but uses
      SSE to do two multiplications/additions in parallel.
      
      When calling updates with small blocks, the overhead for kernel_fpu_begin/
      kernel_fpu_end() negates the perfmance gain. We therefore use the
      poly1305-generic fallback for small updates.
      
      For large messages, throughput increases by ~5-10% compared to
      poly1305-generic:
      
      testing speed of poly1305 (poly1305-generic)
      test  0 (   96 byte blocks,   16 bytes per update,   6 updates): 4080026 opers/sec,  391682496 bytes/sec
      test  1 (   96 byte blocks,   32 bytes per update,   3 updates): 6221094 opers/sec,  597225024 bytes/sec
      test  2 (   96 byte blocks,   96 bytes per update,   1 updates): 9609750 opers/sec,  922536057 bytes/sec
      test  3 (  288 byte blocks,   16 bytes per update,  18 updates): 1459379 opers/sec,  420301267 bytes/sec
      test  4 (  288 byte blocks,   32 bytes per update,   9 updates): 2115179 opers/sec,  609171609 bytes/sec
      test  5 (  288 byte blocks,  288 bytes per update,   1 updates): 3729874 opers/sec, 1074203856 bytes/sec
      test  6 ( 1056 byte blocks,   32 bytes per update,  33 updates):  593000 opers/sec,  626208000 bytes/sec
      test  7 ( 1056 byte blocks, 1056 bytes per update,   1 updates): 1081536 opers/sec, 1142102332 bytes/sec
      test  8 ( 2080 byte blocks,   32 bytes per update,  65 updates):  302077 opers/sec,  628320576 bytes/sec
      test  9 ( 2080 byte blocks, 2080 bytes per update,   1 updates):  554384 opers/sec, 1153120176 bytes/sec
      test 10 ( 4128 byte blocks, 4128 bytes per update,   1 updates):  278715 opers/sec, 1150536345 bytes/sec
      test 11 ( 8224 byte blocks, 8224 bytes per update,   1 updates):  140202 opers/sec, 1153022070 bytes/sec
      
      testing speed of poly1305 (poly1305-simd)
      test  0 (   96 byte blocks,   16 bytes per update,   6 updates): 3790063 opers/sec,  363846076 bytes/sec
      test  1 (   96 byte blocks,   32 bytes per update,   3 updates): 5913378 opers/sec,  567684355 bytes/sec
      test  2 (   96 byte blocks,   96 bytes per update,   1 updates): 9352574 opers/sec,  897847104 bytes/sec
      test  3 (  288 byte blocks,   16 bytes per update,  18 updates): 1362145 opers/sec,  392297990 bytes/sec
      test  4 (  288 byte blocks,   32 bytes per update,   9 updates): 2007075 opers/sec,  578037628 bytes/sec
      test  5 (  288 byte blocks,  288 bytes per update,   1 updates): 3709811 opers/sec, 1068425798 bytes/sec
      test  6 ( 1056 byte blocks,   32 bytes per update,  33 updates):  566272 opers/sec,  597984182 bytes/sec
      test  7 ( 1056 byte blocks, 1056 bytes per update,   1 updates): 1111657 opers/sec, 1173910108 bytes/sec
      test  8 ( 2080 byte blocks,   32 bytes per update,  65 updates):  288857 opers/sec,  600823808 bytes/sec
      test  9 ( 2080 byte blocks, 2080 bytes per update,   1 updates):  590746 opers/sec, 1228751888 bytes/sec
      test 10 ( 4128 byte blocks, 4128 bytes per update,   1 updates):  301825 opers/sec, 1245936902 bytes/sec
      test 11 ( 8224 byte blocks, 8224 bytes per update,   1 updates):  153075 opers/sec, 1258896201 bytes/sec
      
      Benchmark results from a Core i5-4670T.
      Signed-off-by: NMartin Willi <martin@strongswan.org>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      c70f4abe
    • M
      crypto: chacha20 - Add an eight block AVX2 variant for x86_64 · 3d1e93cd
      Martin Willi 提交于
      Extends the x86_64 ChaCha20 implementation by a function processing eight
      ChaCha20 blocks in parallel using AVX2.
      
      For large messages, throughput increases by ~55-70% compared to four block
      SSSE3:
      
      testing speed of chacha20 (chacha20-simd) encryption
      test 0 (256 bit key, 16 byte blocks): 42249230 operations in 10 seconds (675987680 bytes)
      test 1 (256 bit key, 64 byte blocks): 46441641 operations in 10 seconds (2972265024 bytes)
      test 2 (256 bit key, 256 byte blocks): 33028112 operations in 10 seconds (8455196672 bytes)
      test 3 (256 bit key, 1024 byte blocks): 11568759 operations in 10 seconds (11846409216 bytes)
      test 4 (256 bit key, 8192 byte blocks): 1448761 operations in 10 seconds (11868250112 bytes)
      
      testing speed of chacha20 (chacha20-simd) encryption
      test 0 (256 bit key, 16 byte blocks): 41999675 operations in 10 seconds (671994800 bytes)
      test 1 (256 bit key, 64 byte blocks): 45805908 operations in 10 seconds (2931578112 bytes)
      test 2 (256 bit key, 256 byte blocks): 32814947 operations in 10 seconds (8400626432 bytes)
      test 3 (256 bit key, 1024 byte blocks): 19777167 operations in 10 seconds (20251819008 bytes)
      test 4 (256 bit key, 8192 byte blocks): 2279321 operations in 10 seconds (18672197632 bytes)
      
      Benchmark results from a Core i5-4670T.
      Signed-off-by: NMartin Willi <martin@strongswan.org>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      3d1e93cd
    • M
      crypto: chacha20 - Add a SSSE3 SIMD variant for x86_64 · c9320b6d
      Martin Willi 提交于
      Implements an x86_64 assembler driver for the ChaCha20 stream cipher. This
      single block variant works on a single state matrix using SSE instructions.
      It requires SSSE3 due the use of pshufb for efficient 8/16-bit rotate
      operations.
      
      For large messages, throughput increases by ~65% compared to
      chacha20-generic:
      
      testing speed of chacha20 (chacha20-generic) encryption
      test 0 (256 bit key, 16 byte blocks): 45089207 operations in 10 seconds (721427312 bytes)
      test 1 (256 bit key, 64 byte blocks): 43839521 operations in 10 seconds (2805729344 bytes)
      test 2 (256 bit key, 256 byte blocks): 12702056 operations in 10 seconds (3251726336 bytes)
      test 3 (256 bit key, 1024 byte blocks): 3371173 operations in 10 seconds (3452081152 bytes)
      test 4 (256 bit key, 8192 byte blocks): 422468 operations in 10 seconds (3460857856 bytes)
      
      testing speed of chacha20 (chacha20-simd) encryption
      test 0 (256 bit key, 16 byte blocks): 43141886 operations in 10 seconds (690270176 bytes)
      test 1 (256 bit key, 64 byte blocks): 46845874 operations in 10 seconds (2998135936 bytes)
      test 2 (256 bit key, 256 byte blocks): 18458512 operations in 10 seconds (4725379072 bytes)
      test 3 (256 bit key, 1024 byte blocks): 5360533 operations in 10 seconds (5489185792 bytes)
      test 4 (256 bit key, 8192 byte blocks): 692846 operations in 10 seconds (5675794432 bytes)
      
      Benchmark results from a Core i5-4670T.
      Signed-off-by: NMartin Willi <martin@strongswan.org>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      c9320b6d
  2. 05 1月, 2015 1 次提交
  3. 25 8月, 2014 1 次提交
  4. 20 6月, 2014 2 次提交
    • C
      crypto: aes - AES CTR x86_64 "by8" AVX optimization · 22cddcc7
      chandramouli narayanan 提交于
      This patch introduces "by8" AES CTR mode AVX optimization inspired by
      Intel Optimized IPSEC Cryptograhpic library. For additional information,
      please see:
      http://downloadcenter.intel.com/Detail_Desc.aspx?agr=Y&DwnldID=22972
      
      The functions aes_ctr_enc_128_avx_by8(), aes_ctr_enc_192_avx_by8() and
      aes_ctr_enc_256_avx_by8() are adapted from
      Intel Optimized IPSEC Cryptographic library. When both AES and AVX features
      are enabled in a platform, the glue code in AESNI module overrieds the
      existing "by4" CTR mode en/decryption with the "by8"
      AES CTR mode en/decryption.
      
      On a Haswell desktop, with turbo disabled and all cpus running
      at maximum frequency, the "by8" CTR mode optimization
      shows better performance results across data & key sizes
      as measured by tcrypt.
      
      The average performance improvement of the "by8" version over the "by4"
      version is as follows:
      
      For 128 bit key and data sizes >= 256 bytes, there is a 10-16% improvement.
      For 192 bit key and data sizes >= 256 bytes, there is a 20-22% improvement.
      For 256 bit key and data sizes >= 256 bytes, there is a 20-25% improvement.
      
      A typical run of tcrypt with AES CTR mode encryption of the "by4" and "by8"
      optimization shows the following results:
      
      tcrypt with "by4" AES CTR mode encryption optimization on a Haswell Desktop:
      ---------------------------------------------------------------------------
      
      testing speed of __ctr-aes-aesni encryption
      test 0 (128 bit key, 16 byte blocks): 1 operation in 343 cycles (16 bytes)
      test 1 (128 bit key, 64 byte blocks): 1 operation in 336 cycles (64 bytes)
      test 2 (128 bit key, 256 byte blocks): 1 operation in 491 cycles (256 bytes)
      test 3 (128 bit key, 1024 byte blocks): 1 operation in 1130 cycles (1024 bytes)
      test 4 (128 bit key, 8192 byte blocks): 1 operation in 7309 cycles (8192 bytes)
      test 5 (192 bit key, 16 byte blocks): 1 operation in 346 cycles (16 bytes)
      test 6 (192 bit key, 64 byte blocks): 1 operation in 361 cycles (64 bytes)
      test 7 (192 bit key, 256 byte blocks): 1 operation in 543 cycles (256 bytes)
      test 8 (192 bit key, 1024 byte blocks): 1 operation in 1321 cycles (1024 bytes)
      test 9 (192 bit key, 8192 byte blocks): 1 operation in 9649 cycles (8192 bytes)
      test 10 (256 bit key, 16 byte blocks): 1 operation in 369 cycles (16 bytes)
      test 11 (256 bit key, 64 byte blocks): 1 operation in 366 cycles (64 bytes)
      test 12 (256 bit key, 256 byte blocks): 1 operation in 595 cycles (256 bytes)
      test 13 (256 bit key, 1024 byte blocks): 1 operation in 1531 cycles (1024 bytes)
      test 14 (256 bit key, 8192 byte blocks): 1 operation in 10522 cycles (8192 bytes)
      
      testing speed of __ctr-aes-aesni decryption
      test 0 (128 bit key, 16 byte blocks): 1 operation in 336 cycles (16 bytes)
      test 1 (128 bit key, 64 byte blocks): 1 operation in 350 cycles (64 bytes)
      test 2 (128 bit key, 256 byte blocks): 1 operation in 487 cycles (256 bytes)
      test 3 (128 bit key, 1024 byte blocks): 1 operation in 1129 cycles (1024 bytes)
      test 4 (128 bit key, 8192 byte blocks): 1 operation in 7287 cycles (8192 bytes)
      test 5 (192 bit key, 16 byte blocks): 1 operation in 350 cycles (16 bytes)
      test 6 (192 bit key, 64 byte blocks): 1 operation in 359 cycles (64 bytes)
      test 7 (192 bit key, 256 byte blocks): 1 operation in 635 cycles (256 bytes)
      test 8 (192 bit key, 1024 byte blocks): 1 operation in 1324 cycles (1024 bytes)
      test 9 (192 bit key, 8192 byte blocks): 1 operation in 9595 cycles (8192 bytes)
      test 10 (256 bit key, 16 byte blocks): 1 operation in 364 cycles (16 bytes)
      test 11 (256 bit key, 64 byte blocks): 1 operation in 377 cycles (64 bytes)
      test 12 (256 bit key, 256 byte blocks): 1 operation in 604 cycles (256 bytes)
      test 13 (256 bit key, 1024 byte blocks): 1 operation in 1527 cycles (1024 bytes)
      test 14 (256 bit key, 8192 byte blocks): 1 operation in 10549 cycles (8192 bytes)
      
      tcrypt with "by8" AES CTR mode encryption optimization on a Haswell Desktop:
      ---------------------------------------------------------------------------
      
      testing speed of __ctr-aes-aesni encryption
      test 0 (128 bit key, 16 byte blocks): 1 operation in 340 cycles (16 bytes)
      test 1 (128 bit key, 64 byte blocks): 1 operation in 330 cycles (64 bytes)
      test 2 (128 bit key, 256 byte blocks): 1 operation in 450 cycles (256 bytes)
      test 3 (128 bit key, 1024 byte blocks): 1 operation in 1043 cycles (1024 bytes)
      test 4 (128 bit key, 8192 byte blocks): 1 operation in 6597 cycles (8192 bytes)
      test 5 (192 bit key, 16 byte blocks): 1 operation in 339 cycles (16 bytes)
      test 6 (192 bit key, 64 byte blocks): 1 operation in 352 cycles (64 bytes)
      test 7 (192 bit key, 256 byte blocks): 1 operation in 539 cycles (256 bytes)
      test 8 (192 bit key, 1024 byte blocks): 1 operation in 1153 cycles (1024 bytes)
      test 9 (192 bit key, 8192 byte blocks): 1 operation in 8458 cycles (8192 bytes)
      test 10 (256 bit key, 16 byte blocks): 1 operation in 353 cycles (16 bytes)
      test 11 (256 bit key, 64 byte blocks): 1 operation in 360 cycles (64 bytes)
      test 12 (256 bit key, 256 byte blocks): 1 operation in 512 cycles (256 bytes)
      test 13 (256 bit key, 1024 byte blocks): 1 operation in 1277 cycles (1024 bytes)
      test 14 (256 bit key, 8192 byte blocks): 1 operation in 8745 cycles (8192 bytes)
      
      testing speed of __ctr-aes-aesni decryption
      test 0 (128 bit key, 16 byte blocks): 1 operation in 348 cycles (16 bytes)
      test 1 (128 bit key, 64 byte blocks): 1 operation in 335 cycles (64 bytes)
      test 2 (128 bit key, 256 byte blocks): 1 operation in 451 cycles (256 bytes)
      test 3 (128 bit key, 1024 byte blocks): 1 operation in 1030 cycles (1024 bytes)
      test 4 (128 bit key, 8192 byte blocks): 1 operation in 6611 cycles (8192 bytes)
      test 5 (192 bit key, 16 byte blocks): 1 operation in 354 cycles (16 bytes)
      test 6 (192 bit key, 64 byte blocks): 1 operation in 346 cycles (64 bytes)
      test 7 (192 bit key, 256 byte blocks): 1 operation in 488 cycles (256 bytes)
      test 8 (192 bit key, 1024 byte blocks): 1 operation in 1154 cycles (1024 bytes)
      test 9 (192 bit key, 8192 byte blocks): 1 operation in 8390 cycles (8192 bytes)
      test 10 (256 bit key, 16 byte blocks): 1 operation in 357 cycles (16 bytes)
      test 11 (256 bit key, 64 byte blocks): 1 operation in 362 cycles (64 bytes)
      test 12 (256 bit key, 256 byte blocks): 1 operation in 515 cycles (256 bytes)
      test 13 (256 bit key, 1024 byte blocks): 1 operation in 1284 cycles (1024 bytes)
      test 14 (256 bit key, 8192 byte blocks): 1 operation in 8681 cycles (8192 bytes)
      
      crypto: Incorporate feed back to AES CTR mode optimization patch
      
      Specifically, the following:
      a) alignment around main loop in aes_ctrby8_avx_x86_64.S
      b) .rodata around data constants used in the assembely code.
      c) the use of CONFIG_AVX in the glue code.
      d) fix up white space.
      e) informational message for "by8" AES CTR mode optimization
      f) "by8" AES CTR mode optimization can be simply enabled
      if the platform supports both AES and AVX features. The
      optimization works superbly on Sandybridge as well.
      
      Testing on Haswell shows no performance change since the last.
      
      Testing on Sandybridge shows that the "by8" AES CTR mode optimization
      greatly improves performance.
      
      tcrypt log with "by4" AES CTR mode optimization on Sandybridge
      --------------------------------------------------------------
      
      testing speed of __ctr-aes-aesni encryption
      test 0 (128 bit key, 16 byte blocks): 1 operation in 383 cycles (16 bytes)
      test 1 (128 bit key, 64 byte blocks): 1 operation in 408 cycles (64 bytes)
      test 2 (128 bit key, 256 byte blocks): 1 operation in 707 cycles (256 bytes)
      test 3 (128 bit key, 1024 byte blocks): 1 operation in 1864 cycles (1024 bytes)
      test 4 (128 bit key, 8192 byte blocks): 1 operation in 12813 cycles (8192 bytes)
      test 5 (192 bit key, 16 byte blocks): 1 operation in 395 cycles (16 bytes)
      test 6 (192 bit key, 64 byte blocks): 1 operation in 432 cycles (64 bytes)
      test 7 (192 bit key, 256 byte blocks): 1 operation in 780 cycles (256 bytes)
      test 8 (192 bit key, 1024 byte blocks): 1 operation in 2132 cycles (1024 bytes)
      test 9 (192 bit key, 8192 byte blocks): 1 operation in 15765 cycles (8192 bytes)
      test 10 (256 bit key, 16 byte blocks): 1 operation in 416 cycles (16 bytes)
      test 11 (256 bit key, 64 byte blocks): 1 operation in 438 cycles (64 bytes)
      test 12 (256 bit key, 256 byte blocks): 1 operation in 842 cycles (256 bytes)
      test 13 (256 bit key, 1024 byte blocks): 1 operation in 2383 cycles (1024 bytes)
      test 14 (256 bit key, 8192 byte blocks): 1 operation in 16945 cycles (8192 bytes)
      
      testing speed of __ctr-aes-aesni decryption
      test 0 (128 bit key, 16 byte blocks): 1 operation in 389 cycles (16 bytes)
      test 1 (128 bit key, 64 byte blocks): 1 operation in 409 cycles (64 bytes)
      test 2 (128 bit key, 256 byte blocks): 1 operation in 704 cycles (256 bytes)
      test 3 (128 bit key, 1024 byte blocks): 1 operation in 1865 cycles (1024 bytes)
      test 4 (128 bit key, 8192 byte blocks): 1 operation in 12783 cycles (8192 bytes)
      test 5 (192 bit key, 16 byte blocks): 1 operation in 409 cycles (16 bytes)
      test 6 (192 bit key, 64 byte blocks): 1 operation in 434 cycles (64 bytes)
      test 7 (192 bit key, 256 byte blocks): 1 operation in 792 cycles (256 bytes)
      test 8 (192 bit key, 1024 byte blocks): 1 operation in 2151 cycles (1024 bytes)
      test 9 (192 bit key, 8192 byte blocks): 1 operation in 15804 cycles (8192 bytes)
      test 10 (256 bit key, 16 byte blocks): 1 operation in 421 cycles (16 bytes)
      test 11 (256 bit key, 64 byte blocks): 1 operation in 444 cycles (64 bytes)
      test 12 (256 bit key, 256 byte blocks): 1 operation in 840 cycles (256 bytes)
      test 13 (256 bit key, 1024 byte blocks): 1 operation in 2394 cycles (1024 bytes)
      test 14 (256 bit key, 8192 byte blocks): 1 operation in 16928 cycles (8192 bytes)
      
      tcrypt log with "by8" AES CTR mode optimization on Sandybridge
      --------------------------------------------------------------
      
      testing speed of __ctr-aes-aesni encryption
      test 0 (128 bit key, 16 byte blocks): 1 operation in 383 cycles (16 bytes)
      test 1 (128 bit key, 64 byte blocks): 1 operation in 401 cycles (64 bytes)
      test 2 (128 bit key, 256 byte blocks): 1 operation in 522 cycles (256 bytes)
      test 3 (128 bit key, 1024 byte blocks): 1 operation in 1136 cycles (1024 bytes)
      test 4 (128 bit key, 8192 byte blocks): 1 operation in 7046 cycles (8192 bytes)
      test 5 (192 bit key, 16 byte blocks): 1 operation in 394 cycles (16 bytes)
      test 6 (192 bit key, 64 byte blocks): 1 operation in 418 cycles (64 bytes)
      test 7 (192 bit key, 256 byte blocks): 1 operation in 559 cycles (256 bytes)
      test 8 (192 bit key, 1024 byte blocks): 1 operation in 1263 cycles (1024 bytes)
      test 9 (192 bit key, 8192 byte blocks): 1 operation in 9072 cycles (8192 bytes)
      test 10 (256 bit key, 16 byte blocks): 1 operation in 408 cycles (16 bytes)
      test 11 (256 bit key, 64 byte blocks): 1 operation in 428 cycles (64 bytes)
      test 12 (256 bit key, 256 byte blocks): 1 operation in 595 cycles (256 bytes)
      test 13 (256 bit key, 1024 byte blocks): 1 operation in 1385 cycles (1024 bytes)
      test 14 (256 bit key, 8192 byte blocks): 1 operation in 9224 cycles (8192 bytes)
      
      testing speed of __ctr-aes-aesni decryption
      test 0 (128 bit key, 16 byte blocks): 1 operation in 390 cycles (16 bytes)
      test 1 (128 bit key, 64 byte blocks): 1 operation in 402 cycles (64 bytes)
      test 2 (128 bit key, 256 byte blocks): 1 operation in 530 cycles (256 bytes)
      test 3 (128 bit key, 1024 byte blocks): 1 operation in 1135 cycles (1024 bytes)
      test 4 (128 bit key, 8192 byte blocks): 1 operation in 7079 cycles (8192 bytes)
      test 5 (192 bit key, 16 byte blocks): 1 operation in 414 cycles (16 bytes)
      test 6 (192 bit key, 64 byte blocks): 1 operation in 417 cycles (64 bytes)
      test 7 (192 bit key, 256 byte blocks): 1 operation in 572 cycles (256 bytes)
      test 8 (192 bit key, 1024 byte blocks): 1 operation in 1312 cycles (1024 bytes)
      test 9 (192 bit key, 8192 byte blocks): 1 operation in 9073 cycles (8192 bytes)
      test 10 (256 bit key, 16 byte blocks): 1 operation in 415 cycles (16 bytes)
      test 11 (256 bit key, 64 byte blocks): 1 operation in 454 cycles (64 bytes)
      test 12 (256 bit key, 256 byte blocks): 1 operation in 598 cycles (256 bytes)
      test 13 (256 bit key, 1024 byte blocks): 1 operation in 1407 cycles (1024 bytes)
      test 14 (256 bit key, 8192 byte blocks): 1 operation in 9288 cycles (8192 bytes)
      
      crypto: Fix redundant checks
      
      a) Fix the redundant check for cpu_has_aes
      b) Fix the key length check when invoking the CTR mode "by8"
      encryptor/decryptor.
      
      crypto: fix typo in AES ctr mode transform
      Signed-off-by: NChandramouli Narayanan <mouli@linux.intel.com>
      Reviewed-by: NMathias Krause <minipli@googlemail.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      22cddcc7
    • J
      crypto: des_3des - add x86-64 assembly implementation · 6574e6c6
      Jussi Kivilinna 提交于
      Patch adds x86_64 assembly implementation of Triple DES EDE cipher algorithm.
      Two assembly implementations are provided. First is regular 'one-block at
      time' encrypt/decrypt function. Second is 'three-blocks at time' function that
      gains performance increase on out-of-order CPUs.
      
      tcrypt test results:
      
      Intel Core i5-4570:
      
      des3_ede-asm vs des3_ede-generic:
      size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec
      16B     1.21x   1.22x   1.27x   1.36x   1.25x   1.25x
      64B     1.98x   1.96x   1.23x   2.04x   2.01x   2.00x
      256B    2.34x   2.37x   1.21x   2.40x   2.38x   2.39x
      1024B   2.50x   2.47x   1.22x   2.51x   2.52x   2.51x
      8192B   2.51x   2.53x   1.21x   2.56x   2.54x   2.55x
      Signed-off-by: NJussi Kivilinna <jussi.kivilinna@iki.fi>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      6574e6c6
  5. 21 3月, 2014 1 次提交
    • C
      crypto: sha - SHA1 transform x86_64 AVX2 · 7c1da8d0
      chandramouli narayanan 提交于
      This git patch adds x86_64 AVX2 optimization of SHA1
      transform to crypto support. The patch has been tested with 3.14.0-rc1
      kernel.
      
      On a Haswell desktop, with turbo disabled and all cpus running
      at maximum frequency, tcrypt shows AVX2 performance improvement
      from 3% for 256 bytes update to 16% for 1024 bytes update over
      AVX implementation.
      
      This patch adds sha1_avx2_transform(), the glue, build and
      configuration changes needed for AVX2 optimization of
      SHA1 transform to crypto support.
      
      sha1-ssse3 is one module which adds the necessary optimization
      support (SSSE3/AVX/AVX2) for the low-level SHA1 transform function.
      With better optimization support, transform function is overridden
      as the case may be. In the case of AVX2, due to performance reasons
      across datablock sizes, the AVX or AVX2 transform function is used
      at run-time as it suits best. The Makefile change therefore appends
      the necessary objects to the linkage. Due to this, the patch merely
      appends AVX2 transform to the existing build mix and Kconfig support
      and leaves the configuration build support as is.
      Signed-off-by: NChandramouli Narayanan <mouli@linux.intel.com>
      Reviewed-by: NMarek Vasut <marex@denx.de>
      Acked-by: NH. Peter Anvin <hpa@linux.intel.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      7c1da8d0
  6. 15 1月, 2014 1 次提交
  7. 31 12月, 2013 1 次提交
  8. 20 12月, 2013 1 次提交
  9. 24 9月, 2013 1 次提交
  10. 13 9月, 2013 1 次提交
  11. 07 9月, 2013 1 次提交
  12. 24 7月, 2013 1 次提交
  13. 21 6月, 2013 2 次提交
  14. 24 5月, 2013 1 次提交
  15. 25 4月, 2013 6 次提交
  16. 03 4月, 2013 1 次提交
  17. 26 2月, 2013 1 次提交
  18. 20 1月, 2013 1 次提交
  19. 09 11月, 2012 1 次提交
    • J
      crypto: camellia - add AES-NI/AVX/x86_64 assembler implementation of camellia cipher · d9b1d2e7
      Jussi Kivilinna 提交于
      This patch adds AES-NI/AVX/x86_64 assembler implementation of Camellia block
      cipher. Implementation process data in sixteen block chunks, which are
      byte-sliced and AES SubBytes is reused for Camellia s-box with help of pre-
      and post-filtering.
      
      Patch has been tested with tcrypt and automated filesystem tests.
      
      tcrypt test results:
      
      Intel Core i5-2450M:
      
      camellia-aesni-avx vs camellia-asm-x86_64-2way:
      128bit key:                                             (lrw:256bit)    (xts:256bit)
      size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
      16B     0.98x   0.96x   0.99x   0.96x   0.96x   0.95x   0.95x   0.94x   0.97x   0.98x
      64B     0.99x   0.98x   1.00x   0.98x   0.98x   0.99x   0.98x   0.93x   0.99x   0.98x
      256B    2.28x   2.28x   1.01x   2.29x   2.25x   2.24x   1.96x   1.97x   1.91x   1.90x
      1024B   2.57x   2.56x   1.00x   2.57x   2.51x   2.53x   2.19x   2.17x   2.19x   2.22x
      8192B   2.49x   2.49x   1.00x   2.53x   2.48x   2.49x   2.17x   2.17x   2.22x   2.22x
      
      256bit key:                                             (lrw:384bit)    (xts:512bit)
      size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
      16B     0.97x   0.98x   0.99x   0.97x   0.97x   0.96x   0.97x   0.98x   0.98x   0.99x
      64B     1.00x   1.00x   1.01x   0.99x   0.98x   0.99x   0.99x   0.99x   0.99x   0.99x
      256B    2.37x   2.37x   1.01x   2.39x   2.35x   2.33x   2.10x   2.11x   1.99x   2.02x
      1024B   2.58x   2.60x   1.00x   2.58x   2.56x   2.56x   2.28x   2.29x   2.28x   2.29x
      8192B   2.50x   2.52x   1.00x   2.56x   2.51x   2.51x   2.24x   2.25x   2.26x   2.29x
      Signed-off-by: NJussi Kivilinna <jussi.kivilinna@mbnet.fi>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      d9b1d2e7
  20. 15 10月, 2012 2 次提交
  21. 01 8月, 2012 2 次提交
    • J
      crypto: cast6 - add x86_64/avx assembler implementation · 4ea1277d
      Johannes Goetzfried 提交于
      This patch adds a x86_64/avx assembler implementation of the Cast6 block
      cipher. The implementation processes eight blocks in parallel (two 4 block
      chunk AVX operations). The table-lookups are done in general-purpose registers.
      For small blocksizes the functions from the generic module are called. A good
      performance increase is provided for blocksizes greater or equal to 128B.
      
      Patch has been tested with tcrypt and automated filesystem tests.
      
      Tcrypt benchmark results:
      
      Intel Core i5-2500 CPU (fam:6, model:42, step:7)
      
      cast6-avx-x86_64 vs. cast6-generic
      128bit key:                                             (lrw:256bit)    (xts:256bit)
      size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
      16B     0.97x   1.00x   1.01x   1.01x   0.99x   0.97x   0.98x   1.01x   0.96x   0.98x
      64B     0.98x   0.99x   1.02x   1.01x   0.99x   1.00x   1.01x   0.99x   1.00x   0.99x
      256B    1.77x   1.84x   0.99x   1.85x   1.77x   1.77x   1.70x   1.74x   1.69x   1.72x
      1024B   1.93x   1.95x   0.99x   1.96x   1.93x   1.93x   1.84x   1.85x   1.89x   1.87x
      8192B   1.91x   1.95x   0.99x   1.97x   1.95x   1.91x   1.86x   1.87x   1.93x   1.90x
      
      256bit key:                                             (lrw:384bit)    (xts:512bit)
      size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
      16B     0.97x   0.99x   1.02x   1.01x   0.98x   0.99x   1.00x   1.00x   0.98x   0.98x
      64B     0.98x   0.99x   1.01x   1.00x   1.00x   1.00x   1.01x   1.01x   0.97x   1.00x
      256B    1.77x   1.83x   1.00x   1.86x   1.79x   1.78x   1.70x   1.76x   1.71x   1.69x
      1024B   1.92x   1.95x   0.99x   1.96x   1.93x   1.93x   1.83x   1.86x   1.89x   1.87x
      8192B   1.94x   1.95x   0.99x   1.97x   1.95x   1.95x   1.87x   1.87x   1.93x   1.91x
      Signed-off-by: NJohannes Goetzfried <Johannes.Goetzfried@informatik.stud.uni-erlangen.de>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      4ea1277d
    • J
      crypto: cast5 - add x86_64/avx assembler implementation · 4d6d6a2c
      Johannes Goetzfried 提交于
      This patch adds a x86_64/avx assembler implementation of the Cast5 block
      cipher. The implementation processes sixteen blocks in parallel (four 4 block
      chunk AVX operations). The table-lookups are done in general-purpose registers.
      For small blocksizes the functions from the generic module are called. A good
      performance increase is provided for blocksizes greater or equal to 128B.
      
      Patch has been tested with tcrypt and automated filesystem tests.
      
      Tcrypt benchmark results:
      
      Intel Core i5-2500 CPU (fam:6, model:42, step:7)
      
      cast5-avx-x86_64 vs. cast5-generic
      64bit key:
      size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec
      16B     0.99x   0.99x   1.00x   1.00x   1.02x   1.01x
      64B     1.00x   1.00x   0.98x   1.00x   1.01x   1.02x
      256B    2.03x   2.01x   0.95x   2.11x   2.12x   2.13x
      1024B   2.30x   2.24x   0.95x   2.29x   2.35x   2.35x
      8192B   2.31x   2.27x   0.95x   2.31x   2.39x   2.39x
      
      128bit key:
      size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec
      16B     0.99x   0.99x   1.00x   1.00x   1.01x   1.01x
      64B     1.00x   1.00x   0.98x   1.01x   1.02x   1.01x
      256B    2.17x   2.13x   0.96x   2.19x   2.19x   2.19x
      1024B   2.29x   2.32x   0.95x   2.34x   2.37x   2.38x
      8192B   2.35x   2.32x   0.95x   2.35x   2.39x   2.39x
      Signed-off-by: NJohannes Goetzfried <Johannes.Goetzfried@informatik.stud.uni-erlangen.de>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      4d6d6a2c
  22. 27 6月, 2012 2 次提交
  23. 12 6月, 2012 3 次提交
    • J
      crypto: serpent - add x86_64/avx assembler implementation · 7efe4076
      Johannes Goetzfried 提交于
      This patch adds a x86_64/avx assembler implementation of the Serpent block
      cipher. The implementation is very similar to the sse2 implementation and
      processes eight blocks in parallel. Because of the new non-destructive three
      operand syntax all move-instructions can be removed and therefore a little
      performance increase is provided.
      
      Patch has been tested with tcrypt and automated filesystem tests.
      
      Tcrypt benchmark results:
      
      Intel Core i5-2500 CPU (fam:6, model:42, step:7)
      
      serpent-avx-x86_64 vs. serpent-sse2-x86_64
      128bit key:                                             (lrw:256bit)    (xts:256bit)
      size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
      16B     1.03x   1.01x   1.01x   1.01x   1.00x   1.00x   1.00x   1.00x   1.00x   1.01x
      64B     1.00x   1.00x   1.00x   1.00x   1.00x   0.99x   1.00x   1.01x   1.00x   1.00x
      256B    1.05x   1.03x   1.00x   1.02x   1.05x   1.06x   1.05x   1.02x   1.05x   1.02x
      1024B   1.05x   1.02x   1.00x   1.02x   1.05x   1.06x   1.05x   1.03x   1.05x   1.02x
      8192B   1.05x   1.02x   1.00x   1.02x   1.06x   1.06x   1.04x   1.03x   1.04x   1.02x
      
      256bit key:                                             (lrw:384bit)    (xts:512bit)
      size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
      16B     1.01x   1.00x   1.01x   1.01x   1.00x   1.00x   0.99x   1.03x   1.01x   1.01x
      64B     1.00x   1.00x   1.00x   1.00x   1.00x   1.00x   1.00x   1.01x   1.00x   1.02x
      256B    1.05x   1.02x   1.00x   1.02x   1.05x   1.02x   1.04x   1.05x   1.05x   1.02x
      1024B   1.06x   1.02x   1.00x   1.02x   1.07x   1.06x   1.05x   1.04x   1.05x   1.02x
      8192B   1.05x   1.02x   1.00x   1.02x   1.06x   1.06x   1.04x   1.05x   1.05x   1.02x
      
      serpent-avx-x86_64 vs aes-asm (8kB block):
               128bit  256bit
      ecb-enc  1.26x   1.73x
      ecb-dec  1.20x   1.64x
      cbc-enc  0.33x   0.45x
      cbc-dec  1.24x   1.67x
      ctr-enc  1.32x   1.76x
      ctr-dec  1.32x   1.76x
      lrw-enc  1.20x   1.60x
      lrw-dec  1.15x   1.54x
      xts-enc  1.22x   1.64x
      xts-dec  1.17x   1.57x
      Signed-off-by: NJohannes Goetzfried <Johannes.Goetzfried@informatik.stud.uni-erlangen.de>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      7efe4076
    • J
      crypto: twofish - add x86_64/avx assembler implementation · 107778b5
      Johannes Goetzfried 提交于
      This patch adds a x86_64/avx assembler implementation of the Twofish block
      cipher. The implementation processes eight blocks in parallel (two 4 block
      chunk AVX operations). The table-lookups are done in general-purpose registers.
      For small blocksizes the 3way-parallel functions from the twofish-x86_64-3way
      module are called. A good performance increase is provided for blocksizes
      greater or equal to 128B.
      
      Patch has been tested with tcrypt and automated filesystem tests.
      
      Tcrypt benchmark results:
      
      Intel Core i5-2500 CPU (fam:6, model:42, step:7)
      
      twofish-avx-x86_64 vs. twofish-x86_64-3way
      128bit key:                                             (lrw:256bit)    (xts:256bit)
      size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
      16B     0.96x   0.97x   1.00x   0.95x   0.97x   0.97x   0.96x   0.95x   0.95x   0.98x
      64B     0.99x   0.99x   1.00x   0.99x   0.98x   0.98x   0.99x   0.98x   0.99x   0.98x
      256B    1.20x   1.21x   1.00x   1.19x   1.15x   1.14x   1.19x   1.20x   1.18x   1.19x
      1024B   1.29x   1.30x   1.00x   1.28x   1.23x   1.24x   1.26x   1.28x   1.26x   1.27x
      8192B   1.31x   1.32x   1.00x   1.31x   1.25x   1.25x   1.28x   1.29x   1.28x   1.30x
      
      256bit key:                                             (lrw:384bit)    (xts:512bit)
      size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
      16B     0.96x   0.96x   1.00x   0.96x   0.97x   0.98x   0.95x   0.95x   0.95x   0.96x
      64B     1.00x   0.99x   1.00x   0.98x   0.98x   1.01x   0.98x   0.98x   0.98x   0.98x
      256B    1.20x   1.21x   1.00x   1.21x   1.15x   1.15x   1.19x   1.20x   1.18x   1.19x
      1024B   1.29x   1.30x   1.00x   1.28x   1.23x   1.23x   1.26x   1.27x   1.26x   1.27x
      8192B   1.31x   1.33x   1.00x   1.31x   1.26x   1.26x   1.29x   1.29x   1.28x   1.30x
      
      twofish-avx-x86_64 vs aes-asm (8kB block):
               128bit  256bit
      ecb-enc  1.19x   1.63x
      ecb-dec  1.18x   1.62x
      cbc-enc  0.75x   1.03x
      cbc-dec  1.23x   1.67x
      ctr-enc  1.24x   1.65x
      ctr-dec  1.24x   1.65x
      lrw-enc  1.15x   1.53x
      lrw-dec  1.14x   1.52x
      xts-enc  1.16x   1.56x
      xts-dec  1.16x   1.56x
      Signed-off-by: NJohannes Goetzfried <Johannes.Goetzfried@informatik.stud.uni-erlangen.de>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      107778b5
    • M
      crypto: sha1 - use Kbuild supplied flags for AVX test · 65df5774
      Mathias Krause 提交于
      Commit ea4d26ae ("raid5: add AVX optimized RAID5 checksumming")
      introduced x86/ arch wide defines for AFLAGS and CFLAGS indicating AVX
      support in binutils based on the same test we have in x86/crypto/ right
      now. To minimize duplication drop our implementation in favour to the
      one in x86/.
      Signed-off-by: NMathias Krause <minipli@googlemail.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      65df5774
  24. 14 3月, 2012 1 次提交
    • J
      crypto: camellia - add assembler implementation for x86_64 · 0b95ec56
      Jussi Kivilinna 提交于
      Patch adds x86_64 assembler implementation of Camellia block cipher. Two set of
      functions are provided. First set is regular 'one-block at time' encrypt/decrypt
      functions. Second is 'two-block at time' functions that gain performance increase
      on out-of-order CPUs. Performance of 2-way functions should be equal to 1-way
      functions with in-order CPUs.
      
      Patch has been tested with tcrypt and automated filesystem tests.
      
      Tcrypt benchmark results:
      
      AMD Phenom II 1055T (fam:16, model:10):
      
      camellia-asm vs camellia_generic:
      128bit key:                                             (lrw:256bit)    (xts:256bit)
      size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
      16B     1.27x   1.22x   1.30x   1.42x   1.30x   1.34x   1.19x   1.05x   1.23x   1.24x
      64B     1.74x   1.79x   1.43x   1.87x   1.81x   1.87x   1.48x   1.38x   1.55x   1.62x
      256B    1.90x   1.87x   1.43x   1.94x   1.94x   1.95x   1.63x   1.62x   1.67x   1.70x
      1024B   1.96x   1.93x   1.43x   1.95x   1.98x   2.01x   1.67x   1.69x   1.74x   1.80x
      8192B   1.96x   1.96x   1.39x   1.93x   2.01x   2.03x   1.72x   1.64x   1.71x   1.76x
      
      256bit key:                                             (lrw:384bit)    (xts:512bit)
      size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
      16B     1.23x   1.23x   1.33x   1.39x   1.34x   1.38x   1.04x   1.18x   1.21x   1.29x
      64B     1.72x   1.69x   1.42x   1.78x   1.81x   1.89x   1.57x   1.52x   1.56x   1.65x
      256B    1.85x   1.88x   1.42x   1.86x   1.93x   1.96x   1.69x   1.65x   1.70x   1.75x
      1024B   1.88x   1.86x   1.45x   1.95x   1.96x   1.95x   1.77x   1.71x   1.77x   1.78x
      8192B   1.91x   1.86x   1.42x   1.91x   2.03x   1.98x   1.73x   1.71x   1.78x   1.76x
      
      camellia-asm vs aes-asm (8kB block):
               128bit  256bit
      ecb-enc  1.15x   1.22x
      ecb-dec  1.16x   1.16x
      cbc-enc  0.85x   0.90x
      cbc-dec  1.20x   1.23x
      ctr-enc  1.28x   1.30x
      ctr-dec  1.27x   1.28x
      lrw-enc  1.12x   1.16x
      lrw-dec  1.08x   1.10x
      xts-enc  1.11x   1.15x
      xts-dec  1.14x   1.15x
      
      Intel Core2 T8100 (fam:6, model:23, step:6):
      
      camellia-asm vs camellia_generic:
      128bit key:                                             (lrw:256bit)    (xts:256bit)
      size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
      16B     1.10x   1.12x   1.14x   1.16x   1.16x   1.15x   1.02x   1.02x   1.08x   1.08x
      64B     1.61x   1.60x   1.17x   1.68x   1.67x   1.66x   1.43x   1.42x   1.44x   1.42x
      256B    1.65x   1.73x   1.17x   1.77x   1.81x   1.80x   1.54x   1.53x   1.58x   1.54x
      1024B   1.76x   1.74x   1.18x   1.80x   1.85x   1.85x   1.60x   1.59x   1.65x   1.60x
      8192B   1.77x   1.75x   1.19x   1.81x   1.85x   1.86x   1.63x   1.61x   1.66x   1.62x
      
      256bit key:                                             (lrw:384bit)    (xts:512bit)
      size    ecb-enc ecb-dec cbc-enc cbc-dec ctr-enc ctr-dec lrw-enc lrw-dec xts-enc xts-dec
      16B     1.10x   1.07x   1.13x   1.16x   1.11x   1.16x   1.03x   1.02x   1.08x   1.07x
      64B     1.61x   1.62x   1.15x   1.66x   1.63x   1.68x   1.47x   1.46x   1.47x   1.44x
      256B    1.71x   1.70x   1.16x   1.75x   1.69x   1.79x   1.58x   1.57x   1.59x   1.55x
      1024B   1.78x   1.72x   1.17x   1.75x   1.80x   1.80x   1.63x   1.62x   1.65x   1.62x
      8192B   1.76x   1.73x   1.17x   1.78x   1.80x   1.81x   1.64x   1.62x   1.68x   1.64x
      
      camellia-asm vs aes-asm (8kB block):
               128bit  256bit
      ecb-enc  1.17x   1.21x
      ecb-dec  1.17x   1.20x
      cbc-enc  0.80x   0.82x
      cbc-dec  1.22x   1.24x
      ctr-enc  1.25x   1.26x
      ctr-dec  1.25x   1.26x
      lrw-enc  1.14x   1.18x
      lrw-dec  1.13x   1.17x
      xts-enc  1.14x   1.18x
      xts-dec  1.14x   1.17x
      Signed-off-by: NJussi Kivilinna <jussi.kivilinna@mbnet.fi>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      0b95ec56
  25. 21 11月, 2011 2 次提交