Skip to content
体验新版
项目
组织
正在加载...
登录
切换导航
打开侧边栏
btwise
openssl
提交
d52d5ad1
O
openssl
项目概览
btwise
/
openssl
通知
1
Star
0
Fork
0
代码
文件
提交
分支
Tags
贡献者
分支图
Diff
Issue
0
列表
看板
标记
里程碑
合并请求
0
DevOps
流水线
流水线任务
计划
Wiki
0
Wiki
分析
仓库
DevOps
项目成员
Pages
O
openssl
项目概览
项目概览
详情
发布
仓库
仓库
文件
提交
分支
标签
贡献者
分支图
比较
Issue
0
Issue
0
列表
看板
标记
里程碑
合并请求
0
合并请求
0
Pages
DevOps
DevOps
流水线
流水线任务
计划
分析
分析
仓库分析
DevOps
Wiki
0
Wiki
成员
成员
收起侧边栏
关闭侧边栏
动态
分支图
创建新Issue
流水线任务
提交
Issue看板
体验新版 GitCode,发现更多精彩内容 >>
提交
d52d5ad1
编写于
9月 05, 2010
作者:
A
Andy Polyakov
浏览文件
操作
浏览文件
下载
电子邮件补丁
差异文件
modes/asm/ghash-*.pl: switch to [more reproducible] performance results
collected with 'apps/openssl speed ghash'.
上级
a3b0c44b
变更
4
隐藏空白更改
内联
并排
Showing
4 changed file
with
44 addition
and
44 deletion
+44
-44
crypto/modes/asm/ghash-parisc.pl
crypto/modes/asm/ghash-parisc.pl
+3
-3
crypto/modes/asm/ghash-sparcv9.pl
crypto/modes/asm/ghash-sparcv9.pl
+2
-2
crypto/modes/asm/ghash-x86.pl
crypto/modes/asm/ghash-x86.pl
+34
-35
crypto/modes/asm/ghash-x86_64.pl
crypto/modes/asm/ghash-x86_64.pl
+5
-4
未找到文件。
crypto/modes/asm/ghash-parisc.pl
浏览文件 @
d52d5ad1
...
...
@@ -12,9 +12,9 @@
# The module implements "4-bit" GCM GHASH function and underlying
# single multiplication operation in GF(2^128). "4-bit" means that it
# uses 256 bytes per-key table [+128 bytes shared table]. On PA-7100LC
# it processes one byte in 19
cycles, which is more than twice as fast
#
as code generated by gcc 3.2. PA-RISC 2.0 loop is scheduled for 8
# cycles, but measured performance on PA-8600 system is ~9 cycles per
# it processes one byte in 19
.6 cycles, which is more than twice as
#
fast as code generated by gcc 3.2. PA-RISC 2.0 loop is scheduled for
#
8
cycles, but measured performance on PA-8600 system is ~9 cycles per
# processed byte. This is ~2.2x faster than 64-bit code generated by
# vendor compiler (which used to be very hard to beat:-).
#
...
...
crypto/modes/asm/ghash-sparcv9.pl
浏览文件 @
d52d5ad1
...
...
@@ -17,8 +17,8 @@
#
# gcc 3.3.x cc 5.2 this assembler
#
# 32-bit build 81.
0 48.6 11.8 (+586%/+311
%)
# 64-bit build 2
7.5 20.3 11.8 (+133%/+72
%)
# 32-bit build 81.
4 43.3 12.6 (+546%/+244
%)
# 64-bit build 2
0.2 21.2 12.6 (+60%/+68
%)
#
# Here is data collected on UltraSPARC T1 system running Linux:
#
...
...
crypto/modes/asm/ghash-x86.pl
浏览文件 @
d52d5ad1
...
...
@@ -21,17 +21,18 @@
#
# gcc 2.95.3(*) MMX assembler x86 assembler
#
# Pentium 10
0/112
(**) - 50
# PIII 6
3 /77
12.2 24
# P4
96 /122 18.0
84(***)
# Opteron
50 /71
10.1 30
# Core2 54 /6
8 8.6
18
# Pentium 10
5/111
(**) - 50
# PIII 6
8 /75
12.2 24
# P4
125/125 17.8
84(***)
# Opteron
66 /70
10.1 30
# Core2 54 /6
7 8.4
18
#
# (*) gcc 3.4.x was observed to generate few percent slower code,
# which is one of reasons why 2.95.3 results were chosen,
# another reason is lack of 3.4.x results for older CPUs;
# comparison is not completely fair, because C results are
# for vanilla "256B" implementations, not "528B";-)
# comparison with MMX results is not completely fair, because C
# results are for vanilla "256B" implementation, while
# assembler results are for "528B";-)
# (**) second number is result for code compiled with -fPIC flag,
# which is actually more relevant, because assembler code is
# position-independent;
...
...
@@ -44,7 +45,7 @@
# May 2010
#
# Add PCLMULQDQ version performing at 2.1
3
cycles per processed byte.
# Add PCLMULQDQ version performing at 2.1
0
cycles per processed byte.
# The question is how close is it to theoretical limit? The pclmulqdq
# instruction latency appears to be 14 cycles and there can't be more
# than 2 of them executing at any given time. This means that single
...
...
@@ -60,38 +61,36 @@
# Before we proceed to this implementation let's have closer look at
# the best-performing code suggested by Intel in their white paper.
# By tracing inter-register dependencies Tmod is estimated as ~19
# cycles and Naggr is 4, resulting in 2.05 cycles per processed byte.
# As implied, this is quite optimistic estimate, because it does not
# account for Karatsuba pre- and post-processing, which for a single
# multiplication is ~5 cycles. Unfortunately Intel does not provide
# performance data for GHASH alone, only for fused GCM mode. But
# we can estimate it by subtracting CTR performance result provided
# in "AES Instruction Set" white paper: 3.54-1.38=2.16 cycles per
# processed byte or 5% off the estimate. It should be noted though
# that 3.54 is GCM result for 16KB block size, while 1.38 is CTR for
# 1KB block size, meaning that real number is likely to be a bit
# further from estimate.
# cycles and Naggr chosen by Intel is 4, resulting in 2.05 cycles per
# processed byte. As implied, this is quite optimistic estimate,
# because it does not account for Karatsuba pre- and post-processing,
# which for a single multiplication is ~5 cycles. Unfortunately Intel
# does not provide performance data for GHASH alone. But benchmarking
# AES_GCM_encrypt ripped out of Fig. 15 of the white paper with aadt
# alone resulted in 2.46 cycles per byte of out 16KB buffer. Note that
# the result accounts even for pre-computing of degrees of the hash
# key H, but its portion is negligible at 16KB buffer size.
#
# Moving on to the implementation in question. Tmod is estimated as
# ~13 cycles and Naggr is 2, giving asymptotic performance of ...
# 2.16. How is it possible that measured performance is better than
# optimistic theoretical estimate? There is one thing Intel failed
# to recognize. By
fusing GHASH with CTR former's performance is
#
really limited to above (Tmul + Tmod/Naggr) equation. But if GHASH
#
procedure is detached, the modulo-reduction can be interleaved with
#
Naggr-1 multiplications and under ideal conditions even disappear
#
from the equation. So that optimistic theoretical estimate for this
#
implementation is ... 28/16=1.75, and not 2.16. Well, it's probably
#
way too optimistic, at least for such small Naggr. I'd argue that
#
(28+Tproc/Naggr), where Tproc is time required for Karatsuba pre-
#
and post-processing, is more realistic estimate. In this case it
#
gives ... 1.91 cycles per processed byte. Or in other words,
#
depending on how well we can interleave reduction and one of the
#
two multiplications the performance should be betwen 1.91 and 2.16.
#
As already mentioned, this implementation processes one byte [out
# o
f 1KB buffer] in 2.13 cycles, while x86_64 counterpart - in 2.07.
#
x86_64 performance is better, because larger register bank allows
# to interleave reduction and multiplication better.
# to recognize. By
serializing GHASH with CTR in same subroutine
#
former's performance is really limited to above (Tmul + Tmod/Naggr)
#
equation. But if GHASH procedure is detached, the modulo-reduction
#
can be interleaved with Naggr-1 multiplications at instruction level
#
and under ideal conditions even disappear from the equation. So that
#
optimistic theoretical estimate for this implementation is ...
#
28/16=1.75, and not 2.16. Well, it's probably way too optimistic,
#
at least for such small Naggr. I'd argue that (28+Tproc/Naggr),
#
where Tproc is time required for Karatsuba pre- and post-processing,
#
is more realistic estimate. In this case it gives ... 1.91 cycles.
#
Or in other words, depending on how well we can interleave reduction
#
and one of the two multiplications the performance should be betwen
#
1.91 and 2.16. As already mentioned, this implementation processes
# o
ne byte out of 8KB buffer in 2.10 cycles, while x86_64 counterpart
#
- in 2.02. x86_64 performance is better, because larger register
#
bank allows
to interleave reduction and multiplication better.
#
# Does it make sense to increase Naggr? To start with it's virtually
# impossible in 32-bit mode, because of limited register bank
...
...
crypto/modes/asm/ghash-x86_64.pl
浏览文件 @
d52d5ad1
...
...
@@ -20,17 +20,18 @@
# gcc 3.4.x(*) assembler
#
# P4 28.6 14.0 +100%
# Opteron 1
8.5 7.7 +14
0%
# Core2 17.
5 8.1(**) +115
%
# Opteron 1
9.3 7.7 +15
0%
# Core2 17.
8 8.1(**) +120
%
#
# (*) comparison is not completely fair, because C results are
# for vanilla "256B" implementation, not "528B";-)
# for vanilla "256B" implementation, while assembler results
# are for "528B";-)
# (**) it's mystery [to me] why Core2 result is not same as for
# Opteron;
# May 2010
#
# Add PCLMULQDQ version performing at 2.0
7
cycles per processed byte.
# Add PCLMULQDQ version performing at 2.0
2
cycles per processed byte.
# See ghash-x86.pl for background information and details about coding
# techniques.
#
...
...
编辑
预览
Markdown
is supported
0%
请重试
或
添加新附件
.
添加附件
取消
You are about to add
0
people
to the discussion. Proceed with caution.
先完成此消息的编辑!
取消
想要评论请
注册
或
登录