From 79fe664f19489870ae24b4d0e11830a7624ca9d6 Mon Sep 17 00:00:00 2001 From: Andy Polyakov Date: Wed, 26 Sep 2007 12:16:32 +0000 Subject: [PATCH] Clarify commentary in sha512-sparcv9.pl. --- crypto/sha/asm/sha512-sparcv9.pl | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/crypto/sha/asm/sha512-sparcv9.pl b/crypto/sha/asm/sha512-sparcv9.pl index 25f80390ac..fa276d258b 100644 --- a/crypto/sha/asm/sha512-sparcv9.pl +++ b/crypto/sha/asm/sha512-sparcv9.pl @@ -17,7 +17,7 @@ # Performance is >75% better than 64-bit code generated by Sun C and # over 2x than 32-bit code. X[16] resides on stack, but access to it # is scheduled for L2 latency and staged through 32 least significant -# bits of %l0-%l7. The latter is done to achieve 32-/64-bit bit ABI +# bits of %l0-%l7. The latter is done to achieve 32-/64-bit ABI # duality. Nevetheless it's ~40% faster than SHA256, which is pretty # good [optimal coefficient is 50%]. # @@ -25,14 +25,22 @@ # # It's not any faster than 64-bit code generated by Sun C 5.8. This is # because 64-bit code generator has the advantage of using 64-bit -# loads to access X[16], which I consciously traded for 32-/64-bit ABI -# duality [as per above]. But it surpasses 32-bit Sun C generated code -# by 60%, not to mention that it doesn't suffer from severe decay when -# running 4 times physical cores threads and that it leaves gcc [3.4] -# behind by over 4x factor! If compared to SHA256, single thread +# loads(*) to access X[16], which I consciously traded for 32-/64-bit +# ABI duality [as per above]. But it surpasses 32-bit Sun C generated +# code by 60%, not to mention that it doesn't suffer from severe decay +# when running 4 times physical cores threads and that it leaves gcc +# [3.4] behind by over 4x factor! If compared to SHA256, single thread # performance is only 10% better, but overall throughput for maximum # amount of threads for given CPU exceeds corresponding one of SHA256 # by 30% [again, optimal coefficient is 50%]. +# +# (*) Unlike pre-T1 UltraSPARC loads on T1 are executed strictly +# in-order, i.e. load instruction has to complete prior next +# instruction in given thread is executed, even if the latter is +# not dependent on load result! This means that on T1 two 32-bit +# loads are always slower than one 64-bit load. Once again this +# is unlike pre-T1 UltraSPARC, where, if scheduled appropriately, +# 2x32-bit loads can be as fast as 1x64-bit ones. $bits=32; for (@ARGV) { $bits=64 if (/\-m64/ || /\-xarch\=v9/); } -- GitLab