arm64: do_csum: implement accelerated scalar version
hulk inclusion category: feature feature: checksum performance bugzilla: 13700 CVE: NA -------------------------------------------------- It turns out that the IP checksumming code is still exercised often, even though one might expect that modern NICs with checksum offload have no use for it. However, as Lingyan points out, there are combinations of features where the network stack may still fall back to software checksumming, and so it makes sense to provide an optimized implementation in software as well. So provide an implementation of do_csum() in scalar assembler, which, unlike C, gives direct access to the carry flag, making the code run substantially faster. The routine uses overlapping 64 byte loads for all input size > 64 bytes, in order to reduce the number of branches and improve performance on cores with deep pipelines. On Cortex-A57, this implementation is on par with Lingyan's NEON implementation, and roughly 7x as fast as the generic C code. Diff with ard's original patch: add validation check for the len. Cc: "huanglingyan (A)" <huanglingyan2@huawei.com> Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: NChen Zhou <chenzhou10@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Showing
arch/arm64/lib/csum.S
0 → 100644
想要评论请 注册 或 登录