• A
    arm64: do_csum: implement accelerated scalar version · efa29e0a
    Ard Biesheuvel 提交于
    hulk inclusion
    category: feature
    feature: checksum performance
    bugzilla: 13700
    CVE: NA
    
    --------------------------------------------------
    
    It turns out that the IP checksumming code is still exercised often,
    even though one might expect that modern NICs with checksum offload
    have no use for it. However, as Lingyan points out, there are
    combinations of features where the network stack may still fall back
    to software checksumming, and so it makes sense to provide an
    optimized implementation in software as well.
    
    So provide an implementation of do_csum() in scalar assembler, which,
    unlike C, gives direct access to the carry flag, making the code run
    substantially faster. The routine uses overlapping 64 byte loads for
    all input size > 64 bytes, in order to reduce the number of branches
    and improve performance on cores with deep pipelines.
    
    On Cortex-A57, this implementation is on par with Lingyan's NEON
    implementation, and roughly 7x as fast as the generic C code.
    
    Diff with ard's original patch: add validation check for the len.
    
    Cc: "huanglingyan (A)" <huanglingyan2@huawei.com>
    Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
    Signed-off-by: NChen Zhou <chenzhou10@huawei.com>
    Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
    Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
    efa29e0a
csum.S 2.4 KB