• S
    nft_set_pipapo: Introduce AVX2-based lookup implementation · 7400b063
    Stefano Brivio 提交于
    If the AVX2 set is available, we can exploit the repetitive
    characteristic of this algorithm to provide a fast, vectorised
    version by using 256-bit wide AVX2 operations for bucket loads and
    bitwise intersections.
    
    In most cases, this implementation consistently outperforms rbtree
    set instances despite the fact they are configured to use a given,
    single, ranged data type out of the ones used for performance
    measurements by the nft_concat_range.sh kselftest.
    
    That script, injecting packets directly on the ingoing device path
    with pktgen, reports, averaged over five runs on a single AMD Epyc
    7402 thread (3.35GHz, 768 KiB L1D$, 12 MiB L2$), the figures below.
    CONFIG_RETPOLINE was not set here.
    
    Note that this is not a fair comparison over hash and rbtree set
    types: non-ranged entries (used to have a reference for hash types)
    would be matched faster than this, and matching on a single field
    only (which is the case for rbtree) is also significantly faster.
    
    However, it's not possible at the moment to choose this set type
    for non-ranged entries, and the current implementation also needs
    a few minor adjustments in order to match on less than two fields.
    
     ---------------.-----------------------------------.------------.
     AMD Epyc 7402  |          baselines, Mpps          | this patch |
      1 thread      |___________________________________|____________|
      3.35GHz       |        |        |        |        |            |
      768KiB L1D$   | netdev |  hash  | rbtree |        |            |
     ---------------|  hook  |   no   | single |        |   pipapo   |
     type   entries |  drop  | ranges | field  | pipapo |    AVX2    |
     ---------------|--------|--------|--------|--------|------------|
     net,port       |        |        |        |        |            |
              1000  |   19.0 |   10.4 |    3.8 |    4.0 | 7.5   +87% |
     ---------------|--------|--------|--------|--------|------------|
     port,net       |        |        |        |        |            |
               100  |   18.8 |   10.3 |    5.8 |    6.3 | 8.1   +29% |
     ---------------|--------|--------|--------|--------|------------|
     net6,port      |        |        |        |        |            |
              1000  |   16.4 |    7.6 |    1.8 |    2.1 | 4.8  +128% |
     ---------------|--------|--------|--------|--------|------------|
     port,proto     |        |        |        |        |            |
             30000  |   19.6 |   11.6 |    3.9 |    0.5 | 2.6  +420% |
     ---------------|--------|--------|--------|--------|------------|
     net6,port,mac  |        |        |        |        |            |
                10  |   16.5 |    5.4 |    4.3 |    3.4 | 4.7   +38% |
     ---------------|--------|--------|--------|--------|------------|
     net6,port,mac, |        |        |        |        |            |
     proto    1000  |   16.5 |    5.7 |    1.9 |    1.4 | 3.6   +26% |
     ---------------|--------|--------|--------|--------|------------|
     net,mac        |        |        |        |        |            |
              1000  |   19.0 |    8.4 |    3.9 |    2.5 | 6.4  +156% |
     ---------------'--------'--------'--------'--------'------------'
    
    A similar strategy could be easily reused to implement specialised
    versions for other SIMD sets, and I plan to post at least a NEON
    version at a later time.
    Signed-off-by: NStefano Brivio <sbrivio@redhat.com>
    Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
    7400b063
Makefile 9.0 KB