1. 16 12月, 2019 33 次提交
  2. 14 12月, 2019 7 次提交
    • S
      selftests/bpf: Test wire_len/gso_segs in BPF_PROG_TEST_RUN · a06bf42f
      Stanislav Fomichev 提交于
      Make sure we can pass arbitrary data in wire_len/gso_segs.
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20191213223028.161282-2-sdf@google.com
      a06bf42f
    • S
      bpf: Expose __sk_buff wire_len/gso_segs to BPF_PROG_TEST_RUN · 850a88cc
      Stanislav Fomichev 提交于
      wire_len should not be less than real len and is capped by GSO_MAX_SIZE.
      gso_segs is capped by GSO_MAX_SEGS.
      
      v2:
      * set wire_len to skb->len when passed wire_len is 0 (Alexei Starovoitov)
      Signed-off-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20191213223028.161282-1-sdf@google.com
      850a88cc
    • A
      Merge branch 'bpf-dispatcher' · 02620d9e
      Alexei Starovoitov 提交于
      Björn Töpel says:
      
      ====================
      Overview
      ========
      
      This is the 6th iteration of the series that introduces the BPF
      dispatcher, which is a mechanism to avoid indirect calls.
      
      The BPF dispatcher is a multi-way branch code generator, targeted for
      BPF programs. E.g. when an XDP program is executed via the
      bpf_prog_run_xdp(), it is invoked via an indirect call. With
      retpolines enabled, the indirect call has a substantial performance
      impact. The dispatcher is a mechanism that transform indirect calls to
      direct calls, and therefore avoids the retpoline. The dispatcher is
      generated using the BPF JIT, and relies on text poking provided by
      bpf_arch_text_poke().
      
      The dispatcher hijacks a trampoline function it via the __fentry__ nop
      of the trampoline. One dispatcher instance currently supports up to 48
      dispatch points. This can be extended in the future.
      
      In this series, only one dispatcher instance is supported, and the
      only user is XDP. The dispatcher is updated when an XDP program is
      attached/detached to/from a netdev. An alternative to this could have
      been to update the dispatcher at program load point, but as there are
      usually more XDP programs loaded than attached, so the latter was
      picked.
      
      The XDP dispatcher is always enabled, if available, because it helps
      even when retpolines are disabled. Please refer to the "Performance"
      section below.
      
      The first patch refactors the image allocation from the BPF trampoline
      code. Patch two introduces the dispatcher, and patch three adds a
      dispatcher for XDP, and wires up the XDP control-/ fast-path. Patch
      four adds the dispatcher to BPF_TEST_RUN. Patch five adds a simple
      selftest, and the last adds alignment to jump targets.
      
      I have rebased the series on commit 679152d3 ("libbpf: Fix printf
      compilation warnings on ppc64le arch").
      
      Generated code, x86-64
      ======================
      
      The dispatcher currently has a maximum of 48 entries, where one entry
      is a unique BPF program. Multiple users of a dispatcher instance using
      the same BPF program will share that entry.
      
      The program/slot lookup is performed by a binary search, O(log
      n). Let's have a look at the generated code.
      
      The trampoline function has the following signature:
      
        unsigned int tramp(const void *ctx,
                           const struct bpf_insn *insnsi,
                           unsigned int (*bpf_func)(const void *,
                                                    const struct bpf_insn *))
      
      On Intel x86-64 this means that rdx will contain the bpf_func. To,
      make it easier to read, I've let the BPF programs have the following
      range: 0xffffffffffffffff (-1) to 0xfffffffffffffff0
      (-16). 0xffffffff81c00f10 is the retpoline thunk, in this case
      __x86_indirect_thunk_rdx. If retpolines are disabled the thunk will be
      a regular indirect call.
      
      The minimal dispatcher will then look like this:
      
      ffffffffc0002000: cmp    rdx,0xffffffffffffffff
      ffffffffc0002007: je     0xffffffffffffffff ; -1
      ffffffffc000200d: jmp    0xffffffff81c00f10
      
      A 16 entry dispatcher looks like this:
      
      ffffffffc0020000: cmp    rdx,0xfffffffffffffff7 ; -9
      ffffffffc0020007: jg     0xffffffffc0020130
      ffffffffc002000d: cmp    rdx,0xfffffffffffffff3 ; -13
      ffffffffc0020014: jg     0xffffffffc00200a0
      ffffffffc002001a: cmp    rdx,0xfffffffffffffff1 ; -15
      ffffffffc0020021: jg     0xffffffffc0020060
      ffffffffc0020023: cmp    rdx,0xfffffffffffffff0 ; -16
      ffffffffc002002a: jg     0xffffffffc0020040
      ffffffffc002002c: cmp    rdx,0xfffffffffffffff0 ; -16
      ffffffffc0020033: je     0xfffffffffffffff0 ; -16
      ffffffffc0020039: jmp    0xffffffff81c00f10
      ffffffffc002003e: xchg   ax,ax
      ffffffffc0020040: cmp    rdx,0xfffffffffffffff1 ; -15
      ffffffffc0020047: je     0xfffffffffffffff1 ; -15
      ffffffffc002004d: jmp    0xffffffff81c00f10
      ffffffffc0020052: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc002005a: nop    WORD PTR [rax+rax*1+0x0]
      ffffffffc0020060: cmp    rdx,0xfffffffffffffff2 ; -14
      ffffffffc0020067: jg     0xffffffffc0020080
      ffffffffc0020069: cmp    rdx,0xfffffffffffffff2 ; -14
      ffffffffc0020070: je     0xfffffffffffffff2 ; -14
      ffffffffc0020076: jmp    0xffffffff81c00f10
      ffffffffc002007b: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc0020080: cmp    rdx,0xfffffffffffffff3 ; -13
      ffffffffc0020087: je     0xfffffffffffffff3 ; -13
      ffffffffc002008d: jmp    0xffffffff81c00f10
      ffffffffc0020092: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc002009a: nop    WORD PTR [rax+rax*1+0x0]
      ffffffffc00200a0: cmp    rdx,0xfffffffffffffff5 ; -11
      ffffffffc00200a7: jg     0xffffffffc00200f0
      ffffffffc00200a9: cmp    rdx,0xfffffffffffffff4 ; -12
      ffffffffc00200b0: jg     0xffffffffc00200d0
      ffffffffc00200b2: cmp    rdx,0xfffffffffffffff4 ; -12
      ffffffffc00200b9: je     0xfffffffffffffff4 ; -12
      ffffffffc00200bf: jmp    0xffffffff81c00f10
      ffffffffc00200c4: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc00200cc: nop    DWORD PTR [rax+0x0]
      ffffffffc00200d0: cmp    rdx,0xfffffffffffffff5 ; -11
      ffffffffc00200d7: je     0xfffffffffffffff5 ; -11
      ffffffffc00200dd: jmp    0xffffffff81c00f10
      ffffffffc00200e2: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc00200ea: nop    WORD PTR [rax+rax*1+0x0]
      ffffffffc00200f0: cmp    rdx,0xfffffffffffffff6 ; -10
      ffffffffc00200f7: jg     0xffffffffc0020110
      ffffffffc00200f9: cmp    rdx,0xfffffffffffffff6 ; -10
      ffffffffc0020100: je     0xfffffffffffffff6 ; -10
      ffffffffc0020106: jmp    0xffffffff81c00f10
      ffffffffc002010b: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc0020110: cmp    rdx,0xfffffffffffffff7 ; -9
      ffffffffc0020117: je     0xfffffffffffffff7 ; -9
      ffffffffc002011d: jmp    0xffffffff81c00f10
      ffffffffc0020122: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc002012a: nop    WORD PTR [rax+rax*1+0x0]
      ffffffffc0020130: cmp    rdx,0xfffffffffffffffb ; -5
      ffffffffc0020137: jg     0xffffffffc00201d0
      ffffffffc002013d: cmp    rdx,0xfffffffffffffff9 ; -7
      ffffffffc0020144: jg     0xffffffffc0020190
      ffffffffc0020146: cmp    rdx,0xfffffffffffffff8 ; -8
      ffffffffc002014d: jg     0xffffffffc0020170
      ffffffffc002014f: cmp    rdx,0xfffffffffffffff8 ; -8
      ffffffffc0020156: je     0xfffffffffffffff8 ; -8
      ffffffffc002015c: jmp    0xffffffff81c00f10
      ffffffffc0020161: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc0020169: nop    DWORD PTR [rax+0x0]
      ffffffffc0020170: cmp    rdx,0xfffffffffffffff9 ; -7
      ffffffffc0020177: je     0xfffffffffffffff9 ; -7
      ffffffffc002017d: jmp    0xffffffff81c00f10
      ffffffffc0020182: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc002018a: nop    WORD PTR [rax+rax*1+0x0]
      ffffffffc0020190: cmp    rdx,0xfffffffffffffffa ; -6
      ffffffffc0020197: jg     0xffffffffc00201b0
      ffffffffc0020199: cmp    rdx,0xfffffffffffffffa ; -6
      ffffffffc00201a0: je     0xfffffffffffffffa ; -6
      ffffffffc00201a6: jmp    0xffffffff81c00f10
      ffffffffc00201ab: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc00201b0: cmp    rdx,0xfffffffffffffffb ; -5
      ffffffffc00201b7: je     0xfffffffffffffffb ; -5
      ffffffffc00201bd: jmp    0xffffffff81c00f10
      ffffffffc00201c2: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc00201ca: nop    WORD PTR [rax+rax*1+0x0]
      ffffffffc00201d0: cmp    rdx,0xfffffffffffffffd ; -3
      ffffffffc00201d7: jg     0xffffffffc0020220
      ffffffffc00201d9: cmp    rdx,0xfffffffffffffffc ; -4
      ffffffffc00201e0: jg     0xffffffffc0020200
      ffffffffc00201e2: cmp    rdx,0xfffffffffffffffc ; -4
      ffffffffc00201e9: je     0xfffffffffffffffc ; -4
      ffffffffc00201ef: jmp    0xffffffff81c00f10
      ffffffffc00201f4: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc00201fc: nop    DWORD PTR [rax+0x0]
      ffffffffc0020200: cmp    rdx,0xfffffffffffffffd ; -3
      ffffffffc0020207: je     0xfffffffffffffffd ; -3
      ffffffffc002020d: jmp    0xffffffff81c00f10
      ffffffffc0020212: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc002021a: nop    WORD PTR [rax+rax*1+0x0]
      ffffffffc0020220: cmp    rdx,0xfffffffffffffffe ; -2
      ffffffffc0020227: jg     0xffffffffc0020240
      ffffffffc0020229: cmp    rdx,0xfffffffffffffffe ; -2
      ffffffffc0020230: je     0xfffffffffffffffe ; -2
      ffffffffc0020236: jmp    0xffffffff81c00f10
      ffffffffc002023b: nop    DWORD PTR [rax+rax*1+0x0]
      ffffffffc0020240: cmp    rdx,0xffffffffffffffff ; -1
      ffffffffc0020247: je     0xffffffffffffffff ; -1
      ffffffffc002024d: jmp    0xffffffff81c00f10
      
      The nops are there to align jump targets to 16 B.
      
      Performance
      ===========
      
      The tests were performed using the xdp_rxq_info sample program with
      the following command-line:
      
      1. XDP_DRV:
        # xdp_rxq_info --dev eth0 --action XDP_DROP
      2. XDP_SKB:
        # xdp_rxq_info --dev eth0 -S --action XDP_DROP
      3. xdp-perf, from selftests/bpf:
        # test_progs -v -t xdp_perf
      
      Run with mitigations=auto
      -------------------------
      
      Baseline:
      1. 21.7 Mpps (21736190)
      2. 3.8 Mpps   (3837582)
      3. 15 ns
      
      Dispatcher:
      1. 30.2 Mpps (30176320)
      2. 4.0 Mpps   (4015579)
      3. 5 ns
      
      Dispatcher (full; walk all entries, and fallback):
      1. 22.0 Mpps (21986704)
      2. 3.8 Mpps   (3831298)
      3. 17 ns
      
      Run with mitigations=off
      ------------------------
      
      Baseline:
      1. 29.9 Mpps (29875135)
      2. 4.1 Mpps   (4100179)
      3. 4 ns
      
      Dispatcher:
      1. 30.4 Mpps (30439241)
      2. 4.1 Mpps   (4109350)
      1. 4 ns
      
      Dispatcher (full; walk all entries, and fallback):
      1. 28.9 Mpps (28903269)
      2. 4.1 Mpps   (4080078)
      3. 5 ns
      
      xdp-perf runs, aliged vs non-aligned jump targets
      -------------------------------------------------
      
      In this test dispatchers of different sizes, with and without jump
      target alignment, were exercised. As outlined above the function
      lookup is performed via binary search. This means that depending on
      the pointer value of the function, it can reside in the upper or lower
      part of the search table. The performed tests were:
      
      1. aligned, mititations=auto, function entry < other entries
      2. aligned, mititations=auto, function entry > other entries
      3. non-aligned, mititations=auto, function entry < other entries
      4. non-aligned, mititations=auto, function entry > other entries
      5. aligned, mititations=off, function entry < other entries
      6. aligned, mititations=off, function entry > other entries
      7. non-aligned, mititations=off, function entry < other entries
      8. non-aligned, mititations=off, function entry > other entries
      
      The micro benchmarks showed that alignment of jump target has some
      positive impact.
      
      A reply to this cover letter will contain complete data for all runs.
      
      Multiple xdp-perf baseline with mitigations=auto
      ------------------------------------------------
      
       Performance counter stats for './test_progs -v -t xdp_perf' (1024 runs):
      
                   16.69 msec task-clock                #    0.984 CPUs utilized            ( +-  0.08% )
                       2      context-switches          #    0.123 K/sec                    ( +-  1.11% )
                       0      cpu-migrations            #    0.000 K/sec                    ( +- 70.68% )
                      97      page-faults               #    0.006 M/sec                    ( +-  0.05% )
              49,254,635      cycles                    #    2.951 GHz                      ( +-  0.09% )  (12.28%)
              42,138,558      instructions              #    0.86  insn per cycle           ( +-  0.02% )  (36.15%)
               7,315,291      branches                  #  438.300 M/sec                    ( +-  0.01% )  (59.43%)
               1,011,201      branch-misses             #   13.82% of all branches          ( +-  0.01% )  (83.31%)
              15,440,788      L1-dcache-loads           #  925.143 M/sec                    ( +-  0.00% )  (99.40%)
                  39,067      L1-dcache-load-misses     #    0.25% of all L1-dcache hits    ( +-  0.04% )
                   6,531      LLC-loads                 #    0.391 M/sec                    ( +-  0.05% )
                     442      LLC-load-misses           #    6.76% of all LL-cache hits     ( +-  0.77% )
         <not supported>      L1-icache-loads
                  57,964      L1-icache-load-misses                                         ( +-  0.06% )
              15,442,496      dTLB-loads                #  925.246 M/sec                    ( +-  0.00% )
                     514      dTLB-load-misses          #    0.00% of all dTLB cache hits   ( +-  0.73% )  (40.57%)
                     130      iTLB-loads                #    0.008 M/sec                    ( +-  2.75% )  (16.69%)
           <not counted>      iTLB-load-misses                                              ( +-  8.71% )  (0.60%)
         <not supported>      L1-dcache-prefetches
         <not supported>      L1-dcache-prefetch-misses
      
               0.0169558 +- 0.0000127 seconds time elapsed  ( +-  0.07% )
      
      Multiple xdp-perf dispatcher with mitigations=auto
      --------------------------------------------------
      
      Note that this includes generating the dispatcher.
      
       Performance counter stats for './test_progs -v -t xdp_perf' (1024 runs):
      
                    4.80 msec task-clock                #    0.953 CPUs utilized            ( +-  0.06% )
                       1      context-switches          #    0.258 K/sec                    ( +-  1.57% )
                       0      cpu-migrations            #    0.000 K/sec
                      97      page-faults               #    0.020 M/sec                    ( +-  0.05% )
              14,185,861      cycles                    #    2.955 GHz                      ( +-  0.17% )  (50.49%)
              45,691,935      instructions              #    3.22  insn per cycle           ( +-  0.01% )  (99.19%)
               8,346,008      branches                  # 1738.709 M/sec                    ( +-  0.00% )
                  13,046      branch-misses             #    0.16% of all branches          ( +-  0.10% )
              15,443,735      L1-dcache-loads           # 3217.365 M/sec                    ( +-  0.00% )
                  39,585      L1-dcache-load-misses     #    0.26% of all L1-dcache hits    ( +-  0.05% )
                   7,138      LLC-loads                 #    1.487 M/sec                    ( +-  0.06% )
                     671      LLC-load-misses           #    9.40% of all LL-cache hits     ( +-  0.73% )
         <not supported>      L1-icache-loads
                  56,213      L1-icache-load-misses                                         ( +-  0.08% )
              15,443,735      dTLB-loads                # 3217.365 M/sec                    ( +-  0.00% )
           <not counted>      dTLB-load-misses                                              (0.00%)
           <not counted>      iTLB-loads                                                    (0.00%)
           <not counted>      iTLB-load-misses                                              (0.00%)
         <not supported>      L1-dcache-prefetches
         <not supported>      L1-dcache-prefetch-misses
      
              0.00503705 +- 0.00000546 seconds time elapsed  ( +-  0.11% )
      
      Revisions
      =========
      
      v4->v5: [1]
        * Fixed s/xdp_ctx/ctx/ type-o (Toke)
        * Marked dispatcher trampoline with noinline attribute (Alexei)
      
      v3->v4: [2]
        * Moved away from doing dispatcher lookup based on the trampoline
          function, to a model where the dispatcher instance is explicitly
          passed to the bpf_dispatcher_change_prog() (Alexei)
      
      v2->v3: [3]
        * Removed xdp_call, and instead make the dispatcher available to all
          XDP users via bpf_prog_run_xdp() and dev_xdp_install(). (Toke)
        * Always enable the dispatcher, if available (Alexei)
        * Reuse BPF trampoline image allocator (Alexei)
        * Make sure the dispatcher is exercised in selftests (Alexei)
        * Only allow one dispatcher, and wire it to XDP
      
      v1->v2: [4]
        * Fixed i386 build warning (kbuild robot)
        * Made bpf_dispatcher_lookup() static (kbuild robot)
        * Make sure xdp_call.h is only enabled for builtins
        * Add xdp_call() to ixgbe, mlx4, and mlx5
      
      RFC->v1: [5]
        * Improved error handling (Edward and Andrii)
        * Explicit cleanup (Andrii)
        * Use 32B with sext cmp (Alexei)
        * Align jump targets to 16B (Alexei)
        * 4 to 16 entries (Toke)
        * Added stats to xdp_call_run()
      
      [1] https://lore.kernel.org/bpf/20191211123017.13212-1-bjorn.topel@gmail.com/
      [2] https://lore.kernel.org/bpf/20191209135522.16576-1-bjorn.topel@gmail.com/
      [3] https://lore.kernel.org/bpf/20191123071226.6501-1-bjorn.topel@gmail.com/
      [4] https://lore.kernel.org/bpf/20191119160757.27714-1-bjorn.topel@gmail.com/
      [5] https://lore.kernel.org/bpf/20191113204737.31623-1-bjorn.topel@gmail.com/
      ====================
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      02620d9e
    • B
      bpf, x86: Align dispatcher branch targets to 16B · 116eb788
      Björn Töpel 提交于
      >From Intel 64 and IA-32 Architectures Optimization Reference Manual,
      3.4.1.4 Code Alignment, Assembly/Compiler Coding Rule 11: All branch
      targets should be 16-byte aligned.
      
      This commits aligns branch targets according to the Intel manual.
      
      The nops used to align branch targets make the dispatcher larger, and
      therefore the number of supported dispatch points/programs are
      descreased from 64 to 48.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191213175112.30208-7-bjorn.topel@gmail.com
      116eb788
    • B
      selftests: bpf: Add xdp_perf test · e754f5a6
      Björn Töpel 提交于
      The xdp_perf is a dummy XDP test, only used to measure the the cost of
      jumping into a naive XDP program one million times.
      
      To build and run the program:
        $ cd tools/testing/selftests/bpf
        $ make
        $ ./test_progs -v -t xdp_perf
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191213175112.30208-6-bjorn.topel@gmail.com
      e754f5a6
    • B
      bpf: Start using the BPF dispatcher in BPF_TEST_RUN · f23c4b39
      Björn Töpel 提交于
      In order to properly exercise the BPF dispatcher, this commit adds BPF
      dispatcher usage to BPF_TEST_RUN when executing XDP programs.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191213175112.30208-5-bjorn.topel@gmail.com
      f23c4b39
    • B
      bpf, xdp: Start using the BPF dispatcher for XDP · 7e6897f9
      Björn Töpel 提交于
      This commit adds a BPF dispatcher for XDP. The dispatcher is updated
      from the XDP control-path, dev_xdp_install(), and used when an XDP
      program is run via bpf_prog_run_xdp().
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20191213175112.30208-4-bjorn.topel@gmail.com
      7e6897f9