1. 14 9月, 2014 1 次提交
  2. 10 9月, 2014 1 次提交
  3. 06 9月, 2014 2 次提交
    • A
      net: Add function for parsing the header length out of linear ethernet frames · 56193d1b
      Alexander Duyck 提交于
      This patch updates some of the flow_dissector api so that it can be used to
      parse the length of ethernet buffers stored in fragments.  Most of the
      changes needed were to __skb_get_poff as it needed to be updated to support
      sending a linear buffer instead of a skb.
      
      I have split __skb_get_poff into two functions, the first is skb_get_poff
      and it retains the functionality of the original __skb_get_poff.  The other
      function is __skb_get_poff which now works much like __skb_flow_dissect in
      relation to skb_flow_dissect in that it provides the same functionality but
      works with just a data buffer and hlen instead of needing an skb.
      Signed-off-by: NAlexander Duyck <alexander.h.duyck@intel.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      56193d1b
    • D
      net: bpf: make eBPF interpreter images read-only · 60a3b225
      Daniel Borkmann 提交于
      With eBPF getting more extended and exposure to user space is on it's way,
      hardening the memory range the interpreter uses to steer its command flow
      seems appropriate.  This patch moves the to be interpreted bytecode to
      read-only pages.
      
      In case we execute a corrupted BPF interpreter image for some reason e.g.
      caused by an attacker which got past a verifier stage, it would not only
      provide arbitrary read/write memory access but arbitrary function calls
      as well. After setting up the BPF interpreter image, its contents do not
      change until destruction time, thus we can setup the image on immutable
      made pages in order to mitigate modifications to that code. The idea
      is derived from commit 314beb9b ("x86: bpf_jit_comp: secure bpf jit
      against spraying attacks").
      
      This is possible because bpf_prog is not part of sk_filter anymore.
      After setup bpf_prog cannot be altered during its life-time. This prevents
      any modifications to the entire bpf_prog structure (incl. function/JIT
      image pointer).
      
      Every eBPF program (including classic BPF that are migrated) have to call
      bpf_prog_select_runtime() to select either interpreter or a JIT image
      as a last setup step, and they all are being freed via bpf_prog_free(),
      including non-JIT. Therefore, we can easily integrate this into the
      eBPF life-time, plus since we directly allocate a bpf_prog, we have no
      performance penalty.
      
      Tested with seccomp and test_bpf testsuite in JIT/non-JIT mode and manual
      inspection of kernel_page_tables.  Brad Spengler proposed the same idea
      via Twitter during development of this patch.
      
      Joint work with Hannes Frederic Sowa.
      Suggested-by: NBrad Spengler <spender@grsecurity.net>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Kees Cook <keescook@chromium.org>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60a3b225
  4. 03 8月, 2014 5 次提交
  5. 31 7月, 2014 1 次提交
    • P
      net: filter: don't release unattached filter through call_rcu() · 34c5bd66
      Pablo Neira 提交于
      sk_unattached_filter_destroy() does not always need to release the
      filter object via rcu. Since this filter is never attached to the
      socket, the caller should be responsible for releasing the filter
      in a safe way, which may not necessarily imply rcu.
      
      This is a short summary of clients of this function:
      
      1) xt_bpf.c and cls_bpf.c use the bpf matchers from rules, these rules
         are removed from the packet path before the filter is released. Thus,
         the framework makes sure the filter is safely removed.
      
      2) In the ppp driver, the ppp_lock ensures serialization between the
         xmit and filter attachment/detachment path. This doesn't use rcu
         so deferred release via rcu makes no sense.
      
      3) In the isdn/ppp driver, it is called from isdn_ppp_release()
         the isdn_ppp_ioctl(). This driver uses mutex and spinlocks, no rcu.
         Thus, deferred rcu makes no sense to me either, the deferred releases
         may be just masking the effects of wrong locking strategy, which
         should be fixed in the driver itself.
      
      4) In the team driver, this is the only place where the rcu
         synchronization with unattached filter is used. Therefore, this
         patch introduces synchronize_rcu() which is called from the
         genetlink path to make sure the filter doesn't go away while packets
         are still walking over it. I think we can revisit this once struct
         bpf_prog (that only wraps specific bpf code bits) is in place, then
         add some specific struct rcu_head in the scope of the team driver if
         Jiri thinks this is needed.
      
      Deferred rcu release for unattached filters was originally introduced
      in 302d6637 ("filter: Allow to create sk-unattached filters").
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34c5bd66
  6. 25 7月, 2014 1 次提交
  7. 24 7月, 2014 1 次提交
  8. 14 7月, 2014 1 次提交
  9. 09 7月, 2014 1 次提交
  10. 26 6月, 2014 3 次提交
  11. 19 6月, 2014 1 次提交
  12. 12 6月, 2014 1 次提交
  13. 11 6月, 2014 1 次提交
    • A
      net: filter: cleanup A/X name usage · e430f34e
      Alexei Starovoitov 提交于
      The macro 'A' used in internal BPF interpreter:
       #define A regs[insn->a_reg]
      was easily confused with the name of classic BPF register 'A', since
      'A' would mean two different things depending on context.
      
      This patch is trying to clean up the naming and clarify its usage in the
      following way:
      
      - A and X are names of two classic BPF registers
      
      - BPF_REG_A denotes internal BPF register R0 used to map classic register A
        in internal BPF programs generated from classic
      
      - BPF_REG_X denotes internal BPF register R7 used to map classic register X
        in internal BPF programs generated from classic
      
      - internal BPF instruction format:
      struct sock_filter_int {
              __u8    code;           /* opcode */
              __u8    dst_reg:4;      /* dest register */
              __u8    src_reg:4;      /* source register */
              __s16   off;            /* signed offset */
              __s32   imm;            /* signed immediate constant */
      };
      
      - BPF_X/BPF_K is 1 bit used to encode source operand of instruction
      In classic:
        BPF_X - means use register X as source operand
        BPF_K - means use 32-bit immediate as source operand
      In internal:
        BPF_X - means use 'src_reg' register as source operand
        BPF_K - means use 32-bit immediate as source operand
      Suggested-by: NChema Gonzalez <chema@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NChema Gonzalez <chema@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e430f34e
  14. 06 6月, 2014 1 次提交
  15. 03 6月, 2014 1 次提交
  16. 02 6月, 2014 2 次提交
    • D
      net: filter: improve filter block macros · f8f6d679
      Daniel Borkmann 提交于
      Commit 9739eef1 ("net: filter: make BPF conversion more readable")
      started to introduce helper macros similar to BPF_STMT()/BPF_JUMP()
      macros from classic BPF.
      
      However, quite some statements in the filter conversion functions
      remained in the old style which gives a mixture of block macros and
      non block macros in the code. This patch makes the block macros itself
      more readable by using explicit member initialization, and converts
      the remaining ones where possible to remain in a more consistent state.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8f6d679
    • D
      net: filter: get rid of BPF_S_* enum · 34805931
      Daniel Borkmann 提交于
      This patch finally allows us to get rid of the BPF_S_* enum.
      Currently, the code performs unnecessary encode and decode
      workarounds in seccomp and filter migration itself when a filter
      is being attached in order to overcome BPF_S_* encoding which
      is not used anymore by the new interpreter resp. JIT compilers.
      
      Keeping it around would mean that also in future we would need
      to extend and maintain this enum and related encoders/decoders.
      We can get rid of all that and save us these operations during
      filter attaching. Naturally, also JIT compilers need to be updated
      by this.
      
      Before JIT conversion is being done, each compiler checks if A
      is being loaded at startup to obtain information if it needs to
      emit instructions to clear A first. Since BPF extensions are a
      subset of BPF_LD | BPF_{W,H,B} | BPF_ABS variants, case statements
      for extensions can be removed at that point. To ease and minimalize
      code changes in the classic JITs, we have introduced bpf_anc_helper().
      
      Tested with test_bpf on x86_64 (JIT, int), s390x (JIT, int),
      arm (JIT, int), i368 (int), ppc64 (JIT, int); for sparc we
      unfortunately didn't have access, but changes are analogous to
      the rest.
      
      Joint work with Alexei Starovoitov.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mircea Gherzan <mgherzan@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Acked-by: NChema Gonzalez <chemag@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34805931
  17. 24 5月, 2014 2 次提交
  18. 22 5月, 2014 1 次提交
    • A
      net: filter: cleanup invocation of internal BPF · 5fe821a9
      Alexei Starovoitov 提交于
      Kernel API for classic BPF socket filters is:
      
      sk_unattached_filter_create() - validate classic BPF, convert, JIT
      SK_RUN_FILTER() - run it
      sk_unattached_filter_destroy() - destroy socket filter
      
      Cleanup internal BPF kernel API as following:
      
      sk_filter_select_runtime() - final step of internal BPF creation.
        Try to JIT internal BPF program, if JIT is not available select interpreter
      SK_RUN_FILTER() - run it
      sk_filter_free() - free internal BPF program
      
      Disallow direct calls to BPF interpreter. Execution of the BPF program should
      be done with SK_RUN_FILTER() macro.
      
      Example of internal BPF create, run, destroy:
      
        struct sk_filter *fp;
      
        fp = kzalloc(sk_filter_size(prog_len), GFP_KERNEL);
        memcpy(fp->insni, prog, prog_len * sizeof(fp->insni[0]));
        fp->len = prog_len;
      
        sk_filter_select_runtime(fp);
      
        SK_RUN_FILTER(fp, ctx);
      
        sk_filter_free(fp);
      
      Sockets, seccomp, testsuite, tracing are using different ways to populate
      sk_filter, so first steps of program creation are not common.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5fe821a9
  19. 16 5月, 2014 1 次提交
    • A
      net: filter: x86: internal BPF JIT · 62258278
      Alexei Starovoitov 提交于
      Maps all internal BPF instructions into x86_64 instructions.
      This patch replaces original BPF x64 JIT with internal BPF x64 JIT.
      sysctl net.core.bpf_jit_enable is reused as on/off switch.
      
      Performance:
      
      1. old BPF JIT and internal BPF JIT generate equivalent x86_64 code.
        No performance difference is observed for filters that were JIT-able before
      
      Example assembler code for BPF filter "tcpdump port 22"
      
      original BPF -> old JIT:            original BPF -> internal BPF -> new JIT:
         0:   push   %rbp                      0:     push   %rbp
         1:   mov    %rsp,%rbp                 1:     mov    %rsp,%rbp
         4:   sub    $0x60,%rsp                4:     sub    $0x228,%rsp
         8:   mov    %rbx,-0x8(%rbp)           b:     mov    %rbx,-0x228(%rbp) // prologue
                                              12:     mov    %r13,-0x220(%rbp)
                                              19:     mov    %r14,-0x218(%rbp)
                                              20:     mov    %r15,-0x210(%rbp)
                                              27:     xor    %eax,%eax         // clear A
         c:   xor    %ebx,%ebx                29:     xor    %r13,%r13         // clear X
         e:   mov    0x68(%rdi),%r9d          2c:     mov    0x68(%rdi),%r9d
        12:   sub    0x6c(%rdi),%r9d          30:     sub    0x6c(%rdi),%r9d
        16:   mov    0xd8(%rdi),%r8           34:     mov    0xd8(%rdi),%r10
                                              3b:     mov    %rdi,%rbx
        1d:   mov    $0xc,%esi                3e:     mov    $0xc,%esi
        22:   callq  0xffffffffe1021e15       43:     callq  0xffffffffe102bd75
        27:   cmp    $0x86dd,%eax             48:     cmp    $0x86dd,%rax
        2c:   jne    0x0000000000000069       4f:     jne    0x000000000000009a
        2e:   mov    $0x14,%esi               51:     mov    $0x14,%esi
        33:   callq  0xffffffffe1021e31       56:     callq  0xffffffffe102bd91
        38:   cmp    $0x84,%eax               5b:     cmp    $0x84,%rax
        3d:   je     0x0000000000000049       62:     je     0x0000000000000074
        3f:   cmp    $0x6,%eax                64:     cmp    $0x6,%rax
        42:   je     0x0000000000000049       68:     je     0x0000000000000074
        44:   cmp    $0x11,%eax               6a:     cmp    $0x11,%rax
        47:   jne    0x00000000000000c6       6e:     jne    0x0000000000000117
        49:   mov    $0x36,%esi               74:     mov    $0x36,%esi
        4e:   callq  0xffffffffe1021e15       79:     callq  0xffffffffe102bd75
        53:   cmp    $0x16,%eax               7e:     cmp    $0x16,%rax
        56:   je     0x00000000000000bf       82:     je     0x0000000000000110
        58:   mov    $0x38,%esi               88:     mov    $0x38,%esi
        5d:   callq  0xffffffffe1021e15       8d:     callq  0xffffffffe102bd75
        62:   cmp    $0x16,%eax               92:     cmp    $0x16,%rax
        65:   je     0x00000000000000bf       96:     je     0x0000000000000110
        67:   jmp    0x00000000000000c6       98:     jmp    0x0000000000000117
        69:   cmp    $0x800,%eax              9a:     cmp    $0x800,%rax
        6e:   jne    0x00000000000000c6       a1:     jne    0x0000000000000117
        70:   mov    $0x17,%esi               a3:     mov    $0x17,%esi
        75:   callq  0xffffffffe1021e31       a8:     callq  0xffffffffe102bd91
        7a:   cmp    $0x84,%eax               ad:     cmp    $0x84,%rax
        7f:   je     0x000000000000008b       b4:     je     0x00000000000000c2
        81:   cmp    $0x6,%eax                b6:     cmp    $0x6,%rax
        84:   je     0x000000000000008b       ba:     je     0x00000000000000c2
        86:   cmp    $0x11,%eax               bc:     cmp    $0x11,%rax
        89:   jne    0x00000000000000c6       c0:     jne    0x0000000000000117
        8b:   mov    $0x14,%esi               c2:     mov    $0x14,%esi
        90:   callq  0xffffffffe1021e15       c7:     callq  0xffffffffe102bd75
        95:   test   $0x1fff,%ax              cc:     test   $0x1fff,%rax
        99:   jne    0x00000000000000c6       d3:     jne    0x0000000000000117
                                              d5:     mov    %rax,%r14
        9b:   mov    $0xe,%esi                d8:     mov    $0xe,%esi
        a0:   callq  0xffffffffe1021e44       dd:     callq  0xffffffffe102bd91 // MSH
                                              e2:     and    $0xf,%eax
                                              e5:     shl    $0x2,%eax
                                              e8:     mov    %rax,%r13
                                              eb:     mov    %r14,%rax
                                              ee:     mov    %r13,%rsi
        a5:   lea    0xe(%rbx),%esi           f1:     add    $0xe,%esi
        a8:   callq  0xffffffffe1021e0d       f4:     callq  0xffffffffe102bd6d
        ad:   cmp    $0x16,%eax               f9:     cmp    $0x16,%rax
        b0:   je     0x00000000000000bf       fd:     je     0x0000000000000110
                                              ff:     mov    %r13,%rsi
        b2:   lea    0x10(%rbx),%esi         102:     add    $0x10,%esi
        b5:   callq  0xffffffffe1021e0d      105:     callq  0xffffffffe102bd6d
        ba:   cmp    $0x16,%eax              10a:     cmp    $0x16,%rax
        bd:   jne    0x00000000000000c6      10e:     jne    0x0000000000000117
        bf:   mov    $0xffff,%eax            110:     mov    $0xffff,%eax
        c4:   jmp    0x00000000000000c8      115:     jmp    0x000000000000011c
        c6:   xor    %eax,%eax               117:     mov    $0x0,%eax
        c8:   mov    -0x8(%rbp),%rbx         11c:     mov    -0x228(%rbp),%rbx // epilogue
        cc:   leaveq                         123:     mov    -0x220(%rbp),%r13
        cd:   retq                           12a:     mov    -0x218(%rbp),%r14
                                             131:     mov    -0x210(%rbp),%r15
                                             138:     leaveq
                                             139:     retq
      
      On fully cached SKBs both JITed functions take 12 nsec to execute.
      BPF interpreter executes the program in 30 nsec.
      
      The difference in generated assembler is due to the following:
      
      Old BPF imlements LDX_MSH instruction via sk_load_byte_msh() helper function
      inside bpf_jit.S.
      New JIT removes the helper and does it explicitly, so ldx_msh cost
      is the same for both JITs, but generated code looks longer.
      
      New JIT has 4 registers to save, so prologue/epilogue are larger,
      but the cost is within noise on x64.
      
      Old JIT checks whether first insn clears A and if not emits 'xor %eax,%eax'.
      New JIT clears %rax unconditionally.
      
      2. old BPF JIT doesn't support ANC_NLATTR, ANC_PAY_OFFSET, ANC_RANDOM
        extensions. New JIT supports all BPF extensions.
        Performance of such filters improves 2-4 times depending on a filter.
        The longer the filter the higher performance gain.
        Synthetic benchmarks with many ancillary loads see 20x speedup
        which seems to be the maximum gain from JIT
      
      Notes:
      
      . net.core.bpf_jit_enable=2 + tools/net/bpf_jit_disasm is still functional
        and can be used to see generated assembler
      
      . there are two jit_compile() functions and code flow for classic filters is:
        sk_attach_filter() - load classic BPF
        bpf_jit_compile() - try to JIT from classic BPF
        sk_convert_filter() - convert classic to internal
        bpf_int_jit_compile() - JIT from internal BPF
      
        seccomp and tracing filters will just call bpf_int_jit_compile()
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62258278
  20. 14 5月, 2014 1 次提交
  21. 12 5月, 2014 1 次提交
  22. 05 5月, 2014 3 次提交
  23. 24 4月, 2014 1 次提交
  24. 23 4月, 2014 1 次提交
  25. 15 4月, 2014 1 次提交
    • D
      net: filter: seccomp: fix wrong decoding of BPF_S_ANC_SECCOMP_LD_W · 8c482cdc
      Daniel Borkmann 提交于
      While reviewing seccomp code, we found that BPF_S_ANC_SECCOMP_LD_W has
      been wrongly decoded by commit a8fc9277 ("sk-filter: Add ability to
      get socket filter program (v2)") into the opcode BPF_LD|BPF_B|BPF_ABS
      although it should have been decoded as BPF_LD|BPF_W|BPF_ABS.
      
      In practice, this should not have much side-effect though, as such
      conversion is/was being done through prctl(2) PR_SET_SECCOMP. Reverse
      operation PR_GET_SECCOMP will only return the current seccomp mode, but
      not the filter itself. Since the transition to the new BPF infrastructure,
      it's also not used anymore, so we can simply remove this as it's
      unreachable.
      
      Fixes: a8fc9277 ("sk-filter: Add ability to get socket filter program (v2)")
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c482cdc
  26. 14 4月, 2014 1 次提交
    • M
      filter: prevent nla extensions to peek beyond the end of the message · 05ab8f26
      Mathias Krause 提交于
      The BPF_S_ANC_NLATTR and BPF_S_ANC_NLATTR_NEST extensions fail to check
      for a minimal message length before testing the supplied offset to be
      within the bounds of the message. This allows the subtraction of the nla
      header to underflow and therefore -- as the data type is unsigned --
      allowing far to big offset and length values for the search of the
      netlink attribute.
      
      The remainder calculation for the BPF_S_ANC_NLATTR_NEST extension is
      also wrong. It has the minuend and subtrahend mixed up, therefore
      calculates a huge length value, allowing to overrun the end of the
      message while looking for the netlink attribute.
      
      The following three BPF snippets will trigger the bugs when attached to
      a UNIX datagram socket and parsing a message with length 1, 2 or 3.
      
       ,-[ PoC for missing size check in BPF_S_ANC_NLATTR ]--
       | ld	#0x87654321
       | ldx	#42
       | ld	#nla
       | ret	a
       `---
      
       ,-[ PoC for the same bug in BPF_S_ANC_NLATTR_NEST ]--
       | ld	#0x87654321
       | ldx	#42
       | ld	#nlan
       | ret	a
       `---
      
       ,-[ PoC for wrong remainder calculation in BPF_S_ANC_NLATTR_NEST ]--
       | ; (needs a fake netlink header at offset 0)
       | ld	#0
       | ldx	#42
       | ld	#nlan
       | ret	a
       `---
      
      Fix the first issue by ensuring the message length fulfills the minimal
      size constrains of a nla header. Fix the second bug by getting the math
      for the remainder calculation right.
      
      Fixes: 4738c1db ("[SKFILTER]: Add SKF_ADF_NLATTR instruction")
      Fixes: d214c753 ("filter: add SKF_AD_NLATTR_NEST to look for nested..")
      Cc: Patrick McHardy <kaber@trash.net>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NMathias Krause <minipli@googlemail.com>
      Acked-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      05ab8f26
  27. 08 4月, 2014 1 次提交
    • D
      net: filter: be more defensive on div/mod by X==0 · 5f9fde5f
      Daniel Borkmann 提交于
      The old interpreter behaviour was that we returned with 0
      whenever we found a division by 0 would take place. In the new
      interpreter we would currently just skip that instead and
      continue execution.
      
      It's true that a value of 0 as return might not be appropriate
      in all cases, but current users (socket filters -> drop
      packet, seccomp -> SECCOMP_RET_KILL, cls_bpf -> unclassified,
      etc) seem fine with that behaviour. Better this than undefined
      BPF program behaviour as it's expected that A contains the
      result of the division. In future, as more use cases open up,
      we could further adapt this return value to our needs, if
      necessary.
      
      So reintroduce return of 0 for division by 0 as in the old
      interpreter. Also in case of K which is guaranteed to be 32bit
      wide, sk_chk_filter() already takes care of preventing division
      by 0 invoked through K, so we can generally spare us these tests.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Reviewed-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5f9fde5f
  28. 02 4月, 2014 1 次提交
  29. 31 3月, 2014 1 次提交
    • A
      net: filter: rework/optimize internal BPF interpreter's instruction set · bd4cf0ed
      Alexei Starovoitov 提交于
      This patch replaces/reworks the kernel-internal BPF interpreter with
      an optimized BPF instruction set format that is modelled closer to
      mimic native instruction sets and is designed to be JITed with one to
      one mapping. Thus, the new interpreter is noticeably faster than the
      current implementation of sk_run_filter(); mainly for two reasons:
      
      1. Fall-through jumps:
      
        BPF jump instructions are forced to go either 'true' or 'false'
        branch which causes branch-miss penalty. The new BPF jump
        instructions have only one branch and fall-through otherwise,
        which fits the CPU branch predictor logic better. `perf stat`
        shows drastic difference for branch-misses between the old and
        new code.
      
      2. Jump-threaded implementation of interpreter vs switch
         statement:
      
        Instead of single table-jump at the top of 'switch' statement,
        gcc will now generate multiple table-jump instructions, which
        helps CPU branch predictor logic.
      
      Note that the verification of filters is still being done through
      sk_chk_filter() in classical BPF format, so filters from user- or
      kernel space are verified in the same way as we do now, and same
      restrictions/constraints hold as well.
      
      We reuse current BPF JIT compilers in a way that this upgrade would
      even be fine as is, but nevertheless allows for a successive upgrade
      of BPF JIT compilers to the new format.
      
      The internal instruction set migration is being done after the
      probing for JIT compilation, so in case JIT compilers are able to
      create a native opcode image, we're going to use that, and in all
      other cases we're doing a follow-up migration of the BPF program's
      instruction set, so that it can be transparently run in the new
      interpreter.
      
      In short, the *internal* format extends BPF in the following way (more
      details can be taken from the appended documentation):
      
        - Number of registers increase from 2 to 10
        - Register width increases from 32-bit to 64-bit
        - Conditional jt/jf targets replaced with jt/fall-through
        - Adds signed > and >= insns
        - 16 4-byte stack slots for register spill-fill replaced
          with up to 512 bytes of multi-use stack space
        - Introduction of bpf_call insn and register passing convention
          for zero overhead calls from/to other kernel functions
        - Adds arithmetic right shift and endianness conversion insns
        - Adds atomic_add insn
        - Old tax/txa insns are replaced with 'mov dst,src' insn
      
      Performance of two BPF filters generated by libpcap resp. bpf_asm
      was measured on x86_64, i386 and arm32 (other libpcap programs
      have similar performance differences):
      
      fprog #1 is taken from Documentation/networking/filter.txt:
      tcpdump -i eth0 port 22 -dd
      
      fprog #2 is taken from 'man tcpdump':
      tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
         ((tcp[12]&0xf0)>>2)) != 0)' -dd
      
      Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the
      same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call,
      smaller is better:
      
      --x86_64--
               fprog #1  fprog #1   fprog #2  fprog #2
               cache-hit cache-miss cache-hit cache-miss
      old BPF      90       101        192       202
      new BPF      31        71         47        97
      old BPF jit  12        34         17        44
      new BPF jit TBD
      
      --i386--
               fprog #1  fprog #1   fprog #2  fprog #2
               cache-hit cache-miss cache-hit cache-miss
      old BPF     107       136        227       252
      new BPF      40       119         69       172
      
      --arm32--
               fprog #1  fprog #1   fprog #2  fprog #2
               cache-hit cache-miss cache-hit cache-miss
      old BPF     202       300        475       540
      new BPF     180       270        330       470
      old BPF jit  26       182         37       202
      new BPF jit TBD
      
      Thus, without changing any userland BPF filters, applications on
      top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf
      classifier, netfilter's xt_bpf, team driver's load-balancing mode,
      and many more will have better interpreter filtering performance.
      
      While we are replacing the internal BPF interpreter, we also need
      to convert seccomp BPF in the same step to make use of the new
      internal structure since it makes use of lower-level API details
      without being further decoupled through higher-level calls like
      sk_unattached_filter_{create,destroy}(), for example.
      
      Just as for normal socket filtering, also seccomp BPF experiences
      a time-to-verdict speedup:
      
      05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
      
        seccomp_rule_add_exact(ctx,...
        seccomp_rule_add_exact(ctx,...
      
        rc = seccomp_load(ctx);
      
        for (i = 0; i < 10000000; i++)
           syscall(199, 100);
      
      'short filter' has 2 rules
      'large filter' has 200 rules
      
      'short filter' performance is slightly better on x86_64/i386/arm32
      'large filter' is much faster on x86_64 and i386 and shows no
                     difference on arm32
      
      --x86_64-- short filter
      old BPF: 2.7 sec
       39.12%  bench  libc-2.15.so       [.] syscall
        8.10%  bench  [kernel.kallsyms]  [k] sk_run_filter
        6.31%  bench  [kernel.kallsyms]  [k] system_call
        5.59%  bench  [kernel.kallsyms]  [k] trace_hardirqs_on_caller
        4.37%  bench  [kernel.kallsyms]  [k] trace_hardirqs_off_caller
        3.70%  bench  [kernel.kallsyms]  [k] __secure_computing
        3.67%  bench  [kernel.kallsyms]  [k] lock_is_held
        3.03%  bench  [kernel.kallsyms]  [k] seccomp_bpf_load
      new BPF: 2.58 sec
       42.05%  bench  libc-2.15.so       [.] syscall
        6.91%  bench  [kernel.kallsyms]  [k] system_call
        6.25%  bench  [kernel.kallsyms]  [k] trace_hardirqs_on_caller
        6.07%  bench  [kernel.kallsyms]  [k] __secure_computing
        5.08%  bench  [kernel.kallsyms]  [k] sk_run_filter_int_seccomp
      
      --arm32-- short filter
      old BPF: 4.0 sec
       39.92%  bench  [kernel.kallsyms]  [k] vector_swi
       16.60%  bench  [kernel.kallsyms]  [k] sk_run_filter
       14.66%  bench  libc-2.17.so       [.] syscall
        5.42%  bench  [kernel.kallsyms]  [k] seccomp_bpf_load
        5.10%  bench  [kernel.kallsyms]  [k] __secure_computing
      new BPF: 3.7 sec
       35.93%  bench  [kernel.kallsyms]  [k] vector_swi
       21.89%  bench  libc-2.17.so       [.] syscall
       13.45%  bench  [kernel.kallsyms]  [k] sk_run_filter_int_seccomp
        6.25%  bench  [kernel.kallsyms]  [k] __secure_computing
        3.96%  bench  [kernel.kallsyms]  [k] syscall_trace_exit
      
      --x86_64-- large filter
      old BPF: 8.6 seconds
          73.38%    bench  [kernel.kallsyms]  [k] sk_run_filter
          10.70%    bench  libc-2.15.so       [.] syscall
           5.09%    bench  [kernel.kallsyms]  [k] seccomp_bpf_load
           1.97%    bench  [kernel.kallsyms]  [k] system_call
      new BPF: 5.7 seconds
          66.20%    bench  [kernel.kallsyms]  [k] sk_run_filter_int_seccomp
          16.75%    bench  libc-2.15.so       [.] syscall
           3.31%    bench  [kernel.kallsyms]  [k] system_call
           2.88%    bench  [kernel.kallsyms]  [k] __secure_computing
      
      --i386-- large filter
      old BPF: 5.4 sec
      new BPF: 3.8 sec
      
      --arm32-- large filter
      old BPF: 13.5 sec
       73.88%  bench  [kernel.kallsyms]  [k] sk_run_filter
       10.29%  bench  [kernel.kallsyms]  [k] vector_swi
        6.46%  bench  libc-2.17.so       [.] syscall
        2.94%  bench  [kernel.kallsyms]  [k] seccomp_bpf_load
        1.19%  bench  [kernel.kallsyms]  [k] __secure_computing
        0.87%  bench  [kernel.kallsyms]  [k] sys_getuid
      new BPF: 13.5 sec
       76.08%  bench  [kernel.kallsyms]  [k] sk_run_filter_int_seccomp
       10.98%  bench  [kernel.kallsyms]  [k] vector_swi
        5.87%  bench  libc-2.17.so       [.] syscall
        1.77%  bench  [kernel.kallsyms]  [k] __secure_computing
        0.93%  bench  [kernel.kallsyms]  [k] sys_getuid
      
      BPF filters generated by seccomp are very branchy, so the new
      internal BPF performance is better than the old one. Performance
      gains will be even higher when BPF JIT is committed for the
      new structure, which is planned in future work (as successive
      JIT migrations).
      
      BPF has also been stress-tested with trinity's BPF fuzzer.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Paul Moore <pmoore@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      Cc: linux-kernel@vger.kernel.org
      Acked-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd4cf0ed