1. 02 3月, 2015 2 次提交
  2. 06 12月, 2014 1 次提交
    • A
      net: sock: allow eBPF programs to be attached to sockets · 89aa0758
      Alexei Starovoitov 提交于
      introduce new setsockopt() command:
      
      setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, sizeof(prog_fd))
      
      where prog_fd was received from syscall bpf(BPF_PROG_LOAD, attr, ...)
      and attr->prog_type == BPF_PROG_TYPE_SOCKET_FILTER
      
      setsockopt() calls bpf_prog_get() which increments refcnt of the program,
      so it doesn't get unloaded while socket is using the program.
      
      The same eBPF program can be attached to multiple sockets.
      
      User task exit automatically closes socket which calls sk_filter_uncharge()
      which decrements refcnt of eBPF program
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89aa0758
  3. 27 9月, 2014 2 次提交
    • A
      bpf: handle pseudo BPF_LD_IMM64 insn · 0246e64d
      Alexei Starovoitov 提交于
      eBPF programs passed from userspace are using pseudo BPF_LD_IMM64 instructions
      to refer to process-local map_fd. Scan the program for such instructions and
      if FDs are valid, convert them to 'struct bpf_map' pointers which will be used
      by verifier to check access to maps in bpf_map_lookup/update() calls.
      If program passes verifier, convert pseudo BPF_LD_IMM64 into generic by dropping
      BPF_PSEUDO_MAP_FD flag.
      
      Note that eBPF interpreter is generic and knows nothing about pseudo insns.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0246e64d
    • A
      bpf: expand BPF syscall with program load/unload · 09756af4
      Alexei Starovoitov 提交于
      eBPF programs are similar to kernel modules. They are loaded by the user
      process and automatically unloaded when process exits. Each eBPF program is
      a safe run-to-completion set of instructions. eBPF verifier statically
      determines that the program terminates and is safe to execute.
      
      The following syscall wrapper can be used to load the program:
      int bpf_prog_load(enum bpf_prog_type prog_type,
                        const struct bpf_insn *insns, int insn_cnt,
                        const char *license)
      {
          union bpf_attr attr = {
              .prog_type = prog_type,
              .insns = ptr_to_u64(insns),
              .insn_cnt = insn_cnt,
              .license = ptr_to_u64(license),
          };
      
          return bpf(BPF_PROG_LOAD, &attr, sizeof(attr));
      }
      where 'insns' is an array of eBPF instructions and 'license' is a string
      that must be GPL compatible to call helper functions marked gpl_only
      
      Upon succesful load the syscall returns prog_fd.
      Use close(prog_fd) to unload the program.
      
      User space tests and examples follow in the later patches
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      09756af4
  4. 11 9月, 2014 1 次提交
  5. 10 9月, 2014 4 次提交
    • D
      net: bpf: be friendly to kmemcheck · 286aad3c
      Daniel Borkmann 提交于
      Reported by Mikulas Patocka, kmemcheck currently barks out a
      false positive since we don't have special kmemcheck annotation
      for bitfields used in bpf_prog structure.
      
      We currently have jited:1, len:31 and thus when accessing len
      while CONFIG_KMEMCHECK enabled, kmemcheck throws a warning that
      we're reading uninitialized memory.
      
      As we don't need the whole bit universe for pages member, we
      can just split it to u16 and use a bool flag for jited instead
      of a bitfield.
      Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      286aad3c
    • D
      net: bpf: consolidate JIT binary allocator · 738cbe72
      Daniel Borkmann 提交于
      Introduced in commit 314beb9b ("x86: bpf_jit_comp: secure bpf jit
      against spraying attacks") and later on replicated in aa2d2c73
      ("s390/bpf,jit: address randomize and write protect jit code") for
      s390 architecture, write protection for BPF JIT images got added and
      a random start address of the JIT code, so that it's not on a page
      boundary anymore.
      
      Since both use a very similar allocator for the BPF binary header,
      we can consolidate this code into the BPF core as it's mostly JIT
      independant anyway.
      
      This will also allow for future archs that support DEBUG_SET_MODULE_RONX
      to just reuse instead of reimplementing it.
      
      JIT tested on x86_64 and s390x with BPF test suite.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      738cbe72
    • A
      net: filter: split filter.h and expose eBPF to user space · daedfb22
      Alexei Starovoitov 提交于
      allow user space to generate eBPF programs
      
      uapi/linux/bpf.h: eBPF instruction set definition
      
      linux/filter.h: the rest
      
      This patch only moves macro definitions, but practically it freezes existing
      eBPF instruction set, though new instructions can still be added in the future.
      
      These eBPF definitions cannot go into uapi/linux/filter.h, since the names
      may conflict with existing applications.
      
      Full eBPF ISA description is in Documentation/networking/filter.txt
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      daedfb22
    • A
      net: filter: add "load 64-bit immediate" eBPF instruction · 02ab695b
      Alexei Starovoitov 提交于
      add BPF_LD_IMM64 instruction to load 64-bit immediate value into a register.
      All previous instructions were 8-byte. This is first 16-byte instruction.
      Two consecutive 'struct bpf_insn' blocks are interpreted as single instruction:
      insn[0].code = BPF_LD | BPF_DW | BPF_IMM
      insn[0].dst_reg = destination register
      insn[0].imm = lower 32-bit
      insn[1].code = 0
      insn[1].imm = upper 32-bit
      All unused fields must be zero.
      
      Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM
      which loads 32-bit immediate value into a register.
      
      x64 JITs it as single 'movabsq %rax, imm64'
      arm64 may JIT as sequence of four 'movk x0, #imm16, lsl #shift' insn
      
      Note that old eBPF programs are binary compatible with new interpreter.
      
      It helps eBPF programs load 64-bit constant into a register with one
      instruction instead of using two registers and 4 instructions:
      BPF_MOV32_IMM(R1, imm32)
      BPF_ALU64_IMM(BPF_LSH, R1, 32)
      BPF_MOV32_IMM(R2, imm32)
      BPF_ALU64_REG(BPF_OR, R1, R2)
      
      User space generated programs will use this instruction to load constants only.
      
      To tell kernel that user space needs a pointer the _pseudo_ variant of
      this instruction may be added later, which will use extra bits of encoding
      to indicate what type of pointer user space is asking kernel to provide.
      For example 'off' or 'src_reg' fields can be used for such purpose.
      src_reg = 1 could mean that user space is asking kernel to validate and
      load in-kernel map pointer.
      src_reg = 2 could mean that user space needs readonly data section pointer
      src_reg = 3 could mean that user space needs a pointer to per-cpu local data
      All such future pseudo instructions will not be carrying the actual pointer
      as part of the instruction, but rather will be treated as a request to kernel
      to provide one. The kernel will verify the request_for_a_pointer, then
      will drop _pseudo_ marking and will store actual internal pointer inside
      the instruction, so the end result is the interpreter and JITs never
      see pseudo BPF_LD_IMM64 insns and only operate on generic BPF_LD_IMM64 that
      loads 64-bit immediate into a register. User space never operates on direct
      pointers and verifier can easily recognize request_for_pointer vs other
      instructions.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      02ab695b
  6. 06 9月, 2014 1 次提交
    • D
      net: bpf: make eBPF interpreter images read-only · 60a3b225
      Daniel Borkmann 提交于
      With eBPF getting more extended and exposure to user space is on it's way,
      hardening the memory range the interpreter uses to steer its command flow
      seems appropriate.  This patch moves the to be interpreted bytecode to
      read-only pages.
      
      In case we execute a corrupted BPF interpreter image for some reason e.g.
      caused by an attacker which got past a verifier stage, it would not only
      provide arbitrary read/write memory access but arbitrary function calls
      as well. After setting up the BPF interpreter image, its contents do not
      change until destruction time, thus we can setup the image on immutable
      made pages in order to mitigate modifications to that code. The idea
      is derived from commit 314beb9b ("x86: bpf_jit_comp: secure bpf jit
      against spraying attacks").
      
      This is possible because bpf_prog is not part of sk_filter anymore.
      After setup bpf_prog cannot be altered during its life-time. This prevents
      any modifications to the entire bpf_prog structure (incl. function/JIT
      image pointer).
      
      Every eBPF program (including classic BPF that are migrated) have to call
      bpf_prog_select_runtime() to select either interpreter or a JIT image
      as a last setup step, and they all are being freed via bpf_prog_free(),
      including non-JIT. Therefore, we can easily integrate this into the
      eBPF life-time, plus since we directly allocate a bpf_prog, we have no
      performance penalty.
      
      Tested with seccomp and test_bpf testsuite in JIT/non-JIT mode and manual
      inspection of kernel_page_tables.  Brad Spengler proposed the same idea
      via Twitter during development of this patch.
      
      Joint work with Hannes Frederic Sowa.
      Suggested-by: NBrad Spengler <spender@grsecurity.net>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Cc: Alexei Starovoitov <ast@plumgrid.com>
      Cc: Kees Cook <keescook@chromium.org>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      60a3b225
  7. 03 8月, 2014 5 次提交
  8. 25 7月, 2014 1 次提交
  9. 14 7月, 2014 1 次提交
  10. 09 7月, 2014 1 次提交
  11. 11 6月, 2014 1 次提交
    • A
      net: filter: cleanup A/X name usage · e430f34e
      Alexei Starovoitov 提交于
      The macro 'A' used in internal BPF interpreter:
       #define A regs[insn->a_reg]
      was easily confused with the name of classic BPF register 'A', since
      'A' would mean two different things depending on context.
      
      This patch is trying to clean up the naming and clarify its usage in the
      following way:
      
      - A and X are names of two classic BPF registers
      
      - BPF_REG_A denotes internal BPF register R0 used to map classic register A
        in internal BPF programs generated from classic
      
      - BPF_REG_X denotes internal BPF register R7 used to map classic register X
        in internal BPF programs generated from classic
      
      - internal BPF instruction format:
      struct sock_filter_int {
              __u8    code;           /* opcode */
              __u8    dst_reg:4;      /* dest register */
              __u8    src_reg:4;      /* source register */
              __s16   off;            /* signed offset */
              __s32   imm;            /* signed immediate constant */
      };
      
      - BPF_X/BPF_K is 1 bit used to encode source operand of instruction
      In classic:
        BPF_X - means use register X as source operand
        BPF_K - means use 32-bit immediate as source operand
      In internal:
        BPF_X - means use 'src_reg' register as source operand
        BPF_K - means use 32-bit immediate as source operand
      Suggested-by: NChema Gonzalez <chema@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NChema Gonzalez <chema@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e430f34e
  12. 02 6月, 2014 2 次提交
    • D
      net: filter: improve filter block macros · f8f6d679
      Daniel Borkmann 提交于
      Commit 9739eef1 ("net: filter: make BPF conversion more readable")
      started to introduce helper macros similar to BPF_STMT()/BPF_JUMP()
      macros from classic BPF.
      
      However, quite some statements in the filter conversion functions
      remained in the old style which gives a mixture of block macros and
      non block macros in the code. This patch makes the block macros itself
      more readable by using explicit member initialization, and converts
      the remaining ones where possible to remain in a more consistent state.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8f6d679
    • D
      net: filter: get rid of BPF_S_* enum · 34805931
      Daniel Borkmann 提交于
      This patch finally allows us to get rid of the BPF_S_* enum.
      Currently, the code performs unnecessary encode and decode
      workarounds in seccomp and filter migration itself when a filter
      is being attached in order to overcome BPF_S_* encoding which
      is not used anymore by the new interpreter resp. JIT compilers.
      
      Keeping it around would mean that also in future we would need
      to extend and maintain this enum and related encoders/decoders.
      We can get rid of all that and save us these operations during
      filter attaching. Naturally, also JIT compilers need to be updated
      by this.
      
      Before JIT conversion is being done, each compiler checks if A
      is being loaded at startup to obtain information if it needs to
      emit instructions to clear A first. Since BPF extensions are a
      subset of BPF_LD | BPF_{W,H,B} | BPF_ABS variants, case statements
      for extensions can be removed at that point. To ease and minimalize
      code changes in the classic JITs, we have introduced bpf_anc_helper().
      
      Tested with test_bpf on x86_64 (JIT, int), s390x (JIT, int),
      arm (JIT, int), i368 (int), ppc64 (JIT, int); for sparc we
      unfortunately didn't have access, but changes are analogous to
      the rest.
      
      Joint work with Alexei Starovoitov.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Mircea Gherzan <mgherzan@gmail.com>
      Cc: Kees Cook <keescook@chromium.org>
      Acked-by: NChema Gonzalez <chemag@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      34805931
  13. 24 5月, 2014 2 次提交
  14. 22 5月, 2014 1 次提交
    • A
      net: filter: cleanup invocation of internal BPF · 5fe821a9
      Alexei Starovoitov 提交于
      Kernel API for classic BPF socket filters is:
      
      sk_unattached_filter_create() - validate classic BPF, convert, JIT
      SK_RUN_FILTER() - run it
      sk_unattached_filter_destroy() - destroy socket filter
      
      Cleanup internal BPF kernel API as following:
      
      sk_filter_select_runtime() - final step of internal BPF creation.
        Try to JIT internal BPF program, if JIT is not available select interpreter
      SK_RUN_FILTER() - run it
      sk_filter_free() - free internal BPF program
      
      Disallow direct calls to BPF interpreter. Execution of the BPF program should
      be done with SK_RUN_FILTER() macro.
      
      Example of internal BPF create, run, destroy:
      
        struct sk_filter *fp;
      
        fp = kzalloc(sk_filter_size(prog_len), GFP_KERNEL);
        memcpy(fp->insni, prog, prog_len * sizeof(fp->insni[0]));
        fp->len = prog_len;
      
        sk_filter_select_runtime(fp);
      
        SK_RUN_FILTER(fp, ctx);
      
        sk_filter_free(fp);
      
      Sockets, seccomp, testsuite, tracing are using different ways to populate
      sk_filter, so first steps of program creation are not common.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5fe821a9
  15. 16 5月, 2014 1 次提交
    • A
      net: filter: x86: internal BPF JIT · 62258278
      Alexei Starovoitov 提交于
      Maps all internal BPF instructions into x86_64 instructions.
      This patch replaces original BPF x64 JIT with internal BPF x64 JIT.
      sysctl net.core.bpf_jit_enable is reused as on/off switch.
      
      Performance:
      
      1. old BPF JIT and internal BPF JIT generate equivalent x86_64 code.
        No performance difference is observed for filters that were JIT-able before
      
      Example assembler code for BPF filter "tcpdump port 22"
      
      original BPF -> old JIT:            original BPF -> internal BPF -> new JIT:
         0:   push   %rbp                      0:     push   %rbp
         1:   mov    %rsp,%rbp                 1:     mov    %rsp,%rbp
         4:   sub    $0x60,%rsp                4:     sub    $0x228,%rsp
         8:   mov    %rbx,-0x8(%rbp)           b:     mov    %rbx,-0x228(%rbp) // prologue
                                              12:     mov    %r13,-0x220(%rbp)
                                              19:     mov    %r14,-0x218(%rbp)
                                              20:     mov    %r15,-0x210(%rbp)
                                              27:     xor    %eax,%eax         // clear A
         c:   xor    %ebx,%ebx                29:     xor    %r13,%r13         // clear X
         e:   mov    0x68(%rdi),%r9d          2c:     mov    0x68(%rdi),%r9d
        12:   sub    0x6c(%rdi),%r9d          30:     sub    0x6c(%rdi),%r9d
        16:   mov    0xd8(%rdi),%r8           34:     mov    0xd8(%rdi),%r10
                                              3b:     mov    %rdi,%rbx
        1d:   mov    $0xc,%esi                3e:     mov    $0xc,%esi
        22:   callq  0xffffffffe1021e15       43:     callq  0xffffffffe102bd75
        27:   cmp    $0x86dd,%eax             48:     cmp    $0x86dd,%rax
        2c:   jne    0x0000000000000069       4f:     jne    0x000000000000009a
        2e:   mov    $0x14,%esi               51:     mov    $0x14,%esi
        33:   callq  0xffffffffe1021e31       56:     callq  0xffffffffe102bd91
        38:   cmp    $0x84,%eax               5b:     cmp    $0x84,%rax
        3d:   je     0x0000000000000049       62:     je     0x0000000000000074
        3f:   cmp    $0x6,%eax                64:     cmp    $0x6,%rax
        42:   je     0x0000000000000049       68:     je     0x0000000000000074
        44:   cmp    $0x11,%eax               6a:     cmp    $0x11,%rax
        47:   jne    0x00000000000000c6       6e:     jne    0x0000000000000117
        49:   mov    $0x36,%esi               74:     mov    $0x36,%esi
        4e:   callq  0xffffffffe1021e15       79:     callq  0xffffffffe102bd75
        53:   cmp    $0x16,%eax               7e:     cmp    $0x16,%rax
        56:   je     0x00000000000000bf       82:     je     0x0000000000000110
        58:   mov    $0x38,%esi               88:     mov    $0x38,%esi
        5d:   callq  0xffffffffe1021e15       8d:     callq  0xffffffffe102bd75
        62:   cmp    $0x16,%eax               92:     cmp    $0x16,%rax
        65:   je     0x00000000000000bf       96:     je     0x0000000000000110
        67:   jmp    0x00000000000000c6       98:     jmp    0x0000000000000117
        69:   cmp    $0x800,%eax              9a:     cmp    $0x800,%rax
        6e:   jne    0x00000000000000c6       a1:     jne    0x0000000000000117
        70:   mov    $0x17,%esi               a3:     mov    $0x17,%esi
        75:   callq  0xffffffffe1021e31       a8:     callq  0xffffffffe102bd91
        7a:   cmp    $0x84,%eax               ad:     cmp    $0x84,%rax
        7f:   je     0x000000000000008b       b4:     je     0x00000000000000c2
        81:   cmp    $0x6,%eax                b6:     cmp    $0x6,%rax
        84:   je     0x000000000000008b       ba:     je     0x00000000000000c2
        86:   cmp    $0x11,%eax               bc:     cmp    $0x11,%rax
        89:   jne    0x00000000000000c6       c0:     jne    0x0000000000000117
        8b:   mov    $0x14,%esi               c2:     mov    $0x14,%esi
        90:   callq  0xffffffffe1021e15       c7:     callq  0xffffffffe102bd75
        95:   test   $0x1fff,%ax              cc:     test   $0x1fff,%rax
        99:   jne    0x00000000000000c6       d3:     jne    0x0000000000000117
                                              d5:     mov    %rax,%r14
        9b:   mov    $0xe,%esi                d8:     mov    $0xe,%esi
        a0:   callq  0xffffffffe1021e44       dd:     callq  0xffffffffe102bd91 // MSH
                                              e2:     and    $0xf,%eax
                                              e5:     shl    $0x2,%eax
                                              e8:     mov    %rax,%r13
                                              eb:     mov    %r14,%rax
                                              ee:     mov    %r13,%rsi
        a5:   lea    0xe(%rbx),%esi           f1:     add    $0xe,%esi
        a8:   callq  0xffffffffe1021e0d       f4:     callq  0xffffffffe102bd6d
        ad:   cmp    $0x16,%eax               f9:     cmp    $0x16,%rax
        b0:   je     0x00000000000000bf       fd:     je     0x0000000000000110
                                              ff:     mov    %r13,%rsi
        b2:   lea    0x10(%rbx),%esi         102:     add    $0x10,%esi
        b5:   callq  0xffffffffe1021e0d      105:     callq  0xffffffffe102bd6d
        ba:   cmp    $0x16,%eax              10a:     cmp    $0x16,%rax
        bd:   jne    0x00000000000000c6      10e:     jne    0x0000000000000117
        bf:   mov    $0xffff,%eax            110:     mov    $0xffff,%eax
        c4:   jmp    0x00000000000000c8      115:     jmp    0x000000000000011c
        c6:   xor    %eax,%eax               117:     mov    $0x0,%eax
        c8:   mov    -0x8(%rbp),%rbx         11c:     mov    -0x228(%rbp),%rbx // epilogue
        cc:   leaveq                         123:     mov    -0x220(%rbp),%r13
        cd:   retq                           12a:     mov    -0x218(%rbp),%r14
                                             131:     mov    -0x210(%rbp),%r15
                                             138:     leaveq
                                             139:     retq
      
      On fully cached SKBs both JITed functions take 12 nsec to execute.
      BPF interpreter executes the program in 30 nsec.
      
      The difference in generated assembler is due to the following:
      
      Old BPF imlements LDX_MSH instruction via sk_load_byte_msh() helper function
      inside bpf_jit.S.
      New JIT removes the helper and does it explicitly, so ldx_msh cost
      is the same for both JITs, but generated code looks longer.
      
      New JIT has 4 registers to save, so prologue/epilogue are larger,
      but the cost is within noise on x64.
      
      Old JIT checks whether first insn clears A and if not emits 'xor %eax,%eax'.
      New JIT clears %rax unconditionally.
      
      2. old BPF JIT doesn't support ANC_NLATTR, ANC_PAY_OFFSET, ANC_RANDOM
        extensions. New JIT supports all BPF extensions.
        Performance of such filters improves 2-4 times depending on a filter.
        The longer the filter the higher performance gain.
        Synthetic benchmarks with many ancillary loads see 20x speedup
        which seems to be the maximum gain from JIT
      
      Notes:
      
      . net.core.bpf_jit_enable=2 + tools/net/bpf_jit_disasm is still functional
        and can be used to see generated assembler
      
      . there are two jit_compile() functions and code flow for classic filters is:
        sk_attach_filter() - load classic BPF
        bpf_jit_compile() - try to JIT from classic BPF
        sk_convert_filter() - convert classic to internal
        bpf_int_jit_compile() - JIT from internal BPF
      
        seccomp and tracing filters will just call bpf_int_jit_compile()
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      62258278
  16. 12 5月, 2014 1 次提交
  17. 05 5月, 2014 2 次提交
    • D
      net: filter: make register naming more comprehensible · 30743837
      Daniel Borkmann 提交于
      The current code is a bit hard to parse on which registers can be used,
      how they are mapped and all play together. It makes much more sense to
      define this a bit more clearly so that the code is a bit more intuitive.
      This patch cleans this up, and makes naming a bit more consistent among
      the code. This also allows for moving some of the defines into the header
      file. Clearing of A and X registers in __sk_run_filter() do not get a
      particular register name assigned as they have not an 'official' function,
      but rather just result from the concrete initial mapping of old BPF
      programs. Since for BPF helper functions for BPF_CALL we already use
      small letters, so be consistent here as well. No functional changes.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30743837
    • D
      net: filter: simplify label names from jump-table · 5bcfedf0
      Daniel Borkmann 提交于
      This patch simplifies label naming for the BPF jump-table.
      When we define labels via DL(), we just concatenate/textify
      the combination of instruction opcode which consists of the
      class, subclass, word size, target register and so on. Each
      time we leave BPF_ prefix intact, so that e.g. the preprocessor
      generates a label BPF_ALU_BPF_ADD_BPF_X for DL(BPF_ALU, BPF_ADD,
      BPF_X) whereas a label name of ALU_ADD_X is much more easy
      to grasp. Pure cleanup only.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5bcfedf0
  18. 23 4月, 2014 1 次提交
  19. 15 4月, 2014 1 次提交
    • D
      net: filter: seccomp: fix wrong decoding of BPF_S_ANC_SECCOMP_LD_W · 8c482cdc
      Daniel Borkmann 提交于
      While reviewing seccomp code, we found that BPF_S_ANC_SECCOMP_LD_W has
      been wrongly decoded by commit a8fc9277 ("sk-filter: Add ability to
      get socket filter program (v2)") into the opcode BPF_LD|BPF_B|BPF_ABS
      although it should have been decoded as BPF_LD|BPF_W|BPF_ABS.
      
      In practice, this should not have much side-effect though, as such
      conversion is/was being done through prctl(2) PR_SET_SECCOMP. Reverse
      operation PR_GET_SECCOMP will only return the current seccomp mode, but
      not the filter itself. Since the transition to the new BPF infrastructure,
      it's also not used anymore, so we can simply remove this as it's
      unreachable.
      
      Fixes: a8fc9277 ("sk-filter: Add ability to get socket filter program (v2)")
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8c482cdc
  20. 31 3月, 2014 4 次提交
    • A
      net: filter: rework/optimize internal BPF interpreter's instruction set · bd4cf0ed
      Alexei Starovoitov 提交于
      This patch replaces/reworks the kernel-internal BPF interpreter with
      an optimized BPF instruction set format that is modelled closer to
      mimic native instruction sets and is designed to be JITed with one to
      one mapping. Thus, the new interpreter is noticeably faster than the
      current implementation of sk_run_filter(); mainly for two reasons:
      
      1. Fall-through jumps:
      
        BPF jump instructions are forced to go either 'true' or 'false'
        branch which causes branch-miss penalty. The new BPF jump
        instructions have only one branch and fall-through otherwise,
        which fits the CPU branch predictor logic better. `perf stat`
        shows drastic difference for branch-misses between the old and
        new code.
      
      2. Jump-threaded implementation of interpreter vs switch
         statement:
      
        Instead of single table-jump at the top of 'switch' statement,
        gcc will now generate multiple table-jump instructions, which
        helps CPU branch predictor logic.
      
      Note that the verification of filters is still being done through
      sk_chk_filter() in classical BPF format, so filters from user- or
      kernel space are verified in the same way as we do now, and same
      restrictions/constraints hold as well.
      
      We reuse current BPF JIT compilers in a way that this upgrade would
      even be fine as is, but nevertheless allows for a successive upgrade
      of BPF JIT compilers to the new format.
      
      The internal instruction set migration is being done after the
      probing for JIT compilation, so in case JIT compilers are able to
      create a native opcode image, we're going to use that, and in all
      other cases we're doing a follow-up migration of the BPF program's
      instruction set, so that it can be transparently run in the new
      interpreter.
      
      In short, the *internal* format extends BPF in the following way (more
      details can be taken from the appended documentation):
      
        - Number of registers increase from 2 to 10
        - Register width increases from 32-bit to 64-bit
        - Conditional jt/jf targets replaced with jt/fall-through
        - Adds signed > and >= insns
        - 16 4-byte stack slots for register spill-fill replaced
          with up to 512 bytes of multi-use stack space
        - Introduction of bpf_call insn and register passing convention
          for zero overhead calls from/to other kernel functions
        - Adds arithmetic right shift and endianness conversion insns
        - Adds atomic_add insn
        - Old tax/txa insns are replaced with 'mov dst,src' insn
      
      Performance of two BPF filters generated by libpcap resp. bpf_asm
      was measured on x86_64, i386 and arm32 (other libpcap programs
      have similar performance differences):
      
      fprog #1 is taken from Documentation/networking/filter.txt:
      tcpdump -i eth0 port 22 -dd
      
      fprog #2 is taken from 'man tcpdump':
      tcpdump -i eth0 'tcp port 22 and (((ip[2:2] - ((ip[0]&0xf)<<2)) -
         ((tcp[12]&0xf0)>>2)) != 0)' -dd
      
      Raw performance data from BPF micro-benchmark: SK_RUN_FILTER on the
      same SKB (cache-hit) or 10k SKBs (cache-miss); time in ns per call,
      smaller is better:
      
      --x86_64--
               fprog #1  fprog #1   fprog #2  fprog #2
               cache-hit cache-miss cache-hit cache-miss
      old BPF      90       101        192       202
      new BPF      31        71         47        97
      old BPF jit  12        34         17        44
      new BPF jit TBD
      
      --i386--
               fprog #1  fprog #1   fprog #2  fprog #2
               cache-hit cache-miss cache-hit cache-miss
      old BPF     107       136        227       252
      new BPF      40       119         69       172
      
      --arm32--
               fprog #1  fprog #1   fprog #2  fprog #2
               cache-hit cache-miss cache-hit cache-miss
      old BPF     202       300        475       540
      new BPF     180       270        330       470
      old BPF jit  26       182         37       202
      new BPF jit TBD
      
      Thus, without changing any userland BPF filters, applications on
      top of AF_PACKET (or other families) such as libpcap/tcpdump, cls_bpf
      classifier, netfilter's xt_bpf, team driver's load-balancing mode,
      and many more will have better interpreter filtering performance.
      
      While we are replacing the internal BPF interpreter, we also need
      to convert seccomp BPF in the same step to make use of the new
      internal structure since it makes use of lower-level API details
      without being further decoupled through higher-level calls like
      sk_unattached_filter_{create,destroy}(), for example.
      
      Just as for normal socket filtering, also seccomp BPF experiences
      a time-to-verdict speedup:
      
      05-sim-long_jumps.c of libseccomp was used as micro-benchmark:
      
        seccomp_rule_add_exact(ctx,...
        seccomp_rule_add_exact(ctx,...
      
        rc = seccomp_load(ctx);
      
        for (i = 0; i < 10000000; i++)
           syscall(199, 100);
      
      'short filter' has 2 rules
      'large filter' has 200 rules
      
      'short filter' performance is slightly better on x86_64/i386/arm32
      'large filter' is much faster on x86_64 and i386 and shows no
                     difference on arm32
      
      --x86_64-- short filter
      old BPF: 2.7 sec
       39.12%  bench  libc-2.15.so       [.] syscall
        8.10%  bench  [kernel.kallsyms]  [k] sk_run_filter
        6.31%  bench  [kernel.kallsyms]  [k] system_call
        5.59%  bench  [kernel.kallsyms]  [k] trace_hardirqs_on_caller
        4.37%  bench  [kernel.kallsyms]  [k] trace_hardirqs_off_caller
        3.70%  bench  [kernel.kallsyms]  [k] __secure_computing
        3.67%  bench  [kernel.kallsyms]  [k] lock_is_held
        3.03%  bench  [kernel.kallsyms]  [k] seccomp_bpf_load
      new BPF: 2.58 sec
       42.05%  bench  libc-2.15.so       [.] syscall
        6.91%  bench  [kernel.kallsyms]  [k] system_call
        6.25%  bench  [kernel.kallsyms]  [k] trace_hardirqs_on_caller
        6.07%  bench  [kernel.kallsyms]  [k] __secure_computing
        5.08%  bench  [kernel.kallsyms]  [k] sk_run_filter_int_seccomp
      
      --arm32-- short filter
      old BPF: 4.0 sec
       39.92%  bench  [kernel.kallsyms]  [k] vector_swi
       16.60%  bench  [kernel.kallsyms]  [k] sk_run_filter
       14.66%  bench  libc-2.17.so       [.] syscall
        5.42%  bench  [kernel.kallsyms]  [k] seccomp_bpf_load
        5.10%  bench  [kernel.kallsyms]  [k] __secure_computing
      new BPF: 3.7 sec
       35.93%  bench  [kernel.kallsyms]  [k] vector_swi
       21.89%  bench  libc-2.17.so       [.] syscall
       13.45%  bench  [kernel.kallsyms]  [k] sk_run_filter_int_seccomp
        6.25%  bench  [kernel.kallsyms]  [k] __secure_computing
        3.96%  bench  [kernel.kallsyms]  [k] syscall_trace_exit
      
      --x86_64-- large filter
      old BPF: 8.6 seconds
          73.38%    bench  [kernel.kallsyms]  [k] sk_run_filter
          10.70%    bench  libc-2.15.so       [.] syscall
           5.09%    bench  [kernel.kallsyms]  [k] seccomp_bpf_load
           1.97%    bench  [kernel.kallsyms]  [k] system_call
      new BPF: 5.7 seconds
          66.20%    bench  [kernel.kallsyms]  [k] sk_run_filter_int_seccomp
          16.75%    bench  libc-2.15.so       [.] syscall
           3.31%    bench  [kernel.kallsyms]  [k] system_call
           2.88%    bench  [kernel.kallsyms]  [k] __secure_computing
      
      --i386-- large filter
      old BPF: 5.4 sec
      new BPF: 3.8 sec
      
      --arm32-- large filter
      old BPF: 13.5 sec
       73.88%  bench  [kernel.kallsyms]  [k] sk_run_filter
       10.29%  bench  [kernel.kallsyms]  [k] vector_swi
        6.46%  bench  libc-2.17.so       [.] syscall
        2.94%  bench  [kernel.kallsyms]  [k] seccomp_bpf_load
        1.19%  bench  [kernel.kallsyms]  [k] __secure_computing
        0.87%  bench  [kernel.kallsyms]  [k] sys_getuid
      new BPF: 13.5 sec
       76.08%  bench  [kernel.kallsyms]  [k] sk_run_filter_int_seccomp
       10.98%  bench  [kernel.kallsyms]  [k] vector_swi
        5.87%  bench  libc-2.17.so       [.] syscall
        1.77%  bench  [kernel.kallsyms]  [k] __secure_computing
        0.93%  bench  [kernel.kallsyms]  [k] sys_getuid
      
      BPF filters generated by seccomp are very branchy, so the new
      internal BPF performance is better than the old one. Performance
      gains will be even higher when BPF JIT is committed for the
      new structure, which is planned in future work (as successive
      JIT migrations).
      
      BPF has also been stress-tested with trinity's BPF fuzzer.
      
      Joint work with Daniel Borkmann.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Hagen Paul Pfeifer <hagen@jauu.net>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Paul Moore <pmoore@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: H. Peter Anvin <hpa@linux.intel.com>
      Cc: linux-kernel@vger.kernel.org
      Acked-by: NKees Cook <keescook@chromium.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bd4cf0ed
    • D
      net: filter: move filter accounting to filter core · fbc907f0
      Daniel Borkmann 提交于
      This patch basically does two things, i) removes the extern keyword
      from the include/linux/filter.h file to be more consistent with the
      rest of Joe's changes, and ii) moves filter accounting into the filter
      core framework.
      
      Filter accounting mainly done through sk_filter_{un,}charge() take
      care of the case when sockets are being cloned through sk_clone_lock()
      so that removal of the filter on one socket won't result in eviction
      as it's still referenced by the other.
      
      These functions actually belong to net/core/filter.c and not
      include/net/sock.h as we want to keep all that in a central place.
      It's also not in fast-path so uninlining them is fine and even allows
      us to get rd of sk_filter_release_rcu()'s EXPORT_SYMBOL and a forward
      declaration.
      
      Joint work with Alexei Starovoitov.
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fbc907f0
    • D
      net: filter: keep original BPF program around · a3ea269b
      Daniel Borkmann 提交于
      In order to open up the possibility to internally transform a BPF program
      into an alternative and possibly non-trivial reversible representation, we
      need to keep the original BPF program around, so that it can be passed back
      to user space w/o the need of a complex decoder.
      
      The reason for that use case resides in commit a8fc9277 ("sk-filter:
      Add ability to get socket filter program (v2)"), that is, the ability
      to retrieve the currently attached BPF filter from a given socket used
      mainly by the checkpoint-restore project, for example.
      
      Therefore, we add two helpers sk_{store,release}_orig_filter for taking
      care of that. In the sk_unattached_filter_create() case, there's no such
      possibility/requirement to retrieve a loaded BPF program. Therefore, we
      can spare us the work in that case.
      
      This approach will simplify and slightly speed up both, sk_get_filter()
      and sock_diag_put_filterinfo() handlers as we won't need to successively
      decode filters anymore through sk_decode_filter(). As we still need
      sk_decode_filter() later on, we're keeping it around.
      
      Joint work with Alexei Starovoitov.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Pavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a3ea269b
    • D
      net: filter: add jited flag to indicate jit compiled filters · f8bbbfc3
      Daniel Borkmann 提交于
      This patch adds a jited flag into sk_filter struct in order to indicate
      whether a filter is currently jited or not. The size of sk_filter is
      not being expanded as the 32 bit 'len' member allows upper bits to be
      reused since a filter can currently only grow as large as BPF_MAXINSNS.
      
      Therefore, there's enough room also for other in future needed flags to
      reuse 'len' field if necessary. The jited flag also allows for having
      alternative interpreter functions running as currently, we can only
      detect jit compiled filters by testing fp->bpf_func to not equal the
      address of sk_run_filter().
      
      Joint work with Alexei Starovoitov.
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Cc: Pablo Neira Ayuso <pablo@netfilter.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8bbbfc3
  21. 22 1月, 2014 1 次提交
    • D
      net: filter: let bpf_tell_extensions return SKF_AD_MAX · 37692299
      Daniel Borkmann 提交于
      Michal Sekletar added in commit ea02f941 ("net: introduce
      SO_BPF_EXTENSIONS") a facility where user space can enquire
      the BPF ancillary instruction set, which is imho a step into
      the right direction for letting user space high-level to BPF
      optimizers make an informed decision for possibly using these
      extensions.
      
      The original rationale was to return through a getsockopt(2)
      a bitfield of which instructions are supported and which
      are not, as of right now, we just return 0 to indicate a
      base support for SKF_AD_PROTOCOL up to SKF_AD_PAY_OFFSET.
      Limitations of this approach are that this API which we need
      to maintain for a long time can only support a maximum of 32
      extensions, and needs to be additionally maintained/updated
      when each new extension that comes in.
      
      I thought about this a bit more and what we can do here to
      overcome this is to just return SKF_AD_MAX. Since we never
      remove any extension since we cannot break user space and
      always linearly increase SKF_AD_MAX on each newly added
      extension, user space can make a decision on what extensions
      are supported in the whole set of extensions and which aren't,
      by just checking which of them from the whole set have an
      offset < SKF_AD_MAX of the underlying kernel.
      
      Since SKF_AD_MAX must be updated each time we add new ones,
      we don't need to introduce an additional enum and got
      maintenance for free. At some point in time when
      SO_BPF_EXTENSIONS becomes ubiquitous for most kernels, then
      an application can simply make use of this and easily be run
      on newer or older underlying kernels without needing to be
      recompiled, of course. Since that is for 3.14, it's not too
      late to do this change.
      
      Cc: Michal Sekletar <msekleta@redhat.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NDaniel Borkmann <dborkman@redhat.com>
      Acked-by: NMichal Sekletar <msekleta@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      37692299
  22. 19 1月, 2014 1 次提交
    • M
      net: introduce SO_BPF_EXTENSIONS · ea02f941
      Michal Sekletar 提交于
      For user space packet capturing libraries such as libpcap, there's
      currently only one way to check which BPF extensions are supported
      by the kernel, that is, commit aa1113d9 ("net: filter: return
      -EINVAL if BPF_S_ANC* operation is not supported"). For querying all
      extensions at once this might be rather inconvenient.
      
      Therefore, this patch introduces a new option which can be used as
      an argument for getsockopt(), and allows one to obtain information
      about which BPF extensions are supported by the current kernel.
      
      As David Miller suggests, we do not need to define any bits right
      now and status quo can just return 0 in order to state that this
      versions supports SKF_AD_PROTOCOL up to SKF_AD_PAY_OFFSET. Later
      additions to BPF extensions need to add their bits to the
      bpf_tell_extensions() function, as documented in the comment.
      Signed-off-by: NMichal Sekletar <msekleta@redhat.com>
      Cc: David Miller <davem@davemloft.net>
      Reviewed-by: NDaniel Borkmann <dborkman@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ea02f941
  23. 08 10月, 2013 1 次提交
    • A
      net: fix unsafe set_memory_rw from softirq · d45ed4a4
      Alexei Starovoitov 提交于
      on x86 system with net.core.bpf_jit_enable = 1
      
      sudo tcpdump -i eth1 'tcp port 22'
      
      causes the warning:
      [   56.766097]  Possible unsafe locking scenario:
      [   56.766097]
      [   56.780146]        CPU0
      [   56.786807]        ----
      [   56.793188]   lock(&(&vb->lock)->rlock);
      [   56.799593]   <Interrupt>
      [   56.805889]     lock(&(&vb->lock)->rlock);
      [   56.812266]
      [   56.812266]  *** DEADLOCK ***
      [   56.812266]
      [   56.830670] 1 lock held by ksoftirqd/1/13:
      [   56.836838]  #0:  (rcu_read_lock){.+.+..}, at: [<ffffffff8118f44c>] vm_unmap_aliases+0x8c/0x380
      [   56.849757]
      [   56.849757] stack backtrace:
      [   56.862194] CPU: 1 PID: 13 Comm: ksoftirqd/1 Not tainted 3.12.0-rc3+ #45
      [   56.868721] Hardware name: System manufacturer System Product Name/P8Z77 WS, BIOS 3007 07/26/2012
      [   56.882004]  ffffffff821944c0 ffff88080bbdb8c8 ffffffff8175a145 0000000000000007
      [   56.895630]  ffff88080bbd5f40 ffff88080bbdb928 ffffffff81755b14 0000000000000001
      [   56.909313]  ffff880800000001 ffff880800000000 ffffffff8101178f 0000000000000001
      [   56.923006] Call Trace:
      [   56.929532]  [<ffffffff8175a145>] dump_stack+0x55/0x76
      [   56.936067]  [<ffffffff81755b14>] print_usage_bug+0x1f7/0x208
      [   56.942445]  [<ffffffff8101178f>] ? save_stack_trace+0x2f/0x50
      [   56.948932]  [<ffffffff810cc0a0>] ? check_usage_backwards+0x150/0x150
      [   56.955470]  [<ffffffff810ccb52>] mark_lock+0x282/0x2c0
      [   56.961945]  [<ffffffff810ccfed>] __lock_acquire+0x45d/0x1d50
      [   56.968474]  [<ffffffff810cce6e>] ? __lock_acquire+0x2de/0x1d50
      [   56.975140]  [<ffffffff81393bf5>] ? cpumask_next_and+0x55/0x90
      [   56.981942]  [<ffffffff810cef72>] lock_acquire+0x92/0x1d0
      [   56.988745]  [<ffffffff8118f52a>] ? vm_unmap_aliases+0x16a/0x380
      [   56.995619]  [<ffffffff817628f1>] _raw_spin_lock+0x41/0x50
      [   57.002493]  [<ffffffff8118f52a>] ? vm_unmap_aliases+0x16a/0x380
      [   57.009447]  [<ffffffff8118f52a>] vm_unmap_aliases+0x16a/0x380
      [   57.016477]  [<ffffffff8118f44c>] ? vm_unmap_aliases+0x8c/0x380
      [   57.023607]  [<ffffffff810436b0>] change_page_attr_set_clr+0xc0/0x460
      [   57.030818]  [<ffffffff810cfb8d>] ? trace_hardirqs_on+0xd/0x10
      [   57.037896]  [<ffffffff811a8330>] ? kmem_cache_free+0xb0/0x2b0
      [   57.044789]  [<ffffffff811b59c3>] ? free_object_rcu+0x93/0xa0
      [   57.051720]  [<ffffffff81043d9f>] set_memory_rw+0x2f/0x40
      [   57.058727]  [<ffffffff8104e17c>] bpf_jit_free+0x2c/0x40
      [   57.065577]  [<ffffffff81642cba>] sk_filter_release_rcu+0x1a/0x30
      [   57.072338]  [<ffffffff811108e2>] rcu_process_callbacks+0x202/0x7c0
      [   57.078962]  [<ffffffff81057f17>] __do_softirq+0xf7/0x3f0
      [   57.085373]  [<ffffffff81058245>] run_ksoftirqd+0x35/0x70
      
      cannot reuse jited filter memory, since it's readonly,
      so use original bpf insns memory to hold work_struct
      
      defer kfree of sk_filter until jit completed freeing
      
      tested on x86_64 and i386
      Signed-off-by: NAlexei Starovoitov <ast@plumgrid.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d45ed4a4
  24. 11 6月, 2013 1 次提交
  25. 20 5月, 2013 1 次提交