1. 18 9月, 2020 2 次提交
  2. 29 8月, 2020 1 次提交
    • A
      bpf: Introduce sleepable BPF programs · 1e6c62a8
      Alexei Starovoitov 提交于
      Introduce sleepable BPF programs that can request such property for themselves
      via BPF_F_SLEEPABLE flag at program load time. In such case they will be able
      to use helpers like bpf_copy_from_user() that might sleep. At present only
      fentry/fexit/fmod_ret and lsm programs can request to be sleepable and only
      when they are attached to kernel functions that are known to allow sleeping.
      
      The non-sleepable programs are relying on implicit rcu_read_lock() and
      migrate_disable() to protect life time of programs, maps that they use and
      per-cpu kernel structures used to pass info between bpf programs and the
      kernel. The sleepable programs cannot be enclosed into rcu_read_lock().
      migrate_disable() maps to preempt_disable() in non-RT kernels, so the progs
      should not be enclosed in migrate_disable() as well. Therefore
      rcu_read_lock_trace is used to protect the life time of sleepable progs.
      
      There are many networking and tracing program types. In many cases the
      'struct bpf_prog *' pointer itself is rcu protected within some other kernel
      data structure and the kernel code is using rcu_dereference() to load that
      program pointer and call BPF_PROG_RUN() on it. All these cases are not touched.
      Instead sleepable bpf programs are allowed with bpf trampoline only. The
      program pointers are hard-coded into generated assembly of bpf trampoline and
      synchronize_rcu_tasks_trace() is used to protect the life time of the program.
      The same trampoline can hold both sleepable and non-sleepable progs.
      
      When rcu_read_lock_trace is held it means that some sleepable bpf program is
      running from bpf trampoline. Those programs can use bpf arrays and preallocated
      hash/lru maps. These map types are waiting on programs to complete via
      synchronize_rcu_tasks_trace();
      
      Updates to trampoline now has to do synchronize_rcu_tasks_trace() and
      synchronize_rcu_tasks() to wait for sleepable progs to finish and for
      trampoline assembly to finish.
      
      This is the first step of introducing sleepable progs. Eventually dynamically
      allocated hash maps can be allowed and networking program types can become
      sleepable too.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NKP Singh <kpsingh@google.com>
      Link: https://lore.kernel.org/bpf/20200827220114.69225-3-alexei.starovoitov@gmail.com
      1e6c62a8
  3. 21 4月, 2020 1 次提交
    • L
      bpf, x86: Fix encoding for lower 8-bit registers in BPF_STX BPF_B · aee194b1
      Luke Nelson 提交于
      This patch fixes an encoding bug in emit_stx for BPF_B when the source
      register is BPF_REG_FP.
      
      The current implementation for BPF_STX BPF_B in emit_stx saves one REX
      byte when the operands can be encoded using Mod-R/M alone. The lower 8
      bits of registers %rax, %rbx, %rcx, and %rdx can be accessed without using
      a REX prefix via %al, %bl, %cl, and %dl, respectively. Other registers,
      (e.g., %rsi, %rdi, %rbp, %rsp) require a REX prefix to use their 8-bit
      equivalents (%sil, %dil, %bpl, %spl).
      
      The current code checks if the source for BPF_STX BPF_B is BPF_REG_1
      or BPF_REG_2 (which map to %rdi and %rsi), in which case it emits the
      required REX prefix. However, it misses the case when the source is
      BPF_REG_FP (mapped to %rbp).
      
      The result is that BPF_STX BPF_B with BPF_REG_FP as the source operand
      will read from register %ch instead of the correct %bpl. This patch fixes
      the problem by fixing and refactoring the check on which registers need
      the extra REX byte. Since no BPF registers map to %rsp, there is no need
      to handle %spl.
      
      Fixes: 62258278 ("net: filter: x86: internal BPF JIT")
      Signed-off-by: NXi Wang <xi.wang@gmail.com>
      Signed-off-by: NLuke Nelson <luke.r.nels@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200418232655.23870-1-luke.r.nels@gmail.com
      aee194b1
  4. 11 3月, 2020 1 次提交
  5. 05 3月, 2020 3 次提交
  6. 10 1月, 2020 1 次提交
    • M
      bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS · 85d33df3
      Martin KaFai Lau 提交于
      The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
      is a kernel struct with its func ptr implemented in bpf prog.
      This new map is the interface to register/unregister/introspect
      a bpf implemented kernel struct.
      
      The kernel struct is actually embedded inside another new struct
      (or called the "value" struct in the code).  For example,
      "struct tcp_congestion_ops" is embbeded in:
      struct bpf_struct_ops_tcp_congestion_ops {
      	refcount_t refcnt;
      	enum bpf_struct_ops_state state;
      	struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
      }
      The map value is "struct bpf_struct_ops_tcp_congestion_ops".
      The "bpftool map dump" will then be able to show the
      state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g.
      number of tcp_sock in the tcp_congestion_ops case).  This "value" struct
      is created automatically by a macro.  Having a separate "value" struct
      will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding
      "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some
      initialization works before registering the struct_ops to the kernel
      subsystem).  The libbpf will take care of finding and populating the
      "struct bpf_struct_ops_XYZ" from "struct XYZ".
      
      Register a struct_ops to a kernel subsystem:
      1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
      2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
         set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the
         running kernel.
         Instead of reusing the attr->btf_value_type_id,
         btf_vmlinux_value_type_id s added such that attr->btf_fd can still be
         used as the "user" btf which could store other useful sysadmin/debug
         info that may be introduced in the furture,
         e.g. creation-date/compiler-details/map-creator...etc.
      3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described
         in the running kernel btf.  Populate the value of this object.
         The function ptr should be populated with the prog fds.
      4. Call BPF_MAP_UPDATE with the object created in (3) as
         the map value.  The key is always "0".
      
      During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
      args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
      the specific struct_ops to do some final checks in "st_ops->init_member()"
      (e.g. ensure all mandatory func ptrs are implemented).
      If everything looks good, it will register this kernel struct
      to the kernel subsystem.  The map will not allow further update
      from this point.
      
      Unregister a struct_ops from the kernel subsystem:
      BPF_MAP_DELETE with key "0".
      
      Introspect a struct_ops:
      BPF_MAP_LOOKUP_ELEM with key "0".  The map value returned will
      have the prog _id_ populated as the func ptr.
      
      The map value state (enum bpf_struct_ops_state) will transit from:
      INIT (map created) =>
      INUSE (map updated, i.e. reg) =>
      TOBEFREE (map value deleted, i.e. unreg)
      
      The kernel subsystem needs to call bpf_struct_ops_get() and
      bpf_struct_ops_put() to manage the "refcnt" in the
      "struct bpf_struct_ops_XYZ".  This patch uses a separate refcnt
      for the purose of tracking the subsystem usage.  Another approach
      is to reuse the map->refcnt and then "show" (i.e. during map_lookup)
      the subsystem's usage by doing map->refcnt - map->usercnt to filter out
      the map-fd/pinned-map usage.  However, that will also tie down the
      future semantics of map->refcnt and map->usercnt.
      
      The very first subsystem's refcnt (during reg()) holds one
      count to map->refcnt.  When the very last subsystem's refcnt
      is gone, it will also release the map->refcnt.  All bpf_prog will be
      freed when the map->refcnt reaches 0 (i.e. during map_free()).
      
      Here is how the bpftool map command will look like:
      [root@arch-fb-vm1 bpf]# bpftool map show
      6: struct_ops  name dctcp  flags 0x0
      	key 4B  value 256B  max_entries 1  memlock 4096B
      	btf_id 6
      [root@arch-fb-vm1 bpf]# bpftool map dump id 6
      [{
              "value": {
                  "refcnt": {
                      "refs": {
                          "counter": 1
                      }
                  },
                  "state": 1,
                  "data": {
                      "list": {
                          "next": 0,
                          "prev": 0
                      },
                      "key": 0,
                      "flags": 2,
                      "init": 24,
                      "release": 0,
                      "ssthresh": 25,
                      "cong_avoid": 30,
                      "set_state": 27,
                      "cwnd_event": 28,
                      "in_ack_event": 26,
                      "undo_cwnd": 29,
                      "pkts_acked": 0,
                      "min_tso_segs": 0,
                      "sndbuf_expand": 0,
                      "cong_control": 0,
                      "get_info": 0,
                      "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
                      ],
                      "owner": 0
                  }
              }
          }
      ]
      
      Misc Notes:
      * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
        It does an inplace update on "*value" instead returning a pointer
        to syscall.c.  Otherwise, it needs a separate copy of "zero" value
        for the BPF_STRUCT_OPS_STATE_INIT to avoid races.
      
      * The bpf_struct_ops_map_delete_elem() is also called without
        preempt_disable() from map_delete_elem().  It is because
        the "->unreg()" may requires sleepable context, e.g.
        the "tcp_unregister_congestion_control()".
      
      * "const" is added to some of the existing "struct btf_func_model *"
        function arg to avoid a compiler warning caused by this patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20200109003505.3855919-1-kafai@fb.com
      85d33df3
  7. 14 12月, 2019 2 次提交
  8. 25 11月, 2019 3 次提交
    • D
      bpf: Simplify __bpf_arch_text_poke poke type handling · b553a6ec
      Daniel Borkmann 提交于
      Given that we have BPF_MOD_NOP_TO_{CALL,JUMP}, BPF_MOD_{CALL,JUMP}_TO_NOP
      and BPF_MOD_{CALL,JUMP}_TO_{CALL,JUMP} poke types and that we also pass in
      old_addr as well as new_addr, it's a bit redundant and unnecessarily
      complicates __bpf_arch_text_poke() itself since we can derive the same from
      the *_addr that were passed in. Hence simplify and use BPF_MOD_{CALL,JUMP}
      as types which also allows to clean up call-sites.
      
      In addition to that, __bpf_arch_text_poke() currently verifies that text
      matches expected old_insn before we invoke text_poke_bp(). Also add a check
      on new_insn and skip rewrite if it already matches. Reason why this is rather
      useful is that it avoids making any special casing in prog_array_map_poke_run()
      when old and new prog were NULL and has the benefit that also for this case
      we perform a check on text whether it really matches our expectations.
      Suggested-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/fcb00a2b0b288d6c73de4ef58116a821c8fe8f2f.1574555798.git.daniel@iogearbox.net
      b553a6ec
    • D
      bpf, x86: Emit patchable direct jump as tail call · 428d5df1
      Daniel Borkmann 提交于
      Add initial code emission for *direct* jumps for tail call maps in
      order to avoid the retpoline overhead from a493a87f ("bpf, x64:
      implement retpoline for tail call") for situations that allow for
      it, meaning, for known constant keys at verification time which are
      used as index into the tail call map. In case of Cilium which makes
      heavy use of tail calls, constant keys are used in the vast majority,
      only for a single occurrence we use a dynamic key.
      
      High level outline is that if the target prog is NULL in the map, we
      emit a 5-byte nop for the fall-through case and if not, we emit a
      5-byte direct relative jmp to the target bpf_func + skipped prologue
      offset. Later during runtime, we patch these 5-byte nop/jmps upon
      tail call map update or deletions dynamically. Note that on x86-64
      the direct jmp works as we reuse the same stack frame and skip
      prologue (as opposed to some other JIT implementations).
      
      One of the issues is that the tail call map slots can change at any
      given time even during JITing. Therefore, we have two passes: i) emit
      nops for all patchable locations during main JITing phase until we
      declare prog->jited = 1 eventually. At this point the image is stable,
      not public yet and with all jmps disabled. While JITing, we collect
      additional info like poke->ip in order to remember the patch location
      for later modifications. In ii) bpf_tail_call_direct_fixup() walks
      over the progs poke_tab, locks the tail call maps poke_mutex to
      prevent from parallel updates and patches in the right locations via
      __bpf_arch_text_poke(). Note, the main bpf_arch_text_poke() cannot
      be used at this point since we're not yet exposed to kallsyms. For
      the update we use plain memcpy() since the image is not public and
      still in read-write mode. After patching, we activate that poke entry
      through poke->ip_stable. Meaning, at this point any tail call map
      updates/deletions are not going to ignore that poke entry anymore.
      Then, bpf_arch_text_poke() might still occur on the read-write image
      until we finally locked it as read-only. Both modifications on the
      given image are under text_mutex to avoid interference with each
      other when update requests come in in parallel for different tail
      call maps (current one we have locked in JIT and different one where
      poke->ip_stable was already set).
      
      Example prog:
      
        # ./bpftool p d x i 1655
         0: (b7) r3 = 0
         1: (18) r2 = map[id:526]
         3: (85) call bpf_tail_call#12
         4: (b7) r0 = 1
         5: (95) exit
      
      Before:
      
        # ./bpftool p d j i 1655
        0xffffffffc076e55c:
         0:   nopl   0x0(%rax,%rax,1)
         5:   push   %rbp
         6:   mov    %rsp,%rbp
         9:   sub    $0x200,%rsp
        10:   push   %rbx
        11:   push   %r13
        13:   push   %r14
        15:   push   %r15
        17:   pushq  $0x0                      _
        19:   xor    %edx,%edx                |_ index (arg 3)
        1b:   movabs $0xffff88d95cc82600,%rsi |_ map (arg 2)
        25:   mov    %edx,%edx                |  index >= array->map.max_entries
        27:   cmp    %edx,0x24(%rsi)          |
        2a:   jbe    0x0000000000000066       |_
        2c:   mov    -0x224(%rbp),%eax        |  tail call limit check
        32:   cmp    $0x20,%eax               |
        35:   ja     0x0000000000000066       |
        37:   add    $0x1,%eax                |
        3a:   mov    %eax,-0x224(%rbp)        |_
        40:   mov    0xd0(%rsi,%rdx,8),%rax   |_ prog = array->ptrs[index]
        48:   test   %rax,%rax                |  prog == NULL check
        4b:   je     0x0000000000000066       |_
        4d:   mov    0x30(%rax),%rax          |  goto *(prog->bpf_func + prologue_size)
        51:   add    $0x19,%rax               |
        55:   callq  0x0000000000000061       |  retpoline for indirect jump
        5a:   pause                           |
        5c:   lfence                          |
        5f:   jmp    0x000000000000005a       |
        61:   mov    %rax,(%rsp)              |
        65:   retq                            |_
        66:   mov    $0x1,%eax
        6b:   pop    %rbx
        6c:   pop    %r15
        6e:   pop    %r14
        70:   pop    %r13
        72:   pop    %rbx
        73:   leaveq
        74:   retq
      
      After; state after JIT:
      
        # ./bpftool p d j i 1655
        0xffffffffc08e8930:
         0:   nopl   0x0(%rax,%rax,1)
         5:   push   %rbp
         6:   mov    %rsp,%rbp
         9:   sub    $0x200,%rsp
        10:   push   %rbx
        11:   push   %r13
        13:   push   %r14
        15:   push   %r15
        17:   pushq  $0x0                      _
        19:   xor    %edx,%edx                |_ index (arg 3)
        1b:   movabs $0xffff9d8afd74c000,%rsi |_ map (arg 2)
        25:   mov    -0x224(%rbp),%eax        |  tail call limit check
        2b:   cmp    $0x20,%eax               |
        2e:   ja     0x000000000000003e       |
        30:   add    $0x1,%eax                |
        33:   mov    %eax,-0x224(%rbp)        |_
        39:   jmpq   0xfffffffffffd1785       |_ [direct] goto *(prog->bpf_func + prologue_size)
        3e:   mov    $0x1,%eax
        43:   pop    %rbx
        44:   pop    %r15
        46:   pop    %r14
        48:   pop    %r13
        4a:   pop    %rbx
        4b:   leaveq
        4c:   retq
      
      After; state after map update (target prog):
      
        # ./bpftool p d j i 1655
        0xffffffffc08e8930:
         0:   nopl   0x0(%rax,%rax,1)
         5:   push   %rbp
         6:   mov    %rsp,%rbp
         9:   sub    $0x200,%rsp
        10:   push   %rbx
        11:   push   %r13
        13:   push   %r14
        15:   push   %r15
        17:   pushq  $0x0
        19:   xor    %edx,%edx
        1b:   movabs $0xffff9d8afd74c000,%rsi
        25:   mov    -0x224(%rbp),%eax
        2b:   cmp    $0x20,%eax               .
        2e:   ja     0x000000000000003e       .
        30:   add    $0x1,%eax                .
        33:   mov    %eax,-0x224(%rbp)        |_
        39:   jmpq   0xffffffffffb09f55       |_ goto *(prog->bpf_func + prologue_size)
        3e:   mov    $0x1,%eax
        43:   pop    %rbx
        44:   pop    %r15
        46:   pop    %r14
        48:   pop    %r13
        4a:   pop    %rbx
        4b:   leaveq
        4c:   retq
      
      After; state after map update (no prog):
      
        # ./bpftool p d j i 1655
        0xffffffffc08e8930:
         0:   nopl   0x0(%rax,%rax,1)
         5:   push   %rbp
         6:   mov    %rsp,%rbp
         9:   sub    $0x200,%rsp
        10:   push   %rbx
        11:   push   %r13
        13:   push   %r14
        15:   push   %r15
        17:   pushq  $0x0
        19:   xor    %edx,%edx
        1b:   movabs $0xffff9d8afd74c000,%rsi
        25:   mov    -0x224(%rbp),%eax
        2b:   cmp    $0x20,%eax               .
        2e:   ja     0x000000000000003e       .
        30:   add    $0x1,%eax                .
        33:   mov    %eax,-0x224(%rbp)        |_
        39:   nopl   0x0(%rax,%rax,1)         |_ fall-through nop
        3e:   mov    $0x1,%eax
        43:   pop    %rbx
        44:   pop    %r15
        46:   pop    %r14
        48:   pop    %r13
        4a:   pop    %rbx
        4b:   leaveq
        4c:   retq
      
      Nice bonus is that this also shrinks the code emission quite a bit
      for every tail call invocation.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/6ada4c1c9d35eeb5f4ecfab94593dafa6b5c4b09.1574452833.git.daniel@iogearbox.net
      428d5df1
    • D
      bpf, x86: Generalize and extend bpf_arch_text_poke for direct jumps · 4b3da77b
      Daniel Borkmann 提交于
      Add BPF_MOD_{NOP_TO_JUMP,JUMP_TO_JUMP,JUMP_TO_NOP} patching for x86
      JIT in order to be able to patch direct jumps or nop them out. We need
      this facility in order to patch tail call jumps and in later work also
      BPF static keys.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/aa4784196a8e5e985af4b30a4fe5336bce6e9643.1574452833.git.daniel@iogearbox.net
      4b3da77b
  9. 16 11月, 2019 5 次提交
    • A
      bpf: Support attaching tracing BPF program to other BPF programs · 5b92a28a
      Alexei Starovoitov 提交于
      Allow FENTRY/FEXIT BPF programs to attach to other BPF programs of any type
      including their subprograms. This feature allows snooping on input and output
      packets in XDP, TC programs including their return values. In order to do that
      the verifier needs to track types not only of vmlinux, but types of other BPF
      programs as well. The verifier also needs to translate uapi/linux/bpf.h types
      used by networking programs into kernel internal BTF types used by FENTRY/FEXIT
      BPF programs. In some cases LLVM optimizations can remove arguments from BPF
      subprograms without adjusting BTF info that LLVM backend knows. When BTF info
      disagrees with actual types that the verifiers sees the BPF trampoline has to
      fallback to conservative and treat all arguments as u64. The FENTRY/FEXIT
      program can still attach to such subprograms, but it won't be able to recognize
      pointer types like 'struct sk_buff *' and it won't be able to pass them to
      bpf_skb_output() for dumping packets to user space. The FENTRY/FEXIT program
      would need to use bpf_probe_read_kernel() instead.
      
      The BPF_PROG_LOAD command is extended with attach_prog_fd field. When it's set
      to zero the attach_btf_id is one vmlinux BTF type ids. When attach_prog_fd
      points to previously loaded BPF program the attach_btf_id is BTF type id of
      main function or one of its subprograms.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20191114185720.1641606-18-ast@kernel.org
      5b92a28a
    • A
      bpf: Reserve space for BPF trampoline in BPF programs · 9fd4a39d
      Alexei Starovoitov 提交于
      BPF trampoline can be made to work with existing 5 bytes of BPF program
      prologue, but let's add 5 bytes of NOPs to the beginning of every JITed BPF
      program to make BPF trampoline job easier. They can be removed in the future.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20191114185720.1641606-14-ast@kernel.org
      9fd4a39d
    • A
      bpf: Introduce BPF trampoline · fec56f58
      Alexei Starovoitov 提交于
      Introduce BPF trampoline concept to allow kernel code to call into BPF programs
      with practically zero overhead.  The trampoline generation logic is
      architecture dependent.  It's converting native calling convention into BPF
      calling convention.  BPF ISA is 64-bit (even on 32-bit architectures). The
      registers R1 to R5 are used to pass arguments into BPF functions. The main BPF
      program accepts only single argument "ctx" in R1. Whereas CPU native calling
      convention is different. x86-64 is passing first 6 arguments in registers
      and the rest on the stack. x86-32 is passing first 3 arguments in registers.
      sparc64 is passing first 6 in registers. And so on.
      
      The trampolines between BPF and kernel already exist.  BPF_CALL_x macros in
      include/linux/filter.h statically compile trampolines from BPF into kernel
      helpers. They convert up to five u64 arguments into kernel C pointers and
      integers. On 64-bit architectures this BPF_to_kernel trampolines are nops. On
      32-bit architecture they're meaningful.
      
      The opposite job kernel_to_BPF trampolines is done by CAST_TO_U64 macros and
      __bpf_trace_##call() shim functions in include/trace/bpf_probe.h. They convert
      kernel function arguments into array of u64s that BPF program consumes via
      R1=ctx pointer.
      
      This patch set is doing the same job as __bpf_trace_##call() static
      trampolines, but dynamically for any kernel function. There are ~22k global
      kernel functions that are attachable via nop at function entry. The function
      arguments and types are described in BTF.  The job of btf_distill_func_proto()
      function is to extract useful information from BTF into "function model" that
      architecture dependent trampoline generators will use to generate assembly code
      to cast kernel function arguments into array of u64s.  For example the kernel
      function eth_type_trans has two pointers. They will be casted to u64 and stored
      into stack of generated trampoline. The pointer to that stack space will be
      passed into BPF program in R1. On x86-64 such generated trampoline will consume
      16 bytes of stack and two stores of %rdi and %rsi into stack. The verifier will
      make sure that only two u64 are accessed read-only by BPF program. The verifier
      will also recognize the precise type of the pointers being accessed and will
      not allow typecasting of the pointer to a different type within BPF program.
      
      The tracing use case in the datacenter demonstrated that certain key kernel
      functions have (like tcp_retransmit_skb) have 2 or more kprobes that are always
      active.  Other functions have both kprobe and kretprobe.  So it is essential to
      keep both kernel code and BPF programs executing at maximum speed. Hence
      generated BPF trampoline is re-generated every time new program is attached or
      detached to maintain maximum performance.
      
      To avoid the high cost of retpoline the attached BPF programs are called
      directly. __bpf_prog_enter/exit() are used to support per-program execution
      stats.  In the future this logic will be optimized further by adding support
      for bpf_stats_enabled_key inside generated assembly code. Introduction of
      preemptible and sleepable BPF programs will completely remove the need to call
      to __bpf_prog_enter/exit().
      
      Detach of a BPF program from the trampoline should not fail. To avoid memory
      allocation in detach path the half of the page is used as a reserve and flipped
      after each attach/detach. 2k bytes is enough to call 40+ BPF programs directly
      which is enough for BPF tracing use cases. This limit can be increased in the
      future.
      
      BPF_TRACE_FENTRY programs have access to raw kernel function arguments while
      BPF_TRACE_FEXIT programs have access to kernel return value as well. Often
      kprobe BPF program remembers function arguments in a map while kretprobe
      fetches arguments from a map and analyzes them together with return value.
      BPF_TRACE_FEXIT accelerates this typical use case.
      
      Recursion prevention for kprobe BPF programs is done via per-cpu
      bpf_prog_active counter. In practice that turned out to be a mistake. It
      caused programs to randomly skip execution. The tracing tools missed results
      they were looking for. Hence BPF trampoline doesn't provide builtin recursion
      prevention. It's a job of BPF program itself and will be addressed in the
      follow up patches.
      
      BPF trampoline is intended to be used beyond tracing and fentry/fexit use cases
      in the future. For example to remove retpoline cost from XDP programs.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20191114185720.1641606-5-ast@kernel.org
      fec56f58
    • A
      bpf: Add bpf_arch_text_poke() helper · 5964b200
      Alexei Starovoitov 提交于
      Add bpf_arch_text_poke() helper that is used by BPF trampoline logic to patch
      nops/calls in kernel text into calls into BPF trampoline and to patch
      calls/nops inside BPF programs too.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20191114185720.1641606-4-ast@kernel.org
      5964b200
    • A
      bpf: Refactor x86 JIT into helpers · 3b2744e6
      Alexei Starovoitov 提交于
      Refactor x86 JITing of LDX, STX, CALL instructions into separate helper
      functions.  No functional changes in LDX and STX helpers.  There is a minor
      change in CALL helper. It will populate target address correctly on the first
      pass of JIT instead of second pass. That won't reduce total number of JIT
      passes though.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20191114185720.1641606-3-ast@kernel.org
      3b2744e6
  10. 17 10月, 2019 1 次提交
    • A
      bpf: Add support for BTF pointers to x86 JIT · 3dec541b
      Alexei Starovoitov 提交于
      Pointer to BTF object is a pointer to kernel object or NULL.
      Such pointers can only be used by BPF_LDX instructions.
      The verifier changed their opcode from LDX|MEM|size
      to LDX|PROBE_MEM|size to make JITing easier.
      The number of entries in extable is the number of BPF_LDX insns
      that access kernel memory via "pointer to BTF type".
      Only these load instructions can fault.
      Since x86 extable is relative it has to be allocated in the same
      memory region as JITed code.
      Allocate it prior to last pass of JITing and let the last pass populate it.
      Pointer to extable in bpf_prog_aux is necessary to make page fault
      handling fast.
      Page fault handling is done in two steps:
      1. bpf_prog_kallsyms_find() finds BPF program that page faulted.
         It's done by walking rb tree.
      2. then extable for given bpf program is binary searched.
      This process is similar to how page faulting is done for kernel modules.
      The exception handler skips over faulting x86 instruction and
      initializes destination register with zero. This mimics exact
      behavior of bpf_probe_read (when probe_kernel_read faults dest is zeroed).
      
      JITs for other architectures can add support in similar way.
      Until then they will reject unknown opcode and fallback to interpreter.
      
      Since extable should be aligned and placed near JITed code
      make bpf_jit_binary_alloc() return 4 byte aligned image offset,
      so that extable aligning formula in bpf_int_jit_compile() doesn't need
      to rely on internal implementation of bpf_jit_binary_alloc().
      On x86 gcc defaults to 16-byte alignment for regular kernel functions
      due to better performance. JITed code may be aligned to 16 in the future,
      but it will use 4 in the meantime.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20191016032505.2089704-10-ast@kernel.org
      3dec541b
  11. 05 10月, 2019 1 次提交
  12. 02 8月, 2019 1 次提交
    • A
      bpf: fix x64 JIT code generation for jmp to 1st insn · 7c2e988f
      Alexei Starovoitov 提交于
      Introduction of bounded loops exposed old bug in x64 JIT.
      JIT maintains the array of offsets to the end of all instructions to
      compute jmp offsets.
      addrs[0] - offset of the end of the 1st insn (that includes prologue).
      addrs[1] - offset of the end of the 2nd insn.
      JIT didn't keep the offset of the beginning of the 1st insn,
      since classic BPF didn't have backward jumps and valid extended BPF
      couldn't have a branch to 1st insn, because it didn't allow loops.
      With bounded loops it's possible to construct a valid program that
      jumps backwards to the 1st insn.
      Fix JIT by computing:
      addrs[0] - offset of the end of prologue == start of the 1st insn.
      addrs[1] - offset of the end of 1st insn.
      
      v1->v2:
      - Yonghong noticed a bug in jit linfo.
        Fix it by passing 'addrs + 1' to bpf_prog_fill_jited_linfo(),
        since it expects insn_to_jit_off array to be offsets to last byte.
      
      Reported-by: syzbot+35101610ff3e83119b1b@syzkaller.appspotmail.com
      Fixes: 2589726d ("bpf: introduce bounded loops")
      Fixes: 0a14842f ("net: filter: Just In Time compiler for x86-64")
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      7c2e988f
  13. 15 6月, 2019 1 次提交
  14. 05 6月, 2019 1 次提交
  15. 27 1月, 2019 1 次提交
  16. 10 12月, 2018 1 次提交
    • M
      bpf: Add bpf_line_info support · c454a46b
      Martin KaFai Lau 提交于
      This patch adds bpf_line_info support.
      
      It accepts an array of bpf_line_info objects during BPF_PROG_LOAD.
      The "line_info", "line_info_cnt" and "line_info_rec_size" are added
      to the "union bpf_attr".  The "line_info_rec_size" makes
      bpf_line_info extensible in the future.
      
      The new "check_btf_line()" ensures the userspace line_info is valid
      for the kernel to use.
      
      When the verifier is translating/patching the bpf_prog (through
      "bpf_patch_insn_single()"), the line_infos' insn_off is also
      adjusted by the newly added "bpf_adj_linfo()".
      
      If the bpf_prog is jited, this patch also provides the jited addrs (in
      aux->jited_linfo) for the corresponding line_info.insn_off.
      "bpf_prog_fill_jited_linfo()" is added to fill the aux->jited_linfo.
      It is currently called by the x86 jit.  Other jits can also use
      "bpf_prog_fill_jited_linfo()" and it will be done in the followup patches.
      In the future, if it deemed necessary, a particular jit could also provide
      its own "bpf_prog_fill_jited_linfo()" implementation.
      
      A few "*line_info*" fields are added to the bpf_prog_info such
      that the user can get the xlated line_info back (i.e. the line_info
      with its insn_off reflecting the translated prog).  The jited_line_info
      is available if the prog is jited.  It is an array of __u64.
      If the prog is not jited, jited_line_info_cnt is 0.
      
      The verifier's verbose log with line_info will be done in
      a follow up patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c454a46b
  17. 13 6月, 2018 1 次提交
    • K
      treewide: kmalloc() -> kmalloc_array() · 6da2ec56
      Kees Cook 提交于
      The kmalloc() function has a 2-factor argument form, kmalloc_array(). This
      patch replaces cases of:
      
              kmalloc(a * b, gfp)
      
      with:
              kmalloc_array(a * b, gfp)
      
      as well as handling cases of:
      
              kmalloc(a * b * c, gfp)
      
      with:
      
              kmalloc(array3_size(a, b, c), gfp)
      
      as it's slightly less ugly than:
      
              kmalloc_array(array_size(a, b), c, gfp)
      
      This does, however, attempt to ignore constant size factors like:
      
              kmalloc(4 * 1024, gfp)
      
      though any constants defined via macros get caught up in the conversion.
      
      Any factors with a sizeof() of "unsigned char", "char", and "u8" were
      dropped, since they're redundant.
      
      The tools/ directory was manually excluded, since it has its own
      implementation of kmalloc().
      
      The Coccinelle script used for this was:
      
      // Fix redundant parens around sizeof().
      @@
      type TYPE;
      expression THING, E;
      @@
      
      (
        kmalloc(
      -	(sizeof(TYPE)) * E
      +	sizeof(TYPE) * E
        , ...)
      |
        kmalloc(
      -	(sizeof(THING)) * E
      +	sizeof(THING) * E
        , ...)
      )
      
      // Drop single-byte sizes and redundant parens.
      @@
      expression COUNT;
      typedef u8;
      typedef __u8;
      @@
      
      (
        kmalloc(
      -	sizeof(u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * (COUNT)
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(__u8) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(char) * COUNT
      +	COUNT
        , ...)
      |
        kmalloc(
      -	sizeof(unsigned char) * COUNT
      +	COUNT
        , ...)
      )
      
      // 2-factor product with sizeof(type/expression) and identifier or constant.
      @@
      type TYPE;
      expression THING;
      identifier COUNT_ID;
      constant COUNT_CONST;
      @@
      
      (
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_ID)
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_ID
      +	COUNT_ID, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * COUNT_CONST
      +	COUNT_CONST, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_ID)
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_ID
      +	COUNT_ID, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (COUNT_CONST)
      +	COUNT_CONST, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * COUNT_CONST
      +	COUNT_CONST, sizeof(THING)
        , ...)
      )
      
      // 2-factor product, only identifiers.
      @@
      identifier SIZE, COUNT;
      @@
      
      - kmalloc
      + kmalloc_array
        (
      -	SIZE * COUNT
      +	COUNT, SIZE
        , ...)
      
      // 3-factor product with 1 sizeof(type) or sizeof(expression), with
      // redundant parens removed.
      @@
      expression THING;
      identifier STRIDE, COUNT;
      type TYPE;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(TYPE))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * (COUNT) * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * (STRIDE)
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      |
        kmalloc(
      -	sizeof(THING) * COUNT * STRIDE
      +	array3_size(COUNT, STRIDE, sizeof(THING))
        , ...)
      )
      
      // 3-factor product with 2 sizeof(variable), with redundant parens removed.
      @@
      expression THING1, THING2;
      identifier COUNT;
      type TYPE1, TYPE2;
      @@
      
      (
        kmalloc(
      -	sizeof(TYPE1) * sizeof(TYPE2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(THING1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(THING1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * COUNT
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      |
        kmalloc(
      -	sizeof(TYPE1) * sizeof(THING2) * (COUNT)
      +	array3_size(COUNT, sizeof(TYPE1), sizeof(THING2))
        , ...)
      )
      
      // 3-factor product, only identifiers, with redundant parens removed.
      @@
      identifier STRIDE, SIZE, COUNT;
      @@
      
      (
        kmalloc(
      -	(COUNT) * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * STRIDE * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	(COUNT) * (STRIDE) * (SIZE)
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      |
        kmalloc(
      -	COUNT * STRIDE * SIZE
      +	array3_size(COUNT, STRIDE, SIZE)
        , ...)
      )
      
      // Any remaining multi-factor products, first at least 3-factor products,
      // when they're not all constants...
      @@
      expression E1, E2, E3;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(
      -	(E1) * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * E3
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	(E1) * (E2) * (E3)
      +	array3_size(E1, E2, E3)
        , ...)
      |
        kmalloc(
      -	E1 * E2 * E3
      +	array3_size(E1, E2, E3)
        , ...)
      )
      
      // And then all remaining 2 factors products when they're not all constants,
      // keeping sizeof() as the second factor argument.
      @@
      expression THING, E1, E2;
      type TYPE;
      constant C1, C2, C3;
      @@
      
      (
        kmalloc(sizeof(THING) * C2, ...)
      |
        kmalloc(sizeof(TYPE) * C2, ...)
      |
        kmalloc(C1 * C2 * C3, ...)
      |
        kmalloc(C1 * C2, ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * (E2)
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(TYPE) * E2
      +	E2, sizeof(TYPE)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * (E2)
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	sizeof(THING) * E2
      +	E2, sizeof(THING)
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * E2
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	(E1) * (E2)
      +	E1, E2
        , ...)
      |
      - kmalloc
      + kmalloc_array
        (
      -	E1 * E2
      +	E1, E2
        , ...)
      )
      Signed-off-by: NKees Cook <keescook@chromium.org>
      6da2ec56
  18. 04 5月, 2018 1 次提交
  19. 03 5月, 2018 2 次提交
    • D
      bpf, x64: fix memleak when not converging on calls · 39f56ca9
      Daniel Borkmann 提交于
      The JIT logic in jit_subprogs() is as follows: for all subprogs we
      allocate a bpf_prog_alloc(), populate it (prog->is_func = 1 here),
      and pass it to bpf_int_jit_compile(). If a failure occurred during
      JIT and prog->jited is not set, then we bail out from attempting to
      JIT the whole program, and punt to the interpreter instead. In case
      JITing went successful, we fixup BPF call offsets and do another
      pass to bpf_int_jit_compile() (extra_pass is true at that point) to
      complete JITing calls. Given that requires to pass JIT context around
      addrs and jit_data from x86 JIT are freed in the extra_pass in
      bpf_int_jit_compile() when calls are involved (if not, they can
      be freed immediately). However, if in the original pass, the JIT
      image didn't converge then we leak addrs and jit_data since image
      itself is NULL, the prog->is_func is set and extra_pass is false
      in that case, meaning both will become unreachable and are never
      cleaned up, therefore we need to free as well on !image. Only x64
      JIT is affected.
      
      Fixes: 1c2a088a ("bpf: x64: add JIT support for multi-function programs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      39f56ca9
    • D
      bpf, x64: fix memleak when not converging after image · 3aab8884
      Daniel Borkmann 提交于
      While reviewing x64 JIT code, I noticed that we leak the prior allocated
      JIT image in the case where proglen != oldproglen during the JIT passes.
      Prior to the commit e0ee9c12 ("x86: bpf_jit: fix two bugs in eBPF JIT
      compiler") we would just break out of the loop, and using the image as the
      JITed prog since it could only shrink in size anyway. After e0ee9c12,
      we would bail out to out_addrs label where we free addrs and jit_data but
      not the image coming from bpf_jit_binary_alloc().
      
      Fixes: e0ee9c12 ("x86: bpf_jit: fix two bugs in eBPF JIT compiler")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      3aab8884
  20. 02 5月, 2018 1 次提交
    • I
      x86/bpf: Clean up non-standard comments, to make the code more readable · a2c7a983
      Ingo Molnar 提交于
      So by chance I looked into x86 assembly in arch/x86/net/bpf_jit_comp.c and
      noticed the weird and inconsistent comment style it mistakenly learned from
      the networking code:
      
       /* Multi-line comment ...
        * ... looks like this.
        */
      
      Fix this to use the standard comment style specified in Documentation/CodingStyle
      and used in arch/x86/ as well:
      
       /*
        * Multi-line comment ...
        * ... looks like this.
        */
      
      Also, to quote Linus's ... more explicit views about this:
      
        http://article.gmane.org/gmane.linux.kernel.cryptoapi/21066
      
        > But no, the networking code picked *none* of the above sane formats.
        > Instead, it picked these two models that are just half-arsed
        > shit-for-brains:
        >
        >  (no)
        >      /* This is disgusting drug-induced
        >        * crap, and should die
        >        */
        >
        >   (no-no-no)
        >       /* This is also very nasty
        >        * and visually unbalanced */
        >
        > Please. The networking code actually has the *worst* possible comment
        > style. You can literally find that (no-no-no) style, which is just
        > really horribly disgusting and worse than the otherwise fairly similar
        > (d) in pretty much every way.
      
      Also improve the comments and some other details while at it:
      
       - Don't mix same-line and previous-line comment style on otherwise
         identical code patterns within the same function,
      
       - capitalize 'BPF' and x86 register names consistently,
      
       - capitalize sentences consistently,
      
       - instead of 'x64' use 'x86-64': x64 is a Microsoft specific term,
      
       - use more consistent punctuation,
      
       - use standard coding style in macros as well,
      
       - fix typos and a few other minor details.
      
      Consistent coding style is not optional, at least in arch/x86/.
      
      No change in functionality.
      
      ( In case this commit causes conflicts with pending development code
        I'll be glad to help resolve any conflicts! )
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: netdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a2c7a983
  21. 27 4月, 2018 1 次提交
    • I
      x86/bpf: Clean up non-standard comments, to make the code more readable · 5f26c501
      Ingo Molnar 提交于
      So by chance I looked into x86 assembly in arch/x86/net/bpf_jit_comp.c and
      noticed the weird and inconsistent comment style it mistakenly learned from
      the networking code:
      
       /* Multi-line comment ...
        * ... looks like this.
        */
      
      Fix this to use the standard comment style specified in Documentation/CodingStyle
      and used in arch/x86/ as well:
      
       /*
        * Multi-line comment ...
        * ... looks like this.
        */
      
      Also, to quote Linus's ... more explicit views about this:
      
        http://article.gmane.org/gmane.linux.kernel.cryptoapi/21066
      
        > But no, the networking code picked *none* of the above sane formats.
        > Instead, it picked these two models that are just half-arsed
        > shit-for-brains:
        >
        >  (no)
        >      /* This is disgusting drug-induced
        >        * crap, and should die
        >        */
        >
        >   (no-no-no)
        >       /* This is also very nasty
        >        * and visually unbalanced */
        >
        > Please. The networking code actually has the *worst* possible comment
        > style. You can literally find that (no-no-no) style, which is just
        > really horribly disgusting and worse than the otherwise fairly similar
        > (d) in pretty much every way.
      
      Also improve the comments and some other details while at it:
      
       - Don't mix same-line and previous-line comment style on otherwise
         identical code patterns within the same function,
      
       - capitalize 'BPF' and x86 register names consistently,
      
       - capitalize sentences consistently,
      
       - instead of 'x64' use 'x86-64': x64 is a Microsoft specific term,
      
       - use more consistent punctuation,
      
       - use standard coding style in macros as well,
      
       - fix typos and a few other minor details.
      
      Consistent coding style is not optional, at least in arch/x86/.
      
      No change in functionality.
      
      ( In case this commit causes conflicts with pending development code
        I'll be glad to help resolve any conflicts! )
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: David S. Miller <davem@davemloft.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
      Cc: netdev@vger.kernel.org
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      5f26c501
  22. 25 4月, 2018 1 次提交
    • G
      bpf, x64: fix JIT emission for dead code · 1612a981
      Gianluca Borello 提交于
      Commit 2a5418a1 ("bpf: improve dead code sanitizing") replaced dead
      code with a series of ja-1 instructions, for safety. That made JIT
      compilation much more complex for some BPF programs. One instance of such
      programs is, for example:
      
      bool flag = false
      ...
      /* A bunch of other code */
      ...
      if (flag)
              do_something()
      
      In some cases llvm is not able to remove at compile time the code for
      do_something(), so the generated BPF program ends up with a large amount
      of dead instructions. In one specific real life example, there are two
      series of ~500 and ~1000 dead instructions in the program. When the
      verifier replaces them with a series of ja-1 instructions, it causes an
      interesting behavior at JIT time.
      
      During the first pass, since all the instructions are estimated at 64
      bytes, the ja-1 instructions end up being translated as 5 bytes JMP
      instructions (0xE9), since the jump offsets become increasingly large (>
      127) as each instruction gets discovered to be 5 bytes instead of the
      estimated 64.
      
      Starting from the second pass, the first N instructions of the ja-1
      sequence get translated into 2 bytes JMPs (0xEB) because the jump offsets
      become <= 127 this time. In particular, N is defined as roughly 127 / (5
      - 2) ~= 42. So, each further pass will make the subsequent N JMP
      instructions shrink from 5 to 2 bytes, making the image shrink every time.
      This means that in order to have the entire program converge, there need
      to be, in the real example above, at least ~1000 / 42 ~= 24 passes just
      for translating the dead code. If we add this number to the passes needed
      to translate the other non dead code, it brings such program to 40+
      passes, and JIT doesn't complete. Ultimately the userspace loader fails
      because such BPF program was supposed to be part of a prog array owner
      being JITed.
      
      While it is certainly possible to try to refactor such programs to help
      the compiler remove dead code, the behavior is not really intuitive and it
      puts further burden on the BPF developer who is not expecting such
      behavior. To make things worse, such programs are working just fine in all
      the kernel releases prior to the ja-1 fix.
      
      A possible approach to mitigate this behavior consists into noticing that
      for ja-1 instructions we don't really need to rely on the estimated size
      of the previous and current instructions, we know that a -1 BPF jump
      offset can be safely translated into a 0xEB instruction with a jump offset
      of -2.
      
      Such fix brings the BPF program in the previous example to complete again
      in ~9 passes.
      
      Fixes: 2a5418a1 ("bpf: improve dead code sanitizing")
      Signed-off-by: NGianluca Borello <g.borello@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1612a981
  23. 08 3月, 2018 1 次提交
    • D
      bpf, x64: increase number of passes · 6007b080
      Daniel Borkmann 提交于
      In Cilium some of the main programs we run today are hitting 9 passes
      on x64's JIT compiler, and we've had cases already where we surpassed
      the limit where the JIT then punts the program to the interpreter
      instead, leading to insertion failures due to CONFIG_BPF_JIT_ALWAYS_ON
      or insertion failures due to the prog array owner being JITed but the
      program to insert not (both must have the same JITed/non-JITed property).
      
      One concrete case the program image shrunk from 12,767 bytes down to
      10,288 bytes where the image converged after 16 steps. I've measured
      that this took 340us in the JIT until it converges on my i7-6600U. Thus,
      increase the original limit we had from day one where the JIT covered
      cBPF only back then before we run into the case (as similar with the
      complexity limit) where we trip over this and hit program rejections.
      Also add a cond_resched() into the compilation loop, the JIT process
      runs without any locks and may sleep anyway.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      6007b080
  24. 28 2月, 2018 1 次提交
  25. 24 2月, 2018 5 次提交