1. 05 3月, 2020 2 次提交
  2. 03 3月, 2020 1 次提交
    • A
      bpf: Introduce pinnable bpf_link abstraction · 70ed506c
      Andrii Nakryiko 提交于
      Introduce bpf_link abstraction, representing an attachment of BPF program to
      a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
      ownership of attached BPF program, reference counting of a link itself, when
      reference from multiple anonymous inodes, as well as ensures that release
      callback will be called from a process context, so that users can safely take
      mutex locks and sleep.
      
      Additionally, with a new abstraction it's now possible to generalize pinning
      of a link object in BPF FS, allowing to explicitly prevent BPF program
      detachment on process exit by pinning it in a BPF FS and let it open from
      independent other process to keep working with it.
      
      Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
      program attachments) into utilizing bpf_link framework, making them pinnable
      in BPF FS. More FD-based bpf_links will be added in follow up patches.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200303043159.323675-2-andriin@fb.com
      70ed506c
  3. 28 2月, 2020 2 次提交
    • M
      bpf: INET_DIAG support in bpf_sk_storage · 1ed4d924
      Martin KaFai Lau 提交于
      This patch adds INET_DIAG support to bpf_sk_storage.
      
      1. Although this series adds bpf_sk_storage diag capability to inet sk,
         bpf_sk_storage is in general applicable to all fullsock.  Hence, the
         bpf_sk_storage logic will operate on SK_DIAG_* nlattr.  The caller
         will pass in its specific nesting nlattr (e.g. INET_DIAG_*) as
         the argument.
      
      2. The request will be like:
      	INET_DIAG_REQ_SK_BPF_STORAGES (nla_nest) (defined in latter patch)
      		SK_DIAG_BPF_STORAGE_REQ_MAP_FD (nla_put_u32)
      		SK_DIAG_BPF_STORAGE_REQ_MAP_FD (nla_put_u32)
      		......
      
         Considering there could have multiple bpf_sk_storages in a sk,
         instead of reusing INET_DIAG_INFO ("ss -i"),  the user can select
         some specific bpf_sk_storage to dump by specifying an array of
         SK_DIAG_BPF_STORAGE_REQ_MAP_FD.
      
         If no SK_DIAG_BPF_STORAGE_REQ_MAP_FD is specified (i.e. an empty
         INET_DIAG_REQ_SK_BPF_STORAGES), it will dump all bpf_sk_storages
         of a sk.
      
      3. The reply will be like:
      	INET_DIAG_BPF_SK_STORAGES (nla_nest) (defined in latter patch)
      		SK_DIAG_BPF_STORAGE (nla_nest)
      			SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32)
      			SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit)
      		SK_DIAG_BPF_STORAGE (nla_nest)
      			SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32)
      			SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit)
      		......
      
      4. Unlike other INET_DIAG info of a sk which is pretty static, the size
         required to dump the bpf_sk_storage(s) of a sk is dynamic as the
         system adding more bpf_sk_storage_map.  It is hard to set a static
         min_dump_alloc size.
      
         Hence, this series learns it at the runtime and adjust the
         cb->min_dump_alloc as it iterates all sk(s) of a system.  The
         "unsigned int *res_diag_size" in bpf_sk_storage_diag_put()
         is for this purpose.
      
         The next patch will update the cb->min_dump_alloc as it
         iterates the sk(s).
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200225230421.1975729-1-kafai@fb.com
      1ed4d924
    • G
      bpf: Replace zero-length array with flexible-array member · d7f10df8
      Gustavo A. R. Silva 提交于
      The current codebase makes use of the zero-length array language
      extension to the C90 standard, but the preferred mechanism to declare
      variable-length types such as these ones is a flexible array member[1][2],
      introduced in C99:
      
      struct foo {
              int stuff;
              struct boo array[];
      };
      
      By making use of the mechanism above, we will get a compiler warning
      in case the flexible array does not occur last in the structure, which
      will help us prevent some kind of undefined behavior bugs from being
      inadvertently introduced[3] to the codebase from now on.
      
      Also, notice that, dynamic memory allocations won't be affected by
      this change:
      
      "Flexible array members have incomplete type, and so the sizeof operator
      may not be applied. As a quirk of the original implementation of
      zero-length arrays, sizeof evaluates to zero."[1]
      
      This issue was found with the help of Coccinelle.
      
      [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
      [2] https://github.com/KSPP/linux/issues/21
      [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200227001744.GA3317@embeddedor
      d7f10df8
  4. 25 2月, 2020 2 次提交
  5. 29 1月, 2020 1 次提交
  6. 25 1月, 2020 1 次提交
  7. 23 1月, 2020 2 次提交
    • M
      bpf: Add BPF_FUNC_jiffies64 · 5576b991
      Martin KaFai Lau 提交于
      This patch adds a helper to read the 64bit jiffies.  It will be used
      in a later patch to implement the bpf_cubic.c.
      
      The helper is inlined for jit_requested and 64 BITS_PER_LONG
      as the map_gen_lookup().  Other cases could be considered together
      with map_gen_lookup() if needed.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200122233646.903260-1-kafai@fb.com
      5576b991
    • A
      bpf: Introduce dynamic program extensions · be8704ff
      Alexei Starovoitov 提交于
      Introduce dynamic program extensions. The users can load additional BPF
      functions and replace global functions in previously loaded BPF programs while
      these programs are executing.
      
      Global functions are verified individually by the verifier based on their types only.
      Hence the global function in the new program which types match older function can
      safely replace that corresponding function.
      
      This new function/program is called 'an extension' of old program. At load time
      the verifier uses (attach_prog_fd, attach_btf_id) pair to identify the function
      to be replaced. The BPF program type is derived from the target program into
      extension program. Technically bpf_verifier_ops is copied from target program.
      The BPF_PROG_TYPE_EXT program type is a placeholder. It has empty verifier_ops.
      The extension program can call the same bpf helper functions as target program.
      Single BPF_PROG_TYPE_EXT type is used to extend XDP, SKB and all other program
      types. The verifier allows only one level of replacement. Meaning that the
      extension program cannot recursively extend an extension. That also means that
      the maximum stack size is increasing from 512 to 1024 bytes and maximum
      function nesting level from 8 to 16. The programs don't always consume that
      much. The stack usage is determined by the number of on-stack variables used by
      the program. The verifier could have enforced 512 limit for combined original
      plus extension program, but it makes for difficult user experience. The main
      use case for extensions is to provide generic mechanism to plug external
      programs into policy program or function call chaining.
      
      BPF trampoline is used to track both fentry/fexit and program extensions
      because both are using the same nop slot at the beginning of every BPF
      function. Attaching fentry/fexit to a function that was replaced is not
      allowed. The opposite is true as well. Replacing a function that currently
      being analyzed with fentry/fexit is not allowed. The executable page allocated
      by BPF trampoline is not used by program extensions. This inefficiency will be
      optimized in future patches.
      
      Function by function verification of global function supports scalars and
      pointer to context only. Hence program extensions are supported for such class
      of global functions only. In the future the verifier will be extended with
      support to pointers to structures, arrays with sizes, etc.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20200121005348.2769920-2-ast@kernel.org
      be8704ff
  8. 17 1月, 2020 1 次提交
    • T
      xdp: Use bulking for non-map XDP_REDIRECT and consolidate code paths · 1d233886
      Toke Høiland-Jørgensen 提交于
      Since the bulk queue used by XDP_REDIRECT now lives in struct net_device,
      we can re-use the bulking for the non-map version of the bpf_redirect()
      helper. This is a simple matter of having xdp_do_redirect_slow() queue the
      frame on the bulk queue instead of sending it out with __bpf_tx_xdp().
      
      Unfortunately we can't make the bpf_redirect() helper return an error if
      the ifindex doesn't exit (as bpf_redirect_map() does), because we don't
      have a reference to the network namespace of the ingress device at the time
      the helper is called. So we have to leave it as-is and keep the device
      lookup in xdp_do_redirect_slow().
      
      Since this leaves less reason to have the non-map redirect code in a
      separate function, so we get rid of the xdp_do_redirect_slow() function
      entirely. This does lose us the tracepoint disambiguation, but fortunately
      the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint
      entry structures. This means both can contain a map index, so we can just
      amend the tracepoint definitions so we always emit the xdp_redirect(_err)
      tracepoints, but with the map ID only populated if a map is present. This
      means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep
      the definitions around in case someone is still listening for them.
      
      With this change, the performance of the xdp_redirect sample program goes
      from 5Mpps to 8.4Mpps (a 68% increase).
      
      Since the flush functions are no longer map-specific, rename the flush()
      functions to drop _map from their names. One of the renamed functions is
      the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To
      keep from having to update all drivers, use a #define to keep the old name
      working, and only update the virtual drivers in this patch.
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/157918768505.1458396.17518057312953572912.stgit@toke.dk
      1d233886
  9. 16 1月, 2020 3 次提交
    • Y
      bpf: Add batch ops to all htab bpf map · 05799638
      Yonghong Song 提交于
      htab can't use generic batch support due some problematic behaviours
      inherent to the data structre, i.e. while iterating the bpf map  a
      concurrent program might delete the next entry that batch was about to
      use, in that case there's no easy solution to retrieve the next entry,
      the issue has been discussed multiple times (see [1] and [2]).
      
      The only way hmap can be traversed without the problem previously
      exposed is by making sure that the map is traversing entire buckets.
      This commit implements those strict requirements for hmap, the
      implementation follows the same interaction that generic support with
      some exceptions:
      
       - If keys/values buffer are not big enough to traverse a bucket,
         ENOSPC will be returned.
       - out_batch contains the value of the next bucket in the iteration, not
         the next key, but this is transparent for the user since the user
         should never use out_batch for other than bpf batch syscalls.
      
      This commits implements BPF_MAP_LOOKUP_BATCH and adds support for new
      command BPF_MAP_LOOKUP_AND_DELETE_BATCH. Note that for update/delete
      batch ops it is possible to use the generic implementations.
      
      [1] https://lore.kernel.org/bpf/20190724165803.87470-1-brianvv@google.com/
      [2] https://lore.kernel.org/bpf/20190906225434.3635421-1-yhs@fb.com/Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NBrian Vazquez <brianvv@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200115184308.162644-6-brianvv@google.com
      05799638
    • B
      bpf: Add generic support for update and delete batch ops · aa2e93b8
      Brian Vazquez 提交于
      This commit adds generic support for update and delete batch ops that
      can be used for almost all the bpf maps. These commands share the same
      UAPI attr that lookup and lookup_and_delete batch ops use and the
      syscall commands are:
      
        BPF_MAP_UPDATE_BATCH
        BPF_MAP_DELETE_BATCH
      
      The main difference between update/delete and lookup batch ops is that
      for update/delete keys/values must be specified for userspace and
      because of that, neither in_batch nor out_batch are used.
      Suggested-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NBrian Vazquez <brianvv@google.com>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200115184308.162644-4-brianvv@google.com
      aa2e93b8
    • B
      bpf: Add generic support for lookup batch op · cb4d03ab
      Brian Vazquez 提交于
      This commit introduces generic support for the bpf_map_lookup_batch.
      This implementation can be used by almost all the bpf maps since its core
      implementation is relying on the existing map_get_next_key and
      map_lookup_elem. The bpf syscall subcommand introduced is:
      
        BPF_MAP_LOOKUP_BATCH
      
      The UAPI attribute is:
      
        struct { /* struct used by BPF_MAP_*_BATCH commands */
               __aligned_u64   in_batch;       /* start batch,
                                                * NULL to start from beginning
                                                */
               __aligned_u64   out_batch;      /* output: next start batch */
               __aligned_u64   keys;
               __aligned_u64   values;
               __u32           count;          /* input/output:
                                                * input: # of key/value
                                                * elements
                                                * output: # of filled elements
                                                */
               __u32           map_fd;
               __u64           elem_flags;
               __u64           flags;
        } batch;
      
      in_batch/out_batch are opaque values use to communicate between
      user/kernel space, in_batch/out_batch must be of key_size length.
      
      To start iterating from the beginning in_batch must be null,
      count is the # of key/value elements to retrieve. Note that the 'keys'
      buffer must be a buffer of key_size * count size and the 'values' buffer
      must be value_size * count, where value_size must be aligned to 8 bytes
      by userspace if it's dealing with percpu maps. 'count' will contain the
      number of keys/values successfully retrieved. Note that 'count' is an
      input/output variable and it can contain a lower value after a call.
      
      If there's no more entries to retrieve, ENOENT will be returned. If error
      is ENOENT, count might be > 0 in case it copied some values but there were
      no more entries to retrieve.
      
      Note that if the return code is an error and not -EFAULT,
      count indicates the number of elements successfully processed.
      Suggested-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NBrian Vazquez <brianvv@google.com>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200115184308.162644-3-brianvv@google.com
      cb4d03ab
  10. 11 1月, 2020 1 次提交
    • A
      bpf: Introduce function-by-function verification · 51c39bb1
      Alexei Starovoitov 提交于
      New llvm and old llvm with libbpf help produce BTF that distinguish global and
      static functions. Unlike arguments of static function the arguments of global
      functions cannot be removed or optimized away by llvm. The compiler has to use
      exactly the arguments specified in a function prototype. The argument type
      information allows the verifier validate each global function independently.
      For now only supported argument types are pointer to context and scalars. In
      the future pointers to structures, sizes, pointer to packet data can be
      supported as well. Consider the following example:
      
      static int f1(int ...)
      {
        ...
      }
      
      int f3(int b);
      
      int f2(int a)
      {
        f1(a) + f3(a);
      }
      
      int f3(int b)
      {
        ...
      }
      
      int main(...)
      {
        f1(...) + f2(...) + f3(...);
      }
      
      The verifier will start its safety checks from the first global function f2().
      It will recursively descend into f1() because it's static. Then it will check
      that arguments match for the f3() invocation inside f2(). It will not descend
      into f3(). It will finish f2() that has to be successfully verified for all
      possible values of 'a'. Then it will proceed with f3(). That function also has
      to be safe for all possible values of 'b'. Then it will start subprog 0 (which
      is main() function). It will recursively descend into f1() and will skip full
      check of f2() and f3(), since they are global. The order of processing global
      functions doesn't affect safety, since all global functions must be proven safe
      based on their arguments only.
      
      Such function by function verification can drastically improve speed of the
      verification and reduce complexity.
      
      Note that the stack limit of 512 still applies to the call chain regardless whether
      functions were static or global. The nested level of 8 also still applies. The
      same recursion prevention checks are in place as well.
      
      The type information and static/global kind is preserved after the verification
      hence in the above example global function f2() and f3() can be replaced later
      by equivalent functions with the same types that are loaded and verified later
      without affecting safety of this main() program. Such replacement (re-linking)
      of global functions is a subject of future patches.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200110064124.1760511-3-ast@kernel.org
      51c39bb1
  11. 10 1月, 2020 2 次提交
    • M
      bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS · 85d33df3
      Martin KaFai Lau 提交于
      The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
      is a kernel struct with its func ptr implemented in bpf prog.
      This new map is the interface to register/unregister/introspect
      a bpf implemented kernel struct.
      
      The kernel struct is actually embedded inside another new struct
      (or called the "value" struct in the code).  For example,
      "struct tcp_congestion_ops" is embbeded in:
      struct bpf_struct_ops_tcp_congestion_ops {
      	refcount_t refcnt;
      	enum bpf_struct_ops_state state;
      	struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
      }
      The map value is "struct bpf_struct_ops_tcp_congestion_ops".
      The "bpftool map dump" will then be able to show the
      state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g.
      number of tcp_sock in the tcp_congestion_ops case).  This "value" struct
      is created automatically by a macro.  Having a separate "value" struct
      will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding
      "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some
      initialization works before registering the struct_ops to the kernel
      subsystem).  The libbpf will take care of finding and populating the
      "struct bpf_struct_ops_XYZ" from "struct XYZ".
      
      Register a struct_ops to a kernel subsystem:
      1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
      2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
         set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the
         running kernel.
         Instead of reusing the attr->btf_value_type_id,
         btf_vmlinux_value_type_id s added such that attr->btf_fd can still be
         used as the "user" btf which could store other useful sysadmin/debug
         info that may be introduced in the furture,
         e.g. creation-date/compiler-details/map-creator...etc.
      3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described
         in the running kernel btf.  Populate the value of this object.
         The function ptr should be populated with the prog fds.
      4. Call BPF_MAP_UPDATE with the object created in (3) as
         the map value.  The key is always "0".
      
      During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
      args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
      the specific struct_ops to do some final checks in "st_ops->init_member()"
      (e.g. ensure all mandatory func ptrs are implemented).
      If everything looks good, it will register this kernel struct
      to the kernel subsystem.  The map will not allow further update
      from this point.
      
      Unregister a struct_ops from the kernel subsystem:
      BPF_MAP_DELETE with key "0".
      
      Introspect a struct_ops:
      BPF_MAP_LOOKUP_ELEM with key "0".  The map value returned will
      have the prog _id_ populated as the func ptr.
      
      The map value state (enum bpf_struct_ops_state) will transit from:
      INIT (map created) =>
      INUSE (map updated, i.e. reg) =>
      TOBEFREE (map value deleted, i.e. unreg)
      
      The kernel subsystem needs to call bpf_struct_ops_get() and
      bpf_struct_ops_put() to manage the "refcnt" in the
      "struct bpf_struct_ops_XYZ".  This patch uses a separate refcnt
      for the purose of tracking the subsystem usage.  Another approach
      is to reuse the map->refcnt and then "show" (i.e. during map_lookup)
      the subsystem's usage by doing map->refcnt - map->usercnt to filter out
      the map-fd/pinned-map usage.  However, that will also tie down the
      future semantics of map->refcnt and map->usercnt.
      
      The very first subsystem's refcnt (during reg()) holds one
      count to map->refcnt.  When the very last subsystem's refcnt
      is gone, it will also release the map->refcnt.  All bpf_prog will be
      freed when the map->refcnt reaches 0 (i.e. during map_free()).
      
      Here is how the bpftool map command will look like:
      [root@arch-fb-vm1 bpf]# bpftool map show
      6: struct_ops  name dctcp  flags 0x0
      	key 4B  value 256B  max_entries 1  memlock 4096B
      	btf_id 6
      [root@arch-fb-vm1 bpf]# bpftool map dump id 6
      [{
              "value": {
                  "refcnt": {
                      "refs": {
                          "counter": 1
                      }
                  },
                  "state": 1,
                  "data": {
                      "list": {
                          "next": 0,
                          "prev": 0
                      },
                      "key": 0,
                      "flags": 2,
                      "init": 24,
                      "release": 0,
                      "ssthresh": 25,
                      "cong_avoid": 30,
                      "set_state": 27,
                      "cwnd_event": 28,
                      "in_ack_event": 26,
                      "undo_cwnd": 29,
                      "pkts_acked": 0,
                      "min_tso_segs": 0,
                      "sndbuf_expand": 0,
                      "cong_control": 0,
                      "get_info": 0,
                      "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
                      ],
                      "owner": 0
                  }
              }
          }
      ]
      
      Misc Notes:
      * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
        It does an inplace update on "*value" instead returning a pointer
        to syscall.c.  Otherwise, it needs a separate copy of "zero" value
        for the BPF_STRUCT_OPS_STATE_INIT to avoid races.
      
      * The bpf_struct_ops_map_delete_elem() is also called without
        preempt_disable() from map_delete_elem().  It is because
        the "->unreg()" may requires sleepable context, e.g.
        the "tcp_unregister_congestion_control()".
      
      * "const" is added to some of the existing "struct btf_func_model *"
        function arg to avoid a compiler warning caused by this patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20200109003505.3855919-1-kafai@fb.com
      85d33df3
    • M
      bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS · 27ae7997
      Martin KaFai Lau 提交于
      This patch allows the kernel's struct ops (i.e. func ptr) to be
      implemented in BPF.  The first use case in this series is the
      "struct tcp_congestion_ops" which will be introduced in a
      latter patch.
      
      This patch introduces a new prog type BPF_PROG_TYPE_STRUCT_OPS.
      The BPF_PROG_TYPE_STRUCT_OPS prog is verified against a particular
      func ptr of a kernel struct.  The attr->attach_btf_id is the btf id
      of a kernel struct.  The attr->expected_attach_type is the member
      "index" of that kernel struct.  The first member of a struct starts
      with member index 0.  That will avoid ambiguity when a kernel struct
      has multiple func ptrs with the same func signature.
      
      For example, a BPF_PROG_TYPE_STRUCT_OPS prog is written
      to implement the "init" func ptr of the "struct tcp_congestion_ops".
      The attr->attach_btf_id is the btf id of the "struct tcp_congestion_ops"
      of the _running_ kernel.  The attr->expected_attach_type is 3.
      
      The ctx of BPF_PROG_TYPE_STRUCT_OPS is an array of u64 args saved
      by arch_prepare_bpf_trampoline that will be done in the next
      patch when introducing BPF_MAP_TYPE_STRUCT_OPS.
      
      "struct bpf_struct_ops" is introduced as a common interface for the kernel
      struct that supports BPF_PROG_TYPE_STRUCT_OPS prog.  The supporting kernel
      struct will need to implement an instance of the "struct bpf_struct_ops".
      
      The supporting kernel struct also needs to implement a bpf_verifier_ops.
      During BPF_PROG_LOAD, bpf_struct_ops_find() will find the right
      bpf_verifier_ops by searching the attr->attach_btf_id.
      
      A new "btf_struct_access" is also added to the bpf_verifier_ops such
      that the supporting kernel struct can optionally provide its own specific
      check on accessing the func arg (e.g. provide limited write access).
      
      After btf_vmlinux is parsed, the new bpf_struct_ops_init() is called
      to initialize some values (e.g. the btf id of the supporting kernel
      struct) and it can only be done once the btf_vmlinux is available.
      
      The R0 checks at BPF_EXIT is excluded for the BPF_PROG_TYPE_STRUCT_OPS prog
      if the return type of the prog->aux->attach_func_proto is "void".
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20200109003503.3855825-1-kafai@fb.com
      27ae7997
  12. 20 12月, 2019 2 次提交
  13. 17 12月, 2019 1 次提交
  14. 14 12月, 2019 4 次提交
  15. 12 12月, 2019 1 次提交
  16. 25 11月, 2019 6 次提交
  17. 21 11月, 2019 1 次提交
  18. 18 11月, 2019 3 次提交
    • A
      bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY · fc970227
      Andrii Nakryiko 提交于
      Add ability to memory-map contents of BPF array map. This is extremely useful
      for working with BPF global data from userspace programs. It allows to avoid
      typical bpf_map_{lookup,update}_elem operations, improving both performance
      and usability.
      
      There had to be special considerations for map freezing, to avoid having
      writable memory view into a frozen map. To solve this issue, map freezing and
      mmap-ing is happening under mutex now:
        - if map is already frozen, no writable mapping is allowed;
        - if map has writable memory mappings active (accounted in map->writecnt),
          map freezing will keep failing with -EBUSY;
        - once number of writable memory mappings drops to zero, map freezing can be
          performed again.
      
      Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
      can't be memory mapped either.
      
      For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
      to be mmap()'able. We also need to make sure that array data memory is
      page-sized and page-aligned, so we over-allocate memory in such a way that
      struct bpf_array is at the end of a single page of memory with array->value
      being aligned with the start of the second page. On deallocation we need to
      accomodate this memory arrangement to free vmalloc()'ed memory correctly.
      
      One important consideration regarding how memory-mapping subsystem functions.
      Memory-mapping subsystem provides few optional callbacks, among them open()
      and close().  close() is called for each memory region that is unmapped, so
      that users can decrease their reference counters and free up resources, if
      necessary. open() is *almost* symmetrical: it's called for each memory region
      that is being mapped, **except** the very first one. So bpf_map_mmap does
      initial refcnt bump, while open() will do any extra ones after that. Thus
      number of close() calls is equal to number of open() calls plus one more.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Link: https://lore.kernel.org/bpf/20191117172806.2195367-4-andriin@fb.com
      fc970227
    • A
      bpf: Convert bpf_prog refcnt to atomic64_t · 85192dbf
      Andrii Nakryiko 提交于
      Similarly to bpf_map's refcnt/usercnt, convert bpf_prog's refcnt to atomic64
      and remove artificial 32k limit. This allows to make bpf_prog's refcounting
      non-failing, simplifying logic of users of bpf_prog_add/bpf_prog_inc.
      
      Validated compilation by running allyesconfig kernel build.
      Suggested-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Link: https://lore.kernel.org/bpf/20191117172806.2195367-3-andriin@fb.com
      85192dbf
    • A
      bpf: Switch bpf_map ref counter to atomic64_t so bpf_map_inc() never fails · 1e0bd5a0
      Andrii Nakryiko 提交于
      92117d84 ("bpf: fix refcnt overflow") turned refcounting of bpf_map into
      potentially failing operation, when refcount reaches BPF_MAX_REFCNT limit
      (32k). Due to using 32-bit counter, it's possible in practice to overflow
      refcounter and make it wrap around to 0, causing erroneous map free, while
      there are still references to it, causing use-after-free problems.
      
      But having a failing refcounting operations are problematic in some cases. One
      example is mmap() interface. After establishing initial memory-mapping, user
      is allowed to arbitrarily map/remap/unmap parts of mapped memory, arbitrarily
      splitting it into multiple non-contiguous regions. All this happening without
      any control from the users of mmap subsystem. Rather mmap subsystem sends
      notifications to original creator of memory mapping through open/close
      callbacks, which are optionally specified during initial memory mapping
      creation. These callbacks are used to maintain accurate refcount for bpf_map
      (see next patch in this series). The problem is that open() callback is not
      supposed to fail, because memory-mapped resource is set up and properly
      referenced. This is posing a problem for using memory-mapping with BPF maps.
      
      One solution to this is to maintain separate refcount for just memory-mappings
      and do single bpf_map_inc/bpf_map_put when it goes from/to zero, respectively.
      There are similar use cases in current work on tcp-bpf, necessitating extra
      counter as well. This seems like a rather unfortunate and ugly solution that
      doesn't scale well to various new use cases.
      
      Another approach to solve this is to use non-failing refcount_t type, which
      uses 32-bit counter internally, but, once reaching overflow state at UINT_MAX,
      stays there. This utlimately causes memory leak, but prevents use after free.
      
      But given refcounting is not the most performance-critical operation with BPF
      maps (it's not used from running BPF program code), we can also just switch to
      64-bit counter that can't overflow in practice, potentially disadvantaging
      32-bit platforms a tiny bit. This simplifies semantics and allows above
      described scenarios to not worry about failing refcount increment operation.
      
      In terms of struct bpf_map size, we are still good and use the same amount of
      space:
      
      BEFORE (3 cache lines, 8 bytes of padding at the end):
      struct bpf_map {
      	const struct bpf_map_ops  * ops __attribute__((__aligned__(64))); /*     0     8 */
      	struct bpf_map *           inner_map_meta;       /*     8     8 */
      	void *                     security;             /*    16     8 */
      	enum bpf_map_type  map_type;                     /*    24     4 */
      	u32                        key_size;             /*    28     4 */
      	u32                        value_size;           /*    32     4 */
      	u32                        max_entries;          /*    36     4 */
      	u32                        map_flags;            /*    40     4 */
      	int                        spin_lock_off;        /*    44     4 */
      	u32                        id;                   /*    48     4 */
      	int                        numa_node;            /*    52     4 */
      	u32                        btf_key_type_id;      /*    56     4 */
      	u32                        btf_value_type_id;    /*    60     4 */
      	/* --- cacheline 1 boundary (64 bytes) --- */
      	struct btf *               btf;                  /*    64     8 */
      	struct bpf_map_memory memory;                    /*    72    16 */
      	bool                       unpriv_array;         /*    88     1 */
      	bool                       frozen;               /*    89     1 */
      
      	/* XXX 38 bytes hole, try to pack */
      
      	/* --- cacheline 2 boundary (128 bytes) --- */
      	atomic_t                   refcnt __attribute__((__aligned__(64))); /*   128     4 */
      	atomic_t                   usercnt;              /*   132     4 */
      	struct work_struct work;                         /*   136    32 */
      	char                       name[16];             /*   168    16 */
      
      	/* size: 192, cachelines: 3, members: 21 */
      	/* sum members: 146, holes: 1, sum holes: 38 */
      	/* padding: 8 */
      	/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
      } __attribute__((__aligned__(64)));
      
      AFTER (same 3 cache lines, no extra padding now):
      struct bpf_map {
      	const struct bpf_map_ops  * ops __attribute__((__aligned__(64))); /*     0     8 */
      	struct bpf_map *           inner_map_meta;       /*     8     8 */
      	void *                     security;             /*    16     8 */
      	enum bpf_map_type  map_type;                     /*    24     4 */
      	u32                        key_size;             /*    28     4 */
      	u32                        value_size;           /*    32     4 */
      	u32                        max_entries;          /*    36     4 */
      	u32                        map_flags;            /*    40     4 */
      	int                        spin_lock_off;        /*    44     4 */
      	u32                        id;                   /*    48     4 */
      	int                        numa_node;            /*    52     4 */
      	u32                        btf_key_type_id;      /*    56     4 */
      	u32                        btf_value_type_id;    /*    60     4 */
      	/* --- cacheline 1 boundary (64 bytes) --- */
      	struct btf *               btf;                  /*    64     8 */
      	struct bpf_map_memory memory;                    /*    72    16 */
      	bool                       unpriv_array;         /*    88     1 */
      	bool                       frozen;               /*    89     1 */
      
      	/* XXX 38 bytes hole, try to pack */
      
      	/* --- cacheline 2 boundary (128 bytes) --- */
      	atomic64_t                 refcnt __attribute__((__aligned__(64))); /*   128     8 */
      	atomic64_t                 usercnt;              /*   136     8 */
      	struct work_struct work;                         /*   144    32 */
      	char                       name[16];             /*   176    16 */
      
      	/* size: 192, cachelines: 3, members: 21 */
      	/* sum members: 154, holes: 1, sum holes: 38 */
      	/* forced alignments: 2, forced holes: 1, sum forced holes: 38 */
      } __attribute__((__aligned__(64)));
      
      This patch, while modifying all users of bpf_map_inc, also cleans up its
      interface to match bpf_map_put with separate operations for bpf_map_inc and
      bpf_map_inc_with_uref (to match bpf_map_put and bpf_map_put_with_uref,
      respectively). Also, given there are no users of bpf_map_inc_not_zero
      specifying uref=true, remove uref flag and default to uref=false internally.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20191117172806.2195367-2-andriin@fb.com
      1e0bd5a0
  19. 16 11月, 2019 4 次提交