1. 02 5月, 2020 1 次提交
    • S
      bpf: Sharing bpf runtime stats with BPF_ENABLE_STATS · d46edd67
      Song Liu 提交于
      Currently, sysctl kernel.bpf_stats_enabled controls BPF runtime stats.
      Typical userspace tools use kernel.bpf_stats_enabled as follows:
      
        1. Enable kernel.bpf_stats_enabled;
        2. Check program run_time_ns;
        3. Sleep for the monitoring period;
        4. Check program run_time_ns again, calculate the difference;
        5. Disable kernel.bpf_stats_enabled.
      
      The problem with this approach is that only one userspace tool can toggle
      this sysctl. If multiple tools toggle the sysctl at the same time, the
      measurement may be inaccurate.
      
      To fix this problem while keep backward compatibility, introduce a new
      bpf command BPF_ENABLE_STATS. On success, this command enables stats and
      returns a valid fd. BPF_ENABLE_STATS takes argument "type". Currently,
      only one type, BPF_STATS_RUN_TIME, is supported. We can extend the
      command to support other types of stats in the future.
      
      With BPF_ENABLE_STATS, user space tool would have the following flow:
      
        1. Get a fd with BPF_ENABLE_STATS, and make sure it is valid;
        2. Check program run_time_ns;
        3. Sleep for the monitoring period;
        4. Check program run_time_ns again, calculate the difference;
        5. Close the fd.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200430071506.1408910-2-songliubraving@fb.com
      d46edd67
  2. 29 4月, 2020 3 次提交
  3. 27 4月, 2020 1 次提交
  4. 26 4月, 2020 2 次提交
  5. 31 3月, 2020 1 次提交
    • A
      bpf: Implement bpf_link-based cgroup BPF program attachment · af6eea57
      Andrii Nakryiko 提交于
      Implement new sub-command to attach cgroup BPF programs and return FD-based
      bpf_link back on success. bpf_link, once attached to cgroup, cannot be
      replaced, except by owner having its FD. Cgroup bpf_link supports only
      BPF_F_ALLOW_MULTI semantics. Both link-based and prog-based BPF_F_ALLOW_MULTI
      attachments can be freely intermixed.
      
      To prevent bpf_cgroup_link from keeping cgroup alive past the point when no
      BPF program can be executed, implement auto-detachment of link. When
      cgroup_bpf_release() is called, all attached bpf_links are forced to release
      cgroup refcounts, but they leave bpf_link otherwise active and allocated, as
      well as still owning underlying bpf_prog. This is because user-space might
      still have FDs open and active, so bpf_link as a user-referenced object can't
      be freed yet. Once last active FD is closed, bpf_link will be freed and
      underlying bpf_prog refcount will be dropped. But cgroup refcount won't be
      touched, because cgroup is released already.
      
      The inherent race between bpf_cgroup_link release (from closing last FD) and
      cgroup_bpf_release() is resolved by both operations taking cgroup_mutex. So
      the only additional check required is when bpf_cgroup_link attempts to detach
      itself from cgroup. At that time we need to check whether there is still
      cgroup associated with that link. And if not, exit with success, because
      bpf_cgroup_link was already successfully detached.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Link: https://lore.kernel.org/bpf/20200330030001.2312810-2-andriin@fb.com
      af6eea57
  6. 30 3月, 2020 1 次提交
  7. 28 3月, 2020 2 次提交
    • D
      bpf: Enable bpf cgroup hooks to retrieve cgroup v2 and ancestor id · 0f09abd1
      Daniel Borkmann 提交于
      Enable the bpf_get_current_cgroup_id() helper for connect(), sendmsg(),
      recvmsg() and bind-related hooks in order to retrieve the cgroup v2
      context which can then be used as part of the key for BPF map lookups,
      for example. Given these hooks operate in process context 'current' is
      always valid and pointing to the app that is performing mentioned
      syscalls if it's subject to a v2 cgroup. Also with same motivation of
      commit 77236281 ("bpf: Introduce bpf_skb_ancestor_cgroup_id helper")
      enable retrieval of ancestor from current so the cgroup id can be used
      for policy lookups which can then forbid connect() / bind(), for example.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/d2a7ef42530ad299e3cbb245e6c12374b72145ef.1585323121.git.daniel@iogearbox.net
      0f09abd1
    • D
      bpf: Add netns cookie and enable it for bpf cgroup hooks · f318903c
      Daniel Borkmann 提交于
      In Cilium we're mainly using BPF cgroup hooks today in order to implement
      kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*),
      ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic
      between Cilium managed nodes. While this works in its current shape and avoids
      packet-level NAT for inter Cilium managed node traffic, there is one major
      limitation we're facing today, that is, lack of netns awareness.
      
      In Kubernetes, the concept of Pods (which hold one or multiple containers)
      has been built around network namespaces, so while we can use the global scope
      of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing
      NodePort ports on loopback addresses), we also have the need to differentiate
      between initial network namespaces and non-initial one. For example, ExternalIP
      services mandate that non-local service IPs are not to be translated from the
      host (initial) network namespace as one example. Right now, we have an ugly
      work-around in place where non-local service IPs for ExternalIP services are
      not xlated from connect() and friends BPF hooks but instead via less efficient
      packet-level NAT on the veth tc ingress hook for Pod traffic.
      
      On top of determining whether we're in initial or non-initial network namespace
      we also have a need for a socket-cookie like mechanism for network namespaces
      scope. Socket cookies have the nice property that they can be combined as part
      of the key structure e.g. for BPF LRU maps without having to worry that the
      cookie could be recycled. We are planning to use this for our sessionAffinity
      implementation for services. Therefore, add a new bpf_get_netns_cookie() helper
      which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would
      provide the cookie for the initial network namespace while passing the context
      instead of NULL would provide the cookie from the application's network namespace.
      We're using a hole, so no size increase; the assignment happens only once.
      Therefore this allows for a comparison on initial namespace as well as regular
      cookie usage as we have today with socket cookies. We could later on enable
      this helper for other program types as well as we would see need.
      
        (*) Both externalTrafficPolicy={Local|Cluster} types
        [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.cSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net
      f318903c
  8. 18 3月, 2020 1 次提交
  9. 14 3月, 2020 11 次提交
  10. 13 3月, 2020 1 次提交
  11. 11 3月, 2020 1 次提交
    • A
      bpf: Add bpf_link_new_file that doesn't install FD · babf3164
      Andrii Nakryiko 提交于
      Add bpf_link_new_file() API for cases when we need to ensure anon_inode is
      successfully created before we proceed with expensive BPF program attachment
      procedure, which will require equally (if not more so) expensive and
      potentially failing compensation detachment procedure just because anon_inode
      creation failed. This API allows to simplify code by ensuring first that
      anon_inode is created and after BPF program is attached proceed with
      fd_install() that can't fail.
      
      After anon_inode file is created, link can't be just kfree()'d anymore,
      because its destruction will be performed by deferred file_operations->release
      call. For this, bpf_link API required specifying two separate operations:
      release() and dealloc(), former performing detachment only, while the latter
      frees memory used by bpf_link itself. dealloc() needs to be specified, because
      struct bpf_link is frequently embedded into link type-specific container
      struct (e.g., struct bpf_raw_tp_link), so bpf_link itself doesn't know how to
      properly free the memory. In case when anon_inode file was successfully
      created, but subsequent BPF attachment failed, bpf_link needs to be marked as
      "defunct", so that file's release() callback will perform only memory
      deallocation, but no detachment.
      
      Convert raw tracepoint and tracing attachment to new API and eliminate
      detachment from error handling path.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200309231051.1270337-1-andriin@fb.com
      babf3164
  12. 10 3月, 2020 1 次提交
  13. 05 3月, 2020 3 次提交
  14. 03 3月, 2020 1 次提交
    • A
      bpf: Introduce pinnable bpf_link abstraction · 70ed506c
      Andrii Nakryiko 提交于
      Introduce bpf_link abstraction, representing an attachment of BPF program to
      a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
      ownership of attached BPF program, reference counting of a link itself, when
      reference from multiple anonymous inodes, as well as ensures that release
      callback will be called from a process context, so that users can safely take
      mutex locks and sleep.
      
      Additionally, with a new abstraction it's now possible to generalize pinning
      of a link object in BPF FS, allowing to explicitly prevent BPF program
      detachment on process exit by pinning it in a BPF FS and let it open from
      independent other process to keep working with it.
      
      Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
      program attachments) into utilizing bpf_link framework, making them pinnable
      in BPF FS. More FD-based bpf_links will be added in follow up patches.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200303043159.323675-2-andriin@fb.com
      70ed506c
  15. 28 2月, 2020 2 次提交
    • M
      bpf: INET_DIAG support in bpf_sk_storage · 1ed4d924
      Martin KaFai Lau 提交于
      This patch adds INET_DIAG support to bpf_sk_storage.
      
      1. Although this series adds bpf_sk_storage diag capability to inet sk,
         bpf_sk_storage is in general applicable to all fullsock.  Hence, the
         bpf_sk_storage logic will operate on SK_DIAG_* nlattr.  The caller
         will pass in its specific nesting nlattr (e.g. INET_DIAG_*) as
         the argument.
      
      2. The request will be like:
      	INET_DIAG_REQ_SK_BPF_STORAGES (nla_nest) (defined in latter patch)
      		SK_DIAG_BPF_STORAGE_REQ_MAP_FD (nla_put_u32)
      		SK_DIAG_BPF_STORAGE_REQ_MAP_FD (nla_put_u32)
      		......
      
         Considering there could have multiple bpf_sk_storages in a sk,
         instead of reusing INET_DIAG_INFO ("ss -i"),  the user can select
         some specific bpf_sk_storage to dump by specifying an array of
         SK_DIAG_BPF_STORAGE_REQ_MAP_FD.
      
         If no SK_DIAG_BPF_STORAGE_REQ_MAP_FD is specified (i.e. an empty
         INET_DIAG_REQ_SK_BPF_STORAGES), it will dump all bpf_sk_storages
         of a sk.
      
      3. The reply will be like:
      	INET_DIAG_BPF_SK_STORAGES (nla_nest) (defined in latter patch)
      		SK_DIAG_BPF_STORAGE (nla_nest)
      			SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32)
      			SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit)
      		SK_DIAG_BPF_STORAGE (nla_nest)
      			SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32)
      			SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit)
      		......
      
      4. Unlike other INET_DIAG info of a sk which is pretty static, the size
         required to dump the bpf_sk_storage(s) of a sk is dynamic as the
         system adding more bpf_sk_storage_map.  It is hard to set a static
         min_dump_alloc size.
      
         Hence, this series learns it at the runtime and adjust the
         cb->min_dump_alloc as it iterates all sk(s) of a system.  The
         "unsigned int *res_diag_size" in bpf_sk_storage_diag_put()
         is for this purpose.
      
         The next patch will update the cb->min_dump_alloc as it
         iterates the sk(s).
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200225230421.1975729-1-kafai@fb.com
      1ed4d924
    • G
      bpf: Replace zero-length array with flexible-array member · d7f10df8
      Gustavo A. R. Silva 提交于
      The current codebase makes use of the zero-length array language
      extension to the C90 standard, but the preferred mechanism to declare
      variable-length types such as these ones is a flexible array member[1][2],
      introduced in C99:
      
      struct foo {
              int stuff;
              struct boo array[];
      };
      
      By making use of the mechanism above, we will get a compiler warning
      in case the flexible array does not occur last in the structure, which
      will help us prevent some kind of undefined behavior bugs from being
      inadvertently introduced[3] to the codebase from now on.
      
      Also, notice that, dynamic memory allocations won't be affected by
      this change:
      
      "Flexible array members have incomplete type, and so the sizeof operator
      may not be applied. As a quirk of the original implementation of
      zero-length arrays, sizeof evaluates to zero."[1]
      
      This issue was found with the help of Coccinelle.
      
      [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
      [2] https://github.com/KSPP/linux/issues/21
      [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200227001744.GA3317@embeddedor
      d7f10df8
  16. 25 2月, 2020 2 次提交
  17. 29 1月, 2020 1 次提交
  18. 25 1月, 2020 1 次提交
  19. 23 1月, 2020 2 次提交
    • M
      bpf: Add BPF_FUNC_jiffies64 · 5576b991
      Martin KaFai Lau 提交于
      This patch adds a helper to read the 64bit jiffies.  It will be used
      in a later patch to implement the bpf_cubic.c.
      
      The helper is inlined for jit_requested and 64 BITS_PER_LONG
      as the map_gen_lookup().  Other cases could be considered together
      with map_gen_lookup() if needed.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200122233646.903260-1-kafai@fb.com
      5576b991
    • A
      bpf: Introduce dynamic program extensions · be8704ff
      Alexei Starovoitov 提交于
      Introduce dynamic program extensions. The users can load additional BPF
      functions and replace global functions in previously loaded BPF programs while
      these programs are executing.
      
      Global functions are verified individually by the verifier based on their types only.
      Hence the global function in the new program which types match older function can
      safely replace that corresponding function.
      
      This new function/program is called 'an extension' of old program. At load time
      the verifier uses (attach_prog_fd, attach_btf_id) pair to identify the function
      to be replaced. The BPF program type is derived from the target program into
      extension program. Technically bpf_verifier_ops is copied from target program.
      The BPF_PROG_TYPE_EXT program type is a placeholder. It has empty verifier_ops.
      The extension program can call the same bpf helper functions as target program.
      Single BPF_PROG_TYPE_EXT type is used to extend XDP, SKB and all other program
      types. The verifier allows only one level of replacement. Meaning that the
      extension program cannot recursively extend an extension. That also means that
      the maximum stack size is increasing from 512 to 1024 bytes and maximum
      function nesting level from 8 to 16. The programs don't always consume that
      much. The stack usage is determined by the number of on-stack variables used by
      the program. The verifier could have enforced 512 limit for combined original
      plus extension program, but it makes for difficult user experience. The main
      use case for extensions is to provide generic mechanism to plug external
      programs into policy program or function call chaining.
      
      BPF trampoline is used to track both fentry/fexit and program extensions
      because both are using the same nop slot at the beginning of every BPF
      function. Attaching fentry/fexit to a function that was replaced is not
      allowed. The opposite is true as well. Replacing a function that currently
      being analyzed with fentry/fexit is not allowed. The executable page allocated
      by BPF trampoline is not used by program extensions. This inefficiency will be
      optimized in future patches.
      
      Function by function verification of global function supports scalars and
      pointer to context only. Hence program extensions are supported for such class
      of global functions only. In the future the verifier will be extended with
      support to pointers to structures, arrays with sizes, etc.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20200121005348.2769920-2-ast@kernel.org
      be8704ff
  20. 17 1月, 2020 1 次提交
    • T
      xdp: Use bulking for non-map XDP_REDIRECT and consolidate code paths · 1d233886
      Toke Høiland-Jørgensen 提交于
      Since the bulk queue used by XDP_REDIRECT now lives in struct net_device,
      we can re-use the bulking for the non-map version of the bpf_redirect()
      helper. This is a simple matter of having xdp_do_redirect_slow() queue the
      frame on the bulk queue instead of sending it out with __bpf_tx_xdp().
      
      Unfortunately we can't make the bpf_redirect() helper return an error if
      the ifindex doesn't exit (as bpf_redirect_map() does), because we don't
      have a reference to the network namespace of the ingress device at the time
      the helper is called. So we have to leave it as-is and keep the device
      lookup in xdp_do_redirect_slow().
      
      Since this leaves less reason to have the non-map redirect code in a
      separate function, so we get rid of the xdp_do_redirect_slow() function
      entirely. This does lose us the tracepoint disambiguation, but fortunately
      the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint
      entry structures. This means both can contain a map index, so we can just
      amend the tracepoint definitions so we always emit the xdp_redirect(_err)
      tracepoints, but with the map ID only populated if a map is present. This
      means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep
      the definitions around in case someone is still listening for them.
      
      With this change, the performance of the xdp_redirect sample program goes
      from 5Mpps to 8.4Mpps (a 68% increase).
      
      Since the flush functions are no longer map-specific, rename the flush()
      functions to drop _map from their names. One of the renamed functions is
      the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To
      keep from having to update all drivers, use a #define to keep the old name
      working, and only update the virtual drivers in this patch.
      Signed-off-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/157918768505.1458396.17518057312953572912.stgit@toke.dk
      1d233886
  21. 16 1月, 2020 1 次提交