1. 10 5月, 2020 5 次提交
    • Y
      bpf: Create file bpf iterator · 367ec3e4
      Yonghong Song 提交于
      To produce a file bpf iterator, the fd must be
      corresponding to a link_fd assocciated with a
      trace/iter program. When the pinned file is
      opened, a seq_file will be generated.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200509175906.2475893-1-yhs@fb.com
      367ec3e4
    • Y
      bpf: Create anonymous bpf iterator · ac51d99b
      Yonghong Song 提交于
      A new bpf command BPF_ITER_CREATE is added.
      
      The anonymous bpf iterator is seq_file based.
      The seq_file private data are referenced by targets.
      The bpf_iter infrastructure allocated additional space
      at seq_file->private before the space used by targets
      to store some meta data, e.g.,
        prog:       prog to run
        session_id: an unique id for each opened seq_file
        seq_num:    how many times bpf programs are queried in this session
        done_stop:  an internal state to decide whether bpf program
                    should be called in seq_ops->stop() or not
      
      The seq_num will start from 0 for valid objects.
      The bpf program may see the same seq_num more than once if
       - seq_file buffer overflow happens and the same object
         is retried by bpf_seq_read(), or
       - the bpf program explicitly requests a retry of the
         same object
      
      Since module is not supported for bpf_iter, all target
      registeration happens at __init time, so there is no
      need to change bpf_iter_unreg_target() as it is used
      mostly in error path of the init function at which time
      no bpf iterators have been created yet.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200509175905.2475770-1-yhs@fb.com
      ac51d99b
    • Y
      bpf: Support bpf tracing/iter programs for BPF_LINK_CREATE · de4e05ca
      Yonghong Song 提交于
      Given a bpf program, the step to create an anonymous bpf iterator is:
        - create a bpf_iter_link, which combines bpf program and the target.
          In the future, there could be more information recorded in the link.
          A link_fd will be returned to the user space.
        - create an anonymous bpf iterator with the given link_fd.
      
      The bpf_iter_link can be pinned to bpffs mount file system to
      create a file based bpf iterator as well.
      
      The benefit to use of bpf_iter_link:
        - using bpf link simplifies design and implementation as bpf link
          is used for other tracing bpf programs.
        - for file based bpf iterator, bpf_iter_link provides a standard
          way to replace underlying bpf programs.
        - for both anonymous and free based iterators, bpf link query
          capability can be leveraged.
      
      The patch added support of tracing/iter programs for BPF_LINK_CREATE.
      A new link type BPF_LINK_TYPE_ITER is added to facilitate link
      querying. Currently, only prog_id is needed, so there is no
      additional in-kernel show_fdinfo() and fill_link_info() hook
      is needed for BPF_LINK_TYPE_ITER link.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200509175901.2475084-1-yhs@fb.com
      de4e05ca
    • Y
      bpf: Allow loading of a bpf_iter program · 15d83c4d
      Yonghong Song 提交于
      A bpf_iter program is a tracing program with attach type
      BPF_TRACE_ITER. The load attribute
        attach_btf_id
      is used by the verifier against a particular kernel function,
      which represents a target, e.g., __bpf_iter__bpf_map
      for target bpf_map which is implemented later.
      
      The program return value must be 0 or 1 for now.
        0 : successful, except potential seq_file buffer overflow
            which is handled by seq_file reader.
        1 : request to restart the same object
      
      In the future, other return values may be used for filtering or
      teminating the iterator.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200509175900.2474947-1-yhs@fb.com
      15d83c4d
    • Y
      bpf: Implement an interface to register bpf_iter targets · ae24345d
      Yonghong Song 提交于
      The target can call bpf_iter_reg_target() to register itself.
      The needed information:
        target:           target name
        seq_ops:          the seq_file operations for the target
        init_seq_private  target callback to initialize seq_priv during file open
        fini_seq_private  target callback to clean up seq_priv during file release
        seq_priv_size:    the private_data size needed by the seq_file
                          operations
      
      The target name represents a target which provides a seq_ops
      for iterating objects.
      
      The target can provide two callback functions, init_seq_private
      and fini_seq_private, called during file open/release time.
      For example, /proc/net/{tcp6, ipv6_route, netlink, ...}, net
      name space needs to be setup properly during file open and
      released properly during file release.
      
      Function bpf_iter_unreg_target() is also implemented to unregister
      a particular target.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200509175859.2474669-1-yhs@fb.com
      ae24345d
  2. 02 5月, 2020 1 次提交
    • S
      bpf: Sharing bpf runtime stats with BPF_ENABLE_STATS · d46edd67
      Song Liu 提交于
      Currently, sysctl kernel.bpf_stats_enabled controls BPF runtime stats.
      Typical userspace tools use kernel.bpf_stats_enabled as follows:
      
        1. Enable kernel.bpf_stats_enabled;
        2. Check program run_time_ns;
        3. Sleep for the monitoring period;
        4. Check program run_time_ns again, calculate the difference;
        5. Disable kernel.bpf_stats_enabled.
      
      The problem with this approach is that only one userspace tool can toggle
      this sysctl. If multiple tools toggle the sysctl at the same time, the
      measurement may be inaccurate.
      
      To fix this problem while keep backward compatibility, introduce a new
      bpf command BPF_ENABLE_STATS. On success, this command enables stats and
      returns a valid fd. BPF_ENABLE_STATS takes argument "type". Currently,
      only one type, BPF_STATS_RUN_TIME, is supported. We can extend the
      command to support other types of stats in the future.
      
      With BPF_ENABLE_STATS, user space tool would have the following flow:
      
        1. Get a fd with BPF_ENABLE_STATS, and make sure it is valid;
        2. Check program run_time_ns;
        3. Sleep for the monitoring period;
        4. Check program run_time_ns again, calculate the difference;
        5. Close the fd.
      Signed-off-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200430071506.1408910-2-songliubraving@fb.com
      d46edd67
  3. 29 4月, 2020 3 次提交
  4. 27 4月, 2020 1 次提交
  5. 26 4月, 2020 2 次提交
  6. 31 3月, 2020 1 次提交
    • A
      bpf: Implement bpf_link-based cgroup BPF program attachment · af6eea57
      Andrii Nakryiko 提交于
      Implement new sub-command to attach cgroup BPF programs and return FD-based
      bpf_link back on success. bpf_link, once attached to cgroup, cannot be
      replaced, except by owner having its FD. Cgroup bpf_link supports only
      BPF_F_ALLOW_MULTI semantics. Both link-based and prog-based BPF_F_ALLOW_MULTI
      attachments can be freely intermixed.
      
      To prevent bpf_cgroup_link from keeping cgroup alive past the point when no
      BPF program can be executed, implement auto-detachment of link. When
      cgroup_bpf_release() is called, all attached bpf_links are forced to release
      cgroup refcounts, but they leave bpf_link otherwise active and allocated, as
      well as still owning underlying bpf_prog. This is because user-space might
      still have FDs open and active, so bpf_link as a user-referenced object can't
      be freed yet. Once last active FD is closed, bpf_link will be freed and
      underlying bpf_prog refcount will be dropped. But cgroup refcount won't be
      touched, because cgroup is released already.
      
      The inherent race between bpf_cgroup_link release (from closing last FD) and
      cgroup_bpf_release() is resolved by both operations taking cgroup_mutex. So
      the only additional check required is when bpf_cgroup_link attempts to detach
      itself from cgroup. At that time we need to check whether there is still
      cgroup associated with that link. And if not, exit with success, because
      bpf_cgroup_link was already successfully detached.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Link: https://lore.kernel.org/bpf/20200330030001.2312810-2-andriin@fb.com
      af6eea57
  7. 30 3月, 2020 1 次提交
  8. 28 3月, 2020 2 次提交
    • D
      bpf: Enable bpf cgroup hooks to retrieve cgroup v2 and ancestor id · 0f09abd1
      Daniel Borkmann 提交于
      Enable the bpf_get_current_cgroup_id() helper for connect(), sendmsg(),
      recvmsg() and bind-related hooks in order to retrieve the cgroup v2
      context which can then be used as part of the key for BPF map lookups,
      for example. Given these hooks operate in process context 'current' is
      always valid and pointing to the app that is performing mentioned
      syscalls if it's subject to a v2 cgroup. Also with same motivation of
      commit 77236281 ("bpf: Introduce bpf_skb_ancestor_cgroup_id helper")
      enable retrieval of ancestor from current so the cgroup id can be used
      for policy lookups which can then forbid connect() / bind(), for example.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/d2a7ef42530ad299e3cbb245e6c12374b72145ef.1585323121.git.daniel@iogearbox.net
      0f09abd1
    • D
      bpf: Add netns cookie and enable it for bpf cgroup hooks · f318903c
      Daniel Borkmann 提交于
      In Cilium we're mainly using BPF cgroup hooks today in order to implement
      kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*),
      ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic
      between Cilium managed nodes. While this works in its current shape and avoids
      packet-level NAT for inter Cilium managed node traffic, there is one major
      limitation we're facing today, that is, lack of netns awareness.
      
      In Kubernetes, the concept of Pods (which hold one or multiple containers)
      has been built around network namespaces, so while we can use the global scope
      of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing
      NodePort ports on loopback addresses), we also have the need to differentiate
      between initial network namespaces and non-initial one. For example, ExternalIP
      services mandate that non-local service IPs are not to be translated from the
      host (initial) network namespace as one example. Right now, we have an ugly
      work-around in place where non-local service IPs for ExternalIP services are
      not xlated from connect() and friends BPF hooks but instead via less efficient
      packet-level NAT on the veth tc ingress hook for Pod traffic.
      
      On top of determining whether we're in initial or non-initial network namespace
      we also have a need for a socket-cookie like mechanism for network namespaces
      scope. Socket cookies have the nice property that they can be combined as part
      of the key structure e.g. for BPF LRU maps without having to worry that the
      cookie could be recycled. We are planning to use this for our sessionAffinity
      implementation for services. Therefore, add a new bpf_get_netns_cookie() helper
      which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would
      provide the cookie for the initial network namespace while passing the context
      instead of NULL would provide the cookie from the application's network namespace.
      We're using a hole, so no size increase; the assignment happens only once.
      Therefore this allows for a comparison on initial namespace as well as regular
      cookie usage as we have today with socket cookies. We could later on enable
      this helper for other program types as well as we would see need.
      
        (*) Both externalTrafficPolicy={Local|Cluster} types
        [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.cSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net
      f318903c
  9. 18 3月, 2020 1 次提交
  10. 14 3月, 2020 11 次提交
  11. 13 3月, 2020 1 次提交
  12. 11 3月, 2020 1 次提交
    • A
      bpf: Add bpf_link_new_file that doesn't install FD · babf3164
      Andrii Nakryiko 提交于
      Add bpf_link_new_file() API for cases when we need to ensure anon_inode is
      successfully created before we proceed with expensive BPF program attachment
      procedure, which will require equally (if not more so) expensive and
      potentially failing compensation detachment procedure just because anon_inode
      creation failed. This API allows to simplify code by ensuring first that
      anon_inode is created and after BPF program is attached proceed with
      fd_install() that can't fail.
      
      After anon_inode file is created, link can't be just kfree()'d anymore,
      because its destruction will be performed by deferred file_operations->release
      call. For this, bpf_link API required specifying two separate operations:
      release() and dealloc(), former performing detachment only, while the latter
      frees memory used by bpf_link itself. dealloc() needs to be specified, because
      struct bpf_link is frequently embedded into link type-specific container
      struct (e.g., struct bpf_raw_tp_link), so bpf_link itself doesn't know how to
      properly free the memory. In case when anon_inode file was successfully
      created, but subsequent BPF attachment failed, bpf_link needs to be marked as
      "defunct", so that file's release() callback will perform only memory
      deallocation, but no detachment.
      
      Convert raw tracepoint and tracing attachment to new API and eliminate
      detachment from error handling path.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Link: https://lore.kernel.org/bpf/20200309231051.1270337-1-andriin@fb.com
      babf3164
  13. 10 3月, 2020 1 次提交
  14. 05 3月, 2020 3 次提交
  15. 03 3月, 2020 1 次提交
    • A
      bpf: Introduce pinnable bpf_link abstraction · 70ed506c
      Andrii Nakryiko 提交于
      Introduce bpf_link abstraction, representing an attachment of BPF program to
      a BPF hook point (e.g., tracepoint, perf event, etc). bpf_link encapsulates
      ownership of attached BPF program, reference counting of a link itself, when
      reference from multiple anonymous inodes, as well as ensures that release
      callback will be called from a process context, so that users can safely take
      mutex locks and sleep.
      
      Additionally, with a new abstraction it's now possible to generalize pinning
      of a link object in BPF FS, allowing to explicitly prevent BPF program
      detachment on process exit by pinning it in a BPF FS and let it open from
      independent other process to keep working with it.
      
      Convert two existing bpf_link-like objects (raw tracepoint and tracing BPF
      program attachments) into utilizing bpf_link framework, making them pinnable
      in BPF FS. More FD-based bpf_links will be added in follow up patches.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200303043159.323675-2-andriin@fb.com
      70ed506c
  16. 28 2月, 2020 2 次提交
    • M
      bpf: INET_DIAG support in bpf_sk_storage · 1ed4d924
      Martin KaFai Lau 提交于
      This patch adds INET_DIAG support to bpf_sk_storage.
      
      1. Although this series adds bpf_sk_storage diag capability to inet sk,
         bpf_sk_storage is in general applicable to all fullsock.  Hence, the
         bpf_sk_storage logic will operate on SK_DIAG_* nlattr.  The caller
         will pass in its specific nesting nlattr (e.g. INET_DIAG_*) as
         the argument.
      
      2. The request will be like:
      	INET_DIAG_REQ_SK_BPF_STORAGES (nla_nest) (defined in latter patch)
      		SK_DIAG_BPF_STORAGE_REQ_MAP_FD (nla_put_u32)
      		SK_DIAG_BPF_STORAGE_REQ_MAP_FD (nla_put_u32)
      		......
      
         Considering there could have multiple bpf_sk_storages in a sk,
         instead of reusing INET_DIAG_INFO ("ss -i"),  the user can select
         some specific bpf_sk_storage to dump by specifying an array of
         SK_DIAG_BPF_STORAGE_REQ_MAP_FD.
      
         If no SK_DIAG_BPF_STORAGE_REQ_MAP_FD is specified (i.e. an empty
         INET_DIAG_REQ_SK_BPF_STORAGES), it will dump all bpf_sk_storages
         of a sk.
      
      3. The reply will be like:
      	INET_DIAG_BPF_SK_STORAGES (nla_nest) (defined in latter patch)
      		SK_DIAG_BPF_STORAGE (nla_nest)
      			SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32)
      			SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit)
      		SK_DIAG_BPF_STORAGE (nla_nest)
      			SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32)
      			SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit)
      		......
      
      4. Unlike other INET_DIAG info of a sk which is pretty static, the size
         required to dump the bpf_sk_storage(s) of a sk is dynamic as the
         system adding more bpf_sk_storage_map.  It is hard to set a static
         min_dump_alloc size.
      
         Hence, this series learns it at the runtime and adjust the
         cb->min_dump_alloc as it iterates all sk(s) of a system.  The
         "unsigned int *res_diag_size" in bpf_sk_storage_diag_put()
         is for this purpose.
      
         The next patch will update the cb->min_dump_alloc as it
         iterates the sk(s).
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200225230421.1975729-1-kafai@fb.com
      1ed4d924
    • G
      bpf: Replace zero-length array with flexible-array member · d7f10df8
      Gustavo A. R. Silva 提交于
      The current codebase makes use of the zero-length array language
      extension to the C90 standard, but the preferred mechanism to declare
      variable-length types such as these ones is a flexible array member[1][2],
      introduced in C99:
      
      struct foo {
              int stuff;
              struct boo array[];
      };
      
      By making use of the mechanism above, we will get a compiler warning
      in case the flexible array does not occur last in the structure, which
      will help us prevent some kind of undefined behavior bugs from being
      inadvertently introduced[3] to the codebase from now on.
      
      Also, notice that, dynamic memory allocations won't be affected by
      this change:
      
      "Flexible array members have incomplete type, and so the sizeof operator
      may not be applied. As a quirk of the original implementation of
      zero-length arrays, sizeof evaluates to zero."[1]
      
      This issue was found with the help of Coccinelle.
      
      [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
      [2] https://github.com/KSPP/linux/issues/21
      [3] commit 76497732 ("cxgb3/l2t: Fix undefined behaviour")
      Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200227001744.GA3317@embeddedor
      d7f10df8
  17. 25 2月, 2020 2 次提交
  18. 29 1月, 2020 1 次提交