1. 25 4月, 2020 1 次提交
  2. 31 3月, 2020 3 次提交
    • A
      bpf: Implement bpf_prog replacement for an active bpf_cgroup_link · 0c991ebc
      Andrii Nakryiko 提交于
      Add new operation (LINK_UPDATE), which allows to replace active bpf_prog from
      under given bpf_link. Currently this is only supported for bpf_cgroup_link,
      but will be extended to other kinds of bpf_links in follow-up patches.
      
      For bpf_cgroup_link, implemented functionality matches existing semantics for
      direct bpf_prog attachment (including BPF_F_REPLACE flag). User can either
      unconditionally set new bpf_prog regardless of which bpf_prog is currently
      active under given bpf_link, or, optionally, can specify expected active
      bpf_prog. If active bpf_prog doesn't match expected one, no changes are
      performed, old bpf_link stays intact and attached, operation returns
      a failure.
      
      cgroup_bpf_replace() operation is resolving race between auto-detachment and
      bpf_prog update in the same fashion as it's done for bpf_link detachment,
      except in this case update has no way of succeeding because of target cgroup
      marked as dying. So in this case error is returned.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200330030001.2312810-3-andriin@fb.com
      0c991ebc
    • A
      bpf: Implement bpf_link-based cgroup BPF program attachment · af6eea57
      Andrii Nakryiko 提交于
      Implement new sub-command to attach cgroup BPF programs and return FD-based
      bpf_link back on success. bpf_link, once attached to cgroup, cannot be
      replaced, except by owner having its FD. Cgroup bpf_link supports only
      BPF_F_ALLOW_MULTI semantics. Both link-based and prog-based BPF_F_ALLOW_MULTI
      attachments can be freely intermixed.
      
      To prevent bpf_cgroup_link from keeping cgroup alive past the point when no
      BPF program can be executed, implement auto-detachment of link. When
      cgroup_bpf_release() is called, all attached bpf_links are forced to release
      cgroup refcounts, but they leave bpf_link otherwise active and allocated, as
      well as still owning underlying bpf_prog. This is because user-space might
      still have FDs open and active, so bpf_link as a user-referenced object can't
      be freed yet. Once last active FD is closed, bpf_link will be freed and
      underlying bpf_prog refcount will be dropped. But cgroup refcount won't be
      touched, because cgroup is released already.
      
      The inherent race between bpf_cgroup_link release (from closing last FD) and
      cgroup_bpf_release() is resolved by both operations taking cgroup_mutex. So
      the only additional check required is when bpf_cgroup_link attempts to detach
      itself from cgroup. At that time we need to check whether there is still
      cgroup associated with that link. And if not, exit with success, because
      bpf_cgroup_link was already successfully detached.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Link: https://lore.kernel.org/bpf/20200330030001.2312810-2-andriin@fb.com
      af6eea57
    • J
      bpf: Add socket assign support · cf7fbe66
      Joe Stringer 提交于
      Add support for TPROXY via a new bpf helper, bpf_sk_assign().
      
      This helper requires the BPF program to discover the socket via a call
      to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
      helper takes its own reference to the socket in addition to any existing
      reference that may or may not currently be obtained for the duration of
      BPF processing. For the destination socket to receive the traffic, the
      traffic must be routed towards that socket via local route. The
      simplest example route is below, but in practice you may want to route
      traffic more narrowly (eg by CIDR):
      
        $ ip route add local default dev lo
      
      This patch avoids trying to introduce an extra bit into the skb->sk, as
      that would require more invasive changes to all code interacting with
      the socket to ensure that the bit is handled correctly, such as all
      error-handling cases along the path from the helper in BPF through to
      the orphan path in the input. Instead, we opt to use the destructor
      variable to switch on the prefetch of the socket.
      Signed-off-by: NJoe Stringer <joe@wand.net.nz>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200329225342.16317-2-joe@wand.net.nz
      cf7fbe66
  3. 30 3月, 2020 1 次提交
  4. 28 3月, 2020 2 次提交
    • D
      bpf: Enable bpf cgroup hooks to retrieve cgroup v2 and ancestor id · 0f09abd1
      Daniel Borkmann 提交于
      Enable the bpf_get_current_cgroup_id() helper for connect(), sendmsg(),
      recvmsg() and bind-related hooks in order to retrieve the cgroup v2
      context which can then be used as part of the key for BPF map lookups,
      for example. Given these hooks operate in process context 'current' is
      always valid and pointing to the app that is performing mentioned
      syscalls if it's subject to a v2 cgroup. Also with same motivation of
      commit 77236281 ("bpf: Introduce bpf_skb_ancestor_cgroup_id helper")
      enable retrieval of ancestor from current so the cgroup id can be used
      for policy lookups which can then forbid connect() / bind(), for example.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/d2a7ef42530ad299e3cbb245e6c12374b72145ef.1585323121.git.daniel@iogearbox.net
      0f09abd1
    • D
      bpf: Add netns cookie and enable it for bpf cgroup hooks · f318903c
      Daniel Borkmann 提交于
      In Cilium we're mainly using BPF cgroup hooks today in order to implement
      kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*),
      ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic
      between Cilium managed nodes. While this works in its current shape and avoids
      packet-level NAT for inter Cilium managed node traffic, there is one major
      limitation we're facing today, that is, lack of netns awareness.
      
      In Kubernetes, the concept of Pods (which hold one or multiple containers)
      has been built around network namespaces, so while we can use the global scope
      of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing
      NodePort ports on loopback addresses), we also have the need to differentiate
      between initial network namespaces and non-initial one. For example, ExternalIP
      services mandate that non-local service IPs are not to be translated from the
      host (initial) network namespace as one example. Right now, we have an ugly
      work-around in place where non-local service IPs for ExternalIP services are
      not xlated from connect() and friends BPF hooks but instead via less efficient
      packet-level NAT on the veth tc ingress hook for Pod traffic.
      
      On top of determining whether we're in initial or non-initial network namespace
      we also have a need for a socket-cookie like mechanism for network namespaces
      scope. Socket cookies have the nice property that they can be combined as part
      of the key structure e.g. for BPF LRU maps without having to worry that the
      cookie could be recycled. We are planning to use this for our sessionAffinity
      implementation for services. Therefore, add a new bpf_get_netns_cookie() helper
      which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would
      provide the cookie for the initial network namespace while passing the context
      instead of NULL would provide the cookie from the application's network namespace.
      We're using a hole, so no size increase; the assignment happens only once.
      Therefore this allows for a comparison on initial namespace as well as regular
      cookie usage as we have today with socket cookies. We could later on enable
      this helper for other program types as well as we would see need.
      
        (*) Both externalTrafficPolicy={Local|Cluster} types
        [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.cSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net
      f318903c
  5. 13 3月, 2020 2 次提交
  6. 05 3月, 2020 2 次提交
  7. 04 3月, 2020 1 次提交
  8. 28 2月, 2020 1 次提交
  9. 20 2月, 2020 1 次提交
    • D
      bpf: Add bpf_read_branch_records() helper · fff7b643
      Daniel Xu 提交于
      Branch records are a CPU feature that can be configured to record
      certain branches that are taken during code execution. This data is
      particularly interesting for profile guided optimizations. perf has had
      branch record support for a while but the data collection can be a bit
      coarse grained.
      
      We (Facebook) have seen in experiments that associating metadata with
      branch records can improve results (after postprocessing). We generally
      use bpf_probe_read_*() to get metadata out of userspace. That's why bpf
      support for branch records is useful.
      
      Aside from this particular use case, having branch data available to bpf
      progs can be useful to get stack traces out of userspace applications
      that omit frame pointers.
      Signed-off-by: NDaniel Xu <dxu@dxuuu.xyz>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/20200218030432.4600-2-dxu@dxuuu.xyz
      fff7b643
  10. 18 2月, 2020 1 次提交
  11. 23 1月, 2020 2 次提交
    • M
      bpf: Add BPF_FUNC_jiffies64 · 5576b991
      Martin KaFai Lau 提交于
      This patch adds a helper to read the 64bit jiffies.  It will be used
      in a later patch to implement the bpf_cubic.c.
      
      The helper is inlined for jit_requested and 64 BITS_PER_LONG
      as the map_gen_lookup().  Other cases could be considered together
      with map_gen_lookup() if needed.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200122233646.903260-1-kafai@fb.com
      5576b991
    • A
      bpf: Introduce dynamic program extensions · be8704ff
      Alexei Starovoitov 提交于
      Introduce dynamic program extensions. The users can load additional BPF
      functions and replace global functions in previously loaded BPF programs while
      these programs are executing.
      
      Global functions are verified individually by the verifier based on their types only.
      Hence the global function in the new program which types match older function can
      safely replace that corresponding function.
      
      This new function/program is called 'an extension' of old program. At load time
      the verifier uses (attach_prog_fd, attach_btf_id) pair to identify the function
      to be replaced. The BPF program type is derived from the target program into
      extension program. Technically bpf_verifier_ops is copied from target program.
      The BPF_PROG_TYPE_EXT program type is a placeholder. It has empty verifier_ops.
      The extension program can call the same bpf helper functions as target program.
      Single BPF_PROG_TYPE_EXT type is used to extend XDP, SKB and all other program
      types. The verifier allows only one level of replacement. Meaning that the
      extension program cannot recursively extend an extension. That also means that
      the maximum stack size is increasing from 512 to 1024 bytes and maximum
      function nesting level from 8 to 16. The programs don't always consume that
      much. The stack usage is determined by the number of on-stack variables used by
      the program. The verifier could have enforced 512 limit for combined original
      plus extension program, but it makes for difficult user experience. The main
      use case for extensions is to provide generic mechanism to plug external
      programs into policy program or function call chaining.
      
      BPF trampoline is used to track both fentry/fexit and program extensions
      because both are using the same nop slot at the beginning of every BPF
      function. Attaching fentry/fexit to a function that was replaced is not
      allowed. The opposite is true as well. Replacing a function that currently
      being analyzed with fentry/fexit is not allowed. The executable page allocated
      by BPF trampoline is not used by program extensions. This inefficiency will be
      optimized in future patches.
      
      Function by function verification of global function supports scalars and
      pointer to context only. Hence program extensions are supported for such class
      of global functions only. In the future the verifier will be extended with
      support to pointers to structures, arrays with sizes, etc.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NToke Høiland-Jørgensen <toke@redhat.com>
      Link: https://lore.kernel.org/bpf/20200121005348.2769920-2-ast@kernel.org
      be8704ff
  12. 16 1月, 2020 4 次提交
    • Y
      bpf: Add batch ops to all htab bpf map · 05799638
      Yonghong Song 提交于
      htab can't use generic batch support due some problematic behaviours
      inherent to the data structre, i.e. while iterating the bpf map  a
      concurrent program might delete the next entry that batch was about to
      use, in that case there's no easy solution to retrieve the next entry,
      the issue has been discussed multiple times (see [1] and [2]).
      
      The only way hmap can be traversed without the problem previously
      exposed is by making sure that the map is traversing entire buckets.
      This commit implements those strict requirements for hmap, the
      implementation follows the same interaction that generic support with
      some exceptions:
      
       - If keys/values buffer are not big enough to traverse a bucket,
         ENOSPC will be returned.
       - out_batch contains the value of the next bucket in the iteration, not
         the next key, but this is transparent for the user since the user
         should never use out_batch for other than bpf batch syscalls.
      
      This commits implements BPF_MAP_LOOKUP_BATCH and adds support for new
      command BPF_MAP_LOOKUP_AND_DELETE_BATCH. Note that for update/delete
      batch ops it is possible to use the generic implementations.
      
      [1] https://lore.kernel.org/bpf/20190724165803.87470-1-brianvv@google.com/
      [2] https://lore.kernel.org/bpf/20190906225434.3635421-1-yhs@fb.com/Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NBrian Vazquez <brianvv@google.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200115184308.162644-6-brianvv@google.com
      05799638
    • B
      bpf: Add generic support for update and delete batch ops · aa2e93b8
      Brian Vazquez 提交于
      This commit adds generic support for update and delete batch ops that
      can be used for almost all the bpf maps. These commands share the same
      UAPI attr that lookup and lookup_and_delete batch ops use and the
      syscall commands are:
      
        BPF_MAP_UPDATE_BATCH
        BPF_MAP_DELETE_BATCH
      
      The main difference between update/delete and lookup batch ops is that
      for update/delete keys/values must be specified for userspace and
      because of that, neither in_batch nor out_batch are used.
      Suggested-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NBrian Vazquez <brianvv@google.com>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200115184308.162644-4-brianvv@google.com
      aa2e93b8
    • B
      bpf: Add generic support for lookup batch op · cb4d03ab
      Brian Vazquez 提交于
      This commit introduces generic support for the bpf_map_lookup_batch.
      This implementation can be used by almost all the bpf maps since its core
      implementation is relying on the existing map_get_next_key and
      map_lookup_elem. The bpf syscall subcommand introduced is:
      
        BPF_MAP_LOOKUP_BATCH
      
      The UAPI attribute is:
      
        struct { /* struct used by BPF_MAP_*_BATCH commands */
               __aligned_u64   in_batch;       /* start batch,
                                                * NULL to start from beginning
                                                */
               __aligned_u64   out_batch;      /* output: next start batch */
               __aligned_u64   keys;
               __aligned_u64   values;
               __u32           count;          /* input/output:
                                                * input: # of key/value
                                                * elements
                                                * output: # of filled elements
                                                */
               __u32           map_fd;
               __u64           elem_flags;
               __u64           flags;
        } batch;
      
      in_batch/out_batch are opaque values use to communicate between
      user/kernel space, in_batch/out_batch must be of key_size length.
      
      To start iterating from the beginning in_batch must be null,
      count is the # of key/value elements to retrieve. Note that the 'keys'
      buffer must be a buffer of key_size * count size and the 'values' buffer
      must be value_size * count, where value_size must be aligned to 8 bytes
      by userspace if it's dealing with percpu maps. 'count' will contain the
      number of keys/values successfully retrieved. Note that 'count' is an
      input/output variable and it can contain a lower value after a call.
      
      If there's no more entries to retrieve, ENOENT will be returned. If error
      is ENOENT, count might be > 0 in case it copied some values but there were
      no more entries to retrieve.
      
      Note that if the return code is an error and not -EFAULT,
      count indicates the number of elements successfully processed.
      Suggested-by: NStanislav Fomichev <sdf@google.com>
      Signed-off-by: NBrian Vazquez <brianvv@google.com>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200115184308.162644-3-brianvv@google.com
      cb4d03ab
    • Y
      bpf: Add bpf_send_signal_thread() helper · 8482941f
      Yonghong Song 提交于
      Commit 8b401f9e ("bpf: implement bpf_send_signal() helper")
      added helper bpf_send_signal() which permits bpf program to
      send a signal to the current process. The signal may be
      delivered to any threads in the process.
      
      We found a use case where sending the signal to the current
      thread is more preferable.
        - A bpf program will collect the stack trace and then
          send signal to the user application.
        - The user application will add some thread specific
          information to the just collected stack trace for
          later analysis.
      
      If bpf_send_signal() is used, user application will need
      to check whether the thread receiving the signal matches
      the thread collecting the stack by checking thread id.
      If not, it will need to send signal to another thread
      through pthread_kill().
      
      This patch proposed a new helper bpf_send_signal_thread(),
      which sends the signal to the thread corresponding to
      the current kernel task. This way, user space is guaranteed that
      bpf_program execution context and user space signal handling
      context are the same thread.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200115035002.602336-1-yhs@fb.com
      8482941f
  13. 10 1月, 2020 4 次提交
    • A
      bpf: Document BPF_F_QUERY_EFFECTIVE flag · f5bfcd95
      Andrey Ignatov 提交于
      Document BPF_F_QUERY_EFFECTIVE flag, mostly to clarify how it affects
      attach_flags what may not be obvious and what may lead to confision.
      
      Specifically attach_flags is returned only for target_fd but if programs
      are inherited from an ancestor cgroup then returned attach_flags for
      current cgroup may be confusing. For example, two effective programs of
      same attach_type can be returned but w/o BPF_F_ALLOW_MULTI in
      attach_flags.
      
      Simple repro:
        # bpftool c s /sys/fs/cgroup/path/to/task
        ID       AttachType      AttachFlags     Name
        # bpftool c s /sys/fs/cgroup/path/to/task effective
        ID       AttachType      AttachFlags     Name
        95043    ingress                         tw_ipt_ingress
        95048    ingress                         tw_ingress
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20200108014006.938363-1-rdna@fb.com
      f5bfcd95
    • M
      bpf: Add BPF_FUNC_tcp_send_ack helper · 206057fe
      Martin KaFai Lau 提交于
      Add a helper to send out a tcp-ack.  It will be used in the later
      bpf_dctcp implementation that requires to send out an ack
      when the CE state changed.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20200109004551.3900448-1-kafai@fb.com
      206057fe
    • M
      bpf: Introduce BPF_MAP_TYPE_STRUCT_OPS · 85d33df3
      Martin KaFai Lau 提交于
      The patch introduces BPF_MAP_TYPE_STRUCT_OPS.  The map value
      is a kernel struct with its func ptr implemented in bpf prog.
      This new map is the interface to register/unregister/introspect
      a bpf implemented kernel struct.
      
      The kernel struct is actually embedded inside another new struct
      (or called the "value" struct in the code).  For example,
      "struct tcp_congestion_ops" is embbeded in:
      struct bpf_struct_ops_tcp_congestion_ops {
      	refcount_t refcnt;
      	enum bpf_struct_ops_state state;
      	struct tcp_congestion_ops data;  /* <-- kernel subsystem struct here */
      }
      The map value is "struct bpf_struct_ops_tcp_congestion_ops".
      The "bpftool map dump" will then be able to show the
      state ("inuse"/"tobefree") and the number of subsystem's refcnt (e.g.
      number of tcp_sock in the tcp_congestion_ops case).  This "value" struct
      is created automatically by a macro.  Having a separate "value" struct
      will also make extending "struct bpf_struct_ops_XYZ" easier (e.g. adding
      "void (*init)(void)" to "struct bpf_struct_ops_XYZ" to do some
      initialization works before registering the struct_ops to the kernel
      subsystem).  The libbpf will take care of finding and populating the
      "struct bpf_struct_ops_XYZ" from "struct XYZ".
      
      Register a struct_ops to a kernel subsystem:
      1. Load all needed BPF_PROG_TYPE_STRUCT_OPS prog(s)
      2. Create a BPF_MAP_TYPE_STRUCT_OPS with attr->btf_vmlinux_value_type_id
         set to the btf id "struct bpf_struct_ops_tcp_congestion_ops" of the
         running kernel.
         Instead of reusing the attr->btf_value_type_id,
         btf_vmlinux_value_type_id s added such that attr->btf_fd can still be
         used as the "user" btf which could store other useful sysadmin/debug
         info that may be introduced in the furture,
         e.g. creation-date/compiler-details/map-creator...etc.
      3. Create a "struct bpf_struct_ops_tcp_congestion_ops" object as described
         in the running kernel btf.  Populate the value of this object.
         The function ptr should be populated with the prog fds.
      4. Call BPF_MAP_UPDATE with the object created in (3) as
         the map value.  The key is always "0".
      
      During BPF_MAP_UPDATE, the code that saves the kernel-func-ptr's
      args as an array of u64 is generated.  BPF_MAP_UPDATE also allows
      the specific struct_ops to do some final checks in "st_ops->init_member()"
      (e.g. ensure all mandatory func ptrs are implemented).
      If everything looks good, it will register this kernel struct
      to the kernel subsystem.  The map will not allow further update
      from this point.
      
      Unregister a struct_ops from the kernel subsystem:
      BPF_MAP_DELETE with key "0".
      
      Introspect a struct_ops:
      BPF_MAP_LOOKUP_ELEM with key "0".  The map value returned will
      have the prog _id_ populated as the func ptr.
      
      The map value state (enum bpf_struct_ops_state) will transit from:
      INIT (map created) =>
      INUSE (map updated, i.e. reg) =>
      TOBEFREE (map value deleted, i.e. unreg)
      
      The kernel subsystem needs to call bpf_struct_ops_get() and
      bpf_struct_ops_put() to manage the "refcnt" in the
      "struct bpf_struct_ops_XYZ".  This patch uses a separate refcnt
      for the purose of tracking the subsystem usage.  Another approach
      is to reuse the map->refcnt and then "show" (i.e. during map_lookup)
      the subsystem's usage by doing map->refcnt - map->usercnt to filter out
      the map-fd/pinned-map usage.  However, that will also tie down the
      future semantics of map->refcnt and map->usercnt.
      
      The very first subsystem's refcnt (during reg()) holds one
      count to map->refcnt.  When the very last subsystem's refcnt
      is gone, it will also release the map->refcnt.  All bpf_prog will be
      freed when the map->refcnt reaches 0 (i.e. during map_free()).
      
      Here is how the bpftool map command will look like:
      [root@arch-fb-vm1 bpf]# bpftool map show
      6: struct_ops  name dctcp  flags 0x0
      	key 4B  value 256B  max_entries 1  memlock 4096B
      	btf_id 6
      [root@arch-fb-vm1 bpf]# bpftool map dump id 6
      [{
              "value": {
                  "refcnt": {
                      "refs": {
                          "counter": 1
                      }
                  },
                  "state": 1,
                  "data": {
                      "list": {
                          "next": 0,
                          "prev": 0
                      },
                      "key": 0,
                      "flags": 2,
                      "init": 24,
                      "release": 0,
                      "ssthresh": 25,
                      "cong_avoid": 30,
                      "set_state": 27,
                      "cwnd_event": 28,
                      "in_ack_event": 26,
                      "undo_cwnd": 29,
                      "pkts_acked": 0,
                      "min_tso_segs": 0,
                      "sndbuf_expand": 0,
                      "cong_control": 0,
                      "get_info": 0,
                      "name": [98,112,102,95,100,99,116,99,112,0,0,0,0,0,0,0
                      ],
                      "owner": 0
                  }
              }
          }
      ]
      
      Misc Notes:
      * bpf_struct_ops_map_sys_lookup_elem() is added for syscall lookup.
        It does an inplace update on "*value" instead returning a pointer
        to syscall.c.  Otherwise, it needs a separate copy of "zero" value
        for the BPF_STRUCT_OPS_STATE_INIT to avoid races.
      
      * The bpf_struct_ops_map_delete_elem() is also called without
        preempt_disable() from map_delete_elem().  It is because
        the "->unreg()" may requires sleepable context, e.g.
        the "tcp_unregister_congestion_control()".
      
      * "const" is added to some of the existing "struct btf_func_model *"
        function arg to avoid a compiler warning caused by this patch.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20200109003505.3855919-1-kafai@fb.com
      85d33df3
    • M
      bpf: Introduce BPF_PROG_TYPE_STRUCT_OPS · 27ae7997
      Martin KaFai Lau 提交于
      This patch allows the kernel's struct ops (i.e. func ptr) to be
      implemented in BPF.  The first use case in this series is the
      "struct tcp_congestion_ops" which will be introduced in a
      latter patch.
      
      This patch introduces a new prog type BPF_PROG_TYPE_STRUCT_OPS.
      The BPF_PROG_TYPE_STRUCT_OPS prog is verified against a particular
      func ptr of a kernel struct.  The attr->attach_btf_id is the btf id
      of a kernel struct.  The attr->expected_attach_type is the member
      "index" of that kernel struct.  The first member of a struct starts
      with member index 0.  That will avoid ambiguity when a kernel struct
      has multiple func ptrs with the same func signature.
      
      For example, a BPF_PROG_TYPE_STRUCT_OPS prog is written
      to implement the "init" func ptr of the "struct tcp_congestion_ops".
      The attr->attach_btf_id is the btf id of the "struct tcp_congestion_ops"
      of the _running_ kernel.  The attr->expected_attach_type is 3.
      
      The ctx of BPF_PROG_TYPE_STRUCT_OPS is an array of u64 args saved
      by arch_prepare_bpf_trampoline that will be done in the next
      patch when introducing BPF_MAP_TYPE_STRUCT_OPS.
      
      "struct bpf_struct_ops" is introduced as a common interface for the kernel
      struct that supports BPF_PROG_TYPE_STRUCT_OPS prog.  The supporting kernel
      struct will need to implement an instance of the "struct bpf_struct_ops".
      
      The supporting kernel struct also needs to implement a bpf_verifier_ops.
      During BPF_PROG_LOAD, bpf_struct_ops_find() will find the right
      bpf_verifier_ops by searching the attr->attach_btf_id.
      
      A new "btf_struct_access" is also added to the bpf_verifier_ops such
      that the supporting kernel struct can optionally provide its own specific
      check on accessing the func arg (e.g. provide limited write access).
      
      After btf_vmlinux is parsed, the new bpf_struct_ops_init() is called
      to initialize some values (e.g. the btf id of the supporting kernel
      struct) and it can only be done once the btf_vmlinux is available.
      
      The R0 checks at BPF_EXIT is excluded for the BPF_PROG_TYPE_STRUCT_OPS prog
      if the return type of the prog->aux->attach_func_proto is "void".
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Link: https://lore.kernel.org/bpf/20200109003503.3855825-1-kafai@fb.com
      27ae7997
  14. 20 12月, 2019 1 次提交
    • A
      bpf: Support replacing cgroup-bpf program in MULTI mode · 7dd68b32
      Andrey Ignatov 提交于
      The common use-case in production is to have multiple cgroup-bpf
      programs per attach type that cover multiple use-cases. Such programs
      are attached with BPF_F_ALLOW_MULTI and can be maintained by different
      people.
      
      Order of programs usually matters, for example imagine two egress
      programs: the first one drops packets and the second one counts packets.
      If they're swapped the result of counting program will be different.
      
      It brings operational challenges with updating cgroup-bpf program(s)
      attached with BPF_F_ALLOW_MULTI since there is no way to replace a
      program:
      
      * One way to update is to detach all programs first and then attach the
        new version(s) again in the right order. This introduces an
        interruption in the work a program is doing and may not be acceptable
        (e.g. if it's egress firewall);
      
      * Another way is attach the new version of a program first and only then
        detach the old version. This introduces the time interval when two
        versions of same program are working, what may not be acceptable if a
        program is not idempotent. It also imposes additional burden on
        program developers to make sure that two versions of their program can
        co-exist.
      
      Solve the problem by introducing a "replace" mode in BPF_PROG_ATTACH
      command for cgroup-bpf programs being attached with BPF_F_ALLOW_MULTI
      flag. This mode is enabled by newly introduced BPF_F_REPLACE attach flag
      and bpf_attr.replace_bpf_fd attribute to pass fd of the old program to
      replace
      
      That way user can replace any program among those attached with
      BPF_F_ALLOW_MULTI flag without the problems described above.
      
      Details of the new API:
      
      * If BPF_F_REPLACE is set but replace_bpf_fd doesn't have valid
        descriptor of BPF program, BPF_PROG_ATTACH will return corresponding
        error (EINVAL or EBADF).
      
      * If replace_bpf_fd has valid descriptor of BPF program but such a
        program is not attached to specified cgroup, BPF_PROG_ATTACH will
        return ENOENT.
      
      BPF_F_REPLACE is introduced to make the user intent clear, since
      replace_bpf_fd alone can't be used for this (its default value, 0, is a
      valid fd). BPF_F_REPLACE also makes it possible to extend the API in the
      future (e.g. add BPF_F_BEFORE and BPF_F_AFTER if needed).
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAndrii Narkyiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/30cd850044a0057bdfcaaf154b7d2f39850ba813.1576741281.git.rdna@fb.com
      7dd68b32
  15. 18 11月, 2019 1 次提交
    • A
      bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY · fc970227
      Andrii Nakryiko 提交于
      Add ability to memory-map contents of BPF array map. This is extremely useful
      for working with BPF global data from userspace programs. It allows to avoid
      typical bpf_map_{lookup,update}_elem operations, improving both performance
      and usability.
      
      There had to be special considerations for map freezing, to avoid having
      writable memory view into a frozen map. To solve this issue, map freezing and
      mmap-ing is happening under mutex now:
        - if map is already frozen, no writable mapping is allowed;
        - if map has writable memory mappings active (accounted in map->writecnt),
          map freezing will keep failing with -EBUSY;
        - once number of writable memory mappings drops to zero, map freezing can be
          performed again.
      
      Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
      can't be memory mapped either.
      
      For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
      to be mmap()'able. We also need to make sure that array data memory is
      page-sized and page-aligned, so we over-allocate memory in such a way that
      struct bpf_array is at the end of a single page of memory with array->value
      being aligned with the start of the second page. On deallocation we need to
      accomodate this memory arrangement to free vmalloc()'ed memory correctly.
      
      One important consideration regarding how memory-mapping subsystem functions.
      Memory-mapping subsystem provides few optional callbacks, among them open()
      and close().  close() is called for each memory region that is unmapped, so
      that users can decrease their reference counters and free up resources, if
      necessary. open() is *almost* symmetrical: it's called for each memory region
      that is being mapped, **except** the very first one. So bpf_map_mmap does
      initial refcnt bump, while open() will do any extra ones after that. Thus
      number of close() calls is equal to number of open() calls plus one more.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Link: https://lore.kernel.org/bpf/20191117172806.2195367-4-andriin@fb.com
      fc970227
  16. 16 11月, 2019 2 次提交
    • A
      bpf: Support attaching tracing BPF program to other BPF programs · 5b92a28a
      Alexei Starovoitov 提交于
      Allow FENTRY/FEXIT BPF programs to attach to other BPF programs of any type
      including their subprograms. This feature allows snooping on input and output
      packets in XDP, TC programs including their return values. In order to do that
      the verifier needs to track types not only of vmlinux, but types of other BPF
      programs as well. The verifier also needs to translate uapi/linux/bpf.h types
      used by networking programs into kernel internal BTF types used by FENTRY/FEXIT
      BPF programs. In some cases LLVM optimizations can remove arguments from BPF
      subprograms without adjusting BTF info that LLVM backend knows. When BTF info
      disagrees with actual types that the verifiers sees the BPF trampoline has to
      fallback to conservative and treat all arguments as u64. The FENTRY/FEXIT
      program can still attach to such subprograms, but it won't be able to recognize
      pointer types like 'struct sk_buff *' and it won't be able to pass them to
      bpf_skb_output() for dumping packets to user space. The FENTRY/FEXIT program
      would need to use bpf_probe_read_kernel() instead.
      
      The BPF_PROG_LOAD command is extended with attach_prog_fd field. When it's set
      to zero the attach_btf_id is one vmlinux BTF type ids. When attach_prog_fd
      points to previously loaded BPF program the attach_btf_id is BTF type id of
      main function or one of its subprograms.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20191114185720.1641606-18-ast@kernel.org
      5b92a28a
    • A
      bpf: Introduce BPF trampoline · fec56f58
      Alexei Starovoitov 提交于
      Introduce BPF trampoline concept to allow kernel code to call into BPF programs
      with practically zero overhead.  The trampoline generation logic is
      architecture dependent.  It's converting native calling convention into BPF
      calling convention.  BPF ISA is 64-bit (even on 32-bit architectures). The
      registers R1 to R5 are used to pass arguments into BPF functions. The main BPF
      program accepts only single argument "ctx" in R1. Whereas CPU native calling
      convention is different. x86-64 is passing first 6 arguments in registers
      and the rest on the stack. x86-32 is passing first 3 arguments in registers.
      sparc64 is passing first 6 in registers. And so on.
      
      The trampolines between BPF and kernel already exist.  BPF_CALL_x macros in
      include/linux/filter.h statically compile trampolines from BPF into kernel
      helpers. They convert up to five u64 arguments into kernel C pointers and
      integers. On 64-bit architectures this BPF_to_kernel trampolines are nops. On
      32-bit architecture they're meaningful.
      
      The opposite job kernel_to_BPF trampolines is done by CAST_TO_U64 macros and
      __bpf_trace_##call() shim functions in include/trace/bpf_probe.h. They convert
      kernel function arguments into array of u64s that BPF program consumes via
      R1=ctx pointer.
      
      This patch set is doing the same job as __bpf_trace_##call() static
      trampolines, but dynamically for any kernel function. There are ~22k global
      kernel functions that are attachable via nop at function entry. The function
      arguments and types are described in BTF.  The job of btf_distill_func_proto()
      function is to extract useful information from BTF into "function model" that
      architecture dependent trampoline generators will use to generate assembly code
      to cast kernel function arguments into array of u64s.  For example the kernel
      function eth_type_trans has two pointers. They will be casted to u64 and stored
      into stack of generated trampoline. The pointer to that stack space will be
      passed into BPF program in R1. On x86-64 such generated trampoline will consume
      16 bytes of stack and two stores of %rdi and %rsi into stack. The verifier will
      make sure that only two u64 are accessed read-only by BPF program. The verifier
      will also recognize the precise type of the pointers being accessed and will
      not allow typecasting of the pointer to a different type within BPF program.
      
      The tracing use case in the datacenter demonstrated that certain key kernel
      functions have (like tcp_retransmit_skb) have 2 or more kprobes that are always
      active.  Other functions have both kprobe and kretprobe.  So it is essential to
      keep both kernel code and BPF programs executing at maximum speed. Hence
      generated BPF trampoline is re-generated every time new program is attached or
      detached to maintain maximum performance.
      
      To avoid the high cost of retpoline the attached BPF programs are called
      directly. __bpf_prog_enter/exit() are used to support per-program execution
      stats.  In the future this logic will be optimized further by adding support
      for bpf_stats_enabled_key inside generated assembly code. Introduction of
      preemptible and sleepable BPF programs will completely remove the need to call
      to __bpf_prog_enter/exit().
      
      Detach of a BPF program from the trampoline should not fail. To avoid memory
      allocation in detach path the half of the page is used as a reserve and flipped
      after each attach/detach. 2k bytes is enough to call 40+ BPF programs directly
      which is enough for BPF tracing use cases. This limit can be increased in the
      future.
      
      BPF_TRACE_FENTRY programs have access to raw kernel function arguments while
      BPF_TRACE_FEXIT programs have access to kernel return value as well. Often
      kprobe BPF program remembers function arguments in a map while kretprobe
      fetches arguments from a map and analyzes them together with return value.
      BPF_TRACE_FEXIT accelerates this typical use case.
      
      Recursion prevention for kprobe BPF programs is done via per-cpu
      bpf_prog_active counter. In practice that turned out to be a mistake. It
      caused programs to randomly skip execution. The tracing tools missed results
      they were looking for. Hence BPF trampoline doesn't provide builtin recursion
      prevention. It's a job of BPF program itself and will be addressed in the
      follow up patches.
      
      BPF trampoline is intended to be used beyond tracing and fentry/fexit use cases
      in the future. For example to remove retpoline cost from XDP programs.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Link: https://lore.kernel.org/bpf/20191114185720.1641606-5-ast@kernel.org
      fec56f58
  17. 03 11月, 2019 1 次提交
    • D
      bpf: Add probe_read_{user, kernel} and probe_read_{user, kernel}_str helpers · 6ae08ae3
      Daniel Borkmann 提交于
      The current bpf_probe_read() and bpf_probe_read_str() helpers are broken
      in that they assume they can be used for probing memory access for kernel
      space addresses /as well as/ user space addresses.
      
      However, plain use of probe_kernel_read() for both cases will attempt to
      always access kernel space address space given access is performed under
      KERNEL_DS and some archs in-fact have overlapping address spaces where a
      kernel pointer and user pointer would have the /same/ address value and
      therefore accessing application memory via bpf_probe_read{,_str}() would
      read garbage values.
      
      Lets fix BPF side by making use of recently added 3d708182 ("uaccess:
      Add non-pagefault user-space read functions"). Unfortunately, the only way
      to fix this status quo is to add dedicated bpf_probe_read_{user,kernel}()
      and bpf_probe_read_{user,kernel}_str() helpers. The bpf_probe_read{,_str}()
      helpers are kept as-is to retain their current behavior.
      
      The two *_user() variants attempt the access always under USER_DS set, the
      two *_kernel() variants will -EFAULT when accessing user memory if the
      underlying architecture has non-overlapping address ranges, also avoiding
      throwing the kernel warning via 00c42373 ("x86-64: add warning for
      non-canonical user access address dereferences").
      
      Fixes: a5e8c070 ("bpf: add bpf_probe_read_str helper")
      Fixes: 2541517c ("tracing, perf: Implement BPF programs attached to kprobes")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/796ee46e948bc808d54891a1108435f8652c6ca4.1572649915.git.daniel@iogearbox.net
      6ae08ae3
  18. 31 10月, 2019 1 次提交
    • A
      bpf: Replace prog_raw_tp+btf_id with prog_tracing · f1b9509c
      Alexei Starovoitov 提交于
      The bpf program type raw_tp together with 'expected_attach_type'
      was the most appropriate api to indicate BTF-enabled raw_tp programs.
      But during development it became apparent that 'expected_attach_type'
      cannot be used and new 'attach_btf_id' field had to be introduced.
      Which means that the information is duplicated in two fields where
      one of them is ignored.
      Clean it up by introducing new program type where both
      'expected_attach_type' and 'attach_btf_id' fields have
      specific meaning.
      In the future 'expected_attach_type' will be extended
      with other attach points that have similar semantics to raw_tp.
      This patch is replacing BTF-enabled BPF_PROG_TYPE_RAW_TRACEPOINT with
      prog_type = BPF_RPOG_TYPE_TRACING
      expected_attach_type = BPF_TRACE_RAW_TP
      attach_btf_id = btf_id of raw tracepoint inside the kernel
      Future patches will add
      expected_attach_type = BPF_TRACE_FENTRY or BPF_TRACE_FEXIT
      where programs have the same input context and the same helpers,
      but different attach points.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20191030223212.953010-2-ast@kernel.org
      f1b9509c
  19. 17 10月, 2019 2 次提交
  20. 07 10月, 2019 1 次提交
  21. 28 8月, 2019 1 次提交
  22. 22 8月, 2019 2 次提交
  23. 21 8月, 2019 1 次提交
  24. 18 8月, 2019 1 次提交
  25. 10 8月, 2019 1 次提交
    • D
      sock: make cookie generation global instead of per netns · cd48bdda
      Daniel Borkmann 提交于
      Generating and retrieving socket cookies are a useful feature that is
      exposed to BPF for various program types through bpf_get_socket_cookie()
      helper.
      
      The fact that the cookie counter is per netns is quite a limitation
      for BPF in practice in particular for programs in host namespace that
      use socket cookies as part of a map lookup key since they will be
      causing socket cookie collisions e.g. when attached to BPF cgroup hooks
      or cls_bpf on tc egress in host namespace handling container traffic
      from veth or ipvlan devices with peer in different netns. Change the
      counter to be global instead.
      
      Socket cookie consumers must assume the value as opqaue in any case.
      Not every socket must have a cookie generated and knowledge of the
      counter value itself does not provide much value either way hence
      conversion to global is fine.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Willem de Bruijn <willemb@google.com>
      Cc: Martynas Pumputis <m@lambda.lt>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd48bdda