1. 27 4月, 2020 1 次提交
  2. 25 4月, 2020 1 次提交
  3. 31 3月, 2020 3 次提交
    • A
      libbpf: Add support for bpf_link-based cgroup attachment · cc4f864b
      Andrii Nakryiko 提交于
      Add bpf_program__attach_cgroup(), which uses BPF_LINK_CREATE subcommand to
      create an FD-based kernel bpf_link. Also add low-level bpf_link_create() API.
      
      If expected_attach_type is not specified explicitly with
      bpf_program__set_expected_attach_type(), libbpf will try to determine proper
      attach type from BPF program's section definition.
      
      Also add support for bpf_link's underlying BPF program replacement:
        - unconditional through high-level bpf_link__update_program() API;
        - cmpxchg-like with specifying expected current BPF program through
          low-level bpf_link_update() API.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/20200330030001.2312810-4-andriin@fb.com
      cc4f864b
    • A
      bpf: Implement bpf_link-based cgroup BPF program attachment · af6eea57
      Andrii Nakryiko 提交于
      Implement new sub-command to attach cgroup BPF programs and return FD-based
      bpf_link back on success. bpf_link, once attached to cgroup, cannot be
      replaced, except by owner having its FD. Cgroup bpf_link supports only
      BPF_F_ALLOW_MULTI semantics. Both link-based and prog-based BPF_F_ALLOW_MULTI
      attachments can be freely intermixed.
      
      To prevent bpf_cgroup_link from keeping cgroup alive past the point when no
      BPF program can be executed, implement auto-detachment of link. When
      cgroup_bpf_release() is called, all attached bpf_links are forced to release
      cgroup refcounts, but they leave bpf_link otherwise active and allocated, as
      well as still owning underlying bpf_prog. This is because user-space might
      still have FDs open and active, so bpf_link as a user-referenced object can't
      be freed yet. Once last active FD is closed, bpf_link will be freed and
      underlying bpf_prog refcount will be dropped. But cgroup refcount won't be
      touched, because cgroup is released already.
      
      The inherent race between bpf_cgroup_link release (from closing last FD) and
      cgroup_bpf_release() is resolved by both operations taking cgroup_mutex. So
      the only additional check required is when bpf_cgroup_link attempts to detach
      itself from cgroup. At that time we need to check whether there is still
      cgroup associated with that link. And if not, exit with success, because
      bpf_cgroup_link was already successfully detached.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Link: https://lore.kernel.org/bpf/20200330030001.2312810-2-andriin@fb.com
      af6eea57
    • J
      bpf: Add socket assign support · cf7fbe66
      Joe Stringer 提交于
      Add support for TPROXY via a new bpf helper, bpf_sk_assign().
      
      This helper requires the BPF program to discover the socket via a call
      to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
      helper takes its own reference to the socket in addition to any existing
      reference that may or may not currently be obtained for the duration of
      BPF processing. For the destination socket to receive the traffic, the
      traffic must be routed towards that socket via local route. The
      simplest example route is below, but in practice you may want to route
      traffic more narrowly (eg by CIDR):
      
        $ ip route add local default dev lo
      
      This patch avoids trying to introduce an extra bit into the skb->sk, as
      that would require more invasive changes to all code interacting with
      the socket to ensure that the bit is handled correctly, such as all
      error-handling cases along the path from the helper in BPF through to
      the orphan path in the input. Instead, we opt to use the destructor
      variable to switch on the prefetch of the socket.
      Signed-off-by: NJoe Stringer <joe@wand.net.nz>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Link: https://lore.kernel.org/bpf/20200329225342.16317-2-joe@wand.net.nz
      cf7fbe66
  4. 30 3月, 2020 1 次提交
  5. 28 3月, 2020 2 次提交
    • D
      bpf: Enable bpf cgroup hooks to retrieve cgroup v2 and ancestor id · 0f09abd1
      Daniel Borkmann 提交于
      Enable the bpf_get_current_cgroup_id() helper for connect(), sendmsg(),
      recvmsg() and bind-related hooks in order to retrieve the cgroup v2
      context which can then be used as part of the key for BPF map lookups,
      for example. Given these hooks operate in process context 'current' is
      always valid and pointing to the app that is performing mentioned
      syscalls if it's subject to a v2 cgroup. Also with same motivation of
      commit 77236281 ("bpf: Introduce bpf_skb_ancestor_cgroup_id helper")
      enable retrieval of ancestor from current so the cgroup id can be used
      for policy lookups which can then forbid connect() / bind(), for example.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/d2a7ef42530ad299e3cbb245e6c12374b72145ef.1585323121.git.daniel@iogearbox.net
      0f09abd1
    • D
      bpf: Add netns cookie and enable it for bpf cgroup hooks · f318903c
      Daniel Borkmann 提交于
      In Cilium we're mainly using BPF cgroup hooks today in order to implement
      kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*),
      ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic
      between Cilium managed nodes. While this works in its current shape and avoids
      packet-level NAT for inter Cilium managed node traffic, there is one major
      limitation we're facing today, that is, lack of netns awareness.
      
      In Kubernetes, the concept of Pods (which hold one or multiple containers)
      has been built around network namespaces, so while we can use the global scope
      of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing
      NodePort ports on loopback addresses), we also have the need to differentiate
      between initial network namespaces and non-initial one. For example, ExternalIP
      services mandate that non-local service IPs are not to be translated from the
      host (initial) network namespace as one example. Right now, we have an ugly
      work-around in place where non-local service IPs for ExternalIP services are
      not xlated from connect() and friends BPF hooks but instead via less efficient
      packet-level NAT on the veth tc ingress hook for Pod traffic.
      
      On top of determining whether we're in initial or non-initial network namespace
      we also have a need for a socket-cookie like mechanism for network namespaces
      scope. Socket cookies have the nice property that they can be combined as part
      of the key structure e.g. for BPF LRU maps without having to worry that the
      cookie could be recycled. We are planning to use this for our sessionAffinity
      implementation for services. Therefore, add a new bpf_get_netns_cookie() helper
      which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would
      provide the cookie for the initial network namespace while passing the context
      instead of NULL would provide the cookie from the application's network namespace.
      We're using a hole, so no size increase; the assignment happens only once.
      Therefore this allows for a comparison on initial namespace as well as regular
      cookie usage as we have today with socket cookies. We could later on enable
      this helper for other program types as well as we would see need.
      
        (*) Both externalTrafficPolicy={Local|Cluster} types
        [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.cSigned-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net
      f318903c
  6. 13 3月, 2020 2 次提交
  7. 05 3月, 2020 2 次提交
  8. 04 3月, 2020 1 次提交
  9. 20 2月, 2020 1 次提交
  10. 18 2月, 2020 1 次提交
  11. 23 1月, 2020 2 次提交
  12. 16 1月, 2020 2 次提交
  13. 10 1月, 2020 2 次提交
  14. 20 12月, 2019 1 次提交
    • A
      bpf: Support replacing cgroup-bpf program in MULTI mode · 7dd68b32
      Andrey Ignatov 提交于
      The common use-case in production is to have multiple cgroup-bpf
      programs per attach type that cover multiple use-cases. Such programs
      are attached with BPF_F_ALLOW_MULTI and can be maintained by different
      people.
      
      Order of programs usually matters, for example imagine two egress
      programs: the first one drops packets and the second one counts packets.
      If they're swapped the result of counting program will be different.
      
      It brings operational challenges with updating cgroup-bpf program(s)
      attached with BPF_F_ALLOW_MULTI since there is no way to replace a
      program:
      
      * One way to update is to detach all programs first and then attach the
        new version(s) again in the right order. This introduces an
        interruption in the work a program is doing and may not be acceptable
        (e.g. if it's egress firewall);
      
      * Another way is attach the new version of a program first and only then
        detach the old version. This introduces the time interval when two
        versions of same program are working, what may not be acceptable if a
        program is not idempotent. It also imposes additional burden on
        program developers to make sure that two versions of their program can
        co-exist.
      
      Solve the problem by introducing a "replace" mode in BPF_PROG_ATTACH
      command for cgroup-bpf programs being attached with BPF_F_ALLOW_MULTI
      flag. This mode is enabled by newly introduced BPF_F_REPLACE attach flag
      and bpf_attr.replace_bpf_fd attribute to pass fd of the old program to
      replace
      
      That way user can replace any program among those attached with
      BPF_F_ALLOW_MULTI flag without the problems described above.
      
      Details of the new API:
      
      * If BPF_F_REPLACE is set but replace_bpf_fd doesn't have valid
        descriptor of BPF program, BPF_PROG_ATTACH will return corresponding
        error (EINVAL or EBADF).
      
      * If replace_bpf_fd has valid descriptor of BPF program but such a
        program is not attached to specified cgroup, BPF_PROG_ATTACH will
        return ENOENT.
      
      BPF_F_REPLACE is introduced to make the user intent clear, since
      replace_bpf_fd alone can't be used for this (its default value, 0, is a
      valid fd). BPF_F_REPLACE also makes it possible to extend the API in the
      future (e.g. add BPF_F_BEFORE and BPF_F_AFTER if needed).
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAndrii Narkyiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/30cd850044a0057bdfcaaf154b7d2f39850ba813.1576741281.git.rdna@fb.com
      7dd68b32
  15. 18 11月, 2019 1 次提交
    • A
      bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY · fc970227
      Andrii Nakryiko 提交于
      Add ability to memory-map contents of BPF array map. This is extremely useful
      for working with BPF global data from userspace programs. It allows to avoid
      typical bpf_map_{lookup,update}_elem operations, improving both performance
      and usability.
      
      There had to be special considerations for map freezing, to avoid having
      writable memory view into a frozen map. To solve this issue, map freezing and
      mmap-ing is happening under mutex now:
        - if map is already frozen, no writable mapping is allowed;
        - if map has writable memory mappings active (accounted in map->writecnt),
          map freezing will keep failing with -EBUSY;
        - once number of writable memory mappings drops to zero, map freezing can be
          performed again.
      
      Only non-per-CPU plain arrays are supported right now. Maps with spinlocks
      can't be memory mapped either.
      
      For BPF_F_MMAPABLE array, memory allocation has to be done through vmalloc()
      to be mmap()'able. We also need to make sure that array data memory is
      page-sized and page-aligned, so we over-allocate memory in such a way that
      struct bpf_array is at the end of a single page of memory with array->value
      being aligned with the start of the second page. On deallocation we need to
      accomodate this memory arrangement to free vmalloc()'ed memory correctly.
      
      One important consideration regarding how memory-mapping subsystem functions.
      Memory-mapping subsystem provides few optional callbacks, among them open()
      and close().  close() is called for each memory region that is unmapped, so
      that users can decrease their reference counters and free up resources, if
      necessary. open() is *almost* symmetrical: it's called for each memory region
      that is being mapped, **except** the very first one. So bpf_map_mmap does
      initial refcnt bump, while open() will do any extra ones after that. Thus
      number of close() calls is equal to number of open() calls plus one more.
      Signed-off-by: NAndrii Nakryiko <andriin@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Link: https://lore.kernel.org/bpf/20191117172806.2195367-4-andriin@fb.com
      fc970227
  16. 16 11月, 2019 2 次提交
  17. 03 11月, 2019 1 次提交
    • D
      bpf: Add probe_read_{user, kernel} and probe_read_{user, kernel}_str helpers · 6ae08ae3
      Daniel Borkmann 提交于
      The current bpf_probe_read() and bpf_probe_read_str() helpers are broken
      in that they assume they can be used for probing memory access for kernel
      space addresses /as well as/ user space addresses.
      
      However, plain use of probe_kernel_read() for both cases will attempt to
      always access kernel space address space given access is performed under
      KERNEL_DS and some archs in-fact have overlapping address spaces where a
      kernel pointer and user pointer would have the /same/ address value and
      therefore accessing application memory via bpf_probe_read{,_str}() would
      read garbage values.
      
      Lets fix BPF side by making use of recently added 3d708182 ("uaccess:
      Add non-pagefault user-space read functions"). Unfortunately, the only way
      to fix this status quo is to add dedicated bpf_probe_read_{user,kernel}()
      and bpf_probe_read_{user,kernel}_str() helpers. The bpf_probe_read{,_str}()
      helpers are kept as-is to retain their current behavior.
      
      The two *_user() variants attempt the access always under USER_DS set, the
      two *_kernel() variants will -EFAULT when accessing user memory if the
      underlying architecture has non-overlapping address ranges, also avoiding
      throwing the kernel warning via 00c42373 ("x86-64: add warning for
      non-canonical user access address dereferences").
      
      Fixes: a5e8c070 ("bpf: add bpf_probe_read_str helper")
      Fixes: 2541517c ("tracing, perf: Implement BPF programs attached to kprobes")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAndrii Nakryiko <andriin@fb.com>
      Link: https://lore.kernel.org/bpf/796ee46e948bc808d54891a1108435f8652c6ca4.1572649915.git.daniel@iogearbox.net
      6ae08ae3
  18. 31 10月, 2019 1 次提交
  19. 17 10月, 2019 2 次提交
  20. 07 10月, 2019 1 次提交
  21. 28 8月, 2019 1 次提交
  22. 22 8月, 2019 1 次提交
  23. 21 8月, 2019 1 次提交
  24. 18 8月, 2019 1 次提交
  25. 10 8月, 2019 1 次提交
  26. 31 7月, 2019 1 次提交
  27. 30 7月, 2019 1 次提交
  28. 26 7月, 2019 2 次提交
  29. 16 7月, 2019 1 次提交