1. 26 6月, 2018 1 次提交
    • S
      bpf: fix attach type BPF_LIRC_MODE2 dependency wrt CONFIG_CGROUP_BPF · fdb5c453
      Sean Young 提交于
      If the kernel is compiled with CONFIG_CGROUP_BPF not enabled, it is not
      possible to attach, detach or query IR BPF programs to /dev/lircN devices,
      making them impossible to use. For embedded devices, it should be possible
      to use IR decoding without cgroups or CONFIG_CGROUP_BPF enabled.
      
      This change requires some refactoring, since bpf_prog_{attach,detach,query}
      functions are now always compiled, but their code paths for cgroups need
      moving out. Rather than a #ifdef CONFIG_CGROUP_BPF in kernel/bpf/syscall.c,
      moving them to kernel/bpf/cgroup.c and kernel/bpf/sockmap.c does not
      require #ifdefs since that is already conditionally compiled.
      
      Fixes: f4364dcf ("media: rc: introduce BPF_PROG_LIRC_MODE2")
      Signed-off-by: NSean Young <sean@mess.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      fdb5c453
  2. 16 6月, 2018 2 次提交
    • D
      bpf: reject any prog that failed read-only lock · 9facc336
      Daniel Borkmann 提交于
      We currently lock any JITed image as read-only via bpf_jit_binary_lock_ro()
      as well as the BPF image as read-only through bpf_prog_lock_ro(). In
      the case any of these would fail we throw a WARN_ON_ONCE() in order to
      yell loudly to the log. Perhaps, to some extend, this may be comparable
      to an allocation where __GFP_NOWARN is explicitly not set.
      
      Added via 65869a47 ("bpf: improve read-only handling"), this behavior
      is slightly different compared to any of the other in-kernel set_memory_ro()
      users who do not check the return code of set_memory_ro() and friends /at
      all/ (e.g. in the case of module_enable_ro() / module_disable_ro()). Given
      in BPF this is mandatory hardening step, we want to know whether there
      are any issues that would leave both BPF data writable. So it happens
      that syzkaller enabled fault injection and it triggered memory allocation
      failure deep inside x86's change_page_attr_set_clr() which was triggered
      from set_memory_ro().
      
      Now, there are two options: i) leaving everything as is, and ii) reworking
      the image locking code in order to have a final checkpoint out of the
      central bpf_prog_select_runtime() which probes whether any of the calls
      during prog setup weren't successful, and then bailing out with an error.
      Option ii) is a better approach since this additional paranoia avoids
      altogether leaving any potential W+X pages from BPF side in the system.
      Therefore, lets be strict about it, and reject programs in such unlikely
      occasion. While testing I noticed also that one bpf_prog_lock_ro()
      call was missing on the outer dummy prog in case of calls, e.g. in the
      destructor we call bpf_prog_free_deferred() on the main prog where we
      try to bpf_prog_unlock_free() the program, and since we go via
      bpf_prog_select_runtime() do that as well.
      
      Reported-by: syzbot+3b889862e65a98317058@syzkaller.appspotmail.com
      Reported-by: syzbot+9e762b52dd17e616a7a5@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      9facc336
    • D
      bpf: fix panic in prog load calls cleanup · 7d1982b4
      Daniel Borkmann 提交于
      While testing I found that when hitting error path in bpf_prog_load()
      where we jump to free_used_maps and prog contained BPF to BPF calls
      that were JITed earlier, then we never clean up the bpf_prog_kallsyms_add()
      done under jit_subprogs(). Add proper API to make BPF kallsyms deletion
      more clear and fix that.
      
      Fixes: 1c2a088a ("bpf: x64: add JIT support for multi-function programs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      7d1982b4
  3. 03 6月, 2018 1 次提交
  4. 30 5月, 2018 1 次提交
  5. 28 5月, 2018 1 次提交
    • A
      bpf: Hooks for sys_sendmsg · 1cedee13
      Andrey Ignatov 提交于
      In addition to already existing BPF hooks for sys_bind and sys_connect,
      the patch provides new hooks for sys_sendmsg.
      
      It leverages existing BPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR`
      that provides access to socket itlself (properties like family, type,
      protocol) and user-passed `struct sockaddr *` so that BPF program can
      override destination IP and port for system calls such as sendto(2) or
      sendmsg(2) and/or assign source IP to the socket.
      
      The hooks are implemented as two new attach types:
      `BPF_CGROUP_UDP4_SENDMSG` and `BPF_CGROUP_UDP6_SENDMSG` for UDPv4 and
      UDPv6 correspondingly.
      
      UDPv4 and UDPv6 separate attach types for same reason as sys_bind and
      sys_connect hooks, i.e. to prevent reading from / writing to e.g.
      user_ip6 fields when user passes sockaddr_in since it'd be out-of-bound.
      
      The difference with already existing hooks is sys_sendmsg are
      implemented only for unconnected UDP.
      
      For TCP it doesn't make sense to change user-provided `struct sockaddr *`
      at sendto(2)/sendmsg(2) time since socket either was already connected
      and has source/destination set or wasn't connected and call to
      sendto(2)/sendmsg(2) would lead to ENOTCONN anyway.
      
      Connected UDP is already handled by sys_connect hooks that can override
      source/destination at connect time and use fast-path later, i.e. these
      hooks don't affect UDP fast-path.
      
      Rewriting source IP is implemented differently than that in sys_connect
      hooks. When sys_sendmsg is used with unconnected UDP it doesn't work to
      just bind socket to desired local IP address since source IP can be set
      on per-packet basis by using ancillary data (cmsg(3)). So no matter if
      socket is bound or not, source IP has to be rewritten on every call to
      sys_sendmsg.
      
      To do so two new fields are added to UAPI `struct bpf_sock_addr`;
      * `msg_src_ip4` to set source IPv4 for UDPv4;
      * `msg_src_ip6` to set source IPv6 for UDPv6.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      1cedee13
  6. 25 5月, 2018 1 次提交
    • Y
      bpf: introduce bpf subcommand BPF_TASK_FD_QUERY · 41bdc4b4
      Yonghong Song 提交于
      Currently, suppose a userspace application has loaded a bpf program
      and attached it to a tracepoint/kprobe/uprobe, and a bpf
      introspection tool, e.g., bpftool, wants to show which bpf program
      is attached to which tracepoint/kprobe/uprobe. Such attachment
      information will be really useful to understand the overall bpf
      deployment in the system.
      
      There is a name field (16 bytes) for each program, which could
      be used to encode the attachment point. There are some drawbacks
      for this approaches. First, bpftool user (e.g., an admin) may not
      really understand the association between the name and the
      attachment point. Second, if one program is attached to multiple
      places, encoding a proper name which can imply all these
      attachments becomes difficult.
      
      This patch introduces a new bpf subcommand BPF_TASK_FD_QUERY.
      Given a pid and fd, if the <pid, fd> is associated with a
      tracepoint/kprobe/uprobe perf event, BPF_TASK_FD_QUERY will return
         . prog_id
         . tracepoint name, or
         . k[ret]probe funcname + offset or kernel addr, or
         . u[ret]probe filename + offset
      to the userspace.
      The user can use "bpftool prog" to find more information about
      bpf program itself with prog_id.
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      41bdc4b4
  7. 24 5月, 2018 3 次提交
    • S
      bpf: get JITed image lengths of functions via syscall · 815581c1
      Sandipan Das 提交于
      This adds new two new fields to struct bpf_prog_info. For
      multi-function programs, these fields can be used to pass
      a list of the JITed image lengths of each function for a
      given program to userspace using the bpf system call with
      the BPF_OBJ_GET_INFO_BY_FD command.
      
      This can be used by userspace applications like bpftool
      to split up the contiguous JITed dump, also obtained via
      the system call, into more relatable chunks corresponding
      to each function.
      Signed-off-by: NSandipan Das <sandipan@linux.vnet.ibm.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      815581c1
    • S
      bpf: fix multi-function JITed dump obtained via syscall · 4d56a76e
      Sandipan Das 提交于
      Currently, for multi-function programs, we cannot get the JITed
      instructions using the bpf system call's BPF_OBJ_GET_INFO_BY_FD
      command. Because of this, userspace tools such as bpftool fail
      to identify a multi-function program as being JITed or not.
      
      With the JIT enabled and the test program running, this can be
      verified as follows:
      
        # cat /proc/sys/net/core/bpf_jit_enable
        1
      
      Before applying this patch:
      
        # bpftool prog list
        1: kprobe  name foo  tag b811aab41a39ad3d  gpl
                loaded_at 2018-05-16T11:43:38+0530  uid 0
                xlated 216B  not jited  memlock 65536B
        ...
      
        # bpftool prog dump jited id 1
        no instructions returned
      
      After applying this patch:
      
        # bpftool prog list
        1: kprobe  name foo  tag b811aab41a39ad3d  gpl
                loaded_at 2018-05-16T12:13:01+0530  uid 0
                xlated 216B  jited 308B  memlock 65536B
        ...
      
        # bpftool prog dump jited id 1
           0:   nop
           4:   nop
           8:   mflr    r0
           c:   std     r0,16(r1)
          10:   stdu    r1,-112(r1)
          14:   std     r31,104(r1)
          18:   addi    r31,r1,48
          1c:   li      r3,10
        ...
      Signed-off-by: NSandipan Das <sandipan@linux.vnet.ibm.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4d56a76e
    • S
      bpf: get kernel symbol addresses via syscall · dbecd738
      Sandipan Das 提交于
      This adds new two new fields to struct bpf_prog_info. For
      multi-function programs, these fields can be used to pass
      a list of kernel symbol addresses for all functions in a
      given program to userspace using the bpf system call with
      the BPF_OBJ_GET_INFO_BY_FD command.
      
      When bpf_jit_kallsyms is enabled, we can get the address
      of the corresponding kernel symbol for a callee function
      and resolve the symbol's name. The address is determined
      by adding the value of the call instruction's imm field
      to __bpf_call_base. This offset gets assigned to the imm
      field by the verifier.
      
      For some architectures, such as powerpc64, the imm field
      is not large enough to hold this offset.
      
      We resolve this by:
      
      [1] Assigning the subprog id to the imm field of a call
          instruction in the verifier instead of the offset of
          the callee's symbol's address from __bpf_call_base.
      
      [2] Determining the address of a callee's corresponding
          symbol by using the imm field as an index for the
          list of kernel symbol addresses now available from
          the program info.
      Suggested-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NSandipan Das <sandipan@linux.vnet.ibm.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      dbecd738
  8. 23 5月, 2018 2 次提交
  9. 09 5月, 2018 2 次提交
    • M
      bpf: btf: Add struct bpf_btf_info · 62dab84c
      Martin KaFai Lau 提交于
      During BPF_OBJ_GET_INFO_BY_FD on a btf_fd, the current bpf_attr's
      info.info is directly filled with the BTF binary data.  It is
      not extensible.  In this case, we want to add BTF ID.
      
      This patch adds "struct bpf_btf_info" which has the BTF ID as
      one of its member.  The BTF binary data itself is exposed through
      the "btf" and "btf_size" members.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      62dab84c
    • M
      bpf: btf: Introduce BTF ID · 78958fca
      Martin KaFai Lau 提交于
      This patch gives an ID to each loaded BTF.  The ID is allocated by
      the idr like the existing prog-id and map-id.
      
      The bpf_put(map->btf) is moved to __bpf_map_put() so that the
      userspace can stop seeing the BTF ID ASAP when the last BTF
      refcnt is gone.
      
      It also makes BTF accessible from userspace through the
      1. new BPF_BTF_GET_FD_BY_ID command.  It is limited to CAP_SYS_ADMIN
         which is inline with the BPF_BTF_LOAD cmd and the existing
         BPF_[MAP|PROG]_GET_FD_BY_ID cmd.
      2. new btf_id (and btf_key_id + btf_value_id) in "struct bpf_map_info"
      
      Once the BTF ID handler is accessible from userspace, freeing a BTF
      object has to go through a rcu period.  The BPF_BTF_GET_FD_BY_ID cmd
      can then be done under a rcu_read_lock() instead of taking
      spin_lock.
      [Note: A similar rcu usage can be done to the existing
             bpf_prog_get_fd_by_id() in a follow up patch]
      
      When processing the BPF_BTF_GET_FD_BY_ID cmd,
      refcount_inc_not_zero() is needed because the BTF object
      could be already in the rcu dead row .  btf_get() is
      removed since its usage is currently limited to btf.c
      alone.  refcount_inc() is used directly instead.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      78958fca
  10. 05 5月, 2018 1 次提交
    • J
      nfp: bpf: record offload neutral maps in the driver · 630a4d38
      Jakub Kicinski 提交于
      For asynchronous events originating from the device, like perf event
      output, we need to be able to make sure that objects being referred
      to by the FW message are valid on the host.  FW events can get queued
      and reordered.  Even if we had a FW message "barrier" we should still
      protect ourselves from bogus FW output.
      
      Add a reverse-mapping hash table and record in it all raw map pointers
      FW may refer to.  Only record neutral maps, i.e. perf event arrays.
      These are currently the only objects FW can refer to.  Use RCU protection
      on the read side, update side is under RTNL.
      
      Since program vs map destruction order is slightly painful for offload
      simply take an extra reference on all the recorded maps to make sure
      they don't disappear.
      Signed-off-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Reviewed-by: NQuentin Monnet <quentin.monnet@netronome.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      630a4d38
  11. 04 5月, 2018 2 次提交
  12. 30 4月, 2018 1 次提交
  13. 27 4月, 2018 1 次提交
  14. 24 4月, 2018 1 次提交
  15. 20 4月, 2018 3 次提交
    • M
      bpf: btf: Add pretty print support to the basic arraymap · a26ca7c9
      Martin KaFai Lau 提交于
      This patch adds pretty print support to the basic arraymap.
      Support for other bpf maps can be added later.
      
      This patch adds new attrs to the BPF_MAP_CREATE command to allow
      specifying the btf_fd, btf_key_id and btf_value_id.  The
      BPF_MAP_CREATE can then associate the btf to the map if
      the creating map supports BTF.
      
      A BTF supported map needs to implement two new map ops,
      map_seq_show_elem() and map_check_btf().  This patch has
      implemented these new map ops for the basic arraymap.
      
      It also adds file_operations, bpffs_map_fops, to the pinned
      map such that the pinned map can be opened and read.
      After that, the user has an intuitive way to do
      "cat bpffs/pathto/a-pinned-map" instead of getting
      an error.
      
      bpffs_map_fops should not be extended further to support
      other operations.  Other operations (e.g. write/key-lookup...)
      should be realized by the userspace tools (e.g. bpftool) through
      the BPF_OBJ_GET_INFO_BY_FD, map's lookup/update interface...etc.
      Follow up patches will allow the userspace to obtain
      the BTF from a map-fd.
      
      Here is a sample output when reading a pinned arraymap
      with the following map's value:
      
      struct map_value {
      	int count_a;
      	int count_b;
      };
      
      cat /sys/fs/bpf/pinned_array_map:
      
      0: {1,2}
      1: {3,4}
      2: {5,6}
      ...
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a26ca7c9
    • M
      bpf: btf: Add BPF_OBJ_GET_INFO_BY_FD support to BTF fd · 60197cfb
      Martin KaFai Lau 提交于
      This patch adds BPF_OBJ_GET_INFO_BY_FD support to BTF fd.
      The original BTF data, which was used to create the BTF fd during
      the earlier BPF_BTF_LOAD call, will be returned.
      
      The userspace is expected to allocate buffer
      to info.info and the buffer size is set to info.info_len before
      calling BPF_OBJ_GET_INFO_BY_FD.
      
      The original BTF data is copied to the userspace buffer (info.info).
      Only upto the user's specified info.info_len will be copied.
      
      The original BTF data size is set to info.info_len.  The userspace
      needs to check if it is bigger than its allocated buffer size.
      If it is, the userspace should realloc with the kernel-returned
      info.info_len and call the BPF_OBJ_GET_INFO_BY_FD again.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      60197cfb
    • M
      bpf: btf: Add BPF_BTF_LOAD command · f56a653c
      Martin KaFai Lau 提交于
      This patch adds a BPF_BTF_LOAD command which
      1) loads and verifies the BTF (implemented in earlier patches)
      2) returns a BTF fd to userspace.  In the next patch, the
         BTF fd can be specified during BPF_MAP_CREATE.
      
      It currently limits to CAP_SYS_ADMIN.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      f56a653c
  16. 04 4月, 2018 1 次提交
    • A
      kernel/bpf/syscall: fix warning defined but not used · 33491588
      Anders Roxell 提交于
      There will be a build warning -Wunused-function if CONFIG_CGROUP_BPF
      isn't defined, since the only user is inside #ifdef CONFIG_CGROUP_BPF:
      kernel/bpf/syscall.c:1229:12: warning: ‘bpf_prog_attach_check_attach_type’
          defined but not used [-Wunused-function]
       static int bpf_prog_attach_check_attach_type(const struct bpf_prog *prog,
                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      
      Current code moves function bpf_prog_attach_check_attach_type inside
      ifdef CONFIG_CGROUP_BPF.
      
      Fixes: 5e43f899 ("bpf: Check attach type at prog load time")
      Signed-off-by: NAnders Roxell <anders.roxell@linaro.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      33491588
  17. 31 3月, 2018 4 次提交
    • A
      bpf: Post-hooks for sys_bind · aac3fc32
      Andrey Ignatov 提交于
      "Post-hooks" are hooks that are called right before returning from
      sys_bind. At this time IP and port are already allocated and no further
      changes to `struct sock` can happen before returning from sys_bind but
      BPF program has a chance to inspect the socket and change sys_bind
      result.
      
      Specifically it can e.g. inspect what port was allocated and if it
      doesn't satisfy some policy, BPF program can force sys_bind to fail and
      return EPERM to user.
      
      Another example of usage is recording the IP:port pair to some map to
      use it in later calls to sys_connect. E.g. if some TCP server inside
      cgroup was bound to some IP:port_n, it can be recorded to a map. And
      later when some TCP client inside same cgroup is trying to connect to
      127.0.0.1:port_n, BPF hook for sys_connect can override the destination
      and connect application to IP:port_n instead of 127.0.0.1:port_n. That
      helps forcing all applications inside a cgroup to use desired IP and not
      break those applications if they e.g. use localhost to communicate
      between each other.
      
      == Implementation details ==
      
      Post-hooks are implemented as two new attach types
      `BPF_CGROUP_INET4_POST_BIND` and `BPF_CGROUP_INET6_POST_BIND` for
      existing prog type `BPF_PROG_TYPE_CGROUP_SOCK`.
      
      Separate attach types for IPv4 and IPv6 are introduced to avoid access
      to IPv6 field in `struct sock` from `inet_bind()` and to IPv4 field from
      `inet6_bind()` since those fields might not make sense in such cases.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      aac3fc32
    • A
      bpf: Hooks for sys_connect · d74bad4e
      Andrey Ignatov 提交于
      == The problem ==
      
      See description of the problem in the initial patch of this patch set.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 2nd
      part of the problem: making outgoing connecttion from desired IP.
      
      It adds new attach types `BPF_CGROUP_INET4_CONNECT` and
      `BPF_CGROUP_INET6_CONNECT` for program type
      `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` that can be used to override both
      source and destination of a connection at connect(2) time.
      
      Local end of connection can be bound to desired IP using newly
      introduced BPF-helper `bpf_bind()`. It allows to bind to only IP though,
      and doesn't support binding to port, i.e. leverages
      `IP_BIND_ADDRESS_NO_PORT` socket option. There are two reasons for this:
      * looking for a free port is expensive and can affect performance
        significantly;
      * there is no use-case for port.
      
      As for remote end (`struct sockaddr *` passed by user), both parts of it
      can be overridden, remote IP and remote port. It's useful if an
      application inside cgroup wants to connect to another application inside
      same cgroup or to itself, but knows nothing about IP assigned to the
      cgroup.
      
      Support is added for IPv4 and IPv6, for TCP and UDP.
      
      IPv4 and IPv6 have separate attach types for same reason as sys_bind
      hooks, i.e. to prevent reading from / writing to e.g. user_ip6 fields
      when user passes sockaddr_in since it'd be out-of-bound.
      
      == Implementation notes ==
      
      The patch introduces new field in `struct proto`: `pre_connect` that is
      a pointer to a function with same signature as `connect` but is called
      before it. The reason is in some cases BPF hooks should be called way
      before control is passed to `sk->sk_prot->connect`. Specifically
      `inet_dgram_connect` autobinds socket before calling
      `sk->sk_prot->connect` and there is no way to call `bpf_bind()` from
      hooks from e.g. `ip4_datagram_connect` or `ip6_datagram_connect` since
      it'd cause double-bind. On the other hand `proto.pre_connect` provides a
      flexible way to add BPF hooks for connect only for necessary `proto` and
      call them at desired time before `connect`. Since `bpf_bind()` is
      allowed to bind only to IP and autobind in `inet_dgram_connect` binds
      only port there is no chance of double-bind.
      
      bpf_bind() sets `force_bind_address_no_port` to bind to only IP despite
      of value of `bind_address_no_port` socket field.
      
      bpf_bind() sets `with_lock` to `false` when calling to __inet_bind()
      and __inet6_bind() since all call-sites, where bpf_bind() is called,
      already hold socket lock.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      d74bad4e
    • A
      bpf: Hooks for sys_bind · 4fbac77d
      Andrey Ignatov 提交于
      == The problem ==
      
      There is a use-case when all processes inside a cgroup should use one
      single IP address on a host that has multiple IP configured.  Those
      processes should use the IP for both ingress and egress, for TCP and UDP
      traffic. So TCP/UDP servers should be bound to that IP to accept
      incoming connections on it, and TCP/UDP clients should make outgoing
      connections from that IP. It should not require changing application
      code since it's often not possible.
      
      Currently it's solved by intercepting glibc wrappers around syscalls
      such as `bind(2)` and `connect(2)`. It's done by a shared library that
      is preloaded for every process in a cgroup so that whenever TCP/UDP
      server calls `bind(2)`, the library replaces IP in sockaddr before
      passing arguments to syscall. When application calls `connect(2)` the
      library transparently binds the local end of connection to that IP
      (`bind(2)` with `IP_BIND_ADDRESS_NO_PORT` to avoid performance penalty).
      
      Shared library approach is fragile though, e.g.:
      * some applications clear env vars (incl. `LD_PRELOAD`);
      * `/etc/ld.so.preload` doesn't help since some applications are linked
        with option `-z nodefaultlib`;
      * other applications don't use glibc and there is nothing to intercept.
      
      == The solution ==
      
      The patch provides much more reliable in-kernel solution for the 1st
      part of the problem: binding TCP/UDP servers on desired IP. It does not
      depend on application environment and implementation details (whether
      glibc is used or not).
      
      It adds new eBPF program type `BPF_PROG_TYPE_CGROUP_SOCK_ADDR` and
      attach types `BPF_CGROUP_INET4_BIND` and `BPF_CGROUP_INET6_BIND`
      (similar to already existing `BPF_CGROUP_INET_SOCK_CREATE`).
      
      The new program type is intended to be used with sockets (`struct sock`)
      in a cgroup and provided by user `struct sockaddr`. Pointers to both of
      them are parts of the context passed to programs of newly added types.
      
      The new attach types provides hooks in `bind(2)` system call for both
      IPv4 and IPv6 so that one can write a program to override IP addresses
      and ports user program tries to bind to and apply such a program for
      whole cgroup.
      
      == Implementation notes ==
      
      [1]
      Separate attach types for `AF_INET` and `AF_INET6` are added
      intentionally to prevent reading/writing to offsets that don't make
      sense for corresponding socket family. E.g. if user passes `sockaddr_in`
      it doesn't make sense to read from / write to `user_ip6[]` context
      fields.
      
      [2]
      The write access to `struct bpf_sock_addr_kern` is implemented using
      special field as an additional "register".
      
      There are just two registers in `sock_addr_convert_ctx_access`: `src`
      with value to write and `dst` with pointer to context that can't be
      changed not to break later instructions. But the fields, allowed to
      write to, are not available directly and to access them address of
      corresponding pointer has to be loaded first. To get additional register
      the 1st not used by `src` and `dst` one is taken, its content is saved
      to `bpf_sock_addr_kern.tmp_reg`, then the register is used to load
      address of pointer field, and finally the register's content is restored
      from the temporary field after writing `src` value.
      Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4fbac77d
    • A
      bpf: Check attach type at prog load time · 5e43f899
      Andrey Ignatov 提交于
      == The problem ==
      
      There are use-cases when a program of some type can be attached to
      multiple attach points and those attach points must have different
      permissions to access context or to call helpers.
      
      E.g. context structure may have fields for both IPv4 and IPv6 but it
      doesn't make sense to read from / write to IPv6 field when attach point
      is somewhere in IPv4 stack.
      
      Same applies to BPF-helpers: it may make sense to call some helper from
      some attach point, but not from other for same prog type.
      
      == The solution ==
      
      Introduce `expected_attach_type` field in in `struct bpf_attr` for
      `BPF_PROG_LOAD` command. If scenario described in "The problem" section
      is the case for some prog type, the field will be checked twice:
      
      1) At load time prog type is checked to see if attach type for it must
         be known to validate program permissions correctly. Prog will be
         rejected with EINVAL if it's the case and `expected_attach_type` is
         not specified or has invalid value.
      
      2) At attach time `attach_type` is compared with `expected_attach_type`,
         if prog type requires to have one, and, if they differ, attach will
         be rejected with EINVAL.
      
      The `expected_attach_type` is now available as part of `struct bpf_prog`
      in both `bpf_verifier_ops->is_valid_access()` and
      `bpf_verifier_ops->get_func_proto()` () and can be used to check context
      accesses and calls to helpers correspondingly.
      
      Initially the idea was discussed by Alexei Starovoitov <ast@fb.com> and
      Daniel Borkmann <daniel@iogearbox.net> here:
      https://marc.info/?l=linux-netdev&m=152107378717201&w=2Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5e43f899
  18. 29 3月, 2018 1 次提交
    • A
      bpf: introduce BPF_RAW_TRACEPOINT · c4f6699d
      Alexei Starovoitov 提交于
      Introduce BPF_PROG_TYPE_RAW_TRACEPOINT bpf program type to access
      kernel internal arguments of the tracepoints in their raw form.
      
      >From bpf program point of view the access to the arguments look like:
      struct bpf_raw_tracepoint_args {
             __u64 args[0];
      };
      
      int bpf_prog(struct bpf_raw_tracepoint_args *ctx)
      {
        // program can read args[N] where N depends on tracepoint
        // and statically verified at program load+attach time
      }
      
      kprobe+bpf infrastructure allows programs access function arguments.
      This feature allows programs access raw tracepoint arguments.
      
      Similar to proposed 'dynamic ftrace events' there are no abi guarantees
      to what the tracepoints arguments are and what their meaning is.
      The program needs to type cast args properly and use bpf_probe_read()
      helper to access struct fields when argument is a pointer.
      
      For every tracepoint __bpf_trace_##call function is prepared.
      In assembler it looks like:
      (gdb) disassemble __bpf_trace_xdp_exception
      Dump of assembler code for function __bpf_trace_xdp_exception:
         0xffffffff81132080 <+0>:     mov    %ecx,%ecx
         0xffffffff81132082 <+2>:     jmpq   0xffffffff811231f0 <bpf_trace_run3>
      
      where
      
      TRACE_EVENT(xdp_exception,
              TP_PROTO(const struct net_device *dev,
                       const struct bpf_prog *xdp, u32 act),
      
      The above assembler snippet is casting 32-bit 'act' field into 'u64'
      to pass into bpf_trace_run3(), while 'dev' and 'xdp' args are passed as-is.
      All of ~500 of __bpf_trace_*() functions are only 5-10 byte long
      and in total this approach adds 7k bytes to .text.
      
      This approach gives the lowest possible overhead
      while calling trace_xdp_exception() from kernel C code and
      transitioning into bpf land.
      Since tracepoint+bpf are used at speeds of 1M+ events per second
      this is valuable optimization.
      
      The new BPF_RAW_TRACEPOINT_OPEN sys_bpf command is introduced
      that returns anon_inode FD of 'bpf-raw-tracepoint' object.
      
      The user space looks like:
      // load bpf prog with BPF_PROG_TYPE_RAW_TRACEPOINT type
      prog_fd = bpf_prog_load(...);
      // receive anon_inode fd for given bpf_raw_tracepoint with prog attached
      raw_tp_fd = bpf_raw_tracepoint_open("xdp_exception", prog_fd);
      
      Ctrl-C of tracing daemon or cmdline tool that uses this feature
      will automatically detach bpf program, unload it and
      unregister tracepoint probe.
      
      On the kernel side the __bpf_raw_tp_map section of pointers to
      tracepoint definition and to __bpf_trace_*() probe function is used
      to find a tracepoint with "xdp_exception" name and
      corresponding __bpf_trace_xdp_exception() probe function
      which are passed to tracepoint_probe_register() to connect probe
      with tracepoint.
      
      Addition of bpf_raw_tracepoint doesn't interfere with ftrace and perf
      tracepoint mechanisms. perf_event_open() can be used in parallel
      on the same tracepoint.
      Multiple bpf_raw_tracepoint_open("xdp_exception", prog_fd) are permitted.
      Each with its own bpf program. The kernel will execute
      all tracepoint probes and all attached bpf programs.
      
      In the future bpf_raw_tracepoints can be extended with
      query/introspection logic.
      
      __bpf_raw_tp_map section logic was contributed by Steven Rostedt
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      c4f6699d
  19. 28 3月, 2018 1 次提交
  20. 21 3月, 2018 1 次提交
  21. 20 3月, 2018 1 次提交
    • J
      bpf: create tcp_bpf_ulp allowing BPF to monitor socket TX/RX data · 4f738adb
      John Fastabend 提交于
      This implements a BPF ULP layer to allow policy enforcement and
      monitoring at the socket layer. In order to support this a new
      program type BPF_PROG_TYPE_SK_MSG is used to run the policy at
      the sendmsg/sendpage hook. To attach the policy to sockets a
      sockmap is used with a new program attach type BPF_SK_MSG_VERDICT.
      
      Similar to previous sockmap usages when a sock is added to a
      sockmap, via a map update, if the map contains a BPF_SK_MSG_VERDICT
      program type attached then the BPF ULP layer is created on the
      socket and the attached BPF_PROG_TYPE_SK_MSG program is run for
      every msg in sendmsg case and page/offset in sendpage case.
      
      BPF_PROG_TYPE_SK_MSG Semantics/API:
      
      BPF_PROG_TYPE_SK_MSG supports only two return codes SK_PASS and
      SK_DROP. Returning SK_DROP free's the copied data in the sendmsg
      case and in the sendpage case leaves the data untouched. Both cases
      return -EACESS to the user. Returning SK_PASS will allow the msg to
      be sent.
      
      In the sendmsg case data is copied into kernel space buffers before
      running the BPF program. The kernel space buffers are stored in a
      scatterlist object where each element is a kernel memory buffer.
      Some effort is made to coalesce data from the sendmsg call here.
      For example a sendmsg call with many one byte iov entries will
      likely be pushed into a single entry. The BPF program is run with
      data pointers (start/end) pointing to the first sg element.
      
      In the sendpage case data is not copied. We opt not to copy the
      data by default here, because the BPF infrastructure does not
      know what bytes will be needed nor when they will be needed. So
      copying all bytes may be wasteful. Because of this the initial
      start/end data pointers are (0,0). Meaning no data can be read or
      written. This avoids reading data that may be modified by the
      user. A new helper is added later in this series if reading and
      writing the data is needed. The helper call will do a copy by
      default so that the page is exclusively owned by the BPF call.
      
      The verdict from the BPF_PROG_TYPE_SK_MSG applies to the entire msg
      in the sendmsg() case and the entire page/offset in the sendpage case.
      This avoids ambiguity on how to handle mixed return codes in the
      sendmsg case. Again a helper is added later in the series if
      a verdict needs to apply to multiple system calls and/or only
      a subpart of the currently being processed message.
      
      The helper msg_redirect_map() can be used to select the socket to
      send the data on. This is used similar to existing redirect use
      cases. This allows policy to redirect msgs.
      
      Pseudo code simple example:
      
      The basic logic to attach a program to a socket is as follows,
      
        // load the programs
        bpf_prog_load(SOCKMAP_TCP_MSG_PROG, BPF_PROG_TYPE_SK_MSG,
      		&obj, &msg_prog);
      
        // lookup the sockmap
        bpf_map_msg = bpf_object__find_map_by_name(obj, "my_sock_map");
      
        // get fd for sockmap
        map_fd_msg = bpf_map__fd(bpf_map_msg);
      
        // attach program to sockmap
        bpf_prog_attach(msg_prog, map_fd_msg, BPF_SK_MSG_VERDICT, 0);
      
      Adding sockets to the map is done in the normal way,
      
        // Add a socket 'fd' to sockmap at location 'i'
        bpf_map_update_elem(map_fd_msg, &i, fd, BPF_ANY);
      
      After the above any socket attached to "my_sock_map", in this case
      'fd', will run the BPF msg verdict program (msg_prog) on every
      sendmsg and sendpage system call.
      
      For a complete example see BPF selftests or sockmap samples.
      
      Implementation notes:
      
      It seemed the simplest, to me at least, to use a refcnt to ensure
      psock is not lost across the sendmsg copy into the sg, the bpf program
      running on the data in sg_data, and the final pass to the TCP stack.
      Some performance testing may show a better method to do this and avoid
      the refcnt cost, but for now use the simpler method.
      
      Another item that will come after basic support is in place is
      supporting MSG_MORE flag. At the moment we call sendpages even if
      the MSG_MORE flag is set. An enhancement would be to collect the
      pages into a larger scatterlist and pass down the stack. Notice that
      bpf_tcp_sendmsg() could support this with some additional state saved
      across sendmsg calls. I built the code to support this without having
      to do refactoring work. Other features TBD include ZEROCOPY and the
      TCP_RECV_QUEUE/TCP_NO_QUEUE support. This will follow initial series
      shortly.
      
      Future work could improve size limits on the scatterlist rings used
      here. Currently, we use MAX_SKB_FRAGS simply because this was being
      used already in the TLS case. Future work could extend the kernel sk
      APIs to tune this depending on workload. This is a trade-off
      between memory usage and throughput performance.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      4f738adb
  22. 26 1月, 2018 1 次提交
  23. 19 1月, 2018 2 次提交
  24. 18 1月, 2018 1 次提交
    • J
      bpf: add new jited info fields in bpf_dev_offload and bpf_prog_info · fcfb126d
      Jiong Wang 提交于
      For host JIT, there are "jited_len"/"bpf_func" fields in struct bpf_prog
      used by all host JIT targets to get jited image and it's length. While for
      offload, targets are likely to have different offload mechanisms that these
      info are kept in device private data fields.
      
      Therefore, BPF_OBJ_GET_INFO_BY_FD syscall needs an unified way to get JIT
      length and contents info for offload targets.
      
      One way is to introduce new callback to parse device private data then fill
      those fields in bpf_prog_info. This might be a little heavy, the other way
      is to add generic fields which will be initialized by all offload targets.
      
      This patch follow the second approach to introduce two new fields in
      struct bpf_dev_offload and teach bpf_prog_get_info_by_fd about them to fill
      correct jited_prog_len and jited_prog_insns in bpf_prog_info.
      Reviewed-by: NJakub Kicinski <jakub.kicinski@netronome.com>
      Signed-off-by: NJiong Wang <jiong.wang@netronome.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      fcfb126d
  25. 15 1月, 2018 3 次提交
  26. 06 1月, 2018 1 次提交