1. 06 1月, 2018 1 次提交
  2. 20 10月, 2017 1 次提交
  3. 09 10月, 2017 1 次提交
    • S
      netfilter: xt_bpf: Fix XT_BPF_MODE_FD_PINNED mode of 'xt_bpf_info_v1' · 98589a09
      Shmulik Ladkani 提交于
      Commit 2c16d603 ("netfilter: xt_bpf: support ebpf") introduced
      support for attaching an eBPF object by an fd, with the
      'bpf_mt_check_v1' ABI expecting the '.fd' to be specified upon each
      IPT_SO_SET_REPLACE call.
      
      However this breaks subsequent iptables calls:
      
       # iptables -A INPUT -m bpf --object-pinned /sys/fs/bpf/xxx -j ACCEPT
       # iptables -A INPUT -s 5.6.7.8 -j ACCEPT
       iptables: Invalid argument. Run `dmesg' for more information.
      
      That's because iptables works by loading existing rules using
      IPT_SO_GET_ENTRIES to userspace, then issuing IPT_SO_SET_REPLACE with
      the replacement set.
      
      However, the loaded 'xt_bpf_info_v1' has an arbitrary '.fd' number
      (from the initial "iptables -m bpf" invocation) - so when 2nd invocation
      occurs, userspace passes a bogus fd number, which leads to
      'bpf_mt_check_v1' to fail.
      
      One suggested solution [1] was to hack iptables userspace, to perform a
      "entries fixup" immediatley after IPT_SO_GET_ENTRIES, by opening a new,
      process-local fd per every 'xt_bpf_info_v1' entry seen.
      
      However, in [2] both Pablo Neira Ayuso and Willem de Bruijn suggested to
      depricate the xt_bpf_info_v1 ABI dealing with pinned ebpf objects.
      
      This fix changes the XT_BPF_MODE_FD_PINNED behavior to ignore the given
      '.fd' and instead perform an in-kernel lookup for the bpf object given
      the provided '.path'.
      
      It also defines an alias for the XT_BPF_MODE_FD_PINNED mode, named
      XT_BPF_MODE_PATH_PINNED, to better reflect the fact that the user is
      expected to provide the path of the pinned object.
      
      Existing XT_BPF_MODE_FD_ELF behavior (non-pinned fd mode) is preserved.
      
      References: [1] https://marc.info/?l=netfilter-devel&m=150564724607440&w=2
                  [2] https://marc.info/?l=netfilter-devel&m=150575727129880&w=2Reported-by: NRafael Buchbinder <rafi@rbk.ms>
      Signed-off-by: NShmulik Ladkani <shmulik.ladkani@gmail.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NPablo Neira Ayuso <pablo@netfilter.org>
      98589a09
  4. 06 7月, 2017 1 次提交
    • D
      bpf: Implement show_options · 4cc7c186
      David Howells 提交于
      Implement the show_options superblock op for bpf as part of a bid to get
      rid of s_options and generic_show_options() to make it easier to implement
      a context-based mount where the mount options can be passed individually
      over a file descriptor.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      cc: Alexei Starovoitov <ast@kernel.org>
      cc: Daniel Borkmann <daniel@iogearbox.net>
      cc: netdev@vger.kernel.org
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      4cc7c186
  5. 27 4月, 2017 1 次提交
  6. 26 1月, 2017 1 次提交
    • D
      bpf: add initial bpf tracepoints · a67edbf4
      Daniel Borkmann 提交于
      This work adds a number of tracepoints to paths that are either
      considered slow-path or exception-like states, where monitoring or
      inspecting them would be desirable.
      
      For bpf(2) syscall, tracepoints have been placed for main commands
      when they succeed. In XDP case, tracepoint is for exceptions, that
      is, f.e. on abnormal BPF program exit such as unknown or XDP_ABORTED
      return code, or when error occurs during XDP_TX action and the packet
      could not be forwarded.
      
      Both have been split into separate event headers, and can be further
      extended. Worst case, if they unexpectedly should get into our way in
      future, they can also removed [1]. Of course, these tracepoints (like
      any other) can be analyzed by eBPF itself, etc. Example output:
      
        # ./perf record -a -e bpf:* sleep 10
        # ./perf script
        sock_example  6197 [005]   283.980322:      bpf:bpf_map_create: map type=ARRAY ufd=4 key=4 val=8 max=256 flags=0
        sock_example  6197 [005]   283.980721:       bpf:bpf_prog_load: prog=a5ea8fa30ea6849c type=SOCKET_FILTER ufd=5
        sock_example  6197 [005]   283.988423:   bpf:bpf_prog_get_type: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
        sock_example  6197 [005]   283.988443: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[06 00 00 00] val=[00 00 00 00 00 00 00 00]
        [...]
        sock_example  6197 [005]   288.990868: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[01 00 00 00] val=[14 00 00 00 00 00 00 00]
             swapper     0 [005]   289.338243:    bpf:bpf_prog_put_rcu: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
      
        [1] https://lwn.net/Articles/705270/Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a67edbf4
  7. 28 11月, 2016 1 次提交
  8. 01 11月, 2016 1 次提交
    • D
      bpf, inode: add support for symlinks and fix mtime/ctime · 0f98621b
      Daniel Borkmann 提交于
      While commit bb35a6ef ("bpf, inode: allow for rename and link ops")
      added support for hard links that can be used for prog and map nodes,
      this work adds simple symlink support, which can be used f.e. for
      directories also when unpriviledged and works with cmdline tooling that
      understands S_IFLNK anyway. Since the switch in e27f4a94 ("bpf: Use
      mount_nodev not mount_ns to mount the bpf filesystem"), there can be
      various mount instances with mount_nodev() and thus hierarchy can be
      flattened to facilitate object sharing. Thus, we can keep bpf tooling
      also working by repointing paths.
      
      Most of the functionality can be used from vfs library operations. The
      symlink is stored in the inode itself, that is in i_link, which is
      sufficient in our case as opposed to storing it in the page cache.
      While at it, I noticed that bpf_mkdir() and bpf_mkobj() don't update
      the directories mtime and ctime, so add a common helper for it called
      bpf_dentry_finalize() that takes care of it for all cases now.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0f98621b
  9. 28 9月, 2016 1 次提交
  10. 27 9月, 2016 2 次提交
  11. 12 7月, 2016 1 次提交
  12. 24 5月, 2016 1 次提交
    • D
      bpf, inode: disallow userns mounts · 612bacad
      Daniel Borkmann 提交于
      Follow-up to commit e27f4a94 ("bpf: Use mount_nodev not mount_ns
      to mount the bpf filesystem"), which removes the FS_USERNS_MOUNT flag.
      
      The original idea was to have a per mountns instance instead of a
      single global fs instance, but that didn't work out and we had to
      switch to mount_nodev() model. The intent of that middle ground was
      that we avoid users who don't play nice to create endless instances
      of bpf fs which are difficult to control and discover from an admin
      point of view, but at the same time it would have allowed us to be
      more flexible with regard to namespaces.
      
      Therefore, since we now did the switch to mount_nodev() as a fix
      where individual instances are created, we also need to remove userns
      mount flag along with it to avoid running into mentioned situation.
      I don't expect any breakage at this early point in time with removing
      the flag and we can revisit this later should the requirement for
      this come up with future users. This and commit e27f4a94 have
      been split to facilitate tracking should any of them run into the
      unlikely case of causing a regression.
      
      Fixes: b2197755 ("bpf: add support for persistent maps/progs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      612bacad
  13. 21 5月, 2016 1 次提交
    • E
      bpf: Use mount_nodev not mount_ns to mount the bpf filesystem · e27f4a94
      Eric W. Biederman 提交于
      While reviewing the filesystems that set FS_USERNS_MOUNT I spotted the
      bpf filesystem.  Looking at the code I saw a broken usage of mount_ns
      with current->nsproxy->mnt_ns. As the code does not acquire a
      reference to the mount namespace it can not possibly be correct to
      store the mount namespace on the superblock as it does.
      
      Replace mount_ns with mount_nodev so that each mount of the bpf
      filesystem returns a distinct instance, and the code is not buggy.
      
      In discussion with Hannes Frederic Sowa it was reported that the use
      of mount_ns was an attempt to have one bpf instance per mount
      namespace, in an attempt to keep resources that pin resources from
      hiding.  That intent simply does not work, the vfs is not built to
      allow that kind of behavior.  Which means that the bpf filesystem
      really is buggy both semantically and in it's implemenation as it does
      not nor can it implement the original intent.
      
      This change is userspace visible, but my experience with similar
      filesystems leads me to believe nothing will break with a model of each
      mount of the bpf filesystem is distinct from all others.
      
      Fixes: b2197755 ("bpf: add support for persistent maps/progs")
      Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Acked-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e27f4a94
  14. 29 4月, 2016 1 次提交
  15. 28 3月, 2016 1 次提交
  16. 13 12月, 2015 1 次提交
    • D
      bpf, inode: allow for rename and link ops · bb35a6ef
      Daniel Borkmann 提交于
      Add support for renaming and hard links to the fs. Most of this can be
      implemented by using simple library operations under the same constraints
      that we don't use a reserved name like elsewhere. Linking can be useful
      to share/manage things like maps across subsystem users. It works within
      the file system boundary, but is not allowed for directories.
      
      Symbolic links are explicitly not implemented here, as it can be better
      done already by doing bind mounts inside bpf fs to set up shared directories
      f.e. useful when using volumes in docker containers that map a private
      working directory into /sys/fs/bpf/ which contains itself a bind mounted
      path from the host's /sys/fs/bpf/ mount that is shared among multiple
      containers. For single maps instead of whole directory, hard links can
      be easily used to do the same.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bb35a6ef
  17. 26 11月, 2015 1 次提交
    • D
      bpf: fix clearing on persistent program array maps · c9da161c
      Daniel Borkmann 提交于
      Currently, when having map file descriptors pointing to program arrays,
      there's still the issue that we unconditionally flush program array
      contents via bpf_fd_array_map_clear() in bpf_map_release(). This happens
      when such a file descriptor is released and is independent of the map's
      refcount.
      
      Having this flush independent of the refcount is for a reason: there
      can be arbitrary complex dependency chains among tail calls, also circular
      ones (direct or indirect, nesting limit determined during runtime), and
      we need to make sure that the map drops all references to eBPF programs
      it holds, so that the map's refcount can eventually drop to zero and
      initiate its freeing. Btw, a walk of the whole dependency graph would
      not be possible for various reasons, one being complexity and another
      one inconsistency, i.e. new programs can be added to parts of the graph
      at any time, so there's no guaranteed consistent state for the time of
      such a walk.
      
      Now, the program array pinning itself works, but the issue is that each
      derived file descriptor on close would nevertheless call unconditionally
      into bpf_fd_array_map_clear(). Instead, keep track of users and postpone
      this flush until the last reference to a user is dropped. As this only
      concerns a subset of references (f.e. a prog array could hold a program
      that itself has reference on the prog array holding it, etc), we need to
      track them separately.
      
      Short analysis on the refcounting: on map creation time usercnt will be
      one, so there's no change in behaviour for bpf_map_release(), if unpinned.
      If we already fail in map_create(), we are immediately freed, and no
      file descriptor has been made public yet. In bpf_obj_pin_user(), we need
      to probe for a possible map in bpf_fd_probe_obj() already with a usercnt
      reference, so before we drop the reference on the fd with fdput().
      Therefore, if actual pinning fails, we need to drop that reference again
      in bpf_any_put(), otherwise we keep holding it. When last reference
      drops on the inode, the bpf_any_put() in bpf_evict_inode() will take
      care of dropping the usercnt again. In the bpf_obj_get_user() case, the
      bpf_any_get() will grab a reference on the usercnt, still at a time when
      we have the reference on the path. Should we later on fail to grab a new
      file descriptor, bpf_any_put() will drop it, otherwise we hold it until
      bpf_map_release() time.
      
      Joint work with Alexei.
      
      Fixes: b2197755 ("bpf: add support for persistent maps/progs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9da161c
  18. 03 11月, 2015 1 次提交
    • D
      bpf: add support for persistent maps/progs · b2197755
      Daniel Borkmann 提交于
      This work adds support for "persistent" eBPF maps/programs. The term
      "persistent" is to be understood that maps/programs have a facility
      that lets them survive process termination. This is desired by various
      eBPF subsystem users.
      
      Just to name one example: tc classifier/action. Whenever tc parses
      the ELF object, extracts and loads maps/progs into the kernel, these
      file descriptors will be out of reach after the tc instance exits.
      So a subsequent tc invocation won't be able to access/relocate on this
      resource, and therefore maps cannot easily be shared, f.e. between the
      ingress and egress networking data path.
      
      The current workaround is that Unix domain sockets (UDS) need to be
      instrumented in order to pass the created eBPF map/program file
      descriptors to a third party management daemon through UDS' socket
      passing facility. This makes it a bit complicated to deploy shared
      eBPF maps or programs (programs f.e. for tail calls) among various
      processes.
      
      We've been brainstorming on how we could tackle this issue and various
      approches have been tried out so far, which can be read up further in
      the below reference.
      
      The architecture we eventually ended up with is a minimal file system
      that can hold map/prog objects. The file system is a per mount namespace
      singleton, and the default mount point is /sys/fs/bpf/. Any subsequent
      mounts within a given namespace will point to the same instance. The
      file system allows for creating a user-defined directory structure.
      The objects for maps/progs are created/fetched through bpf(2) with
      two new commands (BPF_OBJ_PIN/BPF_OBJ_GET). I.e. a bpf file descriptor
      along with a pathname is being passed to bpf(2) that in turn creates
      (we call it eBPF object pinning) the file system nodes. Only the pathname
      is being passed to bpf(2) for getting a new BPF file descriptor to an
      existing node. The user can use that to access maps and progs later on,
      through bpf(2). Removal of file system nodes is being managed through
      normal VFS functions such as unlink(2), etc. The file system code is
      kept to a very minimum and can be further extended later on.
      
      The next step I'm working on is to add dump eBPF map/prog commands
      to bpf(2), so that a specification from a given file descriptor can
      be retrieved. This can be used by things like CRIU but also applications
      can inspect the meta data after calling BPF_OBJ_GET.
      
      Big thanks also to Alexei and Hannes who significantly contributed
      in the design discussion that eventually let us end up with this
      architecture here.
      
      Reference: https://lkml.org/lkml/2015/10/15/925Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NHannes Frederic Sowa <hannes@stressinduktion.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b2197755