1. 09 11月, 2018 1 次提交
  2. 20 10月, 2018 2 次提交
    • M
      bpf: add queue and stack maps · f1a2e44a
      Mauricio Vasquez B 提交于
      Queue/stack maps implement a FIFO/LIFO data storage for ebpf programs.
      These maps support peek, pop and push operations that are exposed to eBPF
      programs through the new bpf_map[peek/pop/push] helpers.  Those operations
      are exposed to userspace applications through the already existing
      syscalls in the following way:
      
      BPF_MAP_LOOKUP_ELEM            -> peek
      BPF_MAP_LOOKUP_AND_DELETE_ELEM -> pop
      BPF_MAP_UPDATE_ELEM            -> push
      
      Queue/stack maps are implemented using a buffer, tail and head indexes,
      hence BPF_F_NO_PREALLOC is not supported.
      
      As opposite to other maps, queue and stack do not use RCU for protecting
      maps values, the bpf_map[peek/pop] have a ARG_PTR_TO_UNINIT_MAP_VALUE
      argument that is a pointer to a memory zone where to save the value of a
      map.  Basically the same as ARG_PTR_TO_UNINIT_MEM, but the size has not
      be passed as an extra argument.
      
      Our main motivation for implementing queue/stack maps was to keep track
      of a pool of elements, like network ports in a SNAT, however we forsee
      other use cases, like for exampling saving last N kernel events in a map
      and then analysing from userspace.
      Signed-off-by: NMauricio Vasquez B <mauricio.vasquez@polito.it>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      f1a2e44a
    • M
      bpf/verifier: add ARG_PTR_TO_UNINIT_MAP_VALUE · 2ea864c5
      Mauricio Vasquez B 提交于
      ARG_PTR_TO_UNINIT_MAP_VALUE argument is a pointer to a memory zone
      used to save the value of a map.  Basically the same as
      ARG_PTR_TO_UNINIT_MEM, but the size has not be passed as an extra
      argument.
      
      This will be used in the following patch that implements some new
      helpers that receive a pointer to be filled with a map value.
      Signed-off-by: NMauricio Vasquez B <mauricio.vasquez@polito.it>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      2ea864c5
  3. 16 10月, 2018 1 次提交
    • D
      bpf, sockmap: convert to generic sk_msg interface · 604326b4
      Daniel Borkmann 提交于
      Add a generic sk_msg layer, and convert current sockmap and later
      kTLS over to make use of it. While sk_buff handles network packet
      representation from netdevice up to socket, sk_msg handles data
      representation from application to socket layer.
      
      This means that sk_msg framework spans across ULP users in the
      kernel, and enables features such as introspection or filtering
      of data with the help of BPF programs that operate on this data
      structure.
      
      Latter becomes in particular useful for kTLS where data encryption
      is deferred into the kernel, and as such enabling the kernel to
      perform L7 introspection and policy based on BPF for TLS connections
      where the record is being encrypted after BPF has run and came to
      a verdict. In order to get there, first step is to transform open
      coding of scatter-gather list handling into a common core framework
      that subsystems can use.
      
      The code itself has been split and refactored into three bigger
      pieces: i) the generic sk_msg API which deals with managing the
      scatter gather ring, providing helpers for walking and mangling,
      transferring application data from user space into it, and preparing
      it for BPF pre/post-processing, ii) the plain sock map itself
      where sockets can be attached to or detached from; these bits
      are independent of i) which can now be used also without sock
      map, and iii) the integration with plain TCP as one protocol
      to be used for processing L7 application data (later this could
      e.g. also be extended to other protocols like UDP). The semantics
      are the same with the old sock map code and therefore no change
      of user facing behavior or APIs. While pursuing this work it
      also helped finding a number of bugs in the old sockmap code
      that we've fixed already in earlier commits. The test_sockmap
      kselftest suite passes through fine as well.
      
      Joint work with John.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      604326b4
  4. 08 10月, 2018 1 次提交
  5. 03 10月, 2018 1 次提交
  6. 01 10月, 2018 2 次提交
    • R
      bpf: introduce per-cpu cgroup local storage · b741f163
      Roman Gushchin 提交于
      This commit introduced per-cpu cgroup local storage.
      
      Per-cpu cgroup local storage is very similar to simple cgroup storage
      (let's call it shared), except all the data is per-cpu.
      
      The main goal of per-cpu variant is to implement super fast
      counters (e.g. packet counters), which don't require neither
      lookups, neither atomic operations.
      
      >From userspace's point of view, accessing a per-cpu cgroup storage
      is similar to other per-cpu map types (e.g. per-cpu hashmaps and
      arrays).
      
      Writing to a per-cpu cgroup storage is not atomic, but is performed
      by copying longs, so some minimal atomicity is here, exactly
      as with other per-cpu maps.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b741f163
    • R
      bpf: extend cgroup bpf core to allow multiple cgroup storage types · 8bad74f9
      Roman Gushchin 提交于
      In order to introduce per-cpu cgroup storage, let's generalize
      bpf cgroup core to support multiple cgroup storage types.
      Potentially, per-node cgroup storage can be added later.
      
      This commit is mostly a formal change that replaces
      cgroup_storage pointer with a array of cgroup_storage pointers.
      It doesn't actually introduce a new storage type,
      it will be done later.
      
      Each bpf program is now able to have one cgroup storage of each type.
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NSong Liu <songliubraving@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      8bad74f9
  7. 15 9月, 2018 1 次提交
  8. 13 8月, 2018 1 次提交
    • D
      bpf: decouple btf from seq bpf fs dump and enable more maps · e8d2bec0
      Daniel Borkmann 提交于
      Commit a26ca7c9 ("bpf: btf: Add pretty print support to
      the basic arraymap") and 699c86d6 ("bpf: btf: add pretty
      print for hash/lru_hash maps") enabled support for BTF and
      dumping via BPF fs for array and hash/lru map. However, both
      can be decoupled from each other such that regular BPF maps
      can be supported for attaching BTF key/value information,
      while not all maps necessarily need to dump via map_seq_show_elem()
      callback.
      
      The basic sanity check which is a prerequisite for all maps
      is that key/value size has to match in any case, and some maps
      can have extra checks via map_check_btf() callback, e.g.
      probing certain types or indicating no support in general. With
      that we can also enable retrieving BTF info for per-cpu map
      types and lpm.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NYonghong Song <yhs@fb.com>
      e8d2bec0
  9. 11 8月, 2018 1 次提交
    • M
      bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY · 5dc4c4b7
      Martin KaFai Lau 提交于
      This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY.
      
      To unleash the full potential of a bpf prog, it is essential for the
      userspace to be capable of directly setting up a bpf map which can then
      be consumed by the bpf prog to make decision.  In this case, decide which
      SO_REUSEPORT sk to serve the incoming request.
      
      By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control
      and visibility on where a SO_REUSEPORT sk should be located in a bpf map.
      The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that
      the bpf prog can directly select a sk from the bpf map.  That will
      raise the programmability of the bpf prog attached to a reuseport
      group (a group of sk serving the same IP:PORT).
      
      For example, in UDP, the bpf prog can peek into the payload (e.g.
      through the "data" pointer introduced in the later patch) to learn
      the application level's connection information and then decide which sk
      to pick from a bpf map.  The userspace can tightly couple the sk's location
      in a bpf map with the application logic in generating the UDP payload's
      connection information.  This connection info contact/API stays within the
      userspace.
      
      Also, when used with map-in-map, the userspace can switch the
      old-server-process's inner map to a new-server-process's inner map
      in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)".
      The bpf prog will then direct incoming requests to the new process instead
      of the old process.  The old process can finish draining the pending
      requests (e.g. by "accept()") before closing the old-fds.  [Note that
      deleting a fd from a bpf map does not necessary mean the fd is closed]
      
      During map_update_elem(),
      Only SO_REUSEPORT sk (i.e. which has already been added
      to a reuse->socks[]) can be used.  That means a SO_REUSEPORT sk that is
      "bind()" for UDP or "bind()+listen()" for TCP.  These conditions are
      ensured in "reuseport_array_update_check()".
      
      A SO_REUSEPORT sk can only be added once to a map (i.e. the
      same sk cannot be added twice even to the same map).  SO_REUSEPORT
      already allows another sk to be created for the same IP:PORT.
      There is no need to re-create a similar usage in the BPF side.
      
      When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"),
      it will notify the bpf map to remove it from the map also.  It is
      done through "bpf_sk_reuseport_detach()" and it will only be called
      if >=1 of the "reuse->sock[]" has ever been added to a bpf map.
      
      The map_update()/map_delete() has to be in-sync with the
      "reuse->socks[]".  Hence, the same "reuseport_lock" used
      by "reuse->socks[]" has to be used here also. Care has
      been taken to ensure the lock is only acquired when the
      adding sk passes some strict tests. and
      freeing the map does not require the reuseport_lock.
      
      The reuseport_array will also support lookup from the syscall
      side.  It will return a sock_gen_cookie().  The sock_gen_cookie()
      is on-demand (i.e. a sk's cookie is not generated until the very
      first map_lookup_elem()).
      
      The lookup cookie is 64bits but it goes against the logical userspace
      expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also).
      It may catch user in surprise if we enforce value_size=8 while
      userspace still pass a 32bits fd during update.  Supporting different
      value_size between lookup and update seems unintuitive also.
      
      We also need to consider what if other existing fd based maps want
      to return 64bits value from syscall's lookup in the future.
      Hence, reuseport_array supports both value_size 4 and 8, and
      assuming user will usually use value_size=4.  The syscall's lookup
      will return ENOSPC on value_size=4.  It will will only
      return 64bits value from sock_gen_cookie() when user consciously
      choose value_size=8 (as a signal that lookup is desired) which then
      requires a 64bits value in both lookup and update.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5dc4c4b7
  10. 03 8月, 2018 5 次提交
  11. 18 7月, 2018 5 次提交
  12. 26 6月, 2018 1 次提交
    • S
      bpf: fix attach type BPF_LIRC_MODE2 dependency wrt CONFIG_CGROUP_BPF · fdb5c453
      Sean Young 提交于
      If the kernel is compiled with CONFIG_CGROUP_BPF not enabled, it is not
      possible to attach, detach or query IR BPF programs to /dev/lircN devices,
      making them impossible to use. For embedded devices, it should be possible
      to use IR decoding without cgroups or CONFIG_CGROUP_BPF enabled.
      
      This change requires some refactoring, since bpf_prog_{attach,detach,query}
      functions are now always compiled, but their code paths for cgroups need
      moving out. Rather than a #ifdef CONFIG_CGROUP_BPF in kernel/bpf/syscall.c,
      moving them to kernel/bpf/cgroup.c and kernel/bpf/sockmap.c does not
      require #ifdefs since that is already conditionally compiled.
      
      Fixes: f4364dcf ("media: rc: introduce BPF_PROG_LIRC_MODE2")
      Signed-off-by: NSean Young <sean@mess.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      fdb5c453
  13. 16 6月, 2018 1 次提交
  14. 04 6月, 2018 1 次提交
    • Y
      bpf: implement bpf_get_current_cgroup_id() helper · bf6fa2c8
      Yonghong Song 提交于
      bpf has been used extensively for tracing. For example, bcc
      contains an almost full set of bpf-based tools to trace kernel
      and user functions/events. Most tracing tools are currently
      either filtered based on pid or system-wide.
      
      Containers have been used quite extensively in industry and
      cgroup is often used together to provide resource isolation
      and protection. Several processes may run inside the same
      container. It is often desirable to get container-level tracing
      results as well, e.g. syscall count, function count, I/O
      activity, etc.
      
      This patch implements a new helper, bpf_get_current_cgroup_id(),
      which will return cgroup id based on the cgroup within which
      the current task is running.
      
      The later patch will provide an example to show that
      userspace can get the same cgroup id so it could
      configure a filter or policy in the bpf program based on
      task cgroup id.
      
      The helper is currently implemented for tracing. It can
      be added to other program types as well when needed.
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      bf6fa2c8
  15. 25 5月, 2018 2 次提交
  16. 23 5月, 2018 2 次提交
  17. 16 5月, 2018 1 次提交
  18. 10 5月, 2018 1 次提交
  19. 05 5月, 2018 1 次提交
  20. 04 5月, 2018 3 次提交
    • D
      bpf: implement ld_abs/ld_ind in native bpf · e0cea7ce
      Daniel Borkmann 提交于
      The main part of this work is to finally allow removal of LD_ABS
      and LD_IND from the BPF core by reimplementing them through native
      eBPF instead. Both LD_ABS/LD_IND were carried over from cBPF and
      keeping them around in native eBPF caused way more trouble than
      actually worth it. To just list some of the security issues in
      the past:
      
        * fdfaf64e ("x86: bpf_jit: support negative offsets")
        * 35607b02 ("sparc: bpf_jit: fix loads from negative offsets")
        * e0ee9c12 ("x86: bpf_jit: fix two bugs in eBPF JIT compiler")
        * 07aee943 ("bpf, sparc: fix usage of wrong reg for load_skb_regs after call")
        * 6d59b7db ("bpf, s390x: do not reload skb pointers in non-skb context")
        * 87338c8e ("bpf, ppc64: do not reload skb pointers in non-skb context")
      
      For programs in native eBPF, LD_ABS/LD_IND are pretty much legacy
      these days due to their limitations and more efficient/flexible
      alternatives that have been developed over time such as direct
      packet access. LD_ABS/LD_IND only cover 1/2/4 byte loads into a
      register, the load happens in host endianness and its exception
      handling can yield unexpected behavior. The latter is explained
      in depth in f6b1b3bf ("bpf: fix subprog verifier bypass by
      div/mod by 0 exception") with similar cases of exceptions we had.
      In native eBPF more recent program types will disable LD_ABS/LD_IND
      altogether through may_access_skb() in verifier, and given the
      limitations in terms of exception handling, it's also disabled
      in programs that use BPF to BPF calls.
      
      In terms of cBPF, the LD_ABS/LD_IND is used in networking programs
      to access packet data. It is not used in seccomp-BPF but programs
      that use it for socket filtering or reuseport for demuxing with
      cBPF. This is mostly relevant for applications that have not yet
      migrated to native eBPF.
      
      The main complexity and source of bugs in LD_ABS/LD_IND is coming
      from their implementation in the various JITs. Most of them keep
      the model around from cBPF times by implementing a fastpath written
      in asm. They use typically two from the BPF program hidden CPU
      registers for caching the skb's headlen (skb->len - skb->data_len)
      and skb->data. Throughout the JIT phase this requires to keep track
      whether LD_ABS/LD_IND are used and if so, the two registers need
      to be recached each time a BPF helper would change the underlying
      packet data in native eBPF case. At least in eBPF case, available
      CPU registers are rare and the additional exit path out of the
      asm written JIT helper makes it also inflexible since not all
      parts of the JITer are in control from plain C. A LD_ABS/LD_IND
      implementation in eBPF therefore allows to significantly reduce
      the complexity in JITs with comparable performance results for
      them, e.g.:
      
      test_bpf             tcpdump port 22             tcpdump complex
      x64      - before    15 21 10                    14 19  18
               - after      7 10 10                     7 10  15
      arm64    - before    40 91 92                    40 91 151
               - after     51 64 73                    51 62 113
      
      For cBPF we now track any usage of LD_ABS/LD_IND in bpf_convert_filter()
      and cache the skb's headlen and data in the cBPF prologue. The
      BPF_REG_TMP gets remapped from R8 to R2 since it's mainly just
      used as a local temporary variable. This allows to shrink the
      image on x86_64 also for seccomp programs slightly since mapping
      to %rsi is not an ereg. In callee-saved R8 and R9 we now track
      skb data and headlen, respectively. For normal prologue emission
      in the JITs this does not add any extra instructions since R8, R9
      are pushed to stack in any case from eBPF side. cBPF uses the
      convert_bpf_ld_abs() emitter which probes the fast path inline
      already and falls back to bpf_skb_load_helper_{8,16,32}() helper
      relying on the cached skb data and headlen as well. R8 and R9
      never need to be reloaded due to bpf_helper_changes_pkt_data()
      since all skb access in cBPF is read-only. Then, for the case
      of native eBPF, we use the bpf_gen_ld_abs() emitter, which calls
      the bpf_skb_load_helper_{8,16,32}_no_cache() helper unconditionally,
      does neither cache skb data and headlen nor has an inlined fast
      path. The reason for the latter is that native eBPF does not have
      any extra registers available anyway, but even if there were, it
      avoids any reload of skb data and headlen in the first place.
      Additionally, for the negative offsets, we provide an alternative
      bpf_skb_load_bytes_relative() helper in eBPF which operates
      similarly as bpf_skb_load_bytes() and allows for more flexibility.
      Tested myself on x64, arm64, s390x, from Sandipan on ppc64.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      e0cea7ce
    • D
      bpf: migrate ebpf ld_abs/ld_ind tests to test_verifier · 93731ef0
      Daniel Borkmann 提交于
      Remove all eBPF tests involving LD_ABS/LD_IND from test_bpf.ko. Reason
      is that the eBPF tests from test_bpf module do not go via BPF verifier
      and therefore any instruction rewrites from verifier cannot take place.
      
      Therefore, move them into test_verifier which runs out of user space,
      so that verfier can rewrite LD_ABS/LD_IND internally in upcoming patches.
      It will have the same effect since runtime tests are also performed from
      there. This also allows to finally unexport bpf_skb_vlan_{push,pop}_proto
      and keep it internal to core kernel.
      
      Additionally, also add further cBPF LD_ABS/LD_IND test coverage into
      test_bpf.ko suite.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      93731ef0
    • B
      bpf: introduce new bpf AF_XDP map type BPF_MAP_TYPE_XSKMAP · fbfc504a
      Björn Töpel 提交于
      The xskmap is yet another BPF map, very much inspired by
      dev/cpu/sockmap, and is a holder of AF_XDP sockets. A user application
      adds AF_XDP sockets into the map, and by using the bpf_redirect_map
      helper, an XDP program can redirect XDP frames to an AF_XDP socket.
      
      Note that a socket that is bound to certain ifindex/queue index will
      *only* accept XDP frames from that netdev/queue index. If an XDP
      program tries to redirect from a netdev/queue index other than what
      the socket is bound to, the frame will not be received on the socket.
      
      A socket can reside in multiple maps.
      
      v3: Fixed race and simplified code.
      v2: Removed one indirection in map lookup.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      fbfc504a
  21. 29 4月, 2018 1 次提交
    • Y
      bpf: add bpf_get_stack helper · c195651e
      Yonghong Song 提交于
      Currently, stackmap and bpf_get_stackid helper are provided
      for bpf program to get the stack trace. This approach has
      a limitation though. If two stack traces have the same hash,
      only one will get stored in the stackmap table,
      so some stack traces are missing from user perspective.
      
      This patch implements a new helper, bpf_get_stack, will
      send stack traces directly to bpf program. The bpf program
      is able to see all stack traces, and then can do in-kernel
      processing or send stack traces to user space through
      shared map or bpf_perf_event_output.
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      c195651e
  22. 24 4月, 2018 2 次提交
    • J
      bpf: sockmap, map_release does not hold refcnt for pinned maps · ba6b8de4
      John Fastabend 提交于
      Relying on map_release hook to decrement the reference counts when a
      map is removed only works if the map is not being pinned. In the
      pinned case the ref is decremented immediately and the BPF programs
      released. After this BPF programs may not be in-use which is not
      what the user would expect.
      
      This patch moves the release logic into bpf_map_put_uref() and brings
      sockmap in-line with how a similar case is handled in prog array maps.
      
      Fixes: 3d9e9526 ("bpf: sockmap, fix leaking maps with attached but not detached progs")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      ba6b8de4
    • R
      bpf: disable and restore preemption in __BPF_PROG_RUN_ARRAY · 6899b32b
      Roman Gushchin 提交于
      Running bpf programs requires disabled preemption,
      however at least some* of the BPF_PROG_RUN_ARRAY users
      do not follow this rule.
      
      To fix this bug, and also to make it not happen in the future,
      let's add explicit preemption disabling/re-enabling
      to the __BPF_PROG_RUN_ARRAY code.
      
      * for example:
       [   17.624472] RIP: 0010:__cgroup_bpf_run_filter_sk+0x1c4/0x1d0
       ...
       [   17.640890]  inet6_create+0x3eb/0x520
       [   17.641405]  __sock_create+0x242/0x340
       [   17.641939]  __sys_socket+0x57/0xe0
       [   17.642370]  ? trace_hardirqs_off_thunk+0x1a/0x1c
       [   17.642944]  SyS_socket+0xa/0x10
       [   17.643357]  do_syscall_64+0x79/0x220
       [   17.643879]  entry_SYSCALL_64_after_hwframe+0x42/0xb7
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Cc: Alexei Starovoitov <ast@kernel.org>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      6899b32b
  23. 20 4月, 2018 1 次提交
    • M
      bpf: btf: Add pretty print support to the basic arraymap · a26ca7c9
      Martin KaFai Lau 提交于
      This patch adds pretty print support to the basic arraymap.
      Support for other bpf maps can be added later.
      
      This patch adds new attrs to the BPF_MAP_CREATE command to allow
      specifying the btf_fd, btf_key_id and btf_value_id.  The
      BPF_MAP_CREATE can then associate the btf to the map if
      the creating map supports BTF.
      
      A BTF supported map needs to implement two new map ops,
      map_seq_show_elem() and map_check_btf().  This patch has
      implemented these new map ops for the basic arraymap.
      
      It also adds file_operations, bpffs_map_fops, to the pinned
      map such that the pinned map can be opened and read.
      After that, the user has an intuitive way to do
      "cat bpffs/pathto/a-pinned-map" instead of getting
      an error.
      
      bpffs_map_fops should not be extended further to support
      other operations.  Other operations (e.g. write/key-lookup...)
      should be realized by the userspace tools (e.g. bpftool) through
      the BPF_OBJ_GET_INFO_BY_FD, map's lookup/update interface...etc.
      Follow up patches will allow the userspace to obtain
      the BTF from a map-fd.
      
      Here is a sample output when reading a pinned arraymap
      with the following map's value:
      
      struct map_value {
      	int count_a;
      	int count_b;
      };
      
      cat /sys/fs/bpf/pinned_array_map:
      
      0: {1,2}
      1: {3,4}
      2: {5,6}
      ...
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a26ca7c9
  24. 11 4月, 2018 1 次提交
    • Y
      bpf/tracing: fix a deadlock in perf_event_detach_bpf_prog · 3a38bb98
      Yonghong Song 提交于
      syzbot reported a possible deadlock in perf_event_detach_bpf_prog.
      The error details:
        ======================================================
        WARNING: possible circular locking dependency detected
        4.16.0-rc7+ #3 Not tainted
        ------------------------------------------------------
        syz-executor7/24531 is trying to acquire lock:
         (bpf_event_mutex){+.+.}, at: [<000000008a849b07>] perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
      
        but task is already holding lock:
         (&mm->mmap_sem){++++}, at: [<0000000038768f87>] vm_mmap_pgoff+0x198/0x280 mm/util.c:353
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #1 (&mm->mmap_sem){++++}:
             __might_fault+0x13a/0x1d0 mm/memory.c:4571
             _copy_to_user+0x2c/0xc0 lib/usercopy.c:25
             copy_to_user include/linux/uaccess.h:155 [inline]
             bpf_prog_array_copy_info+0xf2/0x1c0 kernel/bpf/core.c:1694
             perf_event_query_prog_array+0x1c7/0x2c0 kernel/trace/bpf_trace.c:891
             _perf_ioctl kernel/events/core.c:4750 [inline]
             perf_ioctl+0x3e1/0x1480 kernel/events/core.c:4770
             vfs_ioctl fs/ioctl.c:46 [inline]
             do_vfs_ioctl+0x1b1/0x1520 fs/ioctl.c:686
             SYSC_ioctl fs/ioctl.c:701 [inline]
             SyS_ioctl+0x8f/0xc0 fs/ioctl.c:692
             do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
             entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
        -> #0 (bpf_event_mutex){+.+.}:
             lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:3920
             __mutex_lock_common kernel/locking/mutex.c:756 [inline]
             __mutex_lock+0x16f/0x1a80 kernel/locking/mutex.c:893
             mutex_lock_nested+0x16/0x20 kernel/locking/mutex.c:908
             perf_event_detach_bpf_prog+0x92/0x3d0 kernel/trace/bpf_trace.c:854
             perf_event_free_bpf_prog kernel/events/core.c:8147 [inline]
             _free_event+0xbdb/0x10f0 kernel/events/core.c:4116
             put_event+0x24/0x30 kernel/events/core.c:4204
             perf_mmap_close+0x60d/0x1010 kernel/events/core.c:5172
             remove_vma+0xb4/0x1b0 mm/mmap.c:172
             remove_vma_list mm/mmap.c:2490 [inline]
             do_munmap+0x82a/0xdf0 mm/mmap.c:2731
             mmap_region+0x59e/0x15a0 mm/mmap.c:1646
             do_mmap+0x6c0/0xe00 mm/mmap.c:1483
             do_mmap_pgoff include/linux/mm.h:2223 [inline]
             vm_mmap_pgoff+0x1de/0x280 mm/util.c:355
             SYSC_mmap_pgoff mm/mmap.c:1533 [inline]
             SyS_mmap_pgoff+0x462/0x5f0 mm/mmap.c:1491
             SYSC_mmap arch/x86/kernel/sys_x86_64.c:100 [inline]
             SyS_mmap+0x16/0x20 arch/x86/kernel/sys_x86_64.c:91
             do_syscall_64+0x281/0x940 arch/x86/entry/common.c:287
             entry_SYSCALL_64_after_hwframe+0x42/0xb7
      
        other info that might help us debug this:
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(&mm->mmap_sem);
                                       lock(bpf_event_mutex);
                                       lock(&mm->mmap_sem);
          lock(bpf_event_mutex);
      
         *** DEADLOCK ***
        ======================================================
      
      The bug is introduced by Commit f371b304 ("bpf/tracing: allow
      user space to query prog array on the same tp") where copy_to_user,
      which requires mm->mmap_sem, is called inside bpf_event_mutex lock.
      At the same time, during perf_event file descriptor close,
      mm->mmap_sem is held first and then subsequent
      perf_event_detach_bpf_prog needs bpf_event_mutex lock.
      Such a senario caused a deadlock.
      
      As suggested by Daniel, moving copy_to_user out of the
      bpf_event_mutex lock should fix the problem.
      
      Fixes: f371b304 ("bpf/tracing: allow user space to query prog array on the same tp")
      Reported-by: syzbot+dc5ca0e4c9bfafaf2bae@syzkaller.appspotmail.com
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      3a38bb98
  25. 31 3月, 2018 1 次提交
    • A
      bpf: Check attach type at prog load time · 5e43f899
      Andrey Ignatov 提交于
      == The problem ==
      
      There are use-cases when a program of some type can be attached to
      multiple attach points and those attach points must have different
      permissions to access context or to call helpers.
      
      E.g. context structure may have fields for both IPv4 and IPv6 but it
      doesn't make sense to read from / write to IPv6 field when attach point
      is somewhere in IPv4 stack.
      
      Same applies to BPF-helpers: it may make sense to call some helper from
      some attach point, but not from other for same prog type.
      
      == The solution ==
      
      Introduce `expected_attach_type` field in in `struct bpf_attr` for
      `BPF_PROG_LOAD` command. If scenario described in "The problem" section
      is the case for some prog type, the field will be checked twice:
      
      1) At load time prog type is checked to see if attach type for it must
         be known to validate program permissions correctly. Prog will be
         rejected with EINVAL if it's the case and `expected_attach_type` is
         not specified or has invalid value.
      
      2) At attach time `attach_type` is compared with `expected_attach_type`,
         if prog type requires to have one, and, if they differ, attach will
         be rejected with EINVAL.
      
      The `expected_attach_type` is now available as part of `struct bpf_prog`
      in both `bpf_verifier_ops->is_valid_access()` and
      `bpf_verifier_ops->get_func_proto()` () and can be used to check context
      accesses and calls to helpers correspondingly.
      
      Initially the idea was discussed by Alexei Starovoitov <ast@fb.com> and
      Daniel Borkmann <daniel@iogearbox.net> here:
      https://marc.info/?l=linux-netdev&m=152107378717201&w=2Signed-off-by: NAndrey Ignatov <rdna@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5e43f899