1. 30 8月, 2018 1 次提交
    • Y
      bpf: add bpffs pretty print for percpu arraymap/hash/lru_hash · c7b27c37
      Yonghong Song 提交于
      Added bpffs pretty print for percpu arraymap, percpu hashmap
      and percpu lru hashmap.
      
      For each map <key, value> pair, the format is:
         <key_value>: {
      	cpu0: <value_on_cpu0>
      	cpu1: <value_on_cpu1>
      	...
      	cpun: <value_on_cpun>
         }
      
      For example, on my VM, there are 4 cpus, and
      for test_btf test in the next patch:
         cat /sys/fs/bpf/pprint_test_percpu_hash
      
      You may get:
         ...
         43602: {
      	cpu0: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
      	cpu1: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
      	cpu2: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
      	cpu3: {43602,0,-43602,0x3,0xaa52,0x3,{43602|[82,170,0,0,0,0,0,0]},ENUM_TWO}
         }
         72847: {
      	cpu0: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
      	cpu1: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
      	cpu2: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
      	cpu3: {72847,0,-72847,0x3,0x11c8f,0x3,{72847|[143,28,1,0,0,0,0,0]},ENUM_THREE}
         }
         ...
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      c7b27c37
  2. 13 8月, 2018 1 次提交
    • D
      bpf: decouple btf from seq bpf fs dump and enable more maps · e8d2bec0
      Daniel Borkmann 提交于
      Commit a26ca7c9 ("bpf: btf: Add pretty print support to
      the basic arraymap") and 699c86d6 ("bpf: btf: add pretty
      print for hash/lru_hash maps") enabled support for BTF and
      dumping via BPF fs for array and hash/lru map. However, both
      can be decoupled from each other such that regular BPF maps
      can be supported for attaching BTF key/value information,
      while not all maps necessarily need to dump via map_seq_show_elem()
      callback.
      
      The basic sanity check which is a prerequisite for all maps
      is that key/value size has to match in any case, and some maps
      can have extra checks via map_check_btf() callback, e.g.
      probing certain types or indicating no support in general. With
      that we can also enable retrieving BTF info for per-cpu map
      types and lpm.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NYonghong Song <yhs@fb.com>
      e8d2bec0
  3. 11 8月, 2018 1 次提交
    • M
      bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY · 5dc4c4b7
      Martin KaFai Lau 提交于
      This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY.
      
      To unleash the full potential of a bpf prog, it is essential for the
      userspace to be capable of directly setting up a bpf map which can then
      be consumed by the bpf prog to make decision.  In this case, decide which
      SO_REUSEPORT sk to serve the incoming request.
      
      By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control
      and visibility on where a SO_REUSEPORT sk should be located in a bpf map.
      The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that
      the bpf prog can directly select a sk from the bpf map.  That will
      raise the programmability of the bpf prog attached to a reuseport
      group (a group of sk serving the same IP:PORT).
      
      For example, in UDP, the bpf prog can peek into the payload (e.g.
      through the "data" pointer introduced in the later patch) to learn
      the application level's connection information and then decide which sk
      to pick from a bpf map.  The userspace can tightly couple the sk's location
      in a bpf map with the application logic in generating the UDP payload's
      connection information.  This connection info contact/API stays within the
      userspace.
      
      Also, when used with map-in-map, the userspace can switch the
      old-server-process's inner map to a new-server-process's inner map
      in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)".
      The bpf prog will then direct incoming requests to the new process instead
      of the old process.  The old process can finish draining the pending
      requests (e.g. by "accept()") before closing the old-fds.  [Note that
      deleting a fd from a bpf map does not necessary mean the fd is closed]
      
      During map_update_elem(),
      Only SO_REUSEPORT sk (i.e. which has already been added
      to a reuse->socks[]) can be used.  That means a SO_REUSEPORT sk that is
      "bind()" for UDP or "bind()+listen()" for TCP.  These conditions are
      ensured in "reuseport_array_update_check()".
      
      A SO_REUSEPORT sk can only be added once to a map (i.e. the
      same sk cannot be added twice even to the same map).  SO_REUSEPORT
      already allows another sk to be created for the same IP:PORT.
      There is no need to re-create a similar usage in the BPF side.
      
      When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"),
      it will notify the bpf map to remove it from the map also.  It is
      done through "bpf_sk_reuseport_detach()" and it will only be called
      if >=1 of the "reuse->sock[]" has ever been added to a bpf map.
      
      The map_update()/map_delete() has to be in-sync with the
      "reuse->socks[]".  Hence, the same "reuseport_lock" used
      by "reuse->socks[]" has to be used here also. Care has
      been taken to ensure the lock is only acquired when the
      adding sk passes some strict tests. and
      freeing the map does not require the reuseport_lock.
      
      The reuseport_array will also support lookup from the syscall
      side.  It will return a sock_gen_cookie().  The sock_gen_cookie()
      is on-demand (i.e. a sk's cookie is not generated until the very
      first map_lookup_elem()).
      
      The lookup cookie is 64bits but it goes against the logical userspace
      expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also).
      It may catch user in surprise if we enforce value_size=8 while
      userspace still pass a 32bits fd during update.  Supporting different
      value_size between lookup and update seems unintuitive also.
      
      We also need to consider what if other existing fd based maps want
      to return 64bits value from syscall's lookup in the future.
      Hence, reuseport_array supports both value_size 4 and 8, and
      assuming user will usually use value_size=4.  The syscall's lookup
      will return ENOSPC on value_size=4.  It will will only
      return 64bits value from sock_gen_cookie() when user consciously
      choose value_size=8 (as a signal that lookup is desired) which then
      requires a 64bits value in both lookup and update.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5dc4c4b7
  4. 27 7月, 2018 1 次提交
    • M
      bpf: btf: Use exact btf value_size match in map_check_btf() · 5f300e80
      Martin KaFai Lau 提交于
      The current map_check_btf() in BPF_MAP_TYPE_ARRAY rejects
      '> map->value_size' to ensure map_seq_show_elem() will not
      access things beyond an array element.
      
      Yonghong suggested that using '!=' is a more correct
      check.  The 8 bytes round_up on value_size is stored
      in array->elem_size.  Hence, using '!=' on map->value_size
      is a proper check.
      
      This patch also adds new tests to check the btf array
      key type and value type.  Two of these new tests verify
      the btf's value_size (the change in this patch).
      
      It also fixes two existing tests that wrongly encoded
      a btf's type size (pprint_test) and the value_type_id (in one
      of the raw_tests[]).  However, that do not affect these two
      BTF verification tests before or after this test changes.
      These two tests mainly failed at array creation time after
      this patch.
      
      Fixes: a26ca7c9 ("bpf: btf: Add pretty print support to the basic arraymap")
      Suggested-by: NYonghong Song <yhs@fb.com>
      Acked-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      5f300e80
  5. 23 5月, 2018 1 次提交
  6. 24 4月, 2018 1 次提交
  7. 20 4月, 2018 1 次提交
    • M
      bpf: btf: Add pretty print support to the basic arraymap · a26ca7c9
      Martin KaFai Lau 提交于
      This patch adds pretty print support to the basic arraymap.
      Support for other bpf maps can be added later.
      
      This patch adds new attrs to the BPF_MAP_CREATE command to allow
      specifying the btf_fd, btf_key_id and btf_value_id.  The
      BPF_MAP_CREATE can then associate the btf to the map if
      the creating map supports BTF.
      
      A BTF supported map needs to implement two new map ops,
      map_seq_show_elem() and map_check_btf().  This patch has
      implemented these new map ops for the basic arraymap.
      
      It also adds file_operations, bpffs_map_fops, to the pinned
      map such that the pinned map can be opened and read.
      After that, the user has an intuitive way to do
      "cat bpffs/pathto/a-pinned-map" instead of getting
      an error.
      
      bpffs_map_fops should not be extended further to support
      other operations.  Other operations (e.g. write/key-lookup...)
      should be realized by the userspace tools (e.g. bpftool) through
      the BPF_OBJ_GET_INFO_BY_FD, map's lookup/update interface...etc.
      Follow up patches will allow the userspace to obtain
      the BTF from a map-fd.
      
      Here is a sample output when reading a pinned arraymap
      with the following map's value:
      
      struct map_value {
      	int count_a;
      	int count_b;
      };
      
      cat /sys/fs/bpf/pinned_array_map:
      
      0: {1,2}
      1: {3,4}
      2: {5,6}
      ...
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      a26ca7c9
  8. 23 2月, 2018 1 次提交
  9. 16 2月, 2018 1 次提交
    • D
      bpf: fix mlock precharge on arraymaps · 9c2d63b8
      Daniel Borkmann 提交于
      syzkaller recently triggered OOM during percpu map allocation;
      while there is work in progress by Dennis Zhou to add __GFP_NORETRY
      semantics for percpu allocator under pressure, there seems also a
      missing bpf_map_precharge_memlock() check in array map allocation.
      
      Given today the actual bpf_map_charge_memlock() happens after the
      find_and_alloc_map() in syscall path, the bpf_map_precharge_memlock()
      is there to bail out early before we go and do the map setup work
      when we find that we hit the limits anyway. Therefore add this for
      array map as well.
      
      Fixes: 6c905981 ("bpf: pre-allocate hash map elements")
      Fixes: a10423b8 ("bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map")
      Reported-by: syzbot+adb03f3f0bb57ce3acda@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Cc: Dennis Zhou <dennisszhou@gmail.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      9c2d63b8
  10. 19 1月, 2018 2 次提交
  11. 11 1月, 2018 1 次提交
    • D
      bpf, array: fix overflow in max_entries and undefined behavior in index_mask · bbeb6e43
      Daniel Borkmann 提交于
      syzkaller tried to alloc a map with 0xfffffffd entries out of a userns,
      and thus unprivileged. With the recently added logic in b2157399
      ("bpf: prevent out-of-bounds speculation") we round this up to the next
      power of two value for max_entries for unprivileged such that we can
      apply proper masking into potentially zeroed out map slots.
      
      However, this will generate an index_mask of 0xffffffff, and therefore
      a + 1 will let this overflow into new max_entries of 0. This will pass
      allocation, etc, and later on map access we still enforce on the original
      attr->max_entries value which was 0xfffffffd, therefore triggering GPF
      all over the place. Thus bail out on overflow in such case.
      
      Moreover, on 32 bit archs roundup_pow_of_two() can also not be used,
      since fls_long(max_entries - 1) can result in 32 and 1UL << 32 in 32 bit
      space is undefined. Therefore, do this by hand in a 64 bit variable.
      
      This fixes all the issues triggered by syzkaller's reproducers.
      
      Fixes: b2157399 ("bpf: prevent out-of-bounds speculation")
      Reported-by: syzbot+b0efb8e572d01bce1ae0@syzkaller.appspotmail.com
      Reported-by: syzbot+6c15e9744f75f2364773@syzkaller.appspotmail.com
      Reported-by: syzbot+d2f5524fb46fd3b312ee@syzkaller.appspotmail.com
      Reported-by: syzbot+61d23c95395cc90dbc2b@syzkaller.appspotmail.com
      Reported-by: syzbot+0d363c942452cca68c01@syzkaller.appspotmail.com
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      bbeb6e43
  12. 09 1月, 2018 1 次提交
    • A
      bpf: prevent out-of-bounds speculation · b2157399
      Alexei Starovoitov 提交于
      Under speculation, CPUs may mis-predict branches in bounds checks. Thus,
      memory accesses under a bounds check may be speculated even if the
      bounds check fails, providing a primitive for building a side channel.
      
      To avoid leaking kernel data round up array-based maps and mask the index
      after bounds check, so speculated load with out of bounds index will load
      either valid value from the array or zero from the padded area.
      
      Unconditionally mask index for all array types even when max_entries
      are not rounded to power of 2 for root user.
      When map is created by unpriv user generate a sequence of bpf insns
      that includes AND operation to make sure that JITed code includes
      the same 'index & index_mask' operation.
      
      If prog_array map is created by unpriv user replace
        bpf_tail_call(ctx, map, index);
      with
        if (index >= max_entries) {
          index &= map->index_mask;
          bpf_tail_call(ctx, map, index);
        }
      (along with roundup to power 2) to prevent out-of-bounds speculation.
      There is secondary redundant 'if (index >= max_entries)' in the interpreter
      and in all JITs, but they can be optimized later if necessary.
      
      Other array-like maps (cpumap, devmap, sockmap, perf_event_array, cgroup_array)
      cannot be used by unpriv, so no changes there.
      
      That fixes bpf side of "Variant 1: bounds check bypass (CVE-2017-5753)" on
      all architectures with and without JIT.
      
      v2->v3:
      Daniel noticed that attack potentially can be crafted via syscall commands
      without loading the program, so add masking to those paths as well.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b2157399
  13. 27 10月, 2017 1 次提交
    • Y
      perf/bpf: Extend the perf_event_read_local() interface, a.k.a. "bpf: perf... · 7d9285e8
      Yonghong Song 提交于
      perf/bpf: Extend the perf_event_read_local() interface, a.k.a. "bpf: perf event change needed for subsequent bpf helpers"
      
      eBPF programs would like access to the (perf) event enabled and
      running times along with the event value, such that they can deal with
      event multiplexing (among other things).
      
      This patch extends the interface; a future eBPF patch will utilize
      the new functionality.
      
      [ Note, there's a same-content commit with a poor changelog and a meaningless
        title in the networking tree as well - but we need this change for subsequent
        perf work, so apply it here as well, with a proper changelog. Hopefully Git
        will be able to sort out this somewhat messy workflow, if there are no other,
        conflicting changes to these files. ]
      Signed-off-by: NYonghong Song <yhs@fb.com>
      [ Rewrote the changelog. ]
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: <ast@fb.com>
      Cc: <daniel@iogearbox.net>
      Cc: <rostedt@goodmis.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: David S. Miller <davem@davemloft.net>
      Link: http://lkml.kernel.org/r/20171005161923.332790-2-yhs@fb.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      7d9285e8
  14. 20 10月, 2017 1 次提交
  15. 19 10月, 2017 1 次提交
  16. 08 10月, 2017 1 次提交
  17. 20 8月, 2017 2 次提交
    • D
      bpf: inline map in map lookup functions for array and htab · 7b0c2a05
      Daniel Borkmann 提交于
      Avoid two successive functions calls for the map in map lookup, first
      is the bpf_map_lookup_elem() helper call, and second the callback via
      map->ops->map_lookup_elem() to get to the map in map implementation.
      Implementation inlines array and htab flavor for map in map lookups.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7b0c2a05
    • M
      bpf: Allow selecting numa node during map creation · 96eabe7a
      Martin KaFai Lau 提交于
      The current map creation API does not allow to provide the numa-node
      preference.  The memory usually comes from where the map-creation-process
      is running.  The performance is not ideal if the bpf_prog is known to
      always run in a numa node different from the map-creation-process.
      
      One of the use case is sharding on CPU to different LRU maps (i.e.
      an array of LRU maps).  Here is the test result of map_perf_test on
      the INNER_LRU_HASH_PREALLOC test if we force the lru map used by
      CPU0 to be allocated from a remote numa node:
      
      [ The machine has 20 cores. CPU0-9 at node 0. CPU10-19 at node 1 ]
      
      ># taskset -c 10 ./map_perf_test 512 8 1260000 8000000
      5:inner_lru_hash_map_perf pre-alloc 1628380 events per sec
      4:inner_lru_hash_map_perf pre-alloc 1626396 events per sec
      3:inner_lru_hash_map_perf pre-alloc 1626144 events per sec
      6:inner_lru_hash_map_perf pre-alloc 1621657 events per sec
      2:inner_lru_hash_map_perf pre-alloc 1621534 events per sec
      1:inner_lru_hash_map_perf pre-alloc 1620292 events per sec
      7:inner_lru_hash_map_perf pre-alloc 1613305 events per sec
      0:inner_lru_hash_map_perf pre-alloc 1239150 events per sec  #<<<
      
      After specifying numa node:
      ># taskset -c 10 ./map_perf_test 512 8 1260000 8000000
      5:inner_lru_hash_map_perf pre-alloc 1629627 events per sec
      3:inner_lru_hash_map_perf pre-alloc 1628057 events per sec
      1:inner_lru_hash_map_perf pre-alloc 1623054 events per sec
      6:inner_lru_hash_map_perf pre-alloc 1616033 events per sec
      2:inner_lru_hash_map_perf pre-alloc 1614630 events per sec
      4:inner_lru_hash_map_perf pre-alloc 1612651 events per sec
      7:inner_lru_hash_map_perf pre-alloc 1609337 events per sec
      0:inner_lru_hash_map_perf pre-alloc 1619340 events per sec #<<<
      
      This patch adds one field, numa_node, to the bpf_attr.  Since numa node 0
      is a valid node, a new flag BPF_F_NUMA_NODE is also added.  The numa_node
      field is honored if and only if the BPF_F_NUMA_NODE flag is set.
      
      Numa node selection is not supported for percpu map.
      
      This patch does not change all the kmalloc.  F.e.
      'htab = kzalloc()' is not changed since the object
      is small enough to stay in the cache.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96eabe7a
  18. 30 6月, 2017 1 次提交
  19. 05 6月, 2017 1 次提交
  20. 26 5月, 2017 1 次提交
    • D
      bpf: fix wrong exposure of map_flags into fdinfo for lpm · a316338c
      Daniel Borkmann 提交于
      trie_alloc() always needs to have BPF_F_NO_PREALLOC passed in via
      attr->map_flags, since it does not support preallocation yet. We
      check the flag, but we never copy the flag into trie->map.map_flags,
      which is later on exposed into fdinfo and used by loaders such as
      iproute2. Latter uses this in bpf_map_selfcheck_pinned() to test
      whether a pinned map has the same spec as the one from the BPF obj
      file and if not, bails out, which is currently the case for lpm
      since it exposes always 0 as flags.
      
      Also copy over flags in array_map_alloc() and stack_map_alloc().
      They always have to be 0 right now, but we should make sure to not
      miss to copy them over at a later point in time when we add actual
      flags for them to use.
      
      Fixes: b95a5c4d ("bpf: add a longest prefix match trie map implementation")
      Reported-by: NJarno Rajahalme <jarno@covalent.io>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a316338c
  21. 25 4月, 2017 1 次提交
  22. 12 4月, 2017 1 次提交
  23. 23 3月, 2017 2 次提交
    • M
      bpf: Add array of maps support · 56f668df
      Martin KaFai Lau 提交于
      This patch adds a few helper funcs to enable map-in-map
      support (i.e. outer_map->inner_map).  The first outer_map type
      BPF_MAP_TYPE_ARRAY_OF_MAPS is also added in this patch.
      The next patch will introduce a hash of maps type.
      
      Any bpf map type can be acted as an inner_map.  The exception
      is BPF_MAP_TYPE_PROG_ARRAY because the extra level of
      indirection makes it harder to verify the owner_prog_type
      and owner_jited.
      
      Multi-level map-in-map is not supported (i.e. map->map is ok
      but not map->map->map).
      
      When adding an inner_map to an outer_map, it currently checks the
      map_type, key_size, value_size, map_flags, max_entries and ops.
      The verifier also uses those map's properties to do static analysis.
      map_flags is needed because we need to ensure BPF_PROG_TYPE_PERF_EVENT
      is using a preallocated hashtab for the inner_hash also.  ops and
      max_entries are needed to generate inlined map-lookup instructions.
      For simplicity reason, a simple '==' test is used for both map_flags
      and max_entries.  The equality of ops is implied by the equality of
      map_type.
      
      During outer_map creation time, an inner_map_fd is needed to create an
      outer_map.  However, the inner_map_fd's life time does not depend on the
      outer_map.  The inner_map_fd is merely used to initialize
      the inner_map_meta of the outer_map.
      
      Also, for the outer_map:
      
      * It allows element update and delete from syscall
      * It allows element lookup from bpf_prog
      
      The above is similar to the current fd_array pattern.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      56f668df
    • M
      bpf: Fix and simplifications on inline map lookup · fad73a1a
      Martin KaFai Lau 提交于
      Fix in verifier:
      For the same bpf_map_lookup_elem() instruction (i.e. "call 1"),
      a broken case is "a different type of map could be used for the
      same lookup instruction". For example, an array in one case and a
      hashmap in another.  We have to resort to the old dynamic call behavior
      in this case.  The fix is to check for collision on insn_aux->map_ptr.
      If there is collision, don't inline the map lookup.
      
      Please see the "do_reg_lookup()" in test_map_in_map_kern.c in the later
      patch for how-to trigger the above case.
      
      Simplifications on array_map_gen_lookup():
      1. Calculate elem_size from map->value_size.  It removes the
         need for 'struct bpf_array' which makes the later map-in-map
         implementation easier.
      2. Remove the 'elem_size == 1' test
      
      Fixes: 81ed18ab ("bpf: add helper inlining infra and optimize map_array lookup")
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fad73a1a
  24. 17 3月, 2017 1 次提交
  25. 18 2月, 2017 1 次提交
  26. 19 1月, 2017 1 次提交
    • D
      bpf: don't trigger OOM killer under pressure with map alloc · d407bd25
      Daniel Borkmann 提交于
      This patch adds two helpers, bpf_map_area_alloc() and bpf_map_area_free(),
      that are to be used for map allocations. Using kmalloc() for very large
      allocations can cause excessive work within the page allocator, so i) fall
      back earlier to vmalloc() when the attempt is considered costly anyway,
      and even more importantly ii) don't trigger OOM killer with any of the
      allocators.
      
      Since this is based on a user space request, for example, when creating
      maps with element pre-allocation, we really want such requests to fail
      instead of killing other user space processes.
      
      Also, don't spam the kernel log with warnings should any of the allocations
      fail under pressure. Given that, we can make backend selection in
      bpf_map_area_alloc() generic, and convert all maps over to use this API
      for spots with potentially large allocation requests.
      
      Note, replacing the one kmalloc_array() is fine as overflow checks happen
      earlier in htab_map_alloc(), since it must also protect the multiplication
      for vmalloc() should kmalloc_array() fail.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d407bd25
  27. 11 1月, 2017 1 次提交
  28. 13 8月, 2016 1 次提交
  29. 17 7月, 2016 1 次提交
  30. 02 7月, 2016 2 次提交
    • M
      cgroup: bpf: Add BPF_MAP_TYPE_CGROUP_ARRAY · 4ed8ec52
      Martin KaFai Lau 提交于
      Add a BPF_MAP_TYPE_CGROUP_ARRAY and its bpf_map_ops's implementations.
      To update an element, the caller is expected to obtain a cgroup2 backed
      fd by open(cgroup2_dir) and then update the array with that fd.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Cc: Tejun Heo <tj@kernel.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4ed8ec52
    • D
      bpf: generally move prog destruction to RCU deferral · 1aacde3d
      Daniel Borkmann 提交于
      Jann Horn reported following analysis that could potentially result
      in a very hard to trigger (if not impossible) UAF race, to quote his
      event timeline:
      
       - Set up a process with threads T1, T2 and T3
       - Let T1 set up a socket filter F1 that invokes another filter F2
         through a BPF map [tail call]
       - Let T1 trigger the socket filter via a unix domain socket write,
         don't wait for completion
       - Let T2 call PERF_EVENT_IOC_SET_BPF with F2, don't wait for completion
       - Now T2 should be behind bpf_prog_get(), but before bpf_prog_put()
       - Let T3 close the file descriptor for F2, dropping the reference
         count of F2 to 2
       - At this point, T1 should have looked up F2 from the map, but not
         finished executing it
       - Let T3 remove F2 from the BPF map, dropping the reference count of
         F2 to 1
       - Now T2 should call bpf_prog_put() (wrong BPF program type), dropping
         the reference count of F2 to 0 and scheduling bpf_prog_free_deferred()
         via schedule_work()
       - At this point, the BPF program could be freed
       - BPF execution is still running in a freed BPF program
      
      While at PERF_EVENT_IOC_SET_BPF time it's only guaranteed that the perf
      event fd we're doing the syscall on doesn't disappear from underneath us
      for whole syscall time, it may not be the case for the bpf fd used as
      an argument only after we did the put. It needs to be a valid fd pointing
      to a BPF program at the time of the call to make the bpf_prog_get() and
      while T2 gets preempted, F2 must have dropped reference to 1 on the other
      CPU. The fput() from the close() in T3 should also add additionally delay
      to the reference drop via exit_task_work() when bpf_prog_release() gets
      called as well as scheduling bpf_prog_free_deferred().
      
      That said, it makes nevertheless sense to move the BPF prog destruction
      generally after RCU grace period to guarantee that such scenario above,
      but also others as recently fixed in ceb56070 ("bpf, perf: delay release
      of BPF prog after grace period") with regards to tail calls won't happen.
      Integrating bpf_prog_free_deferred() directly into the RCU callback is
      not allowed since the invocation might happen from either softirq or
      process context, so we're not permitted to block. Reviewing all bpf_prog_put()
      invocations from eBPF side (note, cBPF -> eBPF progs don't use this for
      their destruction) with call_rcu() look good to me.
      
      Since we don't know whether at the time of attaching the program, we're
      already part of a tail call map, we need to use RCU variant. However, due
      to this, there won't be severely more stress on the RCU callback queue:
      situations with above bpf_prog_get() and bpf_prog_put() combo in practice
      normally won't lead to releases, but even if they would, enough effort/
      cycles have to be put into loading a BPF program into the kernel already.
      Reported-by: NJann Horn <jannh@google.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1aacde3d
  31. 16 6月, 2016 2 次提交
    • D
      bpf, maps: flush own entries on perf map release · 3b1efb19
      Daniel Borkmann 提交于
      The behavior of perf event arrays are quite different from all
      others as they are tightly coupled to perf event fds, f.e. shown
      recently by commit e03e7ee3 ("perf/bpf: Convert perf_event_array
      to use struct file") to make refcounting on perf event more robust.
      A remaining issue that the current code still has is that since
      additions to the perf event array take a reference on the struct
      file via perf_event_get() and are only released via fput() (that
      cleans up the perf event eventually via perf_event_release_kernel())
      when the element is either manually removed from the map from user
      space or automatically when the last reference on the perf event
      map is dropped. However, this leads us to dangling struct file's
      when the map gets pinned after the application owning the perf
      event descriptor exits, and since the struct file reference will
      in such case only be manually dropped or via pinned file removal,
      it leads to the perf event living longer than necessary, consuming
      needlessly resources for that time.
      
      Relations between perf event fds and bpf perf event map fds can be
      rather complex. F.e. maps can act as demuxers among different perf
      event fds that can possibly be owned by different threads and based
      on the index selection from the program, events get dispatched to
      one of the per-cpu fd endpoints. One perf event fd (or, rather a
      per-cpu set of them) can also live in multiple perf event maps at
      the same time, listening for events. Also, another requirement is
      that perf event fds can get closed from application side after they
      have been attached to the perf event map, so that on exit perf event
      map will take care of dropping their references eventually. Likewise,
      when such maps are pinned, the intended behavior is that a user
      application does bpf_obj_get(), puts its fds in there and on exit
      when fd is released, they are dropped from the map again, so the map
      acts rather as connector endpoint. This also makes perf event maps
      inherently different from program arrays as described in more detail
      in commit c9da161c ("bpf: fix clearing on persistent program
      array maps").
      
      To tackle this, map entries are marked by the map struct file that
      added the element to the map. And when the last reference to that map
      struct file is released from user space, then the tracked entries
      are purged from the map. This is okay, because new map struct files
      instances resp. frontends to the anon inode are provided via
      bpf_map_new_fd() that is called when we invoke bpf_obj_get_user()
      for retrieving a pinned map, but also when an initial instance is
      created via map_create(). The rest is resolved by the vfs layer
      automatically for us by keeping reference count on the map's struct
      file. Any concurrent updates on the map slot are fine as well, it
      just means that perf_event_fd_array_release() needs to delete less
      of its own entires.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3b1efb19
    • D
      bpf, maps: extend map_fd_get_ptr arguments · d056a788
      Daniel Borkmann 提交于
      This patch extends map_fd_get_ptr() callback that is used by fd array
      maps, so that struct file pointer from the related map can be passed
      in. It's safe to remove map_update_elem() callback for the two maps since
      this is only allowed from syscall side, but not from eBPF programs for these
      two map types. Like in per-cpu map case, bpf_fd_array_map_update_elem()
      needs to be called directly here due to the extra argument.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d056a788
  32. 09 3月, 2016 1 次提交
  33. 06 2月, 2016 2 次提交
    • A
      bpf: add lookup/update support for per-cpu hash and array maps · 15a07b33
      Alexei Starovoitov 提交于
      The functions bpf_map_lookup_elem(map, key, value) and
      bpf_map_update_elem(map, key, value, flags) need to get/set
      values from all-cpus for per-cpu hash and array maps,
      so that user space can aggregate/update them as necessary.
      
      Example of single counter aggregation in user space:
        unsigned int nr_cpus = sysconf(_SC_NPROCESSORS_CONF);
        long values[nr_cpus];
        long value = 0;
      
        bpf_lookup_elem(fd, key, values);
        for (i = 0; i < nr_cpus; i++)
          value += values[i];
      
      The user space must provide round_up(value_size, 8) * nr_cpus
      array to get/set values, since kernel will use 'long' copy
      of per-cpu values to try to copy good counters atomically.
      It's a best-effort, since bpf programs and user space are racing
      to access the same memory.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      15a07b33
    • A
      bpf: introduce BPF_MAP_TYPE_PERCPU_ARRAY map · a10423b8
      Alexei Starovoitov 提交于
      Primary use case is a histogram array of latency
      where bpf program computes the latency of block requests or other
      events and stores histogram of latency into array of 64 elements.
      All cpus are constantly running, so normal increment is not accurate,
      bpf_xadd causes cache ping-pong and this per-cpu approach allows
      fastest collision-free counters.
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a10423b8
  34. 29 1月, 2016 1 次提交