1. 14 2月, 2018 1 次提交
  2. 27 1月, 2018 1 次提交
    • Y
      bpf: fix kernel page fault in lpm map trie_get_next_key · 6dd1ec6c
      Yonghong Song 提交于
      Commit b471f2f1 ("bpf: implement MAP_GET_NEXT_KEY command
      for LPM_TRIE map") introduces a bug likes below:
      
          if (!rcu_dereference(trie->root))
              return -ENOENT;
          if (!key || key->prefixlen > trie->max_prefixlen) {
              root = &trie->root;
              goto find_leftmost;
          }
          ......
        find_leftmost:
          for (node = rcu_dereference(*root); node;) {
      
      In the code after label find_leftmost, it is assumed
      that *root should not be NULL, but it is not true as
      it is possbile trie->root is changed to NULL by an
      asynchronous delete operation.
      
      The issue is reported by syzbot and Eric Dumazet with the
      below error log:
        ......
        kasan: CONFIG_KASAN_INLINE enabled
        kasan: GPF could be caused by NULL-ptr deref or user memory access
        general protection fault: 0000 [#1] SMP KASAN
        Dumping ftrace buffer:
           (ftrace buffer empty)
        Modules linked in:
        CPU: 1 PID: 8033 Comm: syz-executor3 Not tainted 4.15.0-rc8+ #4
        Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
        RIP: 0010:trie_get_next_key+0x3c2/0xf10 kernel/bpf/lpm_trie.c:682
        ......
      
      This patch fixed the issue by use local rcu_dereferenced
      pointer instead of *(&trie->root) later on.
      
      Fixes: b471f2f1 ("bpf: implement MAP_GET_NEXT_KEY command or LPM_TRIE map")
      Reported-by: Nsyzbot <syzkaller@googlegroups.com>
      Reported-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      6dd1ec6c
  3. 24 1月, 2018 1 次提交
  4. 20 1月, 2018 1 次提交
    • Y
      bpf: implement MAP_GET_NEXT_KEY command for LPM_TRIE map · b471f2f1
      Yonghong Song 提交于
      Current LPM_TRIE map type does not implement MAP_GET_NEXT_KEY
      command. This command is handy when users want to enumerate
      keys. Otherwise, a different map which supports key
      enumeration may be required to store the keys. If the
      map data is sparse and all map data are to be deleted without
      closing file descriptor, using MAP_GET_NEXT_KEY to find
      all keys is much faster than enumerating all key space.
      
      This patch implements MAP_GET_NEXT_KEY command for LPM_TRIE map.
      If user provided key pointer is NULL or the key does not have
      an exact match in the trie, the first key will be returned.
      Otherwise, the next key will be returned.
      
      In this implemenation, key enumeration follows a postorder
      traversal of internal trie. More specific keys
      will be returned first than less specific ones, given
      a sequence of MAP_GET_NEXT_KEY syscalls.
      Signed-off-by: NYonghong Song <yhs@fb.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      b471f2f1
  5. 15 1月, 2018 1 次提交
  6. 20 10月, 2017 1 次提交
  7. 26 9月, 2017 1 次提交
  8. 20 9月, 2017 1 次提交
  9. 20 8月, 2017 1 次提交
    • M
      bpf: Allow selecting numa node during map creation · 96eabe7a
      Martin KaFai Lau 提交于
      The current map creation API does not allow to provide the numa-node
      preference.  The memory usually comes from where the map-creation-process
      is running.  The performance is not ideal if the bpf_prog is known to
      always run in a numa node different from the map-creation-process.
      
      One of the use case is sharding on CPU to different LRU maps (i.e.
      an array of LRU maps).  Here is the test result of map_perf_test on
      the INNER_LRU_HASH_PREALLOC test if we force the lru map used by
      CPU0 to be allocated from a remote numa node:
      
      [ The machine has 20 cores. CPU0-9 at node 0. CPU10-19 at node 1 ]
      
      ># taskset -c 10 ./map_perf_test 512 8 1260000 8000000
      5:inner_lru_hash_map_perf pre-alloc 1628380 events per sec
      4:inner_lru_hash_map_perf pre-alloc 1626396 events per sec
      3:inner_lru_hash_map_perf pre-alloc 1626144 events per sec
      6:inner_lru_hash_map_perf pre-alloc 1621657 events per sec
      2:inner_lru_hash_map_perf pre-alloc 1621534 events per sec
      1:inner_lru_hash_map_perf pre-alloc 1620292 events per sec
      7:inner_lru_hash_map_perf pre-alloc 1613305 events per sec
      0:inner_lru_hash_map_perf pre-alloc 1239150 events per sec  #<<<
      
      After specifying numa node:
      ># taskset -c 10 ./map_perf_test 512 8 1260000 8000000
      5:inner_lru_hash_map_perf pre-alloc 1629627 events per sec
      3:inner_lru_hash_map_perf pre-alloc 1628057 events per sec
      1:inner_lru_hash_map_perf pre-alloc 1623054 events per sec
      6:inner_lru_hash_map_perf pre-alloc 1616033 events per sec
      2:inner_lru_hash_map_perf pre-alloc 1614630 events per sec
      4:inner_lru_hash_map_perf pre-alloc 1612651 events per sec
      7:inner_lru_hash_map_perf pre-alloc 1609337 events per sec
      0:inner_lru_hash_map_perf pre-alloc 1619340 events per sec #<<<
      
      This patch adds one field, numa_node, to the bpf_attr.  Since numa node 0
      is a valid node, a new flag BPF_F_NUMA_NODE is also added.  The numa_node
      field is honored if and only if the BPF_F_NUMA_NODE flag is set.
      
      Numa node selection is not supported for percpu map.
      
      This patch does not change all the kmalloc.  F.e.
      'htab = kzalloc()' is not changed since the object
      is small enough to stay in the cache.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96eabe7a
  10. 26 5月, 2017 1 次提交
    • D
      bpf: fix wrong exposure of map_flags into fdinfo for lpm · a316338c
      Daniel Borkmann 提交于
      trie_alloc() always needs to have BPF_F_NO_PREALLOC passed in via
      attr->map_flags, since it does not support preallocation yet. We
      check the flag, but we never copy the flag into trie->map.map_flags,
      which is later on exposed into fdinfo and used by loaders such as
      iproute2. Latter uses this in bpf_map_selfcheck_pinned() to test
      whether a pinned map has the same spec as the one from the BPF obj
      file and if not, bails out, which is currently the case for lpm
      since it exposes always 0 as flags.
      
      Also copy over flags in array_map_alloc() and stack_map_alloc().
      They always have to be 0 right now, but we should make sure to not
      miss to copy them over at a later point in time when we add actual
      flags for them to use.
      
      Fixes: b95a5c4d ("bpf: add a longest prefix match trie map implementation")
      Reported-by: NJarno Rajahalme <jarno@covalent.io>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a316338c
  11. 12 4月, 2017 1 次提交
  12. 06 3月, 2017 1 次提交
  13. 18 2月, 2017 1 次提交
  14. 09 2月, 2017 1 次提交
    • D
      bpf, lpm: fix overflows in trie_alloc checks · c502faf9
      Daniel Borkmann 提交于
      Cap the maximum (total) value size and bail out if larger than KMALLOC_MAX_SIZE
      as otherwise it doesn't make any sense to proceed further, since we're
      guaranteed to fail to allocate elements anyway in lpm_trie_node_alloc();
      likleyhood of failure is still high for large values, though, similarly
      as with htab case in non-prealloc.
      
      Next, make sure that cost vars are really u64 instead of size_t, so that we
      don't overflow on 32 bit and charge only tiny map.pages against memlock while
      allowing huge max_entries; cap also the max cost like we do with other map
      types.
      
      Fixes: b95a5c4d ("bpf: add a longest prefix match trie map implementation")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c502faf9
  15. 24 1月, 2017 2 次提交