1. 09 9月, 2017 1 次提交
  2. 24 8月, 2017 1 次提交
  3. 23 8月, 2017 1 次提交
  4. 21 8月, 2017 1 次提交
    • D
      bpf: fix double free from dev_map_notification() · 274043c6
      Daniel Borkmann 提交于
      In the current code, dev_map_free() can still race with dev_map_notification().
      In dev_map_free(), we remove dtab from the list of dtabs after we purged
      all entries from it. However, we don't do xchg() with NULL or the like,
      so the entry at that point is still pointing to the device. If a unregister
      notification comes in at the same time, we therefore risk a double-free,
      since the pointer is still present in the map, and then pushed again to
      __dev_map_entry_free().
      
      All this is completely unnecessary. Just remove the dtab from the list
      right before the synchronize_rcu(), so all outstanding readers from the
      notifier list have finished by then, thus we don't need to deal with this
      corner case anymore and also wouldn't need to nullify dev entires. This is
      fine because we iterate over the map releasing all entries and therefore
      dev references anyway.
      
      Fixes: 4cc7b954 ("bpf: devmap fix mutex in rcu critical section")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      274043c6
  5. 20 8月, 2017 1 次提交
    • M
      bpf: Allow selecting numa node during map creation · 96eabe7a
      Martin KaFai Lau 提交于
      The current map creation API does not allow to provide the numa-node
      preference.  The memory usually comes from where the map-creation-process
      is running.  The performance is not ideal if the bpf_prog is known to
      always run in a numa node different from the map-creation-process.
      
      One of the use case is sharding on CPU to different LRU maps (i.e.
      an array of LRU maps).  Here is the test result of map_perf_test on
      the INNER_LRU_HASH_PREALLOC test if we force the lru map used by
      CPU0 to be allocated from a remote numa node:
      
      [ The machine has 20 cores. CPU0-9 at node 0. CPU10-19 at node 1 ]
      
      ># taskset -c 10 ./map_perf_test 512 8 1260000 8000000
      5:inner_lru_hash_map_perf pre-alloc 1628380 events per sec
      4:inner_lru_hash_map_perf pre-alloc 1626396 events per sec
      3:inner_lru_hash_map_perf pre-alloc 1626144 events per sec
      6:inner_lru_hash_map_perf pre-alloc 1621657 events per sec
      2:inner_lru_hash_map_perf pre-alloc 1621534 events per sec
      1:inner_lru_hash_map_perf pre-alloc 1620292 events per sec
      7:inner_lru_hash_map_perf pre-alloc 1613305 events per sec
      0:inner_lru_hash_map_perf pre-alloc 1239150 events per sec  #<<<
      
      After specifying numa node:
      ># taskset -c 10 ./map_perf_test 512 8 1260000 8000000
      5:inner_lru_hash_map_perf pre-alloc 1629627 events per sec
      3:inner_lru_hash_map_perf pre-alloc 1628057 events per sec
      1:inner_lru_hash_map_perf pre-alloc 1623054 events per sec
      6:inner_lru_hash_map_perf pre-alloc 1616033 events per sec
      2:inner_lru_hash_map_perf pre-alloc 1614630 events per sec
      4:inner_lru_hash_map_perf pre-alloc 1612651 events per sec
      7:inner_lru_hash_map_perf pre-alloc 1609337 events per sec
      0:inner_lru_hash_map_perf pre-alloc 1619340 events per sec #<<<
      
      This patch adds one field, numa_node, to the bpf_attr.  Since numa node 0
      is a valid node, a new flag BPF_F_NUMA_NODE is also added.  The numa_node
      field is honored if and only if the BPF_F_NUMA_NODE flag is set.
      
      Numa node selection is not supported for percpu map.
      
      This patch does not change all the kmalloc.  F.e.
      'htab = kzalloc()' is not changed since the object
      is small enough to stay in the cache.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96eabe7a
  6. 17 8月, 2017 1 次提交
  7. 08 8月, 2017 1 次提交
    • J
      bpf: devmap fix mutex in rcu critical section · 4cc7b954
      John Fastabend 提交于
      Originally we used a mutex to protect concurrent devmap update
      and delete operations from racing with netdev unregister notifier
      callbacks.
      
      The notifier hook is needed because we increment the netdev ref
      count when a dev is added to the devmap. This ensures the netdev
      reference is valid in the datapath. However, we don't want to block
      unregister events, hence the initial mutex and notifier handler.
      
      The concern was in the notifier hook we search the map for dev
      entries that hold a refcnt on the net device being torn down. But,
      in order to do this we require two steps,
      
        (i) dereference the netdev:  dev = rcu_dereference(map[i])
       (ii) test ifindex:   dev->ifindex == removing_ifindex
      
      and then finally we can swap in the NULL dev in the map via an
      xchg operation,
      
        xchg(map[i], NULL)
      
      The danger here is a concurrent update could run a different
      xchg op concurrently leading us to replace the new dev with a
      NULL dev incorrectly.
      
            CPU 1                        CPU 2
      
         notifier hook                   bpf devmap update
      
         dev = rcu_dereference(map[i])
                                         dev = rcu_dereference(map[i])
                                         xchg(map[i]), new_dev);
                                         rcu_call(dev,...)
         xchg(map[i], NULL)
      
      The above flow would create the incorrect state with the dev
      reference in the update path being lost. To resolve this the
      original code used a mutex around the above block. However,
      updates, deletes, and lookups occur inside rcu critical sections
      so we can't use a mutex in this context safely.
      
      Fortunately, by writing slightly better code we can avoid the
      mutex altogether. If CPU 1 in the above example uses a cmpxchg
      and _only_ replaces the dev reference in the map when it is in
      fact the expected dev the race is removed completely. The two
      cases being illustrated here, first the race condition,
      
            CPU 1                          CPU 2
      
         notifier hook                     bpf devmap update
      
         dev = rcu_dereference(map[i])
                                           dev = rcu_dereference(map[i])
                                           xchg(map[i]), new_dev);
                                           rcu_call(dev,...)
         odev = cmpxchg(map[i], dev, NULL)
      
      Now we can test the cmpxchg return value, detect odev != dev and
      abort. Or in the good case,
      
            CPU 1                          CPU 2
      
         notifier hook                     bpf devmap update
         dev = rcu_dereference(map[i])
         odev = cmpxchg(map[i], dev, NULL)
                                           [...]
      
      Now 'odev == dev' and we can do proper cleanup.
      
      And viola the original race we tried to solve with a mutex is
      corrected and the trace noted by Sasha below is resolved due
      to removal of the mutex.
      
      Note: When walking the devmap and removing dev references as needed
      we depend on the core to fail any calls to dev_get_by_index() using
      the ifindex of the device being removed. This way we do not race with
      the user while searching the devmap.
      
      Additionally, the mutex was also protecting list add/del/read on
      the list of maps in-use. This patch converts this to an RCU list
      and spinlock implementation. This protects the list from concurrent
      alloc/free operations. The notifier hook walks this list so it uses
      RCU read semantics.
      
      BUG: sleeping function called from invalid context at kernel/locking/mutex.c:747
      in_atomic(): 1, irqs_disabled(): 0, pid: 16315, name: syz-executor1
      1 lock held by syz-executor1/16315:
       #0:  (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] map_delete_elem kernel/bpf/syscall.c:577 [inline]
       #0:  (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] SYSC_bpf kernel/bpf/syscall.c:1427 [inline]
       #0:  (rcu_read_lock){......}, at: [<ffffffff8c363bc2>] SyS_bpf+0x1d32/0x4ba0 kernel/bpf/syscall.c:1388
      
      Fixes: 2ddf71e2 ("net: add notifier hooks for devmap bpf map")
      Reported-by: NSasha Levin <alexander.levin@verizon.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4cc7b954
  8. 25 7月, 2017 1 次提交
  9. 18 7月, 2017 4 次提交