1. 01 10月, 2017 1 次提交
  2. 29 9月, 2017 4 次提交
  3. 27 9月, 2017 2 次提交
    • D
      bpf: add meta pointer for direct access · de8f3a83
      Daniel Borkmann 提交于
      This work enables generic transfer of metadata from XDP into skb. The
      basic idea is that we can make use of the fact that the resulting skb
      must be linear and already comes with a larger headroom for supporting
      bpf_xdp_adjust_head(), which mangles xdp->data. Here, we base our work
      on a similar principle and introduce a small helper bpf_xdp_adjust_meta()
      for adjusting a new pointer called xdp->data_meta. Thus, the packet has
      a flexible and programmable room for meta data, followed by the actual
      packet data. struct xdp_buff is therefore laid out that we first point
      to data_hard_start, then data_meta directly prepended to data followed
      by data_end marking the end of packet. bpf_xdp_adjust_head() takes into
      account whether we have meta data already prepended and if so, memmove()s
      this along with the given offset provided there's enough room.
      
      xdp->data_meta is optional and programs are not required to use it. The
      rationale is that when we process the packet in XDP (e.g. as DoS filter),
      we can push further meta data along with it for the XDP_PASS case, and
      give the guarantee that a clsact ingress BPF program on the same device
      can pick this up for further post-processing. Since we work with skb
      there, we can also set skb->mark, skb->priority or other skb meta data
      out of BPF, thus having this scratch space generic and programmable
      allows for more flexibility than defining a direct 1:1 transfer of
      potentially new XDP members into skb (it's also more efficient as we
      don't need to initialize/handle each of such new members). The facility
      also works together with GRO aggregation. The scratch space at the head
      of the packet can be multiple of 4 byte up to 32 byte large. Drivers not
      yet supporting xdp->data_meta can simply be set up with xdp->data_meta
      as xdp->data + 1 as bpf_xdp_adjust_meta() will detect this and bail out,
      such that the subsequent match against xdp->data for later access is
      guaranteed to fail.
      
      The verifier treats xdp->data_meta/xdp->data the same way as we treat
      xdp->data/xdp->data_end pointer comparisons. The requirement for doing
      the compare against xdp->data is that it hasn't been modified from it's
      original address we got from ctx access. It may have a range marking
      already from prior successful xdp->data/xdp->data_end pointer comparisons
      though.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de8f3a83
    • D
      bpf: rename bpf_compute_data_end into bpf_compute_data_pointers · 6aaae2b6
      Daniel Borkmann 提交于
      Just do the rename into bpf_compute_data_pointers() as we'll add
      one more pointer here to recompute.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6aaae2b6
  4. 26 9月, 2017 1 次提交
  5. 20 9月, 2017 3 次提交
    • D
      bpf: fix ri->map_owner pointer on bpf_prog_realloc · 7c300131
      Daniel Borkmann 提交于
      Commit 109980b8 ("bpf: don't select potentially stale
      ri->map from buggy xdp progs") passed the pointer to the prog
      itself to be loaded into r4 prior on bpf_redirect_map() helper
      call, so that we can store the owner into ri->map_owner out of
      the helper.
      
      Issue with that is that the actual address of the prog is still
      subject to change when subsequent rewrites occur that require
      slow path in bpf_prog_realloc() to alloc more memory, e.g. from
      patching inlining helper functions or constant blinding. Thus,
      we really need to take prog->aux as the address we're holding,
      which also works with prog clones as they share the same aux
      object.
      
      Instead of then fetching aux->prog during runtime, which could
      potentially incur cache misses due to false sharing, we are
      going to just use aux for comparison on the map owner. This
      will also keep the patchlet of the same size, and later check
      in xdp_map_invalid() only accesses read-only aux pointer from
      the prog, it's also in the same cacheline already from prior
      access when calling bpf_func.
      
      Fixes: 109980b8 ("bpf: don't select potentially stale ri->map from buggy xdp progs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7c300131
    • E
      bpf: do not disable/enable BH in bpf_map_free_id() · 930651a7
      Eric Dumazet 提交于
      syzkaller reported following splat [1]
      
      Since hard irq are disabled by the caller, bpf_map_free_id()
      should not try to enable/disable BH.
      
      Another solution would be to change htab_map_delete_elem() to
      defer the free_htab_elem() call after
      raw_spin_unlock_irqrestore(&b->lock, flags), but this might be not
      enough to cover other code paths.
      
      [1]
      WARNING: CPU: 1 PID: 8052 at kernel/softirq.c:161 __local_bh_enable_ip
      +0x1e/0x160 kernel/softirq.c:161
      Kernel panic - not syncing: panic_on_warn set ...
      
      CPU: 1 PID: 8052 Comm: syz-executor1 Not tainted 4.13.0-next-20170915+
      #23
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      Call Trace:
       __dump_stack lib/dump_stack.c:16 [inline]
       dump_stack+0x194/0x257 lib/dump_stack.c:52
       panic+0x1e4/0x417 kernel/panic.c:181
       __warn+0x1c4/0x1d9 kernel/panic.c:542
       report_bug+0x211/0x2d0 lib/bug.c:183
       fixup_bug+0x40/0x90 arch/x86/kernel/traps.c:178
       do_trap_no_signal arch/x86/kernel/traps.c:212 [inline]
       do_trap+0x260/0x390 arch/x86/kernel/traps.c:261
       do_error_trap+0x120/0x390 arch/x86/kernel/traps.c:298
       do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:311
       invalid_op+0x18/0x20 arch/x86/entry/entry_64.S:905
      RIP: 0010:__local_bh_enable_ip+0x1e/0x160 kernel/softirq.c:161
      RSP: 0018:ffff8801cdcd7748 EFLAGS: 00010046
      RAX: 0000000000000082 RBX: 0000000000000201 RCX: 0000000000000000
      RDX: 1ffffffff0b5933c RSI: 0000000000000201 RDI: ffffffff85ac99e0
      RBP: ffff8801cdcd7758 R08: ffffffff85b87158 R09: 1ffff10039b9aec6
      R10: ffff8801c99f24c0 R11: 0000000000000002 R12: ffffffff817b0b47
      R13: dffffc0000000000 R14: ffff8801cdcd77e8 R15: 0000000000000001
       __raw_spin_unlock_bh include/linux/spinlock_api_smp.h:176 [inline]
       _raw_spin_unlock_bh+0x30/0x40 kernel/locking/spinlock.c:207
       spin_unlock_bh include/linux/spinlock.h:361 [inline]
       bpf_map_free_id kernel/bpf/syscall.c:197 [inline]
       __bpf_map_put+0x267/0x320 kernel/bpf/syscall.c:227
       bpf_map_put+0x1a/0x20 kernel/bpf/syscall.c:235
       bpf_map_fd_put_ptr+0x15/0x20 kernel/bpf/map_in_map.c:96
       free_htab_elem+0xc3/0x1b0 kernel/bpf/hashtab.c:658
       htab_map_delete_elem+0x74d/0x970 kernel/bpf/hashtab.c:1063
       map_delete_elem kernel/bpf/syscall.c:633 [inline]
       SYSC_bpf kernel/bpf/syscall.c:1479 [inline]
       SyS_bpf+0x2188/0x46a0 kernel/bpf/syscall.c:1451
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      
      Fixes: f3f1c054 ("bpf: Introduce bpf_map ID")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      930651a7
    • C
      bpf: Implement map_delete_elem for BPF_MAP_TYPE_LPM_TRIE · e454cf59
      Craig Gallek 提交于
      This is a simple non-recursive delete operation.  It prunes paths
      of empty nodes in the tree, but it does not try to further compress
      the tree as nodes are removed.
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e454cf59
  6. 19 9月, 2017 1 次提交
  7. 16 9月, 2017 1 次提交
  8. 09 9月, 2017 3 次提交
    • J
      bpf: devmap, use cond_resched instead of cpu_relax · 374fb014
      John Fastabend 提交于
      Be a bit more friendly about waiting for flush bits to complete.
      Replace the cpu_relax() with a cond_resched().
      Suggested-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      374fb014
    • J
      bpf: add support for sockmap detach programs · 5a67da2a
      John Fastabend 提交于
      The bpf map sockmap supports adding programs via attach commands. This
      patch adds the detach command to keep the API symmetric and allow
      users to remove previously added programs. Otherwise the user would
      have to delete the map and re-add it to get in this state.
      
      This also adds a series of additional tests to capture detach operation
      and also attaching/detaching invalid prog types.
      
      API note: socks will run (or not run) programs depending on the state
      of the map at the time the sock is added. We do not for example walk
      the map and remove programs from previously attached socks.
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a67da2a
    • D
      bpf: don't select potentially stale ri->map from buggy xdp progs · 109980b8
      Daniel Borkmann 提交于
      We can potentially run into a couple of issues with the XDP
      bpf_redirect_map() helper. The ri->map in the per CPU storage
      can become stale in several ways, mostly due to misuse, where
      we can then trigger a use after free on the map:
      
      i) prog A is calling bpf_redirect_map(), returning XDP_REDIRECT
      and running on a driver not supporting XDP_REDIRECT yet. The
      ri->map on that CPU becomes stale when the XDP program is unloaded
      on the driver, and a prog B loaded on a different driver which
      supports XDP_REDIRECT return code. prog B would have to omit
      calling to bpf_redirect_map() and just return XDP_REDIRECT, which
      would then access the freed map in xdp_do_redirect() since not
      cleared for that CPU.
      
      ii) prog A is calling bpf_redirect_map(), returning a code other
      than XDP_REDIRECT. prog A is then detached, which triggers release
      of the map. prog B is attached which, similarly as in i), would
      just return XDP_REDIRECT without having called bpf_redirect_map()
      and thus be accessing the freed map in xdp_do_redirect() since
      not cleared for that CPU.
      
      iii) prog A is attached to generic XDP, calling the bpf_redirect_map()
      helper and returning XDP_REDIRECT. xdp_do_generic_redirect() is
      currently not handling ri->map (will be fixed by Jesper), so it's
      not being reset. Later loading a e.g. native prog B which would,
      say, call bpf_xdp_redirect() and then returns XDP_REDIRECT would
      find in xdp_do_redirect() that a map was set and uses that causing
      use after free on map access.
      
      Fix thus needs to avoid accessing stale ri->map pointers, naive
      way would be to call a BPF function from drivers that just resets
      it to NULL for all XDP return codes but XDP_REDIRECT and including
      XDP_REDIRECT for drivers not supporting it yet (and let ri->map
      being handled in xdp_do_generic_redirect()). There is a less
      intrusive way w/o letting drivers call a reset for each BPF run.
      
      The verifier knows we're calling into bpf_xdp_redirect_map()
      helper, so it can do a small insn rewrite transparent to the prog
      itself in the sense that it fills R4 with a pointer to the own
      bpf_prog. We have that pointer at verification time anyway and
      R4 is allowed to be used as per calling convention we scratch
      R0 to R5 anyway, so they become inaccessible and program cannot
      read them prior to a write. Then, the helper would store the prog
      pointer in the current CPUs struct redirect_info. Later in
      xdp_do_*_redirect() we check whether the redirect_info's prog
      pointer is the same as passed xdp_prog pointer, and if that's
      the case then all good, since the prog holds a ref on the map
      anyway, so it is always valid at that point in time and must
      have a reference count of at least 1. If in the unlikely case
      they are not equal, it means we got a stale pointer, so we clear
      and bail out right there. Also do reset map and the owning prog
      in bpf_xdp_redirect(), so that bpf_xdp_redirect_map() and
      bpf_xdp_redirect() won't get mixed up, only the last call should
      take precedence. A tc bpf_redirect() doesn't use map anywhere
      yet, so no need to clear it there since never accessed in that
      layer.
      
      Note that in case the prog is released, and thus the map as
      well we're still under RCU read critical section at that time
      and have preemption disabled as well. Once we commit with the
      __dev_map_insert_ctx() from xdp_do_redirect_map() and set the
      map to ri->map_to_flush, we still wait for a xdp_do_flush_map()
      to finish in devmap dismantle time once flush_needed bit is set,
      so that is fine.
      
      Fixes: 97f91a7c ("bpf: add bpf_redirect_map helper routine")
      Reported-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      109980b8
  9. 06 9月, 2017 1 次提交
    • E
      bpf: fix numa_node validation · 96e5ae4e
      Eric Dumazet 提交于
      syzkaller reported crashes in bpf map creation or map update [1]
      
      Problem is that nr_node_ids is a signed integer,
      NUMA_NO_NODE is also an integer, so it is very tempting
      to declare numa_node as a signed integer.
      
      This means the typical test to validate a user provided value :
      
              if (numa_node != NUMA_NO_NODE &&
                  (numa_node >= nr_node_ids ||
                   !node_online(numa_node)))
      
      must be written :
      
              if (numa_node != NUMA_NO_NODE &&
                  ((unsigned int)numa_node >= nr_node_ids ||
                   !node_online(numa_node)))
      
      [1]
      kernel BUG at mm/slab.c:3256!
      invalid opcode: 0000 [#1] SMP KASAN
      Dumping ftrace buffer:
         (ftrace buffer empty)
      Modules linked in:
      CPU: 0 PID: 2946 Comm: syzkaller916108 Not tainted 4.13.0-rc7+ #35
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
      task: ffff8801d2bc60c0 task.stack: ffff8801c0c90000
      RIP: 0010:____cache_alloc_node+0x1d4/0x1e0 mm/slab.c:3292
      RSP: 0018:ffff8801c0c97638 EFLAGS: 00010096
      RAX: ffffffffffff8b7b RBX: 0000000001080220 RCX: 0000000000000000
      RDX: 00000000ffff8b7b RSI: 0000000001080220 RDI: ffff8801dac00040
      RBP: ffff8801c0c976c0 R08: 0000000000000000 R09: 0000000000000000
      R10: ffff8801c0c97620 R11: 0000000000000001 R12: ffff8801dac00040
      R13: ffff8801dac00040 R14: 0000000000000000 R15: 00000000ffff8b7b
      FS:  0000000002119940(0000) GS:ffff8801db200000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000020001fec CR3: 00000001d2980000 CR4: 00000000001406f0
      Call Trace:
       __do_kmalloc_node mm/slab.c:3688 [inline]
       __kmalloc_node+0x33/0x70 mm/slab.c:3696
       kmalloc_node include/linux/slab.h:535 [inline]
       alloc_htab_elem+0x2a8/0x480 kernel/bpf/hashtab.c:740
       htab_map_update_elem+0x740/0xb80 kernel/bpf/hashtab.c:820
       map_update_elem kernel/bpf/syscall.c:587 [inline]
       SYSC_bpf kernel/bpf/syscall.c:1468 [inline]
       SyS_bpf+0x20c5/0x4c40 kernel/bpf/syscall.c:1443
       entry_SYSCALL_64_fastpath+0x1f/0xbe
      RIP: 0033:0x440409
      RSP: 002b:00007ffd1f1792b8 EFLAGS: 00000246 ORIG_RAX: 0000000000000141
      RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440409
      RDX: 0000000000000020 RSI: 0000000020006000 RDI: 0000000000000002
      RBP: 0000000000000086 R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401d70
      R13: 0000000000401e00 R14: 0000000000000000 R15: 0000000000000000
      Code: 83 c2 01 89 50 18 4c 03 70 08 e8 38 f4 ff ff 4d 85 f6 0f 85 3e ff ff ff 44 89 fe 4c 89 ef e8 94 fb ff ff 49 89 c6 e9 2b ff ff ff <0f> 0b 0f 0b 0f 0b 66 0f 1f 44 00 00 55 48 89 e5 41 57 41 56 41
      RIP: ____cache_alloc_node+0x1d4/0x1e0 mm/slab.c:3292 RSP: ffff8801c0c97638
      ---[ end trace d745f355da2e33ce ]---
      Kernel panic - not syncing: Fatal exception
      
      Fixes: 96eabe7a ("bpf: Allow selecting numa node during map creation")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Martin KaFai Lau <kafai@fb.com>
      Cc: Alexei Starovoitov <ast@fb.com>
      Cc: Daniel Borkmann <daniel@iogearbox.net>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96e5ae4e
  10. 02 9月, 2017 3 次提交
    • J
      bpf: sockmap update/simplify memory accounting scheme · 90a9631c
      John Fastabend 提交于
      Instead of tracking wmem_queued and sk_mem_charge by incrementing
      in the verdict SK_REDIRECT paths and decrementing in the tx work
      path use skb_set_owner_w and sock_writeable helpers. This solves
      a few issues with the current code. First, in SK_REDIRECT inc on
      sk_wmem_queued and sk_mem_charge were being done without the peers
      sock lock being held. Under stress this can result in accounting
      errors when tx work and/or multiple verdict decisions are working
      on the peer psock.
      
      Additionally, this cleans up the code because we can rely on the
      default destructor to decrement memory accounting on kfree_skb. Also
      this will trigger sk_write_space when space becomes available on
      kfree_skb() which wasn't happening before and prevent __sk_free
      from being called until all in-flight packets are completed.
      
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      90a9631c
    • M
      bpf: Only set node->ref = 1 if it has not been set · bb9b9f88
      Martin KaFai Lau 提交于
      This patch writes 'node->ref = 1' only if node->ref is 0.
      The number of lookups/s for a ~1M entries LRU map increased by
      ~30% (260097 to 343313).
      
      Other writes on 'node->ref = 0' is not changed.  In those cases, the
      same cache line has to be changed anyway.
      
      First column: Size of the LRU hash
      Second column: Number of lookups/s
      
      Before:
      > echo "$((2**20+1)): $(./map_perf_test 1024 1 $((2**20+1)) 10000000 | awk '{print $3}')"
      1048577: 260097
      
      After:
      > echo "$((2**20+1)): $(./map_perf_test 1024 1 $((2**20+1)) 10000000 | awk '{print $3}')"
      1048577: 343313
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bb9b9f88
    • M
      bpf: Inline LRU map lookup · cc555421
      Martin KaFai Lau 提交于
      Inline the lru map lookup to save the cost in making calls to
      bpf_map_lookup_elem() and htab_lru_map_lookup_elem().
      
      Different LRU hash size is tested.  The benefit diminishes when
      the cache miss starts to dominate in the bigger LRU hash.
      Considering the change is simple, it is still worth to optimize.
      
      First column: Size of the LRU hash
      Second column: Number of lookups/s
      
      Before:
      > for i in $(seq 9 20); do echo "$((2**i+1)): $(./map_perf_test 1024 1 $((2**i+1)) 10000000 | awk '{print $3}')"; done
      513: 1132020
      1025: 1056826
      2049: 1007024
      4097: 853298
      8193: 742723
      16385: 712600
      32769: 688142
      65537: 677028
      131073: 619437
      262145: 498770
      524289: 316695
      1048577: 260038
      
      After:
      > for i in $(seq 9 20); do echo "$((2**i+1)): $(./map_perf_test 1024 1 $((2**i+1)) 10000000 | awk '{print $3}')"; done
      513: 1221851
      1025: 1144695
      2049: 1049902
      4097: 884460
      8193: 773731
      16385: 729673
      32769: 721989
      65537: 715530
      131073: 671665
      262145: 516987
      524289: 321125
      1048577: 260048
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cc555421
  11. 29 8月, 2017 6 次提交
    • D
      bpf: fix oops on allocation failure · f740c34e
      Dan Carpenter 提交于
      "err" is set to zero if bpf_map_area_alloc() fails so it means we return
      ERR_PTR(0) which is NULL.  The caller, find_and_alloc_map(), is not
      expecting NULL returns and will oops.
      
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f740c34e
    • J
      bpf: sockmap indicate sock events to listeners · 78aeaaef
      John Fastabend 提交于
      After userspace pushes sockets into a sockmap it may not be receiving
      data (assuming stream_{parser|verdict} programs are attached). But, it
      may still want to manage the socks. A common pattern is to poll/select
      for a POLLRDHUP event so we can close the sock.
      
      This patch adds the logic to wake up these listeners.
      
      Also add TCP_SYN_SENT to the list of events to handle. We don't want
      to break the connection just because we happen to be in this state.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      78aeaaef
    • J
      bpf: harden sockmap program attach to ensure correct map type · 81374aaa
      John Fastabend 提交于
      When attaching a program to sockmap we need to check map type
      is correct.
      
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      81374aaa
    • J
      bpf: sockmap add missing rcu_read_(un)lock in smap_data_ready · d26e597d
      John Fastabend 提交于
      References to psock must be done inside RCU critical section.
      
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      d26e597d
    • J
      bpf: sockmap, remove STRPARSER map_flags and add multi-map support · 2f857d04
      John Fastabend 提交于
      The addition of map_flags BPF_SOCKMAP_STRPARSER flags was to handle a
      specific use case where we want to have BPF parse program disabled on
      an entry in a sockmap.
      
      However, Alexei found the API a bit cumbersome and I agreed. Lets
      remove the STRPARSER flag and support the use case by allowing socks
      to be in multiple maps. This allows users to create two maps one with
      programs attached and one without. When socks are added to maps they
      now inherit any programs attached to the map. This is a nice
      generalization and IMO improves the API.
      
      The API rules are less ambiguous and do not need a flag:
      
        - When a sock is added to a sockmap we have two cases,
      
           i. The sock map does not have any attached programs so
              we can add sock to map without inheriting bpf programs.
              The sock may exist in 0 or more other maps.
      
          ii. The sock map has an attached BPF program. To avoid duplicate
              bpf programs we only add the sock entry if it does not have
              an existing strparser/verdict attached, returning -EBUSY if
              a program is already attached. Otherwise attach the program
              and inherit strparser/verdict programs from the sock map.
      
      This allows for socks to be in a multiple maps for redirects and
      inherit a BPF program from a single map.
      
      Also this patch simplifies the logic around BPF_{EXIST|NOEXIST|ANY}
      flags. In the original patch I tried to be extra clever and only
      update map entries when necessary. Now I've decided the complexity
      is not worth it. If users constantly update an entry with the same
      sock for no reason (i.e. update an entry without actually changing
      any parameters on map or sock) we still do an alloc/release. Using
      this and allowing multiple entries of a sock to exist in a map the
      logic becomes much simpler.
      
      Note: Now that multiple maps are supported the "maps" pointer called
      when a socket is closed becomes a list of maps to remove the sock from.
      To keep the map up to date when a sock is added to the sockmap we must
      add the map/elem in the list. Likewise when it is removed we must
      remove it from the list. This results in searching the per psock list
      on delete operation. On TCP_CLOSE events we walk the list and remove
      the psock from all map/entry locations. I don't see any perf
      implications in this because at most I have a psock in two maps. If
      a psock were to be in many maps its possibly this might be noticeable
      on delete but I can't think of a reason to dup a psock in many maps.
      The sk_callback_lock is used to protect read/writes to the list. This
      was convenient because in all locations we were taking the lock
      anyways just after working on the list. Also the lock is per sock so
      in normal cases we shouldn't see any contention.
      Suggested-by: NAlexei Starovoitov <ast@kernel.org>
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      2f857d04
    • J
      bpf: convert sockmap field attach_bpf_fd2 to type · 464bc0fd
      John Fastabend 提交于
      In the initial sockmap API we provided strparser and verdict programs
      using a single attach command by extending the attach API with a the
      attach_bpf_fd2 field.
      
      However, if we add other programs in the future we will be adding a
      field for every new possible type, attach_bpf_fd(3,4,..). This
      seems a bit clumsy for an API. So lets push the programs using two
      new type fields.
      
         BPF_SK_SKB_STREAM_PARSER
         BPF_SK_SKB_STREAM_VERDICT
      
      This has the advantage of having a readable name and can easily be
      extended in the future.
      
      Updates to samples and sockmap included here also generalize tests
      slightly to support upcoming patch for multiple map support.
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Fixes: 174a79ff ("bpf: sockmap with sk redirect support")
      Suggested-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      464bc0fd
  12. 25 8月, 2017 1 次提交
    • E
      strparser: initialize all callbacks · 3fd87127
      Eric Biggers 提交于
      commit bbb03029 ("strparser: Generalize strparser") added more
      function pointers to 'struct strp_callbacks'; however, kcm_attach() was
      not updated to initialize them.  This could cause the ->lock() and/or
      ->unlock() function pointers to be set to garbage values, causing a
      crash in strp_work().
      
      Fix the bug by moving the callback structs into static memory, so
      unspecified members are zeroed.  Also constify them while we're at it.
      
      This bug was found by syzkaller, which encountered the following splat:
      
          IP: 0x55
          PGD 3b1ca067
          P4D 3b1ca067
          PUD 3b12f067
          PMD 0
      
          Oops: 0010 [#1] SMP KASAN
          Dumping ftrace buffer:
             (ftrace buffer empty)
          Modules linked in:
          CPU: 2 PID: 1194 Comm: kworker/u8:1 Not tainted 4.13.0-rc4-next-20170811 #2
          Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
          Workqueue: kstrp strp_work
          task: ffff88006bb0e480 task.stack: ffff88006bb10000
          RIP: 0010:0x55
          RSP: 0018:ffff88006bb17540 EFLAGS: 00010246
          RAX: dffffc0000000000 RBX: ffff88006ce4bd60 RCX: 0000000000000000
          RDX: 1ffff1000d9c97bd RSI: 0000000000000000 RDI: ffff88006ce4bc48
          RBP: ffff88006bb17558 R08: ffffffff81467ab2 R09: 0000000000000000
          R10: ffff88006bb17438 R11: ffff88006bb17940 R12: ffff88006ce4bc48
          R13: ffff88003c683018 R14: ffff88006bb17980 R15: ffff88003c683000
          FS:  0000000000000000(0000) GS:ffff88006de00000(0000) knlGS:0000000000000000
          CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
          CR2: 0000000000000055 CR3: 000000003c145000 CR4: 00000000000006e0
          DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
          DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
          Call Trace:
           process_one_work+0xbf3/0x1bc0 kernel/workqueue.c:2098
           worker_thread+0x223/0x1860 kernel/workqueue.c:2233
           kthread+0x35e/0x430 kernel/kthread.c:231
           ret_from_fork+0x2a/0x40 arch/x86/entry/entry_64.S:431
          Code:  Bad RIP value.
          RIP: 0x55 RSP: ffff88006bb17540
          CR2: 0000000000000055
          ---[ end trace f0e4920047069cee ]---
      
      Here is a C reproducer (requires CONFIG_BPF_SYSCALL=y and
      CONFIG_AF_KCM=y):
      
          #include <linux/bpf.h>
          #include <linux/kcm.h>
          #include <linux/types.h>
          #include <stdint.h>
          #include <sys/ioctl.h>
          #include <sys/socket.h>
          #include <sys/syscall.h>
          #include <unistd.h>
      
          static const struct bpf_insn bpf_insns[3] = {
              { .code = 0xb7 }, /* BPF_MOV64_IMM(0, 0) */
              { .code = 0x95 }, /* BPF_EXIT_INSN() */
          };
      
          static const union bpf_attr bpf_attr = {
              .prog_type = 1,
              .insn_cnt = 2,
              .insns = (uintptr_t)&bpf_insns,
              .license = (uintptr_t)"",
          };
      
          int main(void)
          {
              int bpf_fd = syscall(__NR_bpf, BPF_PROG_LOAD,
                                   &bpf_attr, sizeof(bpf_attr));
              int inet_fd = socket(AF_INET, SOCK_STREAM, 0);
              int kcm_fd = socket(AF_KCM, SOCK_DGRAM, 0);
      
              ioctl(kcm_fd, SIOCKCMATTACH,
                    &(struct kcm_attach) { .fd = inet_fd, .bpf_fd = bpf_fd });
          }
      
      Fixes: bbb03029 ("strparser: Generalize strparser")
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Tom Herbert <tom@quantonium.net>
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3fd87127
  13. 24 8月, 2017 4 次提交
  14. 23 8月, 2017 3 次提交
    • D
      bpf: minor cleanups for dev_map · af4d045c
      Daniel Borkmann 提交于
      Some minor code cleanups, while going over it I also noticed that
      we're accounting the bitmap only for one CPU currently, so fix that
      up as well.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af4d045c
    • D
      bpf: fix map value attribute for hash of maps · 33ba43ed
      Daniel Borkmann 提交于
      Currently, iproute2's BPF ELF loader works fine with array of maps
      when retrieving the fd from a pinned node and doing a selfcheck
      against the provided map attributes from the object file, but we
      fail to do the same for hash of maps and thus refuse to get the
      map from pinned node.
      
      Reason is that when allocating hash of maps, fd_htab_map_alloc() will
      set the value size to sizeof(void *), and any user space map creation
      requests are forced to set 4 bytes as value size. Thus, selfcheck
      will complain about exposed 8 bytes on 64 bit archs vs. 4 bytes from
      object file as value size. Contract is that fdinfo or BPF_MAP_GET_FD_BY_ID
      returns the value size used to create the map.
      
      Fix it by handling it the same way as we do for array of maps, which
      means that we leave value size at 4 bytes and in the allocation phase
      round up value size to 8 bytes. alloc_htab_elem() needs an adjustment
      in order to copy rounded up 8 bytes due to bpf_fd_htab_map_update_elem()
      calling into htab_map_update_elem() with the pointer of the map
      pointer as value. Unlike array of maps where we just xchg(), we're
      using the generic htab_map_update_elem() callback also used from helper
      calls, which published the key/value already on return, so we need
      to ensure to memcpy() the right size.
      
      Fixes: bcc6b1b7 ("bpf: Add hash of maps support")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      33ba43ed
    • D
      bpf: fix map value attribute for hash of maps · cd36c3a2
      Daniel Borkmann 提交于
      Currently, iproute2's BPF ELF loader works fine with array of maps
      when retrieving the fd from a pinned node and doing a selfcheck
      against the provided map attributes from the object file, but we
      fail to do the same for hash of maps and thus refuse to get the
      map from pinned node.
      
      Reason is that when allocating hash of maps, fd_htab_map_alloc() will
      set the value size to sizeof(void *), and any user space map creation
      requests are forced to set 4 bytes as value size. Thus, selfcheck
      will complain about exposed 8 bytes on 64 bit archs vs. 4 bytes from
      object file as value size. Contract is that fdinfo or BPF_MAP_GET_FD_BY_ID
      returns the value size used to create the map.
      
      Fix it by handling it the same way as we do for array of maps, which
      means that we leave value size at 4 bytes and in the allocation phase
      round up value size to 8 bytes. alloc_htab_elem() needs an adjustment
      in order to copy rounded up 8 bytes due to bpf_fd_htab_map_update_elem()
      calling into htab_map_update_elem() with the pointer of the map
      pointer as value. Unlike array of maps where we just xchg(), we're
      using the generic htab_map_update_elem() callback also used from helper
      calls, which published the key/value already on return, so we need
      to ensure to memcpy() the right size.
      
      Fixes: bcc6b1b7 ("bpf: Add hash of maps support")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Acked-by: NMartin KaFai Lau <kafai@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cd36c3a2
  15. 21 8月, 2017 1 次提交
    • D
      bpf: fix double free from dev_map_notification() · 274043c6
      Daniel Borkmann 提交于
      In the current code, dev_map_free() can still race with dev_map_notification().
      In dev_map_free(), we remove dtab from the list of dtabs after we purged
      all entries from it. However, we don't do xchg() with NULL or the like,
      so the entry at that point is still pointing to the device. If a unregister
      notification comes in at the same time, we therefore risk a double-free,
      since the pointer is still present in the map, and then pushed again to
      __dev_map_entry_free().
      
      All this is completely unnecessary. Just remove the dtab from the list
      right before the synchronize_rcu(), so all outstanding readers from the
      notifier list have finished by then, thus we don't need to deal with this
      corner case anymore and also wouldn't need to nullify dev entires. This is
      fine because we iterate over the map releasing all entries and therefore
      dev references anyway.
      
      Fixes: 4cc7b954 ("bpf: devmap fix mutex in rcu critical section")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      274043c6
  16. 20 8月, 2017 3 次提交
    • D
      bpf: inline map in map lookup functions for array and htab · 7b0c2a05
      Daniel Borkmann 提交于
      Avoid two successive functions calls for the map in map lookup, first
      is the bpf_map_lookup_elem() helper call, and second the callback via
      map->ops->map_lookup_elem() to get to the map in map implementation.
      Implementation inlines array and htab flavor for map in map lookups.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7b0c2a05
    • D
      bpf: make htab inlining more robust wrt assumptions · 89c63074
      Daniel Borkmann 提交于
      Commit 9015d2f5 ("bpf: inline htab_map_lookup_elem()") was
      making the assumption that a direct call emission to the function
      __htab_map_lookup_elem() will always work out for JITs.
      
      This is currently true since all JITs we have are for 64 bit archs,
      but in case of 32 bit JITs like upcoming arm32, we get a NULL pointer
      dereference when executing the call to __htab_map_lookup_elem()
      since passed arguments are of a different size (due to pointer args)
      than what we do out of BPF. Guard and thus limit this for now for
      the current 64 bit JITs only.
      Reported-by: NShubham Bansal <illusionist.neo@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      89c63074
    • M
      bpf: Allow selecting numa node during map creation · 96eabe7a
      Martin KaFai Lau 提交于
      The current map creation API does not allow to provide the numa-node
      preference.  The memory usually comes from where the map-creation-process
      is running.  The performance is not ideal if the bpf_prog is known to
      always run in a numa node different from the map-creation-process.
      
      One of the use case is sharding on CPU to different LRU maps (i.e.
      an array of LRU maps).  Here is the test result of map_perf_test on
      the INNER_LRU_HASH_PREALLOC test if we force the lru map used by
      CPU0 to be allocated from a remote numa node:
      
      [ The machine has 20 cores. CPU0-9 at node 0. CPU10-19 at node 1 ]
      
      ># taskset -c 10 ./map_perf_test 512 8 1260000 8000000
      5:inner_lru_hash_map_perf pre-alloc 1628380 events per sec
      4:inner_lru_hash_map_perf pre-alloc 1626396 events per sec
      3:inner_lru_hash_map_perf pre-alloc 1626144 events per sec
      6:inner_lru_hash_map_perf pre-alloc 1621657 events per sec
      2:inner_lru_hash_map_perf pre-alloc 1621534 events per sec
      1:inner_lru_hash_map_perf pre-alloc 1620292 events per sec
      7:inner_lru_hash_map_perf pre-alloc 1613305 events per sec
      0:inner_lru_hash_map_perf pre-alloc 1239150 events per sec  #<<<
      
      After specifying numa node:
      ># taskset -c 10 ./map_perf_test 512 8 1260000 8000000
      5:inner_lru_hash_map_perf pre-alloc 1629627 events per sec
      3:inner_lru_hash_map_perf pre-alloc 1628057 events per sec
      1:inner_lru_hash_map_perf pre-alloc 1623054 events per sec
      6:inner_lru_hash_map_perf pre-alloc 1616033 events per sec
      2:inner_lru_hash_map_perf pre-alloc 1614630 events per sec
      4:inner_lru_hash_map_perf pre-alloc 1612651 events per sec
      7:inner_lru_hash_map_perf pre-alloc 1609337 events per sec
      0:inner_lru_hash_map_perf pre-alloc 1619340 events per sec #<<<
      
      This patch adds one field, numa_node, to the bpf_attr.  Since numa node 0
      is a valid node, a new flag BPF_F_NUMA_NODE is also added.  The numa_node
      field is honored if and only if the BPF_F_NUMA_NODE flag is set.
      
      Numa node selection is not supported for percpu map.
      
      This patch does not change all the kmalloc.  F.e.
      'htab = kzalloc()' is not changed since the object
      is small enough to stay in the cache.
      Signed-off-by: NMartin KaFai Lau <kafai@fb.com>
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96eabe7a
  17. 19 8月, 2017 2 次提交