1. 09 9月, 2017 2 次提交
    • J
      net: rcu lock and preempt disable missing around generic xdp · bbbe211c
      John Fastabend 提交于
      do_xdp_generic must be called inside rcu critical section with preempt
      disabled to ensure BPF programs are valid and per-cpu variables used
      for redirect operations are consistent. This patch ensures this is true
      and fixes the splat below.
      
      The netif_receive_skb_internal() code path is now broken into two rcu
      critical sections. I decided it was better to limit the preempt_enable/disable
      block to just the xdp static key portion and the fallout is more
      rcu_read_lock/unlock calls. Seems like the best option to me.
      
      [  607.596901] =============================
      [  607.596906] WARNING: suspicious RCU usage
      [  607.596912] 4.13.0-rc4+ #570 Not tainted
      [  607.596917] -----------------------------
      [  607.596923] net/core/dev.c:3948 suspicious rcu_dereference_check() usage!
      [  607.596927]
      [  607.596927] other info that might help us debug this:
      [  607.596927]
      [  607.596933]
      [  607.596933] rcu_scheduler_active = 2, debug_locks = 1
      [  607.596938] 2 locks held by pool/14624:
      [  607.596943]  #0:  (rcu_read_lock_bh){......}, at: [<ffffffff95445ffd>] ip_finish_output2+0x14d/0x890
      [  607.596973]  #1:  (rcu_read_lock_bh){......}, at: [<ffffffff953c8e3a>] __dev_queue_xmit+0x14a/0xfd0
      [  607.597000]
      [  607.597000] stack backtrace:
      [  607.597006] CPU: 5 PID: 14624 Comm: pool Not tainted 4.13.0-rc4+ #570
      [  607.597011] Hardware name: Dell Inc. Precision Tower 5810/0HHV7N, BIOS A17 03/01/2017
      [  607.597016] Call Trace:
      [  607.597027]  dump_stack+0x67/0x92
      [  607.597040]  lockdep_rcu_suspicious+0xdd/0x110
      [  607.597054]  do_xdp_generic+0x313/0xa50
      [  607.597068]  ? time_hardirqs_on+0x5b/0x150
      [  607.597076]  ? mark_held_locks+0x6b/0xc0
      [  607.597088]  ? netdev_pick_tx+0x150/0x150
      [  607.597117]  netif_rx_internal+0x205/0x3f0
      [  607.597127]  ? do_xdp_generic+0xa50/0xa50
      [  607.597144]  ? lock_downgrade+0x2b0/0x2b0
      [  607.597158]  ? __lock_is_held+0x93/0x100
      [  607.597187]  netif_rx+0x119/0x190
      [  607.597202]  loopback_xmit+0xfd/0x1b0
      [  607.597214]  dev_hard_start_xmit+0x127/0x4e0
      
      Fixes: d4455169 ("net: xdp: support xdp generic on virtual devices")
      Fixes: b5cdae32 ("net: Generic XDP")
      Acked-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      bbbe211c
    • D
      bpf: don't select potentially stale ri->map from buggy xdp progs · 109980b8
      Daniel Borkmann 提交于
      We can potentially run into a couple of issues with the XDP
      bpf_redirect_map() helper. The ri->map in the per CPU storage
      can become stale in several ways, mostly due to misuse, where
      we can then trigger a use after free on the map:
      
      i) prog A is calling bpf_redirect_map(), returning XDP_REDIRECT
      and running on a driver not supporting XDP_REDIRECT yet. The
      ri->map on that CPU becomes stale when the XDP program is unloaded
      on the driver, and a prog B loaded on a different driver which
      supports XDP_REDIRECT return code. prog B would have to omit
      calling to bpf_redirect_map() and just return XDP_REDIRECT, which
      would then access the freed map in xdp_do_redirect() since not
      cleared for that CPU.
      
      ii) prog A is calling bpf_redirect_map(), returning a code other
      than XDP_REDIRECT. prog A is then detached, which triggers release
      of the map. prog B is attached which, similarly as in i), would
      just return XDP_REDIRECT without having called bpf_redirect_map()
      and thus be accessing the freed map in xdp_do_redirect() since
      not cleared for that CPU.
      
      iii) prog A is attached to generic XDP, calling the bpf_redirect_map()
      helper and returning XDP_REDIRECT. xdp_do_generic_redirect() is
      currently not handling ri->map (will be fixed by Jesper), so it's
      not being reset. Later loading a e.g. native prog B which would,
      say, call bpf_xdp_redirect() and then returns XDP_REDIRECT would
      find in xdp_do_redirect() that a map was set and uses that causing
      use after free on map access.
      
      Fix thus needs to avoid accessing stale ri->map pointers, naive
      way would be to call a BPF function from drivers that just resets
      it to NULL for all XDP return codes but XDP_REDIRECT and including
      XDP_REDIRECT for drivers not supporting it yet (and let ri->map
      being handled in xdp_do_generic_redirect()). There is a less
      intrusive way w/o letting drivers call a reset for each BPF run.
      
      The verifier knows we're calling into bpf_xdp_redirect_map()
      helper, so it can do a small insn rewrite transparent to the prog
      itself in the sense that it fills R4 with a pointer to the own
      bpf_prog. We have that pointer at verification time anyway and
      R4 is allowed to be used as per calling convention we scratch
      R0 to R5 anyway, so they become inaccessible and program cannot
      read them prior to a write. Then, the helper would store the prog
      pointer in the current CPUs struct redirect_info. Later in
      xdp_do_*_redirect() we check whether the redirect_info's prog
      pointer is the same as passed xdp_prog pointer, and if that's
      the case then all good, since the prog holds a ref on the map
      anyway, so it is always valid at that point in time and must
      have a reference count of at least 1. If in the unlikely case
      they are not equal, it means we got a stale pointer, so we clear
      and bail out right there. Also do reset map and the owning prog
      in bpf_xdp_redirect(), so that bpf_xdp_redirect_map() and
      bpf_xdp_redirect() won't get mixed up, only the last call should
      take precedence. A tc bpf_redirect() doesn't use map anywhere
      yet, so no need to clear it there since never accessed in that
      layer.
      
      Note that in case the prog is released, and thus the map as
      well we're still under RCU read critical section at that time
      and have preemption disabled as well. Once we commit with the
      __dev_map_insert_ctx() from xdp_do_redirect_map() and set the
      map to ri->map_to_flush, we still wait for a xdp_do_flush_map()
      to finish in devmap dismantle time once flush_needed bit is set,
      so that is fine.
      
      Fixes: 97f91a7c ("bpf: add bpf_redirect_map helper routine")
      Reported-by: NJesper Dangaard Brouer <brouer@redhat.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      109980b8
  2. 08 9月, 2017 1 次提交
  3. 06 9月, 2017 2 次提交
  4. 02 9月, 2017 5 次提交
  5. 01 9月, 2017 4 次提交
  6. 31 8月, 2017 1 次提交
  7. 30 8月, 2017 6 次提交
  8. 29 8月, 2017 1 次提交
  9. 26 8月, 2017 1 次提交
    • S
      tcp: fix refcnt leak with ebpf congestion control · ebfa00c5
      Sabrina Dubroca 提交于
      There are a few bugs around refcnt handling in the new BPF congestion
      control setsockopt:
      
       - The new ca is assigned to icsk->icsk_ca_ops even in the case where we
         cannot get a reference on it. This would lead to a use after free,
         since that ca is going away soon.
      
       - Changing the congestion control case doesn't release the refcnt on
         the previous ca.
      
       - In the reinit case, we first leak a reference on the old ca, then we
         call tcp_reinit_congestion_control on the ca that we have just
         assigned, leading to deinitializing the wrong ca (->release of the
         new ca on the old ca's data) and releasing the refcount on the ca
         that we actually want to use.
      
      This is visible by building (for example) BIC as a module and setting
      net.ipv4.tcp_congestion_control=bic, and using tcp_cong_kern.c from
      samples/bpf.
      
      This patch fixes the refcount issues, and moves reinit back into tcp
      core to avoid passing a ca pointer back to BPF.
      
      Fixes: 91b5b21c ("bpf: Add support for changing congestion control")
      Signed-off-by: NSabrina Dubroca <sd@queasysnail.net>
      Acked-by: NLawrence Brakmo <brakmo@fb.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ebfa00c5
  10. 25 8月, 2017 8 次提交
  11. 24 8月, 2017 2 次提交
  12. 23 8月, 2017 2 次提交
    • D
      bpf: misc xdp redirect cleanups · e4a8e817
      Daniel Borkmann 提交于
      Few cleanups including: bpf_redirect_map() is really XDP only due to
      the return code. Move it to a more appropriate location where we do
      the XDP redirect handling and change it's name into bpf_xdp_redirect_map()
      to make it consistent to the bpf_xdp_redirect() helper.
      
      xdp_do_redirect_map() helper can be static since only used out of filter.c
      file. Drop the goto in xdp_do_generic_redirect() and only return errors
      directly. In xdp_do_flush_map() only clear ri->map_to_flush which is the
      arg we're using in that function, ri->map is cleared earlier along with
      ri->ifindex.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Acked-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e4a8e817
    • E
      udp: on peeking bad csum, drop packets even if not at head · fd6055a8
      Eric Dumazet 提交于
      When peeking, if a bad csum is discovered, the skb is unlinked from
      the queue with __sk_queue_drop_skb and the peek operation restarted.
      
      __sk_queue_drop_skb only drops packets that match the queue head.
      
      This fails if the skb was found after the head, using SO_PEEK_OFF
      socket option. This causes an infinite loop.
      
      We MUST drop this problematic skb, and we can simply check if skb was
      already removed by another thread, by looking at skb->next :
      
      This pointer is set to NULL by the  __skb_unlink() operation, that might
      have happened only under the spinlock protection.
      
      Many thanks to syzkaller team (and particularly Dmitry Vyukov who
      provided us nice C reproducers exhibiting the lockup) and Willem de
      Bruijn who provided first version for this patch and a test program.
      
      Fixes: 627d2d6b ("udp: enable MSG_PEEK at non-zero offset")
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Reported-by: NDmitry Vyukov <dvyukov@google.com>
      Cc: Willem de Bruijn <willemb@google.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd6055a8
  13. 22 8月, 2017 1 次提交
    • D
      net: check type when freeing metadata dst · e65a4955
      David Lamparter 提交于
      Commit 3fcece12 ("net: store port/representator id in metadata_dst")
      added a new type field to metadata_dst, but metadata_dst_free() wasn't
      updated to check it before freeing the METADATA_IP_TUNNEL specific dst
      cache entry.
      
      This is not currently causing problems since it's far enough back in the
      struct to be zeroed for the only other type currently in existance
      (METADATA_HW_PORT_MUX), but nevertheless it's not correct.
      
      Fixes: 3fcece12 ("net: store port/representator id in metadata_dst")
      Signed-off-by: NDavid Lamparter <equinox@diac24.net>
      Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
      Cc: Sridhar Samudrala <sridhar.samudrala@intel.com>
      Cc: Simon Horman <horms@verge.net.au>
      Cc: David S. Miller <davem@davemloft.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      e65a4955
  14. 19 8月, 2017 4 次提交