1. 06 9月, 2019 1 次提交
    • P
      net: openvswitch: Set OvS recirc_id from tc chain index · 95a7233c
      Paul Blakey 提交于
      Offloaded OvS datapath rules are translated one to one to tc rules,
      for example the following simplified OvS rule:
      
      recirc_id(0),in_port(dev1),eth_type(0x0800),ct_state(-trk) actions:ct(),recirc(2)
      
      Will be translated to the following tc rule:
      
      $ tc filter add dev dev1 ingress \
      	    prio 1 chain 0 proto ip \
      		flower tcp ct_state -trk \
      		action ct pipe \
      		action goto chain 2
      
      Received packets will first travel though tc, and if they aren't stolen
      by it, like in the above rule, they will continue to OvS datapath.
      Since we already did some actions (action ct in this case) which might
      modify the packets, and updated action stats, we would like to continue
      the proccessing with the correct recirc_id in OvS (here recirc_id(2))
      where we left off.
      
      To support this, introduce a new skb extension for tc, which
      will be used for translating tc chain to ovs recirc_id to
      handle these miss cases. Last tc chain index will be set
      by tc goto chain action and read by OvS datapath.
      Signed-off-by: NPaul Blakey <paulb@mellanox.com>
      Signed-off-by: NVlad Buslov <vladbu@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Acked-by: NPravin B Shelar <pshelar@ovn.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      95a7233c
  2. 31 7月, 2019 1 次提交
  3. 23 7月, 2019 2 次提交
  4. 17 7月, 2019 1 次提交
  5. 09 7月, 2019 5 次提交
  6. 10 6月, 2019 2 次提交
    • S
      net: Don't disable interrupts in __netdev_alloc_skb() · 92dcabd7
      Sebastian Andrzej Siewior 提交于
      __netdev_alloc_skb() can be used from any context and is used by NAPI
      and non-NAPI drivers. Non-NAPI drivers use it in interrupt context and
      NAPI drivers use it during initial allocation (->ndo_open() or
      ->ndo_change_mtu()). Some NAPI drivers share the same function for the
      initial allocation and the allocation in their NAPI callback.
      
      The interrupts are disabled in order to ensure locked access from every
      context to `netdev_alloc_cache'.
      
      Let __netdev_alloc_skb() check if interrupts are disabled. If they are, use
      `netdev_alloc_cache'. Otherwise disable BH and use `napi_alloc_cache.page'.
      The IRQ check is cheaper compared to disabling & enabling interrupts and
      memory allocation with disabled interrupts does not work on -RT.
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      92dcabd7
    • S
      net: Don't disable interrupts in napi_alloc_frag() · 7ba7aeab
      Sebastian Andrzej Siewior 提交于
      netdev_alloc_frag() can be used from any context and is used by NAPI
      and non-NAPI drivers. Non-NAPI drivers use it in interrupt context
      and NAPI drivers use it during initial allocation (->ndo_open() or
      ->ndo_change_mtu()). Some NAPI drivers share the same function for the
      initial allocation and the allocation in their NAPI callback.
      
      The interrupts are disabled in order to ensure locked access from every
      context to `netdev_alloc_cache'.
      
      Let netdev_alloc_frag() check if interrupts are disabled. If they are,
      use `netdev_alloc_cache' otherwise disable BH and invoke
      __napi_alloc_frag() for the allocation. The IRQ check is cheaper
      compared to disabling & enabling interrupts and memory allocation with
      disabled interrupts does not work on -RT.
      Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7ba7aeab
  7. 05 6月, 2019 1 次提交
  8. 31 5月, 2019 3 次提交
  9. 25 5月, 2019 1 次提交
    • J
      bpf: sockmap, fix use after free from sleep in psock backlog workqueue · bd95e678
      John Fastabend 提交于
      Backlog work for psock (sk_psock_backlog) might sleep while waiting
      for memory to free up when sending packets. However, while sleeping
      the socket may be closed and removed from the map by the user space
      side.
      
      This breaks an assumption in sk_stream_wait_memory, which expects the
      wait queue to be still there when it wakes up resulting in a
      use-after-free shown below. To fix his mark sendmsg as MSG_DONTWAIT
      to avoid the sleep altogether. We already set the flag for the
      sendpage case but we missed the case were sendmsg is used.
      Sockmap is currently the only user of skb_send_sock_locked() so only
      the sockmap paths should be impacted.
      
      ==================================================================
      BUG: KASAN: use-after-free in remove_wait_queue+0x31/0x70
      Write of size 8 at addr ffff888069a0c4e8 by task kworker/0:2/110
      
      CPU: 0 PID: 110 Comm: kworker/0:2 Not tainted 5.0.0-rc2-00335-g28f9d1a3-dirty #14
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
      Workqueue: events sk_psock_backlog
      Call Trace:
       print_address_description+0x6e/0x2b0
       ? remove_wait_queue+0x31/0x70
       kasan_report+0xfd/0x177
       ? remove_wait_queue+0x31/0x70
       ? remove_wait_queue+0x31/0x70
       remove_wait_queue+0x31/0x70
       sk_stream_wait_memory+0x4dd/0x5f0
       ? sk_stream_wait_close+0x1b0/0x1b0
       ? wait_woken+0xc0/0xc0
       ? tcp_current_mss+0xc5/0x110
       tcp_sendmsg_locked+0x634/0x15d0
       ? tcp_set_state+0x2e0/0x2e0
       ? __kasan_slab_free+0x1d1/0x230
       ? kmem_cache_free+0x70/0x140
       ? sk_psock_backlog+0x40c/0x4b0
       ? process_one_work+0x40b/0x660
       ? worker_thread+0x82/0x680
       ? kthread+0x1b9/0x1e0
       ? ret_from_fork+0x1f/0x30
       ? check_preempt_curr+0xaf/0x130
       ? iov_iter_kvec+0x5f/0x70
       ? kernel_sendmsg_locked+0xa0/0xe0
       skb_send_sock_locked+0x273/0x3c0
       ? skb_splice_bits+0x180/0x180
       ? start_thread+0xe0/0xe0
       ? update_min_vruntime.constprop.27+0x88/0xc0
       sk_psock_backlog+0xb3/0x4b0
       ? strscpy+0xbf/0x1e0
       process_one_work+0x40b/0x660
       worker_thread+0x82/0x680
       ? process_one_work+0x660/0x660
       kthread+0x1b9/0x1e0
       ? __kthread_create_on_node+0x250/0x250
       ret_from_fork+0x1f/0x30
      
      Fixes: 20bf50de ("skbuff: Function to send an skbuf on a socket")
      Reported-by: NJakub Sitnicki <jakub@cloudflare.com>
      Tested-by: NJakub Sitnicki <jakub@cloudflare.com>
      Signed-off-by: NJohn Fastabend <john.fastabend@gmail.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      bd95e678
  10. 18 4月, 2019 1 次提交
  11. 17 4月, 2019 1 次提交
  12. 04 4月, 2019 1 次提交
    • S
      net-gro: Fix GRO flush when receiving a GSO packet. · 0ab03f35
      Steffen Klassert 提交于
      Currently we may merge incorrectly a received GSO packet
      or a packet with frag_list into a packet sitting in the
      gro_hash list. skb_segment() may crash case because
      the assumptions on the skb layout are not met.
      The correct behaviour would be to flush the packet in the
      gro_hash list and send the received GSO packet directly
      afterwards. Commit d61d072e ("net-gro: avoid reorders")
      sets NAPI_GRO_CB(skb)->flush in this case, but this is not
      checked before merging. This patch makes sure to check this
      flag and to not merge in that case.
      
      Fixes: d61d072e ("net-gro: avoid reorders")
      Signed-off-by: NSteffen Klassert <steffen.klassert@secunet.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      0ab03f35
  13. 28 3月, 2019 1 次提交
  14. 18 2月, 2019 1 次提交
  15. 05 1月, 2019 1 次提交
    • D
      net, skbuff: do not prefer skb allocation fails early · f8c468e8
      David Rientjes 提交于
      Commit dcda9b04 ("mm, tree wide: replace __GFP_REPEAT by
      __GFP_RETRY_MAYFAIL with more useful semantic") replaced __GFP_REPEAT in
      alloc_skb_with_frags() with __GFP_RETRY_MAYFAIL when the allocation may
      directly reclaim.
      
      The previous behavior would require reclaim up to 1 << order pages for
      skb aligned header_len of order > PAGE_ALLOC_COSTLY_ORDER before failing,
      otherwise the allocations in alloc_skb() would loop in the page allocator
      looking for memory.  __GFP_RETRY_MAYFAIL makes both allocations failable
      under memory pressure, including for the HEAD allocation.
      
      This can cause, among many other things, write() to fail with ENOTCONN
      during RPC when under memory pressure.
      
      These allocations should succeed as they did previous to dcda9b04
      even if it requires calling the oom killer and additional looping in the
      page allocator to find memory.  There is no way to specify the previous
      behavior of __GFP_REPEAT, but it's unlikely to be necessary since the
      previous behavior only guaranteed that 1 << order pages would be reclaimed
      before failing for order > PAGE_ALLOC_COSTLY_ORDER.  That reclaim is not
      guaranteed to be contiguous memory, so repeating for such large orders is
      usually not beneficial.
      
      Removing the setting of __GFP_RETRY_MAYFAIL to restore the previous
      behavior, specifically not allowing alloc_skb() to fail for small orders
      and oom kill if necessary rather than allowing RPCs to fail.
      
      Fixes: dcda9b04 ("mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic")
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8c468e8
  16. 22 12月, 2018 2 次提交
  17. 20 12月, 2018 3 次提交
    • F
      net: switch secpath to use skb extension infrastructure · 4165079b
      Florian Westphal 提交于
      Remove skb->sp and allocate secpath storage via extension
      infrastructure.  This also reduces sk_buff by 8 bytes on x86_64.
      
      Total size of allyesconfig kernel is reduced slightly, as there is
      less inlined code (one conditional atomic op instead of two on
      skb_clone).
      
      No differences in throughput in following ipsec performance tests:
      - transport mode with aes on 10GB link
      - tunnel mode between two network namespaces with aes and null cipher
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4165079b
    • F
      net: convert bridge_nf to use skb extension infrastructure · de8bda1d
      Florian Westphal 提交于
      This converts the bridge netfilter (calling iptables hooks from bridge)
      facility to use the extension infrastructure.
      
      The bridge_nf specific hooks in skb clone and free paths are removed, they
      have been replaced by the skb_ext hooks that do the same as the bridge nf
      allocations hooks did.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      de8bda1d
    • F
      sk_buff: add skb extension infrastructure · df5042f4
      Florian Westphal 提交于
      This adds an optional extension infrastructure, with ispec (xfrm) and
      bridge netfilter as first users.
      objdiff shows no changes if kernel is built without xfrm and br_netfilter
      support.
      
      The third (planned future) user is Multipath TCP which is still
      out-of-tree.
      MPTCP needs to map logical mptcp sequence numbers to the tcp sequence
      numbers used by individual subflows.
      
      This DSS mapping is read/written from tcp option space on receive and
      written to tcp option space on transmitted tcp packets that are part of
      and MPTCP connection.
      
      Extending skb_shared_info or adding a private data field to skb fclones
      doesn't work for incoming skb, so a different DSS propagation method would
      be required for the receive side.
      
      mptcp has same requirements as secpath/bridge netfilter:
      
      1. extension memory is released when the sk_buff is free'd.
      2. data is shared after cloning an skb (clone inherits extension)
      3. adding extension to an skb will COW the extension buffer if needed.
      
      The "MPTCP upstreaming" effort adds SKB_EXT_MPTCP extension to store the
      mapping for tx and rx processing.
      
      Two new members are added to sk_buff:
      1. 'active_extensions' byte (filling a hole), telling which extensions
         are available for this skb.
         This has two purposes.
         a) avoids the need to initialize the pointer.
         b) allows to "delete" an extension by clearing its bit
         value in ->active_extensions.
      
         While it would be possible to store the active_extensions byte
         in the extension struct instead of sk_buff, there is one problem
         with this:
          When an extension has to be disabled, we can always clear the
          bit in skb->active_extensions.  But in case it would be stored in the
          extension buffer itself, we might have to COW it first, if
          we are dealing with a cloned skb.  On kmalloc failure we would
          be unable to turn an extension off.
      
      2. extension pointer, located at the end of the sk_buff.
         If the active_extensions byte is 0, the pointer is undefined,
         it is not initialized on skb allocation.
      
      This adds extra code to skb clone and free paths (to deal with
      refcount/free of extension area) but this replaces similar code that
      manages skb->nf_bridge and skb->sp structs in the followup patches of
      the series.
      
      It is possible to add support for extensions that are not preseved on
      clones/copies.
      
      To do this, it would be needed to define a bitmask of all extensions that
      need copy/cow semantics, and change __skb_ext_copy() to check
      ->active_extensions & SKB_EXT_PRESERVE_ON_CLONE, then just set
      ->active_extensions to 0 on the new clone.
      
      This isn't done here because all extensions that get added here
      need the copy/cow semantics.
      
      v2:
      Allocate entire extension space using kmem_cache.
      Upside is that this allows better tracking of used memory,
      downside is that we will allocate more space than strictly needed in
      most cases (its unlikely that all extensions are active/needed at same
      time for same skb).
      The allocated memory (except the small extension header) is not cleared,
      so no additonal overhead aside from memory usage.
      
      Avoid atomic_dec_and_test operation on skb_ext_put()
      by using similar trick as kfree_skbmem() does with fclone_ref:
      If recount is 1, there is no concurrent user and we can free right away.
      Signed-off-by: NFlorian Westphal <fw@strlen.de>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      df5042f4
  18. 05 12月, 2018 1 次提交
    • I
      skbuff: Rename 'offload_mr_fwd_mark' to 'offload_l3_fwd_mark' · 875e8939
      Ido Schimmel 提交于
      Commit abf4bb6b ("skbuff: Add the offload_mr_fwd_mark field") added
      the 'offload_mr_fwd_mark' field to indicate that a packet has already
      undergone L3 multicast routing by a capable device. The field is used to
      prevent the kernel from forwarding a packet through a netdev through
      which the device has already forwarded the packet.
      
      Currently, no unicast packet is routed by both the device and the
      kernel, but this is about to change by subsequent patches and we need to
      be able to mark such packets, so that they will no be forwarded twice.
      
      Instead of adding yet another field to 'struct sk_buff', we can just
      rename 'offload_mr_fwd_mark' to 'offload_l3_fwd_mark', as a packet
      either has a multicast or a unicast destination IP.
      
      While at it, add a comment about both 'offload_fwd_mark' and
      'offload_l3_fwd_mark'.
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      875e8939
  19. 04 12月, 2018 2 次提交
    • W
      udp: elide zerocopy operation in hot path · 52900d22
      Willem de Bruijn 提交于
      With MSG_ZEROCOPY, each skb holds a reference to a struct ubuf_info.
      Release of its last reference triggers a completion notification.
      
      The TCP stack in tcp_sendmsg_locked holds an extra ref independent of
      the skbs, because it can build, send and free skbs within its loop,
      possibly reaching refcount zero and freeing the ubuf_info too soon.
      
      The UDP stack currently also takes this extra ref, but does not need
      it as all skbs are sent after return from __ip(6)_append_data.
      
      Avoid the extra refcount_inc and refcount_dec_and_test, and generally
      the sock_zerocopy_put in the common path, by passing the initial
      reference to the first skb.
      
      This approach is taken instead of initializing the refcount to 0, as
      that would generate error "refcount_t: increment on 0" on the
      next skb_zcopy_set.
      
      Changes
        v3 -> v4
          - Move skb_zcopy_set below the only kfree_skb that might cause
            a premature uarg destroy before skb_zerocopy_put_abort
            - Move the entire skb_shinfo assignment block, to keep that
              cacheline access in one place
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      52900d22
    • W
      udp: msg_zerocopy · b5947e5d
      Willem de Bruijn 提交于
      Extend zerocopy to udp sockets. Allow setting sockopt SO_ZEROCOPY and
      interpret flag MSG_ZEROCOPY.
      
      This patch was previously part of the zerocopy RFC patchsets. Zerocopy
      is not effective at small MTU. With segmentation offload building
      larger datagrams, the benefit of page flipping outweights the cost of
      generating a completion notification.
      
      tools/testing/selftests/net/msg_zerocopy.sh after applying follow-on
      test patch and making skb_orphan_frags_rx same as skb_orphan_frags:
      
          ipv4 udp -t 1
          tx=191312 (11938 MB) txc=0 zc=n
          rx=191312 (11938 MB)
          ipv4 udp -z -t 1
          tx=304507 (19002 MB) txc=304507 zc=y
          rx=304507 (19002 MB)
          ok
          ipv6 udp -t 1
          tx=174485 (10888 MB) txc=0 zc=n
          rx=174485 (10888 MB)
          ipv6 udp -z -t 1
          tx=294801 (18396 MB) txc=294801 zc=y
          rx=294801 (18396 MB)
          ok
      
      Changes
        v1 -> v2
          - Fixup reverse christmas tree violation
        v2 -> v3
          - Split refcount avoidance optimization into separate patch
            - Fix refcount leak on error in fragmented case
              (thanks to Paolo Abeni for pointing this one out!)
            - Fix refcount inc on zero
            - Test sock_flag SOCK_ZEROCOPY directly in __ip_append_data.
              This is needed since commit 5cf4a853 ("tcp: really ignore
      	MSG_ZEROCOPY if no SO_ZEROCOPY") did the same for tcp.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Acked-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b5947e5d
  20. 30 11月, 2018 1 次提交
  21. 26 11月, 2018 1 次提交
    • E
      net: remove unsafe skb_insert() · 4bffc669
      Eric Dumazet 提交于
      I do not see how one can effectively use skb_insert() without holding
      some kind of lock. Otherwise other cpus could have changed the list
      right before we have a chance of acquiring list->lock.
      
      Only existing user is in drivers/infiniband/hw/nes/nes_mgt.c and this
      one probably meant to use __skb_insert() since it appears nesqp->pau_list
      is protected by nesqp->pau_lock. This looks like nesqp->pau_lock
      could be removed, since nesqp->pau_list.lock could be used instead.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Faisal Latif <faisal.latif@intel.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: linux-rdma <linux-rdma@vger.kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4bffc669
  22. 22 11月, 2018 1 次提交
    • P
      net: skb_scrub_packet(): Scrub offload_fwd_mark · b5dd186d
      Petr Machata 提交于
      When a packet is trapped and the corresponding SKB marked as
      already-forwarded, it retains this marking even after it is forwarded
      across veth links into another bridge. There, since it ingresses the
      bridge over veth, which doesn't have offload_fwd_mark, it triggers a
      warning in nbp_switchdev_frame_mark().
      
      Then nbp_switchdev_allowed_egress() decides not to allow egress from
      this bridge through another veth, because the SKB is already marked, and
      the mark (of 0) of course matches. Thus the packet is incorrectly
      blocked.
      
      Solve by resetting offload_fwd_mark() in skb_scrub_packet(). That
      function is called from tunnels and also from veth, and thus catches the
      cases where traffic is forwarded between bridges and transformed in a
      way that invalidates the marking.
      
      Fixes: 6bc506b4 ("bridge: switchdev: Add forward mark support for stacked devices")
      Fixes: abf4bb6b ("skbuff: Add the offload_mr_fwd_mark field")
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Suggested-by: NIdo Schimmel <idosch@mellanox.com>
      Acked-by: NJiri Pirko <jiri@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b5dd186d
  23. 20 11月, 2018 1 次提交
    • P
      net: skb_scrub_packet(): Scrub offload_fwd_mark · 6f9a5069
      Petr Machata 提交于
      When a packet is trapped and the corresponding SKB marked as
      already-forwarded, it retains this marking even after it is forwarded
      across veth links into another bridge. There, since it ingresses the
      bridge over veth, which doesn't have offload_fwd_mark, it triggers a
      warning in nbp_switchdev_frame_mark().
      
      Then nbp_switchdev_allowed_egress() decides not to allow egress from
      this bridge through another veth, because the SKB is already marked, and
      the mark (of 0) of course matches. Thus the packet is incorrectly
      blocked.
      
      Solve by resetting offload_fwd_mark() in skb_scrub_packet(). That
      function is called from tunnels and also from veth, and thus catches the
      cases where traffic is forwarded between bridges and transformed in a
      way that invalidates the marking.
      Signed-off-by: NPetr Machata <petrm@mellanox.com>
      Suggested-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NIdo Schimmel <idosch@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6f9a5069
  24. 17 11月, 2018 1 次提交
  25. 16 11月, 2018 2 次提交
  26. 09 11月, 2018 2 次提交