1. 17 8月, 2017 1 次提交
  2. 14 8月, 2017 2 次提交
    • J
      tap: XDP support · 761876c8
      Jason Wang 提交于
      This patch tries to implement XDP for tun. The implementation was
      split into two parts:
      
      - fast path: small and no gso packet. We try to do XDP at page level
        before build_skb(). For XDP_TX, since creating/destroying queues
        were completely under control of userspace, it was implemented
        through generic XDP helper after skb has been built. This could be
        optimized in the future.
      - slow path: big or gso packet. We try to do it after skb was created
        through generic XDP helpers.
      
      Test were done through pktgen with small packets.
      
      xdp1 test shows ~41.1% improvement:
      
      Before: ~1.7Mpps
      After:  ~2.3Mpps
      
      xdp_redirect to ixgbe shows ~60% improvement:
      
      Before: ~0.8Mpps
      After:  ~1.38Mpps
      Suggested-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      761876c8
    • J
      tap: use build_skb() for small packet · 66ccbc9c
      Jason Wang 提交于
      We use tun_alloc_skb() which calls sock_alloc_send_pskb() to allocate
      skb in the past. This socket based method is not suitable for high
      speed userspace like virtualization which usually:
      
      - ignore sk_sndbuf (INT_MAX) and expect to receive the packet as fast as
        possible
      - don't want to be block at sendmsg()
      
      To eliminate the above overheads, this patch tries to use build_skb()
      for small packet. We will do this only when the following conditions
      are all met:
      
      - TAP instead of TUN
      - sk_sndbuf is INT_MAX
      - caller don't want to be blocked
      - zerocopy is not used
      - packet size is smaller enough to use build_skb()
      
      Pktgen from guest to host shows ~11% improvement for rx pps of tap:
      
      Before: ~1.70Mpps
      After : ~1.88Mpps
      
      What's more important, this makes it possible to implement XDP for tap
      before creating skbs.
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      66ccbc9c
  3. 04 8月, 2017 1 次提交
    • W
      sock: enable MSG_ZEROCOPY · 1f8b977a
      Willem de Bruijn 提交于
      Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with
      skb_zerocopy_clone() wherever needed due to skb split, merge, resize
      or clone.
      
      Split skb_orphan_frags into two variants. The split, merge, .. paths
      support reference counted zerocopy buffers, so do not do a deep copy.
      Add skb_orphan_frags_rx for paths that may loop packets to receive
      sockets. That is not allowed, as it may cause unbounded latency.
      Deep copy all zerocopy copy buffers, ref-counted or not, in this path.
      
      The exact locations to modify were chosen by exhaustively searching
      through all code that might modify skb_frag references and/or the
      the SKBTX_DEV_ZEROCOPY tx_flags bit.
      
      The changes err on the safe side, in two ways.
      
      (1) legacy ubuf_info paths virtio and tap are not modified. They keep
          a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags
          still call skb_copy_ubufs and thus copy frags in this case.
      
      (2) not all copies deep in the stack are addressed yet. skb_shift,
          skb_split and skb_try_coalesce can be refined to avoid copying.
          These are not in the hot path and this patch is hairy enough as
          is, so that is left for future refinement.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f8b977a
  4. 25 7月, 2017 1 次提交
  5. 18 7月, 2017 1 次提交
  6. 27 6月, 2017 1 次提交
  7. 08 6月, 2017 1 次提交
    • D
      net: Fix inconsistent teardown and release of private netdev state. · cf124db5
      David S. Miller 提交于
      Network devices can allocate reasources and private memory using
      netdev_ops->ndo_init().  However, the release of these resources
      can occur in one of two different places.
      
      Either netdev_ops->ndo_uninit() or netdev->destructor().
      
      The decision of which operation frees the resources depends upon
      whether it is necessary for all netdev refs to be released before it
      is safe to perform the freeing.
      
      netdev_ops->ndo_uninit() presumably can occur right after the
      NETDEV_UNREGISTER notifier completes and the unicast and multicast
      address lists are flushed.
      
      netdev->destructor(), on the other hand, does not run until the
      netdev references all go away.
      
      Further complicating the situation is that netdev->destructor()
      almost universally does also a free_netdev().
      
      This creates a problem for the logic in register_netdevice().
      Because all callers of register_netdevice() manage the freeing
      of the netdev, and invoke free_netdev(dev) if register_netdevice()
      fails.
      
      If netdev_ops->ndo_init() succeeds, but something else fails inside
      of register_netdevice(), it does call ndo_ops->ndo_uninit().  But
      it is not able to invoke netdev->destructor().
      
      This is because netdev->destructor() will do a free_netdev() and
      then the caller of register_netdevice() will do the same.
      
      However, this means that the resources that would normally be released
      by netdev->destructor() will not be.
      
      Over the years drivers have added local hacks to deal with this, by
      invoking their destructor parts by hand when register_netdevice()
      fails.
      
      Many drivers do not try to deal with this, and instead we have leaks.
      
      Let's close this hole by formalizing the distinction between what
      private things need to be freed up by netdev->destructor() and whether
      the driver needs unregister_netdevice() to perform the free_netdev().
      
      netdev->priv_destructor() performs all actions to free up the private
      resources that used to be freed by netdev->destructor(), except for
      free_netdev().
      
      netdev->needs_free_netdev is a boolean that indicates whether
      free_netdev() should be done at the end of unregister_netdevice().
      
      Now, register_netdevice() can sanely release all resources after
      ndo_ops->ndo_init() succeeds, by invoking both ndo_ops->ndo_uninit()
      and netdev->priv_destructor().
      
      And at the end of unregister_netdevice(), we invoke
      netdev->priv_destructor() and optionally call free_netdev().
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cf124db5
  8. 07 6月, 2017 1 次提交
  9. 18 5月, 2017 2 次提交
  10. 22 3月, 2017 1 次提交
  11. 14 3月, 2017 2 次提交
  12. 10 3月, 2017 1 次提交
  13. 02 3月, 2017 1 次提交
  14. 07 2月, 2017 1 次提交
  15. 21 1月, 2017 1 次提交
  16. 19 1月, 2017 1 次提交
    • J
      tun: rx batching · 5503fcec
      Jason Wang 提交于
      We can only process 1 packet at one time during sendmsg(). This often
      lead bad cache utilization under heavy load. So this patch tries to do
      some batching during rx before submitting them to host network
      stack. This is done through accepting MSG_MORE as a hint from
      sendmsg() caller, if it was set, batch the packet temporarily in a
      linked list and submit them all once MSG_MORE were cleared.
      
      Tests were done by pktgen (burst=128) in guest over mlx4(noqueue) on host:
      
                                       Mpps  -+%
          rx-frames = 0                0.91  +0%
          rx-frames = 4                1.00  +9.8%
          rx-frames = 8                1.00  +9.8%
          rx-frames = 16               1.01  +10.9%
          rx-frames = 32               1.07  +17.5%
          rx-frames = 48               1.07  +17.5%
          rx-frames = 64               1.08  +18.6%
          rx-frames = 64 (no MSG_MORE) 0.91  +0%
      
      User were allowed to change per device batched packets through
      ethtool -C rx-frames. NAPI_POLL_WEIGHT were used as upper limitation
      to prevent bh from being disabled too long.
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5503fcec
  17. 09 1月, 2017 1 次提交
  18. 25 12月, 2016 1 次提交
  19. 07 12月, 2016 1 次提交
  20. 06 12月, 2016 1 次提交
    • A
      [iov_iter] new primitives - copy_from_iter_full() and friends · cbbd26b8
      Al Viro 提交于
      copy_from_iter_full(), copy_from_iter_full_nocache() and
      csum_and_copy_from_iter_full() - counterparts of copy_from_iter()
      et.al., advancing iterator only in case of successful full copy
      and returning whether it had been successful or not.
      
      Convert some obvious users.  *NOTE* - do not blindly assume that
      something is a good candidate for those unless you are sure that
      not advancing iov_iter in failure case is the right thing in
      this case.  Anything that does short read/short write kind of
      stuff (or is in a loop, etc.) is unlikely to be a good one.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cbbd26b8
  21. 01 12月, 2016 1 次提交
  22. 25 11月, 2016 1 次提交
  23. 19 11月, 2016 2 次提交
  24. 31 10月, 2016 1 次提交
  25. 30 10月, 2016 1 次提交
  26. 21 10月, 2016 1 次提交
    • J
      net: use core MTU range checking in core net infra · 91572088
      Jarod Wilson 提交于
      geneve:
      - Merge __geneve_change_mtu back into geneve_change_mtu, set max_mtu
      - This one isn't quite as straight-forward as others, could use some
        closer inspection and testing
      
      macvlan:
      - set min/max_mtu
      
      tun:
      - set min/max_mtu, remove tun_net_change_mtu
      
      vxlan:
      - Merge __vxlan_change_mtu back into vxlan_change_mtu
      - Set max_mtu to IP_MAX_MTU and retain dynamic MTU range checks in
        change_mtu function
      - This one is also not as straight-forward and could use closer inspection
        and testing from vxlan folks
      
      bridge:
      - set max_mtu of IP_MAX_MTU and retain dynamic MTU range checks in
        change_mtu function
      
      openvswitch:
      - set min/max_mtu, remove internal_dev_change_mtu
      - note: max_mtu wasn't checked previously, it's been set to 65535, which
        is the largest possible size supported
      
      sch_teql:
      - set min/max_mtu (note: max_mtu previously unchecked, used max of 65535)
      
      macsec:
      - min_mtu = 0, max_mtu = 65535
      
      macvlan:
      - min_mtu = 0, max_mtu = 65535
      
      ntb_netdev:
      - min_mtu = 0, max_mtu = 65535
      
      veth:
      - min_mtu = 68, max_mtu = 65535
      
      8021q:
      - min_mtu = 0, max_mtu = 65535
      
      CC: netdev@vger.kernel.org
      CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
      CC: Tom Herbert <tom@herbertland.com>
      CC: Daniel Borkmann <daniel@iogearbox.net>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      CC: Paolo Abeni <pabeni@redhat.com>
      CC: Jiri Benc <jbenc@redhat.com>
      CC: WANG Cong <xiyou.wangcong@gmail.com>
      CC: Roopa Prabhu <roopa@cumulusnetworks.com>
      CC: Pravin B Shelar <pshelar@ovn.org>
      CC: Sabrina Dubroca <sd@queasysnail.net>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      CC: Pravin Shelar <pshelar@nicira.com>
      CC: Maxim Krasnyansky <maxk@qti.qualcomm.com>
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91572088
  27. 24 8月, 2016 1 次提交
    • S
      tun: fix transmit timestamp support · 7b996243
      Soheil Hassas Yeganeh 提交于
      Instead of using sock_tx_timestamp, use skb_tx_timestamp to record
      software transmit timestamp of a packet.
      
      sock_tx_timestamp resets and overrides the tx_flags of the skb.
      The function is intended to be called from within the protocol
      layer when creating the skb, not from a device driver. This is
      inconsistent with other drivers and will cause issues for TCP.
      
      In TCP, we intend to sample the timestamps for the last byte
      for each sendmsg/sendpage. For that reason, tcp_sendmsg calls
      tcp_tx_timestamp only with the last skb that it generates.
      For example, if a 128KB message is split into two 64KB packets
      we want to sample the SND timestamp of the last packet. The current
      code in the tun driver, however, will result in sampling the SND
      timestamp for both packets.
      
      Also, when the last packet is split into smaller packets for
      retranmission (see tcp_fragment), the tun driver will record
      timestamps for all of the retransmitted packets and not only the
      last packet.
      
      Fixes: eda29772 (tun: Support software transmit time stamping.)
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NFrancis Yan <francisyyan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7b996243
  28. 21 8月, 2016 2 次提交
  29. 09 7月, 2016 1 次提交
    • C
      tun: Don't assume type tun in tun_device_event · 86dfb4ac
      Craig Gallek 提交于
      The referenced change added a netlink notifier for processing
      device queue size events.  These events are fired for all devices
      but the registered callback assumed they only occurred for tun
      devices.  This fix adds a check (borrowed from macvtap.c) to discard
      non-tun device events.
      
      For reference, this fixes the following splat:
      [   71.505935] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
      [   71.513870] IP: [<ffffffff8153c1a0>] tun_device_event+0x110/0x340
      [   71.519906] PGD 3f41f56067 PUD 3f264b7067 PMD 0
      [   71.524497] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
      [   71.529374] gsmi: Log Shutdown Reason 0x03
      [   71.533417] Modules linked in:[   71.533826] mlx4_en: eth1: Link Up
      
      [   71.539616]  bonding w1_therm wire cdc_acm ehci_pci ehci_hcd mlx4_en ib_uverbs mlx4_ib ib_core mlx4_core
      [   71.549282] CPU: 12 PID: 7915 Comm: set.ixion-haswe Not tainted 4.7.0-dbx-DEV #8
      [   71.556586] Hardware name: Intel Grantley,Wellsburg/Ixion_IT_15, BIOS 2.58.0 05/03/2016
      [   71.564495] task: ffff887f00bb20c0 ti: ffff887f00798000 task.ti: ffff887f00798000
      [   71.571894] RIP: 0010:[<ffffffff8153c1a0>]  [<ffffffff8153c1a0>] tun_device_event+0x110/0x340
      [   71.580327] RSP: 0018:ffff887f0079bbd8  EFLAGS: 00010202
      [   71.585576] RAX: fffffffffffffae8 RBX: ffff887ef6d03378 RCX: 0000000000000000
      [   71.592624] RDX: 0000000000000000 RSI: 0000000000000028 RDI: 0000000000000000
      [   71.599675] RBP: ffff887f0079bc48 R08: 0000000000000000 R09: 0000000000000001
      [   71.606730] R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000010
      [   71.613780] R13: 0000000000000000 R14: 0000000000000001 R15: ffff887f0079bd00
      [   71.620832] FS:  00007f5cdc581700(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
      [   71.628826] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   71.634500] CR2: 0000000000000010 CR3: 0000003f3eb62000 CR4: 00000000001406e0
      [   71.641549] Stack:
      [   71.643533]  ffff887f0079bc08 0000000000000246 000000000000001e ffff887ef6d00000
      [   71.650871]  ffff887f0079bd00 0000000000000000 0000000000000000 ffffffff00000000
      [   71.658210]  ffff887f0079bc48 ffffffff81d24070 00000000fffffff9 ffffffff81cec7a0
      [   71.665549] Call Trace:
      [   71.667975]  [<ffffffff810eeb0d>] notifier_call_chain+0x5d/0x80
      [   71.673823]  [<ffffffff816365d0>] ? show_tx_maxrate+0x30/0x30
      [   71.679502]  [<ffffffff810eeb3e>] __raw_notifier_call_chain+0xe/0x10
      [   71.685778]  [<ffffffff810eeb56>] raw_notifier_call_chain+0x16/0x20
      [   71.691976]  [<ffffffff8160eb30>] call_netdevice_notifiers_info+0x40/0x70
      [   71.698681]  [<ffffffff8160ec36>] call_netdevice_notifiers+0x16/0x20
      [   71.704956]  [<ffffffff81636636>] change_tx_queue_len+0x66/0x90
      [   71.710807]  [<ffffffff816381ef>] netdev_store.isra.5+0xbf/0xd0
      [   71.716658]  [<ffffffff81638350>] tx_queue_len_store+0x50/0x60
      [   71.722431]  [<ffffffff814a6798>] dev_attr_store+0x18/0x30
      [   71.727857]  [<ffffffff812ea3ff>] sysfs_kf_write+0x4f/0x70
      [   71.733274]  [<ffffffff812e9507>] kernfs_fop_write+0x147/0x1d0
      [   71.739045]  [<ffffffff81134a4f>] ? rcu_read_lock_sched_held+0x8f/0xa0
      [   71.745499]  [<ffffffff8125a108>] __vfs_write+0x28/0x120
      [   71.750748]  [<ffffffff8111b137>] ? percpu_down_read+0x57/0x90
      [   71.756516]  [<ffffffff8125d7d8>] ? __sb_start_write+0xc8/0xe0
      [   71.762278]  [<ffffffff8125d7d8>] ? __sb_start_write+0xc8/0xe0
      [   71.768038]  [<ffffffff8125bd5e>] vfs_write+0xbe/0x1b0
      [   71.773113]  [<ffffffff8125c092>] SyS_write+0x52/0xa0
      [   71.778110]  [<ffffffff817528e5>] entry_SYSCALL_64_fastpath+0x18/0xa8
      [   71.784472] Code: 45 31 f6 48 8b 93 78 33 00 00 48 81 c3 78 33 00 00 48 39 d3 48 8d 82 e8 fa ff ff 74 25 48 8d b0 40 05 00 00 49 63 d6 41 83 c6 01 <49> 89 34 d4 48 8b 90 18 05 00 00 48 39 d3 48 8d 82 e8 fa ff ff
      [   71.803655] RIP  [<ffffffff8153c1a0>] tun_device_event+0x110/0x340
      [   71.809769]  RSP <ffff887f0079bbd8>
      [   71.813213] CR2: 0000000000000010
      [   71.816512] ---[ end trace 4db6449606319f73 ]---
      
      Fixes: 1576d986 ("tun: switch to use skb array for tx")
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86dfb4ac
  30. 05 7月, 2016 1 次提交
  31. 01 7月, 2016 1 次提交
    • J
      tun: switch to use skb array for tx · 1576d986
      Jason Wang 提交于
      We used to queue tx packets in sk_receive_queue, this is less
      efficient since it requires spinlocks to synchronize between producer
      and consumer.
      
      This patch tries to address this by:
      
      - switch from sk_receive_queue to a skb_array, and resize it when
        tx_queue_len was changed.
      - introduce a new proto_ops peek_len which was used for peeking the
        skb length.
      - implement a tun version of peek_len for vhost_net to use and convert
        vhost_net to use peek_len if possible.
      
      Pktgen test shows about 15.3% improvement on guest receiving pps for small
      buffers:
      
      Before: ~1300000pps
      After : ~1500000pps
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1576d986
  32. 16 6月, 2016 1 次提交
  33. 11 6月, 2016 1 次提交
  34. 21 5月, 2016 1 次提交
    • J
      tuntap: correctly wake up process during uninit · addf8fc4
      Jason Wang 提交于
      We used to check dev->reg_state against NETREG_REGISTERED after each
      time we are woke up. But after commit 9e641bdc ("net-tun:
      restructure tun_do_read for better sleep/wakeup efficiency"), it uses
      skb_recv_datagram() which does not check dev->reg_state. This will
      result if we delete a tun/tap device after a process is blocked in the
      reading. The device will wait for the reference count which was held
      by that process for ever.
      
      Fixes this by using RCV_SHUTDOWN which will be checked during
      sk_recv_datagram() before trying to wake up the process during uninit.
      
      Fixes: 9e641bdc ("net-tun: restructure tun_do_read for better
      sleep/wakeup efficiency")
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Xi Wang <xii@google.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      addf8fc4
  35. 29 4月, 2016 1 次提交
    • J
      tuntap: calculate rps hash only when needed · 3df97ba8
      Jason Wang 提交于
      There's no need to calculate rps hash if it was not enabled. So this
      patch export rps_needed and check it before trying to get rps
      hash. Tests (using pktgen to inject packets to guest) shows this can
      improve pps about 13% (when rps is disabled).
      
      Before:
      ~1150000 pps
      After:
      ~1300000 pps
      
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      ----
      Changes from V1:
      - Fix build when CONFIG_RPS is not set
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3df97ba8