1. 22 3月, 2017 1 次提交
  2. 14 3月, 2017 2 次提交
  3. 10 3月, 2017 1 次提交
  4. 02 3月, 2017 1 次提交
  5. 07 2月, 2017 1 次提交
  6. 21 1月, 2017 1 次提交
  7. 19 1月, 2017 1 次提交
    • J
      tun: rx batching · 5503fcec
      Jason Wang 提交于
      We can only process 1 packet at one time during sendmsg(). This often
      lead bad cache utilization under heavy load. So this patch tries to do
      some batching during rx before submitting them to host network
      stack. This is done through accepting MSG_MORE as a hint from
      sendmsg() caller, if it was set, batch the packet temporarily in a
      linked list and submit them all once MSG_MORE were cleared.
      
      Tests were done by pktgen (burst=128) in guest over mlx4(noqueue) on host:
      
                                       Mpps  -+%
          rx-frames = 0                0.91  +0%
          rx-frames = 4                1.00  +9.8%
          rx-frames = 8                1.00  +9.8%
          rx-frames = 16               1.01  +10.9%
          rx-frames = 32               1.07  +17.5%
          rx-frames = 48               1.07  +17.5%
          rx-frames = 64               1.08  +18.6%
          rx-frames = 64 (no MSG_MORE) 0.91  +0%
      
      User were allowed to change per device batched packets through
      ethtool -C rx-frames. NAPI_POLL_WEIGHT were used as upper limitation
      to prevent bh from being disabled too long.
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5503fcec
  8. 09 1月, 2017 1 次提交
  9. 25 12月, 2016 1 次提交
  10. 07 12月, 2016 1 次提交
  11. 06 12月, 2016 1 次提交
    • A
      [iov_iter] new primitives - copy_from_iter_full() and friends · cbbd26b8
      Al Viro 提交于
      copy_from_iter_full(), copy_from_iter_full_nocache() and
      csum_and_copy_from_iter_full() - counterparts of copy_from_iter()
      et.al., advancing iterator only in case of successful full copy
      and returning whether it had been successful or not.
      
      Convert some obvious users.  *NOTE* - do not blindly assume that
      something is a good candidate for those unless you are sure that
      not advancing iov_iter in failure case is the right thing in
      this case.  Anything that does short read/short write kind of
      stuff (or is in a loop, etc.) is unlikely to be a good one.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cbbd26b8
  12. 01 12月, 2016 1 次提交
  13. 25 11月, 2016 1 次提交
  14. 19 11月, 2016 2 次提交
  15. 31 10月, 2016 1 次提交
  16. 30 10月, 2016 1 次提交
  17. 21 10月, 2016 1 次提交
    • J
      net: use core MTU range checking in core net infra · 91572088
      Jarod Wilson 提交于
      geneve:
      - Merge __geneve_change_mtu back into geneve_change_mtu, set max_mtu
      - This one isn't quite as straight-forward as others, could use some
        closer inspection and testing
      
      macvlan:
      - set min/max_mtu
      
      tun:
      - set min/max_mtu, remove tun_net_change_mtu
      
      vxlan:
      - Merge __vxlan_change_mtu back into vxlan_change_mtu
      - Set max_mtu to IP_MAX_MTU and retain dynamic MTU range checks in
        change_mtu function
      - This one is also not as straight-forward and could use closer inspection
        and testing from vxlan folks
      
      bridge:
      - set max_mtu of IP_MAX_MTU and retain dynamic MTU range checks in
        change_mtu function
      
      openvswitch:
      - set min/max_mtu, remove internal_dev_change_mtu
      - note: max_mtu wasn't checked previously, it's been set to 65535, which
        is the largest possible size supported
      
      sch_teql:
      - set min/max_mtu (note: max_mtu previously unchecked, used max of 65535)
      
      macsec:
      - min_mtu = 0, max_mtu = 65535
      
      macvlan:
      - min_mtu = 0, max_mtu = 65535
      
      ntb_netdev:
      - min_mtu = 0, max_mtu = 65535
      
      veth:
      - min_mtu = 68, max_mtu = 65535
      
      8021q:
      - min_mtu = 0, max_mtu = 65535
      
      CC: netdev@vger.kernel.org
      CC: Nicolas Dichtel <nicolas.dichtel@6wind.com>
      CC: Hannes Frederic Sowa <hannes@stressinduktion.org>
      CC: Tom Herbert <tom@herbertland.com>
      CC: Daniel Borkmann <daniel@iogearbox.net>
      CC: Alexander Duyck <alexander.h.duyck@intel.com>
      CC: Paolo Abeni <pabeni@redhat.com>
      CC: Jiri Benc <jbenc@redhat.com>
      CC: WANG Cong <xiyou.wangcong@gmail.com>
      CC: Roopa Prabhu <roopa@cumulusnetworks.com>
      CC: Pravin B Shelar <pshelar@ovn.org>
      CC: Sabrina Dubroca <sd@queasysnail.net>
      CC: Patrick McHardy <kaber@trash.net>
      CC: Stephen Hemminger <stephen@networkplumber.org>
      CC: Pravin Shelar <pshelar@nicira.com>
      CC: Maxim Krasnyansky <maxk@qti.qualcomm.com>
      Signed-off-by: NJarod Wilson <jarod@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      91572088
  18. 24 8月, 2016 1 次提交
    • S
      tun: fix transmit timestamp support · 7b996243
      Soheil Hassas Yeganeh 提交于
      Instead of using sock_tx_timestamp, use skb_tx_timestamp to record
      software transmit timestamp of a packet.
      
      sock_tx_timestamp resets and overrides the tx_flags of the skb.
      The function is intended to be called from within the protocol
      layer when creating the skb, not from a device driver. This is
      inconsistent with other drivers and will cause issues for TCP.
      
      In TCP, we intend to sample the timestamps for the last byte
      for each sendmsg/sendpage. For that reason, tcp_sendmsg calls
      tcp_tx_timestamp only with the last skb that it generates.
      For example, if a 128KB message is split into two 64KB packets
      we want to sample the SND timestamp of the last packet. The current
      code in the tun driver, however, will result in sampling the SND
      timestamp for both packets.
      
      Also, when the last packet is split into smaller packets for
      retranmission (see tcp_fragment), the tun driver will record
      timestamps for all of the retransmitted packets and not only the
      last packet.
      
      Fixes: eda29772 (tun: Support software transmit time stamping.)
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Signed-off-by: NFrancis Yan <francisyyan@google.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      7b996243
  19. 21 8月, 2016 2 次提交
  20. 09 7月, 2016 1 次提交
    • C
      tun: Don't assume type tun in tun_device_event · 86dfb4ac
      Craig Gallek 提交于
      The referenced change added a netlink notifier for processing
      device queue size events.  These events are fired for all devices
      but the registered callback assumed they only occurred for tun
      devices.  This fix adds a check (borrowed from macvtap.c) to discard
      non-tun device events.
      
      For reference, this fixes the following splat:
      [   71.505935] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
      [   71.513870] IP: [<ffffffff8153c1a0>] tun_device_event+0x110/0x340
      [   71.519906] PGD 3f41f56067 PUD 3f264b7067 PMD 0
      [   71.524497] Oops: 0002 [#1] SMP DEBUG_PAGEALLOC
      [   71.529374] gsmi: Log Shutdown Reason 0x03
      [   71.533417] Modules linked in:[   71.533826] mlx4_en: eth1: Link Up
      
      [   71.539616]  bonding w1_therm wire cdc_acm ehci_pci ehci_hcd mlx4_en ib_uverbs mlx4_ib ib_core mlx4_core
      [   71.549282] CPU: 12 PID: 7915 Comm: set.ixion-haswe Not tainted 4.7.0-dbx-DEV #8
      [   71.556586] Hardware name: Intel Grantley,Wellsburg/Ixion_IT_15, BIOS 2.58.0 05/03/2016
      [   71.564495] task: ffff887f00bb20c0 ti: ffff887f00798000 task.ti: ffff887f00798000
      [   71.571894] RIP: 0010:[<ffffffff8153c1a0>]  [<ffffffff8153c1a0>] tun_device_event+0x110/0x340
      [   71.580327] RSP: 0018:ffff887f0079bbd8  EFLAGS: 00010202
      [   71.585576] RAX: fffffffffffffae8 RBX: ffff887ef6d03378 RCX: 0000000000000000
      [   71.592624] RDX: 0000000000000000 RSI: 0000000000000028 RDI: 0000000000000000
      [   71.599675] RBP: ffff887f0079bc48 R08: 0000000000000000 R09: 0000000000000001
      [   71.606730] R10: 0000000000000004 R11: 0000000000000000 R12: 0000000000000010
      [   71.613780] R13: 0000000000000000 R14: 0000000000000001 R15: ffff887f0079bd00
      [   71.620832] FS:  00007f5cdc581700(0000) GS:ffff883f7f700000(0000) knlGS:0000000000000000
      [   71.628826] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   71.634500] CR2: 0000000000000010 CR3: 0000003f3eb62000 CR4: 00000000001406e0
      [   71.641549] Stack:
      [   71.643533]  ffff887f0079bc08 0000000000000246 000000000000001e ffff887ef6d00000
      [   71.650871]  ffff887f0079bd00 0000000000000000 0000000000000000 ffffffff00000000
      [   71.658210]  ffff887f0079bc48 ffffffff81d24070 00000000fffffff9 ffffffff81cec7a0
      [   71.665549] Call Trace:
      [   71.667975]  [<ffffffff810eeb0d>] notifier_call_chain+0x5d/0x80
      [   71.673823]  [<ffffffff816365d0>] ? show_tx_maxrate+0x30/0x30
      [   71.679502]  [<ffffffff810eeb3e>] __raw_notifier_call_chain+0xe/0x10
      [   71.685778]  [<ffffffff810eeb56>] raw_notifier_call_chain+0x16/0x20
      [   71.691976]  [<ffffffff8160eb30>] call_netdevice_notifiers_info+0x40/0x70
      [   71.698681]  [<ffffffff8160ec36>] call_netdevice_notifiers+0x16/0x20
      [   71.704956]  [<ffffffff81636636>] change_tx_queue_len+0x66/0x90
      [   71.710807]  [<ffffffff816381ef>] netdev_store.isra.5+0xbf/0xd0
      [   71.716658]  [<ffffffff81638350>] tx_queue_len_store+0x50/0x60
      [   71.722431]  [<ffffffff814a6798>] dev_attr_store+0x18/0x30
      [   71.727857]  [<ffffffff812ea3ff>] sysfs_kf_write+0x4f/0x70
      [   71.733274]  [<ffffffff812e9507>] kernfs_fop_write+0x147/0x1d0
      [   71.739045]  [<ffffffff81134a4f>] ? rcu_read_lock_sched_held+0x8f/0xa0
      [   71.745499]  [<ffffffff8125a108>] __vfs_write+0x28/0x120
      [   71.750748]  [<ffffffff8111b137>] ? percpu_down_read+0x57/0x90
      [   71.756516]  [<ffffffff8125d7d8>] ? __sb_start_write+0xc8/0xe0
      [   71.762278]  [<ffffffff8125d7d8>] ? __sb_start_write+0xc8/0xe0
      [   71.768038]  [<ffffffff8125bd5e>] vfs_write+0xbe/0x1b0
      [   71.773113]  [<ffffffff8125c092>] SyS_write+0x52/0xa0
      [   71.778110]  [<ffffffff817528e5>] entry_SYSCALL_64_fastpath+0x18/0xa8
      [   71.784472] Code: 45 31 f6 48 8b 93 78 33 00 00 48 81 c3 78 33 00 00 48 39 d3 48 8d 82 e8 fa ff ff 74 25 48 8d b0 40 05 00 00 49 63 d6 41 83 c6 01 <49> 89 34 d4 48 8b 90 18 05 00 00 48 39 d3 48 8d 82 e8 fa ff ff
      [   71.803655] RIP  [<ffffffff8153c1a0>] tun_device_event+0x110/0x340
      [   71.809769]  RSP <ffff887f0079bbd8>
      [   71.813213] CR2: 0000000000000010
      [   71.816512] ---[ end trace 4db6449606319f73 ]---
      
      Fixes: 1576d986 ("tun: switch to use skb array for tx")
      Signed-off-by: NCraig Gallek <kraig@google.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      86dfb4ac
  21. 05 7月, 2016 1 次提交
  22. 01 7月, 2016 1 次提交
    • J
      tun: switch to use skb array for tx · 1576d986
      Jason Wang 提交于
      We used to queue tx packets in sk_receive_queue, this is less
      efficient since it requires spinlocks to synchronize between producer
      and consumer.
      
      This patch tries to address this by:
      
      - switch from sk_receive_queue to a skb_array, and resize it when
        tx_queue_len was changed.
      - introduce a new proto_ops peek_len which was used for peeking the
        skb length.
      - implement a tun version of peek_len for vhost_net to use and convert
        vhost_net to use peek_len if possible.
      
      Pktgen test shows about 15.3% improvement on guest receiving pps for small
      buffers:
      
      Before: ~1300000pps
      After : ~1500000pps
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1576d986
  23. 16 6月, 2016 1 次提交
  24. 11 6月, 2016 1 次提交
  25. 21 5月, 2016 1 次提交
    • J
      tuntap: correctly wake up process during uninit · addf8fc4
      Jason Wang 提交于
      We used to check dev->reg_state against NETREG_REGISTERED after each
      time we are woke up. But after commit 9e641bdc ("net-tun:
      restructure tun_do_read for better sleep/wakeup efficiency"), it uses
      skb_recv_datagram() which does not check dev->reg_state. This will
      result if we delete a tun/tap device after a process is blocked in the
      reading. The device will wait for the reference count which was held
      by that process for ever.
      
      Fixes this by using RCV_SHUTDOWN which will be checked during
      sk_recv_datagram() before trying to wake up the process during uninit.
      
      Fixes: 9e641bdc ("net-tun: restructure tun_do_read for better
      sleep/wakeup efficiency")
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Xi Wang <xii@google.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      addf8fc4
  26. 29 4月, 2016 1 次提交
    • J
      tuntap: calculate rps hash only when needed · 3df97ba8
      Jason Wang 提交于
      There's no need to calculate rps hash if it was not enabled. So this
      patch export rps_needed and check it before trying to get rps
      hash. Tests (using pktgen to inject packets to guest) shows this can
      improve pps about 13% (when rps is disabled).
      
      Before:
      ~1150000 pps
      After:
      ~1300000 pps
      
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      ----
      Changes from V1:
      - Fix build when CONFIG_RPS is not set
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3df97ba8
  27. 19 4月, 2016 1 次提交
  28. 15 4月, 2016 1 次提交
    • P
      tun: use per cpu variables for stats accounting · 608b9977
      Paolo Abeni 提交于
      Currently the tun device accounting uses dev->stats without applying any
      kind of protection, regardless that accounting happens in preemptible
      process context.
      This patch move the tun stats to a per cpu data structure, and protect
      the updates with  u64_stats_update_begin()/u64_stats_update_end() or
      this_cpu_inc according to the stat type. The per cpu stats are
      aggregated by the newly added ndo_get_stats64 ops.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      608b9977
  29. 09 4月, 2016 1 次提交
  30. 08 4月, 2016 1 次提交
  31. 05 4月, 2016 1 次提交
    • S
      sock: enable timestamping using control messages · c14ac945
      Soheil Hassas Yeganeh 提交于
      Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
      This is very costly when users want to sample writes to gather
      tx timestamps.
      
      Add support for enabling SO_TIMESTAMPING via control messages by
      using tsflags added in `struct sockcm_cookie` (added in the previous
      patches in this series) to set the tx_flags of the last skb created in
      a sendmsg. With this patch, the timestamp recording bits in tx_flags
      of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.
      
      Please note that this is only effective for overriding the recording
      timestamps flags. Users should enable timestamp reporting (e.g.,
      SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
      socket options and then should ask for SOF_TIMESTAMPING_TX_*
      using control messages per sendmsg to sample timestamps for each
      write.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c14ac945
  32. 02 4月, 2016 1 次提交
    • D
      tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter · 5a5abb1f
      Daniel Borkmann 提交于
      Sasha Levin reported a suspicious rcu_dereference_protected() warning
      found while fuzzing with trinity that is similar to this one:
      
        [   52.765684] net/core/filter.c:2262 suspicious rcu_dereference_protected() usage!
        [   52.765688] other info that might help us debug this:
        [   52.765695] rcu_scheduler_active = 1, debug_locks = 1
        [   52.765701] 1 lock held by a.out/1525:
        [   52.765704]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff816a64b7>] rtnl_lock+0x17/0x20
        [   52.765721] stack backtrace:
        [   52.765728] CPU: 1 PID: 1525 Comm: a.out Not tainted 4.5.0+ #264
        [...]
        [   52.765768] Call Trace:
        [   52.765775]  [<ffffffff813e488d>] dump_stack+0x85/0xc8
        [   52.765784]  [<ffffffff810f2fa5>] lockdep_rcu_suspicious+0xd5/0x110
        [   52.765792]  [<ffffffff816afdc2>] sk_detach_filter+0x82/0x90
        [   52.765801]  [<ffffffffa0883425>] tun_detach_filter+0x35/0x90 [tun]
        [   52.765810]  [<ffffffffa0884ed4>] __tun_chr_ioctl+0x354/0x1130 [tun]
        [   52.765818]  [<ffffffff8136fed0>] ? selinux_file_ioctl+0x130/0x210
        [   52.765827]  [<ffffffffa0885ce3>] tun_chr_ioctl+0x13/0x20 [tun]
        [   52.765834]  [<ffffffff81260ea6>] do_vfs_ioctl+0x96/0x690
        [   52.765843]  [<ffffffff81364af3>] ? security_file_ioctl+0x43/0x60
        [   52.765850]  [<ffffffff81261519>] SyS_ioctl+0x79/0x90
        [   52.765858]  [<ffffffff81003ba2>] do_syscall_64+0x62/0x140
        [   52.765866]  [<ffffffff817d563f>] entry_SYSCALL64_slow_path+0x25/0x25
      
      Same can be triggered with PROVE_RCU (+ PROVE_RCU_REPEATEDLY) enabled
      from tun_attach_filter() when user space calls ioctl(tun_fd, TUN{ATTACH,
      DETACH}FILTER, ...) for adding/removing a BPF filter on tap devices.
      
      Since the fix in f91ff5b9 ("net: sk_{detach|attach}_filter() rcu
      fixes") sk_attach_filter()/sk_detach_filter() now dereferences the
      filter with rcu_dereference_protected(), checking whether socket lock
      is held in control path.
      
      Since its introduction in 99405162 ("tun: socket filter support"),
      tap filters are managed under RTNL lock from __tun_chr_ioctl(). Thus the
      sock_owned_by_user(sk) doesn't apply in this specific case and therefore
      triggers the false positive.
      
      Extend the BPF API with __sk_attach_filter()/__sk_detach_filter() pair
      that is used by tap filters and pass in lockdep_rtnl_is_held() for the
      rcu_dereference_protected() checks instead.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a5abb1f
  33. 02 3月, 2016 1 次提交
    • P
      net/tun: implement ndo_set_rx_headroom · eaea34b2
      Paolo Abeni 提交于
      ndo_set_rx_headroom controls the align value used by tun devices to
      allocate skbs on frame reception.
      When the xmit device adds a large encapsulation, this avoids an skb
      head reallocation on forwarding.
      
      The measured improvement when forwarding towards a vxlan dev with
      frame size below the egress device MTU is as follow:
      
      vxlan over ipv6, bridged: +6%
      vxlan over ipv6, ovs: +7%
      
      In case of ipv4 tunnels there is no improvement, since the tun
      device default alignment provides enough headroom to avoid the skb
      head reallocation.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eaea34b2
  34. 18 12月, 2015 1 次提交
  35. 02 12月, 2015 1 次提交
    • E
      net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA · 9cd3e072
      Eric Dumazet 提交于
      This patch is a cleanup to make following patch easier to
      review.
      
      Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
      from (struct socket)->flags to a (struct socket_wq)->flags
      to benefit from RCU protection in sock_wake_async()
      
      To ease backports, we rename both constants.
      
      Two new helpers, sk_set_bit(int nr, struct sock *sk)
      and sk_clear_bit(int net, struct sock *sk) are added so that
      following patch can change their implementation.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9cd3e072
  36. 13 10月, 2015 1 次提交
  37. 04 8月, 2015 1 次提交