1. 01 7月, 2016 1 次提交
    • J
      tun: switch to use skb array for tx · 1576d986
      Jason Wang 提交于
      We used to queue tx packets in sk_receive_queue, this is less
      efficient since it requires spinlocks to synchronize between producer
      and consumer.
      
      This patch tries to address this by:
      
      - switch from sk_receive_queue to a skb_array, and resize it when
        tx_queue_len was changed.
      - introduce a new proto_ops peek_len which was used for peeking the
        skb length.
      - implement a tun version of peek_len for vhost_net to use and convert
        vhost_net to use peek_len if possible.
      
      Pktgen test shows about 15.3% improvement on guest receiving pps for small
      buffers:
      
      Before: ~1300000pps
      After : ~1500000pps
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1576d986
  2. 16 6月, 2016 1 次提交
  3. 11 6月, 2016 1 次提交
  4. 21 5月, 2016 1 次提交
    • J
      tuntap: correctly wake up process during uninit · addf8fc4
      Jason Wang 提交于
      We used to check dev->reg_state against NETREG_REGISTERED after each
      time we are woke up. But after commit 9e641bdc ("net-tun:
      restructure tun_do_read for better sleep/wakeup efficiency"), it uses
      skb_recv_datagram() which does not check dev->reg_state. This will
      result if we delete a tun/tap device after a process is blocked in the
      reading. The device will wait for the reference count which was held
      by that process for ever.
      
      Fixes this by using RCV_SHUTDOWN which will be checked during
      sk_recv_datagram() before trying to wake up the process during uninit.
      
      Fixes: 9e641bdc ("net-tun: restructure tun_do_read for better
      sleep/wakeup efficiency")
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Xi Wang <xii@google.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Acked-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      addf8fc4
  5. 29 4月, 2016 1 次提交
    • J
      tuntap: calculate rps hash only when needed · 3df97ba8
      Jason Wang 提交于
      There's no need to calculate rps hash if it was not enabled. So this
      patch export rps_needed and check it before trying to get rps
      hash. Tests (using pktgen to inject packets to guest) shows this can
      improve pps about 13% (when rps is disabled).
      
      Before:
      ~1150000 pps
      After:
      ~1300000 pps
      
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      ----
      Changes from V1:
      - Fix build when CONFIG_RPS is not set
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3df97ba8
  6. 19 4月, 2016 1 次提交
  7. 15 4月, 2016 1 次提交
    • P
      tun: use per cpu variables for stats accounting · 608b9977
      Paolo Abeni 提交于
      Currently the tun device accounting uses dev->stats without applying any
      kind of protection, regardless that accounting happens in preemptible
      process context.
      This patch move the tun stats to a per cpu data structure, and protect
      the updates with  u64_stats_update_begin()/u64_stats_update_end() or
      this_cpu_inc according to the stat type. The per cpu stats are
      aggregated by the newly added ndo_get_stats64 ops.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      608b9977
  8. 09 4月, 2016 1 次提交
  9. 08 4月, 2016 1 次提交
  10. 05 4月, 2016 1 次提交
    • S
      sock: enable timestamping using control messages · c14ac945
      Soheil Hassas Yeganeh 提交于
      Currently, SOL_TIMESTAMPING can only be enabled using setsockopt.
      This is very costly when users want to sample writes to gather
      tx timestamps.
      
      Add support for enabling SO_TIMESTAMPING via control messages by
      using tsflags added in `struct sockcm_cookie` (added in the previous
      patches in this series) to set the tx_flags of the last skb created in
      a sendmsg. With this patch, the timestamp recording bits in tx_flags
      of the skbuff is overridden if SO_TIMESTAMPING is passed in a cmsg.
      
      Please note that this is only effective for overriding the recording
      timestamps flags. Users should enable timestamp reporting (e.g.,
      SOF_TIMESTAMPING_SOFTWARE | SOF_TIMESTAMPING_OPT_ID) using
      socket options and then should ask for SOF_TIMESTAMPING_TX_*
      using control messages per sendmsg to sample timestamps for each
      write.
      Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
      Acked-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c14ac945
  11. 02 4月, 2016 1 次提交
    • D
      tun, bpf: fix suspicious RCU usage in tun_{attach, detach}_filter · 5a5abb1f
      Daniel Borkmann 提交于
      Sasha Levin reported a suspicious rcu_dereference_protected() warning
      found while fuzzing with trinity that is similar to this one:
      
        [   52.765684] net/core/filter.c:2262 suspicious rcu_dereference_protected() usage!
        [   52.765688] other info that might help us debug this:
        [   52.765695] rcu_scheduler_active = 1, debug_locks = 1
        [   52.765701] 1 lock held by a.out/1525:
        [   52.765704]  #0:  (rtnl_mutex){+.+.+.}, at: [<ffffffff816a64b7>] rtnl_lock+0x17/0x20
        [   52.765721] stack backtrace:
        [   52.765728] CPU: 1 PID: 1525 Comm: a.out Not tainted 4.5.0+ #264
        [...]
        [   52.765768] Call Trace:
        [   52.765775]  [<ffffffff813e488d>] dump_stack+0x85/0xc8
        [   52.765784]  [<ffffffff810f2fa5>] lockdep_rcu_suspicious+0xd5/0x110
        [   52.765792]  [<ffffffff816afdc2>] sk_detach_filter+0x82/0x90
        [   52.765801]  [<ffffffffa0883425>] tun_detach_filter+0x35/0x90 [tun]
        [   52.765810]  [<ffffffffa0884ed4>] __tun_chr_ioctl+0x354/0x1130 [tun]
        [   52.765818]  [<ffffffff8136fed0>] ? selinux_file_ioctl+0x130/0x210
        [   52.765827]  [<ffffffffa0885ce3>] tun_chr_ioctl+0x13/0x20 [tun]
        [   52.765834]  [<ffffffff81260ea6>] do_vfs_ioctl+0x96/0x690
        [   52.765843]  [<ffffffff81364af3>] ? security_file_ioctl+0x43/0x60
        [   52.765850]  [<ffffffff81261519>] SyS_ioctl+0x79/0x90
        [   52.765858]  [<ffffffff81003ba2>] do_syscall_64+0x62/0x140
        [   52.765866]  [<ffffffff817d563f>] entry_SYSCALL64_slow_path+0x25/0x25
      
      Same can be triggered with PROVE_RCU (+ PROVE_RCU_REPEATEDLY) enabled
      from tun_attach_filter() when user space calls ioctl(tun_fd, TUN{ATTACH,
      DETACH}FILTER, ...) for adding/removing a BPF filter on tap devices.
      
      Since the fix in f91ff5b9 ("net: sk_{detach|attach}_filter() rcu
      fixes") sk_attach_filter()/sk_detach_filter() now dereferences the
      filter with rcu_dereference_protected(), checking whether socket lock
      is held in control path.
      
      Since its introduction in 99405162 ("tun: socket filter support"),
      tap filters are managed under RTNL lock from __tun_chr_ioctl(). Thus the
      sock_owned_by_user(sk) doesn't apply in this specific case and therefore
      triggers the false positive.
      
      Extend the BPF API with __sk_attach_filter()/__sk_detach_filter() pair
      that is used by tap filters and pass in lockdep_rtnl_is_held() for the
      rcu_dereference_protected() checks instead.
      Reported-by: NSasha Levin <sasha.levin@oracle.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5a5abb1f
  12. 02 3月, 2016 1 次提交
    • P
      net/tun: implement ndo_set_rx_headroom · eaea34b2
      Paolo Abeni 提交于
      ndo_set_rx_headroom controls the align value used by tun devices to
      allocate skbs on frame reception.
      When the xmit device adds a large encapsulation, this avoids an skb
      head reallocation on forwarding.
      
      The measured improvement when forwarding towards a vxlan dev with
      frame size below the egress device MTU is as follow:
      
      vxlan over ipv6, bridged: +6%
      vxlan over ipv6, ovs: +7%
      
      In case of ipv4 tunnels there is no improvement, since the tun
      device default alignment provides enough headroom to avoid the skb
      head reallocation.
      Signed-off-by: NPaolo Abeni <pabeni@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eaea34b2
  13. 18 12月, 2015 1 次提交
  14. 02 12月, 2015 1 次提交
    • E
      net: rename SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA · 9cd3e072
      Eric Dumazet 提交于
      This patch is a cleanup to make following patch easier to
      review.
      
      Goal is to move SOCK_ASYNC_NOSPACE and SOCK_ASYNC_WAITDATA
      from (struct socket)->flags to a (struct socket_wq)->flags
      to benefit from RCU protection in sock_wake_async()
      
      To ease backports, we rename both constants.
      
      Two new helpers, sk_set_bit(int nr, struct sock *sk)
      and sk_clear_bit(int net, struct sock *sk) are added so that
      following patch can change their implementation.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9cd3e072
  15. 13 10月, 2015 1 次提交
  16. 04 8月, 2015 1 次提交
  17. 01 6月, 2015 3 次提交
  18. 11 5月, 2015 2 次提交
  19. 12 4月, 2015 1 次提交
  20. 03 3月, 2015 1 次提交
  21. 09 2月, 2015 1 次提交
    • E
      net: rfs: add hash collision detection · 567e4b79
      Eric Dumazet 提交于
      Receive Flow Steering is a nice solution but suffers from
      hash collisions when a mix of connected and unconnected traffic
      is received on the host, when flow hash table is populated.
      
      Also, clearing flow in inet_release() makes RFS not very good
      for short lived flows, as many packets can follow close().
      (FIN , ACK packets, ...)
      
      This patch extends the information stored into global hash table
      to not only include cpu number, but upper part of the hash value.
      
      I use a 32bit value, and dynamically split it in two parts.
      
      For host with less than 64 possible cpus, this gives 6 bits for the
      cpu number, and 26 (32-6) bits for the upper part of the hash.
      
      Since hash bucket selection use low order bits of the hash, we have
      a full hash match, if /proc/sys/net/core/rps_sock_flow_entries is big
      enough.
      
      If the hash found in flow table does not match, we fallback to RPS (if
      it is enabled for the rxqueue).
      
      This means that a packet for an non connected flow can avoid the
      IPI through a unrelated/victim CPU.
      
      This also means we no longer have to clear the table at socket
      close time, and this helps short lived flows performance.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Acked-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      567e4b79
  22. 05 2月, 2015 1 次提交
  23. 04 2月, 2015 2 次提交
  24. 14 1月, 2015 1 次提交
  25. 13 1月, 2015 1 次提交
    • P
      tuntap: Increase the number of queues in tun. · baf71c5c
      Pankaj Gupta 提交于
      Networking under kvm works best if we allocate a per-vCPU RX and TX
      queue in a virtual NIC. This requires a per-vCPU queue on the host side.
      
      It is now safe to increase the maximum number of queues.
      Preceding patch: 'net: allow large number of rx queues'
      made sure this won't cause failures due to high order memory
      allocations. Increase it to 256: this is the max number of vCPUs
      KVM supports.
      
      Size of tun_struct changes from 8512 to 10496 after this patch. This keeps
      pages allocated for tun_struct before and after the patch to 3.
      Signed-off-by: NPankaj Gupta <pagupta@redhat.com>
      Reviewed-by: NDavid Gibson <dgibson@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      baf71c5c
  26. 01 1月, 2015 2 次提交
  27. 17 12月, 2014 1 次提交
  28. 10 12月, 2014 1 次提交
    • A
      put iov_iter into msghdr · c0371da6
      Al Viro 提交于
      Note that the code _using_ ->msg_iter at that point will be very
      unhappy with anything other than unshifted iovec-backed iov_iter.
      We still need to convert users to proper primitives.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c0371da6
  29. 09 12月, 2014 3 次提交
  30. 06 12月, 2014 1 次提交
  31. 03 12月, 2014 1 次提交
  32. 24 11月, 2014 2 次提交