1. 04 11月, 2014 1 次提交
    • H
      tun: Fix csum_start with VLAN acceleration · a8f9bfdf
      Herbert Xu 提交于
      When VLAN acceleration is in use on the xmit path, we end up
      setting csum_start to the wrong place.  The result is that the
      whoever ends up doing the checksum setting will corrupt the packet
      instead of writing the checksum to the expected location, usually
      this means writing the checksum with an offset of -4.
      
      This patch fixes this by adjusting csum_start when VLAN acceleration
      is detected.
      
      Fixes: 6680ec68 ("tuntap: hardware vlan tx support")
      Cc: stable@vger.kernel.org
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      a8f9bfdf
  2. 31 10月, 2014 2 次提交
    • B
      drivers/net, ipv6: Select IPv6 fragment idents for virtio UFO packets · 5188cd44
      Ben Hutchings 提交于
      UFO is now disabled on all drivers that work with virtio net headers,
      but userland may try to send UFO/IPv6 packets anyway.  Instead of
      sending with ID=0, we should select identifiers on their behalf (as we
      used to).
      Signed-off-by: NBen Hutchings <ben@decadent.org.uk>
      Fixes: 916e4cf4 ("ipv6: reuse ip6_frag_id from ip6_ufo_append_data")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5188cd44
    • B
      drivers/net: Disable UFO through virtio · 3d0ad094
      Ben Hutchings 提交于
      IPv6 does not allow fragmentation by routers, so there is no
      fragmentation ID in the fixed header.  UFO for IPv6 requires the ID to
      be passed separately, but there is no provision for this in the virtio
      net protocol.
      
      Until recently our software implementation of UFO/IPv6 generated a new
      ID, but this was a bug.  Now we will use ID=0 for any UFO/IPv6 packet
      passed through a tap, which is even worse.
      
      Unfortunately there is no distinction between UFO/IPv4 and v6
      features, so disable UFO on taps and virtio_net completely until we
      have a proper solution.
      
      We cannot depend on VM managers respecting the tap feature flags, so
      keep accepting UFO packets but log a warning the first time we do
      this.
      Signed-off-by: NBen Hutchings <ben@decadent.org.uk>
      Fixes: 916e4cf4 ("ipv6: reuse ip6_frag_id from ip6_ufo_append_data")
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d0ad094
  3. 10 9月, 2014 1 次提交
  4. 16 7月, 2014 1 次提交
    • T
      net: set name_assign_type in alloc_netdev() · c835a677
      Tom Gundersen 提交于
      Extend alloc_netdev{,_mq{,s}}() to take name_assign_type as argument, and convert
      all users to pass NET_NAME_UNKNOWN.
      
      Coccinelle patch:
      
      @@
      expression sizeof_priv, name, setup, txqs, rxqs, count;
      @@
      
      (
      -alloc_netdev_mqs(sizeof_priv, name, setup, txqs, rxqs)
      +alloc_netdev_mqs(sizeof_priv, name, NET_NAME_UNKNOWN, setup, txqs, rxqs)
      |
      -alloc_netdev_mq(sizeof_priv, name, setup, count)
      +alloc_netdev_mq(sizeof_priv, name, NET_NAME_UNKNOWN, setup, count)
      |
      -alloc_netdev(sizeof_priv, name, setup)
      +alloc_netdev(sizeof_priv, name, NET_NAME_UNKNOWN, setup)
      )
      
      v9: move comments here from the wrong commit
      Signed-off-by: NTom Gundersen <teg@jklm.no>
      Reviewed-by: NDavid Herrmann <dh.herrmann@gmail.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c835a677
  5. 22 5月, 2014 1 次提交
    • X
      net-tun: restructure tun_do_read for better sleep/wakeup efficiency · 9e641bdc
      Xi Wang 提交于
      tun_do_read always adds current thread to wait queue, even if a packet
      is ready to read. This is inefficient because both sleeper and waker
      want to acquire the wait queue spin lock when packet rate is high.
      
      We restructure the read function and use common kernel networking
      routines to handle receive, sleep and wakeup. With the change
      available packets are checked first before the reading thread is added
      to the wait queue.
      
      Ran performance tests with the following configuration:
      
       - my packet generator -> tap1 -> br0 -> tap0 -> my packet consumer
       - sender pinned to one core and receiver pinned to another core
       - sender send small UDP packets (64 bytes total) as fast as it can
       - sandy bridge cores
       - throughput are receiver side goodput numbers
      
      The results are
      
      baseline: 731k pkts/sec, cpu utilization at 1.50 cpus
       changed: 783k pkts/sec, cpu utilization at 1.53 cpus
      
      The performance difference is largely determined by packet rate and
      inter-cpu communication cost. For example, if the sender and
      receiver are pinned to different cpu sockets, the results are
      
      baseline: 558k pkts/sec, cpu utilization at 1.71 cpus
       changed: 690k pkts/sec, cpu utilization at 1.67 cpus
      Co-authored-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NXi Wang <xii@google.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9e641bdc
  6. 27 3月, 2014 1 次提交
  7. 20 2月, 2014 1 次提交
  8. 17 2月, 2014 1 次提交
  9. 29 1月, 2014 1 次提交
    • M
      tun: add device name(iff) field to proc fdinfo entry · 93e14b6d
      Masatake YAMATO 提交于
      A file descriptor opened for /dev/net/tun and a tun device are
      connected with ioctl.  Though understanding the connection is
      important for trouble shooting, no way is given to a user to know
      the connected device for a given file descriptor at userland.
      
      This patch adds a new fdinfo field for the device name connected to
      a file descriptor opened for /dev/net/tun.
      
      Here is an example of the field:
      
          # lsof | grep tun
          qemu-syst 4565         qemu   25u      CHR             10,200       0t138      12921 /dev/net/tun
          ...
      
          # cat /proc/4565/fdinfo/25
          pos:	138
          flags:	0104002
          iff:	vnet0
      
          # ip link show dev vnet0
          8: vnet0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 ...
      
      changelog:
      
          v2: indent iff just like the other fdinfo fields are.
          v3: remove unused variable.
              Both are suggested by David Miller <davem@davemloft.net>.
      Signed-off-by: NMasatake YAMATO <yamato@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      93e14b6d
  10. 23 1月, 2014 1 次提交
    • D
      tuntap: Fix for a race in accessing numqueues · fa35864e
      Dominic Curran 提交于
      A patch for fixing a race between queue selection and changing queues
      was introduced in commit 92bb73ea("tuntap: fix a possible race between
      queue selection and changing queues").
      
      The fix was to prevent the driver from re-reading the tun->numqueues
      more than once within tun_select_queue() using ACCESS_ONCE().
      
      We have been experiancing 'Divide-by-zero' errors in tun_net_xmit()
      since we moved from 3.6 to 3.10, and believe that they come from a
      simular source where the value of tun->numqueues changes to zero
      between the first and a subsequent read of tun->numqueues.
      
      The fix is a simular use of ACCESS_ONCE(), as well as a multiply
      instead of a divide in the if statement.
      Signed-off-by: NDominic Curran <dominic.curran@citrix.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Cc: Maxim Krasnyansky <maxk@qti.qualcomm.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Acked-by: NMax Krasnyansky <maxk@kernel.org>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fa35864e
  11. 11 1月, 2014 1 次提交
    • J
      net: core: explicitly select a txq before doing l2 forwarding · f663dd9a
      Jason Wang 提交于
      Currently, the tx queue were selected implicitly in ndo_dfwd_start_xmit(). The
      will cause several issues:
      
      - NETIF_F_LLTX were removed for macvlan, so txq lock were done for macvlan
        instead of lower device which misses the necessary txq synchronization for
        lower device such as txq stopping or frozen required by dev watchdog or
        control path.
      - dev_hard_start_xmit() was called with NULL txq which bypasses the net device
        watchdog.
      - dev_hard_start_xmit() does not check txq everywhere which will lead a crash
        when tso is disabled for lower device.
      
      Fix this by explicitly introducing a new param for .ndo_select_queue() for just
      selecting queues in the case of l2 forwarding offload. netdev_pick_tx() was also
      extended to accept this parameter and dev_queue_xmit_accel() was used to do l2
      forwarding transmission.
      
      With this fixes, NETIF_F_LLTX could be preserved for macvlan and there's no need
      to check txq against NULL in dev_hard_start_xmit(). Also there's no need to keep
      a dedicated ndo_dfwd_start_xmit() and we can just reuse the code of
      dev_queue_xmit() to do the transmission.
      
      In the future, it was also required for macvtap l2 forwarding support since it
      provides a necessary synchronization method.
      
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: e1000-devel@lists.sourceforge.net
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Acked-by: NJohn Fastabend <john.r.fastabend@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f663dd9a
  12. 02 1月, 2014 1 次提交
  13. 01 1月, 2014 1 次提交
    • T
      tun: Add support for RFS on tun flows · 9bc88939
      Tom Herbert 提交于
      This patch adds support so that the rps_flow_tables (RFS) can be
      programmed using the tun flows which are already set up to track flows
      for the purposes of queue selection.
      
      On the receive path (corresponding to select_queue and tun_net_xmit) the
      rxhash is saved in the flow_entry.  The original code only does flow
      lookup in select_queue, so this patch adds a flow lookup in tun_net_xmit
      if num_queues == 1 (select_queue is not called from
      dev_queue_xmit->netdev_pick_tx in that case).
      
      The flow is recorded (processing CPU) in tun_flow_update (TX path), and
      reset when flow is deleted.
      Signed-off-by: NTom Herbert <therbert@google.com>
      Signed-off-by: NZhi Yong Wu <wuzhy@linux.vnet.ibm.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9bc88939
  14. 18 12月, 2013 1 次提交
  15. 12 12月, 2013 1 次提交
  16. 11 12月, 2013 3 次提交
  17. 10 12月, 2013 1 次提交
  18. 07 12月, 2013 3 次提交
  19. 15 11月, 2013 1 次提交
    • J
      tuntap: limit head length of skb allocated · 96f8d9ec
      Jason Wang 提交于
      We currently use hdr_len as a hint of head length which is advertised by
      guest. But when guest advertise a very big value, it can lead to an 64K+
      allocating of kmalloc() which has a very high possibility of failure when host
      memory is fragmented or under heavy stress. The huge hdr_len also reduce the
      effect of zerocopy or even disable if a gso skb is linearized in guest.
      
      To solves those issues, this patch introduces an upper limit (PAGE_SIZE) of the
      head, which guarantees an order 0 allocation each time.
      
      Cc: Stefan Hajnoczi <stefanha@redhat.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96f8d9ec
  20. 09 10月, 2013 1 次提交
  21. 13 9月, 2013 1 次提交
  22. 06 9月, 2013 2 次提交
  23. 22 8月, 2013 4 次提交
    • P
      tun: Get skfilter layout · 76975e9c
      Pavel Emelyanov 提交于
      The only thing we may have from tun device is the fprog, whic contains
      the number of filter elements and a pointer to (user-space) memory
      where the elements are. The program itself may not be available if the
      device is persistent and detached.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      76975e9c
    • P
      tun: Allow to skip filter on attach · 849c9b6f
      Pavel Emelyanov 提交于
      There's a small problem with sk-filters on tun devices. Consider
      an application doing this sequence of steps:
      
      fd = open("/dev/net/tun");
      ioctl(fd, TUNSETIFF, { .ifr_name = "tun0" });
      ioctl(fd, TUNATTACHFILTER, &my_filter);
      ioctl(fd, TUNSETPERSIST, 1);
      close(fd);
      
      At that point the tun0 will remain in the system and will keep in
      mind that there should be a socket filter at address '&my_filter'.
      
      If after that we do
      
      fd = open("/dev/net/tun");
      ioctl(fd, TUNSETIFF, { .ifr_name = "tun0" });
      
      we most likely receive the -EFAULT error, since tun_attach() would
      try to connect the filter back. But (!) if we provide a filter at
      address &my_filter, then tun0 will be created and the "new" filter
      would be attached, but application may not know about that.
      
      This may create certain problems to anyone using tun-s, but it's
      critical problem for c/r -- if we meet a persistent tun device
      with a filter in mind, we will not be able to attach to it to dump
      its state (flags, owner, address, vnethdr size, etc.).
      
      The proposal is to allow to attach to tun device (with TUNSETIFF)
      w/o attaching the filter to the tun-file's socket. After this
      attach app may e.g clean the device by dropping the filter, it
      doesn't want to have one, or (in case of c/r) get information
      about the device with tun ioctls.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      849c9b6f
    • P
      tun: Report whether the queue is attached or not · 3d407a80
      Pavel Emelyanov 提交于
      Multiqueue tun devices allow to attach and detach from its queues
      while keeping the interface itself set on file.
      
      Knowing this is critical for the checkpoint part of criu project.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d407a80
    • P
      tun: Add ability to create tun device with given index · fb7589a1
      Pavel Emelyanov 提交于
      Tun devices cannot be created with ifidex user wants, but it's
      required by checkpoint-restore project.
      
      Long time ago such ability was implemented for rtnl_ops-based
      interface for creating links (9c7dafbf net: Allow to create links
      with given ifindex), but the only API for creating and managing
      tuntap devices is ioctl-based and is evolving with adding new ones
      (cde8b15f tuntap: add ioctl to attach or detach a file form tuntap
      device).
      
      Following that trend, here's how a new ioctl that sets the ifindex
      for device, that _will_ be created by TUNSETIFF ioctl looks like.
      So those who want a tuntap device with the ifindex N, should open
      the tun device, call ioctl(fd, TUNSETIFINDEX, &N), then call TUNSETIFF.
      If the index N is busy, then the register_netdev will find this out
      and the ioctl would be failed with -EBUSY.
      
      If setifindex is not called, then it will be generated as before.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb7589a1
  24. 16 8月, 2013 1 次提交
  25. 14 8月, 2013 1 次提交
  26. 10 8月, 2013 1 次提交
    • E
      net: attempt high order allocations in sock_alloc_send_pskb() · 28d64271
      Eric Dumazet 提交于
      Adding paged frags skbs to af_unix sockets introduced a performance
      regression on large sends because of additional page allocations, even
      if each skb could carry at least 100% more payload than before.
      
      We can instruct sock_alloc_send_pskb() to attempt high order
      allocations.
      
      Most of the time, it does a single page allocation instead of 8.
      
      I added an additional parameter to sock_alloc_send_pskb() to
      let other users to opt-in for this new feature on followup patches.
      
      Tested:
      
      Before patch :
      
      $ netperf -t STREAM_STREAM
      STREAM STREAM TEST
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       2304  212992  212992    10.00    46861.15
      
      After patch :
      
      $ netperf -t STREAM_STREAM
      STREAM STREAM TEST
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       2304  212992  212992    10.00    57981.11
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28d64271
  27. 08 8月, 2013 2 次提交
  28. 28 7月, 2013 1 次提交
    • J
      tuntap: hardware vlan tx support · 6680ec68
      Jason Wang 提交于
      Inspired by commit f09e2249 (macvtap: restore
      vlan header on user read). This patch adds hardware vlan tx support for
      tuntap. This is done by copying vlan header directly into userspace in
      tun_put_user() instead of doing it through __vlan_put_tag() in
      dev_hard_start_xmit(). This eliminates one unnecessary memmove() in
      vlan_insert_tag() for 802.1ad and 802.1q traffic.
      
      pktgen test shows about 20% improvement for 802.1q traffic:
      
      Before:
        662149pps 317Mb/sec (317831520bps) errors: 0
      After:
        801033pps 384Mb/sec (384495840bps) errors: 0
      
      Cc: Basil Gor <basil.gor@gmail.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6680ec68
  29. 23 7月, 2013 1 次提交
  30. 19 7月, 2013 1 次提交
    • J
      tuntap: do not zerocopy if iov needs more pages than MAX_SKB_FRAGS · 88529176
      Jason Wang 提交于
      We try to linearize part of the skb when the number of iov is greater than
      MAX_SKB_FRAGS. This is not enough since each single vector may occupy more than
      one pages, so zerocopy_sg_fromiovec() may still fail and may break the guest
      network.
      
      Solve this problem by calculate the pages needed for iov before trying to do
      zerocopy and switch to use copy instead of zerocopy if it needs more than
      MAX_SKB_FRAGS.
      
      This is done through introducing a new helper to count the pages for iov, and
      call uarg->callback() manually when switching from zerocopy to copy to notify
      vhost.
      
      We can do further optimization on top.
      
      The bug were introduced from commit 0690899b
      (tun: experimental zero copy tx support)
      
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88529176