1. 22 8月, 2013 2 次提交
    • P
      tun: Report whether the queue is attached or not · 3d407a80
      Pavel Emelyanov 提交于
      Multiqueue tun devices allow to attach and detach from its queues
      while keeping the interface itself set on file.
      
      Knowing this is critical for the checkpoint part of criu project.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3d407a80
    • P
      tun: Add ability to create tun device with given index · fb7589a1
      Pavel Emelyanov 提交于
      Tun devices cannot be created with ifidex user wants, but it's
      required by checkpoint-restore project.
      
      Long time ago such ability was implemented for rtnl_ops-based
      interface for creating links (9c7dafbf net: Allow to create links
      with given ifindex), but the only API for creating and managing
      tuntap devices is ioctl-based and is evolving with adding new ones
      (cde8b15f tuntap: add ioctl to attach or detach a file form tuntap
      device).
      
      Following that trend, here's how a new ioctl that sets the ifindex
      for device, that _will_ be created by TUNSETIFF ioctl looks like.
      So those who want a tuntap device with the ifindex N, should open
      the tun device, call ioctl(fd, TUNSETIFINDEX, &N), then call TUNSETIFF.
      If the index N is busy, then the register_netdev will find this out
      and the ioctl would be failed with -EBUSY.
      
      If setifindex is not called, then it will be generated as before.
      Signed-off-by: NPavel Emelyanov <xemul@parallels.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fb7589a1
  2. 16 8月, 2013 1 次提交
  3. 14 8月, 2013 1 次提交
  4. 10 8月, 2013 1 次提交
    • E
      net: attempt high order allocations in sock_alloc_send_pskb() · 28d64271
      Eric Dumazet 提交于
      Adding paged frags skbs to af_unix sockets introduced a performance
      regression on large sends because of additional page allocations, even
      if each skb could carry at least 100% more payload than before.
      
      We can instruct sock_alloc_send_pskb() to attempt high order
      allocations.
      
      Most of the time, it does a single page allocation instead of 8.
      
      I added an additional parameter to sock_alloc_send_pskb() to
      let other users to opt-in for this new feature on followup patches.
      
      Tested:
      
      Before patch :
      
      $ netperf -t STREAM_STREAM
      STREAM STREAM TEST
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       2304  212992  212992    10.00    46861.15
      
      After patch :
      
      $ netperf -t STREAM_STREAM
      STREAM STREAM TEST
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       2304  212992  212992    10.00    57981.11
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      28d64271
  5. 08 8月, 2013 2 次提交
  6. 28 7月, 2013 1 次提交
    • J
      tuntap: hardware vlan tx support · 6680ec68
      Jason Wang 提交于
      Inspired by commit f09e2249 (macvtap: restore
      vlan header on user read). This patch adds hardware vlan tx support for
      tuntap. This is done by copying vlan header directly into userspace in
      tun_put_user() instead of doing it through __vlan_put_tag() in
      dev_hard_start_xmit(). This eliminates one unnecessary memmove() in
      vlan_insert_tag() for 802.1ad and 802.1q traffic.
      
      pktgen test shows about 20% improvement for 802.1q traffic:
      
      Before:
        662149pps 317Mb/sec (317831520bps) errors: 0
      After:
        801033pps 384Mb/sec (384495840bps) errors: 0
      
      Cc: Basil Gor <basil.gor@gmail.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6680ec68
  7. 23 7月, 2013 1 次提交
  8. 19 7月, 2013 1 次提交
    • J
      tuntap: do not zerocopy if iov needs more pages than MAX_SKB_FRAGS · 88529176
      Jason Wang 提交于
      We try to linearize part of the skb when the number of iov is greater than
      MAX_SKB_FRAGS. This is not enough since each single vector may occupy more than
      one pages, so zerocopy_sg_fromiovec() may still fail and may break the guest
      network.
      
      Solve this problem by calculate the pages needed for iov before trying to do
      zerocopy and switch to use copy instead of zerocopy if it needs more than
      MAX_SKB_FRAGS.
      
      This is done through introducing a new helper to count the pages for iov, and
      call uarg->callback() manually when switching from zerocopy to copy to notify
      vhost.
      
      We can do further optimization on top.
      
      The bug were introduced from commit 0690899b
      (tun: experimental zero copy tx support)
      
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      88529176
  9. 11 7月, 2013 1 次提交
  10. 26 6月, 2013 1 次提交
  11. 13 6月, 2013 2 次提交
  12. 12 6月, 2013 1 次提交
  13. 11 6月, 2013 1 次提交
  14. 29 5月, 2013 1 次提交
    • J
      tuntap: forbid changing mq flag for persistent device · 8e6d91ae
      Jason Wang 提交于
      We currently allow changing the mq flag (IFF_MULTI_QUEUE) for a persistent
      device. This will result a mismatch between the number the queues in netdev and
      tuntap. This is because we only allocate a 1q netdevice when IFF_MULTI_QUEUE was
      not specified, so when we set the IFF_MULTI_QUEUE and try to attach more queues
      later, netif_set_real_num_tx_queues() may fail which result a single queue
      netdevice with multiple sockets attached.
      
      Solve this by disallowing changing the mq flag for persistent device.
      
      Bug was introduced by commit edfb6a14
      (tuntap: reduce memory using of queues).
      Reported-by: NSriram Narasimhan <sriram.narasimhan@hp.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8e6d91ae
  15. 29 4月, 2013 1 次提交
  16. 25 4月, 2013 1 次提交
  17. 13 4月, 2013 1 次提交
  18. 12 4月, 2013 1 次提交
    • J
      tuntap: initialize vlan_features · c0317998
      Jason Wang 提交于
      The vlan_features was zero which prevents vlan GSO packets to be transmitted to
      userspace. This is suboptimal so enable this by initialize vlan_features for
      tuntap.
      
      Netperf shows better performance of guest receiving since vlan TSO works for
      tuntap:
      
      before:
      netperf -H 192.168.5.4
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.5.4 ()
      port 0 AF_INET : demo
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    10.01    2786.67
      
      after:
      netperf -H 192.168.5.4
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.5.4 ()
      port 0 AF_INET : demo
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    10.00    8085.49
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c0317998
  19. 28 3月, 2013 1 次提交
  20. 27 3月, 2013 1 次提交
    • J
      tuntap: set transport header before passing it to kernel · 38502af7
      Jason Wang 提交于
      Currently, for the packets receives from tuntap, before doing header check,
      kernel just reset the transport header in netif_receive_skb() which pretends no
      l4 header. This is suboptimal for precise packet length estimation (introduced
      in 1def9238) which needs correct l4 header for gso packets.
      
      So this patch set the transport header to csum_start for partial checksum
      packets, otherwise it first try skb_flow_dissect(), if it fails, just reset the
      transport header.
      
      Cc: Eric Dumazet <edumazet@google.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      38502af7
  21. 13 3月, 2013 1 次提交
  22. 07 3月, 2013 1 次提交
    • E
      tun: add a missing nf_reset() in tun_net_xmit() · f8af75f3
      Eric Dumazet 提交于
      Dave reported following crash :
      
      general protection fault: 0000 [#1] SMP
      CPU 2
      Pid: 25407, comm: qemu-kvm Not tainted 3.7.9-205.fc18.x86_64 #1 Hewlett-Packard HP Z400 Workstation/0B4Ch
      RIP: 0010:[<ffffffffa0399bd5>]  [<ffffffffa0399bd5>] destroy_conntrack+0x35/0x120 [nf_conntrack]
      RSP: 0018:ffff880276913d78  EFLAGS: 00010206
      RAX: 50626b6b7876376c RBX: ffff88026e530d68 RCX: ffff88028d158e00
      RDX: ffff88026d0d5470 RSI: 0000000000000011 RDI: 0000000000000002
      RBP: ffff880276913d88 R08: 0000000000000000 R09: ffff880295002900
      R10: 0000000000000000 R11: 0000000000000003 R12: ffffffff81ca3b40
      R13: ffffffff8151a8e0 R14: ffff880270875000 R15: 0000000000000002
      FS:  00007ff3bce38a00(0000) GS:ffff88029fc40000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
      CR2: 00007fd1430bd000 CR3: 000000027042b000 CR4: 00000000000027e0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      Process qemu-kvm (pid: 25407, threadinfo ffff880276912000, task ffff88028c369720)
      Stack:
       ffff880156f59100 ffff880156f59100 ffff880276913d98 ffffffff815534f7
       ffff880276913db8 ffffffff8151a74b ffff880270875000 ffff880156f59100
       ffff880276913dd8 ffffffff8151a5a6 ffff880276913dd8 ffff88026d0d5470
      Call Trace:
       [<ffffffff815534f7>] nf_conntrack_destroy+0x17/0x20
       [<ffffffff8151a74b>] skb_release_head_state+0x7b/0x100
       [<ffffffff8151a5a6>] __kfree_skb+0x16/0xa0
       [<ffffffff8151a666>] kfree_skb+0x36/0xa0
       [<ffffffff8151a8e0>] skb_queue_purge+0x20/0x40
       [<ffffffffa02205f7>] __tun_detach+0x117/0x140 [tun]
       [<ffffffffa022184c>] tun_chr_close+0x3c/0xd0 [tun]
       [<ffffffff8119669c>] __fput+0xec/0x240
       [<ffffffff811967fe>] ____fput+0xe/0x10
       [<ffffffff8107eb27>] task_work_run+0xa7/0xe0
       [<ffffffff810149e1>] do_notify_resume+0x71/0xb0
       [<ffffffff81640152>] int_signal+0x12/0x17
      Code: 00 00 04 48 89 e5 41 54 53 48 89 fb 4c 8b a7 e8 00 00 00 0f 85 de 00 00 00 0f b6 73 3e 0f b7 7b 2a e8 10 40 00 00 48 85 c0 74 0e <48> 8b 40 28 48 85 c0 74 05 48 89 df ff d0 48 c7 c7 08 6a 3a a0
      RIP  [<ffffffffa0399bd5>] destroy_conntrack+0x35/0x120 [nf_conntrack]
       RSP <ffff880276913d78>
      
      This is because tun_net_xmit() needs to call nf_reset()
      before queuing skb into receive_queue
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      f8af75f3
  23. 28 2月, 2013 1 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  24. 14 2月, 2013 1 次提交
    • P
      net: Fix possible wrong checksum generation. · c9af6db4
      Pravin B Shelar 提交于
      Patch cef401de (net: fix possible wrong checksum
      generation) fixed wrong checksum calculation but it broke TSO by
      defining new GSO type but not a netdev feature for that type.
      net_gso_ok() would not allow hardware checksum/segmentation
      offload of such packets without the feature.
      
      Following patch fixes TSO and wrong checksum. This patch uses
      same logic that Eric Dumazet used. Patch introduces new flag
      SKBTX_SHARED_FRAG if at least one frag can be modified by
      the user. but SKBTX_SHARED_FRAG flag is kept in skb shared
      info tx_flags rather than gso_type.
      
      tx_flags is better compared to gso_type since we can have skb with
      shared frag without gso packet. It does not link SHARED_FRAG to
      GSO, So there is no need to define netdev feature for this.
      Signed-off-by: NPravin B Shelar <pshelar@nicira.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c9af6db4
  25. 30 1月, 2013 2 次提交
  26. 28 1月, 2013 1 次提交
    • E
      net: fix possible wrong checksum generation · cef401de
      Eric Dumazet 提交于
      Pravin Shelar mentioned that GSO could potentially generate
      wrong TX checksum if skb has fragments that are overwritten
      by the user between the checksum computation and transmit.
      
      He suggested to linearize skbs but this extra copy can be
      avoided for normal tcp skbs cooked by tcp_sendmsg().
      
      This patch introduces a new SKB_GSO_SHARED_FRAG flag, set
      in skb_shinfo(skb)->gso_type if at least one frag can be
      modified by the user.
      
      Typical sources of such possible overwrites are {vm}splice(),
      sendfile(), and macvtap/tun/virtio_net drivers.
      
      Tested:
      
      $ netperf -H 7.7.8.84
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
      7.7.8.84 () port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    10.00    3959.52
      
      $ netperf -H 7.7.8.84 -t TCP_SENDFILE
      TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 ()
      port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    10.00    3216.80
      
      Performance of the SENDFILE is impacted by the extra allocation and
      copy, and because we use order-0 pages, while the TCP_STREAM uses
      bigger pages.
      Reported-by: NPravin Shelar <pshelar@nicira.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cef401de
  27. 24 1月, 2013 2 次提交
    • J
      tuntap: limit the number of flow caches · b8732fb7
      Jason Wang 提交于
      We create new flow caches when a new flow is identified by tuntap, This may lead
      some issues:
      
      - userspace may produce a huge amount of short live flows to exhaust host memory
      - the unlimited number of flow caches may produce a long list which increase the
        time in the linear searching
      
      Solve this by introducing a limit of total number of flow caches.
      
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b8732fb7
    • J
      tuntap: reduce memory using of queues · edfb6a14
      Jason Wang 提交于
      A MAX_TAP_QUEUES(1024) queues of tuntap device is always allocated
      unconditionally even userspace only requires a single queue device. This is
      unnecessary and will lead a very high order of page allocation when has a high
      possibility to fail. Solving this by creating a one queue net device when
      userspace only use one queue and also reduce MAX_TAP_QUEUES to
      DEFAULT_MAX_NUM_RSS_QUEUES which can guarantee the success of
      the allocation.
      Reported-by: NDirk Hohndel <dirk@hohndel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      edfb6a14
  28. 15 1月, 2013 1 次提交
    • P
      tun: fix LSM/SELinux labeling of tun/tap devices · 5dbbaf2d
      Paul Moore 提交于
      This patch corrects some problems with LSM/SELinux that were introduced
      with the multiqueue patchset.  The problem stems from the fact that the
      multiqueue work changed the relationship between the tun device and its
      associated socket; before the socket persisted for the life of the
      device, however after the multiqueue changes the socket only persisted
      for the life of the userspace connection (fd open).  For non-persistent
      devices this is not an issue, but for persistent devices this can cause
      the tun device to lose its SELinux label.
      
      We correct this problem by adding an opaque LSM security blob to the
      tun device struct which allows us to have the LSM security state, e.g.
      SELinux labeling information, persist for the lifetime of the tun
      device.  In the process we tweak the LSM hooks to work with this new
      approach to TUN device/socket labeling and introduce a new LSM hook,
      security_tun_dev_attach_queue(), to approve requests to attach to a
      TUN queue via TUNSETQUEUE.
      
      The SELinux code has been adjusted to match the new LSM hooks, the
      other LSMs do not make use of the LSM TUN controls.  This patch makes
      use of the recently added "tun_socket:attach_queue" permission to
      restrict access to the TUNSETQUEUE operation.  On older SELinux
      policies which do not define the "tun_socket:attach_queue" permission
      the access control decision for TUNSETQUEUE will be handled according
      to the SELinux policy's unknown permission setting.
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      Acked-by: NEric Paris <eparis@parisplace.org>
      Tested-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5dbbaf2d
  29. 12 1月, 2013 3 次提交
  30. 11 1月, 2013 2 次提交
  31. 22 12月, 2012 1 次提交
  32. 18 12月, 2012 1 次提交