1. 30 1月, 2013 2 次提交
  2. 28 1月, 2013 1 次提交
    • E
      net: fix possible wrong checksum generation · cef401de
      Eric Dumazet 提交于
      Pravin Shelar mentioned that GSO could potentially generate
      wrong TX checksum if skb has fragments that are overwritten
      by the user between the checksum computation and transmit.
      
      He suggested to linearize skbs but this extra copy can be
      avoided for normal tcp skbs cooked by tcp_sendmsg().
      
      This patch introduces a new SKB_GSO_SHARED_FRAG flag, set
      in skb_shinfo(skb)->gso_type if at least one frag can be
      modified by the user.
      
      Typical sources of such possible overwrites are {vm}splice(),
      sendfile(), and macvtap/tun/virtio_net drivers.
      
      Tested:
      
      $ netperf -H 7.7.8.84
      MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to
      7.7.8.84 () port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    10.00    3959.52
      
      $ netperf -H 7.7.8.84 -t TCP_SENDFILE
      TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 ()
      port 0 AF_INET
      Recv   Send    Send
      Socket Socket  Message  Elapsed
      Size   Size    Size     Time     Throughput
      bytes  bytes   bytes    secs.    10^6bits/sec
      
       87380  16384  16384    10.00    3216.80
      
      Performance of the SENDFILE is impacted by the extra allocation and
      copy, and because we use order-0 pages, while the TCP_STREAM uses
      bigger pages.
      Reported-by: NPravin Shelar <pshelar@nicira.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cef401de
  3. 24 1月, 2013 2 次提交
    • J
      tuntap: limit the number of flow caches · b8732fb7
      Jason Wang 提交于
      We create new flow caches when a new flow is identified by tuntap, This may lead
      some issues:
      
      - userspace may produce a huge amount of short live flows to exhaust host memory
      - the unlimited number of flow caches may produce a long list which increase the
        time in the linear searching
      
      Solve this by introducing a limit of total number of flow caches.
      
      Cc: Stephen Hemminger <stephen@networkplumber.org>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      b8732fb7
    • J
      tuntap: reduce memory using of queues · edfb6a14
      Jason Wang 提交于
      A MAX_TAP_QUEUES(1024) queues of tuntap device is always allocated
      unconditionally even userspace only requires a single queue device. This is
      unnecessary and will lead a very high order of page allocation when has a high
      possibility to fail. Solving this by creating a one queue net device when
      userspace only use one queue and also reduce MAX_TAP_QUEUES to
      DEFAULT_MAX_NUM_RSS_QUEUES which can guarantee the success of
      the allocation.
      Reported-by: NDirk Hohndel <dirk@hohndel.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      edfb6a14
  4. 15 1月, 2013 1 次提交
    • P
      tun: fix LSM/SELinux labeling of tun/tap devices · 5dbbaf2d
      Paul Moore 提交于
      This patch corrects some problems with LSM/SELinux that were introduced
      with the multiqueue patchset.  The problem stems from the fact that the
      multiqueue work changed the relationship between the tun device and its
      associated socket; before the socket persisted for the life of the
      device, however after the multiqueue changes the socket only persisted
      for the life of the userspace connection (fd open).  For non-persistent
      devices this is not an issue, but for persistent devices this can cause
      the tun device to lose its SELinux label.
      
      We correct this problem by adding an opaque LSM security blob to the
      tun device struct which allows us to have the LSM security state, e.g.
      SELinux labeling information, persist for the lifetime of the tun
      device.  In the process we tweak the LSM hooks to work with this new
      approach to TUN device/socket labeling and introduce a new LSM hook,
      security_tun_dev_attach_queue(), to approve requests to attach to a
      TUN queue via TUNSETQUEUE.
      
      The SELinux code has been adjusted to match the new LSM hooks, the
      other LSMs do not make use of the LSM TUN controls.  This patch makes
      use of the recently added "tun_socket:attach_queue" permission to
      restrict access to the TUNSETQUEUE operation.  On older SELinux
      policies which do not define the "tun_socket:attach_queue" permission
      the access control decision for TUNSETQUEUE will be handled according
      to the SELinux policy's unknown permission setting.
      Signed-off-by: NPaul Moore <pmoore@redhat.com>
      Acked-by: NEric Paris <eparis@parisplace.org>
      Tested-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5dbbaf2d
  5. 12 1月, 2013 3 次提交
  6. 11 1月, 2013 2 次提交
  7. 22 12月, 2012 1 次提交
  8. 18 12月, 2012 2 次提交
  9. 15 12月, 2012 1 次提交
    • J
      tuntap: fix ambigious multiqueue API · 4008e97f
      Jason Wang 提交于
      The current multiqueue API is ambigious which may confuse both user and LSM to
      do things correctly:
      
      - Both TUNSETIFF and TUNSETQUEUE could be used to create the queues of a tuntap
        device.
      - TUNSETQUEUE were used to disable and enable a specific queue of the
        device. But since the state of tuntap were completely removed from the queue,
        it could be used to attach to another device (there's no such kind of
        requirement currently, and it needs new kind of LSM policy.
      - TUNSETQUEUE could be used to attach to a persistent device without any
        queues. This kind of attching bypass the necessary checking during TUNSETIFF
        and may lead unexpected result.
      
      So this patch tries to make a cleaner and simpler API by:
      
      - Only allow TUNSETIFF to create queues.
      - TUNSETQUEUE could be only used to disable and enabled the queues of a device,
        and the state of the tuntap device were not detachd from the queues when it
        was disabled, so TUNSETQUEUE could be only used after TUNSETIFF and with the
         same device.
      
      This is done by introducing a list which keeps track of all queues which were
      disabled. The queue would be moved between this list and tfiles[] array when it
      was enabled/disabled. A pointer of the tun_struct were also introdued to track
      the device it belongs to when it was disabled.
      
      After the change, the isolation between management and application could be done
      through: TUNSETIFF were only called by management software and TUNSETQUEUE were
      only called by application.For LSM/SELinux, the things left is to do proper
      check during tun_set_queue() if needed.
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4008e97f
  10. 14 12月, 2012 1 次提交
    • E
      tuntap: dont use skb after netif_rx_ni(skb) · 49974420
      Eric Dumazet 提交于
      On Wed, 2012-12-12 at 23:16 -0500, Dave Jones wrote:
      > Since todays net merge, I see this when I start openvpn..
      >
      > general protection fault: 0000 [#1] PREEMPT SMP
      > Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables xfs iTCO_wdt iTCO_vendor_support snd_emu10k1 snd_util_mem snd_ac97_codec coretemp ac97_bus microcode snd_hwdep snd_seq pcspkr snd_pcm snd_page_alloc snd_timer lpc_ich i2c_i801 snd_rawmidi mfd_core snd_seq_device snd e1000e soundcore emu10k1_gp gameport i82975x_edac edac_core vhost_net tun macvtap macvlan kvm_intel kvm binfmt_misc nfsd auth_rpcgss nfs_acl lockd sunrpc btrfs libcrc32c zlib_deflate firewire_ohci sata_sil firewire_core crc_itu_t radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core floppy
      > CPU 0
      > Pid: 1381, comm: openvpn Not tainted 3.7.0+ #14                  /D975XBX
      > RIP: 0010:[<ffffffff815b54a4>]  [<ffffffff815b54a4>] skb_flow_dissect+0x314/0x3e0
      > RSP: 0018:ffff88007d0d9c48  EFLAGS: 00010206
      > RAX: 000000000000055d RBX: 6b6b6b6b6b6b6b4b RCX: 1471030a0180040a
      > RDX: 0000000000000005 RSI: 00000000ffffffe0 RDI: ffff8800ba83fa80
      > RBP: ffff88007d0d9cb8 R08: 0000000000000000 R09: 0000000000000000
      > R10: 0000000000000000 R11: 0000000000000101 R12: ffff8800ba83fa80
      > R13: 0000000000000008 R14: ffff88007d0d9cc8 R15: ffff8800ba83fa80
      > FS:  00007f6637104800(0000) GS:ffff8800bf600000(0000) knlGS:0000000000000000
      > CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      > CR2: 00007f563f5b01c4 CR3: 000000007d140000 CR4: 00000000000007f0
      > DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      > DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
      > Process openvpn (pid: 1381, threadinfo ffff88007d0d8000, task ffff8800a540cd60)
      > Stack:
      >  ffff8800ba83fa80 0000000000000296 0000000000000000 0000000000000000
      >  ffff88007d0d9cc8 ffffffff815bcff4 ffff88007d0d9ce8 ffffffff815b1831
      >  ffff88007d0d9ca8 00000000703f6364 ffff8800ba83fa80 0000000000000000
      > Call Trace:
      >  [<ffffffff815bcff4>] ? netif_rx+0x114/0x4c0
      >  [<ffffffff815b1831>] ? skb_copy_datagram_from_iovec+0x61/0x290
      >  [<ffffffff815b672a>] __skb_get_rxhash+0x1a/0xd0
      >  [<ffffffffa03b9538>] tun_get_user+0x418/0x810 [tun]
      >  [<ffffffff8135f468>] ? delay_tsc+0x98/0xf0
      >  [<ffffffff8109605c>] ? __rcu_read_unlock+0x5c/0xa0
      >  [<ffffffffa03b9a41>] tun_chr_aio_write+0x81/0xb0 [tun]
      >  [<ffffffff81145011>] ? __buffer_unlock_commit+0x41/0x50
      >  [<ffffffff811db917>] do_sync_write+0xa7/0xe0
      >  [<ffffffff811dc01f>] vfs_write+0xaf/0x190
      >  [<ffffffff811dc375>] sys_write+0x55/0xa0
      >  [<ffffffff81705540>] tracesys+0xdd/0xe2
      > Code: 41 8b 44 24 68 41 2b 44 24 6c 01 de 29 f0 83 f8 03 0f 8e a0 00 00 00 48 63 de 49 03 9c 24 e0 00 00 00 48 85 db 0f 84 72 fe ff ff <8b> 03 41 89 46 08 b8 01 00 00 00 e9 43 fd ff ff 0f 1f 40 00 48
      > RIP  [<ffffffff815b54a4>] skb_flow_dissect+0x314/0x3e0
      >  RSP <ffff88007d0d9c48>
      > ---[ end trace 6d42c834c72c002e ]---
      >
      >
      > Faulting instruction is
      >
      >    0:	8b 03                	mov    (%rbx),%eax
      >
      > rbx is slab poison (-20) so this looks like a use-after-free here...
      >
      >                         flow->ports = *ports;
      >  314:   8b 03                   mov    (%rbx),%eax
      >  316:   41 89 46 08             mov    %eax,0x8(%r14)
      >
      > in the inlined skb_header_pointer in skb_flow_dissect
      >
      > 	Dave
      >
      
      commit 96442e42 (tuntap: choose the txq based on rxq) added
      a use after free.
      
      Cache rxhash in a temp variable before calling netif_rx_ni()
      Reported-by: NDave Jones <davej@redhat.com>
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Jason Wang <jasowang@redhat.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      49974420
  11. 12 12月, 2012 1 次提交
  12. 08 12月, 2012 1 次提交
  13. 04 12月, 2012 2 次提交
    • M
      tun: only queue packets on device · 5d097109
      Michael S. Tsirkin 提交于
      Historically tun supported two modes of operation:
      - in default mode, a small number of packets would get queued
        at the device, the rest would be queued in qdisc
      - in one queue mode, all packets would get queued at the device
      
      This might have made sense up to a point where we made the
      queue depth for both modes the same and set it to
      a huge value (500) so unless the consumer
      is stuck the chance of losing packets is small.
      
      Thus in practice both modes behave the same, but the
      default mode has some problems:
      - if packets are never consumed, fragments are never orphaned
        which cases a DOS for sender using zero copy transmit
      - overrun errors are hard to diagnose: fifo error is incremented
        only once so you can not distinguish between
        userspace that is stuck and a transient failure,
        tcpdump on the device does not show any traffic
      
      Userspace solves this simply by enabling IFF_ONE_QUEUE
      but there seems to be little point in not doing the
      right thing for everyone, by default.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      5d097109
    • J
      tuntap: attach queue 0 before registering netdevice · eb0fb363
      Jason Wang 提交于
      We attach queue 0 after registering netdevice currently. This leads to call
      netif_set_real_num_{tx|rx}_queues() after registering the netdevice. Since we
      allow tun/tap has a maximum of 1024 queues, this may lead a huge number of
      uevents to be injected to userspace since we create 2048 kobjects and then
      remove 2046. Solve this problem by attaching queue 0 and set the real number of
      queues before registering netdevice.
      Reported-by: NJiri Slaby <jslaby@suse.cz>
      Tested-by: NJiri Slaby <jslaby@suse.cz>
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      eb0fb363
  14. 27 11月, 2012 2 次提交
  15. 24 11月, 2012 1 次提交
  16. 20 11月, 2012 1 次提交
  17. 03 11月, 2012 1 次提交
  18. 01 11月, 2012 6 次提交
    • J
      tuntap: choose the txq based on rxq · 96442e42
      Jason Wang 提交于
      This patch implements a simple multiqueue flow steering policy - tx follows rx
      for tun/tap. The idea is simple, it just choose the txq based on which rxq it
      comes. The flow were identified through the rxhash of a skb, and the hash to
      queue mapping were recorded in a hlist with an ageing timer to retire the
      mapping. The mapping were created when tun receives packet from userspace, and
      was quired in .ndo_select_queue().
      
      I run co-current TCP_CRR test and didn't see any mapping manipulation helpers in
      perf top, so the overhead could be negelected.
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      96442e42
    • J
      tuntap: add ioctl to attach or detach a file form tuntap device · cde8b15f
      Jason Wang 提交于
      Sometimes usespace may need to active/deactive a queue, this could be done by
      detaching and attaching a file from tuntap device.
      
      This patch introduces a new ioctls - TUNSETQUEUE which could be used to do
      this. Flag IFF_ATTACH_QUEUE were introduced to do attaching while
      IFF_DETACH_QUEUE were introduced to do the detaching.
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cde8b15f
    • J
      tuntap: multiqueue support · c8d68e6b
      Jason Wang 提交于
      This patch converts tun/tap to a multiqueue devices and expose the multiqueue
      queues as multiple file descriptors to userspace. Internally, each tun_file were
      abstracted as a queue, and an array of pointers to tun_file structurs were
      stored in tun_structure device, so multiple tun_files were allowed to be
      attached to the device as multiple queues.
      
      When choosing txq, we first try to identify a flow through its rxhash, if it
      does not have such one, we could try recorded rxq and then use them to choose
      the transmit queue. This policy may be changed in the future.
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c8d68e6b
    • J
      tuntap: RCUify dereferencing between tun_struct and tun_file · 6e914fc7
      Jason Wang 提交于
      RCU were introduced in this patch to synchronize the dereferences between
      tun_struct and tun_file. All tun_{get|put} were replaced with RCU, the
      dereference from one to other must be done under rtnl lock or rcu read critical
      region.
      
      This is needed for the following patches since the one of the goal of multiqueue
      tuntap is to allow adding or removing queues during workload. Without RCU,
      control path would hold tx locks when adding or removing queues (which may cause
      sme delay) and it's hard to change the number of queues without stopping the net
      device. With the help of rcu, there's also no need for tun_file hold an refcnt
      to tun_struct.
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6e914fc7
    • J
      tuntap: move socket to tun_file · 54f968d6
      Jason Wang 提交于
      Current tuntap makes use of the socket receive queue as its tx queue. To
      implement multiple tx queues for tuntap and enable the ability of adding and
      removing queues during workload, the first step is to move the socket related
      structures to tun_file. Then we could let multiple fds/sockets to be attached to
      the tuntap.
      
      This patch removes tun_sock and moves socket related structures from tun_sock or
      tun_struct to tun_file. Two exceptions are tap_filter and sock_fprog, they are
      still kept in tun_structure since they are used to filter packets for the net
      device instead of per transmit queue (at least I see no requirements for
      them). After those changes, socket were created and destroyed during file open
      and close (instead of device creation and destroy), the socket structures could
      be dereferenced from tun_file instead of the file of tun_struct structure
      itself.
      
      For persisent device, since we purge during datching and wouldn't queue any
      packets when no interface were attached, there's no behaviod changes before and
      after this patch, so the changes were transparent to the userspace. To keep the
      attributes such as sndbuf, socket filter and vnet header, those would be
      re-initialize after a new interface were attached to an persist device.
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      54f968d6
    • J
      tuntap: log the unsigned informaiton with %u · 1e588338
      Jason Wang 提交于
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1e588338
  19. 26 10月, 2012 2 次提交
    • D
      cgroup: net_cls: Rework update socket logic · 6a328d8c
      Daniel Wagner 提交于
      The cgroup logic part of net_cls is very similar as the one in
      net_prio. Let's stream line the net_cls logic with the net_prio one.
      
      The net_prio update logic was changed by following commit (note there
      were some changes necessary later on)
      
      commit 406a3c63
      Author: John Fastabend <john.r.fastabend@intel.com>
      Date:   Fri Jul 20 10:39:25 2012 +0000
      
          net: netprio_cgroup: rework update socket logic
      
          Instead of updating the sk_cgrp_prioidx struct field on every send
          this only updates the field when a task is moved via cgroup
          infrastructure.
      
          This allows sockets that may be used by a kernel worker thread
          to be managed. For example in the iscsi case today a user can
          put iscsid in a netprio cgroup and control traffic will be sent
          with the correct sk_cgrp_prioidx value set but as soon as data
          is sent the kernel worker thread isssues a send and sk_cgrp_prioidx
          is updated with the kernel worker threads value which is the
          default case.
      
          It seems more correct to only update the field when the user
          explicitly sets it via control group infrastructure. This allows
          the users to manage sockets that may be used with other threads.
      
      Since classid is now updated when the task is moved between the
      cgroups, we don't have to call sock_update_classid() from various
      places to ensure we always using the latest classid value.
      
      [v2: Use iterate_fd() instead of open coding]
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc:  Li Zefan <lizefan@huawei.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Jamal Hadi Salim <jhs@mojatatu.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: John Fastabend <john.r.fastabend@intel.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <netdev@vger.kernel.org>
      Cc: <cgroups@vger.kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      6a328d8c
    • D
      cgroup: net_cls: Pass in task to sock_update_classid() · fd9a08a7
      Daniel Wagner 提交于
      sock_update_classid() assumes that the update operation always are
      applied on the current task. sock_update_classid() needs to know on
      which tasks to work on in order to be able to migrate task between
      cgroups using the struct cgroup_subsys attach() callback.
      Signed-off-by: NDaniel Wagner <daniel.wagner@bmw-carit.de>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Eric Dumazet <edumazet@google.com>
      Cc: Glauber Costa <glommer@parallels.com>
      Cc: Joe Perches <joe@perches.com>
      Cc: Neil Horman <nhorman@tuxdriver.com>
      Cc: Stanislav Kinsbursky <skinsbursky@parallels.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: <netdev@vger.kernel.org>
      Cc: <cgroups@vger.kernel.org>
      Acked-by: NNeil Horman <nhorman@tuxdriver.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      fd9a08a7
  20. 15 9月, 2012 1 次提交
  21. 15 8月, 2012 1 次提交
  22. 10 8月, 2012 1 次提交
  23. 31 7月, 2012 1 次提交
  24. 30 7月, 2012 1 次提交
  25. 23 7月, 2012 2 次提交