1. 12 7月, 2018 1 次提交
  2. 02 7月, 2018 1 次提交
  3. 28 6月, 2018 1 次提交
    • F
      skbuff: preserve sock reference when scrubbing the skb. · 9c4c3252
      Flavio Leitner 提交于
      The sock reference is lost when scrubbing the packet and that breaks
      TSQ (TCP Small Queues) and XPS (Transmit Packet Steering) causing
      performance impacts of about 50% in a single TCP stream when crossing
      network namespaces.
      
      XPS breaks because the queue mapping stored in the socket is not
      available, so another random queue might be selected when the stack
      needs to transmit something like a TCP ACK, or TCP Retransmissions.
      That causes packet re-ordering and/or performance issues.
      
      TSQ breaks because it orphans the packet while it is still in the
      host, so packets are queued contributing to the buffer bloat problem.
      
      Preserving the sock reference fixes both issues. The socket is
      orphaned anyways in the receiving path before any relevant action
      and on TX side the netfilter checks if the reference is local before
      use it.
      Signed-off-by: NFlavio Leitner <fbl@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      9c4c3252
  4. 24 6月, 2018 1 次提交
  5. 23 6月, 2018 4 次提交
  6. 15 6月, 2018 1 次提交
  7. 06 6月, 2018 1 次提交
  8. 05 6月, 2018 4 次提交
  9. 04 6月, 2018 1 次提交
    • B
      xsk: new descriptor addressing scheme · bbff2f32
      Björn Töpel 提交于
      Currently, AF_XDP only supports a fixed frame-size memory scheme where
      each frame is referenced via an index (idx). A user passes the frame
      index to the kernel, and the kernel acts upon the data.  Some NICs,
      however, do not have a fixed frame-size model, instead they have a
      model where a memory window is passed to the hardware and multiple
      frames are filled into that window (referred to as the "type-writer"
      model).
      
      By changing the descriptor format from the current frame index
      addressing scheme, AF_XDP can in the future be extended to support
      these kinds of NICs.
      
      In the index-based model, an idx refers to a frame of size
      frame_size. Addressing a frame in the UMEM is done by offseting the
      UMEM starting address by a global offset, idx * frame_size + offset.
      Communicating via the fill- and completion-rings are done by means of
      idx.
      
      In this commit, the idx is removed in favor of an address (addr),
      which is a relative address ranging over the UMEM. To convert an
      idx-based address to the new addr is simply: addr = idx * frame_size +
      offset.
      
      We also stop referring to the UMEM "frame" as a frame. Instead it is
      simply called a chunk.
      
      To transfer ownership of a chunk to the kernel, the addr of the chunk
      is passed in the fill-ring. Note, that the kernel will mask addr to
      make it chunk aligned, so there is no need for userspace to do
      that. E.g., for a chunk size of 2k, passing an addr of 2048, 2050 or
      3000 to the fill-ring will refer to the same chunk.
      
      On the completion-ring, the addr will match that of the Tx descriptor,
      passed to the kernel.
      
      Changing the descriptor format to use chunks/addr will allow for
      future changes to move to a type-writer based model, where multiple
      frames can reside in one chunk. In this model passing one single chunk
      into the fill-ring, would potentially result in multiple Rx
      descriptors.
      
      This commit changes the uapi of AF_XDP sockets, and updates the
      documentation.
      Signed-off-by: NBjörn Töpel <bjorn.topel@intel.com>
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      bbff2f32
  10. 29 5月, 2018 3 次提交
    • S
      virtio_net: Extend virtio to use VF datapath when available · ba5e4426
      Sridhar Samudrala 提交于
      This patch enables virtio_net to switch over to a VF datapath when STANDBY
      feature is enabled and a VF netdev is present with the same MAC address.
      It allows live migration of a VM with a direct attached VF without the need
      to setup a bond/team between a VF and virtio net device in the guest.
      
      It uses the API that is exported by the net_failover driver to create and
      and destroy a master failover netdev. When STANDBY feature is enabled, an
      additional netdev(failover netdev) is created that acts as a master device
      and tracks the state of the 2 lower netdevs. The original virtio_net netdev
      is marked as 'standby' netdev and a passthru device with the same MAC is
      registered as 'primary' netdev.
      
      The hypervisor needs to unplug the VF device from the guest on the source
      host and reset the MAC filter of the VF to initiate failover of datapath
      to virtio before starting the migration. After the migration is completed,
      the destination hypervisor sets the MAC filter on the VF and plugs it back
      to the guest to switch over to VF datapath.
      
      This patch is based on the discussion initiated by Jesse on this thread.
      https://marc.info/?l=linux-virtualization&m=151189725224231&w=2Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      ba5e4426
    • S
      net: Introduce net_failover driver · cfc80d9a
      Sridhar Samudrala 提交于
      The net_failover driver provides an automated failover mechanism via APIs
      to create and destroy a failover master netdev and manages a primary and
      standby slave netdevs that get registered via the generic failover
      infrastructure.
      
      The failover netdev acts a master device and controls 2 slave devices. The
      original paravirtual interface gets registered as 'standby' slave netdev and
      a passthru/vf device with the same MAC gets registered as 'primary' slave
      netdev. Both 'standby' and 'failover' netdevs are associated with the same
      'pci' device. The user accesses the network interface via 'failover' netdev.
      The 'failover' netdev chooses 'primary' netdev as default for transmits when
      it is available with link up and running.
      
      This can be used by paravirtual drivers to enable an alternate low latency
      datapath. It also enables hypervisor controlled live migration of a VM with
      direct attached VF by failing over to the paravirtual datapath when the VF
      is unplugged.
      Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      cfc80d9a
    • S
      net: Introduce generic failover module · 30c8bd5a
      Sridhar Samudrala 提交于
      The failover module provides a generic interface for paravirtual drivers
      to register a netdev and a set of ops with a failover instance. The ops
      are used as event handlers that get called to handle netdev register/
      unregister/link change/name change events on slave pci ethernet devices
      with the same mac address as the failover netdev.
      
      This enables paravirtual drivers to use a VF as an accelerated low latency
      datapath. It also allows migration of VMs with direct attached VFs by
      failing over to the paravirtual datapath when the VF is unplugged.
      Signed-off-by: NSridhar Samudrala <sridhar.samudrala@intel.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      30c8bd5a
  11. 25 5月, 2018 1 次提交
    • E
      ppp: remove the PPPIOCDETACH ioctl · af8d3c7c
      Eric Biggers 提交于
      The PPPIOCDETACH ioctl effectively tries to "close" the given ppp file
      before f_count has reached 0, which is fundamentally a bad idea.  It
      does check 'f_count < 2', which excludes concurrent operations on the
      file since they would only be possible with a shared fd table, in which
      case each fdget() would take a file reference.  However, it fails to
      account for the fact that even with 'f_count == 1' the file can still be
      linked into epoll instances.  As reported by syzbot, this can trivially
      be used to cause a use-after-free.
      
      Yet, the only known user of PPPIOCDETACH is pppd versions older than
      ppp-2.4.2, which was released almost 15 years ago (November 2003).
      Also, PPPIOCDETACH apparently stopped working reliably at around the
      same time, when the f_count check was added to the kernel, e.g. see
      https://lkml.org/lkml/2002/12/31/83.  Also, the current 'f_count < 2'
      check makes PPPIOCDETACH only work in single-threaded applications; it
      always fails if called from a multithreaded application.
      
      All pppd versions released in the last 15 years just close() the file
      descriptor instead.
      
      Therefore, instead of hacking around this bug by exporting epoll
      internals to modules, and probably missing other related bugs, just
      remove the PPPIOCDETACH ioctl and see if anyone actually notices.  Leave
      a stub in place that prints a one-time warning and returns EINVAL.
      
      Reported-by: syzbot+16363c99d4134717c05b@syzkaller.appspotmail.com
      Fixes: 1da177e4 ("Linux-2.6.12-rc2")
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Acked-by: NPaul Mackerras <paulus@ozlabs.org>
      Reviewed-by: NGuillaume Nault <g.nault@alphalink.fr>
      Tested-by: NGuillaume Nault <g.nault@alphalink.fr>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      af8d3c7c
  12. 18 5月, 2018 4 次提交
  13. 17 5月, 2018 2 次提交
  14. 12 5月, 2018 1 次提交
  15. 11 5月, 2018 1 次提交
  16. 04 5月, 2018 1 次提交
  17. 28 4月, 2018 1 次提交
  18. 27 4月, 2018 2 次提交
  19. 20 4月, 2018 1 次提交
  20. 16 4月, 2018 2 次提交
  21. 01 4月, 2018 2 次提交
    • E
      inet: frags: break the 2GB limit for frags storage · 3e67f106
      Eric Dumazet 提交于
      Some users are willing to provision huge amounts of memory to be able
      to perform reassembly reasonnably well under pressure.
      
      Current memory tracking is using one atomic_t and integers.
      
      Switch to atomic_long_t so that 64bit arches can use more than 2GB,
      without any cost for 32bit arches.
      
      Note that this patch avoids an overflow error, if high_thresh was set
      to ~2GB, since this test in inet_frag_alloc() was never true :
      
      if (... || frag_mem_limit(nf) > nf->high_thresh)
      
      Tested:
      
      $ echo 16000000000 >/proc/sys/net/ipv4/ipfrag_high_thresh
      
      <frag DDOS>
      
      $ grep FRAG /proc/net/sockstat
      FRAG: inuse 14705885 memory 16000002880
      
      $ nstat -n ; sleep 1 ; nstat | grep Reas
      IpReasmReqds                    3317150            0.0
      IpReasmFails                    3317112            0.0
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3e67f106
    • E
      inet: frags: use rhashtables for reassembly units · 648700f7
      Eric Dumazet 提交于
      Some applications still rely on IP fragmentation, and to be fair linux
      reassembly unit is not working under any serious load.
      
      It uses static hash tables of 1024 buckets, and up to 128 items per bucket (!!!)
      
      A work queue is supposed to garbage collect items when host is under memory
      pressure, and doing a hash rebuild, changing seed used in hash computations.
      
      This work queue blocks softirqs for up to 25 ms when doing a hash rebuild,
      occurring every 5 seconds if host is under fire.
      
      Then there is the problem of sharing this hash table for all netns.
      
      It is time to switch to rhashtables, and allocate one of them per netns
      to speedup netns dismantle, since this is a critical metric these days.
      
      Lookup is now using RCU. A followup patch will even remove
      the refcount hold/release left from prior implementation and save
      a couple of atomic operations.
      
      Before this patch, 16 cpus (16 RX queue NIC) could not handle more
      than 1 Mpps frags DDOS.
      
      After the patch, I reach 9 Mpps without any tuning, and can use up to 2GB
      of storage for the fragments (exact number depends on frags being evicted
      after timeout)
      
      $ grep FRAG /proc/net/sockstat
      FRAG: inuse 1966916 memory 2140004608
      
      A followup patch will change the limits for 64bit arches.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Herbert Xu <herbert@gondor.apana.org.au>
      Cc: Florian Westphal <fw@strlen.de>
      Cc: Jesper Dangaard Brouer <brouer@redhat.com>
      Cc: Alexander Aring <alex.aring@gmail.com>
      Cc: Stefan Schmidt <stefan@osg.samsung.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      648700f7
  22. 31 3月, 2018 1 次提交
  23. 30 3月, 2018 1 次提交
  24. 26 3月, 2018 2 次提交