1. 04 8月, 2017 1 次提交
    • W
      sock: enable MSG_ZEROCOPY · 1f8b977a
      Willem de Bruijn 提交于
      Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with
      skb_zerocopy_clone() wherever needed due to skb split, merge, resize
      or clone.
      
      Split skb_orphan_frags into two variants. The split, merge, .. paths
      support reference counted zerocopy buffers, so do not do a deep copy.
      Add skb_orphan_frags_rx for paths that may loop packets to receive
      sockets. That is not allowed, as it may cause unbounded latency.
      Deep copy all zerocopy copy buffers, ref-counted or not, in this path.
      
      The exact locations to modify were chosen by exhaustively searching
      through all code that might modify skb_frag references and/or the
      the SKBTX_DEV_ZEROCOPY tx_flags bit.
      
      The changes err on the safe side, in two ways.
      
      (1) legacy ubuf_info paths virtio and tap are not modified. They keep
          a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags
          still call skb_copy_ubufs and thus copy frags in this case.
      
      (2) not all copies deep in the stack are addressed yet. skb_shift,
          skb_split and skb_try_coalesce can be refined to avoid copying.
          These are not in the hot path and this patch is hairy enough as
          is, so that is left for future refinement.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f8b977a
  2. 13 7月, 2017 1 次提交
    • M
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic · dcda9b04
      Michal Hocko 提交于
      __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
      the page allocator.  This has been true but only for allocations
      requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
      ignored for smaller sizes.  This is a bit unfortunate because there is
      no way to express the same semantic for those requests and they are
      considered too important to fail so they might end up looping in the
      page allocator for ever, similarly to GFP_NOFAIL requests.
      
      Now that the whole tree has been cleaned up and accidental or misled
      usage of __GFP_REPEAT flag has been removed for !costly requests we can
      give the original flag a better name and more importantly a more useful
      semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
      that the allocator would try really hard but there is no promise of a
      success.  This will work independent of the order and overrides the
      default allocator behavior.  Page allocator users have several levels of
      guarantee vs.  cost options (take GFP_KERNEL as an example)
      
       - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
         attempt to free memory at all. The most light weight mode which even
         doesn't kick the background reclaim. Should be used carefully because
         it might deplete the memory and the next user might hit the more
         aggressive reclaim
      
       - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
         allocation without any attempt to free memory from the current
         context but can wake kswapd to reclaim memory if the zone is below
         the low watermark. Can be used from either atomic contexts or when
         the request is a performance optimization and there is another
         fallback for a slow path.
      
       - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
         non sleeping allocation with an expensive fallback so it can access
         some portion of memory reserves. Usually used from interrupt/bh
         context with an expensive slow path fallback.
      
       - GFP_KERNEL - both background and direct reclaim are allowed and the
         _default_ page allocator behavior is used. That means that !costly
         allocation requests are basically nofail but there is no guarantee of
         that behavior so failures have to be checked properly by callers
         (e.g. OOM killer victim is allowed to fail currently).
      
       - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
         and all allocation requests fail early rather than cause disruptive
         reclaim (one round of reclaim in this implementation). The OOM killer
         is not invoked.
      
       - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
         behavior and all allocation requests try really hard. The request
         will fail if the reclaim cannot make any progress. The OOM killer
         won't be triggered.
      
       - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
         and all allocation requests will loop endlessly until they succeed.
         This might be really dangerous especially for larger orders.
      
      Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
      because they already had their semantic.  No new users are added.
      __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
      there is no progress and we have already passed the OOM point.
      
      This means that all the reclaim opportunities have been exhausted except
      the most disruptive one (the OOM killer) and a user defined fallback
      behavior is more sensible than keep retrying in the page allocator.
      
      [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
      [mhocko@suse.com: semantic fix]
        Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
      [mhocko@kernel.org: address other thing spotted by Vlastimil]
        Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcda9b04
  3. 18 5月, 2017 1 次提交
    • J
      vhost_net: try batch dequing from skb array · c67df11f
      Jason Wang 提交于
      We used to dequeue one skb during recvmsg() from skb_array, this could
      be inefficient because of the bad cache utilization and spinlock
      touching for each packet. This patch tries to batch them by calling
      batch dequeuing helpers explicitly on the exported skb array and pass
      the skb back through msg_control for underlayer socket to finish the
      userspace copying. Batch dequeuing is also the requirement for more
      batching improvement on receive path.
      
      Tests were done by pktgen on tap with XDP1 in guest. Host is Intel(R)
      Xeon(R) CPU E5-2650 0 @ 2.00GHz.
      
      rx batch | pps
      
      0   2.25Mpps
      1   2.33Mpps (+3.56%)
      4   2.33Mpps (+3.56%)
      16  2.35Mpps (+4.44%)
      64  2.42Mpps (+7.56%) <- Default rx batching
      128 2.40Mpps (+6.67%)
      256 2.38Mpps (+5.78%)
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c67df11f
  4. 09 5月, 2017 1 次提交
  5. 02 3月, 2017 2 次提交
  6. 12 2月, 2017 1 次提交
  7. 19 1月, 2017 1 次提交
  8. 16 11月, 2016 1 次提交
  9. 02 8月, 2016 2 次提交
    • J
      vhost: new device IOTLB API · 6b1e6cc7
      Jason Wang 提交于
      This patch tries to implement an device IOTLB for vhost. This could be
      used with userspace(qemu) implementation of DMA remapping
      to emulate an IOMMU for the guest.
      
      The idea is simple, cache the translation in a software device IOTLB
      (which is implemented as an interval tree) in vhost and use vhost_net
      file descriptor for reporting IOTLB miss and IOTLB
      update/invalidation. When vhost meets an IOTLB miss, the fault
      address, size and access can be read from the file. After userspace
      finishes the translation, it writes the translated address to the
      vhost_net file to update the device IOTLB.
      
      When device IOTLB is enabled by setting VIRTIO_F_IOMMU_PLATFORM all vq
      addresses set by ioctl are treated as iova instead of virtual address and
      the accessing can only be done through IOTLB instead of direct userspace
      memory access. Before each round or vq processing, all vq metadata is
      prefetched in device IOTLB to make sure no translation fault happens
      during vq processing.
      
      In most cases, virtqueues are contiguous even in virtual address space.
      The IOTLB translation for virtqueue itself may make it a little
      slower. We might add fast path cache on top of this patch.
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      [mst: use virtio feature bit: VHOST_F_DEVICE_IOTLB -> VIRTIO_F_IOMMU_PLATFORM ]
      [mst: fix build warnings ]
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      [ weiyj.lk: missing unlock on error ]
      Signed-off-by: NWei Yongjun <weiyj.lk@gmail.com>
      6b1e6cc7
    • J
      vhost: convert pre sorted vhost memory array to interval tree · a9709d68
      Jason Wang 提交于
      Current pre-sorted memory region array has some limitations for future
      device IOTLB conversion:
      
      1) need extra work for adding and removing a single region, and it's
         expected to be slow because of sorting or memory re-allocation.
      2) need extra work of removing a large range which may intersect
         several regions with different size.
      3) need trick for a replacement policy like LRU
      
      To overcome the above shortcomings, this patch convert it to interval
      tree which can easily address the above issue with almost no extra
      work.
      
      The patch could be used for:
      
      - Extend the current API and only let the userspace to send diffs of
        memory table.
      - Simplify Device IOTLB implementation.
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      a9709d68
  10. 01 7月, 2016 1 次提交
    • J
      tun: switch to use skb array for tx · 1576d986
      Jason Wang 提交于
      We used to queue tx packets in sk_receive_queue, this is less
      efficient since it requires spinlocks to synchronize between producer
      and consumer.
      
      This patch tries to address this by:
      
      - switch from sk_receive_queue to a skb_array, and resize it when
        tx_queue_len was changed.
      - introduce a new proto_ops peek_len which was used for peeking the
        skb length.
      - implement a tun version of peek_len for vhost_net to use and convert
        vhost_net to use peek_len if possible.
      
      Pktgen test shows about 15.3% improvement on guest receiving pps for small
      buffers:
      
      Before: ~1300000pps
      After : ~1500000pps
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1576d986
  11. 08 6月, 2016 1 次提交
    • J
      vhost_net: stop polling socket during rx processing · 8241a1e4
      Jason Wang 提交于
      We don't stop rx polling socket during rx processing, this will lead
      unnecessary wakeups from under layer net devices (E.g
      sock_def_readable() form tun). Rx will be slowed down in this
      way. This patch avoids this by stop polling socket during rx
      processing. A small drawback is that this introduces some overheads in
      light load case because of the extra start/stop polling, but single
      netperf TCP_RR does not notice any change. In a super heavy load case,
      e.g using pktgen to inject packet to guest, we get about ~8.8%
      improvement on pps:
      
      before: ~1240000 pkt/s
      after:  ~1350000 pkt/s
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      8241a1e4
  12. 11 3月, 2016 1 次提交
  13. 02 3月, 2016 1 次提交
    • G
      vhost: rename vhost_init_used() · 80f7d030
      Greg Kurz 提交于
      Looking at how callers use this, maybe we should just rename init_used
      to vhost_vq_init_access. The _used suffix was a hint that we
      access the vq used ring. But maybe what callers care about is
      that it must be called after access_ok.
      
      Also, this function manipulates the vq->is_le field which isn't related
      to the vq used ring.
      
      This patch simply renames vhost_init_used() to vhost_vq_init_access() as
      suggested by Michael.
      
      No behaviour change.
      Signed-off-by: NGreg Kurz <gkurz@linux.vnet.ibm.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      80f7d030
  14. 16 9月, 2015 1 次提交
  15. 12 4月, 2015 1 次提交
  16. 03 3月, 2015 1 次提交
  17. 28 2月, 2015 2 次提交
  18. 16 2月, 2015 1 次提交
  19. 05 2月, 2015 1 次提交
  20. 04 2月, 2015 2 次提交
  21. 14 1月, 2015 1 次提交
  22. 07 1月, 2015 1 次提交
  23. 10 12月, 2014 1 次提交
    • A
      put iov_iter into msghdr · c0371da6
      Al Viro 提交于
      Note that the code _using_ ->msg_iter at that point will be very
      unhappy with anything other than unshifted iovec-backed iov_iter.
      We still need to convert users to proper primitives.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      c0371da6
  24. 09 12月, 2014 4 次提交
  25. 23 6月, 2014 1 次提交
  26. 09 6月, 2014 3 次提交
    • M
      vhost: move memory pointer to VQs · 47283bef
      Michael S. Tsirkin 提交于
      commit 2ae76693b8bcabf370b981cd00c36cd41d33fabc
          vhost: replace rcu with mutex
      replaced rcu sync for memory accesses with VQ mutex locl/unlock.
      This is correct since all accesses are under VQ mutex, but incomplete:
      we still do useless rcu lock/unlock operations, someone might copy this
      code into some other context where this won't be right.
      This use of RCU is also non standard and hard to understand.
      Let's copy the pointer to each VQ structure, this way
      the access rules become straight-forward, and there's
      no need for RCU anymore.
      Reported-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      47283bef
    • M
      vhost: move acked_features to VQs · ea16c514
      Michael S. Tsirkin 提交于
      Refactor code to make sure features are only accessed
      under VQ mutex. This makes everything simpler, no need
      for RCU here anymore.
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      ea16c514
    • M
      vhost-net: extend device allocation to vmalloc · 23cc5a99
      Michael S. Tsirkin 提交于
      Michael Mueller provided a patch to reduce the size of
      vhost-net structure as some allocations could fail under
      memory pressure/fragmentation. We are still left with
      high order allocations though.
      
      This patch is handling the problem at the core level, allowing
      vhost structures to use vmalloc() if kmalloc() failed.
      
      As vmalloc() adds overhead on a critical network path, add __GFP_REPEAT
      to kzalloc() flags to do this fallback only when really needed.
      
      People are still looking at cleaner ways to handle the problem
      at the API level, probably passing in multiple iovecs.
      This hack seems consistent with approaches
      taken since then by drivers/vhost/scsi.c and net/core/dev.c
      
      Based on patch by Romain Francoise.
      
      Cc: Michael Mueller <mimu@linux.vnet.ibm.com>
      Signed-off-by: NRomain Francoise <romain@orebokech.com>
      Acked-by: NMichael S. Tsirkin <mst@redhat.com>
      23cc5a99
  27. 02 4月, 2014 1 次提交
  28. 29 3月, 2014 2 次提交
  29. 14 2月, 2014 2 次提交