1. 04 8月, 2017 1 次提交
    • W
      sock: enable MSG_ZEROCOPY · 1f8b977a
      Willem de Bruijn 提交于
      Prepare the datapath for refcounted ubuf_info. Clone ubuf_info with
      skb_zerocopy_clone() wherever needed due to skb split, merge, resize
      or clone.
      
      Split skb_orphan_frags into two variants. The split, merge, .. paths
      support reference counted zerocopy buffers, so do not do a deep copy.
      Add skb_orphan_frags_rx for paths that may loop packets to receive
      sockets. That is not allowed, as it may cause unbounded latency.
      Deep copy all zerocopy copy buffers, ref-counted or not, in this path.
      
      The exact locations to modify were chosen by exhaustively searching
      through all code that might modify skb_frag references and/or the
      the SKBTX_DEV_ZEROCOPY tx_flags bit.
      
      The changes err on the safe side, in two ways.
      
      (1) legacy ubuf_info paths virtio and tap are not modified. They keep
          a 1:1 ubuf_info to sk_buff relationship. Calls to skb_orphan_frags
          still call skb_copy_ubufs and thus copy frags in this case.
      
      (2) not all copies deep in the stack are addressed yet. skb_shift,
          skb_split and skb_try_coalesce can be refined to avoid copying.
          These are not in the hot path and this patch is hairy enough as
          is, so that is left for future refinement.
      Signed-off-by: NWillem de Bruijn <willemb@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      1f8b977a
  2. 30 7月, 2017 1 次提交
  3. 13 7月, 2017 1 次提交
    • M
      mm, tree wide: replace __GFP_REPEAT by __GFP_RETRY_MAYFAIL with more useful semantic · dcda9b04
      Michal Hocko 提交于
      __GFP_REPEAT was designed to allow retry-but-eventually-fail semantic to
      the page allocator.  This has been true but only for allocations
      requests larger than PAGE_ALLOC_COSTLY_ORDER.  It has been always
      ignored for smaller sizes.  This is a bit unfortunate because there is
      no way to express the same semantic for those requests and they are
      considered too important to fail so they might end up looping in the
      page allocator for ever, similarly to GFP_NOFAIL requests.
      
      Now that the whole tree has been cleaned up and accidental or misled
      usage of __GFP_REPEAT flag has been removed for !costly requests we can
      give the original flag a better name and more importantly a more useful
      semantic.  Let's rename it to __GFP_RETRY_MAYFAIL which tells the user
      that the allocator would try really hard but there is no promise of a
      success.  This will work independent of the order and overrides the
      default allocator behavior.  Page allocator users have several levels of
      guarantee vs.  cost options (take GFP_KERNEL as an example)
      
       - GFP_KERNEL & ~__GFP_RECLAIM - optimistic allocation without _any_
         attempt to free memory at all. The most light weight mode which even
         doesn't kick the background reclaim. Should be used carefully because
         it might deplete the memory and the next user might hit the more
         aggressive reclaim
      
       - GFP_KERNEL & ~__GFP_DIRECT_RECLAIM (or GFP_NOWAIT)- optimistic
         allocation without any attempt to free memory from the current
         context but can wake kswapd to reclaim memory if the zone is below
         the low watermark. Can be used from either atomic contexts or when
         the request is a performance optimization and there is another
         fallback for a slow path.
      
       - (GFP_KERNEL|__GFP_HIGH) & ~__GFP_DIRECT_RECLAIM (aka GFP_ATOMIC) -
         non sleeping allocation with an expensive fallback so it can access
         some portion of memory reserves. Usually used from interrupt/bh
         context with an expensive slow path fallback.
      
       - GFP_KERNEL - both background and direct reclaim are allowed and the
         _default_ page allocator behavior is used. That means that !costly
         allocation requests are basically nofail but there is no guarantee of
         that behavior so failures have to be checked properly by callers
         (e.g. OOM killer victim is allowed to fail currently).
      
       - GFP_KERNEL | __GFP_NORETRY - overrides the default allocator behavior
         and all allocation requests fail early rather than cause disruptive
         reclaim (one round of reclaim in this implementation). The OOM killer
         is not invoked.
      
       - GFP_KERNEL | __GFP_RETRY_MAYFAIL - overrides the default allocator
         behavior and all allocation requests try really hard. The request
         will fail if the reclaim cannot make any progress. The OOM killer
         won't be triggered.
      
       - GFP_KERNEL | __GFP_NOFAIL - overrides the default allocator behavior
         and all allocation requests will loop endlessly until they succeed.
         This might be really dangerous especially for larger orders.
      
      Existing users of __GFP_REPEAT are changed to __GFP_RETRY_MAYFAIL
      because they already had their semantic.  No new users are added.
      __alloc_pages_slowpath is changed to bail out for __GFP_RETRY_MAYFAIL if
      there is no progress and we have already passed the OOM point.
      
      This means that all the reclaim opportunities have been exhausted except
      the most disruptive one (the OOM killer) and a user defined fallback
      behavior is more sensible than keep retrying in the page allocator.
      
      [akpm@linux-foundation.org: fix arch/sparc/kernel/mdesc.c]
      [mhocko@suse.com: semantic fix]
        Link: http://lkml.kernel.org/r/20170626123847.GM11534@dhcp22.suse.cz
      [mhocko@kernel.org: address other thing spotted by Vlastimil]
        Link: http://lkml.kernel.org/r/20170626124233.GN11534@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20170623085345.11304-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Alex Belits <alex.belits@cavium.com>
      Cc: Chris Wilson <chris@chris-wilson.co.uk>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Darrick J. Wong <darrick.wong@oracle.com>
      Cc: David Daney <david.daney@cavium.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: NeilBrown <neilb@suse.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dcda9b04
  4. 20 6月, 2017 1 次提交
    • I
      sched/wait: Rename wait_queue_t => wait_queue_entry_t · ac6424b9
      Ingo Molnar 提交于
      Rename:
      
      	wait_queue_t		=>	wait_queue_entry_t
      
      'wait_queue_t' was always a slight misnomer: its name implies that it's a "queue",
      but in reality it's a queue *entry*. The 'real' queue is the wait queue head,
      which had to carry the name.
      
      Start sorting this out by renaming it to 'wait_queue_entry_t'.
      
      This also allows the real structure name 'struct __wait_queue' to
      lose its double underscore and become 'struct wait_queue_entry',
      which is the more canonical nomenclature for such data types.
      
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      ac6424b9
  5. 09 6月, 2017 1 次提交
  6. 18 5月, 2017 2 次提交
    • S
      vhost/vsock: use static minor number · f4660cc9
      Stefan Hajnoczi 提交于
      Vhost-vsock is a software device so there is no probe call that causes
      the driver to register its misc char device node.  This creates a
      chicken and egg problem: userspace applications must open
      /dev/vhost-vsock to use the driver but the file doesn't exist until the
      kernel module has been loaded.
      
      Use the devname modalias mechanism so that /dev/vhost-vsock is created
      at boot.  The vhost_vsock kernel module is automatically loaded when the
      first application opens /dev/host-vsock.
      
      Note that the "reserved for local use" range in
      Documentation/admin-guide/devices.txt is incorrect.  The userio driver
      already occupies part of that range.  I've updated the documentation
      accordingly.
      
      Cc: device@lanana.org
      Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f4660cc9
    • J
      vhost_net: try batch dequing from skb array · c67df11f
      Jason Wang 提交于
      We used to dequeue one skb during recvmsg() from skb_array, this could
      be inefficient because of the bad cache utilization and spinlock
      touching for each packet. This patch tries to batch them by calling
      batch dequeuing helpers explicitly on the exported skb array and pass
      the skb back through msg_control for underlayer socket to finish the
      userspace copying. Batch dequeuing is also the requirement for more
      batching improvement on receive path.
      
      Tests were done by pktgen on tap with XDP1 in guest. Host is Intel(R)
      Xeon(R) CPU E5-2650 0 @ 2.00GHz.
      
      rx batch | pps
      
      0   2.25Mpps
      1   2.33Mpps (+3.56%)
      4   2.33Mpps (+3.56%)
      16  2.35Mpps (+4.44%)
      64  2.42Mpps (+7.56%) <- Default rx batching
      128 2.40Mpps (+6.67%)
      256 2.38Mpps (+5.78%)
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      c67df11f
  7. 09 5月, 2017 1 次提交
  8. 25 4月, 2017 1 次提交
  9. 22 3月, 2017 1 次提交
  10. 02 3月, 2017 4 次提交
    • I
      sched/headers: Prepare to move signal wakeup & sigpending methods from... · 174cd4b1
      Ingo Molnar 提交于
      sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>
      
      Fix up affected files that include this signal functionality via sched.h.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      174cd4b1
    • I
      sched/headers: Prepare for new header dependencies before moving code to <linux/sched/mm.h> · 6e84f315
      Ingo Molnar 提交于
      We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which
      will have to be picked up from other headers and a couple of .c files.
      
      Create a trivial placeholder <linux/sched/mm.h> file that just
      maps to <linux/sched.h> to make this patch obviously correct and
      bisectable.
      
      The APIs that are going to be moved first are:
      
         mm_alloc()
         __mmdrop()
         mmdrop()
         mmdrop_async_fn()
         mmdrop_async()
         mmget_not_zero()
         mmput()
         mmput_async()
         get_task_mm()
         mm_access()
         mm_release()
      
      Include the new header in the files that are going to need it.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6e84f315
    • I
      sched/headers: Prepare for new header dependencies before moving code to <linux/sched/clock.h> · e6017571
      Ingo Molnar 提交于
      We are going to split <linux/sched/clock.h> out of <linux/sched.h>, which
      will have to be picked up from other headers and .c files.
      
      Create a trivial placeholder <linux/sched/clock.h> file that just
      maps to <linux/sched.h> to make this patch obviously correct and
      bisectable.
      
      Include the new header in the files that are going to need it.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      e6017571
    • J
      vhost: introduce O(1) vq metadata cache · f8894913
      Jason Wang 提交于
      When device IOTLB is enabled, all address translations were stored in
      interval tree. O(lgN) searching time could be slow for virtqueue
      metadata (avail, used and descriptors) since they were accessed much
      often than other addresses. So this patch introduces an O(1) array
      which points to the interval tree nodes that store the translations of
      vq metadata. Those array were update during vq IOTLB prefetching and
      were reset during each invalidation and tlb update. Each time we want
      to access vq metadata, this small array were queried before interval
      tree. This would be sufficient for static mappings but not dynamic
      mappings, we could do optimizations on top.
      
      Test were done with l2fwd in guest (2M hugepage):
      
         noiommu  | before        | after
      tx 1.32Mpps | 1.06Mpps(82%) | 1.30Mpps(98%)
      rx 2.33Mpps | 1.46Mpps(63%) | 2.29Mpps(98%)
      
      We can almost reach the same performance as noiommu mode.
      Signed-off-by: NJason Wang <jasowang@redhat.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      f8894913
  11. 28 2月, 2017 1 次提交
  12. 12 2月, 2017 2 次提交
  13. 04 2月, 2017 1 次提交
    • H
      vhost: fix initialization for vq->is_le · cda8bba0
      Halil Pasic 提交于
      Currently, under certain circumstances vhost_init_is_le does just a part
      of the initialization job, and depends on vhost_reset_is_le being called
      too. For this reason vhost_vq_init_access used to call vhost_reset_is_le
      when vq->private_data is NULL. This is not only counter intuitive, but
      also real a problem because it breaks vhost_net. The bug was introduced to
      vhost_net with commit 2751c988 ("vhost: cross-endian support for
      legacy devices"). The symptom is corruption of the vq's used.idx field
      (virtio) after VHOST_NET_SET_BACKEND was issued as a part of the vhost
      shutdown on a vq with pending descriptors.
      
      Let us make sure the outcome of vhost_init_is_le never depend on the state
      it is actually supposed to initialize, and fix virtio_net by removing the
      reset from vhost_vq_init_access.
      
      With the above, there is no reason for vhost_reset_is_le to do just half
      of the job. Let us make vhost_reset_is_le reinitialize is_le.
      Signed-off-by: NHalil Pasic <pasic@linux.vnet.ibm.com>
      Reported-by: NMichael A. Tebolt <miket@us.ibm.com>
      Reported-by: NDr. David Alan Gilbert <dgilbert@redhat.com>
      Fixes: commit 2751c988 ("vhost: cross-endian support for legacy devices")
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Reviewed-by: NGreg Kurz <groug@kaod.org>
      Tested-by: NMichael A. Tebolt <miket@us.ibm.com>
      cda8bba0
  14. 25 1月, 2017 1 次提交
  15. 20 1月, 2017 2 次提交
    • D
      vhost/scsi: silence uninitialized variable warning · 532e15af
      Dan Carpenter 提交于
      This is to silence an uninitialized variable warning in debug output.
      The problem is this line:
      
      	pr_debug("vhost_get_vq_desc: head: %d, out: %u in: %u\n",
      		 head, out, in);
      
      If "head == vq->num" is true on the first iteration then "out" and "in"
      aren't initialized.  We handle that a few lines after the printk.  I was
      tempted to just delete the pr_debug() but I decided to just initialize
      them to zero instead.
      
      Also checkpatch.pl complains if variables are declared as just
      "unsigned" without the "int".
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      532e15af
    • B
      vhost: scsi: constify target_core_fabric_ops structures · 1d822a40
      Bhumika Goyal 提交于
      Declare target_core_fabric_ops strucrues as const as they are only
      passed as an argument to the functions target_register_template and
      target_unregister_template. The arguments are of type const struct
      target_core_fabric_ops *, so target_core_fabric_ops structures having
      this property can be declared const.
      Done using Coccinelle:
      
      @r disable optional_qualifier@
      identifier i;
      position p;
      @@
      static struct target_core_fabric_ops i@p={...};
      
      @ok@
      position p;
      identifier r.i;
      @@
      (
      target_register_template(&i@p)
      |
      target_unregister_template(&i@p)
      )
      @bad@
      position p!={r.p,ok.p};
      identifier r.i;
      @@
      i@p
      
      @depends on !bad disable optional_qualifier@
      identifier r.i;
      @@
      +const
      struct target_core_fabric_ops i;
      
      File size before: drivers/vhost/scsi.o
         text	   data	    bss	    dec	    hex	filename
        18063	   2985	     40	  21088	   5260	drivers/vhost/scsi.o
      
      File size after: drivers/vhost/scsi.o
         text	   data	    bss	    dec	    hex	filename
        18479	   2601	     40	  21120	   5280	drivers/vhost/scsi.o
      Signed-off-by: NBhumika Goyal <bhumirks@gmail.com>
      Signed-off-by: NMichael S. Tsirkin <mst@redhat.com>
      Acked-by: NJason Wang <jasowang@redhat.com>
      1d822a40
  16. 19 1月, 2017 2 次提交
  17. 16 12月, 2016 4 次提交
  18. 15 12月, 2016 2 次提交
  19. 09 12月, 2016 3 次提交
  20. 06 12月, 2016 1 次提交
    • A
      [iov_iter] new primitives - copy_from_iter_full() and friends · cbbd26b8
      Al Viro 提交于
      copy_from_iter_full(), copy_from_iter_full_nocache() and
      csum_and_copy_from_iter_full() - counterparts of copy_from_iter()
      et.al., advancing iterator only in case of successful full copy
      and returning whether it had been successful or not.
      
      Convert some obvious users.  *NOTE* - do not blindly assume that
      something is a good candidate for those unless you are sure that
      not advancing iov_iter in failure case is the right thing in
      this case.  Anything that does short read/short write kind of
      stuff (or is in a loop, etc.) is unlikely to be a good one.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      cbbd26b8
  21. 16 11月, 2016 1 次提交
  22. 31 8月, 2016 1 次提交
  23. 23 8月, 2016 1 次提交
  24. 15 8月, 2016 1 次提交
  25. 09 8月, 2016 1 次提交
  26. 02 8月, 2016 2 次提交