1. 26 4月, 2017 7 次提交
    • C
      svcrdma: Clean up RDMA_ERROR path · 6b19cc5c
      Chuck Lever 提交于
      Now that svc_rdma_sendto has been renovated, svc_rdma_send_error can
      be refactored to reduce code duplication and remove C structure-
      based XDR encoding. It is also relocated to the source file that
      contains its only caller.
      
      This is a refactoring change only.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      6b19cc5c
    • C
      svcrdma: Use rdma_rw API in RPC reply path · 9a6a180b
      Chuck Lever 提交于
      The current svcrdma sendto code path posts one RDMA Write WR at a
      time. Each of these Writes typically carries a small number of pages
      (for instance, up to 30 pages for mlx4 devices). That means a 1MB
      NFS READ reply requires 9 ib_post_send() calls for the Write WRs,
      and one for the Send WR carrying the actual RPC Reply message.
      
      Instead, use the new rdma_rw API. The details of Write WR chain
      construction and memory registration are taken care of in the RDMA
      core. svcrdma can focus on the details of the RPC-over-RDMA
      protocol. This gives three main benefits:
      
      1. All Write WRs for one RDMA segment are posted in a single chain.
      As few as one ib_post_send() for each Write chunk.
      
      2. The Write path can now use FRWR to register the Write buffers.
      If the device's maximum page list depth is large, this means a
      single Write WR is needed for each RPC's Write chunk data.
      
      3. The new code introduces support for RPCs that carry both a Write
      list and a Reply chunk. This combination can be used for an NFSv4
      READ where the data payload is large, and thus is removed from the
      Payload Stream, but the Payload Stream is still larger than the
      inline threshold.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      9a6a180b
    • C
      svcrdma: Introduce local rdma_rw API helpers · f13193f5
      Chuck Lever 提交于
      The plan is to replace the local bespoke code that constructs and
      posts RDMA Read and Write Work Requests with calls to the rdma_rw
      API. This shares code with other RDMA-enabled ULPs that manages the
      gory details of buffer registration and posting Work Requests.
      
      Some design notes:
      
       o The structure of RPC-over-RDMA transport headers is flexible,
         allowing multiple segments per Reply with arbitrary alignment,
         each with a unique R_key. Write and Send WRs continue to be
         built and posted in separate code paths. However, one whole
         chunk (with one or more RDMA segments apiece) gets exactly
         one ib_post_send and one work completion.
      
       o svc_xprt reference counting is modified, since a chain of
         rdma_rw_ctx structs generates one completion, no matter how
         many Write WRs are posted.
      
       o The current code builds the transport header as it is construct-
         ing Write WRs. I've replaced that with marshaling of transport
         header data items in a separate step. This is because the exact
         structure of client-provided segments may not align with the
         components of the server's reply xdr_buf, or the pages in the
         page list. Thus parts of each client-provided segment may be
         written at different points in the send path.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      f13193f5
    • C
      svcrdma: Eliminate RPCRDMA_SQ_DEPTH_MULT · b623589d
      Chuck Lever 提交于
      The Send Queue depth is temporarily reduced to 1 SQE per credit. The
      new rdma_rw API does an internal computation, during QP creation, to
      increase the depth of the Send Queue to handle RDMA Read and Write
      operations.
      
      This change has to come before the NFSD code paths are updated to
      use the rdma_rw API. Without this patch, rdma_rw_init_qp() increases
      the size of the SQ too much, resulting in memory allocation failures
      during QP creation.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      b623589d
    • C
      svcrdma: Add svc_rdma_map_reply_hdr() · 6e6092ca
      Chuck Lever 提交于
      Introduce a helper to DMA-map a reply's transport header before
      sending it. This will in part replace the map vector cache.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      6e6092ca
    • C
      svcrdma: Move send_wr to svc_rdma_op_ctxt · 17f5f7f5
      Chuck Lever 提交于
      Clean up: Move the ib_send_wr off the stack, and move common code
      to post a Send Work Request into a helper.
      
      This is a refactoring change only.
      Signed-off-by: NChuck Lever <chuck.lever@oracle.com>
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      17f5f7f5
    • J
      nfsd: check for oversized NFSv2/v3 arguments · 51f56777
      J. Bruce Fields 提交于
      A client can append random data to the end of an NFSv2 or NFSv3 RPC call
      without our complaining; we'll just stop parsing at the end of the
      expected data and ignore the rest.
      
      Encoded arguments and replies are stored together in an array of pages,
      and if a call is too large it could leave inadequate space for the
      reply.  This is normally OK because NFS RPC's typically have either
      short arguments and long replies (like READ) or long arguments and short
      replies (like WRITE).  But a client that sends an incorrectly long reply
      can violate those assumptions.  This was observed to cause crashes.
      
      So, insist that the argument not be any longer than we expect.
      
      Also, several operations increment rq_next_page in the decode routine
      before checking the argument size, which can leave rq_next_page pointing
      well past the end of the page array, causing trouble later in
      svc_free_pages.
      
      As followup we may also want to rewrite the encoding routines to check
      more carefully that they aren't running off the end of the page array.
      Reported-by: NTuomas Haanpää <thaan@synopsys.com>
      Reported-by: NAri Kauppi <ari@synopsys.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJ. Bruce Fields <bfields@redhat.com>
      51f56777
  2. 19 4月, 2017 1 次提交
  3. 15 4月, 2017 1 次提交
    • M
      block: fix bio_will_gap() for first bvec with offset · 5a8d75a1
      Ming Lei 提交于
      Commit 729204ef("block: relax check on sg gap") allows us to merge
      bios, if both are physically contiguous.  This change can merge a huge
      number of small bios, through mkfs for example, mkfs.ntfs running time
      can be decreased to ~1/10.
      
      But if one rq starts with a non-aligned buffer (the 1st bvec's bv_offset
      is non-zero) and if we allow the merge, it is quite difficult to respect
      sg gap limit, especially the max segment size, or we risk having an
      unaligned virtual boundary.  This patch tries to avoid the issue by
      disallowing a merge, if the req starts with an unaligned buffer.
      
      Also add comments to explain why the merged segment can't end in
      unaligned virt boundary.
      
      Fixes: 729204ef ("block: relax check on sg gap")
      Tested-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      
      Rewrote parts of the commit message and comments.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      5a8d75a1
  4. 14 4月, 2017 1 次提交
  5. 08 4月, 2017 2 次提交
  6. 07 4月, 2017 3 次提交
  7. 05 4月, 2017 1 次提交
    • V
      mfd: cros-ec: Fix host command buffer size · b2376407
      Vic Yang 提交于
      For SPI, we can get up to 32 additional bytes for response preamble.
      The current overhead (2 bytes) may cause problems when we try to receive
      a big response. Update it to 32 bytes.
      
      Without this fix we could see a kernel BUG when we receive a big response
      from the Chrome EC when is connected via SPI.
      Signed-off-by: NVic Yang <victoryang@google.com>
      Tested-by: Enric Balletbo i Serra <enric.balletbo.collabora.com>
      Signed-off-by: NLee Jones <lee.jones@linaro.org>
      b2376407
  8. 04 4月, 2017 2 次提交
  9. 03 4月, 2017 2 次提交
  10. 02 4月, 2017 1 次提交
  11. 01 4月, 2017 3 次提交
  12. 28 3月, 2017 1 次提交
  13. 24 3月, 2017 1 次提交
  14. 23 3月, 2017 1 次提交
  15. 22 3月, 2017 5 次提交
  16. 17 3月, 2017 4 次提交
    • T
      cgroup, kthread: close race window where new kthreads can be migrated to non-root cgroups · 77f88796
      Tejun Heo 提交于
      Creation of a kthread goes through a couple interlocked stages between
      the kthread itself and its creator.  Once the new kthread starts
      running, it initializes itself and wakes up the creator.  The creator
      then can further configure the kthread and then let it start doing its
      job by waking it up.
      
      In this configuration-by-creator stage, the creator is the only one
      that can wake it up but the kthread is visible to userland.  When
      altering the kthread's attributes from userland is allowed, this is
      fine; however, for cases where CPU affinity is critical,
      kthread_bind() is used to first disable affinity changes from userland
      and then set the affinity.  This also prevents the kthread from being
      migrated into non-root cgroups as that can affect the CPU affinity and
      many other things.
      
      Unfortunately, the cgroup side of protection is racy.  While the
      PF_NO_SETAFFINITY flag prevents further migrations, userland can win
      the race before the creator sets the flag with kthread_bind() and put
      the kthread in a non-root cgroup, which can lead to all sorts of
      problems including incorrect CPU affinity and starvation.
      
      This bug got triggered by userland which periodically tries to migrate
      all processes in the root cpuset cgroup to a non-root one.  Per-cpu
      workqueue workers got caught while being created and ended up with
      incorrected CPU affinity breaking concurrency management and sometimes
      stalling workqueue execution.
      
      This patch adds task->no_cgroup_migration which disallows the task to
      be migrated by userland.  kthreadd starts with the flag set making
      every child kthread start in the root cgroup with migration
      disallowed.  The flag is cleared after the kthread finishes
      initialization by which time PF_NO_SETAFFINITY is set if the kthread
      should stay in the root cgroup.
      
      It'd be better to wait for the initialization instead of failing but I
      couldn't think of a way of implementing that without adding either a
      new PF flag, or sleeping and retrying from waiting side.  Even if
      userland depends on changing cgroup membership of a kthread, it either
      has to be synchronized with kthread_create() or periodically repeat,
      so it's unlikely that this would break anything.
      
      v2: Switch to a simpler implementation using a new task_struct bit
          field suggested by Oleg.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Suggested-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Reported-and-debugged-by: NChris Mason <clm@fb.com>
      Cc: stable@vger.kernel.org # v4.3+ (we can't close the race on < v4.3)
      Signed-off-by: NTejun Heo <tj@kernel.org>
      77f88796
    • J
      net/mlx4_core: Avoid delays during VF driver device shutdown · 4cbe4dac
      Jack Morgenstein 提交于
      Some Hypervisors detach VFs from VMs by instantly causing an FLR event
      to be generated for a VF.
      
      In the mlx4 case, this will cause that VF's comm channel to be disabled
      before the VM has an opportunity to invoke the VF device's "shutdown"
      method.
      
      For such Hypervisors, there is a race condition between the VF's
      shutdown method and its internal-error detection/reset thread.
      
      The internal-error detection/reset thread (which runs every 5 seconds) also
      detects a disabled comm channel. If the internal-error detection/reset
      flow wins the race, we still get delays (while that flow tries repeatedly
      to detect comm-channel recovery).
      
      The cited commit fixed the command timeout problem when the
      internal-error detection/reset flow loses the race.
      
      This commit avoids the unneeded delays when the internal-error
      detection/reset flow wins.
      
      Fixes: d585df1c ("net/mlx4_core: Avoid command timeouts during VF driver device shutdown")
      Signed-off-by: NJack Morgenstein <jackm@dev.mellanox.co.il>
      Reported-by: NSimon Xiao <sixiao@microsoft.com>
      Signed-off-by: NTariq Toukan <tariqt@mellanox.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      4cbe4dac
    • H
      drivers core: remove assert_held_device_hotplug() · 15c9e10d
      Heiko Carstens 提交于
      The last caller of assert_held_device_hotplug() is gone, so remove it again.
      
      Link: http://lkml.kernel.org/r/20170314125226.16779-3-heiko.carstens@de.ibm.comSigned-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Ben Hutchings <ben@decadent.org.uk>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Sebastian Ott <sebott@linux.vnet.ibm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      15c9e10d
    • M
      kasan: add a prototype of task_struct to avoid warning · 5be9b730
      Masami Hiramatsu 提交于
      Add a prototype of task_struct to fix below warning on arm64.
      
        In file included from arch/arm64/kernel/probes/kprobes.c:19:0:
        include/linux/kasan.h:81:132: error: 'struct task_struct' declared inside parameter list will not be visible outside of this definition or declaration [-Werror]
         static inline void kasan_unpoison_task_stack(struct task_struct *task) {}
      
      As same as other types (kmem_cache, page, and vm_struct) this adds a
      prototype of task_struct data structure on top of kasan.h.
      
      [arnd] A related warning was fixed before, but now appears in a
      different line in the same file in v4.11-rc2.  The patch from Masami
      Hiramatsu still seems appropriate, so let's take his version.
      
      Fixes: 71af2ed5 ("kasan, sched/headers: Remove <linux/sched.h> from <linux/kasan.h>")
      Link: https://patchwork.kernel.org/patch/9569839/
      Link: http://lkml.kernel.org/r/20170313141517.3397802-1-arnd@arndb.deSigned-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NMasami Hiramatsu <mhiramat@kernel.org>
      Acked-by: NAlexander Potapenko <glider@google.com>
      Acked-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5be9b730
  17. 16 3月, 2017 4 次提交
    • G
      crypto: ccp - Assign DMA commands to the channel's CCP · 7c468447
      Gary R Hook 提交于
      The CCP driver generally uses a round-robin approach when
      assigning operations to available CCPs. For the DMA engine,
      however, the DMA mappings of the SGs are associated with a
      specific CCP. When an IOMMU is enabled, the IOMMU is
      programmed based on this specific device.
      
      If the DMA operations are not performed by that specific
      CCP then addressing errors and I/O page faults will occur.
      
      Update the CCP driver to allow a specific CCP device to be
      requested for an operation and use this in the DMA engine
      support.
      
      Cc: <stable@vger.kernel.org> # 4.9.x-
      Signed-off-by: NGary R Hook <gary.hook@amd.com>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      7c468447
    • D
      vmbus: remove hv_event_tasklet_disable/enable · dad72a1d
      Dexuan Cui 提交于
      With the recent introduction of per-channel tasklet, we need to update
      the way we handle the 3 concurrency issues:
      
      1. hv_process_channel_removal -> percpu_channel_deq vs.
         vmbus_chan_sched -> list_for_each_entry(..., percpu_list);
      
      2. vmbus_process_offer -> percpu_channel_enq/deq vs. vmbus_chan_sched.
      
      3. vmbus_close_internal vs. the per-channel tasklet vmbus_on_event;
      
      The first 2 issues can be handled by Stephen's recent patch
      "vmbus: use rcu for per-cpu channel list", and the third issue
      can be handled by calling tasklet_disable in vmbus_close_internal here.
      
      We don't need the original hv_event_tasklet_disable/enable since we
      now use per-channel tasklet instead of the previous per-CPU tasklet,
      and actually we must remove them due to the side effect now:
      vmbus_process_offer -> hv_event_tasklet_enable -> tasklet_schedule will
      start the per-channel callback prematurely, cauing NULL dereferencing
      (the channel may haven't been properly configured to run the callback yet).
      
      Fixes: 631e63a9 ("vmbus: change to per channel tasklet")
      Signed-off-by: NDexuan Cui <decui@microsoft.com>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Tested-by: NVitaly Kuznetsov <vkuznets@redhat.com>
      Signed-off-by: NK. Y. Srinivasan <kys@microsoft.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      dad72a1d
    • S
      vmbus: use rcu for per-cpu channel list · 8200f208
      Stephen Hemminger 提交于
      The per-cpu channel list is now referred to in the interrupt
      routine. This is mostly safe since the host will not normally generate
      an interrupt when channel is being deleted but if it did then there
      would be a use after free problem.
      
      To solve, this use RCU protection on ther per-cpu list.
      
      Fixes: 631e63a9 ("vmbus: change to per channel tasklet")
      Signed-off-by: NStephen Hemminger <sthemmin@microsoft.com>
      Signed-off-by: NK. Y. Srinivasan <kys@microsoft.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      8200f208
    • E
      fscrypt: eliminate ->prepare_context() operation · 94840e3c
      Eric Biggers 提交于
      The only use of the ->prepare_context() fscrypt operation was to allow
      ext4 to evict inline data from the inode before ->set_context().
      However, there is no reason why this cannot be done as simply the first
      step in ->set_context(), and in fact it makes more sense to do it that
      way because then the policy modes and flags get validated before any
      real work is done.  Therefore, merge ext4_prepare_context() into
      ext4_set_context(), and remove ->prepare_context().
      Signed-off-by: NEric Biggers <ebiggers@google.com>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      94840e3c