1. 18 10月, 2018 1 次提交
  2. 17 10月, 2018 1 次提交
  3. 12 10月, 2018 1 次提交
  4. 08 10月, 2018 1 次提交
  5. 28 9月, 2018 4 次提交
  6. 27 9月, 2018 3 次提交
    • G
      bcache: add separate workqueue for journal_write to avoid deadlock · 0f843e65
      Guoju Fang 提交于
      After write SSD completed, bcache schedules journal_write work to
      system_wq, which is a public workqueue in system, without WQ_MEM_RECLAIM
      flag. system_wq is also a bound wq, and there may be no idle kworker on
      current processor. Creating a new kworker may unfortunately need to
      reclaim memory first, by shrinking cache and slab used by vfs, which
      depends on bcache device. That's a deadlock.
      
      This patch create a new workqueue for journal_write with WQ_MEM_RECLAIM
      flag. It's rescuer thread will work to avoid the deadlock.
      Signed-off-by: NGuoju Fang <fangguoju@gmail.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NColy Li <colyli@suse.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0f843e65
    • B
      xen/blkfront: When purging persistent grants, keep them in the buffer · f151ba98
      Boris Ostrovsky 提交于
      Commit a46b5367 ("xen/blkfront: cleanup stale persistent grants")
      added support for purging persistent grants when they are not in use. As
      part of the purge, the grants were removed from the grant buffer, This
      eventually causes the buffer to become empty, with BUG_ON triggered in
      get_free_grant(). This can be observed even on an idle system, within
      20-30 minutes.
      
      We should keep the grants in the buffer when purging, and only free the
      grant ref.
      
      Fixes: a46b5367 ("xen/blkfront: cleanup stale persistent grants")
      Reviewed-by: NJuergen Gross <jgross@suse.com>
      Signed-off-by: NBoris Ostrovsky <boris.ostrovsky@oracle.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f151ba98
    • D
      block: fix deadline elevator drain for zoned block devices · 854f31cc
      Damien Le Moal 提交于
      When the deadline scheduler is used with a zoned block device, writes
      to a zone will be dispatched one at a time. This causes the warning
      message:
      
      deadline: forced dispatching is broken (nr_sorted=X), please report this
      
      to be displayed when switching to another elevator with the legacy I/O
      path while write requests to a zone are being retained in the scheduler
      queue.
      
      Prevent this message from being displayed when executing
      elv_drain_elevator() for a zoned block device. __blk_drain_queue() will
      loop until all writes are dispatched and completed, resulting in the
      desired elevator queue drain without extensive modifications to the
      deadline code itself to handle forced-dispatch calls.
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Fixes: 8dc8146f ("deadline-iosched: Introduce zone locking support")
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      854f31cc
  7. 26 9月, 2018 2 次提交
  8. 22 9月, 2018 1 次提交
    • O
      block: use nanosecond resolution for iostat · b57e99b4
      Omar Sandoval 提交于
      Klaus Kusche reported that the I/O busy time in /proc/diskstats was not
      updating properly on 4.18. This is because we started using ktime to
      track elapsed time, and we convert nanoseconds to jiffies when we update
      the partition counter. However, this gets rounded down, so any I/Os that
      take less than a jiffy are not accounted for. Previously in this case,
      the value of jiffies would sometimes increment while we were doing I/O,
      so at least some I/Os were accounted for.
      
      Let's convert the stats to use nanoseconds internally. We still report
      milliseconds as before, now more accurately than ever. The value is
      still truncated to 32 bits for backwards compatibility.
      
      Fixes: 522a7775 ("block: consolidate struct request timestamp fields")
      Cc: stable@vger.kernel.org
      Reported-by: NKlaus Kusche <klaus.kusche@computerix.info>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b57e99b4
  9. 20 9月, 2018 3 次提交
  10. 17 9月, 2018 1 次提交
  11. 13 9月, 2018 1 次提交
    • J
      null_blk: fix zoned support for non-rq based operation · b228ba1c
      Jens Axboe 提交于
      The supported added for zones in null_blk seem to assume that only rq
      based operation is possible. But this depends on the queue_mode setting,
      if this is set to 0, then cmd->bio is what we need to be operating on.
      Right now any attempt to load null_blk with queue_mode=0 will
      insta-crash, since cmd->rq is NULL and null_handle_cmd() assumes it to
      always be set.
      
      Make the zoned code deal with bio's instead, or pass in the
      appropriate sector/nr_sectors instead.
      
      Fixes: ca4b2a01 ("null_blk: add zone support")
      Tested-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b228ba1c
  12. 12 9月, 2018 1 次提交
  13. 10 9月, 2018 1 次提交
  14. 07 9月, 2018 1 次提交
  15. 06 9月, 2018 2 次提交
  16. 05 9月, 2018 1 次提交
  17. 01 9月, 2018 3 次提交
    • D
      blkcg: use tryget logic when associating a blkg with a bio · 31118850
      Dennis Zhou (Facebook) 提交于
      There is a very small change a bio gets caught up in a really
      unfortunate race between a task migration, cgroup exiting, and itself
      trying to associate with a blkg. This is due to css offlining being
      performed after the css->refcnt is killed which triggers removal of
      blkgs that reach their blkg->refcnt of 0.
      
      To avoid this, association with a blkg should use tryget and fallback to
      using the root_blkg.
      
      Fixes: 08e18eab ("block: add bi_blkg to the bio for cgroups")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennisszhou@gmail.com>
      Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      31118850
    • D
      blkcg: delay blkg destruction until after writeback has finished · 59b57717
      Dennis Zhou (Facebook) 提交于
      Currently, blkcg destruction relies on a sequence of events:
        1. Destruction starts. blkcg_css_offline() is called and blkgs
           release their reference to the blkcg. This immediately destroys
           the cgwbs (writeback).
        2. With blkgs giving up their reference, the blkcg ref count should
           become zero and eventually call blkcg_css_free() which finally
           frees the blkcg.
      
      Jiufei Xue reported that there is a race between blkcg_bio_issue_check()
      and cgroup_rmdir(). To remedy this, blkg destruction becomes contingent
      on the completion of all writeback associated with the blkcg. A count of
      the number of cgwbs is maintained and once that goes to zero, blkg
      destruction can follow. This should prevent premature blkg destruction
      related to writeback.
      
      The new process for blkcg cleanup is as follows:
        1. Destruction starts. blkcg_css_offline() is called which offlines
           writeback. Blkg destruction is delayed on the cgwb_refcnt count to
           avoid punting potentially large amounts of outstanding writeback
           to root while maintaining any ongoing policies. Here, the base
           cgwb_refcnt is put back.
        2. When the cgwb_refcnt becomes zero, blkcg_destroy_blkgs() is called
           and handles destruction of blkgs. This is where the css reference
           held by each blkg is released.
        3. Once the blkcg ref count goes to zero, blkcg_css_free() is called.
           This finally frees the blkg.
      
      It seems in the past blk-throttle didn't do the most understandable
      things with taking data from a blkg while associating with current. So,
      the simplification and unification of what blk-throttle is doing caused
      this.
      
      Fixes: 08e18eab ("block: add bi_blkg to the bio for cgroups")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennisszhou@gmail.com>
      Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      59b57717
    • D
      Revert "blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()" · 6b065462
      Dennis Zhou (Facebook) 提交于
      This reverts commit 4c699480.
      
      Destroying blkgs is tricky because of the nature of the relationship. A
      blkg should go away when either a blkcg or a request_queue goes away.
      However, blkg's pin the blkcg to ensure they remain valid. To break this
      cycle, when a blkcg is offlined, blkgs put back their css ref. This
      eventually lets css_free() get called which frees the blkcg.
      
      The above commit (4c699480) breaks this order of events by trying to
      destroy blkgs in css_free(). As the blkgs still hold references to the
      blkcg, css_free() is never called.
      
      The race between blkcg_bio_issue_check() and cgroup_rmdir() will be
      addressed in the following patch by delaying destruction of a blkg until
      all writeback associated with the blkcg has been finished.
      
      Fixes: 4c699480 ("blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()")
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennisszhou@gmail.com>
      Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
      Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6b065462
  18. 30 8月, 2018 1 次提交
  19. 29 8月, 2018 1 次提交
  20. 28 8月, 2018 10 次提交
    • C
    • J
      nvme-fcloop: Fix dropped LS's to removed target port · afd299ca
      James Smart 提交于
      When a targetport is removed from the config, fcloop will avoid calling
      the LS done() routine thinking the targetport is gone. This leaves the
      initiator reset/reconnect hanging as it waits for a status on the
      Create_Association LS for the reconnect.
      
      Change the filter in the LS callback path. If tport null (set when
      failed validation before "sending to remote port"), be sure to call
      done. This was the main bug. But, continue the logic that only calls
      done if tport was set but there is no remoteport (e.g. case where
      remoteport has been removed, thus host doesn't expect a completion).
      Signed-off-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      afd299ca
    • M
      nvme-pci: add a memory barrier to nvme_dbbuf_update_and_check_event · f1ed3df2
      Michal Wnukowski 提交于
      In many architectures loads may be reordered with older stores to
      different locations.  In the nvme driver the following two operations
      could be reordered:
      
       - Write shadow doorbell (dbbuf_db) into memory.
       - Read EventIdx (dbbuf_ei) from memory.
      
      This can result in a potential race condition between driver and VM host
      processing requests (if given virtual NVMe controller has a support for
      shadow doorbell).  If that occurs, then the NVMe controller may decide to
      wait for MMIO doorbell from guest operating system, and guest driver may
      decide not to issue MMIO doorbell on any of subsequent commands.
      
      This issue is purely timing-dependent one, so there is no easy way to
      reproduce it. Currently the easiest known approach is to run "Oracle IO
      Numbers" (orion) that is shipped with Oracle DB:
      
      orion -run advanced -num_large 0 -size_small 8 -type rand -simulate \
      	concat -write 40 -duration 120 -matrix row -testname nvme_test
      
      Where nvme_test is a .lun file that contains a list of NVMe block
      devices to run test against. Limiting number of vCPUs assigned to given
      VM instance seems to increase chances for this bug to occur. On test
      environment with VM that got 4 NVMe drives and 1 vCPU assigned the
      virtual NVMe controller hang could be observed within 10-20 minutes.
      That correspond to about 400-500k IO operations processed (or about
      100GB of IO read/writes).
      
      Orion tool was used as a validation and set to run in a loop for 36
      hours (equivalent of pushing 550M IO operations). No issues were
      observed. That suggest that the patch fixes the issue.
      
      Fixes: f9f38e33 ("nvme: improve performance for virtual NVMe devices")
      Signed-off-by: NMichal Wnukowski <wnukowski@google.com>
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      [hch: updated changelog and comment a bit]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      f1ed3df2
    • J
      block: bsg: move atomic_t ref_count variable to refcount API · db193954
      John Pittman 提交于
      Currently, variable ref_count within the bsg_device struct is of
      type atomic_t.  For variables being used as reference counters,
      the refcount API should be used instead of atomic.  The newer
      refcount API works to prevent counter overflows and use-after-free
      bugs.  So, move this varable from the atomic API to refcount,
      potentially avoiding the issues mentioned.
      Signed-off-by: NJohn Pittman <jpittman@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      db193954
    • C
      block: remove unnecessary condition check · 62d2a194
      Chengguang Xu 提交于
      kmem_cache_destroy() can handle NULL pointer correctly, so there is
      no need to check e->icq_cache before calling kmem_cache_destroy().
      Signed-off-by: NChengguang Xu <cgxu519@gmx.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      62d2a194
    • L
      ata: ftide010: Add a quirk for SQ201 · 46cb52ad
      Linus Walleij 提交于
      The DMA is broken on this specific device for some unknown
      reason (probably badly designed or plain broken interface
      electronics) and will only work with PIO. Other users of
      the same hardware does not have this problem.
      
      Add a specific quirk so that this Gemini device gets
      DMA turned off. Also fix up some code around passing the
      port information around in probe while we're at it.
      Signed-off-by: NLinus Walleij <linus.walleij@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      46cb52ad
    • J
      blk-wbt: remove dead code · b0a84beb
      Jens Axboe 提交于
      We already note and mark discard and swap IO from bio_to_wbt_flags().
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b0a84beb
    • J
      Merge branch 'stable/for-jens-4.19' of... · 057d3ccf
      Jens Axboe 提交于
      Merge branch 'stable/for-jens-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen into for-linus
      
      Pull Xen block driver fixes from Konrad:
      
      "Fix for flushing out persistent pages at a deterministic rate"
      
      * 'stable/for-jens-4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen:
        xen/blkback: remove unused pers_gnts_lock from struct xen_blkif_ring
        xen/blkback: move persistent grants flags to bool
        xen/blkfront: reorder tests in xlblk_init()
        xen/blkfront: cleanup stale persistent grants
        xen/blkback: don't keep persistent grants too long
      057d3ccf
    • J
      blk-wbt: improve waking of tasks · 38cfb5a4
      Jens Axboe 提交于
      We have two potential issues:
      
      1) After commit 2887e41b, we only wake one process at the time when
         we finish an IO. We really want to wake up as many tasks as can
         queue IO. Before this commit, we woke up everyone, which could cause
         a thundering herd issue.
      
      2) A task can potentially consume two wakeups, causing us to (in
         practice) miss a wakeup.
      
      Fix both by providing our own wakeup function, which stops
      __wake_up_common() from waking up more tasks if we fail to get a
      queueing token. With the strict ordering we have on the wait list, this
      wakes the right tasks and the right amount of tasks.
      
      Based on a patch from Jianchao Wang <jianchao.w.wang@oracle.com>.
      Tested-by: NAgarwal, Anchal <anchalag@amazon.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      38cfb5a4
    • J
      blk-wbt: abstract out end IO completion handler · 061a5427
      Jens Axboe 提交于
      Prep patch for calling the handler from a different context,
      no functional changes in this patch.
      Tested-by: NAgarwal, Anchal <anchalag@amazon.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      061a5427