1. 27 2月, 2020 1 次提交
  2. 25 2月, 2020 1 次提交
    • M
      blk-mq: insert passthrough request into hctx->dispatch directly · 01e99aec
      Ming Lei 提交于
      For some reason, device may be in one situation which can't handle
      FS request, so STS_RESOURCE is always returned and the FS request
      will be added to hctx->dispatch. However passthrough request may
      be required at that time for fixing the problem. If passthrough
      request is added to scheduler queue, there isn't any chance for
      blk-mq to dispatch it given we prioritize requests in hctx->dispatch.
      Then the FS IO request may never be completed, and IO hang is caused.
      
      So passthrough request has to be added to hctx->dispatch directly
      for fixing the IO hang.
      
      Fix this issue by inserting passthrough request into hctx->dispatch
      directly together withing adding FS request to the tail of
      hctx->dispatch in blk_mq_dispatch_rq_list(). Actually we add FS request
      to tail of hctx->dispatch at default, see blk_mq_request_bypass_insert().
      
      Then it becomes consistent with original legacy IO request
      path, in which passthrough request is always added to q->queue_head.
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Ewan D. Milne <emilne@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      01e99aec
  3. 07 10月, 2019 1 次提交
  4. 11 7月, 2019 1 次提交
    • D
      block: Disable write plugging for zoned block devices · b49773e7
      Damien Le Moal 提交于
      Simultaneously writing to a sequential zone of a zoned block device
      from multiple contexts requires mutual exclusion for BIO issuing to
      ensure that writes happen sequentially. However, even for a well
      behaved user correctly implementing such synchronization, BIO plugging
      may interfere and result in BIOs from the different contextx to be
      reordered if plugging is done outside of the mutual exclusion section,
      e.g. the plug was started by a function higher in the call chain than
      the function issuing BIOs.
      
               Context A                     Context B
      
         | blk_start_plug()
         | ...
         | seq_write_zone()
           | mutex_lock(zone)
           | bio-0->bi_iter.bi_sector = zone->wp
           | zone->wp += bio_sectors(bio-0)
           | submit_bio(bio-0)
           | bio-1->bi_iter.bi_sector = zone->wp
           | zone->wp += bio_sectors(bio-1)
           | submit_bio(bio-1)
           | mutex_unlock(zone)
           | return
         | -----------------------> | seq_write_zone()
        				| mutex_lock(zone)
           				| bio-2->bi_iter.bi_sector = zone->wp
           				| zone->wp += bio_sectors(bio-2)
      				| submit_bio(bio-2)
      				| mutex_unlock(zone)
         | <------------------------- |
         | blk_finish_plug()
      
      In the above example, despite the mutex synchronization ensuring the
      correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being
      issued after BIO 2 of context B, when the plug is released with
      blk_finish_plug().
      
      While this problem can be addressed using the blk_flush_plug_list()
      function (in the above example, the call must be inserted before the
      zone mutex lock is released), a simple generic solution in the block
      layer avoid this additional code in all zoned block device user code.
      The simple generic solution implemented with this patch is to introduce
      the internal helper function blk_mq_plug() to access the current
      context plug on BIO submission. This helper returns the current plug
      only if the target device is not a zoned block device or if the BIO to
      be plugged is not a write operation. Otherwise, the caller context plug
      is ignored and NULL returned, resulting is all writes to zoned block
      device to never be plugged.
      Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b49773e7
  5. 03 7月, 2019 1 次提交
  6. 04 5月, 2019 1 次提交
    • M
      blk-mq: free hw queue's resource in hctx's release handler · c7e2d94b
      Ming Lei 提交于
      Once blk_cleanup_queue() returns, tags shouldn't be used any more,
      because blk_mq_free_tag_set() may be called. Commit 45a9c9d9
      ("blk-mq: Fix a use-after-free") fixes this issue exactly.
      
      However, that commit introduces another issue. Before 45a9c9d9,
      we are allowed to run queue during cleaning up queue if the queue's
      kobj refcount is held. After that commit, queue can't be run during
      queue cleaning up, otherwise oops can be triggered easily because
      some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().
      
      We have invented ways for addressing this kind of issue before, such as:
      
      	8dc765d4 ("SCSI: fix queue cleanup race before queue initialization is done")
      	c2856ae2 ("blk-mq: quiesce queue before freeing queue")
      
      But still can't cover all cases, recently James reports another such
      kind of issue:
      
      	https://marc.info/?l=linux-scsi&m=155389088124782&w=2
      
      This issue can be quite hard to address by previous way, given
      scsi_run_queue() may run requeues for other LUNs.
      
      Fixes the above issue by freeing hctx's resources in its release handler, and this
      way is safe becasue tags isn't needed for freeing such hctx resource.
      
      This approach follows typical design pattern wrt. kobject's release handler.
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: linux-scsi@vger.kernel.org,
      Cc: Martin K . Petersen <martin.petersen@oracle.com>,
      Cc: Christoph Hellwig <hch@lst.de>,
      Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
      Reported-by: NJames Smart <james.smart@broadcom.com>
      Fixes: 45a9c9d9 ("blk-mq: Fix a use-after-free")
      Cc: stable@vger.kernel.org
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Tested-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c7e2d94b
  7. 05 4月, 2019 1 次提交
    • B
      block: Revert v5.0 blk_mq_request_issue_directly() changes · fd9c40f6
      Bart Van Assche 提交于
      blk_mq_try_issue_directly() can return BLK_STS*_RESOURCE for requests that
      have been queued. If that happens when blk_mq_try_issue_directly() is called
      by the dm-mpath driver then dm-mpath will try to resubmit a request that is
      already queued and a kernel crash follows. Since it is nontrivial to fix
      blk_mq_request_issue_directly(), revert the blk_mq_request_issue_directly()
      changes that went into kernel v5.0.
      
      This patch reverts the following commits:
      * d6a51a97 ("blk-mq: replace and kill blk_mq_request_issue_directly") # v5.0.
      * 5b7a6f12 ("blk-mq: issue directly with bypass 'false' in blk_mq_sched_insert_requests") # v5.0.
      * 7f556a44 ("blk-mq: refactor the code of issue request directly") # v5.0.
      
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Jianchao Wang <jianchao.w.wang@oracle.com>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: Laurence Oberman <loberman@redhat.com>
      Cc: <stable@vger.kernel.org>
      Reported-by: NLaurence Oberman <loberman@redhat.com>
      Tested-by: NLaurence Oberman <loberman@redhat.com>
      Fixes: 7f556a44 ("blk-mq: refactor the code of issue request directly") # v5.0.
      Signed-off-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fd9c40f6
  8. 25 3月, 2019 1 次提交
  9. 21 3月, 2019 1 次提交
  10. 09 2月, 2019 1 次提交
  11. 01 2月, 2019 2 次提交
  12. 18 12月, 2018 1 次提交
    • M
      blk-mq: fix dispatch from sw queue · c16d6b5a
      Ming Lei 提交于
      When a request is added to rq list of sw queue(ctx), the rq may be from
      a different type of hctx, especially after multi queue mapping is
      introduced.
      
      So when dispach request from sw queue via blk_mq_flush_busy_ctxs() or
      blk_mq_dequeue_from_ctx(), one request belonging to other queue type of
      hctx can be dispatched to current hctx in case that read queue or poll
      queue is enabled.
      
      This patch fixes this issue by introducing per-queue-type list.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      
      Changed by me to not use separately cacheline aligned lists, just
      place them all in the same cacheline where we had just the one list
      and lock before.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c16d6b5a
  13. 17 12月, 2018 1 次提交
  14. 16 12月, 2018 1 次提交
  15. 10 12月, 2018 1 次提交
  16. 05 12月, 2018 1 次提交
  17. 30 11月, 2018 1 次提交
  18. 21 11月, 2018 1 次提交
  19. 08 11月, 2018 7 次提交
  20. 18 7月, 2018 1 次提交
    • M
      blk-mq: issue directly if hw queue isn't busy in case of 'none' · 6ce3dd6e
      Ming Lei 提交于
      In case of 'none' io scheduler, when hw queue isn't busy, it isn't
      necessary to enqueue request to sw queue and dequeue it from
      sw queue because request may be submitted to hw queue asap without
      extra cost, meantime there shouldn't be much request in sw queue,
      and we don't need to worry about effect on IO merge.
      
      There are still some single hw queue SCSI HBAs(HPSA, megaraid_sas, ...)
      which may connect high performance devices, so 'none' is often required
      for obtaining good performance.
      
      This patch improves IOPS and decreases CPU unilization on megaraid_sas,
      per Kashyap's test.
      
      Cc: Kashyap Desai <kashyap.desai@broadcom.com>
      Cc: Laurence Oberman <loberman@redhat.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: Hannes Reinecke <hare@suse.de>
      Reported-by: NKashyap Desai <kashyap.desai@broadcom.com>
      Tested-by: NKashyap Desai <kashyap.desai@broadcom.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      6ce3dd6e
  21. 09 7月, 2018 2 次提交
  22. 29 5月, 2018 1 次提交
    • K
      blk-mq: Remove generation seqeunce · 12f5b931
      Keith Busch 提交于
      This patch simplifies the timeout handling by relying on the request
      reference counting to ensure the iterator is operating on an inflight
      and truly timed out request. Since the reference counting prevents the
      tag from being reallocated, the block layer no longer needs to prevent
      drivers from completing their requests while the timeout handler is
      operating on it: a driver completing a request is allowed to proceed to
      the next state without additional syncronization with the block layer.
      
      This also removes any need for generation sequence numbers since the
      request lifetime is prevented from being reallocated as a new sequence
      while timeout handling is operating on it.
      
      To enables this a refcount is added to struct request so that request
      users can be sure they're operating on the same request without it
      changing while they're processing it.  The request's tag won't be
      released for reuse until both the timeout handler and the completion
      are done with it.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      [hch: slight cleanups, added back submission side hctx lock, use cmpxchg
       for completions]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      12f5b931
  23. 26 4月, 2018 1 次提交
  24. 25 4月, 2018 1 次提交
  25. 11 4月, 2018 1 次提交
    • M
      blk-mq: Revert "blk-mq: reimplement blk_mq_hw_queue_mapped" · 2434af79
      Ming Lei 提交于
      This reverts commit 127276c6.
      
      When all CPUs of one hw queue become offline, there still may have IOs
      not completed from this hctx. But blk_mq_hw_queue_mapped() is called in
      blk_mq_queue_tag_busy_iter(), which is used for iterating request in timeout
      handler, timeout event will be missed on the inactive hctx, then request may
      never be completed.
      
      Also the replementation of blk_mq_hw_queue_mapped() doesn't match the helper's
      name any more, and it should have been named as blk_mq_hw_queue_active().
      
      Even other callers need further verification about this reimplemenation.
      
      So revert this patch now, and we can improve hw queue activate/inactivate event
      after adequent researching and test.
      
      Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Reported-by: NJens Axboe <axboe@kernel.dk>
      Fixes: 	127276c6 ("blk-mq: reimplement blk_mq_hw_queue_mapped")
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2434af79
  26. 10 4月, 2018 1 次提交
  27. 20 1月, 2018 1 次提交
  28. 18 1月, 2018 1 次提交
    • M
      blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback · 396eaf21
      Ming Lei 提交于
      blk_insert_cloned_request() is called in the fast path of a dm-rq driver
      (e.g. blk-mq request-based DM mpath).  blk_insert_cloned_request() uses
      blk_mq_request_bypass_insert() to directly append the request to the
      blk-mq hctx->dispatch_list of the underlying queue.
      
      1) This way isn't efficient enough because the hctx spinlock is always
      used.
      
      2) With blk_insert_cloned_request(), we completely bypass underlying
      queue's elevator and depend on the upper-level dm-rq driver's elevator
      to schedule IO.  But dm-rq currently can't get the underlying queue's
      dispatch feedback at all.  Without knowing whether a request was issued
      or not (e.g. due to underlying queue being busy) the dm-rq elevator will
      not be able to provide effective IO merging (as a side-effect of dm-rq
      currently blindly destaging a request from its elevator only to requeue
      it after a delay, which kills any opportunity for merging).  This
      obviously causes very bad sequential IO performance.
      
      Fix this by updating blk_insert_cloned_request() to use
      blk_mq_request_direct_issue().  blk_mq_request_direct_issue() allows a
      request to be issued directly to the underlying queue and returns the
      dispatch feedback (blk_status_t).  If blk_mq_request_direct_issue()
      returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE
      to _not_ destage the request.  Whereby preserving the opportunity to
      merge IO.
      
      With this, request-based DM's blk-mq sequential IO performance is vastly
      improved (as much as 3X in mpath/virtio-scsi testing).
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      [blk-mq.c changes heavily influenced by Ming Lei's initial solution, but
      they were refactored to make them less fragile and easier to read/review]
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      396eaf21
  29. 10 1月, 2018 3 次提交
    • T
      blk-mq: remove REQ_ATOM_STARTED · 5a61c363
      Tejun Heo 提交于
      After the recent updates to use generation number and state based
      synchronization, we can easily replace REQ_ATOM_STARTED usages by
      adding an extra state to distinguish completed but not yet freed
      state.
      
      Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with
      blk_mq_rq_state() tests.  REQ_ATOM_STARTED no longer has any users
      left and is removed.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5a61c363
    • T
      blk-mq: make blk_abort_request() trigger timeout path · 358f70da
      Tejun Heo 提交于
      With issue/complete and timeout paths now using the generation number
      and state based synchronization, blk_abort_request() is the only one
      which depends on REQ_ATOM_COMPLETE for arbitrating completion.
      
      There's no reason for blk_abort_request() to be a completely separate
      path.  This patch makes blk_abort_request() piggyback on the timeout
      path instead of trying to terminate the request directly.
      
      This removes the last dependency on REQ_ATOM_COMPLETE in blk-mq.
      
      Note that this makes blk_abort_request() asynchronous - it initiates
      abortion but the actual termination will happen after a short while,
      even when the caller owns the request.  AFAICS, SCSI and ATA should be
      fine with that and I think mtip32xx and dasd should be safe but not
      completely sure.  It'd be great if people who know the drivers take a
      look.
      
      v2: - Add comment explaining the lack of synchronization around
            ->deadline update as requested by Bart.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Asai Thambi SP <asamymuthupa@micron.com>
      Cc: Stefan Haberland <sth@linux.vnet.ibm.com>
      Cc: Jan Hoeppner <hoeppner@linux.vnet.ibm.com>
      Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      358f70da
    • T
      blk-mq: replace timeout synchronization with a RCU and generation based scheme · 1d9bd516
      Tejun Heo 提交于
      Currently, blk-mq timeout path synchronizes against the usual
      issue/completion path using a complex scheme involving atomic
      bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
      rules.  Unfortunately, it contains quite a few holes.
      
      There's a complex dancing around REQ_ATOM_STARTED and
      REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
      they don't have a synchronization point across request recycle
      instances and it isn't clear what the barriers add.
      blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
      deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
      
      In fact, it's pretty easy to make blk_mq_check_expired() terminate a
      later instance of a request.  If we induce 5 sec delay before
      time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
      2s, and issue back-to-back large IOs, blk-mq starts timing out
      requests spuriously pretty quickly.  Nothing actually timed out.  It
      just made the call on a recycle instance of a request and then
      terminated a later instance long after the original instance finished.
      The scenario isn't theoretical either.
      
      This patch replaces the broken synchronization mechanism with a RCU
      and generation number based one.
      
      1. Each request has a u64 generation + state value, which can be
         updated only by the request owner.  Whenever a request becomes
         in-flight, the generation number gets bumped up too.  This provides
         the basis for the timeout path to distinguish different recycle
         instances of the request.
      
         Also, marking a request in-flight and setting its deadline are
         protected with a seqcount so that the timeout path can fetch both
         values coherently.
      
      2. The timeout path fetches the generation, state and deadline.  If
         the verdict is timeout, it records the generation into a dedicated
         request abortion field and does RCU wait.
      
      3. The completion path is also protected by RCU (from the previous
         patch) and checks whether the current generation number and state
         match the abortion field.  If so, it skips completion.
      
      4. The timeout path, after RCU wait, scans requests again and
         terminates the ones whose generation and state still match the ones
         requested for abortion.
      
         By now, the timeout path knows that either the generation number
         and state changed if it lost the race or the completion will yield
         to it and can safely timeout the request.
      
      While it's more lines of code, it's conceptually simpler, doesn't
      depend on direct use of subtle memory ordering or coherence, and
      hopefully doesn't terminate the wrong instance.
      
      While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
      between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
      removed yet as it's still used in other places.  Future patches will
      move all state tracking to the new mechanism and remove all bitops in
      the hot paths.
      
      Note that this patch adds a comment explaining a race condition in
      BLK_EH_RESET_TIMER path.  The race has always been there and this
      patch doesn't change it.  It's just documenting the existing race.
      
      v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
          - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
          - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
      
      v3: - Fixed possible extended seqcount / u64_stats_sync read looping
            spotted by Peter.
          - MQ_RQ_IDLE was incorrectly being set in complete_request instead
            of free_request.  Fixed.
      
      v4: - Rebased on top of hctx_lock() refactoring patch.
          - Added comment explaining the use of hctx_lock() in completion path.
      
      v5: - Added comments requested by Bart.
          - Note the addition of BLK_EH_RESET_TIMER race condition in the
            commit message.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1d9bd516
  30. 11 11月, 2017 1 次提交
    • J
      blk-mq: only run the hardware queue if IO is pending · 79f720a7
      Jens Axboe 提交于
      Currently we are inconsistent in when we decide to run the queue. Using
      blk_mq_run_hw_queues() we check if the hctx has pending IO before
      running it, but we don't do that from the individual queue run function,
      blk_mq_run_hw_queue(). This results in a lot of extra and pointless
      queue runs, potentially, on flush requests and (much worse) on tag
      starvation situations. This is observable just looking at top output,
      with lots of kworkers active. For the !async runs, it just adds to the
      CPU overhead of blk-mq.
      
      Move the has-pending check into the run function instead of having
      callers do it.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      79f720a7