1. 28 8月, 2019 1 次提交
  2. 05 8月, 2019 4 次提交
    • M
      blk-mq: add callback of .cleanup_rq · 226b4fc7
      Ming Lei 提交于
      SCSI maintains its own driver private data hooked off of each SCSI
      request, and the pridate data won't be freed after scsi_queue_rq()
      returns BLK_STS_RESOURCE or BLK_STS_DEV_RESOURCE. An upper layer driver
      (e.g. dm-rq) may need to retry these SCSI requests, before SCSI has
      fully dispatched them, due to a lower level SCSI driver's resource
      limitation identified in scsi_queue_rq(). Currently SCSI's per-request
      private data is leaked when the upper layer driver (dm-rq) frees and
      then retries these requests in response to BLK_STS_RESOURCE or
      BLK_STS_DEV_RESOURCE returns from scsi_queue_rq().
      
      This usecase is so specialized that it doesn't warrant training an
      existing blk-mq interface (e.g. blk_mq_free_request) to allow SCSI to
      account for freeing its driver private data -- doing so would add an
      extra branch for handling a special case that all other consumers of
      SCSI (and blk-mq) won't ever need to worry about.
      
      So the most pragmatic way forward is to delegate freeing SCSI driver
      private data to the upper layer driver (dm-rq).  Do so by adding
      new .cleanup_rq callback and calling a new blk_mq_cleanup_rq() method
      from dm-rq.  A following commit will implement the .cleanup_rq() hook
      in scsi_mq_ops.
      
      Cc: Ewan D. Milne <emilne@redhat.com>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Hannes Reinecke <hare@suse.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Cc: dm-devel@redhat.com
      Cc: <stable@vger.kernel.org>
      Fixes: 396eaf21 ("blk-mq: improve DM's blk-mq IO merging via blk_insert_cloned_request feedback")
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      226b4fc7
    • M
      blk-mq: remove blk_mq_complete_request_sync · a87ccce0
      Ming Lei 提交于
      blk_mq_tagset_wait_completed_request() has been applied for waiting
      for completed request's fn, so not necessary to use
      blk_mq_complete_request_sync() any more.
      
      Cc: Max Gurtovoy <maxg@mellanox.com>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a87ccce0
    • M
      blk-mq: introduce blk_mq_tagset_wait_completed_request() · f9934a80
      Ming Lei 提交于
      blk-mq may schedule to call queue's complete function on remote CPU via
      IPI, but doesn't provide any way to synchronize the request's complete
      fn. The current queue freeze interface can't provide the synchonization
      because aborted requests stay at blk-mq queues during EH.
      
      In some driver's EH(such as NVMe), hardware queue's resource may be freed &
      re-allocated. If the completed request's complete fn is run finally after the
      hardware queue's resource is released, kernel crash will be triggered.
      
      Prepare for fixing this kind of issue by introducing
      blk_mq_tagset_wait_completed_request().
      
      Cc: Max Gurtovoy <maxg@mellanox.com>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f9934a80
    • M
      blk-mq: introduce blk_mq_request_completed() · aa306ab7
      Ming Lei 提交于
      NVMe needs this function to decide if one request to be aborted has
      been completed in normal IO path already.
      
      So introduce it.
      
      Cc: Max Gurtovoy <maxg@mellanox.com>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      aa306ab7
  3. 21 6月, 2019 1 次提交
    • C
      block: remove the bi_phys_segments field in struct bio · 14ccb66b
      Christoph Hellwig 提交于
      We only need the number of segments in the blk-mq submission path.
      Remove the field from struct bio, and return it from a variant of
      blk_queue_split instead of that it can passed as an argument to
      those functions that need the value.
      
      This also means we stop recounting segments except for cloning
      and partial segments.
      
      To keep the number of arguments in this how path down remove
      pointless struct request_queue arguments from any of the functions
      that had it and grew a nr_segs argument.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      14ccb66b
  4. 04 5月, 2019 1 次提交
    • M
      blk-mq: always free hctx after request queue is freed · 2f8f1336
      Ming Lei 提交于
      In normal queue cleanup path, hctx is released after request queue
      is freed, see blk_mq_release().
      
      However, in __blk_mq_update_nr_hw_queues(), hctx may be freed because
      of hw queues shrinking. This way is easy to cause use-after-free,
      because: one implicit rule is that it is safe to call almost all block
      layer APIs if the request queue is alive; and one hctx may be retrieved
      by one API, then the hctx can be freed by blk_mq_update_nr_hw_queues();
      finally use-after-free is triggered.
      
      Fixes this issue by always freeing hctx after releasing request queue.
      If some hctxs are removed in blk_mq_update_nr_hw_queues(), introduce
      a per-queue list to hold them, then try to resuse these hctxs if numa
      node is matched.
      
      Cc: Dongli Zhang <dongli.zhang@oracle.com>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: Bart Van Assche <bart.vanassche@wdc.com>
      Cc: linux-scsi@vger.kernel.org,
      Cc: Martin K . Petersen <martin.petersen@oracle.com>,
      Cc: Christoph Hellwig <hch@lst.de>,
      Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Tested-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2f8f1336
  5. 10 4月, 2019 1 次提交
    • M
      blk-mq: introduce blk_mq_complete_request_sync() · 1b8f21b7
      Ming Lei 提交于
      In NVMe's error handler, follows the typical steps of tearing down
      hardware for recovering controller:
      
      1) stop blk_mq hw queues
      2) stop the real hw queues
      3) cancel in-flight requests via
      	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
      cancel_request():
      	mark the request as abort
      	blk_mq_complete_request(req);
      4) destroy real hw queues
      
      However, there may be race between #3 and #4, because blk_mq_complete_request()
      may run q->mq_ops->complete(rq) remotelly and asynchronously, and
      ->complete(rq) may be run after #4.
      
      This patch introduces blk_mq_complete_request_sync() for fixing the
      above race.
      
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: James Smart <james.smart@broadcom.com>
      Cc: linux-nvme@lists.infradead.org
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1b8f21b7
  6. 21 3月, 2019 1 次提交
  7. 19 3月, 2019 1 次提交
  8. 15 2月, 2019 1 次提交
  9. 19 12月, 2018 1 次提交
  10. 18 12月, 2018 1 次提交
  11. 05 12月, 2018 1 次提交
  12. 30 11月, 2018 1 次提交
    • J
      blk-mq: add mq_ops->commit_rqs() · d666ba98
      Jens Axboe 提交于
      blk-mq passes information to the hardware about any given request being
      the last that we will issue in this sequence. The point is that hardware
      can defer costly doorbell type writes to the last request. But if we run
      into errors issuing a sequence of requests, we may never send the request
      with bd->last == true set. For that case, we need a hook that tells the
      hardware that nothing else is coming right now.
      
      For failures returned by the drivers ->queue_rq() hook, the driver is
      responsible for flushing pending requests, if it uses bd->last to
      optimize that part. This works like before, no changes there.
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d666ba98
  13. 27 11月, 2018 2 次提交
  14. 26 11月, 2018 1 次提交
  15. 09 11月, 2018 2 次提交
  16. 08 11月, 2018 8 次提交
  17. 16 10月, 2018 1 次提交
  18. 25 7月, 2018 1 次提交
  19. 09 7月, 2018 2 次提交
  20. 14 6月, 2018 1 次提交
  21. 31 5月, 2018 1 次提交
  22. 25 4月, 2018 1 次提交
  23. 10 4月, 2018 1 次提交
  24. 10 1月, 2018 2 次提交
    • T
      blk-mq: rename blk_mq_hw_ctx->queue_rq_srcu to ->srcu · 05707b64
      Tejun Heo 提交于
      The RCU protection has been expanded to cover both queueing and
      completion paths making ->queue_rq_srcu a misnomer.  Rename it to
      ->srcu as suggested by Bart.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      05707b64
    • T
      blk-mq: replace timeout synchronization with a RCU and generation based scheme · 1d9bd516
      Tejun Heo 提交于
      Currently, blk-mq timeout path synchronizes against the usual
      issue/completion path using a complex scheme involving atomic
      bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence
      rules.  Unfortunately, it contains quite a few holes.
      
      There's a complex dancing around REQ_ATOM_STARTED and
      REQ_ATOM_COMPLETE between issue/completion and timeout paths; however,
      they don't have a synchronization point across request recycle
      instances and it isn't clear what the barriers add.
      blk_mq_check_expired() can easily read STARTED from N-2'th iteration,
      deadline from N-1'th, blk_mark_rq_complete() against Nth instance.
      
      In fact, it's pretty easy to make blk_mq_check_expired() terminate a
      later instance of a request.  If we induce 5 sec delay before
      time_after_eq() test in blk_mq_check_expired(), shorten the timeout to
      2s, and issue back-to-back large IOs, blk-mq starts timing out
      requests spuriously pretty quickly.  Nothing actually timed out.  It
      just made the call on a recycle instance of a request and then
      terminated a later instance long after the original instance finished.
      The scenario isn't theoretical either.
      
      This patch replaces the broken synchronization mechanism with a RCU
      and generation number based one.
      
      1. Each request has a u64 generation + state value, which can be
         updated only by the request owner.  Whenever a request becomes
         in-flight, the generation number gets bumped up too.  This provides
         the basis for the timeout path to distinguish different recycle
         instances of the request.
      
         Also, marking a request in-flight and setting its deadline are
         protected with a seqcount so that the timeout path can fetch both
         values coherently.
      
      2. The timeout path fetches the generation, state and deadline.  If
         the verdict is timeout, it records the generation into a dedicated
         request abortion field and does RCU wait.
      
      3. The completion path is also protected by RCU (from the previous
         patch) and checks whether the current generation number and state
         match the abortion field.  If so, it skips completion.
      
      4. The timeout path, after RCU wait, scans requests again and
         terminates the ones whose generation and state still match the ones
         requested for abortion.
      
         By now, the timeout path knows that either the generation number
         and state changed if it lost the race or the completion will yield
         to it and can safely timeout the request.
      
      While it's more lines of code, it's conceptually simpler, doesn't
      depend on direct use of subtle memory ordering or coherence, and
      hopefully doesn't terminate the wrong instance.
      
      While this change makes REQ_ATOM_COMPLETE synchronization unnecessary
      between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't
      removed yet as it's still used in other places.  Future patches will
      move all state tracking to the new mechanism and remove all bitops in
      the hot paths.
      
      Note that this patch adds a comment explaining a race condition in
      BLK_EH_RESET_TIMER path.  The race has always been there and this
      patch doesn't change it.  It's just documenting the existing race.
      
      v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao.
          - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter.
          - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter.
      
      v3: - Fixed possible extended seqcount / u64_stats_sync read looping
            spotted by Peter.
          - MQ_RQ_IDLE was incorrectly being set in complete_request instead
            of free_request.  Fixed.
      
      v4: - Rebased on top of hctx_lock() refactoring patch.
          - Added comment explaining the use of hctx_lock() in completion path.
      
      v5: - Added comments requested by Bart.
          - Note the addition of BLK_EH_RESET_TIMER race condition in the
            commit message.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: "jianchao.wang" <jianchao.w.wang@oracle.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <Bart.VanAssche@wdc.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1d9bd516
  25. 11 11月, 2017 2 次提交
    • J
      blk-mq: only run the hardware queue if IO is pending · 79f720a7
      Jens Axboe 提交于
      Currently we are inconsistent in when we decide to run the queue. Using
      blk_mq_run_hw_queues() we check if the hctx has pending IO before
      running it, but we don't do that from the individual queue run function,
      blk_mq_run_hw_queue(). This results in a lot of extra and pointless
      queue runs, potentially, on flush requests and (much worse) on tag
      starvation situations. This is observable just looking at top output,
      with lots of kworkers active. For the !async runs, it just adds to the
      CPU overhead of blk-mq.
      
      Move the has-pending check into the run function instead of having
      callers do it.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      79f720a7
    • B
      block, nvme: Introduce blk_mq_req_flags_t · 9a95e4ef
      Bart Van Assche 提交于
      Several block layer and NVMe core functions accept a combination
      of BLK_MQ_REQ_* flags through the 'flags' argument but there is
      no verification at compile time whether the right type of block
      layer flags is passed. Make it possible for sparse to verify this.
      This patch does not change any functionality.
      Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Cc: linux-nvme@lists.infradead.org
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Johannes Thumshirn <jthumshirn@suse.de>
      Cc: Ming Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9a95e4ef