1. 10 10月, 2020 1 次提交
  2. 07 10月, 2020 1 次提交
    • G
      block: Consider only dispatched requests for inflight statistic · a926c7af
      Gabriel Krisman Bertazi 提交于
      According to Documentation/block/stat.rst, inflight should not include
      I/O requests that are in the queue but not yet dispatched to the device,
      but blk-mq identifies as inflight any request that has a tag allocated,
      which, for queues without elevator, happens at request allocation time
      and before it is queued in the ctx (default case in blk_mq_submit_bio).
      
      In addition, current behavior is different for queues with elevator from
      queues without it, since for the former the driver tag is allocated at
      dispatch time.  A more precise approach would be to only consider
      requests with state MQ_RQ_IN_FLIGHT.
      
      This effectively reverts commit 6131837b ("blk-mq: count allocated
      but not started requests in iostats inflight") to consolidate blk-mq
      behavior with itself (elevator case) and with original documentation,
      but it differs from the behavior used by the legacy path.
      
      This version differs from v1 by using blk_mq_rq_state to access the
      state attribute.  Avoid using blk_mq_request_started, which was
      suggested, since we don't want to include MQ_RQ_COMPLETE.
      Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a926c7af
  3. 06 10月, 2020 1 次提交
  4. 28 9月, 2020 1 次提交
    • X
      blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps() · 8229cca8
      Xianting Tian 提交于
      We found blk_mq_alloc_rq_maps() takes more time in kernel space when
      testing nvme device hot-plugging. The test and anlysis as below.
      
      Debug code,
      1, blk_mq_alloc_rq_maps():
              u64 start, end;
              depth = set->queue_depth;
              start = ktime_get_ns();
              pr_err("[%d:%s switch:%ld,%ld] queue depth %d, nr_hw_queues %d\n",
                              current->pid, current->comm, current->nvcsw, current->nivcsw,
                              set->queue_depth, set->nr_hw_queues);
              do {
                      err = __blk_mq_alloc_rq_maps(set);
                      if (!err)
                              break;
      
                      set->queue_depth >>= 1;
                      if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) {
                              err = -ENOMEM;
                              break;
                      }
              } while (set->queue_depth);
              end = ktime_get_ns();
              pr_err("[%d:%s switch:%ld,%ld] all hw queues init cost time %lld ns\n",
                              current->pid, current->comm,
                              current->nvcsw, current->nivcsw, end - start);
      
      2, __blk_mq_alloc_rq_maps():
              u64 start, end;
              for (i = 0; i < set->nr_hw_queues; i++) {
                      start = ktime_get_ns();
                      if (!__blk_mq_alloc_rq_map(set, i))
                              goto out_unwind;
                      end = ktime_get_ns();
                      pr_err("hw queue %d init cost time %lld ns\n", i, end - start);
              }
      
      Test nvme hot-plugging with above debug code, we found it totally cost more
      than 3ms in kernel space without being scheduled out when alloc rqs for all
      16 hw queues with depth 1023, each hw queue cost about 140-250us. The cost
      time will be increased with hw queue number and queue depth increasing. And
      in an extreme case, if __blk_mq_alloc_rq_maps() returns -ENOMEM, it will try
      "queue_depth >>= 1", more time will be consumed.
      	[  428.428771] nvme nvme0: pci function 10000:01:00.0
      	[  428.428798] nvme 10000:01:00.0: enabling device (0000 -> 0002)
      	[  428.428806] pcieport 10000:00:00.0: can't derive routing for PCI INT A
      	[  428.428809] nvme 10000:01:00.0: PCI INT A: no GSI
      	[  432.593374] [4688:kworker/u33:8 switch:663,2] queue depth 30, nr_hw_queues 1
      	[  432.593404] hw queue 0 init cost time 22883 ns
      	[  432.593408] [4688:kworker/u33:8 switch:663,2] all hw queues init cost time 35960 ns
      	[  432.595953] nvme nvme0: 16/0/0 default/read/poll queues
      	[  432.595958] [4688:kworker/u33:8 switch:700,2] queue depth 1023, nr_hw_queues 16
      	[  432.596203] hw queue 0 init cost time 242630 ns
      	[  432.596441] hw queue 1 init cost time 235913 ns
      	[  432.596659] hw queue 2 init cost time 216461 ns
      	[  432.596877] hw queue 3 init cost time 215851 ns
      	[  432.597107] hw queue 4 init cost time 228406 ns
      	[  432.597336] hw queue 5 init cost time 227298 ns
      	[  432.597564] hw queue 6 init cost time 224633 ns
      	[  432.597785] hw queue 7 init cost time 219954 ns
      	[  432.597937] hw queue 8 init cost time 150930 ns
      	[  432.598082] hw queue 9 init cost time 143496 ns
      	[  432.598231] hw queue 10 init cost time 147261 ns
      	[  432.598397] hw queue 11 init cost time 164522 ns
      	[  432.598542] hw queue 12 init cost time 143401 ns
      	[  432.598692] hw queue 13 init cost time 148934 ns
      	[  432.598841] hw queue 14 init cost time 147194 ns
      	[  432.598991] hw queue 15 init cost time 148942 ns
      	[  432.598993] [4688:kworker/u33:8 switch:700,2] all hw queues init cost time 3035099 ns
      	[  432.602611]  nvme0n1: p1
      
      So use this patch to trigger schedule between each hw queue init, to avoid
      other threads getting stuck. It is not in atomic context when executing
      __blk_mq_alloc_rq_maps(), so it is safe to call cond_resched().
      Signed-off-by: NXianting Tian <tian.xianting@h3c.com>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8229cca8
  5. 11 9月, 2020 1 次提交
    • M
      blk-mq: always allow reserved allocation in hctx_may_queue · 28500850
      Ming Lei 提交于
      NVMe shares tagset between fabric queue and admin queue or between
      connect_q and NS queue, so hctx_may_queue() can be called to allocate
      request for these queues.
      
      Tags can be reserved in these tagset. Before error recovery, there is
      often lots of in-flight requests which can't be completed, and new
      reserved request may be needed in error recovery path. However,
      hctx_may_queue() can always return false because there is too many
      in-flight requests which can't be completed during error handling.
      Finally, nothing can proceed.
      
      Fix this issue by always allowing reserved tag allocation in
      hctx_may_queue(). This is reasonable because reserved tags are supposed
      to always be available.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Cc: David Milburn <dmilburn@redhat.com>
      Cc: Ewan D. Milne <emilne@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      28500850
  6. 04 9月, 2020 8 次提交
  7. 22 8月, 2020 1 次提交
    • M
      blk-mq: insert request not through ->queue_rq into sw/scheduler queue · db03f88f
      Ming Lei 提交于
      c616cbee ("blk-mq: punt failed direct issue to dispatch list") supposed
      to add request which has been through ->queue_rq() to the hw queue dispatch
      list, however it adds request running out of budget or driver tag to hw queue
      too. This way basically bypasses request merge, and causes too many request
      dispatched to LLD, and system% is unnecessary increased.
      
      Fixes this issue by adding request not through ->queue_rq into sw/scheduler
      queue, and this way is safe because no ->queue_rq is called on this request
      yet.
      
      High %system can be observed on Azure storvsc device, and even soft lock
      is observed. This patch reduces %system during heavy sequential IO,
      meantime decreases soft lockup risk.
      
      Fixes: c616cbee ("blk-mq: punt failed direct issue to dispatch list")
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Mike Snitzer <snitzer@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      db03f88f
  8. 17 8月, 2020 2 次提交
    • M
      blk-mq: order adding requests to hctx->dispatch and checking SCHED_RESTART · d7d8535f
      Ming Lei 提交于
      SCHED_RESTART code path is relied to re-run queue for dispatch requests
      in hctx->dispatch. Meantime the SCHED_RSTART flag is checked when adding
      requests to hctx->dispatch.
      
      memory barriers have to be used for ordering the following two pair of OPs:
      
      1) adding requests to hctx->dispatch and checking SCHED_RESTART in
      blk_mq_dispatch_rq_list()
      
      2) clearing SCHED_RESTART and checking if there is request in hctx->dispatch
      in blk_mq_sched_restart().
      
      Without the added memory barrier, either:
      
      1) blk_mq_sched_restart() may miss requests added to hctx->dispatch meantime
      blk_mq_dispatch_rq_list() observes SCHED_RESTART, and not run queue in
      dispatch side
      
      or
      
      2) blk_mq_dispatch_rq_list still sees SCHED_RESTART, and not run queue
      in dispatch side, meantime checking if there is request in
      hctx->dispatch from blk_mq_sched_restart() is missed.
      
      IO hang in ltp/fs_fill test is reported by kernel test robot:
      
      	https://lkml.org/lkml/2020/7/26/77
      
      Turns out it is caused by the above out-of-order OPs. And the IO hang
      can't be observed any more after applying this patch.
      
      Fixes: bd166ef1 ("blk-mq-sched: add framework for MQ capable IO schedulers")
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Jeffery <djeffery@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d7d8535f
    • R
      block: blk-mq.c: fix @at_head kernel-doc warning · 26bfeb26
      Randy Dunlap 提交于
      Fix a kernel-doc warning in block/blk-mq.c:
      
      ../block/blk-mq.c:1844: warning: Function parameter or member 'at_head' not described in 'blk_mq_request_bypass_insert'
      
      Fixes: 01e99aec ("blk-mq: insert passthrough request into hctx->dispatch directly")
      Signed-off-by: NRandy Dunlap <rdunlap@infradead.org>
      Cc: André Almeida <andrealmeid@collabora.com>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: linux-block@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      26bfeb26
  9. 01 8月, 2020 1 次提交
  10. 28 7月, 2020 1 次提交
  11. 10 7月, 2020 1 次提交
  12. 09 7月, 2020 2 次提交
  13. 07 7月, 2020 1 次提交
    • M
      blk-mq: consider non-idle request as "inflight" in blk_mq_rq_inflight() · 05a4fed6
      Ming Lei 提交于
      dm-multipath is the only user of blk_mq_queue_inflight().  When
      dm-multipath calls blk_mq_queue_inflight() to check if it has
      outstanding IO it can get a false negative.  The reason for this is
      blk_mq_rq_inflight() doesn't consider requests that are no longer
      MQ_RQ_IN_FLIGHT but that are now MQ_RQ_COMPLETE (->complete isn't
      called or finished yet) as "inflight".
      
      This causes request-based dm-multipath's dm_wait_for_completion() to
      return before all outstanding dm-multipath requests have actually
      completed.  This breaks DM multipath's suspend functionality because
      blk-mq requests complete after DM's suspend has finished -- which
      shouldn't happen.
      
      Fix this by considering any request not in the MQ_RQ_IDLE state
      (so either MQ_RQ_COMPLETE or MQ_RQ_IN_FLIGHT) as "inflight" in
      blk_mq_rq_inflight().
      
      Fixes: 3c94d83c ("blk-mq: change blk_mq_queue_busy() to blk_mq_queue_inflight()")
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      05a4fed6
  14. 02 7月, 2020 1 次提交
  15. 01 7月, 2020 5 次提交
  16. 30 6月, 2020 6 次提交
  17. 29 6月, 2020 2 次提交
  18. 24 6月, 2020 4 次提交