1. 24 5月, 2021 5 次提交
  2. 14 5月, 2021 2 次提交
  3. 16 4月, 2021 1 次提交
    • L
      blk-mq: bypass IO scheduler's limit_depth for passthrough request · 8d663f34
      Lin Feng 提交于
      Commit 01e99aec ("blk-mq: insert passthrough request into
      hctx->dispatch directly") gives high priority to passthrough requests and
      bypass underlying IO scheduler. But as we allocate tag for such request it
      still runs io-scheduler's callback limit_depth, while we really want is to
      give full sbitmap-depth capabity to such request for acquiring available
      tag.
      blktrace shows PC requests(dmraid -s -c -i) hit bfq's limit_depth:
        8,0    2        0     0.000000000 39952 1,0  m   N bfq [bfq_limit_depth] wr_busy 0 sync 0 depth 8
        8,0    2        1     0.000008134 39952  D   R 4 [dmraid]
        8,0    2        2     0.000021538    24  C   R [0]
        8,0    2        0     0.000035442 39952 1,0  m   N bfq [bfq_limit_depth] wr_busy 0 sync 0 depth 8
        8,0    2        3     0.000038813 39952  D   R 24 [dmraid]
        8,0    2        4     0.000044356    24  C   R [0]
      
      This patch introduce a new wrapper to make code not that ugly.
      Signed-off-by: NLin Feng <linf@wangsu.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/20210415033920.213963-1-linf@wangsu.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      8d663f34
  4. 09 4月, 2021 1 次提交
  5. 05 3月, 2021 3 次提交
  6. 12 2月, 2021 2 次提交
  7. 25 1月, 2021 3 次提交
    • J
      blk-mq: Improve performance of non-mq IO schedulers with multiple HW queues · b6e68ee8
      Jan Kara 提交于
      Currently when non-mq aware IO scheduler (BFQ, mq-deadline) is used for
      a queue with multiple HW queues, the performance it rather bad. The
      problem is that these IO schedulers use queue-wide locking and their
      dispatch function does not respect the hctx it is passed in and returns
      any request it finds appropriate. Thus locality of request access is
      broken and dispatch from multiple CPUs just contends on IO scheduler
      locks. For these IO schedulers there's little point in dispatching from
      multiple CPUs. Instead dispatch always only from a single CPU to limit
      contention.
      
      Below is a comparison of dbench runs on XFS filesystem where the storage
      is a raid card with 64 HW queues and to it attached a single rotating
      disk. BFQ is used as IO scheduler:
      
            clients           MQ                     SQ             MQ-Patched
      Amean 1      39.12 (0.00%)       43.29 * -10.67%*       36.09 *   7.74%*
      Amean 2     128.58 (0.00%)      101.30 *  21.22%*       96.14 *  25.23%*
      Amean 4     577.42 (0.00%)      494.47 *  14.37%*      508.49 *  11.94%*
      Amean 8     610.95 (0.00%)      363.86 *  40.44%*      362.12 *  40.73%*
      Amean 16    391.78 (0.00%)      261.49 *  33.25%*      282.94 *  27.78%*
      Amean 32    324.64 (0.00%)      267.71 *  17.54%*      233.00 *  28.23%*
      Amean 64    295.04 (0.00%)      253.02 *  14.24%*      242.37 *  17.85%*
      Amean 512 10281.61 (0.00%)    10211.16 *   0.69%*    10447.53 *  -1.61%*
      
      Numbers are times so lower is better. MQ is stock 5.10-rc6 kernel. SQ is
      the same kernel with megaraid_sas.host_tagset_enable=0 so that the card
      advertises just a single HW queue. MQ-Patched is a kernel with this
      patch applied.
      
      You can see multiple hardware queues heavily hurt performance in
      combination with BFQ. The patch restores the performance.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b6e68ee8
    • J
      Revert "blk-mq, elevator: Count requests per hctx to improve performance" · 5ac83c64
      Jan Kara 提交于
      This reverts commit b445547e.
      
      Since both mq-deadline and BFQ completely ignore hctx they are passed to
      their dispatch function and dispatch whatever request they deem fit
      checking whether any request for a particular hctx is queued is just
      pointless since we'll very likely get a request from a different hctx
      anyway. In the following commit we'll deal with lock contention in these
      IO schedulers in presence of multiple HW queues in a different way.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      5ac83c64
    • C
      block: store a block_device pointer in struct bio · 309dca30
      Christoph Hellwig 提交于
      Replace the gendisk pointer in struct bio with a pointer to the newly
      improved struct block device.  From that the gendisk can be trivially
      accessed with an extra indirection, but it also allows to directly
      look up all information related to partition remapping.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      309dca30
  8. 18 12月, 2020 1 次提交
  9. 17 12月, 2020 1 次提交
  10. 13 12月, 2020 2 次提交
  11. 10 12月, 2020 2 次提交
  12. 08 12月, 2020 2 次提交
    • J
      block: disable iopoll for split bio · cc29e1bf
      Jeffle Xu 提交于
      iopoll is initially for small size, latency sensitive IO. It doesn't
      work well for big IO, especially when it needs to be split to multiple
      bios. In this case, the returned cookie of __submit_bio_noacct_mq() is
      indeed the cookie of the last split bio. The completion of *this* last
      split bio done by iopoll doesn't mean the whole original bio has
      completed. Callers of iopoll still need to wait for completion of other
      split bios.
      
      Besides bio splitting may cause more trouble for iopoll which isn't
      supposed to be used in case of big IO.
      
      iopoll for split bio may cause potential race if CPU migration happens
      during bio submission. Since the returned cookie is that of the last
      split bio, polling on the corresponding hardware queue doesn't help
      complete other split bios, if these split bios are enqueued into
      different hardware queues. Since interrupts are disabled for polling
      queues, the completion of these other split bios depends on timeout
      mechanism, thus causing a potential hang.
      
      iopoll for split bio may also cause hang for sync polling. Currently
      both the blkdev and iomap-based fs (ext4/xfs, etc) support sync polling
      in direct IO routine. These routines will submit bio without REQ_NOWAIT
      flag set, and then start sync polling in current process context. The
      process may hang in blk_mq_get_tag() if the submitted bio has to be
      split into multiple bios and can rapidly exhaust the queue depth. The
      process are waiting for the completion of the previously allocated
      requests, which should be reaped by the following polling, and thus
      causing a deadlock.
      
      To avoid these subtle trouble described above, just disable iopoll for
      split bio and return BLK_QC_T_NONE in this case. The side effect is that
      non-HIPRI IO also returns BLK_QC_T_NONE now. It should be acceptable
      since the returned cookie is never used for non-HIPRI IO.
      Suggested-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      cc29e1bf
    • P
      blk-mq: skip hybrid polling if iopoll doesn't spin · f6f371f7
      Pavel Begunkov 提交于
      If blk_poll() is not going to spin (i.e. @spin=false), it also must not
      sleep in hybrid polling, otherwise it might be pretty suprising for
      users trying to do a quick check and expecting no-wait behaviour.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f6f371f7
  13. 05 12月, 2020 2 次提交
  14. 03 12月, 2020 1 次提交
    • J
      block: fix inflight statistics of part0 · b0d97557
      Jeffle Xu 提交于
      The inflight of partition 0 doesn't include inflight IOs to all
      sub-partitions, since currently mq calculates inflight of specific
      partition by simply camparing the value of the partition pointer.
      
      Thus the following case is possible:
      
      $ cat /sys/block/vda/inflight
             0        0
      $ cat /sys/block/vda/vda1/inflight
             0      128
      
      While single queue device (on a previous version, e.g. v3.10) has no
      this issue:
      
      $cat /sys/block/sda/sda3/inflight
             0       33
      $cat /sys/block/sda/inflight
             0       33
      
      Partition 0 should be specially handled since it represents the whole
      disk. This issue is introduced since commit bf0ddaba ("blk-mq: fix
      sysfs inflight counter").
      
      Besides, this patch can also fix the inflight statistics of part 0 in
      /proc/diskstats. Before this patch, the inflight statistics of part 0
      doesn't include that of sub partitions. (I have marked the 'inflight'
      field with asterisk.)
      
      $cat /proc/diskstats
       259       0 nvme0n1 45974469 0 367814768 6445794 1 0 1 0 *0* 111062 6445794 0 0 0 0 0 0
       259       2 nvme0n1p1 45974058 0 367797952 6445727 0 0 0 0 *33* 111001 6445727 0 0 0 0 0 0
      
      This is introduced since commit f299b7c7 ("blk-mq: provide internal
      in-flight variant").
      
      Fixes: bf0ddaba ("blk-mq: fix sysfs inflight counter")
      Fixes: f299b7c7 ("blk-mq: provide internal in-flight variant")
      Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      [axboe: adapt for 5.11 partition change]
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b0d97557
  15. 02 12月, 2020 1 次提交
  16. 24 11月, 2020 1 次提交
  17. 11 11月, 2020 1 次提交
  18. 24 10月, 2020 1 次提交
  19. 20 10月, 2020 1 次提交
    • X
      blk-mq: remove the calling of local_memory_node() · 576e85c5
      Xianting Tian 提交于
      We don't need to check whether the node is memoryless numa node before
      calling allocator interface. SLUB(and SLAB,SLOB) relies on the page
      allocator to pick a node. Page allocator should deal with memoryless
      nodes just fine. It has zonelists constructed for each possible nodes.
      And it will automatically fall back into a node which is closest to the
      requested node. As long as __GFP_THISNODE is not enforced of course.
      
      The code comments of kmem_cache_alloc_node() of SLAB also showed this:
       * Fallback to other node is possible if __GFP_THISNODE is not set.
      
      blk-mq code doesn't set __GFP_THISNODE, so we can remove the calling
      of local_memory_node().
      Signed-off-by: NXianting Tian <tian.xianting@h3c.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      576e85c5
  20. 10 10月, 2020 1 次提交
  21. 08 10月, 2020 1 次提交
    • M
      dm: fix request-based DM to not bounce through indirect dm_submit_bio · 681cc5e8
      Mike Snitzer 提交于
      It is unnecessary to force request-based DM to call into bio-based
      dm_submit_bio (via indirect disk->fops->submit_bio) only to have it then
      call blk_mq_submit_bio().
      
      Fix this by establishing a request-based DM block_device_operations
      (dm_rq_blk_dops, which doesn't have .submit_bio) and update
      dm_setup_md_queue() to set md->disk->fops to it for
      DM_TYPE_REQUEST_BASED.
      
      Remove DM_TYPE_REQUEST_BASED conditional in dm_submit_bio and unexport
      blk_mq_submit_bio.
      
      Fixes: c62b37d9 ("block: move ->make_request_fn to struct block_device_operations")
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      681cc5e8
  22. 07 10月, 2020 1 次提交
    • G
      block: Consider only dispatched requests for inflight statistic · a926c7af
      Gabriel Krisman Bertazi 提交于
      According to Documentation/block/stat.rst, inflight should not include
      I/O requests that are in the queue but not yet dispatched to the device,
      but blk-mq identifies as inflight any request that has a tag allocated,
      which, for queues without elevator, happens at request allocation time
      and before it is queued in the ctx (default case in blk_mq_submit_bio).
      
      In addition, current behavior is different for queues with elevator from
      queues without it, since for the former the driver tag is allocated at
      dispatch time.  A more precise approach would be to only consider
      requests with state MQ_RQ_IN_FLIGHT.
      
      This effectively reverts commit 6131837b ("blk-mq: count allocated
      but not started requests in iostats inflight") to consolidate blk-mq
      behavior with itself (elevator case) and with original documentation,
      but it differs from the behavior used by the legacy path.
      
      This version differs from v1 by using blk_mq_rq_state to access the
      state attribute.  Avoid using blk_mq_request_started, which was
      suggested, since we don't want to include MQ_RQ_COMPLETE.
      Signed-off-by: NGabriel Krisman Bertazi <krisman@collabora.com>
      Cc: Omar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a926c7af
  23. 06 10月, 2020 1 次提交
  24. 29 9月, 2020 1 次提交
    • Y
      blk-mq: call commit_rqs while list empty but error happen · 632bfb63
      yangerkun 提交于
      Blk-mq should call commit_rqs once 'bd.last != true' and no more
      request will come(so virtscsi can kick the virtqueue, e.g.). We already
      do that in 'blk_mq_dispatch_rq_list/blk_mq_try_issue_list_directly' while
      list not empty and 'queued > 0'. However, we can seen the same scene
      once the last request in list call queue_rq and return error like
      BLK_STS_IOERR which will not requeue the request, and lead that list
      empty but need call commit_rqs too(Or the request for virtscsi will stay
      timeout until other request kick virtqueue).
      
      We found this problem by do fsstress test with offline/online virtscsi
      device repeat quickly.
      
      Fixes: d666ba98 ("blk-mq: add mq_ops->commit_rqs()")
      Reported-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: Nyangerkun <yangerkun@huawei.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      632bfb63
  25. 28 9月, 2020 1 次提交
    • X
      blk-mq: add cond_resched() in __blk_mq_alloc_rq_maps() · 8229cca8
      Xianting Tian 提交于
      We found blk_mq_alloc_rq_maps() takes more time in kernel space when
      testing nvme device hot-plugging. The test and anlysis as below.
      
      Debug code,
      1, blk_mq_alloc_rq_maps():
              u64 start, end;
              depth = set->queue_depth;
              start = ktime_get_ns();
              pr_err("[%d:%s switch:%ld,%ld] queue depth %d, nr_hw_queues %d\n",
                              current->pid, current->comm, current->nvcsw, current->nivcsw,
                              set->queue_depth, set->nr_hw_queues);
              do {
                      err = __blk_mq_alloc_rq_maps(set);
                      if (!err)
                              break;
      
                      set->queue_depth >>= 1;
                      if (set->queue_depth < set->reserved_tags + BLK_MQ_TAG_MIN) {
                              err = -ENOMEM;
                              break;
                      }
              } while (set->queue_depth);
              end = ktime_get_ns();
              pr_err("[%d:%s switch:%ld,%ld] all hw queues init cost time %lld ns\n",
                              current->pid, current->comm,
                              current->nvcsw, current->nivcsw, end - start);
      
      2, __blk_mq_alloc_rq_maps():
              u64 start, end;
              for (i = 0; i < set->nr_hw_queues; i++) {
                      start = ktime_get_ns();
                      if (!__blk_mq_alloc_rq_map(set, i))
                              goto out_unwind;
                      end = ktime_get_ns();
                      pr_err("hw queue %d init cost time %lld ns\n", i, end - start);
              }
      
      Test nvme hot-plugging with above debug code, we found it totally cost more
      than 3ms in kernel space without being scheduled out when alloc rqs for all
      16 hw queues with depth 1023, each hw queue cost about 140-250us. The cost
      time will be increased with hw queue number and queue depth increasing. And
      in an extreme case, if __blk_mq_alloc_rq_maps() returns -ENOMEM, it will try
      "queue_depth >>= 1", more time will be consumed.
      	[  428.428771] nvme nvme0: pci function 10000:01:00.0
      	[  428.428798] nvme 10000:01:00.0: enabling device (0000 -> 0002)
      	[  428.428806] pcieport 10000:00:00.0: can't derive routing for PCI INT A
      	[  428.428809] nvme 10000:01:00.0: PCI INT A: no GSI
      	[  432.593374] [4688:kworker/u33:8 switch:663,2] queue depth 30, nr_hw_queues 1
      	[  432.593404] hw queue 0 init cost time 22883 ns
      	[  432.593408] [4688:kworker/u33:8 switch:663,2] all hw queues init cost time 35960 ns
      	[  432.595953] nvme nvme0: 16/0/0 default/read/poll queues
      	[  432.595958] [4688:kworker/u33:8 switch:700,2] queue depth 1023, nr_hw_queues 16
      	[  432.596203] hw queue 0 init cost time 242630 ns
      	[  432.596441] hw queue 1 init cost time 235913 ns
      	[  432.596659] hw queue 2 init cost time 216461 ns
      	[  432.596877] hw queue 3 init cost time 215851 ns
      	[  432.597107] hw queue 4 init cost time 228406 ns
      	[  432.597336] hw queue 5 init cost time 227298 ns
      	[  432.597564] hw queue 6 init cost time 224633 ns
      	[  432.597785] hw queue 7 init cost time 219954 ns
      	[  432.597937] hw queue 8 init cost time 150930 ns
      	[  432.598082] hw queue 9 init cost time 143496 ns
      	[  432.598231] hw queue 10 init cost time 147261 ns
      	[  432.598397] hw queue 11 init cost time 164522 ns
      	[  432.598542] hw queue 12 init cost time 143401 ns
      	[  432.598692] hw queue 13 init cost time 148934 ns
      	[  432.598841] hw queue 14 init cost time 147194 ns
      	[  432.598991] hw queue 15 init cost time 148942 ns
      	[  432.598993] [4688:kworker/u33:8 switch:700,2] all hw queues init cost time 3035099 ns
      	[  432.602611]  nvme0n1: p1
      
      So use this patch to trigger schedule between each hw queue init, to avoid
      other threads getting stuck. It is not in atomic context when executing
      __blk_mq_alloc_rq_maps(), so it is safe to call cond_resched().
      Signed-off-by: NXianting Tian <tian.xianting@h3c.com>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8229cca8
  26. 11 9月, 2020 1 次提交
    • M
      blk-mq: always allow reserved allocation in hctx_may_queue · 28500850
      Ming Lei 提交于
      NVMe shares tagset between fabric queue and admin queue or between
      connect_q and NS queue, so hctx_may_queue() can be called to allocate
      request for these queues.
      
      Tags can be reserved in these tagset. Before error recovery, there is
      often lots of in-flight requests which can't be completed, and new
      reserved request may be needed in error recovery path. However,
      hctx_may_queue() can always return false because there is too many
      in-flight requests which can't be completed during error handling.
      Finally, nothing can proceed.
      
      Fix this issue by always allowing reserved tag allocation in
      hctx_may_queue(). This is reasonable because reserved tags are supposed
      to always be available.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Cc: David Milburn <dmilburn@redhat.com>
      Cc: Ewan D. Milne <emilne@redhat.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      28500850