1. 12 11月, 2021 1 次提交
  2. 05 11月, 2021 1 次提交
    • J
      block: move queue enter logic into blk_mq_submit_bio() · 900e0807
      Jens Axboe 提交于
      Retain the old logic for the fops based submit, but for our internal
      blk_mq_submit_bio(), move the queue entering logic into the core
      function itself.
      
      We need to be a bit careful if going into the scheduler, as a scheduler
      or queue mappings can arbitrarily change before we have entered the queue.
      Have the bio scheduler mapping do that separately, it's a very cheap
      operation compared to actually doing merging locking and lookups.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      [axboe: update to check merge post submit_bio_checks() doing remap...]
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      900e0807
  3. 30 10月, 2021 1 次提交
  4. 22 10月, 2021 1 次提交
  5. 21 10月, 2021 1 次提交
  6. 19 10月, 2021 1 次提交
  7. 18 10月, 2021 8 次提交
  8. 28 7月, 2021 1 次提交
  9. 25 6月, 2021 1 次提交
    • J
      blk: Fix lock inversion between ioc lock and bfqd lock · fd2ef39c
      Jan Kara 提交于
      Lockdep complains about lock inversion between ioc->lock and bfqd->lock:
      
      bfqd -> ioc:
       put_io_context+0x33/0x90 -> ioc->lock grabbed
       blk_mq_free_request+0x51/0x140
       blk_put_request+0xe/0x10
       blk_attempt_req_merge+0x1d/0x30
       elv_attempt_insert_merge+0x56/0xa0
       blk_mq_sched_try_insert_merge+0x4b/0x60
       bfq_insert_requests+0x9e/0x18c0 -> bfqd->lock grabbed
       blk_mq_sched_insert_requests+0xd6/0x2b0
       blk_mq_flush_plug_list+0x154/0x280
       blk_finish_plug+0x40/0x60
       ext4_writepages+0x696/0x1320
       do_writepages+0x1c/0x80
       __filemap_fdatawrite_range+0xd7/0x120
       sync_file_range+0xac/0xf0
      
      ioc->bfqd:
       bfq_exit_icq+0xa3/0xe0 -> bfqd->lock grabbed
       put_io_context_active+0x78/0xb0 -> ioc->lock grabbed
       exit_io_context+0x48/0x50
       do_exit+0x7e9/0xdd0
       do_group_exit+0x54/0xc0
      
      To avoid this inversion we change blk_mq_sched_try_insert_merge() to not
      free the merged request but rather leave that upto the caller similarly
      to blk_mq_sched_try_merge(). And in bfq_insert_requests() we make sure
      to free all the merged requests after dropping bfqd->lock.
      
      Fixes: aee69d78 ("block, bfq: introduce the BFQ-v0 I/O scheduler as an extra scheduler")
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Acked-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20210623093634.27879-3-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>
      fd2ef39c
  10. 18 6月, 2021 2 次提交
  11. 04 6月, 2021 1 次提交
    • J
      block: Do not pull requests from the scheduler when we cannot dispatch them · 61347154
      Jan Kara 提交于
      Provided the device driver does not implement dispatch budget accounting
      (which only SCSI does) the loop in __blk_mq_do_dispatch_sched() pulls
      requests from the IO scheduler as long as it is willing to give out any.
      That defeats scheduling heuristics inside the scheduler by creating
      false impression that the device can take more IO when it in fact
      cannot.
      
      For example with BFQ IO scheduler on top of virtio-blk device setting
      blkio cgroup weight has barely any impact on observed throughput of
      async IO because __blk_mq_do_dispatch_sched() always sucks out all the
      IO queued in BFQ. BFQ first submits IO from higher weight cgroups but
      when that is all dispatched, it will give out IO of lower weight cgroups
      as well. And then we have to wait for all this IO to be dispatched to
      the disk (which means lot of it actually has to complete) before the
      IO scheduler is queried again for dispatching more requests. This
      completely destroys any service differentiation.
      
      So grab request tag for a request pulled out of the IO scheduler already
      in __blk_mq_do_dispatch_sched() and do not pull any more requests if we
      cannot get it because we are unlikely to be able to dispatch it. That
      way only single request is going to wait in the dispatch list for some
      tag to free.
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20210603104721.6309-1-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>
      61347154
  12. 24 5月, 2021 1 次提交
  13. 11 5月, 2021 1 次提交
    • O
      kyber: fix out of bounds access when preempted · efed9a33
      Omar Sandoval 提交于
      __blk_mq_sched_bio_merge() gets the ctx and hctx for the current CPU and
      passes the hctx to ->bio_merge(). kyber_bio_merge() then gets the ctx
      for the current CPU again and uses that to get the corresponding Kyber
      context in the passed hctx. However, the thread may be preempted between
      the two calls to blk_mq_get_ctx(), and the ctx returned the second time
      may no longer correspond to the passed hctx. This "works" accidentally
      most of the time, but it can cause us to read garbage if the second ctx
      came from an hctx with more ctx's than the first one (i.e., if
      ctx->index_hw[hctx->type] > hctx->nr_ctx).
      
      This manifested as this UBSAN array index out of bounds error reported
      by Jakub:
      
      UBSAN: array-index-out-of-bounds in ../kernel/locking/qspinlock.c:130:9
      index 13106 is out of range for type 'long unsigned int [128]'
      Call Trace:
       dump_stack+0xa4/0xe5
       ubsan_epilogue+0x5/0x40
       __ubsan_handle_out_of_bounds.cold.13+0x2a/0x34
       queued_spin_lock_slowpath+0x476/0x480
       do_raw_spin_lock+0x1c2/0x1d0
       kyber_bio_merge+0x112/0x180
       blk_mq_submit_bio+0x1f5/0x1100
       submit_bio_noacct+0x7b0/0x870
       submit_bio+0xc2/0x3a0
       btrfs_map_bio+0x4f0/0x9d0
       btrfs_submit_data_bio+0x24e/0x310
       submit_one_bio+0x7f/0xb0
       submit_extent_page+0xc4/0x440
       __extent_writepage_io+0x2b8/0x5e0
       __extent_writepage+0x28d/0x6e0
       extent_write_cache_pages+0x4d7/0x7a0
       extent_writepages+0xa2/0x110
       do_writepages+0x8f/0x180
       __writeback_single_inode+0x99/0x7f0
       writeback_sb_inodes+0x34e/0x790
       __writeback_inodes_wb+0x9e/0x120
       wb_writeback+0x4d2/0x660
       wb_workfn+0x64d/0xa10
       process_one_work+0x53a/0xa80
       worker_thread+0x69/0x5b0
       kthread+0x20b/0x240
       ret_from_fork+0x1f/0x30
      
      Only Kyber uses the hctx, so fix it by passing the request_queue to
      ->bio_merge() instead. BFQ and mq-deadline just use that, and Kyber can
      map the queues itself to avoid the mismatch.
      
      Fixes: a6088845 ("block: kyber: make kyber more friendly with merging")
      Reported-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Link: https://lore.kernel.org/r/c7598605401a48d5cfeadebb678abd10af22b83f.1620691329.git.osandov@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      efed9a33
  14. 09 4月, 2021 1 次提交
  15. 05 3月, 2021 1 次提交
  16. 02 3月, 2021 1 次提交
  17. 22 2月, 2021 1 次提交
  18. 05 12月, 2020 1 次提交
  19. 10 10月, 2020 1 次提交
    • Y
      blk-mq: get rid of the dead flush handle code path · c7281524
      Yufen Yu 提交于
      After commit 923218f6 ("blk-mq: don't allocate driver tag upfront
      for flush rq"), blk_mq_submit_bio() will call blk_insert_flush()
      directly to handle flush request rather than blk_mq_sched_insert_request()
      in the case of elevator.
      
      Then, all flush request either have set RQF_FLUSH_SEQ flag when call
      blk_mq_sched_insert_request(), or have inserted into hctx->dispatch.
      So, remove the dead code path.
      Signed-off-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c7281524
  20. 06 10月, 2020 1 次提交
  21. 08 9月, 2020 1 次提交
  22. 04 9月, 2020 2 次提交
    • J
      blk-mq: Facilitate a shared sbitmap per tagset · 32bc15af
      John Garry 提交于
      Some SCSI HBAs (such as HPSA, megaraid, mpt3sas, hisi_sas_v3 ..) support
      multiple reply queues with single hostwide tags.
      
      In addition, these drivers want to use interrupt assignment in
      pci_alloc_irq_vectors(PCI_IRQ_AFFINITY). However, as discussed in [0],
      CPU hotplug may cause in-flight IO completion to not be serviced when an
      interrupt is shutdown. That problem is solved in commit bf0beec0
      ("blk-mq: drain I/O when all CPUs in a hctx are offline").
      
      However, to take advantage of that blk-mq feature, the HBA HW queuess are
      required to be mapped to that of the blk-mq hctx's; to do that, the HBA HW
      queues need to be exposed to the upper layer.
      
      In making that transition, the per-SCSI command request tags are no
      longer unique per Scsi host - they are just unique per hctx. As such, the
      HBA LLDD would have to generate this tag internally, which has a certain
      performance overhead.
      
      However another problem is that blk-mq assumes the host may accept
      (Scsi_host.can_queue * #hw queue) commands. In commit 6eb045e0 ("scsi:
       core: avoid host-wide host_busy counter for scsi_mq"), the Scsi host busy
      counter was removed, which would stop the LLDD being sent more than
      .can_queue commands; however, it should still be ensured that the block
      layer does not issue more than .can_queue commands to the Scsi host.
      
      To solve this problem, introduce a shared sbitmap per blk_mq_tag_set,
      which may be requested at init time.
      
      New flag BLK_MQ_F_TAG_HCTX_SHARED should be set when requesting the
      tagset to indicate whether the shared sbitmap should be used.
      
      Even when BLK_MQ_F_TAG_HCTX_SHARED is set, a full set of tags and requests
      are still allocated per hctx; the reason for this is that if tags and
      requests were only allocated for a single hctx - like hctx0 - it may break
      block drivers which expect a request be associated with a specific hctx,
      i.e. not always hctx0. This will introduce extra memory usage.
      
      This change is based on work originally from Ming Lei in [1] and from
      Bart's suggestion in [2].
      
      [0] https://lore.kernel.org/linux-block/alpine.DEB.2.21.1904051331270.1802@nanos.tec.linutronix.de/
      [1] https://lore.kernel.org/linux-block/20190531022801.10003-1-ming.lei@redhat.com/
      [2] https://lore.kernel.org/linux-block/ff77beff-5fd9-9f05-12b6-826922bace1f@huawei.com/T/#m3db0a602f095cbcbff27e9c884d6b4ae826144beSigned-off-by: NJohn Garry <john.garry@huawei.com>
      Tested-by: Don Brace<don.brace@microsemi.com> #SCSI resv cmds patches used
      Tested-by: NDouglas Gilbert <dgilbert@interlog.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      32bc15af
    • J
      blk-mq: Pass flags for tag init/free · 1c0706a7
      John Garry 提交于
      Pass hctx/tagset flags argument down to blk_mq_init_tags() and
      blk_mq_free_tags() for selective init/free.
      
      For now, make it include the alloc policy flag, which can be evaluated
      when needed (in blk_mq_init_tags()).
      Signed-off-by: NJohn Garry <john.garry@huawei.com>
      Tested-by: NDouglas Gilbert <dgilbert@interlog.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1c0706a7
  23. 02 9月, 2020 4 次提交
  24. 17 8月, 2020 1 次提交
    • M
      blk-mq: order adding requests to hctx->dispatch and checking SCHED_RESTART · d7d8535f
      Ming Lei 提交于
      SCHED_RESTART code path is relied to re-run queue for dispatch requests
      in hctx->dispatch. Meantime the SCHED_RSTART flag is checked when adding
      requests to hctx->dispatch.
      
      memory barriers have to be used for ordering the following two pair of OPs:
      
      1) adding requests to hctx->dispatch and checking SCHED_RESTART in
      blk_mq_dispatch_rq_list()
      
      2) clearing SCHED_RESTART and checking if there is request in hctx->dispatch
      in blk_mq_sched_restart().
      
      Without the added memory barrier, either:
      
      1) blk_mq_sched_restart() may miss requests added to hctx->dispatch meantime
      blk_mq_dispatch_rq_list() observes SCHED_RESTART, and not run queue in
      dispatch side
      
      or
      
      2) blk_mq_dispatch_rq_list still sees SCHED_RESTART, and not run queue
      in dispatch side, meantime checking if there is request in
      hctx->dispatch from blk_mq_sched_restart() is missed.
      
      IO hang in ltp/fs_fill test is reported by kernel test robot:
      
      	https://lkml.org/lkml/2020/7/26/77
      
      Turns out it is caused by the above out-of-order OPs. And the IO hang
      can't be observed any more after applying this patch.
      
      Fixes: bd166ef1 ("blk-mq-sched: add framework for MQ capable IO schedulers")
      Reported-by: Nkernel test robot <rong.a.chen@intel.com>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Cc: Bart Van Assche <bvanassche@acm.org>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Jeffery <djeffery@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d7d8535f
  25. 01 8月, 2020 1 次提交
  26. 10 7月, 2020 1 次提交
  27. 30 6月, 2020 2 次提交