1. 24 5月, 2021 9 次提交
  2. 20 5月, 2021 1 次提交
  3. 14 5月, 2021 3 次提交
  4. 12 5月, 2021 2 次提交
    • P
      block, bfq: avoid circular stable merges · 7ea96eef
      Paolo Valente 提交于
      BFQ may merge a new bfq_queue, stably, with the last bfq_queue
      created. In particular, BFQ first waits a little bit for some I/O to
      flow inside the new queue, say Q2, if this is needed to understand
      whether it is better or worse to merge Q2 with the last queue created,
      say Q1. This delayed stable merge is performed by assigning
      bic->stable_merge_bfqq = Q1, for the bic associated with Q1.
      
      Yet, while waiting for some I/O to flow in Q2, a non-stable queue
      merge of Q2 with Q1 may happen, causing the bic previously associated
      with Q2 to be associated with exactly Q1 (bic->bfqq = Q1). After that,
      Q2 and Q1 may happen to be split, and, in the split, Q1 may happen to
      be recycled as a non-shared bfq_queue. In that case, Q1 may then
      happen to undergo a stable merge with the bfq_queue pointed by
      bic->stable_merge_bfqq. Yet bic->stable_merge_bfqq still points to
      Q1. So Q1 would be merged with itself.
      
      This commit fixes this error by intercepting this situation, and
      canceling the schedule of the stable merge.
      
      Fixes: 430a67f9 ("block, bfq: merge bursts of newly-created queues")
      Signed-off-by: NPietro Pedroni <pedroni.pietro.96@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Link: https://lore.kernel.org/r/20210512094352.85545-2-paolo.valente@linaro.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>
      7ea96eef
    • T
      blk-iocost: fix weight updates of inner active iocgs · e9f4eee9
      Tejun Heo 提交于
      When the weight of an active iocg is updated, weight_updated() is called
      which in turn calls __propagate_weights() to update the active and inuse
      weights so that the effective hierarchical weights are update accordingly.
      
      The current implementation is incorrect for inner active nodes. For an
      active leaf iocg, inuse can be any value between 1 and active and the
      difference represents how much the iocg is donating. When weight is updated,
      as long as inuse is clamped between 1 and the new weight, we're alright and
      this is what __propagate_weights() currently implements.
      
      However, that's not how an active inner node's inuse is set. An inner node's
      inuse is solely determined by the ratio between the sums of inuse's and
      active's of its children - ie. they're results of propagating the leaves'
      active and inuse weights upwards. __propagate_weights() incorrectly applies
      the same clamping as for a leaf when an active inner node's weight is
      updated. Consider a hierarchy which looks like the following with saturating
      workloads in AA and BB.
      
           R
         /   \
        A     B
        |     |
       AA     BB
      
      1. For both A and B, active=100, inuse=100, hwa=0.5, hwi=0.5.
      
      2. echo 200 > A/io.weight
      
      3. __propagate_weights() update A's active to 200 and leave inuse at 100 as
         it's already between 1 and the new active, making A:active=200,
         A:inuse=100. As R's active_sum is updated along with A's active,
         A:hwa=2/3, B:hwa=1/3. However, because the inuses didn't change, the
         hwi's remain unchanged at 0.5.
      
      4. The weight of A is now twice that of B but AA and BB still have the same
         hwi of 0.5 and thus are doing the same amount of IOs.
      
      Fix it by making __propgate_weights() always calculate the inuse of an
      active inner iocg based on the ratio of child_inuse_sum to child_active_sum.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: NDan Schatzberg <dschatzberg@fb.com>
      Fixes: 7caa4715 ("blkcg: implement blk-iocost")
      Cc: stable@vger.kernel.org # v5.4+
      Link: https://lore.kernel.org/r/YJsxnLZV1MnBcqjj@slm.duckdns.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>
      e9f4eee9
  5. 11 5月, 2021 1 次提交
    • O
      kyber: fix out of bounds access when preempted · efed9a33
      Omar Sandoval 提交于
      __blk_mq_sched_bio_merge() gets the ctx and hctx for the current CPU and
      passes the hctx to ->bio_merge(). kyber_bio_merge() then gets the ctx
      for the current CPU again and uses that to get the corresponding Kyber
      context in the passed hctx. However, the thread may be preempted between
      the two calls to blk_mq_get_ctx(), and the ctx returned the second time
      may no longer correspond to the passed hctx. This "works" accidentally
      most of the time, but it can cause us to read garbage if the second ctx
      came from an hctx with more ctx's than the first one (i.e., if
      ctx->index_hw[hctx->type] > hctx->nr_ctx).
      
      This manifested as this UBSAN array index out of bounds error reported
      by Jakub:
      
      UBSAN: array-index-out-of-bounds in ../kernel/locking/qspinlock.c:130:9
      index 13106 is out of range for type 'long unsigned int [128]'
      Call Trace:
       dump_stack+0xa4/0xe5
       ubsan_epilogue+0x5/0x40
       __ubsan_handle_out_of_bounds.cold.13+0x2a/0x34
       queued_spin_lock_slowpath+0x476/0x480
       do_raw_spin_lock+0x1c2/0x1d0
       kyber_bio_merge+0x112/0x180
       blk_mq_submit_bio+0x1f5/0x1100
       submit_bio_noacct+0x7b0/0x870
       submit_bio+0xc2/0x3a0
       btrfs_map_bio+0x4f0/0x9d0
       btrfs_submit_data_bio+0x24e/0x310
       submit_one_bio+0x7f/0xb0
       submit_extent_page+0xc4/0x440
       __extent_writepage_io+0x2b8/0x5e0
       __extent_writepage+0x28d/0x6e0
       extent_write_cache_pages+0x4d7/0x7a0
       extent_writepages+0xa2/0x110
       do_writepages+0x8f/0x180
       __writeback_single_inode+0x99/0x7f0
       writeback_sb_inodes+0x34e/0x790
       __writeback_inodes_wb+0x9e/0x120
       wb_writeback+0x4d2/0x660
       wb_workfn+0x64d/0xa10
       process_one_work+0x53a/0xa80
       worker_thread+0x69/0x5b0
       kthread+0x20b/0x240
       ret_from_fork+0x1f/0x30
      
      Only Kyber uses the hctx, so fix it by passing the request_queue to
      ->bio_merge() instead. BFQ and mq-deadline just use that, and Kyber can
      map the queues itself to avoid the mismatch.
      
      Fixes: a6088845 ("block: kyber: make kyber more friendly with merging")
      Reported-by: NJakub Kicinski <kuba@kernel.org>
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Link: https://lore.kernel.org/r/c7598605401a48d5cfeadebb678abd10af22b83f.1620691329.git.osandov@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      efed9a33
  6. 09 5月, 2021 1 次提交
  7. 07 5月, 2021 1 次提交
  8. 04 5月, 2021 1 次提交
    • C
      bio: limit bio max size · cd2c7545
      Changheun Lee 提交于
      bio size can grow up to 4GB when muli-page bvec is enabled.
      but sometimes it would lead to inefficient behaviors.
      in case of large chunk direct I/O, - 32MB chunk read in user space -
      all pages for 32MB would be merged to a bio structure if the pages
      physical addresses are contiguous. it makes some delay to submit
      until merge complete. bio max size should be limited to a proper size.
      
      When 32MB chunk read with direct I/O option is coming from userspace,
      kernel behavior is below now in do_direct_IO() loop. it's timeline.
      
       | bio merge for 32MB. total 8,192 pages are merged.
       | total elapsed time is over 2ms.
       |------------------ ... ----------------------->|
                                                       | 8,192 pages merged a bio.
                                                       | at this time, first bio submit is done.
                                                       | 1 bio is split to 32 read request and issue.
                                                       |--------------->
                                                        |--------------->
                                                         |--------------->
                                                                    ......
                                                                         |--------------->
                                                                          |--------------->|
                                total 19ms elapsed to complete 32MB read done from device. |
      
      If bio max size is limited with 1MB, behavior is changed below.
      
       | bio merge for 1MB. 256 pages are merged for each bio.
       | total 32 bio will be made.
       | total elapsed time is over 2ms. it's same.
       | but, first bio submit timing is fast. about 100us.
       |--->|--->|--->|---> ... -->|--->|--->|--->|--->|
            | 256 pages merged a bio.
            | at this time, first bio submit is done.
            | and 1 read request is issued for 1 bio.
            |--------------->
                 |--------------->
                      |--------------->
                                            ......
                                                       |--------------->
                                                        |--------------->|
              total 17ms elapsed to complete 32MB read done from device. |
      
      As a result, read request issue timing is faster if bio max size is limited.
      Current kernel behavior with multipage bvec, super large bio can be created.
      And it lead to delay first I/O request issue.
      Signed-off-by: NChangheun Lee <nanich.lee@samsung.com>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Link: https://lore.kernel.org/r/20210503095203.29076-1-nanich.lee@samsung.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      cd2c7545
  9. 01 5月, 2021 1 次提交
    • J
      cgroup: rstat: punt root-level optimization to individual controllers · dc26532a
      Johannes Weiner 提交于
      Current users of the rstat code can source root-level statistics from
      the native counters of their respective subsystem, allowing them to
      forego aggregation at the root level.  This optimization is currently
      implemented inside the generic rstat code, which doesn't track the root
      cgroup and doesn't invoke the subsystem flush callbacks on it.
      
      However, the memory controller cannot do this optimization, because
      cgroup1 breaks out memory specifically for the local level, including at
      the root level.  In preparation for the memory controller switching to
      rstat, move the optimization from rstat core to the controllers.
      
      Afterwards, rstat will always track the root cgroup for changes and
      invoke the subsystem callbacks on it; and it's up to the subsystem to
      special-case and skip aggregation of the root cgroup if it can source
      this information through other, cheaper means.
      
      This is the case for the io controller and the cgroup base stats.  In
      their respective flush callbacks, check whether the parent is the root
      cgroup, and if so, skip the unnecessary upward propagation.
      
      The extra cost of tracking the root cgroup is negligible: on stat
      changes, we actually remove a branch that checks for the root.  The
      queueing for a flush touches only per-cpu data, and only the first stat
      change since a flush requires a (per-cpu) lock.
      
      Link: https://lkml.kernel.org/r/20210209163304.77088-6-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Michal Koutný <mkoutny@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc26532a
  10. 26 4月, 2021 1 次提交
  11. 22 4月, 2021 1 次提交
  12. 17 4月, 2021 1 次提交
    • S
      blk-mq: Fix spurious debugfs directory creation during initialization · 1e91e28e
      Saravanan D 提交于
      blk_mq_debugfs_register_sched_hctx() called from
      device_add_disk()->elevator_init_mq()->blk_mq_init_sched()
      initialization sequence does not have relevant parent directory
      setup and thus spuriously attempts "sched" directory creation
      from root mount of debugfs for every hw queue detected on the
      block device
      
      dmesg
      ...
      debugfs: Directory 'sched' with parent '/' already present!
      debugfs: Directory 'sched' with parent '/' already present!
      .
      .
      debugfs: Directory 'sched' with parent '/' already present!
      ...
      
      The parent debugfs directory for hw queues get properly setup
      device_add_disk()->blk_register_queue()->blk_mq_debugfs_register()
      ->blk_mq_debugfs_register_hctx() later in the block device
      initialization sequence.
      
      A simple check for debugfs_dir has been added to thwart premature
      debugfs directory/file creation attempts.
      Signed-off-by: NSaravanan D <saravanand@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1e91e28e
  13. 16 4月, 2021 2 次提交
    • L
      bfq/mq-deadline: remove redundant check for passthrough request · 7687b38a
      Lin Feng 提交于
      Since commit 01e99aec 'blk-mq: insert passthrough request into
      hctx->dispatch directly', passthrough request should not appear in
      IO-scheduler any more, so blk_rq_is_passthrough checking in addon IO
      schedulers is redundant.
      
      (Notes: this patch passes generic IO load test with hdds under SAS
      controller and hdds under AHCI controller but obviously not covers all.
      Not sure if passthrough request can still escape into IO scheduler from
      blk_mq_sched_insert_requests, which is used by blk_mq_flush_plug_list and
      has lots of indirect callers.)
      Signed-off-by: NLin Feng <linf@wangsu.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7687b38a
    • L
      blk-mq: bypass IO scheduler's limit_depth for passthrough request · 8d663f34
      Lin Feng 提交于
      Commit 01e99aec ("blk-mq: insert passthrough request into
      hctx->dispatch directly") gives high priority to passthrough requests and
      bypass underlying IO scheduler. But as we allocate tag for such request it
      still runs io-scheduler's callback limit_depth, while we really want is to
      give full sbitmap-depth capabity to such request for acquiring available
      tag.
      blktrace shows PC requests(dmraid -s -c -i) hit bfq's limit_depth:
        8,0    2        0     0.000000000 39952 1,0  m   N bfq [bfq_limit_depth] wr_busy 0 sync 0 depth 8
        8,0    2        1     0.000008134 39952  D   R 4 [dmraid]
        8,0    2        2     0.000021538    24  C   R [0]
        8,0    2        0     0.000035442 39952 1,0  m   N bfq [bfq_limit_depth] wr_busy 0 sync 0 depth 8
        8,0    2        3     0.000038813 39952  D   R 24 [dmraid]
        8,0    2        4     0.000044356    24  C   R [0]
      
      This patch introduce a new wrapper to make code not that ugly.
      Signed-off-by: NLin Feng <linf@wangsu.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/20210415033920.213963-1-linf@wangsu.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      8d663f34
  14. 14 4月, 2021 1 次提交
  15. 12 4月, 2021 3 次提交
  16. 09 4月, 2021 11 次提交