1. 20 7月, 2022 1 次提交
    • J
      block: don't allow the same type rq_qos add more than once · 14a6e2eb
      Jinke Han 提交于
      In our test of iocost, we encountered some list add/del corruptions of
      inner_walk list in ioc_timer_fn.
      
      The reason can be described as follows:
      
      cpu 0					cpu 1
      ioc_qos_write				ioc_qos_write
      
      ioc = q_to_ioc(queue);
      if (!ioc) {
              ioc = kzalloc();
      					ioc = q_to_ioc(queue);
      					if (!ioc) {
      						ioc = kzalloc();
      						...
      						rq_qos_add(q, rqos);
      					}
              ...
              rq_qos_add(q, rqos);
              ...
      }
      
      When the io.cost.qos file is written by two cpus concurrently, rq_qos may
      be added to one disk twice. In that case, there will be two iocs enabled
      and running on one disk. They own different iocgs on their active list. In
      the ioc_timer_fn function, because of the iocgs from two iocs have the
      same root iocg, the root iocg's walk_list may be overwritten by each other
      and this leads to list add/del corruptions in building or destroying the
      inner_walk list.
      
      And so far, the blk-rq-qos framework works in case that one instance for
      one type rq_qos per queue by default. This patch make this explicit and
      also fix the crash above.
      Signed-off-by: NJinke Han <hanjinke.666@bytedance.com>
      Reviewed-by: NMuchun Song <songmuchun@bytedance.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Cc: <stable@vger.kernel.org>
      Link: https://lore.kernel.org/r/20220720093616.70584-1-hanjinke.666@bytedance.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      14a6e2eb
  2. 13 7月, 2022 1 次提交
  3. 27 5月, 2022 1 次提交
    • T
      blk-iolatency: Fix inflight count imbalances and IO hangs on offline · 8a177a36
      Tejun Heo 提交于
      iolatency needs to track the number of inflight IOs per cgroup. As this
      tracking can be expensive, it is disabled when no cgroup has iolatency
      configured for the device. To ensure that the inflight counters stay
      balanced, iolatency_set_limit() freezes the request_queue while manipulating
      the enabled counter, which ensures that no IO is in flight and thus all
      counters are zero.
      
      Unfortunately, iolatency_set_limit() isn't the only place where the enabled
      counter is manipulated. iolatency_pd_offline() can also dec the counter and
      trigger disabling. As this disabling happens without freezing the q, this
      can easily happen while some IOs are in flight and thus leak the counts.
      
      This can be easily demonstrated by turning on iolatency on an one empty
      cgroup while IOs are in flight in other cgroups and then removing the
      cgroup. Note that iolatency shouldn't have been enabled elsewhere in the
      system to ensure that removing the cgroup disables iolatency for the whole
      device.
      
      The following keeps flipping on and off iolatency on sda:
      
        echo +io > /sys/fs/cgroup/cgroup.subtree_control
        while true; do
            mkdir -p /sys/fs/cgroup/test
            echo '8:0 target=100000' > /sys/fs/cgroup/test/io.latency
            sleep 1
            rmdir /sys/fs/cgroup/test
            sleep 1
        done
      
      and there's concurrent fio generating direct rand reads:
      
        fio --name test --filename=/dev/sda --direct=1 --rw=randread \
            --runtime=600 --time_based --iodepth=256 --numjobs=4 --bs=4k
      
      while monitoring with the following drgn script:
      
        while True:
          for css in css_for_each_descendant_pre(prog['blkcg_root'].css.address_of_()):
              for pos in hlist_for_each(container_of(css, 'struct blkcg', 'css').blkg_list):
                  blkg = container_of(pos, 'struct blkcg_gq', 'blkcg_node')
                  pd = blkg.pd[prog['blkcg_policy_iolatency'].plid]
                  if pd.value_() == 0:
                      continue
                  iolat = container_of(pd, 'struct iolatency_grp', 'pd')
                  inflight = iolat.rq_wait.inflight.counter.value_()
                  if inflight:
                      print(f'inflight={inflight} {disk_name(blkg.q.disk).decode("utf-8")} '
                            f'{cgroup_path(css.cgroup).decode("utf-8")}')
          time.sleep(1)
      
      The monitoring output looks like the following:
      
        inflight=1 sda /user.slice
        inflight=1 sda /user.slice
        ...
        inflight=14 sda /user.slice
        inflight=13 sda /user.slice
        inflight=17 sda /user.slice
        inflight=15 sda /user.slice
        inflight=18 sda /user.slice
        inflight=17 sda /user.slice
        inflight=20 sda /user.slice
        inflight=19 sda /user.slice <- fio stopped, inflight stuck at 19
        inflight=19 sda /user.slice
        inflight=19 sda /user.slice
      
      If a cgroup with stuck inflight ends up getting throttled, the throttled IOs
      will never get issued as there's no completion event to wake it up leading
      to an indefinite hang.
      
      This patch fixes the bug by unifying enable handling into a work item which
      is automatically kicked off from iolatency_set_min_lat_nsec() which is
      called from both iolatency_set_limit() and iolatency_pd_offline() paths.
      Punting to a work item is necessary as iolatency_pd_offline() is called
      under spinlocks while freezing a request_queue requires a sleepable context.
      
      This also simplifies the code reducing LOC sans the comments and avoids the
      unnecessary freezes which were happening whenever a cgroup's latency target
      is newly set or cleared.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Cc: Liu Bo <bo.liu@linux.alibaba.com>
      Fixes: 8c772a9b ("blk-iolatency: fix IO hang due to negative inflight counter")
      Cc: stable@vger.kernel.org # v5.0+
      Link: https://lore.kernel.org/r/Yn9ScX6Nx2qIiQQi@slm.duckdns.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>
      8a177a36
  4. 17 5月, 2022 1 次提交
  5. 15 3月, 2022 1 次提交
    • T
      block: fix rq-qos breakage from skipping rq_qos_done_bio() · aa1b46dc
      Tejun Heo 提交于
      a647a524 ("block: don't call rq_qos_ops->done_bio if the bio isn't
      tracked") made bio_endio() skip rq_qos_done_bio() if BIO_TRACKED is not set.
      While this fixed a potential oops, it also broke blk-iocost by skipping the
      done_bio callback for merged bios.
      
      Before, whether a bio goes through rq_qos_throttle() or rq_qos_merge(),
      rq_qos_done_bio() would be called on the bio on completion with BIO_TRACKED
      distinguishing the former from the latter. rq_qos_done_bio() is not called
      for bios which wenth through rq_qos_merge(). This royally confuses
      blk-iocost as the merged bios never finish and are considered perpetually
      in-flight.
      
      One reliably reproducible failure mode is an intermediate cgroup geting
      stuck active preventing its children from being activated due to the
      leaf-only rule, leading to loss of control. The following is from
      resctl-bench protection scenario which emulates isolating a web server like
      workload from a memory bomb run on an iocost configuration which should
      yield a reasonable level of protection.
      
        # cat /sys/block/nvme2n1/device/model
        Samsung SSD 970 PRO 512GB
        # cat /sys/fs/cgroup/io.cost.model
        259:0 ctrl=user model=linear rbps=834913556 rseqiops=93622 rrandiops=102913 wbps=618985353 wseqiops=72325 wrandiops=71025
        # cat /sys/fs/cgroup/io.cost.qos
        259:0 enable=1 ctrl=user rpct=95.00 rlat=18776 wpct=95.00 wlat=8897 min=60.00 max=100.00
        # resctl-bench -m 29.6G -r out.json run protection::scenario=mem-hog,loops=1
        ...
        Memory Hog Summary
        ==================
      
        IO Latency: R p50=242u:336u/2.5m p90=794u:1.4m/7.5m p99=2.7m:8.0m/62.5m max=8.0m:36.4m/350m
                    W p50=221u:323u/1.5m p90=709u:1.2m/5.5m p99=1.5m:2.5m/9.5m max=6.9m:35.9m/350m
      
        Isolation and Request Latency Impact Distributions:
      
                      min   p01   p05   p10   p25   p50   p75   p90   p95   p99   max  mean stdev
        isol%       15.90 15.90 15.90 40.05 57.24 59.07 60.01 74.63 74.63 90.35 90.35 58.12 15.82
        lat-imp%        0     0     0     0     0  4.55 14.68 15.54 233.5 548.1 548.1 53.88 143.6
      
        Result: isol=58.12:15.82% lat_imp=53.88%:143.6 work_csv=100.0% missing=3.96%
      
      The isolation result of 58.12% is close to what this device would show
      without any IO control.
      
      Fix it by introducing a new flag BIO_QOS_MERGED to mark merged bios and
      calling rq_qos_done_bio() on them too. For consistency and clarity, rename
      BIO_TRACKED to BIO_QOS_THROTTLED. The flag checks are moved into
      rq_qos_done_bio() so that it's next to the code paths that set the flags.
      
      With the patch applied, the above same benchmark shows:
      
        # resctl-bench -m 29.6G -r out.json run protection::scenario=mem-hog,loops=1
        ...
        Memory Hog Summary
        ==================
      
        IO Latency: R p50=123u:84.4u/985u p90=322u:256u/2.5m p99=1.6m:1.4m/9.5m max=11.1m:36.0m/350m
                    W p50=429u:274u/995u p90=1.7m:1.3m/4.5m p99=3.4m:2.7m/11.5m max=7.9m:5.9m/26.5m
      
        Isolation and Request Latency Impact Distributions:
      
                      min   p01   p05   p10   p25   p50   p75   p90   p95   p99   max  mean stdev
        isol%       84.91 84.91 89.51 90.73 92.31 94.49 96.36 98.04 98.71 100.0 100.0 94.42  2.81
        lat-imp%        0     0     0     0     0  2.81  5.73 11.11 13.92 17.53 22.61  4.10  4.68
      
        Result: isol=94.42:2.81% lat_imp=4.10%:4.68 work_csv=58.34% missing=0%
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Fixes: a647a524 ("block: don't call rq_qos_ops->done_bio if the bio isn't tracked")
      Cc: stable@vger.kernel.org # v5.15+
      Cc: Ming Lei <ming.lei@redhat.com>
      Cc: Yu Kuai <yukuai3@huawei.com>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Link: https://lore.kernel.org/r/Yi7rdrzQEHjJLGKB@slm.duckdns.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>
      aa1b46dc
  6. 12 2月, 2022 1 次提交
  7. 18 10月, 2021 1 次提交
  8. 17 8月, 2021 1 次提交
  9. 06 8月, 2021 1 次提交
  10. 08 10月, 2020 1 次提交
  11. 01 7月, 2020 1 次提交
  12. 29 8月, 2019 2 次提交
  13. 17 7月, 2019 1 次提交
  14. 06 7月, 2019 1 次提交
    • D
      blk-iolatency: fix STS_AGAIN handling · c9b3007f
      Dennis Zhou 提交于
      The iolatency controller is based on rq_qos. It increments on
      rq_qos_throttle() and decrements on either rq_qos_cleanup() or
      rq_qos_done_bio(). a3fb01ba fixes the double accounting issue where
      blk_mq_make_request() may call both rq_qos_cleanup() and
      rq_qos_done_bio() on REQ_NO_WAIT. So checking STS_AGAIN prevents the
      double decrement.
      
      The above works upstream as the only way we can get STS_AGAIN is from
      blk_mq_get_request() failing. The STS_AGAIN handling isn't a real
      problem as bio_endio() skipping only happens on reserved tag allocation
      failures which can only be caused by driver bugs and already triggers
      WARN.
      
      However, the fix creates a not so great dependency on how STS_AGAIN can
      be propagated. Internally, we (Facebook) carry a patch that kills read
      ahead if a cgroup is io congested or a fatal signal is pending. This
      combined with chained bios progagate their bi_status to the parent is
      not already set can can cause the parent bio to not clean up properly
      even though it was successful. This consequently leaks the inflight
      counter and can hang all IOs under that blkg.
      
      To nip the adverse interaction early, this removes the rq_qos_cleanup()
      callback in iolatency in favor of cleaning up always on the
      rq_qos_done_bio() path.
      
      Fixes: a3fb01ba ("blk-iolatency: only account submitted bios")
      Debugged-by: NTejun Heo <tj@kernel.org>
      Debugged-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c9b3007f
  15. 20 6月, 2019 1 次提交
    • D
      blk-iolatency: only account submitted bios · a3fb01ba
      Dennis Zhou 提交于
      As is, iolatency recognizes done_bio and cleanup as ending paths. If a
      request is marked REQ_NOWAIT and fails to get a request, the bio is
      cleaned up via rq_qos_cleanup() and ended in bio_wouldblock_error().
      This results in underflowing the inflight counter. Fix this by only
      accounting bios that were actually submitted.
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a3fb01ba
  16. 16 6月, 2019 1 次提交
  17. 01 5月, 2019 1 次提交
  18. 21 3月, 2019 1 次提交
  19. 09 2月, 2019 2 次提交
    • L
      Blk-iolatency: warn on negative inflight IO counter · 391f552a
      Liu Bo 提交于
      This is to catch any unexpected negative value of inflight IO counter.
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      391f552a
    • L
      blk-iolatency: fix IO hang due to negative inflight counter · 8c772a9b
      Liu Bo 提交于
      Our test reported the following stack, and vmcore showed that
      ->inflight counter is -1.
      
      [ffffc9003fcc38d0] __schedule at ffffffff8173d95d
      [ffffc9003fcc3958] schedule at ffffffff8173de26
      [ffffc9003fcc3970] io_schedule at ffffffff810bb6b6
      [ffffc9003fcc3988] blkcg_iolatency_throttle at ffffffff813911cb
      [ffffc9003fcc3a20] rq_qos_throttle at ffffffff813847f3
      [ffffc9003fcc3a48] blk_mq_make_request at ffffffff8137468a
      [ffffc9003fcc3b08] generic_make_request at ffffffff81368b49
      [ffffc9003fcc3b68] submit_bio at ffffffff81368d7d
      [ffffc9003fcc3bb8] ext4_io_submit at ffffffffa031be00 [ext4]
      [ffffc9003fcc3c00] ext4_writepages at ffffffffa03163de [ext4]
      [ffffc9003fcc3d68] do_writepages at ffffffff811c49ae
      [ffffc9003fcc3d78] __filemap_fdatawrite_range at ffffffff811b6188
      [ffffc9003fcc3e30] filemap_write_and_wait_range at ffffffff811b6301
      [ffffc9003fcc3e60] ext4_sync_file at ffffffffa030cee8 [ext4]
      [ffffc9003fcc3ea8] vfs_fsync_range at ffffffff8128594b
      [ffffc9003fcc3ee8] do_fsync at ffffffff81285abd
      [ffffc9003fcc3f18] sys_fsync at ffffffff81285d50
      [ffffc9003fcc3f28] do_syscall_64 at ffffffff81003c04
      [ffffc9003fcc3f50] entry_SYSCALL_64_after_swapgs at ffffffff81742b8e
      
      The ->inflight counter may be negative (-1) if
      
      1) blk-iolatency was disabled when the IO was issued,
      
      2) blk-iolatency was enabled before this IO reached its endio,
      
      3) the ->inflight counter is decreased from 0 to -1 in endio()
      
      In fact the hang can be easily reproduced by the below script,
      
      H=/sys/fs/cgroup/unified/
      P=/sys/fs/cgroup/unified/test
      
      echo "+io" > $H/cgroup.subtree_control
      mkdir -p $P
      
      echo $$ > $P/cgroup.procs
      
      xfs_io -f -d -c "pwrite 0 4k" /dev/sdg
      
      echo "`cat /sys/block/sdg/dev` target=1000000" > $P/io.latency
      
      xfs_io -f -d -c "pwrite 0 4k" /dev/sdg
      
      This fixes the problem by freezing the queue so that while
      enabling/disabling iolatency, there is no inflight rq running.
      
      Note that quiesce_queue is not needed as this only updating iolatency
      configuration about which dispatching request_queue doesn't care.
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8c772a9b
  20. 18 12月, 2018 1 次提交
    • D
      block: fix blk-iolatency accounting underflow · 13369816
      Dennis Zhou 提交于
      The blk-iolatency controller measures the time from rq_qos_throttle() to
      rq_qos_done_bio() and attributes this time to the first bio that needs
      to create the request. This means if a bio is plug-mergeable or
      bio-mergeable, it gets to bypass the blk-iolatency controller.
      
      The recent series [1], to tag all bios w/ blkgs undermined how iolatency
      was determining which bios it was charging and should process in
      rq_qos_done_bio(). Because all bios are being tagged, this caused the
      atomic_t for the struct rq_wait inflight count to underflow and result
      in a stall.
      
      This patch adds a new flag BIO_TRACKED to let controllers know that a
      bio is going through the rq_qos path. blk-iolatency now checks if this
      flag is set to see if it should process the bio in rq_qos_done_bio().
      
      Overloading BLK_QUEUE_ENTERED works, but makes the flag rules confusing.
      BIO_THROTTLED was another candidate, but the flag is set for all bios
      that have gone through blk-throttle code. Overloading a flag comes with
      the burden of making sure that when either implementation changes, a
      change in setting rules for one doesn't cause a bug in the other. So
      here, we unfortunately opt for adding a new flag.
      
      [1] https://lore.kernel.org/lkml/20181205171039.73066-1-dennis@kernel.org/
      
      Fixes: 5cdf2e3f ("blkcg: associate blkg when associating a device")
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      13369816
  21. 08 12月, 2018 8 次提交
  22. 16 11月, 2018 2 次提交
  23. 02 11月, 2018 1 次提交
  24. 27 10月, 2018 1 次提交
  25. 29 9月, 2018 5 次提交
    • J
      blk-iolatency: keep track of previous windows stats · 451bb7c3
      Josef Bacik 提交于
      We apply a smoothing to the scale changes in order to keep sawtoothy
      behavior from occurring.  However our window for checking if we've
      missed our target can sometimes be lower than the smoothing interval
      (500ms), especially on faster drives like ssd's.  In order to deal with
      this keep track of the running tally of the previous intervals that we
      threw away because we had already done a scale event recently.
      
      This is needed for the ssd case as these low latency drives will have
      bursts of latency, and if it happens to be ok for the window that
      directly follows the opening of the scale window we could unthrottle
      when previous windows we were missing our target.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      451bb7c3
    • J
      blk-iolatency: use a percentile approache for ssd's · 1fa2840e
      Josef Bacik 提交于
      We use an average latency approach for determining if we're missing our
      latency target.  This works well for rotational storage where we have
      generally consistent latencies, but for ssd's and other low latency
      devices you have more of a spikey behavior, which means we often won't
      throttle misbehaving groups because a lot of IO completes at drastically
      faster times than our latency target.  Instead keep track of how many
      IO's miss our target and how many IO's are done in our time window.  If
      the p(90) latency is above our target then we know we need to throttle.
      With this change in place we are seeing the same throttling behavior
      with our testcase on ssd's as we see with rotational drives.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1fa2840e
    • J
      blk-iolatency: deal with small samples · 22ed8a93
      Josef Bacik 提交于
      There is logic to keep cgroups that haven't done a lot of IO in the most
      recent scale window from being punished for over-active higher priority
      groups.  However for things like ssd's where the windows are pretty
      short we'll end up with small numbers of samples, so 5% of samples will
      come out to 0 if there aren't enough.  Make the floor 1 sample to keep
      us from improperly bailing out of scaling down.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      22ed8a93
    • J
      blk-iolatency: deal with nr_requests == 1 · 9f60511a
      Josef Bacik 提交于
      Hitting the case where blk_queue_depth() returned 1 uncovered the fact
      that iolatency doesn't actually handle this case properly, it simply
      doesn't scale down anybody.  For this case we should go straight into
      applying the time delay, which we weren't doing.  Since we already limit
      the floor at 1 request this if statement is not needed, and this allows
      us to set our depth to 1 which allows us to apply the delay if needed.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9f60511a
    • J
      blk-iolatency: use q->nr_requests directly · ff4cee08
      Josef Bacik 提交于
      We were using blk_queue_depth() assuming that it would return
      nr_requests, but we hit a case in production on drives that had to have
      NCQ turned off in order for them to not shit the bed which resulted in a
      qd of 1, even though the nr_requests was much larger.  iolatency really
      only cares about requests we are allowed to queue up, as any io that
      get's onto the request list is going to be serviced soonish, so we want
      to be throttling before the bio gets onto the request list.  To make
      iolatency work as expected, simply use q->nr_requests instead of
      blk_queue_depth() as that is what we actually care about.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ff4cee08
  26. 22 9月, 2018 1 次提交