1. 29 8月, 2019 2 次提交
  2. 17 7月, 2019 1 次提交
  3. 06 7月, 2019 1 次提交
    • D
      blk-iolatency: fix STS_AGAIN handling · c9b3007f
      Dennis Zhou 提交于
      The iolatency controller is based on rq_qos. It increments on
      rq_qos_throttle() and decrements on either rq_qos_cleanup() or
      rq_qos_done_bio(). a3fb01ba fixes the double accounting issue where
      blk_mq_make_request() may call both rq_qos_cleanup() and
      rq_qos_done_bio() on REQ_NO_WAIT. So checking STS_AGAIN prevents the
      double decrement.
      
      The above works upstream as the only way we can get STS_AGAIN is from
      blk_mq_get_request() failing. The STS_AGAIN handling isn't a real
      problem as bio_endio() skipping only happens on reserved tag allocation
      failures which can only be caused by driver bugs and already triggers
      WARN.
      
      However, the fix creates a not so great dependency on how STS_AGAIN can
      be propagated. Internally, we (Facebook) carry a patch that kills read
      ahead if a cgroup is io congested or a fatal signal is pending. This
      combined with chained bios progagate their bi_status to the parent is
      not already set can can cause the parent bio to not clean up properly
      even though it was successful. This consequently leaks the inflight
      counter and can hang all IOs under that blkg.
      
      To nip the adverse interaction early, this removes the rq_qos_cleanup()
      callback in iolatency in favor of cleaning up always on the
      rq_qos_done_bio() path.
      
      Fixes: a3fb01ba ("blk-iolatency: only account submitted bios")
      Debugged-by: NTejun Heo <tj@kernel.org>
      Debugged-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c9b3007f
  4. 20 6月, 2019 1 次提交
    • D
      blk-iolatency: only account submitted bios · a3fb01ba
      Dennis Zhou 提交于
      As is, iolatency recognizes done_bio and cleanup as ending paths. If a
      request is marked REQ_NOWAIT and fails to get a request, the bio is
      cleaned up via rq_qos_cleanup() and ended in bio_wouldblock_error().
      This results in underflowing the inflight counter. Fix this by only
      accounting bios that were actually submitted.
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a3fb01ba
  5. 16 6月, 2019 1 次提交
  6. 01 5月, 2019 1 次提交
  7. 21 3月, 2019 1 次提交
  8. 09 2月, 2019 2 次提交
    • L
      Blk-iolatency: warn on negative inflight IO counter · 391f552a
      Liu Bo 提交于
      This is to catch any unexpected negative value of inflight IO counter.
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      391f552a
    • L
      blk-iolatency: fix IO hang due to negative inflight counter · 8c772a9b
      Liu Bo 提交于
      Our test reported the following stack, and vmcore showed that
      ->inflight counter is -1.
      
      [ffffc9003fcc38d0] __schedule at ffffffff8173d95d
      [ffffc9003fcc3958] schedule at ffffffff8173de26
      [ffffc9003fcc3970] io_schedule at ffffffff810bb6b6
      [ffffc9003fcc3988] blkcg_iolatency_throttle at ffffffff813911cb
      [ffffc9003fcc3a20] rq_qos_throttle at ffffffff813847f3
      [ffffc9003fcc3a48] blk_mq_make_request at ffffffff8137468a
      [ffffc9003fcc3b08] generic_make_request at ffffffff81368b49
      [ffffc9003fcc3b68] submit_bio at ffffffff81368d7d
      [ffffc9003fcc3bb8] ext4_io_submit at ffffffffa031be00 [ext4]
      [ffffc9003fcc3c00] ext4_writepages at ffffffffa03163de [ext4]
      [ffffc9003fcc3d68] do_writepages at ffffffff811c49ae
      [ffffc9003fcc3d78] __filemap_fdatawrite_range at ffffffff811b6188
      [ffffc9003fcc3e30] filemap_write_and_wait_range at ffffffff811b6301
      [ffffc9003fcc3e60] ext4_sync_file at ffffffffa030cee8 [ext4]
      [ffffc9003fcc3ea8] vfs_fsync_range at ffffffff8128594b
      [ffffc9003fcc3ee8] do_fsync at ffffffff81285abd
      [ffffc9003fcc3f18] sys_fsync at ffffffff81285d50
      [ffffc9003fcc3f28] do_syscall_64 at ffffffff81003c04
      [ffffc9003fcc3f50] entry_SYSCALL_64_after_swapgs at ffffffff81742b8e
      
      The ->inflight counter may be negative (-1) if
      
      1) blk-iolatency was disabled when the IO was issued,
      
      2) blk-iolatency was enabled before this IO reached its endio,
      
      3) the ->inflight counter is decreased from 0 to -1 in endio()
      
      In fact the hang can be easily reproduced by the below script,
      
      H=/sys/fs/cgroup/unified/
      P=/sys/fs/cgroup/unified/test
      
      echo "+io" > $H/cgroup.subtree_control
      mkdir -p $P
      
      echo $$ > $P/cgroup.procs
      
      xfs_io -f -d -c "pwrite 0 4k" /dev/sdg
      
      echo "`cat /sys/block/sdg/dev` target=1000000" > $P/io.latency
      
      xfs_io -f -d -c "pwrite 0 4k" /dev/sdg
      
      This fixes the problem by freezing the queue so that while
      enabling/disabling iolatency, there is no inflight rq running.
      
      Note that quiesce_queue is not needed as this only updating iolatency
      configuration about which dispatching request_queue doesn't care.
      Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8c772a9b
  9. 18 12月, 2018 1 次提交
    • D
      block: fix blk-iolatency accounting underflow · 13369816
      Dennis Zhou 提交于
      The blk-iolatency controller measures the time from rq_qos_throttle() to
      rq_qos_done_bio() and attributes this time to the first bio that needs
      to create the request. This means if a bio is plug-mergeable or
      bio-mergeable, it gets to bypass the blk-iolatency controller.
      
      The recent series [1], to tag all bios w/ blkgs undermined how iolatency
      was determining which bios it was charging and should process in
      rq_qos_done_bio(). Because all bios are being tagged, this caused the
      atomic_t for the struct rq_wait inflight count to underflow and result
      in a stall.
      
      This patch adds a new flag BIO_TRACKED to let controllers know that a
      bio is going through the rq_qos path. blk-iolatency now checks if this
      flag is set to see if it should process the bio in rq_qos_done_bio().
      
      Overloading BLK_QUEUE_ENTERED works, but makes the flag rules confusing.
      BIO_THROTTLED was another candidate, but the flag is set for all bios
      that have gone through blk-throttle code. Overloading a flag comes with
      the burden of making sure that when either implementation changes, a
      change in setting rules for one doesn't cause a bug in the other. So
      here, we unfortunately opt for adding a new flag.
      
      [1] https://lore.kernel.org/lkml/20181205171039.73066-1-dennis@kernel.org/
      
      Fixes: 5cdf2e3f ("blkcg: associate blkg when associating a device")
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Cc: Josef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      13369816
  10. 08 12月, 2018 8 次提交
  11. 16 11月, 2018 2 次提交
  12. 02 11月, 2018 1 次提交
  13. 27 10月, 2018 1 次提交
  14. 29 9月, 2018 5 次提交
    • J
      blk-iolatency: keep track of previous windows stats · 451bb7c3
      Josef Bacik 提交于
      We apply a smoothing to the scale changes in order to keep sawtoothy
      behavior from occurring.  However our window for checking if we've
      missed our target can sometimes be lower than the smoothing interval
      (500ms), especially on faster drives like ssd's.  In order to deal with
      this keep track of the running tally of the previous intervals that we
      threw away because we had already done a scale event recently.
      
      This is needed for the ssd case as these low latency drives will have
      bursts of latency, and if it happens to be ok for the window that
      directly follows the opening of the scale window we could unthrottle
      when previous windows we were missing our target.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      451bb7c3
    • J
      blk-iolatency: use a percentile approache for ssd's · 1fa2840e
      Josef Bacik 提交于
      We use an average latency approach for determining if we're missing our
      latency target.  This works well for rotational storage where we have
      generally consistent latencies, but for ssd's and other low latency
      devices you have more of a spikey behavior, which means we often won't
      throttle misbehaving groups because a lot of IO completes at drastically
      faster times than our latency target.  Instead keep track of how many
      IO's miss our target and how many IO's are done in our time window.  If
      the p(90) latency is above our target then we know we need to throttle.
      With this change in place we are seeing the same throttling behavior
      with our testcase on ssd's as we see with rotational drives.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1fa2840e
    • J
      blk-iolatency: deal with small samples · 22ed8a93
      Josef Bacik 提交于
      There is logic to keep cgroups that haven't done a lot of IO in the most
      recent scale window from being punished for over-active higher priority
      groups.  However for things like ssd's where the windows are pretty
      short we'll end up with small numbers of samples, so 5% of samples will
      come out to 0 if there aren't enough.  Make the floor 1 sample to keep
      us from improperly bailing out of scaling down.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      22ed8a93
    • J
      blk-iolatency: deal with nr_requests == 1 · 9f60511a
      Josef Bacik 提交于
      Hitting the case where blk_queue_depth() returned 1 uncovered the fact
      that iolatency doesn't actually handle this case properly, it simply
      doesn't scale down anybody.  For this case we should go straight into
      applying the time delay, which we weren't doing.  Since we already limit
      the floor at 1 request this if statement is not needed, and this allows
      us to set our depth to 1 which allows us to apply the delay if needed.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9f60511a
    • J
      blk-iolatency: use q->nr_requests directly · ff4cee08
      Josef Bacik 提交于
      We were using blk_queue_depth() assuming that it would return
      nr_requests, but we hit a case in production on drives that had to have
      NCQ turned off in order for them to not shit the bed which resulted in a
      qd of 1, even though the nr_requests was much larger.  iolatency really
      only cares about requests we are allowed to queue up, as any io that
      get's onto the request list is going to be serviced soonish, so we want
      to be throttling before the bio gets onto the request list.  To make
      iolatency work as expected, simply use q->nr_requests instead of
      blk_queue_depth() as that is what we actually care about.
      Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ff4cee08
  15. 22 9月, 2018 5 次提交
  16. 14 9月, 2018 1 次提交
  17. 02 8月, 2018 1 次提交
  18. 01 8月, 2018 1 次提交
  19. 17 7月, 2018 2 次提交
    • J
      blk-iolatency: truncate our current time · 71e9690b
      Josef Bacik 提交于
      In our longer tests we noticed that some boxes would degrade to the
      point of uselessness.  This is because we truncate the current time when
      saving it in our bio, but I was using the raw current time to subtract
      from.  So once the box had been up a certain amount of time it would
      appear as if our IO's were taking several years to complete.  Fix this
      by truncating the current time so it matches the issue time.  Verified
      this worked by running with this patch for a week on our test tier.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      71e9690b
    • J
      blk-iolatency: don't change the latency window · d607eefa
      Josef Bacik 提交于
      Early versions of these patches had us waiting for seconds at a time
      during submission, so we had to adjust the timing window we monitored
      for latency.  Now we don't do things like that so this is unnecessary
      code.
      Signed-off-by: NJosef Bacik <jbacik@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d607eefa
  20. 11 7月, 2018 2 次提交