1. 23 8月, 2022 2 次提交
  2. 15 7月, 2022 1 次提交
  3. 19 5月, 2022 1 次提交
    • J
      bfq: Relax waker detection for shared queues · f9506673
      Jan Kara 提交于
      Currently we look for waker only if current queue has no requests. This
      makes sense for bfq queues with a single process however for shared
      queues when there is a larger number of processes the condition that
      queue has no requests is difficult to meet because often at least one
      process has some request in flight although all the others are waiting
      for the waker to do the work and this harms throughput. Relax the "no
      queued request for bfq queue" condition to "the current task has no
      queued requests yet". For this, we also need to start tracking number of
      requests in flight for each task.
      
      This patch (together with the following one) restores the performance
      for dbench with 128 clients that regressed with commit c65e6fd4
      ("bfq: Do not let waker requests skip proper accounting") because
      this commit makes requests of wakers properly enter BFQ queues and thus
      these queues become ineligible for the old waker detection logic.
      Dbench results:
      
               Vanilla 5.18-rc3        5.18-rc3 + revert      5.18-rc3 patched
      Mean     1237.36 (   0.00%)      950.16 *  23.21%*      988.35 *  20.12%*
      
      Numbers are time to complete workload so lower is better.
      
      Fixes: c65e6fd4 ("bfq: Do not let waker requests skip proper accounting")
      Signed-off-by: NJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20220519105235.31397-1-jack@suse.czSigned-off-by: NJens Axboe <axboe@kernel.dk>
      f9506673
  4. 03 5月, 2022 1 次提交
  5. 18 4月, 2022 3 次提交
  6. 18 2月, 2022 1 次提交
  7. 12 2月, 2022 1 次提交
  8. 29 11月, 2021 4 次提交
  9. 25 8月, 2021 1 次提交
  10. 18 8月, 2021 1 次提交
  11. 26 3月, 2021 1 次提交
    • P
      block, bfq: merge bursts of newly-created queues · 430a67f9
      Paolo Valente 提交于
      Many throughput-sensitive workloads are made of several parallel I/O
      flows, with all flows generated by the same application, or more
      generically by the same task (e.g., system boot). The most
      counterproductive action with these workloads is plugging I/O dispatch
      when one of the bfq_queues associated with these flows remains
      temporarily empty.
      
      To avoid this plugging, BFQ has been using a burst-handling mechanism
      for years now. This mechanism has proven effective for throughput, and
      not detrimental for service guarantees. This commit pushes this
      mechanism a little bit further, basing on the following two facts.
      
      First, all the I/O flows of a the same application or task contribute
      to the execution/completion of that common application or task. So the
      performance figures that matter are total throughput of the flows and
      task-wide I/O latency.  In particular, these flows do not need to be
      protected from each other, in terms of individual bandwidth or
      latency.
      
      Second, the above fact holds regardless of the number of flows.
      
      Putting these two facts together, this commits merges stably the
      bfq_queues associated with these I/O flows, i.e., with the processes
      that generate these IO/ flows, regardless of how many the involved
      processes are.
      
      To decide whether a set of bfq_queues is actually associated with the
      I/O flows of a common application or task, and to merge these queues
      stably, this commit operates as follows: given a bfq_queue, say Q2,
      currently being created, and the last bfq_queue, say Q1, created
      before Q2, Q2 is merged stably with Q1 if
      - very little time has elapsed since when Q1 was created
      - Q2 has the same ioprio as Q1
      - Q2 belongs to the same group as Q1
      
      Merging bfq_queues also reduces scheduling overhead. A fio test with
      ten random readers on /dev/nullb shows a throughput boost of 40%, with
      a quadcore. Since BFQ's execution time amounts to ~50% of the total
      per-request processing time, the above throughput boost implies that
      BFQ's overhead is reduced by more than 50%.
      Tested-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Link: https://lore.kernel.org/r/20210304174627.161-7-paolo.valente@linaro.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>
      430a67f9
  12. 26 1月, 2021 4 次提交
  13. 18 8月, 2020 1 次提交
    • D
      bfq: fix blkio cgroup leakage v4 · 2de791ab
      Dmitry Monakhov 提交于
      Changes from v1:
          - update commit description with proper ref-accounting justification
      
      commit db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")
      introduce leak forbfq_group and blkcg_gq objects because of get/put
      imbalance.
      In fact whole idea of original commit is wrong because bfq_group entity
      can not dissapear under us because it is referenced by child bfq_queue's
      entities from here:
       -> bfq_init_entity()
          ->bfqg_and_blkg_get(bfqg);
          ->entity->parent = bfqg->my_entity
      
       -> bfq_put_queue(bfqq)
          FINAL_PUT
          ->bfqg_and_blkg_put(bfqq_group(bfqq))
          ->kmem_cache_free(bfq_pool, bfqq);
      
      So parent entity can not disappear while child entity is in tree,
      and child entities already has proper protection.
      This patch revert commit db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")
      
      bfq_group leak trace caused by bad commit:
      -> blkg_alloc
         -> bfq_pq_alloc
           -> bfqg_get (+1)
      ->bfq_activate_bfqq
        ->bfq_activate_requeue_entity
          -> __bfq_activate_entity
             ->bfq_get_entity
               ->bfqg_and_blkg_get (+1)  <==== : Note1
      ->bfq_del_bfqq_busy
        ->bfq_deactivate_entity+0x53/0xc0 [bfq]
          ->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
            -> bfq_forget_entity(is_in_service = true)
      	 entity->on_st_or_in_serv = false   <=== :Note2
      	 if (is_in_service)
      	     return;  ==> do not touch reference
      -> blkcg_css_offline
       -> blkcg_destroy_blkgs
        -> blkg_destroy
         -> bfq_pd_offline
          -> __bfq_deactivate_entity
               if (!entity->on_st_or_in_serv) /* true, because (Note2)
      		return false;
       -> bfq_pd_free
          -> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
      So bfq_group and blkcg_gq  will leak forever, see test-case below.
      
      ##TESTCASE_BEGIN:
      #!/bin/bash
      
      max_iters=${1:-100}
      #prep cgroup mounts
      mount -t tmpfs cgroup_root /sys/fs/cgroup
      mkdir /sys/fs/cgroup/blkio
      mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
      
      # Prepare blkdev
      grep blkio /proc/cgroups
      truncate -s 1M img
      losetup /dev/loop0 img
      echo bfq > /sys/block/loop0/queue/scheduler
      
      grep blkio /proc/cgroups
      for ((i=0;i<max_iters;i++))
      do
          mkdir -p /sys/fs/cgroup/blkio/a
          echo 0 > /sys/fs/cgroup/blkio/a/cgroup.procs
          dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null
          echo 0 > /sys/fs/cgroup/blkio/cgroup.procs
          rmdir /sys/fs/cgroup/blkio/a
          grep blkio /proc/cgroups
      done
      ##TESTCASE_END:
      
      Fixes: db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NDmitry Monakhov <dmtrmonakhov@yandex-team.ru>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2de791ab
  14. 22 3月, 2020 1 次提交
  15. 03 2月, 2020 3 次提交
  16. 21 11月, 2019 1 次提交
  17. 08 11月, 2019 2 次提交
  18. 07 9月, 2019 1 次提交
    • F
      bfq: Add per-device weight · 795fe54c
      Fam Zheng 提交于
      This adds to BFQ the missing per-device weight interfaces:
      blkio.bfq.weight_device on legacy and io.bfq.weight on unified. The
      implementation pretty closely resembles what we had in CFQ and the parsing code
      is basically reused.
      
      Tests
      =====
      
      Using two cgroups and three block devices, having weights setup as:
      
      Cgroup          test1           test2
      ============================================
      default         100             500
      sda             500             100
      sdb             default         default
      sdc             200             200
      
      cgroup v1 runs
      --------------
      
          sda.test1.out:   READ: bw=913MiB/s
          sda.test2.out:   READ: bw=183MiB/s
      
          sdb.test1.out:   READ: bw=213MiB/s
          sdb.test2.out:   READ: bw=1054MiB/s
      
          sdc.test1.out:   READ: bw=650MiB/s
          sdc.test2.out:   READ: bw=650MiB/s
      
      cgroup v2 runs
      --------------
      
          sda.test1.out:   READ: bw=915MiB/s
          sda.test2.out:   READ: bw=184MiB/s
      
          sdb.test1.out:   READ: bw=216MiB/s
          sdb.test2.out:   READ: bw=1069MiB/s
      
          sdc.test1.out:   READ: bw=621MiB/s
          sdc.test2.out:   READ: bw=622MiB/s
      Signed-off-by: NFam Zheng <zhengfeiran@bytedance.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      795fe54c
  19. 25 6月, 2019 1 次提交
    • P
      block, bfq: detect wakers and unconditionally inject their I/O · 13a857a4
      Paolo Valente 提交于
      A bfq_queue Q may happen to be synchronized with another
      bfq_queue Q2, i.e., the I/O of Q2 may need to be completed for Q to
      receive new I/O. We call Q2 "waker queue".
      
      If I/O plugging is being performed for Q, and Q is not receiving any
      more I/O because of the above synchronization, then, thanks to BFQ's
      injection mechanism, the waker queue is likely to get served before
      the I/O-plugging timeout fires.
      
      Unfortunately, this fact may not be sufficient to guarantee a high
      throughput during the I/O plugging, because the inject limit for Q may
      be too low to guarantee a lot of injected I/O. In addition, the
      duration of the plugging, i.e., the time before Q finally receives new
      I/O, may not be minimized, because the waker queue may happen to be
      served only after other queues.
      
      To address these issues, this commit introduces the explicit detection
      of the waker queue, and the unconditional injection of a pending I/O
      request of the waker queue on each invocation of
      bfq_dispatch_request().
      
      One may be concerned that this systematic injection of I/O from the
      waker queue delays the service of Q's I/O. Fortunately, it doesn't. On
      the contrary, next Q's I/O is brought forward dramatically, for it is
      not blocked for milliseconds.
      Reported-by: NSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Tested-by: NSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      13a857a4
  20. 21 6月, 2019 2 次提交
  21. 01 5月, 2019 1 次提交
  22. 10 4月, 2019 1 次提交
    • P
      block, bfq: fix use after free in bfq_bfqq_expire · eed47d19
      Paolo Valente 提交于
      The function bfq_bfqq_expire() invokes the function
      __bfq_bfqq_expire(), and the latter may free the in-service bfq-queue.
      If this happens, then no other instruction of bfq_bfqq_expire() must
      be executed, or a use-after-free will occur.
      
      Basing on the assumption that __bfq_bfqq_expire() invokes
      bfq_put_queue() on the in-service bfq-queue exactly once, the queue is
      assumed to be freed if its refcounter is equal to one right before
      invoking __bfq_bfqq_expire().
      
      But, since commit 9dee8b3b ("block, bfq: fix queue removal from
      weights tree") this assumption is false. __bfq_bfqq_expire() may also
      invoke bfq_weights_tree_remove() and, since commit 9dee8b3b
      ("block, bfq: fix queue removal from weights tree"), also
      the latter function may invoke bfq_put_queue(). So __bfq_bfqq_expire()
      may invoke bfq_put_queue() twice, and this is the actual case where
      the in-service queue may happen to be freed.
      
      To address this issue, this commit moves the check on the refcounter
      of the queue right around the last bfq_put_queue() that may be invoked
      on the queue.
      
      Fixes: 9dee8b3b ("block, bfq: fix queue removal from weights tree")
      Reported-by: NDmitrii Tcvetkov <demfloro@demfloro.ru>
      Reported-by: NDouglas Anderson <dianders@chromium.org>
      Tested-by: NDmitrii Tcvetkov <demfloro@demfloro.ru>
      Tested-by: NDouglas Anderson <dianders@chromium.org>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eed47d19
  23. 09 4月, 2019 1 次提交
  24. 01 4月, 2019 4 次提交
    • F
      block, bfq: save & resume weight on a queue merge/split · fffca087
      Francesco Pollicino 提交于
      bfq saves the state of a queue each time a merge occurs, to be
      able to resume such a state when the queue is associated again
      with its original process, on a split.
      
      Unfortunately bfq does not save & restore also the weight of the
      queue. If the weight is not correctly resumed when the queue is
      recycled, then the weight of the recycled queue could differ
      from the weight of the original queue.
      
      This commit adds the missing save & resume of the weight.
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NFrancesco Pollicino <fra.fra.800@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fffca087
    • F
      block, bfq: print SHARED instead of pid for shared queues in logs · 1e66413c
      Francesco Pollicino 提交于
      The function "bfq_log_bfqq" prints the pid of the process
      associated with the queue passed as input.
      
      Unfortunately, if the queue is shared, then more than one process
      is associated with the queue. The pid that gets printed in this
      case is the pid of one of the associated processes.
      Which process gets printed depends on the exact sequence of merge
      events the queue underwent. So printing such a pid is rather
      useless and above all is often rather confusing because it
      reports a random pid between those of the associated processes.
      
      This commit addresses this issue by printing SHARED instead of a pid
      if the queue is shared.
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NFrancesco Pollicino <fra.fra.800@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1e66413c
    • P
      block, bfq: do not merge queues on flash storage with queueing · 8cacc5ab
      Paolo Valente 提交于
      To boost throughput with a set of processes doing interleaved I/O
      (i.e., a set of processes whose individual I/O is random, but whose
      merged cumulative I/O is sequential), BFQ merges the queues associated
      with these processes, i.e., redirects the I/O of these processes into a
      common, shared queue. In the shared queue, I/O requests are ordered by
      their position on the medium, thus sequential I/O gets dispatched to
      the device when the shared queue is served.
      
      Queue merging costs execution time, because, to detect which queues to
      merge, BFQ must maintain a list of the head I/O requests of active
      queues, ordered by request positions. Measurements showed that this
      costs about 10% of BFQ's total per-request processing time.
      
      Request processing time becomes more and more critical as the speed of
      the underlying storage device grows. Yet, fortunately, queue merging
      is basically useless on the very devices that are so fast to make
      request processing time critical. To reach a high throughput, these
      devices must have many requests queued at the same time. But, in this
      configuration, the internal scheduling algorithms of these devices do
      also the job of queue merging: they reorder requests so as to obtain
      as much as possible a sequential I/O pattern. As a consequence, with
      processes doing interleaved I/O, the throughput reached by one such
      device is likely to be the same, with and without queue merging.
      
      In view of this fact, this commit disables queue merging, and all
      related housekeeping, for non-rotational devices with internal
      queueing. The total, single-lock-protected, per-request processing
      time of BFQ drops to, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz
      (time measured with simple code instrumentation, and using the
      throughput-sync.sh script of the S suite [1], in performance-profiling
      mode). To put this result into context, the total,
      single-lock-protected, per-request execution time of the lightest I/O
      scheduler available in blk-mq, mq-deadline, is 0.7 us (mq-deadline is
      ~800 LOC, against ~10500 LOC for BFQ).
      
      Disabling merging provides a further, remarkable benefit in terms of
      throughput. Merging tends to make many workloads artificially more
      uneven, mainly because of shared queues remaining non empty for
      incomparably more time than normal queues. So, if, e.g., one of the
      queues in a set of merged queues has a higher weight than a normal
      queue, then the shared queue may inherit such a high weight and, by
      staying almost always active, may force BFQ to perform I/O plugging
      most of the time. This evidently makes it harder for BFQ to let the
      device reach a high throughput.
      
      As a practical example of this problem, and of the benefits of this
      commit, we measured again the throughput in the nasty scenario
      considered in previous commit messages: dbench test (in the Phoronix
      suite), with 6 clients, on a filesystem with journaling, and with the
      journaling daemon enjoying a higher weight than normal processes. With
      this commit, the throughput grows from ~150 MB/s to ~200 MB/s on a
      PLEXTOR PX-256M5 SSD. This is the same peak throughput reached by any
      of the other I/O schedulers. As such, this is also likely to be the
      maximum possible throughput reachable with this workload on this
      device, because I/O is mostly random, and the other schedulers
      basically just pass I/O requests to the drive as fast as possible.
      
      [1] https://github.com/Algodev-github/STested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NFrancesco Pollicino <fra.fra.800@gmail.com>
      Signed-off-by: NAlessio Masola <alessio.masola@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8cacc5ab
    • P
      block, bfq: tune service injection basing on request service times · 2341d662
      Paolo Valente 提交于
      The processes associated with a bfq_queue, say Q, may happen to
      generate their cumulative I/O at a lower rate than the rate at which
      the device could serve the same I/O. This is rather probable, e.g., if
      only one process is associated with Q and the device is an SSD. It
      results in Q becoming often empty while in service. If BFQ is not
      allowed to switch to another queue when Q becomes empty, then, during
      the service of Q, there will be frequent "service holes", i.e., time
      intervals during which Q gets empty and the device can only consume
      the I/O already queued in its hardware queues. This easily causes
      considerable losses of throughput.
      
      To counter this problem, BFQ implements a request injection mechanism,
      which tries to fill the above service holes with I/O requests taken
      from other bfq_queues. The hard part in this mechanism is finding the
      right amount of I/O to inject, so as to both boost throughput and not
      break Q's bandwidth and latency guarantees. To this goal, the current
      version of this mechanism measures the bandwidth enjoyed by Q while it
      is being served, and tries to inject the maximum possible amount of
      extra service that does not cause Q's bandwidth to decrease too
      much.
      
      This solution has an important shortcoming. For bandwidth measurements
      to be stable and reliable, Q must remain in service for a much longer
      time than that needed to serve a single I/O request. Unfortunately,
      this does not hold with many workloads. This commit addresses this
      issue by changing the way the amount of injection allowed is
      dynamically computed. It tunes injection as a function of the service
      times of single I/O requests of Q, instead of Q's
      bandwidth. Single-request service times are evidently meaningful even
      if Q gets very few I/O requests completed while it is in service.
      
      As a testbed for this new solution, we measured the throughput reached
      by BFQ for one of the nastiest workloads and configurations for this
      scheduler: the workload generated by the dbench test (in the Phoronix
      suite), with 6 clients, on a filesystem with journaling, and with the
      journaling daemon enjoying a higher weight than normal processes.
      With this commit, the throughput grows from ~100 MB/s to ~150 MB/s on
      a PLEXTOR PX-256M5.
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NFrancesco Pollicino <fra.fra.800@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2341d662