1. 26 1月, 2021 3 次提交
  2. 18 8月, 2020 1 次提交
    • D
      bfq: fix blkio cgroup leakage v4 · 2de791ab
      Dmitry Monakhov 提交于
      Changes from v1:
          - update commit description with proper ref-accounting justification
      
      commit db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")
      introduce leak forbfq_group and blkcg_gq objects because of get/put
      imbalance.
      In fact whole idea of original commit is wrong because bfq_group entity
      can not dissapear under us because it is referenced by child bfq_queue's
      entities from here:
       -> bfq_init_entity()
          ->bfqg_and_blkg_get(bfqg);
          ->entity->parent = bfqg->my_entity
      
       -> bfq_put_queue(bfqq)
          FINAL_PUT
          ->bfqg_and_blkg_put(bfqq_group(bfqq))
          ->kmem_cache_free(bfq_pool, bfqq);
      
      So parent entity can not disappear while child entity is in tree,
      and child entities already has proper protection.
      This patch revert commit db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")
      
      bfq_group leak trace caused by bad commit:
      -> blkg_alloc
         -> bfq_pq_alloc
           -> bfqg_get (+1)
      ->bfq_activate_bfqq
        ->bfq_activate_requeue_entity
          -> __bfq_activate_entity
             ->bfq_get_entity
               ->bfqg_and_blkg_get (+1)  <==== : Note1
      ->bfq_del_bfqq_busy
        ->bfq_deactivate_entity+0x53/0xc0 [bfq]
          ->__bfq_deactivate_entity+0x1b8/0x210 [bfq]
            -> bfq_forget_entity(is_in_service = true)
      	 entity->on_st_or_in_serv = false   <=== :Note2
      	 if (is_in_service)
      	     return;  ==> do not touch reference
      -> blkcg_css_offline
       -> blkcg_destroy_blkgs
        -> blkg_destroy
         -> bfq_pd_offline
          -> __bfq_deactivate_entity
               if (!entity->on_st_or_in_serv) /* true, because (Note2)
      		return false;
       -> bfq_pd_free
          -> bfqg_put() (-1, byt bfqg->ref == 2) because of (Note2)
      So bfq_group and blkcg_gq  will leak forever, see test-case below.
      
      ##TESTCASE_BEGIN:
      #!/bin/bash
      
      max_iters=${1:-100}
      #prep cgroup mounts
      mount -t tmpfs cgroup_root /sys/fs/cgroup
      mkdir /sys/fs/cgroup/blkio
      mount -t cgroup -o blkio none /sys/fs/cgroup/blkio
      
      # Prepare blkdev
      grep blkio /proc/cgroups
      truncate -s 1M img
      losetup /dev/loop0 img
      echo bfq > /sys/block/loop0/queue/scheduler
      
      grep blkio /proc/cgroups
      for ((i=0;i<max_iters;i++))
      do
          mkdir -p /sys/fs/cgroup/blkio/a
          echo 0 > /sys/fs/cgroup/blkio/a/cgroup.procs
          dd if=/dev/loop0 bs=4k count=1 of=/dev/null iflag=direct 2> /dev/null
          echo 0 > /sys/fs/cgroup/blkio/cgroup.procs
          rmdir /sys/fs/cgroup/blkio/a
          grep blkio /proc/cgroups
      done
      ##TESTCASE_END:
      
      Fixes: db37a34c ("block, bfq: get a ref to a group when adding it to a service tree")
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NDmitry Monakhov <dmtrmonakhov@yandex-team.ru>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2de791ab
  3. 22 3月, 2020 1 次提交
  4. 03 2月, 2020 3 次提交
  5. 21 11月, 2019 1 次提交
  6. 08 11月, 2019 2 次提交
  7. 07 9月, 2019 1 次提交
    • F
      bfq: Add per-device weight · 795fe54c
      Fam Zheng 提交于
      This adds to BFQ the missing per-device weight interfaces:
      blkio.bfq.weight_device on legacy and io.bfq.weight on unified. The
      implementation pretty closely resembles what we had in CFQ and the parsing code
      is basically reused.
      
      Tests
      =====
      
      Using two cgroups and three block devices, having weights setup as:
      
      Cgroup          test1           test2
      ============================================
      default         100             500
      sda             500             100
      sdb             default         default
      sdc             200             200
      
      cgroup v1 runs
      --------------
      
          sda.test1.out:   READ: bw=913MiB/s
          sda.test2.out:   READ: bw=183MiB/s
      
          sdb.test1.out:   READ: bw=213MiB/s
          sdb.test2.out:   READ: bw=1054MiB/s
      
          sdc.test1.out:   READ: bw=650MiB/s
          sdc.test2.out:   READ: bw=650MiB/s
      
      cgroup v2 runs
      --------------
      
          sda.test1.out:   READ: bw=915MiB/s
          sda.test2.out:   READ: bw=184MiB/s
      
          sdb.test1.out:   READ: bw=216MiB/s
          sdb.test2.out:   READ: bw=1069MiB/s
      
          sdc.test1.out:   READ: bw=621MiB/s
          sdc.test2.out:   READ: bw=622MiB/s
      Signed-off-by: NFam Zheng <zhengfeiran@bytedance.com>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      795fe54c
  8. 25 6月, 2019 1 次提交
    • P
      block, bfq: detect wakers and unconditionally inject their I/O · 13a857a4
      Paolo Valente 提交于
      A bfq_queue Q may happen to be synchronized with another
      bfq_queue Q2, i.e., the I/O of Q2 may need to be completed for Q to
      receive new I/O. We call Q2 "waker queue".
      
      If I/O plugging is being performed for Q, and Q is not receiving any
      more I/O because of the above synchronization, then, thanks to BFQ's
      injection mechanism, the waker queue is likely to get served before
      the I/O-plugging timeout fires.
      
      Unfortunately, this fact may not be sufficient to guarantee a high
      throughput during the I/O plugging, because the inject limit for Q may
      be too low to guarantee a lot of injected I/O. In addition, the
      duration of the plugging, i.e., the time before Q finally receives new
      I/O, may not be minimized, because the waker queue may happen to be
      served only after other queues.
      
      To address these issues, this commit introduces the explicit detection
      of the waker queue, and the unconditional injection of a pending I/O
      request of the waker queue on each invocation of
      bfq_dispatch_request().
      
      One may be concerned that this systematic injection of I/O from the
      waker queue delays the service of Q's I/O. Fortunately, it doesn't. On
      the contrary, next Q's I/O is brought forward dramatically, for it is
      not blocked for milliseconds.
      Reported-by: NSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Tested-by: NSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      13a857a4
  9. 21 6月, 2019 2 次提交
  10. 01 5月, 2019 1 次提交
  11. 10 4月, 2019 1 次提交
    • P
      block, bfq: fix use after free in bfq_bfqq_expire · eed47d19
      Paolo Valente 提交于
      The function bfq_bfqq_expire() invokes the function
      __bfq_bfqq_expire(), and the latter may free the in-service bfq-queue.
      If this happens, then no other instruction of bfq_bfqq_expire() must
      be executed, or a use-after-free will occur.
      
      Basing on the assumption that __bfq_bfqq_expire() invokes
      bfq_put_queue() on the in-service bfq-queue exactly once, the queue is
      assumed to be freed if its refcounter is equal to one right before
      invoking __bfq_bfqq_expire().
      
      But, since commit 9dee8b3b ("block, bfq: fix queue removal from
      weights tree") this assumption is false. __bfq_bfqq_expire() may also
      invoke bfq_weights_tree_remove() and, since commit 9dee8b3b
      ("block, bfq: fix queue removal from weights tree"), also
      the latter function may invoke bfq_put_queue(). So __bfq_bfqq_expire()
      may invoke bfq_put_queue() twice, and this is the actual case where
      the in-service queue may happen to be freed.
      
      To address this issue, this commit moves the check on the refcounter
      of the queue right around the last bfq_put_queue() that may be invoked
      on the queue.
      
      Fixes: 9dee8b3b ("block, bfq: fix queue removal from weights tree")
      Reported-by: NDmitrii Tcvetkov <demfloro@demfloro.ru>
      Reported-by: NDouglas Anderson <dianders@chromium.org>
      Tested-by: NDmitrii Tcvetkov <demfloro@demfloro.ru>
      Tested-by: NDouglas Anderson <dianders@chromium.org>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eed47d19
  12. 09 4月, 2019 1 次提交
  13. 01 4月, 2019 5 次提交
    • F
      block, bfq: save & resume weight on a queue merge/split · fffca087
      Francesco Pollicino 提交于
      bfq saves the state of a queue each time a merge occurs, to be
      able to resume such a state when the queue is associated again
      with its original process, on a split.
      
      Unfortunately bfq does not save & restore also the weight of the
      queue. If the weight is not correctly resumed when the queue is
      recycled, then the weight of the recycled queue could differ
      from the weight of the original queue.
      
      This commit adds the missing save & resume of the weight.
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NFrancesco Pollicino <fra.fra.800@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fffca087
    • F
      block, bfq: print SHARED instead of pid for shared queues in logs · 1e66413c
      Francesco Pollicino 提交于
      The function "bfq_log_bfqq" prints the pid of the process
      associated with the queue passed as input.
      
      Unfortunately, if the queue is shared, then more than one process
      is associated with the queue. The pid that gets printed in this
      case is the pid of one of the associated processes.
      Which process gets printed depends on the exact sequence of merge
      events the queue underwent. So printing such a pid is rather
      useless and above all is often rather confusing because it
      reports a random pid between those of the associated processes.
      
      This commit addresses this issue by printing SHARED instead of a pid
      if the queue is shared.
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NFrancesco Pollicino <fra.fra.800@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1e66413c
    • P
      block, bfq: do not merge queues on flash storage with queueing · 8cacc5ab
      Paolo Valente 提交于
      To boost throughput with a set of processes doing interleaved I/O
      (i.e., a set of processes whose individual I/O is random, but whose
      merged cumulative I/O is sequential), BFQ merges the queues associated
      with these processes, i.e., redirects the I/O of these processes into a
      common, shared queue. In the shared queue, I/O requests are ordered by
      their position on the medium, thus sequential I/O gets dispatched to
      the device when the shared queue is served.
      
      Queue merging costs execution time, because, to detect which queues to
      merge, BFQ must maintain a list of the head I/O requests of active
      queues, ordered by request positions. Measurements showed that this
      costs about 10% of BFQ's total per-request processing time.
      
      Request processing time becomes more and more critical as the speed of
      the underlying storage device grows. Yet, fortunately, queue merging
      is basically useless on the very devices that are so fast to make
      request processing time critical. To reach a high throughput, these
      devices must have many requests queued at the same time. But, in this
      configuration, the internal scheduling algorithms of these devices do
      also the job of queue merging: they reorder requests so as to obtain
      as much as possible a sequential I/O pattern. As a consequence, with
      processes doing interleaved I/O, the throughput reached by one such
      device is likely to be the same, with and without queue merging.
      
      In view of this fact, this commit disables queue merging, and all
      related housekeeping, for non-rotational devices with internal
      queueing. The total, single-lock-protected, per-request processing
      time of BFQ drops to, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz
      (time measured with simple code instrumentation, and using the
      throughput-sync.sh script of the S suite [1], in performance-profiling
      mode). To put this result into context, the total,
      single-lock-protected, per-request execution time of the lightest I/O
      scheduler available in blk-mq, mq-deadline, is 0.7 us (mq-deadline is
      ~800 LOC, against ~10500 LOC for BFQ).
      
      Disabling merging provides a further, remarkable benefit in terms of
      throughput. Merging tends to make many workloads artificially more
      uneven, mainly because of shared queues remaining non empty for
      incomparably more time than normal queues. So, if, e.g., one of the
      queues in a set of merged queues has a higher weight than a normal
      queue, then the shared queue may inherit such a high weight and, by
      staying almost always active, may force BFQ to perform I/O plugging
      most of the time. This evidently makes it harder for BFQ to let the
      device reach a high throughput.
      
      As a practical example of this problem, and of the benefits of this
      commit, we measured again the throughput in the nasty scenario
      considered in previous commit messages: dbench test (in the Phoronix
      suite), with 6 clients, on a filesystem with journaling, and with the
      journaling daemon enjoying a higher weight than normal processes. With
      this commit, the throughput grows from ~150 MB/s to ~200 MB/s on a
      PLEXTOR PX-256M5 SSD. This is the same peak throughput reached by any
      of the other I/O schedulers. As such, this is also likely to be the
      maximum possible throughput reachable with this workload on this
      device, because I/O is mostly random, and the other schedulers
      basically just pass I/O requests to the drive as fast as possible.
      
      [1] https://github.com/Algodev-github/STested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NFrancesco Pollicino <fra.fra.800@gmail.com>
      Signed-off-by: NAlessio Masola <alessio.masola@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8cacc5ab
    • P
      block, bfq: tune service injection basing on request service times · 2341d662
      Paolo Valente 提交于
      The processes associated with a bfq_queue, say Q, may happen to
      generate their cumulative I/O at a lower rate than the rate at which
      the device could serve the same I/O. This is rather probable, e.g., if
      only one process is associated with Q and the device is an SSD. It
      results in Q becoming often empty while in service. If BFQ is not
      allowed to switch to another queue when Q becomes empty, then, during
      the service of Q, there will be frequent "service holes", i.e., time
      intervals during which Q gets empty and the device can only consume
      the I/O already queued in its hardware queues. This easily causes
      considerable losses of throughput.
      
      To counter this problem, BFQ implements a request injection mechanism,
      which tries to fill the above service holes with I/O requests taken
      from other bfq_queues. The hard part in this mechanism is finding the
      right amount of I/O to inject, so as to both boost throughput and not
      break Q's bandwidth and latency guarantees. To this goal, the current
      version of this mechanism measures the bandwidth enjoyed by Q while it
      is being served, and tries to inject the maximum possible amount of
      extra service that does not cause Q's bandwidth to decrease too
      much.
      
      This solution has an important shortcoming. For bandwidth measurements
      to be stable and reliable, Q must remain in service for a much longer
      time than that needed to serve a single I/O request. Unfortunately,
      this does not hold with many workloads. This commit addresses this
      issue by changing the way the amount of injection allowed is
      dynamically computed. It tunes injection as a function of the service
      times of single I/O requests of Q, instead of Q's
      bandwidth. Single-request service times are evidently meaningful even
      if Q gets very few I/O requests completed while it is in service.
      
      As a testbed for this new solution, we measured the throughput reached
      by BFQ for one of the nastiest workloads and configurations for this
      scheduler: the workload generated by the dbench test (in the Phoronix
      suite), with 6 clients, on a filesystem with journaling, and with the
      journaling daemon enjoying a higher weight than normal processes.
      With this commit, the throughput grows from ~100 MB/s to ~150 MB/s on
      a PLEXTOR PX-256M5.
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NFrancesco Pollicino <fra.fra.800@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2341d662
    • P
      block, bfq: do not idle for lowest-weight queues · fb53ac6c
      Paolo Valente 提交于
      In most cases, it is detrimental for throughput to plug I/O dispatch
      when the in-service bfq_queue becomes temporarily empty (plugging is
      performed to wait for the possible arrival, soon, of new I/O from the
      in-service queue). There is however a case where plugging is needed
      for service guarantees. If a bfq_queue, say Q, has a higher weight
      than some other active bfq_queue, and is sync, i.e., contains sync
      I/O, then, to guarantee that Q does receive a higher share of the
      throughput than other lower-weight queues, it is necessary to plug I/O
      dispatch when Q remains temporarily empty while being served.
      
      For this reason, BFQ performs I/O plugging when some active bfq_queue
      has a higher weight than some other active bfq_queue. But this is
      unnecessarily overkill. In fact, if the in-service bfq_queue actually
      has a weight lower than or equal to the other queues, then the queue
      *must not* be guaranteed a higher share of the throughput than the
      other queues. So, not plugging I/O cannot cause any harm to the
      queue. And can boost throughput.
      
      Taking advantage of this fact, this commit does not plug I/O for sync
      bfq_queues with a weight lower than or equal to the weights of the
      other queues. Here is an example of the resulting throughput boost
      with the dbench workload, which is particularly nasty for BFQ. With
      the dbench test in the Phoronix suite, BFQ reaches its lowest total
      throughput with 6 clients on a filesystem with journaling, in case the
      journaling daemon has a higher weight than normal processes. Before
      this commit, the total throughput was ~80 MB/sec on a PLEXTOR PX-256M5,
      after this commit it is ~100 MB/sec.
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fb53ac6c
  14. 01 2月, 2019 2 次提交
    • P
      block, bfq: fix in-service-queue check for queue merging · 058fdecc
      Paolo Valente 提交于
      When a new I/O request arrives for a bfq_queue, say Q, bfq checks
      whether that request is close to
      (a) the head request of some other queue waiting to be served, or
      (b) the last request dispatched for the in-service queue (in case Q
      itself is not the in-service queue)
      
      If a queue, say Q2, is found for which the above condition holds, then
      bfq merges Q and Q2, to hopefully get a more sequential I/O in the
      resulting merged queue, and thus a possibly higher throughput.
      
      Case (b) is checked by comparing the new request for Q with the last
      request dispatched, assuming that the latter necessarily belonged to the
      in-service queue. Unfortunately, this assumption is no longer always
      correct, since commit d0edc247 ("block, bfq: inject other-queue I/O
      into seeky idle queues on NCQ flash").
      
      When the assumption does not hold, queues that must not be merged may be
      merged, causing unexpected loss of control on per-queue service
      guarantees.
      
      This commit solves this problem by adding an extra field, which stores
      the actual last request dispatched for the in-service queue, and by
      using this new field to correctly check case (b).
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      058fdecc
    • P
      block, bfq: consider also ioprio classes in symmetry detection · 73d58118
      Paolo Valente 提交于
      In asymmetric scenarios, i.e., when some bfq_queue or bfq_group needs to
      be guaranteed a different bandwidth than other bfq_queues or bfq_groups,
      these service guaranteed can be provided only by plugging I/O dispatch,
      completely or partially, when the queue in service remains temporarily
      empty. A case where asymmetry is particularly strong is when some active
      bfq_queues belong to a higher-priority class than some other active
      bfq_queues. Unfortunately, this important case is not considered at all
      in the code for detecting asymmetric scenarios. This commit adds the
      missing logic.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      73d58118
  15. 07 12月, 2018 1 次提交
    • P
      block, bfq: fix decrement of num_active_groups · ba7aeae5
      Paolo Valente 提交于
      Since commit '2d29c9f8 ("block, bfq: improve asymmetric scenarios
      detection")', if there are process groups with I/O requests waiting for
      completion, then BFQ tags the scenario as 'asymmetric'. This detection
      is needed for preserving service guarantees (for details, see comments
      on the computation * of the variable asymmetric_scenario in the
      function bfq_better_to_idle).
      
      Unfortunately, commit '2d29c9f8 ("block, bfq: improve asymmetric
      scenarios detection")' contains an error exactly in the updating of
      the number of groups with I/O requests waiting for completion: if a
      group has more than one descendant process, then the above number of
      groups, which is renamed from num_active_groups to a more appropriate
      num_groups_with_pending_reqs by this commit, may happen to be wrongly
      decremented multiple times, namely every time one of the descendant
      processes gets all its pending I/O requests completed.
      
      A correct, complete solution should work as follows. Consider a group
      that is inactive, i.e., that has no descendant process with pending
      I/O inside BFQ queues. Then suppose that num_groups_with_pending_reqs
      is still accounting for this group, because the group still has some
      descendant process with some I/O request still in
      flight. num_groups_with_pending_reqs should be decremented when the
      in-flight request of the last descendant process is finally completed
      (assuming that nothing else has changed for the group in the meantime,
      in terms of composition of the group and active/inactive state of
      child groups and processes). To accomplish this, an additional
      pending-request counter must be added to entities, and must be
      updated correctly.
      
      To avoid this additional field and operations, this commit resorts to
      the following tradeoff between simplicity and accuracy: for an
      inactive group that is still counted in num_groups_with_pending_reqs,
      this commit decrements num_groups_with_pending_reqs when the first
      descendant process of the group remains with no request waiting for
      completion.
      
      This simplified scheme provides a fix to the unbalanced decrements
      introduced by 2d29c9f8. Since this error was also caused by lack
      of comments on this non-trivial issue, this commit also adds related
      comments.
      
      Fixes: 2d29c9f8 ("block, bfq: improve asymmetric scenarios detection")
      Reported-by: NSteven Barrett <steven@liquorix.net>
      Tested-by: NSteven Barrett <steven@liquorix.net>
      Tested-by: NLucjan Lucjanov <lucjan.lucjanov@gmail.com>
      Reviewed-by: NFederico Motta <federico@willer.it>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ba7aeae5
  16. 14 10月, 2018 1 次提交
    • F
      block, bfq: improve asymmetric scenarios detection · 2d29c9f8
      Federico Motta 提交于
      bfq defines as asymmetric a scenario where an active entity, say E
      (representing either a single bfq_queue or a group of other entities),
      has a higher weight than some other entities.  If the entity E does sync
      I/O in such a scenario, then bfq plugs the dispatch of the I/O of the
      other entities in the following situation: E is in service but
      temporarily has no pending I/O request.  In fact, without this plugging,
      all the times that E stops being temporarily idle, it may find the
      internal queues of the storage device already filled with an
      out-of-control number of extra requests, from other entities. So E may
      have to wait for the service of these extra requests, before finally
      having its own requests served. This may easily break service
      guarantees, with E getting less than its fair share of the device
      throughput.  Usually, the end result is that E gets the same fraction of
      the throughput as the other entities, instead of getting more, according
      to its higher weight.
      
      Yet there are two other more subtle cases where E, even if its weight is
      actually equal to or even lower than the weight of any other active
      entities, may get less than its fair share of the throughput in case the
      above I/O plugging is not performed:
      1. other entities issue larger requests than E;
      2. other entities contain more active child entities than E (or in
         general tend to have more backlog than E).
      
      In the first case, other entities may get more service than E because
      they get larger requests, than those of E, served during the temporary
      idle periods of E.  In the second case, other entities get more service
      because, by having many child entities, they have many requests ready
      for dispatching while E is temporarily idle.
      
      This commit addresses this issue by extending the definition of
      asymmetric scenario: a scenario is asymmetric when
      - active entities representing bfq_queues have differentiated weights,
        as in the original definition
      or (inclusive)
      - one or more entities representing groups of entities are active.
      
      This broader definition makes sure that I/O plugging will be performed
      in all the above cases, provided that there is at least one active
      group.  Of course, this definition is very coarse, so it will trigger
      I/O plugging also in cases where it is not needed, such as, e.g.,
      multiple active entities with just one child each, and all with the same
      I/O-request size.  The reason for this coarse definition is just that a
      finer-grained definition would be rather heavy to compute.
      
      On the opposite end, even this new definition does not trigger I/O
      plugging in all cases where there is no active group, and all bfq_queues
      have the same weight.  So, in these cases some unfairness may occur if
      there are asymmetries in I/O-request sizes.  We made this choice because
      I/O plugging may lower throughput, and probably a user that has not
      created any group cares more about throughput than about perfect
      fairness.  At any rate, as for possible applications that may care about
      service guarantees, bfq already guarantees a high responsiveness and a
      low latency to soft real-time applications automatically.
      Signed-off-by: NFederico Motta <federico@willer.it>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2d29c9f8
  17. 15 9月, 2018 1 次提交
    • P
      block, bfq: inject other-queue I/O into seeky idle queues on NCQ flash · d0edc247
      Paolo Valente 提交于
      The Achilles' heel of BFQ is its failing to reach a high throughput
      with sync random I/O on flash storage with internal queueing, in case
      the processes doing I/O have differentiated weights.
      
      The cause of this failure is as follows. If at least two processes do
      sync I/O, and have a different weight from each other, then BFQ plugs
      I/O dispatching every time one of these processes, while it is being
      served, remains temporarily without pending I/O requests. This
      plugging is necessary to guarantee that every process enjoys a
      bandwidth proportional to its weight; but it empties the internal
      queue(s) of the drive. And this kills throughput with random I/O. So,
      if some processes have differentiated weights and do both sync and
      random I/O, the end result is a throughput collapse.
      
      This commit tries to counter this problem by injecting the service of
      other processes, in a controlled way, while the process in service
      happens to have no I/O. This injection is performed only if the medium
      is non rotational and performs internal queueing, and the process in
      service does random I/O (service injection might be beneficial for
      sequential I/O too, we'll work on that).
      
      As an example of the benefits of this commit, on a PLEXTOR PX-256M5S
      SSD, and with five processes having differentiated weights and doing
      sync random 4KB I/O, this commit makes the throughput with bfq grow by
      400%, from 25 to 100MB/s. This higher throughput is 10MB/s lower than
      that reached with none. As some less random I/O is added to the mix,
      the throughput becomes equal to or higher than that with none.
      
      This commit is a very first attempt to recover throughput without
      losing control, and certainly has many limitations. One is, e.g., that
      the processes whose service is injected are not chosen so as to
      distribute the extra bandwidth they receive in accordance to their
      weights. Thus there might be loss of weighted fairness in some
      cases. Anyway, this loss concerns extra service, which would not have
      been received at all without this commit. Other limitations and issues
      will probably show up with usage.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d0edc247
  18. 09 7月, 2018 1 次提交
    • P
      block, bfq: add/remove entity weights correctly · 0471559c
      Paolo Valente 提交于
      To keep I/O throughput high as often as possible, BFQ performs
      I/O-dispatch plugging (aka device idling) only when beneficial exactly
      for throughput, or when needed for service guarantees (low latency,
      fairness). An important case where the latter condition holds is when
      the scenario is 'asymmetric' in terms of weights: i.e., when some
      bfq_queue or whole group of queues has a higher weight, and thus has
      to receive more service, than other queues or groups. Without dispatch
      plugging, lower-weight queues/groups may unjustly steal bandwidth to
      higher-weight queues/groups.
      
      To detect asymmetric scenarios, BFQ checks some sufficient
      conditions. One of these conditions is that active groups have
      different weights. BFQ controls this condition by maintaining a
      special set of unique weights of active groups
      (group_weights_tree). To this purpose, in the function
      bfq_active_insert/bfq_active_extract BFQ adds/removes the weight of a
      group to/from this set.
      
      Unfortunately, the function bfq_active_extract may happen to be
      invoked also for a group that is still active (to preserve the correct
      update of the next queue to serve, see comments in function
      bfq_no_longer_next_in_service() for details). In this case, removing
      the weight of the group makes the set group_weights_tree
      inconsistent. Service-guarantee violations follow.
      
      This commit addresses this issue by moving group_weights_tree
      insertions from their previous location (in bfq_active_insert) into
      the function __bfq_activate_entity, and by moving group_weights_tree
      extractions from bfq_active_extract to when the entity that represents
      a group remains throughly idle, i.e., with no request either enqueued
      or dispatched.
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0471559c
  19. 31 5月, 2018 1 次提交
    • P
      block, bfq: remove slow-system class · e24f1c24
      Paolo Valente 提交于
      BFQ computes the duration of weight raising for interactive
      applications automatically, using some reference parameters. In
      particular, BFQ uses the best durations (see comments in the code for
      how these durations have been assessed) for two classes of systems:
      slow and fast ones. Examples of slow systems are old phones or systems
      using micro HDDs. Fast systems are all the remaining ones. Using these
      parameters, BFQ computes the actual duration of the weight raising,
      for the system at hand, as a function of the relative speed of the
      system w.r.t. the speed of a reference system, belonging to the same
      class of systems as the system at hand.
      
      This slow vs fast differentiation proved to be useful in the past, but
      happens to have little meaning with current hardware. Even worse, it
      does cause problems in virtual systems, where the speed of the system
      can vary frequently, and so widely to just confuse the class-detection
      mechanism, and, as we have verified experimentally, to cause BFQ to
      compute non-sensical weight-raising durations.
      
      This commit addresses this issue by removing the slow class and the
      class-detection mechanism.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e24f1c24
  20. 11 5月, 2018 1 次提交
  21. 09 5月, 2018 1 次提交
  22. 27 3月, 2018 1 次提交
  23. 18 1月, 2018 2 次提交
    • P
      block, bfq: limit sectors served with interactive weight raising · 8a8747dc
      Paolo Valente 提交于
      To maximise responsiveness, BFQ raises the weight, and performs device
      idling, for bfq_queues associated with processes deemed as
      interactive. In particular, weight raising has a maximum duration,
      equal to the time needed to start a large application. If a
      weight-raised process goes on doing I/O beyond this maximum duration,
      it loses weight-raising.
      
      This mechanism is evidently vulnerable to the following false
      positives: I/O-bound applications that will go on doing I/O for much
      longer than the duration of weight-raising. These applications have
      basically no benefit from being weight-raised at the beginning of
      their I/O. On the opposite end, while being weight-raised, these
      applications
      a) unjustly steal throughput to applications that may truly need
      low latency;
      b) make BFQ uselessly perform device idling; device idling results
      in loss of device throughput with most flash-based storage, and may
      increase latencies when used purposelessly.
      
      This commit adds a countermeasure to reduce both the above
      problems. To introduce this countermeasure, we provide the following
      extra piece of information (full details in the comments added by this
      commit). During the start-up of the large application used as a
      reference to set the duration of weight-raising, involved processes
      transfer at most ~110K sectors each. Accordingly, a process initially
      deemed as interactive has no right to be weight-raised any longer,
      once transferred 110K sectors or more.
      
      Basing on this consideration, this commit early-ends weight-raising
      for a bfq_queue if the latter happens to have received an amount of
      service at least equal to 110K sectors (actually, a little bit more,
      to keep a safety margin). I/O-bound applications that reach a high
      throughput, such as file copy, get to this threshold much before the
      allowed weight-raising period finishes. Thus this early ending of
      weight-raising reduces the amount of time during which these
      applications cause the problems described above.
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8a8747dc
    • P
      block, bfq: limit tags for writes and async I/O · a52a69ea
      Paolo Valente 提交于
      Asynchronous I/O can easily starve synchronous I/O (both sync reads
      and sync writes), by consuming all request tags. Similarly, storms of
      synchronous writes, such as those that sync(2) may trigger, can starve
      synchronous reads. In their turn, these two problems may also cause
      BFQ to loose control on latency for interactive and soft real-time
      applications. For example, on a PLEXTOR PX-256M5S SSD, LibreOffice
      Writer takes 0.6 seconds to start if the device is idle, but it takes
      more than 45 seconds (!) if there are sequential writes in the
      background.
      
      This commit addresses this issue by limiting the maximum percentage of
      tags that asynchronous I/O requests and synchronous write requests can
      consume. In particular, this commit grants a higher threshold to
      synchronous writes, to prevent the latter from being starved by
      asynchronous I/O.
      
      According to the above test, LibreOffice Writer now starts in about
      1.2 seconds on average, regardless of the background workload, and
      apart from some rare outlier. To check this improvement, run, e.g.,
      sudo ./comm_startup_lat.sh bfq 5 5 seq 10 "lowriter --terminate_after_init"
      for the comm_startup_lat benchmark in the S suite [1].
      
      [1] https://github.com/Algodev-github/STested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a52a69ea
  24. 06 1月, 2018 1 次提交
    • P
      block, bfq: let a queue be merged only shortly after starting I/O · 7b8fa3b9
      Paolo Valente 提交于
      In BFQ and CFQ, two processes are said to be cooperating if they do
      I/O in such a way that the union of their I/O requests yields a
      sequential I/O pattern. To get such a sequential I/O pattern out of
      the non-sequential pattern of each cooperating process, BFQ and CFQ
      merge the queues associated with these processes. In more detail,
      cooperating processes, and thus their associated queues, usually
      start, or restart, to do I/O shortly after each other. This is the
      case, e.g., for the I/O threads of KVM/QEMU and of the dump
      utility. Basing on this assumption, this commit allows a bfq_queue to
      be merged only during a short time interval (100ms) after it starts,
      or re-starts, to do I/O.  This filtering provides two important
      benefits.
      
      First, it greatly reduces the probability that two non-cooperating
      processes have their queues merged by mistake, if they just happen to
      do I/O close to each other for a short time interval. These spurious
      merges cause loss of service guarantees. A low-weight bfq_queue may
      unjustly get more than its expected share of the throughput: if such a
      low-weight queue is merged with a high-weight queue, then the I/O for
      the low-weight queue is served as if the queue had a high weight. This
      may damage other high-weight queues unexpectedly.  For instance,
      because of this issue, lxterminal occasionally took 7.5 seconds to
      start, instead of 6.5 seconds, when some sequential readers and
      writers did I/O in the background on a FUJITSU MHX2300BT HDD.  The
      reason is that the bfq_queues associated with some of the readers or
      the writers were merged with the high-weight queues of some processes
      that had to do some urgent but little I/O. The readers then exploited
      the inherited high weight for all or most of their I/O, during the
      start-up of terminal. The filtering introduced by this commit
      eliminated any outlier caused by spurious queue merges in our start-up
      time tests.
      
      This filtering also provides a little boost of the throughput
      sustainable by BFQ: 3-4%, depending on the CPU. The reason is that,
      once a bfq_queue cannot be merged any longer, this commit makes BFQ
      stop updating the data needed to handle merging for the queue.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NAngelo Ruocco <angeloruocco90@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7b8fa3b9
  25. 15 11月, 2017 1 次提交
  26. 31 8月, 2017 1 次提交
    • P
      block, bfq: make lookup_next_entity push up vtime on expirations · 80294c3b
      Paolo Valente 提交于
      To provide a very smooth service, bfq starts to serve a bfq_queue
      only if the queue is 'eligible', i.e., if the same queue would
      have started to be served in the ideal, perfectly fair system that
      bfq simulates internally. This is obtained by associating each
      queue with a virtual start time, and by computing a special system
      virtual time quantity: a queue is eligible only if the system
      virtual time has reached the virtual start time of the
      queue. Finally, bfq guarantees that, when a new queue must be set
      in service, there is always at least one eligible entity for each
      active parent entity in the scheduler. To provide this guarantee,
      the function __bfq_lookup_next_entity pushes up, for each parent
      entity on which it is invoked, the system virtual time to the
      minimum among the virtual start times of the entities in the
      active tree for the parent entity (more precisely, the push up
      occurs if the system virtual time happens to be lower than all
      such virtual start times).
      
      There is however a circumstance in which __bfq_lookup_next_entity
      cannot push up the system virtual time for a parent entity, even
      if the system virtual time is lower than the virtual start times
      of all the child entities in the active tree. It happens if one of
      the child entities is in service. In fact, in such a case, there
      is already an eligible entity, the in-service one, even if it may
      not be not present in the active tree (because in-service entities
      may be removed from the active tree).
      
      Unfortunately, in the last re-design of the
      hierarchical-scheduling engine, the reset of the pointer to the
      in-service entity for a given parent entity--reset to be done as a
      consequence of the expiration of the in-service entity--always
      happens after the function __bfq_lookup_next_entity has been
      invoked. This causes the function to think that there is still an
      entity in service for the parent entity, and then that the system
      virtual time cannot be pushed up, even if actually such a
      no-more-in-service entity has already been properly reinserted
      into the active tree (or in some other tree if no more
      active). Yet, the system virtual time *had* to be pushed up, to be
      ready to correctly choose the next queue to serve. Because of the
      lack of this push up, bfq may wrongly set in service a queue that
      had been speculatively pre-computed as the possible
      next-in-service queue, but that would no more be the one to serve
      after the expiration and the reinsertion into the active trees of
      the previously in-service entities.
      
      This commit addresses this issue by making
      __bfq_lookup_next_entity properly push up the system virtual time
      if an expiration is occurring.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Tested-by: NLee Tibbert <lee.tibbert@gmail.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      80294c3b
  27. 11 8月, 2017 1 次提交
    • P
      block,bfq: refactor device-idling logic · d5be3fef
      Paolo Valente 提交于
      The logic that decides whether to idle the device is scattered across
      three functions. Almost all of the logic is in the function
      bfq_bfqq_may_idle, but (1) part of the decision is made in
      bfq_update_idle_window, and (2) the function bfq_bfqq_must_idle may
      switch off idling regardless of the output of bfq_bfqq_may_idle. In
      addition, both bfq_update_idle_window and bfq_bfqq_must_idle make
      their decisions as a function of parameters that are used, for similar
      purposes, also in bfq_bfqq_may_idle. This commit addresses these
      issues by moving all the logic into bfq_bfqq_may_idle.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d5be3fef
  28. 30 7月, 2017 1 次提交
    • P
      block, bfq: consider also in_service_entity to state whether an entity is active · 46d556e6
      Paolo Valente 提交于
      Groups of BFQ queues are represented by generic entities in BFQ. When
      a queue belonging to a parent entity is deactivated, the parent entity
      may need to be deactivated too, in case the deactivated queue was the
      only active queue for the parent entity. This deactivation may need to
      be propagated upwards if the entity belongs, in its turn, to a further
      higher-level entity, and so on. In particular, the upward propagation
      of deactivation stops at the first parent entity that remains active
      even if one of its child entities has been deactivated.
      
      To decide whether the last non-deactivation condition holds for a
      parent entity, BFQ checks whether the field next_in_service is still
      not NULL for the parent entity, after the deactivation of one of its
      child entity. If it is not NULL, then there are certainly other active
      entities in the parent entity, and deactivations can stop.
      
      Unfortunately, this check misses a corner case: if in_service_entity
      is not NULL, then next_in_service may happen to be NULL, although the
      parent entity is evidently active. This happens if: 1) the entity
      pointed by in_service_entity is the only active entity in the parent
      entity, and 2) according to the definition of next_in_service, the
      in_service_entity cannot be considered as next_in_service. See the
      comments on the definition of next_in_service for details on this
      second point.
      
      Hitting the above corner case causes crashes.
      
      To address this issue, this commit:
      1) Extends the above check on only next_in_service to controlling both
      next_in_service and in_service_entity (if any of them is not NULL,
      then no further deactivation is performed)
      2) Improves the (important) comments on how next_in_service is defined
      and updated; in particular it fixes a few rather obscure paragraphs
      Reported-by: NEric Wheeler <bfq-sched@lists.ewheeler.net>
      Reported-by: NRick Yiu <rick_yiu@htc.com>
      Reported-by: NTom X Nguyen <tom81094@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Tested-by: NEric Wheeler <bfq-sched@lists.ewheeler.net>
      Tested-by: NRick Yiu <rick_yiu@htc.com>
      Tested-by: NLaurentiu Nicola <lnicola@dend.ro>
      Tested-by: NTom X Nguyen <tom81094@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      46d556e6