1. 25 6月, 2019 7 次提交
    • P
      block, bfq: re-schedule empty queues if they deserve I/O plugging · 3726112e
      Paolo Valente 提交于
      Consider, on one side, a bfq_queue Q that remains empty while in
      service, and, on the other side, the pending I/O of bfq_queues that,
      according to their timestamps, have to be served after Q.  If an
      uncontrolled amount of I/O from the latter bfq_queues were dispatched
      while Q is waiting for its new I/O to arrive, then Q's bandwidth
      guarantees would be violated. To prevent this, I/O dispatch is plugged
      until Q receives new I/O (except for a properly controlled amount of
      injected I/O). Unfortunately, preemption breaks I/O-dispatch plugging,
      for the following reason.
      
      Preemption is performed in two steps. First, Q is expired and
      re-scheduled. Second, the new bfq_queue to serve is chosen. The first
      step is needed by the second, as the second can be performed only
      after Q's timestamps have been properly updated (done in the
      expiration step), and Q has been re-queued for service. This
      dependency is a consequence of the way how BFQ's scheduling algorithm
      is currently implemented.
      
      But Q is not re-scheduled at all in the first step, because Q is
      empty. As a consequence, an uncontrolled amount of I/O may be
      dispatched until Q becomes non empty again. This breaks Q's service
      guarantees.
      
      This commit addresses this issue by re-scheduling Q even if it is
      empty. This in turn breaks the assumption that all scheduled queues
      are non empty. Then a few extra checks are now needed.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3726112e
    • P
      block, bfq: preempt lower-weight or lower-priority queues · 96a291c3
      Paolo Valente 提交于
      BFQ enqueues the I/O coming from each process into a separate
      bfq_queue, and serves bfq_queues one at a time. Each bfq_queue may be
      served for at most timeout_sync milliseconds (default: 125 ms). This
      service scheme is prone to the following inaccuracy.
      
      While a bfq_queue Q1 is in service, some empty bfq_queue Q2 may
      receive I/O, and, according to BFQ's scheduling policy, may become the
      right bfq_queue to serve, in place of the currently in-service
      bfq_queue. In this respect, postponing the service of Q2 to after the
      service of Q1 finishes may delay the completion of Q2's I/O, compared
      with an ideal service in which all non-empty bfq_queues are served in
      parallel, and every non-empty bfq_queue is served at a rate
      proportional to the bfq_queue's weight. This additional delay is equal
      at most to the time Q1 may unjustly remain in service before switching
      to Q2.
      
      If Q1 and Q2 have the same weight, then this time is most likely
      negligible compared with the completion time to be guaranteed to Q2's
      I/O. In addition, first, one of the reasons why BFQ may want to serve
      Q1 for a while is that this boosts throughput and, second, serving Q1
      longer reduces BFQ's overhead. As a conclusion, it is usually better
      not to preempt Q1 if both Q1 and Q2 have the same weight.
      
      In contrast, as Q2's weight or priority becomes higher and higher
      compared with that of Q1, the above delay becomes larger and larger,
      compared with the I/O completion times that have to be guaranteed to
      Q2 according to Q2's weight. So reducing this delay may be more
      important than avoiding the costs of preempting Q1.
      
      Accordingly, this commit preempts Q1 if Q2 has a higher weight or a
      higher priority than Q1. Preemption causes Q1 to be re-scheduled, and
      triggers a new choice of the next bfq_queue to serve. If Q2 really is
      the next bfq_queue to serve, then Q2 will be set in service
      immediately.
      
      This change reduces the component of the I/O latency caused by the
      above delay by about 80%. For example, on an (old) PLEXTOR PX-256M5
      SSD, the maximum latency reported by fio drops from 15.1 to 3.2 ms for
      a process doing sporadic random reads while another process is doing
      continuous sequential reads.
      Signed-off-by: NNicola Bottura <bottura.nicola95@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      96a291c3
    • P
      block, bfq: detect wakers and unconditionally inject their I/O · 13a857a4
      Paolo Valente 提交于
      A bfq_queue Q may happen to be synchronized with another
      bfq_queue Q2, i.e., the I/O of Q2 may need to be completed for Q to
      receive new I/O. We call Q2 "waker queue".
      
      If I/O plugging is being performed for Q, and Q is not receiving any
      more I/O because of the above synchronization, then, thanks to BFQ's
      injection mechanism, the waker queue is likely to get served before
      the I/O-plugging timeout fires.
      
      Unfortunately, this fact may not be sufficient to guarantee a high
      throughput during the I/O plugging, because the inject limit for Q may
      be too low to guarantee a lot of injected I/O. In addition, the
      duration of the plugging, i.e., the time before Q finally receives new
      I/O, may not be minimized, because the waker queue may happen to be
      served only after other queues.
      
      To address these issues, this commit introduces the explicit detection
      of the waker queue, and the unconditional injection of a pending I/O
      request of the waker queue on each invocation of
      bfq_dispatch_request().
      
      One may be concerned that this systematic injection of I/O from the
      waker queue delays the service of Q's I/O. Fortunately, it doesn't. On
      the contrary, next Q's I/O is brought forward dramatically, for it is
      not blocked for milliseconds.
      Reported-by: NSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Tested-by: NSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      13a857a4
    • P
      block, bfq: bring forward seek&think time update · a3f9bce3
      Paolo Valente 提交于
      Until the base value for request service times gets finally computed
      for a bfq_queue, the inject limit for that queue does depend on the
      think-time state (short|long) of the queue. A timely update of the
      think time then guarantees a quicker activation or deactivation of the
      injection. Fortunately, the think time of a bfq_queue is updated in
      the same code path as the inject limit; but after the inject limit.
      
      This commits moves the update of the think time before the update of
      the inject limit. For coherence, it moves the update of the seek time
      too.
      Reported-by: NSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Tested-by: NSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a3f9bce3
    • P
      block, bfq: update base request service times when possible · 24792ad0
      Paolo Valente 提交于
      I/O injection gets reduced if it increases the request service times
      of the victim queue beyond a certain threshold.  The threshold, in its
      turn, is computed as a function of the base service time enjoyed by
      the queue when it undergoes no injection.
      
      As a consequence, for injection to work properly, the above base value
      has to be accurate. In this respect, such a value may vary over
      time. For example, it varies if the size or the spatial locality of
      the I/O requests in the queue change. It is then important to update
      this value whenever possible. This commit performs this update.
      Reported-by: NSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Tested-by: NSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      24792ad0
    • P
      block, bfq: fix rq_in_driver check in bfq_update_inject_limit · db599f9e
      Paolo Valente 提交于
      One of the cases where the parameters for injection may be updated is
      when there are no more in-flight I/O requests. The number of in-flight
      requests is stored in the field bfqd->rq_in_driver of the descriptor
      bfqd of the device. So, the controlled condition is
      bfqd->rq_in_driver == 0.
      
      Unfortunately, this is wrong because, the instruction that checks this
      condition is in the code path that handles the completion of a
      request, and, in particular, the instruction is executed before
      bfqd->rq_in_driver is decremented in such a code path.
      
      This commit fixes this issue by just replacing 0 with 1 in the
      comparison.
      Reported-by: NSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Tested-by: NSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      db599f9e
    • P
      block, bfq: reset inject limit when think-time state changes · 766d6141
      Paolo Valente 提交于
      Until the base value of the request service times gets finally
      computed for a bfq_queue, the inject limit does depend on the
      think-time state (short|long). The limit must be 0 or 1 if the think
      time is deemed, respectively, as short or long. However, such a check
      and possible limit update is performed only periodically, once per
      second. So, to make the injection mechanism much more reactive, this
      commit performs the update also every time the think-time state
      changes.
      
      In addition, in the following special case, this commit lets the
      inject limit of a bfq_queue bfqq remain equal to 1 even if bfqq's
      think time is short: bfqq's I/O is synchronized with that of some
      other queue, i.e., bfqq may receive new I/O only after the I/O of the
      other queue is completed. Keeping the inject limit to 1 allows the
      blocking I/O to be served while bfqq is in service. And this is very
      convenient both for bfqq and for the total throughput, as explained
      in detail in the comments in bfq_update_has_short_ttime().
      Reported-by: NSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Tested-by: NSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      766d6141
  2. 21 6月, 2019 2 次提交
  3. 01 5月, 2019 1 次提交
  4. 14 4月, 2019 1 次提交
  5. 10 4月, 2019 1 次提交
    • P
      block, bfq: fix use after free in bfq_bfqq_expire · eed47d19
      Paolo Valente 提交于
      The function bfq_bfqq_expire() invokes the function
      __bfq_bfqq_expire(), and the latter may free the in-service bfq-queue.
      If this happens, then no other instruction of bfq_bfqq_expire() must
      be executed, or a use-after-free will occur.
      
      Basing on the assumption that __bfq_bfqq_expire() invokes
      bfq_put_queue() on the in-service bfq-queue exactly once, the queue is
      assumed to be freed if its refcounter is equal to one right before
      invoking __bfq_bfqq_expire().
      
      But, since commit 9dee8b3b ("block, bfq: fix queue removal from
      weights tree") this assumption is false. __bfq_bfqq_expire() may also
      invoke bfq_weights_tree_remove() and, since commit 9dee8b3b
      ("block, bfq: fix queue removal from weights tree"), also
      the latter function may invoke bfq_put_queue(). So __bfq_bfqq_expire()
      may invoke bfq_put_queue() twice, and this is the actual case where
      the in-service queue may happen to be freed.
      
      To address this issue, this commit moves the check on the refcounter
      of the queue right around the last bfq_put_queue() that may be invoked
      on the queue.
      
      Fixes: 9dee8b3b ("block, bfq: fix queue removal from weights tree")
      Reported-by: NDmitrii Tcvetkov <demfloro@demfloro.ru>
      Reported-by: NDouglas Anderson <dianders@chromium.org>
      Tested-by: NDmitrii Tcvetkov <demfloro@demfloro.ru>
      Tested-by: NDouglas Anderson <dianders@chromium.org>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      eed47d19
  6. 09 4月, 2019 1 次提交
  7. 01 4月, 2019 9 次提交
    • F
      block, bfq: save & resume weight on a queue merge/split · fffca087
      Francesco Pollicino 提交于
      bfq saves the state of a queue each time a merge occurs, to be
      able to resume such a state when the queue is associated again
      with its original process, on a split.
      
      Unfortunately bfq does not save & restore also the weight of the
      queue. If the weight is not correctly resumed when the queue is
      recycled, then the weight of the recycled queue could differ
      from the weight of the original queue.
      
      This commit adds the missing save & resume of the weight.
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NFrancesco Pollicino <fra.fra.800@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fffca087
    • F
      block, bfq: print SHARED instead of pid for shared queues in logs · 1e66413c
      Francesco Pollicino 提交于
      The function "bfq_log_bfqq" prints the pid of the process
      associated with the queue passed as input.
      
      Unfortunately, if the queue is shared, then more than one process
      is associated with the queue. The pid that gets printed in this
      case is the pid of one of the associated processes.
      Which process gets printed depends on the exact sequence of merge
      events the queue underwent. So printing such a pid is rather
      useless and above all is often rather confusing because it
      reports a random pid between those of the associated processes.
      
      This commit addresses this issue by printing SHARED instead of a pid
      if the queue is shared.
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NFrancesco Pollicino <fra.fra.800@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1e66413c
    • P
      block, bfq: always protect newly-created queues from existing active queues · 84a74689
      Paolo Valente 提交于
      If many bfq_queues belonging to the same group happen to be created
      shortly after each other, then the processes associated with these
      queues have typically a common goal. In particular, bursts of queue
      creations are usually caused by services or applications that spawn
      many parallel threads/processes. Examples are systemd during boot, or
      git grep. If there are no other active queues, then, to help these
      processes get their job done as soon as possible, the best thing to do
      is to reach a high throughput. To this goal, it is usually better to
      not grant either weight-raising or device idling to the queues
      associated with these processes. And this is exactly what BFQ
      currently does.
      
      There is however a drawback: if, in contrast, some other queues are
      already active, then the newly created queues must be protected from
      the I/O flowing through the already existing queues. In this case, the
      best thing to do is the opposite as in the other case: it is much
      better to grant weight-raising and device idling to the newly-created
      queues, if they deserve it. This commit addresses this issue by doing
      so if there are already other active queues.
      
      This change also helps eliminating false positives, which occur when
      the newly-created queues do not belong to an actual large burst of
      creations, but some background task (e.g., a service) happens to
      trigger the creation of new queues in the middle, i.e., very close to
      when the victim queues are created. These false positive may cause
      total loss of control on process latencies.
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      84a74689
    • P
      block, bfq: do not tag totally seeky queues as soft rt · 7074f076
      Paolo Valente 提交于
      Sync random I/O is likely to be confused with soft real-time I/O,
      because it is characterized by limited throughput and apparently
      isochronous arrival pattern. To avoid false positives, this commits
      prevents bfq_queues containing only random (seeky) I/O from being
      tagged as soft real-time.
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7074f076
    • P
      block, bfq: do not merge queues on flash storage with queueing · 8cacc5ab
      Paolo Valente 提交于
      To boost throughput with a set of processes doing interleaved I/O
      (i.e., a set of processes whose individual I/O is random, but whose
      merged cumulative I/O is sequential), BFQ merges the queues associated
      with these processes, i.e., redirects the I/O of these processes into a
      common, shared queue. In the shared queue, I/O requests are ordered by
      their position on the medium, thus sequential I/O gets dispatched to
      the device when the shared queue is served.
      
      Queue merging costs execution time, because, to detect which queues to
      merge, BFQ must maintain a list of the head I/O requests of active
      queues, ordered by request positions. Measurements showed that this
      costs about 10% of BFQ's total per-request processing time.
      
      Request processing time becomes more and more critical as the speed of
      the underlying storage device grows. Yet, fortunately, queue merging
      is basically useless on the very devices that are so fast to make
      request processing time critical. To reach a high throughput, these
      devices must have many requests queued at the same time. But, in this
      configuration, the internal scheduling algorithms of these devices do
      also the job of queue merging: they reorder requests so as to obtain
      as much as possible a sequential I/O pattern. As a consequence, with
      processes doing interleaved I/O, the throughput reached by one such
      device is likely to be the same, with and without queue merging.
      
      In view of this fact, this commit disables queue merging, and all
      related housekeeping, for non-rotational devices with internal
      queueing. The total, single-lock-protected, per-request processing
      time of BFQ drops to, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz
      (time measured with simple code instrumentation, and using the
      throughput-sync.sh script of the S suite [1], in performance-profiling
      mode). To put this result into context, the total,
      single-lock-protected, per-request execution time of the lightest I/O
      scheduler available in blk-mq, mq-deadline, is 0.7 us (mq-deadline is
      ~800 LOC, against ~10500 LOC for BFQ).
      
      Disabling merging provides a further, remarkable benefit in terms of
      throughput. Merging tends to make many workloads artificially more
      uneven, mainly because of shared queues remaining non empty for
      incomparably more time than normal queues. So, if, e.g., one of the
      queues in a set of merged queues has a higher weight than a normal
      queue, then the shared queue may inherit such a high weight and, by
      staying almost always active, may force BFQ to perform I/O plugging
      most of the time. This evidently makes it harder for BFQ to let the
      device reach a high throughput.
      
      As a practical example of this problem, and of the benefits of this
      commit, we measured again the throughput in the nasty scenario
      considered in previous commit messages: dbench test (in the Phoronix
      suite), with 6 clients, on a filesystem with journaling, and with the
      journaling daemon enjoying a higher weight than normal processes. With
      this commit, the throughput grows from ~150 MB/s to ~200 MB/s on a
      PLEXTOR PX-256M5 SSD. This is the same peak throughput reached by any
      of the other I/O schedulers. As such, this is also likely to be the
      maximum possible throughput reachable with this workload on this
      device, because I/O is mostly random, and the other schedulers
      basically just pass I/O requests to the drive as fast as possible.
      
      [1] https://github.com/Algodev-github/STested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NFrancesco Pollicino <fra.fra.800@gmail.com>
      Signed-off-by: NAlessio Masola <alessio.masola@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8cacc5ab
    • P
      block, bfq: tune service injection basing on request service times · 2341d662
      Paolo Valente 提交于
      The processes associated with a bfq_queue, say Q, may happen to
      generate their cumulative I/O at a lower rate than the rate at which
      the device could serve the same I/O. This is rather probable, e.g., if
      only one process is associated with Q and the device is an SSD. It
      results in Q becoming often empty while in service. If BFQ is not
      allowed to switch to another queue when Q becomes empty, then, during
      the service of Q, there will be frequent "service holes", i.e., time
      intervals during which Q gets empty and the device can only consume
      the I/O already queued in its hardware queues. This easily causes
      considerable losses of throughput.
      
      To counter this problem, BFQ implements a request injection mechanism,
      which tries to fill the above service holes with I/O requests taken
      from other bfq_queues. The hard part in this mechanism is finding the
      right amount of I/O to inject, so as to both boost throughput and not
      break Q's bandwidth and latency guarantees. To this goal, the current
      version of this mechanism measures the bandwidth enjoyed by Q while it
      is being served, and tries to inject the maximum possible amount of
      extra service that does not cause Q's bandwidth to decrease too
      much.
      
      This solution has an important shortcoming. For bandwidth measurements
      to be stable and reliable, Q must remain in service for a much longer
      time than that needed to serve a single I/O request. Unfortunately,
      this does not hold with many workloads. This commit addresses this
      issue by changing the way the amount of injection allowed is
      dynamically computed. It tunes injection as a function of the service
      times of single I/O requests of Q, instead of Q's
      bandwidth. Single-request service times are evidently meaningful even
      if Q gets very few I/O requests completed while it is in service.
      
      As a testbed for this new solution, we measured the throughput reached
      by BFQ for one of the nastiest workloads and configurations for this
      scheduler: the workload generated by the dbench test (in the Phoronix
      suite), with 6 clients, on a filesystem with journaling, and with the
      journaling daemon enjoying a higher weight than normal processes.
      With this commit, the throughput grows from ~100 MB/s to ~150 MB/s on
      a PLEXTOR PX-256M5.
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NFrancesco Pollicino <fra.fra.800@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      2341d662
    • P
      block, bfq: do not idle for lowest-weight queues · fb53ac6c
      Paolo Valente 提交于
      In most cases, it is detrimental for throughput to plug I/O dispatch
      when the in-service bfq_queue becomes temporarily empty (plugging is
      performed to wait for the possible arrival, soon, of new I/O from the
      in-service queue). There is however a case where plugging is needed
      for service guarantees. If a bfq_queue, say Q, has a higher weight
      than some other active bfq_queue, and is sync, i.e., contains sync
      I/O, then, to guarantee that Q does receive a higher share of the
      throughput than other lower-weight queues, it is necessary to plug I/O
      dispatch when Q remains temporarily empty while being served.
      
      For this reason, BFQ performs I/O plugging when some active bfq_queue
      has a higher weight than some other active bfq_queue. But this is
      unnecessarily overkill. In fact, if the in-service bfq_queue actually
      has a weight lower than or equal to the other queues, then the queue
      *must not* be guaranteed a higher share of the throughput than the
      other queues. So, not plugging I/O cannot cause any harm to the
      queue. And can boost throughput.
      
      Taking advantage of this fact, this commit does not plug I/O for sync
      bfq_queues with a weight lower than or equal to the weights of the
      other queues. Here is an example of the resulting throughput boost
      with the dbench workload, which is particularly nasty for BFQ. With
      the dbench test in the Phoronix suite, BFQ reaches its lowest total
      throughput with 6 clients on a filesystem with journaling, in case the
      journaling daemon has a higher weight than normal processes. Before
      this commit, the total throughput was ~80 MB/sec on a PLEXTOR PX-256M5,
      after this commit it is ~100 MB/sec.
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      fb53ac6c
    • P
      block, bfq: increase idling for weight-raised queues · 778c02a2
      Paolo Valente 提交于
      If a sync bfq_queue has a higher weight than some other queue, and
      remains temporarily empty while in service, then, to preserve the
      bandwidth share of the queue, it is necessary to plug I/O dispatching
      until a new request arrives for the queue. In addition, a timeout
      needs to be set, to avoid waiting for ever if the process associated
      with the queue has actually finished its I/O.
      
      Even with the above timeout, the device is however not fed with new
      I/O for a while, if the process has finished its I/O. If this happens
      often, then throughput drops and latencies grow. For this reason, the
      timeout is kept rather low: 8 ms is the current default.
      
      Unfortunately, such a low value may cause, on the opposite end, a
      violation of bandwidth guarantees for a process that happens to issue
      new I/O too late. The higher the system load, the higher the
      probability that this happens to some process. This is a problem in
      scenarios where service guarantees matter more than throughput. One
      important case are weight-raised queues, which need to be granted a
      very high fraction of the bandwidth.
      
      To address this issue, this commit lower-bounds the plugging timeout
      for weight-raised queues to 20 ms. This simple change provides
      relevant benefits. For example, on a PLEXTOR PX-256M5S, with which
      gnome-terminal starts in 0.6 seconds if there is no other I/O in
      progress, the same applications starts in
      - 0.8 seconds, instead of 1.2 seconds, if ten files are being read
        sequentially in parallel
      - 1 second, instead of 2 seconds, if, in parallel, five files are
        being read sequentially, and five more files are being written
        sequentially
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      778c02a2
    • K
      block/bfq: fix ifdef for CONFIG_BFQ_GROUP_IOSCHED=y · 42b1bd33
      Konstantin Khlebnikov 提交于
      Replace BFQ_GROUP_IOSCHED_ENABLED with CONFIG_BFQ_GROUP_IOSCHED.
      Code under these ifdefs never worked, something might be broken.
      
      Fixes: 0471559c ("block, bfq: add/remove entity weights correctly")
      Fixes: 73d58118 ("block, bfq: consider also ioprio classes in symmetry detection")
      Reviewed-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      42b1bd33
  8. 01 2月, 2019 14 次提交
    • P
      block, bfq: fix in-service-queue check for queue merging · 058fdecc
      Paolo Valente 提交于
      When a new I/O request arrives for a bfq_queue, say Q, bfq checks
      whether that request is close to
      (a) the head request of some other queue waiting to be served, or
      (b) the last request dispatched for the in-service queue (in case Q
      itself is not the in-service queue)
      
      If a queue, say Q2, is found for which the above condition holds, then
      bfq merges Q and Q2, to hopefully get a more sequential I/O in the
      resulting merged queue, and thus a possibly higher throughput.
      
      Case (b) is checked by comparing the new request for Q with the last
      request dispatched, assuming that the latter necessarily belonged to the
      in-service queue. Unfortunately, this assumption is no longer always
      correct, since commit d0edc247 ("block, bfq: inject other-queue I/O
      into seeky idle queues on NCQ flash").
      
      When the assumption does not hold, queues that must not be merged may be
      merged, causing unexpected loss of control on per-queue service
      guarantees.
      
      This commit solves this problem by adding an extra field, which stores
      the actual last request dispatched for the in-service queue, and by
      using this new field to correctly check case (b).
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      058fdecc
    • P
      block, bfq: do not overcharge writes in asymmetric scenarios · 02a6d787
      Paolo Valente 提交于
      Writes tend to starve reads. bfq counters this problem by overcharging
      writes with an inflated service w.r.t. the actual service (number of
      sector written) they receive.
      
      Yet his overcharging is useless, and actually causes unfairness in the
      opposite direction, when bfq happens to be enforcing strong I/O control.
      bfq does this enforcing when the scenario is asymmetric, i.e., when some
      bfq_queue or group of bfq_queues is to be granted a different bandwidth
      than some other bfq_queue or group of bfq_queues. So, in such a
      scenario, this commit disables write overcharging.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      02a6d787
    • P
      block, bfq: port commit "cfq-iosched: improve hw_tag detection" · b3c34981
      Paolo Valente 提交于
      The original commit is commit 1a1238a7 ("cfq-iosched: improve hw_tag
      detection") and has the following commit message:
      
      If active queue hasn't enough requests and idle window opens, cfq will
      not dispatch sufficient requests to hardware. In such situation, current
      code will zero hw_tag. But this is because cfq doesn't dispatch enough
      requests instead of hardware queue doesn't work. Don't zero hw_tag in
      such case.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b3c34981
    • P
      block, bfq: reduce threshold for detecting command queueing · a3c92560
      Paolo Valente 提交于
      bfq simple heuristic from cfq for detecting whether the drive performs
      command queueing: check whether the average number of in-flight requests
      is above a given threshold. Unfortunately this heuristic does fail to
      detect queueing (on drives with queueing) if processes doing I/O are few
      and issue I/O with a low depth.
      
      To reduce false negatives, this commit lowers the threshold.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a3c92560
    • P
      block, bfq: fix queue removal from weights tree · 9dee8b3b
      Paolo Valente 提交于
      bfq maintains an ordered list, through a red-black tree, of unique
      weights of active bfq_queues. This list is used to detect whether there
      are active queues with differentiated weights. The weight of a queue is
      removed from the list when both the following two conditions become
      true:
      
      (1) the bfq_queue is flagged as inactive
      (2) the has no in-flight request any longer;
      
      Unfortunately, in the rare cases where condition (2) becomes true before
      condition (1), the removal fails, because the function to remove the
      weight of the queue (bfq_weights_tree_remove) is rightly invoked in the
      path that deactivates the bfq_queue, but mistakenly invoked *before* the
      function that actually performs the deactivation (bfq_deactivate_bfqq).
      
      This commits moves the invocation of bfq_weights_tree_remove for
      condition (1) to after bfq_deactivate_bfqq. As a consequence of this
      move, it is necessary to add a further reference to the queue when the
      weight of a queue is added, because the queue might otherwise be freed
      before bfq_weights_tree_remove is invoked. This commit adds this
      reference and makes all related modifications.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9dee8b3b
    • P
      block, bfq: fix sequential rq detection in rate estimation · d87447d8
      Paolo Valente 提交于
      In bfq_update_peak_rate, to check whether an I/O request rq is
      sequential, only the seek distance of rq w.r.t. the last request
      dispatched is controlled. This is not sufficient for non-rotational
      storage, where the size of rq is at least as relevant. This commit adds
      the missing control.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d87447d8
    • P
      block, bfq: unconditionally plug I/O in asymmetric scenarios · 530c4cbb
      Paolo Valente 提交于
      bfq detects the creation of multiple bfq_queues shortly after each
      other, namely a burst of queue creations in the terminology used in the
      code. If the burst is large, then no queue in the burst is granted
      - either I/O-dispatch plugging when the queue remains temporarily idle
        while in service;
      - or weight raising, because it causes even longer plugging.
      
      In fact, such a plugging tends to lower throughput, while these bursts
      are typically due to applications or services that spawn multiple
      processes, to reach a common goal as soon as possible. Examples are a
      "git grep" or the booting of a system.
      
      Unfortunately, disabling plugging may cause a loss of service guarantees
      in asymmetric scenarios, i.e., if queue weights are differentiated or if
      more than one group is active.
      
      This commit addresses this issue by no longer disabling I/O-dispatch
      plugging for queues in large bursts.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      530c4cbb
    • P
      block, bfq: do not plug I/O of in-service queue when harmful · ac8b0cb4
      Paolo Valente 提交于
      If the in-service bfq_queue is sync and remains temporarily idle, then
      I/O dispatching (from other queues) may be plugged. It may be dome for
      two reasons: either to boost throughput, or to preserve the bandwidth
      share of the in-service queue. In the first case, if the I/O of the
      in-service queue, when it finally arrives, consists only of one small
      I/O request, then it makes sense to plug even the I/O of the in-service
      queue. In fact, serving such a small request immediately is likely to
      lower throughput instead of boosting it, whereas waiting a little bit is
      likely to let that request grow, thanks to request merging, and become
      more profitable in terms of throughput (this is likely to happen exactly
      because the I/O of the queue has been detected to boost throughput).
      
      On the opposite end, if I/O dispatching is being plugged only to
      preserve the bandwidth of the in-service queue, then it would be better
      not to plug also the I/O of the in-service queue, because such a
      plugging is likely to cause only loss of bandwidth for the queue.
      
      Unfortunately, no distinction is made between the two cases, and the I/O
      of the in-service queue is always plugged in case just a small I/O
      request arrives. This commit draws this missing distinction and does not
      perform harmful plugging.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ac8b0cb4
    • P
      block, bfq: split function bfq_better_to_idle · 05c2f5c3
      Paolo Valente 提交于
      This is a preparatory commit for commits that need to check only one of
      the two main reasons for idling. This change should also improve the
      quality of the code a little bit, by splitting a function that contains
      very long, non-trivial and little related comments.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      05c2f5c3
    • P
      block, bfq: consider also ioprio classes in symmetry detection · 73d58118
      Paolo Valente 提交于
      In asymmetric scenarios, i.e., when some bfq_queue or bfq_group needs to
      be guaranteed a different bandwidth than other bfq_queues or bfq_groups,
      these service guaranteed can be provided only by plugging I/O dispatch,
      completely or partially, when the queue in service remains temporarily
      empty. A case where asymmetry is particularly strong is when some active
      bfq_queues belong to a higher-priority class than some other active
      bfq_queues. Unfortunately, this important case is not considered at all
      in the code for detecting asymmetric scenarios. This commit adds the
      missing logic.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      73d58118
    • P
      block, bfq: remove case of redirected bic from insert_request · 03e565e4
      Paolo Valente 提交于
      Before commit 18e5a57d ("block, bfq: postpone rq preparation to
      insert or merge"), the destination queue for a request was chosen by a
      different hook than the one that then inserted the request. So, between
      the execution of the two hooks, the bic of the process generating the
      request could happen to be redirected to a different bfq_queue. As a
      consequence, the destination bfq_queue stored in the request could be
      wrong. Such an event does not need to ba handled any longer.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      03e565e4
    • P
      block, bfq: make sure queue budgets are not below service received · f3218ad8
      Paolo Valente 提交于
      With some unlucky sequences of events, the function bfq_updated_next_req
      updates the current budget of a bfq_queue to a lower value than the
      service received by the queue using such a budget. Unfortunately, if
      this happens, then the return value of the function bfq_bfqq_budget_left
      becomes inconsistent. This commit solves this problem by lower-bounding
      the budget computed in bfq_updated_next_req to the service currently
      charged to the queue.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f3218ad8
    • P
      block, bfq: avoid selecting a queue w/o budget · 218cb897
      Paolo Valente 提交于
      To boost throughput on devices with internal queueing and in scenarios
      where device idling is not strictly needed, bfq immediately starts
      serving a new bfq_queue if the in-service bfq_queue remains without
      pending I/O, even if new I/O may arrive soon for the latter queue. Then,
      if such I/O actually arrives soon, bfq preempts the new in-service
      bfq_queue so as to give the previous queue a chance to go on being
      served (in case the previous queue should actually be the one to be
      served, according to its timestamps).
      
      However, the in-service bfq_queue, say Q, may also be without further
      budget when it remains also pending I/O. Since bfq changes budgets
      dynamically to fit the needs of bfq_queues, this happens more often than
      one may expect. If this happens, then there is no point in trying to go
      on serving Q when new I/O arrives for it soon: Q would be expired
      immediately after being selected for service. This would only cause
      useless overhead. This commit avoids such a useless selection.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      218cb897
    • P
      block, bfq: do not consider interactive queues in srt filtering · 20cd3245
      Paolo Valente 提交于
      The speed at which a bfq_queue receives I/O is one of the parameters by
      which bfq decides whether the queue is soft real-time (i.e., whether the
      queue contains the I/O of a soft real-time application). In particular,
      when a bfq_queue remains without outstanding I/O requests, bfq computes
      the minimum time instant, named soft_rt_next_start, at which the next
      request of the queue may arrive for the queue to be deemed as soft real
      time.
      
      Unfortunately this filtering may cause problems with a queue in
      interactive weight raising. In fact, such a queue may be conveying the
      I/O needed to load a soft real-time application. The latter will
      actually exhibit a soft real-time I/O pattern after it finally starts
      doing its job. But, if soft_rt_next_start is updated for an interactive
      bfq_queue, and the queue has received a lot of service before remaining
      with no outstanding request (likely to happen on a fast device), then
      soft_rt_next_start is assigned such a high value that, for a very long
      time, the queue is prevented from being possibly considered as soft real
      time.
      
      This commit removes the updating of soft_rt_next_start for bfq_queues in
      interactive weight raising.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      20cd3245
  9. 08 12月, 2018 1 次提交
    • D
      blkcg: fix ref count issue with bio_blkcg() using task_css · 0fe061b9
      Dennis Zhou 提交于
      The bio_blkcg() function turns out to be inconsistent and consequently
      dangerous to use. The first part returns a blkcg where a reference is
      owned by the bio meaning it does not need to be rcu protected. However,
      the third case, the last line, is problematic:
      
      	return css_to_blkcg(task_css(current, io_cgrp_id));
      
      This can race against task migration and the cgroup dying. It is also
      semantically different as it must be called rcu protected and is
      susceptible to failure when trying to get a reference to it.
      
      This patch adds association ahead of calling bio_blkcg() rather than
      after. This makes association a required and explicit step along the
      code paths for calling bio_blkcg(). In blk-iolatency, association is
      moved above the bio_blkcg() call to ensure it will not return %NULL.
      
      BFQ uses the old bio_blkcg() function, but I do not want to address it
      in this series due to the complexity. I have created a private version
      documenting the inconsistency and noting not to use it.
      Signed-off-by: NDennis Zhou <dennis@kernel.org>
      Acked-by: NTejun Heo <tj@kernel.org>
      Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0fe061b9
  10. 07 12月, 2018 1 次提交
    • P
      block, bfq: fix decrement of num_active_groups · ba7aeae5
      Paolo Valente 提交于
      Since commit '2d29c9f8 ("block, bfq: improve asymmetric scenarios
      detection")', if there are process groups with I/O requests waiting for
      completion, then BFQ tags the scenario as 'asymmetric'. This detection
      is needed for preserving service guarantees (for details, see comments
      on the computation * of the variable asymmetric_scenario in the
      function bfq_better_to_idle).
      
      Unfortunately, commit '2d29c9f8 ("block, bfq: improve asymmetric
      scenarios detection")' contains an error exactly in the updating of
      the number of groups with I/O requests waiting for completion: if a
      group has more than one descendant process, then the above number of
      groups, which is renamed from num_active_groups to a more appropriate
      num_groups_with_pending_reqs by this commit, may happen to be wrongly
      decremented multiple times, namely every time one of the descendant
      processes gets all its pending I/O requests completed.
      
      A correct, complete solution should work as follows. Consider a group
      that is inactive, i.e., that has no descendant process with pending
      I/O inside BFQ queues. Then suppose that num_groups_with_pending_reqs
      is still accounting for this group, because the group still has some
      descendant process with some I/O request still in
      flight. num_groups_with_pending_reqs should be decremented when the
      in-flight request of the last descendant process is finally completed
      (assuming that nothing else has changed for the group in the meantime,
      in terms of composition of the group and active/inactive state of
      child groups and processes). To accomplish this, an additional
      pending-request counter must be added to entities, and must be
      updated correctly.
      
      To avoid this additional field and operations, this commit resorts to
      the following tradeoff between simplicity and accuracy: for an
      inactive group that is still counted in num_groups_with_pending_reqs,
      this commit decrements num_groups_with_pending_reqs when the first
      descendant process of the group remains with no request waiting for
      completion.
      
      This simplified scheme provides a fix to the unbalanced decrements
      introduced by 2d29c9f8. Since this error was also caused by lack
      of comments on this non-trivial issue, this commit also adds related
      comments.
      
      Fixes: 2d29c9f8 ("block, bfq: improve asymmetric scenarios detection")
      Reported-by: NSteven Barrett <steven@liquorix.net>
      Tested-by: NSteven Barrett <steven@liquorix.net>
      Tested-by: NLucjan Lucjanov <lucjan.lucjanov@gmail.com>
      Reviewed-by: NFederico Motta <federico@willer.it>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ba7aeae5
  11. 16 11月, 2018 1 次提交
  12. 08 11月, 2018 1 次提交