1. 09 7月, 2018 3 次提交
    • P
      block, bfq: fix service being wrongly set to zero in case of preemption · 9fae8dd5
      Paolo Valente 提交于
      If
      - a bfq_queue Q preempts another queue, because one request of Q
      arrives in time,
      - but, after this preemption, Q is not the queue that is set in service,
      then Q->entity.service is set to 0 when Q is eventually set in
      service. But Q should have continued receiving service with its old
      budget (which is why preemption has occurred) and its old service.
      
      This commit addresses this issue by resetting service on queue real
      expiration.
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9fae8dd5
    • P
      block, bfq: do not expire a queue that will deserve dispatch plugging · 4420b095
      Paolo Valente 提交于
      For some bfq_queues, BFQ plugs I/O dispatching when the queue becomes
      idle, and keeps the plug until a new request of the queue arrives, or
      a timeout fires. BFQ does so either to boost throughput or to preserve
      service guarantees for the queue.
      
      More precisely, for such a queue, plugging starts when the queue
      happens to have either no request enqueued, or no request in flight,
      that is, no request already dispatched but not yet completed.
      
      On the opposite end, BFQ may happen to expire a queue with no request
      enqueued, without doing any plugging, if the queue still has some
      request in flight. Unfortunately, such a premature expiration causes
      the queue to lose its chance to enjoy dispatch plugging a moment
      later, i.e., when its in-flight requests finally get completed. This
      breaks service guarantees for the queue.
      
      This commit prevents BFQ from expiring an empty queue if the latter
      still has in-flight requests.
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4420b095
    • P
      block, bfq: add/remove entity weights correctly · 0471559c
      Paolo Valente 提交于
      To keep I/O throughput high as often as possible, BFQ performs
      I/O-dispatch plugging (aka device idling) only when beneficial exactly
      for throughput, or when needed for service guarantees (low latency,
      fairness). An important case where the latter condition holds is when
      the scenario is 'asymmetric' in terms of weights: i.e., when some
      bfq_queue or whole group of queues has a higher weight, and thus has
      to receive more service, than other queues or groups. Without dispatch
      plugging, lower-weight queues/groups may unjustly steal bandwidth to
      higher-weight queues/groups.
      
      To detect asymmetric scenarios, BFQ checks some sufficient
      conditions. One of these conditions is that active groups have
      different weights. BFQ controls this condition by maintaining a
      special set of unique weights of active groups
      (group_weights_tree). To this purpose, in the function
      bfq_active_insert/bfq_active_extract BFQ adds/removes the weight of a
      group to/from this set.
      
      Unfortunately, the function bfq_active_extract may happen to be
      invoked also for a group that is still active (to preserve the correct
      update of the next queue to serve, see comments in function
      bfq_no_longer_next_in_service() for details). In this case, removing
      the weight of the group makes the set group_weights_tree
      inconsistent. Service-guarantee violations follow.
      
      This commit addresses this issue by moving group_weights_tree
      insertions from their previous location (in bfq_active_insert) into
      the function __bfq_activate_entity, and by moving group_weights_tree
      extractions from bfq_active_extract to when the entity that represents
      a group remains throughly idle, i.e., with no request either enqueued
      or dispatched.
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      0471559c
  2. 31 5月, 2018 7 次提交
    • D
      block, bfq: prevent soft_rt_next_start from being stuck at infinity · f6c3ca0e
      Davide Sapienza 提交于
      BFQ can deem a bfq_queue as soft real-time only if the queue
      - periodically becomes completely idle, i.e., empty and with
        no still-outstanding I/O request;
      - after becoming idle, gets new I/O only after a special reference
        time soft_rt_next_start.
      
      In this respect, after commit "block, bfq: consider also past I/O in
      soft real-time detection", the value of soft_rt_next_start can never
      decrease. This causes a problem with the following special updating
      case for soft_rt_next_start: to prevent queues that are not completely
      idle to be wrongly detected as soft real-time (when they become
      non-empty again), soft_rt_next_start is temporarily set to infinity
      for empty queues with still outstanding I/O requests. But, if such an
      update is actually performed, then, because of the above commit,
      soft_rt_next_start will be stuck at infinity forever, and the queue
      will have no more chance to be considered soft real-time.
      
      On slow systems, this problem does cause actual soft real-time
      applications to be occasionally not detected as such.
      
      This commit addresses this issue by eliminating the pushing of
      soft_rt_next_start to infinity, and by changing the way non-empty
      queues are prevented from being wrongly detected as soft
      real-time. Simply, a queue that becomes non-empty again can now be
      detected as soft real-time only if it has no outstanding I/O request.
      Signed-off-by: NDavide Sapienza <sapienza.dav@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f6c3ca0e
    • D
      block, bfq: increase weight-raising duration for interactive apps · d450542e
      Davide Sapienza 提交于
      The maximum possible duration of the weight-raising period for
      interactive applications is limited to 13 seconds, as this is the time
      needed to load the largest application that we considered when tuning
      weight raising. Unfortunately, in such an evaluation, we did not
      consider the case of very slow virtual machines.
      
      For example, on a QEMU/KVM virtual machine
      - running in a slow PC;
      - with a virtual disk stacked on a slow low-end 5400rpm HDD;
      - serving a heavy I/O workload, such as the sequential reading of
      several files;
      mplayer takes 23 seconds to start, if constantly weight-raised.
      
      To address this issue, this commit conservatively sets the upper limit
      for weight-raising duration to 25 seconds.
      Signed-off-by: NDavide Sapienza <sapienza.dav@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d450542e
    • P
      block, bfq: remove slow-system class · e24f1c24
      Paolo Valente 提交于
      BFQ computes the duration of weight raising for interactive
      applications automatically, using some reference parameters. In
      particular, BFQ uses the best durations (see comments in the code for
      how these durations have been assessed) for two classes of systems:
      slow and fast ones. Examples of slow systems are old phones or systems
      using micro HDDs. Fast systems are all the remaining ones. Using these
      parameters, BFQ computes the actual duration of the weight raising,
      for the system at hand, as a function of the relative speed of the
      system w.r.t. the speed of a reference system, belonging to the same
      class of systems as the system at hand.
      
      This slow vs fast differentiation proved to be useful in the past, but
      happens to have little meaning with current hardware. Even worse, it
      does cause problems in virtual systems, where the speed of the system
      can vary frequently, and so widely to just confuse the class-detection
      mechanism, and, as we have verified experimentally, to cause BFQ to
      compute non-sensical weight-raising durations.
      
      This commit addresses this issue by removing the slow class and the
      class-detection mechanism.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      e24f1c24
    • P
      block, bfq: add description of weight-raising heuristics · 4029eef1
      Paolo Valente 提交于
      A description of how weight raising works is missing in BFQ
      sources. In addition, the code for handling weight raising is
      scattered across a few functions. This makes it rather hard to
      understand the mechanism and its rationale. This commits adds such a
      description at the beginning of the main source file.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4029eef1
    • F
      block, bfq: remove the removal of 'next' rq in bfq_requests_merged · ac857e0d
      Filippo Muzzini 提交于
      Since bfq_finish_request() is always called on the request 'next',
      after bfq_requests_merged() is finished, and bfq_finish_request()
      removes 'next' from its bfq_queue if needed, it isn't necessary to do
      such a removal in advance in bfq_merged_requests().
      
      This commit removes such a useless 'next' removal.
      Signed-off-by: NFilippo Muzzini <filippo.muzzini@outlook.it>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      ac857e0d
    • P
      block, bfq: remove wrong check in bfq_requests_merged · 8abfa4d6
      Paolo Valente 提交于
      The request rq passed to the function bfq_requests_merged is always in
      a bfq_queue, so the check !RB_EMPTY_NODE(&rq->rb_node) at the
      beginning of bfq_requests_merged always succeeds, and the control
      flow systematically skips to the end of the function.  This implies
      that the body of the function is never executed, i.e., the
      repositioning of rq is never performed.
      
      On the opposite end, a control is missing in the body of the function:
      'next' must be removed only if it is inside a bfq_queue.
      
      This commit removes the wrong check on rq, and adds the missing check
      on 'next'. In addition, this commit adds comments on
      bfq_requests_merged.
      Signed-off-by: NFilippo Muzzini <filippo.muzzini@outlook.it>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8abfa4d6
    • F
      block, bfq: remove wrong lock in bfq_requests_merged · a12bffeb
      Filippo Muzzini 提交于
      In bfq_requests_merged(), there is a deadlock because the lock on
      bfqq->bfqd->lock is held by the calling function, but the code of
      this function tries to grab the lock again.
      
      This deadlock is currently hidden by another bug (fixed by next commit
      for this source file), which causes the body of bfq_requests_merged()
      to be never executed.
      
      This commit removes the deadlock by removing the lock/unlock pair.
      Signed-off-by: NFilippo Muzzini <filippo.muzzini@outlook.it>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a12bffeb
  3. 11 5月, 2018 5 次提交
  4. 09 5月, 2018 1 次提交
    • O
      block: consolidate struct request timestamp fields · 522a7775
      Omar Sandoval 提交于
      Currently, struct request has four timestamp fields:
      
      - A start time, set at get_request time, in jiffies, used for iostats
      - An I/O start time, set at start_request time, in ktime nanoseconds,
        used for blk-stats (i.e., wbt, kyber, hybrid polling)
      - Another start time and another I/O start time, used for cfq and bfq
      
      These can all be consolidated into one start time and one I/O start
      time, both in ktime nanoseconds, shaving off up to 16 bytes from struct
      request depending on the kernel config.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      522a7775
  5. 18 4月, 2018 1 次提交
  6. 27 3月, 2018 1 次提交
  7. 08 2月, 2018 1 次提交
    • P
      block, bfq: add requeue-request hook · a7877390
      Paolo Valente 提交于
      Commit 'a6a252e6 ("blk-mq-sched: decide how to handle flush rq via
      RQF_FLUSH_SEQ")' makes all non-flush re-prepared requests for a device
      be re-inserted into the active I/O scheduler for that device. As a
      consequence, I/O schedulers may get the same request inserted again,
      even several times, without a finish_request invoked on that request
      before each re-insertion.
      
      This fact is the cause of the failure reported in [1]. For an I/O
      scheduler, every re-insertion of the same re-prepared request is
      equivalent to the insertion of a new request. For schedulers like
      mq-deadline or kyber, this fact causes no harm. In contrast, it
      confuses a stateful scheduler like BFQ, which keeps state for an I/O
      request, until the finish_request hook is invoked on the request. In
      particular, BFQ may get stuck, waiting forever for the number of
      request dispatches, of the same request, to be balanced by an equal
      number of request completions (while there will be one completion for
      that request). In this state, BFQ may refuse to serve I/O requests
      from other bfq_queues. The hang reported in [1] then follows.
      
      However, the above re-prepared requests undergo a requeue, thus the
      requeue_request hook of the active elevator is invoked for these
      requests, if set. This commit then addresses the above issue by
      properly implementing the hook requeue_request in BFQ.
      
      [1] https://marc.info/?l=linux-block&m=151211117608676Reported-by: NIvan Kozik <ivan@ludios.org>
      Reported-by: NAlban Browaeys <alban.browaeys@gmail.com>
      Tested-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NSerena Ziviani <ziviani.serena@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a7877390
  8. 18 1月, 2018 2 次提交
    • P
      block, bfq: limit sectors served with interactive weight raising · 8a8747dc
      Paolo Valente 提交于
      To maximise responsiveness, BFQ raises the weight, and performs device
      idling, for bfq_queues associated with processes deemed as
      interactive. In particular, weight raising has a maximum duration,
      equal to the time needed to start a large application. If a
      weight-raised process goes on doing I/O beyond this maximum duration,
      it loses weight-raising.
      
      This mechanism is evidently vulnerable to the following false
      positives: I/O-bound applications that will go on doing I/O for much
      longer than the duration of weight-raising. These applications have
      basically no benefit from being weight-raised at the beginning of
      their I/O. On the opposite end, while being weight-raised, these
      applications
      a) unjustly steal throughput to applications that may truly need
      low latency;
      b) make BFQ uselessly perform device idling; device idling results
      in loss of device throughput with most flash-based storage, and may
      increase latencies when used purposelessly.
      
      This commit adds a countermeasure to reduce both the above
      problems. To introduce this countermeasure, we provide the following
      extra piece of information (full details in the comments added by this
      commit). During the start-up of the large application used as a
      reference to set the duration of weight-raising, involved processes
      transfer at most ~110K sectors each. Accordingly, a process initially
      deemed as interactive has no right to be weight-raised any longer,
      once transferred 110K sectors or more.
      
      Basing on this consideration, this commit early-ends weight-raising
      for a bfq_queue if the latter happens to have received an amount of
      service at least equal to 110K sectors (actually, a little bit more,
      to keep a safety margin). I/O-bound applications that reach a high
      throughput, such as file copy, get to this threshold much before the
      allowed weight-raising period finishes. Thus this early ending of
      weight-raising reduces the amount of time during which these
      applications cause the problems described above.
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      8a8747dc
    • P
      block, bfq: limit tags for writes and async I/O · a52a69ea
      Paolo Valente 提交于
      Asynchronous I/O can easily starve synchronous I/O (both sync reads
      and sync writes), by consuming all request tags. Similarly, storms of
      synchronous writes, such as those that sync(2) may trigger, can starve
      synchronous reads. In their turn, these two problems may also cause
      BFQ to loose control on latency for interactive and soft real-time
      applications. For example, on a PLEXTOR PX-256M5S SSD, LibreOffice
      Writer takes 0.6 seconds to start if the device is idle, but it takes
      more than 45 seconds (!) if there are sequential writes in the
      background.
      
      This commit addresses this issue by limiting the maximum percentage of
      tags that asynchronous I/O requests and synchronous write requests can
      consume. In particular, this commit grants a higher threshold to
      synchronous writes, to prevent the latter from being starved by
      asynchronous I/O.
      
      According to the above test, LibreOffice Writer now starts in about
      1.2 seconds on average, regardless of the background workload, and
      apart from some rare outlier. To check this improvement, run, e.g.,
      sudo ./comm_startup_lat.sh bfq 5 5 seq 10 "lowriter --terminate_after_init"
      for the comm_startup_lat benchmark in the S suite [1].
      
      [1] https://github.com/Algodev-github/STested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a52a69ea
  9. 10 1月, 2018 2 次提交
  10. 09 1月, 2018 1 次提交
  11. 06 1月, 2018 7 次提交
    • P
      block, bfq: remove batches of confusing ifdefs · 9b25bd03
      Paolo Valente 提交于
      Commit a33801e8 ("block, bfq: move debug blkio stats behind
      CONFIG_DEBUG_BLK_CGROUP") introduced two batches of confusing ifdefs:
      one reported in [1], plus a similar one in another function. This
      commit removes both batches, in the way suggested in [1].
      
      [1] https://www.spinics.net/lists/linux-block/msg20043.html
      
      Fixes: a33801e8 ("block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP")
      Reported-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Tested-by: NLuca Miccio <lucmiccio@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      9b25bd03
    • P
      block, bfq: consider also past I/O in soft real-time detection · a34b0244
      Paolo Valente 提交于
      BFQ privileges the I/O of soft real-time applications, such as video
      players, to guarantee to these application a high bandwidth and a low
      latency. In this respect, it is not easy to correctly detect when an
      application is soft real-time. A particularly nasty false positive is
      that of an I/O-bound application that occasionally happens to meet all
      requirements to be deemed as soft real-time. After being detected as
      soft real-time, such an application monopolizes the device. Fortunately,
      BFQ will realize soon that the application is actually not soft
      real-time and suspend every privilege. Yet, the application may happen
      again to be wrongly detected as soft real-time, and so on.
      
      As highlighted by our tests, this problem causes BFQ to occasionally
      fail to guarantee a high responsiveness, in the presence of heavy
      background I/O workloads. The reason is that the background workload
      happens to be detected as soft real-time, more or less frequently,
      during the execution of the interactive task under test. To give an
      idea, because of this problem, Libreoffice Writer occasionally takes 8
      seconds, instead of 3, to start up, if there are sequential reads and
      writes in the background, on a Kingston SSDNow V300.
      
      This commit addresses this issue by leveraging the following facts.
      
      The reason why some applications are detected as soft real-time despite
      all BFQ checks to avoid false positives, is simply that, during high
      CPU or storage-device load, I/O-bound applications may happen to do
      I/O slowly enough to meet all soft real-time requirements, and pass
      all BFQ extra checks. Yet, this happens only for limited time periods:
      slow-speed time intervals are usually interspersed between other time
      intervals during which these applications do I/O at a very high speed.
      To exploit these facts, this commit introduces a little change, in the
      detection of soft real-time behavior, to systematically consider also
      the recent past: the higher the speed was in the recent past, the
      later next I/O should arrive for the application to be considered as
      soft real-time. At the beginning of a slow-speed interval, the minimum
      arrival time allowed for the next I/O usually happens to still be so
      high, to fall *after* the end of the slow-speed period itself. As a
      consequence, the application does not risk to be deemed as soft
      real-time during the slow-speed interval. Then, during the next
      high-speed interval, the application cannot, evidently, be deemed as
      soft real-time (exactly because of its speed), and so on.
      
      This extra filtering proved to be rather effective: in the above test,
      the frequency of false positives became so low that the start-up time
      was 3 seconds in all iterations (apart from occasional outliers,
      caused by page-cache-management issues, which are out of the scope of
      this commit, and cannot be solved by an I/O scheduler).
      Tested-by: NLee Tibbert <lee.tibbert@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NAngelo Ruocco <angeloruocco90@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a34b0244
    • A
      block, bfq: remove superfluous check in queue-merging setup · 4403e4e4
      Angelo Ruocco 提交于
      When two or more processes do I/O in a way that the their requests are
      sequential in respect to one another, BFQ merges the bfq_queues associated
      with the processes. This way the overall I/O pattern becomes sequential,
      and thus there is a boost in througput.
      These cooperating processes usually start or restart to do I/O shortly
      after each other. So, in order to avoid merging non-cooperating processes,
      BFQ ensures that none of these queues has been in weight raising for too
      long.
      
      In this respect, from commit "block, bfq-sq, bfq-mq: let a queue be merged
      only shortly after being created", BFQ checks whether any queue (and not
      only weight-raised ones) is doing I/O continuously from too long to be
      merged.
      
      This new additional check makes the first one useless: a queue doing
      I/O from long enough, if being weight-raised, is also a queue in
      weight raising for too long to be merged. Accordingly, this commit
      removes the first check.
      Signed-off-by: NAngelo Ruocco <angeloruocco90@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4403e4e4
    • P
      block, bfq: let a queue be merged only shortly after starting I/O · 7b8fa3b9
      Paolo Valente 提交于
      In BFQ and CFQ, two processes are said to be cooperating if they do
      I/O in such a way that the union of their I/O requests yields a
      sequential I/O pattern. To get such a sequential I/O pattern out of
      the non-sequential pattern of each cooperating process, BFQ and CFQ
      merge the queues associated with these processes. In more detail,
      cooperating processes, and thus their associated queues, usually
      start, or restart, to do I/O shortly after each other. This is the
      case, e.g., for the I/O threads of KVM/QEMU and of the dump
      utility. Basing on this assumption, this commit allows a bfq_queue to
      be merged only during a short time interval (100ms) after it starts,
      or re-starts, to do I/O.  This filtering provides two important
      benefits.
      
      First, it greatly reduces the probability that two non-cooperating
      processes have their queues merged by mistake, if they just happen to
      do I/O close to each other for a short time interval. These spurious
      merges cause loss of service guarantees. A low-weight bfq_queue may
      unjustly get more than its expected share of the throughput: if such a
      low-weight queue is merged with a high-weight queue, then the I/O for
      the low-weight queue is served as if the queue had a high weight. This
      may damage other high-weight queues unexpectedly.  For instance,
      because of this issue, lxterminal occasionally took 7.5 seconds to
      start, instead of 6.5 seconds, when some sequential readers and
      writers did I/O in the background on a FUJITSU MHX2300BT HDD.  The
      reason is that the bfq_queues associated with some of the readers or
      the writers were merged with the high-weight queues of some processes
      that had to do some urgent but little I/O. The readers then exploited
      the inherited high weight for all or most of their I/O, during the
      start-up of terminal. The filtering introduced by this commit
      eliminated any outlier caused by spurious queue merges in our start-up
      time tests.
      
      This filtering also provides a little boost of the throughput
      sustainable by BFQ: 3-4%, depending on the CPU. The reason is that,
      once a bfq_queue cannot be merged any longer, this commit makes BFQ
      stop updating the data needed to handle merging for the queue.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NAngelo Ruocco <angeloruocco90@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7b8fa3b9
    • A
      block, bfq: check low_latency flag in bfq_bfqq_save_state() · 1be6e8a9
      Angelo Ruocco 提交于
      A just-created bfq_queue will certainly be deemed as interactive on
      the arrival of its first I/O request, if the low_latency flag is
      set. Yet, if the queue is merged with another queue on the arrival of
      its first I/O request, it will not have the chance to be flagged as
      interactive. Nevertheless, if the queue is then split soon enough, it
      has to be flagged as interactive after the split.
      
      To handle this early-merge scenario correctly, BFQ saves the state of
      the queue, on the merge, as if the latter had already been deemed
      interactive. So, if the queue is split soon, it will get
      weight-raised, because the previous state of the queue is resumed on
      the split.
      
      Unfortunately, in the act of saving the state of the newly-created
      queue, BFQ doesn't check whether the low_latency flag is set, and this
      causes early-merged queues to be then weight-raised, on queue splits,
      even if low_latency is off. This commit addresses this problem by
      adding the missing check.
      Signed-off-by: NAngelo Ruocco <angeloruocco90@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      1be6e8a9
    • P
      block, bfq: add missing rq_pos_tree update on rq removal · 05e90283
      Paolo Valente 提交于
      If two processes do I/O close to each other, then BFQ merges the
      bfq_queues associated with these processes, to get a more sequential
      I/O, and thus a higher throughput.  In this respect, to detect whether
      two processes are doing I/O close to each other, BFQ keeps a list of
      the head-of-line I/O requests of all active bfq_queues.  The list is
      ordered by initial sectors, and implemented through a red-black tree
      (rq_pos_tree).
      
      Unfortunately, the update of the rq_pos_tree was incomplete, because
      the tree was not updated on the removal of the head-of-line I/O
      request of a bfq_queue, in case the queue did not remain empty. This
      commit adds the missing update.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NAngelo Ruocco <angeloruocco90@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      05e90283
    • P
      block, bfq: increase threshold to deem I/O as random · f0ba5ea2
      Paolo Valente 提交于
      If two processes do I/O close to each other, i.e., are cooperating
      processes in BFQ (and CFQ'S) nomenclature, then BFQ merges their
      associated bfq_queues, so as to get sequential I/O from the union of
      the I/O requests of the processes, and thus reach a higher
      throughput. A merged queue is then split if its I/O stops being
      sequential. In this respect, BFQ deems the I/O of a bfq_queue as
      (mostly) sequential only if less than 4 I/O requests are random, out
      of the last 32 requests inserted into the queue.
      
      Unfortunately, extensive testing (with the interleaved_io benchmark of
      the S suite [1], and with real applications spawning cooperating
      processes) has clearly shown that, with such a low threshold, only a
      rather low I/O throughput may be reached when several cooperating
      processes do I/O. In particular, the outcome of each test run was
      bimodal: if queue merging occurred and was stable during the test,
      then the throughput was close to the peak rate of the storage device,
      otherwise the throughput was arbitrarily low (usually around 1/10 of
      the peak rate with a rotational device). The probability to get the
      unlucky outcomes grew with the number of cooperating processes: it was
      already significant with 5 processes, and close to one with 7 or more
      processes.
      
      The cause of the low throughput in the unlucky runs was that the
      merged queues containing the I/O of these cooperating processes were
      soon split, because they contained more random I/O requests than those
      tolerated by the 4/32 threshold, but
      - that I/O would have however allowed the storage device to reach
        peak throughput or almost peak throughput;
      - in contrast, the I/O of these processes, if served individually
        (from separate queues) yielded a rather low throughput.
      
      So we repeated our tests with increasing values of the threshold,
      until we found the minimum value (19) for which we obtained maximum
      throughput, reliably, with at least up to 9 cooperating
      processes. Then we checked that the use of that higher threshold value
      did not cause any regression for any other benchmark in the suite [1].
      This commit raises the threshold to such a higher value.
      
      [1] https://github.com/Algodev-github/SSigned-off-by: NAngelo Ruocco <angeloruocco90@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      f0ba5ea2
  12. 15 11月, 2017 3 次提交
    • L
      block, bfq: move debug blkio stats behind CONFIG_DEBUG_BLK_CGROUP · a33801e8
      Luca Miccio 提交于
      BFQ currently creates, and updates, its own instance of the whole
      set of blkio statistics that cfq creates. Yet, from the comments
      of Tejun Heo in [1], it turned out that most of these statistics
      are meant/useful only for debugging. This commit makes BFQ create
      the latter, debugging statistics only if the option
      CONFIG_DEBUG_BLK_CGROUP is set.
      
      By doing so, this commit also enables BFQ to enjoy a high perfomance
      boost. The reason is that, if CONFIG_DEBUG_BLK_CGROUP is not set, then
      BFQ has to update far fewer statistics, and, in particular, not the
      heaviest to update.  To give an idea of the benefits, if
      CONFIG_DEBUG_BLK_CGROUP is not set, then, on an Intel i7-4850HQ, and
      with 8 threads doing random I/O in parallel on null_blk (configured
      with 0 latency), the throughput of BFQ grows from 310 to 400 KIOPS
      (+30%). We have measured similar or even much higher boosts with other
      CPUs: e.g., +45% with an ARM CortexTM-A53 Octa-core. Our results have
      been obtained and can be reproduced very easily with the script in [1].
      
      [1] https://www.spinics.net/lists/linux-block/msg18943.htmlSuggested-by: NTejun Heo <tj@kernel.org>
      Suggested-by: NUlf Hansson <ulf.hansson@linaro.org>
      Tested-by: NLee Tibbert <lee.tibbert@gmail.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NLuca Miccio <lucmiccio@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a33801e8
    • P
      block, bfq: update blkio stats outside the scheduler lock · 24bfd19b
      Paolo Valente 提交于
      bfq invokes various blkg_*stats_* functions to update the statistics
      contained in the special files blkio.bfq.* in the blkio controller
      groups, i.e., the I/O accounting related to the proportional-share
      policy provided by bfq. The execution of these functions takes a
      considerable percentage, about 40%, of the total per-request execution
      time of bfq (i.e., of the sum of the execution time of all the bfq
      functions that have to be executed to process an I/O request from its
      creation to its destruction).  This reduces the request-processing
      rate sustainable by bfq noticeably, even on a multicore CPU. In fact,
      the bfq functions that invoke blkg_*stats_* functions cannot be
      executed in parallel with the rest of the code of bfq, because both
      are executed under the same same per-device scheduler lock.
      
      To reduce this slowdown, this commit moves, wherever possible, the
      invocation of these functions (more precisely, of the bfq functions
      that invoke blkg_*stats_* functions) outside the critical sections
      protected by the scheduler lock.
      
      With this change, and with all blkio.bfq.* statistics enabled, the
      throughput grows, e.g., from 250 to 310 KIOPS (+25%) on an Intel
      i7-4850HQ, in case of 8 threads doing random I/O in parallel on
      null_blk, with the latter configured with 0 latency. We obtained the
      same or higher throughput boosts, up to +30%, with other processors
      (some figures are reported in the documentation). For our tests, we
      used the script [1], with which our results can be easily reproduced.
      
      NOTE. This commit still protects the invocation of blkg_*stats_*
      functions with the request_queue lock, because the group these
      functions are invoked on may otherwise disappear before or while these
      functions are executed.  Fortunately, tests without even this lock
      show, by difference, that the serialization caused by this lock has a
      little impact (at most ~5% of throughput reduction).
      
      [1] https://github.com/Algodev-github/IOSpeedTested-by: NLee Tibbert <lee.tibbert@gmail.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NLuca Miccio <lucmiccio@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      24bfd19b
    • L
      block, bfq: add missing invocations of bfqg_stats_update_io_add/remove · 614822f8
      Luca Miccio 提交于
      bfqg_stats_update_io_add and bfqg_stats_update_io_remove are to be
      invoked, respectively, when an I/O request enters and when an I/O
      request exits the scheduler. Unfortunately, bfq does not fully comply
      with this scheme, because it does not invoke these functions for
      requests that are inserted into or extracted from its priority
      dispatch list. This commit fixes this mistake.
      Tested-by: NLee Tibbert <lee.tibbert@gmail.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NLuca Miccio <lucmiccio@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      614822f8
  13. 09 10月, 2017 2 次提交
    • P
      block, bfq: fix unbalanced decrements of burst size · 99fead8d
      Paolo Valente 提交于
      The commit "block, bfq: decrease burst size when queues in burst
      exit" introduced the decrement of burst_size on the removal of a
      bfq_queue from the burst list. Unfortunately, this decrement can
      happen to be performed even when burst size is already equal to 0,
      because of unbalanced decrements. A description follows of the cause
      of these unbalanced decrements, namely a wrong assumption, and of the
      way how this wrong assumption leads to unbalanced decrements.
      
      The wrong assumption is that a bfq_queue can exit only if the process
      associated with the bfq_queue has exited. This is false, because a
      bfq_queue, say Q, may exit also as a consequence of a merge with
      another bfq_queue. In this case, Q exits because the I/O of its
      associated process has been redirected to another bfq_queue.
      
      The decrement unbalance occurs because Q may then be re-created after
      a split, and added back to the current burst list, *without*
      incrementing burst_size. burst_size is not incremented because Q is
      not a new bfq_queue added to the burst list, but a bfq_queue only
      temporarily removed from the list, and, before the commit "bfq-sq,
      bfq-mq: decrease burst size when queues in burst exit", burst_size was
      not decremented when Q was removed.
      
      This commit addresses this issue by just checking whether the exiting
      bfq_queue is a merged bfq_queue, and, in that case, not decrementing
      burst_size. Unfortunately, this still leaves room for unbalanced
      decrements, in the following rarer case: on a split, the bfq_queue
      happens to be inserted into a different burst list than that it was
      removed from when merged. If this happens, the number of elements in
      the new burst list becomes higher than burst_size (by one). When the
      bfq_queue then exits, it is of course not in a merged state any
      longer, thus burst_size is decremented, which results in an unbalanced
      decrement.  To handle this sporadic, unlucky case in a simple way,
      this commit also checks that burst_size is larger than 0 before
      decrementing it.
      
      Finally, this commit removes an useless, extra check: the check that
      the bfq_queue is sync, performed before checking whether the bfq_queue
      is in the burst list. This extra check is redundant, because only sync
      bfq_queues can be inserted into the burst list.
      
      Fixes: 7cb04004 ("block, bfq: decrease burst size when queues in burst exit")
      Reported-by: NPhilip Müller <philm@manjaro.org>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NAngelo Ruocco <angeloruocco90@gmail.com>
      Tested-by: NPhilip Müller <philm@manjaro.org>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NLee Tibbert <lee.tibbert@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      99fead8d
    • L
      block,bfq: Disable writeback throttling · b5dc5d4d
      Luca Miccio 提交于
      Similarly to CFQ, BFQ has its write-throttling heuristics, and it
      is better not to combine them with further write-throttling
      heuristics of a different nature.
      So this commit disables write-back throttling for a device if BFQ
      is used as I/O scheduler for that device.
      Signed-off-by: NLuca Miccio <lucmiccio@gmail.com>
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NLee Tibbert <lee.tibbert@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b5dc5d4d
  14. 03 10月, 2017 4 次提交
    • P
      block, bfq: decrease burst size when queues in burst exit · 7cb04004
      Paolo Valente 提交于
      If many queues belonging to the same group happen to be created
      shortly after each other, then the concurrent processes associated
      with these queues have typically a common goal, and they get it done
      as soon as possible if not hampered by device idling.  Examples are
      processes spawned by git grep, or by systemd during boot. As for
      device idling, this mechanism is currently necessary for weight
      raising to succeed in its goal: privileging I/O.  In view of these
      facts, BFQ does not provide the above queues with either weight
      raising or device idling.
      
      On the other hand, a burst of queue creations may be caused also by
      the start-up of a complex application. In this case, these queues need
      usually to be served one after the other, and as quickly as possible,
      to maximise responsiveness. Therefore, in this case the best strategy
      is to weight-raise all the queues created during the burst, i.e., the
      exact opposite of the strategy for the above case.
      
      To distinguish between the two cases, BFQ uses an empirical burst-size
      threshold, found through extensive tests and monitoring of daily
      usage. Only large bursts, i.e., burst with a size above this
      threshold, are considered as generated by a high number of parallel
      processes. In this respect, upstart-based boot proved to be rather
      hard to detect as generating a large burst of queue creations, because
      with upstart most of the queues created in a burst exit *before* the
      next queues in the same burst are created. To address this issue, I
      changed the burst-detection mechanism so as to not decrease the size
      of the current burst even if one of the queues in the burst is
      eliminated.
      
      Unfortunately, this missing decrease causes false positives on very
      fast systems: on the start-up of a complex application, such as
      libreoffice writer, so many queues are created, served and exited
      shortly after each other, that a large burst of queue creations is
      wrongly detected as occurring. These false positives just disappear if
      the size of a burst is decreased when one of the queues in the burst
      exits. This commit restores the missing burst-size decrease, relying
      of the fact that upstart is apparently unlikely to be used on systems
      running this and future versions of the kernel.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NMauro Andreolini <mauro.andreolini@unimore.it>
      Signed-off-by: NAngelo Ruocco <angeloruocco90@gmail.com>
      Tested-by: NMirko Montanari <mirkomontanari91@gmail.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NLee Tibbert <lee.tibbert@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      7cb04004
    • P
      block, bfq: let early-merged queues be weight-raised on split too · 894df937
      Paolo Valente 提交于
      A just-created bfq_queue, say Q, may happen to be merged with another
      bfq_queue on the very first invocation of the function
      __bfq_insert_request. In such a case, even if Q would clearly deserve
      interactive weight raising (as it has just been created), the function
      bfq_add_request does not make it to be invoked for Q, and thus to
      activate weight raising for Q. As a consequence, when the state of Q
      is saved for a possible future restore, after a split of Q from the
      other bfq_queue(s), such a state happens to be (unjustly)
      non-weight-raised. Then the bfq_queue will not enjoy any weight
      raising on the split, even if should still be in an interactive
      weight-raising period when the split occurs.
      
      This commit solves this problem as follows, for a just-created
      bfq_queue that is being early-merged: it stores directly, in the saved
      state of the bfq_queue, the weight-raising state that would have been
      assigned to the bfq_queue if not early-merged.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Tested-by: NAngelo Ruocco <angeloruocco90@gmail.com>
      Tested-by: NMirko Montanari <mirkomontanari91@gmail.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NLee Tibbert <lee.tibbert@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      894df937
    • P
      block, bfq: check and switch back to interactive wr also on queue split · 3e2bdd6d
      Paolo Valente 提交于
      As already explained in the message of commit "block, bfq: fix
      wrong init of saved start time for weight raising", if a soft
      real-time weight-raising period happens to be nested in a larger
      interactive weight-raising period, then BFQ restores the interactive
      weight raising at the end of the soft real-time weight raising. In
      particular, BFQ checks whether the latter has ended only on request
      dispatches.
      
      Unfortunately, the above scheme fails to restore interactive weight
      raising in the following corner case: if a bfq_queue, say Q,
      1) Is merged with another bfq_queue while it is in a nested soft
      real-time weight-raising period. The weight-raising state of Q is
      then saved, and not considered any longer until a split occurs.
      2) Is split from the other bfq_queue(s) at a time instant when its
      soft real-time weight raising is already finished.
      On the split, while resuming the previous, soft real-time
      weight-raised state of the bfq_queue Q, BFQ checks whether the
      current soft real-time weight-raising period is actually over. If so,
      BFQ switches weight raising off for Q, *without* checking whether the
      soft real-time period was actually nested in a non-yet-finished
      interactive weight-raising period.
      
      This commit addresses this issue by adding the above missing check in
      bfq_queue splits, and restoring interactive weight raising if needed.
      Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Tested-by: NAngelo Ruocco <angeloruocco90@gmail.com>
      Tested-by: NMirko Montanari <mirkomontanari91@gmail.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NLee Tibbert <lee.tibbert@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      3e2bdd6d
    • P
      block, bfq: fix wrong init of saved start time for weight raising · 4baa8bb1
      Paolo Valente 提交于
      This commit fixes a bug that causes bfq to fail to guarantee a high
      responsiveness on some drives, if there is heavy random read+write I/O
      in the background. More precisely, such a failure allowed this bug to
      be found [1], but the bug may well cause other yet unreported
      anomalies.
      
      BFQ raises the weight of the bfq_queues associated with soft real-time
      applications, to privilege the I/O, and thus reduce latency, for these
      applications. This mechanism is named soft-real-time weight raising in
      BFQ. A soft real-time period may happen to be nested into an
      interactive weight raising period, i.e., it may happen that, when a
      bfq_queue switches to a soft real-time weight-raised state, the
      bfq_queue is already being weight-raised because deemed interactive
      too. In this case, BFQ saves in a special variable
      wr_start_at_switch_to_srt, the time instant when the interactive
      weight-raising period started for the bfq_queue, i.e., the time
      instant when BFQ started to deem the bfq_queue interactive. This value
      is then used to check whether the interactive weight-raising period
      would still be in progress when the soft real-time weight-raising
      period ends.  If so, interactive weight raising is restored for the
      bfq_queue. This restore is useful, in particular, because it prevents
      bfq_queues from losing their interactive weight raising prematurely,
      as a consequence of spurious, short-lived soft real-time
      weight-raising periods caused by wrong detections as soft real-time.
      
      If, instead, a bfq_queue switches to soft-real-time weight raising
      while it *is not* already in an interactive weight-raising period,
      then the variable wr_start_at_switch_to_srt has no meaning during the
      following soft real-time weight-raising period. Unfortunately the
      handling of this case is wrong in BFQ: not only the variable is not
      flagged somehow as meaningless, but it is also set to the time when
      the switch to soft real-time weight-raising occurs. This may cause an
      interactive weight-raising period to be considered mistakenly as still
      in progress, and thus a spurious interactive weight-raising period to
      start for the bfq_queue, at the end of the soft-real-time
      weight-raising period. In particular the spurious interactive
      weight-raising period will be considered as still in progress, if the
      soft-real-time weight-raising period does not last very long. The
      bfq_queue will then be wrongly privileged and, if I/O bound, will
      unjustly steal bandwidth to truly interactive or soft real-time
      bfq_queues, harming responsiveness and low latency.
      
      This commit fixes this issue by just setting wr_start_at_switch_to_srt
      to minus infinity (farthest past time instant according to jiffies
      macros): when the soft-real-time weight-raising period ends, certainly
      no interactive weight-raising period will be considered as still in
      progress.
      
      [1] Background I/O Type: Random - Background I/O mix: Reads and writes
      - Application to start: LibreOffice Writer in
      http://www.phoronix.com/scan.php?page=news_item&px=Linux-4.13-IO-LaptopSigned-off-by: NPaolo Valente <paolo.valente@linaro.org>
      Signed-off-by: NAngelo Ruocco <angeloruocco90@gmail.com>
      Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name>
      Tested-by: NLee Tibbert <lee.tibbert@gmail.com>
      Tested-by: NMirko Montanari <mirkomontanari91@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4baa8bb1