- 08 8月, 2019 1 次提交
-
-
由 Paolo Valente 提交于
Since commit 13a857a4 ("block, bfq: detect wakers and unconditionally inject their I/O"), BFQ stores, in a per-device pointer last_completed_rq_bfqq, the last bfq_queue that had an I/O request completed. If some bfq_queue receives new I/O right after the last request of last_completed_rq_bfqq has been completed, then last_completed_rq_bfqq may be a waker bfq_queue. But if the bfq_queue last_completed_rq_bfqq points to is freed, then last_completed_rq_bfqq becomes a dangling reference. This commit resets last_completed_rq_bfqq if the pointed bfq_queue is freed. Fixes: 13a857a4 ("block, bfq: detect wakers and unconditionally inject their I/O") Reported-by: NDouglas Anderson <dianders@chromium.org> Tested-by: NDouglas Anderson <dianders@chromium.org> Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 18 7月, 2019 1 次提交
-
-
由 Paolo Valente 提交于
Consider a sync bfq_queue Q that remains empty while in service, and suppose that, when this happens, there is a fair amount of already in-flight I/O not belonging to Q. In such a situation, I/O dispatching may need to be plugged (until new I/O arrives for Q), for the following reason. The drive may decide to serve in-flight non-Q's I/O requests before Q's ones, thereby delaying the arrival of new I/O requests for Q (recall that Q is sync). If I/O-dispatching is not plugged, then, while Q remains empty, a basically uncontrolled amount of I/O from other queues may be dispatched too, possibly causing the service of Q's I/O to be delayed even longer in the drive. This problem gets more and more serious as the speed and the queue depth of the drive grow, because, as these two quantities grow, the probability to find no queue busy but many requests in flight grows too. If Q has the same weight and priority as the other queues, then the above delay is unlikely to cause any issue, because all queues tend to undergo the same treatment. So, since not plugging I/O dispatching is convenient for throughput, it is better not to plug. Things change in case Q has a higher weight or priority than some other queue, because Q's service guarantees may simply be violated. For this reason, commit 1de0c4cd ("block, bfq: reduce idling only in symmetric scenarios") does plug I/O in such an asymmetric scenario. Plugging minimizes the delay induced by already in-flight I/O, and enables Q to recover the bandwidth it may lose because of this delay. Yet the above commit does not cover the case of weight-raised queues, for efficiency concerns. For weight-raised queues, I/O-dispatch plugging is activated simply if not all bfq_queues are weight-raised. But this check does not handle the case of in-flight requests, because a bfq_queue may become non busy *before* all its in-flight requests are completed. This commit performs I/O-dispatch plugging for weight-raised queues if there are some in-flight requests. As a practical example of the resulting recover of control, under write load on a Samsung SSD 970 PRO, gnome-terminal starts in 1.5 seconds after this fix, against 15 seconds before the fix (as a reference, gnome-terminal takes about 35 seconds to start with any of the other I/O schedulers). Fixes: 1de0c4cd ("block, bfq: reduce idling only in symmetric scenarios") Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 15 7月, 2019 1 次提交
-
-
由 Mauro Carvalho Chehab 提交于
Rename the block documentation files to ReST, add an index for them and adjust in order to produce a nice html output via the Sphinx build system. At its new index.rst, let's add a :orphan: while this is not linked to the main index.rst file, in order to avoid build warnings. Signed-off-by: NMauro Carvalho Chehab <mchehab+samsung@kernel.org>
-
- 28 6月, 2019 1 次提交
-
-
由 Douglas Anderson 提交于
In reboot tests on several devices we were seeing a "use after free" when slub_debug or KASAN was enabled. The kernel complained about: Unable to handle kernel paging request at virtual address 6b6b6c2b ...which is a classic sign of use after free under slub_debug. The stack crawl in kgdb looked like: 0 test_bit (addr=<optimized out>, nr=<optimized out>) 1 bfq_bfqq_busy (bfqq=<optimized out>) 2 bfq_select_queue (bfqd=<optimized out>) 3 __bfq_dispatch_request (hctx=<optimized out>) 4 bfq_dispatch_request (hctx=<optimized out>) 5 0xc056ef00 in blk_mq_do_dispatch_sched (hctx=0xed249440) 6 0xc056f728 in blk_mq_sched_dispatch_requests (hctx=0xed249440) 7 0xc0568d24 in __blk_mq_run_hw_queue (hctx=0xed249440) 8 0xc0568d94 in blk_mq_run_work_fn (work=<optimized out>) 9 0xc024c5c4 in process_one_work (worker=0xec6d4640, work=0xed249480) 10 0xc024cff4 in worker_thread (__worker=0xec6d4640) Digging in kgdb, it could be found that, though bfqq looked fine, bfqq->bic had been freed. Through further digging, I postulated that perhaps it is illegal to access a "bic" (AKA an "icq") after bfq_exit_icq() had been called because the "bic" can be freed at some point in time after this call is made. I confirmed that there certainly were cases where the exact crashing code path would access the "bic" after bfq_exit_icq() had been called. Sspecifically I set the "bfqq->bic" to (void *)0x7 and saw that the bic was 0x7 at the time of the crash. To understand a bit more about why this crash was fairly uncommon (I saw it only once in a few hundred reboots), you can see that much of the time bfq_exit_icq_fbqq() fully frees the bfqq and thus it can't access the ->bic anymore. The only case it doesn't is if bfq_put_queue() sees a reference still held. However, even in the case when bfqq isn't freed, the crash is still rare. Why? I tracked what happened to the "bic" after the exit routine. It doesn't get freed right away. Rather, put_io_context_active() eventually called put_io_context() which queued up freeing on a workqueue. The freeing then actually happened later than that through call_rcu(). Despite all these delays, some extra debugging showed that all the hoops could be jumped through in time and the memory could be freed causing the original crash. Phew! To make a long story short, assuming it truly is illegal to access an icq after the "exit_icq" callback is finished, this patch is needed. Cc: stable@vger.kernel.org Reviewed-by: NPaolo Valente <paolo.valente@unimore.it> Signed-off-by: NDouglas Anderson <dianders@chromium.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 27 6月, 2019 1 次提交
-
-
由 Douglas Anderson 提交于
Some debug code suggested by Paolo was tripping when I did reboot stress tests. Specifically in bfq_bfqq_resume_state() "bic->saved_wr_start_at_switch_to_srt" was later than the current value of "jiffies". A bit of debugging showed that "bic->saved_wr_start_at_switch_to_srt" was actually 0 and a bit more debugging showed that was because we had run through the "unlikely" case in the bfq_bfqq_save_state() function. Let's init "saved_wr_start_at_switch_to_srt" in the unlikely case to something sane. NOTE: this fixes no known real-world errors. Reviewed-by: NPaolo Valente <paolo.valente@linaro.org> Reviewed-by: NGuenter Roeck <groeck@chromium.org> Signed-off-by: NDouglas Anderson <dianders@chromium.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 26 6月, 2019 1 次提交
-
-
由 Paolo Valente 提交于
By mistake, there is a '&' instead of a '==' in the definition of the macro BFQQ_TOTALLY_SEEKY. This commit replaces the wrong operator with the correct one. Fixes: 7074f076 ("block, bfq: do not tag totally seeky queues as soft rt") Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 25 6月, 2019 7 次提交
-
-
由 Paolo Valente 提交于
Consider, on one side, a bfq_queue Q that remains empty while in service, and, on the other side, the pending I/O of bfq_queues that, according to their timestamps, have to be served after Q. If an uncontrolled amount of I/O from the latter bfq_queues were dispatched while Q is waiting for its new I/O to arrive, then Q's bandwidth guarantees would be violated. To prevent this, I/O dispatch is plugged until Q receives new I/O (except for a properly controlled amount of injected I/O). Unfortunately, preemption breaks I/O-dispatch plugging, for the following reason. Preemption is performed in two steps. First, Q is expired and re-scheduled. Second, the new bfq_queue to serve is chosen. The first step is needed by the second, as the second can be performed only after Q's timestamps have been properly updated (done in the expiration step), and Q has been re-queued for service. This dependency is a consequence of the way how BFQ's scheduling algorithm is currently implemented. But Q is not re-scheduled at all in the first step, because Q is empty. As a consequence, an uncontrolled amount of I/O may be dispatched until Q becomes non empty again. This breaks Q's service guarantees. This commit addresses this issue by re-scheduling Q even if it is empty. This in turn breaks the assumption that all scheduled queues are non empty. Then a few extra checks are now needed. Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Paolo Valente 提交于
BFQ enqueues the I/O coming from each process into a separate bfq_queue, and serves bfq_queues one at a time. Each bfq_queue may be served for at most timeout_sync milliseconds (default: 125 ms). This service scheme is prone to the following inaccuracy. While a bfq_queue Q1 is in service, some empty bfq_queue Q2 may receive I/O, and, according to BFQ's scheduling policy, may become the right bfq_queue to serve, in place of the currently in-service bfq_queue. In this respect, postponing the service of Q2 to after the service of Q1 finishes may delay the completion of Q2's I/O, compared with an ideal service in which all non-empty bfq_queues are served in parallel, and every non-empty bfq_queue is served at a rate proportional to the bfq_queue's weight. This additional delay is equal at most to the time Q1 may unjustly remain in service before switching to Q2. If Q1 and Q2 have the same weight, then this time is most likely negligible compared with the completion time to be guaranteed to Q2's I/O. In addition, first, one of the reasons why BFQ may want to serve Q1 for a while is that this boosts throughput and, second, serving Q1 longer reduces BFQ's overhead. As a conclusion, it is usually better not to preempt Q1 if both Q1 and Q2 have the same weight. In contrast, as Q2's weight or priority becomes higher and higher compared with that of Q1, the above delay becomes larger and larger, compared with the I/O completion times that have to be guaranteed to Q2 according to Q2's weight. So reducing this delay may be more important than avoiding the costs of preempting Q1. Accordingly, this commit preempts Q1 if Q2 has a higher weight or a higher priority than Q1. Preemption causes Q1 to be re-scheduled, and triggers a new choice of the next bfq_queue to serve. If Q2 really is the next bfq_queue to serve, then Q2 will be set in service immediately. This change reduces the component of the I/O latency caused by the above delay by about 80%. For example, on an (old) PLEXTOR PX-256M5 SSD, the maximum latency reported by fio drops from 15.1 to 3.2 ms for a process doing sporadic random reads while another process is doing continuous sequential reads. Signed-off-by: NNicola Bottura <bottura.nicola95@gmail.com> Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Paolo Valente 提交于
A bfq_queue Q may happen to be synchronized with another bfq_queue Q2, i.e., the I/O of Q2 may need to be completed for Q to receive new I/O. We call Q2 "waker queue". If I/O plugging is being performed for Q, and Q is not receiving any more I/O because of the above synchronization, then, thanks to BFQ's injection mechanism, the waker queue is likely to get served before the I/O-plugging timeout fires. Unfortunately, this fact may not be sufficient to guarantee a high throughput during the I/O plugging, because the inject limit for Q may be too low to guarantee a lot of injected I/O. In addition, the duration of the plugging, i.e., the time before Q finally receives new I/O, may not be minimized, because the waker queue may happen to be served only after other queues. To address these issues, this commit introduces the explicit detection of the waker queue, and the unconditional injection of a pending I/O request of the waker queue on each invocation of bfq_dispatch_request(). One may be concerned that this systematic injection of I/O from the waker queue delays the service of Q's I/O. Fortunately, it doesn't. On the contrary, next Q's I/O is brought forward dramatically, for it is not blocked for milliseconds. Reported-by: NSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Tested-by: NSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Paolo Valente 提交于
Until the base value for request service times gets finally computed for a bfq_queue, the inject limit for that queue does depend on the think-time state (short|long) of the queue. A timely update of the think time then guarantees a quicker activation or deactivation of the injection. Fortunately, the think time of a bfq_queue is updated in the same code path as the inject limit; but after the inject limit. This commits moves the update of the think time before the update of the inject limit. For coherence, it moves the update of the seek time too. Reported-by: NSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Tested-by: NSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Paolo Valente 提交于
I/O injection gets reduced if it increases the request service times of the victim queue beyond a certain threshold. The threshold, in its turn, is computed as a function of the base service time enjoyed by the queue when it undergoes no injection. As a consequence, for injection to work properly, the above base value has to be accurate. In this respect, such a value may vary over time. For example, it varies if the size or the spatial locality of the I/O requests in the queue change. It is then important to update this value whenever possible. This commit performs this update. Reported-by: NSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Tested-by: NSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Paolo Valente 提交于
One of the cases where the parameters for injection may be updated is when there are no more in-flight I/O requests. The number of in-flight requests is stored in the field bfqd->rq_in_driver of the descriptor bfqd of the device. So, the controlled condition is bfqd->rq_in_driver == 0. Unfortunately, this is wrong because, the instruction that checks this condition is in the code path that handles the completion of a request, and, in particular, the instruction is executed before bfqd->rq_in_driver is decremented in such a code path. This commit fixes this issue by just replacing 0 with 1 in the comparison. Reported-by: NSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Tested-by: NSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Paolo Valente 提交于
Until the base value of the request service times gets finally computed for a bfq_queue, the inject limit does depend on the think-time state (short|long). The limit must be 0 or 1 if the think time is deemed, respectively, as short or long. However, such a check and possible limit update is performed only periodically, once per second. So, to make the injection mechanism much more reactive, this commit performs the update also every time the think-time state changes. In addition, in the following special case, this commit lets the inject limit of a bfq_queue bfqq remain equal to 1 even if bfqq's think time is short: bfqq's I/O is synchronized with that of some other queue, i.e., bfqq may receive new I/O only after the I/O of the other queue is completed. Keeping the inject limit to 1 allows the blocking I/O to be served while bfqq is in service. And this is very convenient both for bfqq and for the total throughput, as explained in detail in the comments in bfq_update_has_short_ttime(). Reported-by: NSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Tested-by: NSrivatsa S. Bhat (VMware) <srivatsa@csail.mit.edu> Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 21 6月, 2019 2 次提交
-
-
由 Christoph Hellwig 提交于
This option is entirely bfq specific, give it an appropinquate name. Also make it depend on CONFIG_BFQ_GROUP_IOSCHED in Kconfig, as all the functionality already does so anyway. Acked-by: NTejun Heo <tj@kernel.org> Acked-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
We only need the number of segments in the blk-mq submission path. Remove the field from struct bio, and return it from a variant of blk_queue_split instead of that it can passed as an argument to those functions that need the value. This also means we stop recounting segments except for cloning and partial segments. To keep the number of arguments in this how path down remove pointless struct request_queue arguments from any of the functions that had it and grew a nr_segs argument. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 01 5月, 2019 1 次提交
-
-
由 Christoph Hellwig 提交于
All these files have some form of the usual GPLv2 or later boilerplate. Switch them to use SPDX tags instead. Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 14 4月, 2019 1 次提交
-
-
由 Jens Axboe 提交于
A previous commit moved the shallow depth and BFQ depth map calculations to be done at init time, moving it outside of the hotter IO path. This potentially causes hangs if the users changes the depth of the scheduler map, by writing to the 'nr_requests' sysfs file for that device. Add a blk-mq-sched hook that allows blk-mq to inform the scheduler if the depth changes, so that the scheduler can update its internal state. Tested-by: NKai Krakow <kai@kaishome.de> Reported-by: NPaolo Valente <paolo.valente@linaro.org> Fixes: f0635b8a ("bfq: calculate shallow depths at init time") Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 10 4月, 2019 1 次提交
-
-
由 Paolo Valente 提交于
The function bfq_bfqq_expire() invokes the function __bfq_bfqq_expire(), and the latter may free the in-service bfq-queue. If this happens, then no other instruction of bfq_bfqq_expire() must be executed, or a use-after-free will occur. Basing on the assumption that __bfq_bfqq_expire() invokes bfq_put_queue() on the in-service bfq-queue exactly once, the queue is assumed to be freed if its refcounter is equal to one right before invoking __bfq_bfqq_expire(). But, since commit 9dee8b3b ("block, bfq: fix queue removal from weights tree") this assumption is false. __bfq_bfqq_expire() may also invoke bfq_weights_tree_remove() and, since commit 9dee8b3b ("block, bfq: fix queue removal from weights tree"), also the latter function may invoke bfq_put_queue(). So __bfq_bfqq_expire() may invoke bfq_put_queue() twice, and this is the actual case where the in-service queue may happen to be freed. To address this issue, this commit moves the check on the refcounter of the queue right around the last bfq_put_queue() that may be invoked on the queue. Fixes: 9dee8b3b ("block, bfq: fix queue removal from weights tree") Reported-by: NDmitrii Tcvetkov <demfloro@demfloro.ru> Reported-by: NDouglas Anderson <dianders@chromium.org> Tested-by: NDmitrii Tcvetkov <demfloro@demfloro.ru> Tested-by: NDouglas Anderson <dianders@chromium.org> Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 09 4月, 2019 1 次提交
-
-
由 Angelo Ruocco 提交于
Some of the comments in the bfq files had typos. This patch fixes them. Signed-off-by: NAngelo Ruocco <angeloruocco90@gmail.com> Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 01 4月, 2019 9 次提交
-
-
由 Francesco Pollicino 提交于
bfq saves the state of a queue each time a merge occurs, to be able to resume such a state when the queue is associated again with its original process, on a split. Unfortunately bfq does not save & restore also the weight of the queue. If the weight is not correctly resumed when the queue is recycled, then the weight of the recycled queue could differ from the weight of the original queue. This commit adds the missing save & resume of the weight. Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com> Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: NFrancesco Pollicino <fra.fra.800@gmail.com> Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Francesco Pollicino 提交于
The function "bfq_log_bfqq" prints the pid of the process associated with the queue passed as input. Unfortunately, if the queue is shared, then more than one process is associated with the queue. The pid that gets printed in this case is the pid of one of the associated processes. Which process gets printed depends on the exact sequence of merge events the queue underwent. So printing such a pid is rather useless and above all is often rather confusing because it reports a random pid between those of the associated processes. This commit addresses this issue by printing SHARED instead of a pid if the queue is shared. Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com> Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: NFrancesco Pollicino <fra.fra.800@gmail.com> Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Paolo Valente 提交于
If many bfq_queues belonging to the same group happen to be created shortly after each other, then the processes associated with these queues have typically a common goal. In particular, bursts of queue creations are usually caused by services or applications that spawn many parallel threads/processes. Examples are systemd during boot, or git grep. If there are no other active queues, then, to help these processes get their job done as soon as possible, the best thing to do is to reach a high throughput. To this goal, it is usually better to not grant either weight-raising or device idling to the queues associated with these processes. And this is exactly what BFQ currently does. There is however a drawback: if, in contrast, some other queues are already active, then the newly created queues must be protected from the I/O flowing through the already existing queues. In this case, the best thing to do is the opposite as in the other case: it is much better to grant weight-raising and device idling to the newly-created queues, if they deserve it. This commit addresses this issue by doing so if there are already other active queues. This change also helps eliminating false positives, which occur when the newly-created queues do not belong to an actual large burst of creations, but some background task (e.g., a service) happens to trigger the creation of new queues in the middle, i.e., very close to when the victim queues are created. These false positive may cause total loss of control on process latencies. Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com> Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Paolo Valente 提交于
Sync random I/O is likely to be confused with soft real-time I/O, because it is characterized by limited throughput and apparently isochronous arrival pattern. To avoid false positives, this commits prevents bfq_queues containing only random (seeky) I/O from being tagged as soft real-time. Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com> Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Paolo Valente 提交于
To boost throughput with a set of processes doing interleaved I/O (i.e., a set of processes whose individual I/O is random, but whose merged cumulative I/O is sequential), BFQ merges the queues associated with these processes, i.e., redirects the I/O of these processes into a common, shared queue. In the shared queue, I/O requests are ordered by their position on the medium, thus sequential I/O gets dispatched to the device when the shared queue is served. Queue merging costs execution time, because, to detect which queues to merge, BFQ must maintain a list of the head I/O requests of active queues, ordered by request positions. Measurements showed that this costs about 10% of BFQ's total per-request processing time. Request processing time becomes more and more critical as the speed of the underlying storage device grows. Yet, fortunately, queue merging is basically useless on the very devices that are so fast to make request processing time critical. To reach a high throughput, these devices must have many requests queued at the same time. But, in this configuration, the internal scheduling algorithms of these devices do also the job of queue merging: they reorder requests so as to obtain as much as possible a sequential I/O pattern. As a consequence, with processes doing interleaved I/O, the throughput reached by one such device is likely to be the same, with and without queue merging. In view of this fact, this commit disables queue merging, and all related housekeeping, for non-rotational devices with internal queueing. The total, single-lock-protected, per-request processing time of BFQ drops to, e.g., 1.9 us on an Intel Core i7-2760QM@2.40GHz (time measured with simple code instrumentation, and using the throughput-sync.sh script of the S suite [1], in performance-profiling mode). To put this result into context, the total, single-lock-protected, per-request execution time of the lightest I/O scheduler available in blk-mq, mq-deadline, is 0.7 us (mq-deadline is ~800 LOC, against ~10500 LOC for BFQ). Disabling merging provides a further, remarkable benefit in terms of throughput. Merging tends to make many workloads artificially more uneven, mainly because of shared queues remaining non empty for incomparably more time than normal queues. So, if, e.g., one of the queues in a set of merged queues has a higher weight than a normal queue, then the shared queue may inherit such a high weight and, by staying almost always active, may force BFQ to perform I/O plugging most of the time. This evidently makes it harder for BFQ to let the device reach a high throughput. As a practical example of this problem, and of the benefits of this commit, we measured again the throughput in the nasty scenario considered in previous commit messages: dbench test (in the Phoronix suite), with 6 clients, on a filesystem with journaling, and with the journaling daemon enjoying a higher weight than normal processes. With this commit, the throughput grows from ~150 MB/s to ~200 MB/s on a PLEXTOR PX-256M5 SSD. This is the same peak throughput reached by any of the other I/O schedulers. As such, this is also likely to be the maximum possible throughput reachable with this workload on this device, because I/O is mostly random, and the other schedulers basically just pass I/O requests to the drive as fast as possible. [1] https://github.com/Algodev-github/STested-by: NHolger Hoffstätte <holger@applied-asynchrony.com> Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name> Tested-by: NFrancesco Pollicino <fra.fra.800@gmail.com> Signed-off-by: NAlessio Masola <alessio.masola@gmail.com> Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Paolo Valente 提交于
The processes associated with a bfq_queue, say Q, may happen to generate their cumulative I/O at a lower rate than the rate at which the device could serve the same I/O. This is rather probable, e.g., if only one process is associated with Q and the device is an SSD. It results in Q becoming often empty while in service. If BFQ is not allowed to switch to another queue when Q becomes empty, then, during the service of Q, there will be frequent "service holes", i.e., time intervals during which Q gets empty and the device can only consume the I/O already queued in its hardware queues. This easily causes considerable losses of throughput. To counter this problem, BFQ implements a request injection mechanism, which tries to fill the above service holes with I/O requests taken from other bfq_queues. The hard part in this mechanism is finding the right amount of I/O to inject, so as to both boost throughput and not break Q's bandwidth and latency guarantees. To this goal, the current version of this mechanism measures the bandwidth enjoyed by Q while it is being served, and tries to inject the maximum possible amount of extra service that does not cause Q's bandwidth to decrease too much. This solution has an important shortcoming. For bandwidth measurements to be stable and reliable, Q must remain in service for a much longer time than that needed to serve a single I/O request. Unfortunately, this does not hold with many workloads. This commit addresses this issue by changing the way the amount of injection allowed is dynamically computed. It tunes injection as a function of the service times of single I/O requests of Q, instead of Q's bandwidth. Single-request service times are evidently meaningful even if Q gets very few I/O requests completed while it is in service. As a testbed for this new solution, we measured the throughput reached by BFQ for one of the nastiest workloads and configurations for this scheduler: the workload generated by the dbench test (in the Phoronix suite), with 6 clients, on a filesystem with journaling, and with the journaling daemon enjoying a higher weight than normal processes. With this commit, the throughput grows from ~100 MB/s to ~150 MB/s on a PLEXTOR PX-256M5. Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com> Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name> Tested-by: NFrancesco Pollicino <fra.fra.800@gmail.com> Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Paolo Valente 提交于
In most cases, it is detrimental for throughput to plug I/O dispatch when the in-service bfq_queue becomes temporarily empty (plugging is performed to wait for the possible arrival, soon, of new I/O from the in-service queue). There is however a case where plugging is needed for service guarantees. If a bfq_queue, say Q, has a higher weight than some other active bfq_queue, and is sync, i.e., contains sync I/O, then, to guarantee that Q does receive a higher share of the throughput than other lower-weight queues, it is necessary to plug I/O dispatch when Q remains temporarily empty while being served. For this reason, BFQ performs I/O plugging when some active bfq_queue has a higher weight than some other active bfq_queue. But this is unnecessarily overkill. In fact, if the in-service bfq_queue actually has a weight lower than or equal to the other queues, then the queue *must not* be guaranteed a higher share of the throughput than the other queues. So, not plugging I/O cannot cause any harm to the queue. And can boost throughput. Taking advantage of this fact, this commit does not plug I/O for sync bfq_queues with a weight lower than or equal to the weights of the other queues. Here is an example of the resulting throughput boost with the dbench workload, which is particularly nasty for BFQ. With the dbench test in the Phoronix suite, BFQ reaches its lowest total throughput with 6 clients on a filesystem with journaling, in case the journaling daemon has a higher weight than normal processes. Before this commit, the total throughput was ~80 MB/sec on a PLEXTOR PX-256M5, after this commit it is ~100 MB/sec. Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com> Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Paolo Valente 提交于
If a sync bfq_queue has a higher weight than some other queue, and remains temporarily empty while in service, then, to preserve the bandwidth share of the queue, it is necessary to plug I/O dispatching until a new request arrives for the queue. In addition, a timeout needs to be set, to avoid waiting for ever if the process associated with the queue has actually finished its I/O. Even with the above timeout, the device is however not fed with new I/O for a while, if the process has finished its I/O. If this happens often, then throughput drops and latencies grow. For this reason, the timeout is kept rather low: 8 ms is the current default. Unfortunately, such a low value may cause, on the opposite end, a violation of bandwidth guarantees for a process that happens to issue new I/O too late. The higher the system load, the higher the probability that this happens to some process. This is a problem in scenarios where service guarantees matter more than throughput. One important case are weight-raised queues, which need to be granted a very high fraction of the bandwidth. To address this issue, this commit lower-bounds the plugging timeout for weight-raised queues to 20 ms. This simple change provides relevant benefits. For example, on a PLEXTOR PX-256M5S, with which gnome-terminal starts in 0.6 seconds if there is no other I/O in progress, the same applications starts in - 0.8 seconds, instead of 1.2 seconds, if ten files are being read sequentially in parallel - 1 second, instead of 2 seconds, if, in parallel, five files are being read sequentially, and five more files are being written sequentially Tested-by: NHolger Hoffstätte <holger@applied-asynchrony.com> Tested-by: NOleksandr Natalenko <oleksandr@natalenko.name> Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Konstantin Khlebnikov 提交于
Replace BFQ_GROUP_IOSCHED_ENABLED with CONFIG_BFQ_GROUP_IOSCHED. Code under these ifdefs never worked, something might be broken. Fixes: 0471559c ("block, bfq: add/remove entity weights correctly") Fixes: 73d58118 ("block, bfq: consider also ioprio classes in symmetry detection") Reviewed-by: NHolger Hoffstätte <holger@applied-asynchrony.com> Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 01 2月, 2019 12 次提交
-
-
由 Paolo Valente 提交于
When a new I/O request arrives for a bfq_queue, say Q, bfq checks whether that request is close to (a) the head request of some other queue waiting to be served, or (b) the last request dispatched for the in-service queue (in case Q itself is not the in-service queue) If a queue, say Q2, is found for which the above condition holds, then bfq merges Q and Q2, to hopefully get a more sequential I/O in the resulting merged queue, and thus a possibly higher throughput. Case (b) is checked by comparing the new request for Q with the last request dispatched, assuming that the latter necessarily belonged to the in-service queue. Unfortunately, this assumption is no longer always correct, since commit d0edc247 ("block, bfq: inject other-queue I/O into seeky idle queues on NCQ flash"). When the assumption does not hold, queues that must not be merged may be merged, causing unexpected loss of control on per-queue service guarantees. This commit solves this problem by adding an extra field, which stores the actual last request dispatched for the in-service queue, and by using this new field to correctly check case (b). Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Paolo Valente 提交于
Writes tend to starve reads. bfq counters this problem by overcharging writes with an inflated service w.r.t. the actual service (number of sector written) they receive. Yet his overcharging is useless, and actually causes unfairness in the opposite direction, when bfq happens to be enforcing strong I/O control. bfq does this enforcing when the scenario is asymmetric, i.e., when some bfq_queue or group of bfq_queues is to be granted a different bandwidth than some other bfq_queue or group of bfq_queues. So, in such a scenario, this commit disables write overcharging. Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Paolo Valente 提交于
The original commit is commit 1a1238a7 ("cfq-iosched: improve hw_tag detection") and has the following commit message: If active queue hasn't enough requests and idle window opens, cfq will not dispatch sufficient requests to hardware. In such situation, current code will zero hw_tag. But this is because cfq doesn't dispatch enough requests instead of hardware queue doesn't work. Don't zero hw_tag in such case. Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Paolo Valente 提交于
bfq simple heuristic from cfq for detecting whether the drive performs command queueing: check whether the average number of in-flight requests is above a given threshold. Unfortunately this heuristic does fail to detect queueing (on drives with queueing) if processes doing I/O are few and issue I/O with a low depth. To reduce false negatives, this commit lowers the threshold. Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Paolo Valente 提交于
bfq maintains an ordered list, through a red-black tree, of unique weights of active bfq_queues. This list is used to detect whether there are active queues with differentiated weights. The weight of a queue is removed from the list when both the following two conditions become true: (1) the bfq_queue is flagged as inactive (2) the has no in-flight request any longer; Unfortunately, in the rare cases where condition (2) becomes true before condition (1), the removal fails, because the function to remove the weight of the queue (bfq_weights_tree_remove) is rightly invoked in the path that deactivates the bfq_queue, but mistakenly invoked *before* the function that actually performs the deactivation (bfq_deactivate_bfqq). This commits moves the invocation of bfq_weights_tree_remove for condition (1) to after bfq_deactivate_bfqq. As a consequence of this move, it is necessary to add a further reference to the queue when the weight of a queue is added, because the queue might otherwise be freed before bfq_weights_tree_remove is invoked. This commit adds this reference and makes all related modifications. Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Paolo Valente 提交于
In bfq_update_peak_rate, to check whether an I/O request rq is sequential, only the seek distance of rq w.r.t. the last request dispatched is controlled. This is not sufficient for non-rotational storage, where the size of rq is at least as relevant. This commit adds the missing control. Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Paolo Valente 提交于
bfq detects the creation of multiple bfq_queues shortly after each other, namely a burst of queue creations in the terminology used in the code. If the burst is large, then no queue in the burst is granted - either I/O-dispatch plugging when the queue remains temporarily idle while in service; - or weight raising, because it causes even longer plugging. In fact, such a plugging tends to lower throughput, while these bursts are typically due to applications or services that spawn multiple processes, to reach a common goal as soon as possible. Examples are a "git grep" or the booting of a system. Unfortunately, disabling plugging may cause a loss of service guarantees in asymmetric scenarios, i.e., if queue weights are differentiated or if more than one group is active. This commit addresses this issue by no longer disabling I/O-dispatch plugging for queues in large bursts. Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Paolo Valente 提交于
If the in-service bfq_queue is sync and remains temporarily idle, then I/O dispatching (from other queues) may be plugged. It may be dome for two reasons: either to boost throughput, or to preserve the bandwidth share of the in-service queue. In the first case, if the I/O of the in-service queue, when it finally arrives, consists only of one small I/O request, then it makes sense to plug even the I/O of the in-service queue. In fact, serving such a small request immediately is likely to lower throughput instead of boosting it, whereas waiting a little bit is likely to let that request grow, thanks to request merging, and become more profitable in terms of throughput (this is likely to happen exactly because the I/O of the queue has been detected to boost throughput). On the opposite end, if I/O dispatching is being plugged only to preserve the bandwidth of the in-service queue, then it would be better not to plug also the I/O of the in-service queue, because such a plugging is likely to cause only loss of bandwidth for the queue. Unfortunately, no distinction is made between the two cases, and the I/O of the in-service queue is always plugged in case just a small I/O request arrives. This commit draws this missing distinction and does not perform harmful plugging. Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Paolo Valente 提交于
This is a preparatory commit for commits that need to check only one of the two main reasons for idling. This change should also improve the quality of the code a little bit, by splitting a function that contains very long, non-trivial and little related comments. Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Paolo Valente 提交于
In asymmetric scenarios, i.e., when some bfq_queue or bfq_group needs to be guaranteed a different bandwidth than other bfq_queues or bfq_groups, these service guaranteed can be provided only by plugging I/O dispatch, completely or partially, when the queue in service remains temporarily empty. A case where asymmetry is particularly strong is when some active bfq_queues belong to a higher-priority class than some other active bfq_queues. Unfortunately, this important case is not considered at all in the code for detecting asymmetric scenarios. This commit adds the missing logic. Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Paolo Valente 提交于
Before commit 18e5a57d ("block, bfq: postpone rq preparation to insert or merge"), the destination queue for a request was chosen by a different hook than the one that then inserted the request. So, between the execution of the two hooks, the bic of the process generating the request could happen to be redirected to a different bfq_queue. As a consequence, the destination bfq_queue stored in the request could be wrong. Such an event does not need to ba handled any longer. Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Paolo Valente 提交于
With some unlucky sequences of events, the function bfq_updated_next_req updates the current budget of a bfq_queue to a lower value than the service received by the queue using such a budget. Unfortunately, if this happens, then the return value of the function bfq_bfqq_budget_left becomes inconsistent. This commit solves this problem by lower-bounding the budget computed in bfq_updated_next_req to the service currently charged to the queue. Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-