- 06 1月, 2018 11 次提交
-
-
由 Paolo Valente 提交于
In BFQ and CFQ, two processes are said to be cooperating if they do I/O in such a way that the union of their I/O requests yields a sequential I/O pattern. To get such a sequential I/O pattern out of the non-sequential pattern of each cooperating process, BFQ and CFQ merge the queues associated with these processes. In more detail, cooperating processes, and thus their associated queues, usually start, or restart, to do I/O shortly after each other. This is the case, e.g., for the I/O threads of KVM/QEMU and of the dump utility. Basing on this assumption, this commit allows a bfq_queue to be merged only during a short time interval (100ms) after it starts, or re-starts, to do I/O. This filtering provides two important benefits. First, it greatly reduces the probability that two non-cooperating processes have their queues merged by mistake, if they just happen to do I/O close to each other for a short time interval. These spurious merges cause loss of service guarantees. A low-weight bfq_queue may unjustly get more than its expected share of the throughput: if such a low-weight queue is merged with a high-weight queue, then the I/O for the low-weight queue is served as if the queue had a high weight. This may damage other high-weight queues unexpectedly. For instance, because of this issue, lxterminal occasionally took 7.5 seconds to start, instead of 6.5 seconds, when some sequential readers and writers did I/O in the background on a FUJITSU MHX2300BT HDD. The reason is that the bfq_queues associated with some of the readers or the writers were merged with the high-weight queues of some processes that had to do some urgent but little I/O. The readers then exploited the inherited high weight for all or most of their I/O, during the start-up of terminal. The filtering introduced by this commit eliminated any outlier caused by spurious queue merges in our start-up time tests. This filtering also provides a little boost of the throughput sustainable by BFQ: 3-4%, depending on the CPU. The reason is that, once a bfq_queue cannot be merged any longer, this commit makes BFQ stop updating the data needed to handle merging for the queue. Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NAngelo Ruocco <angeloruocco90@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Angelo Ruocco 提交于
A just-created bfq_queue will certainly be deemed as interactive on the arrival of its first I/O request, if the low_latency flag is set. Yet, if the queue is merged with another queue on the arrival of its first I/O request, it will not have the chance to be flagged as interactive. Nevertheless, if the queue is then split soon enough, it has to be flagged as interactive after the split. To handle this early-merge scenario correctly, BFQ saves the state of the queue, on the merge, as if the latter had already been deemed interactive. So, if the queue is split soon, it will get weight-raised, because the previous state of the queue is resumed on the split. Unfortunately, in the act of saving the state of the newly-created queue, BFQ doesn't check whether the low_latency flag is set, and this causes early-merged queues to be then weight-raised, on queue splits, even if low_latency is off. This commit addresses this problem by adding the missing check. Signed-off-by: NAngelo Ruocco <angeloruocco90@gmail.com> Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Paolo Valente 提交于
If two processes do I/O close to each other, then BFQ merges the bfq_queues associated with these processes, to get a more sequential I/O, and thus a higher throughput. In this respect, to detect whether two processes are doing I/O close to each other, BFQ keeps a list of the head-of-line I/O requests of all active bfq_queues. The list is ordered by initial sectors, and implemented through a red-black tree (rq_pos_tree). Unfortunately, the update of the rq_pos_tree was incomplete, because the tree was not updated on the removal of the head-of-line I/O request of a bfq_queue, in case the queue did not remain empty. This commit adds the missing update. Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NAngelo Ruocco <angeloruocco90@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Paolo Valente 提交于
If two processes do I/O close to each other, i.e., are cooperating processes in BFQ (and CFQ'S) nomenclature, then BFQ merges their associated bfq_queues, so as to get sequential I/O from the union of the I/O requests of the processes, and thus reach a higher throughput. A merged queue is then split if its I/O stops being sequential. In this respect, BFQ deems the I/O of a bfq_queue as (mostly) sequential only if less than 4 I/O requests are random, out of the last 32 requests inserted into the queue. Unfortunately, extensive testing (with the interleaved_io benchmark of the S suite [1], and with real applications spawning cooperating processes) has clearly shown that, with such a low threshold, only a rather low I/O throughput may be reached when several cooperating processes do I/O. In particular, the outcome of each test run was bimodal: if queue merging occurred and was stable during the test, then the throughput was close to the peak rate of the storage device, otherwise the throughput was arbitrarily low (usually around 1/10 of the peak rate with a rotational device). The probability to get the unlucky outcomes grew with the number of cooperating processes: it was already significant with 5 processes, and close to one with 7 or more processes. The cause of the low throughput in the unlucky runs was that the merged queues containing the I/O of these cooperating processes were soon split, because they contained more random I/O requests than those tolerated by the 4/32 threshold, but - that I/O would have however allowed the storage device to reach peak throughput or almost peak throughput; - in contrast, the I/O of these processes, if served individually (from separate queues) yielded a rather low throughput. So we repeated our tests with increasing values of the threshold, until we found the minimum value (19) for which we obtained maximum throughput, reliably, with at least up to 9 cooperating processes. Then we checked that the use of that higher threshold value did not cause any regression for any other benchmark in the suite [1]. This commit raises the threshold to such a higher value. [1] https://github.com/Algodev-github/SSigned-off-by: NAngelo Ruocco <angeloruocco90@gmail.com> Signed-off-by: NPaolo Valente <paolo.valente@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Damien Le Moal 提交于
Introduce zone write locking to avoid write request reordering with zoned block devices. This is achieved using a finer selection of the next request to dispatch: 1) Any non-write request is always allowed to proceed. 2) Any write to a conventional zone is always allowed to proceed. 3) For a write to a sequential zone, the zone lock is first checked. a) If the zone is not locked, the write is allowed to proceed after its target zone is locked. b) If the zone is locked, the write request is skipped and the next request in the dispatch queue tested (back to step 1). For a write request that has locked its target zone, the zone is unlocked either when the request completes and the method deadline_request_completed() is called, or when the request is requeued using the method deadline_add_request(). Requests targeting a locked zone are always left in the scheduler queue to preserve the initial write order. If no write request can be dispatched, allow reads to be dispatched even if the write batch is not done. If the device used is not a zoned block device, or if zoned block device support is disabled, this patch does not modify deadline behavior. Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Damien Le Moal 提交于
Avoid directly referencing the next_rq and fifo_list arrays using the helper functions deadline_next_request() and deadline_fifo_request() to facilitate changes in the dispatch request selection in deadline_dispatch_requests() for zoned block devices. While at it, also remove the unnecessary forward declaration of the function deadline_move_request(). Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Damien Le Moal 提交于
Introduce zone write locking to avoid write request reordering with zoned block devices. This is achieved using a finer selection of the next request to dispatch: 1) Any non-write request is always allowed to proceed. 2) Any write to a conventional zone is always allowed to proceed. 3) For a write to a sequential zone, the zone lock is first checked. a) If the zone is not locked, the write is allowed to proceed after its target zone is locked. b) If the zone is locked, the write request is skipped and the next request in the dispatch queue tested (back to step 1). For a write request that has locked its target zone, the zone is unlocked either when the request completes with a call to the method deadline_request_completed() or when the request is requeued using dd_insert_request(). Requests targeting a locked zone are always left in the scheduler queue to preserve the lba ordering for write requests. If no write request can be dispatched, allow reads to be dispatched even if the write batch is not done. If the device used is not a zoned block device, or if zoned block device support is disabled, this patch does not modify mq-deadline behavior. Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Damien Le Moal 提交于
Avoid directly referencing the next_rq and fifo_list arrays using the helper functions deadline_next_request() and deadline_fifo_request() to facilitate changes in the dispatch request selection in __dd_dispatch_request() for zoned block devices. Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com> Reviewed-by: NBart Van Assche <Bart.VanAssche@wdc.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
Components relying only on the request_queue structure for accessing block devices (e.g. I/O schedulers) have a limited knowledged of the device characteristics. In particular, the device capacity cannot be easily discovered, which for a zoned block device also result in the inability to easily know the number of zones of the device (the zone size is indicated by the chunk_sectors field of the queue limits). Introduce the nr_zones field to the request_queue structure to simplify access to this information. Also, add the bitmap seq_zone_bitmap which indicates which zones of the device are sequential zones (write preferred or write required) and the bitmap seq_zones_wlock which indicates if a zone is write locked, that is, if a write request targeting a zone was dispatched to the device. These fields are initialized by the low level block device driver (sd.c for ZBC/ZAC disks). They are not initialized by stacking drivers (device mappers) handling zoned block devices (e.g. dm-linear). Using this, I/O schedulers can introduce zone write locking to control request dispatching to a zoned block device and avoid write request reordering by limiting to at most a single write request per zone outside of the scheduler at any time. Based on previous patches from Damien Le Moal. Signed-off-by: NChristoph Hellwig <hch@lst.de> [Damien] * Fixed comments and identation in blkdev.h * Changed helper functions * Fixed this commit message Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com> Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Bart Van Assche 提交于
Call bdev_get_queue(bdev) after bdev->bd_disk has been initialized instead of just before that pointer has been initialized. This patch avoids that the following command pktsetup 1 /dev/sr0 triggers the following kernel crash: BUG: unable to handle kernel NULL pointer dereference at 0000000000000548 IP: pkt_setup_dev+0x2db/0x670 [pktcdvd] CPU: 2 PID: 724 Comm: pktsetup Not tainted 4.15.0-rc4-dbg+ #1 Call Trace: pkt_ctl_ioctl+0xce/0x1c0 [pktcdvd] do_vfs_ioctl+0x8e/0x670 SyS_ioctl+0x3c/0x70 entry_SYSCALL_64_fastpath+0x23/0x9a Reported-by: NMaciej S. Szmigiero <mail@maciej.szmigiero.name> Fixes: commit ca18d6f7 ("block: Make most scsi_req_init() calls implicit") Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com> Tested-by: NMaciej S. Szmigiero <mail@maciej.szmigiero.name> Cc: Maciej S. Szmigiero <mail@maciej.szmigiero.name> Cc: <stable@vger.kernel.org> # v4.13 Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Bart Van Assche 提交于
Commit 523e1d39 ("block: make gendisk hold a reference to its queue") modified add_disk() and disk_release() but did not update any of the error paths that trigger a put_disk() call after disk->queue has been assigned. That introduced the following behavior in the pktcdvd driver if pkt_new_dev() fails: Kernel BUG at 00000000e98fd882 [verbose debug info unavailable] Since disk_release() calls blk_put_queue() anyway if disk->queue != NULL, fix this by removing the blk_cleanup_queue() call from the pkt_setup_dev() error path. Fixes: commit 523e1d39 ("block: make gendisk hold a reference to its queue") Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com> Cc: Tejun Heo <tj@kernel.org> Cc: Maciej S. Szmigiero <mail@maciej.szmigiero.name> Cc: <stable@vger.kernel.org> # v3.2 Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 05 1月, 2018 26 次提交
-
-
由 Matias Bjørling 提交于
Shorten function to simply return the value of the if statement. Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Javier González 提交于
Since pblk registers its own block device, the iostat accounting is not automatically done for us. Therefore, add the necessary accounting logic to satisfy the iostat interface. Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Javier González 提交于
Add the instance name to the information printed out on target creation. Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Javier González 提交于
Refactor the way we free the write buffer to ensure that all entries get freed in case of an error on the init sequence. Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Javier González 提交于
When creating the write thread, ensure that the kthread has been created before initializing the timer responsible from kicking it. Otherwise, if the kthread creation fails or gets killed from used space, we risk kicking an empty thread structure. Also, since the kthread creation can be interrupted form user space, adapt the error path to not report an error when this happens, since it is intentional that the instance creation is aborted. Signed-off-by: NJavier González <javier@cnexlabs.com> Updated source to reflect the new timer_setup API. Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Javier González 提交于
On scan recovery, reads can fail. This happens because the first page for each line is read in order to determined if the line has been used (and thus needs to be recovered), or not. This can lead to "empty page" read errors. Since these errors are normal, do not log them, as they are confusing when reviewing the logs. Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Javier González 提交于
On recovery, do not stop L2P recovery if reads report high ECC error as the data is still available. Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Javier González 提交于
Allow to set the over-provision percentage on target creation. In case that the value is not provided, fall back to the default value set by the target. In pblk, set the default OP to 11% of the total size of the device Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NHans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Javier González 提交于
Until now, pblk's rate-limiter has used a heuristic to reserve space for GC I/O given that the over-provision area was fixed. In preparation for allowing to define the over-provision area on target creation, define a dedicated free_block counter in the rate-limiter to track the number of blocks being used for user data. Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NHans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Hans Holmberg 提交于
pblk_gc_stop just sets pblk->gc->gc_active to zero, ignoring the flush parameter. This is plain confusing, so remove the function and set the gc active flag at the call points instead. Signed-off-by: NHans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Hans Holmberg 提交于
Unless we protect flush pointer updates with a lock, we risk resetting new flush points before we've synced all sectors up to that point. This patch protects new flush points with the same spin lock that is being held when advancing the sync pointer and resetting completed flush points. Signed-off-by: NHans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Hans Holmberg 提交于
Move completion of syncs and clearing of flush points to the write completion path - this ensures that the data has been comitted to the media before completing bios containing syncs. Signed-off-by: NHans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Hans Holmberg 提交于
Sync point is a really confusing name for keeping track of the last entry that needs to be flushed so change the name to to flush_point instead. Signed-off-by: NHans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Hans Holmberg 提交于
Currently pblk_recov_get_lba list does two separate things: it checks the consistency of the emeta and extracts the lba list. This patch separates the consistency check to make the code easier to read and to prepare for version checks of the line emeta persistent data format version. Signed-off-by: NHans Holmberg <hans.holmberg@cnexlabs.com> Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Javier González 提交于
Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Javier González 提交于
Through time, we have generated some redundant helper functions. Refactor them to eliminate redundant and unnecessary code. Also, reorder them to improve readability Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Javier González 提交于
Until now, target unique naming is only guaranteed per device. This is ok from a lightnvm perspective, but not from a sysfs one, since groups will collide regardless of the underlying device. Check that names are unique across all lightnvm-capable devices. Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Javier González 提交于
Refactor target type lookup to use/not use locks explicitly instead of using a hidden parameter to make the function locking. Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Matias Bjørling 提交于
Prepare for the 2.0 revision by adapting the geometry structures to coexist with the 1.2 revision. Signed-off-by: NMatias Bjørling <m@bjorling.me> Reviewed-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Matias Bjørling 提交于
The lower page table is unused. All page tables reported by 1.2 devices are all reporting a sequential 1:1 page mapping. This is also not used going forward with the 2.0 revision. Signed-off-by: NMatias Bjørling <m@bjorling.me> Reviewed-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Javier González 提交于
Remove the wait filed in nvm_rq. It is not used anymore, as targets rely on the functionality provided by the LightNVM subsystem when sending sync I/O. Signed-off-by: NJavier González <javier@cnexlabs.com> Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Matias Bjørling 提交于
Now that rrpc have been removed. Also remove the hybrid 1.2 support from the core. Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Matias Bjørling 提交于
Now that rrpc has been removed, the only users of the ppa helpers is pblk. However, pblk already defines similar functions. Switch pblk to use the internal ones, and remove the generic ppa helpers. Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Matias Bjørling 提交于
The hybrid mode for 1.2 revision was deprecated, and have no users. Remove to make it easier to move to the 2.0 revision. Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Matias Bjørling 提交于
With rrpc to be removed, the null_blk lightnvm support is no longer functional. Remove the lightnvm implementation and maybe add it to another module in the future if someone takes on the challenge. Signed-off-by: NMatias Bjørling <m@bjorling.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Liu Bo 提交于
Commit de148297 ("blk-mq: introduce .get_budget and .put_budget in blk_mq_ops") changes the function to return bool type, and then commit 1f460b63 ("blk-mq: don't restart queue when .get_budget returns BLK_STS_RESOURCE") changes it back to void, but the comment remains. Signed-off-by: NLiu Bo <bo.li.liu@oracle.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 23 12月, 2017 1 次提交
-
-
由 Jens Axboe 提交于
Even with a number of waitqueues, we can get into a situation where we are heavily contended on the waitqueue lock. I got a report on spc1 where we're spending seconds doing this. Arguably the use case is nasty, I reproduce it with one device and 1000 threads banging on the device. But that doesn't mean we shouldn't be handling it better. What ends up happening is that a thread will fail to get a tag, add itself to the waitqueue, and subsequently get woken up when a tag is freed - only to find itself going back to sleep on the waitqueue. Instead of waking all threads, use an exclusive wait and wake up our sbitmap batch count instead. This seems to work well for me (massive improvement for this use case), and it survives basic testing. But I haven't fully verified it yet. An additional improvement is running the queue and checking for a new tag BEFORE needing to add ourselves to the waitqueue. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 18 12月, 2017 2 次提交
-
-
由 Linus Torvalds 提交于
-
由 Kees Cook 提交于
This reverts commit 04e35f44. SELinux runs with secureexec for all non-"noatsecure" domain transitions, which means lots of processes end up hitting the stack hard-limit change that was introduced in order to fix a race with prlimit(). That race fix will need to be redesigned. Reported-by: NLaura Abbott <labbott@redhat.com> Reported-by: NTomáš Trnka <trnka@scm.com> Cc: stable@vger.kernel.org Signed-off-by: NKees Cook <keescook@chromium.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-