- 09 7月, 2018 9 次提交
-
-
由 Josef Bacik 提交于
blkcg-qos is going to do essentially what wbt does, only on a cgroup basis. Break out the common code that will be shared between blkcg-qos and wbt into blk-rq-qos.* so they can both utilize the same infrastructure. Signed-off-by: NJosef Bacik <jbacik@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Ming Lei 提交于
It won't be efficient to dequeue request one by one from sw queue, but we have to do that when queue is busy for better merge performance. This patch takes the Exponential Weighted Moving Average(EWMA) to figure out if queue is busy, then only dequeue request one by one from sw queue when queue is busy. Fixes: b347689f ("blk-mq-sched: improve dispatching from sw queue") Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Laurence Oberman <loberman@redhat.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Reported-by: NKashyap Desai <kashyap.desai@broadcom.com> Tested-by: NKashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Ming Lei 提交于
list_splice_tail_init() is much more faster than inserting each request one by one, given all requets in 'list' belong to same sw queue and ctx->lock is required to insert requests. Cc: Laurence Oberman <loberman@redhat.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Tested-by: NKashyap Desai <kashyap.desai@broadcom.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Minwoo Im 提交于
Fix typo in a function blk_mq_alloc_tag_set() comment. if if it too large -> if it's too large. Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Minwoo Im 提交于
set->mq_map is now currently cleared if something goes wrong when establishing a queue map in blk-mq-pci.c. It's also cleared before updating a queue map in blk_mq_update_queue_map(). This patch provides an API to clear set->mq_map to make it clear. Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Ming Lei 提交于
We have to remove synchronize_rcu() from blk_queue_cleanup(), otherwise long delay can be caused during lun probe. For removing it, we have to avoid to iterate the set->tag_list in IO path, eg, blk_mq_sched_restart(). This patch reverts 5b79413946d (Revert "blk-mq: don't handle TAG_SHARED in restart"). Given we have fixed enough IO hang issue, and there isn't any reason to restart all queues in one tags any more, see the following reasons: 1) blk-mq core can deal with shared-tags case well via blk_mq_get_driver_tag(), which can wake up queues waiting for driver tag. 2) SCSI is a bit special because it may return BLK_STS_RESOURCE if queue, target or host is ready, but SCSI built-in restart can cover all these well, see scsi_end_request(), queue will be rerun after any request initiated from this host/target is completed. In my test on scsi_debug(8 luns), this patch may improve IOPS by 20% ~ 30% when running I/O on these 8 luns concurrently. Fixes: 705cda97 ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list") Cc: Omar Sandoval <osandov@fb.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: linux-scsi@vger.kernel.org Reported-by: NAndrew Jones <drjones@redhat.com> Tested-by: NAndrew Jones <drjones@redhat.com> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Ming Lei 提交于
Now hctx->lock is only acquired when adding hctx->dispatch_wait to one wait queue, but not held when removing it from the wait queue. IO hang can be observed easily if SCHED RESTART is disabled, that means now RESTART exits just for fixing the issue in blk_mq_mark_tag_wait(). This patch fixes the issue by introducing hctx->dispatch_wait_lock and holding it for removing hctx->dispatch_wait in blk_mq_dispatch_wake(), since we need to avoid acquiring hctx->lock in irq context. Fixes: eb619fdb ("blk-mq: fix issue with shared tag queue re-running") Cc: Christoph Hellwig <hch@lst.de> Cc: Omar Sandoval <osandov@fb.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Tested-by: NAndrew Jones <drjones@redhat.com> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Ming Lei 提交于
'hctx' won't be changed at all, so not necessary to pass '**hctx' to blk_mq_mark_tag_wait(). Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bart.vanassche@wdc.com> Tested-by: NAndrew Jones <drjones@redhat.com> Reviewed-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Ming Lei 提交于
We never pass 'wait' as true to blk_mq_get_driver_tag(), and hence we never change '**hctx' as well. The last use of these went away with the flush cleanup, commit 0c2a6fe4. So cleanup the usage and remove the two extra parameters. Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Tested-by: NAndrew Jones <drjones@redhat.com> Reviewed-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 29 6月, 2018 1 次提交
-
-
由 Jens Axboe 提交于
Some devices have different queue limits depending on the type of IO. A classic case is SATA NCQ, where some commands can queue, but others cannot. If we have NCQ commands inflight and encounter a non-queueable command, the driver returns busy. Currently we attempt to dispatch more from the scheduler, if we were able to queue some commands. But for the case where we ended up stopping due to BUSY, we should not attempt to retrieve more from the scheduler. If we do, we can get into a situation where we attempt to queue a non-queueable command, get BUSY, then successfully retrieve more commands from that scheduler and queue those. This can repeat forever, starving the non-queuable command indefinitely. Fix this by NOT attempting to pull more commands from the scheduler, if we get a BUSY return. This should also be more optimal in terms of letting requests stay in the scheduler for as long as possible, if we get a BUSY due to the regular out-of-tags condition. Reviewed-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 24 6月, 2018 1 次提交
-
-
由 Bart Van Assche 提交于
Make sure that RQF_TIMED_OUT is cleared when a request is reused after a block driver timeout handler has returned BLK_EH_DONE. Fixes: da661267 ("blk-mq: don't time out requests again that are in the timeout handler") Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Andrew Randrianasulu <randrianasulu@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 14 6月, 2018 1 次提交
-
-
由 Christoph Hellwig 提交于
We can currently call the timeout handler again on a request that has already been handed over to the timeout handler. Prevent that with a new flag. Fixes: 12f5b931 ("blk-mq: Remove generation seqeunce") Reported-by: NAndrew Randrianasulu <randrianasulu@gmail.com> Tested-by: NAndrew Randrianasulu <randrianasulu@gmail.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 13 6月, 2018 1 次提交
-
-
由 Kees Cook 提交于
The kzalloc_node() function has a 2-factor argument form, kcalloc_node(). This patch replaces cases of: kzalloc_node(a * b, gfp, node) with: kcalloc_node(a * b, gfp, node) as well as handling cases of: kzalloc_node(a * b * c, gfp, node) with: kzalloc_node(array3_size(a, b, c), gfp, node) as it's slightly less ugly than: kcalloc_node(array_size(a, b), c, gfp, node) This does, however, attempt to ignore constant size factors like: kzalloc_node(4 * 1024, gfp, node) though any constants defined via macros get caught up in the conversion. Any factors with a sizeof() of "unsigned char", "char", and "u8" were dropped, since they're redundant. The Coccinelle script used for this was: // Fix redundant parens around sizeof(). @@ type TYPE; expression THING, E; @@ ( kzalloc_node( - (sizeof(TYPE)) * E + sizeof(TYPE) * E , ...) | kzalloc_node( - (sizeof(THING)) * E + sizeof(THING) * E , ...) ) // Drop single-byte sizes and redundant parens. @@ expression COUNT; typedef u8; typedef __u8; @@ ( kzalloc_node( - sizeof(u8) * (COUNT) + COUNT , ...) | kzalloc_node( - sizeof(__u8) * (COUNT) + COUNT , ...) | kzalloc_node( - sizeof(char) * (COUNT) + COUNT , ...) | kzalloc_node( - sizeof(unsigned char) * (COUNT) + COUNT , ...) | kzalloc_node( - sizeof(u8) * COUNT + COUNT , ...) | kzalloc_node( - sizeof(__u8) * COUNT + COUNT , ...) | kzalloc_node( - sizeof(char) * COUNT + COUNT , ...) | kzalloc_node( - sizeof(unsigned char) * COUNT + COUNT , ...) ) // 2-factor product with sizeof(type/expression) and identifier or constant. @@ type TYPE; expression THING; identifier COUNT_ID; constant COUNT_CONST; @@ ( - kzalloc_node + kcalloc_node ( - sizeof(TYPE) * (COUNT_ID) + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc_node + kcalloc_node ( - sizeof(TYPE) * COUNT_ID + COUNT_ID, sizeof(TYPE) , ...) | - kzalloc_node + kcalloc_node ( - sizeof(TYPE) * (COUNT_CONST) + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc_node + kcalloc_node ( - sizeof(TYPE) * COUNT_CONST + COUNT_CONST, sizeof(TYPE) , ...) | - kzalloc_node + kcalloc_node ( - sizeof(THING) * (COUNT_ID) + COUNT_ID, sizeof(THING) , ...) | - kzalloc_node + kcalloc_node ( - sizeof(THING) * COUNT_ID + COUNT_ID, sizeof(THING) , ...) | - kzalloc_node + kcalloc_node ( - sizeof(THING) * (COUNT_CONST) + COUNT_CONST, sizeof(THING) , ...) | - kzalloc_node + kcalloc_node ( - sizeof(THING) * COUNT_CONST + COUNT_CONST, sizeof(THING) , ...) ) // 2-factor product, only identifiers. @@ identifier SIZE, COUNT; @@ - kzalloc_node + kcalloc_node ( - SIZE * COUNT + COUNT, SIZE , ...) // 3-factor product with 1 sizeof(type) or sizeof(expression), with // redundant parens removed. @@ expression THING; identifier STRIDE, COUNT; type TYPE; @@ ( kzalloc_node( - sizeof(TYPE) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc_node( - sizeof(TYPE) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc_node( - sizeof(TYPE) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc_node( - sizeof(TYPE) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(TYPE)) , ...) | kzalloc_node( - sizeof(THING) * (COUNT) * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc_node( - sizeof(THING) * (COUNT) * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc_node( - sizeof(THING) * COUNT * (STRIDE) + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) | kzalloc_node( - sizeof(THING) * COUNT * STRIDE + array3_size(COUNT, STRIDE, sizeof(THING)) , ...) ) // 3-factor product with 2 sizeof(variable), with redundant parens removed. @@ expression THING1, THING2; identifier COUNT; type TYPE1, TYPE2; @@ ( kzalloc_node( - sizeof(TYPE1) * sizeof(TYPE2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc_node( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(TYPE2)) , ...) | kzalloc_node( - sizeof(THING1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc_node( - sizeof(THING1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(THING1), sizeof(THING2)) , ...) | kzalloc_node( - sizeof(TYPE1) * sizeof(THING2) * COUNT + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) | kzalloc_node( - sizeof(TYPE1) * sizeof(THING2) * (COUNT) + array3_size(COUNT, sizeof(TYPE1), sizeof(THING2)) , ...) ) // 3-factor product, only identifiers, with redundant parens removed. @@ identifier STRIDE, SIZE, COUNT; @@ ( kzalloc_node( - (COUNT) * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc_node( - COUNT * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc_node( - COUNT * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc_node( - (COUNT) * (STRIDE) * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc_node( - COUNT * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc_node( - (COUNT) * STRIDE * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc_node( - (COUNT) * (STRIDE) * (SIZE) + array3_size(COUNT, STRIDE, SIZE) , ...) | kzalloc_node( - COUNT * STRIDE * SIZE + array3_size(COUNT, STRIDE, SIZE) , ...) ) // Any remaining multi-factor products, first at least 3-factor products, // when they're not all constants... @@ expression E1, E2, E3; constant C1, C2, C3; @@ ( kzalloc_node(C1 * C2 * C3, ...) | kzalloc_node( - (E1) * E2 * E3 + array3_size(E1, E2, E3) , ...) | kzalloc_node( - (E1) * (E2) * E3 + array3_size(E1, E2, E3) , ...) | kzalloc_node( - (E1) * (E2) * (E3) + array3_size(E1, E2, E3) , ...) | kzalloc_node( - E1 * E2 * E3 + array3_size(E1, E2, E3) , ...) ) // And then all remaining 2 factors products when they're not all constants, // keeping sizeof() as the second factor argument. @@ expression THING, E1, E2; type TYPE; constant C1, C2, C3; @@ ( kzalloc_node(sizeof(THING) * C2, ...) | kzalloc_node(sizeof(TYPE) * C2, ...) | kzalloc_node(C1 * C2 * C3, ...) | kzalloc_node(C1 * C2, ...) | - kzalloc_node + kcalloc_node ( - sizeof(TYPE) * (E2) + E2, sizeof(TYPE) , ...) | - kzalloc_node + kcalloc_node ( - sizeof(TYPE) * E2 + E2, sizeof(TYPE) , ...) | - kzalloc_node + kcalloc_node ( - sizeof(THING) * (E2) + E2, sizeof(THING) , ...) | - kzalloc_node + kcalloc_node ( - sizeof(THING) * E2 + E2, sizeof(THING) , ...) | - kzalloc_node + kcalloc_node ( - (E1) * E2 + E1, E2 , ...) | - kzalloc_node + kcalloc_node ( - (E1) * (E2) + E1, E2 , ...) | - kzalloc_node + kcalloc_node ( - E1 * E2 + E1, E2 , ...) ) Signed-off-by: NKees Cook <keescook@chromium.org>
-
- 11 6月, 2018 1 次提交
-
-
由 Roman Pen 提交于
It is not allowed to reinit q->tag_set_list list entry while RCU grace period has not completed yet, otherwise the following soft lockup in blk_mq_sched_restart() happens: [ 1064.252652] watchdog: BUG: soft lockup - CPU#12 stuck for 23s! [fio:9270] [ 1064.254445] task: ffff99b912e8b900 task.stack: ffffa6d54c758000 [ 1064.254613] RIP: 0010:blk_mq_sched_restart+0x96/0x150 [ 1064.256510] Call Trace: [ 1064.256664] <IRQ> [ 1064.256824] blk_mq_free_request+0xea/0x100 [ 1064.256987] msg_io_conf+0x59/0xd0 [ibnbd_client] [ 1064.257175] complete_rdma_req+0xf2/0x230 [ibtrs_client] [ 1064.257340] ? ibtrs_post_recv_empty+0x4d/0x70 [ibtrs_core] [ 1064.257502] ibtrs_clt_rdma_done+0xd1/0x1e0 [ibtrs_client] [ 1064.257669] ib_create_qp+0x321/0x380 [ib_core] [ 1064.257841] ib_process_cq_direct+0xbd/0x120 [ib_core] [ 1064.258007] irq_poll_softirq+0xb7/0xe0 [ 1064.258165] __do_softirq+0x106/0x2a2 [ 1064.258328] irq_exit+0x92/0xa0 [ 1064.258509] do_IRQ+0x4a/0xd0 [ 1064.258660] common_interrupt+0x7a/0x7a [ 1064.258818] </IRQ> Meanwhile another context frees other queue but with the same set of shared tags: [ 1288.201183] INFO: task bash:5910 blocked for more than 180 seconds. [ 1288.201833] bash D 0 5910 5820 0x00000000 [ 1288.202016] Call Trace: [ 1288.202315] schedule+0x32/0x80 [ 1288.202462] schedule_timeout+0x1e5/0x380 [ 1288.203838] wait_for_completion+0xb0/0x120 [ 1288.204137] __wait_rcu_gp+0x125/0x160 [ 1288.204287] synchronize_sched+0x6e/0x80 [ 1288.204770] blk_mq_free_queue+0x74/0xe0 [ 1288.204922] blk_cleanup_queue+0xc7/0x110 [ 1288.205073] ibnbd_clt_unmap_device+0x1bc/0x280 [ibnbd_client] [ 1288.205389] ibnbd_clt_unmap_dev_store+0x169/0x1f0 [ibnbd_client] [ 1288.205548] kernfs_fop_write+0x109/0x180 [ 1288.206328] vfs_write+0xb3/0x1a0 [ 1288.206476] SyS_write+0x52/0xc0 [ 1288.206624] do_syscall_64+0x68/0x1d0 [ 1288.206774] entry_SYSCALL_64_after_hwframe+0x3d/0xa2 What happened is the following: 1. There are several MQ queues with shared tags. 2. One queue is about to be freed and now task is in blk_mq_del_queue_tag_set(). 3. Other CPU is in blk_mq_sched_restart() and loops over all queues in tag list in order to find hctx to restart. Because linked list entry was modified in blk_mq_del_queue_tag_set() without proper waiting for a grace period, blk_mq_sched_restart() never ends, spining in list_for_each_entry_rcu_rr(), thus soft lockup. Fix is simple: reinit list entry after an RCU grace period elapsed. Fixes: Fixes: 705cda97 ("blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list") Cc: stable@vger.kernel.org Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-block@vger.kernel.org Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NMing Lei <ming.lei@redhat.com> Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com> Signed-off-by: NRoman Pen <roman.penyaev@profitbricks.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 05 6月, 2018 1 次提交
-
-
由 Jianchao Wang 提交于
If a hardware queue is stopped, it should not be run again before explicitly started. Ignore stopped queues in blk_mq_run_work_fn(), fixing a regression recently introduced when the START_ON_RUN bit was removed. Fixes: 15fe8a90 ("blk-mq: remove blk_mq_delay_queue()") Reviewed-by: NMing Lei <ming.lei@redhat.com> Reviewed-by: NBart Van Assche <bart.vanassche@wdc.com> Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 01 6月, 2018 2 次提交
-
-
由 Christoph Hellwig 提交于
There is almost no shared logic, which leads to a very confusing code flow. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com> Tested-by: NDamien Le Moal <damien.lemoal@wdc.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
Both callers take just around so function call, so move it in. Also remove the now pointless blk_mq_sched_init wrapper. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com> Tested-by: NDamien Le Moal <damien.lemoal@wdc.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 29 5月, 2018 5 次提交
-
-
由 Christoph Hellwig 提交于
Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
The BLK_EH_NOT_HANDLED implies nothing happen, but very often that is not what is happening - instead the driver already completed the command. Fix the symbolic name to reflect that a little better. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Keith Busch 提交于
This patch simplifies the timeout handling by relying on the request reference counting to ensure the iterator is operating on an inflight and truly timed out request. Since the reference counting prevents the tag from being reallocated, the block layer no longer needs to prevent drivers from completing their requests while the timeout handler is operating on it: a driver completing a request is allowed to proceed to the next state without additional syncronization with the block layer. This also removes any need for generation sequence numbers since the request lifetime is prevented from being reallocated as a new sequence while timeout handling is operating on it. To enables this a refcount is added to struct request so that request users can be sure they're operating on the same request without it changing while they're processing it. The request's tag won't be released for reuse until both the timeout handler and the completion are done with it. Signed-off-by: NKeith Busch <keith.busch@intel.com> [hch: slight cleanups, added back submission side hctx lock, use cmpxchg for completions] Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Keith Busch 提交于
The block layer had been setting the state to in-flight prior to updating the timer. This is the wrong order since the timeout handler could observe the in-flight state with the older timeout, believing the request had expired when in fact it is just getting started. Signed-off-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 22 5月, 2018 1 次提交
-
-
由 huhai 提交于
When dispatch_rq_from_ctx is called, in the vast majority of cases the ctx->rq_list is not empty. Signed-off-by: Nhuhai <huhai@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 18 5月, 2018 1 次提交
-
-
由 huhai 提交于
When the number of hardware queues is changed, the drivers will call blk_mq_update_nr_hw_queues() to remap hardware queues. This changes the ctx mappings, but the current code doesn't clear the ->dispatch_from hint. This can result in dispatch_from pointing to a ctx that isn't mapped to the hctx anymore. Fixes: b347689f ("blk-mq-sched: improve dispatching from sw queue") Signed-off-by: Nhuhai <huhai@kylinos.cn> Reviewed-by: NMing Lei <ming.lei@redhat.com> Moved the placement of the clearing to where we clear other items pertaining to the existing mapping, added Fixes line, and reworded the commit message. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 16 5月, 2018 1 次提交
-
-
由 huhai 提交于
We can use blk_mq_sched_insert_request() even if we don't have an IO scheduler attached, since that case will end up being exactly the same as what blk_mq_queue_io() was doing now. Signed-off-by: Nhuhai <huhai@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 11 5月, 2018 1 次提交
-
-
由 Jens Axboe 提交于
It's not useful, they are internal and/or error handling recovery commands. Acked-by: NPaolo Valente <paolo.valente@linaro.org> Reviewed-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 09 5月, 2018 4 次提交
-
-
由 Omar Sandoval 提交于
Currently, struct request has four timestamp fields: - A start time, set at get_request time, in jiffies, used for iostats - An I/O start time, set at start_request time, in ktime nanoseconds, used for blk-stats (i.e., wbt, kyber, hybrid polling) - Another start time and another I/O start time, used for cfq and bfq These can all be consolidated into one start time and one I/O start time, both in ktime nanoseconds, shaving off up to 16 bytes from struct request depending on the kernel config. Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Omar Sandoval 提交于
We want this next to blk_account_io_done() for the next change so that we can call ktime_get() only once for both. Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Omar Sandoval 提交于
struct blk_issue_stat squashes three things into one u64: - The time the driver started working on a request - The original size of the request (for the io.low controller) - Flags for writeback throttling It turns out that on x86_64, we have a 4 byte hole in struct request which we can fill with the non-timestamp fields from blk_issue_stat, simplifying things quite a bit. Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Omar Sandoval 提交于
issue_stat is going to go away, so first make writeback throttling take the containing request, update the internal wbt helpers accordingly, and change rwb->sync_cookie to be the request pointer instead of the issue_stat pointer. No functional change. Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 26 4月, 2018 2 次提交
-
-
由 Omar Sandoval 提交于
When the blk-mq inflight implementation was added, /proc/diskstats was converted to use it, but /sys/block/$dev/inflight was not. Fix it by adding another helper to count in-flight requests by data direction. Fixes: f299b7c7 ("blk-mq: provide internal in-flight variant") Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Omar Sandoval 提交于
In the legacy block case, we increment the counter right after we allocate the request, not when the driver handles it. In both the legacy and blk-mq cases, part_inc_in_flight() is called from blk_account_io_start() right after we've allocated the request. blk-mq only considers requests started requests as inflight, but this is inconsistent with the legacy definition and the intention in the code. This removes the started condition and instead counts all allocated requests. Fixes: f299b7c7 ("blk-mq: provide internal in-flight variant") Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 25 4月, 2018 1 次提交
-
-
由 Ming Lei 提交于
This reverts commit 37c7c6c7. Turns out some drivers(most are FC drivers) may not use managed IRQ affinity, and has their customized .map_queues meantime, so still keep this code for avoiding regression. Reported-by: NLaurence Oberman <loberman@redhat.com> Tested-by: NLaurence Oberman <loberman@redhat.com> Tested-by: NChristian Borntraeger <borntraeger@de.ibm.com> Tested-by: NStefan Haberland <sth@linux.vnet.ibm.com> Cc: Ewan Milne <emilne@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 17 4月, 2018 1 次提交
-
-
由 Jianchao Wang 提交于
rq->gstate and rq->aborted_gstate both are zero before rqs are allocated. If we have a small timeout, when the timer fires, there could be rqs that are never allocated, and also there could be rq that has been allocated but not initialized and started. At the moment, the rq->gstate and rq->aborted_gstate both are 0, thus the blk_mq_terminate_expired will identify the rq is timed out and invoke .timeout early. For scsi, this will cause scsi_times_out to be invoked before the scsi_cmnd is not initialized, scsi_cmnd->device is still NULL at the moment, then we will get crash. Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Tejun Heo <tj@kernel.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Martin Steigerwald <Martin@Lichtvoll.de> Cc: stable@vger.kernel.org Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 10 4月, 2018 6 次提交
-
-
由 Ming Lei 提交于
Firstly, from commit 4b855ad3 ("blk-mq: Create hctx for each present CPU), blk-mq doesn't remap queue any more after CPU topo is changed. Secondly, set->nr_hw_queues can't be bigger than nr_cpu_ids, and now we map all possible CPUs to hw queues, so at least one CPU is mapped to each hctx. So queue mapping has became static and fixed just like percpu variable, and we don't need to handle queue remapping any more. Cc: Stefan Haberland <sth@linux.vnet.ibm.com> Tested-by: NChristian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Ming Lei 提交于
There are several reasons for removing the check: 1) blk_mq_hw_queue_mapped() returns true always now since each hctx may be mapped by one CPU at least 2) when there isn't any online CPU mapped to this hctx, there won't be any IO queued to this CPU, blk_mq_run_hw_queue() only runs queue if there is IO queued to this hctx 3) If __blk_mq_delay_run_hw_queue() is called by blk_mq_delay_run_hw_queue(), which is run from blk_mq_dispatch_rq_list() or scsi_mq_get_budget(), and the hctx to be handled has to be mapped. Cc: Stefan Haberland <sth@linux.vnet.ibm.com> Tested-by: NChristian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Ming Lei 提交于
No driver uses this interface any more, so remove it. Cc: Stefan Haberland <sth@linux.vnet.ibm.com> Tested-by: NChristian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Ming Lei 提交于
This patch introduces helper of blk_mq_hw_queue_first_cpu() for figuring out the hctx's first cpu, and code duplication can be avoided. Cc: Stefan Haberland <sth@linux.vnet.ibm.com> Tested-by: NChristian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Ming Lei 提交于
This patch figures out the final selected CPU, then writes it to hctx->next_cpu once, then we can avoid to intermediate next cpu observed from other dispatch paths. Cc: Stefan Haberland <sth@linux.vnet.ibm.com> Tested-by: NChristian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Ming Lei 提交于
From commit 20e4d813 (blk-mq: simplify queue mapping & schedule with each possisble CPU), one hctx can be mapped from all offline CPUs, then hctx->next_cpu can be set as wrong. This patch fixes this issue by making hctx->next_cpu pointing to the first CPU in hctx->cpumask if all CPUs in hctx->cpumask are offline. Cc: Stefan Haberland <sth@linux.vnet.ibm.com> Tested-by: NChristian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Fixes: 20e4d813 ("blk-mq: simplify queue mapping & schedule with each possisble CPU") Cc: stable@vger.kernel.org Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-