- 27 2月, 2020 1 次提交
-
-
由 John Garry 提交于
The struct blk_mq_hw_ctx pointer argument in blk_mq_put_tag(), blk_mq_poll_nsecs(), and blk_mq_poll_hybrid_sleep() is unused, so remove it. Overall obj code size shows a minor reduction, before: text data bss dec hex filename 27306 1312 0 28618 6fca block/blk-mq.o 4303 272 0 4575 11df block/blk-mq-tag.o after: 27282 1312 0 28594 6fb2 block/blk-mq.o 4311 272 0 4583 11e7 block/blk-mq-tag.o Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NHannes Reinecke <hare@suse.de> Signed-off-by: NJohn Garry <john.garry@huawei.com> -- This minor patch had been carried as part of the blk-mq shared tags RFC, I'd rather not carry it anymore as it required rebasing, so now or never.. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 25 2月, 2020 1 次提交
-
-
由 Ming Lei 提交于
For some reason, device may be in one situation which can't handle FS request, so STS_RESOURCE is always returned and the FS request will be added to hctx->dispatch. However passthrough request may be required at that time for fixing the problem. If passthrough request is added to scheduler queue, there isn't any chance for blk-mq to dispatch it given we prioritize requests in hctx->dispatch. Then the FS IO request may never be completed, and IO hang is caused. So passthrough request has to be added to hctx->dispatch directly for fixing the IO hang. Fix this issue by inserting passthrough request into hctx->dispatch directly together withing adding FS request to the tail of hctx->dispatch in blk_mq_dispatch_rq_list(). Actually we add FS request to tail of hctx->dispatch at default, see blk_mq_request_bypass_insert(). Then it becomes consistent with original legacy IO request path, in which passthrough request is always added to q->queue_head. Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Ewan D. Milne <emilne@redhat.com> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 07 10月, 2019 1 次提交
-
-
由 Pavel Begunkov 提交于
blk_mq_request_completed() and blk_mq_request_started() are short, inline it. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 11 7月, 2019 1 次提交
-
-
由 Damien Le Moal 提交于
Simultaneously writing to a sequential zone of a zoned block device from multiple contexts requires mutual exclusion for BIO issuing to ensure that writes happen sequentially. However, even for a well behaved user correctly implementing such synchronization, BIO plugging may interfere and result in BIOs from the different contextx to be reordered if plugging is done outside of the mutual exclusion section, e.g. the plug was started by a function higher in the call chain than the function issuing BIOs. Context A Context B | blk_start_plug() | ... | seq_write_zone() | mutex_lock(zone) | bio-0->bi_iter.bi_sector = zone->wp | zone->wp += bio_sectors(bio-0) | submit_bio(bio-0) | bio-1->bi_iter.bi_sector = zone->wp | zone->wp += bio_sectors(bio-1) | submit_bio(bio-1) | mutex_unlock(zone) | return | -----------------------> | seq_write_zone() | mutex_lock(zone) | bio-2->bi_iter.bi_sector = zone->wp | zone->wp += bio_sectors(bio-2) | submit_bio(bio-2) | mutex_unlock(zone) | <------------------------- | | blk_finish_plug() In the above example, despite the mutex synchronization ensuring the correct BIO issuing order 0, 1, 2, context A BIOs 0 and 1 end up being issued after BIO 2 of context B, when the plug is released with blk_finish_plug(). While this problem can be addressed using the blk_flush_plug_list() function (in the above example, the call must be inserted before the zone mutex lock is released), a simple generic solution in the block layer avoid this additional code in all zoned block device user code. The simple generic solution implemented with this patch is to introduce the internal helper function blk_mq_plug() to access the current context plug on BIO submission. This helper returns the current plug only if the target device is not a zoned block device or if the BIO to be plugged is not a write operation. Otherwise, the caller context plug is ignored and NULL returned, resulting is all writes to zoned block device to never be plugged. Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 03 7月, 2019 1 次提交
-
-
由 Bart Van Assche 提交于
No code that occurs between blk_mq_get_ctx() and blk_mq_put_ctx() depends on preemption being disabled for its correctness. Since removing the CPU preemption calls does not measurably affect performance, simplify the blk-mq code by removing the blk_mq_put_ctx() function and also by not disabling preemption in blk_mq_get_ctx(). Cc: Hannes Reinecke <hare@suse.com> Cc: Omar Sandoval <osandov@fb.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NBart Van Assche <bvanassche@acm.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 04 5月, 2019 1 次提交
-
-
由 Ming Lei 提交于
Once blk_cleanup_queue() returns, tags shouldn't be used any more, because blk_mq_free_tag_set() may be called. Commit 45a9c9d9 ("blk-mq: Fix a use-after-free") fixes this issue exactly. However, that commit introduces another issue. Before 45a9c9d9, we are allowed to run queue during cleaning up queue if the queue's kobj refcount is held. After that commit, queue can't be run during queue cleaning up, otherwise oops can be triggered easily because some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue(). We have invented ways for addressing this kind of issue before, such as: 8dc765d4 ("SCSI: fix queue cleanup race before queue initialization is done") c2856ae2 ("blk-mq: quiesce queue before freeing queue") But still can't cover all cases, recently James reports another such kind of issue: https://marc.info/?l=linux-scsi&m=155389088124782&w=2 This issue can be quite hard to address by previous way, given scsi_run_queue() may run requeues for other LUNs. Fixes the above issue by freeing hctx's resources in its release handler, and this way is safe becasue tags isn't needed for freeing such hctx resource. This approach follows typical design pattern wrt. kobject's release handler. Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: James Smart <james.smart@broadcom.com> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: linux-scsi@vger.kernel.org, Cc: Martin K . Petersen <martin.petersen@oracle.com>, Cc: Christoph Hellwig <hch@lst.de>, Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>, Reported-by: NJames Smart <james.smart@broadcom.com> Fixes: 45a9c9d9 ("blk-mq: Fix a use-after-free") Cc: stable@vger.kernel.org Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Tested-by: NJames Smart <james.smart@broadcom.com> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 05 4月, 2019 1 次提交
-
-
由 Bart Van Assche 提交于
blk_mq_try_issue_directly() can return BLK_STS*_RESOURCE for requests that have been queued. If that happens when blk_mq_try_issue_directly() is called by the dm-mpath driver then dm-mpath will try to resubmit a request that is already queued and a kernel crash follows. Since it is nontrivial to fix blk_mq_request_issue_directly(), revert the blk_mq_request_issue_directly() changes that went into kernel v5.0. This patch reverts the following commits: * d6a51a97 ("blk-mq: replace and kill blk_mq_request_issue_directly") # v5.0. * 5b7a6f12 ("blk-mq: issue directly with bypass 'false' in blk_mq_sched_insert_requests") # v5.0. * 7f556a44 ("blk-mq: refactor the code of issue request directly") # v5.0. Cc: Christoph Hellwig <hch@infradead.org> Cc: Ming Lei <ming.lei@redhat.com> Cc: Jianchao Wang <jianchao.w.wang@oracle.com> Cc: Hannes Reinecke <hare@suse.com> Cc: Johannes Thumshirn <jthumshirn@suse.de> Cc: James Smart <james.smart@broadcom.com> Cc: Dongli Zhang <dongli.zhang@oracle.com> Cc: Laurence Oberman <loberman@redhat.com> Cc: <stable@vger.kernel.org> Reported-by: NLaurence Oberman <loberman@redhat.com> Tested-by: NLaurence Oberman <loberman@redhat.com> Fixes: 7f556a44 ("blk-mq: refactor the code of issue request directly") # v5.0. Signed-off-by: NBart Van Assche <bvanassche@acm.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 25 3月, 2019 1 次提交
-
-
由 Yufen Yu 提交于
Expect arguments, blk_mq_put_driver_tag_hctx() and blk_mq_put_driver_tag() is same. We can just use argument 'request' to put tag by blk_mq_put_driver_tag(). Then we can remove the unused blk_mq_put_driver_tag_hctx(). Signed-off-by: NYufen Yu <yuyufen@huawei.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 21 3月, 2019 1 次提交
-
-
由 Bart Van Assche 提交于
This function is not used outside the block layer core. Hence unexport it. Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Signed-off-by: NBart Van Assche <bvanassche@acm.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 09 2月, 2019 1 次提交
-
-
由 Liu Bo 提交于
As the prototype has been defined in "include/linux/blk-mq.h", the one in "block/blk-mq.h" can be removed then. Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 01 2月, 2019 2 次提交
-
-
由 Jianchao Wang 提交于
Currently, we check whether the hctx type is supported every time in hot path. Actually, this is not necessary, we could save the default hctx into ctx->hctxs if the type is not supported when map swqueues and use it directly with ctx->hctxs[type]. We also needn't check whether the poll is enabled or not, because the caller would clear the REQ_HIPRI in that case. Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jianchao Wang 提交于
Currently, the queue mapping result is saved in a two-dimensional array. In the hot path, to get a hctx, we need do following: q->queue_hw_ctx[q->tag_set->map[type].mq_map[cpu]] This isn't very efficient. We could save the queue mapping result into ctx directly with different hctx type, like, ctx->hctxs[type] Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 18 12月, 2018 1 次提交
-
-
由 Ming Lei 提交于
When a request is added to rq list of sw queue(ctx), the rq may be from a different type of hctx, especially after multi queue mapping is introduced. So when dispach request from sw queue via blk_mq_flush_busy_ctxs() or blk_mq_dequeue_from_ctx(), one request belonging to other queue type of hctx can be dispatched to current hctx in case that read queue or poll queue is enabled. This patch fixes this issue by introducing per-queue-type list. Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: NMing Lei <ming.lei@redhat.com> Changed by me to not use separately cacheline aligned lists, just place them all in the same cacheline where we had just the one list and lock before. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 17 12月, 2018 1 次提交
-
-
由 Christoph Hellwig 提交于
We should check if a given queue map actually has queues enabled before dispatching to it. This allows drivers to not initialize optional but not used map types, which subsequently will allow fixing problems with queue map rebuilds for that case. Reviewed-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 16 12月, 2018 1 次提交
-
-
由 Jianchao Wang 提交于
Replace blk_mq_request_issue_directly with blk_mq_try_issue_directly in blk_insert_cloned_request and kill it as nobody uses it any more. Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 10 12月, 2018 1 次提交
-
-
由 Mikulas Patocka 提交于
The previous patches deleted all the code that needed the second value returned from part_in_flight - now the kernel only uses the first value. Consequently, part_in_flight (and blk_mq_in_flight) may be changed so that it only returns one value. This patch just refactors the code, there's no functional change. Signed-off-by: NMikulas Patocka <mpatocka@redhat.com> Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 05 12月, 2018 1 次提交
-
-
由 Christoph Hellwig 提交于
Having another indirect all in the fast path doesn't really help in our post-spectre world. Also having too many queue type is just going to create confusion, so I'd rather manage them centrally. Note that the queue type naming and ordering changes a bit - the first index now is the default queue for everything not explicitly marked, the optional ones are read and poll queues. Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 30 11月, 2018 1 次提交
-
-
由 Jens Axboe 提交于
If we are issuing a list of requests, we know if we're at the last one. If we fail issuing, ensure that we call ->commits_rqs() to flush any potential previous requests. Reviewed-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NMing Lei <ming.lei@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 21 11月, 2018 1 次提交
-
-
由 Ming Lei 提交于
Even though .mq_kobj, ctx->kobj and q->kobj share same lifetime from block layer's view, actually they don't because userspace may grab one kobject anytime via sysfs. This patch fixes the issue by the following approach: 1) introduce 'struct blk_mq_ctxs' for holding .mq_kobj and managing all ctxs 2) free all allocated ctxs and the 'blk_mq_ctxs' instance in release handler of .mq_kobj 3) grab one ref of .mq_kobj before initializing each ctx->kobj, so that .mq_kobj is always released after all ctxs are freed. This patch fixes kernel panic issue during booting when DEBUG_KOBJECT_RELEASE is enabled. Reported-by: NGuenter Roeck <linux@roeck-us.net> Cc: "jianchao.wang" <jianchao.w.wang@oracle.com> Tested-by: NGuenter Roeck <linux@roeck-us.net> Reviewed-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 08 11月, 2018 7 次提交
-
-
由 Jens Axboe 提交于
We call blk_mq_map_queue() a lot, at least two times for each request per IO, sometimes more. Since we now have an indirect call as well in that function. cache the mapping so we don't have to re-call blk_mq_map_queue() for the same request multiple times. Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Add support for the tag set carrying multiple queue maps, and for the driver to inform blk-mq how many it wishes to support through setting set->nr_maps. This adds an mq_ops helper for drivers that support more than 1 map, mq_ops->rq_flags_to_type(). The function takes request/bio flags and CPU, and returns a queue map index for that. We then use the type information in blk_mq_map_queue() to index the map set. Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
The mapping used to be dependent on just the CPU location, but now it's a tuple of (type, cpu) instead. This is a prep patch for allowing a single software queue to map to multiple hardware queues. No functional changes in this patch. This changes the software queue count to an unsigned short to save a bit of space. We can still support 64K-1 CPUs, which should be enough. Add a check to catch a wrap. Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Prep patch for being able to place request based not just on CPU location, but also on the type of request. Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Doesn't do anything right now, but it's needed as a prep patch to get the interfaces right. While in there, correct the blk_mq_map_queue() CPU type to an unsigned int. Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
This is in preparation for allowing multiple sets of maps per queue, if so desired. Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NBart Van Assche <bvanassche@acm.org> Reviewed-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
It's just a pointer to set->mq_map, use that instead. Move the assignment a bit earlier, so we always know it's valid. Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NBart Van Assche <bvanassche@acm.org> Reviewed-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 18 7月, 2018 1 次提交
-
-
由 Ming Lei 提交于
In case of 'none' io scheduler, when hw queue isn't busy, it isn't necessary to enqueue request to sw queue and dequeue it from sw queue because request may be submitted to hw queue asap without extra cost, meantime there shouldn't be much request in sw queue, and we don't need to worry about effect on IO merge. There are still some single hw queue SCSI HBAs(HPSA, megaraid_sas, ...) which may connect high performance devices, so 'none' is often required for obtaining good performance. This patch improves IOPS and decreases CPU unilization on megaraid_sas, per Kashyap's test. Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Laurence Oberman <loberman@redhat.com> Cc: Omar Sandoval <osandov@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Hannes Reinecke <hare@suse.de> Reported-by: NKashyap Desai <kashyap.desai@broadcom.com> Tested-by: NKashyap Desai <kashyap.desai@broadcom.com> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 09 7月, 2018 2 次提交
-
-
由 Minwoo Im 提交于
set->mq_map is now currently cleared if something goes wrong when establishing a queue map in blk-mq-pci.c. It's also cleared before updating a queue map in blk_mq_update_queue_map(). This patch provides an API to clear set->mq_map to make it clear. Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Ming Lei 提交于
We never pass 'wait' as true to blk_mq_get_driver_tag(), and hence we never change '**hctx' as well. The last use of these went away with the flush cleanup, commit 0c2a6fe4. So cleanup the usage and remove the two extra parameters. Cc: Bart Van Assche <bart.vanassche@wdc.com> Cc: Christoph Hellwig <hch@lst.de> Tested-by: NAndrew Jones <drjones@redhat.com> Reviewed-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 29 5月, 2018 1 次提交
-
-
由 Keith Busch 提交于
This patch simplifies the timeout handling by relying on the request reference counting to ensure the iterator is operating on an inflight and truly timed out request. Since the reference counting prevents the tag from being reallocated, the block layer no longer needs to prevent drivers from completing their requests while the timeout handler is operating on it: a driver completing a request is allowed to proceed to the next state without additional syncronization with the block layer. This also removes any need for generation sequence numbers since the request lifetime is prevented from being reallocated as a new sequence while timeout handling is operating on it. To enables this a refcount is added to struct request so that request users can be sure they're operating on the same request without it changing while they're processing it. The request's tag won't be released for reuse until both the timeout handler and the completion are done with it. Signed-off-by: NKeith Busch <keith.busch@intel.com> [hch: slight cleanups, added back submission side hctx lock, use cmpxchg for completions] Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 26 4月, 2018 1 次提交
-
-
由 Omar Sandoval 提交于
When the blk-mq inflight implementation was added, /proc/diskstats was converted to use it, but /sys/block/$dev/inflight was not. Fix it by adding another helper to count in-flight requests by data direction. Fixes: f299b7c7 ("blk-mq: provide internal in-flight variant") Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 25 4月, 2018 1 次提交
-
-
由 Linus Walleij 提交于
As it came up in discussion on the mailing list that the semantic meaning of 'blk_mq_ctx' and 'blk_mq_hw_ctx' isn't completely obvious to everyone, let's add some minimal kerneldoc for a starter. Signed-off-by: NLinus Walleij <linus.walleij@linaro.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 11 4月, 2018 1 次提交
-
-
由 Ming Lei 提交于
This reverts commit 127276c6. When all CPUs of one hw queue become offline, there still may have IOs not completed from this hctx. But blk_mq_hw_queue_mapped() is called in blk_mq_queue_tag_busy_iter(), which is used for iterating request in timeout handler, timeout event will be missed on the inactive hctx, then request may never be completed. Also the replementation of blk_mq_hw_queue_mapped() doesn't match the helper's name any more, and it should have been named as blk_mq_hw_queue_active(). Even other callers need further verification about this reimplemenation. So revert this patch now, and we can improve hw queue activate/inactivate event after adequent researching and test. Cc: Stefan Haberland <sth@linux.vnet.ibm.com> Cc: Christian Borntraeger <borntraeger@de.ibm.com> Cc: Christoph Hellwig <hch@lst.de> Reported-by: NJens Axboe <axboe@kernel.dk> Fixes: 127276c6 ("blk-mq: reimplement blk_mq_hw_queue_mapped") Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 10 4月, 2018 1 次提交
-
-
由 Ming Lei 提交于
Now the actual meaning of queue mapped is that if there is any online CPU mapped to this hctx, so implement blk_mq_hw_queue_mapped() in this way. Cc: Stefan Haberland <sth@linux.vnet.ibm.com> Tested-by: NChristian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 20 1月, 2018 1 次提交
-
-
由 Bart Van Assche 提交于
Most blk-mq functions have a name that follows the pattern blk_mq_${action}. However, the function name blk_mq_request_direct_issue is an exception. Hence rename this function. This patch does not change any functionality. Reviewed-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NBart Van Assche <bart.vanassche@wdc.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 18 1月, 2018 1 次提交
-
-
由 Ming Lei 提交于
blk_insert_cloned_request() is called in the fast path of a dm-rq driver (e.g. blk-mq request-based DM mpath). blk_insert_cloned_request() uses blk_mq_request_bypass_insert() to directly append the request to the blk-mq hctx->dispatch_list of the underlying queue. 1) This way isn't efficient enough because the hctx spinlock is always used. 2) With blk_insert_cloned_request(), we completely bypass underlying queue's elevator and depend on the upper-level dm-rq driver's elevator to schedule IO. But dm-rq currently can't get the underlying queue's dispatch feedback at all. Without knowing whether a request was issued or not (e.g. due to underlying queue being busy) the dm-rq elevator will not be able to provide effective IO merging (as a side-effect of dm-rq currently blindly destaging a request from its elevator only to requeue it after a delay, which kills any opportunity for merging). This obviously causes very bad sequential IO performance. Fix this by updating blk_insert_cloned_request() to use blk_mq_request_direct_issue(). blk_mq_request_direct_issue() allows a request to be issued directly to the underlying queue and returns the dispatch feedback (blk_status_t). If blk_mq_request_direct_issue() returns BLK_SYS_RESOURCE the dm-rq driver will now use DM_MAPIO_REQUEUE to _not_ destage the request. Whereby preserving the opportunity to merge IO. With this, request-based DM's blk-mq sequential IO performance is vastly improved (as much as 3X in mpath/virtio-scsi testing). Signed-off-by: NMing Lei <ming.lei@redhat.com> [blk-mq.c changes heavily influenced by Ming Lei's initial solution, but they were refactored to make them less fragile and easier to read/review] Signed-off-by: NMike Snitzer <snitzer@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 10 1月, 2018 3 次提交
-
-
由 Tejun Heo 提交于
After the recent updates to use generation number and state based synchronization, we can easily replace REQ_ATOM_STARTED usages by adding an extra state to distinguish completed but not yet freed state. Add MQ_RQ_COMPLETE and replace REQ_ATOM_STARTED usages with blk_mq_rq_state() tests. REQ_ATOM_STARTED no longer has any users left and is removed. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
With issue/complete and timeout paths now using the generation number and state based synchronization, blk_abort_request() is the only one which depends on REQ_ATOM_COMPLETE for arbitrating completion. There's no reason for blk_abort_request() to be a completely separate path. This patch makes blk_abort_request() piggyback on the timeout path instead of trying to terminate the request directly. This removes the last dependency on REQ_ATOM_COMPLETE in blk-mq. Note that this makes blk_abort_request() asynchronous - it initiates abortion but the actual termination will happen after a short while, even when the caller owns the request. AFAICS, SCSI and ATA should be fine with that and I think mtip32xx and dasd should be safe but not completely sure. It'd be great if people who know the drivers take a look. v2: - Add comment explaining the lack of synchronization around ->deadline update as requested by Bart. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Asai Thambi SP <asamymuthupa@micron.com> Cc: Stefan Haberland <sth@linux.vnet.ibm.com> Cc: Jan Hoeppner <hoeppner@linux.vnet.ibm.com> Cc: Bart Van Assche <Bart.VanAssche@wdc.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Tejun Heo 提交于
Currently, blk-mq timeout path synchronizes against the usual issue/completion path using a complex scheme involving atomic bitflags, REQ_ATOM_*, memory barriers and subtle memory coherence rules. Unfortunately, it contains quite a few holes. There's a complex dancing around REQ_ATOM_STARTED and REQ_ATOM_COMPLETE between issue/completion and timeout paths; however, they don't have a synchronization point across request recycle instances and it isn't clear what the barriers add. blk_mq_check_expired() can easily read STARTED from N-2'th iteration, deadline from N-1'th, blk_mark_rq_complete() against Nth instance. In fact, it's pretty easy to make blk_mq_check_expired() terminate a later instance of a request. If we induce 5 sec delay before time_after_eq() test in blk_mq_check_expired(), shorten the timeout to 2s, and issue back-to-back large IOs, blk-mq starts timing out requests spuriously pretty quickly. Nothing actually timed out. It just made the call on a recycle instance of a request and then terminated a later instance long after the original instance finished. The scenario isn't theoretical either. This patch replaces the broken synchronization mechanism with a RCU and generation number based one. 1. Each request has a u64 generation + state value, which can be updated only by the request owner. Whenever a request becomes in-flight, the generation number gets bumped up too. This provides the basis for the timeout path to distinguish different recycle instances of the request. Also, marking a request in-flight and setting its deadline are protected with a seqcount so that the timeout path can fetch both values coherently. 2. The timeout path fetches the generation, state and deadline. If the verdict is timeout, it records the generation into a dedicated request abortion field and does RCU wait. 3. The completion path is also protected by RCU (from the previous patch) and checks whether the current generation number and state match the abortion field. If so, it skips completion. 4. The timeout path, after RCU wait, scans requests again and terminates the ones whose generation and state still match the ones requested for abortion. By now, the timeout path knows that either the generation number and state changed if it lost the race or the completion will yield to it and can safely timeout the request. While it's more lines of code, it's conceptually simpler, doesn't depend on direct use of subtle memory ordering or coherence, and hopefully doesn't terminate the wrong instance. While this change makes REQ_ATOM_COMPLETE synchronization unnecessary between issue/complete and timeout paths, REQ_ATOM_COMPLETE isn't removed yet as it's still used in other places. Future patches will move all state tracking to the new mechanism and remove all bitops in the hot paths. Note that this patch adds a comment explaining a race condition in BLK_EH_RESET_TIMER path. The race has always been there and this patch doesn't change it. It's just documenting the existing race. v2: - Fixed BLK_EH_RESET_TIMER handling as pointed out by Jianchao. - s/request->gstate_seqc/request->gstate_seq/ as suggested by Peter. - READ_ONCE() added in blk_mq_rq_update_state() as suggested by Peter. v3: - Fixed possible extended seqcount / u64_stats_sync read looping spotted by Peter. - MQ_RQ_IDLE was incorrectly being set in complete_request instead of free_request. Fixed. v4: - Rebased on top of hctx_lock() refactoring patch. - Added comment explaining the use of hctx_lock() in completion path. v5: - Added comments requested by Bart. - Note the addition of BLK_EH_RESET_TIMER race condition in the commit message. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: "jianchao.wang" <jianchao.w.wang@oracle.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Christoph Hellwig <hch@lst.de> Cc: Bart Van Assche <Bart.VanAssche@wdc.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 11 11月, 2017 1 次提交
-
-
由 Jens Axboe 提交于
Currently we are inconsistent in when we decide to run the queue. Using blk_mq_run_hw_queues() we check if the hctx has pending IO before running it, but we don't do that from the individual queue run function, blk_mq_run_hw_queue(). This results in a lot of extra and pointless queue runs, potentially, on flush requests and (much worse) on tag starvation situations. This is observable just looking at top output, with lots of kworkers active. For the !async runs, it just adds to the CPU overhead of blk-mq. Move the has-pending check into the run function instead of having callers do it. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-