- 02 9月, 2020 40 次提交
-
-
由 Ming Lei 提交于
to #28991349 commit 5938870247be4453ef6602c7ce467bebb48113c8 upstream Now almost all .map_queues() implementation based on managed irq affinity doesn't update queue mapping and it just retrieves the old built mapping, so if nr_hw_queues is changed, the mapping talbe includes stale mapping. And only blk_mq_map_queues() may rebuild the mapping talbe. One case is that we limit .nr_hw_queues as 1 in case of kdump kernel. However, drivers often builds queue mapping before allocating tagset via pci_alloc_irq_vectors_affinity(), but set->nr_hw_queues can be set as 1 in case of kdump kernel, so wrong queue mapping is used, and kernel panic[1] is observed during booting. This patch fixes the kernel panic triggerd on nvme by rebulding the mapping table via blk_mq_map_queues(). [1] kernel panic log [ 4.438371] nvme nvme0: 16/0/0 default/read/poll queues [ 4.443277] BUG: unable to handle kernel NULL pointer dereference at 0000000000000098 [ 4.444681] PGD 0 P4D 0 [ 4.445367] Oops: 0000 [#1] SMP NOPTI [ 4.446342] CPU: 3 PID: 201 Comm: kworker/u33:10 Not tainted 4.20.0-rc5-00664-g5eb02f7ee1eb-dirty #459 [ 4.447630] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-2.fc27 04/01/2014 [ 4.448689] Workqueue: nvme-wq nvme_scan_work [nvme_core] [ 4.449368] RIP: 0010:blk_mq_map_swqueue+0xfb/0x222 [ 4.450596] Code: 04 f5 20 28 ef 81 48 89 c6 39 55 30 76 93 89 d0 48 c1 e0 04 48 03 83 f8 05 00 00 48 8b 00 42 8b 3c 28 48 8b 43 58 48 8b 04 f8 <48> 8b b8 98 00 00 00 4c 0f a3 37 72 42 f0 4c 0f ab 37 66 8b b8 f6 [ 4.453132] RSP: 0018:ffffc900023b3cd8 EFLAGS: 00010286 [ 4.454061] RAX: 0000000000000000 RBX: ffff888174448000 RCX: 0000000000000001 [ 4.456480] RDX: 0000000000000001 RSI: ffffe8feffc506c0 RDI: 0000000000000001 [ 4.458750] RBP: ffff88810722d008 R08: ffff88817647a880 R09: 0000000000000002 [ 4.464580] R10: ffffc900023b3c10 R11: 0000000000000004 R12: ffff888174448538 [ 4.467803] R13: 0000000000000004 R14: 0000000000000001 R15: 0000000000000001 [ 4.469220] FS: 0000000000000000(0000) GS:ffff88817bac0000(0000) knlGS:0000000000000000 [ 4.471554] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 4.472464] CR2: 0000000000000098 CR3: 0000000174e4e001 CR4: 0000000000760ee0 [ 4.474264] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 4.476007] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 4.477061] PKRU: 55555554 [ 4.477464] Call Trace: [ 4.478731] blk_mq_init_allocated_queue+0x36a/0x3ad [ 4.479595] blk_mq_init_queue+0x32/0x4e [ 4.480178] nvme_validate_ns+0x98/0x623 [nvme_core] [ 4.480963] ? nvme_submit_sync_cmd+0x1b/0x20 [nvme_core] [ 4.481685] ? nvme_identify_ctrl.isra.8+0x70/0xa0 [nvme_core] [ 4.482601] nvme_scan_work+0x23a/0x29b [nvme_core] [ 4.483269] ? _raw_spin_unlock_irqrestore+0x25/0x38 [ 4.483930] ? try_to_wake_up+0x38d/0x3b3 [ 4.484478] ? process_one_work+0x179/0x2fc [ 4.485118] process_one_work+0x1d3/0x2fc [ 4.485655] ? rescuer_thread+0x2ae/0x2ae [ 4.486196] worker_thread+0x1e9/0x2be [ 4.486841] kthread+0x115/0x11d [ 4.487294] ? kthread_park+0x76/0x76 [ 4.487784] ret_from_fork+0x3a/0x50 [ 4.488322] Modules linked in: nvme nvme_core qemu_fw_cfg virtio_scsi ip_tables [ 4.489428] Dumping ftrace buffer: [ 4.489939] (ftrace buffer empty) [ 4.490492] CR2: 0000000000000098 [ 4.491052] ---[ end trace 03cd268ad5a86ff7 ]--- Cc: Christoph Hellwig <hch@lst.de> Cc: linux-nvme@lists.infradead.org Cc: David Milburn <dmilburn@redhat.com> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Ming Lei 提交于
to #28991349 commit c45b1fa2433c65e44bdf48f513cb37289f3116b9 upstream When -ENOSPC is returned from pci_alloc_irq_vectors_affinity(), we still try to allocate multiple irq vectors again, so irq queues covers the admin queue actually. But we don't consider that, then number of the allocated irq vector may be same with sum of io_queues[HCTX_TYPE_DEFAULT] and io_queues[HCTX_TYPE_READ], this way is obviously wrong, and finally breaks nvme_pci_map_queues(), and warning from pci_irq_get_affinity() is triggered. IRQ queues should cover admin queues, this patch makes this point explicitely in nvme_calc_io_queues(). We got severl boot failure internal report on aarch64, so please consider to fix it in v4.20. Fixes: 6451fe73fa0f ("nvme: fix irq vs io_queue calculations") Signed-off-by: NMing Lei <ming.lei@redhat.com> Reviewed-by: NKeith Busch <keith.busch@intel.com> Tested-by: Nfin4478 <fin4478@hotmail.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit 6451fe73fa0f542a49bfacd7205b88a597897f58 upstream Guenter reported an boot hang issue on HPPA after we default to 0 poll queues. We have two issues in the queue count calculations: 1) We don't separate the poll queues from the read/write queues. This is important, since the former doesn't need interrupts. 2) The adjust logic is broken. Adjust the poll queue count before doing nvme_calc_io_queues(). The poll queue count is only limited by the IO queue count we were able to get from the controller, not failures in the IRQ allocation loop. This leaves nvme_calc_io_queues() just adjusting the read/write queue map. Reported-by: NReported-by: Guenter Roeck <linux@roeck-us.net> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Christoph Hellwig 提交于
to #28991349 commit e20ba6e1da029136ded295f33076483d65ddf50a upstream Having another indirect all in the fast path doesn't really help in our post-spectre world. Also having too many queue type is just going to create confusion, so I'd rather manage them centrally. Note that the queue type naming and ordering changes a bit - the first index now is the default queue for everything not explicitly marked, the optional ones are read and poll queues. Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit dabcefab45d36ecb5a22f16577bb0f298876a22d upstream If we have separate poll queues, we know that they aren't using interrupts. Hence we don't need to disable interrupts around finding completions. Provide a separate set of blk_mq_ops for such devices. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Ming Lei 提交于
to #28991349 commit 07b35eb5a364fa59f88f65e6c786192f2c9163be upstream Type of each element in queue mapping table is 'unsigned int, intead of 'struct blk_mq_queue_map)', so fix it. Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Mike Snitzer <snitzer@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit a4668d9ba4be1ca9f4a39798ba3419fdfef0750d upstream We need a better way of configuring this, and given that polling is (still) a bit niche, let's default to using 0 poll queues. That way we'll have the same read/write/poll behavior as 4.20, and users that want to test/use polling are required to do manual configuration of the number of poll queues. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit db29eb059cdc571f9d75cec4a41b9884b3b8286a upstream At least on SPARC, if MSI/MSI-X isn't supported, we get EINVAL if we ask for more than one vector. This isn't covered by our ENOSPC check. If we get EINVAL, decrease our ask to just one vector, instead of bailing out in error. Fixes: 3b6592f70ad7 ("nvme: utilize two queue maps, one for reads and one for writes") Reported-by: NGuenter Roeck <linux@roeck-us.net> Tested-by: NGuenter Roeck <linux@roeck-us.net> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit 30e066286e232772cad72c87008a958e23e40a33 upstream NVMe always asks for io_queues + 1 worth of IRQ vectors, which means that even when we scale all the way down, we still ask for 2 vectors and get -ENOSPC in return if the system can't support more than 1. Getting just 1 vector is fine, it just means that we'll have 1 IO queue and 1 admin queue, with a shared vector between them. Check for this case and don't add our + 1 if it happens. Fixes: 3b6592f70ad7 ("nvme: utilize two queue maps, one for reads and one for writes") Reported-by: NGuenter Roeck <linux@roeck-us.net> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Ming Lei 提交于
to #28991349 commit 556f36e90dbe7dded81f4fac084d2bc8a2458330 upstream Spread queues among present CPUs first, then building mapping on other non-present CPUs. So we can minimize count of dead queues which are mapped by un-present CPUs only. Then bad IO performance can be avoided by unbalanced mapping between present CPUs and queues. The similar policy has been applied on Managed IRQ affinity. Cc: Yi Zhang <yi.zhang@redhat.com> Reported-by: NYi Zhang <yi.zhang@redhat.com> Reviewed-by: NBob Liu <bob.liu@oracle.com> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit 4b04cc6a8f86c4842314def22332de1f15de8523 upstream Adds support for defining a variable number of poll queues, currently configurable with the 'poll_queues' module parameter. Defaults to a single poll queue. And now we finally have poll support without triggering interrupts! Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Bart Van Assche 提交于
to #28991349 commit e895fedf12dc0663a925b54eb0961fc927208097 upstream This patch avoids that the compiler complains about 'ret' being set but not being used when building with W=1. Fixes: 3b6592f70ad7 ("nvme: utilize two queue maps, one for reads and one for writes") # v5.0-rc1 Signed-off-by: NBart Van Assche <bvanassche@acm.org> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit 3b6592f70ad7b4c24dd3eb2ac9bbe3353d02c992 upstream NVMe does round-robin between queues by default, which means that sharing a queue map for both reads and writes can be problematic in terms of read servicing. It's much easier to flood the queue with writes and reduce the read servicing. Implement two queue maps, one for reads and one for writes. The write queue count is configurable through the 'write_queues' parameter. By default, we retain the previous behavior of having a single queue set, shared between reads and writes. Setting 'write_queues' to a non-zero value will create two queue sets, one for reads and one for writes, the latter using the configurable number of queues (hardware queue counts permitting). Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit 843477d4cc5c4bb4e346c561ecd3b9d0bd67e8c8 upstream Add a queue offset to the tag map. This enables users to map iteratively, for each queue map type they support. Bump maximum number of supported maps to 2, we're now fully able to support more than 1 map. Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit 3110fc79606fb6003949246c6fb325dd43445273 upstream Currently we only look at the software queue, but with support for multiple maps, we should also look at the hardware queue. This is important since we'll flush out the request list if either the software queue or hardware queue don't match. This sorts by software queue first, then hardware queue if that differs. Finally we sort by request location like before. This minimizes the flush points per plug list. Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit 67cae4c948a5311121905a2a8740c50daf7f6478 upstream It's somewhat strange to have a list insertion function that relies on the fact that the caller has mapped things correctly. Pass in the hardware queue directly for insertion, which makes for a much cleaner interface and implementation. Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit ea4f995ee8b8f0578b3319949f2edd5d812fdb0a upstream We call blk_mq_map_queue() a lot, at least two times for each request per IO, sometimes more. Since we now have an indirect call as well in that function. cache the mapping so we don't have to re-call blk_mq_map_queue() for the same request multiple times. Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit 392546aed22009060911f76b6ea24520e2f8b50f upstream With multiple maps, nr_cpu_ids is no longer the maximum number of hardware queues we support on a given devices. The initializer of the tag_set can have set ->nr_hw_queues larger than the available number of CPUs, since we can exceed that with multiple queue maps. Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NBart Van Assche <bvanassche@acm.org> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit b3c661b15d5ab11d982e58bee23e05c1780528a1 upstream Add support for the tag set carrying multiple queue maps, and for the driver to inform blk-mq how many it wishes to support through setting set->nr_maps. This adds an mq_ops helper for drivers that support more than 1 map, mq_ops->rq_flags_to_type(). The function takes request/bio flags and CPU, and returns a queue map index for that. We then use the type information in blk_mq_map_queue() to index the map set. Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit a783b81820fe3532809c98371ec904dfdb0ea9e5 upstream It can be useful for a user to verify what type a given hardware queue is, expose this information in sysfs. Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NBart Van Assche <bvanassche@acm.org> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit f31967f0e455d08d3ea1d2f849bf62dafc92dbf4 upstream The mapping used to be dependent on just the CPU location, but now it's a tuple of (type, cpu) instead. This is a prep patch for allowing a single software queue to map to multiple hardware queues. No functional changes in this patch. This changes the software queue count to an unsigned short to save a bit of space. We can still support 64K-1 CPUs, which should be enough. Add a check to catch a wrap. Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Ming Lei 提交于
to #28991349 commit 7809167da5c86fd6bf309b33dee7a797e263342f upstream We need to pass bio->bi_opf after bio intergrity preparing, otherwise the flag of REQ_INTEGRITY may not be set on the allocated request, then breaks block integrity. Fixes: f9afca4d367b ("blk-mq: pass in request/bio flags to queue mapping") Cc: Hannes Reinecke <hare@suse.com> Cc: Keith Busch <keith.busch@intel.com> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit f9afca4d367b8c915f28d29fcaba7460640403ff upstream Prep patch for being able to place request based not just on CPU location, but also on the type of request. Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit ff2c56609d9b1f0739ae3a3bfdb78191d01e4192 upstream Doesn't do anything right now, but it's needed as a prep patch to get the interfaces right. While in there, correct the blk_mq_map_queue() CPU type to an unsigned int. Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit ed76e329d74a4b15ac0f5fd3adbd52ec0178a134 upstream This is in preparation for allowing multiple sets of maps per queue, if so desired. Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NBart Van Assche <bvanassche@acm.org> Reviewed-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit a8908939af569ce2419f43fd56eeaf003bc3d85d upstream It's just a pointer to set->mq_map, use that instead. Move the assignment a bit earlier, so we always know it's valid. Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NBart Van Assche <bvanassche@acm.org> Reviewed-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Jens Axboe 提交于
to #28991349 commit 6da4b3ab9a6e9b1b5f90322ab3fa3a7dd18edb19 upstream A driver may have a need to allocate multiple sets of MSI/MSI-X interrupts, and have them appropriately affinitized. Add support for defining a number of sets in the irq_affinity structure, of varying sizes, and get each set affinitized correctly across the machine. [ tglx: Minor changelog tweaks ] Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Reviewed-by: NHannes Reinecke <hare@suse.com> Reviewed-by: NMing Lei <ming.lei@redhat.com> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Cc: linux-block@vger.kernel.org Link: https://lkml.kernel.org/r/20181102145951.31979-5-ming.lei@redhat.comSigned-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Ming Lei 提交于
to #28991349 commit 060746d9e394084b7401e7532f2de528ecbfb521 upstream No functional change. Prepares for support of allocating and affinitizing sets of interrupts, in which each set of interrupts needs a full two stage spreading. The first vector argument is necessary for this so the affinitizing starts from the first vector of each set. [ tglx: Minor changelog tweaks ] Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Hannes Reinecke <hare@suse.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Sagi Grimberg <sagi@grimberg.me> Link: https://lkml.kernel.org/r/20181102145951.31979-4-ming.lei@redhat.comSigned-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Ming Lei 提交于
to #28991349 commit 5c903e108d0b005cf59904ca3520934fca4b9439 upstream No functional change. Prepares for supporting allocating and affinitizing interrupt sets. [ tglx: Minor changelog tweaks ] Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Hannes Reinecke <hare@suse.com> Cc: Keith Busch <keith.busch@intel.com> Cc: Sagi Grimberg <sagi@grimberg.me> Link: https://lkml.kernel.org/r/20181102145951.31979-3-ming.lei@redhat.comSigned-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Xiaoguang Wang 提交于
to #28991349 This reverts commit a3d72a0c79fac0e113bbeb85e1e19b3b3568e2f5. Previously we just backported this patch partly, now we revert it temporarily and will backport it in later patches formally. Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Pavel Begunkov 提交于
to #29441901 commit 681fda8d27a66f7e65ff7f2d200d7635e64a8d05 upstream. io_recvmsg() doesn't free memory allocated for struct io_buffer. This can causes a leak when used with automatic buffer selection. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
-
由 Tejun Heo 提交于
to #29361128 commit 0b80f9866e6bbfb905140ed8787ff2af03652c0c upstream. abs_vdebt is an atomic_64 which tracks how much over budget a given cgroup is and controls the activation of use_delay mechanism. Once a cgroup goes over budget from forced IOs, it has to pay it back with its future budget. The progress guarantee on debt paying comes from the iocg being active - active iocgs are processed by the periodic timer, which ensures that as time passes the debts dissipate and the iocg returns to normal operation. However, both iocg activation and vdebt handling are asynchronous and a sequence like the following may happen. 1. The iocg is in the process of being deactivated by the periodic timer. 2. A bio enters ioc_rqos_throttle(), calls iocg_activate() which returns without anything because it still sees that the iocg is already active. 3. The iocg is deactivated. 4. The bio from #2 is over budget but needs to be forced. It increases abs_vdebt and goes over the threshold and enables use_delay. 5. IO control is enabled for the iocg's subtree and now IOs are attributed to the descendant cgroups and the iocg itself no longer issues IOs. This leaves the iocg with stuck abs_vdebt - it has debt but inactive and no further IOs which can activate it. This can end up unduly punishing all the descendants cgroups. The usual throttling path has the same issue - the iocg must be active while throttled to ensure that future event will wake it up - and solves the problem by synchronizing the throttling path with a spinlock. abs_vdebt handling is another form of overage handling and shares a lot of characteristics including the fact that it isn't in the hottest path. This patch fixes the above and other possible races by strictly synchronizing abs_vdebt and use_delay handling with iocg->waitq.lock. Signed-off-by: NTejun Heo <tj@kernel.org> Reported-by: NVlad Dmitriev <vvd@fb.com> Cc: stable@vger.kernel.org # v5.4+ Fixes: e1518f63f246 ("blk-iocost: Don't let merges push vtime into the future") Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Tejun Heo 提交于
to #29361128 commit 21f3cfeab304fc07b90d93d98d4d2f62110fe6b2 upstream. Wrapping numbers in strings is used by some to work around bit-width issues in some enviroments. The problem isn't innate to json and the workaround seems to cause more integration problems than help. Let's drop the string wrapping. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Tejun Heo 提交于
to #29361128 commit f4fe3ea636385a51f1dfbb27c387a04b12b919e9 upstream. This is to help external tools to decide whether iocost_monitor has all its requirements met or not based on the exit status of an -i0 run. Signed-off-by: NTejun Heo <tj@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Tejun Heo 提交于
to #29361128 commit cd006509b0a93cb7ee9d9fd50ae274098997a460 upstream. On each IO completion, iocost decides whether the IO met or missed its latency target. Currently, the targets are fixed numbers per IO type. While this can be good enough for loose latency targets way higher than typical completion latencies, the effect of IO size makes it difficult to tighten the latency target - a target adequate for 4k IOs might be too tight for 512k IOs and vice-versa. iocost already has all the necessary information to account for different IO sizes when testing whether the latency target is met as iocost can calculate the size vtime cost of a given IO. This patch updates the completion path to calculate the size vtime cost of the IO, deduct the nsec equivalent from the observed latency and use the adjusted value to decide whether the target is met. This makes latency targets independent from IO size and enables determining adequate latency targets with fixed size fio runs. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Andy Newell <newella@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Hou Tao 提交于
to #29361128 commit 3d24430694077313c75c6b89f618db09943621e4 upstream. Currently rq->data_len will be decreased by partial completion or zeroed by completion, so when blk_stat_add() is invoked, data_len will be zero and there will never be samples in poll_cb because blk_mq_poll_stats_bkt() will return -1 if data_len is zero. We could move blk_stat_add() back to __blk_mq_complete_request(), but that would make the effort of trying to call ktime_get_ns() once in vain. Instead we can reuse throtl_size field, and use it for both block stats and block throttle, and adjust the logic in blk_mq_poll_stats_bkt() accordingly. Fixes: 4bc6339a ("block: move blk_stat_add() to __blk_mq_end_request()") Tested-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NHou Tao <houtao1@huawei.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Tejun Heo 提交于
to #29361128 commit 54c52e10dc9b939084a7e6e3d32ce8fd8dee7898 upstream. The use_delay mechanism was introduced by blk-iolatency to hold memory allocators accountable for the reclaim and other shared IOs they cause. The duration of the delay is dynamically balanced between iolatency increasing the value on each target miss and it auto-decaying as time passes and threads get delayed on it. While this works well for iolatency, iocost's control model isn't compatible with it. There is no repeated "violation" events which can be balanced against auto-decaying. iocost instead knows how much a given cgroup is over budget and wants to prevent that cgroup from issuing IOs while over budget. Until now, iocost has been adding the cost of force-issued IOs. However, this doesn't reflect the amount which is already over budget and is simply not enough to counter the auto-decaying allowing anon-memory leaking low priority cgroup to go over its alloted share of IOs. As auto-decaying doesn't make much sense for iocost, this patch introduces a different mode of operation for use_delay - when blkcg_set_delay() are used insted of blkcg_add/use_delay(), the delay duration is not auto-decayed until it is explicitly cleared with blkcg_clear_delay(). iocost is updated to keep the delay duration synchronized to the budget overage amount. With this change, iocost can effectively police cgroups which generate significant amount of force-issued IOs. Signed-off-by: NTejun Heo <tj@kernel.org> Cc: Josef Bacik <josef@toxicpanda.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Waiman Long 提交于
to #29361128 commmit d6c8e949a35d6906d6c03a50e9a9cdf4e494528a upstream. Systemtap 4.2 is unable to correctly interpret the "u32 (*missed_ppm)[2]" argument of the iocost_ioc_vrate_adj trace entry defined in include/trace/events/iocost.h leading to the following error: /tmp/stapAcz0G0/stap_c89c58b83cea1724e26395efa9ed4939_6321_aux_6.c:78:8: error: expected ‘;’, ‘,’ or ‘)’ before ‘*’ token , u32[]* __tracepoint_arg_missed_ppm That argument type is indeed rather complex and hard to read. Looking at block/blk-iocost.c. It is just a 2-entry u32 array. By simplifying the argument to a simple "u32 *missed_ppm" and adjusting the trace entry accordingly, the compilation error was gone. Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost") Acked-by: NSteven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NWaiman Long <longman@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Weiping Zhang 提交于
to #29361128 commit fa800d73c8d0d36b1f5929198371f421b69e610e upstream. Acked-by: NTejun Heo <tj@kernel.org> Signed-off-by: NWeiping Zhang <zhangweiping@didiglobal.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-
由 Tejun Heo 提交于
to 29361128 commit dcd6589b11d3b1e71f516a87a7b9646ed356b4c0 upstream. vtimes may wrap and time_before/after64() should be used to determine whether a given vtime is before or after another. iocg_is_idle() was incorrectly using plain "<" comparison do determine whether done_vtime is before vtime. Here, the only thing we're interested in is whether done_vtime matches vtime which indicates that there's nothing in flight. Let's test for inequality instead. Signed-off-by: NTejun Heo <tj@kernel.org> Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost") Cc: stable@vger.kernel.org # v5.4+ Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com> Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
-