- 21 6月, 2019 7 次提交
-
-
由 Chaitanya Kulkarni 提交于
Remove the status parameter o nvme_remove_dead_ctrl(), which is only used for printing it. We move the print message to the same function where actual error is occurring. Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Minwoo Im 提交于
If the state change to NVME_CTRL_CONNECTING fails, the dmesg is going to be like: [ 293.689160] nvme nvme0: failed to mark controller CONNECTING [ 293.689160] nvme nvme0: Removing after probe failure status: 0 Even it prints the first line to indicate the situation, the second line is not proper because the status is 0 which means normally success of the previous operation. This patch makes it indicate the proper error value when it fails. [ 25.932367] nvme nvme0: failed to mark controller CONNECTING [ 25.932369] nvme nvme0: Removing after probe failure status: -16 This situation is able to be easily reproduced by: root@target:~# rmmod nvme && modprobe nvme && rmmod nvme Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Chaitanya Kulkarni 提交于
This patch removes the confusing assignment of the variable result at the time of declaration and sets the value in error cases next to the places where the actual error is happening. Here we also set the result value to -ENODEV when we fail at the final ctrl state transition in nvme_reset_work(). Without this assignment result will hold 0 from nvme_setup_io_queue() and on failure 0 will be passed to he nvme_remove_dead_ctrl() from final state transition. Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Minwoo Im 提交于
If the "irq_queues" are greater than num_possible_cpus(), nvme_calc_irq_sets() can have irq set_size for HCTX_TYPE_DEFAULT greater than it can be afforded. 2039 affd->set_size[HCTX_TYPE_DEFAULT] = nrirqs - nr_read_queues; It might cause a WARN() from the irq_build_affinity_masks() like [1]: 220 if (nr_present < numvecs) 221 WARN_ON(nr_present + nr_others < numvecs); This patch prevents it from the WARN() by adjusting the max_vector value from the nvme_setup_irqs(). [1] WARN messages when modprobe nvme write_queues=32 poll_queues=0: root@target:~/nvme# nproc 8 root@target:~/nvme# modprobe nvme write_queues=32 poll_queues=0 [ 17.925326] nvme nvme0: pci function 0000:00:04.0 [ 17.940601] WARNING: CPU: 3 PID: 1030 at kernel/irq/affinity.c:221 irq_create_affinity_masks+0x222/0x330 [ 17.940602] Modules linked in: nvme nvme_core [last unloaded: nvme] [ 17.940605] CPU: 3 PID: 1030 Comm: kworker/u17:4 Tainted: G W 5.1.0+ #156 [ 17.940605] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 [ 17.940608] Workqueue: nvme-reset-wq nvme_reset_work [nvme] [ 17.940609] RIP: 0010:irq_create_affinity_masks+0x222/0x330 [ 17.940611] Code: 4c 8d 4c 24 28 4c 8d 44 24 30 e8 c9 fa ff ff 89 44 24 18 e8 c0 38 fa ff 8b 44 24 18 44 8b 54 24 1c 5a 44 01 d0 41 39 c4 76 02 <0f> 0b 48 89 df 44 01 e5 e8 f1 ce 10 00 48 8b 34 24 44 89 f0 44 01 [ 17.940611] RSP: 0018:ffffc90002277c50 EFLAGS: 00010216 [ 17.940612] RAX: 0000000000000008 RBX: ffff88807ca48860 RCX: 0000000000000000 [ 17.940612] RDX: ffff88807bc03800 RSI: 0000000000000020 RDI: 0000000000000000 [ 17.940613] RBP: 0000000000000001 R08: ffffc90002277c78 R09: ffffc90002277c70 [ 17.940613] R10: 0000000000000008 R11: 0000000000000001 R12: 0000000000000020 [ 17.940614] R13: 0000000000025d08 R14: 0000000000000001 R15: ffff88807bc03800 [ 17.940614] FS: 0000000000000000(0000) GS:ffff88807db80000(0000) knlGS:0000000000000000 [ 17.940616] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 17.940617] CR2: 00005635e583f790 CR3: 000000000240a000 CR4: 00000000000006e0 [ 17.940617] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 17.940618] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 17.940618] Call Trace: [ 17.940622] __pci_enable_msix_range+0x215/0x540 [ 17.940623] ? kernfs_put+0x117/0x160 [ 17.940625] pci_alloc_irq_vectors_affinity+0x74/0x110 [ 17.940626] nvme_reset_work+0xc30/0x1397 [nvme] [ 17.940628] ? __switch_to_asm+0x34/0x70 [ 17.940628] ? __switch_to_asm+0x40/0x70 [ 17.940629] ? __switch_to_asm+0x34/0x70 [ 17.940630] ? __switch_to_asm+0x40/0x70 [ 17.940630] ? __switch_to_asm+0x34/0x70 [ 17.940631] ? __switch_to_asm+0x40/0x70 [ 17.940632] ? nvme_irq_check+0x30/0x30 [nvme] [ 17.940633] process_one_work+0x20b/0x3e0 [ 17.940634] worker_thread+0x1f9/0x3d0 [ 17.940635] ? cancel_delayed_work+0xa0/0xa0 [ 17.940636] kthread+0x117/0x120 [ 17.940637] ? kthread_stop+0xf0/0xf0 [ 17.940638] ret_from_fork+0x3a/0x50 [ 17.940639] ---[ end trace aca8a131361cd42a ]--- [ 17.942124] nvme nvme0: 7/1/0 default/read/poll queues Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Minwoo Im 提交于
queue_count_set() seems like that it has been provided to limit the number of queue entries for write/poll queues. But, the queue_count_set() has been doing nothing but a parameter check even it has num_possible_cpus() which is nop. This patch removes entire queue_count_ops from the write_queues and poll_queues. Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Minwoo Im 提交于
poll_queues will be zero even without zero initialization here. Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Keith Busch 提交于
The nvme pci driver prepares its devices for power loss during suspend by shutting down the controllers. The power setting is deferred to pci driver's power management before the platform removes power. The suspend-to-idle mode, however, does not remove power. NVMe devices that implement host managed power settings can achieve lower power and better transition latencies than using generic PCI power settings. Try to use this feature if the platform is not involved with the suspend. If successful, restore the previous power state on resume. Tested-by: NKai-Heng Feng <kai.heng.feng@canonical.com> Tested-by: NMario Limonciello <mario.limonciello@dell.com> Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com> Signed-off-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me> [hch: fixed the compilation for the !CONFIG_PM_SLEEP case] Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
- 06 6月, 2019 1 次提交
-
-
由 Christoph Hellwig 提交于
NVMe uses PRPs (or optionally unlimited SGLs) for data transfers and has no specific limit for a single DMA segement. Limiting the size will cause problems because the block layer assumes PRP-ish devices using a virt boundary mask don't have a segment limit. And while this is true, we also really need to tell the DMA mapping layer about it, otherwise dma-debug will trip over it. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reported-by: NSebastian Ott <sebott@linux.ibm.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 23 5月, 2019 1 次提交
-
-
由 Keith Busch 提交于
If a device is providing a single IRQ vector, the IO queue will share that vector with the admin queue. This is an unmanaged vector, so does not have a valid PCI IRQ affinity. Avoid trying to extract a managed affinity in this case and let blk-mq set up the cpu:queue mapping instead. Otherwise we'd hit the following warning when the device is using MSI: WARNING: CPU: 4 PID: 7 at drivers/pci/msi.c:1272 pci_irq_get_affinity+0x66/0x80 Modules linked in: nvme nvme_core serio_raw CPU: 4 PID: 7 Comm: kworker/u16:0 Tainted: G W 5.2.0-rc1+ #494 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 Workqueue: nvme-reset-wq nvme_reset_work [nvme] RIP: 0010:pci_irq_get_affinity+0x66/0x80 Code: 0b 31 c0 c3 83 e2 10 48 c7 c0 b0 83 35 91 74 2a 48 8b 87 d8 03 00 00 48 85 c0 74 0e 48 8b 50 30 48 85 d2 74 05 39 70 14 77 05 <0f> 0b 31 c0 c3 48 63 f6 48 8d 04 76 48 8d 04 c2 f3 c3 48 8b 40 30 RSP: 0000:ffffb5abc01d3cc8 EFLAGS: 00010246 RAX: ffff9536786a39c0 RBX: 0000000000000000 RCX: 0000000000000080 RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9536781ed000 RBP: ffff95367346a008 R08: ffff95367d43f080 R09: ffff953678c07800 R10: ffff953678164800 R11: 0000000000000000 R12: 0000000000000000 R13: ffff9536781ed000 R14: 00000000ffffffff R15: ffff95367346a008 FS: 0000000000000000(0000) GS:ffff95367d400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fdf814a3ff0 CR3: 000000001a20f000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: blk_mq_pci_map_queues+0x37/0xd0 nvme_pci_map_queues+0x80/0xb0 [nvme] blk_mq_alloc_tag_set+0x133/0x2f0 nvme_reset_work+0x105d/0x1590 [nvme] process_one_work+0x291/0x530 worker_thread+0x218/0x3d0 ? process_one_work+0x530/0x530 kthread+0x111/0x130 ? kthread_park+0x90/0x90 ret_from_fork+0x1f/0x30 ---[ end trace 74587339d93c83c0 ]--- Fixes: 22b55601 ("nvme-pci: Separate IO and admin queue IRQ vectors") Reported-by: NIván Chavero <ichavero@chavero.com.mx> Reviewed-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NKeith Busch <keith.busch@intel.com>
-
- 18 5月, 2019 4 次提交
-
-
由 Keith Busch 提交于
A controller with multiple namespaces may have multiple request_queues with their own timeout work. If a controller fails with IO outstanding to diffent namespaces, each request queue may attempt to handle it, so ensure there is no previously scheduled timeout work executing prior to starting controller initialization by synchronizing with each queue. Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: NMing Lei <ming.lei@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NKeith Busch <keith.busch@intel.com>
-
由 Keith Busch 提交于
The reset_work waits for queued IO to complete before setting the controller to live. If any of these times out and requeues, we won't be able to restart the controller because the reset_work is already running. Flush all entered requests to a failed completion if a timeout occurs in the connecting state, and ensure the controller can't transition to the live state after we've unblocked it from waiting for completions. Reviewed-by: NMing Lei <ming.lei@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NKeith Busch <keith.busch@intel.com>
-
由 Keith Busch 提交于
The reset state doesn't dispatch commands that it needs to wait for anymore. If a timeout occurs in this state, the reset work is already disabling the controller, so just reset the request's timer. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NKeith Busch <keith.busch@intel.com>
-
由 Keith Busch 提交于
If a controller disabling didn't start a freeze, don't wait for the operation to complete. Reviewed-by: NMing Lei <ming.lei@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NKeith Busch <keith.busch@intel.com>
-
- 13 5月, 2019 2 次提交
-
-
由 Gustavo A. R. Silva 提交于
In preparation to enabling -Wimplicit-fallthrough, mark switch cases where we are expecting to fall through. This patch fixes the following warning: drivers/nvme/host/pci.c: In function ‘nvme_timeout’: drivers/nvme/host/pci.c:1298:12: warning: this statement may fall through [-Wimplicit-fallthrough=] shutdown = true; ~~~~~~~~~^~~~~~ drivers/nvme/host/pci.c:1299:2: note: here case NVME_CTRL_CONNECTING: ^~~~ Warning level 3 was used: -Wimplicit-fallthrough=3 This patch is part of the ongoing efforts to enable -Wimplicit-fallthrough. Signed-off-by: NGustavo A. R. Silva <gustavo@embeddedor.com> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Maxim Levitsky 提交于
The spec states: "The settings are not retained across a Controller Level Reset" Therefore the driver must enable the shadow doorbell, after each reset. This was caught while testing the nvme driver over upcoming nvme-mdev device. Signed-off-by: NMaxim Levitsky <mlevitsk@redhat.com> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NMinwoo Im <minwoo.im@samsung.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
- 01 5月, 2019 6 次提交
-
-
由 Christoph Hellwig 提交于
Most command aren't PCIe specific, so move the size checking for them to core.c Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBart Van Assche <bvanassche@acm.org> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
由 Minwoo Im 提交于
All the NVMe command has 64bytes fixed size so that it has been assured with BUILD_BUG_ON(). The remaining command structures in linux/nvme.h also need to be checked here. Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Minwoo Im 提交于
Variable "n" will be assigned once kstrtoint() succeeds, otherwise it will not be referred because kstrtoint() will return an error which means go out from this function. Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Keith Busch 提交于
Just like IO queues, the admin queue also will not be restarted after a controller shutdown. Unquiesce this queue so that we do not block request dispatch on a permanently disabled controller. Reported-by: NYufen Yu <yuyufen@huawei.com> Signed-off-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Keith Busch 提交于
We do not restart a controller in a deleting state for timeout errors. When in this state, unblock potential request dispatchers with failed completions by shutting down the controller on timeout detection. Reported-by: NYufen Yu <yuyufen@huawei.com> Signed-off-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Klaus Birkelund Jensen 提交于
The shortcut for single segment SGL requests did not set the PSDT field to mark the request as using SGLs. Fixes: 29791057 ("nvme-pci: optimize mapping single segment requests using SGLs") Signed-off-by: NKlaus Birkelund Jensen <klaus.jensen@cnexlabs.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
- 05 4月, 2019 13 次提交
-
-
由 Christoph Hellwig 提交于
Remove two pointless local variables, remove ret assignment that is never used, move the use_sgl initialization closer to where it is used. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
由 Christoph Hellwig 提交于
If the controller supports SGLs we can take another short cut for single segment request, given that we can always map those without another indirection structure, and thus don't need to create a scatterlist structure. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
由 Christoph Hellwig 提交于
If a request is single segment and fits into one or two PRP entries we do not have to create a scatterlist for it, but can just map the bio_vec directly. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
由 Christoph Hellwig 提交于
We'll have a better way to optimize for small I/O that doesn't require it soon, so remove the existing inline_sg case to make that optimization easier to implement. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
由 Christoph Hellwig 提交于
This prepares for some bigger changes to the data mapping helpers. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
由 Christoph Hellwig 提交于
We always have exactly one segment, so we can simply call dma_map_bvec. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
由 Christoph Hellwig 提交于
This mirrors how nvme_map_pci is called and will allow simplifying some checks in nvme_unmap_pci later on. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
由 Christoph Hellwig 提交于
This means we now have a function that undoes everything nvme_map_data does and we can simplify the error handling a bit. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
由 Christoph Hellwig 提交于
Cleaning up the command setup isn't related to unmapping data, and disentangling them will simplify error handling a bit down the road. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
由 Christoph Hellwig 提交于
nvme_init_iod should really be split into two parts: initialize a few general iod fields, which can easily be done at the beginning of nvme_queue_rq, and allocating the scatterlist if needed, which logically belongs into nvme_map_data with the code making use of it. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
-
由 Keith Busch 提交于
Signed-off-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Keith Busch 提交于
We don't need to save the dma device as it's not used in the hot path and hasn't in a long time. Shrink the struct nvme_queue removing this unnecessary member. Signed-off-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Keith Busch 提交于
A negative value for the cq_vector used to mean the queue is either disabled or a polled queue. However, we have a queue enabled flag, so the cq_vector had been serving double duty. Don't overload the meaning of cq_vector. Use a flag specific to the polled queues instead. Signed-off-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
- 14 3月, 2019 1 次提交
-
-
由 Christoph Hellwig 提交于
Qemu started out with a broken implementation of Write Zeroes written by yours truly. Disable Write Zeroes on qemu for now, eventually we need to go back and make all the qemu quirks version specific, but that is left for another time. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NKeith Busch <keith.busch@intel.com> Tested-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 20 2月, 2019 2 次提交
-
-
由 Christoph Hellwig 提交于
Update license to use SPDX-License-Identifier instead of verbose license text. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Bart Van Assche 提交于
This patch avoids that the compiler complains about 'ret' being set but not being used when building with W=1. Fixes: 3b6592f7 ("nvme: utilize two queue maps, one for reads and one for writes") # v5.0-rc1 Signed-off-by: NBart Van Assche <bvanassche@acm.org> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
- 18 2月, 2019 2 次提交
-
-
由 Ming Lei 提交于
The NVME PCI driver contains a tedious mechanism for interrupt allocation, which is necessary to adjust the number and size of interrupt sets to the maximum available number of interrupts which depends on the underlying PCI capabilities and the available CPU resources. It works around the former short comings of the PCI and core interrupt allocation mechanims in combination with interrupt sets. The PCI interrupt allocation function allows to provide a maximum and a minimum number of interrupts to be allocated and tries to allocate as many as possible. This worked without driver interaction as long as there was only a single set of interrupts to handle. With the addition of support for multiple interrupt sets in the generic affinity spreading logic, which is invoked from the PCI interrupt allocation, the adaptive loop in the PCI interrupt allocation did not work for multiple interrupt sets. The reason is that depending on the total number of interrupts which the PCI allocation adaptive loop tries to allocate in each step, the number and the size of the interrupt sets need to be adapted as well. Due to the way the interrupt sets support was implemented there was no way for the PCI interrupt allocation code or the core affinity spreading mechanism to invoke a driver specific function for adapting the interrupt sets configuration. As a consequence the driver had to implement another adaptive loop around the PCI interrupt allocation function and calling that with maximum and minimum interrupts set to the same value. This ensured that the allocation either succeeded or immediately failed without any attempt to adjust the number of interrupts in the PCI code. The core code now allows drivers to provide a callback to recalculate the number and the size of interrupt sets during PCI interrupt allocation, which in turn allows the PCI interrupt allocation function to be called in the same way as with a single set of interrupts. The PCI code handles the adaptive loop and the interrupt affinity spreading mechanism invokes the driver callback to adapt the interrupt set configuration to the current loop value. This replaces the adaptive loop in the driver completely. Implement the NVME specific callback which adjusts the interrupt sets configuration and remove the adaptive allocation loop. [ tglx: Simplify the callback further and restore the dropped adjustment of number of sets ] Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NMarc Zyngier <marc.zyngier@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bjorn Helgaas <helgaas@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Cc: Keith Busch <keith.busch@intel.com> Cc: Sumit Saxena <sumit.saxena@broadcom.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com> Link: https://lkml.kernel.org/r/20190216172228.602546658@linutronix.de
-
由 Ming Lei 提交于
The interrupt affinity spreading mechanism supports to spread out affinities for one or more interrupt sets. A interrupt set contains one or more interrupts. Each set is mapped to a specific functionality of a device, e.g. general I/O queues and read I/O queus of multiqueue block devices. The number of interrupts per set is defined by the driver. It depends on the total number of available interrupts for the device, which is determined by the PCI capabilites and the availability of underlying CPU resources, and the number of queues which the device provides and the driver wants to instantiate. The driver passes initial configuration for the interrupt allocation via a pointer to struct irq_affinity. Right now the allocation mechanism is complex as it requires to have a loop in the driver to determine the maximum number of interrupts which are provided by the PCI capabilities and the underlying CPU resources. This loop would have to be replicated in every driver which wants to utilize this mechanism. That's unwanted code duplication and error prone. In order to move this into generic facilities it is required to have a mechanism, which allows the recalculation of the interrupt sets and their size, in the core code. As the core code does not have any knowledge about the underlying device, a driver specific callback will be added to struct affinity_desc, which will be invoked by the core code. The callback will get the number of available interupts as an argument, so the driver can calculate the corresponding number and size of interrupt sets. To support this, two modifications for the handling of struct irq_affinity are required: 1) The (optional) interrupt sets size information is contained in a separate array of integers and struct irq_affinity contains a pointer to it. This is cumbersome and as the maximum number of interrupt sets is small, there is no reason to have separate storage. Moving the size array into struct affinity_desc avoids indirections and makes the code simpler. 2) At the moment the struct irq_affinity pointer which is handed in from the driver and passed through to several core functions is marked 'const'. With the upcoming callback to recalculate the number and size of interrupt sets, it's necessary to remove the 'const' qualifier. Otherwise the callback would not be able to update the data. Implement #1 and store the interrupt sets size in 'struct irq_affinity'. No functional change. [ tglx: Fixed the memcpy() size so it won't copy beyond the size of the source. Fixed the kernel doc comments for struct irq_affinity and de-'This patch'-ed the changelog ] Signed-off-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NThomas Gleixner <tglx@linutronix.de> Acked-by: NMarc Zyngier <marc.zyngier@arm.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Bjorn Helgaas <helgaas@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Cc: Sagi Grimberg <sagi@grimberg.me> Cc: linux-nvme@lists.infradead.org Cc: linux-pci@vger.kernel.org Cc: Keith Busch <keith.busch@intel.com> Cc: Sumit Saxena <sumit.saxena@broadcom.com> Cc: Kashyap Desai <kashyap.desai@broadcom.com> Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com> Link: https://lkml.kernel.org/r/20190216172228.423723127@linutronix.de
-
- 12 2月, 2019 1 次提交
-
-
由 Keith Busch 提交于
The reset work holds a mutex to prevent races with removal modifying the same resources, but was unlocking only on success. Unlock on failure too. Fixes: 5c959d73 ("nvme-pci: fix rapid add remove sequence") Signed-off-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-