- 29 7月, 2020 3 次提交
-
-
由 Sagi Grimberg 提交于
A deadlock happens in the following scenario with multipath: 1) scan_work(nvme0) detects a new nsid while nvme0 is an optimized path to it, path nvme1 happens to be inaccessible. 2) Before scan_work is complete nvme0 disconnect is initiated nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING 3) scan_work(1) attempts to submit IO, but nvme_path_is_optimized() observes nvme0 is not LIVE. Since nvme1 is a possible path IO is requeued and scan_work hangs. -- Workqueue: nvme-wq nvme_scan_work [nvme_core] kernel: Call Trace: kernel: __schedule+0x2b9/0x6c0 kernel: schedule+0x42/0xb0 kernel: io_schedule+0x16/0x40 kernel: do_read_cache_page+0x438/0x830 kernel: read_cache_page+0x12/0x20 kernel: read_dev_sector+0x27/0xc0 kernel: read_lba+0xc1/0x220 kernel: efi_partition+0x1e6/0x708 kernel: check_partition+0x154/0x244 kernel: rescan_partitions+0xae/0x280 kernel: __blkdev_get+0x40f/0x560 kernel: blkdev_get+0x3d/0x140 kernel: __device_add_disk+0x388/0x480 kernel: device_add_disk+0x13/0x20 kernel: nvme_mpath_set_live+0x119/0x140 [nvme_core] kernel: nvme_update_ns_ana_state+0x5c/0x60 [nvme_core] kernel: nvme_set_ns_ana_state+0x1e/0x30 [nvme_core] kernel: nvme_parse_ana_log+0xa1/0x180 [nvme_core] kernel: nvme_mpath_add_disk+0x47/0x90 [nvme_core] kernel: nvme_validate_ns+0x396/0x940 [nvme_core] kernel: nvme_scan_work+0x24f/0x380 [nvme_core] kernel: process_one_work+0x1db/0x380 kernel: worker_thread+0x249/0x400 kernel: kthread+0x104/0x140 -- 4) Delete also hangs in flush_work(ctrl->scan_work) from nvme_remove_namespaces(). Similiarly a deadlock with ana_work may happen: if ana_work has started and calls nvme_mpath_set_live and device_add_disk, it will trigger I/O. When we trigger disconnect I/O will block because our accessible (optimized) path is disconnecting, but the alternate path is inaccessible, so I/O blocks. Then disconnect tries to flush the ana_work and hangs. [ 605.550896] Workqueue: nvme-wq nvme_ana_work [nvme_core] [ 605.552087] Call Trace: [ 605.552683] __schedule+0x2b9/0x6c0 [ 605.553507] schedule+0x42/0xb0 [ 605.554201] io_schedule+0x16/0x40 [ 605.555012] do_read_cache_page+0x438/0x830 [ 605.556925] read_cache_page+0x12/0x20 [ 605.557757] read_dev_sector+0x27/0xc0 [ 605.558587] amiga_partition+0x4d/0x4c5 [ 605.561278] check_partition+0x154/0x244 [ 605.562138] rescan_partitions+0xae/0x280 [ 605.563076] __blkdev_get+0x40f/0x560 [ 605.563830] blkdev_get+0x3d/0x140 [ 605.564500] __device_add_disk+0x388/0x480 [ 605.565316] device_add_disk+0x13/0x20 [ 605.566070] nvme_mpath_set_live+0x5e/0x130 [nvme_core] [ 605.567114] nvme_update_ns_ana_state+0x2c/0x30 [nvme_core] [ 605.568197] nvme_update_ana_state+0xca/0xe0 [nvme_core] [ 605.569360] nvme_parse_ana_log+0xa1/0x180 [nvme_core] [ 605.571385] nvme_read_ana_log+0x76/0x100 [nvme_core] [ 605.572376] nvme_ana_work+0x15/0x20 [nvme_core] [ 605.573330] process_one_work+0x1db/0x380 [ 605.574144] worker_thread+0x4d/0x400 [ 605.574896] kthread+0x104/0x140 [ 605.577205] ret_from_fork+0x35/0x40 [ 605.577955] INFO: task nvme:14044 blocked for more than 120 seconds. [ 605.579239] Tainted: G OE 5.3.5-050305-generic #201910071830 [ 605.580712] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 605.582320] nvme D 0 14044 14043 0x00000000 [ 605.583424] Call Trace: [ 605.583935] __schedule+0x2b9/0x6c0 [ 605.584625] schedule+0x42/0xb0 [ 605.585290] schedule_timeout+0x203/0x2f0 [ 605.588493] wait_for_completion+0xb1/0x120 [ 605.590066] __flush_work+0x123/0x1d0 [ 605.591758] __cancel_work_timer+0x10e/0x190 [ 605.593542] cancel_work_sync+0x10/0x20 [ 605.594347] nvme_mpath_stop+0x2f/0x40 [nvme_core] [ 605.595328] nvme_stop_ctrl+0x12/0x50 [nvme_core] [ 605.596262] nvme_do_delete_ctrl+0x3f/0x90 [nvme_core] [ 605.597333] nvme_sysfs_delete+0x5c/0x70 [nvme_core] [ 605.598320] dev_attr_store+0x17/0x30 Fix this by introducing a new state: NVME_CTRL_DELETE_NOIO, which will indicate the phase of controller deletion where I/O cannot be allowed to access the namespace. NVME_CTRL_DELETING still allows mpath I/O to be issued to the bottom device, and only after we flush the ana_work and scan_work (after nvme_stop_ctrl and nvme_prep_remove_namespaces) we change the state to NVME_CTRL_DELETING_NOIO. Also we prevent ana_work from re-firing by aborting early if we are not LIVE, so we should be safe here. In addition, change the transport drivers to follow the updated state machine. Fixes: 0d0b660f ("nvme: add ANA support") Reported-by: NAnton Eidelman <anton@lightbitslabs.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Sagi Grimberg 提交于
We are starting to see some non-trivial states so lets start documenting them. Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Chaitanya Kulkarni 提交于
Saving the nvme controller's page size was from a time when the driver tried to use different sized pages, but this value is always set to a constant, and has been this way for some time. Remove the 'page_size' field and replace its usage with the constant value. This also lets the compiler make some micro-optimizations in the io path, and that's always a good thing. Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
- 08 7月, 2020 3 次提交
-
-
由 Keith Busch 提交于
Add support for NVM Express Zoned Namespaces (ZNS) Command Set defined in NVM Express TP4053. Zoned namespaces are discovered based on their Command Set Identifier reported in the namespaces Namespace Identification Descriptor list. A successfully discovered Zoned Namespace will be registered with the block layer as a host managed zoned block device with Zone Append command support. A namespace that does not support append is not supported by the driver. Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com> Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NHannes Reinecke <hare@suse.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NJavier González <javier.gonz@samsung.com> Reviewed-by: NHimanshu Madhani <himanshu.madhani@oracle.com> Signed-off-by: NHans Holmberg <hans.holmberg@wdc.com> Signed-off-by: NDmitry Fomichev <dmitry.fomichev@wdc.com> Signed-off-by: NAjay Joshi <ajay.joshi@wdc.com> Signed-off-by: NAravind Ramesh <aravind.ramesh@wdc.com> Signed-off-by: NNiklas Cassel <niklas.cassel@wdc.com> Signed-off-by: NMatias Bjørling <matias.bjorling@wdc.com> Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com> Signed-off-by: NKeith Busch <keith.busch@wdc.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Keith Busch 提交于
The Commands Supported and Effects log page was extended with a CSI field that enables the host to query the log page for each command set supported. Retrieve this log page for each command set that an attached namespace supports, and save a pointer to that log in the namespace head. Reviewed-by: NMatias Bjørling <matias.bjorling@wdc.com> Reviewed-by: NJavier González <javier.gonz@samsung.com> Reviewed-by: NHimanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com> Reviewed-by: NHannes Reinecke <hare@suse.de> Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NDaniel Wagner <dwagner@suse.de> Signed-off-by: NKeith Busch <keith.busch@wdc.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Niklas Cassel 提交于
Implements support for multiple I/O Command Sets. NVMe TP 4056 introduces a method to enumerate multiple command sets per namespace. If the command set is exposed, this method for enumeration will be used instead of the traditional method that uses the CC.CSS register command set register for command set identification. For namespaces where the Command Set Identifier is not supported or recognized, the specific namespace will not be created. Reviewed-by: NJavier González <javier.gonz@samsung.com> Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com> Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: NMatias Bjørling <matias.bjorling@wdc.com> Reviewed-by: NDaniel Wagner <dwagner@suse.de> Reviewed-by: NHimanshu Madhani <himanshu.madhani@oracle.com> Reviewed-by: NHannes Reinecke <hare@suse.de> Signed-off-by: NNiklas Cassel <niklas.cassel@wdc.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
- 01 7月, 2020 1 次提交
-
-
由 Christoph Hellwig 提交于
The make_request_fn is a little weird in that it sits directly in struct request_queue instead of an operation vector. Replace it with a block_device_operations method called submit_bio (which describes much better what it does). Also remove the request_queue argument to it, as the queue can be derived pretty trivially from the bio. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 25 6月, 2020 1 次提交
-
-
由 Anton Eidelman 提交于
In the following scenario scan_work and ana_work will deadlock: When scan_work calls nvme_mpath_add_disk() this holds ana_lock and invokes nvme_parse_ana_log(), which may issue IO in device_add_disk() and hang waiting for an accessible path. While nvme_mpath_set_live() only called when nvme_state_is_live(), a transition may cause NVME_SC_ANA_TRANSITION and requeue the IO. Since nvme_mpath_set_live() holds ns->head->lock, an ana_work on ANY ctrl will not be able to complete nvme_mpath_set_live() on the same ns->head, which is required in order to update the new accessible path and remove NVME_NS_ANA_PENDING.. Therefore IO never completes: deadlock [1]. Fix: Move device_add_disk out of the head->lock and protect it with an atomic test_and_set for a new NVME_NS_HEAD_HAS_DISK bit. [1]: kernel: INFO: task kworker/u8:2:160 blocked for more than 120 seconds. kernel: Tainted: G OE 5.3.5-050305-generic #201910071830 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kernel: kworker/u8:2 D 0 160 2 0x80004000 kernel: Workqueue: nvme-wq nvme_ana_work [nvme_core] kernel: Call Trace: kernel: __schedule+0x2b9/0x6c0 kernel: schedule+0x42/0xb0 kernel: schedule_preempt_disabled+0xe/0x10 kernel: __mutex_lock.isra.0+0x182/0x4f0 kernel: __mutex_lock_slowpath+0x13/0x20 kernel: mutex_lock+0x2e/0x40 kernel: nvme_update_ns_ana_state+0x22/0x60 [nvme_core] kernel: nvme_update_ana_state+0xca/0xe0 [nvme_core] kernel: nvme_parse_ana_log+0xa1/0x180 [nvme_core] kernel: nvme_read_ana_log+0x76/0x100 [nvme_core] kernel: nvme_ana_work+0x15/0x20 [nvme_core] kernel: process_one_work+0x1db/0x380 kernel: worker_thread+0x4d/0x400 kernel: kthread+0x104/0x140 kernel: ret_from_fork+0x35/0x40 kernel: INFO: task kworker/u8:4:439 blocked for more than 120 seconds. kernel: Tainted: G OE 5.3.5-050305-generic #201910071830 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kernel: kworker/u8:4 D 0 439 2 0x80004000 kernel: Workqueue: nvme-wq nvme_scan_work [nvme_core] kernel: Call Trace: kernel: __schedule+0x2b9/0x6c0 kernel: schedule+0x42/0xb0 kernel: io_schedule+0x16/0x40 kernel: do_read_cache_page+0x438/0x830 kernel: read_cache_page+0x12/0x20 kernel: read_dev_sector+0x27/0xc0 kernel: read_lba+0xc1/0x220 kernel: efi_partition+0x1e6/0x708 kernel: check_partition+0x154/0x244 kernel: rescan_partitions+0xae/0x280 kernel: __blkdev_get+0x40f/0x560 kernel: blkdev_get+0x3d/0x140 kernel: __device_add_disk+0x388/0x480 kernel: device_add_disk+0x13/0x20 kernel: nvme_mpath_set_live+0x119/0x140 [nvme_core] kernel: nvme_update_ns_ana_state+0x5c/0x60 [nvme_core] kernel: nvme_mpath_add_disk+0xbe/0x100 [nvme_core] kernel: nvme_validate_ns+0x396/0x940 [nvme_core] kernel: nvme_scan_work+0x256/0x390 [nvme_core] kernel: process_one_work+0x1db/0x380 kernel: worker_thread+0x4d/0x400 kernel: kthread+0x104/0x140 kernel: ret_from_fork+0x35/0x40 Fixes: 0d0b660f ("nvme: add ANA support") Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
- 24 6月, 2020 2 次提交
-
-
由 Christoph Hellwig 提交于
Use the new blk_mq_complete_request_remote helper to avoid an indirect function call in the completion fast path. Reviewed-by: NDaniel Wagner <dwagner@suse.de> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
Move the call to blk_should_fake_timeout out of blk_mq_complete_request and into the drivers, skipping call sites that are obvious error handlers, and remove the now superflous blk_mq_force_complete_rq helper. This ensures we don't keep injecting errors into completions that just terminate the Linux request after the hardware has been reset or the command has been aborted. Reviewed-by: NDaniel Wagner <dwagner@suse.de> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 05 6月, 2020 1 次提交
-
-
由 Christoph Hellwig 提交于
The status can be trivially derived from the bio itself. That also avoid callers like NVMe to incorrectly pass a blk_status_t instead of the errno, and the overhead of translating the blk_status_t to the errno in the I/O completion fast path when no tracing is enabled. Fixes: 35fe0d12 ("nvme: trace bio completion") Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 27 5月, 2020 6 次提交
-
-
由 Israel Rukshin 提交于
SGL size of metadata is usually small. Thus, 1 inline sg should cover most cases. The macro will be used for pre-allocate a single SGL entry for metadata. The preallocation of small inline SGLs depends on SG_CHAIN capability so if the ARCH doesn't support SG_CHAIN, use the runtime allocation for the SGL. This patch is a preparation for adding metadata (T10-PI) over fabric support. Signed-off-by: NIsrael Rukshin <israelr@mellanox.com> Signed-off-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com> Reviewed-by: NJames Smart <james.smart@broadcom.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Max Gurtovoy 提交于
This patch doesn't change any logic, and is needed as a preparation for adding PI support for fabrics drivers that will use an extended LBA format for metadata and will support more than 1 integrity segment. Signed-off-by: NMax Gurtovoy <maxg@mellanox.com> Signed-off-by: NIsrael Rukshin <israelr@mellanox.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com> Reviewed-by: NJames Smart <james.smart@broadcom.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 James Smart 提交于
Move the nvme_ns_has_pi() inline from core.c to the nvme.h header. This allows use by the transports. Signed-off-by: NJames Smart <jsmart2021@gmail.com> [maxg: added a comment for nvme_ns_has_pi()] Signed-off-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NIsrael Rukshin <israelr@mellanox.com> Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Max Gurtovoy 提交于
This is a preparation for adding support for metadata in fabric controllers. New flag will imply that NVMe namespace supports getting metadata that was originally generated by host's block layer. Signed-off-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NIsrael Rukshin <israelr@mellanox.com> Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com> Reviewed-by: NJames Smart <james.smart@broadcom.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Max Gurtovoy 提交于
Replace the specific ext boolean (that implies on extended LBA format) with a feature in the new namespace features flag. This is a preparation for adding more namespace features (such as metadata specific features). Signed-off-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NIsrael Rukshin <israelr@mellanox.com> Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com> Reviewed-by: NJames Smart <james.smart@broadcom.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Dan Carpenter 提交于
The nvme_put_ctrl() is implemented earlier as an inline function so this declaration isn't required. Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
- 10 5月, 2020 3 次提交
-
-
由 Keith Busch 提交于
Move the quirked chunk_sectors setting to the same location as noiob so one place registers this setting. And since the noiob value is only used locally, remove the member from struct nvme_ns. Signed-off-by: NKeith Busch <kbusch@kernel.org> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Keith Busch 提交于
Reject a new shared namespace if a duplicate unshared namespace exists. Signed-off-by: NKeith Busch <kbusch@kernel.org> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Keith Busch 提交于
Various nvme commands use a zeroes based number of dwords field. Create a helper function to convert byte lengths to this format. Signed-off-by: NKeith Busch <kbusch@kernel.org> Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com> Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 26 3月, 2020 2 次提交
-
-
由 Israel Rukshin 提交于
Calling nvme_sysfs_delete() when the controller is in the middle of creation may cause several bugs. If the controller is in NEW state we remove delete_controller file and don't delete the controller. The user will not be able to use nvme disconnect command on that controller again, although the controller may be active. Other bugs may happen if the controller is in the middle of create_ctrl callback and nvme_do_delete_ctrl() starts. For example, freeing I/O tagset at nvme_do_delete_ctrl() before it was allocated at create_ctrl callback. To fix all those races don't allow the user to delete the controller before it was fully created. Signed-off-by: NIsrael Rukshin <israelr@mellanox.com> Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NKeith Busch <kbusch@kernel.org>
-
由 John Meneghini 提交于
The nvme multipath error handling defaults to controller reset if the error is unknown. There are, however, no existing nvme status codes that indicate a reset should be used, and resetting causes unnecessary disruption to the rest of IO. Change nvme's error handling to first check if failover should happen. If not, let the normal error handling take over rather than reset the controller. Based-on-a-patch-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NHannes Reinecke <hare@suse.de> Signed-off-by: NJohn Meneghini <johnm@netapp.com> Signed-off-by: NKeith Busch <kbusch@kernel.org>
-
- 27 11月, 2019 1 次提交
-
-
由 Israel Rukshin 提交于
nvme_rdma_alloc_tagset() preallocates a big buffer for the IO SGL based on SG_CHUNK_SIZE. Modern DMA engines are often capable of dealing with very big segments so the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB SGL allocation per command. If a controller has lots of deep queues, preallocation for the sg list can consume substantial amounts of memory. For nvme-rdma, nr_hw_queues can be 128 and each queue's depth 128. This means the resulting preallocation for the data SGL is 128*128*4K = 64MB per controller. Switch to runtime allocation for SGL for lists longer than 2 entries. This is the approach used by NVMe PCI so it should be reasonable for NVMeOF as well. Runtime SGL allocation has always been the case for the legacy I/O path so this is nothing new. The preallocated small SGL depends on SG_CHAIN so if the ARCH doesn't support SG_CHAIN, use only runtime allocation for the SGL. We didn't notice of a performance degradation, since for small IOs we'll use the inline SG and for the bigger IOs the allocation of a bigger SGL from slab is fast enough. Suggested-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NIsrael Rukshin <israelr@mellanox.com> Signed-off-by: NKeith Busch <kbusch@kernel.org>
-
- 22 11月, 2019 1 次提交
-
-
由 Akinobu Mita 提交于
This adds a new quirk NVME_QUIRK_NO_TEMP_THRESH_CHANGE to avoid changing the value of the temperature threshold feature for specific devices that show undesirable behavior. Guenter reported: "On my Intel NVME drive (SSDPEKKW512G7), writing any minimum limit on the Composite temperature sensor results in a temperature warning, and that warning is sticky until I reset the controller. It doesn't seem to matter which temperature I write; writing -273000 has the same result." The Intel NVMe has the latest firmware version installed, so this isn't a problem that was ever fixed. Reported-by: NGuenter Roeck <linux@roeck-us.net> Cc: Keith Busch <kbusch@kernel.org> Cc: Jens Axboe <axboe@fb.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Sagi Grimberg <sagi@grimberg.me> Cc: Jean Delvare <jdelvare@suse.com> Reviewed-by: NGuenter Roeck <linux@roeck-us.net> Tested-by: NGuenter Roeck <linux@roeck-us.net> Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com> Signed-off-by: NKeith Busch <kbusch@kernel.org>
-
- 12 11月, 2019 1 次提交
-
-
由 Guenter Roeck 提交于
nvme devices report temperature information in the controller information (for limits) and in the smart log. Currently, the only means to retrieve this information is the nvme command line interface, which requires super-user privileges. At the same time, it would be desirable to be able to use NVMe temperature information for thermal control. This patch adds support to read NVMe temperatures from the kernel using the hwmon API and adds temperature zones for NVMe drives. The thermal subsystem can use this information to set thermal policies, and userspace can access it using libsensors and/or the "sensors" command. Example output from the "sensors" command: nvme0-pci-0100 Adapter: PCI adapter Composite: +39.0°C (high = +85.0°C, crit = +85.0°C) Sensor 1: +39.0°C Sensor 2: +41.0°C Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NGuenter Roeck <linux@roeck-us.net> Signed-off-by: NKeith Busch <kbusch@kernel.org>
-
- 05 11月, 2019 3 次提交
-
-
由 Damien Le Moal 提交于
Introduce the new helper function nvme_lba_to_sect() to convert a device logical block number to a 512B sector number. Use this new helper in obvious places, cleaning up the code. Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com> Signed-off-by: NKeith Busch <kbusch@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Damien Le Moal 提交于
Rename nvme_block_nr() to nvme_sect_to_lba() and use SECTOR_SHIFT instead of its hard coded value 9. Also add a comment to decribe this helper. Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com> Signed-off-by: NKeith Busch <kbusch@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Israel Rukshin 提交于
This function improves code readability and reduces code duplication. Signed-off-by: NIsrael Rukshin <israelr@mellanox.com> Signed-off-by: NMax Gurtovoy <maxg@mellanox.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NKeith Busch <kbusch@kernel.org> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 14 10月, 2019 2 次提交
-
-
由 Keith Busch 提交于
Prevent simultaneous controller disabling/enabling tasks from interfering with each other through a function to wait until the task successfully transitioned the controller to the RESETTING state. This ensures disabling the controller will not be interrupted by another reset path, otherwise a concurrent reset may leave the controller in the wrong state. Tested-by: NEdmund Nadolski <edmund.nadolski@intel.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NKeith Busch <kbusch@kernel.org>
-
由 Keith Busch 提交于
The admin only state was intended to fence off actions that don't apply to a non-IO capable controller. The only actual user of this is the scan_work, and pci was the only transport to ever set this state. The consequence of having this state is placing an additional burden on every other action that applies to both live and admin only controllers. Remove the admin only state and place the admin only burden on the only place that actually cares: scan_work. This also prepares to make it easier to temporarily pause a LIVE state so that we don't need to remember which state the controller had been in prior to the pause. Tested-by: NEdmund Nadolski <edmund.nadolski@intel.com> Reviewed-by: NJames Smart <james.smart@broadcom.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NKeith Busch <kbusch@kernel.org>
-
- 27 9月, 2019 1 次提交
-
-
由 Keith Busch 提交于
This isn't specific to fabrics. Signed-off-by: NKeith Busch <kbusch@kernel.org> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
- 30 8月, 2019 6 次提交
-
-
由 Sagi Grimberg 提交于
We have a fundamental issue that fabric commands use the admin_q. The reason is, that admin-connect, register reads and writes and admin commands cannot be guaranteed ordering while we are running controller resets. For example, when we reset a controller we perform: 1. disable the controller 2. teardown the admin queue 3. re-establish the admin queue 4. enable the controller In order to perform (3), we need to unquiesce the admin queue, however we may have some admin commands that are already pending on the quiesced admin_q and will immediate execute when we unquiesce it before we execute (4). The host must not send admin commands to the controller before enabling the controller. To fix this, we have the fabric commands (admin connect and property get/set, but not I/O queue connect) use a separate fabrics_q and make sure to quiesce the admin_q before we disable the controller, and unquiesce it only after we enable the controller. This fixes the error prints from nvmet in a controller reset storm test: kernel: nvmet: got cmd 6 while CC.EN == 0 on qid = 0 Which indicate that the host is sending an admin command when the controller is not enabled. Reviewed-by: NJames Smart <james.smart@broadcom.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Benjamin Herrenschmidt 提交于
Another issue with the Apple T2 based 2018 controllers seem to be that they blow up (and shut the machine down) if there's a tag collision between the IO queue and the Admin queue. My suspicion is that they use our tags for their internal tracking and don't mix them with the queue id. They also seem to not like when tags go beyond the IO queue depth, ie 128 tags. This adds a quirk that marks tags 0..31 of the IO queue reserved Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: NMing Lei <ming.lei@redhat.com> Acked-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Benjamin Herrenschmidt 提交于
Based on reverse engineering and original patch by Paul Pawlowski <paul@mrarm.io> This adds support for Apple weird implementation of NVME in their 2018 or later machines. It accounts for the twice-as-big SQ entries for the IO queues, and the fact that only interrupt vector 0 appears to function properly. Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org> Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Hannes Reinecke 提交于
When native multipathing is enabled we cannot enable blktrace for the underlying paths, so any completion is never traced. Signed-off-by: NHannes Reinecke <hare@suse.com> [fixed-up by Mikhail for non-multipath-build] Signed-off-by: NMikhail Skorzhinskii <mskorzhinskiy@solarflare.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Sagi Grimberg 提交于
All seem to call it with ctrl->cap so no need to pass it at all. Reviewed-by: NMinwoo Im <minwoo.im.dev@gmail.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Sagi Grimberg 提交于
nvme_enable_ctrl reads the cap register right after, so no need to do that locally in the transport driver. Have sqsize setting in nvme_init_identify. Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
- 21 8月, 2019 1 次提交
-
-
由 Mario Limonciello 提交于
One of the components in LiteON CL1 device has limitations that can be encountered based upon boundary race conditions using the nvme bus specific suspend to idle flow. When this situation occurs the drive doesn't resume properly from suspend-to-idle. LiteON has confirmed this problem and fixed in the next firmware version. As this firmware is already in the field, avoid running nvme specific suspend to idle flow. Fixes: d916b1be ("nvme-pci: use host managed power state for suspend") Link: http://lists.infradead.org/pipermail/linux-nvme/2019-July/thread.htmlSigned-off-by: NMario Limonciello <mario.limonciello@dell.com> Signed-off-by: NCharles Hyde <charles.hyde@dellteam.com> Reviewed-by: NKeith Busch <kbusch@kernel.org> Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 01 8月, 2019 2 次提交
-
-
由 Sagi Grimberg 提交于
With multipath enabled, nvme_scan_work() can read from the device (through nvme_mpath_add_disk()) and hang [1]. However, with fabrics, once ctrl->state is set to NVME_CTRL_DELETING, the reads will hang (see nvmf_check_ready()) and the mpath stack device make_request will block if head->list is not empty. However, when the head->list consistst of only DELETING/DEAD controllers, we should actually not block, but rather fail immediately. In addition, before we go ahead and remove the namespaces, make sure to clear the current path and kick the requeue list so that the request will fast fail upon requeuing. [1]: -- INFO: task kworker/u4:3:166 blocked for more than 120 seconds. Not tainted 5.2.0-rc6-vmlocalyes-00005-g808c8c2dc0cf #316 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/u4:3 D 0 166 2 0x80004000 Workqueue: nvme-wq nvme_scan_work Call Trace: __schedule+0x851/0x1400 schedule+0x99/0x210 io_schedule+0x21/0x70 do_read_cache_page+0xa57/0x1330 read_cache_page+0x4a/0x70 read_dev_sector+0xbf/0x380 amiga_partition+0xc4/0x1230 check_partition+0x30f/0x630 rescan_partitions+0x19a/0x980 __blkdev_get+0x85a/0x12f0 blkdev_get+0x2a5/0x790 __device_add_disk+0xe25/0x1250 device_add_disk+0x13/0x20 nvme_mpath_set_live+0x172/0x2b0 nvme_update_ns_ana_state+0x130/0x180 nvme_set_ns_ana_state+0x9a/0xb0 nvme_parse_ana_log+0x1c3/0x4a0 nvme_mpath_add_disk+0x157/0x290 nvme_validate_ns+0x1017/0x1bd0 nvme_scan_work+0x44d/0x6a0 process_one_work+0x7d7/0x1240 worker_thread+0x8e/0xff0 kthread+0x2c3/0x3b0 ret_from_fork+0x35/0x40 INFO: task kworker/u4:1:1034 blocked for more than 120 seconds. Not tainted 5.2.0-rc6-vmlocalyes-00005-g808c8c2dc0cf #316 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kworker/u4:1 D 0 1034 2 0x80004000 Workqueue: nvme-delete-wq nvme_delete_ctrl_work Call Trace: __schedule+0x851/0x1400 schedule+0x99/0x210 schedule_timeout+0x390/0x830 wait_for_completion+0x1a7/0x310 __flush_work+0x241/0x5d0 flush_work+0x10/0x20 nvme_remove_namespaces+0x85/0x3d0 nvme_do_delete_ctrl+0xb4/0x1e0 nvme_delete_ctrl_work+0x15/0x20 process_one_work+0x7d7/0x1240 worker_thread+0x8e/0xff0 kthread+0x2c3/0x3b0 ret_from_fork+0x35/0x40 -- Reported-by: NLogan Gunthorpe <logang@deltatee.com> Tested-by: NLogan Gunthorpe <logang@deltatee.com> Reviewed-by: NLogan Gunthorpe <logang@deltatee.com> Reviewed-by: NMing Lei <ming.lei@redhat.com> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-
由 Sagi Grimberg 提交于
When the user issues a command with side effects, we will end up freezing the namespace request queue when updating disk info (and the same for the corresponding mpath disk node). However, we are not freezing the mpath node request queue, which means that mpath I/O can still come in and block on blk_queue_enter (called from nvme_ns_head_make_request -> direct_make_request). This is a deadlock, because blk_queue_enter will block until the inner namespace request queue is unfroze, but that process is blocked because the namespace revalidation is trying to update the mpath disk info and freeze its request queue (which will never complete because of the I/O that is blocked on blk_queue_enter). Fix this by freezing all the subsystem nsheads request queues before executing the passthru command. Given that these commands are infrequent we should not worry about this temporary I/O freeze to keep things sane. Here is the matching hang traces: -- [ 374.465002] INFO: task systemd-udevd:17994 blocked for more than 122 seconds. [ 374.472975] Not tainted 5.2.0-rc3-mpdebug+ #42 [ 374.478522] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 374.487274] systemd-udevd D 0 17994 1 0x00000000 [ 374.493407] Call Trace: [ 374.496145] __schedule+0x2ef/0x620 [ 374.500047] schedule+0x38/0xa0 [ 374.503569] blk_queue_enter+0x139/0x220 [ 374.507959] ? remove_wait_queue+0x60/0x60 [ 374.512540] direct_make_request+0x60/0x130 [ 374.517219] nvme_ns_head_make_request+0x11d/0x420 [nvme_core] [ 374.523740] ? generic_make_request_checks+0x307/0x6f0 [ 374.529484] generic_make_request+0x10d/0x2e0 [ 374.534356] submit_bio+0x75/0x140 [ 374.538163] ? guard_bio_eod+0x32/0xe0 [ 374.542361] submit_bh_wbc+0x171/0x1b0 [ 374.546553] block_read_full_page+0x1ed/0x330 [ 374.551426] ? check_disk_change+0x70/0x70 [ 374.556008] ? scan_shadow_nodes+0x30/0x30 [ 374.560588] blkdev_readpage+0x18/0x20 [ 374.564783] do_read_cache_page+0x301/0x860 [ 374.569463] ? blkdev_writepages+0x10/0x10 [ 374.574037] ? prep_new_page+0x88/0x130 [ 374.578329] ? get_page_from_freelist+0xa2f/0x1280 [ 374.583688] ? __alloc_pages_nodemask+0x179/0x320 [ 374.588947] read_cache_page+0x12/0x20 [ 374.593142] read_dev_sector+0x2d/0xd0 [ 374.597337] read_lba+0x104/0x1f0 [ 374.601046] find_valid_gpt+0xfa/0x720 [ 374.605243] ? string_nocheck+0x58/0x70 [ 374.609534] ? find_valid_gpt+0x720/0x720 [ 374.614016] efi_partition+0x89/0x430 [ 374.618113] ? string+0x48/0x60 [ 374.621632] ? snprintf+0x49/0x70 [ 374.625339] ? find_valid_gpt+0x720/0x720 [ 374.629828] check_partition+0x116/0x210 [ 374.634214] rescan_partitions+0xb6/0x360 [ 374.638699] __blkdev_reread_part+0x64/0x70 [ 374.643377] blkdev_reread_part+0x23/0x40 [ 374.647860] blkdev_ioctl+0x48c/0x990 [ 374.651956] block_ioctl+0x41/0x50 [ 374.655766] do_vfs_ioctl+0xa7/0x600 [ 374.659766] ? locks_lock_inode_wait+0xb1/0x150 [ 374.664832] ksys_ioctl+0x67/0x90 [ 374.668539] __x64_sys_ioctl+0x1a/0x20 [ 374.672732] do_syscall_64+0x5a/0x1c0 [ 374.676828] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 374.738474] INFO: task nvmeadm:49141 blocked for more than 123 seconds. [ 374.745871] Not tainted 5.2.0-rc3-mpdebug+ #42 [ 374.751419] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [ 374.760170] nvmeadm D 0 49141 36333 0x00004080 [ 374.766301] Call Trace: [ 374.769038] __schedule+0x2ef/0x620 [ 374.772939] schedule+0x38/0xa0 [ 374.776452] blk_mq_freeze_queue_wait+0x59/0x100 [ 374.781614] ? remove_wait_queue+0x60/0x60 [ 374.786192] blk_mq_freeze_queue+0x1a/0x20 [ 374.790773] nvme_update_disk_info.isra.57+0x5f/0x350 [nvme_core] [ 374.797582] ? nvme_identify_ns.isra.50+0x71/0xc0 [nvme_core] [ 374.804006] __nvme_revalidate_disk+0xe5/0x110 [nvme_core] [ 374.810139] nvme_revalidate_disk+0xa6/0x120 [nvme_core] [ 374.816078] ? nvme_submit_user_cmd+0x11e/0x320 [nvme_core] [ 374.822299] nvme_user_cmd+0x264/0x370 [nvme_core] [ 374.827661] nvme_dev_ioctl+0x112/0x1d0 [nvme_core] [ 374.833114] do_vfs_ioctl+0xa7/0x600 [ 374.837117] ? __audit_syscall_entry+0xdd/0x130 [ 374.842184] ksys_ioctl+0x67/0x90 [ 374.845891] __x64_sys_ioctl+0x1a/0x20 [ 374.850082] do_syscall_64+0x5a/0x1c0 [ 374.854178] entry_SYSCALL_64_after_hwframe+0x44/0xa9 -- Reported-by: NJames Puthukattukaran <james.puthukattukaran@oracle.com> Tested-by: NJames Puthukattukaran <james.puthukattukaran@oracle.com> Reviewed-by: NKeith Busch <kbusch@kernel.org> Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
-