提交 · b1ae1a238900474a9f51431c0f7f169ade1faa19 · openeuler / Kernel

27 11月, 2019 2 次提交

nvme-fc: Avoid preallocating big SGL for data · b1ae1a23

由 Israel Rukshin 提交于 11月 24, 2019

nvme_fc_create_io_queues() preallocates a big buffer for the IO SGL based
on SG_CHUNK_SIZE.

Modern DMA engines are often capable of dealing with very big segments so
the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB
SGL allocation per command.

If a controller has lots of deep queues, preallocation for the sg list can
consume substantial amounts of memory. For nvme-fc, nr_hw_queues can be
128 and each queue's depth 128. This means the resulting preallocation
for the data SGL is 128*128*4K = 64MB per controller.

Switch to runtime allocation for SGL for lists longer than 2 entries. This
is the approach used by NVMe PCI so it should be reasonable for NVMeOF as
well. Runtime SGL allocation has always been the case for the legacy I/O
path so this is nothing new.
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

b1ae1a23

nvme-rdma: Avoid preallocating big SGL for data · 38e18002

由 Israel Rukshin 提交于 11月 24, 2019

nvme_rdma_alloc_tagset() preallocates a big buffer for the IO SGL based
on SG_CHUNK_SIZE.

Modern DMA engines are often capable of dealing with very big segments so
the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB
SGL allocation per command.

If a controller has lots of deep queues, preallocation for the sg list can
consume substantial amounts of memory. For nvme-rdma, nr_hw_queues can be
128 and each queue's depth 128. This means the resulting preallocation
for the data SGL is 128*128*4K = 64MB per controller.

Switch to runtime allocation for SGL for lists longer than 2 entries. This
is the approach used by NVMe PCI so it should be reasonable for NVMeOF as
well. Runtime SGL allocation has always been the case for the legacy I/O
path so this is nothing new.

The preallocated small SGL depends on SG_CHAIN so if the ARCH doesn't
support SG_CHAIN, use only runtime allocation for the SGL.

We didn't notice of a performance degradation, since for small IOs we'll
use the inline SG and for the bigger IOs the allocation of a bigger SGL
from slab is fast enough.
Suggested-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

38e18002

22 11月, 2019 2 次提交

nvme: hwmon: add quirk to avoid changing temperature threshold · 6c6aa2f2

由 Akinobu Mita 提交于 11月 15, 2019

This adds a new quirk NVME_QUIRK_NO_TEMP_THRESH_CHANGE to avoid changing
the value of the temperature threshold feature for specific devices that
show undesirable behavior.

Guenter reported:

"On my Intel NVME drive (SSDPEKKW512G7), writing any minimum limit on the
Composite temperature sensor results in a temperature warning, and that
warning is sticky until I reset the controller.

It doesn't seem to matter which temperature I write; writing -273000 has
the same result."

The Intel NVMe has the latest firmware version installed, so this isn't
a problem that was ever fixed.
Reported-by: NGuenter Roeck <linux@roeck-us.net>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Jean Delvare <jdelvare@suse.com>
Reviewed-by: NGuenter Roeck <linux@roeck-us.net>
Tested-by: NGuenter Roeck <linux@roeck-us.net>
Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

6c6aa2f2

nvme: hwmon: provide temperature min and max values for each sensor · 52deba0f

由 Akinobu Mita 提交于 11月 15, 2019

According to the NVMe specification, the over temperature threshold and
under temperature threshold features shall be implemented for Composite
Temperature if a non-zero WCTEMP field value is reported in the Identify
Controller data structure.  The features are also implemented for all
implemented temperature sensors (i.e., all Temperature Sensor fields that
report a non-zero value).

This provides the over temperature threshold and under temperature
threshold for each sensor as temperature min and max values of hwmon
sysfs attributes.

The WCTEMP is already provided as a temperature max value for Composite
Temperature, but this change isn't incompatible.  Because the default
value of the over temperature threshold for Composite Temperature is
the WCTEMP.

Now the alarm attribute for Composite Temperature indicates one of the
temperature is outside of a temperature threshold.  Because there is only
a single bit in Critical Warning field that indicates a temperature is
outside of a threshold.

Example output from the "sensors" command:

nvme-pci-0100
Adapter: PCI adapter
Composite:    +33.9°C  (low  = -273.1°C, high = +69.8°C)
                       (crit = +79.8°C)
Sensor 1:     +34.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +31.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 5:     +47.9°C  (low  = -273.1°C, high = +65261.8°C)

This also adds helper macros for kelvin from/to milli Celsius conversion,
and replaces the repeated code in hwmon.c.

Cc: Keith Busch <kbusch@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Jean Delvare <jdelvare@suse.com>
Reviewed-by: NGuenter Roeck <linux@roeck-us.net>
Tested-by: NGuenter Roeck <linux@roeck-us.net>
Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

52deba0f

13 11月, 2019 1 次提交

nvme: Discard workaround for non-conformant devices · 530436c4

由 Eduard Hasenleithner 提交于 11月 12, 2019

Users observe IOMMU related errors when performing discard on nvme from
non-compliant nvme devices reading beyond the end of the DMA mapped
ranges to discard.

Two different variants of this behavior have been observed: SM22XX
controllers round up the read size to a multiple of 512 bytes, and Phison
E12 unconditionally reads the maximum discard size allowed by the spec
(256 segments or 4kB).

Make nvme_setup_discard unconditionally allocate the maximum DSM buffer
so the driver DMA maps a memory range that will always succeed.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=202665 many
Signed-off-by: NEduard Hasenleithner <eduard@hasenleithner.at>
[changelog, use existing define, kernel coding style]
Signed-off-by: NKeith Busch <kbusch@kernel.org>

530436c4

12 11月, 2019 1 次提交

nvme: Add hardware monitoring support · 400b6a7b

由 Guenter Roeck 提交于 11月 06, 2019

nvme devices report temperature information in the controller information
(for limits) and in the smart log. Currently, the only means to retrieve
this information is the nvme command line interface, which requires
super-user privileges.

At the same time, it would be desirable to be able to use NVMe temperature
information for thermal control.

This patch adds support to read NVMe temperatures from the kernel using the
hwmon API and adds temperature zones for NVMe drives. The thermal subsystem
can use this information to set thermal policies, and userspace can access
it using libsensors and/or the "sensors" command.

Example output from the "sensors" command:

nvme0-pci-0100
Adapter: PCI adapter
Composite:    +39.0°C  (high = +85.0°C, crit = +85.0°C)
Sensor 1:     +39.0°C
Sensor 2:     +41.0°C
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

400b6a7b

05 11月, 2019 13 次提交

nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths · 763303a8

由 Anton Eidelman 提交于 11月 01, 2019

nvme_mpath_clear_ctrl_paths() iterates through
the ctrl->namespaces list while holding ctrl->scan_lock.
This does not seem to be the correct way of protecting
from concurrent list modification.

Specifically, nvme_scan_work() sorts ctrl->namespaces
AFTER unlocking scan_lock.

This may result in the following (rare) crash in ctrl disconnect
during scan_work:

    BUG: kernel NULL pointer dereference, address: 0000000000000050
    Oops: 0000 [#1] SMP PTI
    CPU: 0 PID: 3995 Comm: nvme 5.3.5-050305-generic
    RIP: 0010:nvme_mpath_clear_current_path+0xe/0x90 [nvme_core]
    ...
    Call Trace:
     nvme_mpath_clear_ctrl_paths+0x3c/0x70 [nvme_core]
     nvme_remove_namespaces+0x35/0xe0 [nvme_core]
     nvme_do_delete_ctrl+0x47/0x90 [nvme_core]
     nvme_sysfs_delete+0x49/0x60 [nvme_core]
     dev_attr_store+0x17/0x30
     sysfs_kf_write+0x3e/0x50
     kernfs_fop_write+0x11e/0x1a0
     __vfs_write+0x1b/0x40
     vfs_write+0xb9/0x1a0
     ksys_write+0x67/0xe0
     __x64_sys_write+0x1a/0x20
     do_syscall_64+0x5a/0x130
     entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f8d02bfb154

Fix:
After taking scan_lock in nvme_mpath_clear_ctrl_paths()
down_read(&ctrl->namespaces_rwsem) as well to make list traversal safe.
This will not cause deadlocks because taking scan_lock never happens
while holding the namespaces_rwsem.
Moreover, scan work downs namespaces_rwsem in the same order.

Alternative: sort ctrl->namespaces in nvme_scan_work()
while still holding the scan_lock.
This would leave nvme_mpath_clear_ctrl_paths() without correct protection
against ctrl->namespaces modification by anyone other than scan_work.
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

763303a8

nvme-rdma: fix a segmentation fault during module unload · 9ad9e8d6

由 Max Gurtovoy 提交于 10月 29, 2019

In case there are controllers that are not associated with any RDMA
device (e.g. during unsuccessful reconnection) and the user will unload
the module, these controllers will not be freed and will access already
freed memory. The same logic appears in other fabric drivers as well.

Fixes: 87fd1253 ("nvme-rdma: remove redundant reference between ib_device and tagset")
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

9ad9e8d6

nvme: Fix parsing of ANA log page · 64fab729

由 Prabhath Sajeepa 提交于 10月 28, 2019

Check validity of offset into ANA log buffer before accessing
nvme_ana_group_desc. This check ensures the size of ANA log buffer >=
offset + sizeof(nvme_ana_group_desc)
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NPrabhath Sajeepa <psajeepa@purestorage.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

64fab729

nvme-pci: Spelling s/resdicovered/rediscovered/ · 05d3046f

由 Geert Uytterhoeven 提交于 10月 24, 2019

Fix misspelling of "rediscovered".
Signed-off-by: NGeert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

05d3046f

nvme: Introduce nvme_lba_to_sect() · e08f2ae8

由 Damien Le Moal 提交于 10月 21, 2019

Introduce the new helper function nvme_lba_to_sect() to convert a device
logical block number to a 512B sector number. Use this new helper in
obvious places, cleaning up the code.
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e08f2ae8

nvme: Cleanup and rename nvme_block_nr() · 314d48dd

由 Damien Le Moal 提交于 10月 21, 2019

Rename nvme_block_nr() to nvme_sect_to_lba() and use SECTOR_SHIFT
instead of its hard coded value 9. Also add a comment to decribe this
helper.
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

314d48dd

nvme: move common call to nvme_cleanup_cmd to core layer · 16686f3a

由 Max Gurtovoy 提交于 10月 13, 2019

nvme_cleanup_cmd should be called for each call to nvme_setup_cmd
(symmetrical functions). Move the call for nvme_cleanup_cmd to the common
core layer and call it during nvme_complete_rq for the good flow. For
error flow, each transport will call nvme_cleanup_cmd independently. Also
take care of a special case of path failure, where we call
nvme_complete_rq without doing nvme_setup_cmd.
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

16686f3a

nvme: introduce "Command Aborted By host" status code · 2dc3947b

由 Max Gurtovoy 提交于 10月 13, 2019

Fix the status code of canceled requests initiated by the host according
to TP4028 (Status Code 0x371):
"Command Aborted By host: The command was aborted as a result of host
action (e.g., the host disconnected the Fabric connection)."

Also in a multipath environment, unless otherwise specified, errors of
this type (path related) should be retried using a different path, if
one is available.
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2dc3947b

nvme: introduce nvme_is_aen_req function · 58a8df67

由 Israel Rukshin 提交于 10月 13, 2019

This function improves code readability and reduces code duplication.
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

58a8df67

nvme-fc: ensure association_id is cleared regardless of a Disconnect LS · bcde5f0f

由 James Smart 提交于 9月 27, 2019

Code today only clears the association_id if a Disconnect LS is transmit.

Remove ambiguity and unconditionally clear the association_id if the
association has been terminated.
Signed-off-by: NJames Smart <jsmart2021@gmail.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

bcde5f0f

nvme-fc: clarify error messages · 7db39484

由 James Smart 提交于 9月 27, 2019

Change wording on a couple of messages to clarify what happened.
Signed-off-by: NEwan D. Milne <emilne@redhat.com>
Signed-off-by: NJames Smart <jsmart2021@gmail.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7db39484

nvme-fc: Set new cmd set indicator in nvme-fc cmnd iu · 44fbf3bb

由 James Smart 提交于 9月 27, 2019

Set the new category field in the FC-NVME CMND_IU based on queue number.
Signed-off-by: NJames Smart <jsmart2021@gmail.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

44fbf3bb

nvme-fc and nvmet-fc: sync with FC-NVME-2 header changes · 53b2b2f5

由 James Smart 提交于 9月 27, 2019

Sync sources with revised structure and field names to correspond with
FC-NVME-2 header sync-up.

Tested interoperability with success:
- prior initiator with new target
- prior target with new initiator
- new on new
Signed-off-by: NJames Smart <jsmart2021@gmail.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

53b2b2f5

29 10月, 2019 3 次提交

nvme-multipath: remove unused groups_only mode in ana log · 86cccfbf

由 Anton Eidelman 提交于 10月 18, 2019

groups_only mode in nvme_read_ana_log() is no longer used: remove it.
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

86cccfbf

nvme-multipath: fix possible io hang after ctrl reconnect · af8fd042

由 Anton Eidelman 提交于 10月 18, 2019

The following scenario results in an IO hang:
1) ctrl completes a request with NVME_SC_ANA_TRANSITION.
   NVME_NS_ANA_PENDING bit in ns->flags is set and ana_work is triggered.
2) ana_work: nvme_read_ana_log() tries to get the ANA log page from the ctrl.
   This fails because ctrl disconnects.
   Therefore nvme_update_ns_ana_state() is not called
   and NVME_NS_ANA_PENDING bit in ns->flags is not cleared.
3) ctrl reconnects: nvme_mpath_init(ctrl,...) calls
   nvme_read_ana_log(ctrl, groups_only=true).
   However, nvme_update_ana_state() does not update namespaces
   because nr_nsids = 0 (due to groups_only mode).
4) scan_work calls nvme_validate_ns() finds the ns and re-validates OK.

Result:
The ctrl is now live but NVME_NS_ANA_PENDING bit in ns->flags is still set.
Consequently ctrl will never be considered a viable path by __nvme_find_path().
IO will hang if ctrl is the only or the last path to the namespace.

More generally, while ctrl is reconnecting, its ANA state may change.
And because nvme_mpath_init() requests ANA log in groups_only mode,
these changes are not propagated to the existing ctrl namespaces.
This may result in a mal-function or an IO hang.

Solution:
nvme_mpath_init() will nvme_read_ana_log() with groups_only set to false.
This will not harm the new ctrl case (no namespaces present),
and will make sure the ANA state of namespaces gets updated after reconnect.

Note: Another option would be for nvme_mpath_init() to invoke
nvme_parse_ana_log(..., nvme_set_ns_ana_state) for each existing namespace.
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

af8fd042

net: use skb_queue_empty_lockless() in busy poll contexts · 3f926af3

由 Eric Dumazet 提交于 10月 23, 2019

Busy polling usually runs without locks.
Let's use skb_queue_empty_lockless() instead of skb_queue_empty()

Also uses READ_ONCE() in __skb_try_recv_datagram() to address
a similar potential problem.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>

3f926af3

18 10月, 2019 1 次提交

nvme-pci: Set the prp2 correctly when using more than 4k page · a4f40484

由 Kevin Hao 提交于 10月 18, 2019

In the current code, the nvme is using a fixed 4k PRP entry size,
but if the kernel use a page size which is more than 4k, we should
consider the situation that the bv_offset may be larger than the
dev->ctrl.page_size. Otherwise we may miss setting the prp2 and then
cause the command can't be executed correctly.

Fixes: dff824b2 ("nvme-pci: optimize mapping of small single segment requests")
Cc: stable@vger.kernel.org
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKevin Hao <haokexin@gmail.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

a4f40484

15 10月, 2019 1 次提交

nvme-tcp: fix possible leakage during error flow · 28a4cac4

由 Max Gurtovoy 提交于 10月 13, 2019

During nvme_tcp_setup_cmd_pdu error flow, one must call nvme_cleanup_cmd
since it's symmetric to nvme_setup_cmd.
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

28a4cac4

14 10月, 2019 6 次提交

nvme-tcp: Initialize sk->sk_ll_usec only with NET_RX_BUSY_POLL · ac1c4e18

由 Sebastian Andrzej Siewior 提交于 10月 10, 2019

The access to sk->sk_ll_usec should be hidden behind
CONFIG_NET_RX_BUSY_POLL like the definition of sk_ll_usec.

Put access to ->sk_ll_usec behind CONFIG_NET_RX_BUSY_POLL.

Fixes: 1a9460ce ("nvme-tcp: support simple polling")
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

ac1c4e18

nvme: Wait for reset state when required · c1ac9a4b

由 Keith Busch 提交于 9月 04, 2019

Prevent simultaneous controller disabling/enabling tasks from interfering
with each other through a function to wait until the task successfully
transitioned the controller to the RESETTING state. This ensures disabling
the controller will not be interrupted by another reset path, otherwise
a concurrent reset may leave the controller in the wrong state.
Tested-by: NEdmund Nadolski <edmund.nadolski@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

c1ac9a4b

nvme: Prevent resets during paused controller state · 4c75f877

由 Keith Busch 提交于 9月 06, 2019

A paused controller is doing critical internal activation work in the
background. Prevent subsequent controller resets from occurring during
this period by setting the controller state to RESETTING first. A helper
function, nvme_try_sched_reset_work(), is introduced for these paths so
they may continue with scheduling the reset_work after they've completed
their uninterruptible critical section.
Tested-by: NEdmund Nadolski <edmund.nadolski@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

4c75f877

nvme: Restart request timers in resetting state · 92b98e88

由 Keith Busch 提交于 9月 05, 2019

A controller in the resetting state has not yet completed its recovery
actions. The pci and fc transports were already handling this, so update
the remaining transports to not attempt additional recovery in this
state. Instead, just restart the request timer.
Tested-by: NEdmund Nadolski <edmund.nadolski@intel.com>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

92b98e88

nvme: Remove ADMIN_ONLY state · 5d02a5c1

由 Keith Busch 提交于 9月 03, 2019

The admin only state was intended to fence off actions that don't
apply to a non-IO capable controller. The only actual user of this is
the scan_work, and pci was the only transport to ever set this state.
The consequence of having this state is placing an additional burden on
every other action that applies to both live and admin only controllers.

Remove the admin only state and place the admin only burden on the only
place that actually cares: scan_work.

This also prepares to make it easier to temporarily pause a LIVE state
so that we don't need to remember which state the controller had been in
prior to the pause.
Tested-by: NEdmund Nadolski <edmund.nadolski@intel.com>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

5d02a5c1

nvme-pci: Free tagset if no IO queues · 770597ec

由 Keith Busch 提交于 9月 05, 2019

If a controller becomes degraded after a reset, we will not be able to
perform any IO. We currently teardown previously created request
queues and namespaces, but we had kept the unusable tagset. Free
it after all queues using it have been released.
Tested-by: NEdmund Nadolski <edmund.nadolski@intel.com>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

770597ec

05 10月, 2019 2 次提交

nvme: retain split access workaround for capability reads · 3a8ecc93

由 Ard Biesheuvel 提交于 10月 03, 2019

Commit 7fd8930f

  "nvme: add a common helper to read Identify Controller data"

has re-introduced an issue that we have attempted to work around in the
past, in commit a310acd7 ("NVMe: use split lo_hi_{read,write}q").

The problem is that some PCIe NVMe controllers do not implement 64-bit
outbound accesses correctly, which is why the commit above switched
to using lo_hi_[read|write]q for all 64-bit BAR accesses occuring in
the code.

In the mean time, the NVMe subsystem has been refactored, and now calls
into the PCIe support layer for NVMe via a .reg_read64() method, which
fails to use lo_hi_readq(), and thus reintroduces the problem that the
workaround above aimed to address.

Given that, at the moment, .reg_read64() is only used to read the
capability register [which is known to tolerate split reads], let's
switch .reg_read64() to lo_hi_readq() as well.

This fixes a boot issue on some ARM boxes with NVMe behind a Synopsys
DesignWare PCIe host controller.

Fixes: 7fd8930f ("nvme: add a common helper to read Identify Controller data")
Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

3a8ecc93

nvme: fix possible deadlock when nvme_update_formats fails · 6abff1b9

由 Sagi Grimberg 提交于 10月 02, 2019

nvme_update_formats may fail to revalidate the namespace and
attempt to remove the namespace. This may lead to a deadlock
as nvme_ns_remove will attempt to acquire the subsystem lock
which is already acquired by the passthru command with effects.

Move the invalid namepsace removal to after the passthru command
releases the subsystem lock.
Reported-by: NJudy Brock <judy.brock@samsung.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

6abff1b9

28 9月, 2019 1 次提交

nvme-rdma: fix possible use-after-free in connect timeout · 67b483dd

由 Sagi Grimberg 提交于 9月 24, 2019

If the connect times out, we may have already destroyed the
queue in the timeout handler, so test if the queue is still
allocated in the connect error handler.
Reported-by: NYi Zhang <yi.zhang@redhat.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

67b483dd

27 9月, 2019 1 次提交

nvme: Move ctrl sqsize to generic space · f968688f

由 Keith Busch 提交于 9月 26, 2019

This isn't specific to fabrics.
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

f968688f

26 9月, 2019 6 次提交

nvme: Add ctrl attributes for queue_count and sqsize · 2b1ff255

由 James Smart 提交于 9月 24, 2019

Current controller interrogation requires a lot of guesswork
on how many io queues were created and what the io sq size is.
The numbers are dependent upon core/fabric defaults, connect
arguments, and target responses.

Add sysfs attributes for queue_count and sqsize.
Signed-off-by: NJames Smart <jsmart2021@gmail.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

2b1ff255

nvme: allow 64-bit results in passthru commands · 65e68edc

由 Marta Rybczynska 提交于 9月 24, 2019

It is not possible to get 64-bit results from the passthru commands,
what prevents from getting for the Capabilities (CAP) property value.

As a result, it is not possible to implement IOL's NVMe Conformance
test 4.3 Case 1 for Fabrics targets [1] (page 123).

This issue has been already discussed [2], but without a solution.

This patch solves the problem by adding new ioctls with a new
passthru structure, including 64-bit results. The older ioctls stay
unchanged.

[1] https://www.iol.unh.edu/sites/default/files/testsuites/nvme/UNH-IOL_NVMe_Conformance_Test_Suite_v11.0.pdf
[2] http://lists.infradead.org/pipermail/linux-nvme/2018-June/018791.htmlSigned-off-by: NMarta Rybczynska <marta.rybczynska@kalray.eu>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

65e68edc

nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T · 19ea025e

由 Jian-Hong Pan 提交于 9月 24, 2019

Kingston NVME SSD with firmware version E8FK11.T has no interrupt after
resume with actions related to suspend to idle. This patch applied
NVME_QUIRK_SIMPLE_SUSPEND quirk to fix this issue.

Fixes: d916b1be ("nvme-pci: use host managed power state for suspend")
Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=204887Signed-off-by: NJian-Hong Pan <jian-hong@endlessm.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

19ea025e

Added QUIRKs for ADATA XPG SX8200 Pro 512GB · f03e42c6

由 Gabriel Craciunescu 提交于 9月 23, 2019

Booting with default_ps_max_latency_us >6000 makes the device fail.
Also SUBNQN is NULL and gives a warning on each boot/resume.
 $ nvme id-ctrl /dev/nvme0 | grep ^subnqn
   subnqn    : (null)

I use this device with an Acer Nitro 5 (AN515-43-R8BF) Laptop.
To be sure is not a Laptop issue only, I tested the device on
my server board  with the same results.
( with 2x,4x link on the board and 4x link on a PCI-E card ).
Signed-off-by: NGabriel Craciunescu <nix.or.die@gmail.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

f03e42c6

nvme-rdma: Fix max_hw_sectors calculation · ff13c1b8

由 Max Gurtovoy 提交于 9月 21, 2019

By default, the NVMe/RDMA driver should support max io_size of 1MiB (or
upto the maximum supported size by the HCA). Currently, one will see that
/sys/class/block/<bdev>/queue/max_hw_sectors_kb is 1020 instead of 1024.

A non power of 2 value can cause performance degradation due to
unnecessary splitting of IO requests and unoptimized allocation units.

The number of pages per MR has been fixed here, so there is no longer any
need to reduce max_sectors by 1.
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

ff13c1b8

nvme: fix an error code in nvme_init_subsystem() · bc4f6e06

由 Dan Carpenter 提交于 9月 23, 2019

"ret" should be a negative error code here, but it's either success or
possibly uninitialized.

Fixes: 32fd90c4 ("nvme: change locking for the per-subsystem controller list")
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

bc4f6e06

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功