提交 · bf392a5dc02a9b796f3da89fc5bb42856aca64cb · openeuler / Kernel

26 3月, 2020 8 次提交

nvme-pci: Remove tag from process cq · bf392a5d

由 Keith Busch 提交于 3月 02, 2020

The only user for tagged completion was for timeout handling. That user,
though, really only cares if the timed out command is completed, which
we can safely check within the timeout handler.

Remove the tag check to simplify completion handling.
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

bf392a5d

nvme-pci: slimmer CQ head update · e2a366a4

由 Alexey Dobriyan 提交于 2月 28, 2020

Update CQ head with pre-increment operator. This saves subtraction of 1
and a few registers.

Also update phase with "^= 1". This generates only one RMW instruction.

ffffffff815ba150 <nvme_update_cq_head>:
ffffffff815ba150: 0f b7 47 70 movzx eax,WORD PTR [rdi+0x70]
ffffffff815ba154: 83 c0 01 add eax,0x1
ffffffff815ba157: 66 89 47 70 mov WORD PTR [rdi+0x70],ax
ffffffff815ba15b: 66 3b 47 68 cmp ax,WORD PTR [rdi+0x68]
ffffffff815ba15f: 74 01 je ffffffff815ba162 <nvme_update_cq_head+0x12>
ffffffff815ba161: c3 ret
ffffffff815ba162: 31 c0 xor eax,eax
ffffffff815ba164: 80 77 74 01 ===> xor BYTE PTR [rdi+0x74],0x1
ffffffff815ba168: 66 89 47 70 mov WORD PTR [rdi+0x70],ax
ffffffff815ba16c: c3 ret

add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-119 (-119)
Function old new delta
nvme_poll 690 678 -12
nvme_dev_disable 1230 1177 -53
nvme_irq 613 559 -54
Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>

e2a366a4

nvme: Check for readiness more quickly, to speed up boot time · 3e98c244

由 Josh Triplett 提交于 2月 28, 2020

After initialization, nvme_wait_ready checks for readiness every 100ms,
even though the drive may be ready far sooner than that. This delays
system boot by hundreds of milliseconds. Reduce the delay, checking for
readiness every millisecond instead.

Boot-time tests on an AWS c5.12xlarge:

Before:
[    0.546936] initcall nvme_init+0x0/0x5b returned 0 after 37 usecs
...
[    0.764178] nvme nvme0: 2/0/0 default/read/poll queues
[    0.768424]  nvme0n1: p1
[    0.774132] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null)
[    0.774146] VFS: Mounted root (ext4 filesystem) on device 259:1.
...
[    0.788141] Run /sbin/init as init process

After:
[    0.537088] initcall nvme_init+0x0/0x5b returned 0 after 37 usecs
...
[    0.543457] nvme nvme0: 2/0/0 default/read/poll queues
[    0.548473]  nvme0n1: p1
[    0.554339] EXT4-fs (nvme0n1p1): mounted filesystem with ordered data mode. Opts: (null)
[    0.554344] VFS: Mounted root (ext4 filesystem) on device 259:1.
...
[    0.567931] Run /sbin/init as init process
Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

3e98c244

nvme: log additional message for controller status · 94d2e705

由 Rupesh Girase 提交于 2月 27, 2020

Log the controller status to know more about issue if it
lies within kernel nvme subsytem or controller is unhealthy.
Signed-off-by: NRupesh Girase <rgirase@redhat.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulakrni@wdc.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

94d2e705

nvme: code cleanup nvme_identify_ns_desc() · ad95a613

由 Chaitanya Kulkarni 提交于 2月 19, 2020

The function nvme_identify_ns_desc() has 3 levels of nesting which make
error message to exceeded > 80 char per line which is not aligned with
the kernel code standards and rest of the NVMe subsystem code.

Add a helper function to move the processing of the log when the
command is successful by reducing the nesting and keeping the
code < 80 char per line.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

ad95a613

nvme: Don't deter users from enabling hwmon support · 22891450

由 Jean Delvare 提交于 2月 11, 2020

I see no good reason for the "If unsure, say N" advice in the description
of the NVME_HWMON configuration option. It is not dangerous, it does
not select any other option, and has a fairly low overhead.

As the option is already not enabled by default, further suggesting
hesitant users to not enable it is not useful anyway. Unlike some other
options where the description alone may not be sufficient for users to
make a decision, NVME_HWMON is pretty simple to grasp in my opinion,
so just let the user do what they want.
Signed-off-by: NJean Delvare <jdelvare@suse.de>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Reviewed-by: NGuenter Roeck <linux@roeck-us.net>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

22891450

nvme: expose hostid via sysfs for fabrics controllers · 45fb19f7

由 Sagi Grimberg 提交于 2月 07, 2020

We allow userspace to connect with a custom hostid which is useful for
certain use-cases. However there is is no way to tell what is the hostid
used to connect to a given controller.

Expose this so userspace can correlate controllers based on hostid.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

45fb19f7

nvme: expose hostnqn via sysfs for fabrics controllers · 76171c6c

由 Sagi Grimberg 提交于 2月 07, 2020

We allow userspace to connect with a custom hostnqn which is useful for
certain use-cases. However there is no way to tell what is the hostnqn
used to connect to a given controller.

Expose this so userspace can correlate controllers based on hostnqn.
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

76171c6c

05 3月, 2020 2 次提交

nvme-tcp: Set SO_PRIORITY for all host sockets · 9912ade3

由 Wunderlich, Mark 提交于 1月 16, 2020

Enable ability to associate all sockets related to NVMf TCP traffic
to a priority group that will perform optimized network processing for
this traffic class. Maintain initial default behavior of using priority
of zero.
Signed-off-by: NKiran Patil <kiran.patil@intel.com>
Signed-off-by: NMark Wunderlich <mark.wunderlich@intel.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

9912ade3

nvme: remove unused return code from nvme_alloc_ns · adce7e98

由 Edmund Nadolski 提交于 11月 27, 2019

The return code of nvme_alloc_ns is never used, so change it
to void.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NEdmund Nadolski <edmund.nadolski@intel.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

adce7e98

28 2月, 2020 1 次提交

nvme-pci: Hold cq_poll_lock while completing CQEs · 9515743b

由 Bijan Mottahedeh 提交于 2月 26, 2020

Completions need to consumed in the same order the controller submitted
them, otherwise future completion entries may overwrite ones we haven't
handled yet. Hold the nvme queue's poll lock while completing new CQEs to
prevent another thread from freeing command tags for reuse out-of-order.

Fixes: dabcefab ("nvme: provide optimized poll function for separate poll queues")
Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

9515743b

21 2月, 2020 1 次提交

nvme-multipath: Fix memory leak with ana_log_buf · 3b783090

由 Logan Gunthorpe 提交于 2月 20, 2020

kmemleak reports a memory leak with the ana_log_buf allocated by
nvme_mpath_init():

unreferenced object 0xffff888120e94000 (size 8208):
  comm "nvme", pid 6884, jiffies 4295020435 (age 78786.312s)
    hex dump (first 32 bytes):
      00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  ................
      01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
    backtrace:
      [<00000000e2360188>] kmalloc_order+0x97/0xc0
      [<0000000079b18dd4>] kmalloc_order_trace+0x24/0x100
      [<00000000f50c0406>] __kmalloc+0x24c/0x2d0
      [<00000000f31a10b9>] nvme_mpath_init+0x23c/0x2b0
      [<000000005802589e>] nvme_init_identify+0x75f/0x1600
      [<0000000058ef911b>] nvme_loop_configure_admin_queue+0x26d/0x280
      [<00000000673774b9>] nvme_loop_create_ctrl+0x2a7/0x710
      [<00000000f1c7a233>] nvmf_dev_write+0xc66/0x10b9
      [<000000004199f8d0>] __vfs_write+0x50/0xa0
      [<0000000065466fef>] vfs_write+0xf3/0x280
      [<00000000b0db9a8b>] ksys_write+0xc6/0x160
      [<0000000082156b91>] __x64_sys_write+0x43/0x50
      [<00000000c34fbb6d>] do_syscall_64+0x77/0x2f0
      [<00000000bbc574c9>] entry_SYSCALL_64_after_hwframe+0x49/0xbe

nvme_mpath_init() is called by nvme_init_identify() which is called in
multiple places (nvme_reset_work(), nvme_passthru_end(), etc). This
means nvme_mpath_init() may be called multiple times before
nvme_mpath_uninit() (which is only called on nvme_free_ctrl()).

When nvme_mpath_init() is called multiple times, it overwrites the
ana_log_buf pointer with a new allocation, thus leaking the previous
allocation.

To fix this, free ana_log_buf before allocating a new one.

Fixes: 0d0b660f ("nvme: add ANA support")
Cc: <stable@vger.kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

3b783090

20 2月, 2020 1 次提交

nvme: Fix uninitialized-variable warning · 15755854

由 Keith Busch 提交于 2月 20, 2020

gcc may detect a false positive on nvme using an unintialized variable
if setting features fails. Since this is not a fast path, explicitly
initialize this variable to suppress the warning.
Reported-by: NArnd Bergmann <arnd@arndb.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

15755854

19 2月, 2020 2 次提交

nvme-pci: Use single IRQ vector for old Apple models · 98f7b86a

由 Andy Shevchenko 提交于 2月 12, 2020

People reported that old Apple machines are not working properly
if the non-first IRQ vector is in use.

Set quirk for that models to limit IRQ to use first vector only.

Based on original patch by GitHub user npx001.

Link: https://github.com/Dunedan/mbp-2016-linux/issues/9
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Leif Liddy <leif.liddy@gmail.com>
Signed-off-by: NAndy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

98f7b86a

nvme/pci: Add sleep quirk for Samsung and Toshiba drives · 1fae37ac

由 Shyjumon N 提交于 2月 06, 2020

The Samsung SSD SM981/PM981 and Toshiba SSD KBG40ZNT256G on the Lenovo
C640 platform experience runtime resume issues when the SSDs are kept in
sleep/suspend mode for long time.

This patch applies the 'Simple Suspend' quirk to these configurations.
With this patch, the issue had not been observed in a 1+ day test.
Reviewed-by: NJon Derrick <jonathan.derrick@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NShyjumon N <shyjumon.n@intel.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

1fae37ac

15 2月, 2020 4 次提交

nvme: fix the parameter order for nvme_get_log in nvme_get_fw_slot_info · f25372ff

由 Yi Zhang 提交于 2月 14, 2020

nvme fw-activate operation will get bellow warning log,
fix it by update the parameter order

[  113.231513] nvme nvme0: Get FW SLOT INFO log error

Fixes: 0e98719b ("nvme: simplify the API for getting log pages")
Reported-by: NSujith Pandel <sujith_pandel@dell.com>
Reviewed-by: NDavid Milburn <dmilburn@redhat.com>
Signed-off-by: NYi Zhang <yi.zhang@redhat.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f25372ff

nvme/pci: move cqe check after device shutdown · fa46c6fb

由 Keith Busch 提交于 2月 13, 2020

Many users have reported nvme triggered irq_startup() warnings during
shutdown. The driver uses the nvme queue's irq to synchronize scanning
for completions, and enabling an interrupt affined to only offline CPUs
triggers the alarming warning.

Move the final CQE check to after disabling the device and all
registered interrupts have been torn down so that we do not have any
IRQ to synchronize.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=206509Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

fa46c6fb

nvme: prevent warning triggered by nvme_stop_keep_alive · 97b2512a

由 Nigel Kirkland 提交于 2月 10, 2020

Delayed keep alive work is queued on system workqueue and may be cancelled
via nvme_stop_keep_alive from nvme_reset_wq, nvme_fc_wq or nvme_wq.

Check_flush_dependency detects mismatched attributes between the work-queue
context used to cancel the keep alive work and system-wq. Specifically
system-wq does not have the WQ_MEM_RECLAIM flag, whereas the contexts used
to cancel keep alive work have WQ_MEM_RECLAIM flag.

Example warning:

  workqueue: WQ_MEM_RECLAIM nvme-reset-wq:nvme_fc_reset_ctrl_work [nvme_fc]
	is flushing !WQ_MEM_RECLAIM events:nvme_keep_alive_work [nvme_core]

To avoid the flags mismatch, delayed keep alive work is queued on nvme_wq.

However this creates a secondary concern where work and a request to cancel
that work may be in the same work queue - namely err_work in the rdma and
tcp transports, which will want to flush/cancel the keep alive work which
will now be on nvme_wq.

After reviewing the transports, it looks like err_work can be moved to
nvme_reset_wq. In fact that aligns them better with transition into
RESETTING and performing related reset work in nvme_reset_wq.

Change nvme-rdma and nvme-tcp to perform err_work in nvme_reset_wq.
Signed-off-by: NNigel Kirkland <nigel.kirkland@broadcom.com>
Signed-off-by: NJames Smart <jsmart2021@gmail.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

97b2512a

nvme/tcp: fix bug on double requeue when send fails · 2d570a7c

由 Anton Eidelman 提交于 2月 10, 2020

When nvme_tcp_io_work() fails to send to socket due to
connection close/reset, error_recovery work is triggered
from nvme_tcp_state_change() socket callback.
This cancels all the active requests in the tagset,
which requeues them.

The failed request, however, was ended and thus requeued
individually as well unless send returned -EPIPE.
Another return code to be treated the same way is -ECONNRESET.

Double requeue caused BUG_ON(blk_queued_rq(rq))
in blk_mq_requeue_request() from either the individual requeue
of the failed request or the bulk requeue from
blk_mq_tagset_busy_iter(, nvme_cancel_request, );
Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2d570a7c

04 2月, 2020 1 次提交

nvme-pci: remove nvmeq->tags · cfa27356

由 Christoph Hellwig 提交于 1月 30, 2020

There is no real need to have a pointer to the tagset in
struct nvme_queue, as we only need it in a single place, and that place
can derive the used tagset from the device and qid trivially.  This
fixes a problem with stale pointer exposure when tagsets are reset,
and also shrinks the nvme_queue structure.  It also matches what most
other transports have done since day 1.
Reported-by: NEdmund Nadolski <edmund.nadolski@intel.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

cfa27356

01 2月, 2020 1 次提交

nvme: hwmon: switch to use <linux/units.h> helpers · 7724cd2b

由 Akinobu Mita 提交于 1月 30, 2020

This switches the nvme driver to use kelvin_to_millicelsius() and
millicelsius_to_kelvin() in <linux/units.h>.

Link: http://lkml.kernel.org/r/1576386975-7941-8-git-send-email-akinobu.mita@gmail.comSigned-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NGuenter Roeck <linux@roeck-us.net>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NAndy Shevchenko <andy.shevchenko@gmail.com>
Cc: Sujith Thomas <sujith.thomas@intel.com>
Cc: Darren Hart <dvhart@infradead.org>
Cc: Andy Shevchenko <andy@infradead.org>
Cc: Zhang Rui <rui.zhang@intel.com>
Cc: Daniel Lezcano <daniel.lezcano@linaro.org>
Cc: Amit Kucheria <amit.kucheria@verdurent.com>
Cc: Jean Delvare <jdelvare@suse.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Cc: Hartmut Knaack <knaack.h@gmx.de>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Jonathan Cameron <jic23@kernel.org>
Cc: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: Lars-Peter Clausen <lars@metafoo.de>
Cc: Luca Coelho <luciano.coelho@intel.com>
Cc: Peter Meerwald-Stadler <pmeerw@pmeerw.net>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7724cd2b

10 1月, 2020 1 次提交

nvme: Translate more status codes to blk_status_t · 35038bff

由 Keith Busch 提交于 12月 06, 2019

Decode interrupted command and not ready namespace nvme status codes to
BLK_STS_TARGET. These are not generic IO errors and should use a non-path
specific error so that it can use the non-failover retry path.
Reported-by: NJohn Meneghini <John.Meneghini@netapp.com>
Cc: Hannes Reinecke <hare@suse.de>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

35038bff

07 1月, 2020 1 次提交

block: Allow t10-pi to be modular · a754bd5f

由 Herbert Xu 提交于 12月 23, 2019

Currently t10-pi can only be built into the block layer which via
crc-t10dif pulls in a whole chunk of the Crypto API.  In fact all
users of t10-pi work as modules and there is no reason for it to
always be built-in.

This patch adds a new hidden option for t10-pi that is selected
automatically based on BLK_DEV_INTEGRITY and whether the users
of t10-pi are built-in or not.
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a754bd5f

07 12月, 2019 3 次提交

nvme/pci: Fix read queue count · 7e4c6b9a

由 Keith Busch 提交于 12月 06, 2019

If nvme.write_queues equals the number of CPUs, the driver had decreased
the number of interrupts available such that there could only be one read
queue even if the controller could support more. Remove the interrupt
count reduction in this case. The driver wouldn't request more IRQs than
it wants queues anyway.
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

7e4c6b9a

nvme/pci Limit write queue sizes to possible cpus · 17c33167

由 Keith Busch 提交于 12月 07, 2019

The driver can never use more queues of any type than the number of
possible CPUs, so a higher value causes the driver to allocate more
memory for IO queues than it could ever use. Limit the parameter at
module load time to the number of possible cpus.
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

17c33167

nvme/pci: Fix write and poll queue types · 3f68baf7

由 Keith Busch 提交于 12月 07, 2019

The number of poll or write queues should never be negative. Use unsigned
types so that it's not possible to break have the driver not allocate
any queues.
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

3f68baf7

03 12月, 2019 2 次提交

nvme/pci: Remove last_cq_head · f6c4d97b

由 Keith Busch 提交于 12月 03, 2019

We had been saving the last_cq_head seen from an interrupt so that a
polled queue wouldn't mistakenly trigger spruious interrupt detection. We
don't poll interrupt driven queues any more, so saving this value is
pointless.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

f6c4d97b

nvme: Namepace identification descriptor list is optional · 22802bf7

由 Keith Busch 提交于 12月 03, 2019

Despite NVM Express specification 1.3 requires a controller claiming to
be 1.3 or higher implement Identify CNS 03h (Namespace Identification
Descriptor list), the driver doesn't really need this identification in
order to use a namespace. The code had already documented in comments
that we're not to consider an error to this command.

Return success if the controller provided any response to an
namespace identification descriptors command.

Fixes: 538af88e ("nvme: make nvme_report_ns_ids propagate error back")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=205679Reported-by: NIngo Brunberg <ingo_brunberg@web.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: stable@vger.kernel.org # 5.4+
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

22802bf7

27 11月, 2019 7 次提交

Revert "nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T" · 655e7aee

由 Jian-Hong Pan 提交于 10月 31, 2019

Since e045fa29 ("PCI/MSI: Fix incorrect MSI-X masking on resume") is
merged, we can revert the previous quirk now.

This reverts commit 19ea025e.

Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=204887
Fixes: 19ea025e ("nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T")
Link: https://lore.kernel.org/r/20191031093408.9322-1-jian-hong@endlessm.comSigned-off-by: NJian-Hong Pan <jian-hong@endlessm.com>
Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
Acked-by: NChristoph Hellwig <hch@lst.de>
Cc: stable@vger.kernel.org

655e7aee

nvme-fc: fix double-free scenarios on hw queues · c869e494

由 James Smart 提交于 11月 21, 2019

If an error occurs on one of the ios used for creating an
association, the creating routine has error paths that are
invoked by the command failure and the error paths will free
up the controller resources created to that point.

But... the io was ultimately determined by an asynchronous
completion routine that detected the error and which
unconditionally invokes the error_recovery path which calls
delete_association. Delete association deletes all outstanding
io then tears down the controller resources. So the
create_association thread can be running in parallel with
the error_recovery thread. What was seen was the LLDD received
a call to delete a queue, causing the LLDD to do a free of a
resource, then the transport called the delete queue again
causing the driver to repeat the free call. The second free
routine corrupted the allocator. The transport shouldn't be
making the duplicate call, and the delete queue is just one
of the resources being freed.

To fix, it is realized that the create_association path is
completely serialized with one command at a time. So the
failed io completion will always be seen by the create_association
path and as of the failure, there are no ios to terminate and there
is no reason to be manipulating queue freeze states, etc.
The serialized condition stays true until the controller is
transitioned to the LIVE state. Thus the fix is to change the
error recovery path to check the controller state and only
invoke the teardown path if not already in the CONNECTING state.
Reviewed-by: NHimanshu Madhani <hmadhani@marvell.com>
Reviewed-by: NEwan D. Milne <emilne@redhat.com>
Signed-off-by: NJames Smart <jsmart2021@gmail.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

c869e494

nvme: else following return is not needed · c80b36cd

由 Edmund Nadolski 提交于 11月 25, 2019

Remove unnecessary keyword in nvme_create_queue().
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NEdmund Nadolski <edmund.nadolski@intel.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

c80b36cd

nvme: add error message on mismatching controller ids · a8157ff3

由 James Smart 提交于 11月 21, 2019

We've seen a few devices that return different controller id's to
the Fabric Connect command vs the Identify(controller) command. It's
currently hard to identify this failure by existing error messages. It
comes across as a (re)connect attempt in the transport that fails with
a -22 (-EINVAL) status. The issue is compounded by older kernels not
having the controller id check or had the identify command overwrite the
fabrics controller id value before it checked. Both resulted in cases
where the devices appeared fine until more recent kernels.

Clarify the reject by adding an error message on controller id mismatches.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NEwan D. Milne <emilne@redhat.com>
Signed-off-by: NJames Smart <jsmart2021@gmail.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

a8157ff3

nvme_fc: add module to ops template to allow module references · 863fbae9

由 James Smart 提交于 11月 14, 2019

In nvme-fc: it's possible to have connected active controllers
and as no references are taken on the LLDD, the LLDD can be
unloaded.  The controller would enter a reconnect state and as
long as the LLDD resumed within the reconnect timeout, the
controller would resume.  But if a namespace on the controller
is the root device, allowing the driver to unload can be problematic.
To reload the driver, it may require new io to the boot device,
and as it's no longer connected we get into a catch-22 that
eventually fails, and the system locks up.

Fix this issue by taking a module reference for every connected
controller (which is what the core layer did to the transport
module). Reference is cleared when the controller is removed.
Acked-by: NHimanshu Madhani <hmadhani@marvell.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJames Smart <jsmart2021@gmail.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

863fbae9

nvme-fc: Avoid preallocating big SGL for data · b1ae1a23

由 Israel Rukshin 提交于 11月 24, 2019

nvme_fc_create_io_queues() preallocates a big buffer for the IO SGL based
on SG_CHUNK_SIZE.

Modern DMA engines are often capable of dealing with very big segments so
the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB
SGL allocation per command.

If a controller has lots of deep queues, preallocation for the sg list can
consume substantial amounts of memory. For nvme-fc, nr_hw_queues can be
128 and each queue's depth 128. This means the resulting preallocation
for the data SGL is 128*128*4K = 64MB per controller.

Switch to runtime allocation for SGL for lists longer than 2 entries. This
is the approach used by NVMe PCI so it should be reasonable for NVMeOF as
well. Runtime SGL allocation has always been the case for the legacy I/O
path so this is nothing new.
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

b1ae1a23

nvme-rdma: Avoid preallocating big SGL for data · 38e18002

由 Israel Rukshin 提交于 11月 24, 2019

nvme_rdma_alloc_tagset() preallocates a big buffer for the IO SGL based
on SG_CHUNK_SIZE.

Modern DMA engines are often capable of dealing with very big segments so
the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB
SGL allocation per command.

If a controller has lots of deep queues, preallocation for the sg list can
consume substantial amounts of memory. For nvme-rdma, nr_hw_queues can be
128 and each queue's depth 128. This means the resulting preallocation
for the data SGL is 128*128*4K = 64MB per controller.

Switch to runtime allocation for SGL for lists longer than 2 entries. This
is the approach used by NVMe PCI so it should be reasonable for NVMeOF as
well. Runtime SGL allocation has always been the case for the legacy I/O
path so this is nothing new.

The preallocated small SGL depends on SG_CHAIN so if the ARCH doesn't
support SG_CHAIN, use only runtime allocation for the SGL.

We didn't notice of a performance degradation, since for small IOs we'll
use the inline SG and for the bigger IOs the allocation of a bigger SGL
from slab is fast enough.
Suggested-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

38e18002

22 11月, 2019 2 次提交

nvme: hwmon: add quirk to avoid changing temperature threshold · 6c6aa2f2

由 Akinobu Mita 提交于 11月 15, 2019

This adds a new quirk NVME_QUIRK_NO_TEMP_THRESH_CHANGE to avoid changing
the value of the temperature threshold feature for specific devices that
show undesirable behavior.

Guenter reported:

"On my Intel NVME drive (SSDPEKKW512G7), writing any minimum limit on the
Composite temperature sensor results in a temperature warning, and that
warning is sticky until I reset the controller.

It doesn't seem to matter which temperature I write; writing -273000 has
the same result."

The Intel NVMe has the latest firmware version installed, so this isn't
a problem that was ever fixed.
Reported-by: NGuenter Roeck <linux@roeck-us.net>
Cc: Keith Busch <kbusch@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Jean Delvare <jdelvare@suse.com>
Reviewed-by: NGuenter Roeck <linux@roeck-us.net>
Tested-by: NGuenter Roeck <linux@roeck-us.net>
Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

6c6aa2f2

nvme: hwmon: provide temperature min and max values for each sensor · 52deba0f

由 Akinobu Mita 提交于 11月 15, 2019

According to the NVMe specification, the over temperature threshold and
under temperature threshold features shall be implemented for Composite
Temperature if a non-zero WCTEMP field value is reported in the Identify
Controller data structure.  The features are also implemented for all
implemented temperature sensors (i.e., all Temperature Sensor fields that
report a non-zero value).

This provides the over temperature threshold and under temperature
threshold for each sensor as temperature min and max values of hwmon
sysfs attributes.

The WCTEMP is already provided as a temperature max value for Composite
Temperature, but this change isn't incompatible.  Because the default
value of the over temperature threshold for Composite Temperature is
the WCTEMP.

Now the alarm attribute for Composite Temperature indicates one of the
temperature is outside of a temperature threshold.  Because there is only
a single bit in Critical Warning field that indicates a temperature is
outside of a threshold.

Example output from the "sensors" command:

nvme-pci-0100
Adapter: PCI adapter
Composite:    +33.9°C  (low  = -273.1°C, high = +69.8°C)
                       (crit = +79.8°C)
Sensor 1:     +34.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 2:     +31.9°C  (low  = -273.1°C, high = +65261.8°C)
Sensor 5:     +47.9°C  (low  = -273.1°C, high = +65261.8°C)

This also adds helper macros for kelvin from/to milli Celsius conversion,
and replaces the repeated code in hwmon.c.

Cc: Keith Busch <kbusch@kernel.org>
Cc: Jens Axboe <axboe@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Sagi Grimberg <sagi@grimberg.me>
Cc: Jean Delvare <jdelvare@suse.com>
Reviewed-by: NGuenter Roeck <linux@roeck-us.net>
Tested-by: NGuenter Roeck <linux@roeck-us.net>
Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

52deba0f

13 11月, 2019 1 次提交

nvme: Discard workaround for non-conformant devices · 530436c4

由 Eduard Hasenleithner 提交于 11月 12, 2019

Users observe IOMMU related errors when performing discard on nvme from
non-compliant nvme devices reading beyond the end of the DMA mapped
ranges to discard.

Two different variants of this behavior have been observed: SM22XX
controllers round up the read size to a multiple of 512 bytes, and Phison
E12 unconditionally reads the maximum discard size allowed by the spec
(256 segments or 4kB).

Make nvme_setup_discard unconditionally allocate the maximum DSM buffer
so the driver DMA maps a memory range that will always succeed.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=202665 many
Signed-off-by: NEduard Hasenleithner <eduard@hasenleithner.at>
[changelog, use existing define, kernel coding style]
Signed-off-by: NKeith Busch <kbusch@kernel.org>

530436c4

12 11月, 2019 1 次提交

nvme: Add hardware monitoring support · 400b6a7b

由 Guenter Roeck 提交于 11月 06, 2019

nvme devices report temperature information in the controller information
(for limits) and in the smart log. Currently, the only means to retrieve
this information is the nvme command line interface, which requires
super-user privileges.

At the same time, it would be desirable to be able to use NVMe temperature
information for thermal control.

This patch adds support to read NVMe temperatures from the kernel using the
hwmon API and adds temperature zones for NVMe drives. The thermal subsystem
can use this information to set thermal policies, and userspace can access
it using libsensors and/or the "sensors" command.

Example output from the "sensors" command:

nvme0-pci-0100
Adapter: PCI adapter
Composite:    +39.0°C  (high = +85.0°C, crit = +85.0°C)
Sensor 1:     +39.0°C
Sensor 2:     +41.0°C
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

400b6a7b

05 11月, 2019 1 次提交

nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths · 763303a8

由 Anton Eidelman 提交于 11月 01, 2019

nvme_mpath_clear_ctrl_paths() iterates through
the ctrl->namespaces list while holding ctrl->scan_lock.
This does not seem to be the correct way of protecting
from concurrent list modification.

Specifically, nvme_scan_work() sorts ctrl->namespaces
AFTER unlocking scan_lock.

This may result in the following (rare) crash in ctrl disconnect
during scan_work:

    BUG: kernel NULL pointer dereference, address: 0000000000000050
    Oops: 0000 [#1] SMP PTI
    CPU: 0 PID: 3995 Comm: nvme 5.3.5-050305-generic
    RIP: 0010:nvme_mpath_clear_current_path+0xe/0x90 [nvme_core]
    ...
    Call Trace:
     nvme_mpath_clear_ctrl_paths+0x3c/0x70 [nvme_core]
     nvme_remove_namespaces+0x35/0xe0 [nvme_core]
     nvme_do_delete_ctrl+0x47/0x90 [nvme_core]
     nvme_sysfs_delete+0x49/0x60 [nvme_core]
     dev_attr_store+0x17/0x30
     sysfs_kf_write+0x3e/0x50
     kernfs_fop_write+0x11e/0x1a0
     __vfs_write+0x1b/0x40
     vfs_write+0xb9/0x1a0
     ksys_write+0x67/0xe0
     __x64_sys_write+0x1a/0x20
     do_syscall_64+0x5a/0x130
     entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f8d02bfb154

Fix:
After taking scan_lock in nvme_mpath_clear_ctrl_paths()
down_read(&ctrl->namespaces_rwsem) as well to make list traversal safe.
This will not cause deadlocks because taking scan_lock never happens
while holding the namespaces_rwsem.
Moreover, scan work downs namespaces_rwsem in the same order.

Alternative: sort ctrl->namespaces in nvme_scan_work()
while still holding the scan_lock.
This would leave nvme_mpath_clear_ctrl_paths() without correct protection
against ctrl->namespaces modification by anyone other than scan_work.
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>

763303a8

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功