提交 · a277654bafb51fb8b4cf23550f15926bb02536f4 · openeuler / Kernel

20 10月, 2021 1 次提交

nvme: add APIs for stopping/starting admin queue · a277654b

由 Ming Lei 提交于 10月 14, 2021

Add two APIs for stopping and starting admin queue.
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211014081710.1871747-2-ming.lei@redhat.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

a277654b

19 10月, 2021 1 次提交

nvme: add support for batched completion of polled IO · c234a653

由 Jens Axboe 提交于 10月 08, 2021

Take advantage of struct io_comp_batch, if passed in to the nvme poll
handler. If it's set, rather than complete each request individually
inline, store them in the io_comp_batch list. We only do so for requests
that will complete successfully, anything else will be completed inline as
before.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c234a653

18 10月, 2021 2 次提交

block: rename REQ_HIPRI to REQ_POLLED · 6ce913fe

由 Christoph Hellwig 提交于 10月 12, 2021

Unlike the RWF_HIPRI userspace ABI which is intentionally kept vague,
the bio flag is specific to the polling implementation, so rename and
document it properly.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Tested-by: NMark Wunderlich <mark.wunderlich@intel.com>
Link: https://lore.kernel.org/r/20211012111226.760968-12-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

6ce913fe

block: move integrity handling out of <linux/blkdev.h> · fe45e630

由 Christoph Hellwig 提交于 9月 20, 2021

Split the integrity/metadata handling definitions out into a new header.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210920123328.1399408-17-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

fe45e630

14 10月, 2021 1 次提交

nvme: fix per-namespace chardev deletion · be5eb933

由 Adam Manzanares 提交于 10月 13, 2021

Decrease reference count of chardevice during char device deletion in
order to fix a memory leak.  Add a release callabck for the device
associated chardev and move ida_simple_remove into the release function.

Fixes: 2637baed ("nvme: introduce generic per-namespace chardev")
Reported-by: NYi Zhang <yi.zhang@redhat.com>
Suggested-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NAdam Manzanares <a.manzanares@samsung.com>
Reviewed-by: NJavier GonzÃ¡lez <javier@javigon.com>
Tested-by: NYi Zhang <yi.zhang@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

be5eb933

28 9月, 2021 1 次提交

nvme: add command id quirk for apple controllers · a2941f6a

由 Keith Busch 提交于 9月 27, 2021

Some apple controllers use the command id as an index to implementation
specific data structures and will fail if the value is out of bounds.
The nvme driver's recently introduced command sequence number breaks
this controller.

Provide a quirk so these spec incompliant controllers can function as
before. The driver will not have the ability to detect bad completions
when this quirk is used, but we weren't previously checking this anyway.

The quirk bit was selected so that it can readily apply to stable.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=214509
Cc: Sven Peter <sven@svenpeter.dev>
Reported-by: NOrlando Chamberlain <redecorating@protonmail.com>
Reported-by: NAditya Garg <gargaditya08@live.com>
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NSven Peter <sven@svenpeter.dev>
Link: https://lore.kernel.org/r/20210927154306.387437-1-kbusch@kernel.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

a2941f6a

21 9月, 2021 1 次提交

nvme: keep ctrl->namespaces ordered · 298ba0e3

由 Christoph Hellwig 提交于 9月 14, 2021

Various places in the nvme code that rely on ctrl->namespace to be
ordered.  Ensure that the namespae is inserted into the list at the
right position from the start instead of sorting it after the fact.

Fixes: 540c801c ("NVMe: Implement namespace list scanning")
Reported-by: NAnton Eidelman <anton.eidelman@gmail.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <kch@nvidia.com>
Reviewed-by: NDamien Le Moal <damien.lemoal@wdc.com>

298ba0e3

15 9月, 2021 1 次提交

nvme: remove the call to nvme_update_disk_info in nvme_ns_remove · 9da4c727

由 Christoph Hellwig 提交于 9月 14, 2021

There is no need to explicitly unregister the integrity profile when
deleting the gendisk.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20210914070657.87677-4-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

9da4c727

13 9月, 2021 1 次提交

nvme: avoid race in shutdown namespace removal · 9edceaf4

由 Daniel Wagner 提交于 9月 02, 2021

When we remove the siblings entry, we update ns->head->list, hence we
can't separate the removal and test for being empty. They have to be
in the same critical section to avoid a race.

To avoid breaking the refcounting imbalance again, add a list empty
check to nvme_find_ns_head.

Fixes: 5396fdac ("nvme: fix refcounting imbalance when all paths are down")
Signed-off-by: NDaniel Wagner <dwagner@suse.de>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Tested-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

9edceaf4

06 9月, 2021 5 次提交

nvme: add error handling support for add_disk() · ab3994f6

由 Luis Chamberlain 提交于 8月 30, 2021

We never checked for errors on add_disk() as this function
returned void. Now that this is fixed, use the shiny new
error handling.
Signed-off-by: NLuis Chamberlain <mcgrof@kernel.org>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ab3994f6

nvme: only call synchronize_srcu when clearing current path · 041bd1a1

由 Daniel Wagner 提交于 9月 01, 2021

The function nmve_mpath_clear_current_path returns true if the current
path has changed. In this case we have to wait for all concurrent
submissions to finish. But if we didn't change the current path, there
is no point in waiting for another RCU period to finish.
Signed-off-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

041bd1a1

nvme: update keep alive interval when kato is modified · b58da2d2

由 Tatsuya Sasaki 提交于 9月 01, 2021

Currently the connection between host and NVMe-oF target gets
disconnected by keep-alive timeout when a user connects to a target
with a relatively large kato value and then sets the smaller kato
with a set features command (e.g. connects with 60 seconds kato value
and then sets 10 seconds kato value).

The cause is that keep alive command interval on the host, which is
defined as unsigned int kato in nvme_ctrl structure, does not follow
the kato value changes.

This patch updates the keep alive interval in the following steps when
the kato is modified by a set features command: stops the keep alive
work queue, then sets the kato as new timer value and re-start the queue.
Signed-off-by: NTatsuya Sasaki <tatsuya6.sasaki@kioxia.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

b58da2d2

nvme: move nvme_multi_css into nvme.h · 43dc9878

由 Adam Manzanares 提交于 8月 26, 2021

Preparatory patch in order to reuse nvme_multi_css in the nvme target
code.
Signed-off-by: NAdam Manzanares <a.manzanares@samsung.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

43dc9878

nvme-multipath: revalidate paths during rescan · e7d65803

由 Hannes Reinecke 提交于 8月 24, 2021

When triggering a rescan due to a namespace resize we will be
receiving AENs on every controller, triggering a rescan of all
attached namespaces. If multipath is active only the current path and
the ns_head disk will be updated, the other paths will still refer to
the old size until AENs for the remaining controllers are received.

If I/O comes in before that it might be routed to one of the old
paths, triggering an I/O failure with 'access beyond end of device'.
With this patch the old paths are skipped from multipath path
selection until the controller serving these paths has been rescanned.
Signed-off-by: NHannes Reinecke <hare@suse.de>
[dwagner: - introduce NVME_NS_READY flag instead of NVME_NS_INVALIDATE
          - use 'revalidate' instead of 'invalidate' which
	    follows the zoned device code path.
	  - clear NVME_NS_READY before clearing current_path]
Signed-off-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

e7d65803

24 8月, 2021 1 次提交

nvme: use blk_mq_alloc_disk · 5f432cce

由 Christoph Hellwig 提交于 8月 16, 2021

Switch to use the blk_mq_alloc_disk helper for allocating the
request_queue and gendisk.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Link: https://lore.kernel.org/r/20210816131910.615153-2-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

5f432cce

17 8月, 2021 1 次提交

nvme: use bvec_virt · 3973e15f

由 Christoph Hellwig 提交于 8月 04, 2021

Use bvec_virt instead of open coding it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Link: https://lore.kernel.org/r/20210804095634.460779-16-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

3973e15f

16 8月, 2021 1 次提交

nvme: code command_id with a genctr for use-after-free validation · e7006de6

由 Sagi Grimberg 提交于 6月 16, 2021

We cannot detect a (perhaps buggy) controller that is sending us
a completion for a request that was already completed (for example
sending a completion twice), this phenomenon was seen in the wild
a few times.

So to protect against this, we use the upper 4 msbits of the nvme sqe
command_id to use as a 4-bit generation counter and verify it matches
the existing request generation that is incrementing on every execution.

The 16-bit command_id structure now is constructed by:
| xxxx | xxxxxxxxxxxx |
  gen    request tag

This means that we are giving up some possible queue depth as 12 bits
allow for a maximum queue depth of 4095 instead of 65536, however we
never create such long queues anyways so no real harm done.
Suggested-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Acked-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NDaniel Wagner <dwagner@suse.de>
Tested-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

e7006de6

15 8月, 2021 1 次提交

remove the lightnvm subsystem · 9ea9b9c4

由 Christoph Hellwig 提交于 8月 12, 2021

Lightnvm supports the OCSSD 1.x and 2.0 specs which were early attempts
to produce Open Channel SSDs and never made it into the NVMe spec
proper. They have since been superceeded by NVMe enhancements such
as ZNS support. Remove the support per the deprecation schedule.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210812132308.38486-1-hch@lst.deReviewed-by: NMatias Bjørling <mb@lightnvm.io>
Reviewed-by: NJavier González <javier@javigon.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9ea9b9c4

13 8月, 2021 2 次提交

block: remove GENHD_FL_UP · 50b4aecf

由 Christoph Hellwig 提交于 8月 09, 2021

Just check inode_unhashed on the whole device bdev inode instead,
and provide a helper to check for that information.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210809064028.1198327-9-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

50b4aecf

nvme: remove the GENHD_FL_UP check in nvme_ns_remove · 5eba2005

由 Christoph Hellwig 提交于 8月 09, 2021

Early probe failure never reaches nvme_ns_remove, so GENHD_FL_UP must
be set at this point. Remove the check.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210809064028.1198327-4-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

5eba2005

10 8月, 2021 1 次提交

block: pass a gendisk to blk_queue_update_readahead · 471aa704

由 Christoph Hellwig 提交于 8月 09, 2021

.. and rename the function to disk_update_readahead.  This is in
preparation for moving the BDI from the request_queue to the gendisk.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Link: https://lore.kernel.org/r/20210809141744.1203023-3-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

471aa704

21 7月, 2021 2 次提交

nvme: set the PRACT bit when using Write Zeroes with T10 PI · aaeb7bb0

由 Christoph Hellwig 提交于 7月 21, 2021

When using Write Zeroes on a namespace that has protection
information enabled they behavior without the PRACT bit
counter-intuitive and will generally lead to validation failures
when reading the written blocks.  Fix this by always setting the
PRACT bit that generates matching PI data on the fly.

Fixes: 6e02318e ("nvme: add support for the Write Zeroes command")
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NMartin K. Petersen <martin.petersen@oracle.com>

aaeb7bb0

nvme: fix refcounting imbalance when all paths are down · 5396fdac

由 Hannes Reinecke 提交于 7月 16, 2021

When the last path to a ns_head drops the current code
removes the ns_head from the subsystem list, but will only
delete the disk itself if the last reference to the ns_head
drops. This is causing an refcounting imbalance eg when
applications have a reference to the disk, as then they'll
never get notified that the disk is in fact dead.
This patch moves the call 'del_gendisk' into nvme_mpath_check_last_path(),
ensuring that the disk can be properly removed and applications get the
appropriate notifications.
Signed-off-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

5396fdac

01 7月, 2021 2 次提交

nvme: use return value from blk_execute_rq() · ae5e6886

由 Keith Busch 提交于 6月 10, 2021

We don't have an nvme status to report if the driver's .queue_rq()
returns an error without dispatching the requested nvme command. Check
the return value from blk_execute_rq() for all passthrough commands so
the caller may know their command was not successful.

If the command is from the target passthrough interface and fails to
dispatch, synthesize the response back to the host as a internal target
error.
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210610214437.641245-5-kbusch@kernel.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

ae5e6886

nvme: use blk_execute_rq() for passthrough commands · be42a33b

由 Keith Busch 提交于 6月 10, 2021

The generic blk_execute_rq() knows how to handle polled completions. Use
that instead of implementing an nvme specific handler.
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Link: https://lore.kernel.org/r/20210610214437.641245-3-kbusch@kernel.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

be42a33b

17 6月, 2021 1 次提交

nvme: remove zeroout memset call for struct · cc72c442

由 Chaitanya Kulkarni 提交于 6月 16, 2021

Declare and initialize structure variables to zero values so that we can
remove zeroout memset calls in the host/core.c.
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

cc72c442

03 6月, 2021 4 次提交

nvme: remove nvme_{get,put}_ns_from_disk · f1cf35e1

由 Christoph Hellwig 提交于 5月 19, 2021

Now that only one caller is left remove the helpers by restructuring
nvme_pr_command so that it has two helpers for sending a command of to a
given nsid using either the ns_head for multipath, or the namespace
stored in the gendisk.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>

f1cf35e1

nvme: split nvme_report_zones · 8b4fb0f9

由 Christoph Hellwig 提交于 5月 19, 2021

Split multipath support out of nvme_report_zones into a separate helper
and simplify the non-multipath version as a result.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>

8b4fb0f9

nvme-tcp: allow selecting the network interface for connections · 3ede8f72

由 Martin Belanger 提交于 5月 20, 2021

In our application, we need a way to force TCP connections to go out a
specific IP interface instead of letting Linux select the interface
based on the routing tables.

Add the 'host-iface' option to allow specifying the interface to use.
When the option host-iface is specified, the driver uses the specified
interface to set the option SO_BINDTODEVICE on the TCP socket before
connecting.

This new option is needed in addtion to the existing host-traddr for
the following reasons:

Specifying an IP interface by its associated IP address is less
intuitive than specifying the actual interface name and, in some cases,
simply doesn't work. That's because the association between interfaces
and IP addresses is not predictable. IP addresses can be changed or can
change by themselves over time (e.g. DHCP). Interface names are
predictable [1] and will persist over time. Consider the following
configuration.

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state ...
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 100.0.0.100/24 scope global lo
       valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
    link/ether 08:00:27:21:65:ec brd ff:ff:ff:ff:ff:ff
    inet 100.0.0.100/24 scope global enp0s3
       valid_lft forever preferred_lft forever
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
    link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
    inet 100.0.0.100/24 scope global enp0s8
       valid_lft forever preferred_lft forever

The above is a VM that I configured with the same IP address
(100.0.0.100) on all interfaces. Doing a reverse lookup to identify the
unique interface associated with 100.0.0.100 does not work here. And
this is why the option host_iface is required. I understand that the
above config does not represent a standard host system, but I'm using
this to prove a point: "We can never know how users will configure
their systems". By te way, The above configuration is perfectly fine
by Linux.

The current TCP implementation for host_traddr performs a
bind()-before-connect(). This is a common construct to set the source
IP address on a TCP socket before connecting. This has no effect on how
Linux selects the interface for the connection. That's because Linux
uses the Weak End System model as described in RFC1122 [2]. On the other
hand, setting the Source IP Address has benefits and should be supported
by linux-nvme. In fact, setting the Source IP Address is a mandatory
FedGov requirement (e.g. connection to a RADIUS/TACACS+ server).
Consider the following configuration.

$ ip addr list dev enp0s8
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
    link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
    inet 192.168.56.101/24 brd 192.168.56.255 scope global enp0s8
       valid_lft 426sec preferred_lft 426sec
    inet 192.168.56.102/24 scope global secondary enp0s8
       valid_lft forever preferred_lft forever
    inet 192.168.56.103/24 scope global secondary enp0s8
       valid_lft forever preferred_lft forever
    inet 192.168.56.104/24 scope global secondary enp0s8
       valid_lft forever preferred_lft forever

Here we can see that several addresses are associated with interface
enp0s8. By default, Linux always selects the default IP address,
192.168.56.101, as the source address when connecting over interface
enp0s8. Some users, however, want the ability to specify a different
source address (e.g., 192.168.56.102, 192.168.56.103, ...). The option
host_traddr can be used as-is to perform this function.

In conclusion, I believe that we need 2 options for TCP connections.
One that can be used to specify an interface (host-iface). And one that
can be used to set the source address (host-traddr). Users should be
allowed to use one or the other, or both, or none. Of course, the
documentation for host_traddr will need some clarification. It should
state that when used for TCP connection, this option only sets the
source address. And the documentation for host_iface should say that
this option is only available for TCP connections.

References:
[1] https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/
[2] https://tools.ietf.org/html/rfc1122

Tested both IPv4 and IPv6 connections.
Signed-off-by: NMartin Belanger <martin.belanger@dell.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

3ede8f72

nvme: extend and modify the APST configuration algorithm · ebd8a93a

由 Alexey Bogoslavsky 提交于 4月 28, 2021

The algorithm that was used until now for building the APST configuration
table has been found to produce entries with excessively long ITPT
(idle time prior to transition) for devices declaring relatively long
entry and exit latencies for non-operational power states. This leads
to unnecessary waste of power and, as a result, failure to pass
mandatory power consumption tests on Chromebook platforms.

The new algorithm is based on two predefined ITPT values and two
predefined latency tolerances. Based on these values, as well as on
exit and entry latencies reported by the device, the algorithm looks
for up to 2 suitable non-operational power states to use as primary
and secondary APST transition targets. The predefined values are
supplied to the nvme driver as module parameters:

 - apst_primary_timeout_ms (default: 100)
 - apst_secondary_timeout_ms (default: 2000)
 - apst_primary_latency_tol_us (default: 15000)
 - apst_secondary_latency_tol_us (default: 100000)

The algorithm echoes the approach used by Intel's and Microsoft's drivers
on Windows. The specific default parameter values are also based on those
drivers. Yet, this patch doesn't introduce the ability to dynamically
regenerate the APST table in the event of switching the power source from
AC to battery and back. Adding this functionality may be considered in the
future. In the meantime, the timeouts and tolerances reflect a compromise
between values used by Microsoft for AC and battery scenarios.

In most NVMe devices the new algorithm causes them to implement a more
aggressive power saving policy. While beneficial in most cases, this
sometimes comes at the price of a higher IO processing latency in certain
scenarios as well as at the price of a potential impact on the drive's
endurance (due to more frequent context saving when entering deep non-
operational states). So in order to provide a fallback for systems where
these regressions cannot be tolerated, the patch allows to revert to
the legacy behavior by setting either apst_primary_timeout_ms or
apst_primary_latency_tol_us parameter to 0. Eventually (and possibly after
fine tuning the default values of the module parameters) the legacy behavior
can be removed.

TESTING.

The new algorithm has been extensively tested. Initially, simulations were
used to compare APST tables generated by old and new algorithms for a wide
range of devices. After that, power consumption, performance and latencies
were measured under different workloads on devices from multiple vendors
(WD, Intel, Samsung, Hynix, Kioxia). Below is the description of the tests
and the findings.

General observations.
The effect the patch has on the APST table varies depending on the entry and
exit latencies advertised by the devices. For some devices, the effect is
negligible (e.g. Kioxia KBG40ZNS), for some significant, making the
transitions to PS3 and PS4 much quicker (e.g. WD SN530, Intel 760P), or making
the sleep deeper, PS4 rather than PS3 after a similar amount of time (e.g.
SK Hynix BC511). For some devices (e.g. Samsung PM991) the effect is mixed:
the initial transition happens after a longer idle time, but takes the device
to a lower power state.

Workflows.
In order to evaluate the patch's effect on the power consumption and latency,
7 workflows were used for each device. The workflows were designed to test
the scenarios where significant differences between the old and new behaviors
are most likely. Each workflow was tested twice: with the new and with the
old APST table generation implementation. Power consumption, performance and
latency were measured in the process. The following workflows were used:
1) Consecutive write at the maximum rate with IO depth of 2, with no pauses
2) Repeated pattern of 1000 consecutive writes of 4K packets followed by 50ms
   idle time
3) Repeated pattern of 1000 consecutive writes of 4K packets followed by 150ms
   idle time
4) Repeated pattern of 1000 consecutive writes of 4K packets followed by 500ms
   idle time
5) Repeated pattern of 1000 consecutive writes of 4K packets followed by 1.5s
   idle time
6) Repeated pattern of 1000 consecutive writes of 4K packets followed by 5s
   idle time
7) Repeated pattern of a single random read of a 4K packet followed by 150ms
   idle time

Power consumption
Actual power consumption measurements produced predictable results in
accordance with the APST mechanism's theory of operation.
Devices with long entry and exit latencies such as WD SN530 showed huge
improvement on scenarios 4,5 and 6 of up to 62%. Devices such as Kioxia
KBG40ZNS where the resulting APST table looks virtually identical with
both legacy and new algorithms, showed little or no change in the average power
consumption on all workflows. Devices with extra short latencies such as
Samsung PM991 showed moderate increase in power consumption of up to 18% in
worst case scenarios.
In addition, on Intel and Samsung devices a more complex impact was observed
on scenarios 3, 4 and 7. Our understanding is that due to longer stay in deep
non-operational states between the writes the devices start performing background
operations leading to an increase of power consumption. With the old APST tables
part of these operations are delayed until the scenario is over and a longer idle
period begins, but eventually this extra power is consumed anyway.

Performance.
In terms of performance measured on sustained write or read scenarios, the
effect of the patch is minimal as in this case the device doesn't enter low power
states.

Latency
As expected, in devices where the patch causes a more aggressive power saving
policy (e.g. WD SN530, Intel 760P), an increase in latency was observed in
certain scenarios. Workflow number 7, specifically designed to simulate the
worst case scenario as far as latency is concerned, indeed shows a sharp
increase in average latency (~2ms -> ~53ms on Intel 760P and 0.6 -> 10ms on
WD SN530). The latency increase on other workloads and other devices is much
milder or non-existent.
Signed-off-by: NAlexey Bogoslavsky <alexey.bogoslavsky@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ebd8a93a

01 6月, 2021 1 次提交

block: automatically enable GENHD_FL_EXT_DEVT · 0d1feb72

由 Christoph Hellwig 提交于 5月 21, 2021

Automatically set the GENHD_FL_EXT_DEVT flag for all disks allocated
without an explicit number of minors.  This is what all new block
drivers should do, so make sure it is the default without boilerplate
code.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NLuis Chamberlain <mcgrof@kernel.org>
Reviewed-by: NUlf Hansson <ulf.hansson@linaro.org>
Link: https://lore.kernel.org/r/20210521055116.1053587-4-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

0d1feb72

25 5月, 2021 1 次提交

nvme: fix potential memory leaks in nvme_cdev_add · 3596a065

由 Guoqing Jiang 提交于 5月 21, 2021

We need to call put_device if cdev_device_add failed, otherwise
kmemleak has below report.

[<0000000024c71758>] kmem_cache_alloc_trace+0x233/0x480
[<00000000ad2813ed>] device_add+0x7ff/0xe10
[<0000000035bc54c4>] cdev_device_add+0x72/0xa0
[<000000006c9aa1e8>] nvme_cdev_add+0xa9/0xf0 [nvme_core]
[<000000003c4d492d>] nvme_mpath_set_live+0x251/0x290 [nvme_core]
[<00000000889a58da>] nvme_mpath_add_disk+0x268/0x320 [nvme_core]
[<00000000192e7161>] nvme_alloc_ns+0x669/0xac0 [nvme_core]
[<000000007a1a6041>] nvme_validate_or_alloc_ns+0x156/0x280 [nvme_core]
[<000000003a763c35>] nvme_scan_work+0x221/0x3c0 [nvme_core]
[<000000009ff10706>] process_one_work+0x5cf/0xb10
[<000000000644ee25>] worker_thread+0x7a/0x680
[<00000000285ebd2f>] kthread+0x1c6/0x210
[<00000000e297c6ea>] ret_from_fork+0x22/0x30

Fixes: 2637baed ("nvme: introduce generic per-namespace chardev")
Signed-off-by: NGuoqing Jiang <jiangguoqing@kylinos.cn>
Reviewed-by: NJavier González <javier.gonz@samsung.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

3596a065

12 5月, 2021 1 次提交

nvme-multipath: fix double initialization of ANA state · 5e1f6899

由 Christoph Hellwig 提交于 4月 29, 2021

nvme_init_identify and thus nvme_mpath_init can be called multiple
times and thus must not overwrite potentially initialized or in-use
fields.  Split out a helper for the basic initialization when the
controller is initialized and make sure the init_identify path does
not blindly change in-use data structures.

Fixes: 0d0b660f ("nvme: add ANA support")
Reported-by: NMartin Wilck <mwilck@suse.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NHannes Reinecke <hare@suse.de>

5e1f6899

04 5月, 2021 4 次提交

nvme: move the fabrics queue ready check routines to core · a9715744

由 Tao Chiu 提交于 4月 26, 2021

queue_rq() in pci only checks if the dispatched queue (nvmeq) is ready,
e.g. not being suspended. Since nvme_alloc_admin_tags() in reset flow
restarts the admin queue, users are able to submit admin commands to a
controller before reset_work() completes. Commands submitted under this
condition may interfere with commands that performs identify, IO queue
setup in reset_work(), and may result in a hang described in the
following patch.

As seen in the fabrics, user commands are prevented from being executed
under inproper controller states. We may reuse this logic to maintain a
clear admin queue during reset_work().
Signed-off-by: NTao Chiu <taochiu@synology.com>
Signed-off-by: NCody Wong <codywong@synology.com>
Reviewed-by: NLeon Chien <leonchien@synology.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

a9715744

nvme: avoid memset for passthrough requests · 51ad06cd

由 Kanchan Joshi 提交于 4月 27, 2021

nvme_clear_nvme_request() clears the nvme_command, which is unncessary
for passthrough requests as nvme_command is overwritten immediately.
Move clearing part from this helper to the caller, so that double memset
for passthrough requests is avoided.
Signed-off-by: NKanchan Joshi <joshi.k@samsung.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

51ad06cd

nvme: add nvme_get_ns helper · 4c74d1f8

由 Kanchan Joshi 提交于 4月 27, 2021

Add a helper to avoid opencoding ns->kref increment.
Decrement is already done via nvme_put_ns helper.
Signed-off-by: NKanchan Joshi <joshi.k@samsung.com>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

4c74d1f8

nvme: fix controller ioctl through ns_head · 48145b62

由 Minwoo Im 提交于 4月 22, 2021

In multipath case, we should consider namespace attachment with
controllers in a subsystem when we find out the live controller for the
namespace.  This patch manually reverted the commit 3557a440
("nvme: don't bother to look up a namespace for controller ioctls") with
few more updates to nvme_ns_head_chr_ioctl which has been newly updated.

Fixes: 3557a440 ("nvme: don't bother to look up a namespace for
controller ioctls")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

48145b62

22 4月, 2021 3 次提交

nvme: introduce generic per-namespace chardev · 2637baed

由 Minwoo Im 提交于 4月 21, 2021

Userspace has not been allowed to I/O to device that's failed to
be initialized.  This patch introduces generic per-namespace character
device to allow userspace to I/O regardless the block device is there or
not.

The chardev naming convention will similar to the existing blkdev naming,
using a ng prefix instead of nvme, i.e.

	- /dev/ngXnY

It also supports multipath which means it will not expose chardev for the
hidden namespace blkdevs (e.g., nvmeXcYnZ).  If /dev/ngXnY is created for
a ns_head, then I/O request will be routed to a specific controller
selected by the iopolicy of the subsystem.
Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: NJavier González <javier.gonz@samsung.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Tested-by: NKanchan Joshi <joshi.k@samsung.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

2637baed

nvme: cleanup nvme_configure_apst · 60df5de9

由 Christoph Hellwig 提交于 4月 09, 2021

Remove a level of indentation from the main code implementating the table
search by using a goto for the APST not supported case.  Also move the
main comment above the function.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NNiklas Cassel <niklas.cassel@wdc.com>

60df5de9

nvme: do not try to reconfigure APST when the controller is not live · 53fe2a30

由 Christoph Hellwig 提交于 4月 09, 2021

Do not call nvme_configure_apst when the controller is not live, given
that nvme_configure_apst will fail due the lack of an admin queue when
the controller is being torn down and nvme_set_latency_tolerance is
called from dev_pm_qos_hide_latency_tolerance.

Fixes: 510a405d("nvme: fix memory leak for power latency tolerance")
Reported-by: NPeng Liu <liupeng17@lenovo.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>

53fe2a30

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功