提交 · 3ede8f72a9a2825efca23a3552e80a1202ea88fd · openeuler / Kernel

03 6月, 2021 2 次提交

nvme-tcp: allow selecting the network interface for connections · 3ede8f72

由 Martin Belanger 提交于 5月 20, 2021

In our application, we need a way to force TCP connections to go out a
specific IP interface instead of letting Linux select the interface
based on the routing tables.

Add the 'host-iface' option to allow specifying the interface to use.
When the option host-iface is specified, the driver uses the specified
interface to set the option SO_BINDTODEVICE on the TCP socket before
connecting.

This new option is needed in addtion to the existing host-traddr for
the following reasons:

Specifying an IP interface by its associated IP address is less
intuitive than specifying the actual interface name and, in some cases,
simply doesn't work. That's because the association between interfaces
and IP addresses is not predictable. IP addresses can be changed or can
change by themselves over time (e.g. DHCP). Interface names are
predictable [1] and will persist over time. Consider the following
configuration.

1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state ...
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 100.0.0.100/24 scope global lo
       valid_lft forever preferred_lft forever
2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
    link/ether 08:00:27:21:65:ec brd ff:ff:ff:ff:ff:ff
    inet 100.0.0.100/24 scope global enp0s3
       valid_lft forever preferred_lft forever
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
    link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
    inet 100.0.0.100/24 scope global enp0s8
       valid_lft forever preferred_lft forever

The above is a VM that I configured with the same IP address
(100.0.0.100) on all interfaces. Doing a reverse lookup to identify the
unique interface associated with 100.0.0.100 does not work here. And
this is why the option host_iface is required. I understand that the
above config does not represent a standard host system, but I'm using
this to prove a point: "We can never know how users will configure
their systems". By te way, The above configuration is perfectly fine
by Linux.

The current TCP implementation for host_traddr performs a
bind()-before-connect(). This is a common construct to set the source
IP address on a TCP socket before connecting. This has no effect on how
Linux selects the interface for the connection. That's because Linux
uses the Weak End System model as described in RFC1122 [2]. On the other
hand, setting the Source IP Address has benefits and should be supported
by linux-nvme. In fact, setting the Source IP Address is a mandatory
FedGov requirement (e.g. connection to a RADIUS/TACACS+ server).
Consider the following configuration.

$ ip addr list dev enp0s8
3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
    link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
    inet 192.168.56.101/24 brd 192.168.56.255 scope global enp0s8
       valid_lft 426sec preferred_lft 426sec
    inet 192.168.56.102/24 scope global secondary enp0s8
       valid_lft forever preferred_lft forever
    inet 192.168.56.103/24 scope global secondary enp0s8
       valid_lft forever preferred_lft forever
    inet 192.168.56.104/24 scope global secondary enp0s8
       valid_lft forever preferred_lft forever

Here we can see that several addresses are associated with interface
enp0s8. By default, Linux always selects the default IP address,
192.168.56.101, as the source address when connecting over interface
enp0s8. Some users, however, want the ability to specify a different
source address (e.g., 192.168.56.102, 192.168.56.103, ...). The option
host_traddr can be used as-is to perform this function.

In conclusion, I believe that we need 2 options for TCP connections.
One that can be used to specify an interface (host-iface). And one that
can be used to set the source address (host-traddr). Users should be
allowed to use one or the other, or both, or none. Of course, the
documentation for host_traddr will need some clarification. It should
state that when used for TCP connection, this option only sets the
source address. And the documentation for host_iface should say that
this option is only available for TCP connections.

References:
[1] https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/
[2] https://tools.ietf.org/html/rfc1122

Tested both IPv4 and IPv6 connections.
Signed-off-by: NMartin Belanger <martin.belanger@dell.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

3ede8f72

nvme: extend and modify the APST configuration algorithm · ebd8a93a

由 Alexey Bogoslavsky 提交于 4月 28, 2021

The algorithm that was used until now for building the APST configuration
table has been found to produce entries with excessively long ITPT
(idle time prior to transition) for devices declaring relatively long
entry and exit latencies for non-operational power states. This leads
to unnecessary waste of power and, as a result, failure to pass
mandatory power consumption tests on Chromebook platforms.

The new algorithm is based on two predefined ITPT values and two
predefined latency tolerances. Based on these values, as well as on
exit and entry latencies reported by the device, the algorithm looks
for up to 2 suitable non-operational power states to use as primary
and secondary APST transition targets. The predefined values are
supplied to the nvme driver as module parameters:

 - apst_primary_timeout_ms (default: 100)
 - apst_secondary_timeout_ms (default: 2000)
 - apst_primary_latency_tol_us (default: 15000)
 - apst_secondary_latency_tol_us (default: 100000)

The algorithm echoes the approach used by Intel's and Microsoft's drivers
on Windows. The specific default parameter values are also based on those
drivers. Yet, this patch doesn't introduce the ability to dynamically
regenerate the APST table in the event of switching the power source from
AC to battery and back. Adding this functionality may be considered in the
future. In the meantime, the timeouts and tolerances reflect a compromise
between values used by Microsoft for AC and battery scenarios.

In most NVMe devices the new algorithm causes them to implement a more
aggressive power saving policy. While beneficial in most cases, this
sometimes comes at the price of a higher IO processing latency in certain
scenarios as well as at the price of a potential impact on the drive's
endurance (due to more frequent context saving when entering deep non-
operational states). So in order to provide a fallback for systems where
these regressions cannot be tolerated, the patch allows to revert to
the legacy behavior by setting either apst_primary_timeout_ms or
apst_primary_latency_tol_us parameter to 0. Eventually (and possibly after
fine tuning the default values of the module parameters) the legacy behavior
can be removed.

TESTING.

The new algorithm has been extensively tested. Initially, simulations were
used to compare APST tables generated by old and new algorithms for a wide
range of devices. After that, power consumption, performance and latencies
were measured under different workloads on devices from multiple vendors
(WD, Intel, Samsung, Hynix, Kioxia). Below is the description of the tests
and the findings.

General observations.
The effect the patch has on the APST table varies depending on the entry and
exit latencies advertised by the devices. For some devices, the effect is
negligible (e.g. Kioxia KBG40ZNS), for some significant, making the
transitions to PS3 and PS4 much quicker (e.g. WD SN530, Intel 760P), or making
the sleep deeper, PS4 rather than PS3 after a similar amount of time (e.g.
SK Hynix BC511). For some devices (e.g. Samsung PM991) the effect is mixed:
the initial transition happens after a longer idle time, but takes the device
to a lower power state.

Workflows.
In order to evaluate the patch's effect on the power consumption and latency,
7 workflows were used for each device. The workflows were designed to test
the scenarios where significant differences between the old and new behaviors
are most likely. Each workflow was tested twice: with the new and with the
old APST table generation implementation. Power consumption, performance and
latency were measured in the process. The following workflows were used:
1) Consecutive write at the maximum rate with IO depth of 2, with no pauses
2) Repeated pattern of 1000 consecutive writes of 4K packets followed by 50ms
   idle time
3) Repeated pattern of 1000 consecutive writes of 4K packets followed by 150ms
   idle time
4) Repeated pattern of 1000 consecutive writes of 4K packets followed by 500ms
   idle time
5) Repeated pattern of 1000 consecutive writes of 4K packets followed by 1.5s
   idle time
6) Repeated pattern of 1000 consecutive writes of 4K packets followed by 5s
   idle time
7) Repeated pattern of a single random read of a 4K packet followed by 150ms
   idle time

Power consumption
Actual power consumption measurements produced predictable results in
accordance with the APST mechanism's theory of operation.
Devices with long entry and exit latencies such as WD SN530 showed huge
improvement on scenarios 4,5 and 6 of up to 62%. Devices such as Kioxia
KBG40ZNS where the resulting APST table looks virtually identical with
both legacy and new algorithms, showed little or no change in the average power
consumption on all workflows. Devices with extra short latencies such as
Samsung PM991 showed moderate increase in power consumption of up to 18% in
worst case scenarios.
In addition, on Intel and Samsung devices a more complex impact was observed
on scenarios 3, 4 and 7. Our understanding is that due to longer stay in deep
non-operational states between the writes the devices start performing background
operations leading to an increase of power consumption. With the old APST tables
part of these operations are delayed until the scenario is over and a longer idle
period begins, but eventually this extra power is consumed anyway.

Performance.
In terms of performance measured on sustained write or read scenarios, the
effect of the patch is minimal as in this case the device doesn't enter low power
states.

Latency
As expected, in devices where the patch causes a more aggressive power saving
policy (e.g. WD SN530, Intel 760P), an increase in latency was observed in
certain scenarios. Workflow number 7, specifically designed to simulate the
worst case scenario as far as latency is concerned, indeed shows a sharp
increase in average latency (~2ms -> ~53ms on Intel 760P and 0.6 -> 10ms on
WD SN530). The latency increase on other workloads and other devices is much
milder or non-existent.
Signed-off-by: NAlexey Bogoslavsky <alexey.bogoslavsky@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ebd8a93a

12 5月, 2021 1 次提交

nvme-multipath: fix double initialization of ANA state · 5e1f6899

由 Christoph Hellwig 提交于 4月 29, 2021

nvme_init_identify and thus nvme_mpath_init can be called multiple
times and thus must not overwrite potentially initialized or in-use
fields.  Split out a helper for the basic initialization when the
controller is initialized and make sure the init_identify path does
not blindly change in-use data structures.

Fixes: 0d0b660f ("nvme: add ANA support")
Reported-by: NMartin Wilck <mwilck@suse.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NHannes Reinecke <hare@suse.de>

5e1f6899

04 5月, 2021 4 次提交

nvme: move the fabrics queue ready check routines to core · a9715744

由 Tao Chiu 提交于 4月 26, 2021

queue_rq() in pci only checks if the dispatched queue (nvmeq) is ready,
e.g. not being suspended. Since nvme_alloc_admin_tags() in reset flow
restarts the admin queue, users are able to submit admin commands to a
controller before reset_work() completes. Commands submitted under this
condition may interfere with commands that performs identify, IO queue
setup in reset_work(), and may result in a hang described in the
following patch.

As seen in the fabrics, user commands are prevented from being executed
under inproper controller states. We may reuse this logic to maintain a
clear admin queue during reset_work().
Signed-off-by: NTao Chiu <taochiu@synology.com>
Signed-off-by: NCody Wong <codywong@synology.com>
Reviewed-by: NLeon Chien <leonchien@synology.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

a9715744

nvme: avoid memset for passthrough requests · 51ad06cd

由 Kanchan Joshi 提交于 4月 27, 2021

nvme_clear_nvme_request() clears the nvme_command, which is unncessary
for passthrough requests as nvme_command is overwritten immediately.
Move clearing part from this helper to the caller, so that double memset
for passthrough requests is avoided.
Signed-off-by: NKanchan Joshi <joshi.k@samsung.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

51ad06cd

nvme: add nvme_get_ns helper · 4c74d1f8

由 Kanchan Joshi 提交于 4月 27, 2021

Add a helper to avoid opencoding ns->kref increment.
Decrement is already done via nvme_put_ns helper.
Signed-off-by: NKanchan Joshi <joshi.k@samsung.com>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

4c74d1f8

nvme: fix controller ioctl through ns_head · 48145b62

由 Minwoo Im 提交于 4月 22, 2021

In multipath case, we should consider namespace attachment with
controllers in a subsystem when we find out the live controller for the
namespace.  This patch manually reverted the commit 3557a440
("nvme: don't bother to look up a namespace for controller ioctls") with
few more updates to nvme_ns_head_chr_ioctl which has been newly updated.

Fixes: 3557a440 ("nvme: don't bother to look up a namespace for
controller ioctls")
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

48145b62

22 4月, 2021 5 次提交

nvme: introduce generic per-namespace chardev · 2637baed

由 Minwoo Im 提交于 4月 21, 2021

Userspace has not been allowed to I/O to device that's failed to
be initialized.  This patch introduces generic per-namespace character
device to allow userspace to I/O regardless the block device is there or
not.

The chardev naming convention will similar to the existing blkdev naming,
using a ng prefix instead of nvme, i.e.

	- /dev/ngXnY

It also supports multipath which means it will not expose chardev for the
hidden namespace blkdevs (e.g., nvmeXcYnZ).  If /dev/ngXnY is created for
a ns_head, then I/O request will be routed to a specific controller
selected by the iopolicy of the subsystem.
Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com>
Signed-off-by: NJavier González <javier.gonz@samsung.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Tested-by: NKanchan Joshi <joshi.k@samsung.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

2637baed

nvme: cleanup nvme_configure_apst · 60df5de9

由 Christoph Hellwig 提交于 4月 09, 2021

Remove a level of indentation from the main code implementating the table
search by using a goto for the APST not supported case.  Also move the
main comment above the function.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NNiklas Cassel <niklas.cassel@wdc.com>

60df5de9

nvme: do not try to reconfigure APST when the controller is not live · 53fe2a30

由 Christoph Hellwig 提交于 4月 09, 2021

Do not call nvme_configure_apst when the controller is not live, given
that nvme_configure_apst will fail due the lack of an admin queue when
the controller is being torn down and nvme_set_latency_tolerance is
called from dev_pm_qos_hide_latency_tolerance.

Fixes: 510a405d("nvme: fix memory leak for power latency tolerance")
Reported-by: NPeng Liu <liupeng17@lenovo.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>

53fe2a30

nvme: add 'kato' sysfs attribute · 74c22990

由 Hannes Reinecke 提交于 4月 16, 2021

Add a 'kato' controller sysfs attribute to display the current
keep-alive timeout value (if any). This allows userspace to identify
persistent discovery controllers, as these will have a non-zero
KATO value.
Signed-off-by: NHannes Reinecke <hare@suse.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

74c22990

nvme: sanitize KATO setting · a70b81bd

由 Hannes Reinecke 提交于 4月 16, 2021

According to the NVMe base spec the KATO commands should be sent
at half of the KATO interval, to properly account for round-trip
times.
As we now will only ever send one KATO command per connection we
can easily use the recommended values.
This also fixes a potential issue where the request timeout for
the KATO command does not match the value in the connect command,
which might be causing spurious connection drops from the target.
Signed-off-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

a70b81bd

15 4月, 2021 14 次提交

nvme: fix NULL derefence in nvme_ctrl_fast_io_fail_tmo_show/store · d6609084

由 Gopal Tiwari 提交于 4月 14, 2021

Adding entry for dev_attr_fast_io_fail_tmo to avoid the kernel crash
while reading and writing the fast_io_fail_tmo.

Fixes: 09fbed63 (nvme: export fast_io_fail_tmo to sysfs)
Signed-off-by: NGopal Tiwari <gtiwari@redhat.com>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

d6609084

nvme: let namespace probing continue for unsupported features · a9e0e6bc

由 Christoph Hellwig 提交于 4月 07, 2021

Instead of failing to scan the namespace entirely when unsupported
features are detected, just mark the gendisk hidden but allow other
access like the upcoming per-namespace character device.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NJavier González <javier.gonz@samsung.com>

a9e0e6bc

nvme: factor out nvme_ns_open and nvme_ns_release helpers · f5b9a51d

由 Christoph Hellwig 提交于 4月 07, 2021

These will be reused for the per-namespace character devices.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NJavier González <javier.gonz@samsung.com>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>

f5b9a51d

nvme: move nvme_ns_head_ops to multipath.c · 1496bd49

由 Christoph Hellwig 提交于 4月 07, 2021

Move the multipath block_device_operations to multipath.c, where they
belong.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NJavier González <javier.gonz@samsung.com>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>

1496bd49

nvme: factor out a nvme_tryget_ns_head helper · 871ca3ef

由 Christoph Hellwig 提交于 4月 07, 2021

Add a helper to avoid opencoding ns_head->ref manipulations.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NJavier González <javier.gonz@samsung.com>
Reviewed-by: NKanchan Joshi <joshi.k@samsung.com>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>

871ca3ef

nvme: move the ioctl code to a separate file · 2405252a

由 Christoph Hellwig 提交于 4月 10, 2021

Split out the ioctl code from core.c into a new file.  Also update
copyrights while we're at it.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NJavier González <javier.gonz@samsung.com>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>

2405252a

nvme: don't bother to look up a namespace for controller ioctls · 3557a440

由 Christoph Hellwig 提交于 8月 14, 2020

Don't bother to look up a namespace just to drop if after retreiving the
controller for the multipath case.  Just look up a live controller for
the subsystem directly.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NJavier González <javier.gonz@samsung.com>

3557a440

nvme: simplify block device ioctl handling for the !multipath case · 2f907f7f

由 Christoph Hellwig 提交于 8月 14, 2020

Only use the existing ioctl handler for the multipath case, and add a
simpler one that reverts to the pre-multipath case for not shared
use case.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NJavier González <javier.gonz@samsung.com>

2f907f7f

nvme: simplify the compat ioctl handling · 89b3d6e6

由 Christoph Hellwig 提交于 4月 08, 2021

Don't bother defining a separate compat_ioctl handler, and just handle
the NVME_IOCTL_SUBMIT_IO32 case inline.  Also only defined it for those
ABIs (currently just i386 vs x86_64) that are affected.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NJavier González <javier.gonz@samsung.com>

89b3d6e6

nvme: factor out a nvme_ns_ioctl helper · a5d737f1

由 Christoph Hellwig 提交于 8月 14, 2020

Factor out a helper for the namespace based ioctls.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NJavier González <javier.gonz@samsung.com>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>

a5d737f1

nvme: pass a user pointer to nvme_nvm_ioctl · d7790d37

由 Christoph Hellwig 提交于 8月 14, 2020

Pass the proper user pointer instead of the not all that useful integer
representation.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NJavier González <javier.gonz@samsung.com>

d7790d37

nvme: cleanup setting the disk name · 9953ab0c

由 Christoph Hellwig 提交于 4月 07, 2021

Return false from nvme_set_disk_name and let the caller set the
non-multipath name instead of duplicating the naming information in two
places.  Also remove the pointless local variables for the disk name
and flags and the not needed ctrl argument.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NJavier González <javier.gonz@samsung.com>

9953ab0c

nvme: add a nvme_ns_head_multipath helper · 30897388

由 Minwoo Im 提交于 4月 07, 2021

Move the multipath gendisk out of #ifdef CONFIG_NVME_MULTIPATH and add
a new nvme_ns_head_multipath that uses it to check if a ns_head has
a multipath device associated with it.
Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com>
[hch: added the IS_ENABLED, converted a few existing users]
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NJavier González <javier.gonz@samsung.com>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>

30897388

nvme: remove single trailing whitespace · 95d54bd1

由 Niklas Cassel 提交于 4月 10, 2021

There is a single trailing whitespace in core.c.
Since this is just a single whitespace, the chances of this affecting
backports to stable should be quite low, so let's just remove it.
Signed-off-by: NNiklas Cassel <niklas.cassel@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

95d54bd1

09 4月, 2021 1 次提交

treewide: Change list_sort to use const pointers · 4f0f586b

由 Sami Tolvanen 提交于 4月 08, 2021

list_sort() internally casts the comparison function passed to it
to a different type with constant struct list_head pointers, and
uses this pointer to call the functions, which trips indirect call
Control-Flow Integrity (CFI) checking.

Instead of removing the consts, this change defines the
list_cmp_func_t type and changes the comparison function types of
all list_sort() callers to use const pointers, thus avoiding type
mismatches.
Suggested-by: NNick Desaulniers <ndesaulniers@google.com>
Signed-off-by: NSami Tolvanen <samitolvanen@google.com>
Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKees Cook <keescook@chromium.org>
Tested-by: NNick Desaulniers <ndesaulniers@google.com>
Tested-by: NNathan Chancellor <nathan@kernel.org>
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-10-samitolvanen@google.com

4f0f586b

06 4月, 2021 3 次提交

nvme: fix handling of large MDTS values · 8609c63f

由 Bart Van Assche 提交于 4月 02, 2021

Instead of triggering an integer overflow and undefined behavior if MDTS is
large, set max_hw_sectors to UINT_MAX.
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NKeith Busch <kbusch@kernel.org>
[hch: rebased to account for the new nvme_mps_to_sectors helper]
Signed-off-by: NChristoph Hellwig <hch@lst.de>

8609c63f

nvme: implement non-mdts command limits · 5befc7c2

由 Keith Busch 提交于 3月 24, 2021

Commands that access LBA contents without a data transfer between the
host historically have not had a spec defined upper limit. The driver
set the queue constraints for such commands to the max data transfer
size just to be safe, but this artificial constraint frequently limits
devices below their capabilities.

The NVMe Workgroup ratified TP4040 defines how a controller may
advertise their non-MDTS limits. Use these if provided and default to
the current constraints if not. Since the Dataset Management command
limits are defined in logical blocks, but without a namespace to tell us
the logical block size, the code defaults to the safe 512b size.
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

5befc7c2

nvme: disallow passthru cmd from targeting a nsid != nsid of the block dev · c881a23f

由 Niklas Cassel 提交于 3月 26, 2021

When a passthru command targets a specific namespace, the ns parameter to
nvme_user_cmd()/nvme_user_cmd64() is set. However, there is currently no
validation that the nsid specified in the passthru command targets the
namespace/nsid represented by the block device that the ioctl was
performed on.

Add a check that validates that the nsid in the passthru command matches
that of the supplied namespace.
Signed-off-by: NNiklas Cassel <niklas.cassel@wdc.com>
Reviewed-by: NJavier GonzÃ¡lez <javier@javigon.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NKanchan Joshi <joshi.k@samsung.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

c881a23f

03 4月, 2021 10 次提交

nvme: export fast_io_fail_tmo to sysfs · 09fbed63

由 Daniel Wagner 提交于 4月 01, 2021

Commit 8c4dfea9 ("nvme-fabrics: reject I/O to offline device")
introduced fast_io_fail_tmo but didn't export the value to sysfs. The
value can be set during the 'nvme connect'. Export the timeout value
to user space via sysfs to allow runtime configuration.

Cc: Victor Gladkov <Victor.Gladkov@kioxia.com>
Signed-off-by: NDaniel Wagner <dwagner@suse.de>
Reviewed-by: NEwan D. Milne <emilne@redhat.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Reviewed-by: NHimanshu Madhani <himanshu.madhaani@oracle.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

09fbed63

nvme: remove superfluous else in nvme_ctrl_loss_tmo_store · 25a64e4e

由 Daniel Wagner 提交于 4月 01, 2021

If there is an error we will leave the function early. So there
is no need for an else. Remove it.
Signed-off-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

25a64e4e

nvme: use sysfs_emit instead of sprintf · bff4bcf3

由 Daniel Wagner 提交于 4月 01, 2021

sysfs_emit is the recommended API to use for formatting strings to be
returned to user space. It is equivalent to scnprintf and aware of the
PAGE_SIZE buffer size.
Suggested-by: NChaitanya Kulkarni <Chaitanya.Kulkarni@wdc.com>
Signed-off-by: NDaniel Wagner <dwagner@suse.de>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

bff4bcf3

nvme: warn of unhandled effects only once · ed4a854b

由 Keith Busch 提交于 3月 17, 2021

We don't need to repeatedly spam the kernel logs with the same warning
about unhandled passthrough IO effects. Just one warning is sufficient
to observe this condition occurs.
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

ed4a854b

nvme: use driver pdu command for passthrough · f4b9e6c9

由 Keith Busch 提交于 3月 17, 2021

All nvme transport drivers preallocate an nvme command for each request.
Assume to use that command for nvme_setup_cmd() instead of requiring
drivers pass a pointer to it. All nvme drivers must initialize the
generic nvme_request 'cmd' to point to the transport's preallocated
nvme_command.

The generic nvme_request cmd pointer had previously been used only as a
temporary copy for passthrough commands. Since it now points to the
command that gets dispatched, passthrough commands must directly set it
up prior to executing the request.
Signed-off-by: NKeith Busch <kbusch@kernel.org>
Reviewed-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NHimanshu Madhani <himanshu.madhani@oracle.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

f4b9e6c9

nvme: add new line after variable declatation · f1c772d5

由 Chaitanya Kulkarni 提交于 2月 28, 2021

Add a new line in functions nvme_pr_preempt(), nvme_pr_clear(), and
nvme_pr_release() after variable declaration which follows the rest of
the code in the nvme/host/core.c.

No functional change(s) in this patch.
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

f1c772d5

nvme: don't check nvme_req flags for new req · c03fd85d

由 Chaitanya Kulkarni 提交于 2月 28, 2021

nvme_clear_request() has a check for flag REQ_DONTPREP and it is called
from nvme_init_request() and nvme_setuo_cmd().

The function nvme_init_request() is called from nvme_alloc_request()
and nvme_alloc_request_qid(). From these two callers new request is
allocated everytime. For newly allocated request RQF_DONTPREP is never
set. Since after getting a tag, block layer sets the req->rq_flags == 0
and never sets the REQ_DONTPREP when returning the request :-

nvme_alloc_request()
	blk_mq_alloc_request()
		blk_mq_rq_ctx_init()
			rq->rq_flags = 0 <----

nvme_alloc_request_qid()
	blk_mq_alloc_request_hctx()
		blk_mq_rq_ctx_init()
			rq->rq_flags = 0 <----

The block layer does set req->rq_flags but REQ_DONTPREP is not one of
them and that is set by the driver.

That means we can unconditinally set the REQ_DONTPREP value to the
rq->rq_flags when nvme_init_request()->nvme_clear_request() is called
from above two callers.

Move the check for REQ_DONTPREP from nvme_clear_nvme_request() into
nvme_setup_cmd().

This is needed since nvme_alloc_request() now gets called from fast
path when NVMeOF target is configured with passthru backend to avoid
unnecessary checks in the fast path.
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

c03fd85d

nvme: mark nvme_setup_passsthru() inline · 7a366046

由 Chaitanya Kulkarni 提交于 2月 28, 2021

Since nvmet_setup_passthru() function falls in fast path when called
from the NVMeOF passthru backend, make it inline.
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

7a366046

nvme: split init identify into helper · 44ef5611

由 Chaitanya Kulkarni 提交于 2月 28, 2021

The function nvme_init_ctrl_finish() (formerly nvme_init_identify()) has
grown over the period of time about ~200 lines given the size of nvme id
ctrl data structure.

Move the nvme_id_ctrl data structure related initilzation into helper
nvme_init_identify() and call it from nvme_init_ctrl_finish().

When we move the code into nvme_init_identify() change the local
variable i from int to unsigned int and remove the duplicate kfree()
after nvme_mpath_init() and jump to the label out_free if
nvme_mpath_ini() fails.
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

44ef5611

nvme: rename nvme_init_identify() · f21c4769

由 Chaitanya Kulkarni 提交于 2月 28, 2021

This is a prep patch so that we can move the identify data structure
related code initialization from nvme_init_identify() into a helper.

Rename the function nvmet_init_identify() to nvmet_init_ctrl_finish().

Next patch will move the nvme_id_ctrl related initialization from newly
renamed function nvme_init_ctrl_finish() into the nvme_init_identify()
helper.
Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

f21c4769

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功