1. 17 6月, 2021 12 次提交
  2. 16 6月, 2021 6 次提交
  3. 03 6月, 2021 17 次提交
    • C
      nvmet: remove a superfluous variable · 346ac785
      Chaitanya Kulkarni 提交于
      Remove the superfluous variable "bdev" that is only used once in the
      nvmet_bdev_alloc_bip() and use req->ns->bdev that is used everywhere in
      the code to access the nvmet request's bdev.
      Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      346ac785
    • A
      nvmet: move ka_work initialization to nvmet_alloc_ctrl · f6e8bd59
      Amit Engel 提交于
      Initialize keep-alive work only once, as part of alloc_ctrl
      and not each time that nvmet_start_keep_alive_timer is being called
      Signed-off-by: NAmit Engel <amit.engel@dell.com>
      Reviewed-by: NHou Pu <houpu.main@gmail.com>
      f6e8bd59
    • C
      nvme: remove nvme_{get,put}_ns_from_disk · f1cf35e1
      Christoph Hellwig 提交于
      Now that only one caller is left remove the helpers by restructuring
      nvme_pr_command so that it has two helpers for sending a command of to a
      given nsid using either the ns_head for multipath, or the namespace
      stored in the gendisk.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      f1cf35e1
    • C
      nvme: split nvme_report_zones · 8b4fb0f9
      Christoph Hellwig 提交于
      Split multipath support out of nvme_report_zones into a separate helper
      and simplify the non-multipath version as a result.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      8b4fb0f9
    • C
      nvme: move the CSI sanity check into nvme_ns_report_zones · d8ca66e8
      Christoph Hellwig 提交于
      Move the CSI check into nvme_ns_report_zones to clean up the code
      a little bit and prepare for further refactoring.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      d8ca66e8
    • C
      nvme: add a sparse annotation to nvme_ns_head_ctrl_ioctl · 85b790a7
      Christoph Hellwig 提交于
      Add the __releases annotation to tell sparse that nvme_ns_head_ctrl_ioctl
      is expected to unlock head->srcu.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      85b790a7
    • C
      nvme: open code nvme_put_ns_from_disk in nvme_ns_head_ctrl_ioctl · 3e7d1a55
      Christoph Hellwig 提交于
      nvme_ns_head_ctrl_ioctl is always used on multipath nodes, so just call
      srcu_read_unlock directly.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      3e7d1a55
    • C
      nvme: open code nvme_{get,put}_ns_from_disk in nvme_ns_head_ioctl · 86b4284d
      Christoph Hellwig 提交于
      nvme_ns_head_ioctl is always used on multipath nodes, no need to
      deal with the de-multiplexers.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      86b4284d
    • C
      nvme: open code nvme_put_ns_from_disk in nvme_ns_head_chr_ioctl · f423c85c
      Christoph Hellwig 提交于
      nvme_ns_head_chr_ioctl is always used on multipath nodes, so just call
      srcu_read_unlock and consolidate the two unlock paths.
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      f423c85c
    • C
      nvme-fabrics: remove extra braces · 97ba6931
      Chaitanya Kulkarni 提交于
      No need to use the braces around ~ operator.
      
      No functionality change in this patch.
      Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      97ba6931
    • C
      nvme-fabrics: remove an extra comment · 6f860c92
      Chaitanya Kulkarni 提交于
      Remove the comment at the end of the switch that is not needed as
      function is small enough.
      
      No functionality change in this patch.
      Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      6f860c92
    • C
      nvme-fabrics: remove extra new lines in the switch · 63d20f54
      Chaitanya Kulkarni 提交于
      Remove the extra lines in the switch block that is not common practice
      in the kernel code.
      
      No functionality change in this patch.
      Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      63d20f54
    • C
      nvme-fabrics: fix the kerneldco comment for nvmf_log_connect_error() · 25e1de8c
      Chaitanya Kulkarni 提交于
      Fix the comment style that matches existing code.
      
      No functionality change in this patch.
      Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      25e1de8c
    • M
      nvme-tcp: allow selecting the network interface for connections · 3ede8f72
      Martin Belanger 提交于
      In our application, we need a way to force TCP connections to go out a
      specific IP interface instead of letting Linux select the interface
      based on the routing tables.
      
      Add the 'host-iface' option to allow specifying the interface to use.
      When the option host-iface is specified, the driver uses the specified
      interface to set the option SO_BINDTODEVICE on the TCP socket before
      connecting.
      
      This new option is needed in addtion to the existing host-traddr for
      the following reasons:
      
      Specifying an IP interface by its associated IP address is less
      intuitive than specifying the actual interface name and, in some cases,
      simply doesn't work. That's because the association between interfaces
      and IP addresses is not predictable. IP addresses can be changed or can
      change by themselves over time (e.g. DHCP). Interface names are
      predictable [1] and will persist over time. Consider the following
      configuration.
      
      1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state ...
          link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
          inet 100.0.0.100/24 scope global lo
             valid_lft forever preferred_lft forever
      2: enp0s3: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
          link/ether 08:00:27:21:65:ec brd ff:ff:ff:ff:ff:ff
          inet 100.0.0.100/24 scope global enp0s3
             valid_lft forever preferred_lft forever
      3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
          link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
          inet 100.0.0.100/24 scope global enp0s8
             valid_lft forever preferred_lft forever
      
      The above is a VM that I configured with the same IP address
      (100.0.0.100) on all interfaces. Doing a reverse lookup to identify the
      unique interface associated with 100.0.0.100 does not work here. And
      this is why the option host_iface is required. I understand that the
      above config does not represent a standard host system, but I'm using
      this to prove a point: "We can never know how users will configure
      their systems". By te way, The above configuration is perfectly fine
      by Linux.
      
      The current TCP implementation for host_traddr performs a
      bind()-before-connect(). This is a common construct to set the source
      IP address on a TCP socket before connecting. This has no effect on how
      Linux selects the interface for the connection. That's because Linux
      uses the Weak End System model as described in RFC1122 [2]. On the other
      hand, setting the Source IP Address has benefits and should be supported
      by linux-nvme. In fact, setting the Source IP Address is a mandatory
      FedGov requirement (e.g. connection to a RADIUS/TACACS+ server).
      Consider the following configuration.
      
      $ ip addr list dev enp0s8
      3: enp0s8: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc ...
          link/ether 08:00:27:4f:95:5c brd ff:ff:ff:ff:ff:ff
          inet 192.168.56.101/24 brd 192.168.56.255 scope global enp0s8
             valid_lft 426sec preferred_lft 426sec
          inet 192.168.56.102/24 scope global secondary enp0s8
             valid_lft forever preferred_lft forever
          inet 192.168.56.103/24 scope global secondary enp0s8
             valid_lft forever preferred_lft forever
          inet 192.168.56.104/24 scope global secondary enp0s8
             valid_lft forever preferred_lft forever
      
      Here we can see that several addresses are associated with interface
      enp0s8. By default, Linux always selects the default IP address,
      192.168.56.101, as the source address when connecting over interface
      enp0s8. Some users, however, want the ability to specify a different
      source address (e.g., 192.168.56.102, 192.168.56.103, ...). The option
      host_traddr can be used as-is to perform this function.
      
      In conclusion, I believe that we need 2 options for TCP connections.
      One that can be used to specify an interface (host-iface). And one that
      can be used to set the source address (host-traddr). Users should be
      allowed to use one or the other, or both, or none. Of course, the
      documentation for host_traddr will need some clarification. It should
      state that when used for TCP connection, this option only sets the
      source address. And the documentation for host_iface should say that
      this option is only available for TCP connections.
      
      References:
      [1] https://www.freedesktop.org/wiki/Software/systemd/PredictableNetworkInterfaceNames/
      [2] https://tools.ietf.org/html/rfc1122
      
      Tested both IPv4 and IPv6 connections.
      Signed-off-by: NMartin Belanger <martin.belanger@dell.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      3ede8f72
    • M
      nvme-pci: look for StorageD3Enable on companion ACPI device instead · e21e0243
      Mario Limonciello 提交于
      The documentation around the StorageD3Enable property hints that it
      should be made on the PCI device.  This is where newer AMD systems set
      the property and it's required for S0i3 support.
      
      So rather than look for nodes of the root port only present on Intel
      systems, switch to the companion ACPI device for all systems.
      David Box from Intel indicated this should work on Intel as well.
      
      Link: https://lore.kernel.org/linux-nvme/YK6gmAWqaRmvpJXb@google.com/T/#m900552229fa455867ee29c33b854845fce80ba70
      Link: https://docs.microsoft.com/en-us/windows-hardware/design/component-guidelines/power-management-for-storage-hardware-devices-intro
      Fixes: df4f9bc4 ("nvme-pci: add support for ACPI StorageD3Enable property")
      Suggested-by: NLiang Prike <Prike.Liang@amd.com>
      Acked-by: NRaul E Rangel <rrangel@chromium.org>
      Signed-off-by: NMario Limonciello <mario.limonciello@amd.com>
      Reviewed-by: NDavid E. Box <david.e.box@linux.intel.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      e21e0243
    • A
      nvme: extend and modify the APST configuration algorithm · ebd8a93a
      Alexey Bogoslavsky 提交于
      The algorithm that was used until now for building the APST configuration
      table has been found to produce entries with excessively long ITPT
      (idle time prior to transition) for devices declaring relatively long
      entry and exit latencies for non-operational power states. This leads
      to unnecessary waste of power and, as a result, failure to pass
      mandatory power consumption tests on Chromebook platforms.
      
      The new algorithm is based on two predefined ITPT values and two
      predefined latency tolerances. Based on these values, as well as on
      exit and entry latencies reported by the device, the algorithm looks
      for up to 2 suitable non-operational power states to use as primary
      and secondary APST transition targets. The predefined values are
      supplied to the nvme driver as module parameters:
      
       - apst_primary_timeout_ms (default: 100)
       - apst_secondary_timeout_ms (default: 2000)
       - apst_primary_latency_tol_us (default: 15000)
       - apst_secondary_latency_tol_us (default: 100000)
      
      The algorithm echoes the approach used by Intel's and Microsoft's drivers
      on Windows. The specific default parameter values are also based on those
      drivers. Yet, this patch doesn't introduce the ability to dynamically
      regenerate the APST table in the event of switching the power source from
      AC to battery and back. Adding this functionality may be considered in the
      future. In the meantime, the timeouts and tolerances reflect a compromise
      between values used by Microsoft for AC and battery scenarios.
      
      In most NVMe devices the new algorithm causes them to implement a more
      aggressive power saving policy. While beneficial in most cases, this
      sometimes comes at the price of a higher IO processing latency in certain
      scenarios as well as at the price of a potential impact on the drive's
      endurance (due to more frequent context saving when entering deep non-
      operational states). So in order to provide a fallback for systems where
      these regressions cannot be tolerated, the patch allows to revert to
      the legacy behavior by setting either apst_primary_timeout_ms or
      apst_primary_latency_tol_us parameter to 0. Eventually (and possibly after
      fine tuning the default values of the module parameters) the legacy behavior
      can be removed.
      
      TESTING.
      
      The new algorithm has been extensively tested. Initially, simulations were
      used to compare APST tables generated by old and new algorithms for a wide
      range of devices. After that, power consumption, performance and latencies
      were measured under different workloads on devices from multiple vendors
      (WD, Intel, Samsung, Hynix, Kioxia). Below is the description of the tests
      and the findings.
      
      General observations.
      The effect the patch has on the APST table varies depending on the entry and
      exit latencies advertised by the devices. For some devices, the effect is
      negligible (e.g. Kioxia KBG40ZNS), for some significant, making the
      transitions to PS3 and PS4 much quicker (e.g. WD SN530, Intel 760P), or making
      the sleep deeper, PS4 rather than PS3 after a similar amount of time (e.g.
      SK Hynix BC511). For some devices (e.g. Samsung PM991) the effect is mixed:
      the initial transition happens after a longer idle time, but takes the device
      to a lower power state.
      
      Workflows.
      In order to evaluate the patch's effect on the power consumption and latency,
      7 workflows were used for each device. The workflows were designed to test
      the scenarios where significant differences between the old and new behaviors
      are most likely. Each workflow was tested twice: with the new and with the
      old APST table generation implementation. Power consumption, performance and
      latency were measured in the process. The following workflows were used:
      1) Consecutive write at the maximum rate with IO depth of 2, with no pauses
      2) Repeated pattern of 1000 consecutive writes of 4K packets followed by 50ms
         idle time
      3) Repeated pattern of 1000 consecutive writes of 4K packets followed by 150ms
         idle time
      4) Repeated pattern of 1000 consecutive writes of 4K packets followed by 500ms
         idle time
      5) Repeated pattern of 1000 consecutive writes of 4K packets followed by 1.5s
         idle time
      6) Repeated pattern of 1000 consecutive writes of 4K packets followed by 5s
         idle time
      7) Repeated pattern of a single random read of a 4K packet followed by 150ms
         idle time
      
      Power consumption
      Actual power consumption measurements produced predictable results in
      accordance with the APST mechanism's theory of operation.
      Devices with long entry and exit latencies such as WD SN530 showed huge
      improvement on scenarios 4,5 and 6 of up to 62%. Devices such as Kioxia
      KBG40ZNS where the resulting APST table looks virtually identical with
      both legacy and new algorithms, showed little or no change in the average power
      consumption on all workflows. Devices with extra short latencies such as
      Samsung PM991 showed moderate increase in power consumption of up to 18% in
      worst case scenarios.
      In addition, on Intel and Samsung devices a more complex impact was observed
      on scenarios 3, 4 and 7. Our understanding is that due to longer stay in deep
      non-operational states between the writes the devices start performing background
      operations leading to an increase of power consumption. With the old APST tables
      part of these operations are delayed until the scenario is over and a longer idle
      period begins, but eventually this extra power is consumed anyway.
      
      Performance.
      In terms of performance measured on sustained write or read scenarios, the
      effect of the patch is minimal as in this case the device doesn't enter low power
      states.
      
      Latency
      As expected, in devices where the patch causes a more aggressive power saving
      policy (e.g. WD SN530, Intel 760P), an increase in latency was observed in
      certain scenarios. Workflow number 7, specifically designed to simulate the
      worst case scenario as far as latency is concerned, indeed shows a sharp
      increase in average latency (~2ms -> ~53ms on Intel 760P and 0.6 -> 10ms on
      WD SN530). The latency increase on other workloads and other devices is much
      milder or non-existent.
      Signed-off-by: NAlexey Bogoslavsky <alexey.bogoslavsky@wdc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      ebd8a93a
    • C
      nvme: remove redundant initialization of variable ret · 13ce7e62
      Colin Ian King 提交于
      The variable ret is being initialized with a value that is never read,
      it is being updated later on. The assignment is redundant and can be
      removed.
      Signed-off-by: NColin Ian King <colin.king@canonical.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      13ce7e62
  4. 19 5月, 2021 5 次提交
    • J
      nvme-fc: clear q_live at beginning of association teardown · a7d13914
      James Smart 提交于
      The __nvmf_check_ready() routine used to bounce all filesystem io if the
      controller state isn't LIVE.  However, a later patch changed the logic so
      that it rejection ends up being based on the Q live check.  The FC
      transport has a slightly different sequence from rdma and tcp for
      shutting down queues/marking them non-live.  FC marks its queue non-live
      after aborting all ios and waiting for their termination, leaving a
      rather large window for filesystem io to continue to hit the transport.
      Unfortunately this resulted in filesystem I/O or applications seeing I/O
      errors.
      
      Change the FC transport to mark the queues non-live at the first sign of
      teardown for the association (when I/O is initially terminated).
      
      Fixes: 73a53799 ("nvme-fabrics: allow to queue requests for live queues")
      Signed-off-by: NJames Smart <jsmart2021@gmail.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NHimanshu Madhani <himanshu.madhani@oracle.com>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      a7d13914
    • K
      nvme-tcp: rerun io_work if req_list is not empty · a0fdd141
      Keith Busch 提交于
      A possible race condition exists where the request to send data is
      enqueued from nvme_tcp_handle_r2t()'s will not be observed by
      nvme_tcp_send_all() if it happens to be running. The driver relies on
      io_work to send the enqueued request when it is runs again, but the
      concurrently running nvme_tcp_send_all() may not have released the
      send_mutex at that time. If no future commands are enqueued to re-kick
      the io_work, the request will timeout in the SEND_H2C state, resulting
      in a timeout error like:
      
        nvme nvme0: queue 1: timeout request 0x3 type 6
      
      Ensure the io_work continues to run as long as the req_list is not empty.
      
      Fixes: db5ad6b7 ("nvme-tcp: try to send request in queue_rq context")
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      a0fdd141
    • S
      nvme-tcp: fix possible use-after-completion · 825619b0
      Sagi Grimberg 提交于
      Commit db5ad6b7 ("nvme-tcp: try to send request in queue_rq
      context") added a second context that may perform a network send.
      This means that now RX and TX are not serialized in nvme_tcp_io_work
      and can run concurrently.
      
      While there is correct mutual exclusion in the TX path (where
      the send_mutex protect the queue socket send activity) RX activity,
      and more specifically request completion may run concurrently.
      
      This means we must guarantee that any mutation of the request state
      related to its lifetime, bytes sent must not be accessed when a completion
      may have possibly arrived back (and processed).
      
      The race may trigger when a request completion arrives, processed
      _and_ reused as a fresh new request, exactly in the (relatively short)
      window between the last data payload sent and before the request iov_iter
      is advanced.
      
      Consider the following race:
      1. 16K write request is queued
      2. The nvme command and the data is sent to the controller (in-capsule
         or solicited by r2t)
      3. After the last payload is sent but before the req.iter is advanced,
         the controller sends back a completion.
      4. The completion is processed, the request is completed, and reused
         to transfer a new request (write or read)
      5. The new request is queued, and the driver reset the request parameters
         (nvme_tcp_setup_cmd_pdu).
      6. Now context in (2) resumes execution and advances the req.iter
      
      ==> use-after-completion as this is already a new request.
      
      Fix this by making sure the request is not advanced after the last
      data payload send, knowing that a completion may have arrived already.
      
      An alternative solution would have been to delay the request completion
      or state change waiting for reference counting on the TX path, but besides
      adding atomic operations to the hot-path, it may present challenges in
      multi-stage R2T scenarios where a r2t handler needs to be deferred to
      an async execution.
      Reported-by: NNarayan Ayalasomayajula <narayan.ayalasomayajula@wdc.com>
      Tested-by: NAnil Mishra <anil.mishra@wdc.com>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Cc: stable@vger.kernel.org # v5.8+
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      825619b0
    • W
      nvme-loop: fix memory leak in nvme_loop_create_ctrl() · 03504e3b
      Wu Bo 提交于
      When creating loop ctrl in nvme_loop_create_ctrl(), if nvme_init_ctrl()
      fails, the loop ctrl should be freed before jumping to the "out" label.
      
      Fixes: 3a85a5de ("nvme-loop: add a NVMe loopback host driver")
      Signed-off-by: NWu Bo <wubo40@huawei.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      03504e3b
    • W
      nvmet: fix memory leak in nvmet_alloc_ctrl() · fec356a6
      Wu Bo 提交于
      When creating ctrl in nvmet_alloc_ctrl(), if the cntlid_min is larger
      than cntlid_max of the subsystem, and jumps to the
      "out_free_changed_ns_list" label, but the ctrl->sqs lack of be freed.
      Fix this by jumping to the "out_free_sqs" label.
      
      Fixes: 94a39d61 ("nvmet: make ctrl-id configurable")
      Signed-off-by: NWu Bo <wubo40@huawei.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      fec356a6