1. 29 7月, 2020 11 次提交
    • L
      nvme: clear any SGL flags in passthru commands · 2bf5d3bb
      Logan Gunthorpe 提交于
      The host driver should decide whether to use SGLs or PRPs and they
      currently assume the flags are cleared after the call to
      nvme_setup_cmd(). However, passed-through commands may erroneously
      set these bits; so clear them for all cases.
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Reviewed-by: NKeith Busch <kbusch@kernel.org>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      2bf5d3bb
    • J
      nvme-fc: set max_segments to lldd max value · 23748076
      James Smart 提交于
      Currently the FC transport is set max_hw_sectors based on the lldds
      max sgl segment count. However, the block queue max segments is
      set based on the controller's max_segments count, which the transport
      does not set.  As such, the lldd is receiving sgl lists that are
      exceeding its max segment count.
      
      Set the controller max segment count and derive max_hw_sectors from
      the max segment count.
      Signed-off-by: NJames Smart <jsmart2021@gmail.com>
      Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by: NHimanshu Madhani <himanshu.madhani@oracle.com>
      Reviewed-by: NEwan D. Milne <emilne@redhat.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      23748076
    • S
      nvme-hwmon: log the controller device name · 653303f2
      Sagi Grimberg 提交于
      Stay consistent with the rest of the driver
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      653303f2
    • S
      nvme: fix deadlock in disconnect during scan_work and/or ana_work · ecca390e
      Sagi Grimberg 提交于
      A deadlock happens in the following scenario with multipath:
      1) scan_work(nvme0) detects a new nsid while nvme0
          is an optimized path to it, path nvme1 happens to be
          inaccessible.
      
      2) Before scan_work is complete nvme0 disconnect is initiated
          nvme_delete_ctrl_sync() sets nvme0 state to NVME_CTRL_DELETING
      
      3) scan_work(1) attempts to submit IO,
          but nvme_path_is_optimized() observes nvme0 is not LIVE.
          Since nvme1 is a possible path IO is requeued and scan_work hangs.
      
      --
      Workqueue: nvme-wq nvme_scan_work [nvme_core]
      kernel: Call Trace:
      kernel:  __schedule+0x2b9/0x6c0
      kernel:  schedule+0x42/0xb0
      kernel:  io_schedule+0x16/0x40
      kernel:  do_read_cache_page+0x438/0x830
      kernel:  read_cache_page+0x12/0x20
      kernel:  read_dev_sector+0x27/0xc0
      kernel:  read_lba+0xc1/0x220
      kernel:  efi_partition+0x1e6/0x708
      kernel:  check_partition+0x154/0x244
      kernel:  rescan_partitions+0xae/0x280
      kernel:  __blkdev_get+0x40f/0x560
      kernel:  blkdev_get+0x3d/0x140
      kernel:  __device_add_disk+0x388/0x480
      kernel:  device_add_disk+0x13/0x20
      kernel:  nvme_mpath_set_live+0x119/0x140 [nvme_core]
      kernel:  nvme_update_ns_ana_state+0x5c/0x60 [nvme_core]
      kernel:  nvme_set_ns_ana_state+0x1e/0x30 [nvme_core]
      kernel:  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
      kernel:  nvme_mpath_add_disk+0x47/0x90 [nvme_core]
      kernel:  nvme_validate_ns+0x396/0x940 [nvme_core]
      kernel:  nvme_scan_work+0x24f/0x380 [nvme_core]
      kernel:  process_one_work+0x1db/0x380
      kernel:  worker_thread+0x249/0x400
      kernel:  kthread+0x104/0x140
      --
      
      4) Delete also hangs in flush_work(ctrl->scan_work)
          from nvme_remove_namespaces().
      
      Similiarly a deadlock with ana_work may happen: if ana_work has started
      and calls nvme_mpath_set_live and device_add_disk, it will
      trigger I/O. When we trigger disconnect I/O will block because
      our accessible (optimized) path is disconnecting, but the alternate
      path is inaccessible, so I/O blocks. Then disconnect tries to flush
      the ana_work and hangs.
      
      [  605.550896] Workqueue: nvme-wq nvme_ana_work [nvme_core]
      [  605.552087] Call Trace:
      [  605.552683]  __schedule+0x2b9/0x6c0
      [  605.553507]  schedule+0x42/0xb0
      [  605.554201]  io_schedule+0x16/0x40
      [  605.555012]  do_read_cache_page+0x438/0x830
      [  605.556925]  read_cache_page+0x12/0x20
      [  605.557757]  read_dev_sector+0x27/0xc0
      [  605.558587]  amiga_partition+0x4d/0x4c5
      [  605.561278]  check_partition+0x154/0x244
      [  605.562138]  rescan_partitions+0xae/0x280
      [  605.563076]  __blkdev_get+0x40f/0x560
      [  605.563830]  blkdev_get+0x3d/0x140
      [  605.564500]  __device_add_disk+0x388/0x480
      [  605.565316]  device_add_disk+0x13/0x20
      [  605.566070]  nvme_mpath_set_live+0x5e/0x130 [nvme_core]
      [  605.567114]  nvme_update_ns_ana_state+0x2c/0x30 [nvme_core]
      [  605.568197]  nvme_update_ana_state+0xca/0xe0 [nvme_core]
      [  605.569360]  nvme_parse_ana_log+0xa1/0x180 [nvme_core]
      [  605.571385]  nvme_read_ana_log+0x76/0x100 [nvme_core]
      [  605.572376]  nvme_ana_work+0x15/0x20 [nvme_core]
      [  605.573330]  process_one_work+0x1db/0x380
      [  605.574144]  worker_thread+0x4d/0x400
      [  605.574896]  kthread+0x104/0x140
      [  605.577205]  ret_from_fork+0x35/0x40
      [  605.577955] INFO: task nvme:14044 blocked for more than 120 seconds.
      [  605.579239]       Tainted: G           OE     5.3.5-050305-generic #201910071830
      [  605.580712] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [  605.582320] nvme            D    0 14044  14043 0x00000000
      [  605.583424] Call Trace:
      [  605.583935]  __schedule+0x2b9/0x6c0
      [  605.584625]  schedule+0x42/0xb0
      [  605.585290]  schedule_timeout+0x203/0x2f0
      [  605.588493]  wait_for_completion+0xb1/0x120
      [  605.590066]  __flush_work+0x123/0x1d0
      [  605.591758]  __cancel_work_timer+0x10e/0x190
      [  605.593542]  cancel_work_sync+0x10/0x20
      [  605.594347]  nvme_mpath_stop+0x2f/0x40 [nvme_core]
      [  605.595328]  nvme_stop_ctrl+0x12/0x50 [nvme_core]
      [  605.596262]  nvme_do_delete_ctrl+0x3f/0x90 [nvme_core]
      [  605.597333]  nvme_sysfs_delete+0x5c/0x70 [nvme_core]
      [  605.598320]  dev_attr_store+0x17/0x30
      
      Fix this by introducing a new state: NVME_CTRL_DELETE_NOIO, which will
      indicate the phase of controller deletion where I/O cannot be allowed
      to access the namespace. NVME_CTRL_DELETING still allows mpath I/O to
      be issued to the bottom device, and only after we flush the ana_work
      and scan_work (after nvme_stop_ctrl and nvme_prep_remove_namespaces)
      we change the state to NVME_CTRL_DELETING_NOIO. Also we prevent ana_work
      from re-firing by aborting early if we are not LIVE, so we should be safe
      here.
      
      In addition, change the transport drivers to follow the updated state
      machine.
      
      Fixes: 0d0b660f ("nvme: add ANA support")
      Reported-by: NAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      ecca390e
    • S
      nvme: document nvme controller states · 4212f4e9
      Sagi Grimberg 提交于
      We are starting to see some non-trivial states
      so lets start documenting them.
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      4212f4e9
    • Y
      nvme-rdma: use new shared CQ mechanism · 287f329e
      Yamin Friedman 提交于
      Has the driver use shared CQs providing ~10%-20% improvement as seen in
      the patch introducing shared CQs. Instead of opening a CQ for each QP
      per controller connected, a CQ for each QP will be provided by the RDMA
      core driver that will be shared between the QPs on that core reducing
      interrupt overhead.
      Signed-off-by: NYamin Friedman <yaminf@mellanox.com>
      Signed-off-by: NMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by: NOr Gerlitz <ogerlitz@mellanox.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      287f329e
    • D
      nvme-pci: add support for ACPI StorageD3Enable property · df4f9bc4
      David E. Box 提交于
      This patch implements a solution for a BIOS hack used on some currently
      shipping Intel systems to change driver power management policy for PCIe
      NVMe drives. Some newer Intel platforms, like some Comet Lake systems,
      require that PCIe devices use D3 when doing suspend-to-idle in order to
      allow the platform to realize maximum power savings. This is particularly
      needed to support ATX power supply shutdown on desktop systems. In order to
      ensure this happens for root ports with storage devices, Microsoft
      apparently created this ACPI _DSD property as a way to influence their
      driver policy. To my knowledge this property has not been discussed with
      the NVME specification body.
      
      Though the solution is not ideal, it addresses a problem that also affects
      Linux since the NVMe driver's default policy of using NVMe APST during
      suspend-to-idle prevents the PCI root port from going to D3 and leads to
      higher power consumption for these platforms. The power consumption
      difference may be negligible on laptop systems, but many watts on desktop
      systems when the ATX power supply is blocked from powering down.
      
      The patch creates a new nvme_acpi_storage_d3 function to check for the
      StorageD3Enable property during probe and enables D3 as a quirk if set.  It
      also provides a 'noacpi' module parameter to allow skipping the quirk if
      needed.
      
      Tested with:
       - PM961 NVMe SED Samsung 512GB
       - INTEL SSDPEKKF512G8
      
      Link: https://docs.microsoft.com/en-us/windows-hardware/design/component-guidelines/power-management-for-storage-hardware-devices-introSigned-off-by: NDavid E. Box <david.e.box@linux.intel.com>
      Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      df4f9bc4
    • C
      nvme-pci: use max of PRP or SGL for iod size · b13c6393
      Chaitanya Kulkarni 提交于
      >From the initial implementation of NVMe SGL kernel support
      commit a7a7cbe3 ("nvme-pci: add SGL support") with addition of the
      commit 943e942e ("nvme-pci: limit max IO size and segments to avoid
      high order allocations") now there is only caller left for
      nvme_pci_iod_alloc_size() which statically passes true for last
      parameter that calculates allocation size based on SGL since we need
      size of biggest command supported for mempool allocation.
      
      This patch modifies the helper functions nvme_pci_iod_alloc_size() such
      that it is now uses maximum of PRP and SGL size for iod allocation size
      calculation.
      Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      b13c6393
    • C
      nvme-core: replace ctrl page size with a macro · 6c3c05b0
      Chaitanya Kulkarni 提交于
      Saving the nvme controller's page size was from a time when the driver
      tried to use different sized pages, but this value is always set to
      a constant, and has been this way for some time. Remove the 'page_size'
      field and replace its usage with the constant value.
      
      This also lets the compiler make some micro-optimizations in the io
      path, and that's always a good thing.
      Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      6c3c05b0
    • B
      nvme: remove redundant validation in nvme_start_ctrl() · 5887450b
      Baolin Wang 提交于
      We've already validated the 'kato' in nvme_start_keep_alive(), thus no
      need to validate it again in nvme_start_ctrl(). Remove it.
      Signed-off-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      5887450b
    • D
      nvme: remove an unnecessary condition · eca9e827
      Dan Carpenter 提交于
      "v" is an unsigned int so it can't be more than UINT_MAX.  Removing this
      check makes it easier to preserve the error code as well.
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      eca9e827
  2. 16 7月, 2020 2 次提交
  3. 09 7月, 2020 1 次提交
  4. 08 7月, 2020 19 次提交
  5. 02 7月, 2020 2 次提交
  6. 01 7月, 2020 4 次提交
  7. 25 6月, 2020 1 次提交
    • S
      nvme-multipath: fix bogus request queue reference put · c3124466
      Sagi Grimberg 提交于
      The mpath disk node takes a reference on the request mpath
      request queue when adding live path to the mpath gendisk.
      However if we connected to an inaccessible path device_add_disk
      is not called, so if we disconnect and remove the mpath gendisk
      we endup putting an reference on the request queue that was
      never taken [1].
      
      Fix that to check if we ever added a live path (using
      NVME_NS_HEAD_HAS_DISK flag) and if not, clear the disk->queue
      reference.
      
      [1]:
      ------------[ cut here ]------------
      refcount_t: underflow; use-after-free.
      WARNING: CPU: 1 PID: 1372 at lib/refcount.c:28 refcount_warn_saturate+0xa6/0xf0
      CPU: 1 PID: 1372 Comm: nvme Tainted: G           O      5.7.0-rc2+ #3
      Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1 04/01/2014
      RIP: 0010:refcount_warn_saturate+0xa6/0xf0
      RSP: 0018:ffffb29e8053bdc0 EFLAGS: 00010282
      RAX: 0000000000000000 RBX: ffff8b7a2f4fc060 RCX: 0000000000000007
      RDX: 0000000000000007 RSI: 0000000000000092 RDI: ffff8b7a3ec99980
      RBP: ffff8b7a2f4fc000 R08: 00000000000002e1 R09: 0000000000000004
      R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
      R13: fffffffffffffff2 R14: ffffb29e8053bf08 R15: ffff8b7a320e2da0
      FS:  00007f135d4ca800(0000) GS:ffff8b7a3ec80000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00005651178c0c30 CR3: 000000003b650005 CR4: 0000000000360ee0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      Call Trace:
       disk_release+0xa2/0xc0
       device_release+0x28/0x80
       kobject_put+0xa5/0x1b0
       nvme_put_ns_head+0x26/0x70 [nvme_core]
       nvme_put_ns+0x30/0x60 [nvme_core]
       nvme_remove_namespaces+0x9b/0xe0 [nvme_core]
       nvme_do_delete_ctrl+0x43/0x5c [nvme_core]
       nvme_sysfs_delete.cold+0x8/0xd [nvme_core]
       kernfs_fop_write+0xc1/0x1a0
       vfs_write+0xb6/0x1a0
       ksys_write+0x5f/0xe0
       do_syscall_64+0x52/0x1a0
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      Reported-by: NAnton Eidelman <anton@lightbitslabs.com>
      Tested-by: NAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      c3124466