1. 26 3月, 2020 8 次提交
  2. 05 3月, 2020 2 次提交
  3. 28 2月, 2020 1 次提交
  4. 21 2月, 2020 1 次提交
    • L
      nvme-multipath: Fix memory leak with ana_log_buf · 3b783090
      Logan Gunthorpe 提交于
      kmemleak reports a memory leak with the ana_log_buf allocated by
      nvme_mpath_init():
      
      unreferenced object 0xffff888120e94000 (size 8208):
        comm "nvme", pid 6884, jiffies 4295020435 (age 78786.312s)
          hex dump (first 32 bytes):
            00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00  ................
            01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
          backtrace:
            [<00000000e2360188>] kmalloc_order+0x97/0xc0
            [<0000000079b18dd4>] kmalloc_order_trace+0x24/0x100
            [<00000000f50c0406>] __kmalloc+0x24c/0x2d0
            [<00000000f31a10b9>] nvme_mpath_init+0x23c/0x2b0
            [<000000005802589e>] nvme_init_identify+0x75f/0x1600
            [<0000000058ef911b>] nvme_loop_configure_admin_queue+0x26d/0x280
            [<00000000673774b9>] nvme_loop_create_ctrl+0x2a7/0x710
            [<00000000f1c7a233>] nvmf_dev_write+0xc66/0x10b9
            [<000000004199f8d0>] __vfs_write+0x50/0xa0
            [<0000000065466fef>] vfs_write+0xf3/0x280
            [<00000000b0db9a8b>] ksys_write+0xc6/0x160
            [<0000000082156b91>] __x64_sys_write+0x43/0x50
            [<00000000c34fbb6d>] do_syscall_64+0x77/0x2f0
            [<00000000bbc574c9>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      nvme_mpath_init() is called by nvme_init_identify() which is called in
      multiple places (nvme_reset_work(), nvme_passthru_end(), etc). This
      means nvme_mpath_init() may be called multiple times before
      nvme_mpath_uninit() (which is only called on nvme_free_ctrl()).
      
      When nvme_mpath_init() is called multiple times, it overwrites the
      ana_log_buf pointer with a new allocation, thus leaking the previous
      allocation.
      
      To fix this, free ana_log_buf before allocating a new one.
      
      Fixes: 0d0b660f ("nvme: add ANA support")
      Cc: <stable@vger.kernel.org>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NLogan Gunthorpe <logang@deltatee.com>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      3b783090
  5. 20 2月, 2020 1 次提交
  6. 19 2月, 2020 2 次提交
  7. 15 2月, 2020 4 次提交
  8. 04 2月, 2020 1 次提交
  9. 01 2月, 2020 1 次提交
  10. 10 1月, 2020 1 次提交
  11. 07 1月, 2020 1 次提交
    • H
      block: Allow t10-pi to be modular · a754bd5f
      Herbert Xu 提交于
      Currently t10-pi can only be built into the block layer which via
      crc-t10dif pulls in a whole chunk of the Crypto API.  In fact all
      users of t10-pi work as modules and there is no reason for it to
      always be built-in.
      
      This patch adds a new hidden option for t10-pi that is selected
      automatically based on BLK_DEV_INTEGRITY and whether the users
      of t10-pi are built-in or not.
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a754bd5f
  12. 07 12月, 2019 3 次提交
  13. 03 12月, 2019 2 次提交
  14. 27 11月, 2019 7 次提交
    • J
      Revert "nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T" · 655e7aee
      Jian-Hong Pan 提交于
      Since e045fa29 ("PCI/MSI: Fix incorrect MSI-X masking on resume") is
      merged, we can revert the previous quirk now.
      
      This reverts commit 19ea025e.
      
      Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=204887
      Fixes: 19ea025e ("nvme: Add quirk for Kingston NVME SSD running FW E8FK11.T")
      Link: https://lore.kernel.org/r/20191031093408.9322-1-jian-hong@endlessm.comSigned-off-by: NJian-Hong Pan <jian-hong@endlessm.com>
      Signed-off-by: NBjorn Helgaas <bhelgaas@google.com>
      Acked-by: NChristoph Hellwig <hch@lst.de>
      Cc: stable@vger.kernel.org
      655e7aee
    • J
      nvme-fc: fix double-free scenarios on hw queues · c869e494
      James Smart 提交于
      If an error occurs on one of the ios used for creating an
      association, the creating routine has error paths that are
      invoked by the command failure and the error paths will free
      up the controller resources created to that point.
      
      But... the io was ultimately determined by an asynchronous
      completion routine that detected the error and which
      unconditionally invokes the error_recovery path which calls
      delete_association. Delete association deletes all outstanding
      io then tears down the controller resources. So the
      create_association thread can be running in parallel with
      the error_recovery thread. What was seen was the LLDD received
      a call to delete a queue, causing the LLDD to do a free of a
      resource, then the transport called the delete queue again
      causing the driver to repeat the free call. The second free
      routine corrupted the allocator. The transport shouldn't be
      making the duplicate call, and the delete queue is just one
      of the resources being freed.
      
      To fix, it is realized that the create_association path is
      completely serialized with one command at a time. So the
      failed io completion will always be seen by the create_association
      path and as of the failure, there are no ios to terminate and there
      is no reason to be manipulating queue freeze states, etc.
      The serialized condition stays true until the controller is
      transitioned to the LIVE state. Thus the fix is to change the
      error recovery path to check the controller state and only
      invoke the teardown path if not already in the CONNECTING state.
      Reviewed-by: NHimanshu Madhani <hmadhani@marvell.com>
      Reviewed-by: NEwan D. Milne <emilne@redhat.com>
      Signed-off-by: NJames Smart <jsmart2021@gmail.com>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      c869e494
    • E
      nvme: else following return is not needed · c80b36cd
      Edmund Nadolski 提交于
      Remove unnecessary keyword in nvme_create_queue().
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NEdmund Nadolski <edmund.nadolski@intel.com>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      c80b36cd
    • J
      nvme: add error message on mismatching controller ids · a8157ff3
      James Smart 提交于
      We've seen a few devices that return different controller id's to
      the Fabric Connect command vs the Identify(controller) command. It's
      currently hard to identify this failure by existing error messages. It
      comes across as a (re)connect attempt in the transport that fails with
      a -22 (-EINVAL) status. The issue is compounded by older kernels not
      having the controller id check or had the identify command overwrite the
      fabrics controller id value before it checked. Both resulted in cases
      where the devices appeared fine until more recent kernels.
      
      Clarify the reject by adding an error message on controller id mismatches.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NHannes Reinecke <hare@suse.de>
      Reviewed-by: NEwan D. Milne <emilne@redhat.com>
      Signed-off-by: NJames Smart <jsmart2021@gmail.com>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      a8157ff3
    • J
      nvme_fc: add module to ops template to allow module references · 863fbae9
      James Smart 提交于
      In nvme-fc: it's possible to have connected active controllers
      and as no references are taken on the LLDD, the LLDD can be
      unloaded.  The controller would enter a reconnect state and as
      long as the LLDD resumed within the reconnect timeout, the
      controller would resume.  But if a namespace on the controller
      is the root device, allowing the driver to unload can be problematic.
      To reload the driver, it may require new io to the boot device,
      and as it's no longer connected we get into a catch-22 that
      eventually fails, and the system locks up.
      
      Fix this issue by taking a module reference for every connected
      controller (which is what the core layer did to the transport
      module). Reference is cleared when the controller is removed.
      Acked-by: NHimanshu Madhani <hmadhani@marvell.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJames Smart <jsmart2021@gmail.com>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      863fbae9
    • I
      nvme-fc: Avoid preallocating big SGL for data · b1ae1a23
      Israel Rukshin 提交于
      nvme_fc_create_io_queues() preallocates a big buffer for the IO SGL based
      on SG_CHUNK_SIZE.
      
      Modern DMA engines are often capable of dealing with very big segments so
      the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB
      SGL allocation per command.
      
      If a controller has lots of deep queues, preallocation for the sg list can
      consume substantial amounts of memory. For nvme-fc, nr_hw_queues can be
      128 and each queue's depth 128. This means the resulting preallocation
      for the data SGL is 128*128*4K = 64MB per controller.
      
      Switch to runtime allocation for SGL for lists longer than 2 entries. This
      is the approach used by NVMe PCI so it should be reasonable for NVMeOF as
      well. Runtime SGL allocation has always been the case for the legacy I/O
      path so this is nothing new.
      Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      b1ae1a23
    • I
      nvme-rdma: Avoid preallocating big SGL for data · 38e18002
      Israel Rukshin 提交于
      nvme_rdma_alloc_tagset() preallocates a big buffer for the IO SGL based
      on SG_CHUNK_SIZE.
      
      Modern DMA engines are often capable of dealing with very big segments so
      the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB
      SGL allocation per command.
      
      If a controller has lots of deep queues, preallocation for the sg list can
      consume substantial amounts of memory. For nvme-rdma, nr_hw_queues can be
      128 and each queue's depth 128. This means the resulting preallocation
      for the data SGL is 128*128*4K = 64MB per controller.
      
      Switch to runtime allocation for SGL for lists longer than 2 entries. This
      is the approach used by NVMe PCI so it should be reasonable for NVMeOF as
      well. Runtime SGL allocation has always been the case for the legacy I/O
      path so this is nothing new.
      
      The preallocated small SGL depends on SG_CHAIN so if the ARCH doesn't
      support SG_CHAIN, use only runtime allocation for the SGL.
      
      We didn't notice of a performance degradation, since for small IOs we'll
      use the inline SG and for the bigger IOs the allocation of a bigger SGL
      from slab is fast enough.
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      38e18002
  15. 22 11月, 2019 2 次提交
    • A
      nvme: hwmon: add quirk to avoid changing temperature threshold · 6c6aa2f2
      Akinobu Mita 提交于
      This adds a new quirk NVME_QUIRK_NO_TEMP_THRESH_CHANGE to avoid changing
      the value of the temperature threshold feature for specific devices that
      show undesirable behavior.
      
      Guenter reported:
      
      "On my Intel NVME drive (SSDPEKKW512G7), writing any minimum limit on the
      Composite temperature sensor results in a temperature warning, and that
      warning is sticky until I reset the controller.
      
      It doesn't seem to matter which temperature I write; writing -273000 has
      the same result."
      
      The Intel NVMe has the latest firmware version installed, so this isn't
      a problem that was ever fixed.
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Cc: Keith Busch <kbusch@kernel.org>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Jean Delvare <jdelvare@suse.com>
      Reviewed-by: NGuenter Roeck <linux@roeck-us.net>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      6c6aa2f2
    • A
      nvme: hwmon: provide temperature min and max values for each sensor · 52deba0f
      Akinobu Mita 提交于
      According to the NVMe specification, the over temperature threshold and
      under temperature threshold features shall be implemented for Composite
      Temperature if a non-zero WCTEMP field value is reported in the Identify
      Controller data structure.  The features are also implemented for all
      implemented temperature sensors (i.e., all Temperature Sensor fields that
      report a non-zero value).
      
      This provides the over temperature threshold and under temperature
      threshold for each sensor as temperature min and max values of hwmon
      sysfs attributes.
      
      The WCTEMP is already provided as a temperature max value for Composite
      Temperature, but this change isn't incompatible.  Because the default
      value of the over temperature threshold for Composite Temperature is
      the WCTEMP.
      
      Now the alarm attribute for Composite Temperature indicates one of the
      temperature is outside of a temperature threshold.  Because there is only
      a single bit in Critical Warning field that indicates a temperature is
      outside of a threshold.
      
      Example output from the "sensors" command:
      
      nvme-pci-0100
      Adapter: PCI adapter
      Composite:    +33.9°C  (low  = -273.1°C, high = +69.8°C)
                             (crit = +79.8°C)
      Sensor 1:     +34.9°C  (low  = -273.1°C, high = +65261.8°C)
      Sensor 2:     +31.9°C  (low  = -273.1°C, high = +65261.8°C)
      Sensor 5:     +47.9°C  (low  = -273.1°C, high = +65261.8°C)
      
      This also adds helper macros for kelvin from/to milli Celsius conversion,
      and replaces the repeated code in hwmon.c.
      
      Cc: Keith Busch <kbusch@kernel.org>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Jean Delvare <jdelvare@suse.com>
      Reviewed-by: NGuenter Roeck <linux@roeck-us.net>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      52deba0f
  16. 13 11月, 2019 1 次提交
  17. 12 11月, 2019 1 次提交
    • G
      nvme: Add hardware monitoring support · 400b6a7b
      Guenter Roeck 提交于
      nvme devices report temperature information in the controller information
      (for limits) and in the smart log. Currently, the only means to retrieve
      this information is the nvme command line interface, which requires
      super-user privileges.
      
      At the same time, it would be desirable to be able to use NVMe temperature
      information for thermal control.
      
      This patch adds support to read NVMe temperatures from the kernel using the
      hwmon API and adds temperature zones for NVMe drives. The thermal subsystem
      can use this information to set thermal policies, and userspace can access
      it using libsensors and/or the "sensors" command.
      
      Example output from the "sensors" command:
      
      nvme0-pci-0100
      Adapter: PCI adapter
      Composite:    +39.0°C  (high = +85.0°C, crit = +85.0°C)
      Sensor 1:     +39.0°C
      Sensor 2:     +41.0°C
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      400b6a7b
  18. 05 11月, 2019 1 次提交
    • A
      nvme-multipath: fix crash in nvme_mpath_clear_ctrl_paths · 763303a8
      Anton Eidelman 提交于
      nvme_mpath_clear_ctrl_paths() iterates through
      the ctrl->namespaces list while holding ctrl->scan_lock.
      This does not seem to be the correct way of protecting
      from concurrent list modification.
      
      Specifically, nvme_scan_work() sorts ctrl->namespaces
      AFTER unlocking scan_lock.
      
      This may result in the following (rare) crash in ctrl disconnect
      during scan_work:
      
          BUG: kernel NULL pointer dereference, address: 0000000000000050
          Oops: 0000 [#1] SMP PTI
          CPU: 0 PID: 3995 Comm: nvme 5.3.5-050305-generic
          RIP: 0010:nvme_mpath_clear_current_path+0xe/0x90 [nvme_core]
          ...
          Call Trace:
           nvme_mpath_clear_ctrl_paths+0x3c/0x70 [nvme_core]
           nvme_remove_namespaces+0x35/0xe0 [nvme_core]
           nvme_do_delete_ctrl+0x47/0x90 [nvme_core]
           nvme_sysfs_delete+0x49/0x60 [nvme_core]
           dev_attr_store+0x17/0x30
           sysfs_kf_write+0x3e/0x50
           kernfs_fop_write+0x11e/0x1a0
           __vfs_write+0x1b/0x40
           vfs_write+0xb9/0x1a0
           ksys_write+0x67/0xe0
           __x64_sys_write+0x1a/0x20
           do_syscall_64+0x5a/0x130
           entry_SYSCALL_64_after_hwframe+0x44/0xa9
          RIP: 0033:0x7f8d02bfb154
      
      Fix:
      After taking scan_lock in nvme_mpath_clear_ctrl_paths()
      down_read(&ctrl->namespaces_rwsem) as well to make list traversal safe.
      This will not cause deadlocks because taking scan_lock never happens
      while holding the namespaces_rwsem.
      Moreover, scan work downs namespaces_rwsem in the same order.
      
      Alternative: sort ctrl->namespaces in nvme_scan_work()
      while still holding the scan_lock.
      This would leave nvme_mpath_clear_ctrl_paths() without correct protection
      against ctrl->namespaces modification by anyone other than scan_work.
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      763303a8