1. 27 11月, 2019 2 次提交
    • I
      nvme-fc: Avoid preallocating big SGL for data · b1ae1a23
      Israel Rukshin 提交于
      nvme_fc_create_io_queues() preallocates a big buffer for the IO SGL based
      on SG_CHUNK_SIZE.
      
      Modern DMA engines are often capable of dealing with very big segments so
      the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB
      SGL allocation per command.
      
      If a controller has lots of deep queues, preallocation for the sg list can
      consume substantial amounts of memory. For nvme-fc, nr_hw_queues can be
      128 and each queue's depth 128. This means the resulting preallocation
      for the data SGL is 128*128*4K = 64MB per controller.
      
      Switch to runtime allocation for SGL for lists longer than 2 entries. This
      is the approach used by NVMe PCI so it should be reasonable for NVMeOF as
      well. Runtime SGL allocation has always been the case for the legacy I/O
      path so this is nothing new.
      Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by: NJames Smart <james.smart@broadcom.com>
      Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      b1ae1a23
    • I
      nvme-rdma: Avoid preallocating big SGL for data · 38e18002
      Israel Rukshin 提交于
      nvme_rdma_alloc_tagset() preallocates a big buffer for the IO SGL based
      on SG_CHUNK_SIZE.
      
      Modern DMA engines are often capable of dealing with very big segments so
      the SG_CHUNK_SIZE is often too big. SG_CHUNK_SIZE results in a static 4KB
      SGL allocation per command.
      
      If a controller has lots of deep queues, preallocation for the sg list can
      consume substantial amounts of memory. For nvme-rdma, nr_hw_queues can be
      128 and each queue's depth 128. This means the resulting preallocation
      for the data SGL is 128*128*4K = 64MB per controller.
      
      Switch to runtime allocation for SGL for lists longer than 2 entries. This
      is the approach used by NVMe PCI so it should be reasonable for NVMeOF as
      well. Runtime SGL allocation has always been the case for the legacy I/O
      path so this is nothing new.
      
      The preallocated small SGL depends on SG_CHAIN so if the ARCH doesn't
      support SG_CHAIN, use only runtime allocation for the SGL.
      
      We didn't notice of a performance degradation, since for small IOs we'll
      use the inline SG and for the bigger IOs the allocation of a bigger SGL
      from slab is fast enough.
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NMax Gurtovoy <maxg@mellanox.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NIsrael Rukshin <israelr@mellanox.com>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      38e18002
  2. 22 11月, 2019 2 次提交
    • A
      nvme: hwmon: add quirk to avoid changing temperature threshold · 6c6aa2f2
      Akinobu Mita 提交于
      This adds a new quirk NVME_QUIRK_NO_TEMP_THRESH_CHANGE to avoid changing
      the value of the temperature threshold feature for specific devices that
      show undesirable behavior.
      
      Guenter reported:
      
      "On my Intel NVME drive (SSDPEKKW512G7), writing any minimum limit on the
      Composite temperature sensor results in a temperature warning, and that
      warning is sticky until I reset the controller.
      
      It doesn't seem to matter which temperature I write; writing -273000 has
      the same result."
      
      The Intel NVMe has the latest firmware version installed, so this isn't
      a problem that was ever fixed.
      Reported-by: NGuenter Roeck <linux@roeck-us.net>
      Cc: Keith Busch <kbusch@kernel.org>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Jean Delvare <jdelvare@suse.com>
      Reviewed-by: NGuenter Roeck <linux@roeck-us.net>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      6c6aa2f2
    • A
      nvme: hwmon: provide temperature min and max values for each sensor · 52deba0f
      Akinobu Mita 提交于
      According to the NVMe specification, the over temperature threshold and
      under temperature threshold features shall be implemented for Composite
      Temperature if a non-zero WCTEMP field value is reported in the Identify
      Controller data structure.  The features are also implemented for all
      implemented temperature sensors (i.e., all Temperature Sensor fields that
      report a non-zero value).
      
      This provides the over temperature threshold and under temperature
      threshold for each sensor as temperature min and max values of hwmon
      sysfs attributes.
      
      The WCTEMP is already provided as a temperature max value for Composite
      Temperature, but this change isn't incompatible.  Because the default
      value of the over temperature threshold for Composite Temperature is
      the WCTEMP.
      
      Now the alarm attribute for Composite Temperature indicates one of the
      temperature is outside of a temperature threshold.  Because there is only
      a single bit in Critical Warning field that indicates a temperature is
      outside of a threshold.
      
      Example output from the "sensors" command:
      
      nvme-pci-0100
      Adapter: PCI adapter
      Composite:    +33.9°C  (low  = -273.1°C, high = +69.8°C)
                             (crit = +79.8°C)
      Sensor 1:     +34.9°C  (low  = -273.1°C, high = +65261.8°C)
      Sensor 2:     +31.9°C  (low  = -273.1°C, high = +65261.8°C)
      Sensor 5:     +47.9°C  (low  = -273.1°C, high = +65261.8°C)
      
      This also adds helper macros for kelvin from/to milli Celsius conversion,
      and replaces the repeated code in hwmon.c.
      
      Cc: Keith Busch <kbusch@kernel.org>
      Cc: Jens Axboe <axboe@fb.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Jean Delvare <jdelvare@suse.com>
      Reviewed-by: NGuenter Roeck <linux@roeck-us.net>
      Tested-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NAkinobu Mita <akinobu.mita@gmail.com>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      52deba0f
  3. 13 11月, 2019 1 次提交
  4. 12 11月, 2019 1 次提交
    • G
      nvme: Add hardware monitoring support · 400b6a7b
      Guenter Roeck 提交于
      nvme devices report temperature information in the controller information
      (for limits) and in the smart log. Currently, the only means to retrieve
      this information is the nvme command line interface, which requires
      super-user privileges.
      
      At the same time, it would be desirable to be able to use NVMe temperature
      information for thermal control.
      
      This patch adds support to read NVMe temperatures from the kernel using the
      hwmon API and adds temperature zones for NVMe drives. The thermal subsystem
      can use this information to set thermal policies, and userspace can access
      it using libsensors and/or the "sensors" command.
      
      Example output from the "sensors" command:
      
      nvme0-pci-0100
      Adapter: PCI adapter
      Composite:    +39.0°C  (high = +85.0°C, crit = +85.0°C)
      Sensor 1:     +39.0°C
      Sensor 2:     +41.0°C
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NGuenter Roeck <linux@roeck-us.net>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      400b6a7b
  5. 05 11月, 2019 27 次提交
  6. 29 10月, 2019 3 次提交
    • A
      nvme-multipath: remove unused groups_only mode in ana log · 86cccfbf
      Anton Eidelman 提交于
      groups_only mode in nvme_read_ana_log() is no longer used: remove it.
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      86cccfbf
    • A
      nvme-multipath: fix possible io hang after ctrl reconnect · af8fd042
      Anton Eidelman 提交于
      The following scenario results in an IO hang:
      1) ctrl completes a request with NVME_SC_ANA_TRANSITION.
         NVME_NS_ANA_PENDING bit in ns->flags is set and ana_work is triggered.
      2) ana_work: nvme_read_ana_log() tries to get the ANA log page from the ctrl.
         This fails because ctrl disconnects.
         Therefore nvme_update_ns_ana_state() is not called
         and NVME_NS_ANA_PENDING bit in ns->flags is not cleared.
      3) ctrl reconnects: nvme_mpath_init(ctrl,...) calls
         nvme_read_ana_log(ctrl, groups_only=true).
         However, nvme_update_ana_state() does not update namespaces
         because nr_nsids = 0 (due to groups_only mode).
      4) scan_work calls nvme_validate_ns() finds the ns and re-validates OK.
      
      Result:
      The ctrl is now live but NVME_NS_ANA_PENDING bit in ns->flags is still set.
      Consequently ctrl will never be considered a viable path by __nvme_find_path().
      IO will hang if ctrl is the only or the last path to the namespace.
      
      More generally, while ctrl is reconnecting, its ANA state may change.
      And because nvme_mpath_init() requests ANA log in groups_only mode,
      these changes are not propagated to the existing ctrl namespaces.
      This may result in a mal-function or an IO hang.
      
      Solution:
      nvme_mpath_init() will nvme_read_ana_log() with groups_only set to false.
      This will not harm the new ctrl case (no namespaces present),
      and will make sure the ANA state of namespaces gets updated after reconnect.
      
      Note: Another option would be for nvme_mpath_init() to invoke
      nvme_parse_ana_log(..., nvme_set_ns_ana_state) for each existing namespace.
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NAnton Eidelman <anton@lightbitslabs.com>
      Signed-off-by: NKeith Busch <kbusch@kernel.org>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      af8fd042
    • E
      net: use skb_queue_empty_lockless() in busy poll contexts · 3f926af3
      Eric Dumazet 提交于
      Busy polling usually runs without locks.
      Let's use skb_queue_empty_lockless() instead of skb_queue_empty()
      
      Also uses READ_ONCE() in __skb_try_recv_datagram() to address
      a similar potential problem.
      Signed-off-by: NEric Dumazet <edumazet@google.com>
      Signed-off-by: NDavid S. Miller <davem@davemloft.net>
      3f926af3
  7. 18 10月, 2019 1 次提交
  8. 15 10月, 2019 2 次提交
  9. 14 10月, 2019 1 次提交