1. 27 3月, 2019 1 次提交
    • P
      srcu: Remove cleanup_srcu_struct_quiesced() · f5ad3991
      Paul E. McKenney 提交于
      The cleanup_srcu_struct_quiesced() function was added because NVME
      used WQ_MEM_RECLAIM workqueues and SRCU did not, which meant that
      NVME workqueues waiting on SRCU workqueues could result in deadlocks
      during low-memory conditions.  However, SRCU now also has WQ_MEM_RECLAIM
      workqueues, so there is no longer a potential for deadlock.  Furthermore,
      it turns out to be extremely hard to use cleanup_srcu_struct_quiesced()
      correctly due to the fact that SRCU callback invocation accesses the
      srcu_struct structure's per-CPU data area just after callbacks are
      invoked.  Therefore, the usual practice of using srcu_barrier() to wait
      for callbacks to be invoked before invoking cleanup_srcu_struct_quiesced()
      fails because SRCU's callback-invocation workqueue handler might be
      delayed, which can result in cleanup_srcu_struct_quiesced() being invoked
      (and thus freeing the per-CPU data) before the SRCU's callback-invocation
      workqueue handler is finished using that per-CPU data.  Nor is this a
      theoretical problem: KASAN emitted use-after-free warnings because of
      this problem on actual runs.
      
      In short, NVME can now safely invoke cleanup_srcu_struct(), which
      avoids the use-after-free scenario.  And cleanup_srcu_struct_quiesced()
      is quite difficult to use safely.  This commit therefore removes
      cleanup_srcu_struct_quiesced(), switching its sole user back to
      cleanup_srcu_struct().  This effectively reverts the following pair
      of commits:
      
      f7194ac3 ("srcu: Add cleanup_srcu_struct_quiesced()")
      4317228a ("nvme: Avoid flush dependency in delete controller flow")
      Reported-by: NBart Van Assche <bvanassche@acm.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.ibm.com>
      Reviewed-by: NBart Van Assche <bvanassche@acm.org>
      Tested-by: NBart Van Assche <bvanassche@acm.org>
      f5ad3991
  2. 14 3月, 2019 13 次提交
  3. 21 2月, 2019 1 次提交
  4. 20 2月, 2019 13 次提交
  5. 18 2月, 2019 2 次提交
    • M
      nvme-pci: Simplify interrupt allocation · 612b7286
      Ming Lei 提交于
      The NVME PCI driver contains a tedious mechanism for interrupt
      allocation, which is necessary to adjust the number and size of interrupt
      sets to the maximum available number of interrupts which depends on the
      underlying PCI capabilities and the available CPU resources.
      
      It works around the former short comings of the PCI and core interrupt
      allocation mechanims in combination with interrupt sets.
      
      The PCI interrupt allocation function allows to provide a maximum and a
      minimum number of interrupts to be allocated and tries to allocate as
      many as possible. This worked without driver interaction as long as there
      was only a single set of interrupts to handle.
      
      With the addition of support for multiple interrupt sets in the generic
      affinity spreading logic, which is invoked from the PCI interrupt
      allocation, the adaptive loop in the PCI interrupt allocation did not
      work for multiple interrupt sets. The reason is that depending on the
      total number of interrupts which the PCI allocation adaptive loop tries
      to allocate in each step, the number and the size of the interrupt sets
      need to be adapted as well. Due to the way the interrupt sets support was
      implemented there was no way for the PCI interrupt allocation code or the
      core affinity spreading mechanism to invoke a driver specific function
      for adapting the interrupt sets configuration.
      
      As a consequence the driver had to implement another adaptive loop around
      the PCI interrupt allocation function and calling that with maximum and
      minimum interrupts set to the same value. This ensured that the
      allocation either succeeded or immediately failed without any attempt to
      adjust the number of interrupts in the PCI code.
      
      The core code now allows drivers to provide a callback to recalculate the
      number and the size of interrupt sets during PCI interrupt allocation,
      which in turn allows the PCI interrupt allocation function to be called
      in the same way as with a single set of interrupts. The PCI code handles
      the adaptive loop and the interrupt affinity spreading mechanism invokes
      the driver callback to adapt the interrupt set configuration to the
      current loop value. This replaces the adaptive loop in the driver
      completely.
      
      Implement the NVME specific callback which adjusts the interrupt sets
      configuration and remove the adaptive allocation loop.
      
      [ tglx: Simplify the callback further and restore the dropped adjustment of
        	number of sets ]
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bjorn Helgaas <helgaas@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-block@vger.kernel.org
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: linux-nvme@lists.infradead.org
      Cc: linux-pci@vger.kernel.org
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Sumit Saxena <sumit.saxena@broadcom.com>
      Cc: Kashyap Desai <kashyap.desai@broadcom.com>
      Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com>
      Link: https://lkml.kernel.org/r/20190216172228.602546658@linutronix.de
      
      612b7286
    • M
      genirq/affinity: Store interrupt sets size in struct irq_affinity · 9cfef55b
      Ming Lei 提交于
      The interrupt affinity spreading mechanism supports to spread out
      affinities for one or more interrupt sets. A interrupt set contains one
      or more interrupts. Each set is mapped to a specific functionality of a
      device, e.g. general I/O queues and read I/O queus of multiqueue block
      devices.
      
      The number of interrupts per set is defined by the driver. It depends on
      the total number of available interrupts for the device, which is
      determined by the PCI capabilites and the availability of underlying CPU
      resources, and the number of queues which the device provides and the
      driver wants to instantiate.
      
      The driver passes initial configuration for the interrupt allocation via
      a pointer to struct irq_affinity.
      
      Right now the allocation mechanism is complex as it requires to have a
      loop in the driver to determine the maximum number of interrupts which
      are provided by the PCI capabilities and the underlying CPU resources.
      This loop would have to be replicated in every driver which wants to
      utilize this mechanism. That's unwanted code duplication and error
      prone.
      
      In order to move this into generic facilities it is required to have a
      mechanism, which allows the recalculation of the interrupt sets and
      their size, in the core code. As the core code does not have any
      knowledge about the underlying device, a driver specific callback will
      be added to struct affinity_desc, which will be invoked by the core
      code. The callback will get the number of available interupts as an
      argument, so the driver can calculate the corresponding number and size
      of interrupt sets.
      
      To support this, two modifications for the handling of struct irq_affinity
      are required:
      
      1) The (optional) interrupt sets size information is contained in a
         separate array of integers and struct irq_affinity contains a
         pointer to it.
      
         This is cumbersome and as the maximum number of interrupt sets is small,
         there is no reason to have separate storage. Moving the size array into
         struct affinity_desc avoids indirections and makes the code simpler.
      
      2) At the moment the struct irq_affinity pointer which is handed in from
         the driver and passed through to several core functions is marked
         'const'.
      
         With the upcoming callback to recalculate the number and size of
         interrupt sets, it's necessary to remove the 'const'
         qualifier. Otherwise the callback would not be able to update the data.
      
      Implement #1 and store the interrupt sets size in 'struct irq_affinity'.
      
      No functional change.
      
      [ tglx: Fixed the memcpy() size so it won't copy beyond the size of the
        	source. Fixed the kernel doc comments for struct irq_affinity and
        	de-'This patch'-ed the changelog ]
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bjorn Helgaas <helgaas@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-block@vger.kernel.org
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: linux-nvme@lists.infradead.org
      Cc: linux-pci@vger.kernel.org
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Sumit Saxena <sumit.saxena@broadcom.com>
      Cc: Kashyap Desai <kashyap.desai@broadcom.com>
      Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com>
      Link: https://lkml.kernel.org/r/20190216172228.423723127@linutronix.de
      
      9cfef55b
  6. 12 2月, 2019 1 次提交
  7. 06 2月, 2019 2 次提交
  8. 04 2月, 2019 2 次提交
  9. 24 1月, 2019 4 次提交
    • H
      nvme-multipath: drop optimization for static ANA group IDs · 78a61cd4
      Hannes Reinecke 提交于
      Bit 6 in the ANACAP field is used to indicate that the ANA group ID
      doesn't change while the namespace is attached to the controller.
      There is an optimisation in the code to only allocate space
      for the ANA group header, as the namespace list won't change and
      hence would not need to be refreshed.
      However, this optimisation was never carried over to the actual
      workflow, which always assumes that the buffer is large enough
      to hold the ANA header _and_ the namespace list.
      So drop this optimisation and always allocate enough space.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      78a61cd4
    • S
      nvme-rdma: rework queue maps handling · b1064d3e
      Sagi Grimberg 提交于
      If the device supports less queues than provided (if the device has less
      completion vectors), we might hit a bug due to the fact that we ignore
      that in nvme_rdma_map_queues (we override the maps nr_queues with user
      opts).
      
      Instead, keep track of how many default/read/poll queues we actually
      allocated (rather than asked by the user) and use that to assign our
      queue mappings.
      
      Fixes: b65bb777 (" nvme-rdma: support separate queue maps for read and write")
      Reported-by: NSaleem, Shiraz <shiraz.saleem@intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b1064d3e
    • S
      nvme-tcp: fix timeout handler · 39d57757
      Sagi Grimberg 提交于
      Currently, we have several problems with the timeout
      handler:
      1. If we timeout on the controller establishment flow, we will hang
      because we don't execute the error recovery (and we shouldn't because
      the create_ctrl flow needs to fail and cleanup on its own)
      2. We might also hang if we get a disconnet on a queue while the
      controller is already deleting. This racy flow can cause the controller
      disable/shutdown admin command to hang.
      
      We cannot complete a timed out request from the timeout handler without
      mutual exclusion from the teardown flow (e.g. nvme_rdma_error_recovery_work).
      So we serialize it in the timeout handler and teardown io and admin
      queues to guarantee that no one races with us from completing the
      request.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      39d57757
    • S
      nvme-rdma: fix timeout handler · 4c174e63
      Sagi Grimberg 提交于
      Currently, we have several problems with the timeout
      handler:
      1. If we timeout on the controller establishment flow, we will hang
      because we don't execute the error recovery (and we shouldn't because
      the create_ctrl flow needs to fail and cleanup on its own)
      2. We might also hang if we get a disconnet on a queue while the
      controller is already deleting. This racy flow can cause the controller
      disable/shutdown admin command to hang.
      
      We cannot complete a timed out request from the timeout handler without
      mutual exclusion from the teardown flow (e.g. nvme_rdma_error_recovery_work).
      So we serialize it in the timeout handler and teardown io and admin
      queues to guarantee that no one races with us from completing the
      request.
      Reported-by: NJaesoo Lee <jalee@purestorage.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4c174e63
  10. 17 1月, 2019 1 次提交
    • M
      nvme-pci: fix nvme_setup_irqs() · c45b1fa2
      Ming Lei 提交于
      When -ENOSPC is returned from pci_alloc_irq_vectors_affinity(),
      we still try to allocate multiple irq vectors again, so irq queues
      covers the admin queue actually. But we don't consider that, then
      number of the allocated irq vector may be same with sum of
      io_queues[HCTX_TYPE_DEFAULT] and io_queues[HCTX_TYPE_READ], this way
      is obviously wrong, and finally breaks nvme_pci_map_queues(), and
      warning from pci_irq_get_affinity() is triggered.
      
      IRQ queues should cover admin queues, this patch makes this
      point explicitely in nvme_calc_io_queues().
      
      We got severl boot failure internal report on aarch64, so please
      consider to fix it in v4.20.
      
      Fixes: 6451fe73 ("nvme: fix irq vs io_queue calculations")
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Tested-by: Nfin4478 <fin4478@hotmail.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c45b1fa2