1. 29 3月, 2019 2 次提交
  2. 14 3月, 2019 13 次提交
  3. 21 2月, 2019 1 次提交
  4. 20 2月, 2019 13 次提交
  5. 18 2月, 2019 2 次提交
    • M
      nvme-pci: Simplify interrupt allocation · 612b7286
      Ming Lei 提交于
      The NVME PCI driver contains a tedious mechanism for interrupt
      allocation, which is necessary to adjust the number and size of interrupt
      sets to the maximum available number of interrupts which depends on the
      underlying PCI capabilities and the available CPU resources.
      
      It works around the former short comings of the PCI and core interrupt
      allocation mechanims in combination with interrupt sets.
      
      The PCI interrupt allocation function allows to provide a maximum and a
      minimum number of interrupts to be allocated and tries to allocate as
      many as possible. This worked without driver interaction as long as there
      was only a single set of interrupts to handle.
      
      With the addition of support for multiple interrupt sets in the generic
      affinity spreading logic, which is invoked from the PCI interrupt
      allocation, the adaptive loop in the PCI interrupt allocation did not
      work for multiple interrupt sets. The reason is that depending on the
      total number of interrupts which the PCI allocation adaptive loop tries
      to allocate in each step, the number and the size of the interrupt sets
      need to be adapted as well. Due to the way the interrupt sets support was
      implemented there was no way for the PCI interrupt allocation code or the
      core affinity spreading mechanism to invoke a driver specific function
      for adapting the interrupt sets configuration.
      
      As a consequence the driver had to implement another adaptive loop around
      the PCI interrupt allocation function and calling that with maximum and
      minimum interrupts set to the same value. This ensured that the
      allocation either succeeded or immediately failed without any attempt to
      adjust the number of interrupts in the PCI code.
      
      The core code now allows drivers to provide a callback to recalculate the
      number and the size of interrupt sets during PCI interrupt allocation,
      which in turn allows the PCI interrupt allocation function to be called
      in the same way as with a single set of interrupts. The PCI code handles
      the adaptive loop and the interrupt affinity spreading mechanism invokes
      the driver callback to adapt the interrupt set configuration to the
      current loop value. This replaces the adaptive loop in the driver
      completely.
      
      Implement the NVME specific callback which adjusts the interrupt sets
      configuration and remove the adaptive allocation loop.
      
      [ tglx: Simplify the callback further and restore the dropped adjustment of
        	number of sets ]
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bjorn Helgaas <helgaas@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-block@vger.kernel.org
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: linux-nvme@lists.infradead.org
      Cc: linux-pci@vger.kernel.org
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Sumit Saxena <sumit.saxena@broadcom.com>
      Cc: Kashyap Desai <kashyap.desai@broadcom.com>
      Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com>
      Link: https://lkml.kernel.org/r/20190216172228.602546658@linutronix.de
      
      612b7286
    • M
      genirq/affinity: Store interrupt sets size in struct irq_affinity · 9cfef55b
      Ming Lei 提交于
      The interrupt affinity spreading mechanism supports to spread out
      affinities for one or more interrupt sets. A interrupt set contains one
      or more interrupts. Each set is mapped to a specific functionality of a
      device, e.g. general I/O queues and read I/O queus of multiqueue block
      devices.
      
      The number of interrupts per set is defined by the driver. It depends on
      the total number of available interrupts for the device, which is
      determined by the PCI capabilites and the availability of underlying CPU
      resources, and the number of queues which the device provides and the
      driver wants to instantiate.
      
      The driver passes initial configuration for the interrupt allocation via
      a pointer to struct irq_affinity.
      
      Right now the allocation mechanism is complex as it requires to have a
      loop in the driver to determine the maximum number of interrupts which
      are provided by the PCI capabilities and the underlying CPU resources.
      This loop would have to be replicated in every driver which wants to
      utilize this mechanism. That's unwanted code duplication and error
      prone.
      
      In order to move this into generic facilities it is required to have a
      mechanism, which allows the recalculation of the interrupt sets and
      their size, in the core code. As the core code does not have any
      knowledge about the underlying device, a driver specific callback will
      be added to struct affinity_desc, which will be invoked by the core
      code. The callback will get the number of available interupts as an
      argument, so the driver can calculate the corresponding number and size
      of interrupt sets.
      
      To support this, two modifications for the handling of struct irq_affinity
      are required:
      
      1) The (optional) interrupt sets size information is contained in a
         separate array of integers and struct irq_affinity contains a
         pointer to it.
      
         This is cumbersome and as the maximum number of interrupt sets is small,
         there is no reason to have separate storage. Moving the size array into
         struct affinity_desc avoids indirections and makes the code simpler.
      
      2) At the moment the struct irq_affinity pointer which is handed in from
         the driver and passed through to several core functions is marked
         'const'.
      
         With the upcoming callback to recalculate the number and size of
         interrupt sets, it's necessary to remove the 'const'
         qualifier. Otherwise the callback would not be able to update the data.
      
      Implement #1 and store the interrupt sets size in 'struct irq_affinity'.
      
      No functional change.
      
      [ tglx: Fixed the memcpy() size so it won't copy beyond the size of the
        	source. Fixed the kernel doc comments for struct irq_affinity and
        	de-'This patch'-ed the changelog ]
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bjorn Helgaas <helgaas@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-block@vger.kernel.org
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: linux-nvme@lists.infradead.org
      Cc: linux-pci@vger.kernel.org
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Sumit Saxena <sumit.saxena@broadcom.com>
      Cc: Kashyap Desai <kashyap.desai@broadcom.com>
      Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com>
      Link: https://lkml.kernel.org/r/20190216172228.423723127@linutronix.de
      
      9cfef55b
  6. 12 2月, 2019 1 次提交
  7. 06 2月, 2019 2 次提交
  8. 04 2月, 2019 2 次提交
  9. 24 1月, 2019 4 次提交
    • H
      nvme-multipath: drop optimization for static ANA group IDs · 78a61cd4
      Hannes Reinecke 提交于
      Bit 6 in the ANACAP field is used to indicate that the ANA group ID
      doesn't change while the namespace is attached to the controller.
      There is an optimisation in the code to only allocate space
      for the ANA group header, as the namespace list won't change and
      hence would not need to be refreshed.
      However, this optimisation was never carried over to the actual
      workflow, which always assumes that the buffer is large enough
      to hold the ANA header _and_ the namespace list.
      So drop this optimisation and always allocate enough space.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      78a61cd4
    • S
      nvme-rdma: rework queue maps handling · b1064d3e
      Sagi Grimberg 提交于
      If the device supports less queues than provided (if the device has less
      completion vectors), we might hit a bug due to the fact that we ignore
      that in nvme_rdma_map_queues (we override the maps nr_queues with user
      opts).
      
      Instead, keep track of how many default/read/poll queues we actually
      allocated (rather than asked by the user) and use that to assign our
      queue mappings.
      
      Fixes: b65bb777 (" nvme-rdma: support separate queue maps for read and write")
      Reported-by: NSaleem, Shiraz <shiraz.saleem@intel.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      b1064d3e
    • S
      nvme-tcp: fix timeout handler · 39d57757
      Sagi Grimberg 提交于
      Currently, we have several problems with the timeout
      handler:
      1. If we timeout on the controller establishment flow, we will hang
      because we don't execute the error recovery (and we shouldn't because
      the create_ctrl flow needs to fail and cleanup on its own)
      2. We might also hang if we get a disconnet on a queue while the
      controller is already deleting. This racy flow can cause the controller
      disable/shutdown admin command to hang.
      
      We cannot complete a timed out request from the timeout handler without
      mutual exclusion from the teardown flow (e.g. nvme_rdma_error_recovery_work).
      So we serialize it in the timeout handler and teardown io and admin
      queues to guarantee that no one races with us from completing the
      request.
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      39d57757
    • S
      nvme-rdma: fix timeout handler · 4c174e63
      Sagi Grimberg 提交于
      Currently, we have several problems with the timeout
      handler:
      1. If we timeout on the controller establishment flow, we will hang
      because we don't execute the error recovery (and we shouldn't because
      the create_ctrl flow needs to fail and cleanup on its own)
      2. We might also hang if we get a disconnet on a queue while the
      controller is already deleting. This racy flow can cause the controller
      disable/shutdown admin command to hang.
      
      We cannot complete a timed out request from the timeout handler without
      mutual exclusion from the teardown flow (e.g. nvme_rdma_error_recovery_work).
      So we serialize it in the timeout handler and teardown io and admin
      queues to guarantee that no one races with us from completing the
      request.
      Reported-by: NJaesoo Lee <jalee@purestorage.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      4c174e63