1. 23 5月, 2019 1 次提交
    • K
      nvme-pci: use blk-mq mapping for unmanaged irqs · cb9e0e50
      Keith Busch 提交于
      If a device is providing a single IRQ vector, the IO queue will share
      that vector with the admin queue. This is an unmanaged vector, so does
      not have a valid PCI IRQ affinity. Avoid trying to extract a managed
      affinity in this case and let blk-mq set up the cpu:queue mapping instead.
      Otherwise we'd hit the following warning when the device is using MSI:
      
       WARNING: CPU: 4 PID: 7 at drivers/pci/msi.c:1272 pci_irq_get_affinity+0x66/0x80
       Modules linked in: nvme nvme_core serio_raw
       CPU: 4 PID: 7 Comm: kworker/u16:0 Tainted: G        W         5.2.0-rc1+ #494
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
       Workqueue: nvme-reset-wq nvme_reset_work [nvme]
       RIP: 0010:pci_irq_get_affinity+0x66/0x80
       Code: 0b 31 c0 c3 83 e2 10 48 c7 c0 b0 83 35 91 74 2a 48 8b 87 d8 03 00 00 48 85 c0 74 0e 48 8b 50 30 48 85 d2 74 05 39 70 14 77 05 <0f> 0b 31 c0 c3 48 63 f6 48 8d 04 76 48 8d 04 c2 f3 c3 48 8b 40 30
       RSP: 0000:ffffb5abc01d3cc8 EFLAGS: 00010246
       RAX: ffff9536786a39c0 RBX: 0000000000000000 RCX: 0000000000000080
       RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9536781ed000
       RBP: ffff95367346a008 R08: ffff95367d43f080 R09: ffff953678c07800
       R10: ffff953678164800 R11: 0000000000000000 R12: 0000000000000000
       R13: ffff9536781ed000 R14: 00000000ffffffff R15: ffff95367346a008
       FS:  0000000000000000(0000) GS:ffff95367d400000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007fdf814a3ff0 CR3: 000000001a20f000 CR4: 00000000000006e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        blk_mq_pci_map_queues+0x37/0xd0
        nvme_pci_map_queues+0x80/0xb0 [nvme]
        blk_mq_alloc_tag_set+0x133/0x2f0
        nvme_reset_work+0x105d/0x1590 [nvme]
        process_one_work+0x291/0x530
        worker_thread+0x218/0x3d0
        ? process_one_work+0x530/0x530
        kthread+0x111/0x130
        ? kthread_park+0x90/0x90
        ret_from_fork+0x1f/0x30
       ---[ end trace 74587339d93c83c0 ]---
      
      Fixes: 22b55601 ("nvme-pci: Separate IO and admin queue IRQ vectors")
      Reported-by: NIván Chavero <ichavero@chavero.com.mx>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      cb9e0e50
  2. 18 5月, 2019 4 次提交
  3. 13 5月, 2019 2 次提交
  4. 01 5月, 2019 6 次提交
  5. 05 4月, 2019 13 次提交
  6. 14 3月, 2019 1 次提交
  7. 20 2月, 2019 2 次提交
  8. 18 2月, 2019 2 次提交
    • M
      nvme-pci: Simplify interrupt allocation · 612b7286
      Ming Lei 提交于
      The NVME PCI driver contains a tedious mechanism for interrupt
      allocation, which is necessary to adjust the number and size of interrupt
      sets to the maximum available number of interrupts which depends on the
      underlying PCI capabilities and the available CPU resources.
      
      It works around the former short comings of the PCI and core interrupt
      allocation mechanims in combination with interrupt sets.
      
      The PCI interrupt allocation function allows to provide a maximum and a
      minimum number of interrupts to be allocated and tries to allocate as
      many as possible. This worked without driver interaction as long as there
      was only a single set of interrupts to handle.
      
      With the addition of support for multiple interrupt sets in the generic
      affinity spreading logic, which is invoked from the PCI interrupt
      allocation, the adaptive loop in the PCI interrupt allocation did not
      work for multiple interrupt sets. The reason is that depending on the
      total number of interrupts which the PCI allocation adaptive loop tries
      to allocate in each step, the number and the size of the interrupt sets
      need to be adapted as well. Due to the way the interrupt sets support was
      implemented there was no way for the PCI interrupt allocation code or the
      core affinity spreading mechanism to invoke a driver specific function
      for adapting the interrupt sets configuration.
      
      As a consequence the driver had to implement another adaptive loop around
      the PCI interrupt allocation function and calling that with maximum and
      minimum interrupts set to the same value. This ensured that the
      allocation either succeeded or immediately failed without any attempt to
      adjust the number of interrupts in the PCI code.
      
      The core code now allows drivers to provide a callback to recalculate the
      number and the size of interrupt sets during PCI interrupt allocation,
      which in turn allows the PCI interrupt allocation function to be called
      in the same way as with a single set of interrupts. The PCI code handles
      the adaptive loop and the interrupt affinity spreading mechanism invokes
      the driver callback to adapt the interrupt set configuration to the
      current loop value. This replaces the adaptive loop in the driver
      completely.
      
      Implement the NVME specific callback which adjusts the interrupt sets
      configuration and remove the adaptive allocation loop.
      
      [ tglx: Simplify the callback further and restore the dropped adjustment of
        	number of sets ]
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bjorn Helgaas <helgaas@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-block@vger.kernel.org
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: linux-nvme@lists.infradead.org
      Cc: linux-pci@vger.kernel.org
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Sumit Saxena <sumit.saxena@broadcom.com>
      Cc: Kashyap Desai <kashyap.desai@broadcom.com>
      Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com>
      Link: https://lkml.kernel.org/r/20190216172228.602546658@linutronix.de
      
      612b7286
    • M
      genirq/affinity: Store interrupt sets size in struct irq_affinity · 9cfef55b
      Ming Lei 提交于
      The interrupt affinity spreading mechanism supports to spread out
      affinities for one or more interrupt sets. A interrupt set contains one
      or more interrupts. Each set is mapped to a specific functionality of a
      device, e.g. general I/O queues and read I/O queus of multiqueue block
      devices.
      
      The number of interrupts per set is defined by the driver. It depends on
      the total number of available interrupts for the device, which is
      determined by the PCI capabilites and the availability of underlying CPU
      resources, and the number of queues which the device provides and the
      driver wants to instantiate.
      
      The driver passes initial configuration for the interrupt allocation via
      a pointer to struct irq_affinity.
      
      Right now the allocation mechanism is complex as it requires to have a
      loop in the driver to determine the maximum number of interrupts which
      are provided by the PCI capabilities and the underlying CPU resources.
      This loop would have to be replicated in every driver which wants to
      utilize this mechanism. That's unwanted code duplication and error
      prone.
      
      In order to move this into generic facilities it is required to have a
      mechanism, which allows the recalculation of the interrupt sets and
      their size, in the core code. As the core code does not have any
      knowledge about the underlying device, a driver specific callback will
      be added to struct affinity_desc, which will be invoked by the core
      code. The callback will get the number of available interupts as an
      argument, so the driver can calculate the corresponding number and size
      of interrupt sets.
      
      To support this, two modifications for the handling of struct irq_affinity
      are required:
      
      1) The (optional) interrupt sets size information is contained in a
         separate array of integers and struct irq_affinity contains a
         pointer to it.
      
         This is cumbersome and as the maximum number of interrupt sets is small,
         there is no reason to have separate storage. Moving the size array into
         struct affinity_desc avoids indirections and makes the code simpler.
      
      2) At the moment the struct irq_affinity pointer which is handed in from
         the driver and passed through to several core functions is marked
         'const'.
      
         With the upcoming callback to recalculate the number and size of
         interrupt sets, it's necessary to remove the 'const'
         qualifier. Otherwise the callback would not be able to update the data.
      
      Implement #1 and store the interrupt sets size in 'struct irq_affinity'.
      
      No functional change.
      
      [ tglx: Fixed the memcpy() size so it won't copy beyond the size of the
        	source. Fixed the kernel doc comments for struct irq_affinity and
        	de-'This patch'-ed the changelog ]
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bjorn Helgaas <helgaas@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-block@vger.kernel.org
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: linux-nvme@lists.infradead.org
      Cc: linux-pci@vger.kernel.org
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Sumit Saxena <sumit.saxena@broadcom.com>
      Cc: Kashyap Desai <kashyap.desai@broadcom.com>
      Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com>
      Link: https://lkml.kernel.org/r/20190216172228.423723127@linutronix.de
      
      9cfef55b
  9. 12 2月, 2019 1 次提交
  10. 06 2月, 2019 1 次提交
  11. 17 1月, 2019 1 次提交
    • M
      nvme-pci: fix nvme_setup_irqs() · c45b1fa2
      Ming Lei 提交于
      When -ENOSPC is returned from pci_alloc_irq_vectors_affinity(),
      we still try to allocate multiple irq vectors again, so irq queues
      covers the admin queue actually. But we don't consider that, then
      number of the allocated irq vector may be same with sum of
      io_queues[HCTX_TYPE_DEFAULT] and io_queues[HCTX_TYPE_READ], this way
      is obviously wrong, and finally breaks nvme_pci_map_queues(), and
      warning from pci_irq_get_affinity() is triggered.
      
      IRQ queues should cover admin queues, this patch makes this
      point explicitely in nvme_calc_io_queues().
      
      We got severl boot failure internal report on aarch64, so please
      consider to fix it in v4.20.
      
      Fixes: 6451fe73 ("nvme: fix irq vs io_queue calculations")
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Tested-by: Nfin4478 <fin4478@hotmail.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      c45b1fa2
  12. 10 1月, 2019 5 次提交
    • J
      nvme: introduce NVME_QUIRK_IGNORE_DEV_SUBNQN · 6299358d
      James Dingwall 提交于
      If a device provides an NQN it is expected to be globally unique.
      Unfortunately some firmware revisions for Intel 760p/Pro 7600p devices did
      not satisfy this requirement.  In these circumstances if a system has >1
      affected device then only one device is enabled.  If this quirk is enabled
      then the device supplied subnqn is ignored and we fallback to generating
      one as if the field was empty.  In this case we also suppress the version
      check so we don't print a warning when the quirk is enabled.
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NJames Dingwall <james@dingwall.me.uk>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      6299358d
    • H
      nvme-pci: fix out of bounds access in nvme_cqe_pending · dcca1662
      Hongbo Yao 提交于
      There is an out of bounds array access in nvme_cqe_peding().
      
      When enable irq_thread for nvme interrupt, there is racing between the
      nvmeq->cq_head updating and reading.
      
      nvmeq->cq_head is updated in nvme_update_cq_head(), if nvmeq->cq_head
      equals nvmeq->q_depth and before its value set to zero, nvme_cqe_pending()
      uses its value as an array index, the index will be out of bounds.
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      [hch: slight coding style update]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      dcca1662
    • K
      nvme-pci: rerun irq setup on IO queue init errors · 8fae268b
      Keith Busch 提交于
      If the driver is unable to create a subset of IO queues for any reason,
      the read/write and polled queue sets will not match the actual allocated
      hardware contexts. This leaves gaps in the CPU affinity mappings and
      causes the following kernel panic after blk_mq_map_queue_type() returns
      a NULL hctx.
      
        BUG: unable to handle kernel NULL pointer dereference at 0000000000000198
        #PF error: [normal kernel read fault]
        PGD 0 P4D 0
        Oops: 0000 [#1] SMP
        CPU: 64 PID: 1171 Comm: kworker/u259:1 Not tainted 4.20.0+ #241
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-2.fc27 04/01/2014
        Workqueue: nvme-wq nvme_scan_work [nvme_core]
        RIP: 0010:blk_mq_init_allocated_queue+0x2d9/0x440
        RSP: 0018:ffffb1bf0abc3cd0 EFLAGS: 00010286
        RAX: 000000000000001f RBX: ffff8ea744cf0718 RCX: 0000000000000000
        RDX: 0000000000000002 RSI: 000000000000007c RDI: ffffffff9109a820
        RBP: ffff8ea7565f7008 R08: 000000000000001f R09: 000000000000003f
        R10: ffffb1bf0abc3c00 R11: 0000000000000000 R12: 000000000001d008
        R13: ffff8ea7565f7008 R14: 000000000000003f R15: 0000000000000001
        FS:  0000000000000000(0000) GS:ffff8ea757200000(0000) knlGS:0000000000000000
        CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
        CR2: 0000000000000198 CR3: 0000000013058000 CR4: 00000000000006e0
        DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
        DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
        Call Trace:
         blk_mq_init_queue+0x35/0x60
         nvme_validate_ns+0xc6/0x7c0 [nvme_core]
         ? nvme_identify_ctrl.isra.56+0x7e/0xc0 [nvme_core]
         nvme_scan_work+0xc8/0x340 [nvme_core]
         ? __wake_up_common+0x6d/0x120
         ? try_to_wake_up+0x55/0x410
         process_one_work+0x1e9/0x3d0
         worker_thread+0x2d/0x3d0
         ? process_one_work+0x3d0/0x3d0
         kthread+0x111/0x130
         ? kthread_park+0x90/0x90
         ret_from_fork+0x1f/0x30
        Modules linked in: nvme nvme_core serio_raw
        CR2: 0000000000000198
      
      Fix by re-running the interrupt vector setup from scratch using a reduced
      count that may be successful until the created queues matches the irq
      affinity plus polling queue sets.
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      8fae268b
    • L
      nvme-pci: use the same attributes when freeing host_mem_desc_bufs. · cc667f6d
      Liviu Dudau 提交于
      When using HMB the PCIe host driver allocates host_mem_desc_bufs using
      dma_alloc_attrs() but frees them using dma_free_coherent(). Use the
      correct dma_free_attrs() function to free the buffers.
      Signed-off-by: NLiviu Dudau <liviu@dudau.co.uk>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      cc667f6d
    • J
      nvme-pci: fix the wrong setting of nr_maps · c61e678f
      Jianchao Wang 提交于
      We only set the nr_maps to 3 if poll queues are supported.
      Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      c61e678f
  13. 08 1月, 2019 1 次提交