1. 10 7月, 2019 1 次提交
  2. 21 6月, 2019 7 次提交
    • C
      nvme-pci: clean up nvme_remove_dead_ctrl a bit · 7c1ce408
      Chaitanya Kulkarni 提交于
      Remove the status parameter o nvme_remove_dead_ctrl(), which is only
      used for printing it.
      
      We move the print message to the same function where actual error is
      occurring.
      Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      7c1ce408
    • M
      nvme-pci: properly report state change failure in nvme_reset_work · cee6c269
      Minwoo Im 提交于
      If the state change to NVME_CTRL_CONNECTING fails, the dmesg is going to
      be like:
      
        [  293.689160] nvme nvme0: failed to mark controller CONNECTING
        [  293.689160] nvme nvme0: Removing after probe failure status: 0
      
      Even it prints the first line to indicate the situation, the second line
      is not proper because the status is 0 which means normally success of
      the previous operation.
      
      This patch makes it indicate the proper error value when it fails.
        [   25.932367] nvme nvme0: failed to mark controller CONNECTING
        [   25.932369] nvme nvme0: Removing after probe failure status: -16
      
      This situation is able to be easily reproduced by:
        root@target:~# rmmod nvme && modprobe nvme && rmmod nvme
      Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com>
      Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      cee6c269
    • C
      nvme-pci: set the errno on ctrl state change error · e71afda4
      Chaitanya Kulkarni 提交于
      This patch removes the confusing assignment of the variable result at
      the time of declaration and sets the value in error cases next to the
      places where the actual error is happening.
      
      Here we also set the result value to -ENODEV when we fail at the final
      ctrl state transition in nvme_reset_work(). Without this assignment
      result will hold 0 from nvme_setup_io_queue() and on failure 0 will be
      passed to he nvme_remove_dead_ctrl() from final state transition.
      Signed-off-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      e71afda4
    • M
      nvme-pci: adjust irq max_vector using num_possible_cpus() · dad77d63
      Minwoo Im 提交于
      If the "irq_queues" are greater than num_possible_cpus(),
      nvme_calc_irq_sets() can have irq set_size for HCTX_TYPE_DEFAULT greater
      than it can be afforded.
      2039         affd->set_size[HCTX_TYPE_DEFAULT] = nrirqs - nr_read_queues;
      
      It might cause a WARN() from the irq_build_affinity_masks() like [1]:
      220         if (nr_present < numvecs)
      221                 WARN_ON(nr_present + nr_others < numvecs);
      
      This patch prevents it from the WARN() by adjusting the max_vector value
      from the nvme_setup_irqs().
      
      [1] WARN messages when modprobe nvme write_queues=32 poll_queues=0:
      root@target:~/nvme# nproc
      8
      root@target:~/nvme# modprobe nvme write_queues=32 poll_queues=0
      [   17.925326] nvme nvme0: pci function 0000:00:04.0
      [   17.940601] WARNING: CPU: 3 PID: 1030 at kernel/irq/affinity.c:221 irq_create_affinity_masks+0x222/0x330
      [   17.940602] Modules linked in: nvme nvme_core [last unloaded: nvme]
      [   17.940605] CPU: 3 PID: 1030 Comm: kworker/u17:4 Tainted: G        W         5.1.0+ #156
      [   17.940605] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
      [   17.940608] Workqueue: nvme-reset-wq nvme_reset_work [nvme]
      [   17.940609] RIP: 0010:irq_create_affinity_masks+0x222/0x330
      [   17.940611] Code: 4c 8d 4c 24 28 4c 8d 44 24 30 e8 c9 fa ff ff 89 44 24 18 e8 c0 38 fa ff 8b 44 24 18 44 8b 54 24 1c 5a 44 01 d0 41 39 c4 76 02 <0f> 0b 48 89 df 44 01 e5 e8 f1 ce 10 00 48 8b 34 24 44 89 f0 44 01
      [   17.940611] RSP: 0018:ffffc90002277c50 EFLAGS: 00010216
      [   17.940612] RAX: 0000000000000008 RBX: ffff88807ca48860 RCX: 0000000000000000
      [   17.940612] RDX: ffff88807bc03800 RSI: 0000000000000020 RDI: 0000000000000000
      [   17.940613] RBP: 0000000000000001 R08: ffffc90002277c78 R09: ffffc90002277c70
      [   17.940613] R10: 0000000000000008 R11: 0000000000000001 R12: 0000000000000020
      [   17.940614] R13: 0000000000025d08 R14: 0000000000000001 R15: ffff88807bc03800
      [   17.940614] FS:  0000000000000000(0000) GS:ffff88807db80000(0000) knlGS:0000000000000000
      [   17.940616] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [   17.940617] CR2: 00005635e583f790 CR3: 000000000240a000 CR4: 00000000000006e0
      [   17.940617] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [   17.940618] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [   17.940618] Call Trace:
      [   17.940622]  __pci_enable_msix_range+0x215/0x540
      [   17.940623]  ? kernfs_put+0x117/0x160
      [   17.940625]  pci_alloc_irq_vectors_affinity+0x74/0x110
      [   17.940626]  nvme_reset_work+0xc30/0x1397 [nvme]
      [   17.940628]  ? __switch_to_asm+0x34/0x70
      [   17.940628]  ? __switch_to_asm+0x40/0x70
      [   17.940629]  ? __switch_to_asm+0x34/0x70
      [   17.940630]  ? __switch_to_asm+0x40/0x70
      [   17.940630]  ? __switch_to_asm+0x34/0x70
      [   17.940631]  ? __switch_to_asm+0x40/0x70
      [   17.940632]  ? nvme_irq_check+0x30/0x30 [nvme]
      [   17.940633]  process_one_work+0x20b/0x3e0
      [   17.940634]  worker_thread+0x1f9/0x3d0
      [   17.940635]  ? cancel_delayed_work+0xa0/0xa0
      [   17.940636]  kthread+0x117/0x120
      [   17.940637]  ? kthread_stop+0xf0/0xf0
      [   17.940638]  ret_from_fork+0x3a/0x50
      [   17.940639] ---[ end trace aca8a131361cd42a ]---
      [   17.942124] nvme nvme0: 7/1/0 default/read/poll queues
      Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      dad77d63
    • M
      nvme-pci: remove queue_count_ops for write_queues and poll_queues · 483178f3
      Minwoo Im 提交于
      queue_count_set() seems like that it has been provided to limit the
      number of queue entries for write/poll queues.  But, the
      queue_count_set() has been doing nothing but a parameter check even it
      has num_possible_cpus() which is nop.
      
      This patch removes entire queue_count_ops from the write_queues and
      poll_queues.
      Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      483178f3
    • M
      nvme-pci: remove unnecessary zero for static var · a232ea0e
      Minwoo Im 提交于
      poll_queues will be zero even without zero initialization here.
      Signed-off-by: NMinwoo Im <minwoo.im.dev@gmail.com>
      Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      a232ea0e
    • K
      nvme-pci: use host managed power state for suspend · d916b1be
      Keith Busch 提交于
      The nvme pci driver prepares its devices for power loss during suspend
      by shutting down the controllers. The power setting is deferred to
      pci driver's power management before the platform removes power. The
      suspend-to-idle mode, however, does not remove power.
      
      NVMe devices that implement host managed power settings can achieve
      lower power and better transition latencies than using generic PCI power
      settings. Try to use this feature if the platform is not involved with
      the suspend. If successful, restore the previous power state on resume.
      Tested-by: NKai-Heng Feng <kai.heng.feng@canonical.com>
      Tested-by: NMario Limonciello <mario.limonciello@dell.com>
      Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
      [hch: fixed the compilation for the !CONFIG_PM_SLEEP case]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      d916b1be
  3. 06 6月, 2019 1 次提交
  4. 23 5月, 2019 1 次提交
    • K
      nvme-pci: use blk-mq mapping for unmanaged irqs · cb9e0e50
      Keith Busch 提交于
      If a device is providing a single IRQ vector, the IO queue will share
      that vector with the admin queue. This is an unmanaged vector, so does
      not have a valid PCI IRQ affinity. Avoid trying to extract a managed
      affinity in this case and let blk-mq set up the cpu:queue mapping instead.
      Otherwise we'd hit the following warning when the device is using MSI:
      
       WARNING: CPU: 4 PID: 7 at drivers/pci/msi.c:1272 pci_irq_get_affinity+0x66/0x80
       Modules linked in: nvme nvme_core serio_raw
       CPU: 4 PID: 7 Comm: kworker/u16:0 Tainted: G        W         5.2.0-rc1+ #494
       Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
       Workqueue: nvme-reset-wq nvme_reset_work [nvme]
       RIP: 0010:pci_irq_get_affinity+0x66/0x80
       Code: 0b 31 c0 c3 83 e2 10 48 c7 c0 b0 83 35 91 74 2a 48 8b 87 d8 03 00 00 48 85 c0 74 0e 48 8b 50 30 48 85 d2 74 05 39 70 14 77 05 <0f> 0b 31 c0 c3 48 63 f6 48 8d 04 76 48 8d 04 c2 f3 c3 48 8b 40 30
       RSP: 0000:ffffb5abc01d3cc8 EFLAGS: 00010246
       RAX: ffff9536786a39c0 RBX: 0000000000000000 RCX: 0000000000000080
       RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9536781ed000
       RBP: ffff95367346a008 R08: ffff95367d43f080 R09: ffff953678c07800
       R10: ffff953678164800 R11: 0000000000000000 R12: 0000000000000000
       R13: ffff9536781ed000 R14: 00000000ffffffff R15: ffff95367346a008
       FS:  0000000000000000(0000) GS:ffff95367d400000(0000) knlGS:0000000000000000
       CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
       CR2: 00007fdf814a3ff0 CR3: 000000001a20f000 CR4: 00000000000006e0
       DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
       DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
       Call Trace:
        blk_mq_pci_map_queues+0x37/0xd0
        nvme_pci_map_queues+0x80/0xb0 [nvme]
        blk_mq_alloc_tag_set+0x133/0x2f0
        nvme_reset_work+0x105d/0x1590 [nvme]
        process_one_work+0x291/0x530
        worker_thread+0x218/0x3d0
        ? process_one_work+0x530/0x530
        kthread+0x111/0x130
        ? kthread_park+0x90/0x90
        ret_from_fork+0x1f/0x30
       ---[ end trace 74587339d93c83c0 ]---
      
      Fixes: 22b55601 ("nvme-pci: Separate IO and admin queue IRQ vectors")
      Reported-by: NIván Chavero <ichavero@chavero.com.mx>
      Reviewed-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      cb9e0e50
  5. 18 5月, 2019 4 次提交
  6. 13 5月, 2019 2 次提交
  7. 01 5月, 2019 6 次提交
  8. 05 4月, 2019 13 次提交
  9. 14 3月, 2019 1 次提交
  10. 20 2月, 2019 2 次提交
  11. 18 2月, 2019 2 次提交
    • M
      nvme-pci: Simplify interrupt allocation · 612b7286
      Ming Lei 提交于
      The NVME PCI driver contains a tedious mechanism for interrupt
      allocation, which is necessary to adjust the number and size of interrupt
      sets to the maximum available number of interrupts which depends on the
      underlying PCI capabilities and the available CPU resources.
      
      It works around the former short comings of the PCI and core interrupt
      allocation mechanims in combination with interrupt sets.
      
      The PCI interrupt allocation function allows to provide a maximum and a
      minimum number of interrupts to be allocated and tries to allocate as
      many as possible. This worked without driver interaction as long as there
      was only a single set of interrupts to handle.
      
      With the addition of support for multiple interrupt sets in the generic
      affinity spreading logic, which is invoked from the PCI interrupt
      allocation, the adaptive loop in the PCI interrupt allocation did not
      work for multiple interrupt sets. The reason is that depending on the
      total number of interrupts which the PCI allocation adaptive loop tries
      to allocate in each step, the number and the size of the interrupt sets
      need to be adapted as well. Due to the way the interrupt sets support was
      implemented there was no way for the PCI interrupt allocation code or the
      core affinity spreading mechanism to invoke a driver specific function
      for adapting the interrupt sets configuration.
      
      As a consequence the driver had to implement another adaptive loop around
      the PCI interrupt allocation function and calling that with maximum and
      minimum interrupts set to the same value. This ensured that the
      allocation either succeeded or immediately failed without any attempt to
      adjust the number of interrupts in the PCI code.
      
      The core code now allows drivers to provide a callback to recalculate the
      number and the size of interrupt sets during PCI interrupt allocation,
      which in turn allows the PCI interrupt allocation function to be called
      in the same way as with a single set of interrupts. The PCI code handles
      the adaptive loop and the interrupt affinity spreading mechanism invokes
      the driver callback to adapt the interrupt set configuration to the
      current loop value. This replaces the adaptive loop in the driver
      completely.
      
      Implement the NVME specific callback which adjusts the interrupt sets
      configuration and remove the adaptive allocation loop.
      
      [ tglx: Simplify the callback further and restore the dropped adjustment of
        	number of sets ]
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bjorn Helgaas <helgaas@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-block@vger.kernel.org
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: linux-nvme@lists.infradead.org
      Cc: linux-pci@vger.kernel.org
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Sumit Saxena <sumit.saxena@broadcom.com>
      Cc: Kashyap Desai <kashyap.desai@broadcom.com>
      Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com>
      Link: https://lkml.kernel.org/r/20190216172228.602546658@linutronix.de
      
      612b7286
    • M
      genirq/affinity: Store interrupt sets size in struct irq_affinity · 9cfef55b
      Ming Lei 提交于
      The interrupt affinity spreading mechanism supports to spread out
      affinities for one or more interrupt sets. A interrupt set contains one
      or more interrupts. Each set is mapped to a specific functionality of a
      device, e.g. general I/O queues and read I/O queus of multiqueue block
      devices.
      
      The number of interrupts per set is defined by the driver. It depends on
      the total number of available interrupts for the device, which is
      determined by the PCI capabilites and the availability of underlying CPU
      resources, and the number of queues which the device provides and the
      driver wants to instantiate.
      
      The driver passes initial configuration for the interrupt allocation via
      a pointer to struct irq_affinity.
      
      Right now the allocation mechanism is complex as it requires to have a
      loop in the driver to determine the maximum number of interrupts which
      are provided by the PCI capabilities and the underlying CPU resources.
      This loop would have to be replicated in every driver which wants to
      utilize this mechanism. That's unwanted code duplication and error
      prone.
      
      In order to move this into generic facilities it is required to have a
      mechanism, which allows the recalculation of the interrupt sets and
      their size, in the core code. As the core code does not have any
      knowledge about the underlying device, a driver specific callback will
      be added to struct affinity_desc, which will be invoked by the core
      code. The callback will get the number of available interupts as an
      argument, so the driver can calculate the corresponding number and size
      of interrupt sets.
      
      To support this, two modifications for the handling of struct irq_affinity
      are required:
      
      1) The (optional) interrupt sets size information is contained in a
         separate array of integers and struct irq_affinity contains a
         pointer to it.
      
         This is cumbersome and as the maximum number of interrupt sets is small,
         there is no reason to have separate storage. Moving the size array into
         struct affinity_desc avoids indirections and makes the code simpler.
      
      2) At the moment the struct irq_affinity pointer which is handed in from
         the driver and passed through to several core functions is marked
         'const'.
      
         With the upcoming callback to recalculate the number and size of
         interrupt sets, it's necessary to remove the 'const'
         qualifier. Otherwise the callback would not be able to update the data.
      
      Implement #1 and store the interrupt sets size in 'struct irq_affinity'.
      
      No functional change.
      
      [ tglx: Fixed the memcpy() size so it won't copy beyond the size of the
        	source. Fixed the kernel doc comments for struct irq_affinity and
        	de-'This patch'-ed the changelog ]
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Bjorn Helgaas <helgaas@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: linux-block@vger.kernel.org
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: linux-nvme@lists.infradead.org
      Cc: linux-pci@vger.kernel.org
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Sumit Saxena <sumit.saxena@broadcom.com>
      Cc: Kashyap Desai <kashyap.desai@broadcom.com>
      Cc: Shivasharan Srikanteshwara <shivasharan.srikanteshwara@broadcom.com>
      Link: https://lkml.kernel.org/r/20190216172228.423723127@linutronix.de
      
      9cfef55b