1. 03 10月, 2018 1 次提交
  2. 28 8月, 2018 1 次提交
    • M
      nvme-pci: add a memory barrier to nvme_dbbuf_update_and_check_event · f1ed3df2
      Michal Wnukowski 提交于
      In many architectures loads may be reordered with older stores to
      different locations.  In the nvme driver the following two operations
      could be reordered:
      
       - Write shadow doorbell (dbbuf_db) into memory.
       - Read EventIdx (dbbuf_ei) from memory.
      
      This can result in a potential race condition between driver and VM host
      processing requests (if given virtual NVMe controller has a support for
      shadow doorbell).  If that occurs, then the NVMe controller may decide to
      wait for MMIO doorbell from guest operating system, and guest driver may
      decide not to issue MMIO doorbell on any of subsequent commands.
      
      This issue is purely timing-dependent one, so there is no easy way to
      reproduce it. Currently the easiest known approach is to run "Oracle IO
      Numbers" (orion) that is shipped with Oracle DB:
      
      orion -run advanced -num_large 0 -size_small 8 -type rand -simulate \
      	concat -write 40 -duration 120 -matrix row -testname nvme_test
      
      Where nvme_test is a .lun file that contains a list of NVMe block
      devices to run test against. Limiting number of vCPUs assigned to given
      VM instance seems to increase chances for this bug to occur. On test
      environment with VM that got 4 NVMe drives and 1 vCPU assigned the
      virtual NVMe controller hang could be observed within 10-20 minutes.
      That correspond to about 400-500k IO operations processed (or about
      100GB of IO read/writes).
      
      Orion tool was used as a validation and set to run in a loop for 36
      hours (equivalent of pushing 550M IO operations). No issues were
      observed. That suggest that the patch fixes the issue.
      
      Fixes: f9f38e33 ("nvme: improve performance for virtual NVMe devices")
      Signed-off-by: NMichal Wnukowski <wnukowski@google.com>
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
      [hch: updated changelog and comment a bit]
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      f1ed3df2
  3. 30 7月, 2018 1 次提交
  4. 23 7月, 2018 1 次提交
  5. 12 7月, 2018 1 次提交
    • K
      nvme-pci: fix memory leak on probe failure · b6e44b4c
      Keith Busch 提交于
      The nvme driver specific structures need to be initialized prior to
      enabling the generic controller so we can unwind on failure with out
      using the reference counting callbacks so that 'probe' and 'remove'
      can be symmetric.
      
      The newly added iod_mempool is the only resource that was being
      allocated out of order, and a failure there would leak the generic
      controller memory. This patch just moves that allocation above the
      controller initialization.
      
      Fixes: 943e942e ("nvme-pci: limit max IO size and segments to avoid high order allocations")
      Reported-by: NWeiping Zhang <zwp10758@gmail.com>
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      b6e44b4c
  6. 22 6月, 2018 1 次提交
    • J
      nvme-pci: limit max IO size and segments to avoid high order allocations · 943e942e
      Jens Axboe 提交于
      nvme requires an sg table allocation for each request. If the request
      is large, then the allocation can become quite large. For instance,
      with our default software settings of 1280KB IO size, we'll need
      10248 bytes of sg table. That turns into a 2nd order allocation,
      which we can't always guarantee. If we fail the allocation, blk-mq
      will retry it later. But there's no guarantee that we'll EVER be
      able to allocate that much contigious memory.
      
      Limit the IO size such that we never need more than a single page
      of memory. That's a lot faster and more reliable. Then back that
      allocation with a mempool, so that we know we'll always be able
      to succeed the allocation at some point.
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Acked-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      943e942e
  7. 21 6月, 2018 1 次提交
  8. 09 6月, 2018 6 次提交
  9. 30 5月, 2018 2 次提交
  10. 29 5月, 2018 1 次提交
  11. 25 5月, 2018 2 次提交
  12. 21 5月, 2018 1 次提交
    • J
      nvme-pci: fix race between poll and IRQ completions · 68fa9dbe
      Jens Axboe 提交于
      If polling completions are racing with the IRQ triggered by a
      completion, the IRQ handler will find no work and return IRQ_NONE.
      This can trigger complaints about spurious interrupts:
      
      [  560.169153] irq 630: nobody cared (try booting with the "irqpoll" option)
      [  560.175988] CPU: 40 PID: 0 Comm: swapper/40 Not tainted 4.17.0-rc2+ #65
      [  560.175990] Hardware name: Intel Corporation S2600STB/S2600STB, BIOS SE5C620.86B.00.01.0010.010920180151 01/09/2018
      [  560.175991] Call Trace:
      [  560.175994]  <IRQ>
      [  560.176005]  dump_stack+0x5c/0x7b
      [  560.176010]  __report_bad_irq+0x30/0xc0
      [  560.176013]  note_interrupt+0x235/0x280
      [  560.176020]  handle_irq_event_percpu+0x51/0x70
      [  560.176023]  handle_irq_event+0x27/0x50
      [  560.176026]  handle_edge_irq+0x6d/0x180
      [  560.176031]  handle_irq+0xa5/0x110
      [  560.176036]  do_IRQ+0x41/0xc0
      [  560.176042]  common_interrupt+0xf/0xf
      [  560.176043]  </IRQ>
      [  560.176050] RIP: 0010:cpuidle_enter_state+0x9b/0x2b0
      [  560.176052] RSP: 0018:ffffa0ed4659fe98 EFLAGS: 00000246 ORIG_RAX: ffffffffffffffdd
      [  560.176055] RAX: ffff9527beb20a80 RBX: 000000826caee491 RCX: 000000000000001f
      [  560.176056] RDX: 000000826caee491 RSI: 00000000335206ee RDI: 0000000000000000
      [  560.176057] RBP: 0000000000000001 R08: 00000000ffffffff R09: 0000000000000008
      [  560.176059] R10: ffffa0ed4659fe78 R11: 0000000000000001 R12: ffff9527beb29358
      [  560.176060] R13: ffffffffa235d4b8 R14: 0000000000000000 R15: 000000826caed593
      [  560.176065]  ? cpuidle_enter_state+0x8b/0x2b0
      [  560.176071]  do_idle+0x1f4/0x260
      [  560.176075]  cpu_startup_entry+0x6f/0x80
      [  560.176080]  start_secondary+0x184/0x1d0
      [  560.176085]  secondary_startup_64+0xa5/0xb0
      [  560.176088] handlers:
      [  560.178387] [<00000000efb612be>] nvme_irq [nvme]
      [  560.183019] Disabling IRQ #630
      
      A previous commit removed ->cqe_seen that was handling this case,
      but we need to handle this a bit differently due to completions
      now running outside the queue lock. Return IRQ_HANDLED from the
      IRQ handler, if the completion ring head was moved since we last
      saw it.
      
      Fixes: 5cb525c8 ("nvme-pci: handle completions outside of the queue lock")
      Reported-by: NKeith Busch <keith.busch@intel.com>
      Reviewed-by: NKeith Busch <keith.busch@intel.com>
      Tested-by: NKeith Busch <keith.busch@intel.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      68fa9dbe
  13. 19 5月, 2018 6 次提交
  14. 12 5月, 2018 2 次提交
  15. 07 5月, 2018 1 次提交
  16. 02 5月, 2018 1 次提交
  17. 27 4月, 2018 2 次提交
  18. 25 4月, 2018 1 次提交
  19. 12 4月, 2018 3 次提交
  20. 28 3月, 2018 1 次提交
  21. 26 3月, 2018 3 次提交
  22. 02 3月, 2018 1 次提交
    • M
      nvme: pci: pass max vectors as num_possible_cpus() to pci_alloc_irq_vectors · 16ccfff2
      Ming Lei 提交于
      84676c1f ("genirq/affinity: assign vectors to all possible CPUs")
      has switched to do irq vectors spread among all possible CPUs, so
      pass num_possible_cpus() as max vecotrs to be assigned.
      
      For example, in a 8 cores system, 0~3 online, 4~8 offline/not present,
      see 'lscpu':
      
              [ming@box]$lscpu
              Architecture:          x86_64
              CPU op-mode(s):        32-bit, 64-bit
              Byte Order:            Little Endian
              CPU(s):                4
              On-line CPU(s) list:   0-3
              Thread(s) per core:    1
              Core(s) per socket:    2
              Socket(s):             2
              NUMA node(s):          2
              ...
              NUMA node0 CPU(s):     0-3
              NUMA node1 CPU(s):
              ...
      
      1) before this patch, follows the allocated vectors and their affinity:
      	irq 47, cpu list 0,4
      	irq 48, cpu list 1,6
      	irq 49, cpu list 2,5
      	irq 50, cpu list 3,7
      
      2) after this patch, follows the allocated vectors and their affinity:
      	irq 43, cpu list 0
      	irq 44, cpu list 1
      	irq 45, cpu list 2
      	irq 46, cpu list 3
      	irq 47, cpu list 4
      	irq 48, cpu list 6
      	irq 49, cpu list 5
      	irq 50, cpu list 7
      
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: Sagi Grimberg <sagi@grimberg.me>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Christoph Hellwig <hch@lst.de>
      Signed-off-by: NMing Lei <ming.lei@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NKeith Busch <keith.busch@intel.com>
      16ccfff2