1. 07 4月, 2017 3 次提交
    • O
      blk-mq-sched: fix crash in switch error path · 54d5329d
      Omar Sandoval 提交于
      In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
      back to the original scheduler. However, at this point, we've already
      torn down the original scheduler's tags, so this causes a crash. Doing
      the fallback like the legacy elevator path is much harder for mq, so fix
      it by just falling back to none, instead.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      54d5329d
    • O
      blk-mq-sched: set up scheduler tags when bringing up new queues · 93252632
      Omar Sandoval 提交于
      If a new hardware queue is added at runtime, we don't allocate scheduler
      tags for it, leading to a crash. This hooks up the scheduler framework
      to blk_mq_{init,exit}_hctx() to make sure everything gets properly
      initialized/freed.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      93252632
    • O
      blk-mq: use the right hctx when getting a driver tag fails · 81380ca1
      Omar Sandoval 提交于
      While dispatching requests, if we fail to get a driver tag, we mark the
      hardware queue as waiting for a tag and put the requests on a
      hctx->dispatch list to be run later when a driver tag is freed. However,
      blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
      queues if using a single-queue scheduler with a multiqueue device. If
      blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
      are processing. This means we end up using the hardware queue of the
      previous request, which may or may not be the same as that of the
      current request. If it isn't, the wrong hardware queue will end up
      waiting for a tag, and the requests will be on the wrong dispatch list,
      leading to a hang.
      
      The fix is twofold:
      
      1. Make sure we save which hardware queue we were trying to get a
         request for in blk_mq_get_driver_tag() regardless of whether it
         succeeds or not.
      2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
         blk_mq_hw_queue to make it clear that it must handle multiple
         hardware queues, since I've already messed this up on a couple of
         occasions.
      
      This didn't appear in testing with nvme and mq-deadline because nvme has
      more driver tags than the default number of scheduler tags. However,
      with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      81380ca1
  2. 30 3月, 2017 1 次提交
  3. 25 3月, 2017 1 次提交
  4. 22 3月, 2017 1 次提交
    • M
      blk-mq: don't complete un-started request in timeout handler · 95a49603
      Ming Lei 提交于
      When iterating busy requests in timeout handler,
      if the STARTED flag of one request isn't set, that means
      the request is being processed in block layer or driver, and
      isn't submitted to hardware yet.
      
      In current implementation of blk_mq_check_expired(),
      if the request queue becomes dying, un-started requests are
      handled as being completed/freed immediately. This way is
      wrong, and can cause rq corruption or double allocation[1][2],
      when doing I/O and removing&resetting NVMe device at the sametime.
      
      This patch fixes several issues reported by Yi Zhang.
      
      [1]. oops log 1
      [  581.789754] ------------[ cut here ]------------
      [  581.789758] kernel BUG at block/blk-mq.c:374!
      [  581.789760] invalid opcode: 0000 [#1] SMP
      [  581.789761] Modules linked in: vfat fat ipmi_ssif intel_rapl sb_edac
      edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvme
      irqbypass crct10dif_pclmul nvme_core crc32_pclmul ghash_clmulni_intel
      intel_cstate ipmi_si mei_me ipmi_devintf intel_uncore sg ipmi_msghandler
      intel_rapl_perf iTCO_wdt mei iTCO_vendor_support mxm_wmi lpc_ich dcdbas shpchp
      pcspkr acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd dm_multipath grace
      sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper
      syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci
      crc32c_intel tg3 libata megaraid_sas i2c_core ptp fjes pps_core dm_mirror
      dm_region_hash dm_log dm_mod
      [  581.789796] CPU: 1 PID: 1617 Comm: kworker/1:1H Not tainted 4.10.0.bz1420297+ #4
      [  581.789797] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
      [  581.789804] Workqueue: kblockd blk_mq_timeout_work
      [  581.789806] task: ffff8804721c8000 task.stack: ffffc90006ee4000
      [  581.789809] RIP: 0010:blk_mq_end_request+0x58/0x70
      [  581.789810] RSP: 0018:ffffc90006ee7d50 EFLAGS: 00010202
      [  581.789811] RAX: 0000000000000001 RBX: ffff8802e4195340 RCX: ffff88028e2f4b88
      [  581.789812] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000
      [  581.789813] RBP: ffffc90006ee7d60 R08: 0000000000000003 R09: ffff88028e2f4b00
      [  581.789814] R10: 0000000000001000 R11: 0000000000000001 R12: 00000000fffffffb
      [  581.789815] R13: ffff88042abe5780 R14: 000000000000002d R15: ffff88046fbdff80
      [  581.789817] FS:  0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
      [  581.789818] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  581.789819] CR2: 00007f64f403a008 CR3: 000000014d078000 CR4: 00000000001406e0
      [  581.789820] Call Trace:
      [  581.789825]  blk_mq_check_expired+0x76/0x80
      [  581.789828]  bt_iter+0x45/0x50
      [  581.789830]  blk_mq_queue_tag_busy_iter+0xdd/0x1f0
      [  581.789832]  ? blk_mq_rq_timed_out+0x70/0x70
      [  581.789833]  ? blk_mq_rq_timed_out+0x70/0x70
      [  581.789840]  ? __switch_to+0x140/0x450
      [  581.789841]  blk_mq_timeout_work+0x88/0x170
      [  581.789845]  process_one_work+0x165/0x410
      [  581.789847]  worker_thread+0x137/0x4c0
      [  581.789851]  kthread+0x101/0x140
      [  581.789853]  ? rescuer_thread+0x3b0/0x3b0
      [  581.789855]  ? kthread_park+0x90/0x90
      [  581.789860]  ret_from_fork+0x2c/0x40
      [  581.789861] Code: 48 85 c0 74 0d 44 89 e6 48 89 df ff d0 5b 41 5c 5d c3 48
      8b bb 70 01 00 00 48 85 ff 75 0f 48 89 df e8 7d f0 ff ff 5b 41 5c 5d c3 <0f>
      0b e8 71 f0 ff ff 90 eb e9 0f 1f 40 00 66 2e 0f 1f 84 00 00
      [  581.789882] RIP: blk_mq_end_request+0x58/0x70 RSP: ffffc90006ee7d50
      [  581.789889] ---[ end trace bcaf03d9a14a0a70 ]---
      
      [2]. oops log2
      [ 6984.857362] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
      [ 6984.857372] IP: nvme_queue_rq+0x6e6/0x8cd [nvme]
      [ 6984.857373] PGD 0
      [ 6984.857374]
      [ 6984.857376] Oops: 0000 [#1] SMP
      [ 6984.857379] Modules linked in: ipmi_ssif vfat fat intel_rapl sb_edac
      edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
      irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si iTCO_wdt
      iTCO_vendor_support mxm_wmi ipmi_devintf intel_cstate sg dcdbas intel_uncore
      mei_me intel_rapl_perf mei pcspkr lpc_ich ipmi_msghandler shpchp
      acpi_power_meter wmi nfsd auth_rpcgss dm_multipath nfs_acl lockd grace sunrpc
      ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea
      sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm nvme drm nvme_core ahci
      libahci i2c_core tg3 libata ptp megaraid_sas pps_core fjes dm_mirror
      dm_region_hash dm_log dm_mod
      [ 6984.857416] CPU: 7 PID: 1635 Comm: kworker/7:1H Not tainted
      4.10.0-2.el7.bz1420297.x86_64 #1
      [ 6984.857417] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
      [ 6984.857427] Workqueue: kblockd blk_mq_run_work_fn
      [ 6984.857429] task: ffff880476e3da00 task.stack: ffffc90002e90000
      [ 6984.857432] RIP: 0010:nvme_queue_rq+0x6e6/0x8cd [nvme]
      [ 6984.857433] RSP: 0018:ffffc90002e93c50 EFLAGS: 00010246
      [ 6984.857434] RAX: 0000000000000000 RBX: ffff880275646600 RCX: 0000000000001000
      [ 6984.857435] RDX: 0000000000000fff RSI: 00000002fba2a000 RDI: ffff8804734e6950
      [ 6984.857436] RBP: ffffc90002e93d30 R08: 0000000000002000 R09: 0000000000001000
      [ 6984.857437] R10: 0000000000001000 R11: 0000000000000000 R12: ffff8804741d8000
      [ 6984.857438] R13: 0000000000000040 R14: ffff880475649f80 R15: ffff8804734e6780
      [ 6984.857439] FS:  0000000000000000(0000) GS:ffff88047fcc0000(0000) knlGS:0000000000000000
      [ 6984.857440] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 6984.857442] CR2: 0000000000000010 CR3: 0000000001c09000 CR4: 00000000001406e0
      [ 6984.857443] Call Trace:
      [ 6984.857451]  ? mempool_free+0x2b/0x80
      [ 6984.857455]  ? bio_free+0x4e/0x60
      [ 6984.857459]  blk_mq_dispatch_rq_list+0xf5/0x230
      [ 6984.857462]  blk_mq_process_rq_list+0x133/0x170
      [ 6984.857465]  __blk_mq_run_hw_queue+0x8c/0xa0
      [ 6984.857467]  blk_mq_run_work_fn+0x12/0x20
      [ 6984.857473]  process_one_work+0x165/0x410
      [ 6984.857475]  worker_thread+0x137/0x4c0
      [ 6984.857478]  kthread+0x101/0x140
      [ 6984.857480]  ? rescuer_thread+0x3b0/0x3b0
      [ 6984.857481]  ? kthread_park+0x90/0x90
      [ 6984.857489]  ret_from_fork+0x2c/0x40
      [ 6984.857490] Code: 8b bd 70 ff ff ff 89 95 50 ff ff ff 89 8d 58 ff ff ff 44
      89 95 60 ff ff ff e8 b7 dd 12 e1 8b 95 50 ff ff ff 48 89 85 68 ff ff ff <4c>
      8b 48 10 44 8b 58 18 8b 8d 58 ff ff ff 44 8b 95 60 ff ff ff
      [ 6984.857511] RIP: nvme_queue_rq+0x6e6/0x8cd [nvme] RSP: ffffc90002e93c50
      [ 6984.857512] CR2: 0000000000000010
      [ 6984.895359] ---[ end trace 2d7ceb528432bf83 ]---
      
      Cc: stable@vger.kernel.org
      Reported-by: NYi Zhang <yizhan@redhat.com>
      Tested-by: NYi Zhang <yizhan@redhat.com>
      Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NMing Lei <tom.leiming@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      95a49603
  5. 15 3月, 2017 1 次提交
  6. 09 3月, 2017 4 次提交
  7. 03 3月, 2017 1 次提交
    • J
      blk-mq: ensure that bd->last is always set correctly · 113285b4
      Jens Axboe 提交于
      When drivers are called with a request in blk-mq, blk-mq flags the
      state such that the driver knows if this is the last request in
      this call chain or not. The driver can then use that information
      to defer kicking off IO until bd->last is true. However, with blk-mq
      and scheduling, we need to allocate a driver tag for a request before
      it can be issued. If we fail to allocate such a tag, we could end up
      in the situation where the last request issued did not have
      bd->last == true set. This can then cause a driver hang.
      
      This fixes a hang with virtio-blk, which uses bd->last as a hint
      on whether to kick the queue or not.
      Reported-by: NChris Mason <clm@fb.com>
      Tested-by: NChris Mason <clm@fb.com>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      113285b4
  8. 02 3月, 2017 8 次提交
  9. 24 2月, 2017 1 次提交
    • O
      blk-mq: use sbq wait queues instead of restart for driver tags · da55f2cc
      Omar Sandoval 提交于
      Commit 50e1dab8 ("blk-mq-sched: fix starvation for multiple hardware
      queues and shared tags") fixed one starvation issue for shared tags.
      However, we can still get into a situation where we fail to allocate a
      tag because all tags are allocated but we don't have any pending
      requests on any hardware queue.
      
      One solution for this would be to restart all queues that share a tag
      map, but that really sucks. Ideally, we could just block and wait for a
      tag, but that isn't always possible from blk_mq_dispatch_rq_list().
      
      However, we can still use the struct sbitmap_queue wait queues with a
      custom callback instead of blocking. This has a few benefits:
      
      1. It avoids iterating over all hardware queues when completing an I/O,
         which the current restart code has to do.
      2. It benefits from the existing rolling wakeup code.
      3. It avoids punting to another thread just to have it block.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      da55f2cc
  10. 18 2月, 2017 2 次提交
  11. 11 2月, 2017 1 次提交
  12. 09 2月, 2017 2 次提交
  13. 03 2月, 2017 1 次提交
  14. 01 2月, 2017 1 次提交
    • J
      blk-mq: don't fail allocating driver tag for stopped hw queue · 12d70958
      Jens Axboe 提交于
      We rely on blk_mq_get_driver_tag() not failing if 'wait' is true,
      but it currently fails in that case if the queue happens to be
      stopped at the time of the call.
      
      We don't need to check for stopped here, it's just assigning
      the tag. If the queue is stopped, we'll handle it when
      attempting to run the queue.
      
      This fixes a stall/crash on flush intensive workloads, where
      we proceed to process a flush that doesn't have a valid tag
      assigned.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      12d70958
  15. 28 1月, 2017 3 次提交
  16. 27 1月, 2017 7 次提交
  17. 25 1月, 2017 1 次提交
  18. 21 1月, 2017 1 次提交
    • J
      blk-mq: allow resize of scheduler requests · 70f36b60
      Jens Axboe 提交于
      Add support for growing the tags associated with a hardware queue, for
      the scheduler tags. Currently we only support resizing within the
      limits of the original depth, change that so we can grow it as well by
      allocating and replacing the existing scheduler tag set.
      
      This is similar to how we could increase the software queue depth with
      the legacy IO stack and schedulers.
      Signed-off-by: NJens Axboe <axboe@fb.com>
      Reviewed-by: NOmar Sandoval <osandov@fb.com>
      70f36b60