1. 07 4月, 2017 4 次提交
    • O
      blk-mq-sched: fix crash in switch error path · 54d5329d
      Omar Sandoval 提交于
      In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
      back to the original scheduler. However, at this point, we've already
      torn down the original scheduler's tags, so this causes a crash. Doing
      the fallback like the legacy elevator path is much harder for mq, so fix
      it by just falling back to none, instead.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      54d5329d
    • O
      blk-mq-sched: set up scheduler tags when bringing up new queues · 93252632
      Omar Sandoval 提交于
      If a new hardware queue is added at runtime, we don't allocate scheduler
      tags for it, leading to a crash. This hooks up the scheduler framework
      to blk_mq_{init,exit}_hctx() to make sure everything gets properly
      initialized/freed.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      93252632
    • O
      blk-mq-sched: refactor scheduler initialization · 6917ff0b
      Omar Sandoval 提交于
      Preparation cleanup for the next couple of fixes, push
      blk_mq_sched_setup() and e->ops.mq.init_sched() into a helper.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      6917ff0b
    • O
      blk-mq: use the right hctx when getting a driver tag fails · 81380ca1
      Omar Sandoval 提交于
      While dispatching requests, if we fail to get a driver tag, we mark the
      hardware queue as waiting for a tag and put the requests on a
      hctx->dispatch list to be run later when a driver tag is freed. However,
      blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
      queues if using a single-queue scheduler with a multiqueue device. If
      blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
      are processing. This means we end up using the hardware queue of the
      previous request, which may or may not be the same as that of the
      current request. If it isn't, the wrong hardware queue will end up
      waiting for a tag, and the requests will be on the wrong dispatch list,
      leading to a hang.
      
      The fix is twofold:
      
      1. Make sure we save which hardware queue we were trying to get a
         request for in blk_mq_get_driver_tag() regardless of whether it
         succeeds or not.
      2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
         blk_mq_hw_queue to make it clear that it must handle multiple
         hardware queues, since I've already messed this up on a couple of
         occasions.
      
      This didn't appear in testing with nvme and mq-deadline because nvme has
      more driver tags than the default number of scheduler tags. However,
      with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      81380ca1
  2. 30 3月, 2017 1 次提交
  3. 25 3月, 2017 1 次提交
  4. 22 3月, 2017 1 次提交
    • M
      blk-mq: don't complete un-started request in timeout handler · 95a49603
      Ming Lei 提交于
      When iterating busy requests in timeout handler,
      if the STARTED flag of one request isn't set, that means
      the request is being processed in block layer or driver, and
      isn't submitted to hardware yet.
      
      In current implementation of blk_mq_check_expired(),
      if the request queue becomes dying, un-started requests are
      handled as being completed/freed immediately. This way is
      wrong, and can cause rq corruption or double allocation[1][2],
      when doing I/O and removing&resetting NVMe device at the sametime.
      
      This patch fixes several issues reported by Yi Zhang.
      
      [1]. oops log 1
      [  581.789754] ------------[ cut here ]------------
      [  581.789758] kernel BUG at block/blk-mq.c:374!
      [  581.789760] invalid opcode: 0000 [#1] SMP
      [  581.789761] Modules linked in: vfat fat ipmi_ssif intel_rapl sb_edac
      edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvme
      irqbypass crct10dif_pclmul nvme_core crc32_pclmul ghash_clmulni_intel
      intel_cstate ipmi_si mei_me ipmi_devintf intel_uncore sg ipmi_msghandler
      intel_rapl_perf iTCO_wdt mei iTCO_vendor_support mxm_wmi lpc_ich dcdbas shpchp
      pcspkr acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd dm_multipath grace
      sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper
      syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci
      crc32c_intel tg3 libata megaraid_sas i2c_core ptp fjes pps_core dm_mirror
      dm_region_hash dm_log dm_mod
      [  581.789796] CPU: 1 PID: 1617 Comm: kworker/1:1H Not tainted 4.10.0.bz1420297+ #4
      [  581.789797] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
      [  581.789804] Workqueue: kblockd blk_mq_timeout_work
      [  581.789806] task: ffff8804721c8000 task.stack: ffffc90006ee4000
      [  581.789809] RIP: 0010:blk_mq_end_request+0x58/0x70
      [  581.789810] RSP: 0018:ffffc90006ee7d50 EFLAGS: 00010202
      [  581.789811] RAX: 0000000000000001 RBX: ffff8802e4195340 RCX: ffff88028e2f4b88
      [  581.789812] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000
      [  581.789813] RBP: ffffc90006ee7d60 R08: 0000000000000003 R09: ffff88028e2f4b00
      [  581.789814] R10: 0000000000001000 R11: 0000000000000001 R12: 00000000fffffffb
      [  581.789815] R13: ffff88042abe5780 R14: 000000000000002d R15: ffff88046fbdff80
      [  581.789817] FS:  0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
      [  581.789818] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  581.789819] CR2: 00007f64f403a008 CR3: 000000014d078000 CR4: 00000000001406e0
      [  581.789820] Call Trace:
      [  581.789825]  blk_mq_check_expired+0x76/0x80
      [  581.789828]  bt_iter+0x45/0x50
      [  581.789830]  blk_mq_queue_tag_busy_iter+0xdd/0x1f0
      [  581.789832]  ? blk_mq_rq_timed_out+0x70/0x70
      [  581.789833]  ? blk_mq_rq_timed_out+0x70/0x70
      [  581.789840]  ? __switch_to+0x140/0x450
      [  581.789841]  blk_mq_timeout_work+0x88/0x170
      [  581.789845]  process_one_work+0x165/0x410
      [  581.789847]  worker_thread+0x137/0x4c0
      [  581.789851]  kthread+0x101/0x140
      [  581.789853]  ? rescuer_thread+0x3b0/0x3b0
      [  581.789855]  ? kthread_park+0x90/0x90
      [  581.789860]  ret_from_fork+0x2c/0x40
      [  581.789861] Code: 48 85 c0 74 0d 44 89 e6 48 89 df ff d0 5b 41 5c 5d c3 48
      8b bb 70 01 00 00 48 85 ff 75 0f 48 89 df e8 7d f0 ff ff 5b 41 5c 5d c3 <0f>
      0b e8 71 f0 ff ff 90 eb e9 0f 1f 40 00 66 2e 0f 1f 84 00 00
      [  581.789882] RIP: blk_mq_end_request+0x58/0x70 RSP: ffffc90006ee7d50
      [  581.789889] ---[ end trace bcaf03d9a14a0a70 ]---
      
      [2]. oops log2
      [ 6984.857362] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
      [ 6984.857372] IP: nvme_queue_rq+0x6e6/0x8cd [nvme]
      [ 6984.857373] PGD 0
      [ 6984.857374]
      [ 6984.857376] Oops: 0000 [#1] SMP
      [ 6984.857379] Modules linked in: ipmi_ssif vfat fat intel_rapl sb_edac
      edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
      irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si iTCO_wdt
      iTCO_vendor_support mxm_wmi ipmi_devintf intel_cstate sg dcdbas intel_uncore
      mei_me intel_rapl_perf mei pcspkr lpc_ich ipmi_msghandler shpchp
      acpi_power_meter wmi nfsd auth_rpcgss dm_multipath nfs_acl lockd grace sunrpc
      ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea
      sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm nvme drm nvme_core ahci
      libahci i2c_core tg3 libata ptp megaraid_sas pps_core fjes dm_mirror
      dm_region_hash dm_log dm_mod
      [ 6984.857416] CPU: 7 PID: 1635 Comm: kworker/7:1H Not tainted
      4.10.0-2.el7.bz1420297.x86_64 #1
      [ 6984.857417] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
      [ 6984.857427] Workqueue: kblockd blk_mq_run_work_fn
      [ 6984.857429] task: ffff880476e3da00 task.stack: ffffc90002e90000
      [ 6984.857432] RIP: 0010:nvme_queue_rq+0x6e6/0x8cd [nvme]
      [ 6984.857433] RSP: 0018:ffffc90002e93c50 EFLAGS: 00010246
      [ 6984.857434] RAX: 0000000000000000 RBX: ffff880275646600 RCX: 0000000000001000
      [ 6984.857435] RDX: 0000000000000fff RSI: 00000002fba2a000 RDI: ffff8804734e6950
      [ 6984.857436] RBP: ffffc90002e93d30 R08: 0000000000002000 R09: 0000000000001000
      [ 6984.857437] R10: 0000000000001000 R11: 0000000000000000 R12: ffff8804741d8000
      [ 6984.857438] R13: 0000000000000040 R14: ffff880475649f80 R15: ffff8804734e6780
      [ 6984.857439] FS:  0000000000000000(0000) GS:ffff88047fcc0000(0000) knlGS:0000000000000000
      [ 6984.857440] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [ 6984.857442] CR2: 0000000000000010 CR3: 0000000001c09000 CR4: 00000000001406e0
      [ 6984.857443] Call Trace:
      [ 6984.857451]  ? mempool_free+0x2b/0x80
      [ 6984.857455]  ? bio_free+0x4e/0x60
      [ 6984.857459]  blk_mq_dispatch_rq_list+0xf5/0x230
      [ 6984.857462]  blk_mq_process_rq_list+0x133/0x170
      [ 6984.857465]  __blk_mq_run_hw_queue+0x8c/0xa0
      [ 6984.857467]  blk_mq_run_work_fn+0x12/0x20
      [ 6984.857473]  process_one_work+0x165/0x410
      [ 6984.857475]  worker_thread+0x137/0x4c0
      [ 6984.857478]  kthread+0x101/0x140
      [ 6984.857480]  ? rescuer_thread+0x3b0/0x3b0
      [ 6984.857481]  ? kthread_park+0x90/0x90
      [ 6984.857489]  ret_from_fork+0x2c/0x40
      [ 6984.857490] Code: 8b bd 70 ff ff ff 89 95 50 ff ff ff 89 8d 58 ff ff ff 44
      89 95 60 ff ff ff e8 b7 dd 12 e1 8b 95 50 ff ff ff 48 89 85 68 ff ff ff <4c>
      8b 48 10 44 8b 58 18 8b 8d 58 ff ff ff 44 8b 95 60 ff ff ff
      [ 6984.857511] RIP: nvme_queue_rq+0x6e6/0x8cd [nvme] RSP: ffffc90002e93c50
      [ 6984.857512] CR2: 0000000000000010
      [ 6984.895359] ---[ end trace 2d7ceb528432bf83 ]---
      
      Cc: stable@vger.kernel.org
      Reported-by: NYi Zhang <yizhan@redhat.com>
      Tested-by: NYi Zhang <yizhan@redhat.com>
      Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>
      Reviewed-by: NHannes Reinecke <hare@suse.com>
      Signed-off-by: NMing Lei <tom.leiming@gmail.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      95a49603
  5. 16 3月, 2017 1 次提交
  6. 15 3月, 2017 1 次提交
  7. 13 3月, 2017 1 次提交
  8. 12 3月, 2017 1 次提交
    • N
      blk: Ensure users for current->bio_list can see the full list. · f5fe1b51
      NeilBrown 提交于
      Commit 79bd9959 ("blk: improve order of bio handling in generic_make_request()")
      changed current->bio_list so that it did not contain *all* of the
      queued bios, but only those submitted by the currently running
      make_request_fn.
      
      There are two places which walk the list and requeue selected bios,
      and others that check if the list is empty.  These are no longer
      correct.
      
      So redefine current->bio_list to point to an array of two lists, which
      contain all queued bios, and adjust various code to test or walk both
      lists.
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Fixes: 79bd9959 ("blk: improve order of bio handling in generic_make_request()")
      Signed-off-by: NJens Axboe <axboe@fb.com>
      f5fe1b51
  9. 09 3月, 2017 8 次提交
    • N
      blk: improve order of bio handling in generic_make_request() · 79bd9959
      NeilBrown 提交于
      To avoid recursion on the kernel stack when stacked block devices
      are in use, generic_make_request() will, when called recursively,
      queue new requests for later handling.  They will be handled when the
      make_request_fn for the current bio completes.
      
      If any bios are submitted by a make_request_fn, these will ultimately
      be handled seqeuntially.  If the handling of one of those generates
      further requests, they will be added to the end of the queue.
      
      This strict first-in-first-out behaviour can lead to deadlocks in
      various ways, normally because a request might need to wait for a
      previous request to the same device to complete.  This can happen when
      they share a mempool, and can happen due to interdependencies
      particular to the device.  Both md and dm have examples where this happens.
      
      These deadlocks can be erradicated by more selective ordering of bios.
      Specifically by handling them in depth-first order.  That is: when the
      handling of one bio generates one or more further bios, they are
      handled immediately after the parent, before any siblings of the
      parent.  That way, when generic_make_request() calls make_request_fn
      for some particular device, we can be certain that all previously
      submited requests for that device have been completely handled and are
      not waiting for anything in the queue of requests maintained in
      generic_make_request().
      
      An easy way to achieve this would be to use a last-in-first-out stack
      instead of a queue.  However this will change the order of consecutive
      bios submitted by a make_request_fn, which could have unexpected consequences.
      Instead we take a slightly more complex approach.
      A fresh queue is created for each call to a make_request_fn.  After it completes,
      any bios for a different device are placed on the front of the main queue, followed
      by any bios for the same device, followed by all bios that were already on
      the queue before the make_request_fn was called.
      This provides the depth-first approach without reordering bios on the same level.
      
      This, by itself, it not enough to remove all deadlocks.  It just makes
      it possible for drivers to take the extra step required themselves.
      
      To avoid deadlocks, drivers must never risk waiting for a request
      after submitting one to generic_make_request.  This includes never
      allocing from a mempool twice in the one call to a make_request_fn.
      
      A common pattern in drivers is to call bio_split() in a loop, handling
      the first part and then looping around to possibly split the next part.
      Instead, a driver that finds it needs to split a bio should queue
      (with generic_make_request) the second part, handle the first part,
      and then return.  The new code in generic_make_request will ensure the
      requests to underlying bios are processed first, then the second bio
      that was split off.  If it splits again, the same process happens.  In
      each case one bio will be completely handled before the next one is attempted.
      
      With this is place, it should be possible to disable the
      punt_bios_to_recover() recovery thread for many block devices, and
      eventually it may be possible to remove it completely.
      
      Ref: http://www.spinics.net/lists/raid/msg54680.htmlTested-by: NJinpu Wang <jinpu.wang@profitbricks.com>
      Inspired-by: NLars Ellenberg <lars.ellenberg@linbit.com>
      Signed-off-by: NNeilBrown <neilb@suse.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      79bd9959
    • J
      Revert "scsi, block: fix duplicate bdi name registration crashes" · c01228db
      Jan Kara 提交于
      This reverts commit 0dba1314. It causes
      leaking of device numbers for SCSI when SCSI registers multiple gendisks
      for one request_queue in succession. It can be easily reproduced using
      Omar's script [1] on kernel with CONFIG_DEBUG_TEST_DRIVER_REMOVE.
      Furthermore the protection provided by this commit is not needed anymore
      as the problem it was fixing got also fixed by commit 165a5e22
      "block: Move bdi_unregister() to del_gendisk()".
      
      [1]: http://marc.info/?l=linux-block&m=148554717109098&w=2Signed-off-by: NJan Kara <jack@suse.cz>
      Acked-by: NDan Williams <dan.j.williams@intel.com>
      Tested-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      c01228db
    • J
      block: Make del_gendisk() safer for disks without queues · 90f16fdd
      Jan Kara 提交于
      Commit 165a5e22 "block: Move bdi_unregister() to del_gendisk()"
      added disk->queue dereference to del_gendisk(). Although del_gendisk()
      is not supposed to be called without disk->queue valid and
      blk_unregister_queue() warns in that case, this change will make it oops
      instead. Return to the old more robust behavior of just warning when
      del_gendisk() gets called for gendisk with disk->queue being NULL.
      Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Tested-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      90f16fdd
    • J
      block/sed: Fix opal user range check and unused variables · b0bfdfc2
      Jon Derrick 提交于
      Fixes check that the opal user is within the range, and cleans up unused
      method variables.
      Signed-off-by: NJon Derrick <jonathan.derrick@intel.com>
      Reviewed-by: NScott Bauer <scott.bauer@intel.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      b0bfdfc2
    • M
      blk-mq: free hctx->cpumask in release handler of hctx's kobject · 01388df3
      Ming Lei 提交于
      It is obviously that hctx->cpumask is per hctx, and both
      share same lifetime, so this patch moves freeing of hctx->cpumask
      into release handler of hctx's kobject.
      Signed-off-by: NMing Lei <tom.leiming@gmail.com>
      Tested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      01388df3
    • M
      blk-mq: make lifetime consistent between hctx and its kobject · 6c8b232e
      Ming Lei 提交于
      This patch removes kobject_put() over hctx in __blk_mq_unregister_dev(),
      and trys to keep lifetime consistent between hctx and hctx's kobject.
      
      Now blk_mq_sysfs_register() and blk_mq_sysfs_unregister() become
      totally symmetrical, and kobject's refcounter drops to zero just
      when the hctx is freed.
      Signed-off-by: NMing Lei <tom.leiming@gmail.com>
      Tested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      6c8b232e
    • M
      blk-mq: make lifetime consitent between q/ctx and its kobject · 7ea5fe31
      Ming Lei 提交于
      Currently from kobject view, both q->mq_kobj and ctx->kobj can
      be released during one cycle of blk_mq_register_dev() and
      blk_mq_unregister_dev(). Actually, sw queue's lifetime is
      same with its request queue's, which is covered by request_queue->kobj.
      
      So we don't need to call kobject_put() for the two kinds of
      kobject in __blk_mq_unregister_dev(), instead we do that
      in release handler of request queue.
      Signed-off-by: NMing Lei <tom.leiming@gmail.com>
      Tested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      7ea5fe31
    • M
      blk-mq: initialize mq kobjects in blk_mq_init_allocated_queue() · 737f98cf
      Ming Lei 提交于
      Both q->mq_kobj and sw queues' kobjects should have been initialized
      once, instead of doing that each add_disk context.
      
      Also this patch removes clearing of ctx in blk_mq_init_cpu_queues()
      because percpu allocator fills zero to allocated variable.
      
      This patch fixes one issue[1] reported from Omar.
      
      [1] kernel wearning when doing unbind/bind on one scsi-mq device
      
      [   19.347924] kobject (ffff8800791ea0b8): tried to init an initialized object, something is seriously wrong.
      [   19.349781] CPU: 1 PID: 84 Comm: kworker/u8:1 Not tainted 4.10.0-rc7-00210-g53f39eeaa263 #34
      [   19.350686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-20161122_114906-anatol 04/01/2014
      [   19.350920] Workqueue: events_unbound async_run_entry_fn
      [   19.350920] Call Trace:
      [   19.350920]  dump_stack+0x63/0x83
      [   19.350920]  kobject_init+0x77/0x90
      [   19.350920]  blk_mq_register_dev+0x40/0x130
      [   19.350920]  blk_register_queue+0xb6/0x190
      [   19.350920]  device_add_disk+0x1ec/0x4b0
      [   19.350920]  sd_probe_async+0x10d/0x1c0 [sd_mod]
      [   19.350920]  async_run_entry_fn+0x48/0x150
      [   19.350920]  process_one_work+0x1d0/0x480
      [   19.350920]  worker_thread+0x48/0x4e0
      [   19.350920]  kthread+0x101/0x140
      [   19.350920]  ? process_one_work+0x480/0x480
      [   19.350920]  ? kthread_create_on_node+0x60/0x60
      [   19.350920]  ret_from_fork+0x2c/0x40
      
      Cc: Omar Sandoval <osandov@osandov.com>
      Signed-off-by: NMing Lei <tom.leiming@gmail.com>
      Tested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      737f98cf
  10. 03 3月, 2017 3 次提交
  11. 02 3月, 2017 14 次提交
  12. 28 2月, 2017 3 次提交
  13. 24 2月, 2017 1 次提交
    • O
      blk-mq-sched: separate mark hctx and queue restart operations · d38d3515
      Omar Sandoval 提交于
      In blk_mq_sched_dispatch_requests(), we call blk_mq_sched_mark_restart()
      after we dispatch requests left over on our hardware queue dispatch
      list. This is so we'll go back and dispatch requests from the scheduler.
      In this case, it's only necessary to restart the hardware queue that we
      are running; there's no reason to run other hardware queues just because
      we are using shared tags.
      
      So, split out blk_mq_sched_mark_restart() into two operations, one for
      just the hardware queue and one for the whole request queue. The core
      code only needs the hctx variant, but I/O schedulers will want to use
      both.
      
      This also requires adjusting blk_mq_sched_restart_queues() to always
      check the queue restart flag, not just when using shared tags.
      Signed-off-by: NOmar Sandoval <osandov@fb.com>
      Signed-off-by: NJens Axboe <axboe@fb.com>
      d38d3515