- 08 4月, 2017 5 次提交
-
-
由 Bart Van Assche 提交于
Since the next patch in this series will use RCU to iterate over tag_list, make this safe. Add lockdep_assert_held() statements in functions that iterate over tag_list to make clear that using list_for_each_entry() instead of list_for_each_entry_rcu() is fine in these functions. Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Omar Sandoval 提交于
Trivial cleanup. Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Omar Sandoval 提交于
Minor cleanup that makes it easier to figure out what's going on in the driver tag allocation failure path of blk_mq_dispatch_rq_list(). Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Bart Van Assche 提交于
To improve scalability, if hardware queues are shared, restart a single hardware queue in round-robin fashion. Rename blk_mq_sched_restart_queues() to reflect the new semantics. Remove blk_mq_sched_mark_restart_queue() because this function has no callers. Remove flag QUEUE_FLAG_RESTART because this patch removes the code that uses this flag. Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Bart Van Assche 提交于
Introduce a function that runs a hardware queue unconditionally after a delay. Note: there is already a function that stops and restarts a hardware queue after a delay, namely blk_mq_delay_queue(). This function will be used in the next patch in this series. Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Hannes Reinecke <hare@suse.de> Cc: Long Li <longli@microsoft.com> Cc: K. Y. Srinivasan <kys@microsoft.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 07 4月, 2017 4 次提交
-
-
由 Omar Sandoval 提交于
blk_mq_update_nr_hw_queues() used to remap hardware queues, which is the behavior that drivers expect. However, commit 4e68a011 changed blk_mq_queue_reinit() to not remap queues for the case of CPU hotplugging, inadvertently making blk_mq_update_nr_hw_queues() not remap queues as well. This breaks, for example, NBD's multi-connection mode, leaving the added hardware queues unused. Fix it by making blk_mq_update_nr_hw_queues() explicitly remap the queues. Fixes: 4e68a011 ("blk-mq: don't redistribute hardware queues on a CPU hotplug event") Reviewed-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Omar Sandoval 提交于
In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall back to the original scheduler. However, at this point, we've already torn down the original scheduler's tags, so this causes a crash. Doing the fallback like the legacy elevator path is much harder for mq, so fix it by just falling back to none, instead. Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Omar Sandoval 提交于
If a new hardware queue is added at runtime, we don't allocate scheduler tags for it, leading to a crash. This hooks up the scheduler framework to blk_mq_{init,exit}_hctx() to make sure everything gets properly initialized/freed. Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Omar Sandoval 提交于
While dispatching requests, if we fail to get a driver tag, we mark the hardware queue as waiting for a tag and put the requests on a hctx->dispatch list to be run later when a driver tag is freed. However, blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware queues if using a single-queue scheduler with a multiqueue device. If blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we are processing. This means we end up using the hardware queue of the previous request, which may or may not be the same as that of the current request. If it isn't, the wrong hardware queue will end up waiting for a tag, and the requests will be on the wrong dispatch list, leading to a hang. The fix is twofold: 1. Make sure we save which hardware queue we were trying to get a request for in blk_mq_get_driver_tag() regardless of whether it succeeds or not. 2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a blk_mq_hw_queue to make it clear that it must handle multiple hardware queues, since I've already messed this up on a couple of occasions. This didn't appear in testing with nvme and mq-deadline because nvme has more driver tags than the default number of scheduler tags. However, with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd. Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 05 4月, 2017 1 次提交
-
-
由 Bart Van Assche 提交于
The block layer core sets blk_mq_queue_data.list but no block drivers read that member. Hence remove it and also the code that is used to set this member. Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 31 3月, 2017 1 次提交
-
-
由 Jens Axboe 提交于
Commit a4d907b6 unified the single and multi queue request handlers, but in the process, it also screwed up the locking balance and calls blk_mq_try_issue_directly() with the ctx preempt lock held. This is a problem for drivers that have set BLK_MQ_F_BLOCKING, since now they can't reliably sleep. While in there, protect against similar issues in the future, by adding a might_sleep() trigger in the BLOCKING path for direct issue or queue run. Reported-by: NJosef Bacik <josef@toxicpanda.com> Tested-by: NJosef Bacik <josef@toxicpanda.com> Fixes: a4d907b6 ("blk-mq: streamline blk_mq_make_request") Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 30 3月, 2017 5 次提交
-
-
由 Minchan Kim 提交于
In blk_mq_alloc_request_hctx, blk_mq_sched_get_request doesn't get sw context so we don't need to put the context with blk_mq_put_ctx. Unless, we will see preempt counter underflow. Cc: Omar Sandoval <osandov@fb.com> Signed-off-by: NMinchan Kim <minchan@kernel.org> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Minchan Kim 提交于
In blk_mq_alloc_request_hctx, blk_mq_sched_get_request doesn't get sw context so we don't need to put the context with blk_mq_put_ctx. Unless, we will see preempt counter underflow. Cc: Omar Sandoval <osandov@fb.com> Signed-off-by: NMinchan Kim <minchan@kernel.org> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Jens Axboe 提交于
Currently we return true in blk_mq_dispatch_rq_list() if we queued IO successfully, but we really want to return whether or not the we made progress. Progress includes if we got an error return. If we don't, this can lead to a hang in blk_mq_sched_dispatch_requests() when a driver is draining IO by returning BLK_MQ_QUEUE_ERROR instead of manually ending the IO in error and return BLK_MQ_QUEUE_OK. Tested-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Josef Bacik 提交于
When try to issue a request directly and we fail we will requeue the request, but call blk_mq_end_request() as well. This leads to the completed request being on a queuelist and getting ended twice, which causes list corruption in schedulers and other shenanigans. Signed-off-by: NJosef Bacik <jbacik@fb.com> Reviewed-by: NMing Lei <tom.leiming@gmail.com> Reviewed-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Jens Axboe 提交于
Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 29 3月, 2017 4 次提交
-
-
由 Omar Sandoval 提交于
blk_alloc_queue_node() already allocates q->stats, so blk_mq_init_allocated_queue() is overwriting it with a new allocation. Fixes: a83b576c ("block: fix stacked driver stats init and free") Reviewed-by: NMing Lei <tom.leiming@gmail.com> Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
As the .q_usage_counter is used by both legacy and mq path, we need to block new I/O if queue becomes dead in blk_queue_enter(). So rename it and we can use this function in both paths. Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NMing Lei <tom.leiming@gmail.com> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
This patch adds comment on two races related with timeout handler: - requeue from queue busy vs. timeout - rq free & reallocation vs. timeout Both the races themselves and current solution aren't explicit enough, so add comments on them. Cc: Bart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NMing Lei <tom.leiming@gmail.com> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
When iterating busy requests in timeout handler, if the STARTED flag of one request isn't set, that means the request is being processed in block layer or driver, and isn't submitted to hardware yet. In current implementation of blk_mq_check_expired(), if the request queue becomes dying, un-started requests are handled as being completed/freed immediately. This way is wrong, and can cause rq corruption or double allocation[1][2], when doing I/O and removing&resetting NVMe device at the sametime. This patch fixes several issues reported by Yi Zhang. [1]. oops log 1 [ 581.789754] ------------[ cut here ]------------ [ 581.789758] kernel BUG at block/blk-mq.c:374! [ 581.789760] invalid opcode: 0000 [#1] SMP [ 581.789761] Modules linked in: vfat fat ipmi_ssif intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvme irqbypass crct10dif_pclmul nvme_core crc32_pclmul ghash_clmulni_intel intel_cstate ipmi_si mei_me ipmi_devintf intel_uncore sg ipmi_msghandler intel_rapl_perf iTCO_wdt mei iTCO_vendor_support mxm_wmi lpc_ich dcdbas shpchp pcspkr acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd dm_multipath grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci crc32c_intel tg3 libata megaraid_sas i2c_core ptp fjes pps_core dm_mirror dm_region_hash dm_log dm_mod [ 581.789796] CPU: 1 PID: 1617 Comm: kworker/1:1H Not tainted 4.10.0.bz1420297+ #4 [ 581.789797] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016 [ 581.789804] Workqueue: kblockd blk_mq_timeout_work [ 581.789806] task: ffff8804721c8000 task.stack: ffffc90006ee4000 [ 581.789809] RIP: 0010:blk_mq_end_request+0x58/0x70 [ 581.789810] RSP: 0018:ffffc90006ee7d50 EFLAGS: 00010202 [ 581.789811] RAX: 0000000000000001 RBX: ffff8802e4195340 RCX: ffff88028e2f4b88 [ 581.789812] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000 [ 581.789813] RBP: ffffc90006ee7d60 R08: 0000000000000003 R09: ffff88028e2f4b00 [ 581.789814] R10: 0000000000001000 R11: 0000000000000001 R12: 00000000fffffffb [ 581.789815] R13: ffff88042abe5780 R14: 000000000000002d R15: ffff88046fbdff80 [ 581.789817] FS: 0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000 [ 581.789818] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 581.789819] CR2: 00007f64f403a008 CR3: 000000014d078000 CR4: 00000000001406e0 [ 581.789820] Call Trace: [ 581.789825] blk_mq_check_expired+0x76/0x80 [ 581.789828] bt_iter+0x45/0x50 [ 581.789830] blk_mq_queue_tag_busy_iter+0xdd/0x1f0 [ 581.789832] ? blk_mq_rq_timed_out+0x70/0x70 [ 581.789833] ? blk_mq_rq_timed_out+0x70/0x70 [ 581.789840] ? __switch_to+0x140/0x450 [ 581.789841] blk_mq_timeout_work+0x88/0x170 [ 581.789845] process_one_work+0x165/0x410 [ 581.789847] worker_thread+0x137/0x4c0 [ 581.789851] kthread+0x101/0x140 [ 581.789853] ? rescuer_thread+0x3b0/0x3b0 [ 581.789855] ? kthread_park+0x90/0x90 [ 581.789860] ret_from_fork+0x2c/0x40 [ 581.789861] Code: 48 85 c0 74 0d 44 89 e6 48 89 df ff d0 5b 41 5c 5d c3 48 8b bb 70 01 00 00 48 85 ff 75 0f 48 89 df e8 7d f0 ff ff 5b 41 5c 5d c3 <0f> 0b e8 71 f0 ff ff 90 eb e9 0f 1f 40 00 66 2e 0f 1f 84 00 00 [ 581.789882] RIP: blk_mq_end_request+0x58/0x70 RSP: ffffc90006ee7d50 [ 581.789889] ---[ end trace bcaf03d9a14a0a70 ]--- [2]. oops log2 [ 6984.857362] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 [ 6984.857372] IP: nvme_queue_rq+0x6e6/0x8cd [nvme] [ 6984.857373] PGD 0 [ 6984.857374] [ 6984.857376] Oops: 0000 [#1] SMP [ 6984.857379] Modules linked in: ipmi_ssif vfat fat intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si iTCO_wdt iTCO_vendor_support mxm_wmi ipmi_devintf intel_cstate sg dcdbas intel_uncore mei_me intel_rapl_perf mei pcspkr lpc_ich ipmi_msghandler shpchp acpi_power_meter wmi nfsd auth_rpcgss dm_multipath nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm nvme drm nvme_core ahci libahci i2c_core tg3 libata ptp megaraid_sas pps_core fjes dm_mirror dm_region_hash dm_log dm_mod [ 6984.857416] CPU: 7 PID: 1635 Comm: kworker/7:1H Not tainted 4.10.0-2.el7.bz1420297.x86_64 #1 [ 6984.857417] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016 [ 6984.857427] Workqueue: kblockd blk_mq_run_work_fn [ 6984.857429] task: ffff880476e3da00 task.stack: ffffc90002e90000 [ 6984.857432] RIP: 0010:nvme_queue_rq+0x6e6/0x8cd [nvme] [ 6984.857433] RSP: 0018:ffffc90002e93c50 EFLAGS: 00010246 [ 6984.857434] RAX: 0000000000000000 RBX: ffff880275646600 RCX: 0000000000001000 [ 6984.857435] RDX: 0000000000000fff RSI: 00000002fba2a000 RDI: ffff8804734e6950 [ 6984.857436] RBP: ffffc90002e93d30 R08: 0000000000002000 R09: 0000000000001000 [ 6984.857437] R10: 0000000000001000 R11: 0000000000000000 R12: ffff8804741d8000 [ 6984.857438] R13: 0000000000000040 R14: ffff880475649f80 R15: ffff8804734e6780 [ 6984.857439] FS: 0000000000000000(0000) GS:ffff88047fcc0000(0000) knlGS:0000000000000000 [ 6984.857440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6984.857442] CR2: 0000000000000010 CR3: 0000000001c09000 CR4: 00000000001406e0 [ 6984.857443] Call Trace: [ 6984.857451] ? mempool_free+0x2b/0x80 [ 6984.857455] ? bio_free+0x4e/0x60 [ 6984.857459] blk_mq_dispatch_rq_list+0xf5/0x230 [ 6984.857462] blk_mq_process_rq_list+0x133/0x170 [ 6984.857465] __blk_mq_run_hw_queue+0x8c/0xa0 [ 6984.857467] blk_mq_run_work_fn+0x12/0x20 [ 6984.857473] process_one_work+0x165/0x410 [ 6984.857475] worker_thread+0x137/0x4c0 [ 6984.857478] kthread+0x101/0x140 [ 6984.857480] ? rescuer_thread+0x3b0/0x3b0 [ 6984.857481] ? kthread_park+0x90/0x90 [ 6984.857489] ret_from_fork+0x2c/0x40 [ 6984.857490] Code: 8b bd 70 ff ff ff 89 95 50 ff ff ff 89 8d 58 ff ff ff 44 89 95 60 ff ff ff e8 b7 dd 12 e1 8b 95 50 ff ff ff 48 89 85 68 ff ff ff <4c> 8b 48 10 44 8b 58 18 8b 8d 58 ff ff ff 44 8b 95 60 ff ff ff [ 6984.857511] RIP: nvme_queue_rq+0x6e6/0x8cd [nvme] RSP: ffffc90002e93c50 [ 6984.857512] CR2: 0000000000000010 [ 6984.895359] ---[ end trace 2d7ceb528432bf83 ]--- Cc: stable@vger.kernel.org Reported-by: NYi Zhang <yizhan@redhat.com> Tested-by: NYi Zhang <yizhan@redhat.com> Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NMing Lei <tom.leiming@gmail.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 28 3月, 2017 1 次提交
-
-
由 Shaohua Li 提交于
Currently there is no way to know the request size when the request is finished. Next patch will need this info. We could add extra field to record the size, but blk_issue_stat has enough space to record it, so this patch just overloads blk_issue_stat. With this, we will have 49bits to track time, which still is very long time. Signed-off-by: NShaohua Li <shli@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 25 3月, 2017 1 次提交
-
-
由 Jens Axboe 提交于
Currently we return true in blk_mq_dispatch_rq_list() if we queued IO successfully, but we really want to return whether or not the we made progress. Progress includes if we got an error return. If we don't, this can lead to a hang in blk_mq_sched_dispatch_requests() when a driver is draining IO by returning BLK_MQ_QUEUE_ERROR instead of manually ending the IO in error and return BLK_MQ_QUEUE_OK. Tested-by: NJosef Bacik <josef@toxicpanda.com> Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 23 3月, 2017 5 次提交
-
-
由 Christoph Hellwig 提交于
Turn the different ways of merging or issuing I/O into a series of if/else statements instead of the current maze of gotos. Note that this means we pin the CPU a little longer for some cases as the CTX put is moved to common code at the end of the function. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
Now that we have a nice direct issue heper this helps simplifying the code a bit, and also gets rid of the old_rq variable. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
Rename blk_mq_try_issue_directly to __blk_mq_try_issue_directly and add a new wrapper that takes care of RCU / SRCU locking to avoid having boileplate code in the caller which would get duplicated with new callers. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
They are mostly the same code anyway - this just one small conditional for the plug case that is different for both variants. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Christoph Hellwig 提交于
This flag was never used since it was introduced. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 22 3月, 2017 4 次提交
-
-
由 Ming Lei 提交于
When iterating busy requests in timeout handler, if the STARTED flag of one request isn't set, that means the request is being processed in block layer or driver, and isn't submitted to hardware yet. In current implementation of blk_mq_check_expired(), if the request queue becomes dying, un-started requests are handled as being completed/freed immediately. This way is wrong, and can cause rq corruption or double allocation[1][2], when doing I/O and removing&resetting NVMe device at the sametime. This patch fixes several issues reported by Yi Zhang. [1]. oops log 1 [ 581.789754] ------------[ cut here ]------------ [ 581.789758] kernel BUG at block/blk-mq.c:374! [ 581.789760] invalid opcode: 0000 [#1] SMP [ 581.789761] Modules linked in: vfat fat ipmi_ssif intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvme irqbypass crct10dif_pclmul nvme_core crc32_pclmul ghash_clmulni_intel intel_cstate ipmi_si mei_me ipmi_devintf intel_uncore sg ipmi_msghandler intel_rapl_perf iTCO_wdt mei iTCO_vendor_support mxm_wmi lpc_ich dcdbas shpchp pcspkr acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd dm_multipath grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci crc32c_intel tg3 libata megaraid_sas i2c_core ptp fjes pps_core dm_mirror dm_region_hash dm_log dm_mod [ 581.789796] CPU: 1 PID: 1617 Comm: kworker/1:1H Not tainted 4.10.0.bz1420297+ #4 [ 581.789797] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016 [ 581.789804] Workqueue: kblockd blk_mq_timeout_work [ 581.789806] task: ffff8804721c8000 task.stack: ffffc90006ee4000 [ 581.789809] RIP: 0010:blk_mq_end_request+0x58/0x70 [ 581.789810] RSP: 0018:ffffc90006ee7d50 EFLAGS: 00010202 [ 581.789811] RAX: 0000000000000001 RBX: ffff8802e4195340 RCX: ffff88028e2f4b88 [ 581.789812] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000 [ 581.789813] RBP: ffffc90006ee7d60 R08: 0000000000000003 R09: ffff88028e2f4b00 [ 581.789814] R10: 0000000000001000 R11: 0000000000000001 R12: 00000000fffffffb [ 581.789815] R13: ffff88042abe5780 R14: 000000000000002d R15: ffff88046fbdff80 [ 581.789817] FS: 0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000 [ 581.789818] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 581.789819] CR2: 00007f64f403a008 CR3: 000000014d078000 CR4: 00000000001406e0 [ 581.789820] Call Trace: [ 581.789825] blk_mq_check_expired+0x76/0x80 [ 581.789828] bt_iter+0x45/0x50 [ 581.789830] blk_mq_queue_tag_busy_iter+0xdd/0x1f0 [ 581.789832] ? blk_mq_rq_timed_out+0x70/0x70 [ 581.789833] ? blk_mq_rq_timed_out+0x70/0x70 [ 581.789840] ? __switch_to+0x140/0x450 [ 581.789841] blk_mq_timeout_work+0x88/0x170 [ 581.789845] process_one_work+0x165/0x410 [ 581.789847] worker_thread+0x137/0x4c0 [ 581.789851] kthread+0x101/0x140 [ 581.789853] ? rescuer_thread+0x3b0/0x3b0 [ 581.789855] ? kthread_park+0x90/0x90 [ 581.789860] ret_from_fork+0x2c/0x40 [ 581.789861] Code: 48 85 c0 74 0d 44 89 e6 48 89 df ff d0 5b 41 5c 5d c3 48 8b bb 70 01 00 00 48 85 ff 75 0f 48 89 df e8 7d f0 ff ff 5b 41 5c 5d c3 <0f> 0b e8 71 f0 ff ff 90 eb e9 0f 1f 40 00 66 2e 0f 1f 84 00 00 [ 581.789882] RIP: blk_mq_end_request+0x58/0x70 RSP: ffffc90006ee7d50 [ 581.789889] ---[ end trace bcaf03d9a14a0a70 ]--- [2]. oops log2 [ 6984.857362] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010 [ 6984.857372] IP: nvme_queue_rq+0x6e6/0x8cd [nvme] [ 6984.857373] PGD 0 [ 6984.857374] [ 6984.857376] Oops: 0000 [#1] SMP [ 6984.857379] Modules linked in: ipmi_ssif vfat fat intel_rapl sb_edac edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si iTCO_wdt iTCO_vendor_support mxm_wmi ipmi_devintf intel_cstate sg dcdbas intel_uncore mei_me intel_rapl_perf mei pcspkr lpc_ich ipmi_msghandler shpchp acpi_power_meter wmi nfsd auth_rpcgss dm_multipath nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm nvme drm nvme_core ahci libahci i2c_core tg3 libata ptp megaraid_sas pps_core fjes dm_mirror dm_region_hash dm_log dm_mod [ 6984.857416] CPU: 7 PID: 1635 Comm: kworker/7:1H Not tainted 4.10.0-2.el7.bz1420297.x86_64 #1 [ 6984.857417] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016 [ 6984.857427] Workqueue: kblockd blk_mq_run_work_fn [ 6984.857429] task: ffff880476e3da00 task.stack: ffffc90002e90000 [ 6984.857432] RIP: 0010:nvme_queue_rq+0x6e6/0x8cd [nvme] [ 6984.857433] RSP: 0018:ffffc90002e93c50 EFLAGS: 00010246 [ 6984.857434] RAX: 0000000000000000 RBX: ffff880275646600 RCX: 0000000000001000 [ 6984.857435] RDX: 0000000000000fff RSI: 00000002fba2a000 RDI: ffff8804734e6950 [ 6984.857436] RBP: ffffc90002e93d30 R08: 0000000000002000 R09: 0000000000001000 [ 6984.857437] R10: 0000000000001000 R11: 0000000000000000 R12: ffff8804741d8000 [ 6984.857438] R13: 0000000000000040 R14: ffff880475649f80 R15: ffff8804734e6780 [ 6984.857439] FS: 0000000000000000(0000) GS:ffff88047fcc0000(0000) knlGS:0000000000000000 [ 6984.857440] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 6984.857442] CR2: 0000000000000010 CR3: 0000000001c09000 CR4: 00000000001406e0 [ 6984.857443] Call Trace: [ 6984.857451] ? mempool_free+0x2b/0x80 [ 6984.857455] ? bio_free+0x4e/0x60 [ 6984.857459] blk_mq_dispatch_rq_list+0xf5/0x230 [ 6984.857462] blk_mq_process_rq_list+0x133/0x170 [ 6984.857465] __blk_mq_run_hw_queue+0x8c/0xa0 [ 6984.857467] blk_mq_run_work_fn+0x12/0x20 [ 6984.857473] process_one_work+0x165/0x410 [ 6984.857475] worker_thread+0x137/0x4c0 [ 6984.857478] kthread+0x101/0x140 [ 6984.857480] ? rescuer_thread+0x3b0/0x3b0 [ 6984.857481] ? kthread_park+0x90/0x90 [ 6984.857489] ret_from_fork+0x2c/0x40 [ 6984.857490] Code: 8b bd 70 ff ff ff 89 95 50 ff ff ff 89 8d 58 ff ff ff 44 89 95 60 ff ff ff e8 b7 dd 12 e1 8b 95 50 ff ff ff 48 89 85 68 ff ff ff <4c> 8b 48 10 44 8b 58 18 8b 8d 58 ff ff ff 44 8b 95 60 ff ff ff [ 6984.857511] RIP: nvme_queue_rq+0x6e6/0x8cd [nvme] RSP: ffffc90002e93c50 [ 6984.857512] CR2: 0000000000000010 [ 6984.895359] ---[ end trace 2d7ceb528432bf83 ]--- Cc: stable@vger.kernel.org Reported-by: NYi Zhang <yizhan@redhat.com> Tested-by: NYi Zhang <yizhan@redhat.com> Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: NHannes Reinecke <hare@suse.com> Signed-off-by: NMing Lei <tom.leiming@gmail.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Omar Sandoval 提交于
Currently, statistics are gathered in ~0.13s windows, and users grab the statistics whenever they need them. This is not ideal for both in-tree users: 1. Writeback throttling wants its own dynamically sized window of statistics. Since the blk-stats statistics are reset after every window and the wbt windows don't line up with the blk-stats windows, wbt doesn't see every I/O. 2. Polling currently grabs the statistics on every I/O. Again, depending on how the window lines up, we may miss some I/Os. It's also unnecessary overhead to get the statistics on every I/O; the hybrid polling heuristic would be just as happy with the statistics from the previous full window. This reworks the blk-stats infrastructure to be callback-based: users register a callback that they want called at a given time with all of the statistics from the window during which the callback was active. Users can dynamically bucketize the statistics. wbt and polling both currently use read vs. write, but polling can be extended to further subdivide based on request size. The callbacks are kept on an RCU list, and each callback has percpu stats buffers. There will only be a few users, so the overhead on the I/O completion side is low. The stats flushing is also simplified considerably: since the timer function is responsible for clearing the statistics, we don't have to worry about stale statistics. wbt is a trivial conversion. After the conversion, the windowing problem mentioned above is fixed. For polling, we register an extra callback that caches the previous window's statistics in the struct request_queue for the hybrid polling heuristic to use. Since we no longer have a single stats buffer for the request queue, this also removes the sysfs and debugfs stats entries. To replace those, we add a debugfs entry for the poll statistics. Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Omar Sandoval 提交于
The stats buckets will become generic soon, so make the existing users use the common READ and WRITE definitions instead of one internal to blk-stat. Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Omar Sandoval 提交于
We always call wbt_exit() from blk_release_queue(), so these are unnecessary. Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 15 3月, 2017 1 次提交
-
-
由 Jens Axboe 提交于
If we have scheduling enabled, we jump directly to insert-and-run. That's fine, but we run the queue async and we don't pass in information on whether we can block from this context or not. Fixup both these cases. Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com> Reviewed-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 09 3月, 2017 4 次提交
-
-
由 Ming Lei 提交于
It is obviously that hctx->cpumask is per hctx, and both share same lifetime, so this patch moves freeing of hctx->cpumask into release handler of hctx's kobject. Signed-off-by: NMing Lei <tom.leiming@gmail.com> Tested-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
This patch removes kobject_put() over hctx in __blk_mq_unregister_dev(), and trys to keep lifetime consistent between hctx and hctx's kobject. Now blk_mq_sysfs_register() and blk_mq_sysfs_unregister() become totally symmetrical, and kobject's refcounter drops to zero just when the hctx is freed. Signed-off-by: NMing Lei <tom.leiming@gmail.com> Tested-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
Currently from kobject view, both q->mq_kobj and ctx->kobj can be released during one cycle of blk_mq_register_dev() and blk_mq_unregister_dev(). Actually, sw queue's lifetime is same with its request queue's, which is covered by request_queue->kobj. So we don't need to call kobject_put() for the two kinds of kobject in __blk_mq_unregister_dev(), instead we do that in release handler of request queue. Signed-off-by: NMing Lei <tom.leiming@gmail.com> Tested-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Ming Lei 提交于
Both q->mq_kobj and sw queues' kobjects should have been initialized once, instead of doing that each add_disk context. Also this patch removes clearing of ctx in blk_mq_init_cpu_queues() because percpu allocator fills zero to allocated variable. This patch fixes one issue[1] reported from Omar. [1] kernel wearning when doing unbind/bind on one scsi-mq device [ 19.347924] kobject (ffff8800791ea0b8): tried to init an initialized object, something is seriously wrong. [ 19.349781] CPU: 1 PID: 84 Comm: kworker/u8:1 Not tainted 4.10.0-rc7-00210-g53f39eeaa263 #34 [ 19.350686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-20161122_114906-anatol 04/01/2014 [ 19.350920] Workqueue: events_unbound async_run_entry_fn [ 19.350920] Call Trace: [ 19.350920] dump_stack+0x63/0x83 [ 19.350920] kobject_init+0x77/0x90 [ 19.350920] blk_mq_register_dev+0x40/0x130 [ 19.350920] blk_register_queue+0xb6/0x190 [ 19.350920] device_add_disk+0x1ec/0x4b0 [ 19.350920] sd_probe_async+0x10d/0x1c0 [sd_mod] [ 19.350920] async_run_entry_fn+0x48/0x150 [ 19.350920] process_one_work+0x1d0/0x480 [ 19.350920] worker_thread+0x48/0x4e0 [ 19.350920] kthread+0x101/0x140 [ 19.350920] ? process_one_work+0x480/0x480 [ 19.350920] ? kthread_create_on_node+0x60/0x60 [ 19.350920] ret_from_fork+0x2c/0x40 Cc: Omar Sandoval <osandov@osandov.com> Signed-off-by: NMing Lei <tom.leiming@gmail.com> Tested-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 03 3月, 2017 1 次提交
-
-
由 Jens Axboe 提交于
When drivers are called with a request in blk-mq, blk-mq flags the state such that the driver knows if this is the last request in this call chain or not. The driver can then use that information to defer kicking off IO until bd->last is true. However, with blk-mq and scheduling, we need to allocate a driver tag for a request before it can be issued. If we fail to allocate such a tag, we could end up in the situation where the last request issued did not have bd->last == true set. This can then cause a driver hang. This fixes a hang with virtio-blk, which uses bd->last as a hint on whether to kick the queue or not. Reported-by: NChris Mason <clm@fb.com> Tested-by: NChris Mason <clm@fb.com> Reviewed-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
- 02 3月, 2017 3 次提交
-
-
由 Keith Busch 提交于
A driver may wish to take corrective action if queued requests do not complete within a set time. Signed-off-by: NKeith Busch <keith.busch@intel.com> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Keith Busch 提交于
Drivers can start a freeze, so this provides a way to wait for frozen. Signed-off-by: NKeith Busch <keith.busch@intel.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NSagi Grimberg <sagi@grimberg.me> Signed-off-by: NJens Axboe <axboe@fb.com>
-
由 Omar Sandoval 提交于
No functional difference, it just makes a little more sense to update the tag map where we actually allocate the tag. Signed-off-by: NOmar Sandoval <osandov@fb.com> Signed-off-by: NJens Axboe <axboe@fb.com> Tested-by: NSagi Grimberg <sagi@grimberg.me>
-