提交 · bf4907c05e615f6a1811d61c58d56da52f7e9954 · openeuler / Kernel

31 3月, 2017 1 次提交

blk-mq: fix schedule-under-preempt for blocking drivers · bf4907c0

由 Jens Axboe 提交于 3月 30, 2017

Commit a4d907b6 unified the single and multi queue request handlers,
but in the process, it also screwed up the locking balance and calls
blk_mq_try_issue_directly() with the ctx preempt lock held. This is a
problem for drivers that have set BLK_MQ_F_BLOCKING, since now they
can't reliably sleep.

While in there, protect against similar issues in the future, by adding
a might_sleep() trigger in the BLOCKING path for direct issue or queue
run.
Reported-by: NJosef Bacik <josef@toxicpanda.com>
Tested-by: NJosef Bacik <josef@toxicpanda.com>
Fixes: a4d907b6 ("blk-mq: streamline blk_mq_make_request")
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

bf4907c0

30 3月, 2017 4 次提交

block: do not put mq context in blk_mq_alloc_request_hctx · 3e06eb3d

由 Minchan Kim 提交于 3月 30, 2017

In blk_mq_alloc_request_hctx, blk_mq_sched_get_request doesn't
get sw context so we don't need to put the context with
blk_mq_put_ctx. Unless, we will see preempt counter underflow.

Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: NMinchan Kim <minchan@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@fb.com>

3e06eb3d

blk-mq: include errors in did_work calculation · 3e8a7069

由 Jens Axboe 提交于 3月 24, 2017

Currently we return true in blk_mq_dispatch_rq_list() if we queued IO
successfully, but we really want to return whether or not the we made
progress. Progress includes if we got an error return.  If we don't,
this can lead to a hang in blk_mq_sched_dispatch_requests() when a
driver is draining IO by returning BLK_MQ_QUEUE_ERROR instead of
manually ending the IO in error and return BLK_MQ_QUEUE_OK.
Tested-by: NJosef Bacik <josef@toxicpanda.com>
Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

3e8a7069

block-mq: don't re-queue if we get a queue error · b58e1769

由 Josef Bacik 提交于 3月 28, 2017

When try to issue a request directly and we fail we will requeue the
request, but call blk_mq_end_request() as well.  This leads to the
completed request being on a queuelist and getting ended twice, which
causes list corruption in schedulers and other shenanigans.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Reviewed-by: NMing Lei <tom.leiming@gmail.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@fb.com>

b58e1769

J
blk-mq: fix a typo and a spelling mistake · 48b99c9d
由 Jens Axboe 提交于 3月 29, 2017
```
Signed-off-by: NJens Axboe <axboe@fb.com>
```
48b99c9d

29 3月, 2017 4 次提交

blk-mq: fix leak of q->stats · 0c9539a4

由 Omar Sandoval 提交于 3月 28, 2017

blk_alloc_queue_node() already allocates q->stats, so
blk_mq_init_allocated_queue() is overwriting it with a new allocation.

Fixes: a83b576c ("block: fix stacked driver stats init and free")
Reviewed-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

0c9539a4

block: rename blk_mq_freeze_queue_start() · 1671d522

由 Ming Lei 提交于 3月 27, 2017

As the .q_usage_counter is used by both legacy and
mq path, we need to block new I/O if queue becomes
dead in blk_queue_enter().

So rename it and we can use this function in both
paths.
Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

1671d522

blk-mq: comment on races related with timeout handler · d9d149a3

由 Ming Lei 提交于 3月 27, 2017

This patch adds comment on two races related with
timeout handler:

- requeue from queue busy vs. timeout
- rq free & reallocation vs. timeout

Both the races themselves and current solution aren't
explicit enough, so add comments on them.

Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

d9d149a3

blk-mq: don't complete un-started request in timeout handler · a4ef8e56

由 Ming Lei 提交于 3月 22, 2017

When iterating busy requests in timeout handler,
if the STARTED flag of one request isn't set, that means
the request is being processed in block layer or driver, and
isn't submitted to hardware yet.

In current implementation of blk_mq_check_expired(),
if the request queue becomes dying, un-started requests are
handled as being completed/freed immediately. This way is
wrong, and can cause rq corruption or double allocation[1][2],
when doing I/O and removing&resetting NVMe device at the sametime.

This patch fixes several issues reported by Yi Zhang.

[1]. oops log 1
[  581.789754] ------------[ cut here ]------------
[  581.789758] kernel BUG at block/blk-mq.c:374!
[  581.789760] invalid opcode: 0000 [#1] SMP
[  581.789761] Modules linked in: vfat fat ipmi_ssif intel_rapl sb_edac
edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm nvme
irqbypass crct10dif_pclmul nvme_core crc32_pclmul ghash_clmulni_intel
intel_cstate ipmi_si mei_me ipmi_devintf intel_uncore sg ipmi_msghandler
intel_rapl_perf iTCO_wdt mei iTCO_vendor_support mxm_wmi lpc_ich dcdbas shpchp
pcspkr acpi_power_meter wmi nfsd auth_rpcgss nfs_acl lockd dm_multipath grace
sunrpc ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper
syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci
crc32c_intel tg3 libata megaraid_sas i2c_core ptp fjes pps_core dm_mirror
dm_region_hash dm_log dm_mod
[  581.789796] CPU: 1 PID: 1617 Comm: kworker/1:1H Not tainted 4.10.0.bz1420297+ #4
[  581.789797] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[  581.789804] Workqueue: kblockd blk_mq_timeout_work
[  581.789806] task: ffff8804721c8000 task.stack: ffffc90006ee4000
[  581.789809] RIP: 0010:blk_mq_end_request+0x58/0x70
[  581.789810] RSP: 0018:ffffc90006ee7d50 EFLAGS: 00010202
[  581.789811] RAX: 0000000000000001 RBX: ffff8802e4195340 RCX: ffff88028e2f4b88
[  581.789812] RDX: 0000000000001000 RSI: 0000000000001000 RDI: 0000000000000000
[  581.789813] RBP: ffffc90006ee7d60 R08: 0000000000000003 R09: ffff88028e2f4b00
[  581.789814] R10: 0000000000001000 R11: 0000000000000001 R12: 00000000fffffffb
[  581.789815] R13: ffff88042abe5780 R14: 000000000000002d R15: ffff88046fbdff80
[  581.789817] FS:  0000000000000000(0000) GS:ffff88047fc00000(0000) knlGS:0000000000000000
[  581.789818] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  581.789819] CR2: 00007f64f403a008 CR3: 000000014d078000 CR4: 00000000001406e0
[  581.789820] Call Trace:
[  581.789825]  blk_mq_check_expired+0x76/0x80
[  581.789828]  bt_iter+0x45/0x50
[  581.789830]  blk_mq_queue_tag_busy_iter+0xdd/0x1f0
[  581.789832]  ? blk_mq_rq_timed_out+0x70/0x70
[  581.789833]  ? blk_mq_rq_timed_out+0x70/0x70
[  581.789840]  ? __switch_to+0x140/0x450
[  581.789841]  blk_mq_timeout_work+0x88/0x170
[  581.789845]  process_one_work+0x165/0x410
[  581.789847]  worker_thread+0x137/0x4c0
[  581.789851]  kthread+0x101/0x140
[  581.789853]  ? rescuer_thread+0x3b0/0x3b0
[  581.789855]  ? kthread_park+0x90/0x90
[  581.789860]  ret_from_fork+0x2c/0x40
[  581.789861] Code: 48 85 c0 74 0d 44 89 e6 48 89 df ff d0 5b 41 5c 5d c3 48
8b bb 70 01 00 00 48 85 ff 75 0f 48 89 df e8 7d f0 ff ff 5b 41 5c 5d c3 <0f>
0b e8 71 f0 ff ff 90 eb e9 0f 1f 40 00 66 2e 0f 1f 84 00 00
[  581.789882] RIP: blk_mq_end_request+0x58/0x70 RSP: ffffc90006ee7d50
[  581.789889] ---[ end trace bcaf03d9a14a0a70 ]---

[2]. oops log2
[ 6984.857362] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
[ 6984.857372] IP: nvme_queue_rq+0x6e6/0x8cd [nvme]
[ 6984.857373] PGD 0
[ 6984.857374]
[ 6984.857376] Oops: 0000 [#1] SMP
[ 6984.857379] Modules linked in: ipmi_ssif vfat fat intel_rapl sb_edac
edac_core x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm
irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ipmi_si iTCO_wdt
iTCO_vendor_support mxm_wmi ipmi_devintf intel_cstate sg dcdbas intel_uncore
mei_me intel_rapl_perf mei pcspkr lpc_ich ipmi_msghandler shpchp
acpi_power_meter wmi nfsd auth_rpcgss dm_multipath nfs_acl lockd grace sunrpc
ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea
sysfillrect crc32c_intel sysimgblt fb_sys_fops ttm nvme drm nvme_core ahci
libahci i2c_core tg3 libata ptp megaraid_sas pps_core fjes dm_mirror
dm_region_hash dm_log dm_mod
[ 6984.857416] CPU: 7 PID: 1635 Comm: kworker/7:1H Not tainted
4.10.0-2.el7.bz1420297.x86_64 #1
[ 6984.857417] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.2.5 09/06/2016
[ 6984.857427] Workqueue: kblockd blk_mq_run_work_fn
[ 6984.857429] task: ffff880476e3da00 task.stack: ffffc90002e90000
[ 6984.857432] RIP: 0010:nvme_queue_rq+0x6e6/0x8cd [nvme]
[ 6984.857433] RSP: 0018:ffffc90002e93c50 EFLAGS: 00010246
[ 6984.857434] RAX: 0000000000000000 RBX: ffff880275646600 RCX: 0000000000001000
[ 6984.857435] RDX: 0000000000000fff RSI: 00000002fba2a000 RDI: ffff8804734e6950
[ 6984.857436] RBP: ffffc90002e93d30 R08: 0000000000002000 R09: 0000000000001000
[ 6984.857437] R10: 0000000000001000 R11: 0000000000000000 R12: ffff8804741d8000
[ 6984.857438] R13: 0000000000000040 R14: ffff880475649f80 R15: ffff8804734e6780
[ 6984.857439] FS:  0000000000000000(0000) GS:ffff88047fcc0000(0000) knlGS:0000000000000000
[ 6984.857440] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6984.857442] CR2: 0000000000000010 CR3: 0000000001c09000 CR4: 00000000001406e0
[ 6984.857443] Call Trace:
[ 6984.857451]  ? mempool_free+0x2b/0x80
[ 6984.857455]  ? bio_free+0x4e/0x60
[ 6984.857459]  blk_mq_dispatch_rq_list+0xf5/0x230
[ 6984.857462]  blk_mq_process_rq_list+0x133/0x170
[ 6984.857465]  __blk_mq_run_hw_queue+0x8c/0xa0
[ 6984.857467]  blk_mq_run_work_fn+0x12/0x20
[ 6984.857473]  process_one_work+0x165/0x410
[ 6984.857475]  worker_thread+0x137/0x4c0
[ 6984.857478]  kthread+0x101/0x140
[ 6984.857480]  ? rescuer_thread+0x3b0/0x3b0
[ 6984.857481]  ? kthread_park+0x90/0x90
[ 6984.857489]  ret_from_fork+0x2c/0x40
[ 6984.857490] Code: 8b bd 70 ff ff ff 89 95 50 ff ff ff 89 8d 58 ff ff ff 44
89 95 60 ff ff ff e8 b7 dd 12 e1 8b 95 50 ff ff ff 48 89 85 68 ff ff ff <4c>
8b 48 10 44 8b 58 18 8b 8d 58 ff ff ff 44 8b 95 60 ff ff ff
[ 6984.857511] RIP: nvme_queue_rq+0x6e6/0x8cd [nvme] RSP: ffffc90002e93c50
[ 6984.857512] CR2: 0000000000000010
[ 6984.895359] ---[ end trace 2d7ceb528432bf83 ]---

Cc: stable@vger.kernel.org
Reported-by: NYi Zhang <yizhan@redhat.com>
Tested-by: NYi Zhang <yizhan@redhat.com>
Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

a4ef8e56

28 3月, 2017 1 次提交

block: track request size in blk_issue_stat · 88eeca49

由 Shaohua Li 提交于 3月 27, 2017

Currently there is no way to know the request size when the request is
finished. Next patch will need this info. We could add extra field to
record the size, but blk_issue_stat has enough space to record it, so
this patch just overloads blk_issue_stat. With this, we will have 49bits
to track time, which still is very long time.
Signed-off-by: NShaohua Li <shli@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

88eeca49

23 3月, 2017 5 次提交

blk-mq: streamline blk_mq_make_request · a4d907b6

由 Christoph Hellwig 提交于 3月 22, 2017

Turn the different ways of merging or issuing I/O into a series of if/else
statements instead of the current maze of gotos.  Note that this means we
pin the CPU a little longer for some cases as the CTX put is moved to
common code at the end of the function.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

a4d907b6

blk-mq: split the plug and sync cases in blk_mq_make_request · 2299722c

由 Christoph Hellwig 提交于 3月 22, 2017

Now that we have a nice direct issue heper this helps simplifying
the code a bit, and also gets rid of the old_rq variable.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

2299722c

blk-mq: improve blk_mq_try_issue_directly · 5eb6126e

由 Christoph Hellwig 提交于 3月 22, 2017

Rename blk_mq_try_issue_directly to __blk_mq_try_issue_directly and add a
new wrapper that takes care of RCU / SRCU locking to avoid having
boileplate code in the caller which would get duplicated with new callers.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

5eb6126e

blk-mq: merge mq and sq make_request instances · 254d259d

由 Christoph Hellwig 提交于 3月 22, 2017

They are mostly the same code anyway - this just one small conditional
for the plug case that is different for both variants.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

254d259d

blk-mq: remove BLK_MQ_F_DEFER_ISSUE · 7642747d

由 Christoph Hellwig 提交于 3月 22, 2017

This flag was never used since it was introduced.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

7642747d

22 3月, 2017 3 次提交

blk-stat: convert to callback-based statistics reporting · 34dbad5d

由 Omar Sandoval 提交于 3月 21, 2017

Currently, statistics are gathered in ~0.13s windows, and users grab the
statistics whenever they need them. This is not ideal for both in-tree
users:

1. Writeback throttling wants its own dynamically sized window of
   statistics. Since the blk-stats statistics are reset after every
   window and the wbt windows don't line up with the blk-stats windows,
   wbt doesn't see every I/O.
2. Polling currently grabs the statistics on every I/O. Again, depending
   on how the window lines up, we may miss some I/Os. It's also
   unnecessary overhead to get the statistics on every I/O; the hybrid
   polling heuristic would be just as happy with the statistics from the
   previous full window.

This reworks the blk-stats infrastructure to be callback-based: users
register a callback that they want called at a given time with all of
the statistics from the window during which the callback was active.
Users can dynamically bucketize the statistics. wbt and polling both
currently use read vs. write, but polling can be extended to further
subdivide based on request size.

The callbacks are kept on an RCU list, and each callback has percpu
stats buffers. There will only be a few users, so the overhead on the
I/O completion side is low. The stats flushing is also simplified
considerably: since the timer function is responsible for clearing the
statistics, we don't have to worry about stale statistics.

wbt is a trivial conversion. After the conversion, the windowing problem
mentioned above is fixed.

For polling, we register an extra callback that caches the previous
window's statistics in the struct request_queue for the hybrid polling
heuristic to use.

Since we no longer have a single stats buffer for the request queue,
this also removes the sysfs and debugfs stats entries. To replace those,
we add a debugfs entry for the poll statistics.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

34dbad5d

blk-stat: use READ and WRITE instead of BLK_STAT_{READ,WRITE} · fa2e39cb

由 Omar Sandoval 提交于 3月 21, 2017

The stats buckets will become generic soon, so make the existing users
use the common READ and WRITE definitions instead of one internal to
blk-stat.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

fa2e39cb

block: remove extra calls to wbt_exit() · 0315b159

由 Omar Sandoval 提交于 3月 21, 2017

We always call wbt_exit() from blk_release_queue(), so these are
unnecessary.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

0315b159

15 3月, 2017 1 次提交

blk-mq-sched: don't run the queue async from blk_mq_try_issue_directly() · 9c621104

由 Jens Axboe 提交于 3月 14, 2017

If we have scheduling enabled, we jump directly to insert-and-run.
That's fine, but we run the queue async and we don't pass in information
on whether we can block from this context or not. Fixup both these
cases.
Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

9c621104

09 3月, 2017 4 次提交

blk-mq: free hctx->cpumask in release handler of hctx's kobject · 01388df3

由 Ming Lei 提交于 2月 22, 2017

It is obviously that hctx->cpumask is per hctx, and both
share same lifetime, so this patch moves freeing of hctx->cpumask
into release handler of hctx's kobject.
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Tested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NJens Axboe <axboe@fb.com>

01388df3

blk-mq: make lifetime consistent between hctx and its kobject · 6c8b232e

由 Ming Lei 提交于 2月 22, 2017

This patch removes kobject_put() over hctx in __blk_mq_unregister_dev(),
and trys to keep lifetime consistent between hctx and hctx's kobject.

Now blk_mq_sysfs_register() and blk_mq_sysfs_unregister() become
totally symmetrical, and kobject's refcounter drops to zero just
when the hctx is freed.
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Tested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NJens Axboe <axboe@fb.com>

6c8b232e

blk-mq: make lifetime consitent between q/ctx and its kobject · 7ea5fe31

由 Ming Lei 提交于 2月 22, 2017

Currently from kobject view, both q->mq_kobj and ctx->kobj can
be released during one cycle of blk_mq_register_dev() and
blk_mq_unregister_dev(). Actually, sw queue's lifetime is
same with its request queue's, which is covered by request_queue->kobj.

So we don't need to call kobject_put() for the two kinds of
kobject in __blk_mq_unregister_dev(), instead we do that
in release handler of request queue.
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Tested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NJens Axboe <axboe@fb.com>

7ea5fe31

blk-mq: initialize mq kobjects in blk_mq_init_allocated_queue() · 737f98cf

由 Ming Lei 提交于 2月 22, 2017

Both q->mq_kobj and sw queues' kobjects should have been initialized
once, instead of doing that each add_disk context.

Also this patch removes clearing of ctx in blk_mq_init_cpu_queues()
because percpu allocator fills zero to allocated variable.

This patch fixes one issue[1] reported from Omar.

[1] kernel wearning when doing unbind/bind on one scsi-mq device

[   19.347924] kobject (ffff8800791ea0b8): tried to init an initialized object, something is seriously wrong.
[   19.349781] CPU: 1 PID: 84 Comm: kworker/u8:1 Not tainted 4.10.0-rc7-00210-g53f39eeaa263 #34
[   19.350686] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.1-20161122_114906-anatol 04/01/2014
[   19.350920] Workqueue: events_unbound async_run_entry_fn
[   19.350920] Call Trace:
[   19.350920]  dump_stack+0x63/0x83
[   19.350920]  kobject_init+0x77/0x90
[   19.350920]  blk_mq_register_dev+0x40/0x130
[   19.350920]  blk_register_queue+0xb6/0x190
[   19.350920]  device_add_disk+0x1ec/0x4b0
[   19.350920]  sd_probe_async+0x10d/0x1c0 [sd_mod]
[   19.350920]  async_run_entry_fn+0x48/0x150
[   19.350920]  process_one_work+0x1d0/0x480
[   19.350920]  worker_thread+0x48/0x4e0
[   19.350920]  kthread+0x101/0x140
[   19.350920]  ? process_one_work+0x480/0x480
[   19.350920]  ? kthread_create_on_node+0x60/0x60
[   19.350920]  ret_from_fork+0x2c/0x40

Cc: Omar Sandoval <osandov@osandov.com>
Signed-off-by: NMing Lei <tom.leiming@gmail.com>
Tested-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NJens Axboe <axboe@fb.com>

737f98cf

03 3月, 2017 1 次提交

blk-mq: ensure that bd->last is always set correctly · 113285b4

由 Jens Axboe 提交于 3月 02, 2017

When drivers are called with a request in blk-mq, blk-mq flags the
state such that the driver knows if this is the last request in
this call chain or not. The driver can then use that information
to defer kicking off IO until bd->last is true. However, with blk-mq
and scheduling, we need to allocate a driver tag for a request before
it can be issued. If we fail to allocate such a tag, we could end up
in the situation where the last request issued did not have
bd->last == true set. This can then cause a driver hang.

This fixes a hang with virtio-blk, which uses bd->last as a hint
on whether to kick the queue or not.
Reported-by: NChris Mason <clm@fb.com>
Tested-by: NChris Mason <clm@fb.com>
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

113285b4

02 3月, 2017 8 次提交

blk-mq: Provide freeze queue timeout · f91328c4

由 Keith Busch 提交于 3月 01, 2017

A driver may wish to take corrective action if queued requests do not
complete within a set time.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

f91328c4

blk-mq: Export blk_mq_freeze_queue_wait · 6bae363e

由 Keith Busch 提交于 3月 01, 2017

Drivers can start a freeze, so this provides a way to wait for frozen.
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@fb.com>

6bae363e

blk-mq: move update of tags->rqs to __blk_mq_alloc_request() · 562bef42

由 Omar Sandoval 提交于 2月 27, 2017

No functional difference, it just makes a little more sense to update
the tag map where we actually allocate the tag.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>
Tested-by: NSagi Grimberg <sagi@grimberg.me>

562bef42

blk-mq: make blk_mq_alloc_request_hctx() allocate a scheduler request · 6d2809d5

由 Omar Sandoval 提交于 2月 27, 2017

blk_mq_alloc_request_hctx() allocates a driver request directly, unlike
its blk_mq_alloc_request() counterpart. It also crashes because it
doesn't update the tags->rqs map.

Fix it by making it allocate a scheduler request.
Reported-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>
Tested-by: NSagi Grimberg <sagi@grimberg.me>

6d2809d5

blk-mq-sched: Allocate sched reserved tags as specified in the original queue tagset · 415b806d

由 Sagi Grimberg 提交于 2月 27, 2017

Signed-off-by: NSagi Grimberg <sagi@grimberg.me>

Modified by me to also check at driver tag allocation time if the
original request was reserved, so we can be sure to allocate a
properly reserved tag at that point in time, too.
Signed-off-by: NJens Axboe <axboe@fb.com>

415b806d

blk-mq: allocate blk_mq_tags and requests in correct node · 59f082e4

由 Shaohua Li 提交于 2月 01, 2017

blk_mq_tags/requests of specific hardware queue are mostly used in
specific cpus, which might not be in the same numa node as disk. For
example, a nvme card is in node 0. half hardware queue will be used by
node 0, the other node 1.
Signed-off-by: NShaohua Li <shli@fb.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

59f082e4

sched/headers: Prepare to move signal wakeup & sigpending methods from... · 174cd4b1

由 Ingo Molnar 提交于 2月 02, 2017

sched/headers: Prepare to move signal wakeup & sigpending methods from <linux/sched.h> into <linux/sched/signal.h>

Fix up affected files that include this signal functionality via sched.h.
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

174cd4b1

sched/headers: Prepare for new header dependencies before moving code to <linux/sched/topology.h> · 105ab3d8

由 Ingo Molnar 提交于 2月 01, 2017

We are going to split <linux/sched/topology.h> out of <linux/sched.h>, which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder <linux/sched/topology.h> file that just
maps to <linux/sched.h> to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: NIngo Molnar <mingo@kernel.org>

105ab3d8

24 2月, 2017 1 次提交

blk-mq: use sbq wait queues instead of restart for driver tags · da55f2cc

由 Omar Sandoval 提交于 2月 22, 2017

Commit 50e1dab8 ("blk-mq-sched: fix starvation for multiple hardware
queues and shared tags") fixed one starvation issue for shared tags.
However, we can still get into a situation where we fail to allocate a
tag because all tags are allocated but we don't have any pending
requests on any hardware queue.

One solution for this would be to restart all queues that share a tag
map, but that really sucks. Ideally, we could just block and wait for a
tag, but that isn't always possible from blk_mq_dispatch_rq_list().

However, we can still use the struct sbitmap_queue wait queues with a
custom callback instead of blocking. This has a few benefits:

1. It avoids iterating over all hardware queues when completing an I/O,
   which the current restart code has to do.
2. It benefits from the existing rolling wakeup code.
3. It avoids punting to another thread just to have it block.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

da55f2cc

18 2月, 2017 2 次提交

blk-mq: don't special case flush inserts for blk-mq-sched · 0c2a6fe4

由 Jens Axboe 提交于 2月 17, 2017

The current request insertion machinery works just fine for
directly inserting flushes, so no need to special case
this anymore.
Signed-off-by: NJens Axboe <axboe@fb.com>
Reviewed-by: NOmar Sandoval <osandov@fb.com>

0c2a6fe4

blk-mq: have blk_mq_dispatch_rq_list() return if we queued IO or not · 2aa0f21d

由 Jens Axboe 提交于 2月 17, 2017

Currently we're almost there, but if we dispatch nothing, then we
still return success.
Signed-off-by: NJens Axboe <axboe@fb.com>
Reviewed-by: NOmar Sandoval <osandov@fb.com>

2aa0f21d

11 2月, 2017 1 次提交

block: set make_request_fn manually in blk_mq_update_nr_hw_queues · f6f94300

由 Josef Bacik 提交于 2月 10, 2017

Calling blk_queue_make_request resets a bunch of settings on the
request_queue, but all we really want is to update the make_request_fn,
so do this directly so we don't lose things like the logical and
physical block sizes.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

f6f94300

09 2月, 2017 2 次提交

block: optionally merge discontiguous discard bios into a single request · 1e739730

由 Christoph Hellwig 提交于 2月 08, 2017

Add a new merge strategy that merges discard bios into a request until the
maximum number of discard ranges (or the maximum discard size) is reached
from the plug merging code.  I/O scheduler merging is not wired up yet
but might also be useful, although not for fast devices like NVMe which
are the only user for now.

Note that for now we don't support limiting the size of each discard range,
but if needed that can be added later.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

1e739730

block: enumify ELEVATOR_*_MERGE · 34fe7c05

由 Christoph Hellwig 提交于 2月 08, 2017

Switch these constants to an enum, and make let the compiler ensure that
all callers of blk_try_merge and elv_merge handle all potential values.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

34fe7c05

03 2月, 2017 1 次提交

block: use same block debugfs directory for blk-mq and blktrace · 18fbda91

由 Omar Sandoval 提交于 1月 31, 2017

When I added the blk-mq debugging information to debugfs, I didn't
notice that blktrace also creates a "block" directory in debugfs. Make
them use the same dentry, now created in the core block code. Based on a
patch from Jens.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

18fbda91

01 2月, 2017 1 次提交

blk-mq: don't fail allocating driver tag for stopped hw queue · 12d70958

由 Jens Axboe 提交于 1月 31, 2017

We rely on blk_mq_get_driver_tag() not failing if 'wait' is true,
but it currently fails in that case if the queue happens to be
stopped at the time of the call.

We don't need to check for stopped here, it's just assigning
the tag. If the queue is stopped, we'll handle it when
attempting to run the queue.

This fixes a stall/crash on flush intensive workloads, where
we proceed to process a flush that doesn't have a valid tag
assigned.
Signed-off-by: NJens Axboe <axboe@fb.com>

12d70958

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功