提交 · 8aa6382907fa495e6b8b8184b53e36d142ffd8da · openeuler / Kernel

23 5月, 2017 1 次提交

blk-mq: remove blk_mq_abort_requeue_list() · 7254a50a

由 Ming Lei 提交于 5月 22, 2017

No one uses it any more, so remove it.
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NChristoph Hellwig <hch@lst.de>

7254a50a

10 5月, 2017 1 次提交

blk-mq: NVMe 512B/4K+T10 DIF/DIX format returns I/O error on dd with split op · f36ea50c

由 Wen Xiong 提交于 5月 10, 2017

When formatting NVMe to 512B/4K + T10 DIf/DIX, dd with split op returns
"Input/output error". Looks block layer split the bio after calling
bio_integrity_prep(bio). This patch fixes the issue.

Below is how we debug this issue:
(1)format nvme to 4K block # size with type 2 DIF
(2)dd with block size bigger than 1024k.
oflag=direct
dd: error writing '/dev/nvme0n1': Input/output error

We added some debug code in nvme device driver. It showed us the first
op and the second op have the same bi and pi address. This is not
correct.

1st op: nvme0n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
	dsmgmt=0x0, AT=0x0 & RT=0x505
	Guard 0x00b1, AT 0x0000, RT physical 0x00000505 RT virtual 0x00002828

2nd op: nvme0n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
	AT=0x0 & RT=0x605  ==> This op fails and subsequent 5 retires..
	Guard 0x00b1, AT 0x0000, RT physical 0x00000605 RT virtual 0x00002828

With the fix, It showed us both of the first op and the second op have
correct bi and pi address.

1st op: nvme2n1 Op:Wr slba 0x505 length 0x100, PI ctrl=0x1400,
	dsmgmt=0x0, AT=0x0 & RT=0x505
	Guard 0x5ccb, AT 0x0000, RT physical 0x00000505 RT virtual
	0x00002828
2nd op: nvme2n1 Op:Wr slba 0x605 length 0x1, PI ctrl=0x1400, dsmgmt=0x0,
	AT=0x0 & RT=0x605
	Guard 0xab4c, AT 0x0000, RT physical 0x00000605 RT virtual
	0x00003028
Signed-off-by: NWen Xiong <wenxiong@linux.vnet.ibm.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

f36ea50c

08 5月, 2017 2 次提交

blk-mq: make __blk_mq_stop_hw_queues static · ebd76857

由 Colin Ian King 提交于 5月 08, 2017

Making __blk_mq_stop_hw_queues static fixes sparse warning:

  block/blk-mq.c:6: warning: symbol '__blk_mq_stop_hw_queues' was not
  declared. Should it be static?

Fixes: 2719aa21 ("blk-mq: don't use sync workqueue flushing from drivers")
Signed-off-by: NColin Ian King <colin.king@canonical.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

ebd76857

block/mq: fix potential deadlock during cpu hotplug · 51d638b1

由 Wanpeng Li 提交于 5月 07, 2017

This can be triggered by hot-unplug one cpu.

======================================================
 [ INFO: possible circular locking dependency detected ]
 4.11.0+ #17 Not tainted
 -------------------------------------------------------
 step_after_susp/2640 is trying to acquire lock:
  (all_q_mutex){+.+...}, at: [<ffffffffb33f95b8>] blk_mq_queue_reinit_work+0x18/0x110

 but task is already holding lock:
  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffffb306d04f>] cpu_hotplug_begin+0x7f/0xe0

 which lock already depends on the new lock.

 the existing dependency chain (in reverse order) is:

 -> #1 (cpu_hotplug.lock){+.+.+.}:
        lock_acquire+0x11c/0x230
        __mutex_lock+0x92/0x990
        mutex_lock_nested+0x1b/0x20
        get_online_cpus+0x64/0x80
        blk_mq_init_allocated_queue+0x3a0/0x4e0
        blk_mq_init_queue+0x3a/0x60
        loop_add+0xe5/0x280
        loop_init+0x124/0x177
        do_one_initcall+0x53/0x1c0
        kernel_init_freeable+0x1e3/0x27f
        kernel_init+0xe/0x100
        ret_from_fork+0x31/0x40

 -> #0 (all_q_mutex){+.+...}:
        __lock_acquire+0x189a/0x18a0
        lock_acquire+0x11c/0x230
        __mutex_lock+0x92/0x990
        mutex_lock_nested+0x1b/0x20
        blk_mq_queue_reinit_work+0x18/0x110
        blk_mq_queue_reinit_dead+0x1c/0x20
        cpuhp_invoke_callback+0x1f2/0x810
        cpuhp_down_callbacks+0x42/0x80
        _cpu_down+0xb2/0xe0
        freeze_secondary_cpus+0xb6/0x390
        suspend_devices_and_enter+0x3b3/0xa40
        pm_suspend+0x129/0x490
        state_store+0x82/0xf0
        kobj_attr_store+0xf/0x20
        sysfs_kf_write+0x45/0x60
        kernfs_fop_write+0x135/0x1c0
        __vfs_write+0x37/0x160
        vfs_write+0xcd/0x1d0
        SyS_write+0x58/0xc0
        do_syscall_64+0x8f/0x710
        return_from_SYSCALL_64+0x0/0x7a

 other info that might help us debug this:

  Possible unsafe locking scenario:

        CPU0                    CPU1
        ----                    ----
   lock(cpu_hotplug.lock);
                                lock(all_q_mutex);
                                lock(cpu_hotplug.lock);
   lock(all_q_mutex);

  *** DEADLOCK ***

 8 locks held by step_after_susp/2640:
  #0:  (sb_writers#6){.+.+.+}, at: [<ffffffffb3244aed>] vfs_write+0x1ad/0x1d0
  #1:  (&of->mutex){+.+.+.}, at: [<ffffffffb32d3a51>] kernfs_fop_write+0x101/0x1c0
  #2:  (s_active#166){.+.+.+}, at: [<ffffffffb32d3a59>] kernfs_fop_write+0x109/0x1c0
  #3:  (pm_mutex){+.+...}, at: [<ffffffffb30d2ecd>] pm_suspend+0x21d/0x490
  #4:  (acpi_scan_lock){+.+.+.}, at: [<ffffffffb34dc3d7>] acpi_scan_lock_acquire+0x17/0x20
  #5:  (cpu_add_remove_lock){+.+.+.}, at: [<ffffffffb306d6d7>] freeze_secondary_cpus+0x27/0x390
  #6:  (cpu_hotplug.dep_map){++++++}, at: [<ffffffffb306cfd5>] cpu_hotplug_begin+0x5/0xe0
  #7:  (cpu_hotplug.lock){+.+.+.}, at: [<ffffffffb306d04f>] cpu_hotplug_begin+0x7f/0xe0

 stack backtrace:
 CPU: 3 PID: 2640 Comm: step_after_susp Not tainted 4.11.0+ #17
 Hardware name: Dell Inc. OptiPlex 7040/0JCTF8, BIOS 1.4.9 09/12/2016
 Call Trace:
  dump_stack+0x99/0xce
  print_circular_bug+0x1fa/0x270
  __lock_acquire+0x189a/0x18a0
  lock_acquire+0x11c/0x230
  ? lock_acquire+0x11c/0x230
  ? blk_mq_queue_reinit_work+0x18/0x110
  ? blk_mq_queue_reinit_work+0x18/0x110
  __mutex_lock+0x92/0x990
  ? blk_mq_queue_reinit_work+0x18/0x110
  ? kmem_cache_free+0x2cb/0x330
  ? anon_transport_class_unregister+0x20/0x20
  ? blk_mq_queue_reinit_work+0x110/0x110
  mutex_lock_nested+0x1b/0x20
  ? mutex_lock_nested+0x1b/0x20
  blk_mq_queue_reinit_work+0x18/0x110
  blk_mq_queue_reinit_dead+0x1c/0x20
  cpuhp_invoke_callback+0x1f2/0x810
  ? __flow_cache_shrink+0x160/0x160
  cpuhp_down_callbacks+0x42/0x80
  _cpu_down+0xb2/0xe0
  freeze_secondary_cpus+0xb6/0x390
  suspend_devices_and_enter+0x3b3/0xa40
  ? rcu_read_lock_sched_held+0x79/0x80
  pm_suspend+0x129/0x490
  state_store+0x82/0xf0
  kobj_attr_store+0xf/0x20
  sysfs_kf_write+0x45/0x60
  kernfs_fop_write+0x135/0x1c0
  __vfs_write+0x37/0x160
  ? rcu_read_lock_sched_held+0x79/0x80
  ? rcu_sync_lockdep_assert+0x2f/0x60
  ? __sb_start_write+0xd9/0x1c0
  ? vfs_write+0x1ad/0x1d0
  vfs_write+0xcd/0x1d0
  SyS_write+0x58/0xc0
  ? rcu_read_lock_sched_held+0x79/0x80
  do_syscall_64+0x8f/0x710
  ? trace_hardirqs_on_thunk+0x1a/0x1c
  entry_SYSCALL64_slow_path+0x25/0x25

The cpu hotplug path will hold cpu_hotplug.lock and then reinit all exiting
queues for blk mq w/ all_q_mutex, however, blk_mq_init_allocated_queue() will
contend these two locks in the inversion order. This is due to commit eabe0659
(blk/mq: Cure cpu hotplug lock inversion), it fixes a cpu hotplug lock inversion
issue because of hotplug rework, however the hotplug rework is still work-in-progress
and lives in a -tip branch and mainline cannot yet trigger that splat. The commit
breaks the linus's tree in the merge window, so this patch reverts the lock order
and avoids to splat linus's tree.

Cc: Jens Axboe <axboe@kernel.dk>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: NWanpeng Li <wanpeng.li@hotmail.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

51d638b1

04 5月, 2017 3 次提交

blk-mq: untangle debugfs and sysfs · 9c1051aa

由 Omar Sandoval 提交于 5月 04, 2017

Originally, I tied debugfs registration/unregistration together with
sysfs. There's no reason to do this, and it's getting in the way of
letting schedulers define their own debugfs attributes. Instead, tie the
debugfs registration to the lifetime of the structures themselves.

The saner lifetimes mean we can also get rid of the extra mq directory
and move everything one level up. I.e., nvme0n1/mq/hctx0/tags is now
just nvme0n1/hctx0/tags.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

9c1051aa

block/mq: Cure cpu hotplug lock inversion · eabe0659

由 Peter Zijlstra 提交于 5月 04, 2017

By poking at /debug/sched_features I triggered the following splat:

 [] ======================================================
 [] WARNING: possible circular locking dependency detected
 [] 4.11.0-00873-g964c8b7-dirty #694 Not tainted
 [] ------------------------------------------------------
 [] bash/2109 is trying to acquire lock:
 []  (cpu_hotplug_lock.rw_sem){++++++}, at: [<ffffffff8120cb8b>] static_key_slow_dec+0x1b/0x50
 []
 [] but task is already holding lock:
 []  (&sb->s_type->i_mutex_key#4){+++++.}, at: [<ffffffff81140216>] sched_feat_write+0x86/0x170
 []
 [] which lock already depends on the new lock.
 []
 []
 [] the existing dependency chain (in reverse order) is:
 []
 [] -> #2 (&sb->s_type->i_mutex_key#4){+++++.}:
 []        lock_acquire+0x100/0x210
 []        down_write+0x28/0x60
 []        start_creating+0x5e/0xf0
 []        debugfs_create_dir+0x13/0x110
 []        blk_mq_debugfs_register+0x21/0x70
 []        blk_mq_register_dev+0x64/0xd0
 []        blk_register_queue+0x6a/0x170
 []        device_add_disk+0x22d/0x440
 []        loop_add+0x1f3/0x280
 []        loop_init+0x104/0x142
 []        do_one_initcall+0x43/0x180
 []        kernel_init_freeable+0x1de/0x266
 []        kernel_init+0xe/0x100
 []        ret_from_fork+0x31/0x40
 []
 [] -> #1 (all_q_mutex){+.+.+.}:
 []        lock_acquire+0x100/0x210
 []        __mutex_lock+0x6c/0x960
 []        mutex_lock_nested+0x1b/0x20
 []        blk_mq_init_allocated_queue+0x37c/0x4e0
 []        blk_mq_init_queue+0x3a/0x60
 []        loop_add+0xe5/0x280
 []        loop_init+0x104/0x142
 []        do_one_initcall+0x43/0x180
 []        kernel_init_freeable+0x1de/0x266
 []        kernel_init+0xe/0x100
 []        ret_from_fork+0x31/0x40

 []  *** DEADLOCK ***
 []
 [] 3 locks held by bash/2109:
 []  #0:  (sb_writers#11){.+.+.+}, at: [<ffffffff81292bcd>] vfs_write+0x17d/0x1a0
 []  #1:  (debugfs_srcu){......}, at: [<ffffffff8155a90d>] full_proxy_write+0x5d/0xd0
 []  #2:  (&sb->s_type->i_mutex_key#4){+++++.}, at: [<ffffffff81140216>] sched_feat_write+0x86/0x170
 []
 [] stack backtrace:
 [] CPU: 9 PID: 2109 Comm: bash Not tainted 4.11.0-00873-g964c8b7-dirty #694
 [] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
 [] Call Trace:

 []  lock_acquire+0x100/0x210
 []  get_online_cpus+0x2a/0x90
 []  static_key_slow_dec+0x1b/0x50
 []  static_key_disable+0x20/0x30
 []  sched_feat_write+0x131/0x170
 []  full_proxy_write+0x97/0xd0
 []  __vfs_write+0x28/0x120
 []  vfs_write+0xb5/0x1a0
 []  SyS_write+0x49/0xa0
 []  entry_SYSCALL_64_fastpath+0x23/0xc2

This is because of the cpu hotplug lock rework. Break the chain at #1
by reversing the lock acquisition order. This way i_mutex_key#4 no
longer depends on cpu_hotplug_lock and things are good.

Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NJens Axboe <axboe@fb.com>

eabe0659

blk-mq: don't use sync workqueue flushing from drivers · 2719aa21

由 Jens Axboe 提交于 5月 03, 2017

A previous commit introduced the sync flush, which we need from
internal callers like blk_mq_quiesce_queue(). However, we also
call the stop helpers from drivers, particularly from ->queue_rq()
when we have to stop processing for a bit. We can't block from
those locations, and we don't have to guarantee that we're
fully flushed.

Fixes: 9f993737 ("blk-mq: unify hctx delayed_run_work and run_work")
Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

2719aa21

03 5月, 2017 1 次提交

block: don't call blk_mq_quiesce_queue() after queue is frozen · 7a148c2f

由 Ming Lei 提交于 5月 02, 2017

After queue is frozen, no request in this queue can be in use at all, so
there can't be any .queue_rq() running on this queue.  It isn't
necessary to call blk_mq_quiesce_queue() any more, so remove it in both
elevator_switch_mq() and blk_mq_update_nr_requests().

Cc: Bart Van Assche <bart.vanassche@sandisk.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>

Fixed up the description a bit.
Signed-off-by: NJens Axboe <axboe@fb.com>

7a148c2f

02 5月, 2017 1 次提交

blk-mq: update ->init_request and ->exit_request prototypes · d6296d39

由 Christoph Hellwig 提交于 5月 01, 2017

Remove the request_idx parameter, which can't be used safely now that we
support I/O schedulers with blk-mq.  Except for a superflous check in
mtip32xx it was unused anyway.

Also pass the tag_set instead of just the driver data - this allows drivers
to avoid some code duplication in a follow on cleanup.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

d6296d39

28 4月, 2017 2 次提交

blk-mq: unify hctx delay_work and run_work · 21c6e939

由 Jens Axboe 提交于 4月 10, 2017

The only difference between ->run_work and ->delay_work, is that
the latter is used to defer running a queue. This is done by
marking the queue stopped, and scheduling ->delay_work to run
sometime in the future. While the queue is stopped, direct runs
or runs through ->run_work will not run the queue.

If we combine the handlers, then we need to handle two things:

1) If a delayed/stopped run is scheduled, then we should not run
   the queue before that has been completed.
2) If a queue is delayed/stopped, the handler needs to restart
   the queue. Normally a run of a queue with the stopped bit set
   would be a no-op.

Case 1 is handled by modifying a currently pending queue run
to the deadline set by the caller of blk_mq_delay_queue().
Subsequent attempts to queue a queue run will find the work
item already pending, and direct runs will see a stopped queue
as before.

Case 2 is handled by adding a new bit, BLK_MQ_S_START_ON_RUN,
that tells the work handler that it should clear a stopped
queue and run the handler.
Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

21c6e939

blk-mq: unify hctx delayed_run_work and run_work · 9f993737

由 Jens Axboe 提交于 4月 10, 2017

They serve the exact same purpose. Get rid of the non-delayed
work variant, and just run it without delay for the normal case.
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com>
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

9f993737

22 4月, 2017 1 次提交

blk-mq: Fix preempt count imbalance · abc25a69

由 Bart Van Assche 提交于 4月 21, 2017

Avoid that the following kernel bug gets triggered:

BUG: sleeping function called from invalid context at ./include/linux/buffer_head.h:349
in_atomic(): 1, irqs_disabled(): 0, pid: 8019, name: find
CPU: 10 PID: 8019 Comm: find Tainted: G        W I     4.11.0-rc4-dbg+ #2
Call Trace:
 dump_stack+0x68/0x93
 ___might_sleep+0x16e/0x230
 __might_sleep+0x4a/0x80
 __ext4_get_inode_loc+0x1e0/0x4e0
 ext4_iget+0x70/0xbc0
 ext4_iget_normal+0x2f/0x40
 ext4_lookup+0xb6/0x1f0
 lookup_slow+0x104/0x1e0
 walk_component+0x19a/0x330
 path_lookupat+0x4b/0x100
 filename_lookup+0x9a/0x110
 user_path_at_empty+0x36/0x40
 vfs_statx+0x67/0xc0
 SYSC_newfstatat+0x20/0x40
 SyS_newfstatat+0xe/0x10
 entry_SYSCALL_64_fastpath+0x18/0xad

This happens since the big if/else in blk_mq_make_request() doesn't
have final else section that also drops the ctx. Add that.

Fixes: b00c53e8 ("blk-mq: fix schedule-while-atomic with scheduler attached")
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Cc: Omar Sandoval <osandov@fb.com>

Added a bit more to the commit log.
Signed-off-by: NJens Axboe <axboe@fb.com>

abc25a69

21 4月, 2017 9 次提交

blk-stat: kill blk_stat_rq_ddir() · 99c749a4

由 Jens Axboe 提交于 4月 21, 2017

No point in providing and exporting this helper. There's just
one (real) user of it, just use rq_data_dir().
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

99c749a4

blk-mq: add might_sleep check to blk_mq_get_driver_tag() · 5feeacdd

由 Jens Axboe 提交于 4月 20, 2017

If the caller passes in wait=true, it has to be able to block
for a driver tag. We just had a bug where flush insertion
would block on tag allocation, while we had preempt disabled.
Ensure that we catch cases like that earlier next time.
Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

5feeacdd

blk-mq: Fix poll_stat for new size-based bucketing. · 0206319f

由 Stephen Bates 提交于 4月 20, 2017

Fixes an issue where the size of the poll_stat array in request_queue
does not match the size expected by the new size based bucketing for
IO completion polling.

Fixes: 720b8ccc ("blk-mq: Add a polling specific stats function")
Signed-off-by: NStephen Bates <sbates@raithlin.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

0206319f

blk-mq: fix schedule-while-atomic with scheduler attached · b00c53e8

由 Jens Axboe 提交于 4月 20, 2017

We must have dropped the ctx before we call
blk_mq_sched_insert_request() with can_block=true, otherwise we risk
that a flush request can block on insertion if we are currently out of
tags.

[   47.667190] BUG: scheduling while atomic: jbd2/sda2-8/2089/0x00000002
[   47.674493] Modules linked in: x86_pkg_temp_thermal btrfs xor zlib_deflate raid6_pq sr_mod cdre
[   47.690572] Preemption disabled at:
[   47.690584] [<ffffffff81326c7c>] blk_mq_sched_get_request+0x6c/0x280
[   47.701764] CPU: 1 PID: 2089 Comm: jbd2/sda2-8 Not tainted 4.11.0-rc7+ #271
[   47.709630] Hardware name: Dell Inc. PowerEdge T630/0NT78X, BIOS 2.3.4 11/09/2016
[   47.718081] Call Trace:
[   47.720903]  dump_stack+0x4f/0x73
[   47.724694]  ? blk_mq_sched_get_request+0x6c/0x280
[   47.730137]  __schedule_bug+0x6c/0xc0
[   47.734314]  __schedule+0x559/0x780
[   47.738302]  schedule+0x3b/0x90
[   47.741899]  io_schedule+0x11/0x40
[   47.745788]  blk_mq_get_tag+0x167/0x2a0
[   47.750162]  ? remove_wait_queue+0x70/0x70
[   47.754901]  blk_mq_get_driver_tag+0x92/0xf0
[   47.759758]  blk_mq_sched_insert_request+0x134/0x170
[   47.765398]  ? blk_account_io_start+0xd0/0x270
[   47.770679]  blk_mq_make_request+0x1b2/0x850
[   47.775766]  generic_make_request+0xf7/0x2d0
[   47.780860]  submit_bio+0x5f/0x120
[   47.784979]  ? submit_bio+0x5f/0x120
[   47.789631]  submit_bh_wbc.isra.46+0x10d/0x130
[   47.794902]  submit_bh+0xb/0x10
[   47.798719]  journal_submit_commit_record+0x190/0x210
[   47.804686]  ? _raw_spin_unlock+0x13/0x30
[   47.809480]  jbd2_journal_commit_transaction+0x180a/0x1d00
[   47.815925]  kjournald2+0xb6/0x250
[   47.820022]  ? kjournald2+0xb6/0x250
[   47.824328]  ? remove_wait_queue+0x70/0x70
[   47.829223]  kthread+0x10e/0x140
[   47.833147]  ? commit_timeout+0x10/0x10
[   47.837742]  ? kthread_create_on_node+0x40/0x40
[   47.843122]  ret_from_fork+0x29/0x40

Fixes: a4d907b6 ("blk-mq: streamline blk_mq_make_request")
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

b00c53e8

blk-mq: Add a polling specific stats function · 720b8ccc

由 Stephen Bates 提交于 4月 07, 2017

Rather than bucketing IO statisics based on direction only we also
bucket based on the IO size. This leads to improved polling
performance. Update the bucket callback function and use it in the
polling latency estimation.
Signed-off-by: NStephen Bates <sbates@raithlin.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

720b8ccc

blk-mq: fix potential oops with polling and blk-mq scheduler · 3a07bb1d

由 Jens Axboe 提交于 4月 20, 2017

If we have a scheduler attached, blk_mq_tag_to_rq() on the
scheduled tags will return NULL if a request is no longer
in flight. This is different than using the normal tags,
where it will always return the fixed request. Check for
this condition for polling, in case we happen to enter
polling for a completed request.

The request address remains valid, so this check and return
should be perfectly safe.

Fixes: bd166ef1 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Tested-by: NStephen Bates <sbates@raithlin.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

3a07bb1d

block: remove the errors field from struct request · caf7df12

由 Christoph Hellwig 提交于 4月 20, 2017

Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com>
Acked-by: NRoger Pau Monné <roger.pau@citrix.com>
Reviewed-by: NKonrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

caf7df12

blk-mq: simplify __blk_mq_complete_request · 453f8341

由 Christoph Hellwig 提交于 4月 20, 2017

Merge blk_mq_ipi_complete_request and blk_mq_stat_add into their only
caller.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

453f8341

blk-mq: remove the error argument to blk_mq_complete_request · 08e0029a

由 Christoph Hellwig 提交于 4月 20, 2017

Now that all drivers that call blk_mq_complete_requests have a
->complete callback we can remove the direct call to blk_mq_end_request,
as well as the error argument to blk_mq_complete_request.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NBart Van Assche <Bart.VanAssche@sandisk.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

08e0029a

20 4月, 2017 1 次提交

block: Export blk_init_request_from_bio() · da8d7f07

由 Bart Van Assche 提交于 4月 19, 2017

Export this function such that it becomes available to block
drivers.
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Cc: Matias Bjørling <m@bjorling.me>
Cc: Adam Manzanares <adam.manzanares@wdc.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

da8d7f07

15 4月, 2017 2 次提交

blk-mq-sched: make completed_request() callback more useful · c05f8525

由 Omar Sandoval 提交于 4月 14, 2017

Currently, this callback is called right after put_request() and has no
distinguishable purpose. Instead, let's call it before put_request() as
soon as I/O has completed on the request, before we account it in
blk-stat. With this, Kyber can enable stats when it sees a latency
outlier and make sure the outlier gets accounted.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

c05f8525

blk-mq: export helpers · 5b727272

由 Omar Sandoval 提交于 4月 14, 2017

blk_mq_finish_request() is required for schedulers that define their own
put_request(). blk_mq_run_hw_queue() is required for schedulers that
hold back requests to be run later.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

5b727272

08 4月, 2017 6 次提交

blk-mq: Clarify comments in blk_mq_dispatch_rq_list() · 710c785f

由 Bart Van Assche 提交于 4月 07, 2017

The blk_mq_dispatch_rq_list() implementation got modified several
times but the comments in that function were not updated every
time. Since it is nontrivial what is going on, update the comments
in blk_mq_dispatch_rq_list().
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

710c785f

blk-mq: Make it safe to use RCU to iterate over blk_mq_tag_set.tag_list · 705cda97

由 Bart Van Assche 提交于 4月 07, 2017

Since the next patch in this series will use RCU to iterate over
tag_list, make this safe. Add lockdep_assert_held() statements
in functions that iterate over tag_list to make clear that using
list_for_each_entry() instead of list_for_each_entry_rcu() is
fine in these functions.
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

705cda97

O
blk-mq: use true instead of 1 for blk_mq_queue_data.last · d945a365
由 Omar Sandoval 提交于 4月 05, 2017
```
Trivial cleanup.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>
```
d945a365

blk-mq: make driver tag failure path easier to follow · 807b1041

由 Omar Sandoval 提交于 4月 05, 2017

Minor cleanup that makes it easier to figure out what's going on in the
driver tag allocation failure path of blk_mq_dispatch_rq_list().
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

807b1041

blk-mq: Restart a single queue if tag sets are shared · 6d8c6c0f

由 Bart Van Assche 提交于 4月 07, 2017

To improve scalability, if hardware queues are shared, restart
a single hardware queue in round-robin fashion. Rename
blk_mq_sched_restart_queues() to reflect the new semantics.
Remove blk_mq_sched_mark_restart_queue() because this function
has no callers. Remove flag QUEUE_FLAG_RESTART because this
patch removes the code that uses this flag.
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

6d8c6c0f

blk-mq: Introduce blk_mq_delay_run_hw_queue() · 7587a5ae

由 Bart Van Assche 提交于 4月 07, 2017

Introduce a function that runs a hardware queue unconditionally
after a delay. Note: there is already a function that stops and
restarts a hardware queue after a delay, namely blk_mq_delay_queue().

This function will be used in the next patch in this series.
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Hannes Reinecke <hare@suse.de>
Cc: Long Li <longli@microsoft.com>
Cc: K. Y. Srinivasan <kys@microsoft.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

7587a5ae

07 4月, 2017 4 次提交

blk-mq: remap queues when adding/removing hardware queues · ebe8bddb

由 Omar Sandoval 提交于 4月 07, 2017

blk_mq_update_nr_hw_queues() used to remap hardware queues, which is the
behavior that drivers expect. However, commit 4e68a011 changed
blk_mq_queue_reinit() to not remap queues for the case of CPU
hotplugging, inadvertently making blk_mq_update_nr_hw_queues() not remap
queues as well. This breaks, for example, NBD's multi-connection mode,
leaving the added hardware queues unused. Fix it by making
blk_mq_update_nr_hw_queues() explicitly remap the queues.

Fixes: 4e68a011 ("blk-mq: don't redistribute hardware queues on a CPU hotplug event")
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

ebe8bddb

blk-mq-sched: fix crash in switch error path · 54d5329d

由 Omar Sandoval 提交于 4月 07, 2017

In elevator_switch(), if blk_mq_init_sched() fails, we attempt to fall
back to the original scheduler. However, at this point, we've already
torn down the original scheduler's tags, so this causes a crash. Doing
the fallback like the legacy elevator path is much harder for mq, so fix
it by just falling back to none, instead.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

54d5329d

blk-mq-sched: set up scheduler tags when bringing up new queues · 93252632

由 Omar Sandoval 提交于 4月 05, 2017

If a new hardware queue is added at runtime, we don't allocate scheduler
tags for it, leading to a crash. This hooks up the scheduler framework
to blk_mq_{init,exit}_hctx() to make sure everything gets properly
initialized/freed.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

93252632

blk-mq: use the right hctx when getting a driver tag fails · 81380ca1

由 Omar Sandoval 提交于 4月 07, 2017

While dispatching requests, if we fail to get a driver tag, we mark the
hardware queue as waiting for a tag and put the requests on a
hctx->dispatch list to be run later when a driver tag is freed. However,
blk_mq_dispatch_rq_list() may dispatch requests from multiple hardware
queues if using a single-queue scheduler with a multiqueue device. If
blk_mq_get_driver_tag() fails, it doesn't update the hardware queue we
are processing. This means we end up using the hardware queue of the
previous request, which may or may not be the same as that of the
current request. If it isn't, the wrong hardware queue will end up
waiting for a tag, and the requests will be on the wrong dispatch list,
leading to a hang.

The fix is twofold:

1. Make sure we save which hardware queue we were trying to get a
   request for in blk_mq_get_driver_tag() regardless of whether it
   succeeds or not.
2. Make blk_mq_dispatch_rq_list() take a request_queue instead of a
   blk_mq_hw_queue to make it clear that it must handle multiple
   hardware queues, since I've already messed this up on a couple of
   occasions.

This didn't appear in testing with nvme and mq-deadline because nvme has
more driver tags than the default number of scheduler tags. However,
with the blk_mq_update_nr_hw_queues() fix, it showed up with nbd.
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

81380ca1

05 4月, 2017 1 次提交

blk-mq: Remove blk_mq_queue_data.list · f2fbc9dd

由 Bart Van Assche 提交于 4月 05, 2017

The block layer core sets blk_mq_queue_data.list but no block
drivers read that member. Hence remove it and also the code that
is used to set this member.
Signed-off-by: NBart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@fb.com>

f2fbc9dd

31 3月, 2017 1 次提交

blk-mq: fix schedule-under-preempt for blocking drivers · bf4907c0

由 Jens Axboe 提交于 3月 30, 2017

Commit a4d907b6 unified the single and multi queue request handlers,
but in the process, it also screwed up the locking balance and calls
blk_mq_try_issue_directly() with the ctx preempt lock held. This is a
problem for drivers that have set BLK_MQ_F_BLOCKING, since now they
can't reliably sleep.

While in there, protect against similar issues in the future, by adding
a might_sleep() trigger in the BLOCKING path for direct issue or queue
run.
Reported-by: NJosef Bacik <josef@toxicpanda.com>
Tested-by: NJosef Bacik <josef@toxicpanda.com>
Fixes: a4d907b6 ("blk-mq: streamline blk_mq_make_request")
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@fb.com>

bf4907c0

30 3月, 2017 4 次提交

block: do not put mq context in blk_mq_alloc_request_hctx · ac77a0c4

由 Minchan Kim 提交于 3月 30, 2017

In blk_mq_alloc_request_hctx, blk_mq_sched_get_request doesn't
get sw context so we don't need to put the context with
blk_mq_put_ctx. Unless, we will see preempt counter underflow.

Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: NMinchan Kim <minchan@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@fb.com>

ac77a0c4

block: do not put mq context in blk_mq_alloc_request_hctx · 3e06eb3d

由 Minchan Kim 提交于 3月 30, 2017

In blk_mq_alloc_request_hctx, blk_mq_sched_get_request doesn't
get sw context so we don't need to put the context with
blk_mq_put_ctx. Unless, we will see preempt counter underflow.

Cc: Omar Sandoval <osandov@fb.com>
Signed-off-by: NMinchan Kim <minchan@kernel.org>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@fb.com>

3e06eb3d

blk-mq: include errors in did_work calculation · 3e8a7069

由 Jens Axboe 提交于 3月 24, 2017

Currently we return true in blk_mq_dispatch_rq_list() if we queued IO
successfully, but we really want to return whether or not the we made
progress. Progress includes if we got an error return.  If we don't,
this can lead to a hang in blk_mq_sched_dispatch_requests() when a
driver is draining IO by returning BLK_MQ_QUEUE_ERROR instead of
manually ending the IO in error and return BLK_MQ_QUEUE_OK.
Tested-by: NJosef Bacik <josef@toxicpanda.com>
Reviewed-by: NBart Van Assche <bart.vanassche@sandisk.com>
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@fb.com>

3e8a7069

block-mq: don't re-queue if we get a queue error · b58e1769

由 Josef Bacik 提交于 3月 28, 2017

When try to issue a request directly and we fail we will requeue the
request, but call blk_mq_end_request() as well.  This leads to the
completed request being on a queuelist and getting ended twice, which
causes list corruption in schedulers and other shenanigans.
Signed-off-by: NJosef Bacik <jbacik@fb.com>
Reviewed-by: NMing Lei <tom.leiming@gmail.com>
Reviewed-by: NSagi Grimberg <sagi@grimberg.me>
Signed-off-by: NJens Axboe <axboe@fb.com>

b58e1769

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功