提交 · 525b5265fd755a354e0eda67d9b73a4560e8e371 · openanolis / cloud-kernel

15 6月, 2019 1 次提交

blk-mq: move cancel of requeue_work into blk_mq_release · 525b5265

由 Ming Lei 提交于 4月 30, 2019

[ Upstream commit fbc2a15e3433058582e5635aabe48a3011a644a8 ]

With holding queue's kobject refcount, it is safe for driver
to schedule requeue. However, blk_mq_kick_requeue_list() may
be called after blk_sync_queue() is done because of concurrent
requeue activities, then requeue work may not be completed when
freeing queue, and kernel oops is triggered.

So moving the cancel of requeue_work into blk_mq_release() for
avoiding race between requeue and freeing queue.

Cc: Dongli Zhang <dongli.zhang@oracle.com>
Cc: James Smart <james.smart@broadcom.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen <martin.petersen@oracle.com>,
Cc: Christoph Hellwig <hch@lst.de>,
Cc: James E . J . Bottomley <jejb@linux.vnet.ibm.com>,
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NJohannes Thumshirn <jthumshirn@suse.de>
Reviewed-by: NHannes Reinecke <hare@suse.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Tested-by: NJames Smart <james.smart@broadcom.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>

525b5265

31 5月, 2019 2 次提交

block: sed-opal: fix IOC_OPAL_ENABLE_DISABLE_MBR · 2b18febc

由 David Kozub 提交于 2月 14, 2019

[ Upstream commit 78bf47353b0041865564deeed257a54f047c2fdc ]

The implementation of IOC_OPAL_ENABLE_DISABLE_MBR handled the value
opal_mbr_data.enable_disable incorrectly: enable_disable is expected
to be one of OPAL_MBR_ENABLE(0) or OPAL_MBR_DISABLE(1). enable_disable
was passed directly to set_mbr_done and set_mbr_enable_disable where
is was interpreted as either OPAL_TRUE(1) or OPAL_FALSE(0). The end
result was that calling IOC_OPAL_ENABLE_DISABLE_MBR with OPAL_MBR_ENABLE
actually disabled the shadow MBR and vice versa.

This patch adds correct conversion from OPAL_MBR_DISABLE/ENABLE to
OPAL_FALSE/TRUE. The change affects existing programs using
IOC_OPAL_ENABLE_DISABLE_MBR but this is typically used only once when
setting up an Opal drive.
Acked-by: NJon Derrick <jonathan.derrick@intel.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NScott Bauer <sbauer@plzdonthack.me>
Signed-off-by: NDavid Kozub <zub@linux.fjfi.cvut.cz>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>

2b18febc

block: fix use-after-free on gendisk · ad393793

由 Yufen Yu 提交于 4月 02, 2019

[ Upstream commit 2c88e3c7ec32d7a40cc7c9b4a487cf90e4671bdd ]

commit 2da78092 "block: Fix dev_t minor allocation lifetime"
specifically moved blk_free_devt(dev->devt) call to part_release()
to avoid reallocating device number before the device is fully
shutdown.

However, it can cause use-after-free on gendisk in get_gendisk().
We use md device as example to show the race scenes:

Process1		Worker			Process2
md_free
						blkdev_open
del_gendisk
  add delete_partition_work_fn() to wq
  						__blkdev_get
						get_gendisk
put_disk
  disk_release
    kfree(disk)
    						find part from ext_devt_idr
						get_disk_and_module(disk)
    					  	cause use after free

    			delete_partition_work_fn
			put_device(part)
    		  	part_release
		    	remove part from ext_devt_idr

Before <devt, hd_struct pointer> is removed from ext_devt_idr by
delete_partition_work_fn(), we can find the devt and then access
gendisk by hd_struct pointer. But, if we access the gendisk after
it have been freed, it can cause in use-after-freeon gendisk in
get_gendisk().

We fix this by adding a new helper blk_invalidate_devt() in
delete_partition() and del_gendisk(). It replaces hd_struct
pointer in idr with value 'NULL', and deletes the entry from
idr in part_release() as we do now.

Thanks to Jan Kara for providing the solution and more clear comments
for the code.

Fixes: 2da78092 ("block: Fix dev_t minor allocation lifetime")
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reviewed-by: NBart Van Assche <bvanassche@acm.org>
Reviewed-by: NKeith Busch <keith.busch@intel.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Suggested-by: NJan Kara <jack@suse.cz>
Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>

ad393793

17 5月, 2019 1 次提交

bfq: update internal depth state when queue depth changes · 824c2129

由 Jens Axboe 提交于 5月 10, 2019

commit 77f1e0a52d26242b6c2dba019f6ebebfb9ff701e upstream

A previous commit moved the shallow depth and BFQ depth map calculations
to be done at init time, moving it outside of the hotter IO path. This
potentially causes hangs if the users changes the depth of the scheduler
map, by writing to the 'nr_requests' sysfs file for that device.

Add a blk-mq-sched hook that allows blk-mq to inform the scheduler if
the depth changes, so that the scheduler can update its internal state.
Signed-off-by: NEric Wheeler <bfq@linux.ewheeler.net>
Tested-by: NKai Krakow <kai@kaishome.de>
Reported-by: NPaolo Valente <paolo.valente@linaro.org>
Fixes: f0635b8a ("bfq: calculate shallow depths at init time")
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Cc: stable@vger.kernel.org
Signed-off-by: NSasha Levin <sashal@kernel.org>

824c2129

08 5月, 2019 2 次提交

block: pass no-op callback to INIT_WORK(). · 96e4471d

由 Tetsuo Handa 提交于 1月 30, 2019

[ Upstream commit 2e3c18d0ada16f145087b2687afcad1748c0827c ]

syzbot is hitting flush_work() warning caused by commit 4d43d395fed12463
("workqueue: Try to catch flush_work() without INIT_WORK().") [1].
Although that commit did not expect INIT_WORK(NULL) case, calling
flush_work() without setting a valid callback should be avoided anyway.
Fix this problem by setting a no-op callback instead of NULL.

[1] https://syzkaller.appspot.com/bug?id=e390366bc48bc82a7c668326e0663be3b91cbd29Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Reported-and-tested-by: Nsyzbot <syzbot+ba2a929dcf8e704c180e@syzkaller.appspotmail.com>
Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
[sl: rename blk_timeout_work]
Signed-off-by: NSasha Levin <sashal@kernel.org>

96e4471d

block: use blk_free_flush_queue() to free hctx->fq in blk_mq_init_hctx · e5be04ee

由 Shenghui Wang 提交于 4月 01, 2019

[ Upstream commit b9a1ff504b9492ad6beb7d5606e0e3365d4d8499 ]

kfree() can leak the hctx->fq->flush_rq field.
Reviewed-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NShenghui Wang <shhuiw@foxmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>

e5be04ee

20 4月, 2019 1 次提交

blk-iolatency: #include "blk.h" · bde271d1

由 Bart Van Assche 提交于 3月 20, 2019

[ Upstream commit 373e915cd8e84544609eced57a44fbc084f8d60f ]

This patch avoids that the following warning is reported when building
with W=1:

block/blk-iolatency.c:734:5: warning: no previous prototype for 'blk_iolatency_init' [-Wmissing-prototypes]

Cc: Josef Bacik <jbacik@fb.com>
Fixes: d7067512 ("block: introduce blk-iolatency io controller") # v4.19
Signed-off-by: NBart Van Assche <bvanassche@acm.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>

bde271d1

17 4月, 2019 1 次提交

block: do not leak memory in bio_copy_user_iov() · 2591bfc6

由 Jérôme Glisse 提交于 4月 10, 2019

commit a3761c3c91209b58b6f33bf69dd8bb8ec0c9d925 upstream.

When bio_add_pc_page() fails in bio_copy_user_iov() we should free
the page we just allocated otherwise we are leaking it.

Cc: linux-block@vger.kernel.org
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: stable@vger.kernel.org
Reviewed-by: NChaitanya Kulkarni <chaitanya.kulkarni@wdc.com>
Signed-off-by: NJérôme Glisse <jglisse@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

2591bfc6

06 4月, 2019 1 次提交

block, bfq: fix in-service-queue check for queue merging · 06666a19

由 Paolo Valente 提交于 1月 29, 2019

[ Upstream commit 058fdecc6de7cdecbf4c59b851e80eb2d6c5295f ]

When a new I/O request arrives for a bfq_queue, say Q, bfq checks
whether that request is close to
(a) the head request of some other queue waiting to be served, or
(b) the last request dispatched for the in-service queue (in case Q
itself is not the in-service queue)

If a queue, say Q2, is found for which the above condition holds, then
bfq merges Q and Q2, to hopefully get a more sequential I/O in the
resulting merged queue, and thus a possibly higher throughput.

Case (b) is checked by comparing the new request for Q with the last
request dispatched, assuming that the latter necessarily belonged to the
in-service queue. Unfortunately, this assumption is no longer always
correct, since commit d0edc2473be9 ("block, bfq: inject other-queue I/O
into seeky idle queues on NCQ flash").

When the assumption does not hold, queues that must not be merged may be
merged, causing unexpected loss of control on per-queue service
guarantees.

This commit solves this problem by adding an extra field, which stores
the actual last request dispatched for the in-service queue, and by
using this new field to correctly check case (b).
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>

06666a19

24 3月, 2019 1 次提交

blk-mq: insert rq with DONTPREP to hctx dispatch list when requeue · 29452f66

由 Jianchao Wang 提交于 2月 12, 2019

[ Upstream commit aef1897cd36dcf5e296f1d2bae7e0d268561b685 ]

When requeue, if RQF_DONTPREP, rq has contained some driver
specific data, so insert it to hctx dispatch list to avoid any
merge. Take scsi as example, here is the trace event log (no
io scheduler, because RQF_STARTED would prevent merging),

   kworker/0:1H-339   [000] ...1  2037.209289: block_rq_insert: 8,0 R 4096 () 32768 + 8 [kworker/0:1H]
scsi_inert_test-1987  [000] ....  2037.220465: block_bio_queue: 8,0 R 32776 + 8 [scsi_inert_test]
scsi_inert_test-1987  [000] ...2  2037.220466: block_bio_backmerge: 8,0 R 32776 + 8 [scsi_inert_test]
   kworker/0:1H-339   [000] ....  2047.220913: block_rq_issue: 8,0 R 8192 () 32768 + 16 [kworker/0:1H]
scsi_inert_test-1996  [000] ..s1  2047.221007: block_rq_complete: 8,0 R () 32768 + 8 [0]
scsi_inert_test-1996  [000] .Ns1  2047.221045: block_rq_requeue: 8,0 R () 32776 + 8 [0]
   kworker/0:1H-339   [000] ...1  2047.221054: block_rq_insert: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
   kworker/0:1H-339   [000] ...1  2047.221056: block_rq_issue: 8,0 R 4096 () 32776 + 8 [kworker/0:1H]
scsi_inert_test-1986  [000] ..s1  2047.221119: block_rq_complete: 8,0 R () 32776 + 8 [0]

(32768 + 8) was requeued by scsi_queue_insert and had RQF_DONTPREP.
Then it was merged with (32776 + 8) and issued. Due to RQF_DONTPREP,
the sdb only contained the part of (32768 + 8), then only that part
was completed. The lucky thing was that scsi_io_completion detected
it and requeued the remaining part. So we didn't get corrupted data.
However, the requeue of (32776 + 8) is not expected.
Suggested-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>

29452f66

14 3月, 2019 1 次提交

blk-iolatency: fix IO hang due to negative inflight counter · 6d482bc5

由 Liu Bo 提交于 1月 25, 2019

[ Upstream commit 8c772a9bfc7c07c76f4a58b58910452fbb20843b ]

Our test reported the following stack, and vmcore showed that
->inflight counter is -1.

[ffffc9003fcc38d0] __schedule at ffffffff8173d95d
[ffffc9003fcc3958] schedule at ffffffff8173de26
[ffffc9003fcc3970] io_schedule at ffffffff810bb6b6
[ffffc9003fcc3988] blkcg_iolatency_throttle at ffffffff813911cb
[ffffc9003fcc3a20] rq_qos_throttle at ffffffff813847f3
[ffffc9003fcc3a48] blk_mq_make_request at ffffffff8137468a
[ffffc9003fcc3b08] generic_make_request at ffffffff81368b49
[ffffc9003fcc3b68] submit_bio at ffffffff81368d7d
[ffffc9003fcc3bb8] ext4_io_submit at ffffffffa031be00 [ext4]
[ffffc9003fcc3c00] ext4_writepages at ffffffffa03163de [ext4]
[ffffc9003fcc3d68] do_writepages at ffffffff811c49ae
[ffffc9003fcc3d78] __filemap_fdatawrite_range at ffffffff811b6188
[ffffc9003fcc3e30] filemap_write_and_wait_range at ffffffff811b6301
[ffffc9003fcc3e60] ext4_sync_file at ffffffffa030cee8 [ext4]
[ffffc9003fcc3ea8] vfs_fsync_range at ffffffff8128594b
[ffffc9003fcc3ee8] do_fsync at ffffffff81285abd
[ffffc9003fcc3f18] sys_fsync at ffffffff81285d50
[ffffc9003fcc3f28] do_syscall_64 at ffffffff81003c04
[ffffc9003fcc3f50] entry_SYSCALL_64_after_swapgs at ffffffff81742b8e

The ->inflight counter may be negative (-1) if

1) blk-iolatency was disabled when the IO was issued,

2) blk-iolatency was enabled before this IO reached its endio,

3) the ->inflight counter is decreased from 0 to -1 in endio()

In fact the hang can be easily reproduced by the below script,

H=/sys/fs/cgroup/unified/
P=/sys/fs/cgroup/unified/test

echo "+io" > $H/cgroup.subtree_control
mkdir -p $P

echo $$ > $P/cgroup.procs

xfs_io -f -d -c "pwrite 0 4k" /dev/sdg

echo "`cat /sys/block/sdg/dev` target=1000000" > $P/io.latency

xfs_io -f -d -c "pwrite 0 4k" /dev/sdg

This fixes the problem by freezing the queue so that while
enabling/disabling iolatency, there is no inflight rq running.

Note that quiesce_queue is not needed as this only updating iolatency
configuration about which dispatching request_queue doesn't care.
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>

6d482bc5

20 2月, 2019 1 次提交

blk-mq: fix a hung issue when fsync · 80c8452a

由 Jianchao Wang 提交于 1月 30, 2019

[ Upstream commit 85bd6e61f34dffa8ec2dc75ff3c02ee7b2f1cbce ]

Florian reported a io hung issue when fsync(). It should be
triggered by following race condition.

data + post flush         a flush

blk_flush_complete_seq
  case REQ_FSEQ_DATA
    blk_flush_queue_rq
    issued to driver      blk_mq_dispatch_rq_list
                            try to issue a flush req
                            failed due to NON-NCQ command
                            .queue_rq return BLK_STS_DEV_RESOURCE

request completion
  req->end_io // doesn't check RESTART
  mq_flush_data_end_io
    case REQ_FSEQ_POSTFLUSH
      blk_kick_flush
        do nothing because previous flush
        has not been completed
     blk_mq_run_hw_queue
                              insert rq to hctx->dispatch
                              due to RESTART is still set, do nothing

To fix this, replace the blk_mq_run_hw_queue in mq_flush_data_end_io
with blk_mq_sched_restart to check and clear the RESTART flag.

Fixes: bd166ef1 (blk-mq-sched: add framework for MQ capable IO schedulers)
Reported-by: NFlorian Stecker <m19@florianstecker.de>
Tested-by: NFlorian Stecker <m19@florianstecker.de>
Signed-off-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>

80c8452a

23 1月, 2019 1 次提交

block: use rcu_work instead of call_rcu to avoid sleep in softirq · 4cc66cc4

由 Yufen Yu 提交于 11月 28, 2018

commit 94a2c3a32b62e868dc1e3d854326745a7f1b8c7a upstream.

We recently got a stack by syzkaller like this:

BUG: sleeping function called from invalid context at mm/slab.h:361
in_atomic(): 1, irqs_disabled(): 0, pid: 6644, name: blkid
INFO: lockdep is turned off.
CPU: 1 PID: 6644 Comm: blkid Not tainted 4.4.163-514.55.6.9.x86_64+ #76
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
 0000000000000000 5ba6a6b879e50c00 ffff8801f6b07b10 ffffffff81cb2194
 0000000041b58ab3 ffffffff833c7745 ffffffff81cb2080 5ba6a6b879e50c00
 0000000000000000 0000000000000001 0000000000000004 0000000000000000
Call Trace:
 <IRQ>  [<ffffffff81cb2194>] __dump_stack lib/dump_stack.c:15 [inline]
 <IRQ>  [<ffffffff81cb2194>] dump_stack+0x114/0x1a0 lib/dump_stack.c:51
 [<ffffffff8129a981>] ___might_sleep+0x291/0x490 kernel/sched/core.c:7675
 [<ffffffff8129ac33>] __might_sleep+0xb3/0x270 kernel/sched/core.c:7637
 [<ffffffff81794c13>] slab_pre_alloc_hook mm/slab.h:361 [inline]
 [<ffffffff81794c13>] slab_alloc_node mm/slub.c:2610 [inline]
 [<ffffffff81794c13>] slab_alloc mm/slub.c:2692 [inline]
 [<ffffffff81794c13>] kmem_cache_alloc_trace+0x2c3/0x5c0 mm/slub.c:2709
 [<ffffffff81cbe9a7>] kmalloc include/linux/slab.h:479 [inline]
 [<ffffffff81cbe9a7>] kzalloc include/linux/slab.h:623 [inline]
 [<ffffffff81cbe9a7>] kobject_uevent_env+0x2c7/0x1150 lib/kobject_uevent.c:227
 [<ffffffff81cbf84f>] kobject_uevent+0x1f/0x30 lib/kobject_uevent.c:374
 [<ffffffff81cbb5b9>] kobject_cleanup lib/kobject.c:633 [inline]
 [<ffffffff81cbb5b9>] kobject_release+0x229/0x440 lib/kobject.c:675
 [<ffffffff81cbb0a2>] kref_sub include/linux/kref.h:73 [inline]
 [<ffffffff81cbb0a2>] kref_put include/linux/kref.h:98 [inline]
 [<ffffffff81cbb0a2>] kobject_put+0x72/0xd0 lib/kobject.c:692
 [<ffffffff8216f095>] put_device+0x25/0x30 drivers/base/core.c:1237
 [<ffffffff81c4cc34>] delete_partition_rcu_cb+0x1d4/0x2f0 block/partition-generic.c:232
 [<ffffffff813c08bc>] __rcu_reclaim kernel/rcu/rcu.h:118 [inline]
 [<ffffffff813c08bc>] rcu_do_batch kernel/rcu/tree.c:2705 [inline]
 [<ffffffff813c08bc>] invoke_rcu_callbacks kernel/rcu/tree.c:2973 [inline]
 [<ffffffff813c08bc>] __rcu_process_callbacks kernel/rcu/tree.c:2940 [inline]
 [<ffffffff813c08bc>] rcu_process_callbacks+0x59c/0x1c70 kernel/rcu/tree.c:2957
 [<ffffffff8120f509>] __do_softirq+0x299/0xe20 kernel/softirq.c:273
 [<ffffffff81210496>] invoke_softirq kernel/softirq.c:350 [inline]
 [<ffffffff81210496>] irq_exit+0x216/0x2c0 kernel/softirq.c:391
 [<ffffffff82c2cd7b>] exiting_irq arch/x86/include/asm/apic.h:652 [inline]
 [<ffffffff82c2cd7b>] smp_apic_timer_interrupt+0x8b/0xc0 arch/x86/kernel/apic/apic.c:926
 [<ffffffff82c2bc25>] apic_timer_interrupt+0xa5/0xb0 arch/x86/entry/entry_64.S:746
 <EOI>  [<ffffffff814cbf40>] ? audit_kill_trees+0x180/0x180
 [<ffffffff8187d2f7>] fd_install+0x57/0x80 fs/file.c:626
 [<ffffffff8180989e>] do_sys_open+0x45e/0x550 fs/open.c:1043
 [<ffffffff818099c2>] SYSC_open fs/open.c:1055 [inline]
 [<ffffffff818099c2>] SyS_open+0x32/0x40 fs/open.c:1050
 [<ffffffff82c299e1>] entry_SYSCALL_64_fastpath+0x1e/0x9a

In softirq context, we call rcu callback function delete_partition_rcu_cb(),
which may allocate memory by kzalloc with GFP_KERNEL flag. If the
allocation cannot be satisfied, it may sleep. However, That is not allowed
in softirq contex.

Although we found this problem on linux 4.4, the latest kernel version
seems to have this problem as well. And it is very similar to the
previous one:
	https://lkml.org/lkml/2018/7/9/391

Fix it by using RCU workqueue, which allows sleep.
Reviewed-by: NPaul E. McKenney <paulmck@linux.ibm.com>
Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

4cc66cc4

13 1月, 2019 2 次提交

block: mq-deadline: Fix write completion handling · 6353c0a0

由 Damien Le Moal 提交于 12月 17, 2018

commit 7211aef86f79583e59b88a0aba0bc830566f7e8e upstream.

For a zoned block device using mq-deadline, if a write request for a
zone is received while another write was already dispatched for the same
zone, dd_dispatch_request() will return NULL and the newly inserted
write request is kept in the scheduler queue waiting for the ongoing
zone write to complete. With this behavior, when no other request has
been dispatched, rq_list in blk_mq_sched_dispatch_requests() is empty
and blk_mq_sched_mark_restart_hctx() not called. This in turn leads to
__blk_mq_free_request() call of blk_mq_sched_restart() to not run the
queue when the already dispatched write request completes. The newly
dispatched request stays stuck in the scheduler queue until eventually
another request is submitted.

This problem does not affect SCSI disk as the SCSI stack handles queue
restart on request completion. However, this problem is can be triggered
the nullblk driver with zoned mode enabled.

Fix this by always requesting a queue restart in dd_dispatch_request()
if no request was dispatched while WRITE requests are queued.

Fixes: 5700f691 ("mq-deadline: Introduce zone locking support")
Cc: <stable@vger.kernel.org>
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

Add missing export of blk_mq_sched_restart()
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6353c0a0

block: deactivate blk_stat timer in wbt_disable_default() · 69e9b285

由 Ming Lei 提交于 12月 12, 2018

commit 544fbd16a461a318cd80537d1331c0df5c6cf930 upstream.

rwb_enabled() can't be changed when there is any inflight IO.

wbt_disable_default() may set rwb->wb_normal as zero, however the
blk_stat timer may still be pending, and the timer function will update
wrb->wb_normal again.

This patch introduces blk_stat_deactivate() and applies it in
wbt_disable_default(), then the following IO hang triggered when running
parted & switching io scheduler can be fixed:

[  369.937806] INFO: task parted:3645 blocked for more than 120 seconds.
[  369.938941]       Not tainted 4.20.0-rc6-00284-g906c801e5248 #498
[  369.939797] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  369.940768] parted          D    0  3645   3239 0x00000000
[  369.941500] Call Trace:
[  369.941874]  ? __schedule+0x6d9/0x74c
[  369.942392]  ? wbt_done+0x5e/0x5e
[  369.942864]  ? wbt_cleanup_cb+0x16/0x16
[  369.943404]  ? wbt_done+0x5e/0x5e
[  369.943874]  schedule+0x67/0x78
[  369.944298]  io_schedule+0x12/0x33
[  369.944771]  rq_qos_wait+0xb5/0x119
[  369.945193]  ? karma_partition+0x1c2/0x1c2
[  369.945691]  ? wbt_cleanup_cb+0x16/0x16
[  369.946151]  wbt_wait+0x85/0xb6
[  369.946540]  __rq_qos_throttle+0x23/0x2f
[  369.947014]  blk_mq_make_request+0xe6/0x40a
[  369.947518]  generic_make_request+0x192/0x2fe
[  369.948042]  ? submit_bio+0x103/0x11f
[  369.948486]  ? __radix_tree_lookup+0x35/0xb5
[  369.949011]  submit_bio+0x103/0x11f
[  369.949436]  ? blkg_lookup_slowpath+0x25/0x44
[  369.949962]  submit_bio_wait+0x53/0x7f
[  369.950469]  blkdev_issue_flush+0x8a/0xae
[  369.951032]  blkdev_fsync+0x2f/0x3a
[  369.951502]  do_fsync+0x2e/0x47
[  369.951887]  __x64_sys_fsync+0x10/0x13
[  369.952374]  do_syscall_64+0x89/0x149
[  369.952819]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  369.953492] RIP: 0033:0x7f95a1e729d4
[  369.953996] Code: Bad RIP value.
[  369.954456] RSP: 002b:00007ffdb570dd48 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[  369.955506] RAX: ffffffffffffffda RBX: 000055c2139c6be0 RCX: 00007f95a1e729d4
[  369.956389] RDX: 0000000000000001 RSI: 0000000000001261 RDI: 0000000000000004
[  369.957325] RBP: 0000000000000002 R08: 0000000000000000 R09: 000055c2139c6ce0
[  369.958199] R10: 0000000000000000 R11: 0000000000000246 R12: 000055c2139c0380
[  369.959143] R13: 0000000000000004 R14: 0000000000000100 R15: 0000000000000008

Cc: stable@vger.kernel.org
Cc: Paolo Valente <paolo.valente@linaro.org>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

69e9b285

20 12月, 2018 1 次提交

block/bio: Do not zero user pages · 7290c71d

由 Keith Busch 提交于 12月 10, 2018

commit f55adad601c6a97c8c9628195453e0fb23b4a0ae upstream.

We don't need to zero fill the bio if not using kernel allocated pages.

Fixes: f3587d76da05 ("block: Clear kernel memory before copying to user") # v4.20-rc2
Reported-by: NTodd Aiken <taiken@mvtech.ca>
Cc: Laurence Oberman <loberman@redhat.com>
Cc: stable@vger.kernel.org
Cc: Bart Van Assche <bvanassche@acm.org>
Tested-by: NLaurence Oberman <loberman@redhat.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

7290c71d

08 12月, 2018 2 次提交

blk-mq: punt failed direct issue to dispatch list · 55cbeea7

由 Jens Axboe 提交于 12月 06, 2018

commit c616cbee97aed4bc6178f148a7240206dcdb85a6 upstream.

After the direct dispatch corruption fix, we permanently disallow direct
dispatch of non read/write requests. This works fine off the normal IO
path, as they will be retried like any other failed direct dispatch
request. But for the blk_insert_cloned_request() that only DM uses to
bypass the bottom level scheduler, we always first attempt direct
dispatch. For some types of requests, that's now a permanent failure,
and no amount of retrying will make that succeed. This results in a
livelock.

Instead of making special cases for what we can direct issue, and now
having to deal with DM solving the livelock while still retaining a BUSY
condition feedback loop, always just add a request that has been through
->queue_rq() to the hardware queue dispatch list. These are safe to use
as no merging can take place there. Additionally, if requests do have
prepped data from drivers, we aren't dependent on them not sharing space
in the request structure to safely add them to the IO scheduler lists.

This basically reverts ffe81d45322c and is based on a patch from Ming,
but with the list insert case covered as well.

Fixes: ffe81d45322c ("blk-mq: fix corruption with direct issue")
Cc: stable@vger.kernel.org
Suggested-by: NMing Lei <ming.lei@redhat.com>
Reported-by: NBart Van Assche <bvanassche@acm.org>
Tested-by: NMing Lei <ming.lei@redhat.com>
Acked-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

55cbeea7

blk-mq: fix corruption with direct issue · 724ff9cb

由 Jens Axboe 提交于 12月 04, 2018

commit ffe81d45322cc3cb140f0db080a4727ea284661e upstream.

If we attempt a direct issue to a SCSI device, and it returns BUSY, then
we queue the request up normally. However, the SCSI layer may have
already setup SG tables etc for this particular command. If we later
merge with this request, then the old tables are no longer valid. Once
we issue the IO, we only read/write the original part of the request,
not the new state of it.

This causes data corruption, and is most often noticed with the file
system complaining about the just read data being invalid:

[  235.934465] EXT4-fs error (device sda1): ext4_iget:4831: inode #7142: comm dpkg-query: bad extra_isize 24937 (inode size 256)

because most of it is garbage...

This doesn't happen from the normal issue path, as we will simply defer
the request to the hardware queue dispatch list if we fail. Once it's on
the dispatch list, we never merge with it.

Fix this from the direct issue path by flagging the request as
REQ_NOMERGE so we don't change the size of it before issue.

See also:
  https://bugzilla.kernel.org/show_bug.cgi?id=201685Tested-by: NGuenter Roeck <linux@roeck-us.net>
Fixes: 6ce3dd6e ("blk-mq: issue directly if hw queue isn't busy in case of 'none'")
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

724ff9cb

01 12月, 2018 1 次提交

block: copy ioprio in __bio_clone_fast() and bounce · 487d58a9

由 Hannes Reinecke 提交于 11月 12, 2018

[ Upstream commit ca474b73 ]

We need to copy the io priority, too; otherwise the clone will run
with a different priority than the original one.

Fixes: 43b62ce3 ("block: move bio io prio to a new field")
Signed-off-by: NHannes Reinecke <hare@suse.com>
Signed-off-by: NJean Delvare <jdelvare@suse.de>

Fixed up subject, and ordered stores.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>

487d58a9

27 11月, 2018 1 次提交

block: Clear kernel memory before copying to user · a4da95ea

由 Keith Busch 提交于 11月 07, 2018

[ Upstream commit f3587d76 ]

If the kernel allocates a bounce buffer for user read data, this memory
needs to be cleared before copying it to the user, otherwise it may leak
kernel memory to user space.

Laurence Oberman <loberman@redhat.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>

a4da95ea

21 11月, 2018 1 次提交

SCSI: fix queue cleanup race before queue initialization is done · 410306a0

由 Ming Lei 提交于 11月 14, 2018

commit 8dc765d438f1e42b3e8227b3b09fad7d73f4ec9a upstream.

c2856ae2 ("blk-mq: quiesce queue before freeing queue") has
already fixed this race, however the implied synchronize_rcu()
in blk_mq_quiesce_queue() can slow down LUN probe a lot, so caused
performance regression.

Then 1311326c ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
tried to quiesce queue for avoiding unnecessary synchronize_rcu()
only when queue initialization is done, because it is usual to see
lots of inexistent LUNs which need to be probed.

However, turns out it isn't safe to quiesce queue only when queue
initialization is done. Because when one SCSI command is completed,
the user of sending command can be waken up immediately, then the
scsi device may be removed, meantime the run queue in scsi_end_request()
is still in-progress, so kernel panic can be caused.

In Red Hat QE lab, there are several reports about this kind of kernel
panic triggered during kernel booting.

This patch tries to address the issue by grabing one queue usage
counter during freeing one request and the following run queue.

Fixes: 1311326c ("blk-mq: avoid to synchronize rcu inside blk_cleanup_queue()")
Cc: Andrew Jones <drjones@redhat.com>
Cc: Bart Van Assche <bart.vanassche@wdc.com>
Cc: linux-scsi@vger.kernel.org
Cc: Martin K. Petersen <martin.petersen@oracle.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: James E.J. Bottomley <jejb@linux.vnet.ibm.com>
Cc: stable <stable@vger.kernel.org>
Cc: jianchao.wang <jianchao.w.wang@oracle.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

410306a0

14 11月, 2018 4 次提交

block, bfq: correctly charge and reset entity service in all cases · 668c01c1

由 Paolo Valente 提交于 9月 14, 2018

[ Upstream commit cbeb869a3d1110450186b738199963c5e68c2a71 ]

BFQ schedules entities (which represent either per-process queues or
groups of queues) as a function of their timestamps. In particular, as
a function of their (virtual) finish times. The finish time of an
entity is computed as a function of the budget assigned to the entity,
assuming, tentatively, that the entity, once in service, will receive
an amount of service equal to its budget. Then, when the entity is
expired because it finishes to be served, this finish time is updated
as a function of the actual service received by the entity. This
allows the entity to be correctly charged with only the service
received, and then to be correctly re-scheduled.

Yet an entity may receive service also while not being the entity in
service (in the scheduling environment of its parent entity), for
several reasons. If the entity remains with no backlog while receiving
this 'unofficial' service, then it is expired. Also on such an
expiration, the finish time of the entity should be updated to account
for only the service actually received by the entity. Unfortunately,
such an update is not performed for an entity expiring without being
the entity in service.

In a similar vein, the service counter of the entity in service is
reset when the entity is expired, to be ready to be used for next
service cycle. This reset too should be performed also in case an
entity is expired because it remains empty after receiving service
while not being the entity in service. But in this case the reset is
not performed.

This commit performs the above update of the finish time and reset of
the service received, also for an entity expiring while not being the
entity in service.
Signed-off-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

668c01c1

block: make sure writesame bio is aligned with logical block size · 1ea5c403

由 Ming Lei 提交于 10月 29, 2018

commit 34ffec60b27aa81d04e274e71e4c6ef740f75fc7 upstream.

Obviously the created writesame bio has to be aligned with logical block
size, and use bio_allowed_max_sectors() to retrieve this number.

Cc: stable@vger.kernel.org
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Xiao Ni <xni@redhat.com>
Cc: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
Fixes: b49a0871 ("block: remove split code in blkdev_issue_{discard,write_same}")
Tested-by: NRui Salvaterra <rsalvaterra@gmail.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

1ea5c403

block: make sure discard bio is aligned with logical block size · 14657efd

由 Ming Lei 提交于 10月 29, 2018

commit 1adfc5e4 upstream.

Obviously the created discard bio has to be aligned with logical block size.

This patch introduces the helper of bio_allowed_max_sectors() for
this purpose.

Cc: stable@vger.kernel.org
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Xiao Ni <xni@redhat.com>
Cc: Mariusz Dabrowski <mariusz.dabrowski@intel.com>
Fixes: 744889b7 ("block: don't deal with discard limit in blkdev_issue_discard()")
Fixes: a22c4d7e ("block: re-add discard_granularity and alignment checks")
Reported-by: NRui Salvaterra <rsalvaterra@gmail.com>
Tested-by: NRui Salvaterra <rsalvaterra@gmail.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

14657efd

block: setup bounce bio_sets properly · cf8d0973

由 Jens Axboe 提交于 10月 21, 2018

commit 52990a5f upstream.

We're only setting up the bounce bio sets if we happen
to need bouncing for regular HIGHMEM, not if we only need
it for ISA devices.

Protect the ISA bounce setup with a mutex, since it's
being invoked from driver init functions and can thus be
called in parallel.

Cc: stable@vger.kernel.org
Reported-by: NOndrej Zary <linux@rainbow-software.org>
Tested-by: NOndrej Zary <linux@rainbow-software.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

cf8d0973

18 10月, 2018 1 次提交

block: don't deal with discard limit in blkdev_issue_discard() · 744889b7

由 Ming Lei 提交于 10月 12, 2018

blk_queue_split() does respect this limit via bio splitting, so no
need to do that in blkdev_issue_discard(), then we can align to
normal bio submit(bio_add_page() & submit_bio()).

More importantly, this patch fixes one issue introduced in a22c4d7e
("block: re-add discard_granularity and alignment checks"), in which
zero discard bio may be generated in case of zero alignment.

Fixes: a22c4d7e ("block: re-add discard_granularity and alignment checks")
Cc: stable@vger.kernel.org
Cc: Ming Lin <ming.l@ssi.samsung.com>
Cc: Mike Snitzer <snitzer@redhat.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Xiao Ni <xni@redhat.com>
Tested-by: NMariusz Dabrowski <mariusz.dabrowski@intel.com>
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

744889b7

12 10月, 2018 1 次提交

blk-wbt: wake up all when we scale up, not down · 5e65a203

由 Josef Bacik 提交于 10月 11, 2018

Tetsuo brought to my attention that I screwed up the scale_up/scale_down
helpers when I factored out the rq-qos code.  We need to wake up all the
waiters when we add slots for requests to make, not when we shrink the
slots.  Otherwise we'll end up things waiting forever.  This was a
mistake and simply puts everything back the way it was.

cc: stable@vger.kernel.org
Fixes: a7905043 ("blk-rq-qos: refactor out common elements of blk-wbt")
eported-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5e65a203

28 9月, 2018 1 次提交

blk-mq: I/O and timer unplugs are inverted in blktrace · 587562d0

由 Ilya Dryomov 提交于 9月 26, 2018

trace_block_unplug() takes true for explicit unplugs and false for
implicit unplugs.  schedule() unplugs are implicit and should be
reported as timer unplugs.  While correct in the legacy code, this has
been inverted in blk-mq since 4.11.

Cc: stable@vger.kernel.org
Fixes: bd166ef1 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Reviewed-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NIlya Dryomov <idryomov@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

587562d0

27 9月, 2018 1 次提交

block: fix deadline elevator drain for zoned block devices · 854f31cc

由 Damien Le Moal 提交于 9月 27, 2018

When the deadline scheduler is used with a zoned block device, writes
to a zone will be dispatched one at a time. This causes the warning
message:

deadline: forced dispatching is broken (nr_sorted=X), please report this

to be displayed when switching to another elevator with the legacy I/O
path while write requests to a zone are being retained in the scheduler
queue.

Prevent this message from being displayed when executing
elv_drain_elevator() for a zoned block device. __blk_drain_queue() will
loop until all writes are dispatched and completed, resulting in the
desired elevator queue drain without extensive modifications to the
deadline code itself to handle forced-dispatch calls.
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Fixes: 8dc8146f ("deadline-iosched: Introduce zone locking support")
Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>

854f31cc

26 9月, 2018 1 次提交

blk-mq: Allow blocking queue tag iter callbacks · 530ca2c9

由 Keith Busch 提交于 9月 25, 2018

A recent commit runs tag iterator callbacks under the rcu read lock,
but existing callbacks do not satisfy the non-blocking requirement.
The commit intended to prevent an iterator from accessing a queue that's
being modified. This patch fixes the original issue by taking a queue
reference instead of reading it, which allows callbacks to make blocking
calls.

Fixes: f5bbbbe4 ("blk-mq: sync the update nr_hw_queues with blk_mq_queue_tag_busy_iter")
Acked-by: NJianchao Wang <jianchao.w.wang@oracle.com>
Signed-off-by: NKeith Busch <keith.busch@intel.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

530ca2c9

22 9月, 2018 1 次提交

block: use nanosecond resolution for iostat · b57e99b4

由 Omar Sandoval 提交于 9月 21, 2018

Klaus Kusche reported that the I/O busy time in /proc/diskstats was not
updating properly on 4.18. This is because we started using ktime to
track elapsed time, and we convert nanoseconds to jiffies when we update
the partition counter. However, this gets rounded down, so any I/Os that
take less than a jiffy are not accounted for. Previously in this case,
the value of jiffies would sometimes increment while we were doing I/O,
so at least some I/Os were accounted for.

Let's convert the stats to use nanoseconds internally. We still report
milliseconds as before, now more accurately than ever. The value is
still truncated to 32 bits for backwards compatibility.

Fixes: 522a7775 ("block: consolidate struct request timestamp fields")
Cc: stable@vger.kernel.org
Reported-by: NKlaus Kusche <klaus.kusche@computerix.info>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b57e99b4

12 9月, 2018 1 次提交

blk-cgroup: increase number of supported policies · 01c5f85a

由 Jens Axboe 提交于 9月 11, 2018

After merging the iolatency policy, we potentially now have 4 policies
being registered, but only support 3. This causes one of them to fail
loading. Takashi reports that BFQ no longer works for him, because it
fails to load due to policy registration failure.

Bump to 5 policies, and also add a warning for when we have exceeded
the global amount. If we have to touch this again, we should switch
to a dynamic scheme instead.
Reported-by: NTakashi Iwai <tiwai@suse.de>
Reviewed-by: NJeff Moyer <jmoyer@redhat.com>
Tested-by: NTakashi Iwai <tiwai@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

01c5f85a

07 9月, 2018 1 次提交

block: bfq: swap puts in bfqg_and_blkg_put · d5274b3c

由 Konstantin Khlebnikov 提交于 9月 06, 2018

Fix trivial use-after-free. This could be last reference to bfqg.

Fixes: 8f9bebc3 ("block, bfq: access and cache blkg data only when safe")
Acked-by: NPaolo Valente <paolo.valente@linaro.org>
Signed-off-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d5274b3c

06 9月, 2018 1 次提交

block: don't warn when doing fsync on read-only devices · 8b2ded1c

由 Mikulas Patocka 提交于 9月 05, 2018

It is possible to call fsync on a read-only handle (for example, fsck.ext2
does it when doing read-only check), and this call results in kernel
warning.

The patch b089cfd9 ("block: don't warn for flush on read-only device")
attempted to disable the warning, but it is buggy and it doesn't
(op_is_flush tests flags, but bio_op strips off the flags).
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Fixes: 721c7fc7 ("block: fail op_is_write() requests to read-only partitions")
Cc: stable@vger.kernel.org	# 4.18
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8b2ded1c

01 9月, 2018 3 次提交

blkcg: use tryget logic when associating a blkg with a bio · 31118850

由 Dennis Zhou (Facebook) 提交于 8月 31, 2018

There is a very small change a bio gets caught up in a really
unfortunate race between a task migration, cgroup exiting, and itself
trying to associate with a blkg. This is due to css offlining being
performed after the css->refcnt is killed which triggers removal of
blkgs that reach their blkg->refcnt of 0.

To avoid this, association with a blkg should use tryget and fallback to
using the root_blkg.

Fixes: 08e18eab ("block: add bi_blkg to the bio for cgroups")
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDennis Zhou <dennisszhou@gmail.com>
Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

31118850

blkcg: delay blkg destruction until after writeback has finished · 59b57717

由 Dennis Zhou (Facebook) 提交于 8月 31, 2018

Currently, blkcg destruction relies on a sequence of events:
  1. Destruction starts. blkcg_css_offline() is called and blkgs
     release their reference to the blkcg. This immediately destroys
     the cgwbs (writeback).
  2. With blkgs giving up their reference, the blkcg ref count should
     become zero and eventually call blkcg_css_free() which finally
     frees the blkcg.

Jiufei Xue reported that there is a race between blkcg_bio_issue_check()
and cgroup_rmdir(). To remedy this, blkg destruction becomes contingent
on the completion of all writeback associated with the blkcg. A count of
the number of cgwbs is maintained and once that goes to zero, blkg
destruction can follow. This should prevent premature blkg destruction
related to writeback.

The new process for blkcg cleanup is as follows:
  1. Destruction starts. blkcg_css_offline() is called which offlines
     writeback. Blkg destruction is delayed on the cgwb_refcnt count to
     avoid punting potentially large amounts of outstanding writeback
     to root while maintaining any ongoing policies. Here, the base
     cgwb_refcnt is put back.
  2. When the cgwb_refcnt becomes zero, blkcg_destroy_blkgs() is called
     and handles destruction of blkgs. This is where the css reference
     held by each blkg is released.
  3. Once the blkcg ref count goes to zero, blkcg_css_free() is called.
     This finally frees the blkg.

It seems in the past blk-throttle didn't do the most understandable
things with taking data from a blkg while associating with current. So,
the simplification and unification of what blk-throttle is doing caused
this.

Fixes: 08e18eab ("block: add bi_blkg to the bio for cgroups")
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDennis Zhou <dennisszhou@gmail.com>
Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

59b57717

Revert "blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()" · 6b065462

由 Dennis Zhou (Facebook) 提交于 8月 31, 2018

This reverts commit 4c699480.

Destroying blkgs is tricky because of the nature of the relationship. A
blkg should go away when either a blkcg or a request_queue goes away.
However, blkg's pin the blkcg to ensure they remain valid. To break this
cycle, when a blkcg is offlined, blkgs put back their css ref. This
eventually lets css_free() get called which frees the blkcg.

The above commit (4c699480) breaks this order of events by trying to
destroy blkgs in css_free(). As the blkgs still hold references to the
blkcg, css_free() is never called.

The race between blkcg_bio_issue_check() and cgroup_rmdir() will be
addressed in the following patch by delaying destruction of a blkg until
all writeback associated with the blkcg has been finished.

Fixes: 4c699480 ("blk-throttle: fix race between blkcg_bio_issue_check() and cgroup_rmdir()")
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NDennis Zhou <dennisszhou@gmail.com>
Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Cc: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Jens Axboe <axboe@kernel.dk>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

6b065462

28 8月, 2018 3 次提交

block: bsg: move atomic_t ref_count variable to refcount API · db193954

由 John Pittman 提交于 8月 27, 2018

Currently, variable ref_count within the bsg_device struct is of
type atomic_t.  For variables being used as reference counters,
the refcount API should be used instead of atomic.  The newer
refcount API works to prevent counter overflows and use-after-free
bugs.  So, move this varable from the atomic API to refcount,
potentially avoiding the issues mentioned.
Signed-off-by: NJohn Pittman <jpittman@redhat.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

db193954

block: remove unnecessary condition check · 62d2a194

由 Chengguang Xu 提交于 8月 28, 2018

kmem_cache_destroy() can handle NULL pointer correctly, so there is
no need to check e->icq_cache before calling kmem_cache_destroy().
Signed-off-by: NChengguang Xu <cgxu519@gmx.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

62d2a194

blk-wbt: remove dead code · b0a84beb

由 Jens Axboe 提交于 8月 27, 2018

We already note and mark discard and swap IO from bio_to_wbt_flags().
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b0a84beb

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功