提交 · 406295a35101e5f0c125e5e4886d1e96f25a23df · openeuler / Kernel

20 1月, 2022 1 次提交

md: Fix undefined behaviour in is_mddev_idle · 406295a3

由 zhangwensheng 提交于 1月 20, 2022

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4QXS1?from=project-issue
CVE: NA

--------------------------------

UBSAN reports this problem:

[ 5984.281385] UBSAN: Undefined behaviour in drivers/md/md.c:8175:15
[ 5984.281390] signed integer overflow:
[ 5984.281393] -2147483291 - 2072033152 cannot be represented in type 'int'
[ 5984.281400] CPU: 25 PID: 1854 Comm: md101_resync Kdump: loaded Not tainted 4.19.90
[ 5984.281404] Hardware name: Huawei TaiShan 200 (Model 5280)/BC82AMDDA
[ 5984.281406] Call trace:
[ 5984.281415]  dump_backtrace+0x0/0x310
[ 5984.281418]  show_stack+0x28/0x38
[ 5984.281425]  dump_stack+0xec/0x15c
[ 5984.281430]  ubsan_epilogue+0x18/0x84
[ 5984.281434]  handle_overflow+0x14c/0x19c
[ 5984.281439]  __ubsan_handle_sub_overflow+0x34/0x44
[ 5984.281445]  is_mddev_idle+0x338/0x3d8
[ 5984.281449]  md_do_sync+0x1bb8/0x1cf8
[ 5984.281452]  md_thread+0x220/0x288
[ 5984.281457]  kthread+0x1d8/0x1e0
[ 5984.281461]  ret_from_fork+0x10/0x18

When the stat aacum of the disk is greater than INT_MAX, its value
becomes negative after casting to 'int', which may lead to overflow
after subtracting a positive number. In the same way, when the value
of sync_io is greater than INT_MAX,overflow may also occur. These
situations will lead to undefined behavior.

Otherwise, if the stat accum of the disk is close to INT_MAX when
creating raid arrays, the initial value of last_events would be set
close to INT_MAX when mddev initializes IO event counters.
'curr_events - rdev->last_events > 64' will always false during
synchronization. If all the disks of mddev are in this case,
is_mddev_idle() will always return 1, which may cause non-sync IO
is very slow.

To address these problems, need to use 64bit signed integer type
for sync_io,last_events, and curr_events.
Signed-off-by: Nzhangwensheng <zhangwensheng5@huawei.com>
Reviewed-by: NTao Hou <houtao1@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

406295a3

14 1月, 2022 2 次提交

dm btree remove: fix use after free in rebalance_children() · f5d9e08b

由 Joe Thornber 提交于 1月 14, 2022

stable inclusion
from stable-v5.10.88
commit 0e21e6cd5eebfc929ac5fa3b97ca2d4ace3cb6a3
bugzilla: 186058 https://gitee.com/openeuler/kernel/issues/I4QW6A

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=0e21e6cd5eebfc929ac5fa3b97ca2d4ace3cb6a3

--------------------------------

commit 1b8d2789 upstream.

Move dm_tm_unlock() after dm_tm_dec().

Cc: stable@vger.kernel.org
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

f5d9e08b

md: fix update super 1.0 on rdev size change · 2fbe2ae4

由 Markus Hochholdinger 提交于 1月 14, 2022

stable inclusion
from stable-v5.10.85
commit 8b4264c27b821d6b3550fd67c0169cbc5549db8c
bugzilla: 186032 https://gitee.com/openeuler/kernel/issues/I4QVI4

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=8b4264c27b821d6b3550fd67c0169cbc5549db8c

--------------------------------

commit 55df1ce0 upstream.

The superblock of version 1.0 doesn't get moved to the new position on a
device size change. This leads to a rdev without a superblock on a known
position, the raid can't be re-assembled.

The line was removed by mistake and is re-added by this patch.

Fixes: d9c0fa50 ("md: fix max sectors calculation for super 1.0")
Cc: stable@vger.kernel.org
Signed-off-by: NMarkus Hochholdinger <markus@hochholdinger.net>
Reviewed-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

2fbe2ae4

12 1月, 2022 7 次提交

Revert "bcache: add a framework to perform prefetch" · c1af91b8

由 Zheng Zengkai 提交于 1月 12, 2022

euleros/rtos inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4LOJ6
CVE: NA

--------------------------------

This patch set introduce many conflicts while backporting mainline
bcache patches, revert it temporarily.

This reverts commit a460ae11.
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

c1af91b8

Revert "bcache: provide a switch to bypass all IO requests" · 688afe52

由 Zheng Zengkai 提交于 1月 12, 2022

euleros/rtos inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4LOJ6
CVE: NA

--------------------------------

This patch set introduce many conflicts while backporting mainline
bcache patches, revert it temporarily.

This reverts commit 30dc9d9c.
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

688afe52

Revert "bcache: inflight prefetch requests block overlapped normal requests" · 819d0d06

由 Zheng Zengkai 提交于 1月 12, 2022

euleros/rtos inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4LOJ6
CVE: NA

--------------------------------

This patch set introduce many conflicts while backporting mainline
bcache patches, revert it temporarily.

This reverts commit 08a3ac0e.
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

819d0d06

Revert "bcache: Delay to invalidate cache data in writearound write" · c81a4f07

由 Zheng Zengkai 提交于 1月 12, 2022

euleros/rtos inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4LOJ6
CVE: NA

--------------------------------

This patch set introduce many conflicts while backporting mainline
bcache patches, revert it temporarily.

This reverts commit 1751c6ad.
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

c81a4f07

Revert "bcache: Rewrite patch to delay to invalidate cache data" · e4c0fe5d

由 Zheng Zengkai 提交于 1月 12, 2022

euleros/rtos inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4LOJ6
CVE: NA

--------------------------------

This patch set introduce many conflicts while backporting mainline
bcache patches, revert it temporarily.

This reverts commit e8c75ee9.
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

e4c0fe5d

Revert "bcache: do not collect data insert info created by write_moving" · 26d17b9a

由 Zheng Zengkai 提交于 1月 12, 2022

euleros/rtos inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4LOJ6
CVE: NA

--------------------------------

This patch set introduce many conflicts while backporting mainline
bcache patches, revert it temporarily.

This reverts commit 6214e257.
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

26d17b9a

Revert "bcache: always record start time of a sample" · e075d058

由 Zheng Zengkai 提交于 1月 12, 2022

euleros/rtos inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4LOJ6
CVE: NA

--------------------------------

This patch set introduce many conflicts while backporting mainline
bcache patches, revert it temporarily.

This reverts commit 4c16a439.
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>

e075d058

10 1月, 2022 7 次提交

bcache: always record start time of a sample · 4c16a439

由 Li Ruilin 提交于 1月 10, 2022

euleros/rtos inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4LOJ6
CVE: NA

--------------------------------

Before this patch bypassed I/O request will not record start time,
therefore when calculating latency data, start time will be calculated
wrongly. This patch makes start time always be recorded to fix this.
Signed-off-by: NLi Ruilin <liruilin4@huawei.com>
Reviewed-by: NSong Chao <chao.song@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NGuangxing Deng <dengguangxing@huawei.com>
Reviewed-by: Nchao song <chao.song@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

4c16a439

bcache: do not collect data insert info created by write_moving · 6214e257

由 Li Ruilin 提交于 1月 10, 2022

euleros/rtos inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4LOJ6
CVE: NA

--------------------------------

commit 6947676c374("bcache: add a framework to perform prefetch")
collects data insert info which includes device info got from bio.
However, bio created by write_moving here has no device info, causing
a null pointer dereference.

[ 1497.991768] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[ 1497.991869] PGD 0 P4D 0
[ 1497.991912] Oops: 0000 [#1] SMP PTI
[ 1497.991962] CPU: 2 PID: 733 Comm: kworker/2:3 Not tainted 4.19.90+ #33
[ 1497.992030] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ 1497.992137] Workqueue: bcache_gc write_moving [bcache]
[ 1497.992219] RIP: 0010:bch_data_insert+0x4c/0x140 [bcache]
...
[ 1497.993367] Call Trace:
[ 1497.993427]  ? cached_dev_read_error+0x140/0x140 [bcache]
[ 1497.993526]  write_moving+0x19e/0x1b0 [bcache]
[ 1497.993621]  process_one_work+0x1fd/0x440
[ 1497.993742]  worker_thread+0x34/0x410
[ 1497.993811]  kthread+0x121/0x140
[ 1497.993873]  ? process_one_work+0x440/0x440
[ 1497.993946]  ? kthread_create_worker_on_cpu+0x70/0x70
[ 1497.994043]  ret_from_fork+0x35/0x40
Signed-off-by: NLi Ruilin <liruilin4@huawei.com>
Review-by: NSong Chao <chao.song@huawei.com>
Review-by: NXu Wei <xuwei56@huawei.com>
Signed-off-by: NLi Ruilin <liruilin4@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NGuangxing Deng <dengguangxing@huawei.com>
Reviewed-by: Nchao song <chao.song@huawei.com>
Reviewed-by: Nchao song <chao.song@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

6214e257

bcache: Rewrite patch to delay to invalidate cache data · e8c75ee9

由 Li Ruilin 提交于 1月 10, 2022

euleros inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4LOJ6
CVE: NA

------------------------------

The recently pushd bugfix patch "Delay to invalidate cache data
in writearound write" has a stupid copy&paste bug, which causes
bypassed write request will never invalidate data in cache device,
causing a data corruption. This patch fixes this corruption.
This patch also ensures that the writeback lock is released after
data insert.

Fixes: 6a1d9c41b367 ("bcache: Delay to invalidate cache data in writearound write")
Signed-off-by: NLi Ruilin <liruilin4@huawei.com>
Signed-off-by: NSong Chao <chao.song@huawei.com>
Reviewed-by: NPeng Junyi <pengjunyi1@huawei.com>
Acked-by: NXie Xiuqi <xiexiuqi@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NGuangxing Deng <dengguangxing@huawei.com>
Reviewed-by: Nchao song <chao.song@huawei.com>
Reviewed-by: Nchao song <chao.song@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

e8c75ee9

bcache: Delay to invalidate cache data in writearound write · 1751c6ad

由 Li Ruilin 提交于 1月 10, 2022

euleros inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4LOJ6
CVE: NA

------------------------------

In writearound cache mode, read request quickly followed by write
request may overwrite the invalidate bkey inserted by the write
request.

The function bch_data_insert() is invoked asynchronously as the bio
subbmited to backend block device, therefore there may be a read
request subbmited after the bch_data_insert() done and ended before
the backend bio is end. This read request will read data from the
backend block device, and insert dirty data to cache device. However
by writearound cache mode, bcache will not invalidate data again,
so that read request after will read dirty data from the cache,
causing a data corruption.

By this patch we delay the invalidation to end of backend bio to
avoid this corruption.
Signed-off-by: NLi Ruilin <liruilin4@huawei.com>
Reviewed-by: NLuan Jianhai <luanjianhai@huawei.com>
Reviewed-by: NPeng Junyi <pengjunyi1@huawei.com>
Acked-by: NXie Xiuqi <xiexiuqi@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NGuangxing Deng <dengguangxing@huawei.com>
Reviewed-by: Nchao song <chao.song@huawei.com>
Reviewed-by: Nchao song <chao.song@huawei.com>
Reviewed-by: Nchao song <chao.song@huawei.com>
Reviewed-by: Nchao song <chao.song@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

1751c6ad

bcache: inflight prefetch requests block overlapped normal requests · 08a3ac0e

由 Li Ruilin 提交于 1月 10, 2022

euleros inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4LOJ6
CVE: NA

------------------------------

Add a list to save all prefetch requests. When an IO request comes,
check if the request has overlap with some of prefetch requests. If
it das have, block the request until the prefetch request is end.

Add a switch to control whether to enable this. If not enabled, count
the overlapped IO request as a fake hit for performance analysis.
Signed-off-by: NLi Ruilin <liruilin4@huawei.com>
Reviewed-by: NLuan Jianhai <luanjianhai@huawei.com>
Reviewed-by: NPeng Junyi <pengjunyi1@huawei.com>
Acked-by: NXie Xiuqi <xiexiuqi@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NGuangxing Deng <dengguangxing@huawei.com>
Reviewed-by: Nchao song <chao.song@huawei.com>
Reviewed-by: Nchao song <chao.song@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

08a3ac0e

bcache: provide a switch to bypass all IO requests · 30dc9d9c

由 Li Ruilin 提交于 1月 10, 2022

euleros inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4LOJ6
CVE: NA

------------------------------

provide a switch named read_bypass. If enbale, all IO requests will
bypass the cache. This option could be useful when we enable userspace
prefetch and the cache device is low capacity.
Signed-off-by: NLi Ruilin <liruilin4@huawei.com>
Reviewed-by: NLuan Jianhai <luanjianhai@huawei.com>
Reviewed-by: NPeng Junyi <pengjunyi1@huawei.com>
Acked-by: NXie Xiuqi <xiexiuqi@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NGuangxing Deng <dengguangxing@huawei.com>
Reviewed-by: Nchao song <chao.song@huawei.com>
Reviewed-by: Nchao song <chao.song@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

30dc9d9c

bcache: add a framework to perform prefetch · a460ae11

由 Li Ruilin 提交于 1月 10, 2022

euleros inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4LOJ6
CVE: NA

------------------------------

Add a framwork to transform io informations to userspace client and
process prefetch request sent by userspace client. Create a char device
namede "acache" for connecting between kernelspace and userspace.
Save informations of all io requests into a buffer and pass them to
client when client reads from the device.

The prefetch request could be treated as normal io request. As deference,
those requests have no need return data back to userspace, and they
should not append readahead part.

Add two parameters. acache_dev_size is for controlling size of buffer
to save io informations. acache_prefetch_workers is for controlling
max threads to process prefetch requests.
Signed-off-by: NLi Ruilin <liruilin4@huawei.com>
Reviewed-by: NLuan Jianhai <luanjianhai@huawei.com>
Reviewed-by: NPeng Junyi <pengjunyi1@huawei.com>
Acked-by: NXie Xiuqi <xiexiuqi@huawei.com>
Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
Reviewed-by: NGuangxing Deng <dengguangxing@huawei.com>
Reviewed-by: Nchao song <chao.song@huawei.com>
Reviewed-by: Nchao song <chao.song@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

a460ae11

23 12月, 2021 1 次提交

md/raid1: fix a race between removing rdev and access conf->mirrors[i].rdev · ceff49d9

由 Yufen Yu 提交于 12月 23, 2021

hulk inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I4JYYO?from=project-issue
CVE: NA

---------------------------

We get a NULL pointer dereference oops when test raid1 as follow:

mdadm -CR /dev/md1 -l 1 -n 2 /dev/sd[ab]

mdadm /dev/md1 -f /dev/sda
mdadm /dev/md1 -r /dev/sda
mdadm /dev/md1 -a /dev/sda
sleep 5
mdadm /dev/md1 -f /dev/sdb
mdadm /dev/md1 -r /dev/sdb
mdadm /dev/md1 -a /dev/sdb

After a disk(/dev/sda) has been removed, we add the disk to
raid array again, which would trigger recovery action.
Since the rdev current state is 'spare', read/write bio can
be issued to the disk.

Then we set the other disk (/dev/sdb) faulty. Since the raid
array is now in degraded state and /dev/sdb is the only
'In_sync' disk, raid1_error() will return but without set
faulty success.

However, that can interrupt the recovery action and
md_check_recovery will try to call remove_and_add_spares()
to remove the spare disk. And the race condition between
remove_and_add_spares() and raid1_write_request() in follow
can cause NULL pointer dereference for conf->mirrors[i].rdev:

raid1_write_request()   md_check_recovery    raid1_error()
rcu_read_lock()
rdev != NULL
!test_bit(Faulty, &rdev->flags)

                                           conf->recovery_disabled=
                                             mddev->recovery_disabled;
                                            return busy

                        remove_and_add_spares
                        raid1_remove_disk
                        rdev->nr_pending == 0

atomic_inc(&rdev->nr_pending);
rcu_read_unlock()

                        p->rdev=NULL

conf->mirrors[i].rdev->data_offset
NULL pointer deref!!!

                        if (!test_bit(RemoveSynchronized,
                          &rdev->flags))
                         synchronize_rcu();
                         p->rdev=rdev

To fix the race condition, we add a new flag 'WantRemove' for rdev.
Before access conf->mirrors[i].rdev, we need to ensure the rdev
without 'WantRemove' bit.

Link: https://marc.info/?l=linux-raid&m=156412052717709&w=2Reported-by: NZou Wei <zou_wei@huawei.com>
Signed-off-by: NYufen Yu <yuyufen@huawei.com>
Confilct:
        drivers/md/md.h
Signed-off-by: NLaibin Qiu <qiulaibin@huawei.com>
Reviewed-by: Nyuyufen <yuyufen@huawei.com>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

ceff49d9

06 12月, 2021 1 次提交

md: update superblock after changing rdev flags in state_store · f4334456

由 Xiao Ni 提交于 12月 06, 2021

stable inclusion
from stable-5.10.80
commit 2338c3501726895c1657adda2308fcf9e6f17449
bugzilla: 185821 https://gitee.com/openeuler/kernel/issues/I4L7CG

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=2338c3501726895c1657adda2308fcf9e6f17449

--------------------------------

[ Upstream commit 8b9e2291 ]

When the in memory flag is changed, we need to persist the change in the
rdev superblock flags. This is needed for "writemostly" and "failfast".
Reviewed-by: NLi Feng <fengli@smartx.com>
Signed-off-by: NXiao Ni <xni@redhat.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Reviewed-by: NWeilong Chen <chenweilong@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

f4334456

15 11月, 2021 3 次提交

dm: don't stop request queue after the dm device is suspended · 9d0e8ada

由 Ming Lei 提交于 11月 15, 2021

mainline inclusion
from mainline-v5.16
commit a1c2f7e7
category: bugfix
bugzilla: 182378 https://gitee.com/openeuler/kernel/issues/I4DDEL

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a1c2f7e7f25c9d35d3bf046f99682c5373b20fa2

---------------------------

For fixing queue quiesce race between driver and block layer(elevator
switch, update nr_requests, ...), we need to support concurrent quiesce
and unquiesce, which requires the two call to be balanced.

__bind() is only called from dm_swap_table() in which dm device has been
suspended already, so not necessary to stop queue again. With this way,
request queue quiesce and unquiesce can be balanced.
Reported-by: NYi Zhang <yi.zhang@redhat.com>
Fixes: e70feb8b ("blk-mq: support concurrent queue quiesce/unquiesce")
Signed-off-by: NMing Lei <ming.lei@redhat.com>
Signed-off-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

9d0e8ada

md: fix a lock order reversal in md_alloc · 901bf452

由 Christoph Hellwig 提交于 11月 15, 2021

stable inclusion
from stable-5.10.70
commit b18ba3f477a2fdd12d2ca2e01d2bd874968714e2
bugzilla: 182949 https://gitee.com/openeuler/kernel/issues/I4I3GQ

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=b18ba3f477a2fdd12d2ca2e01d2bd874968714e2

--------------------------------

[ Upstream commit 7df835a3 ]

Commit b0140891 ("md: Fix race when creating a new md device.")
not only moved assigning mddev->gendisk before calling add_disk, which
fixes the races described in the commit log, but also added a
mddev->open_mutex critical section over add_disk and creation of the
md kobj.  Adding a kobject after add_disk is racy vs deleting the gendisk
right after adding it, but md already prevents against that by holding
a mddev->active reference.

On the other hand taking this lock added a lock order reversal with what
is not disk->open_mutex (used to be bdev->bd_mutex when the commit was
added) for partition devices, which need that lock for the internal open
for the partition scan, and a recent commit also takes it for
non-partitioned devices, leading to further lockdep splatter.

Fixes: b0140891 ("md: Fix race when creating a new md device.")
Fixes: d6263387 ("block: support delayed holder registration")
Reported-by: syzbot+fadc0aaf497e6a493b9f@syzkaller.appspotmail.com
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Tested-by: syzbot+fadc0aaf497e6a493b9f@syzkaller.appspotmail.com
Reviewed-by: NNeilBrown <neilb@suse.de>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

901bf452

treewide: Change list_sort to use const pointers · 962e9cbe

由 Sami Tolvanen 提交于 11月 15, 2021

stable inclusion
from stable-5.10.70
commit 55e6f8b3c0f5cc600df12ddd0371d2703b910fd7
bugzilla: 182949 https://gitee.com/openeuler/kernel/issues/I4I3GQ

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=55e6f8b3c0f5cc600df12ddd0371d2703b910fd7

--------------------------------

[ Upstream commit 4f0f586b ]

list_sort() internally casts the comparison function passed to it
to a different type with constant struct list_head pointers, and
uses this pointer to call the functions, which trips indirect call
Control-Flow Integrity (CFI) checking.

Instead of removing the consts, this change defines the
list_cmp_func_t type and changes the comparison function types of
all list_sort() callers to use const pointers, thus avoiding type
mismatches.
Suggested-by: NNick Desaulniers <ndesaulniers@google.com>
Signed-off-by: NSami Tolvanen <samitolvanen@google.com>
Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NKees Cook <keescook@chromium.org>
Tested-by: NNick Desaulniers <ndesaulniers@google.com>
Tested-by: NNathan Chancellor <nathan@kernel.org>
Signed-off-by: NKees Cook <keescook@chromium.org>
Link: https://lore.kernel.org/r/20210408182843.1754385-10-samitolvanen@google.comSigned-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

962e9cbe

21 10月, 2021 1 次提交

dm crypt: Avoid percpu_counter spinlock contention in crypt_page_alloc() · 29975cf5

由 Arne Welzel 提交于 10月 21, 2021

stable inclusion
from stable-5.10.67
commit 7509c4cb7c8050177da9ee5e053c0c3d55bb66b7
bugzilla: 182619 https://gitee.com/openeuler/kernel/issues/I4EWO7

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=7509c4cb7c8050177da9ee5e053c0c3d55bb66b7

--------------------------------

commit 528b16bf upstream.

On systems with many cores using dm-crypt, heavy spinlock contention in
percpu_counter_compare() can be observed when the page allocation limit
for a given device is reached or close to be reached. This is due
to percpu_counter_compare() taking a spinlock to compute an exact
result on potentially many CPUs at the same time.

Switch to non-exact comparison of allocated and allowed pages by using
the value returned by percpu_counter_read_positive() to avoid taking
the percpu_counter spinlock.

This may over/under estimate the actual number of allocated pages by at
most (batch-1) * num_online_cpus().

Currently, batch is bounded by 32. The system on which this issue was
first observed has 256 CPUs and 512GB of RAM. With a 4k page size, this
change may over/under estimate by 31MB. With ~10G (2%) allowed dm-crypt
allocations, this seems an acceptable error. Certainly preferred over
running into the spinlock contention.

This behavior was reproduced on an EC2 c5.24xlarge instance with 96 CPUs
and 192GB RAM as follows, but can be provoked on systems with less CPUs
as well.

 * Disable swap
 * Tune vm settings to promote regular writeback
     $ echo 50 > /proc/sys/vm/dirty_expire_centisecs
     $ echo 25 > /proc/sys/vm/dirty_writeback_centisecs
     $ echo $((128 * 1024 * 1024)) > /proc/sys/vm/dirty_background_bytes

 * Create 8 dmcrypt devices based on files on a tmpfs
 * Create and mount an ext4 filesystem on each crypt devices
 * Run stress-ng --hdd 8 within one of above filesystems

Total %system usage collected from sysstat goes to ~35%. Write throughput
on the underlying loop device is ~2GB/s. perf profiling an individual
kworker kcryptd thread shows the following profile, indicating spinlock
contention in percpu_counter_compare():

    99.98%     0.00%  kworker/u193:46  [kernel.kallsyms]  [k] ret_from_fork
      |
      --ret_from_fork
        kthread
        worker_thread
        |
         --99.92%--process_one_work
            |
            |--80.52%--kcryptd_crypt
            |    |
            |    |--62.58%--mempool_alloc
            |    |  |
            |    |   --62.24%--crypt_page_alloc
            |    |     |
            |    |      --61.51%--__percpu_counter_compare
            |    |        |
            |    |         --61.34%--__percpu_counter_sum
            |    |           |
            |    |           |--58.68%--_raw_spin_lock_irqsave
            |    |           |  |
            |    |           |   --58.30%--native_queued_spin_lock_slowpath
            |    |           |
            |    |            --0.69%--cpumask_next
            |    |                |
            |    |                 --0.51%--_find_next_bit
            |    |
            |    |--10.61%--crypt_convert
            |    |          |
            |    |          |--6.05%--xts_crypt
            ...

After applying this patch and running the same test, %system usage is
lowered to ~7% and write throughput on the loop device increases
to ~2.7GB/s. perf report shows mempool_alloc() as ~8% rather than ~62%
in the profile and not hitting the percpu_counter() spinlock anymore.

    |--8.15%--mempool_alloc
    |    |
    |    |--3.93%--crypt_page_alloc
    |    |    |
    |    |     --3.75%--__alloc_pages
    |    |         |
    |    |          --3.62%--get_page_from_freelist
    |    |              |
    |    |               --3.22%--rmqueue_bulk
    |    |                   |
    |    |                    --2.59%--_raw_spin_lock
    |    |                      |
    |    |                       --2.57%--native_queued_spin_lock_slowpath
    |    |
    |     --3.05%--_raw_spin_lock_irqsave
    |               |
    |                --2.49%--native_queued_spin_lock_slowpath
Suggested-by: NDJ Gregor <dj@corelight.com>
Reviewed-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NArne Welzel <arne.welzel@corelight.com>
Fixes: 5059353d ("dm crypt: limit the number of allocated pages")
Cc: stable@vger.kernel.org
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

29975cf5

19 10月, 2021 2 次提交

bcache: add proper error unwinding in bcache_device_init · 648b289f

由 Christoph Hellwig 提交于 10月 19, 2021

stable inclusion
from stable-5.10.65
commit cf13537be54c6ea49f208668f5f8b8bda3edd28e
bugzilla: 182361 https://gitee.com/openeuler/kernel/issues/I4EH3U

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=cf13537be54c6ea49f208668f5f8b8bda3edd28e

--------------------------------

[ Upstream commit 224b0683 ]

Except for the IDA none of the allocations in bcache_device_init is
unwound on error, fix that.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Acked-by: NColy Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20210809064028.1198327-7-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

648b289f

md: revert io stats accounting · b84a2bb4

由 Guoqing Jiang 提交于 10月 19, 2021

mainline inclusion
from mainline-v5.14-rc1
commit ad3fc798
category: bugfix
bugzilla: 169402 https://gitee.com/openeuler/kernel/issues/I4DDEL

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ad3fc798800fb7ca04c1dfc439dba946818048d8

-------------------------------------------------

The commit 41d2d848 ("md: improve io stats accounting") could cause
double fault problem per the report [1], and also it is not correct to
change ->bi_end_io if md don't own it, so let's revert it.

And io stats accounting will be replemented in later commits.

[1]. https://lore.kernel.org/linux-raid/3bf04253-3fad-434a-63a7-20214e38cf26@gmail.com/T/#t

Fixes: 41d2d848 ("md: improve io stats accounting")
Signed-off-by: NGuoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NLuo Meng <luomeng12@huawei.com>

Conflicts:
	drivers/md/md.c
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

b84a2bb4

15 10月, 2021 8 次提交

md/raid10: properly indicate failure when ending a failed write request · 92948f37

由 Wei Shuyu 提交于 10月 15, 2021

stable inclusion
from stable-5.10.58
commit a786282b55b4d1134931c679a80e900c46461eae
bugzilla: 176984 https://gitee.com/openeuler/kernel/issues/I4E2P4

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=a786282b55b4d1134931c679a80e900c46461eae

--------------------------------

commit 5ba03936 upstream.

Similar to [1], this patch fixes the same bug in raid10. Also cleanup the
comments.

[1] commit 2417b986 ("md/raid1: properly indicate failure when ending
                         a failed write request")
Cc: stable@vger.kernel.org
Fixes: 7cee6d4e ("md/raid10: end bio when the device faulty")
Signed-off-by: NWei Shuyu <wsy@dogben.com>
Acked-by: NGuoqing Jiang <jiangguoqing@kylinos.cn>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

92948f37

dm writecache: write at least 4k when committing · 45febdad

由 Mikulas Patocka 提交于 10月 14, 2021

stable inclusion
from stable-5.10.51
commit b716ccffbc8dc8f14773d4ae7daadbca6167da2d
bugzilla: 175263 https://gitee.com/openeuler/kernel/issues/I4DT6F

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=b716ccffbc8dc8f14773d4ae7daadbca6167da2d

--------------------------------

commit 867de40c upstream.

SSDs perform badly with sub-4k writes (because they perfrorm
read-modify-write internally), so make sure writecache writes at least
4k when committing.

Fixes: 991bd8d7 ("dm writecache: commit just one block, not a full page")
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

45febdad

dm writecache: flush origin device when writing and cache is full · 16e8f604

由 Mikulas Patocka 提交于 10月 14, 2021

stable inclusion
from stable-5.10.51
commit 1b5918b087b1dd7bf193340f25ca63c50a277638
bugzilla: 175263 https://gitee.com/openeuler/kernel/issues/I4DT6F

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=1b5918b087b1dd7bf193340f25ca63c50a277638

--------------------------------

commit ee55b92a upstream.

Commit d53f1faf ("dm writecache: do
direct write if the cache is full") changed dm-writecache, so that it
writes directly to the origin device if the cache is full.
Unfortunately, it doesn't forward flush requests to the origin device,
so that there is a bug where flushes are being ignored.

Fix this by adding missing flush forwarding.

For PMEM mode, we fix this bug by disabling direct writes to the origin
device, because it performs better.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Fixes: d53f1faf ("dm writecache: do direct write if the cache is full")
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

16e8f604

dm zoned: check zone capacity · f2aa2072

由 Damien Le Moal 提交于 10月 14, 2021

stable inclusion
from stable-5.10.51
commit cbc03ffec260c28cd177fee143f2fe74cc36ba00
bugzilla: 175263 https://gitee.com/openeuler/kernel/issues/I4DT6F

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=cbc03ffec260c28cd177fee143f2fe74cc36ba00

--------------------------------

commit bab68499 upstream.

The dm-zoned target cannot support zoned block devices with zones that
have a capacity smaller than the zone size (e.g. NVMe zoned namespaces)
due to the current chunk zone mapping implementation as it is assumed
that zones and chunks have the same size with all blocks usable.
If a zoned drive is found to have zones with a capacity different from
the zone size, fail the target initialization.
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Cc: stable@vger.kernel.org # v5.9+
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

f2aa2072

dm writecache: commit just one block, not a full page · 26071c9d

由 Mikulas Patocka 提交于 10月 14, 2021

stable inclusion
from stable-5.10.51
commit ad7083a95d8ac29acdca83942997dd4014c4149d
bugzilla: 175263 https://gitee.com/openeuler/kernel/issues/I4DT6F

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=ad7083a95d8ac29acdca83942997dd4014c4149d

--------------------------------

[ Upstream commit 991bd8d7 ]

Some architectures have pages larger than 4k and committing a full
page causes needless overhead.

Fix this by writing a single block when committing the superblock.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

26071c9d

dm: Fix dm_accept_partial_bio() relative to zone management commands · 00f90ae5

由 Damien Le Moal 提交于 10月 14, 2021

stable inclusion
from stable-5.10.51
commit cc4f0a9d5aa1b5abffb2366a0b37c37806362fe8
bugzilla: 175263 https://gitee.com/openeuler/kernel/issues/I4DT6F

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=cc4f0a9d5aa1b5abffb2366a0b37c37806362fe8

--------------------------------

[ Upstream commit 6842d264 ]

Fix dm_accept_partial_bio() to actually check that zone management
commands are not passed as explained in the function documentation
comment. Also, since a zone append operation cannot be split, add
REQ_OP_ZONE_APPEND as a forbidden command.

White lines are added around the group of BUG_ON() calls to make the
code more legible.
Signed-off-by: NDamien Le Moal <damien.lemoal@wdc.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

00f90ae5

dm writecache: don't split bios when overwriting contiguous cache content · 76eeb9f3

由 Mikulas Patocka 提交于 10月 14, 2021

stable inclusion
from stable-5.10.51
commit 939f750215b89d2cc774a85a6810286aec0f5718
bugzilla: 175263 https://gitee.com/openeuler/kernel/issues/I4DT6F

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=939f750215b89d2cc774a85a6810286aec0f5718

--------------------------------

[ Upstream commit ee50cc19 ]

If dm-writecache overwrites existing cached data, it splits the
incoming bio into many block-sized bios. The I/O scheduler does merge
these bios into one large request but this needless splitting and
merging causes performance degradation.

Fix this by avoiding bio splitting if the cache target area that is
being overwritten is contiguous.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

76eeb9f3

dm space maps: don't reset space map allocation cursor when committing · b0d9aeb4

由 Joe Thornber 提交于 10月 14, 2021

stable inclusion
from stable-5.10.51
commit 65e780667cf39f54154c6d95e40fd650bd0a31c5
bugzilla: 175263 https://gitee.com/openeuler/kernel/issues/I4DT6F

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=65e780667cf39f54154c6d95e40fd650bd0a31c5

--------------------------------

[ Upstream commit 5faafc77 ]

Current commit code resets the place where the search for free blocks
will begin back to the start of the metadata device.  There are a couple
of repercussions to this:

- The first allocation after the commit is likely to take longer than
  normal as it searches for a free block in an area that is likely to
  have very few free blocks (if any).

- Any free blocks it finds will have been recently freed.  Reusing them
  means we have fewer old copies of the metadata to aid recovery from
  hardware error.

Fix these issues by leaving the cursor alone, only resetting when the
search hits the end of the metadata device.
Signed-off-by: NJoe Thornber <ejt@redhat.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

b0d9aeb4

03 7月, 2021 2 次提交

dm btree remove: assign new_root only when removal succeeds · 6555852b

由 Hou Tao 提交于 6月 29, 2021

mainline inclusion
from mainline-next
commit b8e0c7f90e6f99ee64ea60e39253d5fcfb445f9e
category: bugfix
bugzilla: 167383
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b8e0c7f90e6f99ee64ea60e39253d5fcfb445f9e

--------------------------------

remove_raw() in dm_btree_remove() may fail due to IO read error
(e.g. read the content of origin block fails during shadowing),
and the value of shadow_spine::root is uninitialized, but
the uninitialized value is still assign to new_root in the
end of dm_btree_remove().

For dm-thin, the value of pmd->details_root or pmd->root will become
an uninitialized value, so if trying to read details_info tree again
out-of-bound memory may occur as showed below:

  general protection fault, probably for non-canonical address 0x3fdcb14c8d7520
  CPU: 4 PID: 515 Comm: dmsetup Not tainted 5.13.0-rc6
  Hardware name: QEMU Standard PC
  RIP: 0010:metadata_ll_load_ie+0x14/0x30
  Call Trace:
   sm_metadata_count_is_more_than_one+0xb9/0xe0
   dm_tm_shadow_block+0x52/0x1c0
   shadow_step+0x59/0xf0
   remove_raw+0xb2/0x170
   dm_btree_remove+0xf4/0x1c0
   dm_pool_delete_thin_device+0xc3/0x140
   pool_message+0x218/0x2b0
   target_message+0x251/0x290
   ctl_ioctl+0x1c4/0x4d0
   dm_ctl_ioctl+0xe/0x20
   __x64_sys_ioctl+0x7b/0xb0
   do_syscall_64+0x40/0xb0
   entry_SYSCALL_64_after_hwframe+0x44/0xae

Fixing it by only assign new_root when removal succeeds
Signed-off-by: NHou Tao <houtao1@huawei.com>
Signed-off-by: NLuo Meng <luomeng12@huawei.com>
Reviewed-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

6555852b

dm verity: fix require_signatures module_param permissions · a4f53eb7

由 John Keeping 提交于 6月 18, 2021

stable inclusion
from stable-5.10.44
commit 90547d5db50bcb2705709e420e0af51535109113
bugzilla: 109295
CVE: NA

--------------------------------

[ Upstream commit 0c1f3193 ]

The third parameter of module_param() is permissions for the sysfs node
but it looks like it is being used as the initial value of the parameter
here.  In fact, false here equates to omitting the file from sysfs and
does not affect the value of require_signatures.

Making the parameter writable is not simple because going from
false->true is fine but it should not be possible to remove the
requirement to verify a signature.  But it can be useful to inspect the
value of this parameter from userspace, so change the permissions to
make a read-only file in sysfs.
Signed-off-by: NJohn Keeping <john@metanate.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

a4f53eb7

15 6月, 2021 1 次提交

dm snapshot: properly fix a crash when an origin has no snapshots · 2bae0876

由 Mikulas Patocka 提交于 6月 07, 2021

stable inclusion
from stable-5.10.42
commit 1a8ecc3cd1a1250c12128717e47d160f22ba3866
bugzilla: 55093
CVE: NA

--------------------------------

commit 7e768532 upstream.

If an origin target has no snapshots, o->split_boundary is set to 0.
This causes BUG_ON(sectors <= 0) in block/bio.c:bio_split().

Fix this by initializing chunk_size, and in turn split_boundary, to
rounddown_pow_of_two(UINT_MAX) -- the largest power of two that fits
into "unsigned" type.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

2bae0876

03 6月, 2021 4 次提交

dm verity: allow only one error handling mode · 4ae8420c

由 JeongHyeon Lee 提交于 5月 25, 2021

mainline inclusion
from mainline-v5.13-rc1
commit 219a9b5e
category: bugfix
bugzilla: 51874
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=219a9b5e738b75a6a5e9effe1d72f60037a2f131

-----------------------------------------------

If more than one one handling mode is requested during DM verity table
load, the last requested mode will be used.

Change this to impose more strict checking so that the table load will
fail if more than one error handling mode is requested.
Signed-off-by: NJeongHyeon Lee <jhs2.lee@samsung.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NLuo Meng <luomeng12@huawei.com>
Reviewed-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

4ae8420c

dm snapshot: fix crash with transient storage and zero chunk size · b712cd09

由 Mikulas Patocka 提交于 5月 27, 2021

stable inclusion
from stable-5.10.40
commit 2a61f0ccb756f966f7d04aa149635c843f821ad3
bugzilla: 51882
CVE: NA

--------------------------------

commit c699a0db upstream.

The following commands will crash the kernel:

modprobe brd rd_size=1048576
dmsetup create o --table "0 `blockdev --getsize /dev/ram0` snapshot-origin /dev/ram0"
dmsetup create s --table "0 `blockdev --getsize /dev/ram0` snapshot /dev/ram0 /dev/ram1 N 0"

The reason is that when we test for zero chunk size, we jump to the label
bad_read_metadata without setting the "r" variable. The function
snapshot_ctr destroys all the structures and then exits with "r == 0". The
kernel then crashes because it falsely believes that snapshot_ctr
succeeded.

In order to fix the bug, we set the variable "r" to -EINVAL.
Signed-off-by: NMikulas Patocka <mpatocka@redhat.com>
Cc: stable@vger.kernel.org
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

b712cd09

md: Fix missing unused status line of /proc/mdstat · 1e099dfb

由 Jan Glauber 提交于 5月 24, 2021

stable inclusion
from stable-5.10.37
commit 0035a4704557ba66824c08d5759d6e743747410b
bugzilla: 51868
CVE: NA

--------------------------------

commit 7abfabaf upstream.

Reading /proc/mdstat with a read buffer size that would not
fit the unused status line in the first read will skip this
line from the output.

So 'dd if=/proc/mdstat bs=64 2>/dev/null' will not print something
like: unused devices: <none>

Don't return NULL immediately in start() for v=2 but call
show() once to print the status line also for multiple reads.

Cc: stable@vger.kernel.org
Fixes: 1f4aace6 ("fs/seq_file.c: simplify seq_file iteration code and interface")
Signed-off-by: NJan Glauber <jglauber@digitalocean.com>
Signed-off-by: NSong Liu <songliubraving@fb.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

1e099dfb

md: md_open returns -EBUSY when entering racing area · 640134e4

由 Zhao Heming 提交于 5月 24, 2021

stable inclusion
from stable-5.10.37
commit b70b7ec500892f8bc12ffc6f60a3af6fd61d3a8b
bugzilla: 51868
CVE: NA

--------------------------------

commit 6a4db2a6 upstream.

commit d3374825 ("md: make devices disappear when they are no longer
needed.") introduced protection between mddev creating & removing. The
md_open shouldn't create mddev when all_mddevs list doesn't contain
mddev. With currently code logic, there will be very easy to trigger
soft lockup in non-preempt env.

This patch changes md_open returning from -ERESTARTSYS to -EBUSY, which
will break the infinitely retry when md_open enter racing area.

This patch is partly fix soft lockup issue, full fix needs mddev_find
is split into two functions: mddev_find & mddev_find_or_alloc. And
md_open should call new mddev_find (it only does searching job).

For more detail, please refer with Christoph's "split mddev_find" patch
in later commits.

*** env ***
kvm-qemu VM 2C1G with 2 iscsi luns
kernel should be non-preempt

*** script ***

about trigger every time with below script

```
1  node1="mdcluster1"
2  node2="mdcluster2"
3
4  mdadm -Ss
5  ssh ${node2} "mdadm -Ss"
6  wipefs -a /dev/sda /dev/sdb
7  mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \
   /dev/sdb --assume-clean
8
9  for i in {1..10}; do
10    echo ==== $i ====;
11
12    echo "test  ...."
13    ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb"
14    sleep 1
15
16    echo "clean  ....."
17    ssh ${node2} "mdadm -Ss"
18 done
```

I use mdcluster env to trigger soft lockup, but it isn't mdcluster
speical bug. To stop md array in mdcluster env will do more jobs than
non-cluster array, which will leave enough time/gap to allow kernel to
run md_open.

*** stack ***

```
[  884.226509]  mddev_put+0x1c/0xe0 [md_mod]
[  884.226515]  md_open+0x3c/0xe0 [md_mod]
[  884.226518]  __blkdev_get+0x30d/0x710
[  884.226520]  ? bd_acquire+0xd0/0xd0
[  884.226522]  blkdev_get+0x14/0x30
[  884.226524]  do_dentry_open+0x204/0x3a0
[  884.226531]  path_openat+0x2fc/0x1520
[  884.226534]  ? seq_printf+0x4e/0x70
[  884.226536]  do_filp_open+0x9b/0x110
[  884.226542]  ? md_release+0x20/0x20 [md_mod]
[  884.226543]  ? seq_read+0x1d8/0x3e0
[  884.226545]  ? kmem_cache_alloc+0x18a/0x270
[  884.226547]  ? do_sys_open+0x1bd/0x260
[  884.226548]  do_sys_open+0x1bd/0x260
[  884.226551]  do_syscall_64+0x5b/0x1e0
[  884.226554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
```

*** rootcause ***

"mdadm -A" (or other array assemble commands) will start a daemon "mdadm
--monitor" by default. When "mdadm -Ss" is running, the stop action will
wakeup "mdadm --monitor". The "--monitor" daemon will immediately get
info from /proc/mdstat. This time mddev in kernel still exist, so
/proc/mdstat still show md device, which makes "mdadm --monitor" to open
/dev/md0.

The previously "mdadm -Ss" is removing action, the "mdadm --monitor"
open action will trigger md_open which is creating action. Racing is
happening.

```
<thread 1>: "mdadm -Ss"
md_release
  mddev_put deletes mddev from all_mddevs
  queue_work for mddev_delayed_delete
  at this time, "/dev/md0" is still available for opening

<thread 2>: "mdadm --monitor ..."
md_open
 + mddev_find can't find mddev of /dev/md0, and create a new mddev and
 |    return.
 + trigger "if (mddev->gendisk != bdev->bd_disk)" and return
      -ERESTARTSYS.
```

In non-preempt kernel, <thread 2> is occupying on current CPU. and
mddev_delayed_delete which was created in <thread 1> also can't be
schedule.

In preempt kernel, it can also trigger above racing. But kernel doesn't
allow one thread running on a CPU all the time. after <thread 2> running
some time, the later "mdadm -A" (refer above script line 13) will call
md_alloc to alloc a new gendisk for mddev. it will break md_open
statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller,
the soft lockup is broken.

Cc: stable@vger.kernel.org
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NZhao Heming <heming.zhao@suse.com>
Signed-off-by: NSong Liu <song@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: NWeilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

640134e4

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功