提交 · e2ccdfa9e5d3ba28b7d9b72df3ce5fe078255b60 · openeuler / Kernel

28 6月, 2022 6 次提交

ASoC: sti: Fix deadlock via snd_pcm_stop_xrun() call · e2ccdfa9

由 Takashi Iwai 提交于 6月 28, 2022

stable inclusion
from stable-v5.10.109
commit db03abd0dae07396559fd94b1a8ef54903be2073
bugzilla: https://gitee.com/openeuler/kernel/issues/I574AE

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=db03abd0dae07396559fd94b1a8ef54903be2073

--------------------------------

commit 455c5653 upstream.

This is essentially a revert of the commit dc865fb9 ("ASoC: sti:
Use snd_pcm_stop_xrun() helper"), which converted the manual
snd_pcm_stop() calls with snd_pcm_stop_xrun().

The commit above introduced a deadlock as snd_pcm_stop_xrun() itself
takes the PCM stream lock while the caller already holds it.  Since
the conversion was done only for consistency reason and the open-call
with snd_pcm_stop() to the XRUN state is a correct usage, let's revert
the commit back as the fix.

Fixes: dc865fb9 ("ASoC: sti: Use snd_pcm_stop_xrun() helper")
Reported-by: NDaniel Palmer <daniel@0x0f.com>
Cc: Arnaud POULIQUEN <arnaud.pouliquen@st.com>
Cc: <stable@vger.kernel.org>
Link: https://lore.kernel.org/r/20220315091319.3351522-1-daniel@0x0f.comSigned-off-by: NTakashi Iwai <tiwai@suse.de>
Reviewed-by: NArnaud Pouliquen <arnaud.pouliquen@foss.st.com>
Link: https://lore.kernel.org/r/20220315164158.19804-1-tiwai@suse.deSigned-off-by: NMark Brown <broonie@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYu Liao <liaoyu15@huawei.com>
Reviewed-by: NWei Li <liwei391@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

e2ccdfa9

staging: fbtft: fb_st7789v: reset display before initialization · f6708a82

由 Oliver Graute 提交于 6月 28, 2022

stable inclusion
from stable-v5.10.109
commit 56dc187b35d5a0ac9d08560684721abf3aefa4df
bugzilla: https://gitee.com/openeuler/kernel/issues/I574AE

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=56dc187b35d5a0ac9d08560684721abf3aefa4df

--------------------------------

commit b6821b0d upstream.

In rare cases the display is flipped or mirrored. This was observed more
often in a low temperature environment. A clean reset on init_display()
should help to get registers in a sane state.

Fixes: ef8f3177 (staging: fbtft: use init function instead of init sequence)
Cc: stable@vger.kernel.org
Signed-off-by: NOliver Graute <oliver.graute@kococonnector.com>
Link: https://lore.kernel.org/r/20220210085322.15676-1-oliver.graute@kococonnector.com
[sudip: adjust context]
Signed-off-by: NSudip Mukherjee <sudipm.mukherjee@gmail.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYu Liao <liaoyu15@huawei.com>
Reviewed-by: NWei Li <liwei391@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

f6708a82

tpm: Fix error handling in async work · af50153c

由 Tadeusz Struk 提交于 6月 28, 2022

stable inclusion
from stable-v5.10.109
commit 351493858ebc192c4526182f4c5819466e345659
bugzilla: https://gitee.com/openeuler/kernel/issues/I574AE

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=351493858ebc192c4526182f4c5819466e345659

--------------------------------

commit 2e8e4c8f upstream.

When an invalid (non existing) handle is used in a TPM command,
that uses the resource manager interface (/dev/tpmrm0) the resource
manager tries to load it from its internal cache, but fails and
the tpm_dev_transmit returns an -EINVAL error to the caller.
The existing async handler doesn't handle these error cases
currently and the condition in the poll handler never returns
mask with EPOLLIN set.
The result is that the poll call blocks and the application gets stuck
until the user_read_timer wakes it up after 120 sec.
Change the tpm_dev_async_work function to handle error conditions
returned from tpm_dev_transmit they are also reflected in the poll mask
and a correct error code could passed back to the caller.

Cc: Jarkko Sakkinen <jarkko@kernel.org>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: <linux-integrity@vger.kernel.org>
Cc: <stable@vger.kernel.org>
Cc: <linux-kernel@vger.kernel.org>

Fixes: 9e1b74a6 ("tpm: add support for nonblocking operation")
Tested-by: Jarkko Sakkinen<jarkko@kernel.org>
Signed-off-by: NTadeusz Struk <tstruk@gmail.com>
Reviewed-by: NJarkko Sakkinen <jarkko@kernel.org>
Signed-off-by: NJarkko Sakkinen <jarkko@kernel.org>
Cc: Tadeusz Struk <tadeusz.struk@linaro.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYu Liao <liaoyu15@huawei.com>
Reviewed-by: NWei Li <liwei391@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

af50153c

cgroup-v1: Correct privileges check in release_agent writes · 68f7cc31

由 Michal Koutný 提交于 6月 28, 2022

stable inclusion
from stable-v5.10.109
commit ea21245cdcab3f2b46aecd421ac5f5753a1cf88d
bugzilla: https://gitee.com/openeuler/kernel/issues/I574AE

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=ea21245cdcab3f2b46aecd421ac5f5753a1cf88d

--------------------------------

commit 467a726b upstream.

The idea is to check: a) the owning user_ns of cgroup_ns, b)
capabilities in init_user_ns.

The commit 24f60085 ("cgroup-v1: Require capabilities to set
release_agent") got this wrong in the write handler of release_agent
since it checked user_ns of the opener (may be different from the owning
user_ns of cgroup_ns).
Secondly, to avoid possibly confused deputy, the capability of the
opener must be checked.

Fixes: 24f60085 ("cgroup-v1: Require capabilities to set release_agent")
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/stable/20220216121142.GB30035@blackbody.suse.cz/Signed-off-by: NMichal Koutný <mkoutny@suse.com>
Reviewed-by: NMasami Ichikawa(CIP) <masami.ichikawa@cybertrust.co.jp>
Signed-off-by: NTejun Heo <tj@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYu Liao <liaoyu15@huawei.com>
Reviewed-by: NWei Li <liwei391@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

68f7cc31

exfat: avoid incorrectly releasing for root inode · b4ab564c

由 Chen Li 提交于 6月 28, 2022

stable inclusion
from stable-v5.10.109
commit 9eeaa2d7d58ae7fe66bdb016a03fe251c48fe222
bugzilla: https://gitee.com/openeuler/kernel/issues/I574AE

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=9eeaa2d7d58ae7fe66bdb016a03fe251c48fe222

--------------------------------

commit 839a534f upstream.

In d_make_root, when we fail to allocate dentry for root inode,
we will iput root inode and returned value is NULL in this function.

So we do not need to release this inode again at d_make_root's caller.
Signed-off-by: NChen Li <chenli@uniontech.com>
Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
Cc: Tadeusz Struk <tadeusz.struk@linaro.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYu Liao <liaoyu15@huawei.com>
Reviewed-by: NWei Li <liwei391@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

b4ab564c

net: ipv6: fix skb_over_panic in __ip6_append_data · 6c870ac3

由 Tadeusz Struk 提交于 6月 28, 2022

stable inclusion
from stable-v5.10.109
commit ae8ec5eabb1a0e672e054ef50374f3d8508d6828
bugzilla: https://gitee.com/openeuler/kernel/issues/I574AE

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=ae8ec5eabb1a0e672e054ef50374f3d8508d6828

--------------------------------

commit 5e34af41 upstream.

Syzbot found a kernel bug in the ipv6 stack:
LINK: https://syzkaller.appspot.com/bug?id=205d6f11d72329ab8d62a610c44c5e7e25415580
The reproducer triggers it by sending a crafted message via sendmmsg()
call, which triggers skb_over_panic, and crashes the kernel:

skbuff: skb_over_panic: text:ffffffff84647fb4 len:65575 put:65575
head:ffff888109ff0000 data:ffff888109ff0088 tail:0x100af end:0xfec0
dev:<NULL>

Update the check that prevents an invalid packet with MTU equal
to the fregment header size to eat up all the space for payload.

The reproducer can be found here:
LINK: https://syzkaller.appspot.com/text?tag=ReproC&x=1648c83fb00000

Reported-by: syzbot+e223cf47ec8ae183f2a0@syzkaller.appspotmail.com
Signed-off-by: NTadeusz Struk <tadeusz.struk@linaro.org>
Acked-by: NWillem de Bruijn <willemb@google.com>
Link: https://lore.kernel.org/r/20220310232538.1044947-1-tadeusz.struk@linaro.orgSigned-off-by: NJakub Kicinski <kuba@kernel.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NYu Liao <liaoyu15@huawei.com>
Reviewed-by: NWei Li <liwei391@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

6c870ac3

22 6月, 2022 34 次提交

sched/fair: Add document for burstable CFS bandwidth · 50b0c344

由 Huaixin Chang 提交于 6月 22, 2022

mainline inclusion
from mainline-v5.15-rc4
commit d73df887
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5CPWE
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d73df887b6b8174dfbb7f5f878fbd1e0e2eb3f08

--------------------------------

Basic description of usage and effect for CFS Bandwidth Control Burst.
Co-developed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Co-developed-by: NTianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: NTianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: NHuaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: NTejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210830032215.16302-3-changhuaixin@linux.alibaba.comSigned-off-by: NHui Tang <tanghui20@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

50b0c344

sched/fair: Add cfs bandwidth burst statistics · c0241549

由 Huaixin Chang 提交于 6月 22, 2022

mainline inclusion
from mainline-v5.15-rc4
commit bcb1704a
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5CPWE
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=bcb1704a1ed2de580a46f28922e223a65f16e0f5

--------------------------------

Two new statistics are introduced to show the internal of burst feature
and explain why burst helps or not.

nr_bursts:  number of periods bandwidth burst occurs
burst_time: cumulative wall-time (in nanoseconds) that any cpus has
	    used above quota in respective periods
Co-developed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Co-developed-by: NTianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: NTianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: NHuaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Acked-by: NTejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210830032215.16302-2-changhuaixin@linux.alibaba.comSigned-off-by: NHui Tang <tanghui20@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

c0241549

sched/fair: Introduce the burstable CFS controller · a2f5c1de

由 Huaixin Chang 提交于 6月 22, 2022

mainline inclusion
from mainline-v5.13-rc6
commit f4183717
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I5CPWE
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f4183717b370ad28dd0c0d74760142b20e6e7931

--------------------------------

The CFS bandwidth controller limits CPU requests of a task group to
quota during each period. However, parallel workloads might be bursty
so that they get throttled even when their average utilization is under
quota. And they are latency sensitive at the same time so that
throttling them is undesired.

We borrow time now against our future underrun, at the cost of increased
interference against the other system users. All nicely bounded.

Traditional (UP-EDF) bandwidth control is something like:

  (U = \Sum u_i) <= 1

This guaranteeds both that every deadline is met and that the system is
stable. After all, if U were > 1, then for every second of walltime,
we'd have to run more than a second of program time, and obviously miss
our deadline, but the next deadline will be further out still, there is
never time to catch up, unbounded fail.

This work observes that a workload doesn't always executes the full
quota; this enables one to describe u_i as a statistical distribution.

For example, have u_i = {x,e}_i, where x is the p(95) and x+e p(100)
(the traditional WCET). This effectively allows u to be smaller,
increasing the efficiency (we can pack more tasks in the system), but at
the cost of missing deadlines when all the odds line up. However, it
does maintain stability, since every overrun must be paired with an
underrun as long as our x is above the average.

That is, suppose we have 2 tasks, both specify a p(95) value, then we
have a p(95)*p(95) = 90.25% chance both tasks are within their quota and
everything is good. At the same time we have a p(5)p(5) = 0.25% chance
both tasks will exceed their quota at the same time (guaranteed deadline
fail). Somewhere in between there's a threshold where one exceeds and
the other doesn't underrun enough to compensate; this depends on the
specific CDFs.

At the same time, we can say that the worst case deadline miss, will be
\Sum e_i; that is, there is a bounded tardiness (under the assumption
that x+e is indeed WCET).

The benefit of burst is seen when testing with schbench. Default value of
kernel.sched_cfs_bandwidth_slice_us(5ms) and CONFIG_HZ(1000) is used.

	mkdir /sys/fs/cgroup/cpu/test
	echo $$ > /sys/fs/cgroup/cpu/test/cgroup.procs
	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_quota_us
	echo 100000 > /sys/fs/cgroup/cpu/test/cpu.cfs_burst_us

	./schbench -m 1 -t 3 -r 20 -c 80000 -R 10

The average CPU usage is at 80%. I run this for 10 times, and got long tail
latency for 6 times and got throttled for 8 times.

Tail latencies are shown below, and it wasn't the worst case.

	Latency percentiles (usec)
		50.0000th: 19872
		75.0000th: 21344
		90.0000th: 22176
		95.0000th: 22496
		*99.0000th: 22752
		99.5000th: 22752
		99.9000th: 22752
		min=0, max=22727
	rps: 9.90 p95 (usec) 22496 p99 (usec) 22752 p95/cputime 28.12% p99/cputime 28.44%

The interferenece when using burst is valued by the possibilities for
missing the deadline and the average WCET. Test results showed that when
there many cgroups or CPU is under utilized, the interference is
limited. More details are shown in:
https://lore.kernel.org/lkml/5371BD36-55AE-4F71-B9D7-B86DC32E3D2B@linux.alibaba.com/Co-developed-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Signed-off-by: NShanpei Chen <shanpeic@linux.alibaba.com>
Co-developed-by: NTianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: NTianchen Ding <dtcccc@linux.alibaba.com>
Signed-off-by: NHuaixin Chang <changhuaixin@linux.alibaba.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: NBen Segall <bsegall@google.com>
Acked-by: NTejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20210621092800.23714-2-changhuaixin@linux.alibaba.comSigned-off-by: NHui Tang <tanghui20@huawei.com>
Reviewed-by: NChen Hui <judy.chenhui@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

a2f5c1de

mm: memcontrol: add the flag_stat file · 5cda4079

由 tatataeki 提交于 6月 22, 2022

hulk inclusion
category: feature
bugzilla: https://gitee.com/openeuler/kernel/issues/I4MC3F
CVE: NA

----------------------------------

Multiple operations on cgroups in cgroup v1 are related to the status
of the cgroup. The status of the current cgroup can be displayed in
cgroupv2, but it cannot be displayed in cgroup v1, so the
cgroup.flag_stat member is added in memory cgroup to display the
status of the current cgroup and sub-cgroups.

Testing result:
List the status of user.slice
[root@test user.slice]#cat memory.flag_stat
NO_REF 0
ONLINE 1
RELEASED 0
VISIBLE 1
DYING 0
CHILD_NO_REF 0
CHILD_ONLINE 1
CHILD_RELEASED 0
CHILD_VISIBLE 1
CHILD_DYING 0

Create a new cgroup in user.slice
[root@test user.slice]#mkdir user-test

List the current status of user.slice after operation above
[root@test user.slice]#cat memory.flag_stat
NO_REF 0
ONLINE 1
RELEASED 0
VISIBLE 1
DYING 0
CHILD_NO_REF 0
CHILD_ONLINE 2
CHILD_RELEASED 0
CHILD_VISIBLE 2
CHILD_DYING 0
Signed-off-by: Ntatataeki <shengzeyu19_98@163.com>
Reviewed-by: NKefeng Wang <wangkefeng.wang@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

5cda4079

eulerfs: fix potential sbi->persisters free error · 36063a75

由 Gou Hao 提交于 6月 22, 2022

uniontech inclusion
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I40JRR
CVE: NA

-------------------

After alloc the sbi->persisters memory, dep_init
will call dep_fini when error happened.Because
sbi->persisters is not set to 0, -> dep_fini()
can be called with sbi->persisters[] uninitialized,
thus kthread_stop() can be called with random value.
Signed-off-by: NGou Hao <gouhao@uniontech.com>
Reviewed-by: NYu Kuai <yukuai3@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

36063a75

fs/ntfs3: Fix invalid free in log_replay · a7b61394

由 Namjae Jeon 提交于 6月 22, 2022

mainline inclusion
from mainline-v5.19-rc1
commit f26967b9
category: bugfix
bugzilla: 186929, https://gitee.com/src-openeuler/kernel/issues/I5D82L
CVE: CVE-2022-1973

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=f26967b9f7a830e228bb13fb41bd516ddd9d789d

----------------------------------------------------------------

log_read_rst() returns ENOMEM error when there is not enough memory.
In this case, if info is returned without initialization,
it attempts to kfree the uninitialized info->r_page pointer. This patch
moves the memset initialization code to before log_read_rst() is called.
Reported-by: NGerald Lee <sundaywind2004@gmail.com>
Signed-off-by: NNamjae Jeon <linkinjeon@kernel.org>
Signed-off-by: NKonstantin Komarov <almaz.alexandrovich@paragon-software.com>
Signed-off-by: NZhaoLong Wang <wangzhaolong1@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

a7b61394

Revert "nfs: nfs_file_write() should check for writeback errors" · 58e107a7

由 ChenXiaoSong 提交于 6月 22, 2022

hulk inclusion
category: bugfix
bugzilla: 186345, https://gitee.com/openeuler/kernel/issues/I4T2WV
CVE: NA

--------------------------------

This reverts commit ce368536.

filemap_sample_wb_err() will return 0 if nobody has seen the error yet,
then filemap_check_wb_err() will return the unchanged writeback error,
async write() will become sync write().

Reproducer:
        nfs server               |       nfs client
 --------------------------------|----------------------------------------------
 # No space left on server       |
 fallocate -l 100G /server/nospc |
                                 |
                                 | mount -t nfs $nfs_server_ip:/ /mnt
                                 |
                                 | # Expected error: No space left on device
                                 | dd if=/dev/zero of=/mnt/file count=1 ibs=1K
                                 |
                                 | # Release space on mountpoint
                                 | rm /mnt/nospc
                                 |
                                 | # Very very slow
                                 | dd if=/dev/zero of=/mnt/file count=1 ibs=1K
Signed-off-by: NChenXiaoSong <chenxiaosong2@huawei.com>
Reviewed-by: NZhang Xiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

58e107a7

xfs: Skip repetitive warnings about mount options · cec1c0d4

由 Pavel Reichl 提交于 6月 22, 2022

mainline inclusion
from stable-v5.13-rc1
commit 92cf7d36
category: bugfix
bugzilla: 186908, https://gitee.com/openeuler/kernel/issues/I4KIAO

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=92cf7d36384b99d5a57bf4422904a3c16dc4527a

--------------------------------

Skip the warnings about mount option being deprecated if we are
remounting and deprecated option state is not changing.

Bug: https://bugzilla.kernel.org/show_bug.cgi?id=211605Fix-suggested-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: NPavel Reichl <preichl@redhat.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

cec1c0d4

xfs: rename variable mp to parsing_mp · 41a88214

由 Pavel Reichl 提交于 6月 22, 2022

mainline inclusion
from stable-v5.13-rc1
commit 0f98b4ec
category: bugfix
bugzilla: 186908, https://gitee.com/openeuler/kernel/issues/I4KIAO

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0f98b4ece18da9d8287bb4cc4e8f78b8760ea0d0

--------------------------------

Rename mp variable to parsisng_mp so it is easy to distinguish
between current mount point handle and handle for mount point
which mount options are being parsed.
Suggested-by: NEric Sandeen <sandeen@redhat.com>
Signed-off-by: NPavel Reichl <preichl@redhat.com>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-by: NCarlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NGuo Xuenan <guoxuenan@huawei.com>

Conflicts:
	fs/xfs/xfs_super.c
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

41a88214

ext4: convert from atomic_t to refcount_t on ext4_io_end->count · 06da9a21

由 Xiyu Yang 提交于 6月 22, 2022

mainline inclusion
from mainline-v5.16-rc1
commit 31d21d21
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5C8IW
CVE: NA

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=31d21d219b51dcfb16e18427eddae5394d402820

--------------------------------

refcount_t type and corresponding API can protect refcounters from
accidental underflow and overflow and further use-after-free situations.
Signed-off-by: NXiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: NXin Tan <tanxin.ctf@gmail.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/1626674355-55795-1-git-send-email-xiyuyang19@fudan.edu.cnSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NLi Nan <linan122@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

06da9a21

powerpc/32: Fix overread/overwrite of thread_struct via ptrace · ff43f790

由 Michael Ellerman 提交于 6月 22, 2022

mainline inclusion
from mainline-v5.19-rc2
commit 8e127844
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I5C43D?from=project-issue
CVE: CVE-2022-32981

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?id=8e1278444446fc97778a5e5c99bca1ce0bbc5ec9

--------------------------------

The ptrace PEEKUSR/POKEUSR (aka PEEKUSER/POKEUSER) API allows a process
to read/write registers of another process.

To get/set a register, the API takes an index into an imaginary address
space called the "USER area", where the registers of the process are
laid out in some fashion.

The kernel then maps that index to a particular register in its own data
structures and gets/sets the value.

The API only allows a single machine-word to be read/written at a time.
So 4 bytes on 32-bit kernels and 8 bytes on 64-bit kernels.

The way floating point registers (FPRs) are addressed is somewhat
complicated, because double precision float values are 64-bit even on
32-bit CPUs. That means on 32-bit kernels each FPR occupies two
word-sized locations in the USER area. On 64-bit kernels each FPR
occupies one word-sized location in the USER area.

Internally the kernel stores the FPRs in an array of u64s, or if VSX is
enabled, an array of pairs of u64s where one half of each pair stores
the FPR. Which half of the pair stores the FPR depends on the kernel's
endianness.

To handle the different layouts of the FPRs depending on VSX/no-VSX and
big/little endian, the TS_FPR() macro was introduced.

Unfortunately the TS_FPR() macro does not take into account the fact
that the addressing of each FPR differs between 32-bit and 64-bit
kernels. It just takes the index into the "USER area" passed from
userspace and indexes into the fp_state.fpr array.

On 32-bit there are 64 indexes that address FPRs, but only 32 entries in
the fp_state.fpr array, meaning the user can read/write 256 bytes past
the end of the array. Because the fp_state sits in the middle of the
thread_struct there are various fields than can be overwritten,
including some pointers. As such it may be exploitable.

It has also been observed to cause systems to hang or otherwise
misbehave when using gdbserver, and is probably the root cause of this
report which could not be easily reproduced:
  https://lore.kernel.org/linuxppc-dev/dc38afe9-6b78-f3f5-666b-986939e40fc6@keymile.com/

Rather than trying to make the TS_FPR() macro even more complicated to
fix the bug, or add more macros, instead add a special-case for 32-bit
kernels. This is more obvious and hopefully avoids a similar bug
happening again in future.

Note that because 32-bit kernels never have VSX enabled the code doesn't
need to consider TS_FPRWIDTH/OFFSET at all. Add a BUILD_BUG_ON() to
ensure that 32-bit && VSX is never enabled.

Fixes: 87fec051 ("powerpc: PTRACE_PEEKUSR/PTRACE_POKEUSER of FPR registers in little endian builds")
Cc: stable@vger.kernel.org # v3.13+
Reported-by: NAriel Miculas <ariel.miculas@belden.com>
Tested-by: NChristophe Leroy <christophe.leroy@csgroup.eu>
Signed-off-by: NMichael Ellerman <mpe@ellerman.id.au>
Link: https://lore.kernel.org/r/20220609133245.573565-1-mpe@ellerman.id.auSigned-off-by: NYipeng Zou <zouyipeng@huawei.com>
Reviewed-by: NZhang Jianhua <chris.zjh@huawei.com>
Reviewed-by: NLiao Chang <liaochang1@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

ff43f790

RDMA/hns: Refactor the alloc_cqc() · 5af33ea7

由 Wenpeng Liang 提交于 6月 22, 2022

mainline inclusion
from mainline-v5.18-rc1
commit  73f7e056
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5A9XK
cve: NA

reference: https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=73f7e05609ec

Abstract the alloc_cqc() into several parts and separate the process
unrelated to allocating CQC.

Link: https://lore.kernel.org/r/20220302064830.61706-10-liangwenpeng@huawei.comSigned-off-by: NWenpeng Liang <liangwenpeng@huawei.com>
Reviewed-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Signed-off-by: NZhengfeng Luo <luozhengfeng@h-partners.com>
Reviewed-by: NYangyang Li <liyangyang20@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

5af33ea7

RDMA/hns: Refactor the alloc_srqc() · 44f44770

由 Chengchang Tang 提交于 6月 22, 2022

mainline inclusion
from mainline-v5.18-rc1
commit  b65afbd2
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5A9XK
cve: NA

reference: https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=b65afbd2a05c

Abstract the alloc_srqc() into several parts and separate the alloc_srqn()
from the alloc_srqc().

Link: https://lore.kernel.org/r/20220302064830.61706-9-liangwenpeng@huawei.comSigned-off-by: NChengchang Tang <tangchengchang@huawei.com>
Signed-off-by: NWenpeng Liang <liangwenpeng@huawei.com>
Reviewed-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Signed-off-by: NZhengfeng Luo <luozhengfeng@h-partners.com>
Reviewed-by: NYangyang Li <liyangyang20@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

44f44770

RDMA/hns: Clean up the return value check of hns_roce_alloc_cmd_mailbox() · 7c6bdd12

由 Wenpeng Liang 提交于 6月 22, 2022

mainline inclusion
from mainline-v5.18-rc1
commit  904de76c
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5A9XK
cve: NA

reference: https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=904de76c42b7

hns_roce_alloc_cmd_mailbox() never returns NULL, so the check should be
IS_ERR(). And the error code should be converted as the function's return
value.

Link: https://lore.kernel.org/r/20220302064830.61706-8-liangwenpeng@huawei.comSigned-off-by: NWenpeng Liang <liangwenpeng@huawei.com>
Reviewed-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Signed-off-by: NZhengfeng Luo <luozhengfeng@h-partners.com>
Reviewed-by: NYangyang Li <liyangyang20@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

7c6bdd12

RDMA/hns: Remove similar code that configures the hardware contexts · 28e45ed4

由 Chengchang Tang 提交于 6月 22, 2022

mainline inclusion
from mainline-v5.18-rc1
commit  cf7f8f5c
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5A9XK
cve: NA

reference: https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=cf7f8f5c1c54

Remove duplicate code for creating and destroying hardware contexts via
mailbox.

Link: https://lore.kernel.org/r/20220302064830.61706-7-liangwenpeng@huawei.comSigned-off-by: NChengchang Tang <tangchengchang@huawei.com>
Signed-off-by: NWenpeng Liang <liangwenpeng@huawei.com>
Reviewed-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Signed-off-by: NZhengfeng Luo <luozhengfeng@h-partners.com>
Reviewed-by: NYangyang Li <liyangyang20@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

28e45ed4

RDMA/hns: Refactor mailbox functions · 7295919a

由 Chengchang Tang 提交于 6月 22, 2022

mainline inclusion
from mainline-v5.18-rc1
commit 162e29fe
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5A9XK
cve: NA

reference: https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=162e29feabba

The current mailbox functions have too many parameters, making the code
difficult to maintain. So construct a new structure mbox_msg to pass the
information needed by mailbox.

Link: https://lore.kernel.org/r/20220302064830.61706-6-liangwenpeng@huawei.comSigned-off-by: NChengchang Tang <tangchengchang@huawei.com>
Signed-off-by: NWenpeng Liang <liangwenpeng@huawei.com>
Reviewed-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Signed-off-by: NZhengfeng Luo <luozhengfeng@h-partners.com>
Reviewed-by: NYangyang Li <liyangyang20@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

7295919a

RDMA/hns: Fix the wrong type of parameter "op" of the mailbox · 14f639c9

由 Wenpeng Liang 提交于 6月 22, 2022

mainline inclusion
from mainline-v5.18-rc1
commit  e50cda2b
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5A9XK
cve: NA

reference: https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=e50cda2b9f84

The "op" field of the mailbox occupies 8 bits, so the parameter "op"
should be of type u8.

Link: https://lore.kernel.org/r/20220302064830.61706-5-liangwenpeng@huawei.comSigned-off-by: NWenpeng Liang <liangwenpeng@huawei.com>
Reviewed-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Signed-off-by: NZhengfeng Luo <luozhengfeng@h-partners.com>
Reviewed-by: NYangyang Li <liyangyang20@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

14f639c9

RDMA/hns: Remove redundant parameter "mailbox" in the mailbox · 8f93704f

由 Wenpeng Liang 提交于 6月 22, 2022

mainline inclusion
from mainline-v5.18-rc1
commit  479dc93b
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5A9XK
cve: NA

reference: https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=479dc93ba75d

The parameter "out_param" of the mailbox is always null when the context is
destroyed. So remove the function parameter "mailbox".

Link: https://lore.kernel.org/r/20220302064830.61706-4-liangwenpeng@huawei.comSigned-off-by: NWenpeng Liang <liangwenpeng@huawei.com>
Reviewed-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Signed-off-by: NZhengfeng Luo <luozhengfeng@h-partners.com>
Reviewed-by: NYangyang Li <liyangyang20@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

8f93704f

RDMA/hns: Remove fixed parameter "timeout" in the mailbox · 52c7ad41

由 Chengchang Tang 提交于 6月 22, 2022

mainline inclusion
from mainline-v5.18-rc1
commit 0018ed4b
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5A9XK
cve: NA

reference: https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=0018ed4bb07f

The value of the function parameter "timeout" is unique. Therefore,
it is unnecessary to specify the parameter "timeout" value each time.
So remove it.

Link: https://lore.kernel.org/r/20220302064830.61706-3-liangwenpeng@huawei.comSigned-off-by: NChengchang Tang <tangchengchang@huawei.com>
Signed-off-by: NHaoyue Xu <xuhaoyue1@hisilicon.com>
Signed-off-by: NWenpeng Liang <liangwenpeng@huawei.com>
Reviewed-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Signed-off-by: NZhengfeng Luo <luozhengfeng@h-partners.com>
Reviewed-by: NYangyang Li <liyangyang20@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

52c7ad41

RDMA/hns: Remove the unused parameter "op_modifier" in mailbox · 244f29c2

由 Chengchang Tang 提交于 6月 22, 2022

mainline inclusion
from mainline-v5.18-rc1
commit 5a32949d
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I5A9XK
cve: NA

reference: https://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma.git/commit/?id=5a32949d81cc

The parameter "op_modifier" is only used for HIP06. It is useless for HIP08
and later versions. After removing HIP06, this parameter is no longer used,
so remove it.

Link: https://lore.kernel.org/r/20220302064830.61706-2-liangwenpeng@huawei.comSigned-off-by: NChengchang Tang <tangchengchang@huawei.com>
Signed-off-by: NHaoyue Xu <xuhaoyue1@hisilicon.com>
Signed-off-by: NWenpeng Liang <liangwenpeng@huawei.com>
Reviewed-by: NLeon Romanovsky <leonro@nvidia.com>
Signed-off-by: NJason Gunthorpe <jgg@nvidia.com>
Signed-off-by: NZhengfeng Luo <luozhengfeng@h-partners.com>
Reviewed-by: NYangyang Li <liyangyang20@huawei.com>
Acked-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

244f29c2

tcp: increase source port perturb table to 2^16 · c1819944

由 Willy Tarreau 提交于 6月 22, 2022

mainline inclusion
from mainline-v5.18-rc6
commit 4c2c8f03
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I5C3A9
CVE: CVE-2022-32296

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=4c2c8f03a5ab7cb04ec64724d7d176d00bcc91e5

--------------------------------

Moshe Kol, Amit Klein, and Yossi Gilad reported being able to accurately
identify a client by forcing it to emit only 40 times more connections
than there are entries in the table_perturb[] table. The previous two
improvements consisting in resalting the secret every 10s and adding
randomness to each port selection only slightly improved the situation,
and the current value of 2^8 was too small as it's not very difficult
to make a client emit 10k connections in less than 10 seconds.

Thus we're increasing the perturb table from 2^8 to 2^16 so that the
same precision now requires 2.6M connections, which is more difficult in
this time frame and harder to hide as a background activity. The impact
is that the table now uses 256 kB instead of 1 kB, which could mostly
affect devices making frequent outgoing connections. However such
components usually target a small set of destinations (load balancers,
database clients, perf assessment tools), and in practice only a few
entries will be visited, like before.

A live test at 1 million connections per second showed no performance
difference from the previous value.
Reported-by: NMoshe Kol <moshe.kol@mail.huji.ac.il>
Reported-by: NYossi Gilad <yossi.gilad@mail.huji.ac.il>
Reported-by: NAmit Klein <aksecurity@gmail.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Signed-off-by: NWilly Tarreau <w@1wt.eu>
Signed-off-by: NJakub Kicinski <kuba@kernel.org>

Conflicts:
	net/ipv4/inet_hashtables.c
Signed-off-by: NBaisong Zhong <zhongbaisong@huawei.com>
Reviewed-by: NWei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

c1819944

tcp: change source port randomizarion at connect() time · 055ba4fd

由 Eric Dumazet 提交于 6月 22, 2022

stable inclusion
from stable-v5.10.119
commit 33f1b4a27abced7ae0f740d2ec3040debf7c4b3c
category: bugfix
bugzilla: https://gitee.com/src-openeuler/kernel/issues/I5C3A9
CVE: CVE-2022-32296

Reference: https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=33f1b4a27abced7ae0f740d2ec3040debf7c4b3c

--------------------------------

commit 190cc824 upstream.

RFC 6056 (Recommendations for Transport-Protocol Port Randomization)
provides good summary of why source selection needs extra care.

David Dworken reminded us that linux implements Algorithm 3
as described in RFC 6056 3.3.3

Quoting David :
   In the context of the web, this creates an interesting info leak where
   websites can count how many TCP connections a user's computer is
   establishing over time. For example, this allows a website to count
   exactly how many subresources a third party website loaded.
   This also allows:
   - Distinguishing between different users behind a VPN based on
       distinct source port ranges.
   - Tracking users over time across multiple networks.
   - Covert communication channels between different browsers/browser
       profiles running on the same computer
   - Tracking what applications are running on a computer based on
       the pattern of how fast source ports are getting incremented.

Section 3.3.4 describes an enhancement, that reduces
attackers ability to use the basic information currently
stored into the shared 'u32 hint'.

This change also decreases collision rate when
multiple applications need to connect() to
different destinations.
Signed-off-by: NEric Dumazet <edumazet@google.com>
Reported-by: NDavid Dworken <ddworken@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: NDavid S. Miller <davem@davemloft.net>
Signed-off-by: NStefan Ghinea <stefan.ghinea@windriver.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>

Conflicts:
	net/ipv4/inet_hashtables.c
Signed-off-by: NBaisong Zhong <zhongbaisong@huawei.com>
Reviewed-by: NWei Yongjun <weiyongjun1@huawei.com>
Reviewed-by: NXiu Jianfeng <xiujianfeng@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

055ba4fd

ext4: correct the judgment of BUG in ext4_mb_normalize_request · 4ffbd5d7

由 Baokun Li 提交于 6月 22, 2022

hulk inclusion
category: bugfix
bugzilla: 186777, https://gitee.com/openeuler/kernel/issues/I5C568
CVE: NA

--------------------------------

ext4_mb_normalize_request() can move logical start of allocated blocks
to reduce fragmentation and better utilize preallocation. However logical
block requested as a start of allocation (ac->ac_o_ex.fe_logical) should
always be covered by allocated blocks so we should check that by
modifying and to or in the assertion.
Signed-off-by: NBaokun Li <libaokun1@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

4ffbd5d7

ext4: fix bug_on ext4_mb_use_inode_pa · b2861530

由 Baokun Li 提交于 6月 22, 2022

hulk inclusion
category: bugfix
bugzilla: 186777, https://gitee.com/openeuler/kernel/issues/I5C568
CVE: NA

--------------------------------

Hulk Robot reported a BUG_ON:

==================================================================
kernel BUG at fs/ext4/mballoc.c:3211!
[...]
RIP: 0010:ext4_mb_mark_diskspace_used.cold+0x85/0x136f
[...]
Call Trace:
 ext4_mb_new_blocks+0x9df/0x5d30
 ext4_ext_map_blocks+0x1803/0x4d80
 ext4_map_blocks+0x3a4/0x1a10
 ext4_writepages+0x126d/0x2c30
 do_writepages+0x7f/0x1b0
 __filemap_fdatawrite_range+0x285/0x3b0
 file_write_and_wait_range+0xb1/0x140
 ext4_sync_file+0x1aa/0xca0
 vfs_fsync_range+0xfb/0x260
 do_fsync+0x48/0xa0
[...]
==================================================================

Above issue may happen as follows:
-------------------------------------
do_fsync
 vfs_fsync_range
  ext4_sync_file
   file_write_and_wait_range
    __filemap_fdatawrite_range
     do_writepages
      ext4_writepages
       mpage_map_and_submit_extent
        mpage_map_one_extent
         ext4_map_blocks
          ext4_mb_new_blocks
           ext4_mb_normalize_request
            >>> start + size <= ac->ac_o_ex.fe_logical
           ext4_mb_regular_allocator
            ext4_mb_simple_scan_group
             ext4_mb_use_best_found
              ext4_mb_new_preallocation
               ext4_mb_new_inode_pa
                ext4_mb_use_inode_pa
                 >>> set ac->ac_b_ex.fe_len <= 0
           ext4_mb_mark_diskspace_used
            >>> BUG_ON(ac->ac_b_ex.fe_len <= 0);

we can easily reproduce this problem with the following commands:
	`fallocate -l100M disk`
	`mkfs.ext4 -b 1024 -g 256 disk`
	`mount disk /mnt`
	`fsstress -d /mnt -l 0 -n 1000 -p 1`

The size must be smaller than or equal to EXT4_BLOCKS_PER_GROUP.
Therefore, "start + size <= ac->ac_o_ex.fe_logical" may occur
when the size is truncated. So start should be the start position of
the group where ac_o_ex.fe_logical is located after alignment.
In addition, when the value of fe_logical or EXT4_BLOCKS_PER_GROUP
is very large, the value calculated by start_off is more accurate.

Fixes: cd648b8a ("ext4: trim allocation requests to group size")
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: NBaokun Li <libaokun1@huawei.com>
Reviewed-by: NZhang Yi <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

b2861530

net/ns: put workqueue of cleanup_net sleep for a while when notify. · 83f1257f

由 Zhengchao Shao 提交于 6月 22, 2022

hulk inclusion
category: bugfix
bugzilla: 186807 https://gitee.com/openeuler/kernel/issues/I5ATLD
CVE: NA

--------------------------------

When we clean up namespace, we have to notify every netdevice that
dev is down. If device that registered too many, the notify time will
take too many CPU time, It will course CPU soft-lockup issue. The
reprocedure is followed:
NIFS=50
for ((i=0; i<$NIFS; i++))
do
        ip netns add dummy-ns$i
        ip netns exec dummy-ns$i ip link set lo up
done

for ((j=0; j<$NIFS; j++))
do
        for ((i=0; i<1000; i++))
        do
                if=eth$j$i
                ip netns exec dummy-ns$j ip link add $if type dummy
                ip netns exec dummy-ns$j ip link set $if up
                done
done

for ((i=0; i<$NIFS; i++))
do
        ip netns del dummy-ns$i
done
The test will result in the following stack. So clean up work must
sleep for a while when notify device down/change.

watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [kworker/u8:5:288]
Modules linked in:
CPU: 0 PID: 288 Comm: kworker/u8:5 Tainted: G    B             5.10.0+ #5
Hardware name: linux,dummy-virt (DT)
Workqueue: netns cleanup_net
pstate: 20000005 (nzCv daif -PAN -UAO -TCO BTYPE=--)
pc : atomic_set include/asm-generic/atomic-instrumented.h:46 [inline]
pc : __alloc_skb+0x268/0x450 net/core/skbuff.c:241
lr : atomic_set include/asm-generic/atomic-instrumented.h:46 [inline]
lr : __alloc_skb+0x268/0x450 net/core/skbuff.c:241
sp : ffff000015607610
x29: ffff000015607610 x28: 00000000ffffffff
x27: 0000000000000001 x26: ffff0000cc9400e0
x25: ffff0000c745c1be x24: 1fffe00002ac0ed0
x23: 0000000000000000 x22: ffff0000cc9400c0
x21: ffff0000c745c234 x20: ffff0000cc940000
x19: ffff0000c745c140 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000
x15: 0000000000000000 x14: 1fffe00002ac0f00
x13: 0000000000000000 x12: ffff80001992801d
x11: 1fffe0001992801c x10: ffff80001992801c
x9 : dfffa00000000000 x8 : ffff0000cc9400e3
x7 : 0000000000000001 x6 : ffff80001992801c
x5 : ffff0000cc9400e0 x4 : dfffa00000000000
x3 : ffffa00011529a78 x2 : 0000000000000003
x1 : 0000000000000000 x0 : ffff0000cc9400e0
Call trace:
 atomic_set include/asm-generic/atomic-instrumented.h:46 [inline]
 __alloc_skb+0x268/0x450 net/core/skbuff.c:241
 alloc_skb include/linux/skbuff.h:1107 [inline]
 nlmsg_new include/net/netlink.h:958 [inline]
 rtmsg_ifa+0xf4/0x1e0 net/ipv4/devinet.c:1900
 __inet_del_ifa+0x328/0x650 net/ipv4/devinet.c:427
 inet_del_ifa net/ipv4/devinet.c:465 [inline]
 inetdev_destroy net/ipv4/devinet.c:318 [inline]
 inetdev_event+0x2ac/0xac0 net/ipv4/devinet.c:1599
 notifier_call_chain kernel/notifier.c:83 [inline]
 raw_notifier_call_chain+0x94/0xd0 kernel/notifier.c:410
 call_netdevice_notifiers_info+0x9c/0x14c net/core/dev.c:2047
 call_netdevice_notifiers_extack net/core/dev.c:2059 [inline]
 call_netdevice_notifiers net/core/dev.c:2073 [inline]
 rollback_registered_many+0x3d0/0x7dc net/core/dev.c:9558
 unregister_netdevice_many+0x40/0x1b0 net/core/dev.c:10779
 default_device_exit_batch+0x24c/0x2a0 net/core/dev.c:11262
 ops_exit_list+0xb4/0xd0 net/core/net_namespace.c:192
 cleanup_net+0x2b8/0x540 net/core/net_namespace.c:608
 process_one_work+0x3ec/0xa40 kernel/workqueue.c:2279
 worker_thread+0x110/0x8b0 kernel/workqueue.c:2425
 kthread+0x1ac/0x1fc kernel/kthread.c:313
 ret_from_fork+0x10/0x18 arch/arm64/kernel/entry.S:1034
Signed-off-by: NZhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: NWei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

83f1257f

bcache: avoid unnecessary soft lockup in kworker update_writeback_rate() · 42f21b46

由 Coly Li 提交于 6月 22, 2022

mainline inclusion
from v5.19-rc1
commit a1a2d8f0
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue
CVE: N/A

-----------------------------------------

The kworker routine update_writeback_rate() is schedued to update the
writeback rate in every 5 seconds by default. Before calling
__update_writeback_rate() to do real job, semaphore dc->writeback_lock
should be held by the kworker routine.

At the same time, bcache writeback thread routine bch_writeback_thread()
also needs to hold dc->writeback_lock before flushing dirty data back
into the backing device. If the dirty data set is large, it might be
very long time for bch_writeback_thread() to scan all dirty buckets and
releases dc->writeback_lock. In such case update_writeback_rate() can be
starved for long enough time so that kernel reports a soft lockup warn-
ing started like:
  watchdog: BUG: soft lockup - CPU#246 stuck for 23s! [kworker/246:31:179713]

Such soft lockup condition is unnecessary, because after the writeback
thread finishes its job and releases dc->writeback_lock, the kworker
update_writeback_rate() may continue to work and everything is fine
indeed.

This patch avoids the unnecessary soft lockup by the following method,
- Add new member to struct cached_dev
  - dc->rate_update_retry (0 by default)
- In update_writeback_rate() call down_read_trylock(&dc->writeback_lock)
  firstly, if it fails then lock contention happens.
- If dc->rate_update_retry <= BCH_WBRATE_UPDATE_MAX_SKIPS (15), doesn't
  acquire the lock and reschedules the kworker for next try.
- If dc->rate_update_retry > BCH_WBRATE_UPDATE_MAX_SKIPS, no retry
  anymore and call down_read(&dc->writeback_lock) to wait for the lock.

By the above method, at worst case update_writeback_rate() may retry for
1+ minutes before blocking on dc->writeback_lock by calling down_read().
For a 4TB cache device with 1TB dirty data, 90%+ of the unnecessary soft
lockup warning message can be avoided.

When retrying to acquire dc->writeback_lock in update_writeback_rate(),
of course the writeback rate cannot be updated. It is fair, because when
the kworker is blocked on the lock contention of dc->writeback_lock, the
writeback rate cannot be updated neither.

This change follows Jens Axboe's suggestion to a more clear and simple
version.
Signed-off-by: NColy Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20220528124550.32834-2-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

42f21b46

md: bcache: check the return value of kzalloc() in detached_dev_do_request() · c62aceee

由 Jia-Ju Bai 提交于 6月 22, 2022

mainline inclusion
from v5.19-rc1
commit 40f567bb
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue
CVE: N/A

------------------------------------------

The function kzalloc() in detached_dev_do_request() can fail, so its
return value should be checked.

Fixes: bc082a55 ("bcache: fix inaccurate io state for detached bcache devices")
Reported-by: NTOTE Robot <oslab@tsinghua.edu.cn>
Signed-off-by: NJia-Ju Bai <baijiaju1990@gmail.com>
Signed-off-by: NColy Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20220527152818.27545-4-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

c62aceee

bcache: memset on stack variables in bch_btree_check() and bch_sectors_dirty_init() · 34ae5889

由 Coly Li 提交于 6月 22, 2022

mainline inclusion
from v5.19-rc1
commit 7d6b902e
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue
CVE: N/A

---------------------------------------

The local variables check_state (in bch_btree_check()) and state (in
bch_sectors_dirty_init()) should be fully filled by 0, because before
allocating them on stack, they were dynamically allocated by kzalloc().
Signed-off-by: NColy Li <colyli@suse.de>
Link: https://lore.kernel.org/r/20220527152818.27545-2-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

34ae5889

bcache: avoid journal no-space deadlock by reserving 1 journal bucket · 3126f84e

由 Coly Li 提交于 6月 22, 2022

mainline inclusion
from v5.19-rc1
commit 32feee36
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue
CVE: N/A

---------------------------------

The journal no-space deadlock was reported time to time. Such deadlock
can happen in the following situation.

When all journal buckets are fully filled by active jset with heavy
write I/O load, the cache set registration (after a reboot) will load
all active jsets and inserting them into the btree again (which is
called journal replay). If a journaled bkey is inserted into a btree
node and results btree node split, new journal request might be
triggered. For example, the btree grows one more level after the node
split, then the root node record in cache device super block will be
upgrade by bch_journal_meta() from bch_btree_set_root(). But there is no
space in journal buckets, the journal replay has to wait for new journal
bucket to be reclaimed after at least one journal bucket replayed. This
is one example that how the journal no-space deadlock happens.

The solution to avoid the deadlock is to reserve 1 journal bucket in
run time, and only permit the reserved journal bucket to be used during
cache set registration procedure for things like journal replay. Then
the journal space will never be fully filled, there is no chance for
journal no-space deadlock to happen anymore.

This patch adds a new member "bool do_reserve" in struct journal, it is
inititalized to 0 (false) when struct journal is allocated, and set to
1 (true) by bch_journal_space_reserve() when all initialization done in
run_cache_set(). In the run time when journal_reclaim() tries to
allocate a new journal bucket, free_journal_buckets() is called to check
whether there are enough free journal buckets to use. If there is only
1 free journal bucket and journal->do_reserve is 1 (true), the last
bucket is reserved and free_journal_buckets() will return 0 to indicate
no free journal bucket. Then journal_reclaim() will give up, and try
next time to see whetheer there is free journal bucket to allocate. By
this method, there is always 1 jouranl bucket reserved in run time.

During the cache set registration, journal->do_reserve is 0 (false), so
the reserved journal bucket can be used to avoid the no-space deadlock.
Reported-by: NNikhil Kshirsagar <nkshirsagar@gmail.com>
Signed-off-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220524102336.10684-5-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

3126f84e

bcache: remove incremental dirty sector counting for bch_sectors_dirty_init() · 231fe0c0

由 Coly Li 提交于 6月 22, 2022

mainline inclusion
from v5.19-rc1
commit 80db4e47
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue
CVE: N/A

----------------------------------------

After making bch_sectors_dirty_init() being multithreaded, the existing
incremental dirty sector counting in bch_root_node_dirty_init() doesn't
release btree occupation after iterating 500000 (INIT_KEYS_EACH_TIME)
bkeys. Because a read lock is added on btree root node to prevent the
btree to be split during the dirty sectors counting, other I/O requester
has no chance to gain the write lock even restart bcache_btree().

That is to say, the incremental dirty sectors counting is incompatible
to the multhreaded bch_sectors_dirty_init(). We have to choose one and
drop another one.

In my testing, with 512 bytes random writes, I generate 1.2T dirty data
and a btree with 400K nodes. With single thread and incremental dirty
sectors counting, it takes 30+ minites to register the backing device.
And with multithreaded dirty sectors counting, the backing device
registration can be accomplished within 2 minutes.

The 30+ minutes V.S. 2- minutes difference makes me decide to keep
multithreaded bch_sectors_dirty_init() and drop the incremental dirty
sectors counting. This is what this patch does.

But INIT_KEYS_EACH_TIME is kept, in sectors_dirty_init_fn() the CPU
will be released by cond_resched() after every INIT_KEYS_EACH_TIME keys
iterated. This is to avoid the watchdog reports a bogus soft lockup
warning.

Fixes: b144e45f ("bcache: make bch_sectors_dirty_init() to be multithreaded")
Signed-off-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220524102336.10684-4-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

231fe0c0

bcache: improve multithreaded bch_sectors_dirty_init() · dc6c26d1

由 Coly Li 提交于 6月 22, 2022

mainline inclusion
from v5.19-rc1
commit 4dc34ae1
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue
CVE: N/A

-------------------------------------------

Commit b144e45f ("bcache: make bch_sectors_dirty_init() to be
multithreaded") makes bch_sectors_dirty_init() to be much faster
when counting dirty sectors by iterating all dirty keys in the btree.
But it isn't in ideal shape yet, still can be improved.

This patch does the following changes to improve current parallel dirty
keys iteration on the btree,
- Add read lock to root node when multiple threads iterating the btree,
  to prevent the root node gets split by I/Os from other registered
  bcache devices.
- Remove local variable "char name[32]" and generate kernel thread name
  string directly when calling kthread_run().
- Allocate "struct bch_dirty_init_state state" directly on stack and
  avoid the unnecessary dynamic memory allocation for it.
- Decrease BCH_DIRTY_INIT_THRD_MAX from 64 to 12 which is enough indeed.
- Increase &state->started to count created kernel thread after it
  succeeds to create.
- When wait for all dirty key counting threads to finish, use
  wait_event() to replace wait_event_interruptible().

With the above changes, the code is more clear, and some potential error
conditions are avoided.

Fixes: b144e45f ("bcache: make bch_sectors_dirty_init() to be multithreaded")
Signed-off-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220524102336.10684-3-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

dc6c26d1

bcache: improve multithreaded bch_btree_check() · 75c83226

由 Coly Li 提交于 6月 22, 2022

mainline inclusion
from v5.19-rc1
commit 62253644
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue
CVE: N/A

----------------------------------------

Commit 8e710227 ("bcache: make bch_btree_check() to be
multithreaded") makes bch_btree_check() to be much faster when checking
all btree nodes during cache device registration. But it isn't in ideal
shap yet, still can be improved.

This patch does the following thing to improve current parallel btree
nodes check by multiple threads in bch_btree_check(),
- Add read lock to root node while checking all the btree nodes with
  multiple threads. Although currently it is not mandatory but it is
  good to have a read lock in code logic.
- Remove local variable 'char name[32]', and generate kernel thread name
  string directly when calling kthread_run().
- Allocate local variable "struct btree_check_state check_state" on the
  stack and avoid unnecessary dynamic memory allocation for it.
- Reduce BCH_BTR_CHKTHREAD_MAX from 64 to 12 which is enough indeed.
- Increase check_state->started to count created kernel thread after it
  succeeds to create.
- When wait for all checking kernel threads to finish, use wait_event()
  to replace wait_event_interruptible().

With this change, the code is more clear, and some potential error
conditions are avoided.

Fixes: 8e710227 ("bcache: make bch_btree_check() to be multithreaded")
Signed-off-by: NColy Li <colyli@suse.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220524102336.10684-2-colyli@suse.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

75c83226

bcache: fixup multiple threads crash · 235105eb

由 Mingzhe Zou 提交于 6月 22, 2022

mainline inclusion
from v5.18-rc1
commit 887554ab
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue
CVE: N/A

----------------------------------------

When multiple threads to check btree nodes in parallel, the main
thread wait for all threads to stop or CACHE_SET_IO_DISABLE flag:

wait_event_interruptible(check_state->wait,
                         atomic_read(&check_state->started) == 0 ||
                         test_bit(CACHE_SET_IO_DISABLE, &c->flags));

However, the bch_btree_node_read and bch_btree_node_read_done
maybe call bch_cache_set_error, then the CACHE_SET_IO_DISABLE
will be set. If the flag already set, the main thread return
error. At the same time, maybe some threads still running and
read NULL pointer, the kernel will crash.

This patch change the event wait condition, the main thread must
wait for all threads to stop.

Fixes: 8e710227 ("bcache: make bch_btree_check() to be multithreaded")
Signed-off-by: NMingzhe Zou <mingzhe.zou@easystack.cn>
Cc: stable@vger.kernel.org # v5.7+
Signed-off-by: NColy Li <colyli@suse.de>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

235105eb

bcache: fixup bcache_dev_sectors_dirty_add() multithreaded CPU false sharing · 42c45f12

由 Mingzhe Zou 提交于 6月 22, 2022

mainline inclusion
from v5.18-rc1
commit 7b1002f7
category: bugfix
bugzilla: https://gitee.com/openeuler/kernel/issues/I59A5L?from=project-issue
CVE: N/A

--------------------------------------

When attaching a cached device (a.k.a backing device) to a cache
device, bch_sectors_dirty_init() is called to count dirty sectors
and stripes (see what bcache_dev_sectors_dirty_add() does) on the
cache device.

When bcache_dev_sectors_dirty_add() is called, set_bit(stripe,
d->full_dirty_stripes) or clear_bit(stripe, d->full_dirty_stripes)
operation will always be performed. In full_dirty_stripes, each 1bit
represents stripe_size (8192) sectors (512B), so 1bit=4MB (8192*512),
and each CPU cache line=64B=512bit=2048MB. When 20 threads process
a cached disk with 100G dirty data, a single thread processes about
23M at a time, and 20 threads total 460M. These full_dirty_stripes
bits corresponding to the 460M data is likely to fall in the same CPU
cache line. When one of these threads performs a set_bit or clear_bit
operation, the same CPU cache line of other threads will become invalid
and must read the full_dirty_stripes from the main memory again. Compared
with single thread, the time of a bcache_dev_sectors_dirty_add()
call is increased by about 50 times in our test (100G dirty data,
20 threads, bcache_dev_sectors_dirty_add() is called more than
20 million times).

This patch tries to test_bit before set_bit or clear_bit operation.
Therefore, a lot of force set and clear operations will be avoided,
and most of bcache_dev_sectors_dirty_add() calls will only read CPU
cache line.
Signed-off-by: NMingzhe Zou <mingzhe.zou@easystack.cn>
Signed-off-by: NColy Li <colyli@suse.de>
Reviewed-by: NJason Yan <yanaijie@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

42c45f12

openeuler / Kernel 接近 2 年 前同步成功

openeuler / Kernel
接近 2 年前同步成功