提交 · 98498945f1d3ab19fc5904ccc4ea31133d9f2776 · openeuler / Kernel

09 4月, 2021 40 次提交

ext4: save error info to sb through journal if available · 98498945

由 Jan Kara 提交于 3月 25, 2021

mainline inclusion
from mainline-v5.11-rc4
commit 2d01ddc8
category: bugfix
bugzilla: 50839
CVE: NA

-----------------------------------------------

If journalling is still working at the moment we get to writing error
information to the superblock we cannot write directly to the superblock
as such write could race with journalled update of the superblock and
cause journal checksum failures, writing inconsistent information to the
journal or other problems. We cannot journal the superblock directly
from the error handling functions as we are running in uncertain context
and could deadlock so just punt journalled superblock update to a
workqueue.
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201216101844.22917-5-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NYe Bin <yebin10@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

98498945

ext4: protect superblock modifications with a buffer lock · 6d70f424

由 Jan Kara 提交于 3月 25, 2021

mainline inclusion
from mainline-v5.11-rc4
commit 05c2c00f
category: bugfix
bugzilla: 50839
CVE: NA

-----------------------------------------------

Protect all superblock modifications (including checksum computation)
with a superblock buffer lock. That way we are sure computed checksum
matches current superblock contents (a mismatch could cause checksum
failures in nojournal mode or if an unjournalled superblock update races
with a journalled one). Also we avoid modifying superblock contents
while it is being written out (which can cause DIF/DIX failures if we
are running in nojournal mode).
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201216101844.22917-4-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

conflicts:
fs/ext4/file.c
Signed-off-by: NYe Bin <yebin10@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

6d70f424

ext4: drop sync argument of ext4_commit_super() · ee80bf03

由 Jan Kara 提交于 3月 25, 2021

mainline inclusion
from mainline-v5.11-rc4
commit 4392fbc4
category: bugfix
bugzilla: 50839
CVE: NA

-----------------------------------------------

Everybody passes 1 as sync argument of ext4_commit_super(). Just drop
it.
Reviewed-by: NHarshad Shirwadkar <harshadshirwadkar@gmail.com>
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201216101844.22917-3-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NYe Bin <yebin10@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

ee80bf03

ext4: combine ext4_handle_error() and save_error_info() · acf13bf3

由 Jan Kara 提交于 3月 25, 2021

mainline inclusion
from mainline-v5.11-rc4
commit e789ca0c
category: bugfix
bugzilla: 50839
CVE: NA

-----------------------------------------------

save_error_info() is always called together with ext4_handle_error().
Combine them into a single call and move unconditional bits out of
save_error_info() into ext4_handle_error().
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201216101844.22917-2-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NYe Bin <yebin10@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

acf13bf3

ext4: defer saving error info from atomic context · f40c120c

由 Jan Kara 提交于 3月 25, 2021

mainline inclusion
from mainline-v5.11-rc1
commit c92dc856
category: bugfix
bugzilla: 50839
CVE: NA

-----------------------------------------------

When filesystem inconsistency is detected with group locked, we
currently try to modify superblock to store error there without
blocking. However this can cause superblock checksum failures (or
DIF/DIX failure) when the superblock is just being written out.

Make error handling code just store error information in ext4_sb_info
structure and copy it to on-disk superblock only in ext4_commit_super().
In case of error happening with group locked, we just postpone the
superblock flushing to a workqueue.

[ Added fixup so that s_first_error_* does not get updated after
  the file system is remounted.
  Also added fix for syzbot failure.  - Ted ]
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201127113405.26867-8-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: Hillf Danton <hdanton@sina.com>
Reported-by: syzbot+9043030c040ce1849a60@syzkaller.appspotmail.com
Signed-off-by: NYe Bin <yebin10@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

f40c120c

ext4: simplify ext4 error translation · 6603d2ef

由 Jan Kara 提交于 3月 25, 2021

mainline inclusion
from mainline-v5.11-rc1
commit 02a7780e
category: bugfix
bugzilla: 50839
CVE: NA

-----------------------------------------------

We convert errno's to ext4 on-disk format error codes in
save_error_info(). Add a function and a bit of macro magic to make this
simpler.
Signed-off-by: NJan Kara <jack@suse.cz>
Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201127113405.26867-7-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NYe Bin <yebin10@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

6603d2ef

ext4: move functions in super.c · fc004c92

由 Jan Kara 提交于 3月 25, 2021

mainline inclusion
from mainline-v5.11-rc1
commit 40676623
category: bugfix
bugzilla: 50839
CVE: NA

-----------------------------------------------

Just move error info related functions in super.c close to
ext4_handle_error(). We'll want to combine save_error_info() with
ext4_handle_error() and this makes change more obvious and saves a
forward declaration as well. No functional change.
Signed-off-by: NJan Kara <jack@suse.cz>
Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201127113405.26867-6-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NYe Bin <yebin10@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

fc004c92

ext4: make ext4_abort() use __ext4_error() · 590f38ce

由 Jan Kara 提交于 3月 25, 2021

mainline inclusion
from mainline-v5.11-rc1
commit 014c9caa
category: bugfix
bugzilla: 50839
CVE: NA

-----------------------------------------------

The only difference between __ext4_abort() and __ext4_error() is that
the former one ignores errors=continue mount option. Unify the code to
reduce duplication.
Signed-off-by: NJan Kara <jack@suse.cz>
Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201127113405.26867-5-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NYe Bin <yebin10@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

590f38ce

ext4: standardize error message in ext4_protect_reserved_inode() · 563d7a8b

由 Jan Kara 提交于 3月 25, 2021

mainline inclusion
from mainline-v5.11-rc1
commit 93c20bc3
category: bugfix
bugzilla: 50839
CVE: NA

-----------------------------------------------

We use __ext4_error() when ext4_protect_reserved_inode() finds
filesystem corruption. However EXT4_ERROR_INODE_ERR() is perfectly
capable of reporting all the needed information. So just use that.
Signed-off-by: NJan Kara <jack@suse.cz>
Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201127113405.26867-4-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NYe Bin <yebin10@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

563d7a8b

ext4: remove redundant sb checksum recomputation · 8ad485f8

由 Jan Kara 提交于 3月 25, 2021

mainline inclusion
from mainline-v5.11-rc1
commit 81414b4d
category: bugfix
bugzilla: 50839
CVE: NA

-----------------------------------------------

Superblock is written out either through ext4_commit_super() or through
ext4_handle_dirty_super(). In both cases we recompute the checksum so it
is not necessary to recompute it after updating superblock free inodes &
blocks counters.
Signed-off-by: NJan Kara <jack@suse.cz>
Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/20201127113405.26867-3-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NYe Bin <yebin10@huawei.com>
Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

8ad485f8

binfmt_misc: fix possible deadlock in bm_register_write · c584dce0

由 Lior Ribak 提交于 3月 27, 2021

stable inclusion
from stable-5.10.24
commit 5ab9464a2a3c538eedbb438f1802f2fd98d0953f
bugzilla: 51348

--------------------------------

commit e7850f4d upstream.

There is a deadlock in bm_register_write:

First, in the begining of the function, a lock is taken on the binfmt_misc
root inode with inode_lock(d_inode(root)).

Then, if the user used the MISC_FMT_OPEN_FILE flag, the function will call
open_exec on the user-provided interpreter.

open_exec will call a path lookup, and if the path lookup process includes
the root of binfmt_misc, it will try to take a shared lock on its inode
again, but it is already locked, and the code will get stuck in a deadlock

To reproduce the bug:
$ echo ":iiiii:E::ii::/proc/sys/fs/binfmt_misc/bla:F" > /proc/sys/fs/binfmt_misc/register

backtrace of where the lock occurs (#5):
0 schedule () at ./arch/x86/include/asm/current.h:15
1 0xffffffff81b51237 in rwsem_down_read_slowpath (sem=0xffff888003b202e0, count=<optimized out>, state=state@entry=2) at kernel/locking/rwsem.c:992
2 0xffffffff81b5150a in __down_read_common (state=2, sem=<optimized out>) at kernel/locking/rwsem.c:1213
3 __down_read (sem=<optimized out>) at kernel/locking/rwsem.c:1222
4 down_read (sem=<optimized out>) at kernel/locking/rwsem.c:1355
5 0xffffffff811ee22a in inode_lock_shared (inode=<optimized out>) at ./include/linux/fs.h:783
6 open_last_lookups (op=0xffffc9000022fe34, file=0xffff888004098600, nd=0xffffc9000022fd10) at fs/namei.c:3177
7 path_openat (nd=nd@entry=0xffffc9000022fd10, op=op@entry=0xffffc9000022fe34, flags=flags@entry=65) at fs/namei.c:3366
8 0xffffffff811efe1c in do_filp_open (dfd=<optimized out>, pathname=pathname@entry=0xffff8880031b9000, op=op@entry=0xffffc9000022fe34) at fs/namei.c:3396
9 0xffffffff811e493f in do_open_execat (fd=fd@entry=-100, name=name@entry=0xffff8880031b9000, flags=<optimized out>, flags@entry=0) at fs/exec.c:913
10 0xffffffff811e4a92 in open_exec (name=<optimized out>) at fs/exec.c:948
11 0xffffffff8124aa84 in bm_register_write (file=<optimized out>, buffer=<optimized out>, count=19, ppos=<optimized out>) at fs/binfmt_misc.c:682
12 0xffffffff811decd2 in vfs_write (file=file@entry=0xffff888004098500, buf=buf@entry=0xa758d0 ":iiiii:E::ii::i:CF
", count=count@entry=19, pos=pos@entry=0xffffc9000022ff10) at fs/read_write.c:603
13 0xffffffff811defda in ksys_write (fd=<optimized out>, buf=0xa758d0 ":iiiii:E::ii::i:CF
", count=19) at fs/read_write.c:658
14 0xffffffff81b49813 in do_syscall_64 (nr=<optimized out>, regs=0xffffc9000022ff58) at arch/x86/entry/common.c:46
15 0xffffffff81c0007c in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120

To solve the issue, the open_exec call is moved to before the write
lock is taken by bm_register_write

Link: https://lkml.kernel.org/r/20210228224414.95962-1-liorribak@gmail.com
Fixes: 948b701a ("binfmt_misc: add persistent opened binary handler for containers")
Signed-off-by: NLior Ribak <liorribak@gmail.com>
Acked-by: NHelge Deller <deller@gmx.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: N Weilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

c584dce0

configfs: fix a use-after-free in __configfs_open_file · 37af504b

由 Daiyue Zhang 提交于 3月 27, 2021

stable inclusion
from stable-5.10.24
commit 109720342efd6ace3d2e8f34a25ea65036bb1d3b
bugzilla: 51348

--------------------------------

[ Upstream commit 14fbbc82 ]

Commit b0841eef ("configfs: provide exclusion between IO and removals")
uses ->frag_dead to mark the fragment state, thus no bothering with extra
refcount on config_item when opening a file. The configfs_get_config_item
was removed in __configfs_open_file, but not with config_item_put. So the
refcount on config_item will lost its balance, causing use-after-free
issues in some occasions like this:

Test:
1. Mount configfs on /config with read-only items:
drwxrwx--- 289 root   root            0 2021-04-01 11:55 /config
drwxr-xr-x   2 root   root            0 2021-04-01 11:54 /config/a
--w--w--w-   1 root   root         4096 2021-04-01 11:53 /config/a/1.txt
......

2. Then run:
for file in /config
do
echo $file
grep -R 'key' $file
done

3. __configfs_open_file will be called in parallel, the first one
got called will do:
if (file->f_mode & FMODE_READ) {
	if (!(inode->i_mode & S_IRUGO))
		goto out_put_module;
			config_item_put(buffer->item);
				kref_put()
					package_details_release()
						kfree()

the other one will run into use-after-free issues like this:
BUG: KASAN: use-after-free in __configfs_open_file+0x1bc/0x3b0
Read of size 8 at addr fffffff155f02480 by task grep/13096
CPU: 0 PID: 13096 Comm: grep VIP: 00 Tainted: G        W       4.14.116-kasan #1
TGID: 13096 Comm: grep
Call trace:
dump_stack+0x118/0x160
kasan_report+0x22c/0x294
__asan_load8+0x80/0x88
__configfs_open_file+0x1bc/0x3b0
configfs_open_file+0x28/0x34
do_dentry_open+0x2cc/0x5c0
vfs_open+0x80/0xe0
path_openat+0xd8c/0x2988
do_filp_open+0x1c4/0x2fc
do_sys_open+0x23c/0x404
SyS_openat+0x38/0x48

Allocated by task 2138:
kasan_kmalloc+0xe0/0x1ac
kmem_cache_alloc_trace+0x334/0x394
packages_make_item+0x4c/0x180
configfs_mkdir+0x358/0x740
vfs_mkdir2+0x1bc/0x2e8
SyS_mkdirat+0x154/0x23c
el0_svc_naked+0x34/0x38

Freed by task 13096:
kasan_slab_free+0xb8/0x194
kfree+0x13c/0x910
package_details_release+0x524/0x56c
kref_put+0xc4/0x104
config_item_put+0x24/0x34
__configfs_open_file+0x35c/0x3b0
configfs_open_file+0x28/0x34
do_dentry_open+0x2cc/0x5c0
vfs_open+0x80/0xe0
path_openat+0xd8c/0x2988
do_filp_open+0x1c4/0x2fc
do_sys_open+0x23c/0x404
SyS_openat+0x38/0x48
el0_svc_naked+0x34/0x38

To fix this issue, remove the config_item_put in
__configfs_open_file to balance the refcount of config_item.

Fixes: b0841eef ("configfs: provide exclusion between IO and removals")
Signed-off-by: NDaiyue Zhang <zhangdaiyue1@huawei.com>
Signed-off-by: NYi Chen <chenyi77@huawei.com>
Signed-off-by: NGe Qiu <qiuge@huawei.com>
Reviewed-by: NChao Yu <yuchao0@huawei.com>
Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: N  Weilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

37af504b

NFSv4.2: fix return value of _nfs4_get_security_label() · 7a9cf8a4

由 Ondrej Mosnacek 提交于 3月 27, 2021

stable inclusion
from stable-5.10.24
commit caa86901c863e7c3646d189f2deb9e844afd0568
bugzilla: 51348

--------------------------------

[ Upstream commit 53cb2454 ]

An xattr 'get' handler is expected to return the length of the value on
success, yet _nfs4_get_security_label() (and consequently also
nfs4_xattr_get_nfs4_label(), which is used as an xattr handler) returns
just 0 on success.

Fix this by returning label.len instead, which contains the length of
the result.

Fixes: aa9c2669 ("NFS: Client implementation of Labeled-NFS")
Signed-off-by: NOndrej Mosnacek <omosnace@redhat.com>
Reviewed-by: NJames Morris <jamorris@linux.microsoft.com>
Reviewed-by: NPaul Moore <paul@paul-moore.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: N  Weilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

7a9cf8a4

NFS: Don't gratuitously clear the inode cache when lookup failed · d506f8e3

由 Trond Myklebust 提交于 3月 27, 2021

stable inclusion
from stable-5.10.24
commit e181960ec51d5fa089d6e8e2478febe01ca8be04
bugzilla: 51348

--------------------------------

[ Upstream commit 47397915 ]

The fact that the lookup revalidation failed, does not mean that the
inode contents have changed.

Fixes: 5ceb9d7f ("NFS: Refactor nfs_lookup_revalidate()")
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: N  Weilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

d506f8e3

NFS: Don't revalidate the directory permissions on a lookup failure · 636f5615

由 Trond Myklebust 提交于 3月 27, 2021

stable inclusion
from stable-5.10.24
commit dd756d05bee58077ea0239861022ca83e7d8d23d
bugzilla: 51348

--------------------------------

[ Upstream commit 82e7ca13 ]

There should be no reason to expect the directory permissions to change
just because the directory contents changed or a negative lookup timed
out. So let's avoid doing a full call to nfs_mark_for_revalidate() in
that case.
Furthermore, if this is a negative dentry, and we haven't actually done
a new lookup, then we have no reason yet to believe the directory has
changed at all. So let's remove the gratuitous directory inode
invalidation altogether when called from
nfs_lookup_revalidate_negative().
Reported-by: NGeert Jansen <gerardu@amazon.com>
Fixes: 5ceb9d7f ("NFS: Refactor nfs_lookup_revalidate()")
Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: N  Weilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

636f5615

cifs: do not send close in compound create+close requests · 7afed434

由 Paulo Alcantara 提交于 3月 27, 2021

stable inclusion
from stable-5.10.24
commit d308202c1b96024a2f3325642f5e087cf997b5d9
bugzilla: 51348

--------------------------------

commit 04ad69c3 upstream.

In case of interrupted syscalls, prevent sending CLOSE commands for
compound CREATE+CLOSE requests by introducing an
CIFS_CP_CREATE_CLOSE_OP flag to indicate lower layers that it should
not send a CLOSE command to the MIDs corresponding the compound
CREATE+CLOSE request.

A simple reproducer:

    #!/bin/bash

    mount //server/share /mnt -o username=foo,password=***
    tc qdisc add dev eth0 root netem delay 450ms
    stat -f /mnt &>/dev/null & pid=$!
    sleep 0.01
    kill $pid
    tc qdisc del dev eth0 root
    umount /mnt

Before patch:

    ...
    6 0.256893470 192.168.122.2 → 192.168.122.15 SMB2 402 Create Request File: ;GetInfo Request FS_INFO/FileFsFullSizeInformation;Close Request
    7 0.257144491 192.168.122.15 → 192.168.122.2 SMB2 498 Create Response File: ;GetInfo Response;Close Response
    9 0.260798209 192.168.122.2 → 192.168.122.15 SMB2 146 Close Request File:
   10 0.260841089 192.168.122.15 → 192.168.122.2 SMB2 130 Close Response, Error: STATUS_FILE_CLOSED
Signed-off-by: NPaulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: NRonnie Sahlberg <lsahlber@redhat.com>
Reviewed-by: NAurelien Aptel <aaptel@suse.com>
CC: <stable@vger.kernel.org>
Signed-off-by: NSteve French <stfrench@microsoft.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: N  Weilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

7afed434

block: Try to handle busy underlying device on discard · 3ccf9c92

由 Jan Kara 提交于 3月 27, 2021

stable inclusion
from stable-5.10.24
commit d44c9780ed40db88626c9354868eab72159c7a7f
bugzilla: 51348

--------------------------------

commit 56887cff upstream.

Commit 384d87ef ("block: Do not discard buffers under a mounted
filesystem") made paths issuing discard or zeroout requests to the
underlying device try to grab block device in exclusive mode. If that
failed we returned EBUSY to userspace. This however caused unexpected
fallout in userspace where e.g. FUSE filesystems issue discard requests
from userspace daemons although the device is open exclusively by the
kernel. Also shrinking of logical volume by LVM issues discard requests
to a device which may be claimed exclusively because there's another LV
on the same PV. So to avoid these userspace regressions, fall back to
invalidate_inode_pages2_range() instead of returning EBUSY to userspace
and return EBUSY only of that call fails as well (meaning that there's
indeed someone using the particular device range we are trying to
discard).

Link: https://bugzilla.kernel.org/show_bug.cgi?id=211167
Fixes: 384d87ef ("block: Do not discard buffers under a mounted filesystem")
CC: stable@vger.kernel.org
Signed-off-by: NJan Kara <jack@suse.cz>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: N  Weilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

3ccf9c92

ext4: don't try to processed freed blocks until mballoc is initialized · 93e5d414

由 Theodore Ts'o 提交于 3月 27, 2021

stable inclusion
from stable-5.10.24
commit 64578f9417e1e3482f3e4492496772fca130f526
bugzilla: 51348

--------------------------------

[ Upstream commit 027f14f5 ]

If we try to make any changes via the journal between when the journal
is initialized, but before the multi-block allocated is initialized,
we will end up deferencing a NULL pointer when the journal commit
callback function calls ext4_process_freed_data().

The proximate cause of this failure was commit 2d01ddc8 ("ext4:
save error info to sb through journal if available") since file system
corruption problems detected before the call to ext4_mb_init() would
result in a journal commit before we aborted the mount of the file
system.... and we would then trigger the NULL pointer deref.

Link: https://lore.kernel.org/r/YAm8qH/0oo2ofSMR@mit.eduReported-by: NMurphy Zhou <jencce.kernel@gmail.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: N  Weilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

93e5d414

udf: fix silent AED tagLocation corruption · 53387051

由 Steven J. Magnani 提交于 3月 27, 2021

stable inclusion
from stable-5.10.24
commit 82d6c12899e2645bd17d6f9c7d494f360e1089e1
bugzilla: 51348

--------------------------------

[ Upstream commit 63c9e47a ]

When extending a file, udf_do_extend_file() may enter following empty
indirect extent. At the end of udf_do_extend_file() we revert prev_epos
to point to the last written extent. However if we end up not adding any
further extent in udf_do_extend_file(), the reverting points prev_epos
into the header area of the AED and following updates of the extents
(in udf_update_extents()) will corrupt the header.

Make sure that we do not follow indirect extent if we are not going to
add any more extents so that returning back to the last written extent
works correctly.

Link: https://lore.kernel.org/r/20210107234116.6190-2-magnani@ieee.orgSigned-off-by: NSteven J. Magnani <magnani@ieee.org>
Signed-off-by: NJan Kara <jack@suse.cz>
Signed-off-by: NSasha Levin <sashal@kernel.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: N  Weilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

53387051

cifs: fix credit accounting for extra channel · fb386d65

由 Aurelien Aptel 提交于 3月 27, 2021

stable inclusion
from stable-5.10.24
commit 3370a84d781ca5227682bd6e747aaefb6dcc8e21
bugzilla: 51348

--------------------------------

commit a249cc8b upstream.

With multichannel, operations like the queries
from "ls -lR" can cause all credits to be used and
errors to be returned since max_credits was not
being set correctly on the secondary channels and
thus the client was requesting 0 credits incorrectly
in some cases (which can lead to not having
enough credits to perform any operation on that
channel).
Signed-off-by: NAurelien Aptel <aaptel@suse.com>
CC: <stable@vger.kernel.org> # v5.8+
Reviewed-by: NShyam Prasad N <sprasad@microsoft.com>
Signed-off-by: NSteve French <stfrench@microsoft.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: N  Weilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

fb386d65

cifs: return proper error code in statfs(2) · 9ae15b79

由 Paulo Alcantara 提交于 3月 27, 2021

stable inclusion
from stable-5.10.24
commit 3d0bbd97eb6f32bcc1365252aa04a8984bab5007
bugzilla: 51348

--------------------------------

commit 14302ee3 upstream.

In cifs_statfs(), if server->ops->queryfs is not NULL, then we should
use its return value rather than always returning 0.  Instead, use rc
variable as it is properly set to 0 in case there is no
server->ops->queryfs.
Signed-off-by: NPaulo Alcantara (SUSE) <pc@cjr.nz>
Reviewed-by: NAurelien Aptel <aaptel@suse.com>
Reviewed-by: NRonnie Sahlberg <lsahlber@redhat.com>
CC: <stable@vger.kernel.org>
Signed-off-by: NSteve French <stfrench@microsoft.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: N  Weilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

9ae15b79

mount: fix mounting of detached mounts onto targets that reside on shared mounts · 3ef215f4

由 Christian Brauner 提交于 3月 27, 2021

stable inclusion
from stable-5.10.24
commit 36e1efcdc54274d03e67ed6a9d5c1c2a2e77e947
bugzilla: 51348

--------------------------------

commit ee2e3f50 upstream.

Creating a series of detached mounts, attaching them to the filesystem,
and unmounting them can be used to trigger an integer overflow in
ns->mounts causing the kernel to block any new mounts in count_mounts()
and returning ENOSPC because it falsely assumes that the maximum number
of mounts in the mount namespace has been reached, i.e. it thinks it
can't fit the new mounts into the mount namespace anymore.

Depending on the number of mounts in your system, this can be reproduced
on any kernel that supportes open_tree() and move_mount() by compiling
and running the following program:

  /* SPDX-License-Identifier: LGPL-2.1+ */

  #define _GNU_SOURCE
  #include <errno.h>
  #include <fcntl.h>
  #include <getopt.h>
  #include <limits.h>
  #include <stdbool.h>
  #include <stdio.h>
  #include <stdlib.h>
  #include <string.h>
  #include <sys/mount.h>
  #include <sys/stat.h>
  #include <sys/syscall.h>
  #include <sys/types.h>
  #include <unistd.h>

  /* open_tree() */
  #ifndef OPEN_TREE_CLONE
  #define OPEN_TREE_CLONE 1
  #endif

  #ifndef OPEN_TREE_CLOEXEC
  #define OPEN_TREE_CLOEXEC O_CLOEXEC
  #endif

  #ifndef __NR_open_tree
          #if defined __alpha__
                  #define __NR_open_tree 538
          #elif defined _MIPS_SIM
                  #if _MIPS_SIM == _MIPS_SIM_ABI32        /* o32 */
                          #define __NR_open_tree 4428
                  #endif
                  #if _MIPS_SIM == _MIPS_SIM_NABI32       /* n32 */
                          #define __NR_open_tree 6428
                  #endif
                  #if _MIPS_SIM == _MIPS_SIM_ABI64        /* n64 */
                          #define __NR_open_tree 5428
                  #endif
          #elif defined __ia64__
                  #define __NR_open_tree (428 + 1024)
          #else
                  #define __NR_open_tree 428
          #endif
  #endif

  /* move_mount() */
  #ifndef MOVE_MOUNT_F_EMPTY_PATH
  #define MOVE_MOUNT_F_EMPTY_PATH 0x00000004 /* Empty from path permitted */
  #endif

  #ifndef __NR_move_mount
          #if defined __alpha__
                  #define __NR_move_mount 539
          #elif defined _MIPS_SIM
                  #if _MIPS_SIM == _MIPS_SIM_ABI32        /* o32 */
                          #define __NR_move_mount 4429
                  #endif
                  #if _MIPS_SIM == _MIPS_SIM_NABI32       /* n32 */
                          #define __NR_move_mount 6429
                  #endif
                  #if _MIPS_SIM == _MIPS_SIM_ABI64        /* n64 */
                          #define __NR_move_mount 5429
                  #endif
          #elif defined __ia64__
                  #define __NR_move_mount (428 + 1024)
          #else
                  #define __NR_move_mount 429
          #endif
  #endif

  static inline int sys_open_tree(int dfd, const char *filename, unsigned int flags)
  {
          return syscall(__NR_open_tree, dfd, filename, flags);
  }

  static inline int sys_move_mount(int from_dfd, const char *from_pathname, int to_dfd,
                                   const char *to_pathname, unsigned int flags)
  {
          return syscall(__NR_move_mount, from_dfd, from_pathname, to_dfd, to_pathname, flags);
  }

  static bool is_shared_mountpoint(const char *path)
  {
          bool shared = false;
          FILE *f = NULL;
          char *line = NULL;
          int i;
          size_t len = 0;

          f = fopen("/proc/self/mountinfo", "re");
          if (!f)
                  return 0;

          while (getline(&line, &len, f) > 0) {
                  char *slider1, *slider2;

                  for (slider1 = line, i = 0; slider1 && i < 4; i++)
                          slider1 = strchr(slider1 + 1, ' ');

                  if (!slider1)
                          continue;

                  slider2 = strchr(slider1 + 1, ' ');
                  if (!slider2)
                          continue;

                  *slider2 = '\0';
                  if (strcmp(slider1 + 1, path) == 0) {
                          /* This is the path. Is it shared? */
                          slider1 = strchr(slider2 + 1, ' ');
                          if (slider1 && strstr(slider1, "shared:")) {
                                  shared = true;
                                  break;
                          }
                  }
          }
          fclose(f);
          free(line);

          return shared;
  }

  static void usage(void)
  {
          const char *text = "mount-new [--recursive] <base-dir>\n";
          fprintf(stderr, "%s", text);
          _exit(EXIT_SUCCESS);
  }

  #define exit_usage(format, ...)                              \
          ({                                                   \
                  fprintf(stderr, format "\n", ##__VA_ARGS__); \
                  usage();                                     \
          })

  #define exit_log(format, ...)                                \
          ({                                                   \
                  fprintf(stderr, format "\n", ##__VA_ARGS__); \
                  exit(EXIT_FAILURE);                          \
          })

  static const struct option longopts[] = {
          {"help",        no_argument,            0,      'a'},
          { NULL,         no_argument,            0,       0 },
  };

  int main(int argc, char *argv[])
  {
          int exit_code = EXIT_SUCCESS, index = 0;
          int dfd, fd_tree, new_argc, ret;
          char *base_dir;
          char *const *new_argv;
          char target[PATH_MAX];

          while ((ret = getopt_long_only(argc, argv, "", longopts, &index)) != -1) {
                  switch (ret) {
                  case 'a':
                          /* fallthrough */
                  default:
                          usage();
                  }
          }

          new_argv = &argv[optind];
          new_argc = argc - optind;
          if (new_argc < 1)
                  exit_usage("Missing base directory\n");
          base_dir = new_argv[0];

          if (*base_dir != '/')
                  exit_log("Please specify an absolute path");

          /* Ensure that target is a shared mountpoint. */
          if (!is_shared_mountpoint(base_dir))
                  exit_log("Please ensure that \"%s\" is a shared mountpoint", base_dir);

          dfd = open(base_dir, O_RDONLY | O_DIRECTORY | O_CLOEXEC);
          if (dfd < 0)
                  exit_log("%m - Failed to open base directory \"%s\"", base_dir);

          ret = mkdirat(dfd, "detached-move-mount", 0755);
          if (ret < 0)
                  exit_log("%m - Failed to create required temporary directories");

          ret = snprintf(target, sizeof(target), "%s/detached-move-mount", base_dir);
          if (ret < 0 || (size_t)ret >= sizeof(target))
                  exit_log("%m - Failed to assemble target path");

          /*
           * Having a mount table with 10000 mounts is already quite excessive
           * and shoult account even for weird test systems.
           */
          for (size_t i = 0; i < 10000; i++) {
                  fd_tree = sys_open_tree(dfd, "detached-move-mount",
                                          OPEN_TREE_CLONE |
                                          OPEN_TREE_CLOEXEC |
                                          AT_EMPTY_PATH);
                  if (fd_tree < 0) {
                          fprintf(stderr, "%m - Failed to open %d(detached-move-mount)", dfd);
                          exit_code = EXIT_FAILURE;
                          break;
                  }

                  ret = sys_move_mount(fd_tree, "", dfd, "detached-move-mount", MOVE_MOUNT_F_EMPTY_PATH);
                  if (ret < 0) {
                          if (errno == ENOSPC)
                                  fprintf(stderr, "%m - Buggy mount counting");
                          else
                                  fprintf(stderr, "%m - Failed to attach mount to %d(detached-move-mount)", dfd);
                          exit_code = EXIT_FAILURE;
                          break;
                  }
                  close(fd_tree);

                  ret = umount2(target, MNT_DETACH);
                  if (ret < 0) {
                          fprintf(stderr, "%m - Failed to unmount %s", target);
                          exit_code = EXIT_FAILURE;
                          break;
                  }
          }

          (void)unlinkat(dfd, "detached-move-mount", AT_REMOVEDIR);
          close(dfd);

          exit(exit_code);
  }

and wait for the kernel to refuse any new mounts by returning ENOSPC.
How many iterations are needed depends on the number of mounts in your
system. Assuming you have something like 50 mounts on a standard system
it should be almost instantaneous.

The root cause of this is that detached mounts aren't handled correctly
when source and target mount are identical and reside on a shared mount
causing a broken mount tree where the detached source itself is
propagated which propagation prevents for regular bind-mounts and new
mounts. This ultimately leads to a miscalculation of the number of
mounts in the mount namespace.

Detached mounts created via
open_tree(fd, path, OPEN_TREE_CLONE)
are essentially like an unattached new mount, or an unattached
bind-mount. They can then later on be attached to the filesystem via
move_mount() which calls into attach_recursive_mount(). Part of
attaching it to the filesystem is making sure that mounts get correctly
propagated in case the destination mountpoint is MS_SHARED, i.e. is a
shared mountpoint. This is done by calling into propagate_mnt() which
walks the list of peers calling propagate_one() on each mount in this
list making sure it receives the propagation event.
The propagate_one() functions thereby skips both new mounts and bind
mounts to not propagate them "into themselves". Both are identified by
checking whether the mount is already attached to any mount namespace in
mnt->mnt_ns. The is what the IS_MNT_NEW() helper is responsible for.

However, detached mounts have an anonymous mount namespace attached to
them stashed in mnt->mnt_ns which means that IS_MNT_NEW() doesn't
realize they need to be skipped causing the mount to propagate "into
itself" breaking the mount table and causing a disconnect between the
number of mounts recorded as being beneath or reachable from the target
mountpoint and the number of mounts actually recorded/counted in
ns->mounts ultimately causing an overflow which in turn prevents any new
mounts via the ENOSPC issue.

So teach propagation to handle detached mounts by making it aware of
them. I've been tracking this issue down for the last couple of days and
then verifying that the fix is correct by
unmounting everything in my current mount table leaving only /proc and
/sys mounted and running the reproducer above overnight verifying the
number of mounts counted in ns->mounts. With this fix the counts are
correct and the ENOSPC issue can't be reproduced.

This change will only have an effect on mounts created with the new
mount API since detached mounts cannot be created with the old mount API
so regressions are extremely unlikely.

Link: https://lore.kernel.org/r/20210306101010.243666-1-christian.brauner@ubuntu.com
Fixes: 2db154b3 ("vfs: syscall: Add move_mount(2) to move mounts around")
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Cc: <stable@vger.kernel.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NChen Jun <chenjun102@huawei.com>
Acked-by: N  Weilong Chen <chenweilong@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

3ef215f4

arm64/mpam: resctrl: Refresh cpu mask for handling cpuhp · 867ae5b2

由 Wang ShaoBo 提交于 2月 26, 2021

hulk inclusion
category: bugfix
bugzilla: 48265
CVE: NA

--------------------------------

This fixes two problems:

1) when cpu offline, we should clear cpu mask from all associated resctrl
   group but not only default group.

2) when cpu online, we should set cpu mask for default group and update
   default group's cpus to default state if cdp on, this operation is to
   fill code and data fields of mpam sysregs with appropriate value.

Fixes: 2e2c511ff49d ("arm64/mpam: resctrl: Handle cpuhp and resctrl_dom allocation")
Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: NJian Cheng <cj.chengjian@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

867ae5b2

arm64/mpam: resctrl: Use resctrl_group_init_alloc() for default group · 96a27f9d

由 Wang ShaoBo 提交于 2月 26, 2021

hulk inclusion
category: bugfix
bugzilla: 48265
CVE: NA

--------------------------------

When we support configure different types of resources for a resource, the
wrong history value will be updated in the default group after remounting.

e.g.
    > mount -t resctrl resctrl /sys/fs/resctrl/ -o mbMax,mbMin && cd resctrl/
    > echo 'MBMIN:0=2;1=2;2=2;3=2' > schemata
    > cat schemata
      L3:0=7fff;1=7fff;2=7fff;3=7fff
      MBMAX:0=100;1=100;2=100;3=100
      MBMIN:0=2;1=2;2=2;3=2
    > cd .. && umount /sys/fs/resctrl/
    > mount -t resctrl resctrl /sys/fs/resctrl/ -o mbMax,mbMin && cd resctrl/ && cat schemata
      L3:0=7fff;1=7fff;2=7fff;3=7fff
      MBMAX:0=100;1=100;2=100;3=100
      MBMIN:0=0;1=0;2=0;3=0
    > echo 'MBMAX:0=10;1=10;2=10;3=10' > schemata
    > cat schemata
      L3:0=7fff;1=7fff;2=7fff;3=7fff
      MBMAX:0=10;1=10;2=10;3=10
      MBMIN:0=2;1=2;2=2;3=2  #update error history value

When writing schemata sysfile, call path like this:

resctrl_group_schemata_write()
  -=> resctrl_update_groups_config()
         -=> resctrl_group_update_domains()
               -=> resctrl_group_update_domain_ctrls()
                { .../*refresh new_ctrl array of supported conf type once for each resource*/ }

We should refresh new_ctrl field in struct resctrl_staged_config by
resctrl_group_init_alloc() before calling resctrl_group_update_domain_ctrls().

Fixes: 6b2471f089be ("arm64/mpam: resctrl: Support priority and hardlimit(Memory bandwidth) configuration")
Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

96a27f9d

arm64/mpam: resctrl: Add proper error handling to resctrl_mount() · 10e4e43b

由 Wang ShaoBo 提交于 2月 26, 2021

hulk inclusion
category: bugfix
bugzilla: 48265
CVE: NA

--------------------------------

This function is called only when we mount resctrl sysfs, for error
handling we need to destroy schemata list when next few steps failed
after creation of schemata list.

Fixes: 7e9b5caeefff ("arm64/mpam: resctrl: Add helpers for init and destroy schemata list")
Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

10e4e43b

arm64/mpam: Use fs_context to parse mount options · 100e2317

由 Wang ShaoBo 提交于 2月 26, 2021

hulk inclusion
category: bugfix
bugzilla: 48265
CVE: NA

--------------------------------

Use fs_context to parse mount options, this old process parsing from
parse_rdtgroupfs_options() will be obsoleted and removed.
Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

100e2317

arm64/mpam: Set per-cpu's closid to none zero for cdp · cae569b3

由 Wang ShaoBo 提交于 2月 26, 2021

hulk inclusion
category: feature
feature: ARM MPAM support
bugzilla: 48265
CVE: NA

--------------------------------

Sometimes monitoring will have such anomalies:

e.g.
    > cd /sys/fs/resctrl/ && grep . mon_data/*
      mon_data/mon_L3CODE_00:14336
      mon_data/mon_L3CODE_01:344064
      mon_data/mon_L3CODE_02:2048
      mon_data/mon_L3CODE_03:27648
      mon_data/mon_L3DATA_00:0  #L3DATA's monitoring data always be 0
      mon_data/mon_L3DATA_01:0
      mon_data/mon_L3DATA_02:0
      mon_data/mon_L3DATA_03:0
      mon_data/mon_MB_00:392
      mon_data/mon_MB_01:552
      mon_data/mon_MB_02:160
      mon_data/mon_MB_03:0

If cdp on, tasks in resctrl default group with closid=0 and rmid=0 don't
know how to fill proper partid_i/pmg_i and partid_d/pmg_d into MPAMx_ELx
sysregs by mpam_sched_in() called by __switch_to(), it's because current
cpu's default closid and rmid are also equal to 0 and to make the operation
modifying configuration passed.

Update per cpu default closid of none-zero value, call update_closid_rmid()
to update each cpu's mpam proper MPAMx_ELx sysregs for setting partid and
pmg when mounting resctrl sysfs, it looks like a practical method.
Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

cae569b3

arm64/mpam: Split header files into suitable location · 0c564931

由 Wang ShaoBo 提交于 2月 26, 2021

hulk inclusion
category: feature
feature: ARM MPAM support
bugzilla: 48265
CVE: NA

--------------------------------

So far there are some declarations shared by resctrlfs.c and mpam
core module files under kernel/mpam directory scattered in mpam.h
and resctrl.h, this is organized like this:

-- asm/
   +-- resctrl.h        +
   +-- mpam.h           |    +
   +-- mpam_resource.h  |    |    +
                        |    |    |
-- fs/                  |    |    +-> mpam/
   +-- resctrlfs.c <----+----+------> +-- mpam_resctrl.c ...

We move this declarations shared by resctrlfs.c and mpam/ to resctrl.h
and split another declarations into mpam_internal.h, also including
moving mpam_resource.h to mpam/ directory, currently this is organized
like this:

-- asm/
   +-- mpam.h           +----> export to other modules(e.g. SMMU master io)
   +-- resctrl.h        +
                        |
-- mpam/                |
   +-- mpam_internal.h  |    +
   +-- mpam_resource.h  |    |    +
                        |    |    |
-- fs/                  |    +----+-> mpam/
   +-- resctrlfs.c <----+-----------> +-- mpam_resctrl.c ...

In this way can we build a clearer framework for MPAM usage.
Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

0c564931

arm64/mpam: resctrl: Export resource's properties to info directory · 9d39dad1

由 Wang ShaoBo 提交于 2月 26, 2021

hulk inclusion
category: feature
feature: ARM MPAM support
bugzilla: 48265
CVE: NA

--------------------------------

Some resource's properities such as closid and rmid are exported like
Intel-RDT in our resctrl design, but there also has two main differences,
one is MB(Memory Bandwidth), for we MB is also divided into two directories
MB and MB_MON to show respective properties about control and monitor type
as same as LxCache, another is we adopt features sysfile under resources'
directories, which indicates the properties of control type of corresponding
resource, for instance MB hardlimit.

e.g.
    > mount -t resctrl resctrl /sys/fs/resctrl -o mbHdl
    > cd /sys/fs/resctrl/ && cat info/MB/features
      mbHdl@1  #indicate MBHDL setting's upper bound is 1
    > cat schemata
      L3:0=7fff;1=7fff;2=7fff;3=7fff
      MB:0=100;1=100;2=100;3=100
      MBHDL:0=1;1=1;2=1;3=1
Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

9d39dad1

arm64/mpam: Add wait queue for monitor alloc and free · 7d3cd1a2

由 Wang ShaoBo 提交于 2月 26, 2021

hulk inclusion
category: feature
feature: ARM MPAM support
bugzilla: 48265
CVE: NA

--------------------------------

For MPAM, a rmid can do monitoring work only with a monitor resource
allocated, we adopt a mechanism for monitor resource dynamic allocation
and recycling, it is different from Intel-RDT operation who creates a
kworker thread for dynamically monitoring Cache usage and checks if it
is below a threshold adjustable for rmid free, for we have detected that
this method will affect the cpu utilization in many cases, sometimes this
influence cannot be accepted.

Our method is simple, as different resource's monitor number varies, we
deliever two list, one for storing rmids which has exclusive monitor
resource and another for storing this rmids which have monitor resource
shared, this shared monitor id always be 0. it works like this, if a new
rmid apply for a resource monitor which is in used, then we put this rmid
to the tail of latter list and temporarily give a default monitor id 0
util someone releases available monitor resource, if this new rmid has
all resources' monitor resource needed, then it will be put into exclusive
list.

This implements the LRU allocation of monitor resources and give users
part control rights of allocation and release, if resctrl group's quantity
can be guaranteed or user don't need monitoring too many groups
synchronously, this is a more appropriate way for user deployment, not
only that, also can it avoid the risk of inaccuracy in monitoring when
monitoring operation happen to too many groups at the same time.
Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

7d3cd1a2

arm64/mpam: Remap reqpartid,pmg to rmid and intpartid to closid · 0b16164d

由 Wang ShaoBo 提交于 2月 26, 2021

hulk inclusion
category: feature
feature: ARM MPAM support
bugzilla: 48265
CVE: NA

--------------------------------

So far we use sd_closid, including {reqpartid, intpartid}, to label each
resctrl group including ctrlgroup and mongroup, This can perfectly handle
this case where number of reqpartid exceeds intpartid, this always happen
when intpartid narrowing supported, otherwise their two are of same number.
So we use excessive reqpartid to indicate (1)- how configurations can be
synchronized from the configuration indexed by intpartid, not only that,
(2)- take part of monitor role.

But reqpartid in (2) with pmg still be scattered, So far we have not yet
a right way to explain how can we use their two properly. In order to
ensure their resources can be fully utilized, and given this idea from
Intel-RDT's design which uses rmid for monitoring, a rmid remap matrix is
delivered for transforming partid and pmg to rmid, this matrix is organized
like this:

                 [bitmap entry indexed by partid]
                       [col pos is partid]

                     [0]  [1]  [2]  [3]  [4]  [5]
   occ->bitmap[:0]    1    0    0    1    1    1
        bitmap[:1]    1    0    0    1    1    1
        bitmap[:2]    1    1    1    1    1    1
        bitmap[:3]    1    1    1    1    1    1
[row pos-1 is pmg]

Calculate rmid = partid + NR_partid * pmg

occ represents if this bitmap has been used by a partid, it is because
a certain partid should not be accompany with a duplicated pmg for
monitoring, this design easily saves a lot of space, and can also decrease
time complexity of allocating and free rmid process from O(NR_partid)*
O(NR_pmg) to O(NR_partid) + O(log(NR_pmg)) compared with using list.

By this way, we get a continuous rmid set with upper bound(NR_pmg *
NR_partid - 1), given an rmid we can assume that if it's a valid rmid
by judging whether it falls within this range or not.

rmid implicts the reqpartid info, so we can use relevant helpers to get
this reqpartid for sd_closid@reqpartid and perfectly accomplish this
configuration sync mission, this also makes closid simpler which can be
consists of intpartid index only, also each resctrl group is happy to own
consecutive rmid.

This also has some profound influences, for instance for MPAM there
also support SMMU io using partid and pmg, we can use a single helper
mpam_rmid_to_partid_pmg() in SMMU driver to complete this remap process
for rmid input from outside user space.
Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

0b16164d

arm64/mpam: Using software-defined id for rdtgroup instead of 32-bit integer · 26a57873

由 Wang ShaoBo 提交于 2月 26, 2021

hulk inclusion
category: feature
feature: ARM MPAM support
bugzilla: 48265
CVE: NA

--------------------------------

Currently we use partid and pmg (Performance Monitoring Group) to filter
some performance events so that the performance of a particular partid
and pmg can be monitored, but pmg looks useless except for making a
filter with partid, especially when pmg varies in different MPAM resources,
it makes difficult to allocate pmg resource when creating new mon group
in resctrl sysfs, even causes a lot of waste.

So we use a software-defined sd_closid instead of 32-bit integer to label
each rdtgroup (including mon group), sd_closid include intpartid for
allocation and reqpartid for synchronizing configuration and monitoring,
Given MPAM has narrowing feature, also includes the concept (hw_reqpartid,
hw_intpartid we named), when narrowing is not supported, number of intpartid
and reqpartid equals to hw_reqpartid, otherwise intpartid and reqpartid is
related to minimum number of both hw_reqpartid and hw_intpartid supported
across different resources, by using this way, not only we solve above
problem but also use relax reqpartid for creating new mon group. additionally,
pmg is also preferred when it is available.

e.g.
     hw_intpartid:  0  1  2  3  4  5  6  7
     hw_reqpartid:  0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15

                    |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
                                            |  |  |  |  |  |  |  |
resctrl ctrl group: p0 p1 p2 p3 p4 p5 p6 p7 |  |  |  |  |  |  |  |
                          |     |
resctrl mon group:        |     +-----------------------m4 m5 m6 m7
                          +-----------------m0 m1 m2 m3
                          In this case, use extra reqpartid to create
                          m0, m1, m2, m3 mon group for p2 ctrl group,
                          and m4, m5, m6, m7 for p4.

As we know reqpartid both supports allocating and monitoring filter, we
should synchronize config of ctrl group with child mon groups under this
design, each mon group's configuration indexed by a reqpartid that called
slave is closely following it's father ctrl group that called master
whenever configuration changes. not only that, we let task_struct keep
both intpartid and reqpartid so we can know if tasks belong to a same
ctrl group through intpartid and change cpu's partid by writing MPAMx_ELx
through reqpartid when tasks switching.
Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

26a57873

arm64/mpam: resctrl: Remove unnecessary CONFIG_ARM64 · 17c6439e

由 Wang ShaoBo 提交于 2月 26, 2021

hulk inclusion
category: feature
feature: ARM MPAM support
bugzilla: 48265
CVE: NA

--------------------------------

Code in resctrlfs.c is not shared with x86 RDT currently, but may be
updated to support both in the future, so remove unrelated CONFIG for
now to make code clearer.
Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

17c6439e

arm64/mpam: resctrl: Remove ctrlmon sysfile · 037915d2

由 Wang ShaoBo 提交于 2月 26, 2021

hulk inclusion
category: feature
feature: ARM MPAM support
bugzilla: 48265
CVE: NA

--------------------------------

We redesign monitoring process for user, as following illustrates:

  e.g. before rewriting:
     mount /sys/fs/resctrl && cd /sys/fs/resctrl
	 mkdir p1 && cd p1
     echo 1 > ctrlmon  # this allocates a monitor resource for this group
     ...               # associating task/cpu with this group
     grep . mon_data/* # get monitor data from mon_data directory
  e.g. after rewriting:
	 mount /sys/fs/resctrl && cd /sys/fs/resctrl
 	 mkdir p1 && cd p1 # automically allocating a monitoring resource
     ...               # associate task/cpu with this group
     grep . mon_data/* # directly get monitor data

ctrlmon is used for manually allocating a monitor resource for monitoring
a specified group (labeled by partid and pmg), we delete ctrlmon because
this action is redundant.

User should know which group has been allocated a available monitor resource
and only this monitor resource is released then this monitor resource can
be reallocated to a new group after, this action is redundant and unnecessary,
as monitor resource is used only when monitoring process happens, so a
relax monitor resource can be allocated to multiple groups and take effect
when monitoring process happened.

But should some restrictions be known, a monitor resource for monitoring
Cache-occupancy might be kept for a long time until it doesn't need to
be use anymore, or below a threshold as like intel-RDT limbo list works,
otherwise you may see that the monitoring result is very small beyond
exception when you force switch one mon resource from one group to another.

We deliver a simple LRU mon resource allocation mechanism, but so far
it just assign a monitor according to the order in which groups was created,
this is incomplete and needs subsequent improvement.
Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

037915d2

arm64/mpam: resctrl: Update closid alloc and free process with bitmap · bf24b8a8

由 Wang ShaoBo 提交于 2月 26, 2021

hulk inclusion
category: feature
feature: ARM MPAM support
bugzilla: 48265
CVE: NA

--------------------------------

Replace u32 bitmask with bitmap for closid allocation, it's because closid
may be too large to use 32 bits.

This also support cdp, when cdp is enabled, closid will be assigned twice
once time, giving closid to code LxCache and closid+1 to data LxDATA, so
do free process.
Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

bf24b8a8

arm64/mpam: resctrl: Use resctrl_group_init_alloc() to init schema list · 62530c01

由 Wang ShaoBo 提交于 2月 26, 2021

hulk inclusion
category: feature
feature: ARM MPAM support
bugzilla: 48265
CVE: NA

--------------------------------

Add a schema list for each rdt domain, we use this list to store changes
from schemata row instead of previous ctrlval array live in resctrl
resource structure, when mounting resctrl sysfs happened, we would reset
all resource's configuration into default by resctrl_group_update_domains().

Currently each row in schemata sysfile occupy a list node, this may be
extended for perfecting control types.
Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

62530c01

arm64/mpam: resctrl: Add helpers for init and destroy schemata list · 7f72b99b

由 Wang ShaoBo 提交于 2月 26, 2021

hulk inclusion
category: feature
feature: ARM MPAM support
bugzilla: 48265
CVE: NA

--------------------------------

Initialize schemata list when mount resctrl sysfs and destroy it when
umount, each list node contains the value updated by schemata (in resctrl
sysfs) row.

Partial code is borrowed from 250656171d95 ("x86/resctrl: Stop using Lx
CODE/DATA resources"), as it illustrates:

  Now that CDP enable/disable is global, and the closid offset correction
  is based on the configuration being applied, we are using different
  hw_closid slots in the ctrl array for CODE/DATA schema. This lets
  us merge them using the same Lx resource twice for CDP's CODE/DATA
  schema. This keeps the illusion of separate caches in the resctrl code.

  When CDP is enabled for a cache, create two schema generating the names
  and setting the configuration type.

  We can now remove the initialisation of the illusionary hw_resources:
  'cdp_capable' just requires setting a flag, resctrl knows what to do
  from there.

Link: http://www.linux-arm.org/git?p=linux-jm.git;a=commit;h=250656171d95dea079cc661098a0984e7237aa25Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

7f72b99b

arm64/mpam: resctrl: Rebuild configuration and monitoring pipeline · aeffdf37

由 Wang ShaoBo 提交于 2月 26, 2021

hulk inclusion
category: feature
feature: ARM MPAM support
bugzilla: 48265
CVE: NA

--------------------------------

We now bridge resctrl intermediate processing module and mpam devices
module, a large block of code refer to configuration and monitoring
process involved need to be modified.

We change the previous method where straightly writing MSCs' registers,
this jobs are handed over to helpers offered by mpam devices module
instead, when configuration or monitoring action happened, each domains'
ctrlval array changed by resctrl sysfs input would be updated into mpam
config structure live in each mpam component structure, relevant helpers
provided by mpam devices module will soon accomplish the remaining jobs.

Comparasion:
                   configuration or monitoring

                     old               new
                      +                 +
                      |                 |
                      |       +---------+------------+
                      |       | intermediate helpers |
                      |       +---------+------------+
                      |                 |
                      |                 |
                   +--+-----------------+----+
                   | [reading  writing MMIO] |
                   +-------------------------+

So far we nearly accomplish the mission that open up process between
resctrl sysfs and mpam devices module but still incomplete currently,
also some proper actions are needed after.

Also this moves relevant structures such as struct mongroup to suitable
place,.
Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

aeffdf37

arm64/mpam: Fix unreset resources when mkdir ctrl group or umount resctrl · 88520d23

由 Wang ShaoBo 提交于 2月 26, 2021

hulk inclusion
category: bugfix
bugzilla: 48265
CVE: NA

--------------------------------

There are two problems related to schemata:

1) When rmdir a group and then mkdir a new group under resctrl
   root directory, the new group still inherits the schemata
   configuration from old.
   e.g.
       > mount -t resctrl resctrl /sys/fs/resctrl
       > cd /sys/fs/resctrl
       > mkdir p1 && cd p1
       > echo 'L3:0=7f' > schemata
       > cd .. && rmdir p1 && mkdir p1 && cd p1
       > cat schemata
         L3:0=7f;1=7fff;2=7fff;3=7fff
         MB:0=100;1=100;2=100;3=100

2) It still exists when umount /sys/fs/resctrl and remount.
   e.g.
       > mount -t resctrl resctrl /sys/fs/resctrl
       > cd /sys/fs/resctrl
       > echo 'L3:0=7f' > schemata
       > umount /sys/fs/resctrl
       > mount -t resctrl resctrl /sys/fs/resctrl
       > cat schemata
         L3:0=7f;1=7fff;2=7fff;3=7fff
         MB:0=100;1=100;2=100;3=100

Firstly we make each resctrl resource obtains their corresponding
default configuration. NOTE we use zero to initialize L3 default
value instead of max cpbm bits, as zero configurarion equals to
maximum configuration for L3 MSCs. And we use max-percentage masks
of max bandwidth to generate maximum configuration for MB.

Then we reset resources' configuration settings to default value
and back MSCs to default state, when mkdir or umount happended.

Fixes: caf75b6b2540 ("resctrlfs: mpam: init struct for mpam")
Fixes: 916dd9321e3c ("resctrlfs: init support resctrlfs")
Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

88520d23

arm64/mpam: cleanup the source file's licence · 1fef4872

由 Xie XiuQi 提交于 2月 26, 2021

hulk inclusion
category: bugfix
bugzilla: 48265
CVE: NA

--------------------------------

Rewrite the source file's licence of mpam feature.
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>

1fef4872

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功