1. 09 4月, 2021 40 次提交
    • J
      ext4: save error info to sb through journal if available · 98498945
      Jan Kara 提交于
      mainline inclusion
      from mainline-v5.11-rc4
      commit 2d01ddc8
      category: bugfix
      bugzilla: 50839
      CVE: NA
      
      -----------------------------------------------
      
      If journalling is still working at the moment we get to writing error
      information to the superblock we cannot write directly to the superblock
      as such write could race with journalled update of the superblock and
      cause journal checksum failures, writing inconsistent information to the
      journal or other problems. We cannot journal the superblock directly
      from the error handling functions as we are running in uncertain context
      and could deadlock so just punt journalled superblock update to a
      workqueue.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20201216101844.22917-5-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NYe Bin <yebin10@huawei.com>
      Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      98498945
    • J
      ext4: protect superblock modifications with a buffer lock · 6d70f424
      Jan Kara 提交于
      mainline inclusion
      from mainline-v5.11-rc4
      commit 05c2c00f
      category: bugfix
      bugzilla: 50839
      CVE: NA
      
      -----------------------------------------------
      
      Protect all superblock modifications (including checksum computation)
      with a superblock buffer lock. That way we are sure computed checksum
      matches current superblock contents (a mismatch could cause checksum
      failures in nojournal mode or if an unjournalled superblock update races
      with a journalled one). Also we avoid modifying superblock contents
      while it is being written out (which can cause DIF/DIX failures if we
      are running in nojournal mode).
      Signed-off-by: NJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20201216101844.22917-4-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      
      conflicts:
      fs/ext4/file.c
      Signed-off-by: NYe Bin <yebin10@huawei.com>
      Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      6d70f424
    • J
      ext4: drop sync argument of ext4_commit_super() · ee80bf03
      Jan Kara 提交于
      mainline inclusion
      from mainline-v5.11-rc4
      commit 4392fbc4
      category: bugfix
      bugzilla: 50839
      CVE: NA
      
      -----------------------------------------------
      
      Everybody passes 1 as sync argument of ext4_commit_super(). Just drop
      it.
      Reviewed-by: NHarshad Shirwadkar <harshadshirwadkar@gmail.com>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20201216101844.22917-3-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NYe Bin <yebin10@huawei.com>
      Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      ee80bf03
    • J
      ext4: combine ext4_handle_error() and save_error_info() · acf13bf3
      Jan Kara 提交于
      mainline inclusion
      from mainline-v5.11-rc4
      commit e789ca0c
      category: bugfix
      bugzilla: 50839
      CVE: NA
      
      -----------------------------------------------
      
      save_error_info() is always called together with ext4_handle_error().
      Combine them into a single call and move unconditional bits out of
      save_error_info() into ext4_handle_error().
      Signed-off-by: NJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20201216101844.22917-2-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NYe Bin <yebin10@huawei.com>
      Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      acf13bf3
    • J
      ext4: defer saving error info from atomic context · f40c120c
      Jan Kara 提交于
      mainline inclusion
      from mainline-v5.11-rc1
      commit c92dc856
      category: bugfix
      bugzilla: 50839
      CVE: NA
      
      -----------------------------------------------
      
      When filesystem inconsistency is detected with group locked, we
      currently try to modify superblock to store error there without
      blocking. However this can cause superblock checksum failures (or
      DIF/DIX failure) when the superblock is just being written out.
      
      Make error handling code just store error information in ext4_sb_info
      structure and copy it to on-disk superblock only in ext4_commit_super().
      In case of error happening with group locked, we just postpone the
      superblock flushing to a workqueue.
      
      [ Added fixup so that s_first_error_* does not get updated after
        the file system is remounted.
        Also added fix for syzbot failure.  - Ted ]
      Signed-off-by: NJan Kara <jack@suse.cz>
      Link: https://lore.kernel.org/r/20201127113405.26867-8-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Cc: Hillf Danton <hdanton@sina.com>
      Reported-by: syzbot+9043030c040ce1849a60@syzkaller.appspotmail.com
      Signed-off-by: NYe Bin <yebin10@huawei.com>
      Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      f40c120c
    • J
      ext4: simplify ext4 error translation · 6603d2ef
      Jan Kara 提交于
      mainline inclusion
      from mainline-v5.11-rc1
      commit 02a7780e
      category: bugfix
      bugzilla: 50839
      CVE: NA
      
      -----------------------------------------------
      
      We convert errno's to ext4 on-disk format error codes in
      save_error_info(). Add a function and a bit of macro magic to make this
      simpler.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Link: https://lore.kernel.org/r/20201127113405.26867-7-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NYe Bin <yebin10@huawei.com>
      Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      6603d2ef
    • J
      ext4: move functions in super.c · fc004c92
      Jan Kara 提交于
      mainline inclusion
      from mainline-v5.11-rc1
      commit 40676623
      category: bugfix
      bugzilla: 50839
      CVE: NA
      
      -----------------------------------------------
      
      Just move error info related functions in super.c close to
      ext4_handle_error(). We'll want to combine save_error_info() with
      ext4_handle_error() and this makes change more obvious and saves a
      forward declaration as well. No functional change.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Link: https://lore.kernel.org/r/20201127113405.26867-6-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NYe Bin <yebin10@huawei.com>
      Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      fc004c92
    • J
      ext4: make ext4_abort() use __ext4_error() · 590f38ce
      Jan Kara 提交于
      mainline inclusion
      from mainline-v5.11-rc1
      commit 014c9caa
      category: bugfix
      bugzilla: 50839
      CVE: NA
      
      -----------------------------------------------
      
      The only difference between __ext4_abort() and __ext4_error() is that
      the former one ignores errors=continue mount option. Unify the code to
      reduce duplication.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Link: https://lore.kernel.org/r/20201127113405.26867-5-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NYe Bin <yebin10@huawei.com>
      Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      590f38ce
    • J
      ext4: standardize error message in ext4_protect_reserved_inode() · 563d7a8b
      Jan Kara 提交于
      mainline inclusion
      from mainline-v5.11-rc1
      commit 93c20bc3
      category: bugfix
      bugzilla: 50839
      CVE: NA
      
      -----------------------------------------------
      
      We use __ext4_error() when ext4_protect_reserved_inode() finds
      filesystem corruption. However EXT4_ERROR_INODE_ERR() is perfectly
      capable of reporting all the needed information. So just use that.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Link: https://lore.kernel.org/r/20201127113405.26867-4-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NYe Bin <yebin10@huawei.com>
      Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      563d7a8b
    • J
      ext4: remove redundant sb checksum recomputation · 8ad485f8
      Jan Kara 提交于
      mainline inclusion
      from mainline-v5.11-rc1
      commit 81414b4d
      category: bugfix
      bugzilla: 50839
      CVE: NA
      
      -----------------------------------------------
      
      Superblock is written out either through ext4_commit_super() or through
      ext4_handle_dirty_super(). In both cases we recompute the checksum so it
      is not necessary to recompute it after updating superblock free inodes &
      blocks counters.
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
      Link: https://lore.kernel.org/r/20201127113405.26867-3-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NYe Bin <yebin10@huawei.com>
      Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      8ad485f8
    • L
      binfmt_misc: fix possible deadlock in bm_register_write · c584dce0
      Lior Ribak 提交于
      stable inclusion
      from stable-5.10.24
      commit 5ab9464a2a3c538eedbb438f1802f2fd98d0953f
      bugzilla: 51348
      
      --------------------------------
      
      commit e7850f4d upstream.
      
      There is a deadlock in bm_register_write:
      
      First, in the begining of the function, a lock is taken on the binfmt_misc
      root inode with inode_lock(d_inode(root)).
      
      Then, if the user used the MISC_FMT_OPEN_FILE flag, the function will call
      open_exec on the user-provided interpreter.
      
      open_exec will call a path lookup, and if the path lookup process includes
      the root of binfmt_misc, it will try to take a shared lock on its inode
      again, but it is already locked, and the code will get stuck in a deadlock
      
      To reproduce the bug:
      $ echo ":iiiii:E::ii::/proc/sys/fs/binfmt_misc/bla:F" > /proc/sys/fs/binfmt_misc/register
      
      backtrace of where the lock occurs (#5):
      0  schedule () at ./arch/x86/include/asm/current.h:15
      1  0xffffffff81b51237 in rwsem_down_read_slowpath (sem=0xffff888003b202e0, count=<optimized out>, state=state@entry=2) at kernel/locking/rwsem.c:992
      2  0xffffffff81b5150a in __down_read_common (state=2, sem=<optimized out>) at kernel/locking/rwsem.c:1213
      3  __down_read (sem=<optimized out>) at kernel/locking/rwsem.c:1222
      4  down_read (sem=<optimized out>) at kernel/locking/rwsem.c:1355
      5  0xffffffff811ee22a in inode_lock_shared (inode=<optimized out>) at ./include/linux/fs.h:783
      6  open_last_lookups (op=0xffffc9000022fe34, file=0xffff888004098600, nd=0xffffc9000022fd10) at fs/namei.c:3177
      7  path_openat (nd=nd@entry=0xffffc9000022fd10, op=op@entry=0xffffc9000022fe34, flags=flags@entry=65) at fs/namei.c:3366
      8  0xffffffff811efe1c in do_filp_open (dfd=<optimized out>, pathname=pathname@entry=0xffff8880031b9000, op=op@entry=0xffffc9000022fe34) at fs/namei.c:3396
      9  0xffffffff811e493f in do_open_execat (fd=fd@entry=-100, name=name@entry=0xffff8880031b9000, flags=<optimized out>, flags@entry=0) at fs/exec.c:913
      10 0xffffffff811e4a92 in open_exec (name=<optimized out>) at fs/exec.c:948
      11 0xffffffff8124aa84 in bm_register_write (file=<optimized out>, buffer=<optimized out>, count=19, ppos=<optimized out>) at fs/binfmt_misc.c:682
      12 0xffffffff811decd2 in vfs_write (file=file@entry=0xffff888004098500, buf=buf@entry=0xa758d0 ":iiiii:E::ii::i:CF
      ", count=count@entry=19, pos=pos@entry=0xffffc9000022ff10) at fs/read_write.c:603
      13 0xffffffff811defda in ksys_write (fd=<optimized out>, buf=0xa758d0 ":iiiii:E::ii::i:CF
      ", count=19) at fs/read_write.c:658
      14 0xffffffff81b49813 in do_syscall_64 (nr=<optimized out>, regs=0xffffc9000022ff58) at arch/x86/entry/common.c:46
      15 0xffffffff81c0007c in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120
      
      To solve the issue, the open_exec call is moved to before the write
      lock is taken by bm_register_write
      
      Link: https://lkml.kernel.org/r/20210228224414.95962-1-liorribak@gmail.com
      Fixes: 948b701a ("binfmt_misc: add persistent opened binary handler for containers")
      Signed-off-by: NLior Ribak <liorribak@gmail.com>
      Acked-by: NHelge Deller <deller@gmx.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      c584dce0
    • D
      configfs: fix a use-after-free in __configfs_open_file · 37af504b
      Daiyue Zhang 提交于
      stable inclusion
      from stable-5.10.24
      commit 109720342efd6ace3d2e8f34a25ea65036bb1d3b
      bugzilla: 51348
      
      --------------------------------
      
      [ Upstream commit 14fbbc82 ]
      
      Commit b0841eef ("configfs: provide exclusion between IO and removals")
      uses ->frag_dead to mark the fragment state, thus no bothering with extra
      refcount on config_item when opening a file. The configfs_get_config_item
      was removed in __configfs_open_file, but not with config_item_put. So the
      refcount on config_item will lost its balance, causing use-after-free
      issues in some occasions like this:
      
      Test:
      1. Mount configfs on /config with read-only items:
      drwxrwx--- 289 root   root            0 2021-04-01 11:55 /config
      drwxr-xr-x   2 root   root            0 2021-04-01 11:54 /config/a
      --w--w--w-   1 root   root         4096 2021-04-01 11:53 /config/a/1.txt
      ......
      
      2. Then run:
      for file in /config
      do
      echo $file
      grep -R 'key' $file
      done
      
      3. __configfs_open_file will be called in parallel, the first one
      got called will do:
      if (file->f_mode & FMODE_READ) {
      	if (!(inode->i_mode & S_IRUGO))
      		goto out_put_module;
      			config_item_put(buffer->item);
      				kref_put()
      					package_details_release()
      						kfree()
      
      the other one will run into use-after-free issues like this:
      BUG: KASAN: use-after-free in __configfs_open_file+0x1bc/0x3b0
      Read of size 8 at addr fffffff155f02480 by task grep/13096
      CPU: 0 PID: 13096 Comm: grep VIP: 00 Tainted: G        W       4.14.116-kasan #1
      TGID: 13096 Comm: grep
      Call trace:
      dump_stack+0x118/0x160
      kasan_report+0x22c/0x294
      __asan_load8+0x80/0x88
      __configfs_open_file+0x1bc/0x3b0
      configfs_open_file+0x28/0x34
      do_dentry_open+0x2cc/0x5c0
      vfs_open+0x80/0xe0
      path_openat+0xd8c/0x2988
      do_filp_open+0x1c4/0x2fc
      do_sys_open+0x23c/0x404
      SyS_openat+0x38/0x48
      
      Allocated by task 2138:
      kasan_kmalloc+0xe0/0x1ac
      kmem_cache_alloc_trace+0x334/0x394
      packages_make_item+0x4c/0x180
      configfs_mkdir+0x358/0x740
      vfs_mkdir2+0x1bc/0x2e8
      SyS_mkdirat+0x154/0x23c
      el0_svc_naked+0x34/0x38
      
      Freed by task 13096:
      kasan_slab_free+0xb8/0x194
      kfree+0x13c/0x910
      package_details_release+0x524/0x56c
      kref_put+0xc4/0x104
      config_item_put+0x24/0x34
      __configfs_open_file+0x35c/0x3b0
      configfs_open_file+0x28/0x34
      do_dentry_open+0x2cc/0x5c0
      vfs_open+0x80/0xe0
      path_openat+0xd8c/0x2988
      do_filp_open+0x1c4/0x2fc
      do_sys_open+0x23c/0x404
      SyS_openat+0x38/0x48
      el0_svc_naked+0x34/0x38
      
      To fix this issue, remove the config_item_put in
      __configfs_open_file to balance the refcount of config_item.
      
      Fixes: b0841eef ("configfs: provide exclusion between IO and removals")
      Signed-off-by: NDaiyue Zhang <zhangdaiyue1@huawei.com>
      Signed-off-by: NYi Chen <chenyi77@huawei.com>
      Signed-off-by: NGe Qiu <qiuge@huawei.com>
      Reviewed-by: NChao Yu <yuchao0@huawei.com>
      Acked-by: NAl Viro <viro@zeniv.linux.org.uk>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      37af504b
    • O
      NFSv4.2: fix return value of _nfs4_get_security_label() · 7a9cf8a4
      Ondrej Mosnacek 提交于
      stable inclusion
      from stable-5.10.24
      commit caa86901c863e7c3646d189f2deb9e844afd0568
      bugzilla: 51348
      
      --------------------------------
      
      [ Upstream commit 53cb2454 ]
      
      An xattr 'get' handler is expected to return the length of the value on
      success, yet _nfs4_get_security_label() (and consequently also
      nfs4_xattr_get_nfs4_label(), which is used as an xattr handler) returns
      just 0 on success.
      
      Fix this by returning label.len instead, which contains the length of
      the result.
      
      Fixes: aa9c2669 ("NFS: Client implementation of Labeled-NFS")
      Signed-off-by: NOndrej Mosnacek <omosnace@redhat.com>
      Reviewed-by: NJames Morris <jamorris@linux.microsoft.com>
      Reviewed-by: NPaul Moore <paul@paul-moore.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      7a9cf8a4
    • T
      NFS: Don't gratuitously clear the inode cache when lookup failed · d506f8e3
      Trond Myklebust 提交于
      stable inclusion
      from stable-5.10.24
      commit e181960ec51d5fa089d6e8e2478febe01ca8be04
      bugzilla: 51348
      
      --------------------------------
      
      [ Upstream commit 47397915 ]
      
      The fact that the lookup revalidation failed, does not mean that the
      inode contents have changed.
      
      Fixes: 5ceb9d7f ("NFS: Refactor nfs_lookup_revalidate()")
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      d506f8e3
    • T
      NFS: Don't revalidate the directory permissions on a lookup failure · 636f5615
      Trond Myklebust 提交于
      stable inclusion
      from stable-5.10.24
      commit dd756d05bee58077ea0239861022ca83e7d8d23d
      bugzilla: 51348
      
      --------------------------------
      
      [ Upstream commit 82e7ca13 ]
      
      There should be no reason to expect the directory permissions to change
      just because the directory contents changed or a negative lookup timed
      out. So let's avoid doing a full call to nfs_mark_for_revalidate() in
      that case.
      Furthermore, if this is a negative dentry, and we haven't actually done
      a new lookup, then we have no reason yet to believe the directory has
      changed at all. So let's remove the gratuitous directory inode
      invalidation altogether when called from
      nfs_lookup_revalidate_negative().
      Reported-by: NGeert Jansen <gerardu@amazon.com>
      Fixes: 5ceb9d7f ("NFS: Refactor nfs_lookup_revalidate()")
      Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
      Signed-off-by: NAnna Schumaker <Anna.Schumaker@Netapp.com>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      636f5615
    • P
      cifs: do not send close in compound create+close requests · 7afed434
      Paulo Alcantara 提交于
      stable inclusion
      from stable-5.10.24
      commit d308202c1b96024a2f3325642f5e087cf997b5d9
      bugzilla: 51348
      
      --------------------------------
      
      commit 04ad69c3 upstream.
      
      In case of interrupted syscalls, prevent sending CLOSE commands for
      compound CREATE+CLOSE requests by introducing an
      CIFS_CP_CREATE_CLOSE_OP flag to indicate lower layers that it should
      not send a CLOSE command to the MIDs corresponding the compound
      CREATE+CLOSE request.
      
      A simple reproducer:
      
          #!/bin/bash
      
          mount //server/share /mnt -o username=foo,password=***
          tc qdisc add dev eth0 root netem delay 450ms
          stat -f /mnt &>/dev/null & pid=$!
          sleep 0.01
          kill $pid
          tc qdisc del dev eth0 root
          umount /mnt
      
      Before patch:
      
          ...
          6 0.256893470 192.168.122.2 → 192.168.122.15 SMB2 402 Create Request File: ;GetInfo Request FS_INFO/FileFsFullSizeInformation;Close Request
          7 0.257144491 192.168.122.15 → 192.168.122.2 SMB2 498 Create Response File: ;GetInfo Response;Close Response
          9 0.260798209 192.168.122.2 → 192.168.122.15 SMB2 146 Close Request File:
         10 0.260841089 192.168.122.15 → 192.168.122.2 SMB2 130 Close Response, Error: STATUS_FILE_CLOSED
      Signed-off-by: NPaulo Alcantara (SUSE) <pc@cjr.nz>
      Reviewed-by: NRonnie Sahlberg <lsahlber@redhat.com>
      Reviewed-by: NAurelien Aptel <aaptel@suse.com>
      CC: <stable@vger.kernel.org>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      7afed434
    • J
      block: Try to handle busy underlying device on discard · 3ccf9c92
      Jan Kara 提交于
      stable inclusion
      from stable-5.10.24
      commit d44c9780ed40db88626c9354868eab72159c7a7f
      bugzilla: 51348
      
      --------------------------------
      
      commit 56887cff upstream.
      
      Commit 384d87ef ("block: Do not discard buffers under a mounted
      filesystem") made paths issuing discard or zeroout requests to the
      underlying device try to grab block device in exclusive mode. If that
      failed we returned EBUSY to userspace. This however caused unexpected
      fallout in userspace where e.g. FUSE filesystems issue discard requests
      from userspace daemons although the device is open exclusively by the
      kernel. Also shrinking of logical volume by LVM issues discard requests
      to a device which may be claimed exclusively because there's another LV
      on the same PV. So to avoid these userspace regressions, fall back to
      invalidate_inode_pages2_range() instead of returning EBUSY to userspace
      and return EBUSY only of that call fails as well (meaning that there's
      indeed someone using the particular device range we are trying to
      discard).
      
      Link: https://bugzilla.kernel.org/show_bug.cgi?id=211167
      Fixes: 384d87ef ("block: Do not discard buffers under a mounted filesystem")
      CC: stable@vger.kernel.org
      Signed-off-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      3ccf9c92
    • T
      ext4: don't try to processed freed blocks until mballoc is initialized · 93e5d414
      Theodore Ts'o 提交于
      stable inclusion
      from stable-5.10.24
      commit 64578f9417e1e3482f3e4492496772fca130f526
      bugzilla: 51348
      
      --------------------------------
      
      [ Upstream commit 027f14f5 ]
      
      If we try to make any changes via the journal between when the journal
      is initialized, but before the multi-block allocated is initialized,
      we will end up deferencing a NULL pointer when the journal commit
      callback function calls ext4_process_freed_data().
      
      The proximate cause of this failure was commit 2d01ddc8 ("ext4:
      save error info to sb through journal if available") since file system
      corruption problems detected before the call to ext4_mb_init() would
      result in a journal commit before we aborted the mount of the file
      system.... and we would then trigger the NULL pointer deref.
      
      Link: https://lore.kernel.org/r/YAm8qH/0oo2ofSMR@mit.eduReported-by: NMurphy Zhou <jencce.kernel@gmail.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      93e5d414
    • S
      udf: fix silent AED tagLocation corruption · 53387051
      Steven J. Magnani 提交于
      stable inclusion
      from stable-5.10.24
      commit 82d6c12899e2645bd17d6f9c7d494f360e1089e1
      bugzilla: 51348
      
      --------------------------------
      
      [ Upstream commit 63c9e47a ]
      
      When extending a file, udf_do_extend_file() may enter following empty
      indirect extent. At the end of udf_do_extend_file() we revert prev_epos
      to point to the last written extent. However if we end up not adding any
      further extent in udf_do_extend_file(), the reverting points prev_epos
      into the header area of the AED and following updates of the extents
      (in udf_update_extents()) will corrupt the header.
      
      Make sure that we do not follow indirect extent if we are not going to
      add any more extents so that returning back to the last written extent
      works correctly.
      
      Link: https://lore.kernel.org/r/20210107234116.6190-2-magnani@ieee.orgSigned-off-by: NSteven J. Magnani <magnani@ieee.org>
      Signed-off-by: NJan Kara <jack@suse.cz>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      53387051
    • A
      cifs: fix credit accounting for extra channel · fb386d65
      Aurelien Aptel 提交于
      stable inclusion
      from stable-5.10.24
      commit 3370a84d781ca5227682bd6e747aaefb6dcc8e21
      bugzilla: 51348
      
      --------------------------------
      
      commit a249cc8b upstream.
      
      With multichannel, operations like the queries
      from "ls -lR" can cause all credits to be used and
      errors to be returned since max_credits was not
      being set correctly on the secondary channels and
      thus the client was requesting 0 credits incorrectly
      in some cases (which can lead to not having
      enough credits to perform any operation on that
      channel).
      Signed-off-by: NAurelien Aptel <aaptel@suse.com>
      CC: <stable@vger.kernel.org> # v5.8+
      Reviewed-by: NShyam Prasad N <sprasad@microsoft.com>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      fb386d65
    • P
      cifs: return proper error code in statfs(2) · 9ae15b79
      Paulo Alcantara 提交于
      stable inclusion
      from stable-5.10.24
      commit 3d0bbd97eb6f32bcc1365252aa04a8984bab5007
      bugzilla: 51348
      
      --------------------------------
      
      commit 14302ee3 upstream.
      
      In cifs_statfs(), if server->ops->queryfs is not NULL, then we should
      use its return value rather than always returning 0.  Instead, use rc
      variable as it is properly set to 0 in case there is no
      server->ops->queryfs.
      Signed-off-by: NPaulo Alcantara (SUSE) <pc@cjr.nz>
      Reviewed-by: NAurelien Aptel <aaptel@suse.com>
      Reviewed-by: NRonnie Sahlberg <lsahlber@redhat.com>
      CC: <stable@vger.kernel.org>
      Signed-off-by: NSteve French <stfrench@microsoft.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      9ae15b79
    • C
      mount: fix mounting of detached mounts onto targets that reside on shared mounts · 3ef215f4
      Christian Brauner 提交于
      stable inclusion
      from stable-5.10.24
      commit 36e1efcdc54274d03e67ed6a9d5c1c2a2e77e947
      bugzilla: 51348
      
      --------------------------------
      
      commit ee2e3f50 upstream.
      
      Creating a series of detached mounts, attaching them to the filesystem,
      and unmounting them can be used to trigger an integer overflow in
      ns->mounts causing the kernel to block any new mounts in count_mounts()
      and returning ENOSPC because it falsely assumes that the maximum number
      of mounts in the mount namespace has been reached, i.e. it thinks it
      can't fit the new mounts into the mount namespace anymore.
      
      Depending on the number of mounts in your system, this can be reproduced
      on any kernel that supportes open_tree() and move_mount() by compiling
      and running the following program:
      
        /* SPDX-License-Identifier: LGPL-2.1+ */
      
        #define _GNU_SOURCE
        #include <errno.h>
        #include <fcntl.h>
        #include <getopt.h>
        #include <limits.h>
        #include <stdbool.h>
        #include <stdio.h>
        #include <stdlib.h>
        #include <string.h>
        #include <sys/mount.h>
        #include <sys/stat.h>
        #include <sys/syscall.h>
        #include <sys/types.h>
        #include <unistd.h>
      
        /* open_tree() */
        #ifndef OPEN_TREE_CLONE
        #define OPEN_TREE_CLONE 1
        #endif
      
        #ifndef OPEN_TREE_CLOEXEC
        #define OPEN_TREE_CLOEXEC O_CLOEXEC
        #endif
      
        #ifndef __NR_open_tree
                #if defined __alpha__
                        #define __NR_open_tree 538
                #elif defined _MIPS_SIM
                        #if _MIPS_SIM == _MIPS_SIM_ABI32        /* o32 */
                                #define __NR_open_tree 4428
                        #endif
                        #if _MIPS_SIM == _MIPS_SIM_NABI32       /* n32 */
                                #define __NR_open_tree 6428
                        #endif
                        #if _MIPS_SIM == _MIPS_SIM_ABI64        /* n64 */
                                #define __NR_open_tree 5428
                        #endif
                #elif defined __ia64__
                        #define __NR_open_tree (428 + 1024)
                #else
                        #define __NR_open_tree 428
                #endif
        #endif
      
        /* move_mount() */
        #ifndef MOVE_MOUNT_F_EMPTY_PATH
        #define MOVE_MOUNT_F_EMPTY_PATH 0x00000004 /* Empty from path permitted */
        #endif
      
        #ifndef __NR_move_mount
                #if defined __alpha__
                        #define __NR_move_mount 539
                #elif defined _MIPS_SIM
                        #if _MIPS_SIM == _MIPS_SIM_ABI32        /* o32 */
                                #define __NR_move_mount 4429
                        #endif
                        #if _MIPS_SIM == _MIPS_SIM_NABI32       /* n32 */
                                #define __NR_move_mount 6429
                        #endif
                        #if _MIPS_SIM == _MIPS_SIM_ABI64        /* n64 */
                                #define __NR_move_mount 5429
                        #endif
                #elif defined __ia64__
                        #define __NR_move_mount (428 + 1024)
                #else
                        #define __NR_move_mount 429
                #endif
        #endif
      
        static inline int sys_open_tree(int dfd, const char *filename, unsigned int flags)
        {
                return syscall(__NR_open_tree, dfd, filename, flags);
        }
      
        static inline int sys_move_mount(int from_dfd, const char *from_pathname, int to_dfd,
                                         const char *to_pathname, unsigned int flags)
        {
                return syscall(__NR_move_mount, from_dfd, from_pathname, to_dfd, to_pathname, flags);
        }
      
        static bool is_shared_mountpoint(const char *path)
        {
                bool shared = false;
                FILE *f = NULL;
                char *line = NULL;
                int i;
                size_t len = 0;
      
                f = fopen("/proc/self/mountinfo", "re");
                if (!f)
                        return 0;
      
                while (getline(&line, &len, f) > 0) {
                        char *slider1, *slider2;
      
                        for (slider1 = line, i = 0; slider1 && i < 4; i++)
                                slider1 = strchr(slider1 + 1, ' ');
      
                        if (!slider1)
                                continue;
      
                        slider2 = strchr(slider1 + 1, ' ');
                        if (!slider2)
                                continue;
      
                        *slider2 = '\0';
                        if (strcmp(slider1 + 1, path) == 0) {
                                /* This is the path. Is it shared? */
                                slider1 = strchr(slider2 + 1, ' ');
                                if (slider1 && strstr(slider1, "shared:")) {
                                        shared = true;
                                        break;
                                }
                        }
                }
                fclose(f);
                free(line);
      
                return shared;
        }
      
        static void usage(void)
        {
                const char *text = "mount-new [--recursive] <base-dir>\n";
                fprintf(stderr, "%s", text);
                _exit(EXIT_SUCCESS);
        }
      
        #define exit_usage(format, ...)                              \
                ({                                                   \
                        fprintf(stderr, format "\n", ##__VA_ARGS__); \
                        usage();                                     \
                })
      
        #define exit_log(format, ...)                                \
                ({                                                   \
                        fprintf(stderr, format "\n", ##__VA_ARGS__); \
                        exit(EXIT_FAILURE);                          \
                })
      
        static const struct option longopts[] = {
                {"help",        no_argument,            0,      'a'},
                { NULL,         no_argument,            0,       0 },
        };
      
        int main(int argc, char *argv[])
        {
                int exit_code = EXIT_SUCCESS, index = 0;
                int dfd, fd_tree, new_argc, ret;
                char *base_dir;
                char *const *new_argv;
                char target[PATH_MAX];
      
                while ((ret = getopt_long_only(argc, argv, "", longopts, &index)) != -1) {
                        switch (ret) {
                        case 'a':
                                /* fallthrough */
                        default:
                                usage();
                        }
                }
      
                new_argv = &argv[optind];
                new_argc = argc - optind;
                if (new_argc < 1)
                        exit_usage("Missing base directory\n");
                base_dir = new_argv[0];
      
                if (*base_dir != '/')
                        exit_log("Please specify an absolute path");
      
                /* Ensure that target is a shared mountpoint. */
                if (!is_shared_mountpoint(base_dir))
                        exit_log("Please ensure that \"%s\" is a shared mountpoint", base_dir);
      
                dfd = open(base_dir, O_RDONLY | O_DIRECTORY | O_CLOEXEC);
                if (dfd < 0)
                        exit_log("%m - Failed to open base directory \"%s\"", base_dir);
      
                ret = mkdirat(dfd, "detached-move-mount", 0755);
                if (ret < 0)
                        exit_log("%m - Failed to create required temporary directories");
      
                ret = snprintf(target, sizeof(target), "%s/detached-move-mount", base_dir);
                if (ret < 0 || (size_t)ret >= sizeof(target))
                        exit_log("%m - Failed to assemble target path");
      
                /*
                 * Having a mount table with 10000 mounts is already quite excessive
                 * and shoult account even for weird test systems.
                 */
                for (size_t i = 0; i < 10000; i++) {
                        fd_tree = sys_open_tree(dfd, "detached-move-mount",
                                                OPEN_TREE_CLONE |
                                                OPEN_TREE_CLOEXEC |
                                                AT_EMPTY_PATH);
                        if (fd_tree < 0) {
                                fprintf(stderr, "%m - Failed to open %d(detached-move-mount)", dfd);
                                exit_code = EXIT_FAILURE;
                                break;
                        }
      
                        ret = sys_move_mount(fd_tree, "", dfd, "detached-move-mount", MOVE_MOUNT_F_EMPTY_PATH);
                        if (ret < 0) {
                                if (errno == ENOSPC)
                                        fprintf(stderr, "%m - Buggy mount counting");
                                else
                                        fprintf(stderr, "%m - Failed to attach mount to %d(detached-move-mount)", dfd);
                                exit_code = EXIT_FAILURE;
                                break;
                        }
                        close(fd_tree);
      
                        ret = umount2(target, MNT_DETACH);
                        if (ret < 0) {
                                fprintf(stderr, "%m - Failed to unmount %s", target);
                                exit_code = EXIT_FAILURE;
                                break;
                        }
                }
      
                (void)unlinkat(dfd, "detached-move-mount", AT_REMOVEDIR);
                close(dfd);
      
                exit(exit_code);
        }
      
      and wait for the kernel to refuse any new mounts by returning ENOSPC.
      How many iterations are needed depends on the number of mounts in your
      system. Assuming you have something like 50 mounts on a standard system
      it should be almost instantaneous.
      
      The root cause of this is that detached mounts aren't handled correctly
      when source and target mount are identical and reside on a shared mount
      causing a broken mount tree where the detached source itself is
      propagated which propagation prevents for regular bind-mounts and new
      mounts. This ultimately leads to a miscalculation of the number of
      mounts in the mount namespace.
      
      Detached mounts created via
      open_tree(fd, path, OPEN_TREE_CLONE)
      are essentially like an unattached new mount, or an unattached
      bind-mount. They can then later on be attached to the filesystem via
      move_mount() which calls into attach_recursive_mount(). Part of
      attaching it to the filesystem is making sure that mounts get correctly
      propagated in case the destination mountpoint is MS_SHARED, i.e. is a
      shared mountpoint. This is done by calling into propagate_mnt() which
      walks the list of peers calling propagate_one() on each mount in this
      list making sure it receives the propagation event.
      The propagate_one() functions thereby skips both new mounts and bind
      mounts to not propagate them "into themselves". Both are identified by
      checking whether the mount is already attached to any mount namespace in
      mnt->mnt_ns. The is what the IS_MNT_NEW() helper is responsible for.
      
      However, detached mounts have an anonymous mount namespace attached to
      them stashed in mnt->mnt_ns which means that IS_MNT_NEW() doesn't
      realize they need to be skipped causing the mount to propagate "into
      itself" breaking the mount table and causing a disconnect between the
      number of mounts recorded as being beneath or reachable from the target
      mountpoint and the number of mounts actually recorded/counted in
      ns->mounts ultimately causing an overflow which in turn prevents any new
      mounts via the ENOSPC issue.
      
      So teach propagation to handle detached mounts by making it aware of
      them. I've been tracking this issue down for the last couple of days and
      then verifying that the fix is correct by
      unmounting everything in my current mount table leaving only /proc and
      /sys mounted and running the reproducer above overnight verifying the
      number of mounts counted in ns->mounts. With this fix the counts are
      correct and the ENOSPC issue can't be reproduced.
      
      This change will only have an effect on mounts created with the new
      mount API since detached mounts cannot be created with the old mount API
      so regressions are extremely unlikely.
      
      Link: https://lore.kernel.org/r/20210306101010.243666-1-christian.brauner@ubuntu.com
      Fixes: 2db154b3 ("vfs: syscall: Add move_mount(2) to move mounts around")
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: <stable@vger.kernel.org>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      Signed-off-by: NChen Jun <chenjun102@huawei.com>
      Acked-by: N  Weilong Chen <chenweilong@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      3ef215f4
    • W
      arm64/mpam: resctrl: Refresh cpu mask for handling cpuhp · 867ae5b2
      Wang ShaoBo 提交于
      hulk inclusion
      category: bugfix
      bugzilla: 48265
      CVE: NA
      
      --------------------------------
      
      This fixes two problems:
      
      1) when cpu offline, we should clear cpu mask from all associated resctrl
         group but not only default group.
      
      2) when cpu online, we should set cpu mask for default group and update
         default group's cpus to default state if cdp on, this operation is to
         fill code and data fields of mpam sysregs with appropriate value.
      
      Fixes: 2e2c511ff49d ("arm64/mpam: resctrl: Handle cpuhp and resctrl_dom allocation")
      Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
      Reviewed-by: NJian Cheng <cj.chengjian@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      867ae5b2
    • W
      arm64/mpam: resctrl: Use resctrl_group_init_alloc() for default group · 96a27f9d
      Wang ShaoBo 提交于
      hulk inclusion
      category: bugfix
      bugzilla: 48265
      CVE: NA
      
      --------------------------------
      
      When we support configure different types of resources for a resource, the
      wrong history value will be updated in the default group after remounting.
      
      e.g.
          > mount -t resctrl resctrl /sys/fs/resctrl/ -o mbMax,mbMin && cd resctrl/
          > echo 'MBMIN:0=2;1=2;2=2;3=2' > schemata
          > cat schemata
            L3:0=7fff;1=7fff;2=7fff;3=7fff
            MBMAX:0=100;1=100;2=100;3=100
            MBMIN:0=2;1=2;2=2;3=2
          > cd .. && umount /sys/fs/resctrl/
          > mount -t resctrl resctrl /sys/fs/resctrl/ -o mbMax,mbMin && cd resctrl/ && cat schemata
            L3:0=7fff;1=7fff;2=7fff;3=7fff
            MBMAX:0=100;1=100;2=100;3=100
            MBMIN:0=0;1=0;2=0;3=0
          > echo 'MBMAX:0=10;1=10;2=10;3=10' > schemata
          > cat schemata
            L3:0=7fff;1=7fff;2=7fff;3=7fff
            MBMAX:0=10;1=10;2=10;3=10
            MBMIN:0=2;1=2;2=2;3=2  #update error history value
      
      When writing schemata sysfile, call path like this:
      
      resctrl_group_schemata_write()
        -=> resctrl_update_groups_config()
               -=> resctrl_group_update_domains()
                     -=> resctrl_group_update_domain_ctrls()
                      { .../*refresh new_ctrl array of supported conf type once for each resource*/ }
      
      We should refresh new_ctrl field in struct resctrl_staged_config by
      resctrl_group_init_alloc() before calling resctrl_group_update_domain_ctrls().
      
      Fixes: 6b2471f089be ("arm64/mpam: resctrl: Support priority and hardlimit(Memory bandwidth) configuration")
      Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
      Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      96a27f9d
    • W
      arm64/mpam: resctrl: Add proper error handling to resctrl_mount() · 10e4e43b
      Wang ShaoBo 提交于
      hulk inclusion
      category: bugfix
      bugzilla: 48265
      CVE: NA
      
      --------------------------------
      
      This function is called only when we mount resctrl sysfs, for error
      handling we need to destroy schemata list when next few steps failed
      after creation of schemata list.
      
      Fixes: 7e9b5caeefff ("arm64/mpam: resctrl: Add helpers for init and destroy schemata list")
      Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
      Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      10e4e43b
    • W
      arm64/mpam: Use fs_context to parse mount options · 100e2317
      Wang ShaoBo 提交于
      hulk inclusion
      category: bugfix
      bugzilla: 48265
      CVE: NA
      
      --------------------------------
      
      Use fs_context to parse mount options, this old process parsing from
      parse_rdtgroupfs_options() will be obsoleted and removed.
      Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
      Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      100e2317
    • W
      arm64/mpam: Set per-cpu's closid to none zero for cdp · cae569b3
      Wang ShaoBo 提交于
      hulk inclusion
      category: feature
      feature: ARM MPAM support
      bugzilla: 48265
      CVE: NA
      
      --------------------------------
      
      Sometimes monitoring will have such anomalies:
      
      e.g.
          > cd /sys/fs/resctrl/ && grep . mon_data/*
            mon_data/mon_L3CODE_00:14336
            mon_data/mon_L3CODE_01:344064
            mon_data/mon_L3CODE_02:2048
            mon_data/mon_L3CODE_03:27648
            mon_data/mon_L3DATA_00:0  #L3DATA's monitoring data always be 0
            mon_data/mon_L3DATA_01:0
            mon_data/mon_L3DATA_02:0
            mon_data/mon_L3DATA_03:0
            mon_data/mon_MB_00:392
            mon_data/mon_MB_01:552
            mon_data/mon_MB_02:160
            mon_data/mon_MB_03:0
      
      If cdp on, tasks in resctrl default group with closid=0 and rmid=0 don't
      know how to fill proper partid_i/pmg_i and partid_d/pmg_d into MPAMx_ELx
      sysregs by mpam_sched_in() called by __switch_to(), it's because current
      cpu's default closid and rmid are also equal to 0 and to make the operation
      modifying configuration passed.
      
      Update per cpu default closid of none-zero value, call update_closid_rmid()
      to update each cpu's mpam proper MPAMx_ELx sysregs for setting partid and
      pmg when mounting resctrl sysfs, it looks like a practical method.
      Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
      Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      cae569b3
    • W
      arm64/mpam: Split header files into suitable location · 0c564931
      Wang ShaoBo 提交于
      hulk inclusion
      category: feature
      feature: ARM MPAM support
      bugzilla: 48265
      CVE: NA
      
      --------------------------------
      
      So far there are some declarations shared by resctrlfs.c and mpam
      core module files under kernel/mpam directory scattered in mpam.h
      and resctrl.h, this is organized like this:
      
      -- asm/
         +-- resctrl.h        +
         +-- mpam.h           |    +
         +-- mpam_resource.h  |    |    +
                              |    |    |
      -- fs/                  |    |    +-> mpam/
         +-- resctrlfs.c <----+----+------> +-- mpam_resctrl.c ...
      
      We move this declarations shared by resctrlfs.c and mpam/ to resctrl.h
      and split another declarations into mpam_internal.h, also including
      moving mpam_resource.h to mpam/ directory, currently this is organized
      like this:
      
      -- asm/
         +-- mpam.h           +----> export to other modules(e.g. SMMU master io)
         +-- resctrl.h        +
                              |
      -- mpam/                |
         +-- mpam_internal.h  |    +
         +-- mpam_resource.h  |    |    +
                              |    |    |
      -- fs/                  |    +----+-> mpam/
         +-- resctrlfs.c <----+-----------> +-- mpam_resctrl.c ...
      
      In this way can we build a clearer framework for MPAM usage.
      Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
      Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      0c564931
    • W
      arm64/mpam: resctrl: Export resource's properties to info directory · 9d39dad1
      Wang ShaoBo 提交于
      hulk inclusion
      category: feature
      feature: ARM MPAM support
      bugzilla: 48265
      CVE: NA
      
      --------------------------------
      
      Some resource's properities such as closid and rmid are exported like
      Intel-RDT in our resctrl design, but there also has two main differences,
      one is MB(Memory Bandwidth), for we MB is also divided into two directories
      MB and MB_MON to show respective properties about control and monitor type
      as same as LxCache, another is we adopt features sysfile under resources'
      directories, which indicates the properties of control type of corresponding
      resource, for instance MB hardlimit.
      
      e.g.
          > mount -t resctrl resctrl /sys/fs/resctrl -o mbHdl
          > cd /sys/fs/resctrl/ && cat info/MB/features
            mbHdl@1  #indicate MBHDL setting's upper bound is 1
          > cat schemata
            L3:0=7fff;1=7fff;2=7fff;3=7fff
            MB:0=100;1=100;2=100;3=100
            MBHDL:0=1;1=1;2=1;3=1
      Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
      Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      9d39dad1
    • W
      arm64/mpam: Add wait queue for monitor alloc and free · 7d3cd1a2
      Wang ShaoBo 提交于
      hulk inclusion
      category: feature
      feature: ARM MPAM support
      bugzilla: 48265
      CVE: NA
      
      --------------------------------
      
      For MPAM, a rmid can do monitoring work only with a monitor resource
      allocated, we adopt a mechanism for monitor resource dynamic allocation
      and recycling, it is different from Intel-RDT operation who creates a
      kworker thread for dynamically monitoring Cache usage and checks if it
      is below a threshold adjustable for rmid free, for we have detected that
      this method will affect the cpu utilization in many cases, sometimes this
      influence cannot be accepted.
      
      Our method is simple, as different resource's monitor number varies, we
      deliever two list, one for storing rmids which has exclusive monitor
      resource and another for storing this rmids which have monitor resource
      shared, this shared monitor id always be 0. it works like this, if a new
      rmid apply for a resource monitor which is in used, then we put this rmid
      to the tail of latter list and temporarily give a default monitor id 0
      util someone releases available monitor resource, if this new rmid has
      all resources' monitor resource needed, then it will be put into exclusive
      list.
      
      This implements the LRU allocation of monitor resources and give users
      part control rights of allocation and release, if resctrl group's quantity
      can be guaranteed or user don't need monitoring too many groups
      synchronously, this is a more appropriate way for user deployment, not
      only that, also can it avoid the risk of inaccuracy in monitoring when
      monitoring operation happen to too many groups at the same time.
      Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
      Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      7d3cd1a2
    • W
      arm64/mpam: Remap reqpartid,pmg to rmid and intpartid to closid · 0b16164d
      Wang ShaoBo 提交于
      hulk inclusion
      category: feature
      feature: ARM MPAM support
      bugzilla: 48265
      CVE: NA
      
      --------------------------------
      
      So far we use sd_closid, including {reqpartid, intpartid}, to label each
      resctrl group including ctrlgroup and mongroup, This can perfectly handle
      this case where number of reqpartid exceeds intpartid, this always happen
      when intpartid narrowing supported, otherwise their two are of same number.
      So we use excessive reqpartid to indicate (1)- how configurations can be
      synchronized from the configuration indexed by intpartid, not only that,
      (2)- take part of monitor role.
      
      But reqpartid in (2) with pmg still be scattered, So far we have not yet
      a right way to explain how can we use their two properly. In order to
      ensure their resources can be fully utilized, and given this idea from
      Intel-RDT's design which uses rmid for monitoring, a rmid remap matrix is
      delivered for transforming partid and pmg to rmid, this matrix is organized
      like this:
      
                       [bitmap entry indexed by partid]
                             [col pos is partid]
      
                           [0]  [1]  [2]  [3]  [4]  [5]
         occ->bitmap[:0]    1    0    0    1    1    1
              bitmap[:1]    1    0    0    1    1    1
              bitmap[:2]    1    1    1    1    1    1
              bitmap[:3]    1    1    1    1    1    1
      [row pos-1 is pmg]
      
      Calculate rmid = partid + NR_partid * pmg
      
      occ represents if this bitmap has been used by a partid, it is because
      a certain partid should not be accompany with a duplicated pmg for
      monitoring, this design easily saves a lot of space, and can also decrease
      time complexity of allocating and free rmid process from O(NR_partid)*
      O(NR_pmg) to O(NR_partid) + O(log(NR_pmg)) compared with using list.
      
      By this way, we get a continuous rmid set with upper bound(NR_pmg *
      NR_partid - 1), given an rmid we can assume that if it's a valid rmid
      by judging whether it falls within this range or not.
      
      rmid implicts the reqpartid info, so we can use relevant helpers to get
      this reqpartid for sd_closid@reqpartid and perfectly accomplish this
      configuration sync mission, this also makes closid simpler which can be
      consists of intpartid index only, also each resctrl group is happy to own
      consecutive rmid.
      
      This also has some profound influences, for instance for MPAM there
      also support SMMU io using partid and pmg, we can use a single helper
      mpam_rmid_to_partid_pmg() in SMMU driver to complete this remap process
      for rmid input from outside user space.
      Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
      Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      0b16164d
    • W
      arm64/mpam: Using software-defined id for rdtgroup instead of 32-bit integer · 26a57873
      Wang ShaoBo 提交于
      hulk inclusion
      category: feature
      feature: ARM MPAM support
      bugzilla: 48265
      CVE: NA
      
      --------------------------------
      
      Currently we use partid and pmg (Performance Monitoring Group) to filter
      some performance events so that the performance of a particular partid
      and pmg can be monitored, but pmg looks useless except for making a
      filter with partid, especially when pmg varies in different MPAM resources,
      it makes difficult to allocate pmg resource when creating new mon group
      in resctrl sysfs, even causes a lot of waste.
      
      So we use a software-defined sd_closid instead of 32-bit integer to label
      each rdtgroup (including mon group), sd_closid include intpartid for
      allocation and reqpartid for synchronizing configuration and monitoring,
      Given MPAM has narrowing feature, also includes the concept (hw_reqpartid,
      hw_intpartid we named), when narrowing is not supported, number of intpartid
      and reqpartid equals to hw_reqpartid, otherwise intpartid and reqpartid is
      related to minimum number of both hw_reqpartid and hw_intpartid supported
      across different resources, by using this way, not only we solve above
      problem but also use relax reqpartid for creating new mon group. additionally,
      pmg is also preferred when it is available.
      
      e.g.
           hw_intpartid:  0  1  2  3  4  5  6  7
           hw_reqpartid:  0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15
      
                          |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |
                                                  |  |  |  |  |  |  |  |
      resctrl ctrl group: p0 p1 p2 p3 p4 p5 p6 p7 |  |  |  |  |  |  |  |
                                |     |
      resctrl mon group:        |     +-----------------------m4 m5 m6 m7
                                +-----------------m0 m1 m2 m3
                                In this case, use extra reqpartid to create
                                m0, m1, m2, m3 mon group for p2 ctrl group,
                                and m4, m5, m6, m7 for p4.
      
      As we know reqpartid both supports allocating and monitoring filter, we
      should synchronize config of ctrl group with child mon groups under this
      design, each mon group's configuration indexed by a reqpartid that called
      slave is closely following it's father ctrl group that called master
      whenever configuration changes. not only that, we let task_struct keep
      both intpartid and reqpartid so we can know if tasks belong to a same
      ctrl group through intpartid and change cpu's partid by writing MPAMx_ELx
      through reqpartid when tasks switching.
      Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
      Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      26a57873
    • W
      arm64/mpam: resctrl: Remove unnecessary CONFIG_ARM64 · 17c6439e
      Wang ShaoBo 提交于
      hulk inclusion
      category: feature
      feature: ARM MPAM support
      bugzilla: 48265
      CVE: NA
      
      --------------------------------
      
      Code in resctrlfs.c is not shared with x86 RDT currently, but may be
      updated to support both in the future, so remove unrelated CONFIG for
      now to make code clearer.
      Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
      Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      17c6439e
    • W
      arm64/mpam: resctrl: Remove ctrlmon sysfile · 037915d2
      Wang ShaoBo 提交于
      hulk inclusion
      category: feature
      feature: ARM MPAM support
      bugzilla: 48265
      CVE: NA
      
      --------------------------------
      
      We redesign monitoring process for user, as following illustrates:
      
        e.g. before rewriting:
           mount /sys/fs/resctrl && cd /sys/fs/resctrl
      	 mkdir p1 && cd p1
           echo 1 > ctrlmon  # this allocates a monitor resource for this group
           ...               # associating task/cpu with this group
           grep . mon_data/* # get monitor data from mon_data directory
        e.g. after rewriting:
      	 mount /sys/fs/resctrl && cd /sys/fs/resctrl
       	 mkdir p1 && cd p1 # automically allocating a monitoring resource
           ...               # associate task/cpu with this group
           grep . mon_data/* # directly get monitor data
      
      ctrlmon is used for manually allocating a monitor resource for monitoring
      a specified group (labeled by partid and pmg), we delete ctrlmon because
      this action is redundant.
      
      User should know which group has been allocated a available monitor resource
      and only this monitor resource is released then this monitor resource can
      be reallocated to a new group after, this action is redundant and unnecessary,
      as monitor resource is used only when monitoring process happens, so a
      relax monitor resource can be allocated to multiple groups and take effect
      when monitoring process happened.
      
      But should some restrictions be known, a monitor resource for monitoring
      Cache-occupancy might be kept for a long time until it doesn't need to
      be use anymore, or below a threshold as like intel-RDT limbo list works,
      otherwise you may see that the monitoring result is very small beyond
      exception when you force switch one mon resource from one group to another.
      
      We deliver a simple LRU mon resource allocation mechanism, but so far
      it just assign a monitor according to the order in which groups was created,
      this is incomplete and needs subsequent improvement.
      Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
      Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      037915d2
    • W
      arm64/mpam: resctrl: Update closid alloc and free process with bitmap · bf24b8a8
      Wang ShaoBo 提交于
      hulk inclusion
      category: feature
      feature: ARM MPAM support
      bugzilla: 48265
      CVE: NA
      
      --------------------------------
      
      Replace u32 bitmask with bitmap for closid allocation, it's because closid
      may be too large to use 32 bits.
      
      This also support cdp, when cdp is enabled, closid will be assigned twice
      once time, giving closid to code LxCache and closid+1 to data LxDATA, so
      do free process.
      Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
      Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      bf24b8a8
    • W
      arm64/mpam: resctrl: Use resctrl_group_init_alloc() to init schema list · 62530c01
      Wang ShaoBo 提交于
      hulk inclusion
      category: feature
      feature: ARM MPAM support
      bugzilla: 48265
      CVE: NA
      
      --------------------------------
      
      Add a schema list for each rdt domain, we use this list to store changes
      from schemata row instead of previous ctrlval array live in resctrl
      resource structure, when mounting resctrl sysfs happened, we would reset
      all resource's configuration into default by resctrl_group_update_domains().
      
      Currently each row in schemata sysfile occupy a list node, this may be
      extended for perfecting control types.
      Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
      Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      62530c01
    • W
      arm64/mpam: resctrl: Add helpers for init and destroy schemata list · 7f72b99b
      Wang ShaoBo 提交于
      hulk inclusion
      category: feature
      feature: ARM MPAM support
      bugzilla: 48265
      CVE: NA
      
      --------------------------------
      
      Initialize schemata list when mount resctrl sysfs and destroy it when
      umount, each list node contains the value updated by schemata (in resctrl
      sysfs) row.
      
      Partial code is borrowed from 250656171d95 ("x86/resctrl: Stop using Lx
      CODE/DATA resources"), as it illustrates:
      
        Now that CDP enable/disable is global, and the closid offset correction
        is based on the configuration being applied, we are using different
        hw_closid slots in the ctrl array for CODE/DATA schema. This lets
        us merge them using the same Lx resource twice for CDP's CODE/DATA
        schema. This keeps the illusion of separate caches in the resctrl code.
      
        When CDP is enabled for a cache, create two schema generating the names
        and setting the configuration type.
      
        We can now remove the initialisation of the illusionary hw_resources:
        'cdp_capable' just requires setting a flag, resctrl knows what to do
        from there.
      
      Link: http://www.linux-arm.org/git?p=linux-jm.git;a=commit;h=250656171d95dea079cc661098a0984e7237aa25Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
      Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      7f72b99b
    • W
      arm64/mpam: resctrl: Rebuild configuration and monitoring pipeline · aeffdf37
      Wang ShaoBo 提交于
      hulk inclusion
      category: feature
      feature: ARM MPAM support
      bugzilla: 48265
      CVE: NA
      
      --------------------------------
      
      We now bridge resctrl intermediate processing module and mpam devices
      module, a large block of code refer to configuration and monitoring
      process involved need to be modified.
      
      We change the previous method where straightly writing MSCs' registers,
      this jobs are handed over to helpers offered by mpam devices module
      instead, when configuration or monitoring action happened, each domains'
      ctrlval array changed by resctrl sysfs input would be updated into mpam
      config structure live in each mpam component structure, relevant helpers
      provided by mpam devices module will soon accomplish the remaining jobs.
      
      Comparasion:
                         configuration or monitoring
      
                           old               new
                            +                 +
                            |                 |
                            |       +---------+------------+
                            |       | intermediate helpers |
                            |       +---------+------------+
                            |                 |
                            |                 |
                         +--+-----------------+----+
                         | [reading  writing MMIO] |
                         +-------------------------+
      
      So far we nearly accomplish the mission that open up process between
      resctrl sysfs and mpam devices module but still incomplete currently,
      also some proper actions are needed after.
      
      Also this moves relevant structures such as struct mongroup to suitable
      place,.
      Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
      Reviewed-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      aeffdf37
    • W
      arm64/mpam: Fix unreset resources when mkdir ctrl group or umount resctrl · 88520d23
      Wang ShaoBo 提交于
      hulk inclusion
      category: bugfix
      bugzilla: 48265
      CVE: NA
      
      --------------------------------
      
      There are two problems related to schemata:
      
      1) When rmdir a group and then mkdir a new group under resctrl
         root directory, the new group still inherits the schemata
         configuration from old.
         e.g.
             > mount -t resctrl resctrl /sys/fs/resctrl
             > cd /sys/fs/resctrl
             > mkdir p1 && cd p1
             > echo 'L3:0=7f' > schemata
             > cd .. && rmdir p1 && mkdir p1 && cd p1
             > cat schemata
               L3:0=7f;1=7fff;2=7fff;3=7fff
               MB:0=100;1=100;2=100;3=100
      
      2) It still exists when umount /sys/fs/resctrl and remount.
         e.g.
             > mount -t resctrl resctrl /sys/fs/resctrl
             > cd /sys/fs/resctrl
             > echo 'L3:0=7f' > schemata
             > umount /sys/fs/resctrl
             > mount -t resctrl resctrl /sys/fs/resctrl
             > cat schemata
               L3:0=7f;1=7fff;2=7fff;3=7fff
               MB:0=100;1=100;2=100;3=100
      
      Firstly we make each resctrl resource obtains their corresponding
      default configuration. NOTE we use zero to initialize L3 default
      value instead of max cpbm bits, as zero configurarion equals to
      maximum configuration for L3 MSCs. And we use max-percentage masks
      of max bandwidth to generate maximum configuration for MB.
      
      Then we reset resources' configuration settings to default value
      and back MSCs to default state, when mkdir or umount happended.
      
      Fixes: caf75b6b2540 ("resctrlfs: mpam: init struct for mpam")
      Fixes: 916dd9321e3c ("resctrlfs: init support resctrlfs")
      Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      88520d23
    • X
      arm64/mpam: cleanup the source file's licence · 1fef4872
      Xie XiuQi 提交于
      hulk inclusion
      category: bugfix
      bugzilla: 48265
      CVE: NA
      
      --------------------------------
      
      Rewrite the source file's licence of mpam feature.
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NWang ShaoBo <bobo.shaobowang@huawei.com>
      Reviewed-by: NCheng Jian <cj.chengjian@huawei.com>
      Signed-off-by: NZheng Zengkai <zhengzengkai@huawei.com>
      1fef4872