1. 19 1月, 2023 3 次提交
    • C
      fs: move mnt_idmap · 3707d84c
      Christian Brauner 提交于
      Now that we converted everything to just rely on struct mnt_idmap move it all
      into a separate file. This ensure that no code can poke around in struct
      mnt_idmap without any dedicated helpers and makes it easier to extend it in the
      future. Filesystems will now not be able to conflate mount and filesystem
      idmappings as they are two distinct types and require distinct helpers that
      cannot be used interchangeably. We are now also able to extend struct mnt_idmap
      as we see fit.
      Acked-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      3707d84c
    • C
      fs: port privilege checking helpers to mnt_idmap · 9452e93e
      Christian Brauner 提交于
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      Acked-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      9452e93e
    • C
      fs: port ->permission() to pass mnt_idmap · 4609e1f1
      Christian Brauner 提交于
      Convert to struct mnt_idmap.
      
      Last cycle we merged the necessary infrastructure in
      256c8aed ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      Acked-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      4609e1f1
  2. 01 11月, 2022 1 次提交
  3. 20 10月, 2022 2 次提交
    • C
      xattr: use posix acl api · 318e6685
      Christian Brauner 提交于
      In previous patches we built a new posix api solely around get and set
      inode operations. Now that we have all the pieces in place we can switch
      the system calls and the vfs over to only rely on this api when
      interacting with posix acls. This finally removes all type unsafety and
      type conversion issues explained in detail in [1] that we aim to get rid
      of.
      
      With the new posix acl api we immediately translate into an appropriate
      kernel internal struct posix_acl format both when getting and setting
      posix acls. This is a stark contrast to before were we hacked unsafe raw
      values into the uapi struct that was stored in a void pointer relying
      and having filesystems and security modules hack around in the uapi
      struct as well.
      
      Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1]
      Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      318e6685
    • C
      internal: add may_write_xattr() · 56851bc9
      Christian Brauner 提交于
      Split out the generic checks whether an inode allows writing xattrs. Since
      security.* and system.* xattrs don't have any restrictions and we're going
      to split out posix acls into a dedicated api we will use this helper to
      check whether we can write posix acls.
      Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      56851bc9
  4. 18 10月, 2022 3 次提交
    • C
      attr: use consistent sgid stripping checks · ed5a7047
      Christian Brauner 提交于
      Currently setgid stripping in file_remove_privs()'s should_remove_suid()
      helper is inconsistent with other parts of the vfs. Specifically, it only
      raises ATTR_KILL_SGID if the inode is S_ISGID and S_IXGRP but not if the
      inode isn't in the caller's groups and the caller isn't privileged over the
      inode although we require this already in setattr_prepare() and
      setattr_copy() and so all filesystem implement this requirement implicitly
      because they have to use setattr_{prepare,copy}() anyway.
      
      But the inconsistency shows up in setgid stripping bugs for overlayfs in
      xfstests (e.g., generic/673, generic/683, generic/685, generic/686,
      generic/687). For example, we test whether suid and setgid stripping works
      correctly when performing various write-like operations as an unprivileged
      user (fallocate, reflink, write, etc.):
      
      echo "Test 1 - qa_user, non-exec file $verb"
      setup_testfile
      chmod a+rws $junk_file
      commit_and_check "$qa_user" "$verb" 64k 64k
      
      The test basically creates a file with 6666 permissions. While the file has
      the S_ISUID and S_ISGID bits set it does not have the S_IXGRP set. On a
      regular filesystem like xfs what will happen is:
      
      sys_fallocate()
      -> vfs_fallocate()
         -> xfs_file_fallocate()
            -> file_modified()
               -> __file_remove_privs()
                  -> dentry_needs_remove_privs()
                     -> should_remove_suid()
                  -> __remove_privs()
                     newattrs.ia_valid = ATTR_FORCE | kill;
                     -> notify_change()
                        -> setattr_copy()
      
      In should_remove_suid() we can see that ATTR_KILL_SUID is raised
      unconditionally because the file in the test has S_ISUID set.
      
      But we also see that ATTR_KILL_SGID won't be set because while the file
      is S_ISGID it is not S_IXGRP (see above) which is a condition for
      ATTR_KILL_SGID being raised.
      
      So by the time we call notify_change() we have attr->ia_valid set to
      ATTR_KILL_SUID | ATTR_FORCE. Now notify_change() sees that
      ATTR_KILL_SUID is set and does:
      
      ia_valid = attr->ia_valid |= ATTR_MODE
      attr->ia_mode = (inode->i_mode & ~S_ISUID);
      
      which means that when we call setattr_copy() later we will definitely
      update inode->i_mode. Note that attr->ia_mode still contains S_ISGID.
      
      Now we call into the filesystem's ->setattr() inode operation which will
      end up calling setattr_copy(). Since ATTR_MODE is set we will hit:
      
      if (ia_valid & ATTR_MODE) {
              umode_t mode = attr->ia_mode;
              vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode);
              if (!vfsgid_in_group_p(vfsgid) &&
                  !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))
                      mode &= ~S_ISGID;
              inode->i_mode = mode;
      }
      
      and since the caller in the test is neither capable nor in the group of the
      inode the S_ISGID bit is stripped.
      
      But assume the file isn't suid then ATTR_KILL_SUID won't be raised which
      has the consequence that neither the setgid nor the suid bits are stripped
      even though it should be stripped because the inode isn't in the caller's
      groups and the caller isn't privileged over the inode.
      
      If overlayfs is in the mix things become a bit more complicated and the bug
      shows up more clearly. When e.g., ovl_setattr() is hit from
      ovl_fallocate()'s call to file_remove_privs() then ATTR_KILL_SUID and
      ATTR_KILL_SGID might be raised but because the check in notify_change() is
      questioning the ATTR_KILL_SGID flag again by requiring S_IXGRP for it to be
      stripped the S_ISGID bit isn't removed even though it should be stripped:
      
      sys_fallocate()
      -> vfs_fallocate()
         -> ovl_fallocate()
            -> file_remove_privs()
               -> dentry_needs_remove_privs()
                  -> should_remove_suid()
               -> __remove_privs()
                  newattrs.ia_valid = ATTR_FORCE | kill;
                  -> notify_change()
                     -> ovl_setattr()
                        // TAKE ON MOUNTER'S CREDS
                        -> ovl_do_notify_change()
                           -> notify_change()
                        // GIVE UP MOUNTER'S CREDS
           // TAKE ON MOUNTER'S CREDS
           -> vfs_fallocate()
              -> xfs_file_fallocate()
                 -> file_modified()
                    -> __file_remove_privs()
                       -> dentry_needs_remove_privs()
                          -> should_remove_suid()
                       -> __remove_privs()
                          newattrs.ia_valid = attr_force | kill;
                          -> notify_change()
      
      The fix for all of this is to make file_remove_privs()'s
      should_remove_suid() helper to perform the same checks as we already
      require in setattr_prepare() and setattr_copy() and have notify_change()
      not pointlessly requiring S_IXGRP again. It doesn't make any sense in the
      first place because the caller must calculate the flags via
      should_remove_suid() anyway which would raise ATTR_KILL_SGID.
      
      While we're at it we move should_remove_suid() from inode.c to attr.c
      where it belongs with the rest of the iattr helpers. Especially since it
      returns ATTR_KILL_S{G,U}ID flags. We also rename it to
      setattr_should_drop_suidgid() to better reflect that it indicates both
      setuid and setgid bit removal and also that it returns attr flags.
      
      Running xfstests with this doesn't report any regressions. We should really
      try and use consistent checks.
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      ed5a7047
    • C
      attr: add setattr_should_drop_sgid() · 72ae017c
      Christian Brauner 提交于
      The current setgid stripping logic during write and ownership change
      operations is inconsistent and strewn over multiple places. In order to
      consolidate it and make more consistent we'll add a new helper
      setattr_should_drop_sgid(). The function retains the old behavior where
      we remove the S_ISGID bit unconditionally when S_IXGRP is set but also
      when it isn't set and the caller is neither in the group of the inode
      nor privileged over the inode.
      
      We will use this helper both in write operation permission removal such
      as file_remove_privs() as well as in ownership change operations.
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      72ae017c
    • C
      attr: add in_group_or_capable() · 11c2a870
      Christian Brauner 提交于
      In setattr_{copy,prepare}() we need to perform the same permission
      checks to determine whether we need to drop the setgid bit or not.
      Instead of open-coding it twice add a simple helper the encapsulates the
      logic. We will reuse this helpers to make dropping the setgid bit during
      write operations more consistent in a follow up patch.
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      11c2a870
  5. 29 9月, 2022 1 次提交
    • A
      [coredump] don't use __kernel_write() on kmap_local_page() · 06bbaa6d
      Al Viro 提交于
      passing kmap_local_page() result to __kernel_write() is unsafe -
      random ->write_iter() might (and 9p one does) get unhappy when
      passed ITER_KVEC with pointer that came from kmap_local_page().
      
      Fix by providing a variant of __kernel_write() that takes an iov_iter
      from caller (__kernel_write() becomes a trivial wrapper) and adding
      dump_emit_page() that parallels dump_emit(), except that instead of
      __kernel_write() it uses __kernel_write_iter() with ITER_BVEC source.
      
      Fixes: 3159ed57 "fs/coredump: use kmap_local_page()"
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      06bbaa6d
  6. 02 9月, 2022 1 次提交
  7. 16 8月, 2022 1 次提交
    • A
      locks: fix TOCTOU race when granting write lease · d6da19c9
      Amir Goldstein 提交于
      Thread A trying to acquire a write lease checks the value of i_readcount
      and i_writecount in check_conflicting_open() to verify that its own fd
      is the only fd referencing the file.
      
      Thread B trying to open the file for read will call break_lease() in
      do_dentry_open() before incrementing i_readcount, which leaves a small
      window where thread A can acquire the write lease and then thread B
      completes the open of the file for read without breaking the write lease
      that was acquired by thread A.
      
      Fix this race by incrementing i_readcount before checking for existing
      leases, same as the case with i_writecount.
      
      Use a helper put_file_access() to decrement i_readcount or i_writecount
      in do_dentry_open() and __fput().
      
      Fixes: 387e3746 ("locks: eliminate false positive conflicts for write lease")
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      d6da19c9
  8. 20 5月, 2022 1 次提交
  9. 15 5月, 2022 1 次提交
    • A
      Unify the primitives for file descriptor closing · 6319194e
      Al Viro 提交于
      Currently we have 3 primitives for removing an opened file from descriptor
      table - pick_file(), __close_fd_get_file() and close_fd_get_file().  Their
      calling conventions are rather odd and there's a code duplication for no
      good reason.  They can be unified -
      
      1) have __range_close() cap max_fd in the very beginning; that way
      we don't need separate way for pick_file() to report being past the end
      of descriptor table.
      
      2) make {__,}close_fd_get_file() return file (or NULL) directly, rather
      than returning it via struct file ** argument.  Don't bother with
      (bogus) return value - nobody wants that -ENOENT.
      
      3) make pick_file() return NULL on unopened descriptor - the only caller
      that used to care about the distinction between descriptor past the end
      of descriptor table and finding NULL in descriptor table doesn't give
      a damn after (1).
      
      4) lift ->files_lock out of pick_file()
      
      That actually simplifies the callers, as well as the primitives themselves.
      Code duplication is also gone...
      Reviewed-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      6319194e
  10. 25 4月, 2022 2 次提交
  11. 14 3月, 2022 1 次提交
  12. 11 3月, 2022 1 次提交
    • S
      io-uring: Make statx API stable · 1b6fe6e0
      Stefan Roesch 提交于
      One of the key architectual tenets is to keep the parameters for
      io-uring stable. After the call has been submitted, its value can
      be changed. Unfortunaltely this is not the case for the current statx
      implementation.
      
      IO-Uring change:
      This changes replaces the const char * filename pointer in the io_statx
      structure with a struct filename *. In addition it also creates the
      filename object during the prepare phase.
      
      With this change, the opcode also needs to invoke cleanup, so the
      filename object gets freed after processing the request.
      
      fs change:
      This replaces the const char* __user filename parameter in the two
      functions do_statx and vfs_statx with a struct filename *. In addition
      to be able to correctly construct a filename object a new helper
      function getname_statx_lookup_flags is introduced. The function makes
      sure that do_statx and vfs_statx is invoked with the correct lookup flags.
      Signed-off-by: NStefan Roesch <shr@fb.com>
      Link: https://lore.kernel.org/r/20220225185326.1373304-2-shr@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
      1b6fe6e0
  13. 31 1月, 2022 1 次提交
  14. 17 12月, 2021 1 次提交
  15. 10 11月, 2021 1 次提交
    • J
      vfs: keep inodes with page cache off the inode shrinker LRU · 51b8c1fe
      Johannes Weiner 提交于
      Historically (pre-2.5), the inode shrinker used to reclaim only empty
      inodes and skip over those that still contained page cache.  This caused
      problems on highmem hosts: struct inode could put fill lowmem zones
      before the cache was getting reclaimed in the highmem zones.
      
      To address this, the inode shrinker started to strip page cache to
      facilitate reclaiming lowmem.  However, this comes with its own set of
      problems: the shrinkers may drop actively used page cache just because
      the inodes are not currently open or dirty - think working with a large
      git tree.  It further doesn't respect cgroup memory protection settings
      and can cause priority inversions between containers.
      
      Nowadays, the page cache also holds non-resident info for evicted cache
      pages in order to detect refaults.  We've come to rely heavily on this
      data inside reclaim for protecting the cache workingset and driving swap
      behavior.  We also use it to quantify and report workload health through
      psi.  The latter in turn is used for fleet health monitoring, as well as
      driving automated memory sizing of workloads and containers, proactive
      reclaim and memory offloading schemes.
      
      The consequences of dropping page cache prematurely is that we're seeing
      subtle and not-so-subtle failures in all of the above-mentioned
      scenarios, with the workload generally entering unexpected thrashing
      states while losing the ability to reliably detect it.
      
      To fix this on non-highmem systems at least, going back to rotating
      inodes on the LRU isn't feasible.  We've tried (commit a76cf1a4
      ("mm: don't reclaim inodes with many attached pages")) and failed
      (commit 69056ee6 ("Revert "mm: don't reclaim inodes with many
      attached pages"")).
      
      The issue is mostly that shrinker pools attract pressure based on their
      size, and when objects get skipped the shrinkers remember this as
      deferred reclaim work.  This accumulates excessive pressure on the
      remaining inodes, and we can quickly eat into heavily used ones, or
      dirty ones that require IO to reclaim, when there potentially is plenty
      of cold, clean cache around still.
      
      Instead, this patch keeps populated inodes off the inode LRU in the
      first place - just like an open file or dirty state would.  An otherwise
      clean and unused inode then gets queued when the last cache entry
      disappears.  This solves the problem without reintroducing the reclaim
      issues, and generally is a bit more scalable than having to wade through
      potentially hundreds of thousands of busy inodes.
      
      Locking is a bit tricky because the locks protecting the inode state
      (i_lock) and the inode LRU (lru_list.lock) don't nest inside the
      irq-safe page cache lock (i_pages.xa_lock).  Page cache deletions are
      serialized through i_lock, taken before the i_pages lock, to make sure
      depopulated inodes are queued reliably.  Additions may race with
      deletions, but we'll check again in the shrinker.  If additions race
      with the shrinker itself, we're protected by the i_lock: if find_inode()
      or iput() win, the shrinker will bail on the elevated i_count or
      I_REFERENCED; if the shrinker wins and goes ahead with the inode, it
      will set I_FREEING and inhibit further igets(), which will cause the
      other side to create a new instance of the inode instead.
      
      Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Dave Chinner <david@fromorbit.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      51b8c1fe
  16. 22 10月, 2021 2 次提交
  17. 07 9月, 2021 1 次提交
  18. 24 8月, 2021 4 次提交
  19. 17 8月, 2021 1 次提交
  20. 22 7月, 2021 1 次提交
    • P
      cgroup1: fix leaked context root causing sporadic NULL deref in LTP · 1e7107c5
      Paul Gortmaker 提交于
      Richard reported sporadic (roughly one in 10 or so) null dereferences and
      other strange behaviour for a set of automated LTP tests.  Things like:
      
         BUG: kernel NULL pointer dereference, address: 0000000000000008
         #PF: supervisor read access in kernel mode
         #PF: error_code(0x0000) - not-present page
         PGD 0 P4D 0
         Oops: 0000 [#1] PREEMPT SMP PTI
         CPU: 0 PID: 1516 Comm: umount Not tainted 5.10.0-yocto-standard #1
         Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014
         RIP: 0010:kernfs_sop_show_path+0x1b/0x60
      
      ...or these others:
      
         RIP: 0010:do_mkdirat+0x6a/0xf0
         RIP: 0010:d_alloc_parallel+0x98/0x510
         RIP: 0010:do_readlinkat+0x86/0x120
      
      There were other less common instances of some kind of a general scribble
      but the common theme was mount and cgroup and a dubious dentry triggering
      the NULL dereference.  I was only able to reproduce it under qemu by
      replicating Richard's setup as closely as possible - I never did get it
      to happen on bare metal, even while keeping everything else the same.
      
      In commit 71d883c3 ("cgroup_do_mount(): massage calling conventions")
      we see this as a part of the overall change:
      
         --------------
                 struct cgroup_subsys *ss;
         -       struct dentry *dentry;
      
         [...]
      
         -       dentry = cgroup_do_mount(&cgroup_fs_type, fc->sb_flags, root,
         -                                CGROUP_SUPER_MAGIC, ns);
      
         [...]
      
         -       if (percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
         -               struct super_block *sb = dentry->d_sb;
         -               dput(dentry);
         +       ret = cgroup_do_mount(fc, CGROUP_SUPER_MAGIC, ns);
         +       if (!ret && percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
         +               struct super_block *sb = fc->root->d_sb;
         +               dput(fc->root);
                         deactivate_locked_super(sb);
                         msleep(10);
                         return restart_syscall();
                 }
         --------------
      
      In changing from the local "*dentry" variable to using fc->root, we now
      export/leave that dentry pointer in the file context after doing the dput()
      in the unlikely "is_dying" case.   With LTP doing a crazy amount of back to
      back mount/unmount [testcases/bin/cgroup_regression_5_1.sh] the unlikely
      becomes slightly likely and then bad things happen.
      
      A fix would be to not leave the stale reference in fc->root as follows:
      
         --------------
                        dput(fc->root);
        +               fc->root = NULL;
                        deactivate_locked_super(sb);
         --------------
      
      ...but then we are just open-coding a duplicate of fc_drop_locked() so we
      simply use that instead.
      
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: stable@vger.kernel.org      # v5.1+
      Reported-by: NRichard Purdie <richard.purdie@linuxfoundation.org>
      Fixes: 71d883c3 ("cgroup_do_mount(): massage calling conventions")
      Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      1e7107c5
  21. 08 4月, 2021 1 次提交
  22. 02 2月, 2021 1 次提交
  23. 26 1月, 2021 1 次提交
  24. 24 1月, 2021 1 次提交
  25. 10 12月, 2020 1 次提交
  26. 02 12月, 2020 2 次提交
  27. 23 9月, 2020 1 次提交
  28. 31 7月, 2020 2 次提交