• Y
    fs: move S_ISGID stripping into the vfs_*() helpers · 1639a49c
    Yang Xu 提交于
    Move setgid handling out of individual filesystems and into the VFS
    itself to stop the proliferation of setgid inheritance bugs.
    
    Creating files that have both the S_IXGRP and S_ISGID bit raised in
    directories that themselves have the S_ISGID bit set requires additional
    privileges to avoid security issues.
    
    When a filesystem creates a new inode it needs to take care that the
    caller is either in the group of the newly created inode or they have
    CAP_FSETID in their current user namespace and are privileged over the
    parent directory of the new inode. If any of these two conditions is
    true then the S_ISGID bit can be raised for an S_IXGRP file and if not
    it needs to be stripped.
    
    However, there are several key issues with the current implementation:
    
    * S_ISGID stripping logic is entangled with umask stripping.
    
      If a filesystem doesn't support or enable POSIX ACLs then umask
      stripping is done directly in the vfs before calling into the
      filesystem.
      If the filesystem does support POSIX ACLs then unmask stripping may be
      done in the filesystem itself when calling posix_acl_create().
    
      Since umask stripping has an effect on S_ISGID inheritance, e.g., by
      stripping the S_IXGRP bit from the file to be created and all relevant
      filesystems have to call posix_acl_create() before inode_init_owner()
      where we currently take care of S_ISGID handling S_ISGID handling is
      order dependent. IOW, whether or not you get a setgid bit depends on
      POSIX ACLs and umask and in what order they are called.
    
      Note that technically filesystems are free to impose their own
      ordering between posix_acl_create() and inode_init_owner() meaning
      that there's additional ordering issues that influence S_SIGID
      inheritance.
    
    * Filesystems that don't rely on inode_init_owner() don't get S_ISGID
      stripping logic.
    
      While that may be intentional (e.g. network filesystems might just
      defer setgid stripping to a server) it is often just a security issue.
    
    This is not just ugly it's unsustainably messy especially since we do
    still have bugs in this area years after the initial round of setgid
    bugfixes.
    
    So the current state is quite messy and while we won't be able to make
    it completely clean as posix_acl_create() is still a filesystem specific
    call we can improve the S_SIGD stripping situation quite a bit by
    hoisting it out of inode_init_owner() and into the vfs creation
    operations. This means we alleviate the burden for filesystems to handle
    S_ISGID stripping correctly and can standardize the ordering between
    S_ISGID and umask stripping in the vfs.
    
    We add a new helper vfs_prepare_mode() so S_ISGID handling is now done
    in the VFS before umask handling. This has S_ISGID handling is
    unaffected unaffected by whether umask stripping is done by the VFS
    itself (if no POSIX ACLs are supported or enabled) or in the filesystem
    in posix_acl_create() (if POSIX ACLs are supported).
    
    The vfs_prepare_mode() helper is called directly in vfs_*() helpers that
    create new filesystem objects. We need to move them into there to make
    sure that filesystems like overlayfs hat have callchains like:
    
    sys_mknod()
    -> do_mknodat(mode)
       -> .mknod = ovl_mknod(mode)
          -> ovl_create(mode)
             -> vfs_mknod(mode)
    
    get S_ISGID stripping done when calling into lower filesystems via
    vfs_*() creation helpers. Moving vfs_prepare_mode() into e.g.
    vfs_mknod() takes care of that. This is in any case semantically cleaner
    because S_ISGID stripping is VFS security requirement.
    
    Security hooks so far have seen the mode with the umask applied but
    without S_ISGID handling done. The relevant hooks are called outside of
    vfs_*() creation helpers so by calling vfs_prepare_mode() from vfs_*()
    helpers the security hooks would now see the mode without umask
    stripping applied. For now we fix this by passing the mode with umask
    settings applied to not risk any regressions for LSM hooks. IOW, nothing
    changes for LSM hooks. It is worth pointing out that security hooks
    never saw the mode that is seen by the filesystem when actually creating
    the file. They have always been completely misplaced for that to work.
    
    The following filesystems use inode_init_owner() and thus relied on
    S_ISGID stripping: spufs, 9p, bfs, btrfs, ext2, ext4, f2fs, hfsplus,
    hugetlbfs, jfs, minix, nilfs2, ntfs3, ocfs2, omfs, overlayfs, ramfs,
    reiserfs, sysv, ubifs, udf, ufs, xfs, zonefs, bpf, tmpfs.
    
    All of the above filesystems end up calling inode_init_owner() when new
    filesystem objects are created through the ->mkdir(), ->mknod(),
    ->create(), ->tmpfile(), ->rename() inode operations.
    
    Since directories always inherit the S_ISGID bit with the exception of
    xfs when irix_sgid_inherit mode is turned on S_ISGID stripping doesn't
    apply. The ->symlink() and ->link() inode operations trivially inherit
    the mode from the target and the ->rename() inode operation inherits the
    mode from the source inode. All other creation inode operations will get
    S_ISGID handling via vfs_prepare_mode() when called from their relevant
    vfs_*() helpers.
    
    In addition to this there are filesystems which allow the creation of
    filesystem objects through ioctl()s or - in the case of spufs -
    circumventing the vfs in other ways. If filesystem objects are created
    through ioctl()s the vfs doesn't know about it and can't apply regular
    permission checking including S_ISGID logic. Therfore, a filesystem
    relying on S_ISGID stripping in inode_init_owner() in their ioctl()
    callpath will be affected by moving this logic into the vfs. We audited
    those filesystems:
    
    * btrfs allows the creation of filesystem objects through various
      ioctls(). Snapshot creation literally takes a snapshot and so the mode
      is fully preserved and S_ISGID stripping doesn't apply.
    
      Creating a new subvolum relies on inode_init_owner() in
      btrfs_new_subvol_inode() but only creates directories and doesn't
      raise S_ISGID.
    
    * ocfs2 has a peculiar implementation of reflinks. In contrast to e.g.
      xfs and btrfs FICLONE/FICLONERANGE ioctl() that is only concerned with
      the actual extents ocfs2 uses a separate ioctl() that also creates the
      target file.
    
      Iow, ocfs2 circumvents the vfs entirely here and did indeed rely on
      inode_init_owner() to strip the S_ISGID bit. This is the only place
      where a filesystem needs to call mode_strip_sgid() directly but this
      is self-inflicted pain.
    
    * spufs doesn't go through the vfs at all and doesn't use ioctl()s
      either. Instead it has a dedicated system call spufs_create() which
      allows the creation of filesystem objects. But spufs only creates
      directories and doesn't allo S_SIGID bits, i.e. it specifically only
      allows 0777 bits.
    
    * bpf uses vfs_mkobj() but also doesn't allow S_ISGID bits to be created.
    
    The patch will have an effect on ext2 when the EXT2_MOUNT_GRPID mount
    option is used, on ext4 when the EXT4_MOUNT_GRPID mount option is used,
    and on xfs when the XFS_FEAT_GRPID mount option is used. When any of
    these filesystems are mounted with their respective GRPID option then
    newly created files inherit the parent directories group
    unconditionally. In these cases non of the filesystems call
    inode_init_owner() and thus did never strip the S_ISGID bit for newly
    created files. Moving this logic into the VFS means that they now get
    the S_ISGID bit stripped. This is a user visible change. If this leads
    to regressions we will either need to figure out a better way or we need
    to revert. However, given the various setgid bugs that we found just in
    the last two years this is a regression risk we should take.
    
    Associated with this change is a new set of fstests to enforce the
    semantics for all new filesystems.
    
    Link: https://lore.kernel.org/ceph-devel/20220427092201.wvsdjbnc7b4dttaw@wittgenstein [1]
    Link: e014f37d ("xfs: use setattr_copy to set vfs inode attributes") [2]
    Link: 01ea173e ("xfs: fix up non-directory creation in SGID directories") [3]
    Link: fd84bfdd ("ceph: fix up non-directory creation in SGID directories") [4]
    Link: https://lore.kernel.org/r/1657779088-2242-3-git-send-email-xuyang2018.jy@fujitsu.comSuggested-by: NDave Chinner <david@fromorbit.com>
    Suggested-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
    Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
    Reviewed-and-Tested-by: NJeff Layton <jlayton@kernel.org>
    Signed-off-by: NYang Xu <xuyang2018.jy@fujitsu.com>
    [<brauner@kernel.org>: rewrote commit message]
    Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
    1639a49c
namei.c 137.4 KB