1. 24 1月, 2021 35 次提交
    • C
      ext4: support idmapped mounts · 14f3db55
      Christian Brauner 提交于
      Enable idmapped mounts for ext4. All dedicated helpers we need for this
      exist. So this basically just means we're passing down the
      user_namespace argument from the VFS methods to the relevant helpers.
      
      Let's create simple example where we idmap an ext4 filesystem:
      
       root@f2-vm:~# truncate -s 5G ext4.img
      
       root@f2-vm:~# mkfs.ext4 ./ext4.img
       mke2fs 1.45.5 (07-Jan-2020)
       Discarding device blocks: done
       Creating filesystem with 1310720 4k blocks and 327680 inodes
       Filesystem UUID: 3fd91794-c6ca-4b0f-9964-289a000919cf
       Superblock backups stored on blocks:
               32768, 98304, 163840, 229376, 294912, 819200, 884736
      
       Allocating group tables: done
       Writing inode tables: done
       Creating journal (16384 blocks): done
       Writing superblocks and filesystem accounting information: done
      
       root@f2-vm:~# losetup -f --show ./ext4.img
       /dev/loop0
      
       root@f2-vm:~# mount /dev/loop0 /mnt
      
       root@f2-vm:~# ls -al /mnt/
       total 24
       drwxr-xr-x  3 root root  4096 Oct 28 13:34 .
       drwxr-xr-x 30 root root  4096 Oct 28 13:22 ..
       drwx------  2 root root 16384 Oct 28 13:34 lost+found
      
       # Let's create an idmapped mount at /idmapped1 where we map uid and gid
       # 0 to uid and gid 1000
       root@f2-vm:/# ./mount-idmapped --map-mount b:0:1000:1 /mnt/ /idmapped1/
      
       root@f2-vm:/# ls -al /idmapped1/
       total 24
       drwxr-xr-x  3 ubuntu ubuntu  4096 Oct 28 13:34 .
       drwxr-xr-x 30 root   root    4096 Oct 28 13:22 ..
       drwx------  2 ubuntu ubuntu 16384 Oct 28 13:34 lost+found
      
       # Let's create an idmapped mount at /idmapped2 where we map uid and gid
       # 0 to uid and gid 2000
       root@f2-vm:/# ./mount-idmapped --map-mount b:0:2000:1 /mnt/ /idmapped2/
      
       root@f2-vm:/# ls -al /idmapped2/
       total 24
       drwxr-xr-x  3 2000 2000  4096 Oct 28 13:34 .
       drwxr-xr-x 31 root root  4096 Oct 28 13:39 ..
       drwx------  2 2000 2000 16384 Oct 28 13:34 lost+found
      
      Let's create another example where we idmap the rootfs filesystem
      without a mapping for uid 0 and gid 0:
      
       # Create an idmapped mount of for a full POSIX range of rootfs under
       # /mnt but without a mapping for uid 0 to reduce attack surface
      
       root@f2-vm:/# ./mount-idmapped --map-mount b:1:1:65536 / /mnt/
      
       # Since we don't have a mapping for uid and gid 0 all files owned by
       # uid and gid 0 should show up as uid and gid 65534:
       root@f2-vm:/# ls -al /mnt/
       total 664
       drwxr-xr-x 31 nobody nogroup   4096 Oct 28 13:39 .
       drwxr-xr-x 31 root   root      4096 Oct 28 13:39 ..
       lrwxrwxrwx  1 nobody nogroup      7 Aug 25 07:44 bin -> usr/bin
       drwxr-xr-x  4 nobody nogroup   4096 Oct 28 13:17 boot
       drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:48 dev
       drwxr-xr-x 81 nobody nogroup   4096 Oct 28 04:00 etc
       drwxr-xr-x  4 nobody nogroup   4096 Oct 28 04:00 home
       lrwxrwxrwx  1 nobody nogroup      7 Aug 25 07:44 lib -> usr/lib
       lrwxrwxrwx  1 nobody nogroup      9 Aug 25 07:44 lib32 -> usr/lib32
       lrwxrwxrwx  1 nobody nogroup      9 Aug 25 07:44 lib64 -> usr/lib64
       lrwxrwxrwx  1 nobody nogroup     10 Aug 25 07:44 libx32 -> usr/libx32
       drwx------  2 nobody nogroup  16384 Aug 25 07:47 lost+found
       drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:44 media
       drwxr-xr-x 31 nobody nogroup   4096 Oct 28 13:39 mnt
       drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:44 opt
       drwxr-xr-x  2 nobody nogroup   4096 Apr 15  2020 proc
       drwx--x--x  6 nobody nogroup   4096 Oct 28 13:34 root
       drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:46 run
       lrwxrwxrwx  1 nobody nogroup      8 Aug 25 07:44 sbin -> usr/sbin
       drwxr-xr-x  2 nobody nogroup   4096 Aug 25 07:44 srv
       drwxr-xr-x  2 nobody nogroup   4096 Apr 15  2020 sys
       drwxrwxrwt 10 nobody nogroup   4096 Oct 28 13:19 tmp
       drwxr-xr-x 14 nobody nogroup   4096 Oct 20 13:00 usr
       drwxr-xr-x 12 nobody nogroup   4096 Aug 25 07:45 var
      
       # Since we do have a mapping for uid and gid 1000 all files owned by
       # uid and gid 1000 should simply show up as uid and gid 1000:
       root@f2-vm:/# ls -al /mnt/home/ubuntu/
       total 40
       drwxr-xr-x 3 ubuntu ubuntu  4096 Oct 28 00:43 .
       drwxr-xr-x 4 nobody nogroup 4096 Oct 28 04:00 ..
       -rw------- 1 ubuntu ubuntu  2936 Oct 28 12:26 .bash_history
       -rw-r--r-- 1 ubuntu ubuntu   220 Feb 25  2020 .bash_logout
       -rw-r--r-- 1 ubuntu ubuntu  3771 Feb 25  2020 .bashrc
       -rw-r--r-- 1 ubuntu ubuntu   807 Feb 25  2020 .profile
       -rw-r--r-- 1 ubuntu ubuntu     0 Oct 16 16:11 .sudo_as_admin_successful
       -rw------- 1 ubuntu ubuntu  1144 Oct 28 00:43 .viminfo
      
      Link: https://lore.kernel.org/r/20210121131959.646623-39-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-ext4@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      14f3db55
    • C
      fat: handle idmapped mounts · 4b789936
      Christian Brauner 提交于
      Let fat handle idmapped mounts. This allows to have the same fat mount
      appear in multiple locations with different id mappings. This allows to
      expose a vfat formatted USB stick to multiple user with different ids on
      the host or in user namespaces allowing for dac permissions:
      
      mount -o uid=1000,gid=1000 /dev/sdb /mnt
      
      u1001@f2-vm:/lower1$ ls -ln /mnt/
      total 4
      -rwxr-xr-x 1 1000 1000 4 Oct 28 03:44 aaa
      -rwxr-xr-x 1 1000 1000 0 Oct 28 01:09 bbb
      -rwxr-xr-x 1 1000 1000 0 Oct 28 01:10 ccc
      -rwxr-xr-x 1 1000 1000 0 Oct 28 03:46 ddd
      -rwxr-xr-x 1 1000 1000 0 Oct 28 04:01 eee
      
      mount-idmapped --map-mount b:1000:1001:1
      
      u1001@f2-vm:/lower1$ ls -ln /lower1/
      total 4
      -rwxr-xr-x 1 1001 1001 4 Oct 28 03:44 aaa
      -rwxr-xr-x 1 1001 1001 0 Oct 28 01:09 bbb
      -rwxr-xr-x 1 1001 1001 0 Oct 28 01:10 ccc
      -rwxr-xr-x 1 1001 1001 0 Oct 28 03:46 ddd
      -rwxr-xr-x 1 1001 1001 0 Oct 28 04:01 eee
      
      u1001@f2-vm:/lower1$ touch /lower1/fff
      
      u1001@f2-vm:/lower1$ ls -ln /lower1/fff
      -rwxr-xr-x 1 1001 1001 0 Oct 28 04:03 /lower1/fff
      
      u1001@f2-vm:/lower1$ ls -ln /mnt/fff
      -rwxr-xr-x 1 1000 1000 0 Oct 28 04:03 /mnt/fff
      
      Link: https://lore.kernel.org/r/20210121131959.646623-38-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      4b789936
    • C
      fs: introduce MOUNT_ATTR_IDMAP · 9caccd41
      Christian Brauner 提交于
      Introduce a new mount bind mount property to allow idmapping mounts. The
      MOUNT_ATTR_IDMAP flag can be set via the new mount_setattr() syscall
      together with a file descriptor referring to a user namespace.
      
      The user namespace referenced by the namespace file descriptor will be
      attached to the bind mount. All interactions with the filesystem going
      through that mount will be mapped according to the mapping specified in
      the user namespace attached to it.
      
      Using user namespaces to mark mounts means we can reuse all the existing
      infrastructure in the kernel that already exists to handle idmappings
      and can also use this for permission checking to allow unprivileged user
      to create idmapped mounts in the future.
      
      Idmapping a mount is decoupled from the caller's user and mount
      namespace. This means idmapped mounts can be created in the initial
      user namespace which is an important use-case for systemd-homed,
      portable usb-sticks between systems, sharing data between the initial
      user namespace and unprivileged containers, and other use-cases that
      have been brought up. For example, assume a home directory where all
      files are owned by uid and gid 1000 and the home directory is brought to
      a new laptop where the user has id 12345. The system administrator can
      simply create a mount of this home directory with a mapping of
      1000:12345:1 and other mappings to indicate the ids should be kept.
      (With this it is e.g. also possible to create idmapped mounts on the
      host with an identity mapping 1:1:100000 where the root user is not
      mapped. A user with root access that e.g. has been pivot rooted into
      such a mount on the host will be not be able to execute, read, write, or
      create files as root.)
      
      Given that mapping a mount is decoupled from the caller's user namespace
      a sufficiently privileged process such as a container manager can set up
      an idmapped mount for the container and the container can simply pivot
      root to it. There's no need for the container to do anything. The mount
      will appear correctly mapped independent of the user namespace the
      container uses. This means we don't need to mark a mount as idmappable.
      
      In order to create an idmapped mount the caller must currently be
      privileged in the user namespace of the superblock the mount belongs to.
      Once a mount has been idmapped we don't allow it to change its mapping.
      This keeps permission checking and life-cycle management simple. Users
      wanting to change the idmapped can always create a new detached mount
      with a different idmapping.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-36-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Mauricio Vásquez Bernal <mauricio@kinvolk.io>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      9caccd41
    • C
      fs: add mount_setattr() · 2a186721
      Christian Brauner 提交于
      This implements the missing mount_setattr() syscall. While the new mount
      api allows to change the properties of a superblock there is currently
      no way to change the properties of a mount or a mount tree using file
      descriptors which the new mount api is based on. In addition the old
      mount api has the restriction that mount options cannot be applied
      recursively. This hasn't changed since changing mount options on a
      per-mount basis was implemented in [1] and has been a frequent request
      not just for convenience but also for security reasons. The legacy
      mount syscall is unable to accommodate this behavior without introducing
      a whole new set of flags because MS_REC | MS_REMOUNT | MS_BIND |
      MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to the topmost
      mount. Changing MS_REC to apply to the whole mount tree would mean
      introducing a significant uapi change and would likely cause significant
      regressions.
      
      The new mount_setattr() syscall allows to recursively clear and set
      mount options in one shot. Multiple calls to change mount options
      requesting the same changes are idempotent:
      
      int mount_setattr(int dfd, const char *path, unsigned flags,
                        struct mount_attr *uattr, size_t usize);
      
      Flags to modify path resolution behavior are specified in the @flags
      argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW,
      and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to
      restrict path resolution as introduced with openat2() might be supported
      in the future.
      
      The mount_setattr() syscall can be expected to grow over time and is
      designed with extensibility in mind. It follows the extensible syscall
      pattern we have used with other syscalls such as openat2(), clone3(),
      sched_{set,get}attr(), and others.
      The set of mount options is passed in the uapi struct mount_attr which
      currently has the following layout:
      
      struct mount_attr {
      	__u64 attr_set;
      	__u64 attr_clr;
      	__u64 propagation;
      	__u64 userns_fd;
      };
      
      The @attr_set and @attr_clr members are used to clear and set mount
      options. This way a user can e.g. request that a set of flags is to be
      raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in
      @attr_set while at the same time requesting that another set of flags is
      to be lowered such as removing noexec from a mount tree by specifying
      MOUNT_ATTR_NOEXEC in @attr_clr.
      
      Note, since the MOUNT_ATTR_<atime> values are an enum starting from 0,
      not a bitmap, users wanting to transition to a different atime setting
      cannot simply specify the atime setting in @attr_set, but must also
      specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that
      MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set
      can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in
      @attr_clr.
      
      The @propagation field lets callers specify the propagation type of a
      mount tree. Propagation is a single property that has four different
      settings and as such is not really a flag argument but an enum.
      Specifically, it would be unclear what setting and clearing propagation
      settings in combination would amount to. The legacy mount() syscall thus
      forbids the combination of multiple propagation settings too. The goal
      is to keep the semantics of mount propagation somewhat simple as they
      are overly complex as it is.
      
      The @userns_fd field lets user specify a user namespace whose idmapping
      becomes the idmapping of the mount. This is implemented and explained in
      detail in the next patch.
      
      [1]: commit 2e4b7fcd ("[PATCH] r/o bind mounts: honor mount writer counts at remount")
      
      Link: https://lore.kernel.org/r/20210121131959.646623-35-christian.brauner@ubuntu.com
      Cc: David Howells <dhowells@redhat.com>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-api@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      2a186721
    • C
      fs: add attr_flags_to_mnt_flags helper · 5b490500
      Christian Brauner 提交于
      Add a simple helper to translate uapi MOUNT_ATTR_* flags to MNT_* flags
      which we will use in follow-up patches too.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-34-christian.brauner@ubuntu.com
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      5b490500
    • C
      fs: split out functions to hold writers · fbdc2f6c
      Christian Brauner 提交于
      When a mount is marked read-only we set MNT_WRITE_HOLD on it if there
      aren't currently any active writers. Split this logic out into simple
      helpers that we can use in follow-up patches.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-33-christian.brauner@ubuntu.com
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      fbdc2f6c
    • C
      namespace: only take read lock in do_reconfigure_mnt() · e58ace1a
      Christian Brauner 提交于
      do_reconfigure_mnt() used to take the down_write(&sb->s_umount) lock
      which seems unnecessary since we're not changing the superblock. We're
      only checking whether it is already read-only. Setting other mount
      attributes is protected by lock_mount_hash() afaict and not by s_umount.
      
      The history of down_write(&sb->s_umount) lock being taken when setting
      mount attributes dates back to the introduction of MNT_READONLY in [2].
      This introduced the concept of having read-only mounts in contrast to
      just having a read-only superblock. When it got introduced it was simply
      plumbed into do_remount() which already took down_write(&sb->s_umount)
      because it was only used to actually change the superblock before [2].
      Afaict, it would've already been possible back then to only use
      down_read(&sb->s_umount) for MS_BIND | MS_REMOUNT since actual mount
      options were protected by the vfsmount lock already. But that would've
      meant special casing the locking for MS_BIND | MS_REMOUNT in
      do_remount() which people might not have considered worth it.
      Then in [1] MS_BIND | MS_REMOUNT mount option changes were split out of
      do_remount() into do_reconfigure_mnt() but the down_write(&sb->s_umount)
      lock was simply copied over.
      Now that we have this be a separate helper only take the
      down_read(&sb->s_umount) lock since we're only interested in checking
      whether the super block is currently read-only and blocking any writers
      from changing it. Essentially, checking that the super block is
      read-only has the advantage that we can avoid having to go into the
      slowpath and through MNT_WRITE_HOLD and can simply set the read-only
      flag on the mount in set_mount_attributes().
      
      [1]: commit 43f5e655 ("vfs: Separate changing mount flags full remount")
      [2]: commit 2e4b7fcd ("[PATCH] r/o bind mounts: honor mount writer counts at remount")
      
      Link: https://lore.kernel.org/r/20210121131959.646623-32-christian.brauner@ubuntu.com
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      e58ace1a
    • C
      mount: make {lock,unlock}_mount_hash() static · d033cb67
      Christian Brauner 提交于
      The lock_mount_hash() and unlock_mount_hash() helpers are never called
      outside a single file. Remove them from the header and make them static
      to reflect this fact. There's no need to have them callable from other
      places right now, as Christoph observed.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-31-christian.brauner@ubuntu.com
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      d033cb67
    • C
      namespace: take lock_mount_hash() directly when changing flags · 68847c94
      Christian Brauner 提交于
      Changing mount options always ends up taking lock_mount_hash() but when
      MNT_READONLY is requested and neither the mount nor the superblock are
      MNT_READONLY we end up taking the lock, dropping it, and retaking it to
      change the other mount attributes. Instead, let's acquire the lock once
      when changing the mount attributes. This simplifies the locking in these
      codepath, makes them easier to reason about and avoids having to
      reacquire the lock right after dropping it.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-30-christian.brauner@ubuntu.com
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      68847c94
    • C
      nfs: do not export idmapped mounts · 899bf2ce
      Christian Brauner 提交于
      Prevent nfs from exporting idmapped mounts until we have ported it to
      support exporting idmapped mounts.
      
      Link: https://lore.kernel.org/linux-api/20210123130958.3t6kvgkl634njpsm@wittgenstein
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: "J. Bruce Fields" <bfields@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      899bf2ce
    • C
      overlayfs: do not mount on top of idmapped mounts · 029a52ad
      Christian Brauner 提交于
      Prevent overlayfs from being mounted on top of idmapped mounts.
      Stacking filesystems need to be prevented from being mounted on top of
      idmapped mounts until they have have been converted to handle this.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-29-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      029a52ad
    • C
      ecryptfs: do not mount on top of idmapped mounts · 0f16ff0f
      Christian Brauner 提交于
      Prevent ecryptfs from being mounted on top of idmapped mounts.
      Stacking filesystems need to be prevented from being mounted on top of
      idmapped mounts until they have have been converted to handle this.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-28-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      0f16ff0f
    • C
      ima: handle idmapped mounts · a2d2329e
      Christian Brauner 提交于
      IMA does sometimes access the inode's i_uid and compares it against the
      rules' fowner. Enable IMA to handle idmapped mounts by passing down the
      mount's user namespace. We simply make use of the helpers we introduced
      before. If the initial user namespace is passed nothing changes so
      non-idmapped mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-27-christian.brauner@ubuntu.comSigned-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      a2d2329e
    • C
      fs: make helpers idmap mount aware · 549c7297
      Christian Brauner 提交于
      Extend some inode methods with an additional user namespace argument. A
      filesystem that is aware of idmapped mounts will receive the user
      namespace the mount has been marked with. This can be used for
      additional permission checking and also to enable filesystems to
      translate between uids and gids if they need to. We have implemented all
      relevant helpers in earlier patches.
      
      As requested we simply extend the exisiting inode method instead of
      introducing new ones. This is a little more code churn but it's mostly
      mechanical and doesnt't leave us with additional inode methods.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      549c7297
    • C
      exec: handle idmapped mounts · 1ab29965
      Christian Brauner 提交于
      When executing a setuid binary the kernel will verify in bprm_fill_uid()
      that the inode has a mapping in the caller's user namespace before
      setting the callers uid and gid. Let bprm_fill_uid() handle idmapped
      mounts. If the inode is accessed through an idmapped mount it is mapped
      according to the mount's user namespace. Afterwards the checks are
      identical to non-idmapped mounts. If the initial user namespace is
      passed nothing changes so non-idmapped mounts will see identical
      behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-24-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      1ab29965
    • C
      would_dump: handle idmapped mounts · 435ac621
      Christian Brauner 提交于
      When determining whether or not to create a coredump the vfs will verify
      that the caller is privileged over the inode. Make the would_dump()
      helper handle idmapped mounts by passing down the mount's user namespace
      of the exec file. If the initial user namespace is passed nothing
      changes so non-idmapped mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-23-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      435ac621
    • C
      ioctl: handle idmapped mounts · 0f5d220b
      Christian Brauner 提交于
      Enable generic ioctls to handle idmapped mounts by passing down the
      mount's user namespace. If the initial user namespace is passed nothing
      changes so non-idmapped mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-22-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      0f5d220b
    • C
      init: handle idmapped mounts · b816dd5d
      Christian Brauner 提交于
      Enable the init helpers to handle idmapped mounts by passing down the
      mount's user namespace. If the initial user namespace is passed nothing
      changes so non-idmapped mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-21-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      b816dd5d
    • C
      fcntl: handle idmapped mounts · 9eccd12c
      Christian Brauner 提交于
      Enable the setfl() helper to handle idmapped mounts by passing down the
      mount's user namespace. If the initial user namespace is passed nothing
      changes so non-idmapped mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-20-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      9eccd12c
    • C
      utimes: handle idmapped mounts · d06c26f1
      Christian Brauner 提交于
      Enable the vfs_utimes() helper to handle idmapped mounts by passing down
      the mount's user namespace. If the initial user namespace is passed
      nothing changes so non-idmapped mounts will see identical behavior as
      before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-19-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      d06c26f1
    • C
      open: handle idmapped mounts · b8b546a0
      Christian Brauner 提交于
      For core file operations such as changing directories or chrooting,
      determining file access, changing mode or ownership the vfs will verify
      that the caller is privileged over the inode. Extend the various helpers
      to handle idmapped mounts. If the inode is accessed through an idmapped
      mount map it into the mount's user namespace. Afterwards the permissions
      checks are identical to non-idmapped mounts. When changing file
      ownership we need to map the uid and gid from the mount's user
      namespace. If the initial user namespace is passed nothing changes so
      non-idmapped mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-17-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NJames Morris <jamorris@linux.microsoft.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      b8b546a0
    • C
      open: handle idmapped mounts in do_truncate() · 643fe55a
      Christian Brauner 提交于
      When truncating files the vfs will verify that the caller is privileged
      over the inode. Extend it to handle idmapped mounts. If the inode is
      accessed through an idmapped mount it is mapped according to the mount's
      user namespace. Afterwards the permissions checks are identical to
      non-idmapped mounts. If the initial user namespace is passed nothing
      changes so non-idmapped mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-16-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      643fe55a
    • C
      namei: prepare for idmapped mounts · 6521f891
      Christian Brauner 提交于
      The various vfs_*() helpers are called by filesystems or by the vfs
      itself to perform core operations such as create, link, mkdir, mknod, rename,
      rmdir, tmpfile and unlink. Enable them to handle idmapped mounts. If the
      inode is accessed through an idmapped mount map it into the
      mount's user namespace and pass it down. Afterwards the checks and
      operations are identical to non-idmapped mounts. If the initial user
      namespace is passed nothing changes so non-idmapped mounts will see
      identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-15-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      6521f891
    • C
      namei: introduce struct renamedata · 9fe61450
      Christian Brauner 提交于
      In order to handle idmapped mounts we will extend the vfs rename helper
      to take two new arguments in follow up patches. Since this operations
      already takes a bunch of arguments add a simple struct renamedata and
      make the current helper use it before we extend it.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-14-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      9fe61450
    • C
      namei: handle idmapped mounts in may_*() helpers · ba73d987
      Christian Brauner 提交于
      The may_follow_link(), may_linkat(), may_lookup(), may_open(),
      may_o_create(), may_create_in_sticky(), may_delete(), and may_create()
      helpers determine whether the caller is privileged enough to perform the
      associated operations. Let them handle idmapped mounts by mapping the
      inode or fsids according to the mount's user namespace. Afterwards the
      checks are identical to non-idmapped inodes. The patch takes care to
      retrieve the mount's user namespace right before performing permission
      checks and passing it down into the fileystem so the user namespace
      can't change in between by someone idmapping a mount that is currently
      not idmapped. If the initial user namespace is passed nothing changes so
      non-idmapped mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-13-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      ba73d987
    • C
      stat: handle idmapped mounts · 0d56a451
      Christian Brauner 提交于
      The generic_fillattr() helper fills in the basic attributes associated
      with an inode. Enable it to handle idmapped mounts. If the inode is
      accessed through an idmapped mount map it into the mount's user
      namespace before we store the uid and gid. If the initial user namespace
      is passed nothing changes so non-idmapped mounts will see identical
      behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-12-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      0d56a451
    • C
      commoncap: handle idmapped mounts · 71bc356f
      Christian Brauner 提交于
      When interacting with user namespace and non-user namespace aware
      filesystem capabilities the vfs will perform various security checks to
      determine whether or not the filesystem capabilities can be used by the
      caller, whether they need to be removed and so on. The main
      infrastructure for this resides in the capability codepaths but they are
      called through the LSM security infrastructure even though they are not
      technically an LSM or optional. This extends the existing security hooks
      security_inode_removexattr(), security_inode_killpriv(),
      security_inode_getsecurity() to pass down the mount's user namespace and
      makes them aware of idmapped mounts.
      
      In order to actually get filesystem capabilities from disk the
      capability infrastructure exposes the get_vfs_caps_from_disk() helper.
      For user namespace aware filesystem capabilities a root uid is stored
      alongside the capabilities.
      
      In order to determine whether the caller can make use of the filesystem
      capability or whether it needs to be ignored it is translated according
      to the superblock's user namespace. If it can be translated to uid 0
      according to that id mapping the caller can use the filesystem
      capabilities stored on disk. If we are accessing the inode that holds
      the filesystem capabilities through an idmapped mount we map the root
      uid according to the mount's user namespace. Afterwards the checks are
      identical to non-idmapped mounts: reading filesystem caps from disk
      enforces that the root uid associated with the filesystem capability
      must have a mapping in the superblock's user namespace and that the
      caller is either in the same user namespace or is a descendant of the
      superblock's user namespace. For filesystems that are mountable inside
      user namespace the caller can just mount the filesystem and won't
      usually need to idmap it. If they do want to idmap it they can create an
      idmapped mount and mark it with a user namespace they created and which
      is thus a descendant of s_user_ns. For filesystems that are not
      mountable inside user namespaces the descendant rule is trivially true
      because the s_user_ns will be the initial user namespace.
      
      If the initial user namespace is passed nothing changes so non-idmapped
      mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-11-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Acked-by: NJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      71bc356f
    • T
      xattr: handle idmapped mounts · c7c7a1a1
      Tycho Andersen 提交于
      When interacting with extended attributes the vfs verifies that the
      caller is privileged over the inode with which the extended attribute is
      associated. For posix access and posix default extended attributes a uid
      or gid can be stored on-disk. Let the functions handle posix extended
      attributes on idmapped mounts. If the inode is accessed through an
      idmapped mount we need to map it according to the mount's user
      namespace. Afterwards the checks are identical to non-idmapped mounts.
      This has no effect for e.g. security xattrs since they don't store uids
      or gids and don't perform permission checks on them like posix acls do.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-10-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: NTycho Andersen <tycho@tycho.pizza>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      c7c7a1a1
    • C
      acl: handle idmapped mounts · e65ce2a5
      Christian Brauner 提交于
      The posix acl permission checking helpers determine whether a caller is
      privileged over an inode according to the acls associated with the
      inode. Add helpers that make it possible to handle acls on idmapped
      mounts.
      
      The vfs and the filesystems targeted by this first iteration make use of
      posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to
      translate basic posix access and default permissions such as the
      ACL_USER and ACL_GROUP type according to the initial user namespace (or
      the superblock's user namespace) to and from the caller's current user
      namespace. Adapt these two helpers to handle idmapped mounts whereby we
      either map from or into the mount's user namespace depending on in which
      direction we're translating.
      Similarly, cap_convert_nscap() is used by the vfs to translate user
      namespace and non-user namespace aware filesystem capabilities from the
      superblock's user namespace to the caller's user namespace. Enable it to
      handle idmapped mounts by accounting for the mount's user namespace.
      
      In addition the fileystems targeted in the first iteration of this patch
      series make use of the posix_acl_chmod() and, posix_acl_update_mode()
      helpers. Both helpers perform permission checks on the target inode. Let
      them handle idmapped mounts. These two helpers are called when posix
      acls are set by the respective filesystems to handle this case we extend
      the ->set() method to take an additional user namespace argument to pass
      the mount's user namespace down.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      e65ce2a5
    • C
      attr: handle idmapped mounts · 2f221d6f
      Christian Brauner 提交于
      When file attributes are changed most filesystems rely on the
      setattr_prepare(), setattr_copy(), and notify_change() helpers for
      initialization and permission checking. Let them handle idmapped mounts.
      If the inode is accessed through an idmapped mount map it into the
      mount's user namespace. Afterwards the checks are identical to
      non-idmapped mounts. If the initial user namespace is passed nothing
      changes so non-idmapped mounts will see identical behavior as before.
      
      Helpers that perform checks on the ia_uid and ia_gid fields in struct
      iattr assume that ia_uid and ia_gid are intended values and have already
      been mapped correctly at the userspace-kernelspace boundary as we
      already do today. If the initial user namespace is passed nothing
      changes so non-idmapped mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      2f221d6f
    • C
      inode: make init and permission helpers idmapped mount aware · 21cb47be
      Christian Brauner 提交于
      The inode_owner_or_capable() helper determines whether the caller is the
      owner of the inode or is capable with respect to that inode. Allow it to
      handle idmapped mounts. If the inode is accessed through an idmapped
      mount it according to the mount's user namespace. Afterwards the checks
      are identical to non-idmapped mounts. If the initial user namespace is
      passed nothing changes so non-idmapped mounts will see identical
      behavior as before.
      
      Similarly, allow the inode_init_owner() helper to handle idmapped
      mounts. It initializes a new inode on idmapped mounts by mapping the
      fsuid and fsgid of the caller from the mount's user namespace. If the
      initial user namespace is passed nothing changes so non-idmapped mounts
      will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      21cb47be
    • C
      namei: make permission helpers idmapped mount aware · 47291baa
      Christian Brauner 提交于
      The two helpers inode_permission() and generic_permission() are used by
      the vfs to perform basic permission checking by verifying that the
      caller is privileged over an inode. In order to handle idmapped mounts
      we extend the two helpers with an additional user namespace argument.
      On idmapped mounts the two helpers will make sure to map the inode
      according to the mount's user namespace and then peform identical
      permission checks to inode_permission() and generic_permission(). If the
      initial user namespace is passed nothing changes so non-idmapped mounts
      will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-6-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJames Morris <jamorris@linux.microsoft.com>
      Acked-by: NSerge Hallyn <serge@hallyn.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      47291baa
    • C
      capability: handle idmapped mounts · 0558c1bf
      Christian Brauner 提交于
      In order to determine whether a caller holds privilege over a given
      inode the capability framework exposes the two helpers
      privileged_wrt_inode_uidgid() and capable_wrt_inode_uidgid(). The former
      verifies that the inode has a mapping in the caller's user namespace and
      the latter additionally verifies that the caller has the requested
      capability in their current user namespace.
      If the inode is accessed through an idmapped mount map it into the
      mount's user namespace. Afterwards the checks are identical to
      non-idmapped inodes. If the initial user namespace is passed all
      operations are a nop so non-idmapped mounts will not see a change in
      behavior.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-5-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJames Morris <jamorris@linux.microsoft.com>
      Acked-by: NSerge Hallyn <serge@hallyn.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      0558c1bf
    • C
      fs: add file and path permissions helpers · 02f92b38
      Christian Brauner 提交于
      Add two simple helpers to check permissions on a file and path
      respectively and convert over some callers. It simplifies quite a few
      codepaths and also reduces the churn in later patches quite a bit.
      Christoph also correctly points out that this makes codepaths (e.g.
      ioctls) way easier to follow that would otherwise have to do more
      complex argument passing than necessary.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-4-christian.brauner@ubuntu.com
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      02f92b38
    • C
      mount: attach mappings to mounts · a6435940
      Christian Brauner 提交于
      In order to support per-mount idmappings vfsmounts are marked with user
      namespaces. The idmapping of the user namespace will be used to map the
      ids of vfs objects when they are accessed through that mount. By default
      all vfsmounts are marked with the initial user namespace. The initial
      user namespace is used to indicate that a mount is not idmapped. All
      operations behave as before.
      
      Based on prior discussions we want to attach the whole user namespace
      and not just a dedicated idmapping struct. This allows us to reuse all
      the helpers that already exist for dealing with idmappings instead of
      introducing a whole new range of helpers. In addition, if we decide in
      the future that we are confident enough to enable unprivileged users to
      setup idmapped mounts the permission checking can take into account
      whether the caller is privileged in the user namespace the mount is
      currently marked with.
      Later patches enforce that once a mount has been idmapped it can't be
      remapped. This keeps permission checking and life-cycle management
      simple. Users wanting to change the idmapped can always create a new
      detached mount with a different idmapping.
      
      Add a new mnt_userns member to vfsmount and two simple helpers to
      retrieve the mnt_userns from vfsmounts and files.
      
      The idea to attach user namespaces to vfsmounts has been floated around
      in various forms at Linux Plumbers in ~2018 with the original idea
      tracing back to a discussion in 2017 at a conference in St. Petersburg
      between Christoph, Tycho, and myself.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-2-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      a6435940
  2. 17 1月, 2021 2 次提交
  3. 16 1月, 2021 3 次提交
    • J
      io_uring: ensure finish_wait() is always called in __io_uring_task_cancel() · a8d13dbc
      Jens Axboe 提交于
      If we enter with requests pending and performm cancelations, we'll have
      a different inflight count before and after calling prepare_to_wait().
      This causes the loop to restart. If we actually ended up canceling
      everything, or everything completed in-between, then we'll break out
      of the loop without calling finish_wait() on the waitqueue. This can
      trigger a warning on exit_signals(), as we leave the task state in
      TASK_UNINTERRUPTIBLE.
      
      Put a finish_wait() after the loop to catch that case.
      
      Cc: stable@vger.kernel.org # 5.9+
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      a8d13dbc
    • D
      ext4: remove expensive flush on fast commit · e9f53353
      Daejun Park 提交于
      In the fast commit, it adds REQ_FUA and REQ_PREFLUSH on each fast
      commit block when barrier is enabled.  However, in recovery phase,
      ext4 compares CRC value in the tail.  So it is sufficient to add
      REQ_FUA and REQ_PREFLUSH on the block that has tail.
      Signed-off-by: NDaejun Park <daejun7.park@samsung.com>
      Reviewed-by: NHarshad Shirwadkar <harshadshirwadkar@gmail.com>
      Link: https://lore.kernel.org/r/20210106013242epcms2p5b6b4ed8ca86f29456fdf56aa580e74b4@epcms2p5Signed-off-by: NTheodore Ts'o <tytso@mit.edu>
      e9f53353
    • Y
      ext4: fix bug for rename with RENAME_WHITEOUT · 6b4b8e6b
      yangerkun 提交于
      We got a "deleted inode referenced" warning cross our fsstress test. The
      bug can be reproduced easily with following steps:
      
        cd /dev/shm
        mkdir test/
        fallocate -l 128M img
        mkfs.ext4 -b 1024 img
        mount img test/
        dd if=/dev/zero of=test/foo bs=1M count=128
        mkdir test/dir/ && cd test/dir/
        for ((i=0;i<1000;i++)); do touch file$i; done # consume all block
        cd ~ && renameat2(AT_FDCWD, /dev/shm/test/dir/file1, AT_FDCWD,
          /dev/shm/test/dir/dst_file, RENAME_WHITEOUT) # ext4_add_entry in
          ext4_rename will return ENOSPC!!
        cd /dev/shm/ && umount test/ && mount img test/ && ls -li test/dir/file1
        We will get the output:
        "ls: cannot access 'test/dir/file1': Structure needs cleaning"
        and the dmesg show:
        "EXT4-fs error (device loop0): ext4_lookup:1626: inode #2049: comm ls:
        deleted inode referenced: 139"
      
      ext4_rename will create a special inode for whiteout and use this 'ino'
      to replace the source file's dir entry 'ino'. Once error happens
      latter(the error above was the ENOSPC return from ext4_add_entry in
      ext4_rename since all space has been consumed), the cleanup do drop the
      nlink for whiteout, but forget to restore 'ino' with source file. This
      will trigger the bug describle as above.
      Signed-off-by: Nyangerkun <yangerkun@huawei.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Cc: stable@vger.kernel.org
      Fixes: cd808dec ("ext4: support RENAME_WHITEOUT")
      Link: https://lore.kernel.org/r/20210105062857.3566-1-yangerkun@huawei.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
      6b4b8e6b