1. 24 9月, 2022 1 次提交
  2. 02 8月, 2022 1 次提交
  3. 27 7月, 2022 2 次提交
    • Y
      ovl: fix some kernel-doc comments · 9c5dd803
      Yang Li 提交于
      Remove warnings found by running scripts/kernel-doc,
      which is caused by using 'make W=1'.
      fs/overlayfs/super.c:311: warning: Function parameter or member 'dentry'
      not described in 'ovl_statfs'
      fs/overlayfs/super.c:311: warning: Excess function parameter 'sb'
      description in 'ovl_statfs'
      fs/overlayfs/super.c:357: warning: Function parameter or member 'm' not
      described in 'ovl_show_options'
      fs/overlayfs/super.c:357: warning: Function parameter or member 'dentry'
      not described in 'ovl_show_options'
      Reported-by: NAbaci Robot <abaci@linux.alibaba.com>
      Signed-off-by: NYang Li <yang.lee@linux.alibaba.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      9c5dd803
    • M
      ovl: warn if trusted xattr creation fails · b10b85fe
      Miklos Szeredi 提交于
      When mounting overlayfs in an unprivileged user namespace, trusted xattr
      creation will fail.  This will lead to failures in some file operations,
      e.g. in the following situation:
      
        mkdir lower upper work merged
        mkdir lower/directory
        mount -toverlay -olowerdir=lower,upperdir=upper,workdir=work none merged
        rmdir merged/directory
        mkdir merged/directory
      
      The last mkdir will fail:
      
        mkdir: cannot create directory 'merged/directory': Input/output error
      
      The cause for these failures is currently extremely non-obvious and hard to
      debug.  Hence, warn the user and suggest using the userxattr mount option,
      if it is not already supplied and xattr creation fails during the
      self-check.
      Reported-by: NAlois Wohlschlager <alois1@gmx-topmail.de>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      b10b85fe
  4. 16 7月, 2022 1 次提交
  5. 08 7月, 2022 1 次提交
    • C
      ovl: turn of SB_POSIXACL with idmapped layers temporarily · 4a47c638
      Christian Brauner 提交于
      This cycle we added support for mounting overlayfs on top of idmapped
      mounts.  Recently I've started looking into potential corner cases when
      trying to add additional tests and I noticed that reporting for POSIX ACLs
      is currently wrong when using idmapped layers with overlayfs mounted on top
      of it.
      
      I have sent out an patch that fixes this and makes POSIX ACLs work
      correctly but the patch is a bit bigger and we're already at -rc5 so I
      recommend we simply don't raise SB_POSIXACL when idmapped layers are
      used. Then we can fix the VFS part described below for the next merge
      window so we can have good exposure in -next.
      
      I'm going to give a rather detailed explanation to both the origin of the
      problem and mention the solution so people know what's going on.
      
      Let's assume the user creates the following directory layout and they have
      a rootfs /var/lib/lxc/c1/rootfs. The files in this rootfs are owned as you
      would expect files on your host system to be owned. For example, ~/.bashrc
      for your regular user would be owned by 1000:1000 and /root/.bashrc would
      be owned by 0:0. IOW, this is just regular boring filesystem tree on an
      ext4 or xfs filesystem.
      
      The user chooses to set POSIX ACLs using the setfacl binary granting the
      user with uid 4 read, write, and execute permissions for their .bashrc
      file:
      
              setfacl -m u:4:rwx /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc
      
      Now they to expose the whole rootfs to a container using an idmapped
      mount. So they first create:
      
              mkdir -pv /vol/contpool/{ctrover,merge,lowermap,overmap}
              mkdir -pv /vol/contpool/ctrover/{over,work}
              chown 10000000:10000000 /vol/contpool/ctrover/{over,work}
      
      The user now creates an idmapped mount for the rootfs:
      
              mount-idmapped/mount-idmapped --map-mount=b:0:10000000:65536 \
                                            /var/lib/lxc/c2/rootfs \
                                            /vol/contpool/lowermap
      
      This for example makes it so that
      /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc which is owned by uid and gid
      1000 as being owned by uid and gid 10001000 at
      /vol/contpool/lowermap/home/ubuntu/.bashrc.
      
      Assume the user wants to expose these idmapped mounts through an overlayfs
      mount to a container.
      
             mount -t overlay overlay                      \
                   -o lowerdir=/vol/contpool/lowermap,     \
                      upperdir=/vol/contpool/overmap/over, \
                      workdir=/vol/contpool/overmap/work   \
                   /vol/contpool/merge
      
      The user can do this in two ways:
      
      (1) Mount overlayfs in the initial user namespace and expose it to the
          container.
      
      (2) Mount overlayfs on top of the idmapped mounts inside of the container's
          user namespace.
      
      Let's assume the user chooses the (1) option and mounts overlayfs on the
      host and then changes into a container which uses the idmapping
      0:10000000:65536 which is the same used for the two idmapped mounts.
      
      Now the user tries to retrieve the POSIX ACLs using the getfacl command
      
              getfacl -n /vol/contpool/lowermap/home/ubuntu/.bashrc
      
      and to their surprise they see:
      
              # file: vol/contpool/merge/home/ubuntu/.bashrc
              # owner: 1000
              # group: 1000
              user::rw-
              user:4294967295:rwx
              group::r--
              mask::rwx
              other::r--
      
      indicating the uid wasn't correctly translated according to the idmapped
      mount. The problem is how we currently translate POSIX ACLs. Let's inspect
      the callchain in this example:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                        |> vfs_getxattr()
                        |  -> __vfs_getxattr()
                        |     -> handler->get == ovl_posix_acl_xattr_get()
                        |        -> ovl_xattr_get()
                        |           -> vfs_getxattr()
                        |              -> __vfs_getxattr()
                        |                 -> handler->get() /* lower filesystem callback */
                        |> posix_acl_fix_xattr_to_user()
                           {
                                    4 = make_kuid(&init_user_ns, 4);
                                    4 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 4);
                                    /* FAILURE */
                                   -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
                           }
      
      If the user chooses to use option (2) and mounts overlayfs on top of
      idmapped mounts inside the container things don't look that much better:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                        |> vfs_getxattr()
                        |  -> __vfs_getxattr()
                        |     -> handler->get == ovl_posix_acl_xattr_get()
                        |        -> ovl_xattr_get()
                        |           -> vfs_getxattr()
                        |              -> __vfs_getxattr()
                        |                 -> handler->get() /* lower filesystem callback */
                        |> posix_acl_fix_xattr_to_user()
                           {
                                    4 = make_kuid(&init_user_ns, 4);
                                    4 = mapped_kuid_fs(&init_user_ns, 4);
                                    /* FAILURE */
                                   -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
                           }
      
      As is easily seen the problem arises because the idmapping of the lower
      mount isn't taken into account as all of this happens in do_gexattr(). But
      do_getxattr() is always called on an overlayfs mount and inode and thus
      cannot possible take the idmapping of the lower layers into account.
      
      This problem is similar for fscaps but there the translation happens as
      part of vfs_getxattr() already. Let's walk through an fscaps overlayfs
      callchain:
      
              setcap 'cap_net_raw+ep' /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc
      
      The expected outcome here is that we'll receive the cap_net_raw capability
      as we are able to map the uid associated with the fscap to 0 within our
      container.  IOW, we want to see 0 as the result of the idmapping
      translations.
      
      If the user chooses option (1) we get the following callchain for fscaps:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                         -> vfs_getxattr()
                            -> xattr_getsecurity()
                               -> security_inode_getsecurity()                                       ________________________________
                                  -> cap_inode_getsecurity()                                         |                              |
                                     {                                                               V                              |
                                              10000000 = make_kuid(0:0:4k /* overlayfs idmapping */, 10000000);                     |
                                              10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000);                  |
                                                     /* Expected result is 0 and thus that we own the fscap. */                     |
                                                     0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000);            |
                                     }                                                                                              |
                                     -> vfs_getxattr_alloc()                                                                        |
                                        -> handler->get == ovl_other_xattr_get()                                                    |
                                           -> vfs_getxattr()                                                                        |
                                              -> xattr_getsecurity()                                                                |
                                                 -> security_inode_getsecurity()                                                    |
                                                    -> cap_inode_getsecurity()                                                      |
                                                       {                                                                            |
                                                                      0 = make_kuid(0:0:4k /* lower s_user_ns */, 0);               |
                                                               10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); |
                                                               10000000 = from_kuid(0:0:4k /* overlayfs idmapping */, 10000000);    |
                                                               |____________________________________________________________________|
                                                       }
                                                       -> vfs_getxattr_alloc()
                                                          -> handler->get == /* lower filesystem callback */
      
      And if the user chooses option (2) we get:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                         -> vfs_getxattr()
                            -> xattr_getsecurity()
                               -> security_inode_getsecurity()                                                _______________________________
                                  -> cap_inode_getsecurity()                                                  |                             |
                                     {                                                                        V                             |
                                             10000000 = make_kuid(0:10000000:65536 /* overlayfs idmapping */, 0);                           |
                                             10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000);                           |
                                                     /* Expected result is 0 and thus that we own the fscap. */                             |
                                                    0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000);                     |
                                     }                                                                                                      |
                                     -> vfs_getxattr_alloc()                                                                                |
                                        -> handler->get == ovl_other_xattr_get()                                                            |
                                          |-> vfs_getxattr()                                                                                |
                                              -> xattr_getsecurity()                                                                        |
                                                 -> security_inode_getsecurity()                                                            |
                                                    -> cap_inode_getsecurity()                                                              |
                                                       {                                                                                    |
                                                                       0 = make_kuid(0:0:4k /* lower s_user_ns */, 0);                      |
                                                                10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0);        |
                                                                       0 = from_kuid(0:10000000:65536 /* overlayfs idmapping */, 10000000); |
                                                                       |____________________________________________________________________|
                                                       }
                                                       -> vfs_getxattr_alloc()
                                                          -> handler->get == /* lower filesystem callback */
      
      We can see how the translation happens correctly in those cases as the
      conversion happens within the vfs_getxattr() helper.
      
      For POSIX ACLs we need to do something similar. However, in contrast to
      fscaps we cannot apply the fix directly to the kernel internal posix acl
      data structure as this would alter the cached values and would also require
      a rework of how we currently deal with POSIX ACLs in general which almost
      never take the filesystem idmapping into account (the noteable exception
      being FUSE but even there the implementation is special) and instead
      retrieve the raw values based on the initial idmapping.
      
      The correct values are then generated right before returning to
      userspace. The fix for this is to move taking the mount's idmapping into
      account directly in vfs_getxattr() instead of having it be part of
      posix_acl_fix_xattr_to_user().
      
      To this end we simply move the idmapped mount translation into a separate
      step performed in vfs_{g,s}etxattr() instead of in
      posix_acl_fix_xattr_{from,to}_user().
      
      To see how this fixes things let's go back to the original example. Assume
      the user chose option (1) and mounted overlayfs on top of idmapped mounts
      on the host:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                        |> vfs_getxattr()
                        |  |> __vfs_getxattr()
                        |  |  -> handler->get == ovl_posix_acl_xattr_get()
                        |  |     -> ovl_xattr_get()
                        |  |        -> vfs_getxattr()
                        |  |           |> __vfs_getxattr()
                        |  |           |  -> handler->get() /* lower filesystem callback */
                        |  |           |> posix_acl_getxattr_idmapped_mnt()
                        |  |              {
                        |  |                              4 = make_kuid(&init_user_ns, 4);
                        |  |                       10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
                        |  |                       10000004 = from_kuid(&init_user_ns, 10000004);
                        |  |                       |_______________________
                        |  |              }                               |
                        |  |                                              |
                        |  |> posix_acl_getxattr_idmapped_mnt()           |
                        |     {                                           |
                        |                                                 V
                        |             10000004 = make_kuid(&init_user_ns, 10000004);
                        |             10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
                        |             10000004 = from_kuid(&init_user_ns, 10000004);
                        |     }       |_________________________________________________
                        |                                                              |
                        |                                                              |
                        |> posix_acl_fix_xattr_to_user()                               |
                           {                                                           V
                                       10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
                                              /* SUCCESS */
                                              4 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000004);
                           }
      
      And similarly if the user chooses option (1) and mounted overayfs on top of
      idmapped mounts inside the container:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                        |> vfs_getxattr()
                        |  |> __vfs_getxattr()
                        |  |  -> handler->get == ovl_posix_acl_xattr_get()
                        |  |     -> ovl_xattr_get()
                        |  |        -> vfs_getxattr()
                        |  |           |> __vfs_getxattr()
                        |  |           |  -> handler->get() /* lower filesystem callback */
                        |  |           |> posix_acl_getxattr_idmapped_mnt()
                        |  |              {
                        |  |                              4 = make_kuid(&init_user_ns, 4);
                        |  |                       10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
                        |  |                       10000004 = from_kuid(&init_user_ns, 10000004);
                        |  |                       |_______________________
                        |  |              }                               |
                        |  |                                              |
                        |  |> posix_acl_getxattr_idmapped_mnt()           |
                        |     {                                           V
                        |             10000004 = make_kuid(&init_user_ns, 10000004);
                        |             10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
                        |             10000004 = from_kuid(0(&init_user_ns, 10000004);
                        |             |_________________________________________________
                        |     }                                                        |
                        |                                                              |
                        |> posix_acl_fix_xattr_to_user()                               |
                           {                                                           V
                                       10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
                                              /* SUCCESS */
                                              4 = from_kuid(0:10000000:65536 /* caller's idmappings */, 10000004);
                           }
      
      The last remaining problem we need to fix here is ovl_get_acl(). During
      ovl_permission() overlayfs will call:
      
              ovl_permission()
              -> generic_permission()
                 -> acl_permission_check()
                    -> check_acl()
                       -> get_acl()
                          -> inode->i_op->get_acl() == ovl_get_acl()
                              > get_acl() /* on the underlying filesystem)
                                ->inode->i_op->get_acl() == /*lower filesystem callback */
                       -> posix_acl_permission()
      
      passing through the get_acl request to the underlying filesystem. This will
      retrieve the acls stored in the lower filesystem without taking the
      idmapping of the underlying mount into account as this would mean altering
      the cached values for the lower filesystem. The simple solution is to have
      ovl_get_acl() simply duplicate the ACLs, update the values according to the
      idmapped mount and return it to acl_permission_check() so it can be used in
      posix_acl_permission(). Since overlayfs doesn't cache ACLs they'll be
      released right after.
      
      Link: https://github.com/brauner/mount-idmapped/issues/9
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: linux-unionfs@vger.kernel.org
      Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      Fixes: bc70682a ("ovl: support idmapped layers")
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      4a47c638
  6. 28 4月, 2022 6 次提交
  7. 23 3月, 2022 1 次提交
  8. 04 12月, 2021 1 次提交
  9. 04 11月, 2021 1 次提交
    • M
      ovl: fix warning in ovl_create_real() · 1f5573cf
      Miklos Szeredi 提交于
      Syzbot triggered the following warning in ovl_workdir_create() ->
      ovl_create_real():
      
      	if (!err && WARN_ON(!newdentry->d_inode)) {
      
      The reason is that the cgroup2 filesystem returns from mkdir without
      instantiating the new dentry.
      
      Weird filesystems such as this will be rejected by overlayfs at a later
      stage during setup, but to prevent such a warning, call ovl_mkdir_real()
      directly from ovl_workdir_create() and reject this case early.
      
      Reported-and-tested-by: syzbot+75eab84fd0af9e8bf66b@syzkaller.appspotmail.com
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      1f5573cf
  10. 17 8月, 2021 2 次提交
  11. 12 4月, 2021 6 次提交
  12. 28 1月, 2021 2 次提交
    • S
      ovl: implement volatile-specific fsync error behaviour · 335d3fc5
      Sargun Dhillon 提交于
      Overlayfs's volatile option allows the user to bypass all forced sync calls
      to the upperdir filesystem. This comes at the cost of safety. We can never
      ensure that the user's data is intact, but we can make a best effort to
      expose whether or not the data is likely to be in a bad state.
      
      The best way to handle this in the time being is that if an overlayfs's
      upperdir experiences an error after a volatile mount occurs, that error
      will be returned on fsync, fdatasync, sync, and syncfs. This is
      contradictory to the traditional behaviour of VFS which fails the call
      once, and only raises an error if a subsequent fsync error has occurred,
      and been raised by the filesystem.
      
      One awkward aspect of the patch is that we have to manually set the
      superblock's errseq_t after the sync_fs callback as opposed to just
      returning an error from syncfs. This is because the call chain looks
      something like this:
      
      sys_syncfs ->
      	sync_filesystem ->
      		__sync_filesystem ->
      			/* The return value is ignored here
      			sb->s_op->sync_fs(sb)
      			_sync_blockdev
      		/* Where the VFS fetches the error to raise to userspace */
      		errseq_check_and_advance
      
      Because of this we call errseq_set every time the sync_fs callback occurs.
      Due to the nature of this seen / unseen dichotomy, if the upperdir is an
      inconsistent state at the initial mount time, overlayfs will refuse to
      mount, as overlayfs cannot get a snapshot of the upperdir's errseq that
      will increment on error until the user calls syncfs.
      Signed-off-by: NSargun Dhillon <sargun@sargun.me>
      Suggested-by: NAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Fixes: c86243b0 ("ovl: provide a mount option "volatile"")
      Cc: stable@vger.kernel.org
      Reviewed-by: NVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      335d3fc5
    • M
      ovl: add warning on user_ns mismatch · 9efb069d
      Miklos Szeredi 提交于
      Currently there's no way to create an overlay filesystem outside of the
      current user namespace.  Make sure that if this assumption changes it
      doesn't go unnoticed.
      Reported-by: N"Eric W. Biederman" <ebiederm@xmission.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      9efb069d
  13. 24 1月, 2021 7 次提交
    • C
      overlayfs: do not mount on top of idmapped mounts · 029a52ad
      Christian Brauner 提交于
      Prevent overlayfs from being mounted on top of idmapped mounts.
      Stacking filesystems need to be prevented from being mounted on top of
      idmapped mounts until they have have been converted to handle this.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-29-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      029a52ad
    • C
      fs: make helpers idmap mount aware · 549c7297
      Christian Brauner 提交于
      Extend some inode methods with an additional user namespace argument. A
      filesystem that is aware of idmapped mounts will receive the user
      namespace the mount has been marked with. This can be used for
      additional permission checking and also to enable filesystems to
      translate between uids and gids if they need to. We have implemented all
      relevant helpers in earlier patches.
      
      As requested we simply extend the exisiting inode method instead of
      introducing new ones. This is a little more code churn but it's mostly
      mechanical and doesnt't leave us with additional inode methods.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-25-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      549c7297
    • T
      xattr: handle idmapped mounts · c7c7a1a1
      Tycho Andersen 提交于
      When interacting with extended attributes the vfs verifies that the
      caller is privileged over the inode with which the extended attribute is
      associated. For posix access and posix default extended attributes a uid
      or gid can be stored on-disk. Let the functions handle posix extended
      attributes on idmapped mounts. If the inode is accessed through an
      idmapped mount we need to map it according to the mount's user
      namespace. Afterwards the checks are identical to non-idmapped mounts.
      This has no effect for e.g. security xattrs since they don't store uids
      or gids and don't perform permission checks on them like posix acls do.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-10-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: NTycho Andersen <tycho@tycho.pizza>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      c7c7a1a1
    • C
      acl: handle idmapped mounts · e65ce2a5
      Christian Brauner 提交于
      The posix acl permission checking helpers determine whether a caller is
      privileged over an inode according to the acls associated with the
      inode. Add helpers that make it possible to handle acls on idmapped
      mounts.
      
      The vfs and the filesystems targeted by this first iteration make use of
      posix_acl_fix_xattr_from_user() and posix_acl_fix_xattr_to_user() to
      translate basic posix access and default permissions such as the
      ACL_USER and ACL_GROUP type according to the initial user namespace (or
      the superblock's user namespace) to and from the caller's current user
      namespace. Adapt these two helpers to handle idmapped mounts whereby we
      either map from or into the mount's user namespace depending on in which
      direction we're translating.
      Similarly, cap_convert_nscap() is used by the vfs to translate user
      namespace and non-user namespace aware filesystem capabilities from the
      superblock's user namespace to the caller's user namespace. Enable it to
      handle idmapped mounts by accounting for the mount's user namespace.
      
      In addition the fileystems targeted in the first iteration of this patch
      series make use of the posix_acl_chmod() and, posix_acl_update_mode()
      helpers. Both helpers perform permission checks on the target inode. Let
      them handle idmapped mounts. These two helpers are called when posix
      acls are set by the respective filesystems to handle this case we extend
      the ->set() method to take an additional user namespace argument to pass
      the mount's user namespace down.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-9-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      e65ce2a5
    • C
      attr: handle idmapped mounts · 2f221d6f
      Christian Brauner 提交于
      When file attributes are changed most filesystems rely on the
      setattr_prepare(), setattr_copy(), and notify_change() helpers for
      initialization and permission checking. Let them handle idmapped mounts.
      If the inode is accessed through an idmapped mount map it into the
      mount's user namespace. Afterwards the checks are identical to
      non-idmapped mounts. If the initial user namespace is passed nothing
      changes so non-idmapped mounts will see identical behavior as before.
      
      Helpers that perform checks on the ia_uid and ia_gid fields in struct
      iattr assume that ia_uid and ia_gid are intended values and have already
      been mapped correctly at the userspace-kernelspace boundary as we
      already do today. If the initial user namespace is passed nothing
      changes so non-idmapped mounts will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-8-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      2f221d6f
    • C
      inode: make init and permission helpers idmapped mount aware · 21cb47be
      Christian Brauner 提交于
      The inode_owner_or_capable() helper determines whether the caller is the
      owner of the inode or is capable with respect to that inode. Allow it to
      handle idmapped mounts. If the inode is accessed through an idmapped
      mount it according to the mount's user namespace. Afterwards the checks
      are identical to non-idmapped mounts. If the initial user namespace is
      passed nothing changes so non-idmapped mounts will see identical
      behavior as before.
      
      Similarly, allow the inode_init_owner() helper to handle idmapped
      mounts. It initializes a new inode on idmapped mounts by mapping the
      fsuid and fsgid of the caller from the mount's user namespace. If the
      initial user namespace is passed nothing changes so non-idmapped mounts
      will see identical behavior as before.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJames Morris <jamorris@linux.microsoft.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      21cb47be
    • C
      capability: handle idmapped mounts · 0558c1bf
      Christian Brauner 提交于
      In order to determine whether a caller holds privilege over a given
      inode the capability framework exposes the two helpers
      privileged_wrt_inode_uidgid() and capable_wrt_inode_uidgid(). The former
      verifies that the inode has a mapping in the caller's user namespace and
      the latter additionally verifies that the caller has the requested
      capability in their current user namespace.
      If the inode is accessed through an idmapped mount map it into the
      mount's user namespace. Afterwards the checks are identical to
      non-idmapped inodes. If the initial user namespace is passed all
      operations are a nop so non-idmapped mounts will not see a change in
      behavior.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-5-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NJames Morris <jamorris@linux.microsoft.com>
      Acked-by: NSerge Hallyn <serge@hallyn.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      0558c1bf
  14. 14 12月, 2020 2 次提交
    • M
      ovl: unprivieged mounts · 459c7c56
      Miklos Szeredi 提交于
      Enable unprivileged user namespace mounts of overlayfs.  Overlayfs's
      permission model (*) ensures that the mounter itself cannot gain additional
      privileges by the act of creating an overlayfs mount.
      
      This feature request is coming from the "rootless" container crowd.
      
      (*) Documentation/filesystems/overlayfs.txt#Permission model
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      459c7c56
    • M
      ovl: user xattr · 2d2f2d73
      Miklos Szeredi 提交于
      Optionally allow using "user.overlay." namespace instead of
      "trusted.overlay."
      
      This is necessary for overlayfs to be able to be mounted in an unprivileged
      namepsace.
      
      Make the option explicit, since it makes the filesystem format be
      incompatible.
      
      Disable redirect_dir and metacopy options, because these would allow
      privilege escalation through direct manipulation of the
      "user.overlay.redirect" or "user.overlay.metacopy" xattrs.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      2d2f2d73
  15. 12 11月, 2020 2 次提交
    • M
      ovl: expand warning in ovl_d_real() · cef4cbff
      Miklos Szeredi 提交于
      There was a syzbot report with this warning but insufficient information...
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      cef4cbff
    • P
      ovl: introduce new "uuid=off" option for inodes index feature · 5830fb6b
      Pavel Tikhomirov 提交于
      This replaces uuid with null in overlayfs file handles and thus relaxes
      uuid checks for overlay index feature. It is only possible in case there is
      only one filesystem for all the work/upper/lower directories and bare file
      handles from this backing filesystem are unique. In other case when we have
      multiple filesystems lets just fallback to "uuid=on" which is and
      equivalent of how it worked before with all uuid checks.
      
      This is needed when overlayfs is/was mounted in a container with index
      enabled (e.g.: to be able to resolve inotify watch file handles on it to
      paths in CRIU), and this container is copied and started alongside with the
      original one. This way the "copy" container can't have the same uuid on the
      superblock and mounting the overlayfs from it later would fail.
      
      That is an example of the problem on top of loop+ext4:
      
      dd if=/dev/zero of=loopbackfile.img bs=100M count=10
      losetup -fP loopbackfile.img
      losetup -a
        #/dev/loop0: [64768]:35 (/loop-test/loopbackfile.img)
      mkfs.ext4 loopbackfile.img
      mkdir loop-mp
      mount -o loop /dev/loop0 loop-mp
      mkdir loop-mp/{lower,upper,work,merged}
      mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\
      upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
      umount loop-mp/merged
      umount loop-mp
      e2fsck -f /dev/loop0
      tune2fs -U random /dev/loop0
      
      mount -o loop /dev/loop0 loop-mp
      mount -t overlay overlay -oindex=on,lowerdir=loop-mp/lower,\
      upperdir=loop-mp/upper,workdir=loop-mp/work loop-mp/merged
        #mount: /loop-test/loop-mp/merged:
        #mount(2) system call failed: Stale file handle.
      
      If you just change the uuid of the backing filesystem, overlay is not
      mounting any more. In Virtuozzo we copy container disks (ploops) when
      create the copy of container and we require fs uuid to be unique for a new
      container.
      Signed-off-by: NPavel Tikhomirov <ptikhomirov@virtuozzo.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      5830fb6b
  16. 02 9月, 2020 4 次提交
    • M
      ovl: pass ovl_fs down to functions accessing private xattrs · 610afc0b
      Miklos Szeredi 提交于
      This paves the way for optionally using the "user.overlay." xattr
      namespace.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      610afc0b
    • M
      ovl: drop flags argument from ovl_do_setxattr() · 26150ab5
      Miklos Szeredi 提交于
      All callers pass zero flags to ovl_do_setxattr().  So drop this argument.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      26150ab5
    • M
      ovl: adhere to the vfs_ vs. ovl_do_ conventions for xattrs · 71097047
      Miklos Szeredi 提交于
      Call ovl_do_*xattr() when accessing an overlay private xattr, vfs_*xattr()
      otherwise.
      
      This has an effect on debug output, which is made more consistent by this
      patch.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      71097047
    • V
      ovl: provide a mount option "volatile" · c86243b0
      Vivek Goyal 提交于
      Container folks are complaining that dnf/yum issues too many sync while
      installing packages and this slows down the image build. Build requirement
      is such that they don't care if a node goes down while build was still
      going on. In that case, they will simply throw away unfinished layer and
      start new build. So they don't care about syncing intermediate state to the
      disk and hence don't want to pay the price associated with sync.
      
      So they are asking for mount options where they can disable sync on overlay
      mount point.
      
      They primarily seem to have two use cases.
      
      - For building images, they will mount overlay with nosync and then sync
        upper layer after unmounting overlay and reuse upper as lower for next
        layer.
      
      - For running containers, they don't seem to care about syncing upper layer
        because if node goes down, they will simply throw away upper layer and
        create a fresh one.
      
      So this patch provides a mount option "volatile" which disables all forms
      of sync. Now it is caller's responsibility to throw away upper if system
      crashes or shuts down and start fresh.
      
      With "volatile", I am seeing roughly 20% speed up in my VM where I am just
      installing emacs in an image. Installation time drops from 31 seconds to 25
      seconds when nosync option is used. This is for the case of building on top
      of an image where all packages are already cached. That way I take out the
      network operations latency out of the measurement.
      
      Giuseppe is also looking to cut down on number of iops done on the disk. He
      is complaining that often in cloud their VMs are throttled if they cross
      the limit. This option can help them where they reduce number of iops (by
      cutting down on frequent sync and writebacks).
      Signed-off-by: NGiuseppe Scrivano <gscrivan@redhat.com>
      Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      c86243b0