1. 24 9月, 2022 1 次提交
  2. 28 7月, 2022 1 次提交
  3. 16 7月, 2022 1 次提交
    • C
      acl: move idmapped mount fixup into vfs_{g,s}etxattr() · 0c5fd887
      Christian Brauner 提交于
      This cycle we added support for mounting overlayfs on top of idmapped mounts.
      Recently I've started looking into potential corner cases when trying to add
      additional tests and I noticed that reporting for POSIX ACLs is currently wrong
      when using idmapped layers with overlayfs mounted on top of it.
      
      I'm going to give a rather detailed explanation to both the origin of the
      problem and the solution.
      
      Let's assume the user creates the following directory layout and they have a
      rootfs /var/lib/lxc/c1/rootfs. The files in this rootfs are owned as you would
      expect files on your host system to be owned. For example, ~/.bashrc for your
      regular user would be owned by 1000:1000 and /root/.bashrc would be owned by
      0:0. IOW, this is just regular boring filesystem tree on an ext4 or xfs
      filesystem.
      
      The user chooses to set POSIX ACLs using the setfacl binary granting the user
      with uid 4 read, write, and execute permissions for their .bashrc file:
      
              setfacl -m u:4:rwx /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc
      
      Now they to expose the whole rootfs to a container using an idmapped mount. So
      they first create:
      
              mkdir -pv /vol/contpool/{ctrover,merge,lowermap,overmap}
              mkdir -pv /vol/contpool/ctrover/{over,work}
              chown 10000000:10000000 /vol/contpool/ctrover/{over,work}
      
      The user now creates an idmapped mount for the rootfs:
      
              mount-idmapped/mount-idmapped --map-mount=b:0:10000000:65536 \
                                            /var/lib/lxc/c2/rootfs \
                                            /vol/contpool/lowermap
      
      This for example makes it so that /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc
      which is owned by uid and gid 1000 as being owned by uid and gid 10001000 at
      /vol/contpool/lowermap/home/ubuntu/.bashrc.
      
      Assume the user wants to expose these idmapped mounts through an overlayfs
      mount to a container.
      
             mount -t overlay overlay                      \
                   -o lowerdir=/vol/contpool/lowermap,     \
                      upperdir=/vol/contpool/overmap/over, \
                      workdir=/vol/contpool/overmap/work   \
                   /vol/contpool/merge
      
      The user can do this in two ways:
      
      (1) Mount overlayfs in the initial user namespace and expose it to the
          container.
      (2) Mount overlayfs on top of the idmapped mounts inside of the container's
          user namespace.
      
      Let's assume the user chooses the (1) option and mounts overlayfs on the host
      and then changes into a container which uses the idmapping 0:10000000:65536
      which is the same used for the two idmapped mounts.
      
      Now the user tries to retrieve the POSIX ACLs using the getfacl command
      
              getfacl -n /vol/contpool/lowermap/home/ubuntu/.bashrc
      
      and to their surprise they see:
      
              # file: vol/contpool/merge/home/ubuntu/.bashrc
              # owner: 1000
              # group: 1000
              user::rw-
              user:4294967295:rwx
              group::r--
              mask::rwx
              other::r--
      
      indicating the the uid wasn't correctly translated according to the idmapped
      mount. The problem is how we currently translate POSIX ACLs. Let's inspect the
      callchain in this example:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                        |> vfs_getxattr()
                        |  -> __vfs_getxattr()
                        |     -> handler->get == ovl_posix_acl_xattr_get()
                        |        -> ovl_xattr_get()
                        |           -> vfs_getxattr()
                        |              -> __vfs_getxattr()
                        |                 -> handler->get() /* lower filesystem callback */
                        |> posix_acl_fix_xattr_to_user()
                           {
                                    4 = make_kuid(&init_user_ns, 4);
                                    4 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 4);
                                    /* FAILURE */
                                   -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
                           }
      
      If the user chooses to use option (2) and mounts overlayfs on top of idmapped
      mounts inside the container things don't look that much better:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                        |> vfs_getxattr()
                        |  -> __vfs_getxattr()
                        |     -> handler->get == ovl_posix_acl_xattr_get()
                        |        -> ovl_xattr_get()
                        |           -> vfs_getxattr()
                        |              -> __vfs_getxattr()
                        |                 -> handler->get() /* lower filesystem callback */
                        |> posix_acl_fix_xattr_to_user()
                           {
                                    4 = make_kuid(&init_user_ns, 4);
                                    4 = mapped_kuid_fs(&init_user_ns, 4);
                                    /* FAILURE */
                                   -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
                           }
      
      As is easily seen the problem arises because the idmapping of the lower mount
      isn't taken into account as all of this happens in do_gexattr(). But
      do_getxattr() is always called on an overlayfs mount and inode and thus cannot
      possible take the idmapping of the lower layers into account.
      
      This problem is similar for fscaps but there the translation happens as part of
      vfs_getxattr() already. Let's walk through an fscaps overlayfs callchain:
      
              setcap 'cap_net_raw+ep' /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc
      
      The expected outcome here is that we'll receive the cap_net_raw capability as
      we are able to map the uid associated with the fscap to 0 within our container.
      IOW, we want to see 0 as the result of the idmapping translations.
      
      If the user chooses option (1) we get the following callchain for fscaps:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                         -> vfs_getxattr()
                            -> xattr_getsecurity()
                               -> security_inode_getsecurity()                                       ________________________________
                                  -> cap_inode_getsecurity()                                         |                              |
                                     {                                                               V                              |
                                              10000000 = make_kuid(0:0:4k /* overlayfs idmapping */, 10000000);                     |
                                              10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000);                  |
                                                     /* Expected result is 0 and thus that we own the fscap. */                     |
                                                     0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000);            |
                                     }                                                                                              |
                                     -> vfs_getxattr_alloc()                                                                        |
                                        -> handler->get == ovl_other_xattr_get()                                                    |
                                           -> vfs_getxattr()                                                                        |
                                              -> xattr_getsecurity()                                                                |
                                                 -> security_inode_getsecurity()                                                    |
                                                    -> cap_inode_getsecurity()                                                      |
                                                       {                                                                            |
                                                                      0 = make_kuid(0:0:4k /* lower s_user_ns */, 0);               |
                                                               10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); |
                                                               10000000 = from_kuid(0:0:4k /* overlayfs idmapping */, 10000000);    |
                                                               |____________________________________________________________________|
                                                       }
                                                       -> vfs_getxattr_alloc()
                                                          -> handler->get == /* lower filesystem callback */
      
      And if the user chooses option (2) we get:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                         -> vfs_getxattr()
                            -> xattr_getsecurity()
                               -> security_inode_getsecurity()                                                _______________________________
                                  -> cap_inode_getsecurity()                                                  |                             |
                                     {                                                                        V                             |
                                             10000000 = make_kuid(0:10000000:65536 /* overlayfs idmapping */, 0);                           |
                                             10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000);                           |
                                                     /* Expected result is 0 and thus that we own the fscap. */                             |
                                                    0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000);                     |
                                     }                                                                                                      |
                                     -> vfs_getxattr_alloc()                                                                                |
                                        -> handler->get == ovl_other_xattr_get()                                                            |
                                          |-> vfs_getxattr()                                                                                |
                                              -> xattr_getsecurity()                                                                        |
                                                 -> security_inode_getsecurity()                                                            |
                                                    -> cap_inode_getsecurity()                                                              |
                                                       {                                                                                    |
                                                                       0 = make_kuid(0:0:4k /* lower s_user_ns */, 0);                      |
                                                                10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0);        |
                                                                       0 = from_kuid(0:10000000:65536 /* overlayfs idmapping */, 10000000); |
                                                                       |____________________________________________________________________|
                                                       }
                                                       -> vfs_getxattr_alloc()
                                                          -> handler->get == /* lower filesystem callback */
      
      We can see how the translation happens correctly in those cases as the
      conversion happens within the vfs_getxattr() helper.
      
      For POSIX ACLs we need to do something similar. However, in contrast to fscaps
      we cannot apply the fix directly to the kernel internal posix acl data
      structure as this would alter the cached values and would also require a rework
      of how we currently deal with POSIX ACLs in general which almost never take the
      filesystem idmapping into account (the noteable exception being FUSE but even
      there the implementation is special) and instead retrieve the raw values based
      on the initial idmapping.
      
      The correct values are then generated right before returning to userspace. The
      fix for this is to move taking the mount's idmapping into account directly in
      vfs_getxattr() instead of having it be part of posix_acl_fix_xattr_to_user().
      
      To this end we split out two small and unexported helpers
      posix_acl_getxattr_idmapped_mnt() and posix_acl_setxattr_idmapped_mnt(). The
      former to be called in vfs_getxattr() and the latter to be called in
      vfs_setxattr().
      
      Let's go back to the original example. Assume the user chose option (1) and
      mounted overlayfs on top of idmapped mounts on the host:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                        |> vfs_getxattr()
                        |  |> __vfs_getxattr()
                        |  |  -> handler->get == ovl_posix_acl_xattr_get()
                        |  |     -> ovl_xattr_get()
                        |  |        -> vfs_getxattr()
                        |  |           |> __vfs_getxattr()
                        |  |           |  -> handler->get() /* lower filesystem callback */
                        |  |           |> posix_acl_getxattr_idmapped_mnt()
                        |  |              {
                        |  |                              4 = make_kuid(&init_user_ns, 4);
                        |  |                       10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
                        |  |                       10000004 = from_kuid(&init_user_ns, 10000004);
                        |  |                       |_______________________
                        |  |              }                               |
                        |  |                                              |
                        |  |> posix_acl_getxattr_idmapped_mnt()           |
                        |     {                                           |
                        |                                                 V
                        |             10000004 = make_kuid(&init_user_ns, 10000004);
                        |             10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
                        |             10000004 = from_kuid(&init_user_ns, 10000004);
                        |     }       |_________________________________________________
                        |                                                              |
                        |                                                              |
                        |> posix_acl_fix_xattr_to_user()                               |
                           {                                                           V
                                       10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
                                              /* SUCCESS */
                                              4 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000004);
                           }
      
      And similarly if the user chooses option (1) and mounted overayfs on top of
      idmapped mounts inside the container:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                        |> vfs_getxattr()
                        |  |> __vfs_getxattr()
                        |  |  -> handler->get == ovl_posix_acl_xattr_get()
                        |  |     -> ovl_xattr_get()
                        |  |        -> vfs_getxattr()
                        |  |           |> __vfs_getxattr()
                        |  |           |  -> handler->get() /* lower filesystem callback */
                        |  |           |> posix_acl_getxattr_idmapped_mnt()
                        |  |              {
                        |  |                              4 = make_kuid(&init_user_ns, 4);
                        |  |                       10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
                        |  |                       10000004 = from_kuid(&init_user_ns, 10000004);
                        |  |                       |_______________________
                        |  |              }                               |
                        |  |                                              |
                        |  |> posix_acl_getxattr_idmapped_mnt()           |
                        |     {                                           V
                        |             10000004 = make_kuid(&init_user_ns, 10000004);
                        |             10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
                        |             10000004 = from_kuid(0(&init_user_ns, 10000004);
                        |             |_________________________________________________
                        |     }                                                        |
                        |                                                              |
                        |> posix_acl_fix_xattr_to_user()                               |
                           {                                                           V
                                       10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
                                              /* SUCCESS */
                                              4 = from_kuid(0:10000000:65536 /* caller's idmappings */, 10000004);
                           }
      
      The last remaining problem we need to fix here is ovl_get_acl(). During
      ovl_permission() overlayfs will call:
      
              ovl_permission()
              -> generic_permission()
                 -> acl_permission_check()
                    -> check_acl()
                       -> get_acl()
                          -> inode->i_op->get_acl() == ovl_get_acl()
                              > get_acl() /* on the underlying filesystem)
                                ->inode->i_op->get_acl() == /*lower filesystem callback */
                       -> posix_acl_permission()
      
      passing through the get_acl request to the underlying filesystem. This will
      retrieve the acls stored in the lower filesystem without taking the idmapping
      of the underlying mount into account as this would mean altering the cached
      values for the lower filesystem. So we block using ACLs for now until we
      decided on a nice way to fix this. Note this limitation both in the
      documentation and in the code.
      
      The most straightforward solution would be to have ovl_get_acl() simply
      duplicate the ACLs, update the values according to the idmapped mount and
      return it to acl_permission_check() so it can be used in posix_acl_permission()
      forgetting them afterwards. This is a bit heavy handed but fairly
      straightforward otherwise.
      
      Link: https://github.com/brauner/mount-idmapped/issues/9
      Link: https://lore.kernel.org/r/20220708090134.385160-2-brauner@kernel.org
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Miklos Szeredi <mszeredi@redhat.com>
      Cc: linux-unionfs@vger.kernel.org
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NSeth Forshee <sforshee@digitalocean.com>
      Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      0c5fd887
  4. 27 6月, 2022 1 次提交
    • C
      attr: port attribute changes to new types · b27c82e1
      Christian Brauner 提交于
      Now that we introduced new infrastructure to increase the type safety
      for filesystems supporting idmapped mounts port the first part of the
      vfs over to them.
      
      This ports the attribute changes codepaths to rely on the new better
      helpers using a dedicated type.
      
      Before this change we used to take a shortcut and place the actual
      values that would be written to inode->i_{g,u}id into struct iattr. This
      had the advantage that we moved idmappings mostly out of the picture
      early on but it made reasoning about changes more difficult than it
      should be.
      
      The filesystem was never explicitly told that it dealt with an idmapped
      mount. The transition to the value that needed to be stored in
      inode->i_{g,u}id appeared way too early and increased the probability of
      bugs in various codepaths.
      
      We know place the same value in struct iattr no matter if this is an
      idmapped mount or not. The vfs will only deal with type safe
      vfs{g,u}id_t. This makes it massively safer to perform permission checks
      as the type will tell us what checks we need to perform and what helpers
      we need to use.
      
      Fileystems raising FS_ALLOW_IDMAP can't simply write ia_vfs{g,u}id to
      inode->i_{g,u}id since they are different types. Instead they need to
      use the dedicated vfs{g,u}id_to_k{g,u}id() helpers that map the
      vfs{g,u}id into the filesystem.
      
      The other nice effect is that filesystems like overlayfs don't need to
      care about idmappings explicitly anymore and can simply set up struct
      iattr accordingly directly.
      
      Link: https://lore.kernel.org/lkml/CAHk-=win6+ahs1EwLkcq8apqLi_1wXFWbrPf340zYEhObpz4jA@mail.gmail.com [1]
      Link: https://lore.kernel.org/r/20220621141454.2914719-9-brauner@kernel.org
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      CC: linux-fsdevel@vger.kernel.org
      Reviewed-by: NSeth Forshee <sforshee@digitalocean.com>
      Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      b27c82e1
  5. 28 4月, 2022 10 次提交
  6. 04 11月, 2021 1 次提交
    • M
      ovl: fix warning in ovl_create_real() · 1f5573cf
      Miklos Szeredi 提交于
      Syzbot triggered the following warning in ovl_workdir_create() ->
      ovl_create_real():
      
      	if (!err && WARN_ON(!newdentry->d_inode)) {
      
      The reason is that the cgroup2 filesystem returns from mkdir without
      instantiating the new dentry.
      
      Weird filesystems such as this will be rejected by overlayfs at a later
      stage during setup, but to prevent such a warning, call ovl_mkdir_real()
      directly from ovl_workdir_create() and reject this case early.
      
      Reported-and-tested-by: syzbot+75eab84fd0af9e8bf66b@syzkaller.appspotmail.com
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      1f5573cf
  7. 19 8月, 2021 1 次提交
  8. 17 8月, 2021 4 次提交
  9. 12 4月, 2021 4 次提交
  10. 28 1月, 2021 1 次提交
    • S
      ovl: implement volatile-specific fsync error behaviour · 335d3fc5
      Sargun Dhillon 提交于
      Overlayfs's volatile option allows the user to bypass all forced sync calls
      to the upperdir filesystem. This comes at the cost of safety. We can never
      ensure that the user's data is intact, but we can make a best effort to
      expose whether or not the data is likely to be in a bad state.
      
      The best way to handle this in the time being is that if an overlayfs's
      upperdir experiences an error after a volatile mount occurs, that error
      will be returned on fsync, fdatasync, sync, and syncfs. This is
      contradictory to the traditional behaviour of VFS which fails the call
      once, and only raises an error if a subsequent fsync error has occurred,
      and been raised by the filesystem.
      
      One awkward aspect of the patch is that we have to manually set the
      superblock's errseq_t after the sync_fs callback as opposed to just
      returning an error from syncfs. This is because the call chain looks
      something like this:
      
      sys_syncfs ->
      	sync_filesystem ->
      		__sync_filesystem ->
      			/* The return value is ignored here
      			sb->s_op->sync_fs(sb)
      			_sync_blockdev
      		/* Where the VFS fetches the error to raise to userspace */
      		errseq_check_and_advance
      
      Because of this we call errseq_set every time the sync_fs callback occurs.
      Due to the nature of this seen / unseen dichotomy, if the upperdir is an
      inconsistent state at the initial mount time, overlayfs will refuse to
      mount, as overlayfs cannot get a snapshot of the upperdir's errseq that
      will increment on error until the user calls syncfs.
      Signed-off-by: NSargun Dhillon <sargun@sargun.me>
      Suggested-by: NAmir Goldstein <amir73il@gmail.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      Fixes: c86243b0 ("ovl: provide a mount option "volatile"")
      Cc: stable@vger.kernel.org
      Reviewed-by: NVivek Goyal <vgoyal@redhat.com>
      Reviewed-by: NJeff Layton <jlayton@kernel.org>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      335d3fc5
  11. 24 1月, 2021 4 次提交
  12. 14 12月, 2020 1 次提交
    • M
      ovl: user xattr · 2d2f2d73
      Miklos Szeredi 提交于
      Optionally allow using "user.overlay." namespace instead of
      "trusted.overlay."
      
      This is necessary for overlayfs to be able to be mounted in an unprivileged
      namepsace.
      
      Make the option explicit, since it makes the filesystem format be
      incompatible.
      
      Disable redirect_dir and metacopy options, because these would allow
      privilege escalation through direct manipulation of the
      "user.overlay.redirect" or "user.overlay.metacopy" xattrs.
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
      2d2f2d73
  13. 12 11月, 2020 1 次提交
  14. 06 10月, 2020 1 次提交
  15. 02 9月, 2020 5 次提交
  16. 16 7月, 2020 1 次提交
  17. 04 6月, 2020 1 次提交
  18. 03 6月, 2020 1 次提交
    • A
      ovl: fix out of bounds access warning in ovl_check_fb_len() · 522f6e6c
      Amir Goldstein 提交于
      syzbot reported out of bounds memory access from open_by_handle_at()
      with a crafted file handle that looks like this:
      
        { .handle_bytes = 2, .handle_type = OVL_FILEID_V1 }
      
      handle_bytes gets rounded down to 0 and we end up calling:
        ovl_check_fh_len(fh, 0) => ovl_check_fb_len(fh + 3, -3)
      
      But fh buffer is only 2 bytes long, so accessing struct ovl_fb at
      fh + 3 is illegal.
      
      Fixes: cbe7fba8 ("ovl: make sure that real fid is 32bit aligned in memory")
      Reported-and-tested-by: syzbot+61958888b1c60361a791@syzkaller.appspotmail.com
      Cc: <stable@vger.kernel.org> # v5.5
      Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      522f6e6c