1. 25 7月, 2022 15 次提交
  2. 21 7月, 2022 2 次提交
  3. 16 7月, 2022 4 次提交
  4. 15 7月, 2022 1 次提交
  5. 14 7月, 2022 4 次提交
  6. 12 7月, 2022 6 次提交
  7. 09 7月, 2022 4 次提交
    • J
      io_uring: check that we have a file table when allocating update slots · d785a773
      Jens Axboe 提交于
      If IORING_FILE_INDEX_ALLOC is set asking for an allocated slot, the
      helper doesn't check if we actually have a file table or not. The non
      alloc path does do that correctly, and returns -ENXIO if we haven't set
      one up.
      
      Do the same for the allocated path, avoiding a NULL pointer dereference
      when trying to find a free bit.
      
      Fixes: a7c41b46 ("io_uring: let IORING_OP_FILES_UPDATE support choosing fixed file slots")
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      d785a773
    • N
      btrfs: zoned: drop optimization of zone finish · b3a3b025
      Naohiro Aota 提交于
      We have an optimization in do_zone_finish() to send REQ_OP_ZONE_FINISH only
      when necessary, i.e. we don't send REQ_OP_ZONE_FINISH when we assume we
      wrote fully into the zone.
      
      The assumption is determined by "alloc_offset == capacity". This condition
      won't work if the last ordered extent is canceled due to some errors. In
      that case, we consider the zone is deactivated without sending the finish
      command while it's still active.
      
      This inconstancy results in activating another block group while we cannot
      really activate the underlying zone, which causes the active zone exceeds
      errors like below.
      
          BTRFS error (device nvme3n2): allocation failed flags 1, wanted 520192 tree-log 0, relocation: 0
          nvme3n2: I/O Cmd(0x7d) @ LBA 160432128, 127 blocks, I/O Error (sct 0x1 / sc 0xbd) MORE DNR
          active zones exceeded error, dev nvme3n2, sector 0 op 0xd:(ZONE_APPEND) flags 0x4800 phys_seg 1 prio class 0
          nvme3n2: I/O Cmd(0x7d) @ LBA 160432128, 127 blocks, I/O Error (sct 0x1 / sc 0xbd) MORE DNR
          active zones exceeded error, dev nvme3n2, sector 0 op 0xd:(ZONE_APPEND) flags 0x4800 phys_seg 1 prio class 0
      
      Fix the issue by removing the optimization for now.
      
      Fixes: 8376d9e1 ("btrfs: zoned: finish superblock zone once no space left for new SB")
      Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
      Signed-off-by: NNaohiro Aota <naohiro.aota@wdc.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      b3a3b025
    • C
      btrfs: zoned: fix a leaked bioc in read_zone_info · 29634578
      Christoph Hellwig 提交于
      The bioc would leak on the normal completion path and also on the RAID56
      check (but that one won't happen in practice due to the invalid
      combination with zoned mode).
      
      Fixes: 7db1c5d1 ("btrfs: zoned: support dev-replace in zoned filesystems")
      CC: stable@vger.kernel.org # 5.16+
      Reviewed-by: NAnand Jain <anand.jain@oracle.com>
      Signed-off-by: NChristoph Hellwig <hch@lst.de>
      [ update changelog ]
      Reviewed-by: NDavid Sterba <dsterba@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      29634578
    • F
      btrfs: return -EAGAIN for NOWAIT dio reads/writes on compressed and inline extents · a4527e18
      Filipe Manana 提交于
      When doing a direct IO read or write, we always return -ENOTBLK when we
      find a compressed extent (or an inline extent) so that we fallback to
      buffered IO. This however is not ideal in case we are in a NOWAIT context
      (io_uring for example), because buffered IO can block and we currently
      have no support for NOWAIT semantics for buffered IO, so if we need to
      fallback to buffered IO we should first signal the caller that we may
      need to block by returning -EAGAIN instead.
      
      This behaviour can also result in short reads being returned to user
      space, which although it's not incorrect and user space should be able
      to deal with partial reads, it's somewhat surprising and even some popular
      applications like QEMU (Link tag #1) and MariaDB (Link tag #2) don't
      deal with short reads properly (or at all).
      
      The short read case happens when we try to read from a range that has a
      non-compressed and non-inline extent followed by a compressed extent.
      After having read the first extent, when we find the compressed extent we
      return -ENOTBLK from btrfs_dio_iomap_begin(), which results in iomap to
      treat the request as a short read, returning 0 (success) and waiting for
      previously submitted bios to complete (this happens at
      fs/iomap/direct-io.c:__iomap_dio_rw()). After that, and while at
      btrfs_file_read_iter(), we call filemap_read() to use buffered IO to
      read the remaining data, and pass it the number of bytes we were able to
      read with direct IO. Than at filemap_read() if we get a page fault error
      when accessing the read buffer, we return a partial read instead of an
      -EFAULT error, because the number of bytes previously read is greater
      than zero.
      
      So fix this by returning -EAGAIN for NOWAIT direct IO when we find a
      compressed or an inline extent.
      Reported-by: NDominique MARTINET <dominique.martinet@atmark-techno.com>
      Link: https://lore.kernel.org/linux-btrfs/YrrFGO4A1jS0GI0G@atmark-techno.com/
      Link: https://jira.mariadb.org/browse/MDEV-27900?focusedCommentId=216582&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-216582Tested-by: NDominique MARTINET <dominique.martinet@atmark-techno.com>
      CC: stable@vger.kernel.org # 5.10+
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NFilipe Manana <fdmanana@suse.com>
      Signed-off-by: NDavid Sterba <dsterba@suse.com>
      a4527e18
  8. 08 7月, 2022 2 次提交
    • C
      ovl: turn of SB_POSIXACL with idmapped layers temporarily · 4a47c638
      Christian Brauner 提交于
      This cycle we added support for mounting overlayfs on top of idmapped
      mounts.  Recently I've started looking into potential corner cases when
      trying to add additional tests and I noticed that reporting for POSIX ACLs
      is currently wrong when using idmapped layers with overlayfs mounted on top
      of it.
      
      I have sent out an patch that fixes this and makes POSIX ACLs work
      correctly but the patch is a bit bigger and we're already at -rc5 so I
      recommend we simply don't raise SB_POSIXACL when idmapped layers are
      used. Then we can fix the VFS part described below for the next merge
      window so we can have good exposure in -next.
      
      I'm going to give a rather detailed explanation to both the origin of the
      problem and mention the solution so people know what's going on.
      
      Let's assume the user creates the following directory layout and they have
      a rootfs /var/lib/lxc/c1/rootfs. The files in this rootfs are owned as you
      would expect files on your host system to be owned. For example, ~/.bashrc
      for your regular user would be owned by 1000:1000 and /root/.bashrc would
      be owned by 0:0. IOW, this is just regular boring filesystem tree on an
      ext4 or xfs filesystem.
      
      The user chooses to set POSIX ACLs using the setfacl binary granting the
      user with uid 4 read, write, and execute permissions for their .bashrc
      file:
      
              setfacl -m u:4:rwx /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc
      
      Now they to expose the whole rootfs to a container using an idmapped
      mount. So they first create:
      
              mkdir -pv /vol/contpool/{ctrover,merge,lowermap,overmap}
              mkdir -pv /vol/contpool/ctrover/{over,work}
              chown 10000000:10000000 /vol/contpool/ctrover/{over,work}
      
      The user now creates an idmapped mount for the rootfs:
      
              mount-idmapped/mount-idmapped --map-mount=b:0:10000000:65536 \
                                            /var/lib/lxc/c2/rootfs \
                                            /vol/contpool/lowermap
      
      This for example makes it so that
      /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc which is owned by uid and gid
      1000 as being owned by uid and gid 10001000 at
      /vol/contpool/lowermap/home/ubuntu/.bashrc.
      
      Assume the user wants to expose these idmapped mounts through an overlayfs
      mount to a container.
      
             mount -t overlay overlay                      \
                   -o lowerdir=/vol/contpool/lowermap,     \
                      upperdir=/vol/contpool/overmap/over, \
                      workdir=/vol/contpool/overmap/work   \
                   /vol/contpool/merge
      
      The user can do this in two ways:
      
      (1) Mount overlayfs in the initial user namespace and expose it to the
          container.
      
      (2) Mount overlayfs on top of the idmapped mounts inside of the container's
          user namespace.
      
      Let's assume the user chooses the (1) option and mounts overlayfs on the
      host and then changes into a container which uses the idmapping
      0:10000000:65536 which is the same used for the two idmapped mounts.
      
      Now the user tries to retrieve the POSIX ACLs using the getfacl command
      
              getfacl -n /vol/contpool/lowermap/home/ubuntu/.bashrc
      
      and to their surprise they see:
      
              # file: vol/contpool/merge/home/ubuntu/.bashrc
              # owner: 1000
              # group: 1000
              user::rw-
              user:4294967295:rwx
              group::r--
              mask::rwx
              other::r--
      
      indicating the uid wasn't correctly translated according to the idmapped
      mount. The problem is how we currently translate POSIX ACLs. Let's inspect
      the callchain in this example:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                        |> vfs_getxattr()
                        |  -> __vfs_getxattr()
                        |     -> handler->get == ovl_posix_acl_xattr_get()
                        |        -> ovl_xattr_get()
                        |           -> vfs_getxattr()
                        |              -> __vfs_getxattr()
                        |                 -> handler->get() /* lower filesystem callback */
                        |> posix_acl_fix_xattr_to_user()
                           {
                                    4 = make_kuid(&init_user_ns, 4);
                                    4 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 4);
                                    /* FAILURE */
                                   -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
                           }
      
      If the user chooses to use option (2) and mounts overlayfs on top of
      idmapped mounts inside the container things don't look that much better:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                        |> vfs_getxattr()
                        |  -> __vfs_getxattr()
                        |     -> handler->get == ovl_posix_acl_xattr_get()
                        |        -> ovl_xattr_get()
                        |           -> vfs_getxattr()
                        |              -> __vfs_getxattr()
                        |                 -> handler->get() /* lower filesystem callback */
                        |> posix_acl_fix_xattr_to_user()
                           {
                                    4 = make_kuid(&init_user_ns, 4);
                                    4 = mapped_kuid_fs(&init_user_ns, 4);
                                    /* FAILURE */
                                   -1 = from_kuid(0:10000000:65536 /* caller's idmapping */, 4);
                           }
      
      As is easily seen the problem arises because the idmapping of the lower
      mount isn't taken into account as all of this happens in do_gexattr(). But
      do_getxattr() is always called on an overlayfs mount and inode and thus
      cannot possible take the idmapping of the lower layers into account.
      
      This problem is similar for fscaps but there the translation happens as
      part of vfs_getxattr() already. Let's walk through an fscaps overlayfs
      callchain:
      
              setcap 'cap_net_raw+ep' /var/lib/lxc/c2/rootfs/home/ubuntu/.bashrc
      
      The expected outcome here is that we'll receive the cap_net_raw capability
      as we are able to map the uid associated with the fscap to 0 within our
      container.  IOW, we want to see 0 as the result of the idmapping
      translations.
      
      If the user chooses option (1) we get the following callchain for fscaps:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                         -> vfs_getxattr()
                            -> xattr_getsecurity()
                               -> security_inode_getsecurity()                                       ________________________________
                                  -> cap_inode_getsecurity()                                         |                              |
                                     {                                                               V                              |
                                              10000000 = make_kuid(0:0:4k /* overlayfs idmapping */, 10000000);                     |
                                              10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000);                  |
                                                     /* Expected result is 0 and thus that we own the fscap. */                     |
                                                     0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000);            |
                                     }                                                                                              |
                                     -> vfs_getxattr_alloc()                                                                        |
                                        -> handler->get == ovl_other_xattr_get()                                                    |
                                           -> vfs_getxattr()                                                                        |
                                              -> xattr_getsecurity()                                                                |
                                                 -> security_inode_getsecurity()                                                    |
                                                    -> cap_inode_getsecurity()                                                      |
                                                       {                                                                            |
                                                                      0 = make_kuid(0:0:4k /* lower s_user_ns */, 0);               |
                                                               10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0); |
                                                               10000000 = from_kuid(0:0:4k /* overlayfs idmapping */, 10000000);    |
                                                               |____________________________________________________________________|
                                                       }
                                                       -> vfs_getxattr_alloc()
                                                          -> handler->get == /* lower filesystem callback */
      
      And if the user chooses option (2) we get:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                         -> vfs_getxattr()
                            -> xattr_getsecurity()
                               -> security_inode_getsecurity()                                                _______________________________
                                  -> cap_inode_getsecurity()                                                  |                             |
                                     {                                                                        V                             |
                                             10000000 = make_kuid(0:10000000:65536 /* overlayfs idmapping */, 0);                           |
                                             10000000 = mapped_kuid_fs(0:0:4k /* no idmapped mount */, 10000000);                           |
                                                     /* Expected result is 0 and thus that we own the fscap. */                             |
                                                    0 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000000);                     |
                                     }                                                                                                      |
                                     -> vfs_getxattr_alloc()                                                                                |
                                        -> handler->get == ovl_other_xattr_get()                                                            |
                                          |-> vfs_getxattr()                                                                                |
                                              -> xattr_getsecurity()                                                                        |
                                                 -> security_inode_getsecurity()                                                            |
                                                    -> cap_inode_getsecurity()                                                              |
                                                       {                                                                                    |
                                                                       0 = make_kuid(0:0:4k /* lower s_user_ns */, 0);                      |
                                                                10000000 = mapped_kuid_fs(0:10000000:65536 /* idmapped mount */, 0);        |
                                                                       0 = from_kuid(0:10000000:65536 /* overlayfs idmapping */, 10000000); |
                                                                       |____________________________________________________________________|
                                                       }
                                                       -> vfs_getxattr_alloc()
                                                          -> handler->get == /* lower filesystem callback */
      
      We can see how the translation happens correctly in those cases as the
      conversion happens within the vfs_getxattr() helper.
      
      For POSIX ACLs we need to do something similar. However, in contrast to
      fscaps we cannot apply the fix directly to the kernel internal posix acl
      data structure as this would alter the cached values and would also require
      a rework of how we currently deal with POSIX ACLs in general which almost
      never take the filesystem idmapping into account (the noteable exception
      being FUSE but even there the implementation is special) and instead
      retrieve the raw values based on the initial idmapping.
      
      The correct values are then generated right before returning to
      userspace. The fix for this is to move taking the mount's idmapping into
      account directly in vfs_getxattr() instead of having it be part of
      posix_acl_fix_xattr_to_user().
      
      To this end we simply move the idmapped mount translation into a separate
      step performed in vfs_{g,s}etxattr() instead of in
      posix_acl_fix_xattr_{from,to}_user().
      
      To see how this fixes things let's go back to the original example. Assume
      the user chose option (1) and mounted overlayfs on top of idmapped mounts
      on the host:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:0:4k /* initial idmapping */
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                        |> vfs_getxattr()
                        |  |> __vfs_getxattr()
                        |  |  -> handler->get == ovl_posix_acl_xattr_get()
                        |  |     -> ovl_xattr_get()
                        |  |        -> vfs_getxattr()
                        |  |           |> __vfs_getxattr()
                        |  |           |  -> handler->get() /* lower filesystem callback */
                        |  |           |> posix_acl_getxattr_idmapped_mnt()
                        |  |              {
                        |  |                              4 = make_kuid(&init_user_ns, 4);
                        |  |                       10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
                        |  |                       10000004 = from_kuid(&init_user_ns, 10000004);
                        |  |                       |_______________________
                        |  |              }                               |
                        |  |                                              |
                        |  |> posix_acl_getxattr_idmapped_mnt()           |
                        |     {                                           |
                        |                                                 V
                        |             10000004 = make_kuid(&init_user_ns, 10000004);
                        |             10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
                        |             10000004 = from_kuid(&init_user_ns, 10000004);
                        |     }       |_________________________________________________
                        |                                                              |
                        |                                                              |
                        |> posix_acl_fix_xattr_to_user()                               |
                           {                                                           V
                                       10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
                                              /* SUCCESS */
                                              4 = from_kuid(0:10000000:65536 /* caller's idmapping */, 10000004);
                           }
      
      And similarly if the user chooses option (1) and mounted overayfs on top of
      idmapped mounts inside the container:
      
              idmapped mount /vol/contpool/merge:      0:10000000:65536
              caller's idmapping:                      0:10000000:65536
              overlayfs idmapping (ofs->creator_cred): 0:10000000:65536
      
              sys_getxattr()
              -> path_getxattr()
                 -> getxattr()
                    -> do_getxattr()
                        |> vfs_getxattr()
                        |  |> __vfs_getxattr()
                        |  |  -> handler->get == ovl_posix_acl_xattr_get()
                        |  |     -> ovl_xattr_get()
                        |  |        -> vfs_getxattr()
                        |  |           |> __vfs_getxattr()
                        |  |           |  -> handler->get() /* lower filesystem callback */
                        |  |           |> posix_acl_getxattr_idmapped_mnt()
                        |  |              {
                        |  |                              4 = make_kuid(&init_user_ns, 4);
                        |  |                       10000004 = mapped_kuid_fs(0:10000000:65536 /* lower idmapped mount */, 4);
                        |  |                       10000004 = from_kuid(&init_user_ns, 10000004);
                        |  |                       |_______________________
                        |  |              }                               |
                        |  |                                              |
                        |  |> posix_acl_getxattr_idmapped_mnt()           |
                        |     {                                           V
                        |             10000004 = make_kuid(&init_user_ns, 10000004);
                        |             10000004 = mapped_kuid_fs(&init_user_ns /* no idmapped mount */, 10000004);
                        |             10000004 = from_kuid(0(&init_user_ns, 10000004);
                        |             |_________________________________________________
                        |     }                                                        |
                        |                                                              |
                        |> posix_acl_fix_xattr_to_user()                               |
                           {                                                           V
                                       10000004 = make_kuid(0:0:4k /* init_user_ns */, 10000004);
                                              /* SUCCESS */
                                              4 = from_kuid(0:10000000:65536 /* caller's idmappings */, 10000004);
                           }
      
      The last remaining problem we need to fix here is ovl_get_acl(). During
      ovl_permission() overlayfs will call:
      
              ovl_permission()
              -> generic_permission()
                 -> acl_permission_check()
                    -> check_acl()
                       -> get_acl()
                          -> inode->i_op->get_acl() == ovl_get_acl()
                              > get_acl() /* on the underlying filesystem)
                                ->inode->i_op->get_acl() == /*lower filesystem callback */
                       -> posix_acl_permission()
      
      passing through the get_acl request to the underlying filesystem. This will
      retrieve the acls stored in the lower filesystem without taking the
      idmapping of the underlying mount into account as this would mean altering
      the cached values for the lower filesystem. The simple solution is to have
      ovl_get_acl() simply duplicate the ACLs, update the values according to the
      idmapped mount and return it to acl_permission_check() so it can be used in
      posix_acl_permission(). Since overlayfs doesn't cache ACLs they'll be
      released right after.
      
      Link: https://github.com/brauner/mount-idmapped/issues/9
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: linux-unionfs@vger.kernel.org
      Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      Fixes: bc70682a ("ovl: support idmapped layers")
      Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
      4a47c638
    • P
      io_uring: explicit sqe padding for ioctl commands · bdb2c48e
      Pavel Begunkov 提交于
      32 bit sqe->cmd_op is an union with 64 bit values. It's always a good
      idea to do padding explicitly. Also zero check it in prep, so it can be
      used in the future if needed without compatibility concerns.
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Link: https://lore.kernel.org/r/e6b95a05e970af79000435166185e85b196b2ba2.1657202417.git.asml.silence@gmail.com
      [axboe: turn bitwise OR into logical variant]
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      bdb2c48e
  9. 05 7月, 2022 2 次提交