1. 19 1月, 2023 2 次提交
    • C
      fs: move mnt_idmap · 3707d84c
      Christian Brauner 提交于
      Now that we converted everything to just rely on struct mnt_idmap move it all
      into a separate file. This ensure that no code can poke around in struct
      mnt_idmap without any dedicated helpers and makes it easier to extend it in the
      future. Filesystems will now not be able to conflate mount and filesystem
      idmappings as they are two distinct types and require distinct helpers that
      cannot be used interchangeably. We are now also able to extend struct mnt_idmap
      as we see fit.
      Acked-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      3707d84c
    • C
      fs: port i_{g,u}id_into_vfs{g,u}id() to mnt_idmap · e67fe633
      Christian Brauner 提交于
      Convert to struct mnt_idmap.
      Remove legacy file_mnt_user_ns() and mnt_user_ns().
      
      Last cycle we merged the necessary infrastructure in
      256c8aed ("fs: introduce dedicated idmap type for mounts").
      This is just the conversion to struct mnt_idmap.
      
      Currently we still pass around the plain namespace that was attached to a
      mount. This is in general pretty convenient but it makes it easy to
      conflate namespaces that are relevant on the filesystem with namespaces
      that are relevent on the mount level. Especially for non-vfs developers
      without detailed knowledge in this area this can be a potential source for
      bugs.
      
      Once the conversion to struct mnt_idmap is done all helpers down to the
      really low-level helpers will take a struct mnt_idmap argument instead of
      two namespace arguments. This way it becomes impossible to conflate the two
      eliminating the possibility of any bugs. All of the vfs and all filesystems
      only operate on struct mnt_idmap.
      Acked-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      e67fe633
  2. 25 11月, 2022 1 次提交
    • A
      copy_mnt_ns(): handle a corner case (overmounted mntns bindings) saner · 61d8e426
      Al Viro 提交于
      copy_mnt_ns() has the old tree copied, with mntns binding *and* anything
      bound on top of them skipped.  Then it proceeds to walk both trees in
      parallel.  Unfortunately, it doesn't get the "skip the stuff we'd skipped
      when copying" quite right.  Consequences are minor (the ->mnt_root
      comparison will return the situation to sanity pretty soon and the worst
      we get is the unexpected subset of opened non-directories being switched
      to new namespace), but it's confusing enough and it's not hard to get
      the expected behaviour...
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      61d8e426
  3. 01 11月, 2022 1 次提交
    • C
      fs: introduce dedicated idmap type for mounts · 256c8aed
      Christian Brauner 提交于
      Last cycle we've already made the interaction with idmapped mounts more
      robust and type safe by introducing the vfs{g,u}id_t type. This cycle we
      concluded the conversion and removed the legacy helpers.
      
      Currently we still pass around the plain namespace that was attached to
      a mount. This is in general pretty convenient but it makes it easy to
      conflate filesystem and mount namespaces and what different roles they
      have to play. Especially for filesystem developers without much
      experience in this area this is an easy source for bugs.
      
      Instead of passing the plain namespace we introduce a dedicated type
      struct mnt_idmap and replace the pointer with a pointer to a struct
      mnt_idmap. There are no semantic or size changes for the mount struct
      caused by this.
      
      We then start converting all places aware of idmapped mounts to rely on
      struct mnt_idmap. Once the conversion is done all helpers down to the
      really low-level make_vfs{g,u}id() and from_vfs{g,u}id() will take a
      struct mnt_idmap argument instead of two namespace arguments. This way
      it becomes impossible to conflate the two, removing and thus eliminating
      the possibility of any bugs. Fwiw, I fixed some issues in that area a
      while ago in ntfs3 and ksmbd in the past. Afterwards, only low-level
      code can ultimately use the associated namespace for any permission
      checks. Even most of the vfs can be ultimately completely oblivious
      about this and filesystems will never interact with it directly in any
      form in the future.
      
      A struct mnt_idmap currently encompasses a simple refcount and a pointer
      to the relevant namespace the mount is idmapped to. If a mount isn't
      idmapped then it will point to a static nop_mnt_idmap. If it is an
      idmapped mount it will point to a new struct mnt_idmap. As usual there
      are no allocations or anything happening for non-idmapped mounts.
      Everthing is carefully written to be a nop for non-idmapped mounts as
      has always been the case.
      
      If an idmapped mount or mount tree is created a new struct mnt_idmap is
      allocated and a reference taken on the relevant namespace. For each
      mount in a mount tree that gets idmapped or a mount that inherits the
      idmap when it is cloned the reference count on the associated struct
      mnt_idmap is bumped. Just a reminder that we only allow a mount to
      change it's idmapping a single time and only if it hasn't already been
      attached to the filesystems and has no active writers.
      
      The actual changes are fairly straightforward. This will have huge
      benefits for maintenance and security in the long run even if it causes
      some churn. I'm aware that there's some cost for all of you. And I'll
      commit to doing this work and make this as painless as I can.
      
      Note that this also makes it possible to extend struct mount_idmap in
      the future. For example, it would be possible to place the namespace
      pointer in an anonymous union together with an idmapping struct. This
      would allow us to expose an api to userspace that would let it specify
      idmappings directly instead of having to go through the detour of
      setting up namespaces at all.
      
      This just adds the infrastructure and doesn't do any conversions.
      Reviewed-by: NSeth Forshee (DigitalOcean) <sforshee@kernel.org>
      Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      256c8aed
  4. 17 8月, 2022 1 次提交
  5. 06 7月, 2022 1 次提交
  6. 20 5月, 2022 1 次提交
  7. 12 5月, 2022 1 次提交
    • C
      fs: hold writers when changing mount's idmapping · e1bbcd27
      Christian Brauner 提交于
      Hold writers when changing a mount's idmapping to make it more robust.
      
      The vfs layer takes care to retrieve the idmapping of a mount once
      ensuring that the idmapping used for vfs permission checking is
      identical to the idmapping passed down to the filesystem.
      
      For ioctl codepaths the filesystem itself is responsible for taking the
      idmapping into account if they need to. While all filesystems with
      FS_ALLOW_IDMAP raised take the same precautions as the vfs we should
      enforce it explicitly by making sure there are no active writers on the
      relevant mount while changing the idmapping.
      
      This is similar to turning a mount ro with the difference that in
      contrast to turning a mount ro changing the idmapping can only ever be
      done once while a mount can transition between ro and rw as much as it
      wants.
      
      This is a minor user-visible change. But it is extremely unlikely to
      matter. The caller must've created a detached mount via OPEN_TREE_CLONE
      and then handed that O_PATH fd to another process or thread which then
      must've gotten a writable fd for that mount and started creating files
      in there while the caller is still changing mount properties. While not
      impossible it will be an extremely rare corner-case and should in
      general be considered a bug in the application. Consider making a mount
      MOUNT_ATTR_NOEXEC or MOUNT_ATTR_NODEV while allowing someone else to
      perform lookups or exec'ing in parallel by handing them a copy of the
      OPEN_TREE_CLONE fd or another fd beneath that mount.
      
      Link: https://lore.kernel.org/r/20220510095840.152264-1-brauner@kernel.org
      Cc: Seth Forshee <seth.forshee@digitalocean.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
      e1bbcd27
  8. 21 4月, 2022 1 次提交
  9. 23 3月, 2022 1 次提交
  10. 16 3月, 2022 1 次提交
  11. 14 2月, 2022 6 次提交
  12. 31 1月, 2022 1 次提交
  13. 22 1月, 2022 1 次提交
    • L
      fs: move namespace sysctls and declare fs base directory · ab171b95
      Luis Chamberlain 提交于
      This moves the namespace sysctls to its own file as part of the
      kernel/sysctl.c spring cleaning
      
      Since we have now removed all sysctls for "fs", we now have to declare
      it on the filesystem code, we do that using the new helper, which
      reduces boiler plate code.
      
      We rename init_fs_shared_sysctls() to init_fs_sysctls() to reflect that
      now fs/sysctls.c is taking on the burden of being the first to register
      the base directory as well.
      
      Lastly, since init code will load in the order in which we link it we
      have to move the sysctl code to be linked in early, so that its early
      init routine runs prior to other fs code.  This way, other filesystem
      code can register their own sysctls using the helpers after this:
      
        * register_sysctl_init()
        * register_sysctl()
      
      Link: https://lkml.kernel.org/r/20211129211943.640266-3-mcgrof@kernel.orgSigned-off-by: NLuis Chamberlain <mcgrof@kernel.org>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com>
      Cc: Antti Palosaari <crope@iki.fi>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Eric Biggers <ebiggers@google.com>
      Cc: Iurii Zaikin <yzaikin@google.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Lukas Middendorf <kernel@tuxforce.de>
      Cc: Masami Hiramatsu <mhiramat@kernel.org>
      Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
      Cc: Stephen Kitt <steve@sk2.org>
      Cc: Xiaoming Ni <nixiaoming@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ab171b95
  14. 31 12月, 2021 1 次提交
  15. 05 12月, 2021 1 次提交
    • C
      fs: support mapped mounts of mapped filesystems · bd303368
      Christian Brauner 提交于
      In previous patches we added new and modified existing helpers to handle
      idmapped mounts of filesystems mounted with an idmapping. In this final
      patch we convert all relevant places in the vfs to actually pass the
      filesystem's idmapping into these helpers.
      
      With this the vfs is in shape to handle idmapped mounts of filesystems
      mounted with an idmapping. Note that this is just the generic
      infrastructure. Actually adding support for idmapped mounts to a
      filesystem mountable with an idmapping is follow-up work.
      
      In this patch we extend the definition of an idmapped mount from a mount
      that that has the initial idmapping attached to it to a mount that has
      an idmapping attached to it which is not the same as the idmapping the
      filesystem was mounted with.
      
      As before we do not allow the initial idmapping to be attached to a
      mount. In addition this patch prevents that the idmapping the filesystem
      was mounted with can be attached to a mount created based on this
      filesystem.
      
      This has multiple reasons and advantages. First, attaching the initial
      idmapping or the filesystem's idmapping doesn't make much sense as in
      both cases the values of the i_{g,u}id and other places where k{g,u}ids
      are used do not change. Second, a user that really wants to do this for
      whatever reason can just create a separate dedicated identical idmapping
      to attach to the mount. Third, we can continue to use the initial
      idmapping as an indicator that a mount is not idmapped allowing us to
      continue to keep passing the initial idmapping into the mapping helpers
      to tell them that something isn't an idmapped mount even if the
      filesystem is mounted with an idmapping.
      
      Link: https://lore.kernel.org/r/20211123114227.3124056-11-brauner@kernel.org (v1)
      Link: https://lore.kernel.org/r/20211130121032.3753852-11-brauner@kernel.org (v2)
      Link: https://lore.kernel.org/r/20211203111707.3901969-11-brauner@kernel.org
      Cc: Seth Forshee <sforshee@digitalocean.com>
      Cc: Amir Goldstein <amir73il@gmail.com>
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      CC: linux-fsdevel@vger.kernel.org
      Reviewed-by: NSeth Forshee <sforshee@digitalocean.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      bd303368
  16. 04 12月, 2021 1 次提交
  17. 26 11月, 2021 1 次提交
  18. 04 9月, 2021 2 次提交
    • V
      memcg: enable accounting for new namesapces and struct nsproxy · 30acd0bd
      Vasily Averin 提交于
      Container admin can create new namespaces and force kernel to allocate up
      to several pages of memory for the namespaces and its associated
      structures.
      
      Net and uts namespaces have enabled accounting for such allocations.  It
      makes sense to account for rest ones to restrict the host's memory
      consumption from inside the memcg-limited container.
      
      Link: https://lkml.kernel.org/r/5525bcbf-533e-da27-79b7-158686c64e13@virtuozzo.comSigned-off-by: NVasily Averin <vvs@virtuozzo.com>
      Acked-by: NSerge Hallyn <serge@hallyn.com>
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Yutian Yang <nglaive@gmail.com>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      30acd0bd
    • V
      memcg: enable accounting for mnt_cache entries · 79f6540b
      Vasily Averin 提交于
      Patch series "memcg accounting from OpenVZ", v7.
      
      OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels.
      Initially we used our own accounting subsystem, then partially committed
      it to upstream, and a few years ago switched to cgroups v1.  Now we're
      rebasing again, revising our old patches and trying to push them upstream.
      
      We try to protect the host system from any misuse of kernel memory
      allocation triggered by untrusted users inside the containers.
      
      Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing
      list, though I would be very grateful for any comments from maintainersi
      of affected subsystems or other people added in cc:
      
      Compared to the upstream, we additionally account the following kernel objects:
      - network devices and its Tx/Rx queues
      - ipv4/v6 addresses and routing-related objects
      - inet_bind_bucket cache objects
      - VLAN group arrays
      - ipv6/sit: ip_tunnel_prl
      - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets
      - nsproxy and namespace objects itself
      - IPC objects: semaphores, message queues and share memory segments
      - mounts
      - pollfd and select bits arrays
      - signals and posix timers
      - file lock
      - fasync_struct used by the file lease code and driver's fasync queues
      - tty objects
      - per-mm LDT
      
      We have an incorrect/incomplete/obsoleted accounting for few other kernel
      objects: sk_filter, af_packets, netlink and xt_counters for iptables.
      They require rework and probably will be dropped at all.
      
      Also we're going to add an accounting for nft, however it is not ready
      yet.
      
      We have not tested performance on upstream, however, our performance team
      compares our current RHEL7-based production kernel and reports that they
      are at least not worse as the according original RHEL7 kernel.
      
      This patch (of 10):
      
      The kernel allocates ~400 bytes of 'struct mount' for any new mount.
      Creating a new mount namespace clones most of the parent mounts, and this
      can be repeated many times.  Additionally, each mount allocates up to
      PATH_MAX=4096 bytes for mnt->mnt_devname.
      
      It makes sense to account for these allocations to restrict the host's
      memory consumption from inside the memcg-limited container.
      
      Link: https://lkml.kernel.org/r/045db11f-4a45-7c9b-2664-5b32c2b44943@virtuozzo.comSigned-off-by: NVasily Averin <vvs@virtuozzo.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Yutian Yang <nglaive@gmail.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Dmitry Safonov <0x7f454c46@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "J. Bruce Fields" <bfields@fieldses.org>
      Cc: Jeff Layton <jlayton@kernel.org>
      Cc: Jens Axboe <axboe@kernel.dk>
      Cc: Jiri Slaby <jirislaby@kernel.org>
      Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Serge Hallyn <serge@hallyn.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Zefan Li <lizefan.x@bytedance.com>
      Cc: Borislav Petkov <bp@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      79f6540b
  19. 23 8月, 2021 1 次提交
    • J
      fs: remove mandatory file locking support · f7e33bdb
      Jeff Layton 提交于
      We added CONFIG_MANDATORY_FILE_LOCKING in 2015, and soon after turned it
      off in Fedora and RHEL8. Several other distros have followed suit.
      
      I've heard of one problem in all that time: Someone migrated from an
      older distro that supported "-o mand" to one that didn't, and the host
      had a fstab entry with "mand" in it which broke on reboot. They didn't
      actually _use_ mandatory locking so they just removed the mount option
      and moved on.
      
      This patch rips out mandatory locking support wholesale from the kernel,
      along with the Kconfig option and the Documentation file. It also
      changes the mount code to ignore the "mand" mount option instead of
      erroring out, and to throw a big, ugly warning.
      Signed-off-by: NJeff Layton <jlayton@kernel.org>
      f7e33bdb
  20. 21 8月, 2021 1 次提交
  21. 10 8月, 2021 1 次提交
  22. 26 7月, 2021 1 次提交
    • P
      move_mount: allow to add a mount into an existing group · 9ffb14ef
      Pavel Tikhomirov 提交于
      Previously a sharing group (shared and master ids pair) can be only
      inherited when mount is created via bindmount. This patch adds an
      ability to add an existing private mount into an existing sharing group.
      
      With this functionality one can first create the desired mount tree from
      only private mounts (without the need to care about undesired mount
      propagation or mount creation order implied by sharing group
      dependencies), and next then setup any desired mount sharing between
      those mounts in tree as needed.
      
      This allows CRIU to restore any set of mount namespaces, mount trees and
      sharing group trees for a container.
      
      We have many issues with restoring mounts in CRIU related to sharing
      groups and propagation:
      - reverse sharing groups vs mount tree order requires complex mounts
        reordering which mostly implies also using some temporary mounts
      (please see https://lkml.org/lkml/2021/3/23/569 for more info)
      
      - mount() syscall creates tons of mounts due to propagation
      - mount re-parenting due to propagation
      - "Mount Trap" due to propagation
      - "Non Uniform" propagation, meaning that with different tricks with
        mount order and temporary children-"lock" mounts one can create mount
        trees which can't be restored without those tricks
      (see https://www.linuxplumbersconf.org/event/7/contributions/640/)
      
      With this new functionality we can resolve all the problems with
      propagation at once.
      
      Link: https://lore.kernel.org/r/20210715100714.120228-1-ptikhomirov@virtuozzo.com
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Alexander Viro <viro@zeniv.linux.org.uk>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Mattias Nissler <mnissler@chromium.org>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-api@vger.kernel.org
      Cc: lkml <linux-kernel@vger.kernel.org>
      Co-developed-by: NAndrei Vagin <avagin@gmail.com>
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Signed-off-by: NPavel Tikhomirov <ptikhomirov@virtuozzo.com>
      Signed-off-by: NAndrei Vagin <avagin@gmail.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      9ffb14ef
  23. 01 6月, 2021 1 次提交
    • C
      mount: Support "nosymfollow" in new mount api · dd8b477f
      Christian Brauner 提交于
      Commit dab741e0 ("Add a "nosymfollow" mount option.") added support
      for the "nosymfollow" mount option allowing to block following symlinks
      when resolving paths. The mount option so far was only available in the
      old mount api. Make it available in the new mount api as well. Bonus is
      that it can be applied to a whole subtree not just a single mount.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Mattias Nissler <mnissler@chromium.org>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Ross Zwisler <zwisler@google.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      dd8b477f
  24. 12 5月, 2021 1 次提交
    • C
      fs/mount_setattr: tighten permission checks · 2ca4dcc4
      Christian Brauner 提交于
      We currently don't have any filesystems that support idmapped mounts
      which are mountable inside a user namespace. That was a deliberate
      decision for now as a userns root can just mount the filesystem
      themselves. So enforce this restriction explicitly until there's a real
      use-case for this. This way we can notice it and will have a chance to
      adapt and audit our translation helpers and fstests appropriately if we
      need to support such filesystems.
      
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: stable@vger.kernel.org
      CC: linux-fsdevel@vger.kernel.org
      Suggested-by: NSeth Forshee <seth.forshee@canonical.com>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      2ca4dcc4
  25. 01 4月, 2021 1 次提交
  26. 24 1月, 2021 8 次提交
    • C
      fs: introduce MOUNT_ATTR_IDMAP · 9caccd41
      Christian Brauner 提交于
      Introduce a new mount bind mount property to allow idmapping mounts. The
      MOUNT_ATTR_IDMAP flag can be set via the new mount_setattr() syscall
      together with a file descriptor referring to a user namespace.
      
      The user namespace referenced by the namespace file descriptor will be
      attached to the bind mount. All interactions with the filesystem going
      through that mount will be mapped according to the mapping specified in
      the user namespace attached to it.
      
      Using user namespaces to mark mounts means we can reuse all the existing
      infrastructure in the kernel that already exists to handle idmappings
      and can also use this for permission checking to allow unprivileged user
      to create idmapped mounts in the future.
      
      Idmapping a mount is decoupled from the caller's user and mount
      namespace. This means idmapped mounts can be created in the initial
      user namespace which is an important use-case for systemd-homed,
      portable usb-sticks between systems, sharing data between the initial
      user namespace and unprivileged containers, and other use-cases that
      have been brought up. For example, assume a home directory where all
      files are owned by uid and gid 1000 and the home directory is brought to
      a new laptop where the user has id 12345. The system administrator can
      simply create a mount of this home directory with a mapping of
      1000:12345:1 and other mappings to indicate the ids should be kept.
      (With this it is e.g. also possible to create idmapped mounts on the
      host with an identity mapping 1:1:100000 where the root user is not
      mapped. A user with root access that e.g. has been pivot rooted into
      such a mount on the host will be not be able to execute, read, write, or
      create files as root.)
      
      Given that mapping a mount is decoupled from the caller's user namespace
      a sufficiently privileged process such as a container manager can set up
      an idmapped mount for the container and the container can simply pivot
      root to it. There's no need for the container to do anything. The mount
      will appear correctly mapped independent of the user namespace the
      container uses. This means we don't need to mark a mount as idmappable.
      
      In order to create an idmapped mount the caller must currently be
      privileged in the user namespace of the superblock the mount belongs to.
      Once a mount has been idmapped we don't allow it to change its mapping.
      This keeps permission checking and life-cycle management simple. Users
      wanting to change the idmapped can always create a new detached mount
      with a different idmapping.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-36-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Mauricio Vásquez Bernal <mauricio@kinvolk.io>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      9caccd41
    • C
      fs: add mount_setattr() · 2a186721
      Christian Brauner 提交于
      This implements the missing mount_setattr() syscall. While the new mount
      api allows to change the properties of a superblock there is currently
      no way to change the properties of a mount or a mount tree using file
      descriptors which the new mount api is based on. In addition the old
      mount api has the restriction that mount options cannot be applied
      recursively. This hasn't changed since changing mount options on a
      per-mount basis was implemented in [1] and has been a frequent request
      not just for convenience but also for security reasons. The legacy
      mount syscall is unable to accommodate this behavior without introducing
      a whole new set of flags because MS_REC | MS_REMOUNT | MS_BIND |
      MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to the topmost
      mount. Changing MS_REC to apply to the whole mount tree would mean
      introducing a significant uapi change and would likely cause significant
      regressions.
      
      The new mount_setattr() syscall allows to recursively clear and set
      mount options in one shot. Multiple calls to change mount options
      requesting the same changes are idempotent:
      
      int mount_setattr(int dfd, const char *path, unsigned flags,
                        struct mount_attr *uattr, size_t usize);
      
      Flags to modify path resolution behavior are specified in the @flags
      argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW,
      and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to
      restrict path resolution as introduced with openat2() might be supported
      in the future.
      
      The mount_setattr() syscall can be expected to grow over time and is
      designed with extensibility in mind. It follows the extensible syscall
      pattern we have used with other syscalls such as openat2(), clone3(),
      sched_{set,get}attr(), and others.
      The set of mount options is passed in the uapi struct mount_attr which
      currently has the following layout:
      
      struct mount_attr {
      	__u64 attr_set;
      	__u64 attr_clr;
      	__u64 propagation;
      	__u64 userns_fd;
      };
      
      The @attr_set and @attr_clr members are used to clear and set mount
      options. This way a user can e.g. request that a set of flags is to be
      raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in
      @attr_set while at the same time requesting that another set of flags is
      to be lowered such as removing noexec from a mount tree by specifying
      MOUNT_ATTR_NOEXEC in @attr_clr.
      
      Note, since the MOUNT_ATTR_<atime> values are an enum starting from 0,
      not a bitmap, users wanting to transition to a different atime setting
      cannot simply specify the atime setting in @attr_set, but must also
      specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that
      MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set
      can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in
      @attr_clr.
      
      The @propagation field lets callers specify the propagation type of a
      mount tree. Propagation is a single property that has four different
      settings and as such is not really a flag argument but an enum.
      Specifically, it would be unclear what setting and clearing propagation
      settings in combination would amount to. The legacy mount() syscall thus
      forbids the combination of multiple propagation settings too. The goal
      is to keep the semantics of mount propagation somewhat simple as they
      are overly complex as it is.
      
      The @userns_fd field lets user specify a user namespace whose idmapping
      becomes the idmapping of the mount. This is implemented and explained in
      detail in the next patch.
      
      [1]: commit 2e4b7fcd ("[PATCH] r/o bind mounts: honor mount writer counts at remount")
      
      Link: https://lore.kernel.org/r/20210121131959.646623-35-christian.brauner@ubuntu.com
      Cc: David Howells <dhowells@redhat.com>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Cc: linux-api@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      2a186721
    • C
      fs: add attr_flags_to_mnt_flags helper · 5b490500
      Christian Brauner 提交于
      Add a simple helper to translate uapi MOUNT_ATTR_* flags to MNT_* flags
      which we will use in follow-up patches too.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-34-christian.brauner@ubuntu.com
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      5b490500
    • C
      fs: split out functions to hold writers · fbdc2f6c
      Christian Brauner 提交于
      When a mount is marked read-only we set MNT_WRITE_HOLD on it if there
      aren't currently any active writers. Split this logic out into simple
      helpers that we can use in follow-up patches.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-33-christian.brauner@ubuntu.com
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      fbdc2f6c
    • C
      namespace: only take read lock in do_reconfigure_mnt() · e58ace1a
      Christian Brauner 提交于
      do_reconfigure_mnt() used to take the down_write(&sb->s_umount) lock
      which seems unnecessary since we're not changing the superblock. We're
      only checking whether it is already read-only. Setting other mount
      attributes is protected by lock_mount_hash() afaict and not by s_umount.
      
      The history of down_write(&sb->s_umount) lock being taken when setting
      mount attributes dates back to the introduction of MNT_READONLY in [2].
      This introduced the concept of having read-only mounts in contrast to
      just having a read-only superblock. When it got introduced it was simply
      plumbed into do_remount() which already took down_write(&sb->s_umount)
      because it was only used to actually change the superblock before [2].
      Afaict, it would've already been possible back then to only use
      down_read(&sb->s_umount) for MS_BIND | MS_REMOUNT since actual mount
      options were protected by the vfsmount lock already. But that would've
      meant special casing the locking for MS_BIND | MS_REMOUNT in
      do_remount() which people might not have considered worth it.
      Then in [1] MS_BIND | MS_REMOUNT mount option changes were split out of
      do_remount() into do_reconfigure_mnt() but the down_write(&sb->s_umount)
      lock was simply copied over.
      Now that we have this be a separate helper only take the
      down_read(&sb->s_umount) lock since we're only interested in checking
      whether the super block is currently read-only and blocking any writers
      from changing it. Essentially, checking that the super block is
      read-only has the advantage that we can avoid having to go into the
      slowpath and through MNT_WRITE_HOLD and can simply set the read-only
      flag on the mount in set_mount_attributes().
      
      [1]: commit 43f5e655 ("vfs: Separate changing mount flags full remount")
      [2]: commit 2e4b7fcd ("[PATCH] r/o bind mounts: honor mount writer counts at remount")
      
      Link: https://lore.kernel.org/r/20210121131959.646623-32-christian.brauner@ubuntu.com
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      e58ace1a
    • C
      mount: make {lock,unlock}_mount_hash() static · d033cb67
      Christian Brauner 提交于
      The lock_mount_hash() and unlock_mount_hash() helpers are never called
      outside a single file. Remove them from the header and make them static
      to reflect this fact. There's no need to have them callable from other
      places right now, as Christoph observed.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-31-christian.brauner@ubuntu.com
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Suggested-by: NChristoph Hellwig <hch@lst.de>
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      d033cb67
    • C
      namespace: take lock_mount_hash() directly when changing flags · 68847c94
      Christian Brauner 提交于
      Changing mount options always ends up taking lock_mount_hash() but when
      MNT_READONLY is requested and neither the mount nor the superblock are
      MNT_READONLY we end up taking the lock, dropping it, and retaking it to
      change the other mount attributes. Instead, let's acquire the lock once
      when changing the mount attributes. This simplifies the locking in these
      codepath, makes them easier to reason about and avoids having to
      reacquire the lock right after dropping it.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-30-christian.brauner@ubuntu.com
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      68847c94
    • C
      mount: attach mappings to mounts · a6435940
      Christian Brauner 提交于
      In order to support per-mount idmappings vfsmounts are marked with user
      namespaces. The idmapping of the user namespace will be used to map the
      ids of vfs objects when they are accessed through that mount. By default
      all vfsmounts are marked with the initial user namespace. The initial
      user namespace is used to indicate that a mount is not idmapped. All
      operations behave as before.
      
      Based on prior discussions we want to attach the whole user namespace
      and not just a dedicated idmapping struct. This allows us to reuse all
      the helpers that already exist for dealing with idmappings instead of
      introducing a whole new range of helpers. In addition, if we decide in
      the future that we are confident enough to enable unprivileged users to
      setup idmapped mounts the permission checking can take into account
      whether the caller is privileged in the user namespace the mount is
      currently marked with.
      Later patches enforce that once a mount has been idmapped it can't be
      remapped. This keeps permission checking and life-cycle management
      simple. Users wanting to change the idmapped can always create a new
      detached mount with a different idmapping.
      
      Add a new mnt_userns member to vfsmount and two simple helpers to
      retrieve the mnt_userns from vfsmounts and files.
      
      The idea to attach user namespaces to vfsmounts has been floated around
      in various forms at Linux Plumbers in ~2018 with the original idea
      tracing back to a discussion in 2017 at a conference in St. Petersburg
      between Christoph, Tycho, and myself.
      
      Link: https://lore.kernel.org/r/20210121131959.646623-2-christian.brauner@ubuntu.com
      Cc: Christoph Hellwig <hch@lst.de>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: linux-fsdevel@vger.kernel.org
      Reviewed-by: NChristoph Hellwig <hch@lst.de>
      Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
      a6435940