- 19 1月, 2023 2 次提交
-
-
由 Christian Brauner 提交于
Now that we converted everything to just rely on struct mnt_idmap move it all into a separate file. This ensure that no code can poke around in struct mnt_idmap without any dedicated helpers and makes it easier to extend it in the future. Filesystems will now not be able to conflate mount and filesystem idmappings as they are two distinct types and require distinct helpers that cannot be used interchangeably. We are now also able to extend struct mnt_idmap as we see fit. Acked-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
-
由 Christian Brauner 提交于
Convert to struct mnt_idmap. Remove legacy file_mnt_user_ns() and mnt_user_ns(). Last cycle we merged the necessary infrastructure in 256c8aed ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
-
- 25 11月, 2022 1 次提交
-
-
由 Al Viro 提交于
copy_mnt_ns() has the old tree copied, with mntns binding *and* anything bound on top of them skipped. Then it proceeds to walk both trees in parallel. Unfortunately, it doesn't get the "skip the stuff we'd skipped when copying" quite right. Consequences are minor (the ->mnt_root comparison will return the situation to sanity pretty soon and the worst we get is the unexpected subset of opened non-directories being switched to new namespace), but it's confusing enough and it's not hard to get the expected behaviour... Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 01 11月, 2022 1 次提交
-
-
由 Christian Brauner 提交于
Last cycle we've already made the interaction with idmapped mounts more robust and type safe by introducing the vfs{g,u}id_t type. This cycle we concluded the conversion and removed the legacy helpers. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate filesystem and mount namespaces and what different roles they have to play. Especially for filesystem developers without much experience in this area this is an easy source for bugs. Instead of passing the plain namespace we introduce a dedicated type struct mnt_idmap and replace the pointer with a pointer to a struct mnt_idmap. There are no semantic or size changes for the mount struct caused by this. We then start converting all places aware of idmapped mounts to rely on struct mnt_idmap. Once the conversion is done all helpers down to the really low-level make_vfs{g,u}id() and from_vfs{g,u}id() will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two, removing and thus eliminating the possibility of any bugs. Fwiw, I fixed some issues in that area a while ago in ntfs3 and ksmbd in the past. Afterwards, only low-level code can ultimately use the associated namespace for any permission checks. Even most of the vfs can be ultimately completely oblivious about this and filesystems will never interact with it directly in any form in the future. A struct mnt_idmap currently encompasses a simple refcount and a pointer to the relevant namespace the mount is idmapped to. If a mount isn't idmapped then it will point to a static nop_mnt_idmap. If it is an idmapped mount it will point to a new struct mnt_idmap. As usual there are no allocations or anything happening for non-idmapped mounts. Everthing is carefully written to be a nop for non-idmapped mounts as has always been the case. If an idmapped mount or mount tree is created a new struct mnt_idmap is allocated and a reference taken on the relevant namespace. For each mount in a mount tree that gets idmapped or a mount that inherits the idmap when it is cloned the reference count on the associated struct mnt_idmap is bumped. Just a reminder that we only allow a mount to change it's idmapping a single time and only if it hasn't already been attached to the filesystems and has no active writers. The actual changes are fairly straightforward. This will have huge benefits for maintenance and security in the long run even if it causes some churn. I'm aware that there's some cost for all of you. And I'll commit to doing this work and make this as painless as I can. Note that this also makes it possible to extend struct mount_idmap in the future. For example, it would be possible to place the namespace pointer in an anonymous union together with an idmapping struct. This would allow us to expose an api to userspace that would let it specify idmappings directly instead of having to go through the detour of setting up namespaces at all. This just adds the infrastructure and doesn't do any conversions. Reviewed-by: NSeth Forshee (DigitalOcean) <sforshee@kernel.org> Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
-
- 17 8月, 2022 1 次提交
-
-
由 Seth Forshee 提交于
Idmapped mounts should not allow a user to map file ownsership into a range of ids which is not under the control of that user. However, we currently don't check whether the mounter is privileged wrt to the target user namespace. Currently no FS_USERNS_MOUNT filesystems support idmapped mounts, thus this is not a problem as only CAP_SYS_ADMIN in init_user_ns is allowed to set up idmapped mounts. But this could change in the future, so add a check to refuse to create idmapped mounts when the mounter does not have CAP_SYS_ADMIN in the target user namespace. Fixes: bd303368 ("fs: support mapped mounts of mapped filesystems") Signed-off-by: NSeth Forshee <sforshee@digitalocean.com> Reviewed-by: NChristian Brauner (Microsoft) <brauner@kernel.org> Link: https://lore.kernel.org/r/20220816164752.2595240-1-sforshee@digitalocean.comSigned-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
-
- 06 7月, 2022 1 次提交
-
-
由 Al Viro 提交于
The tricky case (__legitimize_mnt() failing after having grabbed a reference) can be trivially dealt with by leaving nd->path.mnt non-NULL, for terminate_walk() to drop it. legitimize_mnt() becomes static after that. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 20 5月, 2022 1 次提交
-
-
由 Al Viro 提交于
It's done once per (mount-related) syscall and there's no point whatsoever making it inline. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 12 5月, 2022 1 次提交
-
-
由 Christian Brauner 提交于
Hold writers when changing a mount's idmapping to make it more robust. The vfs layer takes care to retrieve the idmapping of a mount once ensuring that the idmapping used for vfs permission checking is identical to the idmapping passed down to the filesystem. For ioctl codepaths the filesystem itself is responsible for taking the idmapping into account if they need to. While all filesystems with FS_ALLOW_IDMAP raised take the same precautions as the vfs we should enforce it explicitly by making sure there are no active writers on the relevant mount while changing the idmapping. This is similar to turning a mount ro with the difference that in contrast to turning a mount ro changing the idmapping can only ever be done once while a mount can transition between ro and rw as much as it wants. This is a minor user-visible change. But it is extremely unlikely to matter. The caller must've created a detached mount via OPEN_TREE_CLONE and then handed that O_PATH fd to another process or thread which then must've gotten a writable fd for that mount and started creating files in there while the caller is still changing mount properties. While not impossible it will be an extremely rare corner-case and should in general be considered a bug in the application. Consider making a mount MOUNT_ATTR_NOEXEC or MOUNT_ATTR_NODEV while allowing someone else to perform lookups or exec'ing in parallel by handing them a copy of the OPEN_TREE_CLONE fd or another fd beneath that mount. Link: https://lore.kernel.org/r/20220510095840.152264-1-brauner@kernel.org Cc: Seth Forshee <seth.forshee@digitalocean.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
-
- 21 4月, 2022 1 次提交
-
-
由 Christian Brauner 提交于
After mnt_hold_writers() has been called we will always have set MNT_WRITE_HOLD and consequently we always need to pair mnt_hold_writers() with mnt_unhold_writers(). After the recent cleanup in [1] where Al switched from a do-while to a for loop the cleanup currently fails to unset MNT_WRITE_HOLD for the first mount that was changed. Fix this and make sure that the first mount will be cleaned up and add some comments to make it more obvious. Link: https://lore.kernel.org/lkml/0000000000007cc21d05dd0432b8@google.com Link: https://lore.kernel.org/lkml/00000000000080e10e05dd043247@google.com Link: https://lore.kernel.org/r/20220420131925.2464685-1-brauner@kernel.org Fixes: e257039f ("mount_setattr(): clean the control flow and calling conventions") [1] Cc: Hillf Danton <hdanton@sina.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Reported-by: syzbot+10a16d1c43580983f6a2@syzkaller.appspotmail.com Reported-by: syzbot+306090cfa3294f0bbfb3@syzkaller.appspotmail.com Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
-
- 23 3月, 2022 1 次提交
-
-
由 Anthony Iliopoulos 提交于
Commit f8b92ba6 ("mount: Add mount warning for impending timestamp expiry") introduced a mount warning regarding filesystem timestamp limits, that is printed upon each writable mount or remount. This can result in a lot of unnecessary messages in the kernel log in setups where filesystems are being frequently remounted (or mounted multiple times). Avoid this by setting a superblock flag which indicates that the warning has been emitted at least once for any particular mount, as suggested in [1]. Link: https://lore.kernel.org/CAHk-=wim6VGnxQmjfK_tDg6fbHYKL4EFkmnTjVr9QnRqjDBAeA@mail.gmail.com/ [1] Link: https://lkml.kernel.org/r/20220119202934.26495-1-ailiop@suse.comSigned-off-by: NAnthony Iliopoulos <ailiop@suse.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Acked-by: NChristian Brauner <christian.brauner@ubuntu.com> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Deepa Dinamani <deepa.kernel@gmail.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 16 3月, 2022 1 次提交
-
-
由 Al Viro 提交于
separate the "cleanup" and "apply" codepaths (they have almost no overlap), fold the "cleanup" into "prepare" (which eliminates the need of ->revert) and make loops more idiomatic. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 14 2月, 2022 6 次提交
-
-
由 Christian Brauner 提交于
Simplify the control flow of mount_setattr_{prepare,commit} so they become easier to follow. We kept using both an integer error variable that was passed by pointer as well as a pointer as an indicator for whether or not we need to revert or commit the prepared changes. Simplify this and just use the pointer. If we successfully changed properties the revert pointer will be NULL and if we failed to change properties it will indicate where we failed and thus need to stop reverting. Link: https://lore.kernel.org/r/20220203131411.3093040-8-brauner@kernel.org Cc: Seth Forshee <seth.forshee@digitalocean.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NChristian Brauner <brauner@kernel.org>
-
由 Christian Brauner 提交于
Remove sb_prepare_remount_readonly()'s open-coded mnt_hold_writers() implementation with the real helper we introduced in commit fbdc2f6c ("fs: split out functions to hold writers"). Link: https://lore.kernel.org/r/20220203131411.3093040-7-brauner@kernel.org Cc: Seth Forshee <seth.forshee@digitalocean.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NChristian Brauner <brauner@kernel.org>
-
由 Christian Brauner 提交于
In order to determine whether we need to call mnt_unhold_writers() in mount_setattr_commit() we currently do not just check whether MNT_WRITE_HOLD is set but also if a read-only mount was requested. However, checking whether MNT_WRITE_HOLD is set is enough. Setting MNT_WRITE_HOLD requires lock_mount_hash() to be held and it must be unset before calling unlock_mount_hash(). This guarantees that if we see MNT_WRITE_HOLD we know that we were the ones who set it earlier. We don't need to care about why we set it. Plus, leaving this additional read-only check in makes the code more confusing because it implies that MNT_WRITE_HOLD could've been set by another thread when it really can't. Remove it and update the associated comment. Link: https://lore.kernel.org/r/20220203131411.3093040-6-brauner@kernel.org Cc: Seth Forshee <seth.forshee@digitalocean.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NChristian Brauner <brauner@kernel.org>
-
由 Christian Brauner 提交于
Add a tiny helper that lets us simplify the control-flow and can be used in the next patch to avoid adding another condition open-coded into mount_setattr_prepare(). Instead we can add it into the new helper. Link: https://lore.kernel.org/r/20220203131411.3093040-5-brauner@kernel.org Cc: Seth Forshee <seth.forshee@digitalocean.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NChristian Brauner <brauner@kernel.org>
-
由 Christian Brauner 提交于
When I introduced mnt_{hold,unhold}_writers() in commit fbdc2f6c ("fs: split out functions to hold writers") I did not add kernel doc for them. Fix this and introduce proper documentation. Link: https://lore.kernel.org/r/20220203131411.3093040-4-brauner@kernel.org Fixes: fbdc2f6c ("fs: split out functions to hold writers") Cc: Seth Forshee <seth.forshee@digitalocean.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NChristian Brauner <brauner@kernel.org>
-
由 Al Viro 提交于
Wraparound checks in there are redundant (x + y < x and x + y < y are equivalent when x and y are both unsigned int). IMO more straightforward code would be better here... Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 31 1月, 2022 1 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 22 1月, 2022 1 次提交
-
-
由 Luis Chamberlain 提交于
This moves the namespace sysctls to its own file as part of the kernel/sysctl.c spring cleaning Since we have now removed all sysctls for "fs", we now have to declare it on the filesystem code, we do that using the new helper, which reduces boiler plate code. We rename init_fs_shared_sysctls() to init_fs_sysctls() to reflect that now fs/sysctls.c is taking on the burden of being the first to register the base directory as well. Lastly, since init code will load in the order in which we link it we have to move the sysctl code to be linked in early, so that its early init routine runs prior to other fs code. This way, other filesystem code can register their own sysctls using the helpers after this: * register_sysctl_init() * register_sysctl() Link: https://lkml.kernel.org/r/20211129211943.640266-3-mcgrof@kernel.orgSigned-off-by: NLuis Chamberlain <mcgrof@kernel.org> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Anil S Keshavamurthy <anil.s.keshavamurthy@intel.com> Cc: Antti Palosaari <crope@iki.fi> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Eric Biederman <ebiederm@xmission.com> Cc: Eric Biggers <ebiggers@google.com> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Lukas Middendorf <kernel@tuxforce.de> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Stephen Kitt <steve@sk2.org> Cc: Xiaoming Ni <nixiaoming@huawei.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 31 12月, 2021 1 次提交
-
-
由 Christian Brauner 提交于
Make sure that finish_mount_kattr() is called after mount_kattr was succesfully built in both the success and failure case to prevent leaking any references we took when we built it. We returned early if path lookup failed thereby risking to leak an additional reference we took when building mount_kattr when an idmapped mount was requested. Cc: linux-fsdevel@vger.kernel.org Cc: stable@vger.kernel.org Fixes: 9caccd41 ("fs: introduce MOUNT_ATTR_IDMAP") Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 05 12月, 2021 1 次提交
-
-
由 Christian Brauner 提交于
In previous patches we added new and modified existing helpers to handle idmapped mounts of filesystems mounted with an idmapping. In this final patch we convert all relevant places in the vfs to actually pass the filesystem's idmapping into these helpers. With this the vfs is in shape to handle idmapped mounts of filesystems mounted with an idmapping. Note that this is just the generic infrastructure. Actually adding support for idmapped mounts to a filesystem mountable with an idmapping is follow-up work. In this patch we extend the definition of an idmapped mount from a mount that that has the initial idmapping attached to it to a mount that has an idmapping attached to it which is not the same as the idmapping the filesystem was mounted with. As before we do not allow the initial idmapping to be attached to a mount. In addition this patch prevents that the idmapping the filesystem was mounted with can be attached to a mount created based on this filesystem. This has multiple reasons and advantages. First, attaching the initial idmapping or the filesystem's idmapping doesn't make much sense as in both cases the values of the i_{g,u}id and other places where k{g,u}ids are used do not change. Second, a user that really wants to do this for whatever reason can just create a separate dedicated identical idmapping to attach to the mount. Third, we can continue to use the initial idmapping as an indicator that a mount is not idmapped allowing us to continue to keep passing the initial idmapping into the mapping helpers to tell them that something isn't an idmapped mount even if the filesystem is mounted with an idmapping. Link: https://lore.kernel.org/r/20211123114227.3124056-11-brauner@kernel.org (v1) Link: https://lore.kernel.org/r/20211130121032.3753852-11-brauner@kernel.org (v2) Link: https://lore.kernel.org/r/20211203111707.3901969-11-brauner@kernel.org Cc: Seth Forshee <sforshee@digitalocean.com> Cc: Amir Goldstein <amir73il@gmail.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> CC: linux-fsdevel@vger.kernel.org Reviewed-by: NSeth Forshee <sforshee@digitalocean.com> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
-
- 04 12月, 2021 1 次提交
-
-
由 Christian Brauner 提交于
Multiple places open-code the same check to determine whether a given mount is idmapped. Introduce a simple helper function that can be used instead. This allows us to get rid of the fragile open-coding. We will later change the check that is used to determine whether a given mount is idmapped. Introducing a helper allows us to do this in a single place instead of doing it for multiple places. Link: https://lore.kernel.org/r/20211123114227.3124056-2-brauner@kernel.org (v1) Link: https://lore.kernel.org/r/20211130121032.3753852-2-brauner@kernel.org (v2) Link: https://lore.kernel.org/r/20211203111707.3901969-2-brauner@kernel.org Cc: Seth Forshee <sforshee@digitalocean.com> Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> CC: linux-fsdevel@vger.kernel.org Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Reviewed-by: NSeth Forshee <sforshee@digitalocean.com> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
-
- 26 11月, 2021 1 次提交
-
-
The MNT_WRITE_HOLD flag is used to hold back any new writers while the mount point is about to be made read-only. __mnt_want_write() then loops with disabled preemption until this flag disappears. Callers of mnt_hold_writers() (which sets the flag) hold the spinlock_t of mount_lock (seqlock_t) which disables preemption on !PREEMPT_RT and ensures the task is not scheduled away so that the spinning side spins for a long time. On PREEMPT_RT the spinlock_t does not disable preemption and so it is possible that the task setting MNT_WRITE_HOLD is preempted by task with higher priority which then spins infinitely waiting for MNT_WRITE_HOLD to get removed. Acquire mount_lock::lock which is held by setter of MNT_WRITE_HOLD. This will PI-boost the owner and wait until the lock is dropped and which means that MNT_WRITE_HOLD is cleared again. Link: https://lore.kernel.org/r/20211025152218.opvcqfku2lhqvp4o@linutronix.de Link: https://lore.kernel.org/r/20211125120711.dgbsienyrsxfzpoi@linutronix.deAcked-by: NChristian Brauner <christian.brauner@ubuntu.com> Signed-off-by: NSebastian Andrzej Siewior <bigeasy@linutronix.de>
-
- 04 9月, 2021 2 次提交
-
-
由 Vasily Averin 提交于
Container admin can create new namespaces and force kernel to allocate up to several pages of memory for the namespaces and its associated structures. Net and uts namespaces have enabled accounting for such allocations. It makes sense to account for rest ones to restrict the host's memory consumption from inside the memcg-limited container. Link: https://lkml.kernel.org/r/5525bcbf-533e-da27-79b7-158686c64e13@virtuozzo.comSigned-off-by: NVasily Averin <vvs@virtuozzo.com> Acked-by: NSerge Hallyn <serge@hallyn.com> Acked-by: NChristian Brauner <christian.brauner@ubuntu.com> Acked-by: NKirill Tkhai <ktkhai@virtuozzo.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Borislav Petkov <bp@suse.de> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Yutian Yang <nglaive@gmail.com> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Vasily Averin 提交于
Patch series "memcg accounting from OpenVZ", v7. OpenVZ uses memory accounting 20+ years since v2.2.x linux kernels. Initially we used our own accounting subsystem, then partially committed it to upstream, and a few years ago switched to cgroups v1. Now we're rebasing again, revising our old patches and trying to push them upstream. We try to protect the host system from any misuse of kernel memory allocation triggered by untrusted users inside the containers. Patch-set is addressed mostly to cgroups maintainers and cgroups@ mailing list, though I would be very grateful for any comments from maintainersi of affected subsystems or other people added in cc: Compared to the upstream, we additionally account the following kernel objects: - network devices and its Tx/Rx queues - ipv4/v6 addresses and routing-related objects - inet_bind_bucket cache objects - VLAN group arrays - ipv6/sit: ip_tunnel_prl - scm_fp_list objects used by SCM_RIGHTS messages of Unix sockets - nsproxy and namespace objects itself - IPC objects: semaphores, message queues and share memory segments - mounts - pollfd and select bits arrays - signals and posix timers - file lock - fasync_struct used by the file lease code and driver's fasync queues - tty objects - per-mm LDT We have an incorrect/incomplete/obsoleted accounting for few other kernel objects: sk_filter, af_packets, netlink and xt_counters for iptables. They require rework and probably will be dropped at all. Also we're going to add an accounting for nft, however it is not ready yet. We have not tested performance on upstream, however, our performance team compares our current RHEL7-based production kernel and reports that they are at least not worse as the according original RHEL7 kernel. This patch (of 10): The kernel allocates ~400 bytes of 'struct mount' for any new mount. Creating a new mount namespace clones most of the parent mounts, and this can be repeated many times. Additionally, each mount allocates up to PATH_MAX=4096 bytes for mnt->mnt_devname. It makes sense to account for these allocations to restrict the host's memory consumption from inside the memcg-limited container. Link: https://lkml.kernel.org/r/045db11f-4a45-7c9b-2664-5b32c2b44943@virtuozzo.comSigned-off-by: NVasily Averin <vvs@virtuozzo.com> Reviewed-by: NShakeel Butt <shakeelb@google.com> Acked-by: NChristian Brauner <christian.brauner@ubuntu.com> Cc: Tejun Heo <tj@kernel.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Vladimir Davydov <vdavydov.dev@gmail.com> Cc: Roman Gushchin <guro@fb.com> Cc: Yutian Yang <nglaive@gmail.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dmitry Safonov <0x7f454c46@gmail.com> Cc: "Eric W. Biederman" <ebiederm@xmission.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Jiri Slaby <jirislaby@kernel.org> Cc: Kirill Tkhai <ktkhai@virtuozzo.com> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Serge Hallyn <serge@hallyn.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Borislav Petkov <bp@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 8月, 2021 1 次提交
-
-
由 Jeff Layton 提交于
We added CONFIG_MANDATORY_FILE_LOCKING in 2015, and soon after turned it off in Fedora and RHEL8. Several other distros have followed suit. I've heard of one problem in all that time: Someone migrated from an older distro that supported "-o mand" to one that didn't, and the host had a fstab entry with "mand" in it which broke on reboot. They didn't actually _use_ mandatory locking so they just removed the mount option and moved on. This patch rips out mandatory locking support wholesale from the kernel, along with the Kconfig option and the Documentation file. It also changes the mount code to ignore the "mand" mount option instead of erroring out, and to throw a big, ugly warning. Signed-off-by: NJeff Layton <jlayton@kernel.org>
-
- 21 8月, 2021 1 次提交
-
-
由 Jeff Layton 提交于
We've had CONFIG_MANDATORY_FILE_LOCKING since 2015 and a lot of distros have disabled it. Warn the stragglers that still use "-o mand" that we'll be dropping support for that mount option. Cc: stable@vger.kernel.org Signed-off-by: NJeff Layton <jlayton@kernel.org>
-
- 10 8月, 2021 1 次提交
-
-
由 Miklos Szeredi 提交于
Add the following checks from __do_loopback() to clone_private_mount() as well: - verify that the mount is in the current namespace - verify that there are no locked children Reported-by: NAlois Wohlschlager <alois1@gmx-topmail.de> Fixes: c771d683 ("vfs: introduce clone_private_mount()") Cc: <stable@vger.kernel.org> # v3.18 Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
-
- 26 7月, 2021 1 次提交
-
-
由 Pavel Tikhomirov 提交于
Previously a sharing group (shared and master ids pair) can be only inherited when mount is created via bindmount. This patch adds an ability to add an existing private mount into an existing sharing group. With this functionality one can first create the desired mount tree from only private mounts (without the need to care about undesired mount propagation or mount creation order implied by sharing group dependencies), and next then setup any desired mount sharing between those mounts in tree as needed. This allows CRIU to restore any set of mount namespaces, mount trees and sharing group trees for a container. We have many issues with restoring mounts in CRIU related to sharing groups and propagation: - reverse sharing groups vs mount tree order requires complex mounts reordering which mostly implies also using some temporary mounts (please see https://lkml.org/lkml/2021/3/23/569 for more info) - mount() syscall creates tons of mounts due to propagation - mount re-parenting due to propagation - "Mount Trap" due to propagation - "Non Uniform" propagation, meaning that with different tricks with mount order and temporary children-"lock" mounts one can create mount trees which can't be restored without those tricks (see https://www.linuxplumbersconf.org/event/7/contributions/640/) With this new functionality we can resolve all the problems with propagation at once. Link: https://lore.kernel.org/r/20210715100714.120228-1-ptikhomirov@virtuozzo.com Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Cc: Mattias Nissler <mnissler@chromium.org> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Andrei Vagin <avagin@gmail.com> Cc: linux-fsdevel@vger.kernel.org Cc: linux-api@vger.kernel.org Cc: lkml <linux-kernel@vger.kernel.org> Co-developed-by: NAndrei Vagin <avagin@gmail.com> Acked-by: NChristian Brauner <christian.brauner@ubuntu.com> Signed-off-by: NPavel Tikhomirov <ptikhomirov@virtuozzo.com> Signed-off-by: NAndrei Vagin <avagin@gmail.com> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
-
- 01 6月, 2021 1 次提交
-
-
由 Christian Brauner 提交于
Commit dab741e0 ("Add a "nosymfollow" mount option.") added support for the "nosymfollow" mount option allowing to block following symlinks when resolving paths. The mount option so far was only available in the old mount api. Make it available in the new mount api as well. Bonus is that it can be applied to a whole subtree not just a single mount. Cc: Christoph Hellwig <hch@lst.de> Cc: Mattias Nissler <mnissler@chromium.org> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Ross Zwisler <zwisler@google.com> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
-
- 12 5月, 2021 1 次提交
-
-
由 Christian Brauner 提交于
We currently don't have any filesystems that support idmapped mounts which are mountable inside a user namespace. That was a deliberate decision for now as a userns root can just mount the filesystem themselves. So enforce this restriction explicitly until there's a real use-case for this. This way we can notice it and will have a chance to adapt and audit our translation helpers and fstests appropriately if we need to support such filesystems. Cc: Christoph Hellwig <hch@lst.de> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@vger.kernel.org CC: linux-fsdevel@vger.kernel.org Suggested-by: NSeth Forshee <seth.forshee@canonical.com> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
-
- 01 4月, 2021 1 次提交
-
-
由 Randy Dunlap 提交于
Fix kernel-doc warnings in fs/namespace.c: ./fs/namespace.c:1379: warning: Function parameter or member 'm' not described in 'may_umount_tree' ./fs/namespace.c:1379: warning: Excess function parameter 'mnt' description in 'may_umount_tree' ./fs/namespace.c:1950: warning: Function parameter or member 'path' not described in 'clone_private_mount' Also convert path_is_mountpoint() comments to kernel-doc. Signed-off-by: NRandy Dunlap <rdunlap@infradead.org> Allegedly-acked-by: NAl Viro <viro@zeniv.linux.org.uk> Link: https://lore.kernel.org/r/20210318025227.4162-1-rdunlap@infradead.orgSigned-off-by: NJonathan Corbet <corbet@lwn.net>
-
- 24 1月, 2021 8 次提交
-
-
由 Christian Brauner 提交于
Introduce a new mount bind mount property to allow idmapping mounts. The MOUNT_ATTR_IDMAP flag can be set via the new mount_setattr() syscall together with a file descriptor referring to a user namespace. The user namespace referenced by the namespace file descriptor will be attached to the bind mount. All interactions with the filesystem going through that mount will be mapped according to the mapping specified in the user namespace attached to it. Using user namespaces to mark mounts means we can reuse all the existing infrastructure in the kernel that already exists to handle idmappings and can also use this for permission checking to allow unprivileged user to create idmapped mounts in the future. Idmapping a mount is decoupled from the caller's user and mount namespace. This means idmapped mounts can be created in the initial user namespace which is an important use-case for systemd-homed, portable usb-sticks between systems, sharing data between the initial user namespace and unprivileged containers, and other use-cases that have been brought up. For example, assume a home directory where all files are owned by uid and gid 1000 and the home directory is brought to a new laptop where the user has id 12345. The system administrator can simply create a mount of this home directory with a mapping of 1000:12345:1 and other mappings to indicate the ids should be kept. (With this it is e.g. also possible to create idmapped mounts on the host with an identity mapping 1:1:100000 where the root user is not mapped. A user with root access that e.g. has been pivot rooted into such a mount on the host will be not be able to execute, read, write, or create files as root.) Given that mapping a mount is decoupled from the caller's user namespace a sufficiently privileged process such as a container manager can set up an idmapped mount for the container and the container can simply pivot root to it. There's no need for the container to do anything. The mount will appear correctly mapped independent of the user namespace the container uses. This means we don't need to mark a mount as idmappable. In order to create an idmapped mount the caller must currently be privileged in the user namespace of the superblock the mount belongs to. Once a mount has been idmapped we don't allow it to change its mapping. This keeps permission checking and life-cycle management simple. Users wanting to change the idmapped can always create a new detached mount with a different idmapping. Link: https://lore.kernel.org/r/20210121131959.646623-36-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Mauricio Vásquez Bernal <mauricio@kinvolk.io> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
-
由 Christian Brauner 提交于
This implements the missing mount_setattr() syscall. While the new mount api allows to change the properties of a superblock there is currently no way to change the properties of a mount or a mount tree using file descriptors which the new mount api is based on. In addition the old mount api has the restriction that mount options cannot be applied recursively. This hasn't changed since changing mount options on a per-mount basis was implemented in [1] and has been a frequent request not just for convenience but also for security reasons. The legacy mount syscall is unable to accommodate this behavior without introducing a whole new set of flags because MS_REC | MS_REMOUNT | MS_BIND | MS_RDONLY | MS_NOEXEC | [...] only apply the mount option to the topmost mount. Changing MS_REC to apply to the whole mount tree would mean introducing a significant uapi change and would likely cause significant regressions. The new mount_setattr() syscall allows to recursively clear and set mount options in one shot. Multiple calls to change mount options requesting the same changes are idempotent: int mount_setattr(int dfd, const char *path, unsigned flags, struct mount_attr *uattr, size_t usize); Flags to modify path resolution behavior are specified in the @flags argument. Currently, AT_EMPTY_PATH, AT_RECURSIVE, AT_SYMLINK_NOFOLLOW, and AT_NO_AUTOMOUNT are supported. If useful, additional lookup flags to restrict path resolution as introduced with openat2() might be supported in the future. The mount_setattr() syscall can be expected to grow over time and is designed with extensibility in mind. It follows the extensible syscall pattern we have used with other syscalls such as openat2(), clone3(), sched_{set,get}attr(), and others. The set of mount options is passed in the uapi struct mount_attr which currently has the following layout: struct mount_attr { __u64 attr_set; __u64 attr_clr; __u64 propagation; __u64 userns_fd; }; The @attr_set and @attr_clr members are used to clear and set mount options. This way a user can e.g. request that a set of flags is to be raised such as turning mounts readonly by raising MOUNT_ATTR_RDONLY in @attr_set while at the same time requesting that another set of flags is to be lowered such as removing noexec from a mount tree by specifying MOUNT_ATTR_NOEXEC in @attr_clr. Note, since the MOUNT_ATTR_<atime> values are an enum starting from 0, not a bitmap, users wanting to transition to a different atime setting cannot simply specify the atime setting in @attr_set, but must also specify MOUNT_ATTR__ATIME in the @attr_clr field. So we ensure that MOUNT_ATTR__ATIME can't be partially set in @attr_clr and that @attr_set can't have any atime bits set if MOUNT_ATTR__ATIME isn't set in @attr_clr. The @propagation field lets callers specify the propagation type of a mount tree. Propagation is a single property that has four different settings and as such is not really a flag argument but an enum. Specifically, it would be unclear what setting and clearing propagation settings in combination would amount to. The legacy mount() syscall thus forbids the combination of multiple propagation settings too. The goal is to keep the semantics of mount propagation somewhat simple as they are overly complex as it is. The @userns_fd field lets user specify a user namespace whose idmapping becomes the idmapping of the mount. This is implemented and explained in detail in the next patch. [1]: commit 2e4b7fcd ("[PATCH] r/o bind mounts: honor mount writer counts at remount") Link: https://lore.kernel.org/r/20210121131959.646623-35-christian.brauner@ubuntu.com Cc: David Howells <dhowells@redhat.com> Cc: Aleksa Sarai <cyphar@cyphar.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Cc: linux-api@vger.kernel.org Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
-
由 Christian Brauner 提交于
Add a simple helper to translate uapi MOUNT_ATTR_* flags to MNT_* flags which we will use in follow-up patches too. Link: https://lore.kernel.org/r/20210121131959.646623-34-christian.brauner@ubuntu.com Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Suggested-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
-
由 Christian Brauner 提交于
When a mount is marked read-only we set MNT_WRITE_HOLD on it if there aren't currently any active writers. Split this logic out into simple helpers that we can use in follow-up patches. Link: https://lore.kernel.org/r/20210121131959.646623-33-christian.brauner@ubuntu.com Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Suggested-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
-
由 Christian Brauner 提交于
do_reconfigure_mnt() used to take the down_write(&sb->s_umount) lock which seems unnecessary since we're not changing the superblock. We're only checking whether it is already read-only. Setting other mount attributes is protected by lock_mount_hash() afaict and not by s_umount. The history of down_write(&sb->s_umount) lock being taken when setting mount attributes dates back to the introduction of MNT_READONLY in [2]. This introduced the concept of having read-only mounts in contrast to just having a read-only superblock. When it got introduced it was simply plumbed into do_remount() which already took down_write(&sb->s_umount) because it was only used to actually change the superblock before [2]. Afaict, it would've already been possible back then to only use down_read(&sb->s_umount) for MS_BIND | MS_REMOUNT since actual mount options were protected by the vfsmount lock already. But that would've meant special casing the locking for MS_BIND | MS_REMOUNT in do_remount() which people might not have considered worth it. Then in [1] MS_BIND | MS_REMOUNT mount option changes were split out of do_remount() into do_reconfigure_mnt() but the down_write(&sb->s_umount) lock was simply copied over. Now that we have this be a separate helper only take the down_read(&sb->s_umount) lock since we're only interested in checking whether the super block is currently read-only and blocking any writers from changing it. Essentially, checking that the super block is read-only has the advantage that we can avoid having to go into the slowpath and through MNT_WRITE_HOLD and can simply set the read-only flag on the mount in set_mount_attributes(). [1]: commit 43f5e655 ("vfs: Separate changing mount flags full remount") [2]: commit 2e4b7fcd ("[PATCH] r/o bind mounts: honor mount writer counts at remount") Link: https://lore.kernel.org/r/20210121131959.646623-32-christian.brauner@ubuntu.com Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
-
由 Christian Brauner 提交于
The lock_mount_hash() and unlock_mount_hash() helpers are never called outside a single file. Remove them from the header and make them static to reflect this fact. There's no need to have them callable from other places right now, as Christoph observed. Link: https://lore.kernel.org/r/20210121131959.646623-31-christian.brauner@ubuntu.com Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Suggested-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
-
由 Christian Brauner 提交于
Changing mount options always ends up taking lock_mount_hash() but when MNT_READONLY is requested and neither the mount nor the superblock are MNT_READONLY we end up taking the lock, dropping it, and retaking it to change the other mount attributes. Instead, let's acquire the lock once when changing the mount attributes. This simplifies the locking in these codepath, makes them easier to reason about and avoids having to reacquire the lock right after dropping it. Link: https://lore.kernel.org/r/20210121131959.646623-30-christian.brauner@ubuntu.com Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
-
由 Christian Brauner 提交于
In order to support per-mount idmappings vfsmounts are marked with user namespaces. The idmapping of the user namespace will be used to map the ids of vfs objects when they are accessed through that mount. By default all vfsmounts are marked with the initial user namespace. The initial user namespace is used to indicate that a mount is not idmapped. All operations behave as before. Based on prior discussions we want to attach the whole user namespace and not just a dedicated idmapping struct. This allows us to reuse all the helpers that already exist for dealing with idmappings instead of introducing a whole new range of helpers. In addition, if we decide in the future that we are confident enough to enable unprivileged users to setup idmapped mounts the permission checking can take into account whether the caller is privileged in the user namespace the mount is currently marked with. Later patches enforce that once a mount has been idmapped it can't be remapped. This keeps permission checking and life-cycle management simple. Users wanting to change the idmapped can always create a new detached mount with a different idmapping. Add a new mnt_userns member to vfsmount and two simple helpers to retrieve the mnt_userns from vfsmounts and files. The idea to attach user namespaces to vfsmounts has been floated around in various forms at Linux Plumbers in ~2018 with the original idea tracing back to a discussion in 2017 at a conference in St. Petersburg between Christoph, Tycho, and myself. Link: https://lore.kernel.org/r/20210121131959.646623-2-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
-