- 19 1月, 2023 3 次提交
-
-
由 Christian Brauner 提交于
Now that we converted everything to just rely on struct mnt_idmap move it all into a separate file. This ensure that no code can poke around in struct mnt_idmap without any dedicated helpers and makes it easier to extend it in the future. Filesystems will now not be able to conflate mount and filesystem idmappings as they are two distinct types and require distinct helpers that cannot be used interchangeably. We are now also able to extend struct mnt_idmap as we see fit. Acked-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
-
由 Christian Brauner 提交于
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
-
由 Christian Brauner 提交于
Convert to struct mnt_idmap. Last cycle we merged the necessary infrastructure in 256c8aed ("fs: introduce dedicated idmap type for mounts"). This is just the conversion to struct mnt_idmap. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem with namespaces that are relevent on the mount level. Especially for non-vfs developers without detailed knowledge in this area this can be a potential source for bugs. Once the conversion to struct mnt_idmap is done all helpers down to the really low-level helpers will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two eliminating the possibility of any bugs. All of the vfs and all filesystems only operate on struct mnt_idmap. Acked-by: NDave Chinner <dchinner@redhat.com> Reviewed-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
-
- 01 11月, 2022 1 次提交
-
-
由 Christian Brauner 提交于
Convert an initial portion to rely on struct mnt_idmap by converting the high level xattr helpers. Reviewed-by: NSeth Forshee (DigitalOcean) <sforshee@kernel.org> Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
-
- 20 10月, 2022 2 次提交
-
-
由 Christian Brauner 提交于
In previous patches we built a new posix api solely around get and set inode operations. Now that we have all the pieces in place we can switch the system calls and the vfs over to only rely on this api when interacting with posix acls. This finally removes all type unsafety and type conversion issues explained in detail in [1] that we aim to get rid of. With the new posix acl api we immediately translate into an appropriate kernel internal struct posix_acl format both when getting and setting posix acls. This is a stark contrast to before were we hacked unsafe raw values into the uapi struct that was stored in a void pointer relying and having filesystems and security modules hack around in the uapi struct as well. Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1] Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
-
由 Christian Brauner 提交于
Split out the generic checks whether an inode allows writing xattrs. Since security.* and system.* xattrs don't have any restrictions and we're going to split out posix acls into a dedicated api we will use this helper to check whether we can write posix acls. Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
-
- 18 10月, 2022 3 次提交
-
-
由 Christian Brauner 提交于
Currently setgid stripping in file_remove_privs()'s should_remove_suid() helper is inconsistent with other parts of the vfs. Specifically, it only raises ATTR_KILL_SGID if the inode is S_ISGID and S_IXGRP but not if the inode isn't in the caller's groups and the caller isn't privileged over the inode although we require this already in setattr_prepare() and setattr_copy() and so all filesystem implement this requirement implicitly because they have to use setattr_{prepare,copy}() anyway. But the inconsistency shows up in setgid stripping bugs for overlayfs in xfstests (e.g., generic/673, generic/683, generic/685, generic/686, generic/687). For example, we test whether suid and setgid stripping works correctly when performing various write-like operations as an unprivileged user (fallocate, reflink, write, etc.): echo "Test 1 - qa_user, non-exec file $verb" setup_testfile chmod a+rws $junk_file commit_and_check "$qa_user" "$verb" 64k 64k The test basically creates a file with 6666 permissions. While the file has the S_ISUID and S_ISGID bits set it does not have the S_IXGRP set. On a regular filesystem like xfs what will happen is: sys_fallocate() -> vfs_fallocate() -> xfs_file_fallocate() -> file_modified() -> __file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = ATTR_FORCE | kill; -> notify_change() -> setattr_copy() In should_remove_suid() we can see that ATTR_KILL_SUID is raised unconditionally because the file in the test has S_ISUID set. But we also see that ATTR_KILL_SGID won't be set because while the file is S_ISGID it is not S_IXGRP (see above) which is a condition for ATTR_KILL_SGID being raised. So by the time we call notify_change() we have attr->ia_valid set to ATTR_KILL_SUID | ATTR_FORCE. Now notify_change() sees that ATTR_KILL_SUID is set and does: ia_valid = attr->ia_valid |= ATTR_MODE attr->ia_mode = (inode->i_mode & ~S_ISUID); which means that when we call setattr_copy() later we will definitely update inode->i_mode. Note that attr->ia_mode still contains S_ISGID. Now we call into the filesystem's ->setattr() inode operation which will end up calling setattr_copy(). Since ATTR_MODE is set we will hit: if (ia_valid & ATTR_MODE) { umode_t mode = attr->ia_mode; vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode); if (!vfsgid_in_group_p(vfsgid) && !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID)) mode &= ~S_ISGID; inode->i_mode = mode; } and since the caller in the test is neither capable nor in the group of the inode the S_ISGID bit is stripped. But assume the file isn't suid then ATTR_KILL_SUID won't be raised which has the consequence that neither the setgid nor the suid bits are stripped even though it should be stripped because the inode isn't in the caller's groups and the caller isn't privileged over the inode. If overlayfs is in the mix things become a bit more complicated and the bug shows up more clearly. When e.g., ovl_setattr() is hit from ovl_fallocate()'s call to file_remove_privs() then ATTR_KILL_SUID and ATTR_KILL_SGID might be raised but because the check in notify_change() is questioning the ATTR_KILL_SGID flag again by requiring S_IXGRP for it to be stripped the S_ISGID bit isn't removed even though it should be stripped: sys_fallocate() -> vfs_fallocate() -> ovl_fallocate() -> file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = ATTR_FORCE | kill; -> notify_change() -> ovl_setattr() // TAKE ON MOUNTER'S CREDS -> ovl_do_notify_change() -> notify_change() // GIVE UP MOUNTER'S CREDS // TAKE ON MOUNTER'S CREDS -> vfs_fallocate() -> xfs_file_fallocate() -> file_modified() -> __file_remove_privs() -> dentry_needs_remove_privs() -> should_remove_suid() -> __remove_privs() newattrs.ia_valid = attr_force | kill; -> notify_change() The fix for all of this is to make file_remove_privs()'s should_remove_suid() helper to perform the same checks as we already require in setattr_prepare() and setattr_copy() and have notify_change() not pointlessly requiring S_IXGRP again. It doesn't make any sense in the first place because the caller must calculate the flags via should_remove_suid() anyway which would raise ATTR_KILL_SGID. While we're at it we move should_remove_suid() from inode.c to attr.c where it belongs with the rest of the iattr helpers. Especially since it returns ATTR_KILL_S{G,U}ID flags. We also rename it to setattr_should_drop_suidgid() to better reflect that it indicates both setuid and setgid bit removal and also that it returns attr flags. Running xfstests with this doesn't report any regressions. We should really try and use consistent checks. Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
-
由 Christian Brauner 提交于
The current setgid stripping logic during write and ownership change operations is inconsistent and strewn over multiple places. In order to consolidate it and make more consistent we'll add a new helper setattr_should_drop_sgid(). The function retains the old behavior where we remove the S_ISGID bit unconditionally when S_IXGRP is set but also when it isn't set and the caller is neither in the group of the inode nor privileged over the inode. We will use this helper both in write operation permission removal such as file_remove_privs() as well as in ownership change operations. Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
-
由 Christian Brauner 提交于
In setattr_{copy,prepare}() we need to perform the same permission checks to determine whether we need to drop the setgid bit or not. Instead of open-coding it twice add a simple helper the encapsulates the logic. We will reuse this helpers to make dropping the setgid bit during write operations more consistent in a follow up patch. Reviewed-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
-
- 29 9月, 2022 1 次提交
-
-
由 Al Viro 提交于
passing kmap_local_page() result to __kernel_write() is unsafe - random ->write_iter() might (and 9p one does) get unhappy when passed ITER_KVEC with pointer that came from kmap_local_page(). Fix by providing a variant of __kernel_write() that takes an iov_iter from caller (__kernel_write() becomes a trivial wrapper) and adding dump_emit_page() that parallels dump_emit(), except that instead of __kernel_write() it uses __kernel_write_iter() with ITER_BVEC source. Fixes: 3159ed57 "fs/coredump: use kmap_local_page()" Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 02 9月, 2022 1 次提交
-
-
由 Al Viro 提交于
Reviewed-by: NChristian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 16 8月, 2022 1 次提交
-
-
由 Amir Goldstein 提交于
Thread A trying to acquire a write lease checks the value of i_readcount and i_writecount in check_conflicting_open() to verify that its own fd is the only fd referencing the file. Thread B trying to open the file for read will call break_lease() in do_dentry_open() before incrementing i_readcount, which leaves a small window where thread A can acquire the write lease and then thread B completes the open of the file for read without breaking the write lease that was acquired by thread A. Fix this race by incrementing i_readcount before checking for existing leases, same as the case with i_writecount. Use a helper put_file_access() to decrement i_readcount or i_writecount in do_dentry_open() and __fput(). Fixes: 387e3746 ("locks: eliminate false positive conflicts for write lease") Reviewed-by: NJeff Layton <jlayton@kernel.org> Signed-off-by: NAmir Goldstein <amir73il@gmail.com> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 20 5月, 2022 1 次提交
-
-
由 Al Viro 提交于
It's done once per (mount-related) syscall and there's no point whatsoever making it inline. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 15 5月, 2022 1 次提交
-
-
由 Al Viro 提交于
Currently we have 3 primitives for removing an opened file from descriptor table - pick_file(), __close_fd_get_file() and close_fd_get_file(). Their calling conventions are rather odd and there's a code duplication for no good reason. They can be unified - 1) have __range_close() cap max_fd in the very beginning; that way we don't need separate way for pick_file() to report being past the end of descriptor table. 2) make {__,}close_fd_get_file() return file (or NULL) directly, rather than returning it via struct file ** argument. Don't bother with (bogus) return value - nobody wants that -ENOENT. 3) make pick_file() return NULL on unopened descriptor - the only caller that used to care about the distinction between descriptor past the end of descriptor table and finding NULL in descriptor table doesn't give a damn after (1). 4) lift ->files_lock out of pick_file() That actually simplifies the callers, as well as the primitives themselves. Code duplication is also gone... Reviewed-by: NChristian Brauner (Microsoft) <brauner@kernel.org> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 25 4月, 2022 2 次提交
-
-
由 Stefan Roesch 提交于
This splits off do_getxattr function from the getxattr function. This will allow io_uring to call it from its io worker. Signed-off-by: NStefan Roesch <shr@fb.com> Acked-by: NChristian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20220323154420.3301504-3-shr@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Stefan Roesch 提交于
This splits of the setup part of the function setxattr in its own dedicated function called setxattr_copy. In addition it also exposes a new function called do_setxattr for making the setxattr call. This makes it possible to call these two functions from io_uring in the processing of an xattr request. Signed-off-by: NStefan Roesch <shr@fb.com> Acked-by: NChristian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20220323154420.3301504-2-shr@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 14 3月, 2022 1 次提交
-
-
由 Omar Sandoval 提交于
I'm adding btrfs ioctls to read and write compressed data, and rather than duplicating the checks in rw_verify_area(), let's just export it. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NOmar Sandoval <osandov@fb.com> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 11 3月, 2022 1 次提交
-
-
由 Stefan Roesch 提交于
One of the key architectual tenets is to keep the parameters for io-uring stable. After the call has been submitted, its value can be changed. Unfortunaltely this is not the case for the current statx implementation. IO-Uring change: This changes replaces the const char * filename pointer in the io_statx structure with a struct filename *. In addition it also creates the filename object during the prepare phase. With this change, the opcode also needs to invoke cleanup, so the filename object gets freed after processing the request. fs change: This replaces the const char* __user filename parameter in the two functions do_statx and vfs_statx with a struct filename *. In addition to be able to correctly construct a filename object a new helper function getname_statx_lookup_flags is introduced. The function makes sure that do_statx and vfs_statx is invoked with the correct lookup flags. Signed-off-by: NStefan Roesch <shr@fb.com> Link: https://lore.kernel.org/r/20220225185326.1373304-2-shr@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 31 1月, 2022 1 次提交
-
-
由 Al Viro 提交于
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 17 12月, 2021 1 次提交
-
-
由 Matthew Wilcox (Oracle) 提交于
There are no plans to convert buffer_head infrastructure to use large folios, but __block_write_begin_int() is called from iomap, and it's more convenient and less error-prone if we pass in a folio from iomap. It also has a nice saving of almost 200 bytes of code from removing repeated calls to compound_head(). Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
-
- 10 11月, 2021 1 次提交
-
-
由 Johannes Weiner 提交于
Historically (pre-2.5), the inode shrinker used to reclaim only empty inodes and skip over those that still contained page cache. This caused problems on highmem hosts: struct inode could put fill lowmem zones before the cache was getting reclaimed in the highmem zones. To address this, the inode shrinker started to strip page cache to facilitate reclaiming lowmem. However, this comes with its own set of problems: the shrinkers may drop actively used page cache just because the inodes are not currently open or dirty - think working with a large git tree. It further doesn't respect cgroup memory protection settings and can cause priority inversions between containers. Nowadays, the page cache also holds non-resident info for evicted cache pages in order to detect refaults. We've come to rely heavily on this data inside reclaim for protecting the cache workingset and driving swap behavior. We also use it to quantify and report workload health through psi. The latter in turn is used for fleet health monitoring, as well as driving automated memory sizing of workloads and containers, proactive reclaim and memory offloading schemes. The consequences of dropping page cache prematurely is that we're seeing subtle and not-so-subtle failures in all of the above-mentioned scenarios, with the workload generally entering unexpected thrashing states while losing the ability to reliably detect it. To fix this on non-highmem systems at least, going back to rotating inodes on the LRU isn't feasible. We've tried (commit a76cf1a4 ("mm: don't reclaim inodes with many attached pages")) and failed (commit 69056ee6 ("Revert "mm: don't reclaim inodes with many attached pages"")). The issue is mostly that shrinker pools attract pressure based on their size, and when objects get skipped the shrinkers remember this as deferred reclaim work. This accumulates excessive pressure on the remaining inodes, and we can quickly eat into heavily used ones, or dirty ones that require IO to reclaim, when there potentially is plenty of cold, clean cache around still. Instead, this patch keeps populated inodes off the inode LRU in the first place - just like an open file or dirty state would. An otherwise clean and unused inode then gets queued when the last cache entry disappears. This solves the problem without reintroducing the reclaim issues, and generally is a bit more scalable than having to wade through potentially hundreds of thousands of busy inodes. Locking is a bit tricky because the locks protecting the inode state (i_lock) and the inode LRU (lru_list.lock) don't nest inside the irq-safe page cache lock (i_pages.xa_lock). Page cache deletions are serialized through i_lock, taken before the i_pages lock, to make sure depopulated inodes are queued reliably. Additions may race with deletions, but we'll check again in the shrinker. If additions race with the shrinker itself, we're protected by the i_lock: if find_inode() or iput() win, the shrinker will bail on the elevated i_count or I_REFERENCED; if the shrinker wins and goes ahead with the inode, it will set I_FREEING and inhibit further igets(), which will cause the other side to create a new instance of the inode instead. Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Cc: Tejun Heo <tj@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 22 10月, 2021 2 次提交
-
-
由 Christoph Hellwig 提交于
Get rid of the indirections and just provide a sync_bdevs helper for the generic sync code. Signed-off-by: NChristoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211019062530.2174626-8-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
Instead offer a new sync_blockdev_nowait helper for the !wait case. This new helper is exported as it will grow modular callers in a bit. Signed-off-by: NChristoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211019062530.2174626-3-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 07 9月, 2021 1 次提交
-
-
由 Christoph Hellwig 提交于
Move it together with the rest of the block layer. Signed-off-by: NChristoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20210907141303.1371844-3-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 24 8月, 2021 4 次提交
-
-
由 Dmitry Kadashev 提交于
IORING_OP_LINKAT behaves like linkat(2) and takes the same flags and arguments. In some internal places 'hardlink' is used instead of 'link' to avoid confusion with the SQE links. Name 'link' conflicts with the existing 'link' member of io_kiocb. Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Suggested-by: NChristian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/Signed-off-by: NDmitry Kadashev <dkadashev@gmail.com> Acked-by: NChristian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-12-dkadashev@gmail.com [axboe: add splice_fd_in check] Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Dmitry Kadashev 提交于
IORING_OP_SYMLINKAT behaves like symlinkat(2) and takes the same flags and arguments. Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Suggested-by: NChristian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/Signed-off-by: NDmitry Kadashev <dkadashev@gmail.com> Acked-by: NChristian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-11-dkadashev@gmail.com [axboe: add splice_fd_in check] Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Dmitry Kadashev 提交于
Update the following to return int rather than long, for uniformity with the rest of the do_* helpers in namei.c: * do_rmdir() * do_unlinkat() * do_mkdirat() * do_mknodat() * do_symlinkat() Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <christian.brauner@ubuntu.com> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/io-uring/20210514143202.dmzfcgz5hnauy7ze@wittgenstein/Signed-off-by: NDmitry Kadashev <dkadashev@gmail.com> Acked-by: NChristian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-9-dkadashev@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Dmitry Kadashev 提交于
Pass in the struct filename pointers instead of the user string, and update the three callers to do the same. This is heavily based on commit dbea8d345177 ("fs: make do_renameat2() take struct filename"). This behaves like do_unlinkat() and do_renameat2(). Cc: Al Viro <viro@zeniv.linux.org.uk> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NDmitry Kadashev <dkadashev@gmail.com> Acked-by: NChristian Brauner <christian.brauner@ubuntu.com> Link: https://lore.kernel.org/r/20210708063447.3556403-4-dkadashev@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>
-
- 17 8月, 2021 1 次提交
-
-
由 Christoph Hellwig 提交于
__block_write_begin_int never modifies the passed in iomap, so mark it const. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NDarrick J. Wong <djwong@kernel.org> Signed-off-by: NDarrick J. Wong <djwong@kernel.org>
-
- 22 7月, 2021 1 次提交
-
-
由 Paul Gortmaker 提交于
Richard reported sporadic (roughly one in 10 or so) null dereferences and other strange behaviour for a set of automated LTP tests. Things like: BUG: kernel NULL pointer dereference, address: 0000000000000008 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 1516 Comm: umount Not tainted 5.10.0-yocto-standard #1 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014 RIP: 0010:kernfs_sop_show_path+0x1b/0x60 ...or these others: RIP: 0010:do_mkdirat+0x6a/0xf0 RIP: 0010:d_alloc_parallel+0x98/0x510 RIP: 0010:do_readlinkat+0x86/0x120 There were other less common instances of some kind of a general scribble but the common theme was mount and cgroup and a dubious dentry triggering the NULL dereference. I was only able to reproduce it under qemu by replicating Richard's setup as closely as possible - I never did get it to happen on bare metal, even while keeping everything else the same. In commit 71d883c3 ("cgroup_do_mount(): massage calling conventions") we see this as a part of the overall change: -------------- struct cgroup_subsys *ss; - struct dentry *dentry; [...] - dentry = cgroup_do_mount(&cgroup_fs_type, fc->sb_flags, root, - CGROUP_SUPER_MAGIC, ns); [...] - if (percpu_ref_is_dying(&root->cgrp.self.refcnt)) { - struct super_block *sb = dentry->d_sb; - dput(dentry); + ret = cgroup_do_mount(fc, CGROUP_SUPER_MAGIC, ns); + if (!ret && percpu_ref_is_dying(&root->cgrp.self.refcnt)) { + struct super_block *sb = fc->root->d_sb; + dput(fc->root); deactivate_locked_super(sb); msleep(10); return restart_syscall(); } -------------- In changing from the local "*dentry" variable to using fc->root, we now export/leave that dentry pointer in the file context after doing the dput() in the unlikely "is_dying" case. With LTP doing a crazy amount of back to back mount/unmount [testcases/bin/cgroup_regression_5_1.sh] the unlikely becomes slightly likely and then bad things happen. A fix would be to not leave the stale reference in fc->root as follows: -------------- dput(fc->root); + fc->root = NULL; deactivate_locked_super(sb); -------------- ...but then we are just open-coding a duplicate of fc_drop_locked() so we simply use that instead. Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Tejun Heo <tj@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: stable@vger.kernel.org # v5.1+ Reported-by: NRichard Purdie <richard.purdie@linuxfoundation.org> Fixes: 71d883c3 ("cgroup_do_mount(): massage calling conventions") Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com> Signed-off-by: NTejun Heo <tj@kernel.org>
-
- 08 4月, 2021 1 次提交
-
-
由 Al Viro 提交于
... and provide file_open_root_mnt(), using the root of given mount. Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 02 2月, 2021 1 次提交
-
-
由 Jens Axboe 提交于
Assumes current->files->file_lock is already held on invocation. Helps the caller check the file before removing the fd, if it needs to. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 26 1月, 2021 1 次提交
-
-
由 Al Viro 提交于
no point going through the intermediate pipe Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 24 1月, 2021 1 次提交
-
-
由 Christian Brauner 提交于
The may_follow_link(), may_linkat(), may_lookup(), may_open(), may_o_create(), may_create_in_sticky(), may_delete(), and may_create() helpers determine whether the caller is privileged enough to perform the associated operations. Let them handle idmapped mounts by mapping the inode or fsids according to the mount's user namespace. Afterwards the checks are identical to non-idmapped inodes. The patch takes care to retrieve the mount's user namespace right before performing permission checks and passing it down into the fileystem so the user namespace can't change in between by someone idmapping a mount that is currently not idmapped. If the initial user namespace is passed nothing changes so non-idmapped mounts will see identical behavior as before. Link: https://lore.kernel.org/r/20210121131959.646623-13-christian.brauner@ubuntu.com Cc: Christoph Hellwig <hch@lst.de> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: linux-fsdevel@vger.kernel.org Reviewed-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJames Morris <jamorris@linux.microsoft.com> Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
-
- 10 12月, 2020 1 次提交
-
-
由 Jens Axboe 提交于
Pass in the struct filename pointers instead of the user string, and update the three callers to do the same. This behaves like do_unlinkat(), which also takes a filename struct and puts it when it is done. Converting callers is then trivial. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 02 12月, 2020 2 次提交
-
-
由 Christoph Hellwig 提交于
Switch the block device lookup interfaces to directly work with a dev_t so that struct block_device references are only acquired by the blkdev_get variants (and the blk-cgroup special case). This means that we now don't need an extra reference in the inode and can generally simplify handling of struct block_device to keep the lookups contained in the core block layer code. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NHannes Reinecke <hare@suse.de> Acked-by: NTejun Heo <tj@kernel.org> Acked-by: Coly Li <colyli@suse.de> [bcache] Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
Just open code the wait in the only caller of both functions. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJan Kara <jack@suse.cz> Reviewed-by: NHannes Reinecke <hare@suse.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 23 9月, 2020 1 次提交
-
-
由 Christoph Hellwig 提交于
compat_sys_mount is identical to the regular sys_mount now, so remove it and use the native version everywhere. Signed-off-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 31 7月, 2020 2 次提交
-
-
由 Christoph Hellwig 提交于
Add a simple helper to mknod with a kernel space file name and switch the early init code over to it. Remove the now unused ksys_mknod. Signed-off-by: NChristoph Hellwig <hch@lst.de>
-
由 Christoph Hellwig 提交于
Add a simple helper to mkdir with a kernel space file name and switch the early init code over to it. Remove the now unused ksys_mkdir. Signed-off-by: NChristoph Hellwig <hch@lst.de>
-