提交 · b308570957332422032e34b0abb9233790344e2b · openeuler / Kernel

19 11月, 2022 1 次提交

vfs: vfs_tmpfile: ensure O_EXCL flag is enforced · 406c706c

由 Peter Griffin 提交于 11月 03, 2022

If O_EXCL is *not* specified, then linkat() can be
used to link the temporary file into the filesystem.
If O_EXCL is specified then linkat() should fail (-1).

After commit 863f144f ("vfs: open inside ->tmpfile()")
the O_EXCL flag is no longer honored by the vfs layer for
tmpfile, which means the file can be linked even if O_EXCL
flag is specified, which is a change in behaviour for
userspace!

The open flags was previously passed as a parameter, so it
was uneffected by the changes to file->f_flags caused by
finish_open(). This patch fixes the issue by storing
file->f_flags in a local variable so the O_EXCL test
logic is restored.

This regression was detected by Android CTS Bionic fcntl()
tests running on android-mainline [1].

[1] https://android.googlesource.com/platform/bionic/+/
    refs/heads/master/tests/fcntl_test.cpp#352

Fixes: 863f144f ("vfs: open inside ->tmpfile()")
Acked-by: NMiklos Szeredi <mszeredi@redhat.com>
Tested-by: NWill McVicker <willmcvicker@google.com>
Signed-off-by: NPeter Griffin <peter.griffin@linaro.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

406c706c

04 10月, 2022 1 次提交

mm: fs: initialize fsdata passed to write_begin/write_end interface · 1468c6f4

由 Alexander Potapenko 提交于 9月 15, 2022

Functions implementing the a_ops->write_end() interface accept the `void
*fsdata` parameter that is supposed to be initialized by the corresponding
a_ops->write_begin() (which accepts `void **fsdata`).

However not all a_ops->write_begin() implementations initialize `fsdata`
unconditionally, so it may get passed uninitialized to a_ops->write_end(),
resulting in undefined behavior.

Fix this by initializing fsdata with NULL before the call to
write_begin(), rather than doing so in all possible a_ops implementations.

This patch covers only the following cases found by running x86 KMSAN
under syzkaller:

 - generic_perform_write()
 - cont_expand_zero() and generic_cont_expand_simple()
 - page_symlink()

Other cases of passing uninitialized fsdata may persist in the codebase.

Link: https://lkml.kernel.org/r/20220915150417.722975-43-glider@google.comSigned-off-by: NAlexander Potapenko <glider@google.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Andrey Konovalov <andreyknvl@gmail.com>
Cc: Andrey Konovalov <andreyknvl@google.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Eric Biggers <ebiggers@google.com>
Cc: Eric Biggers <ebiggers@kernel.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ilya Leoshkevich <iii@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Marco Elver <elver@google.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Pekka Enberg <penberg@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vegard Nossum <vegard.nossum@oracle.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

1468c6f4

24 9月, 2022 4 次提交

vfs: open inside ->tmpfile() · 863f144f

由 Miklos Szeredi 提交于 9月 24, 2022

This is in preparation for adding tmpfile support to fuse, which requires
that the tmpfile creation and opening are done as a single operation.

Replace the 'struct dentry *' argument of i_op->tmpfile with
'struct file *'.

Call finish_open_simple() as the last thing in ->tmpfile() instances (may
be omitted in the error case).

Change d_tmpfile() argument to 'struct file *' as well to make callers more
readable.
Reviewed-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

863f144f

vfs: move open right after ->tmpfile() · 9751b338

由 Miklos Szeredi 提交于 9月 24, 2022

Create a helper finish_open_simple() that opens the file with the original
dentry.  Handle the error case here as well to simplify callers.

Call this helper right after ->tmpfile() is called.

Next patch will change the tmpfile API and move this call into tmpfile
instances.
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

9751b338

vfs: make vfs_tmpfile() static · 3e9d4c59

由 Miklos Szeredi 提交于 9月 24, 2022

No callers outside of fs/namei.c anymore.
Reviewed-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

3e9d4c59

vfs: add vfs_tmpfile_open() helper · 22873dea

由 Miklos Szeredi 提交于 9月 24, 2022

This helper unifies tmpfile creation with opening.

Existing vfs_tmpfile() callers outside of fs/namei.c will be converted to
using this helper.  There are two such callers: cachefile and overlayfs.

The cachefiles code currently uses the open_with_fake_path() helper to open
the tmpfile, presumably to disable accounting of the open file.  Overlayfs
uses tmpfile for copy_up, which means these struct file instances will be
short lived, hence it doesn't really matter if they are accounted or not.
Disable accounting in this helper too, which should be okay for both
callers.

Add MAY_OPEN permission checking for consistency.  Like for create(2)
read/write permissions are not checked.
Reviewed-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

22873dea

02 9月, 2022 2 次提交

nd_jump_link(): constify path · ea4af4aa

由 Al Viro 提交于 8月 04, 2022

Reviewed-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

ea4af4aa

may_linkat(): constify path · 8996682b

由 Al Viro 提交于 8月 04, 2022

Reviewed-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

8996682b

21 7月, 2022 1 次提交

fs: move S_ISGID stripping into the vfs_*() helpers · 1639a49c

由 Yang Xu 提交于 7月 14, 2022

Move setgid handling out of individual filesystems and into the VFS
itself to stop the proliferation of setgid inheritance bugs.

Creating files that have both the S_IXGRP and S_ISGID bit raised in
directories that themselves have the S_ISGID bit set requires additional
privileges to avoid security issues.

When a filesystem creates a new inode it needs to take care that the
caller is either in the group of the newly created inode or they have
CAP_FSETID in their current user namespace and are privileged over the
parent directory of the new inode. If any of these two conditions is
true then the S_ISGID bit can be raised for an S_IXGRP file and if not
it needs to be stripped.

However, there are several key issues with the current implementation:

* S_ISGID stripping logic is entangled with umask stripping.

  If a filesystem doesn't support or enable POSIX ACLs then umask
  stripping is done directly in the vfs before calling into the
  filesystem.
  If the filesystem does support POSIX ACLs then unmask stripping may be
  done in the filesystem itself when calling posix_acl_create().

  Since umask stripping has an effect on S_ISGID inheritance, e.g., by
  stripping the S_IXGRP bit from the file to be created and all relevant
  filesystems have to call posix_acl_create() before inode_init_owner()
  where we currently take care of S_ISGID handling S_ISGID handling is
  order dependent. IOW, whether or not you get a setgid bit depends on
  POSIX ACLs and umask and in what order they are called.

  Note that technically filesystems are free to impose their own
  ordering between posix_acl_create() and inode_init_owner() meaning
  that there's additional ordering issues that influence S_SIGID
  inheritance.

* Filesystems that don't rely on inode_init_owner() don't get S_ISGID
  stripping logic.

  While that may be intentional (e.g. network filesystems might just
  defer setgid stripping to a server) it is often just a security issue.

This is not just ugly it's unsustainably messy especially since we do
still have bugs in this area years after the initial round of setgid
bugfixes.

So the current state is quite messy and while we won't be able to make
it completely clean as posix_acl_create() is still a filesystem specific
call we can improve the S_SIGD stripping situation quite a bit by
hoisting it out of inode_init_owner() and into the vfs creation
operations. This means we alleviate the burden for filesystems to handle
S_ISGID stripping correctly and can standardize the ordering between
S_ISGID and umask stripping in the vfs.

We add a new helper vfs_prepare_mode() so S_ISGID handling is now done
in the VFS before umask handling. This has S_ISGID handling is
unaffected unaffected by whether umask stripping is done by the VFS
itself (if no POSIX ACLs are supported or enabled) or in the filesystem
in posix_acl_create() (if POSIX ACLs are supported).

The vfs_prepare_mode() helper is called directly in vfs_*() helpers that
create new filesystem objects. We need to move them into there to make
sure that filesystems like overlayfs hat have callchains like:

sys_mknod()
-> do_mknodat(mode)
   -> .mknod = ovl_mknod(mode)
      -> ovl_create(mode)
         -> vfs_mknod(mode)

get S_ISGID stripping done when calling into lower filesystems via
vfs_*() creation helpers. Moving vfs_prepare_mode() into e.g.
vfs_mknod() takes care of that. This is in any case semantically cleaner
because S_ISGID stripping is VFS security requirement.

Security hooks so far have seen the mode with the umask applied but
without S_ISGID handling done. The relevant hooks are called outside of
vfs_*() creation helpers so by calling vfs_prepare_mode() from vfs_*()
helpers the security hooks would now see the mode without umask
stripping applied. For now we fix this by passing the mode with umask
settings applied to not risk any regressions for LSM hooks. IOW, nothing
changes for LSM hooks. It is worth pointing out that security hooks
never saw the mode that is seen by the filesystem when actually creating
the file. They have always been completely misplaced for that to work.

The following filesystems use inode_init_owner() and thus relied on
S_ISGID stripping: spufs, 9p, bfs, btrfs, ext2, ext4, f2fs, hfsplus,
hugetlbfs, jfs, minix, nilfs2, ntfs3, ocfs2, omfs, overlayfs, ramfs,
reiserfs, sysv, ubifs, udf, ufs, xfs, zonefs, bpf, tmpfs.

All of the above filesystems end up calling inode_init_owner() when new
filesystem objects are created through the ->mkdir(), ->mknod(),
->create(), ->tmpfile(), ->rename() inode operations.

Since directories always inherit the S_ISGID bit with the exception of
xfs when irix_sgid_inherit mode is turned on S_ISGID stripping doesn't
apply. The ->symlink() and ->link() inode operations trivially inherit
the mode from the target and the ->rename() inode operation inherits the
mode from the source inode. All other creation inode operations will get
S_ISGID handling via vfs_prepare_mode() when called from their relevant
vfs_*() helpers.

In addition to this there are filesystems which allow the creation of
filesystem objects through ioctl()s or - in the case of spufs -
circumventing the vfs in other ways. If filesystem objects are created
through ioctl()s the vfs doesn't know about it and can't apply regular
permission checking including S_ISGID logic. Therfore, a filesystem
relying on S_ISGID stripping in inode_init_owner() in their ioctl()
callpath will be affected by moving this logic into the vfs. We audited
those filesystems:

* btrfs allows the creation of filesystem objects through various
  ioctls(). Snapshot creation literally takes a snapshot and so the mode
  is fully preserved and S_ISGID stripping doesn't apply.

  Creating a new subvolum relies on inode_init_owner() in
  btrfs_new_subvol_inode() but only creates directories and doesn't
  raise S_ISGID.

* ocfs2 has a peculiar implementation of reflinks. In contrast to e.g.
  xfs and btrfs FICLONE/FICLONERANGE ioctl() that is only concerned with
  the actual extents ocfs2 uses a separate ioctl() that also creates the
  target file.

  Iow, ocfs2 circumvents the vfs entirely here and did indeed rely on
  inode_init_owner() to strip the S_ISGID bit. This is the only place
  where a filesystem needs to call mode_strip_sgid() directly but this
  is self-inflicted pain.

* spufs doesn't go through the vfs at all and doesn't use ioctl()s
  either. Instead it has a dedicated system call spufs_create() which
  allows the creation of filesystem objects. But spufs only creates
  directories and doesn't allo S_SIGID bits, i.e. it specifically only
  allows 0777 bits.

* bpf uses vfs_mkobj() but also doesn't allow S_ISGID bits to be created.

The patch will have an effect on ext2 when the EXT2_MOUNT_GRPID mount
option is used, on ext4 when the EXT4_MOUNT_GRPID mount option is used,
and on xfs when the XFS_FEAT_GRPID mount option is used. When any of
these filesystems are mounted with their respective GRPID option then
newly created files inherit the parent directories group
unconditionally. In these cases non of the filesystems call
inode_init_owner() and thus did never strip the S_ISGID bit for newly
created files. Moving this logic into the VFS means that they now get
the S_ISGID bit stripped. This is a user visible change. If this leads
to regressions we will either need to figure out a better way or we need
to revert. However, given the various setgid bugs that we found just in
the last two years this is a regression risk we should take.

Associated with this change is a new set of fstests to enforce the
semantics for all new filesystems.

Link: https://lore.kernel.org/ceph-devel/20220427092201.wvsdjbnc7b4dttaw@wittgenstein [1]
Link: e014f37d ("xfs: use setattr_copy to set vfs inode attributes") [2]
Link: 01ea173e ("xfs: fix up non-directory creation in SGID directories") [3]
Link: fd84bfdd ("ceph: fix up non-directory creation in SGID directories") [4]
Link: https://lore.kernel.org/r/1657779088-2242-3-git-send-email-xuyang2018.jy@fujitsu.comSuggested-by: NDave Chinner <david@fromorbit.com>
Suggested-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-and-Tested-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NYang Xu <xuyang2018.jy@fujitsu.com>
[<brauner@kernel.org>: rewrote commit message]
Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>

1639a49c

19 7月, 2022 1 次提交

fs: Add missing umask strip in vfs_tmpfile · ac6800e2

由 Yang Xu 提交于 7月 14, 2022

All creation paths except for O_TMPFILE handle umask in the vfs directly
if the filesystem doesn't support or enable POSIX ACLs. If the filesystem
does then umask handling is deferred until posix_acl_create().
Because, O_TMPFILE misses umask handling in the vfs it will not honor
umask settings. Fix this by adding the missing umask handling.

Link: https://lore.kernel.org/r/1657779088-2242-2-git-send-email-xuyang2018.jy@fujitsu.com
Fixes: 60545d0d ("[O_TMPFILE] it's still short a few helpers, but infrastructure should be OK now...")
Cc: <stable@vger.kernel.org> # 4.19+
Reported-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Reviewed-and-Tested-by: NJeff Layton <jlayton@kernel.org>
Acked-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: NYang Xu <xuyang2018.jy@fujitsu.com>
Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>

ac6800e2

07 7月, 2022 6 次提交

A
step_into(): move fetching ->d_inode past handle_mounts() · 3bd8bc89
由 Al Viro 提交于 7月 03, 2022
```
... and lose messing with it in __follow_mount_rcu()
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
3bd8bc89

lookup_fast(): don't bother with inode · 4cb64024

由 Al Viro 提交于 7月 03, 2022

Note that validation of ->d_seq after ->d_inode fetch is gone, along
with fetching of ->d_inode itself.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

4cb64024

A
follow_dotdot{,_rcu}(): don't bother with inode · b16c001d
由 Al Viro 提交于 7月 03, 2022
```
step_into() will fetch it, TYVM.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
b16c001d

step_into(): lose inode argument · a4f5b521

由 Al Viro 提交于 7月 03, 2022

make handle_mounts() always fetch it.  This is just the first step -
the callers of step_into() will stop trying to calculate the sucker,
etc.

The passed value should be equal to dentry->d_inode in all cases;
in RCU mode - fetched after we'd sampled ->d_seq.  Might as well
fetch it here.  We do need to validate ->d_seq, which duplicates
the check currently done in lookup_fast(); that duplication will
go away shortly.

After that change handle_mounts() always ignores the initial value of
*inode and always sets it on success.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

a4f5b521

namei: stash the sampled ->d_seq into nameidata · 03fa86e9

由 Al Viro 提交于 7月 04, 2022

New field: nd->next_seq.  Set to 0 outside of RCU mode, holds the sampled
value for the next dentry to be considered.  Used instead of an arseload
of local variables, arguments, etc.

step_into() has lost seq argument; nd->next_seq is used, so dentry passed
to it must be the one ->next_seq is about.

There are two requirements for RCU pathwalk:
	1) it should not give a hard failure (other than -ECHILD) unless
non-RCU pathwalk might fail that way given suitable timings.
	2) it should not succeed unless non-RCU pathwalk might succeed
with the same end location given suitable timings.

The use of seq numbers is the way we achieve that.  Invariant we want
to maintain is:
	if RCU pathwalk can reach the state with given nd->path, nd->inode
and nd->seq after having traversed some part of pathname, it must be possible
for non-RCU pathwalk to reach the same nd->path and nd->inode after having
traversed the same part of pathname, and observe the nd->path.dentry->d_seq
equal to what RCU pathwalk has in nd->seq

	For transition from parent to child, we sample child's ->d_seq
and verify that parent's ->d_seq remains unchanged.  Anything that
disrupts parent-child relationship would've bumped ->d_seq on both.
	For transitions from child to parent we sample parent's ->d_seq
and verify that child's ->d_seq has not changed.  Same reasoning as
for the previous case applies.
	For transition from mountpoint to root of mounted we sample
the ->d_seq of root and verify that nobody has touched mount_lock since
the beginning of pathwalk.  That guarantees that mount we'd found had
been there all along, with these mountpoint and root of the mounted.
It would be possible for a non-RCU pathwalk to reach the previous state,
find the same mount and observe its root at the moment we'd sampled
->d_seq of that
	For transitions from root of mounted to mountpoint we sample
->d_seq of mountpoint and verify that mount_lock had not been touched
since the beginning of pathwalk.  The same reasoning as in the
previous case applies.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

03fa86e9

namei: move clearing LOOKUP_RCU towards rcu_read_unlock() · 6e180327

由 Al Viro 提交于 7月 06, 2022

try_to_unlazy()/try_to_unlazy_next() drop LOOKUP_RCU in the
very beginning and do rcu_read_unlock() only at the very end.
However, nothing done in between even looks at the flag in
question; might as well clear it at the same time we unlock.

Note that try_to_unlazy_next() used to call legitimize_mnt(),
which might drop/regain rcu_read_lock() in some cases.  This
is no longer true, so we really have rcu_read_lock() held
all along until the end.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

6e180327

06 7月, 2022 4 次提交

switch try_to_unlazy_next() to __legitimize_mnt() · 7e4745a0

由 Al Viro 提交于 7月 05, 2022

The tricky case (__legitimize_mnt() failing after having grabbed
a reference) can be trivially dealt with by leaving nd->path.mnt
non-NULL, for terminate_walk() to drop it.

legitimize_mnt() becomes static after that.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

7e4745a0

follow_dotdot{,_rcu}(): change calling conventions · 51c6546c

由 Al Viro 提交于 7月 04, 2022

Instead of returning NULL when we are in root, just make it return
the current position (and set *seqp and *inodep accordingly).
That collapses the calls of step_into() in handle_dots()
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

51c6546c

namei: get rid of pointless unlikely(read_seqcount_retry(...)) · 82ef0698

由 Al Viro 提交于 7月 05, 2022

read_seqcount_retry() et.al. are inlined and there's enough annotations
for compiler to figure out that those are unlikely to return non-zero.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

82ef0698

__follow_mount_rcu(): verify that mount_lock remains unchanged · 20aac6c6

由 Al Viro 提交于 7月 04, 2022

Validate mount_lock seqcount as soon as we cross into mount in RCU
mode.  Sure, ->mnt_root is pinned and will remain so until we
do rcu_read_unlock() anyway, and we will eventually fail to unlazy if
the mount_lock had been touched, but we might run into a hard error
(e.g. -ENOENT) before trying to unlazy.  And it's possible to end
up with RCU pathwalk racing with rename() and umount() in a way
that would fail with -ENOENT while non-RCU pathwalk would've
succeeded with any timings.

Once upon a time we hadn't needed that, but analysis had been subtle,
brittle and went out of window as soon as RENAME_EXCHANGE had been
added.

It's narrow, hard to hit and won't get you anything other than
stray -ENOENT that could be arranged in much easier way with the
same priveleges, but it's a bug all the same.

Cc: stable@kernel.org
X-sky-is-falling: unlikely
Fixes: da1ce067 "vfs: add cross-rename"
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

20aac6c6

20 5月, 2022 3 次提交

namei: cleanup double word in comment · 30476f7e

由 Tom Rix 提交于 1月 25, 2022

Remove the second 'to'.
Signed-off-by: NTom Rix <trix@redhat.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

30476f7e

get rid of dead code in legitimize_root() · 52dba645

由 Al Viro 提交于 1月 07, 2022

Combination of LOOKUP_IS_SCOPED and NULL nd->root.mnt is impossible
after successful path_init().  All places where ->root.mnt might
become NULL do that only if LOOKUP_IS_SCOPED is not there and
path_init() itself can return success without setting nd->root
only if ND_ROOT_PRESET had been set (in which case nd->root
had been set by caller and never changed) or if the name had
been a relative one *and* none of the bits in LOOKUP_IS_SCOPED
had been present.

Since all calls of legitimize_root() must be downstream of successful
path_init(), the check for !nd->root.mnt && (nd->flags & LOOKUP_IS_SCOPED)
is pure paranoia.

FWIW, it had been discussed (and agreed upon) with Aleksa back when
scoped lookups had been merged; looks like that had fallen through the
cracks back then.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

52dba645

fs/namei.c:reserve_stack(): tidy up the call of try_to_unlazy() · e5ca024e

由 Al Viro 提交于 1月 07, 2022

!foo() != 0 is a strange way to spell !foo(); fallout from
"fs: make unlazy_walk() error handling consistent"...
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

e5ca024e

14 5月, 2022 1 次提交

proc/sysctl: make protected_* world readable · c7031c14

由 Julius Hemanth Pitti 提交于 5月 13, 2022

protected_* files have 600 permissions which prevents non-superuser from
reading them.

Container like "AWS greengrass" refuse to launch unless
protected_hardlinks and protected_symlinks are set.  When containers like
these run with "userns-remap" or "--user" mapping container's root to
non-superuser on host, they fail to run due to denied read access to these
files.

As these protections are hardly a secret, and do not possess any security
risk, making them world readable.

Though above greengrass usecase needs read access to only
protected_hardlinks and protected_symlinks files, setting all other
protected_* files to 644 to keep consistency.

Link: http://lkml.kernel.org/r/20200709235115.56954-1-jpitti@cisco.com
Fixes: 800179c9 ("fs: add link restrictions")
Signed-off-by: NJulius Hemanth Pitti <jpitti@cisco.com>
Acked-by: NKees Cook <keescook@chromium.org>
Acked-by: NLuis Chamberlain <mcgrof@kernel.org>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>

c7031c14

09 5月, 2022 3 次提交

namei: Call aops write_begin() and write_end() directly · 27a77913

由 Matthew Wilcox (Oracle) 提交于 3月 03, 2022

pagecache_write_begin() and pagecache_write_end() are now trivial
wrappers, so call the aops directly.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

27a77913

namei: Convert page_symlink() to use memalloc_nofs_save() · 2d878178

由 Matthew Wilcox (Oracle) 提交于 2月 22, 2022

Stop using AOP_FLAG_NOFS in favour of the scoped memory API.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>

2d878178

namei: Merge page_symlink() and __page_symlink() · 56f5746c

由 Matthew Wilcox (Oracle) 提交于 2月 22, 2022

There are no callers of __page_symlink() left, so we can remove that
entry point.
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NChristian Brauner <brauner@kernel.org>

56f5746c

28 4月, 2022 1 次提交

fs: add two trivial lookup helpers · 00675017

由 Christian Brauner 提交于 4月 04, 2022

Similar to the addition of lookup_one() add a version of
lookup_one_unlocked() and lookup_one_positive_unlocked() that take
idmapped mounts into account. This is required to port overlay to
support idmapped base layers.

Cc: <linux-fsdevel@vger.kernel.org>
Tested-by: NGiuseppe Scrivano <gscrivan@redhat.com>
Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>

00675017

15 4月, 2022 1 次提交

VFS: filename_create(): fix incorrect intent. · b3d4650d

由 NeilBrown 提交于 4月 14, 2022

When asked to create a path ending '/', but which is not to be a
directory (LOOKUP_DIRECTORY not set), filename_create() will never try
to create the file.  If it doesn't exist, -ENOENT is reported.

However, it still passes LOOKUP_CREATE|LOOKUP_EXCL to the filesystems
->lookup() function, even though there is no intent to create.  This is
misleading and can cause incorrect behaviour.

If you try

   ln -s foo /path/dir/

where 'dir' is a directory on an NFS filesystem which is not currently
known in the dcache, this will fail with ENOENT.

But as the name is not in the dcache, nfs_lookup gets called with
LOOKUP_CREATE|LOOKUP_EXCL and so it returns NULL without performing any
lookup, with the expectation that a subsequent call to create the target
will be made, and the lookup can be combined with the creation.  In the
case with a trailing '/' and no LOOKUP_DIRECTORY, that call is never
made.  Instead filename_create() sees that the dentry is not (yet)
positive and returns -ENOENT - even though the directory actually
exists.

So only set LOOKUP_CREATE|LOOKUP_EXCL if there really is an intent to
create, and use the absence of these flags to decide if -ENOENT should
be returned.

Note that filename_parentat() is only interested in LOOKUP_REVAL, so we
split that out and store it in 'reval_flag'.  __lookup_hash() then gets
reval_flag combined with whatever create flags were determined to be
needed.
Reviewed-by: NDavid Disseldorp <ddiss@suse.de>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NNeilBrown <neilb@suse.de>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b3d4650d

24 1月, 2022 1 次提交

fsnotify: invalidate dcache before IN_DELETE event · a37d9a17

由 Amir Goldstein 提交于 1月 20, 2022

Apparently, there are some applications that use IN_DELETE event as an
invalidation mechanism and expect that if they try to open a file with
the name reported with the delete event, that it should not contain the
content of the deleted file.

Commit 49246466 ("fsnotify: move fsnotify_nameremove() hook out of
d_delete()") moved the fsnotify delete hook before d_delete() so fsnotify
will have access to a positive dentry.

This allowed a race where opening the deleted file via cached dentry
is now possible after receiving the IN_DELETE event.

To fix the regression, create a new hook fsnotify_delete() that takes
the unlinked inode as an argument and use a helper d_delete_notify() to
pin the inode, so we can pass it to fsnotify_delete() after d_delete().

Backporting hint: this regression is from v5.3. Although patch will
apply with only trivial conflicts to v5.4 and v5.10, it won't build,
because fsnotify_delete() implementation is different in each of those
versions (see fsnotify_link()).

A follow up patch will fix the fsnotify_unlink/rmdir() calls in pseudo
filesystem that do not need to call d_delete().

Link: https://lore.kernel.org/r/20220120215305.282577-1-amir73il@gmail.comReported-by: NIvan Delalande <colona@arista.com>
Link: https://lore.kernel.org/linux-fsdevel/YeNyzoDM5hP5LtGW@visor/
Fixes: 49246466 ("fsnotify: move fsnotify_nameremove() hook out of d_delete()")
Cc: stable@vger.kernel.org # v5.3+
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NJan Kara <jack@suse.cz>

a37d9a17

22 1月, 2022 1 次提交

fs: move namei sysctls to its own file · 9c011be1

由 Luis Chamberlain 提交于 1月 21, 2022

kernel/sysctl.c is a kitchen sink where everyone leaves their dirty
dishes, this makes it very difficult to maintain.

To help with this maintenance let's start by moving sysctls to places
where they actually belong.  The proc sysctl maintainers do not want to
know what sysctl knobs you wish to add for your own piece of code, we
just care about the core logic.

So move namei's own sysctl knobs to its own file.

Other than the move we also avoid initializing two static variables to 0
as this is not needed:

  * sysctl_protected_symlinks
  * sysctl_protected_hardlinks

Link: https://lkml.kernel.org/r/20211129205548.605569-8-mcgrof@kernel.orgSigned-off-by: NLuis Chamberlain <mcgrof@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Cc: Antti Palosaari <crope@iki.fi>
Cc: Eric Biederman <ebiederm@xmission.com>
Cc: Iurii Zaikin <yzaikin@google.com>
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Jeff Layton <jlayton@kernel.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Lukas Middendorf <kernel@tuxforce.de>
Cc: Stephen Kitt <steve@sk2.org>
Cc: Xiaoming Ni <nixiaoming@huawei.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

9c011be1

07 1月, 2022 1 次提交

vfs, cachefiles: Mark a backing file in use with an inode flag · 1bd9c4e4

由 David Howells 提交于 11月 18, 2021

Use an inode flag, S_KERNEL_FILE, to mark that a backing file is in use by
the kernel to prevent cachefiles or other kernel services from interfering
with that file.

Alter rmdir to reject attempts to remove a directory marked with this flag.
This is used by cachefiles to prevent cachefilesd from removing them.

Using S_SWAPFILE instead isn't really viable as that has other effects in
the I/O paths.

Changes
=======
ver #3:
 - Check for the object pointer being NULL in the tracepoints rather than
   the caller.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Reviewed-by: NJeff Layton <jlayton@kernel.org>
cc: linux-cachefs@redhat.com
Link: https://lore.kernel.org/r/163819630256.215744.4815885535039369574.stgit@warthog.procyon.org.uk/ # v1
Link: https://lore.kernel.org/r/163906931596.143852.8642051223094013028.stgit@warthog.procyon.org.uk/ # v2
Link: https://lore.kernel.org/r/163967141000.1823006.12920680657559677789.stgit@warthog.procyon.org.uk/ # v3
Link: https://lore.kernel.org/r/164021541207.640689.564689725898537127.stgit@warthog.procyon.org.uk/ # v4

1bd9c4e4

27 10月, 2021 1 次提交

fs: remove leftover comments from mandatory locking removal · 482e0007

由 Jeff Layton 提交于 10月 26, 2021

Stragglers from commit f7e33bdb ("fs: remove mandatory file locking
support").
Signed-off-by: NJeff Layton <jlayton@kernel.org>

482e0007

08 9月, 2021 5 次提交

putname(): IS_ERR_OR_NULL() is wrong here · ea47ab11

由 Al Viro 提交于 9月 07, 2021

Mixing NULL and ERR_PTR() just in case is a Bad Idea(tm).  For
struct filename the former is wrong - failures are reported
as ERR_PTR(...), not as NULL.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

ea47ab11

namei: Standardize callers of filename_create() · b4a4f213

由 Stephen Brennan 提交于 9月 01, 2021

filename_create() has two variants, one which drops the caller's
reference to filename (filename_create) and one which does
not (__filename_create). This can be confusing as it's unusual to drop a
caller's reference. Remove filename_create, rename __filename_create
to filename_create, and convert all callers.

Link: https://lore.kernel.org/linux-fsdevel/f6238254-35bd-7e97-5b27-21050c745874@oracle.com/
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NStephen Brennan <stephen.s.brennan@oracle.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

b4a4f213

namei: Standardize callers of filename_lookup() · 794ebcea

由 Stephen Brennan 提交于 9月 01, 2021

filename_lookup() has two variants, one which drops the caller's
reference to filename (filename_lookup), and one which does
not (__filename_lookup). This can be confusing as it's unusual to drop a
caller's reference. Remove filename_lookup, rename __filename_lookup
to filename_lookup, and convert all callers. The cost is a few slightly
longer functions, but the clarity is greater.

[AV: consuming a reference is not at all unusual, actually; look at
e.g. do_mkdirat(), for example.  It's more that we want non-consuming
variant for close relative of that function...]

Link: https://lore.kernel.org/linux-fsdevel/YS+dstZ3xfcLxhoB@zeniv-ca.linux.org.uk/
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NStephen Brennan <stephen.s.brennan@oracle.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

794ebcea

rename __filename_parentat() to filename_parentat() · c5f563f9

由 Al Viro 提交于 9月 07, 2021

... in separate commit, to avoid noise in previous one
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

c5f563f9

namei: Fix use after free in kern_path_locked · 0766ec82

由 Stephen Brennan 提交于 9月 01, 2021

In 0ee50b47 ("namei: change filename_parentat() calling
conventions"), filename_parentat() was made to always call putname() on
the  filename before returning, and kern_path_locked() was migrated to
this calling convention.  However, kern_path_locked() uses the "last"
parameter to lookup and potentially create a new dentry. The last
parameter contains the last component of the path and points within the
filename, which was recently freed at the end of filename_parentat().
Thus, when kern_path_locked() calls __lookup_hash(), it is using the
filename after it has already been freed.

In other words, these calling conventions had been wrong for the
only remaining caller of filename_parentat().  Everything else
is using __filename_parentat(), which does not drop the reference;
so should kern_path_locked().

Switch kern_path_locked() to use of __filename_parentat() and move
getting/dropping struct filename into wrapper.  Remove filename_parentat(),
now that we have no remaining callers.

Fixes: 0ee50b47 ("namei: change filename_parentat() calling conventions")
Link: https://lore.kernel.org/linux-fsdevel/YS9D4AlEsaCxLFV0@infradead.org/
Link: https://lore.kernel.org/linux-fsdevel/YS+csMTV2tTXKg3s@zeniv-ca.linux.org.uk/
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Reported-by: syzbot+fb0d60a179096e8c2731@syzkaller.appspotmail.com
Signed-off-by: NStephen Brennan <stephen.s.brennan@oracle.com>
Co-authored-by: NDmitry Kadashev <dkadashev@gmail.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

0766ec82

04 9月, 2021 1 次提交

fs, mm: fix race in unlinking swapfile · 51cc3a66

由 Hugh Dickins 提交于 9月 02, 2021

We had a recurring situation in which admin procedures setting up
swapfiles would race with test preparation clearing away swapfiles; and
just occasionally that got stuck on a swapfile "(deleted)" which could
never be swapped off.  That is not supposed to be possible.

2.6.28 commit f9454548 ("don't unlink an active swapfile") admitted
that it was leaving a race window open: now close it.

may_delete() makes the IS_SWAPFILE check (amongst many others) before
inode_lock has been taken on target: now repeat just that simple check in
vfs_unlink() and vfs_rename(), after taking inode_lock.

Which goes most of the way to fixing the race, but swapon() must also
check after it acquires inode_lock, that the file just opened has not
already been unlinked.

Link: https://lkml.kernel.org/r/e17b91ad-a578-9a15-5e3-4989e0f999b5@google.com
Fixes: f9454548 ("don't unlink an active swapfile")
Signed-off-by: NHugh Dickins <hughd@google.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

51cc3a66

24 8月, 2021 1 次提交

io_uring: add support for IORING_OP_LINKAT · cf30da90

由 Dmitry Kadashev 提交于 7月 08, 2021

IORING_OP_LINKAT behaves like linkat(2) and takes the same flags and
arguments.

In some internal places 'hardlink' is used instead of 'link' to avoid
confusion with the SQE links. Name 'link' conflicts with the existing
'link' member of io_kiocb.
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Suggested-by: NChristian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/Signed-off-by: NDmitry Kadashev <dkadashev@gmail.com>
Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-12-dkadashev@gmail.com
[axboe: add splice_fd_in check]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cf30da90

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功