提交 · 3707d84c13670bf09b4a9a4dc6733326d8344b31 · openeuler / Kernel

19 1月, 2023 3 次提交

由 Christian Brauner 提交于 1月 13, 2023

Now that we converted everything to just rely on struct mnt_idmap move it all
into a separate file. This ensure that no code can poke around in struct
mnt_idmap without any dedicated helpers and makes it easier to extend it in the
future. Filesystems will now not be able to conflate mount and filesystem
idmappings as they are two distinct types and require distinct helpers that
cannot be used interchangeably. We are now also able to extend struct mnt_idmap
as we see fit.
Acked-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>

3707d84c

fs: port privilege checking helpers to mnt_idmap · 9452e93e

由 Christian Brauner 提交于 1月 13, 2023

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>

9452e93e

fs: port ->permission() to pass mnt_idmap · 4609e1f1

由 Christian Brauner 提交于 1月 13, 2023

Convert to struct mnt_idmap.

Last cycle we merged the necessary infrastructure in
256c8aed ("fs: introduce dedicated idmap type for mounts").
This is just the conversion to struct mnt_idmap.

Currently we still pass around the plain namespace that was attached to a
mount. This is in general pretty convenient but it makes it easy to
conflate namespaces that are relevant on the filesystem with namespaces
that are relevent on the mount level. Especially for non-vfs developers
without detailed knowledge in this area this can be a potential source for
bugs.

Once the conversion to struct mnt_idmap is done all helpers down to the
really low-level helpers will take a struct mnt_idmap argument instead of
two namespace arguments. This way it becomes impossible to conflate the two
eliminating the possibility of any bugs. All of the vfs and all filesystems
only operate on struct mnt_idmap.
Acked-by: NDave Chinner <dchinner@redhat.com>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>

4609e1f1

01 11月, 2022 1 次提交

acl: conver higher-level helpers to rely on mnt_idmap · 5a6f52d2

由 Christian Brauner 提交于 10月 28, 2022

Convert an initial portion to rely on struct mnt_idmap by converting the
high level xattr helpers.
Reviewed-by: NSeth Forshee (DigitalOcean) <sforshee@kernel.org>
Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>

5a6f52d2

20 10月, 2022 2 次提交

xattr: use posix acl api · 318e6685

由 Christian Brauner 提交于 9月 22, 2022

In previous patches we built a new posix api solely around get and set
inode operations. Now that we have all the pieces in place we can switch
the system calls and the vfs over to only rely on this api when
interacting with posix acls. This finally removes all type unsafety and
type conversion issues explained in detail in [1] that we aim to get rid
of.

With the new posix acl api we immediately translate into an appropriate
kernel internal struct posix_acl format both when getting and setting
posix acls. This is a stark contrast to before were we hacked unsafe raw
values into the uapi struct that was stored in a void pointer relying
and having filesystems and security modules hack around in the uapi
struct as well.

Link: https://lore.kernel.org/all/20220801145520.1532837-1-brauner@kernel.org [1]
Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>

318e6685

internal: add may_write_xattr() · 56851bc9

由 Christian Brauner 提交于 9月 29, 2022

Split out the generic checks whether an inode allows writing xattrs. Since
security.* and system.* xattrs don't have any restrictions and we're going
to split out posix acls into a dedicated api we will use this helper to
check whether we can write posix acls.
Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>

56851bc9

18 10月, 2022 3 次提交

attr: use consistent sgid stripping checks · ed5a7047

由 Christian Brauner 提交于 10月 17, 2022

Currently setgid stripping in file_remove_privs()'s should_remove_suid()
helper is inconsistent with other parts of the vfs. Specifically, it only
raises ATTR_KILL_SGID if the inode is S_ISGID and S_IXGRP but not if the
inode isn't in the caller's groups and the caller isn't privileged over the
inode although we require this already in setattr_prepare() and
setattr_copy() and so all filesystem implement this requirement implicitly
because they have to use setattr_{prepare,copy}() anyway.

But the inconsistency shows up in setgid stripping bugs for overlayfs in
xfstests (e.g., generic/673, generic/683, generic/685, generic/686,
generic/687). For example, we test whether suid and setgid stripping works
correctly when performing various write-like operations as an unprivileged
user (fallocate, reflink, write, etc.):

echo "Test 1 - qa_user, non-exec file $verb"
setup_testfile
chmod a+rws $junk_file
commit_and_check "$qa_user" "$verb" 64k 64k

The test basically creates a file with 6666 permissions. While the file has
the S_ISUID and S_ISGID bits set it does not have the S_IXGRP set. On a
regular filesystem like xfs what will happen is:

sys_fallocate()
-> vfs_fallocate()
   -> xfs_file_fallocate()
      -> file_modified()
         -> __file_remove_privs()
            -> dentry_needs_remove_privs()
               -> should_remove_suid()
            -> __remove_privs()
               newattrs.ia_valid = ATTR_FORCE | kill;
               -> notify_change()
                  -> setattr_copy()

In should_remove_suid() we can see that ATTR_KILL_SUID is raised
unconditionally because the file in the test has S_ISUID set.

But we also see that ATTR_KILL_SGID won't be set because while the file
is S_ISGID it is not S_IXGRP (see above) which is a condition for
ATTR_KILL_SGID being raised.

So by the time we call notify_change() we have attr->ia_valid set to
ATTR_KILL_SUID | ATTR_FORCE. Now notify_change() sees that
ATTR_KILL_SUID is set and does:

ia_valid = attr->ia_valid |= ATTR_MODE
attr->ia_mode = (inode->i_mode & ~S_ISUID);

which means that when we call setattr_copy() later we will definitely
update inode->i_mode. Note that attr->ia_mode still contains S_ISGID.

Now we call into the filesystem's ->setattr() inode operation which will
end up calling setattr_copy(). Since ATTR_MODE is set we will hit:

if (ia_valid & ATTR_MODE) {
        umode_t mode = attr->ia_mode;
        vfsgid_t vfsgid = i_gid_into_vfsgid(mnt_userns, inode);
        if (!vfsgid_in_group_p(vfsgid) &&
            !capable_wrt_inode_uidgid(mnt_userns, inode, CAP_FSETID))
                mode &= ~S_ISGID;
        inode->i_mode = mode;
}

and since the caller in the test is neither capable nor in the group of the
inode the S_ISGID bit is stripped.

But assume the file isn't suid then ATTR_KILL_SUID won't be raised which
has the consequence that neither the setgid nor the suid bits are stripped
even though it should be stripped because the inode isn't in the caller's
groups and the caller isn't privileged over the inode.

If overlayfs is in the mix things become a bit more complicated and the bug
shows up more clearly. When e.g., ovl_setattr() is hit from
ovl_fallocate()'s call to file_remove_privs() then ATTR_KILL_SUID and
ATTR_KILL_SGID might be raised but because the check in notify_change() is
questioning the ATTR_KILL_SGID flag again by requiring S_IXGRP for it to be
stripped the S_ISGID bit isn't removed even though it should be stripped:

sys_fallocate()
-> vfs_fallocate()
   -> ovl_fallocate()
      -> file_remove_privs()
         -> dentry_needs_remove_privs()
            -> should_remove_suid()
         -> __remove_privs()
            newattrs.ia_valid = ATTR_FORCE | kill;
            -> notify_change()
               -> ovl_setattr()
                  // TAKE ON MOUNTER'S CREDS
                  -> ovl_do_notify_change()
                     -> notify_change()
                  // GIVE UP MOUNTER'S CREDS
     // TAKE ON MOUNTER'S CREDS
     -> vfs_fallocate()
        -> xfs_file_fallocate()
           -> file_modified()
              -> __file_remove_privs()
                 -> dentry_needs_remove_privs()
                    -> should_remove_suid()
                 -> __remove_privs()
                    newattrs.ia_valid = attr_force | kill;
                    -> notify_change()

The fix for all of this is to make file_remove_privs()'s
should_remove_suid() helper to perform the same checks as we already
require in setattr_prepare() and setattr_copy() and have notify_change()
not pointlessly requiring S_IXGRP again. It doesn't make any sense in the
first place because the caller must calculate the flags via
should_remove_suid() anyway which would raise ATTR_KILL_SGID.

While we're at it we move should_remove_suid() from inode.c to attr.c
where it belongs with the rest of the iattr helpers. Especially since it
returns ATTR_KILL_S{G,U}ID flags. We also rename it to
setattr_should_drop_suidgid() to better reflect that it indicates both
setuid and setgid bit removal and also that it returns attr flags.

Running xfstests with this doesn't report any regressions. We should really
try and use consistent checks.
Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>

ed5a7047

attr: add setattr_should_drop_sgid() · 72ae017c

由 Christian Brauner 提交于 10月 17, 2022

The current setgid stripping logic during write and ownership change
operations is inconsistent and strewn over multiple places. In order to
consolidate it and make more consistent we'll add a new helper
setattr_should_drop_sgid(). The function retains the old behavior where
we remove the S_ISGID bit unconditionally when S_IXGRP is set but also
when it isn't set and the caller is neither in the group of the inode
nor privileged over the inode.

We will use this helper both in write operation permission removal such
as file_remove_privs() as well as in ownership change operations.
Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>

72ae017c

attr: add in_group_or_capable() · 11c2a870

由 Christian Brauner 提交于 10月 17, 2022

In setattr_{copy,prepare}() we need to perform the same permission
checks to determine whether we need to drop the setgid bit or not.
Instead of open-coding it twice add a simple helper the encapsulates the
logic. We will reuse this helpers to make dropping the setgid bit during
write operations more consistent in a follow up patch.
Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NChristian Brauner (Microsoft) <brauner@kernel.org>

11c2a870

29 9月, 2022 1 次提交

[coredump] don't use __kernel_write() on kmap_local_page() · 06bbaa6d

由 Al Viro 提交于 9月 26, 2022

passing kmap_local_page() result to __kernel_write() is unsafe -
random ->write_iter() might (and 9p one does) get unhappy when
passed ITER_KVEC with pointer that came from kmap_local_page().

Fix by providing a variant of __kernel_write() that takes an iov_iter
from caller (__kernel_write() becomes a trivial wrapper) and adding
dump_emit_page() that parallels dump_emit(), except that instead of
__kernel_write() it uses __kernel_write_iter() with ITER_BVEC source.

Fixes: 3159ed57 "fs/coredump: use kmap_local_page()"
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

06bbaa6d

02 9月, 2022 1 次提交

may_linkat(): constify path · 8996682b

由 Al Viro 提交于 8月 04, 2022

Reviewed-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

8996682b

16 8月, 2022 1 次提交

locks: fix TOCTOU race when granting write lease · d6da19c9

由 Amir Goldstein 提交于 8月 16, 2022

Thread A trying to acquire a write lease checks the value of i_readcount
and i_writecount in check_conflicting_open() to verify that its own fd
is the only fd referencing the file.

Thread B trying to open the file for read will call break_lease() in
do_dentry_open() before incrementing i_readcount, which leaves a small
window where thread A can acquire the write lease and then thread B
completes the open of the file for read without breaking the write lease
that was acquired by thread A.

Fix this race by incrementing i_readcount before checking for existing
leases, same as the case with i_writecount.

Use a helper put_file_access() to decrement i_readcount or i_writecount
in do_dentry_open() and __fput().

Fixes: 387e3746 ("locks: eliminate false positive conflicts for write lease")
Reviewed-by: NJeff Layton <jlayton@kernel.org>
Signed-off-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

d6da19c9

20 5月, 2022 1 次提交

uninline may_mount() and don't opencode it in fspick(2)/fsopen(2) · a5f85d78

由 Al Viro 提交于 3月 01, 2022

It's done once per (mount-related) syscall and there's no point
whatsoever making it inline.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

a5f85d78

15 5月, 2022 1 次提交

Unify the primitives for file descriptor closing · 6319194e

由 Al Viro 提交于 5月 12, 2022

Currently we have 3 primitives for removing an opened file from descriptor
table - pick_file(), __close_fd_get_file() and close_fd_get_file().  Their
calling conventions are rather odd and there's a code duplication for no
good reason.  They can be unified -

1) have __range_close() cap max_fd in the very beginning; that way
we don't need separate way for pick_file() to report being past the end
of descriptor table.

2) make {__,}close_fd_get_file() return file (or NULL) directly, rather
than returning it via struct file ** argument.  Don't bother with
(bogus) return value - nobody wants that -ENOENT.

3) make pick_file() return NULL on unopened descriptor - the only caller
that used to care about the distinction between descriptor past the end
of descriptor table and finding NULL in descriptor table doesn't give
a damn after (1).

4) lift ->files_lock out of pick_file()

That actually simplifies the callers, as well as the primitives themselves.
Code duplication is also gone...
Reviewed-by: NChristian Brauner (Microsoft) <brauner@kernel.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

6319194e

25 4月, 2022 2 次提交

fs: split off do_getxattr from getxattr · c975cad9

由 Stefan Roesch 提交于 4月 24, 2022

This splits off do_getxattr function from the getxattr function. This will
allow io_uring to call it from its io worker.
Signed-off-by: NStefan Roesch <shr@fb.com>
Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20220323154420.3301504-3-shr@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

c975cad9

fs: split off setxattr_copy and do_setxattr function from setxattr · 1a91794c

由 Stefan Roesch 提交于 4月 24, 2022

This splits of the setup part of the function setxattr in its own
dedicated function called setxattr_copy. In addition it also exposes a new
function called do_setxattr for making the setxattr call.

This makes it possible to call these two functions from io_uring in the
processing of an xattr request.
Signed-off-by: NStefan Roesch <shr@fb.com>
Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20220323154420.3301504-2-shr@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

1a91794c

14 3月, 2022 1 次提交

fs: export rw_verify_area() · 87112933

由 Omar Sandoval 提交于 9月 04, 2019

I'm adding btrfs ioctls to read and write compressed data, and rather
than duplicating the checks in rw_verify_area(), let's just export it.
Reviewed-by: NJosef Bacik <josef@toxicpanda.com>
Signed-off-by: NOmar Sandoval <osandov@fb.com>
Reviewed-by: NDavid Sterba <dsterba@suse.com>
Signed-off-by: NDavid Sterba <dsterba@suse.com>

87112933

11 3月, 2022 1 次提交

io-uring: Make statx API stable · 1b6fe6e0

由 Stefan Roesch 提交于 2月 25, 2022

One of the key architectual tenets is to keep the parameters for
io-uring stable. After the call has been submitted, its value can
be changed. Unfortunaltely this is not the case for the current statx
implementation.

IO-Uring change:
This changes replaces the const char * filename pointer in the io_statx
structure with a struct filename *. In addition it also creates the
filename object during the prepare phase.

With this change, the opcode also needs to invoke cleanup, so the
filename object gets freed after processing the request.

fs change:
This replaces the const char* __user filename parameter in the two
functions do_statx and vfs_statx with a struct filename *. In addition
to be able to correctly construct a filename object a new helper
function getname_statx_lookup_flags is introduced. The function makes
sure that do_statx and vfs_statx is invoked with the correct lookup flags.
Signed-off-by: NStefan Roesch <shr@fb.com>
Link: https://lore.kernel.org/r/20220225185326.1373304-2-shr@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

1b6fe6e0

31 1月, 2022 1 次提交
- A
  constify struct path argument of finish_automount()/do_add_mount() · 1e2d8464
  由 Al Viro 提交于 6月 18, 2021
```
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  1e2d8464
17 12月, 2021 1 次提交

fs/buffer: Convert __block_write_begin_int() to take a folio · d1bd0b4e

由 Matthew Wilcox (Oracle) 提交于 11月 03, 2021

There are no plans to convert buffer_head infrastructure to use large
folios, but __block_write_begin_int() is called from iomap, and it's
more convenient and less error-prone if we pass in a folio from iomap.
It also has a nice saving of almost 200 bytes of code from removing
repeated calls to compound_head().
Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>

d1bd0b4e

10 11月, 2021 1 次提交

vfs: keep inodes with page cache off the inode shrinker LRU · 51b8c1fe

由 Johannes Weiner 提交于 11月 08, 2021

Historically (pre-2.5), the inode shrinker used to reclaim only empty
inodes and skip over those that still contained page cache.  This caused
problems on highmem hosts: struct inode could put fill lowmem zones
before the cache was getting reclaimed in the highmem zones.

To address this, the inode shrinker started to strip page cache to
facilitate reclaiming lowmem.  However, this comes with its own set of
problems: the shrinkers may drop actively used page cache just because
the inodes are not currently open or dirty - think working with a large
git tree.  It further doesn't respect cgroup memory protection settings
and can cause priority inversions between containers.

Nowadays, the page cache also holds non-resident info for evicted cache
pages in order to detect refaults.  We've come to rely heavily on this
data inside reclaim for protecting the cache workingset and driving swap
behavior.  We also use it to quantify and report workload health through
psi.  The latter in turn is used for fleet health monitoring, as well as
driving automated memory sizing of workloads and containers, proactive
reclaim and memory offloading schemes.

The consequences of dropping page cache prematurely is that we're seeing
subtle and not-so-subtle failures in all of the above-mentioned
scenarios, with the workload generally entering unexpected thrashing
states while losing the ability to reliably detect it.

To fix this on non-highmem systems at least, going back to rotating
inodes on the LRU isn't feasible.  We've tried (commit a76cf1a4
("mm: don't reclaim inodes with many attached pages")) and failed
(commit 69056ee6 ("Revert "mm: don't reclaim inodes with many
attached pages"")).

The issue is mostly that shrinker pools attract pressure based on their
size, and when objects get skipped the shrinkers remember this as
deferred reclaim work.  This accumulates excessive pressure on the
remaining inodes, and we can quickly eat into heavily used ones, or
dirty ones that require IO to reclaim, when there potentially is plenty
of cold, clean cache around still.

Instead, this patch keeps populated inodes off the inode LRU in the
first place - just like an open file or dirty state would.  An otherwise
clean and unused inode then gets queued when the last cache entry
disappears.  This solves the problem without reintroducing the reclaim
issues, and generally is a bit more scalable than having to wade through
potentially hundreds of thousands of busy inodes.

Locking is a bit tricky because the locks protecting the inode state
(i_lock) and the inode LRU (lru_list.lock) don't nest inside the
irq-safe page cache lock (i_pages.xa_lock).  Page cache deletions are
serialized through i_lock, taken before the i_pages lock, to make sure
depopulated inodes are queued reliably.  Additions may race with
deletions, but we'll check again in the shrinker.  If additions race
with the shrinker itself, we're protected by the i_lock: if find_inode()
or iput() win, the shrinker will bail on the elevated i_count or
I_REFERENCED; if the shrinker wins and goes ahead with the inode, it
will set I_FREEING and inhibit further igets(), which will cause the
other side to create a new instance of the inode instead.

Link: https://lkml.kernel.org/r/20210614211904.14420-4-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
Cc: Roman Gushchin <guro@fb.com>
Cc: Tejun Heo <tj@kernel.org>
Cc: Dave Chinner <david@fromorbit.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

51b8c1fe

22 10月, 2021 2 次提交

block: simplify the block device syncing code · 1e03a36b

由 Christoph Hellwig 提交于 10月 19, 2021

Get rid of the indirections and just provide a sync_bdevs
helper for the generic sync code.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211019062530.2174626-8-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

1e03a36b

block: remove __sync_blockdev · 70164eb6

由 Christoph Hellwig 提交于 10月 19, 2021

Instead offer a new sync_blockdev_nowait helper for the !wait case.
This new helper is exported as it will grow modular callers in a bit.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20211019062530.2174626-3-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

70164eb6

07 9月, 2021 1 次提交

block: move fs/block_dev.c to block/bdev.c · 0dca4462

由 Christoph Hellwig 提交于 9月 07, 2021

Move it together with the rest of the block layer.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20210907141303.1371844-3-hch@lst.deSigned-off-by: NJens Axboe <axboe@kernel.dk>

0dca4462

24 8月, 2021 4 次提交

io_uring: add support for IORING_OP_LINKAT · cf30da90

由 Dmitry Kadashev 提交于 7月 08, 2021

IORING_OP_LINKAT behaves like linkat(2) and takes the same flags and
arguments.

In some internal places 'hardlink' is used instead of 'link' to avoid
confusion with the SQE links. Name 'link' conflicts with the existing
'link' member of io_kiocb.
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Suggested-by: NChristian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/Signed-off-by: NDmitry Kadashev <dkadashev@gmail.com>
Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-12-dkadashev@gmail.com
[axboe: add splice_fd_in check]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cf30da90

io_uring: add support for IORING_OP_SYMLINKAT · 7a8721f8

由 Dmitry Kadashev 提交于 7月 08, 2021

IORING_OP_SYMLINKAT behaves like symlinkat(2) and takes the same flags
and arguments.
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Suggested-by: NChristian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/io-uring/20210514145259.wtl4xcsp52woi6ab@wittgenstein/Signed-off-by: NDmitry Kadashev <dkadashev@gmail.com>
Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-11-dkadashev@gmail.com
[axboe: add splice_fd_in check]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7a8721f8

namei: update do_*() helpers to return ints · 45f30dab

由 Dmitry Kadashev 提交于 7月 08, 2021

Update the following to return int rather than long, for uniformity with
the rest of the do_* helpers in namei.c:

* do_rmdir()
* do_unlinkat()
* do_mkdirat()
* do_mknodat()
* do_symlinkat()

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Christian Brauner <christian.brauner@ubuntu.com>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/io-uring/20210514143202.dmzfcgz5hnauy7ze@wittgenstein/Signed-off-by: NDmitry Kadashev <dkadashev@gmail.com>
Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-9-dkadashev@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

45f30dab

namei: make do_mkdirat() take struct filename · 584d3226

由 Dmitry Kadashev 提交于 7月 08, 2021

Pass in the struct filename pointers instead of the user string, and
update the three callers to do the same. This is heavily based on
commit dbea8d345177 ("fs: make do_renameat2() take struct filename").

This behaves like do_unlinkat() and do_renameat2().

Cc: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NDmitry Kadashev <dkadashev@gmail.com>
Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
Link: https://lore.kernel.org/r/20210708063447.3556403-4-dkadashev@gmail.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

584d3226

17 8月, 2021 1 次提交

fs: mark the iomap argument to __block_write_begin_int const · 6d49cc85

由 Christoph Hellwig 提交于 8月 10, 2021

__block_write_begin_int never modifies the passed in iomap, so mark it
const.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NDarrick J. Wong <djwong@kernel.org>
Signed-off-by: NDarrick J. Wong <djwong@kernel.org>

6d49cc85

22 7月, 2021 1 次提交

cgroup1: fix leaked context root causing sporadic NULL deref in LTP · 1e7107c5

由 Paul Gortmaker 提交于 6月 16, 2021

Richard reported sporadic (roughly one in 10 or so) null dereferences and
other strange behaviour for a set of automated LTP tests.  Things like:

   BUG: kernel NULL pointer dereference, address: 0000000000000008
   #PF: supervisor read access in kernel mode
   #PF: error_code(0x0000) - not-present page
   PGD 0 P4D 0
   Oops: 0000 [#1] PREEMPT SMP PTI
   CPU: 0 PID: 1516 Comm: umount Not tainted 5.10.0-yocto-standard #1
   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-48-gd9c812dda519-prebuilt.qemu.org 04/01/2014
   RIP: 0010:kernfs_sop_show_path+0x1b/0x60

...or these others:

   RIP: 0010:do_mkdirat+0x6a/0xf0
   RIP: 0010:d_alloc_parallel+0x98/0x510
   RIP: 0010:do_readlinkat+0x86/0x120

There were other less common instances of some kind of a general scribble
but the common theme was mount and cgroup and a dubious dentry triggering
the NULL dereference.  I was only able to reproduce it under qemu by
replicating Richard's setup as closely as possible - I never did get it
to happen on bare metal, even while keeping everything else the same.

In commit 71d883c3 ("cgroup_do_mount(): massage calling conventions")
we see this as a part of the overall change:

   --------------
           struct cgroup_subsys *ss;
   -       struct dentry *dentry;

   [...]

   -       dentry = cgroup_do_mount(&cgroup_fs_type, fc->sb_flags, root,
   -                                CGROUP_SUPER_MAGIC, ns);

   [...]

   -       if (percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
   -               struct super_block *sb = dentry->d_sb;
   -               dput(dentry);
   +       ret = cgroup_do_mount(fc, CGROUP_SUPER_MAGIC, ns);
   +       if (!ret && percpu_ref_is_dying(&root->cgrp.self.refcnt)) {
   +               struct super_block *sb = fc->root->d_sb;
   +               dput(fc->root);
                   deactivate_locked_super(sb);
                   msleep(10);
                   return restart_syscall();
           }
   --------------

In changing from the local "*dentry" variable to using fc->root, we now
export/leave that dentry pointer in the file context after doing the dput()
in the unlikely "is_dying" case.   With LTP doing a crazy amount of back to
back mount/unmount [testcases/bin/cgroup_regression_5_1.sh] the unlikely
becomes slightly likely and then bad things happen.

A fix would be to not leave the stale reference in fc->root as follows:

   --------------
                  dput(fc->root);
  +               fc->root = NULL;
                  deactivate_locked_super(sb);
   --------------

...but then we are just open-coding a duplicate of fc_drop_locked() so we
simply use that instead.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: stable@vger.kernel.org      # v5.1+
Reported-by: NRichard Purdie <richard.purdie@linuxfoundation.org>
Fixes: 71d883c3 ("cgroup_do_mount(): massage calling conventions")
Signed-off-by: NPaul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: NTejun Heo <tj@kernel.org>

1e7107c5

08 4月, 2021 1 次提交

switch file_open_root() to struct path · ffb37ca3

由 Al Viro 提交于 4月 01, 2021

... and provide file_open_root_mnt(), using the root of given mount.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

ffb37ca3

02 2月, 2021 1 次提交

fs: provide locked helper variant of close_fd_get_file() · 53dec2ea

由 Jens Axboe 提交于 1月 19, 2021

Assumes current->files->file_lock is already held on invocation. Helps
the caller check the file before removing the fd, if it needs to.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

53dec2ea

26 1月, 2021 1 次提交
- A
  teach sendfile(2) to handle send-to-pipe directly · b964bf53
  由 Al Viro 提交于 1月 25, 2021
```
no point going through the intermediate pipe
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
  b964bf53
24 1月, 2021 1 次提交

namei: handle idmapped mounts in may_*() helpers · ba73d987

由 Christian Brauner 提交于 1月 21, 2021

The may_follow_link(), may_linkat(), may_lookup(), may_open(),
may_o_create(), may_create_in_sticky(), may_delete(), and may_create()
helpers determine whether the caller is privileged enough to perform the
associated operations. Let them handle idmapped mounts by mapping the
inode or fsids according to the mount's user namespace. Afterwards the
checks are identical to non-idmapped inodes. The patch takes care to
retrieve the mount's user namespace right before performing permission
checks and passing it down into the fileystem so the user namespace
can't change in between by someone idmapping a mount that is currently
not idmapped. If the initial user namespace is passed nothing changes so
non-idmapped mounts will see identical behavior as before.

Link: https://lore.kernel.org/r/20210121131959.646623-13-christian.brauner@ubuntu.com
Cc: Christoph Hellwig <hch@lst.de>
Cc: David Howells <dhowells@redhat.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: linux-fsdevel@vger.kernel.org
Reviewed-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJames Morris <jamorris@linux.microsoft.com>
Signed-off-by: NChristian Brauner <christian.brauner@ubuntu.com>

ba73d987

10 12月, 2020 1 次提交

fs: make do_renameat2() take struct filename · e886663c

由 Jens Axboe 提交于 9月 26, 2020

Pass in the struct filename pointers instead of the user string, and
update the three callers to do the same.

This behaves like do_unlinkat(), which also takes a filename struct and
puts it when it is done. Converting callers is then trivial.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e886663c

02 12月, 2020 2 次提交

block: remove i_bdev · 4e7b5671

由 Christoph Hellwig 提交于 11月 23, 2020

Switch the block device lookup interfaces to directly work with a dev_t
so that struct block_device references are only acquired by the
blkdev_get variants (and the blk-cgroup special case).  This means that
we now don't need an extra reference in the inode and can generally
simplify handling of struct block_device to keep the lookups contained
in the core block layer code.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Acked-by: NTejun Heo <tj@kernel.org>
Acked-by: Coly Li <colyli@suse.de>		[bcache]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4e7b5671

fs: remove get_super_thawed and get_super_exclusive_thawed · 60b49885

由 Christoph Hellwig 提交于 11月 16, 2020

Just open code the wait in the only caller of both functions.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NHannes Reinecke <hare@suse.de>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

60b49885

23 9月, 2020 1 次提交

fs: remove compat_sys_mount · 028abd92

由 Christoph Hellwig 提交于 9月 17, 2020

compat_sys_mount is identical to the regular sys_mount now, so remove it
and use the native version everywhere.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

028abd92

31 7月, 2020 2 次提交

init: add an init_mknod helper · 5fee64fc

由 Christoph Hellwig 提交于 7月 22, 2020

Add a simple helper to mknod with a kernel space file name and switch
the early init code over to it.  Remove the now unused ksys_mknod.
Signed-off-by: NChristoph Hellwig <hch@lst.de>

5fee64fc

init: add an init_mkdir helper · 83ff98c3

由 Christoph Hellwig 提交于 7月 22, 2020

Add a simple helper to mkdir with a kernel space file name and switch
the early init code over to it.  Remove the now unused ksys_mkdir.
Signed-off-by: NChristoph Hellwig <hch@lst.de>

83ff98c3

openeuler / Kernel 大约 1 年 前同步成功

openeuler / Kernel
大约 1 年前同步成功