提交 · eacd9aa8cedeb412842c7b339adbaa0477fdd5ad · openeuler / Kernel

21 2月, 2021 1 次提交

fix handling of nd->depth on LOOKUP_CACHED failures in try_to_unlazy* · eacd9aa8

由 Al Viro 提交于 2月 15, 2021

After switching to non-RCU mode, we want nd->depth to match the number
of entries in nd->stack[] that need eventual path_put().
legitimize_links() takes care of that on failures; unfortunately,
failure exits added for LOOKUP_CACHED do not.

We could add the logics for that into those failure exits, both in
try_to_unlazy() and in try_to_unlazy_next(), but since both checks
are immediately followed by legitimize_links() and there's no calls
of legitimize_links() other than those two...  It's easier to
move the check (and required handling of nd->depth on failure) into
legitimize_links() itself.

[caught by Jens: ... and since we are zeroing ->depth here, we need
to do drop_links() first]

Fixes: 6c6ec2b0 "fs: add support for LOOKUP_CACHED"
Tested-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

eacd9aa8

05 1月, 2021 4 次提交

fs: expose LOOKUP_CACHED through openat2() RESOLVE_CACHED · 99668f61

由 Jens Axboe 提交于 12月 17, 2020

Now that we support non-blocking path resolution internally, expose it
via openat2() in the struct open_how ->resolve flags. This allows
applications using openat2() to limit path resolution to the extent that
it is already cached.

If the lookup cannot be satisfied in a non-blocking manner, openat2(2)
will return -1/-EAGAIN.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

99668f61

fs: add support for LOOKUP_CACHED · 6c6ec2b0

由 Jens Axboe 提交于 12月 17, 2020

io_uring always punts opens to async context, since there's no control
over whether the lookup blocks or not. Add LOOKUP_CACHED to support
just doing the fast RCU based lookups, which we know will not block. If
we can do a cached path resolution of the filename, then we don't have
to always punt lookups for a worker.

During path resolution, we always do LOOKUP_RCU first. If that fails and
we terminate LOOKUP_RCU, then fail a LOOKUP_CACHED attempt as well.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

6c6ec2b0

saner calling conventions for unlazy_child() · ae66db45

由 Al Viro 提交于 1月 04, 2021

same as for the previous commit - instead of 0/-ECHILD make
it return true/false, rename to try_to_unlazy_child().
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

ae66db45

fs: make unlazy_walk() error handling consistent · e36cffed

由 Jens Axboe 提交于 12月 17, 2020

Most callers check for non-zero return, and assume it's -ECHILD (which
it always will be). One caller uses the actual error return. Clean this
up and make it fully consistent, by having unlazy_walk() return a bool
instead. Rename it to try_to_unlazy() and return true on success, and
failure on error. That's easier to read.

No functional changes in this patch.

Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

e36cffed

04 1月, 2021 2 次提交

fs/namei.c: Remove unlikely of status being -ECHILD in lookup_fast() · 26ddb45e

由 Steven Rostedt (VMware) 提交于 12月 09, 2020

Running my yearly branch profiling code, it detected a 100% wrong branch
condition in name.c for lookup_fast(). The code in question has:

		status = d_revalidate(dentry, nd->flags);
		if (likely(status > 0))
			return dentry;
		if (unlazy_child(nd, dentry, seq))
			return ERR_PTR(-ECHILD);
		if (unlikely(status == -ECHILD))
			/* we'd been told to redo it in non-rcu mode */
			status = d_revalidate(dentry, nd->flags);

If the status of the d_revalidate() is greater than zero, then the function
finishes. Otherwise, if it is an "unlazy_child" it returns with -ECHILD.
After the above two checks, the status is compared to -ECHILD, as that is
what is returned if the original d_revalidate() needed to be done in a
non-rcu mode.

Especially this path is called in a condition of:

	if (nd->flags & LOOKUP_RCU) {

And most of the d_revalidate() functions have:

	if (flags & LOOKUP_RCU)
		return -ECHILD;

It appears that that is the only case that this if statement is triggered
on two of my machines, running in production.

As it is dependent on what filesystem mix is configured in the running
kernel, simply remove the unlikely() from the if statement.
Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

26ddb45e

A
do_tmpfile(): don't mess with finish_open() · 1e8f44f1
由 Al Viro 提交于 3月 11, 2020
```
use vfs_open() instead
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
```
1e8f44f1

28 12月, 2020 1 次提交

proc mountinfo: make splice available again · 14e3e989

由 Linus Torvalds 提交于 12月 27, 2020

Since commit 36e2c742 ("fs: don't allow splice read/write without
explicit ops") we've required that file operation structures explicitly
enable splice support, rather than falling back to the default handlers.

Most /proc files use the indirect 'struct proc_ops' to describe their
file operations, and were fixed up to support splice earlier in commits
40be821d..b24c30c6, but the mountinfo files interact with the
VFS directly using their own 'struct file_operations' and got missed as
a result.

This adds the necessary support for splice to work for /proc/*/mountinfo
and friends.
Reported-by: NJoan Bruguera Micó <joanbrugueram@gmail.com>
Reported-by: NJussi Kivilinna <jussi.kivilinna@iki.fi>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=209971
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Christoph Hellwig <hch@lst.de>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

14e3e989

23 12月, 2020 5 次提交

io_uring: hold uring_lock while completing failed polled io in io_wq_submit_work() · c07e6719

由 Xiaoguang Wang 提交于 12月 14, 2020

io_iopoll_complete() does not hold completion_lock to complete polled io,
so in io_wq_submit_work(), we can not call io_req_complete() directly, to
complete polled io, otherwise there maybe concurrent access to cqring,
defer_list, etc, which is not safe. Commit dad1b124 ("io_uring: always
let io_iopoll_complete() complete polled io") has fixed this issue, but
Pavel reported that IOPOLL apart from rw can do buf reg/unreg requests(
IORING_OP_PROVIDE_BUFFERS or IORING_OP_REMOVE_BUFFERS), so the fix is not
good.

Given that io_iopoll_complete() is always called under uring_lock, so here
for polled io, we can also get uring_lock to fix this issue.

Fixes: dad1b124 ("io_uring: always let io_iopoll_complete() complete polled io")
Cc: <stable@vger.kernel.org> # 5.5+
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
[axboe: don't deref 'req' after completing it']
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c07e6719

io_uring: fix double io_uring free · 9faadcc8

由 Pavel Begunkov 提交于 12月 21, 2020

Once we created a file for current context during setup, we should not
call io_ring_ctx_wait_and_kill() directly as it'll be done by fput(file)

Cc: stable@vger.kernel.org # 5.10
Reported-by: syzbot+c9937dfb2303a5f18640@syzkaller.appspotmail.com
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
[axboe: fix unused 'ret' for !CONFIG_UNIX]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9faadcc8

ext4: remove unnecessary wbc parameter from ext4_bio_write_page · be993933

由 Lei Chen 提交于 12月 11, 2020

ext4_bio_write_page does not need wbc parameter, since its parameter
io contains the io_wbc field. The io::io_wbc is initialized by
ext4_io_submit_init which is called in ext4_writepages and
ext4_writepage functions prior to ext4_bio_write_page.
Therefor, when ext4_bio_write_page is called, wbc info
has already been included in io parameter.
Signed-off-by: NLei Chen <lennychen@tencent.com>
Link: https://lore.kernel.org/r/1607669664-25656-1-git-send-email-lennychen@tencent.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

be993933

ext4: avoid s_mb_prefetch to be zero in individual scenarios · 82ef1370

由 Chunguang Xu 提交于 12月 04, 2020

Commit cfd73237 ("ext4: add prefetching for block allocation
bitmaps") introduced block bitmap prefetch, and expects to read block
bitmaps of flex_bg through an IO.  However, it seems to ignore the
value range of s_log_groups_per_flex.  In the scenario where the value
of s_log_groups_per_flex is greater than 27, s_mb_prefetch or
s_mb_prefetch_limit will overflow, cause a divide zero exception.

In addition, the logic of calculating nr is also flawed, because the
size of flexbg is fixed during a single mount, but s_mb_prefetch can
be modified, which causes nr to fail to meet the value condition of
[1, flexbg_size].

To solve this problem, we need to set the upper limit of
s_mb_prefetch.  Since we expect to load block bitmaps of a flex_bg
through an IO, we can consider determining a reasonable upper limit
among the IO limit parameters.  After consideration, we chose
BLK_MAX_SEGMENT_SIZE.  This is a good choice to solve divide zero
problem and avoiding performance degradation.

[ Some minor code simplifications to make the changes easy to follow -- TYT ]
Reported-by: NTosk Robot <tencent_os_robot@tencent.com>
Signed-off-by: NChunguang Xu <brookxu@tencent.com>
Reviewed-by: NSamuel Liao <samuelliao@tencent.com>
Reviewed-by: NAndreas Dilger <adilger@dilger.ca>
Link: https://lore.kernel.org/r/1607051143-24508-1-git-send-email-brookxu@tencent.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>

82ef1370

ext4: defer saving error info from atomic context · c92dc856

由 Jan Kara 提交于 11月 27, 2020

When filesystem inconsistency is detected with group locked, we
currently try to modify superblock to store error there without
blocking. However this can cause superblock checksum failures (or
DIF/DIX failure) when the superblock is just being written out.

Make error handling code just store error information in ext4_sb_info
structure and copy it to on-disk superblock only in ext4_commit_super().
In case of error happening with group locked, we just postpone the
superblock flushing to a workqueue.

[ Added fixup so that s_first_error_* does not get updated after
  the file system is remounted.
  Also added fix for syzbot failure.  - Ted ]
Signed-off-by: NJan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20201127113405.26867-8-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
Cc: Hillf Danton <hdanton@sina.com>
Reported-by: syzbot+9043030c040ce1849a60@syzkaller.appspotmail.com

c92dc856

22 12月, 2020 5 次提交

block: update some copyrights · 7b51e703

由 Christoph Hellwig 提交于 12月 10, 2020

Update copyrights for files that have gotten some major rewrites lately.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

7b51e703

block: remove a pointless self-reference in block_dev.c · ca2e270a

由 Christoph Hellwig 提交于 12月 10, 2020

There is no point in duplicating the file name in the top of the file
comment.
Signed-off-by: NChristoph Hellwig <hch@lst.de>
Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ca2e270a

exfat: Avoid allocating upcase table using kcalloc() · 9eb78c25

由 Artem Labazov 提交于 12月 07, 2020

The table for Unicode upcase conversion requires an order-5 allocation,
which may fail on a highly-fragmented system:

 pool-udisksd: page allocation failure: order:5,
 mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),
 cpuset=/,mems_allowed=0
 CPU: 4 PID: 3756880 Comm: pool-udisksd Tainted: G U
 5.8.10-200.fc32.x86_64 #1
 Hardware name: Dell Inc. XPS 13 9360/0PVG6D, BIOS 2.13.0 11/14/2019
 Call Trace:
  dump_stack+0x6b/0x88
  warn_alloc.cold+0x75/0xd9
  ? _cond_resched+0x16/0x40
  ? __alloc_pages_direct_compact+0x144/0x150
  __alloc_pages_slowpath.constprop.0+0xcfa/0xd30
  ? __schedule+0x28a/0x840
  ? __wait_on_bit_lock+0x92/0xa0
  __alloc_pages_nodemask+0x2df/0x320
  kmalloc_order+0x1b/0x80
  kmalloc_order_trace+0x1d/0xa0
  exfat_create_upcase_table+0x115/0x390 [exfat]
  exfat_fill_super+0x3ef/0x7f0 [exfat]
  ? sget_fc+0x1d0/0x240
  ? exfat_init_fs_context+0x120/0x120 [exfat]
  get_tree_bdev+0x15c/0x250
  vfs_get_tree+0x25/0xb0
  do_mount+0x7c3/0xaf0
  ? copy_mount_options+0xab/0x180
  __x64_sys_mount+0x8e/0xd0
  do_syscall_64+0x4d/0x90
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Make the driver use kvcalloc() to eliminate the issue.

Fixes: 370e812b ("exfat: add nls operations")
Cc: stable@vger.kernel.org #v5.7+
Signed-off-by: NArtem Labazov <123321artyom@gmail.com>
Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>

9eb78c25

fix hostfs_open() use of ->f_path.dentry · 2e2cbaf9

由 Al Viro 提交于 12月 09, 2020

this is one of the cases where we need to use d_real() - we are
using more than the name of dentry here.  ->d_sb is used as well,
so in case of hostfs being used as a layer we get the wrong
superblock.
Reported-by: NJohannes Berg <johannes@sipsolutions.net>
Tested-by: NJohannes Berg <johannes@sipsolutions.net>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>

2e2cbaf9

io_uring: fix ignoring xa_store errors · a528b04e

由 Pavel Begunkov 提交于 12月 21, 2020

xa_store() may fail, check the result.

Cc: stable@vger.kernel.org # 5.10
Fixes: 0f212204 ("io_uring: don't rely on weak ->files references")
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a528b04e

21 12月, 2020 4 次提交

io_uring: end waiting before task cancel attempts · f57555ed

由 Pavel Begunkov 提交于 12月 20, 2020

Get rid of TASK_UNINTERRUPTIBLE and waiting with finish_wait before
going for next iteration in __io_uring_task_cancel(), because
__io_uring_files_cancel() doesn't expect that sheduling is disallowed.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f57555ed

io_uring: always progress task_work on task cancel · 55583d72

由 Pavel Begunkov 提交于 12月 20, 2020

Might happen that __io_uring_cancel_task_requests() cancels nothing but
there are task_works pending. We need to always run them.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

55583d72

io-wq: kill now unused io_wq_cancel_all() · 446bc1c2

由 Jens Axboe 提交于 12月 20, 2020

io_uring no longer issues full cancelations on the io-wq, so remove any
remnants of this code and the IO_WQ_BIT_CANCEL flag.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

446bc1c2

io_uring: make ctx cancel on exit targeted to actual ctx · 00c18640

由 Jens Axboe 提交于 12月 20, 2020

Before IORING_SETUP_ATTACH_WQ, we could just cancel everything on the
io-wq when exiting. But that's not the case if they are shared, so
cancel for the specific ctx instead.

Cc: stable@vger.kernel.org
Fixes: 24369c2e ("io_uring: add io-wq workqueue sharing")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

00c18640

20 12月, 2020 10 次提交

epoll: add syscall epoll_pwait2 · 58169a52

由 Willem de Bruijn 提交于 12月 18, 2020

Add syscall epoll_pwait2, an epoll_wait variant with nsec resolution that
replaces int timeout with struct timespec.  It is equivalent otherwise.

    int epoll_pwait2(int fd, struct epoll_event *events,
                     int maxevents,
                     const struct timespec *timeout,
                     const sigset_t *sigset);

The underlying hrtimer is already programmed with nsec resolution.
pselect and ppoll also set nsec resolution timeout with timespec.

The sigset_t in epoll_pwait has a compat variant. epoll_pwait2 needs
the same.

For timespec, only support this new interface on 2038 aware platforms
that define __kernel_timespec_t. So no CONFIG_COMPAT_32BIT_TIME.

Link: https://lkml.kernel.org/r/20201121144401.3727659-3-willemdebruijn.kernel@gmail.comSigned-off-by: NWillem de Bruijn <willemb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

58169a52

epoll: convert internal api to timespec64 · 7cdf7c20

由 Willem de Bruijn 提交于 12月 18, 2020

Patch series "add epoll_pwait2 syscall", v4.

Enable nanosecond timeouts for epoll.

Analogous to pselect and ppoll, introduce an epoll_wait syscall
variant that takes a struct timespec instead of int timeout.

This patch (of 4):

Make epoll more consistent with select/poll: pass along the timeout as
timespec64 pointer.

In anticipation of additional changes affecting all three polling
mechanisms:

- add epoll_pwait2 syscall with timespec semantics,
  and share poll_select_set_timeout implementation.
- compute slack before conversion to absolute time,
  to save one ktime_get_ts64 call.

Link: https://lkml.kernel.org/r/20201121144401.3727659-1-willemdebruijn.kernel@gmail.com
Link: https://lkml.kernel.org/r/20201121144401.3727659-2-willemdebruijn.kernel@gmail.comSigned-off-by: NWillem de Bruijn <willemb@google.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

7cdf7c20

epoll: eliminate unnecessary lock for zero timeout · e59d3c64

由 Soheil Hassas Yeganeh 提交于 12月 18, 2020

We call ep_events_available() under lock when timeout is 0, and then call
it without locks in the loop for the other cases.

Instead, call ep_events_available() without lock for all cases.  For
non-zero timeouts, we will recheck after adding the thread to the wait
queue.  For zero timeout cases, by definition, user is opportunistically
polling and will have to call epoll_wait again in the future.

Note that this lock was kept in c5a282e9 because the whole loop was
historically under lock.

This patch results in a 1% CPU/RPC reduction in RPC benchmarks.

Link: https://lkml.kernel.org/r/20201106231635.3528496-9-soheil.kdev@gmail.comSigned-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Suggested-by: NEric Dumazet <edumazet@google.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Reviewed-by: NWillem de Bruijn <willemb@google.com>
Reviewed-by: NKhazhismel Kumykov <khazhy@google.com>
Cc: Guantao Liu <guantaol@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e59d3c64

epoll: replace gotos with a proper loop · 00b27634

由 Soheil Hassas Yeganeh 提交于 12月 18, 2020

The existing loop is pointless, and the labels make it really hard to
follow the structure.

Replace that control structure with a simple loop that returns when there
are new events, there is a signal, or the thread has timed out.

Link: https://lkml.kernel.org/r/20201106231635.3528496-8-soheil.kdev@gmail.comSigned-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Reviewed-by: NWillem de Bruijn <willemb@google.com>
Reviewed-by: NKhazhismel Kumykov <khazhy@google.com>
Cc: Guantao Liu <guantaol@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

00b27634

epoll: pull all code between fetch_events and send_event into the loop · e8c85328

由 Soheil Hassas Yeganeh 提交于 12月 18, 2020

This is a no-op change which simplifies the follow up patches.

Link: https://lkml.kernel.org/r/20201106231635.3528496-7-soheil.kdev@gmail.comSigned-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Reviewed-by: NWillem de Bruijn <willemb@google.com>
Reviewed-by: NKhazhismel Kumykov <khazhy@google.com>
Cc: Guantao Liu <guantaol@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e8c85328

epoll: simplify and optimize busy loop logic · 1493c47f

由 Soheil Hassas Yeganeh 提交于 12月 18, 2020

ep_events_available() is called multiple times around the busy loop logic,
even though the logic is generally not used. ep_reset_busy_poll_napi_id()
is similarly always called, even when busy loop is not used.

Eliminate ep_reset_busy_poll_napi_id() and inline it inside
ep_busy_loop(). Make ep_busy_loop() return whether there are any events
available after the busy loop. This will eliminate unnecessary loads and
branches, and simplifies the loop.

Link: https://lkml.kernel.org/r/20201106231635.3528496-6-soheil.kdev@gmail.comSigned-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Reviewed-by: NWillem de Bruijn <willemb@google.com>
Reviewed-by: NKhazhismel Kumykov <khazhy@google.com>
Cc: Guantao Liu <guantaol@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

1493c47f

epoll: move eavail next to the list_empty_careful check · e411596d

由 Soheil Hassas Yeganeh 提交于 12月 18, 2020

This is a no-op change and simply to make the code more coherent.

Link: https://lkml.kernel.org/r/20201106231635.3528496-5-soheil.kdev@gmail.comSigned-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Reviewed-by: NWillem de Bruijn <willemb@google.com>
Reviewed-by: NKhazhismel Kumykov <khazhy@google.com>
Cc: Guantao Liu <guantaol@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

e411596d

epoll: pull fatal signal checks into ep_send_events() · cccd29bf

由 Soheil Hassas Yeganeh 提交于 12月 18, 2020

To simplify the code, pull in checking the fatal signals into
ep_send_events().  ep_send_events() is called only from ep_poll().

Note that, previously, we were always checking fatal events, but it is
checked only if eavail is true.  This should be fine because the goal of
that check is to quickly return from epoll_wait() when there is a pending
fatal signal.

Link: https://lkml.kernel.org/r/20201106231635.3528496-4-soheil.kdev@gmail.comSigned-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Suggested-by: NWillem de Bruijn <willemb@google.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Reviewed-by: NWillem de Bruijn <willemb@google.com>
Reviewed-by: NKhazhismel Kumykov <khazhy@google.com>
Cc: Guantao Liu <guantaol@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

cccd29bf

epoll: simplify signal handling · 2efdaf76

由 Soheil Hassas Yeganeh 提交于 12月 18, 2020

Check signals before locking ep->lock, and immediately return -EINTR if
there is any signal pending.

This saves a few loads, stores, and branches from the hot path and
simplifies the loop structure for follow up patches.

Link: https://lkml.kernel.org/r/20201106231635.3528496-3-soheil.kdev@gmail.comSigned-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Reviewed-by: NWillem de Bruijn <willemb@google.com>
Reviewed-by: NKhazhismel Kumykov <khazhy@google.com>
Cc: Guantao Liu <guantaol@google.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

2efdaf76

epoll: check for events when removing a timed out thread from the wait queue · 289caf5d

由 Soheil Hassas Yeganeh 提交于 12月 18, 2020

Patch series "simplify ep_poll".

This patch series is a followup based on the suggestions and feedback by
Linus:
https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com

The first patch in the series is a fix for the epoll race in presence of
timeouts, so that it can be cleanly backported to all affected stable
kernels.

The rest of the patch series simplify the ep_poll() implementation.  Some
of these simplifications result in minor performance enhancements as well.
We have kept these changes under self tests and internal benchmarks for a
few days, and there are minor (1-2%) performance enhancements as a result.

This patch (of 8):

After abc610e0 ("fs/epoll: avoid barrier after an epoll_wait(2)
timeout"), we break out of the ep_poll loop upon timeout, without checking
whether there is any new events available.  Prior to that patch-series we
always called ep_events_available() after exiting the loop.

This can cause races and missed wakeups.  For example, consider the
following scenario reported by Guantao Liu:

Suppose we have an eventfd added using EPOLLET to an epollfd.

Thread 1: Sleeps for just below 5ms and then writes to an eventfd.
Thread 2: Calls epoll_wait with a timeout of 5 ms. If it sees an
          event of the eventfd, it will write back on that fd.
Thread 3: Calls epoll_wait with a negative timeout.

Prior to abc610e0, it is guaranteed that Thread 3 will wake up either
by Thread 1 or Thread 2.  After abc610e0, Thread 3 can be blocked
indefinitely if Thread 2 sees a timeout right before the write to the
eventfd by Thread 1.  Thread 2 will be woken up from
schedule_hrtimeout_range and, with evail 0, it will not call
ep_send_events().

To fix this issue:
1) Simplify the timed_out case as suggested by Linus.
2) while holding the lock, recheck whether the thread was woken up
   after its time out has reached.

Note that (2) is different from Linus' original suggestion: It do not set
"eavail = ep_events_available(ep)" to avoid unnecessary contention (when
there are too many timed-out threads and a small number of events), as
well as races mentioned in the discussion thread.

This is the first patch in the series so that the backport to stable
releases is straightforward.

Link: https://lkml.kernel.org/r/20201106231635.3528496-1-soheil.kdev@gmail.com
Link: https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com
Link: https://lkml.kernel.org/r/20201106231635.3528496-2-soheil.kdev@gmail.com
Fixes: abc610e0 ("fs/epoll: avoid barrier after an epoll_wait(2) timeout")
Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com>
Tested-by: NGuantao Liu <guantaol@google.com>
Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org>
Reported-by: NGuantao Liu <guantaol@google.com>
Reviewed-by: NEric Dumazet <edumazet@google.com>
Reviewed-by: NWillem de Bruijn <willemb@google.com>
Reviewed-by: NKhazhismel Kumykov <khazhy@google.com>
Reviewed-by: NDavidlohr Bueso <dbueso@suse.de>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

289caf5d

19 12月, 2020 4 次提交

close_range: unshare all fds for CLOSE_RANGE_UNSHARE | CLOSE_RANGE_CLOEXEC · fec8a6a6

由 Christian Brauner 提交于 12月 17, 2020

After introducing CLOSE_RANGE_CLOEXEC syzbot reported a crash when
CLOSE_RANGE_CLOEXEC is specified in conjunction with CLOSE_RANGE_UNSHARE.
When CLOSE_RANGE_UNSHARE is specified the caller will receive a private
file descriptor table in case their file descriptor table is currently
shared.

For the case where the caller has requested all file descriptors to be
actually closed via e.g. close_range(3, ~0U, 0) the kernel knows that
the caller does not need any of the file descriptors anymore and will
optimize the close operation by only copying all files in the range from
0 to 3 and no others.

However, if the caller requested CLOSE_RANGE_CLOEXEC together with
CLOSE_RANGE_UNSHARE the caller wants to still make use of the file
descriptors so the kernel needs to copy all of them and can't optimize.

The original patch didn't account for this and thus could cause oopses
as evidenced by the syzbot report because it assumed that all fds had
been copied. Fix this by handling the CLOSE_RANGE_CLOEXEC case.

syzbot reported
==================================================================
BUG: KASAN: null-ptr-deref in instrument_atomic_read include/linux/instrumented.h:71 [inline]
BUG: KASAN: null-ptr-deref in atomic64_read include/asm-generic/atomic-instrumented.h:837 [inline]
BUG: KASAN: null-ptr-deref in atomic_long_read include/asm-generic/atomic-long.h:29 [inline]
BUG: KASAN: null-ptr-deref in filp_close+0x22/0x170 fs/open.c:1274
Read of size 8 at addr 0000000000000077 by task syz-executor511/8522

CPU: 1 PID: 8522 Comm: syz-executor511 Not tainted 5.10.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:79 [inline]
 dump_stack+0x107/0x163 lib/dump_stack.c:120
 __kasan_report mm/kasan/report.c:549 [inline]
 kasan_report.cold+0x5/0x37 mm/kasan/report.c:562
 check_memory_region_inline mm/kasan/generic.c:186 [inline]
 check_memory_region+0x13d/0x180 mm/kasan/generic.c:192
 instrument_atomic_read include/linux/instrumented.h:71 [inline]
 atomic64_read include/asm-generic/atomic-instrumented.h:837 [inline]
 atomic_long_read include/asm-generic/atomic-long.h:29 [inline]
 filp_close+0x22/0x170 fs/open.c:1274
 close_files fs/file.c:402 [inline]
 put_files_struct fs/file.c:417 [inline]
 put_files_struct+0x1cc/0x350 fs/file.c:414
 exit_files+0x12a/0x170 fs/file.c:435
 do_exit+0xb4f/0x2a00 kernel/exit.c:818
 do_group_exit+0x125/0x310 kernel/exit.c:920
 get_signal+0x428/0x2100 kernel/signal.c:2792
 arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811
 handle_signal_work kernel/entry/common.c:147 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:171 [inline]
 exit_to_user_mode_prepare+0x124/0x200 kernel/entry/common.c:201
 __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline]
 syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x447039
Code: Unable to access opcode bytes at RIP 0x44700f.
RSP: 002b:00007f1b1225cdb8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
RAX: 0000000000000001 RBX: 00000000006dbc28 RCX: 0000000000447039
RDX: 00000000000f4240 RSI: 0000000000000081 RDI: 00000000006dbc2c
RBP: 00000000006dbc20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dbc2c
R13: 00007fff223b6bef R14: 00007f1b1225d9c0 R15: 00000000006dbc2c
==================================================================

syzbot has tested the proposed patch and the reproducer did not trigger any issue:

Reported-and-tested-by: syzbot+96cfd2b22b3213646a93@syzkaller.appspotmail.com

Tested on:

commit:         10f7cddd selftests/core: add regression test for CLOSE_RAN..
git tree:       git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git vfs
kernel config:  https://syzkaller.appspot.com/x/.config?x=5d42216b510180e3
dashboard link: https://syzkaller.appspot.com/bug?extid=96cfd2b22b3213646a93
compiler:       gcc (GCC) 10.1.0-syz 20200507

Reported-by: syzbot+96cfd2b22b3213646a93@syzkaller.appspotmail.com
Fixes: 582f1fb6 ("fs, close_range: add flag CLOSE_RANGE_CLOEXEC")
Cc: Giuseppe Scrivano <gscrivan@redhat.com>
Cc: linux-fsdevel@vger.kernel.org
Link: https://lore.kernel.org/r/20201217213303.722643-1-christian.brauner@ubuntu.comSigned-off-by: NChristian Brauner <christian.brauner@ubuntu.com>

fec8a6a6

io_uring: fix 0-iov read buffer select · dd201662

由 Pavel Begunkov 提交于 12月 19, 2020

Doing vectored buf-select read with 0 iovec passed is meaningless and
utterly broken, forbid it.

Cc: <stable@vger.kernel.org> # 5.7+
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

dd201662

Add SMB 2 support for getting and setting SACLs · 9541b813

由 Boris Protopopov 提交于 12月 17, 2020

Fix passing of the additional security info via version
operations. Force new open when getting SACL and avoid
reuse of files that were previously open without
sufficient privileges to access SACLs.
Signed-off-by: NBoris Protopopov <pboris@amazon.com>
Reviewed-by: NShyam Prasad N <sprasad@microsoft.com>
Signed-off-by: NSteve French <stfrench@microsoft.com>

9541b813

SMB3: Add support for getting and setting SACLs · 3970acf7

由 Boris Protopopov 提交于 12月 18, 2020

Add SYSTEM_SECURITY access flag and use with smb2 when opening
files for getting/setting SACLs. Add "system.cifs_ntsd_full"
extended attribute to allow user-space access to the functionality.
Avoid multiple server calls when setting owner, DACL, and SACL.
Signed-off-by: NBoris Protopopov <pboris@amazon.com>
Signed-off-by: NSteve French <stfrench@microsoft.com>

3970acf7

18 12月, 2020 4 次提交

io_uring: close a small race gap for files cancel · dfea9fce

由 Pavel Begunkov 提交于 12月 18, 2020

The purpose of io_uring_cancel_files() is to wait for all requests
matching ->files to go/be cancelled. We should first drop files of a
request in io_req_drop_files() and only then make it undiscoverable for
io_uring_cancel_files.

First drop, then delete from list. It's ok to leave req->id->files
dangling, because it's not dereferenced by cancellation code, only
compared against. It would potentially go to sleep and be awaken by
following in io_req_drop_files() wake_up().

Fixes: 0f212204 ("io_uring: don't rely on weak ->files references")
Cc: <stable@vger.kernel.org> # 5.5+
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

dfea9fce

io_uring: fix io_wqe->work_list corruption · 0020ef04

由 Xiaoguang Wang 提交于 12月 18, 2020

For the first time a req punted to io-wq, we'll initialize io_wq_work's
list to be NULL, then insert req to io_wqe->work_list. If this req is not
inserted into tail of io_wqe->work_list, this req's io_wq_work list will
point to another req's io_wq_work. For splitted bio case, this req maybe
inserted to io_wqe->work_list repeatedly, once we insert it to tail of
io_wqe->work_list for the second time, now io_wq_work->list->next will be
invalid pointer, which then result in many strang error, panic, kernel
soft-lockup, rcu stall, etc.

In my vm, kernel doest not have commit cc29e1bf ("block: disable
iopoll for split bio"), below fio job can reproduce this bug steadily:
[global]
name=iouring-sqpoll-iopoll-1
ioengine=io_uring
iodepth=128
numjobs=1
thread
rw=randread
direct=1
registerfiles=1
hipri=1
bs=4m
size=100M
runtime=120
time_based
group_reporting
randrepeat=0

[device]
directory=/home/feiman.wxg/mntpoint/  # an ext4 mount point

If we have commit cc29e1bf ("block: disable iopoll for split bio"),
there will no splitted bio case for polled io, but I think we still to need
to fix this list corruption, it also should maybe go to stable branchs.

To fix this corruption, if a req is inserted into tail of io_wqe->work_list,
initialize req->io_wq_work->list->next to bu NULL.

Cc: stable@vger.kernel.org
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

0020ef04

cifs: Avoid error pointer dereference · 0bf1bafb

由 Samuel Cabrero 提交于 12月 18, 2020

The patch 7d6535b7: "cifs: Simplify reconnect code when dfs
upcall is enabled" leads to the following static checker warning:

	fs/cifs/connect.c:160 reconn_set_next_dfs_target()
	error: 'server->hostname' dereferencing possible ERR_PTR()

Avoid dereferencing the error pointer by early returning on error
condition.
Reported-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NSamuel Cabrero <scabrero@suse.de>
Signed-off-by: NSteve French <stfrench@microsoft.com>

0bf1bafb

cifs: Re-indent cifs_swn_reconnect() · 0f2c66ae

由 Dan Carpenter 提交于 12月 17, 2020

This code is slightly nicer if we flip the cifs_sockaddr_equal()
around and pull all the code in one tab.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: NSamuel Cabrero <scabrero@suse.de>
Signed-off-by: NSteve French <stfrench@microsoft.com>

0f2c66ae

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功