- 21 2月, 2021 1 次提交
-
-
由 Al Viro 提交于
After switching to non-RCU mode, we want nd->depth to match the number of entries in nd->stack[] that need eventual path_put(). legitimize_links() takes care of that on failures; unfortunately, failure exits added for LOOKUP_CACHED do not. We could add the logics for that into those failure exits, both in try_to_unlazy() and in try_to_unlazy_next(), but since both checks are immediately followed by legitimize_links() and there's no calls of legitimize_links() other than those two... It's easier to move the check (and required handling of nd->depth on failure) into legitimize_links() itself. [caught by Jens: ... and since we are zeroing ->depth here, we need to do drop_links() first] Fixes: 6c6ec2b0 "fs: add support for LOOKUP_CACHED" Tested-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 05 1月, 2021 4 次提交
-
-
由 Jens Axboe 提交于
Now that we support non-blocking path resolution internally, expose it via openat2() in the struct open_how ->resolve flags. This allows applications using openat2() to limit path resolution to the extent that it is already cached. If the lookup cannot be satisfied in a non-blocking manner, openat2(2) will return -1/-EAGAIN. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Jens Axboe 提交于
io_uring always punts opens to async context, since there's no control over whether the lookup blocks or not. Add LOOKUP_CACHED to support just doing the fast RCU based lookups, which we know will not block. If we can do a cached path resolution of the filename, then we don't have to always punt lookups for a worker. During path resolution, we always do LOOKUP_RCU first. If that fails and we terminate LOOKUP_RCU, then fail a LOOKUP_CACHED attempt as well. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
same as for the previous commit - instead of 0/-ECHILD make it return true/false, rename to try_to_unlazy_child(). Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Jens Axboe 提交于
Most callers check for non-zero return, and assume it's -ECHILD (which it always will be). One caller uses the actual error return. Clean this up and make it fully consistent, by having unlazy_walk() return a bool instead. Rename it to try_to_unlazy() and return true on success, and failure on error. That's easier to read. No functional changes in this patch. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 04 1月, 2021 2 次提交
-
-
由 Steven Rostedt (VMware) 提交于
Running my yearly branch profiling code, it detected a 100% wrong branch condition in name.c for lookup_fast(). The code in question has: status = d_revalidate(dentry, nd->flags); if (likely(status > 0)) return dentry; if (unlazy_child(nd, dentry, seq)) return ERR_PTR(-ECHILD); if (unlikely(status == -ECHILD)) /* we'd been told to redo it in non-rcu mode */ status = d_revalidate(dentry, nd->flags); If the status of the d_revalidate() is greater than zero, then the function finishes. Otherwise, if it is an "unlazy_child" it returns with -ECHILD. After the above two checks, the status is compared to -ECHILD, as that is what is returned if the original d_revalidate() needed to be done in a non-rcu mode. Especially this path is called in a condition of: if (nd->flags & LOOKUP_RCU) { And most of the d_revalidate() functions have: if (flags & LOOKUP_RCU) return -ECHILD; It appears that that is the only case that this if statement is triggered on two of my machines, running in production. As it is dependent on what filesystem mix is configured in the running kernel, simply remove the unlikely() from the if statement. Signed-off-by: NSteven Rostedt (VMware) <rostedt@goodmis.org> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Al Viro 提交于
use vfs_open() instead Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
- 28 12月, 2020 1 次提交
-
-
由 Linus Torvalds 提交于
Since commit 36e2c742 ("fs: don't allow splice read/write without explicit ops") we've required that file operation structures explicitly enable splice support, rather than falling back to the default handlers. Most /proc files use the indirect 'struct proc_ops' to describe their file operations, and were fixed up to support splice earlier in commits 40be821d..b24c30c6, but the mountinfo files interact with the VFS directly using their own 'struct file_operations' and got missed as a result. This adds the necessary support for splice to work for /proc/*/mountinfo and friends. Reported-by: NJoan Bruguera Micó <joanbrugueram@gmail.com> Reported-by: NJussi Kivilinna <jussi.kivilinna@iki.fi> Link: https://bugzilla.kernel.org/show_bug.cgi?id=209971 Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 23 12月, 2020 5 次提交
-
-
由 Xiaoguang Wang 提交于
io_iopoll_complete() does not hold completion_lock to complete polled io, so in io_wq_submit_work(), we can not call io_req_complete() directly, to complete polled io, otherwise there maybe concurrent access to cqring, defer_list, etc, which is not safe. Commit dad1b124 ("io_uring: always let io_iopoll_complete() complete polled io") has fixed this issue, but Pavel reported that IOPOLL apart from rw can do buf reg/unreg requests( IORING_OP_PROVIDE_BUFFERS or IORING_OP_REMOVE_BUFFERS), so the fix is not good. Given that io_iopoll_complete() is always called under uring_lock, so here for polled io, we can also get uring_lock to fix this issue. Fixes: dad1b124 ("io_uring: always let io_iopoll_complete() complete polled io") Cc: <stable@vger.kernel.org> # 5.5+ Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> [axboe: don't deref 'req' after completing it'] Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Once we created a file for current context during setup, we should not call io_ring_ctx_wait_and_kill() directly as it'll be done by fput(file) Cc: stable@vger.kernel.org # 5.10 Reported-by: syzbot+c9937dfb2303a5f18640@syzkaller.appspotmail.com Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> [axboe: fix unused 'ret' for !CONFIG_UNIX] Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Lei Chen 提交于
ext4_bio_write_page does not need wbc parameter, since its parameter io contains the io_wbc field. The io::io_wbc is initialized by ext4_io_submit_init which is called in ext4_writepages and ext4_writepage functions prior to ext4_bio_write_page. Therefor, when ext4_bio_write_page is called, wbc info has already been included in io parameter. Signed-off-by: NLei Chen <lennychen@tencent.com> Link: https://lore.kernel.org/r/1607669664-25656-1-git-send-email-lennychen@tencent.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Chunguang Xu 提交于
Commit cfd73237 ("ext4: add prefetching for block allocation bitmaps") introduced block bitmap prefetch, and expects to read block bitmaps of flex_bg through an IO. However, it seems to ignore the value range of s_log_groups_per_flex. In the scenario where the value of s_log_groups_per_flex is greater than 27, s_mb_prefetch or s_mb_prefetch_limit will overflow, cause a divide zero exception. In addition, the logic of calculating nr is also flawed, because the size of flexbg is fixed during a single mount, but s_mb_prefetch can be modified, which causes nr to fail to meet the value condition of [1, flexbg_size]. To solve this problem, we need to set the upper limit of s_mb_prefetch. Since we expect to load block bitmaps of a flex_bg through an IO, we can consider determining a reasonable upper limit among the IO limit parameters. After consideration, we chose BLK_MAX_SEGMENT_SIZE. This is a good choice to solve divide zero problem and avoiding performance degradation. [ Some minor code simplifications to make the changes easy to follow -- TYT ] Reported-by: NTosk Robot <tencent_os_robot@tencent.com> Signed-off-by: NChunguang Xu <brookxu@tencent.com> Reviewed-by: NSamuel Liao <samuelliao@tencent.com> Reviewed-by: NAndreas Dilger <adilger@dilger.ca> Link: https://lore.kernel.org/r/1607051143-24508-1-git-send-email-brookxu@tencent.comSigned-off-by: NTheodore Ts'o <tytso@mit.edu>
-
由 Jan Kara 提交于
When filesystem inconsistency is detected with group locked, we currently try to modify superblock to store error there without blocking. However this can cause superblock checksum failures (or DIF/DIX failure) when the superblock is just being written out. Make error handling code just store error information in ext4_sb_info structure and copy it to on-disk superblock only in ext4_commit_super(). In case of error happening with group locked, we just postpone the superblock flushing to a workqueue. [ Added fixup so that s_first_error_* does not get updated after the file system is remounted. Also added fix for syzbot failure. - Ted ] Signed-off-by: NJan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20201127113405.26867-8-jack@suse.czSigned-off-by: NTheodore Ts'o <tytso@mit.edu> Cc: Hillf Danton <hdanton@sina.com> Reported-by: syzbot+9043030c040ce1849a60@syzkaller.appspotmail.com
-
- 22 12月, 2020 5 次提交
-
-
由 Christoph Hellwig 提交于
Update copyrights for files that have gotten some major rewrites lately. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Christoph Hellwig 提交于
There is no point in duplicating the file name in the top of the file comment. Signed-off-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NJohannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Artem Labazov 提交于
The table for Unicode upcase conversion requires an order-5 allocation, which may fail on a highly-fragmented system: pool-udisksd: page allocation failure: order:5, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null), cpuset=/,mems_allowed=0 CPU: 4 PID: 3756880 Comm: pool-udisksd Tainted: G U 5.8.10-200.fc32.x86_64 #1 Hardware name: Dell Inc. XPS 13 9360/0PVG6D, BIOS 2.13.0 11/14/2019 Call Trace: dump_stack+0x6b/0x88 warn_alloc.cold+0x75/0xd9 ? _cond_resched+0x16/0x40 ? __alloc_pages_direct_compact+0x144/0x150 __alloc_pages_slowpath.constprop.0+0xcfa/0xd30 ? __schedule+0x28a/0x840 ? __wait_on_bit_lock+0x92/0xa0 __alloc_pages_nodemask+0x2df/0x320 kmalloc_order+0x1b/0x80 kmalloc_order_trace+0x1d/0xa0 exfat_create_upcase_table+0x115/0x390 [exfat] exfat_fill_super+0x3ef/0x7f0 [exfat] ? sget_fc+0x1d0/0x240 ? exfat_init_fs_context+0x120/0x120 [exfat] get_tree_bdev+0x15c/0x250 vfs_get_tree+0x25/0xb0 do_mount+0x7c3/0xaf0 ? copy_mount_options+0xab/0x180 __x64_sys_mount+0x8e/0xd0 do_syscall_64+0x4d/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Make the driver use kvcalloc() to eliminate the issue. Fixes: 370e812b ("exfat: add nls operations") Cc: stable@vger.kernel.org #v5.7+ Signed-off-by: NArtem Labazov <123321artyom@gmail.com> Signed-off-by: NNamjae Jeon <namjae.jeon@samsung.com>
-
由 Al Viro 提交于
this is one of the cases where we need to use d_real() - we are using more than the name of dentry here. ->d_sb is used as well, so in case of hostfs being used as a layer we get the wrong superblock. Reported-by: NJohannes Berg <johannes@sipsolutions.net> Tested-by: NJohannes Berg <johannes@sipsolutions.net> Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
-
由 Pavel Begunkov 提交于
xa_store() may fail, check the result. Cc: stable@vger.kernel.org # 5.10 Fixes: 0f212204 ("io_uring: don't rely on weak ->files references") Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 21 12月, 2020 4 次提交
-
-
由 Pavel Begunkov 提交于
Get rid of TASK_UNINTERRUPTIBLE and waiting with finish_wait before going for next iteration in __io_uring_task_cancel(), because __io_uring_files_cancel() doesn't expect that sheduling is disallowed. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
Might happen that __io_uring_cancel_task_requests() cancels nothing but there are task_works pending. We need to always run them. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
io_uring no longer issues full cancelations on the io-wq, so remove any remnants of this code and the IO_WQ_BIT_CANCEL flag. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Before IORING_SETUP_ATTACH_WQ, we could just cancel everything on the io-wq when exiting. But that's not the case if they are shared, so cancel for the specific ctx instead. Cc: stable@vger.kernel.org Fixes: 24369c2e ("io_uring: add io-wq workqueue sharing") Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 20 12月, 2020 10 次提交
-
-
由 Willem de Bruijn 提交于
Add syscall epoll_pwait2, an epoll_wait variant with nsec resolution that replaces int timeout with struct timespec. It is equivalent otherwise. int epoll_pwait2(int fd, struct epoll_event *events, int maxevents, const struct timespec *timeout, const sigset_t *sigset); The underlying hrtimer is already programmed with nsec resolution. pselect and ppoll also set nsec resolution timeout with timespec. The sigset_t in epoll_pwait has a compat variant. epoll_pwait2 needs the same. For timespec, only support this new interface on 2038 aware platforms that define __kernel_timespec_t. So no CONFIG_COMPAT_32BIT_TIME. Link: https://lkml.kernel.org/r/20201121144401.3727659-3-willemdebruijn.kernel@gmail.comSigned-off-by: NWillem de Bruijn <willemb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Willem de Bruijn 提交于
Patch series "add epoll_pwait2 syscall", v4. Enable nanosecond timeouts for epoll. Analogous to pselect and ppoll, introduce an epoll_wait syscall variant that takes a struct timespec instead of int timeout. This patch (of 4): Make epoll more consistent with select/poll: pass along the timeout as timespec64 pointer. In anticipation of additional changes affecting all three polling mechanisms: - add epoll_pwait2 syscall with timespec semantics, and share poll_select_set_timeout implementation. - compute slack before conversion to absolute time, to save one ktime_get_ts64 call. Link: https://lkml.kernel.org/r/20201121144401.3727659-1-willemdebruijn.kernel@gmail.com Link: https://lkml.kernel.org/r/20201121144401.3727659-2-willemdebruijn.kernel@gmail.comSigned-off-by: NWillem de Bruijn <willemb@google.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Soheil Hassas Yeganeh 提交于
We call ep_events_available() under lock when timeout is 0, and then call it without locks in the loop for the other cases. Instead, call ep_events_available() without lock for all cases. For non-zero timeouts, we will recheck after adding the thread to the wait queue. For zero timeout cases, by definition, user is opportunistically polling and will have to call epoll_wait again in the future. Note that this lock was kept in c5a282e9 because the whole loop was historically under lock. This patch results in a 1% CPU/RPC reduction in RPC benchmarks. Link: https://lkml.kernel.org/r/20201106231635.3528496-9-soheil.kdev@gmail.comSigned-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Suggested-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NWillem de Bruijn <willemb@google.com> Reviewed-by: NKhazhismel Kumykov <khazhy@google.com> Cc: Guantao Liu <guantaol@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Soheil Hassas Yeganeh 提交于
The existing loop is pointless, and the labels make it really hard to follow the structure. Replace that control structure with a simple loop that returns when there are new events, there is a signal, or the thread has timed out. Link: https://lkml.kernel.org/r/20201106231635.3528496-8-soheil.kdev@gmail.comSigned-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Reviewed-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NWillem de Bruijn <willemb@google.com> Reviewed-by: NKhazhismel Kumykov <khazhy@google.com> Cc: Guantao Liu <guantaol@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Soheil Hassas Yeganeh 提交于
This is a no-op change which simplifies the follow up patches. Link: https://lkml.kernel.org/r/20201106231635.3528496-7-soheil.kdev@gmail.comSigned-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Reviewed-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NWillem de Bruijn <willemb@google.com> Reviewed-by: NKhazhismel Kumykov <khazhy@google.com> Cc: Guantao Liu <guantaol@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Soheil Hassas Yeganeh 提交于
ep_events_available() is called multiple times around the busy loop logic, even though the logic is generally not used. ep_reset_busy_poll_napi_id() is similarly always called, even when busy loop is not used. Eliminate ep_reset_busy_poll_napi_id() and inline it inside ep_busy_loop(). Make ep_busy_loop() return whether there are any events available after the busy loop. This will eliminate unnecessary loads and branches, and simplifies the loop. Link: https://lkml.kernel.org/r/20201106231635.3528496-6-soheil.kdev@gmail.comSigned-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Reviewed-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NWillem de Bruijn <willemb@google.com> Reviewed-by: NKhazhismel Kumykov <khazhy@google.com> Cc: Guantao Liu <guantaol@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Soheil Hassas Yeganeh 提交于
This is a no-op change and simply to make the code more coherent. Link: https://lkml.kernel.org/r/20201106231635.3528496-5-soheil.kdev@gmail.comSigned-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Reviewed-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NWillem de Bruijn <willemb@google.com> Reviewed-by: NKhazhismel Kumykov <khazhy@google.com> Cc: Guantao Liu <guantaol@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Soheil Hassas Yeganeh 提交于
To simplify the code, pull in checking the fatal signals into ep_send_events(). ep_send_events() is called only from ep_poll(). Note that, previously, we were always checking fatal events, but it is checked only if eavail is true. This should be fine because the goal of that check is to quickly return from epoll_wait() when there is a pending fatal signal. Link: https://lkml.kernel.org/r/20201106231635.3528496-4-soheil.kdev@gmail.comSigned-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Suggested-by: NWillem de Bruijn <willemb@google.com> Reviewed-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NWillem de Bruijn <willemb@google.com> Reviewed-by: NKhazhismel Kumykov <khazhy@google.com> Cc: Guantao Liu <guantaol@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Soheil Hassas Yeganeh 提交于
Check signals before locking ep->lock, and immediately return -EINTR if there is any signal pending. This saves a few loads, stores, and branches from the hot path and simplifies the loop structure for follow up patches. Link: https://lkml.kernel.org/r/20201106231635.3528496-3-soheil.kdev@gmail.comSigned-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Reviewed-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NWillem de Bruijn <willemb@google.com> Reviewed-by: NKhazhismel Kumykov <khazhy@google.com> Cc: Guantao Liu <guantaol@google.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Soheil Hassas Yeganeh 提交于
Patch series "simplify ep_poll". This patch series is a followup based on the suggestions and feedback by Linus: https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com The first patch in the series is a fix for the epoll race in presence of timeouts, so that it can be cleanly backported to all affected stable kernels. The rest of the patch series simplify the ep_poll() implementation. Some of these simplifications result in minor performance enhancements as well. We have kept these changes under self tests and internal benchmarks for a few days, and there are minor (1-2%) performance enhancements as a result. This patch (of 8): After abc610e0 ("fs/epoll: avoid barrier after an epoll_wait(2) timeout"), we break out of the ep_poll loop upon timeout, without checking whether there is any new events available. Prior to that patch-series we always called ep_events_available() after exiting the loop. This can cause races and missed wakeups. For example, consider the following scenario reported by Guantao Liu: Suppose we have an eventfd added using EPOLLET to an epollfd. Thread 1: Sleeps for just below 5ms and then writes to an eventfd. Thread 2: Calls epoll_wait with a timeout of 5 ms. If it sees an event of the eventfd, it will write back on that fd. Thread 3: Calls epoll_wait with a negative timeout. Prior to abc610e0, it is guaranteed that Thread 3 will wake up either by Thread 1 or Thread 2. After abc610e0, Thread 3 can be blocked indefinitely if Thread 2 sees a timeout right before the write to the eventfd by Thread 1. Thread 2 will be woken up from schedule_hrtimeout_range and, with evail 0, it will not call ep_send_events(). To fix this issue: 1) Simplify the timed_out case as suggested by Linus. 2) while holding the lock, recheck whether the thread was woken up after its time out has reached. Note that (2) is different from Linus' original suggestion: It do not set "eavail = ep_events_available(ep)" to avoid unnecessary contention (when there are too many timed-out threads and a small number of events), as well as races mentioned in the discussion thread. This is the first patch in the series so that the backport to stable releases is straightforward. Link: https://lkml.kernel.org/r/20201106231635.3528496-1-soheil.kdev@gmail.com Link: https://lkml.kernel.org/r/CAHk-=wizk=OxUyQPbO8MS41w2Pag1kniUV5WdD5qWL-gq1kjDA@mail.gmail.com Link: https://lkml.kernel.org/r/20201106231635.3528496-2-soheil.kdev@gmail.com Fixes: abc610e0 ("fs/epoll: avoid barrier after an epoll_wait(2) timeout") Signed-off-by: NSoheil Hassas Yeganeh <soheil@google.com> Tested-by: NGuantao Liu <guantaol@google.com> Suggested-by: NLinus Torvalds <torvalds@linux-foundation.org> Reported-by: NGuantao Liu <guantaol@google.com> Reviewed-by: NEric Dumazet <edumazet@google.com> Reviewed-by: NWillem de Bruijn <willemb@google.com> Reviewed-by: NKhazhismel Kumykov <khazhy@google.com> Reviewed-by: NDavidlohr Bueso <dbueso@suse.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
- 19 12月, 2020 4 次提交
-
-
由 Christian Brauner 提交于
After introducing CLOSE_RANGE_CLOEXEC syzbot reported a crash when CLOSE_RANGE_CLOEXEC is specified in conjunction with CLOSE_RANGE_UNSHARE. When CLOSE_RANGE_UNSHARE is specified the caller will receive a private file descriptor table in case their file descriptor table is currently shared. For the case where the caller has requested all file descriptors to be actually closed via e.g. close_range(3, ~0U, 0) the kernel knows that the caller does not need any of the file descriptors anymore and will optimize the close operation by only copying all files in the range from 0 to 3 and no others. However, if the caller requested CLOSE_RANGE_CLOEXEC together with CLOSE_RANGE_UNSHARE the caller wants to still make use of the file descriptors so the kernel needs to copy all of them and can't optimize. The original patch didn't account for this and thus could cause oopses as evidenced by the syzbot report because it assumed that all fds had been copied. Fix this by handling the CLOSE_RANGE_CLOEXEC case. syzbot reported ================================================================== BUG: KASAN: null-ptr-deref in instrument_atomic_read include/linux/instrumented.h:71 [inline] BUG: KASAN: null-ptr-deref in atomic64_read include/asm-generic/atomic-instrumented.h:837 [inline] BUG: KASAN: null-ptr-deref in atomic_long_read include/asm-generic/atomic-long.h:29 [inline] BUG: KASAN: null-ptr-deref in filp_close+0x22/0x170 fs/open.c:1274 Read of size 8 at addr 0000000000000077 by task syz-executor511/8522 CPU: 1 PID: 8522 Comm: syz-executor511 Not tainted 5.10.0-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:79 [inline] dump_stack+0x107/0x163 lib/dump_stack.c:120 __kasan_report mm/kasan/report.c:549 [inline] kasan_report.cold+0x5/0x37 mm/kasan/report.c:562 check_memory_region_inline mm/kasan/generic.c:186 [inline] check_memory_region+0x13d/0x180 mm/kasan/generic.c:192 instrument_atomic_read include/linux/instrumented.h:71 [inline] atomic64_read include/asm-generic/atomic-instrumented.h:837 [inline] atomic_long_read include/asm-generic/atomic-long.h:29 [inline] filp_close+0x22/0x170 fs/open.c:1274 close_files fs/file.c:402 [inline] put_files_struct fs/file.c:417 [inline] put_files_struct+0x1cc/0x350 fs/file.c:414 exit_files+0x12a/0x170 fs/file.c:435 do_exit+0xb4f/0x2a00 kernel/exit.c:818 do_group_exit+0x125/0x310 kernel/exit.c:920 get_signal+0x428/0x2100 kernel/signal.c:2792 arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811 handle_signal_work kernel/entry/common.c:147 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] exit_to_user_mode_prepare+0x124/0x200 kernel/entry/common.c:201 __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline] syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x447039 Code: Unable to access opcode bytes at RIP 0x44700f. RSP: 002b:00007f1b1225cdb8 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca RAX: 0000000000000001 RBX: 00000000006dbc28 RCX: 0000000000447039 RDX: 00000000000f4240 RSI: 0000000000000081 RDI: 00000000006dbc2c RBP: 00000000006dbc20 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dbc2c R13: 00007fff223b6bef R14: 00007f1b1225d9c0 R15: 00000000006dbc2c ================================================================== syzbot has tested the proposed patch and the reproducer did not trigger any issue: Reported-and-tested-by: syzbot+96cfd2b22b3213646a93@syzkaller.appspotmail.com Tested on: commit: 10f7cddd selftests/core: add regression test for CLOSE_RAN.. git tree: git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux.git vfs kernel config: https://syzkaller.appspot.com/x/.config?x=5d42216b510180e3 dashboard link: https://syzkaller.appspot.com/bug?extid=96cfd2b22b3213646a93 compiler: gcc (GCC) 10.1.0-syz 20200507 Reported-by: syzbot+96cfd2b22b3213646a93@syzkaller.appspotmail.com Fixes: 582f1fb6 ("fs, close_range: add flag CLOSE_RANGE_CLOEXEC") Cc: Giuseppe Scrivano <gscrivan@redhat.com> Cc: linux-fsdevel@vger.kernel.org Link: https://lore.kernel.org/r/20201217213303.722643-1-christian.brauner@ubuntu.comSigned-off-by: NChristian Brauner <christian.brauner@ubuntu.com>
-
由 Pavel Begunkov 提交于
Doing vectored buf-select read with 0 iovec passed is meaningless and utterly broken, forbid it. Cc: <stable@vger.kernel.org> # 5.7+ Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Boris Protopopov 提交于
Fix passing of the additional security info via version operations. Force new open when getting SACL and avoid reuse of files that were previously open without sufficient privileges to access SACLs. Signed-off-by: NBoris Protopopov <pboris@amazon.com> Reviewed-by: NShyam Prasad N <sprasad@microsoft.com> Signed-off-by: NSteve French <stfrench@microsoft.com>
-
由 Boris Protopopov 提交于
Add SYSTEM_SECURITY access flag and use with smb2 when opening files for getting/setting SACLs. Add "system.cifs_ntsd_full" extended attribute to allow user-space access to the functionality. Avoid multiple server calls when setting owner, DACL, and SACL. Signed-off-by: NBoris Protopopov <pboris@amazon.com> Signed-off-by: NSteve French <stfrench@microsoft.com>
-
- 18 12月, 2020 4 次提交
-
-
由 Pavel Begunkov 提交于
The purpose of io_uring_cancel_files() is to wait for all requests matching ->files to go/be cancelled. We should first drop files of a request in io_req_drop_files() and only then make it undiscoverable for io_uring_cancel_files. First drop, then delete from list. It's ok to leave req->id->files dangling, because it's not dereferenced by cancellation code, only compared against. It would potentially go to sleep and be awaken by following in io_req_drop_files() wake_up(). Fixes: 0f212204 ("io_uring: don't rely on weak ->files references") Cc: <stable@vger.kernel.org> # 5.5+ Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Xiaoguang Wang 提交于
For the first time a req punted to io-wq, we'll initialize io_wq_work's list to be NULL, then insert req to io_wqe->work_list. If this req is not inserted into tail of io_wqe->work_list, this req's io_wq_work list will point to another req's io_wq_work. For splitted bio case, this req maybe inserted to io_wqe->work_list repeatedly, once we insert it to tail of io_wqe->work_list for the second time, now io_wq_work->list->next will be invalid pointer, which then result in many strang error, panic, kernel soft-lockup, rcu stall, etc. In my vm, kernel doest not have commit cc29e1bf ("block: disable iopoll for split bio"), below fio job can reproduce this bug steadily: [global] name=iouring-sqpoll-iopoll-1 ioengine=io_uring iodepth=128 numjobs=1 thread rw=randread direct=1 registerfiles=1 hipri=1 bs=4m size=100M runtime=120 time_based group_reporting randrepeat=0 [device] directory=/home/feiman.wxg/mntpoint/ # an ext4 mount point If we have commit cc29e1bf ("block: disable iopoll for split bio"), there will no splitted bio case for polled io, but I think we still to need to fix this list corruption, it also should maybe go to stable branchs. To fix this corruption, if a req is inserted into tail of io_wqe->work_list, initialize req->io_wq_work->list->next to bu NULL. Cc: stable@vger.kernel.org Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com> Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Samuel Cabrero 提交于
The patch 7d6535b7: "cifs: Simplify reconnect code when dfs upcall is enabled" leads to the following static checker warning: fs/cifs/connect.c:160 reconn_set_next_dfs_target() error: 'server->hostname' dereferencing possible ERR_PTR() Avoid dereferencing the error pointer by early returning on error condition. Reported-by: NDan Carpenter <dan.carpenter@oracle.com> Signed-off-by: NSamuel Cabrero <scabrero@suse.de> Signed-off-by: NSteve French <stfrench@microsoft.com>
-
由 Dan Carpenter 提交于
This code is slightly nicer if we flip the cifs_sockaddr_equal() around and pull all the code in one tab. Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com> Reviewed-by: NSamuel Cabrero <scabrero@suse.de> Signed-off-by: NSteve French <stfrench@microsoft.com>
-