- 02 2月, 2021 15 次提交
-
-
由 Pavel Begunkov 提交于
personality_idr is usually synchronised by uring_lock, the exception would be removing personalities in io_ring_ctx_wait_and_kill(), which is legit as refs are killed by that point but still would be more resilient to do it under the lock. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
It's awkward to pass return a value into a function for it to return it back. Check it at the caller site and clean up io_resubmit_prep() a bit. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
The hot path is IO completing on the first try. Reshuffle io_rw_reissue() so it's checked first. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Bijan Mottahedeh 提交于
Make the percpu ref release function names consistent between rsrc data and nodes. Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com> Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Bijan Mottahedeh 提交于
Create common alloc/free fixed_rsrc_data routines for both files and buffers. Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com> [remove buffer part] Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Bijan Mottahedeh 提交于
Create common routines to be used for both files/buffers registration. [remove io_sqe_rsrc_set_node substitution] Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com> [merge, quiesce only for files] Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
A simple prep patch allowing to set refnode callbacks after it was allocated. This needed to 1) keep ourself off of hi-level functions where it's not pretty and they are not necessary 2) amortise ref_node allocation in the future, e.g. for updates. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Bijan Mottahedeh 提交于
Split alloc_fixed_file_ref_node into resource generic/specific parts, to be leveraged for fixed buffers. Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Bijan Mottahedeh 提交于
Encapsulate resource reference locking into separate routines. Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Bijan Mottahedeh 提交于
Uplevel ref_list and make it common to all resources. This is to allow one common ref_list to be used for both files, and buffers in upcoming patches. Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com> Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Bijan Mottahedeh 提交于
Generalize io_queue_rsrc_removal to handle both files and buffers. Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com> [remove io_mapped_ubuf from rsrc tables/etc. for now] Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Bijan Mottahedeh 提交于
This is a prep rename patch for subsequent patches to generalize file registration. [io_uring_rsrc_update:: rename fds -> data] Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com> [leave io_uring_files_update as struct] Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Bijan Mottahedeh 提交于
Move allocation of buffer management structures, and validation of buffers into separate routines. Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Bijan Mottahedeh 提交于
Split io_sqe_buffer_register into two routines: - io_sqe_buffer_register() registers a single buffer - io_sqe_buffers_register iterates over all user specified buffers Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Instead of being pessimistic and assume that path lookup will block, use LOOKUP_CACHED to attempt just a cached lookup. This ensures that the fast path is always done inline, and we only punt to async context if IO is needed to satisfy the lookup. For forced nonblock open attempts, mark the file O_NONBLOCK over the actual ->open() call as well. We can safely clear this again before doing fd_install(), so it'll never be user visible that we fiddled with it. This greatly improves the performance of file open where the dentry is already cached: ached 5.10-git 5.10-git+LOOKUP_CACHED Speedup --------------------------------------------------------------- 33% 1,014,975 900,474 1.1x 89% 545,466 292,937 1.9x 100% 435,636 151,475 2.9x The more cache hot we are, the faster the inline LOOKUP_CACHED optimization helps. This is unsurprising and expected, as a thread offload becomes a more dominant part of the total overhead. If we look at io_uring tracing, doing an IORING_OP_OPENAT on a file that isn't in the dentry cache will yield: 275.550481: io_uring_create: ring 00000000ddda6278, fd 3 sq size 8, cq size 16, flags 0 275.550491: io_uring_submit_sqe: ring 00000000ddda6278, op 18, data 0x0, non block 1, sq_thread 0 275.550498: io_uring_queue_async_work: ring 00000000ddda6278, request 00000000c0267d17, flags 69760, normal queue, work 000000003d683991 275.550502: io_uring_cqring_wait: ring 00000000ddda6278, min_events 1 275.550556: io_uring_complete: ring 00000000ddda6278, user_data 0x0, result 4 which shows a failed nonblock lookup, then punt to worker, and then we complete with fd == 4. This takes 65 usec in total. Re-running the same test case again: 281.253956: io_uring_create: ring 0000000008207252, fd 3 sq size 8, cq size 16, flags 0 281.253967: io_uring_submit_sqe: ring 0000000008207252, op 18, data 0x0, non block 1, sq_thread 0 281.253973: io_uring_complete: ring 0000000008207252, user_data 0x0, result 4 shows the same request completing inline, also returning fd == 4. This takes 6 usec. Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 29 1月, 2021 6 次提交
-
-
由 Ronnie Sahlberg 提交于
The new mount API requires additional changes to how DFS is handled. Additional testing of DFS uncovered problems with domain based DFS referrals (a follow on patch addresses DFS links) which this patch addresses. Signed-off-by: NRonnie Sahlberg <lsahlber@redhat.com> Signed-off-by: NPaulo Alcantara (SUSE) <pc@cjr.nz> Signed-off-by: NSteve French <stfrench@microsoft.com>
-
由 Pavel Begunkov 提交于
What 84965ff8 ("io_uring: if we see flush on exit, cancel related tasks") really wants is to cancel all relevant REQ_F_INFLIGHT requests reliably. That can be achieved by io_uring_cancel_files(), but we'll miss it calling io_uring_cancel_task_requests(files=NULL) from io_uring_flush(), because it will go through __io_uring_cancel_task_requests(). Just always call io_uring_cancel_files() during cancel, it's good enough for now. Cc: stable@vger.kernel.org # 5.9+ Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Steve French 提交于
During additional testing of the updated cifs.ko with the new mount API support, we found a few additional cases where we were logging errors, but not returning them to the user. For example: a) invalid security mechanisms b) invalid cache options c) unsupported rdma d) invalid smb dialect requested Fixes: 24e0a1ef ("cifs: switch to new mount api") Acked-by: NRonnie Sahlberg <lsahlber@redhat.com> Signed-off-by: NSteve French <stfrench@microsoft.com>
-
由 Pavel Begunkov 提交于
WARNING: CPU: 0 PID: 21359 at fs/io_uring.c:9042 io_uring_cancel_task_requests+0xe55/0x10c0 fs/io_uring.c:9042 Call Trace: io_uring_flush+0x47b/0x6e0 fs/io_uring.c:9227 filp_close+0xb4/0x170 fs/open.c:1295 close_files fs/file.c:403 [inline] put_files_struct fs/file.c:418 [inline] put_files_struct+0x1cc/0x350 fs/file.c:415 exit_files+0x7e/0xa0 fs/file.c:435 do_exit+0xc22/0x2ae0 kernel/exit.c:820 do_group_exit+0x125/0x310 kernel/exit.c:922 get_signal+0x427/0x20f0 kernel/signal.c:2773 arch_do_signal_or_restart+0x2a8/0x1eb0 arch/x86/kernel/signal.c:811 handle_signal_work kernel/entry/common.c:147 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] exit_to_user_mode_prepare+0x148/0x250 kernel/entry/common.c:201 __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline] syscall_exit_to_user_mode+0x19/0x50 kernel/entry/common.c:302 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Now io_uring_cancel_task_requests() can be called not through file notes but directly, remove a WARN_ONCE() there that give us false positives. That check is not very important and we catch it in other places. Fixes: 84965ff8 ("io_uring: if we see flush on exit, cancel related tasks") Cc: stable@vger.kernel.org # 5.9+ Reported-by: syzbot+3e3d9bd0c6ce9efbc3ef@syzkaller.appspotmail.com Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
kernel BUG at lib/list_debug.c:29! Call Trace: __list_add include/linux/list.h:67 [inline] list_add include/linux/list.h:86 [inline] io_file_get+0x8cc/0xdb0 fs/io_uring.c:6466 __io_splice_prep+0x1bc/0x530 fs/io_uring.c:3866 io_splice_prep fs/io_uring.c:3920 [inline] io_req_prep+0x3546/0x4e80 fs/io_uring.c:6081 io_queue_sqe+0x609/0x10d0 fs/io_uring.c:6628 io_submit_sqe fs/io_uring.c:6705 [inline] io_submit_sqes+0x1495/0x2720 fs/io_uring.c:6953 __do_sys_io_uring_enter+0x107d/0x1f30 fs/io_uring.c:9353 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46 entry_SYSCALL_64_after_hwframe+0x44/0xa9 io_file_get() may be called from splice, and so REQ_F_INFLIGHT may already be set. Fixes: 02a13674 ("io_uring: account io_uring internal files as REQ_F_INFLIGHT") Cc: stable@vger.kernel.org # 5.9+ Reported-by: syzbot+6879187cf57845801267@syzkaller.appspotmail.com Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Steve French 提交于
The "prefixpath" mount option needs to be ignored which was missed in the recent conversion to the new mount API (prefixpath would be set by the mount helper if mounting a subdirectory of the root of a share e.g. //server/share/subdir) Fixes: 24e0a1ef ("cifs: switch to new mount api") Suggested-by: NRonnie Sahlberg <lsahlber@redhat.com> Signed-off-by: NSteve French <stfrench@microsoft.com> Reviewed-by: NRonnie Sahlberg <lsahlber@redhat.com>
-
- 28 1月, 2021 3 次提交
-
-
由 Adam Harvey 提交于
In 24e0a1ef, the noauto and auto options were missed when migrating to the new mount API. As a result, users with noauto in their fstab mount options are now unable to mount cifs filesystems, as they'll receive an "Unknown parameter" error. This restores the old behaviour of ignoring noauto and auto if they're given. Fixes: 24e0a1ef ("cifs: switch to new mount api") Signed-off-by: NAdam Harvey <adam@adamharvey.name> Reviewed-by: NRonnie Sahlberg <lsahlber@redhat.com> Signed-off-by: NSteve French <stfrench@microsoft.com>
-
由 Hao Xu 提交于
Abaci reported the follow warning: [ 27.073425] do not call blocking ops when !TASK_RUNNING; state=1 set at [] prepare_to_wait_exclusive+0x3a/0xc0 [ 27.075805] WARNING: CPU: 0 PID: 951 at kernel/sched/core.c:7853 __might_sleep+0x80/0xa0 [ 27.077604] Modules linked in: [ 27.078379] CPU: 0 PID: 951 Comm: a.out Not tainted 5.11.0-rc3+ #1 [ 27.079637] Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011 [ 27.080852] RIP: 0010:__might_sleep+0x80/0xa0 [ 27.081835] Code: 65 48 8b 04 25 80 71 01 00 48 8b 90 c0 15 00 00 48 8b 70 18 48 c7 c7 08 39 95 82 c6 05 f9 5f de 08 01 48 89 d1 e8 00 c6 fa ff 0b eb bf 41 0f b6 f5 48 c7 c7 40 23 c9 82 e8 f3 48 ec 00 eb a7 [ 27.084521] RSP: 0018:ffffc90000fe3ce8 EFLAGS: 00010286 [ 27.085350] RAX: 0000000000000000 RBX: ffffffff82956083 RCX: 0000000000000000 [ 27.086348] RDX: ffff8881057a0000 RSI: ffffffff8118cc9e RDI: ffff88813bc28570 [ 27.087598] RBP: 00000000000003a7 R08: 0000000000000001 R09: 0000000000000001 [ 27.088819] R10: ffffc90000fe3e00 R11: 00000000fffef9f0 R12: 0000000000000000 [ 27.089819] R13: 0000000000000000 R14: ffff88810576eb80 R15: ffff88810576e800 [ 27.091058] FS: 00007f7b144cf740(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000 [ 27.092775] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 27.093796] CR2: 00000000022da7b8 CR3: 000000010b928002 CR4: 00000000003706f0 [ 27.094778] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 27.095780] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 27.097011] Call Trace: [ 27.097685] __mutex_lock+0x5d/0xa30 [ 27.098565] ? prepare_to_wait_exclusive+0x71/0xc0 [ 27.099412] ? io_cqring_overflow_flush.part.101+0x6d/0x70 [ 27.100441] ? lockdep_hardirqs_on_prepare+0xe9/0x1c0 [ 27.101537] ? _raw_spin_unlock_irqrestore+0x2d/0x40 [ 27.102656] ? trace_hardirqs_on+0x46/0x110 [ 27.103459] ? io_cqring_overflow_flush.part.101+0x6d/0x70 [ 27.104317] io_cqring_overflow_flush.part.101+0x6d/0x70 [ 27.105113] io_cqring_wait+0x36e/0x4d0 [ 27.105770] ? find_held_lock+0x28/0xb0 [ 27.106370] ? io_uring_remove_task_files+0xa0/0xa0 [ 27.107076] __x64_sys_io_uring_enter+0x4fb/0x640 [ 27.107801] ? rcu_read_lock_sched_held+0x59/0xa0 [ 27.108562] ? lockdep_hardirqs_on_prepare+0xe9/0x1c0 [ 27.109684] ? syscall_enter_from_user_mode+0x26/0x70 [ 27.110731] do_syscall_64+0x2d/0x40 [ 27.111296] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 27.112056] RIP: 0033:0x7f7b13dc8239 [ 27.112663] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 3d 01 f0 ff ff 73 01 c3 48 8b 0d 27 ec 2c 00 f7 d8 64 89 01 48 [ 27.115113] RSP: 002b:00007ffd6d7f5c88 EFLAGS: 00000286 ORIG_RAX: 00000000000001aa [ 27.116562] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f7b13dc8239 [ 27.117961] RDX: 000000000000478e RSI: 0000000000000000 RDI: 0000000000000003 [ 27.118925] RBP: 00007ffd6d7f5cb0 R08: 0000000020000040 R09: 0000000000000008 [ 27.119773] R10: 0000000000000001 R11: 0000000000000286 R12: 0000000000400480 [ 27.120614] R13: 00007ffd6d7f5d90 R14: 0000000000000000 R15: 0000000000000000 [ 27.121490] irq event stamp: 5635 [ 27.121946] hardirqs last enabled at (5643): [] console_unlock+0x5c4/0x740 [ 27.123476] hardirqs last disabled at (5652): [] console_unlock+0x4e7/0x740 [ 27.125192] softirqs last enabled at (5272): [] __do_softirq+0x3c5/0x5aa [ 27.126430] softirqs last disabled at (5267): [] asm_call_irq_on_stack+0xf/0x20 [ 27.127634] ---[ end trace 289d7e28fa60f928 ]--- This is caused by calling io_cqring_overflow_flush() which may sleep after calling prepare_to_wait_exclusive() which set task state to TASK_INTERRUPTIBLE Reported-by: NAbaci <abaci@linux.alibaba.com> Fixes: 6c503150 ("io_uring: patch up IOPOLL overflow_flush sync") Reviewed-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NHao Xu <haoxu@linux.alibaba.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Maxim Mikityanskiy 提交于
The cited commit introduced a serious regression with SATA write speed, as found by bisecting. This patch reverts this commit, which restores write speed back to the values observed before this commit. The performance tests were done on a Helios4 NAS (2nd batch) with 4 HDDs (WD8003FFBX) using dd (bs=1M count=2000). "Direct" is a test with a single HDD, the rest are different RAID levels built over the first partitions of 4 HDDs. Test results are in MB/s, R is read, W is write. | Direct | RAID0 | RAID10 f2 | RAID10 n2 | RAID6 ----------------+--------+-------+-----------+-----------+-------- 9011495c | R:256 | R:313 | R:276 | R:313 | R:323 (before faulty) | W:254 | W:253 | W:195 | W:204 | W:117 ----------------+--------+-------+-----------+-----------+-------- 5ff9f192 | R:257 | R:398 | R:312 | R:344 | R:391 (faulty commit) | W:154 | W:122 | W:67.7 | W:66.6 | W:67.2 ----------------+--------+-------+-----------+-----------+-------- 5.10.10 | R:256 | R:401 | R:312 | R:356 | R:375 unpatched | W:149 | W:123 | W:64 | W:64.1 | W:61.5 ----------------+--------+-------+-----------+-----------+-------- 5.10.10 | R:255 | R:396 | R:312 | R:340 | R:393 patched | W:247 | W:274 | W:220 | W:225 | W:121 Applying this patch doesn't hurt read performance, while improves the write speed by 1.5x - 3.5x (more impact on RAID tests). The write speed is restored back to the state before the faulty commit, and even a bit higher in RAID tests (which aren't HDD-bound on this device) - that is likely related to other optimizations done between the faulty commit and 5.10.10 which also improved the read speed. Signed-off-by: NMaxim Mikityanskiy <maxtram95@gmail.com> Fixes: 5ff9f192 ("block: simplify set_init_blocksize") Cc: Christoph Hellwig <hch@lst.de> Cc: Jens Axboe <axboe@kernel.dk> Acked-by: NChristoph Hellwig <hch@lst.de> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 27 1月, 2021 2 次提交
-
-
由 Pavel Begunkov 提交于
Joseph reports following deadlock: CPU0: ... io_kill_linked_timeout // &ctx->completion_lock io_commit_cqring __io_queue_deferred __io_queue_async_work io_wq_enqueue io_wqe_enqueue // &wqe->lock CPU1: ... __io_uring_files_cancel io_wq_cancel_cb io_wqe_cancel_pending_work // &wqe->lock io_cancel_task_cb // &ctx->completion_lock Only __io_queue_deferred() calls queue_async_work() while holding ctx->completion_lock, enqueue drained requests via io_req_task_queue() instead. Cc: stable@vger.kernel.org # 5.9+ Reported-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Tested-by: NJoseph Qi <joseph.qi@linux.alibaba.com> Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Pavel Begunkov 提交于
do not call blocking ops when !TASK_RUNNING; state=2 set at [<00000000ced9dbfc>] prepare_to_wait+0x1f4/0x3b0 kernel/sched/wait.c:262 WARNING: CPU: 1 PID: 19888 at kernel/sched/core.c:7853 __might_sleep+0xed/0x100 kernel/sched/core.c:7848 RIP: 0010:__might_sleep+0xed/0x100 kernel/sched/core.c:7848 Call Trace: __mutex_lock_common+0xc4/0x2ef0 kernel/locking/mutex.c:935 __mutex_lock kernel/locking/mutex.c:1103 [inline] mutex_lock_nested+0x1a/0x20 kernel/locking/mutex.c:1118 io_wq_submit_work+0x39a/0x720 fs/io_uring.c:6411 io_run_cancel fs/io-wq.c:856 [inline] io_wqe_cancel_pending_work fs/io-wq.c:990 [inline] io_wq_cancel_cb+0x614/0xcb0 fs/io-wq.c:1027 io_uring_cancel_files fs/io_uring.c:8874 [inline] io_uring_cancel_task_requests fs/io_uring.c:8952 [inline] __io_uring_files_cancel+0x115d/0x19e0 fs/io_uring.c:9038 io_uring_files_cancel include/linux/io_uring.h:51 [inline] do_exit+0x2e6/0x2490 kernel/exit.c:780 do_group_exit+0x168/0x2d0 kernel/exit.c:922 get_signal+0x16b5/0x2030 kernel/signal.c:2770 arch_do_signal_or_restart+0x8e/0x6a0 arch/x86/kernel/signal.c:811 handle_signal_work kernel/entry/common.c:147 [inline] exit_to_user_mode_loop kernel/entry/common.c:171 [inline] exit_to_user_mode_prepare+0xac/0x1e0 kernel/entry/common.c:201 __syscall_exit_to_user_mode_work kernel/entry/common.c:291 [inline] syscall_exit_to_user_mode+0x48/0x190 kernel/entry/common.c:302 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Rewrite io_uring_cancel_files() to mimic __io_uring_task_cancel()'s counting scheme, so it does all the heavy work before setting TASK_UNINTERRUPTIBLE. Cc: stable@vger.kernel.org # 5.9+ Reported-by: syzbot+f655445043a26a7cfab8@syzkaller.appspotmail.com Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> [axboe: fix inverted task check] Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
- 26 1月, 2021 6 次提交
-
-
由 Pavel Begunkov 提交于
If the tctx inflight number haven't changed because of cancellation, __io_uring_task_cancel() will continue leaving the task in TASK_UNINTERRUPTIBLE state, that's not expected by __io_uring_files_cancel(). Ensure we always call finish_wait() before retrying. Cc: stable@vger.kernel.org # 5.9+ Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Miklos Szeredi 提交于
Prior to commit 7c03e2cd ("vfs: move cap_convert_nscap() call into vfs_setxattr()") the translation of nscap->rootid did not take stacked filesystems (overlayfs and ecryptfs) into account. That patch fixed the overlay case, but made the ecryptfs case worse. Restore old the behavior for ecryptfs that existed before the overlayfs fix. This does not fix ecryptfs's handling of complex user namespace setups, but it does make sure existing setups don't regress. Reported-by: NEric W. Biederman <ebiederm@xmission.com> Cc: Tyler Hicks <code@tyhicks.com> Fixes: 7c03e2cd ("vfs: move cap_convert_nscap() call into vfs_setxattr()") Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com> Signed-off-by: NTyler Hicks <code@tyhicks.com>
-
由 Johannes Berg 提交于
After commit 36e2c742 ("fs: don't allow splice read/write without explicit ops") sendfile() could no longer send data from a real file to a pipe, breaking for example certain cgit setups (e.g. when running behind fcgiwrap), because in this case cgit will try to do exactly this: sendfile() to a pipe. Fix this by using iter_file_splice_write for the splice_write method of pipes, as suggested by Christoph. Cc: stable@vger.kernel.org Fixes: 36e2c742 ("fs: don't allow splice read/write without explicit ops") Suggested-by: NChristoph Hellwig <hch@lst.de> Reviewed-by: NChristoph Hellwig <hch@lst.de> Tested-by: NJohannes Berg <johannes@sipsolutions.net> Signed-off-by: NJohannes Berg <johannes@sipsolutions.net> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Filipe Manana 提交于
After a sudden power failure we may end up with a space cache on disk that is not valid and needs to be rebuilt from scratch. If that happens, during log replay when we attempt to pin an extent buffer from a log tree, at btrfs_pin_extent_for_log_replay(), we do not wait for the space cache to be rebuilt through the call to: btrfs_cache_block_group(cache, 1); That is because that only waits for the task (work queue job) that loads the space cache to change the cache state from BTRFS_CACHE_FAST to any other value. That is ok when the space cache on disk exists and is valid, but when the cache is not valid and needs to be rebuilt, it ends up returning as soon as the cache state changes to BTRFS_CACHE_STARTED (done at caching_thread()). So this means that we can end up trying to unpin a range which is not yet marked as free in the block group. This results in the call to btrfs_remove_free_space() to return -EINVAL to btrfs_pin_extent_for_log_replay(), which in turn makes the log replay fail as well as mounting the filesystem. More specifically the -EINVAL comes from free_space_cache.c:remove_from_bitmap(), because the requested range is not marked as free space (ones in the bitmap), we have the following condition triggered: static noinline int remove_from_bitmap(struct btrfs_free_space_ctl *ctl, (...) if (ret < 0 || search_start != *offset) return -EINVAL; (...) It's the "search_start != *offset" that results in the condition being evaluated to true. When this happens we got the following in dmesg/syslog: [72383.415114] BTRFS: device fsid 32b95b69-0ea9-496a-9f02-3f5a56dc9322 devid 1 transid 1432 /dev/sdb scanned by mount (3816007) [72383.417837] BTRFS info (device sdb): disk space caching is enabled [72383.418536] BTRFS info (device sdb): has skinny extents [72383.423846] BTRFS info (device sdb): start tree-log replay [72383.426416] BTRFS warning (device sdb): block group 30408704 has wrong amount of free space [72383.427686] BTRFS warning (device sdb): failed to load free space cache for block group 30408704, rebuilding it now [72383.454291] BTRFS: error (device sdb) in btrfs_recover_log_trees:6203: errno=-22 unknown (Failed to pin buffers while recovering log root tree.) [72383.456725] BTRFS: error (device sdb) in btrfs_replay_log:2253: errno=-22 unknown (Failed to recover log tree) [72383.460241] BTRFS error (device sdb): open_ctree failed We also mark the range for the extent buffer in the excluded extents io tree. That is fine when the space cache is valid on disk and we can load it, in which case it causes no problems. However, for the case where we need to rebuild the space cache, because it is either invalid or it is missing, having the extent buffer range marked in the excluded extents io tree leads to a -EINVAL failure from the call to btrfs_remove_free_space(), resulting in the log replay and mount to fail. This is because by having the range marked in the excluded extents io tree, the caching thread ends up never adding the range of the extent buffer as free space in the block group since the calls to add_new_free_space(), called from load_extent_tree_free(), filter out any ranges that are marked as excluded extents. So fix this by making sure that during log replay we wait for the caching task to finish completely when we need to rebuild a space cache, and also drop the need to mark the extent buffer range in the excluded extents io tree, as well as clearing ranges from that tree at btrfs_finish_extent_commit(). This started to happen with some frequency on large filesystems having block groups with a lot of fragmentation since the recent commit e747853c ("btrfs: load free space cache asynchronously"), but in fact the issue has been there for years, it was just much less likely to happen. Reviewed-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Su Yue 提交于
This effectively reverts commit d5c82388 ("btrfs: convert data_seqcount to seqcount_mutex_t"). While running fstests on 32 bits test box, many tests failed because of warnings in dmesg. One of those warnings (btrfs/003): [66.441317] WARNING: CPU: 6 PID: 9251 at include/linux/seqlock.h:279 btrfs_remove_chunk+0x58b/0x7b0 [btrfs] [66.441446] CPU: 6 PID: 9251 Comm: btrfs Tainted: G O 5.11.0-rc4-custom+ #5 [66.441449] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014 [66.441451] EIP: btrfs_remove_chunk+0x58b/0x7b0 [btrfs] [66.441472] EAX: 00000000 EBX: 00000001 ECX: c576070c EDX: c6b15803 [66.441475] ESI: 10000000 EDI: 00000000 EBP: c56fbcfc ESP: c56fbc70 [66.441477] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010246 [66.441481] CR0: 80050033 CR2: 05c8da20 CR3: 04b20000 CR4: 00350ed0 [66.441485] Call Trace: [66.441510] btrfs_relocate_chunk+0xb1/0x100 [btrfs] [66.441529] ? btrfs_lookup_block_group+0x17/0x20 [btrfs] [66.441562] btrfs_balance+0x8ed/0x13b0 [btrfs] [66.441586] ? btrfs_ioctl_balance+0x333/0x3c0 [btrfs] [66.441619] ? __this_cpu_preempt_check+0xf/0x11 [66.441643] btrfs_ioctl_balance+0x333/0x3c0 [btrfs] [66.441664] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [66.441683] btrfs_ioctl+0x414/0x2ae0 [btrfs] [66.441700] ? __lock_acquire+0x35f/0x2650 [66.441717] ? lockdep_hardirqs_on+0x87/0x120 [66.441720] ? lockdep_hardirqs_on_prepare+0xd0/0x1e0 [66.441724] ? call_rcu+0x2d3/0x530 [66.441731] ? __might_fault+0x41/0x90 [66.441736] ? kvm_sched_clock_read+0x15/0x50 [66.441740] ? sched_clock+0x8/0x10 [66.441745] ? sched_clock_cpu+0x13/0x180 [66.441750] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [66.441750] ? btrfs_ioctl_get_supported_features+0x30/0x30 [btrfs] [66.441768] __ia32_sys_ioctl+0x165/0x8a0 [66.441773] ? __this_cpu_preempt_check+0xf/0x11 [66.441785] ? __might_fault+0x89/0x90 [66.441791] __do_fast_syscall_32+0x54/0x80 [66.441796] do_fast_syscall_32+0x32/0x70 [66.441801] do_SYSENTER_32+0x15/0x20 [66.441805] entry_SYSENTER_32+0x9f/0xf2 [66.441808] EIP: 0xab7b5549 [66.441814] EAX: ffffffda EBX: 00000003 ECX: c4009420 EDX: bfa91f5c [66.441816] ESI: 00000003 EDI: 00000001 EBP: 00000000 ESP: bfa91e98 [66.441818] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 007b EFLAGS: 00000292 [66.441833] irq event stamp: 42579 [66.441835] hardirqs last enabled at (42585): [<c60eb065>] console_unlock+0x495/0x590 [66.441838] hardirqs last disabled at (42590): [<c60eafd5>] console_unlock+0x405/0x590 [66.441840] softirqs last enabled at (41698): [<c601b76c>] call_on_stack+0x1c/0x60 [66.441843] softirqs last disabled at (41681): [<c601b76c>] call_on_stack+0x1c/0x60 ======================================================================== btrfs_remove_chunk+0x58b/0x7b0: __seqprop_mutex_assert at linux/./include/linux/seqlock.h:279 (inlined by) btrfs_device_set_bytes_used at linux/fs/btrfs/volumes.h:212 (inlined by) btrfs_remove_chunk at linux/fs/btrfs/volumes.c:2994 ======================================================================== The warning is produced by lockdep_assert_held() in __seqprop_mutex_assert() if CONFIG_LOCKDEP is enabled. And "olumes.c:2994 is btrfs_device_set_bytes_used() with mutex lock fs_info->chunk_mutex held already. After adding some debug prints, the cause was found that many __alloc_device() are called with NULL @fs_info (during scanning ioctl). Inside the function, btrfs_device_data_ordered_init() is expanded to seqcount_mutex_init(). In this scenario, its second parameter info->chunk_mutex is &NULL->chunk_mutex which equals to offsetof(struct btrfs_fs_info, chunk_mutex) unexpectedly. Thus, seqcount_mutex_init() is called in wrong way. And later btrfs_device_get/set helpers trigger lockdep warnings. The device and filesystem object lifetimes are different and we'd have to synchronize initialization of the btrfs_device::data_seqcount with the fs_info, possibly using some additional synchronization. It would still not prevent concurrent access to the seqcount lock when it's used for read and initialization. Commit d5c82388 ("btrfs: convert data_seqcount to seqcount_mutex_t") does not mention a particular problem being fixed so revert should not cause any harm and we'll get the lockdep warning fixed. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=210139Reported-by: NErhard F <erhard_f@mailbox.org> Fixes: d5c82388 ("btrfs: convert data_seqcount to seqcount_mutex_t") CC: stable@vger.kernel.org # 5.10 CC: Davidlohr Bueso <dbueso@suse.de> Signed-off-by: NSu Yue <l@damenly.su> Reviewed-by: NDavid Sterba <dsterba@suse.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
由 Josef Bacik 提交于
While running btrfs/011 in a loop I would often ASSERT() while trying to add a new free space entry that already existed, or get an EEXIST while adding a new block to the extent tree, which is another indication of double allocation. This occurs because when we do the free space tree population, we create the new root and then populate the tree and commit the transaction. The problem is when you create a new root, the root node and commit root node are the same. During this initial transaction commit we will run all of the delayed refs that were paused during the free space tree generation, and thus begin to cache block groups. While caching block groups the caching thread will be reading from the main root for the free space tree, so as we make allocations we'll be changing the free space tree, which can cause us to add the same range twice which results in either the ASSERT(ret != -EEXIST); in __btrfs_add_free_space, or in a variety of different errors when running delayed refs because of a double allocation. Fix this by marking the fs_info as unsafe to load the free space tree, and fall back on the old slow method. We could be smarter than this, for example caching the block group while we're populating the free space tree, but since this is a serious problem I've opted for the simplest solution. CC: stable@vger.kernel.org # 4.9+ Fixes: a5ed9182 ("Btrfs: implement the free space B-tree") Reviewed-by: NFilipe Manana <fdmanana@suse.com> Signed-off-by: NJosef Bacik <josef@toxicpanda.com> Signed-off-by: NDavid Sterba <dsterba@suse.com>
-
- 25 1月, 2021 8 次提交
-
-
由 Trond Myklebust 提交于
If a layoutget ends up being reordered w.r.t. a layoutreturn, e.g. due to a layoutget-on-open not knowing a priori which file to lock, then we must assume the layout is no longer being considered valid state by the server. Incrementally improve our ability to reject such states by using the cached old stateid in conjunction with the plh_barrier to try to identify them. Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
-
由 Trond Myklebust 提交于
When we're scheduling a layoutreturn, we need to ignore any further incoming layouts with sequence ids that are going to be affected by the layout return. Fixes: 44ea8dfc ("NFS/pnfs: Reference the layout cred in pnfs_prepare_layoutreturn()") Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
-
由 Trond Myklebust 提交于
If the server returns a new stateid that does not match the one in our cache, then try to return the one we hold instead of just invalidating it on the client side. This ensures that both client and server will agree that the stateid is invalid. Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
-
由 Trond Myklebust 提交于
If the server returns a new stateid that does not match the one in our cache, then pnfs_layout_process() will leak the layout segments returned by pnfs_mark_layout_stateid_invalid(). Fixes: 9888d837 ("pNFS: Force a retry of LAYOUTGET if the stateid doesn't match our cache") Signed-off-by: NTrond Myklebust <trond.myklebust@hammerspace.com>
-
由 Jens Axboe 提交于
This normally doesn't cause any extra harm, but it does mean that we'll increment the eventfd notification count, if one has been registered with the ring. This can confuse applications, when they see more notifications on the eventfd side than are available in the ring. Do the nice thing and only increment this count, if we actually posted (or even overflowed) events. Reported-and-tested-by: NDan Melnic <dmm@fb.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Jens Axboe 提交于
Ensure we match tasks that belong to a dead or dying task as well, as we need to reap those in addition to those belonging to the exiting task. Cc: stable@vger.kernel.org # 5.9+ Reported-by: NJosef Grieb <josef.grieb@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk>
-
由 Xiaoming Ni 提交于
The process_sysctl_arg() does not check whether val is empty before invoking strlen(val). If the command line parameter () is incorrectly configured and val is empty, oops is triggered. For example: "hung_task_panic=1" is incorrectly written as "hung_task_panic", oops is triggered. The call stack is as follows: Kernel command line: .... hung_task_panic ...... Call trace: __pi_strlen+0x10/0x98 parse_args+0x278/0x344 do_sysctl_args+0x8c/0xfc kernel_init+0x5c/0xf4 ret_from_fork+0x10/0x30 To fix it, check whether "val" is empty when "phram" is a sysctl field. Error codes are returned in the failure branch, and error logs are generated by parse_args(). Link: https://lkml.kernel.org/r/20210118133029.28580-1-nixiaoming@huawei.com Fixes: 3db978d4 ("kernel/sysctl: support setting sysctl parameters from kernel command line") Signed-off-by: NXiaoming Ni <nixiaoming@huawei.com> Acked-by: NVlastimil Babka <vbabka@suse.cz> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: Iurii Zaikin <yzaikin@google.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Heiner Kallweit <hkallweit1@gmail.com> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: <stable@vger.kernel.org> [5.8+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
-
由 Jens Axboe 提交于
We need to actively cancel anything that introduces a potential circular loop, where io_uring holds a reference to itself. If the file in question is an io_uring file, then add the request to the inflight list. Cc: stable@vger.kernel.org # 5.9+ Signed-off-by: NJens Axboe <axboe@kernel.dk>
-