- 15 4月, 2021 40 次提交
-
-
由 Jens Axboe 提交于
mainline inclusion from mainline-5.3-rc1 commit aa1fa28f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- This is done through IORING_OP_RECVMSG. This opcode uses the same sqe->msg_flags that IORING_OP_SENDMSG added, and we pass in the msghdr struct in the sqe->addr field as well. We use MSG_DONTWAIT to force an inline fast path if recvmsg() doesn't block, and punt to async execution if it would have. Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Jens Axboe 提交于
mainline inclusion from mainline-5.3-rc1 commit 0fa03c62 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- This is done through IORING_OP_SENDMSG. There's a new sqe->msg_flags for the flags argument, and the msghdr struct is passed in the sqe->addr field. We use MSG_DONTWAIT to force an inline fast path if sendmsg() doesn't block, and punt to async execution if it would have. Acked-by: NDavid S. Miller <davem@davemloft.net> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Jens Axboe 提交于
mainline inclusion from mainline-5.3-rc1 commit 9e645e11 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- With SQE links, we can create chains of dependent SQEs. One example would be queueing an SQE that's a read from one file descriptor, with the linked SQE being a write to another with the same set of buffers. An SQE link will not stall the pipeline, it'll just ensure that dependent SQEs aren't issued before the previous link has completed. Any error at submission or completion time will break the chain of SQEs. For completions, this also includes short reads or writes, as the next SQE could depend on the previous one being fully completed. Any SQE in a chain that gets canceled due to any of the above errors, will get an CQE fill with -ECANCELED as the error value. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Jens Axboe 提交于
mainline inclusion from mainline-5.3-rc1 commit 9d93a3f5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- We can encounter a short read when we're doing buffered reads and the data is partially cached. Right now we just return the short read, but that forces the application to read that CQE, then issue another SQE to finish the read. That read will not be cached, and hence will result in an async punt. It's more efficient to do that async punt from within the kernel, as that will the not need two round trips more to the kernel. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Jens Axboe 提交于
mainline inclusion from mainline-5.3-rc1 commit 87e5e6da category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- Currently these functions return < 0 on error, and 0 for success. Change that so that we return < 0 on error, but number of bytes for success. Some callers already treat the return value that way, others need a slight tweak. Signed-off-by: NJens Axboe <axboe@kernel.dk> Conflicts: include/linux/uio.h [ Patch d05f4435("iov_iter: introduce hash_and_copy_to_iter helper") is not appplied. ] Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Jens Axboe 提交于
mainline inclusion from mainline-5.2-rc7 commit 60c112b0 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- Stephen reports: I hit the following General Protection Fault when testing io_uring via the io_uring engine in fio. This was on a VM running 5.2-rc5 and the latest version of fio. The issue occurs for both null_blk and fake NVMe drives. I have not tested bare metal or real NVMe SSDs. The fio script used is given below. [io_uring] time_based=1 runtime=60 filename=/dev/nvme2n1 (note /dev/nullb0 also fails) ioengine=io_uring bs=4k rw=readwrite direct=1 fixedbufs=1 sqthread_poll=1 sqthread_poll_cpu=0 general protection fault: 0000 [#1] SMP PTI CPU: 0 PID: 872 Comm: io_uring-sq Not tainted 5.2.0-rc5-cpacket-io-uring #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014 RIP: 0010:fput_many+0x7/0x90 Code: 01 48 85 ff 74 17 55 48 89 e5 53 48 8b 1f e8 a0 f9 ff ff 48 85 db 48 89 df 75 f0 5b 5d f3 c3 0f 1f 40 00 0f 1f 44 00 00 89 f6 <f0> 48 29 77 38 74 01 c3 55 48 89 e5 53 48 89 fb 65 48 \ RSP: 0018:ffffadeb817ebc50 EFLAGS: 00010246 RAX: 0000000000000004 RBX: ffff8f46ad477480 RCX: 0000000000001805 RDX: 0000000000000000 RSI: 0000000000000001 RDI: f18b51b9a39552b5 RBP: ffffadeb817ebc58 R08: ffff8f46b7a318c0 R09: 000000000000015d R10: ffffadeb817ebce8 R11: 0000000000000020 R12: ffff8f46ad4cd000 R13: 00000000fffffff7 R14: ffffadeb817ebe30 R15: 0000000000000004 FS: 0000000000000000(0000) GS:ffff8f46b7a00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055828f0bbbf0 CR3: 0000000232176004 CR4: 00000000003606f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ? fput+0x13/0x20 io_free_req+0x20/0x40 io_put_req+0x1b/0x20 io_submit_sqe+0x40a/0x680 ? __switch_to_asm+0x34/0x70 ? __switch_to_asm+0x40/0x70 io_submit_sqes+0xb9/0x160 ? io_submit_sqes+0xb9/0x160 ? __switch_to_asm+0x40/0x70 ? __switch_to_asm+0x34/0x70 ? __schedule+0x3f2/0x6a0 ? __switch_to_asm+0x34/0x70 io_sq_thread+0x1af/0x470 ? __switch_to_asm+0x34/0x70 ? wait_woken+0x80/0x80 ? __switch_to+0x85/0x410 ? __switch_to_asm+0x40/0x70 ? __switch_to_asm+0x34/0x70 ? __schedule+0x3f2/0x6a0 kthread+0x105/0x140 ? io_submit_sqes+0x160/0x160 ? kthread+0x105/0x140 ? io_submit_sqes+0x160/0x160 ? kthread_destroy_worker+0x50/0x50 ret_from_fork+0x35/0x40 which occurs because using a kernel side submission thread isn't valid without using fixed files (registered through io_uring_register()). This causes io_uring to put the request after logging an error, but before the file field is set in the request. If it happens to be non-zero, we attempt to fput() garbage. Fix this by ensuring that req->file is initialized when the request is allocated. Cc: stable@vger.kernel.org # 5.1+ Reported-by: NStephen Bates <sbates@raithlin.com> Tested-by: NStephen Bates <sbates@raithlin.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Eric Biggers 提交于
mainline inclusion from mainline-5.2-rc5 commit 355e8d26 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- Opening and closing an io_uring instance leaks a UNIX domain socket inode. This is because the ->file of the io_uring instance's internal UNIX domain socket is set to point to the io_uring file, but then sock_release() sees the non-NULL ->file and assumes the inode reference is held by the file so doesn't call iput(). That's not the case here, since the reference is still meant to be held by the socket; the actual inode of the io_uring file is different. Fix this leak by NULL-ing out ->file before releasing the socket. Reported-by: syzbot+111cb28d9f583693aefa@syzkaller.appspotmail.com Fixes: 2b188cc1 ("Add io_uring IO interface") Cc: <stable@vger.kernel.org> # v5.1+ Signed-off-by: NEric Biggers <ebiggers@google.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Pavel Begunkov 提交于
mainline inclusion from mainline-5.2-rc3 commit a278682d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- If io_copy_iov() fails, it will break the loop and report success, albeit partially completed operation. Signed-off-by: NPavel Begunkov <asml.silence@gmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Jens Axboe 提交于
mainline inclusion from mainline-5.2-rc2 commit 004d564f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- Various fixes and changes have been applied to liburing since we copied some select bits to the kernel testing/examples part, sync up with liburing to get those changes. Most notable is the change that split the CQE reading into the peek and seen event, instead of being just a single function. Also fixes an unsigned wrap issue in io_uring_submit(), leak of 'fd' in setup if we fail, and various other little issues. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Jens Axboe 提交于
mainline inclusion from mainline-5.2-rc2 commit 486f0692 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- Currently fails with: io_uring-bench.o: In function `main': /home/axboe/git/linux-block/tools/io_uring/io_uring-bench.c:560: undefined reference to `pthread_create' /home/axboe/git/linux-block/tools/io_uring/io_uring-bench.c:588: undefined reference to `pthread_join' collect2: error: ld returned 1 exit status Makefile:11: recipe for target 'io_uring-bench' failed make: *** [io_uring-bench] Error 1 Move -lpthread to the end. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Oleg Nesterov 提交于
mainline inclusion from mainline-5.3-rc1 commit ac301020 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- Now that restore_saved_sigmask_unless() is always called with the same argument right before poll_select_copy_remaining() we can move it into poll_select_copy_remaining() and make it the only caller of restore() in fs/select.c. The patch also renames poll_select_copy_remaining(), poll_select_finish() looks better after this change. kern_select() doesn't use set_user_sigmask(), so in this case poll_select_finish() does restore_saved_sigmask_unless() "for no reason". But this won't hurt, and WARN_ON(!TIF_SIGPENDING) is still valid. Link: http://lkml.kernel.org/r/20190606140915.GC13440@redhat.comSigned-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Laight <David.Laight@aculab.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Deepa Dinamani <deepa.kernel@gmail.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Eric Wong <e@80x24.org> Cc: Jason Baron <jbaron@akamai.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Oleg Nesterov 提交于
mainline inclusion from mainline-5.3-rc1 commit 8cf8b553 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- do_poll() returns -EINTR if interrupted and after that all its callers have to translate it into -ERESTARTNOHAND. Change do_poll() to return -ERESTARTNOHAND and update (simplify) the callers. Note that this also unifies all users of restore_saved_sigmask_unless(), see the next patch. Linus: : The *right* return value will actually be then chosen by : poll_select_copy_remaining(), which will turn ERESTARTNOHAND to EINTR : when it can't update the timeout. : : Except for the cases that use restart_block and do that instead and : don't have the whole timeout restart issue as a result. Link: http://lkml.kernel.org/r/20190606140852.GB13440@redhat.comSigned-off-by: NOleg Nesterov <oleg@redhat.com> Acked-by: NLinus Torvalds <torvalds@linux-foundation.org> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Arnd Bergmann <arnd@arndb.de> Cc: David Laight <David.Laight@aculab.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Deepa Dinamani <deepa.kernel@gmail.com> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Eric Wong <e@80x24.org> Cc: Jason Baron <jbaron@akamai.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Conflicts: fs/select.c Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Oleg Nesterov 提交于
mainline inclusion from mainline-5.3-rc1 commit b772434b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- task->saved_sigmask and ->restore_sigmask are only used in the ret-from- syscall paths. This means that set_user_sigmask() can save ->blocked in ->saved_sigmask and do set_restore_sigmask() to indicate that ->blocked was modified. This way the callers do not need 2 sigset_t's passed to set/restore and restore_user_sigmask() renamed to restore_saved_sigmask_unless() turns into the trivial helper which just calls restore_saved_sigmask(). Link: http://lkml.kernel.org/r/20190606113206.GA9464@redhat.comSigned-off-by: NOleg Nesterov <oleg@redhat.com> Cc: Deepa Dinamani <deepa.kernel@gmail.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Eric Wong <e@80x24.org> Cc: Jason Baron <jbaron@akamai.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: David Laight <David.Laight@aculab.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: fs/select.c [ Patch 9afc5eee("y2038: globally rename compat_time to old_time32") is not applied. ] [ Patch 8dabe724("y2038: syscalls: rename y2038 compat syscalls") is not applied. ] Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Oleg Nesterov 提交于
mainline inclusion from mainline-5.2-rc7 commit 97abc889 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- This is the minimal fix for stable, I'll send cleanups later. Commit 854a6ed5 ("signal: Add restore_user_sigmask()") introduced the visible change which breaks user-space: a signal temporary unblocked by set_user_sigmask() can be delivered even if the caller returns success or timeout. Change restore_user_sigmask() to accept the additional "interrupted" argument which should be used instead of signal_pending() check, and update the callers. Eric said: : For clarity. I don't think this is required by posix, or fundamentally to : remove the races in select. It is what linux has always done and we have : applications who care so I agree this fix is needed. : : Further in any case where the semantic change that this patch rolls back : (aka where allowing a signal to be delivered and the select like call to : complete) would be advantage we can do as well if not better by using : signalfd. : : Michael is there any chance we can get this guarantee of the linux : implementation of pselect and friends clearly documented. The guarantee : that if the system call completes successfully we are guaranteed that no : signal that is unblocked by using sigmask will be delivered? Link: http://lkml.kernel.org/r/20190604134117.GA29963@redhat.com Fixes: 854a6ed5 ("signal: Add restore_user_sigmask()") Signed-off-by: NOleg Nesterov <oleg@redhat.com> Reported-by: NEric Wong <e@80x24.org> Tested-by: NEric Wong <e@80x24.org> Acked-by: N"Eric W. Biederman" <ebiederm@xmission.com> Acked-by: NArnd Bergmann <arnd@arndb.de> Acked-by: NDeepa Dinamani <deepa.kernel@gmail.com> Cc: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Jason Baron <jbaron@akamai.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Al Viro <viro@ZenIV.linux.org.uk> Cc: David Laight <David.Laight@ACULAB.COM> Cc: <stable@vger.kernel.org> [5.0+] Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Conflicts: fs/aio.c [ Patch 9afc5eee("y2038: globally rename compat_time to old_time32") is not applied. ] Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Jackie Liu 提交于
mainline inclusion from mainline-5.2-rc1 commit fdb288a6 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- The previous patch has ensured that io_cqring_events contain smp_rmb memory barriers, Now we can use wait_event_interruptible to keep the code simple. Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Jackie Liu 提交于
mainline inclusion from mainline-5.2-rc1 commit dc6ce4bc category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- Whenever smp_rmb is required to use io_cqring_events, keep smp_rmb inside the function io_cqring_events. Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Roman Penyaev 提交于
mainline inclusion from mainline-5.2-rc1 commit 2bbcd6d3 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- This fixes couple of races which lead to infinite wait of park completion with the following backtraces: [20801.303319] Call Trace: [20801.303321] ? __schedule+0x284/0x650 [20801.303323] schedule+0x33/0xc0 [20801.303324] schedule_timeout+0x1bc/0x210 [20801.303326] ? schedule+0x3d/0xc0 [20801.303327] ? schedule_timeout+0x1bc/0x210 [20801.303329] ? preempt_count_add+0x79/0xb0 [20801.303330] wait_for_completion+0xa5/0x120 [20801.303331] ? wake_up_q+0x70/0x70 [20801.303333] kthread_park+0x48/0x80 [20801.303335] io_finish_async+0x2c/0x70 [20801.303336] io_ring_ctx_wait_and_kill+0x95/0x180 [20801.303338] io_uring_release+0x1c/0x20 [20801.303339] __fput+0xad/0x210 [20801.303341] task_work_run+0x8f/0xb0 [20801.303342] exit_to_usermode_loop+0xa0/0xb0 [20801.303343] do_syscall_64+0xe0/0x100 [20801.303349] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [20801.303380] Call Trace: [20801.303383] ? __schedule+0x284/0x650 [20801.303384] schedule+0x33/0xc0 [20801.303386] io_sq_thread+0x38a/0x410 [20801.303388] ? __switch_to_asm+0x40/0x70 [20801.303390] ? wait_woken+0x80/0x80 [20801.303392] ? _raw_spin_lock_irqsave+0x17/0x40 [20801.303394] ? io_submit_sqes+0x120/0x120 [20801.303395] kthread+0x112/0x130 [20801.303396] ? kthread_create_on_node+0x60/0x60 [20801.303398] ret_from_fork+0x35/0x40 o kthread_park() waits for park completion, so io_sq_thread() loop should check kthread_should_park() along with khread_should_stop(), otherwise if kthread_park() is called before prepare_to_wait() the following schedule() never returns: CPU#0 CPU#1 io_sq_thread_stop(): io_sq_thread(): while(!kthread_should_stop() && !ctx->sqo_stop) { ctx->sqo_stop = 1; kthread_park() prepare_to_wait(); if (kthread_should_stop() { } schedule(); <<< nobody checks park flag, <<< so schedule and never return o if the flag ctx->sqo_stop is observed by the io_sq_thread() loop it is quite possible, that kthread_should_park() check and the following kthread_parkme() is never called, because kthread_park() has not been yet called, but few moments later is is called and waits there for park completion, which never happens, because kthread has already exited: CPU#0 CPU#1 io_sq_thread_stop(): io_sq_thread(): ctx->sqo_stop = 1; while(!kthread_should_stop() && !ctx->sqo_stop) { <<< observe sqo_stop and exit the loop } if (kthread_should_park()) kthread_parkme(); <<< never called, since was <<< never parked kthread_park() <<< waits forever for park completion In the current patch we quit the loop by only kthread_should_park() check (kthread_park() is synchronous, so kthread_should_stop() is never observed), and we abandon ->sqo_stop flag, since it is racy. At the end of the io_sq_thread() we unconditionally call parmke(), since we've exited the loop by the park flag. Signed-off-by: NRoman Penyaev <rpenyaev@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-block@vger.kernel.org Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Jens Axboe 提交于
mainline inclusion from mainline-5.2-rc1 commit c71ffb67 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- We always pass in 0 for the cqe flags argument, since the support for "this read hit page cache" hint was dropped. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Jens Axboe 提交于
mainline inclusion from mainline-5.2-rc1 commit 44a9bd18 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- The test case we have is rightfully failing with the current kernel: io_uring_setup(1, 0x7ffe2cafebe0), flags: IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF, resv: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000, sq_thread_cpu: 4 expected -1, got 3 This is in a vm, and CPU3 is the last valid one, hence asking for 4 should fail the setup with -EINVAL, not succeed. The problem is that we're using array_index_nospec() with nr_cpu_ids as the index, hence we wrap and end up using CPU0 instead of CPU4. This makes the setup succeed where it should be failing. We don't need to use array_index_nospec() as we're not indexing any array with this. Instead just compare with nr_cpu_ids directly. This is fine as we're checking with cpu_online() afterwards. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Stefan Bühler 提交于
mainline inclusion from mainline-5.2-rc1 commit e2033e33 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- When punting to workers the SQE gets copied after the initial try. There is a race condition between reading SQE data for the initial try and copying it for punting it to the workers. For example io_rw_done calls kiocb->ki_complete even if it was prepared for IORING_OP_FSYNC (and would be NULL). The easiest solution for now is to alway prepare again in the worker. req->file is safe to prepare though as long as it is checked before use. Signed-off-by: NStefan Bühler <source@stbuehler.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Shenghui Wang 提交于
mainline inclusion from mainline-5.2-rc1 commit 7889f44d category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- This issue is found by running liburing/test/io_uring_setup test. When test run, the testcase "attempt to bind to invalid cpu" would not pass with messages like: io_uring_setup(1, 0xbfc2f7c8), \ flags: IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF, \ resv: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000, \ sq_thread_cpu: 2 expected -1, got 3 FAIL On my system, there is: CPU(s) possible : 0-3 CPU(s) online : 0-1 CPU(s) offline : 2-3 CPU(s) present : 0-1 The sq_thread_cpu 2 is offline on my system, so the bind should fail. But cpu_possible() will pass the check. We shouldn't be able to bind to an offline cpu. Use cpu_online() to do the check. After the change, the testcase run as expected: EINVAL will be returned for cpu offlined. Reviewed-by: NJeff Moyer <jmoyer@redhat.com> Signed-off-by: NShenghui Wang <shhuiw@foxmail.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Colin Ian King 提交于
mainline inclusion from mainline-5.2-rc1 commit efeb862b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- Currently variable ret is declared in a while-loop code block that shadows another variable ret. When an error occurs in the while-loop the error return in ret is not being set in the outer code block and so the error check on ret is always going to be checking on the wrong ret variable resulting in check that is always going to be true and a premature return occurs. Fix this by removing the declaration of the inner while-loop variable ret so that shadowing does not occur. Addresses-Coverity: ("'Constant' variable guards dead code") Fixes: 6b06314c ("io_uring: add file set registration") Signed-off-by: NColin Ian King <colin.king@canonical.com> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Stefan Bühler 提交于
mainline inclusion from mainline-5.2-rc1 commit 5dcf877f category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- No need to set it in io_poll_add; io_poll_complete doesn't use it to set the result in the CQE. Signed-off-by: NStefan Bühler <source@stbuehler.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Jens Axboe 提交于
mainline inclusion from mainline-5.2-rc1 commit 9b402849 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- Allow registration of an eventfd, which will trigger an event every time a completion event happens for this io_uring instance. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Jens Axboe 提交于
mainline inclusion from mainline-5.2-rc1 commit 5d17b4a4 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- This behaves just like sync_file_range(2) does. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Jens Axboe 提交于
mainline inclusion from mainline-5.2-rc1 commit de0617e4 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- There are no ordering constraints between the submission and completion side of io_uring. But sometimes that would be useful to have. One common example is doing an fsync, for instance, and have it ordered with previous writes. Without support for that, the application must do this tracking itself. This adds a general SQE flag, IOSQE_IO_DRAIN. If a command is marked with this flag, then it will not be issued before previous commands have completed, and subsequent commands submitted after the drain will not be issued before the drain is started.. If there are no pending commands, setting this flag will not change the behavior of the issue of the command. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Jens Axboe 提交于
mainline inclusion from mainline-5.2-rc1 commit 22f96b38 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- This just pulls out the ksys_sync_file_range() code to work on a struct file instead of an fd, so we can use it elsewhere. Signed-off-by: NJens Axboe <axboe@kernel.dk> Conflicts: fs/sync.c include/linux/fs.h [ Patch c553ea4f("fs/sync.c: sync_file_range(2) may use WB_SYNC_ALL writeback") applied earlier. ] Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Mark Rutland 提交于
mainline inclusion from mainline-5.1 commit d4ef6475 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- In io_sqe_buffer_register() we allocate a number of arrays based on the iov_len from the user-provided iov. While we limit iov_len to SZ_1G, we can still attempt to allocate arrays exceeding MAX_ORDER. On a 64-bit system with 4KiB pages, for an iov where iov_base = 0x10 and iov_len = SZ_1G, we'll calculate that nr_pages = 262145. When we try to allocate a corresponding array of (16-byte) bio_vecs, requiring 4194320 bytes, which is greater than 4MiB. This results in SLUB warning that we're trying to allocate greater than MAX_ORDER, and failing the allocation. Avoid this by using kvmalloc() for allocations dependent on the user-provided iov_len. At the same time, fix a leak of imu->bvec when registration fails. Full splat from before this patch: WARNING: CPU: 1 PID: 2314 at mm/page_alloc.c:4595 __alloc_pages_nodemask+0x7ac/0x2938 mm/page_alloc.c:4595 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 2314 Comm: syz-executor326 Not tainted 5.1.0-rc7-dirty #4 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x2f0 include/linux/compiler.h:193 show_stack+0x20/0x30 arch/arm64/kernel/traps.c:158 __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x110/0x190 lib/dump_stack.c:113 panic+0x384/0x68c kernel/panic.c:214 __warn+0x2bc/0x2c0 kernel/panic.c:571 report_bug+0x228/0x2d8 lib/bug.c:186 bug_handler+0xa0/0x1a0 arch/arm64/kernel/traps.c:956 call_break_hook arch/arm64/kernel/debug-monitors.c:301 [inline] brk_handler+0x1d4/0x388 arch/arm64/kernel/debug-monitors.c:316 do_debug_exception+0x1a0/0x468 arch/arm64/mm/fault.c:831 el1_dbg+0x18/0x8c __alloc_pages_nodemask+0x7ac/0x2938 mm/page_alloc.c:4595 alloc_pages_current+0x164/0x278 mm/mempolicy.c:2132 alloc_pages include/linux/gfp.h:509 [inline] kmalloc_order+0x20/0x50 mm/slab_common.c:1231 kmalloc_order_trace+0x30/0x2b0 mm/slab_common.c:1243 kmalloc_large include/linux/slab.h:480 [inline] __kmalloc+0x3dc/0x4f0 mm/slub.c:3791 kmalloc_array include/linux/slab.h:670 [inline] io_sqe_buffer_register fs/io_uring.c:2472 [inline] __io_uring_register fs/io_uring.c:2962 [inline] __do_sys_io_uring_register fs/io_uring.c:3008 [inline] __se_sys_io_uring_register fs/io_uring.c:2990 [inline] __arm64_sys_io_uring_register+0x9e0/0x1bc8 fs/io_uring.c:2990 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall arch/arm64/kernel/syscall.c:47 [inline] el0_svc_common.constprop.0+0x148/0x2e0 arch/arm64/kernel/syscall.c:83 el0_svc_handler+0xdc/0x100 arch/arm64/kernel/syscall.c:129 el0_svc+0x8/0xc arch/arm64/kernel/entry.S:948 SMP: stopping secondary CPUs Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: disabled CPU features: 0x002,23000438 Memory Limit: none Rebooting in 1 seconds.. Fixes: edafccee ("io_uring: add support for pre-mapped user IO buffers") Signed-off-by: NMark Rutland <mark.rutland@arm.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Jens Axboe <axboe@kernel.dk> Cc: linux-fsdevel@vger.kernel.org Cc: linux-block@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Jens Axboe 提交于
mainline inclusion from mainline-5.1 commit 817869d2 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- If we don't end up actually calling submit in io_sq_wq_submit_work(), we still need to drop the submit reference to the request. If we don't, then we can leak the request. This can happen if we race with ring shutdown while flushing the workqueue for requests that require use of the mm_struct. Fixes: e65ef56d ("io_uring: use regular request ref counts") Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Mark Rutland 提交于
mainline inclusion from mainline-5.1 commit 52e04ef4 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- If io_allocate_scq_urings() fails to allocate an sq_* region, it will call io_mem_free() for any previously allocated regions, but leave dangling pointers to these regions in the ctx. Any regions which have not yet been allocated are left NULL. Note that when returning -EOVERFLOW, the previously allocated sq_ring is not freed, which appears to be an unintentional leak. When io_allocate_scq_urings() fails, io_uring_create() will call io_ring_ctx_wait_and_kill(), which calls io_mem_free() on all the sq_* regions, assuming the pointers are valid and not NULL. This can result in pages being freed multiple times, which has been observed to corrupt the page state, leading to subsequent fun. This can also result in virt_to_page() on NULL, resulting in the use of bogus page addresses, and yet more subsequent fun. The latter can be detected with CONFIG_DEBUG_VIRTUAL on arm64. Adding a cleanup path to io_allocate_scq_urings() complicates the logic, so let's leave it to io_ring_ctx_free() to consistently free these pointers, and simplify the io_allocate_scq_urings() error paths. Full splats from before this patch below. Note that the pointer logged by the DEBUG_VIRTUAL "non-linear address" warning has been hashed, and is actually NULL. [ 26.098129] page:ffff80000e949a00 count:0 mapcount:-128 mapping:0000000000000000 index:0x0 [ 26.102976] flags: 0x63fffc000000() [ 26.104373] raw: 000063fffc000000 ffff80000e86c188 ffff80000ea3df08 0000000000000000 [ 26.108917] raw: 0000000000000000 0000000000000001 00000000ffffff7f 0000000000000000 [ 26.137235] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0) [ 26.143960] ------------[ cut here ]------------ [ 26.146020] kernel BUG at include/linux/mm.h:547! [ 26.147586] Internal error: Oops - BUG: 0 [#1] PREEMPT SMP [ 26.149163] Modules linked in: [ 26.150287] Process syz-executor.21 (pid: 20204, stack limit = 0x000000000e9cefeb) [ 26.153307] CPU: 2 PID: 20204 Comm: syz-executor.21 Not tainted 5.1.0-rc7-00004-g7d30b2ea43d6 #18 [ 26.156566] Hardware name: linux,dummy-virt (DT) [ 26.158089] pstate: 40400005 (nZcv daif +PAN -UAO) [ 26.159869] pc : io_mem_free+0x9c/0xa8 [ 26.161436] lr : io_mem_free+0x9c/0xa8 [ 26.162720] sp : ffff000013003d60 [ 26.164048] x29: ffff000013003d60 x28: ffff800025048040 [ 26.165804] x27: 0000000000000000 x26: ffff800025048040 [ 26.167352] x25: 00000000000000c0 x24: ffff0000112c2820 [ 26.169682] x23: 0000000000000000 x22: 0000000020000080 [ 26.171899] x21: ffff80002143b418 x20: ffff80002143b400 [ 26.174236] x19: ffff80002143b280 x18: 0000000000000000 [ 26.176607] x17: 0000000000000000 x16: 0000000000000000 [ 26.178997] x15: 0000000000000000 x14: 0000000000000000 [ 26.181508] x13: 00009178a5e077b2 x12: 0000000000000001 [ 26.183863] x11: 0000000000000000 x10: 0000000000000980 [ 26.186437] x9 : ffff000013003a80 x8 : ffff800025048a20 [ 26.189006] x7 : ffff8000250481c0 x6 : ffff80002ffe9118 [ 26.191359] x5 : ffff80002ffe9118 x4 : 0000000000000000 [ 26.193863] x3 : ffff80002ffefe98 x2 : 44c06ddd107d1f00 [ 26.196642] x1 : 0000000000000000 x0 : 000000000000003e [ 26.198892] Call trace: [ 26.199893] io_mem_free+0x9c/0xa8 [ 26.201155] io_ring_ctx_wait_and_kill+0xec/0x180 [ 26.202688] io_uring_setup+0x6c4/0x6f0 [ 26.204091] __arm64_sys_io_uring_setup+0x18/0x20 [ 26.205576] el0_svc_common.constprop.0+0x7c/0xe8 [ 26.207186] el0_svc_handler+0x28/0x78 [ 26.208389] el0_svc+0x8/0xc [ 26.209408] Code: aa0203e0 d0006861 9133a021 97fcdc3c (d4210000) [ 26.211995] ---[ end trace bdb81cd43a21e50d ]--- [ 81.770626] ------------[ cut here ]------------ [ 81.825015] virt_to_phys used for non-linear address: 000000000d42f2c7 ( (null)) [ 81.827860] WARNING: CPU: 1 PID: 30171 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x48/0x68 [ 81.831202] Modules linked in: [ 81.832212] CPU: 1 PID: 30171 Comm: syz-executor.20 Not tainted 5.1.0-rc7-00004-g7d30b2ea43d6 #19 [ 81.835616] Hardware name: linux,dummy-virt (DT) [ 81.836863] pstate: 60400005 (nZCv daif +PAN -UAO) [ 81.838727] pc : __virt_to_phys+0x48/0x68 [ 81.840572] lr : __virt_to_phys+0x48/0x68 [ 81.842264] sp : ffff80002cf67c70 [ 81.843858] x29: ffff80002cf67c70 x28: ffff800014358e18 [ 81.846463] x27: 0000000000000000 x26: 0000000020000080 [ 81.849148] x25: 0000000000000000 x24: ffff80001bb01f40 [ 81.851986] x23: ffff200011db06c8 x22: ffff2000127e3c60 [ 81.854351] x21: ffff800014358cc0 x20: ffff800014358d98 [ 81.856711] x19: 0000000000000000 x18: 0000000000000000 [ 81.859132] x17: 0000000000000000 x16: 0000000000000000 [ 81.861586] x15: 0000000000000000 x14: 0000000000000000 [ 81.863905] x13: 0000000000000000 x12: ffff1000037603e9 [ 81.866226] x11: 1ffff000037603e8 x10: 0000000000000980 [ 81.868776] x9 : ffff80002cf67840 x8 : ffff80001bb02920 [ 81.873272] x7 : ffff1000037603e9 x6 : ffff80001bb01f47 [ 81.875266] x5 : ffff1000037603e9 x4 : dfff200000000000 [ 81.876875] x3 : ffff200010087528 x2 : ffff1000059ecf58 [ 81.878751] x1 : 44c06ddd107d1f00 x0 : 0000000000000000 [ 81.880453] Call trace: [ 81.881164] __virt_to_phys+0x48/0x68 [ 81.882919] io_mem_free+0x18/0x110 [ 81.886585] io_ring_ctx_wait_and_kill+0x13c/0x1f0 [ 81.891212] io_uring_setup+0xa60/0xad0 [ 81.892881] __arm64_sys_io_uring_setup+0x2c/0x38 [ 81.894398] el0_svc_common.constprop.0+0xac/0x150 [ 81.896306] el0_svc_handler+0x34/0x88 [ 81.897744] el0_svc+0x8/0xc [ 81.898715] ---[ end trace b4a703802243cbba ]--- Fixes: 2b188cc1 ("Add io_uring IO interface") Signed-off-by: NMark Rutland <mark.rutland@arm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-block@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Mark Rutland 提交于
mainline inclusion from mainline-5.1 commit 975554b0 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- In io_sq_offload_start(), we call cpu_possible() on an unbounded cpu value from userspace. On v5.1-rc7 on arm64 with CONFIG_DEBUG_PER_CPU_MAPS, this results in a splat: WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpu_max_bits_warn include/linux/cpumask.h:121 [inline] There was an attempt to fix this in commit: 917257da ("io_uring: only test SQPOLL cpu after we've verified it") ... by adding a check after the cpu value had been limited to NR_CPU_IDS using array_index_nospec(). However, this left an unbound check at the start of the function, for which the warning still fires. Let's fix this correctly by checking that the cpu value is bound by nr_cpu_ids before passing it to cpu_possible(). Note that only nr_cpu_ids of a cpumask are guaranteed to exist at runtime, and nr_cpu_ids can be significantly smaller than NR_CPUs. For example, an arm64 defconfig has NR_CPUS=256, while my test VM has 4 vCPUs. Following the intent from the commit message for 917257da, the check is moved under the SQ_AFF branch, which is the only branch where the cpu values is consumed. The check is performed before bounding the value with array_index_nospec() so that we don't silently accept bogus cpu values from userspace, where array_index_nospec() would force these values to 0. I suspect we can remove the array_index_nospec() call entirely, but I've conservatively left that in place, updated to use nr_cpu_ids to match the prior check. Tested on arm64 with the Syzkaller reproducer: https://syzkaller.appspot.com/bug?extid=cd714a07c6de2bc34293 https://syzkaller.appspot.com/x/repro.syz?x=15d8b397200000 Full splat from before this patch: WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpu_max_bits_warn include/linux/cpumask.h:121 [inline] WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpumask_check include/linux/cpumask.h:128 [inline] WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 cpumask_test_cpu include/linux/cpumask.h:344 [inline] WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 io_sq_offload_start fs/io_uring.c:2244 [inline] WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 io_uring_create fs/io_uring.c:2864 [inline] WARNING: CPU: 1 PID: 27601 at include/linux/cpumask.h:121 io_uring_setup+0x1108/0x15a0 fs/io_uring.c:2916 Kernel panic - not syncing: panic_on_warn set ... CPU: 1 PID: 27601 Comm: syz-executor.0 Not tainted 5.1.0-rc7 #3 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x0/0x2f0 include/linux/compiler.h:193 show_stack+0x20/0x30 arch/arm64/kernel/traps.c:158 __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x110/0x190 lib/dump_stack.c:113 panic+0x384/0x68c kernel/panic.c:214 __warn+0x2bc/0x2c0 kernel/panic.c:571 report_bug+0x228/0x2d8 lib/bug.c:186 bug_handler+0xa0/0x1a0 arch/arm64/kernel/traps.c:956 call_break_hook arch/arm64/kernel/debug-monitors.c:301 [inline] brk_handler+0x1d4/0x388 arch/arm64/kernel/debug-monitors.c:316 do_debug_exception+0x1a0/0x468 arch/arm64/mm/fault.c:831 el1_dbg+0x18/0x8c cpu_max_bits_warn include/linux/cpumask.h:121 [inline] cpumask_check include/linux/cpumask.h:128 [inline] cpumask_test_cpu include/linux/cpumask.h:344 [inline] io_sq_offload_start fs/io_uring.c:2244 [inline] io_uring_create fs/io_uring.c:2864 [inline] io_uring_setup+0x1108/0x15a0 fs/io_uring.c:2916 __do_sys_io_uring_setup fs/io_uring.c:2929 [inline] __se_sys_io_uring_setup fs/io_uring.c:2926 [inline] __arm64_sys_io_uring_setup+0x50/0x70 fs/io_uring.c:2926 __invoke_syscall arch/arm64/kernel/syscall.c:35 [inline] invoke_syscall arch/arm64/kernel/syscall.c:47 [inline] el0_svc_common.constprop.0+0x148/0x2e0 arch/arm64/kernel/syscall.c:83 el0_svc_handler+0xdc/0x100 arch/arm64/kernel/syscall.c:129 el0_svc+0x8/0xc arch/arm64/kernel/entry.S:948 SMP: stopping secondary CPUs Dumping ftrace buffer: (ftrace buffer empty) Kernel Offset: disabled CPU features: 0x002,23000438 Memory Limit: none Rebooting in 1 seconds.. Fixes: 917257da ("io_uring: only test SQPOLL cpu after we've verified it") Signed-off-by: NMark Rutland <mark.rutland@arm.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: linux-block@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Simplied the logic Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Jens Axboe 提交于
mainline inclusion from mainline-5.1 commit 5c8b0b54 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- Currently we only post a cqe if we get an error OUTSIDE of submission. For submission, we return the error directly through io_uring_enter(). This is a bit awkward for applications, and it makes more sense to always post a cqe with an error, if the error happens on behalf of an sqe. This changes submission behavior a bit. io_uring_enter() returns -ERROR for an error, and > 0 for number of sqes submitted. Before this change, if you wanted to submit 8 entries and had an error on the 5th entry, io_uring_enter() would return 4 (for number of entries successfully submitted) and rewind the sqring. The application would then have to peek at the sqring and figure out what was wrong with the head sqe, and then skip it itself. With this change, we'll return 5 since we did consume 5 sqes, and the last sqe (with the error) will result in a cqe being posted with the error. This makes the logic easier to handle in the application, and it cleans up the submission part. Suggested-by: NStefan Bühler <source@stbuehler.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Stefan Bühler 提交于
mainline inclusion from mainline-5.1 commit 62977281 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- There is no operation to order with afterwards, and removing the flag is not critical in any way. There will always be a "race condition" where the application will trigger IORING_ENTER_SQ_WAKEUP when it isn't actually needed. Signed-off-by: NStefan Bühler <source@stbuehler.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Stefan Bühler 提交于
mainline inclusion from mainline-5.1 commit b841f195 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- smp_store_release in io_commit_sqring already orders the store to dropped before the update to SQ head. Signed-off-by: NStefan Bühler <source@stbuehler.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Stefan Bühler 提交于
mainline inclusion from mainline-5.1 commit 82ab082c category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- There is no operation before to order with. Signed-off-by: NStefan Bühler <source@stbuehler.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Stefan Bühler 提交于
mainline inclusion from mainline-5.1 commit 9e4c15a3 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- There is no operation afterwards to order with. Signed-off-by: NStefan Bühler <source@stbuehler.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Stefan Bühler 提交于
mainline inclusion from mainline-5.1 commit 115e12e5 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- The memory operations before reading cq head are unrelated and we don't care about their order. Document that the control dependency in combination with READ_ONCE and WRITE_ONCE forms a barrier we need. Signed-off-by: NStefan Bühler <source@stbuehler.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Stefan Bühler 提交于
mainline inclusion from mainline-5.1 commit 4f7067c3 category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- wq_has_sleeper has a full barrier internally. The smp_rmb barrier in io_uring_poll synchronizes with it. Signed-off-by: NStefan Bühler <source@stbuehler.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Stefan Bühler 提交于
mainline inclusion from mainline-5.1 commit 1e84b97b category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- The application reading the CQ ring needs a barrier to pair with the smp_store_release in io_commit_cqring, not the barrier after it. Also a write barrier *after* writing something (but not *before* writing anything interesting) doesn't order anything, so an smp_wmb() after writing SQ tail is not needed. Additionally consider reading SQ head and writing CQ tail in the notes. Also add some clarifications how the various other fields in the ring buffers are used. Signed-off-by: NStefan Bühler <source@stbuehler.de> Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-
由 Stefan Bühler 提交于
mainline inclusion from mainline-5.1 commit 8449eeda category: feature bugzilla: https://bugzilla.openeuler.org/show_bug.cgi?id=27 CVE: NA --------------------------- Not all request types set REQ_F_FORCE_NONBLOCK when they needed async punting; reverse logic instead and set REQ_F_NOWAIT if request mustn't be punted. Signed-off-by: NStefan Bühler <source@stbuehler.de> Merged with my previous patch for this. Signed-off-by: NJens Axboe <axboe@kernel.dk> Signed-off-by: NZhihao Cheng <chengzhihao1@huawei.com> Signed-off-by: Nyangerkun <yangerkun@huawei.com> Reviewed-by: Nzhangyi (F) <yi.zhang@huawei.com> Signed-off-by: NCheng Jian <cj.chengjian@huawei.com>
-