提交 · 9ffcf1ac125bd0c272faec74f6b14472f5fab3fb · openanolis / cloud-kernel

02 9月, 2020 36 次提交

fuse: delete dentry if timeout is zero · 9ffcf1ac

由 Miklos Szeredi 提交于 8月 15, 2018

task #28910367
commit 8fab010644363f8f80194322aa7a81e38c867af3 upstream

Don't hold onto dentry in lru list if need to re-lookup it anyway at next
access.  Only do this if explicitly enabled, otherwise it could result in
performance regression.

More advanced version of this patch would periodically flush out dentries
from the lru which have gone stale.
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

9ffcf1ac

fuse: Export fuse_dequeue_forget() function · 09db4841

由 Vivek Goyal 提交于 5月 31, 2019

task #28910367
commit 4388c5aac4bae5c83a2c66882043942002ba09a2 upstream

stacked file systems like virtio-fs do not have to play directly with
forget list data structures. There is a helper function use that instead.

Rename dequeue_forget() to fuse_dequeue_forget() and export it so that
stacked filesystems can use it.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

09db4841

fuse: export fuse_get_unique() · f96b6dd6

由 Stefan Hajnoczi 提交于 6月 22, 2018

task #28910367
commit 79d96efffda7597b41968d5d8813b39fc2965f1b upstream

virtio-fs will need unique IDs for FORGET requests from outside
fs/fuse/dev.c.  Make the symbol visible.
Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

f96b6dd6

fuse: Separate fuse device allocation and installation in fuse_conn · 0213de76

由 Vivek Goyal 提交于 3月 06, 2019

task #28910367
commit 0cd1eb9a4160a96e0ec9b93b2e7b489f449bf22d upstream

As of now fuse_dev_alloc() both allocates a fuse device and installs it
in fuse_conn list. fuse_dev_alloc() can fail if fuse_device allocation
fails.

virtio-fs needs to initialize multiple fuse devices (one per virtio
queue). It initializes one fuse device as part of call to
fuse_fill_super_common() and rest of the devices are allocated and
installed after that.

But, we can't affort to fail after calling fuse_fill_super_common() as
we don't have a way to undo all the actions done by fuse_fill_super_common().
So to avoid failures after the call to fuse_fill_super_common(),
pre-allocate all fuse devices early and install them into fuse connection
later.

This patch provides two separate helpers for fuse device allocation and
fuse device installation in fuse_conn.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

0213de76

fuse: add fuse_iqueue_ops callbacks · 57f16587

由 Stefan Hajnoczi 提交于 6月 18, 2018

task #28910367
commit ae3aad77f46fbba56eff7141b2fc49870b60827e upstream

The /dev/fuse device uses fiq->waitq and fasync to signal that requests
are available.  These mechanisms do not apply to virtio-fs.  This patch
introduces callbacks so alternative behavior can be used.

Note that queue_interrupt() changes along these lines:

  spin_lock(&fiq->waitq.lock);
  wake_up_locked(&fiq->waitq);
+ kill_fasync(&fiq->fasync, SIGIO, POLL_IN);
  spin_unlock(&fiq->waitq.lock);
- kill_fasync(&fiq->fasync, SIGIO, POLL_IN);

Since queue_request() and queue_forget() also call kill_fasync() inside
the spinlock this should be safe.
Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

57f16587

fuse: Export fuse_send_init_request() · 6769b1fd

由 Vivek Goyal 提交于 3月 06, 2019

task #28910367
commit 95a84cdb11c26315a6d34664846f82c438c961a1 upstream

This will be used by virtio-fs to send init request to fuse server after
initialization of virt queues.
Signed-off-by: NVivek Goyal <vgoyal@redhat.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

6769b1fd

fuse: export fuse_len_args() · 609c1cf3

由 Stefan Hajnoczi 提交于 6月 21, 2018

task #28910367
commit 14d46d7abc3973a47e8eb0eb5eb87ee8d910a505 upstream

virtio-fs will need to query the length of fuse_arg lists.  Make the
symbol visible.
Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

609c1cf3

fuse: export fuse_end_request() · 63b1ffab

由 Stefan Hajnoczi 提交于 6月 21, 2018

task #28910367
commit 04ec5af0776e9baefed59891f12adbcb5fa71a23 upstream

virtio-fs will need to complete requests from outside fs/fuse/dev.c.
Make the symbol visible.
Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

63b1ffab

fuse: extract fuse_fill_super_common() · 0fdc23c2

由 Stefan Hajnoczi 提交于 6月 13, 2018

task #28910367
commit 0cc2656cdb0b1f234e6d29378cb061e29d7522bc upstream

fuse_fill_super() includes code to process the fd= option and link the
struct fuse_dev to the fd's struct file.  In virtio-fs there is no file
descriptor because /dev/fuse is not used.

This patch extracts fuse_fill_super_common() so that both classic fuse
and virtio-fs can share the code to initialize a mount.

parse_fuse_opt() is also extracted so that the fuse_fill_super_common()
caller has access to the mount options.  This allows classic fuse to
handle the fd= option outside fuse_fill_super_common().
Signed-off-by: NStefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: NLiu Bo <bo.liu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

0fdc23c2

mm: convert PG_balloon to PG_offline · 59df23d6

由 David Hildenbrand 提交于 3月 05, 2019

task #29077503
commit ca215086b14b89a0e70fc211314944aa6ce50020 upstream
pages inflated in virtio-balloon.  Nowadays, it is only a marker that a
page is part of virtio-balloon and therefore logically offline.
We also want to make use of this flag in other balloon drivers - for
inflated pages or when onlining a section but keeping some pages offline
(e.g.  used right now by XEN and Hyper-V via set_online_page_callback()).

We are going to expose this flag to dump tools like makedumpfile.  But
instead of exposing PG_balloon, let's generalize the concept of marking
pages as logically offline, so it can be reused for other purposes later
on.

Rename PG_balloon to PG_offline.  This is an indicator that the page is
logically offline, the content stale and that it should not be touched
(e.g.  a hypervisor would have to allocate backing storage in order for
the guest to dump an unused page).  We can then e.g.  exclude such pages
from dumps.

We replace and reuse KPF_BALLOON (23), as this shouldn't really harm
(and for now the semantics stay the same).  In following patches, we
will make use of this bit also in other balloon drivers.  While at it,
document PGTABLE.

[akpm@linux-foundation.org: fix comment text, per David]
Link: http://lkml.kernel.org/r/20181119101616.8901-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
Acked-by: NKonstantin Khlebnikov <koct9i@gmail.com>
Acked-by: NMichael S. Tsirkin <mst@redhat.com>
Acked-by: NPankaj gupta <pagupta@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
Cc: Christian Hansen <chansen3@cisco.com>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Stephen Rothwell <sfr@canb.auug.org.au>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Pavel Tatashin <pasha.tatashin@oracle.com>
Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Miles Chen <miles.chen@mediatek.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Kazuhito Hagio <k-hagio@ab.jp.nec.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Baoquan He <bhe@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Dave Young <dyoung@redhat.com>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Julien Freche <jfreche@vmware.com>
Cc: Kairui Song <kasong@redhat.com>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Len Brown <len.brown@intel.com>
Cc: Lianbo Jiang <lijiang@redhat.com>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Nadav Amit <namit@vmware.com>
Cc: Omar Sandoval <osandov@fb.com>
Cc: Pavel Machek <pavel@ucw.cz>
Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Stefano Stabellini <sstabellini@kernel.org>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: Xavier Deguillard <xdeguillard@vmware.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

(cherry picked from ccommit ca215086b14b89a0e70fc211314944aa6ce50020)
Signed-off-by: NAlex Shi <alex.shi@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

59df23d6

io_uring: fix recvmsg memory leak with buffer selection · fe15220e

由 Pavel Begunkov 提交于 7月 15, 2020

to #29441901

commit 681fda8d27a66f7e65ff7f2d200d7635e64a8d05 upstream.

io_recvmsg() doesn't free memory allocated for struct io_buffer. This can
causes a leak when used with automatic buffer selection.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

fe15220e

alinux: sched: Add cpu_stress to show system-wide task waiting · ab81d2d9

由 Yihao Wu 提交于 6月 01, 2020

to #28739709

/proc/loadavg can reflex the waiting tasks over a period of time
to some extent. But to become a SLI requires better precision and
quicker response. Furthermore, I/O block is not concerned here,
and bandwidth control is excluded from cpu_stress.

This patch adds a new interface /proc/cpu_stress. It's based on
task runtime tracking so we don't need to deal with complex state
transition. And because task runtime tracking is done in most
scheduler events, the precision is quite enough.

Like loadavg, cpu_stress has 3 average windows too (1,5,15 min)
Signed-off-by: NYihao Wu <wuyihao@linux.alibaba.com>
Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>

ab81d2d9

ovl: inode reference leak in ovl_is_inuse true case. · b247d8a6

由 youngjun 提交于 6月 16, 2020

to #29273482

When "ovl_is_inuse" true case, trap inode reference not put. plus adding
the comment explaining sequence of ovl_is_inuse after ovl_setup_trap.

Fixes: 0be0bfd2de9d ("ovl: fix regression caused by overlapping layers detection")
Cc: <stable@vger.kernel.org> # v4.19+
Reviewed-by: NAmir Goldstein <amir73il@gmail.com>
Signed-off-by: Nyoungjun <her0gyugyu@gmail.com>
Signed-off-by: NMiklos Szeredi <mszeredi@redhat.com>
Link: https://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs.git/commit/?h=overlayfs-next&id=24f14009b8f1754ec2ae4c168940c01259b0f88aSigned-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

b247d8a6

io_uring: fix not initialised work->flags · 109831ff

由 Pavel Begunkov 提交于 7月 12, 2020

to #29276773

commit 16d598030a37853a7a6b4384cad19c9c0af2f021 upstream.

59960b9deb535 ("io_uring: fix lazy work init") tried to fix missing
io_req_init_async(), but left out work.flags and hash. Do it earlier.

Fixes: 7cdaf587de7c ("io_uring: avoid whole io_wq_work copy for requests completed inline")
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

109831ff

io_uring: fix missing msg_name assignment · 589ee219

由 Pavel Begunkov 提交于 7月 12, 2020

to #29276773

commit dd821e0c95a64b5923a0c57f07d3f7563553e756 upstream.

Ensure to set msg.msg_name for the async portion of send/recvmsg,
as the header copy will copy to/from it.

Cc: stable@vger.kernel.org # v5.5+
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

589ee219

io_uring: account user memory freed when exit has been queued · 8c9ebb73

由 Jens Axboe 提交于 7月 10, 2020

to #29276773

commit 309fc03a3284af62eb6082fb60327045a1dabf57 upstream.

We currently account the memory after the exit work has been run, but
that leaves a gap where a process has closed its ring and until the
memory has been accounted as freed. If the memlocked ulimit is
borderline, then that can introduce spurious setup errors returning
-ENOMEM because the free work hasn't been run yet.

Account this as freed when we close the ring, as not to expose a tiny
gap where setting up a new ring can fail.

Fixes: 85faa7b8346e ("io_uring: punt final io_ring_ctx wait-and-free to workqueue")
Cc: stable@vger.kernel.org # v5.7
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

8c9ebb73

io_uring: fix memleak in io_sqe_files_register() · c1f5f815

由 Yang Yingliang 提交于 7月 10, 2020

to #29276773

commit 667e57da358f61b6966e12e925a69e42d912e8bb upstream.

I got a memleak report when doing some fuzz test:

BUG: memory leak
unreferenced object 0x607eeac06e78 (size 8):
  comm "test", pid 295, jiffies 4294735835 (age 31.745s)
  hex dump (first 8 bytes):
    00 00 00 00 00 00 00 00                          ........
  backtrace:
    [<00000000932632e6>] percpu_ref_init+0x2a/0x1b0
    [<0000000092ddb796>] __io_uring_register+0x111d/0x22a0
    [<00000000eadd6c77>] __x64_sys_io_uring_register+0x17b/0x480
    [<00000000591b89a6>] do_syscall_64+0x56/0xa0
    [<00000000864a281d>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Call percpu_ref_exit() on error path to avoid
refcount memleak.

Fixes: 05f3fb3c5397 ("io_uring: avoid ring quiesce for fixed file set unregister and update")
Cc: stable@vger.kernel.org
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

c1f5f815

io_uring: fix memleak in __io_sqe_files_update() · 4b392775

由 Yang Yingliang 提交于 7月 09, 2020

to #29276773

commit f3bd9dae3708a0ff6b067e766073ffeb853301f9 upstream.

I got a memleak report when doing some fuzz test:

BUG: memory leak
unreferenced object 0xffff888113e02300 (size 488):
comm "syz-executor401", pid 356, jiffies 4294809529 (age 11.954s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
a0 a4 ce 19 81 88 ff ff 60 ce 09 0d 81 88 ff ff ........`.......
backtrace:
[<00000000129a84ec>] kmem_cache_zalloc include/linux/slab.h:659 [inline]
[<00000000129a84ec>] __alloc_file+0x25/0x310 fs/file_table.c:101
[<000000003050ad84>] alloc_empty_file+0x4f/0x120 fs/file_table.c:151
[<000000004d0a41a3>] alloc_file+0x5e/0x550 fs/file_table.c:193
[<000000002cb242f0>] alloc_file_pseudo+0x16a/0x240 fs/file_table.c:233
[<00000000046a4baa>] anon_inode_getfile fs/anon_inodes.c:91 [inline]
[<00000000046a4baa>] anon_inode_getfile+0xac/0x1c0 fs/anon_inodes.c:74
[<0000000035beb745>] __do_sys_perf_event_open+0xd4a/0x2680 kernel/events/core.c:11720
[<0000000049009dc7>] do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:359
[<00000000353731ca>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

BUG: memory leak
unreferenced object 0xffff8881152dd5e0 (size 16):
comm "syz-executor401", pid 356, jiffies 4294809529 (age 11.954s)
hex dump (first 16 bytes):
01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[<0000000074caa794>] kmem_cache_zalloc include/linux/slab.h:659 [inline]
[<0000000074caa794>] lsm_file_alloc security/security.c:567 [inline]
[<0000000074caa794>] security_file_alloc+0x32/0x160 security/security.c:1440
[<00000000c6745ea3>] __alloc_file+0xba/0x310 fs/file_table.c:106
[<000000003050ad84>] alloc_empty_file+0x4f/0x120 fs/file_table.c:151
[<000000004d0a41a3>] alloc_file+0x5e/0x550 fs/file_table.c:193
[<000000002cb242f0>] alloc_file_pseudo+0x16a/0x240 fs/file_table.c:233
[<00000000046a4baa>] anon_inode_getfile fs/anon_inodes.c:91 [inline]
[<00000000046a4baa>] anon_inode_getfile+0xac/0x1c0 fs/anon_inodes.c:74
[<0000000035beb745>] __do_sys_perf_event_open+0xd4a/0x2680 kernel/events/core.c:11720
[<0000000049009dc7>] do_syscall_64+0x56/0xa0 arch/x86/entry/common.c:359
[<00000000353731ca>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

If io_sqe_file_register() failed, we need put the file that get by fget()
to avoid the memleak.

Fixes: c3a31e605620 ("io_uring: add support for IORING_REGISTER_FILES_UPDATE")
Cc: stable@vger.kernel.org
Reported-by: NHulk Robot <hulkci@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

4b392775

vfs, afs, ext4: Make the inode hash table RCU searchable · dc136109

由 David Howells 提交于 12月 01, 2017

task #29263287

commit 3f19b2ab97a97b413c24b66c67ae16daa4f56c35 upstream

Make the inode hash table RCU searchable so that searches that want to
access or modify an inode without taking a ref on that inode can do so
without taking the inode hash table lock.

The main thing this requires is some RCU annotation on the list
manipulation operations.  Inodes are already freed by RCU in most cases.

Users of this interface must take care as the inode may be still under
construction or may be being torn down around them.

There are at least three instances where this can be of use:

 (1) Testing whether the inode number iunique() is going to return is
     currently unique (the iunique_lock is still held).

 (2) Ext4 date stamp updating.

 (3) AFS callback breaking.
Signed-off-by: NDavid Howells <dhowells@redhat.com>
Acked-by: NKonstantin Khlebnikov <khlebnikov@yandex-team.ru>
cc: linux-ext4@vger.kernel.org
cc: linux-afs@lists.infradead.org
[jeffle: resolve collision in afs_break_one_callback since code base change]
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

dc136109

io_uring: export cq overflow status to userspace · c7f865a4

由 Xiaoguang Wang 提交于 7月 09, 2020

to #29233603

commit 6d5f904904608a9cd32854d7d0a4dd65b27f9935 upstream

For those applications which are not willing to use io_uring_enter()
to reap and handle cqes, they may completely rely on liburing's
io_uring_peek_cqe(), but if cq ring has overflowed, currently because
io_uring_peek_cqe() is not aware of this overflow, it won't enter
kernel to flush cqes, below test program can reveal this bug:

static void test_cq_overflow(struct io_uring *ring)
{
        struct io_uring_cqe *cqe;
        struct io_uring_sqe *sqe;
        int issued = 0;
        int ret = 0;

        do {
                sqe = io_uring_get_sqe(ring);
                if (!sqe) {
                        fprintf(stderr, "get sqe failed\n");
                        break;;
                }
                ret = io_uring_submit(ring);
                if (ret <= 0) {
                        if (ret != -EBUSY)
                                fprintf(stderr, "sqe submit failed: %d\n", ret);
                        break;
                }
                issued++;
        } while (ret > 0);
        assert(ret == -EBUSY);

        printf("issued requests: %d\n", issued);

        while (issued) {
                ret = io_uring_peek_cqe(ring, &cqe);
                if (ret) {
                        if (ret != -EAGAIN) {
                                fprintf(stderr, "peek completion failed: %s\n",
                                        strerror(ret));
                                break;
                        }
                        printf("left requets: %d\n", issued);
                        continue;
                }
                io_uring_cqe_seen(ring, cqe);
                issued--;
                printf("left requets: %d\n", issued);
        }
}

int main(int argc, char *argv[])
{
        int ret;
        struct io_uring ring;

        ret = io_uring_queue_init(16, &ring, 0);
        if (ret) {
                fprintf(stderr, "ring setup failed: %d\n", ret);
                return 1;
        }

        test_cq_overflow(&ring);
        return 0;
}

To fix this issue, export cq overflow status to userspace by adding new
IORING_SQ_CQ_OVERFLOW flag, then helper functions() in liburing, such as
io_uring_peek_cqe, can be aware of this cq overflow and do flush accordingly.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

c7f865a4

io_uring: fix current->mm NULL dereference on exit · 1f9ef808

由 Pavel Begunkov 提交于 6月 25, 2020

to #29197839

commit d60b5fbc1ce8210759b568da49d149b868e7c6d3 upstream.

Don't reissue requests from io_iopoll_reap_events(), the task may not
have mm, which ends up with NULL. It's better to kill everything off on
exit anyway.

[  677.734670] RIP: 0010:io_iopoll_complete+0x27e/0x630
...
[  677.734679] Call Trace:
[  677.734695]  ? __send_signal+0x1f2/0x420
[  677.734698]  ? _raw_spin_unlock_irqrestore+0x24/0x40
[  677.734699]  ? send_signal+0xf5/0x140
[  677.734700]  io_iopoll_getevents+0x12f/0x1a0
[  677.734702]  io_iopoll_reap_events.part.0+0x5e/0xa0
[  677.734703]  io_ring_ctx_wait_and_kill+0x132/0x1c0
[  677.734704]  io_uring_release+0x20/0x30
[  677.734706]  __fput+0xcd/0x230
[  677.734707]  ____fput+0xe/0x10
[  677.734709]  task_work_run+0x67/0xa0
[  677.734710]  do_exit+0x35d/0xb70
[  677.734712]  do_group_exit+0x43/0xa0
[  677.734713]  get_signal+0x140/0x900
[  677.734715]  do_signal+0x37/0x780
[  677.734717]  ? enqueue_hrtimer+0x41/0xb0
[  677.734718]  ? recalibrate_cpu_khz+0x10/0x10
[  677.734720]  ? ktime_get+0x3e/0xa0
[  677.734721]  ? lapic_next_deadline+0x26/0x30
[  677.734723]  ? tick_program_event+0x4d/0x90
[  677.734724]  ? __hrtimer_get_next_event+0x4d/0x80
[  677.734726]  __prepare_exit_to_usermode+0x126/0x1c0
[  677.734741]  prepare_exit_to_usermode+0x9/0x40
[  677.734742]  idtentry_exit_cond_rcu+0x4c/0x60
[  677.734743]  sysvec_reschedule_ipi+0x92/0x160
[  677.734744]  ? asm_sysvec_reschedule_ipi+0xa/0x20
[  677.734745]  asm_sysvec_reschedule_ipi+0x12/0x20
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>

1f9ef808

io_uring: fix hanging iopoll in case of -EAGAIN · d52d1291

由 Pavel Begunkov 提交于 6月 25, 2020

to #29197839

commit cd664b0e35cb1202f40c259a1a5ea791d18c879d upstream.

io_do_iopoll() won't do anything with a request unless
req->iopoll_completed is set. So io_complete_rw_iopoll() has to set
it, otherwise io_do_iopoll() will poll a file again and again even
though the request of interest was completed long time ago.

Also, remove -EAGAIN check from io_issue_sqe() as it races with
the changed lines. The request will take the long way and be
resubmitted from io_iopoll*().

Fixes: bbde017a32b3 ("io_uring: add memory barrier to synchronize io_kiocb's result and iopoll_completed")
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>

d52d1291

io_uring: fix io_sq_thread no schedule when busy · 73cef744

由 Xuan Zhuo 提交于 6月 23, 2020

fix #27804498

commit b772f07add1c0b22e02c0f1e96f647560679d3a9 upstream

When the user consumes and generates sqe at a fast rate,
io_sqring_entries can always get sqe, and ret will not be equal to -EBUSY,
so that io_sq_thread will never call cond_resched or schedule, and then
we will get the following system error prompt:

rcu: INFO: rcu_sched self-detected stall on CPU
or
watchdog: BUG: soft lockup-CPU#23 stuck for 112s! [io_uring-sq:1863]

This patch checks whether need to call cond_resched() by checking
the need_resched() function every cycle.
Suggested-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXuan Zhuo <xuanzhuo@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

73cef744

io_uring: fix possible race condition against REQ_F_NEED_CLEANUP · 00eaddd4

由 Xiaoguang Wang 提交于 6月 28, 2020

to #28736503

commit 6f2cc1664db20676069cff27a461ccc97dbfd114 upstream

In io_read() or io_write(), when io request is submitted successfully,
it'll go through the below sequence:

    kfree(iovec);
    req->flags &= ~REQ_F_NEED_CLEANUP;
    return ret;

But clearing REQ_F_NEED_CLEANUP might be unsafe. The io request may
already have been completed, and then io_complete_rw_iopoll()
and io_complete_rw() will be called, both of which will also modify
req->flags if needed. This causes a race condition, with concurrent
non-atomic modification of req->flags.

To eliminate this race, in io_read() or io_write(), if io request is
submitted successfully, we don't remove REQ_F_NEED_CLEANUP flag. If
REQ_F_NEED_CLEANUP is set, we'll leave __io_req_aux_free() to the
iovec cleanup work correspondingly.

Cc: stable@vger.kernel.org
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

00eaddd4

io_uring: reap poll completions while waiting for refs to drop on exit · 8ec093aa

由 Jens Axboe 提交于 6月 28, 2020

to #28736503

commit 56952e91acc93ed624fe9da840900defb75f1323 upstream

If we're doing polled IO and end up having requests being submitted
async, then completions can come in while we're waiting for refs to
drop. We need to reap these manually, as nobody else will be looking
for them.

Break the wait into 1/20th of a second time waits, and check for done
poll completions if we time out. Otherwise we can have done poll
completions sitting in ctx->poll_list, which needs us to reap them but
we're just waiting for them.

Cc: stable@vger.kernel.org
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

8ec093aa

io_uring: name sq thread and ref completions · 64d14fe5

由 Jens Axboe 提交于 6月 28, 2020

to #28736503

commit 0f158b4cf20e7983d5b33878a6aad118cfac4f05 upstream

We used to have three completions, now we just have two. With the two,
let's not allocate them dynamically, just embed then in the ctx and
name them appropriately.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

64d14fe5

io_uring: acquire 'mm' for task_work for SQPOLL · f0f8d236

由 Jens Axboe 提交于 6月 28, 2020

to #28736503

commit 9d8426a09195e2dcf2aa249de2aaadd792d491c7 upstream

If we're unlucky with timing, we could be running task_work after
having dropped the memory context in the sq thread. Since dropping
the context requires a runnable task state, we cannot reliably drop
it as part of our check-for-work loop in io_sq_thread(). Instead,
abstract out the mm acquire for the sq thread into a helper, and call
it from the async task work handler.

Cc: stable@vger.kernel.org # v5.7
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

f0f8d236

io_uring: add memory barrier to synchronize io_kiocb's result and iopoll_completed · a1db6be6

由 Xiaoguang Wang 提交于 6月 28, 2020

to #28736503

commit bbde017a32b32d2fa8d5fddca25fade20132abf8 upstream

In io_complete_rw_iopoll(), stores to io_kiocb's result and iopoll
completed are two independent store operations, to ensure that once
iopoll_completed is ture and then req->result must been perceived by
the cpu executing io_do_iopoll(), proper memory barrier should be used.

And in io_do_iopoll(), we check whether req->result is EAGAIN, if it is,
we'll need to issue this io request using io-wq again. In order to just
issue a single smp_rmb() on the completion side, move the re-submit work
to io_iopoll_complete().

Cc: stable@vger.kernel.org
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
[axboe: don't set ->iopoll_completed for -EAGAIN retry]
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

a1db6be6

io_uring: don't fail links for EAGAIN error in IOPOLL mode · b0dc0883

由 Xiaoguang Wang 提交于 6月 28, 2020

to #28736503

commit 2d7d67920e5c8e0854df23ca77da2dd5880ce5dd upstream

In IOPOLL mode, for EAGAIN error, we'll try to submit io request
again using io-wq, so don't fail rest of links if this io request
has links.

Cc: stable@vger.kernel.org
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

b0dc0883

io_uring: cancel by ->task not pid · aa04321d

由 Pavel Begunkov 提交于 6月 28, 2020

to #28736503

commit 801dd57bd1d8c2c253f43635a3045bfa32a810b3 upstream

For an exiting process it tries to cancel all its inflight requests. Use
req->task to match such instead of work.pid. We always have req->task
set, and it will be valid because we're matching only current exiting
task.

Also, remove work.pid and everything related, it's useless now.
Reported-by: NEric W. Biederman <ebiederm@xmission.com>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

aa04321d

io_uring: lazy get task · 090d6c4e

由 Pavel Begunkov 提交于 6月 28, 2020

to #28736503

commit 4dd2824d6d5914949b5fe589538bc2622d84c5dd upstream

There will be multiple places where req->task is used, so refcount-pin
it lazily with introduced *io_{get,put}_req_task(). We need to always
have valid ->task for cancellation reasons, but don't care about pinning
it in some cases. That's why it sets req->task in io_req_init() and
implements get/put laziness with a flag.

This also removes using @current from polling io_arm_poll_handler(),
etc., but doesn't change observable behaviour.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

090d6c4e

io_uring: batch cancel in io_uring_cancel_files() · 037c3f51

由 Pavel Begunkov 提交于 6月 28, 2020

to #28736503

commit 67c4d9e693e3bb7fb968af24e3584f821a78ba56 upstream

Instead of waiting for each request one by one, first try to cancel all
of them in a batched manner, and then go over inflight_list/etc to reap
leftovers.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

037c3f51

io_uring: cancel all task's requests on exit · f4c5149f

由 Pavel Begunkov 提交于 6月 28, 2020

to #28736503

commit 44e728b8aae0bb6d4229129083974f9dea43f50b upstream

If a process is going away, io_uring_flush() will cancel only 1
request with a matching pid. Cancel all of them
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

f4c5149f

io-wq: add an option to cancel all matched reqs · 4b786826

由 Pavel Begunkov 提交于 6月 28, 2020

to #28736503

commit 4f26bda1522c35d2701fc219368c7101c17005c1 upstream

This adds support for cancelling all io-wq works matching a predicate.
It isn't used yet, so no change in observable behaviour.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

4b786826

io-wq: reorder cancellation pending -> running · 570a7c05

由 Pavel Begunkov 提交于 6月 28, 2020

to #28736503

commit f4c2665e33f48904f2766d644df33fb3fd54b5ec upstream

Go all over all pending lists and cancel works there, and only then
try to match running requests. No functional changes here, just a
preparation for bulk cancellation.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

570a7c05

io_uring: fix lazy work init · 819f9648

由 Pavel Begunkov 提交于 6月 28, 2020

to #28736503

commit 59960b9deb5354e4cdb0b6ed3a3b653a2b4eb602 upstream

Don't leave garbage in req.work before punting async on -EAGAIN
in io_iopoll_queue().

[  140.922099] general protection fault, probably for non-canonical
     address 0xdead000000000100: 0000 [#1] PREEMPT SMP PTI
...
[  140.922105] RIP: 0010:io_worker_handle_work+0x1db/0x480
...
[  140.922114] Call Trace:
[  140.922118]  ? __next_timer_interrupt+0xe0/0xe0
[  140.922119]  io_wqe_worker+0x2a9/0x360
[  140.922121]  ? _raw_spin_unlock_irqrestore+0x24/0x40
[  140.922124]  kthread+0x12c/0x170
[  140.922125]  ? io_worker_handle_work+0x480/0x480
[  140.922126]  ? kthread_park+0x90/0x90
[  140.922127]  ret_from_fork+0x22/0x30

Fixes: 7cdaf587de7c ("io_uring: avoid whole io_wq_work copy for requests completed inline")
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

819f9648

29 6月, 2020 4 次提交

io_uring: avoid unnecessary io_wq_work copy for fast poll feature · b2607733

由 Xiaoguang Wang 提交于 6月 10, 2020

to #28736503

commit 405a5d2b2762f2a9813efdee93274d4e7bf607a1 upstream

Basically IORING_OP_POLL_ADD command and async armed poll handlers
for regular commands don't touch io_wq_work, so only REQ_F_WORK_INITIALIZED
is set, can we do io_wq_work copy and restore.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

b2607733

io_uring: allow POLL_ADD with double poll_wait() users · cfbe7e8b

由 Jens Axboe 提交于 5月 15, 2020

to #28736503

commit 18bceab101adde8f38de76016bc77f3f25cf22f4 upstream

Some file descriptors use separate waitqueues for their f_ops->poll()
handler, most commonly one for read and one for write. The io_uring
poll implementation doesn't work with that, as the 2nd poll_wait()
call will cause the io_uring poll request to -EINVAL.

This affects (at least) tty devices and /dev/random as well. This is a
big problem for event loops where some file descriptors work, and others
don't.

With this fix, io_uring handles multiple waitqueues.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

cfbe7e8b

io_uring: fix io_kiocb.flags modification race in IOPOLL mode · d3101fc3

由 Xiaoguang Wang 提交于 6月 11, 2020

to #28736503

commit 65a6543da386838f935d2f03f452c5c0acff2a68 upstream

While testing io_uring in arm, we found sometimes io_sq_thread() keeps
polling io requests even though there are not inflight io requests in
block layer. After some investigations, found a possible race about
io_kiocb.flags, see below race codes:
  1) in the end of io_write() or io_read()
    req->flags &= ~REQ_F_NEED_CLEANUP;
    kfree(iovec);
    return ret;

  2) in io_complete_rw_iopoll()
    if (res != -EAGAIN)
        req->flags |= REQ_F_IOPOLL_COMPLETED;

In IOPOLL mode, io requests still maybe completed by interrupt, then
above codes are not safe, concurrent modifications to req->flags, which
is not protected by lock or is not atomic modifications. I also had
disassemble io_complete_rw_iopoll() in arm:
   req->flags |= REQ_F_IOPOLL_COMPLETED;
   0xffff000008387b18 <+76>:    ldr     w0, [x19,#104]
   0xffff000008387b1c <+80>:    orr     w0, w0, #0x1000
   0xffff000008387b20 <+84>:    str     w0, [x19,#104]

Seems that the "req->flags |= REQ_F_IOPOLL_COMPLETED;" is  load and
modification, two instructions, which obviously is not atomic.

To fix this issue, add a new iopoll_completed in io_kiocb to indicate
whether io request is completed.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

d3101fc3

io_uring: async task poll trigger cleanup · 9c90c6a4

由 Jens Axboe 提交于 5月 17, 2020

to #28736503

commit 310672552f4aea2ad50704711aa3cdd45f5441e9 upstream

If the request is still hashed in io_async_task_func(), then it cannot
have been canceled and it's pointless to check. So save that check.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

9c90c6a4

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功