提交 · 78e9fdaad9d28c990a3d27ec3327a2139b3672e0 · openanolis / cloud-kernel

27 5月, 2020 33 次提交

io_uring: add support for IORING_OP_CONNECT · 78e9fdaa

由 Jens Axboe 提交于 11月 23, 2019

to #26323578

commit f8e85cf255ad57d65eeb9a9d0e59e3dec55bdd9e upstream.

This allows an application to call connect() in an async fashion. Like
other opcodes, we first try a non-blocking connect, then punt to async
context if we have to.

Note that we can still return -EINPROGRESS, and in that case the caller
should use IORING_OP_POLL_ADD to do an async wait for completion of the
connect request (just like for regular connect(2), except we can do it
async here too).
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

78e9fdaa

io_uring: only !null ptr to io_issue_sqe() · 420ea251

由 Pavel Begunkov 提交于 11月 21, 2019

to #26323578

commit f9bd67f69af56d712bfd498f5ad9cf7bb177d600 upstream.

Pass only non-null @nxt to io_issue_sqe() and handle it at the caller's
side. And propagate it.

- kiocb_done() is only called from io_read() and io_write(), which are
only called from io_issue_sqe(), so it's @nxt != NULL

- io_put_req_find_next() is called either with explicitly non-null local
nxt, or from one of the functions in io_issue_sqe() switch (or their
callees).
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

420ea251

io_uring: simplify io_req_link_next() · 543fc5fe

由 Pavel Begunkov 提交于 11月 21, 2019

to #26323578

commit b18fdf71e01fba29a804d63f8c1e2ed61011170d upstream.

"if (nxt)" is always true, as it was checked in the while's condition.
io_wq_current_is_worker() is unnecessary, as non-async callers don't
pass nxt, so io_queue_async_work() will be called for them anyway.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

543fc5fe

io_uring: pass only !null to io_req_find_next() · 5ecc65f5

由 Pavel Begunkov 提交于 11月 21, 2019

to #26323578

commit 944e58bfeda0e9b97cd611adafc823c78e0bc464 upstream.

Make io_req_find_next() and io_req_link_next() to accept only non-null
nxt, and handle it in callers.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

5ecc65f5

io_uring: remove io_free_req_find_next() · b0a3bc34

由 Pavel Begunkov 提交于 11月 21, 2019

to #26323578

commit 70cf9f3270a5c5148e93a526dc1e51965259e70c upstream.

There is only one one-liner user of io_free_req_find_next(). Inline it.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

b0a3bc34

io_uring: add likely/unlikely in io_get_sqring() · e492d6d5

由 Pavel Begunkov 提交于 11月 21, 2019

to #26323578

commit 9835d6fafba58e6d9386a6d5af800789bdb52e5b upstream.

The number of SQEs to submit is specified by a user, so io_get_sqring()
in most of the cases succeeds. Hint compilers about that.

Checking ASM genereted by gcc 9.2.0 for x64, there is one branch
misprediction.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

e492d6d5

io_uring: rename __io_submit_sqe() · 9811e2ae

由 Pavel Begunkov 提交于 11月 21, 2019

to #26323578

commit d732447fed7d6b4c22907f630cd25d574bae5276 upstream.

__io_submit_sqe() is issuing requests, so call it as
such. Moreover, it ends by calling io_iopoll_req_issued().

Rename it and make terminology clearer.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

9811e2ae

io_uring: improve trace_io_uring_defer() trace point · 55ce4ccf

由 Jens Axboe 提交于 11月 21, 2019

to #26323578

commit 915967f69c591b34c5a18d6618af021a81ffd700 upstream.

We don't have shadow requests anymore, so get rid of the shadow
argument. Add the user_data argument, as that's often useful to easily
match up requests, instead of having to look at request pointers.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

55ce4ccf

io_uring: drain next sqe instead of shadowing · 007d7c0a

由 Pavel Begunkov 提交于 11月 21, 2019

to #26323578

commit 1b4a51b6d03d21f55effbcf609ba5526d87d9e9d upstream.

There's an issue with the shadow drain logic in that we drop the
completion lock after deciding to defer a request, then re-grab it later
and assume that the state is still the same. In the mean time, someone
else completing a request could have found and issued it. This can cause
a stall in the queue, by having a shadow request inserted that nobody is
going to drain.

Additionally, if we fail allocating the shadow request, we simply ignore
the drain.

Instead of using a shadow request, defer the next request/link instead.
This also has the following advantages:

- removes semi-duplicated code
- doesn't allocate memory for shadows
- works better if only the head marked for drain
- doesn't need complex synchronisation

On the flip side, it removes the shadow->seq ==
last_drain_in_in_link->seq optimization. That shouldn't be a common
case, and can always be added back, if needed.

Fixes: 4fe2c963154c ("io_uring: add support for link with drain")
Cc: Jackie Liu <liuyun01@kylinos.cn>
Reported-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

007d7c0a

io_uring: close lookup gap for dependent next work · f4e3d2b8

由 Jens Axboe 提交于 11月 20, 2019

to #26323578

commit b76da70fc3759df13e0991706451f1a2e06ba19e upstream.

When we find new work to process within the work handler, we queue the
linked timeout before we have issued the new work. This can be
problematic for very short timeouts, as we have a window where the new
work isn't visible.

Allow the work handler to store a callback function for this in the work
item, and flag it with IO_WQ_WORK_CB if the caller has done so. If that
is set, then io-wq will call the callback when it has setup the new work
item.
Reported-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

f4e3d2b8

io_uring: allow finding next link independent of req reference count · d98ccaf3

由 Jens Axboe 提交于 11月 20, 2019

to #26323578

commit 4d7dd462971405c65bfb3821dbb6b9ce13b5e8d6 upstream.

We currently try and start the next link when we put the request, and
only if we were going to free it. This means that the optimization to
continue executing requests from the same context often fails, as we're
not putting the final reference.

Add REQ_F_LINK_NEXT to keep track of this, and allow io_uring to find the
next request more efficiently.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

d98ccaf3

io_uring: io_allocate_scq_urings() should return a sane state · 4e56986f

由 Jens Axboe 提交于 11月 20, 2019

to #26323578

commit eb065d301e8c83643367bdb0898becc364046bda upstream.

We currently rely on the ring destroy on cleaning things up in case of
failure, but io_allocate_scq_urings() can leave things half initialized
if only parts of it fails.

Be nice and return with either everything setup in success, or return an
error with things nicely cleaned up.

Reported-by: syzbot+0d818c0d39399188f393@syzkaller.appspotmail.com
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

4e56986f

io_uring: Always REQ_F_FREE_SQE for allocated sqe · f9d30ea7

由 Pavel Begunkov 提交于 11月 19, 2019

to #26323578

commit bbad27b2f622fa26d107f8a72c0cd5cc102dc56e upstream.

Always mark requests with allocated sqe and deallocate it in
__io_free_req(). It's easier to follow and doesn't add edge cases.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

f9d30ea7

io_uring: io_fail_links() should only consider first linked timeout · 5234d231

由 Jens Axboe 提交于 11月 19, 2019

to #26323578

commit 5d960724b0cb0d12469d1c62912e4a8c09c9fd92 upstream.

We currently clear the linked timeout field if we cancel such a timeout,
but we should only attempt to cancel if it's the first one we see.
Others should simply be freed like other requests, as they haven't
been started yet.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

5234d231

io_uring: Fix leaking linked timeouts · 76e1b540

由 Pavel Begunkov 提交于 11月 19, 2019

to #26323578

commit 09fbb0a83ec6ab5a4037766261c031151985fff6 upstream.

let have a dependant link: REQ -> LINK_TIMEOUT -> LINK_TIMEOUT

1. submission stage: submission references for REQ and LINK_TIMEOUT
are dropped. So, references respectively (1,1,2)

2. io_put(REQ) + FAIL_LINKS stage: calls io_fail_links(), which for all
linked timeouts will call cancel_timeout() and drop 1 reference.
So, references after: (0,0,1). That's a leak.

Make it treat only the first linked timeout as such, and pass others
through __io_double_put_req().
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

76e1b540

io_uring: remove redundant check · 965269a5

由 Pavel Begunkov 提交于 11月 19, 2019

to #26323578

commit f70193d6d8cad4cc614223fef349e6ea9d48c61f upstream.

Pass any IORING_OP_LINK_TIMEOUT request further, where it will
eventually fail in io_issue_sqe().
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

965269a5

io_uring: break links for failed defer · a162e056

由 Pavel Begunkov 提交于 11月 19, 2019

to #26323578

commit d3b35796b1e3f118017491d621f624e0de7ff9fb upstream.

If io_req_defer() failed, it needs to cancel a dependant link.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

a162e056

io-wq: remove extra space characters · 76333268

由 Dan Carpenter 提交于 11月 19, 2019

to #26323578

commit b2e9c7d64b7ecacc1d0f15a6af88a73cab7d8db9 upstream.

These lines are indented an extra space character.
Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

76333268

io_uring: request cancellations should break links · 653a3c13

由 Jens Axboe 提交于 11月 18, 2019

to #26323578

commit fba38c272a0385148935d6443cb9dc68cf1f37a7 upstream.

We currently don't explicitly break links if a request is cancelled, but
we should. Add explicitly link breakage for all types of request
cancellations that we support.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

653a3c13

io_uring: correct poll cancel and linked timeout expiration completion · 95cecacf

由 Jens Axboe 提交于 11月 18, 2019

to #26323578

commit b0dd8a412699afe3420a08f841333f3474ad45c5 upstream.

Currently a poll request fills a completion entry of 0, even if it got
cancelled. This is odd, and it makes it harder to support with chains.
Ensure that it returns -ECANCELED in the completions events if it got
cancelled, and furthermore ensure that the linked timeout that triggered
it completes with -ETIME if we did indeed trigger the completions
through a timeout.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

95cecacf

io_uring: remove dead REQ_F_SEQ_PREV flag · 468927d3

由 Jens Axboe 提交于 11月 15, 2019

to #26323578

commit e0e328c4b330712e45ba799dc589bda751323110 upstream.

With the conversion to io-wq, we no longer use that flag. Kill it.

Fixes: 561fb04a6a22 ("io_uring: replace workqueue usage with io-wq")
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

468927d3

io_uring: fix sequencing issues with linked timeouts · 9f421418

由 Jens Axboe 提交于 11月 14, 2019

to #26323578

commit 94ae5e77a9150a8c6c57432e2db290c6868ddfad upstream.

We have an issue with timeout links that are deeper in the submit chain,
because we only handle it upfront, not from later submissions. Move the
prep + issue of the timeout link to the async work prep handler, and do
it normally for non-async queue. If we validate and prepare the timeout
links upfront when we first see them, there's nothing stopping us from
supporting any sort of nesting.

Fixes: 2665abfd757f ("io_uring: add support for linked SQE timeouts")
Reported-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

9f421418

io_uring: make req->timeout be dynamically allocated · 2daf4b5c

由 Jens Axboe 提交于 11月 15, 2019

to #26323578

commit ad8a48acc23cb13cbf4332ebabb867b1baa81842 upstream.

There are a few reasons for this:

- As a prep to improving the linked timeout logic
- io_timeout is the biggest member in the io_kiocb opcode union

This also enables a few cleanups, like unifying the timer setup between
IORING_OP_TIMEOUT and IORING_OP_LINK_TIMEOUT, and not needing multiple
arguments to the link/prep helpers.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

2daf4b5c

io_uring: make io_double_put_req() use normal completion path · a9a99776

由 Jens Axboe 提交于 11月 14, 2019

to #26323578

commit 978db57e2c329fc612ff669cab9bf0007efd3ca3 upstream.

If we don't use the normal completion path, we may skip killing links
that should be errored and freed. Add __io_double_put_req() for use
within the completion path itself, other calls should just use
io_double_put_req().
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

a9a99776

io_uring: cleanup return values from the queueing functions · 94453214

由 Jens Axboe 提交于 11月 14, 2019

to #26323578

commit 0e0702dac26b282603261f04a62711a2d9aac17b upstream.

__io_queue_sqe(), io_queue_sqe(), io_queue_link_head() all return 0/err,
but the caller doesn't care since the errors are handled inline. Clean
these up and just make them void.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

94453214

io_uring: io_async_cancel() should pass in 'nxt' request pointer · a5a701a4

由 Jens Axboe 提交于 11月 14, 2019

to #26323578

commit 95a5bbae05ef1ec1cceb8c1b04a482aa0b7c177c upstream.

If we have a linked request, this enables us to pass it back directly
without having to go through async context.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

a5a701a4

io_uring: make POLL_ADD/POLL_REMOVE scale better · f7860a3c

由 Jens Axboe 提交于 11月 14, 2019

to #26323578

commit eac406c61cd0ec8fe7970ca46ddf23e40a86b579 upstream.

One of the obvious use cases for these commands is networking, where
it's not uncommon to have tons of sockets open and polled for. The
current implementation uses a list for insertion and lookup, which works
fine for file based use cases where the count is usually low, it breaks
down somewhat for higher number of files / sockets. A test case with
30k sockets being polled for and cancelled takes:

real    0m6.968s
user    0m0.002s
sys     0m6.936s

with the patch it takes:

real    0m0.233s
user    0m0.010s
sys     0m0.176s

If you go to 50k sockets, it gets even more abysmal with the current
code:

real    0m40.602s
user    0m0.010s
sys     0m40.555s

with the patch it takes:

real    0m0.398s
user    0m0.000s
sys     0m0.341s

Change is pretty straight forward, just replace the cancel_list with
a red/black tree instead.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

f7860a3c

io-wq: remove now redundant struct io_wq_nulls_list · f24ee8ad

由 Jens Axboe 提交于 11月 14, 2019

to #26323578

commit 021d1cdda3875bf35edac9133335f622d7910abc upstream.

Since we don't iterate these lists anymore after commit:

e61df66c69b1 ("io-wq: ensure free/busy list browsing see all items")

we don't need to retain the nulls value we use for them. That means it's
pretty pointless to wrap the hlist_nulls_head in a structure, so get rid
of it.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

f24ee8ad

io-wq: ensure free/busy list browsing see all items · ed0788d3

由 Jens Axboe 提交于 11月 13, 2019

to #26323578

commit e61df66c69b11bc050d233dc95714a6339192c28 upstream.

We have two lists for workers in io-wq, a busy and a free list. For
certain operations we want to browse all workers, and we currently do
that by browsing the two separate lists. But since these lists are RCU
protected, we can potentially miss workers if they move between the two
lists while we're browsing them.

Add a third list, all_list, that simply holds all workers. A worker is
added to that list when it starts, and removed when it exits. This makes
the worker iteration cleaner, too.
Reported-by: NPaul E. McKenney <paulmck@kernel.org>
Reviewed-by: NPaul E. McKenney <paulmck@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

ed0788d3

io-wq: ensure we have a stable view of ->cur_work for cancellations · 63209d2f

由 Jens Axboe 提交于 11月 13, 2019

to #26323578

commit 36c2f9223e84c1aa84bfba90cb2e74b517c92a54 upstream.

worker->cur_work is currently protected by the lock of the wqe that the
worker belongs to. When we send a signal to a worker, we need a stable
view of ->cur_work, so we need to hold that lock. But this doesn't work
so well, since we have the opposite order potentially on queueing work.
If POLL_ADD is used with a signalfd, then io_poll_wake() is called with
the signal lock, and that sometimes needs to insert work items.

Add a specific worker lock that protects the current work item. Then we
can guarantee that the task we're sending a signal is currently
processing the exact work we think it is.
Reported-by: NPaul E. McKenney <paulmck@kernel.org>
Reviewed-by: NPaul E. McKenney <paulmck@kernel.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

63209d2f

io_uring: Fix getting file for non-fd opcodes · d6fd7772

由 Pavel Begunkov 提交于 11月 14, 2019

to #26323578

commit a320e9fa1e2680116d165b9369dfa41d7cc1e1d1 upstream.

For timeout requests and bunch of others io_uring tries to grab a file
with specified fd, which is usually stdin/fd=0.
Update io_op_needs_file()
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

d6fd7772

io_uring: introduce req_need_defer() · 54ad42f9

由 Bob Liu 提交于 11月 13, 2019

to #26323578

commit 9d858b21483981db9c0cb4b184d4cdeb4bc525c2 upstream.

Makes the code easier to read.
Signed-off-by: NBob Liu <bob.liu@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

54ad42f9

io_uring: clean up io_uring_cancel_files() · 6ecd8586

由 Bob Liu 提交于 11月 13, 2019

to #26323578

commit 2f6d9b9d6357ede64a29437676884ee263039910 upstream.

We don't use the return value anymore, drop it. Also drop the
unecessary double cancel_req value check.
Signed-off-by: NBob Liu <bob.liu@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>

6ecd8586

14 5月, 2020 1 次提交

alinux: quota: fix unused label warning in dquot_load_quota_inode() · 4dc24f04

由 Jeffle Xu 提交于 5月 14, 2020

fix #27211210

Fix the compile warning caused by the unused label 'out' since
commit ec6880e8 ("new helper: lookup_positive_unlocked()").

Fixes: ec6880e8 ("new helper: lookup_positive_unlocked()")
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

4dc24f04

06 5月, 2020 1 次提交

fs/namespace.c: fix mountpoint reference counter race · a14d8836

由 Piotr Krysiuk 提交于 4月 27, 2020

to #24913189

commit f511dc75d22e0c000fc70b54f670c2c17f5fba9a stable-4.19.

A race condition between threads updating mountpoint reference counter
affects longterm releases 4.4.220, 4.9.220, 4.14.177 and 4.19.118.

The mountpoint reference counter corruption may occur when:
* one thread increments m_count member of struct mountpoint
  [under namespace_sem, but not holding mount_lock]
    pivot_root()
* another thread simultaneously decrements the same m_count
  [under mount_lock, but not holding namespace_sem]
    put_mountpoint()
      unhash_mnt()
        umount_mnt()
          mntput_no_expire()

To fix this race condition, grab mount_lock before updating m_count in
pivot_root().

Reference: CVE-2020-12114
Cc: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NPiotr Krysiuk <piotras@gmail.com>
Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

a14d8836

30 4月, 2020 2 次提交

xfs: disable map_sync for async flush · df842278

由 Pankaj Gupta 提交于 7月 05, 2019

fix #27138800

commit b21fec414095d966789581c1466fb2f55de33bfe upstream.

Dont support 'MAP_SYNC' with non-DAX files and DAX files
with asynchronous dax_device. Virtio pmem provides
asynchronous host page cache flush mechanism. We don't
support 'MAP_SYNC' with virtio pmem and xfs.
Signed-off-by: NPankaj Gupta <pagupta@redhat.com>
Reviewed-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

df842278

ext4: disable map_sync for async flush · f20196ac

由 Pankaj Gupta 提交于 7月 05, 2019

fix #27138800

commit e46bfc3f03d7894c0eb47c7d754c38bafe39e197 upstream.

Dont support 'MAP_SYNC' with non-DAX files and DAX files
with asynchronous dax_device. Virtio pmem provides
asynchronous host page cache flush mechanism. We don't
support 'MAP_SYNC' with virtio pmem and ext4.
Signed-off-by: NPankaj Gupta <pagupta@redhat.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>

f20196ac

29 4月, 2020 3 次提交

fix autofs regression caused by follow_managed() changes · 23b92780

由 Al Viro 提交于 1月 14, 2020

fix #27211210

commit 508c8772760d4ef9c1a044519b564710c3684fc5 upstream.

we need to reload ->d_flags after the call of ->d_manage() - the thing
might've been called with dentry still negative and have the damn thing
turned positive while we'd waited.

Fixes: d41efb522e90 "fs/namei.c: pull positivity check into follow_managed()"
Reported-by: NIan Kent <raven@themaw.net>
Tested-by: NIan Kent <raven@themaw.net>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

23b92780

fs/namei.c: fix missing barriers when checking positivity · e2bdaed8

由 Al Viro 提交于 11月 12, 2019

fix #27211210

commit 2fa6b1e01a9b1a54769c394f06cd72c3d12a2d48 upstream.

Pinned negative dentries can, generally, be made positive
by another thread.  Conditions that prevent that are
	* ->d_lock on dentry in question
	* parent directory held at least shared
	* nobody else could have observed the address of dentry
Most of the places working with those fall into one of those
categories; however, d_lookup() and friends need to be used
with some care.  Fortunately, there's not a lot of call sites,
and with few exceptions all of those fall under one of the
cases above.

Exceptions are all in fs/namei.c - in lookup_fast(), lookup_dcache()
and mountpoint_last().  Another one is lookup_slow() - there
dcache lookup is done with parent held shared, but the result
is used after we'd drop the lock.  The same happens in do_last() -
the lookup (in lookup_one()) is done with parent locked, but
result is used after unlocking.

lookup_fast(), do_last() and mountpoint_last() flat-out reject
negatives.

Most of lookup_dcache() calls are made with parent locked at least
shared; the only exception is lookup_one_len_unlocked().  It might
return pinned negative, needs serious care from callers.  Fortunately,
almost nobody calls it directly anymore; all but two callers have
converted to lookup_positive_unlocked(), which rejects negatives.

lookup_slow() is called by the same lookup_one_len_unlocked() (see
above), mountpoint_last() and walk_component().  In those two negatives
are rejected.

In other words, there is a small set of places where we need to
check carefully if a pinned potentially negative dentry is, in
fact, positive.  After that check we want to be sure that both
->d_inode and type bits in ->d_flags are stable and observed.
The set consists of follow_managed() (where the rejection happens
for lookup_fast(), walk_component() and do_last()), last_mountpoint()
and lookup_positive_unlocked().

Solution:
	1) transition from negative to positive (in __d_set_inode_and_type())
stores ->d_inode, then uses smp_store_release() to set ->d_flags type bits.
	2) aforementioned 3 places in fs/namei.c fetch ->d_flags with
smp_load_acquire() and bugger off if it type bits say "negative".
That way anyone downstream of those checks has dentry know positive pinned,
with ->d_inode and type bits of ->d_flags stable and observed.

I considered splitting off d_lookup_positive(), so that the checks could
be done right there, under ->d_lock.  However, that leads to massive
duplication of rather subtle code in fs/namei.c and fs/dcache.c.  It's
worse than it might seem, thanks to autofs ->d_manage() getting involved ;-/
No matter what, autofs_d_manage()/autofs_d_automount() must live with
the possibility of pinned negative dentry passed their way, becoming
positive under them - that's the intended behaviour when lookup comes
in the middle of automount in progress, so we can't keep them out of
the area that has to deal with those, more's the pity...
Reported-by: NRitesh Harjani <riteshh@linux.ibm.com>
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

e2bdaed8

fix dget_parent() fastpath race · 2e197f58

由 Al Viro 提交于 10月 31, 2019

fix #27211210

commit e84009336711d2bba885fc9cea66348ddfce3758 upstream.

We are overoptimistic about taking the fast path there; seeing
the same value in ->d_parent after having grabbed a reference
to that parent does *not* mean that it has remained our parent
all along.

That wouldn't be a big deal (in the end it is our parent and
we have grabbed the reference we are about to return), but...
the situation with barriers is messed up.

We might have hit the following sequence:

d is a dentry of /tmp/a/b
CPU1:					CPU2:
parent = d->d_parent (i.e. dentry of /tmp/a)
					rename /tmp/a/b to /tmp/b
					rmdir /tmp/a, making its dentry negative
grab reference to parent,
end up with cached parent->d_inode (NULL)
					mkdir /tmp/a, rename /tmp/b to /tmp/a/b
recheck d->d_parent, which is back to original
decide that everything's fine and return the reference we'd got.

The trouble is, caller (on CPU1) will observe dget_parent()
returning an apparently negative dentry.  It actually is positive,
but CPU1 has stale ->d_inode cached.

Use d->d_seq to see if it has been moved instead of rechecking ->d_parent.
NOTE: we are *NOT* going to retry on any kind of ->d_seq mismatch;
we just go into the slow path in such case.  We don't wait for ->d_seq
to become even either - again, if we are racing with renames, we
can bloody well go to slow path anyway.
Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
Signed-off-by: NJeffle Xu <jefflexu@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

2e197f58

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功