提交 · 4d9237e32c5db4f07f749a7ff1dd9b366bf3600e · openeuler / Kernel

16 3月, 2022 1 次提交

io_uring: recycle apoll_poll entries · 4d9237e3

由 Jens Axboe 提交于 3月 15, 2022

Particularly for networked workloads, io_uring intensively uses its
poll based backend to get a notification when data/space is available.
Profiling workloads, we see 3-4% of alloc+free that is directly attributed
to just the apoll allocation and free (and the rest being skb alloc+free).

For the fast path, we have ctx->uring_lock held already for both issue
and the inline completions, and we can utilize that to avoid any extra
locking needed to have a basic recycling cache for the apoll entries on
both the alloc and free side.

Double poll still requires an allocation. But those are rare and not
a fast path item.

With the simple cache in place, we see a 3-4% reduction in overhead for
the workload.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4d9237e3

12 3月, 2022 1 次提交

io_uring: remove duplicated member check for io_msg_ring_prep() · f3b6a41e

由 Jens Axboe 提交于 3月 12, 2022

Julia and the kernel test robot report that the prep handling for this
command inadvertently checks one field twice:

fs/io_uring.c:4338:42-56: duplicated argument to && or ||

Get rid of it.
Reported-by: Nkernel test robot <lkp@intel.com>
Reported-by: NJulia Lawall <julia.lawall@lip6.fr>
Fixes: 4f57f06c ("io_uring: add support for IORING_OP_MSG_RING command")
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f3b6a41e

11 3月, 2022 7 次提交

io_uring: allow submissions to continue on error · bcbb7bf6

由 Jens Axboe 提交于 3月 10, 2022

By default, io_uring will stop submitting a batch of requests if we run
into an error submitting a request. This isn't strictly necessary, as
the error result is passed out-of-band via a CQE anyway. And it can be
a bit confusing for some applications.

Provide a way to setup a ring that will continue submitting on error,
when the error CQE has been posted.

There's still one case that will break out of submission. If we fail
allocating a request, then we'll still return -ENOMEM. We could in theory
post a CQE for that condition too even if we never got a request. Leave
that for a potential followup.
Reported-by: NDylan Yudaken <dylany@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

bcbb7bf6

io_uring: recycle provided buffers if request goes async · b1c62645

由 Jens Axboe 提交于 3月 09, 2022

If we are using provided buffers, it's less than useful to have a buffer
selected and pinned if a request needs to go async or arms poll for
notification trigger on when we can process it.

Recycle the buffer in those events, so we don't pin it for the duration
of the request.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b1c62645

io_uring: ensure reads re-import for selected buffers · 2be2eb02

由 Jens Axboe 提交于 3月 10, 2022

If we drop buffers between scheduling a retry, then we need to re-import
when we start the request again.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2be2eb02

io_uring: retry early for reads if we can poll · 9af177ee

由 Jens Axboe 提交于 3月 09, 2022

Most of the logic in io_read() deals with regular files, and in some ways
it would make sense to split the handling into S_IFREG and others. But
at least for retry, we don't need to bother setting up a bunch of state
just to abort in the loop later. In particular, don't bother forcing
setup of async data for a normal non-vectored read when we don't need it.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

9af177ee

io_uring: Add support for napi_busy_poll · adc8682e

由 Olivier Langlois 提交于 3月 08, 2022

The sqpoll thread can be used for performing the napi busy poll in a
similar way that it does io polling for file systems supporting direct
access bypassing the page cache.

The other way that io_uring can be used for napi busy poll is by
calling io_uring_enter() to get events.

If the user specify a timeout value, it is distributed between polling
and sleeping by using the systemwide setting
/proc/sys/net/core/busy_poll.

The changes have been tested with this program:
https://github.com/lano1106/io_uring_udp_ping

and the result is:
Without sqpoll:
NAPI busy loop disabled:
rtt min/avg/max/mdev = 40.631/42.050/58.667/1.547 us
NAPI busy loop enabled:
rtt min/avg/max/mdev = 30.619/31.753/61.433/1.456 us

With sqpoll:
NAPI busy loop disabled:
rtt min/avg/max/mdev = 42.087/44.438/59.508/1.533 us
NAPI busy loop enabled:
rtt min/avg/max/mdev = 35.779/37.347/52.201/0.924 us
Co-developed-by: NHao Xu <haoxu@linux.alibaba.com>
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
Signed-off-by: NOlivier Langlois <olivier@trillion01.com>
Link: https://lore.kernel.org/r/810bd9408ffc510ff08269e78dca9df4af0b9e4e.1646777484.git.olivier@trillion01.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

adc8682e

io_uring: minor io_cqring_wait() optimization · 950e79dd

由 Olivier Langlois 提交于 3月 08, 2022

Move up the block manipulating the sig variable to execute code
that may encounter an error and exit first before continuing
executing the rest of the function and avoid useless computations
Signed-off-by: NOlivier Langlois <olivier@trillion01.com>
Link: https://lore.kernel.org/r/84513f7cc1b1fb31d8f4cb910aee033391d036b4.1646777484.git.olivier@trillion01.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

950e79dd

io_uring: add support for IORING_OP_MSG_RING command · 4f57f06c

由 Jens Axboe 提交于 3月 10, 2022

This adds support for IORING_OP_MSG_RING, which allows an SQE to signal
another ring. That allows either waking up someone waiting on the ring,
or even passing a 64-bit value via the user_data field in the CQE.

sqe->fd must contain the fd of a ring that should receive the CQE.
sqe->off will be propagated to the cqe->user_data on the target ring,
and sqe->len will be propagated to cqe->res. The results CQE will have
IORING_CQE_F_MSG set in its flags, to indicate that this CQE was generated
from a messaging request rather than a SQE issued locally on that ring.
This effectively allows passing a 64-bit and a 32-bit quantify between
the two rings.

This request type has the following request specific error cases:

- -EBADFD. Set if the sqe->fd doesn't point to a file descriptor that is
  of the io_uring type.
- -EOVERFLOW. Set if we were not able to deliver a request to the target
  ring.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4f57f06c

10 3月, 2022 18 次提交

io_uring: speedup provided buffer handling · cc3cec83

由 Jens Axboe 提交于 3月 08, 2022

In testing high frequency workloads with provided buffers, we spend a
lot of time in allocating and freeing the buffer units themselves.
Rather than repeatedly free and alloc them, add a recycling cache
instead. There are two caches:

- ctx->io_buffers_cache. This is the one we grab from in the submission
  path, and it's protected by ctx->uring_lock. For inline completions,
  we can recycle straight back to this cache and not need any extra
  locking.

- ctx->io_buffers_comp. If we're not under uring_lock, then we use this
  list to recycle buffers. It's protected by the completion_lock.

On adding a new buffer, check io_buffers_cache. If it's empty, check if
we can splice entries from the io_buffers_comp_cache.

This reduces about 5-10% of overhead from provided buffers, bringing it
pretty close to the non-provided path.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

cc3cec83

io_uring: add support for registering ring file descriptors · e7a6c00d

由 Jens Axboe 提交于 3月 04, 2022

Lots of workloads use multiple threads, in which case the file table is
shared between them. This makes getting and putting the ring file
descriptor for each io_uring_enter(2) system call more expensive, as it
involves an atomic get and put for each call.

Similarly to how we allow registering normal file descriptors to avoid
this overhead, add support for an io_uring_register(2) API that allows
to register the ring fds themselves:

1) IORING_REGISTER_RING_FDS - takes an array of io_uring_rsrc_update
   structs, and registers them with the task.
2) IORING_UNREGISTER_RING_FDS - takes an array of io_uring_src_update
   structs, and unregisters them.

When a ring fd is registered, it is internally represented by an offset.
This offset is returned to the application, and the application then
uses this offset and sets IORING_ENTER_REGISTERED_RING for the
io_uring_enter(2) system call. This works just like using a registered
file descriptor, rather than a real one, in an SQE, where
IOSQE_FIXED_FILE gets set to tell io_uring that we're using an internal
offset/descriptor rather than a real file descriptor.

In initial testing, this provides a nice bump in performance for
threaded applications in real world cases where the batch count (eg
number of requests submitted per io_uring_enter(2) invocation) is low.
In a microbenchmark, submitting NOP requests, we see the following
increases in performance:

Requests per syscall	Baseline	Registered	Increase
----------------------------------------------------------------
1			 ~7030K		 ~8080K		+15%
2			~13120K		~14800K		+13%
4			~22740K		~25300K		+11%
Co-developed-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e7a6c00d

io_uring: documentation fixup · 63c36549

由 Dylan Yudaken 提交于 2月 24, 2022

Fix incorrect name reference in comment. ki_filp does not exist in the
struct, but file does.
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220224105157.1332353-1-dylany@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

63c36549

io_uring: do not recalculate ppos unnecessarily · b4aec400

由 Dylan Yudaken 提交于 2月 22, 2022

There is a slight optimisation to be had by calculating the correct pos
pointer inside io_kiocb_update_pos and then using that later.

It seems code size drops by a bit:
000000000000a1b0 0000000000000400 t io_read
000000000000a5b0 0000000000000319 t io_write

vs
000000000000a1b0 00000000000003f6 t io_read
000000000000a5b0 0000000000000310 t io_write
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b4aec400

io_uring: update kiocb->ki_pos at execution time · d34e1e5b

由 Dylan Yudaken 提交于 2月 22, 2022

Update kiocb->ki_pos at execution time rather than in io_prep_rw().
io_prep_rw() happens before the job is enqueued to a worker and so the
offset might be read multiple times before being executed once.

Ensures that the file position in a set of _linked_ SQEs will be only
obtained after earlier SQEs have completed, and so will include their
incremented file position.
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d34e1e5b

io_uring: remove duplicated calls to io_kiocb_ppos · af9c45ec

由 Dylan Yudaken 提交于 2月 22, 2022

io_kiocb_ppos is called in both branches, and it seems that the compiler
does not fuse this. Fusing removes a few bytes from loop_rw_iter.

Before:
$ nm -S fs/io_uring.o | grep loop_rw_iter
0000000000002430 0000000000000124 t loop_rw_iter

After:
$ nm -S fs/io_uring.o | grep loop_rw_iter
0000000000002430 000000000000010d t loop_rw_iter
Signed-off-by: NDylan Yudaken <dylany@fb.com>
Reviewed-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

af9c45ec

io_uring: Remove unneeded test in io_run_task_work_sig() · c5020bc8

由 Olivier Langlois 提交于 2月 16, 2022

Avoid testing TIF_NOTIFY_SIGNAL twice by calling task_sigpending()
directly from io_run_task_work_sig()
Signed-off-by: NOlivier Langlois <olivier@trillion01.com>
Link: https://lore.kernel.org/r/bd7c0495f7656e803e5736708591bb665e6eaacd.1645041650.git.olivier@trillion01.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

c5020bc8

io-uring: Make tracepoints consistent. · 502c87d6

由 Stefan Roesch 提交于 2月 14, 2022

This makes the io-uring tracepoints consistent. Where it makes sense
the tracepoints start with the following four fields:
- context (ring)
- request
- user_data
- opcode.
Signed-off-by: NStefan Roesch <shr@fb.com>
Link: https://lore.kernel.org/r/20220214180430.70572-3-shr@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

502c87d6

io-uring: add __fill_cqe function · d5ec1dfa

由 Stefan Roesch 提交于 2月 14, 2022

This introduces the __fill_cqe function. This is necessary
to correctly issue the io_uring_complete tracepoint.
Signed-off-by: NStefan Roesch <shr@fb.com>
Link: https://lore.kernel.org/r/20220214180430.70572-2-shr@fb.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

d5ec1dfa

io-wq: use IO_WQ_ACCT_NR rather than hardcoded number · 86127bb1

由 Hao Xu 提交于 2月 06, 2022

It's better to use the defined enum stuff not the hardcoded number to
define array.
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220206095241.121485-4-haoxu@linux.alibaba.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

86127bb1

io-wq: reduce acct->lock crossing functions lock/unlock · e13fb1fe

由 Hao Xu 提交于 2月 06, 2022

reduce acct->lock lock and unlock in different functions to make the
code clearer.
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220206095241.121485-3-haoxu@linux.alibaba.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

e13fb1fe

io-wq: decouple work_list protection from the big wqe->lock · 42abc95f

由 Hao Xu 提交于 2月 06, 2022

wqe->lock is abused, it now protects acct->work_list, hash stuff,
nr_workers, wqe->free_list and so on. Lets first get the work_list out
of the wqe-lock mess by introduce a specific lock for work list. This
is the first step to solve the huge contension between work insertion
and work consumption.
good thing:
  - split locking for bound and unbound work list
  - reduce contension between work_list visit and (worker's)free_list.

For the hash stuff, since there won't be a work with same file in both
bound and unbound work list, thus they won't visit same hash entry. it
works well to use the new lock to protect hash stuff.

Results:
set max_unbound_worker = 4, test with echo-server:
nice -n -15 ./io_uring_echo_server -p 8081 -f -n 1000 -l 16
(-n connection, -l workload)
before this patch:
Samples: 2M of event 'cycles:ppp', Event count (approx.): 1239982111074
Overhead  Command          Shared Object         Symbol
  28.59%  iou-wrk-10021    [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
   8.89%  io_uring_echo_s  [kernel.vmlinux]      [k] native_queued_spin_lock_slowpath
   6.20%  iou-wrk-10021    [kernel.vmlinux]      [k] _raw_spin_lock
   2.45%  io_uring_echo_s  [kernel.vmlinux]      [k] io_prep_async_work
   2.36%  iou-wrk-10021    [kernel.vmlinux]      [k] _raw_spin_lock_irqsave
   2.29%  iou-wrk-10021    [kernel.vmlinux]      [k] io_worker_handle_work
   1.29%  io_uring_echo_s  [kernel.vmlinux]      [k] io_wqe_enqueue
   1.06%  iou-wrk-10021    [kernel.vmlinux]      [k] io_wqe_worker
   1.06%  io_uring_echo_s  [kernel.vmlinux]      [k] _raw_spin_lock
   1.03%  iou-wrk-10021    [kernel.vmlinux]      [k] __schedule
   0.99%  iou-wrk-10021    [kernel.vmlinux]      [k] tcp_sendmsg_locked

with this patch:
Samples: 1M of event 'cycles:ppp', Event count (approx.): 708446691943
Overhead  Command          Shared Object         Symbol
  16.86%  iou-wrk-10893    [kernel.vmlinux]      [k] native_queued_spin_lock_slowpat
   9.10%  iou-wrk-10893    [kernel.vmlinux]      [k] _raw_spin_lock
   4.53%  io_uring_echo_s  [kernel.vmlinux]      [k] native_queued_spin_lock_slowpat
   2.87%  iou-wrk-10893    [kernel.vmlinux]      [k] io_worker_handle_work
   2.57%  iou-wrk-10893    [kernel.vmlinux]      [k] _raw_spin_lock_irqsave
   2.56%  io_uring_echo_s  [kernel.vmlinux]      [k] io_prep_async_work
   1.82%  io_uring_echo_s  [kernel.vmlinux]      [k] _raw_spin_lock
   1.33%  iou-wrk-10893    [kernel.vmlinux]      [k] io_wqe_worker
   1.26%  io_uring_echo_s  [kernel.vmlinux]      [k] try_to_wake_up

spin_lock failure from 25.59% + 8.89% =  34.48% to 16.86% + 4.53% = 21.39%
TPS is similar, while cpu usage is from almost 400% to 350%
Signed-off-by: NHao Xu <haoxu@linux.alibaba.com>
Link: https://lore.kernel.org/r/20220206095241.121485-2-haoxu@linux.alibaba.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

42abc95f

io_uring: Fix use of uninitialized ret in io_eventfd_register() · f0a4e62b

由 Nathan Chancellor 提交于 2月 07, 2022

Clang warns:

  fs/io_uring.c:9396:9: warning: variable 'ret' is uninitialized when used here [-Wuninitialized]
          return ret;
                 ^~~
  fs/io_uring.c:9373:13: note: initialize the variable 'ret' to silence this warning
          int fd, ret;
                     ^
                      = 0
  1 warning generated.

Just return 0 directly and reduce the scope of ret to the if statement,
as that is the only place that it is used, which is how the function was
before the fixes commit.

Fixes: 1a75fac9a0f9 ("io_uring: avoid ring quiesce while registering/unregistering eventfd")
Link: https://github.com/ClangBuiltLinux/linux/issues/1579Signed-off-by: NNathan Chancellor <nathan@kernel.org>
Reviewed-by: NNick Desaulniers <ndesaulniers@google.com>
Link: https://lore.kernel.org/r/20220207162410.1013466-1-nathan@kernel.orgSigned-off-by: NJens Axboe <axboe@kernel.dk>

f0a4e62b

io_uring: remove ring quiesce for io_uring_register · 8bb649ee

由 Usama Arif 提交于 2月 04, 2022

None of the opcodes in io_uring_register use ring quiesce anymore. Hence
io_register_op_must_quiesce always returns false and io_ctx_quiesce is
never called.
Signed-off-by: NUsama Arif <usama.arif@bytedance.com>
Link: https://lore.kernel.org/r/20220204145117.1186568-6-usama.arif@bytedance.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

8bb649ee

io_uring: avoid ring quiesce while registering restrictions and enabling rings · ff16cfcf

由 Usama Arif 提交于 2月 04, 2022

IORING_SETUP_R_DISABLED prevents submitting requests and so there will be
no requests until IORING_REGISTER_ENABLE_RINGS is called. And
IORING_REGISTER_RESTRICTIONS works only before
IORING_REGISTER_ENABLE_RINGS is called. Hence ring quiesce is not needed
for these opcodes.
Signed-off-by: NUsama Arif <usama.arif@bytedance.com>
Link: https://lore.kernel.org/r/20220204145117.1186568-5-usama.arif@bytedance.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

ff16cfcf

io_uring: avoid ring quiesce while registering async eventfd · c75312dd

由 Usama Arif 提交于 2月 04, 2022

This is done using the RCU data structure (io_ev_fd). eventfd_async is
moved from io_ring_ctx to io_ev_fd which is RCU protected hence avoiding
ring quiesce which is much more expensive than an RCU lock. The place
where eventfd_async is read is already under rcu_read_lock so there is no
extra RCU read-side critical section needed.
Signed-off-by: NUsama Arif <usama.arif@bytedance.com>
Link: https://lore.kernel.org/r/20220204145117.1186568-4-usama.arif@bytedance.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

c75312dd

io_uring: avoid ring quiesce while registering/unregistering eventfd · 77bc59b4

由 Usama Arif 提交于 2月 04, 2022

This is done by creating a new RCU data structure (io_ev_fd) as part of
io_ring_ctx that holds the eventfd_ctx.

The function io_eventfd_signal is executed under rcu_read_lock with a
single rcu_dereference to io_ev_fd so that if another thread unregisters
the eventfd while io_eventfd_signal is still being executed, the
eventfd_signal for which io_eventfd_signal was called completes
successfully.

The process of registering/unregistering eventfd is already done under
uring_lock so multiple threads won't enter a race condition while
registering/unregistering eventfd.

With the above approach ring quiesce can be avoided which is much more
expensive then using RCU lock. On the system tested, io_uring_register
with IORING_REGISTER_EVENTFD takes less than 1ms with RCU lock, compared
to 15ms before with ring quiesce.
Signed-off-by: NUsama Arif <usama.arif@bytedance.com>
Link: https://lore.kernel.org/r/20220204145117.1186568-3-usama.arif@bytedance.com
[axboe: long line fixups]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

77bc59b4

io_uring: remove trace for eventfd · 2757be22

由 Usama Arif 提交于 2月 04, 2022

The information on whether eventfd is registered is not very useful and
would result in the tracepoint being enclosed in an rcu_readlock in a
later patch that tries to avoid ring quiesce for registering eventfd.
Suggested-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NUsama Arif <usama.arif@bytedance.com>
Link: https://lore.kernel.org/r/20220204145117.1186568-2-usama.arif@bytedance.comSigned-off-by: NJens Axboe <axboe@kernel.dk>

2757be22

07 3月, 2022 5 次提交

L

Linux 5.17-rc7 · ffb217a1
由 Linus Torvalds 提交于 3月 06, 2022

ffb217a1

Merge tag 'for-5.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux · 3ee65c0f

由 Linus Torvalds 提交于 3月 06, 2022

Pull btrfs fixes from David Sterba:
 "A few more fixes for various problems that have user visible effects
  or seem to be urgent:

   - fix corruption when combining DIO and non-blocking io_uring over
     multiple extents (seen on MariaDB)

   - fix relocation crash due to premature return from commit

   - fix quota deadlock between rescan and qgroup removal

   - fix item data bounds checks in tree-checker (found on a fuzzed
     image)

   - fix fsync of prealloc extents after EOF

   - add missing run of delayed items after unlink during log replay

   - don't start relocation until snapshot drop is finished

   - fix reversed condition for subpage writers locking

   - fix warning on page error"

* tag 'for-5.17-rc6-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: fallback to blocking mode when doing async dio over multiple extents
  btrfs: add missing run of delayed items after unlink during log replay
  btrfs: qgroup: fix deadlock between rescan worker and remove qgroup
  btrfs: fix relocation crash due to premature return from btrfs_commit_transaction()
  btrfs: do not start relocation until in progress drops are done
  btrfs: tree-checker: use u64 for item data end to avoid overflow
  btrfs: do not WARN_ON() if we have PageError set
  btrfs: fix lost prealloc extents beyond eof after full fsync
  btrfs: subpage: fix a wrong check on subpage->writers

3ee65c0f

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm · f81664f7

由 Linus Torvalds 提交于 3月 06, 2022

Pull kvm fixes from Paolo Bonzini:
 "x86 guest:

   - Tweaks to the paravirtualization code, to avoid using them when
     they're pointless or harmful

  x86 host:

   - Fix for SRCU lockdep splat

   - Brown paper bag fix for the propagation of errno"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
  KVM: x86: pull kvm->srcu read-side to kvm_arch_vcpu_ioctl_run
  KVM: x86/mmu: Passing up the error state of mmu_alloc_shadow_roots()
  KVM: x86: Yield to IPI target vCPU only if it is busy
  x86/kvmclock: Fix Hyper-V Isolated VM's boot issue when vCPUs > 64
  x86/kvm: Don't waste memory if kvmclock is disabled
  x86/kvm: Don't use PV TLB/yield when mwait is advertised

f81664f7

Merge tag 'powerpc-5.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 9bdeaca1

由 Linus Torvalds 提交于 3月 06, 2022

Pull powerpc fix from Michael Ellerman:
 "Fix build failure when CONFIG_PPC_64S_HASH_MMU is not set.

  Thanks to Murilo Opsfelder Araujo, and Erhard F"

* tag 'powerpc-5.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/64s: Fix build failure when CONFIG_PPC_64S_HASH_MMU is not set

9bdeaca1

Merge tag 'trace-v5.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace · f40a33f5

由 Linus Torvalds 提交于 3月 06, 2022

Pull tracing fixes from Steven Rostedt:

 - Fix sorting on old "cpu" value in histograms

 - Fix return value of __setup() boot parameter handlers

* tag 'trace-v5.17-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
  tracing: Fix return value of __setup handlers
  tracing/histogram: Fix sorting on old "cpu" value

f40a33f5

06 3月, 2022 8 次提交

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input · dcde98da

由 Linus Torvalds 提交于 3月 05, 2022

Pull input updates from Dmitry Torokhov:

 - a fixup for Goodix touchscreen driver allowing it to work on certain
   Cherry Trail devices

 - a fix for imbalanced enable/disable regulator in Elam touchpad driver
   that became apparent when used with Asus TF103C 2-in-1 dock

 - a couple new input keycodes used on newer keyboards

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/dtor/input:
  HID: add mapping for KEY_ALL_APPLICATIONS
  HID: add mapping for KEY_DICTATE
  Input: elan_i2c - fix regulator enable count imbalance after suspend/resume
  Input: elan_i2c - move regulator_[en|dis]able() out of elan_[en|dis]able_power()
  Input: goodix - workaround Cherry Trail devices with a bogus ACPI Interrupt() resource
  Input: goodix - use the new soc_intel_is_byt() helper
  Input: samsung-keypad - properly state IOMEM dependency

dcde98da

Merge branch 'akpm' (patches from Andrew) · 0014404f

由 Linus Torvalds 提交于 3月 05, 2022

Merge misc fixes from Andrew Morton:
 "8 patches.

  Subsystems affected by this patch series: mm (hugetlb, pagemap, and
  userfaultfd), memfd, selftests, and kconfig"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
  configs/debug: set CONFIG_DEBUG_INFO=y properly
  proc: fix documentation and description of pagemap
  kselftest/vm: fix tests build with old libc
  memfd: fix F_SEAL_WRITE after shmem huge page allocated
  mm: fix use-after-free when anon vma name is used after vma is freed
  mm: prevent vm_area_struct::anon_name refcount saturation
  mm: refactor vm_area_struct::anon_vma_name usage code
  selftests/vm: cleanup hugetlb file after mremap test

0014404f

Merge tag 's390-5.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux · f9026e19

由 Linus Torvalds 提交于 3月 05, 2022

Pull s390 fixes from Vasily Gorbik:

 - Fix HAVE_DYNAMIC_FTRACE_WITH_ARGS implementation by providing correct
   switching between ftrace_caller/ftrace_regs_caller and supplying
   pt_regs only when ftrace_regs_caller is activated.

 - Fix exception table sorting.

 - Fix breakage of kdump tooling by preserving metadata it cannot
   function without.

* tag 's390-5.17-5' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
  s390/extable: fix exception table sorting
  s390/ftrace: fix arch_ftrace_get_regs implementation
  s390/ftrace: fix ftrace_caller/ftrace_regs_caller generation
  s390/setup: preserve memory at OLDMEM_BASE and OLDMEM_SIZE

f9026e19

configs/debug: set CONFIG_DEBUG_INFO=y properly · d1eff16d

由 Qian Cai 提交于 3月 04, 2022

CONFIG_DEBUG_INFO can't be set by user directly, so set
CONFIG_DEBUG_INFO_DWARF_TOOLCHAIN_DEFAULT=y instead.

Otherwise, we end up with no debuginfo in vmlinux which is a big no-no
for kernel debugging.

Link: https://lkml.kernel.org/r/20220301202920.18488-1-quic_qiancai@quicinc.comSigned-off-by: NQian Cai <quic_qiancai@quicinc.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

d1eff16d

proc: fix documentation and description of pagemap · dd21bfa4

由 Yun Zhou 提交于 3月 04, 2022

Since bit 57 was exported for uffd-wp write-protected (commit
fb8e37f3: "mm/pagemap: export uffd-wp protection information"),
fixing it can reduce some unnecessary confusion.

Link: https://lkml.kernel.org/r/20220301044538.3042713-1-yun.zhou@windriver.com
Fixes: fb8e37f3 ("mm/pagemap: export uffd-wp protection information")
Signed-off-by: NYun Zhou <yun.zhou@windriver.com>
Reviewed-by: NPeter Xu <peterx@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Tiberiu A Georgescu <tiberiu.georgescu@nutanix.com>
Cc: Florian Schmidt <florian.schmidt@nutanix.com>
Cc: Ivan Teterevkov <ivan.teterevkov@nutanix.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Yang Shi <shy828301@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Colin Cross <ccross@google.com>
Cc: Alistair Popple <apopple@nvidia.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

dd21bfa4

kselftest/vm: fix tests build with old libc · b773827e

由 Chengming Zhou 提交于 3月 04, 2022

The error message when I build vm tests on debian10 (GLIBC 2.28):

    userfaultfd.c: In function `userfaultfd_pagemap_test':
    userfaultfd.c:1393:37: error: `MADV_PAGEOUT' undeclared (first use
    in this function); did you mean `MADV_RANDOM'?
      if (madvise(area_dst, test_pgsize, MADV_PAGEOUT))
                                         ^~~~~~~~~~~~
                                         MADV_RANDOM

This patch includes these newer definitions from UAPI linux/mman.h, is
useful to fix tests build on systems without these definitions in glibc
sys/mman.h.

Link: https://lkml.kernel.org/r/20220227055330.43087-2-zhouchengming@bytedance.comSigned-off-by: NChengming Zhou <zhouchengming@bytedance.com>
Reviewed-by: NShuah Khan <skhan@linuxfoundation.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

b773827e

memfd: fix F_SEAL_WRITE after shmem huge page allocated · f2b277c4

由 Hugh Dickins 提交于 3月 04, 2022

Wangyong reports: after enabling tmpfs filesystem to support transparent
hugepage with the following command:

  echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled

the docker program tries to add F_SEAL_WRITE through the following
command, but it fails unexpectedly with errno EBUSY:

  fcntl(5, F_ADD_SEALS, F_SEAL_WRITE) = -1.

That is because memfd_tag_pins() and memfd_wait_for_pins() were never
updated for shmem huge pages: checking page_mapcount() against
page_count() is hopeless on THP subpages - they need to check
total_mapcount() against page_count() on THP heads only.

Make memfd_tag_pins() (compared > 1) as strict as memfd_wait_for_pins()
(compared != 1): either can be justified, but given the non-atomic
total_mapcount() calculation, it is better now to be strict.  Bear in
mind that total_mapcount() itself scans all of the THP subpages, when
choosing to take an XA_CHECK_SCHED latency break.

Also fix the unlikely xa_is_value() case in memfd_wait_for_pins(): if a
page has been swapped out since memfd_tag_pins(), then its refcount must
have fallen, and so it can safely be untagged.

Link: https://lkml.kernel.org/r/a4f79248-df75-2c8c-3df-ba3317ccb5da@google.comSigned-off-by: NHugh Dickins <hughd@google.com>
Reported-by: NZeal Robot <zealci@zte.com.cn>
Reported-by: Nwangyong <wang.yong12@zte.com.cn>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: CGEL ZTE <cgel.zte@gmail.com>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Song Liu <songliubraving@fb.com>
Cc: Yang Yang <yang.yang29@zte.com.cn>
Cc: <stable@vger.kernel.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

f2b277c4

mm: fix use-after-free when anon vma name is used after vma is freed · 942341dc

由 Suren Baghdasaryan 提交于 3月 04, 2022

When adjacent vmas are being merged it can result in the vma that was
originally passed to madvise_update_vma being destroyed.  In the current
implementation, the name parameter passed to madvise_update_vma points
directly to vma->anon_name and it is used after the call to vma_merge.
In the cases when vma_merge merges the original vma and destroys it,
this might result in UAF.  For that the original vma would have to hold
the anon_vma_name with the last reference.  The following vma would need
to contain a different anon_vma_name object with the same string.  Such
scenario is shown below:

madvise_vma_behavior(vma)
  madvise_update_vma(vma, ..., anon_name == vma->anon_name)
    vma_merge(vma)
      __vma_adjust(vma) <-- merges vma with adjacent one
        vm_area_free(vma) <-- frees the original vma
    replace_vma_anon_name(anon_name) <-- UAF of vma->anon_name

Fix this by raising the name refcount and stabilizing it.

Link: https://lkml.kernel.org/r/20220224231834.1481408-3-surenb@google.com
Link: https://lkml.kernel.org/r/20220223153613.835563-3-surenb@google.com
Fixes: 9a10064f ("mm: add a field to store names for private anonymous memory")
Signed-off-by: NSuren Baghdasaryan <surenb@google.com>
Reported-by: syzbot+aa7b3d4b35f9dc46a366@syzkaller.appspotmail.com
Acked-by: NMichal Hocko <mhocko@suse.com>
Cc: Alexey Gladkov <legion@kernel.org>
Cc: Chris Hyser <chris.hyser@oracle.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Colin Cross <ccross@google.com>
Cc: Cyrill Gorcunov <gorcunov@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Peter Collingbourne <pcc@google.com>
Cc: Sasha Levin <sashal@kernel.org>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Xiaofeng Cao <caoxiaofeng@yulong.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

942341dc

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功