提交 · f4db7182e0de981a3f1b356e0cf43c6815423055 · openeuler / Kernel

27 6月, 2020 3 次提交

io-wq: return next work from ->do_work() directly · f4db7182

由 Pavel Begunkov 提交于 6月 25, 2020

It's easier to return next work from ->do_work() than
having an in-out argument. Looks nicer and easier to compile.
Also, merge io_wq_assign_next() into its only user.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f4db7182

io-wq: compact io-wq flags numbers · e883a79d

由 Pavel Begunkov 提交于 6月 25, 2020

Renumerate IO_WQ flags, so they take adjacent bits
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e883a79d

io_uring: use task_work for links if possible · c40f6379

由 Jens Axboe 提交于 6月 25, 2020

Currently links are always done in an async fashion, unless we catch them
inline after we successfully complete a request without having to resort
to blocking. This isn't necessarily the most efficient approach, it'd be
more ideal if we could just use the task_work handling for this.

Outside of saving an async jump, we can also do less prep work for these
kinds of requests.

Running dependent links from the task_work handler yields some nice
performance benefits. As an example, examples/link-cp from the liburing
repository uses read+write links to implement a copy operation. Without
this patch, the a cache fold 4G file read from a VM runs in about 3
seconds:

$ time examples/link-cp /data/file /dev/null

real	0m2.986s
user	0m0.051s
sys	0m2.843s

and a subsequent cache hot run looks like this:

$ time examples/link-cp /data/file /dev/null

real	0m0.898s
user	0m0.069s
sys	0m0.797s

With this patch in place, the cold case takes about 2.4 seconds:

$ time examples/link-cp /data/file /dev/null

real	0m2.400s
user	0m0.020s
sys	0m2.366s

and the cache hot case looks like this:

$ time examples/link-cp /data/file /dev/null

real	0m0.676s
user	0m0.010s
sys	0m0.665s

As expected, the (mostly) cache hot case yields the biggest improvement,
running about 25% faster with this change, while the cache cold case
yields about a 20% increase in performance. Outside of the performance
increase, we're using less CPU as well, as we're not using the async
offload threads at all for this anymore.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c40f6379

25 6月, 2020 6 次提交

io_uring: enable READ/WRITE to use deferred completions · a1d7c393

由 Jens Axboe 提交于 6月 22, 2020

A bit more surgery required here, as completions are generally done
through the kiocb->ki_complete() callback, even if they complete inline.
This enables the regular read/write path to use the io_comp_state
logic to batch inline completions.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a1d7c393

io_uring: pass in completion state to appropriate issue side handlers · 229a7b63

由 Jens Axboe 提交于 6月 22, 2020

Provide the completion state to the handlers that we know can complete
inline, so they can utilize this for batching completions.

Cap the max batch count at 32. This should be enough to provide a good
amortization of the cost of the lock+commit dance for completions, while
still being low enough not to cause any real latency issues for SQPOLL
applications.

Xuan Zhuo <xuanzhuo@linux.alibaba.com> reports that this changes his
profile from:

17.97% [kernel] [k] copy_user_generic_unrolled
13.92% [kernel] [k] io_commit_cqring
11.04% [kernel] [k] __io_cqring_fill_event
10.33% [kernel] [k] udp_recvmsg
 5.94% [kernel] [k] skb_release_data
 4.31% [kernel] [k] udp_rmem_release
 2.68% [kernel] [k] __check_object_size
 2.24% [kernel] [k] __slab_free
 2.22% [kernel] [k] _raw_spin_lock_bh
 2.21% [kernel] [k] kmem_cache_free
 2.13% [kernel] [k] free_pcppages_bulk
 1.83% [kernel] [k] io_submit_sqes
 1.38% [kernel] [k] page_frag_free
 1.31% [kernel] [k] inet_recvmsg

to

19.99% [kernel] [k] copy_user_generic_unrolled
11.63% [kernel] [k] skb_release_data
 9.36% [kernel] [k] udp_rmem_release
 8.64% [kernel] [k] udp_recvmsg
 6.21% [kernel] [k] __slab_free
 4.39% [kernel] [k] __check_object_size
 3.64% [kernel] [k] free_pcppages_bulk
 2.41% [kernel] [k] kmem_cache_free
 2.00% [kernel] [k] io_submit_sqes
 1.95% [kernel] [k] page_frag_free
 1.54% [kernel] [k] io_put_req
[...]
 0.07% [kernel] [k] io_commit_cqring
 0.44% [kernel] [k] __io_cqring_fill_event
Signed-off-by: NJens Axboe <axboe@kernel.dk>

229a7b63

io_uring: pass down completion state on the issue side · f13fad7b

由 Jens Axboe 提交于 6月 22, 2020

No functional changes in this patch, just in preparation for having the
completion state be available on the issue side. Later on, this will
allow requests that complete inline to be completed in batches.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f13fad7b

io_uring: add 'io_comp_state' to struct io_submit_state · 013538bd

由 Jens Axboe 提交于 6月 22, 2020

No functional changes in this patch, just in preparation for passing back
pending completions to the caller and completing them in a batched
fashion.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

013538bd

io_uring: provide generic io_req_complete() helper · e1e16097

由 Jens Axboe 提交于 6月 22, 2020

We have lots of callers of:

io_cqring_add_event(req, result);
io_put_req(req);

Provide a helper that does this for us. It helps clean up the code, and
also provides a more convenient location for us to change the completion
handling.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

e1e16097

io_uring: fix NULL-mm for linked reqs · d3cac64c

由 Pavel Begunkov 提交于 6月 25, 2020

__io_queue_sqe() tries to handle all request of a link,
so it's not enough to grab mm in io_sq_thread_acquire_mm()
based just on the head.

Don't check req->needs_mm and do it always.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>

d3cac64c

22 6月, 2020 31 次提交

io_uring: kill NULL checks for submit state · f6b6c7d6

由 Pavel Begunkov 提交于 6月 21, 2020

After recent changes, io_submit_sqes() always passes valid submit state,
so kill leftovers checking it for NULL.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f6b6c7d6

io_uring: set @poll->file after @poll init · b90cd197

由 Pavel Begunkov 提交于 6月 21, 2020

It's a good practice to modify fields of a struct after but not before
it was initialised. Even though io_init_poll_iocb() doesn't touch
poll->file, call it first.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b90cd197

io_uring: remove REQ_F_MUST_PUNT · 24c74678

由 Pavel Begunkov 提交于 6月 21, 2020

REQ_F_MUST_PUNT may seem looking good and clear, but it's the same
as not having REQ_F_NOWAIT set. That rather creates more confusion.
Moreover, it doesn't even affect any behaviour (e.g. see the patch
removing it from io_{read,write}).

Kill theg flag and update already outdated comments.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

24c74678

io_uring: remove setting REQ_F_MUST_PUNT in rw · 62ef7316

由 Pavel Begunkov 提交于 6月 21, 2020

io_{read,write}() {
	...
copy_iov: // prep async
  	if (!(flags & REQ_F_NOWAIT) && !file_can_poll(file))
		flags |= REQ_F_MUST_PUNT;
}

REQ_F_MUST_PUNT there is pointless, because if it happens then
REQ_F_NOWAIT is known to be _not_ set, and the request will go
async path in __io_queue_sqe() anyway. file_can_poll() check
is also repeated in arm_poll*(), so don't need it.

Remove the mentioned assignment REQ_F_MUST_PUNT in preparation
for killing the flag.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

62ef7316

Merge branch 'async-buffered.8' into for-5.9/io_uring · 895aa7b1

由 Jens Axboe 提交于 6月 21, 2020

Pull in async buffered reads branch.

* async-buffered.8:
  io_uring: support true async buffered reads, if file provides it
  mm: add kiocb_wait_page_queue_init() helper
  btrfs: flag files as supporting buffered async reads
  xfs: flag files as supporting buffered async reads
  block: flag block devices as supporting IOCB_WAITQ
  fs: add FMODE_BUF_RASYNC
  mm: support async buffered reads in generic_file_buffered_read()
  mm: add support for async page locking
  mm: abstract out wake_page_match() from wake_page_function()
  mm: allow read-ahead with IOCB_NOWAIT set
  io_uring: re-issue block requests that failed because of resources
  io_uring: catch -EIO from buffered issue request failure
  io_uring: always plug for any number of IOs
  block: provide plug based way of signaling forced no-wait semantics

895aa7b1

io_uring: support true async buffered reads, if file provides it · bcf5a063

由 Jens Axboe 提交于 5月 22, 2020

If the file is flagged with FMODE_BUF_RASYNC, then we don't have to punt
the buffered read to an io-wq worker. Instead we can rely on page
unlocking callbacks to support retry based async IO. This is a lot more
efficient than doing async thread offload.

The retry is done similarly to how we handle poll based retry. From
the unlock callback, we simply queue the retry to a task_work based
handler.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

bcf5a063

mm: add kiocb_wait_page_queue_init() helper · d1932dc3

由 Jens Axboe 提交于 5月 22, 2020

Checks if the file supports it, and initializes the values that we need.
Caller passes in 'data' pointer, if any, and the callback function to
be used.
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

d1932dc3

btrfs: flag files as supporting buffered async reads · 8730f12b

由 Jens Axboe 提交于 5月 22, 2020

btrfs uses generic_file_read_iter(), which already supports this.
Acked-by: NChris Mason <clm@fb.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

8730f12b

xfs: flag files as supporting buffered async reads · f89fb730

由 Jens Axboe 提交于 5月 22, 2020

XFS uses generic_file_read_iter(), which already supports this.
Acked-by: NDarrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

f89fb730

J
block: flag block devices as supporting IOCB_WAITQ · a304f074
由 Jens Axboe 提交于 5月 22, 2020
```
Signed-off-by: NJens Axboe <axboe@kernel.dk>
```
a304f074

fs: add FMODE_BUF_RASYNC · c2a25ec0

由 Jens Axboe 提交于 5月 22, 2020

If set, this indicates that the file system supports IOCB_WAITQ for
buffered reads.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c2a25ec0

mm: support async buffered reads in generic_file_buffered_read() · 1a0a7853

由 Jens Axboe 提交于 5月 22, 2020

Use the async page locking infrastructure, if IOCB_WAITQ is set in the
passed in iocb. The caller must expect an -EIOCBQUEUED return value,
which means that IO is started but not done yet. This is similar to how
O_DIRECT signals the same operation. Once the callback is received by
the caller for IO completion, the caller must retry the operation.
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

1a0a7853

mm: add support for async page locking · dd3e6d50

由 Jens Axboe 提交于 5月 22, 2020

Normally waiting for a page to become unlocked, or locking the page,
requires waiting for IO to complete. Add support for lock_page_async()
and wait_on_page_locked_async(), which are callback based instead. This
allows a caller to get notified when a page becomes unlocked, rather
than wait for it.

We add a new iocb field, ki_waitq, to pass in the necessary data for this
to happen. We can unionize this with ki_cookie, since that is only used
for polled IO. Polled IO can never co-exist with async callbacks, as it is
(by definition) polled completions. struct wait_page_key is made public,
and we define struct wait_page_async as the interface between the caller
and the core.
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

dd3e6d50

mm: abstract out wake_page_match() from wake_page_function() · c7510ab2

由 Jens Axboe 提交于 5月 23, 2020

No functional changes in this patch, just in preparation for allowing
more callers.
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

c7510ab2

mm: allow read-ahead with IOCB_NOWAIT set · 2e85abf0

由 Jens Axboe 提交于 5月 22, 2020

The read-ahead shouldn't block, so allow it to be done even if
IOCB_NOWAIT is set in the kiocb.
Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2e85abf0

io_uring: re-issue block requests that failed because of resources · b63534c4

由 Jens Axboe 提交于 6月 04, 2020

Mark the plug with nowait == true, which will cause requests to avoid
blocking on request allocation. If they do, we catch them and reissue
them from a task_work based handler.

Normally we can catch -EAGAIN directly, but the hard case is for split
requests. As an example, the application issues a 512KB request. The
block core will split this into 128KB if that's the max size for the
device. The first request issues just fine, but we run into -EAGAIN for
some latter splits for the same request. As the bio is split, we don't
get to see the -EAGAIN until one of the actual reads complete, and hence
we cannot handle it inline as part of submission.

This does potentially cause re-reads of parts of the range, as the whole
request is reissued. There's currently no better way to handle this.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

b63534c4

io_uring: catch -EIO from buffered issue request failure · 4503b767

由 Jens Axboe 提交于 6月 01, 2020

-EIO bubbles up like -EAGAIN if we fail to allocate a request at the
lower level. Play it safe and treat it like -EAGAIN in terms of sync
retry, to avoid passing back an errant -EIO.

Catch some of these early for block based file, as non-mq devices
generally do not support NOWAIT. That saves us some overhead by
not first trying, then retrying from async context. We can go straight
to async punt instead.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

4503b767

io_uring: always plug for any number of IOs · ac8691c4

由 Jens Axboe 提交于 6月 01, 2020

Currently we only plug if we're doing more than two request. We're going
to be relying on always having the plug there to pass down information,
so plug unconditionally.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

ac8691c4

block: provide plug based way of signaling forced no-wait semantics · 5a473e83

由 Jens Axboe 提交于 6月 04, 2020

Provide a way for the caller to specify that IO should be marked
with REQ_NOWAIT to avoid blocking on allocation.
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5a473e83

io_uring: separate reporting of ring pages from registered pages · 2e0464d4

由 Bijan Mottahedeh 提交于 6月 16, 2020

Ring pages are not pinned so it is more appropriate to report them
as locked.
Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

2e0464d4

io_uring: report pinned memory usage · 30975825

由 Bijan Mottahedeh 提交于 6月 16, 2020

Report pinned memory usage always, regardless of whether locked memory
limit is enforced.
Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

30975825

io_uring: rename ctx->account_mem field · aad5d8da

由 Bijan Mottahedeh 提交于 6月 16, 2020

Rename account_mem to limit_name to clarify its purpose.
Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

aad5d8da

io_uring: add wrappers for memory accounting · a087e2b5

由 Bijan Mottahedeh 提交于 6月 16, 2020

Facilitate separation of locked memory usage reporting vs. limiting for
upcoming patches.  No functional changes.
Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com>
[axboe: kill unnecessary () around return in io_account_mem()]
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a087e2b5

io_uring: use EPOLLEXCLUSIVE flag to aoid thundering herd type behavior · a31eb4a2

由 Jiufei Xue 提交于 6月 17, 2020

Applications can pass this flag in to avoid accept thundering herd.
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

a31eb4a2

io_uring: change the poll type to be 32-bits · 5769a351

由 Jiufei Xue 提交于 6月 17, 2020

poll events should be 32-bits to cover EPOLLEXCLUSIVE.

Explicit word-swap the poll32_events for big endian to make sure the ABI
is not changed.  We call this feature IORING_FEAT_POLL_32BITS,
applications who want to use EPOLLEXCLUSIVE should check the feature bit
first.
Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>

5769a351

L

Linux 5.8-rc2 · 48778464
由 Linus Torvalds 提交于 6月 21, 2020

48778464

Merge tag 'selinux-pr-20200621' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux · 817d914d

由 Linus Torvalds 提交于 6月 21, 2020

Pull SELinux fixes from Paul Moore:
 "Three small patches to fix problems in the SELinux code, all found via
  clang.

  Two patches fix potential double-free conditions and one fixes an
  undefined return value"

* tag 'selinux-pr-20200621' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/selinux:
  selinux: fix undefined return of cond_evaluate_expr
  selinux: fix a double free in cond_read_node()/cond_read_list()
  selinux: fix double free

817d914d

Merge tag 'pinctrl-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl · 16f4aa9b

由 Linus Torvalds 提交于 6月 21, 2020

Pull pin control fixes from Linus Walleij:
 "Some early fixes collected during the first week after the merge
  window, all pretty self-evident, with the details below. The revert is
  the crucial thing.

   - Fix a warning on the Qualcomm SPMI GPIO chip being instatiated
     twice without a unique irqchip struct

   - Use the noirq variants of the suspend and resume callbacks in the
     Tegra driver

   - Clean up the errorpath on the MCP23s08 driver

   - Revert the use of devm_of_iomap() in the Freescale driver as it was
     regressing the platform

   - Add some missing pins in the Qualcomm IPQ6018 driver

   - Fix a simple documentation bug in the pinctrl-single driver"

* tag 'pinctrl-v5.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/linusw/linux-pinctrl:
  pinctrl: single: fix function name in documentation
  pinctrl: qcom: ipq6018 Add missing pins in qpic pin group
  Revert "pinctrl: freescale: imx: Use 'devm_of_iomap()' to avoid a resource leak in case of error in 'imx_pinctrl_probe()'"
  pinctrl: mcp23s08: Split to three parts: fix ptr_ret.cocci warnings
  pinctrl: tegra: Use noirq suspend/resume callbacks
  pinctrl: qcom: spmi-gpio: fix warning about irq chip reusage

16f4aa9b

Merge tag 'kbuild-fixes-v5.8' of... · be9160a9

由 Linus Torvalds 提交于 6月 21, 2020

Merge tag 'kbuild-fixes-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild fixes from Masahiro Yamada:

 - fix -gz=zlib compiler option test for CONFIG_DEBUG_INFO_COMPRESSED

 - improve cc-option in scripts/Kbuild.include to clean up temp files

 - improve cc-option in scripts/Kconfig.include for more reliable
   compile option test

 - do not copy modules.builtin by 'make install' because it would break
   existing systems

 - use 'userprogs' syntax for watch_queue sample

* tag 'kbuild-fixes-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  samples: watch_queue: build sample program for target architecture
  Revert "Makefile: install modules.builtin even if CONFIG_MODULES=n"
  scripts: Fix typo in headers_install.sh
  kconfig: unify cc-option and as-option
  kbuild: improve cc-option to clean up all temporary files
  Makefile: Improve compressed debug info support detection

be9160a9

Merge tag 'powerpc-5.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux · 75613939

由 Linus Torvalds 提交于 6月 21, 2020

Pull powerpc fixes from Michael Ellerman:

 - One fix for the interrupt rework we did last release which broke
   KVM-PR

 - Three commits fixing some fallout from the READ_ONCE() changes
   interacting badly with our 8xx 16K pages support, which uses a pte_t
   that is a structure of 4 actual PTEs

 - A cleanup of the 8xx pte_update() to use the newly added pmd_off()

 - A fix for a crash when handling an oops if CONFIG_DEBUG_VIRTUAL is
   enabled

 - A minor fix for the SPU syscall generation

Thanks to Aneesh Kumar K.V, Christian Zigotzky, Christophe Leroy, Mike
Rapoport, Nicholas Piggin.

* tag 'powerpc-5.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux:
  powerpc/8xx: Provide ptep_get() with 16k pages
  mm: Allow arches to provide ptep_get()
  mm/gup: Use huge_ptep_get() in gup_hugepte()
  powerpc/syscalls: Use the number when building SPU syscall table
  powerpc/8xx: use pmd_off() to access a PMD entry in pte_update()
  powerpc/64s: Fix KVM interrupt using wrong save area
  powerpc: Fix kernel crash in show_instructions() w/DEBUG_VIRTUAL

75613939

Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 · 93bbca27

由 Linus Torvalds 提交于 6月 21, 2020

Pull crypto fixes from Herbert Xu:

 - NULL dereference in octeontx

 - PM reference imbalance in ks-sa

 - deadlock in crypto manager

 - memory leak in drbg

 - missing socket limit check on receive SG list size in algif_skcipher

 - typos in caam

 - warnings in ccp and hisilicon

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
  crypto: drbg - always try to free Jitter RNG instance
  crypto: marvell/octeontx - Fix a potential NULL dereference
  crypto: algboss - don't wait during notifier callback
  crypto: caam - fix typos
  crypto: ccp - Fix sparse warnings in sev-dev
  crypto: hisilicon - Cap block size at 2^31
  crypto: algif_skcipher - Cap recv SG list at ctx->used
  hwrng: ks-sa - Fix runtime PM imbalance on error

93bbca27

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功