提交 · 819f9648828a1f9771a7a13ee3b32b6631cdf6d0 · openanolis / cloud-kernel

02 9月, 2020 21 次提交

由 Pavel Begunkov 提交于 6月 28, 2020

to #28736503

commit 59960b9deb5354e4cdb0b6ed3a3b653a2b4eb602 upstream

Don't leave garbage in req.work before punting async on -EAGAIN
in io_iopoll_queue().

[  140.922099] general protection fault, probably for non-canonical
     address 0xdead000000000100: 0000 [#1] PREEMPT SMP PTI
...
[  140.922105] RIP: 0010:io_worker_handle_work+0x1db/0x480
...
[  140.922114] Call Trace:
[  140.922118]  ? __next_timer_interrupt+0xe0/0xe0
[  140.922119]  io_wqe_worker+0x2a9/0x360
[  140.922121]  ? _raw_spin_unlock_irqrestore+0x24/0x40
[  140.922124]  kthread+0x12c/0x170
[  140.922125]  ? io_worker_handle_work+0x480/0x480
[  140.922126]  ? kthread_park+0x90/0x90
[  140.922127]  ret_from_fork+0x22/0x30

Fixes: 7cdaf587de7c ("io_uring: avoid whole io_wq_work copy for requests completed inline")
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

819f9648

alinux: config: add N1 config to aarch64 · 4a9a2a96

由 Bin Yu 提交于 6月 30, 2020

task #28924046

This patch enables extra hardware errata for
ARM64 N1 platform and just for aarch64, not for x86.
Signed-off-by: NBin Yu <jkchen@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

4a9a2a96

arm64: Silence clang warning on mismatched value/register sizes · dcbb4923

由 Catalin Marinas 提交于 10月 28, 2019

task #28924046

[ Upstream commit 27a22fbdeedd6c5c451cf5f830d51782bf50c3a2 ]

Clang reports a warning on the __tlbi(aside1is, 0) macro expansion since
the value size does not match the register size specified in the inline
asm. Construct the ASID value using the __TLBI_VADDR() macro.

Fixes: 222fc0c8503d ("arm64: compat: Workaround Neoverse-N1 #1542419 for compat user-space")
Reported-by: NNathan Chancellor <natechancellor@gmail.com>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NBin Yu <jkchen@linux.alibaba.com>
Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Nzou cao <zoucao@linux.alibaba.com>

dcbb4923

arm64: compat: Workaround Neoverse-N1 #1542419 for compat user-space · 98aeb744

由 James Morse 提交于 10月 17, 2019

task #28924046

[ Upstream commit 222fc0c8503d98cec3cb2bac2780cdd21a6e31c0 ]

Compat user-space is unable to perform ICIMVAU instructions from
user-space. Instead it uses a compat-syscall. Add the workaround for
Neoverse-N1 #1542419 to this code path.
Signed-off-by: NJames Morse <james.morse@arm.com>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NBin Yu <jkchen@linux.alibaba.com>
Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Nzou cao <zoucao@linux.alibaba.com>

98aeb744

arm64: Fake the IminLine size on systems affected by Neoverse-N1 #1542419 · 46c9f81c

由 James Morse 提交于 10月 17, 2019

task #28924046

[ Upstream commit ee9d90be9dda ]
Systems affected by Neoverse-N1 #1542419 support DIC so do not need to
perform icache maintenance once new instructions are cleaned to the PoU.
For the errata workaround, the kernel hides DIC from user-space, so that
the unnecessary cache maintenance can be trapped by firmware.

To reduce the number of traps, produce a fake IminLine value based on
PAGE_SIZE.
Signed-off-by: NJames Morse <james.morse@arm.com>
Reviewed-by: NSuzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NBin Yu <jkchen@linux.alibaba.com>
Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Nzou cao <zoucao@linux.alibaba.com>

46c9f81c

arm64: errata: Hide CTR_EL0.DIC on systems affected by Neoverse-N1 #1542419 · c1191e5c

由 James Morse 提交于 10月 17, 2019

task #28924046

[ Upstream commit 05460849c3b5 ]
Cores affected by Neoverse-N1 #1542419 could execute a stale instruction
when a branch is updated to point to freshly generated instructions.

To workaround this issue we need user-space to issue unnecessary
icache maintenance that we can trap. Start by hiding CTR_EL0.DIC.

Backport change:
Modify the ARM64_WORKAROUND_1542419 from 45 to 37
Reviewed-by: NSuzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: NJames Morse <james.morse@arm.com>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NBin Yu <jkchen@linux.alibaba.com>
Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Nzou cao <zoucao@linux.alibaba.com>

c1191e5c

arm64: cpufeature: Trap CTR_EL0 access only where it is necessary · 63447ec2

由 Suzuki K Poulose 提交于 10月 09, 2018

task #28924046

[ Upstream commit 4afe8e79da92 ]
When there is a mismatch in the CTR_EL0 field, we trap
access to CTR from EL0 on all CPUs to expose the safe
value. However, we could skip trapping on a CPU which
matches the safe value.

Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: NSuzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NBin Yu <jkchen@linux.alibaba.com>
Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Nzou cao <zoucao@linux.alibaba.com>

63447ec2

perf tools: Add PMU event JSON files for ARM Cortex-A76 and, Neoverse N1. · 9faaf0ee

由 James Clark 提交于 9月 02, 2019

task #28924046

[ Upstream commit 9e282b739466 ]

The source of the event codes and description text was the Neoverse N1
technical reference manual at:

  http://infocenter.arm.com/help/topic/com.arm.doc.100616_0301_01_en/neoverse_n1_trm_100616_0301_01_en.pdf

The Cortex-A76 shares the same event IDs as the Neoverse N1 and they
can be viewed at:

  https://static.docs.arm.com/100798/0400/cortex_a76_trm_100798_0400_00_en.pdfSigned-off-by: NJames Clark <james.clark@arm.com>
Cc: "linux-perf-users@vger.kernel.org" <linux-perf-users@vger.kernel.org>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Jeremy Linton <jeremy.linton@arm.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Suzuki Poulouse <suzuki.poulose@arm.com>
Cc: james clark <james.clark@arm.com>
Cc: nd <nd@arm.com>
Link: http://lore.kernel.org/lkml/20190902160713.1425-2-james.clark@arm.comSigned-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
Signed-off-by: NBin Yu <jkchen@linux.alibaba.com>
Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Nzou cao <zoucao@linux.alibaba.com>

9faaf0ee

arm64: Update silicon-errata.txt for Neoverse-N1 #1349291 · 52e57fc5

由 James Morse 提交于 6月 18, 2019

task #28924046

[ Upstream commit 3276cc248964 ]

Neoverse-N1 affected by #1349291 may report an Uncontained RAS Error
as Unrecoverable. The kernel's architecture code already considers
Unrecoverable errors as fatal as without kernel-first support no
further error-handling is possible.

Now that KVM attributes SError to the host/guest more precisely
the host's architecture code will always handle host errors that
become pending during world-switch.
Errors misclassified by this errata that affected the guest will be
re-injected to the guest as an implementation-defined SError, which can
be uncontained.

Until kernel-first support is implemented, no workaround is needed
for this issue.
Signed-off-by: NJames Morse <james.morse@arm.com>
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NBin Yu <jkchen@linux.alibaba.com>
Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Nzou cao <zoucao@linux.alibaba.com>

52e57fc5

arm64: Handle erratum 1418040 as a superset of erratum 1188873 · ca26fded

由 Marc Zyngier 提交于 5月 23, 2019

task #28924046

[ Upstream commit a5325089bd05 ]

We already mitigate erratum 1188873 affecting Cortex-A76 and
Neoverse-N1 r0p0 to r2p0. It turns out that revisions r0p0 to
r3p1 of the same cores are affected by erratum 1418040, which
has the same workaround as 1188873.

Let's expand the range of affected revisions to match 1418040,
and repaint all occurences of 1188873 to 1418040. Whilst we're
there, do a bit of reformating in silicon-errata.txt and drop
a now unnecessary dependency on ARM_ARCH_TIMER_OOL_WORKAROUND.
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NBin Yu <jkchen@linux.alibaba.com>
Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Nzou cao <zoucao@linux.alibaba.com>

ca26fded

arm64: Restrict ARM64_ERRATUM_1188873 mitigation to AArch32 · 6ea79e78

由 Marc Zyngier 提交于 4月 15, 2019

task #28924046

[Upstream commit 0f80cad3124f986d0e46c14d46b8da06d87a2bf4]

We currently deal with ARM64_ERRATUM_1188873 by always trapping EL0
accesses for both instruction sets. Although nothing wrong comes out
of that, people trying to squeeze the last drop of performance from
buggy HW find this over the top. Oh well.

Let's change the mitigation by flipping the counter enable bit
on return to userspace. Non-broken HW gets an extra branch on
the fast path, which is hopefully not the end of the world.
The arch timer workaround is also removed.
Acked-by: NDaniel Lezcano <daniel.lezcano@linaro.org>
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Nzou cao <zoucao@linux.alibaba.com>

6ea79e78

arm64: Apply ARM64_ERRATUM_1188873 to Neoverse-N1 · b429dc25

由 Marc Zyngier 提交于 4月 15, 2019

task #28924046

[ Upstream commit 6989303a3b2d864fd8e17d3fa3365d3e9649a598 ]

Neoverse-N1 is also affected by ARM64_ERRATUM_1188873, so let's
add it to the list of affected CPUs.
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
[will: Update silicon-errata.txt]
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NBin Yu <jkchen@linux.alibaba.com>
Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Nzou cao <zoucao@linux.alibaba.com>

b429dc25

arm64: Make ARM64_ERRATUM_1188873 depend on COMPAT · 5d14f280

由 Marc Zyngier 提交于 4月 15, 2019

task #28924046

[Upstream commit c2b5bba3967a ]

Since ARM64_ERRATUM_1188873 only affects AArch32 EL0, it makes some
sense that it should depend on COMPAT.
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NBin Yu <jkchen@linux.alibaba.com>
Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Nzou cao <zoucao@linux.alibaba.com>

5d14f280

arm64: arch_timer: avoid unused function warning · 5be9c8fa

由 Arnd Bergmann 提交于 10月 02, 2018

task #28924046

[Upstream commit 040f340134751d73bd03ee92fabb992946c55b3d]

arm64_1188873_read_cntvct_el0() is protected by the correct
CONFIG_ARM64_ERRATUM_1188873 #ifdef, but the only reference to it is
also inside of an CONFIG_ARM_ARCH_TIMER_OOL_WORKAROUND section,
and causes a warning if that is disabled:

drivers/clocksource/arm_arch_timer.c:323:20: error: 'arm64_1188873_read_cntvct_el0' defined but not used [-Werror=unused-function]

Since the erratum requires that we always apply the workaround
in the timer driver, select that symbol as we do for SoC
specific errata.

Fixes: 95b861a4a6d9 ("arm64: arch_timer: Add workaround for ARM erratum 1188873")
Acked-by: NMarc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NArnd Bergmann <arnd@arndb.de>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NBin Yu <jkchen@linux.alibaba.com>
Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Nzou cao <zoucao@linux.alibaba.com>

5be9c8fa

arm64: arch_timerq: Add workaround for ARM erratum 1188873 · 101ddbde

由 Marc Zyngier 提交于 9月 27, 2018

task #28924046

[ Upstream commit 95b861a4a6d94f64d5242605569218160ebacdbe ]

When running on Cortex-A76, a timer access from an AArch32 EL0
task may end up with a corrupted value or register. The workaround for
this is to trap these accesses at EL1/EL2 and execute them there.

This only affects versions r0p0, r1p0 and r2p0 of the CPU.

Backport change:
The patch modifies ARM64_WORKAROUND_1188873 from 35 to 36 and
the ARM_CPU_PART_CORTEX_A76 is deleted because a previous patch
has been modified.
Acked-by: NMark Rutland <mark.rutland@arm.com>
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NBin Yu <jkchen@linux.alibaba.com>
Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Nzou cao <zoucao@linux.alibaba.com>

101ddbde

arm64: Add part number for Neoverse N1 · 0c455077

由 Marc Zyngier 提交于 4月 15, 2019

task #28924046

[ Upstream commit 0cf57b86859c]

New CPU, new part number. You know the drill.
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NBin Yu <jkchen@linux.alibaba.com>
Reviewed-by: NBaolin Wang <baolin.wang@linux.alibaba.com>
Reviewed-by: Nzou cao <zoucao@linux.alibaba.com>

0c455077

configs: enable conntrack_zone option · fce8bc83

由 Zhiyuan Hou 提交于 6月 03, 2020

fix #28925962

In faas's VPC network mode, we should enable this feature to support
multi-tenancy's snat on a node.
Signed-off-by: NZhiyuan Hou <zhiyuan2048@linux.alibaba.com>
Acked-by: NDust Li <dust.li@linux.alibaba.com>

fce8bc83

vfio-pci: Invalidate mmaps and block MMIO access on disabled memory · aa30f9c9

由 Alex Williamson 提交于 4月 22, 2020

to #28892961

commit abafbc551fddede3e0a08dee1dcde08fc0eb8476 upstream.

Accessing the disabled memory space of a PCI device would typically
result in a master abort response on conventional PCI, or an
unsupported request on PCI express.  The user would generally see
these as a -1 response for the read return data and the write would be
silently discarded, possibly with an uncorrected, non-fatal AER error
triggered on the host.  Some systems however take it upon themselves
to bring down the entire system when they see something that might
indicate a loss of data, such as this discarded write to a disabled
memory space.

To avoid this, we want to try to block the user from accessing memory
spaces while they're disabled.  We start with a semaphore around the
memory enable bit, where writers modify the memory enable state and
must be serialized, while readers make use of the memory region and
can access in parallel.  Writers include both direct manipulation via
the command register, as well as any reset path where the internal
mechanics of the reset may both explicitly and implicitly disable
memory access, and manipulation of the MSI-X configuration, where the
MSI-X vector table resides in MMIO space of the device.  Readers
include the read and write file ops to access the vfio device fd
offsets as well as memory mapped access.  In the latter case, we make
use of our new vma list support to zap, or invalidate, those memory
mappings in order to force them to be faulted back in on access.

Our semaphore usage will stall user access to MMIO spaces across
internal operations like reset, but the user might experience new
behavior when trying to access the MMIO space while disabled via the
PCI command register.  Access via read or write while disabled will
return -EIO and access via memory maps will result in a SIGBUS.  This
is expected to be compatible with known use cases and potentially
provides better error handling capabilities than present in the
hardware, while avoiding the more readily accessible and severe
platform error responses that might otherwise occur.

Fixes: CVE-2020-12888
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

[ shile: fixed conflicts in
	drivers/vfio/pci/vfio_pci.c
	drivers/vfio/pci/vfio_pci_private.h ]
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

aa30f9c9

vfio-pci: Fault mmaps to enable vma tracking · c3da9844

由 Alex Williamson 提交于 4月 28, 2020

to #28892961

commit 11c4cd07ba111a09f49625f9e4c851d83daf0a22 upstream.

Rather than calling remap_pfn_range() when a region is mmap'd, setup
a vm_ops handler to support dynamic faulting of the range on access.
This allows us to manage a list of vmas actively mapping the area that
we can later use to invalidate those mappings.  The open callback
invalidates the vma range so that all tracking is inserted in the
fault handler and removed in the close handler.
Reviewed-by: NPeter Xu <peterx@redhat.com>
Signed-off-by: NAlex Williamson <alex.williamson@redhat.com>

Fixes: CVE-2020-12888
[ shile: fixed conflicts in vfio_pci_private.h ]
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

c3da9844

alinux: introduce deferred_meminit boot parameter · 05f6ed40

由 chenxiangzuo 提交于 6月 29, 2020

fix #27418285

We introduce a boot parametter 'deferred_meminit' for defer
page init feature. Default it is disabled, and we can pass
'deferred_meminit' to enable it.
Signed-off-by: Nchenxiangzuo <cxz18821786681@linux.alibaba.com>
Reviewed-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Acked-by: NShile Zhang <shile.zhang@linux.alibaba.com>

05f6ed40

cpufreq: intel_pstate: Fix compilation for !CONFIG_ACPI · 55952de7

由 Dominik Brodowski 提交于 10月 23, 2018

fix #29051137

commit 5906056e52e9ee5e130d880443e83016f892b5dd upstream

While at it, add a few comments which config options #ifdef
and #else statements refer to.

Fixes: 86d333a8cc7f (cpufreq: intel_pstate: Add base_frequency attribute)
Signed-off-by: NDominik Brodowski <linux@dominikbrodowski.net>
Acked-by: NSrinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
Signed-off-by: NTianjia Zhang <tianjia.zhang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

55952de7

29 6月, 2020 19 次提交

io_uring: avoid unnecessary io_wq_work copy for fast poll feature · b2607733

由 Xiaoguang Wang 提交于 6月 10, 2020

to #28736503

commit 405a5d2b2762f2a9813efdee93274d4e7bf607a1 upstream

Basically IORING_OP_POLL_ADD command and async armed poll handlers
for regular commands don't touch io_wq_work, so only REQ_F_WORK_INITIALIZED
is set, can we do io_wq_work copy and restore.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

b2607733

io_uring: allow POLL_ADD with double poll_wait() users · cfbe7e8b

由 Jens Axboe 提交于 5月 15, 2020

to #28736503

commit 18bceab101adde8f38de76016bc77f3f25cf22f4 upstream

Some file descriptors use separate waitqueues for their f_ops->poll()
handler, most commonly one for read and one for write. The io_uring
poll implementation doesn't work with that, as the 2nd poll_wait()
call will cause the io_uring poll request to -EINVAL.

This affects (at least) tty devices and /dev/random as well. This is a
big problem for event loops where some file descriptors work, and others
don't.

With this fix, io_uring handles multiple waitqueues.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

cfbe7e8b

io_uring: fix io_kiocb.flags modification race in IOPOLL mode · d3101fc3

由 Xiaoguang Wang 提交于 6月 11, 2020

to #28736503

commit 65a6543da386838f935d2f03f452c5c0acff2a68 upstream

While testing io_uring in arm, we found sometimes io_sq_thread() keeps
polling io requests even though there are not inflight io requests in
block layer. After some investigations, found a possible race about
io_kiocb.flags, see below race codes:
  1) in the end of io_write() or io_read()
    req->flags &= ~REQ_F_NEED_CLEANUP;
    kfree(iovec);
    return ret;

  2) in io_complete_rw_iopoll()
    if (res != -EAGAIN)
        req->flags |= REQ_F_IOPOLL_COMPLETED;

In IOPOLL mode, io requests still maybe completed by interrupt, then
above codes are not safe, concurrent modifications to req->flags, which
is not protected by lock or is not atomic modifications. I also had
disassemble io_complete_rw_iopoll() in arm:
   req->flags |= REQ_F_IOPOLL_COMPLETED;
   0xffff000008387b18 <+76>:    ldr     w0, [x19,#104]
   0xffff000008387b1c <+80>:    orr     w0, w0, #0x1000
   0xffff000008387b20 <+84>:    str     w0, [x19,#104]

Seems that the "req->flags |= REQ_F_IOPOLL_COMPLETED;" is  load and
modification, two instructions, which obviously is not atomic.

To fix this issue, add a new iopoll_completed in io_kiocb to indicate
whether io request is completed.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

d3101fc3

io_uring: async task poll trigger cleanup · 9c90c6a4

由 Jens Axboe 提交于 5月 17, 2020

to #28736503

commit 310672552f4aea2ad50704711aa3cdd45f5441e9 upstream

If the request is still hashed in io_async_task_func(), then it cannot
have been canceled and it's pointless to check. So save that check.
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

9c90c6a4

io_uring: avoid whole io_wq_work copy for requests completed inline · 1d6d088b

由 Xiaoguang Wang 提交于 6月 10, 2020

to #28736503

commit 7cdaf587de7c6f494b8433fded19f7728e70e1ef upstream

If requests can be submitted and completed inline, we don't need to
initialize whole io_wq_work in io_init_req(), which is an expensive
operation, add a new 'REQ_F_WORK_INITIALIZED' to determine whether
io_wq_work is initialized and add a helper io_req_init_async(), users
must call io_req_init_async() for the first time touching any members
of io_wq_work.

I use /dev/nullb0 to evaluate performance improvement in my physical
machine:
  modprobe null_blk nr_devices=1 completion_nsec=0
  sudo taskset -c 60 fio  -name=fiotest -filename=/dev/nullb0 -iodepth=128
  -thread -rw=read -ioengine=io_uring -direct=1 -bs=4k -size=100G -numjobs=1
  -time_based -runtime=120

before this patch:
Run status group 0 (all jobs):
   READ: bw=724MiB/s (759MB/s), 724MiB/s-724MiB/s (759MB/s-759MB/s),
   io=84.8GiB (91.1GB), run=120001-120001msec

With this patch:
Run status group 0 (all jobs):
   READ: bw=761MiB/s (798MB/s), 761MiB/s-761MiB/s (798MB/s-798MB/s),
   io=89.2GiB (95.8GB), run=120001-120001msec

About 5% improvement.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

1d6d088b

io_uring: allow O_NONBLOCK async retry · c0810993

由 Jens Axboe 提交于 6月 09, 2020

to #28736503

commit c5b856255cbc3b664d686a83fa9397a835e063de upstream

We can assume that O_NONBLOCK is always honored, even if we don't
have a ->read/write_iter() for the file type. Also unify the read/write
checking for allowing async punt, having the write side factoring in the
REQ_F_NOWAIT flag as well.

Cc: stable@vger.kernel.org
Fixes: 490e89676a52 ("io_uring: only force async punt if poll based retry can't handle it")
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

c0810993

io_wq: add per-wq work handler instead of per work · c349b387

由 Pavel Begunkov 提交于 6月 08, 2020

to #28736503

commit f5fa38c59cb0b40633dee5cdf7465801be3e4928 upstream

io_uring is the only user of io-wq, and now it uses only io-wq callback
for all its requests, namely io_wq_submit_work(). Instead of storing
work->runner callback in each instance of io_wq_work, keep it in io-wq
itself.

pros:
- reduces io_wq_work size
- more robust -- ->func won't be invalidated with mem{cpy,set}(req)
- helps other work
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

c349b387

io_uring: don't arm a timeout through work.func · efc07284

由 Pavel Begunkov 提交于 6月 08, 2020

to #28736503

commit d4c81f38522f3e7f4be1b472ef9988d0ed7f3696 upstream

Remove io_link_work_cb() -- the last custom work.func.
Not the prettiest thing, but works. Instead of queueing a linked timeout
in io_link_work_cb() mark a request with REQ_F_QUEUE_TIMEOUT and do
enqueueing based on the flag in io_wq_submit_work().
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

efc07284

io_uring: remove custom ->func handlers · 7e2014fa

由 Pavel Begunkov 提交于 6月 08, 2020

to #28736503

commit ac45abc0e2a8ed16ecc0eea039fe762ddfefbcad upstream

In preparation of getting rid of work.func, this removes almost all
custom instances of it, leaving only io_wq_submit_work() and
io_link_work_cb(). And the last one will be dealt later.

Nothing fancy, just routinely remove *_finish() function and inline
what's left. E.g. remove io_fsync_finish() + inline __io_fsync() into
io_fsync().

As no users of io_req_cancelled() are left, delete it as well. The patch
adds extra switch lookup on cold-ish path, but that's overweighted by
nice diffstat and other benefits of the following patches.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

7e2014fa

io_uring: don't derive close state from ->func · c6ed5079

由 Pavel Begunkov 提交于 6月 08, 2020

to #28736503

commit 3af73b286ccee493dc055fc58da02b2dc7a5304d upstream

Relying on having a specific work.func is dangerous, even if an opcode
handler set it itself. E.g. io_wq_assign_next() can modify it.

io_close() sets a custom work.func to indicate that
__close_fd_get_file() was already called. Fortunately, there is no bugs
with io_wq_assign_next() and close yet.

Still, do it safe and always be prepared to be called through
io_wq_submit_work(). Zero req->close.put_file in prep, and call
__close_fd_get_file() IFF it's NULL.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

c6ed5079

io_uring: use kvfree() in io_sqe_buffer_register() · 0885bdfb

由 Denis Efremov 提交于 6月 05, 2020

to #28736503

commit a8c73c1a614f6da6c0b04c393f87447e28cb6de4 upstream

Use kvfree() to free the pages and vmas, since they are allocated by
kvmalloc_array() in a loop.

Fixes: d4ef647510b1 ("io_uring: avoid page allocation warnings")
Signed-off-by: NDenis Efremov <efremov@linux.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200605093203.40087-1-efremov@linux.comSigned-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

0885bdfb

io_uring: validate the full range of provided buffers for access · f48b6274

由 Bijan Mottahedeh 提交于 6月 04, 2020

to #28736503

commit efe68c1ca8f49e8c06afd74b699411bfbb8ba1ff upstream

Account for the number of provided buffers when validating the address
range.
Signed-off-by: NBijan Mottahedeh <bijan.mottahedeh@oracle.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

f48b6274

io_uring: re-set iov base/len for buffer select retry · 616bb4fd

由 Jens Axboe 提交于 6月 04, 2020

to #28736503

commit dddb3e26f6d88c5344d28cb5ff9d3d6fa05c4f7a upstream

We already have the buffer selected, but we should set the iter list
again.

Cc: stable@vger.kernel.org # v5.7
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

616bb4fd

io_uring: move send/recv IOPOLL check into prep · 43770e26

由 Pavel Begunkov 提交于 6月 03, 2020

to #28736503

commit d2b6f48b691ed67569786c332f0173b918d3fd1b upstream

Fail recv/send in case of IORING_SETUP_IOPOLL earlier during prep,
so it'd be done only once. Removes duplication as well
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

43770e26

io_uring: deduplicate io_openat{,2}_prep() · 6ead77bf

由 Pavel Begunkov 提交于 6月 03, 2020

to #28736503

commit ec65fea5a8d7a82d3137dd2a44197eb577da111f upstream

io_openat_prep() and io_openat2_prep() are identical except for how
struct open_how is built. Deduplicate it with a helper.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

6ead77bf

io_uring: do build_open_how() only once · d81e76a7

由 Pavel Begunkov 提交于 6月 03, 2020

to #28736503

commit 25e72d1012b30bdff712b563e6141a4f311d28d6 upstream

build_open_how() is just adjusting open_flags/mode. Do it once during
prep. It looks better than storing raw values for the future.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

d81e76a7

io_uring: fix {SQ,IO}POLL with unsupported opcodes · a17a35d5

由 Pavel Begunkov 提交于 6月 03, 2020

to #28736503

commit 3232dd02af65f2d01be641120d2a710176b0c7a7 upstream

IORING_SETUP_IOPOLL is defined only for read/write, other opcodes should
be disallowed, otherwise it'll get an error as below. Also refuse
open/close with SQPOLL, as the polling thread wouldn't know which file
table to use.

RIP: 0010:io_iopoll_getevents+0x111/0x5a0
Call Trace:
 ? _raw_spin_unlock_irqrestore+0x24/0x40
 ? do_send_sig_info+0x64/0x90
 io_iopoll_reap_events.part.0+0x5e/0xa0
 io_ring_ctx_wait_and_kill+0x132/0x1c0
 io_uring_release+0x20/0x30
 __fput+0xcd/0x230
 ____fput+0xe/0x10
 task_work_run+0x67/0xa0
 do_exit+0x353/0xb10
 ? handle_mm_fault+0xd4/0x200
 ? syscall_trace_enter+0x18c/0x2c0
 do_group_exit+0x43/0xa0
 __x64_sys_exit_group+0x18/0x20
 do_syscall_64+0x60/0x1e0
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
[axboe: allow provide/remove buffers and files update]
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

a17a35d5

io_uring: disallow close of ring itself · da73d5e3

由 Jens Axboe 提交于 6月 02, 2020

to #28736503

commit fd2206e4e97b5bae422d9f2f9ebbc79bc97e44a5 upstream

A previous commit enabled this functionality, which also enabled O_PATH
to work correctly with io_uring. But we can't safely close the ring
itself, as the file handle isn't reference counted inside
io_uring_enter(). Instead of jumping through hoops to enable ring
closure, add a "soft" ->needs_file option, ->needs_file_no_error. This
enables O_PATH file descriptors to work, but still catches the case of
trying to close the ring itself.
Reported-by: NJann Horn <jannh@google.com>
Fixes: 904fbcb115c8 ("io_uring: remove 'fd is io_uring' from close path")
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

da73d5e3

io_uring: fix overflowed reqs cancellation · edba7ee1

由 Pavel Begunkov 提交于 5月 30, 2020

to #28736503

commit 7b53d59859bc932b37895d2d37388e7fa29af7a5 upstream

Overflowed requests in io_uring_cancel_files() should be shed only of
inflight and overflowed refs. All other left references are owned by
someone else.

If refcount_sub_and_test() fails, it will go further and put put extra
ref, don't do that. Also, don't need to do io_wq_cancel_work()
for overflowed reqs, they will be let go shortly anyway.
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

edba7ee1

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功