提交 · 47835be65a74b7ab8e132a8ff6aecb59ebc9ab8c · openeuler / raspberrypi-kernel

27 12月, 2019 40 次提交

locking/qspinlock: Introduce starvation avoidance into CNA · 47835be6

由 Alex Kogan 提交于 8月 31, 2019

hulk inclusion
category: feature
bugzilla: 13227
CVE: NA

-------------------------------------------------

Choose the next lock holder among spinning threads running on the same
node with high probability rather than always. With small probability,
hand the lock to the first thread in the secondary queue or, if that
queue is empty, to the immediate successor of the current lock holder
in the main queue.  Thus, assuming no failures while threads hold the
lock, every thread would be able to acquire the lock after a bounded
number of lock transitions, with high probability.
Signed-off-by: NAlex Kogan <alex.kogan@oracle.com>
Reviewed-by: NSteve Sistare <steven.sistare@oracle.com>
Signed-off-by: NWei Li <liwei391@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

47835be6

locking/qspinlock: Introduce CNA into the slow path of qspinlock · 2636acee

由 Alex Kogan 提交于 8月 31, 2019

hulk inclusion
category: feature
bugzilla: 13227
CVE: NA

-------------------------------------------------

In CNA, spinning threads are organized in two queues, a main queue for
threads running on the same node as the current lock holder, and a
secondary queue for threads running on other nodes. At the unlock time,
the lock holder scans the main queue looking for a thread running on
the same node. If found (call it thread T), all threads in the main queue
between the current lock holder and T are moved to the end of the
secondary queue, and the lock is passed to T. If such T is not found, the
lock is passed to the first node in the secondary queue. Finally, if the
secondary queue is empty, the lock is passed to the next thread in the
main queue. For more details, see https://arxiv.org/abs/1810.05600.

Note that this variant of CNA may introduce starvation by continuously
passing the lock to threads running on the same node. This issue
will be addressed later in the series.

Enabling CNA is controlled via a new configuration option
(NUMA_AWARE_SPINLOCKS). The CNA variant is patched in
at the boot time only if we run a multi-node machine, and the new
config is enabled. For the time being, the patching requires
CONFIG_PARAVIRT_SPINLOCKS to be enabled as well.
However, this should be resolved once static_call() is available.
Signed-off-by: NAlex Kogan <alex.kogan@oracle.com>
Reviewed-by: NSteve Sistare <steven.sistare@oracle.com>
Signed-off-by: NWei Li <liwei391@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

2636acee

locking/qspinlock: Refactor the qspinlock slow path · 6ebce3eb

由 Alex Kogan 提交于 8月 31, 2019

hulk inclusion
category: feature
bugzilla: 13227
CVE: NA

-------------------------------------------------

Move some of the code manipulating the spin lock into separate functions.
This would allow easier integration of alternative ways to manipulate
that lock.
Signed-off-by: NAlex Kogan <alex.kogan@oracle.com>
Reviewed-by: NSteve Sistare <steven.sistare@oracle.com>
Signed-off-by: NWei Li <liwei391@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

6ebce3eb

locking/qspinlock: Make arch_mcs_spin_unlock_contended more generic · fea85437

由 Alex Kogan 提交于 8月 31, 2019

hulk inclusion
category: feature
bugzilla: 13227
CVE: NA

-------------------------------------------------

The arch_mcs_spin_unlock_contended macro should accept the value to be
stored into the lock argument as another argument. This allows using the
same macro in cases where the value to be stored is different from 1.
Signed-off-by: NAlex Kogan <alex.kogan@oracle.com>
Reviewed-by: NSteve Sistare <steven.sistare@oracle.com>
Signed-off-by: NWei Li <liwei391@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

fea85437

locking/qspinlock: Remove unnecessary BUG_ON() call · 3b113191

由 Waiman Long 提交于 8月 31, 2019

mainline inclusion
from mainline-5.1-rc1
commit 733000c7ffd9
category: bugfix
bugzilla: 13227
CVE: NA

-------------------------------------------------

With the > 4 nesting levels case handled by the commit: commit

  d682b596d993 ("locking/qspinlock: Handle > 4 slowpath nesting levels")

the BUG_ON() call in encode_tail() will never actually be triggered.

Remove it.
Signed-off-by: NWaiman Long <longman@redhat.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NWill Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1551057253-3231-1-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NWei Li <liwei391@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

3b113191

locking/qspinlock_stat: Track the no MCS node available case · bd3656c2

由 Waiman Long 提交于 8月 31, 2019

mainline inclusion
from mainline-5.1-rc1
commit 412f34a82ccf
category: bugfix
bugzilla: 13227
CVE: NA

-------------------------------------------------

Track the number of slowpath locking operations that are being done
without any MCS node available as well renaming lock_index[123] to make
them more descriptive.

Using these stat counters is one way to find out if a code path is
being exercised.
Signed-off-by: NWaiman Long <longman@redhat.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: James Morse <james.morse@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: SRINIVAS <srinivas.eeda@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Cc: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Link: https://lkml.kernel.org/r/1548798828-16156-3-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NWei Li <liwei391@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

bd3656c2

locking/qspinlock: Handle > 4 slowpath nesting levels · 13fd11b5

由 Waiman Long 提交于 8月 31, 2019

mainline inclusion
from mainline-5.1-rc1
commit d682b596d993
category: bugfix
bugzilla: 13227
CVE: NA

-------------------------------------------------

Four queue nodes per CPU are allocated to enable up to 4 nesting levels
using the per-CPU nodes. Nested NMIs are possible in some architectures.
Still it is very unlikely that we will ever hit more than 4 nested
levels with contention in the slowpath.

When that rare condition happens, however, it is likely that the system
will hang or crash shortly after that. It is not good and we need to
handle this exception case.

This is done by spinning directly on the lock using repeated trylock.
This alternative code path should only be used when there is nested
NMIs. Assuming that the locks used by those NMI handlers will not be
heavily contended, a simple TAS locking should work out.
Suggested-by: NPeter Zijlstra <peterz@infradead.org>
Signed-off-by: NWaiman Long <longman@redhat.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: NWill Deacon <will.deacon@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: James Morse <james.morse@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: SRINIVAS <srinivas.eeda@oracle.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Zhenzhong Duan <zhenzhong.duan@oracle.com>
Link: https://lkml.kernel.org/r/1548798828-16156-2-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NWei Li <liwei391@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

13fd11b5

locking/pvqspinlock: Extend node size when pvqspinlock is configured · c579875d

由 Waiman Long 提交于 8月 31, 2019

mainline inclusion
from mainline-4.20-rc1
commit 0fa809ca7f81
category: bugfix
bugzilla: 13227
CVE: NA

-------------------------------------------------

The qspinlock code supports up to 4 levels of slowpath nesting using
four per-CPU mcs_spinlock structures. For 64-bit architectures, they
fit nicely in one 64-byte cacheline.

For para-virtualized (PV) qspinlocks it needs to store more information
in the per-CPU node structure than there is space for. It uses a trick
to use a second cacheline to hold the extra information that it needs.
So PV qspinlock needs to access two extra cachelines for its information
whereas the native qspinlock code only needs one extra cacheline.

Freshly added counter profiling of the qspinlock code, however, revealed
that it was very rare to use more than two levels of slowpath nesting.
So it doesn't make sense to penalize PV qspinlock code in order to have
four mcs_spinlock structures in the same cacheline to optimize for a case
in the native qspinlock code that rarely happens.

Extend the per-CPU node structure to have two more long words when PV
qspinlock locks are configured to hold the extra data that it needs.

As a result, the PV qspinlock code will enjoy the same benefit of using
just one extra cacheline like the native counterpart, for most cases.

[ mingo: Minor changelog edits. ]
Signed-off-by: NWaiman Long <longman@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1539697507-28084-2-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NWei Li <liwei391@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

c579875d

locking/qspinlock_stat: Count instances of nested lock slowpaths · c9bd1fb6

由 Waiman Long 提交于 8月 31, 2019

mainline inclusion
from mainline-4.20-rc1
commit 1222109a5363
category: bugfix
bugzilla: 13227
CVE: NA

-------------------------------------------------

Queued spinlock supports up to 4 levels of lock slowpath nesting -
user context, soft IRQ, hard IRQ and NMI. However, we are not sure how
often the nesting happens.

So add 3 more per-CPU stat counters to track the number of instances where
nesting index goes to 1, 2 and 3 respectively.

On a dual-socket 64-core 128-thread Zen server, the following were the
new stat counter values under different circumstances:

         State                         slowpath   index1   index2   index3
         -----                         --------   ------   ------   -------
  After bootup                         1,012,150    82       0        0
  After parallel build + perf-top    125,195,009    82       0        0

So the chance of having more than 2 levels of nesting is extremely low.

[ mingo: Minor changelog edits. ]
Signed-off-by: NWaiman Long <longman@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Will Deacon <will.deacon@arm.com>
Link: http://lkml.kernel.org/r/1539697507-28084-1-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NWei Li <liwei391@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

c9bd1fb6

arm64: Force SSBS on context switch · c2f1181d

由 Marc Zyngier 提交于 7月 22, 2019

mainline inclusion
from mainline-v5.3-rc2
commit cbdf8a189a66001c36007bf0f5c975d0376c5c3a
category: feature
bugzilla: 20806
CVE: NA

-------------------------------------------------

On a CPU that doesn't support SSBS, PSTATE[12] is RES0.  In a system
where only some of the CPUs implement SSBS, we end-up losing track of
the SSBS bit across task migration.

To address this issue, let's force the SSBS bit on context switch.

Fixes: 8f04e8e6e29c ("arm64: ssbd: Add support for PSTATE.SSBS rather than trapping to EL3")
Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
[will: inverted logic and added comments]
Signed-off-by: NWill Deacon <will@kernel.org>
Conflicts:
  arch/arm64/kernel/process.c
[yyl: adjust context]
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

c2f1181d

arm64: fix SSBS sanitization · 0c708103

由 Mark Rutland 提交于 2月 15, 2019

mainline inclusion
from mainline-v5.0-rc8
commit f54dada8274643e3ff4436df0ea124aeedc43cae
category: feature
bugzilla: 20806
CVE: NA

-------------------------------------------------

In valid_user_regs() we treat SSBS as a RES0 bit, and consequently it is
unexpectedly cleared when we restore a sigframe or fiddle with GPRs via
ptrace.

This patch fixes valid_user_regs() to account for this, updating the
function to refer to the latest ARM ARM (ARM DDI 0487D.a). For AArch32
tasks, SSBS appears in bit 23 of SPSR_EL1, matching its position in the
AArch32-native PSR format, and we don't need to translate it as we have
to for DIT.

There are no other bit assignments that we need to account for today.
As the recent documentation describes the DIT bit, we can drop our
comment regarding DIT.

While removing SSBS from the RES0 masks, existing inconsistent
whitespace is corrected.

Fixes: d71be2b6c0e19180 ("arm64: cpufeature: Detect SSBS and advertise to userspace")
Signed-off-by: NMark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will.deacon@arm.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

0c708103

arm64: cpu: Move errata and feature enable callbacks closer to callers · a37ad0cd

由 Will Deacon 提交于 8月 07, 2018

mainline inclusion
from mainline-v4.20-rc1
commit b8925ee2e12d1cb9a11d6f28b5814f2bfa59dce1
category: feature
bugzilla: 20806
CVE: NA

-------------------------------------------------

The cpu errata and feature enable callbacks are only called via their
respective arm64_cpu_capabilities structure and therefore shouldn't
exist in the global namespace.

Move the PAN, RAS and cache maintenance emulation enable callbacks into
the same files as their corresponding arm64_cpu_capabilities structures,
making them static in the process.
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Conflicts:
  arch/arm64/kernel/cpu_errata.c
  arch/arm64/kernel/cpufeature.c
[yyl: adjust context]
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NXuefeng Wang <wxf.wang@hisilicon.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

a37ad0cd

KVM: arm64: Set SCTLR_EL2.DSSBS if SSBD is forcefully disabled and !vhe · 1eea4492

由 Will Deacon 提交于 8月 08, 2018

mainline inclusion
from mainline-v4.20-rc1
commit 7c36447ae5a090729e7b129f24705bb231a07e0b
category: feature
bugzilla: 20806
CVE: NA

-------------------------------------------------

When running without VHE, it is necessary to set SCTLR_EL2.DSSBS if SSBD
has been forcefully disabled on the kernel command-line.
Acked-by: NChristoffer Dall <christoffer.dall@arm.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NXuefeng Wang <wxf.wang@hisilicon.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

1eea4492

arm64: ssbd: Add support for PSTATE.SSBS rather than trapping to EL3 · 09bb7481

由 Will Deacon 提交于 8月 07, 2018

mainline inclusion
from mainline-v4.20-rc1
commit 8f04e8e6e29c93421a95b61cad62e3918425eac7
category: feature
bugzilla: 20806
CVE: NA

-------------------------------------------------

On CPUs with support for PSTATE.SSBS, the kernel can toggle the SSBD
state without needing to call into firmware.

This patch hooks into the existing SSBD infrastructure so that SSBS is
used on CPUs that support it, but it's all made horribly complicated by
the very real possibility of big/little systems that don't uniformly
provide the new capability.
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>

Conflicts:
  arch/arm64/kernel/process.c
  arch/arm64/kernel/ssbd.c
  arch/arm64/kernel/cpufeature.c
[yyl: adjust context]
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NXuefeng Wang <wxf.wang@hisilicon.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

09bb7481

arm64: ssbd: Drop #ifdefs for PR_SPEC_STORE_BYPASS · 1438602c

由 Will Deacon 提交于 6月 15, 2018

mainline inclusion
from mainline-v4.20-rc1
commit 2d1b2a91d56b19636b740ea70c8399d1df249f20
category: feature
bugzilla: 20806
CVE: NA

-------------------------------------------------

Now that we're all merged nicely into mainline, there's no need to check
to see if PR_SPEC_STORE_BYPASS is defined.
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NXuefeng Wang <wxf.wang@hisilicon.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

1438602c

arm64: cpufeature: Detect SSBS and advertise to userspace · be185032

由 Will Deacon 提交于 6月 15, 2018

mainline inclusion
from mainline-v4.20-rc1
commit d71be2b6c0e19180b5f80a6d42039cc074a693a2
category: feature
bugzilla: 20806
CVE: NA

-------------------------------------------------

Armv8.5 introduces a new PSTATE bit known as Speculative Store Bypass
Safe (SSBS) which can be used as a mitigation against Spectre variant 4.

Additionally, a CPU may provide instructions to manipulate PSTATE.SSBS
directly, so that userspace can toggle the SSBS control without trapping
to the kernel.

This patch probes for the existence of SSBS and advertise the new instructions
to userspace if they exist.
Reviewed-by: NSuzuki K Poulose <suzuki.poulose@arm.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Conflicts:
  arch/arm64/kernel/cpufeature.c
  arch/arm64/include/asm/cpucaps.h
[yyl: adjust context]
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NXuefeng Wang <wxf.wang@hisilicon.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

be185032

arm64: Fix silly typo in comment · 9f3c5929

由 Will Deacon 提交于 6月 15, 2018

mainline inclusion
from mainline-v4.20-rc1
commit ca7f686ac9fe87a9175696a8744e095ab9749c49
category: feature
bugzilla: 20806
CVE: NA

-------------------------------------------------

I was passing through and figuered I'd fix this up:

	featuer -> feature
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NXuefeng Wang <wxf.wang@hisilicon.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

9f3c5929

ksm: replace jhash2 with xxhash · ac7353e5

由 Timofey Titovets 提交于 8月 31, 2019

mainline inclusion
from mainline-5.0-rc1
commit 59e1a2f4bf83744e748636415fde7d1e9f557e05
category: performance
bugzilla: 13231
CVE: NA

------------------------------------------------

Replace jhash2 with xxhash.

Perf numbers:
Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz
ksm: crc32c   hash() 12081 MB/s
ksm: xxh64    hash()  8770 MB/s
ksm: xxh32    hash()  4529 MB/s
ksm: jhash2   hash()  1569 MB/s

Sioh Lee did some testing:

crc32c_intel: 1084.10ns
crc32c (no hardware acceleration): 7012.51ns
xxhash32: 2227.75ns
xxhash64: 1413.16ns
jhash2: 5128.30ns

As jhash2 always will be slower (for data size like PAGE_SIZE).  Don't use
it in ksm at all.

Use only xxhash for now, because for using crc32c, cryptoapi must be
initialized first - that requires some tricky solution to work well in all
situations.

Link: http://lkml.kernel.org/r/20181023182554.23464-3-nefelim4ag@gmail.comSigned-off-by: NTimofey Titovets <nefelim4ag@gmail.com>
Signed-off-by: Nleesioh <solee@os.korea.ac.kr>
Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

ac7353e5

xxHash: create arch dependent 32/64-bit xxhash() · b41feb0c

由 Timofey Titovets 提交于 8月 31, 2019

mainline inclusion
from mainline-5.0-rc1
commit 0b9df58b79fa283fbedc0fb6a8e248599444bacc
category: performance
bugzilla: 13231
CVE: NA

------------------------------------------------

Patch series "Currently used jhash are slow enough and replace it allow as
to make KSM", v8.

Apeed (in kernel):
        ksm: crc32c   hash() 12081 MB/s
        ksm: xxh64    hash()  8770 MB/s
        ksm: xxh32    hash()  4529 MB/s
        ksm: jhash2   hash()  1569 MB/s

Sioh Lee's testing (copy from other mail):

Test platform: openstack cloud platform (NEWTON version)
Experiment node: openstack based cloud compute node (CPU: xeon E5-2620 v3, memory 64gb)
VM: (2 VCPU, RAM 4GB, DISK 20GB) * 4
Linux kernel: 4.14 (latest version)
KSM setup - sleep_millisecs: 200ms, pages_to_scan: 200

Experiment process:
Firstly, we turn off KSM and launch 4 VMs.  Then we turn on the KSM and
measure the checksum computation time until full_scans become two.

The experimental results (the experimental value is the average of the measured values)
crc32c_intel: 1084.10ns
crc32c (no hardware acceleration): 7012.51ns
xxhash32: 2227.75ns
xxhash64: 1413.16ns
jhash2: 5128.30ns

In summary, the result shows that crc32c_intel has advantages over all of
the hash function used in the experiment.  (decreased by 84.54% compared
to crc32c, 78.86% compared to jhash2, 51.33% xxhash32, 23.28% compared to
xxhash64) the results are similar to those of Timofey.

But, use only xxhash for now, because for using crc32c, cryptoapi must be
initialized first - that require some tricky solution to work good in all
situations.

So:

- First patch implement compile time pickup of fastest implementation of
  xxhash for target platform.

- The second patch replaces jhash2 with xxhash

This patch (of 2):

xxh32() - fast on both 32/64-bit platforms
xxh64() - fast only on 64-bit platform

Create xxhash() which will pick up the fastest version at compile time.

Link: http://lkml.kernel.org/r/20181023182554.23464-2-nefelim4ag@gmail.comSigned-off-by: NTimofey Titovets <nefelim4ag@gmail.com>
Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: leesioh <solee@os.korea.ac.kr>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

b41feb0c

arm64/neon: Disable -Wincompatible-pointer-types when building with Clang · e9fabe0c

由 Nathan Chancellor 提交于 8月 31, 2019

mainline inclusion
from mainline-5.0
commit 0738c8b5915c
category: bugfix
bugzilla: 11024
CVE: NA

-------------------------------------------------

After commit cc9f8349cb33 ("arm64: crypto: add NEON accelerated XOR
implementation"), Clang builds for arm64 started failing with the
following error message.

arch/arm64/lib/xor-neon.c:58:28: error: incompatible pointer types
assigning to 'const unsigned long *' from 'uint64_t *' (aka 'unsigned
long long *') [-Werror,-Wincompatible-pointer-types]
                v3 = veorq_u64(vld1q_u64(dp1 +  6), vld1q_u64(dp2 + 6));
                                         ^~~~~~~~
/usr/lib/llvm-9/lib/clang/9.0.0/include/arm_neon.h:7538:47: note:
expanded from macro 'vld1q_u64'
  __ret = (uint64x2_t) __builtin_neon_vld1q_v(__p0, 51); \
                                              ^~~~

There has been quite a bit of debate and triage that has gone into
figuring out what the proper fix is, viewable at the link below, which
is still ongoing. Ard suggested disabling this warning with Clang with a
pragma so no neon code will have this type of error. While this is not
at all an ideal solution, this build error is the only thing preventing
KernelCI from having successful arm64 defconfig and allmodconfig builds
on linux-next. Getting continuous integration running is more important
so new warnings/errors or boot failures can be caught and fixed quickly.

Link: https://github.com/ClangBuiltLinux/linux/issues/283Suggested-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: NNathan Chancellor <natechancellor@gmail.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
(cherry picked from commit 0738c8b5915c7eaf1e6007b441008e8f3b460443)
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

e9fabe0c

arm64: crypto: add NEON accelerated XOR implementation · 2b99d865

由 Jackie Liu 提交于 8月 31, 2019

mainline inclusion
from mainline-5.0-rc1
commit: cc9f8349cb33965120a96c12e05d00676162eb7f
category: feature
feature: NEON accelerated XOR
bugzilla: 11024
CVE: NA

--------------------------------------------------

This is a NEON acceleration method that can improve
performance by approximately 20%. I got the following
data from the centos 7.5 on Huawei's HISI1616 chip:

[ 93.837726] xor: measuring software checksum speed
[ 93.874039]   8regs  : 7123.200 MB/sec
[ 93.914038]   32regs : 7180.300 MB/sec
[ 93.954043]   arm64_neon: 9856.000 MB/sec
[ 93.954047] xor: using function: arm64_neon (9856.000 MB/sec)

I believe this code can bring some optimization for
all arm64 platform. thanks for Ard Biesheuvel's suggestions.
Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

2b99d865

arm64/neon: add workaround for ambiguous C99 stdint.h types · 28777f93

由 Jackie Liu 提交于 8月 31, 2019

mainline inclusion
from mainline-5.0-rc1
commit: 21e28547f613e7ba4f4cb6831a3ead2d723fdf7b
category: feature
feature: NEON accelerated XOR
bugzilla: 11024
CVE: NA

--------------------------------------------------

In a way similar to ARM commit 09096f6a ("ARM: 7822/1: add workaround
for ambiguous C99 stdint.h types"), this patch redefines the macros that
are used in stdint.h so its definitions of uint64_t and int64_t are
compatible with those of the kernel.

This patch comes from: https://patchwork.kernel.org/patch/3540001/
Wrote by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

We mark this file as a private file and don't have to override asm/types.h
Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

28777f93

arm64/lib: improve CRC32 performance for deep pipelines · fce68364

由 Ard Biesheuvel 提交于 8月 31, 2019

mainline inclusion
from mainline-5.0
commit: efdb25efc7645b326cd5eb82be5feeabe167c24e
category: perf
bugzilla: 20886
CVE: NA

lib/crc32test result:

[root@localhost build]# rmmod crc32test && insmod lib/crc32test.ko &&
dmesg | grep cycles
[83170.153209] CPU7: use cycles 26243990
[83183.122137] CPU7: use cycles 26151290
[83309.691628] CPU7: use cycles 26122830
[83312.415559] CPU7: use cycles 26232600
[83313.191479] CPU8: use cycles 26082350

rmmod crc32test && insmod lib/crc32test.ko && dmesg | grep cycles
[ 1023.539931] CPU25: use cycles 12256730
[ 1024.850360] CPU24: use cycles 12249680
[ 1025.463622] CPU25: use cycles 12253330
[ 1025.862925] CPU25: use cycles 12269720
[ 1026.376038] CPU26: use cycles 12222480

Based on 13702:
arm64/lib: improve CRC32 performance for deep pipelines
crypto: arm64/crc32 - remove PMULL based CRC32 driver
arm64/lib: add accelerated crc32 routines
arm64: cpufeature: add feature for CRC32 instructions
lib/crc32: make core crc32() routines weak so they can be overridden

----------------------------------------------

Improve the performance of the crc32() asm routines by getting rid of
most of the branches and small sized loads on the common path.

Instead, use a branchless code path involving overlapping 16 byte
loads to process the first (length % 32) bytes, and process the
remainder using a loop that processes 32 bytes at a time.

Tested using the following test program:

  #include <stdlib.h>

  extern void crc32_le(unsigned short, char const*, int);

  int main(void)
  {
    static const char buf[4096];

    srand(20181126);

    for (int i = 0; i < 100 * 1000 * 1000; i++)
      crc32_le(0, buf, rand() % 1024);

    return 0;
  }

On Cortex-A53 and Cortex-A57, the performance regresses but only very
slightly. On Cortex-A72 however, the performance improves from

  $ time ./crc32

  real  0m10.149s
  user  0m10.149s
  sys   0m0.000s

to

  $ time ./crc32

  real  0m7.915s
  user  0m7.915s
  sys   0m0.000s

Cc: Rui Sun <sunrui26@huawei.com>
Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

fce68364

lib/crc32.c: mark crc32_le_base/__crc32c_le_base aliases as __pure · 0066ba62

由 Miguel Ojeda 提交于 8月 31, 2019

mainline inclusion
from mainline-5.0
commit ff98e20ef2081b8620dada28fc2d4fb24ca0abf2
category: bugfix
bugzilla: 13702
CVE: NA

-------------------------------------------------

The upcoming GCC 9 release extends the -Wmissing-attributes warnings
(enabled by -Wall) to C and aliases: it warns when particular function
attributes are missing in the aliases but not in their target.

In particular, it triggers here because crc32_le_base/__crc32c_le_base
aren't __pure while their target crc32_le/__crc32c_le are.

These aliases are used by architectures as a fallback in accelerated
versions of CRC32. See commit 9784d82db3eb ("lib/crc32: make core crc32()
routines weak so they can be overridden").

Therefore, being fallbacks, it is likely that even if the aliases
were called from C, there wouldn't be any optimizations possible.
Currently, the only user is arm64, which calls this from asm.

Still, marking the aliases as __pure makes sense and is a good idea
for documentation purposes and possible future optimizations,
which also silences the warning.
Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Tested-by: NLaura Abbott <labbott@redhat.com>
Signed-off-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
(cherry picked from commit ff98e20ef2081b8620dada28fc2d4fb24ca0abf2)
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

0066ba62

crypto: arm64/crc32 - remove PMULL based CRC32 driver · 0e584ca5

由 Ard Biesheuvel 提交于 8月 31, 2019

mainline inclusion
from mainline-4.20-rc1
commit: 598b7d41e544322c8c4f3737ee8ddf905a44175e
category: feature
feature: accelerated crc32 routines
bugzilla: 13702
CVE: NA

--------------------------------------------------

Now that the scalar fallbacks have been moved out of this driver into
the core crc32()/crc32c() routines, we are left with a CRC32 crypto API
driver for arm64 that is based only on 64x64 polynomial multiplication,
which is an optional instruction in the ARMv8 architecture, and is less
and less likely to be available on cores that do not also implement the
CRC32 instructions, given that those are mandatory in the architecture
as of ARMv8.1.

Since the scalar instructions do not require the special handling that
SIMD instructions do, and since they turn out to be considerably faster
on some cores (Cortex-A53) as well, there is really no point in keeping
this code around so let's just remove it.
Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

0e584ca5

arm64/lib: add accelerated crc32 routines · 11fa09f2

由 Ard Biesheuvel 提交于 8月 31, 2019

mainline inclusion
from mainline-4.20-rc1
commit: 7481cddf29ede204b475facc40e6f65459939881
category: feature
feature: accelerated crc32 routines
bugzilla: 13702
CVE: NA

--------------------------------------------------

Unlike crc32c(), which is wired up to the crypto API internally so the
optimal driver is selected based on the platform's capabilities,
crc32_le() is implemented as a library function using a slice-by-8 table
based C implementation. Even though few of the call sites may be
bottlenecks, calling a time variant implementation with a non-negligible
D-cache footprint is a bit of a waste, given that ARMv8.1 and up mandates
support for the CRC32 instructions that were optional in ARMv8.0, but are
already widely available, even on the Cortex-A53 based Raspberry Pi.

So implement routines that use these instructions if available, and fall
back to the existing generic routines otherwise. The selection is based
on alternatives patching.

Note that this unconditionally selects CONFIG_CRC32 as a builtin. Since
CRC32 is relied upon by core functionality such as CONFIG_OF_FLATTREE,
this just codifies the status quo.
Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

11fa09f2

arm64: cpufeature: add feature for CRC32 instructions · 3a8984a9

由 Ard Biesheuvel 提交于 8月 31, 2019

mainline inclusion
from mainline-4.20-rc1
commit: 86d0dd34eafffbc76a81aba6ae2d71927d3835a8
category: feature
feature: accelerated crc32 routines
bugzilla: 13702
CVE: NA

--------------------------------------------------

Add a CRC32 feature bit and wire it up to the CPU id register so we
will be able to use alternatives patching for CRC32 operations.
Acked-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>

Conflicts:
	arch/arm64/include/asm/cpucaps.h
	arch/arm64/kernel/cpufeature.c
Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

3a8984a9

lib/crc32: make core crc32() routines weak so they can be overridden · a216345b

由 Ard Biesheuvel 提交于 8月 31, 2019

mainline inclusion
from mainline-4.20-rc1
commit: 9784d82db3eb3de7851e5a3f4a2481607de2452c
category: feature
feature: accelerated crc32 routines
bugzilla: 13702
CVE: NA

--------------------------------------------------

Allow architectures to drop in accelerated CRC32 routines by making
the crc32_le/__crc32c_le entry points weak, and exposing non-weak
aliases for them that may be used by the accelerated versions as
fallbacks in case the instructions they rely upon are not available.
Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

a216345b

pci: do not save 'PCI_BRIDGE_CTL_BUS_RESET' · 36b45f28

由 Xiongfeng Wang 提交于 8月 29, 2019

hulk inclusion
category: bugfix
bugzilla: 20702
CVE: NA
---------------------------

When I inject a PCIE Fatal error into a mellanox netdevice, 'dmesg'
shows the device is recovered successfully, but 'lspci' didn't show the
device. I checked the configuration space of the slot where the
netdevice is inserted and found out the bit 'PCI_BRIDGE_CTL_BUS_RESET'
is set. Later, I found out it is because this bit is saved in
'saved_config_space' of 'struct pci_dev' when 'pci_pm_runtime_suspend()'
is called. And 'PCI_BRIDGE_CTL_BUS_RESET' is set every time we restore
the configuration sapce.

This patch avoid saving the bit 'PCI_BRIDGE_CTL_BUS_RESET' when we save
the configuration space of a bridge.
Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

36b45f28

config: enable CONFIG_KTASK in hulk_defconfig and storage_ci_defconfig · 58622cdd

由 Hongbo Yao 提交于 8月 30, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

enable CONFIG_KTASK in hulk_defconfig and storage_ci_defconfig
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

58622cdd

ktask: change the chunk size for some ktask thread functions · 135e9232

由 Hongbo Yao 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA

---------------------------

This patch fixes some issue in original series.
1) PMD_SIZE chunks have made thread finishing times too spread out
in some cases, so KTASK_MEM_CHUNK(128M) seems to be a reasonable compromise
2) If hugepagesz=1G, then pages_per_huge_page = 1G / 4K = 256,
use KTASK_MEM_CHUNK will cause the ktask thread to be 1, which will not
improve the performance of clear gigantic page.
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

135e9232

hugetlbfs: parallelize hugetlbfs_fallocate with ktask · 4733c59f

由 Daniel Jordan 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

hugetlbfs_fallocate preallocates huge pages to back a file in a
hugetlbfs filesystem.  The time to call this function grows linearly
with size.

ktask performs well with its default thread count of 4; higher thread
counts are given for context only.

Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory
Test:    fallocate(1) a file on a hugetlbfs filesystem

nthread   speedup   size (GiB)   min time (s)   stdev
      1                    200         127.53    2.19
      2     3.09x          200          41.30    2.11
      4     5.72x          200          22.29    0.51
      8     9.45x          200          13.50    2.58
     16     9.74x          200          13.09    1.64

      1                    400         193.09    2.47
      2     2.14x          400          90.31    3.39
      4     3.84x          400          50.32    0.44
      8     5.11x          400          37.75    1.23
     16     6.12x          400          31.54    3.13

The primary bottleneck for better scaling at higher thread counts is
hugetlb_fault_mutex_table[hash].  perf showed L1-dcache-loads increase
with 8 threads and again sharply with 16 threads, and a CPU counter
profile showed that 31% of the L1d misses were on
hugetlb_fault_mutex_table[hash] in the 16-thread case.
Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

4733c59f

mm: parallelize clear_gigantic_page · ae0cd4d4

由 Daniel Jordan 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

Parallelize clear_gigantic_page, which zeroes any page size larger than
8M (e.g. 1G on x86).

Performance results (the default number of threads is 4; higher thread
counts shown for context only):

Machine:  Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory
Test:     Clear a range of gigantic pages (triggered via fallocate)

nthread   speedup   size (GiB)   min time (s)   stdev
      1                    100          41.13    0.03
      2     2.03x          100          20.26    0.14
      4     4.28x          100           9.62    0.09
      8     8.39x          100           4.90    0.05
     16    10.44x          100           3.94    0.03

      1                    200          89.68    0.35
      2     2.21x          200          40.64    0.18
      4     4.64x          200          19.33    0.32
      8     8.99x          200           9.98    0.04
     16    11.27x          200           7.96    0.04

      1                    400         188.20    1.57
      2     2.30x          400          81.84    0.09
      4     4.63x          400          40.62    0.26
      8     8.92x          400          21.09    0.50
     16    11.78x          400          15.97    0.25

      1                    800         434.91    1.81
      2     2.54x          800         170.97    1.46
      4     4.98x          800          87.38    1.91
      8    10.15x          800          42.86    2.59
     16    12.99x          800          33.48    0.83

The speedups are mostly due to the fact that more threads can use more
memory bandwidth.  The loop we're stressing on the x86 chip in this test
is clear_page_erms, which tops out at a bandwidth of 2550 MiB/s with one
thread.  We get the same bandwidth per thread for 2, 4, or 8 threads,
but at 16 threads the per-thread bandwidth drops to 1420 MiB/s.

However, the performance also improves over a single thread because of
the ktask threads' NUMA awareness (ktask migrates worker threads to the
node local to the work being done).  This becomes a bigger factor as the
amount of pages to zero grows to include memory from multiple nodes, so
that speedups increase as the size increases.
Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

ae0cd4d4

mm: parallelize deferred struct page initialization within each node · eb761d65

由 Daniel Jordan 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

Deferred struct page initialization currently runs one thread per node,
but this is a bottleneck during boot on big machines, so use ktask
within each pgdatinit thread to parallelize the struct page
initialization, allowing the system to take better advantage of its
memory bandwidth.

Because the system is not fully up yet and most CPUs are idle, use more
than the default maximum number of ktask threads.  The kernel doesn't
know the memory bandwidth of a given system to get the most efficient
number of threads, so there's some guesswork involved.  In testing, a
reasonable value turned out to be about a quarter of the CPUs on the
node.

__free_pages_core used to increase the zone's managed page count by the
number of pages being freed.  To accommodate multiple threads, however,
account the number of freed pages with an atomic shared across the ktask
threads and bump the managed page count with it after ktask is finished.

Test:    Boot the machine with deferred struct page init three times

Machine: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz, 88 CPUs, 503G memory,
         2 sockets

kernel                   speedup   max time per   stdev
                                   node (ms)

baseline (4.15-rc2)                        5860     8.6
ktask                      9.56x            613    12.4
Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

eb761d65

mm: enlarge type of offset argument in mem_map_offset and mem_map_next · 228e4183

由 Daniel Jordan 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

Changes the type of 'offset' from int to unsigned long in both
mem_map_offset and mem_map_next.

This facilitates ktask's use of mem_map_next with its unsigned long
types to avoid silent truncation when these unsigned longs are passed as
ints.

It also fixes the preexisting truncation of 'offset' from unsigned long
to int by the sole caller of mem_map_offset, follow_hugetlb_page.
Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

228e4183

vfio: relieve mmap_sem reader cacheline bouncing by holding it longer · 22705b26

由 Daniel Jordan 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

Profiling shows significant time being spent on atomic ops in mmap_sem
reader acquisition.  mmap_sem is taken and dropped for every single base
page during pinning, so this is not surprising.

Reduce the number of times mmap_sem is taken by holding for longer,
which relieves atomic cacheline bouncing.

Results for all VFIO page pinning patches
-----------------------------------------

The test measures the time from qemu invocation to the start of guest
boot.  The guest uses kvm with 320G memory backed with THP.  320G fits
in a node on the test machine used here, so there was no thrashing in
reclaim because of __GFP_THISNODE in THP allocations[1].

CPU:              2 nodes * 24 cores/node * 2 threads/core = 96 CPUs
                  Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
memory:           754G split evenly between nodes
scaling_governor: performance

     patch 6                  patch 8              patch 9 (this one)
     -----------------------  ----------------- ---------------------
thr  speedup average sec  speedup   average sec  speedup   average sec
  1          65.0(± 0.6%)           65.2(± 0.5%)           65.5(± 0.4%)
  2   1.5x   42.8(± 5.8%)   1.8x    36.0(± 0.9%)    1.9x   34.4(± 0.3%)
  3   1.9x   35.0(±11.3%)   2.5x    26.4(± 4.2%)    2.8x   23.7(± 0.2%)
  4   2.3x   28.5(± 1.3%)   3.1x    21.2(± 2.8%)    3.6x   18.3(± 0.3%)
  5   2.5x   26.2(± 1.5%)   3.6x    17.9(± 0.9%)    4.3x   15.1(± 0.3%)
  6   2.7x   24.5(± 1.8%)   4.0x    16.5(± 3.0%)    5.1x   12.9(± 0.1%)
  7   2.8x   23.5(± 4.9%)   4.2x    15.4(± 2.7%)    5.7x   11.5(± 0.6%)
  8   2.8x   22.8(± 1.8%)   4.2x    15.5(± 4.7%)    6.4x   10.3(± 0.8%)
 12   3.2x   20.2(± 1.4%)   4.4x    14.7(± 2.9%)    8.6x    7.6(± 0.6%)
 16   3.3x   20.0(± 0.7%)   4.3x    15.4(± 1.3%)   10.2x    6.4(± 0.6%)

At patch 6, lock_stat showed long reader wait time on mmap_sem writers,
leading to patch 8.

At patch 8, profiling revealed the issue with mmap_sem described above.

Across all three patches, performance consistently improves as the
thread count increases.  The one exception is the antiscaling with
nthr=16 in patch 8: those mmap_sem atomics are really bouncing around
the machine.

The performance with patch 9 looks pretty good overall.  I'm working on
finding the next bottleneck, and this is where it stopped:  When
nthr=16, the obvious issue profiling showed was contention on the split
PMD page table lock when pages are faulted in during the pinning (>2% of
the time).

A split PMD lock protects a PUD_SIZE-ed amount of page table mappings
(1G on x86), so if threads were operating on smaller chunks and
contending in the same PUD_SIZE range, this could be the source of
contention.  However, when nthr=16, threads operate on 5G chunks (320G /
16 threads / (1<<KTASK_LOAD_BAL_SHIFT)), so this wasn't the cause, and
aligning the chunks on PUD_SIZE boundaries didn't help either.

The time is short (6.4 seconds), so the next theory was threads
finishing at different times, but probes showed the threads all returned
within less than a millisecond of each other.

Kernel probes turned up a few smaller VFIO page pin calls besides the
heavy 320G call.  The chunk size given (PMD_SIZE) could affect thread
count and chunk size for these, so chunk size was increased from 2M to
1G.  This caused the split PMD contention to disappear, but with little
change in the runtime.  More digging required.

[1] lkml.kernel.org/r/20180925120326.24392-1-mhocko@kernel.org
Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

22705b26

vfio: remove unnecessary mmap_sem writer acquisition around locked_vm · 68a4b481

由 Daniel Jordan 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

Now that mmap_sem is no longer required for modifying locked_vm, remove
it in the VFIO code.

[XXX Can be sent separately, along with similar conversions in the other
places mmap_sem was taken for locked_vm.  While at it, could make
similar changes to pinned_vm.]
Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

68a4b481

mm: change locked_vm's type from unsigned long to atomic_long_t · 53f4e528

由 Daniel Jordan 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

Currently, mmap_sem must be held as writer to modify the locked_vm field
in mm_struct.

This creates a bottleneck when multithreading VFIO page pinning because
each thread holds the mmap_sem as reader for the majority of the pinning
time but also takes mmap_sem as writer regularly, for short times, when
modifying locked_vm.

The problem gets worse when other workloads compete for CPU with ktask
threads doing page pinning because the other workloads force ktask
threads that hold mmap_sem as writer off the CPU, blocking ktask threads
trying to get mmap_sem as reader for an excessively long time (the
mmap_sem reader wait time grows linearly with the thread count).

Requiring mmap_sem for locked_vm also abuses mmap_sem by making it
protect data that could be synchronized separately.

So, decouple locked_vm from mmap_sem by making locked_vm an
atomic_long_t.  locked_vm's old type was unsigned long and changing it
to a signed type makes it lose half its capacity, but that's only a
concern for 32-bit systems and LONG_MAX * PAGE_SIZE is 8T on x86 in that
case, so there's headroom.

Now that mmap_sem is not taken as writer here, ktask threads holding
mmap_sem as reader can run more often.  Performance results appear later
in the series.

On powerpc, this was cross-compiled-tested only.

[XXX Can send separately.]
Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

53f4e528

vfio: parallelize vfio_pin_map_dma · b0908eee

由 Daniel Jordan 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

When starting a large-memory kvm guest, it takes an excessively long
time to start the boot process because qemu must pin all guest pages to
accommodate DMA when VFIO is in use.  Currently just one CPU is
responsible for the page pinning, which usually boils down to page
clearing time-wise, so the ways to optimize this are buying a faster
CPU ;-) or using more of the CPUs you already have.

Parallelize with ktask.  Refactor so workqueue workers pin with the mm
of the calling thread, and to enable an undo callback for ktask to
handle errors during page pinning.

Performance results appear later in the series.
Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

b0908eee

workqueue, ktask: renice helper threads to prevent starvation · cdc79c13

由 Daniel Jordan 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

With ktask helper threads running at MAX_NICE, it's possible for one or
more of them to begin chunks of the task and then have their CPU time
constrained by higher priority threads.  The main ktask thread, running
at normal priority, may finish all available chunks of the task and then
wait on the MAX_NICE helpers to finish the last in-progress chunks, for
longer than it would have if no helpers were used.

Avoid this by having the main thread assign its priority to each
unfinished helper one at a time so that on a heavily loaded system,
exactly one thread in a given ktask call is running at the main thread's
priority.  At least one thread to ensure forward progress, and at most
one thread to limit excessive multithreading.

Since the workqueue interface, on which ktask is built, does not provide
access to worker threads, ktask can't adjust their priorities directly,
so add a new interface to allow a previously-queued work item to run at
a different priority than the one controlled by the corresponding
workqueue's 'nice' attribute.  The worker assigned to the work item will
run the work at the given priority, temporarily overriding the worker's
priority.

The interface is flush_work_at_nice, which ensures the given work item's
assigned worker runs at the specified nice level and waits for the work
item to finish.

An alternative choice would have been to simply requeue the work item to
a pool with workers of the new priority, but this doesn't seem feasible
because a worker may have already started executing the work and there's
currently no way to interrupt it midway through.  The proposed interface
solves this issue because a worker's priority can be adjusted while it's
executing the work.

TODO:  flush_work_at_nice is a proof-of-concept only, and it may be
desired to have the interface set the work's nice without also waiting
for it to finish.  It's implemented in the flush path for this RFC
because it was fairly simple to write ;-)

I ran tests similar to the ones in the last patch with a couple of
differences:
 - The non-ktask workload uses 8 CPUs instead of 7 to compete with the
   main ktask thread as well as the ktask helpers, so that when the main
   thread finishes, its CPU is completely occupied by the non-ktask
   workload, meaning MAX_NICE helpers can't run as often.
 - The non-ktask workload starts before the ktask workload, rather
   than after, to maximize the chance that it starves helpers.

Runtimes in seconds.

Case 1: Synthetic, worst-case CPU contention

 ktask_test - a tight loop doing integer multiplication to max out on CPU;
              used for testing only, does not appear in this series
 stress-ng  - cpu stressor ("-c --cpu-method ackerman --cpu-ops 1200");

             8_ktask_thrs           8_ktask_thrs
             w/o_renice(stdev)   with_renice  (stdev)  1_ktask_thr(stdev)
             ------------------------------------------------------------
  ktask_test    41.98  ( 0.22)         25.15  ( 2.98)      30.40  ( 0.61)
  stress-ng     44.79  ( 1.11)         46.37  ( 0.69)      53.29  ( 1.91)

Without renicing, ktask_test finishes just after stress-ng does because
stress-ng needs to free up CPUs for the helpers to finish (ktask_test
shows a shorter runtime than stress-ng because ktask_test was started
later).  Renicing lets ktask_test finish 40% sooner, and running the
same amount of work in ktask_test with 1 thread instead of 8 finishes in
a comparable amount of time, though longer than "with_renice" because
MAX_NICE threads still get some CPU time, and the effect over 8 threads
adds up.

stress-ng's total runtime gets a little longer going from no renicing to
renicing, as expected, because each reniced ktask thread takes more CPU
time than before when the helpers were starved.

Running with one ktask thread, stress-ng's reported walltime goes up
because that single thread interferes with fewer stress-ng threads,
but with more impact, causing a greater spread in the time it takes for
individual stress-ng threads to finish.  Averages of the per-thread
stress-ng times from "with_renice" to "1_ktask_thr" come out roughly
the same, though, 43.81 and 43.89 respectively.  So the total runtime of
stress-ng across all threads is unaffected, but the time stress-ng takes
to finish running its threads completely actually improves by spreading
the ktask_test work over more threads.

Case 2: Real-world CPU contention

 ktask_vfio - VFIO page pin a 32G kvm guest
 usemem     - faults in 86G of anonymous THP per thread, PAGE_SIZE stride;
              used to mimic the page clearing that dominates in ktask_vfio
              so that usemem competes for the same system resources

             8_ktask_thrs           8_ktask_thrs
             w/o_renice  (stdev)   with_renice  (stdev)  1_ktask_thr(stdev)
             --------------------------------------------------------------
  ktask_vfio    18.59  ( 0.19)         14.62  ( 2.03)      16.24  ( 0.90)
      usemem    47.54  ( 0.89)         48.18  ( 0.77)      49.70  ( 1.20)

These results are similar to case 1's, though the differences between
times are not quite as pronounced because ktask_vfio ran shorter
compared to usemem.
Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

cdc79c13