- 27 12月, 2019 40 次提交
-
-
由 Hanjun Guo 提交于
hulk inclusion category: feature bugzilla: 13227 CVE: NA ------------------------------------------------- Set numa-aware qspinlock default off and enable it by passing using_numa_aware_qspinlock in the boot cmdline. Signed-off-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NWei Li <liwei391@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Hanjun Guo 提交于
hulk inclusion category: feature bugzilla: 13227 CVE: NA ------------------------------------------------- Enabling CNA is controlled via a new configuration option (NUMA_AWARE_SPINLOCKS). Add it for arm64 support. Signed-off-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NWei Li <liwei391@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Alex Kogan 提交于
hulk inclusion category: feature bugzilla: 13227 CVE: NA ------------------------------------------------- This optimization reduces the probability threads will be shuffled between the main and secondary queues when the secondary queue is empty. It is helpful when the lock is only lightly contended. Signed-off-by: NAlex Kogan <alex.kogan@oracle.com> Reviewed-by: NSteve Sistare <steven.sistare@oracle.com> Signed-off-by: NWei Li <liwei391@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Alex Kogan 提交于
hulk inclusion category: feature bugzilla: 13227 CVE: NA ------------------------------------------------- Choose the next lock holder among spinning threads running on the same node with high probability rather than always. With small probability, hand the lock to the first thread in the secondary queue or, if that queue is empty, to the immediate successor of the current lock holder in the main queue. Thus, assuming no failures while threads hold the lock, every thread would be able to acquire the lock after a bounded number of lock transitions, with high probability. Signed-off-by: NAlex Kogan <alex.kogan@oracle.com> Reviewed-by: NSteve Sistare <steven.sistare@oracle.com> Signed-off-by: NWei Li <liwei391@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Alex Kogan 提交于
hulk inclusion category: feature bugzilla: 13227 CVE: NA ------------------------------------------------- In CNA, spinning threads are organized in two queues, a main queue for threads running on the same node as the current lock holder, and a secondary queue for threads running on other nodes. At the unlock time, the lock holder scans the main queue looking for a thread running on the same node. If found (call it thread T), all threads in the main queue between the current lock holder and T are moved to the end of the secondary queue, and the lock is passed to T. If such T is not found, the lock is passed to the first node in the secondary queue. Finally, if the secondary queue is empty, the lock is passed to the next thread in the main queue. For more details, see https://arxiv.org/abs/1810.05600. Note that this variant of CNA may introduce starvation by continuously passing the lock to threads running on the same node. This issue will be addressed later in the series. Enabling CNA is controlled via a new configuration option (NUMA_AWARE_SPINLOCKS). The CNA variant is patched in at the boot time only if we run a multi-node machine, and the new config is enabled. For the time being, the patching requires CONFIG_PARAVIRT_SPINLOCKS to be enabled as well. However, this should be resolved once static_call() is available. Signed-off-by: NAlex Kogan <alex.kogan@oracle.com> Reviewed-by: NSteve Sistare <steven.sistare@oracle.com> Signed-off-by: NWei Li <liwei391@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Alex Kogan 提交于
hulk inclusion category: feature bugzilla: 13227 CVE: NA ------------------------------------------------- Move some of the code manipulating the spin lock into separate functions. This would allow easier integration of alternative ways to manipulate that lock. Signed-off-by: NAlex Kogan <alex.kogan@oracle.com> Reviewed-by: NSteve Sistare <steven.sistare@oracle.com> Signed-off-by: NWei Li <liwei391@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Alex Kogan 提交于
hulk inclusion category: feature bugzilla: 13227 CVE: NA ------------------------------------------------- The arch_mcs_spin_unlock_contended macro should accept the value to be stored into the lock argument as another argument. This allows using the same macro in cases where the value to be stored is different from 1. Signed-off-by: NAlex Kogan <alex.kogan@oracle.com> Reviewed-by: NSteve Sistare <steven.sistare@oracle.com> Signed-off-by: NWei Li <liwei391@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Waiman Long 提交于
mainline inclusion from mainline-5.1-rc1 commit 733000c7ffd9 category: bugfix bugzilla: 13227 CVE: NA ------------------------------------------------- With the > 4 nesting levels case handled by the commit: commit d682b596d993 ("locking/qspinlock: Handle > 4 slowpath nesting levels") the BUG_ON() call in encode_tail() will never actually be triggered. Remove it. Signed-off-by: NWaiman Long <longman@redhat.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NWill Deacon <will.deacon@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: https://lkml.kernel.org/r/1551057253-3231-1-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NWei Li <liwei391@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Waiman Long 提交于
mainline inclusion from mainline-5.1-rc1 commit 412f34a82ccf category: bugfix bugzilla: 13227 CVE: NA ------------------------------------------------- Track the number of slowpath locking operations that are being done without any MCS node available as well renaming lock_index[123] to make them more descriptive. Using these stat counters is one way to find out if a code path is being exercised. Signed-off-by: NWaiman Long <longman@redhat.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: James Morse <james.morse@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: SRINIVAS <srinivas.eeda@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Cc: Zhenzhong Duan <zhenzhong.duan@oracle.com> Link: https://lkml.kernel.org/r/1548798828-16156-3-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NWei Li <liwei391@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Waiman Long 提交于
mainline inclusion from mainline-5.1-rc1 commit d682b596d993 category: bugfix bugzilla: 13227 CVE: NA ------------------------------------------------- Four queue nodes per CPU are allocated to enable up to 4 nesting levels using the per-CPU nodes. Nested NMIs are possible in some architectures. Still it is very unlikely that we will ever hit more than 4 nested levels with contention in the slowpath. When that rare condition happens, however, it is likely that the system will hang or crash shortly after that. It is not good and we need to handle this exception case. This is done by spinning directly on the lock using repeated trylock. This alternative code path should only be used when there is nested NMIs. Assuming that the locks used by those NMI handlers will not be heavily contended, a simple TAS locking should work out. Suggested-by: NPeter Zijlstra <peterz@infradead.org> Signed-off-by: NWaiman Long <longman@redhat.com> Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org> Acked-by: NWill Deacon <will.deacon@arm.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: James Morse <james.morse@arm.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: SRINIVAS <srinivas.eeda@oracle.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Zhenzhong Duan <zhenzhong.duan@oracle.com> Link: https://lkml.kernel.org/r/1548798828-16156-2-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NWei Li <liwei391@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Waiman Long 提交于
mainline inclusion from mainline-4.20-rc1 commit 0fa809ca7f81 category: bugfix bugzilla: 13227 CVE: NA ------------------------------------------------- The qspinlock code supports up to 4 levels of slowpath nesting using four per-CPU mcs_spinlock structures. For 64-bit architectures, they fit nicely in one 64-byte cacheline. For para-virtualized (PV) qspinlocks it needs to store more information in the per-CPU node structure than there is space for. It uses a trick to use a second cacheline to hold the extra information that it needs. So PV qspinlock needs to access two extra cachelines for its information whereas the native qspinlock code only needs one extra cacheline. Freshly added counter profiling of the qspinlock code, however, revealed that it was very rare to use more than two levels of slowpath nesting. So it doesn't make sense to penalize PV qspinlock code in order to have four mcs_spinlock structures in the same cacheline to optimize for a case in the native qspinlock code that rarely happens. Extend the per-CPU node structure to have two more long words when PV qspinlock locks are configured to hold the extra data that it needs. As a result, the PV qspinlock code will enjoy the same benefit of using just one extra cacheline like the native counterpart, for most cases. [ mingo: Minor changelog edits. ] Signed-off-by: NWaiman Long <longman@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/1539697507-28084-2-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NWei Li <liwei391@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Waiman Long 提交于
mainline inclusion from mainline-4.20-rc1 commit 1222109a category: bugfix bugzilla: 13227 CVE: NA ------------------------------------------------- Queued spinlock supports up to 4 levels of lock slowpath nesting - user context, soft IRQ, hard IRQ and NMI. However, we are not sure how often the nesting happens. So add 3 more per-CPU stat counters to track the number of instances where nesting index goes to 1, 2 and 3 respectively. On a dual-socket 64-core 128-thread Zen server, the following were the new stat counter values under different circumstances: State slowpath index1 index2 index3 ----- -------- ------ ------ ------- After bootup 1,012,150 82 0 0 After parallel build + perf-top 125,195,009 82 0 0 So the chance of having more than 2 levels of nesting is extremely low. [ mingo: Minor changelog edits. ] Signed-off-by: NWaiman Long <longman@redhat.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will.deacon@arm.com> Link: http://lkml.kernel.org/r/1539697507-28084-1-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org> Signed-off-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NWei Li <liwei391@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Marc Zyngier 提交于
mainline inclusion from mainline-v5.3-rc2 commit cbdf8a189a66001c36007bf0f5c975d0376c5c3a category: feature bugzilla: 20806 CVE: NA ------------------------------------------------- On a CPU that doesn't support SSBS, PSTATE[12] is RES0. In a system where only some of the CPUs implement SSBS, we end-up losing track of the SSBS bit across task migration. To address this issue, let's force the SSBS bit on context switch. Fixes: 8f04e8e6e29c ("arm64: ssbd: Add support for PSTATE.SSBS rather than trapping to EL3") Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com> [will: inverted logic and added comments] Signed-off-by: NWill Deacon <will@kernel.org> Conflicts: arch/arm64/kernel/process.c [yyl: adjust context] Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Mark Rutland 提交于
mainline inclusion from mainline-v5.0-rc8 commit f54dada8274643e3ff4436df0ea124aeedc43cae category: feature bugzilla: 20806 CVE: NA ------------------------------------------------- In valid_user_regs() we treat SSBS as a RES0 bit, and consequently it is unexpectedly cleared when we restore a sigframe or fiddle with GPRs via ptrace. This patch fixes valid_user_regs() to account for this, updating the function to refer to the latest ARM ARM (ARM DDI 0487D.a). For AArch32 tasks, SSBS appears in bit 23 of SPSR_EL1, matching its position in the AArch32-native PSR format, and we don't need to translate it as we have to for DIT. There are no other bit assignments that we need to account for today. As the recent documentation describes the DIT bit, we can drop our comment regarding DIT. While removing SSBS from the RES0 masks, existing inconsistent whitespace is corrected. Fixes: d71be2b6c0e19180 ("arm64: cpufeature: Detect SSBS and advertise to userspace") Signed-off-by: NMark Rutland <mark.rutland@arm.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Suzuki K Poulose <suzuki.poulose@arm.com> Cc: Will Deacon <will.deacon@arm.com> Signed-off-by: NWill Deacon <will.deacon@arm.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Will Deacon 提交于
mainline inclusion from mainline-v4.20-rc1 commit b8925ee2 category: feature bugzilla: 20806 CVE: NA ------------------------------------------------- The cpu errata and feature enable callbacks are only called via their respective arm64_cpu_capabilities structure and therefore shouldn't exist in the global namespace. Move the PAN, RAS and cache maintenance emulation enable callbacks into the same files as their corresponding arm64_cpu_capabilities structures, making them static in the process. Signed-off-by: NWill Deacon <will.deacon@arm.com> Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com> Conflicts: arch/arm64/kernel/cpu_errata.c arch/arm64/kernel/cpufeature.c [yyl: adjust context] Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Reviewed-by: NXuefeng Wang <wxf.wang@hisilicon.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Will Deacon 提交于
mainline inclusion from mainline-v4.20-rc1 commit 7c36447a category: feature bugzilla: 20806 CVE: NA ------------------------------------------------- When running without VHE, it is necessary to set SCTLR_EL2.DSSBS if SSBD has been forcefully disabled on the kernel command-line. Acked-by: NChristoffer Dall <christoffer.dall@arm.com> Signed-off-by: NWill Deacon <will.deacon@arm.com> Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Reviewed-by: NXuefeng Wang <wxf.wang@hisilicon.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Will Deacon 提交于
mainline inclusion from mainline-v4.20-rc1 commit 8f04e8e6 category: feature bugzilla: 20806 CVE: NA ------------------------------------------------- On CPUs with support for PSTATE.SSBS, the kernel can toggle the SSBD state without needing to call into firmware. This patch hooks into the existing SSBD infrastructure so that SSBS is used on CPUs that support it, but it's all made horribly complicated by the very real possibility of big/little systems that don't uniformly provide the new capability. Signed-off-by: NWill Deacon <will.deacon@arm.com> Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com> Conflicts: arch/arm64/kernel/process.c arch/arm64/kernel/ssbd.c arch/arm64/kernel/cpufeature.c [yyl: adjust context] Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Reviewed-by: NXuefeng Wang <wxf.wang@hisilicon.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Will Deacon 提交于
mainline inclusion from mainline-v4.20-rc1 commit 2d1b2a91d56b19636b740ea70c8399d1df249f20 category: feature bugzilla: 20806 CVE: NA ------------------------------------------------- Now that we're all merged nicely into mainline, there's no need to check to see if PR_SPEC_STORE_BYPASS is defined. Signed-off-by: NWill Deacon <will.deacon@arm.com> Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Reviewed-by: NXuefeng Wang <wxf.wang@hisilicon.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Will Deacon 提交于
mainline inclusion from mainline-v4.20-rc1 commit d71be2b6c0e19180b5f80a6d42039cc074a693a2 category: feature bugzilla: 20806 CVE: NA ------------------------------------------------- Armv8.5 introduces a new PSTATE bit known as Speculative Store Bypass Safe (SSBS) which can be used as a mitigation against Spectre variant 4. Additionally, a CPU may provide instructions to manipulate PSTATE.SSBS directly, so that userspace can toggle the SSBS control without trapping to the kernel. This patch probes for the existence of SSBS and advertise the new instructions to userspace if they exist. Reviewed-by: NSuzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: NWill Deacon <will.deacon@arm.com> Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com> Conflicts: arch/arm64/kernel/cpufeature.c arch/arm64/include/asm/cpucaps.h [yyl: adjust context] Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Reviewed-by: NXuefeng Wang <wxf.wang@hisilicon.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Will Deacon 提交于
mainline inclusion from mainline-v4.20-rc1 commit ca7f686ac9fe87a9175696a8744e095ab9749c49 category: feature bugzilla: 20806 CVE: NA ------------------------------------------------- I was passing through and figuered I'd fix this up: featuer -> feature Signed-off-by: NWill Deacon <will.deacon@arm.com> Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com> Reviewed-by: NXuefeng Wang <wxf.wang@hisilicon.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Timofey Titovets 提交于
mainline inclusion from mainline-5.0-rc1 commit 59e1a2f4bf83744e748636415fde7d1e9f557e05 category: performance bugzilla: 13231 CVE: NA ------------------------------------------------ Replace jhash2 with xxhash. Perf numbers: Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz ksm: crc32c hash() 12081 MB/s ksm: xxh64 hash() 8770 MB/s ksm: xxh32 hash() 4529 MB/s ksm: jhash2 hash() 1569 MB/s Sioh Lee did some testing: crc32c_intel: 1084.10ns crc32c (no hardware acceleration): 7012.51ns xxhash32: 2227.75ns xxhash64: 1413.16ns jhash2: 5128.30ns As jhash2 always will be slower (for data size like PAGE_SIZE). Don't use it in ksm at all. Use only xxhash for now, because for using crc32c, cryptoapi must be initialized first - that requires some tricky solution to work well in all situations. Link: http://lkml.kernel.org/r/20181023182554.23464-3-nefelim4ag@gmail.comSigned-off-by: NTimofey Titovets <nefelim4ag@gmail.com> Signed-off-by: Nleesioh <solee@os.korea.ac.kr> Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com> Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Nzhong jiang <zhongjiang@huawei.com> Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Timofey Titovets 提交于
mainline inclusion from mainline-5.0-rc1 commit 0b9df58b category: performance bugzilla: 13231 CVE: NA ------------------------------------------------ Patch series "Currently used jhash are slow enough and replace it allow as to make KSM", v8. Apeed (in kernel): ksm: crc32c hash() 12081 MB/s ksm: xxh64 hash() 8770 MB/s ksm: xxh32 hash() 4529 MB/s ksm: jhash2 hash() 1569 MB/s Sioh Lee's testing (copy from other mail): Test platform: openstack cloud platform (NEWTON version) Experiment node: openstack based cloud compute node (CPU: xeon E5-2620 v3, memory 64gb) VM: (2 VCPU, RAM 4GB, DISK 20GB) * 4 Linux kernel: 4.14 (latest version) KSM setup - sleep_millisecs: 200ms, pages_to_scan: 200 Experiment process: Firstly, we turn off KSM and launch 4 VMs. Then we turn on the KSM and measure the checksum computation time until full_scans become two. The experimental results (the experimental value is the average of the measured values) crc32c_intel: 1084.10ns crc32c (no hardware acceleration): 7012.51ns xxhash32: 2227.75ns xxhash64: 1413.16ns jhash2: 5128.30ns In summary, the result shows that crc32c_intel has advantages over all of the hash function used in the experiment. (decreased by 84.54% compared to crc32c, 78.86% compared to jhash2, 51.33% xxhash32, 23.28% compared to xxhash64) the results are similar to those of Timofey. But, use only xxhash for now, because for using crc32c, cryptoapi must be initialized first - that require some tricky solution to work good in all situations. So: - First patch implement compile time pickup of fastest implementation of xxhash for target platform. - The second patch replaces jhash2 with xxhash This patch (of 2): xxh32() - fast on both 32/64-bit platforms xxh64() - fast only on 64-bit platform Create xxhash() which will pick up the fastest version at compile time. Link: http://lkml.kernel.org/r/20181023182554.23464-2-nefelim4ag@gmail.comSigned-off-by: NTimofey Titovets <nefelim4ag@gmail.com> Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com> Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com> Reviewed-by: NAndrew Morton <akpm@linux-foundation.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: leesioh <solee@os.korea.ac.kr> Signed-off-by: NAndrew Morton <akpm@linux-foundation.org> Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Nzhong jiang <zhongjiang@huawei.com> Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Nathan Chancellor 提交于
mainline inclusion from mainline-5.0 commit 0738c8b5915c category: bugfix bugzilla: 11024 CVE: NA ------------------------------------------------- After commit cc9f8349cb33 ("arm64: crypto: add NEON accelerated XOR implementation"), Clang builds for arm64 started failing with the following error message. arch/arm64/lib/xor-neon.c:58:28: error: incompatible pointer types assigning to 'const unsigned long *' from 'uint64_t *' (aka 'unsigned long long *') [-Werror,-Wincompatible-pointer-types] v3 = veorq_u64(vld1q_u64(dp1 + 6), vld1q_u64(dp2 + 6)); ^~~~~~~~ /usr/lib/llvm-9/lib/clang/9.0.0/include/arm_neon.h:7538:47: note: expanded from macro 'vld1q_u64' __ret = (uint64x2_t) __builtin_neon_vld1q_v(__p0, 51); \ ^~~~ There has been quite a bit of debate and triage that has gone into figuring out what the proper fix is, viewable at the link below, which is still ongoing. Ard suggested disabling this warning with Clang with a pragma so no neon code will have this type of error. While this is not at all an ideal solution, this build error is the only thing preventing KernelCI from having successful arm64 defconfig and allmodconfig builds on linux-next. Getting continuous integration running is more important so new warnings/errors or boot failures can be caught and fixed quickly. Link: https://github.com/ClangBuiltLinux/linux/issues/283Suggested-by: NArd Biesheuvel <ard.biesheuvel@linaro.org> Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: NNathan Chancellor <natechancellor@gmail.com> Signed-off-by: NWill Deacon <will.deacon@arm.com> (cherry picked from commit 0738c8b5915c7eaf1e6007b441008e8f3b460443) Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Jackie Liu 提交于
mainline inclusion from mainline-5.0-rc1 commit: cc9f8349cb33965120a96c12e05d00676162eb7f category: feature feature: NEON accelerated XOR bugzilla: 11024 CVE: NA -------------------------------------------------- This is a NEON acceleration method that can improve performance by approximately 20%. I got the following data from the centos 7.5 on Huawei's HISI1616 chip: [ 93.837726] xor: measuring software checksum speed [ 93.874039] 8regs : 7123.200 MB/sec [ 93.914038] 32regs : 7180.300 MB/sec [ 93.954043] arm64_neon: 9856.000 MB/sec [ 93.954047] xor: using function: arm64_neon (9856.000 MB/sec) I believe this code can bring some optimization for all arm64 platform. thanks for Ard Biesheuvel's suggestions. Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: NWill Deacon <will.deacon@arm.com> Signed-off-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Jackie Liu 提交于
mainline inclusion from mainline-5.0-rc1 commit: 21e28547f613e7ba4f4cb6831a3ead2d723fdf7b category: feature feature: NEON accelerated XOR bugzilla: 11024 CVE: NA -------------------------------------------------- In a way similar to ARM commit 09096f6a ("ARM: 7822/1: add workaround for ambiguous C99 stdint.h types"), this patch redefines the macros that are used in stdint.h so its definitions of uint64_t and int64_t are compatible with those of the kernel. This patch comes from: https://patchwork.kernel.org/patch/3540001/ Wrote by: Ard Biesheuvel <ard.biesheuvel@linaro.org> We mark this file as a private file and don't have to override asm/types.h Signed-off-by: NJackie Liu <liuyun01@kylinos.cn> Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: NWill Deacon <will.deacon@arm.com> Signed-off-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Ard Biesheuvel 提交于
mainline inclusion from mainline-5.0 commit: efdb25efc7645b326cd5eb82be5feeabe167c24e category: perf bugzilla: 20886 CVE: NA lib/crc32test result: [root@localhost build]# rmmod crc32test && insmod lib/crc32test.ko && dmesg | grep cycles [83170.153209] CPU7: use cycles 26243990 [83183.122137] CPU7: use cycles 26151290 [83309.691628] CPU7: use cycles 26122830 [83312.415559] CPU7: use cycles 26232600 [83313.191479] CPU8: use cycles 26082350 rmmod crc32test && insmod lib/crc32test.ko && dmesg | grep cycles [ 1023.539931] CPU25: use cycles 12256730 [ 1024.850360] CPU24: use cycles 12249680 [ 1025.463622] CPU25: use cycles 12253330 [ 1025.862925] CPU25: use cycles 12269720 [ 1026.376038] CPU26: use cycles 12222480 Based on 13702: arm64/lib: improve CRC32 performance for deep pipelines crypto: arm64/crc32 - remove PMULL based CRC32 driver arm64/lib: add accelerated crc32 routines arm64: cpufeature: add feature for CRC32 instructions lib/crc32: make core crc32() routines weak so they can be overridden ---------------------------------------------- Improve the performance of the crc32() asm routines by getting rid of most of the branches and small sized loads on the common path. Instead, use a branchless code path involving overlapping 16 byte loads to process the first (length % 32) bytes, and process the remainder using a loop that processes 32 bytes at a time. Tested using the following test program: #include <stdlib.h> extern void crc32_le(unsigned short, char const*, int); int main(void) { static const char buf[4096]; srand(20181126); for (int i = 0; i < 100 * 1000 * 1000; i++) crc32_le(0, buf, rand() % 1024); return 0; } On Cortex-A53 and Cortex-A57, the performance regresses but only very slightly. On Cortex-A72 however, the performance improves from $ time ./crc32 real 0m10.149s user 0m10.149s sys 0m0.000s to $ time ./crc32 real 0m7.915s user 0m7.915s sys 0m0.000s Cc: Rui Sun <sunrui26@huawei.com> Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: NWill Deacon <will.deacon@arm.com> Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Miguel Ojeda 提交于
mainline inclusion from mainline-5.0 commit ff98e20ef2081b8620dada28fc2d4fb24ca0abf2 category: bugfix bugzilla: 13702 CVE: NA ------------------------------------------------- The upcoming GCC 9 release extends the -Wmissing-attributes warnings (enabled by -Wall) to C and aliases: it warns when particular function attributes are missing in the aliases but not in their target. In particular, it triggers here because crc32_le_base/__crc32c_le_base aren't __pure while their target crc32_le/__crc32c_le are. These aliases are used by architectures as a fallback in accelerated versions of CRC32. See commit 9784d82db3eb ("lib/crc32: make core crc32() routines weak so they can be overridden"). Therefore, being fallbacks, it is likely that even if the aliases were called from C, there wouldn't be any optimizations possible. Currently, the only user is arm64, which calls this from asm. Still, marking the aliases as __pure makes sense and is a good idea for documentation purposes and possible future optimizations, which also silences the warning. Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org> Tested-by: NLaura Abbott <labbott@redhat.com> Signed-off-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com> (cherry picked from commit ff98e20ef2081b8620dada28fc2d4fb24ca0abf2) Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Ard Biesheuvel 提交于
mainline inclusion from mainline-4.20-rc1 commit: 598b7d41e544322c8c4f3737ee8ddf905a44175e category: feature feature: accelerated crc32 routines bugzilla: 13702 CVE: NA -------------------------------------------------- Now that the scalar fallbacks have been moved out of this driver into the core crc32()/crc32c() routines, we are left with a CRC32 crypto API driver for arm64 that is based only on 64x64 polynomial multiplication, which is an optional instruction in the ARMv8 architecture, and is less and less likely to be available on cores that do not also implement the CRC32 instructions, given that those are mandatory in the architecture as of ARMv8.1. Since the scalar instructions do not require the special handling that SIMD instructions do, and since they turn out to be considerably faster on some cores (Cortex-A53) as well, there is really no point in keeping this code around so let's just remove it. Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Ard Biesheuvel 提交于
mainline inclusion from mainline-4.20-rc1 commit: 7481cddf29ede204b475facc40e6f65459939881 category: feature feature: accelerated crc32 routines bugzilla: 13702 CVE: NA -------------------------------------------------- Unlike crc32c(), which is wired up to the crypto API internally so the optimal driver is selected based on the platform's capabilities, crc32_le() is implemented as a library function using a slice-by-8 table based C implementation. Even though few of the call sites may be bottlenecks, calling a time variant implementation with a non-negligible D-cache footprint is a bit of a waste, given that ARMv8.1 and up mandates support for the CRC32 instructions that were optional in ARMv8.0, but are already widely available, even on the Cortex-A53 based Raspberry Pi. So implement routines that use these instructions if available, and fall back to the existing generic routines otherwise. The selection is based on alternatives patching. Note that this unconditionally selects CONFIG_CRC32 as a builtin. Since CRC32 is relied upon by core functionality such as CONFIG_OF_FLATTREE, this just codifies the status quo. Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com> Signed-off-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Ard Biesheuvel 提交于
mainline inclusion from mainline-4.20-rc1 commit: 86d0dd34eafffbc76a81aba6ae2d71927d3835a8 category: feature feature: accelerated crc32 routines bugzilla: 13702 CVE: NA -------------------------------------------------- Add a CRC32 feature bit and wire it up to the CPU id register so we will be able to use alternatives patching for CRC32 operations. Acked-by: NWill Deacon <will.deacon@arm.com> Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com> Conflicts: arch/arm64/include/asm/cpucaps.h arch/arm64/kernel/cpufeature.c Signed-off-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Ard Biesheuvel 提交于
mainline inclusion from mainline-4.20-rc1 commit: 9784d82db3eb3de7851e5a3f4a2481607de2452c category: feature feature: accelerated crc32 routines bugzilla: 13702 CVE: NA -------------------------------------------------- Allow architectures to drop in accelerated CRC32 routines by making the crc32_le/__crc32c_le entry points weak, and exposing non-weak aliases for them that may be used by the accelerated versions as fallbacks in case the instructions they rely upon are not available. Acked-by: NHerbert Xu <herbert@gondor.apana.org.au> Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com> Signed-off-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Xiongfeng Wang 提交于
hulk inclusion category: bugfix bugzilla: 20702 CVE: NA --------------------------- When I inject a PCIE Fatal error into a mellanox netdevice, 'dmesg' shows the device is recovered successfully, but 'lspci' didn't show the device. I checked the configuration space of the slot where the netdevice is inserted and found out the bit 'PCI_BRIDGE_CTL_BUS_RESET' is set. Later, I found out it is because this bit is saved in 'saved_config_space' of 'struct pci_dev' when 'pci_pm_runtime_suspend()' is called. And 'PCI_BRIDGE_CTL_BUS_RESET' is set every time we restore the configuration sapce. This patch avoid saving the bit 'PCI_BRIDGE_CTL_BUS_RESET' when we save the configuration space of a bridge. Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com> Reviewed-by: NHanjun Guo <guohanjun@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Hongbo Yao 提交于
hulk inclusion category: feature bugzilla: 13228 CVE: NA --------------------------- enable CONFIG_KTASK in hulk_defconfig and storage_ci_defconfig Signed-off-by: NHongbo Yao <yaohongbo@huawei.com> Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Hongbo Yao 提交于
hulk inclusion category: feature bugzilla: 13228 CVE: NA --------------------------- This patch fixes some issue in original series. 1) PMD_SIZE chunks have made thread finishing times too spread out in some cases, so KTASK_MEM_CHUNK(128M) seems to be a reasonable compromise 2) If hugepagesz=1G, then pages_per_huge_page = 1G / 4K = 256, use KTASK_MEM_CHUNK will cause the ktask thread to be 1, which will not improve the performance of clear gigantic page. Signed-off-by: NHongbo Yao <yaohongbo@huawei.com> Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com> Tested-by: NHongbo Yao <yaohongbo@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Daniel Jordan 提交于
hulk inclusion category: feature bugzilla: 13228 CVE: NA --------------------------- hugetlbfs_fallocate preallocates huge pages to back a file in a hugetlbfs filesystem. The time to call this function grows linearly with size. ktask performs well with its default thread count of 4; higher thread counts are given for context only. Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory Test: fallocate(1) a file on a hugetlbfs filesystem nthread speedup size (GiB) min time (s) stdev 1 200 127.53 2.19 2 3.09x 200 41.30 2.11 4 5.72x 200 22.29 0.51 8 9.45x 200 13.50 2.58 16 9.74x 200 13.09 1.64 1 400 193.09 2.47 2 2.14x 400 90.31 3.39 4 3.84x 400 50.32 0.44 8 5.11x 400 37.75 1.23 16 6.12x 400 31.54 3.13 The primary bottleneck for better scaling at higher thread counts is hugetlb_fault_mutex_table[hash]. perf showed L1-dcache-loads increase with 8 threads and again sharply with 16 threads, and a CPU counter profile showed that 31% of the L1d misses were on hugetlb_fault_mutex_table[hash] in the 16-thread case. Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: NHongbo Yao <yaohongbo@huawei.com> Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com> Tested-by: NHongbo Yao <yaohongbo@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Daniel Jordan 提交于
hulk inclusion category: feature bugzilla: 13228 CVE: NA --------------------------- Parallelize clear_gigantic_page, which zeroes any page size larger than 8M (e.g. 1G on x86). Performance results (the default number of threads is 4; higher thread counts shown for context only): Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory Test: Clear a range of gigantic pages (triggered via fallocate) nthread speedup size (GiB) min time (s) stdev 1 100 41.13 0.03 2 2.03x 100 20.26 0.14 4 4.28x 100 9.62 0.09 8 8.39x 100 4.90 0.05 16 10.44x 100 3.94 0.03 1 200 89.68 0.35 2 2.21x 200 40.64 0.18 4 4.64x 200 19.33 0.32 8 8.99x 200 9.98 0.04 16 11.27x 200 7.96 0.04 1 400 188.20 1.57 2 2.30x 400 81.84 0.09 4 4.63x 400 40.62 0.26 8 8.92x 400 21.09 0.50 16 11.78x 400 15.97 0.25 1 800 434.91 1.81 2 2.54x 800 170.97 1.46 4 4.98x 800 87.38 1.91 8 10.15x 800 42.86 2.59 16 12.99x 800 33.48 0.83 The speedups are mostly due to the fact that more threads can use more memory bandwidth. The loop we're stressing on the x86 chip in this test is clear_page_erms, which tops out at a bandwidth of 2550 MiB/s with one thread. We get the same bandwidth per thread for 2, 4, or 8 threads, but at 16 threads the per-thread bandwidth drops to 1420 MiB/s. However, the performance also improves over a single thread because of the ktask threads' NUMA awareness (ktask migrates worker threads to the node local to the work being done). This becomes a bigger factor as the amount of pages to zero grows to include memory from multiple nodes, so that speedups increase as the size increases. Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: NHongbo Yao <yaohongbo@huawei.com> Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com> Tested-by: NHongbo Yao <yaohongbo@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Daniel Jordan 提交于
hulk inclusion category: feature bugzilla: 13228 CVE: NA --------------------------- Deferred struct page initialization currently runs one thread per node, but this is a bottleneck during boot on big machines, so use ktask within each pgdatinit thread to parallelize the struct page initialization, allowing the system to take better advantage of its memory bandwidth. Because the system is not fully up yet and most CPUs are idle, use more than the default maximum number of ktask threads. The kernel doesn't know the memory bandwidth of a given system to get the most efficient number of threads, so there's some guesswork involved. In testing, a reasonable value turned out to be about a quarter of the CPUs on the node. __free_pages_core used to increase the zone's managed page count by the number of pages being freed. To accommodate multiple threads, however, account the number of freed pages with an atomic shared across the ktask threads and bump the managed page count with it after ktask is finished. Test: Boot the machine with deferred struct page init three times Machine: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz, 88 CPUs, 503G memory, 2 sockets kernel speedup max time per stdev node (ms) baseline (4.15-rc2) 5860 8.6 ktask 9.56x 613 12.4 Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: NHongbo Yao <yaohongbo@huawei.com> Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com> Tested-by: NHongbo Yao <yaohongbo@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Daniel Jordan 提交于
hulk inclusion category: feature bugzilla: 13228 CVE: NA --------------------------- Changes the type of 'offset' from int to unsigned long in both mem_map_offset and mem_map_next. This facilitates ktask's use of mem_map_next with its unsigned long types to avoid silent truncation when these unsigned longs are passed as ints. It also fixes the preexisting truncation of 'offset' from unsigned long to int by the sole caller of mem_map_offset, follow_hugetlb_page. Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: NHongbo Yao <yaohongbo@huawei.com> Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com> Tested-by: NHongbo Yao <yaohongbo@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Daniel Jordan 提交于
hulk inclusion category: feature bugzilla: 13228 CVE: NA --------------------------- Profiling shows significant time being spent on atomic ops in mmap_sem reader acquisition. mmap_sem is taken and dropped for every single base page during pinning, so this is not surprising. Reduce the number of times mmap_sem is taken by holding for longer, which relieves atomic cacheline bouncing. Results for all VFIO page pinning patches ----------------------------------------- The test measures the time from qemu invocation to the start of guest boot. The guest uses kvm with 320G memory backed with THP. 320G fits in a node on the test machine used here, so there was no thrashing in reclaim because of __GFP_THISNODE in THP allocations[1]. CPU: 2 nodes * 24 cores/node * 2 threads/core = 96 CPUs Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz memory: 754G split evenly between nodes scaling_governor: performance patch 6 patch 8 patch 9 (this one) ----------------------- ----------------- --------------------- thr speedup average sec speedup average sec speedup average sec 1 65.0(± 0.6%) 65.2(± 0.5%) 65.5(± 0.4%) 2 1.5x 42.8(± 5.8%) 1.8x 36.0(± 0.9%) 1.9x 34.4(± 0.3%) 3 1.9x 35.0(±11.3%) 2.5x 26.4(± 4.2%) 2.8x 23.7(± 0.2%) 4 2.3x 28.5(± 1.3%) 3.1x 21.2(± 2.8%) 3.6x 18.3(± 0.3%) 5 2.5x 26.2(± 1.5%) 3.6x 17.9(± 0.9%) 4.3x 15.1(± 0.3%) 6 2.7x 24.5(± 1.8%) 4.0x 16.5(± 3.0%) 5.1x 12.9(± 0.1%) 7 2.8x 23.5(± 4.9%) 4.2x 15.4(± 2.7%) 5.7x 11.5(± 0.6%) 8 2.8x 22.8(± 1.8%) 4.2x 15.5(± 4.7%) 6.4x 10.3(± 0.8%) 12 3.2x 20.2(± 1.4%) 4.4x 14.7(± 2.9%) 8.6x 7.6(± 0.6%) 16 3.3x 20.0(± 0.7%) 4.3x 15.4(± 1.3%) 10.2x 6.4(± 0.6%) At patch 6, lock_stat showed long reader wait time on mmap_sem writers, leading to patch 8. At patch 8, profiling revealed the issue with mmap_sem described above. Across all three patches, performance consistently improves as the thread count increases. The one exception is the antiscaling with nthr=16 in patch 8: those mmap_sem atomics are really bouncing around the machine. The performance with patch 9 looks pretty good overall. I'm working on finding the next bottleneck, and this is where it stopped: When nthr=16, the obvious issue profiling showed was contention on the split PMD page table lock when pages are faulted in during the pinning (>2% of the time). A split PMD lock protects a PUD_SIZE-ed amount of page table mappings (1G on x86), so if threads were operating on smaller chunks and contending in the same PUD_SIZE range, this could be the source of contention. However, when nthr=16, threads operate on 5G chunks (320G / 16 threads / (1<<KTASK_LOAD_BAL_SHIFT)), so this wasn't the cause, and aligning the chunks on PUD_SIZE boundaries didn't help either. The time is short (6.4 seconds), so the next theory was threads finishing at different times, but probes showed the threads all returned within less than a millisecond of each other. Kernel probes turned up a few smaller VFIO page pin calls besides the heavy 320G call. The chunk size given (PMD_SIZE) could affect thread count and chunk size for these, so chunk size was increased from 2M to 1G. This caused the split PMD contention to disappear, but with little change in the runtime. More digging required. [1] lkml.kernel.org/r/20180925120326.24392-1-mhocko@kernel.org Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: NHongbo Yao <yaohongbo@huawei.com> Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com> Tested-by: NHongbo Yao <yaohongbo@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-
由 Daniel Jordan 提交于
hulk inclusion category: feature bugzilla: 13228 CVE: NA --------------------------- Now that mmap_sem is no longer required for modifying locked_vm, remove it in the VFIO code. [XXX Can be sent separately, along with similar conversions in the other places mmap_sem was taken for locked_vm. While at it, could make similar changes to pinned_vm.] Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com> Signed-off-by: NHongbo Yao <yaohongbo@huawei.com> Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com> Tested-by: NHongbo Yao <yaohongbo@huawei.com> Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
-