1. 27 12月, 2019 40 次提交
    • A
      locking/qspinlock: Introduce starvation avoidance into CNA · 47835be6
      Alex Kogan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13227
      CVE: NA
      
      -------------------------------------------------
      
      Choose the next lock holder among spinning threads running on the same
      node with high probability rather than always. With small probability,
      hand the lock to the first thread in the secondary queue or, if that
      queue is empty, to the immediate successor of the current lock holder
      in the main queue.  Thus, assuming no failures while threads hold the
      lock, every thread would be able to acquire the lock after a bounded
      number of lock transitions, with high probability.
      Signed-off-by: NAlex Kogan <alex.kogan@oracle.com>
      Reviewed-by: NSteve Sistare <steven.sistare@oracle.com>
      Signed-off-by: NWei Li <liwei391@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      47835be6
    • A
      locking/qspinlock: Introduce CNA into the slow path of qspinlock · 2636acee
      Alex Kogan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13227
      CVE: NA
      
      -------------------------------------------------
      
      In CNA, spinning threads are organized in two queues, a main queue for
      threads running on the same node as the current lock holder, and a
      secondary queue for threads running on other nodes. At the unlock time,
      the lock holder scans the main queue looking for a thread running on
      the same node. If found (call it thread T), all threads in the main queue
      between the current lock holder and T are moved to the end of the
      secondary queue, and the lock is passed to T. If such T is not found, the
      lock is passed to the first node in the secondary queue. Finally, if the
      secondary queue is empty, the lock is passed to the next thread in the
      main queue. For more details, see https://arxiv.org/abs/1810.05600.
      
      Note that this variant of CNA may introduce starvation by continuously
      passing the lock to threads running on the same node. This issue
      will be addressed later in the series.
      
      Enabling CNA is controlled via a new configuration option
      (NUMA_AWARE_SPINLOCKS). The CNA variant is patched in
      at the boot time only if we run a multi-node machine, and the new
      config is enabled. For the time being, the patching requires
      CONFIG_PARAVIRT_SPINLOCKS to be enabled as well.
      However, this should be resolved once static_call() is available.
      Signed-off-by: NAlex Kogan <alex.kogan@oracle.com>
      Reviewed-by: NSteve Sistare <steven.sistare@oracle.com>
      Signed-off-by: NWei Li <liwei391@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      2636acee
    • A
      locking/qspinlock: Refactor the qspinlock slow path · 6ebce3eb
      Alex Kogan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13227
      CVE: NA
      
      -------------------------------------------------
      
      Move some of the code manipulating the spin lock into separate functions.
      This would allow easier integration of alternative ways to manipulate
      that lock.
      Signed-off-by: NAlex Kogan <alex.kogan@oracle.com>
      Reviewed-by: NSteve Sistare <steven.sistare@oracle.com>
      Signed-off-by: NWei Li <liwei391@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      6ebce3eb
    • A
      locking/qspinlock: Make arch_mcs_spin_unlock_contended more generic · fea85437
      Alex Kogan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13227
      CVE: NA
      
      -------------------------------------------------
      
      The arch_mcs_spin_unlock_contended macro should accept the value to be
      stored into the lock argument as another argument. This allows using the
      same macro in cases where the value to be stored is different from 1.
      Signed-off-by: NAlex Kogan <alex.kogan@oracle.com>
      Reviewed-by: NSteve Sistare <steven.sistare@oracle.com>
      Signed-off-by: NWei Li <liwei391@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      fea85437
    • W
      locking/qspinlock: Remove unnecessary BUG_ON() call · 3b113191
      Waiman Long 提交于
      mainline inclusion
      from mainline-5.1-rc1
      commit 733000c7ffd9
      category: bugfix
      bugzilla: 13227
      CVE: NA
      
      -------------------------------------------------
      
      With the > 4 nesting levels case handled by the commit: commit
      
        d682b596d993 ("locking/qspinlock: Handle > 4 slowpath nesting levels")
      
      the BUG_ON() call in encode_tail() will never actually be triggered.
      
      Remove it.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Link: https://lkml.kernel.org/r/1551057253-3231-1-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NWei Li <liwei391@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      3b113191
    • W
      locking/qspinlock_stat: Track the no MCS node available case · bd3656c2
      Waiman Long 提交于
      mainline inclusion
      from mainline-5.1-rc1
      commit 412f34a82ccf
      category: bugfix
      bugzilla: 13227
      CVE: NA
      
      -------------------------------------------------
      
      Track the number of slowpath locking operations that are being done
      without any MCS node available as well renaming lock_index[123] to make
      them more descriptive.
      
      Using these stat counters is one way to find out if a code path is
      being exercised.
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: SRINIVAS <srinivas.eeda@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Zhenzhong Duan <zhenzhong.duan@oracle.com>
      Link: https://lkml.kernel.org/r/1548798828-16156-3-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NWei Li <liwei391@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      bd3656c2
    • W
      locking/qspinlock: Handle > 4 slowpath nesting levels · 13fd11b5
      Waiman Long 提交于
      mainline inclusion
      from mainline-5.1-rc1
      commit d682b596d993
      category: bugfix
      bugzilla: 13227
      CVE: NA
      
      -------------------------------------------------
      
      Four queue nodes per CPU are allocated to enable up to 4 nesting levels
      using the per-CPU nodes. Nested NMIs are possible in some architectures.
      Still it is very unlikely that we will ever hit more than 4 nested
      levels with contention in the slowpath.
      
      When that rare condition happens, however, it is likely that the system
      will hang or crash shortly after that. It is not good and we need to
      handle this exception case.
      
      This is done by spinning directly on the lock using repeated trylock.
      This alternative code path should only be used when there is nested
      NMIs. Assuming that the locks used by those NMI handlers will not be
      heavily contended, a simple TAS locking should work out.
      Suggested-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: James Morse <james.morse@arm.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: SRINIVAS <srinivas.eeda@oracle.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Zhenzhong Duan <zhenzhong.duan@oracle.com>
      Link: https://lkml.kernel.org/r/1548798828-16156-2-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NWei Li <liwei391@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      13fd11b5
    • W
      locking/pvqspinlock: Extend node size when pvqspinlock is configured · c579875d
      Waiman Long 提交于
      mainline inclusion
      from mainline-4.20-rc1
      commit 0fa809ca7f81
      category: bugfix
      bugzilla: 13227
      CVE: NA
      
      -------------------------------------------------
      
      The qspinlock code supports up to 4 levels of slowpath nesting using
      four per-CPU mcs_spinlock structures. For 64-bit architectures, they
      fit nicely in one 64-byte cacheline.
      
      For para-virtualized (PV) qspinlocks it needs to store more information
      in the per-CPU node structure than there is space for. It uses a trick
      to use a second cacheline to hold the extra information that it needs.
      So PV qspinlock needs to access two extra cachelines for its information
      whereas the native qspinlock code only needs one extra cacheline.
      
      Freshly added counter profiling of the qspinlock code, however, revealed
      that it was very rare to use more than two levels of slowpath nesting.
      So it doesn't make sense to penalize PV qspinlock code in order to have
      four mcs_spinlock structures in the same cacheline to optimize for a case
      in the native qspinlock code that rarely happens.
      
      Extend the per-CPU node structure to have two more long words when PV
      qspinlock locks are configured to hold the extra data that it needs.
      
      As a result, the PV qspinlock code will enjoy the same benefit of using
      just one extra cacheline like the native counterpart, for most cases.
      
      [ mingo: Minor changelog edits. ]
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Link: http://lkml.kernel.org/r/1539697507-28084-2-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NWei Li <liwei391@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      c579875d
    • W
      locking/qspinlock_stat: Count instances of nested lock slowpaths · c9bd1fb6
      Waiman Long 提交于
      mainline inclusion
      from mainline-4.20-rc1
      commit 1222109a5363
      category: bugfix
      bugzilla: 13227
      CVE: NA
      
      -------------------------------------------------
      
      Queued spinlock supports up to 4 levels of lock slowpath nesting -
      user context, soft IRQ, hard IRQ and NMI. However, we are not sure how
      often the nesting happens.
      
      So add 3 more per-CPU stat counters to track the number of instances where
      nesting index goes to 1, 2 and 3 respectively.
      
      On a dual-socket 64-core 128-thread Zen server, the following were the
      new stat counter values under different circumstances:
      
               State                         slowpath   index1   index2   index3
               -----                         --------   ------   ------   -------
        After bootup                         1,012,150    82       0        0
        After parallel build + perf-top    125,195,009    82       0        0
      
      So the chance of having more than 2 levels of nesting is extremely low.
      
      [ mingo: Minor changelog edits. ]
      Signed-off-by: NWaiman Long <longman@redhat.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Will Deacon <will.deacon@arm.com>
      Link: http://lkml.kernel.org/r/1539697507-28084-1-git-send-email-longman@redhat.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NWei Li <liwei391@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      c9bd1fb6
    • M
      arm64: Force SSBS on context switch · c2f1181d
      Marc Zyngier 提交于
      mainline inclusion
      from mainline-v5.3-rc2
      commit cbdf8a189a66001c36007bf0f5c975d0376c5c3a
      category: feature
      bugzilla: 20806
      CVE: NA
      
      -------------------------------------------------
      
      On a CPU that doesn't support SSBS, PSTATE[12] is RES0.  In a system
      where only some of the CPUs implement SSBS, we end-up losing track of
      the SSBS bit across task migration.
      
      To address this issue, let's force the SSBS bit on context switch.
      
      Fixes: 8f04e8e6e29c ("arm64: ssbd: Add support for PSTATE.SSBS rather than trapping to EL3")
      Signed-off-by: NMarc Zyngier <marc.zyngier@arm.com>
      [will: inverted logic and added comments]
      Signed-off-by: NWill Deacon <will@kernel.org>
      Conflicts:
        arch/arm64/kernel/process.c
      [yyl: adjust context]
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      c2f1181d
    • M
      arm64: fix SSBS sanitization · 0c708103
      Mark Rutland 提交于
      mainline inclusion
      from mainline-v5.0-rc8
      commit f54dada8274643e3ff4436df0ea124aeedc43cae
      category: feature
      bugzilla: 20806
      CVE: NA
      
      -------------------------------------------------
      
      In valid_user_regs() we treat SSBS as a RES0 bit, and consequently it is
      unexpectedly cleared when we restore a sigframe or fiddle with GPRs via
      ptrace.
      
      This patch fixes valid_user_regs() to account for this, updating the
      function to refer to the latest ARM ARM (ARM DDI 0487D.a). For AArch32
      tasks, SSBS appears in bit 23 of SPSR_EL1, matching its position in the
      AArch32-native PSR format, and we don't need to translate it as we have
      to for DIT.
      
      There are no other bit assignments that we need to account for today.
      As the recent documentation describes the DIT bit, we can drop our
      comment regarding DIT.
      
      While removing SSBS from the RES0 masks, existing inconsistent
      whitespace is corrected.
      
      Fixes: d71be2b6c0e19180 ("arm64: cpufeature: Detect SSBS and advertise to userspace")
      Signed-off-by: NMark Rutland <mark.rutland@arm.com>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      0c708103
    • W
      arm64: cpu: Move errata and feature enable callbacks closer to callers · a37ad0cd
      Will Deacon 提交于
      mainline inclusion
      from mainline-v4.20-rc1
      commit b8925ee2e12d1cb9a11d6f28b5814f2bfa59dce1
      category: feature
      bugzilla: 20806
      CVE: NA
      
      -------------------------------------------------
      
      The cpu errata and feature enable callbacks are only called via their
      respective arm64_cpu_capabilities structure and therefore shouldn't
      exist in the global namespace.
      
      Move the PAN, RAS and cache maintenance emulation enable callbacks into
      the same files as their corresponding arm64_cpu_capabilities structures,
      making them static in the process.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Conflicts:
        arch/arm64/kernel/cpu_errata.c
        arch/arm64/kernel/cpufeature.c
      [yyl: adjust context]
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NXuefeng Wang <wxf.wang@hisilicon.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      a37ad0cd
    • W
      KVM: arm64: Set SCTLR_EL2.DSSBS if SSBD is forcefully disabled and !vhe · 1eea4492
      Will Deacon 提交于
      mainline inclusion
      from mainline-v4.20-rc1
      commit 7c36447ae5a090729e7b129f24705bb231a07e0b
      category: feature
      bugzilla: 20806
      CVE: NA
      
      -------------------------------------------------
      
      When running without VHE, it is necessary to set SCTLR_EL2.DSSBS if SSBD
      has been forcefully disabled on the kernel command-line.
      Acked-by: NChristoffer Dall <christoffer.dall@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NXuefeng Wang <wxf.wang@hisilicon.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1eea4492
    • W
      arm64: ssbd: Add support for PSTATE.SSBS rather than trapping to EL3 · 09bb7481
      Will Deacon 提交于
      mainline inclusion
      from mainline-v4.20-rc1
      commit 8f04e8e6e29c93421a95b61cad62e3918425eac7
      category: feature
      bugzilla: 20806
      CVE: NA
      
      -------------------------------------------------
      
      On CPUs with support for PSTATE.SSBS, the kernel can toggle the SSBD
      state without needing to call into firmware.
      
      This patch hooks into the existing SSBD infrastructure so that SSBS is
      used on CPUs that support it, but it's all made horribly complicated by
      the very real possibility of big/little systems that don't uniformly
      provide the new capability.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      
      Conflicts:
        arch/arm64/kernel/process.c
        arch/arm64/kernel/ssbd.c
        arch/arm64/kernel/cpufeature.c
      [yyl: adjust context]
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NXuefeng Wang <wxf.wang@hisilicon.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      09bb7481
    • W
      arm64: ssbd: Drop #ifdefs for PR_SPEC_STORE_BYPASS · 1438602c
      Will Deacon 提交于
      mainline inclusion
      from mainline-v4.20-rc1
      commit 2d1b2a91d56b19636b740ea70c8399d1df249f20
      category: feature
      bugzilla: 20806
      CVE: NA
      
      -------------------------------------------------
      
      Now that we're all merged nicely into mainline, there's no need to check
      to see if PR_SPEC_STORE_BYPASS is defined.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NXuefeng Wang <wxf.wang@hisilicon.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1438602c
    • W
      arm64: cpufeature: Detect SSBS and advertise to userspace · be185032
      Will Deacon 提交于
      mainline inclusion
      from mainline-v4.20-rc1
      commit d71be2b6c0e19180b5f80a6d42039cc074a693a2
      category: feature
      bugzilla: 20806
      CVE: NA
      
      -------------------------------------------------
      
      Armv8.5 introduces a new PSTATE bit known as Speculative Store Bypass
      Safe (SSBS) which can be used as a mitigation against Spectre variant 4.
      
      Additionally, a CPU may provide instructions to manipulate PSTATE.SSBS
      directly, so that userspace can toggle the SSBS control without trapping
      to the kernel.
      
      This patch probes for the existence of SSBS and advertise the new instructions
      to userspace if they exist.
      Reviewed-by: NSuzuki K Poulose <suzuki.poulose@arm.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Conflicts:
        arch/arm64/kernel/cpufeature.c
        arch/arm64/include/asm/cpucaps.h
      [yyl: adjust context]
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NXuefeng Wang <wxf.wang@hisilicon.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      be185032
    • W
      arm64: Fix silly typo in comment · 9f3c5929
      Will Deacon 提交于
      mainline inclusion
      from mainline-v4.20-rc1
      commit ca7f686ac9fe87a9175696a8744e095ab9749c49
      category: feature
      bugzilla: 20806
      CVE: NA
      
      -------------------------------------------------
      
      I was passing through and figuered I'd fix this up:
      
      	featuer -> feature
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NXuefeng Wang <wxf.wang@hisilicon.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      9f3c5929
    • T
      ksm: replace jhash2 with xxhash · ac7353e5
      Timofey Titovets 提交于
      mainline inclusion
      from mainline-5.0-rc1
      commit 59e1a2f4bf83744e748636415fde7d1e9f557e05
      category: performance
      bugzilla: 13231
      CVE: NA
      
      ------------------------------------------------
      
      Replace jhash2 with xxhash.
      
      Perf numbers:
      Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz
      ksm: crc32c   hash() 12081 MB/s
      ksm: xxh64    hash()  8770 MB/s
      ksm: xxh32    hash()  4529 MB/s
      ksm: jhash2   hash()  1569 MB/s
      
      Sioh Lee did some testing:
      
      crc32c_intel: 1084.10ns
      crc32c (no hardware acceleration): 7012.51ns
      xxhash32: 2227.75ns
      xxhash64: 1413.16ns
      jhash2: 5128.30ns
      
      As jhash2 always will be slower (for data size like PAGE_SIZE).  Don't use
      it in ksm at all.
      
      Use only xxhash for now, because for using crc32c, cryptoapi must be
      initialized first - that requires some tricky solution to work well in all
      situations.
      
      Link: http://lkml.kernel.org/r/20181023182554.23464-3-nefelim4ag@gmail.comSigned-off-by: NTimofey Titovets <nefelim4ag@gmail.com>
      Signed-off-by: Nleesioh <solee@os.korea.ac.kr>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      ac7353e5
    • T
      xxHash: create arch dependent 32/64-bit xxhash() · b41feb0c
      Timofey Titovets 提交于
      mainline inclusion
      from mainline-5.0-rc1
      commit 0b9df58b79fa283fbedc0fb6a8e248599444bacc
      category: performance
      bugzilla: 13231
      CVE: NA
      
      ------------------------------------------------
      
      Patch series "Currently used jhash are slow enough and replace it allow as
      to make KSM", v8.
      
      Apeed (in kernel):
              ksm: crc32c   hash() 12081 MB/s
              ksm: xxh64    hash()  8770 MB/s
              ksm: xxh32    hash()  4529 MB/s
              ksm: jhash2   hash()  1569 MB/s
      
      Sioh Lee's testing (copy from other mail):
      
      Test platform: openstack cloud platform (NEWTON version)
      Experiment node: openstack based cloud compute node (CPU: xeon E5-2620 v3, memory 64gb)
      VM: (2 VCPU, RAM 4GB, DISK 20GB) * 4
      Linux kernel: 4.14 (latest version)
      KSM setup - sleep_millisecs: 200ms, pages_to_scan: 200
      
      Experiment process:
      Firstly, we turn off KSM and launch 4 VMs.  Then we turn on the KSM and
      measure the checksum computation time until full_scans become two.
      
      The experimental results (the experimental value is the average of the measured values)
      crc32c_intel: 1084.10ns
      crc32c (no hardware acceleration): 7012.51ns
      xxhash32: 2227.75ns
      xxhash64: 1413.16ns
      jhash2: 5128.30ns
      
      In summary, the result shows that crc32c_intel has advantages over all of
      the hash function used in the experiment.  (decreased by 84.54% compared
      to crc32c, 78.86% compared to jhash2, 51.33% xxhash32, 23.28% compared to
      xxhash64) the results are similar to those of Timofey.
      
      But, use only xxhash for now, because for using crc32c, cryptoapi must be
      initialized first - that require some tricky solution to work good in all
      situations.
      
      So:
      
      - First patch implement compile time pickup of fastest implementation of
        xxhash for target platform.
      
      - The second patch replaces jhash2 with xxhash
      
      This patch (of 2):
      
      xxh32() - fast on both 32/64-bit platforms
      xxh64() - fast only on 64-bit platform
      
      Create xxhash() which will pick up the fastest version at compile time.
      
      Link: http://lkml.kernel.org/r/20181023182554.23464-2-nefelim4ag@gmail.comSigned-off-by: NTimofey Titovets <nefelim4ag@gmail.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: leesioh <solee@os.korea.ac.kr>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b41feb0c
    • N
      arm64/neon: Disable -Wincompatible-pointer-types when building with Clang · e9fabe0c
      Nathan Chancellor 提交于
      mainline inclusion
      from mainline-5.0
      commit 0738c8b5915c
      category: bugfix
      bugzilla: 11024
      CVE: NA
      
      -------------------------------------------------
      
      After commit cc9f8349cb33 ("arm64: crypto: add NEON accelerated XOR
      implementation"), Clang builds for arm64 started failing with the
      following error message.
      
      arch/arm64/lib/xor-neon.c:58:28: error: incompatible pointer types
      assigning to 'const unsigned long *' from 'uint64_t *' (aka 'unsigned
      long long *') [-Werror,-Wincompatible-pointer-types]
                      v3 = veorq_u64(vld1q_u64(dp1 +  6), vld1q_u64(dp2 + 6));
                                               ^~~~~~~~
      /usr/lib/llvm-9/lib/clang/9.0.0/include/arm_neon.h:7538:47: note:
      expanded from macro 'vld1q_u64'
        __ret = (uint64x2_t) __builtin_neon_vld1q_v(__p0, 51); \
                                                    ^~~~
      
      There has been quite a bit of debate and triage that has gone into
      figuring out what the proper fix is, viewable at the link below, which
      is still ongoing. Ard suggested disabling this warning with Clang with a
      pragma so no neon code will have this type of error. While this is not
      at all an ideal solution, this build error is the only thing preventing
      KernelCI from having successful arm64 defconfig and allmodconfig builds
      on linux-next. Getting continuous integration running is more important
      so new warnings/errors or boot failures can be caught and fixed quickly.
      
      Link: https://github.com/ClangBuiltLinux/linux/issues/283Suggested-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      (cherry picked from commit 0738c8b5915c7eaf1e6007b441008e8f3b460443)
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      e9fabe0c
    • J
      arm64: crypto: add NEON accelerated XOR implementation · 2b99d865
      Jackie Liu 提交于
      mainline inclusion
      from mainline-5.0-rc1
      commit: cc9f8349cb33965120a96c12e05d00676162eb7f
      category: feature
      feature: NEON accelerated XOR
      bugzilla: 11024
      CVE: NA
      
      --------------------------------------------------
      
      This is a NEON acceleration method that can improve
      performance by approximately 20%. I got the following
      data from the centos 7.5 on Huawei's HISI1616 chip:
      
      [ 93.837726] xor: measuring software checksum speed
      [ 93.874039]   8regs  : 7123.200 MB/sec
      [ 93.914038]   32regs : 7180.300 MB/sec
      [ 93.954043]   arm64_neon: 9856.000 MB/sec
      [ 93.954047] xor: using function: arm64_neon (9856.000 MB/sec)
      
      I believe this code can bring some optimization for
      all arm64 platform. thanks for Ard Biesheuvel's suggestions.
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      2b99d865
    • J
      arm64/neon: add workaround for ambiguous C99 stdint.h types · 28777f93
      Jackie Liu 提交于
      mainline inclusion
      from mainline-5.0-rc1
      commit: 21e28547f613e7ba4f4cb6831a3ead2d723fdf7b
      category: feature
      feature: NEON accelerated XOR
      bugzilla: 11024
      CVE: NA
      
      --------------------------------------------------
      
      In a way similar to ARM commit 09096f6a ("ARM: 7822/1: add workaround
      for ambiguous C99 stdint.h types"), this patch redefines the macros that
      are used in stdint.h so its definitions of uint64_t and int64_t are
      compatible with those of the kernel.
      
      This patch comes from: https://patchwork.kernel.org/patch/3540001/
      Wrote by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      
      We mark this file as a private file and don't have to override asm/types.h
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      28777f93
    • A
      arm64/lib: improve CRC32 performance for deep pipelines · fce68364
      Ard Biesheuvel 提交于
      mainline inclusion
      from mainline-5.0
      commit: efdb25efc7645b326cd5eb82be5feeabe167c24e
      category: perf
      bugzilla: 20886
      CVE: NA
      
      lib/crc32test result:
      
      [root@localhost build]# rmmod crc32test && insmod lib/crc32test.ko &&
      dmesg | grep cycles
      [83170.153209] CPU7: use cycles 26243990
      [83183.122137] CPU7: use cycles 26151290
      [83309.691628] CPU7: use cycles 26122830
      [83312.415559] CPU7: use cycles 26232600
      [83313.191479] CPU8: use cycles 26082350
      
      rmmod crc32test && insmod lib/crc32test.ko && dmesg | grep cycles
      [ 1023.539931] CPU25: use cycles 12256730
      [ 1024.850360] CPU24: use cycles 12249680
      [ 1025.463622] CPU25: use cycles 12253330
      [ 1025.862925] CPU25: use cycles 12269720
      [ 1026.376038] CPU26: use cycles 12222480
      
      Based on 13702:
      arm64/lib: improve CRC32 performance for deep pipelines
      crypto: arm64/crc32 - remove PMULL based CRC32 driver
      arm64/lib: add accelerated crc32 routines
      arm64: cpufeature: add feature for CRC32 instructions
      lib/crc32: make core crc32() routines weak so they can be overridden
      
      ----------------------------------------------
      
      Improve the performance of the crc32() asm routines by getting rid of
      most of the branches and small sized loads on the common path.
      
      Instead, use a branchless code path involving overlapping 16 byte
      loads to process the first (length % 32) bytes, and process the
      remainder using a loop that processes 32 bytes at a time.
      
      Tested using the following test program:
      
        #include <stdlib.h>
      
        extern void crc32_le(unsigned short, char const*, int);
      
        int main(void)
        {
          static const char buf[4096];
      
          srand(20181126);
      
          for (int i = 0; i < 100 * 1000 * 1000; i++)
            crc32_le(0, buf, rand() % 1024);
      
          return 0;
        }
      
      On Cortex-A53 and Cortex-A57, the performance regresses but only very
      slightly. On Cortex-A72 however, the performance improves from
      
        $ time ./crc32
      
        real  0m10.149s
        user  0m10.149s
        sys   0m0.000s
      
      to
      
        $ time ./crc32
      
        real  0m7.915s
        user  0m7.915s
        sys   0m0.000s
      
      Cc: Rui Sun <sunrui26@huawei.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      fce68364
    • M
      lib/crc32.c: mark crc32_le_base/__crc32c_le_base aliases as __pure · 0066ba62
      Miguel Ojeda 提交于
      mainline inclusion
      from mainline-5.0
      commit ff98e20ef2081b8620dada28fc2d4fb24ca0abf2
      category: bugfix
      bugzilla: 13702
      CVE: NA
      
      -------------------------------------------------
      
      The upcoming GCC 9 release extends the -Wmissing-attributes warnings
      (enabled by -Wall) to C and aliases: it warns when particular function
      attributes are missing in the aliases but not in their target.
      
      In particular, it triggers here because crc32_le_base/__crc32c_le_base
      aren't __pure while their target crc32_le/__crc32c_le are.
      
      These aliases are used by architectures as a fallback in accelerated
      versions of CRC32. See commit 9784d82db3eb ("lib/crc32: make core crc32()
      routines weak so they can be overridden").
      
      Therefore, being fallbacks, it is likely that even if the aliases
      were called from C, there wouldn't be any optimizations possible.
      Currently, the only user is arm64, which calls this from asm.
      
      Still, marking the aliases as __pure makes sense and is a good idea
      for documentation purposes and possible future optimizations,
      which also silences the warning.
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Tested-by: NLaura Abbott <labbott@redhat.com>
      Signed-off-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      (cherry picked from commit ff98e20ef2081b8620dada28fc2d4fb24ca0abf2)
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      0066ba62
    • A
      crypto: arm64/crc32 - remove PMULL based CRC32 driver · 0e584ca5
      Ard Biesheuvel 提交于
      mainline inclusion
      from mainline-4.20-rc1
      commit: 598b7d41e544322c8c4f3737ee8ddf905a44175e
      category: feature
      feature: accelerated crc32 routines
      bugzilla: 13702
      CVE: NA
      
      --------------------------------------------------
      
      Now that the scalar fallbacks have been moved out of this driver into
      the core crc32()/crc32c() routines, we are left with a CRC32 crypto API
      driver for arm64 that is based only on 64x64 polynomial multiplication,
      which is an optional instruction in the ARMv8 architecture, and is less
      and less likely to be available on cores that do not also implement the
      CRC32 instructions, given that those are mandatory in the architecture
      as of ARMv8.1.
      
      Since the scalar instructions do not require the special handling that
      SIMD instructions do, and since they turn out to be considerably faster
      on some cores (Cortex-A53) as well, there is really no point in keeping
      this code around so let's just remove it.
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      0e584ca5
    • A
      arm64/lib: add accelerated crc32 routines · 11fa09f2
      Ard Biesheuvel 提交于
      mainline inclusion
      from mainline-4.20-rc1
      commit: 7481cddf29ede204b475facc40e6f65459939881
      category: feature
      feature: accelerated crc32 routines
      bugzilla: 13702
      CVE: NA
      
      --------------------------------------------------
      
      Unlike crc32c(), which is wired up to the crypto API internally so the
      optimal driver is selected based on the platform's capabilities,
      crc32_le() is implemented as a library function using a slice-by-8 table
      based C implementation. Even though few of the call sites may be
      bottlenecks, calling a time variant implementation with a non-negligible
      D-cache footprint is a bit of a waste, given that ARMv8.1 and up mandates
      support for the CRC32 instructions that were optional in ARMv8.0, but are
      already widely available, even on the Cortex-A53 based Raspberry Pi.
      
      So implement routines that use these instructions if available, and fall
      back to the existing generic routines otherwise. The selection is based
      on alternatives patching.
      
      Note that this unconditionally selects CONFIG_CRC32 as a builtin. Since
      CRC32 is relied upon by core functionality such as CONFIG_OF_FLATTREE,
      this just codifies the status quo.
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      11fa09f2
    • A
      arm64: cpufeature: add feature for CRC32 instructions · 3a8984a9
      Ard Biesheuvel 提交于
      mainline inclusion
      from mainline-4.20-rc1
      commit: 86d0dd34eafffbc76a81aba6ae2d71927d3835a8
      category: feature
      feature: accelerated crc32 routines
      bugzilla: 13702
      CVE: NA
      
      --------------------------------------------------
      
      Add a CRC32 feature bit and wire it up to the CPU id register so we
      will be able to use alternatives patching for CRC32 operations.
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      
      Conflicts:
      	arch/arm64/include/asm/cpucaps.h
      	arch/arm64/kernel/cpufeature.c
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      3a8984a9
    • A
      lib/crc32: make core crc32() routines weak so they can be overridden · a216345b
      Ard Biesheuvel 提交于
      mainline inclusion
      from mainline-4.20-rc1
      commit: 9784d82db3eb3de7851e5a3f4a2481607de2452c
      category: feature
      feature: accelerated crc32 routines
      bugzilla: 13702
      CVE: NA
      
      --------------------------------------------------
      
      Allow architectures to drop in accelerated CRC32 routines by making
      the crc32_le/__crc32c_le entry points weak, and exposing non-weak
      aliases for them that may be used by the accelerated versions as
      fallbacks in case the instructions they rely upon are not available.
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      a216345b
    • X
      pci: do not save 'PCI_BRIDGE_CTL_BUS_RESET' · 36b45f28
      Xiongfeng Wang 提交于
      hulk inclusion
      category: bugfix
      bugzilla: 20702
      CVE: NA
      ---------------------------
      
      When I inject a PCIE Fatal error into a mellanox netdevice, 'dmesg'
      shows the device is recovered successfully, but 'lspci' didn't show the
      device. I checked the configuration space of the slot where the
      netdevice is inserted and found out the bit 'PCI_BRIDGE_CTL_BUS_RESET'
      is set. Later, I found out it is because this bit is saved in
      'saved_config_space' of 'struct pci_dev' when 'pci_pm_runtime_suspend()'
      is called. And 'PCI_BRIDGE_CTL_BUS_RESET' is set every time we restore
      the configuration sapce.
      
      This patch avoid saving the bit 'PCI_BRIDGE_CTL_BUS_RESET' when we save
      the configuration space of a bridge.
      Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      36b45f28
    • H
      config: enable CONFIG_KTASK in hulk_defconfig and storage_ci_defconfig · 58622cdd
      Hongbo Yao 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      ---------------------------
      
      enable CONFIG_KTASK in hulk_defconfig and storage_ci_defconfig
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      58622cdd
    • H
      ktask: change the chunk size for some ktask thread functions · 135e9232
      Hongbo Yao 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      
      ---------------------------
      
      This patch fixes some issue in original series.
      1) PMD_SIZE chunks have made thread finishing times too spread out
      in some cases, so KTASK_MEM_CHUNK(128M) seems to be a reasonable compromise
      2) If hugepagesz=1G, then pages_per_huge_page = 1G / 4K = 256,
      use KTASK_MEM_CHUNK will cause the ktask thread to be 1, which will not
      improve the performance of clear gigantic page.
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      135e9232
    • D
      hugetlbfs: parallelize hugetlbfs_fallocate with ktask · 4733c59f
      Daniel Jordan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      ---------------------------
      
      hugetlbfs_fallocate preallocates huge pages to back a file in a
      hugetlbfs filesystem.  The time to call this function grows linearly
      with size.
      
      ktask performs well with its default thread count of 4; higher thread
      counts are given for context only.
      
      Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory
      Test:    fallocate(1) a file on a hugetlbfs filesystem
      
      nthread   speedup   size (GiB)   min time (s)   stdev
            1                    200         127.53    2.19
            2     3.09x          200          41.30    2.11
            4     5.72x          200          22.29    0.51
            8     9.45x          200          13.50    2.58
           16     9.74x          200          13.09    1.64
      
            1                    400         193.09    2.47
            2     2.14x          400          90.31    3.39
            4     3.84x          400          50.32    0.44
            8     5.11x          400          37.75    1.23
           16     6.12x          400          31.54    3.13
      
      The primary bottleneck for better scaling at higher thread counts is
      hugetlb_fault_mutex_table[hash].  perf showed L1-dcache-loads increase
      with 8 threads and again sharply with 16 threads, and a CPU counter
      profile showed that 31% of the L1d misses were on
      hugetlb_fault_mutex_table[hash] in the 16-thread case.
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4733c59f
    • D
      mm: parallelize clear_gigantic_page · ae0cd4d4
      Daniel Jordan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      ---------------------------
      
      Parallelize clear_gigantic_page, which zeroes any page size larger than
      8M (e.g. 1G on x86).
      
      Performance results (the default number of threads is 4; higher thread
      counts shown for context only):
      
      Machine:  Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory
      Test:     Clear a range of gigantic pages (triggered via fallocate)
      
      nthread   speedup   size (GiB)   min time (s)   stdev
            1                    100          41.13    0.03
            2     2.03x          100          20.26    0.14
            4     4.28x          100           9.62    0.09
            8     8.39x          100           4.90    0.05
           16    10.44x          100           3.94    0.03
      
            1                    200          89.68    0.35
            2     2.21x          200          40.64    0.18
            4     4.64x          200          19.33    0.32
            8     8.99x          200           9.98    0.04
           16    11.27x          200           7.96    0.04
      
            1                    400         188.20    1.57
            2     2.30x          400          81.84    0.09
            4     4.63x          400          40.62    0.26
            8     8.92x          400          21.09    0.50
           16    11.78x          400          15.97    0.25
      
            1                    800         434.91    1.81
            2     2.54x          800         170.97    1.46
            4     4.98x          800          87.38    1.91
            8    10.15x          800          42.86    2.59
           16    12.99x          800          33.48    0.83
      
      The speedups are mostly due to the fact that more threads can use more
      memory bandwidth.  The loop we're stressing on the x86 chip in this test
      is clear_page_erms, which tops out at a bandwidth of 2550 MiB/s with one
      thread.  We get the same bandwidth per thread for 2, 4, or 8 threads,
      but at 16 threads the per-thread bandwidth drops to 1420 MiB/s.
      
      However, the performance also improves over a single thread because of
      the ktask threads' NUMA awareness (ktask migrates worker threads to the
      node local to the work being done).  This becomes a bigger factor as the
      amount of pages to zero grows to include memory from multiple nodes, so
      that speedups increase as the size increases.
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      ae0cd4d4
    • D
      mm: parallelize deferred struct page initialization within each node · eb761d65
      Daniel Jordan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      ---------------------------
      
      Deferred struct page initialization currently runs one thread per node,
      but this is a bottleneck during boot on big machines, so use ktask
      within each pgdatinit thread to parallelize the struct page
      initialization, allowing the system to take better advantage of its
      memory bandwidth.
      
      Because the system is not fully up yet and most CPUs are idle, use more
      than the default maximum number of ktask threads.  The kernel doesn't
      know the memory bandwidth of a given system to get the most efficient
      number of threads, so there's some guesswork involved.  In testing, a
      reasonable value turned out to be about a quarter of the CPUs on the
      node.
      
      __free_pages_core used to increase the zone's managed page count by the
      number of pages being freed.  To accommodate multiple threads, however,
      account the number of freed pages with an atomic shared across the ktask
      threads and bump the managed page count with it after ktask is finished.
      
      Test:    Boot the machine with deferred struct page init three times
      
      Machine: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz, 88 CPUs, 503G memory,
               2 sockets
      
      kernel                   speedup   max time per   stdev
                                         node (ms)
      
      baseline (4.15-rc2)                        5860     8.6
      ktask                      9.56x            613    12.4
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      eb761d65
    • D
      mm: enlarge type of offset argument in mem_map_offset and mem_map_next · 228e4183
      Daniel Jordan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      ---------------------------
      
      Changes the type of 'offset' from int to unsigned long in both
      mem_map_offset and mem_map_next.
      
      This facilitates ktask's use of mem_map_next with its unsigned long
      types to avoid silent truncation when these unsigned longs are passed as
      ints.
      
      It also fixes the preexisting truncation of 'offset' from unsigned long
      to int by the sole caller of mem_map_offset, follow_hugetlb_page.
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      228e4183
    • D
      vfio: relieve mmap_sem reader cacheline bouncing by holding it longer · 22705b26
      Daniel Jordan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      ---------------------------
      
      Profiling shows significant time being spent on atomic ops in mmap_sem
      reader acquisition.  mmap_sem is taken and dropped for every single base
      page during pinning, so this is not surprising.
      
      Reduce the number of times mmap_sem is taken by holding for longer,
      which relieves atomic cacheline bouncing.
      
      Results for all VFIO page pinning patches
      -----------------------------------------
      
      The test measures the time from qemu invocation to the start of guest
      boot.  The guest uses kvm with 320G memory backed with THP.  320G fits
      in a node on the test machine used here, so there was no thrashing in
      reclaim because of __GFP_THISNODE in THP allocations[1].
      
      CPU:              2 nodes * 24 cores/node * 2 threads/core = 96 CPUs
                        Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
      memory:           754G split evenly between nodes
      scaling_governor: performance
      
           patch 6                  patch 8              patch 9 (this one)
           -----------------------  ----------------- ---------------------
      thr  speedup average sec  speedup   average sec  speedup   average sec
        1          65.0(± 0.6%)           65.2(± 0.5%)           65.5(± 0.4%)
        2   1.5x   42.8(± 5.8%)   1.8x    36.0(± 0.9%)    1.9x   34.4(± 0.3%)
        3   1.9x   35.0(±11.3%)   2.5x    26.4(± 4.2%)    2.8x   23.7(± 0.2%)
        4   2.3x   28.5(± 1.3%)   3.1x    21.2(± 2.8%)    3.6x   18.3(± 0.3%)
        5   2.5x   26.2(± 1.5%)   3.6x    17.9(± 0.9%)    4.3x   15.1(± 0.3%)
        6   2.7x   24.5(± 1.8%)   4.0x    16.5(± 3.0%)    5.1x   12.9(± 0.1%)
        7   2.8x   23.5(± 4.9%)   4.2x    15.4(± 2.7%)    5.7x   11.5(± 0.6%)
        8   2.8x   22.8(± 1.8%)   4.2x    15.5(± 4.7%)    6.4x   10.3(± 0.8%)
       12   3.2x   20.2(± 1.4%)   4.4x    14.7(± 2.9%)    8.6x    7.6(± 0.6%)
       16   3.3x   20.0(± 0.7%)   4.3x    15.4(± 1.3%)   10.2x    6.4(± 0.6%)
      
      At patch 6, lock_stat showed long reader wait time on mmap_sem writers,
      leading to patch 8.
      
      At patch 8, profiling revealed the issue with mmap_sem described above.
      
      Across all three patches, performance consistently improves as the
      thread count increases.  The one exception is the antiscaling with
      nthr=16 in patch 8: those mmap_sem atomics are really bouncing around
      the machine.
      
      The performance with patch 9 looks pretty good overall.  I'm working on
      finding the next bottleneck, and this is where it stopped:  When
      nthr=16, the obvious issue profiling showed was contention on the split
      PMD page table lock when pages are faulted in during the pinning (>2% of
      the time).
      
      A split PMD lock protects a PUD_SIZE-ed amount of page table mappings
      (1G on x86), so if threads were operating on smaller chunks and
      contending in the same PUD_SIZE range, this could be the source of
      contention.  However, when nthr=16, threads operate on 5G chunks (320G /
      16 threads / (1<<KTASK_LOAD_BAL_SHIFT)), so this wasn't the cause, and
      aligning the chunks on PUD_SIZE boundaries didn't help either.
      
      The time is short (6.4 seconds), so the next theory was threads
      finishing at different times, but probes showed the threads all returned
      within less than a millisecond of each other.
      
      Kernel probes turned up a few smaller VFIO page pin calls besides the
      heavy 320G call.  The chunk size given (PMD_SIZE) could affect thread
      count and chunk size for these, so chunk size was increased from 2M to
      1G.  This caused the split PMD contention to disappear, but with little
      change in the runtime.  More digging required.
      
      [1] lkml.kernel.org/r/20180925120326.24392-1-mhocko@kernel.org
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      22705b26
    • D
      vfio: remove unnecessary mmap_sem writer acquisition around locked_vm · 68a4b481
      Daniel Jordan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      ---------------------------
      
      Now that mmap_sem is no longer required for modifying locked_vm, remove
      it in the VFIO code.
      
      [XXX Can be sent separately, along with similar conversions in the other
      places mmap_sem was taken for locked_vm.  While at it, could make
      similar changes to pinned_vm.]
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      68a4b481
    • D
      mm: change locked_vm's type from unsigned long to atomic_long_t · 53f4e528
      Daniel Jordan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      ---------------------------
      
      Currently, mmap_sem must be held as writer to modify the locked_vm field
      in mm_struct.
      
      This creates a bottleneck when multithreading VFIO page pinning because
      each thread holds the mmap_sem as reader for the majority of the pinning
      time but also takes mmap_sem as writer regularly, for short times, when
      modifying locked_vm.
      
      The problem gets worse when other workloads compete for CPU with ktask
      threads doing page pinning because the other workloads force ktask
      threads that hold mmap_sem as writer off the CPU, blocking ktask threads
      trying to get mmap_sem as reader for an excessively long time (the
      mmap_sem reader wait time grows linearly with the thread count).
      
      Requiring mmap_sem for locked_vm also abuses mmap_sem by making it
      protect data that could be synchronized separately.
      
      So, decouple locked_vm from mmap_sem by making locked_vm an
      atomic_long_t.  locked_vm's old type was unsigned long and changing it
      to a signed type makes it lose half its capacity, but that's only a
      concern for 32-bit systems and LONG_MAX * PAGE_SIZE is 8T on x86 in that
      case, so there's headroom.
      
      Now that mmap_sem is not taken as writer here, ktask threads holding
      mmap_sem as reader can run more often.  Performance results appear later
      in the series.
      
      On powerpc, this was cross-compiled-tested only.
      
      [XXX Can send separately.]
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      53f4e528
    • D
      vfio: parallelize vfio_pin_map_dma · b0908eee
      Daniel Jordan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      ---------------------------
      
      When starting a large-memory kvm guest, it takes an excessively long
      time to start the boot process because qemu must pin all guest pages to
      accommodate DMA when VFIO is in use.  Currently just one CPU is
      responsible for the page pinning, which usually boils down to page
      clearing time-wise, so the ways to optimize this are buying a faster
      CPU ;-) or using more of the CPUs you already have.
      
      Parallelize with ktask.  Refactor so workqueue workers pin with the mm
      of the calling thread, and to enable an undo callback for ktask to
      handle errors during page pinning.
      
      Performance results appear later in the series.
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b0908eee
    • D
      workqueue, ktask: renice helper threads to prevent starvation · cdc79c13
      Daniel Jordan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      ---------------------------
      
      With ktask helper threads running at MAX_NICE, it's possible for one or
      more of them to begin chunks of the task and then have their CPU time
      constrained by higher priority threads.  The main ktask thread, running
      at normal priority, may finish all available chunks of the task and then
      wait on the MAX_NICE helpers to finish the last in-progress chunks, for
      longer than it would have if no helpers were used.
      
      Avoid this by having the main thread assign its priority to each
      unfinished helper one at a time so that on a heavily loaded system,
      exactly one thread in a given ktask call is running at the main thread's
      priority.  At least one thread to ensure forward progress, and at most
      one thread to limit excessive multithreading.
      
      Since the workqueue interface, on which ktask is built, does not provide
      access to worker threads, ktask can't adjust their priorities directly,
      so add a new interface to allow a previously-queued work item to run at
      a different priority than the one controlled by the corresponding
      workqueue's 'nice' attribute.  The worker assigned to the work item will
      run the work at the given priority, temporarily overriding the worker's
      priority.
      
      The interface is flush_work_at_nice, which ensures the given work item's
      assigned worker runs at the specified nice level and waits for the work
      item to finish.
      
      An alternative choice would have been to simply requeue the work item to
      a pool with workers of the new priority, but this doesn't seem feasible
      because a worker may have already started executing the work and there's
      currently no way to interrupt it midway through.  The proposed interface
      solves this issue because a worker's priority can be adjusted while it's
      executing the work.
      
      TODO:  flush_work_at_nice is a proof-of-concept only, and it may be
      desired to have the interface set the work's nice without also waiting
      for it to finish.  It's implemented in the flush path for this RFC
      because it was fairly simple to write ;-)
      
      I ran tests similar to the ones in the last patch with a couple of
      differences:
       - The non-ktask workload uses 8 CPUs instead of 7 to compete with the
         main ktask thread as well as the ktask helpers, so that when the main
         thread finishes, its CPU is completely occupied by the non-ktask
         workload, meaning MAX_NICE helpers can't run as often.
       - The non-ktask workload starts before the ktask workload, rather
         than after, to maximize the chance that it starves helpers.
      
      Runtimes in seconds.
      
      Case 1: Synthetic, worst-case CPU contention
      
       ktask_test - a tight loop doing integer multiplication to max out on CPU;
                    used for testing only, does not appear in this series
       stress-ng  - cpu stressor ("-c --cpu-method ackerman --cpu-ops 1200");
      
                   8_ktask_thrs           8_ktask_thrs
                   w/o_renice(stdev)   with_renice  (stdev)  1_ktask_thr(stdev)
                   ------------------------------------------------------------
        ktask_test    41.98  ( 0.22)         25.15  ( 2.98)      30.40  ( 0.61)
        stress-ng     44.79  ( 1.11)         46.37  ( 0.69)      53.29  ( 1.91)
      
      Without renicing, ktask_test finishes just after stress-ng does because
      stress-ng needs to free up CPUs for the helpers to finish (ktask_test
      shows a shorter runtime than stress-ng because ktask_test was started
      later).  Renicing lets ktask_test finish 40% sooner, and running the
      same amount of work in ktask_test with 1 thread instead of 8 finishes in
      a comparable amount of time, though longer than "with_renice" because
      MAX_NICE threads still get some CPU time, and the effect over 8 threads
      adds up.
      
      stress-ng's total runtime gets a little longer going from no renicing to
      renicing, as expected, because each reniced ktask thread takes more CPU
      time than before when the helpers were starved.
      
      Running with one ktask thread, stress-ng's reported walltime goes up
      because that single thread interferes with fewer stress-ng threads,
      but with more impact, causing a greater spread in the time it takes for
      individual stress-ng threads to finish.  Averages of the per-thread
      stress-ng times from "with_renice" to "1_ktask_thr" come out roughly
      the same, though, 43.81 and 43.89 respectively.  So the total runtime of
      stress-ng across all threads is unaffected, but the time stress-ng takes
      to finish running its threads completely actually improves by spreading
      the ktask_test work over more threads.
      
      Case 2: Real-world CPU contention
      
       ktask_vfio - VFIO page pin a 32G kvm guest
       usemem     - faults in 86G of anonymous THP per thread, PAGE_SIZE stride;
                    used to mimic the page clearing that dominates in ktask_vfio
                    so that usemem competes for the same system resources
      
                   8_ktask_thrs           8_ktask_thrs
                   w/o_renice  (stdev)   with_renice  (stdev)  1_ktask_thr(stdev)
                   --------------------------------------------------------------
        ktask_vfio    18.59  ( 0.19)         14.62  ( 2.03)      16.24  ( 0.90)
            usemem    47.54  ( 0.89)         48.18  ( 0.77)      49.70  ( 1.20)
      
      These results are similar to case 1's, though the differences between
      times are not quite as pronounced because ktask_vfio ran shorter
      compared to usemem.
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      cdc79c13