1. 27 12月, 2019 40 次提交
    • N
      arm64/neon: Disable -Wincompatible-pointer-types when building with Clang · e9fabe0c
      Nathan Chancellor 提交于
      mainline inclusion
      from mainline-5.0
      commit 0738c8b5
      category: bugfix
      bugzilla: 11024
      CVE: NA
      
      -------------------------------------------------
      
      After commit cc9f8349 ("arm64: crypto: add NEON accelerated XOR
      implementation"), Clang builds for arm64 started failing with the
      following error message.
      
      arch/arm64/lib/xor-neon.c:58:28: error: incompatible pointer types
      assigning to 'const unsigned long *' from 'uint64_t *' (aka 'unsigned
      long long *') [-Werror,-Wincompatible-pointer-types]
                      v3 = veorq_u64(vld1q_u64(dp1 +  6), vld1q_u64(dp2 + 6));
                                               ^~~~~~~~
      /usr/lib/llvm-9/lib/clang/9.0.0/include/arm_neon.h:7538:47: note:
      expanded from macro 'vld1q_u64'
        __ret = (uint64x2_t) __builtin_neon_vld1q_v(__p0, 51); \
                                                    ^~~~
      
      There has been quite a bit of debate and triage that has gone into
      figuring out what the proper fix is, viewable at the link below, which
      is still ongoing. Ard suggested disabling this warning with Clang with a
      pragma so no neon code will have this type of error. While this is not
      at all an ideal solution, this build error is the only thing preventing
      KernelCI from having successful arm64 defconfig and allmodconfig builds
      on linux-next. Getting continuous integration running is more important
      so new warnings/errors or boot failures can be caught and fixed quickly.
      
      Link: https://github.com/ClangBuiltLinux/linux/issues/283Suggested-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NNathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      (cherry picked from commit 0738c8b5)
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      e9fabe0c
    • J
      arm64: crypto: add NEON accelerated XOR implementation · 2b99d865
      Jackie Liu 提交于
      mainline inclusion
      from mainline-5.0-rc1
      commit: cc9f8349
      category: feature
      feature: NEON accelerated XOR
      bugzilla: 11024
      CVE: NA
      
      --------------------------------------------------
      
      This is a NEON acceleration method that can improve
      performance by approximately 20%. I got the following
      data from the centos 7.5 on Huawei's HISI1616 chip:
      
      [ 93.837726] xor: measuring software checksum speed
      [ 93.874039]   8regs  : 7123.200 MB/sec
      [ 93.914038]   32regs : 7180.300 MB/sec
      [ 93.954043]   arm64_neon: 9856.000 MB/sec
      [ 93.954047] xor: using function: arm64_neon (9856.000 MB/sec)
      
      I believe this code can bring some optimization for
      all arm64 platform. thanks for Ard Biesheuvel's suggestions.
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      2b99d865
    • J
      arm64/neon: add workaround for ambiguous C99 stdint.h types · 28777f93
      Jackie Liu 提交于
      mainline inclusion
      from mainline-5.0-rc1
      commit: 21e28547
      category: feature
      feature: NEON accelerated XOR
      bugzilla: 11024
      CVE: NA
      
      --------------------------------------------------
      
      In a way similar to ARM commit 09096f6a ("ARM: 7822/1: add workaround
      for ambiguous C99 stdint.h types"), this patch redefines the macros that
      are used in stdint.h so its definitions of uint64_t and int64_t are
      compatible with those of the kernel.
      
      This patch comes from: https://patchwork.kernel.org/patch/3540001/
      Wrote by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
      
      We mark this file as a private file and don't have to override asm/types.h
      Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
      Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      28777f93
    • A
      arm64/lib: improve CRC32 performance for deep pipelines · fce68364
      Ard Biesheuvel 提交于
      mainline inclusion
      from mainline-5.0
      commit: efdb25efc7645b326cd5eb82be5feeabe167c24e
      category: perf
      bugzilla: 20886
      CVE: NA
      
      lib/crc32test result:
      
      [root@localhost build]# rmmod crc32test && insmod lib/crc32test.ko &&
      dmesg | grep cycles
      [83170.153209] CPU7: use cycles 26243990
      [83183.122137] CPU7: use cycles 26151290
      [83309.691628] CPU7: use cycles 26122830
      [83312.415559] CPU7: use cycles 26232600
      [83313.191479] CPU8: use cycles 26082350
      
      rmmod crc32test && insmod lib/crc32test.ko && dmesg | grep cycles
      [ 1023.539931] CPU25: use cycles 12256730
      [ 1024.850360] CPU24: use cycles 12249680
      [ 1025.463622] CPU25: use cycles 12253330
      [ 1025.862925] CPU25: use cycles 12269720
      [ 1026.376038] CPU26: use cycles 12222480
      
      Based on 13702:
      arm64/lib: improve CRC32 performance for deep pipelines
      crypto: arm64/crc32 - remove PMULL based CRC32 driver
      arm64/lib: add accelerated crc32 routines
      arm64: cpufeature: add feature for CRC32 instructions
      lib/crc32: make core crc32() routines weak so they can be overridden
      
      ----------------------------------------------
      
      Improve the performance of the crc32() asm routines by getting rid of
      most of the branches and small sized loads on the common path.
      
      Instead, use a branchless code path involving overlapping 16 byte
      loads to process the first (length % 32) bytes, and process the
      remainder using a loop that processes 32 bytes at a time.
      
      Tested using the following test program:
      
        #include <stdlib.h>
      
        extern void crc32_le(unsigned short, char const*, int);
      
        int main(void)
        {
          static const char buf[4096];
      
          srand(20181126);
      
          for (int i = 0; i < 100 * 1000 * 1000; i++)
            crc32_le(0, buf, rand() % 1024);
      
          return 0;
        }
      
      On Cortex-A53 and Cortex-A57, the performance regresses but only very
      slightly. On Cortex-A72 however, the performance improves from
      
        $ time ./crc32
      
        real  0m10.149s
        user  0m10.149s
        sys   0m0.000s
      
      to
      
        $ time ./crc32
      
        real  0m7.915s
        user  0m7.915s
        sys   0m0.000s
      
      Cc: Rui Sun <sunrui26@huawei.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      fce68364
    • M
      lib/crc32.c: mark crc32_le_base/__crc32c_le_base aliases as __pure · 0066ba62
      Miguel Ojeda 提交于
      mainline inclusion
      from mainline-5.0
      commit ff98e20e
      category: bugfix
      bugzilla: 13702
      CVE: NA
      
      -------------------------------------------------
      
      The upcoming GCC 9 release extends the -Wmissing-attributes warnings
      (enabled by -Wall) to C and aliases: it warns when particular function
      attributes are missing in the aliases but not in their target.
      
      In particular, it triggers here because crc32_le_base/__crc32c_le_base
      aren't __pure while their target crc32_le/__crc32c_le are.
      
      These aliases are used by architectures as a fallback in accelerated
      versions of CRC32. See commit 9784d82d ("lib/crc32: make core crc32()
      routines weak so they can be overridden").
      
      Therefore, being fallbacks, it is likely that even if the aliases
      were called from C, there wouldn't be any optimizations possible.
      Currently, the only user is arm64, which calls this from asm.
      
      Still, marking the aliases as __pure makes sense and is a good idea
      for documentation purposes and possible future optimizations,
      which also silences the warning.
      Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Tested-by: NLaura Abbott <labbott@redhat.com>
      Signed-off-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
      (cherry picked from commit ff98e20e)
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      0066ba62
    • A
      crypto: arm64/crc32 - remove PMULL based CRC32 driver · 0e584ca5
      Ard Biesheuvel 提交于
      mainline inclusion
      from mainline-4.20-rc1
      commit: 598b7d41
      category: feature
      feature: accelerated crc32 routines
      bugzilla: 13702
      CVE: NA
      
      --------------------------------------------------
      
      Now that the scalar fallbacks have been moved out of this driver into
      the core crc32()/crc32c() routines, we are left with a CRC32 crypto API
      driver for arm64 that is based only on 64x64 polynomial multiplication,
      which is an optional instruction in the ARMv8 architecture, and is less
      and less likely to be available on cores that do not also implement the
      CRC32 instructions, given that those are mandatory in the architecture
      as of ARMv8.1.
      
      Since the scalar instructions do not require the special handling that
      SIMD instructions do, and since they turn out to be considerably faster
      on some cores (Cortex-A53) as well, there is really no point in keeping
      this code around so let's just remove it.
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      0e584ca5
    • A
      arm64/lib: add accelerated crc32 routines · 11fa09f2
      Ard Biesheuvel 提交于
      mainline inclusion
      from mainline-4.20-rc1
      commit: 7481cddf29ede204b475facc40e6f65459939881
      category: feature
      feature: accelerated crc32 routines
      bugzilla: 13702
      CVE: NA
      
      --------------------------------------------------
      
      Unlike crc32c(), which is wired up to the crypto API internally so the
      optimal driver is selected based on the platform's capabilities,
      crc32_le() is implemented as a library function using a slice-by-8 table
      based C implementation. Even though few of the call sites may be
      bottlenecks, calling a time variant implementation with a non-negligible
      D-cache footprint is a bit of a waste, given that ARMv8.1 and up mandates
      support for the CRC32 instructions that were optional in ARMv8.0, but are
      already widely available, even on the Cortex-A53 based Raspberry Pi.
      
      So implement routines that use these instructions if available, and fall
      back to the existing generic routines otherwise. The selection is based
      on alternatives patching.
      
      Note that this unconditionally selects CONFIG_CRC32 as a builtin. Since
      CRC32 is relied upon by core functionality such as CONFIG_OF_FLATTREE,
      this just codifies the status quo.
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      11fa09f2
    • A
      arm64: cpufeature: add feature for CRC32 instructions · 3a8984a9
      Ard Biesheuvel 提交于
      mainline inclusion
      from mainline-4.20-rc1
      commit: 86d0dd34eafffbc76a81aba6ae2d71927d3835a8
      category: feature
      feature: accelerated crc32 routines
      bugzilla: 13702
      CVE: NA
      
      --------------------------------------------------
      
      Add a CRC32 feature bit and wire it up to the CPU id register so we
      will be able to use alternatives patching for CRC32 operations.
      Acked-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      
      Conflicts:
      	arch/arm64/include/asm/cpucaps.h
      	arch/arm64/kernel/cpufeature.c
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      3a8984a9
    • A
      lib/crc32: make core crc32() routines weak so they can be overridden · a216345b
      Ard Biesheuvel 提交于
      mainline inclusion
      from mainline-4.20-rc1
      commit: 9784d82d
      category: feature
      feature: accelerated crc32 routines
      bugzilla: 13702
      CVE: NA
      
      --------------------------------------------------
      
      Allow architectures to drop in accelerated CRC32 routines by making
      the crc32_le/__crc32c_le entry points weak, and exposing non-weak
      aliases for them that may be used by the accelerated versions as
      fallbacks in case the instructions they rely upon are not available.
      Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
      Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      a216345b
    • X
      pci: do not save 'PCI_BRIDGE_CTL_BUS_RESET' · 36b45f28
      Xiongfeng Wang 提交于
      hulk inclusion
      category: bugfix
      bugzilla: 20702
      CVE: NA
      ---------------------------
      
      When I inject a PCIE Fatal error into a mellanox netdevice, 'dmesg'
      shows the device is recovered successfully, but 'lspci' didn't show the
      device. I checked the configuration space of the slot where the
      netdevice is inserted and found out the bit 'PCI_BRIDGE_CTL_BUS_RESET'
      is set. Later, I found out it is because this bit is saved in
      'saved_config_space' of 'struct pci_dev' when 'pci_pm_runtime_suspend()'
      is called. And 'PCI_BRIDGE_CTL_BUS_RESET' is set every time we restore
      the configuration sapce.
      
      This patch avoid saving the bit 'PCI_BRIDGE_CTL_BUS_RESET' when we save
      the configuration space of a bridge.
      Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      36b45f28
    • H
      config: enable CONFIG_KTASK in hulk_defconfig and storage_ci_defconfig · 58622cdd
      Hongbo Yao 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      ---------------------------
      
      enable CONFIG_KTASK in hulk_defconfig and storage_ci_defconfig
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      58622cdd
    • H
      ktask: change the chunk size for some ktask thread functions · 135e9232
      Hongbo Yao 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      
      ---------------------------
      
      This patch fixes some issue in original series.
      1) PMD_SIZE chunks have made thread finishing times too spread out
      in some cases, so KTASK_MEM_CHUNK(128M) seems to be a reasonable compromise
      2) If hugepagesz=1G, then pages_per_huge_page = 1G / 4K = 256,
      use KTASK_MEM_CHUNK will cause the ktask thread to be 1, which will not
      improve the performance of clear gigantic page.
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      135e9232
    • D
      hugetlbfs: parallelize hugetlbfs_fallocate with ktask · 4733c59f
      Daniel Jordan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      ---------------------------
      
      hugetlbfs_fallocate preallocates huge pages to back a file in a
      hugetlbfs filesystem.  The time to call this function grows linearly
      with size.
      
      ktask performs well with its default thread count of 4; higher thread
      counts are given for context only.
      
      Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory
      Test:    fallocate(1) a file on a hugetlbfs filesystem
      
      nthread   speedup   size (GiB)   min time (s)   stdev
            1                    200         127.53    2.19
            2     3.09x          200          41.30    2.11
            4     5.72x          200          22.29    0.51
            8     9.45x          200          13.50    2.58
           16     9.74x          200          13.09    1.64
      
            1                    400         193.09    2.47
            2     2.14x          400          90.31    3.39
            4     3.84x          400          50.32    0.44
            8     5.11x          400          37.75    1.23
           16     6.12x          400          31.54    3.13
      
      The primary bottleneck for better scaling at higher thread counts is
      hugetlb_fault_mutex_table[hash].  perf showed L1-dcache-loads increase
      with 8 threads and again sharply with 16 threads, and a CPU counter
      profile showed that 31% of the L1d misses were on
      hugetlb_fault_mutex_table[hash] in the 16-thread case.
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      4733c59f
    • D
      mm: parallelize clear_gigantic_page · ae0cd4d4
      Daniel Jordan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      ---------------------------
      
      Parallelize clear_gigantic_page, which zeroes any page size larger than
      8M (e.g. 1G on x86).
      
      Performance results (the default number of threads is 4; higher thread
      counts shown for context only):
      
      Machine:  Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory
      Test:     Clear a range of gigantic pages (triggered via fallocate)
      
      nthread   speedup   size (GiB)   min time (s)   stdev
            1                    100          41.13    0.03
            2     2.03x          100          20.26    0.14
            4     4.28x          100           9.62    0.09
            8     8.39x          100           4.90    0.05
           16    10.44x          100           3.94    0.03
      
            1                    200          89.68    0.35
            2     2.21x          200          40.64    0.18
            4     4.64x          200          19.33    0.32
            8     8.99x          200           9.98    0.04
           16    11.27x          200           7.96    0.04
      
            1                    400         188.20    1.57
            2     2.30x          400          81.84    0.09
            4     4.63x          400          40.62    0.26
            8     8.92x          400          21.09    0.50
           16    11.78x          400          15.97    0.25
      
            1                    800         434.91    1.81
            2     2.54x          800         170.97    1.46
            4     4.98x          800          87.38    1.91
            8    10.15x          800          42.86    2.59
           16    12.99x          800          33.48    0.83
      
      The speedups are mostly due to the fact that more threads can use more
      memory bandwidth.  The loop we're stressing on the x86 chip in this test
      is clear_page_erms, which tops out at a bandwidth of 2550 MiB/s with one
      thread.  We get the same bandwidth per thread for 2, 4, or 8 threads,
      but at 16 threads the per-thread bandwidth drops to 1420 MiB/s.
      
      However, the performance also improves over a single thread because of
      the ktask threads' NUMA awareness (ktask migrates worker threads to the
      node local to the work being done).  This becomes a bigger factor as the
      amount of pages to zero grows to include memory from multiple nodes, so
      that speedups increase as the size increases.
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      ae0cd4d4
    • D
      mm: parallelize deferred struct page initialization within each node · eb761d65
      Daniel Jordan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      ---------------------------
      
      Deferred struct page initialization currently runs one thread per node,
      but this is a bottleneck during boot on big machines, so use ktask
      within each pgdatinit thread to parallelize the struct page
      initialization, allowing the system to take better advantage of its
      memory bandwidth.
      
      Because the system is not fully up yet and most CPUs are idle, use more
      than the default maximum number of ktask threads.  The kernel doesn't
      know the memory bandwidth of a given system to get the most efficient
      number of threads, so there's some guesswork involved.  In testing, a
      reasonable value turned out to be about a quarter of the CPUs on the
      node.
      
      __free_pages_core used to increase the zone's managed page count by the
      number of pages being freed.  To accommodate multiple threads, however,
      account the number of freed pages with an atomic shared across the ktask
      threads and bump the managed page count with it after ktask is finished.
      
      Test:    Boot the machine with deferred struct page init three times
      
      Machine: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz, 88 CPUs, 503G memory,
               2 sockets
      
      kernel                   speedup   max time per   stdev
                                         node (ms)
      
      baseline (4.15-rc2)                        5860     8.6
      ktask                      9.56x            613    12.4
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      eb761d65
    • D
      mm: enlarge type of offset argument in mem_map_offset and mem_map_next · 228e4183
      Daniel Jordan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      ---------------------------
      
      Changes the type of 'offset' from int to unsigned long in both
      mem_map_offset and mem_map_next.
      
      This facilitates ktask's use of mem_map_next with its unsigned long
      types to avoid silent truncation when these unsigned longs are passed as
      ints.
      
      It also fixes the preexisting truncation of 'offset' from unsigned long
      to int by the sole caller of mem_map_offset, follow_hugetlb_page.
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      228e4183
    • D
      vfio: relieve mmap_sem reader cacheline bouncing by holding it longer · 22705b26
      Daniel Jordan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      ---------------------------
      
      Profiling shows significant time being spent on atomic ops in mmap_sem
      reader acquisition.  mmap_sem is taken and dropped for every single base
      page during pinning, so this is not surprising.
      
      Reduce the number of times mmap_sem is taken by holding for longer,
      which relieves atomic cacheline bouncing.
      
      Results for all VFIO page pinning patches
      -----------------------------------------
      
      The test measures the time from qemu invocation to the start of guest
      boot.  The guest uses kvm with 320G memory backed with THP.  320G fits
      in a node on the test machine used here, so there was no thrashing in
      reclaim because of __GFP_THISNODE in THP allocations[1].
      
      CPU:              2 nodes * 24 cores/node * 2 threads/core = 96 CPUs
                        Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
      memory:           754G split evenly between nodes
      scaling_governor: performance
      
           patch 6                  patch 8              patch 9 (this one)
           -----------------------  ----------------- ---------------------
      thr  speedup average sec  speedup   average sec  speedup   average sec
        1          65.0(± 0.6%)           65.2(± 0.5%)           65.5(± 0.4%)
        2   1.5x   42.8(± 5.8%)   1.8x    36.0(± 0.9%)    1.9x   34.4(± 0.3%)
        3   1.9x   35.0(±11.3%)   2.5x    26.4(± 4.2%)    2.8x   23.7(± 0.2%)
        4   2.3x   28.5(± 1.3%)   3.1x    21.2(± 2.8%)    3.6x   18.3(± 0.3%)
        5   2.5x   26.2(± 1.5%)   3.6x    17.9(± 0.9%)    4.3x   15.1(± 0.3%)
        6   2.7x   24.5(± 1.8%)   4.0x    16.5(± 3.0%)    5.1x   12.9(± 0.1%)
        7   2.8x   23.5(± 4.9%)   4.2x    15.4(± 2.7%)    5.7x   11.5(± 0.6%)
        8   2.8x   22.8(± 1.8%)   4.2x    15.5(± 4.7%)    6.4x   10.3(± 0.8%)
       12   3.2x   20.2(± 1.4%)   4.4x    14.7(± 2.9%)    8.6x    7.6(± 0.6%)
       16   3.3x   20.0(± 0.7%)   4.3x    15.4(± 1.3%)   10.2x    6.4(± 0.6%)
      
      At patch 6, lock_stat showed long reader wait time on mmap_sem writers,
      leading to patch 8.
      
      At patch 8, profiling revealed the issue with mmap_sem described above.
      
      Across all three patches, performance consistently improves as the
      thread count increases.  The one exception is the antiscaling with
      nthr=16 in patch 8: those mmap_sem atomics are really bouncing around
      the machine.
      
      The performance with patch 9 looks pretty good overall.  I'm working on
      finding the next bottleneck, and this is where it stopped:  When
      nthr=16, the obvious issue profiling showed was contention on the split
      PMD page table lock when pages are faulted in during the pinning (>2% of
      the time).
      
      A split PMD lock protects a PUD_SIZE-ed amount of page table mappings
      (1G on x86), so if threads were operating on smaller chunks and
      contending in the same PUD_SIZE range, this could be the source of
      contention.  However, when nthr=16, threads operate on 5G chunks (320G /
      16 threads / (1<<KTASK_LOAD_BAL_SHIFT)), so this wasn't the cause, and
      aligning the chunks on PUD_SIZE boundaries didn't help either.
      
      The time is short (6.4 seconds), so the next theory was threads
      finishing at different times, but probes showed the threads all returned
      within less than a millisecond of each other.
      
      Kernel probes turned up a few smaller VFIO page pin calls besides the
      heavy 320G call.  The chunk size given (PMD_SIZE) could affect thread
      count and chunk size for these, so chunk size was increased from 2M to
      1G.  This caused the split PMD contention to disappear, but with little
      change in the runtime.  More digging required.
      
      [1] lkml.kernel.org/r/20180925120326.24392-1-mhocko@kernel.org
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      22705b26
    • D
      vfio: remove unnecessary mmap_sem writer acquisition around locked_vm · 68a4b481
      Daniel Jordan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      ---------------------------
      
      Now that mmap_sem is no longer required for modifying locked_vm, remove
      it in the VFIO code.
      
      [XXX Can be sent separately, along with similar conversions in the other
      places mmap_sem was taken for locked_vm.  While at it, could make
      similar changes to pinned_vm.]
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      68a4b481
    • D
      mm: change locked_vm's type from unsigned long to atomic_long_t · 53f4e528
      Daniel Jordan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      ---------------------------
      
      Currently, mmap_sem must be held as writer to modify the locked_vm field
      in mm_struct.
      
      This creates a bottleneck when multithreading VFIO page pinning because
      each thread holds the mmap_sem as reader for the majority of the pinning
      time but also takes mmap_sem as writer regularly, for short times, when
      modifying locked_vm.
      
      The problem gets worse when other workloads compete for CPU with ktask
      threads doing page pinning because the other workloads force ktask
      threads that hold mmap_sem as writer off the CPU, blocking ktask threads
      trying to get mmap_sem as reader for an excessively long time (the
      mmap_sem reader wait time grows linearly with the thread count).
      
      Requiring mmap_sem for locked_vm also abuses mmap_sem by making it
      protect data that could be synchronized separately.
      
      So, decouple locked_vm from mmap_sem by making locked_vm an
      atomic_long_t.  locked_vm's old type was unsigned long and changing it
      to a signed type makes it lose half its capacity, but that's only a
      concern for 32-bit systems and LONG_MAX * PAGE_SIZE is 8T on x86 in that
      case, so there's headroom.
      
      Now that mmap_sem is not taken as writer here, ktask threads holding
      mmap_sem as reader can run more often.  Performance results appear later
      in the series.
      
      On powerpc, this was cross-compiled-tested only.
      
      [XXX Can send separately.]
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      53f4e528
    • D
      vfio: parallelize vfio_pin_map_dma · b0908eee
      Daniel Jordan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      ---------------------------
      
      When starting a large-memory kvm guest, it takes an excessively long
      time to start the boot process because qemu must pin all guest pages to
      accommodate DMA when VFIO is in use.  Currently just one CPU is
      responsible for the page pinning, which usually boils down to page
      clearing time-wise, so the ways to optimize this are buying a faster
      CPU ;-) or using more of the CPUs you already have.
      
      Parallelize with ktask.  Refactor so workqueue workers pin with the mm
      of the calling thread, and to enable an undo callback for ktask to
      handle errors during page pinning.
      
      Performance results appear later in the series.
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b0908eee
    • D
      workqueue, ktask: renice helper threads to prevent starvation · cdc79c13
      Daniel Jordan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      ---------------------------
      
      With ktask helper threads running at MAX_NICE, it's possible for one or
      more of them to begin chunks of the task and then have their CPU time
      constrained by higher priority threads.  The main ktask thread, running
      at normal priority, may finish all available chunks of the task and then
      wait on the MAX_NICE helpers to finish the last in-progress chunks, for
      longer than it would have if no helpers were used.
      
      Avoid this by having the main thread assign its priority to each
      unfinished helper one at a time so that on a heavily loaded system,
      exactly one thread in a given ktask call is running at the main thread's
      priority.  At least one thread to ensure forward progress, and at most
      one thread to limit excessive multithreading.
      
      Since the workqueue interface, on which ktask is built, does not provide
      access to worker threads, ktask can't adjust their priorities directly,
      so add a new interface to allow a previously-queued work item to run at
      a different priority than the one controlled by the corresponding
      workqueue's 'nice' attribute.  The worker assigned to the work item will
      run the work at the given priority, temporarily overriding the worker's
      priority.
      
      The interface is flush_work_at_nice, which ensures the given work item's
      assigned worker runs at the specified nice level and waits for the work
      item to finish.
      
      An alternative choice would have been to simply requeue the work item to
      a pool with workers of the new priority, but this doesn't seem feasible
      because a worker may have already started executing the work and there's
      currently no way to interrupt it midway through.  The proposed interface
      solves this issue because a worker's priority can be adjusted while it's
      executing the work.
      
      TODO:  flush_work_at_nice is a proof-of-concept only, and it may be
      desired to have the interface set the work's nice without also waiting
      for it to finish.  It's implemented in the flush path for this RFC
      because it was fairly simple to write ;-)
      
      I ran tests similar to the ones in the last patch with a couple of
      differences:
       - The non-ktask workload uses 8 CPUs instead of 7 to compete with the
         main ktask thread as well as the ktask helpers, so that when the main
         thread finishes, its CPU is completely occupied by the non-ktask
         workload, meaning MAX_NICE helpers can't run as often.
       - The non-ktask workload starts before the ktask workload, rather
         than after, to maximize the chance that it starves helpers.
      
      Runtimes in seconds.
      
      Case 1: Synthetic, worst-case CPU contention
      
       ktask_test - a tight loop doing integer multiplication to max out on CPU;
                    used for testing only, does not appear in this series
       stress-ng  - cpu stressor ("-c --cpu-method ackerman --cpu-ops 1200");
      
                   8_ktask_thrs           8_ktask_thrs
                   w/o_renice(stdev)   with_renice  (stdev)  1_ktask_thr(stdev)
                   ------------------------------------------------------------
        ktask_test    41.98  ( 0.22)         25.15  ( 2.98)      30.40  ( 0.61)
        stress-ng     44.79  ( 1.11)         46.37  ( 0.69)      53.29  ( 1.91)
      
      Without renicing, ktask_test finishes just after stress-ng does because
      stress-ng needs to free up CPUs for the helpers to finish (ktask_test
      shows a shorter runtime than stress-ng because ktask_test was started
      later).  Renicing lets ktask_test finish 40% sooner, and running the
      same amount of work in ktask_test with 1 thread instead of 8 finishes in
      a comparable amount of time, though longer than "with_renice" because
      MAX_NICE threads still get some CPU time, and the effect over 8 threads
      adds up.
      
      stress-ng's total runtime gets a little longer going from no renicing to
      renicing, as expected, because each reniced ktask thread takes more CPU
      time than before when the helpers were starved.
      
      Running with one ktask thread, stress-ng's reported walltime goes up
      because that single thread interferes with fewer stress-ng threads,
      but with more impact, causing a greater spread in the time it takes for
      individual stress-ng threads to finish.  Averages of the per-thread
      stress-ng times from "with_renice" to "1_ktask_thr" come out roughly
      the same, though, 43.81 and 43.89 respectively.  So the total runtime of
      stress-ng across all threads is unaffected, but the time stress-ng takes
      to finish running its threads completely actually improves by spreading
      the ktask_test work over more threads.
      
      Case 2: Real-world CPU contention
      
       ktask_vfio - VFIO page pin a 32G kvm guest
       usemem     - faults in 86G of anonymous THP per thread, PAGE_SIZE stride;
                    used to mimic the page clearing that dominates in ktask_vfio
                    so that usemem competes for the same system resources
      
                   8_ktask_thrs           8_ktask_thrs
                   w/o_renice  (stdev)   with_renice  (stdev)  1_ktask_thr(stdev)
                   --------------------------------------------------------------
        ktask_vfio    18.59  ( 0.19)         14.62  ( 2.03)      16.24  ( 0.90)
            usemem    47.54  ( 0.89)         48.18  ( 0.77)      49.70  ( 1.20)
      
      These results are similar to case 1's, though the differences between
      times are not quite as pronounced because ktask_vfio ran shorter
      compared to usemem.
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      cdc79c13
    • D
      ktask: run helper threads at MAX_NICE · b9ca5261
      Daniel Jordan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      ---------------------------
      
      Multithreading may speed long-running kernel tasks, but overly
      optimistic parallelization can go wrong if too many helper threads are
      started on an already-busy system.  Such helpers can degrade the
      performance of other tasks, so they should be sensitive to current CPU
      utilization[1].
      
      To achieve this, run helpers at MAX_NICE so that their CPU time is
      proportional to idle CPU time.  The main thread that called into ktask
      naturally runs at its original priority so that it can make progress on
      a heavily loaded system, as it would if ktask were not in the picture.
      
      I tested two different cases in which a non-ktask and a ktask workload
      compete for the same CPUs with the goal of showing that normal priority
      (i.e. nice=0) ktask helpers cause the non-ktask workload to run more
      slowly, whereas MAX_NICE ktask helpers don't.
      
      Testing notes:
        - Each case was run using 8 CPUs on a large two-socket server, with a
          cpumask allowing all test threads to run anywhere within the 8.
        - The non-ktask workload used 7 threads and the ktask workload used 8
          threads to evaluate how much ktask helpers, rather than the main ktask
          thread, disturbed the non-ktask workload.
        - The non-ktask workload was started after the ktask workload and run
          for less time to maximize the chances that the non-ktask workload would
          be disturbed.
        - Runtimes in seconds.
      
      Case 1: Synthetic, worst-case CPU contention
      
          ktask_test - a tight loop doing integer multiplication to max out on
      		 CPU; used for testing only, does not appear in this series
          stress-ng  - cpu stressor ("-c --cpu-method ackerman --cpu-ops 1200");
      
                       stress-ng
                       alone  (stdev)   max_nice  (stdev)   normal_prio  (stdev)
                       ---------------------------------------------------------
          ktask_test                       96.87  ( 1.09)         90.81  ( 0.29)
          stress-ng    43.04  ( 0.00)      43.58  ( 0.01)         75.86  ( 0.39)
      
      This case shows MAX_NICE helpers make a significant difference compared
      to normal priority helpers, with stress-ng taking 76% longer to finish
      when competing with normal priority ktask threads than when run by
      itself, but only 1% longer when run with MAX_NICE helpers.  The 1% comes
      from the small amount of CPU time MAX_NICE threads are given despite
      their low priority.
      
      Case 2: Real-world CPU contention
      
          ktask_vfio - VFIO page pin a 175G kvm guest
          usemem     - faults in 25G of anonymous THP per thread, PAGE_SIZE
      		 stride; used to mimic the page clearing that dominates in
      		 ktask_vfio so that usemem competes for the same system
      		 resources
      
                       usemem
                       alone  (stdev)   max_nice  (stdev)   normal_prio  (stdev)
                      ---------------------------------------------------------
          ktask_vfio                      14.74  ( 0.04)          9.93  ( 0.09)
              usemem  10.45  ( 0.04)      10.75  ( 0.04)         14.14  ( 0.07)
      
      In the more realistic case 2, the effect is similar although not as
      pronounced.  The usemem threads take 35% longer to finish with normal
      priority ktask threads than when run alone, but only 3% longer when
      MAX_NICE is used.
      
      All ktask users outside of VFIO boil down to page clearing, so I imagine
      the results would be similar for them.
      
      [1] lkml.kernel.org/r/20171206143509.GG7515@dhcp22.suse.cz
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      b9ca5261
    • D
      ktask: add undo support · 69cc5b58
      Daniel Jordan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      ---------------------------
      
      Tasks can fail midway through their work.  To recover, the finished
      chunks of work need to be undone in a task-specific way.
      
      Allow ktask clients to pass an "undo" callback that is responsible for
      undoing one chunk of work.  To avoid multiple levels of error handling,
      do not allow the callback to fail.  For simplicity and because it's a
      slow path, undoing is not multithreaded.
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      69cc5b58
    • D
      ktask: multithread CPU-intensive kernel work · c48676ef
      Daniel Jordan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      ---------------------------
      
      A single CPU can spend an excessive amount of time in the kernel
      operating on large amounts of data.  Often these situations arise
      during initialization- and destruction-related tasks, where the data
      involved scales with system size. These long-running jobs can slow
      startup and shutdown of applications and the system itself while extra
      CPUs sit idle.
      
      To ensure that applications and the kernel continue to perform well as
      core counts and memory sizes increase, harness these idle CPUs to
      complete such jobs more quickly.
      
      ktask is a generic framework for parallelizing CPU-intensive work in the
      kernel.  The API is generic enough to add concurrency to many different
      kinds of tasks--for example, zeroing a range of pages or evicting a list
      of inodes--and aims to save its clients the trouble of splitting up the
      work, choosing the number of threads to use, maintaining an efficient
      concurrency level, starting these threads, and load balancing the work
      between them.
      
      The Documentation patch earlier in this series, from which the above was
      swiped, has more background.
      
      Inspired by work from Pavel Tatashin, Steve Sistare, and Jonathan Adams.
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Suggested-by: NPavel Tatashin <Pavel.Tatashin@microsoft.com>
      Suggested-by: NSteve Sistare <steven.sistare@oracle.com>
      Suggested-by: NJonathan Adams <jwadams@google.com>
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      c48676ef
    • D
      ktask: add documentation · 8c9ab6a2
      Daniel Jordan 提交于
      hulk inclusion
      category: feature
      bugzilla: 13228
      CVE: NA
      ---------------------------
      
      Motivates and explains the ktask API for kernel clients.
      Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
      Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
      Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
      Tested-by: NHongbo Yao <yaohongbo@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      8c9ab6a2
    • Z
      ext4: fix suspicious RCU usage warning in ext4_release_system_zone · 36f2d824
      zhangyi (F) 提交于
      hulk inclusion
      category: bugfix
      bugzilla: 18685
      CVE: NA
      
      -----------------------------
      
      The rcu_dereference() should be used under rcu_read_lock(), or else it
      will complain about it may be a suspicious RCU usage.
      
       WARNING: suspicious RCU usage
       [...]
       -----------------------------
       fs/ext4/block_validity.c:331 suspicious rcu_dereference_check() usage!
       [...]
      
      Because ext4_release_system_zone() always under protection of
      sb->s_umount, so the proper fix is switch to use
      rcu_dereference_protected() instead.
      
      Fixes: fb9fd3ade129be ("ext4: fix potential use after free in system zone via remount with noblock_validity")
      Reviewed-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com>
      Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NYi Zhang <yi.zhang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      36f2d824
    • Z
      ext4: fix integer overflow when calculating commit interval · 3e3e50fe
      zhangyi (F) 提交于
      hulk inclusion
      category: bugfix
      bugzilla: 16625
      CVE: NA
      ---------------------------
      
      If user specify a large enough value of "commit=" option, it may trigger
      signed integer overflow which may lead to sbi->s_commit_interval becomes
      a large or small value, zero in particular.
      
      UBSAN: Undefined behaviour in ../fs/ext4/super.c:1592:31
      signed integer overflow:
      536870912 * 1000 cannot be represented in type 'int'
      [...]
      Call trace:
      [...]
      [<ffffff9008a2d120>] ubsan_epilogue+0x34/0x9c lib/ubsan.c:166
      [<ffffff9008a2d8b8>] handle_overflow+0x228/0x280 lib/ubsan.c:197
      [<ffffff9008a2d95c>] __ubsan_handle_mul_overflow+0x4c/0x68 lib/ubsan.c:218
      [<ffffff90086d070c>] handle_mount_opt fs/ext4/super.c:1592 [inline]
      [<ffffff90086d070c>] parse_options+0x1724/0x1a40 fs/ext4/super.c:1773
      [<ffffff90086d51c4>] ext4_remount+0x2ec/0x14a0 fs/ext4/super.c:4834
      [...]
      
      Although it is not a big deal, still silence the UBSAN by limit the
      input value.
      
      Link: https://patchwork.ozlabs.org/patch/1153233Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Reviewed-by: NJan Kara <jack@suse.cz>
      Reviewed-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com>
      Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      Reviewed-by: NYi Zhang <yi.zhang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      3e3e50fe
    • X
      arm64: set all the CPU as present for dtb booting with 'CONFIG_ACPI' enabled · 015f6d7e
      Xiongfeng Wang 提交于
      hulk inclusion
      category: bugfix
      bugzilla: 20799
      CVE: NA
      ---------------------------
      
      The following patch didn't consider the situation when we boot the
      system using device tree but 'CONFIG_ACPI' is enabled. In this
      situation, we also need to set all the CPU as present CPU.
      
      Fixes: 280637f70ab5 ("arm64: mark all the GICC nodes in MADT as possible cpu")
      Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
      Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      015f6d7e
    • D
      xfs: fix missing ILOCK unlock when xfs_setattr_nonsize fails due to EDQUOT · 6eefee3a
      Darrick J. Wong 提交于
      mainline inclusion
      from mainline-v5.3-rc6
      commit 1fb254aa
      category: bugfix
      bugzilla: 13690
      CVE: CVE-2019-15538
      
      -------------------------------------------------
      
      Benjamin Moody reported to Debian that XFS partially wedges when a chgrp
      fails on account of being out of disk quota.  I ran his reproducer
      script:
      
      (and then as user dummy)
      
      $ dd if=/dev/urandom bs=1M count=50 of=/mnt/dummy/foo
      $ chgrp plugdev /mnt/dummy/foo
      
      and saw:
      Reviewed-by: Nzhengbin <zhengbin13@huawei.com>
      
      ================================================
      WARNING: lock held when returning to user space!
      5.3.0-rc5 #rc5 Tainted: G        W
      ------------------------------------------------
      chgrp/47006 is leaving the kernel with locks still held!
      1 lock held by chgrp/47006:
       #0: 000000006664ea2d (&xfs_nondir_ilock_class){++++}, at: xfs_ilock+0xd2/0x290 [xfs]
      
      ...which is clearly caused by xfs_setattr_nonsize failing to unlock the
      ILOCK after the xfs_qm_vop_chown_reserve call fails.  Add the missing
      unlock.
      
      Reported-by: benjamin.moody@gmail.com
      Fixes: 253f4911 ("xfs: better xfs_trans_alloc interface")
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Tested-by: NSalvatore Bonaccorso <carnil@debian.org>
      Reviewed-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      6eefee3a
    • Z
      dm space map metadata: fix missing store of apply_bops() return value · 1967da12
      ZhangXiaoxu 提交于
      mainline inclusion
      from mainline-v5.3-rc6
      commit ae148243
      category: bugfix
      bugzilla: 20701
      CVE: NA
      
      -------------------------------------------------
      
      In commit 6096d91a ("dm space map metadata: fix occasional leak
      of a metadata block on resize"), we refactor the commit logic to a new
      function 'apply_bops'.  But when that logic was replaced in out() the
      return value was not stored.  This may lead out() returning a wrong
      value to the caller.
      
      Fixes: 6096d91a ("dm space map metadata: fix occasional leak of a metadata block on resize")
      Cc: stable@vger.kernel.org
      Signed-off-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com>
      Reviewed-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      1967da12
    • Z
      dm btree: fix order of block initialization in btree_split_beneath · 19f55ba1
      ZhangXiaoxu 提交于
      mainline inclusion
      from mainline-v5.3-rc6
      commit e4f9d601
      category: bugfix
      bugzilla: 20701
      CVE: NA
      
      -------------------------------------------------
      
      When btree_split_beneath() splits a node to two new children, it will
      allocate two blocks: left and right.  If right block's allocation
      failed, the left block will be unlocked and marked dirty.  If this
      happened, the left block'ss content is zero, because it wasn't
      initialized with the btree struct before the attempot to allocate the
      right block.  Upon return, when flushing the left block to disk, the
      validator will fail when check this block.  Then a BUG_ON is raised.
      
      Fix this by completely initializing the left block before allocating and
      initializing the right block.
      
      Fixes: 4dcb8b57 ("dm btree: fix leak of bufio-backed block in btree_split_beneath error path")
      Cc: stable@vger.kernel.org
      Signed-off-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com>
      Signed-off-by: NMike Snitzer <snitzer@redhat.com>
      Signed-off-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com>
      Reviewed-by: NYufen Yu <yuyufen@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      19f55ba1
    • L
      net: hns3: some functions can be static. · fff3596e
      liaoguojia 提交于
      driver inclusion
      category: bugfix
      bugzilla: NA
      CVE: NA
      
      Functions is used only in the declared file.
      So we need to add a keyword static for them to Reduce coupling.
      Those functions inclding as below:
      hclge_func_reset_sync_vf().
      
      This patch as a supplement to patch 26380194.
      Fix tag: 26380194 (" some functions can be static ")
      
      Feature or Bugfix:Bugfix
      Signed-off-by: Nliaoguojia <liaoguojia@huawei.com>
      Reviewed-by: Nlipeng <lipeng321@huawei.com>
      Reviewed-by: NYunsheng Lin <linyunsheng@huawei.com>
      Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      fff3596e
    • L
      net: hns3: Fixed incorrect type in assignment. · 8bc358e9
      liaoguojia 提交于
      driver inclusion
      category: bugfix
      bugzilla: NA
      CVE: NA
      
      This patch fixes some incorrect type in assignment reported by sparse.
      Those sparse warning as below:
      - warning : restricted __le16 degrades to integer
      - warning : cast from restricted __le32
      - warning : expected restricted __le32
      - warning : cast from restricted __be32
      - warning : cast from restricted __be16
      - warning : cast to restricted __le16
      
      Feature or Bugfix:Bugfix
      Signed-off-by: Nliaoguojia <liaoguojia@huawei.com>
      Reviewed-by: Nlipeng <lipeng321@huawei.com>
      Reviewed-by: NYunsheng Lin <linyunsheng@huawei.com>
      Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      8bc358e9
    • S
      net: hns3: fix HCLGE_SWITCH_ALW_LPBK_B set error · 5cc236fd
      shenjian 提交于
      driver inclusion
      category: bugfix
      bugzilla: NA
      CVE: NA
      
      When doing loopback test, it should set HCLGE_SWITCH_ALW_LPBK_B
      intead of HCLGE_SWITCH_ALW_LCL_LPBK_B, which may cause loopback
      test fail.
      
      Feature or Bugfix:Bugfix
      Signed-off-by: Nshenjian (K) <shenjian15@huawei.com>
      Reviewed-by: Nhuangdaode <huangdaode@hisilicon.com>
      Reviewed-by: NYunsheng Lin <linyunsheng@huawei.com>
      Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      5cc236fd
    • S
      net: hns3: add support for spoof check setting · fea4e7dc
      shenjian 提交于
      driver inclusion
      category: bugfix
      bugzilla: NA
      CVE: NA
      
      This patch adds support for spoof check configuration for VFs.
      When it is enabled, "spoof checking" is done for both mac address
      and VLAN. For each VF, the HW ensures that the source MAC address
      (or VLAN) of every outgoing packet exists in the MAC-list (or
      VLAN-list) configured for RX filtering for that VF. If not,
      the packet is dropped.
      
      Feature or Bugfix:Bugfix
      Signed-off-by: Nshenjian (K) <shenjian15@huawei.com>
      Reviewed-by: Nlipeng <lipeng321@huawei.com>
      Reviewed-by: NYunsheng Lin <linyunsheng@huawei.com>
      Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      fea4e7dc
    • A
      arm64/numa: Report correct memblock range for the dummy node · 65246909
      Anshuman Khandual 提交于
      mainline inclusion
      from mainline-v4.20-rc5
      commit 77cfe950
      category: bugfix
      bugzilla: 5611
      CVE: NA
      
      The dummy node ID is marked into all memory ranges on the system. So the
      dummy node really extends the entire memblock.memory. Hence report correct
      extent information for the dummy node using memblock range helper functions
      instead of the range [0LLU, PFN_PHYS(max_pfn) - 1)].
      
      Fixes: 1a2db300 ("arm64, numa: Add NUMA support for arm64 platforms")
      Acked-by: NPunit Agrawal <punit.agrawal@arm.com>
      Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
      Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NXuefeng Wang <wxf.wang@hisilicon.com>
      Reviewed-by: NZhen Lei <thunder.leizhen@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      65246909
    • D
      bpf, arm64: fix getting subprog addr from aux for calls · f7dcdf17
      Daniel Borkmann 提交于
      mainline inclusion
      from mainline-v4.20-rc5
      commit 8c11ea5c
      category: bugfix
      bugzilla: 5654
      CVE: NA
      
      The arm64 JIT has the same issue as ppc64 JIT in that the relative BPF
      to BPF call offset can be too far away from core kernel in that relative
      encoding into imm is not sufficient and could potentially be truncated,
      see also fd045f6c ("arm64: add support for module PLTs") which adds
      spill-over space for module_alloc() and therefore bpf_jit_binary_alloc().
      Therefore, use the recently added bpf_jit_get_func_addr() helper for
      properly fetching the address through prog->aux->func[off]->bpf_func
      instead. This also has the benefit to optimize normal helper calls since
      their address can use the optimized emission. Tested on Cavium ThunderX
      CN8890.
      
      Fixes: db496944 ("bpf: arm64: add JIT support for multi-function programs")
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NXuefeng Wang <wxf.wang@hisilicon.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      f7dcdf17
    • D
      bpf, ppc64: generalize fetching subprog into bpf_jit_get_func_addr · 51778776
      Daniel Borkmann 提交于
      mainline inclusion
      from mainline-v4.20-rc5
      commit e2c95a61
      category: bugfix
      bugzilla: 5654
      CVE: NA
      
      Make fetching of the BPF call address from ppc64 JIT generic. ppc64
      was using a slightly different variant rather than through the insns'
      imm field encoding as the target address would not fit into that space.
      Therefore, the target subprog number was encoded into the insns' offset
      and fetched through fp->aux->func[off]->bpf_func instead. Given there
      are other JITs with this issue and the mechanism of fetching the address
      is JIT-generic, move it into the core as a helper instead. On the JIT
      side, we get information on whether the retrieved address is a fixed
      one, that is, not changing through JIT passes, or a dynamic one. For
      the former, JITs can optimize their imm emission because this doesn't
      change jump offsets throughout JIT process.
      Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
      Reviewed-by: NSandipan Das <sandipan@linux.ibm.com>
      Tested-by: NSandipan Das <sandipan@linux.ibm.com>
      Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
      Signed-off-by: NXuefeng Wang <wxf.wang@hisilicon.com>
      Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      51778776
    • Y
      net: hns3: fix shaper parameter algorithm · ea3c3ca1
      Yonglong Liu 提交于
      driver inclusion
      category: bugfix
      bugzilla: NA
      CVE: NA
      
      HNS3 driver use unsigned int to calculate the shaper parameter,
      when the configured bandwidth is small, like 1M, the actual result
      is 1.28M in the chip side. This bug only appears when the bandwidth
      is less than 20M.
      
      This patch plus the ir_s one more time when the calculated result
      is equals to the configured bandwidth, so that can get a value
      closer to the configured bandwidth.
      
      Feature or Bugfix:Bugfix
      Signed-off-by: NYonglong Liu <liuyonglong@huawei.com>
      Reviewed-by: Nlinyunsheng <linyunsheng@huawei.com>
      Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      ea3c3ca1
    • D
      xfs: fix off-by-one error in rtbitmap cross-reference · 80bcc5f0
      Darrick J. Wong 提交于
      mainline inclusion
      from mainline-5.1-rc1
      commit 87c9607d
      category: bugfix
      bugzilla: 18943
      CVE: NA
      
      ---------------------------
      
      Fix an off-by-one error in the realtime bitmap "is used" cross-reference
      helper function if the realtime extent size is a single block.
      Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
      Reviewed-by: NBrian Foster <bfoster@redhat.com>
      Signed-off-by: Nyu kuai <yukuai3@huawei.com>
      Reviewed-by: Nzhengbin <zhengbin13@huawei.com>
      Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
      80bcc5f0