提交 · e9fabe0c9e987ecc9035256a2dbbd76dfcf6c349 · openeuler / Kernel

27 12月, 2019 40 次提交

arm64/neon: Disable -Wincompatible-pointer-types when building with Clang · e9fabe0c

由 Nathan Chancellor 提交于 8月 31, 2019

mainline inclusion
from mainline-5.0
commit 0738c8b5
category: bugfix
bugzilla: 11024
CVE: NA

-------------------------------------------------

After commit cc9f8349 ("arm64: crypto: add NEON accelerated XOR
implementation"), Clang builds for arm64 started failing with the
following error message.

arch/arm64/lib/xor-neon.c:58:28: error: incompatible pointer types
assigning to 'const unsigned long *' from 'uint64_t *' (aka 'unsigned
long long *') [-Werror,-Wincompatible-pointer-types]
                v3 = veorq_u64(vld1q_u64(dp1 +  6), vld1q_u64(dp2 + 6));
                                         ^~~~~~~~
/usr/lib/llvm-9/lib/clang/9.0.0/include/arm_neon.h:7538:47: note:
expanded from macro 'vld1q_u64'
  __ret = (uint64x2_t) __builtin_neon_vld1q_v(__p0, 51); \
                                              ^~~~

There has been quite a bit of debate and triage that has gone into
figuring out what the proper fix is, viewable at the link below, which
is still ongoing. Ard suggested disabling this warning with Clang with a
pragma so no neon code will have this type of error. While this is not
at all an ideal solution, this build error is the only thing preventing
KernelCI from having successful arm64 defconfig and allmodconfig builds
on linux-next. Getting continuous integration running is more important
so new warnings/errors or boot failures can be caught and fixed quickly.

Link: https://github.com/ClangBuiltLinux/linux/issues/283Suggested-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: NNathan Chancellor <natechancellor@gmail.com>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
(cherry picked from commit 0738c8b5)
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

e9fabe0c

arm64: crypto: add NEON accelerated XOR implementation · 2b99d865

由 Jackie Liu 提交于 8月 31, 2019

mainline inclusion
from mainline-5.0-rc1
commit: cc9f8349
category: feature
feature: NEON accelerated XOR
bugzilla: 11024
CVE: NA

--------------------------------------------------

This is a NEON acceleration method that can improve
performance by approximately 20%. I got the following
data from the centos 7.5 on Huawei's HISI1616 chip:

[ 93.837726] xor: measuring software checksum speed
[ 93.874039]   8regs  : 7123.200 MB/sec
[ 93.914038]   32regs : 7180.300 MB/sec
[ 93.954043]   arm64_neon: 9856.000 MB/sec
[ 93.954047] xor: using function: arm64_neon (9856.000 MB/sec)

I believe this code can bring some optimization for
all arm64 platform. thanks for Ard Biesheuvel's suggestions.
Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

2b99d865

arm64/neon: add workaround for ambiguous C99 stdint.h types · 28777f93

由 Jackie Liu 提交于 8月 31, 2019

mainline inclusion
from mainline-5.0-rc1
commit: 21e28547
category: feature
feature: NEON accelerated XOR
bugzilla: 11024
CVE: NA

--------------------------------------------------

In a way similar to ARM commit 09096f6a ("ARM: 7822/1: add workaround
for ambiguous C99 stdint.h types"), this patch redefines the macros that
are used in stdint.h so its definitions of uint64_t and int64_t are
compatible with those of the kernel.

This patch comes from: https://patchwork.kernel.org/patch/3540001/
Wrote by: Ard Biesheuvel <ard.biesheuvel@linaro.org>

We mark this file as a private file and don't have to override asm/types.h
Signed-off-by: NJackie Liu <liuyun01@kylinos.cn>
Reviewed-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

28777f93

arm64/lib: improve CRC32 performance for deep pipelines · fce68364

由 Ard Biesheuvel 提交于 8月 31, 2019

mainline inclusion
from mainline-5.0
commit: efdb25efc7645b326cd5eb82be5feeabe167c24e
category: perf
bugzilla: 20886
CVE: NA

lib/crc32test result:

[root@localhost build]# rmmod crc32test && insmod lib/crc32test.ko &&
dmesg | grep cycles
[83170.153209] CPU7: use cycles 26243990
[83183.122137] CPU7: use cycles 26151290
[83309.691628] CPU7: use cycles 26122830
[83312.415559] CPU7: use cycles 26232600
[83313.191479] CPU8: use cycles 26082350

rmmod crc32test && insmod lib/crc32test.ko && dmesg | grep cycles
[ 1023.539931] CPU25: use cycles 12256730
[ 1024.850360] CPU24: use cycles 12249680
[ 1025.463622] CPU25: use cycles 12253330
[ 1025.862925] CPU25: use cycles 12269720
[ 1026.376038] CPU26: use cycles 12222480

Based on 13702:
arm64/lib: improve CRC32 performance for deep pipelines
crypto: arm64/crc32 - remove PMULL based CRC32 driver
arm64/lib: add accelerated crc32 routines
arm64: cpufeature: add feature for CRC32 instructions
lib/crc32: make core crc32() routines weak so they can be overridden

----------------------------------------------

Improve the performance of the crc32() asm routines by getting rid of
most of the branches and small sized loads on the common path.

Instead, use a branchless code path involving overlapping 16 byte
loads to process the first (length % 32) bytes, and process the
remainder using a loop that processes 32 bytes at a time.

Tested using the following test program:

  #include <stdlib.h>

  extern void crc32_le(unsigned short, char const*, int);

  int main(void)
  {
    static const char buf[4096];

    srand(20181126);

    for (int i = 0; i < 100 * 1000 * 1000; i++)
      crc32_le(0, buf, rand() % 1024);

    return 0;
  }

On Cortex-A53 and Cortex-A57, the performance regresses but only very
slightly. On Cortex-A72 however, the performance improves from

  $ time ./crc32

  real  0m10.149s
  user  0m10.149s
  sys   0m0.000s

to

  $ time ./crc32

  real  0m7.915s
  user  0m7.915s
  sys   0m0.000s

Cc: Rui Sun <sunrui26@huawei.com>
Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

fce68364

lib/crc32.c: mark crc32_le_base/__crc32c_le_base aliases as __pure · 0066ba62

由 Miguel Ojeda 提交于 8月 31, 2019

mainline inclusion
from mainline-5.0
commit ff98e20e
category: bugfix
bugzilla: 13702
CVE: NA

-------------------------------------------------

The upcoming GCC 9 release extends the -Wmissing-attributes warnings
(enabled by -Wall) to C and aliases: it warns when particular function
attributes are missing in the aliases but not in their target.

In particular, it triggers here because crc32_le_base/__crc32c_le_base
aren't __pure while their target crc32_le/__crc32c_le are.

These aliases are used by architectures as a fallback in accelerated
versions of CRC32. See commit 9784d82d ("lib/crc32: make core crc32()
routines weak so they can be overridden").

Therefore, being fallbacks, it is likely that even if the aliases
were called from C, there wouldn't be any optimizations possible.
Currently, the only user is arm64, which calls this from asm.

Still, marking the aliases as __pure makes sense and is a good idea
for documentation purposes and possible future optimizations,
which also silences the warning.
Acked-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Tested-by: NLaura Abbott <labbott@redhat.com>
Signed-off-by: NMiguel Ojeda <miguel.ojeda.sandonis@gmail.com>
(cherry picked from commit ff98e20e)
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

0066ba62

crypto: arm64/crc32 - remove PMULL based CRC32 driver · 0e584ca5

由 Ard Biesheuvel 提交于 8月 31, 2019

mainline inclusion
from mainline-4.20-rc1
commit: 598b7d41
category: feature
feature: accelerated crc32 routines
bugzilla: 13702
CVE: NA

--------------------------------------------------

Now that the scalar fallbacks have been moved out of this driver into
the core crc32()/crc32c() routines, we are left with a CRC32 crypto API
driver for arm64 that is based only on 64x64 polynomial multiplication,
which is an optional instruction in the ARMv8 architecture, and is less
and less likely to be available on cores that do not also implement the
CRC32 instructions, given that those are mandatory in the architecture
as of ARMv8.1.

Since the scalar instructions do not require the special handling that
SIMD instructions do, and since they turn out to be considerably faster
on some cores (Cortex-A53) as well, there is really no point in keeping
this code around so let's just remove it.
Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

0e584ca5

arm64/lib: add accelerated crc32 routines · 11fa09f2

由 Ard Biesheuvel 提交于 8月 31, 2019

mainline inclusion
from mainline-4.20-rc1
commit: 7481cddf29ede204b475facc40e6f65459939881
category: feature
feature: accelerated crc32 routines
bugzilla: 13702
CVE: NA

--------------------------------------------------

Unlike crc32c(), which is wired up to the crypto API internally so the
optimal driver is selected based on the platform's capabilities,
crc32_le() is implemented as a library function using a slice-by-8 table
based C implementation. Even though few of the call sites may be
bottlenecks, calling a time variant implementation with a non-negligible
D-cache footprint is a bit of a waste, given that ARMv8.1 and up mandates
support for the CRC32 instructions that were optional in ARMv8.0, but are
already widely available, even on the Cortex-A53 based Raspberry Pi.

So implement routines that use these instructions if available, and fall
back to the existing generic routines otherwise. The selection is based
on alternatives patching.

Note that this unconditionally selects CONFIG_CRC32 as a builtin. Since
CRC32 is relied upon by core functionality such as CONFIG_OF_FLATTREE,
this just codifies the status quo.
Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

11fa09f2

arm64: cpufeature: add feature for CRC32 instructions · 3a8984a9

由 Ard Biesheuvel 提交于 8月 31, 2019

mainline inclusion
from mainline-4.20-rc1
commit: 86d0dd34eafffbc76a81aba6ae2d71927d3835a8
category: feature
feature: accelerated crc32 routines
bugzilla: 13702
CVE: NA

--------------------------------------------------

Add a CRC32 feature bit and wire it up to the CPU id register so we
will be able to use alternatives patching for CRC32 operations.
Acked-by: NWill Deacon <will.deacon@arm.com>
Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>

Conflicts:
	arch/arm64/include/asm/cpucaps.h
	arch/arm64/kernel/cpufeature.c
Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

3a8984a9

lib/crc32: make core crc32() routines weak so they can be overridden · a216345b

由 Ard Biesheuvel 提交于 8月 31, 2019

mainline inclusion
from mainline-4.20-rc1
commit: 9784d82d
category: feature
feature: accelerated crc32 routines
bugzilla: 13702
CVE: NA

--------------------------------------------------

Allow architectures to drop in accelerated CRC32 routines by making
the crc32_le/__crc32c_le entry points weak, and exposing non-weak
aliases for them that may be used by the accelerated versions as
fallbacks in case the instructions they rely upon are not available.
Acked-by: NHerbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: NArd Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NXie XiuQi <xiexiuqi@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

a216345b

pci: do not save 'PCI_BRIDGE_CTL_BUS_RESET' · 36b45f28

由 Xiongfeng Wang 提交于 8月 29, 2019

hulk inclusion
category: bugfix
bugzilla: 20702
CVE: NA
---------------------------

When I inject a PCIE Fatal error into a mellanox netdevice, 'dmesg'
shows the device is recovered successfully, but 'lspci' didn't show the
device. I checked the configuration space of the slot where the
netdevice is inserted and found out the bit 'PCI_BRIDGE_CTL_BUS_RESET'
is set. Later, I found out it is because this bit is saved in
'saved_config_space' of 'struct pci_dev' when 'pci_pm_runtime_suspend()'
is called. And 'PCI_BRIDGE_CTL_BUS_RESET' is set every time we restore
the configuration sapce.

This patch avoid saving the bit 'PCI_BRIDGE_CTL_BUS_RESET' when we save
the configuration space of a bridge.
Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

36b45f28

config: enable CONFIG_KTASK in hulk_defconfig and storage_ci_defconfig · 58622cdd

由 Hongbo Yao 提交于 8月 30, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

enable CONFIG_KTASK in hulk_defconfig and storage_ci_defconfig
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

58622cdd

ktask: change the chunk size for some ktask thread functions · 135e9232

由 Hongbo Yao 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA

---------------------------

This patch fixes some issue in original series.
1) PMD_SIZE chunks have made thread finishing times too spread out
in some cases, so KTASK_MEM_CHUNK(128M) seems to be a reasonable compromise
2) If hugepagesz=1G, then pages_per_huge_page = 1G / 4K = 256,
use KTASK_MEM_CHUNK will cause the ktask thread to be 1, which will not
improve the performance of clear gigantic page.
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

135e9232

hugetlbfs: parallelize hugetlbfs_fallocate with ktask · 4733c59f

由 Daniel Jordan 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

hugetlbfs_fallocate preallocates huge pages to back a file in a
hugetlbfs filesystem.  The time to call this function grows linearly
with size.

ktask performs well with its default thread count of 4; higher thread
counts are given for context only.

Machine: Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory
Test:    fallocate(1) a file on a hugetlbfs filesystem

nthread   speedup   size (GiB)   min time (s)   stdev
      1                    200         127.53    2.19
      2     3.09x          200          41.30    2.11
      4     5.72x          200          22.29    0.51
      8     9.45x          200          13.50    2.58
     16     9.74x          200          13.09    1.64

      1                    400         193.09    2.47
      2     2.14x          400          90.31    3.39
      4     3.84x          400          50.32    0.44
      8     5.11x          400          37.75    1.23
     16     6.12x          400          31.54    3.13

The primary bottleneck for better scaling at higher thread counts is
hugetlb_fault_mutex_table[hash].  perf showed L1-dcache-loads increase
with 8 threads and again sharply with 16 threads, and a CPU counter
profile showed that 31% of the L1d misses were on
hugetlb_fault_mutex_table[hash] in the 16-thread case.
Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

4733c59f

mm: parallelize clear_gigantic_page · ae0cd4d4

由 Daniel Jordan 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

Parallelize clear_gigantic_page, which zeroes any page size larger than
8M (e.g. 1G on x86).

Performance results (the default number of threads is 4; higher thread
counts shown for context only):

Machine:  Intel(R) Xeon(R) CPU E7-8895 v3 @ 2.60GHz, 288 CPUs, 1T memory
Test:     Clear a range of gigantic pages (triggered via fallocate)

nthread   speedup   size (GiB)   min time (s)   stdev
      1                    100          41.13    0.03
      2     2.03x          100          20.26    0.14
      4     4.28x          100           9.62    0.09
      8     8.39x          100           4.90    0.05
     16    10.44x          100           3.94    0.03

      1                    200          89.68    0.35
      2     2.21x          200          40.64    0.18
      4     4.64x          200          19.33    0.32
      8     8.99x          200           9.98    0.04
     16    11.27x          200           7.96    0.04

      1                    400         188.20    1.57
      2     2.30x          400          81.84    0.09
      4     4.63x          400          40.62    0.26
      8     8.92x          400          21.09    0.50
     16    11.78x          400          15.97    0.25

      1                    800         434.91    1.81
      2     2.54x          800         170.97    1.46
      4     4.98x          800          87.38    1.91
      8    10.15x          800          42.86    2.59
     16    12.99x          800          33.48    0.83

The speedups are mostly due to the fact that more threads can use more
memory bandwidth.  The loop we're stressing on the x86 chip in this test
is clear_page_erms, which tops out at a bandwidth of 2550 MiB/s with one
thread.  We get the same bandwidth per thread for 2, 4, or 8 threads,
but at 16 threads the per-thread bandwidth drops to 1420 MiB/s.

However, the performance also improves over a single thread because of
the ktask threads' NUMA awareness (ktask migrates worker threads to the
node local to the work being done).  This becomes a bigger factor as the
amount of pages to zero grows to include memory from multiple nodes, so
that speedups increase as the size increases.
Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

ae0cd4d4

mm: parallelize deferred struct page initialization within each node · eb761d65

由 Daniel Jordan 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

Deferred struct page initialization currently runs one thread per node,
but this is a bottleneck during boot on big machines, so use ktask
within each pgdatinit thread to parallelize the struct page
initialization, allowing the system to take better advantage of its
memory bandwidth.

Because the system is not fully up yet and most CPUs are idle, use more
than the default maximum number of ktask threads.  The kernel doesn't
know the memory bandwidth of a given system to get the most efficient
number of threads, so there's some guesswork involved.  In testing, a
reasonable value turned out to be about a quarter of the CPUs on the
node.

__free_pages_core used to increase the zone's managed page count by the
number of pages being freed.  To accommodate multiple threads, however,
account the number of freed pages with an atomic shared across the ktask
threads and bump the managed page count with it after ktask is finished.

Test:    Boot the machine with deferred struct page init three times

Machine: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz, 88 CPUs, 503G memory,
         2 sockets

kernel                   speedup   max time per   stdev
                                   node (ms)

baseline (4.15-rc2)                        5860     8.6
ktask                      9.56x            613    12.4
Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

eb761d65

mm: enlarge type of offset argument in mem_map_offset and mem_map_next · 228e4183

由 Daniel Jordan 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

Changes the type of 'offset' from int to unsigned long in both
mem_map_offset and mem_map_next.

This facilitates ktask's use of mem_map_next with its unsigned long
types to avoid silent truncation when these unsigned longs are passed as
ints.

It also fixes the preexisting truncation of 'offset' from unsigned long
to int by the sole caller of mem_map_offset, follow_hugetlb_page.
Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

228e4183

vfio: relieve mmap_sem reader cacheline bouncing by holding it longer · 22705b26

由 Daniel Jordan 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

Profiling shows significant time being spent on atomic ops in mmap_sem
reader acquisition.  mmap_sem is taken and dropped for every single base
page during pinning, so this is not surprising.

Reduce the number of times mmap_sem is taken by holding for longer,
which relieves atomic cacheline bouncing.

Results for all VFIO page pinning patches
-----------------------------------------

The test measures the time from qemu invocation to the start of guest
boot.  The guest uses kvm with 320G memory backed with THP.  320G fits
in a node on the test machine used here, so there was no thrashing in
reclaim because of __GFP_THISNODE in THP allocations[1].

CPU:              2 nodes * 24 cores/node * 2 threads/core = 96 CPUs
                  Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
memory:           754G split evenly between nodes
scaling_governor: performance

     patch 6                  patch 8              patch 9 (this one)
     -----------------------  ----------------- ---------------------
thr  speedup average sec  speedup   average sec  speedup   average sec
  1          65.0(± 0.6%)           65.2(± 0.5%)           65.5(± 0.4%)
  2   1.5x   42.8(± 5.8%)   1.8x    36.0(± 0.9%)    1.9x   34.4(± 0.3%)
  3   1.9x   35.0(±11.3%)   2.5x    26.4(± 4.2%)    2.8x   23.7(± 0.2%)
  4   2.3x   28.5(± 1.3%)   3.1x    21.2(± 2.8%)    3.6x   18.3(± 0.3%)
  5   2.5x   26.2(± 1.5%)   3.6x    17.9(± 0.9%)    4.3x   15.1(± 0.3%)
  6   2.7x   24.5(± 1.8%)   4.0x    16.5(± 3.0%)    5.1x   12.9(± 0.1%)
  7   2.8x   23.5(± 4.9%)   4.2x    15.4(± 2.7%)    5.7x   11.5(± 0.6%)
  8   2.8x   22.8(± 1.8%)   4.2x    15.5(± 4.7%)    6.4x   10.3(± 0.8%)
 12   3.2x   20.2(± 1.4%)   4.4x    14.7(± 2.9%)    8.6x    7.6(± 0.6%)
 16   3.3x   20.0(± 0.7%)   4.3x    15.4(± 1.3%)   10.2x    6.4(± 0.6%)

At patch 6, lock_stat showed long reader wait time on mmap_sem writers,
leading to patch 8.

At patch 8, profiling revealed the issue with mmap_sem described above.

Across all three patches, performance consistently improves as the
thread count increases.  The one exception is the antiscaling with
nthr=16 in patch 8: those mmap_sem atomics are really bouncing around
the machine.

The performance with patch 9 looks pretty good overall.  I'm working on
finding the next bottleneck, and this is where it stopped:  When
nthr=16, the obvious issue profiling showed was contention on the split
PMD page table lock when pages are faulted in during the pinning (>2% of
the time).

A split PMD lock protects a PUD_SIZE-ed amount of page table mappings
(1G on x86), so if threads were operating on smaller chunks and
contending in the same PUD_SIZE range, this could be the source of
contention.  However, when nthr=16, threads operate on 5G chunks (320G /
16 threads / (1<<KTASK_LOAD_BAL_SHIFT)), so this wasn't the cause, and
aligning the chunks on PUD_SIZE boundaries didn't help either.

The time is short (6.4 seconds), so the next theory was threads
finishing at different times, but probes showed the threads all returned
within less than a millisecond of each other.

Kernel probes turned up a few smaller VFIO page pin calls besides the
heavy 320G call.  The chunk size given (PMD_SIZE) could affect thread
count and chunk size for these, so chunk size was increased from 2M to
1G.  This caused the split PMD contention to disappear, but with little
change in the runtime.  More digging required.

[1] lkml.kernel.org/r/20180925120326.24392-1-mhocko@kernel.org
Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

22705b26

vfio: remove unnecessary mmap_sem writer acquisition around locked_vm · 68a4b481

由 Daniel Jordan 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

Now that mmap_sem is no longer required for modifying locked_vm, remove
it in the VFIO code.

[XXX Can be sent separately, along with similar conversions in the other
places mmap_sem was taken for locked_vm.  While at it, could make
similar changes to pinned_vm.]
Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

68a4b481

mm: change locked_vm's type from unsigned long to atomic_long_t · 53f4e528

由 Daniel Jordan 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

Currently, mmap_sem must be held as writer to modify the locked_vm field
in mm_struct.

This creates a bottleneck when multithreading VFIO page pinning because
each thread holds the mmap_sem as reader for the majority of the pinning
time but also takes mmap_sem as writer regularly, for short times, when
modifying locked_vm.

The problem gets worse when other workloads compete for CPU with ktask
threads doing page pinning because the other workloads force ktask
threads that hold mmap_sem as writer off the CPU, blocking ktask threads
trying to get mmap_sem as reader for an excessively long time (the
mmap_sem reader wait time grows linearly with the thread count).

Requiring mmap_sem for locked_vm also abuses mmap_sem by making it
protect data that could be synchronized separately.

So, decouple locked_vm from mmap_sem by making locked_vm an
atomic_long_t.  locked_vm's old type was unsigned long and changing it
to a signed type makes it lose half its capacity, but that's only a
concern for 32-bit systems and LONG_MAX * PAGE_SIZE is 8T on x86 in that
case, so there's headroom.

Now that mmap_sem is not taken as writer here, ktask threads holding
mmap_sem as reader can run more often.  Performance results appear later
in the series.

On powerpc, this was cross-compiled-tested only.

[XXX Can send separately.]
Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

53f4e528

vfio: parallelize vfio_pin_map_dma · b0908eee

由 Daniel Jordan 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

When starting a large-memory kvm guest, it takes an excessively long
time to start the boot process because qemu must pin all guest pages to
accommodate DMA when VFIO is in use.  Currently just one CPU is
responsible for the page pinning, which usually boils down to page
clearing time-wise, so the ways to optimize this are buying a faster
CPU ;-) or using more of the CPUs you already have.

Parallelize with ktask.  Refactor so workqueue workers pin with the mm
of the calling thread, and to enable an undo callback for ktask to
handle errors during page pinning.

Performance results appear later in the series.
Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

b0908eee

workqueue, ktask: renice helper threads to prevent starvation · cdc79c13

由 Daniel Jordan 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

With ktask helper threads running at MAX_NICE, it's possible for one or
more of them to begin chunks of the task and then have their CPU time
constrained by higher priority threads.  The main ktask thread, running
at normal priority, may finish all available chunks of the task and then
wait on the MAX_NICE helpers to finish the last in-progress chunks, for
longer than it would have if no helpers were used.

Avoid this by having the main thread assign its priority to each
unfinished helper one at a time so that on a heavily loaded system,
exactly one thread in a given ktask call is running at the main thread's
priority.  At least one thread to ensure forward progress, and at most
one thread to limit excessive multithreading.

Since the workqueue interface, on which ktask is built, does not provide
access to worker threads, ktask can't adjust their priorities directly,
so add a new interface to allow a previously-queued work item to run at
a different priority than the one controlled by the corresponding
workqueue's 'nice' attribute.  The worker assigned to the work item will
run the work at the given priority, temporarily overriding the worker's
priority.

The interface is flush_work_at_nice, which ensures the given work item's
assigned worker runs at the specified nice level and waits for the work
item to finish.

An alternative choice would have been to simply requeue the work item to
a pool with workers of the new priority, but this doesn't seem feasible
because a worker may have already started executing the work and there's
currently no way to interrupt it midway through.  The proposed interface
solves this issue because a worker's priority can be adjusted while it's
executing the work.

TODO:  flush_work_at_nice is a proof-of-concept only, and it may be
desired to have the interface set the work's nice without also waiting
for it to finish.  It's implemented in the flush path for this RFC
because it was fairly simple to write ;-)

I ran tests similar to the ones in the last patch with a couple of
differences:
 - The non-ktask workload uses 8 CPUs instead of 7 to compete with the
   main ktask thread as well as the ktask helpers, so that when the main
   thread finishes, its CPU is completely occupied by the non-ktask
   workload, meaning MAX_NICE helpers can't run as often.
 - The non-ktask workload starts before the ktask workload, rather
   than after, to maximize the chance that it starves helpers.

Runtimes in seconds.

Case 1: Synthetic, worst-case CPU contention

 ktask_test - a tight loop doing integer multiplication to max out on CPU;
              used for testing only, does not appear in this series
 stress-ng  - cpu stressor ("-c --cpu-method ackerman --cpu-ops 1200");

             8_ktask_thrs           8_ktask_thrs
             w/o_renice(stdev)   with_renice  (stdev)  1_ktask_thr(stdev)
             ------------------------------------------------------------
  ktask_test    41.98  ( 0.22)         25.15  ( 2.98)      30.40  ( 0.61)
  stress-ng     44.79  ( 1.11)         46.37  ( 0.69)      53.29  ( 1.91)

Without renicing, ktask_test finishes just after stress-ng does because
stress-ng needs to free up CPUs for the helpers to finish (ktask_test
shows a shorter runtime than stress-ng because ktask_test was started
later).  Renicing lets ktask_test finish 40% sooner, and running the
same amount of work in ktask_test with 1 thread instead of 8 finishes in
a comparable amount of time, though longer than "with_renice" because
MAX_NICE threads still get some CPU time, and the effect over 8 threads
adds up.

stress-ng's total runtime gets a little longer going from no renicing to
renicing, as expected, because each reniced ktask thread takes more CPU
time than before when the helpers were starved.

Running with one ktask thread, stress-ng's reported walltime goes up
because that single thread interferes with fewer stress-ng threads,
but with more impact, causing a greater spread in the time it takes for
individual stress-ng threads to finish.  Averages of the per-thread
stress-ng times from "with_renice" to "1_ktask_thr" come out roughly
the same, though, 43.81 and 43.89 respectively.  So the total runtime of
stress-ng across all threads is unaffected, but the time stress-ng takes
to finish running its threads completely actually improves by spreading
the ktask_test work over more threads.

Case 2: Real-world CPU contention

 ktask_vfio - VFIO page pin a 32G kvm guest
 usemem     - faults in 86G of anonymous THP per thread, PAGE_SIZE stride;
              used to mimic the page clearing that dominates in ktask_vfio
              so that usemem competes for the same system resources

             8_ktask_thrs           8_ktask_thrs
             w/o_renice  (stdev)   with_renice  (stdev)  1_ktask_thr(stdev)
             --------------------------------------------------------------
  ktask_vfio    18.59  ( 0.19)         14.62  ( 2.03)      16.24  ( 0.90)
      usemem    47.54  ( 0.89)         48.18  ( 0.77)      49.70  ( 1.20)

These results are similar to case 1's, though the differences between
times are not quite as pronounced because ktask_vfio ran shorter
compared to usemem.
Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

cdc79c13

ktask: run helper threads at MAX_NICE · b9ca5261

由 Daniel Jordan 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

Multithreading may speed long-running kernel tasks, but overly
optimistic parallelization can go wrong if too many helper threads are
started on an already-busy system.  Such helpers can degrade the
performance of other tasks, so they should be sensitive to current CPU
utilization[1].

To achieve this, run helpers at MAX_NICE so that their CPU time is
proportional to idle CPU time.  The main thread that called into ktask
naturally runs at its original priority so that it can make progress on
a heavily loaded system, as it would if ktask were not in the picture.

I tested two different cases in which a non-ktask and a ktask workload
compete for the same CPUs with the goal of showing that normal priority
(i.e. nice=0) ktask helpers cause the non-ktask workload to run more
slowly, whereas MAX_NICE ktask helpers don't.

Testing notes:
  - Each case was run using 8 CPUs on a large two-socket server, with a
    cpumask allowing all test threads to run anywhere within the 8.
  - The non-ktask workload used 7 threads and the ktask workload used 8
    threads to evaluate how much ktask helpers, rather than the main ktask
    thread, disturbed the non-ktask workload.
  - The non-ktask workload was started after the ktask workload and run
    for less time to maximize the chances that the non-ktask workload would
    be disturbed.
  - Runtimes in seconds.

Case 1: Synthetic, worst-case CPU contention

    ktask_test - a tight loop doing integer multiplication to max out on
		 CPU; used for testing only, does not appear in this series
    stress-ng  - cpu stressor ("-c --cpu-method ackerman --cpu-ops 1200");

                 stress-ng
                 alone  (stdev)   max_nice  (stdev)   normal_prio  (stdev)
                 ---------------------------------------------------------
    ktask_test                       96.87  ( 1.09)         90.81  ( 0.29)
    stress-ng    43.04  ( 0.00)      43.58  ( 0.01)         75.86  ( 0.39)

This case shows MAX_NICE helpers make a significant difference compared
to normal priority helpers, with stress-ng taking 76% longer to finish
when competing with normal priority ktask threads than when run by
itself, but only 1% longer when run with MAX_NICE helpers.  The 1% comes
from the small amount of CPU time MAX_NICE threads are given despite
their low priority.

Case 2: Real-world CPU contention

    ktask_vfio - VFIO page pin a 175G kvm guest
    usemem     - faults in 25G of anonymous THP per thread, PAGE_SIZE
		 stride; used to mimic the page clearing that dominates in
		 ktask_vfio so that usemem competes for the same system
		 resources

                 usemem
                 alone  (stdev)   max_nice  (stdev)   normal_prio  (stdev)
                ---------------------------------------------------------
    ktask_vfio                      14.74  ( 0.04)          9.93  ( 0.09)
        usemem  10.45  ( 0.04)      10.75  ( 0.04)         14.14  ( 0.07)

In the more realistic case 2, the effect is similar although not as
pronounced.  The usemem threads take 35% longer to finish with normal
priority ktask threads than when run alone, but only 3% longer when
MAX_NICE is used.

All ktask users outside of VFIO boil down to page clearing, so I imagine
the results would be similar for them.

[1] lkml.kernel.org/r/20171206143509.GG7515@dhcp22.suse.cz
Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

b9ca5261

ktask: add undo support · 69cc5b58

由 Daniel Jordan 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

Tasks can fail midway through their work.  To recover, the finished
chunks of work need to be undone in a task-specific way.

Allow ktask clients to pass an "undo" callback that is responsible for
undoing one chunk of work.  To avoid multiple levels of error handling,
do not allow the callback to fail.  For simplicity and because it's a
slow path, undoing is not multithreaded.
Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

69cc5b58

ktask: multithread CPU-intensive kernel work · c48676ef

由 Daniel Jordan 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

A single CPU can spend an excessive amount of time in the kernel
operating on large amounts of data.  Often these situations arise
during initialization- and destruction-related tasks, where the data
involved scales with system size. These long-running jobs can slow
startup and shutdown of applications and the system itself while extra
CPUs sit idle.

To ensure that applications and the kernel continue to perform well as
core counts and memory sizes increase, harness these idle CPUs to
complete such jobs more quickly.

ktask is a generic framework for parallelizing CPU-intensive work in the
kernel.  The API is generic enough to add concurrency to many different
kinds of tasks--for example, zeroing a range of pages or evicting a list
of inodes--and aims to save its clients the trouble of splitting up the
work, choosing the number of threads to use, maintaining an efficient
concurrency level, starting these threads, and load balancing the work
between them.

The Documentation patch earlier in this series, from which the above was
swiped, has more background.

Inspired by work from Pavel Tatashin, Steve Sistare, and Jonathan Adams.
Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Suggested-by: NPavel Tatashin <Pavel.Tatashin@microsoft.com>
Suggested-by: NSteve Sistare <steven.sistare@oracle.com>
Suggested-by: NJonathan Adams <jwadams@google.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

c48676ef

ktask: add documentation · 8c9ab6a2

由 Daniel Jordan 提交于 8月 14, 2019

hulk inclusion
category: feature
bugzilla: 13228
CVE: NA
---------------------------

Motivates and explains the ktask API for kernel clients.
Signed-off-by: NDaniel Jordan <daniel.m.jordan@oracle.com>
Signed-off-by: NHongbo Yao <yaohongbo@huawei.com>
Reviewed-by: NXie XiuQi <xiexiuqi@huawei.com>
Tested-by: NHongbo Yao <yaohongbo@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

8c9ab6a2

ext4: fix suspicious RCU usage warning in ext4_release_system_zone · 36f2d824

由 zhangyi (F) 提交于 8月 28, 2019

hulk inclusion
category: bugfix
bugzilla: 18685
CVE: NA

-----------------------------

The rcu_dereference() should be used under rcu_read_lock(), or else it
will complain about it may be a suspicious RCU usage.

 WARNING: suspicious RCU usage
 [...]
 -----------------------------
 fs/ext4/block_validity.c:331 suspicious rcu_dereference_check() usage!
 [...]

Because ext4_release_system_zone() always under protection of
sb->s_umount, so the proper fix is switch to use
rcu_dereference_protected() instead.

Fixes: fb9fd3ade129be ("ext4: fix potential use after free in system zone via remount with noblock_validity")
Reviewed-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NYi Zhang <yi.zhang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

36f2d824

ext4: fix integer overflow when calculating commit interval · 3e3e50fe

由 zhangyi (F) 提交于 8月 28, 2019

hulk inclusion
category: bugfix
bugzilla: 16625
CVE: NA
---------------------------

If user specify a large enough value of "commit=" option, it may trigger
signed integer overflow which may lead to sbi->s_commit_interval becomes
a large or small value, zero in particular.

UBSAN: Undefined behaviour in ../fs/ext4/super.c:1592:31
signed integer overflow:
536870912 * 1000 cannot be represented in type 'int'
[...]
Call trace:
[...]
[<ffffff9008a2d120>] ubsan_epilogue+0x34/0x9c lib/ubsan.c:166
[<ffffff9008a2d8b8>] handle_overflow+0x228/0x280 lib/ubsan.c:197
[<ffffff9008a2d95c>] __ubsan_handle_mul_overflow+0x4c/0x68 lib/ubsan.c:218
[<ffffff90086d070c>] handle_mount_opt fs/ext4/super.c:1592 [inline]
[<ffffff90086d070c>] parse_options+0x1724/0x1a40 fs/ext4/super.c:1773
[<ffffff90086d51c4>] ext4_remount+0x2ec/0x14a0 fs/ext4/super.c:4834
[...]

Although it is not a big deal, still silence the UBSAN by limit the
input value.

Link: https://patchwork.ozlabs.org/patch/1153233Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
Reviewed-by: NJan Kara <jack@suse.cz>
Reviewed-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: Nzhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>
Reviewed-by: NYi Zhang <yi.zhang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

3e3e50fe

arm64: set all the CPU as present for dtb booting with 'CONFIG_ACPI' enabled · 015f6d7e

由 Xiongfeng Wang 提交于 8月 28, 2019

hulk inclusion
category: bugfix
bugzilla: 20799
CVE: NA
---------------------------

The following patch didn't consider the situation when we boot the
system using device tree but 'CONFIG_ACPI' is enabled. In this
situation, we also need to set all the CPU as present CPU.

Fixes: 280637f70ab5 ("arm64: mark all the GICC nodes in MADT as possible cpu")
Signed-off-by: NXiongfeng Wang <wangxiongfeng2@huawei.com>
Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

015f6d7e

xfs: fix missing ILOCK unlock when xfs_setattr_nonsize fails due to EDQUOT · 6eefee3a

由 Darrick J. Wong 提交于 8月 27, 2019

mainline inclusion
from mainline-v5.3-rc6
commit 1fb254aa
category: bugfix
bugzilla: 13690
CVE: CVE-2019-15538

-------------------------------------------------

Benjamin Moody reported to Debian that XFS partially wedges when a chgrp
fails on account of being out of disk quota.  I ran his reproducer
script:

(and then as user dummy)

$ dd if=/dev/urandom bs=1M count=50 of=/mnt/dummy/foo
$ chgrp plugdev /mnt/dummy/foo

and saw:
Reviewed-by: Nzhengbin <zhengbin13@huawei.com>

================================================
WARNING: lock held when returning to user space!
5.3.0-rc5 #rc5 Tainted: G        W
------------------------------------------------
chgrp/47006 is leaving the kernel with locks still held!
1 lock held by chgrp/47006:
 #0: 000000006664ea2d (&xfs_nondir_ilock_class){++++}, at: xfs_ilock+0xd2/0x290 [xfs]

...which is clearly caused by xfs_setattr_nonsize failing to unlock the
ILOCK after the xfs_qm_vop_chown_reserve call fails.  Add the missing
unlock.

Reported-by: benjamin.moody@gmail.com
Fixes: 253f4911 ("xfs: better xfs_trans_alloc interface")
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Tested-by: NSalvatore Bonaccorso <carnil@debian.org>
Reviewed-by: Nzhengbin <zhengbin13@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

6eefee3a

dm space map metadata: fix missing store of apply_bops() return value · 1967da12

由 ZhangXiaoxu 提交于 8月 27, 2019

mainline inclusion
from mainline-v5.3-rc6
commit ae148243
category: bugfix
bugzilla: 20701
CVE: NA

-------------------------------------------------

In commit 6096d91a ("dm space map metadata: fix occasional leak
of a metadata block on resize"), we refactor the commit logic to a new
function 'apply_bops'.  But when that logic was replaced in out() the
return value was not stored.  This may lead out() returning a wrong
value to the caller.

Fixes: 6096d91a ("dm space map metadata: fix occasional leak of a metadata block on resize")
Cc: stable@vger.kernel.org
Signed-off-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com>
Reviewed-by: NYufen Yu <yuyufen@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

1967da12

dm btree: fix order of block initialization in btree_split_beneath · 19f55ba1

由 ZhangXiaoxu 提交于 8月 27, 2019

mainline inclusion
from mainline-v5.3-rc6
commit e4f9d601
category: bugfix
bugzilla: 20701
CVE: NA

-------------------------------------------------

When btree_split_beneath() splits a node to two new children, it will
allocate two blocks: left and right.  If right block's allocation
failed, the left block will be unlocked and marked dirty.  If this
happened, the left block'ss content is zero, because it wasn't
initialized with the btree struct before the attempot to allocate the
right block.  Upon return, when flushing the left block to disk, the
validator will fail when check this block.  Then a BUG_ON is raised.

Fix this by completely initializing the left block before allocating and
initializing the right block.

Fixes: 4dcb8b57 ("dm btree: fix leak of bufio-backed block in btree_split_beneath error path")
Cc: stable@vger.kernel.org
Signed-off-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com>
Signed-off-by: NMike Snitzer <snitzer@redhat.com>
Signed-off-by: NZhangXiaoxu <zhangxiaoxu5@huawei.com>
Reviewed-by: NYufen Yu <yuyufen@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

19f55ba1

net: hns3: some functions can be static. · fff3596e

由 liaoguojia 提交于 8月 20, 2019

driver inclusion
category: bugfix
bugzilla: NA
CVE: NA

Functions is used only in the declared file.
So we need to add a keyword static for them to Reduce coupling.
Those functions inclding as below:
hclge_func_reset_sync_vf().

This patch as a supplement to patch 26380194.
Fix tag: 26380194 (" some functions can be static ")

Feature or Bugfix:Bugfix
Signed-off-by: Nliaoguojia <liaoguojia@huawei.com>
Reviewed-by: Nlipeng <lipeng321@huawei.com>
Reviewed-by: NYunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

fff3596e

net: hns3: Fixed incorrect type in assignment. · 8bc358e9

由 liaoguojia 提交于 8月 19, 2019

driver inclusion
category: bugfix
bugzilla: NA
CVE: NA

This patch fixes some incorrect type in assignment reported by sparse.
Those sparse warning as below:
- warning : restricted __le16 degrades to integer
- warning : cast from restricted __le32
- warning : expected restricted __le32
- warning : cast from restricted __be32
- warning : cast from restricted __be16
- warning : cast to restricted __le16

Feature or Bugfix:Bugfix
Signed-off-by: Nliaoguojia <liaoguojia@huawei.com>
Reviewed-by: Nlipeng <lipeng321@huawei.com>
Reviewed-by: NYunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

8bc358e9

net: hns3: fix HCLGE_SWITCH_ALW_LPBK_B set error · 5cc236fd

由 shenjian 提交于 8月 10, 2019

driver inclusion
category: bugfix
bugzilla: NA
CVE: NA

When doing loopback test, it should set HCLGE_SWITCH_ALW_LPBK_B
intead of HCLGE_SWITCH_ALW_LCL_LPBK_B, which may cause loopback
test fail.

Feature or Bugfix:Bugfix
Signed-off-by: Nshenjian (K) <shenjian15@huawei.com>
Reviewed-by: Nhuangdaode <huangdaode@hisilicon.com>
Reviewed-by: NYunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

5cc236fd

net: hns3: add support for spoof check setting · fea4e7dc

由 shenjian 提交于 8月 09, 2019

driver inclusion
category: bugfix
bugzilla: NA
CVE: NA

This patch adds support for spoof check configuration for VFs.
When it is enabled, "spoof checking" is done for both mac address
and VLAN. For each VF, the HW ensures that the source MAC address
(or VLAN) of every outgoing packet exists in the MAC-list (or
VLAN-list) configured for RX filtering for that VF. If not,
the packet is dropped.

Feature or Bugfix:Bugfix
Signed-off-by: Nshenjian (K) <shenjian15@huawei.com>
Reviewed-by: Nlipeng <lipeng321@huawei.com>
Reviewed-by: NYunsheng Lin <linyunsheng@huawei.com>
Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

fea4e7dc

arm64/numa: Report correct memblock range for the dummy node · 65246909

由 Anshuman Khandual 提交于 7月 24, 2019

mainline inclusion
from mainline-v4.20-rc5
commit 77cfe950
category: bugfix
bugzilla: 5611
CVE: NA

The dummy node ID is marked into all memory ranges on the system. So the
dummy node really extends the entire memblock.memory. Hence report correct
extent information for the dummy node using memblock range helper functions
instead of the range [0LLU, PFN_PHYS(max_pfn) - 1)].

Fixes: 1a2db300 ("arm64, numa: Add NUMA support for arm64 platforms")
Acked-by: NPunit Agrawal <punit.agrawal@arm.com>
Signed-off-by: NAnshuman Khandual <anshuman.khandual@arm.com>
Signed-off-by: NCatalin Marinas <catalin.marinas@arm.com>
Signed-off-by: NXuefeng Wang <wxf.wang@hisilicon.com>
Reviewed-by: NZhen Lei <thunder.leizhen@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

65246909

bpf, arm64: fix getting subprog addr from aux for calls · f7dcdf17

由 Daniel Borkmann 提交于 7月 24, 2019

mainline inclusion
from mainline-v4.20-rc5
commit 8c11ea5c
category: bugfix
bugzilla: 5654
CVE: NA

The arm64 JIT has the same issue as ppc64 JIT in that the relative BPF
to BPF call offset can be too far away from core kernel in that relative
encoding into imm is not sufficient and could potentially be truncated,
see also fd045f6c ("arm64: add support for module PLTs") which adds
spill-over space for module_alloc() and therefore bpf_jit_binary_alloc().
Therefore, use the recently added bpf_jit_get_func_addr() helper for
properly fetching the address through prog->aux->func[off]->bpf_func
instead. This also has the benefit to optimize normal helper calls since
their address can use the optimized emission. Tested on Cavium ThunderX
CN8890.

Fixes: db496944 ("bpf: arm64: add JIT support for multi-function programs")
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NXuefeng Wang <wxf.wang@hisilicon.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

f7dcdf17

bpf, ppc64: generalize fetching subprog into bpf_jit_get_func_addr · 51778776

由 Daniel Borkmann 提交于 7月 24, 2019

mainline inclusion
from mainline-v4.20-rc5
commit e2c95a61
category: bugfix
bugzilla: 5654
CVE: NA

Make fetching of the BPF call address from ppc64 JIT generic. ppc64
was using a slightly different variant rather than through the insns'
imm field encoding as the target address would not fit into that space.
Therefore, the target subprog number was encoded into the insns' offset
and fetched through fp->aux->func[off]->bpf_func instead. Given there
are other JITs with this issue and the mechanism of fetching the address
is JIT-generic, move it into the core as a helper instead. On the JIT
side, we get information on whether the retrieved address is a fixed
one, that is, not changing through JIT passes, or a dynamic one. For
the former, JITs can optimize their imm emission because this doesn't
change jump offsets throughout JIT process.
Signed-off-by: NDaniel Borkmann <daniel@iogearbox.net>
Reviewed-by: NSandipan Das <sandipan@linux.ibm.com>
Tested-by: NSandipan Das <sandipan@linux.ibm.com>
Signed-off-by: NAlexei Starovoitov <ast@kernel.org>
Signed-off-by: NXuefeng Wang <wxf.wang@hisilicon.com>
Reviewed-by: NHanjun Guo <guohanjun@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

51778776

net: hns3: fix shaper parameter algorithm · ea3c3ca1

由 Yonglong Liu 提交于 8月 22, 2019

driver inclusion
category: bugfix
bugzilla: NA
CVE: NA

HNS3 driver use unsigned int to calculate the shaper parameter,
when the configured bandwidth is small, like 1M, the actual result
is 1.28M in the chip side. This bug only appears when the bandwidth
is less than 20M.

This patch plus the ir_s one more time when the calculated result
is equals to the configured bandwidth, so that can get a value
closer to the configured bandwidth.

Feature or Bugfix:Bugfix
Signed-off-by: NYonglong Liu <liuyonglong@huawei.com>
Reviewed-by: Nlinyunsheng <linyunsheng@huawei.com>
Reviewed-by: NYang Yingliang <yangyingliang@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

ea3c3ca1

xfs: fix off-by-one error in rtbitmap cross-reference · 80bcc5f0

由 Darrick J. Wong 提交于 8月 19, 2019

mainline inclusion
from mainline-5.1-rc1
commit 87c9607d
category: bugfix
bugzilla: 18943
CVE: NA

---------------------------

Fix an off-by-one error in the realtime bitmap "is used" cross-reference
helper function if the realtime extent size is a single block.
Signed-off-by: NDarrick J. Wong <darrick.wong@oracle.com>
Reviewed-by: NBrian Foster <bfoster@redhat.com>
Signed-off-by: Nyu kuai <yukuai3@huawei.com>
Reviewed-by: Nzhengbin <zhengbin13@huawei.com>
Signed-off-by: NYang Yingliang <yangyingliang@huawei.com>

80bcc5f0

openeuler / Kernel 1 年多 前同步成功

openeuler / Kernel
1 年多前同步成功