提交 · 696fcb774250f124763e25152d7f2b6a662bf6d5 · openanolis / cloud-kernel

02 9月, 2020 40 次提交

x86/cpu/amd: Call init_amd_zn() om Family 19h processors too · 696fcb77

由 Kim Phillips 提交于 3月 11, 2020

fix #29035100

commit 753039ef8b2f1078e5bff8cd42f80578bf6385b0 upstream

Family 19h CPUs are Zen-based and still share most architectural
features with Family 17h CPUs, and therefore still need to call
init_amd_zn() e.g., to set the RECLAIM_DISTANCE override.

init_amd_zn() also sets X86_FEATURE_ZEN, which today is only used
in amd_set_core_ssb_state(), which isn't called on some late
model Family 17h CPUs, nor on any Family 19h CPUs:
X86_FEATURE_AMD_SSBD replaces X86_FEATURE_LS_CFG_SSBD on those
later model CPUs, where the SSBD mitigation is done via the
SPEC_CTRL MSR instead of the LS_CFG MSR.

Family 19h CPUs also don't have the erratum where the CPB feature
bit isn't set, but that code can stay unchanged and run safely
on Family 19h.
Signed-off-by: NKim Phillips <kim.phillips@amd.com>
Signed-off-by: NBorislav Petkov <bp@suse.de>
Link: https://lkml.kernel.org/r/20200311191451.13221-1-kim.phillips@amd.comSigned-off-by: NPeng Wang <rocking@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

696fcb77

perf/x86/amd: Add support for Large Increment per Cycle Events · 52ef78f4

由 Kim Phillips 提交于 11月 14, 2019

fix #29035100

commit 5738891229a25e9e678122a843cbf0466a456d0c upstream

Description of hardware operation
---------------------------------

The core AMD PMU has a 4-bit wide per-cycle increment for each
performance monitor counter. That works for most events, but
now with AMD Family 17h and above processors, some events can
occur more than 15 times in a cycle. Those events are called
"Large Increment per Cycle" events. In order to count these
events, two adjacent h/w PMCs get their count signals merged
to form 8 bits per cycle total. In addition, the PERF_CTR count
registers are merged to be able to count up to 64 bits.

Normally, events like instructions retired, get programmed on a single
counter like so:

PERF_CTL0 (MSR 0xc0010200) 0x000000000053ff0c # event 0x0c, umask 0xff
PERF_CTR0 (MSR 0xc0010201) 0x0000800000000001 # r/w 48-bit count

The next counter at MSRs 0xc0010202-3 remains unused, or can be used
independently to count something else.

When counting Large Increment per Cycle events, such as FLOPs,
however, we now have to reserve the next counter and program the
PERF_CTL (config) register with the Merge event (0xFFF), like so:

PERF_CTL0 (msr 0xc0010200) 0x000000000053ff03 # FLOPs event, umask 0xff
PERF_CTR0 (msr 0xc0010201) 0x0000800000000001 # rd 64-bit cnt, wr lo 48b
PERF_CTL1 (msr 0xc0010202) 0x0000000f004000ff # Merge event, enable bit
PERF_CTR1 (msr 0xc0010203) 0x0000000000000000 # wr hi 16-bits count

The count is widened from the normal 48-bits to 64 bits by having the
second counter carry the higher 16 bits of the count in its lower 16
bits of its counter register.

The odd counter, e.g., PERF_CTL1, is programmed with the enabled Merge
event before the even counter, PERF_CTL0.

The Large Increment feature is available starting with Family 17h.
For more details, search any Family 17h PPR for the "Large Increment
per Cycle Events" section, e.g., section 2.1.15.3 on p. 173 in this
version:

https://www.amd.com/system/files/TechDocs/56176_ppr_Family_17h_Model_71h_B0_pub_Rev_3.06.zip

Description of software operation
---------------------------------

The following steps are taken in order to support reserving and
enabling the extra counter for Large Increment per Cycle events:

1. In the main x86 scheduler, we reduce the number of available
counters by the number of Large Increment per Cycle events being
scheduled, tracked by a new cpuc variable 'n_pair' and a new
amd_put_event_constraints_f17h(). This improves the counter
scheduler success rate.

2. In perf_assign_events(), if a counter is assigned to a Large
Increment event, we increment the current counter variable, so the
counter used for the Merge event is removed from assignment
consideration by upcoming event assignments.

3. In find_counter(), if a counter has been found for the Large
Increment event, we set the next counter as used, to prevent other
events from using it.

4. We perform steps 2 & 3 also in the x86 scheduler fastpath, i.e.,
we add Merge event accounting to the existing used_mask logic.

5. Finally, we add on the programming of Merge event to the
neighbouring PMC counters in the counter enable/disable{_all}
code paths.

Currently, software does not support a single PMU with mixed 48- and
64-bit counting, so Large increment event counts are limited to 48
bits. In set_period, we zero-out the upper 16 bits of the count, so
the hardware doesn't copy them to the even counter's higher bits.

Simple invocation example showing counting 8 FLOPs per 256-bit/%ymm
vaddps instruction executed in a loop 100 million times:

perf stat -e cpu/fp_ret_sse_avx_ops.all/,cpu/instructions/ <workload>

Performance counter stats for '<workload>':

800,000,000 cpu/fp_ret_sse_avx_ops.all/u
300,042,101 cpu/instructions/u

Prior to this patch, the reported SSE/AVX FLOPs retired count would
be wrong.

[peterz: lots of renames and edits to the code]
Signed-off-by: NKim Phillips <kim.phillips@amd.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NPeng Wang <rocking@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

52ef78f4

perf/x86/amd: Constrain Large Increment per Cycle events · e29ecc26

由 Kim Phillips 提交于 11月 14, 2019

fix #29035100

commit 471af006a747f1c535c8a8c6c0973c320fe01b22 upstream

AMD Family 17h processors and above gain support for Large Increment
per Cycle events.  Unfortunately there is no CPUID or equivalent bit
that indicates whether the feature exists or not, so we continue to
determine eligibility based on a CPU family number comparison.

For Large Increment per Cycle events, we add a f17h-and-compatibles
get_event_constraints_f17h() that returns an even counter bitmask:
Large Increment per Cycle events can only be placed on PMCs 0, 2,
and 4 out of the currently available 0-5.  The only currently
public event that requires this feature to report valid counts
is PMCx003 "Retired SSE/AVX Operations".

Note that the CPU family logic in amd_core_pmu_init() is changed
so as to be able to selectively add initialization for features
available in ranges of backward-compatible CPU families.  This
Large Increment per Cycle feature is expected to be retained
in future families.

A side-effect of assigning a new get_constraints function for f17h
disables calling the old (prior to f15h) amd_get_event_constraints
implementation left enabled by commit e40ed154 ("perf/x86: Add perf
support for AMD family-17h processors"), which is no longer
necessary since those North Bridge event codes are obsoleted.

Also fix a spelling mistake whilst in the area (calulating ->
calculating).

Fixes: e40ed154 ("perf/x86: Add perf support for AMD family-17h processors")
Signed-off-by: NKim Phillips <kim.phillips@amd.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/20191114183720.19887-2-kim.phillips@amd.comSigned-off-by: NPeng Wang <rocking@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

e29ecc26

perf/x86: Add helper to obtain performance counter index · bfd2cd67

由 Reinette Chatre 提交于 9月 19, 2018

fix #29035100

commit 1182a49529edde899be4b4f0e1ab76e626976eb6 upstream

perf_event_read_local() is the safest way to obtain measurements
associated with performance events. In some cases the overhead
introduced by perf_event_read_local() affects the measurements and the
use of rdpmcl() is needed. rdpmcl() requires the index
of the performance counter used so a helper is introduced to determine
the index used by a provided performance event.

The index used by a performance event may change when interrupts are
enabled. A check is added to ensure that the index is only accessed
with interrupts disabled. Even with this check the use of this counter
needs to be done with care to ensure it is queried and used within the
same disabled interrupts section.

This change introduces a new checkpatch warning:
CHECK: extern prototypes should be avoided in .h files
+extern int x86_perf_rdpmc_index(struct perf_event *event);

This warning was discussed and designated as a false positive in
http://lkml.kernel.org/r/20180919091759.GZ24124@hirez.programming.kicks-ass.netSuggested-by: NPeter Zijlstra <peterz@infradead.org>
Signed-off-by: NReinette Chatre <reinette.chatre@intel.com>
Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: fenghua.yu@intel.com
Cc: tony.luck@intel.com
Cc: acme@kernel.org
Cc: gavin.hindman@intel.com
Cc: jithu.joseph@intel.com
Cc: dave.hansen@intel.com
Cc: hpa@zytor.com
Link: https://lkml.kernel.org/r/b277ffa78a51254f5414f7b1bc1923826874566e.1537377064.git.reinette.chatre@intel.comSigned-off-by: NPeng Wang <rocking@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

bfd2cd67

configs: enable AF_XDP socket by default · ab4f8724

由 Dust Li 提交于 7月 12, 2020

to #29272054

AF_XDP is a new AF family that support usespace applications
communicate with XDP program directly.
One promising use case is UDP

Both x86_64 and aarch64 are enabled
Signed-off-by: NDust Li <dust.li@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

ab4f8724

Intel: perf/x86/intel/uncore: Add Ice Lake server uncore support · c4af4e97

由 Kan Liang 提交于 6月 12, 2020

fix #29130534

commit 2b3b76b5ec67568da4bb475d3ce8a92ef494b5de upstream.

Backport summary: Backport to kernel 4.19.57 for ICX uncore support.

The uncore subsystem in Ice Lake server is similar to previous server.
There are some differences in config register encoding and pci device
IDs. The uncore PMON units in Ice Lake server include Ubox, Chabox, IIO,
IRP, M2PCIE, PCU, M2M, PCIE3 and IMC.

- For CHA, filter 1 register has been removed. The filter 0 register can
be used by and of CHA events to be filterd by Thread/Core-ID. To do
so, the control register's tid_en bit must be set to 1.
- For IIO, there are some changes on event constraints. The MSR address
and MSR offsets among counters are also changed.
- For IRP, the MSR address and MSR offsets among counters are changed.
- For M2PCIE, the counters are accessed by MSR now. Add new MSR address
and MSR offsets. Change event constraints.
- To determine the number of CHAs, have to read CAPID6(Low) and CAPID7
(High) now.
- For M2M, update the PCICFG address and Device ID.
- For UPI, update the PCICFG address, Device ID and counter address.
- For M3UPI, update the PCICFG address, Device ID, counter address and
event constraints.
- For IMC, update the formular to calculate MMIO BAR address, which is
MMIO_BASE + specific MEM_BAR offset.
Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: NIngo Molnar <mingo@kernel.org>
Link: https://lkml.kernel.org/r/1585842411-150452-1-git-send-email-kan.liang@linux.intel.comSigned-off-by: NYunying Sun <yunying.sun@intel.com>
Signed-off-by: NPeng Wang <rocking@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

c4af4e97

Intel: perf/x86/intel/uncore: Add box_offsets for free-running counters · 4b559ab4

由 Kan Liang 提交于 6月 12, 2020

fix #29130534

commit bc88a2fe216a51e8ab46d61f89d0c1b5a400470e upstream.

Backport summary: Backport to kernel 4.19.57 for ICX uncore support.

The offset between uncore boxes of free-running counters varies, e.g.
IIO free-running counters on Ice Lake server.

Add box_offsets, an array of offsets between adjacent uncore boxes.
Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1584470314-46657-1-git-send-email-kan.liang@linux.intel.comSigned-off-by: NYunying Sun <yunying.sun@intel.com>
Signed-off-by: NPeng Wang <rocking@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

4b559ab4

Intel: perf/x86/intel/uncore: Factor out __snr_uncore_mmio_init_box · 6b7f290f

由 Kan Liang 提交于 6月 12, 2020

fix #29130534

commit 3442a9ecb8e72a33c28a2b969b766c659830e410 upstream.

Backport summary: Backport to kernel 4.19.57 for ICX uncore support.

The IMC uncore unit in Ice Lake server can only be accessed by MMIO,
which is similar as Snow Ridge.
Factor out __snr_uncore_mmio_init_box which can be shared with Ice Lake
server in the following patch.

No functional changes.
Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Link: https://lkml.kernel.org/r/1584470314-46657-2-git-send-email-kan.liang@linux.intel.comSigned-off-by: NYunying Sun <yunying.sun@intel.com>
Signed-off-by: NPeng Wang <rocking@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

6b7f290f

Intel: perf/x86/intel/uncore: Add IMC uncore support for Snow Ridge · 28661c6a

由 Kan Liang 提交于 6月 12, 2020

fix #29130534

commit ee49532b38dd084650bf715eabe7e3828fb8d275 upstream.

Backport summary: Backport to kernel 4.19.57 for ICX uncore support.

IMC uncore unit can only be accessed via MMIO on Snow Ridge.
The MMIO space of IMC uncore is at the specified offsets from the
MEM0_BAR. Add snr_uncore_get_mc_dev() to locate the PCI device with
MMIO_BASE and MEM0_BAR register.

Add new ops to access the IMC registers via MMIO.

Add 3 new free running counters for clocks, read and write bandwidth.
Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: eranian@google.com
Link: https://lkml.kernel.org/r/1556672028-119221-7-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NYunying Sun <yunying.sun@intel.com>
Signed-off-by: NPeng Wang <rocking@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

28661c6a

Intel: perf/x86/intel/uncore: Clean up client IMC · 4f42d8f8

由 Kan Liang 提交于 6月 12, 2020

fix #29130534

commit 07ce734dd8adc0f170d43c15a9b91b707a21b9d7 upstream.

Backport summary: Backport to kernel 4.19.57 for ICX uncore support.

The client IMC block is accessed by MMIO. Current code uses an informal
way to access the block, which is not recommended.

Clean up the code by using __iomem annotation and the accessor
functions (read[lq]()).

Move exit_box() and read_counter() to generic code, which can be shared
with the server code later.
Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: eranian@google.com
Link: https://lkml.kernel.org/r/1556672028-119221-6-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NYunying Sun <yunying.sun@intel.com>
Signed-off-by: NPeng Wang <rocking@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

4f42d8f8

Intel: perf/x86/intel/uncore: Support MMIO type uncore blocks · 4599feef

由 Kan Liang 提交于 6月 12, 2020

fix #29130534

commit 3da04b8a00dd6d39970b9e764b78c5dfb40ec013 upstream.

Backport summary: Backport to kernel 4.19.57 for ICX uncore support.

A new MMIO type uncore box is introduced on Snow Ridge server. The
counters of MMIO type uncore box can only be accessed by MMIO.

Add a new uncore type, uncore_mmio_uncores, for MMIO type uncore blocks.

Support MMIO type uncore blocks in CPU hot plug. The MMIO space has to
be map/unmap for the first/last CPU. The context also need to be
migrated if the bind CPU changes.

Add mmio_init() to init and register PMUs for MMIO type uncore blocks.

Add a helper to calculate the box_ctl address.

The helpers which calculate ctl/ctr can be shared with PCI type uncore
blocks.
Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: eranian@google.com
Link: https://lkml.kernel.org/r/1556672028-119221-5-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NYunying Sun <yunying.sun@intel.com>
Signed-off-by: NPeng Wang <rocking@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

4599feef

Intel: perf/x86/intel/uncore: Factor out box ref/unref functions · 3df4e38a

由 Kan Liang 提交于 6月 12, 2020

fix #29130534

commit c8872d90e0a3651a096860d3241625ccfa1647e0 upstream.

Backport summary: Backport to kernel 4.19.57 for ICX uncore support.

For uncore box which can only be accessed by MSR, its reference
box->refcnt is updated in CPU hot plug. The uncore boxes need to be
initalized and exited accordingly for the first/last CPU of a socket.

Starts from Snow Ridge server, a new type of uncore box is introduced,
which can only be accessed by MMIO. The driver needs to map/unmap
MMIO space for the first/last CPU of a socket.

Extract the codes of box ref/unref and init/exit for reuse later.

There is no functional change.
Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: eranian@google.com
Link: https://lkml.kernel.org/r/1556672028-119221-4-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NYunying Sun <yunying.sun@intel.com>
Signed-off-by: NPeng Wang <rocking@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

3df4e38a

Intel: perf/x86/intel/uncore: Add uncore support for Snow Ridge server · 06253540

由 Kan Liang 提交于 6月 12, 2020

fix #29130534

commit 210cc5f9db7a5c66b7ca6290b7d35cc7db7e9dbd upstream.

Backport summary: Backport to kernel 4.19.57 for ICX uncore support.

The uncore subsystem on Snow Ridge is similar as previous SKX server.
The uncore units on Snow Ridge include Ubox, Chabox, IIO, IRP, M2PCIE,
PCU, M2M, PCIE3 and IMC.

- The config register encoding and pci device IDs are changed.
- For CHA, the umask_ext and filter_tid fields are changed.
- For IIO, the ch_mask and fc_mask fields are changed.
- For M2M, the mask_ext field is changed.
- Add new PCIe3 unit for PCIe3 root port which provides the interface
  between PCIe devices, plugged into the PCIe port, and the components
  (in M2IOSF).
- IMC can only be accessed via MMIO on Snow Ridge now. Current common
  code doesn't support it yet. IMC will be supported in following
  patches.
- There are 9 free running counters for IIO CLOCKS and bandwidth In.
- Full uncore event list is not published yet. Event constrain is not
  included in this patch. It will be added later separately.
Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: acme@kernel.org
Cc: eranian@google.com
Link: https://lkml.kernel.org/r/1556672028-119221-3-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
Signed-off-by: NYunying Sun <yunying.sun@intel.com>
Signed-off-by: NPeng Wang <rocking@linux.alibaba.com>
Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>

06253540

alinux: block-throttle: only do io statistics if needed · b8a94ed8

由 Xiaoguang Wang 提交于 7月 04, 2020

task #29063222

Current blk throttle codes always do io statistics even though users
don't specify valid throttle rules, which will introduce significant
overheads for applications that don't use blk throttle function and
is wrose in arm, see below perf data captured in arm:

sudo taskset -c 66 fio -ioengine=io_uring -sqthread_poll=1 -hipri=1
-sqthread_poll_cpu=65 -registerfiles=1 -fixedbufs=1 -direct=1
-filename=/dev/nvme0n1 -bs=4k -iodepth=8 -rw=randwrite  -time_based
-ramp_time=30 -runtime=60  -name="test"

Samples: 25K of event 'cycles', Event count (approx.): 16586974662
Overhead  Command      Shared Object      Symbol
   3.54%  io_uring-sq  [kernel.kallsyms]  [k]
throtl_stats_update_completion
   0.89%  io_uring-sq  [kernel.kallsyms]  [k] throtl_bio_end_io
   0.66%  io_uring-sq  [kernel.kallsyms]  [k] blk_throtl_bio
   0.05%  io_uring-sq  [kernel.kallsyms]  [k] blk_throtl_stat_add
   0.05%  io_uring-sq  [kernel.kallsyms]  [k] throtl_track_latency
   0.01%  io_uring-sq  [kernel.kallsyms]  [k] blk_throtl_bio_endio

Samples: 25K of event 'cycles', Event count (approx.): 16586974662
Overhead  Command      Shared Object      Symbol
   1.62%  io_uring-sq  [kernel.kallsyms]  [k] io_submit_sqes
   1.06%  io_uring-sq  [kernel.kallsyms]  [k] io_issue_sqe
   0.32%  io_uring-sq  [kernel.kallsyms]  [k] __io_queue_sqe
   0.06%  io_uring-sq  [kernel.kallsyms]  [k] io_queue_sqe

Above test doesn't set valid blk throttle rules, but the overhead
introduced by blk throttle is even bigger than many io_uring framework
functions, which is not acceptable.

To improve this issue, only do do io statistics if users specify valid
blk throttle rules, and this will also improve performance.

Before this patch:
clat (usec): min=5, max=6871, avg=18.70, stdev=17.89
 lat (usec): min=9, max=6871, avg=18.84, stdev=17.89
WRITE: bw=1618MiB/s (1697MB/s), 1618MiB/s-1618MiB/s (1697MB/s-1697MB/s),
io=94.8GiB (102GB), run=60001-60001msec

With this patch:
clat (usec): min=5, max=7554, avg=17.49, stdev=18.24
lat (usec): min=9, max=7554, avg=17.62, stdev=18.24
 WRITE: bw=1727MiB/s (1810MB/s), 1727MiB/s-1727MiB/s
(1810MB/s-1810MB/s), io=101GiB (109GB), run=60001-60001msec

About 6.6% bps improvement and 6.4% latency reduction.
Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

b8a94ed8

configs: disable CONFIG_REFCOUNT_FULL for release kernel · 897fa8eb

由 Dust Li 提交于 7月 07, 2020

fix #29180329

CONFIG_REFCOUNT_FULL is used for debugging mainly,
for release kernel, it's better to diable it.
This patch disables both x86 and aarch64 for release
kernel.

This has a pretty large performance penalty for
will-it-scale:signal1_process when the process
number are large.
Signed-off-by: NDust Li <dust.li@linux.alibaba.com>
Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>

897fa8eb

configs: update configs to adapt nvdimm series · efa76a28

由 Shile Zhang 提交于 4月 28, 2020

to #27305291

Enabled the following configs for NVDIMM support:
- CONFIG_ACPI_NFIT=m
- CONFIG_NVDIMM_KEYS=y
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

efa76a28

libnvdimm/security: provide fix for secure-erase to use zero-key · 8828833e

由 Dave Jiang 提交于 3月 27, 2019

to #27305291

commit 037c8489ade669e0f09ad40d5b91e5e1159a14b1 upstream.

Add a zero key in order to standardize hardware that want a key of 0's to
be passed. Some platforms defaults to a zero-key with security enabled
rather than allow the OS to enable the security. The zero key would allow
us to manage those platform as well. This also adds a fix to secure erase
so it can use the zero key to do crypto erase. Some other security commands
already use zero keys. This introduces a standard zero-key to allow
unification of semantics cross nvdimm security commands.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

8828833e

libnvdimm/security: Add documentation for nvdimm security support · b42cd6ea

由 Dave Jiang 提交于 12月 10, 2018

to #27305291

commit 1f4883f300da4f4d9d31eaa80f7debf6ce74843b upstream.

Add theory of operation for the security support that's going into
libnvdimm.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Reviewed-by: NJing Lin <jing.lin@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

b42cd6ea

tools/testing/nvdimm: add Intel DSM 1.8 support for nfit_test · 55039eda

由 Dave Jiang 提交于 12月 10, 2018

to #27305291

commit ecaa4a97b3908be0bf3ad12181ae8c44d1816d40 upstream.

Adding test support for new Intel DSM from v1.8. The ability of simulating
master passphrase update and master secure erase have been added to
nfit_test.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

55039eda

tools/testing/nvdimm: Add overwrite support for nfit_test · 98c27ed4

由 Dave Jiang 提交于 12月 10, 2018

to #27305291

commit 926f74802cb1ce0ef0c3b9f806ea542beb57e50d upstream.

With the implementation of Intel NVDIMM DSM overwrite, we are adding unit
test to nfit_test for testing of overwrite operation.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

98c27ed4

tools/testing/nvdimm: Add test support for Intel nvdimm security DSMs · fdbc6db9

由 Dave Jiang 提交于 12月 10, 2018

to #27305291

commit 3c13e2ac747a37e683597d3d875f839f2bc150e1 upstream.

Add nfit_test support for DSM functions "Get Security State",
"Set Passphrase", "Disable Passphrase", "Unlock Unit", "Freeze Lock",
and "Secure Erase" for the fake DIMMs.

Also adding a sysfs knob in order to put the DIMMs in "locked" state. The
order of testing DIMM unlocking would be.
1a. Disable DIMM X.
1b. Set Passphrase to DIMM X.
2. Write to
/sys/devices/platform/nfit_test.0/nfit_test_dimm/test_dimmX/lock_dimm
3. Renable DIMM X
4. Check DIMM X state via sysfs "security" attribute for nmemX.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

fdbc6db9

acpi/nfit, libnvdimm/security: add Intel DSM 1.8 master passphrase support · 8302c7d9

由 Dave Jiang 提交于 12月 10, 2018

to #27305291

commit 89fa9d8ea7bdfa841d19044485cec5f4171069e5 upstream.

With Intel DSM 1.8 [1] two new security DSMs are introduced. Enable/update
master passphrase and master secure erase. The master passphrase allows
a secure erase to be performed without the user passphrase that is set on
the NVDIMM. The commands of master_update and master_erase are added to
the sysfs knob in order to initiate the DSMs. They are similar in opeartion
mechanism compare to update and erase.

[1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdfSigned-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

8302c7d9

acpi/nfit, libnvdimm/security: Add security DSM overwrite support · a3e32b16

由 Dave Jiang 提交于 12月 13, 2018

to #27305291

commit 7d988097c546187ada602cc9bccd0f03d473eb8f upstream.

Add support for the NVDIMM_FAMILY_INTEL "ovewrite" capability as
described by the Intel DSM spec v1.7. This will allow triggering of
overwrite on Intel NVDIMMs. The overwrite operation can take tens of
minutes. When the overwrite DSM is issued successfully, the NVDIMMs will
be unaccessible. The kernel will do backoff polling to detect when the
overwrite process is completed. According to the DSM spec v1.7, the 128G
NVDIMMs can take up to 15mins to perform overwrite and larger DIMMs will
take longer.

Given that overwrite puts the DIMM in an indeterminate state until it
completes introduce the NDD_SECURITY_OVERWRITE flag to prevent other
operations from executing when overwrite is happening. The
NDD_WORK_PENDING flag is added to denote that there is a device reference
on the nvdimm device for an async workqueue thread context.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

a3e32b16

acpi/nfit, libnvdimm: Add support for issue secure erase DSM to Intel nvdimm · 7c7f13d6

由 Dave Jiang 提交于 12月 07, 2018

to #27305291

commit 64e77c8c047fb91ea8c7800c1238108a72f0bf9c upstream.

Add support to issue a secure erase DSM to the Intel nvdimm. The
required passphrase is acquired from an encrypted key in the kernel user
keyring. To trigger the action, "erase <keyid>" is written to the
"security" sysfs attribute.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

7c7f13d6

acpi/nfit, libnvdimm: Add enable/update passphrase support for Intel nvdimms · 5ab26ffc

由 Dave Jiang 提交于 12月 07, 2018

to #27305291

commit d2a4ac73f56a5d0709d28b41fec8d15e4500f8f1 upstream.

Add support for enabling and updating passphrase on the Intel nvdimms.
The passphrase is the an encrypted key in the kernel user keyring.
We trigger the update via writing "update <old_keyid> <new_keyid>" to the
sysfs attribute "security". If no <old_keyid> exists (for enabling
security) then a 0 should be used.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

5ab26ffc

acpi/nfit, libnvdimm: Add disable passphrase support to Intel nvdimm. · 24684577

由 Dave Jiang 提交于 12月 07, 2018

to #27305291

commit 03b65b22ada8115a7a7bfdf0789f6a94adfd6070 upstream.

Add support to disable passphrase (security) for the Intel nvdimm. The
passphrase used for disabling is pulled from an encrypted-key in the kernel
user keyring. The action is triggered by writing "disable <keyid>" to the
sysfs attribute "security".
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

24684577

acpi/nfit, libnvdimm: Add unlock of nvdimm support for Intel DIMMs · ba87cbfc

由 Dave Jiang 提交于 12月 06, 2018

to #27305291

commit 4c6926a23b76ea23403976290cd45a7a143f6500 upstream.

Add support to unlock the dimm via the kernel key management APIs. The
passphrase is expected to be pulled from userspace through keyutils.
The key management and sysfs attributes are libnvdimm generic.

Encrypted keys are used to protect the nvdimm passphrase at rest. The
master key can be a trusted-key sealed in a TPM, preferred, or an
encrypted-key, more flexible, but more exposure to a potential attacker.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Co-developed-by: NDan Williams <dan.j.williams@intel.com>
Reported-by: NRandy Dunlap <rdunlap@infradead.org>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

ba87cbfc

acpi/nfit, libnvdimm: Add freeze security support to Intel nvdimm · 29fd2e68

由 Dave Jiang 提交于 12月 06, 2018

to #27305291

commit 37833fb7989a9d3c3e26354e6878e682c340d718 upstream.

Add support for freeze security on Intel nvdimm. This locks out any
changes to security for the DIMM until a hard reset of the DIMM is
performed. This is triggered by writing "freeze" to the generic
nvdimm/nmemX "security" sysfs attribute.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Co-developed-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

29fd2e68

acpi/nfit, libnvdimm: Introduce nvdimm_security_ops · 3b8f481b

由 Dave Jiang 提交于 12月 05, 2018

to #27305291

commit f2989396553a0bd13f4b25f567a3dee3d722ce40 upstream.

Some NVDIMMs, like the ones defined by the NVDIMM_FAMILY_INTEL command
set, expose a security capability to lock the DIMMs at poweroff and
require a passphrase to unlock them. The security model is derived from
ATA security. In anticipation of other DIMMs implementing a similar
scheme, and to abstract the core security implementation away from the
device-specific details, introduce nvdimm_security_ops.

Initially only a status retrieval operation, ->state(), is defined,
along with the base infrastructure and definitions for future
operations.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Co-developed-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

3b8f481b

keys-encrypted: add nvdimm key format type to encrypted keys · 100129a4

由 Dave Jiang 提交于 12月 04, 2018

to #27305291

commit 9db67581b91d9e9e05c35570ac3f93872e6c84ca upstream.

Adding nvdimm key format type to encrypted keys in order to limit the size
of the key to 32bytes.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Acked-by: NMimi Zohar <zohar@linux.ibm.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

100129a4

keys: Export lookup_user_key to external users · 12aad331

由 Dave Jiang 提交于 12月 04, 2018

to #27305291

commit 76ef5e17252789da79db78341851922af0c16181 upstream.

Export lookup_user_key() symbol in order to allow nvdimm passphrase
update to retrieve user injected keys.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Acked-by: NDavid Howells <dhowells@redhat.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

12aad331

acpi/nfit, libnvdimm: Store dimm id as a member to struct nvdimm · 83d94276

由 Dave Jiang 提交于 12月 04, 2018

to #27305291

commit d6548ae4d16dc231dec22860c9c472bcb991fb15 upstream.

The generated dimm id is needed for the sysfs attribute as well as being
used as the identifier/description for the security key. Since it's
constant and should never change, store it as a member of struct nvdimm.

As nvdimm_create() continues to grow parameters relative to NFIT driver
requirements, do not require other implementations to keep pace.
Introduce __nvdimm_create() to carry the new parameters and keep
nvdimm_create() with the long standing default api.
Signed-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

[ Shile: fixed conflict in drivers/acpi/nfit/nfit.h ]
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

83d94276

acpi/nfit: Add support for Intel DSM 1.8 commands · d7258548

由 Dave Jiang 提交于 12月 04, 2018

to #27305291

commit b3ed2ce024c36054e51cca2eb31a1cdbe4a5f11e upstream.

Add command definition for security commands defined in Intel DSM
specification v1.8 [1]. This includes "get security state", "set
passphrase", "unlock unit", "freeze lock", "secure erase", "overwrite",
"overwrite query", "master passphrase enable/disable", and "master
erase", . Since this adds several Intel definitions, move the relevant
bits to their own header.

These commands mutate physical data, but that manipulation is not cache
coherent. The requirement to flush and invalidate caches makes these
commands unsuitable to be called from userspace, so extra logic is added
to detect and block these commands from being submitted via the ioctl
command submission path.

Lastly, the commands may contain sensitive key material that should not
be dumped in a standard debug session. Update the nvdimm-command
payload-dump facility to move security command payloads behind a
default-off compile time switch.

[1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdfSigned-off-by: NDave Jiang <dave.jiang@intel.com>
Signed-off-by: NDan Williams <dan.j.williams@intel.com>

[ Shile: fixed conflicts:
This patch updated the file "drivers/acpi/nfit/intel.h". The header file is
introduced by commit 0ead111 ("acpi, nfit: Collect shutdown status") in
upstream, which also update the test files. So let's fetch this part to fix
the conflict:
- tools/testing/nvdimm/test/nfit.c
- tools/testing/nvdimm/test/nfit_test.h ]
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

d7258548

io_uring: fix current->mm NULL dereference on exit · 1f9ef808

由 Pavel Begunkov 提交于 6月 25, 2020

to #29197839

commit d60b5fbc1ce8210759b568da49d149b868e7c6d3 upstream.

Don't reissue requests from io_iopoll_reap_events(), the task may not
have mm, which ends up with NULL. It's better to kill everything off on
exit anyway.

[  677.734670] RIP: 0010:io_iopoll_complete+0x27e/0x630
...
[  677.734679] Call Trace:
[  677.734695]  ? __send_signal+0x1f2/0x420
[  677.734698]  ? _raw_spin_unlock_irqrestore+0x24/0x40
[  677.734699]  ? send_signal+0xf5/0x140
[  677.734700]  io_iopoll_getevents+0x12f/0x1a0
[  677.734702]  io_iopoll_reap_events.part.0+0x5e/0xa0
[  677.734703]  io_ring_ctx_wait_and_kill+0x132/0x1c0
[  677.734704]  io_uring_release+0x20/0x30
[  677.734706]  __fput+0xcd/0x230
[  677.734707]  ____fput+0xe/0x10
[  677.734709]  task_work_run+0x67/0xa0
[  677.734710]  do_exit+0x35d/0xb70
[  677.734712]  do_group_exit+0x43/0xa0
[  677.734713]  get_signal+0x140/0x900
[  677.734715]  do_signal+0x37/0x780
[  677.734717]  ? enqueue_hrtimer+0x41/0xb0
[  677.734718]  ? recalibrate_cpu_khz+0x10/0x10
[  677.734720]  ? ktime_get+0x3e/0xa0
[  677.734721]  ? lapic_next_deadline+0x26/0x30
[  677.734723]  ? tick_program_event+0x4d/0x90
[  677.734724]  ? __hrtimer_get_next_event+0x4d/0x80
[  677.734726]  __prepare_exit_to_usermode+0x126/0x1c0
[  677.734741]  prepare_exit_to_usermode+0x9/0x40
[  677.734742]  idtentry_exit_cond_rcu+0x4c/0x60
[  677.734743]  sysvec_reschedule_ipi+0x92/0x160
[  677.734744]  ? asm_sysvec_reschedule_ipi+0xa/0x20
[  677.734745]  asm_sysvec_reschedule_ipi+0x12/0x20
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>

1f9ef808

io_uring: fix hanging iopoll in case of -EAGAIN · d52d1291

由 Pavel Begunkov 提交于 6月 25, 2020

to #29197839

commit cd664b0e35cb1202f40c259a1a5ea791d18c879d upstream.

io_do_iopoll() won't do anything with a request unless
req->iopoll_completed is set. So io_complete_rw_iopoll() has to set
it, otherwise io_do_iopoll() will poll a file again and again even
though the request of interest was completed long time ago.

Also, remove -EAGAIN check from io_issue_sqe() as it races with
the changed lines. The request will take the long way and be
resubmitted from io_iopoll*().

Fixes: bbde017a32b3 ("io_uring: add memory barrier to synchronize io_kiocb's result and iopoll_completed")
Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
Signed-off-by: NJens Axboe <axboe@kernel.dk>
Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>

d52d1291

configs: arm64: keep the unified configs tuned for both arches · bc60596f

由 Shile Zhang 提交于 6月 18, 2020

to #27182371

Restored all the tuned configs for Cloud Kernel before, to keep
the unified configs for both x86_64 and ARM64.
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

bc60596f

configs: arm64: reconfig to sync with internal version · 7c5a791c

由 Shile Zhang 提交于 6月 18, 2020

to #27182371

Reconfig the ARM64 with Alibaba internal kernel help to keep
the unified kernel configs.
Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>

7c5a791c

mm, page_alloc: reset the zone->watermark_boost early · bd231c59

由 Charan Teja Reddy 提交于 6月 03, 2020

to #28825456

commit aa09259109583b98b9d9e7ed0d8eb1b880d1eb97 upstream.

Updating the zone watermarks by any means, like min_free_kbytes,
water_mark_scale_factor etc, when ->watermark_boost is set will result in
higher low and high watermarks than the user asked.

Below are the steps to reproduce the problem on system setup of Android
kernel running on Snapdragon hardware.

1) Default settings of the system are as below:

   #cat /proc/sys/vm/min_free_kbytes = 5162
   #cat /proc/zoneinfo | grep -e boost -e low -e "high " -e min -e Node
	Node 0, zone   Normal
		min      797
		low      8340
		high     8539

2) Monitor the zone->watermark_boost(by adding a debug print in the
   kernel) and whenever it is greater than zero value, write the same
   value of min_free_kbytes obtained from step 1.

   #echo 5162 > /proc/sys/vm/min_free_kbytes

3) Then read the zone watermarks in the system while the
   ->watermark_boost is zero.  This should show the same values of
   watermarks as step 1 but shown a higher values than asked.

   #cat /proc/zoneinfo | grep -e boost -e low -e "high " -e min -e Node
	Node 0, zone   Normal
		min      797
		low      21148
		high     21347

These higher values are because of updating the zone watermarks using the
macro min_wmark_pages(zone) which also adds the zone->watermark_boost.

	#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] +
					z->watermark_boost)

So the steps that lead to the issue are:

1) On the extfrag event, watermarks are boosted by storing the required
   value in ->watermark_boost.

2) User tries to update the zone watermarks level in the system through
   min_free_kbytes or watermark_scale_factor.

3) Later, when kswapd woke up, it resets the zone->watermark_boost to
   zero.

In step 2), we use the min_wmark_pages() macro to store the watermarks
in the zone structure thus the values are always offsetted by
->watermark_boost value. This can be avoided by resetting the
->watermark_boost to zero before it is used.
Signed-off-by: NCharan Teja Reddy <charante@codeaurora.org>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NBaoquan He <bhe@redhat.com>
Cc: Vinayak Menon <vinmenon@codeaurora.org>
Link: http://lkml.kernel.org/r/1589457511-4255-1-git-send-email-charante@codeaurora.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

bd231c59

mm: limit boost_watermark on small zones · ab70cdb0

由 Henry Willard 提交于 5月 07, 2020

to #28825456

commit 14f69140ff9c92a0928547ceefb153a842e8492c upstream.

Commit 1c30844d2dfe ("mm: reclaim small amounts of memory when an
external fragmentation event occurs") adds a boost_watermark() function
which increases the min watermark in a zone by at least
pageblock_nr_pages or the number of pages in a page block.

On Arm64, with 64K pages and 512M huge pages, this is 8192 pages or
512M.  It does this regardless of the number of managed pages managed in
the zone or the likelihood of success.

This can put the zone immediately under water in terms of allocating
pages from the zone, and can cause a small machine to fail immediately
due to OoM.  Unlike set_recommended_min_free_kbytes(), which
substantially increases min_free_kbytes and is tied to THP,
boost_watermark() can be called even if THP is not active.

The problem is most likely to appear on architectures such as Arm64
where pageblock_nr_pages is very large.

It is desirable to run the kdump capture kernel in as small a space as
possible to avoid wasting memory.  In some architectures, such as Arm64,
there are restrictions on where the capture kernel can run, and
therefore, the space available.  A capture kernel running in 768M can
fail due to OoM immediately after boost_watermark() sets the min in zone
DMA32, where most of the memory is, to 512M.  It fails even though there
is over 500M of free memory.  With boost_watermark() suppressed, the
capture kernel can run successfully in 448M.

This patch limits boost_watermark() to boosting a zone's min watermark
only when there are enough pages that the boost will produce positive
results.  In this case that is estimated to be four times as many pages
as pageblock_nr_pages.

Mel said:

: There is no harm in marking it stable.  Clearly it does not happen very
: often but it's not impossible.  32-bit x86 is a lot less common now
: which would previously have been vulnerable to triggering this easily.
: ppc64 has a larger base page size but typically only has one zone.
: arm64 is likely the most vulnerable, particularly when CMA is
: configured with a small movable zone.

Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
Signed-off-by: NHenry Willard <henry.willard@oracle.com>
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Reviewed-by: NDavid Hildenbrand <david@redhat.com>
Acked-by: NMel Gorman <mgorman@techsingularity.net>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: <stable@vger.kernel.org>
Link: http://lkml.kernel.org/r/1588294148-6586-1-git-send-email-henry.willard@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>

[xuyu: expand zone_managed_pages function]
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

ab70cdb0

mm, vmscan: do not special-case slab reclaim when watermarks are boosted · fb4da0ed

由 Mel Gorman 提交于 8月 13, 2019

to #28825456

commit 28360f398778d7623a5ff8a8e90958c0d925e120 upstream.

Dave Chinner reported a problem pointing a finger at commit 1c30844d2dfe
("mm: reclaim small amounts of memory when an external fragmentation
event occurs").

The report is extensive:

  https://lore.kernel.org/linux-mm/20190807091858.2857-1-david@fromorbit.com/

and it's worth recording the most relevant parts (colorful language and
typos included).

	When running a simple, steady state 4kB file creation test to
	simulate extracting tarballs larger than memory full of small
	files into the filesystem, I noticed that once memory fills up
	the cache balance goes to hell.

	The workload is creating one dirty cached inode for every dirty
	page, both of which should require a single IO each to clean and
	reclaim, and creation of inodes is throttled by the rate at which
	dirty writeback runs at (via balance dirty pages). Hence the ingest
	rate of new cached inodes and page cache pages is identical and
	steady. As a result, memory reclaim should quickly find a steady
	balance between page cache and inode caches.

	The moment memory fills, the page cache is reclaimed at a much
	faster rate than the inode cache, and evidence suggests that
	the inode cache shrinker is not being called when large batches
	of pages are being reclaimed. In roughly the same time period
	that it takes to fill memory with 50% pages and 50% slab caches,
	memory reclaim reduces the page cache down to just dirty pages
	and slab caches fill the entirety of memory.

	The LRU is largely full of dirty pages, and we're getting spikes
	of random writeback from memory reclaim so it's all going to shit.
	Behaviour never recovers, the page cache remains pinned at just
	dirty pages, and nothing I could tune would make any difference.
	vfs_cache_pressure makes no difference - I would set it so high
	it should trim the entire inode caches in a single pass, yet it
	didn't do anything. It was clear from tracing and live telemetry
	that the shrinkers were pretty much not running except when
	there was absolutely no memory free at all, and then they did
	the minimum necessary to free memory to make progress.

	So I went looking at the code, trying to find places where pages
	got reclaimed and the shrinkers weren't called. There's only one
	- kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm:
	reclaim small amounts of memory when an external fragmentation
	event occurs").

The watermark boosting introduced by the commit is triggered in response
to an allocation "fragmentation event".  The boosting was not intended
to target THP specifically and triggers even if THP is disabled.
However, with Dave's perfectly reasonable workload, fragmentation events
can be very common given the ratio of slab to page cache allocations so
boosting remains active for long periods of time.

As high-order allocations might use compaction and compaction cannot
move slab pages the decision was made in the commit to special-case
kswapd when watermarks are boosted -- kswapd avoids reclaiming slab as
reclaiming slab does not directly help compaction.

As Dave notes, this decision means that slab can be artificially
protected for long periods of time and messes up the balance with slab
and page caches.

Removing the special casing can still indirectly help avoid
fragmentation by avoiding fragmentation-causing events due to slab
allocation as pages from a slab pageblock will have some slab objects
freed.  Furthermore, with the special casing, reclaim behaviour is
unpredictable as kswapd sometimes examines slab and sometimes does not
in a manner that is tricky to tune or analyse.

This patch removes the special casing.  The downside is that this is not
a universal performance win.  Some benchmarks that depend on the
residency of data when rereading metadata may see a regression when slab
reclaim is restored to its original behaviour.  Similarly, some
benchmarks that only read-once or write-once may perform better when
page reclaim is too aggressive.  The primary upside is that slab
shrinker is less surprising (arguably more sane but that's a matter of
opinion), behaves consistently regardless of the fragmentation state of
the system and properly obeys VM sysctls.

A fsmark benchmark configuration was constructed similar to what Dave
reported and is codified by the mmtest configuration
config-io-fsmark-small-file-stream.  It was evaluated on a 1-socket
machine to avoid dealing with NUMA-related issues and the timing of
reclaim.  The storage was an SSD Samsung Evo and a fresh trimmed XFS
filesystem was used for the test data.

This is not an exact replication of Dave's setup.  The configuration
scales its parameters depending on the memory size of the SUT to behave
similarly across machines.  The parameters mean the first sample
reported by fs_mark is using 50% of RAM which will barely be throttled
and look like a big outlier.  Dave used fake NUMA to have multiple
kswapd instances which I didn't replicate.  Finally, the number of
iterations differ from Dave's test as the target disk was not large
enough.  While not identical, it should be representative.

  fsmark
                                     5.3.0-rc3              5.3.0-rc3
                                       vanilla          shrinker-v1r1
  Min       1-files/sec     4444.80 (   0.00%)     4765.60 (   7.22%)
  1st-qrtle 1-files/sec     5005.10 (   0.00%)     5091.70 (   1.73%)
  2nd-qrtle 1-files/sec     4917.80 (   0.00%)     4855.60 (  -1.26%)
  3rd-qrtle 1-files/sec     4667.40 (   0.00%)     4831.20 (   3.51%)
  Max-1     1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
  Max-5     1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
  Max-10    1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
  Max-90    1-files/sec     4649.60 (   0.00%)     4780.70 (   2.82%)
  Max-95    1-files/sec     4491.00 (   0.00%)     4768.20 (   6.17%)
  Max-99    1-files/sec     4491.00 (   0.00%)     4768.20 (   6.17%)
  Max       1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
  Hmean     1-files/sec     5004.75 (   0.00%)     5075.96 (   1.42%)
  Stddev    1-files/sec     1778.70 (   0.00%)     1369.66 (  23.00%)
  CoeffVar  1-files/sec       33.70 (   0.00%)       26.05 (  22.71%)
  BHmean-99 1-files/sec     5053.72 (   0.00%)     5101.52 (   0.95%)
  BHmean-95 1-files/sec     5053.72 (   0.00%)     5101.52 (   0.95%)
  BHmean-90 1-files/sec     5107.05 (   0.00%)     5131.41 (   0.48%)
  BHmean-75 1-files/sec     5208.45 (   0.00%)     5206.68 (  -0.03%)
  BHmean-50 1-files/sec     5405.53 (   0.00%)     5381.62 (  -0.44%)
  BHmean-25 1-files/sec     6179.75 (   0.00%)     6095.14 (  -1.37%)

                     5.3.0-rc3   5.3.0-rc3
                       vanillashrinker-v1r1
  Duration User         501.82      497.29
  Duration System      4401.44     4424.08
  Duration Elapsed     8124.76     8358.05

This is showing a slight skew for the max result representing a large
outlier for the 1st, 2nd and 3rd quartile are similar indicating that
the bulk of the results show little difference.  Note that an earlier
version of the fsmark configuration showed a regression but that
included more samples taken while memory was still filling.

Note that the elapsed time is higher.  Part of this is that the
configuration included time to delete all the test files when the test
completes -- the test automation handles the possibility of testing
fsmark with multiple thread counts.  Without the patch, many of these
objects would be memory resident which is part of what the patch is
addressing.

There are other important observations that justify the patch.

1. With the vanilla kernel, the number of dirty pages in the system is
   very low for much of the test. With this patch, dirty pages is
   generally kept at 10% which matches vm.dirty_background_ratio which
   is normal expected historical behaviour.

2. With the vanilla kernel, the ratio of Slab/Pagecache is close to
   0.95 for much of the test i.e. Slab is being left alone and
   dominating memory consumption. With the patch applied, the ratio
   varies between 0.35 and 0.45 with the bulk of the measured ratios
   roughly half way between those values. This is a different balance to
   what Dave reported but it was at least consistent.

3. Slabs are scanned throughout the entire test with the patch applied.
   The vanille kernel has periods with no scan activity and then
   relatively massive spikes.

4. Without the patch, kswapd scan rates are very variable. With the
   patch, the scan rates remain quite steady.

4. Overall vmstats are closer to normal expectations

	                                5.3.0-rc3      5.3.0-rc3
	                                  vanilla  shrinker-v1r1
    Ops Direct pages scanned             99388.00      328410.00
    Ops Kswapd pages scanned          45382917.00    33451026.00
    Ops Kswapd pages reclaimed        30869570.00    25239655.00
    Ops Direct pages reclaimed           74131.00        5830.00
    Ops Kswapd efficiency %                 68.02          75.45
    Ops Kswapd velocity                   5585.75        4002.25
    Ops Page reclaim immediate         1179721.00      430927.00
    Ops Slabs scanned                 62367361.00    73581394.00
    Ops Direct inode steals               2103.00        1002.00
    Ops Kswapd inode steals             570180.00     5183206.00

	o Vanilla kernel is hitting direct reclaim more frequently,
	  not very much in absolute terms but the fact the patch
	  reduces it is interesting
	o "Page reclaim immediate" in the vanilla kernel indicates
	  dirty pages are being encountered at the tail of the LRU.
	  This is generally bad and means in this case that the LRU
	  is not long enough for dirty pages to be cleaned by the
	  background flush in time. This is much reduced by the
	  patch.
	o With the patch, kswapd is reclaiming 10 times more slab
	  pages than with the vanilla kernel. This is indicative
	  of the watermark boosting over-protecting slab

A more complete set of tests were run that were part of the basis for
introducing boosting and while there are some differences, they are well
within tolerances.

Bottom line, the special casing kswapd to avoid slab behaviour is
unpredictable and can lead to abnormal results for normal workloads.

This patch restores the expected behaviour that slab and page cache is
balanced consistently for a workload with a steady allocation ratio of
slab/pagecache pages.  It also means that if there are workloads that
favour the preservation of slab over pagecache that it can be tuned via
vm.vfs_cache_pressure where as the vanilla kernel effectively ignores
the parameter when boosting is active.

Link: http://lkml.kernel.org/r/20190808182946.GM2739@techsingularity.net
Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
Reviewed-by: NDave Chinner <dchinner@redhat.com>
Acked-by: NVlastimil Babka <vbabka@suse.cz>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: <stable@vger.kernel.org>	[5.0+]
Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>

fb4da0ed

openanolis / cloud-kernel 1 年多 前同步成功

openanolis / cloud-kernel
1 年多前同步成功