1. 02 9月, 2020 40 次提交
    • K
      Intel: perf/x86/intel/uncore: Add box_offsets for free-running counters · 4b559ab4
      Kan Liang 提交于
      fix #29130534
      
      commit bc88a2fe216a51e8ab46d61f89d0c1b5a400470e upstream.
      
      Backport summary: Backport to kernel 4.19.57 for ICX uncore support.
      
      The offset between uncore boxes of free-running counters varies, e.g.
      IIO free-running counters on Ice Lake server.
      
      Add box_offsets, an array of offsets between adjacent uncore boxes.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1584470314-46657-1-git-send-email-kan.liang@linux.intel.comSigned-off-by: NYunying Sun <yunying.sun@intel.com>
      Signed-off-by: NPeng Wang <rocking@linux.alibaba.com>
      Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
      4b559ab4
    • K
      Intel: perf/x86/intel/uncore: Factor out __snr_uncore_mmio_init_box · 6b7f290f
      Kan Liang 提交于
      fix #29130534
      
      commit 3442a9ecb8e72a33c28a2b969b766c659830e410 upstream.
      
      Backport summary: Backport to kernel 4.19.57 for ICX uncore support.
      
      The IMC uncore unit in Ice Lake server can only be accessed by MMIO,
      which is similar as Snow Ridge.
      Factor out __snr_uncore_mmio_init_box which can be shared with Ice Lake
      server in the following patch.
      
      No functional changes.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Link: https://lkml.kernel.org/r/1584470314-46657-2-git-send-email-kan.liang@linux.intel.comSigned-off-by: NYunying Sun <yunying.sun@intel.com>
      Signed-off-by: NPeng Wang <rocking@linux.alibaba.com>
      Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
      6b7f290f
    • K
      Intel: perf/x86/intel/uncore: Add IMC uncore support for Snow Ridge · 28661c6a
      Kan Liang 提交于
      fix #29130534
      
      commit ee49532b38dd084650bf715eabe7e3828fb8d275 upstream.
      
      Backport summary: Backport to kernel 4.19.57 for ICX uncore support.
      
      IMC uncore unit can only be accessed via MMIO on Snow Ridge.
      The MMIO space of IMC uncore is at the specified offsets from the
      MEM0_BAR. Add snr_uncore_get_mc_dev() to locate the PCI device with
      MMIO_BASE and MEM0_BAR register.
      
      Add new ops to access the IMC registers via MMIO.
      
      Add 3 new free running counters for clocks, read and write bandwidth.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@kernel.org
      Cc: eranian@google.com
      Link: https://lkml.kernel.org/r/1556672028-119221-7-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NYunying Sun <yunying.sun@intel.com>
      Signed-off-by: NPeng Wang <rocking@linux.alibaba.com>
      Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
      28661c6a
    • K
      Intel: perf/x86/intel/uncore: Clean up client IMC · 4f42d8f8
      Kan Liang 提交于
      fix #29130534
      
      commit 07ce734dd8adc0f170d43c15a9b91b707a21b9d7 upstream.
      
      Backport summary: Backport to kernel 4.19.57 for ICX uncore support.
      
      The client IMC block is accessed by MMIO. Current code uses an informal
      way to access the block, which is not recommended.
      
      Clean up the code by using __iomem annotation and the accessor
      functions (read[lq]()).
      
      Move exit_box() and read_counter() to generic code, which can be shared
      with the server code later.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@kernel.org
      Cc: eranian@google.com
      Link: https://lkml.kernel.org/r/1556672028-119221-6-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NYunying Sun <yunying.sun@intel.com>
      Signed-off-by: NPeng Wang <rocking@linux.alibaba.com>
      Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
      4f42d8f8
    • K
      Intel: perf/x86/intel/uncore: Support MMIO type uncore blocks · 4599feef
      Kan Liang 提交于
      fix #29130534
      
      commit 3da04b8a00dd6d39970b9e764b78c5dfb40ec013 upstream.
      
      Backport summary: Backport to kernel 4.19.57 for ICX uncore support.
      
      A new MMIO type uncore box is introduced on Snow Ridge server. The
      counters of MMIO type uncore box can only be accessed by MMIO.
      
      Add a new uncore type, uncore_mmio_uncores, for MMIO type uncore blocks.
      
      Support MMIO type uncore blocks in CPU hot plug. The MMIO space has to
      be map/unmap for the first/last CPU. The context also need to be
      migrated if the bind CPU changes.
      
      Add mmio_init() to init and register PMUs for MMIO type uncore blocks.
      
      Add a helper to calculate the box_ctl address.
      
      The helpers which calculate ctl/ctr can be shared with PCI type uncore
      blocks.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@kernel.org
      Cc: eranian@google.com
      Link: https://lkml.kernel.org/r/1556672028-119221-5-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NYunying Sun <yunying.sun@intel.com>
      Signed-off-by: NPeng Wang <rocking@linux.alibaba.com>
      Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
      4599feef
    • K
      Intel: perf/x86/intel/uncore: Factor out box ref/unref functions · 3df4e38a
      Kan Liang 提交于
      fix #29130534
      
      commit c8872d90e0a3651a096860d3241625ccfa1647e0 upstream.
      
      Backport summary: Backport to kernel 4.19.57 for ICX uncore support.
      
      For uncore box which can only be accessed by MSR, its reference
      box->refcnt is updated in CPU hot plug. The uncore boxes need to be
      initalized and exited accordingly for the first/last CPU of a socket.
      
      Starts from Snow Ridge server, a new type of uncore box is introduced,
      which can only be accessed by MMIO. The driver needs to map/unmap
      MMIO space for the first/last CPU of a socket.
      
      Extract the codes of box ref/unref and init/exit for reuse later.
      
      There is no functional change.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@kernel.org
      Cc: eranian@google.com
      Link: https://lkml.kernel.org/r/1556672028-119221-4-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NYunying Sun <yunying.sun@intel.com>
      Signed-off-by: NPeng Wang <rocking@linux.alibaba.com>
      Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
      3df4e38a
    • K
      Intel: perf/x86/intel/uncore: Add uncore support for Snow Ridge server · 06253540
      Kan Liang 提交于
      fix #29130534
      
      commit 210cc5f9db7a5c66b7ca6290b7d35cc7db7e9dbd upstream.
      
      Backport summary: Backport to kernel 4.19.57 for ICX uncore support.
      
      The uncore subsystem on Snow Ridge is similar as previous SKX server.
      The uncore units on Snow Ridge include Ubox, Chabox, IIO, IRP, M2PCIE,
      PCU, M2M, PCIE3 and IMC.
      
      - The config register encoding and pci device IDs are changed.
      - For CHA, the umask_ext and filter_tid fields are changed.
      - For IIO, the ch_mask and fc_mask fields are changed.
      - For M2M, the mask_ext field is changed.
      - Add new PCIe3 unit for PCIe3 root port which provides the interface
        between PCIe devices, plugged into the PCIe port, and the components
        (in M2IOSF).
      - IMC can only be accessed via MMIO on Snow Ridge now. Current common
        code doesn't support it yet. IMC will be supported in following
        patches.
      - There are 9 free running counters for IIO CLOCKS and bandwidth In.
      - Full uncore event list is not published yet. Event constrain is not
        included in this patch. It will be added later separately.
      Signed-off-by: NKan Liang <kan.liang@linux.intel.com>
      Signed-off-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: acme@kernel.org
      Cc: eranian@google.com
      Link: https://lkml.kernel.org/r/1556672028-119221-3-git-send-email-kan.liang@linux.intel.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NYunying Sun <yunying.sun@intel.com>
      Signed-off-by: NPeng Wang <rocking@linux.alibaba.com>
      Reviewed-by: NArtie Ding <artie.ding@linux.alibaba.com>
      06253540
    • X
      alinux: block-throttle: only do io statistics if needed · b8a94ed8
      Xiaoguang Wang 提交于
      task #29063222
      
      Current blk throttle codes always do io statistics even though users
      don't specify valid throttle rules, which will introduce significant
      overheads for applications that don't use blk throttle function and
      is wrose in arm, see below perf data captured in arm:
      
      sudo taskset -c 66 fio -ioengine=io_uring -sqthread_poll=1 -hipri=1
      -sqthread_poll_cpu=65 -registerfiles=1 -fixedbufs=1 -direct=1
      -filename=/dev/nvme0n1 -bs=4k -iodepth=8 -rw=randwrite  -time_based
      -ramp_time=30 -runtime=60  -name="test"
      
      Samples: 25K of event 'cycles', Event count (approx.): 16586974662
      Overhead  Command      Shared Object      Symbol
         3.54%  io_uring-sq  [kernel.kallsyms]  [k]
      throtl_stats_update_completion
         0.89%  io_uring-sq  [kernel.kallsyms]  [k] throtl_bio_end_io
         0.66%  io_uring-sq  [kernel.kallsyms]  [k] blk_throtl_bio
         0.05%  io_uring-sq  [kernel.kallsyms]  [k] blk_throtl_stat_add
         0.05%  io_uring-sq  [kernel.kallsyms]  [k] throtl_track_latency
         0.01%  io_uring-sq  [kernel.kallsyms]  [k] blk_throtl_bio_endio
      
      Samples: 25K of event 'cycles', Event count (approx.): 16586974662
      Overhead  Command      Shared Object      Symbol
         1.62%  io_uring-sq  [kernel.kallsyms]  [k] io_submit_sqes
         1.06%  io_uring-sq  [kernel.kallsyms]  [k] io_issue_sqe
         0.32%  io_uring-sq  [kernel.kallsyms]  [k] __io_queue_sqe
         0.06%  io_uring-sq  [kernel.kallsyms]  [k] io_queue_sqe
      
      Above test doesn't set valid blk throttle rules, but the overhead
      introduced by blk throttle is even bigger than many io_uring framework
      functions, which is not acceptable.
      
      To improve this issue, only do do io statistics if users specify valid
      blk throttle rules, and this will also improve performance.
      
      Before this patch:
      clat (usec): min=5, max=6871, avg=18.70, stdev=17.89
       lat (usec): min=9, max=6871, avg=18.84, stdev=17.89
      WRITE: bw=1618MiB/s (1697MB/s), 1618MiB/s-1618MiB/s (1697MB/s-1697MB/s),
      io=94.8GiB (102GB), run=60001-60001msec
      
      With this patch:
      clat (usec): min=5, max=7554, avg=17.49, stdev=18.24
      lat (usec): min=9, max=7554, avg=17.62, stdev=18.24
       WRITE: bw=1727MiB/s (1810MB/s), 1727MiB/s-1727MiB/s
      (1810MB/s-1810MB/s), io=101GiB (109GB), run=60001-60001msec
      
      About 6.6% bps improvement and 6.4% latency reduction.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      b8a94ed8
    • D
      configs: disable CONFIG_REFCOUNT_FULL for release kernel · 897fa8eb
      Dust Li 提交于
      fix #29180329
      
      CONFIG_REFCOUNT_FULL is used for debugging mainly,
      for release kernel, it's better to diable it.
      This patch disables both x86 and aarch64 for release
      kernel.
      
      This has a pretty large performance penalty for
      will-it-scale:signal1_process when the process
      number are large.
      Signed-off-by: NDust Li <dust.li@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      897fa8eb
    • S
      configs: update configs to adapt nvdimm series · efa76a28
      Shile Zhang 提交于
      to #27305291
      
      Enabled the following configs for NVDIMM support:
      - CONFIG_ACPI_NFIT=m
      - CONFIG_NVDIMM_KEYS=y
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      efa76a28
    • D
      libnvdimm/security: provide fix for secure-erase to use zero-key · 8828833e
      Dave Jiang 提交于
      to #27305291
      
      commit 037c8489ade669e0f09ad40d5b91e5e1159a14b1 upstream.
      
      Add a zero key in order to standardize hardware that want a key of 0's to
      be passed. Some platforms defaults to a zero-key with security enabled
      rather than allow the OS to enable the security. The zero key would allow
      us to manage those platform as well. This also adds a fix to secure erase
      so it can use the zero key to do crypto erase. Some other security commands
      already use zero keys. This introduces a standard zero-key to allow
      unification of semantics cross nvdimm security commands.
      Signed-off-by: NDave Jiang <dave.jiang@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      8828833e
    • D
      libnvdimm/security: Add documentation for nvdimm security support · b42cd6ea
      Dave Jiang 提交于
      to #27305291
      
      commit 1f4883f300da4f4d9d31eaa80f7debf6ce74843b upstream.
      
      Add theory of operation for the security support that's going into
      libnvdimm.
      Signed-off-by: NDave Jiang <dave.jiang@intel.com>
      Reviewed-by: NJing Lin <jing.lin@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      b42cd6ea
    • D
      tools/testing/nvdimm: add Intel DSM 1.8 support for nfit_test · 55039eda
      Dave Jiang 提交于
      to #27305291
      
      commit ecaa4a97b3908be0bf3ad12181ae8c44d1816d40 upstream.
      
      Adding test support for new Intel DSM from v1.8. The ability of simulating
      master passphrase update and master secure erase have been added to
      nfit_test.
      Signed-off-by: NDave Jiang <dave.jiang@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      55039eda
    • D
      tools/testing/nvdimm: Add overwrite support for nfit_test · 98c27ed4
      Dave Jiang 提交于
      to #27305291
      
      commit 926f74802cb1ce0ef0c3b9f806ea542beb57e50d upstream.
      
      With the implementation of Intel NVDIMM DSM overwrite, we are adding unit
      test to nfit_test for testing of overwrite operation.
      Signed-off-by: NDave Jiang <dave.jiang@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      98c27ed4
    • D
      tools/testing/nvdimm: Add test support for Intel nvdimm security DSMs · fdbc6db9
      Dave Jiang 提交于
      to #27305291
      
      commit 3c13e2ac747a37e683597d3d875f839f2bc150e1 upstream.
      
      Add nfit_test support for DSM functions "Get Security State",
      "Set Passphrase", "Disable Passphrase", "Unlock Unit", "Freeze Lock",
      and "Secure Erase" for the fake DIMMs.
      
      Also adding a sysfs knob in order to put the DIMMs in "locked" state. The
      order of testing DIMM unlocking would be.
      1a. Disable DIMM X.
      1b. Set Passphrase to DIMM X.
      2. Write to
      /sys/devices/platform/nfit_test.0/nfit_test_dimm/test_dimmX/lock_dimm
      3. Renable DIMM X
      4. Check DIMM X state via sysfs "security" attribute for nmemX.
      Signed-off-by: NDave Jiang <dave.jiang@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      fdbc6db9
    • D
      acpi/nfit, libnvdimm/security: add Intel DSM 1.8 master passphrase support · 8302c7d9
      Dave Jiang 提交于
      to #27305291
      
      commit 89fa9d8ea7bdfa841d19044485cec5f4171069e5 upstream.
      
      With Intel DSM 1.8 [1] two new security DSMs are introduced. Enable/update
      master passphrase and master secure erase. The master passphrase allows
      a secure erase to be performed without the user passphrase that is set on
      the NVDIMM. The commands of master_update and master_erase are added to
      the sysfs knob in order to initiate the DSMs. They are similar in opeartion
      mechanism compare to update and erase.
      
      [1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdfSigned-off-by: NDave Jiang <dave.jiang@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      8302c7d9
    • D
      acpi/nfit, libnvdimm/security: Add security DSM overwrite support · a3e32b16
      Dave Jiang 提交于
      to #27305291
      
      commit 7d988097c546187ada602cc9bccd0f03d473eb8f upstream.
      
      Add support for the NVDIMM_FAMILY_INTEL "ovewrite" capability as
      described by the Intel DSM spec v1.7. This will allow triggering of
      overwrite on Intel NVDIMMs. The overwrite operation can take tens of
      minutes. When the overwrite DSM is issued successfully, the NVDIMMs will
      be unaccessible. The kernel will do backoff polling to detect when the
      overwrite process is completed. According to the DSM spec v1.7, the 128G
      NVDIMMs can take up to 15mins to perform overwrite and larger DIMMs will
      take longer.
      
      Given that overwrite puts the DIMM in an indeterminate state until it
      completes introduce the NDD_SECURITY_OVERWRITE flag to prevent other
      operations from executing when overwrite is happening. The
      NDD_WORK_PENDING flag is added to denote that there is a device reference
      on the nvdimm device for an async workqueue thread context.
      Signed-off-by: NDave Jiang <dave.jiang@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      a3e32b16
    • D
      acpi/nfit, libnvdimm: Add support for issue secure erase DSM to Intel nvdimm · 7c7f13d6
      Dave Jiang 提交于
      to #27305291
      
      commit 64e77c8c047fb91ea8c7800c1238108a72f0bf9c upstream.
      
      Add support to issue a secure erase DSM to the Intel nvdimm. The
      required passphrase is acquired from an encrypted key in the kernel user
      keyring. To trigger the action, "erase <keyid>" is written to the
      "security" sysfs attribute.
      Signed-off-by: NDave Jiang <dave.jiang@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      7c7f13d6
    • D
      acpi/nfit, libnvdimm: Add enable/update passphrase support for Intel nvdimms · 5ab26ffc
      Dave Jiang 提交于
      to #27305291
      
      commit d2a4ac73f56a5d0709d28b41fec8d15e4500f8f1 upstream.
      
      Add support for enabling and updating passphrase on the Intel nvdimms.
      The passphrase is the an encrypted key in the kernel user keyring.
      We trigger the update via writing "update <old_keyid> <new_keyid>" to the
      sysfs attribute "security". If no <old_keyid> exists (for enabling
      security) then a 0 should be used.
      Signed-off-by: NDave Jiang <dave.jiang@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      5ab26ffc
    • D
      acpi/nfit, libnvdimm: Add disable passphrase support to Intel nvdimm. · 24684577
      Dave Jiang 提交于
      to #27305291
      
      commit 03b65b22ada8115a7a7bfdf0789f6a94adfd6070 upstream.
      
      Add support to disable passphrase (security) for the Intel nvdimm. The
      passphrase used for disabling is pulled from an encrypted-key in the kernel
      user keyring. The action is triggered by writing "disable <keyid>" to the
      sysfs attribute "security".
      Signed-off-by: NDave Jiang <dave.jiang@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      24684577
    • D
      acpi/nfit, libnvdimm: Add unlock of nvdimm support for Intel DIMMs · ba87cbfc
      Dave Jiang 提交于
      to #27305291
      
      commit 4c6926a23b76ea23403976290cd45a7a143f6500 upstream.
      
      Add support to unlock the dimm via the kernel key management APIs. The
      passphrase is expected to be pulled from userspace through keyutils.
      The key management and sysfs attributes are libnvdimm generic.
      
      Encrypted keys are used to protect the nvdimm passphrase at rest. The
      master key can be a trusted-key sealed in a TPM, preferred, or an
      encrypted-key, more flexible, but more exposure to a potential attacker.
      Signed-off-by: NDave Jiang <dave.jiang@intel.com>
      Co-developed-by: NDan Williams <dan.j.williams@intel.com>
      Reported-by: NRandy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      ba87cbfc
    • D
      acpi/nfit, libnvdimm: Add freeze security support to Intel nvdimm · 29fd2e68
      Dave Jiang 提交于
      to #27305291
      
      commit 37833fb7989a9d3c3e26354e6878e682c340d718 upstream.
      
      Add support for freeze security on Intel nvdimm. This locks out any
      changes to security for the DIMM until a hard reset of the DIMM is
      performed. This is triggered by writing "freeze" to the generic
      nvdimm/nmemX "security" sysfs attribute.
      Signed-off-by: NDave Jiang <dave.jiang@intel.com>
      Co-developed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      29fd2e68
    • D
      acpi/nfit, libnvdimm: Introduce nvdimm_security_ops · 3b8f481b
      Dave Jiang 提交于
      to #27305291
      
      commit f2989396553a0bd13f4b25f567a3dee3d722ce40 upstream.
      
      Some NVDIMMs, like the ones defined by the NVDIMM_FAMILY_INTEL command
      set, expose a security capability to lock the DIMMs at poweroff and
      require a passphrase to unlock them. The security model is derived from
      ATA security. In anticipation of other DIMMs implementing a similar
      scheme, and to abstract the core security implementation away from the
      device-specific details, introduce nvdimm_security_ops.
      
      Initially only a status retrieval operation, ->state(), is defined,
      along with the base infrastructure and definitions for future
      operations.
      Signed-off-by: NDave Jiang <dave.jiang@intel.com>
      Co-developed-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      3b8f481b
    • D
      keys-encrypted: add nvdimm key format type to encrypted keys · 100129a4
      Dave Jiang 提交于
      to #27305291
      
      commit 9db67581b91d9e9e05c35570ac3f93872e6c84ca upstream.
      
      Adding nvdimm key format type to encrypted keys in order to limit the size
      of the key to 32bytes.
      Signed-off-by: NDave Jiang <dave.jiang@intel.com>
      Acked-by: NMimi Zohar <zohar@linux.ibm.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      100129a4
    • D
      keys: Export lookup_user_key to external users · 12aad331
      Dave Jiang 提交于
      to #27305291
      
      commit 76ef5e17252789da79db78341851922af0c16181 upstream.
      
      Export lookup_user_key() symbol in order to allow nvdimm passphrase
      update to retrieve user injected keys.
      Signed-off-by: NDave Jiang <dave.jiang@intel.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      12aad331
    • D
      acpi/nfit, libnvdimm: Store dimm id as a member to struct nvdimm · 83d94276
      Dave Jiang 提交于
      to #27305291
      
      commit d6548ae4d16dc231dec22860c9c472bcb991fb15 upstream.
      
      The generated dimm id is needed for the sysfs attribute as well as being
      used as the identifier/description for the security key. Since it's
      constant and should never change, store it as a member of struct nvdimm.
      
      As nvdimm_create() continues to grow parameters relative to NFIT driver
      requirements, do not require other implementations to keep pace.
      Introduce __nvdimm_create() to carry the new parameters and keep
      nvdimm_create() with the long standing default api.
      Signed-off-by: NDave Jiang <dave.jiang@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      
      [ Shile: fixed conflict in drivers/acpi/nfit/nfit.h ]
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      83d94276
    • D
      acpi/nfit: Add support for Intel DSM 1.8 commands · d7258548
      Dave Jiang 提交于
      to #27305291
      
      commit b3ed2ce024c36054e51cca2eb31a1cdbe4a5f11e upstream.
      
      Add command definition for security commands defined in Intel DSM
      specification v1.8 [1]. This includes "get security state", "set
      passphrase", "unlock unit", "freeze lock", "secure erase", "overwrite",
      "overwrite query", "master passphrase enable/disable", and "master
      erase", . Since this adds several Intel definitions, move the relevant
      bits to their own header.
      
      These commands mutate physical data, but that manipulation is not cache
      coherent. The requirement to flush and invalidate caches makes these
      commands unsuitable to be called from userspace, so extra logic is added
      to detect and block these commands from being submitted via the ioctl
      command submission path.
      
      Lastly, the commands may contain sensitive key material that should not
      be dumped in a standard debug session. Update the nvdimm-command
      payload-dump facility to move security command payloads behind a
      default-off compile time switch.
      
      [1]: http://pmem.io/documents/NVDIMM_DSM_Interface-V1.8.pdfSigned-off-by: NDave Jiang <dave.jiang@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      
      [ Shile: fixed conflicts:
      This patch updated the file "drivers/acpi/nfit/intel.h". The header file is
      introduced by commit 0ead111 ("acpi, nfit: Collect shutdown status") in
      upstream, which also update the test files. So let's fetch this part to fix
      the conflict:
      - tools/testing/nvdimm/test/nfit.c
      - tools/testing/nvdimm/test/nfit_test.h ]
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      d7258548
    • P
      io_uring: fix current->mm NULL dereference on exit · 1f9ef808
      Pavel Begunkov 提交于
      to #29197839
      
      commit d60b5fbc1ce8210759b568da49d149b868e7c6d3 upstream.
      
      Don't reissue requests from io_iopoll_reap_events(), the task may not
      have mm, which ends up with NULL. It's better to kill everything off on
      exit anyway.
      
      [  677.734670] RIP: 0010:io_iopoll_complete+0x27e/0x630
      ...
      [  677.734679] Call Trace:
      [  677.734695]  ? __send_signal+0x1f2/0x420
      [  677.734698]  ? _raw_spin_unlock_irqrestore+0x24/0x40
      [  677.734699]  ? send_signal+0xf5/0x140
      [  677.734700]  io_iopoll_getevents+0x12f/0x1a0
      [  677.734702]  io_iopoll_reap_events.part.0+0x5e/0xa0
      [  677.734703]  io_ring_ctx_wait_and_kill+0x132/0x1c0
      [  677.734704]  io_uring_release+0x20/0x30
      [  677.734706]  __fput+0xcd/0x230
      [  677.734707]  ____fput+0xe/0x10
      [  677.734709]  task_work_run+0x67/0xa0
      [  677.734710]  do_exit+0x35d/0xb70
      [  677.734712]  do_group_exit+0x43/0xa0
      [  677.734713]  get_signal+0x140/0x900
      [  677.734715]  do_signal+0x37/0x780
      [  677.734717]  ? enqueue_hrtimer+0x41/0xb0
      [  677.734718]  ? recalibrate_cpu_khz+0x10/0x10
      [  677.734720]  ? ktime_get+0x3e/0xa0
      [  677.734721]  ? lapic_next_deadline+0x26/0x30
      [  677.734723]  ? tick_program_event+0x4d/0x90
      [  677.734724]  ? __hrtimer_get_next_event+0x4d/0x80
      [  677.734726]  __prepare_exit_to_usermode+0x126/0x1c0
      [  677.734741]  prepare_exit_to_usermode+0x9/0x40
      [  677.734742]  idtentry_exit_cond_rcu+0x4c/0x60
      [  677.734743]  sysvec_reschedule_ipi+0x92/0x160
      [  677.734744]  ? asm_sysvec_reschedule_ipi+0xa/0x20
      [  677.734745]  asm_sysvec_reschedule_ipi+0x12/0x20
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      1f9ef808
    • P
      io_uring: fix hanging iopoll in case of -EAGAIN · d52d1291
      Pavel Begunkov 提交于
      to #29197839
      
      commit cd664b0e35cb1202f40c259a1a5ea791d18c879d upstream.
      
      io_do_iopoll() won't do anything with a request unless
      req->iopoll_completed is set. So io_complete_rw_iopoll() has to set
      it, otherwise io_do_iopoll() will poll a file again and again even
      though the request of interest was completed long time ago.
      
      Also, remove -EAGAIN check from io_issue_sqe() as it races with
      the changed lines. The request will take the long way and be
      resubmitted from io_iopoll*().
      
      Fixes: bbde017a32b3 ("io_uring: add memory barrier to synchronize io_kiocb's result and iopoll_completed")
      Signed-off-by: NPavel Begunkov <asml.silence@gmail.com>
      Signed-off-by: NJens Axboe <axboe@kernel.dk>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NLiu Bo <bo.liu@linux.alibaba.com>
      d52d1291
    • S
      configs: arm64: keep the unified configs tuned for both arches · bc60596f
      Shile Zhang 提交于
      to #27182371
      
      Restored all the tuned configs for Cloud Kernel before, to keep
      the unified configs for both x86_64 and ARM64.
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      bc60596f
    • S
      configs: arm64: reconfig to sync with internal version · 7c5a791c
      Shile Zhang 提交于
      to #27182371
      
      Reconfig the ARM64 with Alibaba internal kernel help to keep
      the unified kernel configs.
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      7c5a791c
    • C
      mm, page_alloc: reset the zone->watermark_boost early · bd231c59
      Charan Teja Reddy 提交于
      to #28825456
      
      commit aa09259109583b98b9d9e7ed0d8eb1b880d1eb97 upstream.
      
      Updating the zone watermarks by any means, like min_free_kbytes,
      water_mark_scale_factor etc, when ->watermark_boost is set will result in
      higher low and high watermarks than the user asked.
      
      Below are the steps to reproduce the problem on system setup of Android
      kernel running on Snapdragon hardware.
      
      1) Default settings of the system are as below:
      
         #cat /proc/sys/vm/min_free_kbytes = 5162
         #cat /proc/zoneinfo | grep -e boost -e low -e "high " -e min -e Node
      	Node 0, zone   Normal
      		min      797
      		low      8340
      		high     8539
      
      2) Monitor the zone->watermark_boost(by adding a debug print in the
         kernel) and whenever it is greater than zero value, write the same
         value of min_free_kbytes obtained from step 1.
      
         #echo 5162 > /proc/sys/vm/min_free_kbytes
      
      3) Then read the zone watermarks in the system while the
         ->watermark_boost is zero.  This should show the same values of
         watermarks as step 1 but shown a higher values than asked.
      
         #cat /proc/zoneinfo | grep -e boost -e low -e "high " -e min -e Node
      	Node 0, zone   Normal
      		min      797
      		low      21148
      		high     21347
      
      These higher values are because of updating the zone watermarks using the
      macro min_wmark_pages(zone) which also adds the zone->watermark_boost.
      
      	#define min_wmark_pages(z) (z->_watermark[WMARK_MIN] +
      					z->watermark_boost)
      
      So the steps that lead to the issue are:
      
      1) On the extfrag event, watermarks are boosted by storing the required
         value in ->watermark_boost.
      
      2) User tries to update the zone watermarks level in the system through
         min_free_kbytes or watermark_scale_factor.
      
      3) Later, when kswapd woke up, it resets the zone->watermark_boost to
         zero.
      
      In step 2), we use the min_wmark_pages() macro to store the watermarks
      in the zone structure thus the values are always offsetted by
      ->watermark_boost value. This can be avoided by resetting the
      ->watermark_boost to zero before it is used.
      Signed-off-by: NCharan Teja Reddy <charante@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NBaoquan He <bhe@redhat.com>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Link: http://lkml.kernel.org/r/1589457511-4255-1-git-send-email-charante@codeaurora.orgSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      bd231c59
    • H
      mm: limit boost_watermark on small zones · ab70cdb0
      Henry Willard 提交于
      to #28825456
      
      commit 14f69140ff9c92a0928547ceefb153a842e8492c upstream.
      
      Commit 1c30844d2dfe ("mm: reclaim small amounts of memory when an
      external fragmentation event occurs") adds a boost_watermark() function
      which increases the min watermark in a zone by at least
      pageblock_nr_pages or the number of pages in a page block.
      
      On Arm64, with 64K pages and 512M huge pages, this is 8192 pages or
      512M.  It does this regardless of the number of managed pages managed in
      the zone or the likelihood of success.
      
      This can put the zone immediately under water in terms of allocating
      pages from the zone, and can cause a small machine to fail immediately
      due to OoM.  Unlike set_recommended_min_free_kbytes(), which
      substantially increases min_free_kbytes and is tied to THP,
      boost_watermark() can be called even if THP is not active.
      
      The problem is most likely to appear on architectures such as Arm64
      where pageblock_nr_pages is very large.
      
      It is desirable to run the kdump capture kernel in as small a space as
      possible to avoid wasting memory.  In some architectures, such as Arm64,
      there are restrictions on where the capture kernel can run, and
      therefore, the space available.  A capture kernel running in 768M can
      fail due to OoM immediately after boost_watermark() sets the min in zone
      DMA32, where most of the memory is, to 512M.  It fails even though there
      is over 500M of free memory.  With boost_watermark() suppressed, the
      capture kernel can run successfully in 448M.
      
      This patch limits boost_watermark() to boosting a zone's min watermark
      only when there are enough pages that the boost will produce positive
      results.  In this case that is estimated to be four times as many pages
      as pageblock_nr_pages.
      
      Mel said:
      
      : There is no harm in marking it stable.  Clearly it does not happen very
      : often but it's not impossible.  32-bit x86 is a lot less common now
      : which would previously have been vulnerable to triggering this easily.
      : ppc64 has a larger base page size but typically only has one zone.
      : arm64 is likely the most vulnerable, particularly when CMA is
      : configured with a small movable zone.
      
      Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
      Signed-off-by: NHenry Willard <henry.willard@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Link: http://lkml.kernel.org/r/1588294148-6586-1-git-send-email-henry.willard@oracle.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      [xuyu: expand zone_managed_pages function]
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      ab70cdb0
    • M
      mm, vmscan: do not special-case slab reclaim when watermarks are boosted · fb4da0ed
      Mel Gorman 提交于
      to #28825456
      
      commit 28360f398778d7623a5ff8a8e90958c0d925e120 upstream.
      
      Dave Chinner reported a problem pointing a finger at commit 1c30844d2dfe
      ("mm: reclaim small amounts of memory when an external fragmentation
      event occurs").
      
      The report is extensive:
      
        https://lore.kernel.org/linux-mm/20190807091858.2857-1-david@fromorbit.com/
      
      and it's worth recording the most relevant parts (colorful language and
      typos included).
      
      	When running a simple, steady state 4kB file creation test to
      	simulate extracting tarballs larger than memory full of small
      	files into the filesystem, I noticed that once memory fills up
      	the cache balance goes to hell.
      
      	The workload is creating one dirty cached inode for every dirty
      	page, both of which should require a single IO each to clean and
      	reclaim, and creation of inodes is throttled by the rate at which
      	dirty writeback runs at (via balance dirty pages). Hence the ingest
      	rate of new cached inodes and page cache pages is identical and
      	steady. As a result, memory reclaim should quickly find a steady
      	balance between page cache and inode caches.
      
      	The moment memory fills, the page cache is reclaimed at a much
      	faster rate than the inode cache, and evidence suggests that
      	the inode cache shrinker is not being called when large batches
      	of pages are being reclaimed. In roughly the same time period
      	that it takes to fill memory with 50% pages and 50% slab caches,
      	memory reclaim reduces the page cache down to just dirty pages
      	and slab caches fill the entirety of memory.
      
      	The LRU is largely full of dirty pages, and we're getting spikes
      	of random writeback from memory reclaim so it's all going to shit.
      	Behaviour never recovers, the page cache remains pinned at just
      	dirty pages, and nothing I could tune would make any difference.
      	vfs_cache_pressure makes no difference - I would set it so high
      	it should trim the entire inode caches in a single pass, yet it
      	didn't do anything. It was clear from tracing and live telemetry
      	that the shrinkers were pretty much not running except when
      	there was absolutely no memory free at all, and then they did
      	the minimum necessary to free memory to make progress.
      
      	So I went looking at the code, trying to find places where pages
      	got reclaimed and the shrinkers weren't called. There's only one
      	- kswapd doing boosted reclaim as per commit 1c30844d2dfe ("mm:
      	reclaim small amounts of memory when an external fragmentation
      	event occurs").
      
      The watermark boosting introduced by the commit is triggered in response
      to an allocation "fragmentation event".  The boosting was not intended
      to target THP specifically and triggers even if THP is disabled.
      However, with Dave's perfectly reasonable workload, fragmentation events
      can be very common given the ratio of slab to page cache allocations so
      boosting remains active for long periods of time.
      
      As high-order allocations might use compaction and compaction cannot
      move slab pages the decision was made in the commit to special-case
      kswapd when watermarks are boosted -- kswapd avoids reclaiming slab as
      reclaiming slab does not directly help compaction.
      
      As Dave notes, this decision means that slab can be artificially
      protected for long periods of time and messes up the balance with slab
      and page caches.
      
      Removing the special casing can still indirectly help avoid
      fragmentation by avoiding fragmentation-causing events due to slab
      allocation as pages from a slab pageblock will have some slab objects
      freed.  Furthermore, with the special casing, reclaim behaviour is
      unpredictable as kswapd sometimes examines slab and sometimes does not
      in a manner that is tricky to tune or analyse.
      
      This patch removes the special casing.  The downside is that this is not
      a universal performance win.  Some benchmarks that depend on the
      residency of data when rereading metadata may see a regression when slab
      reclaim is restored to its original behaviour.  Similarly, some
      benchmarks that only read-once or write-once may perform better when
      page reclaim is too aggressive.  The primary upside is that slab
      shrinker is less surprising (arguably more sane but that's a matter of
      opinion), behaves consistently regardless of the fragmentation state of
      the system and properly obeys VM sysctls.
      
      A fsmark benchmark configuration was constructed similar to what Dave
      reported and is codified by the mmtest configuration
      config-io-fsmark-small-file-stream.  It was evaluated on a 1-socket
      machine to avoid dealing with NUMA-related issues and the timing of
      reclaim.  The storage was an SSD Samsung Evo and a fresh trimmed XFS
      filesystem was used for the test data.
      
      This is not an exact replication of Dave's setup.  The configuration
      scales its parameters depending on the memory size of the SUT to behave
      similarly across machines.  The parameters mean the first sample
      reported by fs_mark is using 50% of RAM which will barely be throttled
      and look like a big outlier.  Dave used fake NUMA to have multiple
      kswapd instances which I didn't replicate.  Finally, the number of
      iterations differ from Dave's test as the target disk was not large
      enough.  While not identical, it should be representative.
      
        fsmark
                                           5.3.0-rc3              5.3.0-rc3
                                             vanilla          shrinker-v1r1
        Min       1-files/sec     4444.80 (   0.00%)     4765.60 (   7.22%)
        1st-qrtle 1-files/sec     5005.10 (   0.00%)     5091.70 (   1.73%)
        2nd-qrtle 1-files/sec     4917.80 (   0.00%)     4855.60 (  -1.26%)
        3rd-qrtle 1-files/sec     4667.40 (   0.00%)     4831.20 (   3.51%)
        Max-1     1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Max-5     1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Max-10    1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Max-90    1-files/sec     4649.60 (   0.00%)     4780.70 (   2.82%)
        Max-95    1-files/sec     4491.00 (   0.00%)     4768.20 (   6.17%)
        Max-99    1-files/sec     4491.00 (   0.00%)     4768.20 (   6.17%)
        Max       1-files/sec    11421.50 (   0.00%)     9999.30 ( -12.45%)
        Hmean     1-files/sec     5004.75 (   0.00%)     5075.96 (   1.42%)
        Stddev    1-files/sec     1778.70 (   0.00%)     1369.66 (  23.00%)
        CoeffVar  1-files/sec       33.70 (   0.00%)       26.05 (  22.71%)
        BHmean-99 1-files/sec     5053.72 (   0.00%)     5101.52 (   0.95%)
        BHmean-95 1-files/sec     5053.72 (   0.00%)     5101.52 (   0.95%)
        BHmean-90 1-files/sec     5107.05 (   0.00%)     5131.41 (   0.48%)
        BHmean-75 1-files/sec     5208.45 (   0.00%)     5206.68 (  -0.03%)
        BHmean-50 1-files/sec     5405.53 (   0.00%)     5381.62 (  -0.44%)
        BHmean-25 1-files/sec     6179.75 (   0.00%)     6095.14 (  -1.37%)
      
                           5.3.0-rc3   5.3.0-rc3
                             vanillashrinker-v1r1
        Duration User         501.82      497.29
        Duration System      4401.44     4424.08
        Duration Elapsed     8124.76     8358.05
      
      This is showing a slight skew for the max result representing a large
      outlier for the 1st, 2nd and 3rd quartile are similar indicating that
      the bulk of the results show little difference.  Note that an earlier
      version of the fsmark configuration showed a regression but that
      included more samples taken while memory was still filling.
      
      Note that the elapsed time is higher.  Part of this is that the
      configuration included time to delete all the test files when the test
      completes -- the test automation handles the possibility of testing
      fsmark with multiple thread counts.  Without the patch, many of these
      objects would be memory resident which is part of what the patch is
      addressing.
      
      There are other important observations that justify the patch.
      
      1. With the vanilla kernel, the number of dirty pages in the system is
         very low for much of the test. With this patch, dirty pages is
         generally kept at 10% which matches vm.dirty_background_ratio which
         is normal expected historical behaviour.
      
      2. With the vanilla kernel, the ratio of Slab/Pagecache is close to
         0.95 for much of the test i.e. Slab is being left alone and
         dominating memory consumption. With the patch applied, the ratio
         varies between 0.35 and 0.45 with the bulk of the measured ratios
         roughly half way between those values. This is a different balance to
         what Dave reported but it was at least consistent.
      
      3. Slabs are scanned throughout the entire test with the patch applied.
         The vanille kernel has periods with no scan activity and then
         relatively massive spikes.
      
      4. Without the patch, kswapd scan rates are very variable. With the
         patch, the scan rates remain quite steady.
      
      4. Overall vmstats are closer to normal expectations
      
      	                                5.3.0-rc3      5.3.0-rc3
      	                                  vanilla  shrinker-v1r1
          Ops Direct pages scanned             99388.00      328410.00
          Ops Kswapd pages scanned          45382917.00    33451026.00
          Ops Kswapd pages reclaimed        30869570.00    25239655.00
          Ops Direct pages reclaimed           74131.00        5830.00
          Ops Kswapd efficiency %                 68.02          75.45
          Ops Kswapd velocity                   5585.75        4002.25
          Ops Page reclaim immediate         1179721.00      430927.00
          Ops Slabs scanned                 62367361.00    73581394.00
          Ops Direct inode steals               2103.00        1002.00
          Ops Kswapd inode steals             570180.00     5183206.00
      
      	o Vanilla kernel is hitting direct reclaim more frequently,
      	  not very much in absolute terms but the fact the patch
      	  reduces it is interesting
      	o "Page reclaim immediate" in the vanilla kernel indicates
      	  dirty pages are being encountered at the tail of the LRU.
      	  This is generally bad and means in this case that the LRU
      	  is not long enough for dirty pages to be cleaned by the
      	  background flush in time. This is much reduced by the
      	  patch.
      	o With the patch, kswapd is reclaiming 10 times more slab
      	  pages than with the vanilla kernel. This is indicative
      	  of the watermark boosting over-protecting slab
      
      A more complete set of tests were run that were part of the basis for
      introducing boosting and while there are some differences, they are well
      within tolerances.
      
      Bottom line, the special casing kswapd to avoid slab behaviour is
      unpredictable and can lead to abnormal results for normal workloads.
      
      This patch restores the expected behaviour that slab and page cache is
      balanced consistently for a workload with a steady allocation ratio of
      slab/pagecache pages.  It also means that if there are workloads that
      favour the preservation of slab over pagecache that it can be tuned via
      vm.vfs_cache_pressure where as the vanilla kernel effectively ignores
      the parameter when boosting is active.
      
      Link: http://lkml.kernel.org/r/20190808182946.GM2739@techsingularity.net
      Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reviewed-by: NDave Chinner <dchinner@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[5.0+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      fb4da0ed
    • A
      mm/page_alloc.c: fix never set ALLOC_NOFRAGMENT flag · a274182f
      Andrey Ryabinin 提交于
      to #28825456
      
      commit 8118b82eb756e271929697e8ada5f637dc443af1 upstream.
      
      Commit 0a79cdad5eb2 ("mm: use alloc_flags to record if kswapd can wake")
      removed setting of the ALLOC_NOFRAGMENT flag.  Bring it back.
      
      The runtime effect is that ALLOC_NOFRAGMENT behaviour is restored so
      that allocations are spread across local zones to avoid fragmentation
      due to mixing pageblocks as long as possible.
      
      Link: http://lkml.kernel.org/r/20190423120806.3503-2-aryabinin@virtuozzo.com
      Fixes: 0a79cdad5eb2 ("mm: use alloc_flags to record if kswapd can wake")
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      a274182f
    • A
      mm/page_alloc.c: avoid potential NULL pointer dereference · 60eb3dfc
      Andrey Ryabinin 提交于
      to #28825456
      
      commit 8139ad043d632c0e9e12d760068a7a8e91659aa1 upstream.
      
      ac.preferred_zoneref->zone passed to alloc_flags_nofragment() can be NULL.
      'zone' pointer unconditionally derefernced in alloc_flags_nofragment().
      Bail out on NULL zone to avoid potential crash.  Currently we don't see
      any crashes only because alloc_flags_nofragment() has another bug which
      allows compiler to optimize away all accesses to 'zone'.
      
      Link: http://lkml.kernel.org/r/20190423120806.3503-1-aryabinin@virtuozzo.com
      Fixes: 6bb154504f8b ("mm, page_alloc: spread allocations across zones before introducing fragmentation")
      Signed-off-by: NAndrey Ryabinin <aryabinin@virtuozzo.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      60eb3dfc
    • M
      mm: do not boost watermarks to avoid fragmentation for the DISCONTIG memory model · d5971569
      Mel Gorman 提交于
      to #28825456
      
      commit 24512228b7a3f412b5a51f189df302616b021c33 upstream.
      
      Mikulas Patocka reported that commit 1c30844d2dfe ("mm: reclaim small
      amounts of memory when an external fragmentation event occurs") "broke"
      memory management on parisc.
      
      The machine is not NUMA but the DISCONTIG model creates three pgdats
      even though it's a UMA machine for the following ranges
      
              0) Start 0x0000000000000000 End 0x000000003fffffff Size   1024 MB
              1) Start 0x0000000100000000 End 0x00000001bfdfffff Size   3070 MB
              2) Start 0x0000004040000000 End 0x00000040ffffffff Size   3072 MB
      
      Mikulas reported:
      
      	With the patch 1c30844d2, the kernel will incorrectly reclaim the
      	first zone when it fills up, ignoring the fact that there are two
      	completely free zones. Basiscally, it limits cache size to 1GiB.
      
      	For example, if I run:
      	# dd if=/dev/sda of=/dev/null bs=1M count=2048
      
      	- with the proper kernel, there should be "Buffers - 2GiB"
      	when this command finishes. With the patch 1c30844d2, buffers
      	will consume just 1GiB or slightly more, because the kernel was
      	incorrectly reclaiming them.
      
      The page allocator and reclaim makes assumptions that pgdats really
      represent NUMA nodes and zones represent ranges and makes decisions on
      that basis.  Watermark boosting for small pgdats leads to unexpected
      results even though this would have behaved reasonably on SPARSEMEM.
      
      DISCONTIG is essentially deprecated and even parisc plans to move to
      SPARSEMEM so there is no need to be fancy, this patch simply disables
      watermark boosting by default on DISCONTIGMEM.
      
      Link: http://lkml.kernel.org/r/20190419094335.GJ18914@techsingularity.net
      Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reported-by: NMikulas Patocka <mpatocka@redhat.com>
      Tested-by: NMikulas Patocka <mpatocka@redhat.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: James Bottomley <James.Bottomley@hansenpartnership.com>
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      d5971569
    • M
      mm, page_alloc: fix a division by zero error when boosting watermarks v2 · 92bebf3c
      Mel Gorman 提交于
      to #28825456
      
      commit 94b3334cbebea34d56a7e6321c6fe9d89b309a49 upstream.
      
      Yury Norov reported that an arm64 KVM instance could not boot since
      after v5.0-rc1 and could addressed by reverting the patches
      
        1c30844d2dfe272d58c ("mm: reclaim small amounts of memory when an external
        73444bc4d8f92e46a20 ("mm, page_alloc: do not wake kswapd with zone lock held")
      
      The problem is that a division by zero error is possible if boosting
      occurs very early in boot if the system has very little memory.  This
      patch avoids the division by zero error.
      
      Link: http://lkml.kernel.org/r/20190213143012.GT9565@techsingularity.net
      Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Reported-by: NYury Norov <yury.norov@gmail.com>
      Tested-by: NYury Norov <yury.norov@gmail.com>
      Tested-by: NWill Deacon <will.deacon@arm.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      92bebf3c
    • M
      mm, page_alloc: do not wake kswapd with zone lock held · ba16c9c8
      Mel Gorman 提交于
      to #28825456
      
      commit 73444bc4d8f92e46a20cb6bd3342fc2ea75c6787 upstream.
      
      syzbot reported the following regression in the latest merge window and
      it was confirmed by Qian Cai that a similar bug was visible from a
      different context.
      
        ======================================================
        WARNING: possible circular locking dependency detected
        4.20.0+ #297 Not tainted
        ------------------------------------------------------
        syz-executor0/8529 is trying to acquire lock:
        000000005e7fb829 (&pgdat->kswapd_wait){....}, at:
        __wake_up_common_lock+0x19e/0x330 kernel/sched/wait.c:120
      
        but task is already holding lock:
        000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: spin_lock
        include/linux/spinlock.h:329 [inline]
        000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_bulk
        mm/page_alloc.c:2548 [inline]
        000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: __rmqueue_pcplist
        mm/page_alloc.c:3021 [inline]
        000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_pcplist
        mm/page_alloc.c:3050 [inline]
        000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue
        mm/page_alloc.c:3072 [inline]
        000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at:
        get_page_from_freelist+0x1bae/0x52a0 mm/page_alloc.c:3491
      
      It appears to be a false positive in that the only way the lock ordering
      should be inverted is if kswapd is waking itself and the wakeup
      allocates debugging objects which should already be allocated if it's
      kswapd doing the waking.  Nevertheless, the possibility exists and so
      it's best to avoid the problem.
      
      This patch flags a zone as needing a kswapd using the, surprisingly,
      unused zone flag field.  The flag is read without the lock held to do
      the wakeup.  It's possible that the flag setting context is not the same
      as the flag clearing context or for small races to occur.  However, each
      race possibility is harmless and there is no visible degredation in
      fragmentation treatment.
      
      While zone->flag could have continued to be unused, there is potential
      for moving some existing fields into the flags field instead.
      Particularly read-mostly ones like zone->initialized and
      zone->contiguous.
      
      Link: http://lkml.kernel.org/r/20190103225712.GJ31517@techsingularity.net
      Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
      Reported-by: syzbot+93d94a001cfbce9e60e1@syzkaller.appspotmail.com
      Signed-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Tested-by: NQian Cai <cai@lca.pw>
      Cc: Dmitry Vyukov <dvyukov@google.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Michal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      
      Conflicts:
      	include/linux/mmzone.h
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      ba16c9c8
    • M
      mm: reclaim small amounts of memory when an external fragmentation event occurs · 9bcadc70
      Mel Gorman 提交于
      to #28825456
      
      commit 1c30844d2dfe272d58c8fc000960b835d13aa2ac upstream.
      
      An external fragmentation event was previously described as
      
          When the page allocator fragments memory, it records the event using
          the mm_page_alloc_extfrag event. If the fallback_order is smaller
          than a pageblock order (order-9 on 64-bit x86) then it's considered
          an event that will cause external fragmentation issues in the future.
      
      The kernel reduces the probability of such events by increasing the
      watermark sizes by calling set_recommended_min_free_kbytes early in the
      lifetime of the system.  This works reasonably well in general but if
      there are enough sparsely populated pageblocks then the problem can still
      occur as enough memory is free overall and kswapd stays asleep.
      
      This patch introduces a watermark_boost_factor sysctl that allows a zone
      watermark to be temporarily boosted when an external fragmentation causing
      events occurs.  The boosting will stall allocations that would decrease
      free memory below the boosted low watermark and kswapd is woken if the
      calling context allows to reclaim an amount of memory relative to the size
      of the high watermark and the watermark_boost_factor until the boost is
      cleared.  When kswapd finishes, it wakes kcompactd at the pageblock order
      to clean some of the pageblocks that may have been affected by the
      fragmentation event.  kswapd avoids any writeback, slab shrinkage and swap
      from reclaim context during this operation to avoid excessive system
      disruption in the name of fragmentation avoidance.  Care is taken so that
      kswapd will do normal reclaim work if the system is really low on memory.
      
      This was evaluated using the same workloads as "mm, page_alloc: Spread
      allocations across zones before introducing fragmentation".
      
      1-socket Skylake machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 1 THP allocating thread
      --------------------------------------
      
      4.20-rc3 extfrag events < order 9:   804694
      4.20-rc3+patch:                      408912 (49% reduction)
      4.20-rc3+patch1-4:                    18421 (98% reduction)
      
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-1      653.58 (   0.00%)      652.71 (   0.13%)
      Amean     fault-huge-1        0.00 (   0.00%)      178.93 * -99.00%*
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-1        0.00 (   0.00%)        5.12 ( 100.00%)
      
      Note that external fragmentation causing events are massively reduced by
      this path whether in comparison to the previous kernel or the vanilla
      kernel.  The fault latency for huge pages appears to be increased but that
      is only because THP allocations were successful with the patch applied.
      
      1-socket Skylake machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  291392
      4.20-rc3+patch:                     191187 (34% reduction)
      4.20-rc3+patch1-4:                   13464 (95% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Min       fault-base-1      912.00 (   0.00%)      905.00 (   0.77%)
      Min       fault-huge-1      127.00 (   0.00%)      135.00 (  -6.30%)
      Amean     fault-base-1     1467.55 (   0.00%)     1481.67 (  -0.96%)
      Amean     fault-huge-1     1127.11 (   0.00%)     1063.88 *   5.61%*
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-1       77.64 (   0.00%)       83.46 (   7.49%)
      
      As before, massive reduction in external fragmentation events, some jitter
      on latencies and an increase in THP allocation success rates.
      
      2-socket Haswell machine
      config-global-dhp__workload_thpfioscale XFS (no special madvise)
      4 fio threads, 5 THP allocating threads
      ----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9:  215698
      4.20-rc3+patch:                     200210 (7% reduction)
      4.20-rc3+patch1-4:                   14263 (93% reduction)
      
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-5     1346.45 (   0.00%)     1306.87 (   2.94%)
      Amean     fault-huge-5     3418.60 (   0.00%)     1348.94 (  60.54%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-5        0.78 (   0.00%)        7.91 ( 910.64%)
      
      There is a 93% reduction in fragmentation causing events, there is a big
      reduction in the huge page fault latency and allocation success rate is
      higher.
      
      2-socket Haswell machine
      global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
      -----------------------------------------------------------------
      
      4.20-rc3 extfrag events < order 9: 166352
      4.20-rc3+patch:                    147463 (11% reduction)
      4.20-rc3+patch1-4:                  11095 (93% reduction)
      
      thpfioscale Fault Latencies
                                         4.20.0-rc3             4.20.0-rc3
                                       lowzone-v5r8             boost-v5r8
      Amean     fault-base-5     6217.43 (   0.00%)     7419.67 * -19.34%*
      Amean     fault-huge-5     3163.33 (   0.00%)     3263.80 (  -3.18%)
      
                                    4.20.0-rc3             4.20.0-rc3
                                  lowzone-v5r8             boost-v5r8
      Percentage huge-5       95.14 (   0.00%)       87.98 (  -7.53%)
      
      There is a large reduction in fragmentation events with some jitter around
      the latencies and success rates.  As before, the high THP allocation
      success rate does mean the system is under a lot of pressure.  However, as
      the fragmentation events are reduced, it would be expected that the
      long-term allocation success rate would be higher.
      
      Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.netSigned-off-by: NMel Gorman <mgorman@techsingularity.net>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Zi Yan <zi.yan@cs.rutgers.edu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      9bcadc70