1. 15 1月, 2020 7 次提交
    • Y
      alinux: mm: memcontrol: support background async page reclaim · 6967792f
      Yang Shi 提交于
      Currently when memory usage exceeds memory cgroup limit, memory cgroup
      just can do sync direct reclaim.  This may incur unexpected stall on
      some applications which are sensitive to latency.  Introduce background
      async page reclaim mechanism, like what kswapd does.
      
      Define memcg memory usage water mark by introducing wmark_ratio interface,
      which is from 0 to 100 and represents percentage of max limit.  The
      wmark_high is calculated by (max * wmark_ratio / 100), the wmark_low is
      (wmark_high - wmark_high >> 8), which is an empirical value.  If wmark_ratio
      is 0, it means water mark is disabled, both wmark_low and wmark_high is max,
      which is the default value.
      
      If wmark_ratio is setup, when charging page, if usage is greater than
      wmark_high, which means the available memory of memcg is low, a work
      would be scheduled to do background page reclaim until memory usage is
      reduced to wmark_low if possible.
      
      Define a dedicated unbound workqueue for scheduling water mark reclaim
      works.
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      6967792f
    • Y
      alinux: mm: vmscan: make it sane reclaim if cgwb_v1 is enabled · 49a3b465
      Yang Shi 提交于
      AliOS Cloud Kernel has cgroup writeback support for v1, so the reclaim could be
      treated as sane reclaim if cgwb_v1 is enabled.
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      49a3b465
    • X
      alinux: mm, memcg: fix possible soft lockup in try_charge · 2ced6afd
      Xu Yu 提交于
      When events such as direct reclaim and oom occur intensively, soft
      lockup is very likely to happen in the instances with 1 vcpu and with
      kernel preempt disabled.
      
      The example soft lockup is as follows.
      
      [  160.555984] watchdog: BUG: soft lockup - CPU#0 stuck for 112s! [malloc:2188]
      [  160.557975] Modules linked in: button
      [  160.559495] CPU: 0 PID: 2188 Comm: malloc Not tainted 4.19.57-15.457.al7.x86_64 #1
      [  160.561546] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
      [  160.563707] RIP: 0010:shrink_node+0x1ae/0x450
      [  160.565391] Code: 00 00 00 49 8b 4f 20 ba 01 00 00 00 4c 8b 74 24 10 4d 8b 47 28 49 8b 77 10 48 2b 4c 24 08 41 8b 7f 1c 4d8
      [  160.570747] RSP: 0000:ffff9d0ec07a3b58 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
      [  160.572889] RAX: ffff982ab6014330 RBX: ffff982ab6014000 RCX: 0000000000000000
      [  160.574992] RDX: 0000000000000001 RSI: ffff982ab6014000 RDI: ffff982ab6014000
      [  160.577106] RBP: ffff982afffb6000 R08: 0000000000000000 R09: ffff982ab6014000
      [  160.579219] R10: 0000000000000004 R11: 0000000000aaaaaa R12: 0000000000000000
      [  160.581326] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9d0ec07a3c50
      [  160.583450] FS:  00007f8b414f7740(0000) GS:ffff982afda00000(0000) knlGS:0000000000000000
      [  160.585704] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  160.587662] CR2: 00007f8adb800010 CR3: 000000007ac9e001 CR4: 00000000003606b0
      [  160.589835] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  160.591971] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  160.594133] Call Trace:
      [  160.595602]  do_try_to_free_pages+0xcc/0x390
      [  160.597356]  try_to_free_mem_cgroup_pages+0xf9/0x1d0
      [  160.599198]  ? out_of_memory+0xb5/0x4a0
      [  160.600882]  try_charge+0x244/0x750
      [  160.602522]  ? __pagevec_lru_add_fn+0x1d0/0x330
      [  160.604310]  mem_cgroup_try_charge+0xb4/0x1d0
      [  160.606085]  mem_cgroup_try_charge_delay+0x1c/0x40
      [  160.607892]  do_anonymous_page+0xf7/0x540
      [  160.609574]  __handle_mm_fault+0x665/0xa00
      [  160.611233]  ? __switch_to_asm+0x35/0x70
      [  160.612838]  handle_mm_fault+0x122/0x1e0
      [  160.614407]  __do_page_fault+0x1b7/0x470
      [  160.615962]  do_page_fault+0x32/0x140
      [  160.617474]  ? async_page_fault+0x8/0x30
      [  160.619012]  async_page_fault+0x1e/0x30
      [  160.620526] RIP: 0033:0x40068e
      
      Fix it by adding cond_resched() in try_charge(), just before goto retry
      after OOM_SUCCESS, in order to let OOM free some memory first.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      2ced6afd
    • X
      alinux: memcg: Point wb to root memcg/blkcg when offlining to avoid zombie · cc00f21e
      Xunlei Pang 提交于
      After turning off the memcg kmem charging, we still suffer
      from various zombie memcg problems on production environment
      because of its non-zero reference count from both page caches
      and per-memcg writeback related structure(bdi_writeback takes
      a reference).
      
      After we reclaimed all the page caches of the zombie memcg,
      it still can't be dropped due to its bdi_writeback.
      
      bdi_writeback is further referenced by the inodes of files,
      so the memcg can't be truely released until the inodes are
      destroyed afterwards which is quite unlikely in short term.
      
      When memcg is offlining, change it's bdi_writeback to root,
      and call css_put to formally release it. We've tested on
      product environment, it yields pretty good effect.
      
      Ditto for wb_blkcg_offline().
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      cc00f21e
    • D
      mm/memory-hotplug: Allow memory resources to be children · 567bed57
      Dave Hansen 提交于
      commit 2794129e902d8eb69413d884dc6404b8716ed9ed upstream
      
      The mm/resource.c code is used to manage the physical address
      space.  The current resource configuration can be viewed in
      /proc/iomem.  An example of this is at the bottom of this
      description.
      
      The nvdimm subsystem "owns" the physical address resources which
      map to persistent memory and has resources inserted for them as
      "Persistent Memory".  The best way to repurpose this for volatile
      use is to leave the existing resource in place, but add a "System
      RAM" resource underneath it. This clearly communicates the
      ownership relationship of this memory.
      
      The request_resource_conflict() API only deals with the
      top-level resources.  Replace it with __request_region() which
      will search for !IORESOURCE_BUSY areas lower in the resource
      tree than the top level.
      
      We *could* also simply truncate the existing top-level
      "Persistent Memory" resource and take over the released address
      space.  But, this means that if we ever decide to hot-unplug the
      "RAM" and give it back, we need to recreate the original setup,
      which may mean going back to the BIOS tables.
      
      This should have no real effect on the existing collision
      detection because the areas that truly conflict should be marked
      IORESOURCE_BUSY.
      
      00000000-00000fff : Reserved
      00001000-0009fbff : System RAM
      0009fc00-0009ffff : Reserved
      000a0000-000bffff : PCI Bus 0000:00
      000c0000-000c97ff : Video ROM
      000c9800-000ca5ff : Adapter ROM
      000f0000-000fffff : Reserved
        000f0000-000fffff : System ROM
      00100000-9fffffff : System RAM
        01000000-01e071d0 : Kernel code
        01e071d1-027dfdff : Kernel data
        02dc6000-0305dfff : Kernel bss
      a0000000-afffffff : Persistent Memory (legacy)
        a0000000-a7ffffff : System RAM
      b0000000-bffdffff : System RAM
      bffe0000-bfffffff : Reserved
      c0000000-febfffff : PCI Bus 0000:00
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NVishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      567bed57
    • D
      mm/resource: Move HMM pr_debug() deeper into resource code · 0297fb96
      Dave Hansen 提交于
      commit b926b7f3baecb2a855db629e6822e1a85212e91c upstream
      
      HMM consumes physical address space for its own use, even
      though nothing is mapped or accessible there.  It uses a
      special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY)
      to uniquely identify these areas.
      
      When HMM consumes address space, it makes a best guess about
      what to consume.  However, it is possible that a future memory
      or device hotplug can collide with the reserved area.  In the
      case of these conflicts, there is an error message in
      register_memory_resource().
      
      Later patches in this series move register_memory_resource()
      from using request_resource_conflict() to __request_region().
      Unfortunately, __request_region() does not return the conflict
      like the previous function did, which makes it impossible to
      check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting
      resource.
      
      Instead of warning in register_memory_resource(), move the
      check into the core resource code itself (__request_region())
      where the conflicting resource _is_ available.  This has the
      added bonus of producing a warning in case of HMM conflicts
      with devices *or* RAM address space, as opposed to the RAM-
      only warnings that were there previously.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      0297fb96
    • C
      mm, memcg: throttle allocators when failing reclaim over memory.high · 6202ab24
      Chris Down 提交于
      commit 0e4b01df865935007bd712cbc8e7299005b28894 upstream.
      
      We're trying to use memory.high to limit workloads, but have found that
      containment can frequently fail completely and cause OOM situations
      outside of the cgroup.  This happens especially with swap space -- either
      when none is configured, or swap is full.  These failures often also don't
      have enough warning to allow one to react, whether for a human or for a
      daemon monitoring PSI.
      
      Here is output from a simple program showing how long it takes in usec
      (column 2) to allocate a megabyte of anonymous memory (column 1) when a
      cgroup is already beyond its memory high setting, and no swap is
      available:
      
          [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
          > --wait -t timeout 300 /root/mdf
          [...]
          95  1035
          96  1038
          97  1000
          98  1036
          99  1048
          100 1590
          101 1968
          102 1776
          103 1863
          104 1757
          105 1921
          106 1893
          107 1760
          108 1748
          109 1843
          110 1716
          111 1924
          112 1776
          113 1831
          114 1766
          115 1836
          116 1588
          117 1912
          118 1802
          119 1857
          120 1731
          [...]
          [System OOM in 2-3 seconds]
      
      The delay does go up extremely marginally past the 100MB memory.high
      threshold, as now we spend time scanning before returning to usermode, but
      it's nowhere near enough to contain growth.  It also doesn't get worse the
      more pages you have, since it only considers nr_pages.
      
      The current situation goes against both the expectations of users of
      memory.high, and our intentions as cgroup v2 developers.  In
      cgroup-v2.txt, we claim that we will throttle and only under "extreme
      conditions" will memory.high protection be breached.  Likewise, cgroup v2
      users generally also expect that memory.high should throttle workloads as
      they exceed their high threshold.  However, as seen above, this isn't
      always how it works in practice -- even on banal setups like those with no
      swap, or where swap has become exhausted, we can end up with memory.high
      being breached and us having no weapons left in our arsenal to combat
      runaway growth with, since reclaim is futile.
      
      It's also hard for system monitoring software or users to tell how bad the
      situation is, as "high" events for the memcg may in some cases be benign,
      and in others be catastrophic.  The current status quo is that we fail
      containment in a way that doesn't provide any advance warning that things
      are about to go horribly wrong (for example, we are about to invoke the
      kernel OOM killer).
      
      This patch introduces explicit throttling when reclaim is failing to keep
      memcg size contained at the memory.high setting.  It does so by applying
      an exponential delay curve derived from the memcg's overage compared to
      memory.high.  In the normal case where the memcg is either below or only
      marginally over its memory.high setting, no throttling will be performed.
      
      This composes well with system health monitoring and remediation, as these
      allocator delays are factored into PSI's memory pressure calculations.
      This both creates a mechanism system administrators or applications
      consuming the PSI interface to trivially see that the memcg in question is
      struggling and use that to make more reasonable decisions, and permits
      them enough time to act.  Either of these can act with significantly more
      nuance than that we can provide using the system OOM killer.
      
      This is a similar idea to memory.oom_control in cgroup v1 which would put
      the cgroup to sleep if the threshold was violated, but it's also
      significantly improved as it results in visible memory pressure, and also
      doesn't schedule indefinitely, which previously made tracing and other
      introspection difficult (ie.  it's clamped at 2*HZ per allocation through
      MEMCG_MAX_HIGH_DELAY_JIFFIES).
      
      Contrast the previous results with a kernel with this patch:
      
          [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
          > --wait -t timeout 300 /root/mdf
          [...]
          95  1002
          96  1000
          97  1002
          98  1003
          99  1000
          100 1043
          101 84724
          102 330628
          103 610511
          104 1016265
          105 1503969
          106 2391692
          107 2872061
          108 3248003
          109 4791904
          110 5759832
          111 6912509
          112 8127818
          113 9472203
          114 12287622
          115 12480079
          116 14144008
          117 15808029
          118 16384500
          119 16383242
          120 16384979
          [...]
      
      As you can see, in the normal case, memory allocation takes around 1000
      usec.  However, as we exceed our memory.high, things start to increase
      exponentially, but fairly leniently at first.  Our first megabyte over
      memory.high takes us 0.16 seconds, then the next is 0.46 seconds, then the
      next is almost an entire second.  This gets worse until we reach our
      eventual 2*HZ clamp per batch, resulting in 16 seconds per megabyte.
      However, this is still making forward progress, so permits tracing or
      further analysis with programs like GDB.
      
      We use an exponential curve for our delay penalty for a few reasons:
      
      1. We run mem_cgroup_handle_over_high to potentially do reclaim after
         we've already performed allocations, which means that temporarily
         going over memory.high by a small amount may be perfectly legitimate,
         even for compliant workloads. We don't want to unduly penalise such
         cases.
      2. An exponential curve (as opposed to a static or linear delay) allows
         ramping up memory pressure stats more gradually, which can be useful
         to work out that you have set memory.high too low, without destroying
         application performance entirely.
      
      This patch expands on earlier work by Johannes Weiner. Thanks!
      
      [akpm@linux-foundation.org: fix max() warning]
      [akpm@linux-foundation.org: fix __udivdi3 ref on 32-bit]
      [akpm@linux-foundation.org: fix it even more]
      [chris@chrisdown.name: fix 64-bit divide even more]
      Link: http://lkml.kernel.org/r/20190723180700.GA29459@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      6202ab24
  2. 02 1月, 2020 5 次提交
  3. 27 12月, 2019 14 次提交
    • Y
      mm: swap: check if swap backing device is congested or not · 0c22f660
      Yang Shi 提交于
      commit 8fd2e0b505d124bbb046ab15de0ff6f8d4babf56 upstream.
      Change SWP_FS to SWP_FILE.
      
      Swap readahead would read in a few pages regardless if the underlying
      device is busy or not.  It may incur long waiting time if the device is
      congested, and it may also exacerbate the congestion.
      
      Use inode_read_congested() to check if the underlying device is busy or
      not like what file page readahead does.  Get inode from
      swap_info_struct.
      
      Although we can add inode information in swap_address_space
      (address_space->host), it may lead some unexpected side effect, i.e.  it
      may break mapping_cap_account_dirty().  Using inode from
      swap_info_struct seems simple and good enough.
      
      Just does the check in vma_cluster_readahead() since
      swap_vma_readahead() is just used for non-rotational device which much
      less likely has congestion than traditional HDD.
      
      Although swap slots may be consecutive on swap partition, it still may
      be fragmented on swap file.  This check would help to reduce excessive
      stall for such case.
      
      The test with page_fault1 of will-it-scale (sometimes tracing may just
      show runtest.py that is the wrapper script of page_fault1), which
      basically launches NR_CPU threads to generate 128MB anonymous pages for
      each thread, on my virtual machine with congested HDD shows long tail
      latency is reduced significantly.
      
      Without the patch
       page_fault1_thr-1490  [023]   129.311706: funcgraph_entry:      #57377.796 us |  do_swap_page();
       page_fault1_thr-1490  [023]   129.369103: funcgraph_entry:        5.642us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.369119: funcgraph_entry:      #1289.592 us |  do_swap_page();
       page_fault1_thr-1490  [023]   129.370411: funcgraph_entry:        4.957us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.370419: funcgraph_entry:        1.940us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.378847: funcgraph_entry:      #1411.385 us |  do_swap_page();
       page_fault1_thr-1490  [023]   129.380262: funcgraph_entry:        3.916us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.380275: funcgraph_entry:      #4287.751 us |  do_swap_page();
      
      With the patch
            runtest.py-1417  [020]   301.925911: funcgraph_entry:      #9870.146 us |  do_swap_page();
            runtest.py-1417  [020]   301.935785: funcgraph_entry:        9.802us   |  do_swap_page();
            runtest.py-1417  [020]   301.935799: funcgraph_entry:        3.551us   |  do_swap_page();
            runtest.py-1417  [020]   301.935806: funcgraph_entry:        2.142us   |  do_swap_page();
            runtest.py-1417  [020]   301.935853: funcgraph_entry:        6.938us   |  do_swap_page();
            runtest.py-1417  [020]   301.935864: funcgraph_entry:        3.765us   |  do_swap_page();
            runtest.py-1417  [020]   301.935871: funcgraph_entry:        3.600us   |  do_swap_page();
            runtest.py-1417  [020]   301.935878: funcgraph_entry:        7.202us   |  do_swap_page();
      
      [akpm@linux-foundation.org: code cleanup]
      [yang.shi@linux.alibaba.com: add comment]
        Link: http://lkml.kernel.org/r/bbc7bda7-62d0-df1a-23ef-d369e865bdca@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1546543673-108536-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NTim Chen <tim.c.chen@intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Hugh Dickins <hughd@google.com
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      0c22f660
    • H
      zswap: use movable memory if zpool support allocate movable memory · 149e0aae
      Hui Zhu 提交于
      commit d2fcd82bb83aab47c6d63aa8c960cd5edb578065 upstream
      
      This is the third version that was updated according to the comments from
      Sergey Senozhatsky https://lkml.org/lkml/2019/5/29/73 and Shakeel Butt
      https://lkml.org/lkml/2019/6/4/973
      
      zswap compresses swap pages into a dynamically allocated RAM-based memory
      pool.  The memory pool should be zbud, z3fold or zsmalloc.  All of them
      will allocate unmovable pages.  It will increase the number of unmovable
      page blocks that will bad for anti-fragment.
      
      zsmalloc support page migration if request movable page:
              handle = zs_malloc(zram->mem_pool, comp_len,
                      GFP_NOIO | __GFP_HIGHMEM |
                      __GFP_MOVABLE);
      
      And commit "zpool: Add malloc_support_movable to zpool_driver" add
      zpool_malloc_support_movable check malloc_support_movable to make sure if
      a zpool support allocate movable memory.
      
      This commit let zswap allocate block with gfp
      __GFP_HIGHMEM | __GFP_MOVABLE if zpool support allocate movable memory.
      
      Following part is test log in a pc that has 8G memory and 2G swap.
      
      Without this commit:
      ~# echo lz4 > /sys/module/zswap/parameters/compressor
      ~# echo zsmalloc > /sys/module/zswap/parameters/zpool
      ~# echo 1 > /sys/module/zswap/parameters/enabled
      ~# swapon /swapfile
      ~# cd /home/teawater/kernel/vm-scalability/
      /home/teawater/kernel/vm-scalability# export unit_size=$((9 * 1024 * 1024 * 1024))
      /home/teawater/kernel/vm-scalability# ./case-anon-w-seq
      2717908992 bytes / 4826062 usecs = 549973 KB/s
      2717908992 bytes / 4864201 usecs = 545661 KB/s
      2717908992 bytes / 4867015 usecs = 545346 KB/s
      2717908992 bytes / 4915485 usecs = 539968 KB/s
      397853 usecs to free memory
      357820 usecs to free memory
      421333 usecs to free memory
      420454 usecs to free memory
      /home/teawater/kernel/vm-scalability# cat /proc/pagetypeinfo
      Page block order: 9
      Pages per block:  512
      
      Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
      Node    0, zone      DMA, type    Unmovable      1      1      1      0      2      1      1      0      1      0      0
      Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      1      3
      Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type    Unmovable      6      5      8      6      6      5      4      1      1      1      0
      Node    0, zone    DMA32, type      Movable     25     20     20     19     22     15     14     11     11      5    767
      Node    0, zone    DMA32, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type    Unmovable   4753   5588   5159   4613   3712   2520   1448    594    188     11      0
      Node    0, zone   Normal, type      Movable     16      3    457   2648   2143   1435    860    459    223    224    296
      Node    0, zone   Normal, type  Reclaimable      0      0     44     38     11      2      0      0      0      0      0
      Node    0, zone   Normal, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      
      Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate
      Node 0, zone      DMA            1            7            0            0            0            0
      Node 0, zone    DMA32            4         1652            0            0            0            0
      Node 0, zone   Normal          931         1485           15            0            0            0
      
      With this commit:
      ~# echo lz4 > /sys/module/zswap/parameters/compressor
      ~# echo zsmalloc > /sys/module/zswap/parameters/zpool
      ~# echo 1 > /sys/module/zswap/parameters/enabled
      ~# swapon /swapfile
      ~# cd /home/teawater/kernel/vm-scalability/
      /home/teawater/kernel/vm-scalability# export unit_size=$((9 * 1024 * 1024 * 1024))
      /home/teawater/kernel/vm-scalability# ./case-anon-w-seq
      2717908992 bytes / 4689240 usecs = 566020 KB/s
      2717908992 bytes / 4760605 usecs = 557535 KB/s
      2717908992 bytes / 4803621 usecs = 552543 KB/s
      2717908992 bytes / 5069828 usecs = 523530 KB/s
      431546 usecs to free memory
      383397 usecs to free memory
      456454 usecs to free memory
      224487 usecs to free memory
      /home/teawater/kernel/vm-scalability# cat /proc/pagetypeinfo
      Page block order: 9
      Pages per block:  512
      
      Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
      Node    0, zone      DMA, type    Unmovable      1      1      1      0      2      1      1      0      1      0      0
      Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      1      3
      Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type    Unmovable     10      8     10      9     10      4      3      2      3      0      0
      Node    0, zone    DMA32, type      Movable     18     12     14     16     16     11      9      5      5      6    775
      Node    0, zone    DMA32, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      1
      Node    0, zone    DMA32, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type    Unmovable   2669   1236    452    118     37     14      4      1      2      3      0
      Node    0, zone   Normal, type      Movable   3850   6086   5274   4327   3510   2494   1520    934    438    220    470
      Node    0, zone   Normal, type  Reclaimable     56     93    155    124     47     31     17      7      3      0      0
      Node    0, zone   Normal, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      
      Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate
      Node 0, zone      DMA            1            7            0            0            0            0
      Node 0, zone    DMA32            4         1650            2            0            0            0
      Node 0, zone   Normal           79         2326           26            0            0            0
      
      You can see that the number of unmovable page blocks is decreased
      when the kernel has this commit.
      
      Link: http://lkml.kernel.org/r/20190605100630.13293-2-teawaterz@linux.alibaba.comSigned-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      149e0aae
    • H
      zpool: add malloc_support_movable to zpool_driver · d3fc8a97
      Hui Zhu 提交于
      commit c165f25d23ecb2f9f121ced20435415b931219e2 upstream
      
      As a zpool_driver, zsmalloc can allocate movable memory because it support
      migate pages.  But zbud and z3fold cannot allocate movable memory.
      
      Add malloc_support_movable to zpool_driver.  If a zpool_driver support
      allocate movable memory, set it to true.  And add
      zpool_malloc_support_movable check malloc_support_movable to make sure if
      a zpool support allocate movable memory.
      
      Link: http://lkml.kernel.org/r/20190605100630.13293-1-teawaterz@linux.alibaba.comSigned-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      d3fc8a97
    • A
      userfaultfd: allow get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) to trigger userfaults · 56c691f6
      Andrea Arcangeli 提交于
      commit 3b9aadf7278d16d7bed4d5d808501065f70898d8 upstream.
      
      get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) called a get_user_pages that would
      not be waiting for userfaults before failing and it would hit on a SIGBUS
      instead.  Using get_user_pages_locked/unlocked instead will allow
      get_mempolicy to allow userfaults to resolve the fault and fill the hole,
      before grabbing the node id of the page.
      
      If the user calls get_mempolicy() with MPOL_F_ADDR | MPOL_F_NODE for an
      address inside an area managed by uffd and there is no page at that
      address, the page allocation from within get_mempolicy() will fail
      because get_user_pages() does not allow for page fault retry required
      for uffd; the user will get SIGBUS.
      
      With this patch, the page fault will be resolved by the uffd and the
      get_mempolicy() will continue normally.
      
      Background:
      
      Via code review, previously the syscall would have returned -EFAULT
      (vm_fault_to_errno), now it will block and wait for an userfault (if
      it's waken before the fault is resolved it'll still -EFAULT).
      
      This way get_mempolicy will give a chance to an "unaware" app to be
      compliant with userfaults.
      
      The reason this visible change is that becoming "userfault compliant"
      cannot regress anything: all other syscalls including read(2)/write(2)
      had to become "userfault compliant" long time ago (that's one of the
      things userfaultfd can do that PROT_NONE and trapping segfaults can't).
      
      So this is just one more syscall that become "userfault compliant" like
      all other major ones already were.
      
      This has been happening on virtio-bridge dpdk process which just called
      get_mempolicy on the guest space post live migration, but before the
      memory had a chance to be migrated to destination.
      
      I didn't run an strace to be able to show the -EFAULT going away, but
      I've the confirmation of the below debug aid information (only visible
      with CONFIG_DEBUG_VM=y) going away with the patch:
      
          [20116.371461] FAULT_FLAG_ALLOW_RETRY missing 0
          [20116.371464] CPU: 1 PID: 13381 Comm: vhost-events Not tainted 4.17.12-200.fc28.x86_64 #1
          [20116.371465] Hardware name: LENOVO 20FAS2BN0A/20FAS2BN0A, BIOS N1CET54W (1.22 ) 02/10/2017
          [20116.371466] Call Trace:
          [20116.371473]  dump_stack+0x5c/0x80
          [20116.371476]  handle_userfault.cold.37+0x1b/0x22
          [20116.371479]  ? remove_wait_queue+0x20/0x60
          [20116.371481]  ? poll_freewait+0x45/0xa0
          [20116.371483]  ? do_sys_poll+0x31c/0x520
          [20116.371485]  ? radix_tree_lookup_slot+0x1e/0x50
          [20116.371488]  shmem_getpage_gfp+0xce7/0xe50
          [20116.371491]  ? page_add_file_rmap+0x1a/0x2c0
          [20116.371493]  shmem_fault+0x78/0x1e0
          [20116.371495]  ? filemap_map_pages+0x3a1/0x450
          [20116.371498]  __do_fault+0x1f/0xc0
          [20116.371500]  __handle_mm_fault+0xe2e/0x12f0
          [20116.371502]  handle_mm_fault+0xda/0x200
          [20116.371504]  __get_user_pages+0x238/0x790
          [20116.371506]  get_user_pages+0x3e/0x50
          [20116.371510]  kernel_get_mempolicy+0x40b/0x700
          [20116.371512]  ? vfs_write+0x170/0x1a0
          [20116.371515]  __x64_sys_get_mempolicy+0x21/0x30
          [20116.371517]  do_syscall_64+0x5b/0x160
          [20116.371520]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The above harmless debug message (not a kernel crash, just a
      dump_stack()) is shown with CONFIG_DEBUG_VM=y to more quickly identify
      and improve kernel spots that may have to become "userfaultfd
      compliant" like this one (without having to run an strace and search
      for syscall misbehavior).  Spots like the above are more closer to a
      kernel bug for the non-cooperative usages that Mike focuses on, than
      for for dpdk qemu-cooperative usages that reproduced it, but it's still
      nicer to get this fixed for dpdk too.
      
      The part of the patch that caused me to think is only the
      implementation issue of mpol_get, but it looks like it should work safe
      no matter the kind of mempolicy structure that is (the default static
      policy also starts at 1 so it'll go to 2 and back to 1 without crashing
      everything at 0).
      
      [rppt@linux.vnet.ibm.com: changelog addition]
        http://lkml.kernel.org/r/20180904073718.GA26916@rapoport-lnx
      Link: http://lkml.kernel.org/r/20180831214848.23676-1-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NMaxime Coquelin <maxime.coquelin@redhat.com>
      Tested-by: NDr. David Alan Gilbert <dgilbert@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      56c691f6
    • J
      psi: pressure stall information for CPU, memory, and IO · 8b2bf079
      Johannes Weiner 提交于
      commit eb414681d5a07d28d2ff90dc05f69ec6b232ebd2 upstream.
      
      When systems are overcommitted and resources become contended, it's hard
      to tell exactly the impact this has on workload productivity, or how close
      the system is to lockups and OOM kills.  In particular, when machines work
      multiple jobs concurrently, the impact of overcommit in terms of latency
      and throughput on the individual job can be enormous.
      
      In order to maximize hardware utilization without sacrificing individual
      job health or risk complete machine lockups, this patch implements a way
      to quantify resource pressure in the system.
      
      A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
      expose the percentage of time the system is stalled on CPU, memory, or IO,
      respectively.  Stall states are aggregate versions of the per-task delay
      accounting delays:
      
             cpu: some tasks are runnable but not executing on a CPU
             memory: tasks are reclaiming, or waiting for swapin or thrashing cache
             io: tasks are waiting for io completions
      
      These percentages of walltime can be thought of as pressure percentages,
      and they give a general sense of system health and productivity loss
      incurred by resource overcommit.  They can also indicate when the system
      is approaching lockup scenarios and OOMs.
      
      To do this, psi keeps track of the task states associated with each CPU
      and samples the time they spend in stall states.  Every 2 seconds, the
      samples are averaged across CPUs - weighted by the CPUs' non-idle time to
      eliminate artifacts from unused CPUs - and translated into percentages of
      walltime.  A running average of those percentages is maintained over 10s,
      1m, and 5m periods (similar to the loadaverage).
      
      [hannes@cmpxchg.org: doc fixlet, per Randy]
        Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
      [hannes@cmpxchg.org: code optimization]
        Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
      [hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
        Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
      [hannes@cmpxchg.org: fix build]
        Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
      Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      [Joseph: fix apply conflicts in task_struct]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      8b2bf079
    • J
      delayacct: track delays from thrashing cache pages · 2dde6f77
      Johannes Weiner 提交于
      commit b1d29ba82cf2bc784f4c963ddd6a2cf29e229b33 upstream.
      
      Delay accounting already measures the time a task spends in direct reclaim
      and waiting for swapin, but in low memory situations tasks spend can spend
      a significant amount of their time waiting on thrashing page cache.  This
      isn't tracked right now.
      
      To know the full impact of memory contention on an individual task,
      measure the delay when waiting for a recently evicted active cache page to
      read back into memory.
      
      Also update tools/accounting/getdelays.c:
      
           [hannes@computer accounting]$ sudo ./getdelays -d -p 1
           print delayacct stats ON
           PID     1
      
           CPU             count     real total  virtual total    delay total  delay average
                           50318      745000000      847346785      400533713          0.008ms
           IO              count    delay total  delay average
                             435      122601218              0ms
           SWAP            count    delay total  delay average
                               0              0              0ms
           RECLAIM         count    delay total  delay average
                               0              0              0ms
           THRASHING       count    delay total  delay average
                              19       12621439              0ms
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-4-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      2dde6f77
    • J
      mm: workingset: tell cache transitions from workingset thrashing · e2d3e3cb
      Johannes Weiner 提交于
      commit 1899ad18c6072d689896badafb81267b0a1092a4 upstream.
      
      Refaults happen during transitions between workingsets as well as in-place
      thrashing.  Knowing the difference between the two has a range of
      applications, including measuring the impact of memory shortage on the
      system performance, as well as the ability to smarter balance pressure
      between the filesystem cache and the swap-backed workingset.
      
      During workingset transitions, inactive cache refaults and pushes out
      established active cache.  When that active cache isn't stale, however,
      and also ends up refaulting, that's bonafide thrashing.
      
      Introduce a new page flag that tells on eviction whether the page has been
      active or not in its lifetime.  This bit is then stored in the shadow
      entry, to classify refaults as transitioning or thrashing.
      
      How many page->flags does this leave us with on 32-bit?
      
      	20 bits are always page flags
      
      	21 if you have an MMU
      
      	23 with the zone bits for DMA, Normal, HighMem, Movable
      
      	29 with the sparsemem section bits
      
      	30 if PAE is enabled
      
      	31 with this patch.
      
      So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes.  If
      that's not enough, the system can switch to discontigmem and re-gain the 6
      or 7 sparsemem section bits.
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      e2d3e3cb
    • J
      mm: workingset: don't drop refault information prematurely · b027d193
      Johannes Weiner 提交于
      commit 95f9ab2d596e8cbb388315e78c82b9a131bf2928 upstream.
      
      Patch series "psi: pressure stall information for CPU, memory, and IO", v4.
      
      		Overview
      
      PSI reports the overall wallclock time in which the tasks in a system (or
      cgroup) wait for (contended) hardware resources.
      
      This helps users understand the resource pressure their workloads are
      under, which allows them to rootcause and fix throughput and latency
      problems caused by overcommitting, underprovisioning, suboptimal job
      placement in a grid; as well as anticipate major disruptions like OOM.
      
      		Real-world applications
      
      We're using the data collected by PSI (and its previous incarnation,
      memdelay) quite extensively at Facebook, and with several success stories.
      
      One usecase is avoiding OOM hangs/livelocks.  The reason these happen is
      because the OOM killer is triggered by reclaim not being able to free
      pages, but with fast flash devices there is *always* some clean and
      uptodate cache to reclaim; the OOM killer never kicks in, even as tasks
      spend 90% of the time thrashing the cache pages of their own executables.
      There is no situation where this ever makes sense in practice.  We wrote a
      <100 line POC python script to monitor memory pressure and kill stuff way
      before such pathological thrashing leads to full system losses that would
      require forcible hard resets.
      
      We've since extended and deployed this code into other places to guarantee
      latency and throughput SLAs, since they're usually violated way before the
      kernel OOM killer would ever kick in.
      
      It is available here: https://github.com/facebookincubator/oomd
      
      Eventually we probably want to trigger the in-kernel OOM killer based on
      extreme sustained pressure as well, so that Linux can avoid memory
      livelocks - which technically aren't deadlocks, but to the user
      indistinguishable from them - out of the box.  We'd continue using OOMD as
      the first line of defense to ensure workload health and implement complex
      kill policies that are beyond the scope of the kernel.
      
      We also use PSI memory pressure for loadshedding.  Our batch job
      infrastructure used to use heuristics based on various VM stats to
      anticipate OOM situations, with lackluster success.  We switched it to PSI
      and managed to anticipate and avoid OOM kills and lockups fairly reliably.
      The reduction of OOM outages in the worker pool raised the pool's
      aggregate productivity, and we were able to switch that service to smaller
      machines.
      
      Lastly, we use cgroups to isolate a machine's main workload from
      maintenance crap like package upgrades, logging, configuration, as well as
      to prevent multiple workloads on a machine from stepping on each others'
      toes.  We were not able to configure this properly without the pressure
      metrics; we would see latency or bandwidth drops, but it would often be
      hard to impossible to rootcause it post-mortem.
      
      We now log and graph pressure for the containers in our fleet and can
      trivially link latency spikes and throughput drops to shortages of
      specific resources after the fact, and fix the job config/scheduling.
      
      PSI has also received testing, feedback, and feature requests from Android
      and EndlessOS for the purpose of low-latency OOM killing, to intervene in
      pressure situations before the UI starts hanging.
      
      		How do you use this feature?
      
      A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3
      files: cpu, memory, and io.  If using cgroup2, cgroups will also have
      cpu.pressure, memory.pressure and io.pressure files, which simply
      aggregate task stalls at the cgroup level instead of system-wide.
      
      The cpu file contains one line:
      
      	some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722
      
      The averages give the percentage of walltime in which one or more tasks
      are delayed on the runqueue while another task has the CPU.  They're
      recent averages over 10s, 1m, 5m windows, so you can tell short term
      trends from long term ones, similarly to the load average.
      
      The total= value gives the absolute stall time in microseconds.  This
      allows detecting latency spikes that might be too short to sway the
      running averages.  It also allows custom time averaging in case the
      10s/1m/5m windows aren't adequate for the usecase (or are too coarse with
      future hardware).
      
      What to make of this "some" metric?  If CPU utilization is at 100% and CPU
      pressure is 0, it means the system is perfectly utilized, with one
      runnable thread per CPU and nobody waiting.  At two or more runnable tasks
      per CPU, the system is 100% overcommitted and the pressure average will
      indicate as much.  From a utilization perspective this is a great state of
      course: no CPU cycles are being wasted, even when 50% of the threads were
      to go idle (as most workloads do vary).  From the perspective of the
      individual job it's not great, however, and they would do better with more
      resources.  Depending on what your priority and options are, raised "some"
      numbers may or may not require action.
      
      The memory file contains two lines:
      
      some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
      full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258
      
      The some line is the same as for cpu, the time in which at least one task
      is stalled on the resource.  In the case of memory, this includes waiting
      on swap-in, page cache refaults and page reclaim.
      
      The full line, however, indicates time in which *nobody* is using the CPU
      productively due to pressure: all non-idle tasks are waiting for memory in
      one form or another.  Significant time spent in there is a good trigger
      for killing things, moving jobs to other machines, or dropping incoming
      requests, since neither the jobs nor the machine overall are making too
      much headway.
      
      The io file is similar to memory.  Because the block layer doesn't have a
      concept of hardware contention right now (how much longer is my IO request
      taking due to other tasks?), it reports CPU potential lost on all IO
      delays, not just the potential lost due to competition.
      
      		FAQ
      
      Q: How is PSI's CPU component different from the load average?
      
      A: There are several quirks in the load average that make it hard to
         impossible to tell how overcommitted the CPU really is.
      
         1. The load average is reported as a raw number of active tasks.
            You need to know how many CPUs there are in the system, how many
            CPUs the workload is allowed to use, then think about what the
            proportion between load and the number of CPUs mean for the
            tasks trying to run.
      
            PSI reports the percentage of wallclock time in which tasks are
            waiting for a CPU to run on. It doesn't matter how many CPUs are
            present or usable. The number always tells the quality of life
            of tasks in the system or in a particular cgroup.
      
         2. The shortest averaging window is 1m, which is extremely coarse,
            and it's sampled in 5s intervals. A *lot* can happen on a CPU in
            5 seconds. This *may* be able to identify persistent long-term
            trends and very clear and obvious overloads, but it's unusable
            for latency spikes and more subtle overutilization.
      
            PSI's shortest window is 10s. It also exports the cumulative
            stall times (in microseconds) of synchronously recorded events.
      
         3. On Linux, the load average for historical reasons includes all
            TASK_UNINTERRUPTIBLE tasks. This gives a broader sense of how
            busy the system is, but on the flipside it doesn't distinguish
            whether tasks are likely to contend over the CPU or IO - which
            obviously requires very different interventions from a sys admin
            or a job scheduler.
      
            PSI reports independent metrics for CPU and IO. You can tell
            which resource is making the tasks wait, but in conjunction
            still see how overloaded the system is overall.
      
      Q: What's the cost / performance impact of this feature?
      
      A: PSI's primary cost is in the scheduler, in particular task wakeups
         and sleeps.
      
         I benchmarked this code using Facebook's two most scheduling
         sensitive workloads: memcache and webserver. They handle a ton of
         small requests - lots of wakeups and sleeps with little actual work
         in between - so they tend to be canaries for scheduler regressions.
      
         In the tests, the boxes were handling live traffic over the course
         of several hours. Half the machines, the control, ran with
         CONFIG_PSI=n.
      
         For memcache I used eight machines total. They're 2-socket, 14
         core, 56 thread boxes. The test runs for half the test period,
         flips the test and control kernels on the hardware to rule out HW
         factors, DC location etc., then runs the other half of the test.
      
         For the webservers, I used 32 machines total. They're single
         socket, 16 core, 32 thread machines.
      
         During the memcache test, CPU load was nopsi=78.05% psi=78.98% in
         the first half and nopsi=77.52% psi=78.25%, so PSI added between
         0.7 and 0.9 percentage points to the CPU load, a difference of
         about 1%.
      
         UPDATE: I re-ran this test with the v3 version of this patch set
         and the CPU utilization was equivalent between test and control.
      
         UPDATE: v4 is on par with v3.
      
         As far as end-to-end request latency from the client perspective
         goes, we don't sample those finely enough to capture the requests
         going to those particular machines during the test, but we know the
         p50 turnaround time in this workload is 54us, and perf bench sched
         pipe on those machines show nopsi=5.232666 us/op and psi=5.587347
         us/op, so this doesn't add much here either.
      
         The profile for the pipe benchmark shows:
      
              0.87%  sched-pipe  [kernel.vmlinux]    [k] psi_group_change
              0.83%  perf.real   [kernel.vmlinux]    [k] psi_group_change
              0.82%  perf.real   [kernel.vmlinux]    [k] psi_task_change
              0.58%  sched-pipe  [kernel.vmlinux]    [k] psi_task_change
      
         The webserver load is running inside 4 nested cgroup levels. The
         CPU load with both nopsi and psi kernels was indistinguishable at
         81%.
      
         For comparison, we had to disable the cgroup cpu controller on the
         webservers because it added 4 percentage points to the CPU% during
         this same exact test.
      
         Versions of this accounting code now run on 80% of our fleet. None
         of our workloads have reported regressions during the rollout.
      
      Daniel Drake said:
      
      : I just retested the latest version at
      : http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the results
      : are great.
      :
      : Test setup:
      : Endless OS
      : GeminiLake N4200 low end laptop
      : 2GB RAM
      : swap (and zram swap) disabled
      :
      : Baseline test: open a handful of large-ish apps and several website
      : tabs in Google Chrome.
      :
      : Results: after a couple of minutes, system is excessively thrashing, mouse
      : cursor can barely be moved, UI is not responding to mouse clicks, so it's
      : impractical to recover from this situation as an ordinary user
      :
      : Add my simple killer:
      : https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd
      :
      : Results: when the thrashing causes the UI to become sluggish, the killer
      : steps in and kills something (usually a chrome tab), and the system
      : remains usable.  I repeatedly opened more apps and more websites over a 15
      : minute period but I wasn't able to get the system to a point of UI
      : unresponsiveness.
      
      Suren said:
      
      : Backported to 4.9 and retested on ARMv8 8 code system running Android.
      : Signals behave as expected reacting to memory pressure, no jumps in
      : "total" counters that would indicate an overflow/underflow issues.  Nicely
      : done!
      
      This patch (of 9):
      
      If we keep just enough refault information to match the *current* page
      cache during reclaim time, we could lose a lot of events when there is
      only a temporary spike in non-cache memory consumption that pushes out all
      the cache.  Once cache comes back, we won't see those refaults.  They
      might not be actionable for LRU aging, but we want to know about them for
      measuring memory pressure.
      
      [hannes@cmpxchg.org: switch to NUMA-aware lru and slab counters]
        Link: http://lkml.kernel.org/r/20181009184732.762-2-hannes@cmpxchg.org
      Link: http://lkml.kernel.org/r/20180828172258.3185-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <jweiner@fb.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      b027d193
    • K
      alinux: writeback: memcg_blkcg_tree_lock can be static · 2307342e
      kbuild test robot 提交于
      Fixes: 60448d43 ("writeback: add memcg_blkcg_link tree")
      Signed-off-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      2307342e
    • J
      alinux: fs/writeback: wrap cgroup writeback v1 logic · 5e06ec32
      Joseph Qi 提交于
      Wrap cgroup writeback v1 logic to prevent build errors without
      CONFIG_CGROUPS or CONFIG_CGROUP_WRITEBACK.
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      5e06ec32
    • J
      alinux: writeback: introduce cgwb_v1 boot param · 6f7afdf3
      Jiufei Xue 提交于
      So far writeback control is supported for cgroup v1 interface. However
      it also has some restrictions, so introduce a new kernel boot parameter
      to control the behavior which is disabled by default. Users can enable
      the writeback control for cgroup v1 with the command line "cgwb_v1".
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      6f7afdf3
    • J
      alinux: fs/writeback: fix double free of blkcg_css · b8016b41
      Jiufei Xue 提交于
      We have gotten a WARNNING when releasing blkcg_css:
      
      [332489.681635] WARNING: CPU: 55 PID: 14859 at lib/list_debug.c:56 __list_del_entry+0x81/0xc0
      [332489.682191] list_del corruption, ffff883e6b94d450->prev is LIST_POISON2 (dead000000000200)
      ......
      [332489.683895] CPU: 55 PID: 14859 Comm: kworker/55:2 Tainted: G
      [332489.684477] Hardware name: Inspur SA5248M4/X10DRT-PS, BIOS 4.05A
      10/11/2016
      [332489.685061] Workqueue: cgroup_destroy css_release_work_fn
      [332489.685654]  ffffc9001d92bd28 ffffffff81380042 ffffc9001d92bd78
      0000000000000000
      [332489.686269]  ffffc9001d92bd68 ffffffff81088f8b 0000003800000000
      ffff883e6b94d4a0
      [332489.686867]  ffff883e6b94d400 ffffffff81ce8fe0 ffff88375b24f400
      ffff883e6b94d4a0
      [332489.687479] Call Trace:
      [332489.688078]  [<ffffffff81380042>] dump_stack+0x63/0x81
      [332489.688681]  [<ffffffff81088f8b>] __warn+0xcb/0xf0
      [332489.689276]  [<ffffffff8108900f>] warn_slowpath_fmt+0x5f/0x80
      [332489.689877]  [<ffffffff8139e7c1>] __list_del_entry+0x81/0xc0
      [332489.690481]  [<ffffffff81125552>] css_release_work_fn+0x42/0x140
      [332489.691090]  [<ffffffff810a2db9>] process_one_work+0x189/0x420
      [332489.691693]  [<ffffffff810a309e>] worker_thread+0x4e/0x4b0
      [332489.692293]  [<ffffffff810a3050>] ? process_one_work+0x420/0x420
      [332489.692905]  [<ffffffff810a9616>] kthread+0xe6/0x100
      [332489.693504]  [<ffffffff810a9530>] ? kthread_park+0x60/0x60
      [332489.694099]  [<ffffffff817184e1>] ret_from_fork+0x41/0x50
      [332489.694722] ---[ end trace 0cf869c4a5cfba87 ]---
      ......
      
      This is caused by calling css_get after the css is killed by another
      thread described below:
      
                 Thread 1                       Thread 2
      cgroup_rmdir
        -> kill_css
          -> percpu_ref_kill_and_confirm
            -> css_killed_ref_fn
      
      css_killed_work_fn
        -> css_put
          -> css_release
                                              wb_get_create
      					  -> find_blkcg_css
      					    -> css_get
      					  -> css_put
      					    -> css_release (double free)
          -> css_release_workfn
            -> css_free_work_fn
             -> blkcg_css_free
      
      When doublefree happened, it may free the memory still used by
      other threads and cause a kernel panic.
      
      Fix this by using css_tryget_online in find_blkcg_css while will return
      false if the css is killed.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      b8016b41
    • J
    • J
      alinux: writeback: add memcg_blkcg_link tree · 7396367d
      Jiufei Xue 提交于
      Here we add a global radix tree to link memcg and blkcg that the user
      attach the tasks to when using cgroup v1, which is used for writeback
      cgroup.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      7396367d
  4. 18 12月, 2019 2 次提交
    • M
      mm, thp, proc: report THP eligibility for each vma · c76adee3
      Michal Hocko 提交于
      [ Upstream commit 7635d9cbe8327e131a1d3d8517dc186c2796ce2e ]
      
      Userspace falls short when trying to find out whether a specific memory
      range is eligible for THP.  There are usecases that would like to know
      that
      http://lkml.kernel.org/r/alpine.DEB.2.21.1809251248450.50347@chino.kir.corp.google.com
      : This is used to identify heap mappings that should be able to fault thp
      : but do not, and they normally point to a low-on-memory or fragmentation
      : issue.
      
      The only way to deduce this now is to query for hg resp.  nh flags and
      confronting the state with the global setting.  Except that there is also
      PR_SET_THP_DISABLE that might change the picture.  So the final logic is
      not trivial.  Moreover the eligibility of the vma depends on the type of
      VMA as well.  In the past we have supported only anononymous memory VMAs
      but things have changed and shmem based vmas are supported as well these
      days and the query logic gets even more complicated because the
      eligibility depends on the mount option and another global configuration
      knob.
      
      Simplify the current state and report the THP eligibility in
      /proc/<pid>/smaps for each existing vma.  Reuse
      transparent_hugepage_enabled for this purpose.  The original
      implementation of this function assumes that the caller knows that the vma
      itself is supported for THP so make the core checks into
      __transparent_hugepage_enabled and use it for existing callers.
      __show_smap just use the new transparent_hugepage_enabled which also
      checks the vma support status (please note that this one has to be out of
      line due to include dependency issues).
      
      [mhocko@kernel.org: fix oops with NULL ->f_mapping]
        Link: http://lkml.kernel.org/r/20181224185106.GC16738@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20181211143641.3503-3-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Paul Oppenheimer <bepvte@gmail.com>
      Cc: William Kucharski <william.kucharski@oracle.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      c76adee3
    • C
      mm/shmem.c: cast the type of unmap_start to u64 · 32b02bfd
      Chen Jun 提交于
      commit aa71ecd8d86500da6081a72da6b0b524007e0627 upstream.
      
      In 64bit system. sb->s_maxbytes of shmem filesystem is MAX_LFS_FILESIZE,
      which equal LLONG_MAX.
      
      If offset > LLONG_MAX - PAGE_SIZE, offset + len < LLONG_MAX in
      shmem_fallocate, which will pass the checking in vfs_fallocate.
      
      	/* Check for wrap through zero too */
      	if (((offset + len) > inode->i_sb->s_maxbytes) || ((offset + len) < 0))
      		return -EFBIG;
      
      loff_t unmap_start = round_up(offset, PAGE_SIZE) in shmem_fallocate
      causes a overflow.
      
      Syzkaller reports a overflow problem in mm/shmem:
      
        UBSAN: Undefined behaviour in mm/shmem.c:2014:10
        signed integer overflow: '9223372036854775807 + 1' cannot be represented in type 'long long int'
        CPU: 0 PID:17076 Comm: syz-executor0 Not tainted 4.1.46+ #1
        Hardware name: linux, dummy-virt (DT)
        Call trace:
           dump_backtrace+0x0/0x2c8 arch/arm64/kernel/traps.c:100
           show_stack+0x20/0x30 arch/arm64/kernel/traps.c:238
           __dump_stack lib/dump_stack.c:15 [inline]
           ubsan_epilogue+0x18/0x70 lib/ubsan.c:164
           handle_overflow+0x158/0x1b0 lib/ubsan.c:195
           shmem_fallocate+0x6d0/0x820 mm/shmem.c:2104
           vfs_fallocate+0x238/0x428 fs/open.c:312
           SYSC_fallocate fs/open.c:335 [inline]
           SyS_fallocate+0x54/0xc8 fs/open.c:239
      
      The highest bit of unmap_start will be appended with sign bit 1
      (overflow) when calculate shmem_falloc.start:
      
          shmem_falloc.start = unmap_start >> PAGE_SHIFT.
      
      Fix it by casting the type of unmap_start to u64, when right shifted.
      
      This bug is found in LTS Linux 4.1.  It also seems to exist in mainline.
      
      Link: http://lkml.kernel.org/r/1573867464-5107-1-git-send-email-chenjun102@huawei.comSigned-off-by: NChen Jun <chenjun102@huawei.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      32b02bfd
  5. 13 12月, 2019 1 次提交
  6. 05 12月, 2019 4 次提交
  7. 01 12月, 2019 7 次提交
    • D
      mm/memory_hotplug: don't access uninitialized memmaps in shrink_zone_span() · 5779cbc9
      David Hildenbrand 提交于
      commit 7ce700bf11b5e2cb84e4352bbdf2123a7a239c84 upstream.
      
      Let's limit shrinking to !ZONE_DEVICE so we can fix the current code.
      We should never try to touch the memmap of offline sections where we
      could have uninitialized memmaps and could trigger BUGs when calling
      page_to_nid() on poisoned pages.
      
      There is no reliable way to distinguish an uninitialized memmap from an
      initialized memmap that belongs to ZONE_DEVICE, as we don't have
      anything like SECTION_IS_ONLINE we can use similar to
      pfn_to_online_section() for !ZONE_DEVICE memory.
      
      E.g., set_zone_contiguous() similarly relies on pfn_to_online_section()
      and will therefore never set a ZONE_DEVICE zone consecutive.  Stopping
      to shrink the ZONE_DEVICE therefore results in no observable changes,
      besides /proc/zoneinfo indicating different boundaries - something we
      can totally live with.
      
      Before commit d0dc12e8 ("mm/memory_hotplug: optimize memory
      hotplug"), the memmap was initialized with 0 and the node with the right
      value.  So the zone might be wrong but not garbage.  After that commit,
      both the zone and the node will be garbage when touching uninitialized
      memmaps.
      
      Toshiki reported a BUG (race between delayed initialization of
      ZONE_DEVICE memmaps without holding the memory hotplug lock and
      concurrent zone shrinking).
      
        https://lkml.org/lkml/2019/11/14/1040
      
      "Iteration of create and destroy namespace causes the panic as below:
      
            kernel BUG at mm/page_alloc.c:535!
            CPU: 7 PID: 2766 Comm: ndctl Not tainted 5.4.0-rc4 #6
            Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
            RIP: 0010:set_pfnblock_flags_mask+0x95/0xf0
            Call Trace:
             memmap_init_zone_device+0x165/0x17c
             memremap_pages+0x4c1/0x540
             devm_memremap_pages+0x1d/0x60
             pmem_attach_disk+0x16b/0x600 [nd_pmem]
             nvdimm_bus_probe+0x69/0x1c0
             really_probe+0x1c2/0x3e0
             driver_probe_device+0xb4/0x100
             device_driver_attach+0x4f/0x60
             bind_store+0xc9/0x110
             kernfs_fop_write+0x116/0x190
             vfs_write+0xa5/0x1a0
             ksys_write+0x59/0xd0
             do_syscall_64+0x5b/0x180
             entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
        While creating a namespace and initializing memmap, if you destroy the
        namespace and shrink the zone, it will initialize the memmap outside
        the zone and trigger VM_BUG_ON_PAGE(!zone_spans_pfn(page_zone(page),
        pfn), page) in set_pfnblock_flags_mask()."
      
      This BUG is also mitigated by this commit, where we for now stop to
      shrink the ZONE_DEVICE zone until we can do it in a safe and clean way.
      
      Link: http://lkml.kernel.org/r/20191006085646.5768-5-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reported-by: NAneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
      Reported-by: NToshiki Fukasawa <t-fukasawa@vx.jp.nec.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Pavel Tatashin <pasha.tatashin@soleen.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Alexander Duyck <alexander.h.duyck@linux.intel.com>
      Cc: Alexander Potapenko <glider@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Christian Borntraeger <borntraeger@de.ibm.com>
      Cc: Christophe Leroy <christophe.leroy@c-s.fr>
      Cc: Damian Tometzki <damian.tometzki@gmail.com>
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Halil Pasic <pasic@linux.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Ira Weiny <ira.weiny@intel.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Jun Yao <yaojun8558363@gmail.com>
      Cc: Logan Gunthorpe <logang@deltatee.com>
      Cc: Mark Rutland <mark.rutland@arm.com>
      Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
      Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: Mike Rapoport <rppt@linux.ibm.com>
      Cc: Pankaj Gupta <pagupta@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Rich Felker <dalias@libc.org>
      Cc: Robin Murphy <robin.murphy@arm.com>
      Cc: Steve Capper <steve.capper@arm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Vasily Gorbik <gor@linux.ibm.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Wei Yang <richard.weiyang@gmail.com>
      Cc: Wei Yang <richardw.yang@linux.intel.com>
      Cc: Will Deacon <will@kernel.org>
      Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
      Cc: Yu Zhao <yuzhao@google.com>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      5779cbc9
    • V
      mm/page_io.c: do not free shared swap slots · 006360ec
      Vinayak Menon 提交于
      [ Upstream commit 5df373e95689b9519b8557da7c5bd0db0856d776 ]
      
      The following race is observed due to which a processes faulting on a
      swap entry, finds the page neither in swapcache nor swap.  This causes
      zram to give a zero filled page that gets mapped to the process,
      resulting in a user space crash later.
      
      Consider parent and child processes Pa and Pb sharing the same swap slot
      with swap_count 2.  Swap is on zram with SWP_SYNCHRONOUS_IO set.
      Virtual address 'VA' of Pa and Pb points to the shared swap entry.
      
      Pa                                       Pb
      
      fault on VA                              fault on VA
      do_swap_page                             do_swap_page
      lookup_swap_cache fails                  lookup_swap_cache fails
                                               Pb scheduled out
      swapin_readahead (deletes zram entry)
      swap_free (makes swap_count 1)
                                               Pb scheduled in
                                               swap_readpage (swap_count == 1)
                                               Takes SWP_SYNCHRONOUS_IO path
                                               zram enrty absent
                                               zram gives a zero filled page
      
      Fix this by making sure that swap slot is freed only when swap count
      drops down to one.
      
      Link: http://lkml.kernel.org/r/1571743294-14285-1-git-send-email-vinmenon@codeaurora.org
      Fixes: aa8d22a1 ("mm: swap: SWP_SYNCHRONOUS_IO: skip swapcache only if swapped page has no other reference")
      Signed-off-by: NVinayak Menon <vinmenon@codeaurora.org>
      Suggested-by: NMinchan Kim <minchan@google.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      006360ec
    • R
      mm: handle no memcg case in memcg_kmem_charge() properly · 6a2245d8
      Roman Gushchin 提交于
      [ Upstream commit e68599a3c3ad0f3171a7cb4e48aa6f9a69381902 ]
      
      Mike Galbraith reported a regression caused by the commit 9b6f7e163cd0
      ("mm: rework memcg kernel stack accounting") on a system with
      "cgroup_disable=memory" boot option: the system panics with the following
      stack trace:
      
        BUG: unable to handle kernel NULL pointer dereference at 00000000000000f8
        PGD 0 P4D 0
        Oops: 0002 [#1] PREEMPT SMP PTI
        CPU: 0 PID: 1 Comm: systemd Not tainted 4.19.0-preempt+ #410
        Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20180531_142017-buildhw-08.phx2.fed4
        RIP: 0010:page_counter_try_charge+0x22/0xc0
        Code: 41 5d c3 c3 0f 1f 40 00 0f 1f 44 00 00 48 85 ff 0f 84 a7 00 00 00 41 56 48 89 f8 49 89 fe 49
        Call Trace:
         try_charge+0xcb/0x780
         memcg_kmem_charge_memcg+0x28/0x80
         memcg_kmem_charge+0x8b/0x1d0
         copy_process.part.41+0x1ca/0x2070
         _do_fork+0xd7/0x3d0
         do_syscall_64+0x5a/0x180
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      The problem occurs because get_mem_cgroup_from_current() returns the NULL
      pointer if memory controller is disabled.  Let's check if this is a case
      at the beginning of memcg_kmem_charge() and just return 0 if
      mem_cgroup_disabled() returns true.  This is how we handle this case in
      many other places in the memory controller code.
      
      Link: http://lkml.kernel.org/r/20181029215123.17830-1-guro@fb.com
      Fixes: 9b6f7e163cd0 ("mm: rework memcg kernel stack accounting")
      Signed-off-by: NRoman Gushchin <guro@fb.com>
      Reported-by: NMike Galbraith <efault@gmx.de>
      Acked-by: NRik van Riel <riel@surriel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      6a2245d8
    • D
      mm/memory_hotplug: fix online/offline_pages called w.o. mem_hotplug_lock · 17523d7a
      David Hildenbrand 提交于
      [ Upstream commit 381eab4a6ee81266f8dddc62e57376c7e584e5b8 ]
      
      There seem to be some problems as result of 30467e0b ("mm, hotplug:
      fix concurrent memory hot-add deadlock"), which tried to fix a possible
      lock inversion reported and discussed in [1] due to the two locks
      	a) device_lock()
      	b) mem_hotplug_lock
      
      While add_memory() first takes b), followed by a) during
      bus_probe_device(), onlining of memory from user space first took a),
      followed by b), exposing a possible deadlock.
      
      In [1], and it was decided to not make use of device_hotplug_lock, but
      rather to enforce a locking order.
      
      The problems I spotted related to this:
      
      1. Memory block device attributes: While .state first calls
         mem_hotplug_begin() and the calls device_online() - which takes
         device_lock() - .online does no longer call mem_hotplug_begin(), so
         effectively calls online_pages() without mem_hotplug_lock.
      
      2. device_online() should be called under device_hotplug_lock, however
         onlining memory during add_memory() does not take care of that.
      
      In addition, I think there is also something wrong about the locking in
      
      3. arch/powerpc/platforms/powernv/memtrace.c calls offline_pages()
         without locks. This was introduced after 30467e0b. And skimming over
         the code, I assume it could need some more care in regards to locking
         (e.g. device_online() called without device_hotplug_lock. This will
         be addressed in the following patches.
      
      Now that we hold the device_hotplug_lock when
      - adding memory (e.g. via add_memory()/add_memory_resource())
      - removing memory (e.g. via remove_memory())
      - device_online()/device_offline()
      
      We can move mem_hotplug_lock usage back into
      online_pages()/offline_pages().
      
      Why is mem_hotplug_lock still needed? Essentially to make
      get_online_mems()/put_online_mems() be very fast (relying on
      device_hotplug_lock would be very slow), and to serialize against
      addition of memory that does not create memory block devices (hmm).
      
      [1] http://driverdev.linuxdriverproject.org/pipermail/ driverdev-devel/
          2015-February/065324.html
      
      This patch is partly based on a patch by Vitaly Kuznetsov.
      
      Link: http://lkml.kernel.org/r/20180925091457.28651-4-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Reviewed-by: NRashmica Gupta <rashmica.g@gmail.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Rashmica Gupta <rashmica.g@gmail.com>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Oscar Salvador <osalvador@suse.de>
      Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: John Allen <jallen@linux.vnet.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      17523d7a
    • D
      mm/memory_hotplug: make add_memory() take the device_hotplug_lock · 02735d59
      David Hildenbrand 提交于
      [ Upstream commit 8df1d0e4a265f25dc1e7e7624ccdbcb4a6630c89 ]
      
      add_memory() currently does not take the device_hotplug_lock, however
      is aleady called under the lock from
      	arch/powerpc/platforms/pseries/hotplug-memory.c
      	drivers/acpi/acpi_memhotplug.c
      to synchronize against CPU hot-remove and similar.
      
      In general, we should hold the device_hotplug_lock when adding memory to
      synchronize against online/offline request (e.g.  from user space) - which
      already resulted in lock inversions due to device_lock() and
      mem_hotplug_lock - see 30467e0b ("mm, hotplug: fix concurrent memory
      hot-add deadlock").  add_memory()/add_memory_resource() will create memory
      block devices, so this really feels like the right thing to do.
      
      Holding the device_hotplug_lock makes sure that a memory block device
      can really only be accessed (e.g. via .online/.state) from user space,
      once the memory has been fully added to the system.
      
      The lock is not held yet in
      	drivers/xen/balloon.c
      	arch/powerpc/platforms/powernv/memtrace.c
      	drivers/s390/char/sclp_cmd.c
      	drivers/hv/hv_balloon.c
      So, let's either use the locked variants or take the lock.
      
      Don't export add_memory_resource(), as it once was exported to be used by
      XEN, which is never built as a module.  If somebody requires it, we also
      have to export a locked variant (as device_hotplug_lock is never
      exported).
      
      Link: http://lkml.kernel.org/r/20180925091457.28651-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NPavel Tatashin <pavel.tatashin@microsoft.com>
      Reviewed-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      Reviewed-by: NRashmica Gupta <rashmica.g@gmail.com>
      Reviewed-by: NOscar Salvador <osalvador@suse.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Michael Ellerman <mpe@ellerman.id.au>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Cc: Len Brown <lenb@kernel.org>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: Nathan Fontenot <nfont@linux.vnet.ibm.com>
      Cc: John Allen <jallen@linux.vnet.ibm.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: Mathieu Malaterre <malat@debian.org>
      Cc: Pavel Tatashin <pavel.tatashin@microsoft.com>
      Cc: YASUAKI ISHIMATSU <yasu.isimatu@gmail.com>
      Cc: Balbir Singh <bsingharora@gmail.com>
      Cc: Haiyang Zhang <haiyangz@microsoft.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Kate Stewart <kstewart@linuxfoundation.org>
      Cc: "K. Y. Srinivasan" <kys@microsoft.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Michael Neuling <mikey@neuling.org>
      Cc: Philippe Ombredanne <pombredanne@nexb.com>
      Cc: Stephen Hemminger <sthemmin@microsoft.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      02735d59
    • D
      mm/gup_benchmark.c: prevent integer overflow in ioctl · 30598425
      Dan Carpenter 提交于
      [ Upstream commit 4b408c74ee5a0b74fc9265c2fe39b0e7dec7c056 ]
      
      The concern here is that "gup->size" is a u64 and "nr_pages" is unsigned
      long.  On 32 bit systems we could trick the kernel into allocating fewer
      pages than expected.
      
      Link: http://lkml.kernel.org/r/20181025061546.hnhkv33diogf2uis@kili.mountain
      Fixes: 64c349f4 ("mm: add infrastructure for get_user_pages_fast() benchmarking")
      Signed-off-by: NDan Carpenter <dan.carpenter@oracle.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Stephen Rothwell <sfr@canb.auug.org.au>
      Cc: Keith Busch <keith.busch@intel.com>
      Cc: "Michael S. Tsirkin" <mst@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: YueHaibing <yuehaibing@huawei.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      30598425
    • A
      mm: thp: fix MADV_DONTNEED vs migrate_misplaced_transhuge_page race condition · 4291e97c
      Andrea Arcangeli 提交于
      [ Upstream commit d7c3393413fe7e7dc54498ea200ea94742d61e18 ]
      
      Patch series "migrate_misplaced_transhuge_page race conditions".
      
      Aaron found a new instance of the THP MADV_DONTNEED race against
      pmdp_clear_flush* variants, that was apparently left unfixed.
      
      While looking into the race found by Aaron, I may have found two more
      issues in migrate_misplaced_transhuge_page.
      
      These race conditions would not cause kernel instability, but they'd
      corrupt userland data or leave data non zero after MADV_DONTNEED.
      
      I did only minor testing, and I don't expect to be able to reproduce this
      (especially the lack of ->invalidate_range before migrate_page_copy,
      requires the latest iommu hardware or infiniband to reproduce).  The last
      patch is noop for x86 and it needs further review from maintainers of
      archs that implement flush_cache_range() (not in CC yet).
      
      To avoid confusion, it's not the first patch that introduces the bug fixed
      in the second patch, even before removing the
      pmdp_huge_clear_flush_notify, that _notify suffix was called after
      migrate_page_copy already run.
      
      This patch (of 3):
      
      This is a corollary of ced10803 ("thp: fix MADV_DONTNEED vs.  numa
      balancing race"), 58ceeb6b ("thp: fix MADV_DONTNEED vs.  MADV_FREE
      race") and 5b7abeae ("thp: fix MADV_DONTNEED vs clear soft dirty
      race).
      
      When the above three fixes where posted Dave asked
      https://lkml.kernel.org/r/929b3844-aec2-0111-fef7-8002f9d4e2b9@intel.com
      but apparently this was missed.
      
      The pmdp_clear_flush* in migrate_misplaced_transhuge_page() was introduced
      in a54a407f ("mm: Close races between THP migration and PMD numa
      clearing").
      
      The important part of such commit is only the part where the page lock is
      not released until the first do_huge_pmd_numa_page() finished disarming
      the pagenuma/protnone.
      
      The addition of pmdp_clear_flush() wasn't beneficial to such commit and
      there's no commentary about such an addition either.
      
      I guess the pmdp_clear_flush() in such commit was added just in case for
      safety, but it ended up introducing the MADV_DONTNEED race condition found
      by Aaron.
      
      At that point in time nobody thought of such kind of MADV_DONTNEED race
      conditions yet (they were fixed later) so the code may have looked more
      robust by adding the pmdp_clear_flush().
      
      This specific race condition won't destabilize the kernel, but it can
      confuse userland because after MADV_DONTNEED the memory won't be zeroed
      out.
      
      This also optimizes the code and removes a superfluous TLB flush.
      
      [akpm@linux-foundation.org: reflow comment to 80 cols, fix grammar and typo (beacuse)]
      Link: http://lkml.kernel.org/r/20181013002430.698-2-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NAaron Tomlin <atomlin@redhat.com>
      Acked-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4291e97c