1. 29 11月, 2019 1 次提交
    • X
      alios: mm, memcg: fix possible soft lockup in try_charge · 1f6142a0
      Xu Yu 提交于
      When events such as direct reclaim and oom occur intensively, soft
      lockup is very likely to happen in the instances with 1 vcpu and with
      kernel preempt disabled.
      
      The example soft lockup is as follows.
      
      [  160.555984] watchdog: BUG: soft lockup - CPU#0 stuck for 112s! [malloc:2188]
      [  160.557975] Modules linked in: button
      [  160.559495] CPU: 0 PID: 2188 Comm: malloc Not tainted 4.19.57-15.457.al7.x86_64 #1
      [  160.561546] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
      [  160.563707] RIP: 0010:shrink_node+0x1ae/0x450
      [  160.565391] Code: 00 00 00 49 8b 4f 20 ba 01 00 00 00 4c 8b 74 24 10 4d 8b 47 28 49 8b 77 10 48 2b 4c 24 08 41 8b 7f 1c 4d8
      [  160.570747] RSP: 0000:ffff9d0ec07a3b58 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
      [  160.572889] RAX: ffff982ab6014330 RBX: ffff982ab6014000 RCX: 0000000000000000
      [  160.574992] RDX: 0000000000000001 RSI: ffff982ab6014000 RDI: ffff982ab6014000
      [  160.577106] RBP: ffff982afffb6000 R08: 0000000000000000 R09: ffff982ab6014000
      [  160.579219] R10: 0000000000000004 R11: 0000000000aaaaaa R12: 0000000000000000
      [  160.581326] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9d0ec07a3c50
      [  160.583450] FS:  00007f8b414f7740(0000) GS:ffff982afda00000(0000) knlGS:0000000000000000
      [  160.585704] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  160.587662] CR2: 00007f8adb800010 CR3: 000000007ac9e001 CR4: 00000000003606b0
      [  160.589835] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  160.591971] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  160.594133] Call Trace:
      [  160.595602]  do_try_to_free_pages+0xcc/0x390
      [  160.597356]  try_to_free_mem_cgroup_pages+0xf9/0x1d0
      [  160.599198]  ? out_of_memory+0xb5/0x4a0
      [  160.600882]  try_charge+0x244/0x750
      [  160.602522]  ? __pagevec_lru_add_fn+0x1d0/0x330
      [  160.604310]  mem_cgroup_try_charge+0xb4/0x1d0
      [  160.606085]  mem_cgroup_try_charge_delay+0x1c/0x40
      [  160.607892]  do_anonymous_page+0xf7/0x540
      [  160.609574]  __handle_mm_fault+0x665/0xa00
      [  160.611233]  ? __switch_to_asm+0x35/0x70
      [  160.612838]  handle_mm_fault+0x122/0x1e0
      [  160.614407]  __do_page_fault+0x1b7/0x470
      [  160.615962]  do_page_fault+0x32/0x140
      [  160.617474]  ? async_page_fault+0x8/0x30
      [  160.619012]  async_page_fault+0x1e/0x30
      [  160.620526] RIP: 0033:0x40068e
      
      Fix it by adding cond_resched() in try_charge(), just before goto retry
      after OOM_SUCCESS, in order to let OOM free some memory first.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      1f6142a0
  2. 28 11月, 2019 1 次提交
  3. 21 11月, 2019 1 次提交
  4. 20 11月, 2019 6 次提交
    • D
      mm/memory-hotplug: Allow memory resources to be children · 71009d66
      Dave Hansen 提交于
      commit 2794129e902d8eb69413d884dc6404b8716ed9ed upstream
      
      The mm/resource.c code is used to manage the physical address
      space.  The current resource configuration can be viewed in
      /proc/iomem.  An example of this is at the bottom of this
      description.
      
      The nvdimm subsystem "owns" the physical address resources which
      map to persistent memory and has resources inserted for them as
      "Persistent Memory".  The best way to repurpose this for volatile
      use is to leave the existing resource in place, but add a "System
      RAM" resource underneath it. This clearly communicates the
      ownership relationship of this memory.
      
      The request_resource_conflict() API only deals with the
      top-level resources.  Replace it with __request_region() which
      will search for !IORESOURCE_BUSY areas lower in the resource
      tree than the top level.
      
      We *could* also simply truncate the existing top-level
      "Persistent Memory" resource and take over the released address
      space.  But, this means that if we ever decide to hot-unplug the
      "RAM" and give it back, we need to recreate the original setup,
      which may mean going back to the BIOS tables.
      
      This should have no real effect on the existing collision
      detection because the areas that truly conflict should be marked
      IORESOURCE_BUSY.
      
      00000000-00000fff : Reserved
      00001000-0009fbff : System RAM
      0009fc00-0009ffff : Reserved
      000a0000-000bffff : PCI Bus 0000:00
      000c0000-000c97ff : Video ROM
      000c9800-000ca5ff : Adapter ROM
      000f0000-000fffff : Reserved
        000f0000-000fffff : System ROM
      00100000-9fffffff : System RAM
        01000000-01e071d0 : Kernel code
        01e071d1-027dfdff : Kernel data
        02dc6000-0305dfff : Kernel bss
      a0000000-afffffff : Persistent Memory (legacy)
        a0000000-a7ffffff : System RAM
      b0000000-bffdffff : System RAM
      bffe0000-bfffffff : Reserved
      c0000000-febfffff : PCI Bus 0000:00
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NVishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      71009d66
    • D
      mm/resource: Move HMM pr_debug() deeper into resource code · cdb8d31f
      Dave Hansen 提交于
      commit b926b7f3baecb2a855db629e6822e1a85212e91c upstream
      
      HMM consumes physical address space for its own use, even
      though nothing is mapped or accessible there.  It uses a
      special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY)
      to uniquely identify these areas.
      
      When HMM consumes address space, it makes a best guess about
      what to consume.  However, it is possible that a future memory
      or device hotplug can collide with the reserved area.  In the
      case of these conflicts, there is an error message in
      register_memory_resource().
      
      Later patches in this series move register_memory_resource()
      from using request_resource_conflict() to __request_region().
      Unfortunately, __request_region() does not return the conflict
      like the previous function did, which makes it impossible to
      check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting
      resource.
      
      Instead of warning in register_memory_resource(), move the
      check into the core resource code itself (__request_region())
      where the conflicting resource _is_ available.  This has the
      added bonus of producing a warning in case of HMM conflicts
      with devices *or* RAM address space, as opposed to the RAM-
      only warnings that were there previously.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      cdb8d31f
    • C
      mm, memcg: throttle allocators when failing reclaim over memory.high · eda29cc0
      Chris Down 提交于
      commit 0e4b01df865935007bd712cbc8e7299005b28894 upstream.
      
      We're trying to use memory.high to limit workloads, but have found that
      containment can frequently fail completely and cause OOM situations
      outside of the cgroup.  This happens especially with swap space -- either
      when none is configured, or swap is full.  These failures often also don't
      have enough warning to allow one to react, whether for a human or for a
      daemon monitoring PSI.
      
      Here is output from a simple program showing how long it takes in usec
      (column 2) to allocate a megabyte of anonymous memory (column 1) when a
      cgroup is already beyond its memory high setting, and no swap is
      available:
      
          [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
          > --wait -t timeout 300 /root/mdf
          [...]
          95  1035
          96  1038
          97  1000
          98  1036
          99  1048
          100 1590
          101 1968
          102 1776
          103 1863
          104 1757
          105 1921
          106 1893
          107 1760
          108 1748
          109 1843
          110 1716
          111 1924
          112 1776
          113 1831
          114 1766
          115 1836
          116 1588
          117 1912
          118 1802
          119 1857
          120 1731
          [...]
          [System OOM in 2-3 seconds]
      
      The delay does go up extremely marginally past the 100MB memory.high
      threshold, as now we spend time scanning before returning to usermode, but
      it's nowhere near enough to contain growth.  It also doesn't get worse the
      more pages you have, since it only considers nr_pages.
      
      The current situation goes against both the expectations of users of
      memory.high, and our intentions as cgroup v2 developers.  In
      cgroup-v2.txt, we claim that we will throttle and only under "extreme
      conditions" will memory.high protection be breached.  Likewise, cgroup v2
      users generally also expect that memory.high should throttle workloads as
      they exceed their high threshold.  However, as seen above, this isn't
      always how it works in practice -- even on banal setups like those with no
      swap, or where swap has become exhausted, we can end up with memory.high
      being breached and us having no weapons left in our arsenal to combat
      runaway growth with, since reclaim is futile.
      
      It's also hard for system monitoring software or users to tell how bad the
      situation is, as "high" events for the memcg may in some cases be benign,
      and in others be catastrophic.  The current status quo is that we fail
      containment in a way that doesn't provide any advance warning that things
      are about to go horribly wrong (for example, we are about to invoke the
      kernel OOM killer).
      
      This patch introduces explicit throttling when reclaim is failing to keep
      memcg size contained at the memory.high setting.  It does so by applying
      an exponential delay curve derived from the memcg's overage compared to
      memory.high.  In the normal case where the memcg is either below or only
      marginally over its memory.high setting, no throttling will be performed.
      
      This composes well with system health monitoring and remediation, as these
      allocator delays are factored into PSI's memory pressure calculations.
      This both creates a mechanism system administrators or applications
      consuming the PSI interface to trivially see that the memcg in question is
      struggling and use that to make more reasonable decisions, and permits
      them enough time to act.  Either of these can act with significantly more
      nuance than that we can provide using the system OOM killer.
      
      This is a similar idea to memory.oom_control in cgroup v1 which would put
      the cgroup to sleep if the threshold was violated, but it's also
      significantly improved as it results in visible memory pressure, and also
      doesn't schedule indefinitely, which previously made tracing and other
      introspection difficult (ie.  it's clamped at 2*HZ per allocation through
      MEMCG_MAX_HIGH_DELAY_JIFFIES).
      
      Contrast the previous results with a kernel with this patch:
      
          [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
          > --wait -t timeout 300 /root/mdf
          [...]
          95  1002
          96  1000
          97  1002
          98  1003
          99  1000
          100 1043
          101 84724
          102 330628
          103 610511
          104 1016265
          105 1503969
          106 2391692
          107 2872061
          108 3248003
          109 4791904
          110 5759832
          111 6912509
          112 8127818
          113 9472203
          114 12287622
          115 12480079
          116 14144008
          117 15808029
          118 16384500
          119 16383242
          120 16384979
          [...]
      
      As you can see, in the normal case, memory allocation takes around 1000
      usec.  However, as we exceed our memory.high, things start to increase
      exponentially, but fairly leniently at first.  Our first megabyte over
      memory.high takes us 0.16 seconds, then the next is 0.46 seconds, then the
      next is almost an entire second.  This gets worse until we reach our
      eventual 2*HZ clamp per batch, resulting in 16 seconds per megabyte.
      However, this is still making forward progress, so permits tracing or
      further analysis with programs like GDB.
      
      We use an exponential curve for our delay penalty for a few reasons:
      
      1. We run mem_cgroup_handle_over_high to potentially do reclaim after
         we've already performed allocations, which means that temporarily
         going over memory.high by a small amount may be perfectly legitimate,
         even for compliant workloads. We don't want to unduly penalise such
         cases.
      2. An exponential curve (as opposed to a static or linear delay) allows
         ramping up memory pressure stats more gradually, which can be useful
         to work out that you have set memory.high too low, without destroying
         application performance entirely.
      
      This patch expands on earlier work by Johannes Weiner. Thanks!
      
      [akpm@linux-foundation.org: fix max() warning]
      [akpm@linux-foundation.org: fix __udivdi3 ref on 32-bit]
      [akpm@linux-foundation.org: fix it even more]
      [chris@chrisdown.name: fix 64-bit divide even more]
      Link: http://lkml.kernel.org/r/20190723180700.GA29459@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      eda29cc0
    • Q
      mm/zsmalloc.c: fix a -Wunused-function warning · 68247716
      Qian Cai 提交于
      commit 2b38d01b4de8b1bbda7f5f7e91252609557635fc upstream
      
      set_zspage_inuse() was introduced in the commit 4f42047b ("zsmalloc:
      use accessor") but all the users of it were removed later by the commits,
      
      bdb0af7c ("zsmalloc: factor page chain functionality out")
      3783689a ("zsmalloc: introduce zspage structure")
      
      so the function can be safely removed now.
      
      Link: http://lkml.kernel.org/r/1568658408-19374-1-git-send-email-cai@lca.pwSigned-off-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      68247716
    • J
      x86/mm: Split vmalloc_sync_all() · 73c092d2
      Joerg Roedel 提交于
      commit 1a0a610d5f056c6195ae9808962477a94d1d72c8 upstream.
      
      Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in
      __purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in the
      vunmap() code-path.  While this change was necessary to maintain
      correctness on x86-32-pae kernels, it also adds additional cycles for
      architectures that don't need it.
      
      Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported
      severe performance regressions in micro-benchmarks because it now also
      calls the x86-64 implementation of vmalloc_sync_all() on vunmap().  But
      the vmalloc_sync_all() implementation on x86-64 is only needed for newly
      created mappings.
      
      To avoid the unnecessary work on x86-64 and to gain the performance back,
      split up vmalloc_sync_all() into two functions:
      
      	* vmalloc_sync_mappings(), and
      	* vmalloc_sync_unmappings()
      
      Most call-sites to vmalloc_sync_all() only care about new mappings being
      synchronized.  The only exception is the new call-site added in the above
      mentioned commit.
      
      Shile Zhang directed us to a report of an 80% regression in reaim
      throughput.
      
      Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org
      Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
      Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com
      Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Reported-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	[GHES]
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      73c092d2
    • V
      zswap: do not map same object twice · 2a1613fd
      Vitaly Wool 提交于
      commit 068619e32ff6229a09407d267e36ea7710b96ea1 upstream
      
      zswap_writeback_entry() maps a handle to read swpentry first, and
      then in the most common case it would map the same handle again.
      This is ok when zbud is the backend since its mapping callback is
      plain and simple, but it slows things down for z3fold.
      
      Since there's hardly a point in unmapping a handle _that_ fast as
      zswap_writeback_entry() does when it reads swpentry, the
      suggestion is to keep the handle mapped till the end.
      
      Link: http://lkml.kernel.org/r/20190916004640.b453167d3556c4093af4cf7d@gmail.comSigned-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Signed-off-by: NVitaly Wool <vitalywool@gmail.com>
      Reviewed-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      2a1613fd
  5. 19 11月, 2019 4 次提交
    • H
      mm/swap_state.c: simplify total_swapcache_pages() with get_swap_device() · a042c2e0
      Huang Ying 提交于
      commit 054f1d1faaed6a7930b77286d607ae45c01d0443 upstream.
      
      total_swapcache_pages() may race with swapper_spaces[] allocation and
      freeing.  Previously, this is protected with a swapper_spaces[] specific
      RCU mechanism.  To simplify the logic/code complexity, it is replaced with
      get/put_swap_device().  The code line number is reduced too.  Although not
      so important, the swapoff() performance improves too because one
      synchronize_rcu() call during swapoff() is deleted.
      
      [ying.huang@intel.com: fix bad swap file entry warning]
        Link: http://lkml.kernel.org/r/20190531024102.21723-1-ying.huang@intel.com
      Link: http://lkml.kernel.org/r/20190527082714.12151-1-ying.huang@intel.comSigned-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      a042c2e0
    • H
      mm, swap: fix race between swapoff and some swap operations · 73c29467
      Huang Ying 提交于
      commit eb085574a7526c4375965c5fbf7e5b0c19cdd336 upstream.
      Change SWP_VALID to (1 << 12).
      
      When swapin is performed, after getting the swap entry information from
      the page table, system will swap in the swap entry, without any lock held
      to prevent the swap device from being swapoff.  This may cause the race
      like below,
      
      CPU 1				CPU 2
      -----				-----
      				do_swap_page
      				  swapin_readahead
      				    __read_swap_cache_async
      swapoff				      swapcache_prepare
        p->swap_map = NULL		        __swap_duplicate
      					  p->swap_map[?] /* !!! NULL pointer access */
      
      Because swapoff is usually done when system shutdown only, the race may
      not hit many people in practice.  But it is still a race need to be fixed.
      
      To fix the race, get_swap_device() is added to check whether the specified
      swap entry is valid in its swap device.  If so, it will keep the swap
      entry valid via preventing the swap device from being swapoff, until
      put_swap_device() is called.
      
      Because swapoff() is very rare code path, to make the normal path runs as
      fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of
      reference count is used to implement get/put_swap_device().  >From
      get_swap_device() to put_swap_device(), RCU reader side is locked, so
      synchronize_rcu() in swapoff() will wait until put_swap_device() is
      called.
      
      In addition to swap_map, cluster_info, etc.  data structure in the struct
      swap_info_struct, the swap cache radix tree will be freed after swapoff,
      so this patch fixes the race between swap cache looking up and swapoff
      too.
      
      Races between some other swap cache usages and swapoff are fixed too via
      calling synchronize_rcu() between clearing PageSwapCache() and freeing
      swap cache data structure.
      
      Another possible method to fix this is to use preempt_off() +
      stop_machine() to prevent the swap device from being swapoff when its data
      structure is being accessed.  The overhead in hot-path of both methods is
      similar.  The advantages of RCU based method are,
      
      1. stop_machine() may disturb the normal execution code path on other
         CPUs.
      
      2. File cache uses RCU to protect its radix tree.  If the similar
         mechanism is used for swap cache too, it is easier to share code
         between them.
      
      3. RCU is used to protect swap cache in total_swapcache_pages() and
         exit_swap_address_space() already.  The two mechanisms can be
         merged to simplify the logic.
      
      Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com
      Fixes: 235b6217 ("mm/swap: add cluster lock")
      Signed-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
      Not-nacked-by: NHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      73c29467
    • Y
      mm: swap: check if swap backing device is congested or not · 533c7f15
      Yang Shi 提交于
      commit 8fd2e0b505d124bbb046ab15de0ff6f8d4babf56 upstream.
      Change SWP_FS to SWP_FILE.
      
      Swap readahead would read in a few pages regardless if the underlying
      device is busy or not.  It may incur long waiting time if the device is
      congested, and it may also exacerbate the congestion.
      
      Use inode_read_congested() to check if the underlying device is busy or
      not like what file page readahead does.  Get inode from
      swap_info_struct.
      
      Although we can add inode information in swap_address_space
      (address_space->host), it may lead some unexpected side effect, i.e.  it
      may break mapping_cap_account_dirty().  Using inode from
      swap_info_struct seems simple and good enough.
      
      Just does the check in vma_cluster_readahead() since
      swap_vma_readahead() is just used for non-rotational device which much
      less likely has congestion than traditional HDD.
      
      Although swap slots may be consecutive on swap partition, it still may
      be fragmented on swap file.  This check would help to reduce excessive
      stall for such case.
      
      The test with page_fault1 of will-it-scale (sometimes tracing may just
      show runtest.py that is the wrapper script of page_fault1), which
      basically launches NR_CPU threads to generate 128MB anonymous pages for
      each thread, on my virtual machine with congested HDD shows long tail
      latency is reduced significantly.
      
      Without the patch
       page_fault1_thr-1490  [023]   129.311706: funcgraph_entry:      #57377.796 us |  do_swap_page();
       page_fault1_thr-1490  [023]   129.369103: funcgraph_entry:        5.642us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.369119: funcgraph_entry:      #1289.592 us |  do_swap_page();
       page_fault1_thr-1490  [023]   129.370411: funcgraph_entry:        4.957us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.370419: funcgraph_entry:        1.940us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.378847: funcgraph_entry:      #1411.385 us |  do_swap_page();
       page_fault1_thr-1490  [023]   129.380262: funcgraph_entry:        3.916us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.380275: funcgraph_entry:      #4287.751 us |  do_swap_page();
      
      With the patch
            runtest.py-1417  [020]   301.925911: funcgraph_entry:      #9870.146 us |  do_swap_page();
            runtest.py-1417  [020]   301.935785: funcgraph_entry:        9.802us   |  do_swap_page();
            runtest.py-1417  [020]   301.935799: funcgraph_entry:        3.551us   |  do_swap_page();
            runtest.py-1417  [020]   301.935806: funcgraph_entry:        2.142us   |  do_swap_page();
            runtest.py-1417  [020]   301.935853: funcgraph_entry:        6.938us   |  do_swap_page();
            runtest.py-1417  [020]   301.935864: funcgraph_entry:        3.765us   |  do_swap_page();
            runtest.py-1417  [020]   301.935871: funcgraph_entry:        3.600us   |  do_swap_page();
            runtest.py-1417  [020]   301.935878: funcgraph_entry:        7.202us   |  do_swap_page();
      
      [akpm@linux-foundation.org: code cleanup]
      [yang.shi@linux.alibaba.com: add comment]
        Link: http://lkml.kernel.org/r/bbc7bda7-62d0-df1a-23ef-d369e865bdca@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1546543673-108536-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NTim Chen <tim.c.chen@intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Hugh Dickins <hughd@google.com
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      533c7f15
    • W
      vmscan: return NODE_RECLAIM_NOSCAN in node_reclaim() when CONFIG_NUMA is n · e431b612
      Wei Yang 提交于
      commit 8b09549c2bfd9f3f8f4cdad74107ef4f4ff9cdd7 upstream.
      
      Commit fa5e084e ("vmscan: do not unconditionally treat zones that
      fail zone_reclaim() as full") changed the return value of
      node_reclaim().  The original return value 0 means NODE_RECLAIM_SOME
      after this commit.
      
      While the return value of node_reclaim() when CONFIG_NUMA is n is not
      changed.  This will leads to call zone_watermark_ok() again.
      
      This patch fixes the return value by adjusting to NODE_RECLAIM_NOSCAN.
      Since node_reclaim() is only called in page_alloc.c, move it to
      mm/internal.h.
      
      Link: http://lkml.kernel.org/r/20181113080436.22078-1-richard.weiyang@gmail.comSigned-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMatthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      e431b612
  6. 30 10月, 2019 14 次提交
    • H
      zswap: use movable memory if zpool support allocate movable memory · 0a943f50
      Hui Zhu 提交于
      commit d2fcd82bb83aab47c6d63aa8c960cd5edb578065 upstream
      
      This is the third version that was updated according to the comments from
      Sergey Senozhatsky https://lkml.org/lkml/2019/5/29/73 and Shakeel Butt
      https://lkml.org/lkml/2019/6/4/973
      
      zswap compresses swap pages into a dynamically allocated RAM-based memory
      pool.  The memory pool should be zbud, z3fold or zsmalloc.  All of them
      will allocate unmovable pages.  It will increase the number of unmovable
      page blocks that will bad for anti-fragment.
      
      zsmalloc support page migration if request movable page:
              handle = zs_malloc(zram->mem_pool, comp_len,
                      GFP_NOIO | __GFP_HIGHMEM |
                      __GFP_MOVABLE);
      
      And commit "zpool: Add malloc_support_movable to zpool_driver" add
      zpool_malloc_support_movable check malloc_support_movable to make sure if
      a zpool support allocate movable memory.
      
      This commit let zswap allocate block with gfp
      __GFP_HIGHMEM | __GFP_MOVABLE if zpool support allocate movable memory.
      
      Following part is test log in a pc that has 8G memory and 2G swap.
      
      Without this commit:
      ~# echo lz4 > /sys/module/zswap/parameters/compressor
      ~# echo zsmalloc > /sys/module/zswap/parameters/zpool
      ~# echo 1 > /sys/module/zswap/parameters/enabled
      ~# swapon /swapfile
      ~# cd /home/teawater/kernel/vm-scalability/
      /home/teawater/kernel/vm-scalability# export unit_size=$((9 * 1024 * 1024 * 1024))
      /home/teawater/kernel/vm-scalability# ./case-anon-w-seq
      2717908992 bytes / 4826062 usecs = 549973 KB/s
      2717908992 bytes / 4864201 usecs = 545661 KB/s
      2717908992 bytes / 4867015 usecs = 545346 KB/s
      2717908992 bytes / 4915485 usecs = 539968 KB/s
      397853 usecs to free memory
      357820 usecs to free memory
      421333 usecs to free memory
      420454 usecs to free memory
      /home/teawater/kernel/vm-scalability# cat /proc/pagetypeinfo
      Page block order: 9
      Pages per block:  512
      
      Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
      Node    0, zone      DMA, type    Unmovable      1      1      1      0      2      1      1      0      1      0      0
      Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      1      3
      Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type    Unmovable      6      5      8      6      6      5      4      1      1      1      0
      Node    0, zone    DMA32, type      Movable     25     20     20     19     22     15     14     11     11      5    767
      Node    0, zone    DMA32, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type    Unmovable   4753   5588   5159   4613   3712   2520   1448    594    188     11      0
      Node    0, zone   Normal, type      Movable     16      3    457   2648   2143   1435    860    459    223    224    296
      Node    0, zone   Normal, type  Reclaimable      0      0     44     38     11      2      0      0      0      0      0
      Node    0, zone   Normal, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      
      Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate
      Node 0, zone      DMA            1            7            0            0            0            0
      Node 0, zone    DMA32            4         1652            0            0            0            0
      Node 0, zone   Normal          931         1485           15            0            0            0
      
      With this commit:
      ~# echo lz4 > /sys/module/zswap/parameters/compressor
      ~# echo zsmalloc > /sys/module/zswap/parameters/zpool
      ~# echo 1 > /sys/module/zswap/parameters/enabled
      ~# swapon /swapfile
      ~# cd /home/teawater/kernel/vm-scalability/
      /home/teawater/kernel/vm-scalability# export unit_size=$((9 * 1024 * 1024 * 1024))
      /home/teawater/kernel/vm-scalability# ./case-anon-w-seq
      2717908992 bytes / 4689240 usecs = 566020 KB/s
      2717908992 bytes / 4760605 usecs = 557535 KB/s
      2717908992 bytes / 4803621 usecs = 552543 KB/s
      2717908992 bytes / 5069828 usecs = 523530 KB/s
      431546 usecs to free memory
      383397 usecs to free memory
      456454 usecs to free memory
      224487 usecs to free memory
      /home/teawater/kernel/vm-scalability# cat /proc/pagetypeinfo
      Page block order: 9
      Pages per block:  512
      
      Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
      Node    0, zone      DMA, type    Unmovable      1      1      1      0      2      1      1      0      1      0      0
      Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      1      3
      Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type    Unmovable     10      8     10      9     10      4      3      2      3      0      0
      Node    0, zone    DMA32, type      Movable     18     12     14     16     16     11      9      5      5      6    775
      Node    0, zone    DMA32, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      1
      Node    0, zone    DMA32, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type    Unmovable   2669   1236    452    118     37     14      4      1      2      3      0
      Node    0, zone   Normal, type      Movable   3850   6086   5274   4327   3510   2494   1520    934    438    220    470
      Node    0, zone   Normal, type  Reclaimable     56     93    155    124     47     31     17      7      3      0      0
      Node    0, zone   Normal, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      
      Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate
      Node 0, zone      DMA            1            7            0            0            0            0
      Node 0, zone    DMA32            4         1650            2            0            0            0
      Node 0, zone   Normal           79         2326           26            0            0            0
      
      You can see that the number of unmovable page blocks is decreased
      when the kernel has this commit.
      
      Link: http://lkml.kernel.org/r/20190605100630.13293-2-teawaterz@linux.alibaba.comSigned-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      0a943f50
    • H
      zpool: add malloc_support_movable to zpool_driver · e874b6e5
      Hui Zhu 提交于
      commit c165f25d23ecb2f9f121ced20435415b931219e2 upstream
      
      As a zpool_driver, zsmalloc can allocate movable memory because it support
      migate pages.  But zbud and z3fold cannot allocate movable memory.
      
      Add malloc_support_movable to zpool_driver.  If a zpool_driver support
      allocate movable memory, set it to true.  And add
      zpool_malloc_support_movable check malloc_support_movable to make sure if
      a zpool support allocate movable memory.
      
      Link: http://lkml.kernel.org/r/20190605100630.13293-1-teawaterz@linux.alibaba.comSigned-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      e874b6e5
    • D
      mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock · a432d982
      Dave Chinner 提交于
      commit 64081362e8ff4587b4554087f3cfc73d3e0a4cd7 upstream.
      
      We've recently seen a workload on XFS filesystems with a repeatable
      deadlock between background writeback and a multi-process application
      doing concurrent writes and fsyncs to a small range of a file.
      
      range_cyclic
      writeback		Process 1		Process 2
      
      xfs_vm_writepages
        write_cache_pages
          writeback_index = 2
          cycled = 0
          ....
          find page 2 dirty
          lock Page 2
          ->writepage
            page 2 writeback
            page 2 clean
            page 2 added to bio
          no more pages
      			write()
      			locks page 1
      			dirties page 1
      			locks page 2
      			dirties page 1
      			fsync()
      			....
      			xfs_vm_writepages
      			write_cache_pages
      			  start index 0
      			  find page 1 towrite
      			  lock Page 1
      			  ->writepage
      			    page 1 writeback
      			    page 1 clean
      			    page 1 added to bio
      			  find page 2 towrite
      			  lock Page 2
      			  page 2 is writeback
      			  <blocks>
      						write()
      						locks page 1
      						dirties page 1
      						fsync()
      						....
      						xfs_vm_writepages
      						write_cache_pages
      						  start index 0
      
          !done && !cycled
            sets index to 0, restarts lookup
          find page 1 dirty
      						  find page 1 towrite
      						  lock Page 1
      						  page 1 is writeback
      						  <blocks>
      
          lock Page 1
          <blocks>
      
      DEADLOCK because:
      
      	- process 1 needs page 2 writeback to complete to make
      	  enough progress to issue IO pending for page 1
      	- writeback needs page 1 writeback to complete so process 2
      	  can progress and unlock the page it is blocked on, then it
      	  can issue the IO pending for page 2
      	- process 2 can't make progress until process 1 issues IO
      	  for page 1
      
      The underlying cause of the problem here is that range_cyclic writeback is
      processing pages in descending index order as we hold higher index pages
      in a structure controlled from above write_cache_pages().  The
      write_cache_pages() caller needs to be able to submit these pages for IO
      before write_cache_pages restarts writeback at mapping index 0 to avoid
      wcp inverting the page lock/writeback wait order.
      
      generic_writepages() is not susceptible to this bug as it has no private
      context held across write_cache_pages() - filesystems using this
      infrastructure always submit pages in ->writepage immediately and so there
      is no problem with range_cyclic going back to mapping index 0.
      
      However:
      	mpage_writepages() has a private bio context,
      	exofs_writepages() has page_collect
      	fuse_writepages() has fuse_fill_wb_data
      	nfs_writepages() has nfs_pageio_descriptor
      	xfs_vm_writepages() has xfs_writepage_ctx
      
      All of these ->writepages implementations can hold pages under writeback
      in their private structures until write_cache_pages() returns, and hence
      they are all susceptible to this deadlock.
      
      Also worth noting is that ext4 has it's own bastardised version of
      write_cache_pages() and so it /may/ have an equivalent deadlock.  I looked
      at the code long enough to understand that it has a similar retry loop for
      range_cyclic writeback reaching the end of the file and then promptly ran
      away before my eyes bled too much.  I'll leave it for the ext4 developers
      to determine if their code is actually has this deadlock and how to fix it
      if it has.
      
      There's a few ways I can see avoid this deadlock.  There's probably more,
      but these are the first I've though of:
      
      1. get rid of range_cyclic altogether
      
      2. range_cyclic always stops at EOF, and we start again from
      writeback index 0 on the next call into write_cache_pages()
      
      2a. wcp also returns EAGAIN to ->writepages implementations to
      indicate range cyclic has hit EOF. writepages implementations can
      then flush the current context and call wpc again to continue. i.e.
      lift the retry into the ->writepages implementation
      
      3. range_cyclic uses trylock_page() rather than lock_page(), and it
      skips pages it can't lock without blocking. It will already do this
      for pages under writeback, so this seems like a no-brainer
      
      3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid
      blocking as per pages under writeback.
      
      I don't think #1 is an option - range_cyclic prevents frequently
      dirtied lower file offset from starving background writeback of
      rarely touched higher file offsets.
      
      performance as going back to the start of the file implies an
      immediate seek. We'll have exactly the same number of seeks if we
      switch writeback to another inode, and then come back to this one
      later and restart from index 0.
      
      retry loop up into the wcp caller means we can issue IO on the
      pending pages before calling wcp again, and so avoid locking or
      waiting on pages in the wrong order. I'm not convinced we need to do
      this given that we get the same thing from #2 on the next writeback
      call from the writeback infrastructure.
      
      inversion problem, just prevents it from becoming a deadlock
      situation. I'd prefer we fix the inversion, not sweep it under the
      carpet like this.
      
      band-aid fix of #3.
      
      So it seems that the simplest way to fix this issue is to implement
      solution #2
      
      Link: http://lkml.kernel.org/r/20181005054526.21507-1-david@fromorbit.comSigned-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.de>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      a432d982
    • A
      userfaultfd: allow get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) to trigger userfaults · 6fc1da0b
      Andrea Arcangeli 提交于
      commit 3b9aadf7278d16d7bed4d5d808501065f70898d8 upstream.
      
      get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) called a get_user_pages that would
      not be waiting for userfaults before failing and it would hit on a SIGBUS
      instead.  Using get_user_pages_locked/unlocked instead will allow
      get_mempolicy to allow userfaults to resolve the fault and fill the hole,
      before grabbing the node id of the page.
      
      If the user calls get_mempolicy() with MPOL_F_ADDR | MPOL_F_NODE for an
      address inside an area managed by uffd and there is no page at that
      address, the page allocation from within get_mempolicy() will fail
      because get_user_pages() does not allow for page fault retry required
      for uffd; the user will get SIGBUS.
      
      With this patch, the page fault will be resolved by the uffd and the
      get_mempolicy() will continue normally.
      
      Background:
      
      Via code review, previously the syscall would have returned -EFAULT
      (vm_fault_to_errno), now it will block and wait for an userfault (if
      it's waken before the fault is resolved it'll still -EFAULT).
      
      This way get_mempolicy will give a chance to an "unaware" app to be
      compliant with userfaults.
      
      The reason this visible change is that becoming "userfault compliant"
      cannot regress anything: all other syscalls including read(2)/write(2)
      had to become "userfault compliant" long time ago (that's one of the
      things userfaultfd can do that PROT_NONE and trapping segfaults can't).
      
      So this is just one more syscall that become "userfault compliant" like
      all other major ones already were.
      
      This has been happening on virtio-bridge dpdk process which just called
      get_mempolicy on the guest space post live migration, but before the
      memory had a chance to be migrated to destination.
      
      I didn't run an strace to be able to show the -EFAULT going away, but
      I've the confirmation of the below debug aid information (only visible
      with CONFIG_DEBUG_VM=y) going away with the patch:
      
          [20116.371461] FAULT_FLAG_ALLOW_RETRY missing 0
          [20116.371464] CPU: 1 PID: 13381 Comm: vhost-events Not tainted 4.17.12-200.fc28.x86_64 #1
          [20116.371465] Hardware name: LENOVO 20FAS2BN0A/20FAS2BN0A, BIOS N1CET54W (1.22 ) 02/10/2017
          [20116.371466] Call Trace:
          [20116.371473]  dump_stack+0x5c/0x80
          [20116.371476]  handle_userfault.cold.37+0x1b/0x22
          [20116.371479]  ? remove_wait_queue+0x20/0x60
          [20116.371481]  ? poll_freewait+0x45/0xa0
          [20116.371483]  ? do_sys_poll+0x31c/0x520
          [20116.371485]  ? radix_tree_lookup_slot+0x1e/0x50
          [20116.371488]  shmem_getpage_gfp+0xce7/0xe50
          [20116.371491]  ? page_add_file_rmap+0x1a/0x2c0
          [20116.371493]  shmem_fault+0x78/0x1e0
          [20116.371495]  ? filemap_map_pages+0x3a1/0x450
          [20116.371498]  __do_fault+0x1f/0xc0
          [20116.371500]  __handle_mm_fault+0xe2e/0x12f0
          [20116.371502]  handle_mm_fault+0xda/0x200
          [20116.371504]  __get_user_pages+0x238/0x790
          [20116.371506]  get_user_pages+0x3e/0x50
          [20116.371510]  kernel_get_mempolicy+0x40b/0x700
          [20116.371512]  ? vfs_write+0x170/0x1a0
          [20116.371515]  __x64_sys_get_mempolicy+0x21/0x30
          [20116.371517]  do_syscall_64+0x5b/0x160
          [20116.371520]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The above harmless debug message (not a kernel crash, just a
      dump_stack()) is shown with CONFIG_DEBUG_VM=y to more quickly identify
      and improve kernel spots that may have to become "userfaultfd
      compliant" like this one (without having to run an strace and search
      for syscall misbehavior).  Spots like the above are more closer to a
      kernel bug for the non-cooperative usages that Mike focuses on, than
      for for dpdk qemu-cooperative usages that reproduced it, but it's still
      nicer to get this fixed for dpdk too.
      
      The part of the patch that caused me to think is only the
      implementation issue of mpol_get, but it looks like it should work safe
      no matter the kind of mempolicy structure that is (the default static
      policy also starts at 1 so it'll go to 2 and back to 1 without crashing
      everything at 0).
      
      [rppt@linux.vnet.ibm.com: changelog addition]
        http://lkml.kernel.org/r/20180904073718.GA26916@rapoport-lnx
      Link: http://lkml.kernel.org/r/20180831214848.23676-1-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NMaxime Coquelin <maxime.coquelin@redhat.com>
      Tested-by: NDr. David Alan Gilbert <dgilbert@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      6fc1da0b
    • J
      psi: pressure stall information for CPU, memory, and IO · 6175397b
      Johannes Weiner 提交于
      commit eb414681d5a07d28d2ff90dc05f69ec6b232ebd2 upstream.
      
      When systems are overcommitted and resources become contended, it's hard
      to tell exactly the impact this has on workload productivity, or how close
      the system is to lockups and OOM kills.  In particular, when machines work
      multiple jobs concurrently, the impact of overcommit in terms of latency
      and throughput on the individual job can be enormous.
      
      In order to maximize hardware utilization without sacrificing individual
      job health or risk complete machine lockups, this patch implements a way
      to quantify resource pressure in the system.
      
      A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
      expose the percentage of time the system is stalled on CPU, memory, or IO,
      respectively.  Stall states are aggregate versions of the per-task delay
      accounting delays:
      
             cpu: some tasks are runnable but not executing on a CPU
             memory: tasks are reclaiming, or waiting for swapin or thrashing cache
             io: tasks are waiting for io completions
      
      These percentages of walltime can be thought of as pressure percentages,
      and they give a general sense of system health and productivity loss
      incurred by resource overcommit.  They can also indicate when the system
      is approaching lockup scenarios and OOMs.
      
      To do this, psi keeps track of the task states associated with each CPU
      and samples the time they spend in stall states.  Every 2 seconds, the
      samples are averaged across CPUs - weighted by the CPUs' non-idle time to
      eliminate artifacts from unused CPUs - and translated into percentages of
      walltime.  A running average of those percentages is maintained over 10s,
      1m, and 5m periods (similar to the loadaverage).
      
      [hannes@cmpxchg.org: doc fixlet, per Randy]
        Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
      [hannes@cmpxchg.org: code optimization]
        Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
      [hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
        Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
      [hannes@cmpxchg.org: fix build]
        Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
      Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      [Joseph: fix apply conflicts in task_struct]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      6175397b
    • J
      delayacct: track delays from thrashing cache pages · 4b7c32ef
      Johannes Weiner 提交于
      commit b1d29ba82cf2bc784f4c963ddd6a2cf29e229b33 upstream.
      
      Delay accounting already measures the time a task spends in direct reclaim
      and waiting for swapin, but in low memory situations tasks spend can spend
      a significant amount of their time waiting on thrashing page cache.  This
      isn't tracked right now.
      
      To know the full impact of memory contention on an individual task,
      measure the delay when waiting for a recently evicted active cache page to
      read back into memory.
      
      Also update tools/accounting/getdelays.c:
      
           [hannes@computer accounting]$ sudo ./getdelays -d -p 1
           print delayacct stats ON
           PID     1
      
           CPU             count     real total  virtual total    delay total  delay average
                           50318      745000000      847346785      400533713          0.008ms
           IO              count    delay total  delay average
                             435      122601218              0ms
           SWAP            count    delay total  delay average
                               0              0              0ms
           RECLAIM         count    delay total  delay average
                               0              0              0ms
           THRASHING       count    delay total  delay average
                              19       12621439              0ms
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-4-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      4b7c32ef
    • J
      mm: workingset: tell cache transitions from workingset thrashing · 41c9ca31
      Johannes Weiner 提交于
      commit 1899ad18c6072d689896badafb81267b0a1092a4 upstream.
      
      Refaults happen during transitions between workingsets as well as in-place
      thrashing.  Knowing the difference between the two has a range of
      applications, including measuring the impact of memory shortage on the
      system performance, as well as the ability to smarter balance pressure
      between the filesystem cache and the swap-backed workingset.
      
      During workingset transitions, inactive cache refaults and pushes out
      established active cache.  When that active cache isn't stale, however,
      and also ends up refaulting, that's bonafide thrashing.
      
      Introduce a new page flag that tells on eviction whether the page has been
      active or not in its lifetime.  This bit is then stored in the shadow
      entry, to classify refaults as transitioning or thrashing.
      
      How many page->flags does this leave us with on 32-bit?
      
      	20 bits are always page flags
      
      	21 if you have an MMU
      
      	23 with the zone bits for DMA, Normal, HighMem, Movable
      
      	29 with the sparsemem section bits
      
      	30 if PAE is enabled
      
      	31 with this patch.
      
      So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes.  If
      that's not enough, the system can switch to discontigmem and re-gain the 6
      or 7 sparsemem section bits.
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      41c9ca31
    • J
      mm: workingset: don't drop refault information prematurely · d9799717
      Johannes Weiner 提交于
      commit 95f9ab2d596e8cbb388315e78c82b9a131bf2928 upstream.
      
      Patch series "psi: pressure stall information for CPU, memory, and IO", v4.
      
      		Overview
      
      PSI reports the overall wallclock time in which the tasks in a system (or
      cgroup) wait for (contended) hardware resources.
      
      This helps users understand the resource pressure their workloads are
      under, which allows them to rootcause and fix throughput and latency
      problems caused by overcommitting, underprovisioning, suboptimal job
      placement in a grid; as well as anticipate major disruptions like OOM.
      
      		Real-world applications
      
      We're using the data collected by PSI (and its previous incarnation,
      memdelay) quite extensively at Facebook, and with several success stories.
      
      One usecase is avoiding OOM hangs/livelocks.  The reason these happen is
      because the OOM killer is triggered by reclaim not being able to free
      pages, but with fast flash devices there is *always* some clean and
      uptodate cache to reclaim; the OOM killer never kicks in, even as tasks
      spend 90% of the time thrashing the cache pages of their own executables.
      There is no situation where this ever makes sense in practice.  We wrote a
      <100 line POC python script to monitor memory pressure and kill stuff way
      before such pathological thrashing leads to full system losses that would
      require forcible hard resets.
      
      We've since extended and deployed this code into other places to guarantee
      latency and throughput SLAs, since they're usually violated way before the
      kernel OOM killer would ever kick in.
      
      It is available here: https://github.com/facebookincubator/oomd
      
      Eventually we probably want to trigger the in-kernel OOM killer based on
      extreme sustained pressure as well, so that Linux can avoid memory
      livelocks - which technically aren't deadlocks, but to the user
      indistinguishable from them - out of the box.  We'd continue using OOMD as
      the first line of defense to ensure workload health and implement complex
      kill policies that are beyond the scope of the kernel.
      
      We also use PSI memory pressure for loadshedding.  Our batch job
      infrastructure used to use heuristics based on various VM stats to
      anticipate OOM situations, with lackluster success.  We switched it to PSI
      and managed to anticipate and avoid OOM kills and lockups fairly reliably.
      The reduction of OOM outages in the worker pool raised the pool's
      aggregate productivity, and we were able to switch that service to smaller
      machines.
      
      Lastly, we use cgroups to isolate a machine's main workload from
      maintenance crap like package upgrades, logging, configuration, as well as
      to prevent multiple workloads on a machine from stepping on each others'
      toes.  We were not able to configure this properly without the pressure
      metrics; we would see latency or bandwidth drops, but it would often be
      hard to impossible to rootcause it post-mortem.
      
      We now log and graph pressure for the containers in our fleet and can
      trivially link latency spikes and throughput drops to shortages of
      specific resources after the fact, and fix the job config/scheduling.
      
      PSI has also received testing, feedback, and feature requests from Android
      and EndlessOS for the purpose of low-latency OOM killing, to intervene in
      pressure situations before the UI starts hanging.
      
      		How do you use this feature?
      
      A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3
      files: cpu, memory, and io.  If using cgroup2, cgroups will also have
      cpu.pressure, memory.pressure and io.pressure files, which simply
      aggregate task stalls at the cgroup level instead of system-wide.
      
      The cpu file contains one line:
      
      	some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722
      
      The averages give the percentage of walltime in which one or more tasks
      are delayed on the runqueue while another task has the CPU.  They're
      recent averages over 10s, 1m, 5m windows, so you can tell short term
      trends from long term ones, similarly to the load average.
      
      The total= value gives the absolute stall time in microseconds.  This
      allows detecting latency spikes that might be too short to sway the
      running averages.  It also allows custom time averaging in case the
      10s/1m/5m windows aren't adequate for the usecase (or are too coarse with
      future hardware).
      
      What to make of this "some" metric?  If CPU utilization is at 100% and CPU
      pressure is 0, it means the system is perfectly utilized, with one
      runnable thread per CPU and nobody waiting.  At two or more runnable tasks
      per CPU, the system is 100% overcommitted and the pressure average will
      indicate as much.  From a utilization perspective this is a great state of
      course: no CPU cycles are being wasted, even when 50% of the threads were
      to go idle (as most workloads do vary).  From the perspective of the
      individual job it's not great, however, and they would do better with more
      resources.  Depending on what your priority and options are, raised "some"
      numbers may or may not require action.
      
      The memory file contains two lines:
      
      some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
      full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258
      
      The some line is the same as for cpu, the time in which at least one task
      is stalled on the resource.  In the case of memory, this includes waiting
      on swap-in, page cache refaults and page reclaim.
      
      The full line, however, indicates time in which *nobody* is using the CPU
      productively due to pressure: all non-idle tasks are waiting for memory in
      one form or another.  Significant time spent in there is a good trigger
      for killing things, moving jobs to other machines, or dropping incoming
      requests, since neither the jobs nor the machine overall are making too
      much headway.
      
      The io file is similar to memory.  Because the block layer doesn't have a
      concept of hardware contention right now (how much longer is my IO request
      taking due to other tasks?), it reports CPU potential lost on all IO
      delays, not just the potential lost due to competition.
      
      		FAQ
      
      Q: How is PSI's CPU component different from the load average?
      
      A: There are several quirks in the load average that make it hard to
         impossible to tell how overcommitted the CPU really is.
      
         1. The load average is reported as a raw number of active tasks.
            You need to know how many CPUs there are in the system, how many
            CPUs the workload is allowed to use, then think about what the
            proportion between load and the number of CPUs mean for the
            tasks trying to run.
      
            PSI reports the percentage of wallclock time in which tasks are
            waiting for a CPU to run on. It doesn't matter how many CPUs are
            present or usable. The number always tells the quality of life
            of tasks in the system or in a particular cgroup.
      
         2. The shortest averaging window is 1m, which is extremely coarse,
            and it's sampled in 5s intervals. A *lot* can happen on a CPU in
            5 seconds. This *may* be able to identify persistent long-term
            trends and very clear and obvious overloads, but it's unusable
            for latency spikes and more subtle overutilization.
      
            PSI's shortest window is 10s. It also exports the cumulative
            stall times (in microseconds) of synchronously recorded events.
      
         3. On Linux, the load average for historical reasons includes all
            TASK_UNINTERRUPTIBLE tasks. This gives a broader sense of how
            busy the system is, but on the flipside it doesn't distinguish
            whether tasks are likely to contend over the CPU or IO - which
            obviously requires very different interventions from a sys admin
            or a job scheduler.
      
            PSI reports independent metrics for CPU and IO. You can tell
            which resource is making the tasks wait, but in conjunction
            still see how overloaded the system is overall.
      
      Q: What's the cost / performance impact of this feature?
      
      A: PSI's primary cost is in the scheduler, in particular task wakeups
         and sleeps.
      
         I benchmarked this code using Facebook's two most scheduling
         sensitive workloads: memcache and webserver. They handle a ton of
         small requests - lots of wakeups and sleeps with little actual work
         in between - so they tend to be canaries for scheduler regressions.
      
         In the tests, the boxes were handling live traffic over the course
         of several hours. Half the machines, the control, ran with
         CONFIG_PSI=n.
      
         For memcache I used eight machines total. They're 2-socket, 14
         core, 56 thread boxes. The test runs for half the test period,
         flips the test and control kernels on the hardware to rule out HW
         factors, DC location etc., then runs the other half of the test.
      
         For the webservers, I used 32 machines total. They're single
         socket, 16 core, 32 thread machines.
      
         During the memcache test, CPU load was nopsi=78.05% psi=78.98% in
         the first half and nopsi=77.52% psi=78.25%, so PSI added between
         0.7 and 0.9 percentage points to the CPU load, a difference of
         about 1%.
      
         UPDATE: I re-ran this test with the v3 version of this patch set
         and the CPU utilization was equivalent between test and control.
      
         UPDATE: v4 is on par with v3.
      
         As far as end-to-end request latency from the client perspective
         goes, we don't sample those finely enough to capture the requests
         going to those particular machines during the test, but we know the
         p50 turnaround time in this workload is 54us, and perf bench sched
         pipe on those machines show nopsi=5.232666 us/op and psi=5.587347
         us/op, so this doesn't add much here either.
      
         The profile for the pipe benchmark shows:
      
              0.87%  sched-pipe  [kernel.vmlinux]    [k] psi_group_change
              0.83%  perf.real   [kernel.vmlinux]    [k] psi_group_change
              0.82%  perf.real   [kernel.vmlinux]    [k] psi_task_change
              0.58%  sched-pipe  [kernel.vmlinux]    [k] psi_task_change
      
         The webserver load is running inside 4 nested cgroup levels. The
         CPU load with both nopsi and psi kernels was indistinguishable at
         81%.
      
         For comparison, we had to disable the cgroup cpu controller on the
         webservers because it added 4 percentage points to the CPU% during
         this same exact test.
      
         Versions of this accounting code now run on 80% of our fleet. None
         of our workloads have reported regressions during the rollout.
      
      Daniel Drake said:
      
      : I just retested the latest version at
      : http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the results
      : are great.
      :
      : Test setup:
      : Endless OS
      : GeminiLake N4200 low end laptop
      : 2GB RAM
      : swap (and zram swap) disabled
      :
      : Baseline test: open a handful of large-ish apps and several website
      : tabs in Google Chrome.
      :
      : Results: after a couple of minutes, system is excessively thrashing, mouse
      : cursor can barely be moved, UI is not responding to mouse clicks, so it's
      : impractical to recover from this situation as an ordinary user
      :
      : Add my simple killer:
      : https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd
      :
      : Results: when the thrashing causes the UI to become sluggish, the killer
      : steps in and kills something (usually a chrome tab), and the system
      : remains usable.  I repeatedly opened more apps and more websites over a 15
      : minute period but I wasn't able to get the system to a point of UI
      : unresponsiveness.
      
      Suren said:
      
      : Backported to 4.9 and retested on ARMv8 8 code system running Android.
      : Signals behave as expected reacting to memory pressure, no jumps in
      : "total" counters that would indicate an overflow/underflow issues.  Nicely
      : done!
      
      This patch (of 9):
      
      If we keep just enough refault information to match the *current* page
      cache during reclaim time, we could lose a lot of events when there is
      only a temporary spike in non-cache memory consumption that pushes out all
      the cache.  Once cache comes back, we won't see those refaults.  They
      might not be actionable for LRU aging, but we want to know about them for
      measuring memory pressure.
      
      [hannes@cmpxchg.org: switch to NUMA-aware lru and slab counters]
        Link: http://lkml.kernel.org/r/20181009184732.762-2-hannes@cmpxchg.org
      Link: http://lkml.kernel.org/r/20180828172258.3185-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <jweiner@fb.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      d9799717
    • K
      writeback: memcg_blkcg_tree_lock can be static · 0019fa8c
      kbuild test robot 提交于
      Fixes: 60448d43 ("writeback: add memcg_blkcg_link tree")
      Signed-off-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      0019fa8c
    • J
      fs/writeback: wrap cgroup writeback v1 logic · 38485c5c
      Joseph Qi 提交于
      Wrap cgroup writeback v1 logic to prevent build errors without
      CONFIG_CGROUPS or CONFIG_CGROUP_WRITEBACK.
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      38485c5c
    • J
      writeback: introduce cgwb_v1 boot param · f91c270d
      Jiufei Xue 提交于
      So far writeback control is supported for cgroup v1 interface. However
      it also has some restrictions, so introduce a new kernel boot parameter
      to control the behavior which is disabled by default. Users can enable
      the writeback control for cgroup v1 with the command line "cgwb_v1".
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      f91c270d
    • J
      fs/writeback: fix double free of blkcg_css · f935fb62
      Jiufei Xue 提交于
      We have gotten a WARNNING when releasing blkcg_css:
      
      [332489.681635] WARNING: CPU: 55 PID: 14859 at lib/list_debug.c:56 __list_del_entry+0x81/0xc0
      [332489.682191] list_del corruption, ffff883e6b94d450->prev is LIST_POISON2 (dead000000000200)
      ......
      [332489.683895] CPU: 55 PID: 14859 Comm: kworker/55:2 Tainted: G
      [332489.684477] Hardware name: Inspur SA5248M4/X10DRT-PS, BIOS 4.05A
      10/11/2016
      [332489.685061] Workqueue: cgroup_destroy css_release_work_fn
      [332489.685654]  ffffc9001d92bd28 ffffffff81380042 ffffc9001d92bd78
      0000000000000000
      [332489.686269]  ffffc9001d92bd68 ffffffff81088f8b 0000003800000000
      ffff883e6b94d4a0
      [332489.686867]  ffff883e6b94d400 ffffffff81ce8fe0 ffff88375b24f400
      ffff883e6b94d4a0
      [332489.687479] Call Trace:
      [332489.688078]  [<ffffffff81380042>] dump_stack+0x63/0x81
      [332489.688681]  [<ffffffff81088f8b>] __warn+0xcb/0xf0
      [332489.689276]  [<ffffffff8108900f>] warn_slowpath_fmt+0x5f/0x80
      [332489.689877]  [<ffffffff8139e7c1>] __list_del_entry+0x81/0xc0
      [332489.690481]  [<ffffffff81125552>] css_release_work_fn+0x42/0x140
      [332489.691090]  [<ffffffff810a2db9>] process_one_work+0x189/0x420
      [332489.691693]  [<ffffffff810a309e>] worker_thread+0x4e/0x4b0
      [332489.692293]  [<ffffffff810a3050>] ? process_one_work+0x420/0x420
      [332489.692905]  [<ffffffff810a9616>] kthread+0xe6/0x100
      [332489.693504]  [<ffffffff810a9530>] ? kthread_park+0x60/0x60
      [332489.694099]  [<ffffffff817184e1>] ret_from_fork+0x41/0x50
      [332489.694722] ---[ end trace 0cf869c4a5cfba87 ]---
      ......
      
      This is caused by calling css_get after the css is killed by another
      thread described below:
      
                 Thread 1                       Thread 2
      cgroup_rmdir
        -> kill_css
          -> percpu_ref_kill_and_confirm
            -> css_killed_ref_fn
      
      css_killed_work_fn
        -> css_put
          -> css_release
                                              wb_get_create
      					  -> find_blkcg_css
      					    -> css_get
      					  -> css_put
      					    -> css_release (double free)
          -> css_release_workfn
            -> css_free_work_fn
             -> blkcg_css_free
      
      When doublefree happened, it may free the memory still used by
      other threads and cause a kernel panic.
      
      Fix this by using css_tryget_online in find_blkcg_css while will return
      false if the css is killed.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      f935fb62
    • J
      37231c89
    • J
      writeback: add memcg_blkcg_link tree · 86c80145
      Jiufei Xue 提交于
      Here we add a global radix tree to link memcg and blkcg that the user
      attach the tasks to when using cgroup v1, which is used for writeback
      cgroup.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      86c80145
  7. 29 10月, 2019 6 次提交
    • J
      mm/memory-failure: poison read receives SIGKILL instead of SIGBUS if mmaped more than once · 30cff8ab
      Jane Chu 提交于
      commit 3d7fed4ad8ccb691d217efbb0f934e6a4df5ef91 upstream.
      
      Mmap /dev/dax more than once, then read the poison location using
      address from one of the mappings.  The other mappings due to not having
      the page mapped in will cause SIGKILLs delivered to the process.
      SIGKILL succeeds over SIGBUS, so user process loses the opportunity to
      handle the UE.
      
      Although one may add MAP_POPULATE to mmap(2) to work around the issue,
      MAP_POPULATE makes mapping 128GB of pmem several magnitudes slower, so
      isn't always an option.
      
      Details -
      
        ndctl inject-error --block=10 --count=1 namespace6.0
      
        ./read_poison -x dax6.0 -o 5120 -m 2
        mmaped address 0x7f5bb6600000
        mmaped address 0x7f3cf3600000
        doing local read at address 0x7f3cf3601400
        Killed
      
      Console messages in instrumented kernel -
      
        mce: Uncorrected hardware memory error in user-access at edbe201400
        Memory failure: tk->addr = 7f5bb6601000
        Memory failure: address edbe201: call dev_pagemap_mapping_shift
        dev_pagemap_mapping_shift: page edbe201: no PUD
        Memory failure: tk->size_shift == 0
        Memory failure: Unable to find user space address edbe201 in read_poison
        Memory failure: tk->addr = 7f3cf3601000
        Memory failure: address edbe201: call dev_pagemap_mapping_shift
        Memory failure: tk->size_shift = 21
        Memory failure: 0xedbe201: forcibly killing read_poison:22434 because of failure to unmap corrupted page
          => to deliver SIGKILL
        Memory failure: 0xedbe201: Killing read_poison:22434 due to hardware memory corruption
          => to deliver SIGBUS
      
      Link: http://lkml.kernel.org/r/1565112345-28754-3-git-send-email-jane.chu@oracle.comSigned-off-by: NJane Chu <jane.chu@oracle.com>
      Suggested-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      30cff8ab
    • D
      hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic() · 91eec769
      David Hildenbrand 提交于
      commit f231fe4235e22e18d847e05cbe705deaca56580a upstream.
      
      Uninitialized memmaps contain garbage and in the worst case trigger
      kernel BUGs, especially with CONFIG_PAGE_POISONING.  They should not get
      touched.
      
      Let's make sure that we only consider online memory (managed by the
      buddy) that has initialized memmaps.  ZONE_DEVICE is not applicable.
      
      page_zone() will call page_to_nid(), which will trigger
      VM_BUG_ON_PGFLAGS(PagePoisoned(page), page) with CONFIG_PAGE_POISONING
      and CONFIG_DEBUG_VM_PGFLAGS when called on uninitialized memmaps.  This
      can be the case when an offline memory block (e.g., never onlined) is
      spanned by a zone.
      
      Note: As explained by Michal in [1], alloc_contig_range() will verify
      the range.  So it boils down to the wrong access in this function.
      
      [1] http://lkml.kernel.org/r/20180423000943.GO17484@dhcp22.suse.cz
      
      Link: http://lkml.kernel.org/r/20191015120717.4858-1-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reported-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      91eec769
    • Q
      mm/page_owner: don't access uninitialized memmaps when reading /proc/pagetypeinfo · f712e306
      Qian Cai 提交于
      commit a26ee565b6cd8dc2bf15ff6aa70bbb28f928b773 upstream.
      
      Uninitialized memmaps contain garbage and in the worst case trigger
      kernel BUGs, especially with CONFIG_PAGE_POISONING.  They should not get
      touched.
      
      For example, when not onlining a memory block that is spanned by a zone
      and reading /proc/pagetypeinfo with CONFIG_DEBUG_VM_PGFLAGS and
      CONFIG_PAGE_POISONING, we can trigger a kernel BUG:
      
        :/# echo 1 > /sys/devices/system/memory/memory40/online
        :/# echo 1 > /sys/devices/system/memory/memory42/online
        :/# cat /proc/pagetypeinfo > test.file
         page:fffff2c585200000 is uninitialized and poisoned
         raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
         raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
         page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
         There is not page extension available.
         ------------[ cut here ]------------
         kernel BUG at include/linux/mm.h:1107!
         invalid opcode: 0000 [#1] SMP NOPTI
      
      Please note that this change does not affect ZONE_DEVICE, because
      pagetypeinfo_showmixedcount_print() is called from
      mm/vmstat.c:pagetypeinfo_showmixedcount() only for populated zones, and
      ZONE_DEVICE is never populated (zone->present_pages always 0).
      
      [david@redhat.com: move check to outer loop, add comment, rephrase description]
      Link: http://lkml.kernel.org/r/20191011140638.8160-1-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visible after d0dc12e8Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f712e306
    • Q
      mm/slub: fix a deadlock in show_slab_objects() · bb6932c5
      Qian Cai 提交于
      commit e4f8e513c3d353c134ad4eef9fd0bba12406c7c8 upstream.
      
      A long time ago we fixed a similar deadlock in show_slab_objects() [1].
      However, it is apparently due to the commits like 01fb58bc ("slab:
      remove synchronous synchronize_sched() from memcg cache deactivation
      path") and 03afc0e2 ("slab: get_online_mems for
      kmem_cache_{create,destroy,shrink}"), this kind of deadlock is back by
      just reading files in /sys/kernel/slab which will generate a lockdep
      splat below.
      
      Since the "mem_hotplug_lock" here is only to obtain a stable online node
      mask while racing with NUMA node hotplug, in the worst case, the results
      may me miscalculated while doing NUMA node hotplug, but they shall be
      corrected by later reads of the same files.
      
        WARNING: possible circular locking dependency detected
        ------------------------------------------------------
        cat/5224 is trying to acquire lock:
        ffff900012ac3120 (mem_hotplug_lock.rw_sem){++++}, at:
        show_slab_objects+0x94/0x3a8
      
        but task is already holding lock:
        b8ff009693eee398 (kn->count#45){++++}, at: kernfs_seq_start+0x44/0xf0
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (kn->count#45){++++}:
               lock_acquire+0x31c/0x360
               __kernfs_remove+0x290/0x490
               kernfs_remove+0x30/0x44
               sysfs_remove_dir+0x70/0x88
               kobject_del+0x50/0xb0
               sysfs_slab_unlink+0x2c/0x38
               shutdown_cache+0xa0/0xf0
               kmemcg_cache_shutdown_fn+0x1c/0x34
               kmemcg_workfn+0x44/0x64
               process_one_work+0x4f4/0x950
               worker_thread+0x390/0x4bc
               kthread+0x1cc/0x1e8
               ret_from_fork+0x10/0x18
      
        -> #1 (slab_mutex){+.+.}:
               lock_acquire+0x31c/0x360
               __mutex_lock_common+0x16c/0xf78
               mutex_lock_nested+0x40/0x50
               memcg_create_kmem_cache+0x38/0x16c
               memcg_kmem_cache_create_func+0x3c/0x70
               process_one_work+0x4f4/0x950
               worker_thread+0x390/0x4bc
               kthread+0x1cc/0x1e8
               ret_from_fork+0x10/0x18
      
        -> #0 (mem_hotplug_lock.rw_sem){++++}:
               validate_chain+0xd10/0x2bcc
               __lock_acquire+0x7f4/0xb8c
               lock_acquire+0x31c/0x360
               get_online_mems+0x54/0x150
               show_slab_objects+0x94/0x3a8
               total_objects_show+0x28/0x34
               slab_attr_show+0x38/0x54
               sysfs_kf_seq_show+0x198/0x2d4
               kernfs_seq_show+0xa4/0xcc
               seq_read+0x30c/0x8a8
               kernfs_fop_read+0xa8/0x314
               __vfs_read+0x88/0x20c
               vfs_read+0xd8/0x10c
               ksys_read+0xb0/0x120
               __arm64_sys_read+0x54/0x88
               el0_svc_handler+0x170/0x240
               el0_svc+0x8/0xc
      
        other info that might help us debug this:
      
        Chain exists of:
          mem_hotplug_lock.rw_sem --> slab_mutex --> kn->count#45
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(kn->count#45);
                                       lock(slab_mutex);
                                       lock(kn->count#45);
          lock(mem_hotplug_lock.rw_sem);
      
         *** DEADLOCK ***
      
        3 locks held by cat/5224:
         #0: 9eff00095b14b2a0 (&p->lock){+.+.}, at: seq_read+0x4c/0x8a8
         #1: 0eff008997041480 (&of->mutex){+.+.}, at: kernfs_seq_start+0x34/0xf0
         #2: b8ff009693eee398 (kn->count#45){++++}, at:
        kernfs_seq_start+0x44/0xf0
      
        stack backtrace:
        Call trace:
         dump_backtrace+0x0/0x248
         show_stack+0x20/0x2c
         dump_stack+0xd0/0x140
         print_circular_bug+0x368/0x380
         check_noncircular+0x248/0x250
         validate_chain+0xd10/0x2bcc
         __lock_acquire+0x7f4/0xb8c
         lock_acquire+0x31c/0x360
         get_online_mems+0x54/0x150
         show_slab_objects+0x94/0x3a8
         total_objects_show+0x28/0x34
         slab_attr_show+0x38/0x54
         sysfs_kf_seq_show+0x198/0x2d4
         kernfs_seq_show+0xa4/0xcc
         seq_read+0x30c/0x8a8
         kernfs_fop_read+0xa8/0x314
         __vfs_read+0x88/0x20c
         vfs_read+0xd8/0x10c
         ksys_read+0xb0/0x120
         __arm64_sys_read+0x54/0x88
         el0_svc_handler+0x170/0x240
         el0_svc+0x8/0xc
      
      I think it is important to mention that this doesn't expose the
      show_slab_objects to use-after-free.  There is only a single path that
      might really race here and that is the slab hotplug notifier callback
      __kmem_cache_shrink (via slab_mem_going_offline_callback) but that path
      doesn't really destroy kmem_cache_node data structures.
      
      [1] http://lkml.iu.edu/hypermail/linux/kernel/1101.0/02850.html
      
      [akpm@linux-foundation.org: add comment explaining why we don't need mem_hotplug_lock]
      Link: http://lkml.kernel.org/r/1570192309-10132-1-git-send-email-cai@lca.pw
      Fixes: 01fb58bc ("slab: remove synchronous synchronize_sched() from memcg cache deactivation path")
      Fixes: 03afc0e2 ("slab: get_online_mems for kmem_cache_{create,destroy,shrink}")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bb6932c5
    • D
      mm/memory-failure.c: don't access uninitialized memmaps in memory_failure() · 9792afbd
      David Hildenbrand 提交于
      commit 96c804a6ae8c59a9092b3d5dd581198472063184 upstream.
      
      We should check for pfn_to_online_page() to not access uninitialized
      memmaps.  Reshuffle the code so we don't have to duplicate the error
      message.
      
      Link: http://lkml.kernel.org/r/20191009142435.3975-3-david@redhat.comSigned-off-by: NDavid Hildenbrand <david@redhat.com>
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      9792afbd
    • M
      memfd: Fix locking when tagging pins · 99b45e7a
      Matthew Wilcox (Oracle) 提交于
      The RCU lock is insufficient to protect the radix tree iteration as
      a deletion from the tree can occur before we take the spinlock to
      tag the entry.  In 4.19, this has manifested as a bug with the following
      trace:
      
      kernel BUG at lib/radix-tree.c:1429!
      invalid opcode: 0000 [#1] SMP KASAN PTI
      CPU: 7 PID: 6935 Comm: syz-executor.2 Not tainted 4.19.36 #25
      Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
      RIP: 0010:radix_tree_tag_set+0x200/0x2f0 lib/radix-tree.c:1429
      Code: 00 00 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 89 44 24 10 e8 a3 29 7e fe 48 8b 44 24 10 48 0f ab 03 e9 d2 fe ff ff e8 90 29 7e fe <0f> 0b 48 c7 c7 e0 5a 87 84 e8 f0 e7 08 ff 4c 89 ef e8 4a ff ac fe
      RSP: 0018:ffff88837b13fb60 EFLAGS: 00010016
      RAX: 0000000000040000 RBX: ffff8883c5515d58 RCX: ffffffff82cb2ef0
      RDX: 0000000000000b72 RSI: ffffc90004cf2000 RDI: ffff8883c5515d98
      RBP: ffff88837b13fb98 R08: ffffed106f627f7e R09: ffffed106f627f7e
      R10: 0000000000000001 R11: ffffed106f627f7d R12: 0000000000000004
      R13: ffffea000d7fea80 R14: 1ffff1106f627f6f R15: 0000000000000002
      FS:  00007fa1b8df2700(0000) GS:ffff8883e2fc0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 00007fa1b8df1db8 CR3: 000000037d4d2001 CR4: 0000000000160ee0
      Call Trace:
       memfd_tag_pins mm/memfd.c:51 [inline]
       memfd_wait_for_pins+0x2c5/0x12d0 mm/memfd.c:81
       memfd_add_seals mm/memfd.c:215 [inline]
       memfd_fcntl+0x33d/0x4a0 mm/memfd.c:247
       do_fcntl+0x589/0xeb0 fs/fcntl.c:421
       __do_sys_fcntl fs/fcntl.c:463 [inline]
       __se_sys_fcntl fs/fcntl.c:448 [inline]
       __x64_sys_fcntl+0x12d/0x180 fs/fcntl.c:448
       do_syscall_64+0xc8/0x580 arch/x86/entry/common.c:293
      
      The problem does not occur in mainline due to the XArray rewrite which
      changed the locking to exclude modification of the tree during iteration.
      At the time, nobody realised this was a bugfix.  Backport the locking
      changes to stable.
      
      Cc: stable@vger.kernel.org
      Reported-by: Nzhong jiang <zhongjiang@huawei.com>
      Signed-off-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      99b45e7a
  8. 18 10月, 2019 1 次提交
  9. 12 10月, 2019 1 次提交
    • K
      usercopy: Avoid HIGHMEM pfn warning · 12c6c4a5
      Kees Cook 提交于
      commit 314eed30ede02fa925990f535652254b5bad6b65 upstream.
      
      When running on a system with >512MB RAM with a 32-bit kernel built with:
      
      	CONFIG_DEBUG_VIRTUAL=y
      	CONFIG_HIGHMEM=y
      	CONFIG_HARDENED_USERCOPY=y
      
      all execve()s will fail due to argv copying into kmap()ed pages, and on
      usercopy checking the calls ultimately of virt_to_page() will be looking
      for "bad" kmap (highmem) pointers due to CONFIG_DEBUG_VIRTUAL=y:
      
       ------------[ cut here ]------------
       kernel BUG at ../arch/x86/mm/physaddr.c:83!
       invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
       CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.3.0-rc8 #6
       Hardware name: Dell Inc. Inspiron 1318/0C236D, BIOS A04 01/15/2009
       EIP: __phys_addr+0xaf/0x100
       ...
       Call Trace:
        __check_object_size+0xaf/0x3c0
        ? __might_sleep+0x80/0xa0
        copy_strings+0x1c2/0x370
        copy_strings_kernel+0x2b/0x40
        __do_execve_file+0x4ca/0x810
        ? kmem_cache_alloc+0x1c7/0x370
        do_execve+0x1b/0x20
        ...
      
      The check is from arch/x86/mm/physaddr.c:
      
      	VIRTUAL_BUG_ON((phys_addr >> PAGE_SHIFT) > max_low_pfn);
      
      Due to the kmap() in fs/exec.c:
      
      		kaddr = kmap(kmapped_page);
      	...
      	if (copy_from_user(kaddr+offset, str, bytes_to_copy)) ...
      
      Now we can fetch the correct page to avoid the pfn check. In both cases,
      hardened usercopy will need to walk the page-span checker (if enabled)
      to do sanity checking.
      Reported-by: NRandy Dunlap <rdunlap@infradead.org>
      Tested-by: NRandy Dunlap <rdunlap@infradead.org>
      Fixes: f5509cc1 ("mm: Hardened usercopy")
      Cc: Matthew Wilcox <willy@infradead.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Reviewed-by: NMatthew Wilcox (Oracle) <willy@infradead.org>
      Link: https://lore.kernel.org/r/201909171056.7F2FFD17@keescookSigned-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      12c6c4a5
  10. 05 10月, 2019 3 次提交
    • Y
      mm/compaction.c: clear total_{migrate,free}_scanned before scanning a new zone · 4d8bdf7f
      Yafang Shao 提交于
      [ Upstream commit a94b525241c0fff3598809131d7cfcfe1d572d8c ]
      
      total_{migrate,free}_scanned will be added to COMPACTMIGRATE_SCANNED and
      COMPACTFREE_SCANNED in compact_zone().  We should clear them before
      scanning a new zone.  In the proc triggered compaction, we forgot clearing
      them.
      
      [laoar.shao@gmail.com: introduce a helper compact_zone_counters_init()]
        Link: http://lkml.kernel.org/r/1563869295-25748-1-git-send-email-laoar.shao@gmail.com
      [akpm@linux-foundation.org: expand compact_zone_counters_init() into its single callsite, per mhocko]
      [vbabka@suse.cz: squash compact_zone() list_head init as well]
        Link: http://lkml.kernel.org/r/1fb6f7da-f776-9e42-22f8-bbb79b030b98@suse.cz
      [akpm@linux-foundation.org: kcompactd_do_work(): avoid unnecessary initialization of cc.zone]
      Link: http://lkml.kernel.org/r/1563789275-9639-1-git-send-email-laoar.shao@gmail.com
      Fixes: 7f354a54 ("mm, compaction: add vmstats for kcompactd work")
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NVlastimil Babka <vbabka@suse.cz>
      Reviewed-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Yafang Shao <shaoyafang@didiglobal.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      4d8bdf7f
    • M
      memcg, kmem: do not fail __GFP_NOFAIL charges · b4a734a5
      Michal Hocko 提交于
      commit e55d9d9bfb69405bd7615c0f8d229d8fafb3e9b8 upstream.
      
      Thomas has noticed the following NULL ptr dereference when using cgroup
      v1 kmem limit:
      BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
      PGD 0
      P4D 0
      Oops: 0000 [#1] PREEMPT SMP PTI
      CPU: 3 PID: 16923 Comm: gtk-update-icon Not tainted 4.19.51 #42
      Hardware name: Gigabyte Technology Co., Ltd. Z97X-Gaming G1/Z97X-Gaming G1, BIOS F9 07/31/2015
      RIP: 0010:create_empty_buffers+0x24/0x100
      Code: cd 0f 1f 44 00 00 0f 1f 44 00 00 41 54 49 89 d4 ba 01 00 00 00 55 53 48 89 fb e8 97 fe ff ff 48 89 c5 48 89 c2 eb 03 48 89 ca <48> 8b 4a 08 4c 09 22 48 85 c9 75 f1 48 89 6a 08 48 8b 43 18 48 8d
      RSP: 0018:ffff927ac1b37bf8 EFLAGS: 00010286
      RAX: 0000000000000000 RBX: fffff2d4429fd740 RCX: 0000000100097149
      RDX: 0000000000000000 RSI: 0000000000000082 RDI: ffff9075a99fbe00
      RBP: 0000000000000000 R08: fffff2d440949cc8 R09: 00000000000960c0
      R10: 0000000000000002 R11: 0000000000000000 R12: 0000000000000000
      R13: ffff907601f18360 R14: 0000000000002000 R15: 0000000000001000
      FS:  00007fb55b288bc0(0000) GS:ffff90761f8c0000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000000008 CR3: 000000007aebc002 CR4: 00000000001606e0
      Call Trace:
       create_page_buffers+0x4d/0x60
       __block_write_begin_int+0x8e/0x5a0
       ? ext4_inode_attach_jinode.part.82+0xb0/0xb0
       ? jbd2__journal_start+0xd7/0x1f0
       ext4_da_write_begin+0x112/0x3d0
       generic_perform_write+0xf1/0x1b0
       ? file_update_time+0x70/0x140
       __generic_file_write_iter+0x141/0x1a0
       ext4_file_write_iter+0xef/0x3b0
       __vfs_write+0x17e/0x1e0
       vfs_write+0xa5/0x1a0
       ksys_write+0x57/0xd0
       do_syscall_64+0x55/0x160
       entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      Tetsuo then noticed that this is because the __memcg_kmem_charge_memcg
      fails __GFP_NOFAIL charge when the kmem limit is reached.  This is a wrong
      behavior because nofail allocations are not allowed to fail.  Normal
      charge path simply forces the charge even if that means to cross the
      limit.  Kmem accounting should be doing the same.
      
      Link: http://lkml.kernel.org/r/20190906125608.32129-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NThomas Lindroth <thomas.lindroth@gmail.com>
      Debugged-by: NTetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
      Cc: Thomas Lindroth <thomas.lindroth@gmail.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b4a734a5
    • T
      memcg, oom: don't require __GFP_FS when invoking memcg OOM killer · d40b3eaf
      Tetsuo Handa 提交于
      commit f9c645621a28e37813a1de96d9cbd89cde94a1e4 upstream.
      
      Masoud Sharbiani noticed that commit 29ef680a ("memcg, oom: move
      out_of_memory back to the charge path") broke memcg OOM called from
      __xfs_filemap_fault() path.  It turned out that try_charge() is retrying
      forever without making forward progress because mem_cgroup_oom(GFP_NOFS)
      cannot invoke the OOM killer due to commit 3da88fb3 ("mm, oom:
      move GFP_NOFS check to out_of_memory").
      
      Allowing forced charge due to being unable to invoke memcg OOM killer will
      lead to global OOM situation.  Also, just returning -ENOMEM will be risky
      because OOM path is lost and some paths (e.g.  get_user_pages()) will leak
      -ENOMEM.  Therefore, invoking memcg OOM killer (despite GFP_NOFS) will be
      the only choice we can choose for now.
      
      Until 29ef680a, we were able to invoke memcg OOM killer when
      GFP_KERNEL reclaim failed [1].  But since 29ef680a, we need to
      invoke memcg OOM killer when GFP_NOFS reclaim failed [2].  Although in the
      past we did invoke memcg OOM killer for GFP_NOFS [3], we might get
      pre-mature memcg OOM reports due to this patch.
      
      [1]
      
       leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
       CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19
       Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
       Call Trace:
        dump_stack+0x63/0x88
        dump_header+0x67/0x27a
        ? mem_cgroup_scan_tasks+0x91/0xf0
        oom_kill_process+0x210/0x410
        out_of_memory+0x10a/0x2c0
        mem_cgroup_out_of_memory+0x46/0x80
        mem_cgroup_oom_synchronize+0x2e4/0x310
        ? high_work_func+0x20/0x20
        pagefault_out_of_memory+0x31/0x76
        mm_fault_error+0x55/0x115
        ? handle_mm_fault+0xfd/0x220
        __do_page_fault+0x433/0x4e0
        do_page_fault+0x22/0x30
        ? page_fault+0x8/0x30
        page_fault+0x1e/0x30
       RIP: 0033:0x4009f0
       Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
       RSP: 002b:00007ffe29ae96f0 EFLAGS: 00010206
       RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001ce1000
       RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000
       RBP: 000000000000000c R08: 0000000000000000 R09: 00007f94be09220d
       R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0
       R13: 0000000000000003 R14: 00007f949d845000 R15: 0000000002800000
       Task in /leaker killed as a result of limit of /leaker
       memory: usage 524288kB, limit 524288kB, failcnt 158965
       memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
       kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0
       Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB
       Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice child
       Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, file-rss:1208kB, shmem-rss:0kB
       oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      [2]
      
       leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), order=0, oom_score_adj=0
       CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20
       Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
       Call Trace:
        dump_stack+0x63/0x88
        dump_header+0x67/0x27a
        ? mem_cgroup_scan_tasks+0x91/0xf0
        oom_kill_process+0x210/0x410
        out_of_memory+0x109/0x2d0
        mem_cgroup_out_of_memory+0x46/0x80
        try_charge+0x58d/0x650
        ? __radix_tree_replace+0x81/0x100
        mem_cgroup_try_charge+0x7a/0x100
        __add_to_page_cache_locked+0x92/0x180
        add_to_page_cache_lru+0x4d/0xf0
        iomap_readpages_actor+0xde/0x1b0
        ? iomap_zero_range_actor+0x1d0/0x1d0
        iomap_apply+0xaf/0x130
        iomap_readpages+0x9f/0x150
        ? iomap_zero_range_actor+0x1d0/0x1d0
        xfs_vm_readpages+0x18/0x20 [xfs]
        read_pages+0x60/0x140
        __do_page_cache_readahead+0x193/0x1b0
        ondemand_readahead+0x16d/0x2c0
        page_cache_async_readahead+0x9a/0xd0
        filemap_fault+0x403/0x620
        ? alloc_set_pte+0x12c/0x540
        ? _cond_resched+0x14/0x30
        __xfs_filemap_fault+0x66/0x180 [xfs]
        xfs_filemap_fault+0x27/0x30 [xfs]
        __do_fault+0x19/0x40
        __handle_mm_fault+0x8e8/0xb60
        handle_mm_fault+0xfd/0x220
        __do_page_fault+0x238/0x4e0
        do_page_fault+0x22/0x30
        ? page_fault+0x8/0x30
        page_fault+0x1e/0x30
       RIP: 0033:0x4009f0
       Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
       RSP: 002b:00007ffda45c9290 EFLAGS: 00010206
       RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001a1e000
       RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000
       RBP: 000000000000000c R08: 0000000000000000 R09: 00007f6d061ff20d
       R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0
       R13: 0000000000000003 R14: 00007f6ce59b2000 R15: 0000000002800000
       Task in /leaker killed as a result of limit of /leaker
       memory: usage 524288kB, limit 524288kB, failcnt 7221
       memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
       kmem: usage 1944kB, limit 9007199254740988kB, failcnt 0
       Memory cgroup stats for /leaker: cache:3632KB rss:518232KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:518408KB inactive_file:3908KB active_file:12KB unevictable:0KB
       Memory cgroup out of memory: Kill process 2746 (leaker) score 992 or sacrifice child
       Killed process 2746 (leaker) total-vm:536704kB, anon-rss:518264kB, file-rss:1188kB, shmem-rss:0kB
       oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      [3]
      
       leaker invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0
       leaker cpuset=/ mems_allowed=0
       CPU: 1 PID: 3206 Comm: leaker Not tainted 3.10.0-957.27.2.el7.x86_64 #1
       Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
       Call Trace:
        [<ffffffffaf364147>] dump_stack+0x19/0x1b
        [<ffffffffaf35eb6a>] dump_header+0x90/0x229
        [<ffffffffaedbb456>] ? find_lock_task_mm+0x56/0xc0
        [<ffffffffaee32a38>] ? try_get_mem_cgroup_from_mm+0x28/0x60
        [<ffffffffaedbb904>] oom_kill_process+0x254/0x3d0
        [<ffffffffaee36c36>] mem_cgroup_oom_synchronize+0x546/0x570
        [<ffffffffaee360b0>] ? mem_cgroup_charge_common+0xc0/0xc0
        [<ffffffffaedbc194>] pagefault_out_of_memory+0x14/0x90
        [<ffffffffaf35d072>] mm_fault_error+0x6a/0x157
        [<ffffffffaf3717c8>] __do_page_fault+0x3c8/0x4f0
        [<ffffffffaf371925>] do_page_fault+0x35/0x90
        [<ffffffffaf36d768>] page_fault+0x28/0x30
       Task in /leaker killed as a result of limit of /leaker
       memory: usage 524288kB, limit 524288kB, failcnt 20628
       memory+swap: usage 524288kB, limit 9007199254740988kB, failcnt 0
       kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
       Memory cgroup stats for /leaker: cache:840KB rss:523448KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:523448KB inactive_file:464KB active_file:376KB unevictable:0KB
       Memory cgroup out of memory: Kill process 3206 (leaker) score 970 or sacrifice child
       Killed process 3206 (leaker) total-vm:536692kB, anon-rss:523304kB, file-rss:412kB, shmem-rss:0kB
      
      Bisected by Masoud Sharbiani.
      
      Link: http://lkml.kernel.org/r/cbe54ed1-b6ba-a056-8899-2dc42526371d@i-love.sakura.ne.jp
      Fixes: 3da88fb3 ("mm, oom: move GFP_NOFS check to out_of_memory") [necessary after 29ef680a]
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: NMasoud Sharbiani <msharbiani@apple.com>
      Tested-by: NMasoud Sharbiani <msharbiani@apple.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[4.19+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d40b3eaf
  11. 16 9月, 2019 1 次提交
  12. 06 9月, 2019 1 次提交