1. 08 12月, 2019 9 次提交
    • Y
      mm: move mem_cgroup_uncharge out of __page_cache_release() · 79044939
      Yang Shi 提交于
      commit 7ae88534cdd96235cd775c03b32a75009355740b upstream
      
      A later patch makes THP deferred split shrinker memcg aware, but it
      needs page->mem_cgroup information in THP destructor, which is called after
      mem_cgroup_uncharge() now.
      
      So move mem_cgroup_uncharge() from __page_cache_release() to compound
      page destructor, which is called by both THP and other compound pages except
      HugeTLB.  And call it in __put_single_page() for single order page.
      
      Link: http://lkml.kernel.org/r/1565144277-36240-3-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Suggested-by: N"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      79044939
    • Y
      mm: thp: extract split_queue_* into a struct · c9acf2bd
      Yang Shi 提交于
      commit 364c1eebe453f06f0c1e837eb155a5725c9cd272 upstream
      
      Patch series "Make deferred split shrinker memcg aware", v6.
      
      Currently THP deferred split shrinker is not memcg aware, this may cause
      premature OOM with some configuration.  For example the below test would
      run into premature OOM easily:
      
      $ cgcreate -g memory:thp
      $ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
      $ cgexec -g memory:thp transhuge-stress 4000
      
      transhuge-stress comes from kernel selftest.
      
      It is easy to hit OOM, but there are still a lot THP on the deferred
      split queue, memcg direct reclaim can't touch them since the deferred split
      shrinker is not memcg aware.
      
      Convert deferred split shrinker memcg aware by introducing per memcg
      deferred split queue.  The THP should be on either per node or per memcg
      deferred split queue if it belongs to a memcg.  When the page is
      immigrated to the other memcg, it will be immigrated to the target
      memcg's deferred split queue too.
      
      Reuse the second tail page's deferred_list for per memcg list since the
      same THP can't be on multiple deferred split queues.
      
      Make deferred split shrinker not depend on memcg kmem since it is not
      slab.  It doesn't make sense to not shrink THP even though memcg kmem is
      disabled.
      
      With the above change the test demonstrated above doesn't trigger OOM
      even though with cgroup.memory=nokmem.
      
      This patch (of 4):
      
      Put split_queue, split_queue_lock and split_queue_len into a struct in
      order to reduce code duplication when we convert deferred_split to memcg
      aware in the later patches.
      
      Link: http://lkml.kernel.org/r/1565144277-36240-2-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Suggested-by: N"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>
      Acked-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reviewed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      c9acf2bd
    • G
      alios: mm: Support kidled · fd952d8c
      Gavin Shan 提交于
      This enables scanning pages in fixed interval to determine their access
      frequency (hot/cold). The result is exported to user land on basis of
      memory cgroup by "memory.idle_page_stats". The design is highlighted as
      below:
      
         * A kernel thread is spawn when this feature is enabled by writing
           non-zero value to "/sys/kernel/mm/kidled/scan_period_in_seconds".
           The thread sequentially scans the nodes and their pages that have
           been chained up in LRU list.
      
         * For each page, its corresponding age information is stored in the
           page flags or array in node. The age represents the scanning intervals
           in which the page isn't accessed. Also, the page flag (PG_idle) is
           leveraged. The page's age is increased by one if the idle flag isn't
           cleared in two consective scans. Otherwise, the page's age is cleared out.
           Also, the page's age information is cleared when it's free'd so that
           the stale age information won't be fetched when it's allocated.
      
         * Initially, the flag is set, while the access bit in its PTE is cleared
           out by the thread. In next scanning period, its PTE access bit is
           synchronized with the page flag: clear the flag if access bit is set.
           The flag is kept otherwise. For unmapped pages, the flag is cleared
           when it's accessed.
      
         * Eventually, the page's aging information is updated to the unstable
           bucket of its corresponding memory cgroup, taking as statistics. The
           unstable bucket (statistics) is copied to stable bucket when all pages
           in all nodes are scanned for once. The stable bucket (statistics) is
           exported to user land through "memory.idle_page_stats".
      
      TESTING
      =======
      
         * cgroup1, unmapped pagecache
      
           # dd if=/dev/zero of=/ext4/test.data oflag=direct bs=1M count=128
           #
           # echo 1 > /sys/kernel/mm/kidled/use_hierarchy
           # echo 15 > /sys/kernel/mm/kidled/scan_period_in_seconds
           # mkdir -p /cgroup/memory
           # mount -tcgroup -o memory /cgroup/memory
           # echo 1 > /cgroup/memory/memory.use_hierarchy
           # mkdir -p /cgroup/memory/test
           # echo 1 > /cgroup/memory/test/memory.use_hierarchy
           #
           # echo $$ > /cgroup/memory/test/cgroup.procs
           # dd if=/ext4/test.data of=/dev/null bs=1M count=128
           # < wait a few minutes >
           # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
           # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
             cfei   0   0   0   134217728   0   0   0   0
           # cat /cgroup/memory/memory.idle_page_stats | grep cfei
             cfei   0   0   0   134217728   0   0   0   0
      
         * cgroup1, mapped pagecache
      
           # < create same file and memory cgroups as above >
           #
           # echo $$ > /cgroup/memory/test/cgroup.procs
           # < run program to mmap the whole created file and access the area >
           # < wait a few minutes >
           # cat /cgroup/memory/test/memory.idle_page_stats | grep cfei
             cfei   0   134217728   0   0   0   0   0   0
           # cat /cgroup/memory/memory.idle_page_stats | grep cfei
             cfei   0   134217728   0   0   0   0   0   0
      
         * cgroup1, mapped and locked pagecache
      
           # < create same file and memory cgroups as above >
           #
           # echo $$ > /cgroup/memory/test/cgroup.procs
           # < run program to mmap the whole created file and mlock the area >
           # < wait a few minutes >
           # cat /cgroup/memory/test/memory.idle_page_stats | grep cfui
             cfui   0   134217728   0   0   0   0   0   0
           # cat /cgroup/memory/memory.idle_page_stats | grep cfui
             cfui   0   134217728   0   0   0   0   0   0
      
         * cgroup1, anonymous and locked area
      
           # < create memory cgroups as above >
           #
           # echo $$ > /cgroup/memory/test/cgroup.procs
           # < run program to mmap anonymous area and mlock it >
           # < wait a few minutes >
           # cat /cgroup/memory/test/memory.idle_page_stats | grep csui
             csui   0   0   134217728   0   0   0   0   0
           # cat /cgroup/memory/memory.idle_page_stats | grep csui
             csui   0   0   134217728   0   0   0   0   0
      
         * Rerun above test cases in cgroup2 and the results are no exceptional.
           However, the cgroups are populated in different way as below:
      
           # mkdir -p /cgroup
           # mount -tcgroup2 none /cgroup
           # echo "+memory" > /cgroup/cgroup.subtree_control
           # mkdir -p /cgroup/test
      Signed-off-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      fd952d8c
    • Y
      alios: mm: memcontrol: make distance between wmark_low and wmark_high configurable · 33ef4784
      Yang Shi 提交于
      Introduce a new interface, wmark_scale_factor, which defines the
      distance between wmark_high and wmark_low.  The unit is in fractions of
      10,000. The default value of 50 means the distance between wmark_high
      and wmark_low is 0.5% of the max limit of the cgroup.  The maximum value
      is 1000, or 10% of the max limit.
      
      The distance between wmark_low and wmark_high have impact on how hard
      memcg kswapd would reclaim.
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      33ef4784
    • Y
      alios: mm: vmscan: make memcg kswapd set memcg state to dirty or writeback · e10c247b
      Yang Shi 提交于
      The global kswapd could set memory node to dirty or writeback if current
      scan find all pages are unqueued dirty or writeback.  Then kswapd would
      write out dirty pages or wait for writeback done.  The memcg kswapd
      behaves like global kswapd, and it should set dirty or writeback state
      to memcg too if the same condition is met.
      
      Since direct reclaim can't write out page caches, the system depends on
      kswapd to write out dirty pages if scan finds too many dirty pages in
      order to avoid pre-mature OOM.  But, if page cache is dirtied too fast,
      writing out pages definitely can't catch up with dirtying pages.  It is
      the responsibility of dirty page balance to throttle dirtying pages.
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      e10c247b
    • Y
      alios: mm: memcontrol: treat memcg wmark reclaim work as kswapd · f7c87fa3
      Yang Shi 提交于
      Since background water mark reclaim is scheduled by workqueue, it could
      do more work than direct reclaim, i.e. write out dirty page, etc.
      
      So, add PF_KSWAPD flag, so that current_is_kswapd() would return true
      for memcg background reclaim.  The condition "current_is_kswapd() &&
      !global_reclaim(sc)" is good enough to tell current is global kswapd or
      memcg background reclaim.
      
      And, kswapd is not allowed to break memory.low protection for now, memcg
      kswapd should not break it either.
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      f7c87fa3
    • Y
      alios: mm: memcontrol: add background reclaim support for cgroupv2 · 256b5d94
      Yang Shi 提交于
      Like v1, add background reclaim support for cgroup v2.  The interfaces
      are exactly same with v1.  However, if high limit is setup for v2, the
      water mark would be calculated by high limit instead of max limit.
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      256b5d94
    • Y
      alios: mm: memcontrol: support background async page reclaim · 6b2ef082
      Yang Shi 提交于
      Currently when memory usage exceeds memory cgroup limit, memory cgroup
      just can do sync direct reclaim.  This may incur unexpected stall on
      some applications which are sensitive to latency.  Introduce background
      async page reclaim mechanism, like what kswapd does.
      
      Define memcg memory usage water mark by introducing wmark_ratio interface,
      which is from 0 to 100 and represents percentage of max limit.  The
      wmark_high is calculated by (max * wmark_ratio / 100), the wmark_low is
      (wmark_high - wmark_high >> 8), which is an empirical value.  If wmark_ratio
      is 0, it means water mark is disabled, both wmark_low and wmark_high is max,
      which is the default value.
      
      If wmark_ratio is setup, when charging page, if usage is greater than
      wmark_high, which means the available memory of memcg is low, a work
      would be scheduled to do background page reclaim until memory usage is
      reduced to wmark_low if possible.
      
      Define a dedicated unbound workqueue for scheduling water mark reclaim
      works.
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      6b2ef082
    • Y
      alios: mm: vmscan: make it sane reclaim if cgwb_v1 is enabled · 76e0403d
      Yang Shi 提交于
      AliOS Cloud Kernel has cgroup writeback support for v1, so the reclaim could be
      treated as sane reclaim if cgwb_v1 is enabled.
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Reviewed-by: NCaspar Zhang <caspar@linux.alibaba.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      76e0403d
  2. 29 11月, 2019 1 次提交
    • X
      alios: mm, memcg: fix possible soft lockup in try_charge · 1f6142a0
      Xu Yu 提交于
      When events such as direct reclaim and oom occur intensively, soft
      lockup is very likely to happen in the instances with 1 vcpu and with
      kernel preempt disabled.
      
      The example soft lockup is as follows.
      
      [  160.555984] watchdog: BUG: soft lockup - CPU#0 stuck for 112s! [malloc:2188]
      [  160.557975] Modules linked in: button
      [  160.559495] CPU: 0 PID: 2188 Comm: malloc Not tainted 4.19.57-15.457.al7.x86_64 #1
      [  160.561546] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
      [  160.563707] RIP: 0010:shrink_node+0x1ae/0x450
      [  160.565391] Code: 00 00 00 49 8b 4f 20 ba 01 00 00 00 4c 8b 74 24 10 4d 8b 47 28 49 8b 77 10 48 2b 4c 24 08 41 8b 7f 1c 4d8
      [  160.570747] RSP: 0000:ffff9d0ec07a3b58 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
      [  160.572889] RAX: ffff982ab6014330 RBX: ffff982ab6014000 RCX: 0000000000000000
      [  160.574992] RDX: 0000000000000001 RSI: ffff982ab6014000 RDI: ffff982ab6014000
      [  160.577106] RBP: ffff982afffb6000 R08: 0000000000000000 R09: ffff982ab6014000
      [  160.579219] R10: 0000000000000004 R11: 0000000000aaaaaa R12: 0000000000000000
      [  160.581326] R13: 0000000000000000 R14: 0000000000000000 R15: ffff9d0ec07a3c50
      [  160.583450] FS:  00007f8b414f7740(0000) GS:ffff982afda00000(0000) knlGS:0000000000000000
      [  160.585704] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      [  160.587662] CR2: 00007f8adb800010 CR3: 000000007ac9e001 CR4: 00000000003606b0
      [  160.589835] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      [  160.591971] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
      [  160.594133] Call Trace:
      [  160.595602]  do_try_to_free_pages+0xcc/0x390
      [  160.597356]  try_to_free_mem_cgroup_pages+0xf9/0x1d0
      [  160.599198]  ? out_of_memory+0xb5/0x4a0
      [  160.600882]  try_charge+0x244/0x750
      [  160.602522]  ? __pagevec_lru_add_fn+0x1d0/0x330
      [  160.604310]  mem_cgroup_try_charge+0xb4/0x1d0
      [  160.606085]  mem_cgroup_try_charge_delay+0x1c/0x40
      [  160.607892]  do_anonymous_page+0xf7/0x540
      [  160.609574]  __handle_mm_fault+0x665/0xa00
      [  160.611233]  ? __switch_to_asm+0x35/0x70
      [  160.612838]  handle_mm_fault+0x122/0x1e0
      [  160.614407]  __do_page_fault+0x1b7/0x470
      [  160.615962]  do_page_fault+0x32/0x140
      [  160.617474]  ? async_page_fault+0x8/0x30
      [  160.619012]  async_page_fault+0x1e/0x30
      [  160.620526] RIP: 0033:0x40068e
      
      Fix it by adding cond_resched() in try_charge(), just before goto retry
      after OOM_SUCCESS, in order to let OOM free some memory first.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      1f6142a0
  3. 28 11月, 2019 1 次提交
  4. 21 11月, 2019 1 次提交
  5. 20 11月, 2019 6 次提交
    • D
      mm/memory-hotplug: Allow memory resources to be children · 71009d66
      Dave Hansen 提交于
      commit 2794129e902d8eb69413d884dc6404b8716ed9ed upstream
      
      The mm/resource.c code is used to manage the physical address
      space.  The current resource configuration can be viewed in
      /proc/iomem.  An example of this is at the bottom of this
      description.
      
      The nvdimm subsystem "owns" the physical address resources which
      map to persistent memory and has resources inserted for them as
      "Persistent Memory".  The best way to repurpose this for volatile
      use is to leave the existing resource in place, but add a "System
      RAM" resource underneath it. This clearly communicates the
      ownership relationship of this memory.
      
      The request_resource_conflict() API only deals with the
      top-level resources.  Replace it with __request_region() which
      will search for !IORESOURCE_BUSY areas lower in the resource
      tree than the top level.
      
      We *could* also simply truncate the existing top-level
      "Persistent Memory" resource and take over the released address
      space.  But, this means that if we ever decide to hot-unplug the
      "RAM" and give it back, we need to recreate the original setup,
      which may mean going back to the BIOS tables.
      
      This should have no real effect on the existing collision
      detection because the areas that truly conflict should be marked
      IORESOURCE_BUSY.
      
      00000000-00000fff : Reserved
      00001000-0009fbff : System RAM
      0009fc00-0009ffff : Reserved
      000a0000-000bffff : PCI Bus 0000:00
      000c0000-000c97ff : Video ROM
      000c9800-000ca5ff : Adapter ROM
      000f0000-000fffff : Reserved
        000f0000-000fffff : System ROM
      00100000-9fffffff : System RAM
        01000000-01e071d0 : Kernel code
        01e071d1-027dfdff : Kernel data
        02dc6000-0305dfff : Kernel bss
      a0000000-afffffff : Persistent Memory (legacy)
        a0000000-a7ffffff : System RAM
      b0000000-bffdffff : System RAM
      bffe0000-bfffffff : Reserved
      c0000000-febfffff : PCI Bus 0000:00
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Reviewed-by: NVishal Verma <vishal.l.verma@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Bjorn Helgaas <bhelgaas@google.com>
      Cc: Yaowei Bai <baiyaowei@cmss.chinamobile.com>
      Cc: Takashi Iwai <tiwai@suse.de>
      Cc: Jerome Glisse <jglisse@redhat.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      71009d66
    • D
      mm/resource: Move HMM pr_debug() deeper into resource code · cdb8d31f
      Dave Hansen 提交于
      commit b926b7f3baecb2a855db629e6822e1a85212e91c upstream
      
      HMM consumes physical address space for its own use, even
      though nothing is mapped or accessible there.  It uses a
      special resource description (IORES_DESC_DEVICE_PRIVATE_MEMORY)
      to uniquely identify these areas.
      
      When HMM consumes address space, it makes a best guess about
      what to consume.  However, it is possible that a future memory
      or device hotplug can collide with the reserved area.  In the
      case of these conflicts, there is an error message in
      register_memory_resource().
      
      Later patches in this series move register_memory_resource()
      from using request_resource_conflict() to __request_region().
      Unfortunately, __request_region() does not return the conflict
      like the previous function did, which makes it impossible to
      check for IORES_DESC_DEVICE_PRIVATE_MEMORY in a conflicting
      resource.
      
      Instead of warning in register_memory_resource(), move the
      check into the core resource code itself (__request_region())
      where the conflicting resource _is_ available.  This has the
      added bonus of producing a warning in case of HMM conflicts
      with devices *or* RAM address space, as opposed to the RAM-
      only warnings that were there previously.
      Signed-off-by: NDave Hansen <dave.hansen@linux.intel.com>
      Reviewed-by: NJerome Glisse <jglisse@redhat.com>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Vishal Verma <vishal.l.verma@intel.com>
      Cc: Tom Lendacky <thomas.lendacky@amd.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: linux-nvdimm@lists.01.org
      Cc: linux-kernel@vger.kernel.org
      Cc: linux-mm@kvack.org
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Fengguang Wu <fengguang.wu@intel.com>
      Cc: Keith Busch <keith.busch@intel.com>
      Signed-off-by: NDan Williams <dan.j.williams@intel.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      cdb8d31f
    • C
      mm, memcg: throttle allocators when failing reclaim over memory.high · eda29cc0
      Chris Down 提交于
      commit 0e4b01df865935007bd712cbc8e7299005b28894 upstream.
      
      We're trying to use memory.high to limit workloads, but have found that
      containment can frequently fail completely and cause OOM situations
      outside of the cgroup.  This happens especially with swap space -- either
      when none is configured, or swap is full.  These failures often also don't
      have enough warning to allow one to react, whether for a human or for a
      daemon monitoring PSI.
      
      Here is output from a simple program showing how long it takes in usec
      (column 2) to allocate a megabyte of anonymous memory (column 1) when a
      cgroup is already beyond its memory high setting, and no swap is
      available:
      
          [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
          > --wait -t timeout 300 /root/mdf
          [...]
          95  1035
          96  1038
          97  1000
          98  1036
          99  1048
          100 1590
          101 1968
          102 1776
          103 1863
          104 1757
          105 1921
          106 1893
          107 1760
          108 1748
          109 1843
          110 1716
          111 1924
          112 1776
          113 1831
          114 1766
          115 1836
          116 1588
          117 1912
          118 1802
          119 1857
          120 1731
          [...]
          [System OOM in 2-3 seconds]
      
      The delay does go up extremely marginally past the 100MB memory.high
      threshold, as now we spend time scanning before returning to usermode, but
      it's nowhere near enough to contain growth.  It also doesn't get worse the
      more pages you have, since it only considers nr_pages.
      
      The current situation goes against both the expectations of users of
      memory.high, and our intentions as cgroup v2 developers.  In
      cgroup-v2.txt, we claim that we will throttle and only under "extreme
      conditions" will memory.high protection be breached.  Likewise, cgroup v2
      users generally also expect that memory.high should throttle workloads as
      they exceed their high threshold.  However, as seen above, this isn't
      always how it works in practice -- even on banal setups like those with no
      swap, or where swap has become exhausted, we can end up with memory.high
      being breached and us having no weapons left in our arsenal to combat
      runaway growth with, since reclaim is futile.
      
      It's also hard for system monitoring software or users to tell how bad the
      situation is, as "high" events for the memcg may in some cases be benign,
      and in others be catastrophic.  The current status quo is that we fail
      containment in a way that doesn't provide any advance warning that things
      are about to go horribly wrong (for example, we are about to invoke the
      kernel OOM killer).
      
      This patch introduces explicit throttling when reclaim is failing to keep
      memcg size contained at the memory.high setting.  It does so by applying
      an exponential delay curve derived from the memcg's overage compared to
      memory.high.  In the normal case where the memcg is either below or only
      marginally over its memory.high setting, no throttling will be performed.
      
      This composes well with system health monitoring and remediation, as these
      allocator delays are factored into PSI's memory pressure calculations.
      This both creates a mechanism system administrators or applications
      consuming the PSI interface to trivially see that the memcg in question is
      struggling and use that to make more reasonable decisions, and permits
      them enough time to act.  Either of these can act with significantly more
      nuance than that we can provide using the system OOM killer.
      
      This is a similar idea to memory.oom_control in cgroup v1 which would put
      the cgroup to sleep if the threshold was violated, but it's also
      significantly improved as it results in visible memory pressure, and also
      doesn't schedule indefinitely, which previously made tracing and other
      introspection difficult (ie.  it's clamped at 2*HZ per allocation through
      MEMCG_MAX_HIGH_DELAY_JIFFIES).
      
      Contrast the previous results with a kernel with this patch:
      
          [root@ktst ~]# systemd-run -p MemoryHigh=100M -p MemorySwapMax=1 \
          > --wait -t timeout 300 /root/mdf
          [...]
          95  1002
          96  1000
          97  1002
          98  1003
          99  1000
          100 1043
          101 84724
          102 330628
          103 610511
          104 1016265
          105 1503969
          106 2391692
          107 2872061
          108 3248003
          109 4791904
          110 5759832
          111 6912509
          112 8127818
          113 9472203
          114 12287622
          115 12480079
          116 14144008
          117 15808029
          118 16384500
          119 16383242
          120 16384979
          [...]
      
      As you can see, in the normal case, memory allocation takes around 1000
      usec.  However, as we exceed our memory.high, things start to increase
      exponentially, but fairly leniently at first.  Our first megabyte over
      memory.high takes us 0.16 seconds, then the next is 0.46 seconds, then the
      next is almost an entire second.  This gets worse until we reach our
      eventual 2*HZ clamp per batch, resulting in 16 seconds per megabyte.
      However, this is still making forward progress, so permits tracing or
      further analysis with programs like GDB.
      
      We use an exponential curve for our delay penalty for a few reasons:
      
      1. We run mem_cgroup_handle_over_high to potentially do reclaim after
         we've already performed allocations, which means that temporarily
         going over memory.high by a small amount may be perfectly legitimate,
         even for compliant workloads. We don't want to unduly penalise such
         cases.
      2. An exponential curve (as opposed to a static or linear delay) allows
         ramping up memory pressure stats more gradually, which can be useful
         to work out that you have set memory.high too low, without destroying
         application performance entirely.
      
      This patch expands on earlier work by Johannes Weiner. Thanks!
      
      [akpm@linux-foundation.org: fix max() warning]
      [akpm@linux-foundation.org: fix __udivdi3 ref on 32-bit]
      [akpm@linux-foundation.org: fix it even more]
      [chris@chrisdown.name: fix 64-bit divide even more]
      Link: http://lkml.kernel.org/r/20190723180700.GA29459@chrisdown.nameSigned-off-by: NChris Down <chris@chrisdown.name>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Nathan Chancellor <natechancellor@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      eda29cc0
    • Q
      mm/zsmalloc.c: fix a -Wunused-function warning · 68247716
      Qian Cai 提交于
      commit 2b38d01b4de8b1bbda7f5f7e91252609557635fc upstream
      
      set_zspage_inuse() was introduced in the commit 4f42047b ("zsmalloc:
      use accessor") but all the users of it were removed later by the commits,
      
      bdb0af7c ("zsmalloc: factor page chain functionality out")
      3783689a ("zsmalloc: introduce zspage structure")
      
      so the function can be safely removed now.
      
      Link: http://lkml.kernel.org/r/1568658408-19374-1-git-send-email-cai@lca.pwSigned-off-by: NQian Cai <cai@lca.pw>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      68247716
    • J
      x86/mm: Split vmalloc_sync_all() · 73c092d2
      Joerg Roedel 提交于
      commit 1a0a610d5f056c6195ae9808962477a94d1d72c8 upstream.
      
      Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in
      __purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in the
      vunmap() code-path.  While this change was necessary to maintain
      correctness on x86-32-pae kernels, it also adds additional cycles for
      architectures that don't need it.
      
      Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported
      severe performance regressions in micro-benchmarks because it now also
      calls the x86-64 implementation of vmalloc_sync_all() on vunmap().  But
      the vmalloc_sync_all() implementation on x86-64 is only needed for newly
      created mappings.
      
      To avoid the unnecessary work on x86-64 and to gain the performance back,
      split up vmalloc_sync_all() into two functions:
      
      	* vmalloc_sync_mappings(), and
      	* vmalloc_sync_unmappings()
      
      Most call-sites to vmalloc_sync_all() only care about new mappings being
      synchronized.  The only exception is the new call-site added in the above
      mentioned commit.
      
      Shile Zhang directed us to a report of an 80% regression in reaim
      throughput.
      
      Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org
      Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
      Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com
      Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
      Signed-off-by: NJoerg Roedel <jroedel@suse.de>
      Reported-by: Nkernel test robot <oliver.sang@intel.com>
      Reported-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com>	[GHES]
      Cc: Dave Hansen <dave.hansen@linux.intel.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Borislav Petkov <bp@alien8.de>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      73c092d2
    • V
      zswap: do not map same object twice · 2a1613fd
      Vitaly Wool 提交于
      commit 068619e32ff6229a09407d267e36ea7710b96ea1 upstream
      
      zswap_writeback_entry() maps a handle to read swpentry first, and
      then in the most common case it would map the same handle again.
      This is ok when zbud is the backend since its mapping callback is
      plain and simple, but it slows things down for z3fold.
      
      Since there's hardly a point in unmapping a handle _that_ fast as
      zswap_writeback_entry() does when it reads swpentry, the
      suggestion is to keep the handle mapped till the end.
      
      Link: http://lkml.kernel.org/r/20190916004640.b453167d3556c4093af4cf7d@gmail.comSigned-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Signed-off-by: NVitaly Wool <vitalywool@gmail.com>
      Reviewed-by: NDan Streetman <ddstreet@ieee.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      2a1613fd
  6. 19 11月, 2019 4 次提交
    • H
      mm/swap_state.c: simplify total_swapcache_pages() with get_swap_device() · a042c2e0
      Huang Ying 提交于
      commit 054f1d1faaed6a7930b77286d607ae45c01d0443 upstream.
      
      total_swapcache_pages() may race with swapper_spaces[] allocation and
      freeing.  Previously, this is protected with a swapper_spaces[] specific
      RCU mechanism.  To simplify the logic/code complexity, it is replaced with
      get/put_swap_device().  The code line number is reduced too.  Although not
      so important, the swapoff() performance improves too because one
      synchronize_rcu() call during swapoff() is deleted.
      
      [ying.huang@intel.com: fix bad swap file entry warning]
        Link: http://lkml.kernel.org/r/20190531024102.21723-1-ying.huang@intel.com
      Link: http://lkml.kernel.org/r/20190527082714.12151-1-ying.huang@intel.comSigned-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Andrea Parri <andrea.parri@amarulasolutions.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      a042c2e0
    • H
      mm, swap: fix race between swapoff and some swap operations · 73c29467
      Huang Ying 提交于
      commit eb085574a7526c4375965c5fbf7e5b0c19cdd336 upstream.
      Change SWP_VALID to (1 << 12).
      
      When swapin is performed, after getting the swap entry information from
      the page table, system will swap in the swap entry, without any lock held
      to prevent the swap device from being swapoff.  This may cause the race
      like below,
      
      CPU 1				CPU 2
      -----				-----
      				do_swap_page
      				  swapin_readahead
      				    __read_swap_cache_async
      swapoff				      swapcache_prepare
        p->swap_map = NULL		        __swap_duplicate
      					  p->swap_map[?] /* !!! NULL pointer access */
      
      Because swapoff is usually done when system shutdown only, the race may
      not hit many people in practice.  But it is still a race need to be fixed.
      
      To fix the race, get_swap_device() is added to check whether the specified
      swap entry is valid in its swap device.  If so, it will keep the swap
      entry valid via preventing the swap device from being swapoff, until
      put_swap_device() is called.
      
      Because swapoff() is very rare code path, to make the normal path runs as
      fast as possible, rcu_read_lock/unlock() and synchronize_rcu() instead of
      reference count is used to implement get/put_swap_device().  >From
      get_swap_device() to put_swap_device(), RCU reader side is locked, so
      synchronize_rcu() in swapoff() will wait until put_swap_device() is
      called.
      
      In addition to swap_map, cluster_info, etc.  data structure in the struct
      swap_info_struct, the swap cache radix tree will be freed after swapoff,
      so this patch fixes the race between swap cache looking up and swapoff
      too.
      
      Races between some other swap cache usages and swapoff are fixed too via
      calling synchronize_rcu() between clearing PageSwapCache() and freeing
      swap cache data structure.
      
      Another possible method to fix this is to use preempt_off() +
      stop_machine() to prevent the swap device from being swapoff when its data
      structure is being accessed.  The overhead in hot-path of both methods is
      similar.  The advantages of RCU based method are,
      
      1. stop_machine() may disturb the normal execution code path on other
         CPUs.
      
      2. File cache uses RCU to protect its radix tree.  If the similar
         mechanism is used for swap cache too, it is easier to share code
         between them.
      
      3. RCU is used to protect swap cache in total_swapcache_pages() and
         exit_swap_address_space() already.  The two mechanisms can be
         merged to simplify the logic.
      
      Link: http://lkml.kernel.org/r/20190522015423.14418-1-ying.huang@intel.com
      Fixes: 235b6217 ("mm/swap: add cluster lock")
      Signed-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Signed-off-by: N"Huang, Ying" <ying.huang@intel.com>
      Reviewed-by: NAndrea Parri <andrea.parri@amarulasolutions.com>
      Not-nacked-by: NHugh Dickins <hughd@google.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Michal Hocko <mhocko@suse.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tim Chen <tim.c.chen@linux.intel.com>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Jérôme Glisse <jglisse@redhat.com>
      Cc: Yang Shi <yang.shi@linux.alibaba.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Dave Jiang <dave.jiang@intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      73c29467
    • Y
      mm: swap: check if swap backing device is congested or not · 533c7f15
      Yang Shi 提交于
      commit 8fd2e0b505d124bbb046ab15de0ff6f8d4babf56 upstream.
      Change SWP_FS to SWP_FILE.
      
      Swap readahead would read in a few pages regardless if the underlying
      device is busy or not.  It may incur long waiting time if the device is
      congested, and it may also exacerbate the congestion.
      
      Use inode_read_congested() to check if the underlying device is busy or
      not like what file page readahead does.  Get inode from
      swap_info_struct.
      
      Although we can add inode information in swap_address_space
      (address_space->host), it may lead some unexpected side effect, i.e.  it
      may break mapping_cap_account_dirty().  Using inode from
      swap_info_struct seems simple and good enough.
      
      Just does the check in vma_cluster_readahead() since
      swap_vma_readahead() is just used for non-rotational device which much
      less likely has congestion than traditional HDD.
      
      Although swap slots may be consecutive on swap partition, it still may
      be fragmented on swap file.  This check would help to reduce excessive
      stall for such case.
      
      The test with page_fault1 of will-it-scale (sometimes tracing may just
      show runtest.py that is the wrapper script of page_fault1), which
      basically launches NR_CPU threads to generate 128MB anonymous pages for
      each thread, on my virtual machine with congested HDD shows long tail
      latency is reduced significantly.
      
      Without the patch
       page_fault1_thr-1490  [023]   129.311706: funcgraph_entry:      #57377.796 us |  do_swap_page();
       page_fault1_thr-1490  [023]   129.369103: funcgraph_entry:        5.642us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.369119: funcgraph_entry:      #1289.592 us |  do_swap_page();
       page_fault1_thr-1490  [023]   129.370411: funcgraph_entry:        4.957us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.370419: funcgraph_entry:        1.940us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.378847: funcgraph_entry:      #1411.385 us |  do_swap_page();
       page_fault1_thr-1490  [023]   129.380262: funcgraph_entry:        3.916us   |  do_swap_page();
       page_fault1_thr-1490  [023]   129.380275: funcgraph_entry:      #4287.751 us |  do_swap_page();
      
      With the patch
            runtest.py-1417  [020]   301.925911: funcgraph_entry:      #9870.146 us |  do_swap_page();
            runtest.py-1417  [020]   301.935785: funcgraph_entry:        9.802us   |  do_swap_page();
            runtest.py-1417  [020]   301.935799: funcgraph_entry:        3.551us   |  do_swap_page();
            runtest.py-1417  [020]   301.935806: funcgraph_entry:        2.142us   |  do_swap_page();
            runtest.py-1417  [020]   301.935853: funcgraph_entry:        6.938us   |  do_swap_page();
            runtest.py-1417  [020]   301.935864: funcgraph_entry:        3.765us   |  do_swap_page();
            runtest.py-1417  [020]   301.935871: funcgraph_entry:        3.600us   |  do_swap_page();
            runtest.py-1417  [020]   301.935878: funcgraph_entry:        7.202us   |  do_swap_page();
      
      [akpm@linux-foundation.org: code cleanup]
      [yang.shi@linux.alibaba.com: add comment]
        Link: http://lkml.kernel.org/r/bbc7bda7-62d0-df1a-23ef-d369e865bdca@linux.alibaba.com
      Link: http://lkml.kernel.org/r/1546543673-108536-1-git-send-email-yang.shi@linux.alibaba.comSigned-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Signed-off-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NTim Chen <tim.c.chen@intel.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Huang Ying <ying.huang@intel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Hugh Dickins <hughd@google.com
      Cc: Jens Axboe <axboe@kernel.dk>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      533c7f15
    • W
      vmscan: return NODE_RECLAIM_NOSCAN in node_reclaim() when CONFIG_NUMA is n · e431b612
      Wei Yang 提交于
      commit 8b09549c2bfd9f3f8f4cdad74107ef4f4ff9cdd7 upstream.
      
      Commit fa5e084e ("vmscan: do not unconditionally treat zones that
      fail zone_reclaim() as full") changed the return value of
      node_reclaim().  The original return value 0 means NODE_RECLAIM_SOME
      after this commit.
      
      While the return value of node_reclaim() when CONFIG_NUMA is n is not
      changed.  This will leads to call zone_watermark_ok() again.
      
      This patch fixes the return value by adjusting to NODE_RECLAIM_NOSCAN.
      Since node_reclaim() is only called in page_alloc.c, move it to
      mm/internal.h.
      
      Link: http://lkml.kernel.org/r/20181113080436.22078-1-richard.weiyang@gmail.comSigned-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Signed-off-by: NWei Yang <richard.weiyang@gmail.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMatthew Wilcox <willy@infradead.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      e431b612
  7. 30 10月, 2019 14 次提交
    • H
      zswap: use movable memory if zpool support allocate movable memory · 0a943f50
      Hui Zhu 提交于
      commit d2fcd82bb83aab47c6d63aa8c960cd5edb578065 upstream
      
      This is the third version that was updated according to the comments from
      Sergey Senozhatsky https://lkml.org/lkml/2019/5/29/73 and Shakeel Butt
      https://lkml.org/lkml/2019/6/4/973
      
      zswap compresses swap pages into a dynamically allocated RAM-based memory
      pool.  The memory pool should be zbud, z3fold or zsmalloc.  All of them
      will allocate unmovable pages.  It will increase the number of unmovable
      page blocks that will bad for anti-fragment.
      
      zsmalloc support page migration if request movable page:
              handle = zs_malloc(zram->mem_pool, comp_len,
                      GFP_NOIO | __GFP_HIGHMEM |
                      __GFP_MOVABLE);
      
      And commit "zpool: Add malloc_support_movable to zpool_driver" add
      zpool_malloc_support_movable check malloc_support_movable to make sure if
      a zpool support allocate movable memory.
      
      This commit let zswap allocate block with gfp
      __GFP_HIGHMEM | __GFP_MOVABLE if zpool support allocate movable memory.
      
      Following part is test log in a pc that has 8G memory and 2G swap.
      
      Without this commit:
      ~# echo lz4 > /sys/module/zswap/parameters/compressor
      ~# echo zsmalloc > /sys/module/zswap/parameters/zpool
      ~# echo 1 > /sys/module/zswap/parameters/enabled
      ~# swapon /swapfile
      ~# cd /home/teawater/kernel/vm-scalability/
      /home/teawater/kernel/vm-scalability# export unit_size=$((9 * 1024 * 1024 * 1024))
      /home/teawater/kernel/vm-scalability# ./case-anon-w-seq
      2717908992 bytes / 4826062 usecs = 549973 KB/s
      2717908992 bytes / 4864201 usecs = 545661 KB/s
      2717908992 bytes / 4867015 usecs = 545346 KB/s
      2717908992 bytes / 4915485 usecs = 539968 KB/s
      397853 usecs to free memory
      357820 usecs to free memory
      421333 usecs to free memory
      420454 usecs to free memory
      /home/teawater/kernel/vm-scalability# cat /proc/pagetypeinfo
      Page block order: 9
      Pages per block:  512
      
      Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
      Node    0, zone      DMA, type    Unmovable      1      1      1      0      2      1      1      0      1      0      0
      Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      1      3
      Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type    Unmovable      6      5      8      6      6      5      4      1      1      1      0
      Node    0, zone    DMA32, type      Movable     25     20     20     19     22     15     14     11     11      5    767
      Node    0, zone    DMA32, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type    Unmovable   4753   5588   5159   4613   3712   2520   1448    594    188     11      0
      Node    0, zone   Normal, type      Movable     16      3    457   2648   2143   1435    860    459    223    224    296
      Node    0, zone   Normal, type  Reclaimable      0      0     44     38     11      2      0      0      0      0      0
      Node    0, zone   Normal, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      
      Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate
      Node 0, zone      DMA            1            7            0            0            0            0
      Node 0, zone    DMA32            4         1652            0            0            0            0
      Node 0, zone   Normal          931         1485           15            0            0            0
      
      With this commit:
      ~# echo lz4 > /sys/module/zswap/parameters/compressor
      ~# echo zsmalloc > /sys/module/zswap/parameters/zpool
      ~# echo 1 > /sys/module/zswap/parameters/enabled
      ~# swapon /swapfile
      ~# cd /home/teawater/kernel/vm-scalability/
      /home/teawater/kernel/vm-scalability# export unit_size=$((9 * 1024 * 1024 * 1024))
      /home/teawater/kernel/vm-scalability# ./case-anon-w-seq
      2717908992 bytes / 4689240 usecs = 566020 KB/s
      2717908992 bytes / 4760605 usecs = 557535 KB/s
      2717908992 bytes / 4803621 usecs = 552543 KB/s
      2717908992 bytes / 5069828 usecs = 523530 KB/s
      431546 usecs to free memory
      383397 usecs to free memory
      456454 usecs to free memory
      224487 usecs to free memory
      /home/teawater/kernel/vm-scalability# cat /proc/pagetypeinfo
      Page block order: 9
      Pages per block:  512
      
      Free pages count per migrate type at order       0      1      2      3      4      5      6      7      8      9     10
      Node    0, zone      DMA, type    Unmovable      1      1      1      0      2      1      1      0      1      0      0
      Node    0, zone      DMA, type      Movable      0      0      0      0      0      0      0      0      0      1      3
      Node    0, zone      DMA, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone      DMA, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type    Unmovable     10      8     10      9     10      4      3      2      3      0      0
      Node    0, zone    DMA32, type      Movable     18     12     14     16     16     11      9      5      5      6    775
      Node    0, zone    DMA32, type  Reclaimable      0      0      0      0      0      0      0      0      0      0      1
      Node    0, zone    DMA32, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone    DMA32, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type    Unmovable   2669   1236    452    118     37     14      4      1      2      3      0
      Node    0, zone   Normal, type      Movable   3850   6086   5274   4327   3510   2494   1520    934    438    220    470
      Node    0, zone   Normal, type  Reclaimable     56     93    155    124     47     31     17      7      3      0      0
      Node    0, zone   Normal, type   HighAtomic      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type          CMA      0      0      0      0      0      0      0      0      0      0      0
      Node    0, zone   Normal, type      Isolate      0      0      0      0      0      0      0      0      0      0      0
      
      Number of blocks type     Unmovable      Movable  Reclaimable   HighAtomic          CMA      Isolate
      Node 0, zone      DMA            1            7            0            0            0            0
      Node 0, zone    DMA32            4         1650            2            0            0            0
      Node 0, zone   Normal           79         2326           26            0            0            0
      
      You can see that the number of unmovable page blocks is decreased
      when the kernel has this commit.
      
      Link: http://lkml.kernel.org/r/20190605100630.13293-2-teawaterz@linux.alibaba.comSigned-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      0a943f50
    • H
      zpool: add malloc_support_movable to zpool_driver · e874b6e5
      Hui Zhu 提交于
      commit c165f25d23ecb2f9f121ced20435415b931219e2 upstream
      
      As a zpool_driver, zsmalloc can allocate movable memory because it support
      migate pages.  But zbud and z3fold cannot allocate movable memory.
      
      Add malloc_support_movable to zpool_driver.  If a zpool_driver support
      allocate movable memory, set it to true.  And add
      zpool_malloc_support_movable check malloc_support_movable to make sure if
      a zpool support allocate movable memory.
      
      Link: http://lkml.kernel.org/r/20190605100630.13293-1-teawaterz@linux.alibaba.comSigned-off-by: NHui Zhu <teawaterz@linux.alibaba.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Dan Streetman <ddstreet@ieee.org>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Nitin Gupta <ngupta@vflare.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com>
      Cc: Seth Jennings <sjenning@redhat.com>
      Cc: Vitaly Wool <vitalywool@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      e874b6e5
    • D
      mm/page-writeback.c: fix range_cyclic writeback vs writepages deadlock · a432d982
      Dave Chinner 提交于
      commit 64081362e8ff4587b4554087f3cfc73d3e0a4cd7 upstream.
      
      We've recently seen a workload on XFS filesystems with a repeatable
      deadlock between background writeback and a multi-process application
      doing concurrent writes and fsyncs to a small range of a file.
      
      range_cyclic
      writeback		Process 1		Process 2
      
      xfs_vm_writepages
        write_cache_pages
          writeback_index = 2
          cycled = 0
          ....
          find page 2 dirty
          lock Page 2
          ->writepage
            page 2 writeback
            page 2 clean
            page 2 added to bio
          no more pages
      			write()
      			locks page 1
      			dirties page 1
      			locks page 2
      			dirties page 1
      			fsync()
      			....
      			xfs_vm_writepages
      			write_cache_pages
      			  start index 0
      			  find page 1 towrite
      			  lock Page 1
      			  ->writepage
      			    page 1 writeback
      			    page 1 clean
      			    page 1 added to bio
      			  find page 2 towrite
      			  lock Page 2
      			  page 2 is writeback
      			  <blocks>
      						write()
      						locks page 1
      						dirties page 1
      						fsync()
      						....
      						xfs_vm_writepages
      						write_cache_pages
      						  start index 0
      
          !done && !cycled
            sets index to 0, restarts lookup
          find page 1 dirty
      						  find page 1 towrite
      						  lock Page 1
      						  page 1 is writeback
      						  <blocks>
      
          lock Page 1
          <blocks>
      
      DEADLOCK because:
      
      	- process 1 needs page 2 writeback to complete to make
      	  enough progress to issue IO pending for page 1
      	- writeback needs page 1 writeback to complete so process 2
      	  can progress and unlock the page it is blocked on, then it
      	  can issue the IO pending for page 2
      	- process 2 can't make progress until process 1 issues IO
      	  for page 1
      
      The underlying cause of the problem here is that range_cyclic writeback is
      processing pages in descending index order as we hold higher index pages
      in a structure controlled from above write_cache_pages().  The
      write_cache_pages() caller needs to be able to submit these pages for IO
      before write_cache_pages restarts writeback at mapping index 0 to avoid
      wcp inverting the page lock/writeback wait order.
      
      generic_writepages() is not susceptible to this bug as it has no private
      context held across write_cache_pages() - filesystems using this
      infrastructure always submit pages in ->writepage immediately and so there
      is no problem with range_cyclic going back to mapping index 0.
      
      However:
      	mpage_writepages() has a private bio context,
      	exofs_writepages() has page_collect
      	fuse_writepages() has fuse_fill_wb_data
      	nfs_writepages() has nfs_pageio_descriptor
      	xfs_vm_writepages() has xfs_writepage_ctx
      
      All of these ->writepages implementations can hold pages under writeback
      in their private structures until write_cache_pages() returns, and hence
      they are all susceptible to this deadlock.
      
      Also worth noting is that ext4 has it's own bastardised version of
      write_cache_pages() and so it /may/ have an equivalent deadlock.  I looked
      at the code long enough to understand that it has a similar retry loop for
      range_cyclic writeback reaching the end of the file and then promptly ran
      away before my eyes bled too much.  I'll leave it for the ext4 developers
      to determine if their code is actually has this deadlock and how to fix it
      if it has.
      
      There's a few ways I can see avoid this deadlock.  There's probably more,
      but these are the first I've though of:
      
      1. get rid of range_cyclic altogether
      
      2. range_cyclic always stops at EOF, and we start again from
      writeback index 0 on the next call into write_cache_pages()
      
      2a. wcp also returns EAGAIN to ->writepages implementations to
      indicate range cyclic has hit EOF. writepages implementations can
      then flush the current context and call wpc again to continue. i.e.
      lift the retry into the ->writepages implementation
      
      3. range_cyclic uses trylock_page() rather than lock_page(), and it
      skips pages it can't lock without blocking. It will already do this
      for pages under writeback, so this seems like a no-brainer
      
      3a. all non-WB_SYNC_ALL writeback uses trylock_page() to avoid
      blocking as per pages under writeback.
      
      I don't think #1 is an option - range_cyclic prevents frequently
      dirtied lower file offset from starving background writeback of
      rarely touched higher file offsets.
      
      performance as going back to the start of the file implies an
      immediate seek. We'll have exactly the same number of seeks if we
      switch writeback to another inode, and then come back to this one
      later and restart from index 0.
      
      retry loop up into the wcp caller means we can issue IO on the
      pending pages before calling wcp again, and so avoid locking or
      waiting on pages in the wrong order. I'm not convinced we need to do
      this given that we get the same thing from #2 on the next writeback
      call from the writeback infrastructure.
      
      inversion problem, just prevents it from becoming a deadlock
      situation. I'd prefer we fix the inversion, not sweep it under the
      carpet like this.
      
      band-aid fix of #3.
      
      So it seems that the simplest way to fix this issue is to implement
      solution #2
      
      Link: http://lkml.kernel.org/r/20181005054526.21507-1-david@fromorbit.comSigned-off-by: NDave Chinner <dchinner@redhat.com>
      Reviewed-by: NJan Kara <jack@suse.de>
      Cc: Nicholas Piggin <npiggin@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      a432d982
    • A
      userfaultfd: allow get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) to trigger userfaults · 6fc1da0b
      Andrea Arcangeli 提交于
      commit 3b9aadf7278d16d7bed4d5d808501065f70898d8 upstream.
      
      get_mempolicy(MPOL_F_NODE|MPOL_F_ADDR) called a get_user_pages that would
      not be waiting for userfaults before failing and it would hit on a SIGBUS
      instead.  Using get_user_pages_locked/unlocked instead will allow
      get_mempolicy to allow userfaults to resolve the fault and fill the hole,
      before grabbing the node id of the page.
      
      If the user calls get_mempolicy() with MPOL_F_ADDR | MPOL_F_NODE for an
      address inside an area managed by uffd and there is no page at that
      address, the page allocation from within get_mempolicy() will fail
      because get_user_pages() does not allow for page fault retry required
      for uffd; the user will get SIGBUS.
      
      With this patch, the page fault will be resolved by the uffd and the
      get_mempolicy() will continue normally.
      
      Background:
      
      Via code review, previously the syscall would have returned -EFAULT
      (vm_fault_to_errno), now it will block and wait for an userfault (if
      it's waken before the fault is resolved it'll still -EFAULT).
      
      This way get_mempolicy will give a chance to an "unaware" app to be
      compliant with userfaults.
      
      The reason this visible change is that becoming "userfault compliant"
      cannot regress anything: all other syscalls including read(2)/write(2)
      had to become "userfault compliant" long time ago (that's one of the
      things userfaultfd can do that PROT_NONE and trapping segfaults can't).
      
      So this is just one more syscall that become "userfault compliant" like
      all other major ones already were.
      
      This has been happening on virtio-bridge dpdk process which just called
      get_mempolicy on the guest space post live migration, but before the
      memory had a chance to be migrated to destination.
      
      I didn't run an strace to be able to show the -EFAULT going away, but
      I've the confirmation of the below debug aid information (only visible
      with CONFIG_DEBUG_VM=y) going away with the patch:
      
          [20116.371461] FAULT_FLAG_ALLOW_RETRY missing 0
          [20116.371464] CPU: 1 PID: 13381 Comm: vhost-events Not tainted 4.17.12-200.fc28.x86_64 #1
          [20116.371465] Hardware name: LENOVO 20FAS2BN0A/20FAS2BN0A, BIOS N1CET54W (1.22 ) 02/10/2017
          [20116.371466] Call Trace:
          [20116.371473]  dump_stack+0x5c/0x80
          [20116.371476]  handle_userfault.cold.37+0x1b/0x22
          [20116.371479]  ? remove_wait_queue+0x20/0x60
          [20116.371481]  ? poll_freewait+0x45/0xa0
          [20116.371483]  ? do_sys_poll+0x31c/0x520
          [20116.371485]  ? radix_tree_lookup_slot+0x1e/0x50
          [20116.371488]  shmem_getpage_gfp+0xce7/0xe50
          [20116.371491]  ? page_add_file_rmap+0x1a/0x2c0
          [20116.371493]  shmem_fault+0x78/0x1e0
          [20116.371495]  ? filemap_map_pages+0x3a1/0x450
          [20116.371498]  __do_fault+0x1f/0xc0
          [20116.371500]  __handle_mm_fault+0xe2e/0x12f0
          [20116.371502]  handle_mm_fault+0xda/0x200
          [20116.371504]  __get_user_pages+0x238/0x790
          [20116.371506]  get_user_pages+0x3e/0x50
          [20116.371510]  kernel_get_mempolicy+0x40b/0x700
          [20116.371512]  ? vfs_write+0x170/0x1a0
          [20116.371515]  __x64_sys_get_mempolicy+0x21/0x30
          [20116.371517]  do_syscall_64+0x5b/0x160
          [20116.371520]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
      
      The above harmless debug message (not a kernel crash, just a
      dump_stack()) is shown with CONFIG_DEBUG_VM=y to more quickly identify
      and improve kernel spots that may have to become "userfaultfd
      compliant" like this one (without having to run an strace and search
      for syscall misbehavior).  Spots like the above are more closer to a
      kernel bug for the non-cooperative usages that Mike focuses on, than
      for for dpdk qemu-cooperative usages that reproduced it, but it's still
      nicer to get this fixed for dpdk too.
      
      The part of the patch that caused me to think is only the
      implementation issue of mpol_get, but it looks like it should work safe
      no matter the kind of mempolicy structure that is (the default static
      policy also starts at 1 so it'll go to 2 and back to 1 without crashing
      everything at 0).
      
      [rppt@linux.vnet.ibm.com: changelog addition]
        http://lkml.kernel.org/r/20180904073718.GA26916@rapoport-lnx
      Link: http://lkml.kernel.org/r/20180831214848.23676-1-aarcange@redhat.comSigned-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reported-by: NMaxime Coquelin <maxime.coquelin@redhat.com>
      Tested-by: NDr. David Alan Gilbert <dgilbert@redhat.com>
      Reviewed-by: NMike Rapoport <rppt@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      6fc1da0b
    • J
      psi: pressure stall information for CPU, memory, and IO · 6175397b
      Johannes Weiner 提交于
      commit eb414681d5a07d28d2ff90dc05f69ec6b232ebd2 upstream.
      
      When systems are overcommitted and resources become contended, it's hard
      to tell exactly the impact this has on workload productivity, or how close
      the system is to lockups and OOM kills.  In particular, when machines work
      multiple jobs concurrently, the impact of overcommit in terms of latency
      and throughput on the individual job can be enormous.
      
      In order to maximize hardware utilization without sacrificing individual
      job health or risk complete machine lockups, this patch implements a way
      to quantify resource pressure in the system.
      
      A kernel built with CONFIG_PSI=y creates files in /proc/pressure/ that
      expose the percentage of time the system is stalled on CPU, memory, or IO,
      respectively.  Stall states are aggregate versions of the per-task delay
      accounting delays:
      
             cpu: some tasks are runnable but not executing on a CPU
             memory: tasks are reclaiming, or waiting for swapin or thrashing cache
             io: tasks are waiting for io completions
      
      These percentages of walltime can be thought of as pressure percentages,
      and they give a general sense of system health and productivity loss
      incurred by resource overcommit.  They can also indicate when the system
      is approaching lockup scenarios and OOMs.
      
      To do this, psi keeps track of the task states associated with each CPU
      and samples the time they spend in stall states.  Every 2 seconds, the
      samples are averaged across CPUs - weighted by the CPUs' non-idle time to
      eliminate artifacts from unused CPUs - and translated into percentages of
      walltime.  A running average of those percentages is maintained over 10s,
      1m, and 5m periods (similar to the loadaverage).
      
      [hannes@cmpxchg.org: doc fixlet, per Randy]
        Link: http://lkml.kernel.org/r/20180828205625.GA14030@cmpxchg.org
      [hannes@cmpxchg.org: code optimization]
        Link: http://lkml.kernel.org/r/20180907175015.GA8479@cmpxchg.org
      [hannes@cmpxchg.org: rename psi_clock() to psi_update_work(), per Peter]
        Link: http://lkml.kernel.org/r/20180907145404.GB11088@cmpxchg.org
      [hannes@cmpxchg.org: fix build]
        Link: http://lkml.kernel.org/r/20180913014222.GA2370@cmpxchg.org
      Link: http://lkml.kernel.org/r/20180828172258.3185-9-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      [Joseph: fix apply conflicts in task_struct]
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      6175397b
    • J
      delayacct: track delays from thrashing cache pages · 4b7c32ef
      Johannes Weiner 提交于
      commit b1d29ba82cf2bc784f4c963ddd6a2cf29e229b33 upstream.
      
      Delay accounting already measures the time a task spends in direct reclaim
      and waiting for swapin, but in low memory situations tasks spend can spend
      a significant amount of their time waiting on thrashing page cache.  This
      isn't tracked right now.
      
      To know the full impact of memory contention on an individual task,
      measure the delay when waiting for a recently evicted active cache page to
      read back into memory.
      
      Also update tools/accounting/getdelays.c:
      
           [hannes@computer accounting]$ sudo ./getdelays -d -p 1
           print delayacct stats ON
           PID     1
      
           CPU             count     real total  virtual total    delay total  delay average
                           50318      745000000      847346785      400533713          0.008ms
           IO              count    delay total  delay average
                             435      122601218              0ms
           SWAP            count    delay total  delay average
                               0              0              0ms
           RECLAIM         count    delay total  delay average
                               0              0              0ms
           THRASHING       count    delay total  delay average
                              19       12621439              0ms
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-4-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      4b7c32ef
    • J
      mm: workingset: tell cache transitions from workingset thrashing · 41c9ca31
      Johannes Weiner 提交于
      commit 1899ad18c6072d689896badafb81267b0a1092a4 upstream.
      
      Refaults happen during transitions between workingsets as well as in-place
      thrashing.  Knowing the difference between the two has a range of
      applications, including measuring the impact of memory shortage on the
      system performance, as well as the ability to smarter balance pressure
      between the filesystem cache and the swap-backed workingset.
      
      During workingset transitions, inactive cache refaults and pushes out
      established active cache.  When that active cache isn't stale, however,
      and also ends up refaulting, that's bonafide thrashing.
      
      Introduce a new page flag that tells on eviction whether the page has been
      active or not in its lifetime.  This bit is then stored in the shadow
      entry, to classify refaults as transitioning or thrashing.
      
      How many page->flags does this leave us with on 32-bit?
      
      	20 bits are always page flags
      
      	21 if you have an MMU
      
      	23 with the zone bits for DMA, Normal, HighMem, Movable
      
      	29 with the sparsemem section bits
      
      	30 if PAE is enabled
      
      	31 with this patch.
      
      So on 32-bit PAE, that leaves 1 bit for distinguishing two NUMA nodes.  If
      that's not enough, the system can switch to discontigmem and re-gain the 6
      or 7 sparsemem section bits.
      
      Link: http://lkml.kernel.org/r/20180828172258.3185-3-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Johannes Weiner <jweiner@fb.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      41c9ca31
    • J
      mm: workingset: don't drop refault information prematurely · d9799717
      Johannes Weiner 提交于
      commit 95f9ab2d596e8cbb388315e78c82b9a131bf2928 upstream.
      
      Patch series "psi: pressure stall information for CPU, memory, and IO", v4.
      
      		Overview
      
      PSI reports the overall wallclock time in which the tasks in a system (or
      cgroup) wait for (contended) hardware resources.
      
      This helps users understand the resource pressure their workloads are
      under, which allows them to rootcause and fix throughput and latency
      problems caused by overcommitting, underprovisioning, suboptimal job
      placement in a grid; as well as anticipate major disruptions like OOM.
      
      		Real-world applications
      
      We're using the data collected by PSI (and its previous incarnation,
      memdelay) quite extensively at Facebook, and with several success stories.
      
      One usecase is avoiding OOM hangs/livelocks.  The reason these happen is
      because the OOM killer is triggered by reclaim not being able to free
      pages, but with fast flash devices there is *always* some clean and
      uptodate cache to reclaim; the OOM killer never kicks in, even as tasks
      spend 90% of the time thrashing the cache pages of their own executables.
      There is no situation where this ever makes sense in practice.  We wrote a
      <100 line POC python script to monitor memory pressure and kill stuff way
      before such pathological thrashing leads to full system losses that would
      require forcible hard resets.
      
      We've since extended and deployed this code into other places to guarantee
      latency and throughput SLAs, since they're usually violated way before the
      kernel OOM killer would ever kick in.
      
      It is available here: https://github.com/facebookincubator/oomd
      
      Eventually we probably want to trigger the in-kernel OOM killer based on
      extreme sustained pressure as well, so that Linux can avoid memory
      livelocks - which technically aren't deadlocks, but to the user
      indistinguishable from them - out of the box.  We'd continue using OOMD as
      the first line of defense to ensure workload health and implement complex
      kill policies that are beyond the scope of the kernel.
      
      We also use PSI memory pressure for loadshedding.  Our batch job
      infrastructure used to use heuristics based on various VM stats to
      anticipate OOM situations, with lackluster success.  We switched it to PSI
      and managed to anticipate and avoid OOM kills and lockups fairly reliably.
      The reduction of OOM outages in the worker pool raised the pool's
      aggregate productivity, and we were able to switch that service to smaller
      machines.
      
      Lastly, we use cgroups to isolate a machine's main workload from
      maintenance crap like package upgrades, logging, configuration, as well as
      to prevent multiple workloads on a machine from stepping on each others'
      toes.  We were not able to configure this properly without the pressure
      metrics; we would see latency or bandwidth drops, but it would often be
      hard to impossible to rootcause it post-mortem.
      
      We now log and graph pressure for the containers in our fleet and can
      trivially link latency spikes and throughput drops to shortages of
      specific resources after the fact, and fix the job config/scheduling.
      
      PSI has also received testing, feedback, and feature requests from Android
      and EndlessOS for the purpose of low-latency OOM killing, to intervene in
      pressure situations before the UI starts hanging.
      
      		How do you use this feature?
      
      A kernel with CONFIG_PSI=y will create a /proc/pressure directory with 3
      files: cpu, memory, and io.  If using cgroup2, cgroups will also have
      cpu.pressure, memory.pressure and io.pressure files, which simply
      aggregate task stalls at the cgroup level instead of system-wide.
      
      The cpu file contains one line:
      
      	some avg10=2.04 avg60=0.75 avg300=0.40 total=157656722
      
      The averages give the percentage of walltime in which one or more tasks
      are delayed on the runqueue while another task has the CPU.  They're
      recent averages over 10s, 1m, 5m windows, so you can tell short term
      trends from long term ones, similarly to the load average.
      
      The total= value gives the absolute stall time in microseconds.  This
      allows detecting latency spikes that might be too short to sway the
      running averages.  It also allows custom time averaging in case the
      10s/1m/5m windows aren't adequate for the usecase (or are too coarse with
      future hardware).
      
      What to make of this "some" metric?  If CPU utilization is at 100% and CPU
      pressure is 0, it means the system is perfectly utilized, with one
      runnable thread per CPU and nobody waiting.  At two or more runnable tasks
      per CPU, the system is 100% overcommitted and the pressure average will
      indicate as much.  From a utilization perspective this is a great state of
      course: no CPU cycles are being wasted, even when 50% of the threads were
      to go idle (as most workloads do vary).  From the perspective of the
      individual job it's not great, however, and they would do better with more
      resources.  Depending on what your priority and options are, raised "some"
      numbers may or may not require action.
      
      The memory file contains two lines:
      
      some avg10=70.24 avg60=68.52 avg300=69.91 total=3559632828
      full avg10=57.59 avg60=58.06 avg300=60.38 total=3300487258
      
      The some line is the same as for cpu, the time in which at least one task
      is stalled on the resource.  In the case of memory, this includes waiting
      on swap-in, page cache refaults and page reclaim.
      
      The full line, however, indicates time in which *nobody* is using the CPU
      productively due to pressure: all non-idle tasks are waiting for memory in
      one form or another.  Significant time spent in there is a good trigger
      for killing things, moving jobs to other machines, or dropping incoming
      requests, since neither the jobs nor the machine overall are making too
      much headway.
      
      The io file is similar to memory.  Because the block layer doesn't have a
      concept of hardware contention right now (how much longer is my IO request
      taking due to other tasks?), it reports CPU potential lost on all IO
      delays, not just the potential lost due to competition.
      
      		FAQ
      
      Q: How is PSI's CPU component different from the load average?
      
      A: There are several quirks in the load average that make it hard to
         impossible to tell how overcommitted the CPU really is.
      
         1. The load average is reported as a raw number of active tasks.
            You need to know how many CPUs there are in the system, how many
            CPUs the workload is allowed to use, then think about what the
            proportion between load and the number of CPUs mean for the
            tasks trying to run.
      
            PSI reports the percentage of wallclock time in which tasks are
            waiting for a CPU to run on. It doesn't matter how many CPUs are
            present or usable. The number always tells the quality of life
            of tasks in the system or in a particular cgroup.
      
         2. The shortest averaging window is 1m, which is extremely coarse,
            and it's sampled in 5s intervals. A *lot* can happen on a CPU in
            5 seconds. This *may* be able to identify persistent long-term
            trends and very clear and obvious overloads, but it's unusable
            for latency spikes and more subtle overutilization.
      
            PSI's shortest window is 10s. It also exports the cumulative
            stall times (in microseconds) of synchronously recorded events.
      
         3. On Linux, the load average for historical reasons includes all
            TASK_UNINTERRUPTIBLE tasks. This gives a broader sense of how
            busy the system is, but on the flipside it doesn't distinguish
            whether tasks are likely to contend over the CPU or IO - which
            obviously requires very different interventions from a sys admin
            or a job scheduler.
      
            PSI reports independent metrics for CPU and IO. You can tell
            which resource is making the tasks wait, but in conjunction
            still see how overloaded the system is overall.
      
      Q: What's the cost / performance impact of this feature?
      
      A: PSI's primary cost is in the scheduler, in particular task wakeups
         and sleeps.
      
         I benchmarked this code using Facebook's two most scheduling
         sensitive workloads: memcache and webserver. They handle a ton of
         small requests - lots of wakeups and sleeps with little actual work
         in between - so they tend to be canaries for scheduler regressions.
      
         In the tests, the boxes were handling live traffic over the course
         of several hours. Half the machines, the control, ran with
         CONFIG_PSI=n.
      
         For memcache I used eight machines total. They're 2-socket, 14
         core, 56 thread boxes. The test runs for half the test period,
         flips the test and control kernels on the hardware to rule out HW
         factors, DC location etc., then runs the other half of the test.
      
         For the webservers, I used 32 machines total. They're single
         socket, 16 core, 32 thread machines.
      
         During the memcache test, CPU load was nopsi=78.05% psi=78.98% in
         the first half and nopsi=77.52% psi=78.25%, so PSI added between
         0.7 and 0.9 percentage points to the CPU load, a difference of
         about 1%.
      
         UPDATE: I re-ran this test with the v3 version of this patch set
         and the CPU utilization was equivalent between test and control.
      
         UPDATE: v4 is on par with v3.
      
         As far as end-to-end request latency from the client perspective
         goes, we don't sample those finely enough to capture the requests
         going to those particular machines during the test, but we know the
         p50 turnaround time in this workload is 54us, and perf bench sched
         pipe on those machines show nopsi=5.232666 us/op and psi=5.587347
         us/op, so this doesn't add much here either.
      
         The profile for the pipe benchmark shows:
      
              0.87%  sched-pipe  [kernel.vmlinux]    [k] psi_group_change
              0.83%  perf.real   [kernel.vmlinux]    [k] psi_group_change
              0.82%  perf.real   [kernel.vmlinux]    [k] psi_task_change
              0.58%  sched-pipe  [kernel.vmlinux]    [k] psi_task_change
      
         The webserver load is running inside 4 nested cgroup levels. The
         CPU load with both nopsi and psi kernels was indistinguishable at
         81%.
      
         For comparison, we had to disable the cgroup cpu controller on the
         webservers because it added 4 percentage points to the CPU% during
         this same exact test.
      
         Versions of this accounting code now run on 80% of our fleet. None
         of our workloads have reported regressions during the rollout.
      
      Daniel Drake said:
      
      : I just retested the latest version at
      : http://git.cmpxchg.org/cgit.cgi/linux-psi.git (Linux 4.18) and the results
      : are great.
      :
      : Test setup:
      : Endless OS
      : GeminiLake N4200 low end laptop
      : 2GB RAM
      : swap (and zram swap) disabled
      :
      : Baseline test: open a handful of large-ish apps and several website
      : tabs in Google Chrome.
      :
      : Results: after a couple of minutes, system is excessively thrashing, mouse
      : cursor can barely be moved, UI is not responding to mouse clicks, so it's
      : impractical to recover from this situation as an ordinary user
      :
      : Add my simple killer:
      : https://gist.github.com/dsd/a8988bf0b81a6163475988120fe8d9cd
      :
      : Results: when the thrashing causes the UI to become sluggish, the killer
      : steps in and kills something (usually a chrome tab), and the system
      : remains usable.  I repeatedly opened more apps and more websites over a 15
      : minute period but I wasn't able to get the system to a point of UI
      : unresponsiveness.
      
      Suren said:
      
      : Backported to 4.9 and retested on ARMv8 8 code system running Android.
      : Signals behave as expected reacting to memory pressure, no jumps in
      : "total" counters that would indicate an overflow/underflow issues.  Nicely
      : done!
      
      This patch (of 9):
      
      If we keep just enough refault information to match the *current* page
      cache during reclaim time, we could lose a lot of events when there is
      only a temporary spike in non-cache memory consumption that pushes out all
      the cache.  Once cache comes back, we won't see those refaults.  They
      might not be actionable for LRU aging, but we want to know about them for
      measuring memory pressure.
      
      [hannes@cmpxchg.org: switch to NUMA-aware lru and slab counters]
        Link: http://lkml.kernel.org/r/20181009184732.762-2-hannes@cmpxchg.org
      Link: http://lkml.kernel.org/r/20180828172258.3185-2-hannes@cmpxchg.orgSigned-off-by: NJohannes Weiner <jweiner@fb.com>
      Acked-by: NPeter Zijlstra (Intel) <peterz@infradead.org>
      Reviewed-by: NRik van Riel <riel@surriel.com>
      Tested-by: NDaniel Drake <drake@endlessm.com>
      Tested-by: NSuren Baghdasaryan <surenb@google.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vinayak Menon <vinmenon@codeaurora.org>
      Cc: Christopher Lameter <cl@linux.com>
      Cc: Peter Enderborg <peter.enderborg@sony.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      d9799717
    • K
      writeback: memcg_blkcg_tree_lock can be static · 0019fa8c
      kbuild test robot 提交于
      Fixes: 60448d43 ("writeback: add memcg_blkcg_link tree")
      Signed-off-by: Nkbuild test robot <lkp@intel.com>
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      0019fa8c
    • J
      fs/writeback: wrap cgroup writeback v1 logic · 38485c5c
      Joseph Qi 提交于
      Wrap cgroup writeback v1 logic to prevent build errors without
      CONFIG_CGROUPS or CONFIG_CGROUP_WRITEBACK.
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Cc: Jiufei Xue <jiufei.xue@linux.alibaba.com>
      Signed-off-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      Acked-by: NCaspar Zhang <caspar@linux.alibaba.com>
      38485c5c
    • J
      writeback: introduce cgwb_v1 boot param · f91c270d
      Jiufei Xue 提交于
      So far writeback control is supported for cgroup v1 interface. However
      it also has some restrictions, so introduce a new kernel boot parameter
      to control the behavior which is disabled by default. Users can enable
      the writeback control for cgroup v1 with the command line "cgwb_v1".
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      f91c270d
    • J
      fs/writeback: fix double free of blkcg_css · f935fb62
      Jiufei Xue 提交于
      We have gotten a WARNNING when releasing blkcg_css:
      
      [332489.681635] WARNING: CPU: 55 PID: 14859 at lib/list_debug.c:56 __list_del_entry+0x81/0xc0
      [332489.682191] list_del corruption, ffff883e6b94d450->prev is LIST_POISON2 (dead000000000200)
      ......
      [332489.683895] CPU: 55 PID: 14859 Comm: kworker/55:2 Tainted: G
      [332489.684477] Hardware name: Inspur SA5248M4/X10DRT-PS, BIOS 4.05A
      10/11/2016
      [332489.685061] Workqueue: cgroup_destroy css_release_work_fn
      [332489.685654]  ffffc9001d92bd28 ffffffff81380042 ffffc9001d92bd78
      0000000000000000
      [332489.686269]  ffffc9001d92bd68 ffffffff81088f8b 0000003800000000
      ffff883e6b94d4a0
      [332489.686867]  ffff883e6b94d400 ffffffff81ce8fe0 ffff88375b24f400
      ffff883e6b94d4a0
      [332489.687479] Call Trace:
      [332489.688078]  [<ffffffff81380042>] dump_stack+0x63/0x81
      [332489.688681]  [<ffffffff81088f8b>] __warn+0xcb/0xf0
      [332489.689276]  [<ffffffff8108900f>] warn_slowpath_fmt+0x5f/0x80
      [332489.689877]  [<ffffffff8139e7c1>] __list_del_entry+0x81/0xc0
      [332489.690481]  [<ffffffff81125552>] css_release_work_fn+0x42/0x140
      [332489.691090]  [<ffffffff810a2db9>] process_one_work+0x189/0x420
      [332489.691693]  [<ffffffff810a309e>] worker_thread+0x4e/0x4b0
      [332489.692293]  [<ffffffff810a3050>] ? process_one_work+0x420/0x420
      [332489.692905]  [<ffffffff810a9616>] kthread+0xe6/0x100
      [332489.693504]  [<ffffffff810a9530>] ? kthread_park+0x60/0x60
      [332489.694099]  [<ffffffff817184e1>] ret_from_fork+0x41/0x50
      [332489.694722] ---[ end trace 0cf869c4a5cfba87 ]---
      ......
      
      This is caused by calling css_get after the css is killed by another
      thread described below:
      
                 Thread 1                       Thread 2
      cgroup_rmdir
        -> kill_css
          -> percpu_ref_kill_and_confirm
            -> css_killed_ref_fn
      
      css_killed_work_fn
        -> css_put
          -> css_release
                                              wb_get_create
      					  -> find_blkcg_css
      					    -> css_get
      					  -> css_put
      					    -> css_release (double free)
          -> css_release_workfn
            -> css_free_work_fn
             -> blkcg_css_free
      
      When doublefree happened, it may free the memory still used by
      other threads and cause a kernel panic.
      
      Fix this by using css_tryget_online in find_blkcg_css while will return
      false if the css is killed.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      f935fb62
    • J
      37231c89
    • J
      writeback: add memcg_blkcg_link tree · 86c80145
      Jiufei Xue 提交于
      Here we add a global radix tree to link memcg and blkcg that the user
      attach the tasks to when using cgroup v1, which is used for writeback
      cgroup.
      Signed-off-by: NJiufei Xue <jiufei.xue@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      86c80145
  8. 29 10月, 2019 4 次提交
    • J
      mm/memory-failure: poison read receives SIGKILL instead of SIGBUS if mmaped more than once · 30cff8ab
      Jane Chu 提交于
      commit 3d7fed4ad8ccb691d217efbb0f934e6a4df5ef91 upstream.
      
      Mmap /dev/dax more than once, then read the poison location using
      address from one of the mappings.  The other mappings due to not having
      the page mapped in will cause SIGKILLs delivered to the process.
      SIGKILL succeeds over SIGBUS, so user process loses the opportunity to
      handle the UE.
      
      Although one may add MAP_POPULATE to mmap(2) to work around the issue,
      MAP_POPULATE makes mapping 128GB of pmem several magnitudes slower, so
      isn't always an option.
      
      Details -
      
        ndctl inject-error --block=10 --count=1 namespace6.0
      
        ./read_poison -x dax6.0 -o 5120 -m 2
        mmaped address 0x7f5bb6600000
        mmaped address 0x7f3cf3600000
        doing local read at address 0x7f3cf3601400
        Killed
      
      Console messages in instrumented kernel -
      
        mce: Uncorrected hardware memory error in user-access at edbe201400
        Memory failure: tk->addr = 7f5bb6601000
        Memory failure: address edbe201: call dev_pagemap_mapping_shift
        dev_pagemap_mapping_shift: page edbe201: no PUD
        Memory failure: tk->size_shift == 0
        Memory failure: Unable to find user space address edbe201 in read_poison
        Memory failure: tk->addr = 7f3cf3601000
        Memory failure: address edbe201: call dev_pagemap_mapping_shift
        Memory failure: tk->size_shift = 21
        Memory failure: 0xedbe201: forcibly killing read_poison:22434 because of failure to unmap corrupted page
          => to deliver SIGKILL
        Memory failure: 0xedbe201: Killing read_poison:22434 due to hardware memory corruption
          => to deliver SIGBUS
      
      Link: http://lkml.kernel.org/r/1565112345-28754-3-git-send-email-jane.chu@oracle.comSigned-off-by: NJane Chu <jane.chu@oracle.com>
      Suggested-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Reviewed-by: NDan Williams <dan.j.williams@intel.com>
      Acked-by: NNaoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      30cff8ab
    • D
      hugetlbfs: don't access uninitialized memmaps in pfn_range_valid_gigantic() · 91eec769
      David Hildenbrand 提交于
      commit f231fe4235e22e18d847e05cbe705deaca56580a upstream.
      
      Uninitialized memmaps contain garbage and in the worst case trigger
      kernel BUGs, especially with CONFIG_PAGE_POISONING.  They should not get
      touched.
      
      Let's make sure that we only consider online memory (managed by the
      buddy) that has initialized memmaps.  ZONE_DEVICE is not applicable.
      
      page_zone() will call page_to_nid(), which will trigger
      VM_BUG_ON_PGFLAGS(PagePoisoned(page), page) with CONFIG_PAGE_POISONING
      and CONFIG_DEBUG_VM_PGFLAGS when called on uninitialized memmaps.  This
      can be the case when an offline memory block (e.g., never onlined) is
      spanned by a zone.
      
      Note: As explained by Michal in [1], alloc_contig_range() will verify
      the range.  So it boils down to the wrong access in this function.
      
      [1] http://lkml.kernel.org/r/20180423000943.GO17484@dhcp22.suse.cz
      
      Link: http://lkml.kernel.org/r/20191015120717.4858-1-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online")	[visible after d0dc12e8]
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Reported-by: NMichal Hocko <mhocko@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NMike Kravetz <mike.kravetz@oracle.com>
      Cc: Anshuman Khandual <anshuman.khandual@arm.com>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      91eec769
    • Q
      mm/page_owner: don't access uninitialized memmaps when reading /proc/pagetypeinfo · f712e306
      Qian Cai 提交于
      commit a26ee565b6cd8dc2bf15ff6aa70bbb28f928b773 upstream.
      
      Uninitialized memmaps contain garbage and in the worst case trigger
      kernel BUGs, especially with CONFIG_PAGE_POISONING.  They should not get
      touched.
      
      For example, when not onlining a memory block that is spanned by a zone
      and reading /proc/pagetypeinfo with CONFIG_DEBUG_VM_PGFLAGS and
      CONFIG_PAGE_POISONING, we can trigger a kernel BUG:
      
        :/# echo 1 > /sys/devices/system/memory/memory40/online
        :/# echo 1 > /sys/devices/system/memory/memory42/online
        :/# cat /proc/pagetypeinfo > test.file
         page:fffff2c585200000 is uninitialized and poisoned
         raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
         raw: ffffffffffffffff ffffffffffffffff ffffffffffffffff ffffffffffffffff
         page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
         There is not page extension available.
         ------------[ cut here ]------------
         kernel BUG at include/linux/mm.h:1107!
         invalid opcode: 0000 [#1] SMP NOPTI
      
      Please note that this change does not affect ZONE_DEVICE, because
      pagetypeinfo_showmixedcount_print() is called from
      mm/vmstat.c:pagetypeinfo_showmixedcount() only for populated zones, and
      ZONE_DEVICE is never populated (zone->present_pages always 0).
      
      [david@redhat.com: move check to outer loop, add comment, rephrase description]
      Link: http://lkml.kernel.org/r/20191011140638.8160-1-david@redhat.com
      Fixes: f1dd2cd1 ("mm, memory_hotplug: do not associate hotadded memory to zones until online") # visible after d0dc12e8Signed-off-by: NQian Cai <cai@lca.pw>
      Signed-off-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: "Peter Zijlstra (Intel)" <peterz@infradead.org>
      Cc: Miles Chen <miles.chen@mediatek.com>
      Cc: Mike Rapoport <rppt@linux.vnet.ibm.com>
      Cc: Qian Cai <cai@lca.pw>
      Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
      Cc: <stable@vger.kernel.org>	[4.13+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      f712e306
    • Q
      mm/slub: fix a deadlock in show_slab_objects() · bb6932c5
      Qian Cai 提交于
      commit e4f8e513c3d353c134ad4eef9fd0bba12406c7c8 upstream.
      
      A long time ago we fixed a similar deadlock in show_slab_objects() [1].
      However, it is apparently due to the commits like 01fb58bc ("slab:
      remove synchronous synchronize_sched() from memcg cache deactivation
      path") and 03afc0e2 ("slab: get_online_mems for
      kmem_cache_{create,destroy,shrink}"), this kind of deadlock is back by
      just reading files in /sys/kernel/slab which will generate a lockdep
      splat below.
      
      Since the "mem_hotplug_lock" here is only to obtain a stable online node
      mask while racing with NUMA node hotplug, in the worst case, the results
      may me miscalculated while doing NUMA node hotplug, but they shall be
      corrected by later reads of the same files.
      
        WARNING: possible circular locking dependency detected
        ------------------------------------------------------
        cat/5224 is trying to acquire lock:
        ffff900012ac3120 (mem_hotplug_lock.rw_sem){++++}, at:
        show_slab_objects+0x94/0x3a8
      
        but task is already holding lock:
        b8ff009693eee398 (kn->count#45){++++}, at: kernfs_seq_start+0x44/0xf0
      
        which lock already depends on the new lock.
      
        the existing dependency chain (in reverse order) is:
      
        -> #2 (kn->count#45){++++}:
               lock_acquire+0x31c/0x360
               __kernfs_remove+0x290/0x490
               kernfs_remove+0x30/0x44
               sysfs_remove_dir+0x70/0x88
               kobject_del+0x50/0xb0
               sysfs_slab_unlink+0x2c/0x38
               shutdown_cache+0xa0/0xf0
               kmemcg_cache_shutdown_fn+0x1c/0x34
               kmemcg_workfn+0x44/0x64
               process_one_work+0x4f4/0x950
               worker_thread+0x390/0x4bc
               kthread+0x1cc/0x1e8
               ret_from_fork+0x10/0x18
      
        -> #1 (slab_mutex){+.+.}:
               lock_acquire+0x31c/0x360
               __mutex_lock_common+0x16c/0xf78
               mutex_lock_nested+0x40/0x50
               memcg_create_kmem_cache+0x38/0x16c
               memcg_kmem_cache_create_func+0x3c/0x70
               process_one_work+0x4f4/0x950
               worker_thread+0x390/0x4bc
               kthread+0x1cc/0x1e8
               ret_from_fork+0x10/0x18
      
        -> #0 (mem_hotplug_lock.rw_sem){++++}:
               validate_chain+0xd10/0x2bcc
               __lock_acquire+0x7f4/0xb8c
               lock_acquire+0x31c/0x360
               get_online_mems+0x54/0x150
               show_slab_objects+0x94/0x3a8
               total_objects_show+0x28/0x34
               slab_attr_show+0x38/0x54
               sysfs_kf_seq_show+0x198/0x2d4
               kernfs_seq_show+0xa4/0xcc
               seq_read+0x30c/0x8a8
               kernfs_fop_read+0xa8/0x314
               __vfs_read+0x88/0x20c
               vfs_read+0xd8/0x10c
               ksys_read+0xb0/0x120
               __arm64_sys_read+0x54/0x88
               el0_svc_handler+0x170/0x240
               el0_svc+0x8/0xc
      
        other info that might help us debug this:
      
        Chain exists of:
          mem_hotplug_lock.rw_sem --> slab_mutex --> kn->count#45
      
         Possible unsafe locking scenario:
      
               CPU0                    CPU1
               ----                    ----
          lock(kn->count#45);
                                       lock(slab_mutex);
                                       lock(kn->count#45);
          lock(mem_hotplug_lock.rw_sem);
      
         *** DEADLOCK ***
      
        3 locks held by cat/5224:
         #0: 9eff00095b14b2a0 (&p->lock){+.+.}, at: seq_read+0x4c/0x8a8
         #1: 0eff008997041480 (&of->mutex){+.+.}, at: kernfs_seq_start+0x34/0xf0
         #2: b8ff009693eee398 (kn->count#45){++++}, at:
        kernfs_seq_start+0x44/0xf0
      
        stack backtrace:
        Call trace:
         dump_backtrace+0x0/0x248
         show_stack+0x20/0x2c
         dump_stack+0xd0/0x140
         print_circular_bug+0x368/0x380
         check_noncircular+0x248/0x250
         validate_chain+0xd10/0x2bcc
         __lock_acquire+0x7f4/0xb8c
         lock_acquire+0x31c/0x360
         get_online_mems+0x54/0x150
         show_slab_objects+0x94/0x3a8
         total_objects_show+0x28/0x34
         slab_attr_show+0x38/0x54
         sysfs_kf_seq_show+0x198/0x2d4
         kernfs_seq_show+0xa4/0xcc
         seq_read+0x30c/0x8a8
         kernfs_fop_read+0xa8/0x314
         __vfs_read+0x88/0x20c
         vfs_read+0xd8/0x10c
         ksys_read+0xb0/0x120
         __arm64_sys_read+0x54/0x88
         el0_svc_handler+0x170/0x240
         el0_svc+0x8/0xc
      
      I think it is important to mention that this doesn't expose the
      show_slab_objects to use-after-free.  There is only a single path that
      might really race here and that is the slab hotplug notifier callback
      __kmem_cache_shrink (via slab_mem_going_offline_callback) but that path
      doesn't really destroy kmem_cache_node data structures.
      
      [1] http://lkml.iu.edu/hypermail/linux/kernel/1101.0/02850.html
      
      [akpm@linux-foundation.org: add comment explaining why we don't need mem_hotplug_lock]
      Link: http://lkml.kernel.org/r/1570192309-10132-1-git-send-email-cai@lca.pw
      Fixes: 01fb58bc ("slab: remove synchronous synchronize_sched() from memcg cache deactivation path")
      Fixes: 03afc0e2 ("slab: get_online_mems for kmem_cache_{create,destroy,shrink}")
      Signed-off-by: NQian Cai <cai@lca.pw>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      bb6932c5