1. 17 4月, 2020 1 次提交
  2. 16 4月, 2020 10 次提交
  3. 13 4月, 2020 7 次提交
  4. 09 4月, 2020 1 次提交
  5. 25 3月, 2020 1 次提交
  6. 18 3月, 2020 20 次提交
    • S
      mm: fix tick timer stall during deferred page init · bdfadace
      Shile Zhang 提交于
      commit 07447453db3aebb6a0917592f411a7122d12a8b9 upstream linux-next.
      
      When 'CONFIG_DEFERRED_STRUCT_PAGE_INIT' is set, 'pgdatinit' kthread will
      initialise the deferred pages with local interrupts disabled. It is
      introduced by commit 3a2d7fa8 ("mm: disable interrupts while
      initializing deferred pages").
      
      On machine with NCPUS <= 2, the 'pgdatinit' kthread could be bound to
      the boot CPU, which could caused the tick timer long time stall, system
      jiffies not be updated in time.
      
      The dmesg shown that:
      
          [    0.197975] node 0 initialised, 32170688 pages in 1ms
      
      Obviously, 1ms is unreasonable.
      
      Now, fix it by restore in the pending interrupts for every 32*1204 pages
      (128MB) initialized, give the chance to update the systemd jiffies.
      The reasonable demsg shown likes:
      
          [    1.069306] node 0 initialised, 32203456 pages in 894ms
      
      Link: http://lkml.kernel.org/r/20200311123848.118638-1-shile.zhang@linux.alibaba.com
      Fixes: 3a2d7fa8 ("mm: disable interrupts while initializing deferred pages")
      Signed-off-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Signed-off-by: NShile Zhang <shile.zhang@linux.alibaba.com>
      Co-developed-by: NKirill Tkhai <ktkhai@virtuozzo.com>
      Reviewed-by: NPavel Tatashin <pasha.tatashin@soleen.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NStephen Rothwell <sfr@canb.auug.org.au>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      bdfadace
    • X
      alinux: mm, memcg: export workingset counters on memcg v1 · 27393b9b
      Xu Yu 提交于
      This exports the workingset counters, i.e., workingset_refault,
      workingset_activate, workingset_restore, and workingset_nodereclaim, to
      memory cgroup v1.
      
      The stat collection of these counters is shared between memory cgroup v1
      and v2.  What this patch does is just to export them on memory cgroup v1.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      27393b9b
    • X
      alinux: mm, memcg: abort priority oom if with oom victim · ec661706
      Xu Yu 提交于
      Explicitly abort mem_cgroup_select_bad_process in priority oom if there
      is already a task as oom victim without MMF_OOM_SKIP flag set.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      ec661706
    • X
      alinux: mm, memcg: account number of processes in the css · 2061acd6
      Xu Yu 提交于
      Since commit e0205ae40f12 ("mm: memcontrol: use CSS_TASK_ITER_PROCS at
      mem_cgroup_scan_tasks()") made mem_cgroup_scan_tasks() to check only one
      thread from each thread group, we can make cgroup_subsys_state::nr_tasks
      to record only the thread group leader, i.e., process, instead of
      thread(s). Furthermore, this renames cgroup_subsys_state::nr_tasks to
      cgroup_subsys_state::nr_procs.
      
      Fixes: f061cd88 ("alinux: kernel: cgroup: account number of tasks in
      the css and its descendants")
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      2061acd6
    • T
      mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks() · 7bf04cbb
      Tetsuo Handa 提交于
      commit f168a9a54ec39b3f832c353733898b713b6b5c1f upstream.
      
      Since commit c03cd7738a83 ("cgroup: Include dying leaders with live
      threads in PROCS iterations") corrected how CSS_TASK_ITER_PROCS works,
      mem_cgroup_scan_tasks() can use CSS_TASK_ITER_PROCS in order to check
      only one thread from each thread group.
      
      [penguin-kernel@I-love.SAKURA.ne.jp: remove thread group leader check in oom_evaluate_task()]
        Link: http://lkml.kernel.org/r/1560853257-14934-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
      Link: http://lkml.kernel.org/r/c763afc8-f0ae-756a-56a7-395f625b95fc@i-love.sakura.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      7bf04cbb
    • X
      alinux: mm, memcg: fix soft lockup in priority oom · fb4c5ea6
      Xu Yu 提交于
      Assuming that there is a memory cgroup tree as follows:
      
              A (use_priority_oom=1, limit=2.5G)
             / \
            /   C (priority=3, usage=1.5G)
           B (priority=0, usage=1G)
      
      As task in C (task-c) invokes oom-killer, task in B (task-b) is chosen
      and killed, and then task-c returns from mem_cgroup_oom and retries in
      try_charge.
      
      If memory page_counter of B has not been reset yet, leading to task-c
      invokes oom-killer again, the soft lockup may happen. In this situation,
      task-c keeps selecting bad process in B, while the only task-b in B has
      already been set PF_EXITING flag, which makes task-b skipped in
      css_task_iter_advance.
      
      Finally, task-c selected no bad process in B and keeps retrying, and
      task-b is stalled in synchronize_rcu when do_exit, exit_task_namespaces
      specifically.
      
      In a nutshell, the new behavior of css_task_iter_advance, i.e., commit
      c03cd7738a83 ("cgroup: Include dying leaders with live threads in PROCS
      iterations"), causes priority oom to misbehave.
      
      This fixes the soft lockup by accounting num_oom_skip of the victim
      memcg and its parents (sift up to oc->memcg), if no bad process is
      chosen from it.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      fb4c5ea6
    • X
      alinux: mm, memcg: record latency of memcg wmark reclaim · 40969475
      Xu Yu 提交于
      The memcg background async page reclaim, a.k.a, memcg kswapd, is
      implemented with a dedicated unbound workqueue currently.
      
      However, memcg kswapd will run too frequently, resulting in high
      overhead, page cache thrashing, frequent dirty page writeback, etc., due
      to improper memcg memory.wmark_ratio, unreasonable memcg memor capacity,
      or even abnormal memcg memory usage.
      
      We need to find out the problematic memcg(s) where memcg kswapd
      introduces significant overhead.
      
      This records the latency of each run of memcg kswapd work, and then
      aggregates into the exstat of per memcg.
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      40969475
    • Z
      mm: fix trying to reclaim unevictable lru page when calling madvise_pageout · 88ec97ec
      zhong jiang 提交于
      commit 82072962973008201b817fae1896512977dd5083 upstream
      
      Recently, I hit the following issue when running upstream.
      
        kernel BUG at mm/vmscan.c:1521!
        invalid opcode: 0000 [#1] SMP KASAN PTI
        CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1
        RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521
        Call Trace:
         reclaim_pages+0x499/0x800 mm/vmscan.c:2188
         madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453
         walk_pmd_range mm/pagewalk.c:53 [inline]
         walk_pud_range mm/pagewalk.c:112 [inline]
         walk_p4d_range mm/pagewalk.c:139 [inline]
         walk_pgd_range mm/pagewalk.c:166 [inline]
         __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261
         walk_page_range+0x179/0x310 mm/pagewalk.c:349
         madvise_pageout_page_range mm/madvise.c:506 [inline]
         madvise_pageout+0x1f0/0x330 mm/madvise.c:542
         madvise_vma mm/madvise.c:931 [inline]
         __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113
         do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290
         entry_SYSCALL_64_after_hwframe+0x49/0xbe
      
      madvise_pageout() accesses the specified range of the vma and isolates
      them, then runs shrink_page_list() to reclaim its memory.  But it also
      isolates the unevictable pages to reclaim.  Hence, we can catch the
      cases in shrink_page_list().
      
      The root cause is that we scan the page tables instead of specific LRU
      list.  and so we need to filter out the unevictable lru pages from our
      end.
      
      Link: http://lkml.kernel.org/r/1572616245-18946-1-git-send-email-zhongjiang@huawei.com
      Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT")
      Signed-off-by: Nzhong jiang <zhongjiang@huawei.com>
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      88ec97ec
    • M
      mm: factor out common parts between MADV_COLD and MADV_PAGEOUT · b6a18a3c
      Minchan Kim 提交于
      commit d616d5126503967bf365db0711ee3c78b356efe9 upstream
      
      There are many common parts between MADV_COLD and MADV_PAGEOUT.
      This patch factor them out to save code duplication.
      
      Link: http://lkml.kernel.org/r/20190726023435.214162-6-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Suggested-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: kbuild test robot <lkp@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      b6a18a3c
    • M
      mm: introduce MADV_PAGEOUT · 23757dcc
      Minchan Kim 提交于
      commit 1a4e58cce84ee88129d5d49c064bd2852b481357 upstream
      
      When a process expects no accesses to a certain memory range for a long
      time, it could hint kernel that the pages can be reclaimed instantly but
      data should be preserved for future use.  This could reduce workingset
      eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall.
      MADV_PAGEOUT can be used by a process to mark a memory range as not
      expected to be used for a long time so that kernel reclaims *any LRU*
      pages instantly.  The hint can help kernel in deciding which pages to
      evict proactively.
      
      A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit
      intentionally because it's automatically bounded by PMD size.  If PMD
      size(e.g., 256) makes some trouble, we could fix it later by limit it to
      SWAP_CLUSTER_MAX[1].
      
      - man-page material
      
      MADV_PAGEOUT (since Linux x.x)
      
      Do not expect access in the near future so pages in the specified
      regions could be reclaimed instantly regardless of memory pressure.
      Thus, access in the range after successful operation could cause
      major page fault but never lose the up-to-date contents unlike
      MADV_DONTNEED. Pages belonging to a shared mapping are only processed
      if a write access is allowed for the calling process.
      
      MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or
      VM_PFNMAP pages.
      
      [1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/
      
      [minchan@kernel.org: clear PG_active on MADV_PAGEOUT]
        Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      23757dcc
    • M
      mm: introduce MADV_COLD · 1af766e8
      Minchan Kim 提交于
      commit 9c276cc65a58faf98be8e56962745ec99ab87636 upstream
      
      Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
      
      - Background
      
      The Android terminology used for forking a new process and starting an app
      from scratch is a cold start, while resuming an existing app is a hot
      start.  While we continually try to improve the performance of cold
      starts, hot starts will always be significantly less power hungry as well
      as faster so we are trying to make hot start more likely than cold start.
      
      To increase hot start, Android userspace manages the order that apps
      should be killed in a process called ActivityManagerService.
      ActivityManagerService tracks every Android app or service that the user
      could be interacting with at any time and translates that into a ranked
      list for lmkd(low memory killer daemon).  They are likely to be killed by
      lmkd if the system has to reclaim memory.  In that sense they are similar
      to entries in any other cache.  Those apps are kept alive for
      opportunistic performance improvements but those performance improvements
      will vary based on the memory requirements of individual workloads.
      
      - Problem
      
      Naturally, cached apps were dominant consumers of memory on the system.
      However, they were not significant consumers of swap even though they are
      good candidate for swap.  Under investigation, swapping out only begins
      once the low zone watermark is hit and kswapd wakes up, but the overall
      allocation rate in the system might trip lmkd thresholds and cause a
      cached process to be killed(we measured performance swapping out vs.
      zapping the memory by killing a process.  Unsurprisingly, zapping is 10x
      times faster even though we use zram which is much faster than real
      storage) so kill from lmkd will often satisfy the high zone watermark,
      resulting in very few pages actually being moved to swap.
      
      - Approach
      
      The approach we chose was to use a new interface to allow userspace to
      proactively reclaim entire processes by leveraging platform information.
      This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
      that are known to be cold from userspace and to avoid races with lmkd by
      reclaiming apps as soon as they entered the cached state.  Additionally,
      it could provide many chances for platform to use much information to
      optimize memory efficiency.
      
      To achieve the goal, the patchset introduce two new options for madvise.
      One is MADV_COLD which will deactivate activated pages and the other is
      MADV_PAGEOUT which will reclaim private pages instantly.  These new
      options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
      ways to gain some free memory space.  MADV_PAGEOUT is similar to
      MADV_DONTNEED in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed immediately; MADV_COLD is similar
      to MADV_FREE in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed when memory pressure rises.
      
      This patch (of 5):
      
      When a process expects no accesses to a certain memory range, it could
      give a hint to kernel that the pages can be reclaimed when memory pressure
      happens but data should be preserved for future use.  This could reduce
      workingset eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_COLD hint to madvise(2) syscall.
      MADV_COLD can be used by a process to mark a memory range as not expected
      to be used in the near future.  The hint can help kernel in deciding which
      pages to evict early during memory pressure.
      
      It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
      
      	active file page -> inactive file LRU
      	active anon page -> inacdtive anon LRU
      
      Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
      LRU's head because MADV_COLD is a little bit different symantic.
      MADV_FREE means it's okay to discard when the memory pressure because the
      content of the page is *garbage* so freeing such pages is almost zero
      overhead since we don't need to swap out and access afterward causes just
      minor fault.  Thus, it would make sense to put those freeable pages in
      inactive file LRU to compete other used-once pages.  It makes sense for
      implmentaion point of view, too because it's not swapbacked memory any
      longer until it would be re-dirtied.  Even, it could give a bonus to make
      them be reclaimed on swapless system.  However, MADV_COLD doesn't mean
      garbage so reclaiming them requires swap-out/in in the end so it's bigger
      cost.  Since we have designed VM LRU aging based on cost-model, anonymous
      cold pages would be better to position inactive anon's LRU list, not file
      LRU.  Furthermore, it would help to avoid unnecessary scanning if system
      doesn't have a swap device.  Let's start simpler way without adding
      complexity at this moment.  However, keep in mind, too that it's a caveat
      that workloads with a lot of pages cache are likely to ignore MADV_COLD on
      anonymous memory because we rarely age anonymous LRU lists.
      
      * man-page material
      
      MADV_COLD (since Linux x.x)
      
      Pages in the specified regions will be treated as less-recently-accessed
      compared to pages in the system with similar access frequencies.  In
      contrast to MADV_FREE, the contents of the region are preserved regardless
      of subsequent writes to pages.
      
      MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
      pages.
      
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      1af766e8
    • M
      mm: change PAGEREF_RECLAIM_CLEAN with PAGE_REFRECLAIM · a0747c91
      Minchan Kim 提交于
      commit 8940b34a4e082ae11498ddae8432f2ac07685d1c upstream
      
      The local variable references in shrink_page_list is PAGEREF_RECLAIM_CLEAN
      as default.  It is for preventing to reclaim dirty pages when CMA try to
      migrate pages.  Strictly speaking, we don't need it because CMA didn't
      allow to write out by .may_writepage = 0 in reclaim_clean_pages_from_list.
      
      Moreover, it has a problem to prevent anonymous pages's swap out even
      though force_reclaim = true in shrink_page_list on upcoming patch.  So
      this patch makes references's default value to PAGEREF_RECLAIM and rename
      force_reclaim with ignore_references to make it more clear.
      
      This is a preparatory work for next patch.
      
      Link: http://lkml.kernel.org/r/20190726023435.214162-3-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: kbuild test robot <lkp@intel.com>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      a0747c91
    • X
      alinux: mm: add proc interface to control context readahead · 2e38a0f2
      Xiaoguang Wang 提交于
      For some workloads whose io activities are mostly random, context
      readahead feature can introduce unnecessary io read operations, which
      will impact app's performance. Context readahead's algorithm is
      straightforward and not that smart.
      
      This patch adds "/proc/sys/vm/enable_context_readahead" to control
      whether to disable or enable this feature. Currently we enable context
      readahead default, user can echo 0 to /proc/sys/vm/enable_context_readahead
      to disable context readahead.
      
      We also have tested mongodb's performance in 'random point select' case,
      With context readahead enabled:
        mongodb eps 12409
      With context readahead disabled:
        mongodb eps 14443
      About 16% performance improvement.
      Signed-off-by: NXiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
      Reviewed-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      2e38a0f2
    • C
      alinux: doc: use unified official project name Cloud Kernel · a60721b9
      Caspar Zhang 提交于
      Cloud Kernel is the official name of our project, this patch unitizes
      the project names used in docs and comments.
      Signed-off-by: NCaspar Zhang <caspar@linux.alibaba.com>
      Acked-by: NJoseph Qi <joseph.qi@linux.alibaba.com>
      a60721b9
    • W
      alinux: mm: oom_kill: show killed task's cgroup info in global oom · 5028e358
      Wenwei Tao 提交于
      Some users want to know the killed task's cgroup info in global
      oom, this message would help them to make upper decision.
      Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      5028e358
    • W
      alinux: mm: memcontrol: enable oom.group on cgroup-v1 · 7d41295c
      Wenwei Tao 提交于
      Enable oom.group on cgroup-v1.
      Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      7d41295c
    • W
      alinux: mm: memcontrol: introduce memcg priority oom · 52e375fc
      Wenwei Tao 提交于
      Under memory pressure reclaim and oom would happen, with multiple
      cgroups exist in one system, we might want some of their memory
      or tasks survived the reclaim and oom while there are other
      candidates.
      
      The @memory.low and @memory.min have make that happen during reclaim,
      this patch introduces memcg priority oom to meet above requirement in
      the oom.
      
      The priority is from 0 to 12, the higher number the higher priority.
      When oom happens it always choose victim from low priority memcg.
      And it works both for memcg oom and global oom, it can be enabled/disabled
      through @memory.use_priority_oom, for global oom through the root
      memcg's @memory.use_priority_oom, it is disabled by default.
      Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      52e375fc
    • X
      alinux: memcg: Account throttled time due to memory.wmark_min_adj · ef467b9d
      Xunlei Pang 提交于
      Accessing original memory.stat turned out to be one heavy
      operation which has been caused many real product problems.
      
      Introduce new cgroup memory.exstat, memory.exstat stands
      for "extra/extended memory.stat", which contains dedicated
      statistics from Alibaba Clould Kernel.
      
      memory.exstat is supposed to provide hierarchical statistics.
      
      Export its first "wmark_min_throttled_ms", and will add more
      like direct reclaim, direct compaction, etc.
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      ef467b9d
    • X
      alinux: memcg: Introduce memory.wmark_min_adj · 60be0f54
      Xunlei Pang 提交于
      In co-location environment, there are more or less some memory
      overcommitment, then BATCH tasks may break the shared global min
      watermark resulting in all types of applications falling into
      the direct reclaim slow path hurting the RT of LS tasks.
      (NOTE: BATCH tasks tolerate big latency spike even in seconds
      as long as doesn't hurt its overal throughput. While LS tasks
      are very Latency-Sensitive, they may time out or fail in case
      of sudden latency spike lasts like hundreds of ms typically.)
      
      Actually BATCH tasks are not sensitive to memory latency, they
      can be assigned a strict min watermark which is different from
      that of LS tasks(which can be aissgned a lenient min watermark
      accordingly), thus isolating each other in case of global memory
      allocation. This is kind of like the idea behind ALLOC_HARDER
      for rt_task(), see gfp_to_alloc_flags().
      
      memory.wmark_min_adj stands for memcg global WMARK_MIN adjustment,
      it is used to realize separate min watermarks above-mentioned for
      memcgs, its valid value is within [-25, 50], specifically:
      negative value means to be relative to [0, WMARK_MIN],
      positive value means to be relative to [WMARK_MIN, WMARK_LOW].
      For examples,
        -25 means "WMARK_MIN + (WMARK_MIN - 0) * (-25%)"
         50 means "WMARK_MIN + (WMARK_LOW - WMARK_MIN) * 50%"
      
      Note that the minimum -25 is what ALLOC_HARDER uses which is safe
      for us to adopt, and the maximum 50 is one experienced value.
      
      Negative memory.wmark_min_adj means high QoS requirements, it can
      allocate below the global WMARK_MIN, which is kind of like the idea
      behind ALLOC_HARDER, see gfp_to_alloc_flags().
      
      Positive memory.wmark_min_adj means low QoS requirements, thus when
      allocation broke memcg min watermark, it should trigger direct reclaim
      traditionally, and we trigger throttle instead to further prevent
      them from disturbing others.
      
      With this interface, we can assign positive values for BATCH memcgs
      and negative values for LS memcgs.
      
      memory.wmark_min_adj default value is 0, and inherit from its parent,
      Note that the final effective wmark_min_adj will consider all the
      hierarchical values, its value is the maximal(most conservative)
      wmark_min_adj along the hierarchy but excluding intermediate default
      values(zero).
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      60be0f54
    • X
      alinux: memcg: Provide users the ability to reap zombie memcgs · 63442ea9
      Xunlei Pang 提交于
      After memcg was deleted, page caches still reference to this memcg
      causing large number of dead(zombie) memcgs in the system. Then it
      slows down access to "/sys/fs/cgroup/cpu/memory.stat", etc due to
      tons of iterations, further causing various latencies.
      
      This patch introduces two ways to reclaim these zombie memcgs.
      1) Background kthread reaper
      Introduce a kernel thread "memcg_zombie_reaper" to reclaim zombie
      memcgs at background periodically.
      
      Several knobs are also added to control the reaper scan frequency:
      - /sys/kernel/mm/memcg_reaper/scan_interval
        The scan period in second. Default 5s.
      - /sys/kernel/mm/memcg_reaper/pages_scan
        The scan rate of pages per scan. Default 1310720(5GiB for 4KiB page).
      - /sys/kernel/mm/memcg_reaper/verbose
        Output some zombie memcg information for debug purpose. Default off.
      - /sys/kernel/mm/memcg_reaper/reap_background
        "on/off" switch. Default "0" means off. Write "1" to switch it on.
      
      2) One-shot trigger by users
      - /sys/kernel/mm/memcg_reaper/reap
        Write "1" to trigger one round of zombie memcg reaping, but without
        any guarantee, you may need to launch multiple rounds as needed.
      Reviewed-by: NGavin Shan <shan.gavin@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      63442ea9