1. 26 4月, 2020 2 次提交
  2. 22 4月, 2020 1 次提交
  3. 17 4月, 2020 1 次提交
  4. 18 3月, 2020 4 次提交
    • T
      mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks() · 7bf04cbb
      Tetsuo Handa 提交于
      commit f168a9a54ec39b3f832c353733898b713b6b5c1f upstream.
      
      Since commit c03cd7738a83 ("cgroup: Include dying leaders with live
      threads in PROCS iterations") corrected how CSS_TASK_ITER_PROCS works,
      mem_cgroup_scan_tasks() can use CSS_TASK_ITER_PROCS in order to check
      only one thread from each thread group.
      
      [penguin-kernel@I-love.SAKURA.ne.jp: remove thread group leader check in oom_evaluate_task()]
        Link: http://lkml.kernel.org/r/1560853257-14934-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
      Link: http://lkml.kernel.org/r/c763afc8-f0ae-756a-56a7-395f625b95fc@i-love.sakura.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NXu Yu <xuyu@linux.alibaba.com>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      7bf04cbb
    • M
      mm: introduce MADV_COLD · 1af766e8
      Minchan Kim 提交于
      commit 9c276cc65a58faf98be8e56962745ec99ab87636 upstream
      
      Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
      
      - Background
      
      The Android terminology used for forking a new process and starting an app
      from scratch is a cold start, while resuming an existing app is a hot
      start.  While we continually try to improve the performance of cold
      starts, hot starts will always be significantly less power hungry as well
      as faster so we are trying to make hot start more likely than cold start.
      
      To increase hot start, Android userspace manages the order that apps
      should be killed in a process called ActivityManagerService.
      ActivityManagerService tracks every Android app or service that the user
      could be interacting with at any time and translates that into a ranked
      list for lmkd(low memory killer daemon).  They are likely to be killed by
      lmkd if the system has to reclaim memory.  In that sense they are similar
      to entries in any other cache.  Those apps are kept alive for
      opportunistic performance improvements but those performance improvements
      will vary based on the memory requirements of individual workloads.
      
      - Problem
      
      Naturally, cached apps were dominant consumers of memory on the system.
      However, they were not significant consumers of swap even though they are
      good candidate for swap.  Under investigation, swapping out only begins
      once the low zone watermark is hit and kswapd wakes up, but the overall
      allocation rate in the system might trip lmkd thresholds and cause a
      cached process to be killed(we measured performance swapping out vs.
      zapping the memory by killing a process.  Unsurprisingly, zapping is 10x
      times faster even though we use zram which is much faster than real
      storage) so kill from lmkd will often satisfy the high zone watermark,
      resulting in very few pages actually being moved to swap.
      
      - Approach
      
      The approach we chose was to use a new interface to allow userspace to
      proactively reclaim entire processes by leveraging platform information.
      This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
      that are known to be cold from userspace and to avoid races with lmkd by
      reclaiming apps as soon as they entered the cached state.  Additionally,
      it could provide many chances for platform to use much information to
      optimize memory efficiency.
      
      To achieve the goal, the patchset introduce two new options for madvise.
      One is MADV_COLD which will deactivate activated pages and the other is
      MADV_PAGEOUT which will reclaim private pages instantly.  These new
      options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
      ways to gain some free memory space.  MADV_PAGEOUT is similar to
      MADV_DONTNEED in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed immediately; MADV_COLD is similar
      to MADV_FREE in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed when memory pressure rises.
      
      This patch (of 5):
      
      When a process expects no accesses to a certain memory range, it could
      give a hint to kernel that the pages can be reclaimed when memory pressure
      happens but data should be preserved for future use.  This could reduce
      workingset eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_COLD hint to madvise(2) syscall.
      MADV_COLD can be used by a process to mark a memory range as not expected
      to be used in the near future.  The hint can help kernel in deciding which
      pages to evict early during memory pressure.
      
      It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
      
      	active file page -> inactive file LRU
      	active anon page -> inacdtive anon LRU
      
      Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
      LRU's head because MADV_COLD is a little bit different symantic.
      MADV_FREE means it's okay to discard when the memory pressure because the
      content of the page is *garbage* so freeing such pages is almost zero
      overhead since we don't need to swap out and access afterward causes just
      minor fault.  Thus, it would make sense to put those freeable pages in
      inactive file LRU to compete other used-once pages.  It makes sense for
      implmentaion point of view, too because it's not swapbacked memory any
      longer until it would be re-dirtied.  Even, it could give a bonus to make
      them be reclaimed on swapless system.  However, MADV_COLD doesn't mean
      garbage so reclaiming them requires swap-out/in in the end so it's bigger
      cost.  Since we have designed VM LRU aging based on cost-model, anonymous
      cold pages would be better to position inactive anon's LRU list, not file
      LRU.  Furthermore, it would help to avoid unnecessary scanning if system
      doesn't have a swap device.  Let's start simpler way without adding
      complexity at this moment.  However, keep in mind, too that it's a caveat
      that workloads with a lot of pages cache are likely to ignore MADV_COLD on
      anonymous memory because we rarely age anonymous LRU lists.
      
      * man-page material
      
      MADV_COLD (since Linux x.x)
      
      Pages in the specified regions will be treated as less-recently-accessed
      compared to pages in the system with similar access frequencies.  In
      contrast to MADV_FREE, the contents of the region are preserved regardless
      of subsequent writes to pages.
      
      MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
      pages.
      
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Reviewed-by: NYang Shi <yang.shi@linux.alibaba.com>
      Signed-off-by: NXunlei Pang <xlpang@linux.alibaba.com>
      1af766e8
    • W
      alinux: mm: oom_kill: show killed task's cgroup info in global oom · 5028e358
      Wenwei Tao 提交于
      Some users want to know the killed task's cgroup info in global
      oom, this message would help them to make upper decision.
      Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      5028e358
    • W
      alinux: mm: memcontrol: introduce memcg priority oom · 52e375fc
      Wenwei Tao 提交于
      Under memory pressure reclaim and oom would happen, with multiple
      cgroups exist in one system, we might want some of their memory
      or tasks survived the reclaim and oom while there are other
      candidates.
      
      The @memory.low and @memory.min have make that happen during reclaim,
      this patch introduces memcg priority oom to meet above requirement in
      the oom.
      
      The priority is from 0 to 12, the higher number the higher priority.
      When oom happens it always choose victim from low priority memcg.
      And it works both for memcg oom and global oom, it can be enabled/disabled
      through @memory.use_priority_oom, for global oom through the root
      memcg's @memory.use_priority_oom, it is disabled by default.
      Signed-off-by: NWenwei Tao <wenwei.tao@linux.alibaba.com>
      Reviewed-by: NXunlei Pang <xlpang@linux.alibaba.com>
      52e375fc
  5. 05 10月, 2019 1 次提交
    • T
      memcg, oom: don't require __GFP_FS when invoking memcg OOM killer · d40b3eaf
      Tetsuo Handa 提交于
      commit f9c645621a28e37813a1de96d9cbd89cde94a1e4 upstream.
      
      Masoud Sharbiani noticed that commit 29ef680a ("memcg, oom: move
      out_of_memory back to the charge path") broke memcg OOM called from
      __xfs_filemap_fault() path.  It turned out that try_charge() is retrying
      forever without making forward progress because mem_cgroup_oom(GFP_NOFS)
      cannot invoke the OOM killer due to commit 3da88fb3 ("mm, oom:
      move GFP_NOFS check to out_of_memory").
      
      Allowing forced charge due to being unable to invoke memcg OOM killer will
      lead to global OOM situation.  Also, just returning -ENOMEM will be risky
      because OOM path is lost and some paths (e.g.  get_user_pages()) will leak
      -ENOMEM.  Therefore, invoking memcg OOM killer (despite GFP_NOFS) will be
      the only choice we can choose for now.
      
      Until 29ef680a, we were able to invoke memcg OOM killer when
      GFP_KERNEL reclaim failed [1].  But since 29ef680a, we need to
      invoke memcg OOM killer when GFP_NOFS reclaim failed [2].  Although in the
      past we did invoke memcg OOM killer for GFP_NOFS [3], we might get
      pre-mature memcg OOM reports due to this patch.
      
      [1]
      
       leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
       CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19
       Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
       Call Trace:
        dump_stack+0x63/0x88
        dump_header+0x67/0x27a
        ? mem_cgroup_scan_tasks+0x91/0xf0
        oom_kill_process+0x210/0x410
        out_of_memory+0x10a/0x2c0
        mem_cgroup_out_of_memory+0x46/0x80
        mem_cgroup_oom_synchronize+0x2e4/0x310
        ? high_work_func+0x20/0x20
        pagefault_out_of_memory+0x31/0x76
        mm_fault_error+0x55/0x115
        ? handle_mm_fault+0xfd/0x220
        __do_page_fault+0x433/0x4e0
        do_page_fault+0x22/0x30
        ? page_fault+0x8/0x30
        page_fault+0x1e/0x30
       RIP: 0033:0x4009f0
       Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
       RSP: 002b:00007ffe29ae96f0 EFLAGS: 00010206
       RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001ce1000
       RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000
       RBP: 000000000000000c R08: 0000000000000000 R09: 00007f94be09220d
       R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0
       R13: 0000000000000003 R14: 00007f949d845000 R15: 0000000002800000
       Task in /leaker killed as a result of limit of /leaker
       memory: usage 524288kB, limit 524288kB, failcnt 158965
       memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
       kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0
       Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB
       Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice child
       Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, file-rss:1208kB, shmem-rss:0kB
       oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      [2]
      
       leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), order=0, oom_score_adj=0
       CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20
       Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
       Call Trace:
        dump_stack+0x63/0x88
        dump_header+0x67/0x27a
        ? mem_cgroup_scan_tasks+0x91/0xf0
        oom_kill_process+0x210/0x410
        out_of_memory+0x109/0x2d0
        mem_cgroup_out_of_memory+0x46/0x80
        try_charge+0x58d/0x650
        ? __radix_tree_replace+0x81/0x100
        mem_cgroup_try_charge+0x7a/0x100
        __add_to_page_cache_locked+0x92/0x180
        add_to_page_cache_lru+0x4d/0xf0
        iomap_readpages_actor+0xde/0x1b0
        ? iomap_zero_range_actor+0x1d0/0x1d0
        iomap_apply+0xaf/0x130
        iomap_readpages+0x9f/0x150
        ? iomap_zero_range_actor+0x1d0/0x1d0
        xfs_vm_readpages+0x18/0x20 [xfs]
        read_pages+0x60/0x140
        __do_page_cache_readahead+0x193/0x1b0
        ondemand_readahead+0x16d/0x2c0
        page_cache_async_readahead+0x9a/0xd0
        filemap_fault+0x403/0x620
        ? alloc_set_pte+0x12c/0x540
        ? _cond_resched+0x14/0x30
        __xfs_filemap_fault+0x66/0x180 [xfs]
        xfs_filemap_fault+0x27/0x30 [xfs]
        __do_fault+0x19/0x40
        __handle_mm_fault+0x8e8/0xb60
        handle_mm_fault+0xfd/0x220
        __do_page_fault+0x238/0x4e0
        do_page_fault+0x22/0x30
        ? page_fault+0x8/0x30
        page_fault+0x1e/0x30
       RIP: 0033:0x4009f0
       Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
       RSP: 002b:00007ffda45c9290 EFLAGS: 00010206
       RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001a1e000
       RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000
       RBP: 000000000000000c R08: 0000000000000000 R09: 00007f6d061ff20d
       R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0
       R13: 0000000000000003 R14: 00007f6ce59b2000 R15: 0000000002800000
       Task in /leaker killed as a result of limit of /leaker
       memory: usage 524288kB, limit 524288kB, failcnt 7221
       memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
       kmem: usage 1944kB, limit 9007199254740988kB, failcnt 0
       Memory cgroup stats for /leaker: cache:3632KB rss:518232KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:518408KB inactive_file:3908KB active_file:12KB unevictable:0KB
       Memory cgroup out of memory: Kill process 2746 (leaker) score 992 or sacrifice child
       Killed process 2746 (leaker) total-vm:536704kB, anon-rss:518264kB, file-rss:1188kB, shmem-rss:0kB
       oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      [3]
      
       leaker invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0
       leaker cpuset=/ mems_allowed=0
       CPU: 1 PID: 3206 Comm: leaker Not tainted 3.10.0-957.27.2.el7.x86_64 #1
       Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
       Call Trace:
        [<ffffffffaf364147>] dump_stack+0x19/0x1b
        [<ffffffffaf35eb6a>] dump_header+0x90/0x229
        [<ffffffffaedbb456>] ? find_lock_task_mm+0x56/0xc0
        [<ffffffffaee32a38>] ? try_get_mem_cgroup_from_mm+0x28/0x60
        [<ffffffffaedbb904>] oom_kill_process+0x254/0x3d0
        [<ffffffffaee36c36>] mem_cgroup_oom_synchronize+0x546/0x570
        [<ffffffffaee360b0>] ? mem_cgroup_charge_common+0xc0/0xc0
        [<ffffffffaedbc194>] pagefault_out_of_memory+0x14/0x90
        [<ffffffffaf35d072>] mm_fault_error+0x6a/0x157
        [<ffffffffaf3717c8>] __do_page_fault+0x3c8/0x4f0
        [<ffffffffaf371925>] do_page_fault+0x35/0x90
        [<ffffffffaf36d768>] page_fault+0x28/0x30
       Task in /leaker killed as a result of limit of /leaker
       memory: usage 524288kB, limit 524288kB, failcnt 20628
       memory+swap: usage 524288kB, limit 9007199254740988kB, failcnt 0
       kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
       Memory cgroup stats for /leaker: cache:840KB rss:523448KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:523448KB inactive_file:464KB active_file:376KB unevictable:0KB
       Memory cgroup out of memory: Kill process 3206 (leaker) score 970 or sacrifice child
       Killed process 3206 (leaker) total-vm:536692kB, anon-rss:523304kB, file-rss:412kB, shmem-rss:0kB
      
      Bisected by Masoud Sharbiani.
      
      Link: http://lkml.kernel.org/r/cbe54ed1-b6ba-a056-8899-2dc42526371d@i-love.sakura.ne.jp
      Fixes: 3da88fb3 ("mm, oom: move GFP_NOFS check to out_of_memory") [necessary after 29ef680a]
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: NMasoud Sharbiani <msharbiani@apple.com>
      Tested-by: NMasoud Sharbiani <msharbiani@apple.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[4.19+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      d40b3eaf
  6. 06 4月, 2019 1 次提交
    • T
      mm,oom: don't kill global init via memory.oom.group · eed3ca0a
      Tetsuo Handa 提交于
      [ Upstream commit d342a0b38674867ea67fde47b0e1e60ffe9f17a2 ]
      
      Since setting global init process to some memory cgroup is technically
      possible, oom_kill_memcg_member() must check it.
      
        Tasks in /test1 are going to be killed due to memory.oom.group set
        Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB
        oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
        Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b
      
      #include <stdio.h>
      #include <string.h>
      #include <unistd.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      
      int main(int argc, char *argv[])
      {
      	static char buffer[10485760];
      	static int pipe_fd[2] = { EOF, EOF };
      	unsigned int i;
      	int fd;
      	char buf[64] = { };
      	if (pipe(pipe_fd))
      		return 1;
      	if (chdir("/sys/fs/cgroup/"))
      		return 1;
      	fd = open("cgroup.subtree_control", O_WRONLY);
      	write(fd, "+memory", 7);
      	close(fd);
      	mkdir("test1", 0755);
      	fd = open("test1/memory.oom.group", O_WRONLY);
      	write(fd, "1", 1);
      	close(fd);
      	fd = open("test1/cgroup.procs", O_WRONLY);
      	write(fd, "1", 1);
      	snprintf(buf, sizeof(buf) - 1, "%d", getpid());
      	write(fd, buf, strlen(buf));
      	close(fd);
      	snprintf(buf, sizeof(buf) - 1, "%lu", sizeof(buffer) * 5);
      	fd = open("test1/memory.max", O_WRONLY);
      	write(fd, buf, strlen(buf));
      	close(fd);
      	for (i = 0; i < 10; i++)
      		if (fork() == 0) {
      			char c;
      			close(pipe_fd[1]);
      			read(pipe_fd[0], &c, 1);
      			memset(buffer, 0, sizeof(buffer));
      			sleep(3);
      			_exit(0);
      		}
      	close(pipe_fd[0]);
      	close(pipe_fd[1]);
      	sleep(3);
      	return 0;
      }
      
      [   37.052923][ T9185] a.out invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
      [   37.056169][ T9185] CPU: 4 PID: 9185 Comm: a.out Kdump: loaded Not tainted 5.0.0-rc4-next-20190131 #280
      [   37.059205][ T9185] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
      [   37.062954][ T9185] Call Trace:
      [   37.063976][ T9185]  dump_stack+0x67/0x95
      [   37.065263][ T9185]  dump_header+0x51/0x570
      [   37.066619][ T9185]  ? trace_hardirqs_on+0x3f/0x110
      [   37.068171][ T9185]  ? _raw_spin_unlock_irqrestore+0x3d/0x70
      [   37.069967][ T9185]  oom_kill_process+0x18d/0x210
      [   37.071515][ T9185]  out_of_memory+0x11b/0x380
      [   37.072936][ T9185]  mem_cgroup_out_of_memory+0xb6/0xd0
      [   37.074601][ T9185]  try_charge+0x790/0x820
      [   37.076021][ T9185]  mem_cgroup_try_charge+0x42/0x1d0
      [   37.077629][ T9185]  mem_cgroup_try_charge_delay+0x11/0x30
      [   37.079370][ T9185]  do_anonymous_page+0x105/0x5e0
      [   37.080939][ T9185]  __handle_mm_fault+0x9cb/0x1070
      [   37.082485][ T9185]  handle_mm_fault+0x1b2/0x3a0
      [   37.083819][ T9185]  ? handle_mm_fault+0x47/0x3a0
      [   37.085181][ T9185]  __do_page_fault+0x255/0x4c0
      [   37.086529][ T9185]  do_page_fault+0x28/0x260
      [   37.087788][ T9185]  ? page_fault+0x8/0x30
      [   37.088978][ T9185]  page_fault+0x1e/0x30
      [   37.090142][ T9185] RIP: 0033:0x7f8b183aefe0
      [   37.091433][ T9185] Code: 20 f3 44 0f 7f 44 17 d0 f3 44 0f 7f 47 30 f3 44 0f 7f 44 17 c0 48 01 fa 48 83 e2 c0 48 39 d1 74 a3 66 0f 1f 84 00 00 00 00 00 <66> 44 0f 7f 01 66 44 0f 7f 41 10 66 44 0f 7f 41 20 66 44 0f 7f 41
      [   37.096917][ T9185] RSP: 002b:00007fffc5d329e8 EFLAGS: 00010206
      [   37.098615][ T9185] RAX: 00000000006010e0 RBX: 0000000000000008 RCX: 0000000000c30000
      [   37.100905][ T9185] RDX: 00000000010010c0 RSI: 0000000000000000 RDI: 00000000006010e0
      [   37.103349][ T9185] RBP: 0000000000000000 R08: 00007f8b188f4740 R09: 0000000000000000
      [   37.105797][ T9185] R10: 00007fffc5d32420 R11: 00007f8b183aef40 R12: 0000000000000005
      [   37.108228][ T9185] R13: 0000000000000000 R14: ffffffffffffffff R15: 0000000000000000
      [   37.110840][ T9185] memory: usage 51200kB, limit 51200kB, failcnt 125
      [   37.113045][ T9185] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
      [   37.115808][ T9185] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
      [   37.117660][ T9185] Memory cgroup stats for /test1: cache:0KB rss:49484KB rss_huge:30720KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:49700KB inactive_file:0KB active_file:0KB unevictable:0KB
      [   37.123371][ T9185] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/test1,task_memcg=/test1,task=a.out,pid=9188,uid=0
      [   37.128158][ T9185] Memory cgroup out of memory: Killed process 9188 (a.out) total-vm:14456kB, anon-rss:10324kB, file-rss:504kB, shmem-rss:0kB
      [   37.132710][ T9185] Tasks in /test1 are going to be killed due to memory.oom.group set
      [   37.132833][   T54] oom_reaper: reaped process 9188 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      [   37.135498][ T9185] Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB
      [   37.143434][ T9185] Memory cgroup out of memory: Killed process 9182 (a.out) total-vm:14456kB, anon-rss:76kB, file-rss:588kB, shmem-rss:0kB
      [   37.144328][   T54] oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      [   37.147585][ T9185] Memory cgroup out of memory: Killed process 9183 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
      [   37.157222][ T9185] Memory cgroup out of memory: Killed process 9184 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:508kB, shmem-rss:0kB
      [   37.157259][ T9185] Memory cgroup out of memory: Killed process 9185 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
      [   37.157291][ T9185] Memory cgroup out of memory: Killed process 9186 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:508kB, shmem-rss:0kB
      [   37.157306][   T54] oom_reaper: reaped process 9183 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      [   37.157328][ T9185] Memory cgroup out of memory: Killed process 9187 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:512kB, shmem-rss:0kB
      [   37.157452][ T9185] Memory cgroup out of memory: Killed process 9189 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
      [   37.158733][ T9185] Memory cgroup out of memory: Killed process 9190 (a.out) total-vm:14456kB, anon-rss:552kB, file-rss:512kB, shmem-rss:0kB
      [   37.160083][   T54] oom_reaper: reaped process 9186 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      [   37.160187][   T54] oom_reaper: reaped process 9189 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      [   37.206941][   T54] oom_reaper: reaped process 9185 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      [   37.212300][ T9185] Memory cgroup out of memory: Killed process 9191 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:512kB, shmem-rss:0kB
      [   37.212317][   T54] oom_reaper: reaped process 9190 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      [   37.218860][ T9185] Memory cgroup out of memory: Killed process 9192 (a.out) total-vm:14456kB, anon-rss:1080kB, file-rss:512kB, shmem-rss:0kB
      [   37.227667][   T54] oom_reaper: reaped process 9192 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      [   37.292323][ T9193] abrt-hook-ccpp (9193) used greatest stack depth: 10480 bytes left
      [   37.351843][    T1] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b
      [   37.354833][    T1] CPU: 7 PID: 1 Comm: systemd Kdump: loaded Not tainted 5.0.0-rc4-next-20190131 #280
      [   37.357876][    T1] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
      [   37.361685][    T1] Call Trace:
      [   37.363239][    T1]  dump_stack+0x67/0x95
      [   37.365010][    T1]  panic+0xfc/0x2b0
      [   37.366853][    T1]  do_exit+0xd55/0xd60
      [   37.368595][    T1]  do_group_exit+0x47/0xc0
      [   37.370415][    T1]  get_signal+0x32a/0x920
      [   37.372449][    T1]  ? _raw_spin_unlock_irqrestore+0x3d/0x70
      [   37.374596][    T1]  do_signal+0x32/0x6e0
      [   37.376430][    T1]  ? exit_to_usermode_loop+0x26/0x9b
      [   37.378418][    T1]  ? prepare_exit_to_usermode+0xa8/0xd0
      [   37.380571][    T1]  exit_to_usermode_loop+0x3e/0x9b
      [   37.382588][    T1]  prepare_exit_to_usermode+0xa8/0xd0
      [   37.384594][    T1]  ? page_fault+0x8/0x30
      [   37.386453][    T1]  retint_user+0x8/0x18
      [   37.388160][    T1] RIP: 0033:0x7f42c06974a8
      [   37.389922][    T1] Code: Bad RIP value.
      [   37.391788][    T1] RSP: 002b:00007ffc3effd388 EFLAGS: 00010213
      [   37.394075][    T1] RAX: 000000000000000e RBX: 00007ffc3effd390 RCX: 0000000000000000
      [   37.396963][    T1] RDX: 000000000000002a RSI: 00007ffc3effd390 RDI: 0000000000000004
      [   37.399550][    T1] RBP: 00007ffc3effd680 R08: 0000000000000000 R09: 0000000000000000
      [   37.402334][    T1] R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000000001
      [   37.404890][    T1] R13: ffffffffffffffff R14: 0000000000000884 R15: 000056460b1ac3b0
      
      Link: http://lkml.kernel.org/r/201902010336.x113a4EO027170@www262.sakura.ne.jp
      Fixes: 3d8b38eb ("mm, oom: introduce memory.oom.group")
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NSasha Levin <sashal@kernel.org>
      eed3ca0a
  7. 07 2月, 2019 2 次提交
    • S
      mm, oom: fix use-after-free in oom_kill_process · b6f534ab
      Shakeel Butt 提交于
      commit cefc7ef3c87d02fc9307835868ff721ea12cc597 upstream.
      
      Syzbot instance running on upstream kernel found a use-after-free bug in
      oom_kill_process.  On further inspection it seems like the process
      selected to be oom-killed has exited even before reaching
      read_lock(&tasklist_lock) in oom_kill_process().  More specifically the
      tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task()
      and the put_task_struct within for_each_thread() frees the tsk and
      for_each_thread() tries to access the tsk.  The easiest fix is to do
      get/put across the for_each_thread() on the selected task.
      
      Now the next question is should we continue with the oom-kill as the
      previously selected task has exited? However before adding more
      complexity and heuristics, let's answer why we even look at the children
      of oom-kill selected task? The select_bad_process() has already selected
      the worst process in the system/memcg.  Due to race, the selected
      process might not be the worst at the kill time but does that matter?
      The userspace can use the oom_score_adj interface to prefer children to
      be killed before the parent.  I looked at the history but it seems like
      this is there before git history.
      
      Link: http://lkml.kernel.org/r/20190121215850.221745-1-shakeelb@google.com
      Reported-by: syzbot+7fbbfa368521945f0e3d@syzkaller.appspotmail.com
      Fixes: 6b0c81b3 ("mm, oom: reduce dependency on tasklist_lock")
      Signed-off-by: NShakeel Butt <shakeelb@google.com>
      Reviewed-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      b6f534ab
    • T
      oom, oom_reaper: do not enqueue same task twice · 7e70ddc3
      Tetsuo Handa 提交于
      commit 9bcdeb51bd7d2ae9fe65ea4d60643d2aeef5bfe3 upstream.
      
      Arkadiusz reported that enabling memcg's group oom killing causes
      strange memcg statistics where there is no task in a memcg despite the
      number of tasks in that memcg is not 0.  It turned out that there is a
      bug in wake_oom_reaper() which allows enqueuing same task twice which
      makes impossible to decrease the number of tasks in that memcg due to a
      refcount leak.
      
      This bug existed since the OOM reaper became invokable from
      task_will_free_mem(current) path in out_of_memory() in Linux 4.7,
      
        T1@P1     |T2@P1     |T3@P1     |OOM reaper
        ----------+----------+----------+------------
                                         # Processing an OOM victim in a different memcg domain.
                              try_charge()
                                mem_cgroup_out_of_memory()
                                  mutex_lock(&oom_lock)
                   try_charge()
                     mem_cgroup_out_of_memory()
                       mutex_lock(&oom_lock)
        try_charge()
          mem_cgroup_out_of_memory()
            mutex_lock(&oom_lock)
                                  out_of_memory()
                                    oom_kill_process(P1)
                                      do_send_sig_info(SIGKILL, @P1)
                                      mark_oom_victim(T1@P1)
                                      wake_oom_reaper(T1@P1) # T1@P1 is enqueued.
                                  mutex_unlock(&oom_lock)
                       out_of_memory()
                         mark_oom_victim(T2@P1)
                         wake_oom_reaper(T2@P1) # T2@P1 is enqueued.
                       mutex_unlock(&oom_lock)
            out_of_memory()
              mark_oom_victim(T1@P1)
              wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL.
            mutex_unlock(&oom_lock)
                                         # Completed processing an OOM victim in a different memcg domain.
                                         spin_lock(&oom_reaper_lock)
                                         # T1P1 is dequeued.
                                         spin_unlock(&oom_reaper_lock)
      
      but memcg's group oom killing made it easier to trigger this bug by
      calling wake_oom_reaper() on the same task from one out_of_memory()
      request.
      
      Fix this bug using an approach used by commit 855b0183 ("oom,
      oom_reaper: disable oom_reaper for oom_kill_allocating_task").  As a
      side effect of this patch, this patch also avoids enqueuing multiple
      threads sharing memory via task_will_free_mem(current) path.
      
      Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp
      Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp
      Fixes: af8e15cc ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: NArkadiusz Miskiewicz <arekm@maven.pl>
      Tested-by: NArkadiusz Miskiewicz <arekm@maven.pl>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Aleksa Sarai <asarai@suse.de>
      Cc: Jay Kamat <jgkamat@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NGreg Kroah-Hartman <gregkh@linuxfoundation.org>
      7e70ddc3
  8. 05 9月, 2018 2 次提交
  9. 23 8月, 2018 6 次提交
    • R
      mm, oom: introduce memory.oom.group · 3d8b38eb
      Roman Gushchin 提交于
      For some workloads an intervention from the OOM killer can be painful.
      Killing a random task can bring the workload into an inconsistent state.
      
      Historically, there are two common solutions for this
      problem:
      1) enabling panic_on_oom,
      2) using a userspace daemon to monitor OOMs and kill
         all outstanding processes.
      
      Both approaches have their downsides: rebooting on each OOM is an obvious
      waste of capacity, and handling all in userspace is tricky and requires a
      userspace agent, which will monitor all cgroups for OOMs.
      
      In most cases an in-kernel after-OOM cleaning-up mechanism can eliminate
      the necessity of enabling panic_on_oom.  Also, it can simplify the cgroup
      management for userspace applications.
      
      This commit introduces a new knob for cgroup v2 memory controller:
      memory.oom.group.  The knob determines whether the cgroup should be
      treated as an indivisible workload by the OOM killer.  If set, all tasks
      belonging to the cgroup or to its descendants (if the memory cgroup is not
      a leaf cgroup) are killed together or not at all.
      
      To determine which cgroup has to be killed, we do traverse the cgroup
      hierarchy from the victim task's cgroup up to the OOMing cgroup (or root)
      and looking for the highest-level cgroup with memory.oom.group set.
      
      Tasks with the OOM protection (oom_score_adj set to -1000) are treated as
      an exception and are never killed.
      
      This patch doesn't change the OOM victim selection algorithm.
      
      Link: http://lkml.kernel.org/r/20180802003201.817-4-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d8b38eb
    • R
      mm, oom: refactor oom_kill_process() · 5989ad7b
      Roman Gushchin 提交于
      Patch series "introduce memory.oom.group", v2.
      
      This is a tiny implementation of cgroup-aware OOM killer, which adds an
      ability to kill a cgroup as a single unit and so guarantee the integrity
      of the workload.
      
      Although it has only a limited functionality in comparison to what now
      resides in the mm tree (it doesn't change the victim task selection
      algorithm, doesn't look at memory stas on cgroup level, etc), it's also
      much simpler and more straightforward.  So, hopefully, we can avoid having
      long debates here, as we had with the full implementation.
      
      As it doesn't prevent any futher development, and implements an useful and
      complete feature, it looks as a sane way forward.
      
      This patch (of 2):
      
      oom_kill_process() consists of two logical parts: the first one is
      responsible for considering task's children as a potential victim and
      printing the debug information.  The second half is responsible for
      sending SIGKILL to all tasks sharing the mm struct with the given victim.
      
      This commit splits oom_kill_process() with an intention to re-use the the
      second half: __oom_kill_process().
      
      The cgroup-aware OOM killer will kill multiple tasks belonging to the
      victim cgroup.  We don't need to print the debug information for the each
      task, as well as play with task selection (considering task's children),
      so we can't use the existing oom_kill_process().
      
      Link: http://lkml.kernel.org/r/20171130152824.1591-2-guro@fb.com
      Link: http://lkml.kernel.org/r/20180802003201.817-3-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5989ad7b
    • M
      mm/oom_kill.c: clean up oom_reap_task_mm() · 431f42fd
      Michal Hocko 提交于
      Andrew has noticed some inconsistencies in oom_reap_task_mm.  Notably
      
       - Undocumented return value.
      
       - comment "failed to reap part..." is misleading - sounds like it's
         referring to something which happened in the past, is in fact
         referring to something which might happen in the future.
      
       - fails to call trace_finish_task_reaping() in one case
      
       - code duplication.
      
       - Increases mmap_sem hold time a little by moving
         trace_finish_task_reaping() inside the locked region.  So sue me ;)
      
       - Sharing the finish: path means that the trace event won't
         distinguish between the two sources of finishing.
      
      Add a short explanation for the return value and fix the rest by
      reorganizing the function a bit to have unified function exit paths.
      
      Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.czSuggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      431f42fd
    • R
      mm, oom: describe task memory unit, larger PID pad · c3b78b11
      Rodrigo Freire 提交于
      The default page memory unit of OOM task dump events might not be
      intuitive and potentially misleading for the non-initiated when debugging
      OOM events: These are pages and not kBs.  Add a small printk prior to the
      task dump informing that the memory units are actually memory _pages_.
      
      Also extends PID field to align on up to 7 characters.
      Reference https://lkml.org/lkml/2018/7/3/1201
      
      Link: http://lkml.kernel.org/r/c795eb5129149ed8a6345c273aba167ff1bbd388.1530715938.git.rfreire@redhat.comSigned-off-by: NRodrigo Freire <rfreire@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3b78b11
    • M
      mm, oom: remove oom_lock from oom_reaper · af5679fb
      Michal Hocko 提交于
      oom_reaper used to rely on the oom_lock since e2fe1456 ("oom_reaper:
      close race with exiting task").  We do not really need the lock anymore
      though.  21292580 ("mm: oom: let oom_reap_task and exit_mmap run
      concurrently") has removed serialization with the exit path based on the
      mm reference count and so we do not really rely on the oom_lock anymore.
      
      Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock
      to prevent from races when the page allocator didn't manage to get the
      freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later
      on and move on to another victim.  Although this is possible in principle
      let's wait for it to actually happen in real life before we make the
      locking more complex again.
      
      Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and
      oom_reap_task_mm).  The reaper serializes with exit_mmap by mmap_sem +
      MMF_OOM_SKIP flag.  There is no synchronization with out_of_memory path
      now.
      
      [mhocko@kernel.org: oom_reap_task_mm should return false when __oom_reap_task_mm did]
        Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20180719075922.13784-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Suggested-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af5679fb
    • M
      mm, oom: distinguish blockable mode for mmu notifiers · 93065ac7
      Michal Hocko 提交于
      There are several blockable mmu notifiers which might sleep in
      mmu_notifier_invalidate_range_start and that is a problem for the
      oom_reaper because it needs to guarantee a forward progress so it cannot
      depend on any sleepable locks.
      
      Currently we simply back off and mark an oom victim with blockable mmu
      notifiers as done after a short sleep.  That can result in selecting a new
      oom victim prematurely because the previous one still hasn't torn its
      memory down yet.
      
      We can do much better though.  Even if mmu notifiers use sleepable locks
      there is no reason to automatically assume those locks are held.  Moreover
      majority of notifiers only care about a portion of the address space and
      there is absolutely zero reason to fail when we are unmapping an unrelated
      range.  Many notifiers do really block and wait for HW which is harder to
      handle and we have to bail out though.
      
      This patch handles the low hanging fruit.
      __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
      are not allowed to sleep if the flag is set to false.  This is achieved by
      using trylock instead of the sleepable lock for most callbacks and
      continue as long as we do not block down the call chain.
      
      I think we can improve that even further because there is a common pattern
      to do a range lookup first and then do something about that.  The first
      part can be done without a sleeping lock in most cases AFAICS.
      
      The oom_reaper end then simply retries if there is at least one notifier
      which couldn't make any progress in !blockable mode.  A retry loop is
      already implemented to wait for the mmap_sem and this is basically the
      same thing.
      
      The simplest way for driver developers to test this code path is to wrap
      userspace code which uses these notifiers into a memcg and set the hard
      limit to hit the oom.  This can be done e.g.  after the test faults in all
      the mmu notifier managed memory and set the hard limit to something really
      small.  Then we are looking for a proper process tear down.
      
      [akpm@linux-foundation.org: coding style fixes]
      [akpm@linux-foundation.org: minor code simplification]
      Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers
      Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp
      Reported-by: NDavid Rientjes <rientjes@google.com>
      Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
      Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
      Cc: Sudeep Dutt <sudeep.dutt@intel.com>
      Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Felix Kuehling <felix.kuehling@amd.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93065ac7
  10. 18 8月, 2018 2 次提交
  11. 22 7月, 2018 1 次提交
  12. 15 6月, 2018 1 次提交
  13. 08 6月, 2018 1 次提交
  14. 12 5月, 2018 1 次提交
    • D
      mm, oom: fix concurrent munlock and oom reaper unmap, v3 · 27ae357f
      David Rientjes 提交于
      Since exit_mmap() is done without the protection of mm->mmap_sem, it is
      possible for the oom reaper to concurrently operate on an mm until
      MMF_OOM_SKIP is set.
      
      This allows munlock_vma_pages_all() to concurrently run while the oom
      reaper is operating on a vma.  Since munlock_vma_pages_range() depends
      on clearing VM_LOCKED from vm_flags before actually doing the munlock to
      determine if any other vmas are locking the same memory, the check for
      VM_LOCKED in the oom reaper is racy.
      
      This is especially noticeable on architectures such as powerpc where
      clearing a huge pmd requires serialize_against_pte_lookup().  If the pmd
      is zapped by the oom reaper during follow_page_mask() after the check
      for pmd_none() is bypassed, this ends up deferencing a NULL ptl or a
      kernel oops.
      
      Fix this by manually freeing all possible memory from the mm before
      doing the munlock and then setting MMF_OOM_SKIP.  The oom reaper can not
      run on the mm anymore so the munlock is safe to do in exit_mmap().  It
      also matches the logic that the oom reaper currently uses for
      determining when to set MMF_OOM_SKIP itself, so there's no new risk of
      excessive oom killing.
      
      This issue fixes CVE-2018-1000200.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1804241526320.238665@chino.kir.corp.google.com
      Fixes: 21292580 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Suggested-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.14+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27ae357f
  15. 06 4月, 2018 3 次提交
  16. 01 2月, 2018 1 次提交
  17. 15 12月, 2017 1 次提交
    • M
      mm, oom_reaper: fix memory corruption · 4837fe37
      Michal Hocko 提交于
      David Rientjes has reported the following memory corruption while the
      oom reaper tries to unmap the victims address space
      
        BUG: Bad page map in process oom_reaper  pte:6353826300000000 pmd:00000000
        addr:00007f50cab1d000 vm_flags:08100073 anon_vma:ffff9eea335603f0 mapping:          (null) index:7f50cab1d
        file:          (null) fault:          (null) mmap:          (null) readpage:          (null)
        CPU: 2 PID: 1001 Comm: oom_reaper
        Call Trace:
           unmap_page_range+0x1068/0x1130
           __oom_reap_task_mm+0xd5/0x16b
           oom_reaper+0xff/0x14c
           kthread+0xc1/0xe0
      
      Tetsuo Handa has noticed that the synchronization inside exit_mmap is
      insufficient.  We only synchronize with the oom reaper if
      tsk_is_oom_victim which is not true if the final __mmput is called from
      a different context than the oom victim exit path.  This can trivially
      happen from context of any task which has grabbed mm reference (e.g.  to
      read /proc/<pid>/ file which requires mm etc.).
      
      The race would look like this
      
        oom_reaper		oom_victim		task
      						mmget_not_zero
      			do_exit
      			  mmput
        __oom_reap_task_mm				mmput
        						  __mmput
      						    exit_mmap
      						      remove_vma
          unmap_page_range
      
      Fix this issue by providing a new mm_is_oom_victim() helper which
      operates on the mm struct rather than a task.  Any context which
      operates on a remote mm struct should use this helper in place of
      tsk_is_oom_victim.  The flag is set in mark_oom_victim and never cleared
      so it is stable in the exit_mmap path.
      
      Debugged by Tetsuo Handa.
      
      Link: http://lkml.kernel.org/r/20171210095130.17110-1-mhocko@kernel.org
      Fixes: 21292580 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrea Argangeli <andrea@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.14]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4837fe37
  18. 30 11月, 2017 1 次提交
    • W
      mm, oom_reaper: gather each vma to prevent leaking TLB entry · 687cb088
      Wang Nan 提交于
      tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory
      space.  In this case, tlb->fullmm is true.  Some archs like arm64
      doesn't flush TLB when tlb->fullmm is true:
      
        commit 5a7862e8 ("arm64: tlbflush: avoid flushing when fullmm == 1").
      
      Which causes leaking of tlb entries.
      
      Will clarifies his patch:
       "Basically, we tag each address space with an ASID (PCID on x86) which
        is resident in the TLB. This means we can elide TLB invalidation when
        pulling down a full mm because we won't ever assign that ASID to
        another mm without doing TLB invalidation elsewhere (which actually
        just nukes the whole TLB).
      
        I think that means that we could potentially not fault on a kernel
        uaccess, because we could hit in the TLB"
      
      There could be a window between complete_signal() sending IPI to other
      cores and all threads sharing this mm are really kicked off from cores.
      In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to
      flush TLB then frees pages.  However, due to the above problem, the TLB
      entries are not really flushed on arm64.  Other threads are possible to
      access these pages through TLB entries.  Moreover, a copy_to_user() can
      also write to these pages without generating page fault, causes
      use-after-free bugs.
      
      This patch gathers each vma instead of gathering full vm space.  In this
      case tlb->fullmm is not true.  The behavior of oom reaper become similar
      to munmapping before do_exit, which should be safe for all archs.
      
      Link: http://lkml.kernel.org/r/20171107095453.179940-1-wangnan0@huawei.com
      Fixes: aac45363 ("mm, oom: introduce oom reaper")
      Signed-off-by: NWang Nan <wangnan0@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      687cb088
  19. 16 11月, 2017 6 次提交
  20. 04 10月, 2017 1 次提交
    • M
      mm, oom_reaper: skip mm structs with mmu notifiers · 4d4bbd85
      Michal Hocko 提交于
      Andrea has noticed that the oom_reaper doesn't invalidate the range via
      mmu notifiers (mmu_notifier_invalidate_range_start/end) and that can
      corrupt the memory of the kvm guest for example.
      
      tlb_flush_mmu_tlbonly already invokes mmu notifiers but that is not
      sufficient as per Andrea:
      
       "mmu_notifier_invalidate_range cannot be used in replacement of
        mmu_notifier_invalidate_range_start/end. For KVM
        mmu_notifier_invalidate_range is a noop and rightfully so. A MMU
        notifier implementation has to implement either ->invalidate_range
        method or the invalidate_range_start/end methods, not both. And if you
        implement invalidate_range_start/end like KVM is forced to do, calling
        mmu_notifier_invalidate_range in common code is a noop for KVM.
      
        For those MMU notifiers that can get away only implementing
        ->invalidate_range, the ->invalidate_range is implicitly called by
        mmu_notifier_invalidate_range_end(). And only those secondary MMUs
        that share the same pagetable with the primary MMU (like AMD iommuv2)
        can get away only implementing ->invalidate_range"
      
      As the callback is allowed to sleep and the implementation is out of
      hand of the MM it is safer to simply bail out if there is an mmu
      notifier registered.  In order to not fail too early make the
      mm_has_notifiers check under the oom_lock and have a little nap before
      failing to give the current oom victim some more time to exit.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20170913113427.2291-1-mhocko@kernel.org
      Fixes: aac45363 ("mm, oom: introduce oom reaper")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d4bbd85
  21. 07 9月, 2017 1 次提交