1. 07 11月, 2021 2 次提交
    • M
      mm, oom: do not trigger out_of_memory from the #PF · 60e2793d
      Michal Hocko 提交于
      Any allocation failure during the #PF path will return with VM_FAULT_OOM
      which in turn results in pagefault_out_of_memory.  This can happen for 2
      different reasons.  a) Memcg is out of memory and we rely on
      mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
      normal allocation fails.
      
      The latter is quite problematic because allocation paths already trigger
      out_of_memory and the page allocator tries really hard to not fail
      allocations.  Anyway, if the OOM killer has been already invoked there
      is no reason to invoke it again from the #PF path.  Especially when the
      OOM condition might be gone by that time and we have no way to find out
      other than allocate.
      
      Moreover if the allocation failed and the OOM killer hasn't been invoked
      then we are unlikely to do the right thing from the #PF context because
      we have already lost the allocation context and restictions and
      therefore might oom kill a task from a different NUMA domain.
      
      This all suggests that there is no legitimate reason to trigger
      out_of_memory from pagefault_out_of_memory so drop it.  Just to be sure
      that no #PF path returns with VM_FAULT_OOM without allocation print a
      warning that this is happening before we restart the #PF.
      
      [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller.
      This is a local problem related to memcg, however, it causes unnecessary
      global OOM kills that are repeated over and over again and escalate into a
      real disaster.  This has been broken since kmem accounting has been
      introduced for cgroup v1 (3.8).  There was no kmem specific reclaim for
      the separate limit so the only way to handle kmem hard limit was to return
      with ENOMEM.  In upstream the problem will be fixed by removing the
      outdated kmem limit, however stable and LTS kernels cannot do it and are
      still affected.  This patch fixes the problem and should be backported
      into stable/LTS.]
      
      Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.comSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NVasily Averin <vvs@virtuozzo.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60e2793d
    • V
      mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks · 0b28179a
      Vasily Averin 提交于
      Patch series "memcg: prohibit unconditional exceeding the limit of dying tasks", v3.
      
      Memory cgroup charging allows killed or exiting tasks to exceed the hard
      limit.  It can be misused and allowed to trigger global OOM from inside
      a memcg-limited container.  On the other hand if memcg fails allocation,
      called from inside #PF handler it triggers global OOM from inside
      pagefault_out_of_memory().
      
      To prevent these problems this patchset:
       (a) removes execution of out_of_memory() from
           pagefault_out_of_memory(), becasue nobody can explain why it is
           necessary.
       (b) allow memcg to fail allocation of dying/killed tasks.
      
      This patch (of 3):
      
      Any allocation failure during the #PF path will return with VM_FAULT_OOM
      which in turn results in pagefault_out_of_memory which in turn executes
      out_out_memory() and can kill a random task.
      
      An allocation might fail when the current task is the oom victim and
      there are no memory reserves left.  The OOM killer is already handled at
      the page allocator level for the global OOM and at the charging level
      for the memcg one.  Both have much more information about the scope of
      allocation/charge request.  This means that either the OOM killer has
      been invoked properly and didn't lead to the allocation success or it
      has been skipped because it couldn't have been invoked.  In both cases
      triggering it from here is pointless and even harmful.
      
      It makes much more sense to let the killed task die rather than to wake
      up an eternally hungry oom-killer and send him to choose a fatter victim
      for breakfast.
      
      Link: https://lkml.kernel.org/r/0828a149-786e-7c06-b70a-52d086818ea3@virtuozzo.comSigned-off-by: NVasily Averin <vvs@virtuozzo.com>
      Suggested-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Mel Gorman <mgorman@techsingularity.net>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Uladzislau Rezki <urezki@gmail.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Vlastimil Babka <vbabka@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0b28179a
  2. 29 10月, 2021 1 次提交
    • S
      mm/oom_kill.c: prevent a race between process_mrelease and exit_mmap · 337546e8
      Suren Baghdasaryan 提交于
      Race between process_mrelease and exit_mmap, where free_pgtables is
      called while __oom_reap_task_mm is in progress, leads to kernel crash
      during pte_offset_map_lock call.  oom-reaper avoids this race by setting
      MMF_OOM_VICTIM flag and causing exit_mmap to take and release
      mmap_write_lock, blocking it until oom-reaper releases mmap_read_lock.
      
      Reusing MMF_OOM_VICTIM for process_mrelease would be the simplest way to
      fix this race, however that would be considered a hack.  Fix this race
      by elevating mm->mm_users and preventing exit_mmap from executing until
      process_mrelease is finished.  Patch slightly refactors the code to
      adapt for a possible mmget_not_zero failure.
      
      This fix has considerable negative impact on process_mrelease
      performance and will likely need later optimization.
      
      Link: https://lkml.kernel.org/r/20211022014658.263508-1-surenb@google.com
      Fixes: 884a7e59 ("mm: introduce process_mrelease system call")
      Signed-off-by: NSuren Baghdasaryan <surenb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Christian Brauner <christian@brauner.io>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Hildenbrand <david@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Jan Engelhardt <jengelh@inai.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      337546e8
  3. 04 9月, 2021 1 次提交
    • S
      mm: introduce process_mrelease system call · 884a7e59
      Suren Baghdasaryan 提交于
      In modern systems it's not unusual to have a system component monitoring
      memory conditions of the system and tasked with keeping system memory
      pressure under control.  One way to accomplish that is to kill
      non-essential processes to free up memory for more important ones.
      Examples of this are Facebook's OOM killer daemon called oomd and
      Android's low memory killer daemon called lmkd.
      
      For such system component it's important to be able to free memory quickly
      and efficiently.  Unfortunately the time process takes to free up its
      memory after receiving a SIGKILL might vary based on the state of the
      process (uninterruptible sleep), size and OPP level of the core the
      process is running.  A mechanism to free resources of the target process
      in a more predictable way would improve system's ability to control its
      memory pressure.
      
      Introduce process_mrelease system call that releases memory of a dying
      process from the context of the caller.  This way the memory is freed in a
      more controllable way with CPU affinity and priority of the caller.  The
      workload of freeing the memory will also be charged to the caller.  The
      operation is allowed only on a dying process.
      
      After previous discussions [1, 2, 3] the decision was made [4] to
      introduce a dedicated system call to cover this use case.
      
      The API is as follows,
      
                int process_mrelease(int pidfd, unsigned int flags);
      
              DESCRIPTION
                The process_mrelease() system call is used to free the memory of
                an exiting process.
      
                The pidfd selects the process referred to by the PID file
                descriptor.
                (See pidfd_open(2) for further information)
      
                The flags argument is reserved for future use; currently, this
                argument must be specified as 0.
      
              RETURN VALUE
                On success, process_mrelease() returns 0. On error, -1 is
                returned and errno is set to indicate the error.
      
              ERRORS
                EBADF  pidfd is not a valid PID file descriptor.
      
                EAGAIN Failed to release part of the address space.
      
                EINTR  The call was interrupted by a signal; see signal(7).
      
                EINVAL flags is not 0.
      
                EINVAL The memory of the task cannot be released because the
                       process is not exiting, the address space is shared
                       with another live process or there is a core dump in
                       progress.
      
                ENOSYS This system call is not supported, for example, without
                       MMU support built into Linux.
      
                ESRCH  The target process does not exist (i.e., it has terminated
                       and been waited on).
      
      [1] https://lore.kernel.org/lkml/20190411014353.113252-3-surenb@google.com/
      [2] https://lore.kernel.org/linux-api/20201113173448.1863419-1-surenb@google.com/
      [3] https://lore.kernel.org/linux-api/20201124053943.1684874-3-surenb@google.com/
      [4] https://lore.kernel.org/linux-api/20201223075712.GA4719@lst.de/
      
      Link: https://lkml.kernel.org/r/20210809185259.405936-1-surenb@google.comSigned-off-by: NSuren Baghdasaryan <surenb@google.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Rik van Riel <riel@surriel.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Jann Horn <jannh@google.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Andy Lutomirski <luto@kernel.org>
      Cc: Christian Brauner <christian.brauner@ubuntu.com>
      Cc: Florian Weimer <fweimer@redhat.com>
      Cc: Jan Engelhardt <jengelh@inai.de>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      884a7e59
  4. 01 7月, 2021 1 次提交
  5. 11 5月, 2021 1 次提交
  6. 07 5月, 2021 1 次提交
  7. 06 5月, 2021 1 次提交
  8. 17 4月, 2021 1 次提交
  9. 25 2月, 2021 1 次提交
  10. 30 1月, 2021 2 次提交
  11. 16 12月, 2020 1 次提交
  12. 14 10月, 2020 1 次提交
    • S
      mm, oom_adj: don't loop through tasks in __set_oom_adj when not necessary · 67197a4f
      Suren Baghdasaryan 提交于
      Currently __set_oom_adj loops through all processes in the system to keep
      oom_score_adj and oom_score_adj_min in sync between processes sharing
      their mm.  This is done for any task with more that one mm_users, which
      includes processes with multiple threads (sharing mm and signals).
      However for such processes the loop is unnecessary because their signal
      structure is shared as well.
      
      Android updates oom_score_adj whenever a tasks changes its role
      (background/foreground/...) or binds to/unbinds from a service, making it
      more/less important.  Such operation can happen frequently.  We noticed
      that updates to oom_score_adj became more expensive and after further
      investigation found out that the patch mentioned in "Fixes" introduced a
      regression.  Using Pixel 4 with a typical Android workload, write time to
      oom_score_adj increased from ~3.57us to ~362us.  Moreover this regression
      linearly depends on the number of multi-threaded processes running on the
      system.
      
      Mark the mm with a new MMF_MULTIPROCESS flag bit when task is created with
      (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK).  Change __set_oom_adj to use
      MMF_MULTIPROCESS instead of mm_users to decide whether oom_score_adj
      update should be synchronized between multiple processes.  To prevent
      races between clone() and __set_oom_adj(), when oom_score_adj of the
      process being cloned might be modified from userspace, we use
      oom_adj_mutex.  Its scope is changed to global.
      
      The combination of (CLONE_VM && !CLONE_THREAD) is rarely used except for
      the case of vfork().  To prevent performance regressions of vfork(), we
      skip taking oom_adj_mutex and setting MMF_MULTIPROCESS when CLONE_VFORK is
      specified.  Clearing the MMF_MULTIPROCESS flag (when the last process
      sharing the mm exits) is left out of this patch to keep it simple and
      because it is believed that this threading model is rare.  Should there
      ever be a need for optimizing that case as well, it can be done by hooking
      into the exit path, likely following the mm_update_next_owner pattern.
      
      With the combination of (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK) being
      quite rare, the regression is gone after the change is applied.
      
      [surenb@google.com: v3]
        Link: https://lkml.kernel.org/r/20200902012558.2335613-1-surenb@google.com
      
      Fixes: 44a70ade ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
      Reported-by: NTim Murray <timmurray@google.com>
      Suggested-by: NMichal Hocko <mhocko@kernel.org>
      Signed-off-by: NSuren Baghdasaryan <surenb@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NChristian Brauner <christian.brauner@ubuntu.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Eugene Syromiatnikov <esyr@redhat.com>
      Cc: Christian Kellner <christian@kellner.me>
      Cc: Adrian Reber <areber@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Aleksa Sarai <cyphar@cyphar.com>
      Cc: Alexey Dobriyan <adobriyan@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: Alexey Gladkov <gladkov.alexey@gmail.com>
      Cc: Michel Lespinasse <walken@google.com>
      Cc: Daniel Jordan <daniel.m.jordan@oracle.com>
      Cc: Andrei Vagin <avagin@gmail.com>
      Cc: Bernd Edlinger <bernd.edlinger@hotmail.de>
      Cc: John Johansen <john.johansen@canonical.com>
      Cc: Yafang Shao <laoar.shao@gmail.com>
      Link: https://lkml.kernel.org/r/20200824153036.3201505-1-surenb@google.comDebugged-by: NMinchan Kim <minchan@kernel.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      67197a4f
  13. 13 8月, 2020 2 次提交
    • Y
      mm, oom: show process exiting information in __oom_kill_process() · 619b5b46
      Yafang Shao 提交于
      When the OOM killer finds a victim and tryies to kill it, if the victim is
      already exiting, the task mm will be NULL and no process will be killed.
      But the dump_header() has been already executed, so it will be strange to
      dump so much information without killing a process.  We'd better show some
      helpful information to indicate why this happens.
      Suggested-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: Qian Cai <cai@lca.pw>
      Link: http://lkml.kernel.org/r/20200721010127.17238-1-laoar.shao@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      619b5b46
    • Y
      mm, oom: make the calculation of oom badness more accurate · 9066e5cf
      Yafang Shao 提交于
      Recently we found an issue on our production environment that when memcg
      oom is triggered the oom killer doesn't chose the process with largest
      resident memory but chose the first scanned process.  Note that all
      processes in this memcg have the same oom_score_adj, so the oom killer
      should chose the process with largest resident memory.
      
      Bellow is part of the oom info, which is enough to analyze this issue.
      [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
      [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
      [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
      [...]
      [7516987.983293] [ pid ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
      [7516987.983510] [ 5740]     0  5740      257        1    32768        0          -998 pause
      [7516987.983574] [58804]     0 58804     4594      771    81920        0          -998 entry_point.bas
      [7516987.983577] [58908]     0 58908     7089      689    98304        0          -998 cron
      [7516987.983580] [58910]     0 58910    16235     5576   163840        0          -998 supervisord
      [7516987.983590] [59620]     0 59620    18074     1395   188416        0          -998 sshd
      [7516987.983594] [59622]     0 59622    18680     6679   188416        0          -998 python
      [7516987.983598] [59624]     0 59624  1859266     5161   548864        0          -998 odin-agent
      [7516987.983600] [59625]     0 59625   707223     9248   983040        0          -998 filebeat
      [7516987.983604] [59627]     0 59627   416433    64239   774144        0          -998 odin-log-agent
      [7516987.983607] [59631]     0 59631   180671    15012   385024        0          -998 python3
      [7516987.983612] [61396]     0 61396   791287     3189   352256        0          -998 client
      [7516987.983615] [61641]     0 61641  1844642    29089   946176        0          -998 client
      [7516987.983765] [ 9236]     0  9236     2642      467    53248        0          -998 php_scanner
      [7516987.983911] [42898]     0 42898    15543      838   167936        0          -998 su
      [7516987.983915] [42900]  1000 42900     3673      867    77824        0          -998 exec_script_vr2
      [7516987.983918] [42925]  1000 42925    36475    19033   335872        0          -998 python
      [7516987.983921] [57146]  1000 57146     3673      848    73728        0          -998 exec_script_J2p
      [7516987.983925] [57195]  1000 57195   186359    22958   491520        0          -998 python2
      [7516987.983928] [58376]  1000 58376   275764    14402   290816        0          -998 rosmaster
      [7516987.983931] [58395]  1000 58395   155166     4449   245760        0          -998 rosout
      [7516987.983935] [58406]  1000 58406 18285584  3967322 37101568        0          -998 data_sim
      [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
      [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
      [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      We can find that the first scanned process 5740 (pause) was killed, but
      its rss is only one page.  That is because, when we calculate the oom
      badness in oom_badness(), we always ignore the negtive point and convert
      all of these negtive points to 1.  Now as oom_score_adj of all the
      processes in this targeted memcg have the same value -998, the points of
      these processes are all negtive value.  As a result, the first scanned
      process will be killed.
      
      The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
      a Guaranteed pod, which has higher priority to prevent from being killed
      by system oom.
      
      To fix this issue, we should make the calculation of oom point more
      accurate.  We can achieve it by convert the chosen_point from 'unsigned
      long' to 'long'.
      
      [cai@lca.pw: reported a issue in the previous version]
      [mhocko@suse.com: fixed the issue reported by Cai]
      [mhocko@suse.com: add the comment in proc_oom_score()]
      [laoar.shao@gmail.com: v3]
        Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.comSigned-off-by: NYafang Shao <laoar.shao@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Tested-by: NNaresh Kamboju <naresh.kamboju@linaro.org>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Qian Cai <cai@lca.pw>
      Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.comSigned-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9066e5cf
  14. 08 8月, 2020 1 次提交
  15. 11 6月, 2020 1 次提交
  16. 10 6月, 2020 3 次提交
  17. 04 6月, 2020 1 次提交
  18. 01 2月, 2020 1 次提交
  19. 05 1月, 2020 1 次提交
  20. 26 9月, 2019 1 次提交
    • M
      mm: introduce MADV_COLD · 9c276cc6
      Minchan Kim 提交于
      Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.
      
      - Background
      
      The Android terminology used for forking a new process and starting an app
      from scratch is a cold start, while resuming an existing app is a hot
      start.  While we continually try to improve the performance of cold
      starts, hot starts will always be significantly less power hungry as well
      as faster so we are trying to make hot start more likely than cold start.
      
      To increase hot start, Android userspace manages the order that apps
      should be killed in a process called ActivityManagerService.
      ActivityManagerService tracks every Android app or service that the user
      could be interacting with at any time and translates that into a ranked
      list for lmkd(low memory killer daemon).  They are likely to be killed by
      lmkd if the system has to reclaim memory.  In that sense they are similar
      to entries in any other cache.  Those apps are kept alive for
      opportunistic performance improvements but those performance improvements
      will vary based on the memory requirements of individual workloads.
      
      - Problem
      
      Naturally, cached apps were dominant consumers of memory on the system.
      However, they were not significant consumers of swap even though they are
      good candidate for swap.  Under investigation, swapping out only begins
      once the low zone watermark is hit and kswapd wakes up, but the overall
      allocation rate in the system might trip lmkd thresholds and cause a
      cached process to be killed(we measured performance swapping out vs.
      zapping the memory by killing a process.  Unsurprisingly, zapping is 10x
      times faster even though we use zram which is much faster than real
      storage) so kill from lmkd will often satisfy the high zone watermark,
      resulting in very few pages actually being moved to swap.
      
      - Approach
      
      The approach we chose was to use a new interface to allow userspace to
      proactively reclaim entire processes by leveraging platform information.
      This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
      that are known to be cold from userspace and to avoid races with lmkd by
      reclaiming apps as soon as they entered the cached state.  Additionally,
      it could provide many chances for platform to use much information to
      optimize memory efficiency.
      
      To achieve the goal, the patchset introduce two new options for madvise.
      One is MADV_COLD which will deactivate activated pages and the other is
      MADV_PAGEOUT which will reclaim private pages instantly.  These new
      options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
      ways to gain some free memory space.  MADV_PAGEOUT is similar to
      MADV_DONTNEED in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed immediately; MADV_COLD is similar
      to MADV_FREE in a way that it hints the kernel that memory region is not
      currently needed and should be reclaimed when memory pressure rises.
      
      This patch (of 5):
      
      When a process expects no accesses to a certain memory range, it could
      give a hint to kernel that the pages can be reclaimed when memory pressure
      happens but data should be preserved for future use.  This could reduce
      workingset eviction so it ends up increasing performance.
      
      This patch introduces the new MADV_COLD hint to madvise(2) syscall.
      MADV_COLD can be used by a process to mark a memory range as not expected
      to be used in the near future.  The hint can help kernel in deciding which
      pages to evict early during memory pressure.
      
      It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves
      
      	active file page -> inactive file LRU
      	active anon page -> inacdtive anon LRU
      
      Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
      LRU's head because MADV_COLD is a little bit different symantic.
      MADV_FREE means it's okay to discard when the memory pressure because the
      content of the page is *garbage* so freeing such pages is almost zero
      overhead since we don't need to swap out and access afterward causes just
      minor fault.  Thus, it would make sense to put those freeable pages in
      inactive file LRU to compete other used-once pages.  It makes sense for
      implmentaion point of view, too because it's not swapbacked memory any
      longer until it would be re-dirtied.  Even, it could give a bonus to make
      them be reclaimed on swapless system.  However, MADV_COLD doesn't mean
      garbage so reclaiming them requires swap-out/in in the end so it's bigger
      cost.  Since we have designed VM LRU aging based on cost-model, anonymous
      cold pages would be better to position inactive anon's LRU list, not file
      LRU.  Furthermore, it would help to avoid unnecessary scanning if system
      doesn't have a swap device.  Let's start simpler way without adding
      complexity at this moment.  However, keep in mind, too that it's a caveat
      that workloads with a lot of pages cache are likely to ignore MADV_COLD on
      anonymous memory because we rarely age anonymous LRU lists.
      
      * man-page material
      
      MADV_COLD (since Linux x.x)
      
      Pages in the specified regions will be treated as less-recently-accessed
      compared to pages in the system with similar access frequencies.  In
      contrast to MADV_FREE, the contents of the region are preserved regardless
      of subsequent writes to pages.
      
      MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
      pages.
      
      [akpm@linux-foundation.org: resolve conflicts with hmm.git]
      Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.orgSigned-off-by: NMinchan Kim <minchan@kernel.org>
      Reported-by: Nkbuild test robot <lkp@intel.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: James E.J. Bottomley <James.Bottomley@HansenPartnership.com>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Ralf Baechle <ralf@linux-mips.org>
      Cc: Chris Zankel <chris@zankel.net>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Daniel Colascione <dancol@google.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Hillf Danton <hdanton@sina.com>
      Cc: Joel Fernandes (Google) <joel@joelfernandes.org>
      Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Cc: Oleksandr Natalenko <oleksandr@redhat.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Sonny Rao <sonnyrao@google.com>
      Cc: Suren Baghdasaryan <surenb@google.com>
      Cc: Tim Murray <timmurray@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c276cc6
  21. 25 9月, 2019 5 次提交
    • M
      mm, oom: consider present pages for the node size · 1eb41bb0
      Michal Hocko 提交于
      constrained_alloc() calculates the size of the oom domain by using
      node_spanned_pages which is incorrect because this is the full range of
      the physical memory range that the numa node occupies rather than the
      memory that backs that range which is represented by node_present_pages.
      
      Sparsely populated nodes (e.g.  after memory hot remove or simply sparse
      due to memory layout) can have really a large difference between the two.
      This shouldn't really cause any real user observable problems because the
      oom calculates a ratio against totalpages and used memory cannot exceed
      present pages but it is confusing and wrong from code point of view.
      
      Link: http://lkml.kernel.org/r/20190829163443.899-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NDavid Hildenbrand <david@redhat.com>
      Reviewed-by: NDavid Hildenbrand <david@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1eb41bb0
    • Y
      mm/oom_kill.c: fix oom_cpuset_eligible() comment · f364f06b
      Yi Wang 提交于
      Commit ac311a14 ("oom: decouple mems_allowed from
      oom_unkillable_task") changed has_intersects_mems_allowed() to
      oom_cpuset_eligible(), but didn't change the comment.
      
      Link: http://lkml.kernel.org/r/1566959929-10638-1-git-send-email-wang.yi59@zte.com.cnSigned-off-by: NYi Wang <wang.yi59@zte.com.cn>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f364f06b
    • E
      mm/oom: add oom_score_adj and pgtables to Killed process message · 70cb6d26
      Edward Chron 提交于
      For an OOM event: print oom_score_adj value for the OOM Killed process to
      document what the oom score adjust value was at the time the process was
      OOM Killed.  The adjustment value can be set by user code and it affects
      the resulting oom_score so it is used to influence kill process selection.
      
      When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing
      this value is the only documentation of the value for the process being
      killed.  Having this value on the Killed process message is useful to
      document if a miscconfiguration occurred or to confirm that the
      oom_score_adj configuration applies as expected.
      
      An example which illustates both misconfiguration and validation that the
      oom_score_adj was applied as expected is:
      
      Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692
       (systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB,
       shmem-rss:0kB pgtables:22kB oom_score_adj:1000
      
      The systemd-udevd is a critical system application that should have an
      oom_score_adj of -1000.  It was miconfigured to have a adjustment of 1000
      making it a highly favored OOM kill target process.  The output documents
      both the misconfiguration and the fact that the process was correctly
      targeted by OOM due to the miconfiguration.  This can be quite helpful for
      triage and problem determination.
      
      The addition of the pgtables_bytes shows page table usage by the process
      and is a useful measure of the memory size of the process.
      
      Link: http://lkml.kernel.org/r/20190822173157.1569-1-echron@arista.comSigned-off-by: NEdward Chron <echron@arista.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      70cb6d26
    • T
      memcg, oom: don't require __GFP_FS when invoking memcg OOM killer · f9c64562
      Tetsuo Handa 提交于
      Masoud Sharbiani noticed that commit 29ef680a ("memcg, oom: move
      out_of_memory back to the charge path") broke memcg OOM called from
      __xfs_filemap_fault() path.  It turned out that try_charge() is retrying
      forever without making forward progress because mem_cgroup_oom(GFP_NOFS)
      cannot invoke the OOM killer due to commit 3da88fb3 ("mm, oom:
      move GFP_NOFS check to out_of_memory").
      
      Allowing forced charge due to being unable to invoke memcg OOM killer will
      lead to global OOM situation.  Also, just returning -ENOMEM will be risky
      because OOM path is lost and some paths (e.g.  get_user_pages()) will leak
      -ENOMEM.  Therefore, invoking memcg OOM killer (despite GFP_NOFS) will be
      the only choice we can choose for now.
      
      Until 29ef680a, we were able to invoke memcg OOM killer when
      GFP_KERNEL reclaim failed [1].  But since 29ef680a, we need to
      invoke memcg OOM killer when GFP_NOFS reclaim failed [2].  Although in the
      past we did invoke memcg OOM killer for GFP_NOFS [3], we might get
      pre-mature memcg OOM reports due to this patch.
      
      [1]
      
       leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
       CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19
       Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
       Call Trace:
        dump_stack+0x63/0x88
        dump_header+0x67/0x27a
        ? mem_cgroup_scan_tasks+0x91/0xf0
        oom_kill_process+0x210/0x410
        out_of_memory+0x10a/0x2c0
        mem_cgroup_out_of_memory+0x46/0x80
        mem_cgroup_oom_synchronize+0x2e4/0x310
        ? high_work_func+0x20/0x20
        pagefault_out_of_memory+0x31/0x76
        mm_fault_error+0x55/0x115
        ? handle_mm_fault+0xfd/0x220
        __do_page_fault+0x433/0x4e0
        do_page_fault+0x22/0x30
        ? page_fault+0x8/0x30
        page_fault+0x1e/0x30
       RIP: 0033:0x4009f0
       Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
       RSP: 002b:00007ffe29ae96f0 EFLAGS: 00010206
       RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001ce1000
       RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000
       RBP: 000000000000000c R08: 0000000000000000 R09: 00007f94be09220d
       R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0
       R13: 0000000000000003 R14: 00007f949d845000 R15: 0000000002800000
       Task in /leaker killed as a result of limit of /leaker
       memory: usage 524288kB, limit 524288kB, failcnt 158965
       memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
       kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0
       Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB
       Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice child
       Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, file-rss:1208kB, shmem-rss:0kB
       oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      [2]
      
       leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), order=0, oom_score_adj=0
       CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20
       Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
       Call Trace:
        dump_stack+0x63/0x88
        dump_header+0x67/0x27a
        ? mem_cgroup_scan_tasks+0x91/0xf0
        oom_kill_process+0x210/0x410
        out_of_memory+0x109/0x2d0
        mem_cgroup_out_of_memory+0x46/0x80
        try_charge+0x58d/0x650
        ? __radix_tree_replace+0x81/0x100
        mem_cgroup_try_charge+0x7a/0x100
        __add_to_page_cache_locked+0x92/0x180
        add_to_page_cache_lru+0x4d/0xf0
        iomap_readpages_actor+0xde/0x1b0
        ? iomap_zero_range_actor+0x1d0/0x1d0
        iomap_apply+0xaf/0x130
        iomap_readpages+0x9f/0x150
        ? iomap_zero_range_actor+0x1d0/0x1d0
        xfs_vm_readpages+0x18/0x20 [xfs]
        read_pages+0x60/0x140
        __do_page_cache_readahead+0x193/0x1b0
        ondemand_readahead+0x16d/0x2c0
        page_cache_async_readahead+0x9a/0xd0
        filemap_fault+0x403/0x620
        ? alloc_set_pte+0x12c/0x540
        ? _cond_resched+0x14/0x30
        __xfs_filemap_fault+0x66/0x180 [xfs]
        xfs_filemap_fault+0x27/0x30 [xfs]
        __do_fault+0x19/0x40
        __handle_mm_fault+0x8e8/0xb60
        handle_mm_fault+0xfd/0x220
        __do_page_fault+0x238/0x4e0
        do_page_fault+0x22/0x30
        ? page_fault+0x8/0x30
        page_fault+0x1e/0x30
       RIP: 0033:0x4009f0
       Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 <41> 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
       RSP: 002b:00007ffda45c9290 EFLAGS: 00010206
       RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001a1e000
       RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000
       RBP: 000000000000000c R08: 0000000000000000 R09: 00007f6d061ff20d
       R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0
       R13: 0000000000000003 R14: 00007f6ce59b2000 R15: 0000000002800000
       Task in /leaker killed as a result of limit of /leaker
       memory: usage 524288kB, limit 524288kB, failcnt 7221
       memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
       kmem: usage 1944kB, limit 9007199254740988kB, failcnt 0
       Memory cgroup stats for /leaker: cache:3632KB rss:518232KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:518408KB inactive_file:3908KB active_file:12KB unevictable:0KB
       Memory cgroup out of memory: Kill process 2746 (leaker) score 992 or sacrifice child
       Killed process 2746 (leaker) total-vm:536704kB, anon-rss:518264kB, file-rss:1188kB, shmem-rss:0kB
       oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      [3]
      
       leaker invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0
       leaker cpuset=/ mems_allowed=0
       CPU: 1 PID: 3206 Comm: leaker Not tainted 3.10.0-957.27.2.el7.x86_64 #1
       Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
       Call Trace:
        [<ffffffffaf364147>] dump_stack+0x19/0x1b
        [<ffffffffaf35eb6a>] dump_header+0x90/0x229
        [<ffffffffaedbb456>] ? find_lock_task_mm+0x56/0xc0
        [<ffffffffaee32a38>] ? try_get_mem_cgroup_from_mm+0x28/0x60
        [<ffffffffaedbb904>] oom_kill_process+0x254/0x3d0
        [<ffffffffaee36c36>] mem_cgroup_oom_synchronize+0x546/0x570
        [<ffffffffaee360b0>] ? mem_cgroup_charge_common+0xc0/0xc0
        [<ffffffffaedbc194>] pagefault_out_of_memory+0x14/0x90
        [<ffffffffaf35d072>] mm_fault_error+0x6a/0x157
        [<ffffffffaf3717c8>] __do_page_fault+0x3c8/0x4f0
        [<ffffffffaf371925>] do_page_fault+0x35/0x90
        [<ffffffffaf36d768>] page_fault+0x28/0x30
       Task in /leaker killed as a result of limit of /leaker
       memory: usage 524288kB, limit 524288kB, failcnt 20628
       memory+swap: usage 524288kB, limit 9007199254740988kB, failcnt 0
       kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
       Memory cgroup stats for /leaker: cache:840KB rss:523448KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:523448KB inactive_file:464KB active_file:376KB unevictable:0KB
       Memory cgroup out of memory: Kill process 3206 (leaker) score 970 or sacrifice child
       Killed process 3206 (leaker) total-vm:536692kB, anon-rss:523304kB, file-rss:412kB, shmem-rss:0kB
      
      Bisected by Masoud Sharbiani.
      
      Link: http://lkml.kernel.org/r/cbe54ed1-b6ba-a056-8899-2dc42526371d@i-love.sakura.ne.jp
      Fixes: 3da88fb3 ("mm, oom: move GFP_NOFS check to out_of_memory") [necessary after 29ef680a]
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Reported-by: NMasoud Sharbiani <msharbiani@apple.com>
      Tested-by: NMasoud Sharbiani <msharbiani@apple.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: <stable@vger.kernel.org>	[4.19+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f9c64562
    • J
      mm/oom_kill.c: add task UID to info message on an oom kill · 8ac3f8fe
      Joel Savitz 提交于
      In the event of an oom kill, useful information about the killed process
      is printed to dmesg.  Users, especially system administrators, will find
      it useful to immediately see the UID of the process.
      
      We already print uid when dumping eligible tasks so it is not overly hard
      to find that information in the oom report.  However this information is
      unavailable when dumping of eligible tasks is disabled.
      
      In the following example, abuse_the_ram is the name of a program that
      attempts to iteratively allocate all available memory until it is stopped
      by force.
      
      Current message:
      
      Out of memory: Killed process 35389 (abuse_the_ram)
      total-vm:133718232kB, anon-rss:129624980kB, file-rss:0kB,
      shmem-rss:0kB
      
      Patched message:
      
      Out of memory: Killed process 2739 (abuse_the_ram),
      total-vm:133880028kB, anon-rss:129754836kB, file-rss:0kB,
      shmem-rss:0kB, UID:0
      
      [akpm@linux-foundation.org: s/UID %d/UID:%u/ in printk]
      Link: http://lkml.kernel.org/r/1560362273-534-1-git-send-email-jsavitz@redhat.comSigned-off-by: NJoel Savitz <jsavitz@redhat.com>
      Suggested-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8ac3f8fe
  22. 13 7月, 2019 5 次提交
    • T
      mm/oom_kill.c: remove redundant OOM score normalization in select_bad_process() · 2c207985
      Tetsuo Handa 提交于
      Since commit bbbe4802 ("mm, oom: remove 'prefer children over
      parent' heuristic") removed the
      
        "%s: Kill process %d (%s) score %u or sacrifice child\n"
      
      line, oc->chosen_points is no longer used after select_bad_process().
      
      Link: http://lkml.kernel.org/r/1560853435-15575-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Shakeel Butt <shakeelb@google.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2c207985
    • S
      oom: decouple mems_allowed from oom_unkillable_task · ac311a14
      Shakeel Butt 提交于
      Commit ef08e3b4 ("[PATCH] cpusets: confine oom_killer to
      mem_exclusive cpuset") introduces a heuristic where a potential
      oom-killer victim is skipped if the intersection of the potential victim
      and the current (the process triggered the oom) is empty based on the
      reason that killing such victim most probably will not help the current
      allocating process.
      
      However the commit 7887a3da ("[PATCH] oom: cpuset hint") changed the
      heuristic to just decrease the oom_badness scores of such potential
      victim based on the reason that the cpuset of such processes might have
      changed and previously they may have allocated memory on mems where the
      current allocating process can allocate from.
      
      Unintentionally 7887a3da ("[PATCH] oom: cpuset hint") introduced a
      side effect as the oom_badness is also exposed to the user space through
      /proc/[pid]/oom_score, so, readers with different cpusets can read
      different oom_score of the same process.
      
      Later, commit 6cf86ac6 ("oom: filter tasks not sharing the same
      cpuset") fixed the side effect introduced by 7887a3da by moving the
      cpuset intersection back to only oom-killer context and out of
      oom_badness.  However the combination of ab290adb ("oom: make
      oom_unkillable_task() helper function") and 26ebc984 ("oom:
      /proc/<pid>/oom_score treat kernel thread honestly") unintentionally
      brought back the cpuset intersection check into the oom_badness
      calculation function.
      
      Other than doing cpuset/mempolicy intersection from oom_badness, the memcg
      oom context is also doing cpuset/mempolicy intersection which is quite
      wrong and is caught by syzcaller with the following report:
      
      kasan: CONFIG_KASAN_INLINE enabled
      kasan: GPF could be caused by NULL-ptr deref or user memory access
      general protection fault: 0000 [#1] PREEMPT SMP KASAN
      CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607
      Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
      Google 01/01/2011
      RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
      RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline]
      RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline]
      RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155
      Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00
      00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f
      85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff
      RSP: 0018:ffff888000127490 EFLAGS: 00010a03
      RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c
      RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001
      RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0
      R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007
      R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6
      FS:  00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
      Call Trace:
        oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321
        mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169
        select_bad_process mm/oom_kill.c:374 [inline]
        out_of_memory mm/oom_kill.c:1088 [inline]
        out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035
        mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573
        mem_cgroup_oom mm/memcontrol.c:1905 [inline]
        try_charge+0xfbe/0x1480 mm/memcontrol.c:2468
        mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073
        mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088
        do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201
        do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359
        wp_huge_pmd mm/memory.c:3793 [inline]
        __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006
        handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053
        do_user_addr_fault arch/x86/mm/fault.c:1455 [inline]
        __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521
        do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552
        page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156
      RIP: 0033:0x400590
      Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48
      8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 <89> 06 e9 1e 01
      00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b
      RSP: 002b:00007fff7bc49780 EFLAGS: 00010206
      RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000
      RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001
      RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000
      R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008
      R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0
      Modules linked in:
      ---[ end trace a65689219582ffff ]---
      RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
      RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline]
      RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline]
      RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155
      Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00
      00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f
      85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff
      RSP: 0018:ffff888000127490 EFLAGS: 00010a03
      RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c
      RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001
      RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0
      R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007
      R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6
      FS:  00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
      CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
      CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0
      DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
      DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
      
      The fix is to decouple the cpuset/mempolicy intersection check from
      oom_unkillable_task() and make sure cpuset/mempolicy intersection check is
      only done in the global oom context.
      
      [shakeelb@google.com: change function name and update comment]
        Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com
      Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ac311a14
    • S
      mm, oom: remove redundant task_in_mem_cgroup() check · 6ba749ee
      Shakeel Butt 提交于
      oom_unkillable_task() can be called from three different contexts i.e.
      global OOM, memcg OOM and oom_score procfs interface.  At the moment
      oom_unkillable_task() does a task_in_mem_cgroup() check on the given
      process.  Since there is no reason to perform task_in_mem_cgroup()
      check for global OOM and oom_score procfs interface, those contexts
      provide NULL memcg and skips the task_in_mem_cgroup() check.  However
      for memcg OOM context, the oom_unkillable_task() is always called from
      mem_cgroup_scan_tasks() and thus task_in_mem_cgroup() check becomes
      redundant and effectively dead code.  So, just remove the
      task_in_mem_cgroup() check altogether.
      
      Link: http://lkml.kernel.org/r/20190624212631.87212-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6ba749ee
    • S
      mm, oom: refactor dump_tasks for memcg OOMs · 5eee7e1c
      Shakeel Butt 提交于
      dump_tasks() traverses all the existing processes even for the memcg OOM
      context which is not only unnecessary but also wasteful.  This imposes a
      long RCU critical section even from a contained context which can be quite
      disruptive.
      
      Change dump_tasks() to be aligned with select_bad_process and use
      mem_cgroup_scan_tasks to selectively traverse only processes of the target
      memcg hierarchy during memcg OOM.
      
      Link: http://lkml.kernel.org/r/20190617231207.160865-1-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Paul Jackson <pj@sgi.com>
      Cc: Nick Piggin <npiggin@suse.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5eee7e1c
    • T
      mm: memcontrol: use CSS_TASK_ITER_PROCS at mem_cgroup_scan_tasks() · f168a9a5
      Tetsuo Handa 提交于
      Since commit c03cd773 ("cgroup: Include dying leaders with live
      threads in PROCS iterations") corrected how CSS_TASK_ITER_PROCS works,
      mem_cgroup_scan_tasks() can use CSS_TASK_ITER_PROCS in order to check
      only one thread from each thread group.
      
      [penguin-kernel@I-love.SAKURA.ne.jp: remove thread group leader check in oom_evaluate_task()]
        Link: http://lkml.kernel.org/r/1560853257-14934-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
      Link: http://lkml.kernel.org/r/c763afc8-f0ae-756a-56a7-395f625b95fc@i-love.sakura.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NShakeel Butt <shakeelb@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f168a9a5
  23. 29 6月, 2019 1 次提交
  24. 21 5月, 2019 1 次提交
  25. 15 5月, 2019 1 次提交
    • J
      mm/mmu_notifier: contextual information for event triggering invalidation · 6f4f13e8
      Jérôme Glisse 提交于
      CPU page table update can happens for many reasons, not only as a result
      of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
      a result of kernel activities (memory compression, reclaim, migration,
      ...).
      
      Users of mmu notifier API track changes to the CPU page table and take
      specific action for them.  While current API only provide range of virtual
      address affected by the change, not why the changes is happening.
      
      This patchset do the initial mechanical convertion of all the places that
      calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
      event as well as the vma if it is know (most invalidation happens against
      a given vma).  Passing down the vma allows the users of mmu notifier to
      inspect the new vma page protection.
      
      The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
      should assume that every for the range is going away when that event
      happens.  A latter patch do convert mm call path to use a more appropriate
      events for each call.
      
      This is done as 2 patches so that no call site is forgotten especialy
      as it uses this following coccinelle patch:
      
      %<----------------------------------------------------------------------
      @@
      identifier I1, I2, I3, I4;
      @@
      static inline void mmu_notifier_range_init(struct mmu_notifier_range *I1,
      +enum mmu_notifier_event event,
      +unsigned flags,
      +struct vm_area_struct *vma,
      struct mm_struct *I2, unsigned long I3, unsigned long I4) { ... }
      
      @@
      @@
      -#define mmu_notifier_range_init(range, mm, start, end)
      +#define mmu_notifier_range_init(range, event, flags, vma, mm, start, end)
      
      @@
      expression E1, E3, E4;
      identifier I1;
      @@
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, I1,
      I1->vm_mm, E3, E4)
      ...>
      
      @@
      expression E1, E2, E3, E4;
      identifier FN, VMA;
      @@
      FN(..., struct vm_area_struct *VMA, ...) {
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, VMA,
      E2, E3, E4)
      ...> }
      
      @@
      expression E1, E2, E3, E4;
      identifier FN, VMA;
      @@
      FN(...) {
      struct vm_area_struct *VMA;
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, VMA,
      E2, E3, E4)
      ...> }
      
      @@
      expression E1, E2, E3, E4;
      identifier FN;
      @@
      FN(...) {
      <...
      mmu_notifier_range_init(E1,
      +MMU_NOTIFY_UNMAP, 0, NULL,
      E2, E3, E4)
      ...> }
      ---------------------------------------------------------------------->%
      
      Applied with:
      spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
      spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
      spatch --sp-file mmu-notifier.spatch --dir mm --in-place
      
      Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.comSigned-off-by: NJérôme Glisse <jglisse@redhat.com>
      Reviewed-by: NRalph Campbell <rcampbell@nvidia.com>
      Reviewed-by: NIra Weiny <ira.weiny@intel.com>
      Cc: Christian König <christian.koenig@amd.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Jan Kara <jack@suse.cz>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Peter Xu <peterx@redhat.com>
      Cc: Felix Kuehling <Felix.Kuehling@amd.com>
      Cc: Jason Gunthorpe <jgg@mellanox.com>
      Cc: Ross Zwisler <zwisler@kernel.org>
      Cc: Dan Williams <dan.j.williams@intel.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Radim Krcmar <rkrcmar@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Christian Koenig <christian.koenig@amd.com>
      Cc: John Hubbard <jhubbard@nvidia.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f4f13e8
  26. 06 3月, 2019 2 次提交
    • T
      mm,oom: don't kill global init via memory.oom.group · d342a0b3
      Tetsuo Handa 提交于
      Since setting global init process to some memory cgroup is technically
      possible, oom_kill_memcg_member() must check it.
      
        Tasks in /test1 are going to be killed due to memory.oom.group set
        Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB
        oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
        Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b
      
      #include <stdio.h>
      #include <string.h>
      #include <unistd.h>
      #include <sys/types.h>
      #include <sys/stat.h>
      #include <fcntl.h>
      
      int main(int argc, char *argv[])
      {
      	static char buffer[10485760];
      	static int pipe_fd[2] = { EOF, EOF };
      	unsigned int i;
      	int fd;
      	char buf[64] = { };
      	if (pipe(pipe_fd))
      		return 1;
      	if (chdir("/sys/fs/cgroup/"))
      		return 1;
      	fd = open("cgroup.subtree_control", O_WRONLY);
      	write(fd, "+memory", 7);
      	close(fd);
      	mkdir("test1", 0755);
      	fd = open("test1/memory.oom.group", O_WRONLY);
      	write(fd, "1", 1);
      	close(fd);
      	fd = open("test1/cgroup.procs", O_WRONLY);
      	write(fd, "1", 1);
      	snprintf(buf, sizeof(buf) - 1, "%d", getpid());
      	write(fd, buf, strlen(buf));
      	close(fd);
      	snprintf(buf, sizeof(buf) - 1, "%lu", sizeof(buffer) * 5);
      	fd = open("test1/memory.max", O_WRONLY);
      	write(fd, buf, strlen(buf));
      	close(fd);
      	for (i = 0; i < 10; i++)
      		if (fork() == 0) {
      			char c;
      			close(pipe_fd[1]);
      			read(pipe_fd[0], &c, 1);
      			memset(buffer, 0, sizeof(buffer));
      			sleep(3);
      			_exit(0);
      		}
      	close(pipe_fd[0]);
      	close(pipe_fd[1]);
      	sleep(3);
      	return 0;
      }
      
      [   37.052923][ T9185] a.out invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
      [   37.056169][ T9185] CPU: 4 PID: 9185 Comm: a.out Kdump: loaded Not tainted 5.0.0-rc4-next-20190131 #280
      [   37.059205][ T9185] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
      [   37.062954][ T9185] Call Trace:
      [   37.063976][ T9185]  dump_stack+0x67/0x95
      [   37.065263][ T9185]  dump_header+0x51/0x570
      [   37.066619][ T9185]  ? trace_hardirqs_on+0x3f/0x110
      [   37.068171][ T9185]  ? _raw_spin_unlock_irqrestore+0x3d/0x70
      [   37.069967][ T9185]  oom_kill_process+0x18d/0x210
      [   37.071515][ T9185]  out_of_memory+0x11b/0x380
      [   37.072936][ T9185]  mem_cgroup_out_of_memory+0xb6/0xd0
      [   37.074601][ T9185]  try_charge+0x790/0x820
      [   37.076021][ T9185]  mem_cgroup_try_charge+0x42/0x1d0
      [   37.077629][ T9185]  mem_cgroup_try_charge_delay+0x11/0x30
      [   37.079370][ T9185]  do_anonymous_page+0x105/0x5e0
      [   37.080939][ T9185]  __handle_mm_fault+0x9cb/0x1070
      [   37.082485][ T9185]  handle_mm_fault+0x1b2/0x3a0
      [   37.083819][ T9185]  ? handle_mm_fault+0x47/0x3a0
      [   37.085181][ T9185]  __do_page_fault+0x255/0x4c0
      [   37.086529][ T9185]  do_page_fault+0x28/0x260
      [   37.087788][ T9185]  ? page_fault+0x8/0x30
      [   37.088978][ T9185]  page_fault+0x1e/0x30
      [   37.090142][ T9185] RIP: 0033:0x7f8b183aefe0
      [   37.091433][ T9185] Code: 20 f3 44 0f 7f 44 17 d0 f3 44 0f 7f 47 30 f3 44 0f 7f 44 17 c0 48 01 fa 48 83 e2 c0 48 39 d1 74 a3 66 0f 1f 84 00 00 00 00 00 <66> 44 0f 7f 01 66 44 0f 7f 41 10 66 44 0f 7f 41 20 66 44 0f 7f 41
      [   37.096917][ T9185] RSP: 002b:00007fffc5d329e8 EFLAGS: 00010206
      [   37.098615][ T9185] RAX: 00000000006010e0 RBX: 0000000000000008 RCX: 0000000000c30000
      [   37.100905][ T9185] RDX: 00000000010010c0 RSI: 0000000000000000 RDI: 00000000006010e0
      [   37.103349][ T9185] RBP: 0000000000000000 R08: 00007f8b188f4740 R09: 0000000000000000
      [   37.105797][ T9185] R10: 00007fffc5d32420 R11: 00007f8b183aef40 R12: 0000000000000005
      [   37.108228][ T9185] R13: 0000000000000000 R14: ffffffffffffffff R15: 0000000000000000
      [   37.110840][ T9185] memory: usage 51200kB, limit 51200kB, failcnt 125
      [   37.113045][ T9185] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
      [   37.115808][ T9185] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
      [   37.117660][ T9185] Memory cgroup stats for /test1: cache:0KB rss:49484KB rss_huge:30720KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:49700KB inactive_file:0KB active_file:0KB unevictable:0KB
      [   37.123371][ T9185] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/test1,task_memcg=/test1,task=a.out,pid=9188,uid=0
      [   37.128158][ T9185] Memory cgroup out of memory: Killed process 9188 (a.out) total-vm:14456kB, anon-rss:10324kB, file-rss:504kB, shmem-rss:0kB
      [   37.132710][ T9185] Tasks in /test1 are going to be killed due to memory.oom.group set
      [   37.132833][   T54] oom_reaper: reaped process 9188 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      [   37.135498][ T9185] Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB
      [   37.143434][ T9185] Memory cgroup out of memory: Killed process 9182 (a.out) total-vm:14456kB, anon-rss:76kB, file-rss:588kB, shmem-rss:0kB
      [   37.144328][   T54] oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      [   37.147585][ T9185] Memory cgroup out of memory: Killed process 9183 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
      [   37.157222][ T9185] Memory cgroup out of memory: Killed process 9184 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:508kB, shmem-rss:0kB
      [   37.157259][ T9185] Memory cgroup out of memory: Killed process 9185 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
      [   37.157291][ T9185] Memory cgroup out of memory: Killed process 9186 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:508kB, shmem-rss:0kB
      [   37.157306][   T54] oom_reaper: reaped process 9183 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      [   37.157328][ T9185] Memory cgroup out of memory: Killed process 9187 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:512kB, shmem-rss:0kB
      [   37.157452][ T9185] Memory cgroup out of memory: Killed process 9189 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
      [   37.158733][ T9185] Memory cgroup out of memory: Killed process 9190 (a.out) total-vm:14456kB, anon-rss:552kB, file-rss:512kB, shmem-rss:0kB
      [   37.160083][   T54] oom_reaper: reaped process 9186 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      [   37.160187][   T54] oom_reaper: reaped process 9189 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      [   37.206941][   T54] oom_reaper: reaped process 9185 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      [   37.212300][ T9185] Memory cgroup out of memory: Killed process 9191 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:512kB, shmem-rss:0kB
      [   37.212317][   T54] oom_reaper: reaped process 9190 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      [   37.218860][ T9185] Memory cgroup out of memory: Killed process 9192 (a.out) total-vm:14456kB, anon-rss:1080kB, file-rss:512kB, shmem-rss:0kB
      [   37.227667][   T54] oom_reaper: reaped process 9192 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      [   37.292323][ T9193] abrt-hook-ccpp (9193) used greatest stack depth: 10480 bytes left
      [   37.351843][    T1] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b
      [   37.354833][    T1] CPU: 7 PID: 1 Comm: systemd Kdump: loaded Not tainted 5.0.0-rc4-next-20190131 #280
      [   37.357876][    T1] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
      [   37.361685][    T1] Call Trace:
      [   37.363239][    T1]  dump_stack+0x67/0x95
      [   37.365010][    T1]  panic+0xfc/0x2b0
      [   37.366853][    T1]  do_exit+0xd55/0xd60
      [   37.368595][    T1]  do_group_exit+0x47/0xc0
      [   37.370415][    T1]  get_signal+0x32a/0x920
      [   37.372449][    T1]  ? _raw_spin_unlock_irqrestore+0x3d/0x70
      [   37.374596][    T1]  do_signal+0x32/0x6e0
      [   37.376430][    T1]  ? exit_to_usermode_loop+0x26/0x9b
      [   37.378418][    T1]  ? prepare_exit_to_usermode+0xa8/0xd0
      [   37.380571][    T1]  exit_to_usermode_loop+0x3e/0x9b
      [   37.382588][    T1]  prepare_exit_to_usermode+0xa8/0xd0
      [   37.384594][    T1]  ? page_fault+0x8/0x30
      [   37.386453][    T1]  retint_user+0x8/0x18
      [   37.388160][    T1] RIP: 0033:0x7f42c06974a8
      [   37.389922][    T1] Code: Bad RIP value.
      [   37.391788][    T1] RSP: 002b:00007ffc3effd388 EFLAGS: 00010213
      [   37.394075][    T1] RAX: 000000000000000e RBX: 00007ffc3effd390 RCX: 0000000000000000
      [   37.396963][    T1] RDX: 000000000000002a RSI: 00007ffc3effd390 RDI: 0000000000000004
      [   37.399550][    T1] RBP: 00007ffc3effd680 R08: 0000000000000000 R09: 0000000000000000
      [   37.402334][    T1] R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000000001
      [   37.404890][    T1] R13: ffffffffffffffff R14: 0000000000000884 R15: 000056460b1ac3b0
      
      Link: http://lkml.kernel.org/r/201902010336.x113a4EO027170@www262.sakura.ne.jp
      Fixes: 3d8b38eb ("mm, oom: introduce memory.oom.group")
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d342a0b3
    • S
      mm, oom: remove 'prefer children over parent' heuristic · bbbe4802
      Shakeel Butt 提交于
      Since the start of the git history of Linux, the kernel after selecting
      the worst process to be oom-killed, prefer to kill its child (if the
      child does not share mm with the parent).  Later it was changed to
      prefer to kill a child who is worst.  If the parent is still the worst
      then the parent will be killed.
      
      This heuristic assumes that the children did less work than their parent
      and by killing one of them, the work lost will be less.  However this is
      very workload dependent.  If there is a workload which can benefit from
      this heuristic, can use oom_score_adj to prefer children to be killed
      before the parent.
      
      The select_bad_process() has already selected the worst process in the
      system/memcg.  There is no need to recheck the badness of its children
      and hoping to find a worse candidate.  That's a lot of unneeded racy
      work.  Also the heuristic is dangerous because it make fork bomb like
      workloads to recover much later because we constantly pick and kill
      processes which are not memory hogs.  So, let's remove this whole
      heuristic.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20190121215850.221745-2-shakeelb@google.comSigned-off-by: NShakeel Butt <shakeelb@google.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bbbe4802