1. 08 10月, 2016 11 次提交
    • M
      oom: warn if we go OOM for higher order and compaction is disabled · 9254990f
      Michal Hocko 提交于
      Since the lumpy reclaim is gone there is no source of higher order pages
      if CONFIG_COMPACTION=n except for the order-0 pages reclaim which is
      unreliable for that purpose to say the least.  Hitting an OOM for
      !costly higher order requests is therefore all not that hard to imagine.
      We are trying hard to not invoke OOM killer as much as possible but
      there is simply no reliable way to detect whether more reclaim retries
      make sense.
      
      Disabling COMPACTION is not widespread but it seems that some users
      might have disable the feature without realizing full consequences
      (mostly along with disabling THP because compaction used to be THP
      mainly thing).  This patch just adds a note if the OOM killer was
      triggered by higher order request with compaction disabled.  This will
      help us identifying possible misconfiguration right from the oom report
      which is easier than to always keep in mind that somebody might have
      disabled COMPACTION without a good reason.
      
      Link: http://lkml.kernel.org/r/20160830111632.GD23963@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9254990f
    • M
      oom, oom_reaper: allow to reap mm shared by the kthreads · 1b51e65e
      Michal Hocko 提交于
      oom reaper was skipped for an mm which is shared with the kernel thread
      (aka use_mm()).  The primary concern was that such a kthread might want
      to read from the userspace memory and see zero page as a result of the
      oom reaper action.  This is no longer a problem after "mm: make sure
      that kthreads will not refault oom reaped memory" because any attempt to
      fault in when the MMF_UNSTABLE is set will result in SIGBUS and so the
      target user should see an error.  This means that we can finally allow
      oom reaper also to tasks which share their mm with kthreads.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-10-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b51e65e
    • M
      mm: make sure that kthreads will not refault oom reaped memory · 3f70dc38
      Michal Hocko 提交于
      There are only few use_mm() users in the kernel right now.  Most of them
      write to the target memory but vhost driver relies on
      copy_from_user/get_user from a kernel thread context.  This makes it
      impossible to reap the memory of an oom victim which shares the mm with
      the vhost kernel thread because it could see a zero page unexpectedly
      and theoretically make an incorrect decision visible outside of the
      killed task context.
      
      To quote Michael S. Tsirkin:
      : Getting an error from __get_user and friends is handled gracefully.
      : Getting zero instead of a real value will cause userspace
      : memory corruption.
      
      The vhost kernel thread is bound to an open fd of the vhost device which
      is not tight to the mm owner life cycle in general.  The device fd can
      be inherited or passed over to another process which means that we
      really have to be careful about unexpected memory corruption because
      unlike for normal oom victims the result will be visible outside of the
      oom victim context.
      
      Make sure that no kthread context (users of use_mm) can ever see
      corrupted data because of the oom reaper and hook into the page fault
      path by checking MMF_UNSTABLE mm flag.  __oom_reap_task_mm will set the
      flag before it starts unmapping the address space while the flag is
      checked after the page fault has been handled.  If the flag is set then
      SIGBUS is triggered so any g-u-p user will get a error code.
      
      Regular tasks do not need this protection because all which share the mm
      are killed when the mm is reaped and so the corruption will not outlive
      them.
      
      This patch shouldn't have any visible effect at this moment because the
      OOM killer doesn't invoke oom reaper for tasks with mm shared with
      kthreads yet.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: N"Michael S. Tsirkin" <mst@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f70dc38
    • T
      mm, oom: enforce exit_oom_victim on current task · 38531201
      Tetsuo Handa 提交于
      There are no users of exit_oom_victim on !current task anymore so enforce
      the API to always work on the current.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-8-git-send-email-mhocko@kernel.orgSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38531201
    • M
      oom, suspend: fix oom_killer_disable vs. pm suspend properly · 7d2e7a22
      Michal Hocko 提交于
      Commit 74070542 ("oom, suspend: fix oom_reaper vs.
      oom_killer_disable race") has workaround an existing race between
      oom_killer_disable and oom_reaper by adding another round of
      try_to_freeze_tasks after the oom killer was disabled.  This was the
      easiest thing to do for a late 4.7 fix.  Let's fix it properly now.
      
      After "oom: keep mm of the killed task available" we no longer have to
      call exit_oom_victim from the oom reaper because we have stable mm
      available and hide the oom_reaped mm by MMF_OOM_SKIP flag.  So let's
      remove exit_oom_victim and the race described in the above commit
      doesn't exist anymore if.
      
      Unfortunately this alone is not sufficient for the oom_killer_disable
      usecase because now we do not have any reliable way to reach
      exit_oom_victim (the victim might get stuck on a way to exit for an
      unbounded amount of time).  OOM killer can cope with that by checking mm
      flags and move on to another victim but we cannot do the same for
      oom_killer_disable as we would lose the guarantee of no further
      interference of the victim with the rest of the system.  What we can do
      instead is to cap the maximum time the oom_killer_disable waits for
      victims.  The only current user of this function (pm suspend) already
      has a concept of timeout for back off so we can reuse the same value
      there.
      
      Let's drop set_freezable for the oom_reaper kthread because it is no
      longer needed as the reaper doesn't wake or thaw any processes.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-7-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d2e7a22
    • M
      mm, oom: get rid of signal_struct::oom_victims · 862e3073
      Michal Hocko 提交于
      After "oom: keep mm of the killed task available" we can safely detect
      an oom victim by checking task->signal->oom_mm so we do not need the
      signal_struct counter anymore so let's get rid of it.
      
      This alone wouldn't be sufficient for nommu archs because
      exit_oom_victim doesn't hide the process from the oom killer anymore.
      We can, however, mark the mm with a MMF flag in __mmput.  We can reuse
      MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-6-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      862e3073
    • M
      oom: keep mm of the killed task available · 26db62f1
      Michal Hocko 提交于
      oom_reap_task has to call exit_oom_victim in order to make sure that the
      oom vicim will not block the oom killer for ever.  This is, however,
      opening new problems (e.g oom_killer_disable exclusion - see commit
      74070542 ("oom, suspend: fix oom_reaper vs.  oom_killer_disable
      race")).  exit_oom_victim should be only called from the victim's
      context ideally.
      
      One way to achieve this would be to rely on per mm_struct flags.  We
      already have MMF_OOM_REAPED to hide a task from the oom killer since
      "mm, oom: hide mm which is shared with kthread or global init". The
      problem is that the exit path:
      
        do_exit
          exit_mm
            tsk->mm = NULL;
            mmput
              __mmput
            exit_oom_victim
      
      doesn't guarantee that exit_oom_victim will get called in a bounded
      amount of time.  At least exit_aio depends on IO which might get blocked
      due to lack of memory and who knows what else is lurking there.
      
      This patch takes a different approach.  We remember tsk->mm into the
      signal_struct and bind it to the signal struct life time for all oom
      victims.  __oom_reap_task_mm as well as oom_scan_process_thread do not
      have to rely on find_lock_task_mm anymore and they will have a reliable
      reference to the mm struct.  As a result all the oom specific
      communication inside the OOM killer can be done via tsk->signal->oom_mm.
      
      Increasing the signal_struct for something as unlikely as the oom killer
      is far from ideal but this approach will make the code much more
      reasonable and long term we even might want to move task->mm into the
      signal_struct anyway.  In the next step we might want to make the oom
      killer exclusion and access to memory reserves completely independent
      which would be also nice.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26db62f1
    • T
      mm,oom_reaper: do not attempt to reap a task twice · 8496afab
      Tetsuo Handa 提交于
      "mm, oom_reaper: do not attempt to reap a task twice" tried to give the
      OOM reaper one more chance to retry using MMF_OOM_NOT_REAPABLE flag.
      But the usefulness of the flag is rather limited and actually never
      shown in practice.  If the flag is set, it means that the holder of
      mm->mmap_sem cannot call up_write() due to presumably being blocked at
      unkillable wait waiting for other thread's memory allocation.  But since
      one of threads sharing that mm will queue that mm immediately via
      task_will_free_mem() shortcut (otherwise, oom_badness() will select the
      same mm again due to oom_score_adj value unchanged), retrying
      MMF_OOM_NOT_REAPABLE mm is unlikely helpful.
      
      Let's always set MMF_OOM_REAPED.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-3-git-send-email-mhocko@kernel.orgSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8496afab
    • T
      mm,oom_reaper: reduce find_lock_task_mm() usage · 7ebffa45
      Tetsuo Handa 提交于
      Patch series "fortify oom killer even more", v2.
      
      This patch (of 9):
      
      __oom_reap_task() can be simplified a bit if it receives a valid mm from
      oom_reap_task() which also uses that mm when __oom_reap_task() failed.
      We can drop one find_lock_task_mm() call and also make the
      __oom_reap_task() code flow easier to follow.  Moreover, this will make
      later patch in the series easier to review.  Pinning mm's mm_count for
      longer time is not really harmful because this will not pin much memory.
      
      This patch doesn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-2-git-send-email-mhocko@kernel.orgSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ebffa45
    • M
      mm/oom_kill.c: fix task_will_free_mem() comment · 5870c2e1
      Michal Hocko 提交于
      Attempt to demystify the task_will_free_mem() loop.
      
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5870c2e1
    • V
      mm: oom: deduplicate victim selection code for memcg and global oom · 7c5f64f8
      Vladimir Davydov 提交于
      When selecting an oom victim, we use the same heuristic for both memory
      cgroup and global oom.  The only difference is the scope of tasks to
      select the victim from.  So we could just export an iterator over all
      memcg tasks and keep all oom related logic in oom_kill.c, but instead we
      duplicate pieces of it in memcontrol.c reusing some initially private
      functions of oom_kill.c in order to not duplicate all of it.  That looks
      ugly and error prone, because any modification of select_bad_process
      should also be propagated to mem_cgroup_out_of_memory.
      
      Let's rework this as follows: keep all oom heuristic related code private
      to oom_kill.c and make oom_kill.c use exported memcg functions when it's
      really necessary (like in case of iterating over memcg tasks).
      
      Link: http://lkml.kernel.org/r/1470056933-7505-1-git-send-email-vdavydov@virtuozzo.comSigned-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c5f64f8
  2. 12 8月, 2016 1 次提交
  3. 29 7月, 2016 8 次提交
    • M
      mm, oom: tighten task_will_free_mem() locking · 091f362c
      Michal Hocko 提交于
      "mm, oom: fortify task_will_free_mem" has dropped task_lock around
      task_will_free_mem in oom_kill_process bacause it assumed that a
      potential race when the selected task exits will not be a problem as the
      oom_reaper will call exit_oom_victim.
      
      Tetsuo was objecting that nommu doesn't have oom_reaper so the race
      would be still possible.  The code would be racy and lockup prone
      theoretically in other aspects without the oom reaper anyway so I didn't
      considered this a big deal.  But it seems that further changes I am
      planning in this area will benefit from stable task->mm in this path as
      well.  So let's drop find_lock_task_mm from task_will_free_mem and call
      it from under task_lock as we did previously.  Just pull the task->mm !=
      NULL check inside the function.
      
      Link: http://lkml.kernel.org/r/1467201562-6709-1-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      091f362c
    • M
      mm, oom: hide mm which is shared with kthread or global init · a373966d
      Michal Hocko 提交于
      The only case where the oom_reaper is not triggered for the oom victim
      is when it shares the memory with a kernel thread (aka use_mm) or with
      the global init.  After "mm, oom: skip vforked tasks from being
      selected" the victim cannot be a vforked task of the global init so we
      are left with clone(CLONE_VM) (without CLONE_SIGHAND).  use_mm() users
      are quite rare as well.
      
      In order to help forward progress for the OOM killer, make sure that
      this really rare case will not get in the way - we do this by hiding the
      mm from the oom killer by setting MMF_OOM_REAPED flag for it.
      oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
      MMF_OOM_REAPED flag set to catch these oom victims.
      
      After this patch we should guarantee forward progress for the OOM killer
      even when the selected victim is sharing memory with a kernel thread or
      global init as long as the victims mm is still alive.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-11-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a373966d
    • M
      mm, oom_reaper: do not attempt to reap a task more than twice · 11a410d5
      Michal Hocko 提交于
      oom_reaper relies on the mmap_sem for read to do its job.  Many places
      which might block readers have been converted to use down_write_killable
      and that has reduced chances of the contention a lot.  Some paths where
      the mmap_sem is held for write can take other locks and they might
      either be not prepared to fail due to fatal signal pending or too
      impractical to be changed.
      
      This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the
      first attempt to reap a task's mm fails.  If the flag is present after
      the failure then we set MMF_OOM_REAPED to hide this mm from the oom
      killer completely so it can go and chose another victim.
      
      As a result a risk of OOM deadlock when the oom victim would be blocked
      indefinetly and so the oom killer cannot make any progress should be
      mitigated considerably while we still try really hard to perform all
      reclaim attempts and stay predictable in the behavior.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-10-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11a410d5
    • M
      mm, oom: task_will_free_mem should skip oom_reaped tasks · 696453e6
      Michal Hocko 提交于
      The 0-day robot has encountered the following:
      
         Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child
         Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB
         oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB
         oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
         oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
         oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB
         oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB
      
      oom_reaper is trying to reap the same task again and again.
      
      This is possible only when the oom killer is bypassed because of
      task_will_free_mem because we skip over tasks with MMF_OOM_REAPED
      already set during select_bad_process.  Teach task_will_free_mem to skip
      over MMF_OOM_REAPED tasks as well because they will be unlikely to free
      anything more.
      
      Analyzed by Tetsuo Handa.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-9-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      696453e6
    • M
      mm, oom: fortify task_will_free_mem() · 1af8bb43
      Michal Hocko 提交于
      task_will_free_mem is rather weak.  It doesn't really tell whether the
      task has chance to drop its mm.  98748bd7 ("oom: consider
      multi-threaded tasks in task_will_free_mem") made a first step into making
      it more robust for multi-threaded applications so now we know that the
      whole process is going down and probably drop the mm.
      
      This patch builds on top for more complex scenarios where mm is shared
      between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel
      use_mm().
      
      Make sure that all processes sharing the mm are killed or exiting.  This
      will allow us to replace try_oom_reaper by wake_oom_reaper because
      task_will_free_mem implies the task is reapable now.  Therefore all paths
      which bypass the oom killer are now reapable and so they shouldn't lock up
      the oom killer.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-8-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1af8bb43
    • M
      mm, oom: kill all tasks sharing the mm · 97fd49c2
      Michal Hocko 提交于
      Currently oom_kill_process skips both the oom reaper and SIG_KILL if a
      process sharing the same mm is unkillable via OOM_ADJUST_MIN.  After "mm,
      oom_adj: make sure processes sharing mm have same view of oom_score_adj"
      all such processes are sharing the same value so we shouldn't see such a
      task at all (oom_badness would rule them out).
      
      We can still encounter oom disabled vforked task which has to be killed as
      well if we want to have other tasks sharing the mm reapable because it can
      access the memory before doing exec.  Killing such a task should be
      acceptable because it is highly unlikely it has done anything useful
      because it cannot modify any memory before it calls exec.  An alternative
      would be to keep the task alive and skip the oom reaper and risk all the
      weird corner cases where the OOM killer cannot make forward progress
      because the oom victim hung somewhere on the way to exit.
      
      [rientjes@google.com - drop printk when OOM_SCORE_ADJ_MIN killed task
       the setting is inherently racy and we cannot do much about it without
       introducing locks in hot paths]
      Link: http://lkml.kernel.org/r/1466426628-15074-7-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97fd49c2
    • M
      mm, oom: skip vforked tasks from being selected · b18dc5f2
      Michal Hocko 提交于
      vforked tasks are not really sitting on any memory.  They are sharing the
      mm with parent until they exec into a new code.  Until then it is just
      pinning the address space.  OOM killer will kill the vforked task along
      with its parent but we still can end up selecting vforked task when the
      parent wouldn't be selected.  E.g.  init doing vfork to launch a task or
      vforked being a child of oom unkillable task with an updated oom_score_adj
      to be killable.
      
      Add a new helper to check whether a task is in the vfork sharing memory
      with its parent and use it in oom_badness to skip over these tasks.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-6-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b18dc5f2
    • M
      mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj · 44a70ade
      Michal Hocko 提交于
      oom_score_adj is shared for the thread groups (via struct signal) but this
      is not sufficient to cover processes sharing mm (CLONE_VM without
      CLONE_SIGHAND) and so we can easily end up in a situation when some
      processes update their oom_score_adj and confuse the oom killer.  In the
      worst case some of those processes might hide from the oom killer
      altogether via OOM_SCORE_ADJ_MIN while others are eligible.  OOM killer
      would then pick up those eligible but won't be allowed to kill others
      sharing the same mm so the mm wouldn't release the mm and so the memory.
      
      It would be ideal to have the oom_score_adj per mm_struct because that is
      the natural entity OOM killer considers.  But this will not work because
      some programs are doing
      
      	vfork()
      	set_oom_adj()
      	exec()
      
      We can achieve the same though.  oom_score_adj write handler can set the
      oom_score_adj for all processes sharing the same mm if the task is not in
      the middle of vfork.  As a result all the processes will share the same
      oom_score_adj.  The current implementation is rather pessimistic and
      checks all the existing processes by default if there is more than 1
      holder of the mm but we do not have any reliable way to check for external
      users yet.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-5-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      44a70ade
  4. 27 7月, 2016 4 次提交
  5. 25 6月, 2016 2 次提交
  6. 04 6月, 2016 1 次提交
  7. 28 5月, 2016 2 次提交
    • M
      oom_reaper: close race with exiting task · e2fe1456
      Michal Hocko 提交于
      Tetsuo has reported:
        Out of memory: Kill process 443 (oleg's-test) score 855 or sacrifice child
        Killed process 443 (oleg's-test) total-vm:493248kB, anon-rss:423880kB, file-rss:4kB, shmem-rss:0kB
        sh invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
        sh cpuset=/ mems_allowed=0
        CPU: 2 PID: 1 Comm: sh Not tainted 4.6.0-rc7+ #51
        Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
        Call Trace:
          dump_stack+0x85/0xc8
          dump_header+0x5b/0x394
        oom_reaper: reaped process 443 (oleg's-test), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      In other words:
      
        __oom_reap_task		exit_mm
          atomic_inc_not_zero
      				  tsk->mm = NULL
      				  mmput
      				    atomic_dec_and_test # > 0
      				  exit_oom_victim # New victim will be
      						  # selected
      				<OOM killer invoked>
      				  # no TIF_MEMDIE task so we can select a new one
          unmap_page_range # to release the memory
      
      The race exists even without the oom_reaper because anybody who pins the
      address space and gets preempted might race with exit_mm but oom_reaper
      made this race more probable.
      
      We can address the oom_reaper part by using oom_lock for __oom_reap_task
      because this would guarantee that a new oom victim will not be selected
      if the oom reaper might race with the exit path.  This doesn't solve the
      original issue, though, because somebody else still might be pinning
      mm_users and so __mmput won't be called to release the memory but that
      is not really realiably solvable because the task will get away from the
      oom sight as soon as it is unhashed from the task_list and so we cannot
      guarantee a new victim won't be selected.
      
      [akpm@linux-foundation.org: fix use of unused `mm', Per Stephen]
      [akpm@linux-foundation.org: coding-style fixes]
      Fixes: aac45363 ("mm, oom: introduce oom reaper")
      Link: http://lkml.kernel.org/r/1464271493-20008-1-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2fe1456
    • V
      mm: oom: do not reap task if there are live threads in threadgroup · edd9f723
      Vladimir Davydov 提交于
      If the current process is exiting, we don't invoke oom killer, instead
      we give it access to memory reserves and try to reap its mm in case
      nobody is going to use it.  There's a mistake in the code performing
      this check - we just ignore any process of the same thread group no
      matter if it is exiting or not - see try_oom_reaper.  Fix it.
      
      Link: http://lkml.kernel.org/r/1464087628-7318-1-git-send-email-vdavydov@virtuozzo.com
      Fixes: 3ef22dff ("oom, oom_reaper: try to reap tasks which skip regular OOM killer path")Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      edd9f723
  8. 21 5月, 2016 3 次提交
    • T
      mm,oom: speed up select_bad_process() loop · f44666b0
      Tetsuo Handa 提交于
      Since commit 3a5dda7a ("oom: prevent unnecessary oom kills or kernel
      panics"), select_bad_process() is using for_each_process_thread().
      
      Since oom_unkillable_task() scans all threads in the caller's thread
      group and oom_task_origin() scans signal_struct of the caller's thread
      group, we don't need to call oom_unkillable_task() and oom_task_origin()
      on each thread.  Also, since !mm test will be done later at
      oom_badness(), we don't need to do !mm test on each thread.  Therefore,
      we only need to do TIF_MEMDIE test on each thread.
      
      Although the original code was correct it was quite inefficient because
      each thread group was scanned num_threads times which can be a lot
      especially with processes with many threads.  Even though the OOM is
      extremely cold path it is always good to be as effective as possible
      when we are inside rcu_read_lock() - aka unpreemptible context.
      
      If we track number of TIF_MEMDIE threads inside signal_struct, we don't
      need to do TIF_MEMDIE test on each thread.  This will allow
      select_bad_process() to use for_each_process().
      
      This patch adds a counter to signal_struct for tracking how many
      TIF_MEMDIE threads are in a given thread group, and check it at
      oom_scan_process_thread() so that select_bad_process() can use
      for_each_process() rather than for_each_process_thread().
      
      [mhocko@suse.com: do not blow the signal_struct size]
        Link: http://lkml.kernel.org/r/20160520075035.GF19172@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/201605182230.IDC73435.MVSOHLFOQFOJtF@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f44666b0
    • M
      mm, oom_reaper: do not mmput synchronously from the oom reaper context · ec8d7c14
      Michal Hocko 提交于
      Tetsuo has properly noted that mmput slow path might get blocked waiting
      for another party (e.g.  exit_aio waits for an IO).  If that happens the
      oom_reaper would be put out of the way and will not be able to process
      next oom victim.  We should strive for making this context as reliable
      and independent on other subsystems as much as possible.
      
      Introduce mmput_async which will perform the slow path from an async
      (WQ) context.  This will delay the operation but that shouldn't be a
      problem because the oom_reaper has reclaimed the victim's address space
      for most cases as much as possible and the remaining context shouldn't
      bind too much memory anymore.  The only exception is when mmap_sem
      trylock has failed which shouldn't happen too often.
      
      The issue is only theoretical but not impossible.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec8d7c14
    • M
      mm, oom_reaper: hide oom reaped tasks from OOM killer more carefully · bb8a4b7f
      Michal Hocko 提交于
      Commit 36324a99 ("oom: clear TIF_MEMDIE after oom_reaper managed to
      unmap the address space") not only clears TIF_MEMDIE for oom reaped task
      but also set OOM_SCORE_ADJ_MIN for the target task to hide it from the
      oom killer.  This works in simple cases but it is not sufficient for
      (unlikely) cases where the mm is shared between independent processes
      (as they do not share signal struct).  If the mm had only small amount
      of memory which could be reaped then another task sharing the mm could
      be selected and that wouldn't help to move out from the oom situation.
      
      Introduce MMF_OOM_REAPED mm flag which is checked in oom_badness (same
      as OOM_SCORE_ADJ_MIN) and task is skipped if the flag is set.  Set the
      flag after __oom_reap_task is done with a task.  This will force the
      select_bad_process() to ignore all already oom reaped tasks as well as
      no such task is sacrificed for its parent.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      bb8a4b7f
  9. 20 5月, 2016 3 次提交
    • M
      mm, oom_reaper: clear TIF_MEMDIE for all tasks queued for oom_reaper · 449d777d
      Michal Hocko 提交于
      Right now the oom reaper will clear TIF_MEMDIE only for tasks which were
      successfully reaped.  This is the safest option because we know that
      such an oom victim would only block forward progress of the oom killer
      without a good reason because it is highly unlikely it would release
      much more memory.  Basically most of its memory has been already torn
      down.
      
      We can relax this assumption to catch more corner cases though.
      
      The first obvious one is when the oom victim clears its mm and gets
      stuck later on.  oom_reaper would back of on find_lock_task_mm returning
      NULL.  We can safely try to clear TIF_MEMDIE in this case because such a
      task would be ignored by the oom killer anyway.  The flag would be
      cleared by that time already most of the time anyway.
      
      The less obvious one is when the oom reaper fails due to mmap_sem
      contention.  Even if we clear TIF_MEMDIE for this task then it is not
      very likely that we would select another task too easily because we
      haven't reaped the last victim and so it would be still the #1
      candidate.  There is a rare race condition possible when the current
      victim terminates before the next select_bad_process but considering
      that oom_reap_task had retried several times before giving up then this
      sounds like a borderline thing.
      
      After this patch we should have a guarantee that the OOM killer will not
      be block for unbounded amount of time for most cases.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Raushaniya Maksudova <rmaksudova@parallels.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Daniel Vetter <daniel.vetter@intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      449d777d
    • M
      oom, oom_reaper: try to reap tasks which skip regular OOM killer path · 3ef22dff
      Michal Hocko 提交于
      If either the current task is already killed or PF_EXITING or a selected
      task is PF_EXITING then the oom killer is suppressed and so is the oom
      reaper.  This patch adds try_oom_reaper which checks the given task and
      queues it for the oom reaper if that is safe to be done meaning that the
      task doesn't share the mm with an alive process.
      
      This might help to release the memory pressure while the task tries to
      exit.
      
      [akpm@linux-foundation.org: fix nommu build]
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Raushaniya Maksudova <rmaksudova@parallels.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Daniel Vetter <daniel.vetter@intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3ef22dff
    • M
      mm, oom: move GFP_NOFS check to out_of_memory · 3da88fb3
      Michal Hocko 提交于
      __alloc_pages_may_oom is the central place to decide when the
      out_of_memory should be invoked.  This is a good approach for most
      checks there because they are page allocator specific and the allocation
      fails right after for all of them.
      
      The notable exception is GFP_NOFS context which is faking
      did_some_progress and keep the page allocator looping even though there
      couldn't have been any progress from the OOM killer.  This patch doesn't
      change this behavior because we are not ready to allow those allocation
      requests to fail yet (and maybe we will face the reality that we will
      never manage to safely fail these request).  Instead __GFP_FS check is
      moved down to out_of_memory and prevent from OOM victim selection there.
      There are two reasons for that
      
      	- OOM notifiers might release some memory even from this context
      	  as none of the registered notifier seems to be FS related
      	- this might help a dying thread to get an access to memory
                reserves and move on which will make the behavior more
                consistent with the case when the task gets killed from a
                different context.
      
      Keep a comment in __alloc_pages_may_oom to make sure we do not forget
      how GFP_NOFS is special and that we really want to do something about
      it.
      
      Note to the current oom_notifier users:
      
      The observable difference for you is that oom notifiers cannot depend on
      any fs locks because we could deadlock.  Not that this would be allowed
      today because that would just lockup machine in most of the cases and
      ruling out the OOM killer along the way.  Another difference is that
      callbacks might be invoked sooner now because GFP_NOFS is a weaker
      reclaim context and so there could be reclaimable memory which is just
      not reachable now.  That would require GFP_NOFS only loads which are
      really rare and more importantly the observable result would be dropping
      of reconstructible object and potential performance drop which is not
      such a big deal when we are struggling to fulfill other important
      allocation requests.
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Raushaniya Maksudova <rmaksudova@parallels.com>
      Cc: Michael S. Tsirkin <mst@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Daniel Vetter <daniel.vetter@intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3da88fb3
  10. 02 4月, 2016 1 次提交
  11. 26 3月, 2016 4 次提交