1. 02 3月, 2017 1 次提交
    • I
      sched/headers: Prepare for new header dependencies before moving code to <linux/sched/mm.h> · 6e84f315
      Ingo Molnar 提交于
      We are going to split <linux/sched/mm.h> out of <linux/sched.h>, which
      will have to be picked up from other headers and a couple of .c files.
      
      Create a trivial placeholder <linux/sched/mm.h> file that just
      maps to <linux/sched.h> to make this patch obviously correct and
      bisectable.
      
      The APIs that are going to be moved first are:
      
         mm_alloc()
         __mmdrop()
         mmdrop()
         mmdrop_async_fn()
         mmdrop_async()
         mmget_not_zero()
         mmput()
         mmput_async()
         get_task_mm()
         mm_access()
         mm_release()
      
      Include the new header in the files that are going to need it.
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: linux-kernel@vger.kernel.org
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      6e84f315
  2. 28 2月, 2017 1 次提交
  3. 25 2月, 2017 1 次提交
  4. 23 2月, 2017 5 次提交
  5. 08 10月, 2016 13 次提交
    • M
      oom: print nodemask in the oom report · 82e7d3ab
      Michal Hocko 提交于
      We have received a hard to explain oom report from a customer.  The oom
      triggered regardless there is a lot of free memory:
      
        PoolThread invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
        PoolThread cpuset=/ mems_allowed=0-7
        Pid: 30055, comm: PoolThread Tainted: G           E X 3.0.101-80-default #1
        Call Trace:
          dump_trace+0x75/0x300
          dump_stack+0x69/0x6f
          dump_header+0x8e/0x110
          oom_kill_process+0xa6/0x350
          out_of_memory+0x2b7/0x310
          __alloc_pages_slowpath+0x7dd/0x820
          __alloc_pages_nodemask+0x1e9/0x200
          alloc_pages_vma+0xe1/0x290
          do_anonymous_page+0x13e/0x300
          do_page_fault+0x1fd/0x4c0
          page_fault+0x25/0x30
        [...]
        active_anon:1135959151 inactive_anon:1051962 isolated_anon:0
         active_file:13093 inactive_file:222506 isolated_file:0
         unevictable:262144 dirty:2 writeback:0 unstable:0
         free:432672819 slab_reclaimable:7917 slab_unreclaimable:95308
         mapped:261139 shmem:166297 pagetables:2228282 bounce:0
        [...]
        Node 0 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
        lowmem_reserve[]: 0 2892 775542 775542
        Node 0 DMA32 free:2783784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2961572kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
        lowmem_reserve[]: 0 0 772650 772650
        Node 0 Normal free:8120kB min:8160kB low:10200kB high:12240kB active_anon:779334960kB inactive_anon:2198744kB active_file:0kB inactive_file:180kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:791193600kB mlocked:131072kB dirty:0kB writeback:0kB mapped:372940kB shmem:361480kB slab_reclaimable:4536kB slab_unreclaimable:68472kB kernel_stack:10104kB pagetables:1414820kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2280 all_unreclaimable? yes
        lowmem_reserve[]: 0 0 0 0
        Node 1 Normal free:476718144kB min:8192kB low:10240kB high:12288kB active_anon:307623696kB inactive_anon:283620kB active_file:10392kB inactive_file:69908kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:257208kB shmem:189896kB slab_reclaimable:3868kB slab_unreclaimable:44756kB kernel_stack:1848kB pagetables:1369432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
        Node 2 Normal free:386002452kB min:8192kB low:10240kB high:12288kB active_anon:398563752kB inactive_anon:68184kB active_file:10292kB inactive_file:29936kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:32084kB shmem:776kB slab_reclaimable:6888kB slab_unreclaimable:60056kB kernel_stack:8208kB pagetables:1282880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
        Node 3 Normal free:196406760kB min:8192kB low:10240kB high:12288kB active_anon:587445640kB inactive_anon:164396kB active_file:5716kB inactive_file:709844kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:291776kB shmem:111416kB slab_reclaimable:5152kB slab_unreclaimable:44516kB kernel_stack:2168kB pagetables:1455956kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
        Node 4 Normal free:425338880kB min:8192kB low:10240kB high:12288kB active_anon:359695204kB inactive_anon:43216kB active_file:5748kB inactive_file:14772kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:24708kB shmem:1120kB slab_reclaimable:1884kB slab_unreclaimable:41060kB kernel_stack:1856kB pagetables:1100208kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
        Node 5 Normal free:11140kB min:8192kB low:10240kB high:12288kB active_anon:784240872kB inactive_anon:1217164kB active_file:28kB inactive_file:48kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:11408kB shmem:0kB slab_reclaimable:2008kB slab_unreclaimable:49220kB kernel_stack:1360kB pagetables:531600kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1202 all_unreclaimable? yes
        lowmem_reserve[]: 0 0 0 0
        Node 6 Normal free:243395332kB min:8192kB low:10240kB high:12288kB active_anon:542015544kB inactive_anon:40208kB active_file:968kB inactive_file:8484kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:19992kB shmem:496kB slab_reclaimable:1672kB slab_unreclaimable:37052kB kernel_stack:2088kB pagetables:750264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
        Node 7 Normal free:10768kB min:8192kB low:10240kB high:12288kB active_anon:784916936kB inactive_anon:192316kB active_file:19228kB inactive_file:56852kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:34440kB shmem:4kB slab_reclaimable:5660kB slab_unreclaimable:36100kB kernel_stack:1328kB pagetables:1007968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
        lowmem_reserve[]: 0 0 0 0
      
      So all nodes but Node 0 have a lot of free memory which should suggest
      that there is an available memory especially when mems_allowed=0-7.  One
      could speculate that a massive process has managed to terminate and free
      up a lot of memory while racing with the above allocation request.
      Although this is highly unlikely it cannot be ruled out.
      
      A further debugging, however shown that the faulting process had
      mempolicy (not cpuset) to bind to Node 0.  We cannot see that
      information from the report though.  mems_allowed turned out to be more
      confusing than really helpful.
      
      Fix this by always priting the nodemask.  It is either mempolicy mask
      (and non-null) or the one defined by the cpusets.  The new output for
      the above oom report would be
      
        PoolThread invoked oom-killer: gfp_mask=0x280da(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_adj=0, oom_score_adj=0
      
      This patch doesn't touch show_mem and the node filtering based on the
      cpuset node mask because mempolicy is always a subset of cpusets and
      seeing the full cpuset oom context might be helpful for tunning more
      specific mempolicies inside cpusets (e.g.  when they turn out to be too
      restrictive).  To prevent from ugly ifdefs the mask is printed even for
      !NUMA configurations but this should be OK (a single node will be
      printed).
      
      Link: http://lkml.kernel.org/r/20160930214146.28600-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NSellami Abdelkader <abdelkader.sellami@sap.com>
      Acked-by: NVlastimil Babka <vbabka@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Sellami Abdelkader <abdelkader.sellami@sap.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      82e7d3ab
    • T
      mm: don't emit warning from pagefault_out_of_memory() · a104808e
      Tetsuo Handa 提交于
      Commit c32b3cbe ("oom, PM: make OOM detection in the freezer path
      raceless") inserted a WARN_ON() into pagefault_out_of_memory() in order
      to warn when we raced with disabling the OOM killer.
      
      Now, patch "oom, suspend: fix oom_killer_disable vs.  pm suspend
      properly" introduced a timeout for oom_killer_disable().  Even if we
      raced with disabling the OOM killer and the system is OOM livelocked,
      the OOM killer will be enabled eventually (in 20 seconds by default) and
      the OOM livelock will be solved.  Therefore, we no longer need to warn
      when we raced with disabling the OOM killer.
      
      Link: http://lkml.kernel.org/r/1473442120-7246-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a104808e
    • M
      oom: warn if we go OOM for higher order and compaction is disabled · 9254990f
      Michal Hocko 提交于
      Since the lumpy reclaim is gone there is no source of higher order pages
      if CONFIG_COMPACTION=n except for the order-0 pages reclaim which is
      unreliable for that purpose to say the least.  Hitting an OOM for
      !costly higher order requests is therefore all not that hard to imagine.
      We are trying hard to not invoke OOM killer as much as possible but
      there is simply no reliable way to detect whether more reclaim retries
      make sense.
      
      Disabling COMPACTION is not widespread but it seems that some users
      might have disable the feature without realizing full consequences
      (mostly along with disabling THP because compaction used to be THP
      mainly thing).  This patch just adds a note if the OOM killer was
      triggered by higher order request with compaction disabled.  This will
      help us identifying possible misconfiguration right from the oom report
      which is easier than to always keep in mind that somebody might have
      disabled COMPACTION without a good reason.
      
      Link: http://lkml.kernel.org/r/20160830111632.GD23963@dhcp22.suse.czSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9254990f
    • M
      oom, oom_reaper: allow to reap mm shared by the kthreads · 1b51e65e
      Michal Hocko 提交于
      oom reaper was skipped for an mm which is shared with the kernel thread
      (aka use_mm()).  The primary concern was that such a kthread might want
      to read from the userspace memory and see zero page as a result of the
      oom reaper action.  This is no longer a problem after "mm: make sure
      that kthreads will not refault oom reaped memory" because any attempt to
      fault in when the MMF_UNSTABLE is set will result in SIGBUS and so the
      target user should see an error.  This means that we can finally allow
      oom reaper also to tasks which share their mm with kthreads.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-10-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1b51e65e
    • M
      mm: make sure that kthreads will not refault oom reaped memory · 3f70dc38
      Michal Hocko 提交于
      There are only few use_mm() users in the kernel right now.  Most of them
      write to the target memory but vhost driver relies on
      copy_from_user/get_user from a kernel thread context.  This makes it
      impossible to reap the memory of an oom victim which shares the mm with
      the vhost kernel thread because it could see a zero page unexpectedly
      and theoretically make an incorrect decision visible outside of the
      killed task context.
      
      To quote Michael S. Tsirkin:
      : Getting an error from __get_user and friends is handled gracefully.
      : Getting zero instead of a real value will cause userspace
      : memory corruption.
      
      The vhost kernel thread is bound to an open fd of the vhost device which
      is not tight to the mm owner life cycle in general.  The device fd can
      be inherited or passed over to another process which means that we
      really have to be careful about unexpected memory corruption because
      unlike for normal oom victims the result will be visible outside of the
      oom victim context.
      
      Make sure that no kthread context (users of use_mm) can ever see
      corrupted data because of the oom reaper and hook into the page fault
      path by checking MMF_UNSTABLE mm flag.  __oom_reap_task_mm will set the
      flag before it starts unmapping the address space while the flag is
      checked after the page fault has been handled.  If the flag is set then
      SIGBUS is triggered so any g-u-p user will get a error code.
      
      Regular tasks do not need this protection because all which share the mm
      are killed when the mm is reaped and so the corruption will not outlive
      them.
      
      This patch shouldn't have any visible effect at this moment because the
      OOM killer doesn't invoke oom reaper for tasks with mm shared with
      kthreads yet.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: N"Michael S. Tsirkin" <mst@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3f70dc38
    • T
      mm, oom: enforce exit_oom_victim on current task · 38531201
      Tetsuo Handa 提交于
      There are no users of exit_oom_victim on !current task anymore so enforce
      the API to always work on the current.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-8-git-send-email-mhocko@kernel.orgSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      38531201
    • M
      oom, suspend: fix oom_killer_disable vs. pm suspend properly · 7d2e7a22
      Michal Hocko 提交于
      Commit 74070542 ("oom, suspend: fix oom_reaper vs.
      oom_killer_disable race") has workaround an existing race between
      oom_killer_disable and oom_reaper by adding another round of
      try_to_freeze_tasks after the oom killer was disabled.  This was the
      easiest thing to do for a late 4.7 fix.  Let's fix it properly now.
      
      After "oom: keep mm of the killed task available" we no longer have to
      call exit_oom_victim from the oom reaper because we have stable mm
      available and hide the oom_reaped mm by MMF_OOM_SKIP flag.  So let's
      remove exit_oom_victim and the race described in the above commit
      doesn't exist anymore if.
      
      Unfortunately this alone is not sufficient for the oom_killer_disable
      usecase because now we do not have any reliable way to reach
      exit_oom_victim (the victim might get stuck on a way to exit for an
      unbounded amount of time).  OOM killer can cope with that by checking mm
      flags and move on to another victim but we cannot do the same for
      oom_killer_disable as we would lose the guarantee of no further
      interference of the victim with the rest of the system.  What we can do
      instead is to cap the maximum time the oom_killer_disable waits for
      victims.  The only current user of this function (pm suspend) already
      has a concept of timeout for back off so we can reuse the same value
      there.
      
      Let's drop set_freezable for the oom_reaper kthread because it is no
      longer needed as the reaper doesn't wake or thaw any processes.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-7-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7d2e7a22
    • M
      mm, oom: get rid of signal_struct::oom_victims · 862e3073
      Michal Hocko 提交于
      After "oom: keep mm of the killed task available" we can safely detect
      an oom victim by checking task->signal->oom_mm so we do not need the
      signal_struct counter anymore so let's get rid of it.
      
      This alone wouldn't be sufficient for nommu archs because
      exit_oom_victim doesn't hide the process from the oom killer anymore.
      We can, however, mark the mm with a MMF flag in __mmput.  We can reuse
      MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-6-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      862e3073
    • M
      oom: keep mm of the killed task available · 26db62f1
      Michal Hocko 提交于
      oom_reap_task has to call exit_oom_victim in order to make sure that the
      oom vicim will not block the oom killer for ever.  This is, however,
      opening new problems (e.g oom_killer_disable exclusion - see commit
      74070542 ("oom, suspend: fix oom_reaper vs.  oom_killer_disable
      race")).  exit_oom_victim should be only called from the victim's
      context ideally.
      
      One way to achieve this would be to rely on per mm_struct flags.  We
      already have MMF_OOM_REAPED to hide a task from the oom killer since
      "mm, oom: hide mm which is shared with kthread or global init". The
      problem is that the exit path:
      
        do_exit
          exit_mm
            tsk->mm = NULL;
            mmput
              __mmput
            exit_oom_victim
      
      doesn't guarantee that exit_oom_victim will get called in a bounded
      amount of time.  At least exit_aio depends on IO which might get blocked
      due to lack of memory and who knows what else is lurking there.
      
      This patch takes a different approach.  We remember tsk->mm into the
      signal_struct and bind it to the signal struct life time for all oom
      victims.  __oom_reap_task_mm as well as oom_scan_process_thread do not
      have to rely on find_lock_task_mm anymore and they will have a reliable
      reference to the mm struct.  As a result all the oom specific
      communication inside the OOM killer can be done via tsk->signal->oom_mm.
      
      Increasing the signal_struct for something as unlikely as the oom killer
      is far from ideal but this approach will make the code much more
      reasonable and long term we even might want to move task->mm into the
      signal_struct anyway.  In the next step we might want to make the oom
      killer exclusion and access to memory reserves completely independent
      which would be also nice.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      26db62f1
    • T
      mm,oom_reaper: do not attempt to reap a task twice · 8496afab
      Tetsuo Handa 提交于
      "mm, oom_reaper: do not attempt to reap a task twice" tried to give the
      OOM reaper one more chance to retry using MMF_OOM_NOT_REAPABLE flag.
      But the usefulness of the flag is rather limited and actually never
      shown in practice.  If the flag is set, it means that the holder of
      mm->mmap_sem cannot call up_write() due to presumably being blocked at
      unkillable wait waiting for other thread's memory allocation.  But since
      one of threads sharing that mm will queue that mm immediately via
      task_will_free_mem() shortcut (otherwise, oom_badness() will select the
      same mm again due to oom_score_adj value unchanged), retrying
      MMF_OOM_NOT_REAPABLE mm is unlikely helpful.
      
      Let's always set MMF_OOM_REAPED.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-3-git-send-email-mhocko@kernel.orgSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8496afab
    • T
      mm,oom_reaper: reduce find_lock_task_mm() usage · 7ebffa45
      Tetsuo Handa 提交于
      Patch series "fortify oom killer even more", v2.
      
      This patch (of 9):
      
      __oom_reap_task() can be simplified a bit if it receives a valid mm from
      oom_reap_task() which also uses that mm when __oom_reap_task() failed.
      We can drop one find_lock_task_mm() call and also make the
      __oom_reap_task() code flow easier to follow.  Moreover, this will make
      later patch in the series easier to review.  Pinning mm's mm_count for
      longer time is not really harmful because this will not pin much memory.
      
      This patch doesn't introduce any functional change.
      
      Link: http://lkml.kernel.org/r/1472119394-11342-2-git-send-email-mhocko@kernel.orgSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ebffa45
    • M
      mm/oom_kill.c: fix task_will_free_mem() comment · 5870c2e1
      Michal Hocko 提交于
      Attempt to demystify the task_will_free_mem() loop.
      
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5870c2e1
    • V
      mm: oom: deduplicate victim selection code for memcg and global oom · 7c5f64f8
      Vladimir Davydov 提交于
      When selecting an oom victim, we use the same heuristic for both memory
      cgroup and global oom.  The only difference is the scope of tasks to
      select the victim from.  So we could just export an iterator over all
      memcg tasks and keep all oom related logic in oom_kill.c, but instead we
      duplicate pieces of it in memcontrol.c reusing some initially private
      functions of oom_kill.c in order to not duplicate all of it.  That looks
      ugly and error prone, because any modification of select_bad_process
      should also be propagated to mem_cgroup_out_of_memory.
      
      Let's rework this as follows: keep all oom heuristic related code private
      to oom_kill.c and make oom_kill.c use exported memcg functions when it's
      really necessary (like in case of iterating over memcg tasks).
      
      Link: http://lkml.kernel.org/r/1470056933-7505-1-git-send-email-vdavydov@virtuozzo.comSigned-off-by: NVladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7c5f64f8
  6. 12 8月, 2016 1 次提交
  7. 29 7月, 2016 8 次提交
    • M
      mm, oom: tighten task_will_free_mem() locking · 091f362c
      Michal Hocko 提交于
      "mm, oom: fortify task_will_free_mem" has dropped task_lock around
      task_will_free_mem in oom_kill_process bacause it assumed that a
      potential race when the selected task exits will not be a problem as the
      oom_reaper will call exit_oom_victim.
      
      Tetsuo was objecting that nommu doesn't have oom_reaper so the race
      would be still possible.  The code would be racy and lockup prone
      theoretically in other aspects without the oom reaper anyway so I didn't
      considered this a big deal.  But it seems that further changes I am
      planning in this area will benefit from stable task->mm in this path as
      well.  So let's drop find_lock_task_mm from task_will_free_mem and call
      it from under task_lock as we did previously.  Just pull the task->mm !=
      NULL check inside the function.
      
      Link: http://lkml.kernel.org/r/1467201562-6709-1-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@parallels.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      091f362c
    • M
      mm, oom: hide mm which is shared with kthread or global init · a373966d
      Michal Hocko 提交于
      The only case where the oom_reaper is not triggered for the oom victim
      is when it shares the memory with a kernel thread (aka use_mm) or with
      the global init.  After "mm, oom: skip vforked tasks from being
      selected" the victim cannot be a vforked task of the global init so we
      are left with clone(CLONE_VM) (without CLONE_SIGHAND).  use_mm() users
      are quite rare as well.
      
      In order to help forward progress for the OOM killer, make sure that
      this really rare case will not get in the way - we do this by hiding the
      mm from the oom killer by setting MMF_OOM_REAPED flag for it.
      oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
      MMF_OOM_REAPED flag set to catch these oom victims.
      
      After this patch we should guarantee forward progress for the OOM killer
      even when the selected victim is sharing memory with a kernel thread or
      global init as long as the victims mm is still alive.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-11-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a373966d
    • M
      mm, oom_reaper: do not attempt to reap a task more than twice · 11a410d5
      Michal Hocko 提交于
      oom_reaper relies on the mmap_sem for read to do its job.  Many places
      which might block readers have been converted to use down_write_killable
      and that has reduced chances of the contention a lot.  Some paths where
      the mmap_sem is held for write can take other locks and they might
      either be not prepared to fail due to fatal signal pending or too
      impractical to be changed.
      
      This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the
      first attempt to reap a task's mm fails.  If the flag is present after
      the failure then we set MMF_OOM_REAPED to hide this mm from the oom
      killer completely so it can go and chose another victim.
      
      As a result a risk of OOM deadlock when the oom victim would be blocked
      indefinetly and so the oom killer cannot make any progress should be
      mitigated considerably while we still try really hard to perform all
      reclaim attempts and stay predictable in the behavior.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-10-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      11a410d5
    • M
      mm, oom: task_will_free_mem should skip oom_reaped tasks · 696453e6
      Michal Hocko 提交于
      The 0-day robot has encountered the following:
      
         Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child
         Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB
         oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB
         oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
         oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
         oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB
         oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB
      
      oom_reaper is trying to reap the same task again and again.
      
      This is possible only when the oom killer is bypassed because of
      task_will_free_mem because we skip over tasks with MMF_OOM_REAPED
      already set during select_bad_process.  Teach task_will_free_mem to skip
      over MMF_OOM_REAPED tasks as well because they will be unlikely to free
      anything more.
      
      Analyzed by Tetsuo Handa.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-9-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      696453e6
    • M
      mm, oom: fortify task_will_free_mem() · 1af8bb43
      Michal Hocko 提交于
      task_will_free_mem is rather weak.  It doesn't really tell whether the
      task has chance to drop its mm.  98748bd7 ("oom: consider
      multi-threaded tasks in task_will_free_mem") made a first step into making
      it more robust for multi-threaded applications so now we know that the
      whole process is going down and probably drop the mm.
      
      This patch builds on top for more complex scenarios where mm is shared
      between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel
      use_mm().
      
      Make sure that all processes sharing the mm are killed or exiting.  This
      will allow us to replace try_oom_reaper by wake_oom_reaper because
      task_will_free_mem implies the task is reapable now.  Therefore all paths
      which bypass the oom killer are now reapable and so they shouldn't lock up
      the oom killer.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-8-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1af8bb43
    • M
      mm, oom: kill all tasks sharing the mm · 97fd49c2
      Michal Hocko 提交于
      Currently oom_kill_process skips both the oom reaper and SIG_KILL if a
      process sharing the same mm is unkillable via OOM_ADJUST_MIN.  After "mm,
      oom_adj: make sure processes sharing mm have same view of oom_score_adj"
      all such processes are sharing the same value so we shouldn't see such a
      task at all (oom_badness would rule them out).
      
      We can still encounter oom disabled vforked task which has to be killed as
      well if we want to have other tasks sharing the mm reapable because it can
      access the memory before doing exec.  Killing such a task should be
      acceptable because it is highly unlikely it has done anything useful
      because it cannot modify any memory before it calls exec.  An alternative
      would be to keep the task alive and skip the oom reaper and risk all the
      weird corner cases where the OOM killer cannot make forward progress
      because the oom victim hung somewhere on the way to exit.
      
      [rientjes@google.com - drop printk when OOM_SCORE_ADJ_MIN killed task
       the setting is inherently racy and we cannot do much about it without
       introducing locks in hot paths]
      Link: http://lkml.kernel.org/r/1466426628-15074-7-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      97fd49c2
    • M
      mm, oom: skip vforked tasks from being selected · b18dc5f2
      Michal Hocko 提交于
      vforked tasks are not really sitting on any memory.  They are sharing the
      mm with parent until they exec into a new code.  Until then it is just
      pinning the address space.  OOM killer will kill the vforked task along
      with its parent but we still can end up selecting vforked task when the
      parent wouldn't be selected.  E.g.  init doing vfork to launch a task or
      vforked being a child of oom unkillable task with an updated oom_score_adj
      to be killable.
      
      Add a new helper to check whether a task is in the vfork sharing memory
      with its parent and use it in oom_badness to skip over these tasks.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-6-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b18dc5f2
    • M
      mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj · 44a70ade
      Michal Hocko 提交于
      oom_score_adj is shared for the thread groups (via struct signal) but this
      is not sufficient to cover processes sharing mm (CLONE_VM without
      CLONE_SIGHAND) and so we can easily end up in a situation when some
      processes update their oom_score_adj and confuse the oom killer.  In the
      worst case some of those processes might hide from the oom killer
      altogether via OOM_SCORE_ADJ_MIN while others are eligible.  OOM killer
      would then pick up those eligible but won't be allowed to kill others
      sharing the same mm so the mm wouldn't release the mm and so the memory.
      
      It would be ideal to have the oom_score_adj per mm_struct because that is
      the natural entity OOM killer considers.  But this will not work because
      some programs are doing
      
      	vfork()
      	set_oom_adj()
      	exec()
      
      We can achieve the same though.  oom_score_adj write handler can set the
      oom_score_adj for all processes sharing the same mm if the task is not in
      the middle of vfork.  As a result all the processes will share the same
      oom_score_adj.  The current implementation is rather pessimistic and
      checks all the existing processes by default if there is more than 1
      holder of the mm but we do not have any reliable way to check for external
      users yet.
      
      Link: http://lkml.kernel.org/r/1466426628-15074-5-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Vladimir Davydov <vdavydov@virtuozzo.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      44a70ade
  8. 27 7月, 2016 4 次提交
  9. 25 6月, 2016 2 次提交
  10. 04 6月, 2016 1 次提交
  11. 28 5月, 2016 2 次提交
    • M
      oom_reaper: close race with exiting task · e2fe1456
      Michal Hocko 提交于
      Tetsuo has reported:
        Out of memory: Kill process 443 (oleg's-test) score 855 or sacrifice child
        Killed process 443 (oleg's-test) total-vm:493248kB, anon-rss:423880kB, file-rss:4kB, shmem-rss:0kB
        sh invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
        sh cpuset=/ mems_allowed=0
        CPU: 2 PID: 1 Comm: sh Not tainted 4.6.0-rc7+ #51
        Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
        Call Trace:
          dump_stack+0x85/0xc8
          dump_header+0x5b/0x394
        oom_reaper: reaped process 443 (oleg's-test), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
      
      In other words:
      
        __oom_reap_task		exit_mm
          atomic_inc_not_zero
      				  tsk->mm = NULL
      				  mmput
      				    atomic_dec_and_test # > 0
      				  exit_oom_victim # New victim will be
      						  # selected
      				<OOM killer invoked>
      				  # no TIF_MEMDIE task so we can select a new one
          unmap_page_range # to release the memory
      
      The race exists even without the oom_reaper because anybody who pins the
      address space and gets preempted might race with exit_mm but oom_reaper
      made this race more probable.
      
      We can address the oom_reaper part by using oom_lock for __oom_reap_task
      because this would guarantee that a new oom victim will not be selected
      if the oom reaper might race with the exit path.  This doesn't solve the
      original issue, though, because somebody else still might be pinning
      mm_users and so __mmput won't be called to release the memory but that
      is not really realiably solvable because the task will get away from the
      oom sight as soon as it is unhashed from the task_list and so we cannot
      guarantee a new victim won't be selected.
      
      [akpm@linux-foundation.org: fix use of unused `mm', Per Stephen]
      [akpm@linux-foundation.org: coding-style fixes]
      Fixes: aac45363 ("mm, oom: introduce oom reaper")
      Link: http://lkml.kernel.org/r/1464271493-20008-1-git-send-email-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e2fe1456
    • V
      mm: oom: do not reap task if there are live threads in threadgroup · edd9f723
      Vladimir Davydov 提交于
      If the current process is exiting, we don't invoke oom killer, instead
      we give it access to memory reserves and try to reap its mm in case
      nobody is going to use it.  There's a mistake in the code performing
      this check - we just ignore any process of the same thread group no
      matter if it is exiting or not - see try_oom_reaper.  Fix it.
      
      Link: http://lkml.kernel.org/r/1464087628-7318-1-git-send-email-vdavydov@virtuozzo.com
      Fixes: 3ef22dff ("oom, oom_reaper: try to reap tasks which skip regular OOM killer path")Signed-off-by: Vladimir Davydov <vdavydov@virtuozzo.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      edd9f723
  12. 21 5月, 2016 1 次提交
    • T
      mm,oom: speed up select_bad_process() loop · f44666b0
      Tetsuo Handa 提交于
      Since commit 3a5dda7a ("oom: prevent unnecessary oom kills or kernel
      panics"), select_bad_process() is using for_each_process_thread().
      
      Since oom_unkillable_task() scans all threads in the caller's thread
      group and oom_task_origin() scans signal_struct of the caller's thread
      group, we don't need to call oom_unkillable_task() and oom_task_origin()
      on each thread.  Also, since !mm test will be done later at
      oom_badness(), we don't need to do !mm test on each thread.  Therefore,
      we only need to do TIF_MEMDIE test on each thread.
      
      Although the original code was correct it was quite inefficient because
      each thread group was scanned num_threads times which can be a lot
      especially with processes with many threads.  Even though the OOM is
      extremely cold path it is always good to be as effective as possible
      when we are inside rcu_read_lock() - aka unpreemptible context.
      
      If we track number of TIF_MEMDIE threads inside signal_struct, we don't
      need to do TIF_MEMDIE test on each thread.  This will allow
      select_bad_process() to use for_each_process().
      
      This patch adds a counter to signal_struct for tracking how many
      TIF_MEMDIE threads are in a given thread group, and check it at
      oom_scan_process_thread() so that select_bad_process() can use
      for_each_process() rather than for_each_process_thread().
      
      [mhocko@suse.com: do not blow the signal_struct size]
        Link: http://lkml.kernel.org/r/20160520075035.GF19172@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/201605182230.IDC73435.MVSOHLFOQFOJtF@I-love.SAKURA.ne.jpSigned-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f44666b0