1. 05 9月, 2018 2 次提交
  2. 23 8月, 2018 6 次提交
    • R
      mm, oom: introduce memory.oom.group · 3d8b38eb
      Roman Gushchin 提交于
      For some workloads an intervention from the OOM killer can be painful.
      Killing a random task can bring the workload into an inconsistent state.
      
      Historically, there are two common solutions for this
      problem:
      1) enabling panic_on_oom,
      2) using a userspace daemon to monitor OOMs and kill
         all outstanding processes.
      
      Both approaches have their downsides: rebooting on each OOM is an obvious
      waste of capacity, and handling all in userspace is tricky and requires a
      userspace agent, which will monitor all cgroups for OOMs.
      
      In most cases an in-kernel after-OOM cleaning-up mechanism can eliminate
      the necessity of enabling panic_on_oom.  Also, it can simplify the cgroup
      management for userspace applications.
      
      This commit introduces a new knob for cgroup v2 memory controller:
      memory.oom.group.  The knob determines whether the cgroup should be
      treated as an indivisible workload by the OOM killer.  If set, all tasks
      belonging to the cgroup or to its descendants (if the memory cgroup is not
      a leaf cgroup) are killed together or not at all.
      
      To determine which cgroup has to be killed, we do traverse the cgroup
      hierarchy from the victim task's cgroup up to the OOMing cgroup (or root)
      and looking for the highest-level cgroup with memory.oom.group set.
      
      Tasks with the OOM protection (oom_score_adj set to -1000) are treated as
      an exception and are never killed.
      
      This patch doesn't change the OOM victim selection algorithm.
      
      Link: http://lkml.kernel.org/r/20180802003201.817-4-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3d8b38eb
    • R
      mm, oom: refactor oom_kill_process() · 5989ad7b
      Roman Gushchin 提交于
      Patch series "introduce memory.oom.group", v2.
      
      This is a tiny implementation of cgroup-aware OOM killer, which adds an
      ability to kill a cgroup as a single unit and so guarantee the integrity
      of the workload.
      
      Although it has only a limited functionality in comparison to what now
      resides in the mm tree (it doesn't change the victim task selection
      algorithm, doesn't look at memory stas on cgroup level, etc), it's also
      much simpler and more straightforward.  So, hopefully, we can avoid having
      long debates here, as we had with the full implementation.
      
      As it doesn't prevent any futher development, and implements an useful and
      complete feature, it looks as a sane way forward.
      
      This patch (of 2):
      
      oom_kill_process() consists of two logical parts: the first one is
      responsible for considering task's children as a potential victim and
      printing the debug information.  The second half is responsible for
      sending SIGKILL to all tasks sharing the mm struct with the given victim.
      
      This commit splits oom_kill_process() with an intention to re-use the the
      second half: __oom_kill_process().
      
      The cgroup-aware OOM killer will kill multiple tasks belonging to the
      victim cgroup.  We don't need to print the debug information for the each
      task, as well as play with task selection (considering task's children),
      so we can't use the existing oom_kill_process().
      
      Link: http://lkml.kernel.org/r/20171130152824.1591-2-guro@fb.com
      Link: http://lkml.kernel.org/r/20180802003201.817-3-guro@fb.comSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5989ad7b
    • M
      mm/oom_kill.c: clean up oom_reap_task_mm() · 431f42fd
      Michal Hocko 提交于
      Andrew has noticed some inconsistencies in oom_reap_task_mm.  Notably
      
       - Undocumented return value.
      
       - comment "failed to reap part..." is misleading - sounds like it's
         referring to something which happened in the past, is in fact
         referring to something which might happen in the future.
      
       - fails to call trace_finish_task_reaping() in one case
      
       - code duplication.
      
       - Increases mmap_sem hold time a little by moving
         trace_finish_task_reaping() inside the locked region.  So sue me ;)
      
       - Sharing the finish: path means that the trace event won't
         distinguish between the two sources of finishing.
      
      Add a short explanation for the return value and fix the rest by
      reorganizing the function a bit to have unified function exit paths.
      
      Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.czSuggested-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reviewed-by: NAndrew Morton <akpm@linux-foundation.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      431f42fd
    • R
      mm, oom: describe task memory unit, larger PID pad · c3b78b11
      Rodrigo Freire 提交于
      The default page memory unit of OOM task dump events might not be
      intuitive and potentially misleading for the non-initiated when debugging
      OOM events: These are pages and not kBs.  Add a small printk prior to the
      task dump informing that the memory units are actually memory _pages_.
      
      Also extends PID field to align on up to 7 characters.
      Reference https://lkml.org/lkml/2018/7/3/1201
      
      Link: http://lkml.kernel.org/r/c795eb5129149ed8a6345c273aba167ff1bbd388.1530715938.git.rfreire@redhat.comSigned-off-by: NRodrigo Freire <rfreire@redhat.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NRafael Aquini <aquini@redhat.com>
      Cc: Michal Hocko <mhocko@kernel.org>
      Cc: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c3b78b11
    • M
      mm, oom: remove oom_lock from oom_reaper · af5679fb
      Michal Hocko 提交于
      oom_reaper used to rely on the oom_lock since e2fe1456 ("oom_reaper:
      close race with exiting task").  We do not really need the lock anymore
      though.  21292580 ("mm: oom: let oom_reap_task and exit_mmap run
      concurrently") has removed serialization with the exit path based on the
      mm reference count and so we do not really rely on the oom_lock anymore.
      
      Tetsuo was arguing that at least MMF_OOM_SKIP should be set under the lock
      to prevent from races when the page allocator didn't manage to get the
      freed (reaped) memory in __alloc_pages_may_oom but it sees the flag later
      on and move on to another victim.  Although this is possible in principle
      let's wait for it to actually happen in real life before we make the
      locking more complex again.
      
      Therefore remove the oom_lock for oom_reaper paths (both exit_mmap and
      oom_reap_task_mm).  The reaper serializes with exit_mmap by mmap_sem +
      MMF_OOM_SKIP flag.  There is no synchronization with out_of_memory path
      now.
      
      [mhocko@kernel.org: oom_reap_task_mm should return false when __oom_reap_task_mm did]
        Link: http://lkml.kernel.org/r/20180724141747.GP28386@dhcp22.suse.cz
      Link: http://lkml.kernel.org/r/20180719075922.13784-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Suggested-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      af5679fb
    • M
      mm, oom: distinguish blockable mode for mmu notifiers · 93065ac7
      Michal Hocko 提交于
      There are several blockable mmu notifiers which might sleep in
      mmu_notifier_invalidate_range_start and that is a problem for the
      oom_reaper because it needs to guarantee a forward progress so it cannot
      depend on any sleepable locks.
      
      Currently we simply back off and mark an oom victim with blockable mmu
      notifiers as done after a short sleep.  That can result in selecting a new
      oom victim prematurely because the previous one still hasn't torn its
      memory down yet.
      
      We can do much better though.  Even if mmu notifiers use sleepable locks
      there is no reason to automatically assume those locks are held.  Moreover
      majority of notifiers only care about a portion of the address space and
      there is absolutely zero reason to fail when we are unmapping an unrelated
      range.  Many notifiers do really block and wait for HW which is harder to
      handle and we have to bail out though.
      
      This patch handles the low hanging fruit.
      __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
      are not allowed to sleep if the flag is set to false.  This is achieved by
      using trylock instead of the sleepable lock for most callbacks and
      continue as long as we do not block down the call chain.
      
      I think we can improve that even further because there is a common pattern
      to do a range lookup first and then do something about that.  The first
      part can be done without a sleeping lock in most cases AFAICS.
      
      The oom_reaper end then simply retries if there is at least one notifier
      which couldn't make any progress in !blockable mode.  A retry loop is
      already implemented to wait for the mmap_sem and this is basically the
      same thing.
      
      The simplest way for driver developers to test this code path is to wrap
      userspace code which uses these notifiers into a memcg and set the hard
      limit to hit the oom.  This can be done e.g.  after the test faults in all
      the mmu notifier managed memory and set the hard limit to something really
      small.  Then we are looking for a proper process tear down.
      
      [akpm@linux-foundation.org: coding style fixes]
      [akpm@linux-foundation.org: minor code simplification]
      Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: Christian König <christian.koenig@amd.com> # AMD notifiers
      Acked-by: Leon Romanovsky <leonro@mellanox.com> # mlx and umem_odp
      Reported-by: NDavid Rientjes <rientjes@google.com>
      Cc: "David (ChunMing) Zhou" <David1.Zhou@amd.com>
      Cc: Paolo Bonzini <pbonzini@redhat.com>
      Cc: Alex Deucher <alexander.deucher@amd.com>
      Cc: David Airlie <airlied@linux.ie>
      Cc: Jani Nikula <jani.nikula@linux.intel.com>
      Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
      Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
      Cc: Doug Ledford <dledford@redhat.com>
      Cc: Jason Gunthorpe <jgg@ziepe.ca>
      Cc: Mike Marciniszyn <mike.marciniszyn@intel.com>
      Cc: Dennis Dalessandro <dennis.dalessandro@intel.com>
      Cc: Sudeep Dutt <sudeep.dutt@intel.com>
      Cc: Ashutosh Dixit <ashutosh.dixit@intel.com>
      Cc: Dimitri Sivanich <sivanich@sgi.com>
      Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
      Cc: Juergen Gross <jgross@suse.com>
      Cc: "Jérôme Glisse" <jglisse@redhat.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Felix Kuehling <felix.kuehling@amd.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      93065ac7
  3. 18 8月, 2018 2 次提交
  4. 22 7月, 2018 1 次提交
  5. 15 6月, 2018 1 次提交
  6. 08 6月, 2018 1 次提交
  7. 12 5月, 2018 1 次提交
    • D
      mm, oom: fix concurrent munlock and oom reaper unmap, v3 · 27ae357f
      David Rientjes 提交于
      Since exit_mmap() is done without the protection of mm->mmap_sem, it is
      possible for the oom reaper to concurrently operate on an mm until
      MMF_OOM_SKIP is set.
      
      This allows munlock_vma_pages_all() to concurrently run while the oom
      reaper is operating on a vma.  Since munlock_vma_pages_range() depends
      on clearing VM_LOCKED from vm_flags before actually doing the munlock to
      determine if any other vmas are locking the same memory, the check for
      VM_LOCKED in the oom reaper is racy.
      
      This is especially noticeable on architectures such as powerpc where
      clearing a huge pmd requires serialize_against_pte_lookup().  If the pmd
      is zapped by the oom reaper during follow_page_mask() after the check
      for pmd_none() is bypassed, this ends up deferencing a NULL ptl or a
      kernel oops.
      
      Fix this by manually freeing all possible memory from the mm before
      doing the munlock and then setting MMF_OOM_SKIP.  The oom reaper can not
      run on the mm anymore so the munlock is safe to do in exit_mmap().  It
      also matches the logic that the oom reaper currently uses for
      determining when to set MMF_OOM_SKIP itself, so there's no new risk of
      excessive oom killing.
      
      This issue fixes CVE-2018-1000200.
      
      Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1804241526320.238665@chino.kir.corp.google.com
      Fixes: 21292580 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Suggested-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>	[4.14+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27ae357f
  8. 06 4月, 2018 3 次提交
  9. 01 2月, 2018 1 次提交
  10. 15 12月, 2017 1 次提交
    • M
      mm, oom_reaper: fix memory corruption · 4837fe37
      Michal Hocko 提交于
      David Rientjes has reported the following memory corruption while the
      oom reaper tries to unmap the victims address space
      
        BUG: Bad page map in process oom_reaper  pte:6353826300000000 pmd:00000000
        addr:00007f50cab1d000 vm_flags:08100073 anon_vma:ffff9eea335603f0 mapping:          (null) index:7f50cab1d
        file:          (null) fault:          (null) mmap:          (null) readpage:          (null)
        CPU: 2 PID: 1001 Comm: oom_reaper
        Call Trace:
           unmap_page_range+0x1068/0x1130
           __oom_reap_task_mm+0xd5/0x16b
           oom_reaper+0xff/0x14c
           kthread+0xc1/0xe0
      
      Tetsuo Handa has noticed that the synchronization inside exit_mmap is
      insufficient.  We only synchronize with the oom reaper if
      tsk_is_oom_victim which is not true if the final __mmput is called from
      a different context than the oom victim exit path.  This can trivially
      happen from context of any task which has grabbed mm reference (e.g.  to
      read /proc/<pid>/ file which requires mm etc.).
      
      The race would look like this
      
        oom_reaper		oom_victim		task
      						mmget_not_zero
      			do_exit
      			  mmput
        __oom_reap_task_mm				mmput
        						  __mmput
      						    exit_mmap
      						      remove_vma
          unmap_page_range
      
      Fix this issue by providing a new mm_is_oom_victim() helper which
      operates on the mm struct rather than a task.  Any context which
      operates on a remote mm struct should use this helper in place of
      tsk_is_oom_victim.  The flag is set in mark_oom_victim and never cleared
      so it is stable in the exit_mmap path.
      
      Debugged by Tetsuo Handa.
      
      Link: http://lkml.kernel.org/r/20171210095130.17110-1-mhocko@kernel.org
      Fixes: 21292580 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Andrea Argangeli <andrea@kernel.org>
      Cc: <stable@vger.kernel.org>	[4.14]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4837fe37
  11. 30 11月, 2017 1 次提交
    • W
      mm, oom_reaper: gather each vma to prevent leaking TLB entry · 687cb088
      Wang Nan 提交于
      tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory
      space.  In this case, tlb->fullmm is true.  Some archs like arm64
      doesn't flush TLB when tlb->fullmm is true:
      
        commit 5a7862e8 ("arm64: tlbflush: avoid flushing when fullmm == 1").
      
      Which causes leaking of tlb entries.
      
      Will clarifies his patch:
       "Basically, we tag each address space with an ASID (PCID on x86) which
        is resident in the TLB. This means we can elide TLB invalidation when
        pulling down a full mm because we won't ever assign that ASID to
        another mm without doing TLB invalidation elsewhere (which actually
        just nukes the whole TLB).
      
        I think that means that we could potentially not fault on a kernel
        uaccess, because we could hit in the TLB"
      
      There could be a window between complete_signal() sending IPI to other
      cores and all threads sharing this mm are really kicked off from cores.
      In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to
      flush TLB then frees pages.  However, due to the above problem, the TLB
      entries are not really flushed on arm64.  Other threads are possible to
      access these pages through TLB entries.  Moreover, a copy_to_user() can
      also write to these pages without generating page fault, causes
      use-after-free bugs.
      
      This patch gathers each vma instead of gathering full vm space.  In this
      case tlb->fullmm is not true.  The behavior of oom reaper become similar
      to munmapping before do_exit, which should be safe for all archs.
      
      Link: http://lkml.kernel.org/r/20171107095453.179940-1-wangnan0@huawei.com
      Fixes: aac45363 ("mm, oom: introduce oom reaper")
      Signed-off-by: NWang Nan <wangnan0@huawei.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Cc: Minchan Kim <minchan@kernel.org>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Bob Liu <liubo95@huawei.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Roman Gushchin <guro@fb.com>
      Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      687cb088
  12. 16 11月, 2017 6 次提交
  13. 04 10月, 2017 1 次提交
    • M
      mm, oom_reaper: skip mm structs with mmu notifiers · 4d4bbd85
      Michal Hocko 提交于
      Andrea has noticed that the oom_reaper doesn't invalidate the range via
      mmu notifiers (mmu_notifier_invalidate_range_start/end) and that can
      corrupt the memory of the kvm guest for example.
      
      tlb_flush_mmu_tlbonly already invokes mmu notifiers but that is not
      sufficient as per Andrea:
      
       "mmu_notifier_invalidate_range cannot be used in replacement of
        mmu_notifier_invalidate_range_start/end. For KVM
        mmu_notifier_invalidate_range is a noop and rightfully so. A MMU
        notifier implementation has to implement either ->invalidate_range
        method or the invalidate_range_start/end methods, not both. And if you
        implement invalidate_range_start/end like KVM is forced to do, calling
        mmu_notifier_invalidate_range in common code is a noop for KVM.
      
        For those MMU notifiers that can get away only implementing
        ->invalidate_range, the ->invalidate_range is implicitly called by
        mmu_notifier_invalidate_range_end(). And only those secondary MMUs
        that share the same pagetable with the primary MMU (like AMD iommuv2)
        can get away only implementing ->invalidate_range"
      
      As the callback is allowed to sleep and the implementation is out of
      hand of the MM it is safer to simply bail out if there is an mmu
      notifier registered.  In order to not fail too early make the
      mm_has_notifiers check under the oom_lock and have a little nap before
      failing to give the current oom victim some more time to exit.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Link: http://lkml.kernel.org/r/20170913113427.2291-1-mhocko@kernel.org
      Fixes: aac45363 ("mm, oom: introduce oom reaper")
      Signed-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NAndrea Arcangeli <aarcange@redhat.com>
      Reviewed-by: NAndrea Arcangeli <aarcange@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4d4bbd85
  14. 07 9月, 2017 2 次提交
    • A
      mm: oom: let oom_reap_task and exit_mmap run concurrently · 21292580
      Andrea Arcangeli 提交于
      This is purely required because exit_aio() may block and exit_mmap() may
      never start, if the oom_reap_task cannot start running on a mm with
      mm_users == 0.
      
      At the same time if the OOM reaper doesn't wait at all for the memory of
      the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would
      generate a spurious OOM kill.
      
      If it wasn't because of the exit_aio or similar blocking functions in
      the last mmput, it would be enough to change the oom_reap_task() in the
      case it finds mm_users == 0, to wait for a timeout or to wait for
      __mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the
      problem here so the concurrency of exit_mmap and oom_reap_task is
      apparently warranted.
      
      It's a non standard runtime, exit_mmap() runs without mmap_sem, and
      oom_reap_task runs with the mmap_sem for reading as usual (kind of
      MADV_DONTNEED).
      
      The race between the two is solved with a combination of
      tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP
      (serialized by a dummy down_write/up_write cycle on the same lines of
      the ksm_exit method).
      
      If the oom_reap_task() may be running concurrently during exit_mmap,
      exit_mmap will wait it to finish in down_write (before taking down mm
      structures that would make the oom_reap_task fail with use after free).
      
      If exit_mmap comes first, oom_reap_task() will skip the mm if
      MMF_OOM_SKIP is already set and in turn all memory is already freed and
      furthermore the mm data structures may already have been taken down by
      free_pgtables.
      
      [aarcange@redhat.com: incremental one liner]
        Link: http://lkml.kernel.org/r/20170726164319.GC29716@redhat.com
      [rientjes@google.com: remove unused mmput_async]
        Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708141733130.50317@chino.kir.corp.google.com
      [aarcange@redhat.com: microoptimization]
        Link: http://lkml.kernel.org/r/20170817171240.GB5066@redhat.com
      Link: http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com
      Fixes: 26db62f1 ("oom: keep mm of the killed task available")
      Signed-off-by: NAndrea Arcangeli <aarcange@redhat.com>
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NDavid Rientjes <rientjes@google.com>
      Tested-by: NDavid Rientjes <rientjes@google.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Hugh Dickins <hughd@google.com>
      Cc: "Kirill A. Shutemov" <kirill@shutemov.name>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      21292580
    • M
      mm, oom: do not rely on TIF_MEMDIE for memory reserves access · cd04ae1e
      Michal Hocko 提交于
      For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
      victims and then, among other things, to give these threads full access
      to memory reserves.  There are few shortcomings of this implementation,
      though.
      
      First of all and the most serious one is that the full access to memory
      reserves is quite dangerous because we leave no safety room for the
      system to operate and potentially do last emergency steps to move on.
      
      Secondly this flag is per task_struct while the OOM killer operates on
      mm_struct granularity so all processes sharing the given mm are killed.
      Giving the full access to all these task_structs could lead to a quick
      memory reserves depletion.  We have tried to reduce this risk by giving
      TIF_MEMDIE only to the main thread and the currently allocating task but
      that doesn't really solve this problem while it surely opens up a room
      for corner cases - e.g.  GFP_NO{FS,IO} requests might loop inside the
      allocator without access to memory reserves because a particular thread
      was not the group leader.
      
      Now that we have the oom reaper and that all oom victims are reapable
      after 1b51e65e ("oom, oom_reaper: allow to reap mm shared by the
      kthreads") we can be more conservative and grant only partial access to
      memory reserves because there are reasonable chances of the parallel
      memory freeing.  We still want some access to reserves because we do not
      want other consumers to eat up the victim's freed memory.  oom victims
      will still contend with __GFP_HIGH users but those shouldn't be so
      aggressive to starve oom victims completely.
      
      Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
      the half of the reserves.  This makes the access to reserves independent
      on which task has passed through mark_oom_victim.  Also drop any usage
      of TIF_MEMDIE from the page allocator proper and replace it by
      tsk_is_oom_victim as well which will make page_alloc.c completely
      TIF_MEMDIE free finally.
      
      CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
      ALLOC_NO_WATERMARKS approach.
      
      There is a demand to make the oom killer memcg aware which will imply
      many tasks killed at once.  This change will allow such a usecase
      without worrying about complete memory reserves depletion.
      
      Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Acked-by: NMel Gorman <mgorman@techsingularity.net>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Roman Gushchin <guro@fb.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      cd04ae1e
  15. 11 7月, 2017 1 次提交
    • R
      mm/oom_kill.c: add tracepoints for oom reaper-related events · 422580c3
      Roman Gushchin 提交于
      During the debugging of the problem described in
      https://lkml.org/lkml/2017/5/17/542 and fixed by Tetsuo Handa in
      https://lkml.org/lkml/2017/5/19/383 , I've found that the existing debug
      output is not really useful to understand issues related to the oom
      reaper.
      
      So, I assume, that adding some tracepoints might help with debugging of
      similar issues.
      
      Trace the following events:
       1) a process is marked as an oom victim,
       2) a process is added to the oom reaper list,
       3) the oom reaper starts reaping process's mm,
       4) the oom reaper finished reaping,
       5) the oom reaper skips reaping.
      
      How it works in practice? Below is an example which show how the problem
      mentioned above can be found: one process is added twice to the
      oom_reaper list:
      
        $ cd /sys/kernel/debug/tracing
        $ echo "oom:mark_victim" > set_event
        $ echo "oom:wake_reaper" >> set_event
        $ echo "oom:skip_task_reaping" >> set_event
        $ echo "oom:start_task_reaping" >> set_event
        $ echo "oom:finish_task_reaping" >> set_event
        $ cat trace_pipe
                allocate-502   [001] ....    91.836405: mark_victim: pid=502
                allocate-502   [001] .N..    91.837356: wake_reaper: pid=502
                allocate-502   [000] .N..    91.871149: wake_reaper: pid=502
              oom_reaper-23    [000] ....    91.871177: start_task_reaping: pid=502
              oom_reaper-23    [000] .N..    91.879511: finish_task_reaping: pid=502
              oom_reaper-23    [000] ....    91.879580: skip_task_reaping: pid=502
      
      Link: http://lkml.kernel.org/r/20170530185231.GA13412@castleSigned-off-by: NRoman Gushchin <guro@fb.com>
      Acked-by: NMichal Hocko <mhocko@suse.com>
      Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      422580c3
  16. 07 7月, 2017 1 次提交
  17. 04 5月, 2017 1 次提交
    • M
      oom: improve oom disable handling · d75da004
      Michal Hocko 提交于
      Tetsuo has reported that sysrq triggered OOM killer will print a
      misleading information when no tasks are selected:
      
        sysrq: SysRq : Manual OOM execution
        Out of memory: Kill process 4468 ((agetty)) score 0 or sacrifice child
        Killed process 4468 ((agetty)) total-vm:43704kB, anon-rss:1760kB, file-rss:0kB, shmem-rss:0kB
        sysrq: SysRq : Manual OOM execution
        Out of memory: Kill process 4469 (systemd-cgroups) score 0 or sacrifice child
        Killed process 4469 (systemd-cgroups) total-vm:10704kB, anon-rss:120kB, file-rss:0kB, shmem-rss:0kB
        sysrq: SysRq : Manual OOM execution
        sysrq: OOM request ignored because killer is disabled
        sysrq: SysRq : Manual OOM execution
        sysrq: OOM request ignored because killer is disabled
        sysrq: SysRq : Manual OOM execution
        sysrq: OOM request ignored because killer is disabled
      
      The real reason is that there are no eligible tasks for the OOM killer
      to select but since commit 7c5f64f8 ("mm: oom: deduplicate victim
      selection code for memcg and global oom") the semantic of out_of_memory
      has changed without updating moom_callback.
      
      This patch updates moom_callback to tell that no task was eligible which
      is the case for both oom killer disabled and no eligible tasks.  In
      order to help distinguish first case from the second add printk to both
      oom_killer_{enable,disable}.  This information is useful on its own
      because it might help debugging potential memory allocation failures.
      
      Fixes: 7c5f64f8 ("mm: oom: deduplicate victim selection code for memcg and global oom")
      Link: http://lkml.kernel.org/r/20170404134705.6361-1-mhocko@kernel.orgSigned-off-by: NMichal Hocko <mhocko@suse.com>
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d75da004
  18. 02 3月, 2017 3 次提交
  19. 28 2月, 2017 1 次提交
  20. 25 2月, 2017 1 次提交
  21. 23 2月, 2017 3 次提交