1. 25 6月, 2015 7 次提交
  2. 16 4月, 2015 1 次提交
  3. 15 4月, 2015 1 次提交
  4. 12 2月, 2015 6 次提交
    • K
      mm: account pmd page tables to the process · dc6c9a35
      Kirill A. Shutemov 提交于
      Dave noticed that unprivileged process can allocate significant amount of
      memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
      memory cgroup.  The trick is to allocate a lot of PMD page tables.  Linux
      kernel doesn't account PMD tables to the process, only PTE.
      
      The use-cases below use few tricks to allocate a lot of PMD page tables
      while keeping VmRSS and VmPTE low.  oom_score for the process will be 0.
      
      	#include <errno.h>
      	#include <stdio.h>
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <sys/mman.h>
      	#include <sys/prctl.h>
      
      	#define PUD_SIZE (1UL << 30)
      	#define PMD_SIZE (1UL << 21)
      
      	#define NR_PUD 130000
      
      	int main(void)
      	{
      		char *addr = NULL;
      		unsigned long i;
      
      		prctl(PR_SET_THP_DISABLE);
      		for (i = 0; i < NR_PUD ; i++) {
      			addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
      			if (addr == MAP_FAILED) {
      				perror("mmap");
      				break;
      			}
      			*addr = 'x';
      			munmap(addr, PMD_SIZE);
      			mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
      					MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
      			if (addr == MAP_FAILED)
      				perror("re-mmap"), exit(1);
      		}
      		printf("PID %d consumed %lu KiB in PMD page tables\n",
      				getpid(), i * 4096 >> 10);
      		return pause();
      	}
      
      The patch addresses the issue by account PMD tables to the process the
      same way we account PTE.
      
      The main place where PMD tables is accounted is __pmd_alloc() and
      free_pmd_range(). But there're few corner cases:
      
       - HugeTLB can share PMD page tables. The patch handles by accounting
         the table to all processes who share it.
      
       - x86 PAE pre-allocates few PMD tables on fork.
      
       - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
         check on exit(2).
      
      Accounting only happens on configuration where PMD page table's level is
      present (PMD is not folded).  As with nr_ptes we use per-mm counter.  The
      counter value is used to calculate baseline for badness score by
      oom-killer.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Reported-by: NDave Hansen <dave.hansen@linux.intel.com>
      Cc: Hugh Dickins <hughd@google.com>
      Reviewed-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Pavel Emelyanov <xemul@openvz.org>
      Cc: David Rientjes <rientjes@google.com>
      Tested-by: NSedat Dilek <sedat.dilek@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      dc6c9a35
    • M
      oom, PM: make OOM detection in the freezer path raceless · c32b3cbe
      Michal Hocko 提交于
      Commit 5695be14 ("OOM, PM: OOM killed task shouldn't escape PM
      suspend") has left a race window when OOM killer manages to
      note_oom_kill after freeze_processes checks the counter.  The race
      window is quite small and really unlikely and partial solution deemed
      sufficient at the time of submission.
      
      Tejun wasn't happy about this partial solution though and insisted on a
      full solution.  That requires the full OOM and freezer's task freezing
      exclusion, though.  This is done by this patch which introduces oom_sem
      RW lock and turns oom_killer_disable() into a full OOM barrier.
      
      oom_killer_disabled check is moved from the allocation path to the OOM
      level and we take oom_sem for reading for both the check and the whole
      OOM invocation.
      
      oom_killer_disable() takes oom_sem for writing so it waits for all
      currently running OOM killer invocations.  Then it disable all the further
      OOMs by setting oom_killer_disabled and checks for any oom victims.
      Victims are counted via mark_tsk_oom_victim resp.  unmark_oom_victim.  The
      last victim wakes up all waiters enqueued by oom_killer_disable().
      Therefore this function acts as the full OOM barrier.
      
      The page fault path is covered now as well although it was assumed to be
      safe before.  As per Tejun, "We used to have freezing points deep in file
      system code which may be reacheable from page fault." so it would be
      better and more robust to not rely on freezing points here.  Same applies
      to the memcg OOM killer.
      
      out_of_memory tells the caller whether the OOM was allowed to trigger and
      the callers are supposed to handle the situation.  The page allocation
      path simply fails the allocation same as before.  The page fault path will
      retry the fault (more on that later) and Sysrq OOM trigger will simply
      complain to the log.
      
      Normally there wouldn't be any unfrozen user tasks after
      try_to_freeze_tasks so the function will not block. But if there was an
      OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
      finish yet then we have to wait for it. This should complete in a finite
      time, though, because
      
      	- the victim cannot loop in the page fault handler (it would die
      	  on the way out from the exception)
      	- it cannot loop in the page allocator because all the further
      	  allocation would fail and __GFP_NOFAIL allocations are not
      	  acceptable at this stage
      	- it shouldn't be blocked on any locks held by frozen tasks
      	  (try_to_freeze expects lockless context) and kernel threads and
      	  work queues are not frozen yet
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Suggested-by: NTejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c32b3cbe
    • M
      oom: thaw the OOM victim if it is frozen · 63a8ca9b
      Michal Hocko 提交于
      oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the
      victim.  This is basically noop when the task is frozen though because the
      task sleeps in the uninterruptible sleep.  The victim is eventually thawed
      later when oom_scan_process_thread meets the task again in a later OOM
      invocation so the OOM killer doesn't live lock.  But this is less than
      optimal.
      
      Let's add __thaw_task into mark_tsk_oom_victim after we set TIF_MEMDIE to
      the victim.  We are not checking whether the task is frozen because that
      would be racy and __thaw_task does that already.  oom_scan_process_thread
      doesn't need to care about freezer anymore as TIF_MEMDIE and freezer are
      excluded completely now.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      63a8ca9b
    • M
      oom: add helpers for setting and clearing TIF_MEMDIE · 49550b60
      Michal Hocko 提交于
      This patchset addresses a race which was described in the changelog for
      5695be14 ("OOM, PM: OOM killed task shouldn't escape PM suspend"):
      
      : PM freezer relies on having all tasks frozen by the time devices are
      : getting frozen so that no task will touch them while they are getting
      : frozen.  But OOM killer is allowed to kill an already frozen task in order
      : to handle OOM situtation.  In order to protect from late wake ups OOM
      : killer is disabled after all tasks are frozen.  This, however, still keeps
      : a window open when a killed task didn't manage to die by the time
      : freeze_processes finishes.
      
      The original patch hasn't closed the race window completely because that
      would require a more complex solution as it can be seen by this patchset.
      
      The primary motivation was to close the race condition between OOM killer
      and PM freezer _completely_.  As Tejun pointed out, even though the race
      condition is unlikely the harder it would be to debug weird bugs deep in
      the PM freezer when the debugging options are reduced considerably.  I can
      only speculate what might happen when a task is still runnable
      unexpectedly.
      
      On a plus side and as a side effect the oom enable/disable has a better
      (full barrier) semantic without polluting hot paths.
      
      I have tested the series in KVM with 100M RAM:
      - many small tasks (20M anon mmap) which are triggering OOM continually
      - s2ram which resumes automatically is triggered in a loop
      	echo processors > /sys/power/pm_test
      	while true
      	do
      		echo mem > /sys/power/state
      		sleep 1s
      	done
      - simple module which allocates and frees 20M in 8K chunks. If it sees
        freezing(current) then it tries another round of allocation before calling
        try_to_freeze
      - debugging messages of PM stages and OOM killer enable/disable/fail added
        and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before
        it wakes up waiters.
      - rebased on top of the current mmotm which means some necessary updates
        in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but
        I think this should be OK because __thaw_task shouldn't interfere with any
        locking down wake_up_process. Oleg?
      
      As expected there are no OOM killed tasks after oom is disabled and
      allocations requested by the kernel thread are failing after all the tasks
      are frozen and OOM disabled.  I wasn't able to catch a race where
      oom_killer_disable would really have to wait but I kinda expected the race
      is really unlikely.
      
      [  242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB
      [  243.628071] Unmarking 2992 OOM victim. oom_victims: 1
      [  243.636072] (elapsed 2.837 seconds) done.
      [  243.641985] Trying to disable OOM killer
      [  243.643032] Waiting for concurent OOM victims
      [  243.644342] OOM killer disabled
      [  243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done.
      [  243.652983] Suspending console(s) (use no_console_suspend to debug)
      [  243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010
      [...]
      [  243.992600] PM: suspend of devices complete after 336.667 msecs
      [  243.993264] PM: late suspend of devices complete after 0.660 msecs
      [  243.994713] PM: noirq suspend of devices complete after 1.446 msecs
      [  243.994717] ACPI: Preparing to enter system sleep state S3
      [  243.994795] PM: Saving platform NVS memory
      [  243.994796] Disabling non-boot CPUs ...
      
      The first 2 patches are simple cleanups for OOM.  They should go in
      regardless the rest IMO.
      
      Patches 3 and 4 are trivial printk -> pr_info conversion and they should
      go in ditto.
      
      The main patch is the last one and I would appreciate acks from Tejun and
      Rafael.  I think the OOM part should be OK (except for __thaw_task vs.
      task_lock where a look from Oleg would appreciated) but I am not so sure I
      haven't screwed anything in the freezer code.  I have found several
      surprises there.
      
      This patch (of 5):
      
      This patch is just a preparatory and it doesn't introduce any functional
      change.
      
      Note:
      I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to
      wait for the oom victim and to prevent from new killing. This is
      just a side effect of the flag. The primary meaning is to give the oom
      victim access to the memory reserves and that shouldn't be necessary
      here.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Cong Wang <xiyou.wangcong@gmail.com>
      Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49550b60
    • M
      oom: make sure that TIF_MEMDIE is set under task_lock · 83363b91
      Michal Hocko 提交于
      OOM killer tries to exclude tasks which do not have mm_struct associated
      because killing such a task wouldn't help much.  The OOM victim gets
      TIF_MEMDIE set to disable OOM killer while the current victim releases the
      memory and then enables the OOM killer again by dropping the flag.
      
      oom_kill_process is currently prone to a race condition when the OOM
      victim is already exiting and TIF_MEMDIE is set after the task releases
      its address space.  This might theoretically lead to OOM livelock if the
      OOM victim blocks on an allocation later during exiting because it
      wouldn't kill any other process and the exiting one won't be able to exit.
       The situation is highly unlikely because the OOM victim is expected to
      release some memory which should help to sort out OOM situation.
      
      Fix this by checking task->mm and setting TIF_MEMDIE flag under task_lock
      which will serialize the OOM killer with exit_mm which sets task->mm to
      NULL.  Setting the flag for current is not necessary because check and set
      is not racy.
      Reported-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      83363b91
    • T
      oom: don't count on mm-less current process · d7a94e7e
      Tetsuo Handa 提交于
      out_of_memory() doesn't trigger the OOM killer if the current task is
      already exiting or it has fatal signals pending, and gives the task
      access to memory reserves instead.  However, doing so is wrong if
      out_of_memory() is called by an allocation (e.g. from exit_task_work())
      after the current task has already released its memory and cleared
      TIF_MEMDIE at exit_mm().  If we again set TIF_MEMDIE to post-exit_mm()
      current task, the OOM killer will be blocked by the task sitting in the
      final schedule() waiting for its parent to reap it.  It will trigger an
      OOM livelock if its parent is unable to reap it due to doing an
      allocation and waiting for the OOM killer to kill it.
      Signed-off-by: NTetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7a94e7e
  5. 14 12月, 2014 2 次提交
  6. 11 12月, 2014 1 次提交
  7. 27 10月, 2014 1 次提交
    • V
      cpuset: simplify cpuset_node_allowed API · 344736f2
      Vladimir Davydov 提交于
      Current cpuset API for checking if a zone/node is allowed to allocate
      from looks rather awkward. We have hardwall and softwall versions of
      cpuset_node_allowed with the softwall version doing literally the same
      as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags.
      If it isn't, the softwall version may check the given node against the
      enclosing hardwall cpuset, which it needs to take the callback lock to
      do.
      
      Such a distinction was introduced by commit 02a0e53d ("cpuset:
      rework cpuset_zone_allowed api"). Before, we had the only version with
      the __GFP_HARDWALL flag determining its behavior. The purpose of the
      commit was to avoid sleep-in-atomic bugs when someone would mistakenly
      call the function without the __GFP_HARDWALL flag for an atomic
      allocation. The suffixes introduced were intended to make the callers
      think before using the function.
      
      However, since the callback lock was converted from mutex to spinlock by
      the previous patch, the softwall check function cannot sleep, and these
      precautions are no longer necessary.
      
      So let's simplify the API back to the single check.
      Suggested-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NVladimir Davydov <vdavydov@parallels.com>
      Acked-by: NChristoph Lameter <cl@linux.com>
      Acked-by: NZefan Li <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      344736f2
  8. 22 10月, 2014 1 次提交
    • M
      OOM, PM: OOM killed task shouldn't escape PM suspend · 5695be14
      Michal Hocko 提交于
      PM freezer relies on having all tasks frozen by the time devices are
      getting frozen so that no task will touch them while they are getting
      frozen. But OOM killer is allowed to kill an already frozen task in
      order to handle OOM situtation. In order to protect from late wake ups
      OOM killer is disabled after all tasks are frozen. This, however, still
      keeps a window open when a killed task didn't manage to die by the time
      freeze_processes finishes.
      
      Reduce the race window by checking all tasks after OOM killer has been
      disabled. This is still not race free completely unfortunately because
      oom_killer_disable cannot stop an already ongoing OOM killer so a task
      might still wake up from the fridge and get killed without
      freeze_processes noticing. Full synchronization of OOM and freezer is,
      however, too heavy weight for this highly unlikely case.
      
      Introduce and check oom_kills counter which gets incremented early when
      the allocator enters __alloc_pages_may_oom path and only check all the
      tasks if the counter changes during the freezing attempt. The counter
      is updated so early to reduce the race window since allocator checked
      oom_killer_disabled which is set by PM-freezing code. A false positive
      will push the PM-freezer into a slow path but that is not a big deal.
      
      Changes since v1
      - push the re-check loop out of freeze_processes into
        check_frozen_processes and invert the condition to make the code more
        readable as per Rafael
      
      Fixes: f660daac (oom: thaw threads if oom killed thread is frozen before deferring)
      Cc: 3.2+ <stable@vger.kernel.org> # 3.2+
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      5695be14
  9. 10 10月, 2014 1 次提交
  10. 07 8月, 2014 3 次提交
    • D
      mm, oom: remove unnecessary exit_state check · fb794bcb
      David Rientjes 提交于
      The oom killer scans each process and determines whether it is eligible
      for oom kill or whether the oom killer should abort because of
      concurrent memory freeing.  It will abort when an eligible process is
      found to have TIF_MEMDIE set, meaning it has already been oom killed and
      we're waiting for it to exit.
      
      Processes with task->mm == NULL should not be considered because they
      are either kthreads or have already detached their memory and killing
      them would not lead to memory freeing.  That memory is only freed after
      exit_mm() has returned, however, and not when task->mm is first set to
      NULL.
      
      Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process
      is no longer considered for oom kill, but only until exit_mm() has
      returned.  This was fragile in the past because it relied on
      exit_notify() to be reached before no longer considering TIF_MEMDIE
      processes.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fb794bcb
    • D
      mm, oom: rename zonelist locking functions · e972a070
      David Rientjes 提交于
      try_set_zonelist_oom() and clear_zonelist_oom() are not named properly
      to imply that they require locking semantics to avoid out_of_memory()
      being reordered.
      
      zone_scan_lock is required for both functions to ensure that there is
      proper locking synchronization.
      
      Rename try_set_zonelist_oom() to oom_zonelist_trylock() and rename
      clear_zonelist_oom() to oom_zonelist_unlock() to imply there is proper
      locking semantics.
      
      At the same time, convert oom_zonelist_trylock() to return bool instead
      of int since only success and failure are tested.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e972a070
    • D
      mm, oom: ensure memoryless node zonelist always includes zones · 8d060bf4
      David Rientjes 提交于
      With memoryless node support being worked on, it's possible that for
      optimizations that a node may not have a non-NULL zonelist.  When
      CONFIG_NUMA is enabled and node 0 is memoryless, this means the zonelist
      for first_online_node may become NULL.
      
      The oom killer requires a zonelist that includes all memory zones for
      the sysrq trigger and pagefault out of memory handler.
      
      Ensure that a non-NULL zonelist is always passed to the oom killer.
      
      [akpm@linux-foundation.org: fix non-numa build]
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Rik van Riel <riel@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8d060bf4
  11. 31 1月, 2014 1 次提交
    • D
      mm, oom: base root bonus on current usage · 778c14af
      David Rientjes 提交于
      A 3% of system memory bonus is sometimes too excessive in comparison to
      other processes.
      
      With commit a63d83f4 ("oom: badness heuristic rewrite"), the OOM
      killer tries to avoid killing privileged tasks by subtracting 3% of
      overall memory (system or cgroup) from their per-task consumption.  But
      as a result, all root tasks that consume less than 3% of overall memory
      are considered equal, and so it only takes 33+ privileged tasks pushing
      the system out of memory for the OOM killer to do something stupid and
      kill dhclient or other root-owned processes.  For example, on a 32G
      machine it can't tell the difference between the 1M agetty and the 10G
      fork bomb member.
      
      The changelog describes this 3% boost as the equivalent to the global
      overcommit limit being 3% higher for privileged tasks, but this is not
      the same as discounting 3% of overall memory from _every privileged task
      individually_ during OOM selection.
      
      Replace the 3% of system memory bonus with a 3% of current memory usage
      bonus.
      
      By giving root tasks a bonus that is proportional to their actual size,
      they remain comparable even when relatively small.  In the example
      above, the OOM killer will discount the 1M agetty's 256 badness points
      down to 179, and the 10G fork bomb's 262144 points down to 183500 points
      and make the right choice, instead of discounting both to 0 and killing
      agetty because it's first in the task list.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Reported-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      778c14af
  12. 24 1月, 2014 1 次提交
  13. 22 1月, 2014 3 次提交
  14. 15 11月, 2013 1 次提交
    • K
      mm: convert mm->nr_ptes to atomic_long_t · e1f56c89
      Kirill A. Shutemov 提交于
      With split page table lock for PMD level we can't hold mm->page_table_lock
      while updating nr_ptes.
      
      Let's convert it to atomic_long_t to avoid races.
      Signed-off-by: NKirill A. Shutemov <kirill.shutemov@linux.intel.com>
      Tested-by: NAlex Thorlton <athorlton@sgi.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
      Cc: "Eric W . Biederman" <ebiederm@xmission.com>
      Cc: "Paul E . McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Dave Hansen <dave.hansen@intel.com>
      Cc: Dave Jones <davej@redhat.com>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Johannes Weiner <hannes@cmpxchg.org>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Robin Holt <robinmholt@gmail.com>
      Cc: Sedat Dilek <sedat.dilek@gmail.com>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Hugh Dickins <hughd@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1f56c89
  15. 17 10月, 2013 1 次提交
    • J
      mm: memcg: handle non-error OOM situations more gracefully · 49426420
      Johannes Weiner 提交于
      Commit 3812c8c8 ("mm: memcg: do not trap chargers with full
      callstack on OOM") assumed that only a few places that can trigger a
      memcg OOM situation do not return VM_FAULT_OOM, like optional page cache
      readahead.  But there are many more and it's impractical to annotate
      them all.
      
      First of all, we don't want to invoke the OOM killer when the failed
      allocation is gracefully handled, so defer the actual kill to the end of
      the fault handling as well.  This simplifies the code quite a bit for
      added bonus.
      
      Second, since a failed allocation might not be the abrupt end of the
      fault, the memcg OOM handler needs to be re-entrant until the fault
      finishes for subsequent allocation attempts.  If an allocation is
      attempted after the task already OOMed, allow it to bypass the limit so
      that it can quickly finish the fault and invoke the OOM killer.
      Reported-by: NazurIt <azurit@pobox.sk>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49426420
  16. 13 9月, 2013 1 次提交
    • J
      mm: memcg: do not trap chargers with full callstack on OOM · 3812c8c8
      Johannes Weiner 提交于
      The memcg OOM handling is incredibly fragile and can deadlock.  When a
      task fails to charge memory, it invokes the OOM killer and loops right
      there in the charge code until it succeeds.  Comparably, any other task
      that enters the charge path at this point will go to a waitqueue right
      then and there and sleep until the OOM situation is resolved.  The problem
      is that these tasks may hold filesystem locks and the mmap_sem; locks that
      the selected OOM victim may need to exit.
      
      For example, in one reported case, the task invoking the OOM killer was
      about to charge a page cache page during a write(), which holds the
      i_mutex.  The OOM killer selected a task that was just entering truncate()
      and trying to acquire the i_mutex:
      
      OOM invoking task:
        mem_cgroup_handle_oom+0x241/0x3b0
        mem_cgroup_cache_charge+0xbe/0xe0
        add_to_page_cache_locked+0x4c/0x140
        add_to_page_cache_lru+0x22/0x50
        grab_cache_page_write_begin+0x8b/0xe0
        ext3_write_begin+0x88/0x270
        generic_file_buffered_write+0x116/0x290
        __generic_file_aio_write+0x27c/0x480
        generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
        do_sync_write+0xea/0x130
        vfs_write+0xf3/0x1f0
        sys_write+0x51/0x90
        system_call_fastpath+0x18/0x1d
      
      OOM kill victim:
        do_truncate+0x58/0xa0              # takes i_mutex
        do_last+0x250/0xa30
        path_openat+0xd7/0x440
        do_filp_open+0x49/0xa0
        do_sys_open+0x106/0x240
        sys_open+0x20/0x30
        system_call_fastpath+0x18/0x1d
      
      The OOM handling task will retry the charge indefinitely while the OOM
      killed task is not releasing any resources.
      
      A similar scenario can happen when the kernel OOM killer for a memcg is
      disabled and a userspace task is in charge of resolving OOM situations.
      In this case, ALL tasks that enter the OOM path will be made to sleep on
      the OOM waitqueue and wait for userspace to free resources or increase
      the group's limit.  But a userspace OOM handler is prone to deadlock
      itself on the locks held by the waiting tasks.  For example one of the
      sleeping tasks may be stuck in a brk() call with the mmap_sem held for
      writing but the userspace handler, in order to pick an optimal victim,
      may need to read files from /proc/<pid>, which tries to acquire the same
      mmap_sem for reading and deadlocks.
      
      This patch changes the way tasks behave after detecting a memcg OOM and
      makes sure nobody loops or sleeps with locks held:
      
      1. When OOMing in a user fault, invoke the OOM killer and restart the
         fault instead of looping on the charge attempt.  This way, the OOM
         victim can not get stuck on locks the looping task may hold.
      
      2. When OOMing in a user fault but somebody else is handling it
         (either the kernel OOM killer or a userspace handler), don't go to
         sleep in the charge context.  Instead, remember the OOMing memcg in
         the task struct and then fully unwind the page fault stack with
         -ENOMEM.  pagefault_out_of_memory() will then call back into the
         memcg code to check if the -ENOMEM came from the memcg, and then
         either put the task to sleep on the memcg's OOM waitqueue or just
         restart the fault.  The OOM victim can no longer get stuck on any
         lock a sleeping task may hold.
      
      Debugged by Michal Hocko.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NazurIt <azurit@pobox.sk>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3812c8c8
  17. 15 7月, 2013 1 次提交
  18. 24 2月, 2013 1 次提交
    • S
      memcg, oom: provide more precise dump info while memcg oom happening · 58cf188e
      Sha Zhengju 提交于
      Currently when a memcg oom is happening the oom dump messages is still
      global state and provides few useful info for users.  This patch prints
      more pointed memcg page statistics for memcg-oom and take hierarchy into
      consideration:
      
      Based on Michal's advice, we take hierarchy into consideration: supppose
      we trigger an OOM on A's limit
      
              root_memcg
                  |
                  A (use_hierachy=1)
                 / \
                B   C
                |
                D
      then the printed info will be:
      
        Memory cgroup stats for /A:...
        Memory cgroup stats for /A/B:...
        Memory cgroup stats for /A/C:...
        Memory cgroup stats for /A/B/D:...
      
      Following are samples of oom output:
      
      (1) Before change:
      
          mal-80 invoked oom-killer:gfp_mask=0xd0, order=0, oom_score_adj=0
          mal-80 cpuset=/ mems_allowed=0
          Pid: 2976, comm: mal-80 Not tainted 3.7.0+ #10
          Call Trace:
           [<ffffffff8167fbfb>] dump_header+0x83/0x1ca
           ..... (call trace)
           [<ffffffff8168a818>] page_fault+0x28/0x30
                                   <<<<<<<<<<<<<<<<<<<<< memcg specific information
          Task in /A/B/D killed as a result of limit of /A
          memory: usage 101376kB, limit 101376kB, failcnt 57
          memory+swap: usage 101376kB, limit 101376kB, failcnt 0
          kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
                                   <<<<<<<<<<<<<<<<<<<<< print per cpu pageset stat
          Mem-Info:
          Node 0 DMA per-cpu:
          CPU    0: hi:    0, btch:   1 usd:   0
          ......
          CPU    3: hi:    0, btch:   1 usd:   0
          Node 0 DMA32 per-cpu:
          CPU    0: hi:  186, btch:  31 usd: 173
          ......
          CPU    3: hi:  186, btch:  31 usd: 130
                                   <<<<<<<<<<<<<<<<<<<<< print global page state
          active_anon:92963 inactive_anon:40777 isolated_anon:0
           active_file:33027 inactive_file:51718 isolated_file:0
           unevictable:0 dirty:3 writeback:0 unstable:0
           free:729995 slab_reclaimable:6897 slab_unreclaimable:6263
           mapped:20278 shmem:35971 pagetables:5885 bounce:0
           free_cma:0
                                   <<<<<<<<<<<<<<<<<<<<< print per zone page state
          Node 0 DMA free:15836kB ... all_unreclaimable? no
          lowmem_reserve[]: 0 3175 3899 3899
          Node 0 DMA32 free:2888564kB ... all_unrelaimable? no
          lowmem_reserve[]: 0 0 724 724
          lowmem_reserve[]: 0 0 0 0
          Node 0 DMA: 1*4kB (U) ... 3*4096kB (M) = 15836kB
          Node 0 DMA32: 41*4kB (UM) ... 702*4096kB (MR) = 2888316kB
          120710 total pagecache pages
          0 pages in swap cache
                                   <<<<<<<<<<<<<<<<<<<<< print global swap cache stat
          Swap cache stats: add 0, delete 0, find 0/0
          Free swap  = 499708kB
          Total swap = 499708kB
          1040368 pages RAM
          58678 pages reserved
          169065 pages shared
          173632 pages non-shared
          [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
          [ 2693]     0  2693     6005     1324      17        0             0 god
          [ 2754]     0  2754     6003     1320      16        0             0 god
          [ 2811]     0  2811     5992     1304      18        0             0 god
          [ 2874]     0  2874     6005     1323      18        0             0 god
          [ 2935]     0  2935     8720     7742      21        0             0 mal-30
          [ 2976]     0  2976    21520    17577      42        0             0 mal-80
          Memory cgroup out of memory: Kill process 2976 (mal-80) score 665 or sacrifice child
          Killed process 2976 (mal-80) total-vm:86080kB, anon-rss:69964kB, file-rss:344kB
      
      We can see that messages dumped by show_free_areas() are longsome and can
      provide so limited info for memcg that just happen oom.
      
      (2) After change
          mal-80 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
          mal-80 cpuset=/ mems_allowed=0
          Pid: 2704, comm: mal-80 Not tainted 3.7.0+ #10
          Call Trace:
           [<ffffffff8167fd0b>] dump_header+0x83/0x1d1
           .......(call trace)
           [<ffffffff8168a918>] page_fault+0x28/0x30
          Task in /A/B/D killed as a result of limit of /A
                                   <<<<<<<<<<<<<<<<<<<<< memcg specific information
          memory: usage 102400kB, limit 102400kB, failcnt 140
          memory+swap: usage 102400kB, limit 102400kB, failcnt 0
          kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
          Memory cgroup stats for /A: cache:32KB rss:30984KB mapped_file:0KB swap:0KB inactive_anon:6912KB active_anon:24072KB inactive_file:32KB active_file:0KB unevictable:0KB
          Memory cgroup stats for /A/B: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
          Memory cgroup stats for /A/C: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
          Memory cgroup stats for /A/B/D: cache:32KB rss:71352KB mapped_file:0KB swap:0KB inactive_anon:6656KB active_anon:64696KB inactive_file:16KB active_file:16KB unevictable:0KB
          [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
          [ 2260]     0  2260     6006     1325      18        0             0 god
          [ 2383]     0  2383     6003     1319      17        0             0 god
          [ 2503]     0  2503     6004     1321      18        0             0 god
          [ 2622]     0  2622     6004     1321      16        0             0 god
          [ 2695]     0  2695     8720     7741      22        0             0 mal-30
          [ 2704]     0  2704    21520    17839      43        0             0 mal-80
          Memory cgroup out of memory: Kill process 2704 (mal-80) score 669 or sacrifice child
          Killed process 2704 (mal-80) total-vm:86080kB, anon-rss:71016kB, file-rss:340kB
      
      This version provides more pointed info for memcg in "Memory cgroup stats
      for XXX" section.
      Signed-off-by: NSha Zhengju <handai.szj@taobao.com>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58cf188e
  19. 13 12月, 2012 3 次提交
  20. 12 12月, 2012 3 次提交
    • D
      mm, oom: fix race when specifying a thread as the oom origin · e1e12d2f
      David Rientjes 提交于
      test_set_oom_score_adj() and compare_swap_oom_score_adj() are used to
      specify that current should be killed first if an oom condition occurs in
      between the two calls.
      
      The usage is
      
      	short oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX);
      	...
      	compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj);
      
      to store the thread's oom_score_adj, temporarily change it to the maximum
      score possible, and then restore the old value if it is still the same.
      
      This happens to still be racy, however, if the user writes
      OOM_SCORE_ADJ_MAX to /proc/pid/oom_score_adj in between the two calls.
      The compare_swap_oom_score_adj() will then incorrectly reset the old value
      prior to the write of OOM_SCORE_ADJ_MAX.
      
      To fix this, introduce a new oom_flags_t member in struct signal_struct
      that will be used for per-thread oom killer flags.  KSM and swapoff can
      now use a bit in this member to specify that threads should be killed
      first in oom conditions without playing around with oom_score_adj.
      
      This also allows the correct oom_score_adj to always be shown when reading
      /proc/pid/oom_score.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1e12d2f
    • D
      mm, oom: change type of oom_score_adj to short · a9c58b90
      David Rientjes 提交于
      The maximum oom_score_adj is 1000 and the minimum oom_score_adj is -1000,
      so this range can be represented by the signed short type with no
      functional change.  The extra space this frees up in struct signal_struct
      will be used for per-thread oom kill flags in the next patch.
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Reviewed-by: NMichal Hocko <mhocko@suse.cz>
      Cc: Anton Vorontsov <anton.vorontsov@linaro.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a9c58b90
    • D
      mm, oom: allow exiting threads to have access to memory reserves · 9ff4868e
      David Rientjes 提交于
      Exiting threads, those with PF_EXITING set, can pagefault and require
      memory before they can make forward progress.  This happens, for instance,
      when a process must fault task->robust_list, a userspace structure, before
      detaching its memory.
      
      These threads also aren't guaranteed to get access to memory reserves
      unless oom killed or killed from userspace.  The oom killer won't grant
      memory reserves if other threads are also exiting other than current and
      stalling at the same point.  This prevents needlessly killing processes
      when others are already exiting.
      
      Instead of special casing all the possible situations between PF_EXITING
      getting set and a thread detaching its mm where it may allocate memory,
      which probably wouldn't get updated when a change is made to the exit
      path, the solution is to give all exiting threads access to memory
      reserves if they call the oom killer.  This allows them to quickly
      allocate, detach its mm, and free the memory it represents.
      
      Summary of Luigi's bug report:
      
      : He had an oom condition where threads were faulting on task->robust_list
      : and repeatedly called the oom killer but it would defer killing a thread
      : because it saw other PF_EXITING threads.  This can happen anytime we need
      : to allocate memory after setting PF_EXITING and before detaching our mm;
      : if there are other threads in the same state then the oom killer won't do
      : anything unless one of them happens to be killed from userspace.
      :
      : So instead of only deferring for PF_EXITING and !task->robust_list, it's
      : better to just give them access to memory reserves to prevent a potential
      : livelock so that any other faults that may be introduced in the future in
      : the exit path don't cause the same problem (and hopefully we don't allow
      : too many of those!).
      Signed-off-by: NDavid Rientjes <rientjes@google.com>
      Acked-by: NMinchan Kim <minchan@kernel.org>
      Tested-by: NLuigi Semenzato <semenzato@google.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9ff4868e