1. 13 11月, 2013 1 次提交
  2. 17 10月, 2013 1 次提交
    • J
      mm: memcg: handle non-error OOM situations more gracefully · 49426420
      Johannes Weiner 提交于
      Commit 3812c8c8 ("mm: memcg: do not trap chargers with full
      callstack on OOM") assumed that only a few places that can trigger a
      memcg OOM situation do not return VM_FAULT_OOM, like optional page cache
      readahead.  But there are many more and it's impractical to annotate
      them all.
      
      First of all, we don't want to invoke the OOM killer when the failed
      allocation is gracefully handled, so defer the actual kill to the end of
      the fault handling as well.  This simplifies the code quite a bit for
      added bonus.
      
      Second, since a failed allocation might not be the abrupt end of the
      fault, the memcg OOM handler needs to be re-entrant until the fault
      finishes for subsequent allocation attempts.  If an allocation is
      attempted after the task already OOMed, allow it to bypass the limit so
      that it can quickly finish the fault and invoke the OOM killer.
      Reported-by: NazurIt <azurit@pobox.sk>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49426420
  3. 09 10月, 2013 19 次提交
  4. 28 9月, 2013 1 次提交
  5. 25 9月, 2013 4 次提交
    • P
      sched: Prepare for per-cpu preempt_count · a233f112
      Peter Zijlstra 提交于
      When using per-cpu preempt_count variables we need to save/restore the
      preempt_count on context switch (into per task storage; for instance
      the old thread_info::preempt_count variable) because of
      PREEMPT_ACTIVE.
      
      However, this means that on fork() the preempt_count value of the last
      context switch gets copied and if we had a PREEMPT_ACTIVE switch right
      before cloning a child task the child task will now too have
      PREEMPT_ACTIVE set and start its life with an extra PREEMPT_ACTIVE
      count.
      
      Therefore we need to make init_task_preempt_count() unconditional;
      this resets whatever preempt_count we inherited from our parent
      process.
      
      Doing so for !per-cpu implementations is harmless.
      
      For !PREEMPT_COUNT kernels we need to be careful not to start life
      with an increased preempt_count.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-4k0b7oy1rcdyzochwiixuwi9@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a233f112
    • P
      sched: Extract the basic add/sub preempt_count modifiers · bdb43806
      Peter Zijlstra 提交于
      Rewrite the preempt_count macros in order to extract the 3 basic
      preempt_count value modifiers:
      
        __preempt_count_add()
        __preempt_count_sub()
      
      and the new:
      
        __preempt_count_dec_and_test()
      
      And since we're at it anyway, replace the unconventional
      $op_preempt_count names with the more conventional preempt_count_$op.
      
      Since these basic operators are equivalent to the previous _notrace()
      variants, do away with the _notrace() versions.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-ewbpdbupy9xpsjhg960zwbv8@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bdb43806
    • P
      sched: Add NEED_RESCHED to the preempt_count · f27dde8d
      Peter Zijlstra 提交于
      In order to combine the preemption and need_resched test we need to
      fold the need_resched information into the preempt_count value.
      
      Since the NEED_RESCHED flag is set across CPUs this needs to be an
      atomic operation, however we very much want to avoid making
      preempt_count atomic, therefore we keep the existing TIF_NEED_RESCHED
      infrastructure in place but at 3 sites test it and fold its value into
      preempt_count; namely:
      
       - resched_task() when setting TIF_NEED_RESCHED on the current task
       - scheduler_ipi() when resched_task() sets TIF_NEED_RESCHED on a
                         remote task it follows it up with a reschedule IPI
                         and we can modify the cpu local preempt_count from
                         there.
       - cpu_idle_loop() for when resched_task() found tsk_is_polling().
      
      We use an inverted bitmask to indicate need_resched so that a 0 means
      both need_resched and !atomic.
      
      Also remove the barrier() in preempt_enable() between
      preempt_enable_no_resched() and preempt_check_resched() to avoid
      having to reload the preemption value and allow the compiler to use
      the flags of the previuos decrement. I couldn't come up with any sane
      reason for this barrier() to be there as preempt_enable_no_resched()
      already has a barrier() before doing the decrement.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-7a7m5qqbn5pmwnd4wko9u6da@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f27dde8d
    • P
      sched, idle: Fix the idle polling state logic · ea811747
      Peter Zijlstra 提交于
      Mike reported that commit 7d1a9417 ("x86: Use generic idle loop")
      regressed several workloads and caused excessive reschedule
      interrupts.
      
      The patch in question failed to notice that the x86 code had an
      inverted sense of the polling state versus the new generic code (x86:
      default polling, generic: default !polling).
      
      Fix the two prominent x86 mwait based idle drivers and introduce a few
      new generic polling helpers (fixing the wrong smp_mb__after_clear_bit
      usage).
      
      Also switch the idle routines to using tif_need_resched() which is an
      immediate TIF_NEED_RESCHED test as opposed to need_resched which will
      end up being slightly different.
      Reported-by: NMike Galbraith <bitbucket@online.de>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: lenb@kernel.org
      Cc: tglx@linutronix.de
      Link: http://lkml.kernel.org/n/tip-nc03imb0etuefmzybzj7sprf@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ea811747
  6. 20 9月, 2013 2 次提交
  7. 13 9月, 2013 2 次提交
    • J
      mm: memcg: do not trap chargers with full callstack on OOM · 3812c8c8
      Johannes Weiner 提交于
      The memcg OOM handling is incredibly fragile and can deadlock.  When a
      task fails to charge memory, it invokes the OOM killer and loops right
      there in the charge code until it succeeds.  Comparably, any other task
      that enters the charge path at this point will go to a waitqueue right
      then and there and sleep until the OOM situation is resolved.  The problem
      is that these tasks may hold filesystem locks and the mmap_sem; locks that
      the selected OOM victim may need to exit.
      
      For example, in one reported case, the task invoking the OOM killer was
      about to charge a page cache page during a write(), which holds the
      i_mutex.  The OOM killer selected a task that was just entering truncate()
      and trying to acquire the i_mutex:
      
      OOM invoking task:
        mem_cgroup_handle_oom+0x241/0x3b0
        mem_cgroup_cache_charge+0xbe/0xe0
        add_to_page_cache_locked+0x4c/0x140
        add_to_page_cache_lru+0x22/0x50
        grab_cache_page_write_begin+0x8b/0xe0
        ext3_write_begin+0x88/0x270
        generic_file_buffered_write+0x116/0x290
        __generic_file_aio_write+0x27c/0x480
        generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
        do_sync_write+0xea/0x130
        vfs_write+0xf3/0x1f0
        sys_write+0x51/0x90
        system_call_fastpath+0x18/0x1d
      
      OOM kill victim:
        do_truncate+0x58/0xa0              # takes i_mutex
        do_last+0x250/0xa30
        path_openat+0xd7/0x440
        do_filp_open+0x49/0xa0
        do_sys_open+0x106/0x240
        sys_open+0x20/0x30
        system_call_fastpath+0x18/0x1d
      
      The OOM handling task will retry the charge indefinitely while the OOM
      killed task is not releasing any resources.
      
      A similar scenario can happen when the kernel OOM killer for a memcg is
      disabled and a userspace task is in charge of resolving OOM situations.
      In this case, ALL tasks that enter the OOM path will be made to sleep on
      the OOM waitqueue and wait for userspace to free resources or increase
      the group's limit.  But a userspace OOM handler is prone to deadlock
      itself on the locks held by the waiting tasks.  For example one of the
      sleeping tasks may be stuck in a brk() call with the mmap_sem held for
      writing but the userspace handler, in order to pick an optimal victim,
      may need to read files from /proc/<pid>, which tries to acquire the same
      mmap_sem for reading and deadlocks.
      
      This patch changes the way tasks behave after detecting a memcg OOM and
      makes sure nobody loops or sleeps with locks held:
      
      1. When OOMing in a user fault, invoke the OOM killer and restart the
         fault instead of looping on the charge attempt.  This way, the OOM
         victim can not get stuck on locks the looping task may hold.
      
      2. When OOMing in a user fault but somebody else is handling it
         (either the kernel OOM killer or a userspace handler), don't go to
         sleep in the charge context.  Instead, remember the OOMing memcg in
         the task struct and then fully unwind the page fault stack with
         -ENOMEM.  pagefault_out_of_memory() will then call back into the
         memcg code to check if the -ENOMEM came from the memcg, and then
         either put the task to sleep on the memcg's OOM waitqueue or just
         restart the fault.  The OOM victim can no longer get stuck on any
         lock a sleeping task may hold.
      
      Debugged by Michal Hocko.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NazurIt <azurit@pobox.sk>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3812c8c8
    • J
      mm: memcg: enable memcg OOM killer only for user faults · 519e5247
      Johannes Weiner 提交于
      System calls and kernel faults (uaccess, gup) can handle an out of memory
      situation gracefully and just return -ENOMEM.
      
      Enable the memcg OOM killer only for user faults, where it's really the
      only option available.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: azurIt <azurit@pobox.sk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      519e5247
  8. 12 9月, 2013 1 次提交
  9. 23 8月, 2013 1 次提交
  10. 14 8月, 2013 1 次提交
  11. 30 7月, 2013 1 次提交
    • C
      freezer: set PF_SUSPEND_TASK flag on tasks that call freeze_processes · 2b44c4db
      Colin Cross 提交于
      Calling freeze_processes sets a global flag that will cause any
      process that calls try_to_freeze to enter the refrigerator.  It
      skips sending a signal to the current task, but if the current
      task ever hits try_to_freeze, all threads will be frozen and the
      system will deadlock.
      
      Set a new flag, PF_SUSPEND_TASK, on the task that calls
      freeze_processes.  The flag notifies the freezer that the thread
      is involved in suspend and should not be frozen.  Also add a
      WARN_ON in thaw_processes if the caller does not have the
      PF_SUSPEND_TASK flag set to catch if a different task calls
      thaw_processes than the one that called freeze_processes, leaving
      a task with PF_SUSPEND_TASK permanently set on it.
      
      Threads that spawn off a task with PF_SUSPEND_TASK set (which
      swsusp does) will also have PF_SUSPEND_TASK set, preventing them
      from freezing while they are helping with suspend, but they need
      to be dead by the time suspend is triggered, otherwise they may
      run when userspace is expected to be frozen.  Add a WARN_ON in
      thaw_processes if more than one thread has the PF_SUSPEND_TASK
      flag set.
      Reported-and-tested-by: NMichael Leun <lkml20130126@newton.leun.net>
      Signed-off-by: NColin Cross <ccross@android.com>
      Signed-off-by: NRafael J. Wysocki <rafael.j.wysocki@intel.com>
      2b44c4db
  12. 23 7月, 2013 1 次提交
    • M
      sched: Implement smarter wake-affine logic · 62470419
      Michael Wang 提交于
      The wake-affine scheduler feature is currently always trying to pull
      the wakee close to the waker. In theory this should be beneficial if
      the waker's CPU caches hot data for the wakee, and it's also beneficial
      in the extreme ping-pong high context switch rate case.
      
      Testing shows it can benefit hackbench up to 15%.
      
      However, the feature is somewhat blind, from which some workloads
      such as pgbench suffer. It's also time-consuming algorithmically.
      
      Testing shows it can damage pgbench up to 50% - far more than the
      benefit it brings in the best case.
      
      So wake-affine should be smarter and it should realize when to
      stop its thankless effort at trying to find a suitable CPU to wake on.
      
      This patch introduces 'wakee_flips', which will be increased each
      time the task flips (switches) its wakee target.
      
      So a high 'wakee_flips' value means the task has more than one
      wakee, and the bigger the number, the higher the wakeup frequency.
      
      Now when making the decision on whether to pull or not, pay attention to
      the wakee with a high 'wakee_flips', pulling such a task may benefit
      the wakee. Also imply that the waker will face cruel competition later,
      it could be very cruel or very fast depends on the story behind
      'wakee_flips', waker therefore suffers.
      
      Furthermore, if waker also has a high 'wakee_flips', that implies that
      multiple tasks rely on it, then waker's higher latency will damage all
      of them, so pulling wakee seems to be a bad deal.
      
      Thus, when 'waker->wakee_flips / wakee->wakee_flips' becomes
      higher and higher, the cost of pulling seems to be worse and worse.
      
      The patch therefore helps the wake-affine feature to stop its pulling
      work when:
      
      	wakee->wakee_flips > factor &&
      	waker->wakee_flips > (factor * wakee->wakee_flips)
      
      The 'factor' here is the number of CPUs in the current CPU's NUMA node,
      so a bigger node will lead to more pulling since the trial becomes more
      severe.
      
      After applying the patch, pgbench shows up to 40% improvements and no regressions.
      
      Tested with 12 cpu x86 server and tip 3.10.0-rc7.
      
      The percentages in the final column highlight the areas with the biggest wins,
      all other areas improved as well:
      
      	pgbench		    base	smart
      
      	| db_size | clients |  tps  |	|  tps  |
      	+---------+---------+-------+   +-------+
      	| 22 MB   |       1 | 10598 |   | 10796 |
      	| 22 MB   |       2 | 21257 |   | 21336 |
      	| 22 MB   |       4 | 41386 |   | 41622 |
      	| 22 MB   |       8 | 51253 |   | 57932 |
      	| 22 MB   |      12 | 48570 |   | 54000 |
      	| 22 MB   |      16 | 46748 |   | 55982 | +19.75%
      	| 22 MB   |      24 | 44346 |   | 55847 | +25.93%
      	| 22 MB   |      32 | 43460 |   | 54614 | +25.66%
      	| 7484 MB |       1 |  8951 |   |  9193 |
      	| 7484 MB |       2 | 19233 |   | 19240 |
      	| 7484 MB |       4 | 37239 |   | 37302 |
      	| 7484 MB |       8 | 46087 |   | 50018 |
      	| 7484 MB |      12 | 42054 |   | 48763 |
      	| 7484 MB |      16 | 40765 |   | 51633 | +26.66%
      	| 7484 MB |      24 | 37651 |   | 52377 | +39.11%
      	| 7484 MB |      32 | 37056 |   | 51108 | +37.92%
      	| 15 GB   |       1 |  8845 |   |  9104 |
      	| 15 GB   |       2 | 19094 |   | 19162 |
      	| 15 GB   |       4 | 36979 |   | 36983 |
      	| 15 GB   |       8 | 46087 |   | 49977 |
      	| 15 GB   |      12 | 41901 |   | 48591 |
      	| 15 GB   |      16 | 40147 |   | 50651 | +26.16%
      	| 15 GB   |      24 | 37250 |   | 52365 | +40.58%
      	| 15 GB   |      32 | 36470 |   | 50015 | +37.14%
      Signed-off-by: NMichael Wang <wangyun@linux.vnet.ibm.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/51D50057.9000809@linux.vnet.ibm.com
      [ Improved the changelog. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      62470419
  13. 18 7月, 2013 2 次提交
  14. 11 7月, 2013 1 次提交
  15. 10 7月, 2013 1 次提交
  16. 04 7月, 2013 1 次提交