1. 20 3月, 2014 2 次提交
    • R
      sched: declare pid_alive as inline · 80e0b6e8
      Richard Guy Briggs 提交于
      We accidentally declared pid_alive without any extern/inline connotation.
      Some platforms were fine with this, some like ia64 and mips were very angry.
      If the function is inline, the prototype should be inline!
      
      on ia64:
      include/linux/sched.h:1718: warning: 'pid_alive' declared inline after
      being called
      Signed-off-by: NRichard Guy Briggs <rgb@redhat.com>
      Signed-off-by: NEric Paris <eparis@redhat.com>
      80e0b6e8
    • R
      pid: get pid_t ppid of task in init_pid_ns · ad36d282
      Richard Guy Briggs 提交于
      Added the functions task_ppid_nr_ns() and task_ppid_nr() to abstract the lookup
      of the PPID (real_parent's pid_t) of a process, including rcu locking, in the
      arbitrary and init_pid_ns.
      This provides an alternative to sys_getppid(), which is relative to the child
      process' pid namespace.
      
      (informed by ebiederman's 6c621b7e)
      Cc: stable@vger.kernel.org
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Signed-off-by: NRichard Guy Briggs <rgb@redhat.com>
      ad36d282
  2. 11 12月, 2013 2 次提交
    • P
      sched/fair: Rework sched_fair time accounting · 9dbdb155
      Peter Zijlstra 提交于
      Christian suffers from a bad BIOS that wrecks his i5's TSC sync. This
      results in him occasionally seeing time going backwards - which
      crashes the scheduler ...
      
      Most of our time accounting can actually handle that except the most
      common one; the tick time update of sched_fair.
      
      There is a further problem with that code; previously we assumed that
      because we get a tick every TICK_NSEC our time delta could never
      exceed 32bits and math was simpler.
      
      However, ever since Frederic managed to get NO_HZ_FULL merged; this is
      no longer the case since now a task can run for a long time indeed
      without getting a tick. It only takes about ~4.2 seconds to overflow
      our u32 in nanoseconds.
      
      This means we not only need to better deal with time going backwards;
      but also means we need to be able to deal with large deltas.
      
      This patch reworks the entire code and uses mul_u64_u32_shr() as
      proposed by Andy a long while ago.
      
      We express our virtual time scale factor in a u32 multiplier and shift
      right and the 32bit mul_u64_u32_shr() implementation reduces to a
      single 32x32->64 multiply if the time delta is still short (common
      case).
      
      For 64bit a 64x64->128 multiply can be used if ARCH_SUPPORTS_INT128.
      Reported-and-Tested-by: NChristian Engelmayer <cengelma@gmx.at>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: fweisbec@gmail.com
      Cc: Paul Turner <pjt@google.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      Cc: Andy Lutomirski <luto@amacapital.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20131118172706.GI3866@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      9dbdb155
    • P
      sched: Remove PREEMPT_NEED_RESCHED from generic code · ba1f14fb
      Peter Zijlstra 提交于
      While hunting a preemption issue with Alexander, Ben noticed that the
      currently generic PREEMPT_NEED_RESCHED stuff is horribly broken for
      load-store architectures.
      
      We currently rely on the IPI to fold TIF_NEED_RESCHED into
      PREEMPT_NEED_RESCHED, but when this IPI lands while we already have
      a load for the preempt-count but before the store, the store will erase
      the PREEMPT_NEED_RESCHED change.
      
      The current preempt-count only works on load-store archs because
      interrupts are assumed to be completely balanced wrt their preempt_count
      fiddling; the previous preempt_count load will match the preempt_count
      state after the interrupt and therefore nothing gets lost.
      
      This patch removes the PREEMPT_NEED_RESCHED usage from generic code and
      pushes it into x86 arch code; the generic code goes back to relying on
      TIF_NEED_RESCHED.
      
      Boot tested on x86_64 and compile tested on ppc64.
      Reported-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Reported-and-Tested-by: NAlexander Graf <agraf@suse.de>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Link: http://lkml.kernel.org/r/20131128132641.GP10022@twins.programming.kicks-ass.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ba1f14fb
  3. 20 11月, 2013 1 次提交
  4. 14 11月, 2013 1 次提交
  5. 13 11月, 2013 2 次提交
    • K
      exec/ptrace: fix get_dumpable() incorrect tests · d049f74f
      Kees Cook 提交于
      The get_dumpable() return value is not boolean.  Most users of the
      function actually want to be testing for non-SUID_DUMP_USER(1) rather than
      SUID_DUMP_DISABLE(0).  The SUID_DUMP_ROOT(2) is also considered a
      protected state.  Almost all places did this correctly, excepting the two
      places fixed in this patch.
      
      Wrong logic:
          if (dumpable == SUID_DUMP_DISABLE) { /* be protective */ }
              or
          if (dumpable == 0) { /* be protective */ }
              or
          if (!dumpable) { /* be protective */ }
      
      Correct logic:
          if (dumpable != SUID_DUMP_USER) { /* be protective */ }
              or
          if (dumpable != 1) { /* be protective */ }
      
      Without this patch, if the system had set the sysctl fs/suid_dumpable=2, a
      user was able to ptrace attach to processes that had dropped privileges to
      that user.  (This may have been partially mitigated if Yama was enabled.)
      
      The macros have been moved into the file that declares get/set_dumpable(),
      which means things like the ia64 code can see them too.
      
      CVE-2013-2929
      Reported-by: NVasily Kulikov <segoon@openwall.com>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d049f74f
    • V
      sched: remove ARCH specific fpu_counter from task_struct · 27f69e68
      Vineet Gupta 提交于
      fpu_counter in task_struct was used only by sh/x86.  Both of these now
      carry it in ARCH specific thread_struct, hence this can now be removed
      from generic task_struct, shrinking it slightly for other arches.
      Signed-off-by: NVineet Gupta <vgupta@synopsys.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul Mundt <paul.mundt@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      27f69e68
  6. 06 11月, 2013 1 次提交
  7. 17 10月, 2013 1 次提交
    • J
      mm: memcg: handle non-error OOM situations more gracefully · 49426420
      Johannes Weiner 提交于
      Commit 3812c8c8 ("mm: memcg: do not trap chargers with full
      callstack on OOM") assumed that only a few places that can trigger a
      memcg OOM situation do not return VM_FAULT_OOM, like optional page cache
      readahead.  But there are many more and it's impractical to annotate
      them all.
      
      First of all, we don't want to invoke the OOM killer when the failed
      allocation is gracefully handled, so defer the actual kill to the end of
      the fault handling as well.  This simplifies the code quite a bit for
      added bonus.
      
      Second, since a failed allocation might not be the abrupt end of the
      fault, the memcg OOM handler needs to be re-entrant until the fault
      finishes for subsequent allocation attempts.  If an allocation is
      attempted after the task already OOMed, allow it to bypass the limit so
      that it can quickly finish the fault and invoke the OOM killer.
      Reported-by: NazurIt <azurit@pobox.sk>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Cc: Michal Hocko <mhocko@suse.cz>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      49426420
  8. 09 10月, 2013 19 次提交
  9. 28 9月, 2013 1 次提交
  10. 25 9月, 2013 4 次提交
    • P
      sched: Prepare for per-cpu preempt_count · a233f112
      Peter Zijlstra 提交于
      When using per-cpu preempt_count variables we need to save/restore the
      preempt_count on context switch (into per task storage; for instance
      the old thread_info::preempt_count variable) because of
      PREEMPT_ACTIVE.
      
      However, this means that on fork() the preempt_count value of the last
      context switch gets copied and if we had a PREEMPT_ACTIVE switch right
      before cloning a child task the child task will now too have
      PREEMPT_ACTIVE set and start its life with an extra PREEMPT_ACTIVE
      count.
      
      Therefore we need to make init_task_preempt_count() unconditional;
      this resets whatever preempt_count we inherited from our parent
      process.
      
      Doing so for !per-cpu implementations is harmless.
      
      For !PREEMPT_COUNT kernels we need to be careful not to start life
      with an increased preempt_count.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-4k0b7oy1rcdyzochwiixuwi9@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      a233f112
    • P
      sched: Extract the basic add/sub preempt_count modifiers · bdb43806
      Peter Zijlstra 提交于
      Rewrite the preempt_count macros in order to extract the 3 basic
      preempt_count value modifiers:
      
        __preempt_count_add()
        __preempt_count_sub()
      
      and the new:
      
        __preempt_count_dec_and_test()
      
      And since we're at it anyway, replace the unconventional
      $op_preempt_count names with the more conventional preempt_count_$op.
      
      Since these basic operators are equivalent to the previous _notrace()
      variants, do away with the _notrace() versions.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-ewbpdbupy9xpsjhg960zwbv8@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      bdb43806
    • P
      sched: Add NEED_RESCHED to the preempt_count · f27dde8d
      Peter Zijlstra 提交于
      In order to combine the preemption and need_resched test we need to
      fold the need_resched information into the preempt_count value.
      
      Since the NEED_RESCHED flag is set across CPUs this needs to be an
      atomic operation, however we very much want to avoid making
      preempt_count atomic, therefore we keep the existing TIF_NEED_RESCHED
      infrastructure in place but at 3 sites test it and fold its value into
      preempt_count; namely:
      
       - resched_task() when setting TIF_NEED_RESCHED on the current task
       - scheduler_ipi() when resched_task() sets TIF_NEED_RESCHED on a
                         remote task it follows it up with a reschedule IPI
                         and we can modify the cpu local preempt_count from
                         there.
       - cpu_idle_loop() for when resched_task() found tsk_is_polling().
      
      We use an inverted bitmask to indicate need_resched so that a 0 means
      both need_resched and !atomic.
      
      Also remove the barrier() in preempt_enable() between
      preempt_enable_no_resched() and preempt_check_resched() to avoid
      having to reload the preemption value and allow the compiler to use
      the flags of the previuos decrement. I couldn't come up with any sane
      reason for this barrier() to be there as preempt_enable_no_resched()
      already has a barrier() before doing the decrement.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/n/tip-7a7m5qqbn5pmwnd4wko9u6da@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      f27dde8d
    • P
      sched, idle: Fix the idle polling state logic · ea811747
      Peter Zijlstra 提交于
      Mike reported that commit 7d1a9417 ("x86: Use generic idle loop")
      regressed several workloads and caused excessive reschedule
      interrupts.
      
      The patch in question failed to notice that the x86 code had an
      inverted sense of the polling state versus the new generic code (x86:
      default polling, generic: default !polling).
      
      Fix the two prominent x86 mwait based idle drivers and introduce a few
      new generic polling helpers (fixing the wrong smp_mb__after_clear_bit
      usage).
      
      Also switch the idle routines to using tif_need_resched() which is an
      immediate TIF_NEED_RESCHED test as opposed to need_resched which will
      end up being slightly different.
      Reported-by: NMike Galbraith <bitbucket@online.de>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: lenb@kernel.org
      Cc: tglx@linutronix.de
      Link: http://lkml.kernel.org/n/tip-nc03imb0etuefmzybzj7sprf@git.kernel.orgSigned-off-by: NIngo Molnar <mingo@kernel.org>
      ea811747
  11. 20 9月, 2013 2 次提交
  12. 13 9月, 2013 2 次提交
    • J
      mm: memcg: do not trap chargers with full callstack on OOM · 3812c8c8
      Johannes Weiner 提交于
      The memcg OOM handling is incredibly fragile and can deadlock.  When a
      task fails to charge memory, it invokes the OOM killer and loops right
      there in the charge code until it succeeds.  Comparably, any other task
      that enters the charge path at this point will go to a waitqueue right
      then and there and sleep until the OOM situation is resolved.  The problem
      is that these tasks may hold filesystem locks and the mmap_sem; locks that
      the selected OOM victim may need to exit.
      
      For example, in one reported case, the task invoking the OOM killer was
      about to charge a page cache page during a write(), which holds the
      i_mutex.  The OOM killer selected a task that was just entering truncate()
      and trying to acquire the i_mutex:
      
      OOM invoking task:
        mem_cgroup_handle_oom+0x241/0x3b0
        mem_cgroup_cache_charge+0xbe/0xe0
        add_to_page_cache_locked+0x4c/0x140
        add_to_page_cache_lru+0x22/0x50
        grab_cache_page_write_begin+0x8b/0xe0
        ext3_write_begin+0x88/0x270
        generic_file_buffered_write+0x116/0x290
        __generic_file_aio_write+0x27c/0x480
        generic_file_aio_write+0x76/0xf0           # takes ->i_mutex
        do_sync_write+0xea/0x130
        vfs_write+0xf3/0x1f0
        sys_write+0x51/0x90
        system_call_fastpath+0x18/0x1d
      
      OOM kill victim:
        do_truncate+0x58/0xa0              # takes i_mutex
        do_last+0x250/0xa30
        path_openat+0xd7/0x440
        do_filp_open+0x49/0xa0
        do_sys_open+0x106/0x240
        sys_open+0x20/0x30
        system_call_fastpath+0x18/0x1d
      
      The OOM handling task will retry the charge indefinitely while the OOM
      killed task is not releasing any resources.
      
      A similar scenario can happen when the kernel OOM killer for a memcg is
      disabled and a userspace task is in charge of resolving OOM situations.
      In this case, ALL tasks that enter the OOM path will be made to sleep on
      the OOM waitqueue and wait for userspace to free resources or increase
      the group's limit.  But a userspace OOM handler is prone to deadlock
      itself on the locks held by the waiting tasks.  For example one of the
      sleeping tasks may be stuck in a brk() call with the mmap_sem held for
      writing but the userspace handler, in order to pick an optimal victim,
      may need to read files from /proc/<pid>, which tries to acquire the same
      mmap_sem for reading and deadlocks.
      
      This patch changes the way tasks behave after detecting a memcg OOM and
      makes sure nobody loops or sleeps with locks held:
      
      1. When OOMing in a user fault, invoke the OOM killer and restart the
         fault instead of looping on the charge attempt.  This way, the OOM
         victim can not get stuck on locks the looping task may hold.
      
      2. When OOMing in a user fault but somebody else is handling it
         (either the kernel OOM killer or a userspace handler), don't go to
         sleep in the charge context.  Instead, remember the OOMing memcg in
         the task struct and then fully unwind the page fault stack with
         -ENOMEM.  pagefault_out_of_memory() will then call back into the
         memcg code to check if the -ENOMEM came from the memcg, and then
         either put the task to sleep on the memcg's OOM waitqueue or just
         restart the fault.  The OOM victim can no longer get stuck on any
         lock a sleeping task may hold.
      
      Debugged by Michal Hocko.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Reported-by: NazurIt <azurit@pobox.sk>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      3812c8c8
    • J
      mm: memcg: enable memcg OOM killer only for user faults · 519e5247
      Johannes Weiner 提交于
      System calls and kernel faults (uaccess, gup) can handle an out of memory
      situation gracefully and just return -ENOMEM.
      
      Enable the memcg OOM killer only for user faults, where it's really the
      only option available.
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Acked-by: NMichal Hocko <mhocko@suse.cz>
      Cc: David Rientjes <rientjes@google.com>
      Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
      Cc: azurIt <azurit@pobox.sk>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      519e5247
  13. 12 9月, 2013 1 次提交
  14. 23 8月, 2013 1 次提交