1. 23 6月, 2014 5 次提交
    • V
      hrtimer: Kick lowres dynticks targets on timer enqueue · 49a2a075
      Viresh Kumar 提交于
      In lowres mode, hrtimers are serviced by the tick instead of a clock
      event. It works well as long as the tick stays periodic but we must also
      make sure that the hrtimers are serviced in dynticks mode targets,
      pretty much like timer list timers do.
      
      Note that all dynticks modes are concerned: get_nohz_timer_target()
      tries not to return remote idle CPUs but there is nothing to prevent
      the elected target from entering dynticks idle mode until we lock its
      base. It's also prefectly legal to enqueue hrtimers on full dynticks CPU.
      
      So there are two requirements to correctly handle dynticks:
      
      1) On target's tick stop time, we must not delay the next tick further
         the next hrtimer.
      
      2) On hrtimer queue time. If the tick of the target is stopped, we must
         wake up that CPU such that it sees the new hrtimer and recalculate
         the next tick accordingly.
      
      The point 1 is well handled currently through get_nohz_timer_interrupt() and
      cmp_next_hrtimer_event().
      
      But the point 2 isn't handled at all.
      
      Fixing this is easy though as we have the necessary API ready for that.
      All we need is to call wake_up_nohz_cpu() on a target when a newly
      enqueued hrtimer requires tick rescheduling, like timer list timer do.
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Link: http://lkml.kernel.org/r/3d7ea08ce008698e26bd39fe10f55949391073ab.1403507178.git.viresh.kumar@linaro.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      49a2a075
    • V
      hrtimer: Store cpu-number in struct hrtimer_cpu_base · cddd0248
      Viresh Kumar 提交于
      In lowres mode, hrtimers are serviced by the tick instead of a clock
      event. Now it works well as long as the tick stays periodic but we
      must also make sure that the hrtimers are serviced in dynticks mode.
      
      Part of that job consist in kicking a dynticks hrtimer target in order
      to make it reconsider the next tick to schedule to correctly handle the
      hrtimer's expiring time. And that part isn't handled by the hrtimers
      subsystem.
      
      To prepare for fixing this, we need __hrtimer_start_range_ns() to be
      able to resolve the CPU target associated to a hrtimer's object
      'cpu_base' so that the kick can be centralized there.
      
      So lets store it in the 'struct hrtimer_cpu_base' to resolve the CPU
      without overhead. It is set once at CPU's online notification.
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Link: http://lkml.kernel.org/r/1403393357-2070-4-git-send-email-fweisbec@gmail.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      cddd0248
    • V
      timer: Kick dynticks targets on mod_timer*() calls · 9f6d9baa
      Viresh Kumar 提交于
      When a timer is enqueued or modified on a dynticks target, that CPU
      must re-evaluate the next tick to service that timer.
      
      The tick re-evaluation is performed by an IPI kick on the target.
      Now while we correctly call wake_up_nohz_cpu() from add_timer_on(), the
      mod_timer*() API family doesn't support so well dynticks targets.
      
      The reason for this is likely that __mod_timer() isn't supposed to
      select an idle target for a timer, unless that target is the current
      CPU, in which case a dynticks idle kick isn't actually needed.
      
      But there is a small race window lurking behind that assumption: the
      elected target has all the time to turn dynticks idle between the call
      to get_nohz_timer_target() and the locking of its base. Hence a risk
      that we enqueue a timer on a dynticks idle destination without kicking
      it. As a result, the timer might be serviced too late in the future.
      
      Also a target elected by __mod_timer() can be in full dynticks mode
      and thus require to be kicked as well. And unlike idle dynticks, this
      concern both local and remote targets.
      
      To fix this whole issue, lets centralize the dynticks kick to
      internal_add_timer() so that it is well handled for all sort of timer
      enqueue. Even timer migration is concerned so that a full dynticks target
      is correctly kicked as needed when timers are migrating to it.
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Link: http://lkml.kernel.org/r/1403393357-2070-3-git-send-email-fweisbec@gmail.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      9f6d9baa
    • V
      timer: Store cpu-number in struct tvec_base · d6f93829
      Viresh Kumar 提交于
      Timers are serviced by the tick. But when a timer is enqueued on a
      dynticks target, we need to kick it in order to make it reconsider the
      next tick to schedule to correctly handle the timer's expiring time.
      
      Now while this kick is correctly performed for add_timer_on(), the
      mod_timer*() family has been a bit neglected.
      
      To prepare for fixing this, we need internal_add_timer() to be able to
      resolve the CPU target associated to a timer's object 'base' so that the
      kick can be centralized there.
      
      This can't be passed as an argument as not all the callers know the CPU
      number of a timer's base. So lets store it in the struct tvec_base to
      resolve the CPU without much overhead. It is set once for good at every
      CPU's first boot.
      Signed-off-by: NViresh Kumar <viresh.kumar@linaro.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Link: http://lkml.kernel.org/r/1403393357-2070-2-git-send-email-fweisbec@gmail.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      d6f93829
    • T
      time/timers: Move all time(r) related files into kernel/time · 5cee9645
      Thomas Gleixner 提交于
      Except for Kconfig.HZ. That needs a separate treatment.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      5cee9645
  2. 12 6月, 2014 5 次提交
  3. 09 6月, 2014 1 次提交
  4. 07 6月, 2014 24 次提交
    • J
      sysctl: convert use of typedef ctl_table to struct ctl_table · 6f8fd1d7
      Joe Perches 提交于
      This typedef is unnecessary and should just be removed.
      Signed-off-by: NJoe Perches <joe@perches.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6f8fd1d7
    • F
      kernel/seccomp.c: kernel-doc warning fix · 119ce5c8
      Fabian Frederick 提交于
      + fix small typo
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: "David S. Miller" <davem@davemloft.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      119ce5c8
    • P
      ipc, kernel: clear whitespace · 46c0a8ca
      Paul McQuade 提交于
      trailing whitespace
      Signed-off-by: NPaul McQuade <paulmcquad@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      46c0a8ca
    • P
      ipc, kernel: use Linux headers · 7153e402
      Paul McQuade 提交于
      Use #include <linux/uaccess.h> instead of <asm/uaccess.h>
      Use #include <linux/types.h> instead of <asm/types.h>
      Signed-off-by: NPaul McQuade <paulmcquad@gmail.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7153e402
    • F
      kernel/profile.c: use static const char instead of static char · f3da64d1
      Fabian Frederick 提交于
      schedstr, sleepstr and kvmstr are only used in strcmp & strlen
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f3da64d1
    • F
      kernel/profile.c: convert printk to pr_foo() · aba871f1
      Fabian Frederick 提交于
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aba871f1
    • F
      kernel/user_namespace.c: kernel-doc/checkpatch fixes · 68a9a435
      Fabian Frederick 提交于
      -uid->gid
      -split some function declarations
      -if/then/else warning
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      68a9a435
    • K
      sysctl: allow for strict write position handling · f4aacea2
      Kees Cook 提交于
      When writing to a sysctl string, each write, regardless of VFS position,
      begins writing the string from the start.  This means the contents of
      the last write to the sysctl controls the string contents instead of the
      first:
      
        open("/proc/sys/kernel/modprobe", O_WRONLY)   = 1
        write(1, "AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"..., 4096) = 4096
        write(1, "/bin/true", 9)                = 9
        close(1)                                = 0
      
        $ cat /proc/sys/kernel/modprobe
        /bin/true
      
      Expected behaviour would be to have the sysctl be "AAAA..." capped at
      maxlen (in this case KMOD_PATH_LEN: 256), instead of truncating to the
      contents of the second write.  Similarly, multiple short writes would
      not append to the sysctl.
      
      The old behavior is unlike regular POSIX files enough that doing audits
      of software that interact with sysctls can end up in unexpected or
      dangerous situations.  For example, "as long as the input starts with a
      trusted path" turns out to be an insufficient filter, as what must also
      happen is for the input to be entirely contained in a single write
      syscall -- not a common consideration, especially for high level tools.
      
      This provides kernel.sysctl_writes_strict as a way to make this behavior
      act in a less surprising manner for strings, and disallows non-zero file
      position when writing numeric sysctls (similar to what is already done
      when reading from non-zero file positions).  For now, the default (0) is
      to warn about non-zero file position use, but retain the legacy
      behavior.  Setting this to -1 disables the warning, and setting this to
      1 enables the file position respecting behavior.
      
      [akpm@linux-foundation.org: fix build]
      [akpm@linux-foundation.org: move misplaced hunk, per Randy]
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f4aacea2
    • K
      sysctl: refactor sysctl string writing logic · 2ca9bb45
      Kees Cook 提交于
      Consolidate buffer length checking with new-line/end-of-line checking.
      Additionally, instead of reading user memory twice, just do the
      assignment during the loop.
      
      This change doesn't affect the potential races here.  It was already
      possible to read a sysctl that was in the middle of a write.  In both
      cases, the string will always be NULL terminated.  The pre-existing race
      remains a problem to be solved.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2ca9bb45
    • K
      sysctl: clean up char buffer arguments · f8808300
      Kees Cook 提交于
      When writing to a sysctl string, each write, regardless of VFS position,
      began writing the string from the start.  This meant the contents of the
      last write to the sysctl controlled the string contents instead of the
      first.
      
      This misbehavior was featured in an exploit against Chrome OS.  While
      it's not in itself a vulnerability, it's a weirdness that isn't on the
      mind of most auditors: "This filter looks correct, the first line
      written would not be meaningful to sysctl" doesn't apply here, since the
      size of the write and the contents of the final write are what matter
      when writing to sysctls.
      
      This adds the sysctl kernel.sysctl_writes_strict to control the write
      behavior.  The default (0) reports when VFS position is non-0 on a
      write, but retains legacy behavior, -1 disables the warning, and 1
      enables the position-respecting behavior.
      
      The long-term plan here is to wait for userspace to be fixed in response
      to the new warning and to then switch the default kernel behavior to the
      new position-respecting behavior.
      
      This patch (of 4):
      
      The char buffer arguments are needlessly cast in weird places.  Clean it
      up so things are easier to read.
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: Randy Dunlap <rdunlap@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f8808300
    • F
      kernel/kexec.c: convert printk to pr_foo() · e1bebcf4
      Fabian Frederick 提交于
      + some pr_warning -> pr_warn and checkpatch warning fixes
      Signed-off-by: NFabian Frederick <fabf@skynet.be>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Vivek Goyal <vgoyal@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e1bebcf4
    • M
      kernel/panic.c: add "crash_kexec_post_notifiers" option for kdump after panic_notifers · f06e5153
      Masami Hiramatsu 提交于
      Add a "crash_kexec_post_notifiers" boot option to run kdump after
      running panic_notifiers and dump kmsg.  This can help rare situations
      where kdump fails because of unstable crashed kernel or hardware failure
      (memory corruption on critical data/code), or the 2nd kernel is already
      broken by the 1st kernel (it's a broken behavior, but who can guarantee
      that the "crashed" kernel works correctly?).
      
      Usage: add "crash_kexec_post_notifiers" to kernel boot option.
      
      Note that this actually increases risks of the failure of kdump.  This
      option should be set only if you worry about the rare case of kdump
      failure rather than increasing the chance of success.
      Signed-off-by: NMasami Hiramatsu <masami.hiramatsu.pt@hitachi.com>
      Acked-by: NMotohiro Kosaki <Motohiro.Kosaki@us.fujitsu.com>
      Acked-by: NVivek Goyal <vgoyal@redhat.com>
      Cc: Eric Biederman <ebiederm@xmission.com>
      Cc: Yoshihiro YUNOMAE <yoshihiro.yunomae.ez@hitachi.com>
      Cc: Satoru MORIYA <satoru.moriya.br@hitachi.com>
      Cc: Tomoki Sekiyama <tomoki.sekiyama@hds.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f06e5153
    • S
      smp: print more useful debug info upon receiving IPI on an offline CPU · a219ccf4
      Srivatsa S. Bhat 提交于
      There is a longstanding problem related to CPU hotplug which causes IPIs
      to be delivered to offline CPUs, and the smp-call-function IPI handler
      code prints out a warning whenever this is detected.  Every once in a
      while this (usually harmless) warning gets reported on LKML, but so far
      it has not been completely fixed.  Usually the solution involves finding
      out the IPI sender and fixing it by adding appropriate synchronization
      with CPU hotplug.
      
      However, while going through one such internal bug reports, I found that
      there is a significant bug in the receiver side itself (more
      specifically, in stop-machine) that can lead to this problem even when
      the sender code is perfectly fine.  This patchset fixes that
      synchronization problem in the CPU hotplug stop-machine code.
      
      Patch 1 adds some additional debug code to the smp-call-function
      framework, to help debug such issues easily.
      
      Patch 2 modifies the stop-machine code to ensure that any IPIs that were
      sent while the target CPU was online, would be noticed and handled by
      that CPU without fail before it goes offline.  Thus, this avoids
      scenarios where IPIs are received on offline CPUs (as long as the sender
      uses proper hotplug synchronization).
      
      In fact, I debugged the problem by using Patch 1, and found that the
      payload of the IPI was always the block layer's trigger_softirq()
      function.  But I was not able to find anything wrong with the block
      layer code.  That's when I started looking at the stop-machine code and
      realized that there is a race-window which makes the IPI _receiver_ the
      culprit, not the sender.  Patch 2 fixes that race and hence this should
      put an end to most of the hard-to-debug IPI-to-offline-CPU issues.
      
      This patch (of 2):
      
      Today the smp-call-function code just prints a warning if we get an IPI
      on an offline CPU.  This info is sufficient to let us know that
      something went wrong, but often it is very hard to debug exactly who
      sent the IPI and why, from this info alone.
      
      In most cases, we get the warning about the IPI to an offline CPU,
      immediately after the CPU going offline comes out of the stop-machine
      phase and reenables interrupts.  Since all online CPUs participate in
      stop-machine, the information regarding the sender of the IPI is already
      lost by the time we exit the stop-machine loop.  So even if we dump the
      stack on each CPU at this point, we won't find anything useful since all
      of them will show the stack-trace of the stopper thread.  So we need a
      better way to figure out who sent the IPI and why.
      
      To achieve this, when we detect an IPI targeted to an offline CPU, loop
      through the call-single-data linked list and print out the payload
      (i.e., the name of the function which was supposed to be executed by the
      target CPU).  This would give us an insight as to who might have sent
      the IPI and help us debug this further.
      
      [akpm@linux-foundation.org: correctly suppress warning output on second and later occurrences]
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Mel Gorman <mgorman@suse.de>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Borislav Petkov <bp@suse.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Mike Galbraith <mgalbraith@suse.de>
      Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a219ccf4
    • O
      signals: change wait_for_helper() to use kernel_sigaction() · 76e0a6f4
      Oleg Nesterov 提交于
      Now that we have kernel_sigaction() we can change wait_for_helper() to
      use it and cleans up the code a bit.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      76e0a6f4
    • O
      signals: introduce kernel_sigaction() · b4e74264
      Oleg Nesterov 提交于
      Now that allow_signal() is really trivial we can unify it with
      disallow_signal().  Add the new helper, kernel_sigaction(), and
      reimplement allow_signal/disallow_signal as a trivial wrappers.
      
      This saves one EXPORT_SYMBOL() and the new helper can have more users.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b4e74264
    • O
      signals: disallow_signal() should flush the potentially pending signal · 580d34e4
      Oleg Nesterov 提交于
      disallow_signal() simply sets SIG_IGN, this is not enough and
      recalc_sigpending() is simply pointless because in can never change the
      state of TIF_SIGPENDING.
      
      If we ignore a signal, we also need to do flush_sigqueue_mask() for the
      case when this signal is pending, this way recalc_sigpending() can
      actually clear TIF_SIGPENDING and we do not "leak" the allocated
      siginfo's.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      580d34e4
    • O
      signals: kill the obsolete sigdelset() and recalc_sigpending() in allow_signal() · ec5955b8
      Oleg Nesterov 提交于
      allow_signal() does sigdelset(current->blocked) due to historic reason,
      previously it could be called by a daemonize()'ed kthread, and
      daemonize() played with current->blocked.
      
      Now that daemonize() has gone away we can remove sigdelset() and
      recalc_sigpending().  If a user really wants to unblock a signal, it
      must use sigprocmask() or set_current_block() explicitely.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ec5955b8
    • O
      signals: mv {dis,}allow_signal() from sched.h/exit.c to signal.[ch] · 0341729b
      Oleg Nesterov 提交于
      Move the declaration/definition of allow_signal/disallow_signal to
      signal.h/signal.c.  The new place is more logical and allows to use the
      static helpers in signal.c (see the next changes).
      
      While at it, make them return void and remove the valid_signal() check.
      Nobody checks the returned value, and in-kernel users must not pass the
      wrong signal number.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0341729b
    • O
      signals: cleanup the usage of t/current in do_sigaction() · afe2b038
      Oleg Nesterov 提交于
      The usage of "task_struct *t" and "current" in do_sigaction() looks really
      annoying and chaotic.  Initially "t" is used as a cached value of current
      but not consistently, then it is reused as a loop variable and we have to
      use "current" again.
      
      Clean up this mess and also convert the code to use for_each_thread().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      afe2b038
    • O
      signals: rename rm_from_queue_full() to flush_sigqueue_mask() · c09c1441
      Oleg Nesterov 提交于
      "rm_from_queue_full" looks ugly and misleading, especially now that
      rm_from_queue() has gone away.  Rename it to flush_sigqueue_mask(), this
      matches flush_sigqueue() we already have.
      
      Also remove the obsolete comment which explains the difference with
      rm_from_queue() we already killed.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c09c1441
    • O
      signals: kill rm_from_queue(), change prepare_signal() to use for_each_thread() · 9490592f
      Oleg Nesterov 提交于
      rm_from_queue() doesn't make sense.  The only caller, prepare_signal(),
      can use rm_from_queue_full() with the same effect.
      
      While at it, change prepare_signal() to use for_each_thread() instead of
      do/while_each_thread.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9490592f
    • O
      signals: s/siginitset/sigemptyset/ in do_sigtimedwait() · 6114041a
      Oleg Nesterov 提交于
      Cosmetic, but siginitset(0) looks a bit strange, sigemptyset() is what
      do_sigtimedwait() needs.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Al Viro <viro@ZenIV.linux.org.uk>
      Cc: David Woodhouse <dwmw2@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Geert Uytterhoeven <geert@linux-m68k.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Richard Weinberger <richard@nod.at>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Tejun Heo <tj@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6114041a
    • O
      ptrace: task_clear_jobctl_trapping()->wake_up_bit() needs mb() · 650226bd
      Oleg Nesterov 提交于
      __wake_up_bit() checks waitqueue_active() and thus the caller needs mb()
      as wake_up_bit() documents, fix task_clear_jobctl_trapping().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      650226bd
    • M
      ptrace: fix fork event messages across pid namespaces · 4e52365f
      Matthew Dempsky 提交于
      When tracing a process in another pid namespace, it's important for fork
      event messages to contain the child's pid as seen from the tracer's pid
      namespace, not the parent's.  Otherwise, the tracer won't be able to
      correlate the fork event with later SIGTRAP signals it receives from the
      child.
      
      We still risk a race condition if a ptracer from a different pid
      namespace attaches after we compute the pid_t value.  However, sending a
      bogus fork event message in this unlikely scenario is still a vast
      improvement over the status quo where we always send bogus fork event
      messages to debuggers in a different pid namespace than the forking
      process.
      Signed-off-by: NMatthew Dempsky <mdempsky@chromium.org>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Julien Tinnes <jln@chromium.org>
      Cc: Roland McGrath <mcgrathr@chromium.org>
      Cc: Jan Kratochvil <jan.kratochvil@redhat.com>
      Cc: <stable@vger.kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4e52365f
  5. 06 6月, 2014 4 次提交
    • T
      futex: Make lookup_pi_state more robust · 54a21788
      Thomas Gleixner 提交于
      The current implementation of lookup_pi_state has ambigous handling of
      the TID value 0 in the user space futex.  We can get into the kernel
      even if the TID value is 0, because either there is a stale waiters bit
      or the owner died bit is set or we are called from the requeue_pi path
      or from user space just for fun.
      
      The current code avoids an explicit sanity check for pid = 0 in case
      that kernel internal state (waiters) are found for the user space
      address.  This can lead to state leakage and worse under some
      circumstances.
      
      Handle the cases explicit:
      
             Waiter | pi_state | pi->owner | uTID      | uODIED | ?
      
        [1]  NULL   | ---      | ---       | 0         | 0/1    | Valid
        [2]  NULL   | ---      | ---       | >0        | 0/1    | Valid
      
        [3]  Found  | NULL     | --        | Any       | 0/1    | Invalid
      
        [4]  Found  | Found    | NULL      | 0         | 1      | Valid
        [5]  Found  | Found    | NULL      | >0        | 1      | Invalid
      
        [6]  Found  | Found    | task      | 0         | 1      | Valid
      
        [7]  Found  | Found    | NULL      | Any       | 0      | Invalid
      
        [8]  Found  | Found    | task      | ==taskTID | 0/1    | Valid
        [9]  Found  | Found    | task      | 0         | 0      | Invalid
        [10] Found  | Found    | task      | !=taskTID | 0/1    | Invalid
      
       [1] Indicates that the kernel can acquire the futex atomically. We
           came came here due to a stale FUTEX_WAITERS/FUTEX_OWNER_DIED bit.
      
       [2] Valid, if TID does not belong to a kernel thread. If no matching
           thread is found then it indicates that the owner TID has died.
      
       [3] Invalid. The waiter is queued on a non PI futex
      
       [4] Valid state after exit_robust_list(), which sets the user space
           value to FUTEX_WAITERS | FUTEX_OWNER_DIED.
      
       [5] The user space value got manipulated between exit_robust_list()
           and exit_pi_state_list()
      
       [6] Valid state after exit_pi_state_list() which sets the new owner in
           the pi_state but cannot access the user space value.
      
       [7] pi_state->owner can only be NULL when the OWNER_DIED bit is set.
      
       [8] Owner and user space value match
      
       [9] There is no transient state which sets the user space TID to 0
           except exit_robust_list(), but this is indicated by the
           FUTEX_OWNER_DIED bit. See [4]
      
      [10] There is no transient state which leaves owner and user space
           TID out of sync.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Will Drewry <wad@chromium.org>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54a21788
    • T
      futex: Always cleanup owner tid in unlock_pi · 13fbca4c
      Thomas Gleixner 提交于
      If the owner died bit is set at futex_unlock_pi, we currently do not
      cleanup the user space futex.  So the owner TID of the current owner
      (the unlocker) persists.  That's observable inconsistant state,
      especially when the ownership of the pi state got transferred.
      
      Clean it up unconditionally.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Will Drewry <wad@chromium.org>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13fbca4c
    • T
      futex: Validate atomic acquisition in futex_lock_pi_atomic() · b3eaa9fc
      Thomas Gleixner 提交于
      We need to protect the atomic acquisition in the kernel against rogue
      user space which sets the user space futex to 0, so the kernel side
      acquisition succeeds while there is existing state in the kernel
      associated to the real owner.
      
      Verify whether the futex has waiters associated with kernel state.  If
      it has, return -EINVAL.  The state is corrupted already, so no point in
      cleaning it up.  Subsequent calls will fail as well.  Not our problem.
      
      [ tglx: Use futex_top_waiter() and explain why we do not need to try
        	restoring the already corrupted user space state. ]
      Signed-off-by: NDarren Hart <dvhart@linux.intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Will Drewry <wad@chromium.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3eaa9fc
    • T
      futex-prevent-requeue-pi-on-same-futex.patch futex: Forbid uaddr == uaddr2 in... · e9c243a5
      Thomas Gleixner 提交于
      futex-prevent-requeue-pi-on-same-futex.patch futex: Forbid uaddr == uaddr2 in futex_requeue(..., requeue_pi=1)
      
      If uaddr == uaddr2, then we have broken the rule of only requeueing from
      a non-pi futex to a pi futex with this call.  If we attempt this, then
      dangling pointers may be left for rt_waiter resulting in an exploitable
      condition.
      
      This change brings futex_requeue() in line with futex_wait_requeue_pi()
      which performs the same check as per commit 6f7b0a2a ("futex: Forbid
      uaddr == uaddr2 in futex_wait_requeue_pi()")
      
      [ tglx: Compare the resulting keys as well, as uaddrs might be
        	different depending on the mapping ]
      
      Fixes CVE-2014-3153.
      
      Reported-by: Pinkie Pie
      Signed-off-by: NWill Drewry <wad@chromium.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NDarren Hart <dvhart@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9c243a5
  6. 05 6月, 2014 1 次提交
    • R
      sched/fair: Fix tg_set_cfs_bandwidth() deadlock on rq->lock · 09dc4ab0
      Roman Gushchin 提交于
      tg_set_cfs_bandwidth() sets cfs_b->timer_active to 0 to
      force the period timer restart. It's not safe, because
      can lead to deadlock, described in commit 927b54fc:
      "__start_cfs_bandwidth calls hrtimer_cancel while holding rq->lock,
      waiting for the hrtimer to finish. However, if sched_cfs_period_timer
      runs for another loop iteration, the hrtimer can attempt to take
      rq->lock, resulting in deadlock."
      
      Three CPUs must be involved:
      
        CPU0               CPU1                         CPU2
        take rq->lock      period timer fired
        ...                take cfs_b lock
        ...                ...                          tg_set_cfs_bandwidth()
        throttle_cfs_rq()  release cfs_b lock           take cfs_b lock
        ...                distribute_cfs_runtime()     timer_active = 0
        take cfs_b->lock   wait for rq->lock            ...
        __start_cfs_bandwidth()
        {wait for timer callback
         break if timer_active == 1}
      
      So, CPU0 and CPU1 are deadlocked.
      
      Instead of resetting cfs_b->timer_active, tg_set_cfs_bandwidth can
      wait for period timer callbacks (ignoring cfs_b->timer_active) and
      restart the timer explicitly.
      Signed-off-by: NRoman Gushchin <klamm@yandex-team.ru>
      Reviewed-by: NBen Segall <bsegall@google.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/87wqdi9g8e.wl\%klamm@yandex-team.ru
      Cc: pjt@google.com
      Cc: chris.j.arges@canonical.com
      Cc: gregkh@linuxfoundation.org
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      09dc4ab0