1. 07 6月, 2014 1 次提交
    • T
      rtmutex: Handle deadlock detection smarter · 3d5c9340
      Thomas Gleixner 提交于
      Even in the case when deadlock detection is not requested by the
      caller, we can detect deadlocks. Right now the code stops the lock
      chain walk and keeps the waiter enqueued, even on itself. Silly not to
      yell when such a scenario is detected and to keep the waiter enqueued.
      
      Return -EDEADLK unconditionally and handle it at the call sites.
      
      The futex calls return -EDEADLK. The non futex ones dequeue the
      waiter, throw a warning and put the task into a schedule loop.
      
      Tagged for stable as it makes the code more robust.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Brad Mouring <bmouring@ni.com>
      Link: http://lkml.kernel.org/r/20140605152801.836501969@linutronix.de
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      3d5c9340
  2. 06 6月, 2014 4 次提交
    • T
      futex: Make lookup_pi_state more robust · 54a21788
      Thomas Gleixner 提交于
      The current implementation of lookup_pi_state has ambigous handling of
      the TID value 0 in the user space futex.  We can get into the kernel
      even if the TID value is 0, because either there is a stale waiters bit
      or the owner died bit is set or we are called from the requeue_pi path
      or from user space just for fun.
      
      The current code avoids an explicit sanity check for pid = 0 in case
      that kernel internal state (waiters) are found for the user space
      address.  This can lead to state leakage and worse under some
      circumstances.
      
      Handle the cases explicit:
      
             Waiter | pi_state | pi->owner | uTID      | uODIED | ?
      
        [1]  NULL   | ---      | ---       | 0         | 0/1    | Valid
        [2]  NULL   | ---      | ---       | >0        | 0/1    | Valid
      
        [3]  Found  | NULL     | --        | Any       | 0/1    | Invalid
      
        [4]  Found  | Found    | NULL      | 0         | 1      | Valid
        [5]  Found  | Found    | NULL      | >0        | 1      | Invalid
      
        [6]  Found  | Found    | task      | 0         | 1      | Valid
      
        [7]  Found  | Found    | NULL      | Any       | 0      | Invalid
      
        [8]  Found  | Found    | task      | ==taskTID | 0/1    | Valid
        [9]  Found  | Found    | task      | 0         | 0      | Invalid
        [10] Found  | Found    | task      | !=taskTID | 0/1    | Invalid
      
       [1] Indicates that the kernel can acquire the futex atomically. We
           came came here due to a stale FUTEX_WAITERS/FUTEX_OWNER_DIED bit.
      
       [2] Valid, if TID does not belong to a kernel thread. If no matching
           thread is found then it indicates that the owner TID has died.
      
       [3] Invalid. The waiter is queued on a non PI futex
      
       [4] Valid state after exit_robust_list(), which sets the user space
           value to FUTEX_WAITERS | FUTEX_OWNER_DIED.
      
       [5] The user space value got manipulated between exit_robust_list()
           and exit_pi_state_list()
      
       [6] Valid state after exit_pi_state_list() which sets the new owner in
           the pi_state but cannot access the user space value.
      
       [7] pi_state->owner can only be NULL when the OWNER_DIED bit is set.
      
       [8] Owner and user space value match
      
       [9] There is no transient state which sets the user space TID to 0
           except exit_robust_list(), but this is indicated by the
           FUTEX_OWNER_DIED bit. See [4]
      
      [10] There is no transient state which leaves owner and user space
           TID out of sync.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Will Drewry <wad@chromium.org>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      54a21788
    • T
      futex: Always cleanup owner tid in unlock_pi · 13fbca4c
      Thomas Gleixner 提交于
      If the owner died bit is set at futex_unlock_pi, we currently do not
      cleanup the user space futex.  So the owner TID of the current owner
      (the unlocker) persists.  That's observable inconsistant state,
      especially when the ownership of the pi state got transferred.
      
      Clean it up unconditionally.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Will Drewry <wad@chromium.org>
      Cc: Darren Hart <dvhart@linux.intel.com>
      Cc: stable@vger.kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      13fbca4c
    • T
      futex: Validate atomic acquisition in futex_lock_pi_atomic() · b3eaa9fc
      Thomas Gleixner 提交于
      We need to protect the atomic acquisition in the kernel against rogue
      user space which sets the user space futex to 0, so the kernel side
      acquisition succeeds while there is existing state in the kernel
      associated to the real owner.
      
      Verify whether the futex has waiters associated with kernel state.  If
      it has, return -EINVAL.  The state is corrupted already, so no point in
      cleaning it up.  Subsequent calls will fail as well.  Not our problem.
      
      [ tglx: Use futex_top_waiter() and explain why we do not need to try
        	restoring the already corrupted user space state. ]
      Signed-off-by: NDarren Hart <dvhart@linux.intel.com>
      Cc: Kees Cook <keescook@chromium.org>
      Cc: Will Drewry <wad@chromium.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b3eaa9fc
    • T
      futex-prevent-requeue-pi-on-same-futex.patch futex: Forbid uaddr == uaddr2 in... · e9c243a5
      Thomas Gleixner 提交于
      futex-prevent-requeue-pi-on-same-futex.patch futex: Forbid uaddr == uaddr2 in futex_requeue(..., requeue_pi=1)
      
      If uaddr == uaddr2, then we have broken the rule of only requeueing from
      a non-pi futex to a pi futex with this call.  If we attempt this, then
      dangling pointers may be left for rt_waiter resulting in an exploitable
      condition.
      
      This change brings futex_requeue() in line with futex_wait_requeue_pi()
      which performs the same check as per commit 6f7b0a2a ("futex: Forbid
      uaddr == uaddr2 in futex_wait_requeue_pi()")
      
      [ tglx: Compare the resulting keys as well, as uaddrs might be
        	different depending on the mapping ]
      
      Fixes CVE-2014-3153.
      
      Reported-by: Pinkie Pie
      Signed-off-by: NWill Drewry <wad@chromium.org>
      Signed-off-by: NKees Cook <keescook@chromium.org>
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NDarren Hart <dvhart@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e9c243a5
  3. 03 6月, 2014 1 次提交
  4. 28 5月, 2014 2 次提交
    • T
      rtmutex: Fix deadlock detector for real · 397335f0
      Thomas Gleixner 提交于
      The current deadlock detection logic does not work reliably due to the
      following early exit path:
      
      	/*
      	 * Drop out, when the task has no waiters. Note,
      	 * top_waiter can be NULL, when we are in the deboosting
      	 * mode!
      	 */
      	if (top_waiter && (!task_has_pi_waiters(task) ||
      			   top_waiter != task_top_pi_waiter(task)))
      		goto out_unlock_pi;
      
      So this not only exits when the task has no waiters, it also exits
      unconditionally when the current waiter is not the top priority waiter
      of the task.
      
      So in a nested locking scenario, it might abort the lock chain walk
      and therefor miss a potential deadlock.
      
      Simple fix: Continue the chain walk, when deadlock detection is
      enabled.
      
      We also avoid the whole enqueue, if we detect the deadlock right away
      (A-A). It's an optimization, but also prevents that another waiter who
      comes in after the detection and before the task has undone the damage
      observes the situation and detects the deadlock and returns
      -EDEADLOCK, which is wrong as the other task is not in a deadlock
      situation.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20140522031949.725272460@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      397335f0
    • S
      powerpc, kexec: Fix "Processor X is stuck" issue during kexec from ST mode · 011e4b02
      Srivatsa S. Bhat 提交于
      If we try to perform a kexec when the machine is in ST (Single-Threaded) mode
      (ppc64_cpu --smt=off), the kexec operation doesn't succeed properly, and we
      get the following messages during boot:
      
      [    0.089866] POWER8 performance monitor hardware support registered
      [    0.089985] power8-pmu: PMAO restore workaround active.
      [    5.095419] Processor 1 is stuck.
      [   10.097933] Processor 2 is stuck.
      [   15.100480] Processor 3 is stuck.
      [   20.102982] Processor 4 is stuck.
      [   25.105489] Processor 5 is stuck.
      [   30.108005] Processor 6 is stuck.
      [   35.110518] Processor 7 is stuck.
      [   40.113369] Processor 9 is stuck.
      [   45.115879] Processor 10 is stuck.
      [   50.118389] Processor 11 is stuck.
      [   55.120904] Processor 12 is stuck.
      [   60.123425] Processor 13 is stuck.
      [   65.125970] Processor 14 is stuck.
      [   70.128495] Processor 15 is stuck.
      [   75.131316] Processor 17 is stuck.
      
      Note that only the sibling threads are stuck, while the primary threads (0, 8,
      16 etc) boot just fine. Looking closer at the previous step of kexec, we observe
      that kexec tries to wakeup (bring online) the sibling threads of all the cores,
      before performing kexec:
      
      [ 9464.131231] Starting new kernel
      [ 9464.148507] kexec: Waking offline cpu 1.
      [ 9464.148552] kexec: Waking offline cpu 2.
      [ 9464.148600] kexec: Waking offline cpu 3.
      [ 9464.148636] kexec: Waking offline cpu 4.
      [ 9464.148671] kexec: Waking offline cpu 5.
      [ 9464.148708] kexec: Waking offline cpu 6.
      [ 9464.148743] kexec: Waking offline cpu 7.
      [ 9464.148779] kexec: Waking offline cpu 9.
      [ 9464.148815] kexec: Waking offline cpu 10.
      [ 9464.148851] kexec: Waking offline cpu 11.
      [ 9464.148887] kexec: Waking offline cpu 12.
      [ 9464.148922] kexec: Waking offline cpu 13.
      [ 9464.148958] kexec: Waking offline cpu 14.
      [ 9464.148994] kexec: Waking offline cpu 15.
      [ 9464.149030] kexec: Waking offline cpu 17.
      
      Instrumenting this piece of code revealed that the cpu_up() operation actually
      fails with -EBUSY. Thus, only the primary threads of all the cores are online
      during kexec, and hence this is a sure-shot receipe for disaster, as explained
      in commit e8e5c215 (powerpc/kexec: Fix orphaned offline CPUs across kexec),
      as well as in the comment above wake_offline_cpus().
      
      It turns out that cpu_up() was returning -EBUSY because the variable
      'cpu_hotplug_disabled' was set to 1; and this disabling of CPU hotplug was done
      by migrate_to_reboot_cpu() inside kernel_kexec().
      
      Now, migrate_to_reboot_cpu() was originally written with the assumption that
      any further code will not need to perform CPU hotplug, since we are anyway in
      the reboot path. However, kexec is clearly not such a case, since we depend on
      onlining CPUs, atleast on powerpc.
      
      So re-enable cpu-hotplug after returning from migrate_to_reboot_cpu() in the
      kexec path, to fix this regression in kexec on powerpc.
      
      Also, wrap the cpu_up() in powerpc kexec code within a WARN_ON(), so that we
      can catch such issues more easily in the future.
      
      Fixes: c97102ba (kexec: migrate to reboot cpu)
      Cc: stable@vger.kernel.org
      Signed-off-by: NSrivatsa S. Bhat <srivatsa.bhat@linux.vnet.ibm.com>
      Signed-off-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      011e4b02
  5. 22 5月, 2014 7 次提交
  6. 19 5月, 2014 5 次提交
    • P
      perf: Fix a race between ring_buffer_detach() and ring_buffer_attach() · b69cf536
      Peter Zijlstra 提交于
      Alexander noticed that we use RCU iteration on rb->event_list but do
      not use list_{add,del}_rcu() to add,remove entries to that list, nor
      do we observe proper grace periods when re-using the entries.
      
      Merge ring_buffer_detach() into ring_buffer_attach() such that
      attaching to the NULL buffer is detaching.
      
      Furthermore, ensure that between any 'detach' and 'attach' of the same
      event we observe the required grace period, but only when strictly
      required. In effect this means that only ioctl(.request =
      PERF_EVENT_IOC_SET_OUTPUT) will wait for a grace period, while the
      normal initial attach and final detach will not be delayed.
      
      This patch should, I think, do the right thing under all
      circumstances, the 'normal' cases all should never see the extra grace
      period, but the two cases:
      
       1) PERF_EVENT_IOC_SET_OUTPUT on an event which already has a
          ring_buffer set, will now observe the required grace period between
          removing itself from the old and attaching itself to the new buffer.
      
          This case is 'simple' in that both buffers are present in
          perf_event_set_output() one could think an unconditional
          synchronize_rcu() would be sufficient; however...
      
       2) an event that has a buffer attached, the buffer is destroyed
          (munmap) and then the event is attached to a new/different buffer
          using PERF_EVENT_IOC_SET_OUTPUT.
      
          This case is more complex because the buffer destruction does:
            ring_buffer_attach(.rb = NULL)
          followed by the ioctl() doing:
            ring_buffer_attach(.rb = foo);
      
          and we still need to observe the grace period between these two
          calls due to us reusing the event->rb_entry list_head.
      
      In order to make 2 happen we use Paul's latest cond_synchronize_rcu()
      call.
      
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Stephane Eranian <eranian@google.com>
      Cc: Andi Kleen <ak@linux.intel.com>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Reported-by: NAlexander Shishkin <alexander.shishkin@linux.intel.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/20140507123526.GD13658@twins.programming.kicks-ass.netSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      b69cf536
    • J
      perf: Prevent false warning in perf_swevent_add · 39af6b16
      Jiri Olsa 提交于
      The perf cpu offline callback takes down all cpu context
      events and releases swhash->swevent_hlist.
      
      This could race with task context software event being just
      scheduled on this cpu via perf_swevent_add while cpu hotplug
      code already cleaned up event's data.
      
      The race happens in the gap between the cpu notifier code
      and the cpu being actually taken down. Note that only cpu
      ctx events are terminated in the perf cpu hotplug code.
      
      It's easily reproduced with:
        $ perf record -e faults perf bench sched pipe
      
      while putting one of the cpus offline:
        # echo 0 > /sys/devices/system/cpu/cpu1/online
      
      Console emits following warning:
        WARNING: CPU: 1 PID: 2845 at kernel/events/core.c:5672 perf_swevent_add+0x18d/0x1a0()
        Modules linked in:
        CPU: 1 PID: 2845 Comm: sched-pipe Tainted: G        W    3.14.0+ #256
        Hardware name: Intel Corporation Montevina platform/To be filled by O.E.M., BIOS AMVACRB1.86C.0066.B00.0805070703 05/07/2008
         0000000000000009 ffff880077233ab8 ffffffff81665a23 0000000000200005
         0000000000000000 ffff880077233af8 ffffffff8104732c 0000000000000046
         ffff88007467c800 0000000000000002 ffff88007a9cf2a0 0000000000000001
        Call Trace:
         [<ffffffff81665a23>] dump_stack+0x4f/0x7c
         [<ffffffff8104732c>] warn_slowpath_common+0x8c/0xc0
         [<ffffffff8104737a>] warn_slowpath_null+0x1a/0x20
         [<ffffffff8110fb3d>] perf_swevent_add+0x18d/0x1a0
         [<ffffffff811162ae>] event_sched_in.isra.75+0x9e/0x1f0
         [<ffffffff8111646a>] group_sched_in+0x6a/0x1f0
         [<ffffffff81083dd5>] ? sched_clock_local+0x25/0xa0
         [<ffffffff811167e6>] ctx_sched_in+0x1f6/0x450
         [<ffffffff8111757b>] perf_event_sched_in+0x6b/0xa0
         [<ffffffff81117a4b>] perf_event_context_sched_in+0x7b/0xc0
         [<ffffffff81117ece>] __perf_event_task_sched_in+0x43e/0x460
         [<ffffffff81096f1e>] ? put_lock_stats.isra.18+0xe/0x30
         [<ffffffff8107b3c8>] finish_task_switch+0xb8/0x100
         [<ffffffff8166a7de>] __schedule+0x30e/0xad0
         [<ffffffff81172dd2>] ? pipe_read+0x3e2/0x560
         [<ffffffff8166b45e>] ? preempt_schedule_irq+0x3e/0x70
         [<ffffffff8166b45e>] ? preempt_schedule_irq+0x3e/0x70
         [<ffffffff8166b464>] preempt_schedule_irq+0x44/0x70
         [<ffffffff816707f0>] retint_kernel+0x20/0x30
         [<ffffffff8109e60a>] ? lockdep_sys_exit+0x1a/0x90
         [<ffffffff812a4234>] lockdep_sys_exit_thunk+0x35/0x67
         [<ffffffff81679321>] ? sysret_check+0x5/0x56
      
      Fixing this by tracking the cpu hotplug state and displaying
      the WARN only if current cpu is initialized properly.
      
      Cc: Corey Ashford <cjashfor@linux.vnet.ibm.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: stable@vger.kernel.org
      Reported-by: NFengguang Wu <fengguang.wu@intel.com>
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Link: http://lkml.kernel.org/r/1396861448-10097-1-git-send-email-jolsa@redhat.comSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      39af6b16
    • P
      perf: Limit perf_event_attr::sample_period to 63 bits · 0819b2e3
      Peter Zijlstra 提交于
      Vince reported that using a large sample_period (one with bit 63 set)
      results in wreckage since while the sample_period is fundamentally
      unsigned (negative periods don't make sense) the way we implement
      things very much rely on signed logic.
      
      So limit sample_period to 63 bits to avoid tripping over this.
      Reported-by: NVince Weaver <vincent.weaver@maine.edu>
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/n/tip-p25fhunibl4y3qi0zuqmyf4b@git.kernel.orgSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      0819b2e3
    • T
      futex: Prevent attaching to kernel threads · f0d71b3d
      Thomas Gleixner 提交于
      We happily allow userspace to declare a random kernel thread to be the
      owner of a user space PI futex.
      
      Found while analysing the fallout of Dave Jones syscall fuzzer.
      
      We also should validate the thread group for private futexes and find
      some fast way to validate whether the "alleged" owner has RW access on
      the file which backs the SHM, but that's a separate issue.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Darren Hart <darren@dvhart.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Carlos ODonell <carlos@redhat.com>
      Cc: Jakub Jelinek <jakub@redhat.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: http://lkml.kernel.org/r/20140512201701.194824402@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      f0d71b3d
    • T
      futex: Add another early deadlock detection check · 866293ee
      Thomas Gleixner 提交于
      Dave Jones trinity syscall fuzzer exposed an issue in the deadlock
      detection code of rtmutex:
        http://lkml.kernel.org/r/20140429151655.GA14277@redhat.com
      
      That underlying issue has been fixed with a patch to the rtmutex code,
      but the futex code must not call into rtmutex in that case because
          - it can detect that issue early
          - it avoids a different and more complex fixup for backing out
      
      If the user space variable got manipulated to 0x80000000 which means
      no lock holder, but the waiters bit set and an active pi_state in the
      kernel is found we can figure out the recursive locking issue by
      looking at the pi_state owner. If that is the current task, then we
      can safely return -EDEADLK.
      
      The check should have been added in commit 59fa6245 (futex: Handle
      futex_pi OWNER_DIED take over correctly) already, but I did not see
      the above issue caused by user space manipulation back then.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Dave Jones <davej@redhat.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Darren Hart <darren@dvhart.com>
      Cc: Davidlohr Bueso <davidlohr@hp.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Clark Williams <williams@redhat.com>
      Cc: Paul McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Cc: Roland McGrath <roland@hack.frob.com>
      Cc: Carlos ODonell <carlos@redhat.com>
      Cc: Jakub Jelinek <jakub@redhat.com>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
      Link: http://lkml.kernel.org/r/20140512201701.097349971@linutronix.deSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@vger.kernel.org
      866293ee
  7. 13 5月, 2014 3 次提交
    • T
      cgroup: fix rcu_read_lock() leak in update_if_frozen() · 36e9d2eb
      Tejun Heo 提交于
      While updating cgroup_freezer locking, 68fafb77d827 ("cgroup_freezer:
      replace freezer->lock with freezer_mutex") introduced a bug in
      update_if_frozen() where it returns with rcu_read_lock() held.  Fix it
      by adding rcu_read_unlock() before returning.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Reported-by: Nkbuild test robot <fengguang.wu@intel.com>
      36e9d2eb
    • T
      cgroup_freezer: replace freezer->lock with freezer_mutex · e5ced8eb
      Tejun Heo 提交于
      After 96d365e0 ("cgroup: make css_set_lock a rwsem and rename it
      to css_set_rwsem"), css task iterators requires sleepable context as
      it may block on css_set_rwsem.  I missed that cgroup_freezer was
      iterating tasks under IRQ-safe spinlock freezer->lock.  This leads to
      errors like the following on freezer state reads and transitions.
      
        BUG: sleeping function called from invalid context at /work
       /os/work/kernel/locking/rwsem.c:20
        in_atomic(): 0, irqs_disabled(): 0, pid: 462, name: bash
        5 locks held by bash/462:
         #0:  (sb_writers#7){.+.+.+}, at: [<ffffffff811f0843>] vfs_write+0x1a3/0x1c0
         #1:  (&of->mutex){+.+.+.}, at: [<ffffffff8126d78b>] kernfs_fop_write+0xbb/0x170
         #2:  (s_active#70){.+.+.+}, at: [<ffffffff8126d793>] kernfs_fop_write+0xc3/0x170
         #3:  (freezer_mutex){+.+...}, at: [<ffffffff81135981>] freezer_write+0x61/0x1e0
         #4:  (rcu_read_lock){......}, at: [<ffffffff81135973>] freezer_write+0x53/0x1e0
        Preemption disabled at:[<ffffffff81104404>] console_unlock+0x1e4/0x460
      
        CPU: 3 PID: 462 Comm: bash Not tainted 3.15.0-rc1-work+ #10
        Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
         ffff88000916a6d0 ffff88000e0a3da0 ffffffff81cf8c96 0000000000000000
         ffff88000e0a3dc8 ffffffff810cf4f2 ffffffff82388040 ffff880013aaf740
         0000000000000002 ffff88000e0a3de8 ffffffff81d05974 0000000000000246
        Call Trace:
         [<ffffffff81cf8c96>] dump_stack+0x4e/0x7a
         [<ffffffff810cf4f2>] __might_sleep+0x162/0x260
         [<ffffffff81d05974>] down_read+0x24/0x60
         [<ffffffff81133e87>] css_task_iter_start+0x27/0x70
         [<ffffffff8113584d>] freezer_apply_state+0x5d/0x130
         [<ffffffff81135a16>] freezer_write+0xf6/0x1e0
         [<ffffffff8112eb88>] cgroup_file_write+0xd8/0x230
         [<ffffffff8126d7b7>] kernfs_fop_write+0xe7/0x170
         [<ffffffff811f0756>] vfs_write+0xb6/0x1c0
         [<ffffffff811f121d>] SyS_write+0x4d/0xc0
         [<ffffffff81d08292>] system_call_fastpath+0x16/0x1b
      
      freezer->lock used to be used in hot paths but that time is long gone
      and there's no reason for the lock to be IRQ-safe spinlock or even
      per-cgroup.  In fact, given the fact that a cgroup may contain large
      number of tasks, it's not a good idea to iterate over them while
      holding IRQ-safe spinlock.
      
      Let's simplify locking by replacing per-cgroup freezer->lock with
      global freezer_mutex.  This also makes the comments explaining the
      intricacies of policy inheritance and the locking around it as the
      states are protected by a common mutex.
      
      The conversion is mostly straight-forward.  The followings are worth
      mentioning.
      
      * freezer_css_online() no longer needs double locking.
      
      * freezer_attach() now performs propagation simply while holding
        freezer_mutex.  update_if_frozen() race no longer exists and the
        comment is removed.
      
      * freezer_fork() now tests whether the task is in root cgroup using
        the new task_css_is_root() without doing rcu_read_lock/unlock().  If
        not, it grabs freezer_mutex and performs the operation.
      
      * freezer_read() and freezer_change_state() grab freezer_mutex across
        the whole operation and pin the css while iterating so that each
        descendant processing happens in sleepable context.
      
      Fixes: 96d365e0 ("cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem")
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      e5ced8eb
    • T
      cgroup: introduce task_css_is_root() · 5024ae29
      Tejun Heo 提交于
      Determining the css of a task usually requires RCU read lock as that's
      the only thing which keeps the returned css accessible till its
      reference is acquired; however, testing whether a task belongs to the
      root can be performed without dereferencing the returned css by
      comparing the returned pointer against the root one in init_css_set[]
      which never changes.
      
      Implement task_css_is_root() which can be invoked in any context.
      This will be used by the scheduled cgroup_freezer change.
      
      v2: cgroup no longer supports modular controllers.  No need to export
          init_css_set.  Pointed out by Li.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NLi Zefan <lizefan@huawei.com>
      5024ae29
  8. 12 5月, 2014 1 次提交
  9. 08 5月, 2014 1 次提交
  10. 07 5月, 2014 9 次提交
  11. 06 5月, 2014 1 次提交
  12. 03 5月, 2014 1 次提交
    • S
      tracing: Use rcu_dereference_sched() for trace event triggers · 561a4fe8
      Steven Rostedt (Red Hat) 提交于
      As trace event triggers are now part of the mainline kernel, I added
      my trace event trigger tests to my test suite I run on all my kernels.
      Now these tests get run under different config options, and one of
      those options is CONFIG_PROVE_RCU, which checks under lockdep that
      the rcu locking primitives are being used correctly. This triggered
      the following splat:
      
      ===============================
      [ INFO: suspicious RCU usage. ]
      3.15.0-rc2-test+ #11 Not tainted
      -------------------------------
      kernel/trace/trace_events_trigger.c:80 suspicious rcu_dereference_check() usage!
      
      other info that might help us debug this:
      
      rcu_scheduler_active = 1, debug_locks = 0
      4 locks held by swapper/1/0:
       #0:  ((&(&j_cdbs->work)->timer)){..-...}, at: [<ffffffff8104d2cc>] call_timer_fn+0x5/0x1be
       #1:  (&(&pool->lock)->rlock){-.-...}, at: [<ffffffff81059856>] __queue_work+0x140/0x283
       #2:  (&p->pi_lock){-.-.-.}, at: [<ffffffff8106e961>] try_to_wake_up+0x2e/0x1e8
       #3:  (&rq->lock){-.-.-.}, at: [<ffffffff8106ead3>] try_to_wake_up+0x1a0/0x1e8
      
      stack backtrace:
      CPU: 1 PID: 0 Comm: swapper/1 Not tainted 3.15.0-rc2-test+ #11
      Hardware name:                  /DG965MQ, BIOS MQ96510J.86A.0372.2006.0605.1717 06/05/2006
       0000000000000001 ffff88007e083b98 ffffffff819f53a5 0000000000000006
       ffff88007b0942c0 ffff88007e083bc8 ffffffff81081307 ffff88007ad96d20
       0000000000000000 ffff88007af2d840 ffff88007b2e701c ffff88007e083c18
      Call Trace:
       <IRQ>  [<ffffffff819f53a5>] dump_stack+0x4f/0x7c
       [<ffffffff81081307>] lockdep_rcu_suspicious+0x107/0x110
       [<ffffffff810ee51c>] event_triggers_call+0x99/0x108
       [<ffffffff810e8174>] ftrace_event_buffer_commit+0x42/0xa4
       [<ffffffff8106aadc>] ftrace_raw_event_sched_wakeup_template+0x71/0x7c
       [<ffffffff8106bcbf>] ttwu_do_wakeup+0x7f/0xff
       [<ffffffff8106bd9b>] ttwu_do_activate.constprop.126+0x5c/0x61
       [<ffffffff8106eadf>] try_to_wake_up+0x1ac/0x1e8
       [<ffffffff8106eb77>] wake_up_process+0x36/0x3b
       [<ffffffff810575cc>] wake_up_worker+0x24/0x26
       [<ffffffff810578bc>] insert_work+0x5c/0x65
       [<ffffffff81059982>] __queue_work+0x26c/0x283
       [<ffffffff81059999>] ? __queue_work+0x283/0x283
       [<ffffffff810599b7>] delayed_work_timer_fn+0x1e/0x20
       [<ffffffff8104d3a6>] call_timer_fn+0xdf/0x1be^M
       [<ffffffff8104d2cc>] ? call_timer_fn+0x5/0x1be
       [<ffffffff81059999>] ? __queue_work+0x283/0x283
       [<ffffffff8104d823>] run_timer_softirq+0x1a4/0x22f^M
       [<ffffffff8104696d>] __do_softirq+0x17b/0x31b^M
       [<ffffffff81046d03>] irq_exit+0x42/0x97
       [<ffffffff81a08db6>] smp_apic_timer_interrupt+0x37/0x44
       [<ffffffff81a07a2f>] apic_timer_interrupt+0x6f/0x80
       <EOI>  [<ffffffff8100a5d8>] ? default_idle+0x21/0x32
       [<ffffffff8100a5d6>] ? default_idle+0x1f/0x32
       [<ffffffff8100ac10>] arch_cpu_idle+0xf/0x11
       [<ffffffff8107b3a4>] cpu_startup_entry+0x1a3/0x213
       [<ffffffff8102a23c>] start_secondary+0x212/0x219
      
      The cause is that the triggers are protected by rcu_read_lock_sched() but
      the data is dereferenced with rcu_dereference() which expects it to
      be protected with rcu_read_lock(). The proper reference should be
      rcu_dereference_sched().
      
      Cc: Tom Zanussi <tom.zanussi@linux.intel.com>
      Cc: stable@vger.kernel.org # 3.14+
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      561a4fe8
  13. 30 4月, 2014 3 次提交
    • J
      timer: Prevent overflow in apply_slack · 98a01e77
      Jiri Bohac 提交于
      On architectures with sizeof(int) < sizeof (long), the
      computation of mask inside apply_slack() can be undefined if the
      computed bit is > 32.
      
      E.g. with: expires = 0xffffe6f5 and slack = 25, we get:
      
      expires_limit = 0x20000000e
      bit = 33
      mask = (1 << 33) - 1  /* undefined */
      
      On x86, mask becomes 1 and and the slack is not applied properly.
      On s390, mask is -1, expires is set to 0 and the timer fires immediately.
      
      Use 1UL << bit to solve that issue.
      Suggested-by: NDeborah Townsend <dstownse@us.ibm.com>
      Signed-off-by: NJiri Bohac <jbohac@suse.cz>
      Cc: stable@vger.kernel.org
      Link: http://lkml.kernel.org/r/20140418152310.GA13654@midget.suse.czSigned-off-by: NThomas Gleixner <tglx@linutronix.de>
      98a01e77
    • L
      hrtimer: Prevent remote enqueue of leftmost timers · 012a45e3
      Leon Ma 提交于
      If a cpu is idle and starts an hrtimer which is not pinned on that
      same cpu, the nohz code might target the timer to a different cpu.
      
      In the case that we switch the cpu base of the timer we already have a
      sanity check in place, which determines whether the timer is earlier
      than the current leftmost timer on the target cpu. In that case we
      enqueue the timer on the current cpu because we cannot reprogram the
      clock event device on the target.
      
      If the timers base is already the target CPU we do not have this
      sanity check in place so we enqueue the timer as the leftmost timer in
      the target cpus rb tree, but we cannot reprogram the clock event
      device on the target cpu. So the timer expires late and subsequently
      prevents the reprogramming of the target cpu clock event device until
      the previously programmed event fires or a timer with an earlier
      expiry time gets enqueued on the target cpu itself.
      
      Add the same target check as we have for the switch base case and
      start the timer on the current cpu if it would become the leftmost
      timer on the target.
      
      [ tglx: Rewrote subject and changelog ]
      Signed-off-by: NLeon Ma <xindong.ma@intel.com>
      Link: http://lkml.kernel.org/r/1398847391-5994-1-git-send-email-xindong.ma@intel.com
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      012a45e3
    • S
      hrtimer: Prevent all reprogramming if hang detected · 6c6c0d5a
      Stuart Hayes 提交于
      If the last hrtimer interrupt detected a hang it sets hang_detected=1
      and programs the clock event device with a delay to let the system
      make progress.
      
      If hang_detected == 1, we prevent reprogramming of the clock event
      device in hrtimer_reprogram() but not in hrtimer_force_reprogram().
      
      This can lead to the following situation:
      
      hrtimer_interrupt()
         hang_detected = 1;
         program ce device to Xms from now (hang delay)
      
      We have two timers pending:
         T1 expires 50ms from now
         T2 expires 5s from now
      
      Now T1 gets canceled, which causes hrtimer_force_reprogram() to be
      invoked, which in turn programs the clock event device to T2 (5
      seconds from now).
      
      Any hrtimer_start after that will not reprogram the hardware due to
      hang_detected still being set. So we effectivly block all timers until
      the T2 event fires and cleans up the hang situation.
      
      Add a check for hang_detected to hrtimer_force_reprogram() which
      prevents the reprogramming of the hang delay in the hardware
      timer. The subsequent hrtimer_interrupt will resolve all outstanding
      issues.
      
      [ tglx: Rewrote subject and changelog and fixed up the comment in
        	hrtimer_force_reprogram() ]
      Signed-off-by: NStuart Hayes <stuart.w.hayes@gmail.com>
      Link: http://lkml.kernel.org/r/53602DC6.2060101@gmail.com
      Cc: stable@vger.kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      6c6c0d5a
  14. 28 4月, 2014 1 次提交
    • S
      ftrace/module: Hardcode ftrace_module_init() call into load_module() · a949ae56
      Steven Rostedt (Red Hat) 提交于
      A race exists between module loading and enabling of function tracer.
      
      	CPU 1				CPU 2
      	-----				-----
        load_module()
         module->state = MODULE_STATE_COMING
      
      				register_ftrace_function()
      				 mutex_lock(&ftrace_lock);
      				 ftrace_startup()
      				  update_ftrace_function();
      				   ftrace_arch_code_modify_prepare()
      				    set_all_module_text_rw();
      				   <enables-ftrace>
      				    ftrace_arch_code_modify_post_process()
      				     set_all_module_text_ro();
      
      				[ here all module text is set to RO,
      				  including the module that is
      				  loading!! ]
      
         blocking_notifier_call_chain(MODULE_STATE_COMING);
          ftrace_init_module()
      
           [ tries to modify code, but it's RO, and fails!
             ftrace_bug() is called]
      
      When this race happens, ftrace_bug() will produces a nasty warning and
      all of the function tracing features will be disabled until reboot.
      
      The simple solution is to treate module load the same way the core
      kernel is treated at boot. To hardcode the ftrace function modification
      of converting calls to mcount into nops. This is done in init/main.c
      there's no reason it could not be done in load_module(). This gives
      a better control of the changes and doesn't tie the state of the
      module to its notifiers as much. Ftrace is special, it needs to be
      treated as such.
      
      The reason this would work, is that the ftrace_module_init() would be
      called while the module is in MODULE_STATE_UNFORMED, which is ignored
      by the set_all_module_text_ro() call.
      
      Link: http://lkml.kernel.org/r/1395637826-3312-1-git-send-email-indou.takao@jp.fujitsu.comReported-by: NTakao Indoh <indou.takao@jp.fujitsu.com>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: stable@vger.kernel.org # 2.6.38+
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      a949ae56