1. 23 8月, 2009 7 次提交
    • P
      rcu: Merge preemptable-RCU functionality into hierarchical RCU · f41d911f
      Paul E. McKenney 提交于
      Create a kernel/rcutree_plugin.h file that contains definitions
      for preemptable RCU (or, under the #else branch of the #ifdef,
      empty definitions for the classic non-preemptable semantics).
      These definitions fit into plugins defined in kernel/rcutree.c
      for this purpose.
      
      This variant of preemptable RCU uses a new algorithm whose
      read-side expense is roughly that of classic hierarchical RCU
      under CONFIG_PREEMPT. This new algorithm's update-side expense
      is similar to that of classic hierarchical RCU, and, in absence
      of read-side preemption or blocking, is exactly that of classic
      hierarchical RCU.  Perhaps more important, this new algorithm
      has a much simpler implementation, saving well over 1,000 lines
      of code compared to mainline's implementation of preemptable
      RCU, which will hopefully be retired in favor of this new
      algorithm.
      
      The simplifications are obtained by maintaining per-task
      nesting state for running tasks, and using a simple
      lock-protected algorithm to handle accounting when tasks block
      within RCU read-side critical sections, making use of lessons
      learned while creating numerous user-level RCU implementations
      over the past 18 months.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: akpm@linux-foundation.org
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josht@linux.vnet.ibm.com
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      LKML-Reference: <12509746134003-git-send-email->
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f41d911f
    • P
      rcu: Simplify rcu_pending()/rcu_check_callbacks() API · a157229c
      Paul E. McKenney 提交于
      All calls from outside RCU are of the form:
      
      	if (rcu_pending(cpu))
      		rcu_check_callbacks(cpu, user);
      
      This is silly, instead we put a call to rcu_pending() in
      rcu_check_callbacks(), and then make the outside calls be to
      rcu_check_callbacks().  This cuts down on the code a bit and
      also gives the compiler a better chance of optimizing.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: akpm@linux-foundation.org
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josht@linux.vnet.ibm.com
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      LKML-Reference: <125097461311-git-send-email->
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a157229c
    • P
      rcu: Use debugfs_remove_recursive() simplify code. · 22f00b69
      Paul E. McKenney 提交于
      Suggested by Josh Triplett.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: akpm@linux-foundation.org
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josht@linux.vnet.ibm.com
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      LKML-Reference: <12509746132173-git-send-email->
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      22f00b69
    • P
      rcu: Merge per-RCU-flavor initialization into pre-existing macro · 65cf8f86
      Paul E. McKenney 提交于
      Rename the RCU_DATA_PTR_INIT() macro to RCU_INIT_FLAVOR() and
      make it do the rcu_init_one() and rcu_boot_init_percpu_data()
      calls.  Merge the loop that was in the original macro with the
      loops that were in __rcu_init().
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: akpm@linux-foundation.org
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josht@linux.vnet.ibm.com
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      LKML-Reference: <12509746133916-git-send-email->
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      65cf8f86
    • P
      rcu: Fix online/offline indication for rcudata.csv trace file · 5699ed8f
      Paul E. McKenney 提交于
      The heading said "Online?", but the column had "Y" for offline
      CPUs and "N" for online CPUs.  Swap the "Y" and "N" to match
      the heading.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: akpm@linux-foundation.org
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josht@linux.vnet.ibm.com
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      LKML-Reference: <12509746132841-git-send-email->
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5699ed8f
    • P
      rcu: Renamings to increase RCU clarity · d6714c22
      Paul E. McKenney 提交于
      Make RCU-sched, RCU-bh, and RCU-preempt be underlying
      implementations, with "RCU" defined in terms of one of the
      three.  Update the outdated rcu_qsctr_inc() names, as these
      functions no longer increment anything.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: akpm@linux-foundation.org
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josht@linux.vnet.ibm.com
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      LKML-Reference: <12509746132696-git-send-email->
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d6714c22
    • P
      rcu: Move private definitions from include/linux/rcutree.h to kernel/rcutree.h · 9f77da9f
      Paul E. McKenney 提交于
      Some information hiding that makes it easier to merge
      preemptability into rcutree without descending into #include
      hell.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: akpm@linux-foundation.org
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josht@linux.vnet.ibm.com
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      LKML-Reference: <1250974613373-git-send-email->
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9f77da9f
  2. 19 8月, 2009 1 次提交
    • P
      rcu: Delay rcu_barrier() wait until beginning of next CPU-hotunplug operation. · 1423cc03
      Paul E. McKenney 提交于
      Ingo Molnar reported this lockup:
      
       [  200.380003] Hangcheck: hangcheck value past margin!
       [  248.192003] INFO: task S99local:2974 blocked for more than 120 seconds.
       [  248.194532] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
       [  248.202330] S99local      D 0000000c  6256  2974   2687 0x00000000
       [  248.208929]  9c7ebe90 00000086 6b67ef8b 0000000c 9f25a610 81a69869 00000001 820b6990
       [  248.216123]  820b6990 820b6990 9c6e4c20 9c6e4eb4 82c78990 00000000 6b993559 0000000c
       [  248.220616]  9c7ebe90 8105f22a 9c6e4eb4 9c6e4c20 00000001 9c7ebe98 9c7ebeb4 81a65cb3
       [  248.229990] Call Trace:
       [  248.234049]  [<81a69869>] ? _spin_unlock_irqrestore+0x22/0x37
       [  248.239769]  [<8105f22a>] ? prepare_to_wait+0x48/0x4e
       [  248.244796]  [<81a65cb3>] rcu_barrier_cpu_hotplug+0xaa/0xc9
       [  248.250343]  [<8105f029>] ? autoremove_wake_function+0x0/0x38
       [  248.256063]  [<81062cf2>] notifier_call_chain+0x49/0x71
       [  248.261263]  [<81062da0>] raw_notifier_call_chain+0x11/0x13
       [  248.266809]  [<81a0b475>] _cpu_down+0x272/0x288
       [  248.271316]  [<81a0b4d5>] cpu_down+0x4a/0xa2
       [  248.275563]  [<81a0c48a>] store_online+0x2a/0x5e
       [  248.280156]  [<81a0c460>] ? store_online+0x0/0x5e
       [  248.284836]  [<814ddc35>] sysdev_store+0x20/0x28
       [  248.289429]  [<8112e403>] sysfs_write_file+0xb8/0xe3
       [  248.294369]  [<8112e34b>] ? sysfs_write_file+0x0/0xe3
       [  248.299396]  [<810e4c8f>] vfs_write+0x91/0x120
       [  248.303817]  [<810e4dc1>] sys_write+0x40/0x65
       [  248.308150]  [<81002d73>] sysenter_do_call+0x12/0x28
      
      This change moves an RCU grace period delay off of the
      critical path for CPU-hotunplug operations.
      
      Since RCU callback migration is only performed on
      CPU-hotunplug operations, and since the rcu_barrier() race is
      provoked only by consecutive CPU-hotunplug operations, it is
      not necessary to delay the end of a given CPU-hotunplug
      operation.
      
      We can instead choose to delay the beginning of the next
      CPU-hotunplug operation.
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Josh Triplett <josht@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: akpm@linux-foundation.org
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: hugh.dickins@tiscali.co.uk
      Cc: benh@kernel.crashing.org
      LKML-Reference: <20090819060614.GA14383@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1423cc03
  3. 16 8月, 2009 5 次提交
    • J
      rcu: Fix typo in rcu_irq_exit() comment header · 684ca5cc
      Josh Triplett 提交于
      Signed-off-by: NJosh Triplett <josh@joshtriplett.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: josht@linux.vnet.ibm.com
      Cc: akpm@linux-foundation.org
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: hugh.dickins@tiscali.co.uk
      Cc: benh@kernel.crashing.org
      LKML-Reference: <1250355231169-git-send-email->
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      684ca5cc
    • P
      rcu: Make rcupreempt_trace.c look at offline CPUs · b612ba80
      Paul E. McKenney 提交于
      Given that offline CPUs can now have non-zero counters, we need
      to dump counters for offline CPUs as well as for online CPUs.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: josht@linux.vnet.ibm.com
      Cc: akpm@linux-foundation.org
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: hugh.dickins@tiscali.co.uk
      Cc: benh@kernel.crashing.org
      LKML-Reference: <12503552313921-git-send-email->
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b612ba80
    • P
      rcu: Make preemptable RCU scan all CPUs when summing RCU counters · 8064d549
      Paul E. McKenney 提交于
      This patch eliminates the counter-moving during CPU-offline
      notifiers, eliminating potential confusion if counters are
      scanned during counter-movement process.
      
      This confusion could result in premature ending of an RCU grace
      period. For example, if there are two tasks in RCU read-side
      critical sections (so that the sum of the counters is two), and
      the counter for the CPU going offline is -2, then moving the
      count to another CPU can result in the sum momentarily
      appearing to be zero.  Since there are no memory barriers in
      either case, many more such scenarios are possible.
      
      So just don't move the counts!!!
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: josht@linux.vnet.ibm.com
      Cc: akpm@linux-foundation.org
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: hugh.dickins@tiscali.co.uk
      Cc: benh@kernel.crashing.org
      LKML-Reference: <12503552312863-git-send-email->
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8064d549
    • P
      rcu: Simplify RCU CPU-hotplug notification · 2e597558
      Paul E. McKenney 提交于
      Use the new cpu_notifier() API to simplify RCU's CPU-hotplug
      notifiers, collapsing down to a single such notifier.
      
      This makes it trivial to provide the notifier-ordering
      guarantee that rcu_barrier() depends on.
      
      Also remove redundant open_softirq() calls from Hierarchical
      RCU notifier.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: josht@linux.vnet.ibm.com
      Cc: akpm@linux-foundation.org
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: hugh.dickins@tiscali.co.uk
      Cc: benh@kernel.crashing.org
      LKML-Reference: <12503552312510-git-send-email->
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2e597558
    • P
      rcu: Split hierarchical RCU initialization into boot-time and CPU-online pieces · 27569620
      Paul E. McKenney 提交于
      This patch divides the rcutree initialization into boot-time
      and hotplug-time components, so that the tree data structures
      are guaranteed to be fully linked at boot time regardless of
      what might happen in CPU hotplug operations.
      
      This makes RCU more resilient against CPU hotplug misbehavior
      (and vice versa), but more importantly, does a better job of
      compartmentalizing the code.
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: josht@linux.vnet.ibm.com
      Cc: akpm@linux-foundation.org
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: hugh.dickins@tiscali.co.uk
      Cc: benh@kernel.crashing.org
      LKML-Reference: <1250355231152-git-send-email->
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      27569620
  4. 14 8月, 2009 1 次提交
    • L
      genirq: prevent wakeup of freed irq thread · 2d860ad7
      Linus Torvalds 提交于
      free_irq() can remove an irqaction while the corresponding interrupt
      is in progress, but free_irq() sets action->thread to NULL
      unconditionally, which might lead to a NULL pointer dereference in
      handle_IRQ_event() when the hard interrupt context tries to wake up
      the handler thread.
      
      Prevent this by moving the thread stop after synchronize_irq(). No
      need to set action->thread to NULL either as action is going to be
      freed anyway.
      
      This fixes a boot crash reported against preempt-rt which uses the
      mainline irq threads code to implement full irq threading.
      
      [ tglx: removed local irqthread variable ]
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      2d860ad7
  5. 13 8月, 2009 6 次提交
    • P
      perf_counter: Report the cloning task as parent on perf_counter_fork() · 94d5d1b2
      Peter Zijlstra 提交于
      A bug in (9f498cc5: perf_counter: Full task tracing) makes
      profiling multi-threaded apps it go belly up.
      
      [ output as: (PID:TID):(PPID:PTID) ]
      
       # ./perf report -D | grep FORK
      0x4b0 [0x18]: PERF_EVENT_FORK: (3237:3237):(3236:3236)
      0xa10 [0x18]: PERF_EVENT_FORK: (3237:3238):(3236:3236)
      0xa70 [0x18]: PERF_EVENT_FORK: (3237:3239):(3236:3236)
      0xad0 [0x18]: PERF_EVENT_FORK: (3237:3240):(3236:3236)
      0xb18 [0x18]: PERF_EVENT_FORK: (3237:3241):(3236:3236)
      
      Shows us that the test (27d028de perf report: Update for the new
      FORK/EXIT events) in builtin-report.c:
      
              /*
               * A thread clone will have the same PID for both
               * parent and child.
               */
              if (thread == parent)
                      return 0;
      
      Will clearly fail.
      
      The problem is that perf_counter_fork() reports the actual
      parent, instead of the cloning thread.
      
      Fixing that (with the below patch), yields:
      
       # ./perf report -D | grep FORK
      0x4c8 [0x18]: PERF_EVENT_FORK: (1590:1590):(1589:1589)
      0xbd8 [0x18]: PERF_EVENT_FORK: (1590:1591):(1590:1590)
      0xc80 [0x18]: PERF_EVENT_FORK: (1590:1592):(1590:1590)
      0x3338 [0x18]: PERF_EVENT_FORK: (1590:1593):(1590:1590)
      0x66b0 [0x18]: PERF_EVENT_FORK: (1590:1594):(1590:1590)
      
      Which both makes more sense and doesn't confuse perf report
      anymore.
      Reported-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: paulus@samba.org
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Arjan van de Ven <arjan@infradead.org>
      LKML-Reference: <1250172882.5241.62.camel@twins>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      94d5d1b2
    • P
      perf_counter: Fix an ipi-deadlock · 970892a9
      Peter Zijlstra 提交于
      perf_pending_counter() is called from IRQ context and will call
      perf_counter_disable(), however perf_counter_disable() uses
      smp_call_function_single() which doesn't fancy being used with
      IRQs disabled due to IPI deadlocks.
      
      Fix this by making it use the local __perf_counter_disable()
      call and teaching the counter_sched_out() code about pending
      disables as well.
      
      This should cover the case where a counter migrates before the
      pending queue gets processed.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Corey J Ashford <cjashfor@us.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: stephane eranian <eranian@googlemail.com>
      LKML-Reference: <20090813103655.244097721@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      970892a9
    • P
      perf: Rework/fix the whole read vs group stuff · 3dab77fb
      Peter Zijlstra 提交于
      Replace PERF_SAMPLE_GROUP with PERF_SAMPLE_READ and introduce
      PERF_FORMAT_GROUP to deal with group reads in a more generic
      way.
      
      This allows you to get group reads out of read() as well.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Corey J Ashford <cjashfor@us.ibm.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: stephane eranian <eranian@googlemail.com>
      LKML-Reference: <20090813103655.117411814@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3dab77fb
    • P
      perf_counter: Fix swcounter context invariance · bcfc2602
      Peter Zijlstra 提交于
      perf_swcounter_is_counting() uses a lock, which means we cannot
      use swcounters from NMI or when holding that particular lock,
      this is unintended.
      
      The below removes the lock, this opens up race window, but not
      worse than the swcounters already experience due to RCU
      traversal of the context in perf_swcounter_ctx_event().
      
      This also fixes the hard lockups while opening a lockdep
      tracepoint counter.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: stephane eranian <eranian@googlemail.com>
      Cc: Corey J Ashford <cjashfor@us.ibm.com>
      LKML-Reference: <1250149915.10001.66.camel@twins>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      bcfc2602
    • I
      perf_counter: Provide hw_perf_counter_setup_online() APIs · 28402971
      Ingo Molnar 提交于
      Provide weak aliases for hw_perf_counter_setup_online(). This is
      used by the BTS patches (for v2.6.32), but it interacts with
      fixes so propagate this upstream. (it has no effect as of yet)
      
      Also export perf_counter_output() to architecture code.
      
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      28402971
    • A
      Remove double removal of blktrace directory · 39cbb602
      Alan D. Brunelle 提交于
      commit fd51d251
      Author: Stefan Raspl <raspl@linux.vnet.ibm.com>
      Date:   Tue May 19 09:59:08 2009 +0200
      
          blktrace: remove debugfs entries on bad path
      
      added in an explicit invocation of debugfs_remove for bt->dir, in
      blk_remove_buf_file_callback we are also getting the directory removed. On
      occasion I am seeing memory corruption that I have bisected down to
      this commit. [The testing involves a (long) series of I/O benchmarks
      with blktrace invoked around the actual runs.] I believe that this
      committed patch is correct, but the problem actually lies in the code
      in blk_remove_buf_file_callback.
      
      With this patch I am able to consistently get complete runs whereas
      previously I could not get a single run to complete.
      
      The first part of the patch simply moves the debugfs_remove below the
      relay_close: the relay_close call will remove files under bt->dir, and
      so we should not remove the directory until all the files we created
      have been removed. (Note: This is not sufficient to fix the problem -
      the file system code has ref counts on the directoy, so our invocation
      does not cause the directory to actually be removed. Nonetheless, we
      should not rely upon that feature.)
      Signed-off-by: NAlan D. Brunelle <alan.brunelle@hp.com>
      Signed-off-by: NJens Axboe <jens.axboe@oracle.com>
      39cbb602
  6. 11 8月, 2009 1 次提交
    • D
      futex: Fix handling of bad requeue syscall pairing · 392741e0
      Darren Hart 提交于
      If futex_requeue(requeue_pi=1) finds a futex_q that was created by a call
      other the futex_wait_requeue_pi(), the q.rt_waiter may be null.  If so,
      this will result in an oops from the following call graph:
      
      futex_requeue()
        rt_mutex_start_proxy_lock()
          task_blocks_on_rt_mutex()
            waiter->task dereference
              OOPS
      
      We currently WARN_ON() if this is detected, clearly this is inadequate.
      If we detect a mispairing in futex_requeue(), bail out, seding -EINVAL to
      user-space.
      
      V2: Fix parenthesis warnings.
      Signed-off-by: NDarren Hart <dvhltc@us.ibm.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Dinakar Guniguntala <dino@in.ibm.com>
      Cc: John Stultz <johnstul@linux.vnet.ibm.com>
      LKML-Reference: <4A7CA8C0.7010809@us.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      392741e0
  7. 10 8月, 2009 5 次提交
    • D
      futex: Fix compat_futex to be same as futex for REQUEUE_PI · 4dc88029
      Dinakar Guniguntala 提交于
      Need to add the REQUEUE_PI checks to the compat_sys_futex API
      as well to ensure 32 bit requeue's work fine on a 64 bit
      system. Patch is against latest tip
      Signed-off-by: NDinakar Guniguntala <dino@in.ibm.com>
      Cc: Darren Hart <dvhltc@us.ibm.com>
      LKML-Reference: <20090810130142.GA23619@in.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4dc88029
    • P
      locking, sched: Give waitqueue spinlocks their own lockdep classes · 2fc39111
      Peter Zijlstra 提交于
      Give waitqueue spinlocks their own lockdep classes when they
      are initialised from init_waitqueue_head().  This means that
      struct wait_queue::func functions can operate other waitqueues.
      
      This is used by CacheFiles to catch the page from a backing fs
      being unlocked and to wake up another thread to take a copy of
      it.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NTakashi Iwai <tiwai@suse.de>
      Cc: linux-cachefs@redhat.com
      Cc: torvalds@osdl.org
      Cc: akpm@linux-foundation.org
      LKML-Reference: <20090810113305.17284.81508.stgit@warthog.procyon.org.uk>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2fc39111
    • P
      perf_counter: Require CAP_SYS_ADMIN for raw tracepoint data · a4e95fc2
      Peter Zijlstra 提交于
      Raw tracepoint data contains various kernel internals and
      data from other users, so restrict this to CAP_SYS_ADMIN.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      LKML-Reference: <1249896452.17467.75.camel@twins>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a4e95fc2
    • P
      perf_counter: Correct PERF_SAMPLE_RAW output · a044560c
      Peter Zijlstra 提交于
      PERF_SAMPLE_* output switches should unconditionally output the
      correct format, as they are the only way to unambiguously parse
      the PERF_EVENT_SAMPLE data.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      LKML-Reference: <1249896447.17467.74.camel@twins>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      a044560c
    • D
      futex: Update futex_q lock_ptr on requeue proxy lock · beda2c7e
      Darren Hart 提交于
      futex_requeue() can acquire the lock on behalf of a waiter
      early on or during the requeue loop if it is uncontended or in
      the event of a lock steal or owner died. On wakeup, the waiter
      (in futex_wait_requeue_pi()) cleans up the pi_state owner using
      the lock_ptr to protect against concurrent access to the
      pi_state. The pi_state is hung off futex_q's on the requeue
      target futex hash bucket so the lock_ptr needs to be updated
      accordingly.
      
      The problem manifested by triggering the WARN_ON in
      lookup_pi_state() about the pid != pi_state->owner->pid.  With
      this patch, the pi_state is properly guarded against concurrent
      access via the requeue target hb lock.
      
      The astute reviewer may notice that there is a window of time
      between when futex_requeue() unlocks the hb locks and when
      futex_wait_requeue_pi() will acquire hb2->lock.  During this
      time the pi_state and uval are not in sync with the underlying
      rtmutex owner (but the uval does indicate there are waiters, so
      no atomic changes will occur in userspace).  However, this is
      not a problem. Should a contending thread enter
      lookup_pi_state() and acquire hb2->lock before the ownership is
      fixed up, it will find the pi_state hung off a waiter's
      (possibly the pending owner's) futex_q and block on the
      rtmutex.  Once futex_wait_requeue_pi() fixes up the owner, it
      will also move the pi_state from the old owner's
      task->pi_state_list to its own.
      
      v3: Fix plist lock name for application to mainline (rather
          than -rt) Compile tested against tip/v2.6.31-rc5.
      Signed-off-by: NDarren Hart <dvhltc@us.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Dinakar Guniguntala <dino@in.ibm.com>
      Cc: John Stultz <johnstul@linux.vnet.ibm.com>
      LKML-Reference: <4A7F4EFF.6090903@us.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      beda2c7e
  8. 09 8月, 2009 7 次提交
    • P
      perf_counter: Fix a race on perf_counter_ctx · 3a80b4a3
      Peter Zijlstra 提交于
      While extending perfcounters with BTS hw-tracing, Markus
      Metzger managed to trigger this warning:
      
         [  995.557128] WARNING: at kernel/perf_counter.c:1191 __perf_counter_task_sched_out+0x48/0x6b()
      
      triggers because commit
      9f498cc5 (perf_counter: Full
      task tracing) removed clearing of tsk->perf_counter_ctxp out
      from under ctx->lock which introduced a race (against
      perf_lock_task_context).
      
      Move it back and deal with the exit notification by explicitly
      passing along the former task context.
      Reported-by: NMarkus T Metzger <markus.t.metzger@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      LKML-Reference: <1249667341.17467.5.camel@twins>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3a80b4a3
    • F
      perf_counter: Fix tracepoint sampling to be part of generic sampling · 3a43ce68
      Frederic Weisbecker 提交于
      Based on Peter's comments, make tracepoint sampling generic
      just like all the other sampling bits are. This is a rename
      with no code changes:
      
      - PERF_SAMPLE_TP_RECORD to PERF_SAMPLE_RAW
      - struct perf_tracepoint_record to perf_raw_record
      
      We want the system in place that transport tracepoints raw
      samples events into the perf ring buffer to be generalized and
      usable by any type of counter.
      
      Reported-by; Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      LKML-Reference: <1249698400-5441-4-git-send-email-fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3a43ce68
    • F
      perf_counter: Work around gcc warning by initializing tracepoint record unconditionally · 10b8e306
      Frederic Weisbecker 提交于
      Despite that the tracepoint record is always present when the
      PERF_SAMPLE_TP_RECORD flag is set, gcc raises a warning,
      thinking it might not be initialized:
      
        kernel/perf_counter.c: In function ‘perf_counter_output’:
        kernel/perf_counter.c:2650: warning: ‘tp’ may be used uninitialized in this function
      
      Then, initialize it to NULL and always check if it's not NULL
      before dereference it.
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      LKML-Reference: <1249698400-5441-2-git-send-email-fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      10b8e306
    • P
      perf_counter: Fix software counters for fast moving event sources · 7b4b6658
      Peter Zijlstra 提交于
      Reimplement the software counters to deal with fast moving
      event sources (such as tracepoints). This means being able
      to generate multiple overflows from a single 'event' as well
      as support throttling.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7b4b6658
    • F
      perf_counter: Fix/complete ftrace event records sampling · f413cdb8
      Frederic Weisbecker 提交于
      This patch implements the kernel side support for ftrace event
      record sampling.
      
      A new counter sampling attribute is added:
      
         PERF_SAMPLE_TP_RECORD
      
      which requests ftrace events record sampling. In this case
      if a PERF_TYPE_TRACEPOINT counter is active and a tracepoint
      fires, we emit the tracepoint binary record to the
      perfcounter event buffer, as a sample.
      
      Result, after setting PERF_SAMPLE_TP_RECORD attribute from perf
      record:
      
       perf record -f -F 1 -a -e workqueue:workqueue_execution
       perf report -D
      
       0x21e18 [0x48]: event: 9
       .
       . ... raw event: size 72 bytes
       .  0000:  09 00 00 00 01 00 48 00 d0 c7 00 81 ff ff ff ff  ......H........
       .  0010:  0a 00 00 00 0a 00 00 00 21 00 00 00 00 00 00 00  ........!......
       .  0020:  2b 00 01 02 0a 00 00 00 0a 00 00 00 65 76 65 6e  +...........eve
       .  0030:  74 73 2f 31 00 00 00 00 00 00 00 00 0a 00 00 00  ts/1...........
       .  0040:  e0 b1 31 81 ff ff ff ff                          .......
      .
      0x21e18 [0x48]: PERF_EVENT_SAMPLE (IP, 1): 10: 0xffffffff8100c7d0 period: 33
      
      The raw ftrace binary record starts at offset 0020.
      
      Translation:
      
       struct trace_entry {
      	type		= 0x2b = 43;
      	flags		= 1;
      	preempt_count	= 2;
      	pid		= 0xa = 10;
      	tgid		= 0xa = 10;
       }
      
       thread_comm = "events/1"
       thread_pid  = 0xa = 10;
       func	    = 0xffffffff8131b1e0 = flush_to_ldisc()
      
      What will come next?
      
       - Userspace support ('perf trace'), 'flight data recorder' mode
         for perf trace, etc.
      
       - The unconditional copy from the profiling callback brings
         some costs however if someone wants no such sampling to
         occur, and needs to be fixed in the future. For that we need
         to have an instant access to the perf counter attribute.
         This is a matter of a flag to add in the struct ftrace_event.
      
       - Take care of the events recursivity! Don't ever try to record
         a lock event for example, it seems some locking is used in
         the profiling fast path and lead to a tracing recursivity.
         That will be fixed using raw spinlock or recursivity
         protection.
      
       - [...]
      
       - Profit! :-)
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Tom Zanussi <tzanussi@gmail.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Gabriel Munteanu <eduard.munteanu@linux360.ro>
      Cc: Lai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f413cdb8
    • P
      perf_counter, ftrace: Fix perf_counter integration · 3a659305
      Peter Zijlstra 提交于
      Adds possible second part to the assign argument of TP_EVENT().
      
        TP_perf_assign(
      	__perf_count(foo);
      	__perf_addr(bar);
        )
      
      Which, when specified make the swcounter increment with @foo instead
      of the usual 1, and report @bar for PERF_SAMPLE_ADDR (data address
      associated with the event) when this triggers a counter overflow.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3a659305
    • S
      posix_cpu_timers_exit_group(): Do not use thread_group_cputimer() · 17d42c1c
      Stanislaw Gruszka 提交于
      When the process exits we don't have to run new cputimer nor
      use running one (as it not accounts when tsk->exit_state != 0)
      to get process CPU times.  As there is only one thread we can
      just use CPU times fields from task and signal structs.
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      17d42c1c
  9. 08 8月, 2009 6 次提交
    • T
      tracing/filters: Always free pred on filter_add_subsystem_pred() failure · 26528e77
      Tom Zanussi 提交于
      If filter_add_subsystem_pred() fails due to ENOSPC or ENOMEM,
      the pred doesn't get freed, while as a side effect it does for
      other errors. Make it so the caller always frees the pred for
      any error.
      Signed-off-by: NTom Zanussi <tzanussi@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      LKML-Reference: <1249746593.6453.32.camel@tropicana>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      26528e77
    • T
      tracing/filters: Don't use pred on alloc failure · 96b2de31
      Tom Zanussi 提交于
      Dan Carpenter sent me a fix to prevent pred from being used if
      it couldn't be allocated.  I noticed the same problem also
      existed for the create_pred() case and added a fix for that.
      Reported-by: NDan Carpenter <error27@gmail.com>
      Signed-off-by: NTom Zanussi <tzanussi@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      LKML-Reference: <1249746549.6453.29.camel@tropicana>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      96b2de31
    • Y
      x86/irq: Fix move_irq_desc() for nodes without ram · ad7d6c7a
      Yinghai Lu 提交于
      Don't move it if target node is -1.
      Signed-off-by: NYinghai Lu <yinghai@kernel.org>
      LKML-Reference: <4A785B5D.4070702@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ad7d6c7a
    • E
      execve: must clear current->clear_child_tid · 9c8a8228
      Eric Dumazet 提交于
      While looking at Jens Rosenboom bug report
      (http://lkml.org/lkml/2009/7/27/35) about strange sys_futex call done from
      a dying "ps" program, we found following problem.
      
      clone() syscall has special support for TID of created threads.  This
      support includes two features.
      
      One (CLONE_CHILD_SETTID) is to set an integer into user memory with the
      TID value.
      
      One (CLONE_CHILD_CLEARTID) is to clear this same integer once the created
      thread dies.
      
      The integer location is a user provided pointer, provided at clone()
      time.
      
      kernel keeps this pointer value into current->clear_child_tid.
      
      At execve() time, we should make sure kernel doesnt keep this user
      provided pointer, as full user memory is replaced by a new one.
      
      As glibc fork() actually uses clone() syscall with CLONE_CHILD_SETTID and
      CLONE_CHILD_CLEARTID set, chances are high that we might corrupt user
      memory in forked processes.
      
      Following sequence could happen:
      
      1) bash (or any program) starts a new process, by a fork() call that
         glibc maps to a clone( ...  CLONE_CHILD_SETTID | CLONE_CHILD_CLEARTID
         ...) syscall
      
      2) When new process starts, its current->clear_child_tid is set to a
         location that has a meaning only in bash (or initial program) context
         (&THREAD_SELF->tid)
      
      3) This new process does the execve() syscall to start a new program.
         current->clear_child_tid is left unchanged (a non NULL value)
      
      4) If this new program creates some threads, and initial thread exits,
         kernel will attempt to clear the integer pointed by
         current->clear_child_tid from mm_release() :
      
              if (tsk->clear_child_tid
                  && !(tsk->flags & PF_SIGNALED)
                  && atomic_read(&mm->mm_users) > 1) {
                      u32 __user * tidptr = tsk->clear_child_tid;
                      tsk->clear_child_tid = NULL;
      
                      /*
                       * We don't check the error code - if userspace has
                       * not set up a proper pointer then tough luck.
                       */
      << here >>      put_user(0, tidptr);
                      sys_futex(tidptr, FUTEX_WAKE, 1, NULL, NULL, 0);
              }
      
      5) OR : if new program is not multi-threaded, but spied by /proc/pid
         users (ps command for example), mm_users > 1, and the exiting program
         could corrupt 4 bytes in a persistent memory area (shm or memory mapped
         file)
      
      If current->clear_child_tid points to a writeable portion of memory of the
      new program, kernel happily and silently corrupts 4 bytes of memory, with
      unexpected effects.
      
      Fix is straightforward and should not break any sane program.
      Reported-by: NJens Rosenboom <jens@mcbone.net>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Sonny Rao <sonnyrao@us.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ulrich Drepper <drepper@redhat.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c8a8228
    • X
      generic-ipi: fix hotplug_cfd() · 69dd647f
      Xiao Guangrong 提交于
      Use CONFIG_HOTPLUG_CPU, not CONFIG_CPU_HOTPLUG
      
      When hot-unpluging a cpu, it will leak memory allocated at cpu hotplug,
      but only if CPUMASK_OFFSTACK=y, which is default to n.
      
      The bug was introduced by 8969a5ed
      ("generic-ipi: remove kmalloc()").
      Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jens Axboe <jens.axboe@oracle.com>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: <stable@kernel.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      69dd647f
    • E
      ring-buffer: Fix memleak in ring_buffer_free() · bd3f0221
      Eric Dumazet 提交于
      I noticed oprofile memleaked in linux-2.6 current tree,
      and tracked this ring-buffer leak.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      LKML-Reference: <4A7C06B9.2090302@gmail.com>
      Cc: stable@kernel.org
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      bd3f0221
  10. 07 8月, 2009 1 次提交