1. 11 2月, 2011 2 次提交
  2. 08 2月, 2011 2 次提交
  3. 04 2月, 2011 1 次提交
  4. 03 2月, 2011 6 次提交
    • S
      tracing: Replace syscall_meta_data struct array with pointer array · 3d56e331
      Steven Rostedt 提交于
      Currently the syscall_meta structures for the syscall tracepoints are
      placed in the __syscall_metadata section, and at link time, the linker
      makes one large array of all these syscall metadata structures. On boot
      up, this array is read (much like the initcall sections) and the syscall
      data is processed.
      
      The problem is that there is no guarantee that gcc will place complex
      structures nicely together in an array format. Two structures in the
      same file may be placed awkwardly, because gcc has no clue that they
      are suppose to be in an array.
      
      A hack was used previous to force the alignment to 4, to pack the
      structures together. But this caused alignment issues with other
      architectures (sparc).
      
      Instead of packing the structures into an array, the structures' addresses
      are now put into the __syscall_metadata section. As pointers are always the
      natural alignment, gcc should always pack them tightly together
      (otherwise initcall, extable, etc would also fail).
      
      By having the pointers to the structures in the section, we can still
      iterate the trace_events without causing unnecessary alignment problems
      with other architectures, or depending on the current behaviour of
      gcc that will likely change in the future just to tick us kernel developers
      off a little more.
      
      The __syscall_metadata section is also moved into the .init.data section
      as it is now only needed at boot up.
      Suggested-by: NDavid Miller <davem@davemloft.net>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      3d56e331
    • M
      tracepoints: Fix section alignment using pointer array · 65498646
      Mathieu Desnoyers 提交于
      Make the tracepoints more robust, making them solid enough to handle compiler
      changes by not relying on anything based on compiler-specific behavior with
      respect to structure alignment. Implement an approach proposed by David Miller:
      use an array of const pointers to refer to the individual structures, and export
      this pointer array through the linker script rather than the structures per se.
      It will consume 32 extra bytes per tracepoint (24 for structure padding and 8
      for the pointers), but are less likely to break due to compiler changes.
      
      History:
      
      commit 7e066fb8 tracepoints: add DECLARE_TRACE() and DEFINE_TRACE()
      added the aligned(32) type and variable attribute to the tracepoint structures
      to deal with gcc happily aligning statically defined structures on 32-byte
      multiples.
      
      One attempt was to use a 8-byte alignment for tracepoint structures by applying
      both the variable and type attribute to tracepoint structures definitions and
      declarations. It worked fine with gcc 4.5.1, but broke with gcc 4.4.4 and 4.4.5.
      
      The reason is that the "aligned" attribute only specify the _minimum_ alignment
      for a structure, leaving both the compiler and the linker free to align on
      larger multiples. Because tracepoint.c expects the structures to be placed as an
      array within each section, up-alignment cause NULL-pointer exceptions due to the
      extra unexpected padding.
      
      (this patch applies on top of -tip)
      Signed-off-by: NMathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      LKML-Reference: <20110126222622.GA10794@Krystal>
      CC: Frederic Weisbecker <fweisbec@gmail.com>
      CC: Ingo Molnar <mingo@elte.hu>
      CC: Thomas Gleixner <tglx@linutronix.de>
      CC: Andrew Morton <akpm@linux-foundation.org>
      CC: Peter Zijlstra <peterz@infradead.org>
      CC: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      65498646
    • P
      sched: Fix update_curr_rt() · 06c3bc65
      Peter Zijlstra 提交于
      cpu_stopper_thread()
        migration_cpu_stop()
          __migrate_task()
            deactivate_task()
              dequeue_task()
                dequeue_task_rq()
                  update_curr_rt()
      
      Will call update_curr_rt() on rq->curr, which at that time is
      rq->stop. The problem is that rq->stop.prio matches an RT prio and
      thus falsely assumes its a rt_sched_class task.
      Reported-Debuged-Tested-Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Cc: stable@kernel.org # .37
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      06c3bc65
    • P
      perf: Fix reading in perf_event_read() · 542e72fc
      Peter Zijlstra 提交于
      It is quite possible for the event to have been disabled between
      perf_event_read() sending the IPI and the CPU servicing the IPI and
      calling __perf_event_read(), hence revalidate the state.
      Reported-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      542e72fc
    • S
      tracing: Replace trace_event struct array with pointer array · e4a9ea5e
      Steven Rostedt 提交于
      Currently the trace_event structures are placed in the _ftrace_events
      section, and at link time, the linker makes one large array of all
      the trace_event structures. On boot up, this array is read (much like
      the initcall sections) and the events are processed.
      
      The problem is that there is no guarantee that gcc will place complex
      structures nicely together in an array format. Two structures in the
      same file may be placed awkwardly, because gcc has no clue that they
      are suppose to be in an array.
      
      A hack was used previous to force the alignment to 4, to pack the
      structures together. But this caused alignment issues with other
      architectures (sparc).
      
      Instead of packing the structures into an array, the structures' addresses
      are now put into the _ftrace_event section. As pointers are always the
      natural alignment, gcc should always pack them tightly together
      (otherwise initcall, extable, etc would also fail).
      
      By having the pointers to the structures in the section, we can still
      iterate the trace_events without causing unnecessary alignment problems
      with other architectures, or depending on the current behaviour of
      gcc that will likely change in the future just to tick us kernel developers
      off a little more.
      
      The _ftrace_event section is also moved into the .init.data section
      as it is now only needed at boot up.
      Suggested-by: NDavid Miller <davem@davemloft.net>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Acked-by: NDavid S. Miller <davem@davemloft.net>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      e4a9ea5e
    • T
      genirq: Prevent irq storm on migration · f1a06390
      Thomas Gleixner 提交于
      move_native_irq() masks and unmasks the interrupt line
      unconditionally, but the interrupt line might be masked due to a
      threaded oneshot handler in progress. Unmasking the line in that case
      can lead to interrupt storms. Observed on PREEMPT_RT.
      
      Originally-from: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: stable@kernel.org
      f1a06390
  5. 31 1月, 2011 4 次提交
  6. 28 1月, 2011 1 次提交
    • E
      perf: Fix alloc_callchain_buffers() · 88d4f0db
      Eric Dumazet 提交于
      Commit 927c7a9e ("perf: Fix race in callchains") introduced
      a mismatch in the sizing of struct callchain_cpus_entries.
      
      nr_cpu_ids must be used instead of num_possible_cpus(), or we
      might get out of bound memory accesses on some machines.
      Signed-off-by: NEric Dumazet <eric.dumazet@gmail.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: David Miller <davem@davemloft.net>
      Cc: Stephane Eranian <eranian@google.com>
      CC: stable@kernel.org
      LKML-Reference: <1295980851.3588.351.camel@edumazet-laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      88d4f0db
  7. 26 1月, 2011 4 次提交
  8. 25 1月, 2011 1 次提交
  9. 24 1月, 2011 2 次提交
  10. 22 1月, 2011 1 次提交
    • O
      perf: perf_event_exit_task_context: s/rcu_dereference/rcu_dereference_raw/ · 806839b2
      Oleg Nesterov 提交于
      In theory, almost every user of task->child->perf_event_ctxp[]
      is wrong. find_get_context() can install the new context at any
      moment, we need read_barrier_depends().
      
      dbe08d82 "perf: Fix
      find_get_context() vs perf_event_exit_task() race" added
      rcu_dereference() into perf_event_exit_task_context() to make
      the precedent, but this makes __rcu_dereference_check() unhappy.
      Use rcu_dereference_raw() to shut up the warning.
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Cc: acme@redhat.com
      Cc: paulus@samba.org
      Cc: stern@rowland.harvard.edu
      Cc: a.p.zijlstra@chello.nl
      Cc: fweisbec@gmail.com
      Cc: roland@redhat.com
      Cc: prasad@linux.vnet.ibm.com
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      LKML-Reference: <20110121174547.GA8796@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      806839b2
  11. 21 1月, 2011 4 次提交
    • P
      perf: Annotate cpuctx->ctx.mutex to avoid a lockdep splat · 547e9fd7
      Peter Zijlstra 提交于
      Lockdep spotted:
      
      	loop_1b_instruc/1899 is trying to acquire lock:
      	 (event_mutex){+.+.+.}, at: [<ffffffff810e1908>] perf_trace_init+0x3b/0x2f7
      
      	but task is already holding lock:
      	 (&ctx->mutex){+.+.+.}, at: [<ffffffff810eb45b>] perf_event_init_context+0xc0/0x218
      
      	which lock already depends on the new lock.
      
      	the existing dependency chain (in reverse order) is:
      
      	-> #3 (&ctx->mutex){+.+.+.}:
      	-> #2 (cpu_hotplug.lock){+.+.+.}:
      	-> #1 (module_mutex){+.+...}:
      	-> #0 (event_mutex){+.+.+.}:
      
      But because the deadlock would be cpuhotplug (cpu-event) vs fork
      (task-event) it cannot, in fact, happen. We can annotate this by giving the
      perf_event_context used for the cpuctx a different lock class from those
      used by tasks.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      547e9fd7
    • T
      genirq: Remove __do_IRQ · 1c77ff22
      Thomas Gleixner 提交于
      All architectures are finally converted. Remove the cruft.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Richard Henderson <rth@twiddle.net>
      Cc: Mike Frysinger <vapier@gentoo.org>
      Cc: David Howells <dhowells@redhat.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Greg Ungerer <gerg@uclinux.org>
      Cc: Michal Simek <monstr@monstr.eu>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Cc: Kyle McMartin <kyle@mcmartin.ca>
      Acked-by: NBenjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Chen Liqin <liqin.chen@sunplusct.com>
      Cc: "David S. Miller" <davem@davemloft.net>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Jeff Dike <jdike@addtoit.com>
      1c77ff22
    • M
      kernel/smp.c: consolidate writes in smp_call_function_interrupt() · 225c8e01
      Milton Miller 提交于
      We have to test the cpu mask in the interrupt handler before checking the
      refs, otherwise we can start to follow an entry before its deleted and
      find it partially initailzed for the next trip.  Presently we also clear
      the cpumask bit before executing the called function, which implies
      getting write access to the line.  After the function is called we then
      decrement refs, and if they go to zero we then unlock the structure.
      
      However, this implies getting write access to the call function data
      before and after another the function is called.  If we can assert that no
      smp_call_function execution function is allowed to enable interrupts, then
      we can move both writes to after the function is called, hopfully allowing
      both writes with one cache line bounce.
      
      On a 256 thread system with a kernel compiled for 1024 threads, the time
      to execute testcase in the "smp_call_function_many race" changelog was
      reduced by about 30-40ms out of about 545 ms.
      
      I decided to keep this as WARN because its now a buggy function, even
      though the stack trace is of no value -- a simple printk would give us the
      information needed.
      
      Raw data:
      
      Without patch:
        ipi_test startup took 1219366ns complete 539819014ns total 541038380ns
        ipi_test startup took 1695754ns complete 543439872ns total 545135626ns
        ipi_test startup took 7513568ns complete 539606362ns total 547119930ns
        ipi_test startup took 13304064ns complete 533898562ns total 547202626ns
        ipi_test startup took 8668192ns complete 544264074ns total 552932266ns
        ipi_test startup took 4977626ns complete 548862684ns total 553840310ns
        ipi_test startup took 2144486ns complete 541292318ns total 543436804ns
        ipi_test startup took 21245824ns complete 530280180ns total 551526004ns
      
      With patch:
        ipi_test startup took 5961748ns complete 500859628ns total 506821376ns
        ipi_test startup took 8975996ns complete 495098924ns total 504074920ns
        ipi_test startup took 19797750ns complete 492204740ns total 512002490ns
        ipi_test startup took 14824796ns complete 487495878ns total 502320674ns
        ipi_test startup took 11514882ns complete 494439372ns total 505954254ns
        ipi_test startup took 8288084ns complete 502570774ns total 510858858ns
        ipi_test startup took 6789954ns complete 493388112ns total 500178066ns
      
      	#include <linux/module.h>
      	#include <linux/init.h>
      	#include <linux/sched.h> /* sched clock */
      
      	#define ITERATIONS 100
      
      	static void do_nothing_ipi(void *dummy)
      	{
      	}
      
      	static void do_ipis(struct work_struct *dummy)
      	{
      		int i;
      
      		for (i = 0; i < ITERATIONS; i++)
      			smp_call_function(do_nothing_ipi, NULL, 1);
      
      		printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id());
      	}
      
      	static struct work_struct work[NR_CPUS];
      
      	static int __init testcase_init(void)
      	{
      		int cpu;
      		u64 start, started, done;
      
      		start = local_clock();
      		for_each_online_cpu(cpu) {
      			INIT_WORK(&work[cpu], do_ipis);
      			schedule_work_on(cpu, &work[cpu]);
      		}
      		started = local_clock();
      		for_each_online_cpu(cpu)
      			flush_work(&work[cpu]);
      		done = local_clock();
      		pr_info("ipi_test startup took %lldns complete %lldns total %lldns\n",
      			started-start, done-started, done-start);
      
      		return 0;
      	}
      
      	static void __exit testcase_exit(void)
      	{
      	}
      
      	module_init(testcase_init)
      	module_exit(testcase_exit)
      	MODULE_LICENSE("GPL");
      	MODULE_AUTHOR("Anton Blanchard");
      Signed-off-by: NMilton Miller <miltonm@bga.com>
      Cc: Anton Blanchard <anton@samba.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      225c8e01
    • A
      kernel/smp.c: fix smp_call_function_many() SMP race · 6dc19899
      Anton Blanchard 提交于
      I noticed a failure where we hit the following WARN_ON in
      generic_smp_call_function_interrupt:
      
                      if (!cpumask_test_and_clear_cpu(cpu, data->cpumask))
                              continue;
      
                      data->csd.func(data->csd.info);
      
                      refs = atomic_dec_return(&data->refs);
                      WARN_ON(refs < 0);      <-------------------------
      
      We atomically tested and cleared our bit in the cpumask, and yet the
      number of cpus left (ie refs) was 0.  How can this be?
      
      It turns out commit 54fdade1
      ("generic-ipi: make struct call_function_data lockless") is at fault.  It
      removes locking from smp_call_function_many and in doing so creates a
      rather complicated race.
      
      The problem comes about because:
      
       - The smp_call_function_many interrupt handler walks call_function.queue
         without any locking.
       - We reuse a percpu data structure in smp_call_function_many.
       - We do not wait for any RCU grace period before starting the next
         smp_call_function_many.
      
      Imagine a scenario where CPU A does two smp_call_functions back to back,
      and CPU B does an smp_call_function in between.  We concentrate on how CPU
      C handles the calls:
      
      CPU A            CPU B                  CPU C              CPU D
      
      smp_call_function
                                              smp_call_function_interrupt
                                                  walks
      					call_function.queue sees
      					data from CPU A on list
      
                       smp_call_function
      
                                              smp_call_function_interrupt
                                                  walks
      
                                              call_function.queue sees
                                                (stale) CPU A on list
      							   smp_call_function int
      							   clears last ref on A
      							   list_del_rcu, unlock
      smp_call_function reuses
      percpu *data A
                                               data->cpumask sees and
                                               clears bit in cpumask
                                               might be using old or new fn!
                                               decrements refs below 0
      
      set data->refs (too late!)
      
      The important thing to note is since the interrupt handler walks a
      potentially stale call_function.queue without any locking, then another
      cpu can view the percpu *data structure at any time, even when the owner
      is in the process of initialising it.
      
      The following test case hits the WARN_ON 100% of the time on my PowerPC
      box (having 128 threads does help :)
      
      #include <linux/module.h>
      #include <linux/init.h>
      
      #define ITERATIONS 100
      
      static void do_nothing_ipi(void *dummy)
      {
      }
      
      static void do_ipis(struct work_struct *dummy)
      {
      	int i;
      
      	for (i = 0; i < ITERATIONS; i++)
      		smp_call_function(do_nothing_ipi, NULL, 1);
      
      	printk(KERN_DEBUG "cpu %d finished\n", smp_processor_id());
      }
      
      static struct work_struct work[NR_CPUS];
      
      static int __init testcase_init(void)
      {
      	int cpu;
      
      	for_each_online_cpu(cpu) {
      		INIT_WORK(&work[cpu], do_ipis);
      		schedule_work_on(cpu, &work[cpu]);
      	}
      
      	return 0;
      }
      
      static void __exit testcase_exit(void)
      {
      }
      
      module_init(testcase_init)
      module_exit(testcase_exit)
      MODULE_LICENSE("GPL");
      MODULE_AUTHOR("Anton Blanchard");
      
      I tried to fix it by ordering the read and the write of ->cpumask and
      ->refs.  In doing so I missed a critical case but Paul McKenney was able
      to spot my bug thankfully :) To ensure we arent viewing previous
      iterations the interrupt handler needs to read ->refs then ->cpumask then
      ->refs _again_.
      
      Thanks to Milton Miller and Paul McKenney for helping to debug this issue.
      
      [miltonm@bga.com: add WARN_ON and BUG_ON, remove extra read of refs before initial read of mask that doesn't help (also noted by Peter Zijlstra), adjust comments, hopefully clarify scenario ]
      [miltonm@bga.com: remove excess tests]
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NMilton Miller <miltonm@bga.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
      Cc: <stable@kernel.org> [2.6.32+]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      6dc19899
  12. 20 1月, 2011 5 次提交
    • T
      smp: Allow on_each_cpu() to be called while early_boot_irqs_disabled status to init/main.c · bd924e8c
      Tejun Heo 提交于
      percpu may end up calling vfree() during early boot which in
      turn may call on_each_cpu() for TLB flushes.  The function of
      on_each_cpu() can be done safely while IRQ is disabled during
      early boot but it assumed that the function is always called
      with local IRQ enabled which ended up enabling local IRQ
      prematurely during boot and triggering a couple of warnings.
      
      This patch updates on_each_cpu() and smp_call_function_many()
      such on_each_cpu() can be used safely while
      early_boot_irqs_disabled is set.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <20110120110713.GC6036@htj.dyndns.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Reported-by: NIngo Molnar <mingo@elte.hu>
      bd924e8c
    • T
      lockdep: Move early boot local IRQ enable/disable status to init/main.c · 2ce802f6
      Tejun Heo 提交于
      During early boot, local IRQ is disabled until IRQ subsystem is
      properly initialized.  During this time, no one should enable
      local IRQ and some operations which usually are not allowed with
      IRQ disabled, e.g. operations which might sleep or require
      communications with other processors, are allowed.
      
      lockdep tracked this with early_boot_irqs_off/on() callbacks.
      As other subsystems need this information too, move it to
      init/main.c and make it generally available.  While at it,
      toggle the boolean to early_boot_irqs_disabled instead of
      enabled so that it can be initialized with %false and %true
      indicates the exceptional condition.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NPekka Enberg <penberg@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <20110120110635.GB6036@htj.dyndns.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2ce802f6
    • S
      hrtimers: Notify hrtimer users of switches to NOHZ mode · 2d0640b4
      Stephen Boyd 提交于
      When NOHZ=y and high res timers are disabled (via cmdline or
      Kconfig) tick_nohz_switch_to_nohz() will notify the user about
      switching into NOHZ mode. Nothing is printed for the case where
      HIGH_RES_TIMERS=y. Fix this for the HIGH_RES_TIMERS=y case by
      duplicating the printk from the low res NOHZ path in the high
      res NOHZ path.
      
      This confused me since I was thinking 'dmesg | grep -i NOHZ' would
      tell me if NOHZ was enabled, but if I have hrtimers there is
      nothing.
      Signed-off-by: NStephen Boyd <sboyd@codeaurora.org>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1295419594-13085-1-git-send-email-sboyd@codeaurora.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2d0640b4
    • O
      perf: Fix perf_event_init_task()/perf_event_free_task() interaction · 8550d7cb
      Oleg Nesterov 提交于
      perf_event_init_task() should clear child->perf_event_ctxp[]
      before anything else. Otherwise, if
      perf_event_init_context(perf_hw_context) fails,
      perf_event_free_task() can free perf_event_ctxp[perf_sw_context]
      copied from parent->perf_event_ctxp[] by dup_task_struct().
      
      Also move the initialization of perf_event_mutex and
      perf_event_list from perf_event_init_context() to
      perf_event_init_context().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Prasad <prasad@linux.vnet.ibm.com>
      Cc: Roland McGrath <roland@redhat.com>
      LKML-Reference: <20110119182228.GC12183@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8550d7cb
    • O
      perf: Fix find_get_context() vs perf_event_exit_task() race · dbe08d82
      Oleg Nesterov 提交于
      find_get_context() must not install the new perf_event_context
      if the task has already passed perf_event_exit_task().
      
      If nothing else, this means the memory leak. Initially
      ctx->refcount == 2, it is supposed that
      perf_event_exit_task_context() should participate and do the
      necessary put_ctx().
      
      find_lively_task_by_vpid() checks PF_EXITING but this buys
      nothing, by the time we call find_get_context() this task can be
      already dead. To the point, cmpxchg() can succeed when the task
      has already done the last schedule().
      
      Change find_get_context() to populate task->perf_event_ctxp[]
      under task->perf_event_mutex, this way we can trust PF_EXITING
      because perf_event_exit_task() takes the same mutex.
      
      Also, change perf_event_exit_task_context() to use
      rcu_dereference(). Probably this is not strictly needed, but
      with or without this change find_get_context() can race with
      setup_new_exec()->perf_event_exit_task(), rcu_dereference()
      looks better.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Prasad <prasad@linux.vnet.ibm.com>
      Cc: Roland McGrath <roland@redhat.com>
      LKML-Reference: <20110119182207.GB12183@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      dbe08d82
  13. 19 1月, 2011 4 次提交
    • T
      blktrace: Don't output messages if NOTIFY isn't set. · 490da40d
      Tao Ma 提交于
      Now if we enable blktrace, cfq has too many messages output to the
      trace buffer. It is fine if we don't specify any action mask.
      But if I do like this:
      blktrace /dev/sdb -a issue -a complete -o - | blkparse -i -
      I only want to see 'D' and 'C', while with the following command
      dd if=/mnt/ocfs2/test of=/dev/null bs=4k count=1 iflag=direct
      
      I will get(with a 2.6.37 vanilla kernel):
        8,16   0        0     0.000000000     0  m   N cfq3805 alloced
        8,16   0        0     0.000004126     0  m   N cfq3805 insert_request
        8,16   0        0     0.000004884     0  m   N cfq3805 add_to_rr
        8,16   0        0     0.000008417     0  m   N cfq workload slice:300
        8,16   0        0     0.000009557     0  m   N cfq3805 set_active wl_prio:0 wl_type:2
        8,16   0        0     0.000010640     0  m   N cfq3805 fifo=          (null)
        8,16   0        0     0.000011193     0  m   N cfq3805 dispatch_insert
        8,16   0        0     0.000012221     0  m   N cfq3805 dispatched a request
        8,16   0        0     0.000012802     0  m   N cfq3805 activate rq, drv=1
        8,16   0        1     0.000013181  3805  D   R 114759 + 8 [dd]
        8,16   0        2     0.000164244     0  C   R 114759 + 8 [0]
        8,16   0        0     0.000167997     0  m   N cfq3805 complete rqnoidle 0
        8,16   0        0     0.000168782     0  m   N cfq3805 set_slice=100
        8,16   0        0     0.000169874     0  m   N cfq3805 arm_idle: 8 group_idle: 0
        8,16   0        0     0.000170189     0  m   N cfq schedule dispatch
        8,16   0        0     0.000397938     0  m   N cfq3805 slice expired t=0
        8,16   0        0     0.000399763     0  m   N cfq3805 sl_used=1 disp=1 charge=1 iops=0 sect=8
        8,16   0        0     0.000400227     0  m   N cfq3805 del_from_rr
        8,16   0        0     0.000400882     0  m   N cfq3805 put_queue
      
      See, there are 19 lines while I only need 2. I don't think it is
      appropriate for a user.
      
      So this patch will disable any messages if the BLK_TC_NOTIFY isn't set.
      Now the output for the same command will look like:
        8,16   0        1     0.000000000  4908  D   R 114759 + 8 [dd]
        8,16   0        2     0.000146827     0  C   R 114759 + 8 [0]
      
      Yes, it is what I want to see.
      
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Jeff Moyer <jmoyer@redhat.com>
      Signed-off-by: NTao Ma <boyu.mt@taobao.com>
      Signed-off-by: NJens Axboe <jaxboe@fusionio.com>
      490da40d
    • P
      sched, cgroup: Use exit hook to avoid use-after-free crash · 068c5cc5
      Peter Zijlstra 提交于
      By not notifying the controller of the on-exit move back to
      init_css_set, we fail to move the task out of the previous
      cgroup's cfs_rq. This leads to an opportunity for a
      cgroup-destroy to come in and free the cgroup (there are no
      active tasks left in it after all) to which the not-quite dead
      task is still enqueued.
      Reported-by: NMiklos Vajna <vmiklos@frugalware.org>
      Fixed-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: <stable@kernel.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      LKML-Reference: <1293206353.29444.205.camel@laptop>
      068c5cc5
    • O
      perf: Validate cpu early in perf_event_alloc() · 66832eb4
      Oleg Nesterov 提交于
      Starting from perf_event_alloc()->perf_init_event(), the kernel
      assumes that event->cpu is either -1 or the valid CPU number.
      
      Change perf_event_alloc() to validate this argument early. This
      also means we can remove the similar check in
      find_get_context().
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Prasad <prasad@linux.vnet.ibm.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: gregkh@suse.de
      Cc: stable@kernel.org
      LKML-Reference: <20110118161032.GC693@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      66832eb4
    • O
      perf: Find_get_context: fix the per-cpu-counter check · 22a4ec72
      Oleg Nesterov 提交于
      If task == NULL, find_get_context() should always check that cpu
      is correct.
      
      Afaics, the bug was introduced by 38a81da2 "perf events: Clean
      up pid passing", but even before that commit "&& cpu != -1" was
      not exactly right, -ESRCH from find_task_by_vpid() is not
      accurate.
      Signed-off-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Prasad <prasad@linux.vnet.ibm.com>
      Cc: Roland McGrath <roland@redhat.com>
      Cc: gregkh@suse.de
      Cc: stable@kernel.org
      LKML-Reference: <20110118161008.GB693@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      22a4ec72
  14. 18 1月, 2011 3 次提交