1. 19 10月, 2010 3 次提交
    • P
      irq_work: Add generic hardirq context callbacks · e360adbe
      Peter Zijlstra 提交于
      Provide a mechanism that allows running code in IRQ context. It is
      most useful for NMI code that needs to interact with the rest of the
      system -- like wakeup a task to drain buffers.
      
      Perf currently has such a mechanism, so extract that and provide it as
      a generic feature, independent of perf so that others may also
      benefit.
      
      The IRQ context callback is generated through self-IPIs where
      possible, or on architectures like powerpc the decrementer (the
      built-in timer facility) is set to generate an interrupt immediately.
      
      Architectures that don't have anything like this get to do with a
      callback from the timer tick. These architectures can call
      irq_work_run() at the tail of any IRQ handlers that might enqueue such
      work (like the perf IRQ handler) to avoid undue latencies in
      processing the work.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NKyle McMartin <kyle@mcmartin.ca>
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      [ various fixes ]
      Signed-off-by: NHuang Ying <ying.huang@intel.com>
      LKML-Reference: <1287036094.7768.291.camel@yhuang-dev>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e360adbe
    • S
      perf_events: Fix transaction recovery in group_sched_in() · 8e5fc1a7
      Stephane Eranian 提交于
      The group_sched_in() function uses a transactional approach to schedule
      a group of events. In a group, either all events can be scheduled or
      none are. To schedule each event in, the function calls event_sched_in().
      In case of error, event_sched_out() is called on each event in the group.
      
      The problem is that event_sched_out() does not completely cancel the
      effects of event_sched_in(). Furthermore event_sched_out() changes the
      state of the event as if it had run which is not true is this particular
      case.
      
      Those inconsistencies impact time tracking fields and may lead to events
      in a group not all reporting the same time_enabled and time_running values.
      This is demonstrated with the example below:
      
      $ task -eunhalted_core_cycles,baclears,baclears -e unhalted_core_cycles,baclears,baclears sleep 5
      1946101 unhalted_core_cycles (32.85% scaling, ena=829181, run=556827)
        11423 baclears (32.85% scaling, ena=829181, run=556827)
         7671 baclears (0.00% scaling, ena=556827, run=556827)
      
      2250443 unhalted_core_cycles (57.83% scaling, ena=962822, run=405995)
        11705 baclears (57.83% scaling, ena=962822, run=405995)
        11705 baclears (57.83% scaling, ena=962822, run=405995)
      
      Notice that in the first group, the last baclears event does not
      report the same timings as its siblings.
      
      This issue comes from the fact that tstamp_stopped is updated
      by event_sched_out() as if the event had actually run.
      
      To solve the issue, we must ensure that, in case of error, there is
      no change in the event state whatsoever. That means timings must
      remain as they were when entering group_sched_in().
      
      To do this we defer updating tstamp_running until we know the
      transaction succeeded. Therefore, we have split event_sched_in()
      in two parts separating the update to tstamp_running.
      
      Similarly, in case of error, we do not want to update tstamp_stopped.
      Therefore, we have split event_sched_out() in two parts separating
      the update to tstamp_stopped.
      
      With this patch, we now get the following output:
      
      $ task -eunhalted_core_cycles,baclears,baclears -e unhalted_core_cycles,baclears,baclears sleep 5
      2492050 unhalted_core_cycles (71.75% scaling, ena=1093330, run=308841)
        11243 baclears (71.75% scaling, ena=1093330, run=308841)
        11243 baclears (71.75% scaling, ena=1093330, run=308841)
      
      1852746 unhalted_core_cycles (0.00% scaling, ena=784489, run=784489)
         9253 baclears (0.00% scaling, ena=784489, run=784489)
         9253 baclears (0.00% scaling, ena=784489, run=784489)
      
      Note that the uneven timing between groups is a side effect of
      the process spending most of its time sleeping, i.e., not enough
      event rotations (but that's a separate issue).
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <4cb86b4c.41e9d80a.44e9.3e19@mx.google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8e5fc1a7
    • S
      perf_events: Fix bogus context time tracking · c530ccd9
      Stephane Eranian 提交于
      You can only call update_context_time() when the context
      is active, i.e., the thread it is attached to is still running.
      
      However, perf_event_read() can be called even when the context
      is inactive, e.g., user read() the counters. The call to
      update_context_time() must be conditioned on the status of
      the context, otherwise, bogus time_enabled, time_running may
      be returned. Here is an example on AMD64. The task program
      is an example from libpfm4. The -p prints deltas every 1s.
      
      $ task -p -e cpu_clk_unhalted sleep 5
          2,266,610 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
      	    0 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
      	    0 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
      	    0 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
      	    0 cpu_clk_unhalted (0.00% scaling, ena=2,158,982, run=2,158,982)
      5,242,358,071 cpu_clk_unhalted (99.95% scaling, ena=5,000,359,984, run=2,319,270)
      
      Whereas if you don't read deltas, e.g., no call to perf_event_read() until
      the process terminates:
      
      $ task -e cpu_clk_unhalted sleep 5
          2,497,783 cpu_clk_unhalted (0.00% scaling, ena=2,376,899, run=2,376,899)
      
      Notice that time_enable, time_running are bogus in the first example
      causing bogus scaling.
      
      This patch fixes the problem, by conditionally calling update_context_time()
      in perf_event_read().
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: stable@kernel.org
      LKML-Reference: <4cb856dc.51edd80a.5ae0.38fb@mx.google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c530ccd9
  2. 15 10月, 2010 2 次提交
    • S
      ftrace: Rename config option HAVE_C_MCOUNT_RECORD to HAVE_C_RECORDMCOUNT · cf4db259
      Steven Rostedt 提交于
      The config option used by archs to let the build system know that
      the C version of the recordmcount works for said arch is currently
      called HAVE_C_MCOUNT_RECORD which enables BUILD_C_RECORDMCOUNT. To
      be more consistent with the name that all archs may use, it has been
      renamed to HAVE_C_RECORDMCOUNT. This will be less confusing since
      we are building a C recordmcount and not a mcount_record.
      Suggested-by: NIngo Molnar <mingo@elte.hu>
      Cc: <linux-arch@vger.kernel.org>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: linux-kbuild@vger.kernel.org
      Cc: John Reiser <jreiser@bitwagon.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      cf4db259
    • S
      ftrace/x86: Add support for C version of recordmcount · 72441cb1
      Steven Rostedt 提交于
      This patch adds the support for the C version of recordmcount and
      compile times show ~ 12% improvement.
      
      After verifying this works, other archs can add:
      
       HAVE_C_MCOUNT_RECORD
      
      in its Kconfig and it will use the C version of recordmcount
      instead of the perl version.
      
      Cc: <linux-arch@vger.kernel.org>
      Cc: Michal Marek <mmarek@suse.cz>
      Cc: linux-kbuild@vger.kernel.org
      Cc: John Reiser <jreiser@bitwagon.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      72441cb1
  3. 14 10月, 2010 1 次提交
  4. 13 10月, 2010 1 次提交
    • B
      tracing: Fix function-graph build warning on 32-bit · 14cae9bd
      Borislav Petkov 提交于
      Fix
      
      kernel/trace/trace_functions_graph.c: In function ‘trace_print_graph_duration’:
      kernel/trace/trace_functions_graph.c:652: warning: comparison of distinct pointer types lacks a cast
      
      when building 36-rc6 on a 32-bit due to the strict type check failing
      in the min() macro.
      Signed-off-by: NBorislav Petkov <bp@alien8.de>
      Cc: Chase Douglas <chase.douglas@canonical.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      LKML-Reference: <20100929080823.GA13595@liondog.tnic>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      14cae9bd
  5. 11 10月, 2010 1 次提交
  6. 06 10月, 2010 1 次提交
    • L
      modules: Fix module_bug_list list corruption race · 5336377d
      Linus Torvalds 提交于
      With all the recent module loading cleanups, we've minimized the code
      that sits under module_mutex, fixing various deadlocks and making it
      possible to do most of the module loading in parallel.
      
      However, that whole conversion totally missed the rather obscure code
      that adds a new module to the list for BUG() handling.  That code was
      doubly obscure because (a) the code itself lives in lib/bugs.c (for
      dubious reasons) and (b) it gets called from the architecture-specific
      "module_finalize()" rather than from generic code.
      
      Calling it from arch-specific code makes no sense what-so-ever to begin
      with, and is now actively wrong since that code isn't protected by the
      module loading lock any more.
      
      So this commit moves the "module_bug_{finalize,cleanup}()" calls away
      from the arch-specific code, and into the generic code - and in the
      process protects it with the module_mutex so that the list operations
      are now safe.
      
      Future fixups:
       - move the module list handling code into kernel/module.c where it
         belongs.
       - get rid of 'module_bug_list' and just use the regular list of modules
         (called 'modules' - imagine that) that we already create and maintain
         for other reasons.
      Reported-and-tested-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Cc: Adrian Bunk <bunk@kernel.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: stable@kernel.org
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      5336377d
  7. 04 10月, 2010 1 次提交
    • S
      perf_events: Fix invalid pointer when pid is invalid · 540804b5
      Stephane Eranian 提交于
      This patch fixes an error in perf_event_open() when the pid
      provided by the user is invalid. find_lively_task_by_vpid()
      does not return NULL on error but an error code. Without the
      fix the error code was silently passed to find_get_context()
      which would eventually cause a invalid pointer dereference.
      Signed-off-by: NStephane Eranian <eranian@google.com>
      Cc: peterz@infradead.org
      Cc: paulus@samba.org
      Cc: davem@davemloft.net
      Cc: fweisbec@gmail.com
      Cc: perfmon2-devel@lists.sf.net
      Cc: eranian@gmail.com
      Cc: robert.richter@amd.com
      LKML-Reference: <4ca9a5d1.e8e9d80a.3dbb.ffff8f2e@mx.google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      540804b5
  8. 02 10月, 2010 1 次提交
    • I
      kfifo: fix scatterlist usage · 399f1e30
      Ira W. Snyder 提交于
      The kfifo_dma family of functions use sg_mark_end() on the last element in
      their scatterlist.  This forces use of a fresh scatterlist for each DMA
      operation, which makes recycling a single scatterlist impossible.
      
      Change the behavior of the kfifo_dma functions to match the usage of the
      dma_map_sg function.  This means that users must respect the returned
      nents value.  The sample code is updated to reflect the change.
      
      This bug is trivial to cause: call kfifo_dma_in_prepare() such that it
      prepares a scatterlist with a single entry comprising the whole fifo.
      This is the case when you map the entirety of a newly created empty fifo.
      This causes the setup_sgl() function to mark the first scatterlist entry
      as the end of the chain, no matter what comes after it.
      
      Afterwards, add and remove some data from the fifo such that another call
      to kfifo_dma_in_prepare() will create two scatterlist entries.  It returns
      nents=2.  However, due to the previous sg_mark_end() call, sg_is_last()
      will now return true for the first scatterlist element.  This causes the
      sample code to print a single scatterlist element when it should print
      two.
      
      By removing the call to sg_mark_end(), we make the API as similar as
      possible to the DMA mapping API.  All users are required to respect the
      returned nents.
      Signed-off-by: NIra W. Snyder <iws@ovro.caltech.edu>
      Cc: Stefani Seibold <stefani@seibold.net>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      399f1e30
  9. 23 9月, 2010 5 次提交
  10. 21 9月, 2010 2 次提交
    • P
      perf: Avoid RCU vs preemption assumptions · 41945f6c
      Peter Zijlstra 提交于
      The per-pmu per-cpu context patch converted things from
      get_cpu_var() to this_cpu_ptr(), but that only works if
      rcu_read_lock() actually disables preemption, and since
      there is no such guarantee, we need to fix that.
      
      Use the newly introduced {get,put}_cpu_ptr().
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Tejun Heo <tj@kernel.org>
      LKML-Reference: <20100917093009.308453028@chello.nl>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      41945f6c
    • S
      sched: Fix nohz balance kick · f6c3f168
      Suresh Siddha 提交于
      There's a situation where the nohz balancer will try to wake itself:
      
      cpu-x is idle which is also ilb_cpu
      got a scheduler tick during idle
      and the nohz_kick_needed() in trigger_load_balance() checks for
      rq_x->nr_running which might not be zero (because of someone waking a
      task on this rq etc) and this leads to the situation of the cpu-x
      sending a kick to itself.
      
      And this can cause a lockup.
      
      Avoid this by not marking ourself eligible for kicking.
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1284400941.2684.19.camel@sbsiddha-MOBL3.sc.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f6c3f168
  11. 17 9月, 2010 5 次提交
  12. 15 9月, 2010 15 次提交
  13. 14 9月, 2010 2 次提交
    • S
      tracing: Replace typecasted void pointer in set_ftrace_filter code · 4aeb6967
      Steven Rostedt 提交于
      The set_ftrace_filter uses seq_file and reads from two lists. The
      pointer returned by t_next() can either be of type struct dyn_ftrace
      or struct ftrace_func_probe. If there is a bug (there was one)
      the wrong pointer may be used and the reference can cause an oops.
      
      This patch makes t_next() and friends only return the iterator structure
      which now has a pointer of type struct dyn_ftrace and struct
      ftrace_func_probe. The t_show() can now test if the pointer is NULL or
      not and if the pointer exists, it is guaranteed to be of the correct type.
      
      Now if there's a bug, only wrong data will be shown but not an oops.
      
      Cc: Chris Wright <chrisw@sous-sol.org>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      4aeb6967
    • S
      tracing: Do not reset *pos in set_ftrace_filter · 2bccfffd
      Steven Rostedt 提交于
      After the filtered functions are read, the probed functions are read
      from the hash in set_ftrace_filter. When the hashed probed functions
      are read, the *pos passed in is reset. Instead of modifying the pos
      given to the read function, just record the pos where the filtered
      functions ended and subtract from that.
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      2bccfffd