1. 01 9月, 2010 1 次提交
    • L
      tracing: Fix a race in function profile · 3aaba20f
      Li Zefan 提交于
      While we are reading trace_stat/functionX and someone just
      disabled function_profile at that time, we can trigger this:
      
      	divide error: 0000 [#1] PREEMPT SMP
      	...
      	EIP is at function_stat_show+0x90/0x230
      	...
      
      This fix just takes the ftrace_profile_lock and checks if
      rec->counter is 0. If it's 0, we know the profile buffer
      has been reset.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: stable@kernel.org
      LKML-Reference: <4C723644.4040708@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      3aaba20f
  2. 30 8月, 2010 1 次提交
    • S
      perf_events: Fix time tracking for events with pid != -1 and cpu != -1 · fa66f07a
      Stephane Eranian 提交于
      Per-thread events with a cpu filter, i.e., cpu != -1, were not
      reporting correct timings when the thread never ran on the
      monitored cpu. The time enabled was reported as a negative
      value.
      
      This patch fixes the problem by updating tstamp_stopped,
      tstamp_running in event_sched_out() for events with filters and
      which are marked as INACTIVE.
      
      The function group_sched_out() is modified to systematically
      call into event_sched_out() to avoid duplicating the timing
      adjustment code twice.
      
      With the patch, I now get:
      
      $ task_cpu -i -e unhalted_core_cycles,unhalted_core_cycles
      noploop 2 noploop for 2 seconds
      CPU0 0		   unhalted_core_cycles (ena=1,991,136,594, run=0)
      CPU0 0		   unhalted_core_cycles (ena=1,991,136,594, run=0)
      
      CPU1 0		   unhalted_core_cycles (ena=1,991,136,594, run=0)
      CPU1 0		   unhalted_core_cycles (ena=1,991,136,594, run=0)
      
      CPU2 0		   unhalted_core_cycles (ena=1,991,136,594, run=0)
      CPU2 0		   unhalted_core_cycles (ena=1,991,136,594, run=0)
      
      CPU3 4,747,990,931 unhalted_core_cycles (ena=1,991,136,594, run=1,991,136,594)
      CPU3 4,747,990,931 unhalted_core_cycles (ena=1,991,136,594, run=1,991,136,594)
      Signed-off-by: NStephane Eranian <eranian@gmail.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: paulus@samba.org
      Cc: davem@davemloft.net
      Cc: fweisbec@gmail.com
      Cc: perfmon2-devel@lists.sf.net
      Cc: eranian@google.com
      LKML-Reference: <4c76802d.aae9d80a.115d.70fe@mx.google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      fa66f07a
  3. 25 8月, 2010 1 次提交
    • A
      tracing/trace_stack: Fix stack trace on ppc64 · 151772db
      Anton Blanchard 提交于
      save_stack_trace() stores the instruction pointer, not the
      function descriptor. On ppc64 the trace stack code currently
      dereferences the instruction pointer and shows 8 bytes of
      instructions in our backtraces:
      
       # cat /sys/kernel/debug/tracing/stack_trace
              Depth    Size   Location    (26 entries)
              -----    ----   --------
        0)     5424     112   0x6000000048000004
        1)     5312     160   0x60000000ebad01b0
        2)     5152     160   0x2c23000041c20030
        3)     4992     240   0x600000007c781b79
        4)     4752     160   0xe84100284800000c
        5)     4592     192   0x600000002fa30000
        6)     4400     256   0x7f1800347b7407e0
        7)     4144     208   0xe89f0108f87f0070
        8)     3936     272   0xe84100282fa30000
      
      Since we aren't dealing with function descriptors, use %pS
      instead of %pF to fix it:
      
       # cat /sys/kernel/debug/tracing/stack_trace
              Depth    Size   Location    (26 entries)
              -----    ----   --------
        0)     5424     112   ftrace_call+0x4/0x8
        1)     5312     160   .current_io_context+0x28/0x74
        2)     5152     160   .get_io_context+0x48/0xa0
        3)     4992     240   .cfq_set_request+0x94/0x4c4
        4)     4752     160   .elv_set_request+0x60/0x84
        5)     4592     192   .get_request+0x2d4/0x468
        6)     4400     256   .get_request_wait+0x7c/0x258
        7)     4144     208   .__make_request+0x49c/0x610
        8)     3936     272   .generic_make_request+0x390/0x434
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Cc: rostedt@goodmis.org
      Cc: fweisbec@gmail.com
      LKML-Reference: <20100825013238.GE28360@kryten>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      151772db
  4. 23 8月, 2010 2 次提交
    • T
      mutex: Improve the scalability of optimistic spinning · 9d0f4dcc
      Tim Chen 提交于
      There is a scalability issue for current implementation of optimistic
      mutex spin in the kernel.  It is found on a 8 node 64 core Nehalem-EX
      system (HT mode).
      
      The intention of the optimistic mutex spin is to busy wait and spin on a
      mutex if the owner of the mutex is running, in the hope that the mutex
      will be released soon and be acquired, without the thread trying to
      acquire mutex going to sleep. However, when we have a large number of
      threads, contending for the mutex, we could have the mutex grabbed by
      other thread, and then another ……, and we will keep spinning, wasting cpu
      cycles and adding to the contention.  One possible fix is to quit
      spinning and put the current thread on wait-list if mutex lock switch to
      a new owner while we spin, indicating heavy contention (see the patch
      included).
      
      I did some testing on a 8 socket Nehalem-EX system with a total of 64
      cores. Using Ingo's test-mutex program that creates/delete files with 256
      threads (http://lkml.org/lkml/2006/1/8/50) , I see the following speed up
      after putting in the mutex spin fix:
      
       ./mutex-test V 256 10
                       Ops/sec
       2.6.34          62864
       With fix        197200
      
      Repeating the test with Aim7 fserver workload, again there is a speed up
      with the fix:
      
                       Jobs/min
       2.6.34          91657
       With fix        149325
      
      To look at the impact on the distribution of mutex acquisition time, I
      collected the mutex acquisition time on Aim7 fserver workload with some
      instrumentation.  The average acquisition time is reduced by 48% and
      number of contentions reduced by 32%.
      
                       #contentions    Time to acquire mutex (cycles)
       2.6.34          72973           44765791
       With fix        49210           23067129
      
      The histogram of mutex acquisition time is listed below.  The acquisition
      time is in 2^bin cycles.  We see that without the fix, the acquisition
      time is mostly around 2^26 cycles.  With the fix, we the distribution get
      spread out a lot more towards the lower cycles, starting from 2^13.
      However, there is an increase of the tail distribution with the fix at
      2^28 and 2^29 cycles.  It seems a small price to pay for the reduced
      average acquisition time and also getting the cpu to do useful work.
      
       Mutex acquisition time distribution (acq time = 2^bin cycles):
               2.6.34                  With Fix
       bin     #occurrence     %       #occurrence     %
       11      2               0.00%   120             0.24%
       12      10              0.01%   790             1.61%
       13      14              0.02%   2058            4.18%
       14      86              0.12%   3378            6.86%
       15      393             0.54%   4831            9.82%
       16      710             0.97%   4893            9.94%
       17      815             1.12%   4667            9.48%
       18      790             1.08%   5147            10.46%
       19      580             0.80%   6250            12.70%
       20      429             0.59%   6870            13.96%
       21      311             0.43%   1809            3.68%
       22      255             0.35%   2305            4.68%
       23      317             0.44%   916             1.86%
       24      610             0.84%   233             0.47%
       25      3128            4.29%   95              0.19%
       26      63902           87.69%  122             0.25%
       27      619             0.85%   286             0.58%
       28      0               0.00%   3536            7.19%
       29      0               0.00%   903             1.83%
       30      0               0.00%   0               0.00%
      
      I've done similar experiments with 2.6.35 kernel on smaller boxes as
      well.  One is on a dual-socket Westmere box (12 cores total, with HT).
      Another experiment is on an old dual-socket Core 2 box (4 cores total, no
      HT)
      
      On the 12-core Westmere box, I see a 250% increase for Ingo's mutex-test
      program with my mutex patch but no significant difference in aim7's
      fserver workload.
      
      On the 4-core Core 2 box, I see the difference with the patch for both
      mutex-test and aim7 fserver are negligible.
      
      So far, it seems like the patch has not caused regression on smaller
      systems.
      Signed-off-by: NTim Chen <tim.c.chen@linux.intel.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: <stable@kernel.org> # .35.x
      LKML-Reference: <1282168827.9542.72.camel@schen9-DESK>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9d0f4dcc
    • P
      watchdog: Don't throttle the watchdog · c6db67cd
      Peter Zijlstra 提交于
      Stephane reported that when the machine locks up, the regular ticks,
      which are responsible to resetting the throttle count, stop too.
      
      Hence the NMI watchdog can end up being throttled before it reports on
      the locked up state, and we end up being sad..
      
      Cure this by having the watchdog overflow reset its own throttle count.
      Reported-by: NStephane Eranian <eranian@google.com>
      Tested-by: NStephane Eranian <eranian@google.com>
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1282215916.1926.4696.camel@laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c6db67cd
  5. 22 8月, 2010 1 次提交
    • A
      workqueue: Add basic tracepoints to track workqueue execution · e36c886a
      Arjan van de Ven 提交于
      With the introduction of the new unified work queue thread pools,
      we lost one feature: It's no longer possible to know which worker
      is causing the CPU to wake out of idle. The result is that PowerTOP
      now reports a lot of "kworker/a:b" instead of more readable results.
      
      This patch adds a pair of tracepoints to the new workqueue code,
      similar in style to the timer/hrtimer tracepoints.
      
      With this pair of tracepoints, the next PowerTOP can correctly
      report which work item caused the wakeup (and how long it took):
      
      Interrupt (43)            i915      time   3.51ms    wakeups 141
      Work      ieee80211_iface_work      time   0.81ms    wakeups  29
      Work              do_dbs_timer      time   0.55ms    wakeups  24
      Process                   Xorg      time  21.36ms    wakeups   4
      Timer    sched_rt_period_timer      time   0.01ms    wakeups   1
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      e36c886a
  6. 21 8月, 2010 2 次提交
  7. 18 8月, 2010 3 次提交
    • N
      fs: fs_struct rwlock to spinlock · 2a4419b5
      Nick Piggin 提交于
      fs: fs_struct rwlock to spinlock
      
      struct fs_struct.lock is an rwlock with the read-side used to protect root and
      pwd members while taking references to them. Taking a reference to a path
      typically requires just 2 atomic ops, so the critical section is very small.
      Parallel read-side operations would have cacheline contention on the lock, the
      dentry, and the vfsmount cachelines, so the rwlock is unlikely to ever give a
      real parallelism increase.
      
      Replace it with a spinlock to avoid one or two atomic operations in typical
      path lookup fastpath.
      Signed-off-by: NNick Piggin <npiggin@kernel.dk>
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      2a4419b5
    • D
      Fix unprotected access to task credentials in waitid() · f362b732
      Daniel J Blueman 提交于
      Using a program like the following:
      
      	#include <stdlib.h>
      	#include <unistd.h>
      	#include <sys/types.h>
      	#include <sys/wait.h>
      
      	int main() {
      		id_t id;
      		siginfo_t infop;
      		pid_t res;
      
      		id = fork();
      		if (id == 0) { sleep(1); exit(0); }
      		kill(id, SIGSTOP);
      		alarm(1);
      		waitid(P_PID, id, &infop, WCONTINUED);
      		return 0;
      	}
      
      to call waitid() on a stopped process results in access to the child task's
      credentials without the RCU read lock being held - which may be replaced in the
      meantime - eliciting the following warning:
      
      	===================================================
      	[ INFO: suspicious rcu_dereference_check() usage. ]
      	---------------------------------------------------
      	kernel/exit.c:1460 invoked rcu_dereference_check() without protection!
      
      	other info that might help us debug this:
      
      	rcu_scheduler_active = 1, debug_locks = 1
      	2 locks held by waitid02/22252:
      	 #0:  (tasklist_lock){.?.?..}, at: [<ffffffff81061ce5>] do_wait+0xc5/0x310
      	 #1:  (&(&sighand->siglock)->rlock){-.-...}, at: [<ffffffff810611da>]
      	wait_consider_task+0x19a/0xbe0
      
      	stack backtrace:
      	Pid: 22252, comm: waitid02 Not tainted 2.6.35-323cd+ #3
      	Call Trace:
      	 [<ffffffff81095da4>] lockdep_rcu_dereference+0xa4/0xc0
      	 [<ffffffff81061b31>] wait_consider_task+0xaf1/0xbe0
      	 [<ffffffff81061d15>] do_wait+0xf5/0x310
      	 [<ffffffff810620b6>] sys_waitid+0x86/0x1f0
      	 [<ffffffff8105fce0>] ? child_wait_callback+0x0/0x70
      	 [<ffffffff81003282>] system_call_fastpath+0x16/0x1b
      
      This is fixed by holding the RCU read lock in wait_task_continued() to ensure
      that the task's current credentials aren't destroyed between us reading the
      cred pointer and us reading the UID from those credentials.
      
      Furthermore, protect wait_task_stopped() in the same way.
      
      We don't need to keep holding the RCU read lock once we've read the UID from
      the credentials as holding the RCU read lock doesn't stop the target task from
      changing its creds under us - so the credentials may be outdated immediately
      after we've read the pointer, lock or no lock.
      Signed-off-by: NDaniel J Blueman <daniel.blueman@gmail.com>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      f362b732
    • D
      Make do_execve() take a const filename pointer · d7627467
      David Howells 提交于
      Make do_execve() take a const filename pointer so that kernel_execve() compiles
      correctly on ARM:
      
      arch/arm/kernel/sys_arm.c:88: warning: passing argument 1 of 'do_execve' discards qualifiers from pointer target type
      
      This also requires the argv and envp arguments to be consted twice, once for
      the pointer array and once for the strings the array points to.  This is
      because do_execve() passes a pointer to the filename (now const) to
      copy_strings_kernel().  A simpler alternative would be to cast the filename
      pointer in do_execve() when it's passed to copy_strings_kernel().
      
      do_execve() may not change any of the strings it is passed as part of the argv
      or envp lists as they are some of them in .rodata, so marking these strings as
      const should be fine.
      
      Further kernel_execve() and sys_execve() need to be changed to match.
      
      This has been test built on x86_64, frv, arm and mips.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Tested-by: NRalf Baechle <ralf@linux-mips.org>
      Acked-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      d7627467
  8. 17 8月, 2010 1 次提交
    • J
      kdb: fix compile error without CONFIG_KALLSYMS · b590cddf
      Jason Wessel 提交于
      If CONFIG_KGDB_KDB is set and CONFIG_KALLSYMS is not set the kernel
      will fail to build with the error:
      
      kernel/built-in.o: In function `kallsyms_symbol_next':
      kernel/debug/kdb/kdb_support.c:237: undefined reference to `kdb_walk_kallsyms'
      kernel/built-in.o: In function `kallsyms_symbol_complete':
      kernel/debug/kdb/kdb_support.c:193: undefined reference to `kdb_walk_kallsyms'
      
      The kdb_walk_kallsyms needs a #ifdef proper header to match the C
      implementation.  This patch also fixes the compiler warnings in
      kdb_support.c when compiling without CONFIG_KALLSYMS set.  The
      compiler warnings are a result of the kallsyms_lookup() macro not
      initializing the two of the pass by reference variables.
      Signed-off-by: NJason Wessel <jason.wessel@windriver.com>
      Reported-by: NMichal Simek <monstr@monstr.eu>
      b590cddf
  9. 14 8月, 2010 2 次提交
    • M
      tracing: Sanitize value returned from write(trace_marker, "...", len) · 1aa54bca
      Marcin Slusarz 提交于
      When userspace code writes non-new-line-terminated string to trace_marker
      file, write handler appends new-line and returns number of bytes written
      to trace buffer, so
      write(fd, "abc", 3) will return 4
      
      That's unexpected and unfortunately it confuses glibc's fprintf function.
      
      Example:
      int main() {
        fprintf(stderr, "abc");
        return 0;
      }
      
      $ gcc test.c -o test
      $ echo mmiotrace > /sys/kernel/debug/tracing/current_tracer
      $ ./test 2>/sys/kernel/debug/tracing/trace_marker
      
      results in infinite loop:
      write(fd, "abc", 3) = 4
      write(fd, "", 1) = 0
      write(fd, "", 1) = 0
      write(fd, "", 1) = 0
      write(fd, "", 1) = 0
      write(fd, "", 1) = 0
      write(fd, "", 1) = 0
      write(fd, "", 1) = 0
      (...)
      
      ...and kernel trace buffer full of empty markers.
      
      Fix it by sanitizing write return value.
      Signed-off-by: NMarcin Slusarz <marcin.slusarz@gmail.com>
      LKML-Reference: <20100727231801.GB2826@joi.lan>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      1aa54bca
    • J
      time: Workaround gcc loop optimization that causes 64bit div errors · c7dcf87a
      John Stultz 提交于
      Early 4.3 versions of gcc apparently aggressively optimize the raw
      time accumulation loop, replacing it with a divide.
      
      On 32bit systems, this causes the following link errors:
      	undefined reference to `__umoddi3'
      	undefined reference to `__udivdi3'
      
      The gcc issue has been fixed in 4.4 and greater.
      
      This patch replaces the accumulation loop with a do_div, as suggested
      by Linus.
      Signed-off-by: NJohn Stultz <johnstul@us.ibm.com>
      CC: Jason Wessel <jason.wessel@windriver.com>
      CC: Larry Finger <Larry.Finger@lwfinger.net>
      CC: Ingo Molnar <mingo@elte.hu>
      CC: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      c7dcf87a
  10. 13 8月, 2010 4 次提交
    • L
      Revert "fsnotify: store struct file not struct path" · 2069601b
      Linus Torvalds 提交于
      This reverts commit 3bcf3860 (and the
      accompanying commit c1e5c954 "vfs/fsnotify: fsnotify_close can delay
      the final work in fput" that was a horribly ugly hack to make it work at
      all).
      
      The 'struct file' approach not only causes that disgusting hack, it
      somehow breaks pulseaudio, probably due to some other subtlety with
      f_count handling.
      
      Fix up various conflicts due to later fsnotify work.
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      2069601b
    • S
      tracing/events: Convert format output to seq_file · 2a37a3df
      Steven Rostedt 提交于
      Two new events were added that broke the current format output.
      
      Both from the SCSI system: scsi_dispatch_cmd_done and scsi_dispatch_cmd_timeout
      
      The reason is that their print_fmt exceeded a page size. Since the output
      of the format used simple_read_from_buffer and trace_seq, it was limited
      to a page size in output.
      
      This patch converts the printing of the format of an event into seq_file,
      which allows greater than a page size to be shown.
      
      I diffed all event formats comparing the output with and without this
      patch. All matched except for the above two, which showed just:
      
        FORMAT TOO BIG
      
      without this patch, but now properly displays the output with this patch.
      
      v2: Remove updating *pos in seq start function.
         [ Thanks to Li Zefan for pointing that out ]
      Reviewed-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Martin K. Petersen <martin.petersen@oracle.com>
      Cc: Kei Tokunaga <tokunaga.keiich@jp.fujitsu.com>
      Cc: James Bottomley <James.Bottomley@suse.de>
      Cc: Tomohiro Kusumi <kusumi.tomohiro@jp.fujitsu.com>
      Cc: Xiao Guangrong <xiaoguangrong@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      2a37a3df
    • J
      timekeeping: Fix overflow in rawtime tv_nsec on 32 bit archs · deda2e81
      Jason Wessel 提交于
      The tv_nsec is a long and when added to the shifted interval it can wrap
      and become negative which later causes looping problems in the
      getrawmonotonic().  The edge case occurs when the system has slept for
      a short period of time of ~2 seconds.
      
      A trace printk of the values in this patch illustrate the problem:
      
      ftrace time stamp: log
      43.716079: logarithmic_accumulation: raw: 3d0913 tv_nsec d687faa
      43.718513: logarithmic_accumulation: raw: 3d0913 tv_nsec da588bd
      43.722161: logarithmic_accumulation: raw: 3d0913 tv_nsec de291d0
      46.349925: logarithmic_accumulation: raw: 7a122600 tv_nsec e1f9ae3b
      46.349930: logarithmic_accumulation: raw: 1e848980 tv_nsec 8831c0e3
      
      The kernel starts looping at 46.349925 in the getrawmonotonic() due to
      the negative value from adding the raw value to tv_nsec.
      
      A simple solution is to accumulate into a u64, and then normalize it
      to a timespec_t.
      Signed-off-by: NJason Wessel <jason.wessel@windriver.com>
       [ Reworked variable names and simplified some of the code. - John ]
      Signed-off-by: NJohn Stultz <johnstul@us.ibm.com>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      deda2e81
    • D
      Add a dummy printk function for the maintenance of unused printks · 12fdff3f
      David Howells 提交于
      Add a dummy printk function for the maintenance of unused printks through gcc
      format checking, and also so that side-effect checking is maintained too.
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      12fdff3f
  11. 12 8月, 2010 2 次提交
  12. 11 8月, 2010 18 次提交
  13. 10 8月, 2010 2 次提交