1. 23 7月, 2010 1 次提交
  2. 21 7月, 2010 4 次提交
    • K
      tracing: Shrink max latency ringbuffer if unnecessary · ef710e10
      KOSAKI Motohiro 提交于
      Documentation/trace/ftrace.txt says
      
        buffer_size_kb:
      
              This sets or displays the number of kilobytes each CPU
              buffer can hold. The tracer buffers are the same size
              for each CPU. The displayed number is the size of the
              CPU buffer and not total size of all buffers. The
              trace buffers are allocated in pages (blocks of memory
              that the kernel uses for allocation, usually 4 KB in size).
              If the last page allocated has room for more bytes
              than requested, the rest of the page will be used,
              making the actual allocation bigger than requested.
              ( Note, the size may not be a multiple of the page size
                due to buffer management overhead. )
      
              This can only be updated when the current_tracer
              is set to "nop".
      
      But it's incorrect. currently total memory consumption is
      'buffer_size_kb x CPUs x 2'.
      
      Why two times difference is there? because ftrace implicitly allocate
      the buffer for max latency too.
      
      That makes sad result when admin want to use large buffer. (If admin
      want full logging and makes detail analysis). example, If admin
      have 24 CPUs machine and write 200MB to buffer_size_kb, the system
      consume ~10GB memory (200MB x 24 x 2). umm.. 5GB memory waste is
      usually unacceptable.
      
      Fortunatelly, almost all users don't use max latency feature.
      The max latency buffer can be disabled easily.
      
      This patch shrink buffer size of the max latency buffer if
      unnecessary.
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      LKML-Reference: <20100701104554.DA2D.A69D9226@jp.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      ef710e10
    • L
      tracing: Reduce latency and remove percpu trace_seq · bc289ae9
      Lai Jiangshan 提交于
      __print_flags() and __print_symbolic() use percpu trace_seq:
      
      1) Its memory is allocated at compile time, it wastes memory if we don't use tracing.
      2) It is percpu data and it wastes more memory for multi-cpus system.
      3) It disables preemption when it executes its core routine
         "trace_seq_printf(s, "%s: ", #call);" and introduces latency.
      
      So we move this trace_seq to struct trace_iterator.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      LKML-Reference: <4C078350.7090106@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      bc289ae9
    • R
      trace: Reorder struct ring_buffer_per_cpu to remove padding on 64bit · 985023de
      Richard Kennedy 提交于
      Reorder structure to remove 8 bytes of padding on 64 bit builds.
      This shrinks the size to 128 bytes so allowing allocation from a smaller
      slab & needed one fewer cache lines.
      Signed-off-by: NRichard Kennedy <richard@rsk.demon.co.uk>
      LKML-Reference: <1269516456.2054.8.camel@localhost>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      985023de
    • L
      tracing: Allow to disable cmdline recording · e870e9a1
      Li Zefan 提交于
      We found that even enabling a single trace event that will rarely be
      triggered can add big overhead to context switch.
      
      (lmbench context switch test)
       -------------------------------------------------
       2p/0K 2p/16K 2p/64K 8p/16K 8p/64K 16p/16K 16p/64K
       ctxsw  ctxsw  ctxsw ctxsw  ctxsw   ctxsw   ctxsw
      ------ ------ ------ ------ ------ ------- -------
        2.19   2.3   2.21   2.56   2.13     2.54    2.07
        2.39   2.51  2.35   2.75   2.27     2.81    2.24
      
      The overhead is 6% ~ 11%.
      
      It's because when a trace event is enabled 3 tracepoints (sched_switch,
      sched_wakeup, sched_wakeup_new) will be activated to map pid to cmdname.
      
      We'd like to avoid this overhead, so add a trace option '(no)record-cmd'
      to allow to disable cmdline recording.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      LKML-Reference: <4C2D57F4.2050204@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      e870e9a1
  3. 16 7月, 2010 1 次提交
    • F
      tracing: Remove ksym tracer · 5d550467
      Frederic Weisbecker 提交于
      The ksym (breakpoint) ftrace plugin has been superseded by perf
      tools that are much more poweful to use the cpu breakpoints.
      This tracer doesn't bring more feature. It has been deprecated
      for a while now, lets remove it.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Prasad <prasad@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      5d550467
  4. 06 7月, 2010 1 次提交
    • M
      tracing/kprobes: Support "string" type · e09c8614
      Masami Hiramatsu 提交于
      Support string type tracing and printing in kprobe-tracer.
      
      This allows user to trace string data in kernel including __user data. Note
      that sometimes __user data may not be accessed if it is paged-out (sorry, but
      kprobes operation should be done in atomic, we can not wait for page-in).
      
      Commiter note: Fixed up conflicts with b7e2ecef.
      
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      LKML-Reference: <20100519195724.2885.18788.stgit@localhost6.localdomain6>
      Signed-off-by: NMasami Hiramatsu <mhiramat@redhat.com>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      e09c8614
  5. 05 7月, 2010 1 次提交
  6. 01 7月, 2010 2 次提交
    • P
      sched: Cure nr_iowait_cpu() users · 8c215bd3
      Peter Zijlstra 提交于
      Commit 0224cf4c (sched: Intoduce get_cpu_iowait_time_us())
      broke things by not making sure preemption was indeed disabled
      by the callers of nr_iowait_cpu() which took the iowait value of
      the current cpu.
      
      This resulted in a heap of preempt warnings. Cure this by making
      nr_iowait_cpu() take a cpu number and fix up the callers to pass
      in the right number.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arjan van de Ven <arjan@infradead.org>
      Cc: Sergey Senozhatsky <sergey.senozhatsky@gmail.com>
      Cc: Rafael J. Wysocki <rjw@sisk.pl>
      Cc: Maxim Levitsky <maximlevitsky@gmail.com>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Jiri Slaby <jslaby@suse.cz>
      Cc: linux-pm@lists.linux-foundation.org
      LKML-Reference: <1277968037.1868.120.camel@laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8c215bd3
    • M
      futex: futex_find_get_task remove credentails check · 7a0ea09a
      Michal Hocko 提交于
      futex_find_get_task is currently used (through lookup_pi_state) from two
      contexts, futex_requeue and futex_lock_pi_atomic.  None of the paths
      looks it needs the credentials check, though.  Different (e)uids
      shouldn't matter at all because the only thing that is important for
      shared futex is the accessibility of the shared memory.
      
      The credentail check results in glibc assert failure or process hang (if
      glibc is compiled without assert support) for shared robust pthread
      mutex with priority inheritance if a process tries to lock already held
      lock owned by a process with a different euid:
      
      pthread_mutex_lock.c:312: __pthread_mutex_lock_full: Assertion `(-(e)) != 3 || !robust' failed.
      
      The problem is that futex_lock_pi_atomic which is called when we try to
      lock already held lock checks the current holder (tid is stored in the
      futex value) to get the PI state.  It uses lookup_pi_state which in turn
      gets task struct from futex_find_get_task.  ESRCH is returned either
      when the task is not found or if credentials check fails.
      
      futex_lock_pi_atomic simply returns if it gets ESRCH.  glibc code,
      however, doesn't expect that robust lock returns with ESRCH because it
      should get either success or owner died.
      Signed-off-by: NMichal Hocko <mhocko@suse.cz>
      Acked-by: NDarren Hart <dvhltc@us.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Nick Piggin <npiggin@suse.de>
      Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7a0ea09a
  7. 30 6月, 2010 1 次提交
  8. 29 6月, 2010 7 次提交
  9. 25 6月, 2010 2 次提交
    • W
      sched: Prevent compiler from optimising the sched_avg_update() loop · 0d98bb26
      Will Deacon 提交于
      GCC 4.4.1 on ARM has been observed to replace the while loop in
      sched_avg_update with a call to uldivmod, resulting in the
      following build failure at link-time:
      
      kernel/built-in.o: In function `sched_avg_update':
       kernel/sched.c:1261: undefined reference to `__aeabi_uldivmod'
       kernel/sched.c:1261: undefined reference to `__aeabi_uldivmod'
      make: *** [.tmp_vmlinux1] Error 1
      
      This patch introduces a fake data hazard to the loop body to
      prevent the compiler optimising the loop away.
      Signed-off-by: NWill Deacon <will.deacon@arm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Catalin Marinas <catalin.marinas@arm.com>
      Cc: Russell King <rmk@arm.linux.org.uk>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: <stable@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0d98bb26
    • F
      hw_breakpoints: Fix per task breakpoint tracking · 45a73372
      Frederic Weisbecker 提交于
      Freeing a perf event can happen in several ways. A task
      calls perf_event_exit_task() right before exiting. This helper
      will detach all the events from the task context and queue their
      removal through free_event() if they are child tasks. The task
      also loses its context reference there.
      
      Releasing the breakpoint slot from the constraint table is made
      from free_event() that calls release_bp_slot(). We count the number
      of breakpoints this task is running by looking at the task's
      perf_event_ctxp and iterating through its attached events.
      But at this time, the reference to this context has been cleaned up
      already.
      
      So looking at the event->ctx instead of task->perf_event_ctxp
      to count the remaining breakpoints should solve the problem.
      At least it would for child breakpoints, but not for parent ones.
      If the parent exits before the child, it will remove all its
      events from the context but free_event() will be called later,
      on fd release time. And checking the number of breakpoints the
      task has attached to its context at this time is unreliable as all
      events have been removed from the context.
      
      To solve this, we keep track of the list of per task breakpoints.
      On top of it, we maintain our array of numbers of breakpoints used
      by the tasks. We use the context address as a task id.
      
      So, instead of looking at the number of events attached to a context,
      we walk through our list of per task breakpoints and count the number
      of breakpoints that use the same ctx than the one to be reserved or
      released from the constraint table, and update the count on top of this
      result.
      
      In the meantime it solves a bad refcounting, it also solves a warning,
      reported by Paul.
      
      Badness at /home/paulus/kernel/perf/kernel/hw_breakpoint.c:114
      NIP: c0000000000cb470 LR: c0000000000cb46c CTR: c00000000032d9b8
      REGS: c000000118e7b570 TRAP: 0700   Not tainted  (2.6.35-rc3-perf-00008-g76b0f133
      )
      MSR: 9000000000029032 <EE,ME,CE,IR,DR>  CR: 44004424  XER: 000fffff
      TASK = c0000001187dcad0[3143] 'perf' THREAD: c000000118e78000 CPU: 1
      GPR00: c0000000000cb46c c000000118e7b7f0 c0000000009866a0 0000000000000020
      GPR04: 0000000000000000 000000000000001d 0000000000000000 0000000000000001
      GPR08: c0000000009bed68 c00000000086dff8 c000000000a5bf10 0000000000000001
      GPR12: 0000000024004422 c00000000ffff200 0000000000000000 0000000000000000
      GPR16: 0000000000000000 0000000000000000 0000000000000018 00000000101150f4
      GPR20: 0000000010206b40 0000000000000000 0000000000000000 00000000101150f4
      GPR24: c0000001199090c0 0000000000000001 0000000000000000 0000000000000001
      GPR28: 0000000000000000 0000000000000000 c0000000008ec290 0000000000000000
      NIP [c0000000000cb470] .task_bp_pinned+0x5c/0x12c
      LR [c0000000000cb46c] .task_bp_pinned+0x58/0x12c
      Call Trace:
      [c000000118e7b7f0] [c0000000000cb46c] .task_bp_pinned+0x58/0x12c (unreliable)
      [c000000118e7b8a0] [c0000000000cb584] .toggle_bp_task_slot+0x44/0xe4
      [c000000118e7b940] [c0000000000cb6c8] .toggle_bp_slot+0xa4/0x164
      [c000000118e7b9f0] [c0000000000cbafc] .release_bp_slot+0x44/0x6c
      [c000000118e7ba80] [c0000000000c4178] .bp_perf_event_destroy+0x10/0x24
      [c000000118e7bb00] [c0000000000c4aec] .free_event+0x180/0x1bc
      [c000000118e7bbc0] [c0000000000c54c4] .perf_event_release_kernel+0x14c/0x170
      Reported-by: NPaul Mackerras <paulus@samba.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Prasad <prasad@linux.vnet.ibm.com>
      Cc: Mahesh Salgaonkar <mahesh@linux.vnet.ibm.com>
      Cc: Will Deacon <will.deacon@arm.com>
      Cc: Jason Wessel <jason.wessel@windriver.com>
      45a73372
  10. 24 6月, 2010 1 次提交
  11. 23 6月, 2010 1 次提交
    • D
      rcu: apply RCU protection to wake_affine() · f3b577de
      Daniel J Blueman 提交于
      The task_group() function returns a pointer that must be protected
      by either RCU, the ->alloc_lock, or the cgroup lock (see the
      rcu_dereference_check() in task_subsys_state(), which is invoked by
      task_group()).  The wake_affine() function currently does none of these,
      which means that a concurrent update would be within its rights to free
      the structure returned by task_group().  Because wake_affine() uses this
      structure only to compute load-balancing heuristics, there is no reason
      to acquire either of the two locks.
      
      Therefore, this commit introduces an RCU read-side critical section that
      starts before the first call to task_group() and ends after the last use
      of the "tg" pointer returned from task_group().  Thanks to Li Zefan for
      pointing out the need to extend the RCU read-side critical section from
      that proposed by the original patch.
      Signed-off-by: NDaniel J Blueman <daniel.blueman@gmail.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      f3b577de
  12. 18 6月, 2010 2 次提交
    • A
      sched: Fix over-scheduling bug · 3c93717c
      Alex,Shi 提交于
      Commit e7097159 ("sched: Optimize unused cgroup configuration") introduced
      an imbalanced scheduling bug.
      
      If we do not use CGROUP, function update_h_load won't update h_load. When the
      system has a large number of tasks far more than logical CPU number, the
      incorrect cfs_rq[cpu]->h_load value will cause load_balance() to pull too
      many tasks to the local CPU from the busiest CPU. So the busiest CPU keeps
      going in a round robin. That will hurt performance.
      
      The issue was found originally by a scientific calculation workload that
      developed by Yanmin. With that commit, the workload performance drops
      about 40%.
      
       CPU  before    after
      
       00   : 2       : 7
       01   : 1       : 7
       02   : 11      : 6
       03   : 12      : 7
       04   : 6       : 6
       05   : 11      : 7
       06   : 10      : 6
       07   : 12      : 7
       08   : 11      : 6
       09   : 12      : 6
       10   : 1       : 6
       11   : 1       : 6
       12   : 6       : 6
       13   : 2       : 6
       14   : 2       : 6
       15   : 1       : 6
      Reviewed-by: NYanmin zhang <yanmin.zhang@intel.com>
      Signed-off-by: NAlex Shi <alex.shi@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1276754893.9452.5442.camel@debian>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3c93717c
    • P
      nohz: Fix nohz ratelimit · 3310d4d3
      Peter Zijlstra 提交于
      Chris Wedgwood reports that 39c0cbe2 (sched: Rate-limit nohz) causes a
      serial console regression, unresponsiveness, and indeed it does. The
      reason is that the nohz code is skipped even when the tick was already
      stopped before the nohz_ratelimit(cpu) condition changed.
      
      Move the nohz_ratelimit() check to the other conditions which prevent
      long idle sleeps.
      Reported-by: NChris Wedgwood <cw@f00f.org>
      Tested-by: NBrian Bloniarz <bmb@athenacr.com>
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Jiri Kosina <jkosina@suse.cz>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Greg KH <gregkh@suse.de>
      Cc: Alan Cox <alan@lxorguk.ukuu.org.uk>
      Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
      Cc: Jef Driesen <jefdriesen@telenet.be>
      LKML-Reference: <1276790557.27822.516.camel@twins>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      3310d4d3
  13. 11 6月, 2010 1 次提交
    • S
      perf/tracing: Fix regression of perf losing kprobe events · a8fb2608
      Steven Rostedt 提交于
      With the addition of the code to shrink the kernel tracepoint
      infrastructure, we lost kprobes being traced by perf. The reason
      is that I tested if the "tp_event->class->perf_probe" existed before
      enabling it. This prevents "ftrace only" events (like the function
      trace events) from being enabled by perf.
      
      Unfortunately, kprobe events do not use perf_probe. This causes
      kprobes to be missed by perf. To fix this, we add the test to
      see if "tp_event->class->reg" exists as well as perf_probe.
      
      Normal trace events have only "perf_probe" but no "reg" function,
      and kprobes and syscalls have the "reg" but no "perf_probe".
      The ftrace unique events do not have either, so this is a valid
      test. If a kprobe or syscall is not to be probed by perf, the
      "reg" function is called anyway, and will return a failure and
      prevent perf from probing it.
      Reported-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Tested-by: NSrikar Dronamraju <srikar@linux.vnet.ibm.com>
      Acked-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      a8fb2608
  14. 10 6月, 2010 1 次提交
  15. 09 6月, 2010 14 次提交