1. 15 12月, 2008 2 次提交
    • I
      perfcounters: implement "counter inheritance" · 9b51f66d
      Ingo Molnar 提交于
      Impact: implement new performance feature
      
      Counter inheritance can be used to run performance counters in a workload,
      transparently - and pipe back the counter results to the parent counter.
      
      Inheritance for performance counters works the following way: when creating
      a counter it can be marked with the .inherit=1 flag. Such counters are then
      'inherited' by all child tasks (be they fork()-ed or clone()-ed). These
      counters get inherited through exec() boundaries as well (except through
      setuid boundaries).
      
      The counter values get added back to the parent counter(s) when the child
      task(s) exit - much like stime/utime statistics are gathered. So inherited
      counters are ideal to gather summary statistics about an application's
      behavior via shell commands, without having to modify that application.
      
      The timec.c command utilizes counter inheritance:
      
        http://redhat.com/~mingo/perfcounters/timec.c
      
      Sample output:
      
         $ ./timec -e 1 -e 3 -e 5 ls -lR /usr/include/ >/dev/null
      
         Performance counter stats for 'ls':
      
                 163516953 instructions
                      2295 cache-misses
                   2855182 branch-misses
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9b51f66d
    • I
      perfcounters: restructure x86 counter math · ee06094f
      Ingo Molnar 提交于
      Impact: restructure code
      
      Change counter math from absolute values to clear delta logic.
      
      We try to extract elapsed deltas from the raw hw counter - and put
      that into the generic counter.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ee06094f
  2. 11 12月, 2008 14 次提交
    • I
      perf counters: clean up state transitions · 6a930700
      Ingo Molnar 提交于
      Impact: cleanup
      
      Introduce a proper enum for the 3 states of a counter:
      
      	PERF_COUNTER_STATE_OFF		= -1
      	PERF_COUNTER_STATE_INACTIVE	=  0
      	PERF_COUNTER_STATE_ACTIVE	=  1
      
      and rename counter->active to counter->state and propagate the
      changes everywhere.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6a930700
    • I
      perf counters: add prctl interface to disable/enable counters · 1d1c7ddb
      Ingo Molnar 提交于
      Add a way for self-monitoring tasks to disable/enable counters summarily,
      via a prctl:
      
      	PR_TASK_PERF_COUNTERS_DISABLE		31
      	PR_TASK_PERF_COUNTERS_ENABLE		32
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1d1c7ddb
    • I
      perf counters: implement PERF_COUNT_TASK_CLOCK · bae43c99
      Ingo Molnar 提交于
      Impact: add new perf-counter type
      
      The 'task clock' counter counts the amount of time a task is executing,
      in nanoseconds. It stops ticking when a task is scheduled out either due
      to it blocking, sleeping or it being preempted.
      
      This counter type is a Linux kernel based abstraction, it is available
      even if the hardware does not support native hardware performance counters.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      bae43c99
    • I
      perf counters: consolidate hw_perf save/restore APIs · 01b2838c
      Ingo Molnar 提交于
      Impact: cleanup
      
      Rename them to better match up the usual IRQ disable/enable APIs:
      
       hw_perf_disable_all()  => hw_perf_save_disable()
       hw_perf_restore_ctrl() => hw_perf_restore()
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      01b2838c
    • I
      perf counters: implement PERF_COUNT_CPU_CLOCK · 5c92d124
      Ingo Molnar 提交于
      Impact: add new perf-counter type
      
      The 'CPU clock' counter counts the amount of CPU clock time that is
      elapsing, in nanoseconds. (regardless of how much of it the task is
      spending on a CPU executing)
      
      This counter type is a Linux kernel based abstraction, it is available
      even if the hardware does not support native hardware performance counters.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      5c92d124
    • I
      perf counters: hw driver API · 621a01ea
      Ingo Molnar 提交于
      Impact: restructure code, introduce hw_ops driver abstraction
      
      Introduce this abstraction to handle counter details:
      
       struct hw_perf_counter_ops {
      	void (*hw_perf_counter_enable)	(struct perf_counter *counter);
      	void (*hw_perf_counter_disable)	(struct perf_counter *counter);
      	void (*hw_perf_counter_read)	(struct perf_counter *counter);
       };
      
      This will be useful to support assymetric hw details, and it will also
      be useful to implement "software counters". (Counters that count kernel
      managed sw events such as pagefaults, context-switches, wall-clock time
      or task-local time.)
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      621a01ea
    • I
      perf counters: group counter, fixes · ccff286d
      Ingo Molnar 提交于
      Impact: bugfix
      
      Check that a group does not span outside the context of a CPU or a task.
      
      Also, do not allow deep recursive hierarchies.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ccff286d
    • I
      perf counters: add support for group counters · 04289bb9
      Ingo Molnar 提交于
      Impact: add group counters
      
      This patch adds the "counter groups" abstraction.
      
      Groups of counters behave much like normal 'single' counters, with a
      few semantic and behavioral extensions on top of that.
      
      A counter group is created by creating a new counter with the open()
      syscall's group-leader group_fd file descriptor parameter pointing
      to another, already existing counter.
      
      Groups of counters are scheduled in and out in one atomic group, and
      they are also roundrobin-scheduled atomically.
      
      Counters that are member of a group can also record events with an
      (atomic) extended timestamp that extends to all members of the group,
      if the record type is set to PERF_RECORD_GROUP.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      04289bb9
    • I
      perf counters: restructure the API · 9f66a381
      Ingo Molnar 提交于
      Impact: clean up new API
      
      Thorough cleanup of the new perf counters API, we now get clean separation
      of the various concepts:
      
       - introduce perf_counter_hw_event to separate out the event source details
      
       - move special type flags into separate attributes: PERF_COUNT_NMI,
         PERF_COUNT_RAW
      
       - extend the type to u64 and reserve it fully to the architecture in the
         raw type case.
      
      And make use of all these changes in the core and x86 perfcounters code.
      
      Also change the syscall signature to:
      
        asmlinkage int sys_perf_counter_open(
      
      	struct perf_counter_hw_event	*hw_event_uptr		__user,
      	pid_t				pid,
      	int				cpu,
      	int				group_fd);
      
      ( Note that group_fd is unused for now - it's reserved for the counter
        groups abstraction. )
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9f66a381
    • T
      perf counters: expand use of counter->event · dfa7c899
      Thomas Gleixner 提交于
      Impact: change syscall, cleanup
      
      Make use of the new perf_counters event type.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      dfa7c899
    • T
      perf counters: clean up 'raw' type API · eab656ae
      Thomas Gleixner 提交于
      Impact: cleanup
      
      Introduce a separate hw_event type.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      eab656ae
    • H
      fix mapping_writably_mapped() · b88ed205
      Hugh Dickins 提交于
      Lee Schermerhorn noticed yesterday that I broke the mapping_writably_mapped
      test in 2.6.7!  Bad bad bug, good good find.
      
      The i_mmap_writable count must be incremented for VM_SHARED (just as
      i_writecount is for VM_DENYWRITE, but while holding the i_mmap_lock)
      when dup_mmap() copies the vma for fork: it has its own more optimal
      version of __vma_link_file(), and I missed this out.  So the count
      was later going down to 0 (dangerous) when one end unmapped, then
      wrapping negative (inefficient) when the other end unmapped.
      
      The only impact on x86 would have been that setting a mandatory lock on
      a file which has at some time been opened O_RDWR and mapped MAP_SHARED
      (but not necessarily PROT_WRITE) across a fork, might fail with -EAGAIN
      when it should succeed, or succeed when it should fail.
      
      But those architectures which rely on flush_dcache_page() to flush
      userspace modifications back into the page before the kernel reads it,
      may in some cases have skipped the flush after such a fork - though any
      repetitive test will soon wrap the count negative, in which case it will
      flush_dcache_page() unnecessarily.
      
      Fix would be a two-liner, but mapping variable added, and comment moved.
      Reported-by: NLee Schermerhorn <Lee.Schermerhorn@hp.com>
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b88ed205
    • H
      KSYM_SYMBOL_LEN fixes · 9c246247
      Hugh Dickins 提交于
      Miles Lane tailing /sys files hit a BUG which Pekka Enberg has tracked
      to my 966c8c12 sprint_symbol(): use
      less stack exposing a bug in slub's list_locations() -
      kallsyms_lookup() writes a 0 to namebuf[KSYM_NAME_LEN-1], but that was
      beyond the end of page provided.
      
      The 100 slop which list_locations() allows at end of page looks roughly
      enough for all the other stuff it might print after the symbol before
      it checks again: break out KSYM_SYMBOL_LEN earlier than before.
      
      Latencytop and ftrace and are using KSYM_NAME_LEN buffers where they
      need KSYM_SYMBOL_LEN buffers, and vmallocinfo a 2*KSYM_NAME_LEN buffer
      where it wants a KSYM_SYMBOL_LEN buffer: fix those before anyone copies
      them.
      
      [akpm@linux-foundation.org: ftrace.h needs module.h]
      Signed-off-by: NHugh Dickins <hugh@veritas.com>
      Cc: Christoph Lameter <cl@linux-foundation.org>
      Cc Miles Lane <miles.lane@gmail.com>
      Acked-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Acked-by: NSteven Rostedt <srostedt@redhat.com>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      9c246247
    • T
      relayfs: fix infinite loop with splice() · fbb5b7ae
      Tom Zanussi 提交于
      Running kmemtraced, which uses splice() on relayfs, causes a hard lock on
      x86-64 SMP.  As described by Tom Zanussi:
      
        It looks like you hit the same problem as described here:
      
        commit 8191ecd1
      
            splice: fix infinite loop in generic_file_splice_read()
      
        relay uses the same loop but it never got noticed or fixed.
      
      Cc: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
      Tested-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NTom Zanussi <tzanussi@gmail.com>
      Signed-off-by: NPekka Enberg <penberg@cs.helsinki.fi>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fbb5b7ae
  3. 10 12月, 2008 1 次提交
    • B
      sched: CPU remove deadlock fix · 9a2bd244
      Brian King 提交于
      Impact: fix possible deadlock in CPU hot-remove path
      
      This patch fixes a possible deadlock scenario in the CPU remove path.
      migration_call grabs rq->lock, then wakes up everything on rq->migration_queue
      with the lock held. Then one of the tasks on the migration queue ends up
      calling tg_shares_up which then also tries to acquire the same rq->lock.
      
      [c000000058eab2e0] c000000000502078 ._spin_lock_irqsave+0x98/0xf0
      [c000000058eab370] c00000000008011c .tg_shares_up+0x10c/0x20c
      [c000000058eab430] c00000000007867c .walk_tg_tree+0xc4/0xfc
      [c000000058eab4d0] c0000000000840c8 .try_to_wake_up+0xb0/0x3c4
      [c000000058eab590] c0000000000799a0 .__wake_up_common+0x6c/0xe0
      [c000000058eab640] c00000000007ada4 .complete+0x54/0x80
      [c000000058eab6e0] c000000000509fa8 .migration_call+0x5fc/0x6f8
      [c000000058eab7c0] c000000000504074 .notifier_call_chain+0x68/0xe0
      [c000000058eab860] c000000000506568 ._cpu_down+0x2b0/0x3f4
      [c000000058eaba60] c000000000506750 .cpu_down+0xa4/0x108
      [c000000058eabb10] c000000000507e54 .store_online+0x44/0xa8
      [c000000058eabba0] c000000000396260 .sysdev_store+0x3c/0x50
      [c000000058eabc10] c0000000001a39b8 .sysfs_write_file+0x124/0x18c
      [c000000058eabcd0] c00000000013061c .vfs_write+0xd0/0x1bc
      [c000000058eabd70] c0000000001308a4 .sys_write+0x68/0x114
      [c000000058eabe30] c0000000000086b4 syscall_exit+0x0/0x40
      Signed-off-by: NBrian King <brking@linux.vnet.ibm.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9a2bd244
  4. 09 12月, 2008 4 次提交
  5. 08 12月, 2008 1 次提交
    • T
      performance counters: core code · 0793a61d
      Thomas Gleixner 提交于
      Implement the core kernel bits of Performance Counters subsystem.
      
      The Linux Performance Counter subsystem provides an abstraction of
      performance counter hardware capabilities. It provides per task and per
      CPU counters, and it provides event capabilities on top of those.
      
      Performance counters are accessed via special file descriptors.
      There's one file descriptor per virtual counter used.
      
      The special file descriptor is opened via the perf_counter_open()
      system call:
      
       int
       perf_counter_open(u32 hw_event_type,
                         u32 hw_event_period,
                         u32 record_type,
                         pid_t pid,
                         int cpu);
      
      The syscall returns the new fd. The fd can be used via the normal
      VFS system calls: read() can be used to read the counter, fcntl()
      can be used to set the blocking mode, etc.
      
      Multiple counters can be kept open at a time, and the counters
      can be poll()ed.
      
      See more details in Documentation/perf-counters.txt.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0793a61d
  6. 04 12月, 2008 2 次提交
    • A
      [PATCH] kill obsolete temporary comment in swsusp_close() · 50c396d3
      Al Viro 提交于
      it had been put there to mark the call of blkdev_put() that
      needed proper argument propagated to it; later patch in the
      same series had done just that.
      Signed-off-by: NAl Viro <viro@zeniv.linux.org.uk>
      50c396d3
    • J
      time: catch xtime_nsec underflows and fix them · 6c9bacb4
      john stultz 提交于
      Impact: fix time warp bug
      
      Alex Shi, along with Yanmin Zhang have been noticing occasional time
      inconsistencies recently. Through their great diagnosis, they found that
      the xtime_nsec value used in update_wall_time was occasionally going
      negative. After looking through the code for awhile, I realized we have
      the possibility for an underflow when three conditions are met in
      update_wall_time():
      
        1) We have accumulated a second's worth of nanoseconds, so we
           incremented xtime.tv_sec and appropriately decrement xtime_nsec.
           (This doesn't cause xtime_nsec to go negative, but it can cause it
            to be small).
      
        2) The remaining offset value is large, but just slightly less then
           cycle_interval.
      
        3) clocksource_adjust() is speeding up the clock, causing a
           corrective amount (compensating for the increase in the multiplier
           being multiplied against the unaccumulated offset value) to be
           subtracted from xtime_nsec.
      
      This can cause xtime_nsec to underflow.
      
      Unfortunately, since we notify the NTP subsystem via second_overflow()
      whenever we accumulate a full second, and this effects the error
      accumulation that has already occured, we cannot simply revert the
      accumulated second from xtime nor move the second accumulation to after
      the clocksource_adjust call without a change in behavior.
      
      This leaves us with (at least) two options:
      
      1) Simply return from clocksource_adjust() without making a change if we
         notice the adjustment would cause xtime_nsec to go negative.
      
      This would work, but I'm concerned that if a large adjustment was needed
      (due to the error being large), it may be possible to get stuck with an
      ever increasing error that becomes too large to correct (since it may
      always force xtime_nsec negative). This may just be paranoia on my part.
      
      2) Catch xtime_nsec if it is negative, then add back the amount its
         negative to both xtime_nsec and the error.
      
      This second method is consistent with how we've handled earlier rounding
      issues, and also has the benefit that the error being added is always in
      the oposite direction also always equal or smaller then the correction
      being applied. So the risk of a corner case where things get out of
      control is lessened.
      
      This patch fixes bug 11970, as tested by Yanmin Zhang
      http://bugzilla.kernel.org/show_bug.cgi?id=11970
      
      Reported-by: alex.shi@intel.com
      Signed-off-by: NJohn Stultz <johnstul@us.ibm.com>
      Acked-by: N"Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
      Tested-by: N"Zhang, Yanmin" <yanmin_zhang@linux.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6c9bacb4
  7. 03 12月, 2008 1 次提交
  8. 02 12月, 2008 2 次提交
    • A
      taint: add missing comment · a8005992
      Arjan van de Ven 提交于
      The description for 'D' was missing in the comment...  (causing me a
      minute of WTF followed by looking at more of the code)
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      a8005992
    • D
      epoll: introduce resource usage limits · 7ef9964e
      Davide Libenzi 提交于
      It has been thought that the per-user file descriptors limit would also
      limit the resources that a normal user can request via the epoll
      interface.  Vegard Nossum reported a very simple program (a modified
      version attached) that can make a normal user to request a pretty large
      amount of kernel memory, well within the its maximum number of fds.  To
      solve such problem, default limits are now imposed, and /proc based
      configuration has been introduced.  A new directory has been created,
      named /proc/sys/fs/epoll/ and inside there, there are two configuration
      points:
      
        max_user_instances = Maximum number of devices - per user
      
        max_user_watches   = Maximum number of "watched" fds - per user
      
      The current default for "max_user_watches" limits the memory used by epoll
      to store "watches", to 1/32 of the amount of the low RAM.  As example, a
      256MB 32bit machine, will have "max_user_watches" set to roughly 90000.
      That should be enough to not break existing heavy epoll users.  The
      default value for "max_user_instances" is set to 128, that should be
      enough too.
      
      This also changes the userspace, because a new error code can now come out
      from EPOLL_CTL_ADD (-ENOSPC).  The EMFILE from epoll_create() was already
      listed, so that should be ok.
      
      [akpm@linux-foundation.org: use get_current_user()]
      Signed-off-by: NDavide Libenzi <davidel@xmailserver.org>
      Cc: Michael Kerrisk <mtk.manpages@gmail.com>
      Cc: <stable@kernel.org>
      Cc: Cyrill Gorcunov <gorcunov@gmail.com>
      Reported-by: NVegard Nossum <vegardno@ifi.uio.no>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      7ef9964e
  9. 01 12月, 2008 2 次提交
  10. 30 11月, 2008 2 次提交
    • I
      sched: prevent divide by zero error in cpu_avg_load_per_task, update · af6d596f
      Ingo Molnar 提交于
      Regarding the bug addressed in:
      
        4cd42620: sched: prevent divide by zero error in cpu_avg_load_per_task
      
      Linus points out that the fix is not complete:
      
      > There's nothing that keeps gcc from deciding not to reload
      > rq->nr_running.
      >
      > Of course, in _practice_, I don't think gcc ever will (if it decides
      > that it will spill, gcc is likely going to decide that it will
      > literally spill the local variable to the stack rather than decide to
      > reload off the pointer), but it's a valid compiler optimization, and
      > it even has a name (rematerialization).
      >
      > So I suspect that your patch does fix the bug, but it still leaves the
      > fairly unlikely _potential_ for it to re-appear at some point.
      >
      > We have ACCESS_ONCE() as a macro to guarantee that the compiler
      > doesn't rematerialize a pointer access. That also would clarify
      > the fact that we access something unsafe outside a lock.
      
      So make sure our nr_running value is immutable and cannot change
      after we check it for nonzero.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      af6d596f
    • I
      sched, cpusets: fix warning in kernel/cpuset.c · 1583715d
      Ingo Molnar 提交于
      this warning:
      
        kernel/cpuset.c: In function ‘generate_sched_domains’:
        kernel/cpuset.c:588: warning: ‘ndoms’ may be used uninitialized in this function
      
      triggers because GCC does not recognize that ndoms stays uninitialized
      only if doms is NULL - but that flow is covered at the end of
      generate_sched_domains().
      
      Help out GCC by initializing this variable to 0. (that's prudent anyway)
      
      Also, this function needs a splitup and code flow simplification:
      with 160 lines length it's clearly too long.
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      1583715d
  11. 27 11月, 2008 2 次提交
    • S
      sched: prevent divide by zero error in cpu_avg_load_per_task · 4cd42620
      Steven Rostedt 提交于
      Impact: fix divide by zero crash in scheduler rebalance irq
      
      While testing the branch profiler, I hit this crash:
      
      divide error: 0000 [#1] PREEMPT SMP
      [...]
      RIP: 0010:[<ffffffff8024a008>]  [<ffffffff8024a008>] cpu_avg_load_per_task+0x50/0x7f
      [...]
      Call Trace:
       <IRQ> <0> [<ffffffff8024fd43>] find_busiest_group+0x3e5/0xcaa
       [<ffffffff8025da75>] rebalance_domains+0x2da/0xa21
       [<ffffffff80478769>] ? find_next_bit+0x1b2/0x1e6
       [<ffffffff8025e2ce>] run_rebalance_domains+0x112/0x19f
       [<ffffffff8026d7c2>] __do_softirq+0xa8/0x232
       [<ffffffff8020ea7c>] call_softirq+0x1c/0x3e
       [<ffffffff8021047a>] do_softirq+0x94/0x1cd
       [<ffffffff8026d5eb>] irq_exit+0x6b/0x10e
       [<ffffffff8022e6ec>] smp_apic_timer_interrupt+0xd3/0xff
       [<ffffffff8020e4b3>] apic_timer_interrupt+0x13/0x20
      
      The code for cpu_avg_load_per_task has:
      
      	if (rq->nr_running)
      		rq->avg_load_per_task = rq->load.weight / rq->nr_running;
      
      The runqueue lock is not held here, and there is nothing that prevents
      the rq->nr_running from going to zero after it passes the if condition.
      
      The branch profiler simply made the race window bigger.
      
      This patch saves off the rq->nr_running to a local variable and uses that
      for both the condition and the division.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4cd42620
    • L
      ftrace: prevent recursion · 4f5a7f40
      Lai Jiangshan 提交于
      Impact: prevent unnecessary stack recursion
      
      if the resched flag was set before we entered, then don't reschedule.
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Acked-by: NSteven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      4f5a7f40
  12. 24 11月, 2008 2 次提交
  13. 21 11月, 2008 2 次提交
  14. 20 11月, 2008 3 次提交