1. 09 2月, 2009 6 次提交
  2. 08 2月, 2009 4 次提交
    • W
      trace: trivial fixes in comment typos. · 57794a9d
      Wenji Huang 提交于
      Impact: clean up
      
      Fixed several typos in the comments.
      Signed-off-by: NWenji Huang <wenji.huang@oracle.com>
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      57794a9d
    • S
      ring-buffer: use generic version of in_nmi · a81bd80a
      Steven Rostedt 提交于
      Impact: clean up
      
      Now that a generic in_nmi is available, this patch removes the
      special code in the ring_buffer and implements the in_nmi generic
      version instead.
      
      With this change, I was also able to rename the "arch_ftrace_nmi_enter"
      back to "ftrace_nmi_enter" and remove the code from the ring buffer.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      a81bd80a
    • S
      ring-buffer: add NMI protection for spinlocks · 78d904b4
      Steven Rostedt 提交于
      Impact: prevent deadlock in NMI
      
      The ring buffers are not yet totally lockless with writing to
      the buffer. When a writer crosses a page, it grabs a per cpu spinlock
      to protect against a reader. The spinlocks taken by a writer are not
      to protect against other writers, since a writer can only write to
      its own per cpu buffer. The spinlocks protect against readers that
      can touch any cpu buffer. The writers are made to be reentrant
      with the spinlocks disabling interrupts.
      
      The problem arises when an NMI writes to the buffer, and that write
      crosses a page boundary. If it grabs a spinlock, it can be racing
      with another writer (since disabling interrupts does not protect
      against NMIs) or with a reader on the same CPU. Luckily, most of the
      users are not reentrant and protects against this issue. But if a
      user of the ring buffer becomes reentrant (which is what the ring
      buffers do allow), if the NMI also writes to the ring buffer then
      we risk the chance of a deadlock.
      
      This patch moves the ftrace_nmi_enter called by nmi_enter() to the
      ring buffer code. It replaces the current ftrace_nmi_enter that is
      used by arch specific code to arch_ftrace_nmi_enter and updates
      the Kconfig to handle it.
      
      When an NMI is called, it will set a per cpu variable in the ring buffer
      code and will clear it when the NMI exits. If a write to the ring buffer
      crosses page boundaries inside an NMI, a trylock is used on the spin
      lock instead. If the spinlock fails to be acquired, then the entry
      is discarded.
      
      This bug appeared in the ftrace work in the RT tree, where event tracing
      is reentrant. This workaround solved the deadlocks that appeared there.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      78d904b4
    • S
      trace: remove deprecated entry->cpu · 1830b52d
      Steven Rostedt 提交于
      Impact: fix to prevent developers from using entry->cpu
      
      With the new ring buffer infrastructure, the cpu for the entry is
      implicit with which CPU buffer it is on.
      
      The original code use to record the current cpu into the generic
      entry header, which can be retrieved by entry->cpu. When the
      ring buffer was introduced, the users were convert to use the
      the cpu number of which cpu ring buffer was in use (this was passed
      to the tracers by the iterator: iter->cpu).
      
      Unfortunately, the cpu item in the entry structure was never removed.
      This allowed for developers to use it instead of the proper iter->cpu,
      unknowingly, using an uninitialized variable. This was not the fault
      of the developers, since it would seem like the logical place to
      retrieve the cpu identifier.
      
      This patch removes the cpu item from the entry structure and fixes
      all the users that should have been using iter->cpu.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      1830b52d
  3. 07 2月, 2009 1 次提交
  4. 06 2月, 2009 6 次提交
    • A
      trace: Call tracing_reset_online_cpus before tracer->init() · b6f11df2
      Arnaldo Carvalho de Melo 提交于
      Impact: cleanup
      
      To make it easy for ftrace plugin writers, as this was open coded in
      the existing plugins
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Acked-by: NFrédéric Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b6f11df2
    • A
      tracing: Introduce trace_buffer_{lock_reserve,unlock_commit} · 51a763dd
      Arnaldo Carvalho de Melo 提交于
      Impact: new API
      
      These new functions do what previously was being open coded, reducing
      the number of details ftrace plugin writers have to worry about.
      
      It also standardizes the handling of stacktrace, userstacktrace and
      other trace options we may introduce in the future.
      
      With this patch, for instance, the blk tracer (and some others already
      in the tree) can use the "userstacktrace" /d/tracing/trace_options
      facility.
      
      $ codiff /tmp/vmlinux.before /tmp/vmlinux.after
      linux-2.6-tip/kernel/trace/trace.c:
        trace_vprintk              |   -5
        trace_graph_return         |  -22
        trace_graph_entry          |  -26
        trace_function             |  -45
        __ftrace_trace_stack       |  -27
        ftrace_trace_userstack     |  -29
        tracing_sched_switch_trace |  -66
        tracing_stop               |   +1
        trace_seq_to_user          |   -1
        ftrace_trace_special       |  -63
        ftrace_special             |   +1
        tracing_sched_wakeup_trace |  -70
        tracing_reset_online_cpus  |   -1
       13 functions changed, 2 bytes added, 355 bytes removed, diff: -353
      
      linux-2.6-tip/block/blktrace.c:
        __blk_add_trace |  -58
       1 function changed, 58 bytes removed, diff: -58
      
      linux-2.6-tip/kernel/trace/trace.c:
        trace_buffer_lock_reserve  |  +88
        trace_buffer_unlock_commit |  +86
       2 functions changed, 174 bytes added, diff: +174
      
      /tmp/vmlinux.after:
       16 functions changed, 176 bytes added, 413 bytes removed, diff: -237
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Acked-by: NFrédéric Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      51a763dd
    • A
      ring_buffer: remove unused flags parameter · 0a987751
      Arnaldo Carvalho de Melo 提交于
      Impact: API change, cleanup
      
      >From ring_buffer_{lock_reserve,unlock_commit}.
      
      $ codiff /tmp/vmlinux.before /tmp/vmlinux.after
      linux-2.6-tip/kernel/trace/trace.c:
        trace_vprintk              |  -14
        trace_graph_return         |  -14
        trace_graph_entry          |  -10
        trace_function             |   -8
        __ftrace_trace_stack       |   -8
        ftrace_trace_userstack     |   -8
        tracing_sched_switch_trace |   -8
        ftrace_trace_special       |  -12
        tracing_sched_wakeup_trace |   -8
       9 functions changed, 90 bytes removed, diff: -90
      
      linux-2.6-tip/block/blktrace.c:
        __blk_add_trace |   -1
       1 function changed, 1 bytes removed, diff: -1
      
      /tmp/vmlinux.after:
       10 functions changed, 91 bytes removed, diff: -91
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Acked-by: NFrédéric Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0a987751
    • J
      wait: prevent exclusive waiter starvation · 777c6c5f
      Johannes Weiner 提交于
      With exclusive waiters, every process woken up through the wait queue must
      ensure that the next waiter down the line is woken when it has finished.
      
      Interruptible waiters don't do that when aborting due to a signal.  And if
      an aborting waiter is concurrently woken up through the waitqueue, noone
      will ever wake up the next waiter.
      
      This has been observed with __wait_on_bit_lock() used by
      lock_page_killable(): the first contender on the queue was aborting when
      the actual lock holder woke it up concurrently.  The aborted contender
      didn't acquire the lock and therefor never did an unlock followed by
      waking up the next waiter.
      
      Add abort_exclusive_wait() which removes the process' wait descriptor from
      the waitqueue, iff still queued, or wakes up the next waiter otherwise.
      It does so under the waitqueue lock.  Racing with a wake up means the
      aborting process is either already woken (removed from the queue) and will
      wake up the next waiter, or it will remove itself from the queue and the
      concurrent wake up will apply to the next waiter after it.
      
      Use abort_exclusive_wait() in __wait_event_interruptible_exclusive() and
      __wait_on_bit_lock() when they were interrupted by other means than a wake
      up through the queue.
      
      [akpm@linux-foundation.org: coding-style fixes]
      Reported-by: NChris Mason <chris.mason@oracle.com>
      Signed-off-by: NJohannes Weiner <hannes@cmpxchg.org>
      Mentored-by: NOleg Nesterov <oleg@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Chuck Lever <cel@citi.umich.edu>
      Cc: Nick Piggin <nickpiggin@yahoo.com.au>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: <stable@kernel.org>		["after some testing"]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      777c6c5f
    • A
      revert "rlimit: permit setting RLIMIT_NOFILE to RLIM_INFINITY" · 60fd760f
      Andrew Morton 提交于
      Revert commit 0c2d64fb because it causes
      (arguably poorly designed) existing userspace to spend interminable
      periods closing billions of not-open file descriptors.
      
      We could bring this back, with some sort of opt-in tunable in /proc, which
      defaults to "off".
      
      Peter's alanysis follows:
      
      : I spent several hours trying to get to the bottom of a serious
      : performance issue that appeared on one of our servers after upgrading to
      : 2.6.28.  In the end it's what could be considered a userspace bug that
      : was triggered by a change in 2.6.28.  Since this might also affect other
      : people I figured I'd at least document what I found here, and maybe we
      : can even do something about it:
      :
      :
      : So, I upgraded some of debian.org's machines to 2.6.28.1 and immediately
      : the team maintaining our ftp archive complained that one of their
      : scripts that previously ran in a few minutes still hadn't even come
      : close to being done after an hour or so.  Downgrading to 2.6.27 fixed
      : that.
      :
      : Turns out that script is forking a lot and something in it or python or
      : whereever closes all the file descriptors it doesn't want to pass on.
      : That is, it starts at zero and goes up to ulimit -n/RLIMIT_NOFILE and
      : closes them all with a few exceptions.
      :
      : Turns out that takes a long time when your limit -n is now 2^20 (1048576).
      :
      : With 2.6.27.* the ulimit -n was the standard 1024, but with 2.6.28 it is
      : now a thousand times that.
      :
      : 2.6.28 included a patch titled "rlimit: permit setting RLIMIT_NOFILE to
      : RLIM_INFINITY" (0c2d64fb)[1] that
      : allows, as the title implies, to set the limit for number of files to
      : infinity.
      :
      : Closer investigation showed that the broken default ulimit did not apply
      : to "system" processes (like stuff started from init).  In the end I
      : could establish that all processes that passed through pam_limit at one
      : point had the bad resource limit.
      :
      : Apparently the pam library in Debian etch (4.0) initializes the limits
      : to some default values when it doesn't have any settings in limit.conf
      : to override them.  Turns out that for nofiles this is RLIM_INFINITY.
      : Commenting out "case RLIMIT_NOFILE" in pam_limit.c:267 of our pam
      : package version 0.79-5 fixes that - tho I'm not sure what side effects
      : that has.
      :
      : Debian lenny (the upcoming 5.0 version) doesn't have this issue as it
      : uses a different pam (version).
      Reported-by: NPeter Palfrader <weasel@debian.org>
      Cc: Adam Tkac <vonsch@gmail.com>
      Cc: Michael Kerrisk <mtk.manpages@googlemail.com>
      Cc: <stable@kernel.org>		[2.6.28.x]
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      60fd760f
    • A
      kernel/async.c: fix printk warnings · 58763a29
      Andrew Morton 提交于
      alpha:
      
      kernel/async.c: In function 'run_one_entry':
      kernel/async.c:141: warning: format '%lli' expects type 'long long int', but argument 2 has type 'async_cookie_t'
      kernel/async.c:149: warning: format '%lli' expects type 'long long int', but argument 2 has type 'async_cookie_t'
      kernel/async.c:149: warning: format '%lld' expects type 'long long int', but argument 4 has type 's64'
      kernel/async.c: In function 'async_synchronize_cookie_special':
      kernel/async.c:250: warning: format '%lli' expects type 'long long int', but argument 3 has type 's64'
      
      Cc: Arjan van de Ven <arjan@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      58763a29
  5. 05 2月, 2009 6 次提交
  6. 04 2月, 2009 1 次提交
  7. 03 2月, 2009 5 次提交
    • A
      trace: Change struct trace_event callbacks parameter list · 2c9b238e
      Arnaldo Carvalho de Melo 提交于
      Impact: API change
      
      The trace_seq and trace_entry are in trace_iterator, where there are
      more fields that may be needed by tracers, so just pass the
      tracer_iterator as is already the case for struct tracer->print_line.
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      2c9b238e
    • F
      trace: better manage the context info for events · c4a8e8be
      Frederic Weisbecker 提交于
      Impact: make trace_event more convenient for tracers
      
      All tracers (for the moment) that use the struct trace_event want to
      have the context info printed before their own output: the pid/cmdline,
      cpu, and timestamp.
      
      But some other tracers that want to implement their trace_event
      callbacks will not necessary need these information or they may want to
      format them as they want.
      
      This patch adds a new default-enabled trace option:
      TRACE_ITER_CONTEXT_INFO When disabled through:
      
      echo nocontext-info > /debugfs/tracing/trace_options
      
      The pid, cpu and timestamps headers will not be printed.
      
      IE with the sched_switch tracer with context-info (default):
      
           bash-2935 [001] 100.356561: 2935:120:S ==> [001]  0:140:R <idle>
         <idle>-0    [000] 100.412804:    0:140:R   + [000] 11:115:S events/0
         <idle>-0    [000] 100.412816:    0:140:R ==> [000] 11:115:R events/0
       events/0-11   [000] 100.412829:   11:115:S ==> [000]  0:140:R <idle>
      
      Without context-info:
      
       2935:120:S ==> [001]  0:140:R <idle>
          0:140:R   + [000] 11:115:S events/0
          0:140:R ==> [000] 11:115:R events/0
         11:115:S ==> [000]  0:140:R <idle>
      
      A tracer can disable it at runtime by clearing the bit
      TRACE_ITER_CONTEXT_INFO in trace_flags.
      
      The print routines were renamed to trace_print_context and
      trace_print_lat_context, so that they can be used by tracers if they
      want to use them for one of the trace_event callbacks.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NArnaldo Carvalho de Melo <acme@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      c4a8e8be
    • S
      trace: let boot trace be chosen by command line · 79fb0768
      Steven Rostedt 提交于
      Now that we have a working ftrace=<tracer> function, make the boot
      tracer get activated by it. This way we can turn it on or off without
      recompiling the kernel, as well as keeping the selftests on. The
      selftests are disabled whenever a default tracer starts running.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      79fb0768
    • S
      trace: fix default boot up tracer · b2821ae6
      Steven Rostedt 提交于
      Peter Zijlstra started the functionality to start up a default
      tracing at bootup. This patch finishes the work.
      
      Now if you add 'ftrace=<tracer>' to the command line, when that tracer
      is registered on bootup, that tracer is selected and starts tracing.
      
      Note, all selftests for tracers that are registered after this tracer
      is disabled. This prevents the selftests from disturbing the running
      tracer, or the running tracer from disturbing the selftest.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b2821ae6
    • E
      modules: Use a better scheme for refcounting · 720eba31
      Eric Dumazet 提交于
      Current refcounting for modules (done if CONFIG_MODULE_UNLOAD=y) is
      using a lot of memory.
      
      Each 'struct module' contains an [NR_CPUS] array of full cache lines.
      
      This patch uses existing infrastructure (percpu_modalloc() &
      percpu_modfree()) to allocate percpu space for the refcount storage.
      
      Instead of wasting NR_CPUS*128 bytes (on i386), we now use
      nr_cpu_ids*sizeof(local_t) bytes.
      
      On a typical distro, where NR_CPUS=8, shiping 2000 modules, we reduce
      size of module files by about 2 Mbytes. (1Kb per module)
      
      Instead of having all refcounters in the same memory node - with TLB misses
      because of vmalloc() - this new implementation permits to have better
      NUMA properties, since each  CPU will use storage on its preferred node,
      thanks to percpu storage.
      Signed-off-by: NEric Dumazet <dada1@cosmosbay.com>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      720eba31
  8. 01 2月, 2009 6 次提交
  9. 31 1月, 2009 4 次提交
    • T
      hrtimer: prevent negative expiry value after clock_was_set() · b0a9b511
      Thomas Gleixner 提交于
      Impact: prevent false positive WARN_ON() in clockevents_program_event()
      
      clock_was_set() changes the base->offset of CLOCK_REALTIME and
      enforces the reprogramming of the clockevent device to expire timers
      which are based on CLOCK_REALTIME. If the clock change is large enough
      then the subtraction of the timer expiry value and base->offset can
      become negative which triggers the warning in
      clockevents_program_event().
      
      Check the subtraction result and set a negative value to 0.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      b0a9b511
    • S
      hrtimers: allow the hot-unplugging of all cpus · 94df7de0
      Sebastien Dugue 提交于
      Impact: fix CPU hotplug hang on Power6 testbox
      
      On architectures that support offlining all cpus (at least powerpc/pseries),
      hot-unpluging the tick_do_timer_cpu can result in a system hang.
      
      This comes from the fact that if the cpu going down happens to be the
      cpu doing the tick, then as the tick_do_timer_cpu handover happens after the
      cpu is dead (via the CPU_DEAD notification), we're left without ticks,
      jiffies are frozen and any task relying on timers (msleep, ...) is stuck.
      That's particularly the case for the cpu looping in __cpu_die() waiting
      for the dying cpu to be dead.
      
      This patch addresses this by having the tick_do_timer_cpu handover happen
      earlier during the CPU_DYING notification. For this, a new clockevent
      notification type is introduced (CLOCK_EVT_NOTIFY_CPU_DYING) which is triggered
      in hrtimer_cpu_notify().
      Signed-off-by: NSebastien Dugue <sebastien.dugue@bull.net>
      Cc: <stable@kernel.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      94df7de0
    • F
      hrtimers: increase clock min delta threshold while interrupt hanging · 7f22391c
      Frederic Weisbecker 提交于
      Impact: avoid timer IRQ hanging slow systems
      
      While using the function graph tracer on a virtualized system, the
      hrtimer_interrupt can hang the system on an infinite loop.
      
      This can be caused in several situations:
      
       - the hardware is very slow and HZ is set too high
      
       - something intrusive is slowing the system down (tracing under emulation)
      
      ... and the next clock events to program are always before the current time.
      
      This patch implements a reasonable compromise: if such a situation is
      detected, we share the CPUs time in 1/4 to process the hrtimer interrupts.
      This is enough to let the system running without serious starvation.
      
      It has been successfully tested under VirtualBox with 1000 HZ and 100 HZ
      with function graph tracer launched. On both cases, the clock events were
      increased until about 25 ms periodic ticks, which means 40 HZ.
      
      So we change a hard to debug hang into a warning message and a system that
      still manages to limp along.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      7f22391c
    • S
      generic-ipi: use per cpu data for single cpu ipi calls · d7240b98
      Steven Rostedt 提交于
      The smp_call_function can be passed a wait parameter telling it to
      wait for all the functions running on other CPUs to complete before
      returning, or to return without waiting. Unfortunately, this is
      currently just a suggestion and not manditory. That is, the
      smp_call_function can decide not to return and wait instead.
      
      The reason for this is because it uses kmalloc to allocate storage
      to send to the called CPU and that CPU will free it when it is done.
      But if we fail to allocate the storage, the stack is used instead.
      This means we must wait for the called CPU to finish before
      continuing.
      
      Unfortunatly, some callers do no abide by this hint and act as if
      the non-wait option is mandatory. The MTRR code for instance will
      deadlock if the smp_call_function is set to wait. This is because
      the smp_call_function will wait for the other CPUs to finish their
      called functions, but those functions are waiting on the caller to
      continue.
      
      This patch changes the generic smp_call_function code to use per cpu
      variables if the allocation of the data fails for a single CPU call. The
      smp_call_function_many will fall back to the smp_call_function_single
      if it fails its alloc. The smp_call_function_single is modified
      to not force the wait state.
      
      Since we now are using a single data per cpu we must synchronize the
      callers to prevent a second caller modifying the data before the
      first called IPI functions complete. To do so, I added a flag to
      the call_single_data called CSD_FLAG_LOCK. When the single CPU is
      called (which can be called when a many call fails an alloc), we
      set the LOCK bit on this per cpu data. When the caller finishes
      it clears the LOCK bit.
      
      The caller must wait till the LOCK bit is cleared before setting
      it. When it is cleared, there is no IPI function using it.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NJens Axboe <jens.axboe@oracle.com>
      Acked-by: NLinus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d7240b98
  10. 30 1月, 2009 1 次提交