1. 16 11月, 2009 1 次提交
    • P
      perf_event: Optimize perf_output_lock() · 559fdc3c
      Peter Zijlstra 提交于
      The purpose of perf_output_{un,}lock() is to:
      
       1) avoid publishing incomplete data
          [ possible when publishing a head that is ahead of an entry
            that is still being written ]
      
       2) guarantee fwd progress
          [ a simple refcount on pending writers doesn't need to drop to
            0, making it so would end up implementing something like forced
            quiecent states of RCU ]
      
      To satisfy the above without undue complexity it serializes
      between CPUs, this means that a pending writer can only be the
      same cpu in a nested context, and since (under normal operation)
      a cpu always makes progress we're good -- if the head is only
      published when the bottom  most writer completes.
      
      Now we don't need to disable IRQs in order to serialize between
      CPUs, disabling preemption ought to be sufficient, esp since we
      already deal with nesting due to NMIs.
      
      This avoids potentially expensive (and needless) local IRQ
      disable/enable ops.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      LKML-Reference: <1258373161.26714.254.camel@laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      559fdc3c
  2. 13 11月, 2009 1 次提交
  3. 10 11月, 2009 2 次提交
  4. 08 11月, 2009 7 次提交
    • L
      ksym_tracer: Remove KSYM_SELFTEST_ENTRY · 30ff21e3
      Li Zefan 提交于
      The macro used to be used in both trace_selftest.c and
      trace_ksym.c, but no longer, so remove it from header file.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Prasad <prasad@linux.vnet.ibm.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      30ff21e3
    • F
      hw-breakpoints: Arbitrate access to pmu following registers constraints · ba1c813a
      Frederic Weisbecker 提交于
      Allow or refuse to build a counter using the breakpoints pmu following
      given constraints.
      
      We keep track of the pmu users by using three per cpu variables:
      
      - nr_cpu_bp_pinned stores the number of pinned cpu breakpoints counters
        in the given cpu
      
      - nr_bp_flexible stores the number of non-pinned breakpoints counters
        in the given cpu.
      
      - task_bp_pinned stores the number of pinned task breakpoints in a cpu
      
      The latter is not a simple counter but gathers the number of tasks that
      have n pinned breakpoints.
      Considering HBP_NUM the number of available breakpoint address
      registers:
         task_bp_pinned[0] is the number of tasks having 1 breakpoint
         task_bp_pinned[1] is the number of tasks having 2 breakpoints
         [...]
         task_bp_pinned[HBP_NUM - 1] is the number of tasks having the
         maximum number of registers (HBP_NUM).
      
      When a breakpoint counter is created and wants an access to the pmu,
      we evaluate the following constraints:
      
      == Non-pinned counter ==
      
      - If attached to a single cpu, check:
      
          (per_cpu(nr_bp_flexible, cpu) || (per_cpu(nr_cpu_bp_pinned, cpu)
               + max(per_cpu(task_bp_pinned, cpu)))) < HBP_NUM
      
             -> If there are already non-pinned counters in this cpu, it
                means there is already a free slot for them.
                Otherwise, we check that the maximum number of per task
                breakpoints (for this cpu) plus the number of per cpu
                breakpoint (for this cpu) doesn't cover every registers.
      
      - If attached to every cpus, check:
      
          (per_cpu(nr_bp_flexible, *) || (max(per_cpu(nr_cpu_bp_pinned, *))
                 + max(per_cpu(task_bp_pinned, *)))) < HBP_NUM
      
             -> This is roughly the same, except we check the number of per
                cpu bp for every cpu and we keep the max one. Same for the
                per tasks breakpoints.
      
      == Pinned counter ==
      
      - If attached to a single cpu, check:
      
             ((per_cpu(nr_bp_flexible, cpu) > 1)
                  + per_cpu(nr_cpu_bp_pinned, cpu)
                  + max(per_cpu(task_bp_pinned, cpu))) < HBP_NUM
      
             -> Same checks as before. But now the nr_bp_flexible, if any,
                must keep one register at least (or flexible breakpoints will
                never be be fed).
      
      - If attached to every cpus, check:
      
            ((per_cpu(nr_bp_flexible, *) > 1)
                 + max(per_cpu(nr_cpu_bp_pinned, *))
                 + max(per_cpu(task_bp_pinned, *))) < HBP_NUM
      
      Changes in v2:
      
      - Counter -> event rename
      
      Changes in v5:
      
      - Fix unreleased non-pinned task-bound-only counters. We only released
        it in the first cpu. (Thanks to Paul Mackerras for reporting that)
      
      Changes in v6:
      
      - Currently, events scheduling are done in this order: cpu context
        pinned + cpu context non-pinned + task context pinned + task context
        non-pinned events. Then our current constraints are right theoretically
        but not in practice, because non-pinned counters may be scheduled
        before we can apply every possible pinned counters. So consider
        non-pinned counters as pinned for now.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Prasad <prasad@linux.vnet.ibm.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jan Kiszka <jan.kiszka@web.de>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Masami Hiramatsu <mhiramat@redhat.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      ba1c813a
    • F
      hw-breakpoints: Rewrite the hw-breakpoints layer on top of perf events · 24f1e32c
      Frederic Weisbecker 提交于
      This patch rebase the implementation of the breakpoints API on top of
      perf events instances.
      
      Each breakpoints are now perf events that handle the
      register scheduling, thread/cpu attachment, etc..
      
      The new layering is now made as follows:
      
             ptrace       kgdb      ftrace   perf syscall
                \          |          /         /
                 \         |         /         /
                                              /
                  Core breakpoint API        /
                                            /
                           |               /
                           |              /
      
                    Breakpoints perf events
      
                           |
                           |
      
                     Breakpoints PMU ---- Debug Register constraints handling
                                          (Part of core breakpoint API)
                           |
                           |
      
                   Hardware debug registers
      
      Reasons of this rewrite:
      
      - Use the centralized/optimized pmu registers scheduling,
        implying an easier arch integration
      - More powerful register handling: perf attributes (pinned/flexible
        events, exclusive/non-exclusive, tunable period, etc...)
      
      Impact:
      
      - New perf ABI: the hardware breakpoints counters
      - Ptrace breakpoints setting remains tricky and still needs some per
        thread breakpoints references.
      
      Todo (in the order):
      
      - Support breakpoints perf counter events for perf tools (ie: implement
        perf_bpcounter_event())
      - Support from perf tools
      
      Changes in v2:
      
      - Follow the perf "event " rename
      - The ptrace regression have been fixed (ptrace breakpoint perf events
        weren't released when a task ended)
      - Drop the struct hw_breakpoint and store generic fields in
        perf_event_attr.
      - Separate core and arch specific headers, drop
        asm-generic/hw_breakpoint.h and create linux/hw_breakpoint.h
      - Use new generic len/type for breakpoint
      - Handle off case: when breakpoints api is not supported by an arch
      
      Changes in v3:
      
      - Fix broken CONFIG_KVM, we need to propagate the breakpoint api
        changes to kvm when we exit the guest and restore the bp registers
        to the host.
      
      Changes in v4:
      
      - Drop the hw_breakpoint_restore() stub as it is only used by KVM
      - EXPORT_SYMBOL_GPL hw_breakpoint_restore() as KVM can be built as a
        module
      - Restore the breakpoints unconditionally on kvm guest exit:
        TIF_DEBUG_THREAD doesn't anymore cover every cases of running
        breakpoints and vcpu->arch.switch_db_regs might not always be
        set when the guest used debug registers.
        (Waiting for a reliable optimization)
      
      Changes in v5:
      
      - Split-up the asm-generic/hw-breakpoint.h moving to
        linux/hw_breakpoint.h into a separate patch
      - Optimize the breakpoints restoring while switching from kvm guest
        to host. We only want to restore the state if we have active
        breakpoints to the host, otherwise we don't care about messed-up
        address registers.
      - Add asm/hw_breakpoint.h to Kbuild
      - Fix bad breakpoint type in trace_selftest.c
      
      Changes in v6:
      
      - Fix wrong header inclusion in trace.h (triggered a build
        error with CONFIG_FTRACE_SELFTEST
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Prasad <prasad@linux.vnet.ibm.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jan Kiszka <jan.kiszka@web.de>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Masami Hiramatsu <mhiramat@redhat.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      24f1e32c
    • C
      sched: Use root_task_group_empty only with FAIR_GROUP_SCHED · e9036b36
      Cyrill Gorcunov 提交于
      root_task_group_empty is used only with FAIR_GROUP_SCHED
      so if we use other scheduler options we get:
      
        kernel/sched.c:314: warning: 'root_task_group_empty' defined but not used
      
      So move CONFIG_FAIR_GROUP_SCHED up that it covers
      root_task_group_empty().
      Signed-off-by: NCyrill Gorcunov <gorcunov@openvz.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      LKML-Reference: <20091026192414.GB5321@lenovo>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e9036b36
    • R
      sched: Fix kernel-doc function parameter name · 968c8645
      Randy Dunlap 提交于
      Fix variable name in sched.c kernel-doc notation.
      
      Fixes this DocBook warning:
      
       Warning(kernel/sched.c:2008): No description found for parameter
       'p' Warning(kernel/sched.c:2008): Excess function parameter 'k'
       description in 'kthread_bind'
      Signed-off-by: NRandy Dunlap <randy.dunlap@oracle.com>
      LKML-Reference: <4AF4B1BC.8020604@oracle.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      968c8645
    • F
      tracing, perf_events: Protect the buffer from recursion in perf · 444a2a3b
      Frederic Weisbecker 提交于
      While tracing using events with perf, if one enables the
      lockdep:lock_acquire event, it will infect every other perf
      trace events.
      
      Basically, you can enable whatever set of trace events through
      perf but if this event is part of the set, the only result we
      can get is a long list of lock_acquire events of rcu read lock,
      and only that.
      
      This is because of a recursion inside perf.
      
      1) When a trace event is triggered, it will fill a per cpu
         buffer and submit it to perf.
      
      2) Perf will commit this event but will also protect some data
         using rcu_read_lock
      
      3) A recursion appears: rcu_read_lock triggers a lock_acquire
         event that will fill the per cpu event and then submit the
         buffer to perf.
      
      4) Perf detects a recursion and ignores it
      
      5) Perf continues its work on the previous event, but its buffer
         has been overwritten by the lock_acquire event, it has then
         been turned into a lock_acquire event of rcu read lock
      
      Such scenario also happens with lock_release with
      rcu_read_unlock().
      
      We could turn the rcu_read_lock() into __rcu_read_lock() to drop
      the lock debugging from perf fast path, but that would make us
      lose the rcu debugging and that doesn't prevent from other
      possible kind of recursion from perf in the future.
      
      This patch adds a recursion protection based on a counter on the
      perf trace per cpu buffers to solve the problem.
      
      -v2: Fixed lost whitespace, added reviewed-by tag
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: NMasami Hiramatsu <mhiramat@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Jason Baron <jbaron@redhat.com>
      LKML-Reference: <1257477185-7838-1-git-send-email-fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      444a2a3b
    • Y
      genirq: try_one_irq() must be called with irq disabled · e7e7e0c0
      Yong Zhang 提交于
      Prarit reported:
      =================================
      [ INFO: inconsistent lock state ]
      2.6.32-rc5 #1
      ---------------------------------
      inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
      swapper/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
       (&irq_desc_lock_class){?.-...}, at: [<ffffffff810c264e>] try_one_irq+0x32/0x138
      {IN-HARDIRQ-W} state was registered at:
       [<ffffffff81095160>] __lock_acquire+0x2fc/0xd5d
       [<ffffffff81095cb4>] lock_acquire+0xf3/0x12d
       [<ffffffff814cdadd>] _spin_lock+0x40/0x89
       [<ffffffff810c3389>] handle_level_irq+0x30/0x105
       [<ffffffff81014e0e>] handle_irq+0x95/0xb7
       [<ffffffff810141bd>] do_IRQ+0x6a/0xe0
       [<ffffffff81012813>] ret_from_intr+0x0/0x16
      irq event stamp: 195096
      hardirqs last  enabled at (195096): [<ffffffff814cd7f7>] _spin_unlock_irq+0x3a/0x5c
      hardirqs last disabled at (195095): [<ffffffff814cdbdd>] _spin_lock_irq+0x29/0x95
      softirqs last  enabled at (195088): [<ffffffff81068c92>] __do_softirq+0x1c1/0x1ef
      softirqs last disabled at (195093): [<ffffffff8101304c>] call_softirq+0x1c/0x30
      
      other info that might help us debug this:
      1 lock held by swapper/0:
       #0:  (kernel/irq/spurious.c:21){+.-...}, at: [<ffffffff81070cf2>]
      run_timer_softirq+0x1a9/0x315
      
      stack backtrace:
      Pid: 0, comm: swapper Not tainted 2.6.32-rc5 #1
      Call Trace:
       <IRQ>  [<ffffffff81093e94>] valid_state+0x187/0x1ae
       [<ffffffff81093fe4>] mark_lock+0x129/0x253
       [<ffffffff810951d4>] __lock_acquire+0x370/0xd5d
       [<ffffffff81095cb4>] lock_acquire+0xf3/0x12d
       [<ffffffff814cdadd>] _spin_lock+0x40/0x89
       [<ffffffff810c264e>] try_one_irq+0x32/0x138
       [<ffffffff810c2795>] poll_all_shared_irqs+0x41/0x6d
       [<ffffffff810c27dd>] poll_spurious_irqs+0x1c/0x49
       [<ffffffff81070d82>] run_timer_softirq+0x239/0x315
       [<ffffffff81068bd3>] __do_softirq+0x102/0x1ef
       [<ffffffff8101304c>] call_softirq+0x1c/0x30
       [<ffffffff81014b65>] do_softirq+0x59/0xca
       [<ffffffff810686ad>] irq_exit+0x58/0xae
       [<ffffffff81029b84>] smp_apic_timer_interrupt+0x94/0xba
       [<ffffffff81012a33>] apic_timer_interrupt+0x13/0x20
      
      The reason is that try_one_irq() is called from hardirq context with
      interrupts disabled and from softirq context (poll_all_shared_irqs())
      with interrupts enabled.
      
      Disable interrupts before calling it from poll_all_shared_irqs().
      Reported-and-tested-by: NPrarit Bhargava <prarit@redhat.com>
      Signed-off-by: NYong Zhang <yong.zhang0@gmail.com>
      LKML-Reference: <1257563773-4620-1-git-send-email-yong.zhang0@gmail.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      e7e7e0c0
  5. 04 11月, 2009 5 次提交
    • M
      tracing/kprobes: Rename Kprobe-tracer to kprobe-event · 77b44d1b
      Masami Hiramatsu 提交于
      Rename Kprobes-based event tracer to kprobes-based tracing event
      (kprobe-event), since it is not a tracer but an extensible
      tracing event interface.
      
      This also changes CONFIG_KPROBE_TRACER to CONFIG_KPROBE_EVENT
      and sets it y by default.
      Signed-off-by: NMasami Hiramatsu <mhiramat@redhat.com>
      Acked-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Jim Keniston <jkenisto@us.ibm.com>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Frank Ch. Eigler <fche@redhat.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: K.Prasad <prasad@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      LKML-Reference: <20091104001247.3454.14131.stgit@harusame>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      77b44d1b
    • L
      ftrace: Fix unmatched locking in ftrace_regex_write() · ed146b25
      Li Zefan 提交于
      When a command is passed to the set_ftrace_filter, then
      the ftrace_regex_lock is still held going back to user space.
      
       # echo 'do_open : foo' > set_ftrace_filter
       (still holding ftrace_regex_lock when returning to user space!)
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      LKML-Reference: <4AEF7F8A.3080300@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      ed146b25
    • L
      ring-buffer: Synchronize resizing buffer with reader lock · f7112949
      Lai Jiangshan 提交于
      We got a sudden panic when we reduced the size of the
      ringbuffer.
      
      We can reproduce the panic by the following steps:
      
      echo 1 > events/sched/enable
      cat trace_pipe > /dev/null &
      
      while ((1))
      do
      echo 12000 > buffer_size_kb
      echo 512 > buffer_size_kb
      done
      
      (not more than 5 seconds, panic ...)
      Reported-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      LKML-Reference: <4AF01735.9060409@cn.fujitsu.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      f7112949
    • F
      perf/core: Add a callback to perf events · 97eaf530
      Frederic Weisbecker 提交于
      A simple callback in a perf event can be used for multiple purposes.
      For example it is useful for triggered based events like hardware
      breakpoints that need a callback to dispatch a triggered breakpoint
      event.
      
      v2: Simplify a bit the callback attribution as suggested by Paul
          Mackerras
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: "K.Prasad" <prasad@linux.vnet.ibm.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Paul Mundt <lethal@linux-sh.org>
      97eaf530
    • A
      perf/core: Provide a kernel-internal interface to get to performance counters · fb0459d7
      Arjan van de Ven 提交于
      There are reasons for kernel code to ask for, and use, performance
      counters.
      For example, in CPU freq governors this tends to be a good idea, but
      there are other examples possible as well of course.
      
      This patch adds the needed bits to do enable this functionality; they
      have been tested in an experimental cpufreq driver that I'm working on,
      and the changes are all that I needed to access counters properly.
      
      [fweisbec@gmail.com: added pid to perf_event_create_kernel_counter so
      that we can profile a particular task too
      
      TODO: Have a better error reporting, don't just return NULL in fail
      case.]
      
      v2: Remove the wrong comment about the fact
          perf_event_create_kernel_counter must be called from a kernel
          thread.
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: "K.Prasad" <prasad@linux.vnet.ibm.com>
      Cc: Alan Stern <stern@rowland.harvard.edu>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Jan Kiszka <jan.kiszka@siemens.com>
      Cc: Jiri Slaby <jirislaby@gmail.com>
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Cc: Avi Kivity <avi@redhat.com>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Masami Hiramatsu <mhiramat@redhat.com>
      Cc: Paul Mundt <lethal@linux-sh.org>
      Cc: Jan Kiszka <jan.kiszka@web.de>
      Cc: Avi Kivity <avi@redhat.com>
      LKML-Reference: <20090925122556.2f8bd939@infradead.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      fb0459d7
  6. 03 11月, 2009 5 次提交
    • I
      Correct nr_processes() when CPUs have been unplugged · 1d510750
      Ian Campbell 提交于
      nr_processes() returns the sum of the per cpu counter process_counts for
      all online CPUs. This counter is incremented for the current CPU on
      fork() and decremented for the current CPU on exit(). Since a process
      does not necessarily fork and exit on the same CPU the process_count for
      an individual CPU can be either positive or negative and effectively has
      no meaning in isolation.
      
      Therefore calculating the sum of process_counts over only the online
      CPUs omits the processes which were started or stopped on any CPU which
      has since been unplugged. Only the sum of process_counts across all
      possible CPUs has meaning.
      
      The only caller of nr_processes() is proc_root_getattr() which
      calculates the number of links to /proc as
              stat->nlink = proc_root.nlink + nr_processes();
      
      You don't have to be all that unlucky for the nr_processes() to return a
      negative value leading to a negative number of links (or rather, an
      apparently enormous number of links). If this happens then you can get
      failures where things like "ls /proc" start to fail because they got an
      -EOVERFLOW from some stat() call.
      
      Example with some debugging inserted to show what goes on:
              # ps haux|wc -l
              nr_processes: CPU0:     90
              nr_processes: CPU1:     1030
              nr_processes: CPU2:     -900
              nr_processes: CPU3:     -136
              nr_processes: TOTAL:    84
              proc_root_getattr. nlink 12 + nr_processes() 84 = 96
              84
              # echo 0 >/sys/devices/system/cpu/cpu1/online
              # ps haux|wc -l
              nr_processes: CPU0:     85
              nr_processes: CPU2:     -901
              nr_processes: CPU3:     -137
              nr_processes: TOTAL:    -953
              proc_root_getattr. nlink 12 + nr_processes() -953 = -941
              75
              # stat /proc/
              nr_processes: CPU0:     84
              nr_processes: CPU2:     -901
              nr_processes: CPU3:     -137
              nr_processes: TOTAL:    -954
              proc_root_getattr. nlink 12 + nr_processes() -954 = -942
                File: `/proc/'
                Size: 0               Blocks: 0          IO Block: 1024   directory
              Device: 3h/3d   Inode: 1           Links: 4294966354
              Access: (0555/dr-xr-xr-x)  Uid: (    0/    root)   Gid: (    0/    root)
              Access: 2009-11-03 09:06:55.000000000 +0000
              Modify: 2009-11-03 09:06:55.000000000 +0000
              Change: 2009-11-03 09:06:55.000000000 +0000
      
      I'm not 100% convinced that the per_cpu regions remain valid for offline
      CPUs, although my testing suggests that they do. If not then I think the
      correct solution would be to aggregate the process_count for a given CPU
      into a global base value in cpu_down().
      
      This bug appears to pre-date the transition to git and it looks like it
      may even have been present in linux-2.6.0-test7-bk3 since it looks like
      the code Rusty patched in http://lwn.net/Articles/64773/ was already
      wrong.
      Signed-off-by: NIan Campbell <ian.campbell@citrix.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Rusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      1d510750
    • J
      PM / Hibernate: Add newline to load_image() fail path · bf9fd67a
      Jiri Slaby 提交于
      Finish a line by \n when load_image fails in the middle of loading.
      Signed-off-by: NJiri Slaby <jirislaby@gmail.com>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      bf9fd67a
    • J
      PM / Hibernate: Fix error handling in save_image() · 4ff277f9
      Jiri Slaby 提交于
      There are too many retval variables in save_image(). Thus error return
      value from snapshot_read_next() may be ignored and only part of the
      snapshot (successfully) written.
      
      Remove 'error' variable, invert the condition in the do-while loop
      and convert the loop to use only 'ret' variable.
      
      Switch the rest of the function to consider only 'ret'.
      
      Also make sure we end printed line by \n if an error occurs.
      Signed-off-by: NJiri Slaby <jirislaby@gmail.com>
      Acked-by: NPavel Machek <pavel@ucw.cz>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      4ff277f9
    • J
      PM / Hibernate: Fix blkdev refleaks · 76b57e61
      Jiri Slaby 提交于
      While cruising through the swsusp code I found few blkdev reference
      leaks of resume_bdev.
      
      swsusp_read: remove blkdev_put altogether. Some fail paths do
                   not do that.
      swsusp_check: make sure we always put a reference on fail paths
      software_resume: all fail paths between swsusp_check and swsusp_read
                       omit swsusp_close. Add it in those cases. And since
                       swsusp_read doesn't drop the reference anymore, do
                       it here unconditionally.
      
      [rjw: Fixed a small coding style issue.]
      Signed-off-by: NJiri Slaby <jirislaby@gmail.com>
      Signed-off-by: NRafael J. Wysocki <rjw@sisk.pl>
      76b57e61
    • M
      sched: Fix kthread_bind() by moving the body of kthread_bind() to sched.c · b84ff7d6
      Mike Galbraith 提交于
      Eric Paris reported that commit
      f685ceac causes boot time
      PREEMPT_DEBUG complaints.
      
       [    4.590699] BUG: using smp_processor_id() in preemptible [00000000] code: rmmod/1314
       [    4.593043] caller is task_hot+0x86/0xd0
      
      Since kthread_bind() messes with scheduler internals, move the
      body to sched.c, and lock the runqueue.
      Reported-by: NEric Paris <eparis@redhat.com>
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Tested-by: NEric Paris <eparis@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1256813310.7574.3.camel@marge.simson.net>
      [ v2: fix !SMP build and clean up ]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b84ff7d6
  7. 02 11月, 2009 3 次提交
    • P
      rcu: Fix long-grace-period race between forcing and initialization · 83f5b01f
      Paul E. McKenney 提交于
      Very long RCU read-side critical sections (50 milliseconds or
      so) can cause a race between force_quiescent_state() and
      rcu_start_gp() as follows on kernel builds with multi-level
      rcu_node hierarchies:
      
      1.	CPU 0 calls force_quiescent_state(), sees that there is a
      	grace period in progress, and acquires ->fsqlock.
      
      2.	CPU 1 detects the end of the grace period, and so
      	cpu_quiet_msk_finish() sets rsp->completed to rsp->gpnum.
      	This operation is carried out under the root rnp->lock,
      	but CPU 0 has not yet acquired that lock.  Note that
      	rsp->signaled is still RCU_SAVE_DYNTICK from the last
      	grace period.
      
      3.	CPU 1 calls rcu_start_gp(), but no one wants a new grace
      	period, so it drops the root rnp->lock and returns.
      
      4.	CPU 0 acquires the root rnp->lock and picks up rsp->completed
      	and rsp->signaled, then drops rnp->lock.  It then enters the
      	RCU_SAVE_DYNTICK leg of the switch statement.
      
      5.	CPU 2 invokes call_rcu(), and now needs a new grace period.
      	It calls rcu_start_gp(), which acquires the root rnp->lock, sets
      	rsp->signaled to RCU_GP_INIT (too bad that CPU 0 is already in
      	the RCU_SAVE_DYNTICK leg of the switch statement!)  and starts
      	initializing the rcu_node hierarchy.  If there are multiple
      	levels to the hierarchy, it will drop the root rnp->lock and
      	initialize the lower levels of the hierarchy.
      
      6.	CPU 0 notes that rsp->completed has not changed, which permits
              both CPU 2 and CPU 0 to try updating it concurrently.  If CPU 0's
      	update prevails, later calls to force_quiescent_state() can
      	count old quiescent states against the new grace period, which
      	can in turn result in premature ending of grace periods.
      
      	Not good.
      
      This patch adds an RCU_GP_IDLE state for rsp->signaled that is
      set initially at boot time and any time a grace period ends.
      This prevents CPU 0 from getting into the workings of
      force_quiescent_state() in step 4.  Additional locking and
      checks prevent the concurrent update of rsp->signaled in step 6.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josh@joshtriplett.org
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: Valdis.Kletnieks@vt.edu
      Cc: dhowells@redhat.com
      LKML-Reference: <1256742889199-git-send-email->
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      83f5b01f
    • T
      uids: Prevent tear down race · b00bc0b2
      Thomas Gleixner 提交于
      Ingo triggered the following warning:
      
      WARNING: at lib/debugobjects.c:255 debug_print_object+0x42/0x50()
      Hardware name: System Product Name
      ODEBUG: init active object type: timer_list
      Modules linked in:
      Pid: 2619, comm: dmesg Tainted: G        W  2.6.32-rc5-tip+ #5298
      Call Trace:
       [<81035443>] warn_slowpath_common+0x6a/0x81
       [<8120e483>] ? debug_print_object+0x42/0x50
       [<81035498>] warn_slowpath_fmt+0x29/0x2c
       [<8120e483>] debug_print_object+0x42/0x50
       [<8120ec2a>] __debug_object_init+0x279/0x2d7
       [<8120ecb3>] debug_object_init+0x13/0x18
       [<810409d2>] init_timer_key+0x17/0x6f
       [<81041526>] free_uid+0x50/0x6c
       [<8104ed2d>] put_cred_rcu+0x61/0x72
       [<81067fac>] rcu_do_batch+0x70/0x121
      
      debugobjects warns about an enqueued timer being initialized. If
      CONFIG_USER_SCHED=y the user management code uses delayed work to
      remove the user from the hash table and tear down the sysfs objects.
      
      free_uid is called from RCU and initializes/schedules delayed work if
      the usage count of the user_struct is 0. The init/schedule happens
      outside of the uidhash_lock protected region which allows a concurrent
      caller of find_user() to reference the about to be destroyed
      user_struct w/o preventing the work from being scheduled. If the next
      free_uid call happens before the work timer expired then the active
      timer is initialized and the work scheduled again.
      
      The race was introduced in commit 5cb350ba (sched: group scheduling,
      sysfs tunables) and made more prominent by commit 3959214f (sched:
      delayed cleanup of user_struct)
      
      Move the init/schedule_delayed_work inside of the uidhash_lock
      protected region to prevent the race.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Acked-by: NDhaval Giani <dhaval@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@us.ibm.com>
      Cc: Kay Sievers <kay.sievers@vrfy.org>
      Cc: stable@kernel.org
      b00bc0b2
    • R
      sched: Fix boot crash by zalloc()ing most of the cpu masks · 49557e62
      Rusty Russell 提交于
      I got a boot crash when forcing cpumasks offstack on 32 bit,
      because find_new_ilb() returned 3 on my UP system (nohz.cpu_mask
      wasn't zeroed).
      
      AFAICT the others need to be zeroed too: only
      nohz.ilb_grp_nohz_mask is initialized before use.
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: Peter Zijlstra <peterz@infradead.org>
      LKML-Reference: <200911022037.21282.rusty@rustcorp.com.au>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      49557e62
  8. 29 10月, 2009 9 次提交
    • A
      sysctl: fix false positives when PROC_SYSCTL=n · 8c85dd87
      Alexey Dobriyan 提交于
      Having ->procname but not ->proc_handler is valid when PROC_SYSCTL=n,
      people use such combination to reduce ifdefs with non-standard handlers.
      
      Addresses http://bugzilla.kernel.org/show_bug.cgi?id=14408Signed-off-by: NAlexey Dobriyan <adobriyan@gmail.com>
      Reported-by: tthtlc's avatarPeter Teoh <htmldeveloper@gmail.com>
      Cc: "Eric W. Biederman" <ebiederm@xmission.com>
      Cc: "Rafael J. Wysocki" <rjw@sisk.pl>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8c85dd87
    • K
      cgroup: fix strstrip() misuse · 478988d3
      KOSAKI Motohiro 提交于
      cgroup_write_X64() and cgroup_write_string() ignore the return value of
      strstrip().  it makes small inconsistent behavior.
      
      example:
      =========================
       # cd /mnt/cgroup/hoge
       # cat memory.swappiness
       60
       # echo "59 " > memory.swappiness
       # cat memory.swappiness
       59
       # echo " 58" > memory.swappiness
       bash: echo: write error: Invalid argument
      
      This patch fixes it.
      
      Cc: Li Zefan <lizf@cn.fujitsu.com>
      Acked-by: NPaul Menage <menage@google.com>
      Signed-off-by: NKOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      478988d3
    • C
      connector: fix regression introduced by sid connector · 0d0df599
      Christian Borntraeger 提交于
      Since commit 02b51df1 (proc connector: add
      event for process becoming session leader) we have the following warning:
      
      Badness at kernel/softirq.c:143
      [...]
      Krnl PSW : 0404c00180000000 00000000001481d4 (local_bh_enable+0xb0/0xe0)
      [...]
      Call Trace:
      ([<000000013fe04100>] 0x13fe04100)
       [<000000000048a946>] sk_filter+0x9a/0xd0
       [<000000000049d938>] netlink_broadcast+0x2c0/0x53c
       [<00000000003ba9ae>] cn_netlink_send+0x272/0x2b0
       [<00000000003baef0>] proc_sid_connector+0xc4/0xd4
       [<0000000000142604>] __set_special_pids+0x58/0x90
       [<0000000000159938>] sys_setsid+0xb4/0xd8
       [<00000000001187fe>] sysc_noemu+0x10/0x16
       [<00000041616cb266>] 0x41616cb266
      
      The warning is
      --->    WARN_ON_ONCE(in_irq() || irqs_disabled());
      
      The network code must not be called with disabled interrupts but
      sys_setsid holds the tasklist_lock with spinlock_irq while calling the
      connector.
      
      After a discussion we agreed that we can move proc_sid_connector from
      __set_special_pids to sys_setsid.
      
      We also agreed that it is sufficient to change the check from
      task_session(curr) != pid into err > 0, since if we don't change the
      session, this means we were already the leader and return -EPERM.
      
      One last thing:
      There is also daemonize(), and some people might want to get a
      notification in that case. Since daemonize() is only needed if a user
      space does kernel_thread this does not look important (and there seems
      to be no consensus if this connector should be called in daemonize). If
      we really want this, we can add proc_sid_connector to daemonize() in an
      additional patch (Scott?)
      Signed-off-by: NChristian Borntraeger <borntraeger@de.ibm.com>
      Cc: Scott James Remnant <scott@ubuntu.com>
      Cc: Matt Helsley <matthltc@us.ibm.com>
      Cc: David S. Miller <davem@davemloft.net>
      Acked-by: NOleg Nesterov <oleg@redhat.com>
      Acked-by: NEvgeniy Polyakov <zbr@ioremap.net>
      Acked-by: NDavid Rientjes <rientjes@google.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      0d0df599
    • L
      tracing/filters: Fix to make system filter work · 3ed67776
      Li Zefan 提交于
      commit fce29d15
      ("tracing/filters: Refactor subsystem filter code")
      broke system filter accidentally.
      Signed-off-by: NLi Zefan <lizf@cn.fujitsu.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Tom Zanussi <tzanussi@gmail.com>
      LKML-Reference: <4AE810BD.3070009@cn.fujitsu.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3ed67776
    • M
      kprobe-tracer: Compare both of event-name and event-group to find probe · dd004c47
      Masami Hiramatsu 提交于
      Fix find_probe_event() to compare both of event-name and
      event-group. Without this fix, kprobe-tracer overwrites existing
      same event-name probe even if its group-name is different.
      Signed-off-by: NMasami Hiramatsu <mhiramat@redhat.com>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Jim Keniston <jkenisto@us.ibm.com>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Frank Ch. Eigler <fche@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: K.Prasad <prasad@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      LKML-Reference: <20091027204244.30545.27516.stgit@harusame>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      dd004c47
    • R
      param: fix setting arrays of bool · 3c7d76e3
      Rusty Russell 提交于
      We create a dummy struct kernel_param on the stack for parsing each
      array element, but we didn't initialize the flags word.  This matters
      for arrays of type "bool", where the flag indicates if it really is
      an array of bools or unsigned int (old-style).
      Reported-by: NTakashi Iwai <tiwai@suse.de>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: stable@kernel.org
      3c7d76e3
    • R
      param: fix NULL comparison on oom · d553ad86
      Rusty Russell 提交于
      kp->arg is always true: it's the contents of that pointer we care about.
      Reported-by: NTakashi Iwai <tiwai@suse.de>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: stable@kernel.org
      d553ad86
    • R
      param: fix lots of bugs with writing charp params from sysfs, by leaking mem. · 65afac7d
      Rusty Russell 提交于
      e180a6b7 "param: fix charp parameters set via sysfs" fixed the case
      where charp parameters written via sysfs were freed, leaving drivers
      accessing random memory.
      
      Unfortunately, storing a flag in the kparam struct was a bad idea: it's
      rodata so setting it causes an oops on some archs.  But that's not all:
      
      1) module_param_array() on charp doesn't work reliably, since we use an
         uninitialized temporary struct kernel_param.
      2) there's a fundamental race if a module uses this parameter and then
         it's changed: they will still access the old, freed, memory.
      
      The simplest fix (ie. for 2.6.32) is to never free the memory.  This
      prevents all these problems, at cost of a memory leak.  In practice, there
      are only 18 places where a charp is writable via sysfs, and all are
      root-only writable.
      Reported-by: NTakashi Iwai <tiwai@suse.de>
      Cc: Sitsofe Wheeler <sitsofe@yahoo.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Christof Schmitt <christof.schmitt@de.ibm.com>
      Signed-off-by: NRusty Russell <rusty@rustcorp.com.au>
      Cc: stable@kernel.org
      65afac7d
    • T
      futex: Fix spurious wakeup for requeue_pi really · 11df6ddd
      Thomas Gleixner 提交于
      The requeue_pi path doesn't use unqueue_me() (and the racy lock_ptr ==
      NULL test) nor does it use the wake_list of futex_wake() which where
      the reason for commit 41890f2 (futex: Handle spurious wake up)
      
      See debugging discussing on LKML Message-ID: <4AD4080C.20703@us.ibm.com>
      
      The changes in this fix to the wait_requeue_pi path were considered to
      be a likely unecessary, but harmless safety net. But it turns out that
      due to the fact that for unknown $@#!*( reasons EWOULDBLOCK is defined
      as EAGAIN we built an endless loop in the code path which returns
      correctly EWOULDBLOCK.
      
      Spurious wakeups in wait_requeue_pi code path are unlikely so we do
      the easy solution and return EWOULDBLOCK^WEAGAIN to user space and let
      it deal with the spurious wakeup.
      
      Cc: Darren Hart <dvhltc@us.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: John Stultz <johnstul@linux.vnet.ibm.com>
      Cc: Dinakar Guniguntala <dino@in.ibm.com>
      LKML-Reference: <4AE23C74.1090502@us.ibm.com>
      Cc: stable@kernel.org
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      11df6ddd
  9. 28 10月, 2009 2 次提交
    • J
      sched: move rq_weight data array out of .percpu · 4a6cc4bd
      Jiri Kosina 提交于
      Commit 34d76c41 introduced percpu array update_shares_data, size of which
      being proportional to NR_CPUS. Unfortunately this blows up ia64 for large
      NR_CPUS configuration, as ia64 allows only 64k for .percpu section.
      
      Fix this by allocating this array dynamically and keep only pointer to it
      percpu.
      
      The per-cpu handling doesn't impose significant performance penalty on
      potentially contented path in tg_shares_up().
      
      ...
      ffffffff8104337c:       65 48 8b 14 25 20 cd    mov    %gs:0xcd20,%rdx
      ffffffff81043383:       00 00
      ffffffff81043385:       48 c7 c0 00 e1 00 00    mov    $0xe100,%rax
      ffffffff8104338c:       48 c7 45 a0 00 00 00    movq   $0x0,-0x60(%rbp)
      ffffffff81043393:       00
      ffffffff81043394:       48 c7 45 a8 00 00 00    movq   $0x0,-0x58(%rbp)
      ffffffff8104339b:       00
      ffffffff8104339c:       48 01 d0                add    %rdx,%rax
      ffffffff8104339f:       49 8d 94 24 08 01 00    lea    0x108(%r12),%rdx
      ffffffff810433a6:       00
      ffffffff810433a7:       b9 ff ff ff ff          mov    $0xffffffff,%ecx
      ffffffff810433ac:       48 89 45 b0             mov    %rax,-0x50(%rbp)
      ffffffff810433b0:       bb 00 04 00 00          mov    $0x400,%ebx
      ffffffff810433b5:       48 89 55 c0             mov    %rdx,-0x40(%rbp)
      ...
      
      After:
      
      ...
      ffffffff8104337c:       65 8b 04 25 28 cd 00    mov    %gs:0xcd28,%eax
      ffffffff81043383:       00
      ffffffff81043384:       48 98                   cltq
      ffffffff81043386:       49 8d bc 24 08 01 00    lea    0x108(%r12),%rdi
      ffffffff8104338d:       00
      ffffffff8104338e:       48 8b 15 d3 7f 76 00    mov    0x767fd3(%rip),%rdx        # ffffffff817ab368 <update_shares_data>
      ffffffff81043395:       48 8b 34 c5 00 ee 6d    mov    -0x7e921200(,%rax,8),%rsi
      ffffffff8104339c:       81
      ffffffff8104339d:       48 c7 45 a0 00 00 00    movq   $0x0,-0x60(%rbp)
      ffffffff810433a4:       00
      ffffffff810433a5:       b9 ff ff ff ff          mov    $0xffffffff,%ecx
      ffffffff810433aa:       48 89 7d c0             mov    %rdi,-0x40(%rbp)
      ffffffff810433ae:       48 c7 45 a8 00 00 00    movq   $0x0,-0x58(%rbp)
      ffffffff810433b5:       00
      ffffffff810433b6:       bb 00 04 00 00          mov    $0x400,%ebx
      ffffffff810433bb:       48 01 f2                add    %rsi,%rdx
      ffffffff810433be:       48 89 55 b0             mov    %rdx,-0x50(%rbp)
      ...
      Signed-off-by: NJiri Kosina <jkosina@suse.cz>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      4a6cc4bd
    • A
      perf_event: Add alignment-faults and emulation-faults software events · f7d79860
      Anton Blanchard 提交于
      Add two more software events that are common to many cpus.
      
      Alignment faults: When a load or store is not aligned properly.
      
      Emulation faults: When an instruction is emulated in software.
      
      Both cause a very significant slowdown (100x or worse), so identifying and
      fixing them is very important.
      Signed-off-by: NAnton Blanchard <anton@samba.org>
      Signed-off-by: NPaul Mackerras <paulus@samba.org>
      f7d79860
  10. 24 10月, 2009 5 次提交
    • J
      tracing: Remove cpu arg from the rb_time_stamp() function · 6d3f1e12
      Jiri Olsa 提交于
      The cpu argument is not used inside the rb_time_stamp() function.
      Plus fix a typo.
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      LKML-Reference: <20091023233647.118547500@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6d3f1e12
    • J
      tracing: Fix comment typo and documentation example · 67b394f7
      Jiri Olsa 提交于
      Trivial patch to fix a documentation example and to fix a
      comment.
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      LKML-Reference: <20091023233646.871719877@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      67b394f7
    • J
      tracing: Fix trace_seq_printf() return value · 3e69533b
      Jiri Olsa 提交于
      trace_seq_printf() return value is a little ambiguous. It
      currently returns the length of the space available in the
      buffer. printf usually returns the amount written. This is not
      adequate here, because:
      
        trace_seq_printf(s, "");
      
      is perfectly legal, and returning 0 would indicate that it
      failed.
      
      We can always see the amount written by looking at the before
      and after values of s->len. This is not quite the same use as
      printf. We only care if the string was successfully written to
      the buffer or not.
      
      Make trace_seq_printf() return 0 if the trace oversizes the
      buffer's free space, 1 otherwise.
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      LKML-Reference: <20091023233646.631787612@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      3e69533b
    • J
      tracing: Update *ppos instead of filp->f_pos · cf8517cf
      Jiri Olsa 提交于
      Instead of directly updating filp->f_pos we should update the *ppos
      argument. The filp->f_pos gets updated within the file_pos_write()
      function called from sys_write().
      Signed-off-by: NJiri Olsa <jolsa@redhat.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      LKML-Reference: <20091023233646.399670810@goodmis.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cf8517cf
    • M
      sched: Strengthen buddies and mitigate buddy induced latencies · f685ceac
      Mike Galbraith 提交于
      This patch restores the effectiveness of LAST_BUDDY in preventing
      pgsql+oltp from collapsing due to wakeup preemption. It also
      switches LAST_BUDDY to exclusively do what it does best, namely
      mitigate the effects of aggressive wakeup preemption, which
      improves vmark throughput markedly, and restores mysql+oltp
      scalability.
      
      Since buddies are about scalability, enable them beginning at the
      point where we begin expanding sched_latency, namely
      sched_nr_latency. Previously, buddies were cleared aggressively,
      which seriously reduced their effectiveness. Not clearing
      aggressively however, produces a small drop in mysql+oltp
      throughput immediately after peak, indicating that LAST_BUDDY is
      actually doing some harm. This is right at the point where X on the
      desktop in competition with another load wants low latency service.
      Ergo, do not enable until we need to scale.
      
      To mitigate latency induced by buddies, or by a task just missing
      wakeup preemption, check latency at tick time.
      
      Last hunk prevents buddies from stymieing BALANCE_NEWIDLE via
      CACHE_HOT_BUDDY.
      
      Supporting performance tests:
      
       tip   = v2.6.32-rc5-1497-ga525b32
       tipx  = NO_GENTLE_FAIR_SLEEPERS NEXT_BUDDY granularity knobs = 31 knobs + 31 buddies
       tip+x = NO_GENTLE_FAIR_SLEEPERS granularity knobs = 31 knobs
      
      (Three run averages except where noted.)
      
       vmark:
       ------
       tip           108466 messages per second
       tip+          125307 messages per second
       tip+x         125335 messages per second
       tipx          117781 messages per second
       2.6.31.3      122729 messages per second
      
       mysql+oltp:
       -----------
       clients          1        2        4        8       16       32       64        128    256
       ..........................................................................................
       tip        9949.89 18690.20 34801.24 34460.04 32682.88 30765.97 28305.27 25059.64 19548.08
       tip+      10013.90 18526.84 34900.38 34420.14 33069.83 32083.40 30578.30 28010.71 25605.47
       tipx       9698.71 18002.70 34477.56 33420.01 32634.30 31657.27 29932.67 26827.52 21487.18
       2.6.31.3   8243.11 18784.20 34404.83 33148.38 31900.32 31161.90 29663.81 25995.94 18058.86
      
       pgsql+oltp:
       -----------
       clients          1        2        4        8       16       32       64      128      256
       ..........................................................................................
       tip       13686.37 26609.25 51934.28 51347.81 49479.51 45312.65 36691.91 26851.57 24145.35
       tip+ (1x) 13907.85 27135.87 52951.98 52514.04 51742.52 50705.43 49947.97 48374.19 46227.94
       tip+x     13906.78 27065.81 52951.19 52542.59 52176.11 51815.94 50838.90 49439.46 46891.00
       tipx      13742.46 26769.81 52351.99 51891.73 51320.79 50938.98 50248.65 48908.70 46553.84
       2.6.31.3  13815.35 26906.46 52683.34 52061.31 51937.10 51376.80 50474.28 49394.47 47003.25
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <new-submission>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      f685ceac