1. 10 12月, 2009 1 次提交
    • T
      hrtimer: Tune hrtimer_interrupt hang logic · 41d2e494
      Thomas Gleixner 提交于
      The hrtimer_interrupt hang logic adjusts min_delta_ns based on the
      execution time of the hrtimer callbacks.
      
      This is error-prone for virtual machines, where a guest vcpu can be
      scheduled out during the execution of the callbacks (and the callbacks
      themselves can do operations that translate to blocking operations in
      the hypervisor), which in can lead to large min_delta_ns rendering the
      system unusable.
      
      Replace the current heuristics with something more reliable. Allow the
      interrupt code to try 3 times to catch up with the lost time. If that
      fails use the total time spent in the interrupt handler to defer the
      next timer interrupt so the system can catch up with other things
      which got delayed. Limit that deferment to 100ms.
      
      The retry events and the maximum time spent in the interrupt handler
      are recorded and exposed via /proc/timer_list
      
      Inspired by a patch from Marcelo.
      Reported-by: NMichael Tokarev <mjt@tls.msk.ru>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Tested-by: NMarcelo Tosatti <mtosatti@redhat.com>
      Cc: kvm@vger.kernel.org
      41d2e494
  2. 06 12月, 2009 4 次提交
  3. 04 12月, 2009 5 次提交
  4. 03 12月, 2009 8 次提交
    • F
      mutex: Fix missing conditions to build mutex_spin_on_owner() · c08f7829
      Frederic Weisbecker 提交于
      We don't need to build mutex_spin_on_owner() if we have
      CONFIG_DEBUG_MUTEXES or CONFIG_HAVE_DEFAULT_NO_SPIN_MUTEXES as
      it won't be used under such configs.
      
      Use CONFIG_MUTEX_SPIN_ON_OWNER as it gathers all the necessary
      checks before building it.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      LKML-Reference: <1259783357-8542-2-git-send-regression-fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      c08f7829
    • F
      mutex: Better control mutex adaptive spinning config · c0226027
      Frederic Weisbecker 提交于
      Introduce CONFIG_MUTEX_SPIN_ON_OWNER so that we can centralize
      in a single place the conditions that determine its definition
      and use.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      LKML-Reference: <1259783357-8542-1-git-send-regression-fweisbec@gmail.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Cc: Peter Zijlstra <peterz@infradead.org>
      c0226027
    • P
      rcu: Add expedited grace-period support for preemptible RCU · d9a3da06
      Paul E. McKenney 提交于
      Implement an synchronize_rcu_expedited() for preemptible RCU
      that actually is expedited.  This uses
      synchronize_sched_expedited() to force all threads currently
      running in a preemptible-RCU read-side critical section onto the
      appropriate ->blocked_tasks[] list, then takes a snapshot of all
      of these lists and waits for them to drain.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josh@joshtriplett.org
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: Valdis.Kletnieks@vt.edu
      Cc: dhowells@redhat.com
      LKML-Reference: <1259784616158-git-send-email->
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d9a3da06
    • P
      rcu: Enable fourth level of TREE_RCU hierarchy · cf244dc0
      Paul E. McKenney 提交于
      Enable a fourth level of rcu_node hierarchy for TREE_RCU and
      TREE_PREEMPT_RCU.  This is for stress-testing and experiemental
      purposes only, although in theory this would enable 16,777,216
      CPUs on 64-bit systems, though only 1,048,576 CPUs on 32-bit
      systems. Normal experimental use of this fourth level will
      normally set CONFIG_RCU_FANOUT=2, requiring a 16-CPU system,
      though the more adventurous (and more fortunate) experimenters
      may wish to chose CONFIG_RCU_FANOUT=3 for 81-CPU systems or even
      CONFIG_RCU_FANOUT=4 for 256-CPU systems.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: NJosh Triplett <josh@joshtriplett.org>
      Acked-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Cc: dipankar@in.ibm.com
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: Valdis.Kletnieks@vt.edu
      Cc: dhowells@redhat.com
      LKML-Reference: <12597846161257-git-send-email->
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      cf244dc0
    • P
      rcu: Rename "quiet" functions · d3f6bad3
      Paul E. McKenney 提交于
      The number of "quiet" functions has grown recently, and the
      names are no longer very descriptive.  The point of all of these
      functions is to do some portion of the task of reporting a
      quiescent state, so rename them accordingly:
      
      o	cpu_quiet() becomes rcu_report_qs_rdp(), which reports a
      	quiescent state to the per-CPU rcu_data structure.  If this
      	turns out to be a new quiescent state for this grace period,
      	then rcu_report_qs_rnp() will be invoked to propagate the
      	quiescent state up the rcu_node hierarchy.
      
      o	cpu_quiet_msk() becomes rcu_report_qs_rnp(), which reports
      	a quiescent state for a given CPU (or possibly a set of CPUs)
      	up the rcu_node hierarchy.
      
      o	cpu_quiet_msk_finish() becomes rcu_report_qs_rsp(), which
      	reports a full set of quiescent states to the global rcu_state
      	structure.
      
      o	task_quiet() becomes rcu_report_unblock_qs_rnp(), which reports
      	a quiescent state due to a task exiting an RCU read-side critical
      	section that had previously blocked in that same critical section.
      	As indicated by the new name, this type of quiescent state is
      	reported up the rcu_node hierarchy (using rcu_report_qs_rnp()
      	to do so).
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: NJosh Triplett <josh@joshtriplett.org>
      Acked-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Cc: dipankar@in.ibm.com
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: Valdis.Kletnieks@vt.edu
      Cc: dhowells@redhat.com
      LKML-Reference: <12597846163698-git-send-email->
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d3f6bad3
    • H
      modules: don't export section names of empty sections via sysfs · 35dead42
      Helge Deller 提交于
      On the parisc architecture we face for each and every loaded kernel module
      this kernel "badness warning":
        sysfs: cannot create duplicate filename '/module/ac97_bus/sections/.text'
        Badness at fs/sysfs/dir.c:487
      
      Reason for that is, that on parisc all kernel modules do have multiple
      .text sections due to the usage of the -ffunction-sections compiler flag
      which is needed to reach all jump targets on this platform.
      
      An objdump on such a kernel module gives:
      Sections:
      Idx Name          Size      VMA       LMA       File off  Algn
        0 .note.gnu.build-id 00000024  00000000  00000000  00000034  2**2
                        CONTENTS, ALLOC, LOAD, READONLY, DATA
        1 .text         00000000  00000000  00000000  00000058  2**0
                        CONTENTS, ALLOC, LOAD, READONLY, CODE
        2 .text.ac97_bus_match 0000001c  00000000  00000000  00000058  2**2
                        CONTENTS, ALLOC, LOAD, READONLY, CODE
        3 .text         00000000  00000000  00000000  000000d4  2**0
                        CONTENTS, ALLOC, LOAD, READONLY, CODE
      ...
      Since the .text sections are empty (size of 0 bytes) and won't be
      loaded by the kernel module loader anyway, I don't see a reason
      why such sections need to be listed under
      /sys/module/<module_name>/sections/<section_name> either.
      
      The attached patch does solve this issue by not exporting section
      names which are empty.
      
      This fixes bugzilla http://bugzilla.kernel.org/show_bug.cgi?id=14703Signed-off-by: NHelge Deller <deller@gmx.de>
      CC: rusty@rustcorp.com.au
      CC: akpm@linux-foundation.org
      CC: James.Bottomley@HansenPartnership.com
      CC: roland@redhat.com
      CC: dave@hiauly1.hia.nrc.ca
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      35dead42
    • H
      sched, cputime: Introduce thread_group_times() · 0cf55e1e
      Hidetoshi Seto 提交于
      This is a real fix for problem of utime/stime values decreasing
      described in the thread:
      
         http://lkml.org/lkml/2009/11/3/522
      
      Now cputime is accounted in the following way:
      
       - {u,s}time in task_struct are increased every time when the thread
         is interrupted by a tick (timer interrupt).
      
       - When a thread exits, its {u,s}time are added to signal->{u,s}time,
         after adjusted by task_times().
      
       - When all threads in a thread_group exits, accumulated {u,s}time
         (and also c{u,s}time) in signal struct are added to c{u,s}time
         in signal struct of the group's parent.
      
      So {u,s}time in task struct are "raw" tick count, while
      {u,s}time and c{u,s}time in signal struct are "adjusted" values.
      
      And accounted values are used by:
      
       - task_times(), to get cputime of a thread:
         This function returns adjusted values that originates from raw
         {u,s}time and scaled by sum_exec_runtime that accounted by CFS.
      
       - thread_group_cputime(), to get cputime of a thread group:
         This function returns sum of all {u,s}time of living threads in
         the group, plus {u,s}time in the signal struct that is sum of
         adjusted cputimes of all exited threads belonged to the group.
      
      The problem is the return value of thread_group_cputime(),
      because it is mixed sum of "raw" value and "adjusted" value:
      
        group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)
      
      This misbehavior can break {u,s}time monotonicity.
      Assume that if there is a thread that have raw values greater
      than adjusted values (e.g. interrupted by 1000Hz ticks 50 times
      but only runs 45ms) and if it exits, cputime will decrease (e.g.
      -5ms).
      
      To fix this, we could do:
      
        group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)
      
      But task_times() contains hard divisions, so applying it for
      every thread should be avoided.
      
      This patch fixes the above problem in the following way:
      
       - Modify thread's exit (= __exit_signal()) not to use task_times().
         It means {u,s}time in signal struct accumulates raw values instead
         of adjusted values.  As the result it makes thread_group_cputime()
         to return pure sum of "raw" values.
      
       - Introduce a new function thread_group_times(*task, *utime, *stime)
         that converts "raw" values of thread_group_cputime() to "adjusted"
         values, in same calculation procedure as task_times().
      
       - Modify group's exit (= wait_task_zombie()) to use this introduced
         thread_group_times().  It make c{u,s}time in signal struct to
         have adjusted values like before this patch.
      
       - Replace some thread_group_cputime() by thread_group_times().
         This replacements are only applied where conveys the "adjusted"
         cputime to users, and where already uses task_times() near by it.
         (i.e. sys_times(), getrusage(), and /proc/<PID>/stat.)
      
      This patch have a positive side effect:
      
       - Before this patch, if a group contains many short-life threads
         (e.g. runs 0.9ms and not interrupted by ticks), the group's
         cputime could be invisible since thread's cputime was accumulated
         after adjusted: imagine adjustment function as adj(ticks, runtime),
           {adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
         After this patch it will not happen because the adjustment is
         applied after accumulated.
      
      v2:
       - remove if()s, put new variables into signal_struct.
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Spencer Candland <spencer@bluehost.com>
      Cc: Americo Wang <xiyou.wangcong@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      LKML-Reference: <4B162517.8040909@jp.fujitsu.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0cf55e1e
    • H
      sched, cputime: Cleanups related to task_times() · d99ca3b9
      Hidetoshi Seto 提交于
      - Remove if({u,s}t)s because no one call it with NULL now.
      - Use cputime_{add,sub}().
      - Add ifndef-endif for prev_{u,s}time since they are used
        only when !VIRT_CPU_ACCOUNTING.
      Signed-off-by: NHidetoshi Seto <seto.hidetoshi@jp.fujitsu.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Spencer Candland <spencer@bluehost.com>
      Cc: Americo Wang <xiyou.wangcong@gmail.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Cc: Stanislaw Gruszka <sgruszka@redhat.com>
      LKML-Reference: <4B1624C7.7040302@jp.fujitsu.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      d99ca3b9
  5. 02 12月, 2009 12 次提交
  6. 01 12月, 2009 3 次提交
    • D
      SLOW_WORK: Fix the CONFIG_MODULES=n case · fa1dae49
      David Howells 提交于
      Commits 3d7a641e ("SLOW_WORK: Wait for outstanding work items belonging to a
      module to clear") introduced some code to make sure that all of a module's
      slow-work items were complete before that module was removed, and commit
      3bde31a4 ("SLOW_WORK: Allow a requeueable work item to sleep till the thread is
      needed") further extended that, breaking it in the process if CONFIG_MODULES=n:
      
          CC      kernel/slow-work.o
        kernel/slow-work.c: In function 'slow_work_execute':
        kernel/slow-work.c:313: error: 'slow_work_thread_processing' undeclared (first use in this function)
        kernel/slow-work.c:313: error: (Each undeclared identifier is reported only once
        kernel/slow-work.c:313: error: for each function it appears in.)
        kernel/slow-work.c: In function 'slow_work_wait_for_items':
        kernel/slow-work.c:950: error: 'slow_work_unreg_sync_lock' undeclared (first use in this function)
        kernel/slow-work.c:951: error: 'slow_work_unreg_wq' undeclared (first use in this function)
        kernel/slow-work.c:961: error: 'slow_work_unreg_work_item' undeclared (first use in this function)
        kernel/slow-work.c:974: error: 'slow_work_unreg_module' undeclared (first use in this function)
        kernel/slow-work.c:977: error: 'slow_work_thread_processing' undeclared (first use in this function)
        make[1]: *** [kernel/slow-work.o] Error 1
      
      Fix this by:
      
       (1) Extracting the bits of slow_work_execute() that are contingent on
           CONFIG_MODULES, and the bits that should be, into inline functions and
           placing them into the #ifdef'd section that defines the relevant variables
           and adding stubs for moduleless kernels.  This allows the removal of some
           #ifdefs.
      
       (2) #ifdef'ing out the contents of slow_work_wait_for_items() in moduleless
           kernels.
      
      The four functions related to handling module unloading synchronisation (and
      their associated variables) could be offloaded into a separate .c file, but
      each function is only used once and three of them are tiny, so doing so would
      prevent them from being inlined.
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NDavid Howells <dhowells@redhat.com>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fa1dae49
    • X
      perf_event: Initialize data.period in perf_swevent_hrtimer() · 59d069eb
      Xiao Guangrong 提交于
      In current code in perf_swevent_hrtimer(), data.period is not
      initialized, The result is obvious wrong:
      
       # ./perf record -f -e cpu-clock make
       # ./perf report
       # Samples: 1740
       #
       # Overhead   Command                                   ......
       # ........  ........  ..........................................
       #
         1025422183050275328.00%        sh  libc-2.9.90.so ...
         1025422183050275328.00%      perl  libperl.so     ...
         1025422168240043264.00%      perl  [kernel]       ...
         1025422030011210752.00%      perl  [kernel]       ...
      Signed-off-by: NXiao Guangrong <xiaoguangrong@cn.fujitsu.com>
      Acked-by: NPeter Zijlstra <peterz@infradead.org>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: <stable@kernel.org>
      LKML-Reference: <4B14E220.2050107@cn.fujitsu.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      59d069eb
    • M
      trace_kprobes: Fix a memory leak bug and check kstrdup() return value · ba8665d7
      Masami Hiramatsu 提交于
      Fix a memory leak case in create_trace_probe(). When an argument
      is too long (> MAX_ARGSTR_LEN), it just jumps to error path. In
      that case tp->args[i].name is not released.
      This also fixes a bug to check kstrdup()'s return value.
      Signed-off-by: NMasami Hiramatsu <mhiramat@redhat.com>
      Cc: systemtap <systemtap@sources.redhat.com>
      Cc: DLE <dle-develop@lists.sourceforge.net>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Jim Keniston <jkenisto@us.ibm.com>
      Cc: Ananth N Mavinakayanahalli <ananth@in.ibm.com>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Frank Ch. Eigler <fche@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Jason Baron <jbaron@redhat.com>
      Cc: K.Prasad <prasad@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      LKML-Reference: <20091201001919.10235.56455.stgit@harusame>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      ba8665d7
  7. 30 11月, 2009 1 次提交
    • A
      core: Fix user return notifier on fork() · 8e7cac79
      Avi Kivity 提交于
      fork() clones all thread_info flags, including
      TIF_USER_RETURN_NOTIFY; if the new task is first scheduled on a cpu
      which doesn't have user return notifiers set, this causes user
      return notifiers to trigger without any way of clearing itself.
      
      This is easy to trigger with a forky workload on the host in
      parallel with kvm, resulting in a cpu in an endless loop on the
      verge of returning to userspace.
      
      Fix by dropping the TIF_USER_RETURN_NOTIFY immediately after fork.
      Signed-off-by: NAvi Kivity <avi@redhat.com>
      LKML-Reference: <1259505288-16559-1-git-send-email-avi@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8e7cac79
  8. 27 11月, 2009 6 次提交