1. 25 9月, 2012 1 次提交
    • F
      cputime: Use a proper subsystem naming for vtime related APIs · bf9fae9f
      Frederic Weisbecker 提交于
      Use a naming based on vtime as a prefix for virtual based
      cputime accounting APIs:
      
      - account_system_vtime() -> vtime_account()
      - account_switch_vtime() -> vtime_task_switch()
      
      It makes it easier to allow for further declension such
      as vtime_account_system(), vtime_account_idle(), ... if we
      want to find out the context we account to from generic code.
      
      This also make it better to know on which subsystem these APIs
      refer to.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <peterz@infradead.org>
      bf9fae9f
  2. 26 7月, 2012 1 次提交
  3. 12 12月, 2011 1 次提交
    • P
      rcu: Track idleness independent of idle tasks · 9b2e4f18
      Paul E. McKenney 提交于
      Earlier versions of RCU used the scheduling-clock tick to detect idleness
      by checking for the idle task, but handled idleness differently for
      CONFIG_NO_HZ=y.  But there are now a number of uses of RCU read-side
      critical sections in the idle task, for example, for tracing.  A more
      fine-grained detection of idleness is therefore required.
      
      This commit presses the old dyntick-idle code into full-time service,
      so that rcu_idle_enter(), previously known as rcu_enter_nohz(), is
      always invoked at the beginning of an idle loop iteration.  Similarly,
      rcu_idle_exit(), previously known as rcu_exit_nohz(), is always invoked
      at the end of an idle-loop iteration.  This allows the idle task to
      use RCU everywhere except between consecutive rcu_idle_enter() and
      rcu_idle_exit() calls, in turn allowing architecture maintainers to
      specify exactly where in the idle loop that RCU may be used.
      
      Because some of the userspace upcall uses can result in what looks
      to RCU like half of an interrupt, it is not possible to expect that
      the irq_enter() and irq_exit() hooks will give exact counts.  This
      patch therefore expands the ->dynticks_nesting counter to 64 bits
      and uses two separate bitfields to count process/idle transitions
      and interrupt entry/exit transitions.  It is presumed that userspace
      upcalls do not happen in the idle loop or from usermode execution
      (though usermode might do a system call that results in an upcall).
      The counter is hard-reset on each process/idle transition, which
      avoids the interrupt entry/exit error from accumulating.  Overflow
      is avoided by the 64-bitness of the ->dyntick_nesting counter.
      
      This commit also adds warnings if a non-idle task asks RCU to enter
      idle state (and these checks will need some adjustment before applying
      Frederic's OS-jitter patches (http://lkml.org/lkml/2011/10/7/246).
      In addition, validation of ->dynticks and ->dynticks_nesting is added.
      Signed-off-by: NPaul E. McKenney <paul.mckenney@linaro.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Reviewed-by: NJosh Triplett <josh@joshtriplett.org>
      9b2e4f18
  4. 10 6月, 2011 1 次提交
  5. 05 3月, 2011 1 次提交
    • A
      BKL: That's all, folks · 4ba8216c
      Arnd Bergmann 提交于
      This removes the implementation of the big kernel lock,
      at last. A lot of people have worked on this in the
      past, I so the credit for this patch should be with
      everyone who participated in the hunt.
      
      The names on the Cc list are the people that were the
      most active in this, according to the recorded git
      history, in alphabetical order.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Acked-by: NAlan Cox <alan@linux.intel.com>
      Cc: Alessio Igor Bogani <abogani@texware.it>
      Cc: Al Viro <viro@zeniv.linux.org.uk>
      Cc: Andrew Hendry <andrew.hendry@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Christoph Hellwig <hch@infradead.org>
      Cc: Eric W. Biederman <ebiederm@xmission.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: Hans Verkuil <hverkuil@xs4all.nl>
      Acked-by: NIngo Molnar <mingo@elte.hu>
      Cc: Jan Blunck <jblunck@infradead.org>
      Cc: John Kacur <jkacur@redhat.com>
      Cc: Jonathan Corbet <corbet@lwn.net>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Oliver Neukum <oliver@neukum.org>
      Cc: Paul Menage <menage@google.com>
      Acked-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Trond Myklebust <Trond.Myklebust@netapp.com>
      4ba8216c
  6. 19 11月, 2010 1 次提交
    • L
      hardirq.h: needs sched.h if using BKL · ed1d77b1
      Linus Torvalds 提交于
      This really isn't the right thing to do, and strictly speaking we should
      have the BKL depth count in the thread info right next to the preempt
      count.  The two really do go together.
      
      However, since that would involve a patch to all architectures, and the
      BKL is finally going away, it's simply not worth the effort to do the
      RightThing(tm).  Just re-instate the <linux/sched.h> include that we
      used to get accidentally from the smp_lock.h one.
      
      This is all fallout from the same old "BKL: remove extraneous #include
      <smp_lock.h>" commit.
      Reported-by: NIngo Molnar <mingo@elte.hu>
      Tested-by: NRandy Dunlap <randy.dunlap@oracle.com>
      Cc: Arnd Bergmann <arnd@arndb.de>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      ed1d77b1
  7. 18 11月, 2010 3 次提交
  8. 02 11月, 2010 1 次提交
  9. 19 10月, 2010 3 次提交
    • V
      sched: Add IRQ_TIME_ACCOUNTING, finer accounting of irq time · b52bfee4
      Venkatesh Pallipadi 提交于
      s390/powerpc/ia64 have support for CONFIG_VIRT_CPU_ACCOUNTING which does
      the fine granularity accounting of user, system, hardirq, softirq times.
      Adding that option on archs like x86 will be challenging however, given the
      state of TSC reliability on various platforms and also the overhead it will
      add in syscall entry exit.
      
      Instead, add a lighter variant that only does finer accounting of
      hardirq and softirq times, providing precise irq times (instead of timer tick
      based samples). This accounting is added with a new config option
      CONFIG_IRQ_TIME_ACCOUNTING so that there won't be any overhead for users not
      interested in paying the perf penalty.
      
      This accounting is based on sched_clock, with the code being generic.
      So, other archs may find it useful as well.
      
      This patch just adds the core logic and does not enable this logic yet.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-5-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b52bfee4
    • V
      sched: Consolidate account_system_vtime extern declaration · e1e10a26
      Venkatesh Pallipadi 提交于
      Just a minor cleanup patch that makes things easier to the following patches.
      No functionality change in this patch.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-3-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e1e10a26
    • V
      sched: Fix softirq time accounting · 75e1056f
      Venkatesh Pallipadi 提交于
      Peter Zijlstra found a bug in the way softirq time is accounted in
      VIRT_CPU_ACCOUNTING on this thread:
      
         http://lkml.indiana.edu/hypermail//linux/kernel/1009.2/01366.html
      
      The problem is, softirq processing uses local_bh_disable internally. There
      is no way, later in the flow, to differentiate between whether softirq is
      being processed or is it just that bh has been disabled. So, a hardirq when bh
      is disabled results in time being wrongly accounted as softirq.
      
      Looking at the code a bit more, the problem exists in !VIRT_CPU_ACCOUNTING
      as well. As account_system_time() in normal tick based accouting also uses
      softirq_count, which will be set even when not in softirq with bh disabled.
      
      Peter also suggested solution of using 2*SOFTIRQ_OFFSET as irq count
      for local_bh_{disable,enable} and using just SOFTIRQ_OFFSET while softirq
      processing. The patch below does that and adds API in_serving_softirq() which
      returns whether we are currently processing softirq or not.
      
      Also changes one of the usages of softirq_count in net/sched/cls_cgroup.c
      to in_serving_softirq.
      
      Looks like many usages of in_softirq really want in_serving_softirq. Those
      changes can be made individually on a case by case basis.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1286237003-12406-2-git-send-email-venki@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      75e1056f
  10. 07 10月, 2010 1 次提交
  11. 20 8月, 2010 1 次提交
    • P
      rcu: Add a TINY_PREEMPT_RCU · a57eb940
      Paul E. McKenney 提交于
      Implement a small-memory-footprint uniprocessor-only implementation of
      preemptible RCU.  This implementation uses but a single blocked-tasks
      list rather than the combinatorial number used per leaf rcu_node by
      TREE_PREEMPT_RCU, which reduces memory consumption and greatly simplifies
      processing.  This version also takes advantage of uniprocessor execution
      to accelerate grace periods in the case where there are no readers.
      
      The general design is otherwise broadly similar to that of TREE_PREEMPT_RCU.
      
      This implementation is a step towards having RCU implementation driven
      off of the SMP and PREEMPT kernel configuration variables, which can
      happen once this implementation has accumulated sufficient experience.
      
      Removed ACCESS_ONCE() from __rcu_read_unlock() and added barrier() as
      suggested by Steve Rostedt in order to avoid the compiler-reordering
      issue noted by Mathieu Desnoyers (http://lkml.org/lkml/2010/8/16/183).
      
      As can be seen below, CONFIG_TINY_PREEMPT_RCU represents almost 5Kbyte
      savings compared to CONFIG_TREE_PREEMPT_RCU.  Of course, for non-real-time
      workloads, CONFIG_TINY_RCU is even better.
      
      	CONFIG_TREE_PREEMPT_RCU
      
      	   text	   data	    bss	    dec	   filename
      	     13	      0	      0	     13	   kernel/rcupdate.o
      	   6170	    825	     28	   7023	   kernel/rcutree.o
      				   ----
      				   7026    Total
      
      	CONFIG_TINY_PREEMPT_RCU
      
      	   text	   data	    bss	    dec	   filename
      	     13	      0	      0	     13	   kernel/rcupdate.o
      	   2081	     81	      8	   2170	   kernel/rcutiny.o
      				   ----
      				   2183    Total
      
      	CONFIG_TINY_RCU (non-preemptible)
      
      	   text	   data	    bss	    dec	   filename
      	     13	      0	      0	     13	   kernel/rcupdate.o
      	    719	     25	      0	    744	   kernel/rcutiny.o
      				    ---
      				    757    Total
      Requested-by: NLoïc Minier <loic.minier@canonical.com>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      a57eb940
  12. 26 10月, 2009 1 次提交
    • P
      rcu: "Tiny RCU", The Bloatwatch Edition · 9b1d82fa
      Paul E. McKenney 提交于
      This patch is a version of RCU designed for !SMP provided for a
      small-footprint RCU implementation.  In particular, the
      implementation of synchronize_rcu() is extremely lightweight and
      high performance. It passes rcutorture testing in each of the
      four relevant configurations (combinations of NO_HZ and PREEMPT)
      on x86.  This saves about 1K bytes compared to old Classic RCU
      (which is no longer in mainline), and more than three kilobytes
      compared to Hierarchical RCU (updated to 2.6.30):
      
      	CONFIG_TREE_RCU:
      
      	   text	   data	    bss	    dec	    filename
      	    183       4       0     187     kernel/rcupdate.o
      	   2783     520      36    3339     kernel/rcutree.o
      				   3526 Total (vs 4565 for v7)
      
      	CONFIG_TREE_PREEMPT_RCU:
      
      	   text	   data	    bss	    dec	    filename
      	    263       4       0     267     kernel/rcupdate.o
      	   4594     776      52    5422     kernel/rcutree.o
      	   			   5689 Total (6155 for v7)
      
      	CONFIG_TINY_RCU:
      
      	   text	   data	    bss	    dec	    filename
      	     96       4       0     100     kernel/rcupdate.o
      	    734      24       0     758     kernel/rcutiny.o
      	    			    858 Total (vs 848 for v7)
      
      The above is for x86.  Your mileage may vary on other platforms.
      Further compression is possible, but is being procrastinated.
      
      Changes from v7 (http://lkml.org/lkml/2009/10/9/388)
      
      o	Apply Lai Jiangshan's review comments (aside from
      might_sleep() 	in synchronize_sched(), which is covered by SMP builds).
      
      o	Fix up expedited primitives.
      
      Changes from v6 (http://lkml.org/lkml/2009/9/23/293).
      
      o	Forward ported to put it into the 2.6.33 stream.
      
      o	Added lockdep support.
      
      o	Make lightweight rcu_barrier.
      
      Changes from v5 (http://lkml.org/lkml/2009/6/23/12).
      
      o	Ported to latest pre-2.6.32 merge window kernel.
      
      	- Renamed rcu_qsctr_inc() to rcu_sched_qs().
      	- Renamed rcu_bh_qsctr_inc() to rcu_bh_qs().
      	- Provided trivial rcu_cpu_notify().
      	- Provided trivial exit_rcu().
      	- Provided trivial rcu_needs_cpu().
      	- Fixed up the rcu_*_enter/exit() functions in linux/hardirq.h.
      
      o	Removed the dependence on EMBEDDED, with a view to making
      	TINY_RCU default for !SMP at some time in the future.
      
      o	Added (trivial) support for expedited grace periods.
      
      Changes from v4 (http://lkml.org/lkml/2009/5/2/91) include:
      
      o	Squeeze the size down a bit further by removing the
      	->completed field from struct rcu_ctrlblk.
      
      o	This permits synchronize_rcu() to become the empty function.
      	Previous concerns about rcutorture were unfounded, as
      	rcutorture correctly handles a constant value from
      	rcu_batches_completed() and rcu_batches_completed_bh().
      
      Changes from v3 (http://lkml.org/lkml/2009/3/29/221) include:
      
      o	Changed rcu_batches_completed(), rcu_batches_completed_bh()
      	rcu_enter_nohz(), rcu_exit_nohz(), rcu_nmi_enter(), and
      	rcu_nmi_exit(), to be static inlines, as suggested by David
      	Howells.  Doing this saves about 100 bytes from rcutiny.o.
      	(The numbers between v3 and this v4 of the patch are not directly
      	comparable, since they are against different versions of Linux.)
      
      Changes from v2 (http://lkml.org/lkml/2009/2/3/333) include:
      
      o	Fix whitespace issues.
      
      o	Change short-circuit "||" operator to instead be "+" in order
      to 	fix performance bug noted by "kraai" on LWN.
      
      		(http://lwn.net/Articles/324348/)
      
      Changes from v1 (http://lkml.org/lkml/2009/1/13/440) include:
      
      o	This version depends on EMBEDDED as well as !SMP, as suggested
      	by Ingo.
      
      o	Updated rcu_needs_cpu() to unconditionally return zero,
      	permitting the CPU to enter dynticks-idle mode at any time.
      	This works because callbacks can be invoked upon entry to
      	dynticks-idle mode.
      
      o	Paul is now OK with this being included, based on a poll at
      the 	Kernel Miniconf at linux.conf.au, where about ten people said
      	that they cared about saving 900 bytes on single-CPU systems.
      
      o	Applies to both mainline and tip/core/rcu.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Acked-by: NDavid Howells <dhowells@redhat.com>
      Acked-by: NJosh Triplett <josh@joshtriplett.org>
      Reviewed-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Cc: dipankar@in.ibm.com
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      Cc: Valdis.Kletnieks@vt.edu
      Cc: avi@redhat.com
      Cc: mtosatti@redhat.com
      LKML-Reference: <12565226351355-git-send-email->
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      9b1d82fa
  13. 22 8月, 2009 1 次提交
    • P
      rcu: Expunge lingering references to CONFIG_CLASSIC_RCU, optimize on !SMP · b560d8ad
      Paul E. McKenney 提交于
      A couple of references to CONFIG_CLASSIC_RCU have survived.
      Although these are harmless, it is past time for them to go.
      The one in hardirq.h is strictly a readability problem.
      
      The two in pagemap.h appear to disable a !SMP performance
      optimization (which this patch re-enables).
      
      This does raise the issue as to whether pagemap.h should really
      be referring to the CPU implementation.  Long term, I intend to
      make the RCU implementation driven by CONFIG_PREEMPT, at which
      point these should change from defined(CONFIG_TREE_RCU) to
      !defined(CONFIG_PREEMPT). In the meantime, is there something
      else that could be done in pagemap.h?
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: laijs@cn.fujitsu.com
      Cc: dipankar@in.ibm.com
      Cc: akpm@linux-foundation.org
      Cc: mathieu.desnoyers@polymtl.ca
      Cc: josht@linux.vnet.ibm.com
      Cc: dvhltc@us.ibm.com
      Cc: niv@us.ibm.com
      Cc: peterz@infradead.org
      Cc: rostedt@goodmis.org
      LKML-Reference: <20090822050851.GA8414@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      b560d8ad
  14. 09 8月, 2009 1 次提交
    • A
      sched: Add default defines for PREEMPT_ACTIVE · 8e5b59a2
      Arnd Bergmann 提交于
      The PREEMPT_ACTIVE setting doesn't actually need to be
      arch-specific, so set up a sane default for all arches to
      (hopefully) migrate to.
      
      > if we look at linux/hardirq.h, it makes this claim:
      >  * - bit 28 is the PREEMPT_ACTIVE flag
      > if that's true, then why are we letting any arch set this define ?  a
      > quick survey shows that half the arches (11) are using 0x10000000 (bit
      > 28) while the other half (10) are using 0x4000000 (bit 26).  and then
      > there is the ia64 oddity which uses bit 30.  the exact value here
      > shouldnt really matter across arches though should it ?
      
      actually alpha, arm and avr32 also use bit 30 (0x40000000),
      there are only five (or eight, depending on how you count)
      architectures (blackfin, h8300, m68k, s390 and sparc) using bit
      26.
      Signed-off-by: NArnd Bergmann <arnd@arndb.de>
      Signed-off-by: NMike Frysinger <vapier@gentoo.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      8e5b59a2
  15. 13 7月, 2009 1 次提交
  16. 24 3月, 2009 1 次提交
    • T
      genirq: add threaded interrupt handler support · 3aa551c9
      Thomas Gleixner 提交于
      Add support for threaded interrupt handlers:
      
      A device driver can request that its main interrupt handler runs in a
      thread. To achive this the device driver requests the interrupt with
      request_threaded_irq() and provides additionally to the handler a
      thread function. The handler function is called in hard interrupt
      context and needs to check whether the interrupt originated from the
      device. If the interrupt originated from the device then the handler
      can either return IRQ_HANDLED or IRQ_WAKE_THREAD. IRQ_HANDLED is
      returned when no further action is required. IRQ_WAKE_THREAD causes
      the genirq code to invoke the threaded (main) handler. When
      IRQ_WAKE_THREAD is returned handler must have disabled the interrupt
      on the device level. This is mandatory for shared interrupt handlers,
      but we need to do it as well for obscure x86 hardware where disabling
      an interrupt on the IO_APIC level redirects the interrupt to the
      legacy PIC interrupt lines.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Reviewed-by: NIngo Molnar <mingo@elte.hu>
      3aa551c9
  17. 13 2月, 2009 2 次提交
    • S
      sched: do not account for NMIs · 2a7b8df0
      Steven Rostedt 提交于
      Impact: avoid corruption in system time accounting
      
      Martin Schwidefsky told me that there was an issue with NMIs and
      system accounting. The problem is that the accounting code is
      not reentrant, and if an NMI goes off after an interrupt it can
      corrupt the accounting.
      
      For now, the best we can do is to treat NMIs like SMIs and they
      are not accounted for.
      
      This patch changes nmi_enter to not call __irq_enter and to do
      the preempt-count and tracing calls directly.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      2a7b8df0
    • S
      preempt-count: force hardirq-count to max of 10 · 5a5fb7db
      Steven Rostedt 提交于
      To add a bit in the preempt_count to be set when in NMI context, we
      found that some archs did not have enough bits to spare. This is
      due to the hardirq_count being a mask that can hold NR_IRQS.
      
      Some archs allow for over 16000 IRQs, and that would require a mask
      of 14 bits. The sofitrq mask is 8 bits and the preempt disable mask
      is also 8 bits.  The PREEMP_ACTIVE bit is bit 30, and bit 31 would
      make the preempt_count (which is type int) a negative number.
      A negative preempt_count is a sign of failure.
      
      Add them up 14+8+8+1+1 you get 32 bits. No room for the NMI bit.
      
      But the hardirq_count is to track the number of nested IRQs, not
      the number of total IRQs.  This originally took the paranoid approach
      of setting the max nesting to NR_IRQS. But when we have archs with
      over 1000 IRQs, it is not practical to think they will ever all
      nest on a single CPU. Not to mention that this would most definitely
      cause a stack overflow.
      
      This patch sets a max of 10 bits to be used for IRQ nesting.
      I did a 'git grep HARDIRQ' to examine all users of HARDIRQ_BITS and
      HARDIRQ_MASK, and found that making it a max of 10 would not hurt
      anyone. I did find that the m68k expected it to be 8 bits, so
      I allow for the archs to set the number to be less than 10.
      
      I removed the setting of HARDIRQ_BITS from the archs that set it
      to more than 10. This includes ALPHA, ia64 and avr32.
      
      This will always allow room for the NMI bit, and if we need to allow
      for NMI nesting, we have 4 bits to play with.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      5a5fb7db
  18. 08 2月, 2009 1 次提交
  19. 19 12月, 2008 1 次提交
    • P
      "Tree RCU": scalable classic RCU implementation · 64db4cff
      Paul E. McKenney 提交于
      This patch fixes a long-standing performance bug in classic RCU that
      results in massive internal-to-RCU lock contention on systems with
      more than a few hundred CPUs.  Although this patch creates a separate
      flavor of RCU for ease of review and patch maintenance, it is intended
      to replace classic RCU.
      
      This patch still handles stress better than does mainline, so I am still
      calling it ready for inclusion.  This patch is against the -tip tree.
      Nevertheless, experience on an actual 1000+ CPU machine would still be
      most welcome.
      
      Most of the changes noted below were found while creating an rcutiny
      (which should permit ejecting the current rcuclassic) and while doing
      detailed line-by-line documentation.
      
      Updates from v9 (http://lkml.org/lkml/2008/12/2/334):
      
      o	Fixes from remainder of line-by-line code walkthrough,
      	including comment spelling, initialization, undesirable
      	narrowing due to type conversion, removing redundant memory
      	barriers, removing redundant local-variable initialization,
      	and removing redundant local variables.
      
      	I do not believe that any of these fixes address the CPU-hotplug
      	issues that Andi Kleen was seeing, but please do give it a whirl
      	in case the machine is smarter than I am.
      
      	A writeup from the walkthrough may be found at the following
      	URL, in case you are suffering from terminal insomnia or
      	masochism:
      
      	http://www.kernel.org/pub/linux/kernel/people/paulmck/tmp/rcutree-walkthrough.2008.12.16a.pdf
      
      o	Made rcutree tracing use seq_file, as suggested some time
      	ago by Lai Jiangshan.
      
      o	Added a .csv variant of the rcudata debugfs trace file, to allow
      	people having thousands of CPUs to drop the data into
      	a spreadsheet.	Tested with oocalc and gnumeric.  Updated
      	documentation to suit.
      
      Updates from v8 (http://lkml.org/lkml/2008/11/15/139):
      
      o	Fix a theoretical race between grace-period initialization and
      	force_quiescent_state() that could occur if more than three
      	jiffies were required to carry out the grace-period
      	initialization.  Which it might, if you had enough CPUs.
      
      o	Apply Ingo's printk-standardization patch.
      
      o	Substitute local variables for repeated accesses to global
      	variables.
      
      o	Fix comment misspellings and redundant (but harmless) increments
      	of ->n_rcu_pending (this latter after having explicitly added it).
      
      o	Apply checkpatch fixes.
      
      Updates from v7 (http://lkml.org/lkml/2008/10/10/291):
      
      o	Fixed a number of problems noted by Gautham Shenoy, including
      	the cpu-stall-detection bug that he was having difficulty
      	convincing me was real.  ;-)
      
      o	Changed cpu-stall detection to wait for ten seconds rather than
      	three in order to reduce false positive, as suggested by Ingo
      	Molnar.
      
      o	Produced a design document (http://lwn.net/Articles/305782/).
      	The act of writing this document uncovered a number of both
      	theoretical and "here and now" bugs as noted below.
      
      o	Fix dynticks_nesting accounting confusion, simplify WARN_ON()
      	condition, fix kerneldoc comments, and add memory barriers
      	in dynticks interface functions.
      
      o	Add more data to tracing.
      
      o	Remove unused "rcu_barrier" field from rcu_data structure.
      
      o	Count calls to rcu_pending() from scheduling-clock interrupt
      	to use as a surrogate timebase should jiffies stop counting.
      
      o	Fix a theoretical race between force_quiescent_state() and
      	grace-period initialization.  Yes, initialization does have to
      	go on for some jiffies for this race to occur, but given enough
      	CPUs...
      
      Updates from v6 (http://lkml.org/lkml/2008/9/23/448):
      
      o	Fix a number of checkpatch.pl complaints.
      
      o	Apply review comments from Ingo Molnar and Lai Jiangshan
      	on the stall-detection code.
      
      o	Fix several bugs in !CONFIG_SMP builds.
      
      o	Fix a misspelled config-parameter name so that RCU now announces
      	at boot time if stall detection is configured.
      
      o	Run tests on numerous combinations of configurations parameters,
      	which after the fixes above, now build and run correctly.
      
      Updates from v5 (http://lkml.org/lkml/2008/9/15/92, bad subject line):
      
      o	Fix a compiler error in the !CONFIG_FANOUT_EXACT case (blew a
      	changeset some time ago, and finally got around to retesting
      	this option).
      
      o	Fix some tracing bugs in rcupreempt that caused incorrect
      	totals to be printed.
      
      o	I now test with a more brutal random-selection online/offline
      	script (attached).  Probably more brutal than it needs to be
      	on the people reading it as well, but so it goes.
      
      o	A number of optimizations and usability improvements:
      
      	o	Make rcu_pending() ignore the grace-period timeout when
      		there is no grace period in progress.
      
      	o	Make force_quiescent_state() avoid going for a global
      		lock in the case where there is no grace period in
      		progress.
      
      	o	Rearrange struct fields to improve struct layout.
      
      	o	Make call_rcu() initiate a grace period if RCU was
      		idle, rather than waiting for the next scheduling
      		clock interrupt.
      
      	o	Invoke rcu_irq_enter() and rcu_irq_exit() only when
      		idle, as suggested by Andi Kleen.  I still don't
      		completely trust this change, and might back it out.
      
      	o	Make CONFIG_RCU_TRACE be the single config variable
      		manipulated for all forms of RCU, instead of the prior
      		confusion.
      
      	o	Document tracing files and formats for both rcupreempt
      		and rcutree.
      
      Updates from v4 for those missing v5 given its bad subject line:
      
      o	Separated dynticks interface so that NMIs and irqs call separate
      	functions, greatly simplifying it.  In particular, this code
      	no longer requires a proof of correctness.  ;-)
      
      o	Separated dynticks state out into its own per-CPU structure,
      	avoiding the duplicated accounting.
      
      o	The case where a dynticks-idle CPU runs an irq handler that
      	invokes call_rcu() is now correctly handled, forcing that CPU
      	out of dynticks-idle mode.
      
      o	Review comments have been applied (thank you all!!!).
      	For but one example, fixed the dynticks-ordering issue that
      	Manfred pointed out, saving me much debugging.  ;-)
      
      o	Adjusted rcuclassic and rcupreempt to handle dynticks changes.
      
      Attached is an updated patch to Classic RCU that applies a hierarchy,
      greatly reducing the contention on the top-level lock for large machines.
      This passes 10-hour concurrent rcutorture and online-offline testing on
      128-CPU ppc64 without dynticks enabled, and exposes some timekeeping
      bugs in presence of dynticks (exciting working on a system where
      "sleep 1" hangs until interrupted...), which were fixed in the
      2.6.27 kernel.  It is getting more reliable than mainline by some
      measures, so the next version will be against -tip for inclusion.
      See also Manfred Spraul's recent patches (or his earlier work from
      2004 at http://marc.info/?l=linux-kernel&m=108546384711797&w=2).
      We will converge onto a common patch in the fullness of time, but are
      currently exploring different regions of the design space.  That said,
      I have already gratefully stolen quite a few of Manfred's ideas.
      
      This patch provides CONFIG_RCU_FANOUT, which controls the bushiness
      of the RCU hierarchy.  Defaults to 32 on 32-bit machines and 64 on
      64-bit machines.  If CONFIG_NR_CPUS is less than CONFIG_RCU_FANOUT,
      there is no hierarchy.  By default, the RCU initialization code will
      adjust CONFIG_RCU_FANOUT to balance the hierarchy, so strongly NUMA
      architectures may choose to set CONFIG_RCU_FANOUT_EXACT to disable
      this balancing, allowing the hierarchy to be exactly aligned to the
      underlying hardware.  Up to two levels of hierarchy are permitted
      (in addition to the root node), allowing up to 16,384 CPUs on 32-bit
      systems and up to 262,144 CPUs on 64-bit systems.  I just know that I
      am going to regret saying this, but this seems more than sufficient
      for the foreseeable future.  (Some architectures might wish to set
      CONFIG_RCU_FANOUT=4, which would limit such architectures to 64 CPUs.
      If this becomes a real problem, additional levels can be added, but I
      doubt that it will make a significant difference on real hardware.)
      
      In the common case, a given CPU will manipulate its private rcu_data
      structure and the rcu_node structure that it shares with its immediate
      neighbors.  This can reduce both lock and memory contention by multiple
      orders of magnitude, which should eliminate the need for the strange
      manipulations that are reported to be required when running Linux on
      very large systems.
      
      Some shortcomings:
      
      o	More bugs will probably surface as a result of an ongoing
      	line-by-line code inspection.
      
      	Patches will be provided as required.
      
      o	There are probably hangs, rcutorture failures, &c.  Seems
      	quite stable on a 128-CPU machine, but that is kind of small
      	compared to 4096 CPUs.  However, seems to do better than
      	mainline.
      
      	Patches will be provided as required.
      
      o	The memory footprint of this version is several KB larger
      	than rcuclassic.
      
      	A separate UP-only rcutiny patch will be provided, which will
      	reduce the memory footprint significantly, even compared
      	to the old rcuclassic.  One such patch passes light testing,
      	and has a memory footprint smaller even than rcuclassic.
      	Initial reaction from various embedded guys was "it is not
      	worth it", so am putting it aside.
      
      Credits:
      
      o	Manfred Spraul for ideas, review comments, and bugs spotted,
      	as well as some good friendly competition.  ;-)
      
      o	Josh Triplett, Ingo Molnar, Peter Zijlstra, Mathieu Desnoyers,
      	Lai Jiangshan, Andi Kleen, Andy Whitcroft, and Andrew Morton
      	for reviews and comments.
      
      o	Thomas Gleixner for much-needed help with some timer issues
      	(see patches below).
      
      o	Jon M. Tollefson, Tim Pepper, Andrew Theurer, Jose R. Santos,
      	Andy Whitcroft, Darrick Wong, Nishanth Aravamudan, Anton
      	Blanchard, Dave Kleikamp, and Nathan Lynch for keeping machines
      	alive despite my heavy abuse^Wtesting.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      64db4cff
  20. 07 11月, 2008 1 次提交
  21. 03 11月, 2008 1 次提交
  22. 31 10月, 2008 1 次提交
    • S
      ftrace: nmi safe code modification · 17666f02
      Steven Rostedt 提交于
      Impact: fix crashes that can occur in NMI handlers, if their code is modified
      
      Modifying code is something that needs special care. On SMP boxes,
      if code that is being modified is also being executed on another CPU,
      that CPU will have undefined results.
      
      The dynamic ftrace uses kstop_machine to make the system act like a
      uniprocessor system. But this does not address NMIs, that can still
      run on other CPUs.
      
      One approach to handle this is to make all code that are used by NMIs
      not be traced. But NMIs can call notifiers that spread throughout the
      kernel and this will be very hard to maintain, and the chance of missing
      a function is very high.
      
      The approach that this patch takes is to have the NMIs modify the code
      if the modification is taking place. The way this works is that just
      writing to code executing on another CPU is not harmful if what is
      written is the same as what exists.
      
      Two buffers are used: an IP buffer and a "code" buffer.
      
      The steps that the patcher takes are:
      
       1) Put in the instruction pointer into the IP buffer
          and the new code into the "code" buffer.
       2) Set a flag that says we are modifying code
       3) Wait for any running NMIs to finish.
       4) Write the code
       5) clear the flag.
       6) Wait for any running NMIs to finish.
      
      If an NMI is executed, it will also write the pending code.
      Multiple writes are OK, because what is being written is the same.
      Then the patcher must wait for all running NMIs to finish before
      going to the next line that must be patched.
      
      This is basically the RCU approach to code modification.
      
      Thanks to Ingo Molnar for suggesting the idea, and to Arjan van de Ven
      for his guidence on what is safe and what is not.
      Signed-off-by: NSteven Rostedt <srostedt@redhat.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      17666f02
  23. 11 5月, 2008 1 次提交
    • L
      BKL: revert back to the old spinlock implementation · 8e3e076c
      Linus Torvalds 提交于
      The generic semaphore rewrite had a huge performance regression on AIM7
      (and potentially other BKL-heavy benchmarks) because the generic
      semaphores had been rewritten to be simple to understand and fair.  The
      latter, in particular, turns a semaphore-based BKL implementation into a
      mess of scheduling.
      
      The attempt to fix the performance regression failed miserably (see the
      previous commit 00b41ec2 'Revert
      "semaphore: fix"'), and so for now the simple and sane approach is to
      instead just go back to the old spinlock-based BKL implementation that
      never had any issues like this.
      
      This patch also has the advantage of being reported to fix the
      regression completely according to Yanmin Zhang, unlike the semaphore
      hack which still left a couple percentage point regression.
      
      As a spinlock, the BKL obviously has the potential to be a latency
      issue, but it's not really any different from any other spinlock in that
      respect.  We do want to get rid of the BKL asap, but that has been the
      plan for several years.
      
      These days, the biggest users are in the tty layer (open/release in
      particular) and Alan holds out some hope:
      
        "tty release is probably a few months away from getting cured - I'm
         afraid it will almost certainly be the very last user of the BKL in
         tty to get fixed as it depends on everything else being sanely locked."
      
      so while we're not there yet, we do have a plan of action.
      Tested-by: NYanmin Zhang <yanmin_zhang@linux.intel.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Andi Kleen <andi@firstfloor.org>
      Cc: Matthew Wilcox <matthew@wil.cx>
      Cc: Alexander Viro <viro@ftp.linux.org.uk>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      8e3e076c
  24. 29 3月, 2008 1 次提交
  25. 01 3月, 2008 1 次提交
  26. 26 1月, 2008 1 次提交
  27. 10 7月, 2007 1 次提交
  28. 17 2月, 2007 2 次提交
  29. 04 10月, 2006 1 次提交
    • E
      [PATCH] genirq: irq: generalize the check for HARDIRQ_BITS · 23d0b8b0
      Eric W. Biederman 提交于
      This patch adds support for systems that cannot receive every interrupt on a
      single cpu simultaneously, in the check to see if we have enough HARDIRQ_BITS.
      
      MAX_HARDIRQS_PER_CPU becomes the count of the maximum number of hardare
      generated interrupts per cpu.
      
      On architectures that support per cpu interrupt delivery this can be a
      significant space savings and scalability bonus.
      
      This patch adds support for systems that cannot receive every interrupt on
      Signed-off-by: NEric W. Biederman <ebiederm@xmission.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Rajesh Shah <rajesh.shah@intel.com>
      Cc: Andi Kleen <ak@muc.de>
      Cc: "Protasevich, Natalie" <Natalie.Protasevich@UNISYS.com>
      Cc: "Luck, Tony" <tony.luck@intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      23d0b8b0
  30. 04 7月, 2006 2 次提交
    • I
      [PATCH] lockdep: core · fbb9ce95
      Ingo Molnar 提交于
      Do 'make oldconfig' and accept all the defaults for new config options -
      reboot into the kernel and if everything goes well it should boot up fine and
      you should have /proc/lockdep and /proc/lockdep_stats files.
      
      Typically if the lock validator finds some problem it will print out
      voluminous debug output that begins with "BUG: ..." and which syslog output
      can be used by kernel developers to figure out the precise locking scenario.
      
      What does the lock validator do?  It "observes" and maps all locking rules as
      they occur dynamically (as triggered by the kernel's natural use of spinlocks,
      rwlocks, mutexes and rwsems).  Whenever the lock validator subsystem detects a
      new locking scenario, it validates this new rule against the existing set of
      rules.  If this new rule is consistent with the existing set of rules then the
      new rule is added transparently and the kernel continues as normal.  If the
      new rule could create a deadlock scenario then this condition is printed out.
      
      When determining validity of locking, all possible "deadlock scenarios" are
      considered: assuming arbitrary number of CPUs, arbitrary irq context and task
      context constellations, running arbitrary combinations of all the existing
      locking scenarios.  In a typical system this means millions of separate
      scenarios.  This is why we call it a "locking correctness" validator - for all
      rules that are observed the lock validator proves it with mathematical
      certainty that a deadlock could not occur (assuming that the lock validator
      implementation itself is correct and its internal data structures are not
      corrupted by some other kernel subsystem).  [see more details and conditionals
      of this statement in include/linux/lockdep.h and
      Documentation/lockdep-design.txt]
      
      Furthermore, this "all possible scenarios" property of the validator also
      enables the finding of complex, highly unlikely multi-CPU multi-context races
      via single single-context rules, increasing the likelyhood of finding bugs
      drastically.  In practical terms: the lock validator already found a bug in
      the upstream kernel that could only occur on systems with 3 or more CPUs, and
      which needed 3 very unlikely code sequences to occur at once on the 3 CPUs.
      That bug was found and reported on a single-CPU system (!).  So in essence a
      race will be found "piecemail-wise", triggering all the necessary components
      for the race, without having to reproduce the race scenario itself!  In its
      short existence the lock validator found and reported many bugs before they
      actually caused a real deadlock.
      
      To further increase the efficiency of the validator, the mapping is not per
      "lock instance", but per "lock-class".  For example, all struct inode objects
      in the kernel have inode->inotify_mutex.  If there are 10,000 inodes cached,
      then there are 10,000 lock objects.  But ->inotify_mutex is a single "lock
      type", and all locking activities that occur against ->inotify_mutex are
      "unified" into this single lock-class.  The advantage of the lock-class
      approach is that all historical ->inotify_mutex uses are mapped into a single
      (and as narrow as possible) set of locking rules - regardless of how many
      different tasks or inode structures it took to build this set of rules.  The
      set of rules persist during the lifetime of the kernel.
      
      To see the rough magnitude of checking that the lock validator does, here's a
      portion of /proc/lockdep_stats, fresh after bootup:
      
       lock-classes:                            694 [max: 2048]
       direct dependencies:                  1598 [max: 8192]
       indirect dependencies:               17896
       all direct dependencies:             16206
       dependency chains:                    1910 [max: 8192]
       in-hardirq chains:                      17
       in-softirq chains:                     105
       in-process chains:                    1065
       stack-trace entries:                 38761 [max: 131072]
       combined max dependencies:         2033928
       hardirq-safe locks:                     24
       hardirq-unsafe locks:                  176
       softirq-safe locks:                     53
       softirq-unsafe locks:                  137
       irq-safe locks:                         59
       irq-unsafe locks:                      176
      
      The lock validator has observed 1598 actual single-thread locking patterns,
      and has validated all possible 2033928 distinct locking scenarios.
      
      More details about the design of the lock validator can be found in
      Documentation/lockdep-design.txt, which can also found at:
      
         http://redhat.com/~mingo/lockdep-patches/lockdep-design.txt
      
      [bunk@stusta.de: cleanups]
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NAdrian Bunk <bunk@stusta.de>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      fbb9ce95
    • I
      [PATCH] lockdep: irqtrace subsystem, core · de30a2b3
      Ingo Molnar 提交于
      Accurate hard-IRQ-flags and softirq-flags state tracing.
      
      This allows us to attach extra functionality to IRQ flags on/off
      events (such as trace-on/off).
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Signed-off-by: NAndrew Morton <akpm@osdl.org>
      Signed-off-by: NLinus Torvalds <torvalds@osdl.org>
      de30a2b3
  31. 26 4月, 2006 1 次提交
  32. 15 1月, 2006 1 次提交
  33. 14 11月, 2005 1 次提交