1. 15 5月, 2012 1 次提交
    • P
      lockdep: fix oops in processing workqueue · 4d82a1de
      Peter Zijlstra 提交于
      Under memory load, on x86_64, with lockdep enabled, the workqueue's
      process_one_work() has been seen to oops in __lock_acquire(), barfing
      on a 0xffffffff00000000 pointer in the lockdep_map's class_cache[].
      
      Because it's permissible to free a work_struct from its callout function,
      the map used is an onstack copy of the map given in the work_struct: and
      that copy is made without any locking.
      
      Surprisingly, gcc (4.5.1 in Hugh's case) uses "rep movsl" rather than
      "rep movsq" for that structure copy: which might race with a workqueue
      user's wait_on_work() doing lock_map_acquire() on the source of the
      copy, putting a pointer into the class_cache[], but only in time for
      the top half of that pointer to be copied to the destination map.
      
      Boom when process_one_work() subsequently does lock_map_acquire()
      on its onstack copy of the lockdep_map.
      
      Fix this, and a similar instance in call_timer_fn(), with a
      lockdep_copy_map() function which additionally NULLs the class_cache[].
      
      Note: this oops was actually seen on 3.4-next, where flush_work() newly
      does the racing lock_map_acquire(); but Tejun points out that 3.4 and
      earlier are already vulnerable to the same through wait_on_work().
      
      * Patch orginally from Peter.  Hugh modified it a bit and wrote the
        description.
      Signed-off-by: NPeter Zijlstra <peterz@infradead.org>
      Reported-by: NHugh Dickins <hughd@google.com>
      LKML-Reference: <alpine.LSU.2.00.1205070951170.1544@eggly.anvils>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      4d82a1de
  2. 09 12月, 2011 1 次提交
  3. 24 11月, 2011 2 次提交
  4. 31 10月, 2011 1 次提交
  5. 03 6月, 2011 1 次提交
  6. 08 3月, 2011 1 次提交
    • S
      debugobjects: Add hint for better object identification · 99777288
      Stanislaw Gruszka 提交于
      In complex subsystems like mac80211 structures can contain several
      timers and work structs, so identifying a specific instance from the
      call trace and object type output of debugobjects can be hard.
      
      Allow the subsystems which support debugobjects to provide a hint
      function. This function returns a pointer to a kernel address
      (preferrably the objects callback function) which is printed along
      with the debugobjects type.
      
      Add hint methods for timer_list, work_struct and hrtimer.
      
      [ tglx: Massaged changelog, made it compile ]
      Signed-off-by: NStanislaw Gruszka <sgruszka@redhat.com>
      LKML-Reference: <20110307085809.GA9334@redhat.com>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      99777288
  7. 16 2月, 2011 1 次提交
  8. 08 2月, 2011 1 次提交
  9. 04 2月, 2011 1 次提交
  10. 31 1月, 2011 1 次提交
  11. 13 12月, 2010 1 次提交
    • C
      timers: Use this_cpu_read · 7496351a
      Christoph Lameter 提交于
      Eric asked for this.
      
      [tglx: Because it generates faster code according to Erics ]
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      Cc: Pekka Enberg <penberg@cs.helsinki.fi>
      Cc: Eric Dumazet <eric.dumazet@gmail.com>
      Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
      Cc: Tejun Heo <tj@kernel.org>
      Cc: linux-mm@kvack.org
      LKML-Reference: <alpine.DEB.2.00.1011301404490.4039@router.home>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      7496351a
  12. 09 12月, 2010 2 次提交
    • H
      nohz: Fix get_next_timer_interrupt() vs cpu hotplug · dbd87b5a
      Heiko Carstens 提交于
      This fixes a bug as seen on 2.6.32 based kernels where timers got
      enqueued on offline cpus.
      
      If a cpu goes offline it might still have pending timers. These will
      be migrated during CPU_DEAD handling after the cpu is offline.
      However while the cpu is going offline it will schedule the idle task
      which will then call tick_nohz_stop_sched_tick().
      
      That function in turn will call get_next_timer_intterupt() to figure
      out if the tick of the cpu can be stopped or not. If it turns out that
      the next tick is just one jiffy off (delta_jiffies == 1)
      tick_nohz_stop_sched_tick() incorrectly assumes that the tick should
      not stop and takes an early exit and thus it won't update the load
      balancer cpu.
      
      Just afterwards the cpu will be killed and the load balancer cpu could
      be the offline cpu.
      
      On 2.6.32 based kernel get_nohz_load_balancer() gets called to decide
      on which cpu a timer should be enqueued (see __mod_timer()). Which
      leads to the possibility that timers get enqueued on an offline cpu.
      These will never expire and can cause a system hang.
      
      This has been observed 2.6.32 kernels. On current kernels
      __mod_timer() uses get_nohz_timer_target() which doesn't have that
      problem. However there might be other problems because of the too
      early exit tick_nohz_stop_sched_tick() in case a cpu goes offline.
      
      The easiest and probably safest fix seems to be to let
      get_next_timer_interrupt() just lie and let it say there isn't any
      pending timer if the current cpu is offline.
      
      I also thought of moving migrate_[hr]timers() from CPU_DEAD to
      CPU_DYING, but seeing that there already have been fixes at least in
      the hrtimer code in this area I'm afraid that this could add new
      subtle bugs.
      Signed-off-by: NHeiko Carstens <heiko.carstens@de.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20101201091109.GA8984@osiris.boeblingen.de.ibm.com>
      Cc: stable@kernel.org
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      dbd87b5a
    • P
      sched: Cure more NO_HZ load average woes · 0f004f5a
      Peter Zijlstra 提交于
      There's a long-running regression that proved difficult to fix and
      which is hitting certain people and is rather annoying in its effects.
      
      Damien reported that after 74f5187a (sched: Cure load average vs
      NO_HZ woes) his load average is unnaturally high, he also noted that
      even with that patch reverted the load avgerage numbers are not
      correct.
      
      The problem is that the previous patch only solved half the NO_HZ
      problem, it addressed the part of going into NO_HZ mode, not of
      comming out of NO_HZ mode. This patch implements that missing half.
      
      When comming out of NO_HZ mode there are two important things to take
      care of:
      
       - Folding the pending idle delta into the global active count.
       - Correctly aging the averages for the idle-duration.
      
      So with this patch the NO_HZ interaction should be complete and
      behaviour between CONFIG_NO_HZ=[yn] should be equivalent.
      
      Furthermore, this patch slightly changes the load average computation
      by adding a rounding term to the fixed point multiplication.
      Reported-by: NDamien Wyart <damien.wyart@free.fr>
      Reported-by: NTim McGrath <tmhikaru@gmail.com>
      Tested-by: NDamien Wyart <damien.wyart@free.fr>
      Tested-by: NOrion Poplawski <orion@cora.nwra.com>
      Tested-by: NKyle McMartin <kyle@mcmartin.ca>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: stable@kernel.org
      Cc: Chase Douglas <chase.douglas@canonical.com>
      LKML-Reference: <1291129145.32004.874.camel@laptop>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      0f004f5a
  13. 22 10月, 2010 3 次提交
  14. 21 10月, 2010 1 次提交
  15. 19 10月, 2010 1 次提交
    • P
      irq_work: Add generic hardirq context callbacks · e360adbe
      Peter Zijlstra 提交于
      Provide a mechanism that allows running code in IRQ context. It is
      most useful for NMI code that needs to interact with the rest of the
      system -- like wakeup a task to drain buffers.
      
      Perf currently has such a mechanism, so extract that and provide it as
      a generic feature, independent of perf so that others may also
      benefit.
      
      The IRQ context callback is generated through self-IPIs where
      possible, or on architectures like powerpc the decrementer (the
      built-in timer facility) is set to generate an interrupt immediately.
      
      Architectures that don't have anything like this get to do with a
      callback from the timer tick. These architectures can call
      irq_work_run() at the tail of any IRQ handlers that might enqueue such
      work (like the perf IRQ handler) to avoid undue latencies in
      processing the work.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NKyle McMartin <kyle@mcmartin.ca>
      Acked-by: NMartin Schwidefsky <schwidefsky@de.ibm.com>
      [ various fixes ]
      Signed-off-by: NHuang Ying <ying.huang@intel.com>
      LKML-Reference: <1287036094.7768.291.camel@yhuang-dev>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      e360adbe
  16. 11 8月, 2010 1 次提交
  17. 04 8月, 2010 2 次提交
    • P
      timer: Added usleep_range timer · 5e7f5a17
      Patrick Pannuto 提交于
      usleep_range is a finer precision implementations of msleep
      and is designed to be a drop-in replacement for udelay where
      a precise sleep / busy-wait is unnecessary.
      
      Since an easy interface to hrtimers could lead to an undesired
      proliferation of interrupts, we provide only a "range" API,
      forcing the caller to think about an acceptable tolerance on
      both ends and hopefully avoiding introducing another interrupt.
      
      INTRO
      
      As discussed here ( http://lkml.org/lkml/2007/8/3/250 ), msleep(1) is not
      precise enough for many drivers (yes, sleep precision is an unfair notion,
      but consistently sleeping for ~an order of magnitude greater than requested
      is worth fixing). This patch adds a usleep API so that udelay does not have
      to be used. Obviously not every udelay can be replaced (those in atomic
      contexts or being used for simple bitbanging come to mind), but there are
      many, many examples of
      
      mydriver_write(...)
      /* Wait for hardware to latch */
      udelay(100)
      
      in various drivers where a busy-wait loop is neither beneficial nor
      necessary, but msleep simply does not provide enough precision and people
      are using a busy-wait loop instead.
      
      CONCERNS FROM THE RFC
      
      Why is udelay a problem / necessary? Most callers of udelay are in device/
      driver initialization code, which is serial...
      
      	As I see it, there is only benefit to sleeping over a delay; the
      	notion of "refactoring" areas that use udelay was presented, but
      	I see usleep as the refactoring. Consider i2c, if the bus is busy,
      	you need to wait a bit (say 100us) before trying again, your
      	current options are:
      
      		* udelay(100)
      		* msleep(1) <-- As noted above, actually as high as ~20ms
      				on some platforms, so not really an option
      		* Manually set up an hrtimer to try again in 100us (which
      		  is what usleep does anyway...)
      
      	People choose the udelay route because it is EASY; we need to
      	provide a better easy route.
      
      	Device / driver / boot code is *currently* serial, but every few
      	months someone makes noise about parallelizing boot, and IMHO, a
      	little forward-thinking now is one less thing to worry about
      	if/when that ever happens
      
      udelay's could be preempted
      
      	Sure, but if udelay plans on looping 1000 times, and it gets
      	preempted on loop 200, whenever it's scheduled again, it is
      	going to do the next 800 loops.
      
      Is the interruptible case needed?
      
      	Probably not, but I see usleep as a very logical parallel to msleep,
      	so it made sense to include the "full" API. Processors are getting
      	faster (albeit not as quickly as they are becoming more parallel),
      	so if someone wanted to be interruptible for a few usecs, why not
      	let them? If this is a contentious point, I'm happy to remove it.
      
      OTHER THOUGHTS
      
      I believe there is also value in exposing the usleep_range option; it gives
      the scheduler a lot more flexibility and allows the programmer to express
      his intent much more clearly; it's something I would hope future driver
      writers will take advantage of.
      
      To get the results in the NUMBERS section below, I literally s/udelay/usleep
      the kernel tree; I had to go in and undo the changes to the USB drivers, but
      everything else booted successfully; I find that extremely telling in and
      of itself -- many people are using a delay API where a sleep will suit them
      just fine.
      
      SOME ATTEMPTS AT NUMBERS
      
      It turns out that calculating quantifiable benefit on this is challenging,
      so instead I will simply present the current state of things, and I hope
      this to be sufficient:
      
      How many udelay calls are there in 2.6.35-rc5?
      
      	udealy(ARG) >=	| COUNT
      	1000		| 319
      	500		| 414
      	100		| 1146
      	20		| 1832
      
      I am working on Android, so that is my focus for this. The following table
      is a modified usleep that simply printk's the amount of time requested to
      sleep; these tests were run on a kernel with udelay >= 20 --> usleep
      
      "boot" is power-on to lock screen
      "power collapse" is when the power button is pushed and the device suspends
      "resume" is when the power button is pushed and the lock screen is displayed
               (no touchscreen events or anything, just turning on the display)
      "use device" is from the unlock swipe to clicking around a bit; there is no
      	sd card in this phone, so fail loading music, video, camera
      
      	ACTION		| TOTAL NUMBER OF USLEEP CALLS	| NET TIME (us)
      	boot		| 22				| 1250
      	power-collapse	| 9				| 1200
      	resume		| 5				| 500
      	use device	| 59				| 7700
      
      The most interesting category to me is the "use device" field; 7700us of
      busy-wait time that could be put towards better responsiveness, or at the
      least less power usage.
      Signed-off-by: NPatrick Pannuto <ppannuto@codeaurora.org>
      Cc: apw@canonical.com
      Cc: corbet@lwn.net
      Cc: arjan@linux.intel.com
      Cc: Randy Dunlap <rdunlap@xenotime.net>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      5e7f5a17
    • T
      Revert "timer: Added usleep[_range] timer" · e1b004c3
      Thomas Gleixner 提交于
      This reverts commit 22b8f15c to merge
      an advanced version.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      e1b004c3
  18. 03 8月, 2010 1 次提交
  19. 23 7月, 2010 2 次提交
  20. 09 6月, 2010 1 次提交
    • V
      sched: Change nohz idle load balancing logic to push model · 83cd4fe2
      Venkatesh Pallipadi 提交于
      In the new push model, all idle CPUs indeed go into nohz mode. There is
      still the concept of idle load balancer (performing the load balancing
      on behalf of all the idle cpu's in the system). Busy CPU kicks the nohz
      balancer when any of the nohz CPUs need idle load balancing.
      The kickee CPU does the idle load balancing on behalf of all idle CPUs
      instead of the normal idle balance.
      
      This addresses the below two problems with the current nohz ilb logic:
      * the idle load balancer continued to have periodic ticks during idle and
        wokeup frequently, even though it did not have any rebalancing to do on
        behalf of any of the idle CPUs.
      * On x86 and CPUs that have APIC timer stoppage on idle CPUs, this
        periodic wakeup can result in a periodic additional interrupt on a CPU
        doing the timer broadcast.
      
      Also currently we are migrating the unpinned timers from an idle to the cpu
      doing idle load balancing (when all the cpus in the system are idle,
      there is no idle load balancing cpu and timers get added to the same idle cpu
      where the request was made. So the existing optimization works only on semi idle
      system).
      
      And In semi idle system, we no longer have periodic ticks on the idle load
      balancer CPU. Using that cpu will add more delays to the timers than intended
      (as that cpu's timer base may not be uptodate wrt jiffies etc). This was
      causing mysterious slowdowns during boot etc.
      
      For now, in the semi idle case, use the nearest busy cpu for migrating timers
      from an idle cpu.  This is good for power-savings anyway.
      Signed-off-by: NVenkatesh Pallipadi <venki@google.com>
      Signed-off-by: NSuresh Siddha <suresh.b.siddha@intel.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      LKML-Reference: <1274486981.2840.46.camel@sbs-t61.sc.intel.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      83cd4fe2
  21. 05 6月, 2010 1 次提交
  22. 28 5月, 2010 1 次提交
  23. 26 5月, 2010 2 次提交
    • T
      timers: Move local variable into else section · 2abfb9e1
      Thomas Gleixner 提交于
      Fix nit-picking coding style detail.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      2abfb9e1
    • T
      timers: Fix slack calculation really · 8e63d779
      Thomas Gleixner 提交于
      commit f00e047e (timers: Fix slack calculation for expired timers)
      fixed the issue of slack on expired timers only partially. Linus
      noticed that jiffies is volatile so it is reloaded twice, which
      generates bad code.
      
      But its worse. This can defeat the time_after() check if jiffies are
      incremented between time_after() and the slack calculation.
      
      Fix it by reading jiffies into a local variable, which prevents the
      compiler from loading it twice. While at it make the > -1 check into
      >= 0 which is easier to read.
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      8e63d779
  24. 24 5月, 2010 1 次提交
  25. 13 5月, 2010 1 次提交
  26. 07 4月, 2010 1 次提交
    • A
      timers: Introduce the concept of timer slack for legacy timers · 3bbb9ec9
      Arjan van de Ven 提交于
      While HR timers have had the concept of timer slack for quite some time
      now, the legacy timers lacked this concept, and had to make do with
      round_jiffies() and friends.
      
      Timer slack is important for power management; grouping timers reduces the
      number of wakeups which in turn reduces power consumption.
      
      This patch introduces timer slack to the legacy timers using the following
      pieces:
      * A slack field in the timer struct
      * An api (set_timer_slack) that callers can use to set explicit timer slack
      * A default slack of 0.4% of the requested delay for callers that do not set
        any explicit slack
      * Rounding code that is part of mod_timer() that tries to
        group timers around jiffies values every 'power of two'
        (so quick timers will group around every 2, but longer timers
        will group around every 4, 8, 16, 32 etc)
      Signed-off-by: NArjan van de Ven <arjan@linux.intel.com>
      Cc: johnstul@us.ibm.com
      Cc: Ingo Molnar <mingo@elte.hu>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NThomas Gleixner <tglx@linutronix.de>
      3bbb9ec9
  27. 30 3月, 2010 1 次提交
    • T
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking... · 5a0e3ad6
      Tejun Heo 提交于
      include cleanup: Update gfp.h and slab.h includes to prepare for breaking implicit slab.h inclusion from percpu.h
      
      percpu.h is included by sched.h and module.h and thus ends up being
      included when building most .c files.  percpu.h includes slab.h which
      in turn includes gfp.h making everything defined by the two files
      universally available and complicating inclusion dependencies.
      
      percpu.h -> slab.h dependency is about to be removed.  Prepare for
      this change by updating users of gfp and slab facilities include those
      headers directly instead of assuming availability.  As this conversion
      needs to touch large number of source files, the following script is
      used as the basis of conversion.
      
        http://userweb.kernel.org/~tj/misc/slabh-sweep.py
      
      The script does the followings.
      
      * Scan files for gfp and slab usages and update includes such that
        only the necessary includes are there.  ie. if only gfp is used,
        gfp.h, if slab is used, slab.h.
      
      * When the script inserts a new include, it looks at the include
        blocks and try to put the new include such that its order conforms
        to its surrounding.  It's put in the include block which contains
        core kernel includes, in the same order that the rest are ordered -
        alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
        doesn't seem to be any matching order.
      
      * If the script can't find a place to put a new include (mostly
        because the file doesn't have fitting include block), it prints out
        an error message indicating which .h file needs to be added to the
        file.
      
      The conversion was done in the following steps.
      
      1. The initial automatic conversion of all .c files updated slightly
         over 4000 files, deleting around 700 includes and adding ~480 gfp.h
         and ~3000 slab.h inclusions.  The script emitted errors for ~400
         files.
      
      2. Each error was manually checked.  Some didn't need the inclusion,
         some needed manual addition while adding it to implementation .h or
         embedding .c file was more appropriate for others.  This step added
         inclusions to around 150 files.
      
      3. The script was run again and the output was compared to the edits
         from #2 to make sure no file was left behind.
      
      4. Several build tests were done and a couple of problems were fixed.
         e.g. lib/decompress_*.c used malloc/free() wrappers around slab
         APIs requiring slab.h to be added manually.
      
      5. The script was run on all .h files but without automatically
         editing them as sprinkling gfp.h and slab.h inclusions around .h
         files could easily lead to inclusion dependency hell.  Most gfp.h
         inclusion directives were ignored as stuff from gfp.h was usually
         wildly available and often used in preprocessor macros.  Each
         slab.h inclusion directive was examined and added manually as
         necessary.
      
      6. percpu.h was updated not to include slab.h.
      
      7. Build test were done on the following configurations and failures
         were fixed.  CONFIG_GCOV_KERNEL was turned off for all tests (as my
         distributed build env didn't work with gcov compiles) and a few
         more options had to be turned off depending on archs to make things
         build (like ipr on powerpc/64 which failed due to missing writeq).
      
         * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
         * powerpc and powerpc64 SMP allmodconfig
         * sparc and sparc64 SMP allmodconfig
         * ia64 SMP allmodconfig
         * s390 SMP allmodconfig
         * alpha SMP allmodconfig
         * um on x86_64 SMP allmodconfig
      
      8. percpu.h modifications were reverted so that it could be applied as
         a separate patch and serve as bisection point.
      
      Given the fact that I had only a couple of failures from tests on step
      6, I'm fairly confident about the coverage of this conversion patch.
      If there is a breakage, it's likely to be something in one of the arch
      headers which should be easily discoverable easily on most builds of
      the specific arch.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Guess-its-ok-by: NChristoph Lameter <cl@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
      5a0e3ad6
  28. 13 3月, 2010 4 次提交
  29. 21 1月, 2010 1 次提交
  30. 17 12月, 2009 1 次提交