1. 07 1月, 2011 13 次提交
  2. 06 1月, 2011 1 次提交
    • J
      perf: Add calls to suspend trace point · 938cfed1
      Jean Pihet 提交于
      Uses the machine_suspend trace point, called from the
      generic kernel suspend_devices_and_enter function.
      Signed-off-by: NJean Pihet <j-pihet@ti.com>
      Acked-by: NRafael J. Wysocki <rjw@sisk.pl>
      Cc: Arjan van de Ven <arjan@linux.intel.com>
      CC: Thomas Renninger <trenn@suse.de>
      Cc: Len Brown <len.brown@intel.com>
      Cc: Pavel Machek <pavel@ucw.cz>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: linux-pm@lists.linux-foundation.org
      LKML-Reference: <1294253342-29056-2-git-send-email-j-pihet@ti.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      938cfed1
  3. 05 1月, 2011 2 次提交
    • N
      sched: Change wait_for_completion_*_timeout() to return a signed long · 6bf41237
      NeilBrown 提交于
      wait_for_completion_*_timeout() can return:
      
         0: if the wait timed out
       -ve: if the wait was interrupted
       +ve: if the completion was completed.
      
      As they currently return an 'unsigned long', the last two cases
      are not easily distinguished which can easily result in buggy
      code, as is the case for the recently added
      wait_for_completion_interruptible_timeout() call in
      net/sunrpc/cache.c
      
      So change them both to return 'long'.  As MAX_SCHEDULE_TIMEOUT
      is LONG_MAX, a large +ve return value should never overflow.
      Signed-off-by: NNeilBrown <neilb@suse.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: J.  Bruce Fields <bfields@fieldses.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <20110105125016.64ccab0e@notabene.brown>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      6bf41237
    • G
      [S390] mutex: Introduce arch_mutex_cpu_relax() · 34b133f8
      Gerald Schaefer 提交于
      The spinning mutex implementation uses cpu_relax() in busy loops as a
      compiler barrier. Depending on the architecture, cpu_relax() may do more
      than needed in this specific mutex spin loops. On System z we also give
      up the time slice of the virtual cpu in cpu_relax(), which prevents
      effective spinning on the mutex.
      
      This patch replaces cpu_relax() in the spinning mutex code with
      arch_mutex_cpu_relax(), which can be defined by each architecture that
      selects HAVE_ARCH_MUTEX_CPU_RELAX. The default is still cpu_relax(), so
      this patch should not affect other architectures than System z for now.
      Signed-off-by: NGerald Schaefer <gerald.schaefer@de.ibm.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <1290437256.7455.4.camel@thinkpad>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      34b133f8
  4. 04 1月, 2011 5 次提交
  5. 03 1月, 2011 1 次提交
    • B
      watchdog: Improve initialisation error message and documentation · 55142374
      Ben Hutchings 提交于
      The error message 'NMI watchdog failed to create perf event...'
      does not make it clear that this is a fatal error for the
      watchdog.  It also currently prints the error value as a
      pointer, rather than extracting the error code with PTR_ERR().
      Fix that.
      
      Add a note to the description of the 'nowatchdog' kernel
      parameter to associate it with this message.
      Reported-by: NCesare Leonardi <celeonar@gmail.com>
      Signed-off-by: NBen Hutchings <ben@decadent.org.uk>
      Cc: 599368@bugs.debian.org
      Cc: 608138@bugs.debian.org
      Cc: Don Zickus <dzickus@redhat.com>
      Cc: Frederic Weisbecker <fweisbec@gmail.com>
      Cc: <stable@kernel.org> # .37.x and later
      LKML-Reference: <1294009362.3167.126.camel@localhost>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      55142374
  6. 30 12月, 2010 1 次提交
  7. 24 12月, 2010 1 次提交
    • D
      ring_buffer: Off-by-one and duplicate events in ring_buffer_read_page · e1e35927
      David Sharp 提交于
      Fix two related problems in the event-copying loop of
      ring_buffer_read_page.
      
      The loop condition for copying events is off-by-one.
      "len" is the remaining space in the caller-supplied page.
      "size" is the size of the next event (or two events).
      If len == size, then there is just enough space for the next event.
      
      size was set to rb_event_ts_length, which may include the size of two
      events if the first event is a time-extend, in order to assure time-
      extends are kept together with the event after it. However,
      rb_advance_reader always advances by one event. This would result in the
      event after any time-extend being duplicated. Instead, get the size of
      a single event for the memcpy, but use rb_event_ts_length for the loop
      condition.
      Signed-off-by: NDavid Sharp <dhsharp@google.com>
      LKML-Reference: <1293064704-8101-1-git-send-email-dhsharp@google.com>
      LKML-Reference: <AANLkTin7nLrRPc9qGjdjHbeVDDWiJjAiYyb-L=gH85bx@mail.gmail.com>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      e1e35927
  8. 23 12月, 2010 3 次提交
    • S
      module: Move RO/NX module protection to after ftrace module update · 94462ad3
      Steven Rostedt 提交于
      The commit:
      
      84e1c6bb
      x86: Add RO/NX protection for loadable kernel modules
      
      Broke the function tracer with this output:
      
      ------------[ cut here ]------------
      WARNING: at kernel/trace/ftrace.c:1014 ftrace_bug+0x114/0x171()
      Hardware name: Precision WorkStation 470
      Modules linked in: i2c_core(+)
      Pid: 86, comm: modprobe Not tainted 2.6.37-rc2+ #68
      Call Trace:
       [<ffffffff8104e957>] warn_slowpath_common+0x85/0x9d
       [<ffffffffa00026db>] ? __process_new_adapter+0x7/0x34 [i2c_core]
       [<ffffffffa00026db>] ? __process_new_adapter+0x7/0x34 [i2c_core]
       [<ffffffff8104e989>] warn_slowpath_null+0x1a/0x1c
       [<ffffffff810a9dfe>] ftrace_bug+0x114/0x171
       [<ffffffffa00026db>] ? __process_new_adapter+0x7/0x34 [i2c_core]
       [<ffffffff810aa0db>] ftrace_process_locs+0x1ae/0x274
       [<ffffffffa00026db>] ? __process_new_adapter+0x7/0x34 [i2c_core]
       [<ffffffff810aa29e>] ftrace_module_notify+0x39/0x44
       [<ffffffff814405cf>] notifier_call_chain+0x37/0x63
       [<ffffffff8106e054>] __blocking_notifier_call_chain+0x46/0x5b
       [<ffffffff8106e07d>] blocking_notifier_call_chain+0x14/0x16
       [<ffffffff8107ffde>] sys_init_module+0x73/0x1f3
       [<ffffffff8100acf2>] system_call_fastpath+0x16/0x1b
      ---[ end trace 2aff4f4ca53ec746 ]---
      ftrace faulted on writing [<ffffffffa00026db>]
      __process_new_adapter+0x7/0x34 [i2c_core]
      
      The cause was that the module text was set to read only before ftrace
      could convert the calls to mcount to nops. Thus, the conversions failed
      due to not being able to write to the text locations.
      
      The simple fix is to move setting the module to read only after the
      module notifiers are called (where ftrace sets the module mcounts to nops).
      Reported-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Acked-by: NRusty Russell <rusty@rustcorp.com.au>
      Signed-off-by: NSteven Rostedt <rostedt@goodmis.org>
      94462ad3
    • J
      taskstats: pad taskstats netlink response for aligment issues on ia64 · 4be2c95d
      Jeff Mahoney 提交于
      The taskstats structure is internally aligned on 8 byte boundaries but the
      layout of the aggregrate reply, with two NLA headers and the pid (each 4
      bytes), actually force the entire structure to be unaligned.  This causes
      the kernel to issue unaligned access warnings on some architectures like
      ia64.  Unfortunately, some software out there doesn't properly unroll the
      NLA packet and assumes that the start of the taskstats structure will
      always be 20 bytes from the start of the netlink payload.  Aligning the
      start of the taskstats structure breaks this software, which we don't
      want.  So, for now the alignment only happens on architectures that
      require it and those users will have to update to fixed versions of those
      packages.  Space is reserved in the packet only when needed.  This ifdef
      should be removed in several years e.g.  2012 once we can be confident
      that fixed versions are installed on most systems.  We add the padding
      before the aggregate since the aggregate is already a defined type.
      
      Commit 85893120 ("delayacct: align to 8 byte boundary on 64-bit systems")
      previously addressed the alignment issues by padding out the pid field.
      This was supposed to be a compatible change but the circumstances
      described above mean that it wasn't.  This patch backs out that change,
      since it was a hack, and introduces a new NULL attribute type to provide
      the padding.  Padding the response with 4 bytes avoids allocating an
      aligned taskstats structure and copying it back.  Since the structure
      weighs in at 328 bytes, it's too big to do it on the stack.
      Signed-off-by: NJeff Mahoney <jeffm@suse.com>
      Reported-by: NBrian Rogers <brian@xyzw.org>
      Cc: Jeff Mahoney <jeffm@suse.com>
      Cc: Guillaume Chazarain <guichaz@gmail.com>
      Cc: Balbir Singh <balbir@in.ibm.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      4be2c95d
    • J
      Fix rounding in clocks_calc_mult_shift() · b5776c4a
      john stultz 提交于
      Russell King reports:
      | On the ARM dev boards, we have a 32-bit counter running at 24MHz.  Calling
      | clocks_calc_mult_shift(&mult, &shift, 24MHz, NSEC_PER_SEC, 60) gives
      | us a multiplier of 2796202666 and a shift of 26.
      |
      | Over a large counter delta, this produces an error - lets take a count
      | from 362976315 to 4280663372:
      |
      | (4280663372-362976315) * 2796202666 / 2^26 - (4280663372-362976315) * (1000/24)
      |  => -38.91872422891230269990
      |
      | Can we do better?
      |
      | (4280663372-362976315) * 2796202667 / 2^26 - (4280663372-362976315) * (1000/24)
      | 19.45936211449532822051
      |
      | which is about twice as good as the 2796202666 multiplier.
      |
      | Looking at the equivalent divisions obtained, 2796202666 / 2^26 gives
      | 41.66666665673255920410ns per tick, whereas 2796202667 / 2^26 gives
      | 41.66666667163372039794ns.  The actual value wanted is 1000/24 =
      | 41.66666666666666666666ns.
      
      Fix this by ensuring we round to nearest when calculating the
      multiplier.
      Signed-off-by: NJohn Stultz <john.stultz@linaro.org>
      Tested-by: NSantosh Shilimkar <santosh.shilimkar@ti.com>
      Tested-by: NWill Deacon <will.deacon@arm.com>
      Tested-by: NMikael Pettersson <mikpe@it.uu.se>
      Tested-by: NEric Miao <eric.y.miao@gmail.com>
      Tested-by: NOlof Johansson <olof@lixom.net>
      Tested-by: NJamie Iles <jamie@jamieiles.com>
      Signed-off-by: NRussell King <rmk+kernel@arm.linux.org.uk>
      b5776c4a
  9. 22 12月, 2010 1 次提交
  10. 21 12月, 2010 1 次提交
    • T
      workqueue: allow chained queueing during destruction · c8efcc25
      Tejun Heo 提交于
      Currently, destroy_workqueue() makes the workqueue deny all new
      queueing by setting WQ_DYING and flushes the workqueue once before
      proceeding with destruction; however, there are cases where work items
      queue more related work items.  Currently, such users need to
      explicitly flush the workqueue multiple times depending on the
      possible depth of such chained queueing.
      
      This patch updates the queueing path such that a work item can queue
      further work items on the same workqueue even when WQ_DYING is set.
      The flush on destruction is automatically retried until the workqueue
      is empty.  This guarantees that the workqueue is empty on destruction
      while allowing chained queueing.
      
      The flush retry logic whines if it takes too many retries to drain the
      workqueue.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
      c8efcc25
  11. 20 12月, 2010 1 次提交
  12. 19 12月, 2010 2 次提交
    • P
      sched: Fix interactivity bug by charging unaccounted run-time on entity re-weight · 19e5eebb
      Paul Turner 提交于
      Mike Galbraith reported poor interactivity[*] when the new shares distribution
      code was combined with autogroups.
      
      The root cause turns out to be a mis-ordering of accounting accrued execution
      time and shares updates.  Since update_curr() is issued hierarchically,
      updating the parent entity weights to reflect child enqueue/dequeue results in
      the parent's unaccounted execution time then being accrued (vs vruntime) at the
      new weight as opposed to the weight present at accumulation.
      
      While this doesn't have much effect on processes with timeslices that cross a
      tick, it is particularly problematic for an interactive process (e.g. Xorg)
      which incurs many (tiny) timeslices.  In this scenario almost all updates are
      at dequeue which can result in significant fairness perturbation (especially if
      it is the only thread, resulting in potential {tg->shares, MIN_SHARES}
      transitions).
      
      Correct this by ensuring unaccounted time is accumulated prior to manipulating
      an entity's weight.
      
      [*] http://xkcd.com/619/ is perversely Nostradamian here.
      Signed-off-by: NPaul Turner <pjt@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      LKML-Reference: <20101216031038.159704378@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      19e5eebb
    • P
      sched: Move periodic share updates to entity_tick() · 43365bd7
      Paul Turner 提交于
      Long running entities that do not block (dequeue) require periodic updates to
      maintain accurate share values.  (Note: group entities with several threads are
      quite likely to be non-blocking in many circumstances).
      
      By virtue of being long-running however, we will see entity ticks (otherwise
      the required update occurs in dequeue/put and we are done).  Thus we can move
      the detection (and associated work) for these updates into the periodic path.
      
      This restores the 'atomicity' of update_curr() with respect to accounting.
      Signed-off-by: NPaul Turner <pjt@google.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      LKML-Reference: <20101216031038.067028969@google.com>
      Signed-off-by: NIngo Molnar <mingo@elte.hu>
      43365bd7
  13. 18 12月, 2010 8 次提交
    • C
      irq_work: Use per cpu atomics instead of regular atomics · 20b87691
      Christoph Lameter 提交于
      The irq work queue is a per cpu object and it is sufficient for
      synchronization if per cpu atomics are used. Doing so simplifies
      the code and reduces the overhead of the code.
      
      Before:
      
      christoph@linux-2.6$ size kernel/irq_work.o
         text	   data	    bss	    dec	    hex	filename
          451	      8	      1	    460	    1cc	kernel/irq_work.o
      
      After:
      
      christoph@linux-2.6$ size kernel/irq_work.o 
         text	   data	    bss	    dec	    hex	filename
          438	      8	      1	    447	    1bf	kernel/irq_work.o
      
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Signed-off-by: NChristoph Lameter <cl@linux.com>
      20b87691
    • P
      rcu: reduce __call_rcu()-induced contention on rcu_node structures · b52573d2
      Paul E. McKenney 提交于
      When the current __call_rcu() function was written, the expedited
      APIs did not exist.  The __call_rcu() implementation therefore went
      to great lengths to detect the end of old grace periods and to start
      new ones, all in the name of reducing grace-period latency.  Now the
      expedited APIs do exist, and the usage of __call_rcu() has increased
      considerably.  This commit therefore causes __call_rcu() to avoid
      worrying about grace periods unless there are a large number of
      RCU callbacks stacked up on the current CPU.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      b52573d2
    • P
      rcu: limit rcu_node leaf-level fanout · 0209f649
      Paul E. McKenney 提交于
      Some recent benchmarks have indicated possible lock contention on the
      leaf-level rcu_node locks.  This commit therefore limits the number of
      CPUs per leaf-level rcu_node structure to 16, in other words, there
      can be at most 16 rcu_data structures fanning into a given rcu_node
      structure.  Prior to this, the limit was 32 on 32-bit systems and 64 on
      64-bit systems.
      
      Note that the fanout of non-leaf rcu_node structures is unchanged.  The
      organization of accesses to the rcu_node tree is such that references
      to non-leaf rcu_node structures are much less frequent than to the
      leaf structures.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      0209f649
    • P
      rcu: fine-tune grace-period begin/end checks · 121dfc4b
      Paul E. McKenney 提交于
      Use the CPU's bit in rnp->qsmask to determine whether or not the CPU
      should try to report a quiescent state.  Handle overflow in the check
      for rdp->gpnum having fallen behind.
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      121dfc4b
    • F
      rcu: Keep gpnum and completed fields synchronized · 5ff8e6f0
      Frederic Weisbecker 提交于
      When a CPU that was in an extended quiescent state wakes
      up and catches up with grace periods that remote CPUs
      completed on its behalf, we update the completed field
      but not the gpnum that keeps a stale value of a backward
      grace period ID.
      
      Later, note_new_gpnum() will interpret the shift between
      the local CPU and the node grace period ID as some new grace
      period to handle and will then start to hunt quiescent state.
      
      But if every grace periods have already been completed, this
      interpretation becomes broken. And we'll be stuck in clusters
      of spurious softirqs because rcu_report_qs_rdp() will make
      this broken state run into infinite loop.
      
      The solution, as suggested by Lai Jiangshan, is to ensure that
      the gpnum and completed fields are well synchronized when we catch
      up with completed grace periods on their behalf by other cpus.
      This way we won't start noting spurious new grace periods.
      Suggested-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Steven Rostedt <rostedt@goodmis.org
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      5ff8e6f0
    • F
      rcu: Stop chasing QS if another CPU did it for us · 20377f32
      Frederic Weisbecker 提交于
      When a CPU is idle and others CPUs handled its extended
      quiescent state to complete grace periods on its behalf,
      it will catch up with completed grace periods numbers
      when it wakes up.
      
      But at this point there might be no more grace period to
      complete, but still the woken CPU always keeps its stale
      qs_pending value and will then continue to chase quiescent
      states even if its not needed anymore.
      
      This results in clusters of spurious softirqs until a new
      real grace period is started. Because if we continue to
      chase quiescent states but we have completed every grace
      periods, rcu_report_qs_rdp() is puzzled and makes that
      state run into infinite loops.
      
      As suggested by Lai Jiangshan, just reset qs_pending if
      someone completed every grace periods on our behalf.
      Suggested-by: NLai Jiangshan <laijs@cn.fujitsu.com>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Ingo Molnar <mingo@elte.hu>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      20377f32
    • T
      rcu: increase synchronize_sched_expedited() batching · e27fc964
      Tejun Heo 提交于
      The fix in commit #6a0cc49 requires more than three concurrent instances
      of synchronize_sched_expedited() before batching is possible.  This
      patch uses a ticket-counter-like approach that is also not unrelated to
      Lai Jiangshan's Ring RCU to allow sharing of expedited grace periods even
      when there are only two concurrent instances of synchronize_sched_expedited().
      
      This commit builds on Tejun's original posting, which may be found at
      http://lkml.org/lkml/2010/11/9/204, adding memory barriers, avoiding
      overflow of signed integers (other than via atomic_t), and fixing the
      detection of batching.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      e27fc964
    • B
      resources: add arch hook for preventing allocation in reserved areas · fcb11918
      Bjorn Helgaas 提交于
      This adds arch_remove_reservations(), which an arch can implement if it
      needs to protect part of the address space from allocation.
      
      Sometimes that can be done by just putting a region in the resource tree,
      but there are cases where that doesn't work well.  For example, x86 BIOS
      E820 reservations are not related to devices, so they may overlap part of,
      all of, or more than a device resource, so they may not end up at the
      correct spot in the resource tree.
      Acked-by: NH. Peter Anvin <hpa@zytor.com>
      Signed-off-by: NBjorn Helgaas <bjorn.helgaas@hp.com>
      Signed-off-by: NJesse Barnes <jbarnes@virtuousgeek.org>
      fcb11918