1. 21 1月, 2013 1 次提交
  2. 20 12月, 2012 1 次提交
  3. 18 12月, 2012 1 次提交
  4. 14 12月, 2012 1 次提交
  5. 11 12月, 2012 11 次提交
    • M
      mm: sched: numa: Delay PTE scanning until a task is scheduled on a new node · 5bca2303
      Mel Gorman 提交于
      Due to the fact that migrations are driven by the CPU a task is running
      on there is no point tracking NUMA faults until one task runs on a new
      node. This patch tracks the first node used by an address space. Until
      it changes, PTE scanning is disabled and no NUMA hinting faults are
      trapped. This should help workloads that are short-lived, do not care
      about NUMA placement or have bound themselves to a single node.
      
      This takes advantage of the logic in "mm: sched: numa: Implement slow
      start for working set sampling" to delay when the checks are made. This
      will take advantage of processes that set their CPU and node bindings
      early in their lifetime. It will also potentially allow any initial load
      balancing to take place.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      5bca2303
    • M
      mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG · 3105b86a
      Mel Gorman 提交于
      The "mm: sched: numa: Control enabling and disabling of NUMA balancing"
      depends on scheduling debug being enabled but it's perfectly legimate to
      disable automatic NUMA balancing even without this option. This should
      take care of it.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      3105b86a
    • M
      mm: sched: numa: Control enabling and disabling of NUMA balancing · 1a687c2e
      Mel Gorman 提交于
      This patch adds Kconfig options and kernel parameters to allow the
      enabling and disabling of automatic NUMA balancing. The existance
      of such a switch was and is very important when debugging problems
      related to transparent hugepages and we should have the same for
      automatic NUMA placement.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      1a687c2e
    • M
      mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate · b8593bfd
      Mel Gorman 提交于
      The PTE scanning rate and fault rates are two of the biggest sources of
      system CPU overhead with automatic NUMA placement.  Ideally a proper policy
      would detect if a workload was properly placed, schedule and adjust the
      PTE scanning rate accordingly. We do not track the necessary information
      to do that but we at least know if we migrated or not.
      
      This patch scans slower if a page was not migrated as the result of a
      NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is
      now higher than the previous default. Once every minute it will reset
      the scanner in case of phase changes.
      
      This is hilariously crude and the numbers are arbitrary. Workloads will
      converge quite slowly in comparison to what a proper policy should be able
      to do. On the plus side, we will chew up less CPU for workloads that have
      no need for automatic balancing.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      b8593bfd
    • M
      sched: numa: Slowly increase the scanning period as NUMA faults are handled · fb003b80
      Mel Gorman 提交于
      Currently the rate of scanning for an address space is controlled
      by the individual tasks. The next scan is simply determined by
      2*p->numa_scan_period.
      
      The 2*p->numa_scan_period is arbitrary and never changes. At this point
      there is still no proper policy that decides if a task or process is
      properly placed. It just scans and assumes the next NUMA fault will
      place it properly. As it is assumed that pages will get properly placed
      over time, increase the scan window each time a fault is incurred. This
      is a big assumption as noted in the comments.
      
      It should be noted that changing to p->numa_scan_period will increase
      system CPU usage because now the scanning rate has effectively doubled.
      If that is a problem then the min_rate should be made 200ms instead of
      restoring the 2* logic.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      fb003b80
    • M
      mm: numa: Rate limit setting of pte_numa if node is saturated · e14808b4
      Mel Gorman 提交于
      If there are a large number of NUMA hinting faults and all of them
      are resulting in migrations it may indicate that memory is just
      bouncing uselessly around. NUMA balancing cost is likely exceeding
      any benefit from locality. Rate limit the PTE updates if the node
      is migration rate-limited. As noted in the comments, this distorts
      the NUMA faulting statistics.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      e14808b4
    • P
      mm: sched: numa: Implement slow start for working set sampling · 4b96a29b
      Peter Zijlstra 提交于
      Add a 1 second delay before starting to scan the working set of
      a task and starting to balance it amongst nodes.
      
      [ note that before the constant per task WSS sampling rate patch
        the initial scan would happen much later still, in effect that
        patch caused this regression. ]
      
      The theory is that short-run tasks benefit very little from NUMA
      placement: they come and go, and they better stick to the node
      they were started on. As tasks mature and rebalance to other CPUs
      and nodes, so does their NUMA placement have to change and so
      does it start to matter more and more.
      
      In practice this change fixes an observable kbuild regression:
      
         # [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]
      
         !NUMA:
         45.291088843 seconds time elapsed                                          ( +-  0.40% )
         45.154231752 seconds time elapsed                                          ( +-  0.36% )
      
         +NUMA, no slow start:
         46.172308123 seconds time elapsed                                          ( +-  0.30% )
         46.343168745 seconds time elapsed                                          ( +-  0.25% )
      
         +NUMA, 1 sec slow start:
         45.224189155 seconds time elapsed                                          ( +-  0.25% )
         45.160866532 seconds time elapsed                                          ( +-  0.17% )
      
      and it also fixes an observable perf bench (hackbench) regression:
      
         # perf stat --null --repeat 10 perf bench sched messaging
      
         -NUMA:
      
         -NUMA:                  0.246225691 seconds time elapsed                   ( +-  1.31% )
         +NUMA no slow start:    0.252620063 seconds time elapsed                   ( +-  1.13% )
      
         +NUMA 1sec delay:       0.248076230 seconds time elapsed                   ( +-  1.35% )
      
      The implementation is simple and straightforward, most of the patch
      deals with adding the /proc/sys/kernel/numa_balancing_scan_delay_ms tunable
      knob.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      [ Wrote the changelog, ran measurements, tuned the default. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      4b96a29b
    • M
      sched, numa, mm: Count WS scanning against present PTEs, not virtual memory ranges · 9f40604c
      Mel Gorman 提交于
      By accounting against the present PTEs, scanning speed reflects the
      actual present (mapped) memory.
      Suggested-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      Cc: Mel Gorman <mgorman@suse.de>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      9f40604c
    • P
      mm: sched: numa: Implement constant, per task Working Set Sampling (WSS) rate · 6e5fb223
      Peter Zijlstra 提交于
      Previously, to probe the working set of a task, we'd use
      a very simple and crude method: mark all of its address
      space PROT_NONE.
      
      That method has various (obvious) disadvantages:
      
       - it samples the working set at dissimilar rates,
         giving some tasks a sampling quality advantage
         over others.
      
       - creates performance problems for tasks with very
         large working sets
      
       - over-samples processes with large address spaces but
         which only very rarely execute
      
      Improve that method by keeping a rotating offset into the
      address space that marks the current position of the scan,
      and advance it by a constant rate (in a CPU cycles execution
      proportional manner). If the offset reaches the last mapped
      address of the mm then it then it starts over at the first
      address.
      
      The per-task nature of the working set sampling functionality in this tree
      allows such constant rate, per task, execution-weight proportional sampling
      of the working set, with an adaptive sampling interval/frequency that
      goes from once per 100ms up to just once per 8 seconds.  The current
      sampling volume is 256 MB per interval.
      
      As tasks mature and converge their working set, so does the
      sampling rate slow down to just a trickle, 256 MB per 8
      seconds of CPU time executed.
      
      This, beyond being adaptive, also rate-limits rarely
      executing systems and does not over-sample on overloaded
      systems.
      
      [ In AutoNUMA speak, this patch deals with the effective sampling
        rate of the 'hinting page fault'. AutoNUMA's scanning is
        currently rate-limited, but it is also fundamentally
        single-threaded, executing in the knuma_scand kernel thread,
        so the limit in AutoNUMA is global and does not scale up with
        the number of CPUs, nor does it scan tasks in an execution
        proportional manner.
      
        So the idea of rate-limiting the scanning was first implemented
        in the AutoNUMA tree via a global rate limit. This patch goes
        beyond that by implementing an execution rate proportional
        working set sampling rate that is not implemented via a single
        global scanning daemon. ]
      
      [ Dan Carpenter pointed out a possible NULL pointer dereference in the
        first version of this patch. ]
      Based-on-idea-by: NAndrea Arcangeli <aarcange@redhat.com>
      Bug-Found-By: NDan Carpenter <dan.carpenter@oracle.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      [ Wrote changelog and fixed bug. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      6e5fb223
    • P
      mm: numa: Add fault driven placement and migration · cbee9f88
      Peter Zijlstra 提交于
      NOTE: This patch is based on "sched, numa, mm: Add fault driven
      	placement and migration policy" but as it throws away all the policy
      	to just leave a basic foundation I had to drop the signed-offs-by.
      
      This patch creates a bare-bones method for setting PTEs pte_numa in the
      context of the scheduler that when faulted later will be faulted onto the
      node the CPU is running on.  In itself this does nothing useful but any
      placement policy will fundamentally depend on receiving hints on placement
      from fault context and doing something intelligent about it.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      cbee9f88
    • I
      Revert "sched/autogroup: Fix crash on reboot when autogroup is disabled" · c1ad41f1
      Ingo Molnar 提交于
      This reverts commit 5258f386,
      because the underlying autogroups bug got fixed upstream in
      a better way, via:
      
        fd8ef117 Revert "sched, autogroup: Stop going ahead if autogroup is disabled"
      
      Cc: Mike Galbraith <efault@gmx.de>
      Cc: Yong Zhang <yong.zhang0@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      c1ad41f1
  6. 04 12月, 2012 1 次提交
    • M
      Revert "sched, autogroup: Stop going ahead if autogroup is disabled" · fd8ef117
      Mike Galbraith 提交于
      This reverts commit 800d4d30.
      
      Between commits 8323f26c ("sched: Fix race in task_group()") and
      800d4d30 ("sched, autogroup: Stop going ahead if autogroup is
      disabled"), autogroup is a wreck.
      
      With both applied, all you have to do to crash a box is disable
      autogroup during boot up, then reboot..  boom, NULL pointer dereference
      due to commit 800d4d30 not allowing autogroup to move things, and
      commit 8323f26c making that the only way to switch runqueues:
      
        BUG: unable to handle kernel NULL pointer dereference at           (null)
        IP: [<ffffffff81063ac0>] effective_load.isra.43+0x50/0x90
        Pid: 7047, comm: systemd-user-se Not tainted 3.6.8-smp #7 MEDIONPC MS-7502/MS-7502
        RIP: effective_load.isra.43+0x50/0x90
        Process systemd-user-se (pid: 7047, threadinfo ffff880221dde000, task ffff88022618b3a0)
        Call Trace:
          select_task_rq_fair+0x255/0x780
          try_to_wake_up+0x156/0x2c0
          wake_up_state+0xb/0x10
          signal_wake_up+0x28/0x40
          complete_signal+0x1d6/0x250
          __send_signal+0x170/0x310
          send_signal+0x40/0x80
          do_send_sig_info+0x47/0x90
          group_send_sig_info+0x4a/0x70
          kill_pid_info+0x3a/0x60
          sys_kill+0x97/0x1a0
          ? vfs_read+0x120/0x160
          ? sys_read+0x45/0x90
          system_call_fastpath+0x16/0x1b
        Code: 49 0f af 41 50 31 d2 49 f7 f0 48 83 f8 01 48 0f 46 c6 48 2b 07 48 8b bf 40 01 00 00 48 85 ff 74 3a 45 31 c0 48 8b 8f 50 01 00 00 <48> 8b 11 4c 8b 89 80 00 00 00 49 89 d2 48 01 d0 45 8b 59 58 4c
        RIP  [<ffffffff81063ac0>] effective_load.isra.43+0x50/0x90
         RSP <ffff880221ddfbd8>
        CR2: 0000000000000000
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      Cc: Yong Zhang <yong.zhang0@gmail.com>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: stable@vger.kernel.org # 2.6.39+
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      fd8ef117
  7. 01 12月, 2012 1 次提交
    • F
      context_tracking: New context tracking susbsystem · 91d1aa43
      Frederic Weisbecker 提交于
      Create a new subsystem that probes on kernel boundaries
      to keep track of the transitions between level contexts
      with two basic initial contexts: user or kernel.
      
      This is an abstraction of some RCU code that use such tracking
      to implement its userspace extended quiescent state.
      
      We need to pull this up from RCU into this new level of indirection
      because this tracking is also going to be used to implement an "on
      demand" generic virtual cputime accounting. A necessary step to
      shutdown the tick while still accounting the cputime.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      [ paulmck: fix whitespace error and email address. ]
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      91d1aa43
  8. 29 11月, 2012 4 次提交
    • F
      cputime: Comment cputime's adjusting code · fa092057
      Frederic Weisbecker 提交于
      The reason for the scaling and monotonicity correction performed
      by cputime_adjust() may not be immediately clear to the reviewer.
      
      Add some comments to explain what happens there.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      fa092057
    • F
      cputime: Consolidate cputime adjustment code · d37f761d
      Frederic Weisbecker 提交于
      task_cputime_adjusted() and thread_group_cputime_adjusted()
      essentially share the same code. They just don't use the same
      source:
      
      * The first function uses the cputime in the task struct and the
      previous adjusted snapshot that ensures monotonicity.
      
      * The second adds the cputime of all tasks in the group and the
      previous adjusted snapshot of the whole group from the signal
      structure.
      
      Just consolidate the common code that does the adjustment. These
      functions just need to fetch the values from the appropriate
      source.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      d37f761d
    • F
      cputime: Rename thread_group_times to thread_group_cputime_adjusted · e80d0a1a
      Frederic Weisbecker 提交于
      We have thread_group_cputime() and thread_group_times(). The naming
      doesn't provide enough information about the difference between
      these two APIs.
      
      To lower the confusion, rename thread_group_times() to
      thread_group_cputime_adjusted(). This name better suggests that
      it's a version of thread_group_cputime() that does some stabilization
      on the raw cputime values. ie here: scale on top of CFS runtime
      stats and bound lower value for monotonicity.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      e80d0a1a
    • F
      cputime: Move thread_group_cputime() to sched code · a634f933
      Frederic Weisbecker 提交于
      thread_group_cputime() is a general cputime API that is not only
      used by posix cpu timer. Let's move this helper to sched code.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      a634f933
  9. 28 11月, 2012 1 次提交
  10. 20 11月, 2012 2 次提交
  11. 19 11月, 2012 3 次提交
    • F
      vtime: No need to disable irqs on vtime_account() · 1017769b
      Frederic Weisbecker 提交于
      vtime_account() is only called from irq entry. irqs
      are always disabled at this point so we can safely
      remove the irq disabling guards on that function.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      1017769b
    • F
      vtime: Consolidate a bit the ctx switch code · e3942ba0
      Frederic Weisbecker 提交于
      On ia64 and powerpc, vtime context switch only consists
      in flushing system and user pending time, plus a few
      arch housekeeping.
      
      Consolidate that into a generic implementation. s390 is
      a special case because pending user and system time accounting
      there is hard to dissociate. So it's keeping its own implementation.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      e3942ba0
    • F
      vtime: Remove the underscore prefix invasion · fd25b4c2
      Frederic Weisbecker 提交于
      Prepending irq-unsafe vtime APIs with underscores was actually
      a bad idea as the result is a big mess in the API namespace that
      is even waiting to be further extended. Also these helpers
      are always called from irq safe callers except kvm. Just
      provide a vtime_account_system_irqsafe() for this specific
      case so that we can remove the underscore prefix on other
      vtime functions.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      fd25b4c2
  12. 17 11月, 2012 1 次提交
    • P
      sched: Mark RCU reader in sched_show_task() · 4e79752c
      Paul E. McKenney 提交于
      When sched_show_task() is invoked from try_to_freeze_tasks(), there is
      no RCU read-side critical section, resulting in the following splat:
      
      [  125.780730] ===============================
      [  125.780766] [ INFO: suspicious RCU usage. ]
      [  125.780804] 3.7.0-rc3+ #988 Not tainted
      [  125.780838] -------------------------------
      [  125.780875] /home/rafael/src/linux/kernel/sched/core.c:4497 suspicious rcu_dereference_check() usage!
      [  125.780946]
      [  125.780946] other info that might help us debug this:
      [  125.780946]
      [  125.781031]
      [  125.781031] rcu_scheduler_active = 1, debug_locks = 0
      [  125.781087] 4 locks held by s2ram/4211:
      [  125.781120]  #0:  (&buffer->mutex){+.+.+.}, at: [<ffffffff811e2acf>] sysfs_write_file+0x3f/0x160
      [  125.781233]  #1:  (s_active#94){.+.+.+}, at: [<ffffffff811e2b58>] sysfs_write_file+0xc8/0x160
      [  125.781339]  #2:  (pm_mutex){+.+.+.}, at: [<ffffffff81090a81>] pm_suspend+0x81/0x230
      [  125.781439]  #3:  (tasklist_lock){.?.?..}, at: [<ffffffff8108feed>] try_to_freeze_tasks+0x2cd/0x3f0
      [  125.781543]
      [  125.781543] stack backtrace:
      [  125.781584] Pid: 4211, comm: s2ram Not tainted 3.7.0-rc3+ #988
      [  125.781632] Call Trace:
      [  125.781662]  [<ffffffff810a3c73>] lockdep_rcu_suspicious+0x103/0x140
      [  125.781719]  [<ffffffff8107cf21>] sched_show_task+0x121/0x180
      [  125.781770]  [<ffffffff8108ffb4>] try_to_freeze_tasks+0x394/0x3f0
      [  125.781823]  [<ffffffff810903b5>] freeze_kernel_threads+0x25/0x80
      [  125.781876]  [<ffffffff81090b65>] pm_suspend+0x165/0x230
      [  125.781924]  [<ffffffff8108fa29>] state_store+0x99/0x100
      [  125.781975]  [<ffffffff812f5867>] kobj_attr_store+0x17/0x20
      [  125.782038]  [<ffffffff811e2b71>] sysfs_write_file+0xe1/0x160
      [  125.782091]  [<ffffffff811667a6>] vfs_write+0xc6/0x180
      [  125.782138]  [<ffffffff81166ada>] sys_write+0x5a/0xa0
      [  125.782185]  [<ffffffff812ff6ae>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [  125.782242]  [<ffffffff81669dd2>] system_call_fastpath+0x16/0x1b
      
      This commit therefore adds the needed RCU read-side critical section.
      Reported-by: N"Rafael J. Wysocki" <rjw@sisk.pl>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      4e79752c
  13. 30 10月, 2012 3 次提交
    • M
      sched/autogroup: Fix crash on reboot when autogroup is disabled · 5258f386
      Mike Galbraith 提交于
      Due to these two commits:
      
        8323f26c sched: Fix race in task_group()
        800d4d30 sched, autogroup: Stop going ahead if autogroup is disabled
      
      ... autogroup scheduling's dynamic knobs are wrecked.
      
      With both patches applied, all you have to do to crash a box is
      disable autogroup during boot up, then reboot.. boom, NULL pointer
      dereference due to 800d4d30 not allowing autogroup to move things,
      and 8323f26c making that the only way to switch runqueues.
      
      Remove most of the (dysfunctional) knobs and turn the remaining
      sched_autogroup_enabled knob readonly.
      
      If the user fiddles with cgroups hereafter, once tasks
      are moved, autogroup won't mess with them again unless
      they call setsid().
      
      No knobs, no glitz, nada, just a cute little thing folks can
      turn on if they don't want to muck about with cgroups and/or
      systemd.
      Signed-off-by: NMike Galbraith <efault@gmx.de>
      Cc: Xiaotian Feng <xtfeng@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Xiaotian Feng <dannyfeng@tencent.com>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: <stable@vger.kernel.org> # v3.6
      Link: http://lkml.kernel.org/r/1351451963.4999.8.camel@maggy.simpson.netSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5258f386
    • F
      cputime: Separate irqtime accounting from generic vtime · 3e1df4f5
      Frederic Weisbecker 提交于
      vtime_account() doesn't have the same role in
      CONFIG_VIRT_CPU_ACCOUNTING and CONFIG_IRQ_TIME_ACCOUNTING.
      
      In the first case it handles time accounting in any context. In
      the second case it only handles irq time accounting.
      
      So when vtime_account() is called from outside vtime_account_irq_*()
      this call is pointless to CONFIG_IRQ_TIME_ACCOUNTING.
      
      To fix the confusion, change vtime_account() to irqtime_account_irq()
      in CONFIG_IRQ_TIME_ACCOUNTING. This way we ensure future account_vtime()
      calls won't waste useless cycles in the irqtime APIs.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      3e1df4f5
    • F
      vtime: Make vtime_account_system() irqsafe · 11113334
      Frederic Weisbecker 提交于
      vtime_account_system() currently has only one caller with
      vtime_account() which is irq safe.
      
      Now we are going to call it from other places like kvm where
      irqs are not always disabled by the time we account the cputime.
      
      So let's make it irqsafe. The arch implementation part is now
      prefixed with "__".
      
      vtime_account_idle() arch implementation is prefixed accordingly
      to stay consistent.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Tony Luck <tony.luck@intel.com>
      Cc: Fenghua Yu <fenghua.yu@intel.com>
      Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
      Cc: Paul Mackerras <paulus@samba.org>
      Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
      Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
      11113334
  14. 24 10月, 2012 9 次提交