1. 23 4月, 2013 3 次提交
    • F
      nohz: Re-evaluate the tick for the new task after a context switch · 99e5ada9
      Frederic Weisbecker 提交于
      When a task is scheduled in, it may have some properties
      of its own that could make the CPU reconsider the need for
      the tick: posix cpu timers, perf events, ...
      
      So notify the full dynticks subsystem when a task gets
      scheduled in and re-check the tick dependency at this
      stage. This is done through a self IPI to avoid messing
      up with any current lock scenario.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      99e5ada9
    • F
      nohz: Re-evaluate the tick from the scheduler IPI · ff442c51
      Frederic Weisbecker 提交于
      The scheduler IPI is used by the scheduler to kick
      full dynticks CPUs asynchronously when more than one
      task are running or when a new timer list timer is
      enqueued. This way the destination CPU can decide
      to restart the tick to handle this new situation.
      
      Now let's call that kick in the scheduler IPI.
      
      (Reusing the scheduler IPI rather than implementing
      a new IPI was suggested by Peter Zijlstra a while ago)
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      ff442c51
    • F
      sched: New helper to prevent from stopping the tick in full dynticks · ce831b38
      Frederic Weisbecker 提交于
      Provide a new helper to be called from the full dynticks engine
      before stopping the tick in order to make sure we don't stop
      it when there is more than one task running on the CPU.
      
      This way we make sure that the tick stays alive to maintain
      fairness.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Oleg Nesterov <oleg@redhat.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      ce831b38
  2. 16 4月, 2013 1 次提交
    • F
      nohz: Switch from "extended nohz" to "full nohz" based naming · c5bfece2
      Frederic Weisbecker 提交于
      "Extended nohz" was used as a naming base for the full dynticks
      API and Kconfig symbols. It reflects the fact the system tries
      to stop the tick in more places than just idle.
      
      But that "extended" name is a bit opaque and vague. Rename it to
      "full" makes it clearer what the system tries to do under this
      config: try to shutdown the tick anytime it can. The various
      constraints that prevent that to happen shouldn't be considered
      as fundamental properties of this feature but rather technical
      issues that may be solved in the future.
      Reported-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      c5bfece2
  3. 03 4月, 2013 1 次提交
    • F
      nohz: Rename CONFIG_NO_HZ to CONFIG_NO_HZ_COMMON · 3451d024
      Frederic Weisbecker 提交于
      We are planning to convert the dynticks Kconfig options layout
      into a choice menu. The user must be able to easily pick
      any of the following implementations: constant periodic tick,
      idle dynticks, full dynticks.
      
      As this implies a mutual exclusion, the two dynticks implementions
      need to converge on the selection of a common Kconfig option in order
      to ease the sharing of a common infrastructure.
      
      It would thus seem pretty natural to reuse CONFIG_NO_HZ to
      that end. It already implements all the idle dynticks code
      and the full dynticks depends on all that code for now.
      So ideally the choice menu would propose CONFIG_NO_HZ_IDLE and
      CONFIG_NO_HZ_EXTENDED then both would select CONFIG_NO_HZ.
      
      On the other hand we want to stay backward compatible: if
      CONFIG_NO_HZ is set in an older config file, we want to
      enable CONFIG_NO_HZ_IDLE by default.
      
      But we can't afford both at the same time or we run into
      a circular dependency:
      
      1) CONFIG_NO_HZ_IDLE and CONFIG_NO_HZ_EXTENDED both select
         CONFIG_NO_HZ
      2) If CONFIG_NO_HZ is set, we default to CONFIG_NO_HZ_IDLE
      
      We might be able to support that from Kconfig/Kbuild but it
      may not be wise to introduce such a confusing behaviour.
      
      So to solve this, create a new CONFIG_NO_HZ_COMMON option
      which gathers the common code between idle and full dynticks
      (that common code for now is simply the idle dynticks code)
      and select it from their referring Kconfig.
      
      Then we'll later create CONFIG_NO_HZ_IDLE and map CONFIG_NO_HZ
      to it for backward compatibility.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      3451d024
  4. 21 3月, 2013 1 次提交
    • F
      nohz: Wake up full dynticks CPUs when a timer gets enqueued · 1c20091e
      Frederic Weisbecker 提交于
      Wake up a CPU when a timer list timer is enqueued there and
      the target is part of the full dynticks range. Sending an IPI
      to it makes it reconsidering the next timer to program on top
      of recent updates.
      
      This may later be improved by checking if the tick is really
      stopped on the target. This would need some careful
      synchronization though. So deal with such optimization later
      and start simple.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Chris Metcalf <cmetcalf@tilera.com>
      Cc: Christoph Lameter <cl@linux.com>
      Cc: Geoff Levand <geoff@infradead.org>
      Cc: Gilad Ben Yossef <gilad@benyossef.com>
      Cc: Hakan Akkan <hakanakkan@gmail.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      1c20091e
  5. 18 3月, 2013 1 次提交
  6. 08 3月, 2013 1 次提交
    • F
      context_tracking: Restore preempted context state after preempt_schedule_irq() · b22366cd
      Frederic Weisbecker 提交于
      From the context tracking POV, preempt_schedule_irq() behaves pretty much
      like an exception: It can be called anytime and schedule another task.
      
      But currently it doesn't restore the context tracking state of the preempted
      code on preempt_schedule_irq() return.
      
      As a result, if preempt_schedule_irq() is called in the tiny frame between
      user_enter() and the actual return to userspace, we resume userspace with
      the wrong context tracking state.
      
      Fix this by using exception_enter/exit() which are a perfect fit for this
      kind of issue.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Kevin Hilman <khilman@linaro.org>
      Cc: Mats Liljegren <mats.liljegren@enea.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      b22366cd
  7. 06 3月, 2013 2 次提交
  8. 28 2月, 2013 1 次提交
    • S
      hlist: drop the node parameter from iterators · b67bfe0d
      Sasha Levin 提交于
      I'm not sure why, but the hlist for each entry iterators were conceived
      
              list_for_each_entry(pos, head, member)
      
      The hlist ones were greedy and wanted an extra parameter:
      
              hlist_for_each_entry(tpos, pos, head, member)
      
      Why did they need an extra pos parameter? I'm not quite sure. Not only
      they don't really need it, it also prevents the iterator from looking
      exactly like the list iterator, which is unfortunate.
      
      Besides the semantic patch, there was some manual work required:
      
       - Fix up the actual hlist iterators in linux/list.h
       - Fix up the declaration of other iterators based on the hlist ones.
       - A very small amount of places were using the 'node' parameter, this
       was modified to use 'obj->member' instead.
       - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
       properly, so those had to be fixed up manually.
      
      The semantic patch which is mostly the work of Peter Senna Tschudin is here:
      
      @@
      iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;
      
      type T;
      expression a,c,d,e;
      identifier b;
      statement S;
      @@
      
      -T b;
          <+... when != b
      (
      hlist_for_each_entry(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue(a,
      - b,
      c) S
      |
      hlist_for_each_entry_from(a,
      - b,
      c) S
      |
      hlist_for_each_entry_rcu(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_rcu_bh(a,
      - b,
      c, d) S
      |
      hlist_for_each_entry_continue_rcu_bh(a,
      - b,
      c) S
      |
      for_each_busy_worker(a, c,
      - b,
      d) S
      |
      ax25_uid_for_each(a,
      - b,
      c) S
      |
      ax25_for_each(a,
      - b,
      c) S
      |
      inet_bind_bucket_for_each(a,
      - b,
      c) S
      |
      sctp_for_each_hentry(a,
      - b,
      c) S
      |
      sk_for_each(a,
      - b,
      c) S
      |
      sk_for_each_rcu(a,
      - b,
      c) S
      |
      sk_for_each_from
      -(a, b)
      +(a)
      S
      + sk_for_each_from(a) S
      |
      sk_for_each_safe(a,
      - b,
      c, d) S
      |
      sk_for_each_bound(a,
      - b,
      c) S
      |
      hlist_for_each_entry_safe(a,
      - b,
      c, d, e) S
      |
      hlist_for_each_entry_continue_rcu(a,
      - b,
      c) S
      |
      nr_neigh_for_each(a,
      - b,
      c) S
      |
      nr_neigh_for_each_safe(a,
      - b,
      c, d) S
      |
      nr_node_for_each(a,
      - b,
      c) S
      |
      nr_node_for_each_safe(a,
      - b,
      c, d) S
      |
      - for_each_gfn_sp(a, c, d, b) S
      + for_each_gfn_sp(a, c, d) S
      |
      - for_each_gfn_indirect_valid_sp(a, c, d, b) S
      + for_each_gfn_indirect_valid_sp(a, c, d) S
      |
      for_each_host(a,
      - b,
      c) S
      |
      for_each_host_safe(a,
      - b,
      c, d) S
      |
      for_each_mesh_entry(a,
      - b,
      c, d) S
      )
          ...+>
      
      [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
      [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
      [akpm@linux-foundation.org: checkpatch fixes]
      [akpm@linux-foundation.org: fix warnings]
      [akpm@linux-foudnation.org: redo intrusive kvm changes]
      Tested-by: NPeter Senna Tschudin <peter.senna@gmail.com>
      Acked-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      Signed-off-by: NSasha Levin <sasha.levin@oracle.com>
      Cc: Wu Fengguang <fengguang.wu@intel.com>
      Cc: Marcelo Tosatti <mtosatti@redhat.com>
      Cc: Gleb Natapov <gleb@redhat.com>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      b67bfe0d
  9. 24 2月, 2013 1 次提交
    • T
      sched: do not use cpu_to_node() to find an offlined cpu's node. · aa00d89c
      Tang Chen 提交于
      If a cpu is offline, its nid will be set to -1, and cpu_to_node(cpu)
      will return -1.  As a result, cpumask_of_node(nid) will return NULL.  In
      this case, find_next_bit() in for_each_cpu will get a NULL pointer and
      cause panic.
      
      Here is a call trace:
        Call Trace:
         <IRQ>
          select_fallback_rq+0x71/0x190
          try_to_wake_up+0x2cb/0x2f0
          wake_up_process+0x15/0x20
          hrtimer_wakeup+0x22/0x30
          __run_hrtimer+0x83/0x320
          hrtimer_interrupt+0x106/0x280
          smp_apic_timer_interrupt+0x69/0x99
          apic_timer_interrupt+0x6f/0x80
      
      There is a hrtimer process sleeping, whose cpu has already been
      offlined.  When it is waken up, it tries to find another cpu to run, and
      get a -1 nid.  As a result, cpumask_of_node(-1) returns NULL, and causes
      ernel panic.
      
      This patch fixes this problem by judging if the nid is -1.  If nid is
      not -1, a cpu on the same node will be picked.  Else, a online cpu on
      another node will be picked.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
      Cc: David Rientjes <rientjes@google.com>
      Cc: Jiang Liu <liuj97@gmail.com>
      Cc: Minchan Kim <minchan.kim@gmail.com>
      Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
      Cc: Mel Gorman <mel@csn.ul.ie>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: "H. Peter Anvin" <hpa@zytor.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Signed-off-by: NAndrew Morton <akpm@linux-foundation.org>
      Signed-off-by: NLinus Torvalds <torvalds@linux-foundation.org>
      aa00d89c
  10. 20 2月, 2013 1 次提交
  11. 15 2月, 2013 1 次提交
  12. 08 2月, 2013 1 次提交
  13. 05 2月, 2013 1 次提交
  14. 29 1月, 2013 1 次提交
  15. 28 1月, 2013 1 次提交
    • F
      cputime: Safely read cputime of full dynticks CPUs · 6a61671b
      Frederic Weisbecker 提交于
      While remotely reading the cputime of a task running in a
      full dynticks CPU, the values stored in utime/stime fields
      of struct task_struct may be stale. Its values may be those
      of the last kernel <-> user transition time snapshot and
      we need to add the tickless time spent since this snapshot.
      
      To fix this, flush the cputime of the dynticks CPUs on
      kernel <-> user transition and record the time / context
      where we did this. Then on top of this snapshot and the current
      time, perform the fixup on the reader side from task_times()
      accessors.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Namhyung Kim <namhyung.kim@lge.com>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Paul Gortmaker <paul.gortmaker@windriver.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      [fixed kvm module related build errors]
      Signed-off-by: NSedat Dilek <sedat.dilek@gmail.com>
      6a61671b
  16. 25 1月, 2013 1 次提交
    • L
      sched: split out css_online/css_offline from tg creation/destruction · ace783b9
      Li Zefan 提交于
      This is a preparaton for later patches.
      
      - What do we gain from cpu_cgroup_css_online():
      
      After ss->css_alloc() and before ss->css_online(), there's a small
      window that tg->css.cgroup is NULL. With this change, tg won't be seen
      before ss->css_online(), where it's added to the global list, so we're
      guaranteed we'll never see NULL tg->css.cgroup.
      
      - What do we gain from cpu_cgroup_css_offline():
      
      tg is freed via RCU, so is cgroup. Without this change, This is how
      synchronization works:
      
      cgroup_rmdir()
        no ss->css_offline()
      diput()
        syncornize_rcu()
        ss->css_free()       <-- unregister tg, and free it via call_rcu()
        kfree_rcu(cgroup)    <-- wait possible refs to cgroup, and free cgroup
      
      We can't just kfree(cgroup), because tg might access tg->css.cgroup.
      
      With this change:
      
      cgroup_rmdir()
        ss->css_offline()    <-- unregister tg
      diput()
        synchronize_rcu()    <-- wait possible refs to tg and cgroup
        ss->css_free()       <-- free tg
        kfree_rcu(cgroup)    <-- free cgroup
      
      As you see, kfree_rcu() is redundant now.
      Signed-off-by: NLi Zefan <lizefan@huawei.com>
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Acked-by: NIngo Molnar <mingo@kernel.org>
      ace783b9
  17. 23 1月, 2013 1 次提交
  18. 21 1月, 2013 1 次提交
  19. 19 1月, 2013 1 次提交
    • T
      workqueue: rename kernel/workqueue_sched.h to kernel/workqueue_internal.h · ea138446
      Tejun Heo 提交于
      Workqueue wants to expose more interface internal to kernel/.  Instead
      of adding a new header file, repurpose kernel/workqueue_sched.h.
      Rename it to workqueue_internal.h and add include protector.
      
      This patch doesn't introduce any functional changes.
      Signed-off-by: NTejun Heo <tj@kernel.org>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Ingo Molnar <mingo@redhat.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      ea138446
  20. 11 12月, 2012 5 次提交
    • M
      mm: sched: numa: Control enabling and disabling of NUMA balancing if !SCHED_DEBUG · 3105b86a
      Mel Gorman 提交于
      The "mm: sched: numa: Control enabling and disabling of NUMA balancing"
      depends on scheduling debug being enabled but it's perfectly legimate to
      disable automatic NUMA balancing even without this option. This should
      take care of it.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      3105b86a
    • M
      mm: sched: numa: Control enabling and disabling of NUMA balancing · 1a687c2e
      Mel Gorman 提交于
      This patch adds Kconfig options and kernel parameters to allow the
      enabling and disabling of automatic NUMA balancing. The existance
      of such a switch was and is very important when debugging problems
      related to transparent hugepages and we should have the same for
      automatic NUMA placement.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      1a687c2e
    • M
      mm: sched: Adapt the scanning rate if a NUMA hinting fault does not migrate · b8593bfd
      Mel Gorman 提交于
      The PTE scanning rate and fault rates are two of the biggest sources of
      system CPU overhead with automatic NUMA placement.  Ideally a proper policy
      would detect if a workload was properly placed, schedule and adjust the
      PTE scanning rate accordingly. We do not track the necessary information
      to do that but we at least know if we migrated or not.
      
      This patch scans slower if a page was not migrated as the result of a
      NUMA hinting fault up to sysctl_numa_balancing_scan_period_max which is
      now higher than the previous default. Once every minute it will reset
      the scanner in case of phase changes.
      
      This is hilariously crude and the numbers are arbitrary. Workloads will
      converge quite slowly in comparison to what a proper policy should be able
      to do. On the plus side, we will chew up less CPU for workloads that have
      no need for automatic balancing.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      b8593bfd
    • P
      mm: sched: numa: Implement slow start for working set sampling · 4b96a29b
      Peter Zijlstra 提交于
      Add a 1 second delay before starting to scan the working set of
      a task and starting to balance it amongst nodes.
      
      [ note that before the constant per task WSS sampling rate patch
        the initial scan would happen much later still, in effect that
        patch caused this regression. ]
      
      The theory is that short-run tasks benefit very little from NUMA
      placement: they come and go, and they better stick to the node
      they were started on. As tasks mature and rebalance to other CPUs
      and nodes, so does their NUMA placement have to change and so
      does it start to matter more and more.
      
      In practice this change fixes an observable kbuild regression:
      
         # [ a perf stat --null --repeat 10 test of ten bzImage builds to /dev/shm ]
      
         !NUMA:
         45.291088843 seconds time elapsed                                          ( +-  0.40% )
         45.154231752 seconds time elapsed                                          ( +-  0.36% )
      
         +NUMA, no slow start:
         46.172308123 seconds time elapsed                                          ( +-  0.30% )
         46.343168745 seconds time elapsed                                          ( +-  0.25% )
      
         +NUMA, 1 sec slow start:
         45.224189155 seconds time elapsed                                          ( +-  0.25% )
         45.160866532 seconds time elapsed                                          ( +-  0.17% )
      
      and it also fixes an observable perf bench (hackbench) regression:
      
         # perf stat --null --repeat 10 perf bench sched messaging
      
         -NUMA:
      
         -NUMA:                  0.246225691 seconds time elapsed                   ( +-  1.31% )
         +NUMA no slow start:    0.252620063 seconds time elapsed                   ( +-  1.13% )
      
         +NUMA 1sec delay:       0.248076230 seconds time elapsed                   ( +-  1.35% )
      
      The implementation is simple and straightforward, most of the patch
      deals with adding the /proc/sys/kernel/numa_balancing_scan_delay_ms tunable
      knob.
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Linus Torvalds <torvalds@linux-foundation.org>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
      Cc: Andrea Arcangeli <aarcange@redhat.com>
      Cc: Rik van Riel <riel@redhat.com>
      [ Wrote the changelog, ran measurements, tuned the default. ]
      Signed-off-by: NIngo Molnar <mingo@kernel.org>
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Reviewed-by: NRik van Riel <riel@redhat.com>
      4b96a29b
    • P
      mm: numa: Add fault driven placement and migration · cbee9f88
      Peter Zijlstra 提交于
      NOTE: This patch is based on "sched, numa, mm: Add fault driven
      	placement and migration policy" but as it throws away all the policy
      	to just leave a basic foundation I had to drop the signed-offs-by.
      
      This patch creates a bare-bones method for setting PTEs pte_numa in the
      context of the scheduler that when faulted later will be faulted onto the
      node the CPU is running on.  In itself this does nothing useful but any
      placement policy will fundamentally depend on receiving hints on placement
      from fault context and doing something intelligent about it.
      Signed-off-by: NMel Gorman <mgorman@suse.de>
      Acked-by: NRik van Riel <riel@redhat.com>
      cbee9f88
  21. 01 12月, 2012 1 次提交
    • F
      context_tracking: New context tracking susbsystem · 91d1aa43
      Frederic Weisbecker 提交于
      Create a new subsystem that probes on kernel boundaries
      to keep track of the transitions between level contexts
      with two basic initial contexts: user or kernel.
      
      This is an abstraction of some RCU code that use such tracking
      to implement its userspace extended quiescent state.
      
      We need to pull this up from RCU into this new level of indirection
      because this tracking is also going to be used to implement an "on
      demand" generic virtual cputime accounting. A necessary step to
      shutdown the tick while still accounting the cputime.
      Signed-off-by: NFrederic Weisbecker <fweisbec@gmail.com>
      Cc: Andrew Morton <akpm@linux-foundation.org>
      Cc: H. Peter Anvin <hpa@zytor.com>
      Cc: Ingo Molnar <mingo@kernel.org>
      Cc: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
      Cc: Peter Zijlstra <peterz@infradead.org>
      Cc: Steven Rostedt <rostedt@goodmis.org>
      Cc: Thomas Gleixner <tglx@linutronix.de>
      Cc: Li Zhong <zhong@linux.vnet.ibm.com>
      Cc: Gilad Ben-Yossef <gilad@benyossef.com>
      Reviewed-by: NSteven Rostedt <rostedt@goodmis.org>
      [ paulmck: fix whitespace error and email address. ]
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      91d1aa43
  22. 28 11月, 2012 1 次提交
  23. 20 11月, 2012 2 次提交
  24. 17 11月, 2012 1 次提交
    • P
      sched: Mark RCU reader in sched_show_task() · 4e79752c
      Paul E. McKenney 提交于
      When sched_show_task() is invoked from try_to_freeze_tasks(), there is
      no RCU read-side critical section, resulting in the following splat:
      
      [  125.780730] ===============================
      [  125.780766] [ INFO: suspicious RCU usage. ]
      [  125.780804] 3.7.0-rc3+ #988 Not tainted
      [  125.780838] -------------------------------
      [  125.780875] /home/rafael/src/linux/kernel/sched/core.c:4497 suspicious rcu_dereference_check() usage!
      [  125.780946]
      [  125.780946] other info that might help us debug this:
      [  125.780946]
      [  125.781031]
      [  125.781031] rcu_scheduler_active = 1, debug_locks = 0
      [  125.781087] 4 locks held by s2ram/4211:
      [  125.781120]  #0:  (&buffer->mutex){+.+.+.}, at: [<ffffffff811e2acf>] sysfs_write_file+0x3f/0x160
      [  125.781233]  #1:  (s_active#94){.+.+.+}, at: [<ffffffff811e2b58>] sysfs_write_file+0xc8/0x160
      [  125.781339]  #2:  (pm_mutex){+.+.+.}, at: [<ffffffff81090a81>] pm_suspend+0x81/0x230
      [  125.781439]  #3:  (tasklist_lock){.?.?..}, at: [<ffffffff8108feed>] try_to_freeze_tasks+0x2cd/0x3f0
      [  125.781543]
      [  125.781543] stack backtrace:
      [  125.781584] Pid: 4211, comm: s2ram Not tainted 3.7.0-rc3+ #988
      [  125.781632] Call Trace:
      [  125.781662]  [<ffffffff810a3c73>] lockdep_rcu_suspicious+0x103/0x140
      [  125.781719]  [<ffffffff8107cf21>] sched_show_task+0x121/0x180
      [  125.781770]  [<ffffffff8108ffb4>] try_to_freeze_tasks+0x394/0x3f0
      [  125.781823]  [<ffffffff810903b5>] freeze_kernel_threads+0x25/0x80
      [  125.781876]  [<ffffffff81090b65>] pm_suspend+0x165/0x230
      [  125.781924]  [<ffffffff8108fa29>] state_store+0x99/0x100
      [  125.781975]  [<ffffffff812f5867>] kobj_attr_store+0x17/0x20
      [  125.782038]  [<ffffffff811e2b71>] sysfs_write_file+0xe1/0x160
      [  125.782091]  [<ffffffff811667a6>] vfs_write+0xc6/0x180
      [  125.782138]  [<ffffffff81166ada>] sys_write+0x5a/0xa0
      [  125.782185]  [<ffffffff812ff6ae>] ? trace_hardirqs_on_thunk+0x3a/0x3f
      [  125.782242]  [<ffffffff81669dd2>] system_call_fastpath+0x16/0x1b
      
      This commit therefore adds the needed RCU read-side critical section.
      Reported-by: N"Rafael J. Wysocki" <rjw@sisk.pl>
      Signed-off-by: NPaul E. McKenney <paulmck@linux.vnet.ibm.com>
      4e79752c
  25. 24 10月, 2012 6 次提交
  26. 05 10月, 2012 2 次提交
    • T
      sched: Update sched_domains_numa_masks[][] when new cpus are onlined · 301a5cba
      Tang Chen 提交于
      Once array sched_domains_numa_masks[] []is defined, it is never updated.
      
      When a new cpu on a new node is onlined, the coincident member in
      sched_domains_numa_masks[][] is not initialized, and all the masks are 0.
      As a result, the build_overlap_sched_groups() will initialize a NULL
      sched_group for the new cpu on the new node, which will lead to kernel panic:
      
      [ 3189.403280] Call Trace:
      [ 3189.403286]  [<ffffffff8106c36f>] warn_slowpath_common+0x7f/0xc0
      [ 3189.403289]  [<ffffffff8106c3ca>] warn_slowpath_null+0x1a/0x20
      [ 3189.403292]  [<ffffffff810b1d57>] build_sched_domains+0x467/0x470
      [ 3189.403296]  [<ffffffff810b2067>] partition_sched_domains+0x307/0x510
      [ 3189.403299]  [<ffffffff810b1ea2>] ? partition_sched_domains+0x142/0x510
      [ 3189.403305]  [<ffffffff810fcc93>] cpuset_update_active_cpus+0x83/0x90
      [ 3189.403308]  [<ffffffff810b22a8>] cpuset_cpu_active+0x38/0x70
      [ 3189.403316]  [<ffffffff81674b87>] notifier_call_chain+0x67/0x150
      [ 3189.403320]  [<ffffffff81664647>] ? native_cpu_up+0x18a/0x1b5
      [ 3189.403328]  [<ffffffff810a044e>] __raw_notifier_call_chain+0xe/0x10
      [ 3189.403333]  [<ffffffff81070470>] __cpu_notify+0x20/0x40
      [ 3189.403337]  [<ffffffff8166663e>] _cpu_up+0xe9/0x131
      [ 3189.403340]  [<ffffffff81666761>] cpu_up+0xdb/0xee
      [ 3189.403348]  [<ffffffff8165667c>] store_online+0x9c/0xd0
      [ 3189.403355]  [<ffffffff81437640>] dev_attr_store+0x20/0x30
      [ 3189.403361]  [<ffffffff8124aa63>] sysfs_write_file+0xa3/0x100
      [ 3189.403368]  [<ffffffff811ccbe0>] vfs_write+0xd0/0x1a0
      [ 3189.403371]  [<ffffffff811ccdb4>] sys_write+0x54/0xa0
      [ 3189.403375]  [<ffffffff81679c69>] system_call_fastpath+0x16/0x1b
      [ 3189.403377] ---[ end trace 1e6cf85d0859c941 ]---
      [ 3189.403398] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
      
      This patch registers a new notifier for cpu hotplug notify chain, and
      updates sched_domains_numa_masks every time a new cpu is onlined or offlined.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      [ fixed compile warning ]
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1348578751-16904-3-git-send-email-tangchen@cn.fujitsu.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      301a5cba
    • T
      sched: Ensure 'sched_domains_numa_levels' is safe to use in other functions · 5f7865f3
      Tang Chen 提交于
      We should temporarily reset 'sched_domains_numa_levels' to 0 after
      it is reset to 'level' in sched_init_numa(). If it fails to allocate
      memory for array sched_domains_numa_masks[][], the array will contain
      less then 'level' members. This could be dangerous when we use it to
      iterate array sched_domains_numa_masks[][] in other functions.
      
      This patch set sched_domains_numa_levels to 0 before initializing
      array sched_domains_numa_masks[][], and reset it to 'level' when
      sched_domains_numa_masks[][] is fully initialized.
      Signed-off-by: NTang Chen <tangchen@cn.fujitsu.com>
      Signed-off-by: NWen Congyang <wency@cn.fujitsu.com>
      Signed-off-by: NPeter Zijlstra <a.p.zijlstra@chello.nl>
      Link: http://lkml.kernel.org/r/1348578751-16904-2-git-send-email-tangchen@cn.fujitsu.comSigned-off-by: NIngo Molnar <mingo@kernel.org>
      5f7865f3